Home > Archive > PERL Beginners > November 2007 > Invalid Unicode
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| Mark Wagner 2007-11-20, 10:04 pm |
| I've got a program where I could greatly simplify things by
temporarily replacing strings with single characters. However, the
potential input includes any valid Unicode character. Assuming that
the invalid characters are never output to anything, are there any
problems that I'd encounter from using code points beyond what Unicode
defines (0x110000 and above)?
Thanks,
Mark Wagner
| |
| Tom Phoenix 2007-11-20, 10:04 pm |
| On 11/20/07, Mark Wagner <carnildo@gmail.com> wrote:
> I've got a program where I could greatly simplify things by
> temporarily replacing strings with single characters. However, the
> potential input includes any valid Unicode character. Assuming that
> the invalid characters are never output to anything, are there any
> problems that I'd encounter from using code points beyond what Unicode
> defines (0x110000 and above)?
It's not clear what you're trying to do. You can replace any character
with any other, but whether or not that causes problems depends upon
your ultimate goals.
It sounds as if this isn't a beginning Perl question. You may be able
to get better and faster answers if you inquire in a forum
specifically about Unicode issues.
Good luck with it!
--Tom Phoenix
Stonehenge Perl Training
| |
| Mark Wagner 2007-11-20, 10:04 pm |
| On Nov 20, 2007 7:28 PM, Tom Phoenix <tom@stonehenge.com> wrote:
>
> On 11/20/07, Mark Wagner <carnildo@gmail.com> wrote:
>
> It's not clear what you're trying to do. You can replace any character
> with any other, but whether or not that causes problems depends upon
> your ultimate goals.
To simplify the question, can I use characters that aren't valid
Unicode, and if so, are there any consequences?
--
Mark
| |
| Tom Phoenix 2007-11-21, 4:01 am |
| On 11/20/07, Mark Wagner <carnildo@gmail.com> wrote:
> To simplify the question, can I use characters that aren't valid
> Unicode, and if so, are there any consequences?
Do you mean, suppose I replace every non-ASCII character with some
invalid character:
my $invalid = chr(0x110000);
$data =~ s/[^\0-\0177]/$invalid/g;
....Will there be any consequences? Yes; your data will be altered. (Is
that what you're asking?)
Are you trying to ask, will Perl prohibit the use of invalid Unicode
characters? Perl strings should be safe for any data. What happened
when you tried it?
Are you trying to ask, will today's invalid Unicode characters be used
for some valid purpose in tomorrow's Unicode, and thereby break my
program? Maybe.
Are you trying to ask something else?
Cheers!
--Tom Phoenix
Stonehenge Perl Training
| |
| Mark Wagner 2007-11-21, 4:01 am |
| On Nov 20, 2007 8:22 PM, Tom Phoenix <tom@stonehenge.com> wrote:
> Are you trying to ask, will Perl prohibit the use of invalid Unicode
> characters? Perl strings should be safe for any data.
That's basically what I needed to know, thanks.
--
Mark Wagner
| |
| Dr.Ruud 2007-11-21, 7:59 am |
| "Mark Wagner" schreef:
> I've got a program where I could greatly simplify things by
> temporarily replacing strings with single characters. However, the
> potential input includes any valid Unicode character. Assuming that
> the invalid characters are never output to anything, are there any
> problems that I'd encounter from using code points beyond what Unicode
> defines (0x110000 and above)?
$ perl -wle'
$i = "0xFD";
while (1) {
$h = hex($i);
$c = chr($h);
last if $h != ord($c);
substr($i, 2, 0, "F");
}
printf "--> %s\n", $i;
'
Integer overflow in hexadecimal number at -e line 4.
Hexadecimal number > 0xffffffff non-portable at -e line 4.
Unicode character 0xffffffff is illegal at -e line 5.
--> 0xFFFFFFFFD
See also pack()/unpack().
--
Affijn, Ruud
"Gewoon is een tijger."
| |
| Nobull67@Gmail.Com 2007-11-21, 10:01 pm |
| On Nov 21, 4:22 am, t...@stonehenge.com (Tom Phoenix) wrote:
> Are you trying to ask, will today's invalid Unicode characters be used
> for some valid purpose in tomorrow's Unicode, and thereby break my
> program? Maybe.
Really? I thought Unicode did make the promise that these codes would
never be assigned.
On the other hand I don't think Perl make the promise that it'll
always handle them. I've often wished that Perl did make this promise
as like the OP says it would be really handy to have "characters" that
could be used in Perl strings but that could not appear in any valid
Unicode input file.
|
|
|
|
|