For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > November 2007 > Invalid Unicode









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Invalid Unicode
Mark Wagner

2007-11-20, 10:04 pm

I've got a program where I could greatly simplify things by
temporarily replacing strings with single characters. However, the
potential input includes any valid Unicode character. Assuming that
the invalid characters are never output to anything, are there any
problems that I'd encounter from using code points beyond what Unicode
defines (0x110000 and above)?

Thanks,
Mark Wagner
Tom Phoenix

2007-11-20, 10:04 pm

On 11/20/07, Mark Wagner <carnildo@gmail.com> wrote:
> I've got a program where I could greatly simplify things by
> temporarily replacing strings with single characters. However, the
> potential input includes any valid Unicode character. Assuming that
> the invalid characters are never output to anything, are there any
> problems that I'd encounter from using code points beyond what Unicode
> defines (0x110000 and above)?


It's not clear what you're trying to do. You can replace any character
with any other, but whether or not that causes problems depends upon
your ultimate goals.

It sounds as if this isn't a beginning Perl question. You may be able
to get better and faster answers if you inquire in a forum
specifically about Unicode issues.

Good luck with it!

--Tom Phoenix
Stonehenge Perl Training
Mark Wagner

2007-11-20, 10:04 pm

On Nov 20, 2007 7:28 PM, Tom Phoenix <tom@stonehenge.com> wrote:
>
> On 11/20/07, Mark Wagner <carnildo@gmail.com> wrote:
>
> It's not clear what you're trying to do. You can replace any character
> with any other, but whether or not that causes problems depends upon
> your ultimate goals.


To simplify the question, can I use characters that aren't valid
Unicode, and if so, are there any consequences?

--
Mark
Tom Phoenix

2007-11-21, 4:01 am

On 11/20/07, Mark Wagner <carnildo@gmail.com> wrote:

> To simplify the question, can I use characters that aren't valid
> Unicode, and if so, are there any consequences?


Do you mean, suppose I replace every non-ASCII character with some
invalid character:

my $invalid = chr(0x110000);
$data =~ s/[^\0-\0177]/$invalid/g;

....Will there be any consequences? Yes; your data will be altered. (Is
that what you're asking?)

Are you trying to ask, will Perl prohibit the use of invalid Unicode
characters? Perl strings should be safe for any data. What happened
when you tried it?

Are you trying to ask, will today's invalid Unicode characters be used
for some valid purpose in tomorrow's Unicode, and thereby break my
program? Maybe.

Are you trying to ask something else?

Cheers!

--Tom Phoenix
Stonehenge Perl Training
Mark Wagner

2007-11-21, 4:01 am

On Nov 20, 2007 8:22 PM, Tom Phoenix <tom@stonehenge.com> wrote:
> Are you trying to ask, will Perl prohibit the use of invalid Unicode
> characters? Perl strings should be safe for any data.


That's basically what I needed to know, thanks.

--
Mark Wagner
Dr.Ruud

2007-11-21, 7:59 am

"Mark Wagner" schreef:

> I've got a program where I could greatly simplify things by
> temporarily replacing strings with single characters. However, the
> potential input includes any valid Unicode character. Assuming that
> the invalid characters are never output to anything, are there any
> problems that I'd encounter from using code points beyond what Unicode
> defines (0x110000 and above)?


$ perl -wle'
$i = "0xFD";
while (1) {
$h = hex($i);
$c = chr($h);
last if $h != ord($c);
substr($i, 2, 0, "F");
}
printf "--> %s\n", $i;
'
Integer overflow in hexadecimal number at -e line 4.
Hexadecimal number > 0xffffffff non-portable at -e line 4.
Unicode character 0xffffffff is illegal at -e line 5.
--> 0xFFFFFFFFD

See also pack()/unpack().

--
Affijn, Ruud

"Gewoon is een tijger."

Nobull67@Gmail.Com

2007-11-21, 10:01 pm

On Nov 21, 4:22 am, t...@stonehenge.com (Tom Phoenix) wrote:

> Are you trying to ask, will today's invalid Unicode characters be used
> for some valid purpose in tomorrow's Unicode, and thereby break my
> program? Maybe.


Really? I thought Unicode did make the promise that these codes would
never be assigned.

On the other hand I don't think Perl make the promise that it'll
always handle them. I've often wished that Perl did make this promise
as like the OP says it would be really handy to have "characters" that
could be used in Perl strings but that could not appear in any valid
Unicode input file.

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com