For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > October 2006 > Re: Yet another unicode question: windows platform









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Re: Yet another unicode question: windows platform
Alexei A. Frounze

2006-10-30, 9:57 pm

tewilk@gmail.com wrote:
> I'm still searching for the answer but I can't find it yet so maybe
> someone can point me in the right direction. This is my problem...
> I'm reading MS SQL Server errorlog file- sql7 and sql2000 are standard
> ascii flat files that I can process with no problems. Then comes
> SQL2005... now the file is in unicode format. After a lots reading
> (web and perl docs) I ran across my answer...
> http://blogs.msdn.com/brettsh/archi.../07/620986.aspx , the
> link talks about writing but I was able to use the same concept to
> read the file (open errorlogFH,"<:raw:encoding(UTF16-LE):crlf:utf8",
> "$TempErrorlog";) which worked fine. BTW... the information from this
> link was more helpful than the perldocs but maybe it just me :o
>
> Now here is my issue that I'm sure is a no brainier for someone out
> there... prior to my open, how can I check the file? Is it plain text
> so that I can use the standard open OR is it unicode so that I know
> that I need to use the "encoding" method?
>
> Also if anyone knows of some good information that has worked for them
> as it pertains to unicode, please post so that I can check it out.


Well, in general, if you don't know the type of file (ASCII, UTF8,
UTF16LE/BE, UTF32LE/BE, some non-ASCII non-Unicode 8/16-bit encoding), you
have to check against all supportable types and if you find that the
contains, say, what's a valid UTF8, then so be it. A few hints... Unicode
text files may begin with so-called BOM (Byte Order Mark). Notepad usually
(if not always) puts it at the beginning of the saved Unicode text file.
It's a different sequence of bytes for UTF8, UTF16LE, UTF16BE, etc. If you
find it, you may validate the rest of the file pretending you know the
Unicode format used (from the BOM). The Unicode standard describes valid
"code point" number ranges. If you find something outside these ranges, it's
not Unicode or the file is corrupt. To find if the file is plain ASCII, just
check that all bytes in it are in the range 0...127. If a file doesn't look
like ASCII or Unicode, it's either some other 8-bit or 16-bit encoding or
it's corrupt. Btw, 7-bit ASCII is a subset of UTF8.

I highly suggest that you read the Unicode documentation from the Unicode
website: http://www.unicode.org. A must to read are: Unicode FAQ, "To the
BMP and Beyond!" by Eric Muller -- must be somewhere on the net. I suggest
that you start with the latter to get an overall idea of Unicode quickly.
And the ultimate source of the information is the Unicode standard itself.

Alex

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com