| tewilk@gmail.com 2006-10-31, 7:56 am |
| Will do thanks for the guidance!
Alexei A. Frounze wrote:
> tewilk@gmail.com wrote:
>
> Well, in general, if you don't know the type of file (ASCII, UTF8,
> UTF16LE/BE, UTF32LE/BE, some non-ASCII non-Unicode 8/16-bit encoding), you
> have to check against all supportable types and if you find that the
> contains, say, what's a valid UTF8, then so be it. A few hints... Unicode
> text files may begin with so-called BOM (Byte Order Mark). Notepad usually
> (if not always) puts it at the beginning of the saved Unicode text file.
> It's a different sequence of bytes for UTF8, UTF16LE, UTF16BE, etc. If you
> find it, you may validate the rest of the file pretending you know the
> Unicode format used (from the BOM). The Unicode standard describes valid
> "code point" number ranges. If you find something outside these ranges, it's
> not Unicode or the file is corrupt. To find if the file is plain ASCII, just
> check that all bytes in it are in the range 0...127. If a file doesn't look
> like ASCII or Unicode, it's either some other 8-bit or 16-bit encoding or
> it's corrupt. Btw, 7-bit ASCII is a subset of UTF8.
>
> I highly suggest that you read the Unicode documentation from the Unicode
> website: http://www.unicode.org. A must to read are: Unicode FAQ, "To the
> BMP and Beyond!" by Eric Muller -- must be somewhere on the net. I suggest
> that you start with the latter to get an overall idea of Unicode quickly.
> And the ultimate source of the information is the Unicode standard itself.
>
> Alex
|