Home > Archive > Java Help > March 2006 > auto-detecting the character set encoding of a text file
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
auto-detecting the character set encoding of a text file
|
|
| martin.gerner@gmail.com 2006-03-28, 7:04 pm |
| Hi,
I just wanted to say that I'm new here, so I excuse myself directly in
case I make any mistake :)
My problem is that I have a bunch of text files with various
character-set encodings, and I would need a method for detecting what
encoding a certain file uses. (so that I can later open that file and
begin reading from it, using the correct encoding)
Is there some way I can do this? Some of the encodings I suspect I will
come across are UTF-8, windows-1252 and ISO-8859-15, although I do not
know that no others might be present.
/Martin Gerner
| |
| Roedy Green 2006-03-28, 10:02 pm |
| On 28 Mar 2006 14:35:13 -0800, martin.gerner@gmail.com wrote, quoted
or indirectly quoted someone who said :
>Is there some way I can do this? Some of the encodings I suspect I will
>come across are UTF-8, windows-1252 and ISO-8859-15, although I do not
>know that no others might be present.
Nothing simple like an encoding field. See
http://mindprod.com/projects/encodi...tification.html
for some approaches.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
| |
| Martin Gerner 2006-03-29, 8:02 am |
| Roedy Green <my_email_is_posted_on_my_website@munged.invalid> wrote in
news:cjqj22lm5dkd0e013odl3vnd8rt9ao4cdb@
4ax.com:
> On 28 Mar 2006 14:35:13 -0800, martin.gerner@gmail.com wrote, quoted
> or indirectly quoted someone who said :
>
>
> Nothing simple like an encoding field. See
> http://mindprod.com/projects/encodi...tification.html
> for some approaches.
Unfortunately, this didn't help me much.. So I take it that there is no
nifty little class I can download that will do this detection for me?
To clarify, the files I will be working with are _not_ HTML or XML files,
but rather standard-text log files from IM clients.
/Martin Gerner
| |
| Thomas Weidenfeller 2006-03-29, 8:02 am |
| martin.gerner@gmail.com wrote:
> My problem is that I have a bunch of text files with various
> character-set encodings, and I would need a method for detecting what
> encoding a certain file uses. (so that I can later open that file and
> begin reading from it, using the correct encoding)
>
> Is there some way I can do this? Some of the encodings I suspect I will
> come across are UTF-8, windows-1252 and ISO-8859-15, although I do not
> know that no others might be present.
You can't in a general way. You have to know the encodings to be sure.
You can apply some heuristics to guess an encoding. But it will be a guess.
/Thomas
--
The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS...ng/java/gui/faq
http://www.uni-giessen.de/faq/archi...g.java.gui.faq/
| |
| Roedy Green 2006-03-29, 7:04 pm |
| On Wed, 29 Mar 2006 13:13:40 +0000 (UTC), Martin Gerner
<martin.gerner@nospam.com> wrote, quoted or indirectly quoted someone
who said :
>Unfortunately, this didn't help me much.. So I take it that there is no
>nifty little class I can download that will do this detection for me?
Exactly. It is a messy problem.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
| |
| Roedy Green 2006-03-29, 7:04 pm |
| On Wed, 29 Mar 2006 13:13:40 +0000 (UTC), Martin Gerner
<martin.gerner@nospam.com> wrote, quoted or indirectly quoted someone
who said :
>To clarify, the files I will be working with are _not_ HTML or XML files,
>but rather standard-text log files from IM clients.
If you have control over the creating of these files, you could put
the encoding on the front of the file followed by a \n. That would
make your job much easier. Or you could tell everyone to use UTF-8
which would make the problem disappear.
You might also do it by tracking the source of the file. You figure
out manually which encoding each source uses over which date range.
The habit of not recording the encoding goes way back. The idea was
documents were local and all encoded the same way. You did not
exchange documents with others, of if you did, you exchanged a whole
tape full all the same, so again the problem of identification did not
come up.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
|
|
|
|
|