Home > Archive > PERL Beginners > December 2004 > combining getc() and unicode strings problem?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
combining getc() and unicode strings problem?
|
|
| tim23456@web.de 2004-12-16, 3:59 pm |
| Hello,
i have intensively searched the web for a solution on the following problem, but could not find any indication for it.
The following code does basicelly nothing else then reading in a file on single char basis and writing it to a file again. The input file is encoded as UTF-8 as well as the output file i want to create. I read in the characters by using getc().
However i still get incorrect results in my output-file. Does anybody know of mistakes i do when combining getc() with reading unicode files?
Any input is greatly appreciated. Thanks very much in advance!
Tim
( I am using Perl 5.8.5 on Intel SuSE 9.2)
...
open(INFILE, "< $ARGV[0]") || die "\nCannot open from-file!";
open(OUTFILE, "> $ARGV[1]") || die "\nCannot create to-file!";
binmode(OUTFILE, ":utf8");
binmode(INFILE, ":utf8");
...
while(!eof(INFILE)) {
for ($i = 1; $i < $Ntes_Zeichen; $i++) {
$dummy = getc(INFILE); if (eof(INFILE)) {exit}
print OUTFILE $dummy;
}
$dummy = getc(INFILE);
print OUTFILE $ersetze_durch;
}
close(INFILE);
close(OUTFILE);
Summary of my perl5 (revision 5 version 8 subversion 5) configuration:
Platform:
osname=linux, osvers=2.6.8.1, archname=i586-linux-thread-multi
uname='linux g168 2.6.8.1 #1 smp thu jul 1 15:23:45 utc 2004 i686 i686 i386 gnulinux '
config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusrbinperl -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm-Duseshrplib=true -Doptimize=-O2 -march=i586 -mcpu=i686 -fmessage-length=0 -Wall -Wall -pipe'
hint=recommended, useposix=true, d_sigaction=define
usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O2 -march=i586 -mcpu=i686 -fmessage-length=0 -Wall -Wall -pipe',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-strict-aliasing -pipe'
ccversion='', gccversion='3.3.4 (pre 3.3.5 20040809)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', ls size=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags =''
libpth=/lib /usr/lib /usr/local/lib
libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
libc=, so=so, useshrplib=true, libperl=libperl.so
gnulibc_version='2.3.3'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.8.5/i586-linux-thread-multi/CORE'
cccdlflags='-fPIC', lddlflags='-shared'
Characteristics of this binary (from libperl):
Compile-time options: MULTIPLICITY USE_ITHREADS USE_LARGE_FILES PERL_IMPLICIT_CONTEXT
Built under linux
Compiled at Oct 1 2004 23:30:38
@INC:
/usr/lib/perl5/5.8.5/i586-linux-thread-multi
/usr/lib/perl5/5.8.5
/usr/lib/perl5/site_perl/5.8.5/i586-linux-thread-multi
/usr/lib/perl5/site_perl/5.8.5
/usr/lib/perl5/site_perl
/usr/lib/perl5/vendor_perl/5.8.5/i586-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.8.5
/usr/lib/perl5/vendor_perl
________________________________________
__________________
Mit WEB.DE FreePhone mit hoechster Qualitaet ab 0 Ct./Min.
weltweit telefonieren! http://freephone.web.de/?mc=021201
| |
| Jonathan Paton 2004-12-16, 8:55 pm |
| Hi,
Not had the misfortune to need to play with this stuff, but I guess
the documentation for perl is a good place to start:
perldoc perl
Particularly:
perldoc perluniintro
perldoc perlunicode
Some aspects are version dependent, so make sure your script
insists on a minimum version of perl.
Why are you doing this? Is most of your experience with C?
Jonathan Paton
On Thu, 16 Dec 2004 19:18:06 +0200, tim23456@web.de <tim23456@web.de> wrote:
> Hello,
>
> i have intensively searched the web for a solution on the following problem, but could not find any indication for it.
>
> The following code does basicelly nothing else then reading in a file on single char basis and writing it to a file again. The input file is encoded as UTF-8 as well as the output file i want to create. I read in the characters by using getc().
> However i still get incorrect results in my output-file. Does anybody know of mistakes i do when combining getc() with reading unicode files?
>
> Any input is greatly appreciated. Thanks very much in advance!
>
> Tim
>
> ( I am using Perl 5.8.5 on Intel SuSE 9.2)
>
> ..
>
> open(INFILE, "< $ARGV[0]") || die "\nCannot open from-file!";
> open(OUTFILE, "> $ARGV[1]") || die "\nCannot create to-file!";
>
> binmode(OUTFILE, ":utf8");
> binmode(INFILE, ":utf8");
>
> ..
>
> while(!eof(INFILE)) {
>
> for ($i = 1; $i < $Ntes_Zeichen; $i++) {
>
> $dummy = getc(INFILE); if (eof(INFILE)) {exit}
> print OUTFILE $dummy;
>
> }
>
> $dummy = getc(INFILE);
> print OUTFILE $ersetze_durch;
>
> }
>
> close(INFILE);
> close(OUTFILE);
>
> Summary of my perl5 (revision 5 version 8 subversion 5) configuration:
> Platform:
> osname=linux, osvers=2.6.8.1, archname=i586-linux-thread-multi
> uname='linux g168 2.6.8.1 #1 smp thu jul 1 15:23:45 utc 2004 i686 i686 i386 gnulinux '
> config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusrbinperl -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm-Duseshrplib=true -Doptimize=-O2 -march=i586 -mcpu=i686 -fmessage-length=0 -Wall -Wall -pipe'
> hint=recommended, useposix=true, d_sigaction=define
> usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
> useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
> use64bitint=undef use64bitall=undef uselongdouble=undef
> usemymalloc=n, bincompat5005=undef
> Compiler:
> cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
> optimize='-O2 -march=i586 -mcpu=i686 -fmessage-length=0 -Wall -Wall -pipe',
> cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-strict-aliasing -pipe'
> ccversion='', gccversion='3.3.4 (pre 3.3.5 20040809)', gccosandvers=''
> intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
> d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
> ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', ls size=8
> alignbytes=4, prototype=define
> Linker and Libraries:
> ld='cc', ldflags =''
> libpth=/lib /usr/lib /usr/local/lib
> libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
> perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
> libc=, so=so, useshrplib=true, libperl=libperl.so
> gnulibc_version='2.3.3'
> Dynamic Linking:
> dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.8.5/i586-linux-thread-multi/CORE'
> cccdlflags='-fPIC', lddlflags='-shared'
>
> Characteristics of this binary (from libperl):
> Compile-time options: MULTIPLICITY USE_ITHREADS USE_LARGE_FILES PERL_IMPLICIT_CONTEXT
> Built under linux
> Compiled at Oct 1 2004 23:30:38
> @INC:
> /usr/lib/perl5/5.8.5/i586-linux-thread-multi
> /usr/lib/perl5/5.8.5
> /usr/lib/perl5/site_perl/5.8.5/i586-linux-thread-multi
> /usr/lib/perl5/site_perl/5.8.5
> /usr/lib/perl5/site_perl
> /usr/lib/perl5/vendor_perl/5.8.5/i586-linux-thread-multi
> /usr/lib/perl5/vendor_perl/5.8.5
> /usr/lib/perl5/vendor_perl
>
> ________________________________________
__________________
> Mit WEB.DE FreePhone mit hoechster Qualitaet ab 0 Ct./Min.
> weltweit telefonieren! http://freephone.web.de/?mc=021201
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> <http://learn.perl.org/> <http://learn.perl.org/first-response>
>
>
--
#!perl
$J=' 'x25 ;for (qq< 1+10 9+14 5-10 50-9 7+13 2-18 6+13
17+6 02+1 2-10 00+4 00+8 3-13 3+12 01-5 2-10 01+1 03+4
00+4 00+8 1-21 01+1 00+5 01-7 >=~/ \S\S \S\S /gx) {m/(
\d+) (.+) /x,, vec$ J,$p +=$2 ,8,= $c+= +$1} warn $J,,
| |
| tim23456@web.de 2004-12-16, 8:55 pm |
| Hello Jonathan, all
> Not had the misfortune to need to play with this stuff, but I guess
> the documentation for perl is a good place to start:
>
[snip]
yes, i read these man pages more than just one time now (at different time=
s), so i think i should have not missed anything.
the perl-manpages in question do give information, which functions work an=
d which will not work (speaking about unicode..), concerning the getc() fu=
nction however nothing is mentioned.
I've also searched bugs.perl.org for any issue concerning 'getc()', but co=
uld not dig up anything concerning the unicode context.
> Some aspects are version dependent, so make sure your script
> insists on a minimum version of perl.
This is not (yet) a problem, because I am developer and user at the same t=
ime.
> Why are you doing this=3F Is most of your experience with C=3F
I'm afraid to say that i do not qualify as a programmer having any knowled=
ge at all.
Consequently I am open to all suggestions of how to accomplish my problem =
in another way. Please help, if you can. What i cannot change however is t=
he fact, that i have to cope with UTF-8 input. That's because i am using c=
haracters such as "=A7", which cannot be represented with 8859-1 (=3DLatin1) o=
r 8859-15 Euro (hope i am not starting incorrectly at this point!). US-ASC=
II does also not qualify for my needs.
The perl man pages in general explicetely state that recent versions of Pe=
rl are "unicode ready by default".
I am using Perl 5.8.5 on Linux. Any input on this is very much appreciated=
..
Thank you, Tim
=20
> Jonathan Paton
>=20
> On Thu, 16 Dec 2004 19:18:06 +0200, tim23456@web.de <tim23456@web.de> wr=
ote:[color=darkred]
oblem, but could not find any indication for it.[color=darkred]
on single char basis and writing it to a file again. The input file is enc=
oded as UTF-8 as well as the output file i want to create. I read in the c=
haracters by using getc().[color=darkred]
know of mistakes i do when combining getc() with reading unicode files=3F[color=darkred]
[snip]
=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=
5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5
F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F
=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=
5F=5F=5F=5F=5F
Mit WEB.DE FreePhone mit hoechster Qualitaet ab 0 Ct./Min.
weltweit telefonieren! http://freephone.web.de/=3Fmc=3D021201
| |
| Jonathan Paton 2004-12-17, 3:55 am |
| Dear Tim,
I think your code is on the right track, as I got a modified
version of your code working. I never use getc...
In C, where getc originates, the getc function returns a
char type. The C char type is almost always 8 bits long.
By definition it doesn't support unicode, so neither does
Perl. It would be nice if they told you though.
My inner loop was simply:
while (<INFILE> ) {
print OUTFILE join "-", /./g;
}
Which reads line by line, and outputs the line but with a=20
dash between each character.
The /./g bit is using the regex engine to match a single
character, and the next etc. It returns a list of characters.
Alternatively, and probably more readable, is split //
My simple test file included your example character, and
outputed:
H-e-l-l-o-=A7-W-o-r-l-d
Working character by character on an input stream is
extremely unpopular in perl. Why do so when perl provides
a VERY powerful (ir)regular expression engine.
>
> I'm afraid to say that i do not qualify as a programmer having
> any knowledge at all.
You know where the beginners list is ;-) I like the fact you have
worked on the problem first. Minimum boilerplate for all scripts
should be:
use strict;
use warnings;
Documentation available via perldoc.
Jonathan Paton
--=20
#!perl
$J=3D' 'x25 ;for (qq< 1+10 9+14 5-10 50-9 7+13 2-18 6+13
17+6 02+1 2-10 00+4 00+8 3-13 3+12 01-5 2-10 01+1 03+4
00+4 00+8 1-21 01+1 00+5 01-7 >=3D~/ \S\S \S\S /gx) {m/(
\d+) (.+) /x,, vec$ J,$p +=3D$2 ,8,=3D $c+=3D +$1} warn $J,,
| |
| tim23456@web.de 2004-12-18, 3:55 pm |
| Hello Jonathan, all,
thank you for your kind response.
[snip]
> I never use getc...
>
> In C, where getc originates, the getc function returns a
> char type. The C char type is almost always 8 bits long.
> By definition it doesn't support unicode, so neither does
> Perl. It would be nice if they told you though.
[snip]
the get() stuff really turned out to be the wrong track of reaching my goal. What I am doing now instead of getc()ing character after character is to use the File::Slurp module. From this module i now found very useful reading in the complete file content
into a single scalar, tough i do know, that holding content of a complete file in _one_ scalar is not very scalable. For the processing afterwards in my script however it seems to be the only solution.
Additionally just one comment: my problem of dealing with unicode input is quite comfortably solved now. Interestingly the File::Slurp module from CPAN, which is referenced in the perl documentation, seems not to be able to deal with unicode content. For
this case using the CPAN module Perl6::Slurp, which requires at least Perl 5.8.0, completely resolved the unicode problem.
Again, thank you very much for your valueable input,
Tim
________________________________________
__________________
Mit WEB.DE FreePhone mit hoechster Qualitaet ab 0 Ct./Min.
weltweit telefonieren! http://freephone.web.de/?mc=021201
|
|
|
|
|