Code Comments
Programming Forum and web based access to our favorite programming groups.Hello,
i have intensively searched the web for a solution on the following problem,
but could not find any indication for it.
The following code does basicelly nothing else then reading in a file on sin
gle char basis and writing it to a file again. The input file is encoded as
UTF-8 as well as the output file i want to create. I read in the characters
by using getc().
However i still get incorrect results in my output-file. Does anybody know o
f mistakes i do when combining getc() with reading unicode files?
Any input is greatly appreciated. Thanks very much in advance!
Tim
( I am using Perl 5.8.5 on Intel SuSE 9.2)
..
open(INFILE, "< $ARGV[0]") || die "\nCannot open from-file!";
open(OUTFILE, "> $ARGV[1]") || die "\nCannot create to-file!";
binmode(OUTFILE, ":utf8");
binmode(INFILE, ":utf8");
..
while(!eof(INFILE)) {
for ($i = 1; $i < $Ntes_Zeichen; $i++) {
$dummy = getc(INFILE); if (eof(INFILE)) {exit}
print OUTFILE $dummy;
}
$dummy = getc(INFILE);
print OUTFILE $ersetze_durch;
}
close(INFILE);
close(OUTFILE);
Summary of my perl5 (revision 5 version 8 subversion 5) configuration:
Platform:
osname=linux, osvers=2.6.8.1, archname=i586-linux-thread-multi
uname='linux g168 2.6.8.1 #1 smp thu jul 1 15:23:45 utc 2004 i686 i686 i386
gnulinux '
config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusrbinperl -D
usethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm-Duseshrplib=true -Doptimize=-O2
-march=i586 -mcpu=i686 -fmessage-length=0 -Wall -Wall -pipe'
hint=recommended, useposix=true, d_sigaction=define
usethreads=define use5005threads=undef useithreads=define usemultiplicity=de
fine
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-stric
t-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O2 -march=i586 -mcpu=i686 -fmessage-length=0 -Wall -Wall -pipe',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-strict-aliasin
g -pipe'
ccversion='', gccversion='3.3.4 (pre 3.3.5 20040809)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', ls
size
=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags =''
libpth=/lib /usr/lib /usr/local/lib
libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
libc=, so=so, useshrplib=true, libperl=libperl.so
gnulibc_version='2.3.3'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,
/usr/lib/perl5/5.8.5/i586-linux-thread-multi/CORE'
cccdlflags='-fPIC', lddlflags='-shared'
Characteristics of this binary (from libperl):
Compile-time options: MULTIPLICITY USE_ITHREADS USE_LARGE_FILES PERL_IMPLICI
T_CONTEXT
Built under linux
Compiled at Oct 1 2004 23:30:38
@INC:
/usr/lib/perl5/5.8.5/i586-linux-thread-multi
/usr/lib/perl5/5.8.5
/usr/lib/perl5/site_perl/5.8.5/i586-linux-thread-multi
/usr/lib/perl5/site_perl/5.8.5
/usr/lib/perl5/site_perl
/usr/lib/perl5/vendor_perl/5.8.5/i586-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.8.5
/usr/lib/perl5/vendor_perl
________________________________________
__________________
Mit WEB.DE FreePhone mit hoechster Qualitaet ab 0 Ct./Min.
weltweit telefonieren! http://freephone.web.de/?mc=021201
Post Follow-up to this messageHi,
Not had the misfortune to need to play with this stuff, but I guess
the documentation for perl is a good place to start:
perldoc perl
Particularly:
perldoc perluniintro
perldoc perlunicode
Some aspects are version dependent, so make sure your script
insists on a minimum version of perl.
Why are you doing this? Is most of your experience with C?
Jonathan Paton
On Thu, 16 Dec 2004 19:18:06 +0200, tim23456@web.de <tim23456@web.de> wrote:
> Hello,
>
> i have intensively searched the web for a solution on the following proble
m, but could not find any indication for it.
>
> The following code does basicelly nothing else then reading in a file on s
ingle char basis and writing it to a file again. The input file is encoded a
s UTF-8 as well as the output file i want to create. I read in the character
s by using getc().
> However i still get incorrect results in my output-file. Does anybody know
of mistakes i do when combining getc() with reading unicode files?
>
> Any input is greatly appreciated. Thanks very much in advance!
>
> Tim
>
> ( I am using Perl 5.8.5 on Intel SuSE 9.2)
>
> ..
>
> open(INFILE, "< $ARGV[0]") || die "\nCannot open from-file!";
> open(OUTFILE, "> $ARGV[1]") || die "\nCannot create to-file!";
>
> binmode(OUTFILE, ":utf8");
> binmode(INFILE, ":utf8");
>
> ..
>
> while(!eof(INFILE)) {
>
> for ($i = 1; $i < $Ntes_Zeichen; $i++) {
>
> $dummy = getc(INFILE); if (eof(INFILE)) {exit}
> print OUTFILE $dummy;
>
> }
>
> $dummy = getc(INFILE);
> print OUTFILE $ersetze_durch;
>
> }
>
> close(INFILE);
> close(OUTFILE);
>
> Summary of my perl5 (revision 5 version 8 subversion 5) configuration:
> Platform:
> osname=linux, osvers=2.6.8.1, archname=i586-linux-thread-multi
> uname='linux g168 2.6.8.1 #1 smp thu jul 1 15:23:45 utc 2004 i686 i686
i386 gnulinux '
> config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusrbinp
erl -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm-Duseshrplib=true -Doptimiz
e=-O2 -march=i586 -mcpu=i686 -fmessage-length=0 -Wall -Wall -pipe'
> hint=recommended, useposix=true, d_sigaction=define
> usethreads=define use5005threads=undef useithreads=define usemultiplic
ity=define
> useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
> use64bitint=undef use64bitall=undef uselongdouble=undef
> usemymalloc=n, bincompat5005=undef
> Compiler:
> cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno
-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
> optimize='-O2 -march=i586 -mcpu=i686 -fmessage-length=0 -Wall -Wall -p
ipe',
> cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-strict-a
liasing -pipe'
> ccversion='', gccversion='3.3.4 (pre 3.3.5 20040809)', gccosandvers=''
> intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
> d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
> ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lse
eksize=8
> alignbytes=4, prototype=define
> Linker and Libraries:
> ld='cc', ldflags =''
> libpth=/lib /usr/lib /usr/local/lib
> libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
> perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
> libc=, so=so, useshrplib=true, libperl=libperl.so
> gnulibc_version='2.3.3'
> Dynamic Linking:
> dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-
rpath,/usr/lib/perl5/5.8.5/i586-linux-thread-multi/CORE'
> cccdlflags='-fPIC', lddlflags='-shared'
>
> Characteristics of this binary (from libperl):
> Compile-time options: MULTIPLICITY USE_ITHREADS USE_LARGE_FILES PERL_IMP
LICIT_CONTEXT
> Built under linux
> Compiled at Oct 1 2004 23:30:38
> @INC:
> /usr/lib/perl5/5.8.5/i586-linux-thread-multi
> /usr/lib/perl5/5.8.5
> /usr/lib/perl5/site_perl/5.8.5/i586-linux-thread-multi
> /usr/lib/perl5/site_perl/5.8.5
> /usr/lib/perl5/site_perl
> /usr/lib/perl5/vendor_perl/5.8.5/i586-linux-thread-multi
> /usr/lib/perl5/vendor_perl/5.8.5
> /usr/lib/perl5/vendor_perl
>
> ________________________________________
__________________
> Mit WEB.DE FreePhone mit hoechster Qualitaet ab 0 Ct./Min.
> weltweit telefonieren! http://freephone.web.de/?mc=021201
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> <http://learn.perl.org/> <http://learn.perl.org/first-response>
>
>
--
#!perl
$J=' 'x25 ;for (qq< 1+10 9+14 5-10 50-9 7+13 2-18 6+13
17+6 02+1 2-10 00+4 00+8 3-13 3+12 01-5 2-10 01+1 03+4
00+4 00+8 1-21 01+1 00+5 01-7 >=~/ \S\S \S\S /gx) {m/(
\d+) (.+) /x,, vec$ J,$p +=$2 ,8,= $c+= +$1} warn $J,,
Post Follow-up to this messageHello Jonathan, all > Not had the misfortune to need to play with this stuff, but I guess > the documentation for perl is a good place to start: > [snip] yes, i read these man pages more than just one time now (at different time= s), so i think i should have not missed anything. the perl-manpages in question do give information, which functions work an= d which will not work (speaking about unicode..), concerning the getc() fu= nction however nothing is mentioned. I've also searched bugs.perl.org for any issue concerning 'getc()', but co= uld not dig up anything concerning the unicode context. > Some aspects are version dependent, so make sure your script > insists on a minimum version of perl. This is not (yet) a problem, because I am developer and user at the same t= ime. > Why are you doing this=3F Is most of your experience with C=3F I'm afraid to say that i do not qualify as a programmer having any knowled= ge at all. Consequently I am open to all suggestions of how to accomplish my problem = in another way. Please help, if you can. What i cannot change however is t= he fact, that i have to cope with UTF-8 input. That's because i am using c= haracters such as "=A7", which cannot be represented with 8859-1 (=3DLatin1) o= r 8859-15 Euro (hope i am not starting incorrectly at this point!). US-ASC= II does also not qualify for my needs. The perl man pages in general explicetely state that recent versions of Pe= rl are "unicode ready by default". I am using Perl 5.8.5 on Linux. Any input on this is very much appreciated= . Thank you, Tim =20 > Jonathan Paton >=20 > On Thu, 16 Dec 2004 19:18:06 +0200, tim23456@web.de <tim23456@web.de> wr= ote: oblem, but could not find any indication for it. on single char basis and writing it to a file again. The input file is enc= oded as UTF-8 as well as the output file i want to create. I read in the c= haracters by using getc(). know of mistakes i do when combining getc() with reading unicode files=3F [snip] =5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F= 5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F= 5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5 F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5 F=5F=5F=5F=5F=5F=5F=5F Mit WEB.DE FreePhone mit hoechster Qualitaet ab 0 Ct./Min. weltweit telefonieren! http://freephone.web.de/=3Fmc=3D021201
Post Follow-up to this messageDear Tim,
I think your code is on the right track, as I got a modified
version of your code working. I never use getc...
In C, where getc originates, the getc function returns a
char type. The C char type is almost always 8 bits long.
By definition it doesn't support unicode, so neither does
Perl. It would be nice if they told you though.
My inner loop was simply:
while (<INFILE> ) {
print OUTFILE join "-", /./g;
}
Which reads line by line, and outputs the line but with a=20
dash between each character.
The /./g bit is using the regex engine to match a single
character, and the next etc. It returns a list of characters.
Alternatively, and probably more readable, is split //
My simple test file included your example character, and
outputed:
H-e-l-l-o-=A7-W-o-r-l-d
Working character by character on an input stream is
extremely unpopular in perl. Why do so when perl provides
a VERY powerful (ir)regular expression engine.
>
> I'm afraid to say that i do not qualify as a programmer having
> any knowledge at all.
You know where the beginners list is ;-) I like the fact you have
worked on the problem first. Minimum boilerplate for all scripts
should be:
use strict;
use warnings;
Documentation available via perldoc.
Jonathan Paton
--=20
#!perl
$J=3D' 'x25 ;for (qq< 1+10 9+14 5-10 50-9 7+13 2-18 6+13
17+6 02+1 2-10 00+4 00+8 3-13 3+12 01-5 2-10 01+1 03+4
00+4 00+8 1-21 01+1 00+5 01-7 >=3D~/ \S\S \S\S /gx) {m/(
\d+) (.+) /x,, vec$ J,$p +=3D$2 ,8,=3D $c+=3D +$1} warn $J,,
Post Follow-up to this messageHello Jonathan, all, thank you for your kind response. [snip] > I never use getc... > > In C, where getc originates, the getc function returns a > char type. The C char type is almost always 8 bits long. > By definition it doesn't support unicode, so neither does > Perl. It would be nice if they told you though. [snip] the get() stuff really turned out to be the wrong track of reaching my goal. What I am doing now instead of getc()ing character after character is to us e the File::Slurp module. From this module i now found very useful reading i n the complete file content into a single scalar, tough i do know, that holding content of a complete fi le in _one_ scalar is not very scalable. For the processing afterwards in my script however it seems to be the only solution. Additionally just one comment: my problem of dealing with unicode input is q uite comfortably solved now. Interestingly the File::Slurp module from CPAN, which is referenced in the perl documentation, seems not to be able to deal with unicode content. For this case using the CPAN module Perl6::Slurp, which requires at least Perl 5 .8.0, completely resolved the unicode problem. Again, thank you very much for your valueable input, Tim ________________________________________ __________________ Mit WEB.DE FreePhone mit hoechster Qualitaet ab 0 Ct./Min. weltweit telefonieren! http://freephone.web.de/?mc=021201
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.