Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

combining getc() and unicode strings problem?
Hello,

i have intensively searched the web for a solution on the following problem,
 but could not find any indication for it.

The following code does basicelly nothing else then reading in a file on sin
gle char basis and writing it to a file again. The input file is encoded as 
UTF-8 as well as the output file i want to create. I read in the characters 
by using getc().
However i still get incorrect results in my output-file. Does anybody know o
f mistakes i do when combining getc() with reading unicode files?

Any input is greatly appreciated. Thanks very much in advance!

Tim

( I am using Perl 5.8.5 on Intel SuSE 9.2)


..

open(INFILE, "< $ARGV[0]") || die "\nCannot open from-file!";
open(OUTFILE, "> $ARGV[1]") || die "\nCannot create to-file!";

binmode(OUTFILE, ":utf8");
binmode(INFILE, ":utf8");


..

while(!eof(INFILE)) {

for ($i = 1; $i < $Ntes_Zeichen; $i++) {

$dummy = getc(INFILE); if (eof(INFILE)) {exit}
print OUTFILE $dummy;

}

$dummy = getc(INFILE);
print OUTFILE $ersetze_durch;

}

close(INFILE);
close(OUTFILE);


Summary of my perl5 (revision 5 version 8 subversion 5) configuration:
Platform:
osname=linux, osvers=2.6.8.1, archname=i586-linux-thread-multi
uname='linux g168 2.6.8.1 #1 smp thu jul 1 15:23:45 utc 2004 i686 i686 i386 
gnulinux '
config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusrbinperl -D
usethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm-Duseshrplib=true -Doptimize=-O2 
-march=i586 -mcpu=i686 -fmessage-length=0 -Wall -Wall -pipe'
hint=recommended, useposix=true, d_sigaction=define
usethreads=define use5005threads=undef useithreads=define usemultiplicity=de
fine
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-stric
t-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O2 -march=i586 -mcpu=i686 -fmessage-length=0 -Wall -Wall -pipe',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-strict-aliasin
g -pipe'
ccversion='', gccversion='3.3.4 (pre 3.3.5 20040809)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lssize
=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags =''
libpth=/lib /usr/lib /usr/local/lib
libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
libc=, so=so, useshrplib=true, libperl=libperl.so
gnulibc_version='2.3.3'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,
/usr/lib/perl5/5.8.5/i586-linux-thread-multi/CORE'
cccdlflags='-fPIC', lddlflags='-shared'


Characteristics of this binary (from libperl):
Compile-time options: MULTIPLICITY USE_ITHREADS USE_LARGE_FILES PERL_IMPLICI
T_CONTEXT
Built under linux
Compiled at Oct  1 2004 23:30:38
@INC:
/usr/lib/perl5/5.8.5/i586-linux-thread-multi
/usr/lib/perl5/5.8.5
/usr/lib/perl5/site_perl/5.8.5/i586-linux-thread-multi
/usr/lib/perl5/site_perl/5.8.5
/usr/lib/perl5/site_perl
/usr/lib/perl5/vendor_perl/5.8.5/i586-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.8.5
/usr/lib/perl5/vendor_perl

 ________________________________________
__________________
Mit WEB.DE FreePhone mit hoechster Qualitaet ab 0 Ct./Min.
weltweit telefonieren! http://freephone.web.de/?mc=021201


Report this thread to moderator Post Follow-up to this message
Old Post
tim23456@web.de
12-16-04 08:59 PM


Re: combining getc() and unicode strings problem?
Hi,

Not had the misfortune to need to play with this stuff, but I guess
the documentation for perl is a good place to start:

perldoc perl

Particularly:

perldoc perluniintro
perldoc perlunicode

Some aspects are version dependent, so make sure your script
insists on a minimum version of perl.

Why are you doing this?  Is most of your experience with C?

Jonathan Paton

On Thu, 16 Dec 2004 19:18:06 +0200, tim23456@web.de <tim23456@web.de> wrote:
> Hello,
>
> i have intensively searched the web for a solution on the following proble
m, but could not find any indication for it.
>
> The following code does basicelly nothing else then reading in a file on s
ingle char basis and writing it to a file again. The input file is encoded a
s UTF-8 as well as the output file i want to create. I read in the character
s by using getc().
> However i still get incorrect results in my output-file. Does anybody know
 of mistakes i do when combining getc() with reading unicode files?
>
> Any input is greatly appreciated. Thanks very much in advance!
>
> Tim
>
> ( I am using Perl 5.8.5 on Intel SuSE 9.2)
>
> ..
>
> open(INFILE, "< $ARGV[0]") || die "\nCannot open from-file!";
> open(OUTFILE, "> $ARGV[1]") || die "\nCannot create to-file!";
>
> binmode(OUTFILE, ":utf8");
> binmode(INFILE, ":utf8");
>
> ..
>
> while(!eof(INFILE)) {
>
>   for ($i = 1; $i < $Ntes_Zeichen; $i++) {
>
>     $dummy = getc(INFILE); if (eof(INFILE)) {exit}
>     print OUTFILE $dummy;
>
>   }
>
>   $dummy = getc(INFILE);
>   print OUTFILE $ersetze_durch;
>
> }
>
> close(INFILE);
> close(OUTFILE);
>
> Summary of my perl5 (revision 5 version 8 subversion 5) configuration:
>   Platform:
>     osname=linux, osvers=2.6.8.1, archname=i586-linux-thread-multi
>     uname='linux g168 2.6.8.1 #1 smp thu jul 1 15:23:45 utc 2004 i686 i686
 i386 gnulinux '
>     config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusrbinp
erl -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm-Duseshrplib=true -Doptimiz
e=-O2 -march=i586 -mcpu=i686 -fmessage-length=0 -Wall -Wall -pipe'
>     hint=recommended, useposix=true, d_sigaction=define
>     usethreads=define use5005threads=undef useithreads=define usemultiplic
ity=define
>     useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
>     use64bitint=undef use64bitall=undef uselongdouble=undef
>     usemymalloc=n, bincompat5005=undef
>   Compiler:
>     cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno
-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
>     optimize='-O2 -march=i586 -mcpu=i686 -fmessage-length=0 -Wall -Wall -p
ipe',
>     cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-strict-a
liasing -pipe'
>     ccversion='', gccversion='3.3.4 (pre 3.3.5 20040809)', gccosandvers=''
>     intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
>     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
>     ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lse
eksize=8
>     alignbytes=4, prototype=define
>   Linker and Libraries:
>     ld='cc', ldflags =''
>     libpth=/lib /usr/lib /usr/local/lib
>     libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
>     perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
>     libc=, so=so, useshrplib=true, libperl=libperl.so
>     gnulibc_version='2.3.3'
>   Dynamic Linking:
>     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-
rpath,/usr/lib/perl5/5.8.5/i586-linux-thread-multi/CORE'
>     cccdlflags='-fPIC', lddlflags='-shared'
>
> Characteristics of this binary (from libperl):
>   Compile-time options: MULTIPLICITY USE_ITHREADS USE_LARGE_FILES PERL_IMP
LICIT_CONTEXT
>   Built under linux
>   Compiled at Oct  1 2004 23:30:38
>   @INC:
>     /usr/lib/perl5/5.8.5/i586-linux-thread-multi
>     /usr/lib/perl5/5.8.5
>     /usr/lib/perl5/site_perl/5.8.5/i586-linux-thread-multi
>     /usr/lib/perl5/site_perl/5.8.5
>     /usr/lib/perl5/site_perl
>     /usr/lib/perl5/vendor_perl/5.8.5/i586-linux-thread-multi
>     /usr/lib/perl5/vendor_perl/5.8.5
>     /usr/lib/perl5/vendor_perl
>
>  ________________________________________
__________________
> Mit WEB.DE FreePhone mit hoechster Qualitaet ab 0 Ct./Min.
> weltweit telefonieren! http://freephone.web.de/?mc=021201
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> <http://learn.perl.org/> <http://learn.perl.org/first-response>
>
>


--
#!perl
$J=' 'x25 ;for (qq< 1+10 9+14 5-10 50-9 7+13 2-18 6+13
17+6 02+1 2-10 00+4 00+8 3-13 3+12 01-5 2-10 01+1 03+4
00+4 00+8 1-21 01+1 00+5 01-7 >=~/ \S\S \S\S /gx) {m/(
\d+) (.+) /x,, vec$ J,$p +=$2 ,8,= $c+= +$1} warn $J,,

Report this thread to moderator Post Follow-up to this message
Old Post
Jonathan Paton
12-17-04 01:55 AM


Re: combining getc() and unicode strings problem?
Hello Jonathan, all

> Not had the misfortune to need to play with this stuff, but I guess
> the documentation for perl is a good place to start:
>
[snip]

yes, i read these man pages more than just one time now (at different time=
s), so i think i should have not missed anything.

the perl-manpages in question do give information, which functions work an=
d which will not work (speaking about unicode..), concerning the getc() fu=
nction however nothing is mentioned.

I've also searched bugs.perl.org for any issue concerning 'getc()', but co=
uld not dig up anything concerning the unicode context.

> Some aspects are version dependent, so make sure your script
> insists on a minimum version of perl.

This is not (yet) a problem, because I am developer and user at the same t=
ime.

> Why are you doing this=3F  Is most of your experience with C=3F

I'm afraid to say that i do not qualify as a programmer having any knowled=
ge at all.

Consequently I am open to all suggestions of how to accomplish my problem =
in another way. Please help, if you can. What i cannot change however is t=
he fact, that i have to cope with UTF-8 input. That's because i am using c=
haracters such as "=A7", which cannot be represented with 8859-1 (=3DLatin1)
 o=
r 8859-15 Euro (hope i am not starting incorrectly at this point!). US-ASC=
II does also not qualify for my needs.
The perl man pages in general explicetely state that recent versions of Pe=
rl are "unicode ready by default".

I am using Perl 5.8.5 on Linux. Any input on this is very much appreciated=
.

Thank you, Tim
=20
> Jonathan Paton
>=20
> On Thu, 16 Dec 2004 19:18:06 +0200, tim23456@web.de <tim23456@web.de> wr=
ote: 
oblem, but could not find any indication for it. 
on single char basis and writing it to a file again. The input file is enc=
oded as UTF-8 as well as the output file i want to create. I read in the c=
haracters by using getc(). 
know of mistakes i do when combining getc() with reading unicode files=3F 

[snip]
 =5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=
5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=
 5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5
F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5
F=5F=5F=5F=5F=5F=5F=5F
Mit WEB.DE FreePhone mit hoechster Qualitaet ab 0 Ct./Min.
weltweit telefonieren! http://freephone.web.de/=3Fmc=3D021201


Report this thread to moderator Post Follow-up to this message
Old Post
tim23456@web.de
12-17-04 01:55 AM


Re: combining getc() and unicode strings problem?
Dear Tim,

I think your code is on the right track, as I got a modified
version of your code working.  I never use getc...

In C, where getc originates, the getc function returns a
char type.  The C char type is almost always 8 bits long.
By definition it doesn't support unicode, so neither does
Perl.  It would be nice if they told you though.

My inner loop was simply:

while (<INFILE> ) {
print OUTFILE join "-", /./g;
}

Which reads line by line, and outputs the line but with a=20
dash between each character.

The /./g bit is using the regex engine to match a single
character, and the next etc.  It returns a list of characters.
Alternatively, and probably more readable, is split //

My simple test file included your example character, and
outputed:

H-e-l-l-o-=A7-W-o-r-l-d

Working character by character on an input stream is
extremely unpopular in perl.  Why do so when perl provides
a VERY powerful (ir)regular expression engine.

 
>
> I'm afraid to say that i do not qualify as a programmer having
> any knowledge at all.

You know where the beginners list is ;-)  I like the fact you have
worked on the problem first.  Minimum boilerplate for all scripts
should be:

use strict;
use warnings;

Documentation available via perldoc.

Jonathan Paton

--=20
#!perl
$J=3D' 'x25 ;for (qq< 1+10 9+14 5-10 50-9 7+13 2-18 6+13
17+6 02+1 2-10 00+4 00+8 3-13 3+12 01-5 2-10 01+1 03+4
00+4 00+8 1-21 01+1 00+5 01-7 >=3D~/ \S\S \S\S /gx) {m/(
\d+) (.+) /x,, vec$ J,$p +=3D$2 ,8,=3D $c+=3D +$1} warn $J,,

Report this thread to moderator Post Follow-up to this message
Old Post
Jonathan Paton
12-17-04 08:55 AM


Re: combining getc() and unicode strings problem?
Hello Jonathan, all,

thank you for your kind response.

[snip]

> I never use getc...
>
> In C, where getc originates, the getc function returns a
> char type. The C char type is almost always 8 bits long.
> By definition it doesn't support unicode, so neither does
> Perl. It would be nice if they told you though.

[snip]

the get() stuff really turned out to be the wrong track of reaching my goal.
 What I am doing now instead of getc()ing character after character is to us
e the File::Slurp module. From this module i now found very useful reading i
n the complete file content
into a single scalar, tough i do know, that holding content of a complete fi
le in _one_ scalar is not very scalable. For the processing afterwards in my
 script however it seems to be the only solution.

Additionally just one comment: my problem of dealing with unicode input is q
uite comfortably solved now. Interestingly the File::Slurp module from CPAN,
 which is referenced in the perl documentation, seems not to be able to deal
 with unicode content. For
this case using the CPAN module Perl6::Slurp, which requires at least Perl 5
.8.0, completely resolved the unicode problem.

Again, thank you very much for your valueable input,
Tim
 ________________________________________
__________________
Mit WEB.DE FreePhone mit hoechster Qualitaet ab 0 Ct./Min.
weltweit telefonieren! http://freephone.web.de/?mc=021201


Report this thread to moderator Post Follow-up to this message
Old Post
tim23456@web.de
12-18-04 08:55 PM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

PERL Beginners archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 07:40 AM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.