Code Comments
Programming Forum and web based access to our favorite programming groups.On 19 Mrz., 14:05, Hermann Peifer <pei...@gmx.eu> wrote:
> Hermann Peifer wrote:
>
>
>
>
>
>
>
>
>
>
>
>
> Just in case someone would be interested, here yet another version of
> the same script, where chunk size is defined in bytes (and checked via
> XMLLEN, as suggested by Juergen).
>
> Hermann
>
> $ cat split_big_xmlfile.awk
>
> # Include the xmlcopy.awk library
> # Make sure that xgawk finds it
> @include xmlcopy
>
> # new_chunk can be anything here, but not 0 or ""
> # size value defines approx. chunk size in bytes
> # you might have to worry about XMLCHARSET (or not)
> BEGIN {
> =A0 =A0 =A0 =A0 =A0new_chunk =3D size =3D 250000000
> =A0 =A0 =A0 =A0 =A0# XMLCHARSET =3D "ISO-8859-1"
>
> }
>
> # Remember original XML declaration
> XMLDECLARATION { header =3D XmlCopy() }
>
> # Remember original root element, define the footer
> XMLSTARTELEM && XMLDEPTH =3D=3D 1 {
> =A0 =A0 =A0 =A0 =A0header =3D header ORS XmlCopy() ORS
> =A0 =A0 =A0 =A0 =A0footer =3D ORS "</" XMLSTARTELEM ">"
>
> }
>
> # Only care about these elements and their children
> XMLPATH ~ /OfferInfo/ {
> =A0 =A0 =A0 =A0 =A0if (new_chunk) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0outfile =3D "chunk" sprintf("%07d", num=[/color
]
) ".xml"
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0printf "%s", header > outfile
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0new_chunk =3D ""
> =A0 =A0 =A0 =A0 =A0}
> =A0 =A0 =A0 =A0 =A0printf "%s", XmlCopy() > outfile
> =A0 =A0 =A0 =A0 =A0chunk_size +=3D XMLLEN
>
> }
>
> # Decide if it's time to add a footer and start with a new chunk
> XMLENDELEM =3D=3D "OfferInfo" && chunk_size > size {
> =A0 =A0 =A0 =A0 =A0printf "%s", footer > outfile
> =A0 =A0 =A0 =A0 =A0num++
> =A0 =A0 =A0 =A0 =A0new_chunk =3D "it's time now"
> =A0 =A0 =A0 =A0 =A0chunk_size =3D 0
>
> }
>
> END {
> =A0 =A0 =A0 =A0 =A0# Footer for the last chunk, but avoid double footers
> =A0 =A0 =A0 =A0 =A0if (!new_chunk) printf "%s", footer > outfile
>
> =A0 =A0 =A0 =A0 =A0# Print XMLERRORs, if any. Xgawk is somewhat lazy in
> =A0 =A0 =A0 =A0 =A0# this respect and might silently die, if you don't hav=[/color
]
e:
> =A0 =A0 =A0 =A0 =A0if (XMLERROR)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0printf("XMLERROR '%s' at row %d col %d =[/color
]
len %d\n",
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0XMLERROR, XMLROW, XMLCO=[/color
]
L, XMLLEN)
>
> }
I am missing the words!.. Thanks alot. BTW I already searched for the
XMLCOPY.AWK Skript but without luck. XGAWK and the utils are installed
but not XMLCOPY. Do you have some url?
Malapha
Post Follow-up to this messageMalapha wrote: > On 19 Mrz., 14:05, Hermann Peifer <pei...@gmx.eu> wrote: > > I am missing the words!.. Thanks alot. BTW I already searched for the > XMLCOPY.AWK Skript but without luck. XGAWK and the utils are installed > but not XMLCOPY. Do you have some url? > > Malapha On my Linux laptop, it is here: /usr/local/share/xgawk/xmlcopy.awk It is part of the latest xgawk release: xgawk-3.1.6-20080101.tar.gz https://sourceforge.net/project/sho...group_id=133165 A third place is the source code repository, see here: http://xmlgawk.cvs.sourceforge.net/...awk/awklib/xml/ Hermann
Post Follow-up to this messageOn 20 Mrz., 09:52, Hermann Peifer <pei...@gmx.eu> wrote: > Malapha wrote: > r > > > > > > > > > num) ".xml" > > > > s > have: %d len %d\n", LCOL, XMLLEN) > > > > > On my Linux laptop, it is here: /usr/local/share/xgawk/xmlcopy.awk > > It is part of the latest xgawk release: xgawk-3.1.6-20080101.tar.gzhttps:/=[/color ] /sourceforge.net/project/showfiles.php?group_id=3D133165 > > A third place is the source code repository, see here:http://xmlgawk.cvs.s=[/color ] ourceforge.net/xmlgawk/xmlgawk/awklib/xml/ > > Hermann Thanks again. I got everything up and running - and it worked :-) I also modified XMLCOPY as suggested. Here are some benchmarks: Type Minutes Size BYTESHRED_XMLCOPY 7,966666667 322 MB COUNTSHRED 0,583333333 322 MB COUNTSHRED_XMLCOPY 7,55 322 MB Countshred_XMLCOPY uses the xmlcopy method. As you can see - the textbased method (Hermans first) is by ways the fastest. Having the divantage, that the xml-input file has to be well formed. I am still struggling which methodology to use. As I have filesizes of up to 3 GB "COUNTSHRED" seems to be the one. One more question: In my XML Files there is another tag next to the <OfferInfo>, named <CancelOfferInfo>. Where do I need to place this in the code, so that it also gets processed? Many thanks Mala
Post Follow-up to this messageOn Mar 25, 11:39=A0am, Malapha <mala...@gmx.net> wrote: > > Thanks again. I got everything up and running - and it worked :-) I > also modified XMLCOPY as suggested. > > Here are some benchmarks: > Type =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Minutes =A0 =A0 =A0 =A0 Size > BYTESHRED_XMLCOPY =A0 =A0 =A0 7,966666667 =A0 =A0 322 MB > COUNTSHRED =A0 =A0 =A0 =A0 =A0 =A0 =A00,583333333 =A0 =A0 322 MB > COUNTSHRED_XMLCOPY =A0 =A0 =A07,55 =A0 =A0 =A0 =A0 =A0 =A0322 MB > > Countshred_XMLCOPY uses the xmlcopy method. As you can see - the > textbased method (Hermans first) is by ways the fastest. Having the > divantage, that the xml-input file has to be well formed. I am > still struggling which methodology to use. As I have filesizes of up > to 3 GB "COUNTSHRED" seems to be the one. > If you already have "nicely formatted" XML files (or manage to get there via xmllint --format), then I'd recommend to use the faster solution with regular awk. If not... then you have to use xgawk in combination with XmlCopy. Some performance tuning might be possible. I guess that J=FCrgen might have some good ideas. > One more question: In my XML Files there is another tag next to the > <OfferInfo>, named <CancelOfferInfo>. Where do I need to place this in > the code, so that it also gets processed? > There is no single answer to this question as are 3 scripts now with slightly different code. However, these rules will find both: OfferInfo and CancelOfferInfo elements: /<.*OfferInfo>/ {do something with regular awk} XMLPATH ~ /OfferInfo/ {do something with xgawk} Another xgawk option could be to define the condition via XMLDEPTH, e.g.: XMLDEPTH > 2 {do something} Hermann
Post Follow-up to this messageOn 25 Mrz., 14:32, Hermann Peifer <pei...@gmx.net> wrote:
> On Mar 25, 11:39=A0am, Malapha <mala...@gmx.net> wrote:
>
>
>
>
>
>
>
>
> If you already have "nicely formatted" XML files (or manage to get
> there via xmllint --format), then I'd recommend to use the faster
> solution with regular awk.
>
> If not... then you have to use xgawk in combination with XmlCopy. Some
> performance tuning might be possible. I guess that J=FCrgen might have
> some good ideas.
>
>
> There is no single answer to this question as are 3 scripts now with
> slightly different code. However, these rules will find both:
> OfferInfo and CancelOfferInfo elements:
>
> /<.*OfferInfo>/ {do something with regular awk}
>
> XMLPATH ~ /OfferInfo/ {do something with xgawk}
>
> Another xgawk option could be to define the condition via XMLDEPTH,
> e.g.:
>
> XMLDEPTH > 2 {do something}
>
> Hermann
After having decided to use the fasted way, please let me come back to
the original problem: I also want to have some logging about the
shredding-process at runtime, after each chunk is finished, so the
filesystems filesize after the shredding corresponds with the values
of the tables attributes in the logfile. Here is my idea:
logfile =3D "shredder_log.txt"
cmd_original_length =3D "ls -l " FILENAME " | gawk '{print $5;}'"
cmd_original_length | getline original_size
cmd_part_length =3D "ls -l " outfile " | gawk '{print $5;}'"
cmd_part_length | getline part_size
print outfile ";" sprintf("%03d", num+1) ";" FILENAME ";" strftime("%m
%d%Y%H%M%S", systime()) ";" original_size ";" part_size >> logfile
so far so fine - but I got problems with the placing of that piece of
code. I tried several places in the script, but either its to early
(file does not yet exist -> size 0), in between the process (wrong
filesize) or to late..
Many regards
Mala
Post Follow-up to this messageMalapha wrote:
> After having decided to use the fasted way, please let me come back to
> the original problem: I also want to have some logging about the
> shredding-process at runtime, after each chunk is finished, so the
> filesystems filesize after the shredding corresponds with the values
> of the tables attributes in the logfile. Here is my idea:
>
> logfile = "shredder_log.txt"
> cmd_original_length = "ls -l " FILENAME " | gawk '{print $5;}'"
> cmd_original_length | getline original_size
> cmd_part_length = "ls -l " outfile " | gawk '{print $5;}'"
> cmd_part_length | getline part_size
> print outfile ";" sprintf("%03d", num+1) ";" FILENAME ";" strftime("%m
> %d%Y%H%M%S", systime()) ";" original_size ";" part_size >> logfile
>
Does this really have to be logged at runtime? I would do this after the
splitting is done. Furthermore: "Using getline is almost always the
wrong approach", to quote one of the regulars in this group. So far, I
followed this advice and managed to avoid getline constructions in my
scripts.
Hermann
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.