Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

Re: Splitting huge XML Files into fixsized wellformed parts
On 19 Mrz., 14:05, Hermann Peifer <pei...@gmx.eu> wrote:
> Hermann Peifer wrote:
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>
> Just in case someone would be interested, here yet another version of
> the same script, where chunk size is defined in bytes (and checked via
> XMLLEN, as suggested by Juergen).
>
> Hermann
>
> $ cat split_big_xmlfile.awk
>
> # Include the xmlcopy.awk library
> # Make sure that xgawk finds it
> @include xmlcopy
>
> # new_chunk can be anything here, but not 0 or ""
> # size value defines approx. chunk size in bytes
> # you might have to worry about XMLCHARSET (or not)
> BEGIN {
> =A0 =A0 =A0 =A0 =A0new_chunk =3D size =3D 250000000
> =A0 =A0 =A0 =A0 =A0# XMLCHARSET =3D "ISO-8859-1"
>
> }
>
> # Remember original XML declaration
> XMLDECLARATION { header =3D XmlCopy() }
>
> # Remember original root element, define the footer
> XMLSTARTELEM && XMLDEPTH =3D=3D 1 {
> =A0 =A0 =A0 =A0 =A0header =3D header ORS XmlCopy() ORS
> =A0 =A0 =A0 =A0 =A0footer =3D ORS "</" XMLSTARTELEM ">"
>
> }
>
> # Only care about these elements and their children
> XMLPATH ~ /OfferInfo/ {
> =A0 =A0 =A0 =A0 =A0if (new_chunk) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0outfile =3D "chunk" sprintf("%07d", num=[/color
]
) ".xml"
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0printf "%s", header > outfile
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0new_chunk =3D ""
> =A0 =A0 =A0 =A0 =A0}
> =A0 =A0 =A0 =A0 =A0printf "%s", XmlCopy() > outfile
> =A0 =A0 =A0 =A0 =A0chunk_size +=3D XMLLEN
>
> }
>
> # Decide if it's time to add a footer and start with a new chunk
> XMLENDELEM =3D=3D "OfferInfo" && chunk_size > size {
> =A0 =A0 =A0 =A0 =A0printf "%s", footer > outfile
> =A0 =A0 =A0 =A0 =A0num++
> =A0 =A0 =A0 =A0 =A0new_chunk =3D "it's time now"
> =A0 =A0 =A0 =A0 =A0chunk_size =3D 0
>
> }
>
> END {
> =A0 =A0 =A0 =A0 =A0# Footer for the last chunk, but avoid double footers
> =A0 =A0 =A0 =A0 =A0if (!new_chunk) printf "%s", footer > outfile
>
> =A0 =A0 =A0 =A0 =A0# Print XMLERRORs, if any. Xgawk is somewhat lazy in
> =A0 =A0 =A0 =A0 =A0# this respect and might silently die, if you don't hav=[/color
]
e:
> =A0 =A0 =A0 =A0 =A0if (XMLERROR)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0printf("XMLERROR '%s' at row %d col %d =[/color
]
len %d\n",
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0XMLERROR, XMLROW, XMLCO=[/color
]
L, XMLLEN)
>
> }

I am missing the words!.. Thanks alot. BTW I already searched for the
XMLCOPY.AWK Skript but without luck. XGAWK and the utils are installed
but not XMLCOPY. Do you have some url?

Malapha

Report this thread to moderator Post Follow-up to this message
Old Post
Malapha
03-19-08 11:59 PM


Re: Splitting huge XML Files into fixsized wellformed parts
Malapha wrote:
> On 19 Mrz., 14:05, Hermann Peifer <pei...@gmx.eu> wrote: 
>
> I am missing the words!.. Thanks alot. BTW I already searched for the
> XMLCOPY.AWK Skript but without luck. XGAWK and the utils are installed
> but not XMLCOPY. Do you have some url?
>
> Malapha

On my Linux laptop, it is here: /usr/local/share/xgawk/xmlcopy.awk

It is part of the latest xgawk release: xgawk-3.1.6-20080101.tar.gz
https://sourceforge.net/project/sho...group_id=133165

A third place is the source code repository, see here:
http://xmlgawk.cvs.sourceforge.net/...awk/awklib/xml/

Hermann

Report this thread to moderator Post Follow-up to this message
Old Post
Hermann Peifer
03-20-08 08:59 AM


Re: Splitting huge XML Files into fixsized wellformed parts
On 20 Mrz., 09:52, Hermann Peifer <pei...@gmx.eu> wrote:
> Malapha wrote: 
> 
r 
> 
> 
> 
> 
> 
> 
> 
> 
> 
num) ".xml" 
> 
> 
> 
> 
s 
> 
 
have: 
%d len %d\n", 
LCOL, XMLLEN)
> 
> 
> 
>
> On my Linux laptop, it is here: /usr/local/share/xgawk/xmlcopy.awk
>
> It is part of the latest xgawk release: xgawk-3.1.6-20080101.tar.gzhttps:/=[/color
]
/sourceforge.net/project/showfiles.php?group_id=3D133165
>
> A third place is the source code repository, see here:http://xmlgawk.cvs.s=[/color
]
ourceforge.net/xmlgawk/xmlgawk/awklib/xml/
>
> Hermann

Thanks again. I got everything up and running - and it worked :-) I
also modified XMLCOPY as suggested.

Here are some benchmarks:
Type	                Minutes	        Size
BYTESHRED_XMLCOPY	7,966666667	322 MB
COUNTSHRED       	0,583333333	322 MB
COUNTSHRED_XMLCOPY	7,55	        322 MB

Countshred_XMLCOPY uses the xmlcopy method. As you can see - the
textbased method (Hermans first) is by ways the fastest. Having the
divantage, that the xml-input file has to be well formed. I am
still struggling which methodology to use. As I have filesizes of up
to 3 GB "COUNTSHRED" seems to be the one.

One more question: In my XML Files there is another tag next to the
<OfferInfo>, named <CancelOfferInfo>. Where do I need to place this in
the code, so that it also gets processed?


Many thanks
Mala

Report this thread to moderator Post Follow-up to this message
Old Post
Malapha
03-25-08 12:58 PM


Re: Splitting huge XML Files into fixsized wellformed parts
On Mar 25, 11:39=A0am, Malapha <mala...@gmx.net> wrote:
>
> Thanks again. I got everything up and running - and it worked :-) I
> also modified XMLCOPY as suggested.
>
> Here are some benchmarks:
> Type =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Minutes =A0 =A0 =A0 =A0 Size
> BYTESHRED_XMLCOPY =A0 =A0 =A0 7,966666667 =A0 =A0 322 MB
> COUNTSHRED =A0 =A0 =A0 =A0 =A0 =A0 =A00,583333333 =A0 =A0 322 MB
> COUNTSHRED_XMLCOPY =A0 =A0 =A07,55 =A0 =A0 =A0 =A0 =A0 =A0322 MB
>
> Countshred_XMLCOPY uses the xmlcopy method. As you can see - the
> textbased method (Hermans first) is by ways the fastest. Having the
> divantage, that the xml-input file has to be well formed. I am
> still struggling which methodology to use. As I have filesizes of up
> to 3 GB "COUNTSHRED" seems to be the one.
>

If you already have "nicely formatted" XML files (or manage to get
there via xmllint --format), then I'd recommend to use the faster
solution with regular awk.

If not... then you have to use xgawk in combination with XmlCopy. Some
performance tuning might be possible. I guess that J=FCrgen might have
some good ideas.

> One more question: In my XML Files there is another tag next to the
> <OfferInfo>, named <CancelOfferInfo>. Where do I need to place this in
> the code, so that it also gets processed?
>

There is no single answer to this question as are 3 scripts now with
slightly different code. However, these rules will find both:
OfferInfo and CancelOfferInfo elements:

/<.*OfferInfo>/ {do something with regular awk}

XMLPATH ~ /OfferInfo/ {do something with xgawk}

Another xgawk option could be to define the condition via XMLDEPTH,
e.g.:

XMLDEPTH > 2 {do something}

Hermann

Report this thread to moderator Post Follow-up to this message
Old Post
Hermann Peifer
03-26-08 12:00 AM


Re: Splitting huge XML Files into fixsized wellformed parts
On 25 Mrz., 14:32, Hermann Peifer <pei...@gmx.net> wrote:
> On Mar 25, 11:39=A0am, Malapha <mala...@gmx.net> wrote:
>
>
>
>
> 
> 
 
> 
>
> If you already have "nicely formatted" XML files (or manage to get
> there via xmllint --format), then I'd recommend to use the faster
> solution with regular awk.
>
> If not... then you have to use xgawk in combination with XmlCopy. Some
> performance tuning might be possible. I guess that J=FCrgen might have
> some good ideas.
> 
>
> There is no single answer to this question as are 3 scripts now with
> slightly different code. However, these rules will find both:
> OfferInfo and CancelOfferInfo elements:
>
> /<.*OfferInfo>/ {do something with regular awk}
>
> XMLPATH ~ /OfferInfo/ {do something with xgawk}
>
> Another xgawk option could be to define the condition via XMLDEPTH,
> e.g.:
>
> XMLDEPTH > 2 {do something}
>
> Hermann
After having decided to use the fasted way, please let me come back to
the original problem: I also want to have some logging about the
shredding-process at runtime, after each chunk is finished, so the
filesystems filesize after the shredding corresponds with the values
of the tables attributes in the logfile. Here is my idea:

logfile =3D "shredder_log.txt"
cmd_original_length =3D "ls -l " FILENAME " | gawk '{print $5;}'"
cmd_original_length | getline original_size
cmd_part_length =3D "ls -l " outfile " | gawk '{print $5;}'"
cmd_part_length | getline part_size
print outfile ";" sprintf("%03d", num+1) ";" FILENAME ";" strftime("%m
%d%Y%H%M%S", systime()) ";" original_size ";" part_size  >> logfile

so far so fine - but I got problems with the placing of that piece of
code.   I tried several places in the script, but either its to early
(file does not yet exist -> size 0), in between the process (wrong
filesize) or to late..

Many regards
Mala




Report this thread to moderator Post Follow-up to this message
Old Post
Malapha
03-27-08 12:03 AM


Re: Splitting huge XML Files into fixsized wellformed parts
Malapha wrote:

> After having decided to use the fasted way, please let me come back to
> the original problem: I also want to have some logging about the
> shredding-process at runtime, after each chunk is finished, so the
> filesystems filesize after the shredding corresponds with the values
> of the tables attributes in the logfile. Here is my idea:
>
> logfile = "shredder_log.txt"
> cmd_original_length = "ls -l " FILENAME " | gawk '{print $5;}'"
> cmd_original_length | getline original_size
> cmd_part_length = "ls -l " outfile " | gawk '{print $5;}'"
> cmd_part_length | getline part_size
> print outfile ";" sprintf("%03d", num+1) ";" FILENAME ";" strftime("%m
> %d%Y%H%M%S", systime()) ";" original_size ";" part_size  >> logfile
>

Does this really have to be logged at runtime? I would do this after the
splitting is done. Furthermore: "Using getline is almost always the
wrong approach", to quote one of the regulars in this group. So far, I
followed this advice and managed to avoid getline constructions in my
scripts.

Hermann

Report this thread to moderator Post Follow-up to this message
Old Post
Hermann Peifer
03-27-08 12:03 AM


Sponsored Links




Last Thread Next Thread Next
Pages (2): « 1 [2]
Search this forum -> 
Post New Thread

AWK archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 03:52 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.