Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

Splitting huge XML Files into fixsized wellformed parts
Hi,

I am kind of depressed :-) I want to split xml-files with sizes
greater than 2 gb into smaler chunks. As I dont want to end up with
billions of files, I want those splitted files to have configurable
sizes like 250 MB. Each file should be well formed having an exact
copy of the header (and footer as the closing of the header) from the
original file. Forthermore, a table should be generated were I can
see, that the File X is seperated into Part N with timestamp:

Table:

Orginalfilename|Name of PartN|Size of PartN|Timestamp



The Original XML-Files look like this:
<?xml ...>
<Headerelement with some infos to be copied 1to1>
<OfferInfo>
<OfferID></OfferID>
..
</OfferInfo>
<OfferInfo>
<OfferID></OfferID>
..
</OfferInfo>
<OfferInfo>
<OfferID></OfferID>
..
</OfferInfo>
</Headerelement>



All in all I ended up with reading the XML processing docus with gawk,
but as it seems I am lacking some deeper programming skills.. Could
someone please help?

Thx
Malapha

Report this thread to moderator Post Follow-up to this message
Old Post
Malapha
03-17-08 01:01 PM


Re: Splitting huge XML Files into fixsized wellformed parts
Malapha wrote:
> Hi,
>
> I am kind of depressed :-) I want to split xml-files with sizes
> greater than 2 gb into smaler chunks. As I dont want to end up with
> billions of files, I want those splitted files to have configurable
> sizes like 250 MB. Each file should be well formed having an exact
> copy of the header (and footer as the closing of the header) from the
> original file. Forthermore, a table should be generated were I can
> see, that the File X is seperated into Part N with timestamp:

A nice and well described little homework with clear requirements.

I'd abstain from splitting the file according to file sizes in MB
but suggest to take a more simple measure for splitting, like number
of XML-blocks or number of lines.

>
> Table:
>
> Orginalfilename|Name of PartN|Size of PartN|Timestamp
>
>
>
> The Original XML-Files look like this:
> <?xml ...>
> <Headerelement with some infos to be copied 1to1>
>          <OfferInfo>
>                          <OfferID></OfferID>
>                           ...
>           </OfferInfo>
>          <OfferInfo>
>                          <OfferID></OfferID>
>                           ...
>           </OfferInfo>
>          <OfferInfo>
>                          <OfferID></OfferID>
>                           ...
>           </OfferInfo>
> </Headerelement>
>
>
>
> All in all I ended up with reading the XML processing docus with gawk,
> but as it seems I am lacking some deeper programming skills..

Given your data above you can solve that all with basic awk pattern
matching capabilities, no deeper skills required. What have you tried
so far?

> Could
> someone please help?

Since, apparently, you don't have a complex XML structure the use of
xgawk seems unnecessary. The quick way I'd go would be...

Save everything in a variable until you match the /Headerelement/.
Write that header to a file whose name contains a variable as number.
Write everything until the end of the block /<\/OfferInfo>/ to the
file whose name contains a variable as number, while counting lines.
If the number of lines exceeded some constant value write the constant
trailer, and close() the file, and increase the variable that counts
the files. To create a separate table just write out the information
you already have to a file with fixed name (use awk's date functions
or if unavailable an external date program and getline).

If you have concrete questions feel free to ask.
(Or did you mean to write that program for you?)

Janis

>
> Thx
> Malapha

Report this thread to moderator Post Follow-up to this message
Old Post
Janis Papanagnou
03-17-08 01:01 PM


Re: Splitting huge XML Files into fixsized wellformed parts
On 17 Mrz., 13:37, Janis Papanagnou <Janis_Papanag...@hotmail.com>
wrote:
> Malapha wrote: 
> 
>
> A nice and well described little homework with clear requirements.
>
> I'd abstain from splitting the file according to file sizes in MB
> but suggest to take a more simple measure for splitting, like number
> of XML-blocks or number of lines.
>

I totally agree with you. Using numbers of XML block as an
approximation for filesize is well enough.
The problem I see is, using linecounts works in cases where an EOL is
implemented in the xml document. In case the input data file has no
EOL I run into problems. So I came to the solution to use the xgawk
framework in order to make use of the "node hopping" technique. This
gives me the possibility to count the Offers without having to solve
the problems mentioned above.

> 
>
> Given your data above you can solve that all with basic awk pattern
> matching capabilities, no deeper skills required. What have you tried
> so far?

As I come from the VBA world - I tried to get familiar with awk. What
I do have is theoretical solution in form of a structured process
diagram :-)

Copy Header and Footer from Original to Var
Set Start_Offer = First Offer (from <Offer> to </Offer> )
Set End_Transaction = 0
Set Part = 0
Set FileSize = 0
Set MaxFileSize = 250
while not Start_Offer < EOF(OriginalXMLFile)
Part=part+1
Open NewFile OriginalXMLFileName + Part + ".xml"
Paste Header from Var to NewFile
While filesize(NewFile)<MaxFileSize do
Copy Offer (Start_Offer) from OriginalXMLDatei to NewFile
Start_Offer=Start_Offer + 1
wend
Paste Footer from Var to NewFile
wend

I am right now trying to translate this into awk.. Please dont ask me
how far i am, its frustrating :-)


> Save everything in a variable until you match the /Headerelement/.
> Write that header to a file whose name contains a variable as number.
> Write everything until the end of the block /<\/OfferInfo>/ to the
> file whose name contains a variable as number, while counting lines.
> If the number of lines exceeded some constant value write the constant
> trailer, and close() the file, and increase the variable that counts
> the files. To create a separate table just write out the information
> you already have to a file with fixed name (use awk's date functions
> or if unavailable an external date program and getline).

This looks very much like my approach - so I am quite happy that I am
not that wrong...



Report this thread to moderator Post Follow-up to this message
Old Post
Malapha
03-17-08 11:59 PM


Re: Splitting huge XML Files into fixsized wellformed parts
Malapha wrote:
> On 17 Mrz., 13:37, Janis Papanagnou <Janis_Papanag...@hotmail.com>
> wrote: 
>
> I totally agree with you. Using numbers of XML block as an
> approximation for filesize is well enough.
> The problem I see is, using linecounts works in cases where an EOL is
> implemented in the xml document. In case the input data file has no
> EOL I run into problems. So I came to the solution to use the xgawk
> framework in order to make use of the "node hopping" technique. This
> gives me the possibility to count the Offers without having to solve
> the problems mentioned above.
>

Missing line breaks could be added via a preprocessing step with
$ xmllint --format bigfile.xml > formatted_bigfile.xml

I don't know how xmllint performs with a 2G file. On my old laptop, I am
running out of memory when trying to re-format a 600M file. However, you
might have better hardware available.

There are also other XML command line tools around that have some
"pretty print" option. xmlstarlet is one of them.
 
>

Before going deeper into xgawk: try to reformat the file as suggested
above. Then, as suggested by Janis, you could make use regular awk for
the splitting task.

Hermann

Report this thread to moderator Post Follow-up to this message
Old Post
Hermann Peifer
03-17-08 11:59 PM


Re: Splitting huge XML Files into fixsized wellformed parts
Malapha schrieb:

> I totally agree with you. Using numbers of XML block as an
> approximation for filesize is well enough.

You may use the variable XMLLEN in xgawk.
Accumulate XMLLEN and you get a very precise
approximation for file size.

xgawk -lxml '{l+= XMLLEN};END{print l}' mssecure.xml
2309349

ll mssecure.xml
-rw-r--r-- 1 kahrs users 2309349 12. Jan 2005  mssecure.xml

Report this thread to moderator Post Follow-up to this message
Old Post
Jürgen Kahrs
03-17-08 11:59 PM


Re: Splitting huge XML Files into fixsized wellformed parts
Malapha wrote:
>
> As I come from the VBA world - I tried to get familiar with awk. What
> I do have is theoretical solution in form of a structured process
> diagram :-)
>
> Copy Header and Footer from Original to Var
> Set Start_Offer = First Offer (from <Offer> to </Offer> )
> Set End_Transaction = 0
> Set Part = 0
> Set FileSize = 0
> Set MaxFileSize = 250
> while not Start_Offer < EOF(OriginalXMLFile)
>      Part=part+1
>      Open NewFile OriginalXMLFileName + Part + ".xml"
>      Paste Header from Var to NewFile
>      While filesize(NewFile)<MaxFileSize do
>          Copy Offer (Start_Offer) from OriginalXMLDatei to NewFile
>          Start_Offer=Start_Offer + 1
>      wend
>      Paste Footer from Var to NewFile
> wend
>
> I am right now trying to translate this into awk.. Please dont ask me
> how far i am, its frustrating :-)
>
>

Below one solution for splitting in well-formed chunks, here: 100
OfferInfos each.  There might be better solutions (I just don't know
them ;-) It only works if the XML data is in "pretty print format", as
the sample data you posted.


$ cat split_bigfile.awk

BEGIN {	new_chunk = 1 ; size = 100 }

NR == 1 { header = $0 ; next }
NR == 2 { header = header ORS $0 ; footer = "</" substr($1,2) ">" ; next }

$0 !~ footer {
if (new_chunk) {
outfile = "chunk" sprintf("%07d", num) ".xml"
print header > outfile
new_chunk = 0
}
print > outfile
}

/<\/OfferInfo>/ {
num = int(count++/size)
if (num > prev_num) {
print footer > outfile
new_chunk = 1
}
prev_num = num
}

END { if (!new_chunk) print footer > outfile }

Report this thread to moderator Post Follow-up to this message
Old Post
Hermann Peifer
03-17-08 11:59 PM


Re: Splitting huge XML Files into fixsized wellformed parts
On 17 Mrz., 21:33, J=FCrgen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
wrote:
> Malapha schrieb:
> 
>
> You may use the variable XMLLEN in xgawk.
> Accumulate XMLLEN and you get a very precise
> approximation for file size.
>
> xgawk -lxml '{l+=3D XMLLEN};END{print l}' mssecure.xml
> 2309349
>
> ll mssecure.xml
> -rw-r--r-- 1 kahrs users 2309349 12. Jan 2005 =A0mssecure.xml

wow. this looks promising!!

unfortunatelly I caught a cold so I am not able to test it in the
office. But thanks alot. Ill try to combine it with the suggestions
Hermann made.

Thanks
Malapha

Report this thread to moderator Post Follow-up to this message
Old Post
Malapha
03-18-08 11:59 PM


Re: Splitting huge XML Files into fixsized wellformed parts
On 18 Mrz., 00:01, Hermann Peifer <pei...@gmx.eu> wrote:
> Malapha wrote:
> 
> 
File 
> 
>
> Below one solution for splitting in well-formed chunks, here: 100
> OfferInfos each. =A0There might be better solutions (I just don't know
> them ;-) It only works if the XML data is in "pretty print format", as
> the sample data you posted.
>
> $ cat split_bigfile.awk
>
> BEGIN { new_chunk =3D 1 ; size =3D 100 }
>
> NR =3D=3D 1 { header =3D $0 ; next }
> NR =3D=3D 2 { header =3D header ORS $0 ; footer =3D "</" substr($1,2) ">" =[/color
]
; next }
>
> $0 !~ footer {
> =A0 =A0 =A0 =A0 if (new_chunk) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 outfile =3D "chunk" sprintf("%07d", num) "=[/color
]
.xml"
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 print header > outfile
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 new_chunk =3D 0
> =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 =A0 print > outfile
>
> }
>
> /<\/OfferInfo>/ {
> =A0 =A0 =A0 =A0 num =3D int(count++/size)
> =A0 =A0 =A0 =A0 if (num > prev_num) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 print footer > outfile
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 new_chunk =3D 1
> =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 =A0 prev_num =3D num
>
> }
>
> END { if (!new_chunk) print footer > outfile }

Herman you are great. As I have written in to J=FCrgen, I am unable to
check it. But as soon as possible I ll give it a try!!

Thanks again
Malapha

Report this thread to moderator Post Follow-up to this message
Old Post
Malapha
03-18-08 11:59 PM


Re: Splitting huge XML Files into fixsized wellformed parts
Malapha wrote:
> On 18 Mrz., 00:01, Hermann Peifer <pei...@gmx.eu> wrote: 
>
> Herman you are great. As I have written in to Jürgen, I am unable to
> check it. But as soon as possible I ll give it a try!!
>
> Thanks again
> Malapha


Here the xgawk version of the same script. It works fine for me with
your testdata. No pre-formatting of bigfile.xml is needed. However, for
this solution you need to have xgawk and the library xmlcopy.awk
available. In xmlcopy.awk, I made a minor change at the very end:

# printf( "%s", token )
return token

Usage of the script: xgawk -f split_big_xmlfile.awk bigfile.xml

$ cat split_big_xmlfile.awk

# Include the xmlcopy.awk library
# Make sure that xgawk finds it
@include xmlcopy

BEGIN { new_chunk = 1 ; size = 100 }

# Remember XML declaration of bigfile.xml
XMLDECLARATION { header = XmlCopy() }

# Remember root element, define the footer
XMLSTARTELEM && XMLDEPTH == 1 {
header = header XmlCopy()
footer = "</" XMLSTARTELEM ">"
}

# Only care about OfferInfos and their children
XMLPATH ~ /OfferInfo/ {
if (new_chunk) {
outfile = "chunk" sprintf("%07d", num) ".xml"
printf "%s", header > outfile
new_chunk = 0
}
printf "%s", XmlCopy() > outfile
}

# Decide if it's time to add a footer and start a new chunk
XMLENDELEM == "OfferInfo" {
num = int(++count/size)
if (num > prev_num) {
print footer > outfile
new_chunk = 1
}
prev_num = num
}

# Avoid double footers, if at the end: count%size = 0
END { if (!new_chunk) print footer > outfile }

Report this thread to moderator Post Follow-up to this message
Old Post
Hermann Peifer
03-18-08 11:59 PM


Re: Splitting huge XML Files into fixsized wellformed parts
Hermann Peifer wrote:
>
> Here the xgawk version of the same script. It works fine for me with
> your testdata. No pre-formatting of bigfile.xml is needed. However, for
> this solution you need to have xgawk and the library xmlcopy.awk
> available. In xmlcopy.awk, I made a minor change at the very end:
>
>    # printf( "%s", token )
>    return token
>
> Usage of the script: xgawk -f split_big_xmlfile.awk bigfile.xml
>
> $ cat split_big_xmlfile.awk
>
> # Include the xmlcopy.awk library
> # Make sure that xgawk finds it
> @include xmlcopy
>
> BEGIN { new_chunk = 1 ; size = 100 }
>
> # Remember XML declaration of bigfile.xml
> XMLDECLARATION { header = XmlCopy() }
>
> # Remember root element, define the footer
> XMLSTARTELEM && XMLDEPTH == 1 {
>     header = header XmlCopy()
>     footer = "</" XMLSTARTELEM ">"
> }
>
> # Only care about OfferInfos and their children
> XMLPATH ~ /OfferInfo/ {
>     if (new_chunk) {
>         outfile = "chunk" sprintf("%07d", num) ".xml"
>         printf "%s", header > outfile
>         new_chunk = 0
>     }
>     printf "%s", XmlCopy() > outfile
> }
>
> # Decide if it's time to add a footer and start a new chunk
> XMLENDELEM == "OfferInfo" {
>     num = int(++count/size)
>     if (num > prev_num) {
>         print footer > outfile
>         new_chunk = 1
>     }
>     prev_num = num
> }
>
> # Avoid double footers, if at the end: count%size = 0
> END { if (!new_chunk) print footer > outfile }


Just in case someone would be interested, here yet another version of
the same script, where chunk size is defined in bytes (and checked via
XMLLEN, as suggested by Juergen).

Hermann

$ cat split_big_xmlfile.awk

# Include the xmlcopy.awk library
# Make sure that xgawk finds it
@include xmlcopy

# new_chunk can be anything here, but not 0 or ""
# size value defines approx. chunk size in bytes
# you might have to worry about XMLCHARSET (or not)
BEGIN {
new_chunk = size = 250000000
# XMLCHARSET = "ISO-8859-1"
}

# Remember original XML declaration
XMLDECLARATION { header = XmlCopy() }

# Remember original root element, define the footer
XMLSTARTELEM && XMLDEPTH == 1 {
header = header ORS XmlCopy() ORS
footer = ORS "</" XMLSTARTELEM ">"
}

# Only care about these elements and their children
XMLPATH ~ /OfferInfo/ {
if (new_chunk) {
outfile = "chunk" sprintf("%07d", num) ".xml"
printf "%s", header > outfile
new_chunk = ""
}
printf "%s", XmlCopy() > outfile
chunk_size += XMLLEN
}

# Decide if it's time to add a footer and start with a new chunk
XMLENDELEM == "OfferInfo" && chunk_size > size {
printf "%s", footer > outfile
num++
new_chunk = "it's time now"
chunk_size = 0
}

END {
# Footer for the last chunk, but avoid double footers
if (!new_chunk) printf "%s", footer > outfile

# Print XMLERRORs, if any. Xgawk is somewhat lazy in
# this respect and might silently die, if you don't have:
if (XMLERROR)
printf("XMLERROR '%s' at row %d col %d len %d\n",
XMLERROR, XMLROW, XMLCOL, XMLLEN)
}

Report this thread to moderator Post Follow-up to this message
Old Post
Hermann Peifer
03-19-08 11:59 PM


Sponsored Links




Last Thread Next Thread Next
Pages (2): [1] 2 »
Search this forum -> 
Post New Thread

AWK archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 03:52 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.