Code Comments
Programming Forum and web based access to our favorite programming groups.Hi, I am kind of depressed :-) I want to split xml-files with sizes greater than 2 gb into smaler chunks. As I dont want to end up with billions of files, I want those splitted files to have configurable sizes like 250 MB. Each file should be well formed having an exact copy of the header (and footer as the closing of the header) from the original file. Forthermore, a table should be generated were I can see, that the File X is seperated into Part N with timestamp: Table: Orginalfilename|Name of PartN|Size of PartN|Timestamp The Original XML-Files look like this: <?xml ...> <Headerelement with some infos to be copied 1to1> <OfferInfo> <OfferID></OfferID> .. </OfferInfo> <OfferInfo> <OfferID></OfferID> .. </OfferInfo> <OfferInfo> <OfferID></OfferID> .. </OfferInfo> </Headerelement> All in all I ended up with reading the XML processing docus with gawk, but as it seems I am lacking some deeper programming skills.. Could someone please help? Thx Malapha
Post Follow-up to this messageMalapha wrote: > Hi, > > I am kind of depressed :-) I want to split xml-files with sizes > greater than 2 gb into smaler chunks. As I dont want to end up with > billions of files, I want those splitted files to have configurable > sizes like 250 MB. Each file should be well formed having an exact > copy of the header (and footer as the closing of the header) from the > original file. Forthermore, a table should be generated were I can > see, that the File X is seperated into Part N with timestamp: A nice and well described little homework with clear requirements. I'd abstain from splitting the file according to file sizes in MB but suggest to take a more simple measure for splitting, like number of XML-blocks or number of lines. > > Table: > > Orginalfilename|Name of PartN|Size of PartN|Timestamp > > > > The Original XML-Files look like this: > <?xml ...> > <Headerelement with some infos to be copied 1to1> > <OfferInfo> > <OfferID></OfferID> > ... > </OfferInfo> > <OfferInfo> > <OfferID></OfferID> > ... > </OfferInfo> > <OfferInfo> > <OfferID></OfferID> > ... > </OfferInfo> > </Headerelement> > > > > All in all I ended up with reading the XML processing docus with gawk, > but as it seems I am lacking some deeper programming skills.. Given your data above you can solve that all with basic awk pattern matching capabilities, no deeper skills required. What have you tried so far? > Could > someone please help? Since, apparently, you don't have a complex XML structure the use of xgawk seems unnecessary. The quick way I'd go would be... Save everything in a variable until you match the /Headerelement/. Write that header to a file whose name contains a variable as number. Write everything until the end of the block /<\/OfferInfo>/ to the file whose name contains a variable as number, while counting lines. If the number of lines exceeded some constant value write the constant trailer, and close() the file, and increase the variable that counts the files. To create a separate table just write out the information you already have to a file with fixed name (use awk's date functions or if unavailable an external date program and getline). If you have concrete questions feel free to ask. (Or did you mean to write that program for you?) Janis > > Thx > Malapha
Post Follow-up to this messageOn 17 Mrz., 13:37, Janis Papanagnou <Janis_Papanag...@hotmail.com> wrote: > Malapha wrote: > > > A nice and well described little homework with clear requirements. > > I'd abstain from splitting the file according to file sizes in MB > but suggest to take a more simple measure for splitting, like number > of XML-blocks or number of lines. > I totally agree with you. Using numbers of XML block as an approximation for filesize is well enough. The problem I see is, using linecounts works in cases where an EOL is implemented in the xml document. In case the input data file has no EOL I run into problems. So I came to the solution to use the xgawk framework in order to make use of the "node hopping" technique. This gives me the possibility to count the Offers without having to solve the problems mentioned above. > > > Given your data above you can solve that all with basic awk pattern > matching capabilities, no deeper skills required. What have you tried > so far? As I come from the VBA world - I tried to get familiar with awk. What I do have is theoretical solution in form of a structured process diagram :-) Copy Header and Footer from Original to Var Set Start_Offer = First Offer (from <Offer> to </Offer> ) Set End_Transaction = 0 Set Part = 0 Set FileSize = 0 Set MaxFileSize = 250 while not Start_Offer < EOF(OriginalXMLFile) Part=part+1 Open NewFile OriginalXMLFileName + Part + ".xml" Paste Header from Var to NewFile While filesize(NewFile)<MaxFileSize do Copy Offer (Start_Offer) from OriginalXMLDatei to NewFile Start_Offer=Start_Offer + 1 wend Paste Footer from Var to NewFile wend I am right now trying to translate this into awk.. Please dont ask me how far i am, its frustrating :-) > Save everything in a variable until you match the /Headerelement/. > Write that header to a file whose name contains a variable as number. > Write everything until the end of the block /<\/OfferInfo>/ to the > file whose name contains a variable as number, while counting lines. > If the number of lines exceeded some constant value write the constant > trailer, and close() the file, and increase the variable that counts > the files. To create a separate table just write out the information > you already have to a file with fixed name (use awk's date functions > or if unavailable an external date program and getline). This looks very much like my approach - so I am quite happy that I am not that wrong...
Post Follow-up to this messageMalapha wrote: > On 17 Mrz., 13:37, Janis Papanagnou <Janis_Papanag...@hotmail.com> > wrote: > > I totally agree with you. Using numbers of XML block as an > approximation for filesize is well enough. > The problem I see is, using linecounts works in cases where an EOL is > implemented in the xml document. In case the input data file has no > EOL I run into problems. So I came to the solution to use the xgawk > framework in order to make use of the "node hopping" technique. This > gives me the possibility to count the Offers without having to solve > the problems mentioned above. > Missing line breaks could be added via a preprocessing step with $ xmllint --format bigfile.xml > formatted_bigfile.xml I don't know how xmllint performs with a 2G file. On my old laptop, I am running out of memory when trying to re-format a 600M file. However, you might have better hardware available. There are also other XML command line tools around that have some "pretty print" option. xmlstarlet is one of them. > Before going deeper into xgawk: try to reformat the file as suggested above. Then, as suggested by Janis, you could make use regular awk for the splitting task. Hermann
Post Follow-up to this messageMalapha schrieb:
> I totally agree with you. Using numbers of XML block as an
> approximation for filesize is well enough.
You may use the variable XMLLEN in xgawk.
Accumulate XMLLEN and you get a very precise
approximation for file size.
xgawk -lxml '{l+= XMLLEN};END{print l}' mssecure.xml
2309349
ll mssecure.xml
-rw-r--r-- 1 kahrs users 2309349 12. Jan 2005 mssecure.xml
Post Follow-up to this messageMalapha wrote:
>
> As I come from the VBA world - I tried to get familiar with awk. What
> I do have is theoretical solution in form of a structured process
> diagram :-)
>
> Copy Header and Footer from Original to Var
> Set Start_Offer = First Offer (from <Offer> to </Offer> )
> Set End_Transaction = 0
> Set Part = 0
> Set FileSize = 0
> Set MaxFileSize = 250
> while not Start_Offer < EOF(OriginalXMLFile)
> Part=part+1
> Open NewFile OriginalXMLFileName + Part + ".xml"
> Paste Header from Var to NewFile
> While filesize(NewFile)<MaxFileSize do
> Copy Offer (Start_Offer) from OriginalXMLDatei to NewFile
> Start_Offer=Start_Offer + 1
> wend
> Paste Footer from Var to NewFile
> wend
>
> I am right now trying to translate this into awk.. Please dont ask me
> how far i am, its frustrating :-)
>
>
Below one solution for splitting in well-formed chunks, here: 100
OfferInfos each. There might be better solutions (I just don't know
them ;-) It only works if the XML data is in "pretty print format", as
the sample data you posted.
$ cat split_bigfile.awk
BEGIN { new_chunk = 1 ; size = 100 }
NR == 1 { header = $0 ; next }
NR == 2 { header = header ORS $0 ; footer = "</" substr($1,2) ">" ; next }
$0 !~ footer {
if (new_chunk) {
outfile = "chunk" sprintf("%07d", num) ".xml"
print header > outfile
new_chunk = 0
}
print > outfile
}
/<\/OfferInfo>/ {
num = int(count++/size)
if (num > prev_num) {
print footer > outfile
new_chunk = 1
}
prev_num = num
}
END { if (!new_chunk) print footer > outfile }
Post Follow-up to this messageOn 17 Mrz., 21:33, J=FCrgen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
wrote:
> Malapha schrieb:
>
>
> You may use the variable XMLLEN in xgawk.
> Accumulate XMLLEN and you get a very precise
> approximation for file size.
>
> xgawk -lxml '{l+=3D XMLLEN};END{print l}' mssecure.xml
> 2309349
>
> ll mssecure.xml
> -rw-r--r-- 1 kahrs users 2309349 12. Jan 2005 =A0mssecure.xml
wow. this looks promising!!
unfortunatelly I caught a cold so I am not able to test it in the
office. But thanks alot. Ill try to combine it with the suggestions
Hermann made.
Thanks
Malapha
Post Follow-up to this messageOn 18 Mrz., 00:01, Hermann Peifer <pei...@gmx.eu> wrote:
> Malapha wrote:
>
>
File
>
>
> Below one solution for splitting in well-formed chunks, here: 100
> OfferInfos each. =A0There might be better solutions (I just don't know
> them ;-) It only works if the XML data is in "pretty print format", as
> the sample data you posted.
>
> $ cat split_bigfile.awk
>
> BEGIN { new_chunk =3D 1 ; size =3D 100 }
>
> NR =3D=3D 1 { header =3D $0 ; next }
> NR =3D=3D 2 { header =3D header ORS $0 ; footer =3D "</" substr($1,2) ">" =[/color
]
; next }
>
> $0 !~ footer {
> =A0 =A0 =A0 =A0 if (new_chunk) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 outfile =3D "chunk" sprintf("%07d", num) "=[/color
]
.xml"
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 print header > outfile
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 new_chunk =3D 0
> =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 =A0 print > outfile
>
> }
>
> /<\/OfferInfo>/ {
> =A0 =A0 =A0 =A0 num =3D int(count++/size)
> =A0 =A0 =A0 =A0 if (num > prev_num) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 print footer > outfile
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 new_chunk =3D 1
> =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 =A0 prev_num =3D num
>
> }
>
> END { if (!new_chunk) print footer > outfile }
Herman you are great. As I have written in to J=FCrgen, I am unable to
check it. But as soon as possible I ll give it a try!!
Thanks again
Malapha
Post Follow-up to this messageMalapha wrote:
> On 18 Mrz., 00:01, Hermann Peifer <pei...@gmx.eu> wrote:
>
> Herman you are great. As I have written in to Jürgen, I am unable to
> check it. But as soon as possible I ll give it a try!!
>
> Thanks again
> Malapha
Here the xgawk version of the same script. It works fine for me with
your testdata. No pre-formatting of bigfile.xml is needed. However, for
this solution you need to have xgawk and the library xmlcopy.awk
available. In xmlcopy.awk, I made a minor change at the very end:
# printf( "%s", token )
return token
Usage of the script: xgawk -f split_big_xmlfile.awk bigfile.xml
$ cat split_big_xmlfile.awk
# Include the xmlcopy.awk library
# Make sure that xgawk finds it
@include xmlcopy
BEGIN { new_chunk = 1 ; size = 100 }
# Remember XML declaration of bigfile.xml
XMLDECLARATION { header = XmlCopy() }
# Remember root element, define the footer
XMLSTARTELEM && XMLDEPTH == 1 {
header = header XmlCopy()
footer = "</" XMLSTARTELEM ">"
}
# Only care about OfferInfos and their children
XMLPATH ~ /OfferInfo/ {
if (new_chunk) {
outfile = "chunk" sprintf("%07d", num) ".xml"
printf "%s", header > outfile
new_chunk = 0
}
printf "%s", XmlCopy() > outfile
}
# Decide if it's time to add a footer and start a new chunk
XMLENDELEM == "OfferInfo" {
num = int(++count/size)
if (num > prev_num) {
print footer > outfile
new_chunk = 1
}
prev_num = num
}
# Avoid double footers, if at the end: count%size = 0
END { if (!new_chunk) print footer > outfile }
Post Follow-up to this messageHermann Peifer wrote:
>
> Here the xgawk version of the same script. It works fine for me with
> your testdata. No pre-formatting of bigfile.xml is needed. However, for
> this solution you need to have xgawk and the library xmlcopy.awk
> available. In xmlcopy.awk, I made a minor change at the very end:
>
> # printf( "%s", token )
> return token
>
> Usage of the script: xgawk -f split_big_xmlfile.awk bigfile.xml
>
> $ cat split_big_xmlfile.awk
>
> # Include the xmlcopy.awk library
> # Make sure that xgawk finds it
> @include xmlcopy
>
> BEGIN { new_chunk = 1 ; size = 100 }
>
> # Remember XML declaration of bigfile.xml
> XMLDECLARATION { header = XmlCopy() }
>
> # Remember root element, define the footer
> XMLSTARTELEM && XMLDEPTH == 1 {
> header = header XmlCopy()
> footer = "</" XMLSTARTELEM ">"
> }
>
> # Only care about OfferInfos and their children
> XMLPATH ~ /OfferInfo/ {
> if (new_chunk) {
> outfile = "chunk" sprintf("%07d", num) ".xml"
> printf "%s", header > outfile
> new_chunk = 0
> }
> printf "%s", XmlCopy() > outfile
> }
>
> # Decide if it's time to add a footer and start a new chunk
> XMLENDELEM == "OfferInfo" {
> num = int(++count/size)
> if (num > prev_num) {
> print footer > outfile
> new_chunk = 1
> }
> prev_num = num
> }
>
> # Avoid double footers, if at the end: count%size = 0
> END { if (!new_chunk) print footer > outfile }
Just in case someone would be interested, here yet another version of
the same script, where chunk size is defined in bytes (and checked via
XMLLEN, as suggested by Juergen).
Hermann
$ cat split_big_xmlfile.awk
# Include the xmlcopy.awk library
# Make sure that xgawk finds it
@include xmlcopy
# new_chunk can be anything here, but not 0 or ""
# size value defines approx. chunk size in bytes
# you might have to worry about XMLCHARSET (or not)
BEGIN {
new_chunk = size = 250000000
# XMLCHARSET = "ISO-8859-1"
}
# Remember original XML declaration
XMLDECLARATION { header = XmlCopy() }
# Remember original root element, define the footer
XMLSTARTELEM && XMLDEPTH == 1 {
header = header ORS XmlCopy() ORS
footer = ORS "</" XMLSTARTELEM ">"
}
# Only care about these elements and their children
XMLPATH ~ /OfferInfo/ {
if (new_chunk) {
outfile = "chunk" sprintf("%07d", num) ".xml"
printf "%s", header > outfile
new_chunk = ""
}
printf "%s", XmlCopy() > outfile
chunk_size += XMLLEN
}
# Decide if it's time to add a footer and start with a new chunk
XMLENDELEM == "OfferInfo" && chunk_size > size {
printf "%s", footer > outfile
num++
new_chunk = "it's time now"
chunk_size = 0
}
END {
# Footer for the last chunk, but avoid double footers
if (!new_chunk) printf "%s", footer > outfile
# Print XMLERRORs, if any. Xgawk is somewhat lazy in
# this respect and might silently die, if you don't have:
if (XMLERROR)
printf("XMLERROR '%s' at row %d col %d len %d\n",
XMLERROR, XMLROW, XMLCOL, XMLLEN)
}
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.