For Programmers: Free Programming Magazines  


Home > Archive > AWK > April 2007 > xgawk: merging XML files









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author xgawk: merging XML files
bruce_phipps@my-deja.com

2007-04-01, 6:56 pm

I'm trying to use xgawk to merge about 50 XML files into a single XML
file.
I first concatenate all the files:

cat test*.xml > bigfile.xml

So far, so good but bigfile.xml contains 50 XML declaration and DTD
statements. I only want to keep the first of these.
So, I am looking to use xgawk to strip out all XML declarations except
the ones at the top of the file.
Any ideas?
I guess I need to use xgawk's XMLDECLARATION variable to do a match.

Thanks
Bruce

Patrick TJ McPhee

2007-04-01, 6:56 pm

In article <1175449116.471685.127160@b75g2000hsg.googlegroups.com>,
<bruce_phipps@my-deja.com> wrote:
% I'm trying to use xgawk to merge about 50 XML files into a single XML
% file.
% I first concatenate all the files:
%
% cat test*.xml > bigfile.xml

This seems like a bad idea to me. Even if you get rid of the XML
declarations, this won't be a valid XML file. I guess you could
do something like this (with any old awk)

BEGIN {
print "<?xml version=\"1.0\"?>"
print "<!DOCTYPE whatever ...>"
print "<outer_element>"
}
END { print "</outer_element>"
}
!/<!DOCTYPE/ && !/<\?xml/

then run it like this
awk -f xmlcat.awk test*.xml > bigfile.xml

Of course, that won't work if you have local entity declarations.
--

Patrick TJ McPhee
North York Canada
ptjm@interlog.com
Jürgen Kahrs

2007-04-01, 6:56 pm

bruce_phipps@my-deja.com wrote:

> I'm trying to use xgawk to merge about 50 XML files into a single XML
> file.
> I first concatenate all the files:
>
> cat test*.xml > bigfile.xml


Concatenating large amounts of XML files is one
of the most wanted features. You probably know that
the resulting file containing all the individual
XML files is not a well-formed file. xgawk will be
able to read this file anyway.

> So far, so good but bigfile.xml contains 50 XML declaration and DTD
> statements. I only want to keep the first of these.
> So, I am looking to use xgawk to strip out all XML declarations except
> the ones at the top of the file.


XMLgawk is actually not so good a tool if you want
to produce a one-to-one copy of an original file.
I guess that the solution posted by Patrick McPhee
is actually the best you can get if you need a
one-to-one copy in the first place. Patrick's
solution will run with any AWK around, I guess
even with oawk on Solaris.
Manuel Collado

2007-04-01, 6:56 pm

bruce_phipps@my-deja.com escribió:
> I'm trying to use xgawk to merge about 50 XML files into a single XML
> file.
> I first concatenate all the files:
>
> cat test*.xml > bigfile.xml
>
> So far, so good but bigfile.xml contains 50 XML declaration and DTD
> statements. I only want to keep the first of these.
> So, I am looking to use xgawk to strip out all XML declarations except
> the ones at the top of the file.
> Any ideas?
> I guess I need to use xgawk's XMLDECLARATION variable to do a match.


Please clarify what do you want. If the input files are:

<?xml ... ?>
<!DOCTYPE root1 ....>
<root1>
....
</root1>

....

<?xml ... ?>
<!DOCTYPE rootN ....>
<rootN>
....
</rootN>

What is the desired result of its concatenation?

--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
bruce_phipps@my-deja.com

2007-04-02, 9:57 pm

On 1 Apr, 21:40, Manuel Collado <m.coll...@fake.fi.upm.es> wrote:
> bruce_phi...@my-deja.com escribi=F3:
>
>
>
>
> Please clarify what do you want. If the input files are:
>
> <?xml ... ?>
> <!DOCTYPE root1 ....>
> <root1>
> ...
> </root1>
>
> ...
>
> <?xml ... ?>
> <!DOCTYPE rootN ....>
> <rootN>
> ...
> </rootN>
>
> What is the desired result of its concatenation?
>
> --
> Manuel Collado -http://lml.ls.fi.upm.es/~mcollado


All 50 XML files are of identical format:

<?xml version=3D"1.0" encoding=3D"UTF-8"?>
<!DOCTYPE sect1 ..." "xsolbook35.dtd">
<sect1>
..=2E.
</sect1>

It should be possible to combine these as I mentioned and still get a
well-formed XML file.
The end result of the concatenation is a large technical document.
Bruce

Thomas Weidenfeller

2007-04-02, 9:57 pm

bruce_phipps@my-deja.com wrote:
> All 50 XML files are of identical format:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE sect1 ..." "xsolbook35.dtd">
> <sect1>
> ...
> </sect1>
>
> It should be possible to combine these as I mentioned and still get a
> well-formed XML file.
> The end result of the concatenation is a large technical document.


You did not answer the question. Do you want

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE sect1 ..." "xsolbook35.dtd">
<sect1>
...
</sect1>
<sect1>
...
</sect1>
<sect1>
...
</sect1>

?

If yes, that would not be a well-formed XML document. A well-formed XML
document must have only one root element. So you need to study the DTD
to figure out what is a possible root element and if and how <sect1>
elements are allowed to be nested inside a root element. You might need
to change to another DTD if the one you currently use does not allow to
nest <sect1>s in some root element (which might or might not be itself a
<sect1> ).

/Thomas


bruce_phipps@my-deja.com

2007-04-02, 9:57 pm

On 2 Apr, 10:35, Thomas Weidenfeller <nob...@ericsson.invalid> wrote:
> bruce_phi...@my-deja.com wrote:
>
>
>
> You did not answer the question. Do you want
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE sect1 ..." "xsolbook35.dtd">
> <sect1>
> ...
> </sect1>
> <sect1>
> ...
> </sect1>
> <sect1>
> ...
> </sect1>
>
> ?
>
> If yes, that would not be a well-formed XML document. A well-formed XML
> document must have only one root element. So you need to study the DTD
> to figure out what is a possible root element and if and how <sect1>
> elements are allowed to be nested inside a root element. You might need
> to change to another DTD if the one you currently use does not allow to
> nest <sect1>s in some root element (which might or might not be itself a
> <sect1> ).
>
> /Thomas


Yes, you are correct. <sect1> is the root element and I should only
have one acording to the DTD.
Bruce


Juergen Kahrs

2007-04-02, 9:57 pm

bruce_phipps@my-deja.com wrote:

>
> Yes, you are correct. <sect1> is the root element and I should only
> have one acording to the DTD.


Thomas already pointed out that in this case, you
cannot simply use the same DTD anymore. The content
of the large has a different "tag structure" than
the original files. Hence, DTD must be different.
Manuel Collado

2007-04-02, 9:57 pm

bruce_phipps@my-deja.com escribió:
> On 2 Apr, 10:35, Thomas Weidenfeller <nob...@ericsson.invalid> wrote:
>
> Yes, you are correct. <sect1> is the root element and I should only
> have one acording to the DTD.


If your "xsolbook35.dtd" defines a, say, <document> tag that can contain
<sect1> elements, then the merged whole document could be:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document "..." "xsolbook35.dtd">
<document>
<sect1>
...
</sect1>
<sect1>
...
</sect1>
<sect1>
...
</sect1>
</document>

This can probably be achieved with only regular awk (no need to use xgawk).
Perhaps with (untested):

--- file: concat-sect1.awk -----------
BEGIN {
print "<?xml version='1.0' encoding='UTF-8'?>"
print "<!DOCTYPE document '...' 'xsolbook35.dtd'>"
print "<document>"
}

/<\?/, /?>/ { next } # skip xml declarations
/<!/, />/ { next } # skip DOCTYPE declarations

{ print } # copy sections

END { print "</document" }
--------------------------------------

The global command could be:

awk -f concat-sect1.awk *.xml


Regards,
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
Manuel Collado

2007-04-02, 9:57 pm

Manuel Collado escribió:
> ...
> END { print "</document" }


Ooops! Should be:

END { print "</document>" }

--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
bruce_phipps@my-deja.com

2007-04-02, 9:57 pm


> If your "xsolbook35.dtd" defines a, say, <document> tag that can contain
> <sect1> elements, then the merged whole document could be:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE document "..." "xsolbook35.dtd">
> <document>
> <sect1>
> ...
> </sect1>
> <sect1>
> ...
> </sect1>
> <sect1>
> ...
> </sect1>
> </document>
>


Yes. This is the structure of my document.

> This can probably be achieved with only regular awk (no need to use xgawk).
> Perhaps with (untested):
>
> --- file: concat-sect1.awk -----------
> BEGIN {
> print "<?xml version='1.0' encoding='UTF-8'?>"
> print "<!DOCTYPE document '...' 'xsolbook35.dtd'>"
> print "<document>"
> }
>
> /<\?/, /?>/ { next } # skip xml declarations
> /<!/, />/ { next } # skip DOCTYPE declarations
>
> { print } # copy sections
>
> END { print "</document>" }
> --------------------------------------
>
> The global command could be:
>
> awk -f concat-sect1.awk *.xml
>
> Regards,
> --
> Manuel Collado -http://lml.ls.fi.upm.es/~mcollado

I get this error:

awk: syntax error near line 8
awk: bailing out near line 8

Bruce

Ed Morton

2007-04-02, 9:57 pm

bruce_phipps@my-deja.com wrote:
<snip>
> I get this error:
>
> awk: syntax error near line 8
> awk: bailing out near line 8
>
> Bruce
>


Then you're using old, broken awk. Use a modern awk such as GNU awk
(gawk), New awk (nawk) or /usr/xpg4/bin/awk on Solaris.

Ed.
bruce_phipps@my-deja.com

2007-04-02, 9:57 pm

On 2 Apr, 13:25, Ed Morton <mor...@lsupcaemnt.com> wrote:
> bruce_phi...@my-deja.com wrote:
>
> <snip>
>
>
>
>
> Then you're using old, broken awk. Use a modern awk such as GNU awk
> (gawk), New awk (nawk) or /usr/xpg4/bin/awk on Solaris.
>
> Ed.


I think the code was missing a backslash:

/<\?/, /?>/ { next } # skip xml declarations

It's now running but I am not getting the output I expected.
I'm not sure the line-based approach of awk will work if my XML flows
across lines.

Bruce

Manuel Collado

2007-04-02, 9:57 pm

bruce_phipps@my-deja.com escribió:
> ...
> I think the code was missing a backslash:
>
> /<\?/, /?>/ { next } # skip xml declarations


OK, please try:

/<\?/, /\?>/ { next } # skip xml declarations

>
> It's now running but I am not getting the output I expected.
> I'm not sure the line-based approach of awk will work if my XML flows
> across lines.


It should work as long as both the XML declaration and the DOCTYPE
declaration occupy whole lines (one or several). I.e., they don't share the
same input line with other markup.

Please post a simplified data samples if it still doesn't work.

Regards.
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
bruce_phipps@my-deja.com

2007-04-03, 3:59 am


>
> It should work as long as both the XML declaration and the DOCTYPE
> declaration occupy whole lines (one or several). I.e., they don't share the
> same input line with other markup.
>


Unfortunately the XML declaration and DOCTYPE are on the same line as
other markup.

Bruce


Manuel Collado

2007-04-03, 7:58 am

bruce_phipps@my-deja.com escribió:
>
> Unfortunately the XML declaration and DOCTYPE are on the same line as
> other markup.


Ok. In that case xgawk can really help. Please try the following:


--- file: concat-sect1.awk -------------
@include xmllib

BEGIN {
print "<?xml version='1.0' encoding='UTF-8'?>"
print "<!DOCTYPE document SYSTEM 'xsolbook35.dtd'>"
print "<document>"
XMLMODE = 1
XMLCHARSET = "UTF-8"
ORS = ""
}

XMLDECLARATION || XMLSTARTDOCT || XMLENDDOCT || XMLUNPARSED { next } #
skip XML and DOCTYPE declarations

{ print } # copy sections

END { print "\n</document>\n" }
----------------------------------------

NOTES:
1.- xmllib.awk must be in the AWKPATH or in the current directory
2.- You may want to add XMLPROCINST to the list of skipped tokens
3.- Invoke the script as:

xgawk -f concat-sect1.awk *.xml


Hope this helps.
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
bsh

2007-04-10, 3:56 am

"Manuel Collado" <m.coll...@fake.fi.upm.es> wrote:
> <bruce_phi...@my-deja.com>:
> Ok. In that case xgawk can really help. Please try the following:


I am not skilled enough in XML to be able to say that the
following suggestion is apropos, but I'd like to have the
comments of especially those involved with xmlgawk concerning
the following XSLT to merge XML files. I hope it is not too
naive to ask: Can xmlgawk invoke XSLT?

"Merging two XML files" by Oliver Becker:
http://www2.informatik.hu-berlin.de/~obecker/XSLT/

=Brian

Jürgen Kahrs

2007-04-10, 6:56 pm

bsh wrote:

> I am not skilled enough in XML to be able to say that the
> following suggestion is apropos, but I'd like to have the
> comments of especially those involved with xmlgawk concerning
> the following XSLT to merge XML files. I hope it is not too
> naive to ask: Can xmlgawk invoke XSLT?


The function "system" in AWK can invoke any
command line. This command line may invoke
XSLT. So, invoking is possible, but what for ?

> "Merging two XML files" by Oliver Becker:
> http://www2.informatik.hu-berlin.de/~obecker/XSLT/


This understanding of "merging" is very special:

That means equivalent nodes appear only once in the output.
Two element nodes are treated equivalent, their local names
are equal, their namespace-uris are equal, and all their
attributes are equal. Two text nodes are treated equal if
their normalized content is equal. The same rule applies
for comments and processing instructions.

This algorithm is similar to the creation of a "diff" tool:
Read two XML files and print equivalent parts only once,
print differing parts each. Several "diff" tools have been
suggested here in this newsgroup. It is not trivial to
implement this in XMLgawk, but it _is_ possible.
bsh

2007-04-11, 6:57 pm

On Apr 10, 8:25 am, J=FCrgen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
wrote:
> bsh wrote:
> The function "system" in AWK can invoke any command line....


Perhaps I was musing on the possibility for xmlgawk to invoke a
coprocess with XSLT processed XML and then continue to work on the
result.

> This understanding of "merging" is very special: ....


Okay. Thanks for the introduction, Juergen -- I was unaware of
the special definition given to the "merge" in XML, and this gives
me a pointer for additional research into the matter. Also, thanks
for the effort in making xmlgawk available to the programming
community.

=3DBrian

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com