Home > Archive > AWK > March 2007 > Rough content share of XML files
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Rough content share of XML files
|
|
| Hermann Peifer 2007-03-03, 3:57 am |
| Hi All,
Would you think that the below code is good enough to calculate the
*rough* content share of XML files (with typically 1 XML element per line)?
Missing is of course the content length of XML attributes, but this is
not so relevant in my current case.
$ cat ../simple_xmlstats.awk
BEGIN{
FS = "[<>]"
OFS= "\t"
}
{
total+=length($0)
content+=length($3)
}
END{
print FILENAME,content "/" total,sprintf("%.0f",100*content/total) "%"
}
$ cat AL_meta.xml
<?xml version='1.0' ?>
<!DOCTYPE airbase SYSTEM 'airbase.dtd'>
<airbase>
<country>
<country_name>ALBANIA</country_name>
<country_iso_code>AL</country_iso_code>
<country_eu_member>N</country_eu_member>
</country>
</airbase>
$ awk -f ../simple_xmlstats.awk AL_meta.xml
AL_meta.xml 10/222 5%
Thanks in advance, Hermann
PS and OT:
If I would like to pay an AWK expert for an answer to the above (and
other, potentially more serious ;-) AWK questions: Where would you
recommend me to post a small ad? Probably not in this group?
| |
| Vassilis 2007-03-03, 6:57 pm |
|
=CF/=C7 Hermann Peifer =DD=E3=F1=E1=F8=E5:
> Hi All,
>
> Would you think that the below code is good enough to calculate the
> *rough* content share of XML files (with typically 1 XML element per line=
)?
>
> Missing is of course the content length of XML attributes, but this is
> not so relevant in my current case.
>
> $ cat ../simple_xmlstats.awk
> BEGIN{
> FS =3D "[<>]"
> OFS=3D "\t"
> }
> {
> total+=3Dlength($0)
> content+=3Dlength($3)
> }
> END{
> print FILENAME,content "/" total,sprintf("%.0f",100*content/total) "%"
> }
>
>
> $ cat AL_meta.xml
> <?xml version=3D'1.0' ?>
> <!DOCTYPE airbase SYSTEM 'airbase.dtd'>
> <airbase>
> <country>
> <country_name>ALBANIA</country_name>
> <country_iso_code>AL</country_iso_code>
> <country_eu_member>N</country_eu_member>
> </country>
> </airbase>
>
>
> $ awk -f ../simple_xmlstats.awk AL_meta.xml
> AL_meta.xml 10/222 5%
>
>
> Thanks in advance, Hermann
>
> PS and OT:
> If I would like to pay an AWK expert for an answer to the above (and
> other, potentially more serious ;-) AWK questions: Where would you
> recommend me to post a small ad? Probably not in this group?
If the bulk of your work (and most of your input) is typical of the
script shown, I'd said that this is enough. If you suspect, there's
more to it than this, I'd suggest talking to people that frequent this
NG, like Juergen Kahrs or Andrew Schorr, people who have implemented
xmlgawk. xmlgawk should be a good tool for you.
Have a look here [http://tinyurl.com/ymqa2b]
| |
| Hermann Peifer 2007-03-03, 6:57 pm |
| Vassilis wrote:
>
> If the bulk of your work (and most of your input) is typical of the
> script shown, I'd said that this is enough. If you suspect, there's
> more to it than this, I'd suggest talking to people that frequent this
> NG, like Juergen Kahrs or Andrew Schorr, people who have implemented
> xmlgawk. xmlgawk should be a good tool for you.
> Have a look here [http://tinyurl.com/ymqa2b]
>
Thanks for the hint. I've indeed been there once, months ago. After some
clicking I ended up on a download page for source code. This is not my
turf. I *really* appreciate Open Source Software (however in compiled
format ;-)
Hermann
| |
| Vassilis 2007-03-03, 6:57 pm |
|
=CF/=C7 Hermann Peifer =DD=E3=F1=E1=F8=E5:
>
> Thanks for the hint. I've indeed been there once, months ago. After some
> clicking I ended up on a download page for source code. This is not my
> turf. I *really* appreciate Open Source Software (however in compiled
> format ;-)
>
> Hermann
If you're using linux, compiling should be a breeze like ./configure
&& make && make install
If you're running windose, you should try cygwin.
IIRC, not long ago, someone posted a binary for win here.
Try googling for it.
| |
| Janis Papanagnou 2007-03-03, 6:57 pm |
| Hermann Peifer wrote:
> Hi All,
>
> Would you think that the below code is good enough to calculate the
> *rough* content share of XML files (with typically 1 XML element per line)?
I've read about your reluctance with xgawk, but want to remark that
there's even just a single line solution to extract the character data
from XML streams in the xgawk's documentation. It's worth having a look
at that tool.
Janis
>
> Missing is of course the content length of XML attributes, but this is
> not so relevant in my current case.
>
> $ cat ../simple_xmlstats.awk
> BEGIN{
> FS = "[<>]"
> OFS= "\t"
> }
> {
> total+=length($0)
> content+=length($3)
> }
> END{
> print FILENAME,content "/" total,sprintf("%.0f",100*content/total) "%"
> }
>
>
> $ cat AL_meta.xml
> <?xml version='1.0' ?>
> <!DOCTYPE airbase SYSTEM 'airbase.dtd'>
> <airbase>
> <country>
> <country_name>ALBANIA</country_name>
> <country_iso_code>AL</country_iso_code>
> <country_eu_member>N</country_eu_member>
> </country>
> </airbase>
>
>
> $ awk -f ../simple_xmlstats.awk AL_meta.xml
> AL_meta.xml 10/222 5%
>
>
> Thanks in advance, Hermann
>
> PS and OT:
> If I would like to pay an AWK expert for an answer to the above (and
> other, potentially more serious ;-) AWK questions: Where would you
> recommend me to post a small ad? Probably not in this group?
| |
| Jurgen Kahrs 2007-03-03, 6:57 pm |
| Vassilis wrote:
>
> If the bulk of your work (and most of your input) is typical of the
> script shown, I'd said that this is enough. If you suspect, there's
> more to it than this, I'd suggest talking to people that frequent this
> NG, like Juergen Kahrs or Andrew Schorr, people who have implemented
> xmlgawk. xmlgawk should be a good tool for you.
> Have a look here [http://tinyurl.com/ymqa2b]
That's true. Arnold Robbins (maintainer of gawk) also
has worked his way through large amounts of XML (docbook)
data with the help of gawk (not xgawk). I think he is
running a consulting business that includes text processing
projects (he has written tons of O'Reilly books).
| |
| Jurgen Kahrs 2007-03-03, 6:57 pm |
| Vassilis wrote:
> Ï/Ç Hermann Peifer Ýãñáøå:
>
> If you're using linux, compiling should be a breeze like ./configure
> && make && make install
> If you're running windose, you should try cygwin.
Right. On Linux, it is really that easy (but remember
to do the installation as "root").
> IIRC, not long ago, someone posted a binary for win here.
> Try googling for it.
Right. Someone posted the link to xgawk.exe here.
Other reported about ways to build on Cygwin.
| |
| Hermann Peifer 2007-03-03, 6:57 pm |
| Janis Papanagnou wrote:
> Hermann Peifer wrote:
>
> I've read about your reluctance with xgawk, but want to remark that
> there's even just a single line solution to extract the character data
> from XML streams in the xgawk's documentation. It's worth having a look
> at that tool.
>
> Janis
>
I would say I that I normally keep a "respectful distance" between me
and source code. Thanks for the encouragement, I will give it a try and
compile the xgawk source code. After some 5 years of compilation
abstinence ;-)
[color=darkred]
Can someone comment on my OT question in return?
;-) Hermann
| |
| Janis Papanagnou 2007-03-03, 6:57 pm |
| Hermann Peifer wrote:
> Janis Papanagnou wrote:
>
> Can someone comment on my OT question in return?
In some way already commented on that it in my e-mail. :-)
Janis
>
> ;-) Hermann
| |
| Jürgen Kahrs 2007-03-03, 6:57 pm |
| Hermann Peifer wrote:
>
> Can someone comment on my OT question in return?
I have done so. Read my earlier posting.
| |
| Hermann Peifer 2007-03-03, 9:57 pm |
| Jürgen Kahrs wrote:
> Hermann Peifer wrote:
>
>
> I have done so. Read my earlier posting.
Not really, if you read my 2 original questions ;-)
The idea is not to spend my own money, but some of my employer's money.
Which in return is the European taxpayers' money, i.e your money, in the
widest sense.
In simple terms, this is the reason why I have to follow a thick book of
rules. The book is thicker than you would imagine.
In essence, the rules are saying that I have to "consult the market" in
a "transparent process" ensuring "fair competition" in order to select
"the bid offering the best value for money." (...)
To do all this in the context of a small AWK consulting contract is my
challenge of the month.
Hermann
http://www.eea.europa.eu/organisation/organigram.html
| |
| Hermann Peifer 2007-03-03, 9:57 pm |
| Hermann Peifer wrote:
> Janis Papanagnou wrote:
>
> I would say I that I normally keep a "respectful distance" between me
> and source code. Thanks for the encouragement, I will give it a try and
> compile the xgawk source code. After some 5 years of compilation
> abstinence ;-)
>
Via a hint from another posting in this thread: I managed to find a
pre-compiled :-) binary for Cygwin.
$ xgawk --version
Extensible GNU Awk 3.1.5 (build beta.20060401) with dynamic loading, and
with statically-linked extensions
Copyright (C) 1989, 1991-2005 Free Software Foundation.
Can you send me your single line solution so that I can give it a try as
xgawk beta tester? ;-)
Thanks in advance, Hermann
| |
| Janis Papanagnou 2007-03-03, 9:57 pm |
| Hermann Peifer wrote:
> Hermann Peifer wrote:
>
>
> Via a hint from another posting in this thread: I managed to find a
> pre-compiled :-) binary for Cygwin.
>
> $ xgawk --version
> Extensible GNU Awk 3.1.5 (build beta.20060401) with dynamic loading, and
> with statically-linked extensions
> Copyright (C) 1989, 1991-2005 Free Software Foundation.
>
> Can you send me your single line solution so that I can give it a try as
> xgawk beta tester? ;-)
>
> Thanks in advance, Hermann
Sure thing. Here it is...
@load xml
XMLCHARDATA { printf $0 }
Taken from the xgawk online docs (chapter 3.4) found here:
http://home.vrweb.de/~juergen.kahrs...ML/xmlgawk.html
Janis
| |
| Hermann Peifer 2007-03-04, 3:56 am |
| Janis Papanagnou wrote:
> Hermann Peifer wrote:
>
> Sure thing. Here it is...
>
> @load xml
> XMLCHARDATA { printf $0 }
>
> Taken from the xgawk online docs (chapter 3.4) found here:
> http://home.vrweb.de/~juergen.kahrs...ML/xmlgawk.html
>
> Janis
Thanks.
By the way:
Is XMLgawk just a synonym for xgawk or is there a difference?
Hermann
| |
| Jürgen Kahrs 2007-03-04, 7:57 am |
| Hermann Peifer wrote:
>
> Not really, if you read my 2 original questions ;-)
OK, I thought that you just wanted the problem
being solved quickly.
> The idea is not to spend my own money, but some of my employer's money.
> Which in return is the European taxpayers' money, i.e your money, in the
> widest sense.
Damn, this sounds really frightening.
> In essence, the rules are saying that I have to "consult the market" in
> a "transparent process" ensuring "fair competition" in order to select
> "the bid offering the best value for money." (...)
Ensuring "fair competition" is hard to do.
If you post an advertisement here in comp.lang.awk,
I guess that the posting will be tolerated.
Such posting appear here from time to time.
I know that some readers are really interested
in such postings.
>
> By the way:
> Is XMLgawk just a synonym for xgawk or is there a difference?
The x in xgawk indicates the extension libraries in xgawk.
XMLgawk is the name of the XML extension.
| |
| Hermann Peifer 2007-03-04, 6:57 pm |
| Ju"rgen Kahrs <Juergen.KahrsDELETETHIS@vr-web.de>" wrote:
> Vassilis wrote:
>
> (...)
Once I have the father of XMLgawk on the wire, I dare to ask this
(potentially silly) question around processing data in/from XML files:
Fact is that with this (and other) stylesheets:
http://converters.eionet.europa.eu/xsl/EoI_tabsep.xsl
.... plus xsltproc, plus a decent Linux server computer, I extract
millions of data elements into a tab separated format. Within 10-15
minutes. I then do some more or less intelligent data quality checking
and calculations with GAWK. In my current context, I do not have to
export the results back into XML.
Question: Do I then really have to bother about compiling xgawk from
sources (compiling source code is anyway not my hobby, as I mentioned
earlier ;-) What would be the added value of xgawk, given my limited use
case?
Hermann
| |
| Jurgen Kahrs 2007-03-04, 6:57 pm |
| Hermann Peifer wrote:
> ... plus xsltproc, plus a decent Linux server computer, I extract
> millions of data elements into a tab separated format. Within 10-15
> minutes. I then do some more or less intelligent data quality checking
> and calculations with GAWK. In my current context, I do not have to
> export the results back into XML.
If this pipeline is already established and produces
good results, then you should probably not change it
anymore. Checking quality of tab separated data with
GAWK is a classic application of AWK principles.
> Question: Do I then really have to bother about compiling xgawk from
> sources (compiling source code is anyway not my hobby, as I mentioned
No, not necessary. If the input file of GAWK is already
tab separated, then there is no need for xgawk.
> earlier ;-) What would be the added value of xgawk, given my limited use
> case?
If you used XMLgawk, then the complete pipeline could
be implemented in one XMLgawk script. No need to learn
XSL, no need to use an xsltproc (and its supporting tools
like JRE). The whole pipeline will probably operate much
faster. You have to decide on your own if you really need
the speed.
Perhaps, the tab separated data file could also become
obsolete. This depends on what you are doing with the
tab separated file.
| |
| Hermann Peifer 2007-03-06, 3:58 am |
| Ju"rgen Kahrs <Juergen.KahrsDELETETHIS@vr-web.de>" wrote:
> Hermann Peifer wrote:
>
>
> If this pipeline is already established and produces
> good results, then you should probably not change it
> anymore. Checking quality of tab separated data with
> GAWK is a classic application of AWK principles.
>
>
> No, not necessary. If the input file of GAWK is already
> tab separated, then there is no need for xgawk.
>
Thanks for confirming my vague feeling that my current approach is not
necessarily stupid ;-)
>
> If you used XMLgawk, then the complete pipeline could
> be implemented in one XMLgawk script. No need to learn
> XSL, no need to use an xsltproc (and its supporting tools
> like JRE). The whole pipeline will probably operate much
> faster. You have to decide on your own if you really need
> the speed.
>
> Perhaps, the tab separated data file could also become
> obsolete. This depends on what you are doing with the
> tab separated file.
I will give up xgawk (beta) testing now. I promise to get back to it
once I am able to install it via rpm -i, apt-get install etc. or even
better: It would be part of any standard Linux distribution.
Hermann
| |
| fips152 2007-03-18, 4:00 am |
| In article <45E936A7.5040708@gmx.eu>, Hermann Peifer <peifer@gmx.eu> wrote:
> Hi All,
>
> Would you think that the below code is good enough to calculate the
> *rough* content share of XML files (with typically 1 XML element per line)?
>
> Missing is of course the content length of XML attributes, but this is
> not so relevant in my current case.
>
> $ cat ../simple_xmlstats.awk
> BEGIN{
> FS = "[<>]"
> OFS= "\t"
> }
> {
> total+=length($0)
> content+=length($3)
> }
> END{
> print FILENAME,content "/" total,sprintf("%.0f",100*content/total) "%"
> }
>
>
> $ cat AL_meta.xml
> <?xml version='1.0' ?>
> <!DOCTYPE airbase SYSTEM 'airbase.dtd'>
> <airbase>
> <country>
> <country_name>ALBANIA</country_name>
> <country_iso_code>AL</country_iso_code>
> <country_eu_member>N</country_eu_member>
> </country>
> </airbase>
>
Be careful if any of the XML data has ">" in the content.
If the XML data you will be processing is machine-generated it
may not be an issue, but the following text IS well-formed XML:
<airbase>
<country>
<country_region>EURASIA > EUROPE > BALKANS</country_region>
<country_name>ALBANIA</country_name>
<country_iso_code>AL</country_iso_code>
<country_eu_member>N</country_eu_member>
</country>
</airbase>
It might be better (simpler, cleaner, more reliable) to process
the PYX output of an XML parser. PYX has four types of lines:
( start-tag
) end-tag
A attribute
- character data (content)
? processing instruction
Total the (length($0) - 1) values for each "-" line and divide
that total by the size of the XML file.
..
| |
| Hermann Peifer 2007-03-18, 4:00 am |
| fips152 wrote:
> In article <45E936A7.5040708@gmx.eu>, Hermann Peifer <peifer@gmx.eu> wrote:
>
>
>
> Be careful if any of the XML data has ">" in the content.
>
> If the XML data you will be processing is machine-generated it
> may not be an issue, but the following text IS well-formed XML:
>
> <airbase>
> <country>
> <country_region>EURASIA > EUROPE > BALKANS</country_region>
> <country_name>ALBANIA</country_name>
> <country_iso_code>AL</country_iso_code>
> <country_eu_member>N</country_eu_member>
> </country>
> </airbase>
>
> It might be better (simpler, cleaner, more reliable) to process
> the PYX output of an XML parser. PYX has four types of lines:
>
> ( start-tag
> ) end-tag
> A attribute
> - character data (content)
> ? processing instruction
>
> Total the (length($0) - 1) values for each "-" line and divide
> that total by the size of the XML file.
>
Although the XML files I am currently working with are
machine-generated, I found some literal ">" in the character data:
<statistic_name>AOT40 (c > 80 ug/m3,3 months,corrected)</statistic_name>
In other cases, characters like "<", "(", ")" and "µ" are encoded:
<component_name>Particulate matter < 10 µm
(aerosol)</component_name>
The ">" in the character data of statistic_name will indeed disturb my
simple awk script logic somewhat. However, I am currently only
interested in the *rough* results of my script.
In any case: Thanks for the hint. This is good to know.
Hermann
|
|
|
|
|