Code Comments
Programming Forum and web based access to our favorite programming groups.To help me get my head around XMLGAWK can someone solve the following. I have a XMLTV data file from which I want to extract certain data and write to a tab-delimited flat file. The XMLTV data is as follows: <?xml version="1.0" encoding="UTF-8"?> <tv><programme start="20041218204000 +1000" stop="20041218225000 +1000" channel="Network TEN Brisbane"><title>The Frighteners</title><sub-title/><desc>A psychic private detective, who consorts with deceased souls, becomes engaged in a mystery as members of the town community begin dying mysteriously.</desc><rating system="ABA"><value>M</value></rating><length units="minutes">130</length><category>Horror</category></programme><programm e start="20041218080000 +1000" stop="20041218083000 +1000" channel="Network TEN Brisbane"><title>Worst Best Friends</title><sub-title>Better Than Glen</sub-title><desc>Life's like that for Roger Thesaurus - two of his best friends are also his worst enemies!</desc><rating system="ABA"><value>C</value></rating><length units="minutes">30</length><category>Children</category></programme></tv> The flate file needs to be as follows: channel<tab>programme start<tab>length<tab>title<tab>description<tab>rating value So the first record would read: Network TEN Brisbane<tab>2004-12-18 hh:mm<tab>130<tab>The Frighteners<tab>A psychic private detective, who consorts with deceased souls, becomes engaged in a mystery as members of the town community begin dying mysteriously.<tab>M The start time, which I've just shown as hh:mm, is obviously derived from the start record but the +1000 does not need to be taken into consideration. Thanks for any and all help.
Post Follow-up to this messageJürgen Kahrs wrote:
> William James wrote:
> ...
>
> The pattern/action is the basis of course.
> But I sometimes wonder how difficult it is
> for users of XMLgawk that they *also* have
> to keep track of the sequence of XMLSTARTELEM,
> XMLCHARDATA and XMLENDELEM.
>
> For example, when writing the doc, I wonder
> if I have to keep explaining this fragment
> each time it occurs:
>
> XMLCHARDATA { data = $0 }
> XMLENDELEM == "desc" { desc = data }
This recurrent code fragment can by simplified by using the 'xmllib.awk'
library (included in XMLgawk). It becomes:
EE == "desc" { ... CDATA ... }
(use CDATA instead of 'desc')
--
To reply by e-mail, please remove the extra dot
in the given address: m.collado -> mcollado
Post Follow-up to this messageManuel Collado wrote:
> Jürgen Kahrs wrote:
> ...
>
> This recurrent code fragment can by simplified by using the 'xmllib.awk'
> library (included in XMLgawk). It becomes:
>
> EE == "desc" { ... CDATA ... }
>
> (use CDATA instead of 'desc')
Sorry. Better explained as --> It becomes:
EE == "desc" { desc = CDATA }
or use CDATA directly instead of 'desc'.
--
To reply by e-mail, please remove the extra dot
in the given address: m.collado -> mcollado
Post Follow-up to this messageWilliam James <w_a_x_man@yahoo.com> wrote: > > I hope you're right. Since it seemed that a mere shell could do it so > easily, I hesitated to post an awk solution. Hee, hee... You can read up on Expat binding for Bash at http://home.eol.ca/~parkw/park-january.html near the bottom. It's part of tutorials I wrote. The other articles are at http://linuxgazette.net/108/park.html http://linuxgazette.net/109/park.html -- William Park <opengeometry@yahoo.ca> Open Geometry Consulting, Toronto, Canada Linux solution for data processing.
Post Follow-up to this messageHello,
> To help me get my head around XMLGAWK can someone solve the following.
> I have a XMLTV data file from which I want to extract certain data and
> write to a tab-delimited flat file.
I think this one will do:
BEGIN { XMLMODE=1 }
XMLSTARTELEM == "programme" {
channel = XMLATTR["channel"]
start = XMLATTR["start"]
}
XMLCHARDATA { data = $0 }
XMLENDELEM == "desc" { desc = data }
XMLENDELEM == "length" { leng = data }
XMLENDELEM == "title" { title = data }
XMLENDELEM == "programme" {
print channel "\t" substr(start,1,4) "-" substr(start,5,2) "-" substr(start,
7,2),
substr(start,9,2) ":" substr(start,11,2) "\t" leng "\t" title "\t" desc "\tM
"
}
> So the first record would read:
>
> Network TEN Brisbane<tab>2004-12-18 hh:mm<tab>130<tab>The
> Frighteners<tab>A psychic private detective, who consorts with
> deceased souls, becomes engaged in a mystery as members of the town
> community begin dying mysteriously.<tab>M
I have tested the script above and the output
is as expected.
> The start time, which I've just shown as hh:mm, is obviously derived
> from the start record but the +1000 does not need to be taken into
> consideration.
The script currently ignores it, but it could also be
taken into account.
William Park was faster in supplying a solution,
but I think the AWK solution is more readable.
Anyway, William is a tough competitor.
Post Follow-up to this messageWilliam James wrote: > I hope you're right. Since it seemed that a mere shell could do it so > easily, I hesitated to post an awk solution. Thanks for posting your solution in pure GNU Awk. It is interesting to see the amount of overhead needed to process XML. When comparing our solutions, do you think that the simplification which XMLgawk introduces justifies the effort of extending GNU Awk once more ? Does XMLgawk's approach of signaling the occurence of a tag with special variables make sense to you ? > (How do you keep this #!@% google from removing indentation?) Is it really Google who removes the indentation ? When I post a script (not via Google) someone replaces two leading blank with one leading blank. This is not the same that happened to your script.
Post Follow-up to this message
J=FCrgen Kahrs wrote:
> Thanks for posting your solution in pure GNU Awk.
In Kernighan's "One True Awk".
> It is interesting to see the amount of overhead
> needed to process XML.
I already had and frequently used the functions Match()
and _match(), so the only bothersome thing was writing
bookends().
> When comparing our solutions,
> do you think that the simplification which XMLgawk
> introduces justifies the effort of extending GNU Awk
> once more ?
I don't know; I'm not very familiar with gawk.
> Does XMLgawk's approach of signaling the
> occurence of a tag with special variables make sense
> to you ?
Yes. It works well with the <test> { <actions> } pairs
of awk.
>
> Is it really Google who removes the indentation ?
Before google changed its newsgroup handling recently,
the leading blanks were kept.
Post Follow-up to this messageJürgen Kahrs wrote:
> William James wrote:
> ...
>
> The pattern/action is the basis of course.
> But I sometimes wonder how difficult it is
> for users of XMLgawk that they *also* have
> to keep track of the sequence of XMLSTARTELEM,
> XMLCHARDATA and XMLENDELEM.
>
> For example, when writing the doc, I wonder
> if I have to keep explaining this fragment
> each time it occurs:
>
> XMLCHARDATA { data = $0 }
> XMLENDELEM == "desc" { desc = data }
This recurrent code fragment can by simplified by using the 'xmllib.awk'
library (included in XMLgawk). It becomes:
EE == "desc" { ... CDATA ... }
(use CDATA instead of 'desc')
--
To reply by e-mail, please remove the extra dot
in the given address: m.collado -> mcollado
Post Follow-up to this messageManuel Collado wrote:
> Jürgen Kahrs wrote:
> ...
>
> This recurrent code fragment can by simplified by using the 'xmllib.awk'
> library (included in XMLgawk). It becomes:
>
> EE == "desc" { ... CDATA ... }
>
> (use CDATA instead of 'desc')
Sorry. Better explained as --> It becomes:
EE == "desc" { desc = CDATA }
or use CDATA directly instead of 'desc'.
--
To reply by e-mail, please remove the extra dot
in the given address: m.collado -> mcollado
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.