Home > Archive > AWK > December 2004 > XMLGAWK help required
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
XMLGAWK help required
|
|
| RipBurn 2004-12-20, 8:55 am |
| To help me get my head around XMLGAWK can someone solve the following.
I have a XMLTV data file from which I want to extract certain data and
write to a tab-delimited flat file.
The XMLTV data is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<tv><programme start="20041218204000 +1000" stop="20041218225000
+1000" channel="Network TEN Brisbane"><title>The
Frighteners</title><sub-title/><desc>A psychic private detective, who
consorts with deceased souls, becomes engaged in a mystery as members
of the town community begin dying mysteriously.</desc><rating
system="ABA"><value>M</value></rating><length
units="minutes">130</length><category>Horror</category></programme><programme
start="20041218080000 +1000" stop="20041218083000 +1000"
channel="Network TEN Brisbane"><title>Worst Best
Friends</title><sub-title>Better Than Glen</sub-title><desc>Life's
like that for Roger Thesaurus - two of his best friends are also his
worst enemies!</desc><rating
system="ABA"><value>C</value></rating><length
units="minutes">30</length><category>Children</category></programme></tv>
The flate file needs to be as follows:
channel<tab>programme
start<tab>length<tab>title<tab>description<tab>rating value
So the first record would read:
Network TEN Brisbane<tab>2004-12-18 hh:mm<tab>130<tab>The
Frighteners<tab>A psychic private detective, who consorts with
deceased souls, becomes engaged in a mystery as members of the town
community begin dying mysteriously.<tab>M
The start time, which I've just shown as hh:mm, is obviously derived
from the start record but the +1000 does not need to be taken into
consideration.
Thanks for any and all help.
| |
| Manuel Collado 2004-12-20, 8:55 am |
| Jürgen Kahrs wrote:
> William James wrote:
> ...
>
> The pattern/action is the basis of course.
> But I sometimes wonder how difficult it is
> for users of XMLgawk that they *also* have
> to keep track of the sequence of XMLSTARTELEM,
> XMLCHARDATA and XMLENDELEM.
>
> For example, when writing the doc, I wonder
> if I have to keep explaining this fragment
> each time it occurs:
>
> XMLCHARDATA { data = $0 }
> XMLENDELEM == "desc" { desc = data }
This recurrent code fragment can by simplified by using the 'xmllib.awk'
library (included in XMLgawk). It becomes:
EE == "desc" { ... CDATA ... }
(use CDATA instead of 'desc')
--
To reply by e-mail, please remove the extra dot
in the given address: m.collado -> mcollado
| |
| Manuel Collado 2004-12-20, 3:55 pm |
| Manuel Collado wrote:
> Jürgen Kahrs wrote:
> ...
>
> This recurrent code fragment can by simplified by using the 'xmllib.awk'
> library (included in XMLgawk). It becomes:
>
> EE == "desc" { ... CDATA ... }
>
> (use CDATA instead of 'desc')
Sorry. Better explained as --> It becomes:
EE == "desc" { desc = CDATA }
or use CDATA directly instead of 'desc'.
--
To reply by e-mail, please remove the extra dot
in the given address: m.collado -> mcollado
| |
|
|
| Jürgen Kahrs 2004-12-22, 8:55 am |
| Hello,
> To help me get my head around XMLGAWK can someone solve the following.
> I have a XMLTV data file from which I want to extract certain data and
> write to a tab-delimited flat file.
I think this one will do:
BEGIN { XMLMODE=1 }
XMLSTARTELEM == "programme" {
channel = XMLATTR["channel"]
start = XMLATTR["start"]
}
XMLCHARDATA { data = $0 }
XMLENDELEM == "desc" { desc = data }
XMLENDELEM == "length" { leng = data }
XMLENDELEM == "title" { title = data }
XMLENDELEM == "programme" {
print channel "\t" substr(start,1,4) "-" substr(start,5,2) "-" substr(start,7,2),
substr(start,9,2) ":" substr(start,11,2) "\t" leng "\t" title "\t" desc "\tM"
}
> So the first record would read:
>
> Network TEN Brisbane<tab>2004-12-18 hh:mm<tab>130<tab>The
> Frighteners<tab>A psychic private detective, who consorts with
> deceased souls, becomes engaged in a mystery as members of the town
> community begin dying mysteriously.<tab>M
I have tested the script above and the output
is as expected.
> The start time, which I've just shown as hh:mm, is obviously derived
> from the start record but the +1000 does not need to be taken into
> consideration.
The script currently ignores it, but it could also be
taken into account.
William Park was faster in supplying a solution,
but I think the AWK solution is more readable.
Anyway, William is a tough competitor.
| |
| Jürgen Kahrs 2004-12-22, 8:55 am |
| William James wrote:
> I hope you're right. Since it seemed that a mere shell could do it so
> easily, I hesitated to post an awk solution.
Thanks for posting your solution in pure GNU Awk.
It is interesting to see the amount of overhead
needed to process XML. When comparing our solutions,
do you think that the simplification which XMLgawk
introduces justifies the effort of extending GNU Awk
once more ? Does XMLgawk's approach of signaling the
occurence of a tag with special variables make sense
to you ?
> (How do you keep this #!@% google from removing indentation?)
Is it really Google who removes the indentation ?
When I post a script (not via Google) someone
replaces two leading blank with one leading blank.
This is not the same that happened to your script.
| |
| William James 2004-12-22, 8:55 am |
|
J=FCrgen Kahrs wrote:
> Thanks for posting your solution in pure GNU Awk.
In Kernighan's "One True Awk".
> It is interesting to see the amount of overhead
> needed to process XML.
I already had and frequently used the functions Match()
and _match(), so the only bothersome thing was writing
bookends().
> When comparing our solutions,
> do you think that the simplification which XMLgawk
> introduces justifies the effort of extending GNU Awk
> once more ?
I don't know; I'm not very familiar with gawk.
> Does XMLgawk's approach of signaling the
> occurence of a tag with special variables make sense
> to you ?
Yes. It works well with the <test> { <actions> } pairs
of awk.
>
> Is it really Google who removes the indentation ?
Before google changed its newsgroup handling recently,
the leading blanks were kept.
| |
| Manuel Collado 2004-12-22, 8:55 am |
| Jürgen Kahrs wrote:
> William James wrote:
> ...
>
> The pattern/action is the basis of course.
> But I sometimes wonder how difficult it is
> for users of XMLgawk that they *also* have
> to keep track of the sequence of XMLSTARTELEM,
> XMLCHARDATA and XMLENDELEM.
>
> For example, when writing the doc, I wonder
> if I have to keep explaining this fragment
> each time it occurs:
>
> XMLCHARDATA { data = $0 }
> XMLENDELEM == "desc" { desc = data }
This recurrent code fragment can by simplified by using the 'xmllib.awk'
library (included in XMLgawk). It becomes:
EE == "desc" { ... CDATA ... }
(use CDATA instead of 'desc')
--
To reply by e-mail, please remove the extra dot
in the given address: m.collado -> mcollado
| |
| Manuel Collado 2004-12-23, 8:55 am |
| Manuel Collado wrote:
> Jürgen Kahrs wrote:
> ...
>
> This recurrent code fragment can by simplified by using the 'xmllib.awk'
> library (included in XMLgawk). It becomes:
>
> EE == "desc" { ... CDATA ... }
>
> (use CDATA instead of 'desc')
Sorry. Better explained as --> It becomes:
EE == "desc" { desc = CDATA }
or use CDATA directly instead of 'desc'.
--
To reply by e-mail, please remove the extra dot
in the given address: m.collado -> mcollado
|
|
|
|
|