Home > Archive > PERL Beginners > January 2007 > XML::LibXML navigation
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
XML::LibXML navigation
|
|
| Beginner 2007-01-11, 6:59 pm |
| Hi,
I have to do some sanity checks on a large xml file of addresses
(snip below). I have been using XML::LibXML and seem to have started
ok but I am struggling to navigate around a record.
In the sample date below your'll see some addresses with "DO NOT..."
in. I can locate them easily enough but I am struggling to navigate
back up the DOM to access the code so I can record the code with
faulty addresses.
Here my effort. Can anyone help me either to move backup up to the
right element node or catch the code node before I begin to loop
through the address line(s).
TIA,
Dp.
======= My Effort ==========
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my $file = 'ADDRESS.XML';
open(FH,$file) or die "Can't open file $file: $!\n";
my $parser = XML::LibXML->new;
my $doc = $parser->parse_fh(\*FH);
my @results = $doc->findnodes('//address');
foreach my $i (@results) {
my @addlines = $i->findnodes('//line');
foreach my $l (@addlines) {
if ($l->string_value =~ /\s+NOT\s+/) {
my $p = $i->nodePath;
$p .= '/code';
print $p->nodeValue,"\t";
print $l->string_value, "\t";
print $l->string_value, "\n";
}
}
}
=============================
=========== Sample Data ==========
<?xml version = "1.0" encoding= "utf-8"?>
....snip
<address number="1016">
<code>B679OOO00</code>
<record_type>client</record_type>
<address_type>shipping</address_type>
<Postcode></Postcode>
<Country>GBR</Country>
<lines>
<line>DO NOT USE THIS CODE</line>
</lines>
</address>
<address number="1014">
<code>P982LUS00</code>
<record_type>client</record_type>
<address_type>shipping</address_type>
<Postcode>HR2 0AU</Postcode>
<Country>GBR</Country>
<lines>
<line>UPPER HOUSE FARM</line>
<line>BACTON</line>
<line>ESSEX</line>
<line>EX2 0AU</line>
</lines>
</address>
<address number="1333">
<code>A234ULE00</code>
<record_type>client</record_type>
<address_type>shipping</address_type>
<Postcode></Postcode>
<Country>AND</Country>
<lines>
<line>QUEENS HOUSE</line>
<line>1 BUCKINGHAM PALACE</line>
<line>LONDON WC2H</line>
<line>****NOT AT THIS ADDRESS ANY
MORE.</line>
<line>***************</line>
</lines>
</address>
<address number="1018">
<code>A&MPUB00</code>
<record_type>client</record_type>
<address_type>shipping</address_type>
<Postcode>PO19 8SQ</Postcode>
<Country>GBR</Country>
<lines>
<line>THE ATRIUM</line>
<line>SOUTHERN GATE</line>
<line>CHICHESTER</line>
<line>SUSSEX</line>
<line>PO19 8SQ</line>
</lines>
</address>
| |
| Randal L. Schwartz 2007-01-11, 6:59 pm |
| >>>>> ""Beginner"" == "Beginner" <dermot@sciencephoto.com> writes:
"Beginner"> I have to do some sanity checks on a large xml file of addresses
"Beginner"> (snip below). I have been using XML::LibXML and seem to have started
"Beginner"> ok but I am struggling to navigate around a record.
Take a look at the XML::XSH2 language, which uses XML::LibXML underneath,
but lets you write common operations using a meta-language tailored
to tree manipulation, interspersed with Perl for the heavy lifting.
--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
| |
| Rob Dixon 2007-01-12, 6:59 pm |
| Beginner wrote:
> Hi,
>
> I have to do some sanity checks on a large xml file of addresses
> (snip below). I have been using XML::LibXML and seem to have started
> ok but I am struggling to navigate around a record.
>
> In the sample date below your'll see some addresses with "DO NOT..."
> in. I can locate them easily enough but I am struggling to navigate
> back up the DOM to access the code so I can record the code with
> faulty addresses.
>
> Here my effort. Can anyone help me either to move backup up to the
> right element node or catch the code node before I begin to loop
> through the address line(s).
>
> TIA,
> Dp.
>
>
> ======= My Effort ==========
> #!/usr/bin/perl
>
> use strict;
> use warnings;
> use XML::LibXML;
>
> my $file = 'ADDRESS.XML';
> open(FH,$file) or die "Can't open file $file: $!\n";
>
> my $parser = XML::LibXML->new;
> my $doc = $parser->parse_fh(\*FH);
>
> my @results = $doc->findnodes('//address');
>
> foreach my $i (@results) {
> my @addlines = $i->findnodes('//line');
> foreach my $l (@addlines) {
> if ($l->string_value =~ /\s+NOT\s+/) {
> my $p = $i->nodePath;
> $p .= '/code';
> print $p->nodeValue,"\t";
> print $l->string_value, "\t";
> print $l->string_value, "\n";
> }
> }
>
> }
[snip XML]
If I understand you correctly then all you need is
my @results = $doc->findnodes('/dataroot/address[contains(lines/line, "DO NOT
USE")]');
foreach my $address (@results) {
my $code = $address->findvalue('code');
print $code, "\n";
}
which prints the code of all those addresses that have a line containing 'DO NOT
USE'. Is that what was required?
(Note that I've assumed a root node <dataroot>. You will need to change the node
name to the actual value.)
HTH,
Rob
| |
| Beginner 2007-01-12, 6:59 pm |
| On 12 Jan 2007 at 17:06, Rob Dixon wrote:
Hi Rob,
#!/usr/bin/perl
my $file = 'ADDRESS.XML';
open(FH,$file) or die "Can't open file $file: $!\n";
my $parser = XML::LibXML->new;
my $doc = $parser->parse_fh(\*FH);
my @codes = $doc->findnodes('//code');
my @lines = $doc->findnodes('//lines');
for (my $i = 0; $i < $#codes; ++$i) {
#print $codes[$i]->string_value, "\t";
my @add = $lines[$i]->childNodes;
for ( my $a = 1; $a <$#add; ++$a) {
if ($add[$a]->string_value =~ /\s+NOT\s+/) {
print $codes[$i]->string_value,": ",$add[$a]-[color=darkred]
>string_value,"\n";
}
}
>
> If I understand you correctly then all you need is
>
> my @results = $doc->findnodes('/dataroot/address[contains(lines/line, "DO NOT
> USE")]');
>
> foreach my $address (@results) {
> my $code = $address->findvalue('code');
> print $code, "\n";
> }
>
> which prints the code of all those addresses that have a line containing 'DO NOT
> USE'. Is that what was required?
>
Yes ...and no. I guess I want to print out the 'code' for any address
so that I can get the data corrected but I guess I would also like to
remove those records at the /dataroot/address level so they don't
appear in the file.
i spent a lot of time on this today as this look like a excellent
parser and DOM navigator but I struggled moving around.
In your example @results looks like it would contain references to
all the /lines/line data with DO NOT USE in the string_value. What I
have struggling with is that this is also a reference to the record
as a whole and my navigation techniques are not working out. For
example whenever I used findnodes I was getting every code in the
file. I think now that was because I was using /dataroot/address as
the starting point.
Aside from CPAN, I would appreciate any other sources of info about
using the using libXML with perl and xpath expressions. It is
whoppingly fast.
Thanx again,
Dp.
>
> Rob
| |
| Rob Dixon 2007-01-12, 6:59 pm |
| Beginner wrote:
> On 12 Jan 2007 at 17:06, Rob Dixon wrote:
> #!/usr/bin/perl
>
>
> my $file = 'ADDRESS.XML';
> open(FH,$file) or die "Can't open file $file: $!\n";
>
> my $parser = XML::LibXML->new;
> my $doc = $parser->parse_fh(\*FH);
>
> my @codes = $doc->findnodes('//code');
> my @lines = $doc->findnodes('//lines');
It's never a good idea to use the double-slash unless you really need it, as it
forces the XPath engine to search through the whole of the data for a matching
node name. If you are working with an awfully designed data structure and you
really have no idea where the nodes will appear then fine, but in this case you
can tell the software exactly where to look with
my @codes = $doc->findnodes('/dataroot/address/code');
my @lines = $doc->findnodes('/dataroot/address/lines');
> for (my $i = 0; $i < $#codes; ++$i) {
> #print $codes[$i]->string_value, "\t";
> my @add = $lines[$i]->childNodes;
> for ( my $a = 1; $a <$#add; ++$a) {
> if ($add[$a]->string_value =~ /\s+NOT\s+/) {
> print $codes[$i]->string_value,": ",$add[$a]-> string_value,"\n";
> }
> }
> }
This will probably work, but only coincidentally! You're relying on the
elements in @codes and @lines arrays being paired exactly, which will be the
case only if all of the <address> nodes contains exactly one <code> element and
exactly one <lines> element. This may well be the case, but isn't something you
should be assuming.
Your code also does the same as mine, except that you print the address line
that was found to contain /\s+NOT\s+/ as well as the code of the address in
which it was found.
NOT USE")]');[color=darkred]
NOT[color=darkred]
>
> Yes ...and no. I guess I want to print out the 'code' for any address
> so that I can get the data corrected but I guess I would also like to
> remove those records at the /dataroot/address level so they don't
> appear in the file.
You mean you want to produce a modified version of the original file with the
flagged address elements removed? The you want XSLT, not Perl!
> i spent a lot of time on this today as this look like a excellent
> parser and DOM navigator but I struggled moving around.
It is. I'm very impressed myself.
> In your example @results looks like it would contain references to
> all the /lines/line data with DO NOT USE in the string_value.
No. The XPath expression
/dataroot/address[contains(lines/line, "DO NOT USE")]
indicates all <address> elements that have at least one <line> element
containing the string "DO NOT USE".
> What I have struggling with is that this is also a reference to the record as
> a whole and my navigation techniques are not working out. For example
> whenever I used findnodes I was getting every code in the file. I think now
> that was because I was using /dataroot/address as the starting point.
I'm not sure what you mean. In my code @results is a list of all marked
<address> nodes. Which nodes are found byt the findnodes method depends on what
the current context node is, so $doc->findnodes('//code') will return all of the
<code> elements in the data, but (in my code) $address->findnodes('//code')
would return all of the <code> elements within that address. I have used
$address->findvalue('code') because I want the text value of the node and I also
want to look for a <code> child of the <address> node instead of any <code>
descendant.
> Aside from CPAN, I would appreciate any other sources of info about
> using the using libXML with perl and xpath expressions. It is
> whoppingly fast.
If you're doing a lot of XML work then I wholeheartedly recommend O'Reilly's
volumes
XML in a Nutshell, Third Edition
XPath and XPointer
XSLT
HTH,
Rob
| |
| Mumia W. 2007-01-12, 6:59 pm |
| On 01/12/2007 11:29 AM, Beginner wrote:
> On 12 Jan 2007 at 17:06, Rob Dixon wrote:
>
> Hi Rob,
>
> #!/usr/bin/perl
>
>
> my $file = 'ADDRESS.XML';
> open(FH,$file) or die "Can't open file $file: $!\n";
>
> my $parser = XML::LibXML->new;
> my $doc = $parser->parse_fh(\*FH);
>
> my @codes = $doc->findnodes('//code');
> my @lines = $doc->findnodes('//lines');
>
> for (my $i = 0; $i < $#codes; ++$i) {
> #print $codes[$i]->string_value, "\t";
> my @add = $lines[$i]->childNodes;
> for ( my $a = 1; $a <$#add; ++$a) {
> if ($add[$a]->string_value =~ /\s+NOT\s+/) {
> print $codes[$i]->string_value,": ",$add[$a]-
> }
> }
>
>
> Yes ...and no. I guess I want to print out the 'code' for any address
> so that I can get the data corrected but I guess I would also like to
> remove those records at the /dataroot/address level so they don't
> appear in the file.
>
This works for me:
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
use Data::Dumper;
my $parser = XML::LibXML->new();
my $doc = $parser->parse_fh(\*DATA);
my %remove;
my @results = $doc->findnodes('//address');
foreach my $address (@results) {
my @lines = $address->childNodes;
@lines = grep $_->nodeName eq 'lines', @lines;
@lines = map $_->childNodes, @lines;
@lines = grep $_->nodeName eq 'line', @lines;
foreach my $line (@lines) {
if ($line->string_value =~ /\bNOT\b/) {
my $number = $address->getAttribute('number');
$remove{$number} = $address;
}
}
}
print Dumper(\%remove);
> i spent a lot of time on this today as this look like a excellent
> parser and DOM navigator but I struggled moving around.
>
Yes, the unintuitive behavior of findnodes and find threw me too. It
seems that those methods look at every node in the file--not just the
children of the current element.
> In your example @results looks like it would contain references to
> all the /lines/line data with DO NOT USE in the string_value. What I
> have struggling with is that this is also a reference to the record
> as a whole and my navigation techniques are not working out. For
> example whenever I used findnodes I was getting every code in the
> file. I think now that was because I was using /dataroot/address as
> the starting point.
>
I know ZIP about XPath (today is my first day dealing with it), but I'm
able to get the 1016 address element using this code:
my @results = $doc->findnodes('/dataroot/address[contains(lines/line,
"NOT")]');
@results = map $_->getAttribute('number'), @results;
print Dumper(\@results);
-----------------
Unfortunately, it doesn't get the 1333 node. Using findnodes and
specifying an XPath statement only seems to pick up the first line in
the lines element. Other lines are ignored. If a put a dummy line before
the "DO NOT USE" line for 1016, 1016 is no longer recognized.
Perhaps it's possible to create an XPath statement that searches all of
the lines in a lines element.
> Aside from CPAN, I would appreciate any other sources of info about
> using the using libXML with perl and xpath expressions. It is
> whoppingly fast.
>
> Thanx again,
> Dp.
>
>
XML::Simple isn't so bad :-)
#!/usr/bin/perl
use strict;
use warnings;
use XML::Simple;
my $root = XMLin(\*DATA );
my @remove;
foreach my $address (@{$root->{address}}) {
my $descent = $address->{lines}{line};
my $lines = ref($descent) ? join("\n",@$descent) : $descent;
if ($lines =~ /\bNOT\b/) {
push @remove, $address->{number};
}
}
print "To remove: @remove\n";
------------
For me, this prints "To remove: 1016 1333."
However, XML::Simple places some constraints on your XML document. Read
the POD.
| |
| Andreas Puerzer 2007-01-12, 6:59 pm |
| Mumia W. schrieb:
> On 01/12/2007 11:29 AM, Beginner wrote:
[snip]
>
[more code using XML::LibXML snipped]
>
> XML::Simple isn't so bad :-)
>
XML::Twig neither ;->
#!/usr/bin/perl
use warnings;
use strict;
use XML::Twig;
my $xml;
$xml .= $_ while <DATA>;
my $twig = XML::Twig->new(
twig_handlers => {
'address' => sub {
my $elt = $_;
for ($elt->descendants('line')) {
if ($_->text =~ /\bNOT\b/) {
print "Invalid address number: "
. $elt->{att}->{number} . ", code: "
. $elt->first_child('code')->text . "\n";
$elt->cut;
}
}
}
},
pretty_print => 'indented',
)->parse($xml);
$twig->print;
__DATA__
<dataroot>
<address number="1016">
<code>B679OOO00</code>
<record_type>client</record_type>
<address_type>shipping</address_type>
<Postcode></Postcode>
<Country>GBR</Country>
<lines>
<line>DO NOT USE THIS CODE</line>
</lines>
</address>
<address number="1014">
<code>P982LUS00</code>
<record_type>client</record_type>
<address_type>shipping</address_type>
<Postcode>HR2 0AU</Postcode>
<Country>GBR</Country>
<lines>
<line>UPPER HOUSE FARM</line>
<line>BACTON</line>
<line>ESSEX</line>
<line>EX2 0AU</line>
</lines>
</address>
<address number="1333">
<code>A234ULE00</code>
<record_type>client</record_type>
<address_type>shipping</address_type>
<Postcode></Postcode>
<Country>AND</Country>
<lines>
<line>QUEENS HOUSE</line>
<line>1 BUCKINGHAM PALACE</line>
<line>LONDON WC2H</line>
<line>****NOT AT THIS ADDRESS ANY MORE.</line>
<line>***************</line>
</lines>
</address>
<address number="1018">
<code>A&MPUB00</code>
<record_type>client</record_type>
<address_type>shipping</address_type>
<Postcode>PO19 8SQ</Postcode>
<Country>GBR</Country>
<lines>
<line>THE ATRIUM</line>
<line>SOUTHERN GATE</line>
<line>CHICHESTER</line>
<line>SUSSEX</line>
<line>PO19 8SQ</line>
</lines>
</address>
</dataroot>
HTH,
Andreas Puerzer
--
perl -mAcme::JAPH
| |
| Jenda Krynicky 2007-01-22, 6:58 pm |
| From: "Beginner" <dermot@sciencephoto.com>
> Hi,
>
> I have to do some sanity checks on a large xml file of addresses (snip
> below). I have been using XML::LibXML and seem to have started ok but
> I am struggling to navigate around a record.
>
> In the sample date below your'll see some addresses with "DO NOT..."
> in. I can locate them easily enough but I am struggling to navigate
> back up the DOM to access the code so I can record the code with
> faulty addresses.
A bit late and again using a different module:
use XML::Rules;
# find the tags and print <code>
my $parser_find = XML::Rules->new(
rules => [
_default => '',
line => sub {$_[1]->{_content}."\n\t"},
'code,lines' => 'content',
address => sub {
if ($_[1]->{lines} =~ /\s+NOT\s+/) {
print $_[1]->{code}."\n";
}
}
],
);
$parser_find->parse($xml);
# filter the <address> tags
my $parser_remove = XML::Rules->new(
rules => [
_default => 'raw',
line => sub {
my ($tag, $attrs, $context, $parents) = @_;
if ($attrs->{_content} =~ /\s+NOT\s+/) {
$parents->[-2]{_remove} = 1;
# skip the <lines> and set the attribute
# directly in <address>
}
return [$tag => $attrs];
},
address => sub {
return $_[0] => $_[1] unless ($_[1]->{_remove});
return;
}
],
style => 'filter',
);
my $result;
open my $FH, '>', \$result;
$parser_remove->filter($xml, $FH);
close $FH;
print $result;
__END__
The plus is that this doesn't keep the whole XML in memory, but
instead processes the bits as they are read&parsed, which may make a
big difference with huge files.
Jenda
===== Jenda@Krynicky.cz === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed
to get drunk and croon as much as they like.
-- Terry Pratchett in Sourcery
|
|
|
|
|