For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > August 2007 > Parsing large XML file









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Parsing large XML file
Mike Blezien

2007-07-15, 6:59 pm

Hello,

we need to parse some very large XML files, approx., 900-1000KB's filesize. A
sample of a typical XML file can be view here that would be parsed:
http://projects.thunder-rain.com/uploads/000001.xml

I was planning on using the XML::Twig module to do this, using the following
code snip to loop through each of the <product> .... </product> elements. Not
every single element is needed but most within each loop of each
<product></product>

# Code snip:
########################################
############################
my $xmlfile = '/path/to/upload/000001.xml';
my $cgi = new CGI();
my $twig = new XML::Twig(twig_handlers => {
product => \&get_products,
});
$twig->parsefile("$xmlfile");

sub get_products {
my($t,$elt) = @_;
# loop through each product.

my $article_number = $elt->first_child_text('article_number');
my $ean_upc = $elt->first_child_text('ean_upc');
my $distributor_number = $elt->first_child_text('distributor_number');
my $distributor_name = $elt->first_child_text('distributor_name');
my $artist = $elt->first_child_text('artist');

# now loop through each
<tracks><number_of_tracks></number_of_tracks><playtime></playtime>
# <track> <sound> </sound> </track></tracks> for each product.
# <number_of_tracks> element determines total <tracks> .. <track> <sound>
</sound> </track> .. </tracks>
# # in loop.

$t->purge();
}

exit();
########################################
#########################

Now the areas I'm have alot of problem is with the elements within each product,
the
<tracks> .... </tracks> and looping through each of the tracks child elements
and <sound></sound>
---------
<product>
........
<tracks>
<number_of_tracks></number_of_tracks><playtime></playtime>
<track> ....
<sound> ..
</sound>
</track>
</tracks>
.........
</product>
--------

Is there a better way to do this to obtain all the data within each of the
<product> ... </product> elements? I've never really worked with XML files this
large and complex tree. Any help or suggestions would be much appreciated.

TIA
Mike(mickalo)Blezien
===============================
Thunder Rain Internet Publishing
Providing Internet Solution that Work
===============================
Rob Dixon

2007-07-15, 9:58 pm

Mike Blezien wrote:
>
> we need to parse some very large XML files, approx., 900-1000KB's
> filesize. A sample of a typical XML file can be view here that would be
> parsed: http://projects.thunder-rain.com/uploads/000001.xml
>
> I was planning on using the XML::Twig module to do this, using the
> following code snip to loop through each of the <product> ....
> </product> elements. Not every single element is needed but most within
> each loop of each <product></product>
>
> # Code snip:
> ########################################
############################
> my $xmlfile = '/path/to/upload/000001.xml';
> my $cgi = new CGI();
> my $twig = new XML::Twig(twig_handlers => {
> product => \&get_products,
> });
> $twig->parsefile("$xmlfile");
>
> sub get_products {
> my($t,$elt) = @_;
> # loop through each product.
>
> my $article_number = $elt->first_child_text('article_number');
> my $ean_upc = $elt->first_child_text('ean_upc');
> my $distributor_number = $elt->first_child_text('distributor_number');
> my $distributor_name = $elt->first_child_text('distributor_name');
> my $artist = $elt->first_child_text('artist');
>
> # now loop through each
> <tracks><number_of_tracks></number_of_tracks><playtime></playtime>
> # <track> <sound> </sound> </track></tracks> for each product.
> # <number_of_tracks> element determines total <tracks> .. <track>
> <sound> </sound> </track> .. </tracks>
> # # in loop.
>
> $t->purge();
> }
>
> exit();
> ########################################
#########################
>
> Now the areas I'm have alot of problem is with the elements within each
> product, the
> <tracks> .... </tracks> and looping through each of the tracks child
> elements and <sound></sound>
> ---------
> <product>
> .......
> <tracks>
> <number_of_tracks></number_of_tracks><playtime></playtime>
> <track> ....
> <sound> ..
> </sound>
> </track>
> </tracks>
> ........
> </product>
> --------
>
> Is there a better way to do this to obtain all the data within each of
> the <product> ... </product> elements? I've never really worked with XML
> files this large and complex tree. Any help or suggestions would be much
> appreciated.


Hi Mike

Your application of XML::Twig seems exactly right. I'm not sure what it is you
don't understand, but if you use this as your 'get_products' subroutine I hope
it answers some questions. All it does is print the title of the product and
the title of all the tracks in that product. Post again if you have any trouble
understanding what I've written.

sub get_products {

my $product = $_;

my $product_title = $product->first_child('title');
print $product_title->trimmed_text, "\n";

my $tracks = $product->first_child('tracks');
return unless $tracks;

foreach my $track ($tracks->children('track')) {
my $track_title = $track->first_child('title');
print ' ', $track_title->trimmed_text, "\n";
}

print "\n";
}

HTH,

Rob
Mike Blezien

2007-07-15, 9:58 pm

Rob,

----- Original Message -----
From: "Rob Dixon" <rob.dixon@350.com>
To: "Perl List" <beginners@perl.org>
Cc: "Mike Blezien" <mickalo@frontiernet.net>
Sent: Sunday, July 15, 2007 7:49 PM
Subject: Re: Parsing large XML file


> Mike Blezien wrote:
>
> Hi Mike
>
> Your application of XML::Twig seems exactly right. I'm not sure what it is you
> don't understand, but if you use this as your 'get_products' subroutine I hope
> it answers some questions. All it does is print the title of the product and
> the title of all the tracks in that product. Post again if you have any
> trouble
> understanding what I've written.
>
> sub get_products {
>
> my $product = $_;
>
> my $product_title = $product->first_child('title');
> print $product_title->trimmed_text, "\n";
>
> my $tracks = $product->first_child('tracks');
> return unless $tracks;
>
> foreach my $track ($tracks->children('track')) {
> my $track_title = $track->first_child('title');
> print ' ', $track_title->trimmed_text, "\n";
> }
>
> print "\n";
> }
>
> HTH,
>
> Rob


Ok, this helps getting me in the right direction, much appreciate the help.

The only question I have now, is while looping through the <track> </track> we
have another loop inside each for the <track>
.....
.....
<sound>
<file> ... </file>
<sound_type> ... </sound_type>
<codec> ... </codec>
<bitrate> ... </bitrate>
<channels>mono</channels>
</sound>
......
.......
</track>
can one do something like this:

foreach my $track ($tracks->children('track'))
{
for my $sound ($track->first_child('sound'))
{
my $soundtype = $sound->first_child_text('sound_type');
my $codec = $sound->first_child_text('codec');
}
my $track_title = $track->first_child('title');
print ' ', $track_title->trimmed_text, "\n";
}

Would this work or is there a better way to do this ?

Mike
Rob Dixon

2007-07-16, 6:59 pm

Mike Blezien wrote:
>
> Rob Dixon wrote:
[snip][color=darkred]
>
> Ok, this helps getting me in the right direction, much appreciate the help.
>
> The only question I have now, is while looping through the <track>
> </track> we have another loop inside each for the <track>
> .....
> .....
> <sound>
> <file> ... </file>
> <sound_type> ... </sound_type>
> <codec> ... </codec>
> <bitrate> ... </bitrate>
> <channels>mono</channels>
> </sound>
> ......
> .......
> </track>
> can one do something like this:
>
> foreach my $track ($tracks->children('track'))
> {
> for my $sound ($track->first_child('sound'))
> {
> my $soundtype = $sound->first_child_text('sound_type');
> my $codec = $sound->first_child_text('codec');
> }
> my $track_title = $track->first_child('title');
> print ' ', $track_title->trimmed_text, "\n";
> }
>
> Would this work or is there a better way to do this ?


Almost right. You need

for my $sound ($track->children('sound')) {
:
}

(and you also need to test it!)

One thing to be careful of is that all the variables in my code were XML
nodes which could be both used to locate child nodes and to extract their
text values. You have variables which contain just the text values such as

my $codec = $sound->first_child_text('codec');

which is ok as long as you understand the difference and can keep track
of which is which. You may want to stick with variables being XML nodes
throughout, such as:

my $codec = $sound->first_child('codec');
print $codec->trimmed_text, "\n";

HTH,

Rob

Mike Blezien

2007-07-16, 6:59 pm

Rob,

----- Original Message -----
From: "Rob Dixon" <rob.dixon@350.com>
To: <beginners@perl.org>
Cc: "Mike Blezien" <mickalo@frontiernet.net>
Sent: Monday, July 16, 2007 5:04 AM
Subject: Re: Parsing large XML file


> Mike Blezien wrote:
> [snip]
>
> Almost right. You need
>
> for my $sound ($track->children('sound')) {
> :
> }
>
> (and you also need to test it!)
>
> One thing to be careful of is that all the variables in my code were XML
> nodes which could be both used to locate child nodes and to extract their
> text values. You have variables which contain just the text values such as
>
> my $codec = $sound->first_child_text('codec');
>
> which is ok as long as you understand the difference and can keep track
> of which is which. You may want to stick with variables being XML nodes
> throughout, such as:
>
> my $codec = $sound->first_child('codec');
> print $codec->trimmed_text, "\n";
>
> HTH,
>
> Rob


Ok, appreciate all your help & advise. Going to do some testing later and see
how well this all works ;)

Thanks,
Mike
Jenda Krynicky

2007-08-02, 7:59 am

From: "Mike Blezien" <mickalo@frontiernet.net>
> we need to parse some very large XML files, approx., 900-1000KB's filesize. A
> sample of a typical XML file can be view here that would be parsed:
> http://projects.thunder-rain.com/uploads/000001.xml


I'm probably comming late, but the anyway ... this looks like a
perfect task for my XML::Rules. The URL doesn't work anymore so I'm
guessing the structure of the XML.

Using XML::Rules the code would look somewhat like this:

#!perl
use XML::Rules;

my $parser = XML::Rules->new(
rules => [
_default => 'content',
tracks => 'pass no content',
'track,sound' => 'no content array',

product => sub {
my ($tag, $attr) = @_;
delete $attr->{_content};
#use Data::Dumper;
#print Dumper($attr);

print <<"*END*";
article_number: $attr->{'article_number'}
distributor_number: $attr->{'distributor_number'}
distributor_name: $attr->{'distributor_name'}
artist: $attr->{'artist'}
ean_upc: $attr->{'ean_upc'}
set_total: $attr->{'set_total'}
*END*

foreach my $track (@{$attr->{track}}) {
print " Track: $track->{trackno}. $track->{title} ($track-
>{setno})\n";

foreach my $sound (@{$track->{sound}}) {
print " Sound: $sound->{file}\n Type: $sound->{sound_type}
(Codec: $sound->{codec})\n";
}
}
print "\n";

return;
}
]
);

$parser->parse(\*DATA);

__DATA__
<products>
<product>
<article_number>Blah blah</article_number>
<distributor_number>Blah blah</distributor_number>
<distributor_name>Blah blah</distributor_name>
<artist>Blah blah</artist>
<ean_upc>Blah blah</ean_upc>
<set_total>Blah blah</set_total>
<tracks>
<number_of_tracks>2</number_of_tracks>
<track>
<title>Blah blah</title>
<trackno>1</trackno>
<setno>Blah blah</setno>
<sound>
<sound_type>Blah blah</sound_type>
<codec>Blah blah</codec>
<file>Blah blah</file>
</sound>
</track>
<track>
<title>YDFbibusdf</title>
<trackno>2</trackno>
<setno>Blah blah</setno>
<sound>
<sound_type>Blah blah</sound_type>
<codec>Blah blah</codec>
<file>Blah blah</file>
</sound>
</track>
</tracks>
</product>
</products>

__END__


I believe this will be even more efficient than XML::Twig.
http://xmltwig.com/article/ways_to_..._rome.html#todo

HTH, Jenda
===== Jenda@Krynicky.cz === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed
to get drunk and croon as much as they like.
-- Terry Pratchett in Sourcery

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com