For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > March 2008 > The huge amount response data problem









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author The huge amount response data problem
falconzyx@gmail.com

2008-03-24, 10:20 pm

I have a issue:
1. I want to open a file and use the data from the file to construct
the url.
2. After I constructed the url and sent it, I got the response html
data and some parts are what I want store inot the files.

It seems like a very easy thing, however, the issue is that the data
from the file that I have to open are too huge, which I have to
consturct almost 200000 url address to send and parse response data.
And the speed is very very slow.

I have no idea with thread or db cache, so I want some help .

Please give me some advices that what I should do to improve the speed

Thanks very much.
falconzyx@gmail.com

2008-03-25, 4:38 am

On Mar 25, 10:44 am, "falcon...@gmail.com" <falcon...@gmail.com>
wrote:
> I have a issue:
> 1. I want to open a file and use the data from the file to construct
> the url.
> 2. After I constructed the url and sent it, I got the response html
> data and some parts are what I want store inot the files.
>
> It seems like a very easy thing, however, the issue is that the data
> from the file that I have to open are too huge, which I have to
> consturct almost 200000 url address to send and parse response data.
> And the speed is very very slow.
>
> I have no idea with thread or db cache, so I want some help .
>
> Please give me some advices that what I should do to improve the speed
>
> Thanks very much.


this is my code:

use threads;
use LWP::UserAgent;
use LWP::Simple;
use Data::Dumper;
use strict;
use threads::shared;



my $wordsList = &get_request;
#print Dumper( @wordsList );

my @words = split("\n", $wordsList);
#print Dumper(@words);

my @url = &get_url(@words);
#print Dumper(@url);
my @thr;
foreach my $i (1..100000) {
push @thr, threads->new(\&get_html, $url[$i]);
}
foreach (@thr) {
$_->detach; # it doesn't work!!!!!!!!!!!!!!!!
}



sub get_html {
my (@url) = @_;

}
sub get_request {
..........
return $wordsList;
}

sub get_url {
my (@words) = @_;
................
return @url;
}
Ben Bullock

2008-03-25, 4:38 am

Your code is hopelessly inefficient. 100,000 strings of even twenty
characters is at least two megabytes of memory. Then you've doubled
that number with the creation of the URL, and then you are creating
arrays of all these things, so you've used several megabytes of
memory.

Instead of first creating a huge array of names, then a huge array of
URLs, why don't you just read in one line of the file at a time, then
try to get data from each URL? Read in one line of the first file,
create its URL, get the response data, store it, then go back and get
the next line of the file, etc. A 100,000 line file actually isn't
that big.

But if you are getting all these files from the internet, the biggest
bottleneck is probably the time the code spends waiting for a response
from the web servers it's requested. You'd have to think about making
parallel requests somehow to solve that.

falconzyx@gmail.com

2008-03-25, 4:38 am

On Mar 25, 3:06 pm, Ben Bullock <benkasminbull...@gmail.com> wrote:
> Your code is hopelessly inefficient. 100,000 strings of even twenty
> characters is at least two megabytes of memory. Then you've doubled
> that number with the creation of the URL, and then you are creating
> arrays of all these things, so you've used several megabytes of
> memory.
>
> Instead of first creating a huge array of names, then a huge array of
> URLs, why don't you just read in one line of the file at a time, then
> try to get data from each URL? Read in one line of the first file,
> create its URL, get the response data, store it, then go back and get
> the next line of the file, etc. A 100,000 line file actually isn't
> that big.
>
> But if you are getting all these files from the internet, the biggest
> bottleneck is probably the time the code spends waiting for a response
> from the web servers it's requested. You'd have to think about making
> parallel requests somehow to solve that.


Thanks Ben,

However, is there any good solution that use threads method? I use
that, and out of memory time by time after I refactor the code as you
told
I try thread::Pool and some other thread module that I found.
Doesn't it really Perl suit for mutil threads programming??

Thanks again for eveyone.
falconzyx@gmail.com

2008-03-25, 4:38 am

On Mar 25, 4:25 pm, "falcon...@gmail.com" <falcon...@gmail.com> wrote:
> On Mar 25, 3:06 pm, Ben Bullock <benkasminbull...@gmail.com> wrote:
>
>
>
>
>
>
> Thanks Ben,
>
> However, is there any good solution that use threads method? I use
> that, and out of memory time by time after I refactor the code as you
> told
> I try thread::Pool and some other thread module that I found.
> Doesn't it really Perl suit for mutil threads programming??
>
> Thanks again for eveyone.


here is my refactor code :
use threads;
use LWP::UserAgent;
use Data::Dumper;
use strict;



&get_request();

sub get_request {
open (FH, "...") or die "can not open file $!";
while (<FH> ) {
my $i = <FH>;
my $url = ".../$i";
my $t = threads->new(\&get_html, $url);
$t->join();

}
close (FH);
}
sub get_html {
my ($url) = @_;
my $user_agent = LWP::UserAgent->new();
my $response = $user_agent->request(HTTP::Request->new('GET',
$url));
my $content = $response->content;
format_html ($content);
}
sub format_html {
my ($content) = shift;
my $html_data = $content;
my $word;
my $data;
while ( $html_data =~ m{...}igs ) {
$word = $1;
}
while ( $html_data =~ m{...}igs ) {
$data = $1;
save_data( $word, $data );
}
while ( $data =~ m{...}igs ) {
my $title = $1;
my $sound = $1.$2;
if ( defined($sound) ) {
save_sound( $word, $title, $sound );
}
}
}

sub save_data {
my ( $word, $data ) = @_;
open ( FH, " > ..." ) or die "Can not open $!";
print FH $data;
close(FH);
}

sub save_sound {
my ( $word, $title, $sound ) = @_;
getstore("....", "...") or warn $!;
}
RedGrittyBrick

2008-03-25, 8:08 am

falconzyx@gmail.com wrote:
> On Mar 25, 3:06 pm, Ben Bullock <benkasminbull...@gmail.com> wrote:
>
> Thanks Ben,
>
> However, is there any good solution that use threads method? I use
> that, and out of memory time by time after I refactor the code as you
> told


That's because, if your file contains 100000 lines, your program tries
to create 100000 simultaneous threads doesn't it?

I would create a pool with a fixed number of threads (say 10). I'd read
the file adding tasks to a queue of the same size, after filling the
queue I'd pause reading the file until the queue has a spare space.
Maybe this could be achieved by sleeping a while (say 100ms) and
re-checking if the queue is stuill full. When a thread is created or has
finished a task it should remove a task from the queue and process it.
If the queue is empty the thread should sleep for a while (say 200ms)
and try again, you'd need some mechanism to signal threads that all
tasks have been queued (maybe a flag, a special marker task, a signal or
a certain number of consecutive failed attempts to find work.)

I've never tried to program something like this in Perl so I'd imagine
someone (probably several people) has already solved this and added
modules to CPAN to assist in this sort of task.

There's probably some OO Design Patterns that apply too.

> I try thread::Pool and some other thread module that I found.
> Doesn't it really Perl suit for mutil threads programming??


I find it hard to understand what you are saying but I think the answer
is: Yes, Perl is well suited to programming with multiple threads (or
processes).

--
RGB
Jürgen Exner

2008-03-25, 8:08 am

"falconzyx@gmail.com" <falconzyx@gmail.com> wrote:
>consturct almost 200000 url address to send and parse response data.
>And the speed is very very slow.
>
>Please give me some advices that what I should do to improve the speed


Get a T1 line.

jue
Jim Gibson

2008-03-25, 7:19 pm

In article <47e8caa2$0$32055$da0feed9@news.zen.co.uk>, RedGrittyBrick
<RedGrittyBrick@SpamWeary.foo> wrote:

> falconzyx@gmail.com wrote:


[problem getting data from 100_000 URLs snipped]

> That's because, if your file contains 100000 lines, your program tries
> to create 100000 simultaneous threads doesn't it?
>
> I would create a pool with a fixed number of threads (say 10). I'd read
> the file adding tasks to a queue of the same size, after filling the
> queue I'd pause reading the file until the queue has a spare space.
> Maybe this could be achieved by sleeping a while (say 100ms) and
> re-checking if the queue is stuill full. When a thread is created or has
> finished a task it should remove a task from the queue and process it.
> If the queue is empty the thread should sleep for a while (say 200ms)
> and try again, you'd need some mechanism to signal threads that all
> tasks have been queued (maybe a flag, a special marker task, a signal or
> a certain number of consecutive failed attempts to find work.)


With a fixed number of predefined URLs, I would use a simpler approach:

Fork off some number of processes (say 10) and assign to each of them
the job of reading the URL file and fetching the results from their set
of URLs sequentially.

The OP has not told us how the results from each URL will be used, so
that may affect whether to use threads or forked processes. A database
would be appropriate for merging or saving the results from all of
these URLs.

There is also the LWP::Parallel module, which allows one process to
simultaneously fetch responses from many URLs (I have not used it).

--
Jim Gibson

Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com
xhoster@gmail.com

2008-03-25, 7:19 pm

"falconzyx@gmail.com" <falconzyx@gmail.com> wrote:
> I have a issue:
> 1. I want to open a file and use the data from the file to construct
> the url.
> 2. After I constructed the url and sent it, I got the response html
> data and some parts are what I want store inot the files.
>
> It seems like a very easy thing, however, the issue is that the data
> from the file that I have to open are too huge, which I have to
> consturct almost 200000 url address to send and parse response data.
> And the speed is very very slow.


What part is slow, waiting for the response or parsing it?

Does those URLs point to *your* servers? If so, then you should be able
to bypass http and go directly to the source. If not, then do you have
permission from the owner of the servers to launch what could very well
be a denial of service attack against them?

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
falconzyx@gmail.com

2008-03-27, 4:49 am

On Mar 26, 1:50=A0am, xhos...@gmail.com wrote:
> RedGrittyBrick <RedGrittyBr...@SpamWeary.foo> wrote:
>
>
> I agree with the "(or processes)" part, provided you are running on a Unix=


> like platform. =A0But in my experience/opinion Perl threads mostly suck.
>
> --
> --------------------http://NewsReader.Com/--------------------
> The costs of publication of this article were defrayed in part by the
> payment of page charges. This article must therefore be hereby marked
> advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate=


> this fact.


Here is my refactor code, which still at a very slow speed, please
advice me how to improve it, thanks very much:

require LWP::Parallel::UserAgent;
use HTTP::Request;
use LWP::Simple;
use threads;

# display tons of debugging messages. See 'perldoc LWP::Debug'
#use LWP::Debug qw(+);
my $reqs =3D [
HTTP::Request->new('GET',"http://www...."),
HTTP::Request->new('GET', "......"
..............# about nearly 200000 url here

];

my $pua =3D LWP::Parallel::UserAgent->new();
$pua->in_order (10000); # handle requests in order of registration
$pua->duplicates(0); # ignore duplicates
$pua->timeout (1); # in seconds
$pua->redirect (1); # follow redirects

foreach my $req (@$reqs) {
print "Registering '".$req->url."'\n";
if ( my $res =3D $pua->register ($req) ) {
print STDERR $res->error_as_HTML;
}
}
my $entries =3D $pua->wait();

foreach (keys %$entries) {
my $res =3D $entries->{$_}->response;
threads->new(\&format_html, $res->content);

}
foreach my $thr (threads->list()) {
$thr->join(); # I think it does not work......
}

sub format_html {
my ($html_data) =3D shift;
my $word;
my $data;
while ( $html_data =3D~ m{...}igs ) {
$word =3D $1;
}
while ( $html_data =3D~ m{...}igs ) {
$data =3D $1;
save_data( $word, $data );
}
while ( $data =3D~ m{...}igs ) {
my $title =3D $1;
my $sound =3D $1.$2;
if ( defined($sound) ) {
save_sound( $word, $title, $sound );
}
}
}

sub save_data {
my ( $word, $data ) =3D @_;
open ( FH, " > ..." ) or die "Can not open $!";
print FH $data;
close(FH);
}



sub save_sound {
my ( $word, $title, $sound ) =3D @_;
getstore("...", "...") or warn $!;

}

Jim Gibson

2008-03-27, 10:16 pm

In article
<c776f49d-4eb1-4765-a617-c2e4b1e125c8@s19g2000prg.googlegroups.com>,
<"falconzyx@gmail.com"> wrote:


> Here is my refactor code, which still at a very slow speed, please
> advice me how to improve it, thanks very much:
>
> require LWP::Parallel::UserAgent;
> use HTTP::Request;
> use LWP::Simple;
> use threads;
>
> # display tons of debugging messages. See 'perldoc LWP::Debug'
> #use LWP::Debug qw(+);
> my $reqs = [
> HTTP::Request->new('GET',"http://www...."),
> HTTP::Request->new('GET', "......"
> ..............# about nearly 200000 url here
>
> ];

[ rest snipped ]

You are flooding the web with 200_000 URL requests (but only 7-35 at a
time according to the LWP::Parallel documentation). Of course that is
going to be very slow. How slow is it? You should start with a small
number and see how long it takes, then add more URLs and see if it
scales OK.

The bottleneck is going to be waiting for a response from each of your
URLs. There is no way any amount of Perl programming efficiency can
make up for that slowness.

You should consider caching your results. How often do you have to run
this program to get all 200_000 URLs? How often do they change content?
Does your list vary every time you run the program?

--
Jim Gibson

Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com