Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

Is there a maximum size to Perl strings?
I've noticed this problem in a message board script of mine lately. The
site has a file in the Java Properties format listing various ways of
blocking users - e.g. usernames, email addresses, or message content.
Each is a list of regexps.

Here's a sample from the file

# Content is a Perl RE, however the . operator doesn't work, nor
# can you specify ranges using the {n,m} notation. Unless otherwise
# specified (using ^ and $) the given text is assumed to be a fragment,
# not the entirety.
content: (?i)cgi.ebay., (?i)www.ebay., (?i)www.dontpayretail.co.uk,
(?i)www.\S*sex\S*.co, (?i)www.goyemen.com, \
MONEY, \
THAI\s*MASSAGE, (?i)www.cgispy.com, \
(?i)<textarea, (?i)textarea>, \
(?i)Paypal, (?i)\$1.00\s+Bill, \
DOLLARS, \
GUARANTE+D?

As you can see it's a series of comma delimited RegExps, which has been
broken onto multiple lines using the '' character.

It worked fine, until it grew in size (that's about a third of what I
have for content). It appears that the bottom of the section isn't been
read.

What the code does for reading in sections is easy enough:

my %blocks;
while (not eof (BLKS))
{	my ($key, $value) = split (/\s*[=:]\s*/, <BLKS> );
next if ($key =~ /^#/);
$key =~ s/^\s+//gi;
while ($value =~ /\\\s*$/g and not eof (BLKS))
{	$value =~ s/\\\s*$//g;

my $valueContinued = <BLKS>;
$value .= $valueContinued;
}
$value =~ s/\s*$//sgi;
$value =~ s/^\s*//sgi;
$value =~ s/\r?\n//gi;

$blocks{lc ($key)} = $value;
}

# The content block accept minor REGEXP patterns. However dots are out!
$blocks{content}     =~ s/\./\\\./gi;

@blockedContent  = split (/\s*,\s*/, $blocks{content});


When a message is posted, the code then loops through every regexp in
the @blockedContent array, applying it in turn, and dying if any of them
match. Does anyone know why the whole of the block isn't being read in?
Is it a problem with String size, or the size of an array?

Thanks
--
Bryan

Report this thread to moderator Post Follow-up to this message
Old Post
Bryan Feeney
01-05-05 08:56 PM


Re: Is there a maximum size to Perl strings?
Bryan Feeney wrote:
> I've noticed this problem in a message board script of mine lately. The
> site has a file in the Java Properties format listing various ways of
> blocking users - e.g. usernames, email addresses, or message content.
> Each is a list of regexps.
>
> Here's a sample from the file
>
> # Content is a Perl RE, however the . operator doesn't work, nor
> # can you specify ranges using the {n,m} notation. Unless otherwise
> # specified (using ^ and $) the given text is assumed to be a fragment,
> # not the entirety.
> content: (?i)cgi.ebay., (?i)www.ebay., (?i)www.dontpayretail.co.uk,
>          (?i)www.\S*sex\S*.co, (?i)www.goyemen.com, \
>          MONEY, \
>          THAI\s*MASSAGE, (?i)www.cgispy.com, \
>          (?i)&lt;textarea, (?i)textarea&gt;, \
>          (?i)Paypal, (?i)\$1.00\s+Bill, \
>          DOLLARS, \
>          GUARANTE+D?
>
> As you can see it's a series of comma delimited RegExps, which has been
> broken onto multiple lines using the '' character.
>
> It worked fine, until it grew in size (that's about a third of what I
> have for content). It appears that the bottom of the section isn't been
> read.

You don't have a '' character at the end of the first line so that is
probably why it is not reading the whole section.


> What the code does for reading in sections is easy enough:
>
> my %blocks;
> while (not eof (BLKS))
> {    my ($key, $value) = split (/\s*[=:]\s*/, <BLKS> );
>     next if ($key =~ /^#/);
>     $key =~ s/^\s+//gi;
^^
You are using the /g option which means that you want to match the pattern
everywhere in the string but you have anchored the pattern to the beginning 
of
the string so the /g option is superfluous.  You are using the /i option whi
ch
means to ignore the case on alphabetic characters in the pattern but there a
re
no alphabetic characters in the pattern so the /i option is also superfluous
.


>     while ($value =~ /\\\s*$/g and not eof (BLKS))
^
The /g option is superfluous.


>     {    $value =~ s/\\\s*$//g;
^
The /g option is superfluous.


>         my $valueContinued = <BLKS>;
>         $value .= $valueContinued;
>     }
>     $value =~ s/\s*$//sgi;
^^^
You are using the /s option which means to include the newline character whe
n
using . to match any character but you are not using the . meta-character in
the pattern so the /s option is superfluous as are the /g and /i options.


>     $value =~ s/^\s*//sgi;
^^^
The /s, /g and /i options are superfluous.


>     $value =~ s/\r?\n//gi;
^
The /i option is superfluous.


>     $blocks{lc ($key)} = $value;
> }
>
> # The content block accept minor REGEXP patterns. However dots are out!
> $blocks{content}     =~ s/\./\\\./gi;
^
The /i option is superfluous.


> @blockedContent  = split (/\s*,\s*/, $blocks{content});

I would write it like this:


my ( $section, %blocks );
while ( <DATA> ) {
next if /^#/ or not /\S/;

s/\A\s+//;
s/\s+\z//;

if ( s/\\$// ) {
$section .= $_;
next;
}

my ( $key, $value ) = split /\s*[=:]\s*/, $section, 2 or next;
$section = '';
$value =~ s'\.'\.'g;
$blocks{ lc $key } = [ split /\s*,\s*/, $value ];
}

my @blockedContent = @{ $blocks{ content } };


__DATA__

# Content is a Perl RE, however the . operator doesn't work, nor
# can you specify ranges using the {n,m} notation. Unless otherwise
# specified (using ^ and $) the given text is assumed to be a fragment,
# not the entirety.

content: (?i)cgi.ebay., (?i)www.ebay., (?i)www.dontpayretail.co.uk, \
(?i)www.\S*sex\S*.co, (?i)www.goyemen.com, \
MONEY, \
THAI\s*MASSAGE, (?i)www.cgispy.com, \
(?i)&lt;textarea, (?i)textarea&gt;, \
(?i)Paypal, (?i)\$1.00\s+Bill, \
DOLLARS, \
GUARANTE+D?


__END__


John
--
use Perl;
program
fulfillment

Report this thread to moderator Post Follow-up to this message
Old Post
John W. Krahn
01-07-05 01:55 PM


Re: Is there a maximum size to Perl strings?
My understanding is that strings in Perl can essentially be as large
as all available RAM...


On Wed, 05 Jan 2005 15:29:00 +0000, Bryan Feeney
<bfeeney@NotARealAccount.net> wrote:

>I've noticed this problem in a message board script of mine lately. The
>site has a file in the Java Properties format listing various ways of
>blocking users - e.g. usernames, email addresses, or message content.
>Each is a list of regexps.
>
>Here's a sample from the file
>
># Content is a Perl RE, however the . operator doesn't work, nor
># can you specify ranges using the {n,m} notation. Unless otherwise
># specified (using ^ and $) the given text is assumed to be a fragment,
># not the entirety.
>content: (?i)cgi.ebay., (?i)www.ebay., (?i)www.dontpayretail.co.uk,
>          (?i)www.\S*sex\S*.co, (?i)www.goyemen.com, \
>          MONEY, \
>          THAI\s*MASSAGE, (?i)www.cgispy.com, \
>          (?i)&lt;textarea, (?i)textarea&gt;, \
>          (?i)Paypal, (?i)\$1.00\s+Bill, \
>          DOLLARS, \
>          GUARANTE+D?
>
>As you can see it's a series of comma delimited RegExps, which has been
>broken onto multiple lines using the '' character.
>
>It worked fine, until it grew in size (that's about a third of what I
>have for content). It appears that the bottom of the section isn't been
>read.
>
>What the code does for reading in sections is easy enough:
>
>my %blocks;
>while (not eof (BLKS))
>{	my ($key, $value) = split (/\s*[=:]\s*/, <BLKS> );
>	next if ($key =~ /^#/);
>	$key =~ s/^\s+//gi;
>	while ($value =~ /\\\s*$/g and not eof (BLKS))
>	{	$value =~ s/\\\s*$//g;
>
>		my $valueContinued = <BLKS>;
>		$value .= $valueContinued;
>	}
>	$value =~ s/\s*$//sgi;
>	$value =~ s/^\s*//sgi;
>	$value =~ s/\r?\n//gi;
>
>	$blocks{lc ($key)} = $value;
>}
>
># The content block accept minor REGEXP patterns. However dots are out!
>$blocks{content}     =~ s/\./\\\./gi;
>
>@blockedContent  = split (/\s*,\s*/, $blocks{content});
>
>
>When a message is posted, the code then loops through every regexp in
>the @blockedContent array, applying it in turn, and dying if any of them
>match. Does anyone know why the whole of the block isn't being read in?
>Is it a problem with String size, or the size of an array?
>
>Thanks


Report this thread to moderator Post Follow-up to this message
Old Post
Kevin Carlson
01-14-05 08:57 AM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

PERL Programming archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 07:36 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.