For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > May 2005 > Scanning @array elements for similair content









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Scanning @array elements for similair content
Randy

2005-05-28, 3:58 am

Hello,

I have a text file that stores names and email addresses. This data is built
from a feedback form on my website. Here is the format of my textfile
entries:

Dan Smith,dan@email.com
Mike Roberts,mike@yahoo.com
Steve Anderson,steve@goto.com

and so on.

As you can see, it's pretty much a standard CSV textfile. Overtime, this
database has grown very big, and there are several duplicate email addresses
in the data. Until recently I have had to visually go through the data and
remove duplicate email addresses I can find, regardless of what is found in
the name field. I am sing assistance on how I could write a script that
would scan each line, separate the names field from the email address field,
then scan and remove duplicates. So far all I have is the following:

#!/usr/bin/perl

use CGI;
use CGI::Carp qw(fatalsToBrowser);
use strict;

my @data, $data, $name, $email;

open (FH, "<data.txt") or die "Can't open file: $!";
@data=<FH>;
close(FH);

foreach $data (@data) {
chomp ($data);
($name,$email)=split(/\,/,$data);

\\ Missing scan for duplicates and removal code here \\
}

open (FH, ">data.txt") or die "Can't open file: $!";
print FH @data;
close(FH);

Yes I am a newbie Perl programmer. I'm not very good at brainstorming an
approach to sorting/matching routines. I would very much appreciate some
help understanding and building the final element. Another complication is
what if there are two identical email addresses but one is all caps and the
other isn't. I'm not looking for someone to write me the code I need,
instead to point me in the right direction so that I actually learn
something and forward my Perl skills. Thankx everyone.

Robert


Tad McClellan

2005-05-28, 3:58 am

Randy <rbutcher.nospam@hotmail.com> wrote:

> I have a text file that stores names and email addresses.



> I am sing assistance on how I could write a script that
> would scan each line,



You have that already!

(though poorly done)


> separate the names field from the email address field,



You have that already too.


> then scan and remove duplicates.



perldoc -q duplicate

How can I remove duplicate elements from a list or array?

(pay particular attention to the last sentence of the answer given there.)

You are expected to check the Perl FAQ *before* posting to
the Perl newsgroup you know.


> use strict;



Very good, but you should also have:

use warnings;

(and look in your server error logs for its output, or,
even better, run your CGI program from the command line
during early development, rather than in the CGI environment)


> open (FH, "<data.txt") or die "Can't open file: $!";
> @data=<FH>;
> close(FH);
>
> foreach $data (@data) {



It is bad practice to read an entire file into memory only to
process it line-by-line anyway.

Why not just read and process line-by-line?


> chomp ($data);
> ($name,$email)=split(/\,/,$data);



Whitespace is not a scarce resource, feel free to use as much of
it as you like to make your code easier to read.


> \\ Missing scan for duplicates and removal code here \\
> }



my %emails;
while ( my $data = <INPUT> ) { # untested
chomp $data;
my($name, $email) = split(/\,/, $data);
$emails{$email} = $name;
}

foreach my $adr ( sort keys %emails ) {
print OUTPUT "$emails{$adr},$adr\n";
}


> Yes I am a newbie Perl programmer.



I was thinking that you are a newbie to programming itself.


> I would very much appreciate some
> help understanding and building the final element.



Use a hash to eliminate duplicates.


> Another complication is
> what if there are two identical email addresses but one is all caps and the
> other isn't.



You need to decide what to do, then we can help you write Perl
code that does that.

You could perhaps just normalize them all to a single case
before storing or searching the hash:

perldoc -f uc
perldoc -f lc


> I'm not looking for someone to write me the code I need,



Oops, too late. :-)


> instead to point me in the right direction so that I actually learn
> something and forward my Perl skills.



A depressingly infrequent display of Good Attitude for this here group.

Good for you! (and us)


--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
Randy

2005-05-28, 3:58 am

"Tad McClellan" <tadmc@augustmail.com> wrote in message
news:slrnd9fku9.e81.tadmc@magna.augustmail.com...
> Randy <rbutcher.nospam@hotmail.com> wrote:


> perldoc -q duplicate
>
> How can I remove duplicate elements from a list or array?
>
> (pay particular attention to the last sentence of the answer given there.)
>
> You are expected to check the Perl FAQ *before* posting to
> the Perl newsgroup you know.



I had actually checked the perldoc for this and did find ways to remove
duplicate array entires but didn't know what to do when I wanted to match
the specific split item .. ie .. match the email address only, not the
entire array element.


[color=darkred]
> Very good, but you should also have:
>
> use warnings;
>
> (and look in your server error logs for its output, or,
> even better, run your CGI program from the command line
> during early development, rather than in the CGI environment)



I have also added 'use warnings,' thank you.


>
>
> It is bad practice to read an entire file into memory only to
> process it line-by-line anyway.
>
> Why not just read and process line-by-line?



Agreed, I now use the CPAN module File::Slurp to read textfile entires into
an @array efficiently.
http://search.cpan.org/~uri/File-Sl...b/File/Slurp.pm


> Whitespace is not a scarce resource, feel free to use as much of
> it as you like to make your code easier to read.



Point noted.


the[color=darkred]
>
> You need to decide what to do, then we can help you write Perl
> code that does that.
>
> You could perhaps just normalize them all to a single case
> before storing or searching the hash:



During the split phase where I separate the $name for the $email, I now use
this regex: $email =~ tr/A-Z/a-z/;


>
> A depressingly infrequent display of Good Attitude for this here group.
>
> Good for you! (and us)



You are correct Tad, I am new to programming in general. I'm trying my best
to better understand the basics. Here is the final code I use to remove the
duplicate entries and it does do it's job:

#!/usr/bin/perl

use CGI;
use CGI::Carp qw(fatalsToBrowser);

use strict;
use warnings;

open (INPUT, "<data.txt") or die "Can't open file: $!";
my %entries;
while ( my $data = <INPUT> ) {
chomp $data;
my ($name, $email) = split(/\,/, $data);
$name =~ s/(\w+)/\u\L$1/g;
$email =~ tr/A-Z/a-z/;
$entries{$email} = $name;
}
close(INPUT);

foreach my $adr ( sort keys %entries ) {
print "$entries{$adr},$adr\n";
}

exit;

That said, I'm not entirely certain what part of the code IS actually
detecting and removing the duplicate entries. I have a hunch that this is
taking please in the foreach loop. I created a test data.txt file and
manually entered several duplicate email addresses. When the script is run,
any duplicate is removed, seems it kills duplicates from the top down .. ie
... if dan@email.com was found on 5 lines, it keeps the last occurrence... or
maybe it removed all but the last alphabetically sorted item.

Tad, thank you for this. I would like to ask one final question on this
matter ... right now, when the script is run, it prints to screen all remain
hash entries without any duplicates. Under that I would like it to show
which entries got removed. I assume to do this, I would need to modify the
script to push any matched duplicates into a secondary array or hash and
then print that last. Perhaps not. Your thoughts are appreciated.

Robert

P.S. Your going to laugh at this but until recent I have never used the
command 'use strict'. To be honest, I'm not 100% certain what exactly this
does, or how it is benefiting me or the script. All I know for certain is
that without adding "my" to variable definitions, the script doesn't
work/run. Most articles I have read online highly recommend using this
command but don't go into great detail why. I ask you this because I wish to
better my understanding of Perl and to ensure I write proper scripts in the
future.


Jürgen Exner

2005-05-28, 3:58 am

Randy wrote:
> During the split phase where I separate the $name for the $email, I
> now use this regex: $email =~ tr/A-Z/a-z/;


Just to be nitpicking: tr/// does not use REs. That's one big difference to
s///.

And it's better to use the function lc() instead of your tr/// code because
lc() handles non-English characters correctly, too, while your code fails
for anything that is outside the basic 26 latin characters.

jue


Randy

2005-05-28, 8:56 am

"Randy" <rbutcher.nospam@hotmail.com> wrote in message
news:ZBTle.1500813$6l.598100@pd7tw2no...
> "Tad McClellan" <tadmc@augustmail.com> wrote in message


> That said, I'm not entirely certain what part of the code IS actually
> detecting and removing the duplicate entries. I have a hunch that this is
> taking please in the foreach loop.


Tad, I did a little more research on hashes. I now think the duplicate
elimination is NOT happening during the foreach loop, that loop is just
sorting the hash and printing it; instead it is occuring when you are
defining each hash element in the initial <while> loop. I think this happens
because you are assigning (your method) the key as the email address and the
value as the name. In doing so you can't have duplicated key names?!?!?! so
the hash just ignores when a request for a duplicate key name is
requested?!?!?!?

If i'm wrong about this I hope you don't think less of me ... I really am
trying to learn.

Robert


Damian James

2005-05-28, 8:56 am

On Sat, 28 May 2005 06:36:15 GMT, Randy said:
> ... so
> the hash just ignores when a request for a duplicate key name is
> requested?!?!?!?


No, the second assignment simply overrides the first one.

my %blah;
$blah{ x } = 'test 1';
$blah{ x } = 'test 2';
print "$blah{ x }\n";

> If i'm wrong about this I hope you don't think less of me ... I really am
> trying to learn.


Pfft, we've all been there. Never care about seeming foolish when
the object is to learn. It's the folks who try to look like they
already know everything who are foolish.

--damian
Brian McCauley

2005-05-28, 8:56 am



Randy wrote:

> "Tad McClellan" <tadmc@augustmail.com> wrote in message
> news:slrnd9fku9.e81.tadmc@magna.augustmail.com...
>
>
>
> I had actually checked the perldoc for this and did find ways to remove
> duplicate array entires but didn't know what to do when I wanted to match
> the specific split item .. ie .. match the email address only, not the
> entire array element.


>
> Agreed, I now use the CPAN module File::Slurp to read textfile entires into
> an @array efficiently.


If you have a need to slurp then File::Slurp will do so efficiently but
you have no need. It is better to read a line at a time as Tad showed.
I see looking a the end of the post you have indeed done so. Good.

>
> During the split phase where I separate the $name for the $email, I now use
> this regex: $email =~ tr/A-Z/a-z/;


There is no regex there. I agree with Tad that the lc() function would
be beter than tr///.

$email = lc $email;


Ditto.
[color=darkred]
> You are correct Tad, I am new to programming in general. I'm trying my best
> to better understand the basics. Here is the final code I use to remove the
> duplicate entries and it does do it's job:


It looks good. I will now proceed t criticise it but don't let this
detract from the fact that it is good.

> #!/usr/bin/perl
>
> use CGI;
> use CGI::Carp qw(fatalsToBrowser);


Is this a CGI script? It doesn't look time one?

> use strict;
> use warnings;


Generally best to put these two ASAP. That way you'll even get their
protection in your other use statements. The only thing I like to see
above these two are comments and, in a the case of a module, a package
directive.

> open (INPUT, "<data.txt") or die "Can't open file: $!";
> my %entries;
> while ( my $data = <INPUT> ) {
> chomp $data;
> my ($name, $email) = split(/\,/, $data);


No need to backslash the comma in a regex. I'm not as paranoid about
leaning toothpick syndrome as Tad but I wouldn't bother here.

> $name =~ s/(\w+)/\u\L$1/g;


OK, nothing whatever to do with Perl, but this is bad. There are a lot
of names (like mine) that have non-trivial capitaliztion. You risk
offending and alienating many people. This has been oft discssed here.
There is no solution as sometimes there can be two distinct names that
differ only in capialization.

> $email =~ tr/A-Z/a-z/;
> $entries{$email} = $name;
> }
> close(INPUT);


Your code looks nice but your use of indentation between the open/close
is rather unconventional.

> foreach my $adr ( sort keys %entries ) {
> print "$entries{$adr},$adr\n";
> }
>
> exit;


It is more conventional just to let perl fall off the end of your script
and exit() implicitly.

> That said, I'm not entirely certain what part of the code IS actually
> detecting and removing the duplicate entries. I have a hunch that this is
> taking please in the foreach loop.


No - it is the line

$entries{$email} = $name;

If you encounter a second record in the input with an e-mail address
that's been encountered before the above line will replace the old entry
in %entries with a new one, thus forgetting all but the last entry with
a given e-mail.

> .. if dan@email.com was found on 5 lines, it keeps the last occurrence...


Yep.

> Tad, thank you for this. I would like to ask one final question on this
> matter ... right now, when the script is run, it prints to screen all remain
> hash entries without any duplicates. Under that I would like it to show
> which entries got removed. I assume to do this, I would need to modify the
> script to push any matched duplicates into a secondary array or hash and
> then print that last.


Yes that would work.

if ( defined $entries{$email} ) {
push @duplicates => $data;
} else {
$entries{$email} = $name;
}

Note - this now preserves the first instance of each address and puts
the rest into @duplicates.

> P.S. Your going to laugh at this but until recent I have never used the
> command 'use strict'. To be honest, I'm not 100% certain what exactly this
> does, or how it is benefiting me or the script. All I know for certain is
> that without adding "my" to variable definitions, the script doesn't
> work/run.


Yes that is probably the most noticable of the three effects. Without
'use strict' perl will treat the first mention of an undeclared variable
as an implicit declaration of a package-scoped variable (well kinda).
This can be a great convenience in 1-line scripts but is generally a
liability in scripts longer than about 10 lines.

> Most articles I have read online highly recommend using this
> command but don't go into great detail why. I ask you this because I wish to
> better my understanding of Perl and to ensure I write proper scripts in the
> future.


I would argue (and indeed have argued with giants) that it is best to
see 'use strict' as disabling three fairly obscure features and that
understanding of these features is something that should not concern
people too early in their learning of Perl.

http://groups-beta.google.com/group...9f307d6b9e83c65
Tad McClellan

2005-05-28, 3:57 pm

Randy <rbutcher.nospam@hotmail.com> wrote:


[snip stuff already answered in other followups]


> P.S. Your going to laugh at this



No I'm not.

I used to program without it myself (because it did not yet exist :-)


> but until recent I have never used the
> command 'use strict'.

^^^^^^^
^^^^^^^

It is more properly called a "pragma", which is fancy college-talk
for a "compiler directive".


> To be honest, I'm not 100% certain what exactly this
> does,



You can read its docs with

perldoc strict


> or how it is benefiting me or the script.



It finds common mistakes.

"strict vars" in particular, finds typos, which is a very common
mistake made by all of us.


> All I know for certain is
> that without adding "my" to variable definitions, the script doesn't
> work/run.



When you put "use strict" in your program, you are making a
promise to perl.

I promise to declare all of my variables before I use them
(or I will use their fully qualified package name).

If you break your promise, perl will refuse to run your program.


> Most articles I have read online highly recommend using this
> command but don't go into great detail why.



It finds bugs in microseconds rather than in a bazillion- microsecond
debugging session.


> I ask you this because I wish to
> better my understanding of Perl and to ensure I write proper scripts in the
> future.



strict and warnings will save you time by finding common
mistakes *for you*.



--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
RedGrittyBrick

2005-05-29, 3:56 pm

Randy wrote:
> Hello,
>
> I have a text file that stores names and email addresses. This data is built
> from a feedback form on my website. Here is the format of my textfile
> entries:
>
> Dan Smith,dan@email.com
> Mike Roberts,mike@yahoo.com
> Steve Anderson,steve@goto.com
>
> and so on.
>
> As you can see, it's pretty much a standard CSV textfile. Overtime, this
> database has grown very big, and there are several duplicate email addresses
> in the data. Until recently I have had to visually go through the data and
> remove duplicate email addresses I can find, regardless of what is found in
> the name field. I am sing assistance on how I could write a script that
> would scan each line, separate the names field from the email address field,
> then scan and remove duplicates. So far all I have is the following:
>
> #!/usr/bin/perl
>
> use CGI;
> use CGI::Carp qw(fatalsToBrowser);
> use strict;
>
> my @data, $data, $name, $email;
>
> open (FH, "<data.txt") or die "Can't open file: $!";
> @data=<FH>;
> close(FH);
>
> foreach $data (@data) {
> chomp ($data);
> ($name,$email)=split(/\,/,$data);
>
> \\ Missing scan for duplicates and removal code here \\
> }
>
> open (FH, ">data.txt") or die "Can't open file: $!";
> print FH @data;
> close(FH);
>
> Yes I am a newbie Perl programmer. I'm not very good at brainstorming an
> approach to sorting/matching routines. I would very much appreciate some
> help understanding and building the final element. Another complication is
> what if there are two identical email addresses but one is all caps and the
> other isn't. I'm not looking for someone to write me the code I need,
> instead to point me in the right direction so that I actually learn
> something and forward my Perl skills. Thankx everyone.
>


Rather than removing duplicates, I'd not insert them

#!perl
use strict;
use warnings;
open my $fh, '<', 'data.txt'
or die "unable to open data.txt because $!";
while (<$fh> ) {
chomp;
my ($name, $address) = split(/\,/,$_,2);
print "$name, $address\n" unless $seen{$address}++:
}
close $fh;

Untested.
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com