For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > May 2004 > Finding missing numbers in sequence









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Finding missing numbers in sequence
Larry Wissink

2004-05-13, 7:31 am

I have a problem that I thought would be perfect for Perl, except that I
seem to be using all my system resources to run it. Of course this
probably means I'm doing it the wrong way...



The problem:

We have a backup server that is missing records from the production
server for a particular table. We know that it should have sequential
records and that it is missing some records. We want to get a sense of
the number of records missing. So, we know the problem started around
the beginning of March at id 70,000,000 (rounded for convenience).
Currently we are at 79,000,000. So, I dumped to a file all the ids
between 70,000,000 and 79,000,000 (commas inserted here for
readability). I need to figure out what numbers are missing. The way
that seemed easiest to me was to create two arrays. One with every
number between 70 and 79 million, the other with every number in our
dump file. Then compare them as illustrated in the Perl Cookbook using
a hash.

The simple script I came up with works fine with a test file of just 10
records.

But, when I try to scale that to 9 million records, it doesn't work.
This is probably because it is trying to do something like what db
people call a cartesian join (every record against every record).

So, does anybody have a suggestion for a better way to do it in Perl?



I'll probably end up doing it in SQL if I can't come up with a Perl
solution. (Create a second table like the first array with every number
between 70 and 79 million, and join the two tables.)



Larry

lwissink@ebates.com



script:



use strict;

use warnings;



my %seen;

my @list = ();

my @missing;

my @ids = ();

my $lis;

my $item;



foreach $lis (1 .. 10) { # sample list of 10

push(@ids, $lis);

}



open(DATA, "< ms_ids_test.txt") or die "Couldn't open data file: $!\n";
# create file like below



while (<DATA> ) {

chomp;

push(@list, $_);

}



@seen{@list} = ();



foreach $item (@ids) {

push(@missing, $item) unless exists $seen{$item};

}



print scalar(@missing);





#sample file (without the pounds)

#1

#2

#3

#4

#5

#9

#10

# note missing 6,7,8

# result is 3




Ramprasad A Padmanabhan

2004-05-13, 8:30 am

I think there will be some optimizations always possible, but You wont
get any dramatic improvements.

What I would do is something like this

First make sure that all the data is sorted in the file
Create a sequence array of all the required numbers, In your example it
was all numbers from 1..10
#!/usr/bin/perl
#
#
my @full_sequence = (1..10);
my @missing=();
my $i=0;
while(<DATA> ){
chomp;
while($full_sequence[$i] != $_) {
push @missing , $full_sequence[$i++] ;
}
$i++
}

print "MISSING ARE @missing\n";

exit 0;


___ END__
Just try that out , hope it will be better. Infact your requirement is
very data specific , It will hardly make any difference wether you code
in PERL or in C


Bye
Ram


On Thu, 2004-05-13 at 04:09, Larry Wissink wrote:
> I have a problem that I thought would be perfect for Perl, except that I
> seem to be using all my system resources to run it. Of course this
> probably means I'm doing it the wrong way...
>
>
>
> The problem:
>
> We have a backup server that is missing records from the production
> server for a particular table. We know that it should have sequential
> records and that it is missing some records. We want to get a sense of
> the number of records missing. So, we know the problem started around
> the beginning of March at id 70,000,000 (rounded for convenience).
> Currently we are at 79,000,000. So, I dumped to a file all the ids
> between 70,000,000 and 79,000,000 (commas inserted here for
> readability). I need to figure out what numbers are missing. The way
> that seemed easiest to me was to create two arrays. One with every
> number between 70 and 79 million, the other with every number in our
> dump file. Then compare them as illustrated in the Perl Cookbook using
> a hash.
>
> The simple script I came up with works fine with a test file of just 10
> records.
>
> But, when I try to scale that to 9 million records, it doesn't work.
> This is probably because it is trying to do something like what db
> people call a cartesian join (every record against every record).
>
> So, does anybody have a suggestion for a better way to do it in Perl?
>
>
>
> I'll probably end up doing it in SQL if I can't come up with a Perl
> solution. (Create a second table like the first array with every number
> between 70 and 79 million, and join the two tables.)
>
>
>
> Larry
>
> lwissink@ebates.com
>
>
>
> script:
>
>
>
> use strict;
>
> use warnings;
>
>
>
> my %seen;
>
> my @list = ();
>
> my @missing;
>
> my @ids = ();
>
> my $lis;
>
> my $item;
>
>
>
> foreach $lis (1 .. 10) { # sample list of 10
>
> push(@ids, $lis);
>
> }
>
>
>
> open(DATA, "< ms_ids_test.txt") or die "Couldn't open data file: $!\n";
> # create file like below
>
>
>
> while (<DATA> ) {
>
> chomp;
>
> push(@list, $_);
>
> }
>
>
>
> @seen{@list} = ();
>
>
>
> foreach $item (@ids) {
>
> push(@missing, $item) unless exists $seen{$item};
>
> }
>
>
>
> print scalar(@missing);
>
>
>
>
>
> #sample file (without the pounds)
>
> #1
>
> #2
>
> #3
>
> #4
>
> #5
>
> #9
>
> #10
>
> # note missing 6,7,8
>
> # result is 3
>
>
>



Chris Charley

2004-05-13, 11:31 am


----- Original Message -----
From: "Larry Wissink" <lwissink@corp.ebates.com>
Newsgroups: perl.beginners
To: <beginners@perl.org>
Sent: Wednesday, May 12, 2004 6:39 PM
Subject: Finding missing numbers in sequence


I have a problem that I thought would be perfect for Perl, except that I
seem to be using all my system resources to run it. Of course this
probably means I'm doing it the wrong way...



The problem:

We have a backup server that is missing records from the production
server for a particular table. We know that it should have sequential
records and that it is missing some records. We want to get a sense of
the number of records missing. So, we know the problem started around
the beginning of March at id 70,000,000 (rounded for convenience).
Currently we are at 79,000,000. So, I dumped to a file all the ids
between 70,000,000 and 79,000,000 (commas inserted here for
readability). I need to figure out what numbers are missing. The way
that seemed easiest to me was to create two arrays. One with every
number between 70 and 79 million, the other with every number in our
dump file. Then compare them as illustrated in the Perl Cookbook using
a hash.


So, does anybody have a suggestion for a better way to do it in Perl?

Hello Larry,

I was able to come up with a simple script that doesn't require using an
array/hash, so that your mem problems should not occur.

NOTE: the beginning number has to have 1 subtracted from it. This is a first
case situation so that that the algorithm works for the set. Also, if there
are missing records after the last number read in, it won't find that
sequence, but that should be easy enough to find (in this example, any
numbers after 20).

HTH
Chris


#!/usr/bin/perl
use strict;
use warnings;

my @missing;

my $i = 0; # Or 70,000,000 - 1 (69,999,999)

while(<DATA> ) {
if ($_ != ++$i) {
push @missing, $i .. $_ - 1;
$i = $_;
}
}

print join "\n", @missing;

__DATA__
4
5
7
10
12
13
14
15
20


OUTPUT
1
2
3
6
8
9
11
16
17
18
19


Jeff 'Japhy' Pinyan

2004-05-13, 11:31 am

On May 12, Larry Wissink said:

>We have a backup server that is missing records from the production
>server for a particular table. We know that it should have sequential
>records and that it is missing some records. We want to get a sense of
>the number of records missing. So, we know the problem started around
>the beginning of March at id 70,000,000 (rounded for convenience).
>Currently we are at 79,000,000. So, I dumped to a file all the ids
>between 70,000,000 and 79,000,000 (commas inserted here for
>readability). I need to figure out what numbers are missing. The way
>that seemed easiest to me was to create two arrays. One with every
>number between 70 and 79 million, the other with every number in our
>dump file. Then compare them as illustrated in the Perl Cookbook using
>a hash.
>
>But, when I try to scale that to 9 million records, it doesn't work.
>This is probably because it is trying to do something like what db
>people call a cartesian join (every record against every record).


Well, don't do that! ;) When you have a super-set and a sub-set, and
they're ordered, you only need to go through the set ONCE.

@superset = (1 .. 10);
@subset = (1, 2, 4, 7, 8, 9);
@missing = ();

my $idx = 0;

for (@superset) {
push @missing, $_
unless $subset[$idx] == $_ and ++$idx;
}

That's just a bit of shorthand for:

for (@superset) {
if ($subset[$idx] == $_) { $idx++ }
else { push @missing, $_ }
}

Anyway, that populates @missing with the missing elements, in linear time.

--
Jeff "japhy" Pinyan japhy@pobox.com http://www.pobox.com/~japhy/
RPI Acacia brother #734 http://www.perlmonks.org/ http://www.cpan.org/
CPAN ID: PINYAN [Need a programmer? If you like my work, let me know.]
<stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.

Chris Charley

2004-05-13, 12:34 pm


----- Original Message -----
From: "Jeff 'Japhy' Pinyan" <japhy@perlmonk.org>
Newsgroups: perl.beginners
To: "Larry Wissink" <lwissink@corp.ebates.com>
Cc: <beginners@perl.org>
Sent: Thursday, May 13, 2004 9:56 AM
Subject: Re: Finding missing numbers in sequence


> On May 12, Larry Wissink said:
>

[snip]
[color=darkred]
> Well, don't do that! ;) When you have a super-set and a sub-set, and
> they're ordered, you only need to go through the set ONCE.
>
> @superset = (1 .. 10);
> @subset = (1, 2, 4, 7, 8, 9);


Why put subset in an array when he stated above they were stored
(sequentially?) already in a file?

Isn't this a (possible) cause of his problem, creating a large array and an
even larger one (@superset)?
I've found on my machine (WinXP with 512 meg memory), that arrays larger
than about 10-15 megs start using virtual memory (my hard drive starts to
get busy) (I think those were the numbers :-) ).

My solution assumed @subset was in a file, not an array, and I didn't use a
superset array to compare. So, I had dramatically less memory used than your
solution.

> @missing = ();
>
> my $idx = 0;
>
> for (@superset) {
> push @missing, $_
> unless $subset[$idx] == $_ and ++$idx;
> }
>


Chris


John W. Krahn

2004-05-20, 7:30 pm

Larry Wissink wrote:
>
> We have a backup server that is missing records from the production
> server for a particular table. We know that it should have sequential
> records and that it is missing some records. We want to get a sense of
> the number of records missing. So, we know the problem started around
> the beginning of March at id 70,000,000 (rounded for convenience).
> Currently we are at 79,000,000. So, I dumped to a file all the ids
> between 70,000,000 and 79,000,000 (commas inserted here for
> readability). I need to figure out what numbers are missing. The way
> that seemed easiest to me was to create two arrays. One with every
> number between 70 and 79 million, the other with every number in our
> dump file. Then compare them as illustrated in the Perl Cookbook using
> a hash.
>
> The simple script I came up with works fine with a test file of just 10
> records.
>
> But, when I try to scale that to 9 million records, it doesn't work.
> This is probably because it is trying to do something like what db
> people call a cartesian join (every record against every record).
>
> So, does anybody have a suggestion for a better way to do it in Perl?


Something like this should work:

my $last = <FILE>;
while ( my $current = <FILE> ) {
for ( my $missing = $last + 1; $missing < $current; ++$missing ) {
print "$missing\n";
}
$last = $current;
}



John
--
use Perl;
program
fulfillment
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com