For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > September 2004 > Counting most frequently-occurring n-grams in a file (or over multiple files)









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Counting most frequently-occurring n-grams in a file (or over multiple files)
C3

2004-09-24, 3:57 am

I'm looking for, or willing to write, a program that will take a list of
files as command-line arguments, and then build up a frequency table of
n-grams (individual bytes, or strings of 2 or more bytes) for all these
files.

e.g. ngram 4 file1.txt file2.txt

would return the most frequently occurring sequences of 4 bytes over the two
files.

I am willing to go quick'n'dirty for this. I understand I need to build up a
table of all the n-grams that exist in each file. Can someone help me get
started on this?


cheers,


C3

2004-09-24, 4:01 pm

Unmatched curly brace :)

"John W. Krahn" <someone@example.com> wrote in message
news:AwQ4d.146861$XP3.70635@edtnps84...
> C3 wrote:
>
> Well if it's quick'n'dirty that you want:
>
> perl -lne'BEGIN{$r="."x shift}$h{$1}++while/(?=($r))/g}{print for keys%h'
> 4 file1.txt file2.txt
>
>
>
> John
> --
> use Perl;
> program
> fulfillment



C3

2004-09-24, 4:01 pm

Hmm, seems to run on the command-line, but it produces no output for me.

"John W. Krahn" <someone@example.com> wrote in message
news:AwQ4d.146861$XP3.70635@edtnps84...
> C3 wrote:
>
> Well if it's quick'n'dirty that you want:
>
> perl -lne'BEGIN{$r="."x shift}$h{$1}++while/(?=($r))/g}{print for keys%h'
> 4 file1.txt file2.txt
>
>
>
> John
> --
> use Perl;
> program
> fulfillment



Jeff 'japhy' Pinyan

2004-09-24, 4:01 pm

On Fri, 24 Sep 2004, it was written:

>I'm looking for, or willing to write, a program that will take a list of
>files as command-line arguments, and then build up a frequency table of
>n-grams (individual bytes, or strings of 2 or more bytes) for all these
>files.
>
>e.g. ngram 4 file1.txt file2.txt
>
>would return the most frequently occurring sequences of 4 bytes over the two
>files.


Open the file, read it in conveniently sized chunks, and for every group
of four characters, increment $ngram{$g}.

--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
Senior Dean, Fall 2004 % have long ago been overpaid?
RPI Corporation Secretary %
http://japhy.perlmonk.org/ % -- Meister Eckhart


Bill Smith

2004-09-24, 4:01 pm


"C3" <_> wrote in message
news:4153cc6b$0$20124$afc38c87@news.optusnet.com.au...
> I'm looking for, or willing to write, a program that will take a list

of
> files as command-line arguments, and then build up a frequency table

of
> n-grams (individual bytes, or strings of 2 or more bytes) for all

these
> files.
>

--snip--


Are n-grams restricted to characters on a single line or can they flow
onto the next line? (or even next file?) In the latter case, are the
newline character(s) part of the n-gram?

Bill


Larry Felton Johnson

2004-09-24, 4:01 pm

"C3" <_> wrote in message news:<41542a17$0$23897$afc38c87@news.optusnet.com.au>...
> Hmm, seems to run on the command-line, but it produces no output for me.


What sort of environment are you running it in? I cut and pasted his
oneliner and ran it against a number of files on my workstation, and it
worked right away. I haven't really checked the output carefully, but
on trivial files of character sequences it seems to work as I'd expect.

Larry
C3

2004-09-24, 8:56 pm

> Are n-grams restricted to characters on a single line or can they flow
> onto the next line? (or even next file?) In the latter case, are the
> newline character(s) part of the n-gram?


n-grams are sequences of bytes, not ASCII characters, so line feeds and
carriage returns are treated like any other character. n-grams may not flow
onto other files.

cheers,


C3

2004-09-26, 3:56 pm

I'm running Perl 5.6.1 under Debian 3.0. I don't get any output, and have to
kill the app. Incidentally, what would it take to modify the program so that
it printed the ASCII code in hex (or decimal)? After all, it will be run on
binary files.


cheers,

"Larry Felton Johnson" <larryj@gsu.edu> wrote in message
news:4ae7bf57.0409241054.2e2d081@posting.google.com...
> "C3" <_> wrote in message
> news:<41542a17$0$23897$afc38c87@news.optusnet.com.au>...
>
> What sort of environment are you running it in? I cut and pasted his
> oneliner and ran it against a number of files on my workstation, and it
> worked right away. I haven't really checked the output carefully, but
> on trivial files of character sequences it seems to work as I'd expect.
>
> Larry



Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com