Home > Archive > PERL Miscellaneous > September 2004 > Counting most frequently-occurring n-grams in a file (or over multiple files)
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Counting most frequently-occurring n-grams in a file (or over multiple files)
|
|
|
| I'm looking for, or willing to write, a program that will take a list of
files as command-line arguments, and then build up a frequency table of
n-grams (individual bytes, or strings of 2 or more bytes) for all these
files.
e.g. ngram 4 file1.txt file2.txt
would return the most frequently occurring sequences of 4 bytes over the two
files.
I am willing to go quick'n'dirty for this. I understand I need to build up a
table of all the n-grams that exist in each file. Can someone help me get
started on this?
cheers,
| |
|
| Unmatched curly brace :)
"John W. Krahn" <someone@example.com> wrote in message
news:AwQ4d.146861$XP3.70635@edtnps84...
> C3 wrote:
>
> Well if it's quick'n'dirty that you want:
>
> perl -lne'BEGIN{$r="."x shift}$h{$1}++while/(?=($r))/g}{print for keys%h'
> 4 file1.txt file2.txt
>
>
>
> John
> --
> use Perl;
> program
> fulfillment
| |
|
| Hmm, seems to run on the command-line, but it produces no output for me.
"John W. Krahn" <someone@example.com> wrote in message
news:AwQ4d.146861$XP3.70635@edtnps84...
> C3 wrote:
>
> Well if it's quick'n'dirty that you want:
>
> perl -lne'BEGIN{$r="."x shift}$h{$1}++while/(?=($r))/g}{print for keys%h'
> 4 file1.txt file2.txt
>
>
>
> John
> --
> use Perl;
> program
> fulfillment
| |
| Jeff 'japhy' Pinyan 2004-09-24, 4:01 pm |
| On Fri, 24 Sep 2004, it was written:
>I'm looking for, or willing to write, a program that will take a list of
>files as command-line arguments, and then build up a frequency table of
>n-grams (individual bytes, or strings of 2 or more bytes) for all these
>files.
>
>e.g. ngram 4 file1.txt file2.txt
>
>would return the most frequently occurring sequences of 4 bytes over the two
>files.
Open the file, read it in conveniently sized chunks, and for every group
of four characters, increment $ngram{$g}.
--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
Senior Dean, Fall 2004 % have long ago been overpaid?
RPI Corporation Secretary %
http://japhy.perlmonk.org/ % -- Meister Eckhart
| |
| Bill Smith 2004-09-24, 4:01 pm |
|
"C3" <_> wrote in message
news:4153cc6b$0$20124$afc38c87@news.optusnet.com.au...
> I'm looking for, or willing to write, a program that will take a list
of
> files as command-line arguments, and then build up a frequency table
of
> n-grams (individual bytes, or strings of 2 or more bytes) for all
these
> files.
>
--snip--
Are n-grams restricted to characters on a single line or can they flow
onto the next line? (or even next file?) In the latter case, are the
newline character(s) part of the n-gram?
Bill
| |
| Larry Felton Johnson 2004-09-24, 4:01 pm |
| "C3" <_> wrote in message news:<41542a17$0$23897$afc38c87@news.optusnet.com.au>...
> Hmm, seems to run on the command-line, but it produces no output for me.
What sort of environment are you running it in? I cut and pasted his
oneliner and ran it against a number of files on my workstation, and it
worked right away. I haven't really checked the output carefully, but
on trivial files of character sequences it seems to work as I'd expect.
Larry
| |
|
| > Are n-grams restricted to characters on a single line or can they flow
> onto the next line? (or even next file?) In the latter case, are the
> newline character(s) part of the n-gram?
n-grams are sequences of bytes, not ASCII characters, so line feeds and
carriage returns are treated like any other character. n-grams may not flow
onto other files.
cheers,
| |
|
| I'm running Perl 5.6.1 under Debian 3.0. I don't get any output, and have to
kill the app. Incidentally, what would it take to modify the program so that
it printed the ASCII code in hex (or decimal)? After all, it will be run on
binary files.
cheers,
"Larry Felton Johnson" <larryj@gsu.edu> wrote in message
news:4ae7bf57.0409241054.2e2d081@posting.google.com...
> "C3" <_> wrote in message
> news:<41542a17$0$23897$afc38c87@news.optusnet.com.au>...
>
> What sort of environment are you running it in? I cut and pasted his
> oneliner and ran it against a number of files on my workstation, and it
> worked right away. I haven't really checked the output carefully, but
> on trivial files of character sequences it seems to work as I'd expect.
>
> Larry
|
|
|
|
|