Code Comments
Programming Forum and web based access to our favorite programming groups.I'm looking for, or willing to write, a program that will take a list of files as command-line arguments, and then build up a frequency table of n-grams (individual bytes, or strings of 2 or more bytes) for all these files. e.g. ngram 4 file1.txt file2.txt would return the most frequently occurring sequences of 4 bytes over the two files. I am willing to go quick'n'dirty for this. I understand I need to build up a table of all the n-grams that exist in each file. Can someone help me get started on this? cheers,
Post Follow-up to this messageUnmatched curly brace :)
"John W. Krahn" <someone@example.com> wrote in message
news:AwQ4d.146861$XP3.70635@edtnps84...
> C3 wrote:
>
> Well if it's quick'n'dirty that you want:
>
> perl -lne'BEGIN{$r="."x shift}$h{$1}++while/(?=($r))/g}{print for keys%h'
> 4 file1.txt file2.txt
>
>
>
> John
> --
> use Perl;
> program
> fulfillment
Post Follow-up to this messageHmm, seems to run on the command-line, but it produces no output for me.
"John W. Krahn" <someone@example.com> wrote in message
news:AwQ4d.146861$XP3.70635@edtnps84...
> C3 wrote:
>
> Well if it's quick'n'dirty that you want:
>
> perl -lne'BEGIN{$r="."x shift}$h{$1}++while/(?=($r))/g}{print for keys%h'
> 4 file1.txt file2.txt
>
>
>
> John
> --
> use Perl;
> program
> fulfillment
Post Follow-up to this messageOn Fri, 24 Sep 2004, it was written:
>I'm looking for, or willing to write, a program that will take a list of
>files as command-line arguments, and then build up a frequency table of
>n-grams (individual bytes, or strings of 2 or more bytes) for all these
>files.
>
>e.g. ngram 4 file1.txt file2.txt
>
>would return the most frequently occurring sequences of 4 bytes over the tw
o
>files.
Open the file, read it in conveniently sized chunks, and for every group
of four characters, increment $ngram{$g}.
--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
Senior Dean, Fall 2004 % have long ago been overpaid?
RPI Corporation Secretary %
http://japhy.perlmonk.org/ % -- Meister Eckhart
Post Follow-up to this message"C3" <_> wrote in message news:4153cc6b$0$20124$afc38c87@news.optusnet.com.au... > I'm looking for, or willing to write, a program that will take a list of > files as command-line arguments, and then build up a frequency table of > n-grams (individual bytes, or strings of 2 or more bytes) for all these > files. > --snip-- Are n-grams restricted to characters on a single line or can they flow onto the next line? (or even next file?) In the latter case, are the newline character(s) part of the n-gram? Bill
Post Follow-up to this message"C3" <_> wrote in message news:<41542a17$0$23897$afc38c87@news.optusnet.com.au>... > Hmm, seems to run on the command-line, but it produces no output for me. What sort of environment are you running it in? I cut and pasted his oneliner and ran it against a number of files on my workstation, and it worked right away. I haven't really checked the output carefully, but on trivial files of character sequences it seems to work as I'd expect. Larry
Post Follow-up to this message> Are n-grams restricted to characters on a single line or can they flow > onto the next line? (or even next file?) In the latter case, are the > newline character(s) part of the n-gram? n-grams are sequences of bytes, not ASCII characters, so line feeds and carriage returns are treated like any other character. n-grams may not flow onto other files. cheers,
Post Follow-up to this messageI'm running Perl 5.6.1 under Debian 3.0. I don't get any output, and have to kill the app. Incidentally, what would it take to modify the program so that it printed the ASCII code in hex (or decimal)? After all, it will be run on binary files. cheers, "Larry Felton Johnson" <larryj@gsu.edu> wrote in message news:4ae7bf57.0409241054.2e2d081@posting.google.com... > "C3" <_> wrote in message > news:<41542a17$0$23897$afc38c87@news.optusnet.com.au>... > > What sort of environment are you running it in? I cut and pasted his > oneliner and ran it against a number of files on my workstation, and it > worked right away. I haven't really checked the output carefully, but > on trivial files of character sequences it seems to work as I'd expect. > > Larry
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.