For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > September 2004 > Re: Counting most frequently-occurring n-grams in a file (or over









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Re: Counting most frequently-occurring n-grams in a file (or over
John W. Krahn

2004-09-24, 8:58 am

C3 wrote:
> I'm looking for, or willing to write, a program that will take a list of
> files as command-line arguments, and then build up a frequency table of
> n-grams (individual bytes, or strings of 2 or more bytes) for all these
> files.
>
> e.g. ngram 4 file1.txt file2.txt
>
> would return the most frequently occurring sequences of 4 bytes over the two
> files.
>
> I am willing to go quick'n'dirty for this. I understand I need to build up a
> table of all the n-grams that exist in each file. Can someone help me get
> started on this?


Well if it's quick'n'dirty that you want:

perl -lne'BEGIN{$r="."x shift}$h{$1}++while/(?=($r))/g}{print for keys%h' 4
file1.txt file2.txt



John
--
use Perl;
program
fulfillment
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com