Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

Counting most frequently-occurring n-grams in a file (or over multiple files)
I'm looking for, or willing to write, a program that will take a list of
files as command-line arguments, and then build up a frequency table of
n-grams (individual bytes, or strings of 2 or more bytes) for all these
files.

e.g. ngram 4 file1.txt file2.txt

would return the most frequently occurring sequences of 4 bytes over the two
files.

I am willing to go quick'n'dirty for this. I understand I need to build up a
table of all the n-grams that exist in each file. Can someone help me get
started on this?


cheers,



Report this thread to moderator Post Follow-up to this message
Old Post
C3
09-24-04 08:57 AM


Re: Counting most frequently-occurring n-grams in a file (or over multiple files)
Unmatched curly brace :)

"John W. Krahn" <someone@example.com> wrote in message
news:AwQ4d.146861$XP3.70635@edtnps84...
> C3 wrote: 
>
> Well if it's quick'n'dirty that you want:
>
> perl -lne'BEGIN{$r="."x shift}$h{$1}++while/(?=($r))/g}{print for keys%h'
> 4 file1.txt file2.txt
>
>
>
> John
> --
> use Perl;
> program
> fulfillment



Report this thread to moderator Post Follow-up to this message
Old Post
C3
09-24-04 09:01 PM


Re: Counting most frequently-occurring n-grams in a file (or over multiple files)
Hmm, seems to run on the command-line, but it produces no output for me.

"John W. Krahn" <someone@example.com> wrote in message
news:AwQ4d.146861$XP3.70635@edtnps84...
> C3 wrote: 
>
> Well if it's quick'n'dirty that you want:
>
> perl -lne'BEGIN{$r="."x shift}$h{$1}++while/(?=($r))/g}{print for keys%h'
> 4 file1.txt file2.txt
>
>
>
> John
> --
> use Perl;
> program
> fulfillment



Report this thread to moderator Post Follow-up to this message
Old Post
C3
09-24-04 09:01 PM


Re: Counting most frequently-occurring n-grams in a file (or over multiple files)
On Fri, 24 Sep 2004, it was written:

>I'm looking for, or willing to write, a program that will take a list of
>files as command-line arguments, and then build up a frequency table of
>n-grams (individual bytes, or strings of 2 or more bytes) for all these
>files.
>
>e.g. ngram 4 file1.txt file2.txt
>
>would return the most frequently occurring sequences of 4 bytes over the tw
o
>files.

Open the file, read it in conveniently sized chunks, and for every group
of four characters, increment $ngram{$g}.

--
Jeff "japhy" Pinyan         %  How can we ever be the sold short or
RPI Acacia Brother #734     %  the cheated, we who for every service
Senior Dean, Fall 2004    %  have long ago been overpaid?
RPI Corporation Secretary   %
http://japhy.perlmonk.org/  %    -- Meister Eckhart



Report this thread to moderator Post Follow-up to this message
Old Post
Jeff 'japhy' Pinyan
09-24-04 09:01 PM


Re: Counting most frequently-occurring n-grams in a file (or over multiple files)
"C3" <_> wrote in message
news:4153cc6b$0$20124$afc38c87@news.optusnet.com.au...
> I'm looking for, or willing to write, a program that will take a list
of
> files as command-line arguments, and then build up a frequency table
of
> n-grams (individual bytes, or strings of 2 or more bytes) for all
these
> files.
>
--snip--


Are n-grams restricted to characters on a single line or can they flow
onto the next line? (or even next file?)  In the latter case, are the
newline character(s) part of the n-gram?

Bill



Report this thread to moderator Post Follow-up to this message
Old Post
Bill Smith
09-24-04 09:01 PM


Re: Counting most frequently-occurring n-grams in a file (or over multiple files)
"C3" <_> wrote in message news:<41542a17$0$23897$afc38c87@news.optusnet.com.au>...
> Hmm, seems to run on the command-line, but it produces no output for me.

What sort of environment are you running it in?  I cut and pasted his
oneliner and ran it against a number of files on my workstation, and it
worked right away.  I haven't really checked the output carefully, but
on trivial files of character sequences it seems to work as I'd expect.

Larry

Report this thread to moderator Post Follow-up to this message
Old Post
Larry Felton Johnson
09-24-04 09:01 PM


Re: Counting most frequently-occurring n-grams in a file (or over multiple files)
> Are n-grams restricted to characters on a single line or can they flow
> onto the next line? (or even next file?)  In the latter case, are the
> newline character(s) part of the n-gram?

n-grams are sequences of bytes, not ASCII characters, so line feeds and
carriage returns are treated like any other character. n-grams may not flow
onto other files.

cheers,



Report this thread to moderator Post Follow-up to this message
Old Post
C3
09-25-04 01:56 AM


Re: Counting most frequently-occurring n-grams in a file (or over multiple files)
I'm running Perl 5.6.1 under Debian 3.0. I don't get any output, and have to
kill the app. Incidentally, what would it take to modify the program so that
it printed the ASCII code in hex (or decimal)? After all, it will be run on
binary files.


cheers,

"Larry Felton Johnson" <larryj@gsu.edu> wrote in message
news:4ae7bf57.0409241054.2e2d081@posting.google.com...
> "C3" <_> wrote in message
> news:<41542a17$0$23897$afc38c87@news.optusnet.com.au>... 
>
> What sort of environment are you running it in?  I cut and pasted his
> oneliner and ran it against a number of files on my workstation, and it
> worked right away.  I haven't really checked the output carefully, but
> on trivial files of character sequences it seems to work as I'd expect.
>
> Larry



Report this thread to moderator Post Follow-up to this message
Old Post
C3
09-26-04 08:56 PM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

PERL Miscellaneous archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 05:23 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.