| glen herrmannsfeldt 2006-05-31, 4:05 am |
| Paul Van Delst wrote:
> I want to be able to search an array of strings (with many repeated
> elements) so I can count the unique elements. I've attached a working
> program at the bottom of the post so you can see what I've got so far.
If you expect duplicates a hash table is probably easier and faster.
You might find a hash routine with a built in hash function, otherwise
my favorite is the CRC32 polynomial, which is easy to implement using
a 256 element table and some shift and exclusive or functions. After
computing the CRC32 value for each string, take that value modulo the
hash table size. Easiest is a linked list at each hash position, but
there are other possibilities. Maybe you can find an already written
hashing routine.
As others have said, there are languages which make this very easy,
such as perl and awk. Many C libraries have a hash routine, as does
Java. From reading a file with a list to writing the result is
about two lines of AWK, 20 of C, or 30 of Java. (For the latter two,
most of the work is reading the input file and writing the result.)
Sorting should be O(N log N), even for an external sort. Hashing is
usually considered O(N), though that depends on N not being too large.
I have done hash table counting for millions of unique entries (out of
billions). You really want the hash table to fit in memory.
If it can't, then an external sort might be better.
-- glen
|