Code Comments
Programming Forum and web based access to our favorite programming groups.hi, I am wondering what approaches you all might suggest for the following scenario. I'm not looking for code, but just general thoughts about the best way to approach this problem. I need to substitute hundreds, possibly thousands, of character sequences (English words or phrases) in texts that are up to about 100KB or so in size, and I need to do this as fast as human possible (well, actually, faster). Some of the substitution terms require regular expressions, but some can be handled by a regular replace. Any advice at all about Java resources that might be available would be very much appreciated. I can think of a few naive ways of doing this, but perhaps there are some lesser known classes than String and StringBuffer that would be useful, or perhaps there is some open-source utility class that offers a mutable character-array type object with powerful search-and-replace/regex abilities, or maybe something else altogether. Thanks in advance for any pointers...
Post Follow-up to this messageanon wrote: > hi, > > I am wondering what approaches you all might suggest for the following > scenario. I'm not looking for code, but just general thoughts about > the best way to approach this problem. > > I need to substitute hundreds, possibly thousands, of character > sequences (English words or phrases) in texts that are up to about > 100KB or so in size, and I need to do this as fast as human possible > (well, actually, faster). Some of the substitution terms require > regular expressions, but some can be handled by a regular replace. The clearly defined terms (words) can be handled by a hashtable mapping to their replacements. This leaves the issue of delimiting the terms, probably best handled by a conventional stream scanner. Terms that are ambiguous may be able to be handled by a closest-match using a binary search of nearby terms. But frankly, if you are after raw speed and are adamant on acquiring the maximum performance, with Java you are trading some performance for ease of development and a rich library. These factors should be weighed carefully as to their relative importance. > Any advice at all about Java resources that might be available would > be very much appreciated. I can think of a few naive ways of doing > this, but perhaps there are some lesser known classes than String and > StringBuffer that would be useful, or perhaps there is some > open-source utility class that offers a mutable character-array type > object with powerful search-and-replace/regex abilities, or maybe > something else altogether. You need to realize that regex in general is not the fastest approach to matching, although here also there are ways to produce greater speed, like precompiling the matcher. -- Paul Lutus http://www.arachnoid.com
Post Follow-up to this messageOn 23 Sep 2004 22:07:12 -0700, anon <tolchocked@gmail.com> wrote: > hi, > > I am wondering what approaches you all might suggest for the following > scenario. I'm not looking for code, but just general thoughts about > the best way to approach this problem. > > I need to substitute hundreds, possibly thousands, of character > sequences (English words or phrases) in texts that are up to about > 100KB or so in size, and I need to do this as fast as human possible > (well, actually, faster). Some of the substitution terms require > regular expressions, but some can be handled by a regular replace. > > Any advice at all about Java resources that might be available would > be very much appreciated. I can think of a few naive ways of doing > this, but perhaps there are some lesser known classes than String and > StringBuffer that would be useful, or perhaps there is some > open-source utility class that offers a mutable character-array type > object with powerful search-and-replace/regex abilities, or maybe > something else altogether. > > Thanks in advance for any pointers... I suggest two things: 1. look at the algorithms used in spelling checkers. 2. avoid characters and stay in bytes if you are sure your text can be represented that way. Bill
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.