| Author |
Looking for suggestions on fast string comparison
|
|
| jimwu88NOOOSPAM@yahoo.com 2007-05-22, 7:12 pm |
| I need to diff two text files line by line. Does anybody know what the
best (fastest) way of doing it? I can use "string match" or "$line1 ==
$line2", but am not sure if those are the fastest comparison.
Thansk,
Jim
| |
| Glenn Jackman 2007-05-22, 7:12 pm |
| At 2007-05-22 02:13PM, "jimwu88NOOOSPAM@yahoo.com" wrote:
> I need to diff two text files line by line. Does anybody know what the
> best (fastest) way of doing it? I can use "string match" or "$line1 ==
> $line2", but am not sure if those are the fastest comparison.
I would guess [string compare] would be what you want.
Or perhaps [exec diff old_file new_file]
--
Glenn Jackman
"You can only be young once. But you can always be immature." -- Dave Barry
| |
| Michael Schlenker 2007-05-22, 7:12 pm |
| jimwu88NOOOSPAM@yahoo.com schrieb:
> I need to diff two text files line by line. Does anybody know what the
> best (fastest) way of doing it? I can use "string match" or "$line1 ==
> $line2", but am not sure if those are the fastest comparison.
>
If you need something like the diff program:
Take a look at http://wiki.tcl.tk/3108 or the tcllib struct::list
package, which has the core of the code included.
If you simply need to say: line1 and line2 differ, but do not look for
insertions etc. a simple foreach loop with two indices might be the
fastest way:
foreach line1 $file1 line2 $file2 {
if {$line1 ne $line2} { puts "Different: $line1 $line2" }
}
Basically you use the same bytecodes if you use string equal/string
compare or the Tcl 8.4 eq/ne operators.
But how fast do you need the compare and for what size/kind of files?
Michael
| |
| Donal K. Fellows 2007-05-22, 7:12 pm |
| Stephan Kuhagen wrote:
> In pure Tcl I would guess that
> if { $string1 ne $string2 } {
> not_equal_something
> }
> is the fastest comparison of two strings, because "ne" is direct string
> comparison implemented in C and not equal (ne) should be faster than equal
> (eq).
It really makes no difference; it just inverts the sense of a test. On
the other hand, the eq/ne operators are very fast since they check for
obvious stuff first and do things that modern processors like to do
anyway. The gripping hand is that [string equal] is normally compiled to
the same bytecode anyway; there's no penalty for verbosity.
Now [string compare] is slower since it has to work out which string is
the lesser. That makes a whole bunch of short-cuts invalid.
Donal.
| |
| jimwu88NOOOSPAM@yahoo.com 2007-05-23, 4:27 am |
| On May 22, 3:44 pm, Michael Schlenker <schl...@uni-oldenburg.de>
wrote:
> jimwu88NOOOS...@yahoo.com schrieb:> I need to diff two text files line by line. Does anybody know what the
>
> If you need something like the diff program:
> Take a look athttp://wiki.tcl.tk/3108or the tcllib struct::list
> package, which has the core of the code included.
>
> If you simply need to say: line1 and line2 differ, but do not look for
> insertions etc. a simple foreach loop with two indices might be the
> fastest way:
>
> foreach line1 $file1 line2 $file2 {
> if {$line1 ne $line2} { puts "Different: $line1 $line2" }
>
> }
>
> Basically you use the same bytecodes if you use string equal/string
> compare or the Tcl 8.4 eq/ne operators.
>
> But how fast do you need the compare and for what size/kind of files?
>
> Michael
Thanks to everybody who replied.
I don't have a target speed. I just wanted to run as fast as I could.
Each file is about 20MB with ~300K lines. I need to skip two lines out
of every ~100 lines (one packet). Those two lines have timestamps, so
they will be different depending on when the packet is generated. All
other lines should be the same if everything works as expected.
Otherwise, the script flags an error. I don't need to know at which
position the lines are different.
Thanks,
Jim
| |
| Uwe Klein 2007-05-23, 4:27 am |
| jimwu88NOOOSPAM@yahoo.com wrote:
> I don't have a target speed. I just wanted to run as fast as I could.
> Each file is about 20MB with ~300K lines. I need to skip two lines out
> of every ~100 lines (one packet). Those two lines have timestamps, so
> they will be different depending on when the packet is generated. All
> other lines should be the same if everything works as expected.
> Otherwise, the script flags an error. I don't need to know at which
> position the lines are different.
are you on a unixy platform?
set diffres [ exec diff $filea $fileb ]
set diffres [ split $diffres \n ]
foreach line $diffres {
switch -glob -- $line \
\[0-9]* {
# handle linepos
scan $line %d%c%d leftlineno what rightlineno
} >* {
puts "leftline ( $leftlineno ):$line"
} <* {
puts "rightline( $rightlineno ):$line"
} ---* {
puts .
}
uwe
| |
| Larry W. Virden 2007-05-23, 8:08 am |
| On May 23, 2:49 am, Uwe Klein <uwe_klein_habertw...@t-online.de>
wrote:
> are you on a unixy platform?
That same technique will work on windows and other systems with a
command line - just fetch the appropriate diff command.
| |
| Larry W. Virden 2007-05-23, 8:08 am |
| On May 23, 2:49 am, Uwe Klein <uwe_klein_habertw...@t-online.de>
wrote:
> are you on a unixy platform?
That same technique will work on windows and other systems with a
command line - just fetch the appropriate diff command.
| |
| Larry W. Virden 2007-05-23, 8:08 am |
| On May 23, 2:49 am, Uwe Klein <uwe_klein_habertw...@t-online.de>
wrote:
> are you on a unixy platform?
That same technique will work on windows and other systems with a
command line - just fetch the appropriate diff command.
| |
| Larry W. Virden 2007-05-23, 8:08 am |
| On May 23, 2:49 am, Uwe Klein <uwe_klein_habertw...@t-online.de>
wrote:
> are you on a unixy platform?
That same technique will work on windows and other systems with a
command line - just fetch the appropriate diff command.
| |
| Uwe Klein 2007-05-23, 7:12 pm |
| Larry W. Virden wrote:
> On May 23, 2:49 am, Uwe Klein <uwe_klein_habertw...@t-online.de>
> wrote:
>
>
>
>
> That same technique will work on windows and other systems with a
> command line - just fetch the appropriate diff command.
>
Hey larry,
I am a bit slow on occasion.
but not that slow that I would need 4 reminders ;-))
uwe
apropos: with unix it is in the box.
with win every useful prog is extra hassle to get.
Thus I can understand that people write
their utilities in tcl on windows.
| |
| Neil Madden 2007-05-23, 7:12 pm |
| jimwu88NOOOSPAM@yahoo.com wrote:
....
> I don't have a target speed. I just wanted to run as fast as I could.
> Each file is about 20MB with ~300K lines. I need to skip two lines out
> of every ~100 lines (one packet). Those two lines have timestamps, so
> they will be different depending on when the packet is generated. All
> other lines should be the same if everything works as expected.
> Otherwise, the script flags an error. I don't need to know at which
> position the lines are different.
A simple foreach loop and [string equal] test will do then. For 2 20MB
files you may as well just slurp the whole lot into memory using [read]
and then use [split] and [foreach]. Should be plenty fast enough.
-- Neil
| |
| Larry W. Virden 2007-05-23, 7:12 pm |
| On May 23, 9:32 am, Uwe Klein <uwe_klein_habertw...@t-online.de>
wrote:
>
> Hey larry,
> I am a bit slow on occasion.
>
> but not that slow that I would need 4 reminders ;-))
>
Sigh - google groups kept saying "I'm sorry, but your posting failed;
try again later." I finally gave up hoping that the item would post.
I've deleted the duplicate postings (which, in itself, was painful...)
| |
| Bruce Hartweg 2007-05-23, 7:12 pm |
| Larry W. Virden wrote:
> On May 23, 9:32 am, Uwe Klein <uwe_klein_habertw...@t-online.de>
> wrote:
>
>
> Sigh - google groups kept saying "I'm sorry, but your posting failed;
> try again later." I finally gave up hoping that the item would post.
>
> I've deleted the duplicate postings (which, in itself, was painful...)
>
you're not the only one, I've seen multiple instances of multiple posts
today.
bruce
| |
| Stephan Kuhagen 2007-05-23, 7:12 pm |
| Donal K. Fellows wrote:
>
> It really makes no difference;
You are obviously right. My brain must have gone to bed some time earlier
than me yesterday...
Stephan
| |
|
| In article <T%I4i.38712$Ug.33944@fe1.news.blueyonder.co.uk>,
Donal K. Fellows <donal.k.fellows@manchester.ac.uk> wrote:
>Stephan Kuhagen wrote:
>
>It really makes no difference; it just inverts the sense of a test. On
>the other hand, the eq/ne operators are very fast since they check for
>obvious stuff first and do things that modern processors like to do
>anyway. The gripping hand is that [string equal] is normally compiled to
^^^^^^^^^^^^^^^^^
>the same bytecode anyway; there's no penalty for verbosity.
More of the watchmaker's work, no doubt? :-)
MH
| |
| Donal K. Fellows 2007-05-24, 8:08 am |
| MH wrote:
> Donal K. Fellows wrote:
> ^^^^^^^^^^^^^^^^^
>
> More of the watchmaker's work, no doubt? :-)
It's been motied about that that might be the case.
Donal.
|
|
|
|