For Programmers: Free Programming Magazines  


Home > Archive > Tcl > May 2007 > Looking for suggestions on fast string comparison









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Looking for suggestions on fast string comparison
jimwu88NOOOSPAM@yahoo.com

2007-05-22, 7:12 pm

I need to diff two text files line by line. Does anybody know what the
best (fastest) way of doing it? I can use "string match" or "$line1 ==
$line2", but am not sure if those are the fastest comparison.

Thansk,
Jim

Glenn Jackman

2007-05-22, 7:12 pm

At 2007-05-22 02:13PM, "jimwu88NOOOSPAM@yahoo.com" wrote:
> I need to diff two text files line by line. Does anybody know what the
> best (fastest) way of doing it? I can use "string match" or "$line1 ==
> $line2", but am not sure if those are the fastest comparison.


I would guess [string compare] would be what you want.

Or perhaps [exec diff old_file new_file]

--
Glenn Jackman
"You can only be young once. But you can always be immature." -- Dave Barry
Michael Schlenker

2007-05-22, 7:12 pm

jimwu88NOOOSPAM@yahoo.com schrieb:
> I need to diff two text files line by line. Does anybody know what the
> best (fastest) way of doing it? I can use "string match" or "$line1 ==
> $line2", but am not sure if those are the fastest comparison.
>

If you need something like the diff program:
Take a look at http://wiki.tcl.tk/3108 or the tcllib struct::list
package, which has the core of the code included.

If you simply need to say: line1 and line2 differ, but do not look for
insertions etc. a simple foreach loop with two indices might be the
fastest way:

foreach line1 $file1 line2 $file2 {
if {$line1 ne $line2} { puts "Different: $line1 $line2" }
}

Basically you use the same bytecodes if you use string equal/string
compare or the Tcl 8.4 eq/ne operators.

But how fast do you need the compare and for what size/kind of files?

Michael
Donal K. Fellows

2007-05-22, 7:12 pm

Stephan Kuhagen wrote:
> In pure Tcl I would guess that
> if { $string1 ne $string2 } {
> not_equal_something
> }
> is the fastest comparison of two strings, because "ne" is direct string
> comparison implemented in C and not equal (ne) should be faster than equal
> (eq).


It really makes no difference; it just inverts the sense of a test. On
the other hand, the eq/ne operators are very fast since they check for
obvious stuff first and do things that modern processors like to do
anyway. The gripping hand is that [string equal] is normally compiled to
the same bytecode anyway; there's no penalty for verbosity.

Now [string compare] is slower since it has to work out which string is
the lesser. That makes a whole bunch of short-cuts invalid.

Donal.
jimwu88NOOOSPAM@yahoo.com

2007-05-23, 4:27 am

On May 22, 3:44 pm, Michael Schlenker <schl...@uni-oldenburg.de>
wrote:
> jimwu88NOOOS...@yahoo.com schrieb:> I need to diff two text files line by line. Does anybody know what the
>
> If you need something like the diff program:
> Take a look athttp://wiki.tcl.tk/3108or the tcllib struct::list
> package, which has the core of the code included.
>
> If you simply need to say: line1 and line2 differ, but do not look for
> insertions etc. a simple foreach loop with two indices might be the
> fastest way:
>
> foreach line1 $file1 line2 $file2 {
> if {$line1 ne $line2} { puts "Different: $line1 $line2" }
>
> }
>
> Basically you use the same bytecodes if you use string equal/string
> compare or the Tcl 8.4 eq/ne operators.
>
> But how fast do you need the compare and for what size/kind of files?
>
> Michael


Thanks to everybody who replied.

I don't have a target speed. I just wanted to run as fast as I could.
Each file is about 20MB with ~300K lines. I need to skip two lines out
of every ~100 lines (one packet). Those two lines have timestamps, so
they will be different depending on when the packet is generated. All
other lines should be the same if everything works as expected.
Otherwise, the script flags an error. I don't need to know at which
position the lines are different.

Thanks,
Jim

Uwe Klein

2007-05-23, 4:27 am

jimwu88NOOOSPAM@yahoo.com wrote:

> I don't have a target speed. I just wanted to run as fast as I could.
> Each file is about 20MB with ~300K lines. I need to skip two lines out
> of every ~100 lines (one packet). Those two lines have timestamps, so
> they will be different depending on when the packet is generated. All
> other lines should be the same if everything works as expected.
> Otherwise, the script flags an error. I don't need to know at which
> position the lines are different.


are you on a unixy platform?
set diffres [ exec diff $filea $fileb ]
set diffres [ split $diffres \n ]
foreach line $diffres {
switch -glob -- $line \
\[0-9]* {
# handle linepos
scan $line %d%c%d leftlineno what rightlineno
} >* {
puts "leftline ( $leftlineno ):$line"
} <* {
puts "rightline( $rightlineno ):$line"
} ---* {
puts .
}

uwe
Larry W. Virden

2007-05-23, 8:08 am

On May 23, 2:49 am, Uwe Klein <uwe_klein_habertw...@t-online.de>
wrote:

> are you on a unixy platform?


That same technique will work on windows and other systems with a
command line - just fetch the appropriate diff command.

Larry W. Virden

2007-05-23, 8:08 am

On May 23, 2:49 am, Uwe Klein <uwe_klein_habertw...@t-online.de>
wrote:

> are you on a unixy platform?


That same technique will work on windows and other systems with a
command line - just fetch the appropriate diff command.

Larry W. Virden

2007-05-23, 8:08 am

On May 23, 2:49 am, Uwe Klein <uwe_klein_habertw...@t-online.de>
wrote:

> are you on a unixy platform?


That same technique will work on windows and other systems with a
command line - just fetch the appropriate diff command.

Larry W. Virden

2007-05-23, 8:08 am

On May 23, 2:49 am, Uwe Klein <uwe_klein_habertw...@t-online.de>
wrote:

> are you on a unixy platform?


That same technique will work on windows and other systems with a
command line - just fetch the appropriate diff command.

Uwe Klein

2007-05-23, 7:12 pm

Larry W. Virden wrote:
> On May 23, 2:49 am, Uwe Klein <uwe_klein_habertw...@t-online.de>
> wrote:
>
>
>
>
> That same technique will work on windows and other systems with a
> command line - just fetch the appropriate diff command.
>

Hey larry,
I am a bit slow on occasion.

but not that slow that I would need 4 reminders ;-))

uwe

apropos: with unix it is in the box.
with win every useful prog is extra hassle to get.
Thus I can understand that people write
their utilities in tcl on windows.
Neil Madden

2007-05-23, 7:12 pm

jimwu88NOOOSPAM@yahoo.com wrote:
....
> I don't have a target speed. I just wanted to run as fast as I could.
> Each file is about 20MB with ~300K lines. I need to skip two lines out
> of every ~100 lines (one packet). Those two lines have timestamps, so
> they will be different depending on when the packet is generated. All
> other lines should be the same if everything works as expected.
> Otherwise, the script flags an error. I don't need to know at which
> position the lines are different.


A simple foreach loop and [string equal] test will do then. For 2 20MB
files you may as well just slurp the whole lot into memory using [read]
and then use [split] and [foreach]. Should be plenty fast enough.

-- Neil
Larry W. Virden

2007-05-23, 7:12 pm

On May 23, 9:32 am, Uwe Klein <uwe_klein_habertw...@t-online.de>
wrote:

>
> Hey larry,
> I am a bit slow on occasion.
>
> but not that slow that I would need 4 reminders ;-))
>


Sigh - google groups kept saying "I'm sorry, but your posting failed;
try again later." I finally gave up hoping that the item would post.

I've deleted the duplicate postings (which, in itself, was painful...)

Bruce Hartweg

2007-05-23, 7:12 pm

Larry W. Virden wrote:
> On May 23, 9:32 am, Uwe Klein <uwe_klein_habertw...@t-online.de>
> wrote:
>
>
> Sigh - google groups kept saying "I'm sorry, but your posting failed;
> try again later." I finally gave up hoping that the item would post.
>
> I've deleted the duplicate postings (which, in itself, was painful...)
>

you're not the only one, I've seen multiple instances of multiple posts
today.

bruce
Stephan Kuhagen

2007-05-23, 7:12 pm

Donal K. Fellows wrote:

>
> It really makes no difference;


You are obviously right. My brain must have gone to bed some time earlier
than me yesterday...

Stephan

MH

2007-05-23, 7:12 pm

In article <T%I4i.38712$Ug.33944@fe1.news.blueyonder.co.uk>,
Donal K. Fellows <donal.k.fellows@manchester.ac.uk> wrote:
>Stephan Kuhagen wrote:
>
>It really makes no difference; it just inverts the sense of a test. On
>the other hand, the eq/ne operators are very fast since they check for
>obvious stuff first and do things that modern processors like to do
>anyway. The gripping hand is that [string equal] is normally compiled to

^^^^^^^^^^^^^^^^^
>the same bytecode anyway; there's no penalty for verbosity.


More of the watchmaker's work, no doubt? :-)

MH
Donal K. Fellows

2007-05-24, 8:08 am

MH wrote:
> Donal K. Fellows wrote:
> ^^^^^^^^^^^^^^^^^
>
> More of the watchmaker's work, no doubt? :-)


It's been motied about that that might be the case.

Donal.
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com