Home > Archive > Tcl > January 2008 > Speeding up recursive directory traversal
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Speeding up recursive directory traversal
|
|
| ewilsonmail@gmail.com 2007-12-08, 10:28 pm |
| Our found that my Tcl script was significantly slower than equivalent
php and python scripts and traced the problem to the part that handles
recursive directory traversal. I wrote test scripts and here are
sample results indicative of the tests:
[edwil@fedora devel]$ time php rec.php /usr/share
PHP SUMMARY
dirs 7367
files 95280
real 0m6.836s
user 0m3.801s
sys 0m2.750s
[edwil@fedora devel]$ time tclsh rec.tcl /usr/share
Tcl SUMMARY
dirs 7367
files 95280
real 0m18.240s
user 0m10.705s
sys 0m6.760s
The relevant part of the Tcl code is:
------------------------------------------------------------
proc process {e} {
global nDir nFile
if {[file isfile $e]} {
incr nFile
return
}
if {![file isdirectory $e] || ![file readable $e]} {return}
incr nDir
catch {
set entries [glob -directory $e * \.*]
foreach de $entries {
set tail [file tail $de]
if {[string equal $tail . ] || [string equal $tail ..]} {
continue
}
process $de
}
}
}
------------------------------------------------------------
I suspect that the problem is in the way I use [glob]. Any ideas?
(I'm not worried about matching "non-standard" filenames.)
Regards
E Wilson
| |
| Michael Schlenker 2007-12-09, 8:12 am |
| ewilsonmail@gmail.com schrieb:
> Our found that my Tcl script was significantly slower than equivalent
> php and python scripts and traced the problem to the part that handles
> recursive directory traversal. I wrote test scripts and here are
> sample results indicative of the tests:
>
> [edwil@fedora devel]$ time php rec.php /usr/share
>
> PHP SUMMARY
> dirs 7367
> files 95280
>
> real 0m6.836s
> user 0m3.801s
> sys 0m2.750s
> [edwil@fedora devel]$ time tclsh rec.tcl /usr/share
>
> Tcl SUMMARY
> dirs 7367
> files 95280
>
> real 0m18.240s
> user 0m10.705s
> sys 0m6.760s
It could be that Tcl does a lot more under the hood to cater for its vfs
layer, which has some overhead.
Try to take a look at an strace output when running both scripts,
probably Tcl does a lot more syscalls. There probably is no easy fix.
Michael
| |
| Wojciech Kocjan 2007-12-09, 7:18 pm |
| Dnia 09-12-2007 o 05:22:18 <ewilsonmail@gmail.com> napisał(a):
> Our found that my Tcl script was significantly slower than equivalent
> php and python scripts and traced the problem to the part that handles
> recursive directory traversal. I wrote test scripts and here are
> sample results indicative of the tests:
>
> [edwil@fedora devel]$ time php rec.php /usr/share
>
> PHP SUMMARY
> dirs 7367
> files 95280
>
> real 0m6.836s
> user 0m3.801s
> sys 0m2.750s
> [edwil@fedora devel]$ time tclsh rec.tcl /usr/share
Is it possible for you to switch your code to do a file stat in the
beginning instead of a lot of isfile/isdirectory calls?
What are the results if you also skip file isreadable? I expect all of
these calls to do a stat().
Also, stracing the code should also show what is the main ovcerhead.
I think you could be able to get it down to 12s, but that's still slower
than PHP - which could be the overhead related to VFS code in Tcl.
--
Wojciech Kocjan
| |
| Alexandre Ferrieux 2007-12-09, 10:16 pm |
| On Dec 9, 5:22 am, ewilsonm...@gmail.com wrote:
> Our found that my Tcl script was significantly slower than equivalent
> php and python scripts and traced the problem to the part that handles
> recursive directory traversal. I wrote test scripts and here are
> sample results indicative of the tests:
It is possible to reduce all filesystem activity to a single [glob]
call per directory/file.
The key is that [glob -nocomplain -dir $e * .*] returns:
- an empty list for non-dirs
- a list with at least . and .. for dirs
thus avoiding [file isdirectory].
With this I gain a 2.6x speedup over your code, which is close to the
2.66 you get with PHP:
proc process2 e {
global nDir nFile
set l [glob -nocomplain -directory $e * .*]
if {![llength $l]} {incr nFile;return}
incr nDir
foreach de $l {
set tail [file tail $de]
if {[string equal $tail . ] || [string equal $tail ..]} continue
process2 $de
}
}
Granted, I omitted the rights-checking part. If you really want them,
you can [catch]. This may have a cost which I didn't bother to
measure ;-)
-Alex
| |
|
|
|
|
| ewilsonmail@gmail.com 2007-12-11, 7:17 pm |
| On Dec 11, 8:27 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
> On Dec 10, 7:33 pm, ewilsonm...@gmail.com wrote:
>
>
> Still, the above posted code with just [glob] and no [file state] is
> 2.4 times faster than your last code.
> You're not saying how important is the weeding out of non-regular
> files nor dirs (but [glob -types] may help).
Regarding the optimisation, the new script runs slower than the
original:
(I just need regular files by the way but that's ok)
[ed@fedora devel]$ time tclsh rec.tcl /usr/share
Tcl SUMMARY
dirs 7390
files 95754
real 0m21.118s
user 0m14.572s
sys 0m6.441s
[ed@fedora devel]$ time tclsh rec2.tcl /usr/share
Tcl SUMMARY
dirs 7390
files 95754
real 0m30.725s
user 0m20.992s
sys 0m9.605s
[ed@fedora devel]$ time php rec.php /usr/share
PHP SUMMARY
dirs 7390
files 95754
real 0m5.835s
user 0m3.422s
sys 0m2.387s
[ed@fedora devel]$ cat rec2.tcl
proc process2 e {
global nDir nFile
set l [glob -nocomplain -directory $e * .*]
if {![llength $l]} {incr nFile;return}
incr nDir
foreach de $l {
set tail [file tail $de]
if {[string equal $tail . ] || [string equal $tail ..]} continue
process2 $de
}
}
proc main {} {
global argc argv argv0 nFile nDir
set nFile 0
set nDir 0
if { $argc<1 } { puts "$argv0: usage tclsh $argv0 files|dirs"; exit
1 }
foreach a $argv { process2 $a }
puts "Tcl SUMMARY"
puts "dirs $nDir"
puts "files $nFile"
}
main
Tcl version is 8.4.15.
> More likely, the filesystem cache: 17usecs/lstat when hitting the
> head, 1 or less when running from cache.
> Always do your tests twice or thrice to cancel out scuh first-time
> effects in comparisons.
Of course, you're quite right. The 2-dir-deep run was the last thing
I did. Let's ignore it.
Regards
E Wilson
| |
| ewilsonmail@gmail.com 2007-12-13, 8:13 am |
| On Dec 11, 8:27 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
<snip>
> Still, the above posted code with just [glob] and no [file state] is
> 2.4 times faster than your last code.
> You're not saying how important is the weeding out of non-regular
> files nor dirs (but [glob -types] may help).
<snip>
I found it slower: http://groups.google.com/group/comp...37e234ac5b743ac
Could you show your test scripts?
Regards
E Wilson.
| |
| Alexandre Ferrieux 2007-12-13, 10:14 pm |
| On Dec 13, 2:04 pm, ewilsonm...@gmail.com wrote:
> On Dec 11, 8:27 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
> wrote:
> <snip>> Still, the above posted code with just [glob] and no [file state] is
>
> <snip>
>
> I found it slower:http://groups.google.com/group/comp...37e234ac5b743ac
>
> Could you show your test scripts?
Nothing more than the posted code, invoked in the tclsh console as
time {process2 c:/}
on a WindowsXP Dell D620 laptop.
Again, I did every test several times to get rid of first-hit effects.
-Alex
| |
| ewilsonmail@gmail.com 2007-12-13, 10:14 pm |
| On Dec 14, 12:55 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
> On Dec 13, 2:04 pm, ewilsonm...@gmail.com wrote:
>
>
>
>
>
> Nothing more than the posted code, invoked in the tclsh console as
>
> time {process2 c:/}
>
> on a WindowsXP Dell D620 laptop...
Ah, that might explain why we get different results.
Thank you.
E Wilson.
| |
| Bezoar 2007-12-13, 10:14 pm |
| On Dec 8, 10:22 pm, ewilsonm...@gmail.com wrote:
> Our found that my Tcl script was significantly slower than equivalent
> php and python scripts and traced the problem to the part that handles
> recursive directory traversal. I wrote test scripts and here are
> sample results indicative of the tests:
>
> [edwil@fedora devel]$ time php rec.php /usr/share
>
> PHP SUMMARY
> dirs 7367
> files 95280
>
> real 0m6.836s
> user 0m3.801s
> sys 0m2.750s
> [edwil@fedora devel]$ time tclsh rec.tcl /usr/share
>
> Tcl SUMMARY
> dirs 7367
> files 95280
>
> real 0m18.240s
> user 0m10.705s
> sys 0m6.760s
>
> The relevant part of the Tcl code is:
> ------------------------------------------------------------
> proc process {e} {
> global nDir nFile
>
> if {[file isfile $e]} {
> incr nFile
> return
> }
> if {![file isdirectory $e] || ![file readable $e]} {return}
> incr nDir
> catch {
> set entries [glob -directory $e * \.*]
> foreach de $entries {
> set tail [file tail $de]
> if {[string equal $tail . ] || [string equal $tail ..]} {
> continue
> }
> process $de
> }
> }}
>
> ------------------------------------------------------------
> I suspect that the problem is in the way I use [glob]. Any ideas?
>
> (I'm not worried about matching "non-standard" filenames.)
>
> Regards
> E Wilson
The reason for the time difference seems obvious to me ... You will
notice that glob in tcl is optimized to find particular patterns in a
directory so must do a match on every directory entry before returning
its list. In your tcl script two compares must be made on each
entry. In the php case, none of these compares are being made except
the ones you program yourself. Perhaps glob should be made to return
everything in a directory if no pattern is provided rather than give
an error. On the other hand glob ( at lease in 8.5 ) has some very
nice filtering capabilities using -types that can filter on filesystem
types and permissions as well. You can use glob itself to separate
the files from the directories as I have done (below) eliminating the
comparisons in the script. Without this redundancy you save 50%. I
demonstrate that in the below script.
In the script below, I also add an additional counter to see how many
hidden files are in the file system,to show the filtering
action. Suprisingly without it ( and adding .* to the glob for all
files) only saved about .05 seconds in runtime. I also recursively
create a directory list first then glob for the files by iteratively
globing over the resulting directory list. I did this to eliminate the
the isdirectory and isfile comparisons in my script. It appears that
glob filters by the type first then applies the more expensive
matching tests second. The good news is that I was able to speed up
the time to about 3.5 - 3.7 seconds from 7.8 to 8 seconds. More than
%50 percent speed up.
I wonder if you could provide the timing with filtering out the
hidden files as I have done for the php script to see if that brings
the scripts more in line.
Original script when run across my home directory.
Tcl SUMMARY PHP SUMMARY
dirs 5885 dirs 5888
files 143373 files 143373
real 0m7.383s real 0m2.035s
user 0m3.359s user 0m0.869s
sys 0m3.922s sys 0m1.167s
Timing for script provided below over my home directory
Tcl SUMMARY
dirs :5888
Regular files :143143
hidden files :229
Total Files :143372
other 0
real 0m3.649s
user 0m0.829s
sys 0m2.820s
-- script ---
proc allDirs { e retlist } {
upvar $retlist retval
set count 0;
foreach dir [ glob -nocomplain -directory $e -types {d } --
* .* ] {
if { $count < 2 } {
if { [ regexp {(\.|\.\.)$} $dir] } {
incr count;
continue
}
}
allDirs $dir retval
}
lappend retval $e
}
proc process { e } {
global nDir nFile nFileHidden
set dirlist {}
allDirs $e dirlist
foreach dir $dirlist {
incr nFile [llength [ glob -nocomplain -directory $dir -types {f} --
* ] ]
incr nFileHidden [llength [ glob -nocomplain -directory $dir -
types {f hidden } -- .* ] ]
}
set nDir [llength $dirlist ]
}
proc main {} {
global argc argv argv0 nFile nDir nFileHidden
set nFile 0
set nFileHidden 0
set nDir 0
set nOther 0
if { $argc<1 } {
puts "$argv0: usage tclsh $argv0 files|dirs";
exit 1
}
foreach a $argv { process $a }
puts ""
puts "Tcl SUMMARY"
puts "dirs :$nDir"
puts "Regular files :$nFile"
puts "hidden files :$nFileHidden"
puts "Total Files :[expr { $nFile + $nFileHidden}] "
puts "other $nOther"
}
main
---- script end
Carl
| |
| Bezoar 2007-12-13, 10:14 pm |
| On Dec 8, 10:22 pm, ewilsonm...@gmail.com wrote:
> Our found that my Tcl script was significantly slower than equivalent
> php and python scripts and traced the problem to the part that handles
> recursive directory traversal. I wrote test scripts and here are
> sample results indicative of the tests:
>
> [edwil@fedora devel]$ time php rec.php /usr/share
>
> PHP SUMMARY
> dirs 7367
> files 95280
>
> real 0m6.836s
> user 0m3.801s
> sys 0m2.750s
> [edwil@fedora devel]$ time tclsh rec.tcl /usr/share
>
> Tcl SUMMARY
> dirs 7367
> files 95280
>
> real 0m18.240s
> user 0m10.705s
> sys 0m6.760s
>
> The relevant part of the Tcl code is:
> ------------------------------------------------------------
> proc process {e} {
> global nDir nFile
>
> if {[file isfile $e]} {
> incr nFile
> return
> }
> if {![file isdirectory $e] || ![file readable $e]} {return}
> incr nDir
> catch {
> set entries [glob -directory $e * \.*]
> foreach de $entries {
> set tail [file tail $de]
> if {[string equal $tail . ] || [string equal $tail ..]} {
> continue
> }
> process $de
> }
> }}
>
> ------------------------------------------------------------
> I suspect that the problem is in the way I use [glob]. Any ideas?
>
> (I'm not worried about matching "non-standard" filenames.)
>
> Regards
> E Wilson
The reason for the time difference seems obvious to me ... You will
notice that glob in tcl is optimized to find particular patterns in a
directory so must do a match on every directory entry before returning
its list. In your tcl script two compares must be made on each
entry. In the php case, none of these compares are being made except
the ones you program yourself. Perhaps glob should be made to return
everything in a directory if no pattern is provided rather than give
an error. On the other hand glob ( at lease in 8.5 ) has some very
nice filtering capabilities using -types that can filter on filesystem
types and permissions as well. You can use glob itself to separate
the files from the directories as I have done (below) eliminating the
comparisons in the script. Without this redundancy you save 50%. I
demonstrate that in the below script.
In the script below, I also add an additional counter to see how many
hidden files are in the file system,to show the filtering
action. Suprisingly without it ( and adding .* to the glob for all
files) only saved about .05 seconds in runtime. I also recursively
create a directory list first then glob for the files by iteratively
globing over the resulting directory list. I did this to eliminate the
the isdirectory and isfile comparisons in my script. It appears that
glob filters by the type first then applies the more expensive
matching tests second. The good news is that I was able to speed up
the time to about 3.5 - 3.7 seconds from 7.8 to 8 seconds. More than
%50 percent speed up.
I wonder if you could provide the timing with filtering out the
hidden files as I have done for the php script to see if that brings
the scripts more in line.
Original script when run across my home directory.
Tcl SUMMARY PHP SUMMARY
dirs 5885 dirs 5888
files 143373 files 143373
real 0m7.383s real 0m2.035s
user 0m3.359s user 0m0.869s
sys 0m3.922s sys 0m1.167s
Timing for script provided below over my home directory
Tcl SUMMARY
dirs :5888
Regular files :143143
hidden files :229
Total Files :143372
other 0
real 0m3.649s
user 0m0.829s
sys 0m2.820s
-- script ---
proc allDirs { e retlist } {
upvar $retlist retval
set count 0;
foreach dir [ glob -nocomplain -directory $e -types {d } --
* .* ] {
if { $count < 2 } {
if { [ regexp {(\.|\.\.)$} $dir] } {
incr count;
continue
}
}
allDirs $dir retval
}
lappend retval $e
}
proc process { e } {
global nDir nFile nFileHidden
set dirlist {}
allDirs $e dirlist
foreach dir $dirlist {
incr nFile [llength [ glob -nocomplain -directory $dir -types {f} --
* ] ]
incr nFileHidden [llength [ glob -nocomplain -directory $dir -
types {f hidden } -- .* ] ]
}
set nDir [llength $dirlist ]
}
proc main {} {
global argc argv argv0 nFile nDir nFileHidden
set nFile 0
set nFileHidden 0
set nDir 0
set nOther 0
if { $argc<1 } {
puts "$argv0: usage tclsh $argv0 files|dirs";
exit 1
}
foreach a $argv { process $a }
puts ""
puts "Tcl SUMMARY"
puts "dirs :$nDir"
puts "Regular files :$nFile"
puts "hidden files :$nFileHidden"
puts "Total Files :[expr { $nFile + $nFileHidden}] "
puts "other $nOther"
}
main
-- end script ---
Carl
| |
| ewilsonmail@gmail.com 2007-12-14, 7:17 pm |
| On Dec 14, 1:57 pm, Bezoar <cwjo...@gmail.com> wrote:
> Perceptive comments plus some nice code snipped
Your version more than halves the time on my Linux Desktop.
Tcl SUMMARY
dirs :7390
Regular files :95725
hidden files :29
Total Files :95754
other 0
real 0m9.597s
user 0m4.546s
sys 0m5.004s
My initial version gives
Tcl SUMMARY
dirs 7390
files 95754
real 0m20.889s
user 0m13.973s
sys 0m6.841s
php tracking hidden files /and/ directories gives:
PHP SUMMARY
dirs 7390
files 95754
hidden 31
real 0m6.561s
user 0m3.927s
sys 0m2.584s
(about 15% slower than when it doesn't track hidden files/dirs - note /
usr/share has changed since my first post so don't use the old
numbers)
In the php script I just do "if ($fn[0] == '.') ++$nHidden;" to work
on *nix.
Your script has certainly narrowed the gap on my Linux machine while
Alexandre's makes a significant difference on Windows XP.
Perhaps the [glob] command needs an option (or maybe the default
option) that returns all directory entries.
Kind regards
E Wilson.
| |
| ewilsonmail@gmail.com 2007-12-15, 8:12 am |
| On Dec 15, 8:34 am, "Donal K. Fellows"
<donal.k.fell...@manchester.ac.uk> wrote:
> ewilsonm...@gmail.com wrote:
>
> What exactly do you mean by this? All subdirectories in a directory? All
> entries in a directory, whether subdirs or what? (The [glob] command can
> do both of those.)
>
> Donal.
The 2nd: all entries. But without having to match explicitly. So
instead of
[glob * .*]
we can just go
[glob] or [glob -all].
I'm no API designer so there are undoubtedly better ways to do it
including leaving things as they are.
Regards
E Wilson
(The message of mine that Donal replied to hasn't showing up in Google
Groups yet - for those who are using this interface ... and are
wondering)
| |
| Alexandre Ferrieux 2007-12-15, 8:12 am |
| On Dec 15, 7:45 am, ewilsonm...@gmail.com wrote:
>
> instead of
> [glob * .*]
> we can just go
> [glob] or [glob -all].
Yes, I'd like this too.
And while we're at it, I'd also suggest that the same option (or an
extra one) skip the "." and ".." entries, because most of the time we
need extra code not to recurse on them.
-Alex
| |
| ewilsonmail@gmail.com 2007-12-15, 7:15 pm |
| On Dec 15, 9:46 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
> On Dec 15, 7:45 am, ewilsonm...@gmail.com wrote:
>
> Yes, I'd like this too.
> And while we're at it, I'd also suggest that the same option (or an
> extra one) skip the "." and ".." entries, because most of the time we
> need extra code not to recurse on them.
>
> -Alex
+1
Regards
E Wilson
| |
| Donal K. Fellows 2007-12-15, 7:15 pm |
| ewilsonmail@gmail.com wrote:
> The 2nd: all entries. But without having to match explicitly. So
> instead of
> [glob * .*]
> we can just go
> [glob] or [glob -all].
Have you done a FRQ for this yet?
Donal.
| |
| ewilsonmail@gmail.com 2007-12-18, 7:20 pm |
| [Contents]
* where I'm at
* notes about the scripts
* the scripts
* timing results
* strace output
* comments and analysis welcomed
.... so,
I tested each script in single-user mode taking the median of 5 runs
with the disk cache warmed up.
I tested 4 different scripts:
1. My original Tcl script (rec.tcl)
2. The latest Tcl script (rec4.tcl) from this thread with tweaks
3. My original php script (rec.php)
4. My quick python script (rec.py)
Then I ran strace on all scripts except the python one ( ...long day)
[Some Notes about the scripts]
I've removed some of the micro-optimisations from rec3.tcl ... it
runs a little slower than the previous version. I've also made it
more realistic: rather than counting files by [llength] of the file
list, I walk the list as I do in the script that does real work. It
uses code contributed by others in this thread.
My original php can be improved by utilising methods of the
'DirectoryIterator' instead of passing strings to 'is_file' and
'is_directory'. But the script is already the fastest.
The python script does not follow links. I leave this as an exercise
for the interested reader ;-) Apart from the timing, the interesting
thing with the python script is that it is embarrassingly
short (across & down) as it uses the os module of the python library.
In fairness to php and tcl scripts, they also check that their
arguments are files or directories and act appropriately while the
python os.walk() was written to ignore exceptions.
[The Scripts]
# rec.tcl: the original -----------------------------------------
proc err-exit {} {
global argv0
puts "$argv0: usage <directory-name | file-name> ..."
exit
}
proc process {e} {
global nDir nFile
if {[file isfile $e]} {
incr nFile
return
}
if {![file isdirectory $e]} {return}
incr nDir
catch {
set entries [glob -directory $e * \.*]
foreach de $entries {
set tail [file tail $de]
if {[string equal $tail . ] || [string equal $tail ..]} {
continue
}
process $de
}
}
}
proc main {} {
global argc argv nFile nDir
set nFile 0
set nDir 0
if { $argc<1 } { err-exit }
foreach a $argv { process $a }
puts SUMMARY
puts "directories $nDir"
puts "files $nFile"
}
main
# rec4.tcl: current -------------------------------------------------
# with thanks to everyone in this thread
proc is-dot {dir} { regexp {(\.|\.\.)$} $dir }
proc process-file {f} {
global nFile
incr nFile
}
proc process {e} {
global nDir nFile
incr nDir
foreach f [glob -nocomplain -directory $e -types {f} -- * .*] {
process-file $f
}
foreach dir [glob -nocomplain -directory $e -types {d} -- * .*] {
if {![is-dot $dir]} { process $dir }
}
}
proc main {} {
global argc argv argv0 nFile nDir
set nDir [set nFile 0]
if {$argc<1} {
puts "$argv0: usage tclsh $argv0 dirs or files";
exit 1
}
foreach a $argv {
if {[file isfile $a]} { process-file $a; continue }
process $a
}
puts "Tcl SUMMARY"
puts "dirs \t $nDir"
puts "files \t $nFile"
}
main
# rec.py ------------------------------------------------------------
import os, sys
from os.path import join
def rec(d):
global numDirs, numFiles
for root, dirs, files in os.walk(arg):
numDirs+=1
for file in files:
numFiles+=1
numDirs = numFiles = 0
args = sys.argv[1:]
for arg in args: rec(arg)
print "PYTHON SUMMARY"
print "dirs %d" % numDirs
print "files %d" % numFiles
# rec.php -----------------------------------------------------------
<?php
function process_file ($f)
{
global $nFile;
++$nFile;
}
function process ($d)
{
global $nDir;
if (is_file ($d)) { process_file ($d); return; }
if (!is_dir ($d)) return;
try {
$dir = new DirectoryIterator ($d);
++$nDir;
foreach ($dir as $file) {
if ($file->isDot()) continue;
process($file->getpathName());
}
}
catch (Exception $ex) {
echo "Cannot iterate over $d\n";
}
}
$nDir = $nFile = 0;
$n = count ($argv);
for ($i=1; $i<$n; ++$i) {
$e = $argv[$i];
process($e);
}
echo "\nPHP SUMMARY\ndirs $nDir\nfiles $nFile\n";
---------------------------------------------------------------------
[Timing Results] on /usr/share
-- Median of 5 runs --------------------------------
rec4.tcl rec.tcl rec.php rec.py
real 8.792 14.014 4.536 10.453
user 4.777 9.270 2.648 7.838
sys 4.013 4.742 1.888 2.614
----------------------------------------------------
[Truncated strace output]
rec.tcl (the original script)
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
62.76 0.582095 1 628971 1 lstat64
20.47 0.189866 2 125317 stat64
11.52 0.106819 4 30024 getdents
2.99 0.027711 2 14790 open
1.60 0.014857 1 14791 close
0.51 0.004709 0 14787 fstat64
0.12 0.001157 1 1179 readlink
0.01 0.000075 4 18 mmap2
0.01 0.000068 4 16 read
0.01 0.000066 4 16 brk
rec4.tcl
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
44.69 0.355098 1 265410 stat64
25.81 0.205072 3 60048 getdents
16.37 0.130055 1 131921 1 lstat64
7.51 0.059689 2 29570 open
3.98 0.031633 1 29571 close
1.51 0.012012 0 29567 fstat64
0.09 0.000727 1 601 readlink
0.03 0.000223 56 4 write
0.01 0.000062 6 10 7 access
0.01 0.000046 46 1 set_robust_list
rec.php
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
64.28 0.256811 2 103151 stat64
26.16 0.104530 5 22404 getdents
5.19 0.020739 3 7462 3 open
2.45 0.009807 1 7463 close
0.97 0.003856 1 7390 ls
0.84 0.003353 0 7465 fstat64
0.07 0.000279 2 141 mmap2
0.02 0.000096 3 31 munmap
0.01 0.000057 1 91 read
Comments and analysis welcomed.
Regards
E Wilson
| |
| ewilsonmail@gmail.com 2007-12-18, 7:20 pm |
| On Dec 16, 8:45 am, "Donal K. Fellows"
<donal.k.fell...@manchester.ac.uk> wrote:
> ewilsonm...@gmail.com wrote:
>
> Have you done a FRQ for this yet?
>
> Donal.
Will do :-)
Regards
E Wilson
| |
| schlenk@uni-oldenburg.de 2007-12-18, 7:20 pm |
| > The python script does not follow links. I leave this as an exercise
> for the interested reader ;-) Apart from the timing, the interesting
> thing with the python script is that it is embarrassingly
> short (across & down) as it uses the os module of the python library.
> In fairness to php and tcl scripts, they also check that their
> arguments are files or directories and act appropriately while the
> python os.walk() was written to ignore exceptions.
Maybe you should take a look at:
http://tcllib.sourceforge.net/doc/multiop.html
Its also very , not widely known and short. (but might not help
with the speed problem).
Michael
| |
| Alexandre Ferrieux 2007-12-18, 7:20 pm |
| On Dec 18, 6:40 pm, schl...@uni-oldenburg.de wrote:
>
> Maybe you should take a look at:http://tcllib.sourceforge.net/doc/multiop.html
>
> Its also very , not widely known and short. (but might not help
> with the speed problem).
It sits on top of fileutil::find, which is itself a variant of the
usual recursive-glob at the beginning of this thread. The only
difference I can see is that it insists on handling file entries at
each given level before diving into subdirs...
-Alex
| |
| ewilsonmail@gmail.com 2007-12-18, 7:20 pm |
| I should add that I'm happy with the speed of my current Tcl script.
I must admit that I'm surprised with the speed of the php and the
brevity of the python one.
(multiop looks too)
I sometimes wish that glob (and Tcl commands in general) didn't use
such long words for options.
Compare
glob -nocomplain -directory $e -types {f} -- * .*
glob -q -d $e -t {f} -- * .*
The first may indicate the meaning better but I still have to look at
the manpage.
I find it easier to read down than across and often use [set] in my
scripts to make up for long command invocations. Tcl isn't perfect
but I can't help liking it (and I haven't got into the fun stuff yet).
E Wilson
| |
|
|
|
|
|