Home > Archive > Tcl > January 2008 > Wouldn't it be nice to have a [binary set] command ?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Wouldn't it be nice to have a [binary set] command ?
|
|
| Rolf Schroedter 2008-01-28, 9:02 am |
| I'm working a lot with binary data.
When I need to replace e.g. a memory cell at address 13, i usually do
something like
% set MEM [string repeat a 60]
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaa
% set ofs 13
13
% binary scan $MEM a${ofs}I1a* head x tail
3
% set MEM [binary format a*I1a* $head 0x30313233 $tail]
aaaaaaaaaaaaa0123aaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaa
Obviously this usage of the [binary] command isn't very effective,
because lots of data copying and variable creation is performed, making
these kind of operations quite slow.
Wouldn't it be nice to have a [binary set varName] command working
simlar to [lset varName] on an existing binary data variable, supporting
all the [binary] format specifications ?
binary set varName formatString ?arg arg ...?
Then the code above could be written in plain Tcl as
binary set MEM @${ofs}I1 0x30313233
| |
| Donal K. Fellows 2008-01-28, 9:02 am |
| Rolf Schroedter wrote:
> Wouldn't it be nice to have a [binary set varName] command working
> simlar to [lset varName] on an existing binary data variable, supporting
> all the [binary] format specifications ?
You should be able to do this sort of thing using [string replace] and a
simple [binary format], but I agree that a [binary set] would be a nice
enhancement.
Donal.
| |
| Rolf Schroedter 2008-01-28, 9:02 am |
| Donal K. Fellows wrote, On 28.01.2008 14:20:
> You should be able to do this sort of thing using [string replace] and a
> simple [binary format], but I agree that a [binary set] would be a nice
> enhancement.
>
> Donal.
I agree.
However my feeling is that the combination [string replace] & [binary
format] is more expensive than [binary scan] & [binary format].
And the relative difference increases with the size of binary data
(see results below).
BTW, although I know that the double memmove() done by [setIntA]
shouldn't be very effective, I'm impressed by the performance:
1 kByte/microsecond for a Pentium4@3.2GHz isn't that bad...
% proc setIntA {varName ofs val} {
upvar $varName A
binary scan $A a${ofs}Ia* head x tail
set A [binary format a*Ia* $head $val $tail]
}
% proc setIntB {varName ofs val} {
upvar $varName A
set A [string replace $A $ofs $ofs+3 [binary format I $val]]
}
% set MEM [string repeat a 1000]
% time {setIntA MEM 13 0x30313233} 1000
5.364 microseconds per iteration
% time {setIntB MEM 13 0x30313233} 1000
5.78 microseconds per iteration
% expr 5.78/5.364
==> 1.0775540641312453
% set MEM [string repeat a 10000]
% time {setIntA MEM 13 0x30313233} 1000
16.981 microseconds per iteration
% time {setIntB MEM 13 0x30313233} 1000
18.803 microseconds per iteration
% expr 18.803/16.981
==> 1.107296390083034
% set MEM [string repeat a 100000]
% time {setIntA MEM 13 0x30313233} 1000
105.072 microseconds per iteration
% time {setIntB MEM 13 0x30313233} 1000
141.483 microseconds per iteration
% expr 141.483/105.072
==> 1.3465338053905893
| |
| Alexandre Ferrieux 2008-01-28, 7:43 pm |
| On Jan 28, 2:20 pm, "Donal K. Fellows"
<donal.k.fell...@manchester.ac.uk> wrote:
> Rolf Schroedter wrote:
>
> You should be able to do this sort of thing using [string replace] and a
> simple [binary format], but I agree that a [binary set] would be a nice
> enhancement.
>
> Donal.
Funnily, this relates to a discussion right now on the tcl-core list
about direct memory access. A [binary set] would indeed provide a
natural entry door for mmap'ed files and/or shared memory segments,
provided the ByteArray type is extended to accomodate backing by a non-
malloc'ed range (but even the read-only side with [string range] would
please me).
-Alex
| |
| Fredderic 2008-01-28, 7:43 pm |
| On Mon, 28 Jan 2008 15:31:32 +0100,
Rolf Schroedter <me@privacy.net> wrote:
> Donal K. Fellows wrote, On 28.01.2008 14:20:
> I agree. However my feeling is that the combination [string replace]
> & [binary format] is more expensive than [binary scan] & [binary
> format]. And the relative difference increases with the size of
> binary data (see results below).
> % proc setIntA {varName ofs val} {
> upvar $varName A
> binary scan $A a${ofs}Ia* head x tail
> set A [binary format a*Ia* $head $val $tail]
> }
There's also another method, that seemed to work quite neatly for me...
proc setIntC {varName ofs val} {
upvar $varName A
set A [binary format a*@${ofs}I $A $val]
}
I ran across this recently when I was faced with a bit of a problem;
trying to deal with a little over 500k items, each being a list
containing a character, 7-9 integers (depending on the specific type of
item, 8 actually being the norm), and one or two strings (again,
depending on the specific type of item, with one string being the norm).
If you work this out in terms of TCL's internal storage, that's 10-12
distinct entities per record, at a rough 40b per Tcl_Obj, plus the size
of the string reps, that's a minimum of about 480 bytes to store one
record, or about 240MB for the lot, and that's before you add
allocation of the array keys (which are a unique numeric index given to
each item). Dumping the entire structure into a text file, one item per
line, only consumes 44MB.
On a desktop machine with only 256MB physical RAM, and a fair bit going
on besides, handling this much data got a wee bit slow.
So, I re-coded it to store the integer fields as a binary blob, and
place that, along with any strings (usually one, possibly two), in a
list. That cut the per-item storage down to about 152b (based on my
40b per entity estimate) plus the length of the string (the two-string
case is uncommon, but adds another entitiy). At about 76MB, for exactly
the same data, that's a much better proposition. Still a bit more than
the 44MB it consumed on file, but adequate none the less; the
alternative is to flush the internal rep of each record every time I
access it in any way, and I'm guessing (I haven't actually tested it),
that'd cost a fair bit more in speed with the repeated construction and
disposal of a list internal rep., holding a bunch of distinct strings.
Probably not the best method, I'm more interested in getting it working
for now (which means keeping the memory footprint just low enough
that my machine can deal with it), but essentially what I did was this:
proc get-binary {field info} {
binary scan [lindex $info 0] $field value
return $value
}
proc get-binarg {field info} {
return [lindex $info $field]
}
proc set-binary {field infoVar value} {
upvar 1 $infoVar info
set rec [binary format $field [lindex $info 0] $value]
set info [lreplace $info 0 0 $rec]
}
proc set-binarg {field infoVar value} {
upvar 1 $infoVar info
set info [lreplace $info $field $field $value]
}
[proc "" {fields} {
foreach {key field} $fields {
interp alias {} get-$key {} get-binary @${field}
interp alias {} set-$key {} set-binary a*@${field}u
}
foreach {key field} {name 1 alias 2} {
interp alias {} get-$key {} get-binarg $field
interp alias {} set-$key {} set-binarg $field
}
}] [lsearch -all -not -inline {
{# 00 0 0 1 1 1 2 2 2 |2 3 }
{# 01 5 7 1 5 9 3 5 7 |9 1 }
{# an...t.n...n...n...n...t.t.t.|t.t. }
type 0a {# single-character record type}
id 1n {# identifier ID for this record}
refs 5t {# the data is reference-counted}
base 7n {# parent item in the data set}
{# etc. etc. etc.}
} {#*}]
The ID is left in the record simply because it makes a few things
easier. I've also considered incorporating the strings into the binary
blob also (should be doable with just a second [binary] statement), and
doing away with the underlying list entirely. I probably will, though
it won't save particularly much compared to what that saves already.
I have to say, one thing that would have made life a fair bit easier
at some points, would have been support for null-terminated strings, or
a length deliminator similar to [format]s version of the * deliminator.
For example;
binary scan "hello\0..." z*... string
binary scan "\05hello..." ca?... len len string
likewise support for [format]s %n in [binary format] could be helpful;
binary format "x2...a*N...@0s" ... "hello" ofs ... ofs
Although that example is highly evil, it could be handy in some far
more benign situations also.
Anyone think it's practical and TIP-worthy? ;)
Fredderic
| |
| Donal K. Fellows 2008-01-28, 7:43 pm |
| Fredderic wrote:
> If you work this out in terms of TCL's internal storage, that's 10-12
> distinct entities per record, at a rough 40b per Tcl_Obj,
On a 32-bit system, it's 24 bytes per Tcl_Obj, assuming no sharing
(it's actually exactly 6 machine words). The additional overhead per
object due to memory allocation is very low (sub-byte after
amortization).
Donal.
| |
| Rolf Schroedter 2008-01-29, 4:56 am |
| Fredderic wrote, On 28.01.2008 20:04:
>
> There's also another method, that seemed to work quite neatly for me...
>
> proc setIntC {varName ofs val} {
> upvar $varName A
> set A [binary format a*@${ofs}I $A $val]
> }
Great thing, your setIntC is much faster than SetIntA/B,
obviously because there is only one extra memcopy instead of two:
size microseconds
bytes SetIntA SetIntB setIntC
10^3 5.4 5.8 3.1
10^4 17.0 18.8 5.8
10^5 105.0 141.8 28.8
Thanks a lot,
Rolf.
| |
| Fredderic 2008-01-31, 4:53 am |
| On Mon, 28 Jan 2008 16:51:36 -0800 (PST),
"Donal K. Fellows" <donal.k.fellows@man.ac.uk> wrote:
> Fredderic wrote:
> On a 32-bit system, it's 24 bytes per Tcl_Obj, assuming no sharing
> (it's actually exactly 6 machine words). The additional overhead per
> object due to memory allocation is very low (sub-byte after
> amortization).
*nods* I was talking about 32-bit architectures, and I was ignoring
allocation overhead of the Tcl_Obj's for that exact reason.
I may have been a little overzealous in my estimation, though I was
allowing for a minimal string rep, also, which has its own overheads,
and at least a few characters. I suspect if I took an average, that
estimate is actually on the small side. My appologies if I didn't make
that obvious.
Fredderic
| |
| Fredderic 2008-01-31, 4:53 am |
| On Tue, 29 Jan 2008 09:15:59 +0100,
Rolf Schroedter <me@privacy.net> wrote:
> Fredderic wrote, On 28.01.2008 20:04:
> Great thing, your setIntC is much faster than SetIntA/B,
> obviously because there is only one extra memcopy instead of two:
> size microseconds
> bytes SetIntA SetIntB setIntC
> 10^3 5.4 5.8 3.1
> 10^4 17.0 18.8 5.8
> 10^5 105.0 141.8 28.8
I also went to pains to avoid any string reps, which is probably a
slice of the difference also. As I understand it, string reps aren't
just a memcopy, they're an encoding translation to and from almost-UTF-8
also, which would incur some significant overhead here.
Not to mention more than double memory for the dual value storage. And
with this many values, that's quite significant. ;)
Although there is an actual "string" value type, I think. I'll have to
hunt it down sometime, but I'm pretty sure I've seen one once, quite by
accident. I wonder if it's used to delay encoding transformations...?
Then again, I could just have been imagining it. :)
Fredderic
|
|
|
|
|