For Programmers: Free Programming Magazines  


Home > Archive > Smalltalk > October 2004 > Finding the intersection of two collections









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Finding the intersection of two collections
Thomas Gagne

2004-10-05, 4:04 am

I thought this had already been discussed, but I just returned from
groups.google.com empty-handed.

Is there a hidden method (or obscurely named one) for finding the common
members of two collections?

I have two collections (sets actually) and I'd like to find the members they
have in common.

It seems we could iterate through one looking for elements of the other. It
also seems it should be possible to add both collections' members to a bag
then select the items with multiple occurences but I didn't see an obvious way
to do it so I thought I'd ask...
Volker Zink

2004-10-05, 4:04 am

I don't know where you searched, but if i search at groups.google.com
within the newsgroup comp.lang.smalltalk i find something. For example
the thread named "[VW7.2] Intersection is missing?":

http://groups.google.com/groups?hl=...objects.com%26r
num%3D1

Volker

Thomas Gagne schrieb:
> I thought this had already been discussed, but I just returned from
> groups.google.com empty-handed.
>
> Is there a hidden method (or obscurely named one) for finding the common
> members of two collections?
>
> I have two collections (sets actually) and I'd like to find the members
> they have in common.
>
> It seems we could iterate through one looking for elements of the
> other. It also seems it should be possible to add both collections'
> members to a bag then select the items with multiple occurences but I
> didn't see an obvious way to do it so I thought I'd ask...

panu

2004-10-06, 4:00 am

Volker Zink wrote:

> I don't know where you searched, but if i search at groups.google.com
> within the newsgroup comp.lang.smalltalk i find something. For example
> the thread named "[VW7.2] Intersection is missing?":
>
> http://groups.google.com/groups?hl=...lyobjects.com%2

6rnum%3D1


I think the solution set1 - (set1 - set2)
is fabulous in its simplicity, and also
a great example of the power of Smalltalk.

-Panu Viljamaa



Volker Zink

2004-10-06, 4:00 am

panu schrieb:
> Volker Zink wrote:
>
26rnum%3D1[color=darkred]
>
>
>
>
> I think the solution set1 - (set1 - set2)
> is fabulous in its simplicity, and also
> a great example of the power of Smalltalk.
>
> -Panu Viljamaa
>
>
>


Its elegant but not efficient. So for practical purpose i use other
solutions which need less time.

Volker
Chris Uppal

2004-10-06, 8:57 am

Volker Zink wrote:

[color=darkred]
> Its elegant but not efficient. So for practical purpose i use other
> solutions which need less time.


Personally, I wouldn't even call it elegant. Short yes. Clear no. Obfuscated
is the word I'm looking for.

In the absence of "proper" set notation symbols (big U, inverted big U, etc)
I'd say that anything except:

aSet intersection: anotherSet

was pointlessly indirect. (Not that it has to be restricted to Sets; indeed
Dolphin defines #intersection: against Collection. And in (one of) the obvious
way(s):

^ self select: [:each | aCollection includes: each].

(In Dolphin, btw, #- is only defined on Set, not on Collection.)

-- chris



Normand Mongeau

2004-10-06, 4:00 pm

"Volker Zink" <Volker.Zink@porabo.ch> wrote in message
news:4163a292@news.totallyobjects.com...
> panu schrieb:
%26rnum%3D1[color=darkred]
>
> Its elegant but not efficient. So for practical purpose i use other
> solutions which need less time.
>
> Volker


Well it all depends on the dialect. In GemStone Smalltalk, set arithmetic
(IdentitySet that is) is extremely fast and efficient, if fact you can
execute somethink like the above on sets containing millions of objects and
the answer comes back in sub-second timings.

Normand
The Objects Consulting


panu

2004-10-09, 3:56 am

Chris Uppal wrote:

[color=darkred]
> Personally, I wouldn't even call it elegant. Short yes. Clear no. Obfuscated
> is the word I'm looking for.



I agree it is not trivial - probably
because the '-' of Sets doesn't obey
the same laws as '-' of numbers.

Yet I'd suspect that for a mathematician
the above is about as clear as it gets.

-Panu Viljamaa

Chris Uppal

2004-10-09, 8:56 am

panu wrote:
> Chris Uppal wrote:
>
>
>
>
> I agree it is not trivial - probably
> because the '-' of Sets doesn't obey
> the same laws as '-' of numbers.
>
> Yet I'd suspect that for a mathematician
> the above is about as clear as it gets.


Well, it's a long time since I did my degree in Pure Maths, but if I still
counted as a mathematician at all, then that'd be one vote against it ;-)

-- chris



Ian Upright

2004-10-11, 8:57 pm

Thomas Gagne <tgagne@wide-open-west.com> wrote:

>I thought this had already been discussed, but I just returned from
>groups.google.com empty-handed.
>
>Is there a hidden method (or obscurely named one) for finding the common
>members of two collections?
>
>I have two collections (sets actually) and I'd like to find the members they
>have in common.
>
>It seems we could iterate through one looking for elements of the other. It
>also seems it should be possible to add both collections' members to a bag
>then select the items with multiple occurences but I didn't see an obvious way
>to do it so I thought I'd ask...


Generally intersection performance (such as in GemStone) is linear in terms
of performance primarily related to the size of the smallest collection.

Eg, if intersecting ten thousand elements againt ten million, that will take
about ten thousand operations, and part of each of those operations is
skipping ahead to the next appopriate element in the big set.

I've implemented intersections in C++ using standard STL set collections,
and intersection performance has similar characteristics to that of
GemStone. (actually with some good compiler optimizations turned on it can
beat GemStone)

I've been also implementing other algorithms for set intersecting that use
tree structures and have different performance characteristics when dealing
with very large sets.

As far as I know, the most efficient simple implementation is to take two
hash tables where their hash values maintain their order regardless of the
collection size, and just walk down both of them in parallel, skipping ahead
where appropriate to make the comparisions.

Ian

---
http://www.upright.net/ian/
Chris Uppal

2004-10-12, 8:57 am

Ian Upright wrote:

[quoting out-of-order]

> As far as I know, the most efficient simple implementation is to take two
> hash tables where their hash values maintain their order regardless of the
> collection size, and just walk down both of them in parallel, skipping
> ahead where appropriate to make the comparisions.


Do you mean that the "hash tables" are really more like a SortedCollection
where the items are sorted by their integer hash values, rather than the
classic "hash table" implementation ?

If so, don't that mean that #contains:, etc, are genuinely O(log n), rather
than being FAPP constant-time as in classic hash tables.

> Eg, if intersecting ten thousand elements againt ten million, that will
> take about ten thousand operations, and part of each of those operations
> is skipping ahead to the next appopriate element in the big set.


I think, even without a specially-crafted Set implementation, simple code like:

Set>>intersection: anotherSet
^ self size > anotherSet size
ifTrue: [anotherSet intersection: self].
ifFalse: [self select: [:each | anotherSet includes: each]].

will (FAPP) have the same O(size of smaller set) run time, albeit with higher
constant factors.

-- chris



Ian Upright

2004-10-13, 4:00 pm

"Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> wrote:

>Ian Upright wrote:
>
>[quoting out-of-order]
>
>
>Do you mean that the "hash tables" are really more like a SortedCollection
>where the items are sorted by their integer hash values, rather than the
>classic "hash table" implementation ?


You can create a hash-table that uses straight division instead of the
modulus to position the elements in the array. Thus, the elements in the
table are always ordered by their hash value. This has a different set of
performance characteristics that make it not quite ideal, but should yield
better intersection performance. Alternative tree-like or paritioning
structures for the sets may be more appropriate.

Ian

>If so, don't that mean that #contains:, etc, are genuinely O(log n), rather
>than being FAPP constant-time as in classic hash tables.
>
>
>I think, even without a specially-crafted Set implementation, simple code like:
>
>Set>>intersection: anotherSet
> ^ self size > anotherSet size
> ifTrue: [anotherSet intersection: self].
> ifFalse: [self select: [:each | anotherSet includes: each]].
>
>will (FAPP) have the same O(size of smaller set) run time, albeit with higher
>constant factors.
>
> -- chris


---
http://www.upright.net/ian/
Chris Uppal

2004-10-14, 4:00 pm

Ian Upright wrote:

> You can create a hash-table that uses straight division instead of the
> modulus to position the elements in the array.


Right, thanks.


> Thus, the elements in the
> table are always ordered by their hash value.


Always /almost/ ordered, I think. I.e. given two hashes H1 and H2, where H2 >
H1, H2 can come first in the collection if it is in the same contingous group
of non-empty slots as H1 (assuming an otherwise "normal" hashtable
implementation).

-- chris


Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com