For Programmers: Free Programming Magazines  


Home > Archive > A86 Assembler > July 2007 > greatest multiple algorithm









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author greatest multiple algorithm
Bronson

2007-06-28, 6:57 pm

Hi,

i'm looking for a fast way to find greatest multiple of a, which is
smaller than b.
Recently i've used b-b%a, but maybe it could be done faster.

Any ideas?

Przemek

Terence

2007-06-29, 3:57 am

(b-1)/a

Tim Roberts

2007-06-29, 3:57 am

Bronson <spamtrap@crayne.org> wrote:
>
>i'm looking for a fast way to find greatest multiple of a, which is
>smaller than b.
>Recently i've used b-b%a, but maybe it could be done faster.


If you know that "a" is a power of 2, you can do much better by b & ~(a-1).
If you know that b is not very much greater than a, you can avoid the
division by using repeated subtraction.

Beyond that, what you've got is probably OK.
--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

Tim Roberts

2007-06-29, 3:57 am

Bronson <spamtrap@crayne.org> wrote:
>
>i'm looking for a fast way to find greatest multiple of a, which is
>smaller than b.
>Recently i've used b-b%a, but maybe it could be done faster.


Actually, (b/a)*a is probably faster than what you have.
--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

Sebastian Biallas

2007-06-29, 7:57 am

Bronson wrote:
> Hi,
>
> i'm looking for a fast way to find greatest multiple of a, which is
> smaller than b.
> Recently i've used b-b%a, but maybe it could be done faster.


Do you mean less than b or less-or-equal than b? Your formula finds the
latter one.

Regarding your question: If you know nothing about a, it's probably the
fastest. If you know a in advance, you can speed up the %-Operation by
some smart multiplication.

E.g. for edi = edi % 3:

movl %edi, %eax
movl $-1431655765, %edx
mull %edx
shrl %edx
leal (%edx,%edx,2), %edx
subl %edx, %edi

(Stolen from gcc output)

--
Gruß,
Sebastian

Rod Pemberton

2007-06-29, 6:57 pm


"Bronson" <spamtrap@crayne.org> wrote in message
news:1183066820.884123.297300@u2g2000hsc.googlegroups.com...
> Hi,
>
> i'm looking for a fast way to find greatest multiple of a, which is
> smaller than b.
> Recently i've used b-b%a, but maybe it could be done faster.
>



The answer to your question doesn't just depend on the algorithm. If the
cpu doesn't support modulo, division might be faster. For an x86 cpu, you
can use integer instructions to do division or modulo and/or floating point
instructions to do floating and integer division. It's possible that you
might find a faster algorithm using shifts: shr,shl,shrd,shrl. You might
also be able to code them using other instructions: mmx, xmm, sse, sse2,
etc. If you just use the integer or floating point instructions for x86,
there are at least three basic ways in x86 assembly to do (b/a)*a and one
for b-b%a. I've listed most of the required instructions needed to code a
routine for each method. The timings are for the pentium.

1) floating point floor(b/a)*a
FLD 1 cycle
FLD 1 cycle
FDIV 39 cycles
FRNDINT 9-20 cycles
FST 1-2 cycles
- at least 51 to 63 cycles

2) integer (b/a)*a - using x87 floating point unit
FILD 3/1 cycles
FILD 3/1 cycles
FIDIV 42 cycles
FIMUL 7/4 cycles
FIST 6 cycles
- at least 54 to 61 cycles

3) integer (b/a)*a - implicit truncation
MOV 1 cycle
MOV 1 cycle
DIV 40 cycles 32-bit
MUL 11 cycles
- at least 53 cycles

4) integer b-b%a - DIV performs modulo
MOV 1 cycle
MOV 1 cycle
DIV 40 cycles 32-bit
SUB 1 cycle
- at least 44 cycles


For these examples, the b-b%a is faster than (b/a)*a or floor(b/a)*a due to
the instruction timings of the needed instructions.


Rod Pemberton

Bronson

2007-07-02, 7:57 am

On 29 Cze, 20:38, "Rod Pemberton" <spamt...@crayne.org> wrote:
> "Bronson" <spamt...@crayne.org> wrote in message
>
> news:1183066820.884123.297300@u2g2000hsc.googlegroups.com...
>
>
>
> The answer to your question doesn't just depend on the algorithm. If the
> cpu doesn't support modulo, division might be faster. For an x86 cpu, you
> can use integer instructions to do division or modulo and/or floating point
> instructions to do floating and integer division. It's possible that you
> might find a faster algorithm using shifts: shr,shl,shrd,shrl. You might
> also be able to code them using other instructions: mmx, xmm, sse, sse2,
> etc. If you just use the integer or floating point instructions for x86,
> there are at least three basic ways in x86 assembly to do (b/a)*a and one
> for b-b%a. I've listed most of the required instructions needed to code a
> routine for each method. The timings are for the pentium.
>
> 1) floating point floor(b/a)*a
> FLD 1 cycle
> FLD 1 cycle
> FDIV 39 cycles
> FRNDINT 9-20 cycles
> FST 1-2 cycles
> - at least 51 to 63 cycles
>
> 2) integer (b/a)*a - using x87 floating point unit
> FILD 3/1 cycles
> FILD 3/1 cycles
> FIDIV 42 cycles
> FIMUL 7/4 cycles
> FIST 6 cycles
> - at least 54 to 61 cycles
>
> 3) integer (b/a)*a - implicit truncation
> MOV 1 cycle
> MOV 1 cycle
> DIV 40 cycles 32-bit
> MUL 11 cycles
> - at least 53 cycles
>
> 4) integer b-b%a - DIV performs modulo
> MOV 1 cycle
> MOV 1 cycle
> DIV 40 cycles 32-bit
> SUB 1 cycle
> - at least 44 cycles
>
> For these examples, the b-b%a is faster than (b/a)*a or floor(b/a)*a due to
> the instruction timings of the needed instructions.
>
> Rod Pemberton


Thanks all of you.
But i forgot to mention that those are 64-bit integers on a 32-bit
machine. And i like it to be smaller, equal to b.

Sorry about that ;-],

Przemek

Bronson

2007-07-02, 9:57 pm

On 2 Lip, 12:30, Bronson <spamt...@crayne.org> wrote:
> On 29 Cze, 20:38, "Rod Pemberton" <spamt...@crayne.org> wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Thanks all of you.
> But i forgot to mention that those are 64-bit integers on a 32-bit
> machine. And i like it to be smaller, equal to b.
>
> Sorry about that ;-],
>
> Przemek


I'll tell you exactcly what i have to do...
I have one number of the form 30*k+i, where k is a integer >= 0 and i
is one of [1,7,11,13,17,19,23,29].
And i have another number M - positive integer.
I need to find greatest multiple of 30*k+i before M (less or equal).
Maybe there is a fast way to do that. In my algorithm i need to find
it for 30*k+7, and later for 30*(k+1)+7, so any quick way to find it
for 30*(k+1)+7 having done it for 30*k+7 would be also very helpfull.
M is a 64-bit integer, and 30*k+i is a 32-bit. Thanks for any help.

Przemek

Bronson

2007-07-03, 6:58 pm

On 2 Lip, 14:44, Bronson <spamt...@crayne.org> wrote:
> On 2 Lip, 12:30,Bronson <spamt...@crayne.org> wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> I'll tell you exactcly what i have to do...
> I have one number of the form 30*k+i, where k is a integer >= 0 and i
> is one of [1,7,11,13,17,19,23,29].
> And i have another number M - positive integer.
> I need to find greatest multiple of 30*k+i before M (less or equal).
> Maybe there is a fast way to do that. In my algorithm i need to find
> it for 30*k+7, and later for 30*(k+1)+7, so any quick way to find it
> for 30*(k+1)+7 having done it for 30*k+7 would be also very helpfull.
> M is a 64-bit integer, and 30*k+i is a 32-bit. Thanks for any help.
>
> Przemek


What am i doing wrong here:

ull mod(ull max, uint q) {
ull r;
__asm {
finit
fild dword ptr q
fild qword ptr max
fdiv st(0), st(1)
fmul st(0), st(1)
fist qword ptr r
}
return r;
}

? It is inline assembly in VC++. Funny thing, it always return another
value, and i about 1000 cycles, while b-b%a takes just 270. Any ideas?

Thanks,

Przemek

Terence

2007-07-04, 3:57 am

You seem to be trying to crack public/private key algorithms by the
prime divisor remainder method.

Bronson

2007-07-04, 6:57 pm

On 4 Lip, 05:16, Terence <spamt...@crayne.org> wrote:
> You seem to be trying to crack public/private key algorithms by the
> prime divisor remainder method.


No, i am finding primes...

Terence

2007-07-05, 6:57 pm

On Jul 5, 4:03 am, Bronson <spamt...@crayne.org> wrote:
> On 4 Lip, 05:16, Terence <spamt...@crayne.org> wrote:
>
>
> No, i am finding primes...


VERY closely related...

Tim Roberts

2007-07-06, 3:57 am

Terence <spamtrap@crayne.org> wrote:
>On Jul 5, 4:03 am, Bronson <spamt...@crayne.org> wrote:
>
>VERY closely related...


Maybe so, but what was your point? Unlike some of the activities we see on
this forum, I don't think there is anything illegal or immoral in trying to
crack public/private key algorithms. If he is successful, I suspect the
encryption community would welcome more information about his technique.
--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

Bronson

2007-07-06, 9:57 pm

On 6 Lip, 04:53, Tim Roberts <spamt...@crayne.org> wrote:
> Terence <spamt...@crayne.org> wrote:
>
>
>
>
> Maybe so, but what was your point? Unlike some of the activities we see on
> this forum, I don't think there is anything illegal or immoral in trying to
> crack public/private key algorithms. If he is successful, I suspect the
> encryption community would welcome more information about his technique.
> --
> Tim Roberts, t...@probo.com
> Providenza & Boekelheide, Inc.


Can someone tell me why the code i posted doesn't work and is slower,
than b-b%a ??
Everyone said, that it would be faster...

Thanks,

Przemek

Tim Roberts

2007-07-07, 9:58 pm

Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-CLAX86-Policy: http://www.pacificsites.com/~ccrayne/clax86.html
X-CLAX86-Faq: http://www.faqs.org/faqs/by-newsgro...ng.asm.x86.html
X-CLAX86-Info-1: Send submissions to comp-lang-asm-x86@moderators.isc.org
X-CLAX86-Info-2: Send technical complaints to clax86-admin@crayne.org
X-CLAX86-Info-3: Send complaints about policy to clax86-admin@crayne.org
X-Comment: moderators do not necessarily agree or disagree with this article.
X-Newsreader: Forte Agent 4.2/32.1117
X-Null-Tag: f6b98e0de7d76dbd987dacd1f33dfd6b
X-Null-Tag: f6b98e0de7d76dbd987dacd1f33dfd6b
X-PacificInternet-MailScanner-Information: Please contact the ISP for more information
X-PacificInternet-MailScanner: Found to be clean
X-PacificInternet-MailScanner-SpamCheck: not spam (whitelisted),
SpamAssassin (not cached, score=0, required 5.5, autolearn=disabled)
X-PacificInternet-MailScanner-From: news@gnilink.net
X-Spam-Status: No
X-Complaints-To: abuse@supernews.com
Lines: 14
Bytes: 2431
Xref: number1.nntp.dca.giganews.com comp.lang.asm.x86:132668

Bronson <spamtrap@crayne.org> wrote:
>
>What am i doing wrong here:


Using floating point, for one. You have three int/float conversions in
this code, and those cost cycles. Plus, a double holds fewer significant
bits than an unsigned long long, so you are going to lose precision.

Just keep them as integers. You can do a 64 by 32 division, 32x32
multiply, and a 64 to 64 subtraction in a lot fewer cycles.
--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

Phil Carmody

2007-07-08, 7:57 am

Tim Roberts <spamtrap@crayne.org> writes:
> Bronson <spamtrap@crayne.org> wrote:
>
> Using floating point, for one. You have three int/float conversions in
> this code, and those cost cycles. Plus, a double holds fewer significant
> bits than an unsigned long long, so you are going to lose precision.
>
> Just keep them as integers. You can do a 64 by 32 division, 32x32
> multiply, and a 64 to 64 subtraction in a lot fewer cycles.


Weird. Yves Gallot uses floats. Paul Jobling uses floats. Marcel
Martin uses floats. I use floats. Peter Montgomery recommends
using floats (at least as reported by Francois Morain in one of
his ECPP papers, IIRC).

Some applications are best in pure FP, others combine int and FP.

Maybe we're all doing something wrong?

What was he doing wrong?
At the bug level - he wasn't rounding to an integer.
At the design level - he wasn't precomputing the reciprocal
of the number he wanted to divide by in advance, for re-use.
(Anything apart from the most trivial application will be
performing modular reduction more than once with the same
base. And if it is the most trivial application, he's
almost certainly using the wrong algorithm anyway.)

Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.

Jerome H. Fine

2007-07-08, 6:57 pm

>Bronson wrote:

>On 4 Lip, 05:16, Terence <spamt...@crayne.org> wrote:
>
>
>No, i am finding primes...
>

Jerome Fine replies:

As you may be aware, pi(10^18) = 24,739,954,287,740,860
and the binary values up to 10^18 are 60 bit numbers.
You mentioned elsewhere that you are finding 64-bit primes.
Are you attempting to locate and count all of the primes
which are less than 10^18 or just find individual primes?

If you are locating all of the primes up to 10^18, then
you will need to use all of the primes up to 10^9 for which
the count is pi(10^9) = 50,847,534. It is probably faster
to sieve for primes than to check each number individually.

On another aspect of dealing with primes, it is possible to
squeeze all of those 50,847,534 primes into less than 32 MB.
My current project is to locate and save all of the primes
up to 10^12 for which pi(10^12) = 37,607,912,018. Based on
my current success with the first 51 million primes, I
hope to squeeze the first 38 billion primes into less than
32 GB, perhaps as small as 28 GB or perhaps even 24 GB.

I am interested in the challenge of this latter problem.

In addition, if I can find a faster method up to sieve for
the primes up to 10^12, then it may be possible to consider
a full sieve up to pi(10^15), and perhaps up to pi(10^18).

If anyone else is also interested, I would appreciate some
feedback.

Sincerely yours,

Jerome Fine


Phil Carmody

2007-07-08, 6:57 pm

"Jerome H. Fine" <spamtrap@crayne.org> writes:
> In addition, if I can find a faster method up to sieve for
> the primes up to 10^12, then it may be possible to consider
> a full sieve up to pi(10^15), and perhaps up to pi(10^18).
>
> If anyone else is also interested, I would appreciate some
> feedback.


Don't store lists of primes (more than the first few thousand).
Calculate them on the fly.

Use DJB's primegen, or TOS's segmented sieve. Remember to
parameterise them correctly, or performance will go pear-
shaped.

JvB (who reads and contributes here) has a sieve that is
as fast as the above (faster than TOS's released C version,
similar to TOS's private asm optimised version) all the
way to 10^17, maybe 10^18, but he's not made his source
public, AFAIK.

Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.

James Van Buskirk

2007-07-08, 9:57 pm

"Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message
news:87tzse7hzi.fsf@nonospaz.fatphil.org...

> JvB (who reads and contributes here) has a sieve that is
> as fast as the above (faster than TOS's released C version,
> similar to TOS's private asm optimised version) all the
> way to 10^17, maybe 10^18, but he's not made his source
> public, AFAIK.


http://home.comcast.net/~kmbtib/Sieve30h/

But be warned that this version has a little AMD assembly in it
and may need to be messed with to get it working on Intel. Also
it's just something I found on my hard disk and posted in response
to some stupid troll who of course never bothered to compile it,
so it may be between versions and not working. Some timing data
for the program on an Athlon 1400 GHz may be found in

http://home.comcast.net/~kmbtib/Sie...00000000000.dat

To get it up to 1.0e18, some more code to handle the progression
of data structures that hold the large primes that don't hit the
sieve every buffer would be desirable.

I think it can be useful to run sieves out as far as possible:
propositions such as Chebyshev's bias, verification of Goldbach,
twin primes, and sums relating to primes are useful data in my
opinion, but lists of primes are not that useful because the
ancient algorithm is just that much faster than the hard disk
can spit them out (in agreement with you).

Oh yes, major rewrite is also needed if values of 1.0e18 or higher
are desired: the code as it stands doesn't have restart capability,
and this is vital for using multiple cores on a single system or
for creating a distributed version.

Anyhow my code kicks everybody else's butt as it stands and if a
distributed version were to go online it would be a shame if
something lame (like the original SETI@home) went up instead. If
there were enough interest in taking this as far as it can go, I
could put some more time into it, especially if that interest
involved my getting paid instead of just getting bitched at for
doing all the work (fat chance.)

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Rod Pemberton

2007-07-09, 7:57 am


"Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message
news:87tzse7hzi.fsf@nonospaz.fatphil.org...
> "Jerome H. Fine" <spamtrap@crayne.org> writes:
>
> Don't store lists of primes (more than the first few thousand).
> Calculate them on the fly.
>
> Use DJB's primegen, or TOS's segmented sieve. Remember to
> parameterise them correctly, or performance will go pear-
> shaped.
>
> JvB (who reads and contributes here) has a sieve that is
> as fast as the above (faster than TOS's released C version,
> similar to TOS's private asm optimised version) all the
> way to 10^17, maybe 10^18, but he's not made his source
> public, AFAIK.
>


I've tried to write a few trivial (odd) prime number programs over the
years. I'm curious as to how you guys think those two compare to some of
the trivial methods I've tried (esp. Bernstein).

integer array
array stores found primes
modulo with found primes in array to find new primes
slows down rapidly with increasing number of primes in array
consumes large amounts of memory

bit sieve
using single bits to represent odd numbers
finds primes upto memory_in_bytes*8*2
requires no modulo computation
requires many memory accesses
slows down based on size memory used
limited to available memory

brute computation
simple loop
i+=2 for number to test as a factor of the prime
j+=4*(i-1) for the square of i
i.e., to stop when i is roughly sqrt of prime being tested
skip 6k+3
test 6k-1 or 6k+1 modulo with i
fast, no real memory requirements
limited to cpu's largest integer size


The brute computation is been the fastest by far, but is obviously limited
to "small" primes.


Rod Pemberton

Phil Carmody

2007-07-09, 9:57 pm

"Rod Pemberton" <spamtrap@crayne.org> writes:
> I've tried to write a few trivial (odd) prime number programs over the
> years. I'm curious as to how you guys think those two compare to some of
> the trivial methods I've tried (esp. Bernstein).
>
> integer array
> array stores found primes
> modulo with found primes in array to find new primes
> slows down rapidly with increasing number of primes in array
> consumes large amounts of memory


That doesn't sound like a sieve, it sounds like trial division.
You're not reusing the results of prior computations.
I always advise a sieve.

> bit sieve
> using single bits to represent odd numbers
> finds primes upto memory_in_bytes*8*2
> requires no modulo computation
> requires many memory accesses
> slows down based on size memory used
> limited to available memory


You can use a wheel to decrease the memory requirements. Most
fast sieves use the 8 residues mod 30 as their wheel. That
almost halves the memory requirements. Even a wheel of 6 saves
you a third.

See DJB's eratosthenes as part of the primegen package.
James' sieve was a 30 too, IIRC.

> brute computation
> simple loop
> i+=2 for number to test as a factor of the prime
> j+=4*(i-1) for the square of i
> i.e., to stop when i is roughly sqrt of prime being tested
> skip 6k+3
> test 6k-1 or 6k+1 modulo with i
> fast, no real memory requirements
> limited to cpu's largest integer size


Again, that sounds like trial division rather than a sieve.

> The brute computation is been the fastest by far, but is obviously limited
> to "small" primes.


TOS's (Tomas Oliveira e Silva) GPL-ed C source is acceptably fast,
and very simple. 50% speedup should be possible without too much
work, by introducing the concepts of a mod 6 wheel to either or
both of the small primes and the sieving region.
You won't get much faster without cranking out specifically
optimised inner loops for all the different possibilities.
(Which is what both JvB and TOS have done.)

I'll stop rambling - this has everything you need to know:
http://www.ieeta.pt/~tos/software/prime_sieve.html

Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.

Rod Pemberton

2007-07-09, 9:57 pm


"Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message
news:878x9p6ga4.fsf@nonospaz.fatphil.org...
> "Rod Pemberton" <spamtrap@crayne.org> writes:
of[color=darkred]
>
> That doesn't sound like a sieve, it sounds like trial division.
> You're not reusing the results of prior computations.


They're recalled from the array. It slows down progessively as number of
array accesses increase, i.e., the potential primes become larger.

> I always advise a sieve.
>


Why? As I've pointed out, each has advantages and limitations. For
integers upto the largest cpu supported integer type, modulo is faster on
modern machines. This could be used to reduce the time needed to setup the
lower or smaller primes in bit sieve. The bit sieve becomes progressively
slower with larger sieve sizes due to the large number of memory accesses.
However, the bit sieve has the advantage that it isn't limited to finding
primes that fit within the cpu supported integer type.

>
> You can use a wheel to decrease the memory requirements. Most
> fast sieves use the 8 residues mod 30 as their wheel. That
> almost halves the memory requirements. Even a wheel of 6 saves
> you a third.
>


One question is how much does the extra logic needed to skip bits slow down
the algorithm versus the gain in computable range. For example, I found
that skipping multiples of 5 for the trial division algortihm below ("wheel
of 30" ?) slowed the algorithm down alot. The comparison and conditional
were much slower than the modulo. If I were to extend that to skipping
other values (larger "wheels"), it'd slow down even more. The bit sieve
would slow down too. Memory isn't scarce or expensive anymore, so skipping
values would only make sense to increase the range of computable primes.
The bit sieve does have the advantage that it can find primes larger than
the cpu's largest integer type, esp. on modern machines due to the increased
memory. Anyway, I'm more interested in speed than range, and the trial
division is much faster than this method.

Unfortunately, I really didn't like this algorithm as either recent C code
or ancient C64 BASIC version...where I first tried the idea. The C64 was so
*slow* doing division. I figured I'd be better off with a sieve, but there
wasn't enough memory to store enough of the primes. That's when I first
tried single bits for odds. I'm not versed in prime number algorithms, so I
don't know whether this was in use previously and my program was never
distributed. But, I'd figure that it probably was in use somewhere due to
the memory constraints of the earlier generation of mainframes.

> See DJB's eratosthenes as part of the primegen package.
> James' sieve was a 30 too, IIRC.
>
>
> Again, that sounds like trial division rather than a sieve.
>


Yes, trial division, skipping multiples of 2 and 3 ("wheel of 6" ?).

limited[color=darkred]
>
> TOS's (Tomas Oliveira e Silva) GPL-ed C source is acceptably fast,
> and very simple. 50% speedup should be possible without too much
> work, by introducing the concepts of a mod 6 wheel to either or
> both of the small primes and the sieving region.
> You won't get much faster without cranking out specifically
> optimised inner loops for all the different possibilities.
> (Which is what both JvB and TOS have done.)
>
> I'll stop rambling - this has everything you need to know:
> http://www.ieeta.pt/~tos/software/prime_sieve.html
>


Forgot to mention another one:

test potential primes with GCD composed of all lower primes
modulo with "primed" GCD
very fast
"primed" GCD exceeds cpu integer size rapidly (i.e., prime 47...)

Hmm, maybe I should try floats on that one just to see how far I can get it
to run without error...


Rod Pemberton

Phil Carmody

2007-07-10, 6:57 pm

"Rod Pemberton" <spamtrap@crayne.org> writes:
> "Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message
>
> Why?


Because I'm interested in speed. And I'm always talking in
terms of 10^12-10^16. Anything that's just 10^9 or 10^10
is just a toy and not of interest to me.

> As I've pointed out, each has advantages and limitations. For
> integers upto the largest cpu supported integer type, modulo is faster on
> modern machines.


In practice, it seems that the pure bit sieve (although using
Bernstein or Galway's quadratic form (Atkin) sieve rather than
Eratosthenes) should remain optimal until well beyond 10^20,
and after that trial division makes no sense at all.

> This could be used to reduce the time needed to setup the
> lower or smaller primes in bit sieve. The bit sieve becomes progressively
> slower with larger sieve sizes due to the large number of memory accesses.

That can be reduced by a significant factor using the technique
documented on TOS's web page. Basically you throw the primes into
a simplified priority queue. Only look at the primes that are
relevant to each segment you're sieving at the time. (James -
did yours do this too - I've not looked at your code for a long
time, and barely speak ForTran.)

> However, the bit sieve has the advantage that it isn't limited to finding
> primes that fit within the cpu supported integer type.
>
>
> One question is how much does the extra logic needed to skip bits slow down
> the algorithm versus the gain in computable range. For example, I found
> that skipping multiples of 5 for the trial division algortihm below ("wheel
> of 30" ?)


Yup.

> slowed the algorithm down alot.


It's possible. It takes a lot of care to use a mod 30 wheel
effectively. TOS and DJB have done it, I almost never care
beyond the trivial mod 6 wheel.

> The comparison and conditional
> were much slower than the modulo.


Often efficient wheels are implemented by unrolling loops,
and creating 8 different versions of the code for the 8
different stepping patterns (James' and TOS's optimised
code does this, I'm sure (and it doesn't apply to an Atkin
sieve such as DJB's)).

[...]
> Forgot to mention another one:
>
> test potential primes with GCD composed of all lower primes
> modulo with "primed" GCD
> very fast
> "primed" GCD exceeds cpu integer size rapidly (i.e., prime 47...)
>
> Hmm, maybe I should try floats on that one just to see how far I can get it
> to run without error...



That technique goes all the way back to Euclid. It's very useful
in many situations. If you look in the GMP source for their
probable primality test you'll see them do a couple of GCDs with
products of primes before anything else.

Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.

Rod Pemberton

2007-07-12, 7:57 am


"Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message
news:878x9o4jgy.fsf@nonospaz.fatphil.org...
> "Rod Pemberton" <spamtrap@crayne.org> writes:
>
> Because I'm interested in speed. And I'm always talking in
> terms of 10^12-10^16. Anything that's just 10^9 or 10^10
> is just a toy and not of interest to me.
>
on[color=darkred]
>
> In practice, it seems that the pure bit sieve (although using
> Bernstein or Galway's quadratic form (Atkin) sieve rather than
> Eratosthenes) should remain optimal until well beyond 10^20,
> and after that trial division makes no sense at all.
>
progressively[color=darkred]
accesses.[color=darkred]
> That can be reduced by a significant factor using the technique
> documented on TOS's web page. Basically you throw the primes into
> a simplified priority queue. Only look at the primes that are
> relevant to each segment you're sieving at the time. (James -
> did yours do this too - I've not looked at your code for a long
> time, and barely speak ForTran.)
>
finding[color=darkred]
down[color=darkred]
("wheel[color=darkred]
>
> Yup.
>
>
> It's possible. It takes a lot of care to use a mod 30 wheel
> effectively. TOS and DJB have done it, I almost never care
> beyond the trivial mod 6 wheel.
>
>
> Often efficient wheels are implemented by unrolling loops,
> and creating 8 different versions of the code for the 8
> different stepping patterns (James' and TOS's optimised
> code does this, I'm sure (and it doesn't apply to an Atkin
> sieve such as DJB's)).
>
> [...]
47...)[color=darkred]
get it[color=darkred]
>
>
> That technique goes all the way back to Euclid. It's very useful
> in many situations. If you look in the GMP source for their
> probable primality test you'll see them do a couple of GCDs with
> products of primes before anything else.
>


Honestly, from my experiences, I wasn't really sold on the idea that a bit
sieve could be very fast. Looking at other sieves on the Internet didn't
change my attitude any. That was until I saw John Moyer's webpage. I'm not
sure how his compares to TOS, DJB, or James', but the performance numbers he
gives for his "wheel of 30" indicate that it's a "screamer." The one thing
I noticed right away was that he wasn't storing multiples of 2,3,5 - just
the other 8 possible primes per set of 30 - nice compact encoding...

http://www.rsok.com/~jrm/printprimes.html


Rod Pemberton

Jerome H. Fine

2007-07-13, 7:57 am

>Rod Pemberton wrote:

>Honestly, from my experiences, I wasn't really sold on the idea that a bit
>sieve could be very fast. Looking at other sieves on the Internet didn't
>change my attitude any. That was until I saw John Moyer's webpage. I'm not
>sure how his compares to TOS, DJB, or James', but the performance numbers he
>gives for his "wheel of 30" indicate that it's a "screamer." The one thing
>I noticed right away was that he wasn't storing multiples of 2,3,5 - just
>the other 8 possible primes per set of 30 - nice compact encoding...
>
>http://www.rsok.com/~jrm/printprimes.html
>

Jerome Fine replies:

>From what I understand of a bit sieve or a "wheel of 30"

(also using bits since it could also use bytes instead)
is that with a work area of a limited size (the work
area size is ALWAYS limited since it is ALWAYS best to
use as large a work area as possible), the largest primes
to be sieved must be processed MANY more times. So with
a bit sieve which allows 8 times as many elements in the
work area used as bits, the larger primes (of which there
are MANY more than the smaller primes) will need to be
processed 1/8 as many times. Thus, while the smaller
primes will not benefit significantly (it makes very
little difference for the primes less than 1000 as to
whether the work area is a 1,000,000 elements or
100,000,000 elements - most of the time is spent within
the inner loop), when primes larger than 100,000,000
are being processed for a sieve beyond 10^16, almost
none of the time is spent within the inner loop. For
the primes larger than 100,000,000 almost all of the
time is spent just checking to determine if the work
area should be modified - which is 90% of the primes
for a sieve up to 10^18 or just over 44 million of the
50,847,534 primes to be used for the portion from 10^16
up to 10^18 which is obviously 99% of the use of the
work area. Unless the number of elements in the work
area is over 1 billion (easy these days with a memory
of more than 1 GB - a bit sieve can do that with just
125 million bytes which is less than 25% of the memory
available to the programs I run on my Pentium III) in
which case only a few bits are being modified each time
by those very large primes greater than 100,000,000.

Compare a sieve up to just 10^12 which requires the use
of only 78,498 primes and a million times smaller work
area (or at least repeated number of sieves of the work
area) as compared with 10^18. For example, if it takes
a little as one second to complete a sieve up to 10^12,
the sieve up to 10^18 will take much longer than ten days.
I have not heard of anyone, as yet, completing the sieve
up to 10^12 in just one second.

A lot of words to describe the obvious for those
individuals who have written sieve programs.

What I am having a bit of difficulty with is just how
to efficiently convert a byte sieve to a bit sieve.
I don't think it is worth while to comment on the actual
code that could be used in this post, but on the next
one I will try. This post is probably too long as it is.

I just looked again at John Moyer's home page as well as
the link that you provide. While I have not yet progressed
to using a "wheel of 30", that is my immediate intention -
as soon as I understand the code sufficiently as well as
the algorithm for its implementation.

I presume that when you mean not "storing multiples of 2, 3, 5",
you are referring to the bits left in the working area being used
to sieve for the primes. Was that your intention?

What I am having difficulty with is in understanding just how the
increased code complexity compensates for having to modify just
8 / 15 of the bits with a "wheel of 30" bit sieve (or just over
half as many) as opposed to a simple bit sieve of only the
odd numbers. Yes the work area covers almost twice as many
values to be sieved for the "wheel of 30" as compare with a
simple bit sieve, but the algorithm needs to be much more
complicated. As for eliminating the need to sieve the primes
2, 3 and 5, it is trivial for a both a simple byte sieve and
a simple bit sieve to eliminate many more of the smallest primes
to be sieved after these extremely small primes have been done
once (the first time only is necessary) by saving a relatively
small portion of the large work area.

I am at a complete loss as to where you found the "screamer"
values which specify how fast John's implementation of the "wheel
of 30" code is in practice. Can you provide one of the timings
along with to a link to where you found them?

Can you or anyone else provide and timings from anyone else?

I presume that TOS is Tomas Oliveria e Silva? Please confirm.
And is DJB Daniel J. Bernstein? I can't seem to find a home page.
Finally, who is James'?

Sincerely yours,

Jerome Fine


Phil Carmody

2007-07-13, 7:57 am

"Jerome H. Fine" <spamtrap@crayne.org> writes:
> I presume that TOS is Tomas Oliveria e Silva? Please confirm.


Yup.

> And is DJB Daniel J. Bernstein? I can't seem to find a home page.


cr.yp.to

> Finally, who is James'?


James Van Buskirk, semi-regular contributer hereon.

Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.

James Van Buskirk

2007-07-13, 7:57 am

"Jerome H. Fine" <spamtrap@crayne.org> wrote in message
news:4696BBF5.9040909@compsys.to...

> Finally, who is James'?


A couple of things you may have missed or not have communicated
intelligibly:

o Have you looked at http://primes.utm.edu/ ?

o What are you trying to do with the sieve results: find pi(x) for
large x, verify Goldbach like TOS, some other, or some combination?

o Making a sieve go fast requires many optimizations. The TOS code is
in C, mine is in Fortran. At least I can help you if you have
questions about the code at http://home.comcast.net/~kmbtib/Sieve30h/
(time and energy permitting of course, it seems there is never a
shortage of space nor momentum :) ). My code has a few comments as
well; more at least than DJB's code.

o Mod 30 compression (I don't think the term 'wheel' is very descriptive
in the English language) is the way to go with current processors:
given a total loop explosion, you can hit the sieve table every
clock cycle, and each hit is twice as effective because there are
only half as many entries to sieve. Loop explosion would blow out
instruction cache for mod 210 compression which is why nobody uses it.

o A fundamental optimization is that of sieving by buffers; I use two
levels of buffers to match two levels of cache. On my 21164 I tried
2 levels because it has 2 MB of L3 cache, but it didn't seem to
make it go any faster, perhaps due to TLB thrashing.

If you want to write new code from scratch, start out by implementing
mod 30 compression, then implement buffering, then implement loop
explosion, then implement 'painting' where 128 bits of the sieve
table buffer are processed at once with PAND (or POR given negative
logic) then implement large prime data structures, then implement
'launch pads' writing large prime data structures (so that a whole
cache line or even several get written at once) and by then you
should have a pretty darned fast sieve. You will then be able to
think about what you want to do with it. Hopefully you will do a
lot of things with the primes generated, not just count them. After
all that you need restart code and some Windows programming so that
people can run it in the background without interfering with the
normal operation of their computers. Then you have to promote the
project to the extent that a reasonable number of people run it for
a reasonable amount of time and then you will get results about
primes that nobody else has yet achieved.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com