For Programmers: Free Programming Magazines  


Home > Archive > A86 Assembler > August 2006 > SSE2 register addition (linux gas)









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author SSE2 register addition (linux gas)
Alexander Knopf

2006-08-25, 3:56 am

quick questions about the above.

i'm writing a program that handles large numbers, over 128 bits.
now i understand i could just put 4 DWord sections of any number into an
xmm register, put another 4 DWords into a second and add them.
however, i don't know how the carry flag is handled.
so questions as follows:
1) which byte order would i have to use ?
2) would the carry bit be added to the next higher dword ?
3) what happens if the highest dword addition produces a carry ?

any help is greatly appreciated.

-Alexander Knopf

ldb

2006-08-25, 6:56 pm


Alexander Knopf wrote:
> quick questions about the above.
>
> i'm writing a program that handles large numbers, over 128 bits.
> now i understand i could just put 4 DWord sections of any number into an
> xmm register, put another 4 DWords into a second and add them.
> however, i don't know how the carry flag is handled.
> so questions as follows:
> 1) which byte order would i have to use ?
> 2) would the carry bit be added to the next higher dword ?
> 3) what happens if the highest dword addition produces a carry ?
>
> any help is greatly appreciated.
>
> -Alexander Knopf



The short answer is: They will add the four dwords completely
independantly, and the carry flag goes into the EFLAGS register, if I
remember correctly. SSE instructions are NOT meant to operate on
128-bit data types. They are meant to work on, up to, qwords. Doing
higher than qword math in SSE registers correctly is not obvious and
generally done incorrectly.

Longer answer:
There is a horizontal component associated with this (the carry flag).
Horizontal, in the sense, that the result of the operation has
dependancies on adjacent data elements in the same register (as opposed
to vertical, which would be corresponding elements in a different
register). SSE works best when you confine all dependancies to the
vertical variety. This is possible with, for instance, 128-bit adds,
but it's fairly difficult to pull off correctly. Since you can use 2
qwords (which is better than 4dwords).. it involves calculating 2
128-bit additions at the -same- time.

If you had two matricies (or vectors) A and B of 128-bit numbers you
wanted to add... and we say a1h and a1l are the high and low qwords of
element a1 (128-bit number).

xmm1 = [a1l : a2l]
xmm2 = [b1l : b2l]
store(xmm1 + xmm2)
xmm3 = [a1h : a2h]
xmm4 = [b1h : b2h]
figure out carry flags for h term, add it
store

In this method, we are calculating -two- elements of our matrix at the
same time. And all our dependancies are vertical, so that all
information needed to add a1 and b1 occurs in the left half of the SSE
register, and all information needed to add a2 and b2 occur in the
right half.

jacob navia

2006-08-25, 6:56 pm

Alexander Knopf a écrit :
> quick questions about the above.
>
> i'm writing a program that handles large numbers, over 128 bits.
> now i understand i could just put 4 DWord sections of any number into an
> xmm register, put another 4 DWords into a second and add them.
> however, i don't know how the carry flag is handled.
> so questions as follows:
> 1) which byte order would i have to use ?
> 2) would the carry bit be added to the next higher dword ?
> 3) what happens if the highest dword addition produces a carry ?
>
> any help is greatly appreciated.
>
> -Alexander Knopf
>


I would check first if there is a 128 bit ADD instruction...

Alexander Knopf

2006-08-25, 6:56 pm

ldb wrote:
> Alexander Knopf wrote:
>
>
> The short answer is: They will add the four dwords completely
> independantly, and the carry flag goes into the EFLAGS register, if I
> remember correctly. SSE instructions are NOT meant to operate on
> 128-bit data types. They are meant to work on, up to, qwords. Doing
> higher than qword math in SSE registers correctly is not obvious and
> generally done incorrectly.


so basically what you're saying is, since i can do addition of 2 QWords,
and there's only one carry flag i would have to check the values anyhow.
now then, would it be easier to use 32 bit registers and use adc to add
the carry flag to the next value ?

-Alexander Knopf

ldb

2006-08-29, 6:56 pm


Alexander Knopf wrote:
> ldb wrote:
>
> so basically what you're saying is, since i can do addition of 2 QWords,
> and there's only one carry flag i would have to check the values anyhow.
> now then, would it be easier to use 32 bit registers and use adc to add
> the carry flag to the next value ?
>
> -Alexander Knopf


Actuall the intel manual on the PADDQ (add packed quadword integers)
instruction says this:
"When a quadword result is too large to be represented in 64 bits
(overflow), the result is wrapped around and the low 64 bits are
written to the destination element (that is, the carry is ignored)....
... however, it does not set bits in the EFLAGS register to indivate
overflow and/or a carry. To prevent undetected overflow conditions,
software must control the ranges of the values operated on."

The moral of the story here is that SSE is very bad for multiprecision
arithmetic operations. People often come in here and want to use, for
instance, 128-bit integers, and believe that XMM registers (which are
128 bit) are the silver bullet. It turns out, in fact, they are not.

If you need to add 2 pairs (ie a+b and c+d) of 128-bit numbers, then
SSE -can- work with some trickery with some benefit (you do a+b in the
left 64 bits, and c+d in the right 64 bits)... but to do a single 128
bit add, it is -much- easier to just do it in normal 32-bit registers
with adc and add.

Now, if you add to two giant vectors of 128-bit integers, with correct
programming, the SSE would probably outperform a single 32-bit
solution. The operative word there is 'probably'. It's more complicated
than just doing a for loop over each element, however. The gist of the
solution is you need to add pairs of elements simultaneously.

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com