For Programmers: Free Programming Magazines  


Home > Archive > Compression > January 2007 > JPEG decoder performance









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author JPEG decoder performance
cr88192

2006-12-26, 3:56 am

well, earlier I optimized the IDCT transform, and and now the IDCT
constitutes only a small amount of the total run-time it seems.

well, anyways, if anyone has any good ideas how I can optimize the IDCT
further, I am including the current version I am using at the bottom.


actully, most of the run-time is going into the code which takes the image
(in the form of decoded 8x8 blocks), and rearanges it into a set of raster
image planes, and also upsamples the Cb and Cr planes (constituting about
35% of the time).

another 25% goes into the function which takes these planes, and does the
YCbCr to RGB conversion, reassembling the image into a buffer.

just below this, about 10% goes into the huffman symbol-decode function
(could be optimized by an index table if needed).

just below this (about 9%), is the function for outputting the targa image.

then finally we get to the IDCT, which consists of 3 functions:
5.7% for the horizontal part;
4.7% for the verticle part;
and 2.7% for the frontend.

this is followed by misc functions...


note that with an 800x600 image, runtimes were too short to really be
measurable.
the test-image I was using here was 2576x1932, with a total decode time of
about 2s (with 0.3s going into DCT).

optimized vs profiler settings seems to make only a minor impact on total
time.


so, here are the IDCT functions:

---

void PDJPG_TransIDCT_Horiz(short *iblk, int *oblk)
{
int a, b, c, d, e, f;

a=iblk[0]*91; b=iblk[4]*91;
c=iblk[2]*118; d=iblk[2]*49;
e=iblk[6]*118; f=iblk[6]*49;
oblk[0]=a+iblk[1]*126+c+iblk[3]*106+b+ib
lk[5]* 71+f+iblk[7]* 25;
oblk[1]=a+iblk[1]*106+d-iblk[3]* 25-b-iblk[5]*126-e-iblk[7]* 71;
oblk[2]=a+iblk[1]* 71-d-iblk[3]*126-b+iblk[5]* 25+e+iblk[7]*106;
oblk[3]=a+iblk[1]* 25-c-iblk[3]* 71+b+iblk[5]*106-f-iblk[7]*126;
oblk[4]=a-iblk[1]* 25-c+iblk[3]* 71+b-iblk[5]*106-f+iblk[7]*126;
oblk[5]=a-iblk[1]* 71-d+iblk[3]*126-b-iblk[5]* 25+e-iblk[7]*106;
oblk[6]=a-iblk[1]*106+d+iblk[3]* 25-b+iblk[5]*126-e+iblk[7]* 71;
oblk[7]=a-iblk[1]*126+c-iblk[3]*106+b-iblk[5]* 71+f-iblk[7]* 25;
}

void PDJPG_TransIDCT_Vert(int *iblk, int *oblk)
{
int a, b, c, d, e, f;

a=iblk[0]*91; b=iblk[32]*91;
c=iblk[16]*118; d=iblk[16]*49;
e=iblk[48]*118; f=iblk[48]*49;
oblk[ 0]=a+iblk[8]*126+c+iblk[24]*106+b+iblk[4
0]* 71+f+iblk[56]* 25;
oblk[ 8]=a+iblk[8]*106+d-iblk[24]* 25-b-iblk[40]*126-e-iblk[56]* 71;
oblk[16]=a+iblk[8]* 71-d-iblk[24]*126-b+iblk[40]* 25+e+iblk[56]*106;
oblk[24]=a+iblk[8]* 25-c-iblk[24]* 71+b+iblk[40]*106-f-iblk[56]*126;
oblk[32]=a-iblk[8]* 25-c+iblk[24]* 71+b-iblk[40]*106-f+iblk[56]*126;
oblk[40]=a-iblk[8]* 71-d+iblk[24]*126-b-iblk[40]* 25+e-iblk[56]*106;
oblk[48]=a-iblk[8]*106+d+iblk[24]* 25-b+iblk[40]*126-e+iblk[56]* 71;
oblk[56]=a-iblk[8]*126+c-iblk[24]*106+b-iblk[40]* 71+f-iblk[56]* 25;
}

void PDJPG_TransIDCT(short *iblk, byte *oblk)
{
int s[DCTSZ2];
int t[DCTSZ2];
int i, j;

PDJPG_TransIDCT_Horiz(iblk+0, s+0);
PDJPG_TransIDCT_Horiz(iblk+8, s+8);
PDJPG_TransIDCT_Horiz(iblk+16, s+16);
PDJPG_TransIDCT_Horiz(iblk+24, s+24);
PDJPG_TransIDCT_Horiz(iblk+32, s+32);
PDJPG_TransIDCT_Horiz(iblk+40, s+40);
PDJPG_TransIDCT_Horiz(iblk+48, s+48);
PDJPG_TransIDCT_Horiz(iblk+56, s+56);

PDJPG_TransIDCT_Vert(s+0, t+0);
PDJPG_TransIDCT_Vert(s+1, t+1);
PDJPG_TransIDCT_Vert(s+2, t+2);
PDJPG_TransIDCT_Vert(s+3, t+3);
PDJPG_TransIDCT_Vert(s+4, t+4);
PDJPG_TransIDCT_Vert(s+5, t+5);
PDJPG_TransIDCT_Vert(s+6, t+6);
PDJPG_TransIDCT_Vert(s+7, t+7);

for(i=0; i<64; i++)
{
j=(t[i]>>16)+128;
oblk[i]=(j<0)?0:((j>255)?255:j);
}
}

---


and a few of the other major time-eating functions:
---

void PDJPG_BuildPlane(int cn)
{
byte *pbuf;
int i, j, k, l;
int id, w, h, n, xs, ys;
int a, b, c, d, v;

id=pdjpg_scn[cn];

w=(pdjpg_cxi[id]+7)/8;
h=(pdjpg_cyi[id]+7)/8;
pbuf=malloc((w*8)*(h*8));

xs=w*8; ys=h*8;

for(i=0; i<h; i++)
for(j=0; j<w; j++)
{
n=(i*w)+j;

for(k=0; k<8; k++)for(l=0; l<8; l++)
pbuf[(i*8+k)*xs+(j*8+l)]=
pdjpg_sibuf[cn][n*64+(k*8+l)];
}

if((pdjpg_cxi[id]==pdjpg_xs) && (pdjpg_cyi[id]==pdjpg_ys))
{
pdjpg_spbuf[cn]=pbuf;
return;
}

if((pdjpg_cxi[id]==(pdjpg_xs>>1)) && (pdjpg_cyi[id]==(pdjpg_ys>>1)))
{
pdjpg_spbuf[cn]=malloc(pdjpg_xs*pdjpg_ys
);

for(i=0; i<pdjpg_ys; i++)
for(j=0; j<pdjpg_xs; j++)
{
k=i>>1; l=j>>1;
a=pbuf[((k+0)*xs)+(l+0)];
b=pbuf[((k+0)*xs)+(l+1)];
c=pbuf[((k+1)*xs)+(l+0)];
d=pbuf[((k+1)*xs)+(l+1)];
k=(j&1)?(0.34*a+0.66*b):(0.66*a+0.34*b);
l=(j&1)?(0.34*c+0.66*d):(0.66*c+0.34*d);
v=(i&1)?(0.34*k+0.66*l):(0.66*k+0.34*l);

v=(v<0)?0:((v>255)?255:v);
pdjpg_spbuf[cn][i*pdjpg_xs+j]=v;
}
return;
}

//ok, yeah, skipped out on a general case...
}

void PDJPG_BuildImage(byte *obuf)
{
int i, j, k, l;
int y, u, v, r, g, b;

for(i=0; i<pdjpg_ys; i++)
for(j=0; j<pdjpg_xs; j++)
{
y=pdjpg_spbuf[0][i*pdjpg_xs+j];
if(pdjpg_nc==3)
{
u=pdjpg_spbuf[1][i*pdjpg_xs+j];
v=pdjpg_spbuf[2][i*pdjpg_xs+j];
r=y+ 1.402*(v-128);
g=y- 0.34414*(u-128)- 0.71414*(v-128);
b=y+ 1.772*(u-128);

r=(r<0)?0:((r>255)?255:r);
g=(g<0)?0:((g>255)?255:g);
b=(b<0)?0:((b>255)?255:b);
}else
{
r=y; g=y; b=y;
}

k=pdjpg_ys-1-i;
obuf[((k*pdjpg_xs)+j)*4+0]=r;
obuf[((k*pdjpg_xs)+j)*4+1]=g;
obuf[((k*pdjpg_xs)+j)*4+2]=b;
}
}

int PDJHUFF_DecodeSymbol(int tab)
{
int i, j, k;

tab<<=8;

i=PDJHUFF_PWord();
for(j=0; j<256; j++)
{
k=pdjhuff_len[tab|j];
if(!k)continue;

if((i>>(16-k))!=pdjhuff_code[tab|j])
continue;
PDJHUFF_SkipNBits(k);
return(j);
}
return(-1);
}


mihai cartoaje

2007-01-06, 6:56 pm

cr88192 wrote:
> well, earlier I optimized the IDCT transform, and and now the IDCT
> constitutes only a small amount of the total run-time it seems.
>
> well, anyways, if anyone has any good ideas how I can optimize the IDCT
> further, I am including the current version I am using at the bottom.
>
>
> actully, most of the run-time is going into the code which takes the image
> (in the form of decoded 8x8 blocks), and rearanges it into a set of raster
> image planes, and also upsamples the Cb and Cr planes (constituting about
> 35% of the time).
>
> another 25% goes into the function which takes these planes, and does the
> YCbCr to RGB conversion, reassembling the image into a buffer.
>
> just below this, about 10% goes into the huffman symbol-decode function
> (could be optimized by an index table if needed).
>
> just below this (about 9%), is the function for outputting the targa image.
>
> then finally we get to the IDCT, which consists of 3 functions:
> 5.7% for the horizontal part;
> 4.7% for the verticle part;
> and 2.7% for the frontend.
>
> this is followed by misc functions...
>
>
> note that with an 800x600 image, runtimes were too short to really be
> measurable.
> the test-image I was using here was 2576x1932, with a total decode time of
> about 2s (with 0.3s going into DCT).
>
> optimized vs profiler settings seems to make only a minor impact on total
> time.
>
>
> so, here are the IDCT functions:
>
> ---
>
> void PDJPG_TransIDCT_Horiz(short *iblk, int *oblk)
> {
> int a, b, c, d, e, f;
>
> a=iblk[0]*91; b=iblk[4]*91;
> c=iblk[2]*118; d=iblk[2]*49;
> e=iblk[6]*118; f=iblk[6]*49;
> oblk[0]=a+iblk[1]*126+c+iblk[3]*106+b+ib
lk[5]* 71+f+iblk[7]* 25;
> oblk[1]=a+iblk[1]*106+d-iblk[3]* 25-b-iblk[5]*126-e-iblk[7]* 71;
> oblk[2]=a+iblk[1]* 71-d-iblk[3]*126-b+iblk[5]* 25+e+iblk[7]*106;
> oblk[3]=a+iblk[1]* 25-c-iblk[3]* 71+b+iblk[5]*106-f-iblk[7]*126;
> oblk[4]=a-iblk[1]* 25-c+iblk[3]* 71+b-iblk[5]*106-f+iblk[7]*126;
> oblk[5]=a-iblk[1]* 71-d+iblk[3]*126-b-iblk[5]* 25+e-iblk[7]*106;
> oblk[6]=a-iblk[1]*106+d+iblk[3]* 25-b+iblk[5]*126-e+iblk[7]* 71;
> oblk[7]=a-iblk[1]*126+c-iblk[3]*106+b-iblk[5]* 71+f-iblk[7]* 25;
> }
>
> void PDJPG_TransIDCT_Vert(int *iblk, int *oblk)
> {
> int a, b, c, d, e, f;
>
> a=iblk[0]*91; b=iblk[32]*91;
> c=iblk[16]*118; d=iblk[16]*49;
> e=iblk[48]*118; f=iblk[48]*49;
> oblk[ 0]=a+iblk[8]*126+c+iblk[24]*106+b+iblk[4
0]* 71+f+iblk[56]* 25;
> oblk[ 8]=a+iblk[8]*106+d-iblk[24]* 25-b-iblk[40]*126-e-iblk[56]* 71;
> oblk[16]=a+iblk[8]* 71-d-iblk[24]*126-b+iblk[40]* 25+e+iblk[56]*106;
> oblk[24]=a+iblk[8]* 25-c-iblk[24]* 71+b+iblk[40]*106-f-iblk[56]*126;
> oblk[32]=a-iblk[8]* 25-c+iblk[24]* 71+b-iblk[40]*106-f+iblk[56]*126;
> oblk[40]=a-iblk[8]* 71-d+iblk[24]*126-b-iblk[40]* 25+e-iblk[56]*106;
> oblk[48]=a-iblk[8]*106+d+iblk[24]* 25-b+iblk[40]*126-e+iblk[56]* 71;
> oblk[56]=a-iblk[8]*126+c-iblk[24]*106+b-iblk[40]* 71+f-iblk[56]* 25;
> }
>
> void PDJPG_TransIDCT(short *iblk, byte *oblk)
> {
> int s[DCTSZ2];
> int t[DCTSZ2];
> int i, j;
>
> PDJPG_TransIDCT_Horiz(iblk+0, s+0);
> PDJPG_TransIDCT_Horiz(iblk+8, s+8);
> PDJPG_TransIDCT_Horiz(iblk+16, s+16);
> PDJPG_TransIDCT_Horiz(iblk+24, s+24);
> PDJPG_TransIDCT_Horiz(iblk+32, s+32);
> PDJPG_TransIDCT_Horiz(iblk+40, s+40);
> PDJPG_TransIDCT_Horiz(iblk+48, s+48);
> PDJPG_TransIDCT_Horiz(iblk+56, s+56);
>
> PDJPG_TransIDCT_Vert(s+0, t+0);
> PDJPG_TransIDCT_Vert(s+1, t+1);
> PDJPG_TransIDCT_Vert(s+2, t+2);
> PDJPG_TransIDCT_Vert(s+3, t+3);
> PDJPG_TransIDCT_Vert(s+4, t+4);
> PDJPG_TransIDCT_Vert(s+5, t+5);
> PDJPG_TransIDCT_Vert(s+6, t+6);
> PDJPG_TransIDCT_Vert(s+7, t+7);
>
> for(i=0; i<64; i++)
> {
> j=(t[i]>>16)+128;
> oblk[i]=(j<0)?0:((j>255)?255:j);
> }
> }
>
> ---
>
>
> and a few of the other major time-eating functions:
> ---
>
> void PDJPG_BuildPlane(int cn)
> {
> byte *pbuf;
> int i, j, k, l;
> int id, w, h, n, xs, ys;
> int a, b, c, d, v;
>
> id=pdjpg_scn[cn];
>
> w=(pdjpg_cxi[id]+7)/8;
> h=(pdjpg_cyi[id]+7)/8;
> pbuf=malloc((w*8)*(h*8));
>
> xs=w*8; ys=h*8;
>
> for(i=0; i<h; i++)
> for(j=0; j<w; j++)
> {
> n=(i*w)+j;
>
> for(k=0; k<8; k++)for(l=0; l<8; l++)
> pbuf[(i*8+k)*xs+(j*8+l)]=
> pdjpg_sibuf[cn][n*64+(k*8+l)];
> }
>
> if((pdjpg_cxi[id]==pdjpg_xs) && (pdjpg_cyi[id]==pdjpg_ys))
> {
> pdjpg_spbuf[cn]=pbuf;
> return;
> }
>
> if((pdjpg_cxi[id]==(pdjpg_xs>>1)) && (pdjpg_cyi[id]==(pdjpg_ys>>1)))
> {
> pdjpg_spbuf[cn]=malloc(pdjpg_xs*pdjpg_ys
);
>
> for(i=0; i<pdjpg_ys; i++)
> for(j=0; j<pdjpg_xs; j++)
> {
> k=i>>1; l=j>>1;
> a=pbuf[((k+0)*xs)+(l+0)];
> b=pbuf[((k+0)*xs)+(l+1)];
> c=pbuf[((k+1)*xs)+(l+0)];
> d=pbuf[((k+1)*xs)+(l+1)];
> k=(j&1)?(0.34*a+0.66*b):(0.66*a+0.34*b);
> l=(j&1)?(0.34*c+0.66*d):(0.66*c+0.34*d);
> v=(i&1)?(0.34*k+0.66*l):(0.66*k+0.34*l);
>
> v=(v<0)?0:((v>255)?255:v);
> pdjpg_spbuf[cn][i*pdjpg_xs+j]=v;
> }
> return;
> }
>
> //ok, yeah, skipped out on a general case...
> }
>
> void PDJPG_BuildImage(byte *obuf)
> {
> int i, j, k, l;
> int y, u, v, r, g, b;
>
> for(i=0; i<pdjpg_ys; i++)
> for(j=0; j<pdjpg_xs; j++)
> {
> y=pdjpg_spbuf[0][i*pdjpg_xs+j];
> if(pdjpg_nc==3)
> {
> u=pdjpg_spbuf[1][i*pdjpg_xs+j];
> v=pdjpg_spbuf[2][i*pdjpg_xs+j];
> r=y+ 1.402*(v-128);
> g=y- 0.34414*(u-128)- 0.71414*(v-128);
> b=y+ 1.772*(u-128);
>
> r=(r<0)?0:((r>255)?255:r);
> g=(g<0)?0:((g>255)?255:g);
> b=(b<0)?0:((b>255)?255:b);
> }else
> {
> r=y; g=y; b=y;
> }
>
> k=pdjpg_ys-1-i;
> obuf[((k*pdjpg_xs)+j)*4+0]=r;
> obuf[((k*pdjpg_xs)+j)*4+1]=g;
> obuf[((k*pdjpg_xs)+j)*4+2]=b;
> }
> }
>
> int PDJHUFF_DecodeSymbol(int tab)
> {
> int i, j, k;
>
> tab<<=8;
>
> i=PDJHUFF_PWord();
> for(j=0; j<256; j++)
> {
> k=pdjhuff_len[tab|j];
> if(!k)continue;
>
> if((i>>(16-k))!=pdjhuff_code[tab|j])
> continue;
> PDJHUFF_SkipNBits(k);
> return(j);
> }
> return(-1);
> }



One way of performing the DCT is by first doing a 2D Hadamaard
transformation and then converting it to a DCT. If H is a 1D Hadamaard
transformation, C is the DCT and X transforms the first into the
second, then,

C = X H
or,
X = C H^(-1)

Carsten Neubauer

2007-01-06, 6:56 pm

I only looked into PDJPG_TransIDCT_Horiz
and PDJPG_TransIDCT_Vert. Of course they
are not the major bottleneck, but you can
save ca. 40% of muls and adds there like:

void PDJPG_TransIDCT_Horiz(short *iblk, int *oblk)
{
int a, b, c, d, e, f;

a = iblk[0]* 91;
b = iblk[4]* 91;
c = iblk[2]*118;
d = iblk[2]* 49;
e = iblk[6]*118;
f = iblk[6]* 49;

int acbf0, acbf1, adbe0, adbe1, i126, i106, i071, i025;

acbf0 = a + c + b + f;
acbf1 = a - c + b - f;
adbe0 = a + d - b - e;
adbe1 = a - d - b + e;

i126 = iblk[1]*126 + iblk[3]*106 + iblk[5]* 71 + iblk[7]* 25;
i106 = iblk[1]*106 - iblk[3]* 25 - iblk[5]*126 - iblk[7]* 71;
i071 = iblk[1]* 71 - iblk[3]*126 + iblk[5]* 25 + iblk[7]*106;
i025 = iblk[1]* 25 - iblk[3]* 71 + iblk[5]*106 - iblk[7]*126;

oblk[0] = acbf0 + i126;
oblk[1] = adbe0 + i106;
oblk[2] = adbe1 + i071;
oblk[3] = acbf1 + i025;
oblk[4] = acbf1 - i025;
oblk[5] = adbe1 - i071;
oblk[6] = adbe0 - i106;
oblk[7] = acbf0 - i126;
}

It uses 22 muls and 32 adds,
instead of 38 muls and 56 adds.

Of course the same applies to
PDJPG_TransIDCT_Vert.


Carsten Neubauer
http://www.c14sw.de/

Carsten Neubauer

2007-01-06, 6:56 pm

Two more Ideas on PDJPG_BuildPlane:

1) The four nested loops

> for(i=0; i<h; i++)
> for(j=0; j<w; j++)
> {
> n=(i*w)+j;
>
> for(k=0; k<8; k++)
> for(l=0; l<8; l++)
> pbuf[(i*8+k)*xs+(j*8+l)]=pdjpg_sibuf[cn]
[n*64+(k*8+l)];
> }


Ehm, I see full position-arithmetics
in the inner loop and once per pixel.
The expression on the right side
n*64+(k*8+l)
only counts upwards.
The whole loops should do the same as

byte *pbSource = pdjpg_sibuf[cn];
byte *pbDest = pbuf;

for( i = 0; i < h; i++)
{
for( j = 0; j < w; j++)
{
for( k = 0; k < 8; k++)
{
for(l = 0; l < 8; l++)
*pbDest++ = *pbSource++;

pbDest += xs - 8; //one line down, 8 pixels left
}
pbDest -= xs*8 - 8; //eight up, eight right
}
pbDest += xs*7; //seven more down
}





2) The two nested loops

> for(i=0; i<pdjpg_ys; i++)
> for(j=0; j<pdjpg_xs; j++)
> {
> k=i>>1; l=j>>1;
> a=pbuf[((k+0)*xs)+(l+0)];
> b=pbuf[((k+0)*xs)+(l+1)];
> c=pbuf[((k+1)*xs)+(l+0)];
> d=pbuf[((k+1)*xs)+(l+1)];
> k=(j&1)?(0.34*a+0.66*b):(0.66*a+0.34*b);
> l=(j&1)?(0.34*c+0.66*d):(0.66*c+0.34*d);
> v=(i&1)?(0.34*k+0.66*l):(0.66*k+0.34*l);
>
> v=(v<0)?0:((v>255)?255:v);
> pdjpg_spbuf[cn][i*pdjpg_xs+j]=v;
> }


i*pdjpg_xs+j counts upwards.
The calculations take their time
and also four or five IFs per pixel.
They are hard to avoid, unless your loop
would process 2x2 pixels at once.
Did you already try lookup-tables
instead of the 0.34 and 0.66 muls?
Like
double dTab34[256] = {0.34, 0.68, ...};
Two of them should use 4KB and fit into the cache,
don't know, perhaps double is overkill,
float might also be sufficient and use half the space.

can be transformed to

int k0, k1;
pbDest = pdjpg_sibuf[cn];
for(i=0; i<pdjpg_ys; i++)
{
k0 = (i>>1)*xs;
for(j=0; j<pdjpg_xs; j++)
{
k1 = k0 + (j>>1);
a = pbuf[k1 + 0)];
b = pbuf[k1 + 1)];
k1 += xs;
c = pbuf[k1 + 0)];
d = pbuf[k1 + 1)];

if (j&1)
{
k = dTab34[a] + dTab66[b];
l = dTab34[c] + dTab66[d];
}
else
{
k = dTab66[a] + dTab34[b];
l = dTab66[c] + dTab34[d];
}
v = (i&1) ? (dTab34[k] + dTab66[l]) : (dTab66[k] + dTab34[l]);

*pbDest++ = (v<0)?0:((v>255)?255:v);
}
}

Well, not nicer, but hopefully faster.
I'd say processing 2x2 pixels at once
should speed those loops up by another
50%.


Hope it helps,

Carsten Neubauer

Nils Haeck

2007-01-06, 6:56 pm

Are you, by any chance, creating buffers (malloc) every time a function is
called? It certainly makes sense to create a few large buffers at the start
of the program and re-use these buffers over and over again. A lot of time
goes into memory management if you don't pay attention to it :)

Nils


mihai cartoaje

2007-01-06, 6:56 pm

mihai cartoaje wrote:
> cr88192 wrote:
>
>
> One way of performing the DCT is by first doing a 2D Hadamaard
> transformation and then converting it to a DCT. If H is a 1D Hadamaard
> transformation, C is the DCT and X transforms the first into the
> second, then,
>
> C = X H
> or,
> X = C H^(-1)


I meant Hadamard. A 2D Hadamard transformation can be done by
compositing Haar transformations each on a 2x2 group of pixels. A
lossless version assuming the pixels are arranged like,

ab
cd

is,

c0 = (a + b + c + d) >> 1
c1 = (- a + b - c + d) >> 1
c2 = (- a - b + c + d) >> 1
c3 = - ((-(a - b - c + d)) >> 1)

and the inverse transformation:

a = (c0 - c1 - c2 + c3) >> 1
b = -((-(c0 + c1 - c2 - c3)) >> 1)
c = -((-(c0 - c1 + c2 - c3)) >> 1)
d = -((-(c0 + c1 + c2 + c3)) >> 1)

Because of the shifts, I have to prove that the lowest order bits are
reconstructed:
(0, 0, 0, 0) -> (0, 0, 0, 0) -> (0, 0, 0, 0)
(0, 0, 0, 1) -> (0, 0, 0, 1) -> (0, 0, 0, 1)
(0, 0, 1, 0) -> (0,-1, 0, 0) -> (0, 0, 1, 0)
(0, 0, 1, 1) -> (1, 0, 1, 0) -> (0, 0, 1, 1)
(0, 1, 0, 0) -> (0, 0,-1, 0) -> (0, 1, 0, 0)
(0, 1, 0, 1) -> (1, 1, 0, 0) -> (0, 1, 0, 1)
(0, 1, 1, 0) -> (1, 0, 0,-1) -> (0, 1, 1, 0)
(0, 1, 1, 1) -> (1, 0, 0, 0) -> (0, 1, 1, 1)
(1, 0, 0, 0) -> (0,-1,-1, 1) -> (1, 0, 0, 0)
(1, 0, 0, 1) -> (1, 0, 0, 1) -> (1, 0, 0, 1)
(1, 0, 1, 0) -> (1,-1, 0, 0) -> (1, 0, 1, 0)
(1, 0, 1, 1) -> (1,-1, 0, 1) -> (1, 0, 1, 1)
(1, 1, 0, 0) -> (1, 0,-1, 0) -> (1, 1, 0, 0)
(1, 1, 0, 1) -> (1, 0,-1, 1) -> (1, 1, 0, 1)
(1, 1, 1, 0) -> (1,-1,-1, 0) -> (1, 1, 1, 0)
(1, 1, 1, 1) -> (2, 0, 0, 0) -> (1, 1, 1, 1)

cr88192

2007-01-06, 6:56 pm


"mihai cartoaje" <repstsb@yahoo.ca> wrote in message
news:1167536928.362115.298680@h40g2000cwb.googlegroups.com...
> mihai cartoaje wrote:
>
> I meant Hadamard. A 2D Hadamard transformation can be done by
> compositing Haar transformations each on a 2x2 group of pixels. A
> lossless version assuming the pixels are arranged like,
>
> ab
> cd
>
> is,
>
> c0 = (a + b + c + d) >> 1
> c1 = (- a + b - c + d) >> 1
> c2 = (- a - b + c + d) >> 1
> c3 = - ((-(a - b - c + d)) >> 1)
>
> and the inverse transformation:
>
> a = (c0 - c1 - c2 + c3) >> 1
> b = -((-(c0 + c1 - c2 - c3)) >> 1)
> c = -((-(c0 - c1 + c2 - c3)) >> 1)
> d = -((-(c0 + c1 + c2 + c3)) >> 1)
>
> Because of the shifts, I have to prove that the lowest order bits are
> reconstructed:
> (0, 0, 0, 0) -> (0, 0, 0, 0) -> (0, 0, 0, 0)
> (0, 0, 0, 1) -> (0, 0, 0, 1) -> (0, 0, 0, 1)
> (0, 0, 1, 0) -> (0,-1, 0, 0) -> (0, 0, 1, 0)
> (0, 0, 1, 1) -> (1, 0, 1, 0) -> (0, 0, 1, 1)
> (0, 1, 0, 0) -> (0, 0,-1, 0) -> (0, 1, 0, 0)
> (0, 1, 0, 1) -> (1, 1, 0, 0) -> (0, 1, 0, 1)
> (0, 1, 1, 0) -> (1, 0, 0,-1) -> (0, 1, 1, 0)
> (0, 1, 1, 1) -> (1, 0, 0, 0) -> (0, 1, 1, 1)
> (1, 0, 0, 0) -> (0,-1,-1, 1) -> (1, 0, 0, 0)
> (1, 0, 0, 1) -> (1, 0, 0, 1) -> (1, 0, 0, 1)
> (1, 0, 1, 0) -> (1,-1, 0, 0) -> (1, 0, 1, 0)
> (1, 0, 1, 1) -> (1,-1, 0, 1) -> (1, 0, 1, 1)
> (1, 1, 0, 0) -> (1, 0,-1, 0) -> (1, 1, 0, 0)
> (1, 1, 0, 1) -> (1, 0,-1, 1) -> (1, 1, 0, 1)
> (1, 1, 1, 0) -> (1,-1,-1, 0) -> (1, 1, 1, 0)
> (1, 1, 1, 1) -> (2, 0, 0, 0) -> (1, 1, 1, 1)
>


I hadn't noticed that anyone had responded to this, a lot of other posts and
other things it seems.

this transform, however, does not appear like it could be used as a
replacement for the DCT, or at least not one compatible with the DCT (as
done in jpeg and friends...).



cr88192

2007-01-06, 6:56 pm

interesting idea, yes...

in my case, I had considered factoring it more, but I had felt uncertain as
to how much to factor it, in particular, I was uncertain of the relative
cost between arithmetic operations and variable creation/assignment.


as for optimizing the scale functions, I had considered ideas, but much
after the post, decided it is fast enough. as it was, I was having to use
particularly large images in order to even really gain really measurable
time costs (for the most part, my images tend to be a lot smaller).


"Carsten Neubauer" <cn@c14sw.de> wrote in message
news:1167428997.531943.148360@n51g2000cwc.googlegroups.com...
>I only looked into PDJPG_TransIDCT_Horiz
> and PDJPG_TransIDCT_Vert. Of course they
> are not the major bottleneck, but you can
> save ca. 40% of muls and adds there like:
>
> void PDJPG_TransIDCT_Horiz(short *iblk, int *oblk)
> {
> int a, b, c, d, e, f;
>
> a = iblk[0]* 91;
> b = iblk[4]* 91;
> c = iblk[2]*118;
> d = iblk[2]* 49;
> e = iblk[6]*118;
> f = iblk[6]* 49;
>
> int acbf0, acbf1, adbe0, adbe1, i126, i106, i071, i025;
>
> acbf0 = a + c + b + f;
> acbf1 = a - c + b - f;
> adbe0 = a + d - b - e;
> adbe1 = a - d - b + e;
>
> i126 = iblk[1]*126 + iblk[3]*106 + iblk[5]* 71 + iblk[7]* 25;
> i106 = iblk[1]*106 - iblk[3]* 25 - iblk[5]*126 - iblk[7]* 71;
> i071 = iblk[1]* 71 - iblk[3]*126 + iblk[5]* 25 + iblk[7]*106;
> i025 = iblk[1]* 25 - iblk[3]* 71 + iblk[5]*106 - iblk[7]*126;
>
> oblk[0] = acbf0 + i126;
> oblk[1] = adbe0 + i106;
> oblk[2] = adbe1 + i071;
> oblk[3] = acbf1 + i025;
> oblk[4] = acbf1 - i025;
> oblk[5] = adbe1 - i071;
> oblk[6] = adbe0 - i106;
> oblk[7] = acbf0 - i126;
> }
>
> It uses 22 muls and 32 adds,
> instead of 38 muls and 56 adds.
>
> Of course the same applies to
> PDJPG_TransIDCT_Vert.
>
>
> Carsten Neubauer
> http://www.c14sw.de/
>



Carsten Neubauer

2007-01-06, 6:56 pm

cr88192 schrieb:

> interesting idea, yes...
>
> in my case, I had considered factoring it more, but I had felt uncertain as
> to how much to factor it, in particular, I was uncertain of the relative
> cost between arithmetic operations and variable creation/assignment.
>


Creating eight more local integers on the stack
should come without extra clockcycles.
Factoring completely should also result in
less generated assembler-code.
Of course some optimization-settings in your
compiler might do the same, but I learned not
to rely on that.


>
> as for optimizing the scale functions, I had considered ideas, but much
> after the post, decided it is fast enough. as it was, I was having to use
> particularly large images in order to even really gain really measurable
> time costs (for the most part, my images tend to be a lot smaller).
>


Sure, sometimes the memory-throughput limits
the execution-speed of optimized code anyway.
Especially in image-processing, where the whole
data exceeds the cache-size, our fast CPUs
spend a lot of time waiting for memory-accesses.


Carsten Neubauer
http://www.c14sw.de/

cr88192

2007-01-06, 6:56 pm


"Carsten Neubauer" <cn@c14sw.de> wrote in message
news:1168027719.662443.32190@42g2000cwt.googlegroups.com...
> cr88192 schrieb:
>
>
> Creating eight more local integers on the stack
> should come without extra clockcycles.
> Factoring completely should also result in
> less generated assembler-code.
> Of course some optimization-settings in your
> compiler might do the same, but I learned not
> to rely on that.
>


yes, ok.


>
>
> Sure, sometimes the memory-throughput limits
> the execution-speed of optimized code anyway.
> Especially in image-processing, where the whole
> data exceeds the cache-size, our fast CPUs
> spend a lot of time waiting for memory-accesses.
>


yes.

another idea I had thought up was this:
rather than decode all the 8x8 blocks into an array, and then rearrange this
array into the image buffers, I instead decode the DCT blocks directly into
the planar buffers.

likewise, I don't upsample the UV planes either (4:2:0), but instead have a
special YUV->RGB conversion that works with 4 pixels at a time.

this would probably allow faster decoding by reducing the amount of stuff
needing to be shuffled around in memory...


a few more mysterious thoughts have been considered as well for making a
JPEG-like codec using MDCT instead of DCT. in the past, this didn't exactly
work right (in part, I think, because I tried to come up with a 2D version
of the MDCT...).

if I do the MDCT like I did in the DCT original post (combined with the idea
mentioned above), it may well work better than my past attempts...

or such...


>
> Carsten Neubauer
> http://www.c14sw.de/
>



Carsten Neubauer

2007-01-06, 6:56 pm

cr88192 schrieb:

>
> another idea I had thought up was this:
> rather than decode all the 8x8 blocks into an array, and then rearrange this
> array into the image buffers, I instead decode the DCT blocks directly into
> the planar buffers.
>
> likewise, I don't upsample the UV planes either (4:2:0), but instead have a
> special YUV->RGB conversion that works with 4 pixels at a time.
>
> this would probably allow faster decoding by reducing the amount of stuff
> needing to be shuffled around in memory...
>


Yes, this strategy should be the fastest possible,
as you wouldn't have to touch anything twice.


>
> a few more mysterious thoughts have been considered as well for making a
> JPEG-like codec using MDCT instead of DCT.


Interesting. Never tried MDCT, but from what I read, it should
result in less block-artefacts. I only suspect it to be slower
and to produce more coefficients, as one has to do a 16x16
transform for each 8x8 block.


> in the past, this didn't exactly
> work right (in part, I think, because I tried to come up with a 2D version
> of the MDCT...).
>
> if I do the MDCT like I did in the DCT original post (combined with the idea
> mentioned above), it may well work better than my past attempts...
>
> or such...
>


Sure, most second tries result in a better implementation.
So it sounds promising.


Greetings,

Carsten Neubauer
http://www.c14sw.de/

cr88192

2007-01-07, 3:56 am


"Carsten Neubauer" <cn@c14sw.de> wrote in message
news:1168096391.132510.227370@11g2000cwr.googlegroups.com...
> cr88192 schrieb:
>
>
> Yes, this strategy should be the fastest possible,
> as you wouldn't have to touch anything twice.
>


yes.

tuned the decoder more, so now much less time goes into this phase.

so, now decoding is a little faster.

most of the time now is going into the function that decodes huffman
symbols, followed by the YUV420 specific decode function, and the function
that writes tga files to disk.

after this is the function that huffman decodes the blocks, the dequantizer
(a set of 64 integer multiplies/block), followed by the IDCT functions.

after this is a few misc functions for working with the bitstream (NextByte,
SkipBits, ReadBits, PeakWord, ...).


>
>
> Interesting. Never tried MDCT, but from what I read, it should
> result in less block-artefacts. I only suspect it to be slower
> and to produce more coefficients, as one has to do a 16x16
> transform for each 8x8 block.
>


well, one processes 16x16 pixel blocks, but it only produces an 8x8
coefficient block.

it could be interesting though in terms of reducing artifacting (especially
for low quality images), but, of course:
it would be slower than JPEG;
it would be incompatible;
it would require adding an 8 pixel border on all sides of the image (and
thus increase the total number of blocks by a small factor).

then again, given the fact that the IDCT is now fairly fast, similar should
be applicable to the MDCT, making it probably not "that" much slower.


>
>
> Sure, most second tries result in a better implementation.
> So it sounds promising.
>


yes.

I may try implementing this and see how it goes...


>
> Greetings,
>
> Carsten Neubauer
> http://www.c14sw.de/
>



Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com