For Programmers: Free Programming Magazines  


Home > Archive > Compression > April 2006 > Need a bit of information about Compression









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Need a bit of information about Compression
Dr.Whacked

2006-04-04, 6:55 pm

Well i have done quite a good amount of research in compression methods
and stuff but currently i am in a fix so i was hopeing to get my self
out of it with your help.

What my question is that "What is the best possible method to compress
a file that only has 3 different Chr and one of them repeates alot?"

for example

File =
"///1/////1/2//////2///21/////1/2/12//1/12/1/21/2/212121///////////12/////////12/////1/1///1///1//1//11/2//"

a file with random letter place some what like this should be able to
compress alot right since there are only 3 different Chr so as per the
combination formula it should create a relatively less combinations
from a file that has 256 different Chr

If you know of any method of compression that will compress this type
of file extreamly then please reply

(This type of files are not data files that are seriously huge comes
around 20 mb average and my application can generate alot of these
which are important for the application to work)

cr88192

2006-04-05, 3:55 am


Dr.Whacked wrote:
> Well i have done quite a good amount of research in compression methods
> and stuff but currently i am in a fix so i was hopeing to get my self
> out of it with your help.
>
> What my question is that "What is the best possible method to compress
> a file that only has 3 different Chr and one of them repeates alot?"
>
> for example
>
> File =
> "///1/////1/2//////2///21/////1/2/12//1/12/1/21/2/212121///////////12/////////12/////1/1///1///1//1//11/2//"
>

<snip>

"best possible", likely some manner of arithmetic coding, preferably
one only having 3 symbols. will save a few bytes at the start due to
not needing to adjust for the absence of the other symbols, or for
leaving enough space to allow possibly coding the others.

next up, deflate, which may do better or worse than ac based on
implementation details of the arithmetic coder.

third up, and likely the simplest to implement, would be a customized
coding:
/ -> 0
1 -> 10
2 -> 11

or, if there are enough runs, this could sacrifice 1 or 2 slightly to
add rle.

/ -> 0
1 -> 10
2 -> 110
R -> 111

where R is a run:
R LLL, runs, eg, between 6..14.

or such...

Earl Colby Pottinger

2006-04-05, 6:55 pm

"cr88192" <cr88192@hotmail.com> :

> Dr.Whacked wrote:
[color=darkred]
[color=darkred]
[color=darkred]
>

"///1/////1/2//////2///21/////1/2/12//1/12/1/21/2/212121///////////12/////////12/////1/1///1///1//1//11/2//"

> <snip>


> "best possible", likely some manner of arithmetic coding, preferably
> one only having 3 symbols. will save a few bytes at the start due to
> not needing to adjust for the absence of the other symbols, or for
> leaving enough space to allow possibly coding the others.


> next up, deflate, which may do better or worse than ac based on
> implementation details of the arithmetic coder.


> third up, and likely the simplest to implement, would be a customized
> coding:
> / -> 0
> 1 -> 10
> 2 -> 11


> or, if there are enough runs, this could sacrifice 1 or 2 slightly to
> add rle.


> / -> 0
> 1 -> 10
> 2 -> 110
> R -> 111


> where R is a run: R LLL, runs, eg, between 6..14. or such...


Like your third idea most, simple, fast and easy to do. Maybe what is needed
to know is how much compression Dr. Whacked needs in real life.

Earl Colby Pottinger

--
<Cruising, building a Catamaran, Rebuilding Cabin, New Peroxide Still Design,
Writting SF, Programming FOSS - What happened to the time?>
Dr.Whacked

2006-04-05, 6:55 pm

Thx you cr88192 and Earl Colby Pottinger for your replies i am
currently trying out the 3rd method hope it works and to answer the
question "how much compression Dr. Whacked needs in real life.?" well
MAX, i need alot of it.

if possible can u refer me to some documentation of the 3rd methods
since replaceing

/ = 00
1 = 10
2 = 11

it will increase the size twise basically so if possible just direct me
to how it works since i have created a method of how it should work
since i gained good information of compression via the research.

and i would like to know the 4th method that uses a run variable.

thx alot for ur replies guys helped me alot to understand compression
and helped me go in the right direction.

Giordano Ferdinandi

2006-04-06, 7:55 am

> it will increase the size twise basically so if possible just direct me
> to how it works


I believe they're suggesting that you should use 1 bit for "/" and 2
bits for "1" or "2". It won't increase the size twice, since if you
have 3 symbols you normally need log(3) bits to encode each symbol
without any compression.
I think that the best way of doing this, if the frequencies of each
symbol are predictable and don't have significant variations from one
file to another, is using a simple Huffman encoding (without storing
any data relative to the tree in the compressed file). In fact, if you
used the above file to build a Huffman tree, you would get

/ = 0
1 = 10
2 = 11

which is what cr88192 and Earl Colby Pottinger are suggesting.
However, you will probably get much better results by combining symbols
together, that is trying to build a Huffman tree for the symbols
//, /1, /2, 11, 1/, 12, 2/, 21, 22
or even better results if you do it with symbols
///, //1, //2 .....
and so on.
Try this link http://en.wikipedia.org/wiki/Huffman_coding and follow
the links at the bottom of that page.
I hope this helps

cr88192

2006-04-06, 6:55 pm


"Earl Colby Pottinger" <earlcolby.pottinger@sympatico.ca> wrote in message
news:fXUYf.2114$sh3.139298@news20.bellglobal.com...
> "cr88192" <cr88192@hotmail.com> :
>


<snip>

>
>
>
>
>
> Like your third idea most, simple, fast and easy to do. Maybe what is
> needed
> to know is how much compression Dr. Whacked needs in real life.
>

yeah.

one thing I wonder sometimes is how common my "newer" method of working with
bitstreams is, eg:
I maintain a word, which contains the next 32 bits of the stream, and a
count giving the current bit;
in the case of reading, when there is at least 8 bits read, the stream word
is shifted 8 bits and the next byte is pulled in;
in the case of writing, the byte is written out and the word is shifted.

slight issue is it likes to read over, which for my inflater required a hack
to locate the exact end of the stream (this was needed mostly because some
things like to follow immediately after the compressed data, eg, the adler32
checksum, ...).

seems like it wouldn't be too unusual, then again, I don't know...


skimming libjpeg, they do something vaguely similar, albeit their word is
kept aligned with the next bit, rather than on byte boundaries (as mine is).
in my case, the logic is simpler at least, writing:
shift value and or with word;
add number of bits;
while(bits>=8)write byte and shift (for jpeg: if byte was 0xFF write 0x00),
bits-=8.

vs:
get last word;
mask buffer;
add bits;
do a funky align involving a shift and a subtraction;
or with old bits;
while(bits>=8)
{
write byte;
if byte was 0xFF write 0x00;
shift;
bits-=8;
}
put value and bits count back into state.

or such...


> Earl Colby Pottinger
>
> --
> <Cruising, building a Catamaran, Rebuilding Cabin, New Peroxide Still
> Design,
> Writting SF, Programming FOSS - What happened to the time?>



cr88192

2006-04-06, 6:55 pm


"Dr.Whacked" <Dr.Whacked@gmail.com> wrote in message
news:1144269418.807817.211040@v46g2000cwv.googlegroups.com...
> Thx you cr88192 and Earl Colby Pottinger for your replies i am
> currently trying out the 3rd method hope it works and to answer the
> question "how much compression Dr. Whacked needs in real life.?" well
> MAX, i need alot of it.
>

in any case, those files should compress pretty well.

> if possible can u refer me to some documentation of the 3rd methods
> since replaceing
>
> / = 00
> 1 = 10
> 2 = 11
>
> it will increase the size twise basically so if possible just direct me
> to how it works since i have created a method of how it should work
> since i gained good information of compression via the research.
>

bits yo...

alternatively, one can escape having to manage a bitstream by grabbing the
symbols 4 at a time, and coding them as 2 bit pairs into a single byte. this
will reduce the files to about 1/4 the original size, but not much else.

//12

would be coded in a byte, eg, as:
00 00 01 10
or 0x06

> and i would like to know the 4th method that uses a run variable.
>

basically, scan forwards so long as the next symbol is equal to the last
symbol coded;
if the number of matching symbols is greater than the minimum then a run can
be encoded (6 was chosen, assuming that for that case, 6 would be a "break
even" point...);
if the run is longer than the max, it is clamped to the max prior to
writing;
one can then skip over the symbols.

an alternative with that idea could have been to have a variable length code
for the length, eg:

111 00: run of 5
111 01: run of 6
111 100: 7
111 100: 8
111 1100: 9
111 1101: 10
111 11100: 11
111 11101: 12
111 11110: 13
111 11111: 14

> thx alot for ur replies guys helped me alot to understand compression
> and helped me go in the right direction.
>

ok.


Nishu

2006-04-07, 7:55 am

I've a query. Isnt it necessary to encode the EOF?

For eg.
the data is 11//////////2211 --> 16 bytes
corresponding huffman code (bits) --> 1010 0000 0000 0011 1110 10xx
== A0 03 E? --> 3 bytes only

unless we encode the EOF character (end of file) how can we practically
determine its end or decode it or store it?

If i consider EOF as also the case then it will go like this..
/ - 0
1 - 10
2 - 110
EOF - 111

Where after encoding the whole stream u need to fill the remaining bits
by 1 with the min of 3 '1' bits.

so now the text becomes
1010 0000 0000 0011 0110 1010 1111 1111
A0 03 6A FF --> 4 bytes

Regards,
Nishu


cr88192 wrote:
> "Dr.Whacked" <Dr.Whacked@gmail.com> wrote in message
> news:1144269418.807817.211040@v46g2000cwv.googlegroups.com...
> in any case, those files should compress pretty well.
>
> bits yo...
>
> alternatively, one can escape having to manage a bitstream by grabbing the
> symbols 4 at a time, and coding them as 2 bit pairs into a single byte. this
> will reduce the files to about 1/4 the original size, but not much else.
>


cr88192

2006-04-07, 7:55 am


"Nishu" <naresh.attri@gmail.com> wrote in message
news:1144401362.171087.203050@i40g2000cwc.googlegroups.com...
> I've a query. Isnt it necessary to encode the EOF?
>
> For eg.
> the data is 11//////////2211 --> 16 bytes
> corresponding huffman code (bits) --> 1010 0000 0000 0011 1110 10xx
> == A0 03 E? --> 3 bytes only
>
> unless we encode the EOF character (end of file) how can we practically
> determine its end or decode it or store it?
>

typical approaches:
either encode a full length, or a length mod some constant.

also common is to know from the data how long it is.
less common is to set up some special condition, where the eof is implied by
an otherwise impossible occurance (less likely in this case).

say, for example, we know that all runs of 0 greater than or equal to 5 will
be encoded as an rle run. if we then encode, say, 8 consecutive 0's, then we
know that the eof is reached (some fudging would be involved with this one,
a final 1 bit could be used, eg, to properly align the eof).

> If i consider EOF as also the case then it will go like this..
> / - 0
> 1 - 10
> 2 - 110
> EOF - 111
>
> Where after encoding the whole stream u need to fill the remaining bits
> by 1 with the min of 3 '1' bits.
>

the length would be cheaper, for the 3 symbol case, providing for this sort
of an eof marker is rather expensive (it is cheaper when the alphabet is
somewhat larger, as then it is more so "just another character").

> so now the text becomes
> 1010 0000 0000 0011 0110 1010 1111 1111
> A0 03 6A FF --> 4 bytes
>
> Regards,
> Nishu
>
>
> cr88192 wrote:
>



David A. Scott

2006-04-07, 7:55 am

"Nishu" <naresh.attri@gmail.com> wrote in
news:1144401362.171087.203050@i40g2000cwc.googlegroups.com:

> I've a query. Isnt it necessary to encode the EOF?
>
> For eg.
> the data is 11//////////2211 --> 16 bytes
> corresponding huffman code (bits) --> 1010 0000 0000 0011 1110 10xx
> == A0 03 E? --> 3 bytes only
>
> unless we encode the EOF character (end of file) how can we
> practically determine its end or decode it or store it?
>


Just do it bijectively. So no you don't need to reserve a speial
EOF if stored to a binary byte data file. You really don't need to
treat it any differently than any other data file.

If I was using my style of bijective coding for 3 simple
character I would have used
/ = 1
1 = 01
2 = 00

that way in example above would have coded to exactly:
01011111 11111100 00010100

You would have to look at code in arb2x,zip to see this.
In this simple case the last two zero do not decode to
a 2 there are some simple bijective rules to see exactly
how it ends.





David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"

Nishu

2006-04-07, 6:55 pm

> > unless we encode the EOF character (end of file) how can we practically
> typical approaches:
> either encode a full length, or a length mod some constant.


Yes, full Length coding, arithmetic coding which is certainly better
option.
But would u explain me how length mode CONST coding works?

>
> also common is to know from the data how long it is.
> less common is to set up some special condition, where the eof is implied by
> an otherwise impossible occurance (less likely in this case).
>

I didnt understand this. Isnt that "111" would be the special case..
that last case. and hence forth we know the eof and so will stop the
decoding.

> say, for example, we know that all runs of 0 greater than or equal to 5 will
> be encoded as an rle run. if we then encode, say, 8 consecutive 0's, then we
> know that the eof is reached (some fudging would be involved with this one,
> a final 1 bit could be used, eg, to properly align the eof).


Yes, after doing Huffman, RLE too can be envoled for even better
compression. And in that if we mention the impossible case as the eof,
then yes, problem can be solved. and we can ignore eof while doing
huffman. Its certainly better option as you pointed out.

But still, if we just opt for huffman encoding, i'm not able find out
how can we determine the eof ?

Regds,
Nishu

Nishu

2006-04-07, 6:55 pm



> Just do it bijectively. So no you don't need to reserve a speial
> EOF if stored to a binary byte data file. You really don't need to
> treat it any differently than any other data file.
>
> If I was using my style of bijective coding for 3 simple
> character I would have used
> / = 1
> 1 = 01
> 2 = 00
>
> that way in example above would have coded to exactly:
> 01011111 11111100 00010100


Well, obviously, its just the one's complement of the bitstream. But i
do appreciate its added advantage while transmitting the bitstream
(from digital error resilience point of view), hence its better.

>
> You would have to look at code in arb2x,zip to see this.
> In this simple case the last two zero do not decode to
> a 2 there are some simple bijective rules to see exactly
> how it ends.
>


Would you plz brief me about how we can do that since your research
paper is quite lengthy and it will take appreciable amount of time to
go thro' the whole.

Best Regards,
Nishu

widding@birger.com

2006-04-07, 6:55 pm

cr88192 wrote:
> Dr.Whacked wrote:
> <snip>
>
> third up, and likely the simplest to implement, would be a customized
> coding:
> / -> 0
> 1 -> 10
> 2 -> 11
>
> or, if there are enough runs, this could sacrifice 1 or 2 slightly to
> add rle.
>
> / -> 0
> 1 -> 10
> 2 -> 110
> R -> 111
>
> where R is a run:
> R LLL, runs, eg, between 6..14.
>
> or such...


Another possibility, only run length encode the "/", as the original
statement of the problem "one of them repeats alot [sic]". This will
allow the expansion problem to be eliminated. Possible encoding:

10 <- 1
11 <- 2
00 <- /

010 <- //
011 000 <- /// (three)
011 001 <- //// (four)
....
011 110 <- ////////// (ten)

011 111 000000 <- /////////// (eleven)
....
011 111 111110 <- /////.... (seventy three)
011 111 111111 <- end of file (or indicator of next code length)

The last code grouping is an exageration to illustrate a point. If for
some reason your data set has a very high probability of seeing runs of
eleven the jump in code size should be smaller. On the other hand, if
your data sets only see very short, and very long runs it might make
sense to increase the code size for eleven to allow a greater max
count, or add another code length to the tree.

If one knows the character of his data (i.e. by having intimate
knowledge of what it represents, or by statistical analysis) the tree
structure can be set up in advance of seeing specific data. This will
not result in an optimal solution for a given stream, but the overall
complexity of the algorithm can often be reduced, and nearly optimal
coding can result.

We have done a number custom compression algorithms tailored to specifc
data sets, and when implementing in hardware, eliminating special cases
can greatly simplify the implementation and greatly speed up the
result. For example, to output a token per clock cycle in the
decompressor, the above coding never requires more than six bits of the
input stream to be seen in parallel, and only requires the ability to
shift by two and three positions, so a barrel shifter is replaced by
only a two tap shifter which is simply a two input mux. Eliminating
the special case of runs less than six simplifies the encoder.



Regards,
Erik.

---
Erik Widding
President
Birger Engineering, Inc.

(mail) 100 Boylston St #1070; Boston, MA 02116
(voice) 617.695.9233
(fax) 617.695.9234
(web) http://www.birger.com

David A. Scott

2006-04-07, 9:56 pm

"Nishu" <naresh.attri@gmail.com> wrote in
news:1144417370.346593.295240@g10g2000cwb.googlegroups.com:

>
> Would you plz brief me about how we can do that since your research
> paper is quite lengthy and it will take appreciable amount of time to
> go thro' the whole.
>
>


The point is you can compress this file to another file. One can
always make full use of the data. You let the operating system
handle the true EOF as it does with any other file. Many assume
this is not possible because of the fact simple in the box
thinking prevents many people from seeing the bijective ways to
do it.


The simple way for the 3 tokens 1 01 00 is always write them out
to a pretend infinite string. where you add either a tail of 1000..
or 0000... after the last token. You normally add the all zero
sting you only add the 1000.. string if the last token is either
the 00 token or the 00 token followed by one or more 1 tokens.
as soon as a 01 token seen you got back to the 000... for the
pretend endings.

From this pretend infinite we get ready to write the true
byte data file. But as we do this we keep track of three types
of bytes. The A all zero byte 00000000 and B leading one
follwed by seven zeroes 10000000 and C everything else.
Now group the pretend infinte string made above into 8 bit
bytes. You can quickly shorten this pretend file by dropping
the infinte end of all zeros that long tail of all zeros is
not needed. In fact at this point we have finite byte file
but we don't write it out yet. Notice that since the nature
of how the file built we also don't have an empty file.
If the file your left with does not end in type B 10000000
you can write it out your done. If it ends in a type B look
at preceeding bytes until you there are no more bytes or until
either a type C occurs or a type A. If the first non B type is
not an A you can also write the file out your done. Only if
its a type A in which case you drop the last B byte and then
write the file out.

The reverse from byte to the original 3 token should be
childs play.



David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"

Phil Carmody

2006-04-08, 3:55 am

"widding@birger.com" <widding@birger.com> writes:[color=darkred]
> cr88192 wrote:

There's always Tunstall codes that reels in some of the slack from huffman.

However, an adaptive arithmetic or range coder should be your best option.

Curiously, Tunstall codes can be considered as very similar to the two
run length encoding schemes PP & GPP suggest. However, they have the
additional flexibility of taking advantage of higher order statistics
than simple RLE. (e.g. if long runs are more likely to be followed by
a 1.)

Phil
--
What is it: is man only a blunder of God, or God only a blunder of man?
-- Friedrich Nietzsche (1844-1900), The Twilight of the Gods
Jasen Betts

2006-04-08, 3:55 am

On 2006-04-07, Nishu <naresh.attri@gmail.com> wrote:
> I've a query. Isnt it necessary to encode the EOF?
>
> For eg.
> the data is 11//////////2211 --> 16 bytes
> corresponding huffman code (bits) --> 1010 0000 0000 0011 1110 10xx
> == A0 03 E? --> 3 bytes only
>
> unless we encode the EOF character (end of file) how can we practically
> determine its end or decode it or store it?


use the last three bits of the file to indicate how many padding bits there
are.



--

Bye.
Jasen
cr88192

2006-04-08, 7:55 am


"Nishu" <naresh.attri@gmail.com> wrote in message
news:1144416807.604959.239040@i39g2000cwa.googlegroups.com...
>
> Yes, full Length coding, arithmetic coding which is certainly better
> option.

this approach is often used by arithmetic coders as well. when the length is
reached, we know the data is done.

> But would u explain me how length mode CONST coding works?
>

mod const usually involves some manner of guess, eg, one decodes until the
end of the compressed data is reached. one knows that some amount of the
trailing data is garbage. we then know that the true end of the data (mod
const) is less than the current end (mod the same const) (this may not hold
per se, if it doesn't it means that the decoded data wrapped around), one
can then back up the requisite amount, and have the decoded data (at the
correct length).

> I didnt understand this. Isnt that "111" would be the special case..
> that last case. and hence forth we know the eof and so will stop the
> decoding.
>

yeah, but with only 3 symbols, having a code dedicated to eof is very
expensive, better to look for a cheaper special case (an encoder
impossibility or somesuch...).

>
> Yes, after doing Huffman, RLE too can be envoled for even better
> compression. And in that if we mention the impossible case as the eof,
> then yes, problem can be solved. and we can ignore eof while doing
> huffman. Its certainly better option as you pointed out.
>
> But still, if we just opt for huffman encoding, i'm not able find out
> how can we determine the eof ?
>

this depends...

the eof symbol approach works, if the codespace is large enough to justify
it. with, eg, >50 symbols, the eof symbol doesn't really matter, with 3 it
does.

> Regds,
> Nishu
>



Dr.Whacked

2006-04-10, 3:55 am

Thx you all for your information and methods & guideing me in the right
direction. I am currently trying to code the method (in VB6) hopefully
it works. and if possible if you people have any example of this method
(replaceing / = 0, 1 = 10, 2 = 11) in vb6 then please reply.

Thx once again for ur efforts
I will post what happened after coding the method.

Nishu

2006-04-10, 7:59 am



Jasen Betts wrote:

> On 2006-04-07, Nishu <naresh.attri@gmail.com> wrote:
>
> use the last three bits of the file to indicate how many padding bits there
> are.


Yes, the idea is pretty ok but just consider the above example -->

1010 0000 0000 0011 1110 10xx

Only last two bits are xx... and the third last bit is actually a part
of code which can be anything. Ur point of reserving the last three
bits to denote padding bits is vague here.

Regds,
Nishu

>
>
>
> --
>
> Bye.
> Jasen


Nishu

2006-04-10, 7:59 am


cr88192 wrote:

> "Nishu" <naresh.attri@gmail.com> wrote in message
> news:1144416807.604959.239040@i39g2000cwa.googlegroups.com...
> this approach is often used by arithmetic coders as well. when the length is
> reached, we know the data is done.
>


The question is"when the length is reached !!"
In arithmetic coders, we can take care of whole string at a time and
can convert it into a single number which lies between a certain range.
hence there cant be any problem about whether all of bits in last byte
are filled or not, but this doesnt hold for huffman coding.


> mod const usually involves some manner of guess, eg, one decodes until the
> end of the compressed data is reached. one knows that some amount of the
> trailing data is garbage. we then know that the true end of the data (mod
> const) is less than the current end (mod the same const) (this may not hold
> per se, if it doesn't it means that the decoded data wrapped around), one
> can then back up the requisite amount, and have the decoded data (at the
> correct length).
>


One need to know which data is garbage and which is not !! Problem is
to find out the proper end of the bit stream...

and this can be done if we know how many '/', '1' and '0' are there.
but not otherwise. It seems as if we need extra counter to find out
them and then we know how many total no. of actual coded bits are
present.

> yeah, but with only 3 symbols, having a code dedicated to eof is very
> expensive, better to look for a cheaper special case (an encoder
> impossibility or somesuch...).
>


yes, coded EOF is certainly expensive, but, then theres not a good
alternative suggested here right now. I'm finding myself helpless in
following Mr. Davids point of view. If u understand that, plz explain
me.

and what special character or impossibilty??? we are dealing in bits!!!
so we need bits to assign special case and in case u dont encode eof, u
simply cant put it as it is in the bit stream. right? we dont have any
exception to point out eof.

Just that we can make a counter to count all the char and then encode
them.
so say, then we know, '/' == ten, '1' == 4, '2' == 2. then we have 1 x
10 + 2 x4 + 2 x2 = 22 bits. and hence we found the no. of coded bits.
Its a one alternative i can think of.

> this depends...
>
> the eof symbol approach works, if the codespace is large enough to justify
> it. with, eg, >50 symbols, the eof symbol doesn't really matter, with 3 it
> does.


Yes, eof is a good choice in case of large set of char and when u dont
know the length of the stream. And in this case, we dont know the
length of the stream.

Plz help me, if i'm not able to follow you.

Regards,
Nishu

David A. Scott

2006-04-10, 7:59 am

"Nishu" <naresh.attri@gmail.com> wrote in
news:1144667277.765377.196120@z34g2000cwc.googlegroups.com:

>
> The question is"when the length is reached !!"
> In arithmetic coders, we can take care of whole string at a time and
> can convert it into a single number which lies between a certain range.
> hence there cant be any problem about whether all of bits in last byte
> are filled or not, but this doesnt hold for huffman coding.
>
>


First of all huffman coding is a subset of arithmetic.
If you think of arithmetic as a single number you could also
think of Huffman as a single number.

Next point how do you determine the EOF for the file of
"/" "0" and "1" its the same way with airthmetic or huffman
if done in a bijective way so as to not waste file space.


David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"

Nishu

2006-04-10, 7:59 am


David A. Scott wrote:
> "Nishu" <naresh.attri@gmail.com> wrote in
> news:1144667277.765377.196120@z34g2000cwc.googlegroups.com:
>
[color=darkred]
>
> First of all huffman coding is a subset of arithmetic.
> If you think of arithmetic as a single number you could also
> think of Huffman as a single number.
>


Yes sir, huffman is subset of arithmetic.
And I'm thinking of arithmetic encoding as a number which is a integral
multile of byte. i.e. we can encode a set of chars to an arithmetic
number which would range between (x, y] where there will be certainly a
number which can be put as integral multile of byte without affecting
the decoding. In case of hufman we cant be sure.

> Next point how do you determine the EOF for the file of
> "/" "0" and "1" its the same way with airthmetic or huffman
> if done in a bijective way so as to not waste file space.
>


Plz explain me from the basics of bijective. I would be thankful for ur
efforts.

I'm considering bijective to be in accordance with this.

http://en.wikipedia.org/wiki/Bijection

Regds,
Nishu


>
> David A. Scott
> --
> My Crypto code
> http://bijective.dogma.net/crypto/scott19u.zip
> http://www.jim.com/jamesd/Kong/scott19u.zip old version
> My Compression code http://bijective.dogma.net/
> **TO EMAIL ME drop the roman "five" **
> Disclaimer:I am in no way responsible for any of the statements
> made in the above text. For all I know I might be drugged.
> As a famous person once said "any cryptograhic
> system is only as strong as its weakest link"


David A. Scott

2006-04-10, 7:59 am

"Nishu" <naresh.attri@gmail.com> wrote in
news:1144674748.901322.227480@i40g2000cwc.googlegroups.com:

>
> Yes sir, huffman is subset of arithmetic.
> And I'm thinking of arithmetic encoding as a number which is a integral
> multile of byte. i.e. we can encode a set of chars to an arithmetic
> number which would range between (x, y] where there will be certainly a
> number which can be put as integral multile of byte without affecting
> the decoding. In case of hufman we cant be sure.
>
>
> Plz explain me from the basics of bijective. I would be thankful for ur
> efforts.
>


I think I have explained it. Look at my site there are samples
and pointers to two other people that have same view.

> I'm considering bijective to be in accordance with this.
>
> http://en.wikipedia.org/wiki/Bijection
>


I feel its the same. For the example you have here take the
set of all files that only have the characters "/","0","1" you
can map it one-to-one to the set of all bianry bytes files such
that all files in both set have a one-to-one relation to the other.

I showed in a previous post that is less than a w old how
to do it for huffman case above its not rocket sicence.




> Regds,
> Nishu
>
>




David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"

cr88192

2006-04-10, 6:55 pm


"Nishu" <naresh.attri@gmail.com> wrote in message
news:1144667277.765377.196120@z34g2000cwc.googlegroups.com...
>
> cr88192 wrote:
>
>
> The question is"when the length is reached !!"
> In arithmetic coders, we can take care of whole string at a time and
> can convert it into a single number which lies between a certain range.
> hence there cant be any problem about whether all of bits in last byte
> are filled or not, but this doesnt hold for huffman coding.
>

the last byte doesn't matter.

you keep track of decoded length.

if you are decoding, say, 1024 symbols, then you stop as soon as you decode
those symbols, no problem, whatever is left is ignored.

>
>
> One need to know which data is garbage and which is not !!


because you know where the end "should" be.

Problem is
> to find out the proper end of the bit stream...
>

no, it is not.
you don't actually need to know this in this case.


> and this can be done if we know how many '/', '1' and '0' are there.
> but not otherwise. It seems as if we need extra counter to find out
> them and then we know how many total no. of actual coded bits are
> present.
>

storing coded bits is one way (possible with 3 bits, if one knows the
location of the last byte).

likewise, one can store how large it should be when decoded, and maybe use
the mod approach for this.

>
> yes, coded EOF is certainly expensive, but, then theres not a good
> alternative suggested here right now. I'm finding myself helpless in
> following Mr. Davids point of view. If u understand that, plz explain
> me.
>

well, in my case I don't personally care all that much about bijective
coding...

easy enough to do things the good ol-fashioned way, and unless major
additional compression or speed gains pop up, I am not all that interested
personally.

> and what special character or impossibilty??? we are dealing in bits!!!
> so we need bits to assign special case and in case u dont encode eof, u
> simply cant put it as it is in the bit stream. right? we dont have any
> exception to point out eof.
>


note, we transform some input to some output. we define impossibility in
terms of the rules of the transform. if the data located can't possibly be
generated by the encoder (say, a run of the same symbol that is too long and
not handled by RLE), it could be used as an eof marker, with no additional
costs.

a simple case would be, eg, defining an RLE scheme where any run >=2 of the
last symbol is replaced with a run, so 3 occurances of the same symbol in a
row is impossible.

AAAABBBCAABBBB
A#3B#2CAAB#3AAA

in this case, the rules don't allow the AAA on the end, so decoding the AAA
would flag that an eof has been reached (likewise, to save space one could
do this with the highest probability symbol, rather than needing an extra,
low probability symbol).

> Just that we can make a counter to count all the char and then encode
> them.
> so say, then we know, '/' == ten, '1' == 4, '2' == 2. then we have 1 x
> 10 + 2 x4 + 2 x2 = 22 bits. and hence we found the no. of coded bits.
> Its a one alternative i can think of.
>

no need to store the bit counts. this wont really work with AC anyways...
most of my approaches would work with both AC and huffman coding.

>
> Yes, eof is a good choice in case of large set of char and when u dont
> know the length of the stream. And in this case, we dont know the
> length of the stream.
>

so, we can encode it. that is what I was getting at.
if one, up front, is like: this stream is 31690 symbols, ok, we can decode
it.

likewise, if we state that the length mod 256 is 202, it can still be
decoded, as once a known end for the coded stream is reached, we can use the
mod to find the end of the decoded stream.

> Plz help me, if i'm not able to follow you.
>
> Regards,
> Nishu
>



cr88192

2006-04-10, 6:55 pm


"Nishu" <naresh.attri@gmail.com> wrote in message
news:1144665278.213680.90920@i39g2000cwa.googlegroups.com...
>
>
> Jasen Betts wrote:
>
>
> Yes, the idea is pretty ok but just consider the above example -->
>
> 1010 0000 0000 0011 1110 10xx
>
> Only last two bits are xx... and the third last bit is actually a part
> of code which can be anything. Ur point of reserving the last three
> bits to denote padding bits is vague here.
>

1010 0000 0000 0011 1110 10xx xxxx x111

of course, this approach does require we have a way to detect the location
of the last 2 bytes (may be problematic if the coded data is just part of a
larger file and does not have a "chunk size" or similar...).

> Regds,
> Nishu
>
>



Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com