For Programmers: Free Programming Magazines  


Home > Archive > Compression > April 2006 > Re: compressing a text file









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Re: compressing a text file
Jasen Betts

2006-04-21, 7:55 am

On 2006-04-20, junky_fellow@yahoo.co.in <junky_fellow@yahoo.co.in> wrote:
> HI guys,
>
> I am new to the field of data compression. I want to write an
> algorithm to compress
> the text file. One way I thought of replacing the frequently occuring
> words with a smaller
> symbol. Say, for example if "the" is repeated in the text file 1000
> times I would replace
> "the" with a new symbol "@" at all the 1000 places.
> But there is a possibility that the new symbol "@" is already present
> at some places
> in the text file. So, I may mistook it as "the". Can anyone suggest me
> how to solve
> this problem ?


replace @ with the

replace @ with ~@ and replace ~ with ~~

--

Bye.
Jasen
cr88192

2006-04-21, 6:56 pm


"Jasen Betts" <jasen@free.net.nz> wrote in message
news:60f5.4448a72e.7ddcc@clunker.homenet...
> On 2006-04-20, junky_fellow@yahoo.co.in <junky_fellow@yahoo.co.in> wrote:
>
> replace @ with the
>
> replace @ with ~@ and replace ~ with ~~
>

still, as it stands, imo, specific word replacement (in this form) is
essentially largely useless in general.

much more effective (assuming a lot of text files are involved and a shared
dictionary is ok) would be a kind of external dictionary approach (ime,
often termed "vector quantization").

assuming the dictionary is good and the files are consistent, this would
likely give the largest possible payoff...

personally, I would much rather prefer this over trying to match and replace
certain words...


then again, specific word replacement may be simplest (conceptually), in
this case, I would at least recommend using the upper 128 chars for stored
words, as then there is little or no clash with printable characters.

> --
>
> Bye.
> Jasen



Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com