For Programmers: Free Programming Magazines  


Home > Archive > Compression > December 2006 > help on reducing file size while having a readable format









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author help on reducing file size while having a readable format
pedro.ballester@gmail.com

2006-11-30, 6:55 pm

Hi everyone,

One of my C programs produces a huge ascii file containing float,
integers and characteres with a given structure. The file is 6.5GB
which can be reduced to 569MB once zipped. The latter led me to think
that there might be a way to achieve significant size reduction, while
having file that can be read from a C program. So, do you know any way
to write the file so as to get a similar size reduction while being
able to read its contents?

I have already tried to open a binary file for writing, but there is
not much change. Of course, I could uncompress the file before using it
and compress it again afterwards, however this solution is undesirable
because of hard drive limitations as well as the associated
computational expense.

Thanks in advance,

Pedro

John Reiser

2006-11-30, 6:55 pm

> One of my C programs produces a huge ascii file containing float,
> integers and characteres with a given structure. The file is 6.5GB
> which can be reduced to 569MB once zipped. The latter led me to think
> that there might be a way to achieve significant size reduction, while
> having file that can be read from a C program.


#include <zlib.h>

then use gzopen(), gzread(), gztell(), gzs(), gzeof(), gzerror(), etc.
Or, if the file is always read from stdin, then use zcat in a shell
pipeline, with no changes required to read()/fread()/etc in the program.


--
cr88192

2006-12-01, 6:55 pm


<pedro.ballester@gmail.com> wrote in message
news:1164914498.786962.84060@h54g2000cwb.googlegroups.com...
> Hi everyone,
>
> One of my C programs produces a huge ascii file containing float,
> integers and characteres with a given structure. The file is 6.5GB
> which can be reduced to 569MB once zipped. The latter led me to think
> that there might be a way to achieve significant size reduction, while
> having file that can be read from a C program. So, do you know any way
> to write the file so as to get a similar size reduction while being
> able to read its contents?
>


well, here are a few possibilities:
generate binary data, and not ascii data;
if applicable, use vector or tree elimination/merging teqniques (eg: if it
is a tree structure, and if some branches may contain similar or equivalent
data, then consider ways in which they can be merged, and if the numbers are
in the form of vectors, and if there is much possibility of similar or
identical vectors, consider a scheme by which they can be reused);
....


vector and tree elimination are applicable to text formats, and there are
different ways in which to approach them.

on simple approach is to keep a list referencing any recently encoded tree
fragments (say the past 1024 or whatever, added in the 'unwind' part of the
process, ie, after encoding saif branch), and then when coding a new part of
the tree, one looks to see if a match has been encoded recently (is in the
list), and if so, encodes it as an index.

vector elimination can be implemented similarly (albeit with some small
epsilon).

in an ascii file, especially with large/precise numbers, and a similar
tendency of numbers to collide, this may be applicable to individual numbers
as well.

....


for example, the array:
1.013478 2.324354 3.66987654 2.324354 1.013478

can be coded:
1.013478 2.324354 3.66987654 N1 N2

and some vectors:
1 0 0
0 1 0
0 0 1
1 0 0
0 1 0
0 0 1

can be coded:
1 0 0
0 1 0
0 0 1
V2
V1
V0


> I have already tried to open a binary file for writing, but there is
> not much change. Of course, I could uncompress the file before using it
> and compress it again afterwards, however this solution is undesirable
> because of hard drive limitations as well as the associated
> computational expense.
>


note:
depending on the OS, opening a file in ascii or binary mode will make little
or no difference.

on linux, there wont be any difference.
on windows, it is a difference of whether linebreaks are represented as
LF("\n") or CR LF ("\r\n").

the actual data, however, will still be ascii text.


note wrt compression:
compressing/decompressing the data on io may well actually make it faster,
because it can be noted that the speed at which data can be compressed or
decompressed (esp wrt deflate) is potentially faster than the speed by which
it can be read from/written to disk.


> Thanks in advance,
>
> Pedro
>



pedro.ballester@gmail.com

2006-12-02, 7:55 am


John Reiser wrote:
>
> #include <zlib.h>
>
> then use gzopen(), gzread(), gztell(), gzs(), gzeof(), gzerror(), etc.
> Or, if the file is always read from stdin, then use zcat in a shell
> pipeline, with no changes required to read()/fread()/etc in the program.
>
>
> --


Thank you. This looks very much like what I was looking for. I don't
have this library in any of my systems, I imagine it is non-standard.

Am I right thinking that just by placing the header file in my work
directory I will have access to all these compression facilities?

pedro.ballester@gmail.com

2006-12-02, 7:55 am


> well, here are a few possibilities:
> generate binary data, and not ascii data;
> if applicable, use vector or tree elimination/merging teqniques (eg: if it
> is a tree structure, and if some branches may contain similar or equivalent
> data, then consider ways in which they can be merged, and if the numbers are
> in the form of vectors, and if there is much possibility of similar or
> identical vectors, consider a scheme by which they can be reused);
> ...
>
>
> vector and tree elimination are applicable to text formats, and there are
> different ways in which to approach them.
>
> on simple approach is to keep a list referencing any recently encoded tree
> fragments (say the past 1024 or whatever, added in the 'unwind' part of the
> process, ie, after encoding saif branch), and then when coding a new part of
> the tree, one looks to see if a match has been encoded recently (is in the
> list), and if so, encodes it as an index.
>
> vector elimination can be implemented similarly (albeit with some small
> epsilon).
>
> in an ascii file, especially with large/precise numbers, and a similar
> tendency of numbers to collide, this may be applicable to individual numbers
> as well.
>
> ...
>
>
> for example, the array:
> 1.013478 2.324354 3.66987654 2.324354 1.013478
>
> can be coded:
> 1.013478 2.324354 3.66987654 N1 N2
>
> and some vectors:
> 1 0 0
> 0 1 0
> 0 0 1
> 1 0 0
> 0 1 0
> 0 0 1
>
> can be coded:
> 1 0 0
> 0 1 0
> 0 0 1
> V2
> V1
> V0
>
>
>
> note:
> depending on the OS, opening a file in ascii or binary mode will make little
> or no difference.
>
> on linux, there wont be any difference.
> on windows, it is a difference of whether linebreaks are represented as
> LF("\n") or CR LF ("\r\n").
>
> the actual data, however, will still be ascii text.
>
>
> note wrt compression:
> compressing/decompressing the data on io may well actually make it faster,
> because it can be noted that the speed at which data can be compressed or
> decompressed (esp wrt deflate) is potentially faster than the speed by which
> it can be read from/written to disk.
>


Thanks for this explanation. I think I will try with zlib.h, as there
are not much obvious repetitions in the text.

pedro.ballester@gmail.com

2006-12-02, 7:55 am


John Reiser wrote:
>
> #include <zlib.h>
>
> then use gzopen(), gzread(), gztell(), gzs(), gzeof(), gzerror(), etc.
> Or, if the file is always read from stdin, then use zcat in a shell
> pipeline, with no changes required to read()/fread()/etc in the program.
>
>
> --


Thank you. This looks very much like what I was looking for. I don't
have this library in any of my systems, I imagine it is non-standard.

Am I right thinking that just by placing the header file in my work
directory I will have access to all these compression facilities?

cr88192

2006-12-02, 7:55 am


<pedro.ballester@gmail.com> wrote in message
news:1165061717.356700.39330@j72g2000cwa.googlegroups.com...
>
> John Reiser wrote:
>
> Thank you. This looks very much like what I was looking for. I don't
> have this library in any of my systems, I imagine it is non-standard.
>
> Am I right thinking that just by placing the header file in my work
> directory I will have access to all these compression facilities?
>


dude...


well, zlib is not part of the 'standard libraries', but is none the less
very common (primarily in opensource land though, eg, with linux, cygwin,
....).

one must first know how to link with libraries though, and whether zlib is
present, which depends some on what compiler toolchain is in use (not
mentioned here), and how fammiliar said user is with said toolchain, and
with programming in general...

and if not present, find zlib, download, build, and install...




cr88192

2006-12-02, 7:55 am


<pedro.ballester@gmail.com> wrote in message
news:1165061957.780894.15150@80g2000cwy.googlegroups.com...
>


<snip>

>
> Thanks for this explanation. I think I will try with zlib.h, as there
> are not much obvious repetitions in the text.
>


ok.

just note, as noted elsewhere.


as for the tree-elimination approaches, these work both for data
serialization (I have used similar approaches in some of my file-formats),
and also for some compiler optimizations (though ly, less directly
applicable to stack machines, and thus would require adding registers and/or
anylyzing subsequent code...).

I guess how applicable they are depends on what one is encoding.



Mark Adler

2006-12-02, 6:55 pm

pedro.ballester@gmail.com wrote:
> Thank you. This looks very much like what I was looking for. I don't
> have this library in any of my systems, I imagine it is non-standard.


It is almost universal on Unix systems. What are your systems?

> Am I right thinking that just by placing the header file in my work
> directory I will have access to all these compression facilities?


You need the header files (zlib.h and zconf.h) and the compiled library
(libz.a or a shared compile such as libz.so or libz.dylib). You can
get the source distribution here:

http://zlib.net/

and compile it yourself. There is also an already-compiled DLL version
for Windows available for download there.

mark

jasen

2006-12-11, 6:57 pm

On 2006-12-02, pedro.ballester@gmail.com <pedro.ballester@gmail.com> wrote:
>
> John Reiser wrote:
>
> Thank you. This looks very much like what I was looking for. I don't
> have this library in any of my systems, I imagine it is non-standard.


fairly standard, I think firefox uses it.

> Am I right thinking that just by placing the header file in my work
> directory I will have access to all these compression facilities?


No, you need to install the rest of the development kit too.

Bye.
Jasen
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com