Home > Archive > Cobol > January 2006 > Re: [experiences] Fujitsu NetCOBOL for .NET
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Re: [experiences] Fujitsu NetCOBOL for .NET
|
|
| Raymond Leech 2006-01-14, 3:55 am |
| For most of our Internet-facing files, we use variable length ISAM records
which we compress then encrypt before writing. The compression ratio is
nearly identical to WinZip with 60-90% reduction typical. Our largest file
contains 60 gig data with an 86% compression ratio resulting in an ISAM file
of only about 8 gig.
So long as the ISAM module is robust and reasonably efficient (at or near
NE), we should be able to port the application with very little trouble.
Thanks for your input HeyBub. ray
"HeyBub" <heybubNOSPAM@gmail.com> wrote in message
news:11s537ip039694b@news.supernews.com...
> Raymond Leech wrote:
>
> The ISAM handler - at least in .NET predecessor (PowerCOBOL) is made by
> someone other than Fijutsu (I forget who) and is pretty robust. It uses,
> for example, RLE to compact the file.
>
> We took a 1.2 Gig text file and made from it an ISAM file. The resulting
> file was 700 Meg. Almost half the original size!
>
> Point is, it might turn out, what with this improved ISAM handler, that
> your files actually get smaller!
>
> Ought to be easy to test. Just a thought.
>
| |
| HeyBub 2006-01-14, 9:55 pm |
| Raymond Leech wrote:
> For most of our Internet-facing files, we use variable length ISAM
> records which we compress then encrypt before writing. The
> compression ratio is nearly identical to WinZip with 60-90% reduction
> typical. Our largest file contains 60 gig data with an 86%
> compression ratio resulting in an ISAM file of only about 8 gig.
And how do you do that ("compress then encrypt")?
| |
| Raymond Leech 2006-01-15, 3:55 am |
| I have written wrappers for some C dll's which do compression (addZip and
aPlib) and encryption (addCrypt which supports Blowfish and several others).
As Russell suggested in his post, common routines are the way to go. It
makes tasks like adding compression to a file both easy and nearly
foolproof. So long as the developers uses the common routines (e.g., no
discrete read/write verbs), the system takes care of compressing and
encrypting the record prior to the write/rewrite and decryption and
uncompression after any read.
My preference is the addZip / addCrypt package since the same vendor created
both. The compression routine is a bit 'cpu expensive', but the
uncompression routine is very fast.
If you need more information, I'd be happy to share whatever info I can. ray
"HeyBub" <heybubNOSPAM@gmail.com> wrote in message
news:11sjdrr9dh20rfc@news.supernews.com...
> Raymond Leech wrote:
>
> And how do you do that ("compress then encrypt")?
>
>
>
| |
| HeyBub 2006-01-16, 3:55 am |
| Raymond Leech wrote
> I have written wrappers for some C dll's which do compression (addZip
> and aPlib) and encryption (addCrypt which supports Blowfish and
> several others).
> As Russell suggested in his post, common routines are the way to go.
> It makes tasks like adding compression to a file both easy and nearly
> foolproof. So long as the developers uses the common routines (e.g.,
> no discrete read/write verbs), the system takes care of compressing
> and encrypting the record prior to the write/rewrite and decryption
> and uncompression after any read.
That's what befuddles me. How on earth can a compression program compress a
RECORD? I looked at the features for addZip [
http://www.littlebigware.com/addzip/features.html ] and see nothing about
handing a RECORD to the routine. If, in fact, all it does is ZIP files (with
appropriate bells and whistles), the $140 is way more than the free Active-X
control we already use.
| |
| Raymond Leech 2006-01-16, 9:55 pm |
| Get the addZip 0.71 zip files (in second section of download page). There
will be documentation in the zip file. You'll find two pages on the
"in-memory" routines which compress an input buffer, returning a compressed
output buffer.
aPlib is less expensive ($29 or $95). I had some problems with early
versions and had to use another vendor for encryption. I ended up opting for
one-stop-shopping. Either should work just fine though.
BTW, the compression ratio between the two was almost identical. I think the
compression speed was faster in addZip but the uncompression was pretty
close (its been years since I did the benchmarks, memory might be a bit
rusty).
ray
"HeyBub" <heybubNOSPAM@gmail.com> wrote in message
news:11sm9113dv6gi2a@news.supernews.com...
> Raymond Leech wrote
>
> That's what befuddles me. How on earth can a compression program compress
> a RECORD? I looked at the features for addZip [
> http://www.littlebigware.com/addzip/features.html ] and see nothing about
> handing a RECORD to the routine. If, in fact, all it does is ZIP files
> (with appropriate bells and whistles), the $140 is way more than the free
> Active-X control we already use.
>
>
>
| |
| HeyBub 2006-01-17, 6:55 pm |
| Raymond Leech wrote:
> Get the addZip 0.71 zip files (in second section of download page).
> There will be documentation in the zip file. You'll find two pages on
> the "in-memory" routines which compress an input buffer, returning a
> compressed output buffer.
Okay, did that. You hand the control an input buffer, it does its magic, and
puts the result in an output buffer. I presume, at this point, you treat the
compressed output buffer like a record and write it to a file.
Thing is, the contents of the "compressed" output buffer includes a header
plus the compressed data. This "header," if I remember correctly, can be
several hundred, if not thousands, of bytes big which would seem to negate
the effect of the compression.
Am I missing something?
| |
| Raymond Leech 2006-01-17, 9:55 pm |
| I would be shocked if the header ever exceeded the low-few-hundred bytes. I
believe the threshold in my tests was 128-200 bytes minimum input string
that could be compressed smaller than the input buffer. If you pass a buffer
which is too small, the routine returned either the output buffer exactly
the same as the input buffer or a return code telling you the compression
failed.
In practice, I wouldnt try to compress a record that small, mainly because
the savings aren't worth the overhead. I still encrypt some short records
but that only adds up to 8 bytes so no biggie.
Again, you're making me reach far back in my memory (I have trouble
remembering what i ate for diner an hour ago... I think there was corn
involved). I developed these routines back in 1995-1997. They've been
rock-solid for a long time so I've had little need to recall some of the
long-ago details.
ray
"HeyBub" <heybubNOSPAM@gmail.com> wrote in message
news:11sr0skgk9tv034@news.supernews.com...
> Raymond Leech wrote:
>
> Okay, did that. You hand the control an input buffer, it does its magic,
> and puts the result in an output buffer. I presume, at this point, you
> treat the compressed output buffer like a record and write it to a file.
>
> Thing is, the contents of the "compressed" output buffer includes a header
> plus the compressed data. This "header," if I remember correctly, can be
> several hundred, if not thousands, of bytes big which would seem to negate
> the effect of the compression.
>
> Am I missing something?
>
| |
| Michael Wojcik 2006-01-19, 6:55 pm |
|
In article <11sr0skgk9tv034@news.supernews.com>, "HeyBub" <heybubNOSPAM@gmail.com> writes:
> Raymond Leech wrote:
>
> Okay, did that. You hand the control an input buffer, it does its magic, and
> puts the result in an output buffer. I presume, at this point, you treat the
> compressed output buffer like a record and write it to a file.
>
> Thing is, the contents of the "compressed" output buffer includes a header
> plus the compressed data. This "header," if I remember correctly, can be
> several hundred, if not thousands, of bytes big which would seem to negate
> the effect of the compression.
>
> Am I missing something?
You're missing RFC 1950 and RFC 1951, which specify the zip file format
and the Deflate compression algorithm, respectively.
The zip format has a maximum of 10 bytes of header data.
The Deflate format has a fixed 3-bit header per block, where blocks
are of varying size and need not start or end on a byte boundary. The
contents of blocks vary by the algorithm selected for the block and
its parameters. In the worst case, where Deflate is not able to
compress any of the file, it adds a total of five bytes for every
64KB of data. That could amount to "hundreds ... of bytes" of header
data if you're trying to compress at least 2.5MB of uncompressible
data, but 1) that's still only 0.008% overhead and 2) that would be a
dumb thing to do.
Any decent implementation of a basic entropy encoder (Deflate is a
combination of Adaptive Huffman and LZ77) will have similar behavior.
Markov-model encoders like BWT-based ones (eg bzip2) and the PPM
family might have slightly more overhead but provide much better
compression in the general case, so the net effect is better.
There are compression algorithms that produce a relatively large
amount of metadata - typically dictionary-based algorithms meant to
compress a very large corpus of similar data - but they're not
appropriate for this kind of application anyway.
All that said, unless the data is highly redundant (which it may be),
the compression payoff will only be decent with a context- sensitive
encoder (such as an entropy encoder), and those do better when they
have more material to work with. Compressing a record at a time is
an inefficient use of the representation unless you need rapid random
access - which is often the case, of course. If you're only going to
process sequentially, it makes much more sense to compress the whole
file (eg using a compressing filesystem).
If the data *is* highly redundant, you may get quite good compression
from a context-free encoder, such as a simple run-length encoder.
They aren't nearly as powerful but they are very fast and simple, and
in the event of data corruption the error is usually easy to correct.
In sum: sometimes compressing record-by-record is useful; sometimes
compressing an entire file is useful; sometimes simple algorithms are
suitable; but in other circumstances some or all of those may not
apply.
(Also, I might point out that there is a free, open-source zip
library available which will compress and decompress in memory as
well as to and from files. The usual tradeoffs between free and
commercial software apply, of course.)
--
Michael Wojcik michael.wojcik@microfocus.com
Memory, I realize, can be an unreliable thing; often it is heavily coloured
by the circumstances in which one remembers, and no doubt this applies to
certain of the recollections I have gathered here. -- Kazuo Ishiguro
| |
| Raymond Leech 2006-01-20, 3:55 am |
| Michael, very nicely documented. As you mention, you really have to look at
it on a case-by-case basis to "pick the right solution for the problem".
A comment on file system compression. I've found it often the best answer if
all you're doing is compressing, because the OS handles the details -- log
files and reports are great examples. One downside I've experienced is
encrypted data performs poorly because the compression occurs after
encryption, whereas if the data is compressed prior to encryption, my
results were much better.
Possibly an Encrypted File System with compression would do much better. My
hesitance to pursue EFS or such is that once the volume is mounted, any
process can access the data (including an erant copy to an unencrypted
volume), whereas doing it in the app prevents anyone without access to the
app from decoding the data.
Does anyone have any experience on an EFS with compression that they'd like
to share?
ray
"Michael Wojcik" <mwojcik@newsguy.com> wrote in message
news:dqop9k017b1@news1.newsguy.com...
>
> In article <11sr0skgk9tv034@news.supernews.com>, "HeyBub"
> <heybubNOSPAM@gmail.com> writes:
>
> You're missing RFC 1950 and RFC 1951, which specify the zip file format
> and the Deflate compression algorithm, respectively.
>
> The zip format has a maximum of 10 bytes of header data.
>
> The Deflate format has a fixed 3-bit header per block, where blocks
> are of varying size and need not start or end on a byte boundary. The
> contents of blocks vary by the algorithm selected for the block and
> its parameters. In the worst case, where Deflate is not able to
> compress any of the file, it adds a total of five bytes for every
> 64KB of data. That could amount to "hundreds ... of bytes" of header
> data if you're trying to compress at least 2.5MB of uncompressible
> data, but 1) that's still only 0.008% overhead and 2) that would be a
> dumb thing to do.
>
> Any decent implementation of a basic entropy encoder (Deflate is a
> combination of Adaptive Huffman and LZ77) will have similar behavior.
> Markov-model encoders like BWT-based ones (eg bzip2) and the PPM
> family might have slightly more overhead but provide much better
> compression in the general case, so the net effect is better.
>
> There are compression algorithms that produce a relatively large
> amount of metadata - typically dictionary-based algorithms meant to
> compress a very large corpus of similar data - but they're not
> appropriate for this kind of application anyway.
>
> All that said, unless the data is highly redundant (which it may be),
> the compression payoff will only be decent with a context- sensitive
> encoder (such as an entropy encoder), and those do better when they
> have more material to work with. Compressing a record at a time is
> an inefficient use of the representation unless you need rapid random
> access - which is often the case, of course. If you're only going to
> process sequentially, it makes much more sense to compress the whole
> file (eg using a compressing filesystem).
>
> If the data *is* highly redundant, you may get quite good compression
> from a context-free encoder, such as a simple run-length encoder.
> They aren't nearly as powerful but they are very fast and simple, and
> in the event of data corruption the error is usually easy to correct.
>
> In sum: sometimes compressing record-by-record is useful; sometimes
> compressing an entire file is useful; sometimes simple algorithms are
> suitable; but in other circumstances some or all of those may not
> apply.
>
> (Also, I might point out that there is a free, open-source zip
> library available which will compress and decompress in memory as
> well as to and from files. The usual tradeoffs between free and
> commercial software apply, of course.)
>
>
> --
> Michael Wojcik michael.wojcik@microfocus.com
>
> Memory, I realize, can be an unreliable thing; often it is heavily
> coloured
> by the circumstances in which one remembers, and no doubt this applies to
> certain of the recollections I have gathered here. -- Kazuo Ishiguro
| |
| HeyBub 2006-01-20, 9:55 pm |
| Michael Wojcik wrote:
> In article <11sr0skgk9tv034@news.supernews.com>, "HeyBub"
> <heybubNOSPAM@gmail.com> writes:
>
> You're missing RFC 1950 and RFC 1951, which specify the zip file
> format
> and the Deflate compression algorithm, respectively.
>
> The zip format has a maximum of 10 bytes of header data.
========
So, I created a text file of 69 bytes. I zipped it. The result was 176
bytes.
Then I created a text file of one character. The resulting file was 3 bytes
(character+CR+LF). I zipped that file. The resulting file was 115 bytes.
Next I took an uncompressible file (a zip file) of 6550 bytes and zipped it.
The result was 6662.
So, looks like to me, in the extreme cases, zipping a file/record results in
LARGER chunks than the original.
>
> The Deflate format has a fixed 3-bit header per block, where blocks
> are of varying size and need not start or end on a byte boundary. The
> contents of blocks vary by the algorithm selected for the block and
> its parameters. In the worst case, where Deflate is not able to
> compress any of the file, it adds a total of five bytes for every
> 64KB of data. That could amount to "hundreds ... of bytes" of header
> data if you're trying to compress at least 2.5MB of uncompressible
> data, but 1) that's still only 0.008% overhead and 2) that would be a
> dumb thing to do.
>
> Any decent implementation of a basic entropy encoder (Deflate is a
> combination of Adaptive Huffman and LZ77) will have similar behavior.
> Markov-model encoders like BWT-based ones (eg bzip2) and the PPM
> family might have slightly more overhead but provide much better
> compression in the general case, so the net effect is better.
>
> There are compression algorithms that produce a relatively large
> amount of metadata - typically dictionary-based algorithms meant to
> compress a very large corpus of similar data - but they're not
> appropriate for this kind of application anyway.
>
> All that said, unless the data is highly redundant (which it may be),
> the compression payoff will only be decent with a context- sensitive
> encoder (such as an entropy encoder), and those do better when they
> have more material to work with. Compressing a record at a time is
> an inefficient use of the representation unless you need rapid random
> access - which is often the case, of course. If you're only going to
> process sequentially, it makes much more sense to compress the whole
> file (eg using a compressing filesystem).
>
> If the data *is* highly redundant, you may get quite good compression
> from a context-free encoder, such as a simple run-length encoder.
> They aren't nearly as powerful but they are very fast and simple, and
> in the event of data corruption the error is usually easy to correct.
>
> In sum: sometimes compressing record-by-record is useful; sometimes
> compressing an entire file is useful; sometimes simple algorithms are
> suitable; but in other circumstances some or all of those may not
> apply.
>
> (Also, I might point out that there is a free, open-source zip
> library available which will compress and decompress in memory as
> well as to and from files. The usual tradeoffs between free and
> commercial software apply, of course.)
| |
| Michael Wojcik 2006-01-20, 9:55 pm |
|
In article <e7qdnctJHZRX603eRVn-pA@comcast.com>, "Raymond Leech" <rayNOUNDERSCOREleech@comcast.net> writes:
>
> A comment on file system compression. I've found it often the best answer if
> all you're doing is compressing, because the OS handles the details -- log
> files and reports are great examples. One downside I've experienced is
> encrypted data performs poorly because the compression occurs after
> encryption, whereas if the data is compressed prior to encryption, my
> results were much better.
True, and a good point. General-purpose encryption algorithms should
produce data that cannot be compressed well even by sophisticated
encoders, because compressibility indicates non-uniformity in the
data, and that in turn indicates that some of the original message's
information is still recoverable.
> Does anyone have any experience on an EFS with compression that they'd like
> to share?
I don't, I'm afraid. I haven't had a need to spend much time looking
at encrypting filesystems.
Some general-purpose encryption software tries to compress the input
data first, both because it won't be compressible afterward and because
it's a useful "pre-whitening" step. I don't know whether that's true
of particular encrypted-filesystem implementations.
--
Michael Wojcik michael.wojcik@microfocus.com
HTML is as readable as C. You can take this either way. -- Charlie Gibbs
| |
| Michael Wojcik 2006-01-20, 9:55 pm |
|
In article <11t1ureb84l6d60@news.supernews.com>, "HeyBub" <heybubNOSPAM@gmail.com> writes:
> Michael Wojcik wrote:
>
> So, I created a text file of 69 bytes. I zipped it. The result was 176
> bytes.
Sorry, I should have been clearer. The zip file format, as specified
in the standard, has a maximum of 10 bytes of header data. However,
a number of "zip" implementations use an extended file format that
includes a list of files and their attributes, and other information;
this is actually the PKZIP file format, and is distinct from the format
specified in RFC 1950.
The PKZIP file format includes significant additional data, so of
course it has more overhead. (Since a PKZIP zip file includes the
names of the files that are contained in the archive, obviously it
has to have more than a small fixed amount of overhead.)
When compressing individual records for a file, the zip file format,
and not the PKZIP file format, would be the appropriate choice.
The PKZIP file format is documented in the "PKZIP Application Note",
which is available from various online sources.
Examples of PKZIP implementations are PKZIP, WinZip, and Info-ZIP,
though the latter also includes support for bare RFC-1950 zip
streams. An example of a pure zip implementation is gzip.
--
Michael Wojcik michael.wojcik@microfocus.com
Is it any wonder the world's gone insane, with information come to be
the only real medium of exchange? -- Thomas Pynchon
|
|
|
|
|