fixing broken .Z files?

der Mouse mouse at Rodents.Montreal.QC.CA
Thu Mar 3 22:28:48 CST 2005


> [...], but once data gets transformed by LZW it has very little
> entropy left.

I'm not sure what you mean by "entropy" here.  The
information-theoretic meaning I know is diametrically opposed to the
sense in which you're using it (by it, highly compressed data has much,
not little, entropy - almost as many bits of entropy, of information
content, as it has surface bits).  This is why I consistently use
"redundancy" below.

> (If it didn't, by your argument, you could still recompress it with a
> bitwise encoder like PAQ -- you can't (by a significant margin,
> anyway)).

That's only because the encoder isn't smart enough to take advantage of
the redundancy that's there.  When I say redundancy is present, I am
speaking from an information-theoretical standpoint; the presence of
redundancy does not mean that any particular compression algorithm can
squeeze it out.  Saying that no encoding software exists that can
compress some data blob, *even if true*, does not mean that there is no
redundancy in it, only that - if there is any - no extant software
knows how to find it and compress it out.

For example, I can produce effectively unlimited amounts of data that
has very little information content but which I defy any program to
compress significantly.  All I need to do is pick a random key and
encrypt a stream of all-0-bits with that key using some decent
algorithm (3DES, IDEA, arcfour, whatever).  Only a few bits of
information content (the size of the key, basically) but about as
uncompressible as it gets.

>> Well, yeah; "is human parsable" is a form of redundancy, but one
>> that is almost impossible for programs to take advantage of -
> Actually, WinRK uses a dictionary to currently achieve the very best
> Calgary Corpus score, so it is most definitely exploitable.

Using a dictionary helps with *some* "is human-parsable" redundancy.
It's certainly not enough for all of it.  Consider

	Yours with every time an postmark hasn't empty of two kin
	beside a whose boot-print with.  Besides talk pathological we
	of chapter.  ...

or perhaps

	That's only because the encoder isn't smart enough to take
	advantage of the redundancy that's there.  When I say
	redundancy is present, I am speaking from a cook's pantry;
	buying pre-ground pepper at the corner store won't be nearly as
	good as grinding it yourself.  But a good vinegar exists that
	can compress some data blob, *even if true*, ...

The former might be recognizable as nonsense if the code knows enough
English grammar to recognize parts of speech and recognize the lack of
a valid parse tree.  The latter, well, recognizing that sort of
nonsense as nonsense is as AI-complete as the natural langauge problem.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse at rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


More information about the cctalk mailing list