--- James Newton <eplus1@san.rr.com> wrote:
> Has anyone had experience or have ideas about
> compressing / decompressing
> English text with a PIC?
>
> I have an application that would benefit from having
> the PC that is sending
> data to the PIC compress text and then having the
> PIC expand it. The text is
> just standard English words, phrases, sentences
> etc...
>
> To make things more complex, the compressed text
> cannot use non-printable
> characters, i.e. we get only 96 possible values per
> byte or only one of the
> following patterns in each byte. 3 x 2^5 = 96
>
> 001x xxxx
> 010x xxxx
> 011x xxxx
>
> BCD numbers pack into this fairly nicely, 9 x 9 = 81
> < 96 so each byte still
> holds two BCD digits. But 26 letters plus shift plus
> space don't seem to map
> well. <GRIN> (this is, of course, because they map

Actually, two BCD character won't map. There are ten
digits not 9 in BCD.

> perfectly) I was thinking
> that some Huffman compression system may exist that
> works well here? Easy to
> de-compress on a PIC is important as well.

Huffman is good (perfect?) except for one tiny
problem:
memory. I presume if you chose to implement a static
table then perhaps it'd work. But then it would no
longer be "optimal". However, if you know your
characters and there distribution then perhaps a
static table would work just fine.

How many characters do you "need"? I recall something
I believe Andy Warren posted a long time ago (where's
Andy?) that may be applicable here. It wasn't really
compression per se, but it made a more optimal use of
the available bits. Basically it involved sending
three characters in two bytes. Obviously you can't
send three 8-bit chunks in 16-bits, so you have to
reduce the character set size. The optimal character
set size is the largest cube that fits into 16-bits.
This happens to be 40: 40^3 = 64000 which is less than
2^16. This would give you the 26 letters of the
alphabet (only upper case or lower case), the numbers
0-9, and four punctuation characters of your choice.

The encoding and decoding process is essentially an
exercise in changing bases. For example, if you mapped
your 40 characters into the numbers 0-39, then you'd
encode them by:

x = a1 * 40^2 + a2 * 40^1 + a3 * 40^0
  = a1 * 1600 + a2 * 40 + a3

Where a1, a2, and a3 are the three encoded characters.
The decoding process goes like:

a1 = x mod 40^2
x = x - a1 * 40^2
a2 = x mod 40
x = x - a2 * 40
a3 = x

Like I said, this is not compression, it's a
re-arrangement of sorts.

Now in your application, you don't have the full 16
bits to work with. In fact, you only have:
 log2(96) = 6.58 bits
or about 6 and a half bits.

If you concatenate two of your "bytes" you get about
13 bits, which could encode 3 characters from a set of
20 characters. If you concatenate three bytes, you'd
have about 18.5 bits which would allow you to encode 4
characters from a set of 30.

So going back to my question, "How many characters do
you need?" Call this K. The number of characters n
encoded into m bytes is:

 K^n <= 96^m

or

  m >= n*log(K)/log(96)

m and n are integers, so you have a simple Diophantine
equation.

As a couple of examples:

K=20: (we know the answer from above)

has the solutions
m=2,n=3
m=3,n=4
m=4,n=5
m=4,n=6

K=30:
m=3,n=3
m=3,n=4
m=4,n=5
m=5,n=6

K=40:
m=3,n=3
m=4,n=4
m=5,n=5
m=5,n=6

As you can see, when K approaches 96 there are
diminishing returns.

For Andy's case, you have:

 m >= n*log(40)/log(2^16)

m=1,n=1
m=1,n=2
m=1,n=3
m=2,n=4
m=2,n=5
m=2,n=6

Where m is the number of 16-bit words and n is the
number of base 40 characters.

Neat.

.lo

=====

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com