--- James Newton wrote: > Has anyone had experience or have ideas about > compressing / decompressing > English text with a PIC? > > I have an application that would benefit from having > the PC that is sending > data to the PIC compress text and then having the > PIC expand it. The text is > just standard English words, phrases, sentences > etc... > > To make things more complex, the compressed text > cannot use non-printable > characters, i.e. we get only 96 possible values per > byte or only one of the > following patterns in each byte. 3 x 2^5 = 96 > > 001x xxxx > 010x xxxx > 011x xxxx > > BCD numbers pack into this fairly nicely, 9 x 9 = 81 > < 96 so each byte still > holds two BCD digits. But 26 letters plus shift plus > space don't seem to map > well. (this is, of course, because they map Actually, two BCD character won't map. There are ten digits not 9 in BCD. > perfectly) I was thinking > that some Huffman compression system may exist that > works well here? Easy to > de-compress on a PIC is important as well. Huffman is good (perfect?) except for one tiny problem: memory. I presume if you chose to implement a static table then perhaps it'd work. But then it would no longer be "optimal". However, if you know your characters and there distribution then perhaps a static table would work just fine. How many characters do you "need"? I recall something I believe Andy Warren posted a long time ago (where's Andy?) that may be applicable here. It wasn't really compression per se, but it made a more optimal use of the available bits. Basically it involved sending three characters in two bytes. Obviously you can't send three 8-bit chunks in 16-bits, so you have to reduce the character set size. The optimal character set size is the largest cube that fits into 16-bits. This happens to be 40: 40^3 = 64000 which is less than 2^16. This would give you the 26 letters of the alphabet (only upper case or lower case), the numbers 0-9, and four punctuation characters of your choice. The encoding and decoding process is essentially an exercise in changing bases. For example, if you mapped your 40 characters into the numbers 0-39, then you'd encode them by: x = a1 * 40^2 + a2 * 40^1 + a3 * 40^0 = a1 * 1600 + a2 * 40 + a3 Where a1, a2, and a3 are the three encoded characters. The decoding process goes like: a1 = x mod 40^2 x = x - a1 * 40^2 a2 = x mod 40 x = x - a2 * 40 a3 = x Like I said, this is not compression, it's a re-arrangement of sorts. Now in your application, you don't have the full 16 bits to work with. In fact, you only have: log2(96) = 6.58 bits or about 6 and a half bits. If you concatenate two of your "bytes" you get about 13 bits, which could encode 3 characters from a set of 20 characters. If you concatenate three bytes, you'd have about 18.5 bits which would allow you to encode 4 characters from a set of 30. So going back to my question, "How many characters do you need?" Call this K. The number of characters n encoded into m bytes is: K^n <= 96^m or m >= n*log(K)/log(96) m and n are integers, so you have a simple Diophantine equation. As a couple of examples: K=20: (we know the answer from above) has the solutions m=2,n=3 m=3,n=4 m=4,n=5 m=4,n=6 K=30: m=3,n=3 m=3,n=4 m=4,n=5 m=5,n=6 K=40: m=3,n=3 m=4,n=4 m=5,n=5 m=5,n=6 As you can see, when K approaches 96 there are diminishing returns. For Andy's case, you have: m >= n*log(40)/log(2^16) m=1,n=1 m=1,n=2 m=1,n=3 m=2,n=4 m=2,n=5 m=2,n=6 Where m is the number of 16-bit words and n is the number of base 40 characters. Neat. .lo ===== __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com