James Newton wrote:
> 
> Has anyone had experience or have ideas about compressing / decompressing
> English text with a PIC?
> 
> I have an application that would benefit from having the PC that is sending
> data to the PIC compress text and then having the PIC expand it. The text is
> just standard English words, phrases, sentences etc...
> 
> To make things more complex, the compressed text cannot use non-printable
> characters, i.e. we get only 96 possible values per byte or only one of the
> following patterns in each byte. 3 x 2^5 = 96
> 
> 001x xxxx
> 010x xxxx
> 011x xxxx
> 
> BCD numbers pack into this fairly nicely, 9 x 9 = 81 < 96 so each byte still
> holds two BCD digits. But 26 letters plus shift plus space don't seem to map
> well. <GRIN> (this is, of course, because they map perfectly) I was thinking
> that some Huffman compression system may exist that works well here? Easy to
> de-compress on a PIC is important as well.
> 
> If you have good text compression / expansion code that produces binary, I
> can translate that into printable only but thought I would mention it on the
> chance that someone else has hit this before.
> 
> If this sounds like a really weird application, think about this: if you
> wanted to make a device that could be updated via an email.... without
> running an application on the remote email users machine... just send the
> email and tell the user to connect the device and copy the email out...
> Anyone who has struggled with update files that get munched in transit or
> when the user on the other end can't figure out how to run your app or
> winzip (or doesn't want to or can't etc...) will see the utility here. And,
> of course, the smaller the email the better...


Hey James... :)  why in the hell are you saying "English" text?... just
because french, portuguese, spanish and few others <g> use lots of ˆ  Ž
‡ ’ — œ  ‹ – › and some other nice chars?  hehe,  remember that the
byte code 0111-1111 could be used to flag the next 8 bits as aa
non-compressed character... so you could use all the rest of 255
combinations of chars. Obviously for each "special" character out of
your "English" <g> 95 chars, you will spend 13 bits... but I am sure it
will still compressing lots.  IBM used in some old machines, lots of 6
bits code chars, at that time, a bit was made by gold... :)... and by
the way, you forgot the 000x-xxxx combination, you can go up to 128
combinations, not only 96... :)... (4 x 2^5)...

When you said "English" text, I thought you would talk about word and
letters grouping data bases... easy to do... using a byte code as "FFh"
as flag, you can use the next two bytes to index a data base (table of
words or groups of letters) from the most used in that specific
language. So there is no compression at all, just a fast speed
indexing.  You can find statistics about the 1000 most used words in
English by region, etc.  For example, the word "combination" could be
just "FF 05 93", 3 bytes... you could define FE as a flag for packets
that use 3 bytes coding, and so on... think about it.

Wagner