James Newton wrote: > > Has anyone had experience or have ideas about compressing / decompressing > English text with a PIC? > > I have an application that would benefit from having the PC that is sending > data to the PIC compress text and then having the PIC expand it. The text is > just standard English words, phrases, sentences etc... > > To make things more complex, the compressed text cannot use non-printable > characters, i.e. we get only 96 possible values per byte or only one of the > following patterns in each byte. 3 x 2^5 = 96 > > 001x xxxx > 010x xxxx > 011x xxxx > > BCD numbers pack into this fairly nicely, 9 x 9 = 81 < 96 so each byte still > holds two BCD digits. But 26 letters plus shift plus space don't seem to map > well. (this is, of course, because they map perfectly) I was thinking > that some Huffman compression system may exist that works well here? Easy to > de-compress on a PIC is important as well. > > If you have good text compression / expansion code that produces binary, I > can translate that into printable only but thought I would mention it on the > chance that someone else has hit this before. > > If this sounds like a really weird application, think about this: if you > wanted to make a device that could be updated via an email.... without > running an application on the remote email users machine... just send the > email and tell the user to connect the device and copy the email out... > Anyone who has struggled with update files that get munched in transit or > when the user on the other end can't figure out how to run your app or > winzip (or doesn't want to or can't etc...) will see the utility here. And, > of course, the smaller the email the better... Hey James... :) why in the hell are you saying "English" text?... just because french, portuguese, spanish and few others use lots of and some other nice chars? hehe, remember that the byte code 0111-1111 could be used to flag the next 8 bits as aa non-compressed character... so you could use all the rest of 255 combinations of chars. Obviously for each "special" character out of your "English" 95 chars, you will spend 13 bits... but I am sure it will still compressing lots. IBM used in some old machines, lots of 6 bits code chars, at that time, a bit was made by gold... :)... and by the way, you forgot the 000x-xxxx combination, you can go up to 128 combinations, not only 96... :)... (4 x 2^5)... When you said "English" text, I thought you would talk about word and letters grouping data bases... easy to do... using a byte code as "FFh" as flag, you can use the next two bytes to index a data base (table of words or groups of letters) from the most used in that specific language. So there is no compression at all, just a fast speed indexing. You can find statistics about the 1000 most used words in English by region, etc. For example, the word "combination" could be just "FF 05 93", 3 bytes... you could define FE as a flag for packets that use 3 bytes coding, and so on... think about it. Wagner