Text compression for embedded controllers

There are a series of problems when trying for text compression on an embedded controller:

Can't have big tables. I'm trying for 64 or fewer entries. I can easily combine the upper and lower case letters with a shift code and I'm using some huristics to guess when shifting should happen with out needing to send the code: After a period, ! or ?, expect the next letter to be shifted; one code means shift just the next letter but; if you have to shift two letters in a row, stay shifted untill you see the shift code again. I havent really settled on a way to handle symbols but I'm thinking about haveing more than one table so that I can take advantage of the fact that numbers are more likely to be A) grouped, B) follow a $ or a % or #. Since I have to use 8 bits in the main table, and need only 6, I'll use the upper 2 to shift to a new table when we hit certain symbols. The symbole "table" will actually be the end of the full table (change the lookup offset).
Can't have big code. I've come up with a nasty hack on Huffman encoding that (I think) will simplify the length encoding of the frequency translated data.

I wrote some QBASIC routines to test the proof of concept:
This one produces a variable length string of "1"s and "0"s that effeciently encodes numbers from 0 to 31 with smaller numbers haveing shorter lengths. The end of each string is marked with a "00x" or a "11x" and no other part of one value will have this pattern. Basically the 00 or 11 encodes the second bit of the value and the x is the LSb. The higher order bits are encoded by a series of "01"s or "10"s that are equal in number to the bit values. I know it sounds like this could be done more effeciently, but I could not find a way that did not involve A) a very big chunk of code or table in the PIC or B) a less than ideal distribution of lengths at the lower values.

FUNCTION LenEnc$ (in)
 REM PRINT #2, RIGHT$("   " + STR$(in), 5); " ";
 out$ = ""
 oldbit = INT((in + 2) / 4) MOD 2
 out$ = RIGHT$(STR$(oldbit), 1)
 WHILE in > 3
  IF oldbit = 0 THEN
   out$ = out$ + "1"

   oldbit = 1
  ELSE
   out$ = out$ + "0"

   oldbit = 0
   END IF
  in = in - 4
  WEND
 IF oldbit = 0 THEN
  out$ = out$ + "0"
 ELSE
  out$ = out$ + "1"
END IF
 out$ = out$ + RIGHT$(STR$(in MOD 2), 1)

 LenEnc$ = out$

END FUNCTION

This code produces the following encodings for 0 to 31:

  0 000
  1 001
  2 110
  3 111
  4 1000
  5 1001
  6 0110
  7 0111
  8 01000
  9 01001
 10 10110
 11 10111
 12 101000
 13 101001
 14 010110
 15 010111
 16 0101000
 17 0101001
 18 1010110
 19 1010111
 20 10101000
 21 10101001
 22 01010110
 23 01010111
 24 010101000
 25 010101001
 26 101010110
 27 101010111
 28 1010101000
 29 1010101001
 30 0101010110
 31 0101010111

If you look at the length of common huffman codes, this seems to compare favorably. I would like to find one that more closely matches the lengths to the actually frequency of the letters. According to a variety of sources Space, E, and T are from 10% to 15% of common english, then N, R, O, A, and I are about 7.5% followed by a steady decline from S at 6.1% through D at 4.2% to H at 3.4%. After this L, H, C, F, P, U and M are all about 2.5% to 3.5%. If you graph that data,

 ***************
E*************
T*********
N********
R*******.
O*******.
A*******.
I*******.
S******
D****
L***.
H***.
C***
F***
P***
U**.
M**.
Y**
G*.
W*.
V*.
B*
X.
Q.
K
J
Z

You can see that the pattern of 4, 3 bit codes followed by 4, 4 bit codes, etc... matches pretty darn well.

The decoding routine looks like heck in QBASIC, but once I get it into the PIC it should be very small.

FUNCTION LenDec (bits$)
 FOR i = 1 TO LEN(bits$)
  IF lastbit THEN
   accum = accum + VAL(MID$(bits$, i, 1))
   bits$ = MID$(bits$, i + 1)
   LenDec = accum
   lastbit = false
   firstbit = true
   EXIT FUNCTION
  ELSE
   IF firstbit THEN
    firstbit = false 'not anymore
    accum = 0
   ELSE
    IF oldbit = VAL(MID$(bits$, i, 1)) THEN
     lastbit = true
     accum = accum + (VAL(MID$(bits$, i, 1)) * 2)
    ELSE
     accum = accum + 4
     END IF 'delta bit
    END IF 'first bit
   oldbit = VAL(MID$(bits$, i, 1))
   END IF 'last bit
  NEXT i
END FUNCTION

Anyway, the basic idea is:

normalize everything to 32 values by inserting the shift code or table change codes
Lookup each value in the frequency table to get the table index
length encode the value
ship it to the PIC
length decode the value
look it up in the table, minding the table shift codes
reverse the shifting and
output the byte

Comments:

/techref/method/compress/etxtfreq.htm english text frequencies (letter frequencies, pair frequencies, most common words, etc.)

Code: