There are a series of problems when trying for text compression on an embedded controller:
I wrote some QBASIC routines to test the proof of concept:
This one produces a variable length string of "1"s and "0"s that effeciently
encodes numbers from 0 to 31 with smaller numbers haveing shorter lengths.
The end of each string is marked with a "00x" or a "11x" and no other part
of one value will have this pattern. Basically the 00 or 11 encodes the second
bit of the value and the x is the LSb. The higher order bits are encoded
by a series of "01"s or "10"s that are equal in number to the bit values.
I know it sounds like this could be done more effeciently, but I could not
find a way that did not involve A) a very big chunk of code or table in the
PIC or B) a less than ideal distribution of lengths at the lower values.
FUNCTION LenEnc$ (in) REM PRINT #2, RIGHT$(" " + STR$(in), 5); " "; out$ = "" oldbit = INT((in + 2) / 4) MOD 2 out$ = RIGHT$(STR$(oldbit), 1) WHILE in > 3 IF oldbit = 0 THEN out$ = out$ + "1" oldbit = 1 ELSE out$ = out$ + "0" oldbit = 0 END IF in = in - 4 WEND IF oldbit = 0 THEN out$ = out$ + "0" ELSE out$ = out$ + "1" END IF out$ = out$ + RIGHT$(STR$(in MOD 2), 1) LenEnc$ = out$ END FUNCTION
This code produces the following encodings for 0 to 31:
0 000 1 001 2 110 3 111 4 1000 5 1001 6 0110 7 0111 8 01000 9 01001 10 10110 11 10111 12 101000 13 101001 14 010110 15 010111 16 0101000 17 0101001 18 1010110 19 1010111 20 10101000 21 10101001 22 01010110 23 01010111 24 010101000 25 010101001 26 101010110 27 101010111 28 1010101000 29 1010101001 30 0101010110 31 0101010111
If you look at the length of common huffman codes, this seems to compare favorably. I would like to find one that more closely matches the lengths to the actually frequency of the letters. According to a variety of sources Space, E, and T are from 10% to 15% of common english, then N, R, O, A, and I are about 7.5% followed by a steady decline from S at 6.1% through D at 4.2% to H at 3.4%. After this L, H, C, F, P, U and M are all about 2.5% to 3.5%. If you graph that data,
*************** E************* T********* N******** R*******. O*******. A*******. I*******. S****** D**** L***. H***. C*** F*** P*** U**. M**. Y** G*. W*. V*. B* X. Q. K J Z
You can see that the pattern of 4, 3 bit codes followed by 4, 4 bit codes, etc... matches pretty darn well.
The decoding routine looks like heck in QBASIC, but once I get it into the PIC it should be very small.
FUNCTION LenDec (bits$) FOR i = 1 TO LEN(bits$) IF lastbit THEN accum = accum + VAL(MID$(bits$, i, 1)) bits$ = MID$(bits$, i + 1) LenDec = accum lastbit = false firstbit = true EXIT FUNCTION ELSE IF firstbit THEN firstbit = false 'not anymore accum = 0 ELSE IF oldbit = VAL(MID$(bits$, i, 1)) THEN lastbit = true accum = accum + (VAL(MID$(bits$, i, 1)) * 2) ELSE accum = accum + 4 END IF 'delta bit END IF 'first bit oldbit = VAL(MID$(bits$, i, 1)) END IF 'last bit NEXT i END FUNCTION
Anyway, the basic idea is:
Comments:
Code: