Compression Methods : English text frequencies

Data compression algorithms typically take advantage of the very non-uniform distribution of letters, letter pairs, and words in natural languages.

The frequency of letters in English text messages

Letter 	Frequency 	Letter 	Frequency
a 	0.08167 	n 	0.06749
b 	0.01492 	o 	0.07507
c 	0.02782 	p 	0.01929
d 	0.04253 	q 	0.00095
e 	0.12702 	r 	0.05987
f 	0.02228 	s 	0.06327
g 	0.02015 	t 	0.09056
h 	0.06094 	u 	0.02758
i 	0.06966 	v 	0.00978
j 	0.00153 	w 	0.02360
k 	0.00772 	x 	0.00150
l 	0.04025 	y 	0.01974
m 	0.02406 	z 	0.00074

Sorted by frequency:

Letter 	Frequency 	Letter 	Frequency
e 	0.12702 	m 	0.02406 	
t 	0.09056 	w 	0.02360
a 	0.08167 	f 	0.02228 	
o 	0.07507 	g 	0.02015 	
i 	0.06966 	y 	0.01974
n 	0.06749 	p 	0.01929
s 	0.06327 	b 	0.01492 	
h 	0.06094 	v 	0.00978
r 	0.05987 	k 	0.00772 	
d 	0.04253 	j 	0.00153 	
l 	0.04025 	x 	0.00150
c 	0.02782 	q 	0.00095
u 	0.02758 	z 	0.00074

-- from Cryptographical Mathematics, by Robert Edward Lewand .

See also Wikipedia: Letter frequencies .

(The frequency of letters in English text varies from one source text to another -- you can see that the text used in the charts below had a slightly different distribution.)

The most frequent letters, compared to their Morse Code representation.

Freq.    Letter       Morse code    Morse length     Total 
130        E                 1000        4             520 
 92        T               111000        6             552 
 79        N             11101000        8             632 
 76        R           1011101000       10             760 
 75        O       11101110111000       14            1050 
 74        A             10111000        8             592 
 74        I               101000        6             444 
 61        S             10101000        8             488 
 42        D           1110101000       10             420 
 36        L         101110101000       12             432 
 34        H           1010101000       10             340 
 31        C       11101011101000       14             434 
 28        F         101011101000       12             336 
 27        P       10111011101000       14             378 
 26        U           1010111000       10             260 
 25        M           1110111000       10             250 
 19        Y     1110101110111000       16             304 
 16        G         111011101000       12             192 
 16        W         101110111000       12             192 
 15        V         101010111000       12             180 
 10        B         111010101000       12             120 
  5        X       11101010111000       14              70 
  3        Q     1110111010111000       16              48 
  3        K         111010111000       12              36 
  2        J     1011101110111000       16              32 
  1        Z       11101110101000       14              14
1000                                           Total  9076
Average:                             11.23            9.07        

(What source text was used to calculate these frequencies?) (I suspect the source text was The Art & Skill of Radio-Telegraphy by William G. Pierpont N0HFF http://www.zerobeat.net/tasrt/c28.htm )

"From the above, if we take five times the above average letter length and add the space required for word spacing (seven total or 0000000) we arrive at the normal English word length as 5 x 9.076 + 4 = 49.38 ... [approximately] 50 units per standard word. (By contrast, a random five-letter group averages [ 5 x 11.23 + 4 = ] 60.15 units. This is 20.3% longer than normal English word length.)"

Common sequences by length:

6: that, with, have, this, will, your, from, they, know, want, been, good, much, some, time, very, when, come, here, just.

5: the, and, for, are, but, not, you, all, any, can, had, her, was, one, our, out, day, get, has, him,

4: of, to, in, it, is, be, as, at, so, we, he, by, or, on, do, if, me, my, up, an,

3: the, ing, and, ion, ent, for, tio, ere, her, ate, ver, ter, tha, ati, hat, ers, his, res, ill, are, _A_, _I_.

2: th, he, an, re, er, in, on, at, nd, st, es, en, of, te, ed, or, ti, hi, as, to, _T, _A, _O, _S, _W, e_, s_, o_, t_, ll, ee, ss, oo, tt, ff, rr, nn, pp, cc

1: e, t, a, o, n, r, i, s, h, d, l, f, c, m, u, g, y, p, w, b, v, k, x, j, q, z.

                 Common        Most likely 
  Num    ASCII   Huffman       followed by:
    0 100 0101 E 011           er, es, en, ed, e_, ee
    1 101 0100 T 000           th, te, ti, to, t_, tt

   10 100 0001 A 1100          an, at, as, 
   11 100 1111 O 1101          on, of, or, o_, oo

  100 100 1110 N 1010          nd, nn
  101 101 0010 R 0101          re, rr
  110 100 1001 I 1110          in
  111 101 0011 S 1011          st, s_, ss

 1000 100 1000 H 0011          he, hi
 1001 100 0100 D 0010          
 1010 100 1100 L 10011
 1011 100 0110 F 10010
 1100 100 0011 C 01001
 1101 100 1101 M 010000
 1110 101 0101 U 111100
 1111 100 0111 G 111101

10000 101 1001 Y 100011
10001 101 0000 P 100000
10010 101 0111 W 100010
10011 100 0010 B 111110
10100 101 0110 V 1111111
10101 100 1011 K 11111100
10110 101 1000 X 111111011
10111 100 1010 J 1111110100
11000 101 0001 Q 11111101010
11001 101 1010 Z 11111101011

David Vandenburg says:

_Frequency Analysis of English Vocabulary and Grammar: Based on the LOB Corpus_ by Stig Johansson and Knut Hofland (OUP, 1989, ISBN 0-19-8242212-2) gives the top eighteen words and their frequencies as:
          1.  the       68315
          2.  of        35716
          3.  and       27856
          4.  to        26760
          5.  a         22744
          6.  in        21108
          7.  that      11188
          8.  is        10978
          9.  was       10499
         10.  it        10010
         11.  for        9299
         12.  he         8776
         13.  as         7337
         14.  with       7197
         15.  be         7186
         16.  on         7027
         17.  I          6696
         18.  his        6266

_The American Heritage Word Frequency Book_ by John B. Carroll, Peter Davies, and Barry Richman (Houghton Mifflin, 1971, ISBN 0-395-13570-2) gives the top 300 words in order of frequency and in groups of 100 as:

the of and a to in is you that it he for was on are as with his they
at be this from I have or by one had not but what all were when we
there can an your which their said if do will each about how up out
them then she many some so these would other into has more her two
like him see time could no make than first been its who now people
my made over did down only way find use may water long little very
after words called just where most know

get through back much before go good new write out used me man too
any day same right look think also around another came come work
three word must because does part even place well such here take why
things help put years different away again off went old number great
tell men say small every found still between name should Mr home big
give air line set own under read last never us left end along while
might next sound below saw something thought both few those always
looked show large often together asked house don't world going want

school important until 1 form food keep children feet land side
without boy once animals life enough took sometimes four head above
kind began almost live page got earth need far hand high year mother
light parts country father let night following 2 picture being study
second eyes soon times story boys since white days ever paper hard
near sentence better best across during today others however sure
means knew it's try told young miles sun ways thing whole hear
example heard several change answer room sea against top turned 3
learn point city play toward five using himself usually

I played with substitution ciphers ages ago, and after the letter frequency's, there's digraphs and trigraphs (most common 2 & 3 letter groups), then most common words. If it's typical English, then handling the first x most common words is sure to be of significant help. (x being how much you have memory for).

From the tiny handbook I made up ages ago:
Bigrams: Th, he, an, re, er, in, on, at, nd, st, es, en, of, te, ed, or, ti, hi, as, to.

Trigrams: The, ing, and, ion, ent, for, tio, ere, her, ate, ver, ter, tha, ati, hat, ers, his, res, ill, are.

Reversals: Er / re, es / se, an / na, ti / it, on / no.

Doubles: ll, ee, ss, oo, tt, ff, rr, nn, pp, cc.

One letter words: A, I.

Two letter words: Of, to, in, it, is, be, as, at, so, we, he, by, or, on, do, if, me, my, up, an.

Three letter words: The, and, for, are, but, not, you, all, any, can, had, her, was, one, our, out, day, get, has, him.

Four letter words: That, with, have, this, will, your, from, they, know, want, been, good, much, some, time, very, when, come, here, just.

Letters: E, t, a, o, n, r, i, s, h, d, l, f, c, m, u, g, y, p, w, b, v, k, x, j, q, z.

A, e, i, o, u = 39%
I, n, r, s, t = 33%
E, t, a, o, n = 45%
E, t, a, o, n, r, i, s, h = 70%

More than 50% of all English words begin with the letters: T, a, o, s, w.
More than 50% of all English words end with the letters: E, s, o, t.

The above listings are the most significantly common, rated by *frequency of occurrence*.
These are all from multiple published sources, not my own making. I did write some (Pascal) programs to gather this data from supplied text files, and my results matched the published ones quite well.

Well, like I said, this is only of any use if you assume common English. If your expected data is consistent in it's vocab & usage, you could make your own stats,

(Updated 2007-01-17)

Comments:

Questions:

Code:

Interested: