Data compression algorithms typically take advantage of the very non-uniform distribution of letters, letter pairs, and words in natural languages.
Letter Frequency Letter Frequency a 0.08167 n 0.06749 b 0.01492 o 0.07507 c 0.02782 p 0.01929 d 0.04253 q 0.00095 e 0.12702 r 0.05987 f 0.02228 s 0.06327 g 0.02015 t 0.09056 h 0.06094 u 0.02758 i 0.06966 v 0.00978 j 0.00153 w 0.02360 k 0.00772 x 0.00150 l 0.04025 y 0.01974 m 0.02406 z 0.00074
Sorted by frequency:
Letter Frequency Letter Frequency e 0.12702 m 0.02406 t 0.09056 w 0.02360 a 0.08167 f 0.02228 o 0.07507 g 0.02015 i 0.06966 y 0.01974 n 0.06749 p 0.01929 s 0.06327 b 0.01492 h 0.06094 v 0.00978 r 0.05987 k 0.00772 d 0.04253 j 0.00153 l 0.04025 x 0.00150 c 0.02782 q 0.00095 u 0.02758 z 0.00074
-- from Cryptographical Mathematics, by Robert Edward Lewand .
See also Wikipedia: Letter frequencies .
(The frequency of letters in English text varies from one source text to another -- you can see that the text used in the charts below had a slightly different distribution.)
Freq. Letter Morse code Morse length Total 130 E 1000 4 520 92 T 111000 6 552 79 N 11101000 8 632 76 R 1011101000 10 760 75 O 11101110111000 14 1050 74 A 10111000 8 592 74 I 101000 6 444 61 S 10101000 8 488 42 D 1110101000 10 420 36 L 101110101000 12 432 34 H 1010101000 10 340 31 C 11101011101000 14 434 28 F 101011101000 12 336 27 P 10111011101000 14 378 26 U 1010111000 10 260 25 M 1110111000 10 250 19 Y 1110101110111000 16 304 16 G 111011101000 12 192 16 W 101110111000 12 192 15 V 101010111000 12 180 10 B 111010101000 12 120 5 X 11101010111000 14 70 3 Q 1110111010111000 16 48 3 K 111010111000 12 36 2 J 1011101110111000 16 32 1 Z 11101110101000 14 14 1000 Total 9076 Average: 11.23 9.07
(What source text was used to calculate these frequencies?) (I suspect the source text was The Art & Skill of Radio-Telegraphy by William G. Pierpont N0HFF http://www.zerobeat.net/tasrt/c28.htm )
"From the above, if we take five times the above average letter length and add the space required for word spacing (seven total or 0000000) we arrive at the normal English word length as 5 x 9.076 + 4 = 49.38 ... [approximately] 50 units per standard word. (By contrast, a random five-letter group averages [ 5 x 11.23 + 4 = ] 60.15 units. This is 20.3% longer than normal English word length.)"
6: that, with, have, this, will, your, from, they, know, want, been, good, much, some, time, very, when, come, here, just.
5: the, and, for, are, but, not, you, all, any, can, had, her, was, one, our, out, day, get, has, him,
4: of, to, in, it, is, be, as, at, so, we, he, by, or, on, do, if, me, my, up, an,
3: the, ing, and, ion, ent, for, tio, ere, her, ate, ver, ter, tha, ati, hat, ers, his, res, ill, are, _A_, _I_.
2: th, he, an, re, er, in, on, at, nd, st, es, en, of, te, ed, or, ti, hi, as, to, _T, _A, _O, _S, _W, e_, s_, o_, t_, ll, ee, ss, oo, tt, ff, rr, nn, pp, cc
1: e, t, a, o, n, r, i, s, h, d, l, f, c, m, u, g, y, p, w, b, v, k, x, j, q, z.
Common Most likely Num ASCII Huffman followed by: 0 100 0101 E 011 er, es, en, ed, e_, ee 1 101 0100 T 000 th, te, ti, to, t_, tt 10 100 0001 A 1100 an, at, as, 11 100 1111 O 1101 on, of, or, o_, oo 100 100 1110 N 1010 nd, nn 101 101 0010 R 0101 re, rr 110 100 1001 I 1110 in 111 101 0011 S 1011 st, s_, ss 1000 100 1000 H 0011 he, hi 1001 100 0100 D 0010 1010 100 1100 L 10011 1011 100 0110 F 10010 1100 100 0011 C 01001 1101 100 1101 M 010000 1110 101 0101 U 111100 1111 100 0111 G 111101 10000 101 1001 Y 100011 10001 101 0000 P 100000 10010 101 0111 W 100010 10011 100 0010 B 111110 10100 101 0110 V 1111111 10101 100 1011 K 11111100 10110 101 1000 X 111111011 10111 100 1010 J 1111110100 11000 101 0001 Q 11111101010 11001 101 1010 Z 11111101011
David Vandenburg says:
_Frequency Analysis of English Vocabulary and Grammar: Based on the LOB Corpus_ by Stig Johansson and Knut Hofland (OUP, 1989, ISBN 0-19-8242212-2) gives the top eighteen words and their frequencies as:1. the 68315 2. of 35716 3. and 27856 4. to 26760 5. a 22744 6. in 21108 7. that 11188 8. is 10978 9. was 10499 10. it 10010 11. for 9299 12. he 8776 13. as 7337 14. with 7197 15. be 7186 16. on 7027 17. I 6696 18. his 6266_The American Heritage Word Frequency Book_ by John B. Carroll, Peter Davies, and Barry Richman (Houghton Mifflin, 1971, ISBN 0-395-13570-2) gives the top 300 words in order of frequency and in groups of 100 as:
the of and a to in is you that it he for was on are as with his they at be this from I have or by one had not but what all were when we there can an your which their said if do will each about how up out them then she many some so these would other into has more her two like him see time could no make than first been its who now people my made over did down only way find use may water long little very after words called just where most know get through back much before go good new write out used me man too any day same right look think also around another came come work three word must because does part even place well such here take why things help put years different away again off went old number great tell men say small every found still between name should Mr home big give air line set own under read last never us left end along while might next sound below saw something thought both few those always looked show large often together asked house don't world going want school important until 1 form food keep children feet land side without boy once animals life enough took sometimes four head above kind began almost live page got earth need far hand high year mother light parts country father let night following 2 picture being study second eyes soon times story boys since white days ever paper hard near sentence better best across during today others however sure means knew it's try told young miles sun ways thing whole hear example heard several change answer room sea against top turned 3 learn point city play toward five using himself usuallyI played with substitution ciphers ages ago, and after the letter frequency's, there's digraphs and trigraphs (most common 2 & 3 letter groups), then most common words. If it's typical English, then handling the first x most common words is sure to be of significant help. (x being how much you have memory for).
From the tiny handbook I made up ages ago:
Bigrams: Th, he, an, re, er, in, on, at, nd, st, es, en, of, te, ed, or, ti, hi, as, to.Trigrams: The, ing, and, ion, ent, for, tio, ere, her, ate, ver, ter, tha, ati, hat, ers, his, res, ill, are.
Reversals: Er / re, es / se, an / na, ti / it, on / no.
Doubles: ll, ee, ss, oo, tt, ff, rr, nn, pp, cc.
One letter words: A, I.
Two letter words: Of, to, in, it, is, be, as, at, so, we, he, by, or, on, do, if, me, my, up, an.
Three letter words: The, and, for, are, but, not, you, all, any, can, had, her, was, one, our, out, day, get, has, him.
Four letter words: That, with, have, this, will, your, from, they, know, want, been, good, much, some, time, very, when, come, here, just.
Letters: E, t, a, o, n, r, i, s, h, d, l, f, c, m, u, g, y, p, w, b, v, k, x, j, q, z.
A, e, i, o, u = 39%
I, n, r, s, t = 33%
E, t, a, o, n = 45%
E, t, a, o, n, r, i, s, h = 70%More than 50% of all English words begin with the letters: T, a, o, s, w.
More than 50% of all English words end with the letters: E, s, o, t.The above listings are the most significantly common, rated by *frequency of occurrence*.
These are all from multiple published sources, not my own making. I did write some (Pascal) programs to gather this data from supplied text files, and my results matched the published ones quite well.Well, like I said, this is only of any use if you assume common English. If your expected data is consistent in it's vocab & usage, you could make your own stats,
(Updated 2007-01-17)
Comments:
That author ran a bunch of English works of fiction and reports the stats he found. (Naturally, the space is the most frequent "letter" at 18.74%, followed by "e" at 9.60%). Many of the most frequent "letter" pairs include a space character. The most common pairs (in order of frequency) were "e", " t", "th", "he", which were all more common than the 13th most common single letter but less common than the 12th most common single letter.
Questions:
Is there an application that scans text documents that supports trigraphs? I saw you linked freqanalysis above, but that only outputs single letter and bigraph combos.cartermarcus04-yahoo-K72 replies: Never mind. I found an online one here: http://practicalcryptography.com/cryptanalysis/text-characterisation/monogram-bigram-and-trigram-frequency-counts/
Hi,James Newton replies: ISBN's for two seperate books are listed above. Do you need more than that?
I would like to know who I can reference for the work listed here in relation to most common english letters and most common english words (or letter combinations). Surely this is part fo a published academic work?
Thanks
lloyd
Code:
Interested: