I did this awhile ago in VB to do a proof of concept a a simple=20
compression of large texts  (putting the Book of Mormon on a TI-89).=20
 Using the output of the program
(example, not real data)
THE 2343
A 1000
I 500

I was able to pick two places where I could optimally use different word=20
lengths to represent each word.  So a bit sequence starting with 0 was a=20
total of 6 bits in length, and represented about 45% of the words in the=20
book.  A bit sequence starting with 00 was 8 total bits, and 01 was 10=20
total bits.  I never got to the point of creating  a program to compress=20
and decompress the text(s), as there are several other programs out=20
there which do this very well using several other optimized algorithms,=20
but it was a fun diversion for an afternoon or two.

At any rate (I cannot locate the code for it), it was essentially as=20
follows:
Allocate a two-dimensional string array for 50,000 elements (which=20
usually exceeds the number of words you'll need to account for, if not,=20
up it), where each element holds a string and an integer (or long, etc).
Determine what a word is (in my case, it was any whitespace, return,=20
form feed, or punctuation.  Take care to preserve hyphenated words if=20
necessary).
Parse each word in the text (in VB it was easy, several INSTR (In=20
String) statements in a row, the one returning the lowest non-zero value=20
gave you the end of the first word in the sentence)
Pass each word to a function which searches the array for the word.  If=20
found, it increments the integer element.  If not found, it adds it to=20
the end of the list and pus a 1 in the integer element.
Parse the next word until the end of the text.

You'll have to deal with punctuation and case issues, but that is pretty=20
straightforward.

If you're dealing with especially large amounts of text, you can order=20
the array every hundred thousand parsed words or so.  That way the most=20
common words will end up being found faster in the searches in the list.
My routine (in VB on a PII 300) parsed, IIRC, a 4MB text file in about=20
20 minutes, which isn't terrible by most standards for a proof of concept=
. =20


I hope this helps!

-Adam

Russell McMahon wrote:

>Doesn't get much more OT than this but I'm stuck so I'm "casting my net
>fairly wide"..
>
>I need a program to count the occurrence of each separate word in a
>document.
>eg how many of each of "why" , "what", "elephant", "aardvark" etc (NOT j=
ust
>the total number of words).
>
>I've done a substantial web search and there seem to be some such availa=
ble
>but those I can find are written for other platforms than mine.
>This seems to be a favourite university assignment but the results don't
>seem to get posted.
>I could write one myself but the task seems less trivial than it first
>appeared and to do so MUST be reinventing the wheel.
>
>So - does anyone know where I can get such a program (preferably free).
>
>Source             Microsoft Word
>                                  or a format that it can save in
>                                   eg HTML, RTF etc).
>                          NOT plain ASCII
>                                  as there are special characters
>                                  which would get lost.
>
>Platform           PC  / Windows (sorry)
>
>
>Here's a sample of the text to be converted (with some of the accented
>characters missing due to being in plain text.)
>
>  Moses ntiaka Josua
>  32 Kwaro Moses=E9n taokeyak=E1mp kar
>  Kisim Bek 1-2.
>  Kwaromp arop tokwae Josep, t=E1 maomp naeoun=E1p kuri wokwaek surumpwi=
apono.
>  Aeapo maok, maomp y=E7ri fekn=E1mp forokorinapao k=E1pae kare fourouri=
a kantri
>  Isip mek fopwen=E1pia akwap. T=E1 wakmwaek Isip f=E1r=E1kapan poukeyak=
=E1mp wour=E9kam
>  king Fero te Josep=E9n m=E9r mo. Aeria mao nken=E1mp fek te Israel f=E1=
r=E1kap te
>  k=E1pae karea wae, t=E1 Isip f=E1r=E1kap s=E9rr=E1, "Am f=E1r=E1kap te=
 k=E1pae karea wae
>  napon. T=E1 ankwap fi arop koropea nomo yorowar napo te Israelo kuri n=
omp
>  yopor-aropan yaewouria nomwan yorowar mwarea napon. Ae mwarea napara n=
omo
>  man k=E1pae n=E9nk n=E1nko, nomp yae ankore mek yak=E1pan=E1pono."
>
>The language is native to a few thousand people in Western highland Papu=
a
>New Guinea just next to the border with Irian Jaya. Antone who has read =
this
>far can probably hazard a guess as to the nature of the content from thi=
s
>information and the somewhat english sounding nature of a few of the wor=
ds.
>This extract is from a set of Old Testament stories and there are other
>related documents.
>
>
>TIA
>
>            Russell McMahon
>
>--
>http://www.piclist.com hint: The list server can filter out subtopics
>(like ads or off topics) for you. See http://www.piclist.com/#topics
>
>
>
>
>

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics