I did this awhile ago in VB to do a proof of concept a a simple=20 compression of large texts (putting the Book of Mormon on a TI-89).=20 Using the output of the program (example, not real data) THE 2343 A 1000 I 500 I was able to pick two places where I could optimally use different word=20 lengths to represent each word. So a bit sequence starting with 0 was a=20 total of 6 bits in length, and represented about 45% of the words in the=20 book. A bit sequence starting with 00 was 8 total bits, and 01 was 10=20 total bits. I never got to the point of creating a program to compress=20 and decompress the text(s), as there are several other programs out=20 there which do this very well using several other optimized algorithms,=20 but it was a fun diversion for an afternoon or two. At any rate (I cannot locate the code for it), it was essentially as=20 follows: Allocate a two-dimensional string array for 50,000 elements (which=20 usually exceeds the number of words you'll need to account for, if not,=20 up it), where each element holds a string and an integer (or long, etc). Determine what a word is (in my case, it was any whitespace, return,=20 form feed, or punctuation. Take care to preserve hyphenated words if=20 necessary). Parse each word in the text (in VB it was easy, several INSTR (In=20 String) statements in a row, the one returning the lowest non-zero value=20 gave you the end of the first word in the sentence) Pass each word to a function which searches the array for the word. If=20 found, it increments the integer element. If not found, it adds it to=20 the end of the list and pus a 1 in the integer element. Parse the next word until the end of the text. You'll have to deal with punctuation and case issues, but that is pretty=20 straightforward. If you're dealing with especially large amounts of text, you can order=20 the array every hundred thousand parsed words or so. That way the most=20 common words will end up being found faster in the searches in the list. My routine (in VB on a PII 300) parsed, IIRC, a 4MB text file in about=20 20 minutes, which isn't terrible by most standards for a proof of concept= . =20 I hope this helps! -Adam Russell McMahon wrote: >Doesn't get much more OT than this but I'm stuck so I'm "casting my net >fairly wide".. > >I need a program to count the occurrence of each separate word in a >document. >eg how many of each of "why" , "what", "elephant", "aardvark" etc (NOT j= ust >the total number of words). > >I've done a substantial web search and there seem to be some such availa= ble >but those I can find are written for other platforms than mine. >This seems to be a favourite university assignment but the results don't >seem to get posted. >I could write one myself but the task seems less trivial than it first >appeared and to do so MUST be reinventing the wheel. > >So - does anyone know where I can get such a program (preferably free). > >Source Microsoft Word > or a format that it can save in > eg HTML, RTF etc). > NOT plain ASCII > as there are special characters > which would get lost. > >Platform PC / Windows (sorry) > > >Here's a sample of the text to be converted (with some of the accented >characters missing due to being in plain text.) > > Moses ntiaka Josua > 32 Kwaro Moses=E9n taokeyak=E1mp kar > Kisim Bek 1-2. > Kwaromp arop tokwae Josep, t=E1 maomp naeoun=E1p kuri wokwaek surumpwi= apono. > Aeapo maok, maomp y=E7ri fekn=E1mp forokorinapao k=E1pae kare fourouri= a kantri > Isip mek fopwen=E1pia akwap. T=E1 wakmwaek Isip f=E1r=E1kapan poukeyak= =E1mp wour=E9kam > king Fero te Josep=E9n m=E9r mo. Aeria mao nken=E1mp fek te Israel f=E1= r=E1kap te > k=E1pae karea wae, t=E1 Isip f=E1r=E1kap s=E9rr=E1, "Am f=E1r=E1kap te= k=E1pae karea wae > napon. T=E1 ankwap fi arop koropea nomo yorowar napo te Israelo kuri n= omp > yopor-aropan yaewouria nomwan yorowar mwarea napon. Ae mwarea napara n= omo > man k=E1pae n=E9nk n=E1nko, nomp yae ankore mek yak=E1pan=E1pono." > >The language is native to a few thousand people in Western highland Papu= a >New Guinea just next to the border with Irian Jaya. Antone who has read = this >far can probably hazard a guess as to the nature of the content from thi= s >information and the somewhat english sounding nature of a few of the wor= ds. >This extract is from a set of Old Testament stories and there are other >related documents. > > >TIA > > Russell McMahon > >-- >http://www.piclist.com hint: The list server can filter out subtopics >(like ads or off topics) for you. See http://www.piclist.com/#topics > > > > > -- http://www.piclist.com hint: The list server can filter out subtopics (like ads or off topics) for you. See http://www.piclist.com/#topics