Russell McMahon wrote:
>
> Doesn't get much more OT than this but I'm stuck so I'm "casting my net
> fairly wide"..
>
> I need a program to count the occurrence of each separate word in a
> document.
> eg how many of each of "why" , "what", "elephant", "aardvark" etc (NOT just
> the total number of words).


Hi Russell, this is a very specialty task. It is
not that hard to do in C, but you are going to need
large buffers. You can use a circular counter like
the lempel-ziv algorithm (ZIP).

But you really need to specify a few details:
* Total size of document (characters)?
* Total number of words you need logged?
* Do you need EVERY word logged?

If the total number of logged words is reasonable,
the job is very different. Note even in well written
C on a fast Pentium this is going to take a LONG
time if you want every word logged, and with maybe
10,000 to 20,000 words in a language at maybe 32
bytes each for count and string you have memory
issues even in C on a big computer.

Have you looked at Solway's "bigtext"?? This is
a shareware program that may compress your large
document to about 20% to 25% of its size, but I
suspect you need more...
-Roman

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics