At 09:12 PM 27/11/2002 -0600, you wrote:
>Ftp it to a UNIX box and use sort?  8-)

Actually, I could have done that, but this XP box has hugely more disk
space, a much faster processor and something like double the RAM than the
Linux (debian) box that I have available.

>Or import into Excel and sort it, if Excel will take it.  Or Access.  Or
>install the Cygwin tools.  Or install Perl and write a one-liner.  That's
>a joke, see, *ALL* Perl programs can be written as one-liners.

Cygwin was the solution.  It seems it's better not to re-invent the
wheel.  Gnu sort can already handle this sort of thing.  I like having bash
around anyway.  It's way better than DOS, well... for me.

>After the second margarita I find the possibilities are endless.  I doubt
>any of the M$ Office products will work well on a 3GB file,

Most MS tools choke on that kind of file size.

DOS sort was what I originally tried, but it turned out to be case
insensitive, which was very important in this case.  It turns out that the
GNU sort that comes with Cygwin was case sensitive, and therefore better.

 >If you do not care about duplicates you can skip uniq.

Actually, duplicates are important.  They're why I'm doing this to start
with.  I'm looking for frequency of occurrences of character
combinations.  I already wrote the software to generate the character
sequences using standard output and standard input, which made the
buffering issues of the programme non-existant.

The entire sort and frequency count of the largest file took 2hr 43m to
complete  Not bad, I think.

Thanks for all the suggestions,

Brendan

P.S. Who want's to know what the most common 5-character sequence in the
English language is?  It is (get this) " and ".  That's right.  Space, and,
space.

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics