Issues with people not being able to reach the site, except through proxys or anonymizers

199912291515-0700
The techref and piclist.com sites were down from sometime after 199912291515 to 199912300700 first time since installation. M$ aint so bad!
200002022200-0630
Yeah, the damn thing screwed itself good last night.... Second time since we started. Blue screen by this morning and a lot of people said they couldn't get in. And the DNS server (a third party) had a problem yesterday. Sorry about the hassle. Its up again now.

The good news is PacBell finally got our 384kbps DSL connection up at a location where the server can be monitored 24/7. The current server only has human companionship <GRIN> 06:30 to 15:30 daily. We are testing the connection for reliability and setting up a new server machine. Then we will register a nicer domain name for the techref and move the techref and piclist sites. The old server will continue for other things and refer techref and piclist people to the new address.

I'm also experimenting with Linux (Red Hat 6.1) as a router and backup web server. I've seen some interesting things done where an NT box and a Linux box ping each other and if one dies, the other takes over. The idea is that even if something external (hacker, virus, anything that triggers an OS flaw, etc...) kills one, its very unlikely to kill the other. Since I use a lot of ASP and 32-bit machine language MASM code, in my case the Linux box will only have a "We are experiencing technical difficulties" web page and it will spend its time screaming (pager, audio alarm, phone calls, etc...) for help and maybe rebooting the NT server. So far the Red Hat installer has some kind of security problem with the partition table, but I've not had time to really trouble shoot it.

200003232200-0616
Locked up at 10 o'clock. No idea why. There is a deframent daemon that starts at that time on Wednesdays. I'll defragment manually and see if that has a problem.
200109280530-0930
ISP service outage. Road construction nearby cut the cable. First time in years.
20010917-now
All last week I've been having horrible problems with the server getting overloaded. % Processor Usage goes to 100% and stays. Web site response is more and more sluggish and the number of users connected climbs to insane levels. The TaskList shows that inetinfo or mtx is takeing 99% of the available cycles. Useing Performance Monitor to watch the threads of those processes shows that different threads are maxing out at different times. Running windbg, attaching the process and viewing a stack dump for the maxed out threads always shows that MSVCRT.DLL is running. No info in MSKB related to that DLL other than that it is the Visual C runtime library. I don't run any Visual C code. Tried shutting off all CGI-BIN exe's. Upgradeing that DLL, etc... has no effect. Stopping the site or the entire web service has no effect (!). And then... We notice that stopping the content index service frees up the processor and disconnects most of the connected users. No content index corrupt messages... server ran fine for an entire day with the search functions disabled. Rebuilding the content index had no effect. I've disabled the search functions (but kept the index server running) and the processor never makes it above 5%! And I've served more pages than ever! Doing one or two searches at a time (locally) sends the processor up to about 80 or 90% for the duration of the search. Doing 10 searches at once pegs the meter and it never recovers.

So... the secret is: Don't run a busy NT web server with more than a Gig of text in the search engine unless you throttle its use. I'm frantically searching for a way to do that. For now, it just records the starting time of each search and will not allow another one for a number of seconds after that.

A new, faster, server would allow more so I'm adding requests for support on all the search pages and during the time the search is running. If that doesn't generate some funds for a faster server, I'll look for sponsership or advertizing for the search pages ONLY! The main site will always be free and accessable.
200112272010-200112280935
ISP service outage. Road construction nearby cut the cable. Second time (for the same reason) in years.
20020110
Moved from Rancho Bernardo to Temecula. DSL is the only option. Verizon has us on a 384k up and 384k down modem.
20020112
Second day and the DSL connection slowed to a crawl. Support at Verizon will not talk to us unless we diconnect our router and connect only one PC. After verifying that the connection is at about 64k, a real tech was called and found that our account was configured incorrectly in one of the three systems that control connections.
20020125
A lineman decided that our wires were not punched down neatly enough at the main terminal block for the building, called up the main number, got a girl in the office and told her (not asked) that the phones would be down for 2 minutes. He then ripped us out and re-wired us... DSL, fax lines, and all. Took about 10 minutes.
20020226
Down for a while this am at the office: First it looked like the router might have locked up as I couldn't reach it on the internal network... but the DSL modem was also flashing its Data light which according to the manual is abnormal. I power cycled both and all was well... for a while. Then the data light went out and the modem light on the DSL modem started blinking and that lasted for about 10 min and 2 power cycles. Sat on hold for 7 min for Tech support and they couldn't get the modem back up so they put in a trouble ticket, and about a minute later, it came back up. Tech got here just before closeing and it had worked all day...
20020321
Up and down this afternoon from 2 to 3 at the office. DSL modem would flash its data or modem lights on and off. Support doesn't want to send a tech because it started working while I was on the phone with them.
20020322
Up and down this afternoon from 12 to 1 at the office. DSL modem would flash its data or modem lights on and off. DSL sucks. Support will send a tech. The tech found a bad connection on the breakout panel in the back of our suite. Seems ok for now.
200301061324-200301061509
After a long time of not haveing any problems, I apparently set the firewall box too close to a server monitor and it overheated... I know that internet access stopped suddenly at 1:24pm, and I could not access the firewall config page. I found the top of the unit was very hot and the vents were being blocked by the side of the monitor. After verifying that the problem was not with the DSL modem, I reset the firewall to factory defaults and started setting it up again (note to self, print out and keep handy all the settings... Doh!). Finally back up at 3:09pm
20030517-2003051909
The SQL server took a dump. BSOD. Restarted and all is well. Only affect at this point is the email archive, but as I add more ecom stuff, that could cause real problems. Anytime you add more machines to a solution, the risk of failure increases. Sigh.
20030722
Re-arrangeing some network cables and kicked the power cord out of the back of the server not once, but twice. Doh!
20040213092344
I applied some updates from $MS last night, and restarted, and then about 11pm there was a big windmill, then a spike in the index service cpu usage (not unusual) and the inetinfo service started sucking all the available cycles. Commited bytes slowly climbed to about double the norm and avaiable bytes oscillated wildly. Network utilization was also spiking with the windmills. This AM the server was reporting 500 Internal Server error on piclist, sxlist and massmind and running the others really slow. By the time I got to the office (9am) the system was still sluggish, but task manager did not show any unusual activity. The web sites were now reporting HTTP 1.1 Application Restarting. I restarted the machine and all seems to be well.
2004 08/12 19-2004 08/13 10
Power spikes from a local lightning storm apparently whacked a chunk out of the hard drive. No major damage, NTFS was up to the challenge.
2004 09/08 pm-2004 09/10 am
Well, the NTFS can only do so much I guess. <grin> The hard drive fried last night. I took it as a sign to move to the new server a bit early. I've been getting it ready but there are some rough edges... let me know if you find one.

I had good backups from the old server, but it took a while to get them all together. I made the mistake of using the new server as a development platform to try to work out some new features and solve old issue, but that ment that it was not in sync with the old server. And once i got that sorted out, there were the standard wierd problems to work though:

It seem to be ok now and the new server is cranking nicely.

2005 02/12
"HTTP 1.1 Application Restarting" started about 23:00 and continued even after the regular restart a few hours later. I finally drove in to the office at around 14:00 on the 13th. I started and stoped the site, tried to stop the web service (not responding) and restarted the PC. Still the same. And then I realized that I had been editing an include file used by default.asp from another PC via the network and had left that file open. I close it and suddenly all was well. Apparently, IIS was trying to rebuild the application and couldn't get access to that file because it was locked by the other PC? Just my guess. Note to self: Don't leave important files open!
2005 08/06-07
Another of the lovely power fail / shutdown started / power back on before shut down complete / system doesn't restart hang-ups. APC with NT is just lovely.
2006 07/12-14
Another of the lovely power fail / shutdown started / power back on before shut down complete / system doesn't restart hang-ups. APC with NT is just lovely. Happens every year when the heat waves come and the power grid can't keep up with the Air Conditioners.
2006 08/11-12
Started getting this "Couldn't load application" error on the web site. Event log is full of
DCOM got error "Logon failure: unknown user name or bad password. " and was unable to logon MASSMIND\IWAM_XX in order to run the server: {3DFEFADE-B61A-4096-90D8-F3C0F137D9BB}

and

The server failed to load application '/LM/W3SVC/2/Root/xxxx'. The error was '80080005'.

where xxxx was each of the applications in turn. E.g. techref, dict, and so on.

I deleted and re-created each of the applications for the virtual directories that have them. That seemed to solve the problem. No freaking idea what caused it except that I had been playing around with the old web server and I think it syncronized the SAM database and changed the password on that account.

2006 08/25
None of the web services are running after the normal restart this AM. And they won't start from the Internet Service Control Panel. The WWW service is running, so I stoped it and started it and then the web sites, etc... could be started in the ISCP. Event log shows: Event ID: 4098 Source: Transaction Server
The run-time environment has detected the absence of a critical resource and has caused the process that hosted it to terminate. HRESULT: 80070006 (Microsoft Transaction Server Internals Information: File: x:\viper\src\runtime\mtxex\vipthrd.cpp, Line: 862)

I can't seem to find anything about that on the net... Nothing wierd in the log files, just seem that something didnt start right this AM. I hate crap like this. Yesterday and the day before I was seeing really high counts on "Current Anonymous User" without a corresponding increase in traffic.

2006 08/28
Happened to notice that the site was running sort of slow and the processor seemed to be working harder than usual. After looking around, I found that none of the applications where running in "separate memory space (isolated process)". Oh... no problem, just some leftover from the issues of a day or so ago, so I'll just turn that on... And then the web site was down. "Application failed to load"

Event log is filling up with 80004005 "failed to load"

Oh boy, oh boy... turn that back off... Ok for now. But why can't I run in seperate memory space? Anyway, long story short:
http://support.microsoft.com/kb/297989/ Was the deal. After I got that worked out, all was well. Not only the IUSR but also the IWAM accounts had gotten thier passwords changed by the backup domain controller and they needed to be reset. I also ended up having to delete two of the packages out of the Transaction Server \ Computers \ My Computer \ Packages Installed list and re-create them via the Internet Service Manager. Not sure what was wrong with them, but they would not accecpt the seperate memory space option: it would just delete the application when I turned that on.

Fun, fun, fun.

2007 08/04
Perminant "Application Restarting" again. As far as I can tell, the backup (which runs from another server) froze with the application .asa file open (open for read not write, but IIS seems to have issues with anyone else even looking at that tile) and restarting both machines did the trick.
 
Looking back, it appears August is a bad month for my web server...
2007 08/16
Power failure. UPS held the server up for about 20 minutes and then it died. The power came back on about an hour later... for about 2 minutes... Then went down for another hour and a half or so. Finally came back on around 5 and stayed up. August sucks. Of of these years, I have to move to a hosted server.
2008 02/21
Hard drive lost some chunks and for some reason the server didn't restart on it's own, so it didn't start working on the standard restart chkdsk until I restarted it in the morning around 9am. And the chkdsk ran REALLY slow for some reason (probably because it was finding and correcting the problems) so it didn't get done until around 3pm. Yeap: Fried the entire day. And me biting my nails to see if the drive was toast. Once it finally finished, everything reports good and the server is back up.
2008 07/17
Power failure. The triplight rack mount 750 kept the server up for about 10 minutes, but the monitering software on the servers was setup to shut down after a few minutes and totally failed to do so. I happened to be in the office and so I shut down manually after discovering that the monitoring agent service was not running and could not be started. It had apparently failed after an IMF update had been installed earlier that day. So why did it fail? And why didn't the monitoring agent (which was running) log or broadcast any error message when it's service wasn't running? Details will be posted after TrippLight support had a chance to respond.

Interested:

Comments:

Questions:

Says: " Thats what you get for using Linux. :) Try a real server OS, like FreeBSD (or any of the BSDs for that matter). "

Andrew Wilson Says:

What are the specs on the box you're serving with? What's bandwidth usage for the piclist server? Have you ever explored the possibility of mirroring the site and round-robin..ing the dns? I'd like to help if I can, I think piclist is a great resource (if only it were easier to access the info..) Feel free to respond here or via email at wil1(..at..)umbc.edu

James Newton replies: This is pretty normal volumn: It seems to grow at a rate of about 50k per month on average.

        Hits               Bytes      Visits      PViews       Date

      29,061          65,586,744       3,720      13,217     21 Mar  
      46,723         105,378,142       5,733      21,479     22 Mar  
      42,573          95,996,292       5,941      18,481     23 Mar  
      50,140         113,094,810       6,040      22,739     24 Mar  
      43,310          97,862,740       5,846      21,123     25 Mar  


        Hits               Bytes      Visits      PViews      Month

   1,918,843       4,438,804,622     131,238     638,653   Jan 2004  
   1,078,755       2,434,471,520     119,662     490,111   Feb 2004  
   1,353,173       3,052,974,442     129,288     600,971   Mar 2004

        Hits               Bytes      Visits      PViews      Month  Unique IPs

   2,515,640      28.56_GB           168,732     782,610   Mar 2007  116,454
   2,140,602      24.33_GB           153,541     689,888   Apr 2007  106,116
   2,302,367      25.23_GB           163,775     747,945   May 2007  112,178 
  

The server is an NT box which is a bit starved for ram at the moment: 96MB only. But I'll be upgrading it again soon.

The only problem with mirroring is that the site accepts updates instantly via the form you used to post your offer. Keeping two sites in sync is something I haven't worked out, but I'll take any help I can get on that issue.

Thanks!

Ivan Kocher Says: " If you need help with linux I can help. I do so for a living and electronics is my second.

drop me a line, to see how this can be done :)

Ivan
" James Newton replies: Thank you kindly. Frankly, I have neither the time nor the desire to learn Linux. With the searches limited to one every few seconds, the server does just fine.

If you were interested in mirroring the site or providing a search engine from your own server, I would be willing to try to set up such a thing.

At one time I was looking in to using RSYNC (I think?) on the NT box to send out notice of changes and keep a remote mirror up to date. The idea was that the remote mirror could provide the search function (as well as the content if desired) and so remove the burden from the local box.

Again, I'm NOT interested in hosting, learning, or in other ways touching *nix. Nothing personal, just my preference based on my lone experience.

Thanks again.

See also:

Every now and again, I find "RUNDLL32 SETUPAPI,InstallHinfSection DefaultInstall 1 {out}.inf" running in an unattached process. It turns out that is part of a batch file that I wrote years ago to restart the server every night. The batch file gets called by task manager and trys to run dll this inf file that causes the server to restart. If it doesn't restart, for what ever reason, the process sits there doing nothing forever.