Jim Korman wrote:

> And then blaming it on the hardware. ;-)

Acutally now you have my analytical side thinking...

Truthfully, with the high-quality of Unix OS's these days... it's pretty
rare that the OS software (other than proprietary in-house stuff)
crashes a box.  We run across things during deployment and testing (one
bad Linux ethernet driver, for example), but hardware is regularly a
real cause of failure.

Looking through system outage reports here... the real causes for system
downtime in 2033 were (in order of frequency/importance/my pain):

Power loss
Human error
Hard disk failures
Security breach

Redundancy in the applications and load-balancing kept uptime well over
four 9's, perhaps higher.  (My boss keeps those numbers, and he's happy
with them, whatever they are... I don't track them.  My focus is keeping
stuff running, mitigating risk -- not tracking what happened
after-the-fact other than to learn from it.)

Power loss - Our data center provider had a meltdown (literally) of
their power distribution hardware due to a manufacturing defect in a
copper bus bar and some water entering some places water was never
intended to go.  One full day of downtime for that site, reduandant data
center took most of the applications and load during the outage.  They
got roasted by all customers for not having a redundant bus bar to carry
the load, everything else is double or triple redundant in their
power-distribution, two generators, city power, two transfer switches,
multiple on-line UPS's that could be re-routed if one failed (manually,
but could be done...), etc.

Hard disk failure... numerous systems not using RAID 1 or RAID 5 on
primary disks needed for system to continue operation.  ($$$)  It's on
"the list" for this year to fix this permanently, but it'll only get
done with a real budget and a policy that no critical systems run on
non-redundant hardware.  (In other words, as someone on the NANOG list
put it so succinctly -- "Let's not discuss building a Global network
with household appliance machines any further."  (Yes, I'm going to use
that one in a meeting, sooner or later.)

Human error - Cultural... aversion to using standard operating
procedures and lack of testing.  Deployment of somewhat untested
software (in my opinion, but sometimes you have to shoot the Engineer
and say "SHIP IT", I know... I know...)  Silly stuff like power plugs
getting bumped (no discipline in the cable plant - again, cultural)... etc.

Security breach... a couple of older systems that had been recommended
for upgrade but that human (time) resources weren't being expended on...
until the recommendation became a security incident that is.  ;-)  The
DMZ contained the event to a limited number of boxes, as designed.

Not once did we have a known OS level failure.  This is on HP-UX,
Solaris, and Linux.  We did have a few Oracle hiccups but nothing major
or full-outages caused by that.  Just silly configuration issues.

Some security patches were a pain more than others this year.  OpenSSH
was a moving target, but doesn't require a reboot.  Constant kernel
updates on Linux are getting annoying -- those require downtime and a
reboot and throwing over to redundant boxes.

But if I had to count up how many times the poor Windows admin had to
reboot his machines this year vs. how many times I have... I'd shoot
myself if my machines were down as much as his are -- through no fault
of his.  Security updates for Windows, we admins have this joke... "You
moved the mouse!  Would you like to reboot now? (Y/N)"  Poor guy.

That was 2003 in a nutshell for me... how was yours?  ;-)

Nate Duehr, nate@natetech.com

--
http://www.piclist.com hint: PICList Posts must start with ONE topic:
[PIC]:,[SX]:,[AVR]: ->uP ONLY! [EE]:,[OT]: ->Other [BUY]:,[AD]: ->Ads