Jim Korman wrote: > And then blaming it on the hardware. ;-) Acutally now you have my analytical side thinking... Truthfully, with the high-quality of Unix OS's these days... it's pretty rare that the OS software (other than proprietary in-house stuff) crashes a box. We run across things during deployment and testing (one bad Linux ethernet driver, for example), but hardware is regularly a real cause of failure. Looking through system outage reports here... the real causes for system downtime in 2033 were (in order of frequency/importance/my pain): Power loss Human error Hard disk failures Security breach Redundancy in the applications and load-balancing kept uptime well over four 9's, perhaps higher. (My boss keeps those numbers, and he's happy with them, whatever they are... I don't track them. My focus is keeping stuff running, mitigating risk -- not tracking what happened after-the-fact other than to learn from it.) Power loss - Our data center provider had a meltdown (literally) of their power distribution hardware due to a manufacturing defect in a copper bus bar and some water entering some places water was never intended to go. One full day of downtime for that site, reduandant data center took most of the applications and load during the outage. They got roasted by all customers for not having a redundant bus bar to carry the load, everything else is double or triple redundant in their power-distribution, two generators, city power, two transfer switches, multiple on-line UPS's that could be re-routed if one failed (manually, but could be done...), etc. Hard disk failure... numerous systems not using RAID 1 or RAID 5 on primary disks needed for system to continue operation. ($$$) It's on "the list" for this year to fix this permanently, but it'll only get done with a real budget and a policy that no critical systems run on non-redundant hardware. (In other words, as someone on the NANOG list put it so succinctly -- "Let's not discuss building a Global network with household appliance machines any further." (Yes, I'm going to use that one in a meeting, sooner or later.) Human error - Cultural... aversion to using standard operating procedures and lack of testing. Deployment of somewhat untested software (in my opinion, but sometimes you have to shoot the Engineer and say "SHIP IT", I know... I know...) Silly stuff like power plugs getting bumped (no discipline in the cable plant - again, cultural)... etc. Security breach... a couple of older systems that had been recommended for upgrade but that human (time) resources weren't being expended on... until the recommendation became a security incident that is. ;-) The DMZ contained the event to a limited number of boxes, as designed. Not once did we have a known OS level failure. This is on HP-UX, Solaris, and Linux. We did have a few Oracle hiccups but nothing major or full-outages caused by that. Just silly configuration issues. Some security patches were a pain more than others this year. OpenSSH was a moving target, but doesn't require a reboot. Constant kernel updates on Linux are getting annoying -- those require downtime and a reboot and throwing over to redundant boxes. But if I had to count up how many times the poor Windows admin had to reboot his machines this year vs. how many times I have... I'd shoot myself if my machines were down as much as his are -- through no fault of his. Security updates for Windows, we admins have this joke... "You moved the mouse! Would you like to reboot now? (Y/N)" Poor guy. That was 2003 in a nutshell for me... how was yours? ;-) Nate Duehr, nate@natetech.com -- http://www.piclist.com hint: PICList Posts must start with ONE topic: [PIC]:,[SX]:,[AVR]: ->uP ONLY! [EE]:,[OT]: ->Other [BUY]:,[AD]: ->Ads