On 4/24/07, Vitaliy wrote: > I think you and Nate are missing the point. Not really - Agile doesn't fit telco very well... I'll explain... > By definition, projects involve uncertainties. If you remove all uncertainty > from a project, it stops being a project and turns into a manufacturing > activity. Which is *exactly* what telco customers want. You don't *EVER* release code that causes massive problems or outages into a production telco network. Phone calls are a 24/7 business. This means changes are slow and gradual -- or -- an all out "all hands on deck" push is made to a big new system, with insane amounts of testing/certification done before the thing ever goes anywhere near a Central Office. Not to mention that in reality, a "manufacturing process" is exactly what software companies claim that they're working toward, every year... "code re-use", "object orientation", "virtual sandboxes"... all of that crap is just another way of saying, "stable, reliable changes that work". Agile (to me, anyway) is just the latest promise from an industry full of ADD poster-children who can't seem to ever learn to stick to one thing and do it well, and are always chasing after... "oooh, shiny!". > With Agile, you still have a schedule and a plan. The difference is, Agile > recognizes the fact that schedules are guesstimates, and that plans tend to > change. Plans don't change in production telco systems unless they're in writing and reviewed. They are laid out and PLANNED ahead of time. This is the point of a PLAN. All changes are reviewed by both the Engineers involved, their managers, a Change Control Board, and the customer. When I say "the customer", they have the same structured review by multiple people. I think Agile tends to work well in organizations that lack the DISCIPLINE to follow their PLAN. Or perhaps who are stuck with terribly undisciplined customers. Telco can be frustratingly structure-bound, but at least our customers know their priorities up-front. I laugh when I see people saying their plan changed... you don't DO that in telco. > With Agile, you still define those things with the customer. You know that > you're working with limited resources, so you rank the features (with > customer's help) in order of priority, considering cost/benefit of each > feature, etc. You end up with a Users Guide, a requirement spec, whatever > else you end up with that you use for quoting. Ranking isn't done in my world, once the purchase process is over with. The feature is either in, and PLANNED to be fully tested, or it's out at the very start. The dollar amount is agreed-to up-front, and the clock is ticking. Nothing truly gets paid for until after it's deployed and through production certification... there's a Purchase Order, a sales contract, and a deadline. It's all based off of cost/benefit analysis. You can learn to live with a lot of bugs if it'll take five minutes to re-write the fix and two YEARS of lab certification testing before that code will ever see a production environment. You missed some stuff below... I'll add in my world... > With traditional "waterfall" development, you just follow these steps, in > the order shown, until the project is done: > 0.1 A year of hinting that new features are coming. 0.2 At least four quarterly meetings with the customer to carefully detail possible upgrades to get a feel for which ones they'll even consider doing a standard year-long certification cycle on in their lab. Maybe even a mock-up demo of an interface or idea that doesn't even exist yet, with lots of warnings that it's not even really started yet. 0.3 Regular visits by a salesperson to get a feel for which of the proposed ideas/products the customer is truly going to go fight for a capital budget for. 0.4 Some internal coding and design work to be prepared to move on any customer sales of things not even marketed yet, but already floated past the decision-makers at the customer. (Weird, huh?) 0.5 RFQ from customer to you and all your competitors, asking for the features you've been touting to them for a year. > 1. Requirements (Mostly done pre-sales above.) > 2. Design 2.1 Customer review by their own lab of the ENTIRE system and any proposed design changes under NDA. All revisions made here. Tight loop that Agile espouses is done up-front... not during Implementation, ever. > 3. Implementation 3.1 Release of "beta" or better to customer's internal labs for interoperability testing, usually a fully-working product that is missing "polish" and/or a few non-critical features that are still being coded. This allows their internal deployment teams to understand the installation/restore/recovery process, so they can start to work with our support team to come up with procedures for the next step... full lab integration testing. 3.2 Full release and customer re-analyzes all changes and mountains of paperwork are signed off agreeing that from the lowliest cable puller to the architecture folks, the system is documented to the point where every wire is traceable, and every code change is approved. > 4. Verification 4.1 Customer flattens a production quality machine or entire group of machines, and loads it from the ground up with previous version and tests their internal deployment documentation to a point where a monkey could deploy the software, in another VERY expensive lab. Every single command to execute the upgrade, including a full roll-back procedure and escalation phone numbers for every person touching or responsible for each upgraded site are documented, the tenative times/dates of upgrade are published, and "Method of Procedure" documentation is signed off by customer and by us after dry-runs of the upgrade. 4.2 Maintenance windows are set, usually at least a month in advance, with some deployments taking longer than a year to reach all affected sites. Some customers even call this process a "Network Threat" process, and have to approve each "Threat" with their Network Operations Centers. No site is ever allowed to have multiple "Threats" open without the appropriate number of on-site staff. (Many sites are unmanned, otherwise.) 4.3 Test deployment into a lower-activity production site after multiple dry-runs through the upgrade in the lab environment. If anything so much as hiccups or looks different from the dry-runs, the system is immediately recovered to original state. Timed sections and "red lines" are places where you shall NOT cross into that stage of an upgrade without the express permission of the customer -- usually a Subject Matter Expert is present on the conference call throughout the first few deployments, and they can call an upgrade off at any step in the process. The recovery process timeline is KNOWN ahead of time, and everyone knows not only how to back out the upgrade, but how LONG it will take from any point in the process. (A screw up here can send you back to lab re-certification, cost at least a quarter or two in sales revenue, and it's a major all-hands-on-deck experience.) 4.4 The rest of the sites are done. If any changes or mistakes in the process are found that affect the early sites, new maintenance windows are opened to return to those sites and you don't move on until they're done unless the customer approves a "final cleanup" of any human error or configuration gaffs. > 5. Maintenance 5.1 Bugs found at this point, once the system's deployed, are assessed for overall impact to the bottom line ("If you can't bill for it, it's just a hobby!"), and customer impact. The cardinal sins in telco are a) Not answering the phone line during an inbound call, and b) hanging up on someone in the middle of a call. If you hit one of those two hot buttons with a bug, an Emergency patch might be issued. Otherwise, you'll probably go a couple of years after deployment with only minor changes to system admin type settings, and those must go through the same process as deployed software, but they're considered "maintenance". Code doesn't have "maintenance", it has VERY small "emergency" changes to avoid only the largest customer-impacting or operational issues. 5.2 We're now one to two years from the original date of the release of the code, and changes at this point are cost/benefit analyzed to death by both sides. Nothing happens on this version again, unless a crash or other major issue comes up. The Sales/Marketing/Engineering side of the house is already on to steps 0 through 1 above for next generation software, and customer is moving toward next "major" upgrade. Any changes made at this point are MINOR and only allowed if direct customer or revenue impact is seen. Minor bugs, interface problems, etc.. are considered low-risk, and therefore low-priority by the all. 5.3 Sometime in the not too distant future after 5.1, an end-of-service and pre-end-of-life announcement is sent, usually around the 4 to 5 year mark. In the highly disciplined release process above, Agile sucks. I have customers running 4-8 year old software on production systems with KNOWN problems, and they're much happier with KNOWN problems than UNKNOWN ones. Every code change adds risk, leaving it alone adds none. Risk isn't tolerated, and those with bad risk assessment skills don't last long. I have had maintenance windows "go south" on me, through no fault of my own (hardware failure, etc...) and changing the procedures being used if the procedure must have risk added during a maintenance window and you're past a reasonable fall-back point, can easily involve escalations to three layers of management on both sides, usually stopping at the VP levels in most telco organizations. (Example, a bad SCSI controller and a very peculiar failure mode of both some Clustering software and an external JBOD disk array caused an outage of a site during a routine stop and start of a piece of software to clear a known issue. That particular failure had a VP on the phone in 5 minutes flat on a Saturday morning at 3AM his time. We barely had time to make sure our own management levels were notified that high up the food chain.) So anyway, that's about the way I see it... this "world" differs from most in that one of these single systems makes enough revenue in cash in three days to pay my annual salary and have change left over. When that kind of money is flying around, processes like Agile and iterative development are scoffed at... and rightly so. But that cash is diluted severely by the amount of process and engineering effort that has to go around the system to keep it up 24/7, so while the numbers sound huge -- they're mostly eaten up by overhead. Agile really can't be wedged into the above processes (which I've watched and done for close to 14 years now, with a few years off in the middle to play in the large data-center/ISP sector)... it's too rigid, and for good reason... it's rare someone can't make a phone call. (VoIP general crappiness and cellular RF front-side and back-haul system overload during emergencies, notwithstanding.) Oh for those that made it this far... what day is NO maintenance or upgrades EVER done in telco without emergency authorization? Mother's Day. Never ever. Busiest telco day of the year... Nate -- http://www.piclist.com PIC/SX FAQ & list archive View/change your membership options at http://mailman.mit.edu/mailman/listinfo/piclist