• Categories

  • Archives

  • Recent Posts

The Sessions paper: an analytical critique

Roger Sessions has published a white paper, “The IT Complexity Crisis: Danger and Opportunity” (PDF). It’s created a bit of a stir in tech circles, largely because Sessions estimates that “worldwide, we are already losing over USD 500 billion per month on IT failure, and the problem is getting worse” (page 1; emphasis in original). He feels that the consequence is a “coming IT meltdown”, then goes on to offer his own solution, namely designing simpler IT systems.

This naturally intrigued me, since for the last 15 years, I have been writing, consulting, lecturing, and testifying about troubled and failed IT projects. While there are indeed tremendous financial losses due to late and failed IT projects, the figures Sessions gives seem much too large to me, and so I decided to do this critique of his analysis.

Sessions is good enough to provide the basis of his estimates and calculations, including footnotes. But that’s where some of the problems start. For example,  on page 3, Sessions cites (his footnote ’02′) to the US Budget, Fiscal Year 2009, Analytical Perspective (PDF), p. 169, for information on “at-risk” or failed IT projects, specifically:

  • “According to the 2009 U.S. Budget [02], the failure rate is increasing at the rate of around 15% per year. If this trend continues, within another five years or so a total IT meltdown may be unavoidable.” (p. 3)
  • “According to the 2009 U.S. Budget [02], 66% of all Federal IT dollars are invested in projects that are ‘at risk’. I assume this number is representative of the rest of the world.” (p. 3, in “Calculating the Cost of IT Failure” box)
  • A large number of these ['at risk' projects] will eventually fail. I assume the failure of an ‘at risk’ project is between 50% and 80%. For this analysis, I’ll use the average: 65%.”

These three statements run into immediate problems. First, and relatively minor, Sessions gets his page number wrong: he’s citing “page 169″ of the Analytical Perspective document, but there is no discussion whatsoever on page 169 of that document about IT projects. However, page 157 of that document (which happens to be page 169 of the PDF document) does start a section titled “INTEGRATING SERVICES WITH INFORMATION TECHNOLOGY”, so I presume that Sessions made the simple mistake of using the PDF page count rather than the document’s actual page numbering.

Even so, serious problems remain with Sessions’ citations and analysis.

Page 157 of the Analytical Perspective document does not say what Sessions claimed in the two comments above. I have not been able to figure out where Sessions gets his figure for “the failure rate increasing around 15% per year” from the cited US Budget Analytical Perspective document, much less his conclusion that “if this trend continues, within another five years or so a total IT meltdown may be unavoidable.” As far as I can tell, the Analytical Perspective document does not talk about failed IT projects at all, much less the increase in failure rates.

Furthermore, the phrase “the failure rate increasing around 15% per year” is itself ambiguous and may not be that significant. To start with an arbitrary number, assume that 100 projects “fail” in a given year. If “the failure rate [is] increasing around 15% per year”, then that means that 115 projects would fail the next year, and 132 projects would fail the year after that. But unless we know both the actual number of failed IT projects and the total number of IT projects in that same year, Sessions’ figure tells us nothing. If there’s only 150 IT projects total, then the 15% failure rate increase becomes very significant; if there’s 1000 IT projects total, then we’re many years away from Sessions’ threatened “meltdown”.

Sessions also ignores or confuses the failure rate for new projects vs. the systems already deployed. In other words, the failure rate for new systems development says very little about the continued functionality of existing, deployed systems now in use. While there are occasions (most notably Y2k, now a decade behind us) where existing IT systems just won’t function or function properly if they aren’t fixed or replaced, by and large both governments and private concerns have gotten along remarkably well for years or even decades with antiquated systems

As for Sessions’ second statement, there is a table on page 158 that may represent the basis for it:

ITtable

As can be seen in the FY 2009 column, 66% (535 out of 810) of the FY 2009 “Major IT Investments” are projects that are “Not Well Planned and Managed”. Note that this table does not (as Sessions infers) indicate Federal dollars but rather actual projects; that is, in FY 2009, there are 810 projects listed as “Major IT investments”, of which 535 are designated as “Not Well Planned and Managed”. The previous page appears to indicate that these projects represent $27 billion, which is roughly 38% of the proposed Federal IT budget — not a great figure, but still almost half of the 66% that Sessions claims.

What’s more, supplementary data (PDF) for the FY 2009 Analytical Perspective makes it clear that the US Government’s designation of such projects — which puts them on a “Management Watch List” (WML) — has reduced the risk of such projects during each fiscal year:

ITFY

Note that in FY 2007 and 2008, the number of IT projects designated as “Not Well Planned and Managed” shrunk significantly during the year (from Q1 to Q4) without a proportional shrinkage of the overall number of major IT projects. In other word, it appears that the government’s efforts to remove such projects from the “Not Well Planned and Managed” category is relatively successful. And the actual US IT budget dollars at risk at the end of each of those fiscal years ($4.2 billion for FY 07, $8.6 billion for FY 08)  is a much smaller percentage (6.5% and 13%, respectively) of the Federal IT budget for each of those years ($64.2 billion for FY 07 (XLS), $66.4 billion for FY 08 (XLS)).

Sessions then states that “I assume this number [66% of all Federal IT dollars being at risk] is representative of the rest of the world.” There are numerous problems with this assumption, starting with the fact that the 66% figure is wrong; in fact, the actual “at risk” (his term, not the US Government’s) percentage of the IT budget at the end of FY 07 and FY 08 were, as noted above, 6.5% and 13%, respectively.

Sessions’ error here is significant, since he goes on in several places (cf. page 4) to cite his use of the % of the total IT budget as being significant, when he’s not talking about the total IT budget at all.

Furthermore, it is unclear whether his phrase “the rest of the world” means all other national governments, or all other entities doing IT project development. It seems to be the latter, though it’s hard to tell from his statements. On the other hand, I have spent years consulting with corporations on troubled projects, and I can tell you that they do not have 66% of their IT budgets devoted to “at risk” projects. In fact, the majority of corporate IT budgets are devoted to maintenance of existing systems, not new and risky projects (cf. here, here, here, and here, as simple examples).

As noted, Sessions then assumes that the failure rate for “at risk” IT projects is 65%, which means that (as he says) “I am calculating that 43% (.65 x .66) of the total IT budget” is devoted to failed projects. At this point, his figures become nonsensical, as they are derived both from misreadings and lack of complete information about the Federal IT budget and projects. To wit:

  • The 535 “not well planned and managed” IT projects in the US FY 09 budget only represent 38% of the total IT budget, not 66% as Sessions mistakenly states.
  • In the two previous years (FY 07 and FY 08), the number of IT projects labeled as “not well planned and managed” dropped during the course of each year (see the 2nd table above). In FY 07, it dropped from 263 projects in Q1 to just 84 in Q4, which means that 69% were moved off of the “not well planned and managed” list during the year. Likewise, in FY 08, it dropped from 346 projects in Q1 to 134 projects in Q4, a drop of 61%. This directly contradicts Sessions’ assumption of a 65% failure rate for projects in the “not well planned and managed” category.
  • The FY ’09 Analytical Perspective says nothing about actual failed projects, as far as I can tell.

Sessions then goes on to make further out-of-his-hat assumptions regarding “direct and indirect costs”. He cites an example of the IRS (an agency long troubled by IT woes) and notes a lost opportunity based on fraudulent tax returns due to the system not being operational. He projects a loss over two years ($1.788 billion), compares it to the cost of the failed modernization ($185 million over a ten-year period), and calculates an indirect costs ratio of 9.6 to 1. He then decides — with no other documentation or analysis whatsoever — that the universal ratio of indirect to direct costs for a failed IT project ranges from 5:1 to 10:1, and uses the “average” of 7.5:1.

There are so many problems here that I scarce know where to start. For starters, the term “average” assumes an even distribution of ratios from 5:1 to 10:1 and does not recognize any ratios lower than 5:1. I’ve seen many failed projects that had much lower ratios of “indirect” to “direct” costs, since the firm simply continued to operate using the existing systems, and the “lost opportunity” for not having the new system in place was relatively small.

More importantly, the IRS gets to collect taxes from the entire US: $2.5 trillion in tax collections each year. Using the IRS as a baseline makes little sense for most other government agencies, and even less sense for most corporations and non-government organizations (NGOs), because most IT systems in most organizations (government or private) do not have the ability to generate such magnitudes of revenue, period.

Indeed, there is a long-standing controversy within IT management circles as to whether a new computer system can be relied upon to provide any significant return on investment (ROI), or whether it exists merely to “keep up with the competition”.

Sessions concludes his section on calculations thusly (p. 5, emphasis his):

Of course, these calculations are estimates. I recommend you don’t get overly focused on the exact amounts. I could be off by ten or twenty percent in either directions. The real point is not the exact numbers, but the magnitude of the numbers and the fact that the numbers are getting worse.

Unfortunately, Sessions is fundamentally wrong in his numerical analysis, and his numbers are off by far more than “ten or twenty percent”. For the Federal Government alone, they are off by almost  a full order of magnitude (10x), due to his critical errors both on the percentage of the Federal IT ’09 budget “at risk” (it’s 38%, not 66%) and the number of “at risk” projects that fail (he says 65%; the US government numbers for FY 07 and 08 show that only 35% of the projects — representing just 6.5% to 13% percent of the Federal IT budget — were still “at risk” at the end of each fiscal year, and it gives no figures that I can find for actual failed IT projects).

Furthermore, his projection of the (erroneous) 66%-of-IT-budget-at-risk figure on the rest of the world is just wrong, especially in corporations and business (which spend vastly more on IT than the US government). In those organizations, maintenance costs dominates, and the percentage of the IT budget devoted to new projects tends to be small (20% or less), with an even smaller fraction of that representing “at risk” projects.

I may comment more on Sessions’ paper, but my conclusion here is that his estimate of $500 billion/month in lost direct and indirect costs due to IT systems failure just does not hold up, in my opinion.  ..bruce..

HR 3200 from a systems design perspective (Part II)

In the first part of this three-part series, I briefly outlined the parallels between developing software and crafting legislation, while pointing out the great risks and issues in the latter. I also indicated what I felt were some of the general structural flaws  in HR 3200, the House bill on health care reform — not criticizing any actual proposals, but rather highlighting some of the design and implementation problems that make it hard to understand HR 3200 and even harder to predict its consequences.

Here in Part II, I’ll talk about some of the well-established maxims and heuristics of complex systems development, and how they apply to legislation in general and to HR 3200 in particular. (More after the jump.)

Read the rest »

HR 3200 from a systems design perspective (Part I)

[Welcome Slashdotters -- feel free to leave comments here or there. But no debates on health care reform or what HR 3200 does or does not do, please -- just on the concept itself.]

[Part II is now up.]

On the occasions where I have reviewed the actual text of major legislation, I have been struck by the parallels between legislation and software, particularly in terms of the pitfalls and issues with architecture, design, implementation, testing, and deployment. Some of the tradeoffs are even the same, such as trading off the risk of “analysis paralysis” (never moving beyond the research and analysis phase) and the risks of unintended consequences from rushing ill-formed software into production. Yet another similarity is that both software and legislation tend to leverage off of, interact with, call upon, extend, and/or replace existing software and legislation.  Finally, the more complex a given system or piece of legislation is, the less likely that it will achieve the original intent.

But there are some critical differences that make legislation design both harder and higher-risk than systems design. (More after the jump.)

Read the rest »

Book review: “Why New Systems Fail”

My review of Why New Systems Fail by Phil Simon is now up on Slashdot. Here’s the opening paragraph:

Over the last forty years, a small set of classic works on risks and pitfalls in software engineering and IT project management have been published and remained in print. The authors are well known, or should be: Gerry Weinberg, Fred Brooks, Ed Yourdon, Capers Jones, Stephen Flowers, Robert Glass, Tom DeMarco, Tim Lister, Steve McConnell, Steve Maguire, and so on. These books all focus largely on projects where actual software development is going on. A new book by Phil Simon, Why New Systems Fail, is likewise a risks-and-pitfalls book, but Simon covers largely uncharted territory for the genre: selection and implementation of enterprise-level, customizable, off-the-shelf (COTS) software packages, such as accounting systems, human resource systems, and enterprise resource planning (ERP) software. As such, Simon’s book is not only useful, it is important.

Go read the whole thing. ..bruce..

Techno-blindness

A few decades back, when handheld electronic calculators were still pretty neat, someone did a study on the authority people gave to them. As I recall, those conducting the study built some normal-looking calculators that were designed with specific errors in the calculation circuits such that in certain cases the calculators would give wrong answers. The studies in the subject were asked to perform a specific set of arithmetic calculations using these calculators. For some of these calculations, the doctored calculators were certain to give the wrong answer. They were then asked to check the answers by hand and give what they felt was the correct answer.

A large percentage of the subjects — when given a wrong answer by the calculator — would get the right answer when they carried out their calculations by hand. But most of them would assume that they had made a mistake in their manual calculations and that the calculator was correct, and so put down the wrong answer. These individuals couldn’t conceive that that the calculator was giving a wrong answer, and so they would doubt their own by-hand calculations.

Nice story, but old. Why am I telling this? I had a nearly identical experience yesterday, except it didn’t involve a calculator — it involved a GPS unit.

For the last four years, I have lived about 25 miles southeast of downtown Denver. During that period I have driven to and from Wyoming — north of us — multiple times, via the I-25 freeway, so I have a general sense of what lies between Denver and Cheyenne. On the other hand, it’s been probably at least a year since I’ve made that drive, so it’s not exactly fresh in my mind.

Yesterday, I had to drive to Ft. Collins, which is between Denver and Cheyenne (WY), to testify at a trial. I had it pretty fixed in my mind that the trip to Ft. Collins would take about 90 to 105 minutes. I was supposed to be there by 2:30 pm, so I planned on leaving around 12:30 pm, giving myself some slack. Around noon, I got a call from my client’s lawyer, asking if I could be there by 2:00 pm instead. I told him I thought I could make it by then, got things pulled together, and got into the car. I punched the courthouse address into our GPS system — a Magellan Maestro 4250 — asked it to plot my route (“fastest time”), and drove off.

A few miles into my drive, I noticed that the GPS unit showed an arrival time of 2:20 pm. That seemed a bit long to me, and I remembered that I had changed the time zone on the GPS system during my recent three-week trip to California. So I popped the GPS out of its cradle, worked the menus to set the time zone back to Mountain Time, backed out to the map display, and stuck it in its cradle again.

And panicked: the arrival time was now 3:20 pm (which I should have already known, since I was changing from Pacific to Mountain time). I was going to be over an hour late. I might not even be allowed to testify; I was scheduled to be the last witness before my client’s lawyer rested my client’s case. At the best, I was looking at a scolding from the judge and a negative impression on the jury.

Now, with this GPS unit, I’ve occasionally had circumstances where the time-to-destination estimate shifts abruptly, but never by more than a modest number of minutes. I was going a back route, and I thought that the GPS might re-adjust the time estimate when I got onto the E-470 toll road, heading north.

It didn’t. I had picked up a few minutes — the estimated time of arrival was now 3:18 pm — but that bought me very little. While driving at a, uh, vigorous rate of speed on E-470, I canceled the route to the courthouse, then called it up again (on ‘Previous Destinations’) to ensure that I was actually going to LaPorte Road in Fort Collins, Colorado. I was. I selected that destination again, asked the GPS to calculate the route — and again got an arrival time of 3:18 pm or so.

I was dumbfounded. I had been certain that Ft. Collins was an hour and a half, maybe 1:45, from my house, and the GPS system was telling me that it was actually close to three hours. Given how long it had been since my last trip up I-25, I now wondered if I was simply misremembering. I punched the map ‘zoom out’ button repeatedly to ensure that I was going to Ft. Collins. Sure enough — the map showed my course up E-470 to I-25 and then to a point about halfway to the Wyoming border: Ft. Collins.

I was now even more confused. While I was still at home, I had called up Google Maps and plotted a course from my house to the courthouse, and I was sure that it had told me that my driving time was 1:49. But now I was wondering if I had misread the Google Maps page, and it was actually telling me that the distance was 149 miles.

I continued my rapid drive up E-470, hoping against hope that the estimated time of arrival would magically drop an hour or more at some point. It didn’t. Finally, a few miles from the turnoff to I-25 North, I bit the bullet and called the lawyer (via the client’s cell phone) to give him the bad news: that I wouldn’t be there until 3:10 or so (I had picked up a few minutes, though not nearly as many as I had hoped). He was professional on the phone, but I know he felt blindsided — how could I be so stupid (and unprofessional) as to not leave in time to arrive on time? We hung up, and I continued the rapid pace.

I transferred onto I-25 North and saw on the GPS display that my next turnoff was 80 miles away. I wondered for the nth time how I could have been so wrong about the distance to Ft. Collins. Then something bubbled up through my brain: if the Ft. Collins turnoff was only 80 miles away, that meant that I would get off of I-25 around 2:15 pm — it surely wouldn’t take another 45 minutes to get to the courthouse, would it? I tried to think back to the Google map and wondered if Ft. Collins was a really, really long town; if I had to drive another 20 to 30 miles once I was off the freeway. I looked more carefully at the GPS for information on my next exit, the one 80 miles away.

It was in Cheyenne, Wyoming. And I felt hope for about the first time in 45 minutes.

I canceled the route to the courthouse and punched in a new route to Ft. Collins, picking the first street name and number I could punch up. I pressed ‘fastest time’ (the same control I had punched both times for the courthouse) — and got an arrival time of 1:55 pm, well over an hour sooner than the GPS had given for the courthouse. I called up the courthouse address again (from ‘previous destinations’), asked as before for the fastest time — and now got an arrival time of 1:56 pm (instead of 3:10 pm). About that time, I passed a freeway sign indication that Ft. Collins was 37 miles away, so I knew I had the right time this time. I quickly called the lawyer back (actually the client, on his cell phone) and let him know that I would indeed be there by 2:00 pm.

And I was. I testified, left the courthouse, and headed back home at a much more leisurely pace, stopping in Longmont to have dinner with my daughter and grandsons.

This was a genuine bug in the GPS unit that I encountered. All three route calculations to the courthouse used the same address and the same preferences (fastest time), yet the first two times, the GPS was somehow routing me through Cheyenne, Wyoming. Furthermore, when I had zoomed out the map, it did not show me going up to Cheyenne and back; it clearly showed me going only partway to the Wyoming border and then getting off the freeway.

That said, I did exactly what those people in the calculator study did: I trusted the GPS more than my own experience (and more than my recollection of what Google Maps had said). What’s more, I had a set of local and regional atlases in the storage pouch behind the passenger seat; it would have taken me maybe 60 seconds to pull over to the side of the road, pull out an atlas, and verify just how far it was to Ft. Collins. But instead of doing that, I panicked, drove fast, and assumed that the GPS was correct, particularly when I got the same arrival time result the second time.

There was one more point of confusion in all this that likewise could have cleared things up sooner.  When I called the lawyer the first time, to give him the bad news, he asked where I was. I told him that I was on 470, approaching I-25. E-470 is the toll portion of the 470 beltway that goes about 3/4ths of the way around the Denver metropolitan area. I-25, which runs north-south, crosses 470 twice once north of Denver and again south of Denver. When I told the lawyer how late I was going to be and where I was, he likely assumed that I was at the southern intersection of I-25 and 470, which would put me about 80 miles from Ft. Collins and having to drive right through Denver (heavy traffic, lower speed limits) to get there. Instead, I was about 45 miles and outside of the Denver metropolis, with light traffic and a 75 MPH speed limit.

I’ve had this GPS unit for at least two years. As noted, I’ve had a few glitches in time and route calculation, but never anything of this magnitude. So I let what the machine was telling me override my own experience and knowledge. It is an error in judgment all too frequent when information technology is involved.

Food for thought.  ..bruce..