Tuesday, November 11, 2008

Well Known Software Failures

Software systems are pervasive in all aspects of society. From electronic voting to online shopping, a significant part of our daily life is mediated by software. In this page, I collect a list of well-known software failures. I will start with a study of economic cost of software bugs.

Contents

Economic Cost of Software Bugs
Air-Traffic Control System in LA Airport *****
Northeast Blackout **
NASA Mars Climate Orbiter ****
Denver Airport Baggage-handling System *

The number of *s is the ironic factor I assign to each story. The one with most *s is the most ironic one. Why? You will find out.

Economic Cost of Software Bugs

Report Date: 2/2002 Price Tag: $60 Billion Annually

WASHINGTON (COMPUTERWORLD) - Software bugs are costing the U.S. economy an estimated $59.5 billion each year, with more than half of the cost borne by end users and the remainder by developers and vendors, according to a new federal study.

Improvements in testing could reduce this cost by about a third, or $22.5 billion, but it won't eliminate all software errors, the study said. Of the total $59.5 billion cost, users incurred 64% of the cost and developers 36%.

Out of curiosity of how the study calculated the cost, I skimmed through the report. The following is a summary of their methodology.

It divided software developing process into stages: Requirement Gathering and Analysis, Architectural Design, Coding, Unit Test, Integration and Component, RAISE System Test, Early Customer Feedback, Beta Test Programs, and Post-product Release.

Bugs are generated at each stage of the software development process. The later in the production process that a bug is discovered, the more costly it is to repair the bug. Then impact estimates were developed relative to two counterfactual scenarios. The first scenario investigates the cost reductions if all bugs and errors could be found in the same development stage in which they are introduced. This is inferred to as the cost of an inadequate software testing infrastructure. The second scenario investigates the cost reductions associated with finding an increased percentage (but not 100 percent) of bugs and errors closer to the development stages where they are introduced. This is referred to as a cost reduction from feasible infrastructure improvements.

The study examined the impact of buggy software in several major industries -- automotive, aerospace and financial services -- and then extrapolated the results for the U.S. economy. It then concluded software bugs are costing (the first scenario) the U.S. economy an estimated $59.5 billion each year. Improvements in testing (the second scenario) could reduce this cost by about a third, or $22.5 billion

The report also included interesting tables that show the frequency of which stages errors are found, and relative cost to repair defects when found at different stages.

Air-Traffic Control System in LA Airport

Incident Date: 9/14/2004

(IEEE Spectrum) -- It was an air traffic controller's worst nightmare. Without warning, on Tuesday, 14 September, at about 5 p.m. Pacific daylight time, air traffic controllers lost voice contact with 400 airplanes they were tracking over the southwestern United States. Planes started to head toward one another, something that occurs routinely under careful control of the air traffic controllers, who keep airplanes safely apart. But now the controllers had no way to redirect the planes' courses.

The controllers lost contact with the planes when the main voice communications system shut down unexpectedly. To make matters worse, a backup system that was supposed to take over in such an event crashed within a minute after it was turned on. The outage disrupted about 800 flights across the country.

Inside the control system unit is a countdown timer that ticks off time in milliseconds. The VCSU uses the timer as a pulse to send out periodic queries to the VSCS. It starts out at the highest possible number that the system's server and its software can handle—232. It's a number just over 4 billion milliseconds. When the counter reaches zero, the system runs out of ticks and can no longer time itself. So it shuts down.

Counting down from 232 to zero in milliseconds takes just under 50 days. The FAA procedure of having a technician reboot the VSCS every 30 days resets the timer to 232 almost three weeks before it runs out of digits.

Northeast Blackout

Incident Date: 8/14/2003 Price Tag: $6 - $10 Billion

NEW YORK (AP) - A programming error has been identified as the cause of alarm failures that might have contributed to the scope of last summer's Northeast blackout, industry officials said Thursday.

The failures occurred when multiple systems trying to access the same information at once got the equivalent of busy signals, he said. The software should have given one system precedent.

With the software not functioning properly at that point, data that should have been deleted were instead retained, slowing performance, he said. Similar troubles affected the backup systems.

NASA Mars Climate Orbiter

Incident Date: 9/23/1999 Price Tag: $125 million

WASHINGTON (AP) -- For nine months, the Mars Climate Orbiter was speeding through space and speaking to NASA in metric. But the engineers on the ground were replying in non-metric English.

It was a mathematical mismatch that was not caught until after the $125-million spacecraft, a key part of NASA's Mars exploration program, was sent crashing too low and too fast into the Martian atmosphere. The craft has not been heard from since.

Noel Henners of Lockheed Martin Astronautics, the prime contractor for the Mars craft, said at a news conference it was up to his company's engineers to assure the metric systems used in one computer program were compatible with the English system used in another program. The simple conversion check was not done, he said.

Denver Airport Baggage-handling System

Incident Date: 11/1993 - 6/1994 Price Tag: > $200 million

(Scientific America) -- Scheduled for takeoff by last Halloween (1993), the airport's grand opening was postponed until December to allow BAE Automated Systems time to flush the gremlins out of its $193-million system. December yielded to March. March slipped to May. In June the airport's planners, their bond rating demoted to junk and their budget hemorrhaging red ink at the rate of $1.1 million a day in interest and operating costs, conceded that they could not predict when the baggage system would stabilize enough for the airport to open.

No comments: