Tuesday, November 11, 2008

Using Fault Tree Analysis to Improve Software Testing

Testing a software product to remove hidden defects is an integral part of the software development life cycle (SDLC). Yet it is well accepted that running a software product through every possible scenario to check for defects is not just difficult, but usually impossible. The enormous cost and huge effort required is simply too much. Thus, more limited testing remains a major part of the software development effort as do the challenges faced in software testing.

The application of process improvement tools to the software development life cycle is becoming popular in the software community. These techniques have already been successfully leveraged by manufacturers which have encouraged software professionals to apply such tools to the SDLC. Using fault tree analysis (FTA) is one good way to improve the effectiveness of software testing. It can help identify the potential causes of a problem, suggest suitable corrective action and offer insight into preparing test case scenarios.

Challenges in Software Testing

1. Inherent challenge: It is next to impossible to test a software product of average complexity to all of its specifications and features. The number of test cases required to test every aspect of a software application would be so large that it would be economically impossible to prepare and execute. For example, a simple program to analyze a string of 10 alphabetic characters could have 2610 combinations. At a rate of one combination per microsecond, testing the string would take 4.5 million years, according to author Watts S. Humphrey in his book Managing the Software Process.

2. Laborious process of test case preparation and documentation: Test case preparation is labor intensive and has to fit into what is normally an already tight schedule. Often project teams are tempted to pay less attention to this activity. Considering the large number of test cases to be developed, it takes considerable effort to document and maintain the documentation. The project team seldom documents all the test cases and has to conduct testing with additional undocumented test cases.

3. Effectiveness of test cases: Identifying a test case is as important as writing a line of code. Lack of proper methods makes this task more challenging. Software engineering researchers, such as Glenford J. Myers in his book Software Reliability, Principles and Practices, have observed that it is impossible to test one's own programs. Test cases for a module created by the software developer tend to have an ingrained bias toward an application's functionality. Such test cases often are prepared to prove what is being developed, instead of to reveal defects – the proper objective of test cases.



Figure 1

4. Resource crunch: More effort is spent in software testing than in any other phase of software development. Figure 1 shows the distribution of efforts in the software development life cycle. The percent distributions are typical industry averages. While specific amounts of effort are scheduled for testing, projects often end up with less testing time than planned because the design and construction phases consume more effort than estimated. Tools that may increase effectiveness of testing are unavailable or unnoticed. Even if tools are available, a project team faced with a new learning curve may not be inclined to use them.

How Can the Challenges Be Met?

An analysis of the testing process reveals that one of the root causes of ineffectiveness is the process of test case creation. A test case is considered effective when it can reveal a defect. With good test cases, most latent defects can be identified and fixed before a product is shipped. Hence improving the test case creation process will help make the software testing process more effective.

The complexity of conventional test case documents often tends to become a bottleneck to improving effectiveness. The way out is to deploy the right tools to design useful test cases, ones which can reveal defects. It may not be necessary to test every possible combination since many of them could be redundant. The focus must be on those tests which can accurately tell about the health of the software.

Fault tree analysis may help simplify designing better test cases to improve effectiveness of the test process. The FTA preparation process brings in a variety of ideas, broadens the scope of thinking and adds creativity to the process.

What Is Fault Tree Analysis?

Fault tree analysis is a top-down approach to identify all potential causes leading to a defect. Each cause is further broken down into least possible events or faults. The analysis begins with a major defect. All the potential events – individual or in combination – that may cause the defect are identified. Potential events are further traced down in a similar way to the lowest possible level.



Table 1

Two logic symbols – known as logic gates – And and Or are used to represent the sequencing of events. The And symbol indicates that all preceding events must exist simultaneously for a defect to occur. The Or symbol indicates that either of the preceding events may lead to said defect. Table 1, known as a truth table, illustrates how the logic gates behave. Let's consider the And gate. The output will exist only when both inputs are present simultaneously. With reference to fault tree analysis, the fault condition exists only if the preceding events exist simultaneously. In the case of the Or gate, either of the inputs is required to produce the output condition. That is, either input state may result in a fault condition.



Figure 2: Demonstration of FTA

Figure 2 illustrates how FTA could be used for a typical situation of troubleshooting, e.g., computer not starting. The fault tree is shown with Levels 1 through 3 tracing the fault conditions, Level 1 being the highest.

There are at least two situations (faults) that may result in a computer not starting. Since either of the situations – power failure or booting failure – is capable of producing the Level 1 fault, the Or gate is used to represent their combination. The Level 2 fault, power failure, may result if the primary power source fails and at the same time the uninterruptible power supply (UPS) is down. An And gate is used to represent this situation. The event, "UPS down," may be further traced to faults like battery failure, hardware failure and so on.

Deciding the scope of FTA at the beginning is essential to limit the analysis to the required level. For example, if the focus is on problems with a computer, there is no need to analyze the failures of a UPS as it may not be an integral part of a computer. However, the failure of a motherboard associated with a booting problem may be discussed further as it is very much part of a computer system.

Advantages of Applying FTA

FTA can be advantageous to software projects in at least three ways:

Value addition: FTA has the potential to serve as a defect-prevention tool. If FTA is performed before baselining the design, it can provide valuable information on application failures and their mechanisms. This information could be utilized to improve the design by preventing the potential defects or by introducing fault-tolerating abilities. FTA is most effective for more complex functions but may not be adding much value when applied to the simple functions of a software application. FTA utilizes the potential of teamwork to bring in a variety of ideas and broaden thinking.

Simplicity: FTA is very simple and can be prepared by project teams with minimum training. Its graphical presentation improves readability and makes it easy to maintain in the event of changes.

Traceability: Some of the conventional test case tools provide a unique identification to individual test cases. Such traceability could be added to FTA by appropriately identifying the individual scenario.

An FTA Case Study

Here is a common example of improving the security of software application by using controlled access. A weakness in choosing an appropriate login name or password may result in weaker application security (user ID and password are focused on more in this example than other factors, such as network or other interfaces). Figure 3 illustrates how this is represented.

The user ID and the password are considered further to see what could lead to a defect, i.e., poor security. The short length, non-use of digits or special characters, and validity not bounded by time, etc., could make a password weak. Similarly such situations could be listed for user IDs and other primary concerns.

Each scenario is identified with a unique number to establish traceability. Such traceability helps test cases to be related to other project artifacts like requirements, design or program specifications. The valid and invalid conditions for respective scenarios also can be noted for quick reference during testing.

Well Known Software Failures

Software systems are pervasive in all aspects of society. From electronic voting to online shopping, a significant part of our daily life is mediated by software. In this page, I collect a list of well-known software failures. I will start with a study of economic cost of software bugs.

Contents

Economic Cost of Software Bugs
Air-Traffic Control System in LA Airport *****
Northeast Blackout **
NASA Mars Climate Orbiter ****
Denver Airport Baggage-handling System *

The number of *s is the ironic factor I assign to each story. The one with most *s is the most ironic one. Why? You will find out.

Economic Cost of Software Bugs

Report Date: 2/2002 Price Tag: $60 Billion Annually

WASHINGTON (COMPUTERWORLD) - Software bugs are costing the U.S. economy an estimated $59.5 billion each year, with more than half of the cost borne by end users and the remainder by developers and vendors, according to a new federal study.

Improvements in testing could reduce this cost by about a third, or $22.5 billion, but it won't eliminate all software errors, the study said. Of the total $59.5 billion cost, users incurred 64% of the cost and developers 36%.

Out of curiosity of how the study calculated the cost, I skimmed through the report. The following is a summary of their methodology.

It divided software developing process into stages: Requirement Gathering and Analysis, Architectural Design, Coding, Unit Test, Integration and Component, RAISE System Test, Early Customer Feedback, Beta Test Programs, and Post-product Release.

Bugs are generated at each stage of the software development process. The later in the production process that a bug is discovered, the more costly it is to repair the bug. Then impact estimates were developed relative to two counterfactual scenarios. The first scenario investigates the cost reductions if all bugs and errors could be found in the same development stage in which they are introduced. This is inferred to as the cost of an inadequate software testing infrastructure. The second scenario investigates the cost reductions associated with finding an increased percentage (but not 100 percent) of bugs and errors closer to the development stages where they are introduced. This is referred to as a cost reduction from feasible infrastructure improvements.

The study examined the impact of buggy software in several major industries -- automotive, aerospace and financial services -- and then extrapolated the results for the U.S. economy. It then concluded software bugs are costing (the first scenario) the U.S. economy an estimated $59.5 billion each year. Improvements in testing (the second scenario) could reduce this cost by about a third, or $22.5 billion

The report also included interesting tables that show the frequency of which stages errors are found, and relative cost to repair defects when found at different stages.

Air-Traffic Control System in LA Airport

Incident Date: 9/14/2004

(IEEE Spectrum) -- It was an air traffic controller's worst nightmare. Without warning, on Tuesday, 14 September, at about 5 p.m. Pacific daylight time, air traffic controllers lost voice contact with 400 airplanes they were tracking over the southwestern United States. Planes started to head toward one another, something that occurs routinely under careful control of the air traffic controllers, who keep airplanes safely apart. But now the controllers had no way to redirect the planes' courses.

The controllers lost contact with the planes when the main voice communications system shut down unexpectedly. To make matters worse, a backup system that was supposed to take over in such an event crashed within a minute after it was turned on. The outage disrupted about 800 flights across the country.

Inside the control system unit is a countdown timer that ticks off time in milliseconds. The VCSU uses the timer as a pulse to send out periodic queries to the VSCS. It starts out at the highest possible number that the system's server and its software can handle—232. It's a number just over 4 billion milliseconds. When the counter reaches zero, the system runs out of ticks and can no longer time itself. So it shuts down.

Counting down from 232 to zero in milliseconds takes just under 50 days. The FAA procedure of having a technician reboot the VSCS every 30 days resets the timer to 232 almost three weeks before it runs out of digits.

Northeast Blackout

Incident Date: 8/14/2003 Price Tag: $6 - $10 Billion

NEW YORK (AP) - A programming error has been identified as the cause of alarm failures that might have contributed to the scope of last summer's Northeast blackout, industry officials said Thursday.

The failures occurred when multiple systems trying to access the same information at once got the equivalent of busy signals, he said. The software should have given one system precedent.

With the software not functioning properly at that point, data that should have been deleted were instead retained, slowing performance, he said. Similar troubles affected the backup systems.

NASA Mars Climate Orbiter

Incident Date: 9/23/1999 Price Tag: $125 million

WASHINGTON (AP) -- For nine months, the Mars Climate Orbiter was speeding through space and speaking to NASA in metric. But the engineers on the ground were replying in non-metric English.

It was a mathematical mismatch that was not caught until after the $125-million spacecraft, a key part of NASA's Mars exploration program, was sent crashing too low and too fast into the Martian atmosphere. The craft has not been heard from since.

Noel Henners of Lockheed Martin Astronautics, the prime contractor for the Mars craft, said at a news conference it was up to his company's engineers to assure the metric systems used in one computer program were compatible with the English system used in another program. The simple conversion check was not done, he said.

Denver Airport Baggage-handling System

Incident Date: 11/1993 - 6/1994 Price Tag: > $200 million

(Scientific America) -- Scheduled for takeoff by last Halloween (1993), the airport's grand opening was postponed until December to allow BAE Automated Systems time to flush the gremlins out of its $193-million system. December yielded to March. March slipped to May. In June the airport's planners, their bond rating demoted to junk and their budget hemorrhaging red ink at the rate of $1.1 million a day in interest and operating costs, conceded that they could not predict when the baggage system would stabilize enough for the airport to open.