Building Ultra-reliable software

Reprinted from NASA
Why Is Formal Methods Necessary?
The Design Error Problem

We are very good at building complex software systems that work 95% of the time. But we do not know how to build complex software systems that are ultra-reliably safe (i.e. .


Why Is Formal Methods Necessary?

Digital systems can fail in catastrophic ways leading to death or tremendous financial loss. Although their are many potential causes including physical failure, human error, and environmental factors, design errors are increasingly becoming the most serious culprit. The following recent events show the potential of design errors for disaster:

David Hughes wrote in the December 21/28, 1998 issue of AVIATION WEEK & SPACE TECHNOLOGY:

At Lockheed Martin in 1995, I found that the Lockheed Martin/Boeing F-22 will really be a computer with wings… If the software is delivered on time and on budget, and works as expected, the F-22 will be a success. If not, the F-22 will be in big trouble.

Thus software becomes the Achilles’ heel of almost any aerospace defense project. To get it wrong means the whole project may go down the tubes.

On February 24, 1999, the PITAC issued its report, “Information Technology Research: Investing in Our Future,” to President Clinton. They wrote:

Software research was judged by The President’s Information Technology Advisory Committee to be the highest priority area for fundamental research. From the desktop computer to the phone system to the stock market, our economy and society have become increasingly reliant on software. This Committee concluded that not only is the demand for software exceeding our ability to produce it; the software that is produced today is fragile, unreliable, and difficult to design, test, maintain, and upgrade.

The PITAC stated that demand for software has grown far faster than our ability to produce it. Furthermore, the Nation needs software that is far more usable, reliable, and powerful than what is being produced today. The PITAC noted that the U.S. has become dangerously dependent on large software systems whose behavior is not well understood and that often fail in unpredicted ways. Therefore, increases in research on software should be given a high priority, including software for managing large amounts of information, for making computers easier to use, for making software easier to create and maintain, and for improving the ways humans interact with computers.

We are very good at building complex software systems that work 95% of the time. But we do not know how to build complex software systems that are ultra-reliably safe (i.e. ). Consequently, industries that build safety-critical systems have relied upon relatively simple system designs and minimized the amount of software that is safety-critical. They have also relied upon backups that are not software (e.g. a pilot that can take emergency action.) Nevertheless, even with this approach, the verification and certification of such systems has been enormously expensive. It is not unusual to hear that 60% or 70% of the cost of an avionics box is verification and certification cost.

The traditional way of verifying software systems is through human inspection, simulation, and testing. Unfortunately these approaches provide no guarantees about the quality of the software after it has been verified in this manner. Human inspection is limited by the abilities of the reviewers and simulation and testing can only explore a minuscule fraction of the state space of any real software system:

Size of State Space for Simple System

  • Size of the input state is 250
  • If each input could be tested in 1 millisecond, an exhaustive verification would require 36 million years

Furthermore, the redundancy techniques used for hardware fault-tolerance do not work for the design error problem. See (software fault tolerance) for an explanation of why this is true.

https://shemesh.larc.nasa.gov/fm/fm-why-new.html


The Design Error Problem

A computer system may fail to perform as expected because a physical component fails or because a design error is uncovered. For a system to be both ultra-reliable and safe, both of these potential causes of failure must be handled.

Established techniques exist for handling physical component failure; these techniques use redundancy and voting. The reliability assessment problem in the presence of physical faults is based on Markov modeling techniques and is well understood.

The design error problem is a much greater threat. Unfortunately, no scientifically justifiable defense against this threat is currently used in practice. There are 3 basic strategies that are advocated for dealing with the design error:

  1. Testing (Lots of it)
  2. Design Diversity (i.e. software fault-tolerance: N-version programming, recovery blocks, etc.)
  3. Fault Avoidance (i.e. formal specification/verification, automatic program synthesis, reusable modules)

The problem with life testing is that in order to measure ultrareliability one must test for exorbitant amounts of time. For example to measure a 10-9 probability of failure for a 1 hour mission one must test for more than 114,000 years.

Many advocate design diversity as a means to overcome the limitations of testing. The basic idea is to use separate design/implementation teams to produce multiple versions from the same specification. Then, non-exact threshold voters are used to mask the effect of a design error in one of the versions. The hope is that the design flaws will manifest errors independently or nearly so.

By assuming independence one can obtain ultra-reliable-level estimates of reliability even though the individual versions have failure rates on the order of 10-4. Unfortunately, the independence assumption has been rejected at the 99% confidence level in several experiments for low reliability software.

Furthermore, the independence assumption cannot ever be validated for high reliability software because of the exorbitant test times required. If one cannot assume independence then one must measure correlations. This is infeasible as well—it requires as much testing time as life-testing the system because the correlations must be in the ultra-reliable region in order for the system to be ultra-reliable. Therefore, it is not possible, within feasible amounts of testing time, to establish that design diversity achieves ultra-reliability. Consequently, design diversity can create an “illusion” of ultra-reliability without actually providing it.

From this analysis, we conclude that formal methods currently offers the most promising method for handling the design fault problem. Because the often quoted 1-10‒9 reliability is beyond the range of quantification, we have no choice but to develop life-critical systems in the most rigorous manner available to us, which is the use of formal methods.


References
  1. Butler, Ricky W.; and Finelli, George B.: The Infeasibility of Quantifying the Reliability of Life-Critical Real-Time Software. IEEE Transactions on Software Engineering, vol. 19, no. 1, Jan. 1993, pp. 3-12.
  2. Holloway, C. Michael: Holloway, C. Michael: Why Engineers Should Consider Formal Methods, Proceedings of the 16th AIAA/IEEE Digital Avionics Systems Conference, October 26-30, 1997, Irvine CA, Volume 1, pages 1.3-16 – 1.3.-22. (Takes a very different approach than that given on these pages.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: