38 Software Reliability

R. Baskaran

SOFTWARE RELIABILITY

Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment.

LEARNING OBJECTIVES

• To differentiate the failure and faults.

• To highlight the importance of execution and calendar time

• To understand Time interval between failures.

• To understand on the user perception of reliability.

DEFINITIONS OF SOFTWARE RELIABILITY

Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment. The key elements of the definition include probability of failure-free operation, length of time of failure-free operation and the given execution environment. Failure intensity is a measure of the reliability of a software system operating in a given environment. Example: An air traffic control system fails once in two years.

Factors Influencing Software Reliability

A user’s perception of the reliability of a software depends upon two categories of information.
- The number of faults present in the software.
- The way users operate the system. This is known as the operational profile.
The fault count in a system is influenced by the following.
- Size and complexity of code.
- Characteristics of the development process used.
- Education, experience, and training of development personnel.
- Operational environment.

Applications of Software Reliability

The applications of software reliability includes

Comparison of software engineering technologies.
- What is the cost of adopting a technology?
- What is the return from the technology — in terms of cost and quality?
Measuring the progress of system testing –The failure intensity measure tells us about the present quality of the system: high intensity means more tests are to be performed.
Controlling the system in operation –The amount of change to a software for maintenance affects its reliability.
Better insight into software development processes – Quantification of quality gives us a better insight into the development processes.

FUNCTIONAL AND NON-FUNCTIONAL REQUIREMENTS

System functional requirements may specify error checking, recovery features, and system failure protection. System reliability and availability are specified as part of the non-functional requirements for the system.

SYSTEM RELIABILITY SPECIFICATION

Hardware reliability focuses on the probability a hardware component fails.
Software reliability focuses on the probability a software component will produce an incorrect output.
The software does not wear out and it can continue to operate after a bad result.
Operator reliability focuses on the probability when a system user makes an error.

FAILURE PROBABILITIES

If there are two independent components in a system and the operation of the system depends on them both then, P(S) = P (A) + P (B)

If the components are replicated then the probability of failure is P(S) = P (A) n which means that all components fail at once.

FUNCTIONAL RELIABILITY REQUIREMENTS

The system will check all operator inputs to see that they fall within their required ranges.
The system will check all disks for bad blocks each time it is booted.
The system must be implemented in using a standard implementation of Ada.

NON-FUNCTIONAL RELIABILITY SPECIFICATION

The required level of reliability must be expressed quantitatively. Reliability is a dynamic system attribute. Source code reliability specifications are meaningless (e.g. N faults/1000 LOC). An appropriate metric should be chosen to specify the overall system reliability.

HARDWARE RELIABILITY METRICS

Hardware metrics are not suitable for software since its metrics are based on notion of component failure. Software failures are often design failures. Often the system is available after the failure has occurred. Hardware components can wear out.

SOFTWARE RELIABILITY METRICS

Reliability metrics are units of measure for system reliability. System reliability is measured by counting the number of operational failures and relating these to demands made on the system at the time of failure. A long-term measurement program is required to assess the reliability of critical systems.

PROBABILITY OF FAILURE ON DEMAND

The probability system will fail when a service request is made. It is useful when requests are made on an intermittent or infrequent basis. It is appropriate for protection systems where service requests may be rare and consequences can be serious if service is not delivered. It is relevant for many safety-critical systems with exception handlers.

RATE OF FAULT OCCURRENCE

Rate of fault occurrence reflects upon the rate of failure in the system. It is useful when system has to process a large number of similar requests that are relatively frequent. It is relevant for operating systems and transaction processing systems.

RELIABILITY METRICS

Probability of Failure on Demand (PoFoD)
- PoFoD = 0.001.
- For one in every 1000 requests the service fails per time unit.
Rate of Fault Occurrence (RoCoF)
- RoCoF = 0.02.
- Two failures for each 100 operational time units of operation.
Mean Time to Failure (MTTF)
- The average time between observed failures (aka MTBF)
- It measures time between observable system failures.
- For stable systems MTTF = 1/RoCoF.
- It is relevant for systems when individual transactions take lots of processing time (e.g. CAD or WP systems).
Availability = MTBF / (MTBF+MTTR)
- MTBF = Mean Time Between Failure
- MTTR = Mean Time to Repair
Reliability = MTBF / (1+MTBF)

TIME UNITS

Time units include:

Raw Execution Time which is employed in non-stop system
Calendar Time is employed when the system has regular usage patterns
Number of Transactions is employed for demand type transaction systems

AVAILABILITY

Availability measures the fraction of time system is really available for use. It takes repair and restart times into account. It is relevant for non-stop continuously running systems (e.g. traffic signal).

FAILURE CONSEQUENCES – STUDY 1

Reliability does not take consequences into account. Transient faults have no real consequences but other faults might cause data loss or corruption. Hence it may be worthwhile to identify different classes of failure, and use different metrics for each.

FAILURE CONSEQUENCES – STUDY 2

When specifying reliability both the number of failures and the consequences of each matter. Failures with serious consequences are more damaging than those where repair and recovery is straightforward. In some cases, different reliability specifications may be defined for different failure types.

FAILURE CLASSIFICATION

Failure can be classified as the following

Transient – only occurs with certain inputs.
Permanent – occurs on all
Recoverable – system can recover without operator help.
Unrecoverable – operator has to help.
Non-corrupting – failure does not corrupt system state or d
Corrupting – system state or data are altered.

BUILDING RELIABILITY SPECIFICATION

The building of reliability specification involves consequences analysis of possible system failures for each sub-system. From system failure analysis, partition the failure into appropriate classes. For each class send out the appropriate reliability metric.

SPECIFICATION VALIDATION

It is impossible to empirically validate high reliability specifications. No database corruption really means PoFoD class < 1 in 200 million. If each transaction takes 1 second to verify, simulation of one day’s transactions takes 3.5 days.

Web Links

https://users.ece.cmu.edu/~koopman/des_s99/sw_reliability/
https://www.cs.drexel.edu/~spiros/teaching/CS576/slides/9.reliability.pdf
https://www.tutorialspoint.com/software_testing_dictionary/reliability_testing.htm

Supporting & Reference Materials

Roger S. Pressman, “Software Engineering: A Practitioner’s Approach”, Fifth Edition, McGraw Hill, 2001.
Pankaj Jalote, “An Integrated Approach to Software Engineering”, Second Edition, Springer Verlag, 2005.
Ian Sommerville, “Software Engineering”, Sixth Edition, Addison Wesley, 2000.
Doron A.Peled, “Software Reliability Methods”, Springer Publications.
http://www.softrel.com/IEEE1633.pdf