PRELIMINARY VERSION A Design Diversity Metric and Analysis of Redundant Systems

更新时间:2023-05-07 06:39:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

PRELIMINARY VERSION

Copyright ? 1999 by the Center for Reliable Computing, Stanford University.All rights reserved, including the right to reproduce this report, or portions thereof, in any form.Center for Reliable Computing TECHNICAL REPORT

A Design Diversity Metric and Analysis of Redundant Systems

Subhasish Mitra, Nirmal R. Saxena and Edward J. McCluskey

Imprimatur: Philip Shirvani and Santiago Fernandez-Gomez

PRELIMINARY VERSION 1

PRELIMINARY VERSION

2

A Design Diversity Metric and Analysis of Redundant Systems

Subhasish Mitra, Nirmal R. Saxena and Edward J. McCluskey

CRC Technical Report No. 99-4

(CSL TR No. ??)

May 1999

Center for Reliable Computing

Computer Systems Laboratory

Departments of Electrical Engineering and Computer Science

Stanford University, Stanford, California 94305

Abstract

Design diversity has long been used to protect redundant systems against common-mode failures. The conventional notion of diversity relies on “independent” generation of “different” implementations. This concept is qualitative and does not provide a basis to compare the reliabilities of two diverse systems. In this paper, for the first time, we present a metric to quantify diversity among several designs. Based on this metric, we derive analytical reliability models that show a simple relationship among design diversity, system failure rate, and mission time. We also perform availability analysis of redundant systems using our metric. In addition, we present simulation results to demonstrate the effectiveness of design diversity in duplex systems. For common-mode failures and design faults, there is a significant gain in using different implementations — however, as our analysis shows, the gain diminishes as the mission time increases. For independent multiple-module failures, we show that, mere use of different implementations does not always guarantee higher reliability compared to redundant systems with identical implementations — it is important to analyze the reliability of redundant systems using our metric. Our simulation results also demonstrate the usefulness of diversity in enhancing the self-testing properties of redundant systems

PRELIMINARY VERSION

3TABLE OF CONTENTS

1. Introduction (1)

2. Design Diversity Metric and Reliability Analysis (3)

3. Example (12)

4. A Simulation-Based Approach (13)

5. Self-Testing Property (17)

6. Diversity Advantages in Configurable Systems (18)

7. Conclusions (20)

8. Acknowledgments (21)

9. References (22)

10. Appendix 1: Reliability Analysis (24)

LIST OF FIGURES

Figure 1.1. A Duplex redundant system (1)

Figure 2.1 A discrete time model of the system (5)

Figure 2.2. Faults affecting modules simultaneously (5)

Figure 2.3. Fault-Secure probability of duplex systems (7)

Figure 2.4. Fault-secure probability of duplex systems (common-mode failures) (8)

Figure 2.5. Effect of diversity with mission time (8)

Figure 2.6. Markov chain for availability analysis (11)

Figure 2.7. Comparison of availability of duplex systems (12)

Figure 3.1. An example logic circuit (13)

Figure 6.1. Reconfigurable computing test-bed (18)

Figure 6.2. Results from experiments on a configurable computing test-bed (19)

Figure 6.3. Results from experiments on a configurable computing test-bed (20)

PRELIMINARY VERSION

4LIST OF TABLES

Table 2.1.Behavior of faulty multiple-output circuits4 Table 4.1.Characteristics of simulated designs14 Table 4.2.Simulation 1 results15 Table 4.3.Simulation 2 results16 Table 5.1.Self-testing properties of diverse and non-diverse duplex systems18

PRELIMINARY VERSION 5

PRELIMINARY VERSION

1

1. INTRODUCTION

The use of redundancy techniques for designing systems with high data-integrity

and availability has been studied extensively [Siewiorek 92][Pradhan 96]. A duplex system in the form of a self-checking pair is an example of a classical redundancy scheme (Fig. 1.1). As long as only one module fails, the system either produces correct

Figure 1.1. A Duplex Redundant System

In a redundant system, common-mode failures (CMFs) result from failures that affect more than one module at the same time, generally due to a common cause. These include operational failures that may be due to external (such as EMI, power-supply disturbances and radiation) or internal causes. In addition to these common-mode failures, with the increasing complexity of the various designs, design mistakes are becoming very significant. It has been pointed out in [Avizienis 84], although the use of redundant copies of hardware has proven to be quite effective in the detection of physical faults and subsequent system recovery, design faults are reproduced when redundant copies are made. Simple replication fails to enhance the system reliability against design faults.

Design diversity has been proposed in the past to protect redundant systems against common-mode failures. In [Avizienis 84], design diversity was defined as the independent generation of two or more software or hardware elements (e.g., program modules, VLSI circuit masks, etc.) to satisfy a given requirement. Design diversity was also proposed in [Lala 94] as an avoidance technique against common-mode failures. Design diversity has been applied to both software and hardware systems. N-version programming [Avizienis 77, Lyu 91] is used to achieve diversity in software systems. Hardware design diversity has been used in the Primary Flight Computer (PFC) system of Boeing 777 [Riter 95] and many other commercial systems [Briere 93]. For the Boeing 777, three different processors (from AMD, Intel and Motorola) are used in the

PRELIMINARY VERSION

PFC. Tohma proposed to use the implementations of logic functions in true and complemented forms during duplication [Tohma 71]. The use of a particular circuit and its dual was proposed in [Tamir 84] to achieve diversity in order to handle common-mode failures. The basic idea is that, with different implementations, common failure modes will probably cause different error effects.

Design diversity can prove to be useful in the context of dependable Adaptive Computing Systems (ACS). The field programmability of Field Programmable Gate Arrays (FPGAs) can be utilized to achieve diversity among the different modules. In an ACS environment, we can create diversity by synthesizing and downloading different implementations into FPGAs at any time. Thus, there is no need to manufacture multiple diverse ASICs. In order to quantify the effect of diversity on the reliability of a redundant system, a metric is needed to quantify diversity among designs with the same specification [Tamir 84].

In addition to common-mode failures, with the high density of logic gates in a VLSI chip, multiple failures may become more frequent. For example, current research shows that multiple-event upsets (possibly due to a single radiation source) are common in VLSI chips [Liu 97][Reed 97]. The classical reliability models of redundant systems are pessimistic because, in the presence of multiple module failures, they do not consider compensating effects of different faults [Siewiorek 75]. It is interesting to find out whether design diversity also helps in achieving better compensating effects of different faults, compared to simple replication.

In this paper, we address problems related to design diversity and examine their effects on the reliability of a redundant system. Some preliminary ideas related to this work were reported in [Saxena 98] and [Mitra 99]. Our main contributions are: (1) developing a metric to quantify diversity among several designs; and (2) using this metric to perform reliability and availability analysis of redundant systems. In Sec. 2, we introduce a design diversity metric and perform reliability and availability analysis of redundant systems using this metric. Section 3 presents some preliminaries related to the stuck-at fault model and illustrates our analysis with the help of an example. We present simulation results in Sec. 4. Section 5 examines the effect of design diversity on the self-testing properties of a duplex system. We present experimental results demonstrating the advantages of using design diversity in configurable systems in Sec. 6. Finally, we conclude in Sec. 7.

2

PRELIMINARY VERSION

3

2. Design Diversity Metric And Reliability Analysis

2.1. D: A Design Diversity Metric

In this section, we introduce a metric to quantify diversity among several designs.

We define the metric for a system with two designs implementing the same function. The metric has application in estimating the reliability of NMR systems with masking redundancy. Before defining the diversity metric, we first define the notion of diversity between two implementations with respect to a fault pair.

For two designs implementing the same function, the diversity with respect to a fault pair (f i, f j), d i,j, is the probability that the designs do not produce identical error patterns, in response to a given input sequence, when f i and f j affect the first and the second implementations, respectively.

For a given fault model, the design diversity metric, D, between two designs is the expected value of the diversity with respect to different fault pairs. Mathematically, we

have D = P(f

i ,f

j

)d

i,j

(f i,f j)

∑, where P (f i, f j) is the probability of the fault pair(f i, f j).

D is the probability that, in response to a given input sequence, the two implementations either produce error-free outputs or produce different error patterns on their outputs.

Example: Consider any combinational logic function with n inputs and a single output. The fault model considered is such, that a combinational circuit remains combinational in the presence of the fault. Let us consider two implementations (N1 and N2) of the given combinational logic function.

The joint detectability, k i,j, of a fault pair(f i, f j) is the number of input patterns that detect both f i and f j. This definition follows from the idea of detectability developed in [McCluskey 88].

If we assume that all the input patterns are equally likely, then we can write d i,j =

1 -k i,j

2n

.

The d i,j’s generate a diversity profile for the two implementations with respect to a

fault model. Consider a duplex system consisting of the two implementations under consideration. In response to any input combination, the implementations can produce one of the following cases at their outputs. (1) Both of them produce correct outputs. (2) One of them produces correct output and the other produces incorrect output. (3) Both of them produce the same incorrect value.

For the first case, the duplex system will produce correct outputs. For the second case, the system will report a mismatch so that appropriate recovery actions can be taken. However, for the third case, the system will produce an incorrect output without reporting

PRELIMINARY VERSION

4a mismatch — thus, for the third case, the integrity of the system is lost due to the presence of faults in the two implementations. In the literature on fault-tolerance

[Siewiorek 92][Pradhan 96], this system integrity has been referred to as the fault-secure property.

The quantity d i,j is the probability that a duplex system, having two implementations of the logic function under consideration, is fault-secure when faults f i and f j affect the first (N 1) and the second (N 2) implementations, respectively.

If we assume that all fault pairs are equally probable and there are m fault pairs (f i ,

f j ), then the D metric for the two implementations is: D = 1m d i ,j i ,j

∑.We extend the above example to consider multiple-output combinational logic circuits . For a fault pair (f i , f j ) affecting the two implementations, we define k ij as the number of input patterns, in response to each of which, both the implementations produce the same erroneous output pattern . We can use the same formulas as the single output case.Table 2.1. Behavior of faulty multiple output circuits Inputs

Fault-free outputs Faulty outputs (Implementation 1)Faulty outputs (Implementation 2)00

0 10 01 001

1 0 1 0 1 010

0 01 01 011 1 1 1 0 1 0

For example, consider a combinational logic function with two inputs and two outputs (Table 2.1). Suppose that, faults f i and f j affect the first and the second implementations, respectively. The responses of the two implementations in the presence of the faults are shown in Table 2.1. The faulty output bits are highlighted in the third and fourth columns of Table 2.1. It is clear that for the calculation of k ij , we have to consider only the input patterns 10 and 11.

The above illustration of the design diversity metric can also be extended to sequential circuits and software programs. For small or medium-sized systems, the exact value of the diversity metric can be calculated manually or using computer programs.For large systems, the value can be estimated by using simulation techniques.

For two identical implementations of the same function, a common-mode failure (e.g., a design mistake) can be modeled as the same fault f i affecting the two implementations. Let m be the number of input sequences for which these two implementations produce identical error patterns at the outputs. If the second implementation is different from the first, for any fault f j affecting the second

PRELIMINARY VERSION

5implementation (and f i affecting the first implementation), we cannot have more than m input sequences that produce identical error patterns at the outputs of the two implementations. Hence, d i,i ≤ d i,j . This property is useful for enhancing the reliability of a redundant system against common-mode failures by using diversity.

2.2. Reliability Analysis

In this section, we calculate the reliability of duplex systems using the diversity metric described in Sec. 2.1. We define the reliability of a duplex system as the probability that the system is fault-secure. The reliability calculation is independent of whether the redundant components are exact replicas or different implementations. We assume a discrete time model for the system. In such a model, the time axis is broken up into discrete time cycles and we apply inputs and observe outputs only at cycle boundaries.

As shown in Fig. 2.1, input combination (vector) v i is applied at the beginning of the i th cycle. Also, in Fig. 2.1, the first system becomes faulty (f 1) during cycle i and the second system becomes faulty (f 2) during cycle j . Let p be the probability that a particular module is affected by a fault at any cycle. For simplicity, we assume that this probability p is the same for all the modules in the system at all times. The probability p

can be looked upon as the failure rate per cycle.

v

v i

v 1

2

Figure 2.1. A discrete time model of the system

For a given fault pair (f 1, f 2)

, there are two possible cases. In the first case, both the faults appear in the same cycle. The situation is shown in Fig. 2.2.

Time

f 2

Figure 2.2. Faults affect the modules simultaneously

PRELIMINARY VERSION

6In Fig. 2.2, faults f1 and f2 affect modules 1 and 2, simultaneously at cycle i. It

may be argued that if a random fault appears in a particular module, then chances are high that the second fault will also appear in that same module. However, we do not assume any such correlation in this paper. At time 0, everything is fault-free. So, before cycle i the system will produce correct results. However, starting from time i, in each cycle, the system will produce correct results with the probability equal to d1,2. The probability s1(f1, f2, t) that the system is fault-secure up to time t, even in the presence of the two faults f1 and f2, is given by:

s1(f1, f2, t) = p2d

1,2[d

1,2

t?(1?p)2t]

[d

1,2

?(1?p)]

The derivation of the above expression is shown in the appendix. Next, we consider the case where f1 and f2 appear at different cycles.

As discussed earlier, in Fig. 2.1, Module 1 becomes faulty during cycle i and Module 2 becomes faulty during cycle j. It is clear that up to time j, a duplex system will be fault-secure. Hence, starting from time j, the system will be fault-secure with probability d1,2. Thus, the probability s2(f1, f2, t) that the system is fault-secure up to time t, in the presence of the two faults f1 and f2 is given by the following equation.

s2(f1, f2, t) =

2

(d

1,2

?1+p)

(1?p)p2d

1,2

2

[d

1,2

t?1?(1?p)2t?2]

[d

1,2

?(1?p)2]

?2

(d

1,2?1+p)

(1?p)t pd

1,2

[1?(1?p)t?1]

The derivation of the second case is also shown in the appendix. This case is more complicated than the first case and is useful when we consider random independent faults in multiple modules. We have:

s(f1, f2, t) = s1(f1, f2, t) + s2(f1, f2, t)

Here s(f1, f2, t) is the probability that a duplex system is fault-secure up to time t, when Module 1 is affected by fault f1 and Module 2 by fault f2.

We can characterize a duplex system using our diversity metric. In the following calculations, we assume that once a module becomes faulty, no other fault appears in that module. This assumption is simplistic and allows us to obtain closed-form reliability expressions. We calculate the probability that, up to time t, a duplex system is fault-secure. It is given by the following expression:

(1?p)2t+2(1?p)t[1?(1?p)t]+P(f

1,f

2

)s(f

1

,f

2

,t)

f1,f2

The above expression follows from the fact that, in a duplex system, when none of the modules fails the system produces correct outputs. When only one of the modules

PRELIMINARY VERSION

7fails (due to single or multiple faults), the system is fault-secure. When both modules are faulty, then we have to consider the d 1,2 value for the fault pair (f 1, f 2) in the two modules. P(f 1, f 2) is the probability that faults f 1 and f 2 appear in modules 1 and 2,respectively.

Mission Time (MTTF of Simplex)Prob.

fault

secure Classical d 1,2 =1-10-10

d 1,2=1-10-11Figur

e 2.3. Fault-secure probability o

f a duplex system with multiple independent failures In Fig. 2.3, for a given pair of faults (f 1, f 2), we show the plots of the above expression for different values of d 1,2. The mission time is shown alon

g the X-axis —the MTTF (Mean Time To Failure in cycles) of a simplex system corresponds to 1 time unit. The probability that a fault appears in one cycle is 10-12. Along the Y-axis, we show the probability that the duplex system is fault-secure. The classical analysis of duplex systems is pessimistic since it assumes that the system ceases to be fault-secure when two modules are faulty.

The above expressions can be modified for common-mode failures (CMF). The probability that a duplex system is fault-secure against common-mode failures up to time t , is given by the following expression:

(1?p )t +P (f 1,f 2)z (f 1,f 2,t )

f 1,f 2

∑Here, p is the probability that a CMF affects the two modules. In the above expression, z(f 1, f 2, t) is given by the following formula:

z(f 1, f 2, t) = pd 1,2[d 1,2t ?(1?p )t ]

[d 1,2?(1?p )]The above expression is maximized when d 1,2 is of the order of (1-p ). This suggests that, for a common-mode failure that can be modeled as fault pair (f 1, f 2), we can obtain appreciable reliability improvement over classical systems when the value of d 1,2 is of the order of (1-p ). The following observations can be derived from this

PRELIMINARY VERSION

8relationship.

9. When the failure rate is high, even a small diversity can help enhance the system reliability over traditional replication.

10. If the failure rate is low, then d 1,2 must be extremely high for appreciable reliability improvement over classical systems. As a limiting case, consider the situation when the CMF failure rate is 0. In that case, diversity will not buy us any extra reliability against CMFs.

Mission Time (MTTF of Simplex)Prob.

fault

secure Classical

d 1,2 = 1-10-12d 1,2 = 1-10-11

Figure 2.4. Fault-secure probability of a duplex system against common-mode failures In Fig. 2.4, for a given pair of faults (f 1, f 2), we show the plots of the above fault-secure probability expression for the different values of d 1,2. The failure rate per cycle is

10-13. It is clear that we get appreciable improvement in reliability (over classical systems) when the value of d 1,2 is very high (1-10-12 or more). When the value of d 1,2 is

less than 1-10-12

, we do not see high reliability improvement.

Mission Time (MTTF of Simplex)

Gain

Figure 2.5. Effect of diversity with mission time (for common-mode failures)

PRELIMINARY VERSION

9In Fig. 2.5, we show how the reliability improvement obtained from diversity depends on mission time. On the Y-axis of the graph in Fig. 2.5, we plot the ratio of the following two quantities.

9. The probability that a duplex system is not fault-secure at time i , for a fault pair (f 1,f 2) with d 1,2 = 1-10-11.

10. The probability that a duplex system is not fault-secure at time i , for fault pair (f 1, f 2)with d 1,2 = 1-10-12. The failure rate per cycle is 10-13.

We call this ratio the gain . On the X-axis, we plot the mission time. As Fig. 2.5shows, the gain diminishes with longer mission times. This analysis allows us to derive relationships between the reliability of a redundant system, the diversity incorporated to protect the system against common-mode failures and the mission time. The relationship between diversity and mission time can also be used to determine checkpoint intervals in a redundant system. For example, referring to Fig. 2.5, we can checkpoint the state of the system when the gain is close to 1. Thus, our design diversity metric is a very fundamental property and can be used to understand different trade-offs associated with the design of dependable systems using redundancy.

Next, we estimate the error latency using our design diversity metric. Consider a duplex system with two implementations N 1 and N 2 of the same logic function. Let us suppose that the faults f 1 and f 2 affect the two implementations at cycle c . The error latency is defined to be the number of cycles from c after which both the implementations produce the same error pattern at the output. For more discussions on error latency, the reader is referred to [Shedletsky 76]. The probability that the error latency is t (t > 0) is

given by: d 1,2t ?1(1?d 1,2). Here, the assumption is that d 1,2 value is strictly less than 1. If the d 1,2 value is equal to 1, then the error latency is always equal to T , the mission time.The expected error latency is given by the following formula:

P (f 1,f 2)td 1,2t ?1(1?d 1,2)t ,(f 1,f 2),d 1,2≠1∑+P (f 1,f 2)T

(f 1,f 2),d 1,2=1∑From this expression, it is clear that for long mission times (i.e., large values of t ),the probability value approaches to 0 when the d 1,2 value for the fault pair is less than 1.Thus, the fault pairs which have their d i,j values equal to 1 (i.e., the compensating fault pairs) play a dominant role in determining the error latency for long mission times.Hence, the value of the expected error latency is determined by the percentage of compensating fault pairs. Simplification of the above expression produces the following expression for the expected latency of a duplex system in terms of the diversity metrics with respect to the different fault pairs.

PRELIMINARY VERSION

10Expected error latency =

P(f

1

,f

2

)

(1?d

1,2

)

(f1,f2),d1,2≠1

∑+P(f1,f2)T

(f1,f2),d1,2=1

Consider the case of design mistakes that are special cases of common-mode

failures. For these cases, the fault is always present. Simple analysis reveals that the probability that a duplex system is fault-secure up to time t, in the presence of design mistakes, is:

P(f

1,f

2

)d

1,2

t

f1,f2

Thus, for design mistakes, for a given fault pair (f1, f2), the more the value of d1,2, the more is the system reliability. This implies that, for design mistakes, diversity among the two implementations in a duplex system helps to increase the probability that the system is fault-secure.

While diversity in hardware designs is the main focus of this paper, the above ideas can be extended to analyze diversity in software modules. For estimating the diversity metric for software modules, we need to have a fault model for the software under consideration. Considering the range of values the input variables to the software module can possibly take, it may be difficult to compute the exact value of the metric. However, the value of the metric can be estimated using simulation techniques. Note that, our observations about the relationships between diversity, mission time and failure rate still hold for software systems. One key feature of our analysis technique is that, it is powerful, but at the same time, simpler than the models in [Eckhardt 85], [Tomek 93] and [Lyu 95].

2.3. Availability Analysis

In this section, we perform availability analysis of duplex systems with repair capabilities using our diversity metric. For the purpose of our analysis, we assume that p is the probability that a (common-mode) failure affects the system during a particular cycle. The failure can manifest as fault f1 and f2 affecting Module 1 and Module 2, respectively. In our analysis, we use the following quantities, as described below.

The metric d1,2 is the probability that the two modules do not produce the same error pattern (at their outputs) in response to a given input sequence, when they are affected by the faults f1 and f2.

We define another quantity, t1,2, which is the probability that the two modules do not produce any error at their outputs in response to a given input sequence, when they are affected by the faults f1 and f2.

The quantity d1,2- t1,2 is the probability that the two modules will produce non-

PRELIMINARY VERSION

11identical error patterns (at their outputs) in response to a given input sequence, when they are affected by the faults f 1 and f 2.

The Markov chain used for our analysis is shown in Fig. 2.6. In the Markov chain, the system starts at the Good state. As long as a fault does not appear, the system remains in the Good state. However, as soon as a fault appears, the system goes to the Faulty Correct state. The probability that both the modules produce correct outputs, in spite of the presence of the fault, is t 1,2. The probability that the modules produce identical errors at their outputs is 1- d 1,2. Thus, with probability d 1,2- t 1,2, the modules produce non-identical erroneous — this means that the presence of the fault is detected.Once the fault is detected, the system enters the Repair state. We have assumed that the expected number of cycles required to repair the system is 1m

. For modeling the repair operation, we could as well use a repair rate. However, in the context of re-configurable systems, we can have bounds on repair time, which we can use during the above Markov analysis. The availability is given by the probability that the system is in the Good or the Faulty Correct state. In the following graph (Fig. 2.7), we show the dependence of availability on the values of d 1,2 and t 1,2. This analysis implications on the usefulness of diversity for enhancing the self-testing property and hence, the availability of duplex

Figure 2.6. Markov Chain for availability analysis In Fig. 2.7, we plot the availability values for two duplex systems. The probability (p ) that a fault pair appears in a particular cycle is 10-8. The number of repair cycles (1m

) is 100. Both the systems have the value of d 1,2 equal to 1-10-5 for a fault pair.However, one of the systems (shown in Fig. 2.7) has the value of t 1,2 equal to d 1,2 and the

PRELIMINARY VERSION

12other one has the t 1,2 value equal to half of d 1,2. As can be seen in Fig. 2.7, initially the system having t 1,2 = d 1,2 has a higher availability (since the probability that it stays in the Faulty Correct state is high). However, as time increases, the availability of the system with t 1,2 = 0.5*d 1,2 decreases at a much smaller rate compared to the system with t 1,2 =d 1,2. This is because, for the system with d 1,2 equal to t 1,2, there is no repair capability in contrast to the other system.

Availability Figure 2.7. Comparison of the availability of duplex systems

We validate our observations using simulation data in Sec. 4. For simulation purposes, we used the stuck-at fault model. In the next section, we introduce the preliminaries related to the stuck-at fault model and illustrate the calculation of our diversity metric using an example.

3. Example

Research in the area of digital testing and diagnosis of combinational and sequential logic circuits has demonstrated the effectiveness of the logical stuck-at fault model. In this model, the failures in a logic circuit behave as if as some lines in the circuit assume constant logical values, either 1 or 0, independent of the logic values on other lines of the circuit.

For the rest of this paper, we assume that all failures manifest as stuck-at faults in the circuit. We also assume that the failures are permanent; i.e., if a stuck-at fault shows up at some time instant t , then the fault remains at all time instants greater than t . For circuits made from SRAM-based FPGAs, unless we re-initialize the SRAMs (reload a given configuration), a transient fault in the configuration SRAM persists. Thus, the assumption of the permanent fault behavior is reasonable.

PRELIMINARY VERSION

13For example, consider the network shown in Fig. 3.1. The function implemented

by the network is wx + y. Consider a stuck-at-0 (s-a-0) fault on the line y, denoted by y/0. The function implemented by the network, in the presence of the fault, is wx. Thus, w = 1, x = 0 and y = 1, when applied to the input of the logic circuit causes the faulty network to produce a 0 and the fault-free network to produce a 1. Therefore, the fault y/0 is detected by the pattern w = 1, x = 0, y = 1.

Figure 3.1. An example logic circuit

A fault is said to be functionally equivalent to another fault if and only if the output function realized by the network with only the first fault present is equal to the function realized when only the second fault is present. For example, in the network of Fig. 3.1, in the presence of the fault x/0, the function implemented is y. In the presence of the fault z/0, the function implemented by the network is also y. Hence, the faults x/0 and z/0 are functionally equivalent. The set of functionally equivalent faults forms an equivalence class. A fault f1dominates fault f2 if and only if all input combinations that detect f2 also detect f1. In our example, the fault p/0 dominates fault z/0. Techniques for obtaining equivalence and dominance relationships among different fault pairs have been described in [McCluskey 71] and [To 73].

We illustrate the calculation of our design diversity metric with respect to single stuck-at faults in the circuit of Fig. 3.1. There are 10 single-stuck faults associated with this network. The faults are: w/0, w/1, x/0, x/1, y/0, y/1, z/0, z/1, p/0 and p/1. The corresponding fault equivalence classes are: F1 = {w/0, x/0, z/0}, F2 = {y/1, z/1, p/1}, F3 = {w/1}, F4 = {x/1}, F5 = {y/0} and F6 = {p/0}. The set of vectors that detect the faults in F1 is V1 = {w = 1, x = 1, y = 0}. We write V1 = {110}. Similarly, V2 = {000, 010, 100}, V3 = {010}, V4 = {100}, V5 = {001, 011, 101} and V6 = {001, 011, 101, 111, 110}. Here, the number of inputs (n) is 3. Consider the fault pair (f1, f2) = (w/0, p/0). The set of vectors that detect w/0 is V1. The set of vectors that detect p/0 is V6 and V1∩V6 = {110}. Thus, the value of d1,2is 7/8. In this way all the d i,j’s and the D metric can be calculated.

4. A Simulation-Based Approach

As we noted earlier, it is difficult to model the entire complex system

PRELIMINARY VERSION

mathematically. Even with the stuck-at fault model, it is difficult to derive the exact reliability equation for the following reasons:

9.For a given pair of faults (f1, f2), the calculation of d1,2 is an NP-complete problem

[Gary 79]. The problem is related to the NP-complete test generation problem.

10.If multiple stuck-at faults appear in the modules at different cycles, then the reliability

expressions will become complicated. In fact, it may not be possible to obtain a closed form.

Hence, we developed a simulation environment to examine the reliability of a redundant system in the presence of multiple faulty modules.

Table 4.1. Characteristics of simulated designs

Circuit# Inputs# Outputs# SSF (T)# SSF (C)

Z5xp1710550610

apex491996368578

clip95698664

inc79486506

rd8484568398

For generating different designs, we minimized the truth tables corresponding to some MCNC benchmark circuits (clip, inc, Z5xp1, apex4 and rd84) using espresso. Then, we synthesized logic circuits after applying multi-level optimizations using the rugged script available in sis [Sentovich 92]. We subsequently mapped the multi-level logic circuits to the LSI Logic G-10p technology library [LSI 96]. Next, we complemented the outputs in the truth tables of the benchmark circuits to generate new truth tables. We used the same synthesis procedure for these new truth tables. Finally, we added inverters at the outputs of the new designs obtained. Table 4.1 summarizes the characteristics of the different simulated designs.

In the fourth column of Table 4.1, we report the number of candidate single stuck faults for the implementations of the circuits, obtained by synthesizing the given specification. The fifth column shows the number of candidate single stuck faults for the implementations of the circuits, obtained by synthesizing the given specifications with complemented outputs.

Simulation 1

For Simulation 1, for each benchmark circuit, we built duplex systems with identical and different implementations. For each of these systems, we performed 100,000 experiments. In each experiment, we randomly picked up a single stuck-at fault pair (f1, f2) such that the fault f1 affects Module 1 and f2 affects Module 2. We injected these faults into the modules, applied input patterns from a counter (with random seed)

14

PRELIMINARY VERSION

15and calculated the error latency (the number of cycles after which the system ceases to be fault-secure). The expected error latency for the injected fault pairs is shown in Table

4.2. We also calculated the percentage of fault pairs for which none of the two modules produced the same erroneous outputs at the same time (compensating fault pairs). These are the fault pairs (f 1, f 2) that have d 1,2 equal to 1.Table 4.2. Simulation 1 results Circuit

Name

Copies Error Latency (cycles)% compensating fault pairs (d i,j = 1)Z5xp1

T, T 673366.96T, C 686968.76apex4

T, T 859485.71T, C 809480.51clip

T, T 795179.24T, C 786978.44T, T 766676.54inc

T, C 751675.08C, C 751274.90T, T 763876.23rd84T, C 679767.73

C, C 705170.40

As shown in Table 4.2, a duplex system consisting of different implementations of the Z5xp1 circuit has a higher percentage of compensating fault pairs, compared to the non-diverse version — however, that is not generally true. For example, for the clip benchmark, the non-diverse duplex system has a higher percentage of compensating fault pairs. For compensating fault pairs, the error latency is strictly infinity — we assumed the value to be 10,000 cycles for our experiments. This is because, the number of inputs of the benchmark circuits under consideration lie between 7 and 9. Thus, the total number of input patterns is between 128 and 512. Note that, the expected error latency is dependent on the number of compensating fault pairs. This dependence of error latency on the number of compensating fault pairs has been explained earlier in Sec. 2.1.

In [Sakov 87], for a given combinational logic function, the fault detectability profiles for different implementations have been reported. Further studies are needed to synthesize circuit structures with high values of d i,j for different fault pairs. It has been proved in [To 73] that, for fanout-free combinational logic networks, all internal single stuck-at faults are either equivalent to or dominate single stuck-at faults on the primary inputs of the network. Thus, if we want to implement two diverse fanout-free networks implementing the same function, the d i,j values of the different fault pairs will be strongly dependent on the input combinations detecting the single stuck-at faults on network inputs and outputs. For both the networks, the set of patterns that detect the input or output stuck-at faults is independent of the network structure and is directly determined

本文来源:https://www.bwwdw.com/article/6nde.html

Top