The Confounding Effect of Class Size on The Validity of Object-Oriented Metrics

更新时间:2023-07-23 20:22:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

de l’information

Natonia Rlsearch Ceoucnl iCnada anIstitue for tIfonmatrio neThcnoogy

Colsenl intioana le rdechrcehs eCaandaI stntiu dt Techenlooig de l’eiformationnER-106B2

hTe Cnoofnudng ifEecf tfoC als Sszi onethe Validty ifoOb jct-oeiernte MdtreicKshaled lE mEam,S adi Banleabi,ra nd Nishih toGe Seltembpe 199r9

Cnada

NRaC4 360

de l’information

National ResearchCouncil Canada

Institute forInformation TechnologyConseil nationalde recherches CanadaInstitut de Technologiede l’information

The Confounding Effect of Class Size on the Validity

of Object-oriented Metrics

Khaled El Emam, Saida Benlarbi, and

Nishith Goel

September 1999

Copyright 1999 by

National Research Council of Canada

Permission is granted to quote short excerpts and to reproduce figures and tables from this report,

provided that the source of such material is fully acknowledged.

de l’information

The Confounding Effect of Class Size on

The Validity of Object-Oriented Metrics

Khaled El Emam

National Research Council, Canada

Institute for Information Technology

Building M-50, Montreal Road

Ottawa, Ontario

Canada K1A OR6

khaled.el-emam@iit.nrc.caSaida BenlarbiNishith GoelCistel Technology210 Colonnade RoadSuite 204Nepean, OntarioCanada K2E 7L5{benlarbi, ngoel}@

Abstract

Much effort has been devoted to the development and empirical validation of object-oriented metrics.

The empirical validations performed thus far would suggest that a core set of validated metrics is close

to being identified. However, none of these studies control for the potentially confounding effect of class

size. In this paper we demonstrate a strong size confounding effect, and question the results of previous

object-oriented metrics validation studies. We first investigated whether there is a confounding effect of

class size in validation studies of object-oriented metrics and show that based on previous work there is

reason to believe that such an effect exists. We then describe a detailed empirical methodology for

identifying those effects. Finally, we perform a study on a large C++ telecommunications framework to

examine if size is really a confounder. This study considered the Chidamber and Kemerer metrics, and

a subset of the Lorenz and Kidd metrics. The dependent variable was the incidence of a fault

attributable to a field failure (fault-proneness of a class). Our findings indicate that before controlling for

size, the results are very similar to previous studies: the metrics that are expected to be validated are

indeed associated with fault-proneness. After controlling for size none of the metrics we studied were

associated with fault-proneness anymore. This demonstrates a strong size confounding effect, and

casts doubt on the results of previous object-oriented metrics validation studies. It is recommended that

previous validation studies be re-examined to determine whether their conclusions would still hold after

controlling for size, and that future validation studies should always control for size.

1 Introduction

The validation of software product metrics has received much research attention by the software

engineering community. There are two types of validation that are recognized [48]: internal and external.

Internal validation is a theoretical exercise that ensures that the metric is a proper numerical

characterization of the property it claims to measure. External validation involves empirically

demonstrating that the product metric is associated with some important external metric (such as

measures of maintainability or reliability). These are also commonly referred to as theoretical and

empirical validation respectively [73], and procedures for achieving both are described in [15]. Our focus2in this paper is empirical validation.

Product metrics are of little value by themselves unless there is empirical evidence that they are

associated with important external attributes [65]. The demonstration of such a relationship can serve

two important purposes: early prediction/identification of high risk software components, and the

construction of preventative design and programming guidelines.1

Some authors distinguish between the terms ‘metric’ and ‘measure’ [2]. We use the term “metric” here to be consistent with

prevailing international standards. Specifically, ISO/IEC 9126:1991 [64] defines a “software quality metric” as a “quantitative scale

and method which can be used to determine the value a feature takes for a specific software product”.

21 Theoretical validations of many of the metrics that we consider in this paper can be found in [20][21][30].

de l’information

Early prediction is commonly cast as a binary classification problem. This is achieved through a quality

model that classifies components into either a high or low risk category. The definition of a high risk

component varies depending on the context of the study. For example, a high risk component is one that

contains any faults found during testing [14][75], one that contains any faults found during operation [72],

or one that is costly to correct after an error has been found [3][13][1]. The identification of high risk

components allows an organization to take mitigating actions, such as focus defect detection activities on

high risk components, for example optimally allocating testing resources [56], or redesign components

that are likely to cause field failures or be costly to maintain. This is motivated by evidence showing that

most faults are found in only a few of a system’s components [86][51][67][91].

A number of organizations have integrated quality models and modeling techniques into their overall

quality decision making process. For example, Lyu et al. [81] report on a prototype system to support

developers with software quality models, and the EMERALD system is reportedly routinely used for risk

assessment at Nortel [62][63]. Ebert and Liedtke describe the application of quality models to control the

quality of switching software at Alcatel [46].

The construction of design and programming guidelines can proceed by first showing that there is a

relationship between say a coupling metric and maintenance cost. Then proscriptions on the maximum

allowable value on that coupling metric are defined in order to avoid costly rework and maintenance in the4future. Examples of cases where guidelines were empirically constructed are [1][3]. Guidelines based

on anecdotal experience have also been defined [80], and experience-based guidelines are used directly

in the context of software product acquisition by Bell Canada [34].

Concordant with the popularity of the object-oriented paradigm, there has been a concerted research

effort to develop object oriented product metrics [8][17][30][80][78][27][24][60][106], and to validate them

[4][27][17][19][22][78][32][57][89][106][8][25][10]. For example, in [8] the relationship between a set of

new polymorphism metrics and fault-proneness is investigated. A study of the relationship between

various design and source code measures using a data set from student systems was reported in

[4][17][22][18], and a validation study of a large set of object-oriented metrics on an industrial system was

described in [19]. Another industrial study is described in [27] where the authors investigate the

relationship between object-oriented design metrics and two dependent variables: the number of defects

and size in LOC. Li and Henry [78] report an analysis where they related object-oriented design and code

metrics to the extent of code change, which they use as a surrogate for maintenance effort. Chidamber

et al. [32] describe an exploratory analysis where they investigate the relationship between object-

oriented metrics and productivity, rework effort and design effort on three different financial systems

respectively. Tang et al. [106] investigate the relationship between a set of object-oriented metrics and

faults found in three systems. Nesi and Querci [89] construct regression models to predict class

development effort using a set of new metrics. Finally, Harrison et al. [57] propose a new object-oriented

coupling metric, and compare its performance with a more established coupling metric.

Despite minor inconsistencies in some of the results, a reading of the object-oriented metrics validation

literature would suggest that a number of metrics are indeed ‘validated’ in that they are strongly

associated with outcomes of interest (e.g., fault-proneness) and that they can serve as good predictors of

high-risk classes. The former is of course a precursor for the latter. For example, it has been stated that

some metrics (namely the Chidamber and Kemerer – henceforth CK – metrics of [30]) “have been proven

empirically to be useful for the prediction of fault-prone modules” [106]. A recent review of the literature

stated that “Existing data suggests that there are important relationships between structural attributes and

external quality indicators” [23].

However, almost all of the validation studies that have been performed thus far completely ignore the

potential confounding impact of class size. This is the case because the analyses employed are

univariate: they only model the relationship between the product metric and the dependent variable of

interest. For example, recent studies used the bivariate correlation between object-oriented metrics and3

It is not, however, always the case that binary classifiers are used. For example, there have been studies that predict the number

of faults in individual components (e.g., [69]), and that produce point estimates of maintenance effort (e.g., [78][66]).

It should be noted that the construction of guidelines requires the demonstration of a causal relationship rather than a mere

association.43

de l’information

the number of faults to investigate the validity of the metrics [57][10]. Also, univariate logistic regression

models are used as the basis for demonstrating the relationship between object-oriented product metrics

and fault-proneness in [22][19][106]. The importance of controlling for potential confounders in empirical

studies of object-oriented products has been emphasized [23]. However, size, the most obvious potential

confounder, has not been controlled in previous validation studies.

The objective of this paper is to investigate the confounding effect of class size on the validation of object-

oriented product metrics. We first demonstrate based on previous work that there is potentially a size

confounding effect in object-oriented metrics validation studies, and present a methodology for empirically

testing this. We then perform an empirical study on an object-oriented telecommunications framework5written in C++ [102]. The metrics we investigate consist of the CK metrics suite [30], and some of the

metrics defined by Lorenz and Kidd [80]. The external metric that we validate against is the occurrence of

a fault, which we term the fault-proneness of the class. In our study a fault is detected due to a field

failure.

Briefly, our results indicate that by using the commonly employed univariate analyses our results are

consistent with previous studies. After controlling for the confounding effect of class size, none of the

metrics is associated with fault-proneness. This indicates a strong confounding effect of class size on

some common object-oriented metrics. The results cast serious doubt that many previous validation

studies demonstrate more than that size is associated with fault-proneness.

Perhaps the most important practical implication of these results is that design and programming

guidelines based on previous validation studies are questioned. Efforts to control cost and quality using

object-oriented metrics as early indicators of problems may be achieved just as well using early indicators

of size. The implications for research are that data from previous validation studies should be re-

examined to gauge the impact of the size confounding effect, and future validation studies should control

for size.

In Section 2 we provide the rationale behind the confounding effect of class size and present a framework

for its empirical investigation. Section 3 presents our research method, and Section 4 includes the results

of the study. We conclude the paper in Section 5 with a summary and directions for future work.

2 Background

This section is divided in two parts. First, we present the theoretical and empirical basis of the object-

oriented metrics that we attempt to validate. Second, we demonstrate that there is a potentially strong

size confounding effect in object-oriented metrics validation studies.

2.1 Theoretical and Empirical Basis of Object-Oriented Metrics

2.1.1 Theoretical Basis and Its Empirical Support

The primary reason why there is an interest in the development of product metrics in general is

exemplified by the following justification for a product metric validity study “There is a clear intuitive basis

for believing that complex programs have more faults in them than simple programs” [87]. However, an

intuitive belief does not make a theory. In fact, the lack of a strong theoretical basis driving the

development of traditional software product metrics has been criticized in the past [68]. Specifically,

Kearney et al. [68] state that “One of the reasons that the development of software complexity measures

is so difficult is that programming behaviors are poorly understood. A behavior must be understood before

what makes it difficult can be determined. To clearly state what is to be measured, we need a theory of

programming that includes models of the program, the programmer, the programming environment, and

the programming task.” It has been stated that for historical reasons the CK metrics are the most referenced [23]. Most commercial metrics collection tools

available at the time of writing also collect these metrics.5

de l’information

Figure 1: Theoretical basis for the development of object-oriented product metrics.

In the arena of object-oriented metrics, a slightly more detailed articulation of a theoretical basis for

developing quantitative models relating product metrics and external quality metrics has been provided in

[19], and is summarized in Figure 1. There, it is hypothesized that the structural properties of a software

component (such as its coupling) have an impact on its cognitive complexity. Cognitive complexity is

defined as the mental burden of the individuals who have to deal with the component, for example, the

developers, testers, inspectors, and maintainers. High cognitive complexity leads to a component6exhibiting undesirable external qualities, such as increased fault-proneness and reduced maintainability.

Certain structural features of the object-oriented paradigm have been implicated in reducing the

understandability of object-oriented programs, hence raising cognitive complexity. We describe these

below.

2.1.1.1 Distribution of Functionality

In traditional applications developed using functional decomposition, functionality is localized in specific

procedures, the contents of data structures are accessed directly, and data central to an application is

often globally accessible [110]. Functional decomposition makes procedural programs easier to

understand because it is based on a hierarchy in which a top-level function calls lower level functions to

carry out smaller chunks of the overall task [109]. Hence tracing through a program to understand its

global functionality is facilitated.

In one experimental study with students and professional programmers [11], the authors compared

maintenance time for three equivalent programs (implementing three different applications, therefore we

have nine programs): one consisted of a straight serial structure (i.e., one main function), a program

developed following the principles of functional decomposition, and an object-oriented program (without

inheritance). In general, it took the students more time to change the object-oriented programs, and the

professionals exhibited the same effect, although not as strongly. Furthermore, both the students and

professionals noted that they found that it was most difficult to recognize program units in the object-

oriented programs, and the students felt that it was also most difficult to find information in the object-

oriented programs. Widenbeck et al. [109] make a distinction between program functionality at the local

level and at the global (application) level. At the local level they argue that the object-oriented paradigm’s

concept of encapsulation ensures that methods are bundled together with the data that they operate on,

making it easier to construct appropriate mental models and specifically to understand a class’ individual

functionality. At the global level, functionality is dispersed amongst many interacting classes, making itharder to understand what the program is doing. They support this in an experiment with equivalent small

C++ (with no inheritance) and Pascal programs whereby the subjects were better able to answer

questions about the functionality of the C++ program. They also performed an experiment with larger

programs. Here the subjects with the C++ program (with inheritance) were unable to answer questions

about its functionality much better than guessing. While this study was done with novices, it supports the

general notions that high cohesion makes object-oriented programs easier to understand, and high

coupling makes them more difficult to understand. Wilde et al.’s [110] conclusions based on an interview-

based study of two object-oriented systems at Bellcore implemented in C++ and an investigation of a PC

Smalltalk environment, all in different application domains, are concordant with this finding, in that

programmers have to understand a method’s context of use by tracing back through the chain of calls

that reach it, and tracing the chain of methods it uses. When there are many interactions, this

6 To reflect the likelihood that not only structural properties affect a component’s external qualities, some authors have included

additional metrics as predictor variables in their quantitative models, such as reuse [69], the history of corrected faults [70], and the

experience of developers [72][71]. However, this does not detract from the importance of the primary relationship between product

metrics and a component’s external qualities.

de l’information

exacerbates the understandability problem. An investigation of a C and a C++ system, both developed by

the same staff in the same organization, concluded that “The developers found it much harder to trace

faults in the OO C++ design than in the conventional C design. Although this may simply be a feature of

C++, it appears to be more generally observed in the testing of OO systems, largely due to the distorted

and frequently nonlocal relationships between cause and effect: the manifestation of a failure may be a

‘long way away’ from the fault that led to it. […] Overall, each C++ correction took more than twice as long

to fix as each C correction.” [59].

2.1.1.2 Inheritance Complications

As noted in [43], there has been a preoccupation within the community with inheritance, and therefore

more studies have investigated that particular feature of the object-oriented paradigm.

Inheritance introduces a new level of delocalization, making the understandability even more difficult. It

has been noted that “Inheritance gives rise to distributed class descriptions. That is, the complete

description for a class C can only be assembled by examining C as well as each of C’s superclasses.

Because different classes are described at different places in the source code of a program (often spread

across several different files), there is no single place a programmer can turn to get a complete

description of a class” [77]. While this argument is stated in terms of source code, it is not difficult to

generalize it to design documents. Wilde et al.’s study [110] indicated that to understand the behavior of

a method one has to trace inheritance dependencies, which is considerably complicated due to dynamic

binding. A similar point was made in [77] about the understandability of programs in languages that

support dynamic binding, such as C++.

In a set of interviews with 13 experienced users of object-oriented programming, Daly et al. [40] noted

that if the inheritance hierarchy is designed properly then the effect of distributing functionality over the

inheritance hierarchy would not be detrimental to understanding. However, it has been argued that there

exists increasing conceptual inconsistency as one travels down an inheritance hierarchy (i.e., deeper

levels in the hierarchy are characterized by inconsistent extensions and/or specializations of super-

classes) [45], therefore inheritance hierarchies may not be designed properly in practice. In one study

Dvorak [45] found that subjects were more inconsistent in placing classes deeper in the inheritance

hierarchy when compared to at higher levels in the hierarchy.

An experimental investigation found that making changes to a C++ program with inheritance consumed

more effort than a program without inheritance, and the author attributed this to the subjects finding the

inheritance program more difficult to understand based on responses to a questionnaire [26]. A

contradictory result was found in [41], where the authors conducted a series of classroom experiments

comparing the time to perform maintenance tasks on a ‘flat’ C++ program and a program with three levels

of inheritance. This was premised on a survey of object-oriented practitioners showing 55% of

respondents agreeing that inheritance depth is a factor when attempting to understand object-oriented

software [39]. The result was a significant reduction in maintenance effort for the inheritance program.

An internal replication by the same authors found the results to be in the same direction, albeit the p-

value was larger. The second experiment in [41] found that C++ programs with 5 levels of inheritance

took more time to maintain than those with no inheritance, although the effect was not statistically

significant. The authors explain this by observing that searching/tracing through the bigger inheritance

hierarchy takes longer. Two experiments that were partial replications of the Daly et al. experiments

produced different conclusions [107]. In both experiments the subjects were given three equivalent Java

programs to make changes to, and the maintenance time was measured. One of the Java programs was

‘flat’, one had an inheritance depth of 3, and one had an inheritance depth of 5. The results for the first

experiment indicate that the programs with inheritance depth of 3 took longer to maintain than the ‘flat’

program, but the program with inheritance depth of 5 took as much time as the ‘flat’ program. The authors

attribute this to the fact that the amount of changes required to complete the maintenance task for the

deepest inheritance program was smaller. The results for a second task in the first experiment and the

results of the second experiment indicate that it took longer to maintain the programs with inheritance. To

explain this finding and its difference from the Daly et al. results, the authors showed that the “number of

methods relevant for understanding” (which is the number of methods that have to be traced in order to

perform the maintenance task) was strongly correlated with the maintenance time, and this value was

much larger in their study compared with the Daly et al. programs. The authors conclude that inheritance

de l’information

depth per se is not the factor that affects understandability, but the number of methods that have to be

traced.

2.1.1.3 Summary

The current theoretical framework for explaining the effect of the structural properties of object-oriented

programs on external program attributes can be justified empirically. To be specific, studies that have

been performed indicate that the distribution of functionality across classes in object-oriented systems,

and the exacerbation of this through inheritance, potentially makes programs more difficult to understand.

This suggests that highly cohesive, sparsely coupled, and low inheritance programs are less likely to

contain a fault. Therefore, metrics that measure these three dimensions of an object-oriented program

would be expected to be good predictors of fault-proneness or the number of faults.

The empirical question is then whether contemporary object-oriented metrics measure the relevant

structural properties well enough to substantiate the above theory. Below we review the evidence on this.

2.1.2 Empirical Validation of Object-Oriented Metrics

In this section we review the empirical studies that investigate the relationship between the ten object-

oriented metrics that we study and fault-proneness (or number of faults). The product metrics cover the

following dimensions: coupling, cohesion, inheritance, and complexity. These dimensions are based on

the definition of the metrics, and may not reflect their actual behavior.

Coupling metrics characterize the static usage dependencies amongst the classes in an object-oriented

system [21]. Cohesion metrics characterize the extent to which the methods and attributes of a class

belong together [16]. Inheritance metrics characterize the structure of the inheritance hierarchy.

Complexity metrics, as used here, are adaptations of traditional procedural paradigm complexity metrics

to the object-oriented paradigm.

Current methodological approaches for the validation of object-oriented product metrics are best

exemplified by two articles by Briand et al. [19][22]. These are validation studies for an industrial

communications system and a set of student systems respectively, where a considerable number of

contemporary object-oriented product metrics were studied. We single out these studies because their

methodological reporting is detailed and because they reflect what can be considered best

methodological practice to date.

The basic approach starts with a data set of product metrics and binary fault data for a complete system

or multiple systems. The important element of the Briand et al. methodology that is of interest to us here

is the univariate analysis that they stipulate should be performed. In fact, the main association between

the product metrics and fault-proneness is established on the basis of the univariate analysis. If the

relationship is statistically significant (and in the expected direction) than a metric is considered7validated. For instance, in [22] the authors state a series of hypotheses relating each metric with fault-

proneness. They then explain “Univariate logistic regression is performed, for each individual measure

(independent variable), against the dependent variable to determine if the measure is statistically related,

in the expected direction, to fault-proneness. This analysis is conducted to test the hypotheses..”

Subsequently, the results of the univariate analysis are used to evaluate the extent of evidence

supporting each of the hypotheses. Reliance on univariate results as the basis for drawing validity

conclusions is common practice (e.g., see [4][10][17][18][57][106]).

In this review we first present the definition of the metrics as we have operationalized them. The

operationalization of some of the metrics is programming language dependent. We then present the

magnitude of the coefficients and p values computed in the various studies. Validation coefficients were

either the change in odds ratio as a measure of the magnitude of the metric to fault-proneness

association from a logistic regression (see the appendix, Section 7) or the Spearman correlation

coefficient. Finally, this review focuses only on the fault-proneness or number of faults dependent

variable. Other studies that investigated effort, such as [32][89][78], are not covered as effort is not the

topic of the current paper.7 Briand et al. use logistic regression, and consider the statistical significance of the regression parameters.

de l’information

2.1.2.1 WMC

This is the Weighted Methods per Class metric [30], and can be classified as a traditional complexity

metric. It is a count of the methods in a class. The developers of this metric leave the weighting scheme

as an implementation decision [30]. We weight it using cyclomatic complexity as did [78]. However, other

authors did not adopt a weighting scheme [4][106]. Methods from ancestor classes are not counted and

neither are “friends” in C++. This is similar to the approach taken in, for example, [4][31]. To be precise,8WMC was counted after preprocessing to avoid undercounts due to macros [33].

One study found WMC to be associated with fault-proneness on three different sub-systems written in9C++ with p-values 0.054, 0.0219 and 0.0602, and change in odds ratio 1.26, 1.45, and 1.26 [106]. A

study that evaluated WMC on a C++ application and a Java application found WMC to have a Spearman

correlation of 0.414 and 0.456 with the number of faults due to field failures respectively, and highly

significant p-values (<0.0001 and <0.0056) [10]. Another study using student systems found WMC to be10associated with fault-proneness with a p-value for the logistic regression coefficient of 0.0607 [4].

2.1.2.2 DIT

The Depth of Inheritance Tree [30] metric is defined as the length of the longest path from the class to the

root in the inheritance hierarchy. It is stated that as one goes further down the class hierarchy the more

complex a class becomes, and hence more fault-prone.

The DIT metric was empirically evaluated in [19][22]. In [19] the authors found that this metric was related

to fault-proneness (p=0.0074) with a change in odds ratio equal to 0.572 when measured on non-library

classes. The second study [22] also found it to be associated with fault-proneness (p=0.0001) with a

change in odds ratio of 2.311. Another study using student systems found DIT to be associated with fault-

proneness with a p-value for the logistic regression coefficient <0.0001 [4].

It will be noted that in the first study a negative association was found between DIT and fault-proneness.

The authors explain this by stating that in the system studied classes located deeper in the inheritance

hierarchy provide only implementations for a few specialized methods, and are therefore less likely to

contain faults than classes closer to the root [19]. This was a deliberate strategy to place as much

functionality as close as possible to the root of the inheritance tree. Note that for the latter two

investigations, the same data set was used, and therefore the slightly different coefficients may have

been due to removal of outliers.

One study using data from an industrial system found that classes involved in an inheritance structure

were more likely to have defects (found during integration testing and within 12 months post-delivery)

[27]. Another study did not find DIT to be associated with fault-proneness on three different sub-systems

written in C++, where faults were based on three years’ worth of trouble reports [106]. One study that

evaluated DIT on a Java application found that it had a Spearman correlation of 0.523 (p<0.0015) with the

number of faults due to field failures [10].

2.1.2.3 NOC

This is the Number of Children inheritance metric [30]. This metric counts the number of classes which

inherit from a particular class (i.e., the number of classes in the inheritance tree down from a class).

The NOC metric was empirically evaluated in [19][22]. In [19] the authors found that this metric was not

related to fault-proneness. Conversely, the second study [22] found it to be associated with fault-

proneness (p=0.0276) with a change in odds ratio of 0.322. Another study using student systems found8 Note that macros embodied in #ifdef’s are used to customize the implementation to a particular platform. Therefore, the method is

defined at design time but its implementation is conditional on environment variables. Not counting it, as suggested in [31], would

undercount methods known at design time.

In this study faults were classified as either object-oriented type faults or traditional faults. The values presented here are for all of

the faults, although the same metrics were found to be significanct for both all faults and the object-oriented only faults.

Furthermore, the change in odds ratio reported is based on a change of one unit of the metric rather than a change in the standard

deviation.

109 This study used the same data set as in [22], except that the data was divided into subsets using different criteria. The results

presented here are for all of the classes.

de l’information

NOC to be associated with fault-proneness with a p-value for the regression coefficient <0.0001 [4]. Note

that for the latter two investigations, the same data set was used, and therefore the slightly different

coefficients may have been due to removal of outliers. In both studies NOC had a negative association

with fault-proneness and this was interpreted as indicating that greater attention was given to these

classes (e.g., through inspections) given that many classes were dependent on them.

Another study did not find NOC to be associated with fault-proneness on three different sub-systems

written in C++, where faults were based on three years’ worth of trouble reports [106]. NOC was not

associated with the number of faults due to field failures in a study of two systems, one implemented in

C++ and the other in Java [10].

2.1.2.4 CBO

This is the Coupling Between Object Classes coupling metric [30]. A class is coupled with another if

methods of one class uses methods or attributes of the other, or vice versa. In this definition, uses can

mean as a member type, parameter type, method local variable type or cast. CBO is the number of other

classes to which a class is coupled. It includes inheritance-based coupling (i.e., coupling between

classes related via inheritance).

The CBO metric was empirically evaluated in [19][22]. In [19] the authors found that this metric was

related to fault-proneness (p<0.0001) with a change in odds ratio equal to 5.493 when measured on non-

library classes. The second study [22] also found it to be associated with fault-proneness (p<0.0001) with

a change in odds ratio of 2.012 when measured on non-library classes. Another study did not find CBO to

be associated with fault-proneness on three different sub-systems written in C++, where faults were

based on three years’ worth of trouble reports [106]. This was also the case in a recent empirical analysis

on two traffic simulation systems, where no relationship between CBO and the number of known faults

was found [57], and a study of a Java application where CBO was not found to be associated with faults

due to field failures [10]. Finally, another study using student systems found CBO to be associated with

fault-proneness with a p-value for the logistic regression coefficient <0.0001 [4].

2.1.2.5 RFC

This is the Response for a Class coupling metric [30]. The response set of a class consists of the set M of

methods of the class, and the set of methods invoked directly by methods in M (i.e., the set of methods

that can potentially be executed in response to a message received by that class). RFC is the number of

methods in the response set of the class.

The RFC metric was empirically evaluated in [19][22]. In [19] the authors found that this metric was

related to fault-proneness (p=0.0019) with a change in odds ratio equal to 1.368 when measured on non-

library classes. The second study [22] also found it to be associated with fault-proneness (p<0.0001) with

a change in odds ratio of 3.208 when measured on non-library classes. Another study found RFC to be

associated with fault-proneness on two different sub-systems written in C++ with p-values 0.0401 and110.0499, and change in odds ratio 1.0562 and 1.0654 [106]. A study that evaluated RFC on a C++

application and a Java application found RFC to have a Spearman correlation of 0.417 and 0.775 with the

number of faults due to field failures respectively, and highly significant p-values (both <0.0001) [10].

Another study using student systems found RFC to be associated with fault-proneness with a p-value for

the logistic regression coefficient <0.0001 [4]. In this study faults were classified as either object-oriented type faults or traditional faults. The values presented here are for all of

the faults, although the same metrics were found to be significanct for both all faults and the object-oriented only faults.

Furthermore, the change in odds ratio reported is based on a change of one unit of the metric rather than a change in the standard

deviation.11

de l’information

2.1.2.6 LCOM

This is a cohesion metric that was defined in [30]. This measures the number of pairs of methods in the

class using no attributes in common minus the number of pairs of methods that do. If the difference is

negative it is set to zero.

The LCOM metric was empirically evaluated in [19][22]. In [19] the authors found it to be associated with

fault-proneness (p=0.0.249) with a change in odds ratio of 1.613. Conversely, the second study [22] did

not find it to be associated with fault-proneness.

2.1.2.7 NMO

This is an inheritance metric that has been defined in [80], and measures the number of inherited

methods overriden by a subclass. A large number of overriden methods indicates a design problem [80].

Since a subclass is intended to specialize its parent, it should primarily extend the parent’s services [94].

This should result in unique new method names. Numerous overrides indicate subclassing for the

convenience of reusing some code and/or instance variables when the new subclass is not purely a

specialization of its parent [80].

The NMO metric was empirically evaluated in [19][22]. In [19] the authors found that this metric was

related to fault-proneness (p=0.0082) with a change in odds ratio equal to 1.724. The second study [22]

also found it to be associated with fault-proneness (p=0.0243) with a change in odds ratio of 1.948.

Lorenz and Kidd [80] caution that in the context of frameworks methods are often defined specifically for

reuse or that are meant to be overriden. Therefore, for our study there is already an a priori expectation

that this metric may not be a good predictor.

2.1.2.8 NMA

This is an inheritance metric that has been defined in [80], and measures the number of methods added

by a subclass (inherited methods are not counted). As this value becomes larger for a class, the

functionality of that class becomes increasingly distinct from that of the parent classes.

The NMO metric was empirically evaluated in [19][22]. In [19] the authors found that this metric was

related to fault-proneness (p=0.0021) with a change in odds ratio equal to 3.925, a rather substantial

effect. The second study [22] also found it to be associated with fault-proneness (p=0.0021) with a

change in odds ratio of 1.710.

2.1.2.9 SIX

This is an inheritance metric that has been defined in [80], and consists of a combination of inheritance

metrics. It is calculated as the product of the number of overriden methods and the class hierarchy

nesting level normalized by the total number of methods in the class. The higher value for SIX, the more

likely that a particular class does not conform to the abstraction of it's superclasses [94].

The SIX metric was empirically evaluated in [19][22]. In [19] the authors found that this metric was not

related to fault-proneness. Conversely, the second study [22] found it to be associated with fault-

proneness (p=0.0089) with a change in odds ratio of 1.337.

2.1.2.10 NPAVG

This can be considered as a coupling metric and has been defined in [80], and measures the average

number of parameters per method (not including inherited methods). Methods with a high number of

parameters generally require considerable testing (as their input can be highly varied). Also, large

numbers of parameters lead to more complex, less maintainable code.

2.1.2.11 Summary

The current empirical studies do provide some evidence that object oriented metrics are associated with

fault-proneness or the incidence of faults. Though, the evidence is equivocal. For some of the inheritance

metrics that were studied (DIT and NOC) some studies found a positive association, some found a

negative association, and some found no association. The CBO metric was found to be positively

associated with fault-proneness in some studies, and not associated with either the number of faults

found or fault-proneness in other studies. The RFC and WMC metrics were consistently found to be

de l’information

associated with fault-proneness. The NMO and NMA metrics were found to be associated with fault-

proneness, but the evidence for the SIX metric is more equivocal. The LCOM cohesion metric also has

equivocal evidence supporting its validity.

It should be noted that the differences in the results obtained across studies may be a consequence of

the measurement of different dependent variables. For instance, some treat the dependent variable as

the (continuous) number of defects found. Other studies use a binary value of incidence of a fault during

testing or in the field, or both. It is plausible that the effects of product metrics may be different for each of

these.

An optimistic observer would conclude that the evidence as to the predictive validity of most of these

metrics is good enough to recommend their practical usage.

2.2 The Confounding Effect of Size

In this section we take as a starting point the stance of an optimistic observer and assume that there is

sufficient empirical evidence demonstrating the relationship between the object-oriented metrics that we

study and fault-proneness. We already showed that previous empirical studies drew their conclusions

from univariate analyses. Below we make the argument that univariate analyses ignore the potential

confounding effects of class size. We show that if there is indeed a size confounding effect, then

previous empirical studies could have harbored a large positive bias.

For ease of presentation we take as a running example a coupling metric as the main metric that we are

trying to validate. For our purposes, a validation study is designed to determine whether there is an

association between coupling and fault-proneness. Furthermore, we assume that this coupling metric is

appropriately dichotomized: Low Coupling (LC) and High Coupling (HC). This dichotomization

assumption simplifies the presentation, but the conclusions can be directly generalized to a continuous

metric.

2.2.1 The Case Control Analogy

An object-oriented metrics validation study can be easily seen as an unmatched case-control study.

Case-control studies are frequently used in epidemiology to, for example, study the effect of exposure to12carcinogens on the incidence of cancers [95][12]. The reason for using case-control studies as opposed

to randomized experiments in certain instances is that it would not be ethically and legally defensible to

do otherwise. For example, it would not be possible to have deliberately composed ‘exposed’ and

‘unexposed’ groups in a randomized experiment when the exposure is a suspected carcinogen or toxic

substance. Randomized experiments are more appropriately used to evaluate treatments or preventative

measures [52].

In applying the conduct of a case-control study to the validation of an object-oriented product metric, one

would first proceed by identifying classes that have faults in them (the cases). Then, for the purpose of

comparison another group of classes without faults in them are identified (the controls). We determine

the proportion of cases that have, say High Coupling and the proportion with Low Coupling. Similarly, we

determine the proportion of controls with High Coupling, and the proportion with Low Coupling. If there is

an association of coupling with fault-proneness then the prevalence of High Coupling classes would be

higher in the cases than in the controls. Effectively then, a case-control study follows a paradigm that

proceeds from effect to cause, attempting to find antecedents that lead to faults [99]. In a case-control

study, the control group provides an estimate of the frequency of High Coupling that would be expected

among the classes that do not have faults in them.

In an epidemiological context, it is common to have ‘hospital-based cases’ [52][95]. For example, a

subset or all patients that have been admitted to a hospital with a particular disease can be considered as13cases. Controls can also be selected from the same hospital or clinic. The selection of controls is not

necessarily a simple affair. For example, one can match the cases with controls on some confounding12

13 Other types of studies that are used are cohort-studies [52], but we will not consider these here. This raises the issue of generalizability of the results. However, as noted by Breslow and Day [12], generalization from the sample

in a case-control study depends on non-statistical arguments. The concern with the design of the study is to maximize internal

validity. In general, replication of results establishes generalizability [79].

de l’information

variables, for instance, on age and sex. Matching ensures that the cases and controls are similar on the

matching variable and therefore this variable cannot be considered a causal factor in the analysis.

Alternatively, one can have an unmatched case-control study and control for confounding effects during

the analysis stage.

In an unmatched case-control study the determination of an association between the exposure (product

metric) and the disease (fault-proneness) proceeds by calculating a measure of association and

determining whether it is significant. For example, consider the following contingency table that is

obtained from a hypothetical validation study:

Fault PronenessCouplingHC

LCFaulty9119Not Faulty1991

Table 1: A contingency table showing the results of a hypothetical validation study.

For this particular data set, the odds ratio is 22.9 (see the appendix, Section 7, for a definition of the odds

ratio), which is highly significant, indicating a strong positive association between coupling and fault-

proneness.

2.2.2 The Potential Confounding Effect of Size

One important element that has been ignored in previous validation studies is the potential confounding

effect of class size. This is illustrated in Figure 2.

Figure 2: Path diagram illustrating the confounding effect of size.

The path diagram in Figure 2 depicts a classic text-book example of confounding in a case-control study 14[99][12]. The path (a) represents the current causal beliefs about product metrics being an antecedent

We make the analogy to a case-control study because it provides us with a well tested framework for defining and evaluating

confounding effects, as well as for conducting observational studies from which one can make stronger causal claims (if all known

confounders are controlled). However, for the sole purposes of this paper, the characteristics of a confounding effect have been

described and exemplified in [61] without resort to a case-control analogy.14

de l’information

to fault-proneness. The path (b) depicts a positive causal relationship between size and fault-proneness.

The path (c) depicts a positive association between product metrics and size.

If this path diagram is concordant with reality, then size distorts the relationship between product metrics

and fault-proneness. Confounding can result in considerable bias in the estimate of the magnitude of the

association. Size is a positive confounder, which means that ignoring size will always result in the

association between say coupling and fault-proneness to be more positive than it really is.

The potential confounding effect of size can be demonstrated through an example (adapted from [12]).

Consider the table in Table 1 that gave an odds ratio of 22.9. As mentioned earlier, this is representative

of the current univariate analyses used in the object-oriented product metrics validation literature (which

explicitly exclude size as a covariate nor employ a stratification on size).

Now, let us say that if we analyze the data seperately for small and large classes, we have the data in15Table 2 for the large classes, and the data in Table 3 for the small classes.

Fault PronenessCouplingHC

LCFaulty9010Not Faulty91

Table 2: A contingency table showing the results for only large classes of a hypothetical validation study.

Fault PronenessCouplingHC

LCFaulty19Not Faulty1090

Table 3: A contingency table showing the results for only small classes of a hypothetical validation study.

In both of the above tables the odds ratio is one. By stratifying on size (i.e., controlling for the effect of

size), the association between coupling and fault-proneness has been reduced dramatically. This is

because size was the reason why there was an association between coupling and fault-proneness in the

first place. Once the influence of size is removed, the example shows that the impact of the coupling

metric disappears.

Therefore, an important improvement on the conduct of validation studies of object oriented metrics is to

control for the effect of size, otherwise one may be getting the illusion that the product metric is strongly

associated with fault-proneness, when in reality the association is much weaker or non-existent.

2.2.3 Evidence of a Confounding Effect

Now we must consider whether the path diagram in Figure 2 can be supported in reality.

There is evidence that object-oriented product metrics are associated with size. For example, in [22] the

Spearman rho correlation coefficients go as high as 0.43 for associations between some coupling and

cohesion metrics with size, and 0.397 for inheritance metrics, and both are statistically significant (at an

alpha level of say 0.1). Similar patterns emerge in the study reported in [19], where relatively large

correlations are shown. In another study [27] the authors display the correlation matrix showing the

Spearman correlation between a set of object-oriented metrics that can be collected from Shlaer-Mellor

designs and C++ LOC. The correlations range from 0.563 to 0.968, all statistically significant at an alpha

level 0.05. This also indicates very strong correlations with size.

Note that in this example the odds ratio of the size to fault-proneness association is 100, and the size to coupling association is

81.3. Therefore, it follows the model in Figure 2.15

de l’information

Associations between size and defects have been reported in non-object oriented systems [58]. For

object oriented programs, the relationship between size and defects is clearly visible in the study of [27],

where the Spearman correlation was found to be 0.759 and statistically significant. Another study of

image analysis programs written in C++ found a Spearman correlation of 0.53 between size in LOC and

the number of errors found during testing [55], and was statistically significant at an alpha level of 0.05.

Briand et al. [22] find statistically significant associations between 6 different size metrics and fault-

proneness for C++ programs, with a change in odds ratio going as high as 4.952 for one of the size

metrics.

General indications of a confounding effect are seen in Figure 3, which shows the associations between a

set of coupling metrics and fault-proneness, and with size from a recent study [22]. The association

between coupling metrics and fault-proneness is given in terms of the change in the odds ratio and the p-

value of the univariate logistic regression parameter. The association with size is in terms of the

Spearman correlation. As can be seen in Figure 3, all the metrics that had a significant relationship with

fault-proneness in the univariate analysis also had a significant correlation with size. Furthermore, there

is a general trend of increasing association between the coupling metric and fault-proneness as its

association with size increases.Relationship with fault-proneness

MetricChange in odds

Ratio

2.012

2.062

3.208

8.168

5.206

7.170

1.090

9.272

1.395

1.385

1.416

1.206

1.133

0.816

1.575

1.067

4.937

1.214p-value<0.0001<0.0001<0.0001<0.0001<0.0001<0.00010.5898<0.00010.03290.03890.03070.32130.33840.2520.09220.6735<0.00010.2737Relationship with sizerho0.32170.33590.39400.43100.32320.3168-0.1240.34550.17530.19580.12960.02970.0493-0.08550.2365-0.12290.2765-0.0345p-value<0.0001<0.0001<0.0001<0.0001<0.0001<0.00010.1082<0.00010.01630.00880.07850.70100.49130.25280.00190.11150.00010.6553CBOCBO’RFC1RFCMPCICPIH-ICPNIH-ICPDACDAC’OCAICFCAECOCMICOCMECIFMMICAMMICOMMICOMMEC

Figure 3: Relationship between coupling metrics and fault-proneness, and between coupling metrics and

size from [22]. This covers only coupling to non-library classes. This also excludes the following metrics

because no results pertaining to the relationship with fault-proneness were presented: ACAIC, DCAEC,

IFCMIC, ACMIC, IFCMEC, and DCMEC. The definition of these metrics is provided in the appendix.

de l’information

This leads us to conclude that, potentially, previous validation studies have overestimated the impact of

object oriented metrics on fault-proneness due to the confounding effect of size.

2.3 Summary

In this section the theoretical basis for object-oriented product metrics was presented. This states that

cognitive complexity is an intervening variable between the structural properties of classes and fault-

proneness. Furthermore, the empirical evidence supporting the validity of the object oriented metrics that

we study was presented, and this indicates that some of the metrics are strongly associated with fault-

proneness or the number of faults. We have also demonstrated that there is potentially a strong size

confounding effect in empirical studies to date that validate object oriented product metrics. This makes it

of paramount importance to determine whether such a strong confounding effect really exists.

If a size confounding effect is found, this means that previous validation studies have a positive bias and

may have exaggerated the impact of product metrics on fault-proneness. The reason is that studies to

date relied exclusively on univariate analysis to test the hypothesis that the product metrics are

associated with fault-proneness or the number of faults. The objective of the study below then is to

directly test the existence of this confounding effect and its magnitude.

3 Research Method

3.1 Data Source

Our data set comes from a telecommunications framework written in C++ [102]. The framework

implements many core design patterns for concurrent communication software. The communication

software tasks provided by this framework include event demultiplexing and event handler dispatching,

signal handling, service initialization, interprocess communication, shared memory management,

message routing, dynamic (re)configuration of distributed services, and concurrent execution and

synchronization. The framework has been used in applications such as electronic medical imaging

systems, configurable telecommunications systems, high-performance real-time CORBA, and web

servers. Examples of its application include in the Motorola Iridium global personal communications

system [101] and in network monitoring applications for telecommunications switches at Ericsson [100]. A

total of 174 classes from the framework that were being reused in the development of commercial

switching software constitute the system that we study. A total of 14 different programmers were involved16in the development of this set of classes.

3.2 Measurement

3.2.1 Product Metrics

All product metrics are defined on the class, and constitute design metrics, and they have been presented

in Section 2.1.2. In our study the size variable was measured as non-comment source LOC for the class.

Measurement of product metrics used a commercial metrics collection tool that is currently being used by

a number of large telecommunications software development organizations.

3.2.2 Dependent Variable

17For this product, we obtained data on the faults found in the library from actual field usage. Each fault

was due to a unique field failure and represents a defect in the program that caused the failure. Failures

were reported by the users of the framework. The developers of the framework documented the reasons

for each delta in the version control system, and it was from this that we extracted information on whether

a class was faulty.16

17 This number was obtained from the different login names of the version control system associated with each class. It has been argued that considering faults causing field failures is a more important question to address than faults found during

testing [9]. In fact, it has been argued that it is the ultimate aim of quality modeling to predict post-release fault-proness [50]. In at

least one study it was found that pre-release fault-proneness is not a good surrogate measure for post-release fault-proness, the

reason posited being that pre-release fault-proness is a function of testing effort [51].

de l’information

A total of 192 faults were detected in the framework at the time of writing. These faults occurred in 70 out

of 174 classes. The dichotomous dependent variable that we used in our study was the detection or non-

detection of a fault. If one or more faults are detected then the class is considered to be faulty, and if not

then it is considered not faulty.

3.3 Data Analysis Methods

3.3.1 Testing for a Confounding Effect

It is tempting to use a simple approach to test for a confounding effect of size: examine the association

between size and fault-proneness. If this association is not significant at a traditional alpha level, then

conclude that size is not different between cases and controls (and hence has no confounding effect),

and proceed with a usual univariate analysis.

However, it has been noted that this is an incorrect approach [38]. The reason is that traditional

significance testing places the burden of proof on rejecting the null hypothesis. This means that one has

to prove that the cases and controls do differ in size. In evaluating confounding potential, the burden of

proof should be in the opposite direction: before discarding the potential for confounding, the researcher

should demonstrate that cases and controls do not differ on size. This means controlling the Type II error

rather than the Type I error. Since one usually has no control over the sample size, this means setting

the alpha level to 0.25, 0.5, or even larger.

A simpler and more parsimonious approach is as follows. For an unmatched case-control study, a

measured confounding variable can be controlled through a regression adjustment [12][99]. A regression

adjustment entails including the confounder as another independent variable in a regression model. If the

regression coefficient of the object-oriented metric changes dramatically (in magnitude and statistical

significance) with and without the size variable, then this is a strong indication that there was indeed a

confounding effect [61]. This is further elaborated below.

3.3.2 Logistic Regression Model

Binary logistic regression is used to construct models when the dependent variable can only take on two

values, as in our case. It is most convenient to use a logistic regression (henceforth LR) model rather

than the contingency table analysis used earlier for illustrations since the model does not require

dichotomization of our product metrics.The general form of an LR model is:

π=

1+e1 β0+βixi i=1 Eqn. 1∑k

where π is the probability of a class having a fault, and the xi’s are the independent variables. The β

parameters are estimated through the (unconditional) maximization of a log-likelihood [61].

In a univariate analysis only one xi,

being validated:18x1, is included in the model, and this is the product metric that is

1

1+e β0+βix1π=Eqn. 2

When controlling for size, a second xi, x2, is included that measures size:

π=

1811+e β0+βix1+β2x2Eqn. 3 Conditional logistic regression is used when there has been matching in the case-control study and each matched set is treated

as a stratum in the analysis [12].

de l’information

In constructing our models, we could follow the previous literature and not consider interaction effects nor

consider any transformations (for example, see [4][8][17][18][19][22][106]). To err on the conservative

side, however, we did test for interaction effects between the size metric and the product metric for all

product metrics evaluated. In none of the cases was a significant interaction effect identified.19Furthermore, we performed a logarithmic transformation on our variables and re-evaluated all the20models. Our conclusions would not be affected by using the transformed models. Therefore, we only

present the detailed results for the untransformed model.

The magnitude of an association can be expressed in terms of the change in odds ratio as the x1 variable

changes by one standard deviation. This is explained in the appendix (Section 7), and is denoted by

Ψ. Since we construct two models as shown in Eqn. 2 and Eqn. 3 without and with controlling for size

respectively, we will denote the change in odds ratio as Ψx1 and Ψx1+x2 respectively. As suggested

in [74], we can evaluate the extent to which the change in odds ratio changes as an indication of theextent of confounding. We operationalize this as follows:

ψ=2 ψx1 ψx1+x2

ψx1+x2×100Eqn. 4

This gives the percent change in Ψx1+x2 by removing the size confounder. If this value is large then we

can consider that class size does indeed have a confounding effect. The definition of “large” can be

problematic, however, as will be seen in the results, the changes are sufficiently big in our study that by

any reasonable threshold, there is little doubt.

3.3.3 Diagnostics and Hypothesis Testing

The appendix of this paper presents the details of the model diagnostics that were performed, and the

approach to hypothesis testing. Here we summarize these.

The diagnostics concerned checking for collinearity and identifying influential observations. We compute

the condition number specific to logistic regression, ηLR, to determine whether dependencies amongst

the independent variables are affecting the stability of the model (collinearity). The β value provides us

an indication of which observations are overly influential. For hypothesis testing, we use the likelihood

ratio statistic, G, to test the significance of the overall model, the Wald statistic to test for the significance2of individual model parameters, and the Hosmer and Lemeshow R value as a measure of goodness of

fit. Note that for the univariate model the G statistic and the Wald test are statistically equivalent, but we

present them both for completeness. All statistical tests were performed at an alpha level of 0.05.

4 Results

4.1 Descriptive Statistics

Box and whisker plots for all the product metrics that we collected are shown in Figure 4. These indicatethth21the median, the 25 and 75 quantiles. Outliers and extreme points are also shown in the figure.

As is typical with product metrics their distributions are clearly heavy tailed. Most of the variables are

counts, and therefore their minimal value is zero. Variables NOC, NMO, and SIX have less than six

observations that are non-zero. Therefore, they were excluded from further analysis. This is the

approach followed in [22].19

20

21 Given that product metrics are counts, an appropriate transformation to stablize the variance would be the logarithm. We wish to thank an anonymous reviewer for making this suggestion. As will be noted that in some cases the minimal value is zero. For metrics such as CBO, WMC and RFC, this would be because

the class was defined in a manner similar to a C struct, with no methods associated with it.

de l’information

The fact that few classes have NOC values greater than zero indicates that most classes in the system

are leaf classes. Overall, 76 of the classes had a DIT value greater than zero, indicating that they are

subclasses. The remaining 98 classes are at the root of the inheritance hierarchy. The above makes

clear that the inheritance hierarchy for this system was “flat”. Variable DIT has a small variation, but this

is primarily due to there not being a large amount of deep inheritance in this framework. Shallow

inheritance trees, indicating sparse use of inheritance, have been reported in a number of systems thus

far [27][30][32].

Figure 4: Box and whisker plots for all the object-oriented product metrics. Two charts are shown to

allow for the fact that two y-axis scales are required due to the different ranges.

The LCOM values may seem to be large. However, examination of the results from previous systems

indicate that they are not exceptional. For instance, in the C++ systems reported in [22], the maximum

LCOM value was 818, the mean 43, and the standard deviation was 106. Similarly, the system reported

in [19] had a maximum LCOM value of 4988, a mean of 99.6, and standard deviation of 547.7.

4.2 Correlation with Size

Table 4 shows the correlation of the metrics with size as measured in LOC. As can be seen all of the

associations are statistically significant except DIT. But DIT did not have much variation, and therefore a

weak association with size is not surprising. All metrics except LCOM and NPAVG have a substantial

correlation coefficient, indicating a non-trivial assocation with size.

de l’information

OO

Metric

WMC

DIT

CBO

RFC

LCOM

NMA

NPAVGLOCRho0.880.0980.460.880.240.860.27p-value<0.00010.19<0.0001<0.00010.0011<0.00010.000256

Table 4: Spearman correlation of the object oriented metrics (only the ones that have more than five non-

zero values) with size in LOC.

4.3 Validation Results

The results of the univariate analyses and the models controlling for size for each of the remaining

metrics are presented in this section. The complete results are presented in Table 5.

de l’information

otu iSzeCon rot MelticrWC MDI CTO RBFC LOC NMA MPNAG

VoCntolling rfro iSz

eG p(-vauel)101 .0.0015)( 0.7 1(0.864)1 41.6( 00.144 )1.418( .0007)03.24 (0 .0271 )2.13 (080.00)30 (0 .975)

HL-R .04032

LηR2.24 21.66 .712 294. 1.0141 276 1..5

7Ceof.f (p-valu)0e.0820 0.(020) 2-.1202 4(.03049 ).20048( 0.0266) .00480 0.001()3 00.70 4(0.3091) 0.060 (90.007) 00.0530 0(4.857) Ψ

174 ..9038 .3187 1.4 1.33 1.895 1.0704G (p-aluev)1383 (.00.1) -10.39 5(.0010) 161.51(0 0.03) 03.19 (100.01)16 .0 6(0.003)0 -

--HLR0 059 .--2

η L7R6. 3-3.597. 84 4.368 2.47 7--Coffe.(p- vluea)-.00113 (0.0337 -0.020)9(0. 449) -05.2070(0 252.) 00.0026 0.(8230) 0.0-05 2(.04783 )-

-Ψ 08.0 1.-2400 .731 10. 7.907 --ize SoCeff .(-pvaul)0e0.411 (.0029) -08.0150(0 .003)00. 1017(0 .1507) .01001 0.001(7 ).0102 (70.042) --0Size

Ψ2374.- 1.9 28. 1.086 .15 --2

.00007 .010 0.8490 .00140 05.54 1.8e6

-00.7950 07.10. 059 .0690 -

T-ble 5:a Ovreallre ults of sthem deols witouthc ntrolo ofsiz eu(ivnriaat eodems)l, andw ti hcotrol ofn siz.e Te G havuleis the iklleiohd orato iesttf o tre hwohl modee.l he“Coeff.T c”oulnsm iveg ht estimaetedp ramaeter sfomrt hel oigtsi regcrsseoi nodml. ehe“p-Tvauel” sith eo e-nided stes tfo 22 hte unllhy pthosesifo rhet cofefciietn T.e hRvalues re baase don th eedinitfin oof Rpovidredb y oHsmera n demLesowh 6[]1 Hen.c tehe are 2yrefer erdt oa ths H-L R eavleu.s oFrt ehse ond haclf o tfh tabee, lpesentrnig he trsuelst fothe odml wieh tizsec onrtl, tohecoeff ciiet for nhte ize psaametre rsiprov dediwith tsich ageni ndod satio.r orFt heme rtcsiwh ee trh eomdelw thouit izseco tnor ls nit oisgnifcaitn(DI T nda NPVA)G ewdo otnp reesntt he odmle wihts zieco ntor sinlecg vien he tyhoptehsise condofudinng effect,ht resuelstw ill notb eubs

stantiveyldif freet nfomr ht noes zei cnotrlomo edl.V181-/0040/

019

本文来源:https://www.bwwdw.com/article/iasm.html

Top