Robust probabilistic PCA with missing data and contribution

更新时间：2023-04-17 16:22:01 阅读量：实用文档文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

robust推荐度：
相关推荐

Computational Statistics and Data Analysis53(2009)

3706–3716

Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage:

24c2b9fcc8d376eeaeaa31fd/locate/csda

Robust probabilistic PCA with missing data and contribution analysis for outlier detection

Tao Chen a,?,Elaine Martin b,Gary Montague b

a School of Chemical and Biomedical Engineering,Nanyang Technological University,Singapore637459,Singapore

b School of Chemical Engineering and Advanced Materials,Newcastle University,Newcastle upon Tyne,NE17RU,UK

a r t i c l e i n f o Article history:

Received9March2009 Accepted20March2009 Available online31March2009a b s t r a c t

Principal component analysis(PCA)is a widely adopted multivariate data analysis technique,with interpretation being established on the basis of both classical linear projection and a probability model(i.e.probabilistic PCA(PPCA)).Recently robust PPCA models,by using the multivariate t-distribution,have been proposed to consider the situation where there may be outliers within the data set.This paper presents an overview of the robust PPCA technique,and further discusses the issue of missing data. An expectation-maximization(EM)algorithm is presented for the maximum likelihood estimation of the model parameters in the presence of missing data.When applying robust PPCA for outlier detection,a contribution analysis method is proposed to identify which variables contribute the most to the occurrence of outliers,providing valuable information regarding the source of outlying data.The proposed technique is demonstrated on numerical examples,and the application to outlier detection and diagnosis in an industrial fermentation process.

1.Introduction

Principal component analysis(PCA)(Jolliffe,2002)is a general multivariate statistical projection technique for dimension reduction,and it has seen a wide spectrum of applications in various areas,including exploratory data analysis,pattern recognition,quality monitoring and control.The traditional approach to the implementation of PCA is based on the linear projection of the original data onto a space where the variance is maximized.Let{x n,n=1,...,N}be the d-dimensional data set,then the first step in PCA is to compute the sample covariance matrix,S,of order d×d.The eigenvectors w j

and eigenvaluesλj of S are then calculated(j=1,...,d).By retaining those eigenvectors corresponding to the q largest eigenvalues,the q-dimensional PCA score vectors,t n,are calculated as:t n=W T(x n?μ),where W=(w1,...,w q),andμis the mean of the data set.Therefore the original data can be represented as a linear combination of the scores plus a noise vector:x n=Wt n+μ+e n.

More recently Tipping and Bishop(1999b)proposed a probabilistic formulation of PCA(PPCA)from the perspective of a Gaussian latent variable model.The PPCA model is realized by specifying a Gaussian noise model e～G(0,σ2I),which implies that the conditional distribution of data given PCA scores is:x|t～G(Wt+μ,σ2I).By adopting a prior Gaussian distribution for the PCA score vector,t～G(0,I),the marginal distribution of the data x is also shown to be Gaussian: x～G(μ,WW T+σ2I).PPCA degenerates into traditional PCA ifσ2→0.Within the PPCA framework,the principal components are essentially the maximum likelihood estimates of the model parameters,which can be implemented using

Corresponding author.Tel.:+6565138267;fax:+6567947553.

E-mail address:chentao@24c2b9fcc8d376eeaeaa31fd.sg(T.Chen).

doi:10.1016/j.csda.2009.03.014

T.Chen et al./Computational Statistics and Data Analysis53(2009)3706–37163707 either the eigen-decomposition of the sample covariance matrix(as in the traditional PCA),or an expectation-maximization

(EM)algorithm(Dempster et al.,1977).Tipping and Bishop(1999b)argued that PPCA attains several advantages over

traditional PCA,including its extendability to handling missing data and to forming a mixture model(Tipping and Bishop,

1999a),and its potential application in probability density estimation and multivariate statistical process monitoring(Chen

and Sun,2009;Kim and Lee,2003).

It is a well-known issue that the conventional PCA is sensitive to anomalous observations because the calculation of

sample mean and covariance matrix can be significantly influenced by a small number of outliers.Similarly,PPCA is not

robust to outliers since the data are assumed to follow a multivariate Gaussian distribution that is easily affected by deviant

observations.There is a rich literature on robust PCA methods to obtain principal components that are insensitive to outliers.

The first category of methods are based on a robust estimate of the covariance matrix(Cambell,1980;Devlin et al.,1981;

Huber,1981;Ruymagaart,1981).The idea is to give different weights to the observations where the weight is a function

of Mahalanobis distance.The observations with large Mahalanobis distance are automatically down-weighted since they

tend to be outliers.The major difficulty with these weighting methods is due to high computation.Note that computing

the covariance matrix requires O(Nd2)operations,which is infeasible for high-dimensional data.Recently,more robust estimates of the covariance structure have been proposed,including positive-breakdown estimator(Croux and Haesbroeck,

2000)and that based on a convex loss function(Ibazizen and Dauxois,2003).Nevertheless,these methods are still limited

to moderate dimensions due to computational cost.

An alternative approach to robust PCA is based on projection pursuit(Li and Chen,1985;Hubert et al.,2002).Dealing

with high-dimensional data,projection pursuit seeks low-dimensional projections that maximize a robust measure of

spread.By obtaining the projections sequentially,projection pursuit is computationally efficient particularly when q d. More recently,Hubert et al.(2005)proposed a robust PCA method that is aimed at combining the advantages of robust covariance estimation and projection pursuit.It should be noted that all the robust PCA methods reviewed above are based on conventional PCA,and thus they do not fall into the family of probability models.

This paper discusses a robust PCA method within a probabilistic framework,based on replacing the Gaussian distribution

utilized in the original PPCA by a heavy-tailed t-distribution that is robust to the presence of outliers.The idea of using t-

distribution was originally proposed by Archambeau et al.(2006)for both robust PPCA and robust probabilistic canonical

correlation analysis(PCCA),and was later extended to developing robust latent variable regression models(Fang and Jeong,

2008).This paper will initially review the rationale of the multivariate t-distribution in Section2,and the formulation of

robust PPCA in Section3.In Section4we address the issue of missing data,which is common in many data analysis tasks due

to sensor fault or human errors,within the framework of robust PPCA.An EM algorithm will be developed for the estimation

of model parameters in the presence of missing data.Section5gives two numerical examples to illustrate the effectiveness

of the proposed robust PPCA model.Subsequently we demonstrate the application of the robust PPCA for outlier detection in

Section6.Specifically we present a contribution analysis method,which was previously proposed for multivariate statistical

process control using non-robust PPCA and mixture models(Chen and Sun,2009),to identify which variables contribute the

most to the occurrence of outliers and provide useful information regarding the source of outlying data.Finally Section7

concludes this paper.

2.Multivariate t-distribution

Assume that the random variable x follows a multivariate Gaussian distribution:x～G(m, ).In the presence of outliers, a two-component Gaussian mixture model can be employed to account for the relatively large variance of the outliers:

x～(1? )G(m, )+ G(x;m,b )(1) where b is a positive large factor,and ∈[0,1]is a small value to reflect the prior knowledge that a small portion of the data may be outliers.This two-component model has seen various applications in outlier detection and measurement rectification(Chen et al.,2008;Schick and Mitter,1994).The mixture model in Eq.(1)can be extended to an infinite Gaussian mixture model as(Peel and McLachlan,2000):

x～

G(m,b )p(b)d b(2)

where the integration is performed over the scaling factor b.Suppose u=1/b is a chi-square random variable with degrees of freedom v:u～Ga(v/2,v/2),where the Gamma probability density function is given by:Ga(α,β)=βαuα?1e?βu/Γ(α). Then the marginal distribution of x can be obtained by performing the integration in(2),resulting in a multivariate t-distribution with degrees of freedom(Lange et al.,1989;Liu,1997;Peel and McLachlan,2000)v:x～t v(m, ).An alternative perspective on the t-distribution is to treat u as a latent variable,and the conditional distribution of x|u is Gaussian:x|u～G(m, /u).

The Gaussian distribution is a special case of the t-distribution when v→∞.In general the t-distribution has significantly heavier tails than the Gaussian distribution,which is a desirable property to handle data sets in the presence of outliers.

3708T.Chen et al./Computational Statistics and Data Analysis 53(2009)3706–3716

3.Robust PPCA

This section reviews the robust PPCA model originally proposed in Archambeau et al.(2006)and Fang and Jeong (2008),including the probability model and detailed parameter estimation method using EM algorithm.

3.1.The probability model

To consider the presence of outliers in the data set,the Gaussian distribution in PPCA is replaced by the t -distribution to achieve a robust model.Specifically the conditional distribution of the data x given PCA scores t is

x |t ,u ～G Wt +μ,σ2I /u (3)where u ～Ga (v/2,v/2).Thus x |t ～t v (Wt +μ,σ2I ).If the prior of the scores is also a t -distribution:t |u ～G (0,I /u ),or equivalently t ～t v (0,I ),the distribution of the data given u is:

x |u ～ p (x |t ,u )p (t |u )d t =G (μ,C /u )(4)where C =WW T +σ2I is a d ×d matrix.The marginal distribution of the data x is then a multivariate t -distribution:x ～t v (μ,C ).

For the estimation of model parameters presented subsequently,the conditional distributions,u |x and t |x ,u ,are required.The former was shown (Lange et al.,1989)to be a Gamma distribution:

u |x ～Ga v +d 2,v +p 2 (5)

where p =(x ?μ)T C ?1(x ?μ).The conditional distribution of the scores can be calculated by using Bayes’rule,resulting in:

t |x ,u ～G M ?1W T (x ?μ),σ2M ?1/u (6)where M =W T W +σ2I is a q ×q matrix.Similarly t |x is also t -distributed:t |x ～t v (M ?1W T (x ?μ),σ2M ?1).In summary,the Gaussian distributions primarily used in PPCA have been replaced by the t -distributions in the proposed robust PPCA model.For the purpose of inference,it is required to invert the covariance matrix C .This can be efficiently performed by using the Woodbury matrix identity if q d (which is often the case if d is large):

C ?1= WW T +σ2I ?1=I /σ2?WM ?1W T /σ2.(7)The objective of dimension reduction,one of the major motivation behind PCA,can be achieved by utilizing the mean of the latent variables:

t |x =M ?1W T (x ?μ)(8)where is the expectation operator.This is also the projection of the original d -dimensional data to the (lower)q -dimensional PCA scores.

3.2.Maximum likelihood estimation

The model parameters to be estimated are:{μ,W ,σ2,v }.Given a set of training data,the maximum likelihood estimation of these parameters can be achieved by using the expectation-maximization (EM)algorithm (Dempster et al.,1977).To apply the EM algorithm,the latent variables {t n ,u n }are treated as ‘‘missing data’’,and the ‘‘complete’’data comprise the latent variables and the observed data x n .The corresponding complete-data log-likelihood is:

L C =N

n =1ln {p (x n ,t n ,u n )}(9)

where the joint distribution can be factorized as:

p (x n ,t n ,u n )=p (x n |t n ,u n )p (t n |u n )p (u n ).(10)In the E-step,the expectation of the complete-data log-likelihood with respect to the conditional distribution t n ,u n |x n is calculated as follows:

L C =?N n =1 d 2ln (σ2)+ u n 2σ2(x n ?μ)T (x n ?μ)?1σ2 u n t n T W T (x n ?μ)+12σ2tr (W T W u n t n t T n )+v 2log v 2+ v 2?1 log u n ?log Γ v 2 ?v 2 u n (11)

T.Chen et al./Computational Statistics and Data Analysis53(2009)3706–37163709 where the terms that are independent of the model parameters are omitted.The expectation terms in(11)are:

n =

v+d

v+p n(12)

=M?1W T(x n?μ)(13)

t n = u n t n (14)

n t n t T

=σ2M?1+ u

T(15)

log u

n =ψ

?log

p n

(16)

where p n=(x n?μ)T C?1(x n?μ),andψ()is the digamma function.

The M-step maximizes L C with respect to the model parameters,resulting in the update equations as:

?μ=

n=1

?W t

)

n=1

(17)

?W=

n=1

(x n??μ) u n t n T

n=1

t n t T

(18)

?σ2=

n=1

??μ 2?2 u

t n T?W T(x n??μ)+tr(?W T?W u n t n t T n )

(19)

and?v can be updated using a scalar nonlinear maximization routine that is available in most computation software packages. The EM algorithm does not explicitly calculate the sample covariance matrix that requires O(Nd2)operations.An inspection of(18)and(19)reveals that the computational complexity is only O(Ndq).When q d,considerable computational cost can be saved.

Furthermore,the EM algorithm can be simplified by using a two-stage procedure,where the PCA scores t n are not considered in the first stage(Tipping and Bishop,1999a).Hence the objective of the first stage is to estimateμ.More specifically,the complete-data log-likelihood in the first stage is:

L C

n=1

ln{p(x n,u n)}=

n=1

ln{p(x n|u n)p(u n)}(20)

where p(x n|u n)is given by(4).The expectation of L C1with respect to u n|x n,as given in(5),is:

C1 =?

n=1

?μ)T C?1(x n?μ).(21)

Maximization of(21)with respect toμgives

?μ=

n=1

.(22)

In the second stage,the latent variables t n are introduced,and the log-likelihood in(20)is increased through the EM algorithm to update W,σ2and v.It should be noted that in the second stage L C is not actually maximized,because?μis kept fixed.In this sense,the second stage corresponds to the generalized EM algorithm(Gelman et al.,1995).This two-stage EM algorithm leads to improved convergence speed(Tipping and Bishop,1999a),since the expectation terms in(12)–(15)are calculated using the updated mean,?μ,to update W,σ2and v.

In summary,the two-stage EM algorithm operates as follows.

(1)Stage1:

?E-Step:Given current parameters{μ,W,σ2,v},calculate the expected value u n as in(12).

?M-Step:Update?μas in(22).

(2)Stage2:

?E-Step:Given current parameters{?μ,W,σ2,v},re-calculate the expected value as in(12)–(16).

?M-Step:Update?W and?σ2as in(18)and(19),followed by the updating of?v.

(3)Repeat Stage1and2until convergence is reached.

3710T.Chen et al./Computational Statistics and Data Analysis 53(2009)3706–3716

3.3.Post-processing of W

In general,the loading matrix W at convergence is not necessarily orthogonal (Tipping and Bishop,1999b ),and a rotation of W through an arbitrary q ×q orthogonal matrix R ,i.e.WR ,is still the maximum likelihood estimate of the robust PPCA model.The rotational ambiguity can be resolved if necessary by computing the eigen-decomposition of W T W =R T R where =diag (λ1,...,λq ),and rotating W according to WR .Based on this eigen-decomposition the percentage of explained variance by the PCA model,a widely used measure to assess the effectiveness of the PCA,can be calculated as q j =1λj /( q j =1λj +σ2).

4.Missing data

The issue of missing data refers to the situations where the d -dimensional data x has some missing values.By assuming that the missing-data mechanism does not depend on the missing values,i.e.missing at random,the conditional distribution of the missing values given observed data can be formulated,and it forms the basis of the EM algorithm for parameter estimation.

4.1.Conditional distribution of missing data

The data can be pided as x T =[x o T ,x u T ],where x o and x u are sub-vectors of observed and unobserved (missing)data respectively.According to the robust PPCA model,x |u ～G (μ,C /u ).For ease of derivation the mean and covariance are also organized into blocks:μ= μo μu ;C = C oo C ou C uo C uu

.The conditional distribution of missing data given observed data,x u |x o ,u ,is also Gaussian (Little and Rubin,1987)with mean μu +C uo C ?

1oo (x o ?μo ),and covariance (C uu ?C uo C ?1oo C ou )/u .For the development of maximum likelihood estimation,it is

convenient to utilize the distribution of the complete vector x ,which is again a Gaussian distribution:x |x o ,u ～G (z ,Q /u ),or equivalently x |x o ～t v (z ,Q ),where

z = x o

μu +C uo C ?

1oo (x o ?μo )

;Q = 000(C uu ?C uo C ?1oo C ou ) .(23)

4.2.Maximum likelihood estimation

The EM algorithm for the maximum likelihood estimation of model parameters is similar to that presented in Section 3.2,the difference being the handling of missing data represented by a Gaussian distribution x |u ～G (z ,Q /u ).In the first stage of the EM algorithm,the PCA scores are not considered,and thus the expectation in the E-step is: L C 1 =?N

n =1tr u n (x n ?μ)(x n ?μ)T C ?1 .(24)

Due to the presence of missing data,the expectation in (24)is taken with respect to x n ,u n |x o n ,as opposed to u n |x n in (21).

p (x n ,u n |x o n )further factorizes as:

p (x n ,u n |x o

n )=p (x n |u n ,x o n )p (u n |x o n )(25)

where the first term is formulated in (23)as a Gaussian distribution:G (z n ,Q n /u ).The second term is a Gamma distribution:

u n |x o n ～Ga v +d o n 2,v +p o n 2 (26)

where d o n is the dimension of the observed data x o n ,and p o n =(x o n ?μo )T C ?1oo (x o n ?μo ).Therefore the expectations can be

obtained as

u n =v +d o n v +p o n

(27) u n (x n ?μ)(x n ?μ)T =Q n + u n (z n ?μ)(z n ?μ)T .

(28)Substituting (28)into (24)and letting ? L C 1 /?μ=0results in the updating formula for μ:

?μ=N n =1 u n z n

n =1 u n .

(29)

T.Chen et al./Computational Statistics and Data Analysis 53(2009)3706–37163711–4–3

–2

–1

–5–4–3–2–1012345

Fig.1.The first principal component (straight line)and the 2.58standard deviation contour (ellipsoid)obtained by using the PCA models.Original data

(×);true results (—);PPCA (?·?·);robust PPCA (??);robust PPCA with 20%missing values (···).The results of the latter two situations are very similar,indicating small impact of the missing data on the robust PPCA model.

In the second stage of the EM algorithm,the PCA scores t n are considered to estimate W ,σ2and v .Now the expectation of the complete-data log-likelihood in (9)and (10)must be calculated with respect to p (x n ,t n ,u n |x o n ),which can be factorized as:p (x n ,t n ,u n |x o n )=p (t n |x n ,u n )p (x n ,u n |x o n ),where the two terms are given in (6)and (25)respectively.Therefore the expected log-likelihood can be expanded as:

L C =?N n =1 d

2ln (σ2)+1

2σ2tr u n (x n ??μ)(x n ??μ)T

σ2tr u n (x n ??μ)t T n W T +12σ2tr W T W u n t n t T n (30)

and the expectation terms are u n (x n ??μ)t T n = u n (x n ??μ

)(x n ??μ)T WM ?1(31) u n t n t T n =σ2M ?1+M ?1W T u n (x n ??μ)(x n ??μ)T WM ?1(32)

where u n (x n ??μ

)(x n ??μ)T is given in (28)by replacing μwith ?μ.Maximization of L C with respect to W and σ2results in the following updating formula:

?W = N n =1

u n (x n ?μ)t T n N n =1 u n t n t T

?1(33)?σ2=1Nd N n =1 tr u n (x n ??μ)(x n ??μ)T ?2tr u n (x n ??μ)t T n W T +tr (?W T ?W u n t n t T n ) (34)

and ?v can be updated using a nonlinear maximization algorithm.The two-stage EM algorithm operates by alternating the E-step and M-step until convergence,similar to the procedure summarized in Section 3.2.

5.Numerical examples

We first consider a simple numerical example that comprises a data set with 90random samples generated from a two-dimensional Gaussian distribution with zero mean and the following covariance matrix:

1.0

0.50.51.0 .(35)Another 10data points were generated from a uniform distribution over the range [?5,5]to simulate the presence of outliers.These 100data points were utilized for the development of the PPCA and robust PPCA models.In addition,a separate data set with missing values was simulated by randomly removing each value in the original data with probability 0.2.

Fig.1illustrates the first principal component and the 2.58standard deviation contour obtained by using the PCA models.The ‘‘true results’’were obtained by eigen-decomposition of the covariance matrix in Eq.(35),which was used to generate

3712T.Chen et al./Computational Statistics and Data Analysis53(2009)3706–3716

a b

Fig.2.Projections of the first20data points by using robust PPCA(a)on the full data set and(b)with20%missing values.

Table1

CPU time(s)for the parameter estimation(per iteration)of the robust PPCA models.

the data.It can be seen that the PPCA is sensitive to outliers,both the principal component and the contour being significantly different from the true ones.In contrast,the robust PPCA is much less susceptible to outliers.Furthermore,the presence of a reasonable level of missing data appears to have only small impact on the results of robust PPCA.Fig.2(a)shows the projections of the first20data points by using robust PPCA whilst in Fig.2(b)20%of the values are missing.Despite some variations,the two plots are largely similar,indicating the effectiveness of the proposed approach to the handling of missing data.

To demonstrate the computational efficiency of the proposed method,we further consider five numerical data sets each having500data points(N=500)with varying dimensions:d=(10,50,200,500,1000).For each d,the data matrix X of order N×d is formed such that each element is a random sample generated from a univariate Gaussian distribution with zero mean and standard deviation.Note that the PCA model is not appropriate for analyzing these data sets since the variables are independent.In this study the data are purely utilized for the illustration of computational time.In all cases we fix the number of principal components to five(q=5).

Table1gives the CPU time(s)for each iteration of the EM algorithm for the parameter estimation.The algorithm was implemented within the Matlab environment under Windows XP system equipped with a Pentium2.8GHz CPU.The results show that the proposed method is reasonably efficient in computation.The CPU time at1000variables is still feasible from the practical perspective.Clearly the total computation time is also dependent on the number of EM iterations that are required to converge.We have observed that in all the cases presented in Table1,the EM algorithm appears to converge within10iterations.If necessary,the algorithm can run significantly faster by compiling the Matlab script into a binary executable file.

6.Application to outlier detection and diagnosis

As a general multivariate statistical tool,PCA has been applied to many practical problems in science,engineering and econometrics.As an example this section considers the application in outlier detection in an industrial fermentation process.

Since there is no formal definition of‘‘outlier’’,this paper relies on the informal and intuitive statement that outliers are observations that are in some way inconsistent with the remainder of a data set(Barnett and Lewis,1994).A large number of univariate outlier detection approaches have been suggested in the literature;see Barnett and Lewis(1994)for a review. However,these approaches,if applied to multivariate problems by investigating one variable at a time,would ignore the covariance structure of the data.Therefore multivariate techniques must be employed,and PCA has been one of the most accepted methods for outlier detection(Daszykowski et al.,2007;Jolliffe,2002).

The procedure of using PCA for outlier detection is closely related to that of multivariate statistical process control (MSPC)(Qin,2003),which is to monitor the performance of a manufacturing process to ensure process safety and delivery of consistent product.Both methods require the modeling of data using PCA(MSPC further requires the data to be collected under normal operating conditions),followed by the development of confidence bound.If the confidence bound is exceeded, the occurrence of an outlier(or abnormal behavior in MSPC)is detected.Furthermore for diagnosis purpose,a contribution

T.Chen et al./Computational Statistics and Data Analysis 53(2009)3706–37163713

analysis procedure can be applied to identify which variables contribute the most to the occurrence of the outlier or abnormal process.These issues will be discussed in more detail subsequently.

6.1.Confidence bound

One of the advantages of a probabilistic PCA model,in place of conventional PCA,for outlier detection (and MSPC),is that it provides a single likelihood based confidence bound to detect outliers,as opposed to the confidence bounds for two metrics,i.e.Hotelling’s T 2and squared prediction error (SPE).In the literature of PCA,outliers are classified into two categories:leverage and orthogonal.Leverage outliers are far from the data majority on the score space (24c2b9fcc8d376eeaeaa31fdrge T 2),whilst orthogonal outliers have large residuals from the PCA model (24c2b9fcc8d376eeaeaa31fdrge SPE)(Daszykowski et al.,2007).In the community of process fault detection and diagnosis based on conventional PCA,the combination of T 2and SPE metrics has attracted significant attention (Yue and Qin,2001);however this is not an issue with probabilistic models.The likelihood value is a sufficient metric to measure how far one observation is from the majority of the data that is represented by the probabilistic model.Therefore the distinction between leverage and orthogonal outliers is unnecessary in the context of probabilistic models.In practice a single monitoring metric will reduce the work load of data analysts and plant operators as they will only be exposed to one monitoring chart.This is crucial for the wider acceptance of outlier detection and MSPC in practice (Chen et al.,2006).

On the basis of the probability distribution p (x ),the 100α%confidence bound can be defined as a likelihood threshold h that satisfies the integral (Chen et al.,2006):

x :p (x )>h p (x )d x =α.

(36)

For PPCA model p (x )is a multivariate Gaussian distribution,and the equivalent confidence bound to (36)is based on the squared Mahalanobis distance M 2:the data point x is considered as an outlier if

M 2=(x ?μ)T C ?1(x ?μ)>χ2d (α)(37)

where χ2d (α)is the α-fractile of the chi-square distribution with degrees of freedom d .For robust PPCA model where p (x )is a multivariate t -distribution,M 2/d has an F -distribution with d and v degrees of freedom (Kotz and Nadarajah,2004).Therefore a data point is detected as an outlier if M 2/d exceeds the α-fractile of the corresponding F -distribution.However as the result of the heavy-tail characteristic of the t -distribution,the confidence bound is observed in extensive preliminary studies (not reported)to be larger than is required and thus will fail to identify potential outliers.An alternative approach is to regard the robust PPCA as a method to robustly estimate the PCA projections (and thus μand C ),and the outlier detection is performed based on the chi-square distribution as in (37).This method was used in Peel and McLachlan (2000)for noise detection.

Note that the confidence bound based on the chi-square distribution is approximate,since the mean and covariance are not exactly known but estimated.Indeed,the determination of an appropriate threshold for the confidence bound has been a difficult task in the literature of outlier detection.Wilks (1962)showed that if the mean and covariance are estimated using the standard method (i.e.μ= N n =1x n /N and C = N n =1(x n ?μ)(x n ?μ)T /(N ?1)),then the squared Mahalanobis distance has a Beta distribution.However,the confidence bound derived from the Beta distribution is also approximate,as the mean and covariance are not estimated from the above standard equations in robust methods.Hardin and Rocke (2005)developed an improved F approximation to the distribution of M 2for the robust minimum covariance determinant (MCD)estimator (Rousseeuw and van Driessen,1999);how this improved approach can be adapted for the proposed robust PPCA method is an interesting topic to explore in the future.

In the presence of missing data,M 2can be calculated as the expected value with respect to the conditional distribution of the missing data (Section 4.1):

E [M 2]=tr C ?1 (z ?μ)(z ?μ)T +Q

.(38)

6.2.Contribution analysis The objective of contribution analysis is to identify which variables contribute the most to the occurrence of outliers.In general contribution analysis may not explicitly reveal the root cause of the presence of outliers,but it is undoubtedly helpful in pointing out the inconsistent variables that may undergo further diagnosis procedures.

The traditional contribution analysis (Miller et al.,1998)is to decompose Hotelling’s T 2and SPE obtained from PCA model into the sum of d contributing terms,each corresponding to a variable.A similar method was developed for PPCA model (Kim and Lee,2003).These techniques are limited by the requirement of investigating the contribution to two metrics.It is not clear how to resolve the conflicts if the two contribution analysis procedures reach different conclusions about the responsible variables.Alternatively reconstruction based methods have been proposed (Dunia et al.,1996;Yue and Qin,2001),where each variable is treated as if it were missing and is reconstructed in turn,and the variables corresponding to the largest reconstruction errors are considered to contribute the most to the occurrence of outlier.

3714T.Chen et al./Computational Statistics and Data Analysis53(2009)3706–3716

Table2

Variables that were used for the analysis of the fermentation process.

Motivated by the reconstruction based techniques,this paper adopts a missing variable based contribution analysis method,which was originally proposed for MSPC using non-robust PPCA and mixture models(Chen and Sun,2009),for robust PPCA model.Assume that x is identified as a candidate outlier,for the j th variable of x(j=1,...,d),the proposed method operates as follows:

(1)Let x j be the candidate outlier with the j th variable missing.Calculate the conditional distribution of x j given the observed

variables(Section4.1).

(2)Calculate E[M2]as in(38).

(3)The contribution of the j th variable can be quantified as M2?E[M2],i.e.the decrease in the monitoring metric if the

variable is eliminated.

Furthermore,in step(c)if E[M2]is smaller than the confidence boundχ2d(α),the corresponding variable can be regarded as being significantly influential,since its elimination would bring the data back to normal.

6.3.Application in a batch fermentation process

In the beer production industry,the fermentation process has the greatest influence over the product quality and production time variability.The process under investigation is operated in batch mode,where the wort from previous operations is fed into the fermenter.Then yeast is added to metabolize sugars and amino acids and alcohol is produced. As the sugars are used up the fermentation slows down,and a cooling operation can stop the process at the desired gravity (density of beer)and/or diacetyl concentration.

Since it is difficult to produce beer consistently in industrial scale,the quality control in beer industry relies significantly on the consistency of raw materials and process conditions.In this study data was collected from an industrial fermenter comprising100batches(Basabe,2004).Each batch is treated as one data point that consists of9variables(Table2),including initial conditions and process parameters.As opposed to including the temperature trajectory throughout each batch,only initial and mean temperatures were used for analysis.This is because the process temperature was manually controlled, and its large variation during the batch is not directly related to product quality(Basabe,2004).Of all the900values(100 batches×9variables)59(6.56%)are missing.The data are preprocessed to zero mean and unit standard deviation on each dimension.

Both PPCA and robust PPCA were performed on the data and the dimension of the problem was reduced to5principal components,which explained76.4%(PPCA)and77.9%(robust PPCA)of the total variance.Fig.3(a)gives the outlier detection chart for robust PPCA.Due to the heavy-tail effect of the t-distribution,the F-distribution based confidence bound is significantly larger than the bound based on chi-square distribution,and the F-distribution identifies only one outlier.A post-inspection revealed that the F-distribution missed a number of batches that are clearly regarded as outliers.Therefore the chi-square distribution based confidence bound is recommended and will be used subsequently.

Fig.3shows that PPCA detects6outliers(batches12,14,39,48,66and92),whilst robust PPCA identifies an additional three(batches1,6,46).Batch14is the most obvious outlier,and the contribution analysis in Fig.4(d)clearly indicates that the first variable(dissolved oxygen)is responsible for the batch being detected as outlying.Variable1in batch14has the value of99ppm that appears to be the result of measurement error as opposed to abnormal process condition,since the concentration of dissolved oxygen at99ppm is not physically possible in this fermentation process.Fig.4(a)depicts that batch1is also detected as outlying by robust PPCA due to the first variable,whose value is40ppm and less extreme than batch14.However due to the dominant influence of batch14,the PPCA significantly over-estimates the variance of the first variable,and it failed to identify batch1as outlier.In general PPCA tends to be sensitive to a small number of influential data points.Therefore robust PPCA is more appropriate for the modeling of the data where outliers are present.

Fig.4also depicts the contribution analysis for batch6and12.Clearly the contribution plots provide important information for the diagnosis of the source of outliers.

7.Conclusions and discussions

This paper presents a robust PCA model within a probabilistic framework,with specific focus on handling missing data and its applications in outlier detection and diagnosis.The idea is to replace the Gaussian distribution utilized by the probabilistic PCA with the heavy-tailed and more robust multivariate t-distribution.The EM algorithm is implemented for the maximum likelihood parameter estimation and the handling of missing data.Numerical example has shown that the

T.Chen et al./Computational Statistics and Data Analysis 53(2009)3706–3716

3715

Fig.3.Outlier detection by using (a)robust PPCA and (b)PPCA with 99%confidence bound based on chi-square distribution (—)and F -distribution (?·?·).The batch numbers of the detected outliers are labeled.

Variable

–50

1520

M 2 –E [M 2]

Variable

20406080100120140160180200

M 2 –E [M 2]

–1

2345

67a

d Fig.4.Contribution analysis for robust PPCA with 99%confidenc

e bound (—)that was calculated as M 2?χ2

d (0.99).(a)Batch 1;(b)Batch 6;(c)Batch 12;(d)Batch 14.

presence of a reasonable level of missing data appears to have only small impact on the results of robust PPCA.Furthermore,a contribution analysis method has been developed to help identify the influential variables that contribute to the occurrence of outliers,and it has been successfully applied to the analysis of the data collected from an industrial fermentation process.One limitation of the robust PPCA model is that it does not consider the situation where outliers are clustered and thus the overall distribution of data is multi-modal.In general,clustered outliers are more difficult to detect.The presented robust

3716T.Chen et al./Computational Statistics and Data Analysis53(2009)3706–3716

PPCA,which assumes a uni-modal distribution of the data,is not specifically designed to address this issue.A potential solution,suggested by Rocke and Woodruff(2001),is to utilize cluster analysis to first identify the clusters,and then develop a metric from the largest identified cluster(s)for outlier detection.Following this idea,the robust PPCA can be extended to a mixture model,similar to the mixture of(non-robust)PPCA(Tipping and Bishop,1999a).Alternatively,the methodology of‘‘forward search’’(Atkinson et al.,2004)can be adopted to incrementally include the data for analysis,and thus both the isolated and clustered outliers can be identified sequentially.Currently these methods are under investigation. References

Archambeau,C.,Delanney,N.,Verleysen,M.,2006.Robust probabilistic projection.In:Proc.23rd International Conference on Machine Learning,Pittsburgh, USA.

Atkinson,A.C.,Riani,M.,Cerioli,A.,2004.Exploring Multivariate Data with the Forward Search.Springer-Verlag,New York.

Barnett,V.,Lewis,T.,1994.Outliers in Statistical Data,3rd edition.John Wiley,New York.

Basabe,X.L.,2004.Towards improved fermentation consistency using multivariate analysis of process data.Master’s Thesis,University of Newcastle upon Tyne,UK.

Cambell,N.A.,1980.Robust procedures in multivariate analysis.Applied Statistics29,231–237.

Chen,T.,Morris,J.,Martin,E.,2006.Probability density estimation via an infinite Gaussian mixture model:Application to statistical process monitoring.

Journal of the Royal Statistical Society C(Applied Statistics)55,699–715.

Chen,T.,Morris,J.,Martin,E.,2008.Dynamic data rectification using particle 24c2b9fcc8d376eeaeaa31fdputers and Chemical Engineering32,451–462.

Chen,T.,Sun,Y.,2009.Probabilistic contribution analysis for statistical process monitoring:A missing variable approach.Control Engineering Practice17, 469–477.

Croux,C.,Haesbroeck,G.,2000.Principal components analysis based on robust estimators of the covariance or correlation matrix:Influence functions and efficiencies.Biometrika87,603–618.

Daszykowski,M.,Kaczmarek,K.,Heyden,Y.V.,Walczak,B.,2007.Robust statistics in data analysis—A review basic concepts.Chemometrics and Intelligent Laboratory Systems85,203–219.

Dempster,A.P.,Laird,N.M.,Rubin,D.B.,1977.Maximum likelihood from incomplete data via the EM algorithm.Journal of Royal Statistical Society B39, 1–38.

Devlin,S.J.,Gnanadesikan,R.,Kettenring,J.R.,1981.Robust estimation of dispersion matrices and principal component.Journal of the American Statistical Association12,136–154.

Dunia,R.,Qin,S.,Edgar,T.,McAvoy,T.,1996.Identification of faulty sensors using PCA.AIChE Journal42,2797–2812.

Fang,Y.,Jeong,M.K.,2008.Robust probabilistic multivariate calibration model.Technometrics50,305–316.

Gelman,A.B.,Carlin,J.S.,Stern,H.S.,Rubin,D.B.,1995.Bayesian Data Analysis.Chapman&Hall/CRC.

Hardin,J.,Rocke,D.M.,2005.The distribution of robust distances.Journal of Computational and Graphical Statistics14,910–927.

Huber,P.J.,1981.Robust Statistics.Wiley,New York.

Hubert,M.,Rousseeuw,P.J.,Branden,K.V.,2005.ROBPCA:A new approach to robust principal component analysis.Technometrics47,64–79.

Hubert,M.,Rousseeuw,P.J.,Verboven,S.,2002.A fast method for robust principal components with applications to chemometrics.Chemometrics and Intelligent Laboratory Systems60,101–111.

Ibazizen,M.,Dauxois,J.,2003.A robust principal component analysis.Statistics37,73–83.

Jolliffe,I.T.,2002.Principal Component Analysis,2nd edition.Springer.

Kim,D.,Lee,I.-B.,2003.Process monitoring based on probabilistic PCA.Chemometrics and Intelligent Laboratory Systems67,109–123.

Kotz,S.,Nadarajah,S.,2004.Multivariate t Distributions and Their Applications.Cambridge University Press.

Lange,K.L.,Little,R.J.A.,Taylor,J.M.G.,1989.Robust statistical modeling using the t distribution.Journal of the American Statistical Association84,881–896. Li,G.,Chen,Z.,1985.Projection-pursuit approach to robust dispersion matrices and principal components:Primary theory and Monte Carlo.Journal of the American Statistical Association80,759–766.

Little,R.J.A.,Rubin,D.B.,1987.Statistical Analysis with Missing Data.Wiley,Chichester.

Liu,C.,1997.ML estimation of the multivariate t distribution and the EM algorithm.Journal of Multivariate Analysis63,296–312.

Miller,P.,Swanson,R.E.,Heckler,C.F.,1998.Contribution plots:A missing link in multivariate quality control.International Journal of Applied Mathematics and Computer Science8,775–792.

Peel,D.,McLachlan,G.J.,2000.Robust mixture modelling using the t distribution.Statistics and Computing10,339–348.

Qin,S.J.,2003.Statistical process monitoring:Basics and beyond.Journal of Chemometrics17,480–502.

Rocke,D.M.,Woodruff,D.L.,2001.Multivariate outlier detection and robust covariance matrix estimation—Discussion.Technometrics43,300–303. Rousseeuw,P.J.,van Driessen,K.,1999.A fast algorithm for the minimum covariance determinant estimator.Technometrics41,212–223. Ruymagaart,F.H.,1981.A robust principal component analysis.Journal of Multivariate Analysis11,485–497.

Schick,I.C.,Mitter,S.K.,1994.Robust recursive estimation in the presence of heavy-tailed observation noise.Annals of Statistics22,1045–1080. Tipping,M.E.,Bishop,C.M.,1999a.Mixtures of probabilistic principal component analysers.Neural Computation11,443–482.

Tipping,M.E.,Bishop,C.M.,1999b.Probabilistic principal component analysis.Journal of the Royal Statistical Society B61,611–622.

Wilks,S.,1962.Mathematical Statistics.Wiley,New York.

Yue,H.,Qin,S.,2001.Reconstruction based fault identification using a combined index.Industrial and Engineering Chemistry Research40,4403–4414.

本文来源：https://www.bwwdw.com/article/v1lq.html

相关文章：

湖南省岳阳县第一中学2014年物理奥赛教案第三讲能量和动量05-22

2016年安徽省桐城市语文教师招聘考试历年真题复习资09-17

自主招生自荐信要求及范文03-13

女性健康讲座主持词02-23

八年级美术试题试题及答案03-15

安徽省水稻种植行业企业名录2017年7309家01-01

XXXX年宏观环境分析精编版12-29

【必备】慰问感谢信04-12

李静卖掉乐蜂网是最佳选择03-11

上一篇：湖北省襄阳市2022-2022学年高一下学期期末考试语文试题(精校版) 下一篇：山东省临沂市英语六年级上学期(深港版)期末测试卷