Discovering the hidden structure of complex dynamic systems

更新时间:2023-05-28 01:42:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

Dynamic Bayesian networks provide a compact and natural representation for complex dynamic systems. However, in many cases, there is no expert available from whom a model can be elicited. Learning provides an alternative approach for constructing models of

Discovering the Hidden Structure of Complex Dynamic SystemsComputer Science Dept. Stanford University Stanford, CA 94305-9010 xb@cs.stanford.edu

Xavier Boyen

Inst. of Computer Science Hebrew University Jerusalem, 91904, Israel nir@cs.huji.ac.il

Nir Friedman

Computer Science Dept. Stanford University Stanford, CA 94305-9010 koller@cs.stanford.edu

Daphne Koller

AbstractDynamic Bayesian networks provide a compact and natural representation for complex dynamic systems. However, in many cases, there is no expert available from whom a model can be elicited. Learning provides an alternative approach for constructing models of dynamic systems. In this paper, we address some of the crucial computational aspects of learning the structure of dynamic systems, particularly those where some relevant variables are partially observed or even entirely unknown. Our approach is based on the Structural Expectation Maximization (SEM) algorithm. The main computational cost of the SEM algorithm is the gathering of expected sufcient statistics. We propose a novel approximation scheme that allows these su cient statistics to be computed e ciently. We also investigate the fundamental problem of discovering the existence of hidden variables without exhaustive and expensive search. Our approach is based on the observation that, in dynamic systems, ignoring a hidden variable typically results in a violation of the Markov property. Thus, our algorithm searches for such violations in the data, and introduces hidden variables to explain them. We provide empirical results showing that the algorithm is able to learn the dynamics of complex systems in a computationally tractable way.

1 IntroductionMany real world phenomena are naturally modeled as dynamic systems: the stock market, measurements of a patient's vital signs in an intensive care unit, vehicles on a freeway, etc. Knowledge of a system's dynamics is essential for many tasks, including prediction, monitoring, and the detection of anomalies.

Real world complex systems are typically highly structured, consisting of several interacting entities. Dynamic Bayesian networks (DBNs) provide a representational language for modeling such structured systems. By representing a system's state via several variables, and describing the relations between them, a DBN can capture the underlying structure of the system dynamics, e.g., which stocks depend on which other and on what external variables. Compared to less structured representations such as hidden Markov models (HMMs), DBNs support e ective approximate inference Boyen and Koller 1998b] and feature more robust learning due to their reduced number of parameters. Recent work has made signi cant progress on the problem of learning Bayesian networks from data (see, for example, Heckerman 1999]. These ideas have recently been applied to the problem of learning DBNs in the presence of partially observable data Friedman et al. 1998], using the Structural EM (SEM) algorithm Fr

iedman 1997]. The basic outline of SEM is as follows: Each iteration starts with the current candidate DBN. In the E-step, the missing data is completed using the expected value relative to the current DBN, and the completed data is used to gather expected su cient statistics| the completed\counts" corresponding to various events. In the M-step, These statistics are then used to score a variety of new candidate structures, and the best scoring candidate is selected. After one or more structural changes, the completed data and statistics are recomputed. The methods described by Friedman et al., su er from two major shortcomings. First, their approach has signi cant computational cost when applied to complex processes such as stock market data. Second, as this algorithm does not change the variables in the network, hidden variables must be prespeci ed by the user. In this paper, we outline these limitations and suggest methods that address both issues. The computational cost of the SEM approach is due

Dynamic Bayesian networks provide a compact and natural representation for complex dynamic systems. However, in many cases, there is no expert available from whom a model can be elicited. Learning provides an alternative approach for constructing models of

to the complexity of the E-step. The E-step uses inference to complete the missing data; the inference process propagates messages which are explicit distributions over the set of state variables, so that their representation is exponential in the size of this set. In particular, if a join tree algorithm Jensen et al. 1990; Shenoy and Shafer 1990] is used for inference, the tree essentially reduces to one huge clique for each pair of consecutive time slices. This exponential cost is prohibitive in all but the simplest DBNs. This problem also appears in the simpler case of learning DBN parameters given the structure. For that setting, Boyen and Koller 1998a] propose an approximate E-step algorithm. This algorithm propagates approximate messages represented as factorized products over independent clusters. This representation allows the propagation of messages from one time slice to the other using a join tree with much smaller cliques than in the exact method. They show that the error in sufcient statistics resulting from approximation is small, and that the in uence on the progress of the learning algorithm is negligible. Even for small networks, order of magnitude speedups can easily be obtained. In this paper, we extend this technique to the problem of structure learning. In parametric EM, as used in Boyen and Koller 1998a], each family in the DBN is guaranteed to be contained in some clique in the clique tree for the two consecutive time slices. Thus, su cient statistics could easily be accumulated during the execution of the inference algorithm. In SEM, on the other hand, the results of the same inference process must be used for scoring a variety of di erent candidate structures. Each of these requires su cient statistics for a di erent set of events. While inference algorithms can in principle be used to compute the probability of any event, this procedure is fairly expensive, especially for a large number of arbitrary events. In Section 4,

we propose a new approximation that circumvents this bottleneck. Roughly speaking, this approximation estimates the posterior probability of the event for which we want statistics by a product of small factors. As we show, this estimate is quite good, and can be computed e ciently as a by-product of the inference algorithm. Our nal contribution addresses a fundamental problem in learning models for dynamic systems. In order to learn models that are statistically robust and computationally tractable, we must often introduce hidden variables into our structure. These variables serve many roles: enabling the Markov property, capturing hidden in uences of the observables, etc. It is possible, in theory, to discover hidden variables simply by

introducing them into the model and hoping that the search algorithm learns their meaning automatically (as in Friedman et al. 1998]). We propose a technique that allows for a more targeted search for hidden variables. Our approach is based on the observation that, in DBNs, ignoring a hidden variable typically results in a violation of the Markov property. For example, in a DBN for tracking trafc, eliminating the velocity variable would result in a non-Markovian dependence of the current location on the location at the two previous time slices. Our algorithm exploits this property by explicitly searching for violations of the Markov property. For example, a signi cant correlation between a variable A(+1) and some other earlier variable B (? ) (given the Parents of A(+1) ) indicates that A must have additional inuences not present in our model. In the general case there will be an additional hidden process that in uences both A and B, and possibly other variables as well. We thus add an extra variable in our model to account for this hidden process. We believe that our algorithmic ideas combine to provide a viable solution to the problem of learning complex dynamic systems from data. We provide some empirical results on real and synthetic data that, while very preliminary, are quite promising in their ability to scale up to larger systems.t t d t

2 Preliminaries: Learning DBNs2.1 Dynamic Bayesian NetworksA dynamic Bayesian network (DBN) is a model of the evolution of a stochastic system over time. We choose to model the system evolution using a sequence of discrete time points at regular intervals. At each point in time, the instantaneous state of a process is specied in terms of a set of attributes X1;:::; X . We use X ( ) to denote the random variable corresponding to the attribute X at time t. A DBN represents a distribution over (possibly in nite) trajectories of the system. In order to allow such a distribution to be represented, we usually make two assumptions. The Markov assumption states that the future is independent of the past given the present. More precisely, for all t, we have the conditional independence I (X(+1); fX(0);::: X(?1) g j X( ) ). This assumption allows us to reduce t

he representation problem to that of specifying P (X(0) ) and P (X(+1) j X( ) ) for all t. The second assumption is that the process is stationary (or time-invariant ), i.e., that P (X(+1) j X( )) is the same for all t. Given these assumptions, we can specify the proban t i i t t t t t t t

Dynamic Bayesian networks provide a compact and natural representation for complex dynamic systems. However, in many cases, there is no expert available from whom a model can be elicited. Learning provides an alternative approach for constructing models of

bilistic model of a DBN using two Bayesian network (BN) fragments. The rst is a prior network B0 that speci es a distribution over initial states X(0) . The second is a transition network B!, which represents the transition probability from states X( ) to states X(+1). The transition network is a BN fragment over 0 the nodes fX1;:::; X; X1;:::; X 0 g. A node X rep( ) 0 represents X (+1) . The nodes X resents X and X in the network are forced to be roots (i.e., have no parents), and are not associated with conditional probability distributions. We denote the parents of X 0 in the graph by Pa(X 0 ). Each node X 0 is associated with a conditional probability distribution (CPD), which speci es P (X 0 j Pa(X 0 )). The transition probability from one stateQ to another x0| P (x0 j x)| is then x de ned to be P (x0 j u ), where u is the value in x; x0 of the variables in Pa(X 0). A DBN de nes a distribution over in nite trajectories of states. In practice, we reason only about a nite time interval 0;:::; T . To do this reasoning, we can notionally\unroll" the DBN structure into a long BN over X(0);:::; X( ) . In time slice 0, the parents of X (0) and its CPD are those speci ed in the prior network B0; in slice t+1, the parents of X (+1) and its CPD are those speci ed for X 0 in B! . Thus, given a DBN B= (B0; B! ), the joint distribution over X(0);:::; X( ) is P (x(0);:::; x( ) )t t n t n t i i i i i i i i i i i i i i i T i t i i T B T

it is thus an estimate of how well a given candidate ts the empirical data. The log-likelihood depends on the su cient statistics that summarize the frequencies of the relevant events in the data. For any event y over X; X0, we de ne

data" is the log-likelihood function, de ned as`(B: D)= log P (D j B ). This function measures the extent to which the data is likely given a candidate model B;

Ny=t

Xt

(y( ) j D)t

where (y( ) j D) is an indicator function which takes on value 1 if the event y over X; X0 holds for the instantiation X= d( ) and X0= d(+1), and 0 otherwise. The log-likelihood can now be described as:t t

`(B!: D)=i

X Xx

0 2Val( 0 ) u2Val(Pa( 0 ))iX

X

N 0 u log 0 jux;

(1)

i

x

i

i

X

i

= P 0 (x(0) )B

T

? Y

1

t=0

P ! (X0= x(+1) j X= x( ) )B t t

2.2 Learning DBNs: Complete DataWe now consider the task of learning a DBN from data. For simplicity of notation, we assume that our data D is a single trajectory d(0);:::; d( ) through the system; in this section, we assume that this trajectory is fully observable. Also for simplicity, we will ignore the task of learning the prior network B0 . Friedman et al. 1998] provide a detailed descrip

tion of how BN learning can be applied to DBNs. We brie y highlight the most relevant parts of their analysis. Given a training sequence, the learning task is to nd the network B! that\best matches" D. The notion of best match is de ned using a scoring function. Several di erent scoring functions have been proposed in the literature. The most frequently used are the Bayesian Information Criterion (BIC) Schwarz 1978] and the Bayesian scores, such as the BDe score of Heckerman et al. 1995]. Both of these scores combine\ t to data" with some penalty relating to the complexity of the network. For ease of presentation, we will focus on the BIC score. In both these scores, the term that represents\ t toT

The BIC score is simply the log-likelihood plus a penalty term for network complexity: Score (B! )=`(B!: D)? log T Dim(G! ) 2 where Dim(G! ) is the dimension of G!, which in the case of complete data is simply the number of parameters. Our goal is to nd the network that maximizes this score. For a xed structure, the parameters that maximize the score are exactly the maximum likelihood parameters, which simply mirror the frequencies in the data:^ ju= N ju (2) Nu Finding the highest scoring network structure is NPhard Chickering et al. 1995]. Thus, we usually resort to greedy local search procedures Buntine 1991; Heckerman et al. 1995] that gradually improve a candidate structure by applying local structural transformation: adding, deleting, or reversing an edge. These transformations are usually applied in a greedy fashion, with occasional random steps to deal with local maxima and plateaux. Two crucial properties of the BIC score greatly facilitate this procedure. First, the score of a network can be written as a sum of terms, where each term determines the score of a particular choice of parents for a particular variable. Thus, a local change to one family X 0; Pa(X 0 ), such as the addition or removal of an arc, a ects only one of these terms. As a consequence, the incremental value of any change to another family in the network remains unchanged. Hence, to determine the values of all local changes to the currentBI C xi xi i i

Dynamic Bayesian networks provide a compact and natural representation for complex dynamic systems. However, in many cases, there is no expert available from whom a model can be elicited. Learning provides an alternative approach for constructing models of

network structure we need only re-evaluate changes to the family of X 0 . Second, the term that evaluates the family of X 0 is a function only of the su cient statistics for X 0 and its parents. Thus, these su cient statistics are the only aspects of the data that we need to preserve. For each choice of parents for X 0, we need to collect statistics on di erent events. Evaluation of local changes usually involves computation of new su cient statistics, and then an evaluation of the score with respect to the statistics of the new model and its dimension. The Bayesian score is somewhat more complex. It involves taking prior distribution over models and parameters in to account. Without going into details, we note that for some choices of priors, such as the BDe priors of Heckerman et al. 1995], the main feature of

BIC also hold for the Bayesian score: the score decomposes in to a sum of terms, and the score depends only on the su cient statistics collected from data. Although the Bayesian score and the BIC are asymptotically equivalent, for small sample sizes the Bayesian score often performs better.i i i i

2.3 Learning DBNs: Incomplete DataThe main di culty with learning from partial observations is that we no longer know the counts in the data. As a consequence, the score no longer decomposes into separate components corresponding to individual families. The most common solution to the missing data problem is the Expectation-Maximization (EM) algorithm Dempster et al. 1977; Lauritzen 1995]. The algorithm is an iterative procedure that searches for a parameter vector which is a local maximum of the likelihood function. It starts with some initial (often random) parameter vector . It then repeatedly executes a two-phase procedure. In the E-step, the current parameters are used to complete the data by\ lling in" unobserved values with their expected values. In the M-step, the completed data is used as if it was real, in a maximum likelihood estimation step. More precisely, given a current parameter vector, the algorithm computes the expected su cient statistics (ESS) for D relative to:

justifying EM's performance is that each EM cycle is guaranteed to improve the likelihood of the data given the model, until it reaches a local maximum. EM has been traditionally viewed as a method for adjusting the parameters of a xed model structure. Friedman's Structural EM (SEM) algorithm 1997] extends it to the structure learning task. The SEM algorithm has the same E-step as EM, completing the data by computing expected counts based on the current structure and parameters. In addition to re-estimating parameters, the M-step of SEM uses the ESS, computed according to the current structure, to score other candidate structures. Essentially, the algorithm completes the data using the current network structure, then performs a complete-data structure search in the inner loop. After some number of steps, the algorithm stops the structure search, uses the current candidate network to complete the data, and the process repeats. Friedman 1998] shows that, for a large family of scoring rules, the network resulting from this inner loop must have a higher score than the original. This is true even though the expected counts used in evaluating the new structure are computed using the old structure. More precisely, Friedman de nes a notion of expected score which is the expectation of the score for di erent completions D+ of the data D, where the probability of a completion D+ is P (D+ j D; B ). For a large class of scores, such as BIC or BDe, he shows that if we make a change to the network structure that increases the expected score, then the true score increases by at least as much. The crucial property of the expected score is that, like the score for complete data, it de

composes into a sum of local terms. For instance, the expected BIC score is simply the BIC score applied to the expected su cient statistics. This property reinstates both of the important advantages that we had in the case of structure search for complete data: the ability to re-evaluate only a small number of structural changes following a structural change, and the ability to restrict attention to (expected) su cient statistics.

3 The E-stepAs the discussion above clearly shows, the key requirement in applying SEM is the ability to compute the expected su cient statistics. This computation requires that we run probabilistic inference over the entire trajectory D, as the distribution over any missing value will typically be a ected by past and future evidence. Unfortunately, probabilistic inference for DBNs is notoriously expensive, as Friedman et al. 1998] clearly point out in their paper on learning DBNs. The classical solution to DBN inference is based on

Ny= E Ny: (G!; )] X ()= E (y j D): (G!; )]t

=

Xt t

P (y( ) j D; (G!; ))t

(3)

It then uses the ESS N in place of the su cient statistics N in (2) for estimating a new set of parameters . The process then repeats. The central theorem

Dynamic Bayesian networks provide a compact and natural representation for complex dynamic systems. However, in many cases, there is no expert available from whom a model can be elicited. Learning provides an alternative approach for constructing models of

the same ideas as the forward-backward algorithm for HMMs Rabiner and Juang 1986]. At a high level, the algorithm propagates forward messages ( ) from the start of the sequence forward, gathering evidence along the way; it uses a similar process to propagate backward messages ( ) in the reverse direction. Letting r(1);:::; r( ) denote the observations along the sequence, ( ) represents the distribution P (X( ) j r(1);:::; r( ) ); ( ) represents the conditional distribution P (r(+1);:::; r( ) j X( ) ). The update rules for the messages are:t t T t t t t t T t0 0 0

~(+1) by the transition model P (X(+1) j X( ) ), condition on the observations at time t+ 1, and project onto the clusters Z( ) to get ( ) . In the case of DBNs, we can accomplish these propagations quite e ciently. For example, in the forward propagation case, we rst generate a clique tree ( ) Lauritzen and Spiegelhalter 1988] in which, for every k, some clique contains Z( ) and some clique contains Z(+1) . A standard clique tree propagation algorithm can then be used to compute the posterior X (t) (x) P (x j x) P (r(t+1) j x ) distribution over every clique. Once that is done, the (t+1) (x )/ distribution over Z(+1) is easily extracted from the B! B! x appropriate clique X (t+1) (x ) P (r(t+1) j x ) P (x j x) 1998b] for details.) potential. (See Boyen and Koller (t) (x)/ B! B! x0 Boyen and Koller demonstrate the applicability of their approximate message passing algorithm for the Now, the posterior distribution ( ) over the states task of parameter estimation for a real-life network at time t is simply ( ) (x) ( ) (x) (suitably renor- with ten state variables (the bat network of Forbes malized). Similarly, the joint posterior (+1) (x; x0 ) et al. 1995]). T

hey show that their algorithm is 10{15 over the states at t and t+ 1 is proportional to times faster than exact message propagation, and that the degradation in the quality of the learned model is ( ) (x) P ! (x0 j x) (+1) (x0 ) P ! (r(+1) j x0 ). negligible. This message passing algorithm can be implemented in a straightforward way for DBNs. Indeed, this algorithm was essentially the one used by Friedman et al. 4 Approximate Su cient Statistics 1998]. Unfortunately, this approach is practical only for very small DBNs: its basic representation| the This approximate message passing algorithm allows us and messages| have an entry for every possi- to perform inference in much larger dynamic systems, ble state of the system, making them exponential in and thereby perhaps to learn them. However, we are the number of state variables. Even in highly struc- still faced with a serious problem in computing extured processes, these messages do not admit a more pected su cient statistics for structure search. Recompact representation, as the variables in the mes- call that, to compute ESS, we need to compute joint sage are almost invariably fully correlated Ghahra- marginals over sets of variables Y, as in Equation (3). In standard EM for parameter learning, the situation mani and Jordan 1996a; Boyen and Koller 1998b]. is easy. Boyen and Koller 1998b] describe a new approach over the Here, theofonly required statistics are the .ones families the individual variables to approximate inference in dynamic systems, which construction of the clique tree, each family isin B! By contained avoids the problem of explicitly maintaining distribu- in some clique of the clique tree used in our approxtions over large spaces. Their algorithm maintains Furthermore, we to comfactored representations that approximate these dis- imatea inference.message~( ) and a can use message bine forward backward tributions. In Boyen and Koller 1998a], they apply~(+1), and obtain compact representation of the full this algorithm to the task of message passing in the joint posterior ( a+1) as a tree ( ) . It is then easy forward-backward algorithm. to extract the needed marginals from the appropriate Speci cally, they represent messages as products of cliques Boyen and Koller 1998a]. independent marginals over disjoint clusters. Let Z1;:::; Z be a partition of X. The approxi- In SEM, this easy solution is no longer applicable, since we use E-step for a given B mate message~( ) is represented as a product of pute suthe results of ourfor a variety of other ! to comcient statistics candidate Q~(Z( ) ). Similarly,~( ) is represented marginals 0 structures B! . Hence, we will need to compute ESS Q~(Z( ) ). To apply forward propagation,~( ) is for events Y which do not correspond to families in as multiplied by the transition model P !, and condi- the current structure B! . There is no reason to astioned on r(+1) . The result is then projected onto each sume that the variables

in an arbitrary event Y will be cluster Z( ), and the next message~(+1) is de ned as contained in a single clique in . Note that this issue the product of these marginals. Backward propagation did not arise in the work of Friedman et al. 1998]. In works similarly, except that we take the product of their approach, the corresponding clique tree to wast t t t t k t t t k k t k0 0 0

t

t

t

t;t

t

B

t

B

t

t

t

t;t

t

K

t

t

t

k

k

t

t

k

k

B

t

t

t

k

Dynamic Bayesian networks provide a compact and natural representation for complex dynamic systems. However, in many cases, there is no expert available from whom a model can be elicited. Learning provides an alternative approach for constructing models of

basically a single clique containing all of the variables in X( ); X(+1) . Thus, all possible queries Y were contained in a single clique (the one and only clique). It is our ability to provide a ner grained decomposition of the clique tree that brings up this issue. The straightforward approach to this problem is to compute the necessary posterior P (Y( ) j D; B! ) by performing a special-purpose inference step over ( ), tailored to Y. Unfortunately, this operation can be expensive, especially if Y contains variables which are\distant" in the current structure. It also needs to be performed a great many times every time slice| once for each statistic of interest. The problem is even more serious when we seek statistics over several time slices, as in our search for violations of the Markov property. In this case, Y( ) contains nodes over several time slices, so that the computation of P (Y( ) j D; B! ) is almost invariably infeasible. We propose a new approximate solution, in the same spirit as the approximate message decomposition of Boyen and Koller 1998b, 1998a] and of the variational approach of Ghahramani and Jordan 1996a]. Instead of computing the joint distribution P (Y( ) ), we approximate it as a product of independent marginals over individual variables. We approximate P (Y( ) j D; B! ) as Q P (Y ( ) j D; B! ), where for each of the individual variables Y, P (Y ( ) j D; B! ) is computed by marginalizing some clique in ( ) in which the variable is present. From there, NY is computed as in (3). This process requires a pass over the clique tree to perform the marginalization, and then a simple computation which requires linear time in the number of su cient statistics that we are maintaining.1 Furthermore, this approach applies easily to the task of computing ESS for events that span several time slices: we simply extract marginals from more than one clique tree ( ) . For example, to approximate the joint marginals over an event X (+1); X (?2), we could extract P (X (+1) ) from ( ) and P (X (?2) ) from (?2) . We then multiply the two marginals, and add the result to the expected counts for this event, as usual. At rst glance one might think that this approximation discards all correlations between variables in different clusters Y . In general, however, this is not thet t t t t t t t t i i t i i t t t t t i i i t t t i i

Table 1: Negative log-likelihood on test data for parametric EM, for di erent starting points.bat network Seed#1 Seed#2 Seed#3 Gold

standard 22.1860 22.1860 22.1860 Exact ESS 22.4026 22.2801 22.3269 Approximate ESS 22.2633 22.2676 22.2782

1 We note that we could have used a more re ned computation that would have taken advantage of the cooccurrence of some subsets of Y within a single clique in order to avoid approximating them as independent. However, this extension would require that we marginalize the cliques in a potentially di erent way for every statistic that we need to compute. Our experiments (see below) suggest that the error introduced by this approximation is probably not large enough to be worth the signi cant computational overhead.

case, since we are examining the posterior distribution Y given di erent con guration of evidence. Consider, for example, the situation where we have two binary variables A; B 2 Y, which belong to di erent cliques. If, in our data, we have that, depending on the evidence, A and B are either both probably true or both probably false, then at each step the mass of (A; B ) in the product distribution P (A( ) )P (B ( ) ) will be large at either (0; 0) or (1; 1), and its accumulation in our su cient statistics will reveal the correlation. Of course, there are cases where this approximation would lose correlations. For example, if we have that A( ) and B ( ) are both uniformly distributed yet correlated, our approximate su cient statistics would not reveal the correlation. Thus, if the evidence does not give us information about the values of the variables, but only about the correlations, our approximation will lose this correlation. We argue that such models are hard to learn in general (with or without our approximation). Indeed, if the evidence does not give us enough information about the value of a hidden variable, our ability to learn something meaningful about its distribution is very limited. We tested the error introduced by this approximation on the better-understood problem of parameter estimation. As in Boyen and Koller 1998a], we generated a long sequence of 20,000 synthetic data points from the bat network Forbes et al. 1995]. We then attempted to estimate the parameters back from the data, given the correct structure. We ran two versions of this experiment: one with the approximate message passing algorithm (as above) but without the approximation of the ESS, and the other with both. As can be seen on Table 1, the approximation does not degrade the learning accuracy. On the contrary, the approximation even seems to be slightly bene cial, which could be explained as a regularization e ect.t t t t

5 Structure Search5.1 E cient Structure SearchIn the previous section, we suggested methods for e cient computation of expected su cient statistics. In SEM search, we need to compute the ESS for each

Dynamic Bayesian networks provide a compact and natural representation for complex dynamic systems. However, in many cases, there is no expert available from whom a model can be elicited. Learning provides an alternative approach for constructing models of

family we change. How do we choose which ESS we should compute? The rst approach is to compute in advance all the expected su cient statistics the search might need. However, since there are too many of these, this solution is

impractical.2 The second approach, which was used by Friedman et al. 1998], is to compute sufcient statistics\on demand", i.e., statistics for Y are computed only when the search needs to evaluate a structure with this family.3 Unfortunately, that also is typically quite expensive, as it requires a traversal over the entire training sequence. These two solutions are at the extreme ends of a spectrum. Friedman et al. 1999] present an intermediate solution which we also adopt. The search procedure works in stages. At the beginning of each stage the search procedure posts the statistics it will require for that stage. These are selected in an informed way, based on the current state of the search. The requested statistics are then computed in one batch, using a single inference phase for all of them at once. More specifically, the algorithm nds for each variable X 0 a set Pot0 of potential parents, based on the current network structure. At each stage, the search is restricted to consider only operations that involve adding edges Y ! X 0 for Y 2 Pot0 or removing current arcs. The number of ESS required for these operations is fairly small, and can be collected at once. The algorithm uses heuristics to focus the attention of the search procedure on\plausible" potential parents. The algorithm therefore requires relatively few statistics in each stage. After this restricted search is done, the process repeats, using the new network as a basis for nding new potential parents for each variable. This process is iterated until convergence of the scoring function.i i i i

5.2 Discovering Hidden VariablesAs we mentioned in the introduction, a fundamental problem when learning dynamic systems from real data is the discovery of hidden variables. In stock market data, for example, internet stocks are typically correlated. Unless we realize that this correlation is due to a hidden variable| public perception of the future growth of internet revenue| the model we learn is likely to be quite poor. The task of discovering new hidden variables is a noto2 In general, there are exponential number of statistics. When we restrict the indegree of each variable, the number is polynomial, but still unrealistically large. 3 Of course, we want to avoid unnecessary recomputations. The standard solution is to store a cache of computed statistics, and call the ESS computations on statistics that are not found in the cache.

riously di cult one. In temporal sequences, however, we have some cues that can indicate the presence of such variables. In particular, ignoring a hidden variable will often lead to a non-Markovian correlation, induced by the loss of information as the hidden variable is\forgotten" from step to step. Thus, we can search for non-Markovian correlations, and use them as an indication that the process needs additional\memory" about the past. More precisely, suppose that we discover that we can predict X (+1) using X ( ) and Y (?1) . Then we might cons

ider creating a new hidden variable H such that H is a parent of X 0 and Y is a parent of H 0 . Thus, we will have that Y (?1) in uences H ( ) via the Y ! H 0 edge, and that H ( ) in turn in uences X (+1) via the H ! X 0 edge. In other words, H behaves as the\memory" of Y with one step of lag. In general, we propose the following algorithm. We start by learning the edges among variables in k time slices (a k-TBN) for some xed time window k. When some of the variables are unobserved, we use structural EM and our approximation methods to estimate the ESS of the variables in these k consecutive time slices. We note that this process uses structural EM: the sufcient statistics are computed once, and then used for an extended search phase over structures. That, combined with our approach to computing ESS for variables that are far apart in the network, allows us to estimate the ESS for the k-TBN without ever doing inference on it. After we learn such a network, we eliminate the nonMarkovian arcs by creating new hidden variables that\remember" those variables that participate as parents in non-Markovian correlations. Any variable X in time slice t? d which directly in uences a variable Y at time t+ 1 requires d new hidden variables: at time t, the ith introduced variable X? has the same value that X had at time slice t? i. In order to represent this\memory" model exactly, the CPDs of these newly created variables would have to be deterministic. However, deterministic models do not easily accommodate EM-style adaptation. Furthermore, since we want to encourage the search to construct variables that remember global phenomena, we also add\persistence" arcs that allow the hidden variables to depend on longer term past. Therefore, we initialize the parameters for these variables to be noisy versions of the appropriate deterministic CPDs, and make the noise biased toward persistence with the previous time slice of the hidden variable. Having constructed a new 2TBN, we are now again in a position where we can run parametric EM to nd better parameters for the new hidden variables. Thent t t t t t t i

Dynamic Bayesian networks provide a compact and natural representation for complex dynamic systems. However, in many cases, there is no expert available from whom a model can be elicited. Learning provides an alternative approach for constructing models of

we repeat the process (structure learning, introduction of variables, etc.).

Table 2: Negative log-likelihood on test data for various algorithms and data sets (in bits=time slice).Gold standard Parametric EM 2 hid vars FHMM 4 hid vars 6 hid vars fully observable only iteration 0 SEM iteration 1 iteration 2 Bach Apnea Stock bat n/a n/a n/a 22.147 n/a n/a n/a 22.873 8.486 3.635 24.268 23.957 5.623| 24.302 23.562|| 23.213 23.773 4.538 1.892 20.834 22.693 4.503 1.704 20.759 22.418 4.513 1.713 20.819 22.434 4.537 1.710 20.710 22.388

6 Experimental ResultsWe tried our algorithm on four di erent domains: three involve real-world data, while the fourth is sampled from a given DBN. We originally ran our algorithm using standard structure search on the observable variables only, allowing for non-Markovian edges. We tested the accuracy on the test set fo

r the learned network, and then used it to introduce hidden variables, as described above. We then repeated this process, running inference to compute a certain predetermined set of ESS, and then using them for an extensive structure search. For structure search, we experimented with both BDe and BIC scores, and with both full and tree-structured CPDs. We report the results for the BDe score with trees; the results for the other cases are somewhat di erent, but exhibit the same general trends. As points of comparison, we also learned two other types of structure: a standard Markovian DBN over the observable variables (using a standard BN structure learning algorithm), and a Factorial HMM (FHMM) structure Ghahramani and Jordan 1996b] using parametric EM with the approximate message passing algorithm of Boyen and Koller 1998a]. An FHMM is best viewed as a DBN with some number` of hidden variables, each of which evolves independently of the others; each observable variable depends on all of the hidden variables within its time slice. FHMMs have been shown to be a good candidate for modeling several interacting processes evolving in parallel (e.g., multiple articulatory processes in speech). In our experiments, we tried FHMMs with two, four, and six binary hidden variables. (For data sets with only a small number of observables, we tried fewer hidden variables.) part of the 1991 Santa Fe competition for learning time series Weigend and Gershenfeld 1990]. It encodes the melodic line of 100 chorales attributed to J.S. Bach. The model has ve discrete attributes Key signature, Pitch, Duration, Fermata, and Time signature. The last three all represent temporal aspects of the piece. The training set consisted of 71 chorales, each of which is about 50 notes long, for a total training set size of 3212 transitions; the test set consisted of 29 chorales, chosen at random, for a total test set size of 1379. The rst column of Table 2 shows the results for this data set. We can see that all of the instances of the DBN learning algorithms performed signi cantly better than all instances of the FHMM algorithm. We

also see that the introduction of non-Markovian edges and hidden variables helps signi cantly with respect to standard structure search over the observables. Somewhat disappointingly, the introduction of hidden variables per se does not improve the log-likelihood. We attribute that to numerical over tting, as the network structure learned (shown in Figure 1) is actually quite natural. The algorithm detected the correlations between the three tempo attributes. The other two are decoupled from those. All variables have persistence arcs, except (very naturally) the Fermata variable, which represents a momentary event corresponding to the end of a segment. The algorithm introduced several hidden variables that capture the nonMarkovian nature of these variables. Most interestingly, the Duration variable has a time-slice that represents a short non-Markovian d

ependence (two time slices); on the other hand, the hidden variables introduced for the Time signature variable, which represents much longer-term aspects of the piece, correspond to longer-range dependencies.

Sleep apnea| This data set was also proposed as

Bach chorales| This data set was proposed as

part of the Santa Fe competition Weigend and Gershenfeld 1990]. It was obtained by monitoring 3 medical parameters on a patient su ering from sleep apnea. The data contains 34000 points, collected in a single run. This data set is non-stationary, due to the various sleep phases and the condition of the patient. Following the suggestion of Dagum and Galper 1993], each variable was discretized into seven buckets. We partitioned the sequence into four training subsequences, for a total of 19994 transitions, and one test sequence of size 13999. The behavior of the loglikelihoods is very similar to the previous data set. Again, the structure is very natural. There is a strong correlation between BloodOxygen and HeartRate and between HeartRate and ChestVolume, but not directly between BloodOxygen and ChestVolume, a very natural assumption. As above, the algorithm discovered interesting non-Markovian correlations; here, the temporal granularity of BloodOxygen is shorter-term than

Dynamic Bayesian networks provide a compact and natural representation for complex dynamic systems. However, in many cases, there is no expert available from whom a model can be elicited. Learning provides an alternative approach for constructing models of

slice tslice t+1evidence at t+1

slice t

slice t+1

slice t-1slice tslice t+1slice t-2slice t-1slice t

slice t+1ServicesInternet industryHardwarePC industryChipsJapanese companiesElectronicsAutomobileslice tslice t+1

Dynamic Bayesian networks provide a compact and natural representation for complex dynamic systems. However, in many cases, there is no expert available from whom a model can be elicited. Learning provides an alternative approach for constructing models of

learned after only a single iteration of SEM. We can see that the algorithm does discover a few interesting correlations, such as one between TurnSignal and XdotSens (sensed lateral movement).

7 Discussion and ConclusionsIn this paper, we combine two lines of works. The rst deals with search techniques for learning in the presence of hidden variables Friedman 1997; Friedman et al. 1998]. The second deals with fast approximate inference in complex networks Boyen and Koller 1998b; Boyen and Koller 1999]. While approximate DBN inference has been playing a major role in parametric learning Boyen and Koller 1998a; Ghahramani and Jordan 1996a], this is the rst paper to deal with the issues involved in applying it to structure search. In particular, we had to deal with the computation of a large number of di erent statistics and to introduce methods for discovering hidden variables. Although we based our solution on the Boyen-Koller approximation, many of these ideas can be applied to other approximate inference methods, including the variational methods of Ghahramani and Jordan 1996a]. Clearly, our work only scratches the surface of the problem of discovering hidden variables. While our algorithm discovers correlations that involve temporal interactions, it is less apt at detecting atemporal correlations as we saw in the stock market data. On the other extreme, our algorithm does not support the discovery of truly long-range dependencies and aggregate in uences from variables evolving at di erent speeds. Our current method for discovering hidden in uences requires that the time scale of the interaction matches the time scale of the model. If there is a hidden variable evolving much more slowly than the observables, then our algorithm would not nd it. This problem can be addressed by explicitly searching for violations of the Markov property at widely varying time granularities. Speci cally, applying our algorithm on a data sequence subsampled by a factor of k would exhibit interactions with a time constant of the order of k. We believe this issue to be of crucial importance when learning from real-world data. In real systems, observable variables are typically in uenced by hidden processes with widely di ering time scales, which furthermore are not always related to the sampling rate of the observations.

References

Boyen, X. and D. Koller (1998a). Approximate learning of dynamic models. In NIPS 11. Boyen, X. and D. Koller (1998b). Tractable inference for complex stochastic processes. In Proc. UAI. Boyen, X. and D. Koller (1999). Exploiting the architec-

ture of dynamic systems. In Proc. AAAI. Buntine, W. (1991). Theory re nement on Bayesian networks. In Proc. UAI, pp. 52{60. Chickering, D., D. Geiger, and D. Heckerman (1995). Learning Bayesian networks: search methods and experimental results. In Proc. AI& Stats, pp. 112{128. Dagum, P. and A. Galper (1993). Forecast

ing sleep apnea with dynamic network models. In Proc. UAI, pp. 64{71. Dempster, A., N. Laird, and D. Rubin (1977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B39, 1{38. Forbes, J., T. Huang, K. Kanazawa, and S. Russell (1995). The BATmobile: Towards a Bayesian automated taxi. In Proc. IJCAI. Friedman, N. (1997). Learning belief networks in the presence of missing values and hidden variables. In Proc. ICML. Friedman, N. (1998). The Bayesian structural EM algorithm. In Proc. UAI, pp. 129{138. Friedman, N., K. Murphy, and S. Russell (1998). Learning the structure of dynamic probabilistic networks. In Proc. UAI, pp. 139{147. Friedman, N., D. Peer, and I. Nachman (1999). Learning Bayesian network structure from massive datasets: The\sparse candidate" algorithm. In Proc. UAI. Ghahramani, Z. and M. Jordan (1996a). Factorial hidden Markov models. In Proc. NIPS. Ghahramani, Z. and M. Jordan (1996b). Factorial hidden Markov models. In Proc. NIPS. Heckerman, D. (1999). A tutorial on learning with Bayesian networks. In M. I. Jordan (Ed.), Learning in Graphical Models. MIT Press. Heckerman, D., D. Geiger, and D. M. Chickering (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 20, 197{243. Jensen, F., S. Lauritzen, and K. Olesen (1990). Bayesian updating in recursive graphical models by local computations. Computational Statistical Quarterly 4. Lauritzen, S. and D. Spiegelhalter (1988). Local computations with probabilities on graphical structures and their application to expert systems. J. Roy. Stat. Soc. B 50. Lauritzen, S. L. (1995). The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis 19, 191{201. Rabiner, L. and B. Juang (1986, January). An introduction to hidden Markov models. IEEE Acoustics, Speech& Signal Processing . Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, 461{464. Shenoy, P. P. and G. R. Shafer (1990). Axioms for probability and belief-function propagation. In Proc. UAI, pp. 169{198. Spirtes, H., C. Glymour, and R. Scheines (1993). Causation, Prediction, and Search. Springer-Verlag. Weigend, A. and N. Gershenfeld (1990). Time-series competition. In Proc. Nonlinear Modeling and Forecasting, Volume XII. Santa Fe institute: AddisonWesley.

本文来源:https://www.bwwdw.com/article/mba4.html

Top