On-Line Learning of Prototypes and Principal Components

更新时间：2023-07-25 17:51:01 阅读量：实用文档文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

We review our recent investigation of on--line unsupervised learning from high--dimensional structured data. First, on--line competitive learning is studied as a method for the identification of prototype vectors from overlapping clusters of examples. Spec

On{line Learning of Prototypes and Principal ComponentsInstitut fur Theoretische Physik Universitat Wurzburg Am Hubland D{97074 Wurzburg, Germany

M. Biehl, A. Freking, M. Holzer, G. Reents, and E. Schlosser

Ref: WUE-ITP-98-007

We review our recent investigation of on{line unsupervised learning from high{dimensional structured data. First, on{line competitive learning is studied as a method for the identi cation of prototype vectors from overlapping clusters of examples. Speci cally, we analyse the dynamics of the well{known winner{takes{all or K{means algorithm. As a second standard learning technique, the application of Sanger's rule for principal component analysis is investigated. In both scenarios the necessary process of student specialization may be delayed signi cantly due to underlying symmetries.

Abstract

1 IntroductionMethods from statistical physics have been applied to the theory of adaptive systems with great success in recent years. Perhaps the most prominent example is the analysis of feedforward neural networks which can learn from example data. The statistical mechanics approach allows to investigate typical properties of very large systems on average over the randomness contained in the data. It complements results from computational learning theory and other disciplines. Most of the investigations concern the supervised learning of a rule. For reviews of the eld see for instance (Watkin et al. 1993; Opper and Kinzel, 1996). A particularly successful line of research was initiated in (Kinzel and Rujan, 1990; Kinouchi and Caticha, 1992) and aims at the analysis of the physics of on{line learning schemes (Amari, 1967 and 1993; Hertz et al., 1991). On{ line learning is attractive from a practical point of view because it uses only the latest in the sequence of examples. Obviously storage needs and computational e ort are reduced in comparison with batch{ or o{line learning (Hertz et al., 1991). 1

Biehl et al.

On the other hand, the simplicity of on{line algorithms has allowed to study a variety of learning scenarios and network architectures, including the simple perceptron (e.g. Kinzel and Rujan, 1990; Kinouchi and Caticha, 1992) and multilayered networks with threshold units (e.g. Sompolinsky et al., 1995, Copelli and Caticha, 1995) or sigmoidal activation functions respectively (e.g. Saad and Solla, 1995; Biehl et al., 1996). Despite their simplicity, on{line algorithms compete well with costly o{line prescriptions (e.g. Kinouchi and Caticha, 1992; Opper, 1996; van den Broeck and Reimann, 1996; Kim and Sompolinsky, 1996; Copelli et al., 1997). Models of unsupervised learning have also been studied in the statistical mechanics framework, see e.g. (Biehl and Mietzner, 1994; Watkin and Nadal, 1994; Barkai and Sompolinsky, 1994; Lootens and van den Broeck, 1995). The investigation of on{line unsupervised learning schemes (Biehl, 1994; van den Broeck and Reimann, 1996) has provided new insights in this context as well

. In the following we review our recent investigation of on{line unsupervised learning from high{dimensional structured data (Biehl et al.,1997; Biehl and Schlosser, 1998). In the next section, on{line competitive learning is discussed as a method for the identi cation of prototype vectors from overlapping clusters of examples. Here, the focus will be on the well{known winner{takes{all or K{means algorithm (Duda and Hart, 1973; Hertz et al., 1991; Bishop, 1995). Section 3 revisits a second standard unsupervised learning technique: the identi cation of principal components by means of Sanger's rule (Sanger, 1989; Hertz et al., 1991; Bishop, 1995). In all these scenarios, the necessary process of student specialization can be delayed signi cantly. This e ect is due to underlying symmetries which result in quasi{stationary plateau con gurations in the learning dynamics. We conclude with a summary of the main results and a discussion of possible extensions in section 4.

2 Competitive Learning2.1 The ModelOne of the possible objectives of unsupervised learning is the identi cation o n of prototype vectors from a given set of data 2 IRN (= 1;:::; P ). The aim is to nd a faithful representation of this set by use of only a few typical o n< vectors J k 2 IRN (k= 1;:::; K< P ) which capture the relevant features of the data. This is closely related to (yet not identical with) clustering problems, where the goal is to group the vectors into several sets of similar inputs (Hertz et al., 1991; Bishop, 1995). Frequently, the identi cation of prototypes is guided by the Euclidean distances dk ( )= (? J k ): (2.1)2

On{line Learning of Prototypes and Principal Components

In particular, the family of so{called competitive learning algorithms updates the students J k according to a prescription of the generic form= J k?+ N? J k? pk (fdk g) (2.2) The change of the weight vectors is always along (? J k? ), i.e. the prototypes are moved in the direction of the presented example. The step size of this change is determined by the learning rate> 0 which is scaled with the dimension N of the data. The factors pk de ne the actual algorithm. Here, they are taken to be non{ negative functions of the set of distances dk= (? J k? ) and obey the normalization constraint K X pk (fdk g)= 1 (2.3)Jk1 1 1 1 2

which xes the total contribution of a single example to the learning process. Typically, a speci c example will a ect the prototypes which are closest in distance more e ciently than others. This competition of the students for updates is controlled by the assignment or labeling functions pk . Note that no normalization constraint is imposed on the student vectors. This is in contrast to models of directional clustering, where only characteristic directions are searched in input space (Biehl and Mietzner, 1994; Watkin and Nadal, 1994; Lootens and van den Broeck, 1995). In the following, the magnitude of the student vectors is J k= O(1),

whereas= O(N ) holds true for the example data. We restrict the analysis to the case of only two prototype vectors J and J in an environment which provides a sequence of independent data drawn from a stochastic source. We assume a bimodal input distribution of the speci c form 1 X P ( jm) where P ( jm)= 1 exp? 1 (? b B ): P( )= 2 m (2 )N= 2 m (2.4) which corresponds to a mixture of two overlapping Gaussians centered at b B m. We take the characteristic vectors B m to be orthogonal and normalized (B k B m= km). Hence, the quantity b> 0 speci es the o set of the respective cluster centers from the origin. The dummy variable m indicates from which of the clusters is drawn, both centers contribute with the same probability 1=2. The assumed data distribution is only weakly anisotropic: Given the cluster label m, the distance of the conditional mean b B m from the origin is b= O(1), p whereas the average length of input vectors is O( N ). Similar clustered input distributions have been studied in various models of unsupervised learning as well as in supervised scenarios, see e.g. (Biehl and Mietzner, 1994; Watkin2 2 1 2 2 2=1 2

k=1

Biehl et al.

and Nadal, 1994; Barkai and Sompolinsky, 1994; Meir, 1995; Marangi et al., 1995). Note that here, of course, the labels m= 1; 2 are not provided with the example data. We proceed by investigating the model in the thermodynamic limit N ! 1. The random quantities xm= J m and ym= B m are distributed according to a mixture of Gaussians as well. By means of the central limit theorem this holds true for more general P ( jm) in the limit N ! 1, provided the rst and second moments are the same as in (2.4). The joint density of the overlaps is uniquely determined by their conditional averages and covariances: hxj xk in? hxj in hxk in= Qjk= J j J k hxk ymin? hxk in hymin= Rkm= J k B m hyl ymin? hyl in hymin= lm= B l B m hxk in= b Rkn and hymin= b mn (2.5) Here and in the following h in denotes the average over the contribution P ( jn) to the mixture density (2.4). Averages over the full joint density of fx; x; y; y g are to be calculated as a sum of conditional means, one obtains for example 1 hx i= 2 (hx i+ hx i )= 1 b (R+ R ): 21 2 1 2 1 1 1 1 2 11 12

In the thermodynamic limit N ! 1 the set of self{averaging order parameters Qjk and Rkm is su cient for a macroscopical description of the system. Further, the dynamics of the learning process can be analysed exactly in terms of these quantities. To this end we derive from (2.2) recursion relations for the evolution of the Qjk and Rkm, which are then averaged over the latest random input vector. This is possible because the randomness of the data enters only through the projections fxk; yk g. Thus, all averages are over the four{dimensional Gaussian density which is speci ed by its moments (2.5). In terms of the continuous time==N the dynamics is now described by a set of coupled rst order di erential equations, see e.g. (Biehl

and Schwarze, 1995) and (Saad and Solla, 1995) for a more detailed description of the formalism in the context of supervised learning. A discussion of self{averaging in on{line learning can be found in (Reents and Urbanczik, 1998). Here, the obtained system of di erential equations reads dRkm= h(y? R ) p i m km k d (2.6) dQlm= h(x? Q ) p+ (x? Q ) p i+ h p p i l lm m m lm l l m d2

2.2 The Dynamics of Learning

On{line Learning of Prototypes and Principal Components

where the arguments of the assignment functions have been omitted for simplicity. For a discussion of the essential properties of our model we restrict the analysis to a partially symmetric subspace where

Q= Q= Q; Q= Q= C; R= R= R; R= R= S (2.7)11 22 12 21 11 22 12 21

In all cases investigated here, the dynamics (2.6) preserves this symmetry. We observe furthermore that, even if the initial conditions violate (2.7), the system approaches the restricted subspace (or an equivalent one with relabeled students) after a short transient. This is analogous to the ndings of e.g. (Saad and Solla 1995) and (Biehl et al., 1996) with respect to supervised learning schemes. In systems with two students only, a substantial simpli cation is achieved by introducing the following linear combinations of order parameters:

R= R S and Q= Q C:

(2.8)

In terms of these quantities and under the symmetry assumption (2.7) the equations of motion (2.6) are of the following form: dR= (b? R ) dQ= (2bR? 2Q+ ) (2.9) d 2 d 2 dR?= (h(y? y ) (p? p )i? hp? p i R )? d 2 (2.10) E D dQ?= (h(x? x ) (p? p )i?hp? p i Q )+ (p? p ):? d 2 Here we have used the properties+++++ 1 2 1 2 1 2 2 1 2 1 2 1 2 1 2 2

p+ p= 1 and hp; i= 1=21 2 12++

(2.11)

of the input distribution and assignment functions. Note that the dynamics of R and Q decouples from the remaining equations. Remarkably, Eqs. (2.9) are independent of the speci c learning algorithm provided it satis es the conditions (2.11). Hence, for two prototypes in the presence of the bimodal input distribution (2.4), the temporal evolution of R and Q is the same for all competitive learning schemes and can be studied beforehand. We obtain the analytic solution++

R ( )= b+ Ae?= (2.12) Q ( )= 2+ b+ 2bAe?=+ Be? where the constants A= R (0)?b and B=?=2+b+Q (0)?2bR (0) depend on the initial conditions. The asymptotic ( ! 1) con guration R= b and+ 2+ 2 2+ 2+++

6+ 2

Biehl et al.

Q= b+=2 represents the only xed point of the subsystem (2.9) and is approached exponentially fast with increasing from all initial settings and for all values of . The quantity R measures the overlaps of the prototype vectors with the sum (B+ B ). It therefore marks the unspecialized identi cation of the direction in which the center of mass of the input distribution is found. On the contrary, the overlap R?= J (B? B )= J (B? B ) quanti es the specialization of the students which is, in a way, the genuine aim of comp

etitive learning.+ 1 2 1 1 2 2 2 1

2.3 The Specialization ProcessWe will investigate the dynamics of the specialization process for a particularly simple and e cient choice for the labeling functions: ( K Y for 1 (2.13) pk= (dj? dk )= 0 if dk< dj else. all j 6= k j6 k=

The corresponding training scheme updates at each time step only the student with the minimal distance to the current input. All other K? 1 prototypes remain unchanged, hence the term winner{takes{all{algorithm has been coined for this prescription (Hertz et al., 1991). It is identical with an on{line realization of the well{known K{means{algorithm (Duda and Hart, 1973; Bishop, 1995). The resulting prescription can be interpreted as the stochastic gradient descent minimization of an instantaneous energy K 1 X d p? 1( )"( )= 2 (2.14) k k 2 k which measures the representation error of input vector in terms of Euclidean distances, see (Hertz et al., 1991; Biehl et al., 1997) for details. As the change of weights is always in the direction (? J? ), the updated prototype will be the winner for similar examples later in the sequence with even higher probability. Therefore, the strategy should yield specialized prototypes, each of which represents a region in input space where many examples have been observed in the course of learning. For two competing students Eq. (2.13) reduces to p= (d? d ) p= (d? d )= 1? p: (2.15) By use of the property pk pm= pk km of the Heaviside{function all averages in the Eqs. (2.10) can be performed analytically and one obtains""#!# bR?+ p R? exp? b R? dR?= 2?b? R?+ 2b p2Q d 2 Q? 4Q??2=1 1 1 2 1 2 1 2 1 2 2

12.5

1.5R0.80.6

0.4b

0.5

0050100150200S0.20-0.2246810αη

Biehl et al.

In order to gain a theoretical understanding of the observed behavior we study the xed point structure of (2.16). For all xed points, the values of R and Q are given by the asymptotic form ( ! 1) form of Eqs. (2.12). We observe that con gurations with++

2 4+2 R?= 0; (2.17) 2 are stationary under the dynamics (2.16). A linearization of the system shows? that the xed point (0; Q? ) is always repulsive, whereas (0; Q? ) becomes attractive for (2.18)>= 2 b+ 2b

Q?= Q?= 4+( ) )

(

(+)

which de nes a critical learning rate of the process. In Fig. 1 (b) it is shown as a line in the (; b){plane which separates the region in which no specialization occurs from the one with a non{zero asymptotic value of R?. For small enough learning rates c all xed points with R?= 0 are repulsive and the students will eventually specialize upon presentation of an increasing number of examples. However, generic initial conditions with R S 0 will cause the system to approach a state in the vicinity of (2.17). A linearization around the xed point (0; Q? ) shows that the specialization increases exponentially with: R?/ R?(0) e where is the relevant eigenvalue of the linearized system. The characteristic time needed to achieve signi cant specialization R?= O(1) is therefore proportional to? ln R? (0)]=: This behavior is strongly reminiscent of the plateau states which have been found to delay supervised learning by gradient descent in multilayered networks (Biehl and Schwarze, 1995; Saad and Solla, 1995; Biehl et al., 1996), see other contributions to this volume. Note, however, that the dominant plateau in the winner{takes{all scenario is not characterized by almost identical student vectors. An e ective mutual repulsion is imposed on the prototype vectors due to the pronounced competitive nature of the learning algorithm. At each time step only one of the students can be updated even if J J . The prototypes separate very fast, however their projections in the (B; B ){plane are almost equal in the plateau state. Eventually the system approaches a stable con guration where R? and Q? satisfy the conditions(+) 1 2 1 2

Q?= 8 e?r?== p2 2

r?]?2??

r?2 1 2 2

(b? 2r?)2 2 1 2

r?]? 2 r?

(2.19)

Q0.8

R0.6

0.4

Q0.2

J1S0

-0.2

00.511.52

ηJ2B2B1

Biehl et al.

3 Principal Component Analysis3.1 The ModelAnother important problem in data analysis is the faithful representation of high-dimensional data by low{dimensional feature vectors which contain as much information about the original inputs as possible. One standard method for this task is principal component analysis (PCA). It determines, for a given set of observed data, the eigenvectors of the empirical covariance matrix which correspond to its largest eigenvalues. Projections on these characteristic vectors serve as a useful linear representation of the data, see e.g. (Hertz et al., 1991; Bishop, 1995; Deco and Obradovic, 1996) for the theoretical background. The purpose of competitive learning is to provide (few) typical prototype vectors of the same dimensionality as the (many) original input vectors. On the contrary, PCA aims at the low{dimensional representation of each of the examples by detecting the most relevant features of the data. Principal component analysis takes into account only rst and second moments of the observed data (i.e. their covariance matrix). Hence, we can consider a particularly simple model distribution in the following. Input vectors are taken to consist of random components independently drawn from zero mean Gaussian distributions. We assume that M relevant directions fB igi;:::;M Din IREN (with M N ) exist which determine the correlation matrix C=>:=1

C= IN+

with the N{dimensio

nal identity matrix IN . The vectors fB ig are taken to be orthogonal and normalized: B i B j= ij . This weakly anisotropic distribution can be interpreted as the result of deforming a single, isotropic Gaussian cluster with data points~ 2 IRN:=~+X

i=1

(bi+ 2bi )B iB> i2

(3.1)

i=1

bi (B i~)B i:

(3.2)

Distributions of this type have been previously considered in e.g. (Lootens and van den Broeck, 1995). The directions Bi are, by construction, the eigenvectors of C corresponding to eigenvalues (1+ bi) . Assuming b b::: bM> 0 without loss of generality, the set of vectors B i coincides therefore with the ordered principal components of the data distribution. As before, we assume that the environment generates a sequence of independent example vectors according to the above input distribution. A2 1 2

On{line Learning of Prototypes and Principal Components

matching number of student vectors J l 2 IRN (l= 1; 2;:::; M ) is updated following Sanger's rule (Sanger, 1989; Hertz et al., 1991) when a new example is presented: ! l X?+ lx?; Jl= Jl? xk J k (3.3) N l k with the student projections xl= J l? . The learning rates l control the step size of the updates for the students. In contrast to the previous section we include the possibility of using di erent rates for di erent J k . For small learning rates l ! 0 one can show that Sanger's rule yields normalized vectors (Hertz et al., 1991). Throughout this paper, however, we assume that an explicit normalization at each time step guarantees J l= 1 for all . This introduces additional terms of order l=N in the full form of the algorithm, see (Biehl, 1994) for the case M= 1. The prescription (3.3) leads to an ordering of the student vectors which (in general) corresponds to the identi cation of one B i by each student when a large number of examples is presented (Hertz et al., 1991). An alternative scheme was suggested by Oja (Oja, 1982, Oja and Karhunen, 1985, Hertz et al., 1991, Oja, 1996) and has been proven to provide an arbitrary basis of the subspace spanned by the fB ig, the actual result depends on the initial choice of the J l (0). Again, the analysis is based on the fact that the quantities xk and yj= B j (indices omitted) are Gaussian variables. In the thermodynamic limit this holds true for more general input vectors consisting of non{Gaussian random components i with the same second order statistics. Here, the relevant moments are1 1=1 1 2 2

hxk yl i= (1+ bl ) Rkl; hyk yl i= kl (1+ bk ); M hxk xl i= Qkl+ (bi+ 2bi )RliRki;2 2

(3.4) (3.5)

and hxk i= hyk i= 0 for all k. The order parameters Rkl= J k B l and Qkl= J k J l are de ned as in the previous section but with all Qkk= 1 due to the above mentioned normalization.

i=1

3.2 The Dynamics of LearningThe analysis of the temporal evolution of order parameters proceeds by deriving a system of (3M? M )=2 coupled rst order di erential equations which reads l? dRlj= hx y i? (

+=2) Dx E R? X hx x i (R? Q R ) l l j l lj l l k kj lk lj l l d k2 2 2 1=1

12D E D

Biehl et al.

dQlj= (+ ) hx x i? ((+=2) x+ (+=2) x )Q l j l j l j lj l l j j d j? l?? l hxl xk i (Qkj? Qkl Qlj )? j hxj xk i (Qlk? Qkj Qlj )2 2 2 2

(3.6)

1 X

k=1

where l; j= 1; 2;:::; M (l 6= j in the second equation). Note that all averages are given in (3.4). Therefore, a closed set of di erential equations is obtained for arbitrary M (< N ): In addition to the numerical integration of (3.6), an< analytic treatment of its xed point properties allows for an investigation of the system in the limit ! 1. Further, plateau{like states in the transient dynamics can be studied. As a measure of success we consider the deviation of the linear reconstruction M X= xi J i (3.7) est from the original data . The expectation value of the associated quadratic error is minimized for fJ i= B ig (i= 1;:::; M ) or whenever the two sets of vectors span the same subspace. The simple linear combination of vectors J k (3.7) coincides with the optimal (with respect to the quadratic error) linear reconstruction only asymptotically, i.e. for fJ k ! B k g (Bishop, 1995; Deco and Obradovic, 1996). Nevertheless, (3.7) can serve as a rough measure of the achieved quality of the representation. The estimation error on average over the assumed input distribution is given by 1D"est= 2 ( 1D est? )? 22

i=1

M M X 1 X Dx E+ X k? hx x i Q=?2 k l kl k k k l2 1=1=1=1

(3.8)

and can be expressed in terms ofEthe order parameters by means of (3.4). Note D=2 has been subtracted in the de nition of that an irrelevant constant"est. In Figure 3 (a) a generic example of the temporal evolvement of diagonal overlaps Rll and of the cost function is shown. For small enough learning rates l the following attractive xed point con guration is approached asymptotically (with ! 1):2

Rll=

v u u t

bl+ 2bl? l=2; (bl+ 2bl )(1+ l=2)2 2

Rlj= Qlj= 0 for l 6= j:

(3.9)

This con guration re ects the identi cation of one speci c principal component by each student. However, the achievable absolute values jRll j remain smaller than 1 for non{zero learning rates ( all l= 0:1 in Fig. 3 (a)). Very

On{line Learning of Prototypes and Principal Components 1.511 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 800.0 11 00 11 00 11 00 11 00 11 00

1.0 0.8R11

ε est

1.0

R kk0.5

2.0

εest

ε est

0.6 0.4 0.2 0.0 0.0R21 Q 12 R12R 22

2.5

0.0 0.0

400.0α

3.0 0.0

200.0

(a) (b) Figure 3: (Sanger's rule) (a) A typical learning curve for M= 3: The representation error"est decreases in a cascade{like manner, the corresponding evolution of the diagonal order parameters Rkk is displayed in the inset. Here, l= 0:1 and all Rjk (0)= 0:06+ random deviations of the order O (10? ). (b) The

evolution of all overlaps in the case M= 2 with initial conditions Rlk (0)= O(10? ), Q (0)= O(10? ) and X (0)= O(10? ) (cf. Eq. (3.11)), learning rates l= 0:1 for both students.10 2 10 12 4

400.0α

600.0

800.0

100.0α

200.0

300.0

small values of l yield good learning success, but many examples are needed. With larger l learning is fast, but the asymptotic error remains large. Better results could be achieved by applying an appropriate annealing schedule for{dependent learning rates l ( ). For step sizes c (3.10) l> l= 2bl (bl+ 2) con gurations with the corresponding overlaps Rll= 0 become stable which is analogous to the result of (Biehl, 1994) for M= 1. Throughout the following we will assume that all learning rates are smaller than their respective critical value. Hence, (3.9) is the only attractive con guration of the system.

3.3 Symmetries and SpecializationAdditional repulsive xed points of the system exist due to underlying symmetries of the learning problem. After relabeling the students all these repulsive states are characterized by (3.9) with some or all of the diagonal Rll= 0: The in uence of these repulsive states on the learning dynamics is exemplied in Fig. 3 (a): Before approaching its asymptotic con guration (3.9), the system is trapped close to a number of such xed points. The intermediate con gurations are almost stationary and resemble the plateau states observed

Biehl et al.

in supervised learning and in the context of the competitive algorithm (see previous section). To begin with, we discuss the relevance and structure of the plateau{like con gurations in terms of the model with two student vectors (M= 2) and equal learning rates== . First we note that for any initial con guration with R (0)= 0, this overlap will remain zero in the course of learning. This property of Sanger's rule is already apparent in a system with only one student (Biehl, 1994) since the update of J is independent of all other vectors J k (k> 1). A non{zero R (0) will enable the system to achieve the asymptotic value given in (3.9). p In realistic situations one would expect R= O(1= N ) indicating that a characteristic time of the order ln N is needed to produce a signi cant overlap R . Here we focus on the dynamics of the subsequent students and assume a fairly large R (0) throughout the following. Because of the hierarchical structure of the algorithm (3.3) the above property does not simply carry over to other overlaps. For instance, dR=d 6= 0 holds true even with R= 0, in general. Instead, we can show that the speci c symmetry measure X= (R R? R R ) satis es dX= 0 for X= 0: (3.11) d A value of X= 0 corresponds to unspecialized vectors J i with linear dependent projections in the space spanned by B and B . For a set of students with X (0)= 0 it is impossible to identify both principal components because the conservation of the symmetry (3.11), together with the e ective orthogonalization for ! 1, enforces

R ! 0 and R ! 0 when R increases in the course of learning. This is possible even with R (0); R (0)> 0. Such a case is illustrated in Fig. 3 (b) which demonstrates the e ective loss of initial overlaps. A linearization of the equations of motion around the speci c con guration Rjk= 0 for j; k= 1; 2 and Q= 0 shows that a small, non{zero jX j increases exponentially with . Eqs. (3.6) decouple in the vicinity of the con guration and one obtains X ( )= X (0) e with= (b+ 2b+ b+ 2b )?: (3.12) It is interesting to note that the learning rate at which becomes zero is given by ( c+ c)=2 with lc from Eq. (3.10). Thus, the stability of X (0)= 0 is directly linked to the critical rates associated with the diagonal overlaps themselves. The characteristic time which is needed to achieve a nonzero X= O(1), i.e. to leave the repulsive xed point will be proportional to? ln jX (0)j= .1 2 11 1 11 11 11 11 22 22 11 22 12 21 1 2 22 21 11 22 21 12 2 1 1 2 2 2 2 1 2

On{line Learning of Prototypes and Principal Componentsε estε est

1.5X=10 10

2.0

εest

2.0X=10 3

εest 2.5X=10 7

b c

2.5 0.0

(a) (b) Figure 4: (Sanger's algorithm) (a) The plateau length of the learning process (M= 2) depends logarithmically on X (from left to right:? log X= 3; 4; 5;::: 10). Apart from the deviations X, initial conditions are the same as in Fig. 3 (b) (b) Average estimation error (M= 3) for equal learning rates as in Fig. 3 (a) (line a), learning rates= 1;= 1:009 and= 0:99 (line b) and= 1;= 1:09 and= 0:9 (line c).10 1 2 1 3 1 1 2 1 3 1

100.0

200.0

300.0

3.0 0.0

200.0

400.0

600.0

800.0

This logarithmic dependence of the plateau length is displayed in Fig. 4 (a) for a number of initial values X (0). We would like to point out that the symmetry X= 0 (eq. (3.11)) is preserved in the course of learning only if the learning rates are identical (= ). The relation dX= aX+ b(? ) (3.13) d indicates that jX j will increase with even for X (0)= 0 as soon as 6= . The coe cients a and b are in general non-zero and depend on the Rkj and the Qkl . Small di erences j? j enable the system to escape from the symmetric con guration after a time of order? ln j? j. Note that the e ect is not related to the natural ordering of the principal components. The improvement in comparison with= does not depend critically on which of the learning rates is taken to be the larger one. In a system of M> 2 students and l= (l= 1; 2;:::; M ) statements analogous to the Eqs. (3.11,3.12) can be made with respect to R R::: R M (3.14) X M= R:::::::::::::: R:M:::: RM:::::: RMM1 2 1 2 1 2 1 2 1 2 1 2 11 12 1 ( ) 21 2 1

16( )

Biehl et al.

The full determinant of overlaps Rlm as well as the determinants of all upper{ left submatrices X k (related to the rst k students and eigenvectors) are found to satisfy dX k= 0 if X k= 0 (k= 1; 2;:::; M ): (3.15) d This re ects again the hierarchical

structure of Sanger's algorithm. Note that for k= 1; 2 the above discussed properties of the overlap R and the symmetry X= X (3.11) is included here. In Figure 3 (a) and 4 (b) (line a) it is shown how in a system of M= 3 students with== two subsequent plateaus are visited which are characterized by X= X 0 and X 0 respectively. The use of only slightly di erent learning rates already breaks these symmetries very e ciently as can be seen in Fig. 4 (b) (lines b and c).( ) ( ) (2) 11 1 2 (2) 3 (3)

4 Summary and OutlookIn summary we have discussed two solvable models of unsupervised learning from high{dimensional data. In the thermodynamic limit, the dynamics of the considered on{line learning processes is described in terms of di erential equations for a small number of order parameters. Here we have focused on the process of student specialization in the course of learning. In both scenarios, plateau{like intermediate states of the system can dominate the time needed for successful training. These plateaus are due to the existence of weakly repulsive xed points of the dynamics and re ect characteristic underlying symmetries. In particular, we have studied the determination of prototype vectors from clustered example data by means of competitive learning. The investigated winner{takes{all procedure is identical with the on{line realization of the prominent K{means algorithm. It assigns each input deterministically to the prototype which is closest in distance. Obviously it is irrelevant precisely which of the prototypes represents which data cluster. This is similar to the permutation symmetry obeyed by the hidden nodes in a fully connected multilayered neural network and has analogous consequences. The investigation of Sanger's rule for principal component analysis shows that repulsive xed points and plateaus exist even though the algorithm imposes, by construction, a natural ordering on the student vectors. We have identi ed the relevant underlying symmetries and studied their e ect on the learning dynamics. Here, the e ect of choosing di erent step sizes for di erent students has been demonstrated. Preliminary results show a similar, drastic improvement for the supervised training of over{sophisticated students, i.e. multilayer nets with an inappropriately large number of hidden units (Schwarze, 1998).

On{line Learning of Prototypes and Principal Components

Further investigations shall address more complex model situations, for instance competitive learning with more than two clusters and prototypes. In particular, situations with a number of students which does not match the structure of the example data should be interesting. Possible modi cations of the competitive learning algorithm replace the hard step functions in Eq. (2.13) by a soft minimum type of assignment. As an example one could consider a stochastic gradient ascent maximization of the likelihood associated with a model distribution of the form (2.4), see e.g. (Bisho

p, 1995). By introducing a topology in the space of prototypes it should be possible to extend the analysis to the dynamics of self{organized feature maps, see e.g. (Hertz et al., 1991) for an introduction and related references. Non{linear extensions of principal component analysis (Oja et al., 1996) could be studied, which take into account higher moments of the presented data. In a sense such algorithms ll in the gap between the methods of prototype identi cation and low{dimensional representation. Finally, for all these learning algorithms the use of a proper annealing schedule for the learning rates seems promising.

ReferencesAmari, S., 1967, IEEE Trans. Elect. Comput. EC{16, 299 Amari, S., 1993, Neurocomputing 5, 185 Barkai, N. and H. Sompolinksy, 1994, Phys. Rev. E 50, 1766 Biehl, M. and A. Mietzner, 1994, J. Phys. A 27, 1885 Biehl, M., 1994, Europhys. Lett. 30, 391 Biehl, M. and H. Schwarze, 1995, J. Phys. A 28, 643 Biehl, M., P. Riegler, and C. Wohler, 1996, J. Phys. A 29, 4769 Biehl, M., A. Freking, and G. Reents, 1997, Europhys. Lett. 38, 1 Biehl, M. and E. Schlosser, 1998, J. Phys. A 31, L97 Bishop C. M., 1995,`Neural Networks for Pattern Recognition' (Clarendon Press, Oxford) Copelli, M. and N. Caticha, 1995, J. Phys. A 28, 1615 Copelli, M., R. Eichhorn, O. Kinouchi, M. Biehl, R. Simonetti, P. Riegler, and N. Caticha, 1997, Europhys. Lett. 37, 432 Deco, G. and D. Obradovic, 1996,`An Information-Theoretic Approach to Neural Computing' (Springer, Berlin)

Biehl et al.

Duda, R.O. and P.E. Hart, 1973,`Pattern Classi cation and Scene Analysis' (Wiley, New York) Hertz, J.A., A. Krogh, and R.G. Palmer, 1991,`Introduction to the Theory of Neural Computation' (Addison Wesley, Redwood{City, CA) Kim, J. and H. Sompolinsky, 1996, Phys. Rev. Lett. 76, 3021 Kinouchi, O. and N. Caticha, 1992, Phys. Rev. E 26, 6243 Kinzel, W. and P. Rujan, 1990, Europhys. Lett. 13, 473 Lootens, E. and C. van den Broeck, 1995, Europhys. Lett. 30, 381 Marangi, C., M. Biehl, S.A. Solla, 1995, Europhys. Lett. 30, 117 Meir, R., 1995, Neural Comp. 7, 144 Oja, E., 1982, J. Math. Biol. 15, 267 Oja, E. and J. Karhunen, 1985, J. Math. An. Appl. 106, 69 Oja, E., J. Karhunen, L. Wang, and R. Vigario, 1996, in`Neural Nets, WIRN Vietri{95', eds. M. Marinaro and R. Tagliaferri (World Scienti c, Singapore) Opper, M., 1996, Phys. Rev. Lett. 77, 4671 Opper, M. and W. Kinzel, 1996, in:`Models of Neural Networks', Vol. III, eds. E. Domany, J.L. van Hemmen, and K. Schulten (Springer, Berlin) Reents, G. and R. Urbanczik, 1998,`Self{averaging and On{line learning', preprint Universitat Wurzburg Saad, D. and S.A. Solla, 1995, Phys. Rev. Lett. 74, 4337 and Phys. Rev. E 52, 4225 Sanger, T.D., 1989, Neural Networks 2, 549 Schwarze, S., 1998, diploma thesis Universitat Wurzburg Sompolinsky, H., N. Barkai, and H.S. Seung, 1995, in:`Neural Networks: The Statistical Mechanics Perspective', eds. J.H. Oh, C. Kwon, and S. Cho (World Scienti c, Singapore), van den Broeck, C. and P. Re

imann, 1996, Phys. Rev. Lett. 76, 2188 Watkin, T.L.H. and J.{P. Nadal, 1994, J. Phys. A 27, 1889 Watkin, T.L.H., A. Rau, and M. Biehl, 1993, Rev. Mod. Phys. 65, 499

本文来源：https://www.bwwdw.com/article/0yem.html

相关文章：

《在小学数学中实施“试学、展学、研学”学习方法研究》CN或省级03-08

2016年家电行业分析报告12-03

翻译教程01-29

上海合作组织成员国政府首脑（总理）关于进一步开展交通领域合作11-29

中央民族大学2007-2008学年06-20

上一篇：初中数学教师第一学期教学工作目标总结下一篇：《大卫科波菲尔》女性形象浅析