a gaussian process guide particle filter for tracking 3D human pose in video

更新时间:2023-05-03 09:55:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

4286IEEE TRANSACTIONS ON IMAGE PROCESSING,VOL.22,NO.11,NOVEMBER2013

A Gaussian Process Guided Particle Filter for

Tracking3D Human Pose in Video Suman Sedai,Mohammed Bennamoun,Member,IEEE,and Du Q.Huynh,Member,IEEE

Abstract—In this paper,we propose a hybrid method that com-bines Gaussian process learning,a particle?lter,and annealing to track the3D pose of a human subject in video sequences. Our approach,which we refer to as annealed Gaussian process guided particle?lter,comprises two steps.In the training step, we use a supervised learning method to train a Gaussian process regressor that takes the silhouette descriptor as an input and produces multiple output poses modeled by a mixture of Gaussian distributions.In the tracking step,the output pose distributions from the Gaussian process regression are combined with the annealed particle?lter to track the3D pose in each frame of the video sequence.Our experiments show that the proposed method does not require initialization and does not lose tracking of the pose.We compare our approach with a standard annealed particle?lter using the HumanEva-I dataset and with other state of the art approaches using the HumanEva-II dataset. The evaluation results show that our approach can successfully track the3D human pose over long video sequences and give more accurate pose tracking results than the annealed particle ?lter.

Index Terms—3D human pose tracking,Gaussian process regression,particle?lter,hybrid method.

I.I NTRODUCTION AND M OTIVATION

I MAGE and video-based human pose estimation and track-

ing is a popular research area due to its large number of applications including surveillance,security and human computer interaction.For example,in video-based smart sur-veillance systems,3D poses can be used to infer the action of a subject in a scene and detect abnormal behaviors.It can provide an advanced human computer interface for gaming and virtual reality applications.Also,3D poses of a person computed over a number of video frames can be useful for biometric applications to recognize a person.These applica-tions require simple video cameras or still images as input and,as a result,provide a low cost solution in contrast to marker-based systems.

Manuscript received January5,2012;revised September15,2012and March13,2013;accepted June12,2013.Date of publication July4,2013; date of current version September12,2013.This work was supported in part by the ARC Discovery Project under Grant DP0771294.The associate editor coordinating the review of this manuscript and approving it for publication was Prof.Nikolaos V.Boulgouris.

S.Sedai was with the University of Western Australia,Crawley,WA6009, Australia.He is now with IBM Research Australia,Melbourne3000,Australia (e-mail:ssedai@0ca8f1da71fe910ef02df81c).

M.Bennamoun and D.Q.Huynh are with the University of Western Australia,Crawley,WA6009,Australia(e-mail:mohammed.bennamoun@ 0ca8f1da71fe910ef02df81c.au;du.huynh@0ca8f1da71fe910ef02df81c.au).

Color versions of one or more of the?gures in this paper are available online at 0ca8f1da71fe910ef02df81c.

Digital Object Identi?er10.1109/TIP.2013.2271850

Human pose estimation and tracking systems can be broadly classi?ed into three different approaches:discriminative, generative and hybrid methods.

In generative methods,the output pose is estimated by searching the solution space for a pose that best explains the observed image features[1],[2].In this approach,a generative model is constructed which measures how close the hypothesized pose is to the observed image features.A hypothesized pose that is most consistent with the observed image features is chosen as the output pose.Particle?lter[2], [3]is a generative tracking method to estimate the pose in each frame using an estimate of the previous frame and the motion information.Given an image of a human subject,there are multiple3D human poses associated with the image.Such pose ambiguities can be resolved using images from multiple camera views[4].

In discriminative methods,a regression function that maps image features to the pose parameters is obtained using supervised learning[5],[6].Although discriminative methods can quickly estimate the3D pose,they can produce incorrect predictions for new inputs if the system is trained using small datasets.Moreover,the relationship between image features and the human pose is often multimodal.For example,when the human silhouette is used as an image feature,one sil-houette can be associated with more than one pose,resulting in ambiguities.In such cases,multiple discriminative models are needed to build one-to-many relationships between image features and poses[5],[7].

Discriminative methods are powerful for speci?c tasks such as human pose estimation as they are only based on the mapping from the input to the desired output.The generative methods,on the other hand,are?exible because they provide room for using partial knowledge of the solution space and exploit the human body model to explore the solution space. Due to their different ways of predicting the?nal output,the two methods are considered to complement each other.Thus hybrid generative and discriminative methods have shown to have the potential to improve pose estimation performance.As a result,they have gained more attention recently.

In hybrid methods,such as the ones presented in Refs.[8], [9],a discriminative mapping function is used to generate a pose hypothesis space.The pose is then estimated by searching the hypothesis space using a generative method.The method of Ref.[8]is only based on pose estimation from a single image and is therefore unable to track the pose from a video sequence.Moreover,these methods assume that the pose hypothesis space generated from the discriminative model is

1057-7149?2013IEEE

SEDAI et al.:GAUSSIAN PROCESS GUIDED PARTICLE FILTER4287

always correct and hence fail to handle the case when the discriminative model predicts incorrect poses.

In this paper,we propose a hybrid discriminative and generative method to track the3D human pose from both single and multiple cameras.For the discriminative model,we use a mixture of Gaussian Process(GP)regression models, which are obtained by training GP models in different regions of the pose space.The GP regression has the advantage of being able to give a probabilistic estimate of the3D human pose.It provides an effective way of incorporating more con?dent discriminative predictions into the tracking process while discarding the uncertain ones.To the best of our knowledge,this is the?rst hybrid method that takes into account the predictive uncertainty of the discriminative model and combines it with a generative model to improve pose estimation.We treat the probabilistic output from the GP regression as one component of the hypothesis space.In the tracking step,we combine this hypothesis space with the hypothesis space obtained from the motion model and the search for the optimal pose is performed using an annealed particle?lter.

A major contribution of this paper is the introduction of a novel method to combine Gaussian Process regression and annealed particle?lter for3D human pose tracking. Our method can probabilistically discard uncertain predictions from the regression model and hence only use predictions that are likely to be correct to track the3D pose.Moreover,our method can resolve ambiguities when motion information is used along with the image cues from multiple views during pose tracking.

The organization of the paper is as follows.Section II presents the related work.Section III presents the details of the training of the discriminative model.Section IV presents the proposed AGP-PF method for3D human pose tracking. Section V provides the experimental results and Section VI concludes the paper.

II.B ACKGROUND AND R ELATED W ORK

In discriminative estimation,a direct mapping between image features and the pose parameters is learned using a training dataset.Examples of discriminative learning includes nearest neighbor based(example based)regression[6]and sparse regression[10],[11].Often the relationship between the image features and the pose space is multimodal.The multimodal relationship is established by learning the one-to-many mapping functions that are de?ned in terms of a mixture of sparse regressors[5],[7]or Gaussian Process regression[12],[13].This setting produces a multiplicity of the pose solutions that can either be ranked using a gating function[14]or be veri?ed using an observation likelihood [8]or be disambiguated using temporal constancy[11].For effective pose estimation using a discriminative approach,it is important that image features are compact and distinctive. Approaches such as[15],[16]use a metric learning method to suppress the feature components that are irrelevant to pose estimation.Recent research shows that feature selection and pose estimation can be carried out using regression trees[17]and random forest[18].Other methods such as[19],[20] use dimensionality reduction to make the feature vector more distinctive for pose estimation.The fusion of multiple image cues based on regression has also been used to improve the pose estimation performance[21],[22].In another approach [23],2D trajectories of the human limbs in a video are used as features and the mapping between the trajectory features and the3D poses is modelled using Gaussian Process regression. It has also been shown that pose estimation performance can be improved by taking into account the dependencies between the output dimensions,e.g.,using structural SVM[24]and Gaussian Process regression[25].Dimensionality reduction can also be used to address the correlation between output dimensions,e.g.,the training of mapping function is performed in a lower dimensional subspace of features and pose so as to make use of unlabelled data[26].

In generative inference,the pose that best explains the observed image features is determined from a population of hypothesized poses.A search algorithm is used to search the prior pose space for the pose that exhibits the maximum likelihood value for the observed image feature[1],[27]. Generative methods thus consist of three basic components:a state prior(pose hypotheses),an observation likelihood model (matching function),and a search algorithm.

The search algorithm can be local or global in nature.Most of the local search algorithms employ Newton’s optimization method[4],[28].Global search methods are based on stochas-tic sampling techniques where the solution space is represented by a set of points sampled according to some hypotheses. Successful global search algorithms include annealing[27], Markov Chain Monte-Carlo[29],covariance scaled sampling [1],and Dynamic Programming[30].In order to re?ne the pose tracking,the sampling based global search techniques have been combined with some local search techniques[31], [32].In order to reduce the search space,the pose prior space based on a prede?ned set of activities,such as walking and jogging,has been used[33].Pose priors based on models that constrain the human body pose to physically plausible con?g-urations have also been used to reduce the search space[34]. In order to compute the likelihood of a pose hypothesis, image features,such as silhouettes,edges[1],[27],color [29],[30]and learned features[35],have been used.First, the human body model corresponding to the hypothesized pose is projected onto the image feature space to obtain a model feature(a.k.a template feature).The likelihood value is computed as a similarity measure between the observed image feature and the template feature.For3D pose esti-mation,human body models based on cylinders[33],[36], superquadrics[37]and3D surfaces[4],[9]are commonly used.For2D pose estimation,a simple rectangular cardboard model has been used[38],[39].

To reduce the complexity of tracking in higher dimensions, prior models based on a lower-dimensional representation of the pose have been used.The commonly used models include linear models such as the Principal Component Analysis [33]and non-linear approximation such as the Mixture of Factor Analysers[40].In another approach[41],the Restricted Boltzmann Machine is used to model human motion in a

4288IEEE TRANSACTIONS ON IMAGE PROCESSING,VOL.22,NO.11,NOVEMBER2013

discrete latent space.A more commonly used non-linear dimensionality reduction technique for pose tracking is the Gaussian Process Latent Variable Model(GPLVM)[42],[43], which does not always provide a smooth latent space as required for tracking.To overcome this problem,a Gaussian Process Dynamical Model(GPDM)[44]which uses non-linear regression to model the motion dynamics in a latent space in a GPLVM framework can be employed.Tracking in the latent space often has a lower computational complexity;however, it has also been argued that models based on dimensionality reduction have limited capacity[45].

In the hybrid methods,discriminative and generative approaches are combined together to exploit their complemen-tary power to predict the pose more accurately.To combine these two methods,the observation likelihood obtained from a generative model is used to verify the pose hypotheses obtained from the discriminative mapping functions for pose estimation[8],[9].In other work,e.g.,[45],generative and discriminative models are iteratively trained using Expectation Maximization(EM).In each step of the EM,the predictions made by one model are used to train the other model. Recently,the work of Ref.[46]combines the discriminative and generative models by applying distance constraints to the predictions made by the discriminative model and verifying them using image likelihoods.None of these approaches utilizes the prediction uncertainty of the discriminative model to track the3D human pose.In this paper,we propose a hybrid method that takes into account the prediction uncertainty of the discriminative model and effectively combines it with the generative model.

We use Gaussian Process regression as the discriminative model.The probabilistic output poses from the discriminative model are integrated with a particle?lter and a subsequent annealing step for3D human pose tracking.Consequently,the pose tracking performance is improved,since the search space is reduced to only include the correct hypotheses generated from the discriminative model combined with the hypotheses of the motion model.This is an extension of our previous work[47],where we used a Relevance Vector Machine as a discriminative model combined with a particle?lter for 2D pose tracking.GP regression has been used in Ref.[48] to learn the dynamical model and the observation likelihood model for object tracking.However,their method does not involve a mapping from the feature space to the output space. Our method,on the other hand,uses GP regression to learn a discriminative model which gives a conditional distribution of the3D human pose.In this paper,we use a generic form of the motion model and observation likelihood model for tracking.In another purely discriminative?ltering-based approach[49],the unreliability of the observation is modelled using a probabilistic classi?er to regulate the predictions from multiple regressors.Our method,on the other hand,models the prediction uncertainty by the variance of the prediction from the Gaussian Process regressor and hence does not require a separate classi?er to model the unreliability.Furthermore, unlike[49],our method incorporates the image likelihoods obtained from the generative model to verify the pose hypothe-ses and to estimate the target state

distribution.Fig.1.A block diagram showing the training of the mixture of Gaussian Process regression models.

III.T RAINING S TEP

In the training step,we use supervised learning to construct a multimodal mapping between the shape descriptor space and the3D pose space.Once the mapping model is trained, the multimodal3D poses can be estimated using the trained model.A block diagram showing the training of our mixture of Gaussian Process regressors is shown in Fig.1.Given the training images,we?rst extract the silhouette images using background subtraction[50].Then,we extract the silhouette descriptors using Discrete Cosine Transform(DCT).We divide the3D pose space into K clusters and we train the mapping from the silhouette descriptor space to the3D pose space of each cluster.The?nal output of the training stage is a mixture of Gaussian Process regression models.

A.Feature and Pose Representation

A review of many shape and appearance descriptors that are applicable to discriminative pose estimation is available in Ref.[51].In this paper,we use the silhouette as an image feature and the Discrete Cosine Transform(DCT)of the silhouette as the shape descriptor.We use the DCT because it is simple to compute and yet more discriminative than other shape descriptors[51],[52].It is shown in Ref.[52]that the DCT descriptor outperforms other shape descriptors such as Histogram of Shape Contexts and Lipschitz embeddings for human pose estimation.

The DCT descriptor,which belongs to the family of orthog-onal moments,represents the silhouette image by the sum of two dimensional cosine functions of different frequencies characterized by the various coef?cients.The DCT has been popularly used for image compression.First a silhouette window is cropped from the foreground image obtained from background subtraction.As shown in Fig.2,the cropped image

SEDAI et al.:GAUSSIAN PROCESS GUIDED PARTICLE FILTER

4289

0ca8f1da71fe910ef02df81cputation of the DCT descriptor from a silhouette image.

is scaled to the size of H ×W and the DCT descriptor for the image window is computed as

M p ,q =

W ?1 x =0H ?1 y =0

f p (x )

g q (y )I (x ,y ),(1)

for p =0,...,W ?1;q =0,...,H ?1,

where f p (x )=αp cos {p π(x +0.5)/W };g q (y )=αq cos {q π(y +0.5)/H };and αp =√

(1+min (p ,1))/W and αq =√

(1+min (q ,1))/H .We take W =64and H =128pixels.This window size is the most commonly used size for human subjects that are in an upright pose.The DCT descriptor has a nice property that most of the rich information about the silhouette is encoded in just a few of its coef?cients.We empirically found that setting the descriptor to a 64dimen-sional vector (corresponding to 8rows and 8columns of the DCT matrix M )is suf?cient to represent each silhouette.In this paper,we assume that the subject is upright in the image.Although the DCT descriptor is not rotation invariant,it does not affect the pose estimation so long as the human subject is in an upright pose in the image.It is possible to handle athletic motions such as handstands and cartwheels by training a separate discriminative model for such activities and using a classi?er to select the most appropriate model for an input feature.Since the silhouette is centered and re-scaled to the standard size,the descriptor is invariant to translation and scale.

We represent each 3D pose as the relative orientations of the body parts in terms of Euler angles.We take the torso as the root segment and the orientation of the torso segment is measured with respect to a global coordinate system.The orientations of the upper arms are measured relative to the orientations of the torso.The orientations of the lower arms are measured relative to the upper arms.Relative orientations between the upper legs and the torso and between the lower and upper legs are de?ned in a similar manner.B.Mixture of Gaussian Process (MGP)Regressors

We use a supervised learning technique to estimate the 3D pose directly from the silhouette descriptor.We learn the piecewise mapping of Gaussian Process (GP)regressor [53]from the shape descriptor space x ∈R m to the 3D pose space y ∈R d using the training data samples T = x (i ),y (i ) ,for i =1,...,N ,where N is the number of training samples.With such a mixture of Gaussian Process (MGP)regression

model [54],the GP regressors are trained for each region of

the data space and a separate classi?er is used to select the GP model that is appropriate for an input feature.First,the 3D pose space is partitioned into K clusters using the hierarchical k-means algorithm.The training set is then divided into K subsets,T 1,...,T K ,such that x (i ),y (i ) ∈T k if y (i )belongs to the k th cluster.We assume that the components of the output vector y are independent of each other so we train the separate GP regression model for each output component of y =[y 1,...,y d ]T .Without loss of generality,we drop the subscript q in y q (which represents the q th component of y )and present only a one-dimensional Gaussian Process regression.In each cluster,the relationship between x (i )and each component y (i )of the training pose instance y (i )(the superscript (i )represents the i th training instance)is modeled by

y (i )=f k (x (i ))+ (i )

k ,(2)

where (i )k ~N (0,β?1

k )and the βk is the hyper-parameter representing the precision of the noise.From the de?nition of a Gaussian Process [53],the joint distribution of the output variables is given by a Gaussian:

p (y k |X k )=N (0,C k ),(3)

where y k = y (i ),...,y (N k ) T ,X k =

x (1),...,x (N k ) T for all y (i )∈k th cluster,and C k is a covariance matrix whose entries are given by C k (i ,j )=κ(x (i ),x (j ))+β?1δij .The covariance function κ(x (i ),x (j ))can be expressed as

κ(x (i ),x (j ))=θk ,1exp ?θk ,22 x (i )

?x (j ) 2 +θk ,3,(4)

where the parameters k = θk ,1,θk ,2,θk ,3,βk

are referred to as the hyper-parameters of the GP and δij is the Kronecker’s delta function.This covariance function combines the Radial Basis Function (RBF)and a bias term.The learning of the hyper-parameters is based on the evaluation of the likelihood function p (y k | k )and the maximization of the log likelihood using a gradient based optimization technique such as con-jugate gradient.The log likelihood function for a Gaussian process can be evaluated as

ln p (y k | k )=?12ln |C k |?12y T k

C ?1k y k ?N k

2ln (2π).(5)Once the hyper-parameters are trained,the next step is

to predict the output pose component y k ?for the unseen test feature vector x ?.This requires the evaluation of the predictive distribution p (y k ?|y k ,X k ).Let us de?ne y k ?=

y (1)

,...,y (N K ),y k ? T ,whose joint distribution is given by

p (y k ?)=N (0,C k ?),

(6)

where C k ?∈R (N k +1)×(N k +1)is a covariance matrix.That is,

C k ?= C k c k ?

c T k ?c k ,(7)where the vector c k ?has elements c k ?(i )=κ(x (i ),x ?),for

i =1,...,N k ,and the scalar c k =κ(x ?,x ?)+β?1

k

=

4290IEEE TRANSACTIONS ON IMAGE PROCESSING,VOL.22,NO.11,NOVEMBER 2013

θk ,1+θk ,3+β?1

k from Eq.(4).The conditional distribu-tion p (y k ?|y k ,X k )is a Gaussian distribution with mean and variance given by

y k ?=c T k ?C ?1

k y k

(8)σ2

k (x ?)=c k ?(c k ?)T C ?1k c k ?.

(9)

In this manner,we obtain the prediction for each component

of the pose vector.Let y q ,k ?be the prediction for the q th com-ponent of the pose vector and σ2q ,k (x ?

)be the corresponding variance.The full pose vector is y k ?=

y 1,k ?,...,y d ,k ? T and the corresponding covariance matrix becomes k (x ?)=

diag (σ21,k ?,...,σ2d ,k ?

).The multimodal relationship between the silhouette descriptor x and the 3D pose vector y is thus represented by a mixture of K regressors:

p (y |x ?)=

K k =1

g k (x ?)N (y k ?, k (x ?)),(10)

where g k (x )is a K -class classi?er which gives the probability

that the k th GP regressor is selected to predict the given feature instance x ?.We model the multi-class classi?er g k (x )as a multinomial logistic regressor (i.e.,the softmax function):

g k (x )=exp (?v T k x ) K

j =1exp ?(v T

j x )

,(11)

where v k is an m -dimensional parameter vector.The parame-ter vectors v 1,...,v K are estimated from the training data x (i ),l (i ) N i =1,where l (i )= l (i )1,...,l (i )K and l (i )

k

denotes the probability that feature vector x (i )belongs to the k th cluster.

We set l (i )

k

=1if y (i )belongs to the k th cluster,otherwise we set it to 0.The maximum likelihood estimation of the parame-ter vectors is then performed using the iteratively reweighted least squares method.We use a fast method based on bound optimization described in [55]to train these parameter vectors.

Instead of the multinomial logistic regressor,an alternative is to model the multi-class classi?er g k (x )as a function of the variance of the k th GP mode,i.e.,by setting g k (x )∝1/trace ( k (x )).As the average variance of the prediction is lower when the test feature is closer to the training samples,a higher weight is given to the cluster which is closer to the test feature vector.In our empirical evaluation shown in Table III,we found that the pose estimation performance of the MGP regressor is improved when the multinomial logistic regressor is used in comparison to the case when the function of the variance is used.

1)Pose Space Clustering:The motivation behind clustering the pose space and learning discriminative model in each cluster is to model the depth ambiguities introduced by the silhouettes.For that purpose,we cluster the pose space into six partitions using hierarchical k-means.At the ?rst level,the pose space is partitioned into four clusters representing poses which face forward,backward,left and right with respect to the camera by a careful initialization.At the second level,each cluster that represents the lateral pose (left or right)is further partitioned into two clusters.Figure 3shows the representative poses in each cluster along

with

pose space

C1

C2

C3

C4

C5

C6

Fig.3.Representative pose in each cluster obtained using hierarchical k-means.Each pose is rendered as a 3D cylindrical model with red denoting left limbs and blue denoting right limbs.Silhouettes that give rise to ambigu-ous poses are also shown (Figure best viewed in color).

the silhouettes that give rise to ambiguous poses.Clusters C1and C2model the forward-backward ambiguity as the silhouette S1could be generated from the poses that are both in C1and C2.Similarly,clusters C3and C4model the ambiguity associated with the silhouette S2;clusters C5and C6model the ambiguity associated with the silhouette S3.Other approaches address the ambiguities by sampling from the multimodal posterior obtained from a mixture of regressors models [5],[7].In another interesting approach [56],it is shown that the ambiguous poses can be distinguished by sampling from the posterior in a latent space that is shared by both the observation and the pose spaces.In their approach,the posterior in the latent space is obtained using the GPLVM model.

This approach of hard clustering could result in a poor prediction performance at the boundary of the clusters.How-ever,since we used the Gaussian Process regression as the discriminative model,such poor predictions could often be detected as they tend to produce larger prediction variances.In Section IV-C,we discuss the adoption of a probabilistic method to discard poor predictions made by the MGP regres-sion model during the tracking of the 3D pose.The method of Ref.[12]provides an interesting solution to the boundary problem in that a MGP regressor is trained on the subset of the data that is closest to the test feature.However,their method requires computing the GP hyper-parameters for each test image.

SEDAI et al.:GAUSSIAN PROCESS GUIDED PARTICLE FILTER

4291 Fig.4.A block diagram showing the testing of our proposed3D pose

Tracking System.The3D human pose for each video frame is predicted from

the mixture of GP regressions;The tracking incorporates the predicted pose,

prediction uncertainty,edges and silhouette observations using the AGP-PF

method.

IV.P OSE T RACKING

In the tracking step,our goal is to determine the3D

pose of a person in the video frames.The block diagram of

our proposed tracking method is shown in Fig.4.For each

image frame in a video,the human pose is predicted using a

Mixture of Gaussian Process(MGP)regressors(Section III).

The output of the MGP regressors is a mixture of Gaussian

distributions.The output distribution is taken as one com-

ponent of the hypothesis space.The other component being

the output pose distribution of the previous frame.The

output pose is then computed by searching the combined

pose hypothesis space using our proposed AGP-PF tracking

method.

Our method needs a human body model to compute the

likelihood of each pose hypothesis.In this paper,we use a

3D cylindrical model of the human body.Each body part is

represented by a tapered cylinder as shown in Fig.5.The

length of the body parts and the diameters of the cylinders are

assumed to be?xed for a given person and is initialized at the

beginning of the tracking.

We use human kinematic constraints to determine the

degrees of freedom(DOF)of the pose.The torso has six

degrees of freedom after incorporating the global translation

and rotation.The head,upper arms and upper legs have three

degrees of freedom each.The forearms have one degree of

freedom each(they are only allowed to rotate about their

Y axis)and the lower legs have two degrees of freedom

(they are allowed to rotate about their X and Y axes).Hence

the human body model shown in Fig.5has27degrees of

freedom.We enforce the joint angle limit by restricting the

variation of the angles to a kinematically possible range.The

kinematically possible angles are computed from the training

data.

In this paper,we use our proposed Gaussian Process guided

particle?lter for pose tracking.Below,we will?rst give

a review of a standard particle?lter(for completeness)in

Section IV-B followed by our proposed Gaussian Process

guided particle?lter for pose tracking in Section

IV-C.

Fig.5.A3D human body model in a neutral pose.Each body part of the

model is approximated by a tapered cylinder.The X-axis is denoted by a

large circled dot and is perpendicular to both the Y-and Z-axes.

A.Likelihood distribution

Given an image observation denoted by r t at time t,the

likelihood density p(r t|y t)measures how well a hypothesized

pose vector y t explains the image observation r t.In our

method,the likelihood value is computed by matching the3D

human body model corresponding to the hypothesized pose

with the observed image features.We use the silhouette and

edge features of the image to compute the likelihood of each

pose state vector.We?rst project the3D cylindrical model

corresponding to the hypothesized pose onto the image plane

using the camera calibration matrix to obtain the hypothesized

edge features and the hypothesized region features.The likeli-

hood value is computed by matching the hypothesized features

with the observed image features.We use the silhouette and

edge features of the observed image to compute the likelihood

based on two cost measures:silhouette cost and edge cost,as

detailed below.

1)Silhouette Cost:The silhouette cost measures how well

the region projected by the hypothesized model?ts into the

observed silhouette.Given a3D pose hypothesis,we?rst

generate a binary image H of the corresponding3D human

body model such that H(x,y)=1if the pixel corresponds to

the hypothesized foreground and H(x,y)=0otherwise.An

example of a hypothesized foreground is shown in Fig.6(d).

Let Z be the observed silhouette image as shown in Fig.6(b).

The part of the silhouette Z that is not explained by the model

region H is given by R1=Z∩ˉH,where∩denotes the pixel-

wise“and”operator andˉH denotes the inverted image of H.

Similarly,the part of the model region that is not explained by

the silhouette is given by R2=H∩ˉZ.Since our objective is

to minimize these unexplained silhouette and model regions,

the cost can be expressed as

C sil=0.5

Area(R1)

Area(Z)

+Area(R2

)

Area(H)

,(12)

4292IEEE TRANSACTIONS ON IMAGE PROCESSING,VOL.22,NO.11,NOVEMBER

2013 Fig.6.(a)Input test image(b)Observed silhouette image Z(c)3D human

body model for a hypothesized pose(d)Projected silhouette image H obtained

from the model(e)H superimposed on Z(f)An example of the visible model

edge points projected into the observed edge image.

where Area(I)gives the number of non-zero pixels in

the binary image I.When the hypothesized model?ts

the observed silhouette exactly then both Area(R1)=

Area(R2)=0and hence C sil will give the lowest cost of

0.When the two regions have zero overlap then no part of the

silhouette region is explained by the model(i.e.,Z∩ˉH=Z)

and no part of the model region is explained by the silhouette

(i.e.,H∩ˉZ=H).In this case,C sil will give the highest cost

value of1.This measure of the silhouette cost is similar to

that of[57],[58].

2)Edge Cost:The edge cost measures how well the bound-

ary line corresponding to the hypothesized model?ts with the

observed edge image.Given an input image,we detect edges

by thresholding the gradient image to obtain a binary edge map

[2].We segment the foreground edges corresponding to the

human subject by masking the binary edge image with the sil-

houette image.We then construct a Gaussian distance map E1

of the segmented edge image to determine the edge probability

of a given pixel.The Gaussian distance map,which gives the

proximity of a pixel to the edge,can be obtained by convolving

the binary edge image with a Gaussian kernel and rescaling

the pixel values between0and1[2].In the next step,we

generate a set of hypothesized edge points E2by projecting the

visible boundary of the3D cylindrical model corresponding

to a pose hypothesis to the edge image and sparsely sampling

the points along the boundary.The points that are hidden due

to self occlusion are discarded using the depth information

of the body model.The edge cost is then obtained by com-

puting the mean square error(MSE)of the edge probability

values:

C edge=

1

|E2|

p∈E2

(1?E1(p))2.(13)

Similar to C sil,the value of C edge falls between0and1.

Assuming equal in?uence of edge and silhouette features on

tracking,the?nal likelihood for a given hypothesized pose is

approximated by

p(r t|y t)≈exp(?(C sil+C edge)).(14)

For the case when images from more than one camera view

are available,the silhouette and the edge costs are computed

using images from each camera and the costs are averaged to

obtain the?nal cost of a pose hypothesis in Eq.(14).

B.Particle Filter

The particle?lter is a Monte Carlo approximation to the

sequential Bayesian estimation,which propagates the posterior

probability of the?rst order Markov process from time t?1

to t through the following equation:

p(y t|R t)=c p(r t|y t)

y t?1

p(y t|y t?1)p(y t?1|R t?1),(15)

where c is a constant,y t is the3D pose state at time t;r t

is the image observation;R t=[r1,...,r t]comprises all

observations observed sequentially up to time t;p(y t|y t?1)is

a distribution that describes the motion model;and p(r t|y t)is

the observation likelihood distribution.The multi-dimensional

integral of Eq.(15)can only be evaluated for the simple

case where the posterior distribution of the state variable is

Gaussian.When the state variable corresponds to the human

pose,the posterior distribution is non-Gaussian and methods

like the Kalman?lters generally fail[59].Particle?lters are

therefore often used to approximate Eq.(15)using a set of

weighted samples S t={y(i)t,π(i)t}n i=1where each y(i)t is a

particle,π(i)t is the corresponding particle weight,which is

normalized to ensure iπ(i)t=1and n denotes the number

of particles.The particle?lter does not make any explicit

assumption about the form of the posterior and hence is

applicable to systems where the posterior distribution of the

state variable is non-Gaussian.In order to estimate the pose

using the particle?lter,one must design two models:a dynam-

ical model(a.k.a motion model),namely p(y(i)t|y(i)t?1)which

describes the movement of the human subject from one frame

to another in the3D space;and the observation likelihood

model p(r t|y t)which gives the probability that observation

r t can be generated by a pose sample y t as described in

Section IV-A.At each time step t,given the particle set

S t?1,a basic sequential importance resampling updates the

particles in three steps[60].First,sample n particles from the

discrete distribution denoted by S t?1with replacement.In the

second step,each sampled particle is modi?ed by a motion

model.In the third step,the normalized importance weight is

computed from the observation likelihood.The new particle-

weight set at time t is obtained as S t.The particle-weight

set S t represents the posterior distribution of the state and the

output can be computed by taking the expected value of the

posterior distribution represented by the particle-weight set.

A simple particle?lter does not work accurately in higher

dimensions because a large number of particles is required

to populate such a higher dimensional state 0ca8f1da71fe910ef02df81cing a

small number of particles might lead towards an entrapment

SEDAI et al.:GAUSSIAN PROCESS GUIDED PARTICLE FILTER

4293

Fig.7.Graphical model for AGP-FP method that includes the conditional

p (y t |y t ?1,x t )and the observation likelihood p (r t |y t ).We assume that p (y t |y t ?1,x t )can be factored into the discriminative model p (y t |x t )and the motion model p (y t |y t ?1)according to Eq.(18).

of the particles around local maxima.The occurrence of local maxima in the state space is common in human pose tracking because there are many ways the model can partially ?t the observed image.Variants of the particle ?lter,such as annealed particle ?lter [27]that iteratively pushes the particles towards the high probability regions of the state space,have been developed.Moreover,the computed importance weights might not always be correct,mainly because of the noisy/ambiguous observations and an incorrect generative model.This leads to frequent mistracking.Our Annealed Gaussian Process guided particle ?lter described in the following subsection aims at tackling this problem.

C.Annealed Gaussian Process Guided Particle Filter In Annealed Gaussian Process Guided Particle Filter (AGP-PF),not only the motion information is used,but the discrim-inative distribution obtained from the supervised learning in Section III is also used to track the pose from one frame to another.The graphical model for the AGP-PF is shown in Fig.7.Let y t denote a hidden state that represents the 3D pose at time t and x t be the image observation at time t that is speci?cally used to generate the discriminative distribution p (y t |x t ).Let r t be the image observation at time t that is used to compute the likelihood distribution p (r t |y t ).Let X t =[x 1,...,x t ]and R t =[r 1,...,r t ]be all observations which have been sequentially observed until time t .Then the posterior density of the state after a new observation x t and r t is given by a recursive Bayesian equation

p (y t |R t ,X t )=c p (r t |y t )p (y t |R t ?1,X t ),

(16)

where c =1/p (r t |R t ?1,X t )is a constant term relative to y t and p (r t |y t )=p (r t |y t ,X t )follows from the conditional independence according to the graphical model in Fig.7.Since the integral of the target distribution p (y t |R t ,X t )over the entire space of y t should be one,the target distribution can be calculated by ?rst computing Eq.16without considering the constant c and then normalizing it.Therefore,the constant c does not in?uence the estimation of the target distribution.The conditional independence assumption holds when the features r and x have different properties and they do not depend on each other.In our case,x is taken as the Discrete Cosine Transform of the silhouette and hence it is a much

coarser description of the shape.On the other hand,r repre-sents the edges of the body segments which corresponds to a richer appearance representation of the body parts.As the value of the DCT descriptor of a silhouette gives no knowledge about the value of the edges feature,these two features can be considered to be independent.The prior distribution at time t can be written as

p (y t |R t ?1,X t )=

p (y t |y t ?1,x t )p (y t ?1|R t ?1,X t ?1)d y t ?1.(17)

We assume that the conditional distribution p (y t |y t ?1,x t )can be expressed as a mixture of simpler conditionals [61],i.e.,

p (y t |y t ?1,x t )=(1?α(x t ))p (y t |y t ?1)+α(x t )p (y t |x t ),(18)where p (y t |x t )is a discriminative distribution expressed in terms of a mixture of Gaussians obtained from the trained Gaussian Process models (see Eq.(10));α(x t )∈[01]is a mixing coef?cient that denotes the contribution of the dis-criminative distribution towards the prior at t .We discard any contribution of the discriminative distribution p (y t |x t )when it has a large variance σ2(x t ).This is achieved by relating the mixing coef?cient α(x t )to the variance σ2(x t )using the following equation:

α(x t )=exp ?σ2

(x t )/λ

,(19)where σ2(x t )=trace [ k (x t )]/d is the average variance of

the most likely GP model selected from Eq.(10).

The most likely GP model is the one with the highest gating probability g k (x t ).Figure 9shows the variations of α(x t )w.r.t σ2(x t )given in Eq.(19)for different values of λ.The value of the control parameter λis determined empirically as discussed in Section V-C.1.From Eq.(19),it means that the prediction from the GP model that has a large variance (i.e.,low certainty)is less likely to be correct and thus fewer samples should be drawn from it.From Eqs.(17)and (18),the prior distribution can be expressed as

p (y t |R t ?1,X t )=(1?α(x t ))p (y t |R t ?1,X t ?1)

+α(x t )p (y t |x t ),

(20)

where

p (y t |R t ?1,X t ?1)=

p (y t |y t ?1)p (y t ?1|R t ?1,X t ?1)d y t ?1,

(21)

which is the ?rst term of the prior distribution is obtained by applying the motion model to the posterior distribution at t ?1.The second term of the prior is the discriminative density p (y t |x t )given by Eq.(10).We approximate the posterior at time t ?1by the particle-weight set S t ?1={y (i )t ?1,π(i )t ?1}n i =1and assume a statistically stationary motion model,i.e.,p (y t |y t ?1)~N (y t ?1, ),where is a diagonal covariance matrix.A method to determine the value of is discussed in Section V-C1.We can write the ?rst term of the

prior distribution as p (y t |R t ?1,X t ?1)= i π(i )t ?1N (y (i )

t ?1, ).The second term of the prior distribution is the discriminative

4294IEEE TRANSACTIONS ON IMAGE PROCESSING,VOL.22,NO.11,NOVEMBER 2013

distribution obtained from supervised learning.The ?nal prior of Eq.(20)can then be expressed as

p (y t |R t ?1,X t )=(1?α(x t ))

n i =1

π(i )t ?1N (y (i )

t ?1, )

+α(x t )p (y t |x t ),

(22)

where the discriminative distribution p (y t |x t )is given by

Eq.(10).We use the silhouette and edge based likelihood to model the likelihood distribution p (r t |y t )as described in Section IV-A.The importance weight for each particle can be computed as a normalized likelihood value,i.e.,

π(i )t ∝ p (r t |y (i )t ) A l

,(23)where p (r t |y (i )t )is given by Eq.(14),A l is the inverse

annealing temperature for the l th annealing layer,and π(i )t is

normalized so that n i =1π(i )

t =1.Simulated annealing works on the principle that a region of high probability lies in the vicinity of the particles which have higher weights [2].At the ?rst annealing layer,particles are allowed to ?oat in high energy state.This means that all the particles have similar probabilities associated with them resulting in a smooth and non-peaky likelihood function.Then the system is gradually allowed to cool down at the successive annealing layer.By doing so,peaks in the likelihood function are introduced slowly.Consequently,more particles are concentrated around the high probability region at the end of the annealing layer.We automatically ?nd the value of the annealing temperature A l so that the particle survival rate at each layer is around 50%(following the method of [2]):

ζ(A l )=1n

n

i =1

π(i )

t 2 ?1,(24)

where A l is estimated by minimizing |ζ(A l )?0.5|.This process of decreasing the particles survival rate has the same effect as the cooling of a mechanical system.

The steps of our proposed method are given in Table I.Let S t ,l be the particle weight set output at the end of an l th annealing layer of frame t .For t =0,the initial set of particles S 0is obtained by sampling from the discriminative distribution p (y 0|x 0)and equal importance weights are assigned to them.For t >0,the input to the ?rst annealing layer is the posterior distribution of the previous frame,i.e.,S t ,0=S t ?1.Then,at each annealing layer,the prior distribution is constructed from the discriminative distribution p (y t |x t ),the mixing coef?cient α(x t )and the discrete distribution S t ,l ?1using Eq.(22).For layer l =1,we compute the value of α(x t )according to Eq.(19).For l >1,we force α(x t )=0to prevent any sampling from p (y t |x t )because samples from p (y t |x t )are already taken at l =1.The particles are then sampled from the prior distribution and the importance weights of the particles are computed according to Eq.(23).At the end of each layer,the covariance matrix corresponding to the motion model is shrunk by a factor of 0.5to allow the search in the next layer to be focused on a more narrow region of the pose space.Pose samples obtained from p (y t |x t )need to be aligned if the test images are taken from a camera view which is

TABLE I

A LGORITHM FOR THE A NNEALED G AUSSIAN P ROCESS G UIDED

P ARTICLE F ILTER FOR 3D H UMAN P OSE T

RACKING

different from the camera view of the training images.Let τt be the torso orientation of a sample y t obtained from the discriminative distribution.To align the pose sample y t to the camera view of the test image,we compute τ←τ? where is the camera direction angle difference between the training and test views.The angular difference is determined from the camera calibration matrices of the two camera views.Also,each sample obtained from p (y t |x t )is a pose vector in terms of Euler angles.We convert the pose vector to a motion vector by concatenating the global translation vector at the beginning of the pose vector.The global translation vectors are obtained by sampling from the ?rst term of the prior distribution given in Eq.(22).In this way,all the samples in the particle set are obtained as a motion vector.

It can be seen that the annealed particle ?lter (APF)is a special case of an AGP-PF when α=0.When α=1,the method does not use the motion model;instead,it performs pose detection by sampling from the discriminative distribu-tion alone and validating the samples using the importance weights computed from the likelihood distribution.Hence,the method of Ref.[8]can be seen as a special case of an AGP-PF when α=1.Our method,on the other hand,adaptively chooses the value of αbetween 0and 1as the mixing coef?cient α(x t )in Eq.(20)is inversely related to the prediction uncertainty of the discriminative model.Hence the pose predictions that are likely to be correct are retained to guide the tracking process.This provides a stable tracking since,at each time step,even if the motion model produces

SEDAI et al.:GAUSSIAN PROCESS GUIDED PARTICLE FILTER 4295

wrong samples,the samples obtained from the discriminative distribution are used to compute the posterior.Also,if the predictions from the discriminative model are uncertain,less emphasis is given to them.In such cases,the tracking is more driven by the results of the previous frame.

V.E XPERIMENTS

A.Dataset Description

We trained and evaluated our proposed 3D human pose tracking method using the HumanEva-I data set [62].We used video frames and corresponding 3D poses of the three subjects from the Walking and Jogging sequences of the dataset to train and evaluate our approach.The dataset was originally partitioned into training,validation and testing sets.However,as the ground truth of the testing set was not provided,we used the validation set (which provides ground truth)as our testing set and the original training set as our training set.Table II shows the number of images in the training and testing set for each activity.For each image,the corresponding ground truth 3D motion is given by the 3D pose and the global translation vector.The 3D pose is given by the relative orientations of the 10body parts in terms of Euler angles.We also used the Walking data of subject S2from the HumanEva-II dataset to compare our method with other ones.B.Training

For all the images in the training and testing sets,we extracted the silhouette images using the background subtraction method described in Ref.[50].We computed 64-dimensional DCT shape descriptor vectors from the sil-houettes following the process described in Section III-A.We then trained the mixture of Gaussian Process regressors which map the DCT descriptor to the 3D poses using the supervised learning approach described in Section III-B.We set the number of clusters to K =6so as to allow suf?cient partitioning of the pose space and to model the potential ambiguities associated with the silhouettes as described in Section III-B1.We trained a Gaussian Process regressor for each cluster.The ?nal output of the discriminative model is the mixture of Gaussians given by Eq.(10).C.Tracking and Evaluation

The discriminative model is combined with the motion model to track the pose using our proposed AGP-PF algorithm described in Section IV.We sampled 300particles at each iteration of AGP-PF.The ?nal output of the AGP-PF is a 33-dimensional motion vector whose components are the global translation vector and the relative Euler angles of the ten body parts.Given the length of the body parts,we converted the motion vector to the absolute 3D joint locations using for-ward kinematic.We computed the 3D error between the esti-mated 3D joint locations ˉy and the ground truth 3D joint loca-tions ?y as follows [58]:E (y ,?y )=1J J i =1 m i (ˉy

)?m i (?y ) ,where m i (y )∈R 3denotes the three dimensional coordinates of the i th joint location from the pose vector y ∈R d ; · denotes the Euclidean distance;and J is the number of

joint

Fig.8.The variance of the output from the GP regressor and the mixing coef?cient (α)for each image frame from the Walking sequence of Subject S2.The mixing coef?cient (α)was computed according to Eq.19.

TABLE II

N UMBER OF I MAGES U SED FOR T RAINING AND T ESTING

S ET IN THE H UMAN E VA -I D

ATASET

locations.The formula measures the 3D error in mm between the two pose vectors.The mean 3D error of the T test images is computed by ˉE =1T

T i =1E (ˉy (i ),ˉy (i )).We investigated three cases for pose tracking.The ?rst is an AGP-PF where the value of the mixing coef?cient αis set to be inversely proportional to the variance of the most likely Gaussian component of the discriminative distribution given in Eq.(19).In this case,αtakes values between 0and 1;When the variance of the discriminative distribution is lower,αis set to a higher value.Consequently,more samples are taken from the discriminative distribution during tracking.Conversely,a higher variance of the discriminative distribution sets αto a lower value.As a result,less samples are taken from the discriminative distribution.Since a higher variance of the discriminative distribution implies an uncertain prediction and vice-versa,the discriminative distribution which are more certain are automatically selected for tracking.Figure 8shows an example of the inverse relationship between the variance of the discriminative distribution and the mixing coef?cient αfor each image in the video.It can be seen that the value of αvaries according to the image frames.

In the second case,we set α=1,which suppresses the sequential sampling nature of AGP-PF.The search is only performed in the pose space predicted by GP regression.The search discards the output pose of the preceding frame.In this case,the motion model is only used to predict the global

4296IEEE TRANSACTIONS ON IMAGE PROCESSING,VOL.22,NO.11,NOVEMBER

2013

Fig.9.Mixing coef?cient (α)versus variance σ2for different values of control parameter λ.

translation vectors.We refer to this case as the Annealed Gaussian Process (AGP)method.

In the third case,we set α=0so only the motion model is used to predict the pose.In this case,the discriminative distrib-ution from the Gaussian Process regressors are discarded.This case is equivalent to the annealed particle ?lter (APF).It is to be noted that all three cases use annealing for pose tracking.We also evaluated the performance of a standard particle ?lter (PF)and the GP-PF that does not use annealing for tracking.1)Parameter Selection:There are four parameters that need to be set in the tracking system.The ?rst one is the value of λin Eq.(19).Figure 9shows the relationship of αin terms of σ2for different values of the control parameter λ.We empirically observed that the predictions from MGP regressor whose variance (σ2)is greater than 0.015have larger pose estimation errors.In order to suppress the in?uence of such predictions on tracking,λis set to a value so that αbecomes 0when σ2>0.015.We therefore select λ=0.003so that αbecomes zero for σ2>0.015.

The second parameter is the sampling diagonal covariance matrix of the motion model.The diagonal components of correspond to the sampling variance of each body angle.They are computed so that the standard deviation of each body angle is set to equal the maximum absolute inter frame angular difference for a particular activity [27].

The third free parameter is the number of particles n .We found its optimal value via validation.Figure 10plots the values of n versus the mean 3D errors with respect to the different tracking methods.The mean 3D errors were computed using 150images of a walking human subject.It can be seen that for the case of AGP-PF,APF and GP-PF,the performance did not improve for n >300.Hence we set the optimal value of n =300.Our proposed AGP-PF tracking gave the lowest error for all number of particles.The ?gure also depicts the advantage of annealing by an improved performance of AGP-PF and APF over the performance of GP-PF and PF.

The fourth free parameter is the number of annealing layers M .We set M =5,as the pose tracking performance does not improve for M >5.Table III compares the pose estimation performance of the MGP regressor for different choices

of

Fig.10.Number of particles (n )versus mean 3D pose tracking errors for different pose tracking methods.

TABLE III

C OMPARISON OF 3

D P OS

E E STIMATION E RRORS O

F THE MGP R EGRESSOR FOR D IFFERENT T YPES OF

G ATING F UNCTIONS ON

W ALKING AND J OGGING A

CTIVITIES

TABLE IV

T HE M EAN AND S TANDARD D EVIATION OF 3D T RACKING E RRORS IN mm OF AGP-PF,AGP APF.T HE T RACKING WAS P ERFORMED ON I MAGES C APTURED BY (A )T HREE C AMERAS AND (B )A S INGLE C

AMERA

gating functions on the Walking and Jogging sequences of the dataset.The performance of MGP regressor is superior when the multinomial regressor is used.We therefore incorporate the multinomial logistic regressor in the gating function of our MGP regressor.It should be noted that in order to compute the 3D pose estimation error,the joint locations are measured relative to the torso joint.

2)Experimental Results:Table IV(a)and IV(b)show the mean 3D tracking errors for the three cases for subjects S1,S2and S3for the Walking and Jogging activities.Table IV(a)shows the tracking errors for a single camera whereas Table IV(b)shows the tracking errors for three

SEDAI et al.:GAUSSIAN PROCESS GUIDED PARTICLE FILTER

4297

0ca8f1da71fe910ef02df81cparison of the 3D tracking errors in mm for the Walking sequence evaluated from AGP-PF,APF and AGP pose tracking methods for (a)single camera and (b)multiple

cameras.

Fig.12.Examples of the 3D poses obtained from the GP regression (rendered in blue color)and AGP-PF tracking (rendered in green color).Below each output is the corresponding value of α.Each 3D pose is illustrated using the boundaries of the projected cylinders of the 3D pose (Figure best viewed in color).

cameras.Also included in the tables are the standard deviations of these errors.The results show that our AGP-PF with a dynamic mixing coef?cient gave the lowest mean errors.It also produced the smallest standard deviations denoting that the estimated poses using our method are more stable.It can be seen that the standard particle ?lter with annealing produced a larger error because tracking failed at an early stage.Although the pose detection method produced smaller errors than the APF,the standard deviations of the errors were larger than those from AGP-PF.Moreover,AGP-PF with dynamic mixing coef?cients was able to more accurately track the pose over the frames and recover from mistracking.

Figure 11(a)and (b)show the pose tracking errors of the Walking sequence for all three cases for single and multiple camera tracking.It can be seen that AGP-PF gave the lowest errors for most of the frames and provided more stable pose estimates than the other two cases for both single and multiple camera tracking.AGP-PF performs tracking by giving higher weights to the correct output poses from GP regression

while

Fig.13.(a)An image from the test set.(b)Corresponding silhouette image.(c)&(d)The 3D pose estimates of two GP regressors with the highest gating probability are displayed,with red denoting left limbs and blue denoting right limbs.The associated gating probabilities are displayed below each output pose.These two probable solutions denote the pose ambiguities associated with the silhouette.(e)Our AGP-PF used them as input for 3D pose tracking (Figure best viewed in

color).

0ca8f1da71fe910ef02df81cparison of the ground truth and the estimated knee ?exion angle using our proposed AGP-PF method.

discarding incorrect output poses.Examples of how incorrect output poses from GP regression are given less weights for tracking are shown in Fig.12.In this case,our AGP-PF gives more weights to the poses sampled from the motion model and hence gives correct predictions.

Figure 13shows an example of a multi-modal pose output given by the mixture of GP models.The 3D pose estimates of the two GP regressors with the highest gating probability are shown in Fig.13(c)and (d).These two probable solutions denote the pose ambiguities associated with the silhouette.Our AGP-PF tracking method uses them to predict the correct 3D pose as shown in Fig.13(e).Figure 14compares the knee ?exion angles estimated using our method with the ground truth knee ?exion angles for each video frame.Figures 15and 16display some of the output poses predicted using our method for the Walking and Jogging sequences.The faster and larger arm and leg movements of the human subject in the Jogging sequence makes the pose estimation problem more challenging.Our experiments show that our proposed AGP-PF can effectively track 3D pose in a video sequence.Examples of videos to illustrate the tracking of the 3D pose are available.1

10ca8f1da71fe910ef02df81c.au/?

suman/videos

4298IEEE TRANSACTIONS ON IMAGE PROCESSING,VOL.22,NO.11,NOVEMBER

2013

Fig.15.Pose estimation results on some of the images of the test set of Walking sequence.Each3D pose is illustrated using the boundaries of the projected cylinders of the3D pose(Figure best viewed in color).

TABLE V

C OMPARISON OF M EAN3

D T RACKING

E RRORS(mm)O

F O UR AGP-PF

M ETHOD W ITH O THER A PPROACHES ON THE W ALKING A CTIVITY OF THE H UMAN E VA-I D ATASET.T HE S TANDARD D EVIATION

O VER F RAMES IS N OT P ROVIDED BY

[43]

0ca8f1da71fe910ef02df81cparison with other works

We compared our work with the state of the art tracking methods[4],[32],[34],[58].These approaches use annealed particle?lter[58],smoothing particle?lter[63],particle?lter with physics based prior[34],local optimization[4],a hybrid of local and global optimization[32]and a latent variable model[43]for3D pose tracking.

Table V shows our pose tracking results compared to a smoothing particle?lter of Ref.[63]and a latent variable model of Ref.[43]on the HumanEva-I dataset.The results show that our AGP-PF outperforms the method of Ref.[63]for all the subjects.We found that the mean3D tracking error of our approach is lower than the error of Ref.[43]for Subjects S1and S2.The mean3D tracking error of our approach is higher than the error of Ref.[43]for subject S3.

Table VI shows our pose tracking results compared to those from other state of the art approaches for the Walking sequence of Subject S2of the HumanEva-II dataset.The camera view of the dataset which is used to train the discriminative model is different from the view of Camera C1of

HumanEva-II Fig.16.Pose estimation results on some of the images of the test set of Jogging sequence.Each3D pose is illustrated using the boundaries of the projected cylinders of the3D pose.In comparison with the Walking sequence in Fig.15,because of the larger and faster arm and leg movements,the estimated poses for this sequence are less accurate(Figure best viewed in color).

TABLE VI

3D P OSE T RACKING E RRORS OF O UR AGP-PF M ETHOD AND

S OME S TATE OF A RT P OSE T RACKING A PPROACHES ON THE

H UMAN E VA-II D ATASET.T HE W ALKING A CTIVITY OF

S UBJECT S2IS U SED IN THE E

XPERIMENT

dataset which is used as the test sequence.Therefore we normalize the view(align with respect to the test camera view)of the pose predicted by the discriminative model before tracking.With this pre-processing step,we obtained a mean tracking error of73.0mm which surpasses the performance of Refs.[58]and[4].This shows that our method is not only able to generalize well w.r.t the subjects’gaits and genders,but it is also able to generalize well between camera views.

We also compared our algorithm to the approaches in Refs.[34]and[32]and it turned out that they achieve a better performance(an average error of37mm and53mm respectively).This is understandable because Ref.[34]uses strong priors based on a complex bio-mechanical model of the human body and Ref.[32]uses a local optimization with a surface-based3D human model whereby a person’s body is scanned off-line(using an expensive body scanner).This

SEDAI et al.:GAUSSIAN PROCESS GUIDED PARTICLE FILTER4299

phase requires the full cooperation of the person which is not required in our case.

VI.C ONCLUSION

We have presented an Annealed Gaussian Process Guided Particle Filter(AGP-PF)for3D human pose tracking in video sequences.Our method effectively exploits the discriminative distribution obtained from the mixture of Gaussian Process regression model with a motion model to obtain an accurate and stable tracking of the3D pose in each video frame.We use the prediction uncertainty obtained from the Gaussian Process regression to dynamically determine the contribution of the discriminative model for tracking.Our method does not require initialization and can resolve pose ambiguities during tracking using motion information and multi-view images.Experimental results show that our proposed AGP-PF can accurately track the3D pose in a long video sequence. Although this paper uses a stationary motion model for pose tracking,we believe that by using a learned motion model, the pose can be tracked more accurately.Tracking in a lower dimensional latent space could also be employed for a more ef?cient performance.Moreover,the scalability of our method could be improved by training the discriminative model using data from various other activities.

R EFERENCES

[1] C.Sminchisescu and B.Triggs,“Estimating articulated human motion

with covariance scaled sampling,”Int.J.Robot.Res.,vol.22,no.6, pp.371–391,Jun.2003.

[2]J.Deutscher,A.Blake,and I.Reid,“Articulated body motion capture by

annealed particle?ltering,”in Proc.IEEE Conf.CVPR,vol.2.Jun.2000, pp.126–133.

[3]M.Isard and A.Blake,“Condensation—Conditional density propagation

for visual tracking,”0ca8f1da71fe910ef02df81cput.Vis.,vol.29,no.1,pp.5–28, Aug.1998.

[4]S.Corazza,L.Mündermann, E.Gambaretto,G.Ferrigno,and

T.P.Andriacchi,“Markerless motion capture through visual hull, articulated ICP and subject speci?c model generation,”0ca8f1da71fe910ef02df81cput.

Vis.,vol.87,nos.1–2,pp.156–169,Mar.2010.

[5] A.Agarwal and B.Triggs,“Monocular human motion capture with

a mixture of regressors,”in 0ca8f1da71fe910ef02df81cput.Soc.Conf.CVPR,vol.3.

Jun.2005,p.72.

[6]G.Shakhnarovich,P.Viola,and T.Darrell,“Fast pose estimation with

parameter-sensitive hashing,”in 0ca8f1da71fe910ef02df81cput.Vis.,vol.2.

Oct.2003,pp.750–757.

[7] A.Kanaujia and D.Metaxas,“Learning ambiguities using Bayesian

mixture of experts,”in Proc.18th IEEE Int.Conf.Tools Artif.Intell., Nov.2006,pp.436–440.

[8]R.Rosales and S.Sclaroff,“Combining generative and discriminative

models in a framework for articulated pose estimation,”0ca8f1da71fe910ef02df81cput.

Vis.,vol.67,no.3,pp.251–276,May2006.

[9]L.Sigal, A.Balan,and M.Black,“Combined discriminative and

generative articulated pose and non-rigid shape estimation,”in Advances in Neural Information Processing Systems.Cambridge,MA,USA: MIT Press,2008,pp.1337–1344.

[10] A.Agarwal and B.Triggs,“Recovering3D human pose from monocular

images,”IEEE Trans.Pattern Anal.Mach.Intell.,vol.28,no.1, pp.44–58,Jun.2006.

[11] C.Sminchisescu,A.Kanaujia,Z.Li,and D.Metaxas,“Discriminative

density propagation for3D human motion estimation,”in Proc.IEEE Conf.CVPR,vol.1.Jun.2005,pp.390–397.

[12]R.Urtasun and T.Darrell,“Sparse probabilistic regression for activity-

independent human pose inference,”in Proc.IEEE Conf.CVPR, Jun.2008,pp.1–8.

[13]X.Zhao,Y.Fu,and Y.Liu,“Human motion tracking by temporal-spatial

local Gaussian process experts,”IEEE Trans.Image Process.,vol.20, no.4,pp.1141–1151,Apr.2011.[14]L.Bo,C.Sminchisescu,A.Kanaujia,and D.Metaxas,“Fast algorithms

for large scale conditional3D prediction,”in Proc.IEEE Conf.CVPR, Jun.2008,pp.1–8.

[15] A.Kanaujia,C.Sminchisescu,and D.N.Metaxas,“Semi-supervised

hierarchical models for3D human pose reconstruction,”in Proc.IEEE Conf.CVPR,Jun.2007,pp.1–8.

[16]H.Ning,W.Xu,Y.Gong,and T.Huang,“Discriminative learning of

visual words for3D human pose estimation,”in Proc.IEEE Conf.CVPR, Jun.2008,pp.1–8.

[17] A.Bissacco,M.-H.Yang,and S.Soatto,“Fast human pose estimation

using appearance and motion via multi-dimensional boosting regres-sion,”in Proc.IEEE Conf.CVPR,Jun.2007,pp.1–8.

[18]J.Shotton,A.Fitzgibbon,M.Cook,T.Sharp,M.Finocchio,R.Moore,

A.Kipman,and A.Blake,“Real-time human pose recognition in parts

from single depth images,”in Proc.CVPR,2011.

[19] A.Agarwal and B.Triggs,“A local basis representation for estimat-

ing human pose from cluttered images,”in Proc.ACCV,Jan.2006, pp.50–59.

[20]S.Sedai,M.Bennamoun,and D.Q.Huynh,“Context-based appearance

descriptor for3D human pose estimation from monocular images,”

in Proc.DICTA,Dec.2009,pp.484–491.

[21]S.Sedai,M.Bennamoun,and D.Q.Huynh,“Localized fusion of shape

and appearance features for3D human pose estimation,”in Proc.BMVC, Sep.2010,pp.1–10.

[22]S.Sedai,M.Bennamoun,and D.Q.Huynh,“Discriminative fusion

of shape and appearance features for human pose estimation,”Pattern Recognit.,to be published.

[23] A.Fossati,M.Salzmann,and P.Fua,“Observable subspaces for

3D human motion recovery,”in Proc.IEEE CVPR,Jun.2009, pp.1137–1144.

[24] C.Ionescu,L.Bo,and C.Sminchisescu,“Structural SVM for visual

localization and continuous state estimation,”in Proc.12th Int.Conf.

Comput.Vis.,Oct.2009,pp.1157–1164.

[25]L.Bo and C.Sminchisescu,“Twin Gaussian processes for structured

prediction,”0ca8f1da71fe910ef02df81cput.Vis.,vol.87,nos.1–2,pp.28–52,Mar.2010.

[26]R.Navaratnam, A.Fitzgibbon,and R.Cipolla,“The joint manifold

model for semi-supervised multi-valued regression,”in Proc.IEEE11th CCCV,Oct.2007,pp.1–8.

[27]J.Deutscher and I. D.Reid,“Articulated body motion capture by

stochastic search,”0ca8f1da71fe910ef02df81cput.Vis.,vol.61,no.2,pp.185–205, Feb.2005.

[28] C.Bregler and J.Malik,“Tracking people with twists and exponential

maps,”in Proc.IEEE Comput.Soc.Conf.CVPR,Jun.1998,pp.8–15.

[29]M.W.Lee and I.Cohen,“A model-based approach for estimating human

3D poses in static images,”IEEE Trans.Pattern Anal.Mach.Intell., vol.28,no.6,pp.905–916,Jun.2006.

[30] D.Ramanan,D.Forsyth,and A.Zisserman,“Strike a pose:Tracking

people by?nding stylized poses,”in Proc.IEEE Comput.Soc.Conf.

CVPR,vol.1.Jun.2005,pp.271–278.

[31]J.Li and N.M.Allinson,“A comprehensive review of current local

features for computer vision,”Neurocomput.,vol.71,nos.10–12, pp.1771–1787,Jun.2008.

[32]J.Gall,B.Rosenhahn,T.Brox,and H.-P.Seidel,“Optimization and

?ltering for human motion capture,”0ca8f1da71fe910ef02df81cput.Vis.,vol.87, nos.1–2,pp.75–92,2010.

[33]H.Sidenbladh,M.J.Black,and D.J.Fleet,“Stochastic tracking of3D

human?gures using2D image motion,”in Proc.ECCV,vol.2.2000, pp.702–718.

[34]M.A.Brubaker,D.J.Fleet,and A.Hertzmann,“Physics-based person

tracking using the anthropomorphic walker,”0ca8f1da71fe910ef02df81cput.Vis.,vol.87, nos.1–2,pp.140–155,Mar.2010.

[35]H.Sidenbladh and M.J.Black,“Learning the statistics of people

in images and video,”0ca8f1da71fe910ef02df81cput.Vision,vol.54,nos.1–3, pp.181–207,2003.

[36]J.Deutscher,A.J.Davison,and I.D.Reid,“Automatic partitioning

of high dimensional search spaces associated with articulated body motion capture,”in Proc.IEEE Comput.Soc.Conf.CVPR,vol.2.2001, pp.II-669–II-676.

[37] A.Sundaresan and R.Chellappa,“Multicamera tracking of articulated

human motion using shape and motion cues,”IEEE Trans.Img.Process., vol.18,no.9,pp.2114–2126,Sep.2009.

[38]R.Navaratnam,A.Thayananthan,P.H.Torr,and R.Cipolla,“Hierarchi-

cal part-based human body pose estimation,”in Proc.BMVC,Sep.2005.

[39] D.Ramanan, D.Forsyth,and A.Zisserman,“Tracking people by

learning their appearance,”IEEE Trans.Pattern Anal.Mach.Intell., vol.29,no.1,pp.65–81,Jan.2007.

4300IEEE TRANSACTIONS ON IMAGE PROCESSING,VOL.22,NO.11,NOVEMBER2013

[40]R.Li,T.-P.Tian,S.Sclaroff,and M.-H.Yang,“3D human motion

tracking with a coordinated mixture of factor analyzers,”0ca8f1da71fe910ef02df81cput.

Vis.,vol.87,nos.1–2,pp.170–190,Mar.2010.

[41]G.W.Taylor and G.E.Hinton,“Factored conditional restricted boltz-

mann machines for modeling motion style,”in Proc.26th Annu.Int.

Conf.Mach.Learn.,Jun.2009,pp.1025–1032.

[42]R.Urtasun,D.J.Fleet,and P.Fua,“3D people tracking with Gaussian

process dynamical models,”in Proc.IEEE Comput.Soc.Conf.CVPR, vol.1.Jun.2006,pp.238–245.

[43] A.Yao,J.Gall,L.J.V.Gool,and R.Urtasun,“Learning probabilis-

tic non-linear latent variable models for tracking complex activities,”

in Proc.ANIPS,2011,pp.1359–1367.

[44]J.Wang,D.Fleet,and A.Hertzmann,“Gaussian process dynamical

models for human motion,”EEE Trans.Pattern Recognit.Mach.Intell., vol.30,no.2,pp.283–298,Feb.2008.

[45] C.Sminchisescu,A.Kanaujia,and D.Metaxas,“Learning joint top-

down and bottom-up processes for3D visual inference,”in Proc.IEEE Comput.Soc.Conf.CVPR,vol.2.Jun.2006,pp.1743–1752.

[46]M.Salzmann and R.Urtasun,“Combining discriminative and generative

methods for3D deformable surface and articulated pose reconstruction,”

in Proc.IEEE Conf.CVPR,Jun.2010,pp.647–654.

[47]S.Sedai,D.Q.Huynh,and M.Bennamoun,“Supervised particle?lter

for tracking2D human pose in monocular video,”in Proc.IEEE WACV, Jan.2011,pp.367–373.

[48]J.Ko and D.Fox,“Gp-BayesFilters:Bayesian?ltering using Gaussian

process prediction and observation models,”Auto.Robot.,vol.27,no.1, pp.75–90,2009.

[49]I.Patras and E.R.Hancock,“Coupled prediction classi?cation for robust

visual tracking,”IEEE Trans.Pattern Anal.Mach.Intell.,vol.32,no.9, pp.1553–1567,Sep.2010.

[50] A.Elgammal,R.Duraiswami,D.Harwood,and L.Davis,“Background

and foreground modeling using nonparametric kernel density estimation for visual surveillance,”Proc.IEEE,vol.90,no.7,pp.1151–1163, Jul.2002.

[51]S.Sedai,M.Bennamoun,and D.Q.Huynh,“Evaluating shape and

appearance descriptors for3D human pose estimation,”in Proc.6th IEEE ICIEA,Jun.2011,pp.293–298.

[52]P.Tresadern and I.Reid,“An evaluation of shape descriptors for image

retrieval in human pose estimation,”in Proc.BMVC,vol.2.Sep.2007, pp.800–809.

[53] C.E.Rasmussen and C.K.I.Williams,Gaussian Processes for Machine

Learning(Adaptive Computation and Machine Learning).Cambridge, MA,USA:MIT Press,2005.

[54]V.Tresp,“Mixtures of gaussian processes,”in Advances in Neural

Information Processing Systems.Cambridge,MA,USA:MIT Press, 2001,pp.654–660.

[55] B.Krishnapuram,L.Carin,M.A.T.Figueiredo,and A.J.Hartemink,

“Sparse multinomial logistic regression:Fast algorithms and generaliza-tion bounds,”IEEE Trans.Pattern Anal.Mach.Intell.,vol.27,no.6, pp.957–968,Jun.2005.

[56] C.H.Ek,J.Rihan,P.H.Torr,G.Rogez,and N. 0ca8f1da71fe910ef02df81cwrence,

“Ambiguity modeling in latent spaces,”in Proc.5th Int.Workshop Mach.

Learn.Multimodal Int.,2008,pp.62–73.

[57] C.Sminchisescu and A.Telea,“Human pose estimation from silhouettes

a consistent approach using distance level sets,”in Proc.Int.Conf.

Comput.Graph.,0ca8f1da71fe910ef02df81cput.Vis.,2002,pp.413–420.

[58]L.Sigal,A.Balan,and M.Black,“HumanEva:Synchronized video

and motion capture dataset and baseline algorithm for evaluation of articulated human motion,”0ca8f1da71fe910ef02df81cput.Vis.,vol.87,nos.1–2, pp.4–27,Mar.2010.

[59]J.Deutscher,B.North,B.Bascle,and A.Blake,“Tracking through

singularities and discontinuities by random sampling,”in Proc.7th IEEE 0ca8f1da71fe910ef02df81cput.Vis.,vol.2.Sep.1999,pp.1144–1149.

[60] A.Doucet,N.De Freitas,and N.Gordon,Sequential Monte Carlo

Methods in Practice.New York,NY,USA:Springer-Verlag,2001. [61] A.Pfeffer,“Suf?ciency,separability and temporal probabilistic models,”

in Proc.7th Conf.UAI,2001,pp.421–428.

[62]L.Sigal and M.J.Black,“HumanEva:Synchronized video and motion

capture dataset for evaluation of articulated human motion,”Dept.

Comput.Sci.,Brown Univ.,Providence,RI,USA,Tech.Rep.CS-06-08, 2006.

[63]P.Peursum,S.Venkatesh,and G.West,“A study on smoothing for

particle-?ltered3D human body tracking,”0ca8f1da71fe910ef02df81cput.Vis.,vol.87, nos.1–2,pp.53–74,Mar.

2010.

Suman Sedai received the M.Sc.degree from Inha

University,Incheon,Korea,and the Ph.D.degree

from the University of Western Australia,Crawley,

Australia,in2007and2012,respectively.His current

research interests include image processing,visual

tracking,object recognition,pattern recognition,and

machine learning.He is currently a Post-Doctoral

Researcher with IBM Research,Melbourne,Aus-

tralia.

Mohammed Bennamoun received the M.Sc.degree

in control theory from Queen’s University,Kingston,

ON,Canada,and the Ph.D.degree in computer

vision from Queen’s/Q.U.T,Brisbane,Australia.He

was a lecturer in robotics with Queen’s University

and joined QUT in1993as an Associate Lecturer.

He is currently a Winthrop Professor.He served as

the Head of the School of Computer Science and

Software Engineering,The University of Western

Australia,Crawley,Australia,from2007to2012.

He served as the Director of the University Centre at QUT:The Space Centre for Satellite Navigation from1998to2002.He was an Erasmus Mundus Scholar and a Visiting Professor with the University of Edinburgh,Edinburgh,U.K.,in2006.He was a Visiting Professor with the Centre National de la Recherche Scienti?que and Telecom Lille1,France, in2009,Helsinki University of Technology,Helsinki,France,in2006, and University of Bourgogne and Paris13,Paris,France,from2002to 2003.He is the co-author of Object Recognition:Fundamentals and Case Studies(Springer-Verlag,2001)and the co-author of an edited book on Ontology Learning and Knowledge Discovery Using the Web in2011.He has published over250journal and conference publications and secured highly competitive national grants from the Australian Research Council(ARC). Some of these grants were in collaboration with Industry partners(through the ARC Linkage Project scheme)to solve real research problems for industry, including Swimming Australia,the West Australian Institute of Sport,a textile company(Beaulieu Paci?c),and AAM-GeoScan.He has worked on research problems and collaborated(through joint publications,grants,and supervision of Ph.D.students)with researchers from different disciplines, including animal biology,speech processing,biomechanics,ophthalmology, dentistry,linguistics,robotics,photogrammetry,and radiology.He received the Best Supervisor of the Year Award from QUT.He received an award for research supervision from UWA in2008.He served as a Guest Editor for a couple of special issues in international journals,such as the International Journal of Pattern Recognition and Arti?cial Intelligence.He was selected to give conference tutorials from the European Conference on Computer Vision and the International Conference on Acoustics Speech and Signal Processing. He has organized several special sessions for conferences,including a special session for the IEEE International Conference in Image Processing.He was on the program committee of many international conferences.He has contributed in the organization of many local and international conferences.His current research interests include control theory,robotics,obstacle avoidance,object recognition,arti?cial neural networks,signal/image processing,and computer vision(particularly

3D).

Du Q.Huynh is an Associate Professor with the

School of Computer Science and Software Engi-

neering,University of Western Australia,Crawley,

Australia.She received the Ph.D.degree in computer

vision from the University of Western Australia

in1994.She was with the Australian Coopera-

tive Research Centre for Sensor Signal and Infor-

mation Processing,Murdoch University,Murdoch,

Australia.She has been a Visiting Scholar with

Lund University,Lund,Sweden,Malm?University,

Malm?,Sweden,Chinese University of Hong Kong, Hong Kong,Nagoya University,Nagoya,China,Gunma University,Gunma, Japan,and the University of Melbourne,Melbourne,Australia.She has received several grants funded by the Australian Research Council.Her current research interests include shape from motion,multiple view geometry,video image processing,and visual tracking.

本文来源:https://www.bwwdw.com/article/y5ne.html

微信扫码分享

《a gaussian process guide particle filter for tracking 3D human pose in video.doc》
将本文的Word文档下载到电脑,方便收藏和打印
推荐度:
点击下载文档
下载全文
范文搜索
下载文档
Top