A Discriminative CNN Video Representation for Event Detectio

更新时间:2023-04-10 18:01:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

A Discriminative CNN Video Representation for Event Detection

Zhongwen Xu?Yi Yang?Alexander G.Hauptmann§?QCIS,University of Technology,Sydney§SCS,Carnegie Mellon University zhongwen.xu@7082124a1611cc7931b765ce0508763231127496.au yee.i.yang@7082124a1611cc7931b765ce0508763231127496 alex@7082124a1611cc7931b765ce0508763231127496

Abstract

In this paper,we propose a discriminative video rep-resentation for event detection over a large scale video dataset when only limited hardware resources are avail-able.The focus of this paper is to effectively leverage deep Convolutional Neural Networks(CNNs)to advance event detection,where only frame level static descriptors can be extracted by the existing CNN toolkits.This paper makes two contributions to the inference of CNN video representa-tion.First,while average pooling and max pooling have long been the standard approaches to aggregating frame level static features,we show that performance can be sig-ni?cantly improved by taking advantage of an appropriate encoding method.Second,we propose using a set of latent concept descriptors as the frame descriptor,which enriches visual information while keeping it computationally afford-able.The integration of the two contributions results in a new state-of-the-art performance in event detection over the largest video 7082124a1611cc7931b765ce0508763231127496pared to improved Dense Trajectories,which has been recognized as the best video representation for event detection,our new representation improves the Mean Average Precision(mAP)from27.6% to36.8%for the TRECVID MEDTest14dataset and from 34.0%to44.6%for the TRECVID MEDTest13dataset. 1.Introduction and Related Work

Complex event detection[1,2],which targets the detec-tion of such events as“renovating a home”in a large video collection crawled from Youtube,has recently attracted a lot of research attention in computer 7082124a1611cc7931b765ce0508763231127496pared to con-cept analysis in videos,e.g.,action recognition,event de-tection is more dif?cult primarily because an event is more complex and thus has greater intra-class variations.For ex-ample,a“marriage proposal”event may take place indoors or outdoors,and may consist of multiple concepts such as ring(object),kneeling down(action)and kissing(action).

Recent research efforts have shown that combining mul-tiple features,including static appearance features[9,25, 41],motion features[23,7,43,44,33]and acoustic fea-

tures[28],yields good performance in event detection,as evidenced by the reports of the top ranked teams in the TRECVID Multimedia Event Detection(MED)competi-tion[3,22,29,30]and research papers[26,31,40,45] that have tackled this problem.By utilizing additional data to assist complex event detection,researchers propose the use of“video attributes”derived from other sources to fa-cilitate event detection[27],or to utilize related exemplars when the training exemplars are very few[46].As we focus on improving video representation in this paper,this new method can be readily fed into those frameworks to further improve their performance.

Dense Trajectories and its enhanced version improved Dense Trajectories(IDT)[44]have dominated complex event detection in recent years due to their superior per-formance over other features such as the motion feature STIP[23]and the static appearance feature Dense SIFT[3].

Despite good performance,heavy computation costs greatly restrict the usage of the improved Dense Trajectories on a large scale.In the TRECVID MED competition2014[2], the National Institute of Standards and Technology(NIST) introduced a very large video collection,containing200,000 videos of8,000hours in duration.Paralleling1,000cores, it takes about one week to extract the improved Dense Tra-jectories for the200,000videos in the TRECVID MEDE-val14collection.Even after the spatial re-sizing and tem-poral down-sampling processing,it still takes500cores one week to extract the features[3].As a result of the unaf-fordable computation cost,it would be extremely dif?cult for a relatively smaller research group with limited com-putational resources to process large scale MED datasets.

It becomes important to propose an ef?cient representation for complex event detection with only affordable computa-tional resources,e.g.,a single machine,while at the same time attempting to achieve better performance.

One instinctive idea would be to utilize the deep learn-ing approach,especially Convolutional Neural Networks (CNNs),given their overwhelming accuracy in image anal-ysis and fast processing speed,which is achieved by lever-aging the massive parallel processing power of GPUs[21].

However,it has been reported that the event detection 1798

978-1-4673-6964-0/15/$31.00 ?2015 IEEE

MEDTest13MEDTest14 IDT[44,3]34.027.6 CNN in Lan et al.[22]29.0N.A.

CNN avg32.724.8 Table1.Performance comparison(mean Average Precision in per-centage).Lan et al.[22]is the only attempt to apply CNN features in TRECVID 7082124a1611cc7931b765ce0508763231127496N avg are our results from the average pooling representation of frame level CNN descriptors. performance of CNN based video representation is worse than the improved Dense Trajectories in TRECVID MED 2013[22,3],as shown in Table1.A few technical prob-lems remain unsolved.

Firstly,CNN requires a large amount of labeled video data to train good models from scratch.The large scale TRECVID MED datasets(i.e.,MEDTest13[1]and MEDTest14[2])only have100positive examples per event,with many null videos which are irrelevant.The num-ber of labeled videos is smaller than that of the video col-lection for sports videos[20].In addition,as indicated in [46],event videos are quite different from action videos,so it makes little sense to use the action dataset to train models for event detection.

Secondly,when dealing with a domain speci?c task with a small number of training data,?ne-tuning[12]is an effec-tive technique for adapting the ImageNet pre-trained mod-els for new tasks.However,the video level event labels are rather coarse at the frame level,i.e.,not all frames necessar-ily contain the semantic information of the event.If we use the coarse video level label for each frame,performance is barely improved by frame level?ne-tuning;this was veri-?ed by our preliminary experiment1.

Lastly,given the frame level CNN descriptors,we need to generate a discriminative video level representation.Av-erage pooling is the standard approach[32,3]for static local features,as well as for the CNN descriptors[22]. Table1shows the performance comparisons of the im-proved Dense Trajectories and CNN average pooling rep-resentation.We provide the performance of Lan et al.[22] for reference as well.We can see that the performance of CNN average pooling representation cannot get better than the hand-crafted feature improved Dense Trajectories, which is fairly different from the observations in other vi-sion tasks[12,6,13].

The contributions of this paper are threefold.First,this is the?rst work to leverage the encoding techniques to gen-erate video representation based on CNN descriptors.Sec-ond,we propose to use a set of latent concept descriptors as frame descriptors,which further persi?es the output with aggregation on multiple spatial locations at deeper stage of 1However,with certain modi?cation of the CNN structure,e.g.cross-frame max-pooling[11],?ne-tuning could be helpful.

the network.The approach forwards video frames for only

one round along the deep CNNs for descriptor extraction.

With these two contributions,the proposed video CNN rep-

resentation achieves more than30%relative improvement

over the state-of-the-art video representation on the large

scale MED dataset,and this can be conducted on a single

machine in two days with4GPU cards installed.In addi-

tion,we propose to use Product Quantization[15]based on

CNN video representation to speed up the execution(event

search)time.According to our extensive experiments,we

show that the approach signi?cantly reduces the I/O cost,

thereby making event prediction much faster while retain-

ing almost the same level of precision.

2.Preliminaries

Unless otherwise speci?ed,this work is based on the net-

work architecture released by[37],i.e.,the con?guration

with16weight layers in the VGG ILSVRC2014classi?-

cation task winning solutions.The?rst13weight layers

are convolutional layers,?ve of which are followed by a

max-pooling layer.The last three weight layers are fully-

connected layers.In the rest of this paper,we follow the

notations in[6,12]:pool5refers to the activation of the last

pooling layer,fc6and fc7refer to the activation of the?rst

and second fully-connected layers,respectively.Though the

structure in[37]is much deeper than the classic CNN struc-

ture in[21,6,12],the subscripts of pool5,fc6and fc7no-

tations still correspond if we regard the convolution layers

between the max-pooling layers as a“compositional convo-

lutional layer”[37].We utilize the activations before Rec-

ti?ed Linear Units(i.e.,fc6and fc7)and after them(i.e.,

fc6relu and fc7relu),since we observe signi?cant differ-

ences in performance between these two variants.

3.Video CNN Representation

We begin by extracting the frame level CNN descriptors

using the Caffe toolkit[18]with the model shared by[37].

We then need to generate video level vector representations

on top of the frame level CNN descriptors.

3.1.Average Pooling on CNN Descriptors

As described in state-of-the-art complex event detection

systems[3,32],the standard way to achieve image-based

video representation in which local descriptor extraction

relies on inpidual frames alone,is as follows:(1)Ob-

tain the descriptors for inpidual frames;(2)Apply nor-

malization on frame descriptors;(3)Average pooling on

frame descriptors to obtain the video representation,i.e.,

x video=1

N N i=1x i,x i is the frame-level descriptor and N is the total number of frames extracted from the video;

(4)Re-normalization on video representation.

Max pooling on frames to generate video representation 1799

is an alternative method but it is not typical in event detec-tion.We observe similar performance with average pooling, so we omit this method.

3.2.Video Pooling on CNN descriptors

Video pooling computes video representation over the whole video by pooling all the descriptors from all the frames in a video.The Fisher vector[35,36]and Vector of Locally Aggregated Descriptors(VLAD)[16,17]have been shown to have great advantages over Bag-of-Words (BoWs)[38]in local descriptor encoding methods.The Fisher vector and VLAD have been proposed for image classi?cation and image retrieval to encode image local de-scriptors such as dense SIFT and Histogram of Oriented Gradients(HOG).Attempts have also been made to apply Fisher vector and VLAD on local motion descriptors such as Histogram of Optical Flow(HOF)and Motion Bound-ary Histogram(MBH)to capture the motion information in videos.To our knowledge,this is the?rst work on the video pooling of CNN descriptors and we broaden the encoding methods from local descriptors to CNN descriptors in video analysis.

3.2.1Fisher Vector Encoding

In Fisher vector encoding[35,36],a Gaussian Mixture Model(GMM)with K components can be denoted as Θ={(μk,Σk,πk),k=1,2,...,K},whereμk,Σk,πk are the mean,variance and prior parameters of k-th compo-nent learned from the training CNN descriptors in the frame level,respectively.Given X=(x1,...,x N)of CNN de-scriptors extracted from a video,we have mean and covari-ance deviation vectors for the k-th component as:

u k=

1

N√πk

N

i=1q ki x i?μkσk

v k=

1

N

√2πk

N

i=1q ki x i?μkσk 2?1 ,(1)

where q ki is the posterior probability.By concatenation of

the u k and v k of all the K components,we form the Fisher

vector for the video with size2D′K,where D′is the dimen-

sion of CNN descriptor x i after PCA pre-processing.PCA

pre-processing is necessary for a better?t on the diagonal

covariance matrix assumption[36].Power normalization,

often Signed Square Root(SSR)with z=sign(z) |z|, and?2normalization are then applied to the Fisher vec-

tors[35,36].

3.2.2VLAD Encoding

VLAD encoding[16,17]can be regarded as a simpli?ed

version of Fisher vector encoding.With K coarse centers

Figure1.Probability distribution of the cosine similarity between positive-positive(blue and plain)and positive-negative(red and dashed)videos using fc7features,for average pooling(top),en-coding with the Fisher vector using256-component GMM(mid-dle),and encoding with VLAD using256centers(bottom).As the range of probability of Fisher vectors is very different from aver-age pooling and VLAD,we only use consistent axes for average pooling and VLAD.This?gure is best viewed in color.

{c1,c2,...,c K}generated by K-means,we can obtain the difference vector regarding center c k by:

u k= i:NN(x i)=c k(x i?c k),(2)

where NN(x i)indicates x i’s nearest neighbors among K coarse centers.

The VLAD encoding vector with size D′K is obtained by concatenating u k over all the K centers.Another vari-ant of VLAD called VLAD-k,which extends the nearest centers with the k-nearest centers,has shown good per-formance in action recognition[19,34].Without speci?-cation,we utilize VLAD-k with k=5by default.Ex-cept for the power and?2normalization,we apply intra-normalization[4]to VLAD.

3.2.3Quantitative Analysis

Given the above three approaches,we need to?nd out which one is the most appropriate for the CNN descrip-tors.To this end,we conduct an analytic experiment on the MEDTest14training set[2]to study the discriminative ability of three types of video representations,i.e.,average pooling,video pooling with Fisher vector,and video pool-ing with VLAD on the CNN descriptors.Speci?cally,we calculate the cosine similarity within the positive exemplars among all the events(denoted as pos-pos),and the cosine similarity between positive exemplars and negative exem-plars(denoted as pos-neg).The results are shown in Fig-

1800

Figure 2.Illustration of the latent concept descriptors encoding procedure.We adopt M ?lters in the last convolutional layer as M latent concept classi?ers.Before the last convolutional layer,M ?lters (e.g .,a cuboid of size 3×3×512)produce the prediction outputs at every convolution location,followed by the max-pooling operations.Then,we get the responses of windows of different sizes and strides (in this example the output size is 2×2)for each latent concept.Color strength corresponds to the strength of response of each ?lter.Finally,we accumulate the responses for the M ?lters at the same location into the latent concept descriptors.Each dimension corresponds to one latent concept.After obtaining all latent concept descriptors of all frames,we then apply encoding methods to get the ?nal video representation.This ?gure is best viewed in color.

ure 1.With a good representation,the data points of posi-tive and negative exemplars should be far away from each other,i.e .,the cosine similarity of “pos-neg”should be close to zero.In addition,there should be a clear difference be-tween the distributions of “pos-pos”and “pos-neg”.

Average pooling :In Figure 1,we observe that the “pos-neg”cosine similarity distribution is far from zero,which is highly indicative that a large portion of the positive and negative exemplar pairs are similar to each other.In addi-tion,the intersection of areas under the two lines span over a large range of [0.2,0.8].Both observations imply that av-erage pooling may not be the best choice.

Fisher vector :Although the “pos-neg”similarity dis-tribution is fairly close to zero,a large proportion of the “pos-pos”pairs also fall into the same range.No obvious difference between the distributions of “pos-pos”and “pos-neg”can be observed.

VLAD :The distribution of the “pos-neg”pairs is much closer to zero than average pooling while a relatively small proportion of the “pos-pos”similarity is close to the peak of the “pos-neg”similarity.

From the above analytic study,we can see that VLAD is the most ?t for the CNN descriptors because the VLAD representation has the best discriminative ability,which is also consistent with the experimental results in Section 5.1.

7082124a1611cc7931b765ce0508763231127496N Latent Concept Descriptors

Compared to the fully-connected layers,pool 5contains spatial information.However,if we follow the standard way and ?atten pool 5into a vector,the feature dimension will be very high,which will induce heavy computational cost.

Speci?cally,the features dimension of pool 5is a ×a ×M ,where a is the size of ?ltered images of the last pooling layer and M is the number of convolutional ?lters in the last convolutional layer (in our case,a =7and M =512).In the VGG network [37],pool 5features are vectors of 25,088-D while the fc 6and fc 7features have only 4096-D.As a result,researchers tend to ignore the general fea-tures extracted from pool 5[6,13].The problem is even more severe in the video pooling scheme because the frame descriptors with high dimensions would lead to instability problems [10].

Note that the convolutional ?lters can be regarded as generalized linear classi?ers on the underlying data patches,and each convolutional ?lter corresponds to a latent con-cept [24].We propose to formulate the general features from pool 5as the vectors of latent concept descriptors ,in which each dimension of the latent concept descriptors rep-resents the response of the speci?c latent concept.Each ?l-ter in the last convolutional layer is independent from other ?lters.The response of the ?lter is the prediction of the linear classi?er on the convolutional location for the cor-responding latent concept.In that way,pool 5layer of size a ×a ×M can be converted into a 2latent concept descriptors with M dimensions.Each latent concept descriptor repre-sents the responses from the M ?lters for a speci?c pool-ing location.Once we obtain the latent concept descriptors for all the frames in a video,we then apply an encoding method to generate the video representation.In this case,each frame contains a 2descriptors instead of one descrip-tor for the frame,as illustrated in Figure 2.

In [14],He et al .claim that the aggregation at a deeper

1801

layer is more compatible with the hierarchical information processing in our brains than cropping or wrapping the original inputs,and they propose to use a Spatial Pyramid Pooling(SPP)layer for object classi?cation and detection, which not only achieves better performance but also relaxes the constraint that the input must be?xed-size.Different from[14],we do not train the network with the SPP layer from scratch,because it takes much longer time,especially for a very deep neural network.Instead,at the last pooling layer,we adopt multiple windows with different sizes and strides without retraining the CNNs.In that way,visual in-formation is enriched while only marginal computation cost is added,as we forward frames through the networks only once to extract the latent concept descriptors.

After extracting the CNN latent concept descriptors for all spatial locations of each frame in a video,we then ap-ply video pooling to all the latent concept descriptors of that video.As in[14],we apply four different CNN max-pooling operations and obtain(6×6),(3×3),(2×2)and (1×1)outputs for each independent convolutional?lter,a total of50spatial locations for a single frame.The dimen-sion of latent concept descriptors(512-D)is shorter than the descriptors from the fully-connected layers(4,096-D), while the visual information is enriched via multiple spatial locations on the?ltered images.

3.4.Representation Compression

For the engineering aspect of a fast event search[2]on a large video collection,we can utilize techniques such as Product Quantization(PQ)[15]to compress the Fisher vec-tor or VLAD representation.With PQ compression,the storage space in disk and memory can be reduced by more than an order of magnitude,while the performance remains almost the same.The basic idea of PQ is to decompose the representation into sub-vectors with equal length B,and then within each sub-vector,K-means is applied to generate 2m centers as representative points.All the sub-vectors are approximated by the nearest center and encoded into the in-dex of the nearest center.In this way,B?oat numbers in the original representation become m bit code;thus,the com-

pression ratio is B×32

m .For example,if we take m=8and

B=4,we can achieve16times reduction in storage space.

Targeting at prediction on compressed data instead of on the original features,we can decompose the learned linear classi?er w with an equal length B.With look-up tables to store the dot-product between sub-vectors of2m centers and the corresponding sub-vector of w,the prediction speed on large-amount of videos can be accelerated by D

B

times look-

up operations and D

B?1times addition operations for each video assuming D is the feature dimension[36].

4.Experiment Settings

4.1.Datasets

In our experiments,we utilize the largest event detection datasets with labels2,namely TRECVID MEDTest13[1] and TRECVID MEDTest14[2].They have been intro-duced by NIST for all participants in the TRECVID com-petition and research community to conduct experiments on.For both datasets,there are20complex events respec-tively,but with10events overlapping.MEDTest13con-tains events E006-E015and E021-E030,while MEDTest14 has events E021-E040.Event names include“Birthday party”,“Bike trick”,etc.Refer to[1,2]for the complete list of event names.In the training section,there are ap-proximately100positive exemplars per event,and all events share negative exemplars with about5,000videos.The test-ing section has approximately23,000search videos.The total duration of videos in each collection is about1,240 hours.

4.2.Features for Comparisons

As reported in[3]and compared with the features from other top performers[30,29,22]in the TRECVID MED 2013competition,we can see that the improved Dense Tra-jectories has superb advantages over the original Dense Tra-jectories(used by all other teams except[3]),and is even better than approaches that combine many low-level visual features[30,29,22].Improved Dense Trajectories extracts local descriptors such as trajectory,HOG,HOF,and MBH, and Fisher vector is then applied to encode the local de-scriptors into video representation.Following[44,3],we ?rst reduce the dimension of each descriptor by a factor of 2and then utilize256components to generate the Fisher vectors.We evaluate four types of descriptor in improved Dense Trajectories,and report the results of the best combi-nation of descriptors and the two inpidual descriptors that have the best performance(HOG and MBH).

In addition,we report the results of some popular fea-tures used in the TRECVID competition for reference,such as STIP[23],MoSIFT[7]and CSIFT[41],though their per-formance is far weaker than improved Dense Trajectories.

4.3.Evaluation Details

In all the experiments,we apply linear Support Vector Machine(SVM)with LIBSVM toolkit[5].We conduct ex-tensive experiments on two standard training conditions:in 100Ex,100positive exemplars are given in each event and in10Ex,10positive exemplars are given.In the100Ex con-dition,we utilize5-fold cross-validation to choose the pa-rameter of regularization coef?cient C in linear SVM.In the 10Ex condition,we follow[22]and set C in linear SVM to 1.

2Labels for MEDEval13and MEDEval14are not publicly available. 1802

We sample every?ve frames in the videos and follow the pre-processing of[21,6]on CNN descriptor extraction.We extract the features from the center crop 7082124a1611cc7931b765ce0508763231127496N descrip-tors are extracted using Caffe[18]with the best publicly available model[37],and we utilize vlfeat[42]to generate Fisher vector and VLAD representation.

Mean Average Precision(mAP)for binary classi?cation is applied to evaluate the performance of event detection according to the NIST standard[1,2].

5.Experiment Results

5.1.Results for Video Pooling of CNN descriptors

In this section,we show the experiments on video pool-ing of fc6,fc6relu,fc7and fc7relu.Before aggregation,we ?rst apply PCA with whitening on the?2normalized CNN descriptors.Unlike local descriptors such as HOG,MBH, which have dimensions less than200-D,the CNN descrip-tors have much higher dimensions(4,096-D).We conduct experiments with different reduced dimensions,i.e.,128, 256,512and1,024,and utilize the reduced dimensions that best balance performance and storage cost in corresponding features,i.e.,512-D for fc6and fc6relu and256-D for fc7 and fc7relu.We utilize256components for Fisher vectors and256centers for VLAD as common choices in[36,16]. We will study the impact of parameters in Section5.3.PCA projections,components in GMM for Fisher vectors,and centers in K-means for VLAD are learned from approxi-mately256,000sampled frames in the training set.

Since we observe similar patterns in MEDTest13 and MEDTest14under both100Ex and10Ex,we take MEDTest14100Ex as an example to compare with differ-ent representations,namely average pooling,video pooling with Fisher vectors and video pooling with VLAD.From Table2,we can see that both video pooling with Fisher vectors and VLAD demonstrate great advantages over the average pooling representation.On the video pooling of CNN descriptors,Fisher vector encoding does not exhibit better performance than VLAD.Similar observations have been expressed in[10].We suspect that the distribution of CNN descriptors is quite different from the local descrip-tors,e.g.,HOG,HOF.We will study the theoretical reasons for the poorer performance of Fisher vector than VLAD on CNN video pooling in future research.

fc6fc6relu fc7fc7relu Average pooling19.824.818.823.8

Fisher vector28.328.427.429.1 VLAD33.132.633.231.5 Table 2.Performance comparison(mAP in percentage)on MEDTest14100Ex

We compare the performance of VLAD encoded CNN descriptors with state-of-the-art feature improved Dense

Figure 3.Performance comparisons on MEDTest13and MEDTest14,both100Ex and10Ex.This?gure is best viewed in color.

Trajectories(IDT)and average pooling on CNN descrip-tors in Figure3.We also illustrate the performance of the two strongest descriptors inside IDT(HOG and MBH).

We can see very clearly that VLAD encoded CNN fea-tures signi?cantly outperform IDT and average pooling on CNN descriptors over all settings.For more refer-ences,we provide the performance of a number of widely used features[29,30,22]on MEDTest14for compari-son.MoSIFT[7]with Fisher vector achieves mAP18.1% on100Ex and5.3%on10Ex;STIP[23]with Fisher vec-tor achieves mAP15.0%on100Ex and7.1%on10Ex;

CSIFT[41]with Fisher vector achieves mAP14.7%on 100Ex and5.3%on10Ex.Note that with VLAD encoded CNN descriptors,we can achieve better performance with 10Ex than the relatively poorer features such as MoSIFT, STIP,and CSIFT with100Ex!

5.2.Results for CNN Latent Concept Descriptors

with Spatial Pyramid Pooling

We evaluate the performance of latent concept descrip-tors(LCD)of both the original CNN structure and the struc-ture with the Spatial Pyramid Pooling(SPP)layer plugged in to validate the effectiveness of SPP.Before encoding the latent concept descriptors,we?rst apply PCA with whiten-ing.Dimension reduction is conducted from512-D to a range of dimensions such as32-D,64-D,128-D,and256-D,and we?nd that256-D is the best choice.We observe

a similar pattern with video pooling of fc layers indicating

that Fisher vector is inferior to VLAD on video pooling.We omit the results for Fisher vector due to limited space.

We show the performance of our proposed latent con-cept descriptors(LCD)in Table3and Table4.In both 100Ex and10Ex over two datasets,we can see clear gaps 1803

100Ex10Ex

Average pooling31.218.8

LCD VLAD38.225.0

LCD VLAD+SPP40.325.6

Table3.Performance comparisons for pool

5

on MEDTest13. LCD VLAD is VLAD encoded LCD from the original CNN struc-ture,while LCD VLAD+SPP indicates VLAD encoded LCD with SPP layer plugged in.

100Ex10Ex

Average pooling24.615.3

LCD VLAD33.922.8

LCD VLAD+SPP35.723.2

Table4.Performance comparisons for pool

5

on MEDTest14.No-tations are the same as Table3.

over the pool5features with average pooling,which demon-strates the advantages of our proposed novel utilization of pool5.With SPP layer,VLAD encoded LCD(LCD VLAD+ SPP)continues to increase the performance further from the original structure(LCD VLAD).The aggregation at a deeper stage to generate multiple levels of spatial information via multiple CNN max-pooling demonstrates advantages over the original CNN structure while having only minimal com-putation costs.The SPP layer enables a single pass of the forwarding in the network compared to the multiple passes of applying spatial pyramid on the original input images.

5.3.Analysis of the Impact of Parameters

We take VLAD encoded fc7features under MEDTest14 100Ex as an example to see the impact of parameters in the video pooling process.

Dimensions of PCA:The original dimension of fc7is quite high compared to local descriptors.It is essential to investigate the impact of dimensions in PCA in the pre-processing stage,since it is critical to achieve a better trade-off of performance and storage costs.Table5shows that in dimensions of more than256-D,performance remains sim-ilar,whereas encoding in128-D damages the performance signi?cantly.

Dimension128-D256-D512-D1024-D mAP30.633.233.133.2 Table5.Impact of dimensions of CNN descriptors after PCA,with ?xed K=256in VLAD.

Number of Centers in Encoding:We explore various numbers of centers K in VLAD,and the results are shown in Table6.With the increase of K,we can see that the discriminative ability of the generated features improves. However when K=512,the generated vector may be too sparse,which is somewhat detrimental to performance.

VLAD-k:We experiment with the traditional VLAD as well,with nearest center only instead of k-nearest centers.

K3264128256512

mAP28.729.730.433.232.1 Table6.Impact on numbers of centers(K)in VLAD,with?xed PCA dimension of256-D.

mAP drops from33.2%to32.0%.

Power Normalization:We remove the SSR post-processing and test the features on the VLAD encoded fc7.

mAP drops from33.2%to27.0%,from which we can see the signi?cant effect of SSR post-processing.

Intra-normalization:We turn off the intra-normalization.mAP drops from33.2%to30.6%.

5.4.Results for Product Quantization Compression

original B=4B=8 mAP33.233.5(↑0.3)33.0(↓0.2) space reduction-16×32×

Table7.Performance change analysis for VLAD encoded fc7with PQ compression.B is the length of the sub-vectors in PQ and m=8.

We conduct experiments on VLAD encoded fc7to see the performance changes with Product Quantization(PQ) compression.From the results in Table7,we can see that PQ with B=4maintains the performance and even im-proves slightly.When B=8,performance drops slightly.

If we compress with B=4,we can store VLAD encoded fc7features in3.1GB for the MEDEval14,which contains 200,000videos of8,000hours’duration.With further com-pression with a lossless technique such as Blosc3[8],we can store the features of the whole collection in less than1GB, which can be read by a normal SSD disk in a few seconds.

Without PQ compression,the storage size of the features would be48.8GB,which severely compromises the exe-cution time due to the I/O cost.Utilization of compression techniques largely saves the I/O cost in the prediction pro-cedure,while preserving the performance.

In our speed test on the MEDEval14collection using the compressed data but not the original features,we can?nish the prediction on200,000videos in4.1seconds per event using20threads on an Intel Xeon E5-2690v2@3.00GHz.

5.5.Results for Fusing Multiple Layers Extracted

from the Same Model

We investigate average late fusion[39]to fuse the pre-diction results from different layers with PQ compression,

i.e.,VLAD encoded LCD with SPP,fc6and fc7.From

Table8we can see that the simple fusion pushes the per-formance further beyond the single layers on MEDTest 13and MEDTest14,and achieves signi?cant advantages over improved Dense Trajectories(IDT).Our proposed 3Blosc can reduce the storage space by a factor of4

1804

Figure4.MEDTest13100Ex per event performance comparison (in mAP percentage).This?gure is best viewed in color. method pushes the state-of-the-art performance much fur-ther,achieves more than30%relative improvement on 100Ex,and more than65%relative improvement on10Ex over both challenging datasets.

Ours IDT Relative Improv MED13100Ex44.634.031.2%

MED1310Ex29.818.065.6%

MED14100Ex36.827.633.3%

MED1410Ex24.513.976.3%

Table8.Performance comparison of all settings;the last column shows the relative improvement of our proposed representation over IDT.

Figure4and Figure5show the per-event mAP compari-son of the100Ex setting on MEDTest13and MEDTest14. We provide results for average pooling on CNN descriptors with late fusion of three layers as well,denoted as CNN avg. Our proposed representation beats two other strong base-lines in15out of20events in MEDTest13and14out of20 events in MEDTest14,respectively.

7082124a1611cc7931b765ce0508763231127496parison to the state-of-the-art Systems

We compare the MEDTest134results with the top per-formers in the TRECVID MED2013competition[3,30, 22].The AXES team does not show their performance on MEDTest13[3].Natarajan et al.[30]report mAP38.5% on100Ex,17.9%on10Ex from their whole visual system of combining all their low-level visual 7082124a1611cc7931b765ce0508763231127496n et al.[22] report39.3%mAP on100Ex of their whole system includ-ing non-visual features while they conducted10Ex on their 4In[3,30,22],teams report performance on MEDEval13as well, while MEDEval13is a different collection used in the competition,where only NIST can evaluate the performance.

(in mAP percentage).This?gure is best viewed in color.

internal dataset.Our results achieve44.6%mAP on100Ex and29.8%mAP on10Ex,which signi?cantly outperforms the top performers in the competition who combine more than10kinds of features with sophisticated schemes.To show that our representation is complementary to features from other modalities,we perform average late fusion of our proposed representation with IDT and MFCC,and generate

a lightweight system with static,motion and acoustic fea-

tures,which achieves48.6%mAP on100Ex,and32.2% mAP on10Ex.

6.Conclusion

TRECVID Multimedia Event Detection(MED)has suf-fered from huge computation costs in feature extraction and classi?cation 7082124a1611cc7931b765ce0508763231127496ing Convolutional Neural Net-work(CNN)representation seems to be a good solution,but generating video representation from CNN descriptors has different characteristics from image representation.We are the?rst to leverage encoding techniques to generate video representation from CNN descriptors.And we propose la-tent concept descriptors to generate CNN descriptors more properly.For fast event search,we utilize Product Quantiza-tion to compress the video representation and predict on the compressed data.Extensive experiments on the two largest event detection collections under different training condi-tions demonstrate the advantages of our proposed represen-tation.We have achieved promising performance which is superior to the state-of-the-art systems which combine10 more features.The proposed representation is extendible and the performance can be further improved by better CNN models and/or appropriate?ne-tuning techniques.

1805

7.Acknowledgement

This paper is in part supported by the973program 2012CB316400,in part by the ARC DECRA project,and in part by Intelligence Advanced Research Projects Activ-ity(IARPA)via Department of Interior National Business Center contract number D11PC20068.The 7082124a1611cc7931b765ce0508763231127496ern-ment is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright an-notation thereon.Disclaimer:The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the of?cial policies or endorsements,either expressed or implied,of IARPA, DoI/NBC,or the 7082124a1611cc7931b765ce0508763231127496ernment.

We gratefully acknowledge the support of NVIDIA Cor-poration with the donation of the GPUs used for this re-search.

References

[1]TRECVID MED13.7082124a1611cc7931b765ce0508763231127496/itl/

iad/mig/med13.cfm.1,2,5,6

[2]TRECVID MED14.7082124a1611cc7931b765ce0508763231127496/itl/

iad/mig/med14.cfm.1,2,3,5,6

[3]R.Aly,R.Arandjelovic,K.Chat?eld,M.Douze,B.Fer-

nando,Z.Harchaoui,K.McGuinness,N.E.O’Connor,

D.Oneata,O.M.Parkhi,et al.The AXES submissions at

TrecVid2013.2013.1,2,5,8

[4]R.Arandjelovi′c and A.Zisserman.All about VLAD.In

CVPR,2013.3

[5] C.-C.Chang and C.-J.Lin.Libsvm:a library for support

vector machines.ACM Transactions on Intelligent Systems and Technology(TIST),2(3):27,2011.5

[6]K.Chat?eld,K.Simonyan,A.Vedaldi,and A.Zisserman.

Return of the devil in the details:Delving deep into convo-lutional nets.In BMVC,2014.2,4,6

[7]M.-Y.Chen and A.Hauptmann.Mosift:Recognizing human

actions in surveillance videos.CMU TR,2009.1,5,6 [8]R.G.Cinbis,J.Verbeek,and C.Schmid.Segmentation

driven object detection with Fisher vectors.In ICCV,2013.

7

[9]N.Dalal and B.Triggs.Histograms of oriented gradients for

human detection.In CVPR,2005.1

[10]M.Douze,J.Revaud,C.Schmid,and H.J′e gou.Stable

hyper-pooling and query expansion for event detection.In ICCV,2013.4,6

[11] C.Gan,N.Wang,Y.Yang,D.-Y.Yeung,and A.Hauptmann.

Devnet:A deep event network for multimedia event detec-tion and evience recounting.In CVPR,2015.2

[12]R.Girshick,J.Donahue,T.Darrell,and J.Malik.Rich fea-

ture hierarchies for accurate object detection and semantic segmentation.In CVPR,2014.2

[13]Y.Gong,L.Wang,R.Guo,and 7082124a1611cc7931b765ce0508763231127496zebnik.Multi-scale

orderless pooling of deep convolutional activation features.

In ECCV,2014.2,4

[14]K.He,X.Zhang,S.Ren,and J.Sun.Spatial pyramid pooling

in deep convolutional networks for visual recognition.In

ECCV.2014.4,5

[15]H.Jegou,M.Douze,and C.Schmid.Product quantization

for nearest neighbor search.TPAMI,33(1):117–128,2011.

2,5

[16]H.J′e gou,M.Douze,C.Schmid,and P.P′e rez.Aggregating

local descriptors into a compact image representation.In

CVPR,2010.3,6

[17]H.J′e gou,F.Perronnin,M.Douze,J.S′a nchez,P.P′e rez,and

C.Schmid.Aggregating local image descriptors into com-

pact codes.TPAMI,34(9):1704–1716,2012.3

[18]Y.Jia.Caffe:An open source convolutional architecture

for fast feature embedding.7082124a1611cc7931b765ce0508763231127496,

2013.2,6

[19]V.Kantorov and 7082124a1611cc7931b765ce0508763231127496ptev.Ef?cient feature extraction,en-

coding and classi?cation for action recognition.In CVPR,

2014.3

[20] A.Karpathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar,

and 7082124a1611cc7931b765ce0508763231127496rge-scale video classi?cation with convo-

lutional neural networks.In CVPR,2014.2

[21] A.Krizhevsky,I.Sutskever,and G.E.Hinton.Imagenet

classi?cation with deep convolutional neural networks.In

NIPS,2012.1,2,6

[22]7082124a1611cc7931b765ce0508763231127496n,L.Jiang,S.-I.Yu,et al.CMU-Informedia at

TRECVID2013Multimedia Event Detection.In TRECVID

2013Workshop,2013.1,2,5,6,8

[23]7082124a1611cc7931b765ce0508763231127496ptev.On space-time interest points.IJCV,64(2-3):107–

123,2005.1,5,6

[24]M.Lin,Q.Chen,and 7082124a1611cc7931b765ce0508763231127496work in network.CoRR,

abs/1312.4400,2013.4

[25] D.G.Lowe.Distinctive image features from scale-invariant

keypoints.IJCV,60(2):91–110,2004.1

[26]Z.Ma,Y.Yang,N.Sebe,and A.Hauptmann.Knowledge

adaptation with partially shared features for event detection

using few exemplars.IEEE Transactions on Pattern Analysis

and Machine Intelligence,36(9):1789–1802,2014.1

[27]Z.Ma,Y.Yang,Z.Xu,S.Yan,N.Sebe,and A.G.Haupt-

7082124a1611cc7931b765ce0508763231127496plex event detection via multi-source video at-

tributes.In CVPR,2013.1

[28] F.Metze,S.Rawat,and Y.Wang.Improved audio features

for large-scale multimedia event detection.In ICME,2014.

1

[29]G.K.Myers,R.Nallapati,J.van Hout,et al.The2013

SESAME Multimedia Event Detection and Recounting sys-

tem.In TRECVID2013Workshop,2013.1,5,6

[30]P.Natarajan,S.Wu,F.Luisier,et al.BBN VISER TRECVID

2013Multimedia Event Detection and Multimedia Event Re-

counting Systems.In TRECVID2013Workshop,2013.1,5,

6,8

[31]P.Natarajan,S.Wu,S.Vitaladevuni,X.Zhuang,S.Tsaka-

lidis,U.Park,and R.Prasad.Multimodal feature fusion for

robust event detection in web videos.In CVPR,2012.1

[32] D.Oneata,M.Douze,J.Revaud,S.Jochen,D.Potapov,

H.Wang,Z.Harchaoui,J.Verbeek,C.Schmid,R.Aly,et al.

AXES at TRECVid2012:KIS,INS,and MED.In TRECVID

workshop,2012.2

1806

[33] D.Oneata,J.Verbeek,and C.Schmid.Action and event

recognition with Fisher vectors on a compact feature set.In

ICCV,2013.1

[34]X.Peng,L.Wang,X.Wang,and Y.Qiao.Bag of

visual words and fusion methods for action recognition:

Comprehensive study and good practice.arXiv preprint

arXiv:1405.4506,2014.3

[35] F.Perronnin,J.S′a nchez,and T.Mensink.Improving the

?sher kernel for large-scale image classi?cation.In ECCV.

2010.3

[36]J.S′a nchez,F.Perronnin,T.Mensink,and J.Verbeek.Im-

age classi?cation with the?sher vector:Theory and practice.

IJCV,105(3):222–245,2013.3,5,6

[37]K.Simonyan and A.Zisserman.Very deep convolutional

networks for large-scale image recognition.arXiv preprint

arXiv:1409.1556,2014.2,4,6

[38]J.Sivic and A.Zisserman.Video google:A text retrieval

approach to object matching in videos.In CVPR,2003.3

[39] C.G.Snoek,M.Worring,and A.W.Smeulders.Early versus

late fusion in semantic video analysis.In MM.ACM,2005.

7

[40] A.Tamrakar,S.Ali,Q.Yu,J.Liu,O.Javed,A.Divakaran,

H.Cheng,and H.Sawhney.Evaluation of low-level features

and their combinations for complex event detection in open

source videos.In CVPR,2012.1

[41]K.E.Van De Sande,T.Gevers,and C.G.Snoek.Evaluating

color descriptors for object and scene recognition.TPAMI,

32(9):1582–1596,2010.1,5,6

[42] A.Vedaldi and B.Fulkerson.Vlfeat:An open and portable

library of computer vision algorithms.In MM.ACM,2010.

6

[43]H.Wang,A.Klaser,C.Schmid,and C.-L.Liu.Action recog-

nition by dense trajectories.In CVPR,2011.1

[44]H.Wang and C.Schmid.Action recognition with improved

trajectories.In ICCV,2013.1,2,5

[45]Z.Xu,Y.Yang,I.Tsang,N.Sebe,and A.G.Hauptmann.

Feature weighting via optimal thresholding for video analy-

sis.In ICCV,2013.1

[46]Y.Yang,Z.Ma,Z.Xu,S.Yan,and A.G.Hauptmann.

How related exemplars help complex event detection in web

videos?In ICCV,2013.1,2

1807

本文来源:https://www.bwwdw.com/article/lp8l.html

Top