A Discriminative CNN Video Representation for Event Detectio
更新时间:2023-04-10 18:01:01 阅读量: 实用文档 文档下载
- 阿根廷推荐度:
- 相关推荐
A Discriminative CNN Video Representation for Event Detection
Zhongwen Xu?Yi Yang?Alexander G.Hauptmann§?QCIS,University of Technology,Sydney§SCS,Carnegie Mellon University zhongwen.xu@7082124a1611cc7931b765ce0508763231127496.au yee.i.yang@7082124a1611cc7931b765ce0508763231127496 alex@7082124a1611cc7931b765ce0508763231127496
Abstract
In this paper,we propose a discriminative video rep-resentation for event detection over a large scale video dataset when only limited hardware resources are avail-able.The focus of this paper is to effectively leverage deep Convolutional Neural Networks(CNNs)to advance event detection,where only frame level static descriptors can be extracted by the existing CNN toolkits.This paper makes two contributions to the inference of CNN video representa-tion.First,while average pooling and max pooling have long been the standard approaches to aggregating frame level static features,we show that performance can be sig-ni?cantly improved by taking advantage of an appropriate encoding method.Second,we propose using a set of latent concept descriptors as the frame descriptor,which enriches visual information while keeping it computationally afford-able.The integration of the two contributions results in a new state-of-the-art performance in event detection over the largest video 7082124a1611cc7931b765ce0508763231127496pared to improved Dense Trajectories,which has been recognized as the best video representation for event detection,our new representation improves the Mean Average Precision(mAP)from27.6% to36.8%for the TRECVID MEDTest14dataset and from 34.0%to44.6%for the TRECVID MEDTest13dataset. 1.Introduction and Related Work
Complex event detection[1,2],which targets the detec-tion of such events as“renovating a home”in a large video collection crawled from Youtube,has recently attracted a lot of research attention in computer 7082124a1611cc7931b765ce0508763231127496pared to con-cept analysis in videos,e.g.,action recognition,event de-tection is more dif?cult primarily because an event is more complex and thus has greater intra-class variations.For ex-ample,a“marriage proposal”event may take place indoors or outdoors,and may consist of multiple concepts such as ring(object),kneeling down(action)and kissing(action).
Recent research efforts have shown that combining mul-tiple features,including static appearance features[9,25, 41],motion features[23,7,43,44,33]and acoustic fea-
tures[28],yields good performance in event detection,as evidenced by the reports of the top ranked teams in the TRECVID Multimedia Event Detection(MED)competi-tion[3,22,29,30]and research papers[26,31,40,45] that have tackled this problem.By utilizing additional data to assist complex event detection,researchers propose the use of“video attributes”derived from other sources to fa-cilitate event detection[27],or to utilize related exemplars when the training exemplars are very few[46].As we focus on improving video representation in this paper,this new method can be readily fed into those frameworks to further improve their performance.
Dense Trajectories and its enhanced version improved Dense Trajectories(IDT)[44]have dominated complex event detection in recent years due to their superior per-formance over other features such as the motion feature STIP[23]and the static appearance feature Dense SIFT[3].
Despite good performance,heavy computation costs greatly restrict the usage of the improved Dense Trajectories on a large scale.In the TRECVID MED competition2014[2], the National Institute of Standards and Technology(NIST) introduced a very large video collection,containing200,000 videos of8,000hours in duration.Paralleling1,000cores, it takes about one week to extract the improved Dense Tra-jectories for the200,000videos in the TRECVID MEDE-val14collection.Even after the spatial re-sizing and tem-poral down-sampling processing,it still takes500cores one week to extract the features[3].As a result of the unaf-fordable computation cost,it would be extremely dif?cult for a relatively smaller research group with limited com-putational resources to process large scale MED datasets.
It becomes important to propose an ef?cient representation for complex event detection with only affordable computa-tional resources,e.g.,a single machine,while at the same time attempting to achieve better performance.
One instinctive idea would be to utilize the deep learn-ing approach,especially Convolutional Neural Networks (CNNs),given their overwhelming accuracy in image anal-ysis and fast processing speed,which is achieved by lever-aging the massive parallel processing power of GPUs[21].
However,it has been reported that the event detection 1798
978-1-4673-6964-0/15/$31.00 ?2015 IEEE
MEDTest13MEDTest14 IDT[44,3]34.027.6 CNN in Lan et al.[22]29.0N.A.
CNN avg32.724.8 Table1.Performance comparison(mean Average Precision in per-centage).Lan et al.[22]is the only attempt to apply CNN features in TRECVID 7082124a1611cc7931b765ce0508763231127496N avg are our results from the average pooling representation of frame level CNN descriptors. performance of CNN based video representation is worse than the improved Dense Trajectories in TRECVID MED 2013[22,3],as shown in Table1.A few technical prob-lems remain unsolved.
Firstly,CNN requires a large amount of labeled video data to train good models from scratch.The large scale TRECVID MED datasets(i.e.,MEDTest13[1]and MEDTest14[2])only have100positive examples per event,with many null videos which are irrelevant.The num-ber of labeled videos is smaller than that of the video col-lection for sports videos[20].In addition,as indicated in [46],event videos are quite different from action videos,so it makes little sense to use the action dataset to train models for event detection.
Secondly,when dealing with a domain speci?c task with a small number of training data,?ne-tuning[12]is an effec-tive technique for adapting the ImageNet pre-trained mod-els for new tasks.However,the video level event labels are rather coarse at the frame level,i.e.,not all frames necessar-ily contain the semantic information of the event.If we use the coarse video level label for each frame,performance is barely improved by frame level?ne-tuning;this was veri-?ed by our preliminary experiment1.
Lastly,given the frame level CNN descriptors,we need to generate a discriminative video level representation.Av-erage pooling is the standard approach[32,3]for static local features,as well as for the CNN descriptors[22]. Table1shows the performance comparisons of the im-proved Dense Trajectories and CNN average pooling rep-resentation.We provide the performance of Lan et al.[22] for reference as well.We can see that the performance of CNN average pooling representation cannot get better than the hand-crafted feature improved Dense Trajectories, which is fairly different from the observations in other vi-sion tasks[12,6,13].
The contributions of this paper are threefold.First,this is the?rst work to leverage the encoding techniques to gen-erate video representation based on CNN descriptors.Sec-ond,we propose to use a set of latent concept descriptors as frame descriptors,which further persi?es the output with aggregation on multiple spatial locations at deeper stage of 1However,with certain modi?cation of the CNN structure,e.g.cross-frame max-pooling[11],?ne-tuning could be helpful.
the network.The approach forwards video frames for only
one round along the deep CNNs for descriptor extraction.
With these two contributions,the proposed video CNN rep-
resentation achieves more than30%relative improvement
over the state-of-the-art video representation on the large
scale MED dataset,and this can be conducted on a single
machine in two days with4GPU cards installed.In addi-
tion,we propose to use Product Quantization[15]based on
CNN video representation to speed up the execution(event
search)time.According to our extensive experiments,we
show that the approach signi?cantly reduces the I/O cost,
thereby making event prediction much faster while retain-
ing almost the same level of precision.
2.Preliminaries
Unless otherwise speci?ed,this work is based on the net-
work architecture released by[37],i.e.,the con?guration
with16weight layers in the VGG ILSVRC2014classi?-
cation task winning solutions.The?rst13weight layers
are convolutional layers,?ve of which are followed by a
max-pooling layer.The last three weight layers are fully-
connected layers.In the rest of this paper,we follow the
notations in[6,12]:pool5refers to the activation of the last
pooling layer,fc6and fc7refer to the activation of the?rst
and second fully-connected layers,respectively.Though the
structure in[37]is much deeper than the classic CNN struc-
ture in[21,6,12],the subscripts of pool5,fc6and fc7no-
tations still correspond if we regard the convolution layers
between the max-pooling layers as a“compositional convo-
lutional layer”[37].We utilize the activations before Rec-
ti?ed Linear Units(i.e.,fc6and fc7)and after them(i.e.,
fc6relu and fc7relu),since we observe signi?cant differ-
ences in performance between these two variants.
3.Video CNN Representation
We begin by extracting the frame level CNN descriptors
using the Caffe toolkit[18]with the model shared by[37].
We then need to generate video level vector representations
on top of the frame level CNN descriptors.
3.1.Average Pooling on CNN Descriptors
As described in state-of-the-art complex event detection
systems[3,32],the standard way to achieve image-based
video representation in which local descriptor extraction
relies on inpidual frames alone,is as follows:(1)Ob-
tain the descriptors for inpidual frames;(2)Apply nor-
malization on frame descriptors;(3)Average pooling on
frame descriptors to obtain the video representation,i.e.,
x video=1
N N i=1x i,x i is the frame-level descriptor and N is the total number of frames extracted from the video;
(4)Re-normalization on video representation.
Max pooling on frames to generate video representation 1799
is an alternative method but it is not typical in event detec-tion.We observe similar performance with average pooling, so we omit this method.
3.2.Video Pooling on CNN descriptors
Video pooling computes video representation over the whole video by pooling all the descriptors from all the frames in a video.The Fisher vector[35,36]and Vector of Locally Aggregated Descriptors(VLAD)[16,17]have been shown to have great advantages over Bag-of-Words (BoWs)[38]in local descriptor encoding methods.The Fisher vector and VLAD have been proposed for image classi?cation and image retrieval to encode image local de-scriptors such as dense SIFT and Histogram of Oriented Gradients(HOG).Attempts have also been made to apply Fisher vector and VLAD on local motion descriptors such as Histogram of Optical Flow(HOF)and Motion Bound-ary Histogram(MBH)to capture the motion information in videos.To our knowledge,this is the?rst work on the video pooling of CNN descriptors and we broaden the encoding methods from local descriptors to CNN descriptors in video analysis.
3.2.1Fisher Vector Encoding
In Fisher vector encoding[35,36],a Gaussian Mixture Model(GMM)with K components can be denoted as Θ={(μk,Σk,πk),k=1,2,...,K},whereμk,Σk,πk are the mean,variance and prior parameters of k-th compo-nent learned from the training CNN descriptors in the frame level,respectively.Given X=(x1,...,x N)of CNN de-scriptors extracted from a video,we have mean and covari-ance deviation vectors for the k-th component as:
u k=
1
N√πk
N
i=1q ki x i?μkσk
v k=
1
N
√2πk
N
i=1q ki x i?μkσk 2?1 ,(1)
where q ki is the posterior probability.By concatenation of
the u k and v k of all the K components,we form the Fisher
vector for the video with size2D′K,where D′is the dimen-
sion of CNN descriptor x i after PCA pre-processing.PCA
pre-processing is necessary for a better?t on the diagonal
covariance matrix assumption[36].Power normalization,
often Signed Square Root(SSR)with z=sign(z) |z|, and?2normalization are then applied to the Fisher vec-
tors[35,36].
3.2.2VLAD Encoding
VLAD encoding[16,17]can be regarded as a simpli?ed
version of Fisher vector encoding.With K coarse centers
Figure1.Probability distribution of the cosine similarity between positive-positive(blue and plain)and positive-negative(red and dashed)videos using fc7features,for average pooling(top),en-coding with the Fisher vector using256-component GMM(mid-dle),and encoding with VLAD using256centers(bottom).As the range of probability of Fisher vectors is very different from aver-age pooling and VLAD,we only use consistent axes for average pooling and VLAD.This?gure is best viewed in color.
{c1,c2,...,c K}generated by K-means,we can obtain the difference vector regarding center c k by:
u k= i:NN(x i)=c k(x i?c k),(2)
where NN(x i)indicates x i’s nearest neighbors among K coarse centers.
The VLAD encoding vector with size D′K is obtained by concatenating u k over all the K centers.Another vari-ant of VLAD called VLAD-k,which extends the nearest centers with the k-nearest centers,has shown good per-formance in action recognition[19,34].Without speci?-cation,we utilize VLAD-k with k=5by default.Ex-cept for the power and?2normalization,we apply intra-normalization[4]to VLAD.
3.2.3Quantitative Analysis
Given the above three approaches,we need to?nd out which one is the most appropriate for the CNN descrip-tors.To this end,we conduct an analytic experiment on the MEDTest14training set[2]to study the discriminative ability of three types of video representations,i.e.,average pooling,video pooling with Fisher vector,and video pool-ing with VLAD on the CNN descriptors.Speci?cally,we calculate the cosine similarity within the positive exemplars among all the events(denoted as pos-pos),and the cosine similarity between positive exemplars and negative exem-plars(denoted as pos-neg).The results are shown in Fig-
1800
Figure 2.Illustration of the latent concept descriptors encoding procedure.We adopt M ?lters in the last convolutional layer as M latent concept classi?ers.Before the last convolutional layer,M ?lters (e.g .,a cuboid of size 3×3×512)produce the prediction outputs at every convolution location,followed by the max-pooling operations.Then,we get the responses of windows of different sizes and strides (in this example the output size is 2×2)for each latent concept.Color strength corresponds to the strength of response of each ?lter.Finally,we accumulate the responses for the M ?lters at the same location into the latent concept descriptors.Each dimension corresponds to one latent concept.After obtaining all latent concept descriptors of all frames,we then apply encoding methods to get the ?nal video representation.This ?gure is best viewed in color.
ure 1.With a good representation,the data points of posi-tive and negative exemplars should be far away from each other,i.e .,the cosine similarity of “pos-neg”should be close to zero.In addition,there should be a clear difference be-tween the distributions of “pos-pos”and “pos-neg”.
Average pooling :In Figure 1,we observe that the “pos-neg”cosine similarity distribution is far from zero,which is highly indicative that a large portion of the positive and negative exemplar pairs are similar to each other.In addi-tion,the intersection of areas under the two lines span over a large range of [0.2,0.8].Both observations imply that av-erage pooling may not be the best choice.
Fisher vector :Although the “pos-neg”similarity dis-tribution is fairly close to zero,a large proportion of the “pos-pos”pairs also fall into the same range.No obvious difference between the distributions of “pos-pos”and “pos-neg”can be observed.
VLAD :The distribution of the “pos-neg”pairs is much closer to zero than average pooling while a relatively small proportion of the “pos-pos”similarity is close to the peak of the “pos-neg”similarity.
From the above analytic study,we can see that VLAD is the most ?t for the CNN descriptors because the VLAD representation has the best discriminative ability,which is also consistent with the experimental results in Section 5.1.
7082124a1611cc7931b765ce0508763231127496N Latent Concept Descriptors
Compared to the fully-connected layers,pool 5contains spatial information.However,if we follow the standard way and ?atten pool 5into a vector,the feature dimension will be very high,which will induce heavy computational cost.
Speci?cally,the features dimension of pool 5is a ×a ×M ,where a is the size of ?ltered images of the last pooling layer and M is the number of convolutional ?lters in the last convolutional layer (in our case,a =7and M =512).In the VGG network [37],pool 5features are vectors of 25,088-D while the fc 6and fc 7features have only 4096-D.As a result,researchers tend to ignore the general fea-tures extracted from pool 5[6,13].The problem is even more severe in the video pooling scheme because the frame descriptors with high dimensions would lead to instability problems [10].
Note that the convolutional ?lters can be regarded as generalized linear classi?ers on the underlying data patches,and each convolutional ?lter corresponds to a latent con-cept [24].We propose to formulate the general features from pool 5as the vectors of latent concept descriptors ,in which each dimension of the latent concept descriptors rep-resents the response of the speci?c latent concept.Each ?l-ter in the last convolutional layer is independent from other ?lters.The response of the ?lter is the prediction of the linear classi?er on the convolutional location for the cor-responding latent concept.In that way,pool 5layer of size a ×a ×M can be converted into a 2latent concept descriptors with M dimensions.Each latent concept descriptor repre-sents the responses from the M ?lters for a speci?c pool-ing location.Once we obtain the latent concept descriptors for all the frames in a video,we then apply an encoding method to generate the video representation.In this case,each frame contains a 2descriptors instead of one descrip-tor for the frame,as illustrated in Figure 2.
In [14],He et al .claim that the aggregation at a deeper
1801
layer is more compatible with the hierarchical information processing in our brains than cropping or wrapping the original inputs,and they propose to use a Spatial Pyramid Pooling(SPP)layer for object classi?cation and detection, which not only achieves better performance but also relaxes the constraint that the input must be?xed-size.Different from[14],we do not train the network with the SPP layer from scratch,because it takes much longer time,especially for a very deep neural network.Instead,at the last pooling layer,we adopt multiple windows with different sizes and strides without retraining the CNNs.In that way,visual in-formation is enriched while only marginal computation cost is added,as we forward frames through the networks only once to extract the latent concept descriptors.
After extracting the CNN latent concept descriptors for all spatial locations of each frame in a video,we then ap-ply video pooling to all the latent concept descriptors of that video.As in[14],we apply four different CNN max-pooling operations and obtain(6×6),(3×3),(2×2)and (1×1)outputs for each independent convolutional?lter,a total of50spatial locations for a single frame.The dimen-sion of latent concept descriptors(512-D)is shorter than the descriptors from the fully-connected layers(4,096-D), while the visual information is enriched via multiple spatial locations on the?ltered images.
3.4.Representation Compression
For the engineering aspect of a fast event search[2]on a large video collection,we can utilize techniques such as Product Quantization(PQ)[15]to compress the Fisher vec-tor or VLAD representation.With PQ compression,the storage space in disk and memory can be reduced by more than an order of magnitude,while the performance remains almost the same.The basic idea of PQ is to decompose the representation into sub-vectors with equal length B,and then within each sub-vector,K-means is applied to generate 2m centers as representative points.All the sub-vectors are approximated by the nearest center and encoded into the in-dex of the nearest center.In this way,B?oat numbers in the original representation become m bit code;thus,the com-
pression ratio is B×32
m .For example,if we take m=8and
B=4,we can achieve16times reduction in storage space.
Targeting at prediction on compressed data instead of on the original features,we can decompose the learned linear classi?er w with an equal length B.With look-up tables to store the dot-product between sub-vectors of2m centers and the corresponding sub-vector of w,the prediction speed on large-amount of videos can be accelerated by D
B
times look-
up operations and D
B?1times addition operations for each video assuming D is the feature dimension[36].
4.Experiment Settings
4.1.Datasets
In our experiments,we utilize the largest event detection datasets with labels2,namely TRECVID MEDTest13[1] and TRECVID MEDTest14[2].They have been intro-duced by NIST for all participants in the TRECVID com-petition and research community to conduct experiments on.For both datasets,there are20complex events respec-tively,but with10events overlapping.MEDTest13con-tains events E006-E015and E021-E030,while MEDTest14 has events E021-E040.Event names include“Birthday party”,“Bike trick”,etc.Refer to[1,2]for the complete list of event names.In the training section,there are ap-proximately100positive exemplars per event,and all events share negative exemplars with about5,000videos.The test-ing section has approximately23,000search videos.The total duration of videos in each collection is about1,240 hours.
4.2.Features for Comparisons
As reported in[3]and compared with the features from other top performers[30,29,22]in the TRECVID MED 2013competition,we can see that the improved Dense Tra-jectories has superb advantages over the original Dense Tra-jectories(used by all other teams except[3]),and is even better than approaches that combine many low-level visual features[30,29,22].Improved Dense Trajectories extracts local descriptors such as trajectory,HOG,HOF,and MBH, and Fisher vector is then applied to encode the local de-scriptors into video representation.Following[44,3],we ?rst reduce the dimension of each descriptor by a factor of 2and then utilize256components to generate the Fisher vectors.We evaluate four types of descriptor in improved Dense Trajectories,and report the results of the best combi-nation of descriptors and the two inpidual descriptors that have the best performance(HOG and MBH).
In addition,we report the results of some popular fea-tures used in the TRECVID competition for reference,such as STIP[23],MoSIFT[7]and CSIFT[41],though their per-formance is far weaker than improved Dense Trajectories.
4.3.Evaluation Details
In all the experiments,we apply linear Support Vector Machine(SVM)with LIBSVM toolkit[5].We conduct ex-tensive experiments on two standard training conditions:in 100Ex,100positive exemplars are given in each event and in10Ex,10positive exemplars are given.In the100Ex con-dition,we utilize5-fold cross-validation to choose the pa-rameter of regularization coef?cient C in linear SVM.In the 10Ex condition,we follow[22]and set C in linear SVM to 1.
2Labels for MEDEval13and MEDEval14are not publicly available. 1802
We sample every?ve frames in the videos and follow the pre-processing of[21,6]on CNN descriptor extraction.We extract the features from the center crop 7082124a1611cc7931b765ce0508763231127496N descrip-tors are extracted using Caffe[18]with the best publicly available model[37],and we utilize vlfeat[42]to generate Fisher vector and VLAD representation.
Mean Average Precision(mAP)for binary classi?cation is applied to evaluate the performance of event detection according to the NIST standard[1,2].
5.Experiment Results
5.1.Results for Video Pooling of CNN descriptors
In this section,we show the experiments on video pool-ing of fc6,fc6relu,fc7and fc7relu.Before aggregation,we ?rst apply PCA with whitening on the?2normalized CNN descriptors.Unlike local descriptors such as HOG,MBH, which have dimensions less than200-D,the CNN descrip-tors have much higher dimensions(4,096-D).We conduct experiments with different reduced dimensions,i.e.,128, 256,512and1,024,and utilize the reduced dimensions that best balance performance and storage cost in corresponding features,i.e.,512-D for fc6and fc6relu and256-D for fc7 and fc7relu.We utilize256components for Fisher vectors and256centers for VLAD as common choices in[36,16]. We will study the impact of parameters in Section5.3.PCA projections,components in GMM for Fisher vectors,and centers in K-means for VLAD are learned from approxi-mately256,000sampled frames in the training set.
Since we observe similar patterns in MEDTest13 and MEDTest14under both100Ex and10Ex,we take MEDTest14100Ex as an example to compare with differ-ent representations,namely average pooling,video pooling with Fisher vectors and video pooling with VLAD.From Table2,we can see that both video pooling with Fisher vectors and VLAD demonstrate great advantages over the average pooling representation.On the video pooling of CNN descriptors,Fisher vector encoding does not exhibit better performance than VLAD.Similar observations have been expressed in[10].We suspect that the distribution of CNN descriptors is quite different from the local descrip-tors,e.g.,HOG,HOF.We will study the theoretical reasons for the poorer performance of Fisher vector than VLAD on CNN video pooling in future research.
fc6fc6relu fc7fc7relu Average pooling19.824.818.823.8
Fisher vector28.328.427.429.1 VLAD33.132.633.231.5 Table 2.Performance comparison(mAP in percentage)on MEDTest14100Ex
We compare the performance of VLAD encoded CNN descriptors with state-of-the-art feature improved Dense
Figure 3.Performance comparisons on MEDTest13and MEDTest14,both100Ex and10Ex.This?gure is best viewed in color.
Trajectories(IDT)and average pooling on CNN descrip-tors in Figure3.We also illustrate the performance of the two strongest descriptors inside IDT(HOG and MBH).
We can see very clearly that VLAD encoded CNN fea-tures signi?cantly outperform IDT and average pooling on CNN descriptors over all settings.For more refer-ences,we provide the performance of a number of widely used features[29,30,22]on MEDTest14for compari-son.MoSIFT[7]with Fisher vector achieves mAP18.1% on100Ex and5.3%on10Ex;STIP[23]with Fisher vec-tor achieves mAP15.0%on100Ex and7.1%on10Ex;
CSIFT[41]with Fisher vector achieves mAP14.7%on 100Ex and5.3%on10Ex.Note that with VLAD encoded CNN descriptors,we can achieve better performance with 10Ex than the relatively poorer features such as MoSIFT, STIP,and CSIFT with100Ex!
5.2.Results for CNN Latent Concept Descriptors
with Spatial Pyramid Pooling
We evaluate the performance of latent concept descrip-tors(LCD)of both the original CNN structure and the struc-ture with the Spatial Pyramid Pooling(SPP)layer plugged in to validate the effectiveness of SPP.Before encoding the latent concept descriptors,we?rst apply PCA with whiten-ing.Dimension reduction is conducted from512-D to a range of dimensions such as32-D,64-D,128-D,and256-D,and we?nd that256-D is the best choice.We observe
a similar pattern with video pooling of fc layers indicating
that Fisher vector is inferior to VLAD on video pooling.We omit the results for Fisher vector due to limited space.
We show the performance of our proposed latent con-cept descriptors(LCD)in Table3and Table4.In both 100Ex and10Ex over two datasets,we can see clear gaps 1803
100Ex10Ex
Average pooling31.218.8
LCD VLAD38.225.0
LCD VLAD+SPP40.325.6
Table3.Performance comparisons for pool
5
on MEDTest13. LCD VLAD is VLAD encoded LCD from the original CNN struc-ture,while LCD VLAD+SPP indicates VLAD encoded LCD with SPP layer plugged in.
100Ex10Ex
Average pooling24.615.3
LCD VLAD33.922.8
LCD VLAD+SPP35.723.2
Table4.Performance comparisons for pool
5
on MEDTest14.No-tations are the same as Table3.
over the pool5features with average pooling,which demon-strates the advantages of our proposed novel utilization of pool5.With SPP layer,VLAD encoded LCD(LCD VLAD+ SPP)continues to increase the performance further from the original structure(LCD VLAD).The aggregation at a deeper stage to generate multiple levels of spatial information via multiple CNN max-pooling demonstrates advantages over the original CNN structure while having only minimal com-putation costs.The SPP layer enables a single pass of the forwarding in the network compared to the multiple passes of applying spatial pyramid on the original input images.
5.3.Analysis of the Impact of Parameters
We take VLAD encoded fc7features under MEDTest14 100Ex as an example to see the impact of parameters in the video pooling process.
Dimensions of PCA:The original dimension of fc7is quite high compared to local descriptors.It is essential to investigate the impact of dimensions in PCA in the pre-processing stage,since it is critical to achieve a better trade-off of performance and storage costs.Table5shows that in dimensions of more than256-D,performance remains sim-ilar,whereas encoding in128-D damages the performance signi?cantly.
Dimension128-D256-D512-D1024-D mAP30.633.233.133.2 Table5.Impact of dimensions of CNN descriptors after PCA,with ?xed K=256in VLAD.
Number of Centers in Encoding:We explore various numbers of centers K in VLAD,and the results are shown in Table6.With the increase of K,we can see that the discriminative ability of the generated features improves. However when K=512,the generated vector may be too sparse,which is somewhat detrimental to performance.
VLAD-k:We experiment with the traditional VLAD as well,with nearest center only instead of k-nearest centers.
K3264128256512
mAP28.729.730.433.232.1 Table6.Impact on numbers of centers(K)in VLAD,with?xed PCA dimension of256-D.
mAP drops from33.2%to32.0%.
Power Normalization:We remove the SSR post-processing and test the features on the VLAD encoded fc7.
mAP drops from33.2%to27.0%,from which we can see the signi?cant effect of SSR post-processing.
Intra-normalization:We turn off the intra-normalization.mAP drops from33.2%to30.6%.
5.4.Results for Product Quantization Compression
original B=4B=8 mAP33.233.5(↑0.3)33.0(↓0.2) space reduction-16×32×
Table7.Performance change analysis for VLAD encoded fc7with PQ compression.B is the length of the sub-vectors in PQ and m=8.
We conduct experiments on VLAD encoded fc7to see the performance changes with Product Quantization(PQ) compression.From the results in Table7,we can see that PQ with B=4maintains the performance and even im-proves slightly.When B=8,performance drops slightly.
If we compress with B=4,we can store VLAD encoded fc7features in3.1GB for the MEDEval14,which contains 200,000videos of8,000hours’duration.With further com-pression with a lossless technique such as Blosc3[8],we can store the features of the whole collection in less than1GB, which can be read by a normal SSD disk in a few seconds.
Without PQ compression,the storage size of the features would be48.8GB,which severely compromises the exe-cution time due to the I/O cost.Utilization of compression techniques largely saves the I/O cost in the prediction pro-cedure,while preserving the performance.
In our speed test on the MEDEval14collection using the compressed data but not the original features,we can?nish the prediction on200,000videos in4.1seconds per event using20threads on an Intel Xeon E5-2690v2@3.00GHz.
5.5.Results for Fusing Multiple Layers Extracted
from the Same Model
We investigate average late fusion[39]to fuse the pre-diction results from different layers with PQ compression,
i.e.,VLAD encoded LCD with SPP,fc6and fc7.From
Table8we can see that the simple fusion pushes the per-formance further beyond the single layers on MEDTest 13and MEDTest14,and achieves signi?cant advantages over improved Dense Trajectories(IDT).Our proposed 3Blosc can reduce the storage space by a factor of4
1804
Figure4.MEDTest13100Ex per event performance comparison (in mAP percentage).This?gure is best viewed in color. method pushes the state-of-the-art performance much fur-ther,achieves more than30%relative improvement on 100Ex,and more than65%relative improvement on10Ex over both challenging datasets.
Ours IDT Relative Improv MED13100Ex44.634.031.2%
MED1310Ex29.818.065.6%
MED14100Ex36.827.633.3%
MED1410Ex24.513.976.3%
Table8.Performance comparison of all settings;the last column shows the relative improvement of our proposed representation over IDT.
Figure4and Figure5show the per-event mAP compari-son of the100Ex setting on MEDTest13and MEDTest14. We provide results for average pooling on CNN descriptors with late fusion of three layers as well,denoted as CNN avg. Our proposed representation beats two other strong base-lines in15out of20events in MEDTest13and14out of20 events in MEDTest14,respectively.
7082124a1611cc7931b765ce0508763231127496parison to the state-of-the-art Systems
We compare the MEDTest134results with the top per-formers in the TRECVID MED2013competition[3,30, 22].The AXES team does not show their performance on MEDTest13[3].Natarajan et al.[30]report mAP38.5% on100Ex,17.9%on10Ex from their whole visual system of combining all their low-level visual 7082124a1611cc7931b765ce0508763231127496n et al.[22] report39.3%mAP on100Ex of their whole system includ-ing non-visual features while they conducted10Ex on their 4In[3,30,22],teams report performance on MEDEval13as well, while MEDEval13is a different collection used in the competition,where only NIST can evaluate the performance.
(in mAP percentage).This?gure is best viewed in color.
internal dataset.Our results achieve44.6%mAP on100Ex and29.8%mAP on10Ex,which signi?cantly outperforms the top performers in the competition who combine more than10kinds of features with sophisticated schemes.To show that our representation is complementary to features from other modalities,we perform average late fusion of our proposed representation with IDT and MFCC,and generate
a lightweight system with static,motion and acoustic fea-
tures,which achieves48.6%mAP on100Ex,and32.2% mAP on10Ex.
6.Conclusion
TRECVID Multimedia Event Detection(MED)has suf-fered from huge computation costs in feature extraction and classi?cation 7082124a1611cc7931b765ce0508763231127496ing Convolutional Neural Net-work(CNN)representation seems to be a good solution,but generating video representation from CNN descriptors has different characteristics from image representation.We are the?rst to leverage encoding techniques to generate video representation from CNN descriptors.And we propose la-tent concept descriptors to generate CNN descriptors more properly.For fast event search,we utilize Product Quantiza-tion to compress the video representation and predict on the compressed data.Extensive experiments on the two largest event detection collections under different training condi-tions demonstrate the advantages of our proposed represen-tation.We have achieved promising performance which is superior to the state-of-the-art systems which combine10 more features.The proposed representation is extendible and the performance can be further improved by better CNN models and/or appropriate?ne-tuning techniques.
1805
7.Acknowledgement
This paper is in part supported by the973program 2012CB316400,in part by the ARC DECRA project,and in part by Intelligence Advanced Research Projects Activ-ity(IARPA)via Department of Interior National Business Center contract number D11PC20068.The 7082124a1611cc7931b765ce0508763231127496ern-ment is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright an-notation thereon.Disclaimer:The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the of?cial policies or endorsements,either expressed or implied,of IARPA, DoI/NBC,or the 7082124a1611cc7931b765ce0508763231127496ernment.
We gratefully acknowledge the support of NVIDIA Cor-poration with the donation of the GPUs used for this re-search.
References
[1]TRECVID MED13.7082124a1611cc7931b765ce0508763231127496/itl/
iad/mig/med13.cfm.1,2,5,6
[2]TRECVID MED14.7082124a1611cc7931b765ce0508763231127496/itl/
iad/mig/med14.cfm.1,2,3,5,6
[3]R.Aly,R.Arandjelovic,K.Chat?eld,M.Douze,B.Fer-
nando,Z.Harchaoui,K.McGuinness,N.E.O’Connor,
D.Oneata,O.M.Parkhi,et al.The AXES submissions at
TrecVid2013.2013.1,2,5,8
[4]R.Arandjelovi′c and A.Zisserman.All about VLAD.In
CVPR,2013.3
[5] C.-C.Chang and C.-J.Lin.Libsvm:a library for support
vector machines.ACM Transactions on Intelligent Systems and Technology(TIST),2(3):27,2011.5
[6]K.Chat?eld,K.Simonyan,A.Vedaldi,and A.Zisserman.
Return of the devil in the details:Delving deep into convo-lutional nets.In BMVC,2014.2,4,6
[7]M.-Y.Chen and A.Hauptmann.Mosift:Recognizing human
actions in surveillance videos.CMU TR,2009.1,5,6 [8]R.G.Cinbis,J.Verbeek,and C.Schmid.Segmentation
driven object detection with Fisher vectors.In ICCV,2013.
7
[9]N.Dalal and B.Triggs.Histograms of oriented gradients for
human detection.In CVPR,2005.1
[10]M.Douze,J.Revaud,C.Schmid,and H.J′e gou.Stable
hyper-pooling and query expansion for event detection.In ICCV,2013.4,6
[11] C.Gan,N.Wang,Y.Yang,D.-Y.Yeung,and A.Hauptmann.
Devnet:A deep event network for multimedia event detec-tion and evience recounting.In CVPR,2015.2
[12]R.Girshick,J.Donahue,T.Darrell,and J.Malik.Rich fea-
ture hierarchies for accurate object detection and semantic segmentation.In CVPR,2014.2
[13]Y.Gong,L.Wang,R.Guo,and 7082124a1611cc7931b765ce0508763231127496zebnik.Multi-scale
orderless pooling of deep convolutional activation features.
In ECCV,2014.2,4
[14]K.He,X.Zhang,S.Ren,and J.Sun.Spatial pyramid pooling
in deep convolutional networks for visual recognition.In
ECCV.2014.4,5
[15]H.Jegou,M.Douze,and C.Schmid.Product quantization
for nearest neighbor search.TPAMI,33(1):117–128,2011.
2,5
[16]H.J′e gou,M.Douze,C.Schmid,and P.P′e rez.Aggregating
local descriptors into a compact image representation.In
CVPR,2010.3,6
[17]H.J′e gou,F.Perronnin,M.Douze,J.S′a nchez,P.P′e rez,and
C.Schmid.Aggregating local image descriptors into com-
pact codes.TPAMI,34(9):1704–1716,2012.3
[18]Y.Jia.Caffe:An open source convolutional architecture
for fast feature embedding.7082124a1611cc7931b765ce0508763231127496,
2013.2,6
[19]V.Kantorov and 7082124a1611cc7931b765ce0508763231127496ptev.Ef?cient feature extraction,en-
coding and classi?cation for action recognition.In CVPR,
2014.3
[20] A.Karpathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar,
and 7082124a1611cc7931b765ce0508763231127496rge-scale video classi?cation with convo-
lutional neural networks.In CVPR,2014.2
[21] A.Krizhevsky,I.Sutskever,and G.E.Hinton.Imagenet
classi?cation with deep convolutional neural networks.In
NIPS,2012.1,2,6
[22]7082124a1611cc7931b765ce0508763231127496n,L.Jiang,S.-I.Yu,et al.CMU-Informedia at
TRECVID2013Multimedia Event Detection.In TRECVID
2013Workshop,2013.1,2,5,6,8
[23]7082124a1611cc7931b765ce0508763231127496ptev.On space-time interest points.IJCV,64(2-3):107–
123,2005.1,5,6
[24]M.Lin,Q.Chen,and 7082124a1611cc7931b765ce0508763231127496work in network.CoRR,
abs/1312.4400,2013.4
[25] D.G.Lowe.Distinctive image features from scale-invariant
keypoints.IJCV,60(2):91–110,2004.1
[26]Z.Ma,Y.Yang,N.Sebe,and A.Hauptmann.Knowledge
adaptation with partially shared features for event detection
using few exemplars.IEEE Transactions on Pattern Analysis
and Machine Intelligence,36(9):1789–1802,2014.1
[27]Z.Ma,Y.Yang,Z.Xu,S.Yan,N.Sebe,and A.G.Haupt-
7082124a1611cc7931b765ce0508763231127496plex event detection via multi-source video at-
tributes.In CVPR,2013.1
[28] F.Metze,S.Rawat,and Y.Wang.Improved audio features
for large-scale multimedia event detection.In ICME,2014.
1
[29]G.K.Myers,R.Nallapati,J.van Hout,et al.The2013
SESAME Multimedia Event Detection and Recounting sys-
tem.In TRECVID2013Workshop,2013.1,5,6
[30]P.Natarajan,S.Wu,F.Luisier,et al.BBN VISER TRECVID
2013Multimedia Event Detection and Multimedia Event Re-
counting Systems.In TRECVID2013Workshop,2013.1,5,
6,8
[31]P.Natarajan,S.Wu,S.Vitaladevuni,X.Zhuang,S.Tsaka-
lidis,U.Park,and R.Prasad.Multimodal feature fusion for
robust event detection in web videos.In CVPR,2012.1
[32] D.Oneata,M.Douze,J.Revaud,S.Jochen,D.Potapov,
H.Wang,Z.Harchaoui,J.Verbeek,C.Schmid,R.Aly,et al.
AXES at TRECVid2012:KIS,INS,and MED.In TRECVID
workshop,2012.2
1806
[33] D.Oneata,J.Verbeek,and C.Schmid.Action and event
recognition with Fisher vectors on a compact feature set.In
ICCV,2013.1
[34]X.Peng,L.Wang,X.Wang,and Y.Qiao.Bag of
visual words and fusion methods for action recognition:
Comprehensive study and good practice.arXiv preprint
arXiv:1405.4506,2014.3
[35] F.Perronnin,J.S′a nchez,and T.Mensink.Improving the
?sher kernel for large-scale image classi?cation.In ECCV.
2010.3
[36]J.S′a nchez,F.Perronnin,T.Mensink,and J.Verbeek.Im-
age classi?cation with the?sher vector:Theory and practice.
IJCV,105(3):222–245,2013.3,5,6
[37]K.Simonyan and A.Zisserman.Very deep convolutional
networks for large-scale image recognition.arXiv preprint
arXiv:1409.1556,2014.2,4,6
[38]J.Sivic and A.Zisserman.Video google:A text retrieval
approach to object matching in videos.In CVPR,2003.3
[39] C.G.Snoek,M.Worring,and A.W.Smeulders.Early versus
late fusion in semantic video analysis.In MM.ACM,2005.
7
[40] A.Tamrakar,S.Ali,Q.Yu,J.Liu,O.Javed,A.Divakaran,
H.Cheng,and H.Sawhney.Evaluation of low-level features
and their combinations for complex event detection in open
source videos.In CVPR,2012.1
[41]K.E.Van De Sande,T.Gevers,and C.G.Snoek.Evaluating
color descriptors for object and scene recognition.TPAMI,
32(9):1582–1596,2010.1,5,6
[42] A.Vedaldi and B.Fulkerson.Vlfeat:An open and portable
library of computer vision algorithms.In MM.ACM,2010.
6
[43]H.Wang,A.Klaser,C.Schmid,and C.-L.Liu.Action recog-
nition by dense trajectories.In CVPR,2011.1
[44]H.Wang and C.Schmid.Action recognition with improved
trajectories.In ICCV,2013.1,2,5
[45]Z.Xu,Y.Yang,I.Tsang,N.Sebe,and A.G.Hauptmann.
Feature weighting via optimal thresholding for video analy-
sis.In ICCV,2013.1
[46]Y.Yang,Z.Ma,Z.Xu,S.Yan,and A.G.Hauptmann.
How related exemplars help complex event detection in web
videos?In ICCV,2013.1,2
1807
正在阅读:
A Discriminative CNN Video Representation for Event Detectio04-10
06级药学《药剂学》复习试题05-21
运功会指南10-03
北邮作业 GSM系统与GPRS 答案03-11
2011分子生物学实验学生用 105-08
深圳证券交易所关于做好中小企业板上市公司2005年年度报告工作的通知03-17
基于plc 八层电梯设计 论文06-20
管理学复习题(2) 12-03
- 110-Basic Video Compression Techniques
- 2A cooperative packet recovery protocol for multicast video
- 3Multiresolution Representation for Orbital Dynamics in Multipolar Fields
- 4CNN,News:Justin,Bieber,storms,off,stage
- 5卷积神经网络CNN代码解析
- 6An event-driven framework for the simulation of networks of spiking neurons
- 7VIDEO是什么意思
- 8Information flow based event distribution middleware
- 9Information flow based event distribution middleware
- 10Video4Linux2 - en - 图文
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- Discriminative
- Representation
- Detectio
- Video
- Event
- CNN
- 2022年江苏宿迁中考数学试卷(word版及答案)
- 2022年高考作文备考百强名校作文试题例析-100教育
- 2012年湖北地区各高校专业课考研资料清单打印版
- 电气设备的安全防火管理示范文本
- 江苏某办公楼外装饰幕墙施工组织设计(铝塑板幕墙 玻璃幕墙 石材
- 2022_2022学年高中英语Unit3Backtothepast单元过关检测(三)牛津
- 山西省临汾第一中学高二下册第二学期期末考试物理试题-含答案【
- 全国新课标卷2022届新高三语文定位考试(含答案)
- WCDMA-切换和掉话问题优化指导书
- 房建学校300页实用标准化施工组织设计
- 二级建造师施工管理2005-2012真题及答案(适合打印)
- 2022年最新高二 语文期中试卷漳州双语实验学高二考语文试卷及答
- 国家开放大学2022年春季学期电大中国现代学形成性考核
- 【在当代弘扬和传承中华优秀传统文化演讲稿】在当代弘扬和传承中
- 2012款新福克斯车主手册_说明书
- 2022年西南林业大学生命科学学院细胞生物学(同等学力加试)复试笔
- 实验室污水处理详细方案
- 七年级语文上册 第14课《古代神话传说五则》教案3 长春版
- 广州市小学四年级英语下册练习题
- 运用“问题驱动”模式 助力化学课堂教学-2022年教育文档