AUTOMATIC VIDEO STRUCTURING BASED ON HMMS AND AUDIO VISUAL INTEGRATION
更新时间:2023-08-28 01:40:01 阅读量: 教育文库 文档下载
- automatic推荐度:
- 相关推荐
This paper focuses on the use of Hidden Markov Models (HMMs) for structure analysis of sport videos. The video structure parsing relies on the analysis of the temporal interleaving of video shots, with respect to a priori information about video content an
AUTOMATICVIDEOSTRUCTURINGBASEDONHMMSANDAUDIOVISUAL
INTEGRATION
P.Gros(1),E.Kijak(2)andG.Gravier(1)
(1)
(2)
IRISA–CNRSIRISA–Universit´edeRennes1
CampusUniversitairedeBeaulieu35042RennesCedex,France
{Patrick.Gros,Ewa.Kijak,Guillaume.Gravier}@irisa.fr
ABSTRACT
ThispaperfocusesontheuseofHiddenMarkovModels(HMMs)forstructureanalysisofsportvideos.Thevideostructureparsingreliesontheanalysisofthetemporalinter-leavingofvideoshots,withrespecttoaprioriinformationaboutvideocontentandeditingrules.Thebasictemporalunitisthevideoshotandbothaudioandvisualfeaturesareusedtocharacterizeitstypeofview.Ourapproachisvali-datedintheparticulardomainoftennisvideos.Asaresult,typicaltennisscenesareidenti ed.Inaddition,eachshotisassignedtoalevelinthehierarchydescribedintermsofpoint,gameandset.
1.INTRODUCTION
Videocontentanalysisisanactiveresearchdomainthataimsatautomaticallyextractinghigh-levelsemanticeventsfromvideo.Thissemanticinformationcanbeusedtopro-duceindexesortables-of-contentsthatenableef cientsearchandbrowsingofvideocontent.Low-levelvisualfeatures,largelyusedforindexinggenericvideocontents,arenotsuf cienttoprovideameaningfulinformationtoanend-user.Toachievesuchagoal,algorithmshavetobededi-catedtooneparticulartypeofvideos.
Onedomain-speci capplicationisthedetectionandre-cognitionofhighlightsinsportvideos.Sportvideoanalysisismotivatedbythegrowingamountofarchivedsportvideomaterial,andbythebroadcastersneedsofadetailedannota-tionofvideocontentstoselectrelevantexcerptstobeeditedforsummariesormagazines.Uptonow,thisloggingtaskisperformedmanuallybylibrarians.
Mostoftheexistingworksinthedomainofsportsvideoanalysisarerelatedtospeci ceventsdetection.Acom-monapproachineventdetectionconsistsincombiningtheextractionoflow-levelfeatureswithheuristicrulestoin-ferpredeterminedhighlights[1,2].RecentapproachesuseHiddenMarkovModels(HMMs)fortheeventclassi cationinsoccer[3]andbaseball[4].Nevertheless,theseworks
attempttodetectspeci cevents,butthereconstructionofhigher-leveltemporalstructureisnotaddressed.
Insidethecategoryofsportvideos,adistinctionshouldbemadebetweentime-constrainedsportssuchassoccer,andscore-constrainedsportssuchastennisorbaseball.Time-constrainedsportshavearelativelyloosestructure.Thegamecanbedecomposedintoequalperiods.Duringape-riod,thecontent owisquiteunpredictable.Asaresult,structureanalysisofasoccervideoisrestrictedto”play”/”out-of-play”segmentation[5].
Inscore-constrainedsports,thecontentpresentsastronghierarchicalstructure.Forexample,atennismatchcanbebrokendownintosets,gamesandpoints.Apreviousworkontennisandbaseball[6]studiesthedetectionofbasicunits,suchasserveintennisorpitchinbaseball,however,thewell-de nedstructureofthesesportsisnottakenintoaccounttorecoverthewholehierarchicalstructure.
Inthispaper,weaddresstheproblemofrecoveringsportvideostructure,throughtheexampleoftenniswhichpresentsastrongstructure.Videostructureparsingconsistsinex-tractinglogicalstoryunitsfromtheconsideredvideo.Thestructuretobeestimatedreliesonthenatureofthevideo.Forinstance,anewsvideocanbeconsideredasasequenceofunitswhichstartswithanimageframepresentingananchorpersonfollowedbyavarietyofnewsandcommer-cials[7].
Producingatableofcontentsimpliestoperformatem-poralsegmentationofthevideointoshots.Suchataskhasbeenwidelystudiedandgenerallyreliesonthedetectionofdiscontinuitiesintolow-levelvisualfeaturessuchascolorormotion[8].Thecriticalstepistoautomaticallygroupshotsinto”scenes”,or“storyunits”thatarede nedasacoherentgroupofshotsthatismeaningfulfortheend-user.Recenteffortshavebeenmadeonfusinginformationprovidedbydifferentstreams.Itseemsreasonabletothinkthatintegratingseveralmediaimprovetheperformanceoftheanalysis.Thisiscon rmedbysomeexistingworksre-portedin[9,10].Multimodalapproacheshavebeeninves-
This paper focuses on the use of Hidden Markov Models (HMMs) for structure analysis of sport videos. The video structure parsing relies on the analysis of the temporal interleaving of video shots, with respect to a priori information about video content an
tigatedfordifferentareasofcontent-basedanalysis,suchassceneboundarydetection[11],structureanalysisofnews[12],andgenreclassi cation[13].However,fusingmultimodalfeaturesisnotatrivialtask.Wecanhighlighttwoproblemsamongmanyothers.
asynchronizationandtimescaleproblem:samplingratetocomputeandanalyselow-levelfeaturesisnotthesameforthedifferentmedias;
adecisionproblem:whatshouldbethe naldecisionwhenthedifferentmediasprovideoppositeinforma-tion?
Multimodalfusioncanbeperformedattwolevels:fea-tureanddecisionlevels.Atthefeaturelevel,low-levelau-dioandvisualfeaturesarecombinedintoasingleaudio-visualfeaturevectorbeforetheclassi cation.Themulti-modalfeatureshavetobesynchronized[12].Thisearlyin-tegrationstrategyiscomputationallycostlyduetothesizeoftypicalfeaturespaces.Atthedecisionlevel,acom-monapproachconsistsinclassifyingseparatelyaccordingtoeachmodalitybeforeintegratingtheclassi cationresults.However,somedependenciesamongfeaturesfromdifferentmodalitiesarenottakenintoaccountinthislateintegrationscheme.
Butusuallythesesapproachesrelyonasuccessiveuseofvisualandaudioclassi cation[14,15].Forexamplein[15],visualfeaturesare rstusedtoidentifythecourtviewsofatennisvideo.Thenballhits,silence,applause,andspeecharedetectedintheseshots.Theanalysisofthesoundtransitionpattern nallyallowstore nethemodel,andidentifyspeci ceventslikescores,reserves,aces,servesandreturns.
Inthiswork,anintermediatestrategyisusedwhichcon-sistsinextractingseparatelyshot-based“highlevel”audioandvisualcues.Theclassi cationisthenmadeusingtheaudioandvisualcuessimultaneously(Figure2).Inotherwords,wechooseatransitionallevelbetweendecisionandfeaturelevels.Beforeanalyzingshotsfromrawimagein-tensityandaudiodata,somepreliminarydecisionscanbemadeusingthefeaturesofthedata(e.g.representationofaudiofeaturesintermsofclasseslikemusic,noise,silence,speech,andapplause).Inthisway,aftermakingsomebasicdecisions,thefeaturespacesizeisreducedandeachmodal-itycanbecombinedmoreeasily.
Ouraimistoexploitmultimodalinformationandtem-poralrelationsbetweenshotsinordertoidentifytheglobalstructure.Theproposedmethodsimultaneouslyperformsasceneclassi cationandsegmentationusingHMMs.HMMsprovideanef cientwaytointegratefeaturesfromdifferentmedia[13],andtorepresentthehierarchicalstructureofatennismatch.Atthe rstlevel,severalconsecutiveshotsofatennisvideoareclassi edwithinoneofthe
following
Fig.1.StructureofTennisGame
fourprede nedtennisunits:missed rstserve,rally,re-play,andbreak.Atthehigherlevel,theclassi edsegmentsaregroupedandassignedtotheircorrespondinglabelinthestructurehierarchy,describedintermsofpoint,game,andset(seeFigure1).
Thispaperisorganizedasfollows.Section2provideselementsontennisvideosyntax.Section3givesanoverviewofthesystemandbrie ydescribestheaudio-visualfeaturesexploited.Section4introducesthestructureanalysismech-anism.Experimentalresultsarepresentedanddiscussedinsection5.
2.TENNISSYNTAX
Sportvideoproductionischaracterizedbytheuseofalim-itednumberofcamerasatalmost xedpositions.Thedif-ferenttypesofviewspresentinatennisvideocanbedi-videdintofourprincipalclasses:global,medium,close-up,andaudience.Inatennisvideoproduction,globalviewscontainmuchofthepertinentinformation.Theremaininginformationreliesonthepresenceortheabsenceofnonglobalviewsbutisindependentofthetypeoftheseviews.
Consideringagiveninstant,thepointofviewgivingthemostrelevantinformationisselectedbytheproducer,andbroadcast.Thereforesportsarecomposedofarestrictednumberoftypicalscenesproducingarepetitivepattern.Forexample,duringarallyinatennisvideo,thecontentpro-videdbythecamera lmingthewholecourtisselected(globalview).Aftertheendoftherally,theplayerwhohasjustcarriedoutanactionofinterestiscapturedwithaclose-up.Asclose-upviewsneverappearduringarallybutrightafterorbeforeit,globalviewsaregenerallysigni cantofarally.Anotherexampleconsistsinreplaysthatarenoti- edtotheviewersbyinsertingspecialtransitions.Becauseofthepresenceoftypicalscenesandthe nitenumberofviews,thetennisvideohasapredictabletemporalsyntax.
This paper focuses on the use of Hidden Markov Models (HMMs) for structure analysis of sport videos. The video structure parsing relies on the analysis of the temporal interleaving of video shots, with respect to a priori information about video content an
Program Syntax States
Audio Stream
Fig.2.Structureanalysissystemoverview
Weidentifyherefourtypicalpatternintennisvideos,calledtennisunits,thatare:missed rstserve,rally,replay,andbreak.Abreakischaracterizedbyanimportantsuc-cessionofscenesunrelatedtogames,suchascommercialsorclose-ups.Itappearswhenplayerschangeends,gener-allyeverytwogames.Wealsotakeadvantageofthewell-de nedandstrongstructureoftennisbroadcasttobreakitdownintosets,gamesandpoints.
3.SYSTEMOVERVIEW
Inthissection,wegiveanoverviewofthesystem(Figure2)andbrie ydescribetheextractionofvisualfeatures.First,thevideostreamisautomaticallysegmentedintoshotsbydetectingcutsanddissolvetransitions[16].Foreachshot,shotfeaturesarecomputedandonekeyframeisextractedfromthebeginningoftheshotalongwithitsimagefea-tures.Thesegmentedvideoresultsinanobservationse-quencewhichisparsedbyaHMMprocess.3.1.VisualFeatures
Thefeatureswecurrentlyuseareshotlength,avisualsimi-laritybasedondominantcolorsdesciptor,andrelativeplayerposition.
Shotlengthl:theshotlengthisgivenbytheshotseg-mentationprocess.Itisthenumberofframeswithintheshot.
Visualsimilarityv:weusedvisualfeaturestoidentifytheglobalviewswithinalltheextractedkeyframes.Theprocesscanbedividedintotwosteps.First,akeyframeKrefrepresentativeofaglobalviewisselectedautomati-callywithoutmakinganyassumptionabouttheplayingareacolor.Ourapproachtriestoavoidtheuseofprede ned eldcolorasthegame eldcolorcanlargelyvaryfromonevideotoanother.OnceKrefhasbeenfound,eachkeyframeKtischaracterizedbyasimilaritydistancetoKref.Thevisualsimilaritymeasurev(Kt,Kref)isde nedasaweightedfunc-tionofthespatialcoherency,thedistancefunctionbetweenthedominantcolorvectors,andtheactivity(averagecameramotionduringashot):
v(Kt,Kref)=(1)
w1|Ct Cref|+w1d(Ft,Fref)+w3|At Aref|
wherew1,w2,andw3aretheweights.
Playerpositiond:theplayerpositionisgivenbythegravitycenterofarawsegmentationoftheplayer(alsonamedblob).Asitisatime-consumingprocess,thisfea-tureextractionisonlyperformedonpotentialglobalviewkeyframes.Itusesdomain-knowledgeonthetenniscourtmodelandassociatedpotentialplayerpositions.Onlytheplayeronthebottomofthecourtisconsideredtoensureamorereliabledetection.Theblobcorrespondingtothisplayerisgivenbyanimage lteringandsegmentationpro-cessbasedondominantcolors.TenniscourtlinesarealsodetectedbyaHough-transformandthehalf-courtline,i.e.thelineseparatingtheleftandrighthalvesofthecourt,isidenti ed.Thedistancedbetweenthedetectedblobandthehalf-courtlineiscomputed.Iftheextractionprocessfails,thisfeatureisnottakenintoaccountfortheconsideredshot.3.2.AudioFeatures
Asmentionedpreviously,thevideostreamissegmentedintoasequenceofshots.Sinceshotboundariesaremoresuitableforastructureanalysisbasedonproductionrulesthanboundariesextractedfromthesoundtrack,theshotisconsideredasthebaseentity,andfeaturesdescribingtheaudiocontentforeachshotareusedtoprovideadditionalinformation.
Foreachshot,abinaryvectoratdescribingwhichaudioevents,amongspeech,applause,ballhits,noiseandmusic,arepresentintheshotisextractedfromanautomaticseg-mentationoftheaudiostream.
Thesoundtrackis rstsegmentedintospectrallyhomo-geneouschunks.Foreachchunk,testsareperformedinde-pendentlyforeachoftheaudioeventsconsideredinordertodeterminewhicheventsarepresent.Moredetailscanbefoundin[17].Usingthisapproach,theframecorrectclassi cationrateobtainedis77.83%whilethetotalframeclassi cationerrorrateis34.41%duetoinsertions.Theconfusionmatrixshowsthatballhits,speechandapplausearewellclassi edwhilenoiseisoftenmisclassi edasballhits,probablyduetothefactthatballhitsisamixofballhitsandcourtnoises.ballhitsclassisofteninserted,andmusicclassisoftendeleted.
Finally,theshotaudiovectorsatarecreatedbylookingouttheaudioeventsthatoccurwithintheshotboundaryaccordingtotheaudiosegmentation.
4.STRUCTUREANALYSIS
Weintegrateaprioriinformationbyderivingsyntacticalbasicelementsfromthetennisvideosyntax.Wede ne
This paper focuses on the use of Hidden Markov Models (HMMs) for structure analysis of sport videos. The video structure parsing relies on the analysis of the temporal interleaving of video shots, with respect to a priori information about video content an
s =argmaxlnp(s)+
s
t
lnbst(ot)(3)
Totakeintoaccountthelong-termstructureofaten-nisgame,thefourHMMsareconnectedtoahierarchical
HMM,asrepresentedinFigure3.Thishigherlevelre ectsthestructureofatennisgameintermsofsets,games,andpoints.Thesearchspacecanbedescribedasahugenetworkwherethebeststatetransitionpathhastobefound.Thesearchisperformedatdifferentlevels.Transitionprobabil-itiesbetweenstatesofthehierarchicalHMMresultentirelyfromaprioriinformation,whiletransitionprobabilitiesforthesub-HMMsresultfromalearningprocess.
SeveralcommentsaboutthehierarchicalHMMareinorder.Thepointisthebasicscoringunit.Itcorrespondstoawinnerrally,thatistosayalmostallralliesexcept rstmissedserves.Abreakhappenattheendofatleasttenconsecutivepoints.Boundariesdetectionbetweengames,andconsequentlygamecompositionintermsofpoints,relieessentiallyontheplayerpositiondetection,whichindicateiftheserverhaschanged.
5.EXPERIMENTALRESULTS
Fig.3.Contenthierarchyofbroadcasttennisvideofourbasicstructuralunits:twoofthemarerelatedtogamephases(missed rstservesandrallies),thetwoothersdealwithvideosegmentswherenoplayoccurs(breaksandre-plays).EachoftheseunitsismodelledbyaHMM.TheHMMsrelyonthetemporalrelationshipsbetweenshots.
EachstateoftheHMMsmodelseitherasingleshotoradissolvetransitionbetweenshots.Forashott,theob-servationotconsistsofthesimilaritymeasurevt,theshotdurationlt,theaudiodescriptionvectorat,andtherelativeplayerpositiondt,ifitexists.Therelativeplayerpositiontothehalf-courtlineisusedasacuetodetermineiftheserverhaschangedornot.Theprobabilityofanobservationottobeinstatejattisthengivenby:
bj(ot)=p(vt|j)p(lt|j)p(st|j)P[at|j](2)
Inthissection,wedescribetheexperimentalresultsofthe
audiovisualtennisvideosegmentationbyHMMs.Experi-mentaldataarecomposedof8videos,representingabout5hoursofmanuallylabelledtennisvideo.Thevideosaredis-tributedamong3differenttournaments,implyingdifferentproductionstylesandplaying elds.3sequencesareusedtotraintheHMMwhiletheremainingpartisreservedforthetests.Onetournamentiscompletelyexludedfromthe
wheretheprobabilitydistributionsp(vt|j),p(dt|j)andP[at|j]
trainingset.
areestimatedbyalearningstep.p(vt|j)andp(dt|j)are
Severalexperimentsareconductedusingvisualfeatures
modelledbysmoothedhistograms,andP[at|j]istheprod-only,audiofeaturesonlyandthecombinedaudiovisualap-uctovereachsoundclasskofthediscreteprobabilityP[at(k)|j].
proach.Thesegmentationresultsarecomparedwiththe
p(st|j)istheprobabilitythattheserverchangedgivenby:
manuallyannotatedgroundtruth.Classi cationratesof
||dt| |dp||typicaltennisscenesaregiveninTable1. Consideringvisualfeaturesonly,themainsourceofmis- matchiswhenarallyisidenti edasamissed rstserve.Inifstatejdoesn’tcorrespondNormp(st|j)=thiscasethesimilaritymeasureiswellcomputedbutthe toachangeofserver playerdetectionortheanalysisoftheinterleavingofshots ifdthasnotbeenextracted 0,5failed.Replaydetectionreliesessentiallyondissolvetran-whateverthestate
sitiondetection.Ourdissolvedetectionalgorithmgivesalotoffalsedetections,thatleadstoasmallprecisionratewhereNormisanormalizationfactorthatistakenasthe
(48%).Wecheckthatcorrectingthetemporalsegmentationhalfwidthofthecourtbaseline,anddpisthepreviousex-improvethereplaydetectionratesupto100%.istingdistanceextracted.Inaddition,foreachstaterepre-sentingaglobalview,theplayerpositionattheleftorrightUsingonlyaudiofeatures,theprecisionandrecallrates
sideofthehalf-courtlineisconsidered.forralliesand rstmissedservesuggeststhataudiofeatures
areeffectivetodescriberallyscenes.Indeed,arallyises-Segmentationandclassi cationofthewholeobserved
sentiallycharacterizedbythepresenceofballhitssoundssequenceintothedifferentstructuralelementsareperformed
andapplausewhichhappenattheendoftheexchange,simultaneouslyusingaViterbialgorithm.
This paper focuses on the use of Hidden Markov Models (HMMs) for structure analysis of sport videos. The video structure parsing relies on the analysis of the temporal interleaving of video shots, with respect to a priori information about video content an
althoughamissed rstserveisonlycharacterizedbythepresenceofballhits.Onthecontrary,replaysarenotchar-acterizedbyarepresentativeaudiocontent,andalmostallreplaysaremissed.Thecorrectdetectionsaremoreduetothecharacteristicshotdurationsofdissolvetransitionsthatareveryshort.Forthesamereasons,replayshotscanalsobeconfusedwithcommercialsthatarenon-globalviewsofshortduration.Breakistheonlystatecharacterizedbythepresenceofmusic.Thatmeansmusicisarelevanteventforbreakdetectionandparticularlyforcommercials.
Fusingtheaudioandvisualcuesenhancedtheperfor-mance,http://www.77cn.com.cnparingwithre-sultsusingvisualfeaturesonly,therearetwosigni cantim-provements:therecallandprecisionratesforrallies,andmissed rstserve.Introducingaudiocuesincreasesthecor-rectdetectionratethankstoballhitsoundsandapplause.Recoveringtheglobalstructureismoreinterestingtoreachahigher-levelinthestructureanalysis(seeTable2).Thepointboundariesdetectionishighlycorrelatedwiththecorrectdetectionoftypicaltennisscenes.However,thechangeofserverdetectionisofhighrelevanceforthestruc-tureparsingprocess.Withoutanyinformationabouttheendofagame,theViterbiprocessfallsinlocalminimawhensearchingthebeststatetransitionpath,becauseoftheequiprobabletransitionsbetweengames.Thegamebound-ariesdetectionisthenverysensitivetotheplayerextraction,andrequiresthisprocesstoberobust.Allmisplacedgameboundariesareduetoerrorsorambiguitiesinplayerposi-tion.Anotherwaytodealwiththisproblemshouldbetoanalyzethescoredisplaysonsuperimposedcaptions.
PointboundariesGameboundaries
This paper focuses on the use of Hidden Markov Models (HMMs) for structure analysis of sport videos. The video structure parsing relies on the analysis of the temporal interleaving of video shots, with respect to a priori information about video content an
Segmentation
precision
Firstserve
92%
Replay
89%
82%74%
Audiofeatures
65%precision
65%
91%
47%
92%
92%94%
66%
precision
88%
正在阅读:
AUTOMATIC VIDEO STRUCTURING BASED ON HMMS AND AUDIO VISUAL INTEGRATION08-28
软件实施工程师面试题目09-27
脱丙烯精馏塔10-19
人文社会科学概论12-07
策划部工作总结文本参考04-25
保持身心健康的秘籍03-25
抗战纪念馆布展工程设计施工总承包(一体化)项目(实施方案、技06-05
王志展毕业设计09-15
兽医工作心得体会03-31
法制宣传月宣传教育活动总结范文精选07-30
- 1A Novel Approach for Automatic Palmprint Recognition
- 2alsa(audio)驱动分析
- 3Augmented . Automatic . Automating . Axioms .
- 4INTEGRATION AND EVALUATION OF SENSOR MODALITIES FOR POLAR RO
- 5Automatic belief revision in sneps
- 6Android系统Audio框架 - 图文
- 7A Framework for Automatic Adaptation of Tunable Distributed
- 8Android系统Audio框架 - 图文
- 9Showing video with Qt toolbox and ffmpeg libraries
- 1010-Basic Video Compression Techniques
- exercise2
- 铅锌矿详查地质设计 - 图文
- 厨余垃圾、餐厨垃圾堆肥系统设计方案
- 陈明珠开题报告
- 化工原理精选例题
- 政府形象宣传册营销案例
- 小学一至三年级语文阅读专项练习题
- 2014.民诉 期末考试 复习题
- 巅峰智业 - 做好顶层设计对建设城市的重要意义
- (三起)冀教版三年级英语上册Unit4 Lesson24练习题及答案
- 2017年实心轮胎现状及发展趋势分析(目录)
- 基于GIS的农用地定级技术研究定稿
- 2017-2022年中国医疗保健市场调查与市场前景预测报告(目录) - 图文
- 作业
- OFDM技术仿真(MATLAB代码) - 图文
- Android工程师笔试题及答案
- 生命密码联合密码
- 空间地上权若干法律问题探究
- 江苏学业水平测试《机械基础》模拟试题
- 选课走班实施方案
- STRUCTURING
- INTEGRATION
- AUTOMATIC
- VISUAL
- VIDEO
- BASED
- AUDIO
- HMMS
- 2013届高三物理名校试题汇编B:专题03 牛顿运动定律(解析版)
- 第2课 “罢黜百家_独尊儒术”
- 污水处理站操作规程
- 2014年度国资财务决算报表讲解
- 维修工应知应会
- 小学六年级数学(圆的周长和面积提高练习题)
- 人教版八年级物理(下册)物理第十章《信息的传递》练习题(含答案)
- 公司战略合作协议
- 卓越的领导情商管理
- 打造样板市场
- 检测项目合作协议(托委托检验)
- 管理办法部文201147号
- 冀教版小学三年级上册__不带小括号的两级混合运算
- 医疗机构申请执业登记注册书(例范本)
- 细心度测试卷
- PHQ-9抑郁症筛查量表 (1)
- 【精品】最新审定人教版小学数学二年级上册期末测试题(精品)
- 鄂卫发〔2006〕43 号《湖北省妇幼保健机构保健工作规范》(试行)
- 电子线路设计实习报告
- 基于InSb