AUTOMATIC VIDEO STRUCTURING BASED ON HMMS AND AUDIO VISUAL INTEGRATION

更新时间：2023-08-28 01:40:01 阅读量：教育文库文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

automatic推荐度：
相关推荐

This paper focuses on the use of Hidden Markov Models (HMMs) for structure analysis of sport videos. The video structure parsing relies on the analysis of the temporal interleaving of video shots, with respect to a priori information about video content an

AUTOMATICVIDEOSTRUCTURINGBASEDONHMMSANDAUDIOVISUAL

INTEGRATION

P.Gros(1),E.Kijak(2)andG.Gravier(1)

(1)

(2)

IRISA–CNRSIRISA–Universit´edeRennes1

CampusUniversitairedeBeaulieu35042RennesCedex,France

{Patrick.Gros,Ewa.Kijak,Guillaume.Gravier}@irisa.fr

ABSTRACT

ThispaperfocusesontheuseofHiddenMarkovModels(HMMs)forstructureanalysisofsportvideos.Thevideostructureparsingreliesontheanalysisofthetemporalinter-leavingofvideoshots,withrespecttoaprioriinformationaboutvideocontentandeditingrules.Thebasictemporalunitisthevideoshotandbothaudioandvisualfeaturesareusedtocharacterizeitstypeofview.Ourapproachisvali-datedintheparticulardomainoftennisvideos.Asaresult,typicaltennisscenesareidenti ed.Inaddition,eachshotisassignedtoalevelinthehierarchydescribedintermsofpoint,gameandset.

1.INTRODUCTION

Videocontentanalysisisanactiveresearchdomainthataimsatautomaticallyextractinghigh-levelsemanticeventsfromvideo.Thissemanticinformationcanbeusedtopro-duceindexesortables-of-contentsthatenableef cientsearchandbrowsingofvideocontent.Low-levelvisualfeatures,largelyusedforindexinggenericvideocontents,arenotsuf cienttoprovideameaningfulinformationtoanend-user.Toachievesuchagoal,algorithmshavetobededi-catedtooneparticulartypeofvideos.

Onedomain-speci capplicationisthedetectionandre-cognitionofhighlightsinsportvideos.Sportvideoanalysisismotivatedbythegrowingamountofarchivedsportvideomaterial,andbythebroadcastersneedsofadetailedannota-tionofvideocontentstoselectrelevantexcerptstobeeditedforsummariesormagazines.Uptonow,thisloggingtaskisperformedmanuallybylibrarians.

Mostoftheexistingworksinthedomainofsportsvideoanalysisarerelatedtospeci ceventsdetection.Acom-monapproachineventdetectionconsistsincombiningtheextractionoflow-levelfeatureswithheuristicrulestoin-ferpredeterminedhighlights[1,2].RecentapproachesuseHiddenMarkovModels(HMMs)fortheeventclassi cationinsoccer[3]andbaseball[4].Nevertheless,theseworks

attempttodetectspeci cevents,butthereconstructionofhigher-leveltemporalstructureisnotaddressed.

Insidethecategoryofsportvideos,adistinctionshouldbemadebetweentime-constrainedsportssuchassoccer,andscore-constrainedsportssuchastennisorbaseball.Time-constrainedsportshavearelativelyloosestructure.Thegamecanbedecomposedintoequalperiods.Duringape-riod,thecontent owisquiteunpredictable.Asaresult,structureanalysisofasoccervideoisrestrictedto”play”/”out-of-play”segmentation[5].

Inscore-constrainedsports,thecontentpresentsastronghierarchicalstructure.Forexample,atennismatchcanbebrokendownintosets,gamesandpoints.Apreviousworkontennisandbaseball[6]studiesthedetectionofbasicunits,suchasserveintennisorpitchinbaseball,however,thewell-de nedstructureofthesesportsisnottakenintoaccounttorecoverthewholehierarchicalstructure.

Inthispaper,weaddresstheproblemofrecoveringsportvideostructure,throughtheexampleoftenniswhichpresentsastrongstructure.Videostructureparsingconsistsinex-tractinglogicalstoryunitsfromtheconsideredvideo.Thestructuretobeestimatedreliesonthenatureofthevideo.Forinstance,anewsvideocanbeconsideredasasequenceofunitswhichstartswithanimageframepresentingananchorpersonfollowedbyavarietyofnewsandcommer-cials[7].

Producingatableofcontentsimpliestoperformatem-poralsegmentationofthevideointoshots.Suchataskhasbeenwidelystudiedandgenerallyreliesonthedetectionofdiscontinuitiesintolow-levelvisualfeaturessuchascolorormotion[8].Thecriticalstepistoautomaticallygroupshotsinto”scenes”,or“storyunits”thatarede nedasacoherentgroupofshotsthatismeaningfulfortheend-user.Recenteffortshavebeenmadeonfusinginformationprovidedbydifferentstreams.Itseemsreasonabletothinkthatintegratingseveralmediaimprovetheperformanceoftheanalysis.Thisiscon rmedbysomeexistingworksre-portedin[9,10].Multimodalapproacheshavebeeninves-

tigatedfordifferentareasofcontent-basedanalysis,suchassceneboundarydetection[11],structureanalysisofnews[12],andgenreclassi cation[13].However,fusingmultimodalfeaturesisnotatrivialtask.Wecanhighlighttwoproblemsamongmanyothers.

asynchronizationandtimescaleproblem:samplingratetocomputeandanalyselow-levelfeaturesisnotthesameforthedifferentmedias;

adecisionproblem:whatshouldbethe naldecisionwhenthedifferentmediasprovideoppositeinforma-tion?

Multimodalfusioncanbeperformedattwolevels:fea-tureanddecisionlevels.Atthefeaturelevel,low-levelau-dioandvisualfeaturesarecombinedintoasingleaudio-visualfeaturevectorbeforetheclassi cation.Themulti-modalfeatureshavetobesynchronized[12].Thisearlyin-tegrationstrategyiscomputationallycostlyduetothesizeoftypicalfeaturespaces.Atthedecisionlevel,acom-monapproachconsistsinclassifyingseparatelyaccordingtoeachmodalitybeforeintegratingtheclassi cationresults.However,somedependenciesamongfeaturesfromdifferentmodalitiesarenottakenintoaccountinthislateintegrationscheme.

Butusuallythesesapproachesrelyonasuccessiveuseofvisualandaudioclassi cation[14,15].Forexamplein[15],visualfeaturesare rstusedtoidentifythecourtviewsofatennisvideo.Thenballhits,silence,applause,andspeecharedetectedintheseshots.Theanalysisofthesoundtransitionpattern nallyallowstore nethemodel,andidentifyspeci ceventslikescores,reserves,aces,servesandreturns.

Inthiswork,anintermediatestrategyisusedwhichcon-sistsinextractingseparatelyshot-based“highlevel”audioandvisualcues.Theclassi cationisthenmadeusingtheaudioandvisualcuessimultaneously(Figure2).Inotherwords,wechooseatransitionallevelbetweendecisionandfeaturelevels.Beforeanalyzingshotsfromrawimagein-tensityandaudiodata,somepreliminarydecisionscanbemadeusingthefeaturesofthedata(e.g.representationofaudiofeaturesintermsofclasseslikemusic,noise,silence,speech,andapplause).Inthisway,aftermakingsomebasicdecisions,thefeaturespacesizeisreducedandeachmodal-itycanbecombinedmoreeasily.

Ouraimistoexploitmultimodalinformationandtem-poralrelationsbetweenshotsinordertoidentifytheglobalstructure.Theproposedmethodsimultaneouslyperformsasceneclassi cationandsegmentationusingHMMs.HMMsprovideanef cientwaytointegratefeaturesfromdifferentmedia[13],andtorepresentthehierarchicalstructureofatennismatch.Atthe rstlevel,severalconsecutiveshotsofatennisvideoareclassi edwithinoneofthe

following

Fig.1.StructureofTennisGame

fourprede nedtennisunits:missed rstserve,rally,re-play,andbreak.Atthehigherlevel,theclassi edsegmentsaregroupedandassignedtotheircorrespondinglabelinthestructurehierarchy,describedintermsofpoint,game,andset(seeFigure1).

Thispaperisorganizedasfollows.Section2provideselementsontennisvideosyntax.Section3givesanoverviewofthesystemandbrie ydescribestheaudio-visualfeaturesexploited.Section4introducesthestructureanalysismech-anism.Experimentalresultsarepresentedanddiscussedinsection5.

2.TENNISSYNTAX

Sportvideoproductionischaracterizedbytheuseofalim-itednumberofcamerasatalmost xedpositions.Thedif-ferenttypesofviewspresentinatennisvideocanbedi-videdintofourprincipalclasses:global,medium,close-up,andaudience.Inatennisvideoproduction,globalviewscontainmuchofthepertinentinformation.Theremaininginformationreliesonthepresenceortheabsenceofnonglobalviewsbutisindependentofthetypeoftheseviews.

Consideringagiveninstant,thepointofviewgivingthemostrelevantinformationisselectedbytheproducer,andbroadcast.Thereforesportsarecomposedofarestrictednumberoftypicalscenesproducingarepetitivepattern.Forexample,duringarallyinatennisvideo,thecontentpro-videdbythecamera lmingthewholecourtisselected(globalview).Aftertheendoftherally,theplayerwhohasjustcarriedoutanactionofinterestiscapturedwithaclose-up.Asclose-upviewsneverappearduringarallybutrightafterorbeforeit,globalviewsaregenerallysigni cantofarally.Anotherexampleconsistsinreplaysthatarenoti- edtotheviewersbyinsertingspecialtransitions.Becauseofthepresenceoftypicalscenesandthe nitenumberofviews,thetennisvideohasapredictabletemporalsyntax.

Program Syntax States

Audio Stream

Fig.2.Structureanalysissystemoverview

Weidentifyherefourtypicalpatternintennisvideos,calledtennisunits,thatare:missed rstserve,rally,replay,andbreak.Abreakischaracterizedbyanimportantsuc-cessionofscenesunrelatedtogames,suchascommercialsorclose-ups.Itappearswhenplayerschangeends,gener-allyeverytwogames.Wealsotakeadvantageofthewell-de nedandstrongstructureoftennisbroadcasttobreakitdownintosets,gamesandpoints.

3.SYSTEMOVERVIEW

Inthissection,wegiveanoverviewofthesystem(Figure2)andbrie ydescribetheextractionofvisualfeatures.First,thevideostreamisautomaticallysegmentedintoshotsbydetectingcutsanddissolvetransitions[16].Foreachshot,shotfeaturesarecomputedandonekeyframeisextractedfromthebeginningoftheshotalongwithitsimagefea-tures.Thesegmentedvideoresultsinanobservationse-quencewhichisparsedbyaHMMprocess.3.1.VisualFeatures

Thefeatureswecurrentlyuseareshotlength,avisualsimi-laritybasedondominantcolorsdesciptor,andrelativeplayerposition.

Shotlengthl:theshotlengthisgivenbytheshotseg-mentationprocess.Itisthenumberofframeswithintheshot.

Visualsimilarityv:weusedvisualfeaturestoidentifytheglobalviewswithinalltheextractedkeyframes.Theprocesscanbedividedintotwosteps.First,akeyframeKrefrepresentativeofaglobalviewisselectedautomati-callywithoutmakinganyassumptionabouttheplayingareacolor.Ourapproachtriestoavoidtheuseofprede ned eldcolorasthegame eldcolorcanlargelyvaryfromonevideotoanother.OnceKrefhasbeenfound,eachkeyframeKtischaracterizedbyasimilaritydistancetoKref.Thevisualsimilaritymeasurev(Kt,Kref)isde nedasaweightedfunc-tionofthespatialcoherency,thedistancefunctionbetweenthedominantcolorvectors,andtheactivity(averagecameramotionduringashot):

v(Kt,Kref)=(1)

w1|Ct Cref|+w1d(Ft,Fref)+w3|At Aref|

wherew1,w2,andw3aretheweights.

Playerpositiond:theplayerpositionisgivenbythegravitycenterofarawsegmentationoftheplayer(alsonamedblob).Asitisatime-consumingprocess,thisfea-tureextractionisonlyperformedonpotentialglobalviewkeyframes.Itusesdomain-knowledgeonthetenniscourtmodelandassociatedpotentialplayerpositions.Onlytheplayeronthebottomofthecourtisconsideredtoensureamorereliabledetection.Theblobcorrespondingtothisplayerisgivenbyanimage lteringandsegmentationpro-cessbasedondominantcolors.TenniscourtlinesarealsodetectedbyaHough-transformandthehalf-courtline,i.e.thelineseparatingtheleftandrighthalvesofthecourt,isidenti ed.Thedistancedbetweenthedetectedblobandthehalf-courtlineiscomputed.Iftheextractionprocessfails,thisfeatureisnottakenintoaccountfortheconsideredshot.3.2.AudioFeatures

Asmentionedpreviously,thevideostreamissegmentedintoasequenceofshots.Sinceshotboundariesaremoresuitableforastructureanalysisbasedonproductionrulesthanboundariesextractedfromthesoundtrack,theshotisconsideredasthebaseentity,andfeaturesdescribingtheaudiocontentforeachshotareusedtoprovideadditionalinformation.

Foreachshot,abinaryvectoratdescribingwhichaudioevents,amongspeech,applause,ballhits,noiseandmusic,arepresentintheshotisextractedfromanautomaticseg-mentationoftheaudiostream.

Thesoundtrackis rstsegmentedintospectrallyhomo-geneouschunks.Foreachchunk,testsareperformedinde-pendentlyforeachoftheaudioeventsconsideredinordertodeterminewhicheventsarepresent.Moredetailscanbefoundin[17].Usingthisapproach,theframecorrectclassi cationrateobtainedis77.83%whilethetotalframeclassi cationerrorrateis34.41%duetoinsertions.Theconfusionmatrixshowsthatballhits,speechandapplausearewellclassi edwhilenoiseisoftenmisclassi edasballhits,probablyduetothefactthatballhitsisamixofballhitsandcourtnoises.ballhitsclassisofteninserted,andmusicclassisoftendeleted.

Finally,theshotaudiovectorsatarecreatedbylookingouttheaudioeventsthatoccurwithintheshotboundaryaccordingtotheaudiosegmentation.

4.STRUCTUREANALYSIS

Weintegrateaprioriinformationbyderivingsyntacticalbasicelementsfromthetennisvideosyntax.Wede ne

s =argmaxlnp(s)+

lnbst(ot)(3)

Totakeintoaccountthelong-termstructureofaten-nisgame,thefourHMMsareconnectedtoahierarchical

HMM,asrepresentedinFigure3.Thishigherlevelre ectsthestructureofatennisgameintermsofsets,games,andpoints.Thesearchspacecanbedescribedasahugenetworkwherethebeststatetransitionpathhastobefound.Thesearchisperformedatdifferentlevels.Transitionprobabil-itiesbetweenstatesofthehierarchicalHMMresultentirelyfromaprioriinformation,whiletransitionprobabilitiesforthesub-HMMsresultfromalearningprocess.

SeveralcommentsaboutthehierarchicalHMMareinorder.Thepointisthebasicscoringunit.Itcorrespondstoawinnerrally,thatistosayalmostallralliesexcept rstmissedserves.Abreakhappenattheendofatleasttenconsecutivepoints.Boundariesdetectionbetweengames,andconsequentlygamecompositionintermsofpoints,relieessentiallyontheplayerpositiondetection,whichindicateiftheserverhaschanged.

5.EXPERIMENTALRESULTS

Fig.3.Contenthierarchyofbroadcasttennisvideofourbasicstructuralunits:twoofthemarerelatedtogamephases(missed rstservesandrallies),thetwoothersdealwithvideosegmentswherenoplayoccurs(breaksandre-plays).EachoftheseunitsismodelledbyaHMM.TheHMMsrelyonthetemporalrelationshipsbetweenshots.

EachstateoftheHMMsmodelseitherasingleshotoradissolvetransitionbetweenshots.Forashott,theob-servationotconsistsofthesimilaritymeasurevt,theshotdurationlt,theaudiodescriptionvectorat,andtherelativeplayerpositiondt,ifitexists.Therelativeplayerpositiontothehalf-courtlineisusedasacuetodetermineiftheserverhaschangedornot.Theprobabilityofanobservationottobeinstatejattisthengivenby:

bj(ot)=p(vt|j)p(lt|j)p(st|j)P[at|j](2)

Inthissection,wedescribetheexperimentalresultsofthe

audiovisualtennisvideosegmentationbyHMMs.Experi-mentaldataarecomposedof8videos,representingabout5hoursofmanuallylabelledtennisvideo.Thevideosaredis-tributedamong3differenttournaments,implyingdifferentproductionstylesandplaying elds.3sequencesareusedtotraintheHMMwhiletheremainingpartisreservedforthetests.Onetournamentiscompletelyexludedfromthe

wheretheprobabilitydistributionsp(vt|j),p(dt|j)andP[at|j]

trainingset.

areestimatedbyalearningstep.p(vt|j)andp(dt|j)are

Severalexperimentsareconductedusingvisualfeatures

modelledbysmoothedhistograms,andP[at|j]istheprod-only,audiofeaturesonlyandthecombinedaudiovisualap-uctovereachsoundclasskofthediscreteprobabilityP[at(k)|j].

proach.Thesegmentationresultsarecomparedwiththe

p(st|j)istheprobabilitythattheserverchangedgivenby:

manuallyannotatedgroundtruth.Classi cationratesof

||dt| |dp||typicaltennisscenesaregiveninTable1. Consideringvisualfeaturesonly,themainsourceofmis- matchiswhenarallyisidenti edasamissed rstserve.Inifstatejdoesn’tcorrespondNormp(st|j)=thiscasethesimilaritymeasureiswellcomputedbutthe toachangeofserver playerdetectionortheanalysisoftheinterleavingofshots ifdthasnotbeenextracted 0,5failed.Replaydetectionreliesessentiallyondissolvetran-whateverthestate

sitiondetection.Ourdissolvedetectionalgorithmgivesalotoffalsedetections,thatleadstoasmallprecisionratewhereNormisanormalizationfactorthatistakenasthe

(48%).Wecheckthatcorrectingthetemporalsegmentationhalfwidthofthecourtbaseline,anddpisthepreviousex-improvethereplaydetectionratesupto100%.istingdistanceextracted.Inaddition,foreachstaterepre-sentingaglobalview,theplayerpositionattheleftorrightUsingonlyaudiofeatures,theprecisionandrecallrates

sideofthehalf-courtlineisconsidered.forralliesand rstmissedservesuggeststhataudiofeatures

areeffectivetodescriberallyscenes.Indeed,arallyises-Segmentationandclassi cationofthewholeobserved

sentiallycharacterizedbythepresenceofballhitssoundssequenceintothedifferentstructuralelementsareperformed

andapplausewhichhappenattheendoftheexchange,simultaneouslyusingaViterbialgorithm.

althoughamissed rstserveisonlycharacterizedbythepresenceofballhits.Onthecontrary,replaysarenotchar-acterizedbyarepresentativeaudiocontent,andalmostallreplaysaremissed.Thecorrectdetectionsaremoreduetothecharacteristicshotdurationsofdissolvetransitionsthatareveryshort.Forthesamereasons,replayshotscanalsobeconfusedwithcommercialsthatarenon-globalviewsofshortduration.Breakistheonlystatecharacterizedbythepresenceofmusic.Thatmeansmusicisarelevanteventforbreakdetectionandparticularlyforcommercials.

Fusingtheaudioandvisualcuesenhancedtheperfor-mance,http://www.77cn.com.cnparingwithre-sultsusingvisualfeaturesonly,therearetwosigni cantim-provements:therecallandprecisionratesforrallies,andmissed rstserve.Introducingaudiocuesincreasesthecor-rectdetectionratethankstoballhitsoundsandapplause.Recoveringtheglobalstructureismoreinterestingtoreachahigher-levelinthestructureanalysis(seeTable2).Thepointboundariesdetectionishighlycorrelatedwiththecorrectdetectionoftypicaltennisscenes.However,thechangeofserverdetectionisofhighrelevanceforthestruc-tureparsingprocess.Withoutanyinformationabouttheendofagame,theViterbiprocessfallsinlocalminimawhensearchingthebeststatetransitionpath,becauseoftheequiprobabletransitionsbetweengames.Thegamebound-ariesdetectionisthenverysensitivetotheplayerextraction,andrequiresthisprocesstoberobust.Allmisplacedgameboundariesareduetoerrorsorambiguitiesinplayerposi-tion.Anotherwaytodealwiththisproblemshouldbetoanalyzethescoredisplaysonsuperimposedcaptions.

PointboundariesGameboundaries

Segmentation

precision

Firstserve

92%

Replay

89%

82%74%

Audiofeatures

65%precision

65%

91%