Abstract The MediaMill TRECVID 2005 Semantic Video Search Engine

更新时间:2023-05-20 14:21:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

TheMediaMillTRECVID2005SemanticVideoSearchEngine

C.G.M.Snoek,J.C.vanGemert,J.M.Geusebroek,B.Huurnink,D.C.Koelma,G.P.Nguyen,

O.deRooij,F.J.Seinstra,A.W.M.Smeulders,C.J.Veenman,M.Worring

IntelligentSystemsLabAmsterdam,UniversityofAmsterdam

Kruislaan403,1098SJAmsterdam,TheNetherlands

http://www.mediamill.nl

Abstract

InthispaperwedescribeourTRECVID2005experiments.TheUvA-MediaMillteamparticipatedinfourtasks.Forthedetectionofcamerawork(runid:Aweinvestigatethebene tofusingatessellationofdetectorsincombinationwithsupervisedlearningoverastandardapproachusingglobalimageinforma-tion.Experimentsindicatethataverageprecisionresultsincreasedrastically,especiallyforpan(+51%)andtilt(+28%).Forcon-ceptdetectionweproposeagenericapproachusingoursemanticpath nder.Mostimportantnoveltycomparedtolastyearssys-temistheimprovedvisualanalysisusingproto-conceptsbasedonWiccestfeatures.Inaddition,thepathselectionmechanismwasextended.Basedonthesemanticpath nderarchitecturewearecurrentlyabletodetectanunprecedentedlexiconof101semanticconceptsinagenericfashion.Weperformedalargesetofexper-iments(runid:BvA).Theresultsshowthatanoptimalstrategyforgenericmultimediaanalysisisonethatlearnsfromthetrain-ingsetonaper-conceptbasiswhichtactictofollow.Experimentsalsoindicatethatourvisualanalysisapproachishighlypromis-ing.Thelexiconof101semanticconceptsformsthebasisforoursearchexperiments(runid:BWeparticipatedinau-tomatic,manual(usingonlyvisualinformation),andinteractivesearch.Thelexicon-drivenretrievalparadigmaidssubstantiallyinallsearchtasks.Whencoupledwithinteraction,exploitingseveralnovelbrowsingschemesofoursemanticvideosearchen-gine,resultsareexcellent.Weobtainatop-3resultfor19outof24searchtopics.Inaddition,weobtainthehighestmeanaverageprecisionofallsearchparticipants.WeexploitedthetechnologydevelopedfortheabovetaskstoexploretheBBCrushes.Mostintriguingresultisthatfromthelexiconof101visual-onlymod-elstrainedfornewsdata25conceptsperformreasonablywellonBBCdataalso.

thevideosoriginatefromnon-Englishspeakingcountries,suchasChinaorTheNetherlands,queryingthecontentbe-comesevenharderasautomaticspeechrecognitionresultsaremuchpoorer.Forvideosfromthesesources,anad-ditionalvisualanalysispotentiallyyieldsmorerobustness.Fore ectivevideoretrievalthereisaneedformultimediaanalysis;inwhichtextretrievalisanimportantfactor,butnotthedecisiveelement.Weadvocatethattheidealmul-timediaretrievalsystemshould rstlearnalargelexiconofconcepts,basedonmultimediaanalysis,tobeusedfortheinitialsearch.Then,theidealsystemshouldemploysimi-larityandinteractiontore nethesearchuntilsatisfaction.Weproposeamultimediaretrievalparadigmbuiltonthreeprinciples:learningofalexiconofsemanticconcepts,multimediadatasimilarity,anduserinteraction.Withintheproposedparadigm,weexplorethecombinationofquery-by-concept,query-by-similarity,andinteractive lteringus-ingadvancedvisualizationsoftheMediaMillsemanticvideosearchengine.Todemonstratethee ectivenessofourmul-timediaretrievalparadigm,severalcomponentsareevalu-atedwithinthe2005NISTTRECVIDvideoretrievalbench-mark[16].

Theorganizationofthispaperisasfollows.First,wedis-cussourgenerallearningarchitectureanddatapreparationsteps.Oursystemarchitectureforgenericsemanticindex-ingispresentedinSection3.WedescribeourapproachforcameraworkindexinginSection4.Ourmultimediare-trievalparadigmispresentedinSection5.OurexplorativeworkonBBCrushesisaddressedinSection6.

2Preliminaries

1Introduction

TheMediaMillsemanticvideosearchengineexploitsacom-monarchitecturewithastandardizedinput-outputmodeltoallowforsemanticintegration.TheconventionstodescribethemodularsystemarchitectureareindicatedinFig.1.

Despitetheemergenceofcommercialvideosearchengines,suchasGoogle[9]andBlinkx[3],multimediaretrievalisbynomeansasolvedproblem.Infact,presentdayvideosearchenginesrelymainlyontext-intheformofclosedcaptions[9]ortranscribedspeech[3]-forretrieval.Thisre-sultsindisappointingperformancewhenthevisualcontentisnotre ectedintheassociatedtext.Inaddition,when

2.1GeneralLearningArchitecture

Weperceiveofvideoindexingasapatternrecognitionprob-lem.We rstneedtosegmentavideo.Weoptforcam-erashots[18],indicatedbyi,followingthestandardinTRECVIDevaluations.Givenpatternx,partofashot,

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

Figure1:Data owconventionsasusedinthispaper.Di erentarrowsindicatedi erenceindata ows.

theaimistodetectanindexωfromshotiusingproba-bilitypi(ω|xi).Weexploitsupervisedlearningtolearntherelationbetweenωandxi.Thetrainingdataofthemulti-mediaarchive,togetherwithlabeledsamples,areforlearn-ingclassi ers.Theotherdata,thetestdata,aresetasidefortesting.ThegeneralarchitectureforsupervisedlearningintheMediaMillsemanticvideosearchenginearchitectureisillustratedinFig.2.

Wecanchoosefromalargevarietyofsupervisedmachinelearningapproachestoobtainpi(ω|xi).Forourpurpose,themethodofchoiceshouldbecapableofhandlingvideodocuments.Tothatend,ideallyitmustlearnfromalimitednumberofexamples,itmusthandleunbalanceddata,anditshouldaccountforunknownorerroneouslydetecteddata.Insuchheavydemands,theSupportVectorMachine(SVM)framework[35,4]hasproventobeasolidchoice[1,29].TheusualSVMmethodprovidesamarginintheresult.WepreferPlatt’sconversionmethod[19]toachieveaposteriorprobabilityoftheresult.SVMclassi ersthustrainedforω,resultinanestimatepi(ω|xi, q),where qareparametersoftheSVMyettobeoptimized.

Thein uenceoftheSVMparametersonvideoindexingissigni cant[14].Weobtaingoodparametersettingsforaclassi er,byusinganiterativesearchonalargenumberofSVMparametercombinations.Wemeasureaveragepreci-sionperformanceofallparametercombinationsandselectthecombinationthatyieldsthebestperformance, q .Hereweuse3-foldcrossvalidation[11]with3repetitionstopre-ventover ttingofparameters.Theresultoftheparametersearchover qistheimprovedmodelp q ).Inthei(ω|xi, followingwedrop q whereobvious.

Figure2:Generalarchitectureforsupervisedlearningininthe

MediaMillsemanticvideosearchengine,usingtheconventionsofFig.1.

(321),car(192),charts(52),crowd(270),desert(82), re(67),US- ag(98),maps(44),mountain(41),road(143),sky(291),smoke(64),snow(24),vegetation(242),water(108),wherethenumberinbracketsindicatesthenumberofannotationsamplesofthatconcept.WeagainusedtheTRECVID2005commonannotatione ortasabasisforselectingrelevantshotscontainingtheproto-concepts.Inthoseshots,weannotatedrectangularregionswheretheproto-conceptisvisibleforatleast20frames.

Wesplitthetrainingdataaprioriintofournon-overlappingtrainingandvalidationsetstopreventover t-tingofclassi ers.TrainingsetsA,B,andCcontain30%percentofthe2005trainingdata,validationsetDcontainstheremaining10%.WeassignallshotsinthetrainingsetrandomlytoeithersetA,B,C,orD.

3SemanticPath nderIndexing

2.2DataPreparation

Supervisedlearningrequireslabeledexamples.Inpart,werelyontheprovidedgroundtruthoftheTRECVID2005common

annotatione ort[36].Itisextendedmanuallytoarriveatanincomplete,butreliablegroundtruthforanunprecedentedamountof101semanticconceptsinlexiconΛS.Inaddition,wemanuallylabeledasubstantialpartofthetrainingsetwithrespecttodominanttypeofcamerawork,i.e.pan,tilt,and/orzoom,ifpresent.

Inordertorecognizeconceptsbasedonlow-levelvisualanalysis,weannotated15di erentproto-concepts:building

Thecentralassumptioninoursemanticindexingarchitec-tureisthatanybroadcastvideoistheresultofanauthor-ingprocess.Whenwewanttoextractsemanticsfromadigitalbroadcastvideothisauthoringprocessneedstobereversed.Forauthoring-drivenanalysisweproposedthesemanticpath nder[30].Thesemanticpath nderiscom-posedofthreeanalysissteps.Itfollowsthereverseauthor-ingprocess.Eachanalysisstepinthepathdetectsseman-ticconcepts.Inaddition,onecanexploittheoutputofananalysisstepinthepathastheinputforthenextone.Thesemanticpath nderstartsinthecontentanalysisstep.Inthisanalysisstep,wefollowadata-drivenapproachofin-dexingsemantics.Thestyleanalysisstepisthesecondanal-ysisstep.Herewetackletheindexingproblembyviewingavideofromtheperspectiveofproduction.Thisanalysisstepaidsespeciallyinindexingofrichsemantics.Finally,toenhancetheindexesfurther,inthecontextanalysisstep,weviewsemanticsincontext.Onewouldexpectthatsomeconcepts,likevegetation,havetheiremphasisoncontentwherethestyle(ofthecameraworkthatis)andcontext(of

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

Figure3:Thesemanticpath nderforoneconcept,usingthecon-ventionsofFig.1.

conceptslikegraphics)donotaddmuch.Incontrast,morecomplexevents,likepeoplewalking,pro tfromincrementaladaptationoftheanalysistotheintentionoftheauthor.Thevirtueofthesemanticpath nderisitsabilityto ndthebestpathofanalysisstepsonaper-conceptbasis.Anoverviewofthesemanticpath nderisgiveninFig.3.

3.1ContentAnalysisStep

Weviewofvideointhecontentanalysisstepfromthedataperspective.Ingeneral,threedatastreamsormodalitiesexistinvideo,namelytheauditorymodality,thetextualmodality,andthevisualone.Asspeechisoftenthemostinformativepartoftheauditorysource,wefocusonvisualfeatures,andontextualfeaturesobtainedfromtranscribedspeech.Aftermodalityspeci cdataprocessing,wecombinefeaturesinamultimodalrepresentationusingearlyfusionandlatefusion[32].3.1.1

VisualAnalysis

Modelingvisualdataheavilyreliesonqualitativefeatures.

Goodfeaturesdescribetherelevantinformationinanimagewhilereducingtheamountofdatarepresentingtheimage.Toachievethisgoal,weuseWiccestfeaturesasintroducedin[6].Wiccestfeaturescombinecolorinvariancewith

nat-

ural

image

statistics.

Color

invariance

aims

toremoveac-cidentallightingconditions,whilenaturalimagestatisticse cientlyrepresentimagedata.

Colorinvarianceaimsatkeepingthemeasurementscon-stantundervarying

intensity,viewpointandshading.In[7]severalcolorinvariantsaredescribed.WeusetheWin-variantthatnormalizesthespectralinformationwiththeenergy.Thisnormalizationmakesthemeasurementsin-dependentofilluminationchangesunderuniformlightingconditions.

Whenmodelingscenes,edgesarehighlyinformative.Edgesrevealwhereoneregionendsandanotherbegins.Thus,anedgehasatleasttwicetheinformationcontentthenauniformlycoloredpatch,sinceanedgecontainsin-formationaboutallregionsitdivides.Besidesservingasregionboundaries,anensembleofedgesdescribestextureinformation.Texturecharacterizesthematerialanobjectismadeof.Moreover,acompilationofclutteredobjectscan

Figure4:Anexampleofdividinganimageupinoverlappingre-gions.Inthisparticularexample,theregionsizeisa1

forboththex-dimensionandy-dimension.Theregionsoftheimagesizeareuni-formlysampledacrosstheimagewithastepsizeofhalfaregion.Samplinginthismanneridenti esnineoverlappingregions.

bedescribedastextureinformation.Therefore,ascenecanbemodeledwithtexturedregions.

Textureisdescribedbythedistributionofedgesatacer-tainregioninanimage.Hence,ahistogramofaGaussianderivative ltersrepresentstheedgestatistics.Sincetherearemorenon-edgepixelsthenthereareedgepixels,thedis-tributionofedgeresponsesfornaturalimagesalwayshasapeakaroundzero,i.e.:manypixelshavenoedgeresponses.Additionally,theshapeofthetailsofthedistributionisoftenin-betweenapower-lawandaGaussiandistribution.Thisspeci cdistributioncanbewellmodeledwithanin-tegratedWeibulldistribution[8].Thisdistributionisgivenbyγ2γexp)

1βΓ(1γ whereristheedgeresponseto r µβ the γ ,(1)Gaussianderivative lter ∞andΓ(·)isthecompleteGammafunction,Γ(x)=tx 1e 1

dt.Theparameterβdenotesthewidthdistribution,0

ofthetheparameterγrepresentsthe’peakness’ofthedistribution,andtheparameterµdenotestheoriginofthedistribution.

ToassessthesimilaritybetweenWiccestfeatures,agoodness-of- ttestisutilized.Themeasureisbasedontheintegratedsquarederrorbetweenthetwocumulativedistri-butions,whichisobtainedbyaCram´er-vonMisesmeasure.FortwoWeibulldistributionswithparametersFβ,FγandGβ,Gγa rstorderTaylorapproximationoftheCram´er-vonMisesstatisticyieldsthelogdi erencebetweenthepa-rameters.Therefore,ameasureofsimilaritybetweentwoWeibulldistributionsFandGisgivenbytheratiooftheparameters,

W2(F,G)=

min(Fβ,Gβ)min(Fγ,Gγ)

max(F.(2)

β,Gβ)max(Fγ,Gγ)Theµparameterrepresentsthemodeofthedistribution.Thepositionofthemodeisin uencedbyunevenillumi-nationandcoloredillumination.Hence,toachievecolorconstancythevaluesforµmaybeignored.

Insummary,Wiccestfeaturesprovideacolorinvarianttexturedescriptor.Moreover,thefeaturesrelyheavilyonnaturalimagestatisticstocompactlyrepresentthevisualinformation.

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

3.1.2

Contextures:RegionalTextureDescriptorsandtheirContext

Thevisualdetectorsaimtodecomposeanimageinproto-conceptslikevegetation,water, re,skyetc.Toachievethisgoal,animageisdividedupinseveraloverlappingrectan-gularregions.Theregionsareuniformlysampledacrosstheimage,withastepsizeofhalfaregion.Theregionsizehastobelargeenoughtoassessstatisticalrelevance,andsmallenoughtocapturelocaltexturesinanimage.Weutilizeamulti-scaleapproach,usingsmallandlargeregions.Anexampleofregionsamplingisdisplayedin gure4.

Avisualsceneischaracterizedbybothglobalaswellaslocaltextureinformation.Forexample,apicturewithanaircraftinmidairmightbedescribedas“sky,withaholeinit”.Tomodelthistypeofinformation,weuseaproto-conceptoccurrencehistogramwhereeachbinisaproto-concept.Thevaluesinthehistogramarethesimilarityresponsesofeachproto-conceptannotation,totheregionsintheimage.

Weusetheproto-conceptoccurrencehistogramtochar-acterizebothglobalandlocaltextureinformation.Globalinformationisdescribedbycomputinganoccurrencehis-togramaccumulatedoverallregionsintheimage.Localinformationistakenintoaccountbyconstructinganotheroccurrencehistogramforonlytheresponseofthebestre-gion.Foreachproto-concept,orbin,btheaccumulatedoc-currencehistogramandthebestoccurrencehistogramareconstructedby,

Haccumulated(b)Hbest(b)

==

W2(a,r)W2(a,r)

,,

accountsforinformationaboutthewholeshoti,andin-formationaboutaccidentalframes,whichmightoccurwithhighcameramotion.Thecombinationofalltheseparam-etersyieldsavectorofcontextures vi,containingthe nalresultofthevisualanalysis.3.1.3

TextualAnalysis

Inthetextualmodality,weaimtolearntheassociationbe-tweenutteredspeechandsemanticconcepts.Adetectionsystemtranscribesthespeechintotext.FortheChineseandArabicsourcesweexploittheprovidedmachinetrans-lations.Theresultingtranslationismappedfromstoryleveltoshotlevel.Fromthetextweremovethefrequentlyoc-curringstopwords.Afterstopwordremoval,wearereadytolearnsemantics.

Tolearntherelationbetweenutteredspeechandcon-cepts,weconnectwordstoshots.Wemakethisconnectionwithinthetemporalboundariesofashot.Wederivealex-iconofutteredwordsthatco-occurwithωusingtheshot-basedannotationsofthetrainingdata.Foreachconceptω,welearnaseparatelexicon,ΛωT,asthisutteredwordlexi-conisspeci cforthatconcept.ForfeatureextractionwecomparethetextassociatedwitheachshotwithΛωT.Thiscomparisonyieldsatextvector tiforshoti,whichcontainsthehistogramofthewordsinassociationwithω.3.1.4

EarlyFusion

r∈R(im)a∈A(b)

argmax

r∈R(im)a∈A(b)

whereR(im)denotesthesetofregionsinimageim,A(b)representsthesetofstoredannotationsforproto-conceptb,andW2istheCram´er-vonMisesstatisticasintroducedinequation2.

Wedenoteaproto-conceptoccurrencehistogramasacon-textureforthatimage.Wehavechosenthisname,asourmethodincorporatestexturefeaturesinacontext.Thetex-turefeaturesaregivenbytheuseofWiccestfeatures,usingcolorinvarianceandnaturalimagestatistics.Furthermore,contextistakenintoaccountbythecombinationofbothlocalandglobalregioncombinations.

Contexturescanbecomputedfordi erentparameterset-tings.Speci cally,wecalculatethecontexturesatscalesσ=1andσ=3oftheGaussian lter.Furthermore,we

1

and1usetwodi erentregionsizes,withratiosofof

thex-dimensionandy-dimensionsoftheimage.Moreover,contexturesarebasedononeimage,andnotbasedonashot.Togeneralizeourapproachtoshotlevel,weextract1framepersecondoutofthevideo,andthenaggregatetheframesthatbelongtothesameshot.Weusetwowaystoaggregateframes:1)averagethecontextureresponsesforallextractedframesinashotand2)keepthemaximumresponseofallframesinashot.Thisaggregationstrategy

Indexingapproachesthatrelyonearlyfusion rstextractunimodalfeaturesofeachstream.Theextractedfeaturesofallstreamsarecombinedintoasinglerepresentation.Aftercombinationofunimodalfeaturesinamultimodalrepre-sentation,earlyfusionmethodsrelyonsupervisedlearningtoclassifysemanticconcepts.Earlyfusionyieldsatrulymultimediafeaturerepresentation,sincethefeaturesareintegratedfromthestart.Anaddedadvantageisthere-quirementofonelearningphaseonly.Disadvantageoftheapproachisthedi cultytocombinefeaturesintoacom-monrepresentation.ThegeneralschemeforearlyfusionisillustratedinFig.5a.

Werelyonvectorconcatenationintheearlyfusionschemetoobtainamultimodalrepresentation.Wecon-catenatethevisualvector viwiththetextvector ti.Afterfeaturenormalization,weobtainearlyfusionvector ei.3.1.5

LateFusion

Indexingapproachesthatrelyonlatefusionalsostartwithextractionofunimodalfeatures.Incontrasttoearlyfusion,wherefeaturesarethencombinedintoamultimodalrep-resentation,approachesforlatefusionlearnsemanticcon-ceptsdirectlyfromunimodalfeatures.Ingeneral,tefusionfocusesontheindividualstrengthofmodalities.Uni-modalconceptdetectionscoresarefusedintoamultimodal

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

(a)

(b)

Figure5:(a)Generalschemeforearlyfusion.Outputofunimodalanalysisisfusedbeforeaconceptislearned.(b)Generalschemeforlatefusion.Outputofunimodalanalysisisusedtolearnseparatescoresforaconcept.Afterfusiona nalscoreislearnedfortheconcept.WeusetheconventionsofFig.1.

semanticrepresentationratherthanafeaturerepresenta-tion.Abigdisadvantageoflatefusionschemesisitsexpen-sivenessintermsofthelearninge ort,aseverymodalityrequiresaseparatesupervisedlearningstage.Moreover,thecombinedrepresentationrequiresanadditionallearningstage.Anotherdisadvantageofthelatefusionapproachisthepotentiallossofcorrelationinmixedfeaturespace.AgeneralschemeforlatefusionisillustratedinFig.5b.

Forthelatefusionscheme,weconcatenatetheprobabilis-ticoutputscoreaftervisualanalysis,i.e.p vi, q ),withi(ω|

theprobabilisticscoreresultingfromtextualanalysis,i.e.

q ),intolatefusionvector li.p i(ω|ti,

3.2.1StyleAnalysis

Wedevelopdetectorsforallfourproductionrolesasfeature

extractioninthestyleanalysisstep.Werefertoourpre-viousworkforspeci cimplementationdetailsofthedetec-tors[31,ElectronicAppendix].Wehavechosentoconverttheoutputofall

styledetectorstoanordinalscale,asthisallowsforelegantfusion.

ForthelayoutLthelengthofacamerashotisusedasafeature,asthisisknowntobeaninformativedescrip-torforgenre[31].Overlayedtextisanotherinformativedescriptor.Itspresenceisdetectedbyatextlocalization

3.1.6ContentPath nder

Welearn101semanticconceptsbasedonthefourvectorsresultingfromanalysisinthecontentanalysisstep.Thus vi, ti, ei,and liserveastheinputforoursupervisedlearn-ingmodule,whichlearnsanoptimizedSVMmodelforeachsemanticconceptωusing3-foldcrossvalidationwith3rep-etitionsontrainingsetA.Thesemodelsarethenvalidated

i)forallonsetD,yieldingabestperformingmodelp i(ω|m

ωinΛS,wherem i∈{ vi, ti, ei, li}.

3.2StyleAnalysisStep

Inthestyleanalysisstepweconceiveofavideofromtheproductionperspective.Basedonthefourrolesinvolvedinthevideoproductionprocess[31],youtdetectorsanalyzetheroleoftheeditor.Contentdetectorsanalyzetheroleofproductiondesign.Capturedetectorsanalyzetheroleoftheproductionrecordingunit.Finally,contextdetectorsanalyzetheroleofthepreproductionteam,seeFig.6.

Figure6:Featureextractionandclassi cationinthestyleanalysisstep,specialcaseofFig.2.

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

algorithm[25].Tosegmenttheauditorylayout,periodsofspeechandsilencearedetectedbasedontheprovidedau-tomaticspeechrecognitionresults.Weobtainavoice-overdetectorbycombiningthespeechsegmentationwiththecamerashotsegmentation[31].Thesetoflayoutfeaturesisthusgivenby:L={shotlength,overlayedtext,silence,voice-over}.

AsconcernsthecontentC,afrontalfacedetector[27]isappliedtodetectpeople.Wecountthenumberoffaces,andforeachfaceitslocationisderived[31].Inaddition,wemeasuretheaverageamountofobjectmotioninacam-erashot[29].Basedonprovidedspeakeridenti cationweidentifyeachofthethreemostfrequentspeakers.Eachcamerashotischeckedforpresenceofspeechfromoneofthethree[31].Wealsoexploittheprovidednamedentityrecognition.Thesetofcontentfeaturesisthusgivenby:C={faces,facelocation,objectmotion,frequentspeaker,voicenamedentity}.

ForcaptureT,wecomputethecameradistancefromthesizeofdetectedfaces[27,31].Itisunde nedwhennofaceisdetected.Inadditiontocameradistance,severaltypesofcameraworkaredetected[2],e.g.pan,tilt,zoom,andsoon.Finally,forcapturewealsoestimatetheamountofcameramotion[2].Thesetofcapturefeaturesisthusgivenby:T={cameradistance,camerawork,cameramotion}.ThecontextSservestoenhanceorreducethecorrelationbetweensemanticconcepts.Detectionofvegetationcanaidinthedetectionofaforestforexample.Likewise,theco-occurrenceofaspaceshuttleandabicycleinoneshotisimprobable.Astheperformanceofsemanticconceptde-tectorsisunknownandlikelytovarybetweenconcepts,weexploititerationtoaddthemtothecontext.Therationalehereistoaddconceptsthatarerelativelyeasytodetect rst.Theyaidindetectionperformancebyincreasingthenumberoftruepositivesorreducingthenumberoffalsepositives.Topreventbiasfromdomainknowledge,weusetheperformanceonvalidationsetDofallconceptsfromΛSinthecontentanalysisstepastheorderingforthecontext.Toassigndetectionresultsforthe rstandleastdi cultconcept,werankallshotresultsonp i).Thisrank-i(ω1|m

ingisthenexploitedtocategorizeresultsforω1intooneof velevels.Thebasicsetofcontextfeaturesisthusgivenby:S={contentanalysisstepω1}.

Theconcatenationof{L,C,T,S}forshotiyieldsstylevector si.Thisvectorformstheinputforaniterativeclassi- er[31]thattrainsastylemodelforeachconceptinlexiconΛS.WeclassifyallωinΛSagaininthestyleanalysisstep.Weuse3-foldcrossvalidationwith3repetitionsontrain-ingsetBtooptimizeparametersettingsinthisanalysisstep.Weusetheresultingprobabilityasoutputforconceptdetectioninthestyleanalysisstep.

Figure7:Featureextractionandclassi cationinthecontextanal-ysisstep,specialcaseofFig.2.

BoththecontentanalysisstepandthestyleanalysisstepyieldaprobabilityforeachshotiandallconceptsωinΛS.Theprobabilityindicateswhetheraconceptispresent.Wefusethesesemanticfeaturesofananalysisstepforashotiintoacontextvector,seeFig.7.

Weconsiderthreepathsinthecontextanalysisstep.The rstpathstemsdirectlyfromthecontentanalysisstep.We

i.fusethe101p i)conceptscoresintocontextvectordi(ω|m

Thesecondpathstemsfromthestyleanalysisstepwherewefusethe101p si)scoresintocontextvectorp i.Thethirdi(ω|

pathselectsthebestperformeronvalidationsetDfromeithercontentanalysissteporstyleanalysisstep.Thesebestperformersarefusedincontextvector bi.

Fromthesethreevectorswelearnrelationsbetweencon-ceptsautomatically.Tothatendthevectorsserveastheinputforasupervisedlearningmodule,whichassociatesacontextualprobabilityp ci)toashotiforallωinΛS,i(ω|

i,pwhere ci∈{d i, bi}.Tooptimizeparametersettings,we

use3-foldcrossvalidationwith3repetitionsontheprevi-ouslyunuseddatafromtrainingsetC.

Theoutputofthecontextanalysisstepisalsotheoutputoftheentiresemanticpath nderonvideodocuments.Onthewaywehaveincludedinthesemanticpath nder,there-sultsoftheanalysisonrawdata,factsderivedfromproduc-tionbytheuseofstylefeatures,andacontextperspectiveoftheauthor’sintentbyusingsemanticfeatures.Foreachconceptweobtainseveralprobabilitiesbasedon(partial)content,style,andcontext.Weselectfromallpossibilitiestheonethatmaximizesaverageprecisionbasedonperfor-manceonvalidationsetD.Thesemanticpath nderpro-videsuswiththeopportunitytodecidewhetheraone-shotanalysisstepisbestfortheconceptonlyconcentratingon(visual)content,oratwo-analysisstepclassi erincreasingdiscriminatorypowerbyaddingproductionstyletocontent,orthataconceptpro tsmostfromaconsecutiveanalysisoncontent,style,andcontextlevel.

3.4Experiments

3.3ContextAnalysisStep

Thecontextanalysisstepaddscontexttoourinterpretationofthevideo.Ourultimateaimisthereconstructionoftheauthor’sintentbyconsideringdetectedconceptsincontext.

Wetraversedtheentiresemanticpath nderforall101con-cepts.Theaverageprecisionperformanceoftheseman-ticpath nderanditssub-systems,onvalidationsetD,areshowninFig.8.

Weevaluatedforeachconceptfouranalysisstrategiesinthecontentanalysisstep:text-only,visual-only,earlyfu-

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

Table1:UvA-MediaMillTRECVID2005runcomparisonforall10benchmarkconcepts.Thebestpathofthesemanticpath stcolumnindicatesresultsofourvisual-onlyrun.

SP-1

PeoplewalkingExplosionMapUS agBuildingWaterscapeMountainPrisonerSportsCarMAP

0.1990.0410.1420.10.2350.2010.220.0050.3420.2130.1698

SP-20.1720.0270.160.0630.2290.1980.1930.0010.2250.1920.146

SP-30.1540.0320.1350.110.2260.1370.18200.2890.1820.1447

SP-40.1790.0350.1230.0950.2250.1640.1950.0010.2020.2010.142

SP-50.1010.0360.0990.0720.210.1240.170.0010.1370.1960.1146

SP-60.1030.0340.1270.1140.1570.1360.1280.0010.1530.1990.1152

Visual-only

0.0310.0730.1380.1290.2690.1660.2070.0030.2720.2330.1521

sion,andlatefusion.Resultscon rmtheimportanceofvisualanalysisforgenericconceptdetection.Text-analysisyieldsthebestapproachforonly8concepts,whereasvisualanalysisyieldsthebestperformanceforasmuchas45con-cepts.Fusionisoptimalfortheremaining48concepts,withaclearadvantageforearlyfusion(33concepts)infavoroflatefusion(15concepts).

Thestyleanalysisstepagaincon rmstheimportanceforinclusionofprofessionaltelevisionproductionfacetsforse-manticvideoindexing.Especiallyforconceptswhichsharemanysimilaritiesintheirproductionprocess,likeanchors,monologues,andentertainment.Forotherconcepts,con-tentismoredecisive,liketennisandbaseballforexample.Thussomeconceptsarejustcontent,whereasothersarepureproductionstyle.

Weboostconceptdetectionperformancefurtherbytheusageofcontext.Thepath nderagainexploitsvariationinperformanceforthevariouspathstoselectanoptimalpathway.Theresultsdemonstratethevirtueofthesemanticpath nder.Conceptsaredividedbytheanalysisstepafterwhichtheyachievebestperformance.Basedontheseresultsweconcludethatanoptimalstrategyforgenericmultimediaanalysisisonethatlearnsfromthetrainingsetonaper-conceptbasiswhichtactictofollow.3.4.1

Path nderRuns

validationsetswearestillover ttingthedata.Apointofconcernhereistherandomassignmentofshotstothesep-aratetrainingandvalidationsets.Thismaybiastheclas-si ersasitispossiblethatsimilarnewsitemsfromseveralchannelsaredistributedtoseparatesets.Fortwoconcepts(mapandexplosion)performancesu eredfrommisinter-pretationofcorrectconcepts.Hadweincludedexamplesofnewsanchorswithmapsinthebackgroundofthestudiosetting(forthemapconcept)andsmoke(forexplosion)inourtrainingsets,resultswouldbehigher.Whenlookingatthejudgedresults,wealsofoundthatthreeconcepts(water-scape,mountain,andcar)aredominatedbycommercials.Wedonotperformwelloncommercialdetection.Thiscanbeexplainedbecausewetake1framepersecondoutofthevideointhevisualanalysis.Samplinginthismannerwillselectdi erentframesforthesamecommercialsthatreap-pearondi erenttimestampsinavideo.Weanticipatethatimprovementinframesamplingyieldsincreasedrobustnessfortheentirepath nder.3.4.2

Visual-onlyRun

Wesubmittedsixpathsforeachbenchmarkconcept,prior-itizedaccordingtovalidationsetperformance.Forconceptexplosionforexample,theoptimalpath(SP-1)indicatesthatvisual-onlyanalysisisthebestperformer.However,inmostcasesthebestpathisaconsecutivepathofcon-tent,style,andcontext.Wereporttheo cialTRECVIDbenchmarkresultsinTable1.

Theresultsshowthatthepath ndermechanismisagoodwaytoestimatethebestperforminganalysispath.TheSP-1runcontainingtheoptimalpathisindeedthebestper-formerin8outof10cases.Overall,thisisalsoourbestperformingrun.However,whatstrikesusmostisthatav-erageprecisionresultsaremuchlowerthancanbeexpectedbasedonvalidationsetperformancereportedinFig.8.Thismayindicatethatdespitetheuseofseparatetrainingand

ValidationsetperformanceinFig.8.indicatesthatourvi-sualanalysisstepperformsquitegood.Todeterminethecontributionofthevisualanalysisstep,wethereforesub-mittedavisual-onlyrun.ThisinvolvedtrainingaSupportVectorMachineonthevectorofcontexturesasintroducedinsection3.1.1.WetrainedanSVMforeachofthe10con-ceptoftheconceptdetectiontask.Anexperimentforrecog-nizingproto-conceptwassubmittedbyanothergroup[37].Thevisualfeaturesinthesubmittedvisual-onlyrunareslightlydi erentfromthevisualfeaturesinthesemanticpath ndersystem.Thisdi erenceiscausedbyongoingde-velopmentonthevisualanalysis.Speci cally,weimprovedtheWeibull ttobemorerobustandweaddedtheproto-conceptcar.Thenewerversionofthevisualanalysiswasnotincorporatedinthesemanticpath nder.Itwasnotintegratedbecausevisualanalysisisthe rststepinthese-manticpath.Thus,achangeinthevisualanalysismeansthatallfurtherpathswouldhavetoberecomputed.How-ever,foravisual-onlyrun,theimprovementswerefeasibletocompute.

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

Semantic Concept2allawi3anchor4animal5arrafat6baseball7basketball8beach9bicycle10bird11boat12building13bus14bush_jr15bush_sr16candle17car18cartoon19chair20charts21clinton22cloud

23corporate_leader24court25crowd26cycling27desert28dog29drawing

30drawing_cartoon31duo_anchor32entertainment33explosion34face35female36fireweapon37fish38flag

39flag_usa40food41football42golf

43government_building44government_leader45graphics46grass

47hassan_nasrallah48horse

49horse_racing50house51hu_jintao52indoor53kerry54lahoud55male56maps57meeting58military59monologue60motorbike61mountain

62natural_disaster63newspaper64nightfire65office66outdoor

67overlayed_text68people

69people_marching70police_security71powell72prisoner73racing

74religious_leader75river76road77screen78sharon79sky80smoke81snow82soccer83splitscreen84sports85studio

86swimmingpool87table88tank89tennis90tony_blair91tower92tree93truck94urban

95vegetation96vehicle97violence

98walking_running99waterbody100waterfall101weatherTRECVID MAPTextAnalysis

0.1880.1750.2090.0840.0510.0330.0020.0960.2010.0650.1590.1010.0720.0280.0080.1080.5110.1000.2090.0020.0340.0400.0770.2330.1030.0340.2840.3180.4030.0080.2570.0400.7240.0650.0360.0650.0960.0770.0160.0260.0690.0260.2910.1690.0160.4460.0010.0010.0810.2670.4000.0300.1350.1010.1460.2020.1830.0530.0030.0410.1260.0680.0110.0290.4400.5520.8030.1210.0170.0330.0110.0070.2680.1670.1200.1100.0030.1800.0840.0660.0370.0800.1320.4120.0020.0830.0120.2190.7500.0150.0130.0400.2050.0710.1350.2330.2240.0770.0010.4610.101VisualAnalysis

0.0540.5850.1890.1120.2400.5410.0050.0250.7160.1470.2810.0250.1730.0190.0030.2530.7470.5340.2750.2640.2370.0970.0570.4040.0200.1140.2620.2750.2880.6510.2680.1270.8980.3160.0390.2350.1650.1850.0710.1880.0380.0350.2750.3540.1510.8670.1290.0590.0050.0940.6160.0790.3940.2440.4060.3680.2390.1280.3990.2990.0350.5260.0090.0730.6680.6970.8330.2290.0070.0190.0080.0090.0600.5000.2390.0660.0080.4990.3300.0360.5330.6160.2960.6530.0010.1350.0240.6440.2540.0230.1780.0350.2700.2240.2810.2910.3270.2750.0010.2400.246Early Fusion

0.2290.4720.2160.0730.2260.2350.0050.1280.3790.0390.2510.0950.0720.0210.0200.1970.5690.3280.4400.0750.1010.0510.3380.4040.1350.1290.4460.2690.2930.0540.3250.0870.8930.1180.1280.1160.1210.1410.0680.0880.1790.0190.2610.3580.0420.2780.2190.2530.0810.2300.5840.0280.2480.1310.3080.2280.3050.0890.1630.1810.1520.4330.0090.0650.7060.6780.8700.2320.0150.0730.0770.0060.2510.0840.2190.1260.2100.4980.2720.1010.3650.2870.2570.6300.0010.1400.0300.6170.6880.0830.1870.0490.2910.1980.2730.3380.3280.2030.0080.5080.203

Late Fusion

0.0260.5620.1810.0780.0400.4510.0020.0980.4540.1690.0850.1460.1440.0010.0240.2140.6400.5220.384

0.1560.0770.0030.4020.0010.0980.0040.3180.4050.0600.1930.0600.7550.0210.0430.1000.1570.1750.0300.0330.0920.1570.3780.3400.0630.6670.0010.2010.0060.0820.6070.0050.2970.2150.3230.3520.3310.1380.0030.2030.1060.4540.0030.0910.6650.6860.8040.1690.0090.0120.0030.0010.1900.2520.2190.0750.0370.4940.2820.0280.4550.5910.3200.6740.0830.0190.6910.2560.0200.1100.0220.2970.1880.2780.3480.3540.2370.4830.197StyleContent-Context

0.0110.2740.7640.6150.3160.3300.1350.1410.0850.0840.5320.5730.0360.0090.1400.1090.4870.7170.1020.1720.2920.2980.0240.0150.2130.2010.2170.0650.0060.0020.2150.2690.4550.6010.2070.5520.3210.4560.2070.0180.1260.2280.0780.0490.0990.3500.3910.4240.4350.1210.0700.1430.2940.4830.0450.2080.0930.4420.8570.6020.6840.4960.0940.1180.9130.6960.4140.3360.0370.1310.2840.2310.1350.1820.1370.1900.1720.1380.2520.1960.1090.1900.2120.0080.4000.4010.3630.4450.0980.1670.1580.9170.3080.1820.5400.2040.0120.0050.0600.2960.6770.6740.0280.1230.2580.5590.2790.2590.3880.4710.3930.4040.2820.3570.6920.1490.0140.3890.2280.3470.0560.1510.4970.5250.0050.1310.0710.0620.6340.7440.9910.7060.9370.8480.2180.2520.0190.0170.0190.0310.0110.0880.0080.0100.0220.2520.0170.0250.2300.2680.0730.1540.0080.1990.4820.5370.2190.3740.0840.2990.5100.5780.8190.6770.4230.4590.7460.7180.1780.0120.2030.1070.0010.3350.3820.7630.0050.0590.0680.0620.0970.1890.0510.0620.2850.3200.2040.2360.2860.3260.3870.4510.4140.4100.2510.3050.1180.0090.5550.5790.2450.296

Style-ContextBest-Context

0.0070.2430.7800.7710.3010.4170.2470.1760.0730.0280.5890.6410.0110.0100.4060.4000.4620.6780.1320.2220.3040.3270.0210.0180.2240.2190.1980.2050.0030.0180.2430.2820.5280.6930.2840.5770.3220.463

0.002

0.1280.1720.0800.0650.1160.3680.4140.4460.4280.4210.0950.1440.2000.4980.0290.2740.2190.4430.8810.8820.6930.7000.0340.1250.9250.9290.4190.4200.0550.0590.3220.3530.1450.1840.1620.2150.1870.2160.3300.3510.0590.2140.2120.2130.4120.4160.4020.4720.0940.1070.2511.0000.3410.3380.4090.4060.0140.0080.0690.3230.7180.7220.0030.0650.3300.4540.2910.2940.4070.4930.4220.4520.2930.3580.7180.724

0.2500.0280.4970.0020.0780.7260.9910.8900.2270.0180.1900.0130.0290.0060.0610.2520.0800.0020.4970.2080.1420.5120.7570.4660.7800.1810.1760.0010.4200.0210.0730.1450.0660.3310.2100.3150.4400.4210.2890.0420.5600.259

0.3310.1630.5290.0030.0980.7540.9900.9260.2560.0220.0770.0880.0510.3460.1200.2770.1490.1510.5510.3530.0560.6360.7950.5290.7810.1750.1970.0010.7640.7510.1150.1510.0680.3560.2400.3430.4850.4640.3460.2560.5480.320Optimal Path

0.2740.7800.4170.2470.2400.6410.0360.4060.7170.2220.3270.1460.2240.2170.0240.2820.7470.5770.4630.2640.2370.0970.3680.4460.4350.1440.4980.3180.4430.8820.7000.1270.9290.4200.1310.3530.1840.2150.2160.3510.2140.2130.4160.4720.1671.0000.3410.5400.0810.3230.7220.1230.5590.2940.4930.4520.3580.7240.3990.3470.1630.5290.1310.0980.7540.9910.9370.2560.0220.1900.0880.0510.3460.5000.2770.1540.2100.5510.3740.2990.6360.8190.5290.7810.1810.2030.3350.7640.7510.1150.1890.0680.3560.2400.3430.4850.4640.3460.2560.5790.322

Figure8:Validationsetaverageprecisionperformancefor101semanticconceptsusingsub-systemsofthesemanticpath nder.Thebestpathforeachconceptismarkedwithgraycells.Emptycellsindicateimpossibilitytolearnmodels,duetolackofannotatedexamplesinthetrainingsub-setused.

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

Table2:Validationsetaverageprecisionperformancefor3typesofcameraworkusingseveralversionsofourcameraworkdetector.

Pan

TiltZoomMAPLateFusion

0.8620.7860.8620.837LateFusion+SelectedContext0.8590.7520.8660.826LateFusion+Context0.8560.6560.8560.789EarlyFusion0.7030.5580.7830.681Global

0.5690.6130.8130.665Global+Context

0.5910.5620.7920.648EarlyFusion+Context

0.616

0.461

0.765

0.614

Theresultsofourvisual-runre ecttheimportanceofvi-sualanalysis.Forfourconcepts(explosion,US ag,build-ing,car)weoutperformthepath ndersystem.Thisim-provementmightbeattributedtotheuseofimprovedvisualfeaturesandtothefactthatweusetheentiretrainingsetinSVM-training.However,sincethevisualanalysisstepisembeddedinthepath ndersystem,thevisualanalysisshouldneverperformbetter.Thereforewebelievethatre-sultsofthepath ndersystemwillimprovewhenthenewfeaturesareincluded.

4CameraWork

Forthedetectionofcameraworkwestartwithanexist-ingimplementationbasedonspatiotemporalimageanaly-sis[34,12].Givenasetofglobalintensityimagesfromshoti,thealgorithm rstextractspatiotemporalimages.Ontheseimagesadirectionanalysisisappliedtoestimatedi-rectionparameters.Theseparametersformtheinputforasupervisedlearningmoduletolearnthreetypesofcamerawork.Wemodi edthealgorithminvariousways.Wesu-perimposedatessellationof8regionsoneachinputframetodecreasethee ectoflocaldisturbances.Parametersthusobtainedareexploitedusinganearlyfusionandlatefusionapproach.Inadditionweexploredwhetherthe101conceptscoresobtainedfromthesemanticpath nderaidindetec-tionofcamerawork.

4.1Experiments

ExperimentsonvalidationsetDindicatethataveragepre-cisionresultsincreasedrastically,especiallyforpan(+51%)andtilt(+28%),seeTable2.Thebestapproachisalatefu-sionschemewithouttheusageofcontext.Relativetootherparticipantsweperformedquitegoodinprecision,butquitebadintermsofrecall.Resultsindicatethatthebasede-tectoristooconservative.However,italsoshowsthatanyglobalimagebasedcameraworkdetectorhasthepotentialtopro tfromatessellationofregion-baseddetectors.

5Lexicon-drivenRetrieval

Weproposealexicon-drivenretrievalparadigmtoequipuserswithsemanticaccesstomultimediaarchives.The

aimistoretrievefromamultimediaarchiveS,whichiscomposedofnuniqueshots{s1,s2,...,sn},thebestpossi-bleanswersetinresponsetoauserinformationneed.Tothatend,weusethe101conceptsinthelexiconaswellasthe3typesofcameraworkforourautomatic,manual,andinteractivesearchsystems.

5.1AutomaticSearch

Ourautomaticsearchengineusesonlytopictextasin-put[10],aswepostulatethatitisunreasonabletoexpectausertoprovideavideosearchsystemwithexamplevideosinarealworldscenario.Werelypurelyontextandthelex-iconof101semanticconceptdetectorsthatwehavedevel-opedusingthesemanticpath nder,seeSection3,tosearchthroughthevideocollection.Wedevelopedoursearchsys-temusingthevideodata,topics,andgroundtruthsfromthe2003and2004TRECVIDevaluationsasatrainingset.5.1.1

IndexingComponents

OurautomaticsearchsystemincorporatesregularTFIDF-basedindicesforstandardretrievalusingthebfx-bfx[24]formula,LatentSemanticIndexing[5]fortextretrievalwithimplicitqueryexpansion,and101thedi erentseman-ticconceptindicesforquery-by-concept.Eachindexwasmatchedtooneormoreconcepts,orsynsetsintheWord-Net[13]lexicaldatabaseonanindividualbasis,accordingtowhethertheconceptdirectlymatchesthecontentofthedetectors.Forexample,thedetectorfortheconceptbaseball ndsshotsofbaseballgames,andtheseshotsinvariablyin-cludebaseballplayers,baseballequipment,andabaseballdiamond,sotheseconceptsarealsomatched.AdditionalsynsetsareaddedtoWordNetforsemanticconceptsthatdonothaveadirectWordNetequivalent.5.1.2

AutomaticQueryInterfaceSelection

Weperformthestandardstoppingandstemmingproce-duresonthetopictext(usingtheSMARTstoplist[23]withtheadditionofthewords ndandshots;andthePorterstemmingalgorithm[20]respectively).Inaddition,weper-formpart-of-speechtaggingandchunkingusingtheTree-Tagger[26].Thisgrammaticalinformationisusedtoiden-tifytwodi erentquerycategorizations:complexvs.simplequeriesandgeneralvs.speci cqueries.Anytopiccontain-ingmorethanonenounchunkisclassi edascomplex,asitreferstomorethanoneobject,whilerequestscontainingonlyasinglenounchunkareclassi edassimple.Ifare-questcontainsaname(apropernoun)itreferstoaspeci cobject,ratherthanageneralcategory,sowecategorizeallrequestscontainingpropernounsasspeci crequests,andallothersasgeneralrequests.

Subsequently,weextracttheWordNetwordsinthetopictextthroughdictionarylookupofnounchunksandnouns.WeidentifythecorrectsynsetforWordNetwordswithmultiplemeaningsthroughdisambiguation.Weevaluated

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

anumberofdisambiguationstrategiesusingtheWord-Net::Similarity[17]resource,andfoundthatforthepur-posesofoursystem,thebestapproachwastochoosethemostcommonlyoccurringmeaningofaword.Thenwelookforrelatedsemanticconceptindexsynsetsinthehypernymandhyponymtreesofeachofthetopicsynsets.Ifanindexsynsetisfound,wecalculatethesimilaritybetweenthetwosynsetsusingtheResniksimilaritymeasure[21].

Finally,queriesareformed.WecreatebothastemmedandanunstemmedTFIDFqueryusingallofthetopicterms.WecreateanextraTFIDFqueryonpropernounsonlyforspeci ctopics,andaqueryonallnounsonlyforgeneraltopics.FortheLSIindexwecreatealsoaqueryusingallofthetopicterms,andinadditionwecreateanadditionalqueryusingpropernounsonlyforspeci ctop-ics,andallnounsforgeneraltopics.Finally,weselecttheconceptindexwiththehighestResniksimilaritytoatopicsynsetasthebestmatch,andqueryonthisconcept.5.1.3

CombiningQueryResults

Weuseatieredapproachforresultfusion, rstfusingthetextresultsfromtheTFIDFandLSIsearchesindividually,thenfusingtheresultanttwosets,and nallycombiningthemwiththeresultsfromthesemanticconceptsearch.WeuseweightedBordafusiontocombineresults,andde-velopedtheweightsthroughoptimizationexperimentsonthetrainingset.Weuseresultsfromunstemmedsearchestobooststemmedresultsforsimpletopics,asthesebene tfromusingtheexactspellingtosearchontext.Wealsoboosttextsearcheswithasearchonpropernounsforspe-ci ctopics,aspropernounsareagoodindicatorofresultrelevance.

Whencombiningtextresultswithconceptresults,weusetwomeasuresdevelopedspeci callyforWordNetbyResnik[21]:conceptinformationcontentandsimilarity(previouslymentioned).Theinformationcontentmeasureisameasureofthespeci cityofaconcept–asaconceptbecomesmoreabstract,theinformationcontentdecreases.Whenthematchingindexconcepthashighinformationcon-tent,andthewordsintheconceptdonot,wegiveprioritytotheconceptresults.Likewise,whenthematchedconceptindexisverysimilartothetopic,thenwegivetheconceptsearchaveryhighweighting.

5.2ManualSearch

Ourmanualsearchapproachinvestigatesthepoweroflex-icondrivenretrievalusedinavisual-onlysetting.Weputtheprincipleoflexicondrivenretrievaltothetestbyusingonlythe101conceptsinansweringthequeries.Further-more,wetestthehypothesisthatvisualinformation,thisyear,issigni cantlymoreimportantthantextualinforma-tion.Totesttheimpactofvisualinformation,weusenoothermodalitywhatsoever,andrelyonlyonvisualfeatures.ThisentailstrainingaSupportVectorMachineonthevec-torofcontexturesasintroducedinsection3.1.1.ThisSVM

istrainedforeveryoneofthe101conceptswiththewholedevelopmentsetasatrainingset.Thislexiconof101visualconceptsissubsequentlyusedinansweringthequeries.Foreachquery,wemanuallyselectoneortwoconceptsthat tthequestion,andusetheoutcomeofthesedetectorsasour nalanswertothequestion.

5.3InteractiveSearch

Ourinteractivesearchsystemsstorestheprobabilitiesofalldetectedconceptsandtypesofcameraworkforeachshotinadatabase.Inadditiontolearning,theparadigmalsofacilitatesmultimediaanalysisatasimilaritylevel.Inthesimilaritycomponent,2similarityfunctionsareappliedtoindexthedatainthevisualandtextualmodality.Itre-sultsin2similaritydistancesforallshots,whicharestoredinadatabase.TheMediaMillsearchengineo ersusersanaccesstothestoredindexesandthevideodataintheformof106queryinterfaces;i.e.2query-by-similarityin-terfaces,101query-by-conceptinterfacesand3query-by-cameraworkinterfaces.Thequeryinterfacesemphasizethelexicon-drivennatureoftheparadigm.EachqueryinterfaceactsasarankingoperatorΦionthemultimediaarchiveS,wherei∈{1,2,...,106}.Thesearchenginestoresresultsofeachrankingoperatorinarankedlistρi,whichwedenoteby:

ρi=Φi(S).(3)Thesearchenginehandlesthequeryrequests,combinesthe

results,anddisplaysthemtoaninteractinguser.Withintheparadigm,weperceiveofinteractionasacombinationofqueryingthesearchengineandselectingrelevantre-sultsusingoneofmanydisplayvisualizations.AschematicoverviewoftheretrievalparadigmisgiveninFig.9.

Tosupportbrowsingwithadvancedvisualizationsthedataisfurtherprocessed.Thehigh-dimensionalfeaturespaceisprojectedtothe2Dvisualizationspacetoallowforvisualbrowsing.Clusters,andrepresentativesforeachcluster,areidenti edtosupporthierarchicalbrowsing.Fi-nally,semanticthreadsareidenti ed,toallowforfastse-manticbrowsing.Forinteractivesearch,usersmaptop-icstoquery-by-multimodal-conceptorquery-by-keywordtocreateasetofcandidateresultstoexplore.Whenthereisaone-to-onerelationbetweenthequeryandtheconcept,arank-timebrowsingmethodisemployed.Inothercases,thesetformsthestartingpointforvisual,hierarchical,orsemanticbrowsing.Thebrowsingmethodsaresupportedbyadvancedvisualizationandactivelearningtools.5.3.1

MultimediaSimilarityIndexing

Afteralltheconceptsaredetected,thelowlevelfeaturesareusuallyignored.Webelieve,however,thatthesefea-turesarestillvaluableinaddinginformationtotheresultsofquery-by-conceptsearch.Exceptforspeci cconceptssuchaspersonX(Allawi,Bush,Blair),USA ag,mostofprovidedconceptshavegeneralmeaninglikesport,animal,maps,drawing.Theseconceptscanbeclassi edfurtherinto

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

Wede neaweightfunctionw(·)thatcomputestheweightofsjinρibasedonrij.Thislinearweightfunctiongivesahigherweighttoshotsthatareretrievedinthetopofρiandgraduallyreducesto0.Thisfunctionisde nedas:

w(rij)=

n rij+1

.

n

(5)

Weaggregatetheresultsforeachshotsjbyaddingthecontributionfromeachrankedlistρi.Wethenusethe nalrankingoperatorΦ torankallshotsfromSindescendingorderbasedonthisnewweight.Thiscombinationmethodyieldsa nalrankedlistofresultsρ ,de nedas:

m

,ρ=Φ(6)w(rij)

i

j=1,2,...,n

wheremindicatesthenumberofselectedqueryinterfaces.

Figure9:Thelexicon-drivenparadigmforinteractivemultimediaretrievalcombineslearning,similarity,andinteraction.Itlearnstodetectalexiconof101semanticconceptstogetherwith3typesofcamerawork.Inaddition,itcomputes2similaritydistances.Asearchenginethenpresents2interfacesforquery-by-similarity,3interfacesforquery-by-camera-work,and101interfacesforquery-by-concept.Basedoninteractionausermayre nesearchresultsuntilanacceptablestandardisreached.

sub-concepts.Forinstance,themapconceptsmaycontainmapsinweatherforecast,oramapofacountryinanewsre-port.Hence,weallowuserstodistinguishquery-by-conceptresultsfurtherbasedonlowlevelfeatures.

Therearedi erentoptionsforselectinglow-levelfea-tures,eitherusingcolors,textures,shapesorcombinationsofthose.Weusethevisualconceptfeaturesfromthevisualanalysisstepofthesemanticpath nder,seeSection3.1.1.Weexploitthesame15proto-concepts,butnowwith6dif-ferentparametersetsforeachshot.Thosevaluesarerepre-sentedasafeaturevectorpershot.Alltheshotswiththeircorrespondingfeaturevectorsbuiltupa90dimensionalfea-turespace.

Obtainingthebestperformanceonretrievingimages,notonlydependsonthefeatures,butalsoontheselectionofanappropriatesimilarityfunction.Theaimistochoosethebestdistancefunctionthatisabletoreturnthemax-imumnumberofrelevantimagesinitsnearestneighbors.BasedonexperimentalresultswechoosetheL2measureasadistancefunction.5.3.2

CombiningQueryResults

CombinationbySemanticThreadsThegeneratedcon-ceptprobabilitiesmoreorlessdescribethecontentofeachshot.However,sincethereareonlyalimitednumberofcat-egoriesfordetection,aproblemariseswhenashotdoesn’t tintoanycategory,i.e.eachindividualconceptdetectorreturnedanear-zerovalue.Allshotswithallconceptvaluesbelowathresholdcouldsimplyberemoved.Howeversomedetectorsproducelow-valueresultsbutthetop-rankedshotsarestillcorrect.Thisneedstobetakenintoaccountwhencombiningshots.Weusearound-robinpruningproceduretoensurethatatleastatop-Nshotsfromeachconceptde-tectorisincluded,evenwhenthatdetectorhasverylowvaluescomparedtootherdetectors.

Eachremainingshotnowcontainsatleastonedetectedconcept.Withthisinformationadistancemeasurementbetweenshotscanbecreated.Buthowdowemeasuredistancebetweenconceptvectors?Ifweassumeequaldis-tancesbetweenconcepts,wecanconstructadistancematrixmadeupfromthesimilaritySpqbetweenshotspandqus-ingwell-knowndistancemetricssuchasEuclideandistanceorhistogramintersection.Giventhecomputeddistancebe-tweenshots,itispossibleto ndgroupsofrelatedshotsus-ingclusteringtechniques.CurrentlyweuseK-meansclus-tering.

Nowthatclustersofrelatedshotsexistthetaskofformingasinglecoherentlineofshotsfromeachclustermustbeexamined.Weapplyashortestpathalgorithmsothatshotsthatarenexttoeachotherusuallyhaveaverylowdistancetoeachother,whichmeansthatshotswithsimilarsemanticcontentareneareachother.

5.4DisplayofResults

CombinationbyLinearWeightingToreorderrankedlistsofresults,we rstdeterminetherankrijofshotsjoverthevariousρi.Denotedby:

rij=ρi(sj).

(4)

Fore ectiveinteractionaninterfaceforcommunicatingbe-tweentheuserandthesystemisneeded.Weconsidertwoissuesthatarerequiredforane ectiveinterface:

(1)Forqueryspeci cation,supportshouldbegiventoexplorethecollectioninsearchofgoodexamplesastheuserseldomhasagoodexampleathis/herdisposal.

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

Figure10:InterfacesoftheMediaMillsemanticvideosearchengine.OnthelefttheCrossBrowsershowingresultsfortennis.OntoptheSphereBrowser,displayingseveralsemanticthreads.Bottomright:activelearningusingasemanticcluster-basedvisualizationintheGalaxyBrowser.

Mostexistingsystemsbrowsekeyframesinsequence(left-right,top-down)[28].Hence,relationsbetweenframesarenottakenintoaccount.Fore ectiveinteractionthismaybeunappropriateastheusercannotbene tfromtheinherentstructurefoundinvideocollections.Therefore,(2)Inthevisualization,relationsbetweenkeyframesshouldbetakenintoaccounttoallowselectionofseveralframesbyoneuseraction.

Forthesereasons,visualizationofkeyframesincludingsupportforbrowsingandexploringisessentialinaninter-activesearchsystem.Weexploredthreeadvancedvisual-izations.5.4.1

CrossBrowser

5.4.2GalaxyBrowser

Tovisualizequery-by-conceptresultsweproposeaCross-Browser.Thebrowserdisplaystwoorthogonaldimensions.Thehorizontaloneisthetime-thread,usingtheoriginalTRECVIDshotsequence.Theverticaldimensioncontainstherankedlistofqueryresults.TheGUIgivestheuseracrosslayoutofnearbyshotsonthescreen.Itexploitstheobservationthatsemanticallysimilarshotstendtoclusterinthetimedimension.TheresultingbrowserisvisibleinFig.10.

Tospeedupthesearchwithinthetimelimitation,wewanttosupporttheuserwithasystemthattheyareabletose-lectmorethanonekeyframeinonemouseaction.Itcanbeassumedthatthekeyframesrelevanttoasearchtopicsharesimilarfeatures.Hence,theyshouldbeclosetoeachotherinthefeaturespace.Therefore,visualizationbasedonthesimilaritybetweenthemwillmakethesearcheasierassimi-larimagesaregroupedtogetherinaspeci clocationofthesearchspace.Hence,lessnavigationandinteractionactionswillbeneeded.WeproposetheGalaxyBrowser,whichin-tegratesadvancedsimilaritybasedvisualizationwithactivelearning.

Thesimilaritybasedvisualizationof[15]isthebasisforourretrieval.Inbrief,wehavepointedoutthatforanoptimalvisualizationsystem,threerequirementshavetoobeyed:overview,structurepreservationandvisibility.The rstrequirementensuresthatthesetdisplayedwillbeabletorepresentthewholecollection,thesocalledrepresen-tativeset.Foruserinteraction,thecollectionshouldbeprojectedtothedisplayspace.Hence,thesecondrequire-menttriestopreservetherelationsbetweenkeyframesintheoriginalfeaturespace.The nalrequirementkeepsthecontentofdisplayedkeyframesfeasibleforinteraction.

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

Thesearecon ictingrequirements.Forexample,tosat-isfytheoverviewrequirement,thenumberofrepresentativekeyframesshouldbeincreased.Becauseofthe xedsizeofthedisplayspace,themorekeyframesthehigherthechanceofoverlap,thevisibilityrequirementhencewillbeviolated.Moreover,whilepreservingthevisibilityimagesarespreadoutfromeachother,originalrelationsbetweenthemarechangedi.e.structureisnotpreserved.Therefore,costfunctionsforeachrequirementandbalancingfunctionsbetweenthemareproposed.

Activelearningalgorithmsmostlyusesupportvectorma-chines(SVM)asafeedbacklearningbase[38,33].Inin-teractivesearch,usingthisapproach,thesystem rstshowssomeimagesandaskstheusertolabelthoseaspositiveand/ornegative.Thelearningiseitherbasedonbothpos-itiveandnegativeexamples(knownastwo-classSVM)oronpositive/negativeonesonly(knownasone-classSVM).TheseexamplesareusedtotraintheSVMtolearnclassi ersseparatingpositiveandnegativeexamples.Theprocessisrepeateduntiltheperformancesatis esgivenconstraints.Wehavedoneacomparisonbetweenthetwoapproaches,theresultsturnoutthatone-classSVMgenerallyperformsbetterthanthetwo-class,aswellasfasterinreturningtheresult.Weconcentrateontheuseofone-classSVMforlearningtherelevancefeedback.

Thecombinationofthetwotechniquesisdrawnintoonescheme(seeFig.11).Theo inestagecontainsfeatureextractionandsimilarityfunctionselection.TheISOSNEfrom[15]isappliedtoprojectthecollectionfromthehighdimensionalspacetothevisualizationspace.Thenextstepwilldecidewhichsetofkeyframeswillbeusedasarepre-sentativeone.Todoso,weemployk-meansalgorithmtoclusterkeyframesintoa xednumberofgroups.Asetofkeyframesselectedfromdi rmationofeachkeyframebelongingtoacertaingroup,anditspositioninthevisual-izationspaceisstoredaso inedata.

Intheinteractivestage,queryresultsareinputforstart-ingupthesearch.First,thesetoftopkkeyframesfromthequeryresultsisdisplayed.Theuserthenusesthesys-temtoexplorethecollectionand ndrelevantkeyframes.Particularly,ifthecurrentlydisplayedsetcontainsanypos-itiveone,theuserselectsthatkeyframeandgoesintothecorrespondingclusterwiththeexpectationof ndingmoresimilarones.Withtheadvantageofsimilaritybasedvi-sualization,insteadofclickingonanindividualkeyframeforlabeling,thesystemsupportstheuserwithmousedrag-gingtodrawtheareaofkeyframesinthesamecategory.Thismeansthatwhentheuser ndsagroupofrelevantkeyframes,he/shedrawsarectanglearoundthoseandmarksthemallaspositiveexamples.Therefore,oursystemcanreducethenumberofactionsfromtheuserwiththesameamountofinformationforrelevancefeedback.Incasethereisnopositivekeyframeinthecurrentset,theuserthenasksthesystemtodisplayanotherset,whichcontainsthenextkkeyframesfromthequeryresults.Keyframeswhichareselectedastrainingexamplesordisplayedbeforewillnot

be

Figure11:SchemeofaninteractivesearchintheGalaxyBrowserwiththecombinationofactivelearningandsimilaritybasedvisual-ization.

shownagain.

Inthelearningstep,whenacertainnumberoftrainingexamplesareprovided,theSVMtrainsthesupportvec-tors.Weusethewell-knownSVMlibrarydevelopedbyChangandLin[4],whichprovidesaone-classimplemen-tation.Afterthelearning,asetofimagesclosesttotheborderisreturned.Theprocessisrepeateduntilacertainconstraintissatis edsuchasnumberofiterations,timelim-itation,orsimplythattheuserdoesnotwanttogiveanymorefeedback.Atthatpoint,thesystemwillreturnthe nalresultcontainingkeyframeswithmaximumdistancestotheborderastheyareassumedhavinghighprobabilitiestoberelevanttothesearchtopic.5.4.3

SphereBrowser

TovisualizethethreadstructureasocalledSphere-Browser[22]wasdeveloped.Thebrowserdisplaystwoor-thogonaldimensions.Thehorizontaloneisthetime-thread,usingtheoriginalTRECVIDshotsequence.Theverticaldimensioncontainsforeachshotcluster-threadsofseman-ticallysimilarfootage.TheGUIgivestheuseraaspheri-callayoutofnearbyshotsonthescreen,ingthemouseandarrowkeystheusercanthennavigateeitherthroughtimeorthroughrelatedshots,selectingrelevantshotswhenfound.Alsoselecting(parts

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

TRECVID 2005 Overall Search Results

IyadMahmoudSearch Topic

Figure12:Comparisonofautomatic,manual,andinteractivesearchresultsfor24topics.Resultsfortheusersofthelexicon-driven

retrievalparadigmareindicatedwithspecialmarkers.

of)entirethreadsispossible.Smoothtransitionanimationsexisttoenabletheusertohaveabetterintuitivefeelingofwhereheisbrowsinginthedataset.TheresultingbrowserisshowninFig.10.

weexpectthattheresultsofoursystemwillimproveasweaddmoresemanticconceptindices,usingoursemanticpath nderstrategy.

5.5

5.5.1

Experiments

AutomaticSearch

5.5.2ManualSearch

Wesubmittedtworunsforautomaticsearch,onebaselinerunusingthe naltextsearchstrategyonly,andonefullrunincorporatingtextandsemanticconcepts.AscanbeseeninFig.12thecombinedsemanticandtextrunout-performedthetextrunonnearlyallcounts.Wedidbestforthosetopicsthathadaclearmappingtothesemanticconceptindices,i.e.tennisfortopic156,meetingfortopic163(achievingthebestresultforthistopic)andbasketballfortopic165.Insomecasestheconceptweightingstrategywasnotoptimal,forexamplefortopic158.Inthiscasewedetectedtheaircraftindex,buttheconceptresultsweregivenaweightingof0intheresultfusionbecausetheinfor-mationcontentoftheconcepthelicopterwascalculatedtobemuchhigherthantheinformationcontentoftheconceptaircraft.Ifwehadutilizedtheaircraftdetectorinthiscase,wewouldhaveachievedanaverageprecisionof0.17,whichishigherthanthebestevaluatedaverageprecisionof0.14.Wehavedemonstratedthatautomaticsearchusingonlytextasinputisarealistictask.Weperformbetterthanthemedianforanumberoftopics,andevenachievethebestscoreforonetopic.Postulatingthatallothersys-temsincorporatemultimodalexamplesintheirsearch,thisisasigni cantresult.Theperformanceofoursearchen-gineisbestwhenoneormorerelatedindicesarepresent;

Wesubmittedonerunformanualsearchwhereweonlyusethe101conceptsinthelexicontoanswerthequeries.More-over,werestrictourselvestousingonlyvisualinformation.Forthirteentopicswescoreabovethemedian.Speci cally,fortwoqueries,i.e.vehiclewith ames(160)andtennisplayers(156)weperformthebestofallmanualruns,andfortwootherqueries,i.e.peoplewithbanners(161)andbasketballplayers(165)wearesecondbest.Fortenquerieswescorebelowthemedian,threeofthosearenotcoveredbyourlexicon,andsevenareperson-xtypequeries.Weperformbadlyforperson-xqueriesbecausethefeaturesde-scribevisualscenelayout,consequently,namesandfacesarenotmodeled.Fortheremainingfourteentopicsthereisonlyonei.e.boat(164)paredtoourautomaticsearchtextbaseline,weperformworseoneightqueries.Ofthoseeightqueries,thetextbaselineperformsbetterforallperson-xqueries,andforoneotherquery(164).Consequently,avisual-onlyap-proachoutperformsthetextbaselinein16queries,includingtheout-of-lexiconqueries.

Webelieveourresultssupportthelexicon-drivenretrievalapproachandshowtheimportanceofvisualanalysis.De-spitetheobviousdisadvantagesofusingonlyvisualinfor-mation,weoutperformthetextbaseline,andevenscorethebestofallmanualrunsintwoqueries.

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

MeanAverage Precision

System Runs

Figure13:OverviewofallsearchrunssubmittedtoTRECVID2005,erswhoexploitedtheproposedparadigmareindicatedwithspecialmarkers.

5.5.3InteractiveSearch

Wesubmittedfourrunsforinteractivesearch.Threeusersfocussedonusingonlyonebrowser.Thefourthusersmixedallbrowsers.ResultsinFig.12indicatethatformostsearchtopics,usersoftheproposedparadigmforinteractivemulti-mediaretrievalscoreaboveaverage.Furthermore,usersofourapproachobtainatop-3averageprecisionresultfor19outof24topics.Bestperformanceisobtainedfor7topics.BestresultsareobtainedwiththeCrossBrowser.

Dependingonthesearchtopic,theproposedGalaxy-Browseraidsusersinsearchingfortherelevantsubsetofthecollection.Asthefeaturesusedarevisualbased,thesystemworkswellincaserelevantimagesofacertaintopicsharevisualsimilarity,e.g.queriesrelatedtotennisorcar.However,whentopicshavelargevarietyinvisualsettings,forinstancepersonxtopics,visualfeatureshardlyyieldad-ditionalinformationtoaidtheuserintheinteractivesearchprocess.Toourknowledge,noexistingfeaturesworkwellinthesecases.

Twosearchstrategieswerediscoveredduringtheinter-activeretrievaltaskusingtheSphereBrowser.Thereweretopicsforwhichmultipleclusterthreadsyieldedgoodresultsforthattopic,suchasTennis(156),Peoplewithbannersorsigns(161),Meeting(163)andTallbuilding(170).Forthesetopicsonlytherelevantpartsofthethreadsneededtobese-lected.AnotherselectionmethodwasfoundinqueriessuchasAirplanetakeo (167)andO cesetting(172).Heretherewereonlyalimitednumberofconsecutivevalidshotsvisibleineachthread,butbecauseofthecombinationofbothtimeandclusterthreadstherewasalwaysanothervalidbutnotyetselectedshotvisible.Forthesequeries,selectionwasdonebyhoppingfromonevalidresulttoanother.AlsoanumberoftopicswerenotanswerablebytheSphereBrowserbecauseoflackofnearbyshots.Theseincludepersonxtop-ics149,151,and153.

Togaininsightintheoverallqualityofourlexicon-drivenretrievalparadigm.Wecomparetheresultsofouruserswithallotherusersthatparticipatedintheretrievaltasksofthe2005TRECVIDbenchmark.WevisualizedtheresultsforallsubmittedsearchrunsinFig.13.Theresultsarestate-of-the-art.

6ExplorationofBBCRushes

TheBBCRushesconsistofrawmaterialusedtoproduceavideo.Sincethereislittletonospeech,thismaterialisverysuitableforvisual-onlyindexing.We rstsegmentedthevideo’susingourshotsegmentationalgorithm[2].Thenweappliedourbestperformingcameramotiondetector(seeSection4)ontheBBCrushesusingthemodelstrainedforthenewsdata.Tofurtherinvestigatetherobustnessofourvisualfeatures,weperformedvisual-onlyconceptde-tectionontheBBCrushesdata,withoutre-trainingthevisualmodels.Thevisualmodelsarethesameasusedinthevisualonlyfeaturetask(Section3.4)andinthemanualsearchtask(Section5.2).ThedetectorsthuslearnedonnewsdataaresubsequentlyevaluatedontheBBCrushesvideos.Obviously,notall101conceptsareuseful,sincetheyaretrainedonbroadcastnews.However,25conceptstranscendthenewsdomainandsomeperformsurprisinglywellontheBBCrushes:aircraft,bird,boat,building,car,charts,cloud,crowd,face,female,food,governmentbuild-ing,grass,meeting,mountain,outdoor,overlayedtext,sky,smoke,tower,tree,urban,vegetation,vehicle,waterbody.WedevelopedaversionoftheMediaMillsemanticvideosearchenginetailoredtotheBBCrushersbasedonthecom-putedindexes.Whilestillprimitiveintermsofutility,thesearchengineallowsuserstoexplorethecollectioninasur-prisingmanner.Theresultsagaincon rmtheimportanceofrobustvisualfeatures.Hence,forthistaskmuchisto

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

beexpectedfromimprovedvisualanalysisyieldingalargelexiconofsemanticconcepts.

Acknowledgements

NISTfortheevaluatione ort.KevinWalkerforsolvingharddiskproblems.DCUteamforcreationofkeyframes.AlexHauptmannformissingmachinetranslations.

References

[1]A.Amiretal.IBMresearchTRECVID-2003videoretrieval

system.InProc.oftheTRECVIDWorkshop,Gaithersburg,USA,2003.

[2]zyusersandautomaticvideoretrievaltools

in(the)lowlands.InE.VoorheesandD.Harman,editors,Proc.ofthe10thTextREtrievalConference,volume500-250,Gaithersburg,USA,2001.

[3]BlinkxVideoSearch,2005./.[4]C.-C.ChangandC.-J.Lin.LIBSVM:alibraryfor

supportvectormachines,2001.Softwareavailableathttp://www.csie.ntu.edu.tw/~cjlin/libsvm/.

[5]S.Deerwester,S.Dumais,G.Furnas,ndauer,and

R.Harshman.Indexingbylatentsemanticanalysis.J.AmericanSocietyInform.Science,41(6):391–407,1990.[6]J.Geusebroek.Visualobjectrecognition,2005.Patent

PCT/NL2005/000485 ledJuly6.

[7]J.Geusebroek,R.Boomgaard,A.Smeulders,and

H.Geerts.Colorinvariance.IEEETrans.PAMI,23(12):1338–1350,2001.

[8]J.GeusebroekandA.W.M.Smeulders.Asix-stimulusthe-oryforstochastictexture.InternationalJournalofCom-puterVision,62(1/2):7–16,2005.

[9]GoogleVideoSearch,2005./.[10]B.Huurnink.AutoSeek:Towardsafullyautomatedvideo

searchsystem.Master’sthesis,UniversiteitvanAmster-dam,2005.

[11]A.Jain,R.Duin,andJ.Mao.Statisticalpatternrecogni-tion:Areview.IEEETrans.PAMI,22(1):4–37,2000.

[12]P.JolyandH.-K.Kim.E cientautomaticanalysisofcam-eraworkandmicrosegmentationofvideousingspatiotem-poralimages.SignalProcessing:ImageCommunication,8(4):295–307,1996.

[13]G.Miller.WordNet:-municationsoftheACM,38(11):39–41,1995.

[14]M.Naphade.Onsupervisionandstatisticallearningfor

semanticmultimediaanalysis.J.VisualCommun.ImageRepresentation,15(3):348–369,2004.

[15]G.NguyenandM.Worring.Similaritybasedvisualization

ofimagecollections.InInt’lWorksh.Audio-VisualContentandInformationVisualizationinDigitalLibraries,2005.[16]NIST.TRECVIDVideoRetrievalEvaluation,2001–2005.

http://www-nlpir.nist.gov/projects/trecvid/.

[17]T.Pedersen,S.Patwardhan,andJ.Michelizzi.Word-Net:Similarity-measuringtherelatednessofconcepts.InNat’lConf.Arti cialIntelligence,SanJose,USA,2004.[18]C.Petersohn.FraunhoferHHIatTRECVID2004:Shot

boundarydetectionsystem.InProc.oftheTRECVIDWorkshop,Gaithersburg,USA,2004.

[19]J.Platt.ProbabilitiesforSVmachines.InAdvancesin

LargeMarginClassi ers,pages61–74.MITPress,2000.

[20]M.Porter.Analgorithmforsu xstripping.Program,

14(3):130–137,1980.

[21]inginformationcontenttoevaluatesemantic

similarityinataxonomy.InInternationalJointConferenceonArti cialIntelligence,SanMateo,USA,1995.

[22]O.Rooij.Browsingnewsvideousingsemanticthreads.

Master’sthesis,UniversiteitvanAmsterdam,2006.

[23]G.Salton.TheSMARTretrievalsystem:experimentsin

automaticdocumentprocessing.Prentice-Hall,EnglewoodCli s,USA,1971.

[24]G.SaltonandC.Buckley.Term-weightingapproachesin

rmationProcessingandMan-agement,24(5):513–523,1988.

[25]T.Sato,T.Kanade,E.Hughes,M.Smith,andS.Satoh.

VideoOCR:Indexingdigitalnewslibrariesbyrecognitionofsuperimposedcaption.MultimediaSystems,7(5):385–395,1999.

[26]H.Schmid.Probabilisticpart-of-speechtaggingusingdeci-siontrees.InInternationalConferenceonNewMethodsinLanguageProcessing,Manchester,UK,1994.

[27]H.SchneidermanandT.Kanade.Objectdetectionusing

thestatisticsofparts.InternationalJournalofComputerVision,56(3):151–177,2004.

[28]A.Smeulders,M.Worring,S.Santini,A.Gupta,and

R.Jain.Contentbasedimageretrievalattheendoftheearlyyears.IEEETrans.PAMI,22(12):1349–1380,2000.[29]C.SnoekandM.Worring.Multimediaevent-basedvideo

indexingusingtimeintervals.IEEETrans.Multimedia,7(4):638–647,2005.

[30]C.Snoek,M.Worring,J.Geusebroek,D.Koelma,F.Sein-stra,andA.Smeulders.Thesemanticpath nder:Usinganauthoringmetaphorforgenericmultimediaindexing.IEEETrans.PAMI,2006.Inpress.

[31]C.Snoek,M.Worring,andA.Hauptmann.Learningrich

semanticsfromnewsvideoarchivesbystyleanalysis.ACMTrans.MultimediaComputing,CommunicationsandAppli-cations,2(2),May2006.inpress.

[32]C.Snoek,M.Worring,andA.Smeulders.Earlyversus

latefusioninsemanticvideoanalysis.InACMMultimedia,Singapore,2005.

[33]S.TongandE.Chang.Supportvectormachineactivelearn-ingforimageretrieval.InACMMultimedia,pages107–118,Ottawa,Canada,2001.

[34]Y.Tonomura,A.Akutsu,Y.Taniguchi,andG.Suzuki.

Structuredvideocomputing.IEEEMultimedia,1(3):34–43,1994.

[35]V.Vapnik.TheNatureofStatisticalLearningTheory.

Springer-Verlag,NewYork,USA,2thedition,2000.

[36]T.Volkmer,J.Smith,A.Natsev,M.Campbell,and

M.Naphade.Aweb-basedsystemforcollaborativeanno-tationoflargeimageandvideocollections.InACMMulti-media,Singapore,2005.

[37]T.Westerveld,R.Cornacchia,J.vanGemert,D.Hiemstra,

andA.deVries.Anintegratedapproachtotextandimageretrieval–thelowlandsteamatTRECVID2005.InProc.oftheTRECVIDWorkshop,Gaithersburg,USA,2005.[38]X.ZhouandT.Huang.Relevancefeedbackinimagere-trieval:acomprehensiveoverview.MultimediaSystems,8(6):536–544,2003.

本文来源:https://www.bwwdw.com/article/ry44.html

Top