Abstract The MediaMill TRECVID 2005 Semantic Video Search Engine

更新时间：2023-05-20 14:21:01 阅读量：实用文档文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

abstract推荐度：
相关推荐

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

TheMediaMillTRECVID2005SemanticVideoSearchEngine

C.G.M.Snoek,J.C.vanGemert,J.M.Geusebroek,B.Huurnink,D.C.Koelma,G.P.Nguyen,

O.deRooij,F.J.Seinstra,A.W.M.Smeulders,C.J.Veenman,M.Worring

IntelligentSystemsLabAmsterdam,UniversityofAmsterdam

Kruislaan403,1098SJAmsterdam,TheNetherlands

http://www.mediamill.nl

Abstract

InthispaperwedescribeourTRECVID2005experiments.TheUvA-MediaMillteamparticipatedinfourtasks.Forthedetectionofcamerawork(runid:Aweinvestigatethebene tofusingatessellationofdetectorsincombinationwithsupervisedlearningoverastandardapproachusingglobalimageinforma-tion.Experimentsindicatethataverageprecisionresultsincreasedrastically,especiallyforpan(+51%)andtilt(+28%).Forcon-ceptdetectionweproposeagenericapproachusingoursemanticpath nder.Mostimportantnoveltycomparedtolastyearssys-temistheimprovedvisualanalysisusingproto-conceptsbasedonWiccestfeatures.Inaddition,thepathselectionmechanismwasextended.Basedonthesemanticpath nderarchitecturewearecurrentlyabletodetectanunprecedentedlexiconof101semanticconceptsinagenericfashion.Weperformedalargesetofexper-iments(runid:BvA).Theresultsshowthatanoptimalstrategyforgenericmultimediaanalysisisonethatlearnsfromthetrain-ingsetonaper-conceptbasiswhichtactictofollow.Experimentsalsoindicatethatourvisualanalysisapproachishighlypromis-ing.Thelexiconof101semanticconceptsformsthebasisforoursearchexperiments(runid:BWeparticipatedinau-tomatic,manual(usingonlyvisualinformation),andinteractivesearch.Thelexicon-drivenretrievalparadigmaidssubstantiallyinallsearchtasks.Whencoupledwithinteraction,exploitingseveralnovelbrowsingschemesofoursemanticvideosearchen-gine,resultsareexcellent.Weobtainatop-3resultfor19outof24searchtopics.Inaddition,weobtainthehighestmeanaverageprecisionofallsearchparticipants.WeexploitedthetechnologydevelopedfortheabovetaskstoexploretheBBCrushes.Mostintriguingresultisthatfromthelexiconof101visual-onlymod-elstrainedfornewsdata25conceptsperformreasonablywellonBBCdataalso.

thevideosoriginatefromnon-Englishspeakingcountries,suchasChinaorTheNetherlands,queryingthecontentbe-comesevenharderasautomaticspeechrecognitionresultsaremuchpoorer.Forvideosfromthesesources,anad-ditionalvisualanalysispotentiallyyieldsmorerobustness.Fore ectivevideoretrievalthereisaneedformultimediaanalysis;inwhichtextretrievalisanimportantfactor,butnotthedecisiveelement.Weadvocatethattheidealmul-timediaretrievalsystemshould rstlearnalargelexiconofconcepts,basedonmultimediaanalysis,tobeusedfortheinitialsearch.Then,theidealsystemshouldemploysimi-larityandinteractiontore nethesearchuntilsatisfaction.Weproposeamultimediaretrievalparadigmbuiltonthreeprinciples:learningofalexiconofsemanticconcepts,multimediadatasimilarity,anduserinteraction.Withintheproposedparadigm,weexplorethecombinationofquery-by-concept,query-by-similarity,andinteractive lteringus-ingadvancedvisualizationsoftheMediaMillsemanticvideosearchengine.Todemonstratethee ectivenessofourmul-timediaretrievalparadigm,severalcomponentsareevalu-atedwithinthe2005NISTTRECVIDvideoretrievalbench-mark[16].

Theorganizationofthispaperisasfollows.First,wedis-cussourgenerallearningarchitectureanddatapreparationsteps.Oursystemarchitectureforgenericsemanticindex-ingispresentedinSection3.WedescribeourapproachforcameraworkindexinginSection4.Ourmultimediare-trievalparadigmispresentedinSection5.OurexplorativeworkonBBCrushesisaddressedinSection6.

2Preliminaries

1Introduction

TheMediaMillsemanticvideosearchengineexploitsacom-monarchitecturewithastandardizedinput-outputmodeltoallowforsemanticintegration.TheconventionstodescribethemodularsystemarchitectureareindicatedinFig.1.

Despitetheemergenceofcommercialvideosearchengines,suchasGoogle[9]andBlinkx[3],multimediaretrievalisbynomeansasolvedproblem.Infact,presentdayvideosearchenginesrelymainlyontext-intheformofclosedcaptions[9]ortranscribedspeech[3]-forretrieval.Thisre-sultsindisappointingperformancewhenthevisualcontentisnotre ectedintheassociatedtext.Inaddition,when

2.1GeneralLearningArchitecture

Weperceiveofvideoindexingasapatternrecognitionprob-lem.We rstneedtosegmentavideo.Weoptforcam-erashots[18],indicatedbyi,followingthestandardinTRECVIDevaluations.Givenpatternx,partofashot,

Figure1:Data owconventionsasusedinthispaper.Di erentarrowsindicatedi erenceindata ows.

theaimistodetectanindexωfromshotiusingproba-bilitypi(ω|xi).Weexploitsupervisedlearningtolearntherelationbetweenωandxi.Thetrainingdataofthemulti-mediaarchive,togetherwithlabeledsamples,areforlearn-ingclassi ers.Theotherdata,thetestdata,aresetasidefortesting.ThegeneralarchitectureforsupervisedlearningintheMediaMillsemanticvideosearchenginearchitectureisillustratedinFig.2.

Wecanchoosefromalargevarietyofsupervisedmachinelearningapproachestoobtainpi(ω|xi).Forourpurpose,themethodofchoiceshouldbecapableofhandlingvideodocuments.Tothatend,ideallyitmustlearnfromalimitednumberofexamples,itmusthandleunbalanceddata,anditshouldaccountforunknownorerroneouslydetecteddata.Insuchheavydemands,theSupportVectorMachine(SVM)framework[35,4]hasproventobeasolidchoice[1,29].TheusualSVMmethodprovidesamarginintheresult.WepreferPlatt’sconversionmethod[19]toachieveaposteriorprobabilityoftheresult.SVMclassi ersthustrainedforω,resultinanestimatepi(ω|xi, q),where qareparametersoftheSVMyettobeoptimized.

Thein uenceoftheSVMparametersonvideoindexingissigni cant[14].Weobtaingoodparametersettingsforaclassi er,byusinganiterativesearchonalargenumberofSVMparametercombinations.Wemeasureaveragepreci-sionperformanceofallparametercombinationsandselectthecombinationthatyieldsthebestperformance, q .Hereweuse3-foldcrossvalidation[11]with3repetitionstopre-ventover ttingofparameters.Theresultoftheparametersearchover qistheimprovedmodelp q ).Inthei(ω|xi, followingwedrop q whereobvious.

Figure2:Generalarchitectureforsupervisedlearningininthe

MediaMillsemanticvideosearchengine,usingtheconventionsofFig.1.

(321),car(192),charts(52),crowd(270),desert(82), re(67),US- ag(98),maps(44),mountain(41),road(143),sky(291),smoke(64),snow(24),vegetation(242),water(108),wherethenumberinbracketsindicatesthenumberofannotationsamplesofthatconcept.WeagainusedtheTRECVID2005commonannotatione ortasabasisforselectingrelevantshotscontainingtheproto-concepts.Inthoseshots,weannotatedrectangularregionswheretheproto-conceptisvisibleforatleast20frames.

Wesplitthetrainingdataaprioriintofournon-overlappingtrainingandvalidationsetstopreventover t-tingofclassi ers.TrainingsetsA,B,andCcontain30%percentofthe2005trainingdata,validationsetDcontainstheremaining10%.WeassignallshotsinthetrainingsetrandomlytoeithersetA,B,C,orD.

3SemanticPath nderIndexing

2.2DataPreparation

Supervisedlearningrequireslabeledexamples.Inpart,werelyontheprovidedgroundtruthoftheTRECVID2005common

annotatione ort[36].Itisextendedmanuallytoarriveatanincomplete,butreliablegroundtruthforanunprecedentedamountof101semanticconceptsinlexiconΛS.Inaddition,wemanuallylabeledasubstantialpartofthetrainingsetwithrespecttodominanttypeofcamerawork,i.e.pan,tilt,and/orzoom,ifpresent.

Inordertorecognizeconceptsbasedonlow-levelvisualanalysis,weannotated15di erentproto-concepts:building

Thecentralassumptioninoursemanticindexingarchitec-tureisthatanybroadcastvideoistheresultofanauthor-ingprocess.Whenwewanttoextractsemanticsfromadigitalbroadcastvideothisauthoringprocessneedstobereversed.Forauthoring-drivenanalysisweproposedthesemanticpath nder[30].Thesemanticpath nderiscom-posedofthreeanalysissteps.Itfollowsthereverseauthor-ingprocess.Eachanalysisstepinthepathdetectsseman-ticconcepts.Inaddition,onecanexploittheoutputofananalysisstepinthepathastheinputforthenextone.Thesemanticpath nderstartsinthecontentanalysisstep.Inthisanalysisstep,wefollowadata-drivenapproachofin-dexingsemantics.Thestyleanalysisstepisthesecondanal-ysisstep.Herewetackletheindexingproblembyviewingavideofromtheperspectiveofproduction.Thisanalysisstepaidsespeciallyinindexingofrichsemantics.Finally,toenhancetheindexesfurther,inthecontextanalysisstep,weviewsemanticsincontext.Onewouldexpectthatsomeconcepts,likevegetation,havetheiremphasisoncontentwherethestyle(ofthecameraworkthatis)andcontext(of

Figure3:Thesemanticpath nderforoneconcept,usingthecon-ventionsofFig.1.

conceptslikegraphics)donotaddmuch.Incontrast,morecomplexevents,likepeoplewalking,pro tfromincrementaladaptationoftheanalysistotheintentionoftheauthor.Thevirtueofthesemanticpath nderisitsabilityto ndthebestpathofanalysisstepsonaper-conceptbasis.Anoverviewofthesemanticpath nderisgiveninFig.3.

3.1ContentAnalysisStep

Weviewofvideointhecontentanalysisstepfromthedataperspective.Ingeneral,threedatastreamsormodalitiesexistinvideo,namelytheauditorymodality,thetextualmodality,andthevisualone.Asspeechisoftenthemostinformativepartoftheauditorysource,wefocusonvisualfeatures,andontextualfeaturesobtainedfromtranscribedspeech.Aftermodalityspeci cdataprocessing,wecombinefeaturesinamultimodalrepresentationusingearlyfusionandlatefusion[32].3.1.1

VisualAnalysis

Modelingvisualdataheavilyreliesonqualitativefeatures.

Goodfeaturesdescribetherelevantinformationinanimagewhilereducingtheamountofdatarepresentingtheimage.Toachievethisgoal,weuseWiccestfeaturesasintroducedin[6].Wiccestfeaturescombinecolorinvariancewith

nat-

ural

image

statistics.

Color

invariance

aims

toremoveac-cidentallightingconditions,whilenaturalimagestatisticse cientlyrepresentimagedata.

Colorinvarianceaimsatkeepingthemeasurementscon-stantundervarying

intensity,viewpointandshading.In[7]severalcolorinvariantsaredescribed.WeusetheWin-variantthatnormalizesthespectralinformationwiththeenergy.Thisnormalizationmakesthemeasurementsin-dependentofilluminationchangesunderuniformlightingconditions.

Whenmodelingscenes,edgesarehighlyinformative.Edgesrevealwhereoneregionendsandanotherbegins.Thus,anedgehasatleasttwicetheinformationcontentthenauniformlycoloredpatch,sinceanedgecontainsin-formationaboutallregionsitdivides.Besidesservingasregionboundaries,anensembleofedgesdescribestextureinformation.Texturecharacterizesthematerialanobjectismadeof.Moreover,acompilationofclutteredobjectscan

Figure4:Anexampleofdividinganimageupinoverlappingre-gions.Inthisparticularexample,theregionsizeisa1

forboththex-dimensionandy-dimension.Theregionsoftheimagesizeareuni-formlysampledacrosstheimagewithastepsizeofhalfaregion.Samplinginthismanneridenti esnineoverlappingregions.

bedescribedastextureinformation.Therefore,ascenecanbemodeledwithtexturedregions.

Textureisdescribedbythedistributionofedgesatacer-tainregioninanimage.Hence,ahistogramofaGaussianderivative ltersrepresentstheedgestatistics.Sincetherearemorenon-edgepixelsthenthereareedgepixels,thedis-tributionofedgeresponsesfornaturalimagesalwayshasapeakaroundzero,i.e.:manypixelshavenoedgeresponses.Additionally,theshapeofthetailsofthedistributionisoftenin-betweenapower-lawandaGaussiandistribution.Thisspeci cdistributioncanbewellmodeledwithanin-tegratedWeibulldistribution[8].Thisdistributionisgivenbyγ2γexp)

1βΓ(1γ whereristheedgeresponseto r µβ the γ ,(1)Gaussianderivative lter ∞andΓ(·)isthecompleteGammafunction,Γ(x)=tx 1e 1

dt.Theparameterβdenotesthewidthdistribution,0

ofthetheparameterγrepresentsthe’peakness’ofthedistribution,andtheparameterµdenotestheoriginofthedistribution.

ToassessthesimilaritybetweenWiccestfeatures,agoodness-of- ttestisutilized.Themeasureisbasedontheintegratedsquarederrorbetweenthetwocumulativedistri-butions,whichisobtainedbyaCram´er-vonMisesmeasure.FortwoWeibulldistributionswithparametersFβ,FγandGβ,Gγa rstorderTaylorapproximationoftheCram´er-vonMisesstatisticyieldsthelogdi erencebetweenthepa-rameters.Therefore,ameasureofsimilaritybetweentwoWeibulldistributionsFandGisgivenbytheratiooftheparameters,

W2(F,G)=

min(Fβ,Gβ)min(Fγ,Gγ)

max(F.(2)

β,Gβ)max(Fγ,Gγ)Theµparameterrepresentsthemodeofthedistribution.Thepositionofthemodeisin uencedbyunevenillumi-nationandcoloredillumination.Hence,toachievecolorconstancythevaluesforµmaybeignored.

Insummary,Wiccestfeaturesprovideacolorinvarianttexturedescriptor.Moreover,thefeaturesrelyheavilyonnaturalimagestatisticstocompactlyrepresentthevisualinformation.

3.1.2

Contextures:RegionalTextureDescriptorsandtheirContext

Thevisualdetectorsaimtodecomposeanimageinproto-conceptslikevegetation,water, re,skyetc.Toachievethisgoal,animageisdividedupinseveraloverlappingrectan-gularregions.Theregionsareuniformlysampledacrosstheimage,withastepsizeofhalfaregion.Theregionsizehastobelargeenoughtoassessstatisticalrelevance,andsmallenoughtocapturelocaltexturesinanimage.Weutilizeamulti-scaleapproach,usingsmallandlargeregions.Anexampleofregionsamplingisdisplayedin gure4.

Avisualsceneischaracterizedbybothglobalaswellaslocaltextureinformation.Forexample,apicturewithanaircraftinmidairmightbedescribedas“sky,withaholeinit”.Tomodelthistypeofinformation,weuseaproto-conceptoccurrencehistogramwhereeachbinisaproto-concept.Thevaluesinthehistogramarethesimilarityresponsesofeachproto-conceptannotation,totheregionsintheimage.

Weusetheproto-conceptoccurrencehistogramtochar-acterizebothglobalandlocaltextureinformation.Globalinformationisdescribedbycomputinganoccurrencehis-togramaccumulatedoverallregionsintheimage.Localinformationistakenintoaccountbyconstructinganotheroccurrencehistogramforonlytheresponseofthebestre-gion.Foreachproto-concept,orbin,btheaccumulatedoc-currencehistogramandthebestoccurrencehistogramareconstructedby,

Haccumulated(b)Hbest(b)

W2(a,r)W2(a,r)

accountsforinformationaboutthewholeshoti,andin-formationaboutaccidentalframes,whichmightoccurwithhighcameramotion.Thecombinationofalltheseparam-etersyieldsavectorofcontextures vi,containingthe nalresultofthevisualanalysis.3.1.3

TextualAnalysis

Inthetextualmodality,weaimtolearntheassociationbe-tweenutteredspeechandsemanticconcepts.Adetectionsystemtranscribesthespeechintotext.FortheChineseandArabicsourcesweexploittheprovidedmachinetrans-lations.Theresultingtranslationismappedfromstoryleveltoshotlevel.Fromthetextweremovethefrequentlyoc-curringstopwords.Afterstopwordremoval,wearereadytolearnsemantics.

Tolearntherelationbetweenutteredspeechandcon-cepts,weconnectwordstoshots.Wemakethisconnectionwithinthetemporalboundariesofashot.Wederivealex-iconofutteredwordsthatco-occurwithωusingtheshot-basedannotationsofthetrainingdata.Foreachconceptω,welearnaseparatelexicon,ΛωT,asthisutteredwordlexi-conisspeci cforthatconcept.ForfeatureextractionwecomparethetextassociatedwitheachshotwithΛωT.Thiscomparisonyieldsatextvector tiforshoti,whichcontainsthehistogramofthewordsinassociationwithω.3.1.4

EarlyFusion

r∈R(im)a∈A(b)

argmax

r∈R(im)a∈A(b)

whereR(im)denotesthesetofregionsinimageim,A(b)representsthesetofstoredannotationsforproto-conceptb,andW2istheCram´er-vonMisesstatisticasintroducedinequation2.

Wedenoteaproto-conceptoccurrencehistogramasacon-textureforthatimage.Wehavechosenthisname,asourmethodincorporatestexturefeaturesinacontext.Thetex-turefeaturesaregivenbytheuseofWiccestfeatures,usingcolorinvarianceandnaturalimagestatistics.Furthermore,contextistakenintoaccountbythecombinationofbothlocalandglobalregioncombinations.

Contexturescanbecomputedfordi erentparameterset-tings.Speci cally,wecalculatethecontexturesatscalesσ=1andσ=3oftheGaussian lter.Furthermore,we

and1usetwodi erentregionsizes,withratiosofof

thex-dimensionandy-dimensionsoftheimage.Moreover,contexturesarebasedononeimage,andnotbasedonashot.Togeneralizeourapproachtoshotlevel,weextract1framepersecondoutofthevideo,andthenaggregatetheframesthatbelongtothesameshot.Weusetwowaystoaggregateframes:1)averagethecontextureresponsesforallextractedframesinashotand2)keepthemaximumresponseofallframesinashot.Thisaggregationstrategy

Indexingapproachesthatrelyonearlyfusion rstextractunimodalfeaturesofeachstream.Theextractedfeaturesofallstreamsarecombinedintoasinglerepresentation.Aftercombinationofunimodalfeaturesinamultimodalrepre-sentation,earlyfusionmethodsrelyonsupervisedlearningtoclassifysemanticconcepts.Earlyfusionyieldsatrulymultimediafeaturerepresentation,sincethefeaturesareintegratedfromthestart.Anaddedadvantageisthere-quirementofonelearningphaseonly.Disadvantageoftheapproachisthedi cultytocombinefeaturesintoacom-monrepresentation.ThegeneralschemeforearlyfusionisillustratedinFig.5a.

Werelyonvectorconcatenationintheearlyfusionschemetoobtainamultimodalrepresentation.Wecon-catenatethevisualvector viwiththetextvector ti.Afterfeaturenormalization,weobtainearlyfusionvector ei.3.1.5

LateFusion

Indexingapproachesthatrelyonlatefusionalsostartwithextractionofunimodalfeatures.Incontrasttoearlyfusion,wherefeaturesarethencombinedintoamultimodalrep-resentation,approachesforlatefusionlearnsemanticcon-ceptsdirectlyfromunimodalfeatures.Ingeneral,tefusionfocusesontheindividualstrengthofmodalities.Uni-modalconceptdetectionscoresarefusedintoamultimodal

(a)

(b)

Figure5:(a)Generalschemeforearlyfusion.Outputofunimodalanalysisisfusedbeforeaconceptislearned.(b)Generalschemeforlatefusion.Outputofunimodalanalysisisusedtolearnseparatescoresforaconcept.Afterfusiona nalscoreislearnedfortheconcept.WeusetheconventionsofFig.1.

semanticrepresentationratherthanafeaturerepresenta-tion.Abigdisadvantageoflatefusionschemesisitsexpen-sivenessintermsofthelearninge ort,aseverymodalityrequiresaseparatesupervisedlearningstage.Moreover,thecombinedrepresentationrequiresanadditionallearningstage.Anotherdisadvantageofthelatefusionapproachisthepotentiallossofcorrelationinmixedfeaturespace.AgeneralschemeforlatefusionisillustratedinFig.5b.

Forthelatefusionscheme,weconcatenatetheprobabilis-ticoutputscoreaftervisualanalysis,i.e.p vi, q ),withi(ω|

theprobabilisticscoreresultingfromtextualanalysis,i.e.

q ),intolatefusionvector li.p i(ω|ti,

3.2.1StyleAnalysis

Wedevelopdetectorsforallfourproductionrolesasfeature

extractioninthestyleanalysisstep.Werefertoourpre-viousworkforspeci cimplementationdetailsofthedetec-tors[31,ElectronicAppendix].Wehavechosentoconverttheoutputofall

styledetectorstoanordinalscale,asthisallowsforelegantfusion.

ForthelayoutLthelengthofacamerashotisusedasafeature,asthisisknowntobeaninformativedescrip-torforgenre[31].Overlayedtextisanotherinformativedescriptor.Itspresenceisdetectedbyatextlocalization

3.1.6ContentPath nder

Welearn101semanticconceptsbasedonthefourvectorsresultingfromanalysisinthecontentanalysisstep.Thus vi, ti, ei,and liserveastheinputforoursupervisedlearn-ingmodule,whichlearnsanoptimizedSVMmodelforeachsemanticconceptωusing3-foldcrossvalidationwith3rep-etitionsontrainingsetA.Thesemodelsarethenvalidated

i)forallonsetD,yieldingabestperformingmodelp i(ω|m

ωinΛS,wherem i∈{ vi, ti, ei, li}.

3.2StyleAnalysisStep

Inthestyleanalysisstepweconceiveofavideofromtheproductionperspective.Basedonthefourrolesinvolvedinthevideoproductionprocess[31],youtdetectorsanalyzetheroleoftheeditor.Contentdetectorsanalyzetheroleofproductiondesign.Capturedetectorsanalyzetheroleoftheproductionrecordingunit.Finally,contextdetectorsanalyzetheroleofthepreproductionteam,seeFig.6.

Figure6:Featureextractionandclassi cationinthestyleanalysisstep,specialcaseofFig.2.

algorithm[25].Tosegmenttheauditorylayout,periodsofspeechandsilencearedetectedbasedontheprovidedau-tomaticspeechrecognitionresults.Weobtainavoice-overdetectorbycombiningthespeechsegmentationwiththecamerashotsegmentation[31].Thesetoflayoutfeaturesisthusgivenby:L={shotlength,overlayedtext,silence,voice-over}.

AsconcernsthecontentC,afrontalfacedetector[27]isappliedtodetectpeople.Wecountthenumberoffaces,andforeachfaceitslocationisderived[31].Inaddition,wemeasuretheaverageamountofobjectmotioninacam-erashot[29].Basedonprovidedspeakeridenti cationweidentifyeachofthethreemostfrequentspeakers.Eachcamerashotischeckedforpresenceofspeechfromoneofthethree[31].Wealsoexploittheprovidednamedentityrecognition.Thesetofcontentfeaturesisthusgivenby:C={faces,facelocation,objectmotion,frequentspeaker,voicenamedentity}.

ForcaptureT,wecomputethecameradistancefromthesizeofdetectedfaces[27,31].Itisunde nedwhennofaceisdetected.Inadditiontocameradistance,severaltypesofcameraworkaredetected[2],e.g.pan,tilt,zoom,andsoon.Finally,forcapturewealsoestimatetheamountofcameramotion[2].Thesetofcapturefeaturesisthusgivenby:T={cameradistance,camerawork,cameramotion}.ThecontextSservestoenhanceorreducethecorrelationbetweensemanticconcepts.Detectionofvegetationcanaidinthedetectionofaforestforexample.Likewise,theco-occurrenceofaspaceshuttleandabicycleinoneshotisimprobable.Astheperformanceofsemanticconceptde-tectorsisunknownandlikelytovarybetweenconcepts,weexploititerationtoaddthemtothecontext.Therationalehereistoaddconceptsthatarerelativelyeasytodetect rst.Theyaidindetectionperformancebyincreasingthenumberoftruepositivesorreducingthenumberoffalsepositives.Topreventbiasfromdomainknowledge,weusetheperformanceonvalidationsetDofallconceptsfromΛSinthecontentanalysisstepastheorderingforthecontext.Toassigndetectionresultsforthe rstandleastdi cultconcept,werankallshotresultsonp i).Thisrank-i(ω1|m

ingisthenexploitedtocategorizeresultsforω1intooneof velevels.Thebasicsetofcontextfeaturesisthusgivenby:S={contentanalysisstepω1}.

Theconcatenationof{L,C,T,S}forshotiyieldsstylevector si.Thisvectorformstheinputforaniterativeclassi- er[31]thattrainsastylemodelforeachconceptinlexiconΛS.WeclassifyallωinΛSagaininthestyleanalysisstep.Weuse3-foldcrossvalidationwith3repetitionsontrain-ingsetBtooptimizeparametersettingsinthisanalysisstep.Weusetheresultingprobabilityasoutputforconceptdetectioninthestyleanalysisstep.

Figure7:Featureextractionandclassi cationinthecontextanal-ysisstep,specialcaseofFig.2.

BoththecontentanalysisstepandthestyleanalysisstepyieldaprobabilityforeachshotiandallconceptsωinΛS.Theprobabilityindicateswhetheraconceptispresent.Wefusethesesemanticfeaturesofananalysisstepforashotiintoacontextvector,seeFig.7.

Weconsiderthreepathsinthecontextanalysisstep.The rstpathstemsdirectlyfromthecontentanalysisstep.We

i.fusethe101p i)conceptscoresintocontextvectordi(ω|m

Thesecondpathstemsfromthestyleanalysisstepwherewefusethe101p si)scoresintocontextvectorp i.Thethirdi(ω|

pathselectsthebestperformeronvalidationsetDfromeithercontentanalysissteporstyleanalysisstep.Thesebestperformersarefusedincontextvector bi.

Fromthesethreevectorswelearnrelationsbetweencon-ceptsautomatically.Tothatendthevectorsserveastheinputforasupervisedlearningmodule,whichassociatesacontextualprobabilityp ci)toashotiforallωinΛS,i(ω|

i,pwhere ci∈{d i, bi}.Tooptimizeparametersettings,we

use3-foldcrossvalidationwith3repetitionsontheprevi-ouslyunuseddatafromtrainingsetC.

Theoutputofthecontextanalysisstepisalsotheoutputoftheentiresemanticpath nderonvideodocuments.Onthewaywehaveincludedinthesemanticpath nder,there-sultsoftheanalysisonrawdata,factsderivedfromproduc-tionbytheuseofstylefeatures,andacontextperspectiveoftheauthor’sintentbyusingsemanticfeatures.Foreachconceptweobtainseveralprobabilitiesbasedon(partial)content,style,andcontext.Weselectfromallpossibilitiestheonethatmaximizesaverageprecisionbasedonperfor-manceonvalidationsetD.Thesemanticpath nderpro-videsuswiththeopportunitytodecidewhetheraone-shotanalysisstepisbestfortheconceptonlyconcentratingon(visual)content,oratwo-analysisstepclassi erincreasingdiscriminatorypowerbyaddingproductionstyletocontent,orthataconceptpro tsmostfromaconsecutiveanalysisoncontent,style,andcontextlevel.

3.4Experiments

3.3ContextAnalysisStep

Thecontextanalysisstepaddscontexttoourinterpretationofthevideo.Ourultimateaimisthereconstructionoftheauthor’sintentbyconsideringdetectedconceptsincontext.

Wetraversedtheentiresemanticpath nderforall101con-cepts.Theaverageprecisionperformanceoftheseman-ticpath nderanditssub-systems,onvalidationsetD,areshowninFig.8.

Weevaluatedforeachconceptfouranalysisstrategiesinthecontentanalysisstep:text-only,visual-only,earlyfu-

Table1:UvA-MediaMillTRECVID2005runcomparisonforall10benchmarkconcepts.Thebestpathofthesemanticpath stcolumnindicatesresultsofourvisual-onlyrun.

SP-1

PeoplewalkingExplosionMapUS agBuildingWaterscapeMountainPrisonerSportsCarMAP

0.1990.0410.1420.10.2350.2010.220.0050.3420.2130.1698

SP-20.1720.0270.160.0630.2290.1980.1930.0010.2250.1920.146

SP-30.1540.0320.1350.110.2260.1370.18200.2890.1820.1447

SP-40.1790.0350.1230.0950.2250.1640.1950.0010.2020.2010.142

SP-50.1010.0360.0990.0720.210.1240.170.0010.1370.1960.1146

SP-60.1030.0340.1270.1140.1570.1360.1280.0010.1530.1990.1152

Visual-only

0.0310.0730.1380.1290.2690.1660.2070.0030.2720.2330.1521

sion,andlatefusion.Resultscon rmtheimportanceofvisualanalysisforgenericconceptdetection.Text-analysisyieldsthebestapproachforonly8concepts,whereasvisualanalysisyieldsthebestperformanceforasmuchas45con-cepts.Fusionisoptimalfortheremaining48concepts,withaclearadvantageforearlyfusion(33concepts)infavoroflatefusion(15concepts).

Thestyleanalysisstepagaincon rmstheimportanceforinclusionofprofessionaltelevisionproductionfacetsforse-manticvideoindexing.Especiallyforconceptswhichsharemanysimilaritiesintheirproductionprocess,likeanchors,monologues,andentertainment.Forotherconcepts,con-tentismoredecisive,liketennisandbaseballforexample.Thussomeconceptsarejustcontent,whereasothersarepureproductionstyle.

Weboostconceptdetectionperformancefurtherbytheusageofcontext.Thepath nderagainexploitsvariationinperformanceforthevariouspathstoselectanoptimalpathway.Theresultsdemonstratethevirtueofthesemanticpath nder.Conceptsaredividedbytheanalysisstepafterwhichtheyachievebestperformance.Basedontheseresultsweconcludethatanoptimalstrategyforgenericmultimediaanalysisisonethatlearnsfromthetrainingsetonaper-conceptbasiswhichtactictofollow.3.4.1

Path nderRuns

validationsetswearestillover ttingthedata.Apointofconcernhereistherandomassignmentofshotstothesep-aratetrainingandvalidationsets.Thismaybiastheclas-si ersasitispossiblethatsimilarnewsitemsfromseveralchannelsaredistributedtoseparatesets.Fortwoconcepts(mapandexplosion)performancesu eredfrommisinter-pretationofcorrectconcepts.Hadweincludedexamplesofnewsanchorswithmapsinthebackgroundofthestudiosetting(forthemapconcept)andsmoke(forexplosion)inourtrainingsets,resultswouldbehigher.Whenlookingatthejudgedresults,wealsofoundthatthreeconcepts(water-scape,mountain,andcar)aredominatedbycommercials.Wedonotperformwelloncommercialdetection.Thiscanbeexplainedbecausewetake1framepersecondoutofthevideointhevisualanalysis.Samplinginthismannerwillselectdi erentframesforthesamecommercialsthatreap-pearondi erenttimestampsinavideo.Weanticipatethatimprovementinframesamplingyieldsincreasedrobustnessfortheentirepath nder.3.4.2

Visual-onlyRun

Wesubmittedsixpathsforeachbenchmarkconcept,prior-itizedaccordingtovalidationsetperformance.Forconceptexplosionforexample,theoptimalpath(SP-1)indicatesthatvisual-onlyanalysisisthebestperformer.However,inmostcasesthebestpathisaconsecutivepathofcon-tent,style,andcontext.Wereporttheo cialTRECVIDbenchmarkresultsinTable1.

Theresultsshowthatthepath ndermechanismisagoodwaytoestimatethebestperforminganalysispath.TheSP-1runcontainingtheoptimalpathisindeedthebestper-formerin8outof10cases.Overall,thisisalsoourbestperformingrun.However,whatstrikesusmostisthatav-erageprecisionresultsaremuchlowerthancanbeexpectedbasedonvalidationsetperformancereportedinFig.8.Thismayindicatethatdespitetheuseofseparatetrainingand

ValidationsetperformanceinFig.8.indicatesthatourvi-sualanalysisstepperformsquitegood.Todeterminethecontributionofthevisualanalysisstep,wethereforesub-mittedavisual-onlyrun.ThisinvolvedtrainingaSupportVectorMachineonthevectorofcontexturesasintroducedinsection3.1.1.WetrainedanSVMforeachofthe10con-ceptoftheconceptdetectiontask.Anexperimentforrecog-nizingproto-conceptwassubmittedbyanothergroup[37].Thevisualfeaturesinthesubmittedvisual-onlyrunareslightlydi erentfromthevisualfeaturesinthesemanticpath ndersystem.Thisdi erenceiscausedbyongoingde-velopmentonthevisualanalysis.Speci cally,weimprovedtheWeibull ttobemorerobustandweaddedtheproto-conceptcar.Thenewerversionofthevisualanalysiswasnotincorporatedinthesemanticpath nder.Itwasnotintegratedbecausevisualanalysisisthe rststepinthese-manticpath.Thus,achangeinthevisualanalysismeansthatallfurtherpathswouldhavetoberecomputed.How-ever,foravisual-onlyrun,theimprovementswerefeasibletocompute.

Semantic Concept2allawi3anchor4animal5arrafat6baseball7basketball8beach9bicycle10bird11boat12building13bus14bush_jr15bush_sr16candle17car18cartoon19chair20charts21clinton22cloud

23corporate_leader24court25crowd26cycling27desert28dog29drawing

30drawing_cartoon31duo_anchor32entertainment33explosion34face35female36fireweapon37fish38flag

39flag_usa40food41football42golf

43government_building44government_leader45graphics46grass

47hassan_nasrallah48horse

49horse_racing50house51hu_jintao52indoor53kerry54lahoud55male56maps57meeting58military59monologue60motorbike61mountain

62natural_disaster63newspaper64nightfire65office66outdoor

67overlayed_text68people

69people_marching70police_security71powell72prisoner73racing

74religious_leader75river76road77screen78sharon79sky80smoke81snow82soccer83splitscreen84sports85studio

86swimmingpool87table88tank89tennis90tony_blair91tower92tree93truck94urban

95vegetation96vehicle97violence

98walking_running99waterbody100waterfall101weatherTRECVID MAPTextAnalysis

0.1880.1750.2090.0840.0510.0330.0020.0960.2010.0650.1590.1010.0720.0280.0080.1080.5110.1000.2090.0020.0340.0400.0770.2330.1030.0340.2840.3180.4030.0080.2570.0400.7240.0650.0360.0650.0960.0770.0160.0260.0690.0260.2910.1690.0160.4460.0010.0010.0810.2670.4000.0300.1350.1010.1460.2020.1830.0530.0030.0410.1260.0680.0110.0290.4400.5520.8030.1210.0170.0330.0110.0070.2680.1670.1200.1100.0030.1800.0840.0660.0370.0800.1320.4120.0020.0830.0120.2190.7500.0150.0130.0400.2050.0710.1350.2330.2240.0770.0010.4610.101VisualAnalysis

0.0540.5850.1890.1120.2400.5410.0050.0250.7160.1470.2810.0250.1730.0190.0030.2530.7470.5340.2750.2640.2370.0970.0570.4040.0200.1140.2620.2750.2880.6510.2680.1270.8980.3160.0390.2350.1650.1850.0710.1880.0380.0350.2750.3540.1510.8670.1290.0590.0050.0940.6160.0790.3940.2440.4060.3680.2390.1280.3990.2990.0350.5260.0090.0730.6680.6970.8330.2290.0070.0190.0080.0090.0600.5000.2390.0660.0080.4990.3300.0360.5330.6160.2960.6530.0010.1350.0240.6440.2540.0230.1780.0350.2700.2240.2810.2910.3270.2750.0010.2400.246Early Fusion

0.2290.4720.2160.0730.2260.2350.0050.1280.3790.0390.2510.0950.0720.0210.0200.1970.5690.3280.4400.0750.1010.0510.3380.4040.1350.1290.4460.2690.2930.0540.3250.0870.8930.1180.1280.1160.1210.1410.0680.0880.1790.0190.2610.3580.0420.2780.2190.2530.0810.2300.5840.0280.2480.1310.3080.2280.3050.0890.1630.1810.1520.4330.0090.0650.7060.6780.8700.2320.0150.0730.0770.0060.2510.0840.2190.1260.2100.4980.2720.1010.3650.2870.2570.6300.0010.1400.0300.6170.6880.0830.1870.0490.2910.1980.2730.3380.3280.2030.0080.5080.203

Late Fusion

0.0260.5620.1810.0780.0400.4510.0020.0980.4540.1690.0850.1460.1440.0010.0240.2140.6400.5220.384

0.1560.0770.0030.4020.0010.0980.0040.3180.4050.0600.1930.0600.7550.0210.0430.1000.1570.1750.0300.0330.0920.1570.3780.3400.0630.6670.0010.2010.0060.0820.6070.0050.2970.2150.3230.3520.3310.1380.0030.2030.1060.4540.0030.0910.6650.6860.8040.1690.0090.0120.0030.0010.1900.2520.2190.0750.0370.4940.2820.0280.4550.5910.3200.6740.0830.0190.6910.2560.0200.1100.0220.2970.1880.2780.3480.3540.2370.4830.197StyleContent-Context

0.0110.2740.7640.6150.3160.3300.1350.1410.0850.0840.5320.5730.0360.0090.1400.1090.4870.7170.1020.1720.2920.2980.0240.0150.2130.2010.2170.0650.0060.0020.2150.2690.4550.6010.2070.5520.3210.4560.2070.0180.1260.2280.0780.0490.0990.3500.3910.4240.4350.1210.0700.1430.2940.4830.0450.2080.0930.4420.8570.6020.6840.4960.0940.1180.9130.6960.4140.3360.0370.1310.2840.2310.1350.1820.1370.1900.1720.1380.2520.1960.1090.1900.2120.0080.4000.4010.3630.4450.0980.1670.1580.9170.3080.1820.5400.2040.0120.0050.0600.2960.6770.6740.0280.1230.2580.5590.2790.2590.3880.4710.3930.4040.2820.3570.6920.1490.0140.3890.2280.3470.0560.1510.4970.5250.0050.1310.0710.0620.6340.7440.9910.7060.9370.8480.2180.2520.0190.0170.0190.0310.0110.0880.0080.0100.0220.2520.0170.0250.2300.2680.0730.1540.0080.1990.4820.5370.2190.3740.0840.2990.5100.5780.8190.6770.4230.4590.7460.7180.1780.0120.2030.1070.0010.3350.3820.7630.0050.0590.0680.0620.0970.1890.0510.0620.2850.3200.2040.2360.2860.3260.3870.4510.4140.4100.2510.3050.1180.0090.5550.5790.2450.296

Style-ContextBest-Context

0.0070.2430.7800.7710.3010.4170.2470.1760.0730.0280.5890.6410.0110.0100.4060.4000.4620.6780.1320.2220.3040.3270.0210.0180.2240.2190.1980.2050.0030.0180.2430.2820.5280.6930.2840.5770.3220.463

0.002

0.1280.1720.0800.0650.1160.3680.4140.4460.4280.4210.0950.1440.2000.4980.0290.2740.2190.4430.8810.8820.6930.7000.0340.1250.9250.9290.4190.4200.0550.0590.3220.3530.1450.1840.1620.2150.1870.2160.3300.3510.0590.2140.2120.2130.4120.4160.4020.4720.0940.1070.2511.0000.3410.3380.4090.4060.0140.0080.0690.3230.7180.7220.0030.0650.3300.4540.2910.2940.4070.4930.4220.4520.2930.3580.7180.724

0.2500.0280.4970.0020.0780.7260.9910.8900.2270.0180.1900.0130.0290.0060.0610.2520.0800.0020.4970.2080.1420.5120.7570.4660.7800.1810.1760.0010.4200.0210.0730.1450.0660.3310.2100.3150.4400.4210.2890.0420.5600.259

0.3310.1630.5290.0030.0980.7540.9900.9260.2560.0220.0770.0880.0510.3460.1200.2770.1490.1510.5510.3530.0560.6360.7950.5290.7810.1750.1970.0010.7640.7510.1150.1510.0680.3560.2400.3430.4850.4640.3460.2560.5480.320Optimal Path

0.2740.7800.4170.2470.2400.6410.0360.4060.7170.2220.3270.1460.2240.2170.0240.2820.7470.5770.4630.2640.2370.0970.3680.4460.4350.1440.4980.3180.4430.8820.7000.1270.9290.4200.1310.3530.1840.2150.2160.3510.2140.2130.4160.4720.1671.0000.3410.5400.0810.3230.7220.1230.5590.2940.4930.4520.3580.7240.3990.3470.1630.5290.1310.0980.7540.9910.9370.2560.0220.1900.0880.0510.3460.5000.2770.1540.2100.5510.3740.2990.6360.8190.5290.7810.1810.2030.3350.7640.7510.1150.1890.0680.3560.2400.3430.4850.4640.3460.2560.5790.322

Figure8:Validationsetaverageprecisionperformancefor101semanticconceptsusingsub-systemsofthesemanticpath nder.Thebestpathforeachconceptismarkedwithgraycells.Emptycellsindicateimpossibilitytolearnmodels,duetolackofannotatedexamplesinthetrainingsub-setused.

Table2:Validationsetaverageprecisionperformancefor3typesofcameraworkusingseveralversionsofourcameraworkdetector.

Pan

TiltZoomMAPLateFusion

0.8620.7860.8620.837LateFusion+SelectedContext0.8590.7520.8660.826LateFusion+Context0.8560.6560.8560.789EarlyFusion0.7030.5580.7830.681Global

0.5690.6130.8130.665Global+Context

0.5910.5620.7920.648EarlyFusion+Context

0.616

0.461

0.765

0.614

Theresultsofourvisual-runre ecttheimportanceofvi-sualanalysis.Forfourconcepts(explosion,US ag,build-ing,car)weoutperformthepath ndersystem.Thisim-provementmightbeattributedtotheuseofimprovedvisualfeaturesandtothefactthatweusetheentiretrainingsetinSVM-training.However,sincethevisualanalysisstepisembeddedinthepath ndersystem,thevisualanalysisshouldneverperformbetter.Thereforewebelievethatre-sultsofthepath ndersystemwillimprovewhenthenewfeaturesareincluded.

4CameraWork

Forthedetectionofcameraworkwestartwithanexist-ingimplementationbasedonspatiotemporalimageanaly-sis[34,12].Givenasetofglobalintensityimagesfromshoti,thealgorithm rstextractspatiotemporalimages.Ontheseimagesadirectionanalysisisappliedtoestimatedi-rectionparameters.Theseparametersformtheinputforasupervisedlearningmoduletolearnthreetypesofcamerawork.Wemodi edthealgorithminvariousways.Wesu-perimposedatessellationof8regionsoneachinputframetodecreasethee ectoflocaldisturbances.Parametersthusobtainedareexploitedusinganearlyfusionandlatefusionapproach.Inadditionweexploredwhetherthe101conceptscoresobtainedfromthesemanticpath nderaidindetec-tionofcamerawork.

4.1Experiments

ExperimentsonvalidationsetDindicatethataveragepre-cisionresultsincreasedrastically,especiallyforpan(+51%)andtilt(+28%),seeTable2.Thebestapproachisalatefu-sionschemewithouttheusageofcontext.Relativetootherparticipantsweperformedquitegoodinprecision,butquitebadintermsofrecall.Resultsindicatethatthebasede-tectoristooconservative.However,italsoshowsthatanyglobalimagebasedcameraworkdetectorhasthepotentialtopro tfromatessellationofregion-baseddetectors.

5Lexicon-drivenRetrieval

Weproposealexicon-drivenretrievalparadigmtoequipuserswithsemanticaccesstomultimediaarchives.The

aimistoretrievefromamultimediaarchiveS,whichiscomposedofnuniqueshots{s1,s2,...,sn},thebestpossi-bleanswersetinresponsetoauserinformationneed.Tothatend,weusethe101conceptsinthelexiconaswellasthe3typesofcameraworkforourautomatic,manual,andinteractivesearchsystems.

5.1AutomaticSearch

Ourautomaticsearchengineusesonlytopictextasin-put[10],aswepostulatethatitisunreasonabletoexpectausertoprovideavideosearchsystemwithexamplevideosinarealworldscenario.Werelypurelyontextandthelex-iconof101semanticconceptdetectorsthatwehavedevel-opedusingthesemanticpath nder,seeSection3,tosearchthroughthevideocollection.Wedevelopedoursearchsys-temusingthevideodata,topics,andgroundtruthsfromthe2003and2004TRECVIDevaluationsasatrainingset.5.1.1

IndexingComponents

OurautomaticsearchsystemincorporatesregularTFIDF-basedindicesforstandardretrievalusingthebfx-bfx[24]formula,LatentSemanticIndexing[5]fortextretrievalwithimplicitqueryexpansion,and101thedi erentseman-ticconceptindicesforquery-by-concept.Eachindexwasmatchedtooneormoreconcepts,orsynsetsintheWord-Net[13]lexicaldatabaseonanindividualbasis,accordingtowhethertheconceptdirectlymatchesthecontentofthedetectors.Forexample,thedetectorfortheconceptbaseball ndsshotsofbaseballgames,andtheseshotsinvariablyin-cludebaseballplayers,baseballequipment,andabaseballdiamond,sotheseconceptsarealsomatched.AdditionalsynsetsareaddedtoWordNetforsemanticconceptsthatdonothaveadirectWordNetequivalent.5.1.2

AutomaticQueryInterfaceSelection

Weperformthestandardstoppingandstemmingproce-duresonthetopictext(usingtheSMARTstoplist[23]withtheadditionofthewords ndandshots;andthePorterstemmingalgorithm[20]respectively).Inaddition,weper-formpart-of-speechtaggingandchunkingusingtheTree-Tagger[26].Thisgrammaticalinformationisusedtoiden-tifytwodi erentquerycategorizations:complexvs.simplequeriesandgeneralvs.speci cqueries.Anytopiccontain-ingmorethanonenounchunkisclassi edascomplex,asitreferstomorethanoneobject,whilerequestscontainingonlyasinglenounchunkareclassi edassimple.Ifare-questcontainsaname(apropernoun)itreferstoaspeci cobject,ratherthanageneralcategory,sowecategorizeallrequestscontainingpropernounsasspeci crequests,andallothersasgeneralrequests.

Subsequently,weextracttheWordNetwordsinthetopictextthroughdictionarylookupofnounchunksandnouns.WeidentifythecorrectsynsetforWordNetwordswithmultiplemeaningsthroughdisambiguation.Weevaluated

anumberofdisambiguationstrategiesusingtheWord-Net::Similarity[17]resource,andfoundthatforthepur-posesofoursystem,thebestapproachwastochoosethemostcommonlyoccurringmeaningofaword.Thenwelookforrelatedsemanticconceptindexsynsetsinthehypernymandhyponymtreesofeachofthetopicsynsets.Ifanindexsynsetisfound,wecalculatethesimilaritybetweenthetwosynsetsusingtheResniksimilaritymeasure[21].

Finally,queriesareformed.WecreatebothastemmedandanunstemmedTFIDFqueryusingallofthetopicterms.WecreateanextraTFIDFqueryonpropernounsonlyforspeci ctopics,andaqueryonallnounsonlyforgeneraltopics.FortheLSIindexwecreatealsoaqueryusingallofthetopicterms,andinadditionwecreateanadditionalqueryusingpropernounsonlyforspeci ctop-ics,andallnounsforgeneraltopics.Finally,weselecttheconceptindexwiththehighestResniksimilaritytoatopicsynsetasthebestmatch,andqueryonthisconcept.5.1.3

CombiningQueryResults

Weuseatieredapproachforresultfusion, rstfusingthetextresultsfromtheTFIDFandLSIsearchesindividually,thenfusingtheresultanttwosets,and nallycombiningthemwiththeresultsfromthesemanticconceptsearch.WeuseweightedBordafusiontocombineresults,andde-velopedtheweightsthroughoptimizationexperimentsonthetrainingset.Weuseresultsfromunstemmedsearchestobooststemmedresultsforsimpletopics,asthesebene tfromusingtheexactspellingtosearchontext.Wealsoboosttextsearcheswithasearchonpropernounsforspe-ci ctopics,aspropernounsareagoodindicatorofresultrelevance.

Whencombiningtextresultswithconceptresults,weusetwomeasuresdevelopedspeci callyforWordNetbyResnik[21]:conceptinformationcontentandsimilarity(previouslymentioned).Theinformationcontentmeasureisameasureofthespeci cityofaconcept–asaconceptbecomesmoreabstract,theinformationcontentdecreases.Whenthematchingindexconcepthashighinformationcon-tent,andthewordsintheconceptdonot,wegiveprioritytotheconceptresults.Likewise,whenthematchedconceptindexisverysimilartothetopic,thenwegivetheconceptsearchaveryhighweighting.

5.2ManualSearch

Ourmanualsearchapproachinvestigatesthepoweroflex-icondrivenretrievalusedinavisual-onlysetting.Weputtheprincipleoflexicondrivenretrievaltothetestbyusingonlythe101conceptsinansweringthequeries.Further-more,wetestthehypothesisthatvisualinformation,thisyear,issigni cantlymoreimportantthantextualinforma-tion.Totesttheimpactofvisualinformation,weusenoothermodalitywhatsoever,andrelyonlyonvisualfeatures.ThisentailstrainingaSupportVectorMachineonthevec-torofcontexturesasintroducedinsection3.1.1.ThisSVM

istrainedforeveryoneofthe101conceptswiththewholedevelopmentsetasatrainingset.Thislexiconof101visualconceptsissubsequentlyusedinansweringthequeries.Foreachquery,wemanuallyselectoneortwoconceptsthat tthequestion,andusetheoutcomeofthesedetectorsasour nalanswertothequestion.

5.3InteractiveSearch

Ourinteractivesearchsystemsstorestheprobabilitiesofalldetectedconceptsandtypesofcameraworkforeachshotinadatabase.Inadditiontolearning,theparadigmalsofacilitatesmultimediaanalysisatasimilaritylevel.Inthesimilaritycomponent,2similarityfunctionsareappliedtoindexthedatainthevisualandtextualmodality.Itre-sultsin2similaritydistancesforallshots,whicharestoredinadatabase.TheMediaMillsearchengineo ersusersanaccesstothestoredindexesandthevideodataintheformof106queryinterfaces;i.e.2query-by-similarityin-terfaces,101query-by-conceptinterfacesand3query-by-cameraworkinterfaces.Thequeryinterfacesemphasizethelexicon-drivennatureoftheparadigm.EachqueryinterfaceactsasarankingoperatorΦionthemultimediaarchiveS,wherei∈{1,2,...,106}.Thesearchenginestoresresultsofeachrankingoperatorinarankedlistρi,whichwedenoteby:

ρi=Φi(S).(3)Thesearchenginehandlesthequeryrequests,combinesthe

results,anddisplaysthemtoaninteractinguser.Withintheparadigm,weperceiveofinteractionasacombinationofqueryingthesearchengineandselectingrelevantre-sultsusingoneofmanydisplayvisualizations.AschematicoverviewoftheretrievalparadigmisgiveninFig.9.

Tosupportbrowsingwithadvancedvisualizationsthedataisfurtherprocessed.Thehigh-dimensionalfeaturespaceisprojectedtothe2Dvisualizationspacetoallowforvisualbrowsing.Clusters,andrepresentativesforeachcluster,areidenti edtosupporthierarchicalbrowsing.Fi-nally,semanticthreadsareidenti ed,toallowforfastse-manticbrowsing.Forinteractivesearch,usersmaptop-icstoquery-by-multimodal-conceptorquery-by-keywordtocreateasetofcandidateresultstoexplore.Whenthereisaone-to-onerelationbetweenthequeryandtheconcept,arank-timebrowsingmethodisemployed.Inothercases,thesetformsthestartingpointforvisual,hierarchical,orsemanticbrowsing.Thebrowsingmethodsaresupportedbyadvancedvisualizationandactivelearningtools.5.3.1

MultimediaSimilarityIndexing

Afteralltheconceptsaredetected,thelowlevelfeaturesareusuallyignored.Webelieve,however,thatthesefea-turesarestillvaluableinaddinginformationtotheresultsofquery-by-conceptsearch.Exceptforspeci cconceptssuchaspersonX(Allawi,Bush,Blair),USA ag,mostofprovidedconceptshavegeneralmeaninglikesport,animal,maps,drawing.Theseconceptscanbeclassi edfurtherinto

Wede neaweightfunctionw(·)thatcomputestheweightofsjinρibasedonrij.Thislinearweightfunctiongivesahigherweighttoshotsthatareretrievedinthetopofρiandgraduallyreducesto0.Thisfunctionisde nedas:

w(rij)=

n rij+1

(5)

Weaggregatetheresultsforeachshotsjbyaddingthecontributionfromeachrankedlistρi.Wethenusethe nalrankingoperatorΦ torankallshotsfromSindescendingorderbasedonthisnewweight.Thiscombinationmethodyieldsa nalrankedlistofresultsρ ,de nedas:

,ρ=Φ(6)w(rij)

j=1,2,...,n

wheremindicatesthenumberofselectedqueryinterfaces.

Figure9:Thelexicon-drivenparadigmforinteractivemultimediaretrievalcombineslearning,similarity,andinteraction.Itlearnstodetectalexiconof101semanticconceptstogetherwith3typesofcamerawork.Inaddition,itcomputes2similaritydistances.Asearchenginethenpresents2interfacesforquery-by-similarity,3interfacesforquery-by-camera-work,and101interfacesforquery-by-concept.Basedoninteractionausermayre nesearchresultsuntilanacceptablestandardisreached.

sub-concepts.Forinstance,themapconceptsmaycontainmapsinweatherforecast,oramapofacountryinanewsre-port.Hence,weallowuserstodistinguishquery-by-conceptresultsfurtherbasedonlowlevelfeatures.

Therearedi erentoptionsforselectinglow-levelfea-tures,eitherusingcolors,textures,shapesorcombinationsofthose.Weusethevisualconceptfeaturesfromthevisualanalysisstepofthesemanticpath nder,seeSection3.1.1.Weexploitthesame15proto-concepts,butnowwith6dif-ferentparametersetsforeachshot.Thosevaluesarerepre-sentedasafeaturevectorpershot.Alltheshotswiththeircorrespondingfeaturevectorsbuiltupa90dimensionalfea-turespace.

Obtainingthebestperformanceonretrievingimages,notonlydependsonthefeatures,butalsoontheselectionofanappropriatesimilarityfunction.Theaimistochoosethebestdistancefunctionthatisabletoreturnthemax-imumnumberofrelevantimagesinitsnearestneighbors.BasedonexperimentalresultswechoosetheL2measureasadistancefunction.5.3.2

CombiningQueryResults

CombinationbySemanticThreadsThegeneratedcon-ceptprobabilitiesmoreorlessdescribethecontentofeachshot.However,sincethereareonlyalimitednumberofcat-egoriesfordetection,aproblemariseswhenashotdoesn’t tintoanycategory,i.e.eachindividualconceptdetectorreturnedanear-zerovalue.Allshotswithallconceptvaluesbelowathresholdcouldsimplyberemoved.Howeversomedetectorsproducelow-valueresultsbutthetop-rankedshotsarestillcorrect.Thisneedstobetakenintoaccountwhencombiningshots.Weusearound-robinpruningproceduretoensurethatatleastatop-Nshotsfromeachconceptde-tectorisincluded,evenwhenthatdetectorhasverylowvaluescomparedtootherdetectors.

Eachremainingshotnowcontainsatleastonedetectedconcept.Withthisinformationadistancemeasurementbetweenshotscanbecreated.Buthowdowemeasuredistancebetweenconceptvectors?Ifweassumeequaldis-tancesbetweenconcepts,wecanconstructadistancematrixmadeupfromthesimilaritySpqbetweenshotspandqus-ingwell-knowndistancemetricssuchasEuclideandistanceorhistogramintersection.Giventhecomputeddistancebe-tweenshots,itispossibleto ndgroupsofrelatedshotsus-ingclusteringtechniques.CurrentlyweuseK-meansclus-tering.

Nowthatclustersofrelatedshotsexistthetaskofformingasinglecoherentlineofshotsfromeachclustermustbeexamined.Weapplyashortestpathalgorithmsothatshotsthatarenexttoeachotherusuallyhaveaverylowdistancetoeachother,whichmeansthatshotswithsimilarsemanticcontentareneareachother.

5.4DisplayofResults

CombinationbyLinearWeightingToreorderrankedlistsofresults,we rstdeterminetherankrijofshotsjoverthevariousρi.Denotedby:

rij=ρi(sj).

(4)

Fore ectiveinteractionaninterfaceforcommunicatingbe-tweentheuserandthesystemisneeded.Weconsidertwoissuesthatarerequiredforane ectiveinterface:

(1)Forqueryspeci cation,supportshouldbegiventoexplorethecollectioninsearchofgoodexamplesastheuserseldomhasagoodexampleathis/herdisposal.

Figure10:InterfacesoftheMediaMillsemanticvideosearchengine.OnthelefttheCrossBrowsershowingresultsfortennis.OntoptheSphereBrowser,displayingseveralsemanticthreads.Bottomright:activelearningusingasemanticcluster-basedvisualizationintheGalaxyBrowser.

Mostexistingsystemsbrowsekeyframesinsequence(left-right,top-down)[28].Hence,relationsbetweenframesarenottakenintoaccount.Fore ectiveinteractionthismaybeunappropriateastheusercannotbene tfromtheinherentstructurefoundinvideocollections.Therefore,(2)Inthevisualization,relationsbetweenkeyframesshouldbetakenintoaccounttoallowselectionofseveralframesbyoneuseraction.

Forthesereasons,visualizationofkeyframesincludingsupportforbrowsingandexploringisessentialinaninter-activesearchsystem.Weexploredthreeadvancedvisual-izations.5.4.1

CrossBrowser

5.4.2GalaxyBrowser

Tovisualizequery-by-conceptresultsweproposeaCross-Browser.Thebrowserdisplaystwoorthogonaldimensions.Thehorizontaloneisthetime-thread,usingtheoriginalTRECVIDshotsequence.Theverticaldimensioncontainstherankedlistofqueryresults.TheGUIgivestheuseracrosslayoutofnearbyshotsonthescreen.Itexploitstheobservationthatsemanticallysimilarshotstendtoclusterinthetimedimension.TheresultingbrowserisvisibleinFig.10.

Tospeedupthesearchwithinthetimelimitation,wewanttosupporttheuserwithasystemthattheyareabletose-lectmorethanonekeyframeinonemouseaction.Itcanbeassumedthatthekeyframesrelevanttoasearchtopicsharesimilarfeatures.Hence,theyshouldbeclosetoeachotherinthefeaturespace.Therefore,visualizationbasedonthesimilaritybetweenthemwillmakethesearcheasierassimi-larimagesaregroupedtogetherinaspeci clocationofthesearchspace.Hence,lessnavigationandinteractionactionswillbeneeded.WeproposetheGalaxyBrowser,whichin-tegratesadvancedsimilaritybasedvisualizationwithactivelearning.

Thesimilaritybasedvisualizationof[15]isthebasisforourretrieval.Inbrief,wehavepointedoutthatforanoptimalvisualizationsystem,threerequirementshavetoobeyed:overview,structurepreservationandvisibility.The rstrequirementensuresthatthesetdisplayedwillbeabletorepresentthewholecollection,thesocalledrepresen-tativeset.Foruserinteraction,thecollectionshouldbeprojectedtothedisplayspace.Hence,thesecondrequire-menttriestopreservetherelationsbetweenkeyframesintheoriginalfeaturespace.The nalrequirementkeepsthecontentofdisplayedkeyframesfeasibleforinteraction.

Thesearecon ictingrequirements.Forexample,tosat-isfytheoverviewrequirement,thenumberofrepresentativekeyframesshouldbeincreased.Becauseofthe xedsizeofthedisplayspace,themorekeyframesthehigherthechanceofoverlap,thevisibilityrequirementhencewillbeviolated.Moreover,whilepreservingthevisibilityimagesarespreadoutfromeachother,originalrelationsbetweenthemarechangedi.e.structureisnotpreserved.Therefore,costfunctionsforeachrequirementandbalancingfunctionsbetweenthemareproposed.

Activelearningalgorithmsmostlyusesupportvectorma-chines(SVM)asafeedbacklearningbase[38,33].Inin-teractivesearch,usingthisapproach,thesystem rstshowssomeimagesandaskstheusertolabelthoseaspositiveand/ornegative.Thelearningiseitherbasedonbothpos-itiveandnegativeexamples(knownastwo-classSVM)oronpositive/negativeonesonly(knownasone-classSVM).TheseexamplesareusedtotraintheSVMtolearnclassi ersseparatingpositiveandnegativeexamples.Theprocessisrepeateduntiltheperformancesatis esgivenconstraints.Wehavedoneacomparisonbetweenthetwoapproaches,theresultsturnoutthatone-classSVMgenerallyperformsbetterthanthetwo-class,aswellasfasterinreturningtheresult.Weconcentrateontheuseofone-classSVMforlearningtherelevancefeedback.

Thecombinationofthetwotechniquesisdrawnintoonescheme(seeFig.11).Theo inestagecontainsfeatureextractionandsimilarityfunctionselection.TheISOSNEfrom[15]isappliedtoprojectthecollectionfromthehighdimensionalspacetothevisualizationspace.Thenextstepwilldecidewhichsetofkeyframeswillbeusedasarepre-sentativeone.Todoso,weemployk-meansalgorithmtoclusterkeyframesintoa xednumberofgroups.Asetofkeyframesselectedfromdi rmationofeachkeyframebelongingtoacertaingroup,anditspositioninthevisual-izationspaceisstoredaso inedata.

Intheinteractivestage,queryresultsareinputforstart-ingupthesearch.First,thesetoftopkkeyframesfromthequeryresultsisdisplayed.Theuserthenusesthesys-temtoexplorethecollectionand ndrelevantkeyframes.Particularly,ifthecurrentlydisplayedsetcontainsanypos-itiveone,theuserselectsthatkeyframeandgoesintothecorrespondingclusterwiththeexpectationof ndingmoresimilarones.Withtheadvantageofsimilaritybasedvi-sualization,insteadofclickingonanindividualkeyframeforlabeling,thesystemsupportstheuserwithmousedrag-gingtodrawtheareaofkeyframesinthesamecategory.Thismeansthatwhentheuser ndsagroupofrelevantkeyframes,he/shedrawsarectanglearoundthoseandmarksthemallaspositiveexamples.Therefore,oursystemcanreducethenumberofactionsfromtheuserwiththesameamountofinformationforrelevancefeedback.Incasethereisnopositivekeyframeinthecurrentset,theuserthenasksthesystemtodisplayanotherset,whichcontainsthenextkkeyframesfromthequeryresults.Keyframeswhichareselectedastrainingexamplesordisplayedbeforewillnot

Figure11:SchemeofaninteractivesearchintheGalaxyBrowserwiththecombinationofactivelearningandsimilaritybasedvisual-ization.

shownagain.

Inthelearningstep,whenacertainnumberoftrainingexamplesareprovided,theSVMtrainsthesupportvec-tors.Weusethewell-knownSVMlibrarydevelopedbyChangandLin[4],whichprovidesaone-classimplemen-tation.Afterthelearning,asetofimagesclosesttotheborderisreturned.Theprocessisrepeateduntilacertainconstraintissatis edsuchasnumberofiterations,timelim-itation,orsimplythattheuserdoesnotwanttogiveanymorefeedback.Atthatpoint,thesystemwillreturnthe nalresultcontainingkeyframeswithmaximumdistancestotheborderastheyareassumedhavinghighprobabilitiestoberelevanttothesearchtopic.5.4.3

SphereBrowser

TovisualizethethreadstructureasocalledSphere-Browser[22]wasdeveloped.Thebrowserdisplaystwoor-thogonaldimensions.Thehorizontaloneisthetime-thread,usingtheoriginalTRECVIDshotsequence.Theverticaldimensioncontainsforeachshotcluster-threadsofseman-ticallysimilarfootage.TheGUIgivestheuseraaspheri-callayoutofnearbyshotsonthescreen,ingthemouseandarrowkeystheusercanthennavigateeitherthroughtimeorthroughrelatedshots,selectingrelevantshotswhenfound.Alsoselecting(parts

TRECVID 2005 Overall Search Results

IyadMahmoudSearch Topic

Figure12:Comparisonofautomatic,manual,andinteractivesearchresultsfor24topics.Resultsfortheusersofthelexicon-driven

retrievalparadigmareindicatedwithspecialmarkers.

of)entirethreadsispossible.Smoothtransitionanimationsexisttoenabletheusertohaveabetterintuitivefeelingofwhereheisbrowsinginthedataset.TheresultingbrowserisshowninFig.10.

weexpectthattheresultsofoursystemwillimproveasweaddmoresemanticconceptindices,usingoursemanticpath nderstrategy.

5.5

5.5.1

Experiments

AutomaticSearch

5.5.2ManualSearch

Wesubmittedtworunsforautomaticsearch,onebaselinerunusingthe naltextsearchstrategyonly,andonefullrunincorporatingtextandsemanticconcepts.AscanbeseeninFig.12thecombinedsemanticandtextrunout-performedthetextrunonnearlyallcounts.Wedidbestforthosetopicsthathadaclearmappingtothesemanticconceptindices,i.e.tennisfortopic156,meetingfortopic163(achievingthebestresultforthistopic)andbasketballfortopic165.Insomecasestheconceptweightingstrategywasnotoptimal,forexamplefortopic158.Inthiscasewedetectedtheaircraftindex,buttheconceptresultsweregivenaweightingof0intheresultfusionbecausetheinfor-mationcontentoftheconcepthelicopterwascalculatedtobemuchhigherthantheinformationcontentoftheconceptaircraft.Ifwehadutilizedtheaircraftdetectorinthiscase,wewouldhaveachievedanaverageprecisionof0.17,whichishigherthanthebestevaluatedaverageprecisionof0.14.Wehavedemonstratedthatautomaticsearchusingonlytextasinputisarealistictask.Weperformbetterthanthemedianforanumberoftopics,andevenachievethebestscoreforonetopic.Postulatingthatallothersys-temsincorporatemultimodalexamplesintheirsearch,thisisasigni cantresult.Theperformanceofoursearchen-gineisbestwhenoneormorerelatedindicesarepresent;

Wesubmittedonerunformanualsearchwhereweonlyusethe101conceptsinthelexicontoanswerthequeries.More-over,werestrictourselvestousingonlyvisualinformation.Forthirteentopicswescoreabovethemedian.Speci cally,fortwoqueries,i.e.vehiclewith ames(160)andtennisplayers(156)weperformthebestofallmanualruns,andfortwootherqueries,i.e.peoplewithbanners(161)andbasketballplayers(165)wearesecondbest.Fortenquerieswescorebelowthemedian,threeofthosearenotcoveredbyourlexicon,andsevenareperson-xtypequeries.Weperformbadlyforperson-xqueriesbecausethefeaturesde-scribevisualscenelayout,consequently,namesandfacesarenotmodeled.Fortheremainingfourteentopicsthereisonlyonei.e.boat(164)paredtoourautomaticsearchtextbaseline,weperformworseoneightqueries.Ofthoseeightqueries,thetextbaselineperformsbetterforallperson-xqueries,andforoneotherquery(164).Consequently,avisual-onlyap-proachoutperformsthetextbaselinein16queries,includingtheout-of-lexiconqueries.

Webelieveourresultssupportthelexicon-drivenretrievalapproachandshowtheimportanceofvisualanalysis.De-spitetheobviousdisadvantagesofusingonlyvisualinfor-mation,weoutperformthetextbaseline,andevenscorethebestofallmanualrunsintwoqueries.

MeanAverage Precision

System Runs

Figure13:OverviewofallsearchrunssubmittedtoTRECVID2005,erswhoexploitedtheproposedparadigmareindicatedwithspecialmarkers.

5.5.3InteractiveSearch

Wesubmittedfourrunsforinteractivesearch.Threeusersfocussedonusingonlyonebrowser.Thefourthusersmixedallbrowsers.ResultsinFig.12indicatethatformostsearchtopics,usersoftheproposedparadigmforinteractivemulti-mediaretrievalscoreaboveaverage.Furthermore,usersofourapproachobtainatop-3averageprecisionresultfor19outof24topics.Bestperformanceisobtainedfor7topics.BestresultsareobtainedwiththeCrossBrowser.

Dependingonthesearchtopic,theproposedGalaxy-Browseraidsusersinsearchingfortherelevantsubsetofthecollection.Asthefeaturesusedarevisualbased,thesystemworkswellincaserelevantimagesofacertaintopicsharevisualsimilarity,e.g.queriesrelatedtotennisorcar.However,whentopicshavelargevarietyinvisualsettings,forinstancepersonxtopics,visualfeatureshardlyyieldad-ditionalinformationtoaidtheuserintheinteractivesearchprocess.Toourknowledge,noexistingfeaturesworkwellinthesecases.

Twosearchstrategieswerediscoveredduringtheinter-activeretrievaltaskusingtheSphereBrowser.Thereweretopicsforwhichmultipleclusterthreadsyieldedgoodresultsforthattopic,suchasTennis(156),Peoplewithbannersorsigns(161),Meeting(163)andTallbuilding(170).Forthesetopicsonlytherelevantpartsofthethreadsneededtobese-lected.AnotherselectionmethodwasfoundinqueriessuchasAirplanetakeo (167)andO cesetting(172).Heretherewereonlyalimitednumberofconsecutivevalidshotsvisibleineachthread,butbecauseofthecombinationofbothtimeandclusterthreadstherewasalwaysanothervalidbutnotyetselectedshotvisible.Forthesequeries,selectionwasdonebyhoppingfromonevalidresulttoanother.AlsoanumberoftopicswerenotanswerablebytheSphereBrowserbecauseoflackofnearbyshots.Theseincludepersonxtop-ics149,151,and153.

Togaininsightintheoverallqualityofourlexicon-drivenretrievalparadigm.Wecomparetheresultsofouruserswithallotherusersthatparticipatedintheretrievaltasksofthe2005TRECVIDbenchmark.WevisualizedtheresultsforallsubmittedsearchrunsinFig.13.Theresultsarestate-of-the-art.

6ExplorationofBBCRushes

TheBBCRushesconsistofrawmaterialusedtoproduceavideo.Sincethereislittletonospeech,thismaterialisverysuitableforvisual-onlyindexing.We rstsegmentedthevideo’susingourshotsegmentationalgorithm[2].Thenweappliedourbestperformingcameramotiondetector(seeSection4)ontheBBCrushesusingthemodelstrainedforthenewsdata.Tofurtherinvestigatetherobustnessofourvisualfeatures,weperformedvisual-onlyconceptde-tectionontheBBCrushesdata,withoutre-trainingthevisualmodels.Thevisualmodelsarethesameasusedinthevisualonlyfeaturetask(Section3.4)andinthemanualsearchtask(Section5.2).ThedetectorsthuslearnedonnewsdataaresubsequentlyevaluatedontheBBCrushesvideos.Obviously,notall101conceptsareuseful,sincetheyaretrainedonbroadcastnews.However,25conceptstranscendthenewsdomainandsomeperformsurprisinglywellontheBBCrushes:aircraft,bird,boat,building,car,charts,cloud,crowd,face,female,food,governmentbuild-ing,grass,meeting,mountain,outdoor,overlayedtext,sky,smoke,tower,tree,urban,vegetation,vehicle,waterbody.WedevelopedaversionoftheMediaMillsemanticvideosearchenginetailoredtotheBBCrushersbasedonthecom-putedindexes.Whilestillprimitiveintermsofutility,thesearchengineallowsuserstoexplorethecollectioninasur-prisingmanner.Theresultsagaincon rmtheimportanceofrobustvisualfeatures.Hence,forthistaskmuchisto

beexpectedfromimprovedvisualanalysisyieldingalargelexiconofsemanticconcepts.

Acknowledgements

NISTfortheevaluatione ort.KevinWalkerforsolvingharddiskproblems.DCUteamforcreationofkeyframes.AlexHauptmannformissingmachinetranslations.

References

[1]A.Amiretal.IBMresearchTRECVID-2003videoretrieval

system.InProc.oftheTRECVIDWorkshop,Gaithersburg,USA,2003.

[2]zyusersandautomaticvideoretrievaltools

in(the)lowlands.InE.VoorheesandD.Harman,editors,Proc.ofthe10thTextREtrievalConference,volume500-250,Gaithersburg,USA,2001.

[3]BlinkxVideoSearch,2005./.[4]C.-C.ChangandC.-J.Lin.LIBSVM:alibraryfor

supportvectormachines,2001.Softwareavailableathttp://www.csie.ntu.edu.tw/~cjlin/libsvm/.

[5]S.Deerwester,S.Dumais,G.Furnas,ndauer,and

R.Harshman.Indexingbylatentsemanticanalysis.J.AmericanSocietyInform.Science,41(6):391–407,1990.[6]J.Geusebroek.Visualobjectrecognition,2005.Patent

PCT/NL2005/000485 ledJuly6.

[7]J.Geusebroek,R.Boomgaard,A.Smeulders,and

H.Geerts.Colorinvariance.IEEETrans.PAMI,23(12):1338–1350,2001.

[8]J.GeusebroekandA.W.M.Smeulders.Asix-stimulusthe-oryforstochastictexture.InternationalJournalofCom-puterVision,62(1/2):7–16,2005.

[9]GoogleVideoSearch,2005./.[10]B.Huurnink.AutoSeek:Towardsafullyautomatedvideo

searchsystem.Master’sthesis,UniversiteitvanAmster-dam,2005.

[11]A.Jain,R.Duin,andJ.Mao.Statisticalpatternrecogni-tion:Areview.IEEETrans.PAMI,22(1):4–37,2000.

[12]P.JolyandH.-K.Kim.E cientautomaticanalysisofcam-eraworkandmicrosegmentationofvideousingspatiotem-poralimages.SignalProcessing:ImageCommunication,8(4):295–307,1996.

[13]G.Miller.WordNet:-municationsoftheACM,38(11):39–41,1995.

[14]M.Naphade.Onsupervisionandstatisticallearningfor

semanticmultimediaanalysis.J.VisualCommun.ImageRepresentation,15(3):348–369,2004.

[15]G.NguyenandM.Worring.Similaritybasedvisualization

ofimagecollections.InInt’lWorksh.Audio-VisualContentandInformationVisualizationinDigitalLibraries,2005.[16]NIST.TRECVIDVideoRetrievalEvaluation,2001–2005.

http://www-nlpir.nist.gov/projects/trecvid/.

[17]T.Pedersen,S.Patwardhan,andJ.Michelizzi.Word-Net:Similarity-measuringtherelatednessofconcepts.InNat’lConf.Arti cialIntelligence,SanJose,USA,2004.[18]C.Petersohn.FraunhoferHHIatTRECVID2004:Shot

boundarydetectionsystem.InProc.oftheTRECVIDWorkshop,Gaithersburg,USA,2004.

[19]J.Platt.ProbabilitiesforSVmachines.InAdvancesin

LargeMarginClassi ers,pages61–74.MITPress,2000.

[20]M.Porter.Analgorithmforsu xstripping.Program,

14(3):130–137,1980.

[21]inginformationcontenttoevaluatesemantic

similarityinataxonomy.InInternationalJointConferenceonArti cialIntelligence,SanMateo,USA,1995.

[22]O.Rooij.Browsingnewsvideousingsemanticthreads.

Master’sthesis,UniversiteitvanAmsterdam,2006.

[23]G.Salton.TheSMARTretrievalsystem:experimentsin

automaticdocumentprocessing.Prentice-Hall,EnglewoodCli s,USA,1971.

[24]G.SaltonandC.Buckley.Term-weightingapproachesin

rmationProcessingandMan-agement,24(5):513–523,1988.

[25]T.Sato,T.Kanade,E.Hughes,M.Smith,andS.Satoh.

VideoOCR:Indexingdigitalnewslibrariesbyrecognitionofsuperimposedcaption.MultimediaSystems,7(5):385–395,1999.

[26]H.Schmid.Probabilisticpart-of-speechtaggingusingdeci-siontrees.InInternationalConferenceonNewMethodsinLanguageProcessing,Manchester,UK,1994.

[27]H.SchneidermanandT.Kanade.Objectdetectionusing

thestatisticsofparts.InternationalJournalofComputerVision,56(3):151–177,2004.

[28]A.Smeulders,M.Worring,S.Santini,A.Gupta,and

R.Jain.Contentbasedimageretrievalattheendoftheearlyyears.IEEETrans.PAMI,22(12):1349–1380,2000.[29]C.SnoekandM.Worring.Multimediaevent-basedvideo

indexingusingtimeintervals.IEEETrans.Multimedia,7(4):638–647,2005.

[30]C.Snoek,M.Worring,J.Geusebroek,D.Koelma,F.Sein-stra,andA.Smeulders.Thesemanticpath nder:Usinganauthoringmetaphorforgenericmultimediaindexing.IEEETrans.PAMI,2006.Inpress.

[31]C.Snoek,M.Worring,andA.Hauptmann.Learningrich

semanticsfromnewsvideoarchivesbystyleanalysis.ACMTrans.MultimediaComputing,CommunicationsandAppli-cations,2(2),May2006.inpress.

[32]C.Snoek,M.Worring,andA.Smeulders.Earlyversus

latefusioninsemanticvideoanalysis.InACMMultimedia,Singapore,2005.

[33]S.TongandE.Chang.Supportvectormachineactivelearn-ingforimageretrieval.InACMMultimedia,pages107–118,Ottawa,Canada,2001.

[34]Y.Tonomura,A.Akutsu,Y.Taniguchi,andG.Suzuki.

Structuredvideocomputing.IEEEMultimedia,1(3):34–43,1994.

[35]V.Vapnik.TheNatureofStatisticalLearningTheory.

Springer-Verlag,NewYork,USA,2thedition,2000.

[36]T.Volkmer,J.Smith,A.Natsev,M.Campbell,and

M.Naphade.Aweb-basedsystemforcollaborativeanno-tationoflargeimageandvideocollections.InACMMulti-media,Singapore,2005.

[37]T.Westerveld,R.Cornacchia,J.vanGemert,D.Hiemstra,

andA.deVries.Anintegratedapproachtotextandimageretrieval–thelowlandsteamatTRECVID2005.InProc.oftheTRECVIDWorkshop,Gaithersburg,USA,2005.[38]X.ZhouandT.Huang.Relevancefeedbackinimagere-trieval:acomprehensiveoverview.MultimediaSystems,8(6):536–544,2003.

本文来源：https://www.bwwdw.com/article/ry44.html

相关文章：

新标日初级下同步测试卷(1)11-10

2012新任科级干部培训班讲话11-08

投资学基础复习资料03-24

2017版高考数学一轮总复习几何证明选讲第二节直线与圆的位置关系练习理选修4-1 - 图文12-23

爱在离别时——写给所有的女孩子11-03

科幻画创作与学生创造性思维的培养06-07

古诗词题目10-28

两张情侣头像一男一女02-14

挡土墙及边坡防护施工方案02-29

上一篇：电子产品的可靠性验证的主要项目及检测仪器下一篇：拓普康GTS-311全站仪使用说明书