Video Google-A Text Retrieval Approach to Object Matching in Videos

更新时间:2023-08-11 13:26:01 阅读量: 教育文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

大规模Retrieval的经典之作

VideoGoogle:ATextRetrievalApproachtoObjectMatchinginVideos

JosefSivicandAndrewZisserman

RoboticsResearchGroup,DepartmentofEngineeringScience

UniversityofOxford,UnitedKingdom

Abstract

Wedescribeanapproachtoobjectandsceneretrievalwhichsearchesforandlocalizesalltheoccurrencesofauseroutlinedobjectinavideo.Theobjectisrepresentedbyasetofviewpointinvariantregiondescriptorssothatrecog-nitioncanproceedsuccessfullydespitechangesinview-point,illuminationandpartialocclusion.Thetemporalcontinuityofthevideowithinashotisusedtotracktheregionsinordertorejectunstableregionsandreducetheeffectsofnoiseinthedescriptors.

Theanalogywithtextretrievalisintheimplementationwherematchesondescriptorsarepre-computed(usingvec-torquantization),andinverted lesystemsanddocumentrankingsareused.Theresultisthatretrievalisimmediate,returningarankedlistofkeyframes/shotsinthemannerofGoogle.

Themethodisillustratedformatchingontwofulllengthfeature lms.

1.Introduction

Theaimofthisworkistoretrievethosekeyframesandshotsofavideocontainingaparticularobjectwiththeease,speedandaccuracywithwhichGoogleretrievestextdocu-ments(webpages)containingparticularwords.Thispaperinvestigateswhetheratextretrievalapproachcanbesuc-cessfullyemployedforobjectrecognition.

Identifyingan(identical)objectinadatabaseofimagesisnowreachingsomematurity.Itisstillachallengingprob-lembecauseanobject’svisualappearancemaybeverydif-ferentduetoviewpointandlighting,anditmaybepartiallyoccluded,butsuccessfulmethodsnowexist.Typicallyanobjectisrepresentedbyasetofoverlappingregionseachrepresentedbyavectorcomputedfromtheregion’sappear-ance.Theregionsegmentationanddescriptorsarebuiltwithacontrolleddegreeofinvariancetoviewpointandillu-minationconditions.Similardescriptorsarecomputedforallimagesinthedatabase.Recognitionofaparticularob-jectproceedsbynearestneighbourmatchingofthedescrip-torvectors,followedbydisambiguatingusinglocalspa-tialcoherence(suchasneighbourhoods,ordering,orspatiallayout),orglobalrelationships(suchasepipolargeometry).

Examplesinclude[5,6,8,11,13,12,14,16,17].

Weexplorewhetherthistypeofapproachtorecognitioncanberecastastextretrieval.Inessencethisrequiresavisualanalogyofaword,andhereweprovidethisbyvectorquantizingthedescriptorvectors.However,itwillbeseenthatpursuingtheanalogywithtextretrievalismorethanasimpleoptimizationoverdifferentvectorquantizations.Therearemanylessonsandrulesofthumbthathavebeenlearntanddevelopedinthetextretrievalliteratureanditisworthascertainingifthesealsocanbeemployedinvisualretrieval.

Thebene tsofthisapproachisthatmatchesareeffec-tivelypre-computedsothatatrun-timeframesandshotscontaininganyparticularobjectcanberetrievedwithno-delay.Thismeansthatanyobjectoccurringinthevideo(andconjunctionsofobjects)canberetrievedeventhoughtherewasnoexplicitinterestintheseobjectswhende-scriptorswerebuiltforthevideo.However,wemustalsodeterminewhetherthisvectorquantizedretrievalmissesanymatchesthatwouldhavebeenobtainediftheformermethodofnearestneighbourmatchinghadbeenused.Reviewoftextretrieval:Textretrievalsystemsgenerallyemployanumberofstandardsteps[1].Thedocumentsare rstparsedintowords.Secondthewordsarerepre-sentedbytheirstems,forexample‘walk’,‘walking’and‘walks’wouldberepresentedbythestem‘walk’.Thirdastoplistisusedtorejectverycommonwords,suchas‘the’and‘an’,whichoccurinmostdocumentsandarethereforenotdiscriminatingforaparticulardocument.Theremain-ingwordsarethenassignedauniqueidenti er,andeachdocumentisrepresentedbyavectorwithcomponentsgivenbythefrequencyofoccurrenceofthewordsthedocumentcontains.Inadditionthecomponentsareweightedinvari-ousways(describedinmoredetailinsection4),andinthecaseofGoogletheweightingofawebpagedependsonthenumberofwebpageslinkingtothatparticularpage[3].Alloftheabovestepsarecarriedoutinadvanceofactualre-trieval,andthesetofvectorsrepresentingallthedocumentsinacorpusareorganizedasaninverted le[18]tofacilitateef cientretrieval.Aninverted leisstructuredlikeanidealbookindex.Ithasanentryforeachwordinthecorpusfol-lowedbyalistofallthedocuments(andpositioninthat

大规模Retrieval的经典之作

document)inwhichthewordoccurs.

Atextisretrievedbycomputingitsvectorofwordfrequenciesandreturningthedocumentswiththeclosest(measuredbyangles)vectors.Inadditionthematchontheorderingandseparationofthewordsmaybeusedtorankthereturneddocuments.

Paperoutline:Hereweexplorevisualanalogiesofeachofthesesteps.Section2describesthevisualdescriptorsused.Section3thendescribestheirvectorquantizationintovisual‘words’,andsection4weightingandindexingforthevectormodel.Theseideasarethenevaluatedonagroundtruthsetofframesinsection5.Finally,astoplistandranking(byamatchonspatiallayout)areintroducedinsection6,andusedtoevaluateobjectretrievalthroughouttwofeature lms:‘RunLolaRun’(‘LolaRennt’)[Tykwer,1999],and‘GroundhogDay’[Ramis,1993].

Althoughpreviousworkhasborrowedideasfromthetextretrievalliteratureforimageretrievalfromdatabases(e.g.[15]usedtheweightingandinverted leschemes)tothebestofourknowledgethisisthe rstsystematicappli-cationoftheseideastoobjectretrievalinvideos.

2.Viewpointinvariantdescription

Twotypesofviewpointcovariantregionsarecomputedforeachframe.The rstisconstructedbyellipticalshapeadap-tationaboutaninterestpoint.Themethodinvolvesitera-tivelydeterminingtheellipsecentre,scaleandshape.Thescaleisdeterminedbythelocalextremum(acrossscale)ofaLaplacian,andtheshapebymaximizingintensitygradientisotropyovertheellipticalregion[2,4].Theimplementa-tiondetailsaregivenin[8,13].ThisregiontypeisreferredtoasShapeAdapted(SA).

Thesecondtypeofregionisconstructedbyselectingar-easfromanintensitywatershedimagesegmentation.Theregionsarethoseforwhichtheareaisapproximatelysta-tionaryastheintensitythresholdisvaried.Theimplemen-tationdetailsaregivenin[7].ThisregiontypeisreferredtoasMaximallyStable(MS).

Twotypesofregionsareemployedbecausetheydetectdifferentimageareasandthusprovidecomplementaryrep-resentationsofaframe.TheSAregionstendtobecenteredoncornerlikefeatures,andtheMSregionscorrespondtoblobsofhighcontrastwithrespecttotheirsurroundingssuchasadarkwindowonagraywall.Bothtypesofre-gionsarerepresentedbyellipses.Thesearecomputedattwicetheoriginallydetectedregionsizeinorderfortheim-ageappearancetobemorediscriminating.Fora720576pixelvideoframethenumberofregionscomputedistypi-cally1600.AnexampleisshowninFigure1.

Eachellipticalaf neinvariantregionisrepresentedbya128-dimensionalvectorusingtheSIFTdescriptor

devel-Figure1:Toprow:Twoframesshowingthesamescenefromverydifferentcameraviewpoints(fromthe lm‘RunLolaRun’).Middlerow:frameswithdetectedaf neinvariantregionssuper-imposed.‘MaximallyStable’(MS)regionsareinyellow.‘ShapeAdapted’(SA)regionsareincyan.Bottomrow:Finalmatchedregionsafterindexingandspatialconsensus.Notethatthecorre-spondencesde nethesceneoverlapbetweenthetwoframes.

opedbyLowe[5].In[9]thisdescriptorwasshowntobesu-periortoothersusedintheliterature,suchastheresponseofasetofsteerable lters[8]ororthogonal lters[13],andwehavealsofoundSIFTtobesuperior(bycomparingsceneretrievalresultsagainstgroundtruthasinsection5.1).ThereasonforthissuperiorperformanceisthatSIFT,unliketheotherdescriptors,isdesignedtobeinvarianttoashiftofafewpixelsintheregionposition,biningtheSIFTdescriptorwithaf necovariantregionsgivesregiondescriptionvec-torswhichareinvarianttoaf netransformationsoftheim-age.Note,bothregiondetectionandthedescriptioniscom-putedonmonochromeversionsoftheframes,colourinfor-mationisnotcurrentlyusedinthiswork.

Toreducenoiseandrejectunstableregions,informationisaggregatedoverasequenceofframes.Theregionsde-tectedineachframeofthevideoaretrackedusingasimpleconstantvelocitydynamicalmodelandcorrelation.Anyre-gionwhichdoesnotsurviveformorethanthreeframesisrejected.Eachregionofthetrackcanberegardedasan

大规模Retrieval的经典之作

independentmeasurementofacommonsceneregion(thepre-imageofthedetectedregion),andtheestimateofthedescriptorforthissceneregioniscomputedbyaveragingthedescriptorsthroughoutthetrack.Thisgivesameasur-ableimprovementinthesignaltonoiseofthedescriptors(whichagainhasbeendemonstratedusingthegroundtruthtestsofsection5.1).

3.Buildingavisualvocabulary

Theobjectivehereistovectorquantizethedescriptorsintoclusterswhichwillbethevisual‘words’fortextretrieval.Thenwhenanewframeofthemovieisobservedeachde-scriptoroftheframeisassignedtothenearestcluster,andthisimmediatelygeneratesmatchesforallframesthrough-outthemovie.Thevocabularyisconstructedfromasub-partofthemovie,anditsmatchingaccuracyandexpressivepowerareevaluatedontheremainderofthemovie,asde-scribedinthefollowingsections.

ThevectorquantizationiscarriedoutherebyK-meansclustering,thoughothermethods(K-medoids,histogrambinning,etc)arecertainlypossible.

(a)

Figure2:Samplesfromtheclusterscorrespondingtoasinglevi-sualword.(a)TwoexamplesofclustersofShapeAdaptedregions.(b)TwoexamplesofclustersofMaximallyStableregions.

3.1.Implementation

Regionsaretrackedthroughcontiguousframes,andameanvectordescriptorx¯icomputedforeachoftheiregions.Torejectunstableregionsthe10%oftrackswiththelargestdiagonalcovariancematrixarerejected.Thisgeneratesanaverageofabout1000regionsperframe.

Eachdescriptorisa128-vector,andtosimultaneouslyclusterallthedescriptorsofthemoviewouldbeagargan-tuantask.Insteadasubsetof48shotsisselected(theseshotsarediscussedinmoredetailinsection5.1)cover-ingabout10kframeswhichrepresentabout10%ofalltheframesinthemovie.Evenwiththisreductiontherearestill200Kaveragedtrackdescriptorsthatmustbeclustered.TodeterminethedistancefunctionforclusteringtheMa-halanobisdistanceiscomputedasfollows:itisassumedthatthecovarianceΣisthesameforalltracks,andthisiscomputedbyestimatingfromalltheavailabledata,i.e.alldescriptorsforalltracksinthe48shots.TheMaha-lanobisdistanceenablesthemorenoisycomponentsofthe128–vectortobeweighteddown,andalsodecorrelatesthecomponents.Empiricallythereisasmalldegreeofcorrela-tion.Thedistancefunctionbetweentwodescriptors(repre-¯2,isthengivensentedbytheirmeantrackdescriptors)x¯1,x

1bydx¯1x¯2x¯1x¯2x¯1x¯2.Asisstandard,

thedescriptorspaceisaf netransformedbythesquarerootofΣsothatEuclideandistancemaybeused.

About6kclustersareusedforShapeAdaptedregions,andabout10kclustersforMaximallyStableregions.Theratioofthenumberofclustersforeachtypeischosentobeapproximatelythesameastheratioofdetected

descriptors

ofeachtype.Thenumberofclustersischosenempiricallytomaximizeretrievalresultsonthegroundtruthsetofsec-tion5.1.TheK-meansalgorithmisrunseveraltimeswithrandominitialassignmentsofpointsasclustercentres,andthebestresultused.

Figure2showsexamplesofregionsbelongingtopar-ticularclusters,i.e.whichwillbetreatedasthesamevi-sualword.Theclusteredregionsre ectthepropertiesoftheSIFTdescriptorswhichpenalizevariationsamongstre-gionslessthancross-correlation.ThisisbecauseSIFTem-phasizesorientationofgradients,ratherthanthepositionofaparticularintensitywithintheregion.

ThereasonthatSAandMSregionsareclusteredsepa-ratelyisthattheycoverdifferentandlargelyindependentregionsofthescene.Consequently,theymaybethoughtofasdifferentvocabulariesfordescribingthesamescene,andthusshouldhavetheirownwordsets,inthesamewayasonevocabularymightdescribearchitecturalfeaturesandanotherthestateofrepairofabuilding.

4.Visualindexingusingtextretrievalmethods

Intextretrievaleachdocumentisrepresentedbyavectorofwordfrequencies.However,itisusualtoapplyaweightingtothecomponentsofthisvector[1],ratherthanusethefre-quencyvectordirectlyforindexing.Herewedescribethestandardweightingthatisemployed,andthenthevisualanalogyofdocumentretrievaltoframeretrieval.

大规模Retrieval的经典之作

Thestandardweightingisknownas‘termfrequency–inversedocumentfrequency’,tf-idf,andiscomputedasfollows.Supposethereisavocabularyofkwords,theneachdocumentisrepresentedbyak-vectorVdt1titkofweightedwordfrequencieswithcom-ponents

nidNlogti

ndni

rankoftheithrelevantimage.InessenceRankiszeroifall

Nrelimagesarereturned rst.TheRankmeasureliesintherange0to1,with05correspondingtorandomretrieval.

5.1.Groundtruthimagesetresults

Figure3bshowstheaveragenormalizedrankusingeachimageofthedatasetasaqueryimagewiththetf-idfweight-ingdescribedinsection4.Thebene tinhavingtwofeaturetypesisevident.Thecombinationofbothclearlygivesbet-terperformancethaneitheronealone.Theperformanceofeachfeaturetypevariesfordifferentframesorlocations.Forexample,inframes46-49MSregionsperformbetter,andconverselyforframes126-127SAregionsaresuperior.Theretrievalrankingisperfectfor17ofthe19locations,eventhosewithsigni cantviewpointchanges.Therankingresultsarelessimpressiveforimages61-70and119-121,thougheveninthesecasestheframematchesarenotmissedjustlowranked.Thisisduetoalackofregionsintheover-lappingpartofthescene,see gure4.Thisisnotaproblemofvectorquantization(theregionsthatareincommonarecorrectlymatched),butduetofewfeaturesbeingdetectedforthistypeofscene(pavementtexture).Wereturntothispointinsection7.

Table1showsthemeanoftheRankmeasurecomputedfromall164imagesforthreestandardtextretrievaltermweightingmethods[1].Thetf-idfweightingoutperformsboththebinaryweights(i.e.thevectorcomponentsareoneiftheimagecontainsthedescriptor,zerootherwise)andtermfrequencyweights(thecomponentsarethefrequencyofwordoccurrence).Thedifferencesarenotverysigni -cantfortheranksaveragedoverthewholegroundtruthset.However,forparticularframes(e.g.49)thedifferencecanbeashighas0.1.

Theaverageprecisionrecallcurveforallframesisshownin gure3c.Foreachframeasaquery,wehavecomputedprecisionasthenumberofrelevantimages(i.e.ofthesamelocation)relativetothetotalnumberofframesretrieved,andrecallasthenumberofcorrectlyretrievedframesrelativetothenumberofrelevantframes.Againthebene tofcombiningthetwofeaturetypesisclear.

Theseretrievalresultsdemonstratethatthereisnolossofperformanceinusingvectorquantization(visualwords)comparedtodirectnearestneighbour(orε-nearestneigh-bour)matchingofinvariants[12].

Thisgroundtruthsetisalsousedtolearnthesystempa-rametersincluding:thenumberofclustercentres;themini-mumtrackinglengthforstablefeatures;andtheproportionofunstabledescriptorstorejectbasedontheircovariance.

wherenidisthenumberofoccurrencesofwordiindoc-umentd,ndisthetotalnumberofwordsinthedocumentd,niisthenumberofoccurrencesoftermiinthewholedatabaseandNisthenumberofdocumentsinthewholedatabase.Theweightingisaproductoftwoterms:thewordfrequencynidnd,andtheinversedocumentfrequencylogNni.Theintuitionisthatwordfrequencyweightswordsoccurringofteninaparticulardocument,andthusde-scribeitwell,whilsttheinversedocumentfrequencydown-weightswordsthatappearofteninthedatabase.

Attheretrievalstagedocumentsarerankedbytheirnor-malizedscalarproduct(cosineofangle)betweenthequeryvectorVqandalldocumentvectorsVdinthedatabase.

Inourcasethequeryvectorisgivenbythevisualwordscontainedinauserspeci edsub-partofaframe,andtheotherframesarerankedaccordingtothesimilarityoftheirweightedvectorstothisqueryvector.Variousweightingmodelsareevaluatedinthefollowingsection.

5.Experimentalevaluationofscenematchingusingvisualwords

Heretheobjectiveistomatchscenelocationswithinaclosedworldofshots[12].Themethodisevaluatedon164framesfrom48shotstakenat19different3DlocationsinthemovieRunLolaRun.Wehavebetween4-9framesfromeachlocation.Examplesofthreeframesfromeachoffourdifferentlocationsareshownin gure3a.Therearesignif-icantviewpointchangesoverthetripletsofframesshownforthesamelocation.Eachframeofthetripletisfromadifferent(anddistantintime)shotinthemovie.

Intheretrievalteststheentireframeisusedasaqueryregion.Theretrievalperformanceismeasuredoverall164framesusingeachinturnasaqueryregion.Thecorrectre-trievalconsistsofalltheotherframeswhichshowthesamelocation,andthisgroundtruthisdeterminedbyhandforthecomplete164frameset.

Theretrievalperformanceismeasuredusingtheaveragenormalizedrankofrelevantimages[10]givenby

Rank

1NNrel

Nrel

i1

∑Ri

NrelNrel

2

1

6.Objectretrieval

Inthissectionweevaluatesearchingforobjectsthroughouttheentiremovie.Theobjectofinterestisspeci edbythewhereNrelisthenumberofrelevantimagesforparticularqueryimage,Nisthesizeoftheimageset,andRiisthe

大规模Retrieval的经典之作

SAMSSA+MS

binary0.02650.02370.0165tf0.02750.02080.0153tf-idf0.02090.01960.0132

Table1:ThemeanoftheRankmeasurecomputedfromall164imagesofthegroundtruthsetfordifferenttermweightingmeth-

ods.

(a)

Figure4:Top:Frames61and64fromthegroundtruthdataset.Apoorrankingscoreisobtainedforthispair.Bottom:superimposeddetectedaf neinvariantregions.Thecarefulreaderwillnotethat,duetotheverydifferentviewpoints,onlytwoofthe564(left)and533(right)regionscorrespondbetweenframes.

(b)

userasasub-partofanyframe.

Afeaturelength lmtypicallyhas100K-150Kframes.Toreducecomplexityonekeyframeisusedpersecondofthevideo.Descriptorsarecomputedforstableregionsineachkeyframeandthemeanvaluesarecomputedusingtwoframeseithersideofthekeyframe.Thedescriptorsarevec-torquantizedusingthecentresclusteredfromthegroundtruthset.

Herewearealsoevaluatingtheexpressivenessofthevi-sualvocabularysinceframesoutsidethegroundtruthsetcontainnewobjectsandscenes,andtheirdetectedregionshavenotbeenincludedinformingtheclusters.

6.1.Stoplist

Usingastoplistanalogythemostfrequentvisualwordsthatoccurinalmostallimagesaresuppressed.Figure5showsthefrequencyofvisualwordsoverallthekeyframesofLola.Thetop5%andbottom10%arestopped.Inourcasetheverycommonwordsareduetolargeclustersofover3Kpoints.Thestoplistboundariesweredeterminedempiricallytoreducethenumberofmismatchesandsizeoftheinverted lewhilekeepingsuf cientvisualvocabulary.Figure3:Groundtruthdata.(a)Eachrowshowsaframefromthreedifferentshotsofthesamelocationinthegroundtruthdataset.(b)Averagenormalizedrankforlocationmatchingonthegroundtruthset.(c)AveragePrecision-Recallcurveforlocationmatchingonthegroundtruthset.

大规模Retrieval的经典之作

Figures6showthebene tofimposingastoplist–theverycommonvisualwordsoccuratmanyplacesintheim-ageandareresponsibleformis-matches.Mostoftheseareremovedoncethestoplistisapplied.Theremovaloftheremainingmis-matchesisdescribednext.

6.2.Spatialconsistency

Googleincreasestherankingfordocumentswherethesearchedforwordsappearclosetogetherintheretrievedtexts(measuredbywordorder).Thisanalogyisespeciallyrelevantforqueryingobjectsbyasubpartoftheimage,wherematchedcovariantregionsintheretrievedframesshouldhaveasimilarspatialarrangement[12,14](pactness)tothoseoftheoutlinedregioninthequeryimage.Theideaisimplementedhereby rstretrievingframesusingtheweightedfrequencyvectoralone,andthenre-rankingthembasedonameasureofspatialconsistency.Spatialconsistencycanbemeasuredquitelooselysim-plybyrequiringthatneighbouringmatchesinthequeryre-gionlieinasurroundingareaintheretrievedframe.Itcanalsobemeasuredverystrictlybyrequiringthatneighbour-ingmatcheshavethesamespatiallayoutinthequeryre-gionandretrievedframe.Inourcasethematchedregionsprovidetheaf netransformationbetweenthequeryandre-trievedimagesoapointtopointmapisavailableforthisstrictmeasure.

Wehavefoundthatthebestperformanceisobtainedinthemiddleofthispossiblerangeofmeasures.Asearchareaisde nedbythe15nearestneighboursofeachmatch,andeachregionwhichalsomatcheswithinthisareacastsavoteforthatframe.Matcheswithnosupportarerejected.Thetotalnumberofvotesdeterminestherankoftheframe.Thisworksverywellasisdemonstratedinthelastrowof gure6,whichshowsthespatialconsistencyrejectionofin-correctmatches.Theobjectretrievalexamplesof gures7to9employthisrankingmeasureandamplydemonstrateitsusefulness.

Othermeasureswhichtakeaccountoftheaf nemap-pingbetweenimagesmayberequiredinsomesituations,butthisinvolvesagreatercomputationalexpense.

(a)(b)

Figure5:FrequencyofMSvisualwordsamongall3768keyframesofRunLolaRun(a)before,and(b)after,applicationofa

stoplist.

6.3.Objectretrieval

Implementation–useofinverted les:Inaclassical lestructureallwordsarestoredinthedocumenttheyappearin.Aninverted lestructurehasanentry(hitlist)foreachwordwherealloccurrencesofthewordinalldocumentsarestored.Inourcasetheinverted lehasanentryforeachvisualword,whichstoresallthematches,i.e.occurrencesofthesamewordinallframes.Thedocumentvectorisverysparseanduseofaninverted lemakestheretrievalveryfast.Queryingadatabaseof4kframestakesabout0.1secondwithaMatlabimplementationona2GHzpentium.

after lteringonspatialconsistency.

大规模Retrieval的经典之作

Examplequeries:Figures7and8showresultsoftwoobjectqueriesforthemovie‘RunLolaRun’,and gure9showstheresultofanobjectqueryonthe lm‘Ground-hogday’.Bothmoviescontainabout4Kkeyframes.Boththeactualframesreturnedandtheirrankingareexcellent–asfarasitispossibletotell,noframescontainingtheob-jectaremissed(nofalsenegatives),andthehighlyrankedframesalldocontaintheobject(goodprecision).

Theobjectqueryresultsdodemonstratetheexpressivepowerofthevisualvocabulary.ThevisualwordslearntforLolaareusedunchangedfortheGroundhogDayretrieval.

7.SummaryandConclusions

Theanalogywithtextretrievalreallyhasdemonstrateditsworth:wehaveimmediaterun-timeobjectretrievalthroughoutamoviedatabase,despitesigni cantviewpointchangesinmanyframes.Theobjectisspeci edasasub-partofanimage,andthishasprovedsuf cientforquasi-planarrigidobjects.

Thereare,ofcourse,improvementsthatcanbemademainlytoovercomeproblemsinthevisualprocessing.Lowrankingsarecurrentlyduetoalackofvisualdescriptorsforsomescenetypes.However,theframeworkallowsotherex-istingaf neco-variantregionstobeadded(theywillde neanextendedvisualvocabulary),forexamplethoseof[17].Anotherimprovementwouldbetode netheobjectofin-terestovermorethanasingleframetoallowforsearchonallitsvisualaspects.

Thetextretrievalanalogyalsoraisesinterestingques-tionsforfuturework.Intextretrievalsystemsthetex-tualvocabularyisnotstatic,growingasnewdocumentsareaddedtothecollection.Similarly,wedonotclaimthatourvectorquantizationisuniversalforallimages.Sofarwehavelearntvectorquantizationssuf cientfortwomovies,butwaysofupgradingthevisualvocabularywillneedtobefound.Onecouldthinkoflearningvisualvocabulariesfordifferentscenetypes(e.g.cityscapevsaforest).

Finally,wenowhavetheintriguingpossibilityoffollow-ingothersuccessesofthetextretrievalcommunity,suchaslatentsemanticindexingto ndcontent,andautomaticclus-teringto ndtheprincipalobjectsthatoccurthroughoutthemovie.

AcknowledgementsWearegratefultoDavidLowe,Jiri

Matas,KrystianMikolajczykandFrederikSchaffalitzkyforsup-plyingtheirregiondetector/descriptorcodes.ThankstoAndrewBlake,MarkEveringham,AndrewFitzgibbon,KrystianMikola-jczykandFrederikSchaffalitzkyforfruitfuldiscussions.ThisworkwasfundedbyECproject

VIBES.

Figure7:ObjectqueryexampleI.Firstrow:(left)framewithuserspeci edqueryregion(aposter)inyellow,and(right)closeupofthequeryregion.Thefourremainingrowsshow(left)the1st,12th,16th,and20thretrievedframeswiththeidenti edre-gionofinterestshowninyellow,and(right)acloseupoftheim-agewithmatchedellipticalregionssuperimposed.Inthiscase20keyframeswereretrieved:sixfromthesameshotasthequeryimage,therestfromdifferentshotsatlaterpointsinthemovie.Allretrievedframescontainthespeci edobject.Notetheposterappearsonvariousbillboardsthroughoutthemovie(andBerlin).

大规模Retrieval的经典之作

References

[1]R.Baeza-YatesandB.Ribeiro-Neto.ModernInformation

Retrieval.ACMPress,ISBN:020139829,1999.[2]A.Baumberg.Reliablefeaturematchingacrosswidelysep-aratedviews.InProc.CVPR,pages774–781,2000.[3]S.BrinandL.Page.Theanatomyofalarge-scalehypertex-tualwebsearchengine.In7thInt.WWWConference,1998.[4]T.LindebergandJ.G arding.Shape-adaptedsmoothingin

estimationof3-ddepthcuesfromaf nedistortionsoflocal2-dbrightnessstructure.InProc.ECCV,LNCS800,pages389–400,1994.[5]D.Lowe.Objectrecognitionfromlocalscale-invariantfea-tures.InProc.ICCV,pages1150–1157,1999.[6]D.Lowe.Localfeatureviewclusteringfor3Dobjectrecog-nition.InProc.CVPR,pages682–688.Springer,2001.[7]J.Matas,O.Chum,M.Urban,andT.Pajdla.Robustwide

baselinestereofrommaximallystableextremalregions.InProc.BMVC.,pages384–393,2002.[8]K.MikolajczykandC.Schmid.Anaf neinvariantinterest

pointdetector.InProc.ECCV.Springer-Verlag,2002.[9]K.MikolajczykandC.Schmid.Aperformanceevaluation

oflocaldescriptors.InCVPR,2003.[10]H.M¨uller,S.Marchand-Maillet,andT.Pun.Thetruthabout

corel-evaluationinimageretrieval.InProc.CIVR,2002.ˇObdrˇ[11]S.za´lekandJ.Matas.Objectrecognitionusinglocalaf neframesondistinguishedregions.InProc.BMVC.,pages113–122,2002.[12]F.SchaffalitzkyandA.Zisserman.Automatedscenematch-inginmovies.InProc.CIVR2002,LNCS2383,pages186–197.Springer-Verlag,2002.[13]F.SchaffalitzkyandA.Zisserman.Multi-viewmatching

forunorderedimagesets,or“HowdoIorganizemyhol-idaysnaps?”.InProc.ECCV,volume1,pages414–431.Springer-Verlag,2002.[14]C.SchmidandR.Mohr.Localgreyvalueinvariantsforim-ageretrieval.IEEEPAMI,19(5):530–534,1997.[15]D.M.Squire,W.M¨uller,H.M¨uller,andT.Pun.Content-basedqueryofimagedatabases:inspirationsfromtextre-trieval.PatternRecognitionLetters,21:1193–1198,2000.[16]biningappearanceandtopology

forwidebaselinematching.InProc.ECCV,LNCS2350,pages68–81.Springer-Verlag,2002.[17]T.TuytelaarsandL.VanGool.Widebaselinestereomatch-ingbasedonlocal,af nelyinvariantregions.InProc.BMVC.,pages412–425,2000.

Figure9:ObjectqueryexampleIII.GroundhogDay.Firstrow:(left)queryregion,and(right)itscloseup.Nextrows:The12th,35thand50thretrievedframes(left)andobjectclose-upswithmatchedregions(right).73keyframeswereretrievedofwhich53containedtheobject.The rstincorrectframewasranked27th.

[18]I.H.Witten,A.Moffat,andT.Bell.ManagingGigabytes:

CompressingandIndexingDocumentsandImages.MorganKaufmannPublishers,ISBN:1558605703,1999.

Figure8:ObjectqueryexampleII.RunLolaRun.Firstrow:(left)queryregion,and(right)itscloseup.Nextrows:The9th,16thand25thretrievedframes(left)andobjectclose-ups(right)withmatchedregions.33keyframeswereretrieved.31containedtheobject.Thetwoincorrectframeswereranked29and

30.

本文来源:https://www.bwwdw.com/article/t54j.html

Top