Video Google-A Text Retrieval Approach to Object Matching in Videos
更新时间:2023-08-11 13:26:01 阅读量: 教育文库 文档下载
- vivo推荐度:
- 相关推荐
大规模Retrieval的经典之作
VideoGoogle:ATextRetrievalApproachtoObjectMatchinginVideos
JosefSivicandAndrewZisserman
RoboticsResearchGroup,DepartmentofEngineeringScience
UniversityofOxford,UnitedKingdom
Abstract
Wedescribeanapproachtoobjectandsceneretrievalwhichsearchesforandlocalizesalltheoccurrencesofauseroutlinedobjectinavideo.Theobjectisrepresentedbyasetofviewpointinvariantregiondescriptorssothatrecog-nitioncanproceedsuccessfullydespitechangesinview-point,illuminationandpartialocclusion.Thetemporalcontinuityofthevideowithinashotisusedtotracktheregionsinordertorejectunstableregionsandreducetheeffectsofnoiseinthedescriptors.
Theanalogywithtextretrievalisintheimplementationwherematchesondescriptorsarepre-computed(usingvec-torquantization),andinverted lesystemsanddocumentrankingsareused.Theresultisthatretrievalisimmediate,returningarankedlistofkeyframes/shotsinthemannerofGoogle.
Themethodisillustratedformatchingontwofulllengthfeature lms.
1.Introduction
Theaimofthisworkistoretrievethosekeyframesandshotsofavideocontainingaparticularobjectwiththeease,speedandaccuracywithwhichGoogleretrievestextdocu-ments(webpages)containingparticularwords.Thispaperinvestigateswhetheratextretrievalapproachcanbesuc-cessfullyemployedforobjectrecognition.
Identifyingan(identical)objectinadatabaseofimagesisnowreachingsomematurity.Itisstillachallengingprob-lembecauseanobject’svisualappearancemaybeverydif-ferentduetoviewpointandlighting,anditmaybepartiallyoccluded,butsuccessfulmethodsnowexist.Typicallyanobjectisrepresentedbyasetofoverlappingregionseachrepresentedbyavectorcomputedfromtheregion’sappear-ance.Theregionsegmentationanddescriptorsarebuiltwithacontrolleddegreeofinvariancetoviewpointandillu-minationconditions.Similardescriptorsarecomputedforallimagesinthedatabase.Recognitionofaparticularob-jectproceedsbynearestneighbourmatchingofthedescrip-torvectors,followedbydisambiguatingusinglocalspa-tialcoherence(suchasneighbourhoods,ordering,orspatiallayout),orglobalrelationships(suchasepipolargeometry).
Examplesinclude[5,6,8,11,13,12,14,16,17].
Weexplorewhetherthistypeofapproachtorecognitioncanberecastastextretrieval.Inessencethisrequiresavisualanalogyofaword,andhereweprovidethisbyvectorquantizingthedescriptorvectors.However,itwillbeseenthatpursuingtheanalogywithtextretrievalismorethanasimpleoptimizationoverdifferentvectorquantizations.Therearemanylessonsandrulesofthumbthathavebeenlearntanddevelopedinthetextretrievalliteratureanditisworthascertainingifthesealsocanbeemployedinvisualretrieval.
Thebene tsofthisapproachisthatmatchesareeffec-tivelypre-computedsothatatrun-timeframesandshotscontaininganyparticularobjectcanberetrievedwithno-delay.Thismeansthatanyobjectoccurringinthevideo(andconjunctionsofobjects)canberetrievedeventhoughtherewasnoexplicitinterestintheseobjectswhende-scriptorswerebuiltforthevideo.However,wemustalsodeterminewhetherthisvectorquantizedretrievalmissesanymatchesthatwouldhavebeenobtainediftheformermethodofnearestneighbourmatchinghadbeenused.Reviewoftextretrieval:Textretrievalsystemsgenerallyemployanumberofstandardsteps[1].Thedocumentsare rstparsedintowords.Secondthewordsarerepre-sentedbytheirstems,forexample‘walk’,‘walking’and‘walks’wouldberepresentedbythestem‘walk’.Thirdastoplistisusedtorejectverycommonwords,suchas‘the’and‘an’,whichoccurinmostdocumentsandarethereforenotdiscriminatingforaparticulardocument.Theremain-ingwordsarethenassignedauniqueidenti er,andeachdocumentisrepresentedbyavectorwithcomponentsgivenbythefrequencyofoccurrenceofthewordsthedocumentcontains.Inadditionthecomponentsareweightedinvari-ousways(describedinmoredetailinsection4),andinthecaseofGoogletheweightingofawebpagedependsonthenumberofwebpageslinkingtothatparticularpage[3].Alloftheabovestepsarecarriedoutinadvanceofactualre-trieval,andthesetofvectorsrepresentingallthedocumentsinacorpusareorganizedasaninverted le[18]tofacilitateef cientretrieval.Aninverted leisstructuredlikeanidealbookindex.Ithasanentryforeachwordinthecorpusfol-lowedbyalistofallthedocuments(andpositioninthat
大规模Retrieval的经典之作
document)inwhichthewordoccurs.
Atextisretrievedbycomputingitsvectorofwordfrequenciesandreturningthedocumentswiththeclosest(measuredbyangles)vectors.Inadditionthematchontheorderingandseparationofthewordsmaybeusedtorankthereturneddocuments.
Paperoutline:Hereweexplorevisualanalogiesofeachofthesesteps.Section2describesthevisualdescriptorsused.Section3thendescribestheirvectorquantizationintovisual‘words’,andsection4weightingandindexingforthevectormodel.Theseideasarethenevaluatedonagroundtruthsetofframesinsection5.Finally,astoplistandranking(byamatchonspatiallayout)areintroducedinsection6,andusedtoevaluateobjectretrievalthroughouttwofeature lms:‘RunLolaRun’(‘LolaRennt’)[Tykwer,1999],and‘GroundhogDay’[Ramis,1993].
Althoughpreviousworkhasborrowedideasfromthetextretrievalliteratureforimageretrievalfromdatabases(e.g.[15]usedtheweightingandinverted leschemes)tothebestofourknowledgethisisthe rstsystematicappli-cationoftheseideastoobjectretrievalinvideos.
2.Viewpointinvariantdescription
Twotypesofviewpointcovariantregionsarecomputedforeachframe.The rstisconstructedbyellipticalshapeadap-tationaboutaninterestpoint.Themethodinvolvesitera-tivelydeterminingtheellipsecentre,scaleandshape.Thescaleisdeterminedbythelocalextremum(acrossscale)ofaLaplacian,andtheshapebymaximizingintensitygradientisotropyovertheellipticalregion[2,4].Theimplementa-tiondetailsaregivenin[8,13].ThisregiontypeisreferredtoasShapeAdapted(SA).
Thesecondtypeofregionisconstructedbyselectingar-easfromanintensitywatershedimagesegmentation.Theregionsarethoseforwhichtheareaisapproximatelysta-tionaryastheintensitythresholdisvaried.Theimplemen-tationdetailsaregivenin[7].ThisregiontypeisreferredtoasMaximallyStable(MS).
Twotypesofregionsareemployedbecausetheydetectdifferentimageareasandthusprovidecomplementaryrep-resentationsofaframe.TheSAregionstendtobecenteredoncornerlikefeatures,andtheMSregionscorrespondtoblobsofhighcontrastwithrespecttotheirsurroundingssuchasadarkwindowonagraywall.Bothtypesofre-gionsarerepresentedbyellipses.Thesearecomputedattwicetheoriginallydetectedregionsizeinorderfortheim-ageappearancetobemorediscriminating.Fora720576pixelvideoframethenumberofregionscomputedistypi-cally1600.AnexampleisshowninFigure1.
Eachellipticalaf neinvariantregionisrepresentedbya128-dimensionalvectorusingtheSIFTdescriptor
devel-Figure1:Toprow:Twoframesshowingthesamescenefromverydifferentcameraviewpoints(fromthe lm‘RunLolaRun’).Middlerow:frameswithdetectedaf neinvariantregionssuper-imposed.‘MaximallyStable’(MS)regionsareinyellow.‘ShapeAdapted’(SA)regionsareincyan.Bottomrow:Finalmatchedregionsafterindexingandspatialconsensus.Notethatthecorre-spondencesde nethesceneoverlapbetweenthetwoframes.
opedbyLowe[5].In[9]thisdescriptorwasshowntobesu-periortoothersusedintheliterature,suchastheresponseofasetofsteerable lters[8]ororthogonal lters[13],andwehavealsofoundSIFTtobesuperior(bycomparingsceneretrievalresultsagainstgroundtruthasinsection5.1).ThereasonforthissuperiorperformanceisthatSIFT,unliketheotherdescriptors,isdesignedtobeinvarianttoashiftofafewpixelsintheregionposition,biningtheSIFTdescriptorwithaf necovariantregionsgivesregiondescriptionvec-torswhichareinvarianttoaf netransformationsoftheim-age.Note,bothregiondetectionandthedescriptioniscom-putedonmonochromeversionsoftheframes,colourinfor-mationisnotcurrentlyusedinthiswork.
Toreducenoiseandrejectunstableregions,informationisaggregatedoverasequenceofframes.Theregionsde-tectedineachframeofthevideoaretrackedusingasimpleconstantvelocitydynamicalmodelandcorrelation.Anyre-gionwhichdoesnotsurviveformorethanthreeframesisrejected.Eachregionofthetrackcanberegardedasan
大规模Retrieval的经典之作
independentmeasurementofacommonsceneregion(thepre-imageofthedetectedregion),andtheestimateofthedescriptorforthissceneregioniscomputedbyaveragingthedescriptorsthroughoutthetrack.Thisgivesameasur-ableimprovementinthesignaltonoiseofthedescriptors(whichagainhasbeendemonstratedusingthegroundtruthtestsofsection5.1).
3.Buildingavisualvocabulary
Theobjectivehereistovectorquantizethedescriptorsintoclusterswhichwillbethevisual‘words’fortextretrieval.Thenwhenanewframeofthemovieisobservedeachde-scriptoroftheframeisassignedtothenearestcluster,andthisimmediatelygeneratesmatchesforallframesthrough-outthemovie.Thevocabularyisconstructedfromasub-partofthemovie,anditsmatchingaccuracyandexpressivepowerareevaluatedontheremainderofthemovie,asde-scribedinthefollowingsections.
ThevectorquantizationiscarriedoutherebyK-meansclustering,thoughothermethods(K-medoids,histogrambinning,etc)arecertainlypossible.
(a)
Figure2:Samplesfromtheclusterscorrespondingtoasinglevi-sualword.(a)TwoexamplesofclustersofShapeAdaptedregions.(b)TwoexamplesofclustersofMaximallyStableregions.
3.1.Implementation
Regionsaretrackedthroughcontiguousframes,andameanvectordescriptorx¯icomputedforeachoftheiregions.Torejectunstableregionsthe10%oftrackswiththelargestdiagonalcovariancematrixarerejected.Thisgeneratesanaverageofabout1000regionsperframe.
Eachdescriptorisa128-vector,andtosimultaneouslyclusterallthedescriptorsofthemoviewouldbeagargan-tuantask.Insteadasubsetof48shotsisselected(theseshotsarediscussedinmoredetailinsection5.1)cover-ingabout10kframeswhichrepresentabout10%ofalltheframesinthemovie.Evenwiththisreductiontherearestill200Kaveragedtrackdescriptorsthatmustbeclustered.TodeterminethedistancefunctionforclusteringtheMa-halanobisdistanceiscomputedasfollows:itisassumedthatthecovarianceΣisthesameforalltracks,andthisiscomputedbyestimatingfromalltheavailabledata,i.e.alldescriptorsforalltracksinthe48shots.TheMaha-lanobisdistanceenablesthemorenoisycomponentsofthe128–vectortobeweighteddown,andalsodecorrelatesthecomponents.Empiricallythereisasmalldegreeofcorrela-tion.Thedistancefunctionbetweentwodescriptors(repre-¯2,isthengivensentedbytheirmeantrackdescriptors)x¯1,x
1bydx¯1x¯2x¯1x¯2x¯1x¯2.Asisstandard,
thedescriptorspaceisaf netransformedbythesquarerootofΣsothatEuclideandistancemaybeused.
About6kclustersareusedforShapeAdaptedregions,andabout10kclustersforMaximallyStableregions.Theratioofthenumberofclustersforeachtypeischosentobeapproximatelythesameastheratioofdetected
descriptors
ofeachtype.Thenumberofclustersischosenempiricallytomaximizeretrievalresultsonthegroundtruthsetofsec-tion5.1.TheK-meansalgorithmisrunseveraltimeswithrandominitialassignmentsofpointsasclustercentres,andthebestresultused.
Figure2showsexamplesofregionsbelongingtopar-ticularclusters,i.e.whichwillbetreatedasthesamevi-sualword.Theclusteredregionsre ectthepropertiesoftheSIFTdescriptorswhichpenalizevariationsamongstre-gionslessthancross-correlation.ThisisbecauseSIFTem-phasizesorientationofgradients,ratherthanthepositionofaparticularintensitywithintheregion.
ThereasonthatSAandMSregionsareclusteredsepa-ratelyisthattheycoverdifferentandlargelyindependentregionsofthescene.Consequently,theymaybethoughtofasdifferentvocabulariesfordescribingthesamescene,andthusshouldhavetheirownwordsets,inthesamewayasonevocabularymightdescribearchitecturalfeaturesandanotherthestateofrepairofabuilding.
4.Visualindexingusingtextretrievalmethods
Intextretrievaleachdocumentisrepresentedbyavectorofwordfrequencies.However,itisusualtoapplyaweightingtothecomponentsofthisvector[1],ratherthanusethefre-quencyvectordirectlyforindexing.Herewedescribethestandardweightingthatisemployed,andthenthevisualanalogyofdocumentretrievaltoframeretrieval.
大规模Retrieval的经典之作
Thestandardweightingisknownas‘termfrequency–inversedocumentfrequency’,tf-idf,andiscomputedasfollows.Supposethereisavocabularyofkwords,theneachdocumentisrepresentedbyak-vectorVdt1titkofweightedwordfrequencieswithcom-ponents
nidNlogti
ndni
rankoftheithrelevantimage.InessenceRankiszeroifall
Nrelimagesarereturned rst.TheRankmeasureliesintherange0to1,with05correspondingtorandomretrieval.
5.1.Groundtruthimagesetresults
Figure3bshowstheaveragenormalizedrankusingeachimageofthedatasetasaqueryimagewiththetf-idfweight-ingdescribedinsection4.Thebene tinhavingtwofeaturetypesisevident.Thecombinationofbothclearlygivesbet-terperformancethaneitheronealone.Theperformanceofeachfeaturetypevariesfordifferentframesorlocations.Forexample,inframes46-49MSregionsperformbetter,andconverselyforframes126-127SAregionsaresuperior.Theretrievalrankingisperfectfor17ofthe19locations,eventhosewithsigni cantviewpointchanges.Therankingresultsarelessimpressiveforimages61-70and119-121,thougheveninthesecasestheframematchesarenotmissedjustlowranked.Thisisduetoalackofregionsintheover-lappingpartofthescene,see gure4.Thisisnotaproblemofvectorquantization(theregionsthatareincommonarecorrectlymatched),butduetofewfeaturesbeingdetectedforthistypeofscene(pavementtexture).Wereturntothispointinsection7.
Table1showsthemeanoftheRankmeasurecomputedfromall164imagesforthreestandardtextretrievaltermweightingmethods[1].Thetf-idfweightingoutperformsboththebinaryweights(i.e.thevectorcomponentsareoneiftheimagecontainsthedescriptor,zerootherwise)andtermfrequencyweights(thecomponentsarethefrequencyofwordoccurrence).Thedifferencesarenotverysigni -cantfortheranksaveragedoverthewholegroundtruthset.However,forparticularframes(e.g.49)thedifferencecanbeashighas0.1.
Theaverageprecisionrecallcurveforallframesisshownin gure3c.Foreachframeasaquery,wehavecomputedprecisionasthenumberofrelevantimages(i.e.ofthesamelocation)relativetothetotalnumberofframesretrieved,andrecallasthenumberofcorrectlyretrievedframesrelativetothenumberofrelevantframes.Againthebene tofcombiningthetwofeaturetypesisclear.
Theseretrievalresultsdemonstratethatthereisnolossofperformanceinusingvectorquantization(visualwords)comparedtodirectnearestneighbour(orε-nearestneigh-bour)matchingofinvariants[12].
Thisgroundtruthsetisalsousedtolearnthesystempa-rametersincluding:thenumberofclustercentres;themini-mumtrackinglengthforstablefeatures;andtheproportionofunstabledescriptorstorejectbasedontheircovariance.
wherenidisthenumberofoccurrencesofwordiindoc-umentd,ndisthetotalnumberofwordsinthedocumentd,niisthenumberofoccurrencesoftermiinthewholedatabaseandNisthenumberofdocumentsinthewholedatabase.Theweightingisaproductoftwoterms:thewordfrequencynidnd,andtheinversedocumentfrequencylogNni.Theintuitionisthatwordfrequencyweightswordsoccurringofteninaparticulardocument,andthusde-scribeitwell,whilsttheinversedocumentfrequencydown-weightswordsthatappearofteninthedatabase.
Attheretrievalstagedocumentsarerankedbytheirnor-malizedscalarproduct(cosineofangle)betweenthequeryvectorVqandalldocumentvectorsVdinthedatabase.
Inourcasethequeryvectorisgivenbythevisualwordscontainedinauserspeci edsub-partofaframe,andtheotherframesarerankedaccordingtothesimilarityoftheirweightedvectorstothisqueryvector.Variousweightingmodelsareevaluatedinthefollowingsection.
5.Experimentalevaluationofscenematchingusingvisualwords
Heretheobjectiveistomatchscenelocationswithinaclosedworldofshots[12].Themethodisevaluatedon164framesfrom48shotstakenat19different3DlocationsinthemovieRunLolaRun.Wehavebetween4-9framesfromeachlocation.Examplesofthreeframesfromeachoffourdifferentlocationsareshownin gure3a.Therearesignif-icantviewpointchangesoverthetripletsofframesshownforthesamelocation.Eachframeofthetripletisfromadifferent(anddistantintime)shotinthemovie.
Intheretrievalteststheentireframeisusedasaqueryregion.Theretrievalperformanceismeasuredoverall164framesusingeachinturnasaqueryregion.Thecorrectre-trievalconsistsofalltheotherframeswhichshowthesamelocation,andthisgroundtruthisdeterminedbyhandforthecomplete164frameset.
Theretrievalperformanceismeasuredusingtheaveragenormalizedrankofrelevantimages[10]givenby
Rank
1NNrel
Nrel
i1
∑Ri
NrelNrel
2
1
6.Objectretrieval
Inthissectionweevaluatesearchingforobjectsthroughouttheentiremovie.Theobjectofinterestisspeci edbythewhereNrelisthenumberofrelevantimagesforparticularqueryimage,Nisthesizeoftheimageset,andRiisthe
大规模Retrieval的经典之作
SAMSSA+MS
binary0.02650.02370.0165tf0.02750.02080.0153tf-idf0.02090.01960.0132
Table1:ThemeanoftheRankmeasurecomputedfromall164imagesofthegroundtruthsetfordifferenttermweightingmeth-
ods.
(a)
Figure4:Top:Frames61and64fromthegroundtruthdataset.Apoorrankingscoreisobtainedforthispair.Bottom:superimposeddetectedaf neinvariantregions.Thecarefulreaderwillnotethat,duetotheverydifferentviewpoints,onlytwoofthe564(left)and533(right)regionscorrespondbetweenframes.
(b)
userasasub-partofanyframe.
Afeaturelength lmtypicallyhas100K-150Kframes.Toreducecomplexityonekeyframeisusedpersecondofthevideo.Descriptorsarecomputedforstableregionsineachkeyframeandthemeanvaluesarecomputedusingtwoframeseithersideofthekeyframe.Thedescriptorsarevec-torquantizedusingthecentresclusteredfromthegroundtruthset.
Herewearealsoevaluatingtheexpressivenessofthevi-sualvocabularysinceframesoutsidethegroundtruthsetcontainnewobjectsandscenes,andtheirdetectedregionshavenotbeenincludedinformingtheclusters.
6.1.Stoplist
Usingastoplistanalogythemostfrequentvisualwordsthatoccurinalmostallimagesaresuppressed.Figure5showsthefrequencyofvisualwordsoverallthekeyframesofLola.Thetop5%andbottom10%arestopped.Inourcasetheverycommonwordsareduetolargeclustersofover3Kpoints.Thestoplistboundariesweredeterminedempiricallytoreducethenumberofmismatchesandsizeoftheinverted lewhilekeepingsuf cientvisualvocabulary.Figure3:Groundtruthdata.(a)Eachrowshowsaframefromthreedifferentshotsofthesamelocationinthegroundtruthdataset.(b)Averagenormalizedrankforlocationmatchingonthegroundtruthset.(c)AveragePrecision-Recallcurveforlocationmatchingonthegroundtruthset.
大规模Retrieval的经典之作
Figures6showthebene tofimposingastoplist–theverycommonvisualwordsoccuratmanyplacesintheim-ageandareresponsibleformis-matches.Mostoftheseareremovedoncethestoplistisapplied.Theremovaloftheremainingmis-matchesisdescribednext.
6.2.Spatialconsistency
Googleincreasestherankingfordocumentswherethesearchedforwordsappearclosetogetherintheretrievedtexts(measuredbywordorder).Thisanalogyisespeciallyrelevantforqueryingobjectsbyasubpartoftheimage,wherematchedcovariantregionsintheretrievedframesshouldhaveasimilarspatialarrangement[12,14](pactness)tothoseoftheoutlinedregioninthequeryimage.Theideaisimplementedhereby rstretrievingframesusingtheweightedfrequencyvectoralone,andthenre-rankingthembasedonameasureofspatialconsistency.Spatialconsistencycanbemeasuredquitelooselysim-plybyrequiringthatneighbouringmatchesinthequeryre-gionlieinasurroundingareaintheretrievedframe.Itcanalsobemeasuredverystrictlybyrequiringthatneighbour-ingmatcheshavethesamespatiallayoutinthequeryre-gionandretrievedframe.Inourcasethematchedregionsprovidetheaf netransformationbetweenthequeryandre-trievedimagesoapointtopointmapisavailableforthisstrictmeasure.
Wehavefoundthatthebestperformanceisobtainedinthemiddleofthispossiblerangeofmeasures.Asearchareaisde nedbythe15nearestneighboursofeachmatch,andeachregionwhichalsomatcheswithinthisareacastsavoteforthatframe.Matcheswithnosupportarerejected.Thetotalnumberofvotesdeterminestherankoftheframe.Thisworksverywellasisdemonstratedinthelastrowof gure6,whichshowsthespatialconsistencyrejectionofin-correctmatches.Theobjectretrievalexamplesof gures7to9employthisrankingmeasureandamplydemonstrateitsusefulness.
Othermeasureswhichtakeaccountoftheaf nemap-pingbetweenimagesmayberequiredinsomesituations,butthisinvolvesagreatercomputationalexpense.
(a)(b)
Figure5:FrequencyofMSvisualwordsamongall3768keyframesofRunLolaRun(a)before,and(b)after,applicationofa
stoplist.
6.3.Objectretrieval
Implementation–useofinverted les:Inaclassical lestructureallwordsarestoredinthedocumenttheyappearin.Aninverted lestructurehasanentry(hitlist)foreachwordwherealloccurrencesofthewordinalldocumentsarestored.Inourcasetheinverted lehasanentryforeachvisualword,whichstoresallthematches,i.e.occurrencesofthesamewordinallframes.Thedocumentvectorisverysparseanduseofaninverted lemakestheretrievalveryfast.Queryingadatabaseof4kframestakesabout0.1secondwithaMatlabimplementationona2GHzpentium.
after lteringonspatialconsistency.
大规模Retrieval的经典之作
Examplequeries:Figures7and8showresultsoftwoobjectqueriesforthemovie‘RunLolaRun’,and gure9showstheresultofanobjectqueryonthe lm‘Ground-hogday’.Bothmoviescontainabout4Kkeyframes.Boththeactualframesreturnedandtheirrankingareexcellent–asfarasitispossibletotell,noframescontainingtheob-jectaremissed(nofalsenegatives),andthehighlyrankedframesalldocontaintheobject(goodprecision).
Theobjectqueryresultsdodemonstratetheexpressivepowerofthevisualvocabulary.ThevisualwordslearntforLolaareusedunchangedfortheGroundhogDayretrieval.
7.SummaryandConclusions
Theanalogywithtextretrievalreallyhasdemonstrateditsworth:wehaveimmediaterun-timeobjectretrievalthroughoutamoviedatabase,despitesigni cantviewpointchangesinmanyframes.Theobjectisspeci edasasub-partofanimage,andthishasprovedsuf cientforquasi-planarrigidobjects.
Thereare,ofcourse,improvementsthatcanbemademainlytoovercomeproblemsinthevisualprocessing.Lowrankingsarecurrentlyduetoalackofvisualdescriptorsforsomescenetypes.However,theframeworkallowsotherex-istingaf neco-variantregionstobeadded(theywillde neanextendedvisualvocabulary),forexamplethoseof[17].Anotherimprovementwouldbetode netheobjectofin-terestovermorethanasingleframetoallowforsearchonallitsvisualaspects.
Thetextretrievalanalogyalsoraisesinterestingques-tionsforfuturework.Intextretrievalsystemsthetex-tualvocabularyisnotstatic,growingasnewdocumentsareaddedtothecollection.Similarly,wedonotclaimthatourvectorquantizationisuniversalforallimages.Sofarwehavelearntvectorquantizationssuf cientfortwomovies,butwaysofupgradingthevisualvocabularywillneedtobefound.Onecouldthinkoflearningvisualvocabulariesfordifferentscenetypes(e.g.cityscapevsaforest).
Finally,wenowhavetheintriguingpossibilityoffollow-ingothersuccessesofthetextretrievalcommunity,suchaslatentsemanticindexingto ndcontent,andautomaticclus-teringto ndtheprincipalobjectsthatoccurthroughoutthemovie.
AcknowledgementsWearegratefultoDavidLowe,Jiri
Matas,KrystianMikolajczykandFrederikSchaffalitzkyforsup-plyingtheirregiondetector/descriptorcodes.ThankstoAndrewBlake,MarkEveringham,AndrewFitzgibbon,KrystianMikola-jczykandFrederikSchaffalitzkyforfruitfuldiscussions.ThisworkwasfundedbyECproject
VIBES.
Figure7:ObjectqueryexampleI.Firstrow:(left)framewithuserspeci edqueryregion(aposter)inyellow,and(right)closeupofthequeryregion.Thefourremainingrowsshow(left)the1st,12th,16th,and20thretrievedframeswiththeidenti edre-gionofinterestshowninyellow,and(right)acloseupoftheim-agewithmatchedellipticalregionssuperimposed.Inthiscase20keyframeswereretrieved:sixfromthesameshotasthequeryimage,therestfromdifferentshotsatlaterpointsinthemovie.Allretrievedframescontainthespeci edobject.Notetheposterappearsonvariousbillboardsthroughoutthemovie(andBerlin).
大规模Retrieval的经典之作
References
[1]R.Baeza-YatesandB.Ribeiro-Neto.ModernInformation
Retrieval.ACMPress,ISBN:020139829,1999.[2]A.Baumberg.Reliablefeaturematchingacrosswidelysep-aratedviews.InProc.CVPR,pages774–781,2000.[3]S.BrinandL.Page.Theanatomyofalarge-scalehypertex-tualwebsearchengine.In7thInt.WWWConference,1998.[4]T.LindebergandJ.G arding.Shape-adaptedsmoothingin
estimationof3-ddepthcuesfromaf nedistortionsoflocal2-dbrightnessstructure.InProc.ECCV,LNCS800,pages389–400,1994.[5]D.Lowe.Objectrecognitionfromlocalscale-invariantfea-tures.InProc.ICCV,pages1150–1157,1999.[6]D.Lowe.Localfeatureviewclusteringfor3Dobjectrecog-nition.InProc.CVPR,pages682–688.Springer,2001.[7]J.Matas,O.Chum,M.Urban,andT.Pajdla.Robustwide
baselinestereofrommaximallystableextremalregions.InProc.BMVC.,pages384–393,2002.[8]K.MikolajczykandC.Schmid.Anaf neinvariantinterest
pointdetector.InProc.ECCV.Springer-Verlag,2002.[9]K.MikolajczykandC.Schmid.Aperformanceevaluation
oflocaldescriptors.InCVPR,2003.[10]H.M¨uller,S.Marchand-Maillet,andT.Pun.Thetruthabout
corel-evaluationinimageretrieval.InProc.CIVR,2002.ˇObdrˇ[11]S.za´lekandJ.Matas.Objectrecognitionusinglocalaf neframesondistinguishedregions.InProc.BMVC.,pages113–122,2002.[12]F.SchaffalitzkyandA.Zisserman.Automatedscenematch-inginmovies.InProc.CIVR2002,LNCS2383,pages186–197.Springer-Verlag,2002.[13]F.SchaffalitzkyandA.Zisserman.Multi-viewmatching
forunorderedimagesets,or“HowdoIorganizemyhol-idaysnaps?”.InProc.ECCV,volume1,pages414–431.Springer-Verlag,2002.[14]C.SchmidandR.Mohr.Localgreyvalueinvariantsforim-ageretrieval.IEEEPAMI,19(5):530–534,1997.[15]D.M.Squire,W.M¨uller,H.M¨uller,andT.Pun.Content-basedqueryofimagedatabases:inspirationsfromtextre-trieval.PatternRecognitionLetters,21:1193–1198,2000.[16]biningappearanceandtopology
forwidebaselinematching.InProc.ECCV,LNCS2350,pages68–81.Springer-Verlag,2002.[17]T.TuytelaarsandL.VanGool.Widebaselinestereomatch-ingbasedonlocal,af nelyinvariantregions.InProc.BMVC.,pages412–425,2000.
Figure9:ObjectqueryexampleIII.GroundhogDay.Firstrow:(left)queryregion,and(right)itscloseup.Nextrows:The12th,35thand50thretrievedframes(left)andobjectclose-upswithmatchedregions(right).73keyframeswereretrievedofwhich53containedtheobject.The rstincorrectframewasranked27th.
[18]I.H.Witten,A.Moffat,andT.Bell.ManagingGigabytes:
CompressingandIndexingDocumentsandImages.MorganKaufmannPublishers,ISBN:1558605703,1999.
Figure8:ObjectqueryexampleII.RunLolaRun.Firstrow:(left)queryregion,and(right)itscloseup.Nextrows:The9th,16thand25thretrievedframes(left)andobjectclose-ups(right)withmatchedregions.33keyframeswereretrieved.31containedtheobject.Thetwoincorrectframeswereranked29and
30.
正在阅读:
Video Google-A Text Retrieval Approach to Object Matching in Videos08-11
《不老泉》读后感600字07-02
小学教育教学工作个人小结3篇02-22
第一类医疗器械生产备案 - 图文07-01
小画家作文400字06-20
集装箱桥吊司机试题04-21
2019年8月份最新入党申请书范本04-30
我喜欢南京城的春天作文500字07-10
我是一只小蝌蚪作文600字06-24
- 1google介绍
- 2Google学术搜索(Google_Scholar)使用技巧
- 3Simple Text Processing
- 41 Essential Object-Oriented Design
- 5EVOLVINGBUILDINGBLOCKSFORDESIGNUSINGGENETIC ENGINEERING A FORMAL APPROACH.
- 6EVOLVINGBUILDINGBLOCKSFORDESIGNUSINGGENETIC ENGINEERING A FORMAL APPROACH.
- 7Unit 1 Text
- 8Text A 课文翻译
- 9Showing video with Qt toolbox and ffmpeg libraries
- 1010-Basic Video Compression Techniques
- exercise2
- 铅锌矿详查地质设计 - 图文
- 厨余垃圾、餐厨垃圾堆肥系统设计方案
- 陈明珠开题报告
- 化工原理精选例题
- 政府形象宣传册营销案例
- 小学一至三年级语文阅读专项练习题
- 2014.民诉 期末考试 复习题
- 巅峰智业 - 做好顶层设计对建设城市的重要意义
- (三起)冀教版三年级英语上册Unit4 Lesson24练习题及答案
- 2017年实心轮胎现状及发展趋势分析(目录)
- 基于GIS的农用地定级技术研究定稿
- 2017-2022年中国医疗保健市场调查与市场前景预测报告(目录) - 图文
- 作业
- OFDM技术仿真(MATLAB代码) - 图文
- Android工程师笔试题及答案
- 生命密码联合密码
- 空间地上权若干法律问题探究
- 江苏学业水平测试《机械基础》模拟试题
- 选课走班实施方案
- Retrieval
- Approach
- Matching
- Object
- Videos
- Video
- Text