推荐系统netflix获奖算法

更新时间:2023-06-01 10:13:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

赢得netflix推荐系统大奖的算法

1

TheBellKorSolutiontotheNet ixGrandPrize

YehudaKorenAugust2009

I.INTRODUCTION

Thisarticledescribespartofourcontributiontothe“Bell-Kor’sPragmaticChaos” nalsolution,whichwontheNet ixGrandPrize.TheotherportionofthecontributionwascreatedwhileworkingatAT&TwithRobertBellandChrisVolinsky,asreportedinour2008ProgressPrizereport[3].The nalsolutionincludesallthepredictorsdescribedthere.Inthisarticlewedescribeonlythenewerpredictors.

Sowhatisnewoverlastyear’ssolution?Firstwefurtherim-provedthebaselinepredictors(Sec.III).Thisinturnimprovesourothermodels,whichincorporatethosepredictors,likethematrixfactorizationmodel(Sec.IV).Inaddition,anextensionoftheneighborhoodmodelthataddressestemporaldynamicswasintroduced(Sec.V).OntheRestrictedBoltzmannMa-chines(RBM)front,weuseanewRBMmodelwithsuperioraccuracybyconditioningthevisibleunits(Sec.VI).The naladditionistheintroductionofanewblendingalgorithm,whichisbasedongradientboosteddecisiontrees(GBDT)(Sec.VII).

II.PRELIMINARIES

TheNet ixdatasetcontainsmorethan100milliondate-stampedmovieratingsperformedbyanonymousNet ixcus-tomersbetweenDec31,1999andDec31,2005[4].Thisdatasetgivesratingsaboutm=480,189usersandn=17,770movies(aka,items).

Thecontestwasdesignedinatraining-testsetformat.AHold-outsetofabout4.2millionratingswascreatedconsistingofthelastninemoviesratedbyeachuser(orfewerifauserhadnotratedatleast18moviesovertheentireperiod).Theremainingdatamadeupthetrainingset.TheHold-outsetwasrandomlysplitthreeways,intosubsetscalledProbe,Quiz,andTest.TheProbesetwasattachedtothetrainingset,andlabels(theratingthattheusergavethemovie)wereattached.TheQuizandTestsetsmadeupanevaluationset,whichisknownastheQualifyingset,thatcompetitorswererequiredtopredictratingsfor.Onceacompetitorsubmitspre-dictions,theprizemasterreturnstherootmeansquarederror(RMSE)achievedontheQuizset,whichispostedonapublicleaderboard(/leaderboard).RMSEvaluesmentionedinthisarticlecorrespondtotheQuizset.Ultimately,thewinneroftheprizeistheonethatscoresbestontheTestset,andthosescoreswereneverdisclosedbyNet ix.Thisprecludescleversystemswhichmight“game”thecompetitionbylearningabouttheQuizsetthroughrepeatedsubmissions.

Comparedwiththetrainingdata,theHold-outsetcontainsmanymoreratingsbyusersthatdonotratemuchandare

Y.KoreniswithYahoo!Research,

Haifa,

ISRAEL.

Email:yehuda@

thereforehardertopredict.Inaway,thisrepresentsrealrequirementsforacollaborative ltering(CF)system,whichneedstopredictnewratingsfromolderones,andtoequallyaddressallusers,notjusttheheavyraters.

Wereservespecialindexingletterstodistinguishusersfrommovies:forusersu,v,andformoviesi,j.Aratingruiindicatesthepreferencebyuseruofmoviei.Valuesarerangingfrom1(star)indicatingnointerestto5(stars)indicatingastronginterest.Wedistinguishpredictedratingsfromknownones,byusingthenotationr uiforthepredictedvalueofrui.

Thescalartuidenotesthetimeofratingrui.Here,timeismeasuredindays,sotuicountsthenumberofdayselapsedsincesomeearlytimepoint.About99%ofthepossibleratingsaremissing,becauseausertypicallyratesonlyasmallportionofthemovies.The(u,i)pairsforwhichruiisknownarestoredinthetrainingsetK={(u,i)|ruiisknown}.NoticethatKincludesalsotheProbeset.EachuseruisassociatedwithasetofitemsdenotedbyR(u),whichcontainsalltheitemsforwhichratingsbyuareavailable.Likewise,R(i)denotesthesetofuserswhorateditemi.Sometimes,wealsouseasetdenotedbyN(u),whichcontainsallitemsforwhichuprovidedarating,eveniftheratingvalueisunknown.Thus,N(u)extendsR(u)byalsoconsideringtheratingsintheQualifyingset.

Modelsfortheratingdataarelearnedby ttingthepre-viouslyobservedratings(trainingset).However,ourgoalistogeneralizethoseinawaythatallowsustopredictfuture,unknownratings(Qualifyingset).Thus,cautionshouldbeex-ercisedtoavoidover ttingtheobserveddata.Weachievethisbyregularizingthelearnedparameters,whosemagnitudesarepenalized.Theextentofregularizationiscontrolledbytunableconstants.Unlessotherwisestated,weuseL2regularization.Thisisagoodplacetoaddsomewordsontheconstantscontrollingouralgorithms(includingstepsizes,regularization,andnumberofiterations).ExactvaluesoftheseconstantsaredeterminedbyvalidationontheProbeset.Inallcasesbutone(tobementionedbelow),suchvalidationisdoneinamanualgreedymanner.Thatis,whenanewlyintroducedconstantneedstogettuned,weexecutemultiplerunsofthealgorithmsandpickthevaluethatyieldsthebestRMSEontheNet ixProbeset[4].Thisschemedoesnotresultinoptimalsettingsforseveralreasons.First,onceaconstantissetwedonotrevisititsvalue,eventhoughfutureintroductionofotherconstantsmayrequiremodifyingearliersettings.Second,weusethesameconstantsundermultiplevariantsofthesamealgorithm(e.g.,multipledimensionalitiesofafactorizationmodel),whereasamoredelicatetuningwouldrequireadifferentsettingforeachvariant.Wechosethisconvenient,butlessaccuratemethod,becauseourexperienceshowedthatovertuningtheaccuracyofasinglepredictordoes

赢得netflix推荐系统大奖的算法

notdeliverarealcontributionafterbeingincorporatedwithintheoverallblend.

III.BASELINEPREDICTORS

Collaborative lteringmodelstrytocapturetheinteractionsbetweenusersanditemsthatproducethedifferentratingvalues.However,manyoftheobservedratingvaluesareduetoeffectsassociatedwitheitherusersoritems,independentlyoftheirinteraction.AprimeexampleisthattypicalCFdataexhibitlargeuseranditembiases–i.e.,systematictendenciesforsomeuserstogivehigherratingsthanothers,andforsomeitemstoreceivehigherratingsthanothers.

Wewillencapsulatethoseeffects,whichdonotinvolveuser-iteminteraction,withinthebaselinepredictors.Becausethesepredictorstendtocapturemuchoftheobservedsignal,itisvitaltomodelthemaccurately.Thisenablesisolatingthepartofthesignalthattrulyrepresentsuser-iteminteraction,andsubjectingittomoreappropriateuserpreferencemodels.Denotebyµtheoverallaveragerating.Abaselinepredic-tionforanunknownratingruiisdenotedbybuiandaccountsfortheuseranditemeffects:

bui=µ+bu+bi

(1)

Theparametersbuandbiindicatetheobserveddeviationsofuseruanditemi,respectively,fromtheaverage.Forexample,supposethatwewantabaselineestimatefortheratingofthemovieTitanicbyuserJoe.Now,saythattheaverageratingoverallmovies,µ,is3.7stars.Furthermore,Titanicisbetterthananaveragemovie,soittendstoberated0.5starsabovetheaverage.Ontheotherhand,Joeisacriticaluser,whotendstorate0.3starslowerthantheaverage.Thus,thebaselineestimateforTitanic’sratingbyJoewouldbe3.9starsbycalculating3.7 0.3+0.5.

Awaytoestimatetheparametersisbydecouplingthecalculationofthebi’sfromthecalculationofthebu’s.First,foreachitemiweset

bi=∑u∈R(i)(rui µ)λ(i)|.(2)

1+|RThen,foreachuseruweset

bu=

∑i∈R(u)(rui µ bi)

λ+|R(u)|

.

(3)

2Averagesareshrunktowardszerobyusingtheregularizationparameters,λ1,λ2,whicharedeterminedbyvalidationontheProbeset.Wewereusing:λ1=25,λ2=10.Wheneverthisworkreferstobaselinepredictorsestimatedfashion,theyaredenotedbyb

inthisdecoupled

ui.Amoreaccurateestimationofbuandbiwilltreatthemsymmetrically,bysolvingtheleastsquaresproblem

min

b(rui µ bu bi)2+λ3(∑b2u+∑b2

i).

(4)

(u,i∑)∈K

u

i

Hereinafter,b denotesalluseranditembiases(busand

bis).The rstterm∑’sthat(u,i)∈K(rui µ+bu+bi)2strivesto ndbu’sandbiterm,λ3(∑ub2u+∑ib2 tthegivenratings.Theregularizing

i),avoidsover ttingbypenalizingthe

magnitudesoftheparameters.Thisleastsquareproblemcan

2

besolvedfairlyef cientlybythemethodofstochasticgradientdescent.Inpractice,wewereusingmorecomprehensiveversionsof(4),towhichweturnnow.A.Timechangingbaselinepredictors

Muchofthetemporalvariabilityinthedataisincludedwithinthebaselinepredictors,throughtwomajortemporaleffects.The rstaddressesthefactthatanitem’spopularitymaychangeovertime.Forexample,moviescangoinandoutofpopularityastriggeredbyexternaleventssuchastheappearanceofanactorinanewmovie.Thisismanifestedinourmodelsbytreatingtheitembiasbiasafunctionoftime.Thesecondmajortemporaleffectallowsuserstochangetheirbaselineratingsovertime.Forexample,auserwhotendedtorateanaveragemovie“4stars”,maynowratesuchamovie“3stars”.Thismayre ectseveralfactorsincludinganaturaldriftinauser’sratingscale,thefactthatratingsaregiveninthecontextofotherratingsthatweregivenrecentlyandalsothefactthattheidentityoftheraterwithinahouseholdcanchangeovertime.Hence,inourmodelswetaketheparameterbuasafunctionoftime.Thisinducesatemplateforatimesensitivebaselinepredictorforu’sratingofiatdaytui:

bui=µ+bu(tui)+bi(tui)

(5)

Here,bu(·)andbi(·)arerealvaluedfunctionsthatchangeovertime.Theexactwaytobuildthesefunctionsshouldre ectareasonablewaytoparameterizetheinvolvingtemporalchanges.

Amajordistinctionisbetweentemporaleffectsthatspanextendedperiodsoftimeandmoretransienteffects.Wedonotexpectmovielikeabilityto uctuateonadailybasis,butrathertochangeovermoreextendedperiods.Ontheotherhand,weobservethatusereffectscanchangeonadailybasis,re ectinginconsistenciesnaturaltocustomerbehavior.Thisrequires nertimeresolutionwhenmodelinguser-biasescomparedwithalowerresolutionthatsuf cesforcapturingitem-relatedtimeeffects.

Westartwithourchoiceoftime-changingitembiasesbi(t).Wefounditadequatetosplittheitembiasesintotime-basedbins,usingaconstantitembiasforeachtimeperiod.Thedecisionofhowtosplitthetimelineintobinsshouldbalancethedesiretoachieve nerresolution(hence,smallerbins)withtheneedforenoughratingsperbin(hence,largerbins).Infact,thereisawidevarietyofbinsizesthatyieldaboutthesameaccuracy.Inourimplementation,eachbincorrespondstoroughlytenconsecutiveweeksofdata,leadingto30binsspanningalldaysinthedataset.AdaytisassociatedwithanintegerBin(t)(anumberbetween1and30inourdata),suchthatthemoviebiasissplitintoastationarypartandatimechangingpart:

bi(t)=bi+bi,Bin(t)(6)Whilebinningtheparametersworkswellontheitems,

itismoreofachallengeontheusers’side.Ontheonehand,wewouldlikea nerresolutionforuserstodetectveryshortlivedtemporaleffects.Ontheotherhand,wedonotexpectenoughratingsperusertoproducereliableestimatesforisolatedbins.Differentfunctionalformscanbe

赢得netflix推荐系统大奖的算法

consideredforparameterizingtemporaluserbehavior,withvaryingcomplexityandaccuracy.

Onesimplemodelingchoiceusesalinearfunctiontocaptureapossiblegradualdriftofuserbias.Foreachuseru,wedenotethemeandateofratingbytu.Now,ifuratedamovieondayt,thentheassociatedtimedeviationofthisratingisde nedas

devu(t)=sign(t tu)·|t tu|β.

Here|t tu|measuresthenumberofdaysbetweendatestandtu.WesetthevalueofβbyvalidationontheProbeset;inourimplementationβ=0.4.Weintroduceasinglenewparameterforeachusercalledαusothatwegetour rstde nitionofatime-dependentuser-bias:

bu(1)

(t)=bu+αu·devu(t)

(7)

Thissimplelinearmodelforapproximatingadriftingbehaviorrequireslearningtwoparametersperuser:buandαu.

Thelinearfunctionformodelingtheuserbiasmesheswellwithgradualdriftsintheuserbehavior.However,wealsoobservesuddendriftsemergingas“spikes”associatedwithasingledayorsession.Forexample,wehavefoundthatmultipleratingsausergivesinasingleday,tendtoconcentratearoundasinglevalue.Suchaneffectneednotspanmorethanasingleday.Thismayre ectthemoodoftheuserthatday,theimpactofratingsgiveninasingledayoneachother,orchangesintheactualraterinmulti-personaccounts.Toaddresssuchshortlivedeffects,weassignasingleparameterperuserandday,absorbingtheday-speci cvariability.Thisparameterisdenotedbybut.

IntheNet ixdata,auserrateson40differentdaysonaverage.Thus,workingwithbutrequires,onaverage,40parameterstodescribeeachuserbias.Itisexpectedthatbutisinadequateasastandaloneforcapturingtheuserbias,sinceitmissesallsortsofsignalsthatspanmorethanasingleday.Thus,itservesasanadditivecomponentwithinthepreviouslydescribedschemes.Theuserbiasmodel(7)becomes

bu(3)

(t)=bu+αu·devu(t)+but.

(8)

Thediscussionsofarleadstothebaselinepredictorbui=µ+bu+αu·devu(tui)+bu,tui+bi+bi,Bin(tui).

(9)

Ifusedasastandalonepredictor,itsresultingRMSEwouldbe0.9605.

Anothereffectwithinthescopeofbaselinepredictorsisrelatedtothechangingscaleofuserratings.Whilebi(t)isauser-independentmeasureforthemeritofitemiattimet,userstendtorespondtosuchameasuredifferently.Forexample,differentusersemploydifferentratingscales,andasingleusercanchangehisratingscaleovertime.Accordingly,therawvalueofthemoviebiasisnotcompletelyuser-independent.Toaddressthis,weaddatime-dependentscalingfeaturetothebaselinepredictors,denotedbycu(t).Thus,thebaselinepredictor(9)becomes

bui=µ+bu+αu·devu(tui)+bu,tui+(bi+bi,Bin(tui))·cu(tui).

(10)

3

Alldiscussedwaystoimplementbu(t)wouldbevalidforimplementingcu(t)aswell.Wechosetodedicateaseparateparameterperday,resultingin:cu(t)=cu+cut.Asusual,cuisthestablepartofcu(t),whereascutrepresentsday-speci cvariability.

Addingthemultiplicativefactorcu(t)tothebaselinepre-dictor(asper(10))lowersRMSEto0.9555.Interestingly,thisbasicmodel,whichcapturesjustmaineffectsdisregardinguser-iteminteractions,canexplainalmostasmuchofthedatavariabilityasthecommercialNet ixCinematchrecommendersystem,whosepublishedRMSEonthesameQuizsetis0.9514[4].B.Frequencies

ItwasbroughttoourattentionbyourcolleaguesatthePragmaticTheoryteam(PT)thatthenumberofratingsausergaveonaspeci cdayexplainsasigni cantportionofthevariabilityofthedataduringthatday.Formally,denotebyFuitheoverallnumberofratingsthatuserugaveondaytui.ThevalueofFuiwillbehenceforthdubbeda“frequency”,followingPT’snotation.InpracticeweworkwitharoundedlogarithmofFui,denotedbyfui= logaFui .1

Interestingly,eventhoughfuiissolelydrivenbyuseru,itwillin uencetheitem-biases,ratherthantheuser-biases.Accordingly,foreachitemiweintroduceatermbif,capturingthebiasspeci cfortheitemiatlog-frequencyf.Baselinepredictor(10)isextendedtobe

bui=µ+bu+αu·devu(tui)+bu,tui+(bi+bi,Bin(tui))·cu(tui)+bi,fui.

(11)

Wenotethatitwouldbesensibletomultiplybi,fuibycu(tui),butwehavenotexperimentedwiththis.

Theeffectofaddingthefrequencytermtothemoviebiasisquitedramatic.RMSEdropsfrom0.9555to0.9278.Notably,itshowsabaselinepredictorwithapredictionaccuracysigni cantlybetterthanthatoftheoriginalNet ixCinematchalgorithm.

Here,itisimportanttoremindthatabaselinepredictor,nomatterhowaccurate,cannotyieldpersonalizedrecommenda-tionsonitsown,asitmissesallinteractionsbetweenusersanditems.Inasense,itiscapturingtheportionofthedatathatislessrelevantforestablishingrecommendationsandindoingsoenablesderivingaccuraterecommendationsbysubjectingothermodelstocleanerdata.Nonetheless,weincludedtwoofthemoreaccuratebaselinepredictorsinourblend.

Whyfrequencieswork?:Inordertograspthesourceoffrequenciescontribution,wemaketwoempiricalobservations.First,wecouldseethatfrequenciesareextremelypowerfulforastandalonebaselinepredictor,butaswewillsee,theycontributemuchlesswithinafullmethod,wheremostoftheirbene tdisappearswhenaddingtheuser-movieinteractionterms(matrixfactorizationorneighborhood).Secondisthefactthatfrequenciesseemtobemuchmorehelpfulwhenusedwithmoviebiases,butnotsowhenusedwithuser-relatedparameters.

1Notice

thatFuiisstrictlypositivewheneveritisused,sothelogarithmis

wellde ned.

赢得netflix推荐系统大奖的算法

Frequencieshelpindistinguishingdayswhenusersratealotinabulk.Typically,suchratingsaregivennotcloselytotheactualwatchingday.Ourtheoryisthatwhenratinginabulk,usersstillre ecttheirnormalpreferences.However,certainmoviesexhibitanasymmetricattitudetowardsthem.Somepeoplelikethem,andwillrememberthemforlongastheirall-timefavorites.Ontheotherhand,somepeopledislikethemandjusttendtoforgetthem.Thus,whengivingbulkratings,onlythosewiththepositiveapproachwillmarkthemastheirfavorites,whilethosedislikingthemwillnotmentionthem.Suchabehaviorisexpectedtowardsmostpopularmovies,whichcanbeeitherrememberedasverygoodorjustbeforgotten.Asimilarphenomenoncanalsohappenwithanegativeapproach.Somemoviesarenotoriouslybad,andpeoplewhodidnotlikethemalwaysgivethemasnegativeexamples,indicatingwhattheydonotwanttowatch.However,fortheotherpartofthepopulation,wholikedthosemovies,theyarenotgoingtoberememberedlongassalientpositiveexamples.Thus,whenratinginbulk,longafterwatchingthemovie,onlythosewhodislikedthemoviewillrateit.

Thisexplainswhysuchbiasesshouldbeassociatedwithmovies,notwithusers.Thisalsoexplainswhymostoftheeffectdisappearswhenaddingtheinteractionterms,whichalready“understand”thattheuserisofthetypethatlikes/dislikesthemovie.Inotherwords,wehypothesizethathighfrequencies(orbulkratings)donotrepresentmuchchangeinpeople’staste,butmostlyabiasedselectionofmoviestoberated–somemoviesarenaturalcandidatesas“badexamples”,whileothersarenatural“goodexamples”.Webelievethatfurthervalidatingourhypothesisbearspracticalimplications.If,indeed,frequenciesrepresentbiasedselection,theyshouldbetreatedascapturingnoise,whichneedstogetisolatedoutwhenmakingrecommendations.

Finally,weshouldcommentthatamovierentersuchasNet ix,mighthaveadditionaldatasourcesthatcomplementfrequencies.Forexample,dataontimepasedsinceactualwatchingdate,oronwhetherratingswereenteredinresponsetoagivenquestionnaireorinitiatedbytheuser.

C.Predictingfuturedays

Ourmodelsincludeday-speci cparameters.Weareoftenaskedhowthesemodelscanbeusedforpredictingratingsinthefuture,onnewdatesforwhichwecannottraintheday-speci cparameters?Thesimpleansweristhatforthosefuture(untrained)dates,theday-speci cparametersshouldtaketheirdefaultvalue.Inparticularfor(11),cu(tui)issettocu,andbu,tuiissettozero.Yet,onewonders,ifwecannotusetheday-speci cparametersforpredictingthefuture,whyaretheygoodatall?Afterall,predictionisinterestingonlywhenitisaboutthefuture.Tofurthersharpenthequestion,weshouldmentionthefactthattheNet ixQualifyingsetincludesmanyratingsondatesforwhichwehavenootherratingbythesameuserandhenceday-speci cparameterscannotbeexploited.Toanswerthis,noticethatourtemporalmodelingmakesnoattempttocapturefuturechanges.Allitistryingtodoistocapturetransienttemporaleffects,whichhadasigni cantin uenceonpastuserfeedback.Whensucheffectsareidenti- edtheymustbetuneddown,sothatwecanmodelthemore

4

enduringsignal.Thisallowsourmodeltobettercapturethelong-termcharacteristicsofthedata,whilelettingdedicatedparametersabsorbshortterm uctuations.Forexample,ifausergavemanyhigherthanusualratingsonaparticularsingleday,ourmodelsdiscountthosebyaccountingforapossibleday-speci cgoodmood,whichdoesnotre ectsthelongertermbehaviorofthisuser.Thisway,theday-speci cparametersaccomplishakindofdatacleaning,whichimprovespredictionoffuturedates.D.What’sintheblend?

TheRMSE=0.9555resultofmodel(10)isincludedintheblend.Tolearntheinvolvedparameters,bu,αu,but,bi,bi,Bin(t),cu,andcutoneshouldminimizetheregularizedsquarederroronthetrainingset.Learningisdonebyastochasticgradientdescentalgorithmrunningfor30iterations.Weuseseparatelearningrate(stepsize)andregularization(weightdecay)oneachkindoflearnedparameter,byminimizingthecostfunction

bmin,α∑

rui µ bu αu·devu(tui) bu,t(12) ,c

ui

(u,i)∈K

(bi+bi,Bin(tui))·(cu+cu,tui) 2

+λab2u+λbαu2

+

λcb2u,t2b2i,Bin(t22ui+λdbi+λeui)+λf(cu 1)+λgcu,tui.

Actualvaluesofthelearningrates

andregularizationcon-stants(λa,λb,...,λg)areasfollows:

bubutαubi

bi,Bin(t)cucutreg×10

235e-150003

10

1

5e-1

Noticethatregularizationshrinksparameterstowardszero,withoneexception.Themultipliercuisshrunktowards1,i.e.,wepenalize(cu 1)2,ratherthanc2parametersareinitializedtozero,exceptu.Similarly,alllearnedcuthatisinitializedto1.

Theblendalsoincludestheresultofthemoreaccuratebaselinepredictor(11).Infact,thisistheonlycasewhereweresortedtoanautomaticparametertuner(APT)to ndthebestconstants(learningrates,regularization,andlogbasis).Speci cally,wewereusingAPT1,whichisdescribedin[13].ThereasonweusedAPThereistwofold.First,thisbaselinepredictorcomponentisembeddedinourmorecomprehensivemodels(describedlater).Therefore,itisworthwhiletohighlyoptimizeit.Second,thisisasmallquickly-trainedmodel.Sowecouldeasilyaffordmanyhundredsofautomaticexecutionsseekingoptimalsettings.Still,itisworthmentioningthebene tofAPTwasanRMSEreductionof(only)0.0016overourinitialmanualsettings.

TheparametersoftheRMSE=0.9278resultofmodel(11)werelearnedwitha40-iterationstochasticgradientdescentprocess,withthefollowingconstantsgoverningtheblearningbαofeachkindofparameter:

i,Bin(t)cucutbi,fuutubibreg×102

2.55.2313952.559.294.761.901.10e-6

Thelogbasis,a,ter,werefertothismodelas[PQ1].

赢得netflix推荐系统大奖的算法

IV.MATRIXFACTORIZATIONWITHTEMPORALDYNAMICSMatrixfactorizationwithtemporaldynamicswasalreadydescribedinlastyear’sProgressReport[3],orwithmoredetailinaKDD’09paper[8].ThemajorenhancementforthisyearistheincorporationoftheimprovedbaselinepredictorsdescribedinSec.III.

Thefullmodel,whichisknownastimeSVD++[8]isbasedonthepredictionrule

r ui=bui+qTi

pu(tui)+|N(u)| 1

yj

.(13)

j∈∑

N(u)

Here,theexactde nitionofthetime-dependentbaseline

predictor,bui,follows(10).

AsistypicalforaSVD++model[7],weemploytwosetsofstaticmoviefactors:qi,yi∈Rf.The rstset(theqis)iscommontoallfactormodels.Thesecondset(theyis)facilitatesapproximatingauserfactorthroughthesetofmoviesratedbythesameuser,usingthenormalized

sum|N(u)| 1

∑j∈N(u)yj.Differentnormalizationsoftheform|N(u)| αcouldbeemployed.Ourchoiceofα=1 xingthevarianceofthesum(seealso[7]formoreattemptsatintuitiononthischoice.)

Userfactors,pu(t)∈Rfaretime-dependent.Wemodeledeachofthecomponentsofpu(t)T=(pu1(t),...,puf(t))inthesamewaythatwetreateduserbiases.Inparticularwehavefoundmodelingafter(8)effective,leadingto

puk(t)=puk+αuk·devu(t)+pukt

k=1,...,f.

(14)

Herepukcapturesthestationaryportionofthefactor,αuk·devu(t)approximatesapossibleportionthatchangeslinearlyovertime,andpuktabsorbstheverylocal,day-speci cvari-ability.

Wewereoccasionallyalsousingamorememoryef cientversion,withouttheday-speci cportion:

puk(t)=puk+αuk·devu(t)k=1,...,f

(15)

Thesamemodelwasalsoextendedwiththeaforementionedfrequencies.Sincefrequencyaffectstheperceptionofmovies,wetriedtoinjectfrequencyawarenessintothemoviefactors.Tothisendwecreatedanothercopyofthemoviefactors,foreachpossiblefrequencyvalue.Thisleadstothemodel

r 1

ui=bui+(qTi+qTi,fui)

pu(tui)+|N(u)|

yj (16)

j∈∑

.N(u)

Herethede nitionofbuiisfrequency-awarefollowing(11).Noticethatwhilethetransitiontofrequency-awarebiaseswasmeasurablyeffective,theintroductionoffrequency-dependentmoviefactorswasbarelybene cial.A.What’sintheblend?

Weincludedmultiplevariationsofthematrixfactorizationmodelsintheblend.Allmodelsarelearnedbystochasticgradientdescentapplieddirectlyontherawdata,nopre-orpost-processingareinvolved.Inotherwords,allparameters(biases,user-factorsandmovie-factors)aresimultaneously

5

learnedfromthedata.Constants(learningratesandregu-larizationtobespeci edshortly)aretunedtoreachlowestRMSEafter40iterations.(Practically,onecangiveortakearoundteniterationswithoutameaningfulRMSEimpact).However,forblendingwehavefoundthatover-trainingishelpful.Thatis,weoftenletthealgorithmrunfarmorethan40iterations,therebyover ttingthetraindata,whichhappenstobebene cialwhenblendingwithotherpredictors.

The rstmodelistheoneusingrule(13),togetherwiththemorememoryef cientuser-factors(15).Thesettingscontrol-lingthelearningofbias-relatedparametersareasdescribedinSec.III-D.Asforlearningthefactorsthemselves(qi,puandyj),weareusingalearningrateof0.008andregularizationof0.0015,wherethelearningratedecaysbyamultiplicativefactorof0.9aftereachiteration.Finally,forαukthelearningrateis1e-5andtheregularizationis50.Thesesamesettingsremainthesamethroughoutthissection.Thethreevariantswithinourblendare:

1)f=20,#iterations=40,RMSE=0.89142)f=200,#iterations=40,RMSE=0.88143)f=500,#iterations=50,RMSE=0.8815

Thenextmodelstillemploysrule(13),butwiththemoreaccurateuser-factorrepresentation(14).Thisaddsonetypeofparameter,pukt,whichislearnedwithalearningrateof0.004andregularizationof0.01.Thetwovariantswithintheblendwerebothheavilyover-trainedtoover tthetrainingdata:1)f=200,#iterations=80,RMSE=0.88252)f=500,#iterations=110,RMSE=0.8841

Finallywehaveourmostaccuratefactormodel,whichfollows(16).Whilemainnoveltyofthismodel(overthepreviousone)isinthebiasterm,wealsoaddedthefrequency-speci cmovie-factorsqi,fregularizationui.Theirrespectivelearningrateis2e-5,withof0.02.Theblendincludessixvariants:1)f=200,#iterations=40,RMSE=0.87772)f=200,#iterations=60,RMSE=0.87873)f=500,#iterations=40,RMSE=0.87694)f=500,#iterations=60,RMSE=0.87845)f=1000,#iterations=80,RMSE=0.87926)

f

=2000,#iterations=40,RMSE=0.8762

Later,werefertothemodelwithf=200and#iterations=40

as[PQ2].

V.NEIGHBORHOODMODELSWITHTEMPORALDYNAMICSThemostcommonapproachtoCFisbasedonneigh-borhoodmodels.Whiletypicallylessaccuratethantheirfactorizationcounterparts,neighborhoodmethodsenjoypop-ularitythankstosomeoftheirmerits,suchasexplainingthereasoningbehindcomputedrecommendations,andseamlesslyaccountingfornewenteredratings.ThemethoddescribedinthissectionisbasedonSec.5ofourKDD’09paper[8].Recently,wesuggestedanitem-itemmodelbasedonglobaloptimization[7],whichwillenableusheretocapturetimedynamicsinaprincipledmanner.Thestaticmodel,withouttemporaldynamics,iscenteredonthefollowingprediction

赢得netflix推荐系统大奖的算法

rule:

r ui=bui+|R(u)| 1

(ruj b

+|N(u)| j∈∑

1

uj)wijR(u)

j∈∑

cij

N(u)

Here,then2

(17)item-itemweightswijandcijrepresenttheadjustmentsweneedtomaketothepredictedratingofitemi,givenaratingofitemj.Itwasprovengreatlybene cialtousetwosetsofitem-itemweights:one(thewijs)isrelatedtothevaluesoftheratings,andtheotherdisregardstheratingvalue,consideringonlywhichitemswererated(thecijs).Theseweightsareautomaticallylearnedfromthebiases.Theconstantsb

thedatatogetherwith

ujareprecomputedaccordingto(2)–(3).

Whenadaptingrule(17)toaddresstemporaldynamics,twocomponentsshouldbeconsideredseparately.First,isthebaselinepredictorportion,bui=µ+bi+bu,whichexplainsmostoftheobservedsignal.Second,isthepartthatcapturesthemore|R(u)| 1

informativesignal,∑j∈R(u)(ruj b dealingwith)| 1user-iteminteraction

uj)wij+|N(ufromthe∑j∈N(u)cij.Forthe

baselinepart,nothingchangesfactormodel,andwemakeittime-aware,accordingtoeither(10),or(11).Thelatteroneaddsfrequenciesandisgenerallypreferred.However,capturingtemporaldynamicswithintheinteractionpartrequiresadifferentstrategy.

Item-itemweights(wijandcij)re ectinherentitemcharac-teristicsandarenotexpectedtodriftovertime.Thelearningprocessshouldmakesurethattheycaptureunbiasedlongtermvalues,withoutbeingtooaffectedfromdriftingaspects.Indeed,thetime-changingnatureofthedatacanmaskmuchofthelongertermitem-itemrelationshipsifnottreatedad-equately.Forinstance,auserratingbothitemsiandjhighinashorttimeperiod,isagoodindicatorforrelatingthem,therebypushinghigherthevalueofwij.Ontheotherhand,ifthosetworatingsaregiven veyearsapart,whiletheuser’staste(ifnotheridentity)couldconsiderablychange,thisislessevidenceofanyrelationbetweentheitems.Ontopofthis,wewouldarguethatthoseconsiderationsareprettymuchuser-dependent–someusersaremoreconsistentthanothersandallowrelatingtheirlongertermactions.

Ourgoalhereistodistillaccuratevaluesfortheitem-itemweights,despitetheinterferingtemporaleffects.Firstweneedtoparameterizethedecayingrelationsbetweentwoitemsratedbyuseru.Weadoptanexponentialdecayformedbythefunctione βu· t,whereβu>0controlstheuserspeci cdecayrateandshouldbelearnedfromthedata.Thisleadstothepredictionrule

r 1

ui=bui+|N(u)| j|cij+

(18)

j∈∑

e βu·|tui tuN(u)

|R(u)| 1

e βu·|tui tuj|((ruj b

).j∈∑

uj)wijR(u)

Theinvolvedparametersarelearnedbyminimizingtheas-sociatedregularizedsquarederror.Minimizationisperformed

bystochasticgradientdescentfor20–30iterations.Themodelisapplieddirectlytotherawdata,soallparameters(biasesandmovie-movieweights)arelearnedsimultaneously.Learning

6

ofbias-relatedparametersisgovernedbythesameconstants

discussedinSec.III.Asforthemovie-movieweights(bothwijandcij),theirlearningrateis0.005withregularizationconstantof0.002.Finally,theupdateoftheexponentβu,usesaparticularlysmallstepsizeof1e-7,withregularizationconstantequaling0.01.

Wealsoexperimentedwithotherdecayforms,likethemorecomputationally-friendly(1+βu t) 1,whichresultedinthesameaccuracy,withanimprovedrunningtime.(Noneedtochangemeta-parameters.)

Asinthefactorcase,properlyconsideringtemporaldy-namicsimprovestheaccuracyoftheneighborhoodmodel.TheRMSEdecreasesfrom0.9002[7]to0.8870(seenextsubsection).Toourbestknowledge,thisissigni cantlybetterthanpreviouslyknownresultsbyneighborhoodmethods.Toputthisinsomeperspective,thisresultisevenbetterthanthosereported[1,2,11,15]byusinghybridapproachessuchasapplyinganeighborhoodapproachonresidualsofotheralgorithms.Alessonisthataddressingtemporaldynamicsinthedatacanhaveamoresigni cantimpactonaccuracythandesigningmorecomplexlearningalgorithms.

A.What’sintheblend?

Weranthetime-awareneighborhoodmodel,withbiasesfollowing(10)for20,25,and30iterationsofstochasticgradientdescent.TheresultingRMSEswere0.8887,0.8885and0.8887,respectively.Theresultswith20and30iterationsareintheblend.

Wealsotriedextending(18)withanon-normalizedterm.Thisinvolvedaddingathirdsetofmovie-movieweights,dij,asfollows:

r 1

ui=bui+|N(u)| βu·|tui tuj|cij+

j∈∑

e N(u)

|R(u)|

1e βu·|tui tuj|((ruj b

)+j∈∑

uj)wijR(u)

j∈∑

e γu·|tui tuj|((ruj b

uj)dij).R(u)

Here,wealsotriedtoemphasizetheveryadjacentratingsmadebytheuser.Therefore,thenewdecay-controllingcon-stants,theγus,wereinitializedwitharelativelyhighvalueof0.5(comparedtoinitializingβuwithzero.)Inaddition,fordijweusedaslowerlearningrateof1e-5.Learningwasdoneby25iterationsofstochasticgradientdescent.TheresultwithRMSE=0.8881isincludedintheblend.Inretrospect,webelievethatsuchaminisculeRMSEreductiondoesnotjustifyaddingathirdsetofmovie-movieweights.

Finally,weranthetime-awareneighborhoodmodel,withbiasesfollowing(11)for20iterationsofstochasticgradientdescent.(Thethirdsetofmovie-movieweightswasnotused.)TheresultofRMSE=ter,werefertothismodelas[PQ3].

赢得netflix推荐系统大奖的算法

VI.EXTENSIONSTORESTRICTEDBOLTZMANN

MACHINES

A.RBMswithconditionalvisibleunits

WeextendedtheRestrictedBoltzmannMachines(RBM)modelsuggestedby[12].Theoriginalworkshowedasigni -cantperformanceboostbymakingthehiddenunitsconditionalonwhichmoviesthecurrentuserhasrated.Wehavefoundthatasimilarlysigni cantperformanceboostisachievedbyconditioningthevisibleunits.

Intuitivelyspeaking,eachoftheRBMvisibleunitscorre-spondstoaspeci cmovie.Thustheirbiasesrepresentmovie-biases.However,weknowthatotherkindsofbiasesaremoresigni ly,user-bias,single-dayuserbias,andfrequency-basedmoviebias.Therefore,weaddthosebiasestothevisibleunitsthroughconditionalconnections,whichdependonthecurrentlyshownuseranddate.

Letusborrowtheoriginalnotation[12],whichusesaconditionalmultinomialdistributionformodelingeachcolumnoftheobservedvisiblebinaryratingmatrixV:

p(vkexp(bki+∑Fj=1hjWiki

=1|h)=

j)∑l=1exp(bl

i+∑j=1hjWilj)

(19)

Weextendthisconditionaldistributionasfollows.Letthecurrentinstancerefertouseruanddatet.Weaddauser/datebiasterm:p(vkexp(bkut+bku+bki+∑Fj=1hjWiki

=1|h,u,t)=

j)∑l=1exp(blut+blu+bli+∑j=1hjWilj)

(20)

wherebkuisauser-speci cparameter,andbkruleis

utisauser×date-speci cvariable.Thelearning bku=ε1( vki data vki T), bkut=ε2( vki data vki T).

Wehavefounditconvenientheretodeviatefromthemini-batchlearningschemesuggestedin[12],andtolearnbkandbkweupdateeachu

immediatelyutinafullyonlinemanner.Thatis,afterobservingacorrespondingtraininginstance.Theusedlearningratesare:ε1=0.0025andε2=0.008.Noticethatunlessotherwisestated,weusethesameweightdecaysuggestedintheoriginalpaper,whichis0.001.

Consideringthesigni canteffectoffrequencies,wecanfurtherconditiononthemhere.Letthecurrentinstancerefertouseruanddatetwithassociatedfrequencyf.Theresultingconditionaldistributionisasfollows:

p(vkexp(bkif+bkut+bku+bki+∑Fj=1hjWikj)

i

=1|h,u,t,f)=

∑l=1exp(bif+but+blu+bi+∑j=1hjWij)

(21)

wherebkrulewillifisamovie×frequency-speci cvariable.Itslearning

be

bkif=ε3( vki data vk

i T).

whereε3=0.0002.Onlinelearningisusedaswell.

Whenusingfrequencieswealsoemployedthefollowingmodi cationtothevisible-hiddenweights,whichwasbroughttoourattentionbyourcolleaguesatthePragmaticTheory

7

team.InsteadofusingtheoriginalweightsWikj,wewilluse

frequency-dependentweightsWikjf,whicharefactoredas

Wikjf=Wikj·(1+Cfj).

WeuseonlinelearningforthenewparametersCfj,withlearningrateof1e-5.

Asitturnedout,thisextensionoftheweightsbarelyimprovesperformancewhenfrequency-biasesarealreadypresent,whilebeingsomewhatonerousintermsofrunningtime.Thus,weareunlikelytorecommendit.Still,itispartofourfrequency-awareRBMimplementation.B.RBMswithday-speci chiddenunits

2Motivated

bytheday-speci cuserfactor(14),wealso

triedtocreateday-speci cRBMhiddenunits.OntopoftheFhiddenunits,wealsoaddGday-speci cunits.Forauserthatratedonrdifferentdays,wecreaterparallelcopiesoftheGday-speci cunits.Allthoseparallelcopiessharethesamehidden-visibleweights,hiddenbiases,andconditionalconnections.Also,eachparallelcopyisconnectedonlytothevisibleunitscorrespondingtotheratingsgiveninitsrespectiveday.

Toputthisformally,foraday-speci chiddenunitindexedbyj,withacorrespondingratingdatet,weusetheindicatorvectorrt∈{0,1}ntodenotewhichmoviesthecurrentuserratedondatet.Then,theBernoullidistributionformodelinghiddenuserfeaturesbecomesn

n

p(hj=1|V,rt)=σ(bj+∑

riviWik

j+

Dij).

(22)

i=1k∑

5

tk=1

i∑rit=1

Inourimplementationweusedthismodeltogetherwith

thefrequency-biasedRBM.Allparametersassociatedwiththeday-speci cunitswerelearnedinmini-batches,astheirnonday-speci ccounterparts,butwithalearningrateof0.005andaweightdecayof0.01.Resultswerenotencouraging,andfurtherre nementisstillneeded.Stillasinglevariantofthisschemecontributestotheblend.C.What’sintheblend?

Firstanoteonover-training.OurparametersettingmadetheRBMtypicallyconvergeatlowestQuizRMSEwith60–90iterations.However,fortheoverallblenditwasbene cialtocontinueover ttingthetrainingset,andlettheRBMrunformanyadditionaliterations,aswillbeseeninthefollowing.WeincludeintheblendfourvariantsoftheRBMmodelfollowing(20):

1)F=200,#iterations=52,RMSE=0.89512)F=400,#iterations=62,RMSE=0.89423)F=400,#iterations=82,RMSE=0.89444)F=400,#iterations=100,RMSE=0.8952

TherearealsotwovariantsoftheRBMwithfrequencies(21):

1)F=200,#iterations=90,RMSE=0.89282)F=200,#iterations=140,RMSE=0.8949

2An

ideadevelopedtogetherwithMartinPiotte

赢得netflix推荐系统大奖的算法

Later,werefertothesetwomodelsas[PQ4]and[PQ5].Interestingly,theRMSE=0.8928resultisthebestweknowbyusingapureRBM.Ifourgoodexperiencewithpostpro-cessingRBMbykNN[2]isrepeatable,onecanachieveafurthersigni cantRMSEreductionbyapplyingkNNtotheresiduals.However,wehavenotexperimentedwiththis.Finally,thereisasinglepredictorRBMwith50hiddenunitsand50day-speci chiddenunits,whichran70iterationstoproduceRMSE=ter,werefertothismodelas[PQ6].

VII.GBDTBLENDING

AkeytoachievinghighlycompetitiveresultsontheNet- ixdataisusageofsophisticatedblendingschemes,whichcombine3themultipleindividualpredictorsintoasingle nalsolution.Thissigni cantcomponentwasmanagedbyourcolleaguesattheBigChaosteam[14].Still,wewereproduc-ingafewblendedsolutions,whichwerelaterincorporatedasindividualpredictorsinthe nalblend.

Ourblendingtechniqueswereappliedtothreedistinctsetsofpredictors.Firstisasetof454predictors,whichrepresentallpredictorsoftheBellKor’sPragmaticChaosteamforwhichwehavematchingProbeandQualifyingresults[14].Second,isasetof75predictors,whichtheBigChaosteampickedoutofthe454predictorsbyforwardselection[14].Finally,asetof24BellKorpredictorsforwhichwehadmatchingProbeandQualifyingresults.Detailsofthissetaregivenattheendofthissection.

A.GradientBoostedDecisionTrees

Whilemajorbreakthroughsinthecompetitionwereachievedbyuncoveringnewfeaturesunderlyingthedata,thosebecamerareandveryhardtoget.Asweenteredthe nal30daysofthecompetition(“lastcallforgrandprizeperiod”),werealizedthatindividualpredictors,evenifnovelandaccurate,areunlikelytomakeadifferencetotheblend.Wespeculatedthatthemostimpactduringashortperiodof30dayswouldbeachievedbyexploringnewblendingtechniquesorimprovingtheexistingones.Blendingoffersalowerriskpathtoimprovementinashorttime.First,unlikeindividualpredictors,betterblendingisdirectlyconnectedtothe nalresult.Second,blendingsimultaneouslytouchesmanypredictors,ratherthanimprovingoneatatime.ThisledtotheideaofemployingGradientBoostedDecisionTrees,whichwasraisedtogetherwithMichaelJahrerandAndreasT¨oscher.Eventually,itdidindeedmakeacontributiontotheblend,thoughwehopedforamoresigni cantimpact.

GradientBoostedDecisionTrees(GBDT)areanadditiveregressionmodelconsistingofanensembleoftrees, ttedtocurrentresidualsinaforwardstep-wisemanner.Inthetra-ditionalboostingframework,theweaklearnersaregenerallyshallowdecisiontreesconsistingofafewleafnodes.GBDTensemblesarefoundtoworkwellwhentherearehundredsofsuchdecisiontrees.Standardreferencesare[5,6],andaknownimplementationisTreenet[16].

3Whileweuseherethegenericterm“blending”,themoreaccurateterm

wouldbe“stackedgeneralization”.

8

GBDTcombineafewadvantages,includinganabilityto ndnon-lineartransformations,abilitytohandleskewedvariableswithoutrequiringtransformations,computationalro-bustness(e.g.,highlycollinearvariablesarenotanissue)andhighscalability.Theyalsonaturallylendthemselvestoparallelization.Thishasmadethemagoodchoiceforseverallargescalepracticalproblemssuchasrankingresultsofasearchengine[9,17],orquery-biasedsummarizationofsearchresults[10].Inpracticewehadfoundthem,indeed,very exibleandconvenient.However,theiraccuracylagsbehindthatofNeuralNetworkregressorsdescribedin[13].

TherearefourparameterscontrollingGBDT,whichare:(1)numberoftrees,(2)sizeofeachtree,(3)shrinkage(or,“learningrate”),and(4)samplingrate.Ourexperimentsdidnotshowmuchsensitivitytoanyoftheseparameters(exactchoicesaredescribedlater.)

SinceGBDTcanhandleveryskewedvariables,weaddedtothelistofpredictorsfouradditionalfeatures:usersupport(numberofratedmovies),moviesupport(numberofratingusers),frequencyanddateofrating(numberofdayspassedsinceearliestratinginthedataset).

WeappliedGBDTlearningontheaforementionedsetsof454and75predictors.TheProbesetisusedfortrain-ingtheGBDT,whichisthenappliedontheQualifyingset.Parametersettingsare:#trees=200,tree-size=20,shrink-age=0.18,andsampling-rate=0.9.Theresults,whicharein-cludedintheblend,areofRMSE=0.8603(454predictors)andRMSE=0.8606(75predictors).

Whenworkingwiththemuchsmallersetof24BellKorpre-dictors,weusedthesettings:#trees=150,tree-size=20,shrink-age=0.2,andsampling-rate=1.0.TheresultofRMSE=0.8664wasincludedintheblend.

Itisalsobene cialtointroduceaclusteringofusersormovies,whichwillallowGBDTtotreatallusers(ormovies)ofacertainkindsimilarly.Inthepast[2],wetoutedsplittingusersintobinsbasedontheirsupport,andapplyinganequalblendingstrategyforallusersinthesamebin.ThisisalreadyaddressedintheGBDTimplementationdescribedabove,thankstoaddingtheusersupportvariabletotheblendedfeatures.Howeverwecanintroduceadditionalkindsofuserrelationshipstothescheme.Forexample,amatrixfactorizationmodelcomputesashortvectorcharacterizingeachuser(auserfactor).Like-mindedusersareexpectedtogetmappedtosimilarvectors.Hence,addingsuchvectorstotheblendedfeaturesetswilleffectivelyallowGBDTtosliceanddicetheuserbaseintosubsetsofsimilarusersonwhichthesameblendingrulesshouldbeapplied.Thesamecanbedonewithmovies.

Weincludedintheblendthreeformsofthisidea,allappliedonthesetof24BellKorpredictors.FirstweaddedtotheblendedpredictorsfeaturesfromthetimeSVD++model(16)ofdimensionalityf=20.Thisway,allindividualbiastermswereaddedasfeatures.Inaddition,foreachmovie-userpairu i,weaddedthe20-Dmoviefactor(qi+qi,fui),andthe

20-Duserfactorp)+|R(u)| 1

u(tui∑j∈R(u)yj.ThisresultedinRMSE=0.8661.

Second,weusedthe20hiddenunitsofanRBMasa20-Duserrepresentation(inlieuofthetimeSVD++user

赢得netflix推荐系统大奖的算法

representation).ThemovierepresentationisstillbasedonthetimeSVD++model.TheresultingRMSEisalso0.8661.

Finally,weaddedk-NNfeaturesontopofthetimeSVD++features.Thatis,foreachu ipair,wefoundthetop20moviesmostsimilartoi,whichwereratedbyu.Weaddedthemoviescores,eachmultipliedbytheirrespectivesimilaritiesasadditionalfeatures.SimilaritiesherewereshrunkPearsoncorrelations[1].ThisslightlyreducestheRMSEto0.8660.AnotherusageofGBDTisforsolvingaregressionproblempermovie.Foreachuserwecomputeda50-Dcharacteristicvectorformedbythevaluesofthe50hiddenunitsofarespectiveRBM.Then,foreachmovieweusedGBDTforsolvingtheregressionproblemoflinkingthe50-Duservectorstothetrueuserratingsofthemovie.Theresult,withRMSE=0.9248,willbedenotedas[PQ7]inthefollowingdescription.

B.ListofBellKor’sProbe-Qualifyingpairs

Welistthe24BellKorpredictorswhichparticipatedintheGBDTblending.Noticethatmanymoreofourpredictorsareinthe nalblendofQualifyingresults(asmentionedearlierinthisarticle).However,onlyforthoselistedbelowwepossesscorrespondingProberesults,whichrequireextracomputationalresourcestofullyre-trainthemodelwhileexcludingtheProbesetfromthetrainingset.

PostProgressPrize2008predictors

Thosewerementionedearlierinthisdocument:1)PQ12)PQ23)PQ34)PQ45)PQ56)PQ67)PQ7

ProgressPrize2008predictors

Thefollowingisbasedonournotationin[3]:8)SVD++(1)(f=200)

9)Integrated3)(f=100,k=300)10)SVD++((f=500)

11)FirstneighborhoodmodelofSec.2.2of[3]

(RMSE=0.9002)

12)Aneighborhoodmodelmentionedtowardstheendof

Sec.2.2of[3](RMSE=0.8914)ProgressPrize2007predictors

Thefollowingisbasedonournotationin[2]:13)Predictor#4014)Predictor#3515)Predictor#6716)Predictor#75

17)NNMF(60factors)withadaptiveuserfactors18)Predictor#8119)Predictor#73

20)100neighborsUser-kNNonresidualsofallglobal

effectsbutthelast421)Predictor#8522)Predictor#45

9

23)Predictor#8324)Predictor#106

OnelastpredictorwithRMSE=0.8713isinthe nalblend.Itisbasedontheblendingtechniquedescribedinpage12of[3].Thetechniquewasappliedtothefourpredictorsindexedaboveby:2,9,12,and13.

VIII.CONCLUDINGREMARKS

GrantingthegrandprizecelebratestheconclusionoftheNet ixPrizecompetition.Wideparticipation,extensivepresscoverageandmanypublicationsallre ecttheimmensesuc-cessofthecompetition.Dealingwithmovies,asubjectclosetotheheartsofmany,wasde nitelyagoodstart.Yet,muchcouldgowrong,butdidnot,thankstoseveralenablingfactors.The rstsuccessfactorisontheorganizationalside–Net ix.Theydidagreatservicetothe eldbyreleasingapreciousdataset,anactwhichissorare,yetcourageousandimportanttotheprogressofscience.Beyondthis,bothdesignandconductofthecompetitionwere awlessandnon-trivial.Forexample,thesizeofthedatawasrightontarget.Muchlargerandmorerepresentativethancomparabledatasets,yetsmallenoughtomakethecompetitionaccessibletoanyonewithacommodityPC.Asanotherexample,Iwouldmentionthesplitofthetestsetintothreeparts:Probe,Quiz,andTest,whichwasessentialtoensurethefairnessofthecompetition.Despitebeingplannedwellahead,itprovedtobeadecisivefactorattheverylastminuteofthecompetition,threeyearslater.Thesecondsuccessfactoristhewideengagementofmanycompetitors.Thiscreatedpositivebuzz,leadingtofurtherenrollmentofmanymore.Muchwassaidandwrittenonthecollaborativespiritofthecompetitors,whichopenlypublishedanddiscussedtheirinnovationsonthewebforumandthroughscienti cpublications.Thefeelingwasofabigcommunityprogressingtogether,makingtheexperiencemoreenjoyableandef cienttoallparticipants.Infact,thisfacilitatedthena-tureofthecompetition,whichproceededlikealongmarathon,ratherthanaseriesofshortsprints.

Anotherhelpfulfactorwassometouchofluck.Themostprominentoneisthechoiceofthe10%improvementgoal.Anysmalldeviationfromthisnumber,wouldhavemadethecompetitioneithertooeasyorimpossiblydif cult.Inaddition,thegoddessofluckensuredmostsuspenseful nishlinesinboth2007ProgressPrizeand2009GrandPrize,matchingbestsportsevents.

Thescienceofrecommendersystemsisaprimebene ciaryofthecontest.Manynewpeoplebecameinvolvedinthe eldandmadetheircontributions.Thereisaclearspikeinrelatedpublications,andtheNet ixdatasetisthedirectcatalysttodevelopingsomeofthebetteralgorithmsknowninthe eld.Outofthenumerousnewalgorithmiccontributions,Iwouldliketohighlightone–thosehumblebaselinepredictors(orbiases),whichcapturemaineffectsinthedata.Whiletheliteraturemostlyconcentratesonthemoresophisticatedalgorithmicaspects,wehavelearnedthatanaccuratetreatmentofmaineffectsisprobablyatleastassigni cantascomingupwithmodelingbreakthroughs.

Finally,wewereluckytowinthiscompetition,butrecog-nizetheimportantcontributionsofthemanyothercontestants,

赢得netflix推荐系统大奖的算法

10

fromwhichwehavelearnedsomuch.Wewouldliketothankallthosewhopublishedtheirresults,thosewhoparticipatedinthewebforum,andthosewhoemaileduswithquestionsandideas.Wegotourbestintuitionsthisway.

REFERENCES

[1]R.BellandY.Koren.ScalableCollaborativeFilteringwithJointly

DerivedNeighborhoodInterpolationWeights.IEEEInternationalCon-ferenceonDataMining(ICDM’07),pp.43–52,2007.

[2]R.Bell,Y.KorenandC.Volinsky.TheBellKorSolutiontotheNet ix

Prize.2007.

[3]R.Bell,Y.KorenandC.Volinsky.TheBellKor2008Solutiontothe

Net ixPrize.2008.

[4]nning.TheNet ixPrize.KDDCupandWorkshop,

[5]J.H.Friedman.GreedyFunctionApproximation:AGradientBoosting

Machine.TheAnnalsofStatistics,Vol.29(2001),pp.1189-1232.

[6]putationalStatistics

&DataAnalysis,Vol.38(2002),pp.367–378.

[7]Y.Koren.FactorizationMeetstheNeighborhood:aMultifacetedCol-laborativeFilteringModel.Proc.14thACMSIGKDDInt.Conf.onKnowledgeDiscoveryandDataMining(KDD’08),pp.426–434,2008.[8]Y.Koren.CollaborativeFilteringwithTemporalDynamics.Proc.15th

ACMSIGKDDInt.Conf.onKnowledgeDiscoveryandDataMining(KDD’09),pp.447–456,2009.

[9]P.Li,C.J.BurgesandQ.Wu.Mcrank.LearningtoRankUsingMultiple

Classi cationandGradientBoosting.Proc.21stProc.ofAdvancesinNeuralInformationProcessingSystems,2007.

[10]D.MetzlerandT.Kanungo.MachineLearnedSentenceSelection

StrategiesforQuery-BiasedSummarization.SIGIRLearningtoRankWorkshop,2008.

[11]A.Paterek.ImprovingRegularizedSingularValueDecompositionfor

CollaborativeFiltering.Proc.KDDCupandWorkshop,2007.

[12]R.Salakhutdinov,A.MnihandG.Hinton.RestrictedBoltzmannMa-chinesforCollaborativeFiltering.Proc.24thAnnualInternationalConferenceonMachineLearning,pp.791–798,2007.[13]A.T¨oscherandM.Jahrer.TheBigChaosSolutiontotheNet ixPrize

2008,2008.[14]A.T¨oscherandM.Jahrer.TheBigChaosSolutiontotheNet ixGrand

Prize,2009.[15]A.T¨oscher,M.JahrerandR.Legenstein.ImprovedNeighborhood-based

AlgorithmsforLarge-ScaleRecommenderSystems.KDD’08WorkshoponLargeScaleRecommendersSystemsandtheNet ixPrize,2008.[16]/TreeNet.

php.

[17]Z.Zheng,H.Zha,T.Zhang,O.Chapelle,K.ChenandG.Sun.A

GeneralBoostingMethodanditsApplicationtoLearningRankingFunctionsforWebSearch.Proc.21stProc.ofAdvancesinNeuralInformationProcessingSystems,2007.

本文来源:https://www.bwwdw.com/article/kdc1.html

Top