k-means++:The advantages of careful seeding(full text)

更新时间:2023-05-23 16:33:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

k-means++

k-means++:TheAdvantagesofCarefulSeeding

DavidArthur

Abstract

Thek-meansmethodisawidelyusedclusteringtechniquethatseekstominimizetheaveragesquareddistancebetweenpointsinthesamecluster.Althoughito ersnoaccuracyguarantees,itssimplicityandspeedareveryappealinginpractice.Byaugmentingk-meanswithaverysimple,ran-domizedseedingtechnique,weobtainanalgorithmthatisΘ(logk)-competitivewiththeoptimalclustering.Prelim-inaryexperimentsshowthatouraugmentationimprovesboththespeedandtheaccuracyofk-means,oftenquitedramatically.

SergeiVassilvitskii

thanmostofitscompetitors.

Unfortunately,theempiricalspeedandsimplicityofthek-meansalgorithmcomeatthepriceofaccuracy.Therearemanynaturalexamplesforwhichthealgo-φ

isrithmgeneratesarbitrarilybadclusterings(i.e.,OPT

unboundedevenwhennandkare xed).Furthermore,theseexamplesdonotrelyonanadversarialplacementofthestartingcenters,andtheratiocanbeunboundedwithhighprobabilityevenwiththestandardrandom-izedseedingtechnique.

Inthispaper,weproposeawayofinitializingk-meansbychoosingrandomstartingcenterswithveryspeci cprobabilities.Speci cally,wechooseapointpasacenterwithprobabilityproportionaltop’scontributiontotheoverallpotential.Lettingφdenotethepotentialafterchoosingcentersinthisway,weshowthefollowing.

Theorem1.1.Foranysetofdatapoints,E[φ]≤8(lnk+2)φOPT.

Thissamplingisbothfastandsimple,anditalreadyachievesapproximationguaranteesthatk-meanscan-not.Weproposeusingittoseedtheinitialcentersfork-means,leadingtoacombinedalgorithmwecallk-means++.

ThiscomplementsaveryrecentresultofOstrovskyetal.[24],whoindependentlyproposedmuchthesamealgorithm.Whereastheyshowedthisrandomizedseed-ingisO(1)-competitiveondatasetsfollowingacertainseparationcondition,weshowitisO(logk)-competitiveonalldatasets.

WealsoshowthattheanalysisforTheorem1.1istightuptoaconstantfactor,andthatitcanbeeas-ilyextendedtovariouspotentialfunctionsinarbitrarymetricspaces.Inparticular,wecanalsogetasim-pleO(logk)approximationalgorithmforthek-medianobjective.Furthermore,weprovidepreliminaryexperi-mentaldatashowingthatinpractice,k-means++reallydoesoutperformk-meansintermsofbothaccuracyandspeed,oftenbyasubstantialmargin.

1.1RelatedworkAsafundamentalprobleminmachinelearning,k-meanshasarichhistory.Becauseofitssimplicityanditsobservedspeed,Lloyd’smethod[20]remainsthemostpopularapproachinpractice,

1Introduction

Clusteringisoneoftheclassicproblemsinmachinelearningandcomputationalgeometry.Inthepopulark-meansformulation,oneisgivenanintegerkandasetofndatapointsinRd.Thegoalistochoosekcenterssoastominimizeφ,thesumofthesquareddistancesbetweeneachpointanditsclosestcenter.

SolvingthisproblemexactlyisNP-hard,evenwithjusttwoclusters[10],buttwenty- veyearsago,Lloyd[20]proposedalocalsearchsolutionthatisstillverywidelyusedtoday(seeforexample[1,11,15]).Indeed,arecentsurveyofdataminingtechniquesstatesthatit“isbyfarthemostpopularclusteringalgorithmusedinscienti candindustrialapplications”[5].

Usuallyreferredtosimplyask-means,Lloyd’salgorithmbeginswithkarbitrarycenters,typicallychosenuniformlyatrandomfromthedatapoints.Eachpointisthenassignedtothenearestcenter,andeachcenterisrecomputedasthecenterofmassofallpointsassignedtoit.Thesetwosteps(assignmentandcentercalculation)arerepeateduntiltheprocessstabilizes.

Onecancheckthatthetotalerrorφismonotoni-callydecreasing,whichensuresthatnoclusteringisre-peatedduringthecourseofthealgorithm.Sincethereareatmostknpossibleclusterings,theprocesswillal-waysterminate.Inpractice,veryfewiterationsareusu-allyrequired,whichmakesthealgorithmmuchfaster

University,SupportedinpartbyNDSEGFellow-ship,NSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.

StanfordUniversity,SupportedinpartbyNSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.

k-means++

despiteitslimitedaccuracy.TheconvergencetimeofLloyd’smethodhasbeenthesubjectofarecentseriesofpapers[2,4,8,14];inthisworkwefocusonimprovingitsaccuracy.

Inthetheorycommunity,Inabaetal.[16]werethe rsttogiveanexactalgorithmforthek-meansproblem,withtherunningtimeofO(nkd).Sincethen,anumberofpolynomialtimeapproximationschemeshavebeendeveloped(see[9,13,19,21]andthereferencestherein).Whiletheauthorsdevelopinterestinginsightsintothestructureoftheclusteringproblem,theiralgorithmsarehighlyexponential(orworse)ink,andareunfortunatelyimpracticalevenforrelativelysmalln,kandd.

Kanungoetal.[17]proposedanO(n3 d)algorithmthatis(9+ )-competitive.However,n3comparesunfavorablywiththealmostlinearrunningtimeofLloyd’smethod,andtheexponentialdependenceondcanalsobeproblematic.Forthesereasons,Kanungoetal.alsosuggestedawayofcombiningtheirtechniqueswithLloyd’salgorithm,butinordertoavoidtheexponentialdependenceond,theirapproachsacri cesallapproximationguarantees.

MettuandPlaxton[22]alsoachievedaconstant-probabilityO(1)approximationusingatechniquecalledsuccessivesampling.TheymatchourrunningtimeofO(nkd),butonlyifkissu cientlylargeandthespreadissu cientlysmall.Inpractice,ourapproachissimpler,andourexperimentalresultsseemtobebetterintermsofbothspeedandaccuracy.

Veryrecently,Ostrovskyetal.[24]independentlyproposedanalgorithmthatisessentiallyidenticaltoours,althoughtheiranalysisisquitedi erent.LettingφOPT,kdenotetheoptimalpotentialforak-clusteringonagivendataset,theyprovek-means++isO(1)-φOPT,k

≤ 2.ThecompetitiveinthecasewhereOPT

,k 1

intuitionhereisthatifthisconditiondoesnothold,thenthedataisnotwellsuitedforclusteringwiththegivenvaluefork.

Combiningthisresultwithoursgivesastrongcharacterizationofthealgorithm’sperformance.Inparticular,k-means++isneverworsethanO(logk)-competitive,andonverywellformeddatasets,itimprovestobeingO(1)-competitive.

Overall,theseedingtechniqueweproposeissimilarinspirittothatusedbyMeyerson[23]foronlinefacilitylocation,andMishraetal.[12]andCharikaretal.[6]inthecontextofk-medianclustering.However,ouranalysisisquitedi erentfromthoseworks.

Forthek-meansproblem,wearegivenanintegerkandasetofndatapointsX Rd.WewishtochoosekcentersCsoastominimizethepotentialfunction,

φ=min x c 2.

x∈X

c∈C

Choosingthesecentersimplicitlyde nesaclustering

–foreachcenter,wesetoneclustertobethesetofdatapointsthatareclosertothatcenterthantoanyother.Asnotedabove, ndinganexactsolutiontothek-meansproblemisNP-hard.

Throughoutthepaper,wewillletCOPTdenotetheoptimalclusteringforagiveninstanceofthek-meansproblem,andwewillletφOPTdenotethecorrespondingpotential.GivenaclusteringCwithpotentialφ,wealsoletφ(A)denotetheXtothe contributionofA 2potential(i.e.,φ(A)=x∈Aminc∈C x c ).2.1Thek-meansalgorithmThek-meansmethodisasimpleandfastalgorithmthatattemptstolocallyimproveanarbitraryk-meansclustering.Itworksasfollows.

1.ArbitrarilychoosekinitialcentersC={c1,...,ck}.2.Foreachi∈{1,...,k},settheclusterCitobethesetofpointsinXthatareclosertocithantheyaretocjforallj=i.

3.Foreachi∈{1,...,k},setcito bethecenterof

1

massofallpointsinCi:ci=ix∈Cix.

4.RepeatSteps2and3untilCnolongerchanges.ItisstandardpracticetochoosetheinitialcentersuniformlyatrandomfromX.ForStep2,tiesmaybebrokenarbitrarily,aslongasthemethodisconsistent.

Steps2and3arebothguaranteedtodecreaseφ,sothealgorithmmakeslocalimprovementstoanarbitraryclusteringuntilitisnolongerpossibletodoso.ToseethatStep3doesinfactdecreasesφ,itishelpfultorecallastandardresultfromlinearalgebra(see[14]).Lemma2.1.LetSbeasetofpointswithcenterofmassc(S),and letzbeanarbitrarypoint.Then, 222

x∈S x z x∈S x c(S) =|S|· c(S) z .MonotonicityforStep3followsfromtakingStobeasingleclusterandztobeitsinitialcenter.

Asdiscussedabove,thek-meansalgorithmisat-tractiveinpracticebecauseitissimpleanditisgener-allyfast.Unfortunately,itisguaranteedonlyto ndalocaloptimum,whichcanoftenbequitepoor.

2Preliminaries2.2Thek-means++algorithmThek-meansalgo-Inthissection,weformallyde nethek-meansproblem,rithmbeginswithanarbitrarysetofclustercenters.aswellasthek-meansandk-means++algorithms.Weproposeaspeci cwayofchoosingthesecenters.At

k-means++

anygiventime,letD(x)denotetheshortestdistancefromadatapointxtotheclosestcenterwehaveal-readychosen.Then,wede nethefollowingalgorithm,whichwecallk-means++.

1a.Chooseaninitialcenterc1uniformlyatrandom

fromX.

1b.Choosethenextcenterci,selectingci=x ∈X

2)

withprobabilityD(x.x∈X

1c.RepeatStep1buntilwehavechosenatotalofk

centers.

2-4.Proceedaswiththestandardk-meansalgorithm.WecalltheweightingusedinStep1bsimply“D2weighting”.

3k-means++isO(logk)-competitiveInthissection,weproveourmainresult.

OurnextstepistoproveananalogofLemma3.1fortheremainingcenters,whicharechosenwithD2weighting.

Lemma3.2.LetAbeanarbitraryclusterinCOPT,andletCbeanarbitraryclustering.IfweaddarandomcentertoCfromA,chosenwithD2weighting,thenE[φ(A)]≤8φOPT(A).

Proof.Theprobabilitythatwechoosesome xeda0asourcenter,giventhatwearechoosingourcenterfrom

20)

A,ispreciselyD(a.Furthermore,afterchoos-a∈A

ingthecentera0,apointawillcontributepreciselymin(D(a), a a0 )2tothepotential.Therefore,E[φ(A)]=

a0∈A

D(a0)2

min(D(a), a a0 )2.2

a∈AD(a)

a∈A

NotebythetriangleinequalitythatD(a0)≤Theorem3.1.IfCisconstructedwithk-means++,

thenthecorrespondingpotentialfunctionφsatis esD(a)+ a a0 foralla,a0.Fromthis,thepower-meaninequality1impliesthatD(a0)2≤2D(a)2+E[φ]≤8(lnk+2)φOPT.

2 a a0 2.Summingoveralla,wethenhavethat 2Infact,weprovethisholdsafteronlyStep1oftheD(a)2≤2 22

0a∈AD(a)+a∈A a a0 ,andalgorithmabove.Steps2through4canthenonly

hence,E[φ(A)]isatmost,

decreaseφ.Notsurprisingly,ourexperimentsshowthis

localoptimizationisimportantinpractice,althoughit2 2a∈AD(a)·min(D(a), a a0 )2·isdi culttoquantifythistheoretically.2|A|a∈AD(a)a0∈Aa∈AOuranalysisconsistsoftwoparts.First,weshow

2 thatk-means++iscompetitiveinthoseclustersofCOPT2a∈A a a0 ·min(D(a), a a0 )2.fromwhichitchoosesacenter.Thisiseasiestinthe+|A|·2

a∈AD(a)a0∈Aa∈Acaseofour rstcenter,whichischosenuniformlyat

random.

Inthe rstexpression,wesubstitutemin(D(a), a

22

Lemma3.1.LetAbeanarbitraryclusterinCOPT,anda0 )≤ a a0 ,andinthesecondexpression,we

22

letCbetheclusteringwithjustonecenter,whichissubstitutemin(D(a), a a0 )≤D(a).Simplifying,chosenuniformlyatrandomfromA.Then,E[φ(A)]=wethenhave,2φOPT(A). 4

E[φ(A)]≤· a a0 2

|A|Proof.Letc(A)denotethecenterofmassofthedataa0∈Aa∈A

pointsinA.ByLemma2.1,weknowthatsince=8φOPT(A).COPTisoptimal,itmustbeusingc(A)asthecenter

ingthesamelemmaThelaststepherefollowsfromLemma3.1.again,weseeE[φ(A)]isgivenby,

WehavenowshownthatseedingbyD2weighting 1

iscompetitiveaslongasitchoosescentersfromeach· a a0 2

clusterofCOPT,whichcompletesthe rsthalfofoura0∈Aa∈A

argument.Wenowuseinductiontoshowthetotalerror 1ingeneralisatmostO(logk).= a c(A) 2+|A|· a0 c(A) 2a0∈Aa∈A

power-meaninequalitystatesforanyrealnumbers=2 a c(A) 2,

a∈A

andtheresultfollows.

12a1,···,amthatΣa2i≥(Σai).ItfollowsfromtheCauchy-Schwarzinequality.Weareonlyusingthecasem=2here,butwewillneedthegeneralcaseforLemma3.3.

k-means++

Lemma3.3.LetCbeanarbitraryclustering.ChoosefollowsthatourcontributiontoE[φOPT]inthiscaseisu>0“uncovered”clustersfromCOPT,andletXuatmost,denotethesetofpointsintheseclusters.Alsolet

φ(A)Xc=X Xu.Nowsupposeweaddt≤urandomcenters·paφ(Xc)+φa+8φOPT(Xu) 8φOPT(A)

toC,chosenwithD2weighting.LetC denotethethea∈A

resultingclustering,andletφ denotethecorresponding u tpotential.Then,E[φ ]isatmost,·(1+Ht 1)+·φ(Xu) φ(A)

u 1

u tφ(A)

·φ(Xu).φ(Xc)+8φOPT(Xu)·(1+Ht)+≤·φ(Xc)+8φOPT(Xu)·(1+Ht 1)uφ

u t11

Here,Htdenotestheharmonicsum,1++···+.+·φ(Xu) φ(A).

u 1

Proof.Weprovethisbyinduction,showingthatifthe

laststepherefollowsfromthefactthatresultholdsfor(t 1,u)and(t 1,u 1),thenitThe

a∈Apaφa≤8φOPT(A),whichisimpliedbyLemmaalsoholdsfor(t,u).Therefore,itsu cestocheck

3.2.t=0,u>0andt=u=1asourbasecases.

power-meaninequalityimpliesthatIft=0andu>0,theresultfollowsfromthefact Now,the122

tA Xuφ(A)≥·φ(Xu).Therefore,ifwesumover=1.Next,supposet=u=1.that1+Ht=uWechooseouronenewcenterfromtheoneuncoveredalluncoveredclustersA,weobtainapotentialcontri-u)butionofatmost,clusterwithprobabilityexactlyφ(X.Inthiscase,

Lemma3.2guaranteesthatE[φ ]≤φ(Xc)+8φOPT(Xu).φ(Xu) ·φ(Xc)+8φOPT(Xu)·(1+Ht 1)Sinceφ≤φevenifwechooseacenterfromacovered

φ

cluster,wehave,

1u t1+··φ(Xu)2 ·φ(Xu)2 φ(Xu)φ(Xc)E[φ ]≤·φ(Xc)+8φOPT(Xu)+·φ φφφ(Xu)

=·φ(Xc)+8φOPT(Xu)·(1+Ht 1)

≤2φ(Xc)+8φOPT(Xu).φ

u t+·φ(Xu).Since1+Ht=2here,wehaveshowntheresultholds

u

forbothbasecases.

Wenowproceedtoprovetheinductivestep.ItisCombiningthepotentialcontributiontoE[φ ]fromconvenientheretoconsidertwocases.Firstsupposewebothcases,wenowobtainthedesiredbound:chooseour rstcenterfromacoveredcluster.Asabove,

Xc)

thishappenswithprobabilityexactlyφ(.Notethat

E[φ]≤φ(Xc)+8φOPT(Xu)·(1+Ht 1)

thisnewcentercanonlydecreaseφ.Bearingthisin

u tφ(Xc)φ(Xu)mind,applytheinductivehypothesiswiththesame

+·φ(Xu)+·

choiceofcoveredclusters,butwithtdecreasedbyone.uφu

ItfollowsthatourcontributiontoE[φ]inthiscaseis1

≤φ(Xc)+8φOPT(Xu)·1+Ht 1+atmost,u

u t φ(Xc)+·φ(Xu).·φ(Xc)+8φOPT(Xu)·(1+Ht 1)uφ

1u t+1Theinductivestepnowfollowsfromthefactthat≤1.+·φ(Xu).

u

WespecializeLemma3.3toobtainourmainresult.

Ontheotherhand,supposewechooseour rst

centerfromsomeuncoveredclusterA.ThishappensTheorem3.1IfCisconstructedwithk-means++,

A)

withprobabilityφ(.Letpadenotetheprobabilitythenthecorrespondingpotentialfunctionφsatis esthatwechoosea∈Aasourcenter,giventhecenterisE[φ]≤8(lnk+2)φOPT.somewhereinA,andletφadenoteφ(A)afterwechoose

aasourcenter.Onceagain,weapplyourinductiveProof.ConsidertheclusteringCafterwehavecom-hypothesis,thistimeaddingAtothesetofcoveredpletedStep1.LetAdenotetheCOPTclusterinwhichclusters,aswellasdecreasingbothtanduby1.Itwechosethe rstcenter.ApplyingLemma3.3with

k-means++

t=u=k 1andwithAbeingtheonlycoveredclus-ter,wehave,

E[φOPT]≤φ(A)+8φOPT 8φOPT(A)·(1+Hk 1).TheresultnowfollowsfromLemma3.1,andfromthefactthatHk 1≤1+lnk.

222

Finally,sincenδ2u≥u·n·δ·βandnδu≥nδHuβ,wehave,

n

22 2

2nδ·u.φ≥α·nδ·(1+Hu)·β+k

Thiscompletesthebasecase.

Wenowproceedtoprovetheinductivestep.AswithLemma3.3,weconsidertwocases.Theprobability

4Amatchinglowerboundthatour rstcenterischosenfromanuncoveredcluster

2

Inthissection,weshowthattheDseedingusedis,byk-means++isnobetterthan (logk)-competitive2

u·n· inexpectation,therebyprovingTheorem3.1istight

u·withinaconstantfactor.· +(k u)··δ (k t)δ

Fixk,andthenchoosen, ,δwithn kand u 2

≥δ.WeconstructXwithnpoints.Firstchoosekcentersu 2+(k u)δ2 n k222

c1,c2,...,cksuchthat ci cj = ·δ

u 2

foralli=j.Now,foreachci,adddatapoints≥α·.+(k u)δu xi,1,xi,2,···,xi,narrangedinaregularsimplexwith

kApplyingourinductivehypothesiswithtandubothcenterci,sidelengthδ,andradiusn·δ.Ifwedo

decreasedby1,weobtainapotentialcontributionfrom

thisinorthogonaldimensionsforeachi,wethenhave,

thiscaseofatleast,

δifi=j,or 2 xi,i xj,j =u otherwise.·αt+1·nδ2·(1+Hu 1)·β22u +(k u)δ

Weproveourseedingtechniqueisinexpectation n

(logk)worsethantheoptimalclusteringinthiscase.+ 2 2nδ2·(u t).

kClearly,theoptimalclusteringhascenters{ci},

k2

whichleadstoanoptimalpotentialofφOPT=n Theprobabilitythatour rstcenterischosenfrom·δ.

Conversely,usinganinductionsimilartothatofLemmaacoveredclusteris3.3,weshowD2seedingcannotmatchthisbound.As22

(k u)·n·δ (k t)δbefore,weboundtheexpectedpotentialintermsof

u·thenumberofcenterslefttochooseandthenumber· +(k u)··δ (k t)δ

22ofuncoveredclusters(thoseclustersofC0fromwhich(k u)·n(k u)δ2·δ (k t)δ·≥wehavenotchosenacenter).Lemma4.1.LetCbeanarbitraryclusteringonXwith(k u)δ2

.≥α·k t≥1centers,butwithuclustersfromCOPTu 2+(k u)δ2

uncovered.Nowsupposeweaddtrandomcentersto

C,chosenwithD2weighting.LetC denotethetheApplyingourinductivehypothesiswithtdecreasedby1resultingclustering,andletφ denotethecorrespondingbutwithuconstant,weobtainapotentialcontribution

fromthiscaseofatleast,potential.

2

k22kδ2 2Furthermore,letα=n ,β= and(k u)δ t+1 u i ·α·nδ2·(1+Hu)·βHu=i=1k.Then,E[φ ]isatleast,u +(k u)δ

n n t+12 22

α·nδ·(1+Hu)·β+ 2nδ·(u t).+ 2 2nδ2·(u t+1).kkProof.Weprovethisbyinductionont.Ift=0,note

Therefore,E[φ ]isatleast,that,

n nnt+12 22 22α·nδ·(1+Hu)·β+ 2nδ·(u t)φ=φ=n u· k·δ+u·· .kkk t+1nnαnn u· k kn+·(k u)δ2· 2 2nδ2Sincen u·n≥=α.Also,≥,wehaveu +(k u)δk

α,β≤1.Therefore,

2 2 u ·H(u) H(u 1)·nδ·β.n 2n 2

φ≥α·n u··δ·β+u·· .

kk

k-means++

However,Hu Hu 1=

2

k u

andβ=

2 2kδ2

,soProof.LetcdenotethecenterofAinCOPT.Then,E[φ[ ](A)]

=≤=

1

a a0

|A|

a0∈Aa∈A

u ·H(u) H(u 1)·nδ2·β

n

222

2nδ,=(k u)δ·k

2 1

a c + a0 c

|A|

a0∈Aa∈A

andtheresultfollows.

2 φOPT(A).

[ ]

Asintheprevioussection,weobtainthedesiredThesecondstepherefollowsfromthetriangleinequalityresultbyspecializingtheinduction.andthepower-meaninequality.TherestofourupperboundanalysiscarriesTheorem4.1.D2seedingisnobetterthan2(lnk)-throughwithoutchange,exceptthatintheproofofcompetitive.

Lemma3.2,weloseafactorof2 1fromthepower-meaninequality,insteadofjust2.Puttingeverything

Proof.Supposeaclusteringwithpotentialφiscon-together,weobtainthegeneraltheorem.

structedusingk-means++onXdescribedabove.Ap-plyLemma4.1withu=t=k 1afterthe rstTheorem5.1.IfCisconstructedwithD seeding,

centerhasbeenchosen.Notingthat1+Hk 1=thenthecorrespondingpotentialfunctionφ[ ]satis es, k 1 11

[ ]1+i=1 =Hk>lnk,wethenhave,E[φ[ ]]≤22 (lnk+2)φ.

OPT

6Empiricalresults

Inordertoevaluatek-means++inpractice,wehave

Now, xkandδbutletnand approachin nity.implementedandtesteditinC++[3].Inthissection,Thenαandβbothapproach1,andtheresultfollowswediscusstheresultsofthesepreliminaryexperiments.

k2

fromthefactthatφOPT=n WefoundthatD2seedingsubstantiallyimprovesboth·δ.

therunningtimeandtheaccuracyofk-means.

5Generalizations

Althoughthek-meansalgorithmitselfappliesonly6.1DatasetsWeevaluatedtheperformanceofinvectorspaceswiththepotentialfunctionφ=k-meansandk-means++onfourdatasets. 2The rstdataset,Norm25,issynthetic.Togeneratex∈Xminc∈C x c ,wenotethatourseedingtech-niquedoesnothavethesamelimitations.Inthissec-it,wechose25“true”centersuniformlyatrandomtion,wediscussextendingourresultstoarbitrarymet-froma15-dimensionalhypercubeofsidelength500.ricspaceswiththemoregeneralpotentialfunction,WethenaddedpointsfromGaussiandistributionsof

φ[ ]=x∈Xminc∈C x c for ≥1.Inparticular,variance1aroundeachtruecenter.Thus,weobtainednotethatthecaseof =1isthek-medianspotentialanumberofwellseparatedGaussianswiththethetrue

centersprovidingagoodapproximationtotheoptimalfunction.

Thesegeneralizationsrequireonlyonechangetoclustering.

Wechosetheremainingdatasetsfromreal-worldthealgorithmitself.InsteadofusingD2seeding,we

switchtoD seeding–i.e.,wechoosex0asacenterexampleso theUC-IrvineMachineLearningReposi-

tory.TheClouddataset[7]consistsof1024pointsin100)

withprobabilityD(x.x∈Xdimensions,anditisPhilippeCollard’s rstcloudcover

Fortheanalysis,themostimportantchangeap-database.TheIntrusiondataset[18]consistsof494019pearsinLemma3.1.Ouroriginalproofusesaninnerpointsin35dimensions,anditrepresentsfeaturesavail-productstructurethatisnotavailableinthegeneralabletoanintrusiondetectionsystem.Finally,theSpamcase.However,aslightlyweakerresultcanbeprovendataset[25]consistsof4601pointsin58dimensions,usingonlythetriangleinequality.anditrepresentsfeaturesavailabletoane-mailspamdetectionsystem.Lemma5.1.LetAbeanarbitraryclusterinCOPT,andForeachdataset,wetestedk=10,25,and50.letCbetheclusteringwithjustonecenter,whichischo-senuniformlyatrandomfromA.Then,E[φ[ ](A)]≤6.2MetricsSinceweweretestingrandomizedseed-[ ]

2 φOPT(A).ingprocesses,weran20trialsforeachcase.Wereport

E[φ]≥αkβ·nδ2·lnk.

k-means++

k2550

k-meansk-means++

4

4.233·1099.96%

3

7.750·1099.81%k-meansk-means++

4

1.914·1099.92%

1

1.474·100.53%k-meansk-means++0.9087.79%2.04 1.62%

Table1:ExperimentalresultsontheNorm25dataset(n=10000,d=15).Fork-means,welistthe

actual potentialandtime inseconds.Fork-means++,welistthepercentageimprovementoverk-means:k-means++value100%·1 .

k2550

k-meansk-means++3

3.637·1042.76%

3

1.867·1039.01%k-meansk-means++3

2.550·1022.60%

3

1.407·1023.07%k-meansk-means++0.1143.21%0.1641.99%

Table2:ExperimentalresultsontheClouddataset(n=1024,d=10).Fork-means,welisttheactualpotential

andtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means.

k2550

k-meansk-means++8

3.149·1099.20%

8

3.079·1099.84%k-meansk-means++8

3.100·1099.32%

8

3.076·1099.87%k-meansk-means++257.3449.19%917.0066.70%

Table3:ExperimentalresultsontheIntrusiondataset(n=494019,d=35).Fork-means,welisttheactual

potentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means.

k2550

k-meansk-means++

3.288·10488.76%

4

3.183·1095.35%k-meansk-means++

3.280·10489.58%

4

2.384·1094.30%k-meansk-means++7.3679.84%12.2075.76%

Table4:ExperimentalresultsontheSpamdataset(n=4601,d=58).Fork-means,welisttheactualpotential

andtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means.

k-means++

theminimumandtheaveragepotential(actuallydi-videdbythenumberofpoints),aswellasthemeanrunningtime.Ourimplementationsarestandardwithnospecialoptimizations.

6.3ResultsTheresultsfork-meansandk-means++aredisplayedinTables1through4.Welisttheabsoluteresultsfork-means,andthepercentageimprovementachievedbyk-means++(e.g.,a90%improvementintherunningtimeisequivalenttoafactor10speedup).Weobservethatk-means++consistentlyoutperformedk-means,bothbyachievingalowerpotentialvalue,insomecasesbyseveralordersofmagnitude,andalsobyhavingafasterrunningtime.TheD2seedingisslightlyslowerthanuniformseeding,butitstillleadstoafasteralgorithmsinceithelpsthelocalsearchconvergeafterfeweriterations.

Thesyntheticexampleisacasewherestandardk-meansdoesverybadly.Eventhoughthereisan“obvious”clustering,theuniformseedingwillinevitablymergesomeoftheseclusters,andthelocalsearchwillneverbeabletosplitthemapart(see[12]forfurtherdiscussionofthisphenomenon).Thecarefulseedingmethodofk-means++avoidedthisproblemaltogether,anditalmostalwaysattainedtheoptimalclusteringonthesyntheticdataset.

Thedi erencebetweenk-meansandk-means++onthereal-worlddatasetswasalsosubstantial.Ineverycase,k-means++achievedatleasta10%accuracyimprovementoverk-means,anditoftenperformedmuchbetter.Indeed,ontheSpamandIntrusiondatasets,k-means++achievedpotentials20to1000timessmallerthanthoseachievedbystandardk-means.Eachtrialalsocompletedtwotothreetimesfaster,andeachindividualtrialwasmuchmorelikelytoachieveagoodclustering.

7Conclusionandfuturework

Wehavepresentedanewwaytoseedthek-meansalgorithmthatisO(logk)-competitivewiththeoptimalclustering.Furthermore,ourseedingtechniqueisasfastandassimpleasthek-meansalgorithmitself,whichmakesitattractiveinpractice.Towardsthatend,weranpreliminaryexperimentsonseveralreal-worlddatasets,andweobservedthatk-means++substantiallyoutperformedstandardk-meansintermsofbothspeedandaccuracy.

AlthoughouranalysisoftheexpectedpotentialE[φ]achievedbyk-means++istighttowithinacon-stantfactor,afewopenquestionsstillremain.Mostimportantly,itisstandardpracticetorunthek-meansalgorithmmultipletimes,andthenkeeponlythebestclusteringfound.Thisraisesthequestionofwhether

k-means++achievesasymptoticallybetterresultsifitisallowedseveraltrials.Forexample,ifk-means++isrun2ktimes,ourargumentscanbemodi edtoshowitislikelytoachieveaconstantapproximationatleastonce.Weaskwhetherasimilarboundcanbeachievedforasmallernumberoftrials.

Also,experimentsshowedthatk-means++generallyperformedbetterifitselectedseveralnewcentersduringeachiteration,andthengreedilychosetheonethatdecreasedφasmuchaspossible.Unfortunately,ourproofsdonotcarryovertothisscenario.Itwouldbeinterestingtoseeacomparable(orbetter)asymptoticresultprovenhere.

Finally,wearecurrentlyworkingonamorethor-oughexperimentalanalysis.Inparticular,wearemea-suringtheperformanceofnotonlyk-means++andstan-dardk-means,butalsoothervariantsthathavebeensuggestedinthetheorycommunity.

Acknowledgements

WewouldliketothankRajeevMotwaniforhishelpfulcomments.References

[1]PankajK.AgarwalandNabilH.Mustafa.k-means

projectiveclustering.InPODS’04:Proceedingsofthetwenty-thirdACMSIGMOD-SIGACT-SIGARTsym-posiumonPrinciplesofdatabasesystems,pages155–165,NewYork,NY,USA,2004.ACMPress.[2]D.ArthurandS.Vassilvitskii.Worst-caseand

smoothedanalysisoftheICPalgorithm,withanap-plicationtothek-meansmethod.InSymposiumonFoundationsofComputerScience,2006.

[3]DavidArthurandSergeiVassilvitskii.k-means++

testcode.http://www.stanford.edu/~darthur/kMeansppTest.zip.

[4]DavidArthurandSergeiVassilvitskii.Howslowis

thek-meansmethod?InSCG’06:Proceedingsofthetwenty-secondannualsymposiumoncomputationalgeometry.ACMPress,2006.

[5]PavelBerkhin.Surveyofclusteringdatamining

techniques.Technicalreport,AccrueSoftware,SanJose,CA,2002.

[6]MosesCharikar,LiadanO’Callaghan,andRinaPani-grahy.Betterstreamingalgorithmsforclusteringprob-lems.InSTOC’03:Proceedingsofthethirty- fthan-nualACMsymposiumonTheoryofcomputing,pages30–39,NewYork,NY,USA,2003.ACMPress.

[7]PhilippeCollard’scloudcoverdatabase.ftp://ftp.

ics.uci.edu/pub/machine-learning-databases/undocumented/taylor/cloud.data.

[8]SanjoyDasgupta.Howfastisk-means?InBernhard

Sch¨olkopfandManfredK.Warmuth,editors,COLT,volume2777ofLectureNotesinComputerScience,page735.Springer,2003.

k-means++

[9]W.FernandezdelaVega,MarekKarpinski,Claire

Kenyon,andYuvalRabani.Approximationschemesforclusteringproblems.InSTOC’03:Proceedingsofthethirty- fthannualACMsymposiumonTheoryofcomputing,pages50–58,NewYork,NY,USA,2003.ACMPress.

[10]P.Drineas,A.Frieze,R.Kannan,S.Vempala,and

V.Vinay.Clusteringlargegraphsviathesingularvaluedecomposition.Mach.Learn.,56(1-3):9–33,2004.[11]Fr´ed´ericGibouandRonaldFedkiw.Afasthybrid

k-meanslevelsetalgorithmforsegmentation.In4thAnnualHawaiiInternationalConferenceonStatisticsandMathematics,pages281–291,2005.

[12]SudiptoGuha,AdamMeyerson,NinaMishra,Rajeev

Motwani,andLiadanO’Callaghan.Clusteringdatastreams:Theoryandpractice.IEEETransactionsonKnowledgeandDataEngineering,15(3):515–528,2003.[13]SarielHar-PeledandSohamMazumdar.Oncoresets

fork-meansandk-medianclustering.InSTOC’04:Proceedingsofthethirty-sixthannualACMsymposiumonTheoryofcomputing,pages291–300,NewYork,NY,USA,2004.ACMPress.

[14]SarielHar-PeledandBardiaSadri.Howfastisthe

k-meansmethod?InSODA’05:ProceedingsofthesixteenthannualACM-SIAMsymposiumonDiscretealgorithms,pages877–885,Philadelphia,PA,USA,2005.SocietyforIndustrialandAppliedMathematics.[15]R.Herwig,A.J.Poustka,C.Muller,C.Bull,

H.Lehrach,andJO’rge-scaleclusteringofcdna- ngerprintingdata.GenomeResearch,9:1093–1105,1999.

[16]MaryInaba,NaokiKatoh,andHiroshiImai.Applica-tionsofweightedvoronoidiagramsandrandomizationtovariance-basedk-clustering:(extendedabstract).InSCG’94:ProceedingsofthetenthannualsymposiumonComputationalgeometry,pages332–339,NewYork,NY,USA,1994.ACMPress.

[17]TapasKanungo,DavidM.Mount,NathanS.Ne-tanyahu,ChristineD.Piatko,RuthSilverman,put.Geom.,28(2-3):89–112,2004.

[18]KDDCup1999dataset.http://kdd.ics.uci.edu/

/databases/kddcup99/kddcup99.html.

[19]AmitKumar,YogishSabharwal,andSandeepSen.A

simplelineartime(1+ )CS’04:Proceedingsofthe45thAnnualIEEESymposiumonFoundationsofComputerScience(FOCS’04),pages454–462,Washington,DC,USA,2004.IEEECom-puterSociety.

[20]StuartP.Lloyd.Leastsquaresquantizationinpcm.

IEEETransactionsonInformationTheory,28(2):129–136,1982.[21]Jir´ Matousek.Onapproximategeometrick-clustering.

Discrete&ComputationalGeometry,24(1):61–84,2000.

[22]RamgopalR.MettuandC.GregPlaxton.Optimal

timeboundsforapproximateclustering.InAdnanDarwicheandNirFriedman,editors,UAI,pages344–351.MorganKaufmann,2002.

[23]CS’01:

Proceedingsofthe42ndIEEEsymposiumonFounda-tionsofComputerScience,page426,Washington,DC,USA,2001.IEEEComputerSociety.

[24]R.Ostrovsky,Y.Rabani,L.Schulman,andC.Swamy.

Thee ectivenessofLloyd-typemethodsforthek-Meansproblem.InSymposiumonFoundationsofComputerScience,2006.

[25]Spame-maildatabase.http://www.ics.uci.edu/

~mlearn/databases/spambase/.

本文来源:https://www.bwwdw.com/article/86d4.html

Top