k-means++:The advantages of careful seeding(full text)
更新时间:2023-05-23 16:33:01 阅读量: 实用文档 文档下载
- kmeans图像分割推荐度:
- 相关推荐
k-means++
k-means++:TheAdvantagesofCarefulSeeding
DavidArthur
Abstract
Thek-meansmethodisawidelyusedclusteringtechniquethatseekstominimizetheaveragesquareddistancebetweenpointsinthesamecluster.Althoughito ersnoaccuracyguarantees,itssimplicityandspeedareveryappealinginpractice.Byaugmentingk-meanswithaverysimple,ran-domizedseedingtechnique,weobtainanalgorithmthatisΘ(logk)-competitivewiththeoptimalclustering.Prelim-inaryexperimentsshowthatouraugmentationimprovesboththespeedandtheaccuracyofk-means,oftenquitedramatically.
SergeiVassilvitskii
thanmostofitscompetitors.
Unfortunately,theempiricalspeedandsimplicityofthek-meansalgorithmcomeatthepriceofaccuracy.Therearemanynaturalexamplesforwhichthealgo-φ
isrithmgeneratesarbitrarilybadclusterings(i.e.,OPT
unboundedevenwhennandkare xed).Furthermore,theseexamplesdonotrelyonanadversarialplacementofthestartingcenters,andtheratiocanbeunboundedwithhighprobabilityevenwiththestandardrandom-izedseedingtechnique.
Inthispaper,weproposeawayofinitializingk-meansbychoosingrandomstartingcenterswithveryspeci cprobabilities.Speci cally,wechooseapointpasacenterwithprobabilityproportionaltop’scontributiontotheoverallpotential.Lettingφdenotethepotentialafterchoosingcentersinthisway,weshowthefollowing.
Theorem1.1.Foranysetofdatapoints,E[φ]≤8(lnk+2)φOPT.
Thissamplingisbothfastandsimple,anditalreadyachievesapproximationguaranteesthatk-meanscan-not.Weproposeusingittoseedtheinitialcentersfork-means,leadingtoacombinedalgorithmwecallk-means++.
ThiscomplementsaveryrecentresultofOstrovskyetal.[24],whoindependentlyproposedmuchthesamealgorithm.Whereastheyshowedthisrandomizedseed-ingisO(1)-competitiveondatasetsfollowingacertainseparationcondition,weshowitisO(logk)-competitiveonalldatasets.
WealsoshowthattheanalysisforTheorem1.1istightuptoaconstantfactor,andthatitcanbeeas-ilyextendedtovariouspotentialfunctionsinarbitrarymetricspaces.Inparticular,wecanalsogetasim-pleO(logk)approximationalgorithmforthek-medianobjective.Furthermore,weprovidepreliminaryexperi-mentaldatashowingthatinpractice,k-means++reallydoesoutperformk-meansintermsofbothaccuracyandspeed,oftenbyasubstantialmargin.
1.1RelatedworkAsafundamentalprobleminmachinelearning,k-meanshasarichhistory.Becauseofitssimplicityanditsobservedspeed,Lloyd’smethod[20]remainsthemostpopularapproachinpractice,
1Introduction
Clusteringisoneoftheclassicproblemsinmachinelearningandcomputationalgeometry.Inthepopulark-meansformulation,oneisgivenanintegerkandasetofndatapointsinRd.Thegoalistochoosekcenterssoastominimizeφ,thesumofthesquareddistancesbetweeneachpointanditsclosestcenter.
SolvingthisproblemexactlyisNP-hard,evenwithjusttwoclusters[10],buttwenty- veyearsago,Lloyd[20]proposedalocalsearchsolutionthatisstillverywidelyusedtoday(seeforexample[1,11,15]).Indeed,arecentsurveyofdataminingtechniquesstatesthatit“isbyfarthemostpopularclusteringalgorithmusedinscienti candindustrialapplications”[5].
Usuallyreferredtosimplyask-means,Lloyd’salgorithmbeginswithkarbitrarycenters,typicallychosenuniformlyatrandomfromthedatapoints.Eachpointisthenassignedtothenearestcenter,andeachcenterisrecomputedasthecenterofmassofallpointsassignedtoit.Thesetwosteps(assignmentandcentercalculation)arerepeateduntiltheprocessstabilizes.
Onecancheckthatthetotalerrorφismonotoni-callydecreasing,whichensuresthatnoclusteringisre-peatedduringthecourseofthealgorithm.Sincethereareatmostknpossibleclusterings,theprocesswillal-waysterminate.Inpractice,veryfewiterationsareusu-allyrequired,whichmakesthealgorithmmuchfaster
University,SupportedinpartbyNDSEGFellow-ship,NSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.
StanfordUniversity,SupportedinpartbyNSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.
k-means++
despiteitslimitedaccuracy.TheconvergencetimeofLloyd’smethodhasbeenthesubjectofarecentseriesofpapers[2,4,8,14];inthisworkwefocusonimprovingitsaccuracy.
Inthetheorycommunity,Inabaetal.[16]werethe rsttogiveanexactalgorithmforthek-meansproblem,withtherunningtimeofO(nkd).Sincethen,anumberofpolynomialtimeapproximationschemeshavebeendeveloped(see[9,13,19,21]andthereferencestherein).Whiletheauthorsdevelopinterestinginsightsintothestructureoftheclusteringproblem,theiralgorithmsarehighlyexponential(orworse)ink,andareunfortunatelyimpracticalevenforrelativelysmalln,kandd.
Kanungoetal.[17]proposedanO(n3 d)algorithmthatis(9+ )-competitive.However,n3comparesunfavorablywiththealmostlinearrunningtimeofLloyd’smethod,andtheexponentialdependenceondcanalsobeproblematic.Forthesereasons,Kanungoetal.alsosuggestedawayofcombiningtheirtechniqueswithLloyd’salgorithm,butinordertoavoidtheexponentialdependenceond,theirapproachsacri cesallapproximationguarantees.
MettuandPlaxton[22]alsoachievedaconstant-probabilityO(1)approximationusingatechniquecalledsuccessivesampling.TheymatchourrunningtimeofO(nkd),butonlyifkissu cientlylargeandthespreadissu cientlysmall.Inpractice,ourapproachissimpler,andourexperimentalresultsseemtobebetterintermsofbothspeedandaccuracy.
Veryrecently,Ostrovskyetal.[24]independentlyproposedanalgorithmthatisessentiallyidenticaltoours,althoughtheiranalysisisquitedi erent.LettingφOPT,kdenotetheoptimalpotentialforak-clusteringonagivendataset,theyprovek-means++isO(1)-φOPT,k
≤ 2.ThecompetitiveinthecasewhereOPT
,k 1
intuitionhereisthatifthisconditiondoesnothold,thenthedataisnotwellsuitedforclusteringwiththegivenvaluefork.
Combiningthisresultwithoursgivesastrongcharacterizationofthealgorithm’sperformance.Inparticular,k-means++isneverworsethanO(logk)-competitive,andonverywellformeddatasets,itimprovestobeingO(1)-competitive.
Overall,theseedingtechniqueweproposeissimilarinspirittothatusedbyMeyerson[23]foronlinefacilitylocation,andMishraetal.[12]andCharikaretal.[6]inthecontextofk-medianclustering.However,ouranalysisisquitedi erentfromthoseworks.
Forthek-meansproblem,wearegivenanintegerkandasetofndatapointsX Rd.WewishtochoosekcentersCsoastominimizethepotentialfunction,
φ=min x c 2.
x∈X
c∈C
Choosingthesecentersimplicitlyde nesaclustering
–foreachcenter,wesetoneclustertobethesetofdatapointsthatareclosertothatcenterthantoanyother.Asnotedabove, ndinganexactsolutiontothek-meansproblemisNP-hard.
Throughoutthepaper,wewillletCOPTdenotetheoptimalclusteringforagiveninstanceofthek-meansproblem,andwewillletφOPTdenotethecorrespondingpotential.GivenaclusteringCwithpotentialφ,wealsoletφ(A)denotetheXtothe contributionofA 2potential(i.e.,φ(A)=x∈Aminc∈C x c ).2.1Thek-meansalgorithmThek-meansmethodisasimpleandfastalgorithmthatattemptstolocallyimproveanarbitraryk-meansclustering.Itworksasfollows.
1.ArbitrarilychoosekinitialcentersC={c1,...,ck}.2.Foreachi∈{1,...,k},settheclusterCitobethesetofpointsinXthatareclosertocithantheyaretocjforallj=i.
3.Foreachi∈{1,...,k},setcito bethecenterof
1
massofallpointsinCi:ci=ix∈Cix.
4.RepeatSteps2and3untilCnolongerchanges.ItisstandardpracticetochoosetheinitialcentersuniformlyatrandomfromX.ForStep2,tiesmaybebrokenarbitrarily,aslongasthemethodisconsistent.
Steps2and3arebothguaranteedtodecreaseφ,sothealgorithmmakeslocalimprovementstoanarbitraryclusteringuntilitisnolongerpossibletodoso.ToseethatStep3doesinfactdecreasesφ,itishelpfultorecallastandardresultfromlinearalgebra(see[14]).Lemma2.1.LetSbeasetofpointswithcenterofmassc(S),and letzbeanarbitrarypoint.Then, 222
x∈S x z x∈S x c(S) =|S|· c(S) z .MonotonicityforStep3followsfromtakingStobeasingleclusterandztobeitsinitialcenter.
Asdiscussedabove,thek-meansalgorithmisat-tractiveinpracticebecauseitissimpleanditisgener-allyfast.Unfortunately,itisguaranteedonlyto ndalocaloptimum,whichcanoftenbequitepoor.
2Preliminaries2.2Thek-means++algorithmThek-meansalgo-Inthissection,weformallyde nethek-meansproblem,rithmbeginswithanarbitrarysetofclustercenters.aswellasthek-meansandk-means++algorithms.Weproposeaspeci cwayofchoosingthesecenters.At
k-means++
anygiventime,letD(x)denotetheshortestdistancefromadatapointxtotheclosestcenterwehaveal-readychosen.Then,wede nethefollowingalgorithm,whichwecallk-means++.
1a.Chooseaninitialcenterc1uniformlyatrandom
fromX.
1b.Choosethenextcenterci,selectingci=x ∈X
2)
withprobabilityD(x.x∈X
1c.RepeatStep1buntilwehavechosenatotalofk
centers.
2-4.Proceedaswiththestandardk-meansalgorithm.WecalltheweightingusedinStep1bsimply“D2weighting”.
3k-means++isO(logk)-competitiveInthissection,weproveourmainresult.
OurnextstepistoproveananalogofLemma3.1fortheremainingcenters,whicharechosenwithD2weighting.
Lemma3.2.LetAbeanarbitraryclusterinCOPT,andletCbeanarbitraryclustering.IfweaddarandomcentertoCfromA,chosenwithD2weighting,thenE[φ(A)]≤8φOPT(A).
Proof.Theprobabilitythatwechoosesome xeda0asourcenter,giventhatwearechoosingourcenterfrom
20)
A,ispreciselyD(a.Furthermore,afterchoos-a∈A
ingthecentera0,apointawillcontributepreciselymin(D(a), a a0 )2tothepotential.Therefore,E[φ(A)]=
a0∈A
D(a0)2
min(D(a), a a0 )2.2
a∈AD(a)
a∈A
NotebythetriangleinequalitythatD(a0)≤Theorem3.1.IfCisconstructedwithk-means++,
thenthecorrespondingpotentialfunctionφsatis esD(a)+ a a0 foralla,a0.Fromthis,thepower-meaninequality1impliesthatD(a0)2≤2D(a)2+E[φ]≤8(lnk+2)φOPT.
2 a a0 2.Summingoveralla,wethenhavethat 2Infact,weprovethisholdsafteronlyStep1oftheD(a)2≤2 22
0a∈AD(a)+a∈A a a0 ,andalgorithmabove.Steps2through4canthenonly
hence,E[φ(A)]isatmost,
decreaseφ.Notsurprisingly,ourexperimentsshowthis
localoptimizationisimportantinpractice,althoughit2 2a∈AD(a)·min(D(a), a a0 )2·isdi culttoquantifythistheoretically.2|A|a∈AD(a)a0∈Aa∈AOuranalysisconsistsoftwoparts.First,weshow
2 thatk-means++iscompetitiveinthoseclustersofCOPT2a∈A a a0 ·min(D(a), a a0 )2.fromwhichitchoosesacenter.Thisiseasiestinthe+|A|·2
a∈AD(a)a0∈Aa∈Acaseofour rstcenter,whichischosenuniformlyat
random.
Inthe rstexpression,wesubstitutemin(D(a), a
22
Lemma3.1.LetAbeanarbitraryclusterinCOPT,anda0 )≤ a a0 ,andinthesecondexpression,we
22
letCbetheclusteringwithjustonecenter,whichissubstitutemin(D(a), a a0 )≤D(a).Simplifying,chosenuniformlyatrandomfromA.Then,E[φ(A)]=wethenhave,2φOPT(A). 4
E[φ(A)]≤· a a0 2
|A|Proof.Letc(A)denotethecenterofmassofthedataa0∈Aa∈A
pointsinA.ByLemma2.1,weknowthatsince=8φOPT(A).COPTisoptimal,itmustbeusingc(A)asthecenter
ingthesamelemmaThelaststepherefollowsfromLemma3.1.again,weseeE[φ(A)]isgivenby,
WehavenowshownthatseedingbyD2weighting 1
iscompetitiveaslongasitchoosescentersfromeach· a a0 2
clusterofCOPT,whichcompletesthe rsthalfofoura0∈Aa∈A
argument.Wenowuseinductiontoshowthetotalerror 1ingeneralisatmostO(logk).= a c(A) 2+|A|· a0 c(A) 2a0∈Aa∈A
power-meaninequalitystatesforanyrealnumbers=2 a c(A) 2,
a∈A
andtheresultfollows.
12a1,···,amthatΣa2i≥(Σai).ItfollowsfromtheCauchy-Schwarzinequality.Weareonlyusingthecasem=2here,butwewillneedthegeneralcaseforLemma3.3.
k-means++
Lemma3.3.LetCbeanarbitraryclustering.ChoosefollowsthatourcontributiontoE[φOPT]inthiscaseisu>0“uncovered”clustersfromCOPT,andletXuatmost,denotethesetofpointsintheseclusters.Alsolet
φ(A)Xc=X Xu.Nowsupposeweaddt≤urandomcenters·paφ(Xc)+φa+8φOPT(Xu) 8φOPT(A)
toC,chosenwithD2weighting.LetC denotethethea∈A
resultingclustering,andletφ denotethecorresponding u tpotential.Then,E[φ ]isatmost,·(1+Ht 1)+·φ(Xu) φ(A)
u 1
u tφ(A)
·φ(Xu).φ(Xc)+8φOPT(Xu)·(1+Ht)+≤·φ(Xc)+8φOPT(Xu)·(1+Ht 1)uφ
u t11
Here,Htdenotestheharmonicsum,1++···+.+·φ(Xu) φ(A).
u 1
Proof.Weprovethisbyinduction,showingthatifthe
laststepherefollowsfromthefactthatresultholdsfor(t 1,u)and(t 1,u 1),thenitThe
a∈Apaφa≤8φOPT(A),whichisimpliedbyLemmaalsoholdsfor(t,u).Therefore,itsu cestocheck
3.2.t=0,u>0andt=u=1asourbasecases.
power-meaninequalityimpliesthatIft=0andu>0,theresultfollowsfromthefact Now,the122
tA Xuφ(A)≥·φ(Xu).Therefore,ifwesumover=1.Next,supposet=u=1.that1+Ht=uWechooseouronenewcenterfromtheoneuncoveredalluncoveredclustersA,weobtainapotentialcontri-u)butionofatmost,clusterwithprobabilityexactlyφ(X.Inthiscase,
Lemma3.2guaranteesthatE[φ ]≤φ(Xc)+8φOPT(Xu).φ(Xu) ·φ(Xc)+8φOPT(Xu)·(1+Ht 1)Sinceφ≤φevenifwechooseacenterfromacovered
φ
cluster,wehave,
1u t1+··φ(Xu)2 ·φ(Xu)2 φ(Xu)φ(Xc)E[φ ]≤·φ(Xc)+8φOPT(Xu)+·φ φφφ(Xu)
=·φ(Xc)+8φOPT(Xu)·(1+Ht 1)
≤2φ(Xc)+8φOPT(Xu).φ
u t+·φ(Xu).Since1+Ht=2here,wehaveshowntheresultholds
u
forbothbasecases.
Wenowproceedtoprovetheinductivestep.ItisCombiningthepotentialcontributiontoE[φ ]fromconvenientheretoconsidertwocases.Firstsupposewebothcases,wenowobtainthedesiredbound:chooseour rstcenterfromacoveredcluster.Asabove,
Xc)
thishappenswithprobabilityexactlyφ(.Notethat
E[φ]≤φ(Xc)+8φOPT(Xu)·(1+Ht 1)
thisnewcentercanonlydecreaseφ.Bearingthisin
u tφ(Xc)φ(Xu)mind,applytheinductivehypothesiswiththesame
+·φ(Xu)+·
choiceofcoveredclusters,butwithtdecreasedbyone.uφu
ItfollowsthatourcontributiontoE[φ]inthiscaseis1
≤φ(Xc)+8φOPT(Xu)·1+Ht 1+atmost,u
u t φ(Xc)+·φ(Xu).·φ(Xc)+8φOPT(Xu)·(1+Ht 1)uφ
1u t+1Theinductivestepnowfollowsfromthefactthat≤1.+·φ(Xu).
u
WespecializeLemma3.3toobtainourmainresult.
Ontheotherhand,supposewechooseour rst
centerfromsomeuncoveredclusterA.ThishappensTheorem3.1IfCisconstructedwithk-means++,
A)
withprobabilityφ(.Letpadenotetheprobabilitythenthecorrespondingpotentialfunctionφsatis esthatwechoosea∈Aasourcenter,giventhecenterisE[φ]≤8(lnk+2)φOPT.somewhereinA,andletφadenoteφ(A)afterwechoose
aasourcenter.Onceagain,weapplyourinductiveProof.ConsidertheclusteringCafterwehavecom-hypothesis,thistimeaddingAtothesetofcoveredpletedStep1.LetAdenotetheCOPTclusterinwhichclusters,aswellasdecreasingbothtanduby1.Itwechosethe rstcenter.ApplyingLemma3.3with
k-means++
t=u=k 1andwithAbeingtheonlycoveredclus-ter,wehave,
E[φOPT]≤φ(A)+8φOPT 8φOPT(A)·(1+Hk 1).TheresultnowfollowsfromLemma3.1,andfromthefactthatHk 1≤1+lnk.
222
Finally,sincenδ2u≥u·n·δ·βandnδu≥nδHuβ,wehave,
n
22 2
2nδ·u.φ≥α·nδ·(1+Hu)·β+k
Thiscompletesthebasecase.
Wenowproceedtoprovetheinductivestep.AswithLemma3.3,weconsidertwocases.Theprobability
4Amatchinglowerboundthatour rstcenterischosenfromanuncoveredcluster
2
Inthissection,weshowthattheDseedingusedis,byk-means++isnobetterthan (logk)-competitive2
u·n· inexpectation,therebyprovingTheorem3.1istight
u·withinaconstantfactor.· +(k u)··δ (k t)δ
Fixk,andthenchoosen, ,δwithn kand u 2
≥δ.WeconstructXwithnpoints.Firstchoosekcentersu 2+(k u)δ2 n k222
c1,c2,...,cksuchthat ci cj = ·δ
u 2
foralli=j.Now,foreachci,adddatapoints≥α·.+(k u)δu xi,1,xi,2,···,xi,narrangedinaregularsimplexwith
kApplyingourinductivehypothesiswithtandubothcenterci,sidelengthδ,andradiusn·δ.Ifwedo
decreasedby1,weobtainapotentialcontributionfrom
thisinorthogonaldimensionsforeachi,wethenhave,
thiscaseofatleast,
δifi=j,or 2 xi,i xj,j =u otherwise.·αt+1·nδ2·(1+Hu 1)·β22u +(k u)δ
Weproveourseedingtechniqueisinexpectation n
(logk)worsethantheoptimalclusteringinthiscase.+ 2 2nδ2·(u t).
kClearly,theoptimalclusteringhascenters{ci},
k2
whichleadstoanoptimalpotentialofφOPT=n Theprobabilitythatour rstcenterischosenfrom·δ.
Conversely,usinganinductionsimilartothatofLemmaacoveredclusteris3.3,weshowD2seedingcannotmatchthisbound.As22
(k u)·n·δ (k t)δbefore,weboundtheexpectedpotentialintermsof
u·thenumberofcenterslefttochooseandthenumber· +(k u)··δ (k t)δ
22ofuncoveredclusters(thoseclustersofC0fromwhich(k u)·n(k u)δ2·δ (k t)δ·≥wehavenotchosenacenter).Lemma4.1.LetCbeanarbitraryclusteringonXwith(k u)δ2
.≥α·k t≥1centers,butwithuclustersfromCOPTu 2+(k u)δ2
uncovered.Nowsupposeweaddtrandomcentersto
C,chosenwithD2weighting.LetC denotethetheApplyingourinductivehypothesiswithtdecreasedby1resultingclustering,andletφ denotethecorrespondingbutwithuconstant,weobtainapotentialcontribution
fromthiscaseofatleast,potential.
2
k22kδ2 2Furthermore,letα=n ,β= and(k u)δ t+1 u i ·α·nδ2·(1+Hu)·βHu=i=1k.Then,E[φ ]isatleast,u +(k u)δ
n n t+12 22
α·nδ·(1+Hu)·β+ 2nδ·(u t).+ 2 2nδ2·(u t+1).kkProof.Weprovethisbyinductionont.Ift=0,note
Therefore,E[φ ]isatleast,that,
n nnt+12 22 22α·nδ·(1+Hu)·β+ 2nδ·(u t)φ=φ=n u· k·δ+u·· .kkk t+1nnαnn u· k kn+·(k u)δ2· 2 2nδ2Sincen u·n≥=α.Also,≥,wehaveu +(k u)δk
α,β≤1.Therefore,
2 2 u ·H(u) H(u 1)·nδ·β.n 2n 2
φ≥α·n u··δ·β+u·· .
kk
k-means++
However,Hu Hu 1=
2
k u
andβ=
2 2kδ2
,soProof.LetcdenotethecenterofAinCOPT.Then,E[φ[ ](A)]
=≤=
1
a a0
|A|
a0∈Aa∈A
u ·H(u) H(u 1)·nδ2·β
n
222
2nδ,=(k u)δ·k
2 1
a c + a0 c
|A|
a0∈Aa∈A
andtheresultfollows.
2 φOPT(A).
[ ]
Asintheprevioussection,weobtainthedesiredThesecondstepherefollowsfromthetriangleinequalityresultbyspecializingtheinduction.andthepower-meaninequality.TherestofourupperboundanalysiscarriesTheorem4.1.D2seedingisnobetterthan2(lnk)-throughwithoutchange,exceptthatintheproofofcompetitive.
Lemma3.2,weloseafactorof2 1fromthepower-meaninequality,insteadofjust2.Puttingeverything
Proof.Supposeaclusteringwithpotentialφiscon-together,weobtainthegeneraltheorem.
structedusingk-means++onXdescribedabove.Ap-plyLemma4.1withu=t=k 1afterthe rstTheorem5.1.IfCisconstructedwithD seeding,
centerhasbeenchosen.Notingthat1+Hk 1=thenthecorrespondingpotentialfunctionφ[ ]satis es, k 1 11
[ ]1+i=1 =Hk>lnk,wethenhave,E[φ[ ]]≤22 (lnk+2)φ.
OPT
6Empiricalresults
Inordertoevaluatek-means++inpractice,wehave
Now, xkandδbutletnand approachin nity.implementedandtesteditinC++[3].Inthissection,Thenαandβbothapproach1,andtheresultfollowswediscusstheresultsofthesepreliminaryexperiments.
k2
fromthefactthatφOPT=n WefoundthatD2seedingsubstantiallyimprovesboth·δ.
therunningtimeandtheaccuracyofk-means.
5Generalizations
Althoughthek-meansalgorithmitselfappliesonly6.1DatasetsWeevaluatedtheperformanceofinvectorspaceswiththepotentialfunctionφ=k-meansandk-means++onfourdatasets. 2The rstdataset,Norm25,issynthetic.Togeneratex∈Xminc∈C x c ,wenotethatourseedingtech-niquedoesnothavethesamelimitations.Inthissec-it,wechose25“true”centersuniformlyatrandomtion,wediscussextendingourresultstoarbitrarymet-froma15-dimensionalhypercubeofsidelength500.ricspaceswiththemoregeneralpotentialfunction,WethenaddedpointsfromGaussiandistributionsof
φ[ ]=x∈Xminc∈C x c for ≥1.Inparticular,variance1aroundeachtruecenter.Thus,weobtainednotethatthecaseof =1isthek-medianspotentialanumberofwellseparatedGaussianswiththethetrue
centersprovidingagoodapproximationtotheoptimalfunction.
Thesegeneralizationsrequireonlyonechangetoclustering.
Wechosetheremainingdatasetsfromreal-worldthealgorithmitself.InsteadofusingD2seeding,we
switchtoD seeding–i.e.,wechoosex0asacenterexampleso theUC-IrvineMachineLearningReposi-
tory.TheClouddataset[7]consistsof1024pointsin100)
withprobabilityD(x.x∈Xdimensions,anditisPhilippeCollard’s rstcloudcover
Fortheanalysis,themostimportantchangeap-database.TheIntrusiondataset[18]consistsof494019pearsinLemma3.1.Ouroriginalproofusesaninnerpointsin35dimensions,anditrepresentsfeaturesavail-productstructurethatisnotavailableinthegeneralabletoanintrusiondetectionsystem.Finally,theSpamcase.However,aslightlyweakerresultcanbeprovendataset[25]consistsof4601pointsin58dimensions,usingonlythetriangleinequality.anditrepresentsfeaturesavailabletoane-mailspamdetectionsystem.Lemma5.1.LetAbeanarbitraryclusterinCOPT,andForeachdataset,wetestedk=10,25,and50.letCbetheclusteringwithjustonecenter,whichischo-senuniformlyatrandomfromA.Then,E[φ[ ](A)]≤6.2MetricsSinceweweretestingrandomizedseed-[ ]
2 φOPT(A).ingprocesses,weran20trialsforeachcase.Wereport
E[φ]≥αkβ·nδ2·lnk.
k-means++
k2550
k-meansk-means++
4
4.233·1099.96%
3
7.750·1099.81%k-meansk-means++
4
1.914·1099.92%
1
1.474·100.53%k-meansk-means++0.9087.79%2.04 1.62%
Table1:ExperimentalresultsontheNorm25dataset(n=10000,d=15).Fork-means,welistthe
actual potentialandtime inseconds.Fork-means++,welistthepercentageimprovementoverk-means:k-means++value100%·1 .
k2550
k-meansk-means++3
3.637·1042.76%
3
1.867·1039.01%k-meansk-means++3
2.550·1022.60%
3
1.407·1023.07%k-meansk-means++0.1143.21%0.1641.99%
Table2:ExperimentalresultsontheClouddataset(n=1024,d=10).Fork-means,welisttheactualpotential
andtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means.
k2550
k-meansk-means++8
3.149·1099.20%
8
3.079·1099.84%k-meansk-means++8
3.100·1099.32%
8
3.076·1099.87%k-meansk-means++257.3449.19%917.0066.70%
Table3:ExperimentalresultsontheIntrusiondataset(n=494019,d=35).Fork-means,welisttheactual
potentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means.
k2550
k-meansk-means++
3.288·10488.76%
4
3.183·1095.35%k-meansk-means++
3.280·10489.58%
4
2.384·1094.30%k-meansk-means++7.3679.84%12.2075.76%
Table4:ExperimentalresultsontheSpamdataset(n=4601,d=58).Fork-means,welisttheactualpotential
andtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means.
k-means++
theminimumandtheaveragepotential(actuallydi-videdbythenumberofpoints),aswellasthemeanrunningtime.Ourimplementationsarestandardwithnospecialoptimizations.
6.3ResultsTheresultsfork-meansandk-means++aredisplayedinTables1through4.Welisttheabsoluteresultsfork-means,andthepercentageimprovementachievedbyk-means++(e.g.,a90%improvementintherunningtimeisequivalenttoafactor10speedup).Weobservethatk-means++consistentlyoutperformedk-means,bothbyachievingalowerpotentialvalue,insomecasesbyseveralordersofmagnitude,andalsobyhavingafasterrunningtime.TheD2seedingisslightlyslowerthanuniformseeding,butitstillleadstoafasteralgorithmsinceithelpsthelocalsearchconvergeafterfeweriterations.
Thesyntheticexampleisacasewherestandardk-meansdoesverybadly.Eventhoughthereisan“obvious”clustering,theuniformseedingwillinevitablymergesomeoftheseclusters,andthelocalsearchwillneverbeabletosplitthemapart(see[12]forfurtherdiscussionofthisphenomenon).Thecarefulseedingmethodofk-means++avoidedthisproblemaltogether,anditalmostalwaysattainedtheoptimalclusteringonthesyntheticdataset.
Thedi erencebetweenk-meansandk-means++onthereal-worlddatasetswasalsosubstantial.Ineverycase,k-means++achievedatleasta10%accuracyimprovementoverk-means,anditoftenperformedmuchbetter.Indeed,ontheSpamandIntrusiondatasets,k-means++achievedpotentials20to1000timessmallerthanthoseachievedbystandardk-means.Eachtrialalsocompletedtwotothreetimesfaster,andeachindividualtrialwasmuchmorelikelytoachieveagoodclustering.
7Conclusionandfuturework
Wehavepresentedanewwaytoseedthek-meansalgorithmthatisO(logk)-competitivewiththeoptimalclustering.Furthermore,ourseedingtechniqueisasfastandassimpleasthek-meansalgorithmitself,whichmakesitattractiveinpractice.Towardsthatend,weranpreliminaryexperimentsonseveralreal-worlddatasets,andweobservedthatk-means++substantiallyoutperformedstandardk-meansintermsofbothspeedandaccuracy.
AlthoughouranalysisoftheexpectedpotentialE[φ]achievedbyk-means++istighttowithinacon-stantfactor,afewopenquestionsstillremain.Mostimportantly,itisstandardpracticetorunthek-meansalgorithmmultipletimes,andthenkeeponlythebestclusteringfound.Thisraisesthequestionofwhether
k-means++achievesasymptoticallybetterresultsifitisallowedseveraltrials.Forexample,ifk-means++isrun2ktimes,ourargumentscanbemodi edtoshowitislikelytoachieveaconstantapproximationatleastonce.Weaskwhetherasimilarboundcanbeachievedforasmallernumberoftrials.
Also,experimentsshowedthatk-means++generallyperformedbetterifitselectedseveralnewcentersduringeachiteration,andthengreedilychosetheonethatdecreasedφasmuchaspossible.Unfortunately,ourproofsdonotcarryovertothisscenario.Itwouldbeinterestingtoseeacomparable(orbetter)asymptoticresultprovenhere.
Finally,wearecurrentlyworkingonamorethor-oughexperimentalanalysis.Inparticular,wearemea-suringtheperformanceofnotonlyk-means++andstan-dardk-means,butalsoothervariantsthathavebeensuggestedinthetheorycommunity.
Acknowledgements
WewouldliketothankRajeevMotwaniforhishelpfulcomments.References
[1]PankajK.AgarwalandNabilH.Mustafa.k-means
projectiveclustering.InPODS’04:Proceedingsofthetwenty-thirdACMSIGMOD-SIGACT-SIGARTsym-posiumonPrinciplesofdatabasesystems,pages155–165,NewYork,NY,USA,2004.ACMPress.[2]D.ArthurandS.Vassilvitskii.Worst-caseand
smoothedanalysisoftheICPalgorithm,withanap-plicationtothek-meansmethod.InSymposiumonFoundationsofComputerScience,2006.
[3]DavidArthurandSergeiVassilvitskii.k-means++
testcode.http://www.stanford.edu/~darthur/kMeansppTest.zip.
[4]DavidArthurandSergeiVassilvitskii.Howslowis
thek-meansmethod?InSCG’06:Proceedingsofthetwenty-secondannualsymposiumoncomputationalgeometry.ACMPress,2006.
[5]PavelBerkhin.Surveyofclusteringdatamining
techniques.Technicalreport,AccrueSoftware,SanJose,CA,2002.
[6]MosesCharikar,LiadanO’Callaghan,andRinaPani-grahy.Betterstreamingalgorithmsforclusteringprob-lems.InSTOC’03:Proceedingsofthethirty- fthan-nualACMsymposiumonTheoryofcomputing,pages30–39,NewYork,NY,USA,2003.ACMPress.
[7]PhilippeCollard’scloudcoverdatabase.ftp://ftp.
ics.uci.edu/pub/machine-learning-databases/undocumented/taylor/cloud.data.
[8]SanjoyDasgupta.Howfastisk-means?InBernhard
Sch¨olkopfandManfredK.Warmuth,editors,COLT,volume2777ofLectureNotesinComputerScience,page735.Springer,2003.
k-means++
[9]W.FernandezdelaVega,MarekKarpinski,Claire
Kenyon,andYuvalRabani.Approximationschemesforclusteringproblems.InSTOC’03:Proceedingsofthethirty- fthannualACMsymposiumonTheoryofcomputing,pages50–58,NewYork,NY,USA,2003.ACMPress.
[10]P.Drineas,A.Frieze,R.Kannan,S.Vempala,and
V.Vinay.Clusteringlargegraphsviathesingularvaluedecomposition.Mach.Learn.,56(1-3):9–33,2004.[11]Fr´ed´ericGibouandRonaldFedkiw.Afasthybrid
k-meanslevelsetalgorithmforsegmentation.In4thAnnualHawaiiInternationalConferenceonStatisticsandMathematics,pages281–291,2005.
[12]SudiptoGuha,AdamMeyerson,NinaMishra,Rajeev
Motwani,andLiadanO’Callaghan.Clusteringdatastreams:Theoryandpractice.IEEETransactionsonKnowledgeandDataEngineering,15(3):515–528,2003.[13]SarielHar-PeledandSohamMazumdar.Oncoresets
fork-meansandk-medianclustering.InSTOC’04:Proceedingsofthethirty-sixthannualACMsymposiumonTheoryofcomputing,pages291–300,NewYork,NY,USA,2004.ACMPress.
[14]SarielHar-PeledandBardiaSadri.Howfastisthe
k-meansmethod?InSODA’05:ProceedingsofthesixteenthannualACM-SIAMsymposiumonDiscretealgorithms,pages877–885,Philadelphia,PA,USA,2005.SocietyforIndustrialandAppliedMathematics.[15]R.Herwig,A.J.Poustka,C.Muller,C.Bull,
H.Lehrach,andJO’rge-scaleclusteringofcdna- ngerprintingdata.GenomeResearch,9:1093–1105,1999.
[16]MaryInaba,NaokiKatoh,andHiroshiImai.Applica-tionsofweightedvoronoidiagramsandrandomizationtovariance-basedk-clustering:(extendedabstract).InSCG’94:ProceedingsofthetenthannualsymposiumonComputationalgeometry,pages332–339,NewYork,NY,USA,1994.ACMPress.
[17]TapasKanungo,DavidM.Mount,NathanS.Ne-tanyahu,ChristineD.Piatko,RuthSilverman,put.Geom.,28(2-3):89–112,2004.
[18]KDDCup1999dataset.http://kdd.ics.uci.edu/
/databases/kddcup99/kddcup99.html.
[19]AmitKumar,YogishSabharwal,andSandeepSen.A
simplelineartime(1+ )CS’04:Proceedingsofthe45thAnnualIEEESymposiumonFoundationsofComputerScience(FOCS’04),pages454–462,Washington,DC,USA,2004.IEEECom-puterSociety.
[20]StuartP.Lloyd.Leastsquaresquantizationinpcm.
IEEETransactionsonInformationTheory,28(2):129–136,1982.[21]Jir´ Matousek.Onapproximategeometrick-clustering.
Discrete&ComputationalGeometry,24(1):61–84,2000.
[22]RamgopalR.MettuandC.GregPlaxton.Optimal
timeboundsforapproximateclustering.InAdnanDarwicheandNirFriedman,editors,UAI,pages344–351.MorganKaufmann,2002.
[23]CS’01:
Proceedingsofthe42ndIEEEsymposiumonFounda-tionsofComputerScience,page426,Washington,DC,USA,2001.IEEEComputerSociety.
[24]R.Ostrovsky,Y.Rabani,L.Schulman,andC.Swamy.
Thee ectivenessofLloyd-typemethodsforthek-Meansproblem.InSymposiumonFoundationsofComputerScience,2006.
[25]Spame-maildatabase.http://www.ics.uci.edu/
~mlearn/databases/spambase/.
正在阅读:
k-means++:The advantages of careful seeding(full text)05-23
教科版2018-2019学年高中物理选修3-2第2章 交变电流 第7节练习含解析08-25
我发现这儿真美作文400字07-08
高中数学选修2-2综合测试题06-10
ps CS5模拟考试题04-30
机电传动控制答案(冯清秀)11-01
我的爸爸不是很帅作文350字06-24
学生简爱读后感范本精选06-01
- 1advantages and disadvantages of GM food
- 2!!The Advantages and Disadvantages of Fast Food
- 3what are the advantages and disadvantages of overseas study
- 4Unit 7 Be Careful!The Second Period教案
- 5Simple Text Processing
- 6Unit 1 Text
- 7Text A 课文翻译
- 8text4Chi
- 9Listening Strategies for IELTS text
- 10A summary of the text sandwich generation
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- advantages
- careful
- seeding
- means
- full
- text
- 白头翁和黄连不同配比下有效成分溶出量和量效关系的研究
- 宝宝脾气犟怎么办呢
- 防灾科技学院毛概题库(历年真题)3
- 黑莓手机 键盘快捷键以及部分软件快捷键的使用
- 高瓦斯矿井特大火区治理的新技术
- 中央空调风道清洗
- 商品流通学 第九章
- 我的教育梦演讲稿
- 发展与教育心理学
- 2010年秋季西安电子科技大学机电工程学院博士入学情况
- 我国煤炭企业社会责任问题探讨
- 第4节 连续型随机变量及其概率密度(续2)
- 【精品】《 ang eng ing ong》 人教部编版一年级上册语文教案
- 2015年健康咨询第3期:中医常见病预防保健知识(结核病)
- 专升本大学语文练习题集锦附答案
- 基于MATLAB的红外图像增强技术研究与应用
- 2012--2016年的中国经济走向
- 财务会计第十章流动负债作业题答案及解析
- 2021年职称英语等级考试试卷综合类B级试题
- 2016年执业药师继续教育 抗菌药物临床应用指导原则(2015版)答案