Storage device performance prediction with CART models

更新时间:2023-07-19 03:45:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

StorageDevicePerformancePredictionwithCART

Models

MengzhiWang,KinmanAu,AnastassiaAilamaki,

AnthonyBrockwell,ChristosFaloutsos,andGregoryR.Ganger

CMU-PDL-04-103

March2004

ParallelDataLaboratoryCarnegieMellonUniversityPittsburgh,PA15213-3890

Abstract

Storagedeviceperformancepredictionisakeyelementofself-managedstoragesystemsandapplicationplanningtasks,suchasdataassignment.Thisworkexplorestheapplicationofamachinelearningtool,CARTmodels,tostoragedevicemodeling.Ourapproachpredictsadevice’sperformanceasafunctionofinputworkloads,requiringnoknowledgeofthedeviceinternals.WeproposetwousesofCARTmodels:onethatpredictsper-requestresponsetimes(andthenderivesaggregatevalues)andonethatpredictsaggregatevaluesdirectlyfromworkloadcharacteristics.Afterbeingtrainedonourexperimentalplatforms,bothprovideaccurateblack-boxmodelsacrossarangeoftesttracesfromrealenvironments.Experimentsshowthatthesemodelspredicttheaverageand90thpercentileresponsetimewithanrelativeerroraslowas16%,whenthetrainingworkloadsaresimilartothetestingworkloads,andinterpolatewellacrossdifferentworkloads.

Acknowledgements:WethankthemembersandcompaniesofthePDLConsortium(includingEMC,Hewlett-Packard,Hitachi,HitachiGlobalStorageTechnologies,IBM,Intel,LSILogic,Microsoft,NetworkAppliance,Oracle,Panasas,Seagate,Sun,andVeritas)fortheirinterest,insights,feedback,andsupport.WethankIBMforpartlyfundingthisworkthroughaCASstudentfellowshipandafacultypartnershipaward.ThisworkisfundedinpartbyNSFgrantsCCR-0205544,IIS-0133686,BES-0329549,IIS-0083148,IIS-0113089,IIS-0209107,andIIS-0205224.WewouldalsoliketothankEnoThereska,MikeMesnier,andJohnStrunkfortheirparticipationanddiscussionintheearlystageofthisproject.

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

Keywords:Performanceprediction,storagedevicemodeling

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

1Introduction

Thecostsandcomplexityofsystemadministrationinstoragesystems[17,35,11]anddatabasesystems[12,1,15,21]makeautomationofadministrationtasksacriticalresearchchallenge.Oneimportantaspectofadministeringself-managedstoragesystems,particularlylargestorageinfrastructures,isdecidingwhichdatasetstostoreonwhichdevices.To ndanoptimalornearoptimalsolutionrequirestheabilitytopredicthowwelleachdevicewillserveeachworkload,sothatloadscanbebalancedandparticularlygoodmatchescanbeexploited.

Researchershavelongutilizedperformancemodelsforsuchpredictiontocomparealternativestoragedevicedesigns.Givensuf cienteffortandexpertise,accuratesimulations(e.g.,[5,28])oranalyticmodels(e.g.,[22,30,31])canbegeneratedtoexploredesignquestionsforaparticulardevice.Unfortunately,inpractice,suchtimeandexpertiseisnotavailablefordeployedinfrastructures,whichareoftencomprisedofnumerousanddistinctdevicetypes,andtheiradministratorshaveneitherthetimenortheexpertiseneededtocon guredevicemodels.

Thispaperattacksthisobstaclebyprovidingablack-boxmodelgenerationalgorithm.By“blackbox,”wemeanthatthemodel(andmodelgenerationsystem)hasnoinformationabouttheinternalcomponentsoralgorithmsofthestoragedevice.Givenaccesstoadeviceforsome“trainingperiod,”themodelgen-erationsystemlearnsadevice’sbehaviorasafunctionofinputworkloads.Theresultingdevicemodelapproximatesthisfunctionusingexistingmachinelearningtools.OurapproachemploystheClassi cationAndRegressionTrees(CART)toolbecauseofitsef ciencyandaccuracy.CARTmodels,inanutshell,approximatefunctionsonamulti-dimensionalCartesianspaceusingpiece-wiseconstantfunctions.

Suchlearning-basedblackboxmodelingisdif cultfortworeasons.First,allthemachinelearningtoolswehaveexaminedusevectorsofscalarsasinput.Existingworkloadcharacterizationmodels,however,pressingthesedistributionsintoasetofscalarsisnotstraightforward.Second,thequalityofthegeneratedmodelsdependshighlyonthequalityofthetrainingworkloads.Thetrainingworkloadsshouldbediverseenoughtoprovidehighcoverageoftheinputspace.

Thisworkdevelopstwowaysofencodingworkloadsasvectors:avectorperrequestoravectorperworkload.Thetwoencodingschemesleadtotwotypesofdevicemodels,operatingattheper-requestandper-workloadgranularities,respectively.Therequest-leveldevicemodelspredicteachrequest’sresponsetimebasedonitsper-requestvector,or“requestdescription.”Theworkload-leveldevicemodels,ontheotherhand,predictaggregateperformancedirectlyfromper-workloadvectors,or“workloaddescriptions.”Ourexperimentsonavarietyofrealworldworkloadshaveshownthatthesedescriptionsarereasonablygoodatcapturingworkloadperformancefrombothsingledisksanddiskarrays.ThetwoCART-basedmodelshaveamedianrelativeerrorof17%and38%,respectively,foraverageresponsetimeprediction,and18%and43%respectivelyforthe90thpercentile,whenthetrainingandtestingtracescomefromthesameworkload.TheCART-basedmodelsalsointerpolatewellacrossworkloads.

Theremainderofthispaperisorganizedasfollows.Section2discussespreviousworkintheareaofstoragedevicemodelingandworkloadcharacterization.Section3describesCARTanditsproperties.Section4describestwoCART-baseddevicemodels.Section5evaluatesthemodelsusingseveralreal-worldworkloadtraces.Section6concludesthepaper.

2RelatedWork

Performancemodelinghasalongandsuccessfulhistory.Almostalways,however,thoroughknowledgeofthesystembeingmodeledisassumed.Disksimulators,suchasPantheon[33]andDiskSim[5],usesoftwaretosimulatestoragedevicebehaviorandproduceaccurateper-requestresponsetimes.Developingsuchsim-ulatorsischallenging,especiallywhendiskparametersarenotpubliclyavailable.Predictingperformance

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

usingsimulatorsisalsoresourceintensive.Analyticalmodels[7,22,24,30,31]aremorecomputationallyef cientbecausethesemodelsdescribedevicebehaviorwithasetofformulae.Findingtheformulasetrequiresdeepunderstandingoftheinteractionbetweenstoragedevicesandworkloads.Inaddition,bothdisksimulatorsandanalyticalmodelsaretightlycoupledwiththemodeleddevice.Therefore,newdevicetechnologiesmayinvalidateexistingmodelsandrequireanewroundofmodelbuilding.

OurapproachusesCART,whichtreatsstoragedevicesasblackboxes.Asaresult,themodelconstruc-tionalgorithmisfullyautomatedandshouldbegeneralenoughtohandleanytypeofstoragedevice.Thedegenerateformsof“black-boxmodels”areperformancespeci cations,suchasthemaximumthroughputofthedevices,publishedbydevicemanufacturers.Theactualperformance,however,willbenowherenearthesenumbersundersomeworkloads.Anderson’s“table-based”approach[3]includesworkloadcharacter-isticsinthemodelinput.Thetable-basedmodelsrememberdevicebehaviorforawiderangeofworkloadanddevicepairsandinterploatesamongtablesentriesinpredicting.Anderson’smodelsareusedinanautomatedstorageprovisiontool,Ergastulum[2],whichformulatesautomaticstorageinfrastructureprovi-sioningasanoptimizationproblemandusesdevicemodelstoguidethesearchalgorithminlocatingthesolution.Ourapproachimprovesonthetable-basedmodelsbyemployingmachinelearningtoolstocapturedevicebehavior.Becauseofthegoodscalabilityofthetoolstohighdimensionaldatasets,weareabletousemoresophisticatedworkloadcharacteristicsasthemodelinput.Asaresult,themodelsaremoreef cientinbothcomputationandstorage.

Workloadcharacterizationisanimportantpartofdevicemodelingbecauseitprovidesasuitablerep-resentationofworkloads.Despiteabundantpublishedworkinmodelingwebtraf c[23,25,8],I/Otraf cmodelingreceiveslessattention.Directapplicationofwebtraf canalysismethodstoI/worktraf chasacategoricaladdressspace,andthereisnonotionofsequentialscans.Incontrast,theperformancevariabilitycanbeseveralordersofmagnitudebetweenrandomandsequentialaccessesforI/Oworkloads.Ganger[10]pointedoutthecomplexityofI/Oworkloads,andeventhedetectionofsequentialscansisahardproblem[19].Gomezetal.[14]identi edself-similarityinI/Otraf candadoptedstructuralmodelstogenerateI/Oworkloads.Kurmasetal.[20]employedaniterativeapproachtodetectimportantworkloadcharacteristics.Rome[34]providedageneralframeworkofworkloadspeci cations.Alltheapproaches,inonewayoranother,useempiricaldistribu-tionsderivedfromgivenworkloadsastheparametervalues.Ourpreviouswork[32]takesadvantageoftheself-similarityofI/Oworkloadsandproposesatool,the“entropyplot,”tocharacterizethespatio-temporalcharacteristicsofI/Oworkloadswiththreescalars.SinceourCART-basedmodelsrequireworkloadstobepresentedintheformofvectorsofscalars,theentropyplotisanattractivechoice.

3Background:CARTModels

ThissectiongivesabriefintroductionoftheCARTmodelsandjusti esourchoiceofthetool.AdetaileddiscussionofCARTisavailablein[4].

3.1CARTModels

CARTmodelingisamachinelearningtoolthatcanapproximaterealfunctionsinmulti-dimensionalCarte-fXε,sianspace.(Itcanalsobethoughtofasatypeofnon-linearregression.)GivenafunctionY

dwhereX ,Y ,andεiszero-meannoise,aCARTmodelapproximatesYusingapiece-wiseconstant f X.WerefertothecomponentsofXasfeaturesinthefollowingtext.Theterm,ε,capturesfunction,Y

theintrinsicrandomnessofthedataandthevariabilitycontributedbytheunobservablevariables.Thevari-anceofthenoisecouldbedependentonX.Forexample,thevarianceofresponsetimeoftendependsonthearrivalrate.

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

x<5.94851

x<3.23184

x<7.7123 100

x<9.0008x<1.92033

x<5.0035x<7.05473

80 60 40 20 0

y

f(x) = x * x

CART

x<1.69098

x<3.60137

30.83 48.68 56.33 72.06 16.64 26.85x<4.4543x<0.889

88.01 2.57 21.67 -1.92 6.05 8.94-20

0 2 4

x

6 8 10

(a)Fittedtree

(b)Datapointsandregressionline

Figure1:CARTmodelforasimpleone-dimensionaldataset.Thedatasetcontains100datapointsgen-x2ε,whereεfollowsaGuassiandistributionwithmean0andstandarddeviationeratedusingfx

10.

Thepiece-wiseconstantfunctionf Xcanbevisualizedasabinarytree.Figure1(a)showsaCARTmodelconstructedonthesampleone-dimensionaldatasetin(b).Thesampledatasetisgeneratedusing

yi

x2i

εi

i

12

100

wherexiisuniformlydistributedwithin(0,10),andεifollowsaGuassiandistributionofN010.The

leafnodescorrespondtodisjointhyper-rectanglesinthefeaturevectorspace.Thehyper-rectanglesaredegeneratedintointervalsforone-dimensionaldatasets.Eachleafisassociatedwithavalue,f X,whichisthepredictionforallXswithinthecorrespondinghyper-rectangle.Theinternalnodescontainsplitpoints,andapathfromtheroottoaleafde nesthehyper-rectangleoftheleafnode.Thetree,therefore,representsapiece-wiseconstantfunctiononthefeaturevectorspace.Figure1(b)showstheregressionlineofthesampleCARTmodel.

3.2CARTModelProperties

CARTmodelsarecomputationallyef cientinbothconstructionandprediction.Theconstructionalgorithmstartswithatreewithasinglerootnodecorrespondingtotheentireinputvectorspaceandgrowsthetreebygreedilyselectingthesplitpointthatyieldsthemaximumreductioninmeansquarederror.AmoredetaileddiscussionofthesplitpointselectionispresentedinAppendixA.Eachpredictioninvolvesatreetraversaland,therefore,isfast.

CARToffersgoodinterpretabilityandallowsustoevaluatetheimportanceofvariousworkloadchar-acteristicsinpredictingworkloadperformance.ACARTmodelisabinarytree,makingiteasytoplotonpaperasinFigure1(a).Moreimportantly,onecanevaluateafeature’simportancebyitscontributioninerrorreduction.Intuitively,amoreimportantfeatureshouldcontributemoretotheerrorreduction;thus,leavingitoutofthefeaturevectorwouldsigni cantlyraisethepredictionerror.InaCARTmodel,weusethesumoftheerrorreductionrelatedtoalltheappearancesofafeatureasitsimportance.

3.3ComparisonWithOtherRegressionTools

OtherregressiontoolscanachievethesamefunctionalityasCART.WechoosetouseCARTbecauseofitsaccuracy,ef ciency,robustness,andeaseofuse.Table1comparesCARTwithfourotherpopulartoolsto

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

FeatureCART

high(505%)

Neuralnetworks

fair(66%)

Poor

Fair

Poor

Poor

fast(seconds)

slow(hours)

fast

(milliseconds)

low(60B)

low(2MB)

Fair

k-nearestneighbors

InterpretabilityAbilitytohandleirrelevantinput

GoodGood

PoorPoor

PredictiontimeEaseofuse

fast

(milliseconds)

Good

slow(minutes)Fair

Table1:Comparisonofregressiontoolsinpredictingper-requestresponsetime.(ThesamedatasetisusedinFigure5.)Thecomparisononrow2,3,4andthelastoneistakenfrom[16].Werankthefeaturesintheorderoftheirimportance.Interpretabilityisthemodel’sabilitytoinfertheimportanceofinputvariables.Robustnessistheabilitytofunctionwellundernoisydataset.Irrelevantinputreferstofeaturesthathavelittlepredictivepowers.

buildtherequest-leveldevicemodelasdescribedinSection4.2.Themodelswereconstructedonthe rstdayofcello99aandtestsrunonthesecondofthesametrace.TheinformaiononthetracesweusedmaybefoundinSection5.

Themodel[29]usesalinearfunctionofXtoapproximatefX.Duetonon-linearstoragedevicebehavior,linearmodelshavepooraccuracy.

Themodel[26]consistsofasetofhighlyinterconnectedprocessingelementsworkinginunisontoapproximatethetargetfunction.Weuseasinglehiddenlayerof20nodes(bestamong20and40)andalearningrateof0.05.Halfofthetrainingsetisusedinbuildingthemodelandtheotherhalfforvalidation.Suchamodeltakesalongtimetoconverge.

The[6]mapstheinputdataintoahighdimensionalspaceandperformsalinearregressionthere.Ourmodelusestheradialbasisfunction

Kxix

expγx

xi

2

asthekernelfunction,andγissettobe2(bestamong1,3,4,6).Weuseanef cientimplementation,SVMlight[18],inourexperiment.Selectingtheparametervaluesrequiresexpertiseandmultipleroundsoftrials.

Themodel[9]ismemory-basedbecausethemodelremembersallthetrain-ingdatapointsandpredictionisdonethroughaveragingtheoutputoftheknearestneighborsofthedatapointbeingpredicted.WeusetheEuclideandistancefunctionandakvalueof5(bestamong5,10,15,and20).Themodelisaccurate,butisinef cientinstorageandcomputation.

Thelastthreetoolsrequirethatallthefeaturesandoutputbenormalizedtotheunitlength.Forfeaturesoflargevaluerange,wetakelogarithmsbeforenormalization.Overall,CARTisthebestatpredictingper-requestresponsetimes,withtheonlydownsidebeingslightlyloweraccuracycomparedtothemuchmorespace-andtime-consumingapproach.

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

Figure2:Modelconstructionthroughtraining.RTiistheresponsetimeofrequestri.

4PredictingPerformancewithCART

ThissectionpresentstwowaysofconstructingdevicemodelsbasedonCARTmodels.

4.1Overview

OurgoalistobuildamodelforagivenstoragedevicewhichpredictsdeviceperformanceasafunctionofI/Oworkload.Thedevicemodelreceivesaworkloadasinputandpredictsitsaggregateperformance.Wede neaworkloadasasequenceofdiskrequests,witheachrequest,ri,uniquelydescribedbyfourattributes:arrivaltime(ArrivalTimei),logicalblocknumber(LBNi),requestsizeinnumberofdiskblocks(Sizei),andread/writetype(RWi).Thestoragedevicecouldbeasingledisk,adiskarray,orsomeotherlike-interfacedcomponent.Theaggregateperformancecanbeeithertheaverageorthe90-thpercentileresponsetime.

OurapproachusesCARTtoapproximatethefunction.Weassumethatthemodelconstructionalgorithmcanfeedanyworkloadintothedevicetoobserveitsbehaviorforacertainperiodoftime,alsoknownas“training.”Thealgorithmthenbuildsthedevicemodelbasedontheobservedresponsetimes,asillustratedinFigure2.Modelconstructiondoesnotrequireanyinformationabouttheinternalsofthemodeleddevice.Therefore,itisgeneralenoughtomodelanydevice.

Regressiontoolsareanaturalchoicetomodeldevicebehavior.Suchtoolsaredesignedtomodelfunc-tionsonmulti-dimensionalspacegivenasetofsampleswithknownoutput.Thedif cultyistotransformworkloadsintodatapointsinamulti-dimensionalfeaturespace.Weexploretwowaystoachievethetrans-formationasillustratedinFigure3.Arequest-levelmodelrepresentsarequestriasavectorRi,alsoknownasthe“requestdescription,”andusesCARTmodelstopredictper-requestresponsetimes.Theaggregateperformanceisthencalculatedbyaggregatingtheresponsetimes.Aworkload-levelmodel,ontheotherhand,representstheentireworkloadasasinglevectorW,orthe“workloaddescription,”andpredictstheaggregateperformancedirectlyfromW.Inbothapproaches,thequalityoftheinputvectorsiscriticaltothemodelaccuracy.Thenexttwosectionspresenttherequestandworkloaddescriptionsindetail.

4.2Request-LevelDeviceModels

ThissectiondescribestheCART-basedrequest-leveldevicemodel.ThismodelusesaCARTmodeltopredicttheresponsetimesofindividualrequestsbasedonrequestdescriptions.Themodel,therefore,isabletogeneratetheentireresponsetimedistributionandoutputanyaggregateperformancemeasures.

Weadoptthefollowingtwoconstraintsindesigningtherequestdescription.1.Ridoesnotincludeanyactualresponsetimes.Onecouldrelaxthisconstraintbyallowingthein-clusionoftheresponsetimeinformationforalltherequeststhathavealreadybeenservedwhenthecurrentrequestarrives.Thisrelaxation,however,isfeasibleonlyforonlineresponsetimepredictions;itwouldnotbeappropriateforapplicationplanningtasksbecausetheplannerdoesnotrunworkloadsondevices.

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

Figure3:TwotypesofCART-baseddevicemodels.

2.Ricanbecalculatedfromrj,ji.Thisconstraintsimpli estherequestdescription.Inmostcases,theresponsetimeofacurrentrequestdependsonlyonpreviousrequestsandtherequestitself.

OurrequestdescriptionRiforrequestricontainsthefollowingvariables:

Ri

TimeDiffi1

TimeDiffik

LBNi

LBNDiffi1

LBNDiffil

Sizei

RWi

Seqi

whereTimeDiffikArrivalTimeiArrivalTimei2k1andLBNDiffilLBNiLBNil.The rstthreegroupsoffeaturescapturethreecomponentsoftheresponsetime,andSeqiindicateswhethertherequestisasequentialaccess.The rstk1featuresmeasurethetemporalburstinessoftheworkloadwhenriarrives,andsupportpredictionofthequeuingtime.WeallowtheTimeDifffeaturestoexponentiallygrowthedistancefromthecurrentrequesttohistoryrequesttoaccommodatelargebursts.Thenextl1featuresmeasurethespatiallocality,supportingpredictionoftheseektimeoftherequest.SizeiandRWisupportpredictionofthedatatransfertime.

Thetwoparameters,kandl,determinehowfarwelookbackforrequestburstsandlocality.Smallvaluesdonotadequatelycapturethesecharacteristics,rgevalues,ontheotherhand,leadstoahigherdimensionality,meaningtheneedforalargertrainingsetandalongertrainingtime.Theoptimalvaluesfortheseparametersarehighlydevicespeci c,andSection5.1showshowweselecttheparametervaluesinourexperiments.

4.3Workload-LevelDeviceModels

Theworkload-levelmodelrepresentstheentireworkloadasasingleworkloaddescriptionandpredictsaggregatedeviceperformancedirectly.TheworkloaddescriptionWcontainsthefollowingfeatures.

W

Averagearrivalrate

Readratio

Averagerequestsize

Percentageofsequentialrequests

Temporalandspatialburstiness

Correlationsbetweenpairsofattributes

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

Theworkloaddescriptionusestheentropyplot[32]toquantifytemporalandspatialburstinessandcorrela-tionsbetweenattributes.Entropyvalueareplottedononeortwoattributesagainsttheentropycalculationgranularity.Theincrementoftheentropyvaluescharacterizeshowtheburstinessandcorrelationschangefromonegranularitytothenext.Becauseoftheself-similarityofI/Oworkloads[13],theincrementisusuallyconstant,allowingustousetheentropyplotslopetocharacterizetheburstinessandcorrelations.AppendixBdescribestheentropyplotindetail.

Theworkload-leveldevicemodeloffersfastpredictions.ThemodelcompressesaworkloadintoaworkloaddescriptionandfeedsthedescriptionintoaCARTmodeltoproducethedesiredperformancemeasure.Featureextractionisalsofast.Topredictboththeaverageand90thpercentileresponsetime,themodelmusthavetwoseparatetrees,oneforeachperformancemetric.

Workloadmodelingintroducesaparametercalled“windowsize.”Thewindowsizeistheunitofper-formancepredictionand,thus,theworkloadlengthforworkloaddescriptiongeneration.Forexample,wecandividealongtraceintoone-minutefragmentsandusetheworkload-levelmodeltopredicttheaverageresponsetimeoverone-minuteintervals.Fragmentingworkloadshasseveraladvantages.First,performanceproblemsareusuallytransient.A“problem”ingtheworkloadinitsentirety,ontheotherhand,failstoindentifysuchtransientproblems.Second,fragmentingthetrainingtraceproducesmoresamplesfortrainingandreducestherequiredtrainingtime.Windowsthataretoosmall,however,containtoofewrequestsfortheentropyplottobeeffective.Weuseone-minutewindowsinallofourexperiments.

4.4ComparisonofTwoTypesofModels

Thereisacleartradeoffbetweentherequest-levelandworkload-leveldevicemodels.Theformerisfastintrainingandslowinprediction,andthelatteristheopposite.

Themodeltrainingtimeisdominatedbytracereplay,which,whentakingplaceonactualdevices,requiresexactlythesameamountoftimeasthetracelength.BuildingaCARTmodelneedsonlysecondsofcomputation,buttracereplaycanrequirehundredsofhourstoacquireenoughdatapointsformodelconstruction.Whenoperatingattherequestlevel,thedevicemodelgetsonedatapointperrequestasopposedtoonedatapointperone-minuteworkloadfragmentasintheworkload-leveldevicemodel.Inordertogetthesamenumberofdatapoints,theworkload-leveldevicemodelneedsatrainingtime100timeslongerthantherequest-levelmodelwhenthearrivalrateis100requestsperminute.

Thenumberoftreetraversalsdeterminesthepredictiontime,sinceeachpredictedvaluerequiresatreetraversal.Therefore,thetotalnumberoftreetraversalsisthenumberofrequestsintheworkloadfortherequest-leveldevicemodelandthenumberofworkloadfragmentsfortheworkload-levelmodel.Withanaveragearrivalrateof100requestsperminute,therequest-levelmodelis100timesslowerinprediction.

Anitemforfutureresearchistheexplorationofthepossibilityofcombiningthetwomodelstodeliveronesthatareef cientinbothtrainingandprediction.

5ExperimentalResults

ThissectionevaluatestheCART-baseddevicemodelspresentedintheprevioussectionusingarangeofworkloadtraces.

Devices.Wemodeltwodevices:asinglediskandadiskarray.Thesinglediskisa9GBAtlas10Kdiskwithanaveragerotationallatencyof3milliseconds.ThediskarrayisaRAID5diskarrayconsistingof8Atlas10Kdiskswitha32KBstripesize.WereplayallthetracesonthetwodevicesexcepttheSAPtrace,whichisbeyondthecapacityoftheAtlas10Kdisk.

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

Tracenamecello99acello99bcello99c

Length4weeks4weeks4weeks

Tracedescription

AverageSize

7.8Million

7.1KB118.0KB8.5KB

1.1Million

singledisk

35.4%

115.71ms113.61ms5.04ms

99.9%

7.40ms59.28ms

Table2:Tracesummary.WemodelanAtlas10K9GBandaRAID5diskarrayconsistingof8Atlas10Kdisks.TheresponsetimeiscollectedbyreplayingthetracesonDiskSim3.0[5].

Traces.Weusethreesetsofreal-worldtracesinthisstudy.Table2liststhesummarystatisticsoftheeditedtraces.The rsttwo,cello92andcello99capturetypicalcomputersystemresearchI/Oworkloads,collectedatHPLabsin1992and1999respectively[27,14].Wepreprocesscello92toconcatenatetheLBNsofthethreemostactivedevicesfromthetraceto llthemodeleddevice.Forcello99,wepickthethreemostactivedevices,amongthe23devices,andlabelthemcello99a,cello99b,andcello99c.Thecello99traces tina9GBdiskperfectly,sonotraceeditingisnecessary.Asthesetracesarelong(twomonthsforcello92andoneyearforcello99),wereportdataforafour-weeksnapshot(5/1/92to5/28/92and2/1/99to2/28/99).

TheSAPtracewascollectedfromanOracledatabaseserverrunningSAPISUCCS2.5Binapowerutilitycompany.Theserverhasmorethan3,000usersanddiskaccessesre ecttheretrievalofcustomerinvoicesforupdatingandreviewing.SequentialreadsdominatetheSAPtrace.

Evaluationmethodology.Theevaluationusesthedevicemodelstopredicttheaverageand90thper-centileresponsetimeforone-minuteworkloadfragments.Wereportthepredictionerrorsusingtwometrics:

Y,andrelativeerrorabsoluteerrorde nedasthedifferencebetweenthepredictedandtheactualvalue,Y

de nedasYY

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

Relative importanc

e

60%40%20%0%

TimeDiffTimeDiffTimeDiffTimeDiffTimeDiffTimeDiffTimeDiffTimeDiffTimeDiffTimeDiff1LBNLBNDiffLBNDiffLBNDiffLBNDiffLBNDiffSize(a)relativeimportancemeasuredonthe9GBAtlas10KBdiskusingcello99aand

cello99c

SeR(b)relativeimportancemeasuredontheRAID5diskarrayusingcello99aandcello99c

Figure4:Relativeimportanceoftherequestdescriptionfeatures.

5.1CalibratingRequest-LevelModels

Thissectiondescribeshowweselectparametervaluesforkandlfortherequest-leveldevicemodels.

Figure4showstherelativeimportanceoftherequestdescriptionfeaturesindeterminingper-requestresponsetimebysettingkto10andlto5.Thefeature’srelativeimportanceismeasuredbyitscontributioninerrorreduction.Thegraphsshowtheimportanceofrequestdescriptionfeaturesmeasuredonbothdevices,trainedontwotraces(cello99aandcello99c).Weuseonlythe rstdayofthetracesandreducethedatasetsizeby90%withuniformsampling.

First,weobservethattherelativeimportanceisworkloaddependent.Asweexpected,forbusytraf- csuchasthatwhichoccurredinthecello99atrace,thequeuingtimedominatestheresponsetime,andthereby,theTimeDifffeaturesaremoreimportant.Ontheotherhand,cello99chassmallresponsetimes,andfeaturesthatcharacterizethedatatransfertime,suchasSizeandRW,havegoodpredictivepowerinmodelingthesingledisk.

Second,weobservethatthemostimporantfeatureshiftsfromTimeDiff8toTimeDiff7wherecom-paringthesingledisktothediskarrayforcello99abecausethequeuingtimebecomeslesssigni cantforthediskarray.Thedistinctionbetweenthetwotraces,however,persists.

Wesetkto10forTimeDiffandlto3forLBNDiffinthesubsequentexperimentssothatwecanmodeldevicebehaviorunderbothtypesofworkloads.

Weshowthemodelaccuracyinpredictingper-requestresponsetimesinFigure5.ThemodelisbuiltfortheAtlas10Kdisk.Thetrainingtraceisthe rstdayofcello99a,andthetestingtraceistheseconddayofthesametrace.Figure5(a)isascatterplot,showingthepredictedresponsetimesagainsttheactualonesforthe rst5,000requests.Mostofthepointsstayclosetothediagonalline,suggestingaccuratepredictionoftherequest-leveldevicemodel.Figure5(b)furthercomparestheresponsetimedistributions.Thelongtailofthedistributioniswellcapturedbytherequest-levelmodel,indicatingthattherequestdescriptionis

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

Predicted response time (ms)

1600 1200 800 400 0

100 %80 %% of requests

60 %40 %20 %0 %

ActualPredicted

1000 2000 3000 4000

Response time (ms)

400 800 1200Actual response time (ms)

1600

(a)Scatterplot(b)Responsetimedistribution

Figure5:Predictionaccuracyoftherequest-levelmodel.Theactualandpredictedaverageresponsetimesare137.96msand133.01msrespectively.Thecorrespondingdemerit,de nedin[28]astherootmeansquareofhorizontaldistancebetweentheactualandpredictedcurvesin(b),is46.06milliseconds(33.4%oftheactualaverageresponsetime).

effectiveincapturingrequest-levelcharacteristicsneededtopredictresponsetimes.

Insummary,therequestdescriptioneffectivelycapturesimportantper-requestcharacteristics,leadingtoaccuraterequest-leveldevicemodels.

5.2ModelingASingleDisk

Figure6comparestheaccuracyofallthepredictorsinmodelinganAtlas10K9GBdiskonreal-worldtraces.Asmentionedearlier,allthepredictorsaretrainedusingthe rsttwoweeksofcello99a.Overall,thetwoCART-baseddevicemodelsprovidegoodpredictionaccuracyinpredictingboththeaverageand90thpercentileresponsetimes,comparedtootherpredictors.Severalmoredetailedobservationscanbemade.

First,allofthemodelsperformthebestwhenthetrainingandtestingtracesarefromthesameworkload,e.g.cello99a,becausethemodelshaveseenhowthedevicebehavesundersuchworkloads.Thepredictoralsocutsthemedianpredictionerrorofthepredictorbymorethanahalfbecauseofthestrongperiodicityoftheworkload.andfurtherreducetheerrorto4.84milliseconds(19%)and14.83milliseconds(47%)respectivelyfortheaverageresponsetimeprediction,and20.46milliseconds(15%)and49.50milliseconds(45%)respectivelyforthe90thpercentile.Theperfor-androughlymeasuresthebene tofusinganon-linearmancedifferencebetween

model,suchasCART,becausebothacceptthesameinput.Weobserveasigni cantimprovementfromtheformertothelatter,suggestingnon-lineardevicebehavior.

Second,bothCART-baseddevicemodelsinterpolatebetteracrossworkloadsthantheothermodels.

andrelyblindlyonsimilaritiesbetweenthetrainingandtestingworkloadstomake

goodpredictions.Consequently,itisnotsurprisingtoseehugepredictionerrorswhenthetrainingandtestingworkloadsdiffer.TheCART-basedpredictors,ontheotherhand,areabletodistinguishbetweenworkloadsofdifferentcharacteristicsandaremorerobusttothedifferencebetweenthetrainingandtestingworkloads.

Third,modelaccuracyishighlydependentonthetrainingworkloadqualityfortheCART-basedmodels.Thepredictionerrorincreasesforworkloadsotherthancello99a,becauseoftheaccesspatterndifferencesamongthesetraces.TheCART-basedmodelslearndevicebehaviorthroughtraining;therefore,theycannotpredictperformanceforworkloadsthathavetotallydifferentcharacteristicsfromthetrainingworkloads.

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

averageresponsetime

(b)Predictionerrorfor90thpercentileresponsetime

Figure6:Comparisonofpredictorsforasingle9GBAtlas10Kdisk.

Forexample,constantlyover-predictsforcello99cbecausethemodelwasnevertrainedwththesmallsequentialaccessesthatareparticulartocello99c.Section5.4givesaninformalerroranalysisandidenti esinadequatetrainingbeingthemostsigni canterrorsource.

Fourth,highquantileresponsetimesaremoredif culttopredict.Weobservelargerpredictionerrorsfromallthepredictorsfor90thpercentileresponsetimepredictionsthanforaverageresponsetimepredic-tions.TheaccuracyadvantageofthetwoCART-basedmodelsishigherfor90thpercentilepredictions.

Insummary,thetwoCART-basedmodelsgiveaccuratepredictionswhenthetrainingandtestingwork-loadssharethesamecharacteristicsandinterpolatewellotherwise.Thegoodaccuracysuggeststheeffec-tivenessoftherequestandworkloaddescriptionsincapturingimportantworkloadcharacteristics.

5.3ModelingADiskArray

Figure7comparestheaccuracyofthefourpredictorsinmodelingthediskarray.ThepredictorisnotpresentedbecausetheSAPtracedoesnotprovideenoughinformationonarrivaltimeforustoknowtheoffsetwithinaweek.Theoverallresultsaresimilartothoseforthesingledisk.ThetwoCART-basedmodelsarethemostaccuratepredictors.Theabsoluteerrorsbecomesmallerduetothedecreasedresponsetimefromthesingledisktothediskarray.Therelativeaccuracyamongthepredictors,however,staysthesame.Overall,theCART-baseddevicemodelingapproachworkswellforthediskarray.

5.4ErrorAnalysis

Thissectionpresentsaninformalerroranalysistoidentifythemostsigni canterrorsourcefortheCART-baseddevicemodels.

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

averageresponsetime

(b)predictionerrorfor90thpercentileresponsetime

Figure7:ComparisonofpredictorsforaRAID5diskarrayof8Atlas10Kdisks.

Amodel’serrorconsistsoftwoparts.The rstpartcomesfromintrinsicrandomnessoftheinputdata,suchasmeasurementerror,andthiserrorcannotbecapturedbyanymodel.Therestoftheerrorcomesfromthemodelingapproachitself.TheCART-basedmodelsincurerroratthreeplaces.First,thetransformationfromworkloadstovectorsintroducesinformationloss.Second,theCART-basedmodelsusepiece-wiseconstantfunctions,whichcouldbedifferentfromthetruefunctions.Third,alow-qualitytrainingtraceyieldsinaccuratemodelsbecauseCARTreliesontheinformationfromthetrainingdatatomakepredictions.Aninadequatetrainingsethasonlyalimitedrangeofworkloadsandleadstolargepredictionerrorsforworkloadsoutsideofthisrange.We ndthatthelasterrorsource,inadequatetrainingdata,causesthemosttroubleinourexperiments.

Weconductasmallexperimenttoverifyourhypothesis.Figure8(a)comparesthedifferenceinse-quentialitybetweencello99aandcello99c.Thespectrumofsequentiality(from0%to100%ofrequestsintheworkloadbeingsequential)isdividedinto20buckets,andthegraphsshowsthenumberofone-minuteworkloadfragmentsineachbucketforbothtraces.Weobserveasigni cantnumberofhighsequentialityfragmentsincello99b,butnofragmentgoesbeyond50%sequentialityincello99a.Thisdifferenceleadstolargepredictionerrorsforhighsequentialityfragmentswhenwebuildtheworkload-levelmodeloncello99aanduseittopredicttheperformanceofcello99b,asshownin(b).Theerrorsarereducedsigni cantlywhenweincludethe rsthalfofcello99bintraining.Thedramaticerrorreductionsuggeststhatpredictioner-rorsfromtheothersourcesarenegligiblewhencomparedwiththeonesintroducedbyinadequatetraining.Figure8(c)furthershowstheabsoluteerrorhistogramwith1millisecondbuckets.Thespikeshiftto0millisecondswhenwetrainthemodelonthecombinedtrainingtrace,indicatingthatitisreasonabletoas-sumeazero-meannoiseterm.Weconcludefromthisevidencethatcontributingeffortsinblack-boxdevicemodelingshouldbedirectedtowardgeneratingagoodtrainingsetthatcoversabroadrangeofworkloadtypes.

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

200

# of intervals

1501005000%

25%50%75%100%% of sequential requests

(a)Histogramsonsequentiality(b)Averageabsoluteerror(c)Absoluteerrordistribution

Figure8:Effectsofdifferenttrainingworkloads.

6Conclusions

Storagedeviceperformancemodelingisanimportantelementinself-managedstoragesystemsandotherapplicationplanningtasks.Ourtargetmodeltakesaworkloadasinputandpredictsitsaggregateperfor-manceonthemodeleddeviceef cientlyandaccurately.Thispaperpresentsourinitialresultsinexploringmachinelearningtoolstobuilddevicemodels.Ablackboxpredictivetool,CART,makesdevicemodelsindependentofthestoragedevicesbeingmodeled,andthus,generalenoughtohandleanytypeofdevices.Themodelconstruction,alsoknownastraining,consistsoftwophases:replayingtracesonthedevicesandbuildingaCARTmodelbasedontheobservedresponsetimes.Modelinganewdeviceinvolvesonlytrainingonthetargetdevice.

CART-basedmodelstakeinputintheformofvectors,soworkloadsmustbetransformedintovectorsinordertouseCARTasthebasisfordevicemodels.Thispaperpresentstwowaystoaccomplishsuchatransformation,yieldingtwotypesofdevicemodels.Therequest-leveldevicemodelsrepresenteachrequestasavectorandpredictitsresponsetime.Asaresult,themodelsareabletopredicttheentireresponsetimedistribution.Theexperimentsshowthatthepredictedresponsetimehasademerit gureof33%foramodernUNIX leservertrace,leadingtoamedianrelativeerroraslowas16%foraggregateperformancepredictions.Theworkload-leveldevicemodels,ontheotherhand,transformaworkloadfragmentintoavectorandpredictitsaggregateperformancedirectly.Thevectortakesadvantageoftheef ciententropyplotmetrictocapturethetemporalandspatialburstinessaswellasthecorrelationswithinI/Oworkloads.Themedianrelativeerrorcanbeaslowas29%fortheworkload-leveldevicemodels.

Theerroranalysissuggeststhatthequalityofthetrainingworkloadsplaysacriticalroleinthemodelaccuracy.Themodelsareunabletopredictworkloadsthataredifferentfromthetrainingworkloads.Toaccuratelypredictarbitraryworkloads,itisimportantforthetrainingworkloadstobeasdiverseaspossibletocoverawiderangeofworkloads.Ourfutureworkwillexploretheeffectivenessofexistingsyntheticworkloadgeneratorsinproducinghigh-qualitytrainingworkloads.

Continuingresearchcanimprovethemodelpredictionaccuracy.First,ourexperimentsshowtherele-vanceoftrainingtraces.Generatingrulestoassistintrainingsuchmodelsbroadlyenoughwillbeimportant.Second,theworkloadcharacterizationproblempersists,affectingtheworkload-levelmodels.Webelieve,however,thatthecontextofferedbythemodelscanhelpproduceinsightintothislong-standingproblem.Third,thetwotypesofdevicemodelsshowdesirablepropertiesintrainingandpredicting,respectively.Itshouldbevaluabletohaveamodelthatcombinesthebestofbothapproaches.

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

References

[1]N.Allen.Don’twasteyourstoragedollars:Whatyouneedtoknow.Researchnote,GartnerGroup,

2001.[2]E.Anderson,M.Kallahalla,S.Spence,R.Swaminathan,andQ.Wang.Ergastulum:Quickly nding

near-optimalstoragesystemdesigns.TechnicalReportHPL-SSP-2001-05,HPLabs,2001.[3]EricAnderson.Simpletable-basedmodelingofstoragedevices.TechnicalReportHPL-SSP-2001-4,

HPLabs,2001.[4]L.Brieman,J.H.Friedman,R.A.Olshen,andC.J.Stone.Classi cationandRegressionTrees.

Wadsworth,1984.[5]JohnBucy,GregGanger,andcontributors.TheDiskSimsimulationenvironmentversion3.0reference

manual.TechnicalReportCMU-CS-03-102,CarnegieMellonUniversity,2003.[6]C.J.C.Burges.AtutorialonSupportVectorMachinesforpatternrecognition.KnowledgeDiscovery

andDataMining,2(2):121–167.[7]ShenzeChenandDonTowsley.AperformanceevaluationofRAIDarchitectures.IEEETransactions

onComputers,45(10):1116–1130,1996.[8]MarkE.CrovellaandAzerBestavros.Self-similarityinworldwidewebtraf cevidenceandpos-siblecauses.InProc.ofthe1996ACMSIGMETRICSIntl.Conf.onMeasurementandModelingofComputerSystems,May1996.[9]B.Dasarathy.Neareastneighborpatternclassi cationtechniques.IEEEComputerSocietyPress,

1991.[10]GregoryR.Ganger.Generatingrepresentativesyntheticworkloads:Anunsolvedproblem.InPro-ceedingsoftheComputerMeasurementGroup(CMG)Conference,pages1263–1269,1995.[11]GregoryR.Ganger,JohnD.Strunk,andAndrewJ.Klosterman.Self-*storage:Brick-basedstorage

withautomatedadministration.TechnicalReportCMU-CS-03-178,CarnegieMellonUniversity,2003.[12]GartnerGroup.Totalcostofstorageownership—Auser-orientedapproach.Researchnote,Gartner

Group,2000.[13]Mar´iaE.G´omezandVicenteSantonja.Analysisofself-similarityinI/Oworkloadusingstructural

modeling.In7thInternationalSymposiumonModeling,AnalysisandSimulationofComputerandTelecommunicationSystems,pages234–243,1999.[14]Mar´iaE.G´omezandVicenteSantonja.Anewapproachinthemodelingandgenerationofsynthetic

diskworkload.InProceedingsof9thInternationalSymposiumonModeling,AnalysisandSimulationofComputerandTelecommunicationSystems,pages199–206,2000.[15]J.Gray.AconversationwithJimGray.ACMQueue,1(4),2003.

[16]TrevorHastie,RobertTibshirani,andJeromeFriedman.TheElementsofStatisticalLearning:Data

Mining,Inference,andPrediction.Springer,2001.[17]InternationalBusinessMachinesCorp.Autonomiccomputing:IBM’sperspectiveonthestateof

informationtechnology./autonomic/manifesto/.

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

[18]T.Joachims.Makinglarge-scaleSVMlearningpractical.InB.Sch¨olkopf,C.Burges,andSmola,

editors,AdvancesinKernelMethods–SupportVectorLearning.MIT-Press,1999.[19]K.Keeton,G.Alvarez,E.Riedel,andM.Uysal.CharacterizingI/O-intensiveworkloadsequentiality

onmoderndiskarrays.InTheFourthWorkshopOnComputerArchitectureEvaluationUsingCom-mercialWorkloads,2001.[20]ZacharyKurmas,KimberlyKeeton,andRalphBecker-Szendy.I/Oworkloadcharacterization.In4th

workshoponcomputerarchitectureevaluationusingcommercialworkloads,2001.[21]mb.Hardwarespendingsputters.RedHerring,pages32–33,2001.

[22]EdwardK.LeeandRandyH.Katz.Ananalyticperformancemodelofdiskarrays.InProceedingsof

the1993ACMSIGMETRICS,pages98–109,1993.[23]WillE.Leland,MuradS.Taqq,WalterWillinger,andDanielV.Wilson.Ontheself-similarnatureof

Ethernettraf c.InACMSIGCOMM,pages13–17,1993.[24]ArifMerchantandGuillermoA.Alvarez.DiskarraymodelsinMinerva.TechnicalReportHPL-2001-118,HPLaboratories,2001.[25]RuldolfH.Riedi,MatthewS.Crouse,VinayJ.Ribeiro,andRichardG.Baraniuk.Amultifrac-talwaveletmodelwithapplicationtonetworktraf c.IEEETransactionsonInformationTheory,45(3):992–1018,April1999.[26]B.D.Ripley.PaternrecognitionsandNeuralNetworks.CambridgeUniversityPress,1996.

[27]ChrisRuemmlerandJohnWilkes.Unixdiskaccesspatterns.InWinterUSENIXConference,pages

405–420,1993.[28]ChrisRuemmlerandJohnWilkes.Anintroductiontodiskdrivemodeling.IEEEComputer,27(3):17–

28,1994.[29]G.Seber.Multivariateobservations.Wiley,1984.

[30]ElizabethShriver,ArifMerchant,andJohnWilkes.Ananalyticalbehaviormodelfordiskdriveswith

readaheadcachesandrequestreordering.InProceedingsofInternationalConferenceonMeasurementandModelingofComputerSystems,pages182–191,1998.[31]MustafaUysal,GuillermoA.Alvarez,andArifMerchant.Amodular,analyticalthroughputmodel

formoderndiskarrays.InProceedingsof9thInternationalSymposiumonModeling,AnalysisandSimulationofComputerandTelecommunicationSystems,pages183–192,2001.[32]MengzhiWang,AnastassiaAilamaki,andChristosFaloutsos.Capturingthespatio-temporalbehavior

ofrealtraf cdata.PerformanceEvaluation,49(1/4):147–163,2002.[33]J.Wilkes.ThePantheonstorage-systemsimulator.TechnicalReportHPL–SSP–95–14,Hewlett-PackardLaboratories,1995.[34]JohnWilkes.TravelingtoRome,QoSspeci cationsforautomatedstoragesystemmanagement.In

Proc.Intl.WorkshoponQualityofService(IWQoS’2001),pages75–91.Springer-VerlagLectureNotesinComputerScience2092,2001.[35]JohnWilkes.Dataservices–fromdatatocontainers.KeynoteaddressatFileandStorageTechnologies

Conference(FAST’03),March–April2003.

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

AppendixA:ConstructingCARTModels

ACARTmodelisapiecewise-constantfunctiononamulti-dimensionalspace.Thisappendixgivesabriefdescriptionofthemodelconstructionalgorithm.Pleasereferto[4]foracompletediscussionofCARTmodels.

TheCARTmodelhasabinarytreestructurebuiltbyrecursivebinarysplits.SupposewehaveNobser-vations,Xii12N,withcorrespondingoutputsYii12N.Eachobservationconsistsofpinputfeatures,Xixi1xip).Theconstructionalgorithmstartswithatreewithonlyarootnodeandgrowsthetreedownwardbysplittingonenodeatime.Thechosensplitoffersthemostbene tinreducingthemeansquarederror.TheaverageYiforalltheXisinaleafnodeisusedasthepredictivevaluefortheleafnode.Thealgorithmcontinuesuntilcertaincriteriaaremet.

Wedescribehowthesplitischosenindetailnext.Thealgorithmevaluatesallthepossibledistinctsplitsonalltheleafnodesofthetree(ortherootnodeinthe rststep).Anodecorrespondstoahyer-rectangleregionoftheinputvectorspace,andasplitdecidesalongwhichfeatureandatwhatvaluetheregionshouldbedividedintotwo.Forexample,atnodet,asplitonfeaturejatvaluevde nestwonodes,nodet1andnodet2.

Xi

nodet1

Xixij

v

Xi

nodet

Xi

nodet2

Xixij

v

Xinodet

IfwedenotethenumberofobservationsinnodetasNtandthepredictivevalueasYt,themeansquarederroratnodetbeforethesplitis

MSEt

i:Xinodet

1

Nt1

Yi

¯t1Y

2

i:Xinodet2

1

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

12000

Number of Requests

9000

6000

3000

Time (aggregated in 10 seconds)

Disk Blocks (aggregated in 1000 blocks)

20

Disk block number

Entropy value

15 10 5 0

Entropy on timeEntropy on LBNJoint entropyCorrelation

60000

40000

20000

Arrival time

0 5

10Scale

15 20

Entropyplotonone-dimensionaldatasets.Theone-dimensionalentropyplotcapturescharacteristicsofindividualattributes,suchasthetemporalandspatialburstiness.Thesetwotypesofburstinessmeasurestheburstinessinthearrivalprocessandtheskewinaccessfrequenciesofdiskblocks.Weusetheentropyplotforarrivaltimeasanexampletoshowhowtheentropyplotworks.

Givenaworkload,wecanderiveits“margin”onthearrivaltimebycountingthenumberofrequeststhatarriveintothesystemateachtimetick.ThetopgraphofFigure9(a)showsthesampletrace’smarginonarrivaltime.

2n.WecalculatetheAssumethatthetraceis2ntimetickslong,andthemarginisCii12

entropyvalueatscalekbyapplyingtheentropyfunctionontheaggregatedmarginatscalek.TheaggregatedmarginisCkjj122kwhere

Intuitively,theentirelengthofthemarginisdividedinto2kequi-lengthedintervalsatscalek.Thus,applyingtheentropyfunctiononCkgives

where

Pkj

Number of Requests

(a)Sampledisktrace(b)Entropyplot

Figure9:Asampledisktraceanditsentropyplot.

C

k

2n

k

j

i1

∑C2n

k

j

1

i

H

k

2n

k

∑Pkjlog2Pkj

j1

C

k

j

Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr

1 0.8Entropy value

0.8 0.6 0.4 0.2

Time and RWLocation and RW

0.8 0.6 0.4 0.2 0

Time and SizeLocation and Size

5

10 15Scale

20

25

Entropy value

0.6 0.4 0.2 0

0 1 2

3 4Scale

5 6 7 0 5

10 15Scale

20 25

(a)Entropyplotontime(b)Entropyplotwithoperationtype

Entropy value

(c)Entropyplotwithsize

Figure10:Entropyplotstoquantifyotherworkloadcharacteristics.

Asimilarcalculationonthetrace’s“margin”onLBNgivestheentropyplotonLBN.Figure9(b)showstheentropyplotonbotharrivaltimeandLBNofthesampletrace.

Wemaketwoobservations.First,theentropyplotshowsstronglinearity,suggestingtheskewinarrivaltimeandLBNstaysconstantatallgranularities.Theconstantincrementoftheentropyvaluefromonescaletothenextsuggeststhatthedegreeofskewstaysthesameatallthescales.Thatis,thesampletracehasthesameburstinessatallscales,whichcon rmstheself-similarityofI/Oworkloadsobservedinpreviousstudies[13].Second,thelinearentropyplotallowsustousetheentropyplotslopestocharacterizetheburstiness.Smoothtraf chasanentropyplotofslopecloseto1.Real-worldtraces,however,havestrongburstiness.Insummary,theentropyplotde nedonthetracemarginsallowsustousetwoscalarstocharacterizeboththetemporalandspatialburstinessofI/Oworkloads.

Entropyplotontwo-dimensionaldatasets.Weextendtheentropyplottohandletwo-dimensionaldatasetstomeasurethecorrelationsbetweentwoattributes.Asbefore,theentropyplotcalculatestheentropyvalueatdifferentscales,onlythistimeontwo-dimensionaldatasets.Givenatwo-dimensionalprojectionofa

2n,wedividetheprojectioninto2k2kgrids,whichaggregatesbothdimensionstrace,Cijij12

withscalek.ThisgivesaseriesCkof2k2kelements.ApplyingtheentropyfunctiontoCkgivesthejointentropyvalueatscalekonthetwodimensions.

Thejointentropyallowsustocalculatethecorrelationbetweenthetwoattributes.Thecorrelationisthedifferencebetweenthesumoftheentropyvalueonthetwoattributesandthejointentropyplot.Figure9(b)showsboththejointentropyandthecorrelationonarrivaltimeandLBNforthesampledisktrace.WeobservethatastrongcorrelationexistsbetweenarrivaltimeandLBN,andalsothatthecorrelationstaysconstantatallscales.Thus,weareabletouseascalarvalue,thecorrelationslope,toquantifythecorrelationbetweenarrivaltimeandLBN.

Entropyplotinvolvingrequestsizeandoperationtype.Itispossibletoextendtheentropyplottohandleoperationtypeandrequestsize.Theonlydifferenceisthelimitedvaluerangesofthetwoattributes,whichlimitthenumberofdatapointsintheentropyplot.Asaresult,theworkloaddescriptiondoesnotincludeentropyplotslopesonthesetwoattributes.

Quantifyingthecorrelationsinvolvingeitherofthetwoattributesfacesthesameproblem.Oursolutionistoalwaysusethe nestgranularityontherequestsizeoroperationtype,buttochangethescaleontheotherattribute.Forexample,tocalculatethejointentropyplotonarrivaltimeandrequesttype,theaggregationhappensonlyonthearrivaltime.Figure10showstheentropyplotsthatinvolveoperationtypeandrequestsize.Theseentropyplotsarenotaslinearaspreviousones.Therefore,itisnotstraightforwardtocompresseachlineintoascalar.Currently,ourworkloaddescriptionusestheaverageincrementbetweentwoadjacentscales.

本文来源:https://www.bwwdw.com/article/fqg1.html

Top