Storage device performance prediction with CART models
更新时间:2023-07-19 03:45:01 阅读量: 实用文档 文档下载
- storage推荐度:
- 相关推荐
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
StorageDevicePerformancePredictionwithCART
Models
MengzhiWang,KinmanAu,AnastassiaAilamaki,
AnthonyBrockwell,ChristosFaloutsos,andGregoryR.Ganger
CMU-PDL-04-103
March2004
ParallelDataLaboratoryCarnegieMellonUniversityPittsburgh,PA15213-3890
Abstract
Storagedeviceperformancepredictionisakeyelementofself-managedstoragesystemsandapplicationplanningtasks,suchasdataassignment.Thisworkexplorestheapplicationofamachinelearningtool,CARTmodels,tostoragedevicemodeling.Ourapproachpredictsadevice’sperformanceasafunctionofinputworkloads,requiringnoknowledgeofthedeviceinternals.WeproposetwousesofCARTmodels:onethatpredictsper-requestresponsetimes(andthenderivesaggregatevalues)andonethatpredictsaggregatevaluesdirectlyfromworkloadcharacteristics.Afterbeingtrainedonourexperimentalplatforms,bothprovideaccurateblack-boxmodelsacrossarangeoftesttracesfromrealenvironments.Experimentsshowthatthesemodelspredicttheaverageand90thpercentileresponsetimewithanrelativeerroraslowas16%,whenthetrainingworkloadsaresimilartothetestingworkloads,andinterpolatewellacrossdifferentworkloads.
Acknowledgements:WethankthemembersandcompaniesofthePDLConsortium(includingEMC,Hewlett-Packard,Hitachi,HitachiGlobalStorageTechnologies,IBM,Intel,LSILogic,Microsoft,NetworkAppliance,Oracle,Panasas,Seagate,Sun,andVeritas)fortheirinterest,insights,feedback,andsupport.WethankIBMforpartlyfundingthisworkthroughaCASstudentfellowshipandafacultypartnershipaward.ThisworkisfundedinpartbyNSFgrantsCCR-0205544,IIS-0133686,BES-0329549,IIS-0083148,IIS-0113089,IIS-0209107,andIIS-0205224.WewouldalsoliketothankEnoThereska,MikeMesnier,andJohnStrunkfortheirparticipationanddiscussionintheearlystageofthisproject.
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
Keywords:Performanceprediction,storagedevicemodeling
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
1Introduction
Thecostsandcomplexityofsystemadministrationinstoragesystems[17,35,11]anddatabasesystems[12,1,15,21]makeautomationofadministrationtasksacriticalresearchchallenge.Oneimportantaspectofadministeringself-managedstoragesystems,particularlylargestorageinfrastructures,isdecidingwhichdatasetstostoreonwhichdevices.To ndanoptimalornearoptimalsolutionrequirestheabilitytopredicthowwelleachdevicewillserveeachworkload,sothatloadscanbebalancedandparticularlygoodmatchescanbeexploited.
Researchershavelongutilizedperformancemodelsforsuchpredictiontocomparealternativestoragedevicedesigns.Givensuf cienteffortandexpertise,accuratesimulations(e.g.,[5,28])oranalyticmodels(e.g.,[22,30,31])canbegeneratedtoexploredesignquestionsforaparticulardevice.Unfortunately,inpractice,suchtimeandexpertiseisnotavailablefordeployedinfrastructures,whichareoftencomprisedofnumerousanddistinctdevicetypes,andtheiradministratorshaveneitherthetimenortheexpertiseneededtocon guredevicemodels.
Thispaperattacksthisobstaclebyprovidingablack-boxmodelgenerationalgorithm.By“blackbox,”wemeanthatthemodel(andmodelgenerationsystem)hasnoinformationabouttheinternalcomponentsoralgorithmsofthestoragedevice.Givenaccesstoadeviceforsome“trainingperiod,”themodelgen-erationsystemlearnsadevice’sbehaviorasafunctionofinputworkloads.Theresultingdevicemodelapproximatesthisfunctionusingexistingmachinelearningtools.OurapproachemploystheClassi cationAndRegressionTrees(CART)toolbecauseofitsef ciencyandaccuracy.CARTmodels,inanutshell,approximatefunctionsonamulti-dimensionalCartesianspaceusingpiece-wiseconstantfunctions.
Suchlearning-basedblackboxmodelingisdif cultfortworeasons.First,allthemachinelearningtoolswehaveexaminedusevectorsofscalarsasinput.Existingworkloadcharacterizationmodels,however,pressingthesedistributionsintoasetofscalarsisnotstraightforward.Second,thequalityofthegeneratedmodelsdependshighlyonthequalityofthetrainingworkloads.Thetrainingworkloadsshouldbediverseenoughtoprovidehighcoverageoftheinputspace.
Thisworkdevelopstwowaysofencodingworkloadsasvectors:avectorperrequestoravectorperworkload.Thetwoencodingschemesleadtotwotypesofdevicemodels,operatingattheper-requestandper-workloadgranularities,respectively.Therequest-leveldevicemodelspredicteachrequest’sresponsetimebasedonitsper-requestvector,or“requestdescription.”Theworkload-leveldevicemodels,ontheotherhand,predictaggregateperformancedirectlyfromper-workloadvectors,or“workloaddescriptions.”Ourexperimentsonavarietyofrealworldworkloadshaveshownthatthesedescriptionsarereasonablygoodatcapturingworkloadperformancefrombothsingledisksanddiskarrays.ThetwoCART-basedmodelshaveamedianrelativeerrorof17%and38%,respectively,foraverageresponsetimeprediction,and18%and43%respectivelyforthe90thpercentile,whenthetrainingandtestingtracescomefromthesameworkload.TheCART-basedmodelsalsointerpolatewellacrossworkloads.
Theremainderofthispaperisorganizedasfollows.Section2discussespreviousworkintheareaofstoragedevicemodelingandworkloadcharacterization.Section3describesCARTanditsproperties.Section4describestwoCART-baseddevicemodels.Section5evaluatesthemodelsusingseveralreal-worldworkloadtraces.Section6concludesthepaper.
2RelatedWork
Performancemodelinghasalongandsuccessfulhistory.Almostalways,however,thoroughknowledgeofthesystembeingmodeledisassumed.Disksimulators,suchasPantheon[33]andDiskSim[5],usesoftwaretosimulatestoragedevicebehaviorandproduceaccurateper-requestresponsetimes.Developingsuchsim-ulatorsischallenging,especiallywhendiskparametersarenotpubliclyavailable.Predictingperformance
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
usingsimulatorsisalsoresourceintensive.Analyticalmodels[7,22,24,30,31]aremorecomputationallyef cientbecausethesemodelsdescribedevicebehaviorwithasetofformulae.Findingtheformulasetrequiresdeepunderstandingoftheinteractionbetweenstoragedevicesandworkloads.Inaddition,bothdisksimulatorsandanalyticalmodelsaretightlycoupledwiththemodeleddevice.Therefore,newdevicetechnologiesmayinvalidateexistingmodelsandrequireanewroundofmodelbuilding.
OurapproachusesCART,whichtreatsstoragedevicesasblackboxes.Asaresult,themodelconstruc-tionalgorithmisfullyautomatedandshouldbegeneralenoughtohandleanytypeofstoragedevice.Thedegenerateformsof“black-boxmodels”areperformancespeci cations,suchasthemaximumthroughputofthedevices,publishedbydevicemanufacturers.Theactualperformance,however,willbenowherenearthesenumbersundersomeworkloads.Anderson’s“table-based”approach[3]includesworkloadcharacter-isticsinthemodelinput.Thetable-basedmodelsrememberdevicebehaviorforawiderangeofworkloadanddevicepairsandinterploatesamongtablesentriesinpredicting.Anderson’smodelsareusedinanautomatedstorageprovisiontool,Ergastulum[2],whichformulatesautomaticstorageinfrastructureprovi-sioningasanoptimizationproblemandusesdevicemodelstoguidethesearchalgorithminlocatingthesolution.Ourapproachimprovesonthetable-basedmodelsbyemployingmachinelearningtoolstocapturedevicebehavior.Becauseofthegoodscalabilityofthetoolstohighdimensionaldatasets,weareabletousemoresophisticatedworkloadcharacteristicsasthemodelinput.Asaresult,themodelsaremoreef cientinbothcomputationandstorage.
Workloadcharacterizationisanimportantpartofdevicemodelingbecauseitprovidesasuitablerep-resentationofworkloads.Despiteabundantpublishedworkinmodelingwebtraf c[23,25,8],I/Otraf cmodelingreceiveslessattention.Directapplicationofwebtraf canalysismethodstoI/worktraf chasacategoricaladdressspace,andthereisnonotionofsequentialscans.Incontrast,theperformancevariabilitycanbeseveralordersofmagnitudebetweenrandomandsequentialaccessesforI/Oworkloads.Ganger[10]pointedoutthecomplexityofI/Oworkloads,andeventhedetectionofsequentialscansisahardproblem[19].Gomezetal.[14]identi edself-similarityinI/Otraf candadoptedstructuralmodelstogenerateI/Oworkloads.Kurmasetal.[20]employedaniterativeapproachtodetectimportantworkloadcharacteristics.Rome[34]providedageneralframeworkofworkloadspeci cations.Alltheapproaches,inonewayoranother,useempiricaldistribu-tionsderivedfromgivenworkloadsastheparametervalues.Ourpreviouswork[32]takesadvantageoftheself-similarityofI/Oworkloadsandproposesatool,the“entropyplot,”tocharacterizethespatio-temporalcharacteristicsofI/Oworkloadswiththreescalars.SinceourCART-basedmodelsrequireworkloadstobepresentedintheformofvectorsofscalars,theentropyplotisanattractivechoice.
3Background:CARTModels
ThissectiongivesabriefintroductionoftheCARTmodelsandjusti esourchoiceofthetool.AdetaileddiscussionofCARTisavailablein[4].
3.1CARTModels
CARTmodelingisamachinelearningtoolthatcanapproximaterealfunctionsinmulti-dimensionalCarte-fXε,sianspace.(Itcanalsobethoughtofasatypeofnon-linearregression.)GivenafunctionY
dwhereX ,Y ,andεiszero-meannoise,aCARTmodelapproximatesYusingapiece-wiseconstant f X.WerefertothecomponentsofXasfeaturesinthefollowingtext.Theterm,ε,capturesfunction,Y
theintrinsicrandomnessofthedataandthevariabilitycontributedbytheunobservablevariables.Thevari-anceofthenoisecouldbedependentonX.Forexample,thevarianceofresponsetimeoftendependsonthearrivalrate.
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
x<5.94851
x<3.23184
x<7.7123 100
x<9.0008x<1.92033
x<5.0035x<7.05473
80 60 40 20 0
y
f(x) = x * x
CART
x<1.69098
x<3.60137
30.83 48.68 56.33 72.06 16.64 26.85x<4.4543x<0.889
88.01 2.57 21.67 -1.92 6.05 8.94-20
0 2 4
x
6 8 10
(a)Fittedtree
(b)Datapointsandregressionline
Figure1:CARTmodelforasimpleone-dimensionaldataset.Thedatasetcontains100datapointsgen-x2ε,whereεfollowsaGuassiandistributionwithmean0andstandarddeviationeratedusingfx
10.
Thepiece-wiseconstantfunctionf Xcanbevisualizedasabinarytree.Figure1(a)showsaCARTmodelconstructedonthesampleone-dimensionaldatasetin(b).Thesampledatasetisgeneratedusing
yi
x2i
εi
i
12
100
wherexiisuniformlydistributedwithin(0,10),andεifollowsaGuassiandistributionofN010.The
leafnodescorrespondtodisjointhyper-rectanglesinthefeaturevectorspace.Thehyper-rectanglesaredegeneratedintointervalsforone-dimensionaldatasets.Eachleafisassociatedwithavalue,f X,whichisthepredictionforallXswithinthecorrespondinghyper-rectangle.Theinternalnodescontainsplitpoints,andapathfromtheroottoaleafde nesthehyper-rectangleoftheleafnode.Thetree,therefore,representsapiece-wiseconstantfunctiononthefeaturevectorspace.Figure1(b)showstheregressionlineofthesampleCARTmodel.
3.2CARTModelProperties
CARTmodelsarecomputationallyef cientinbothconstructionandprediction.Theconstructionalgorithmstartswithatreewithasinglerootnodecorrespondingtotheentireinputvectorspaceandgrowsthetreebygreedilyselectingthesplitpointthatyieldsthemaximumreductioninmeansquarederror.AmoredetaileddiscussionofthesplitpointselectionispresentedinAppendixA.Eachpredictioninvolvesatreetraversaland,therefore,isfast.
CARToffersgoodinterpretabilityandallowsustoevaluatetheimportanceofvariousworkloadchar-acteristicsinpredictingworkloadperformance.ACARTmodelisabinarytree,makingiteasytoplotonpaperasinFigure1(a).Moreimportantly,onecanevaluateafeature’simportancebyitscontributioninerrorreduction.Intuitively,amoreimportantfeatureshouldcontributemoretotheerrorreduction;thus,leavingitoutofthefeaturevectorwouldsigni cantlyraisethepredictionerror.InaCARTmodel,weusethesumoftheerrorreductionrelatedtoalltheappearancesofafeatureasitsimportance.
3.3ComparisonWithOtherRegressionTools
OtherregressiontoolscanachievethesamefunctionalityasCART.WechoosetouseCARTbecauseofitsaccuracy,ef ciency,robustness,andeaseofuse.Table1comparesCARTwithfourotherpopulartoolsto
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
FeatureCART
high(505%)
Neuralnetworks
fair(66%)
Poor
Fair
Poor
Poor
fast(seconds)
slow(hours)
fast
(milliseconds)
low(60B)
low(2MB)
Fair
k-nearestneighbors
InterpretabilityAbilitytohandleirrelevantinput
GoodGood
PoorPoor
PredictiontimeEaseofuse
fast
(milliseconds)
Good
slow(minutes)Fair
Table1:Comparisonofregressiontoolsinpredictingper-requestresponsetime.(ThesamedatasetisusedinFigure5.)Thecomparisononrow2,3,4andthelastoneistakenfrom[16].Werankthefeaturesintheorderoftheirimportance.Interpretabilityisthemodel’sabilitytoinfertheimportanceofinputvariables.Robustnessistheabilitytofunctionwellundernoisydataset.Irrelevantinputreferstofeaturesthathavelittlepredictivepowers.
buildtherequest-leveldevicemodelasdescribedinSection4.2.Themodelswereconstructedonthe rstdayofcello99aandtestsrunonthesecondofthesametrace.TheinformaiononthetracesweusedmaybefoundinSection5.
Themodel[29]usesalinearfunctionofXtoapproximatefX.Duetonon-linearstoragedevicebehavior,linearmodelshavepooraccuracy.
Themodel[26]consistsofasetofhighlyinterconnectedprocessingelementsworkinginunisontoapproximatethetargetfunction.Weuseasinglehiddenlayerof20nodes(bestamong20and40)andalearningrateof0.05.Halfofthetrainingsetisusedinbuildingthemodelandtheotherhalfforvalidation.Suchamodeltakesalongtimetoconverge.
The[6]mapstheinputdataintoahighdimensionalspaceandperformsalinearregressionthere.Ourmodelusestheradialbasisfunction
Kxix
expγx
xi
2
asthekernelfunction,andγissettobe2(bestamong1,3,4,6).Weuseanef cientimplementation,SVMlight[18],inourexperiment.Selectingtheparametervaluesrequiresexpertiseandmultipleroundsoftrials.
Themodel[9]ismemory-basedbecausethemodelremembersallthetrain-ingdatapointsandpredictionisdonethroughaveragingtheoutputoftheknearestneighborsofthedatapointbeingpredicted.WeusetheEuclideandistancefunctionandakvalueof5(bestamong5,10,15,and20).Themodelisaccurate,butisinef cientinstorageandcomputation.
Thelastthreetoolsrequirethatallthefeaturesandoutputbenormalizedtotheunitlength.Forfeaturesoflargevaluerange,wetakelogarithmsbeforenormalization.Overall,CARTisthebestatpredictingper-requestresponsetimes,withtheonlydownsidebeingslightlyloweraccuracycomparedtothemuchmorespace-andtime-consumingapproach.
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
Figure2:Modelconstructionthroughtraining.RTiistheresponsetimeofrequestri.
4PredictingPerformancewithCART
ThissectionpresentstwowaysofconstructingdevicemodelsbasedonCARTmodels.
4.1Overview
OurgoalistobuildamodelforagivenstoragedevicewhichpredictsdeviceperformanceasafunctionofI/Oworkload.Thedevicemodelreceivesaworkloadasinputandpredictsitsaggregateperformance.Wede neaworkloadasasequenceofdiskrequests,witheachrequest,ri,uniquelydescribedbyfourattributes:arrivaltime(ArrivalTimei),logicalblocknumber(LBNi),requestsizeinnumberofdiskblocks(Sizei),andread/writetype(RWi).Thestoragedevicecouldbeasingledisk,adiskarray,orsomeotherlike-interfacedcomponent.Theaggregateperformancecanbeeithertheaverageorthe90-thpercentileresponsetime.
OurapproachusesCARTtoapproximatethefunction.Weassumethatthemodelconstructionalgorithmcanfeedanyworkloadintothedevicetoobserveitsbehaviorforacertainperiodoftime,alsoknownas“training.”Thealgorithmthenbuildsthedevicemodelbasedontheobservedresponsetimes,asillustratedinFigure2.Modelconstructiondoesnotrequireanyinformationabouttheinternalsofthemodeleddevice.Therefore,itisgeneralenoughtomodelanydevice.
Regressiontoolsareanaturalchoicetomodeldevicebehavior.Suchtoolsaredesignedtomodelfunc-tionsonmulti-dimensionalspacegivenasetofsampleswithknownoutput.Thedif cultyistotransformworkloadsintodatapointsinamulti-dimensionalfeaturespace.Weexploretwowaystoachievethetrans-formationasillustratedinFigure3.Arequest-levelmodelrepresentsarequestriasavectorRi,alsoknownasthe“requestdescription,”andusesCARTmodelstopredictper-requestresponsetimes.Theaggregateperformanceisthencalculatedbyaggregatingtheresponsetimes.Aworkload-levelmodel,ontheotherhand,representstheentireworkloadasasinglevectorW,orthe“workloaddescription,”andpredictstheaggregateperformancedirectlyfromW.Inbothapproaches,thequalityoftheinputvectorsiscriticaltothemodelaccuracy.Thenexttwosectionspresenttherequestandworkloaddescriptionsindetail.
4.2Request-LevelDeviceModels
ThissectiondescribestheCART-basedrequest-leveldevicemodel.ThismodelusesaCARTmodeltopredicttheresponsetimesofindividualrequestsbasedonrequestdescriptions.Themodel,therefore,isabletogeneratetheentireresponsetimedistributionandoutputanyaggregateperformancemeasures.
Weadoptthefollowingtwoconstraintsindesigningtherequestdescription.1.Ridoesnotincludeanyactualresponsetimes.Onecouldrelaxthisconstraintbyallowingthein-clusionoftheresponsetimeinformationforalltherequeststhathavealreadybeenservedwhenthecurrentrequestarrives.Thisrelaxation,however,isfeasibleonlyforonlineresponsetimepredictions;itwouldnotbeappropriateforapplicationplanningtasksbecausetheplannerdoesnotrunworkloadsondevices.
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
Figure3:TwotypesofCART-baseddevicemodels.
2.Ricanbecalculatedfromrj,ji.Thisconstraintsimpli estherequestdescription.Inmostcases,theresponsetimeofacurrentrequestdependsonlyonpreviousrequestsandtherequestitself.
OurrequestdescriptionRiforrequestricontainsthefollowingvariables:
Ri
TimeDiffi1
TimeDiffik
LBNi
LBNDiffi1
LBNDiffil
Sizei
RWi
Seqi
whereTimeDiffikArrivalTimeiArrivalTimei2k1andLBNDiffilLBNiLBNil.The rstthreegroupsoffeaturescapturethreecomponentsoftheresponsetime,andSeqiindicateswhethertherequestisasequentialaccess.The rstk1featuresmeasurethetemporalburstinessoftheworkloadwhenriarrives,andsupportpredictionofthequeuingtime.WeallowtheTimeDifffeaturestoexponentiallygrowthedistancefromthecurrentrequesttohistoryrequesttoaccommodatelargebursts.Thenextl1featuresmeasurethespatiallocality,supportingpredictionoftheseektimeoftherequest.SizeiandRWisupportpredictionofthedatatransfertime.
Thetwoparameters,kandl,determinehowfarwelookbackforrequestburstsandlocality.Smallvaluesdonotadequatelycapturethesecharacteristics,rgevalues,ontheotherhand,leadstoahigherdimensionality,meaningtheneedforalargertrainingsetandalongertrainingtime.Theoptimalvaluesfortheseparametersarehighlydevicespeci c,andSection5.1showshowweselecttheparametervaluesinourexperiments.
4.3Workload-LevelDeviceModels
Theworkload-levelmodelrepresentstheentireworkloadasasingleworkloaddescriptionandpredictsaggregatedeviceperformancedirectly.TheworkloaddescriptionWcontainsthefollowingfeatures.
W
Averagearrivalrate
Readratio
Averagerequestsize
Percentageofsequentialrequests
Temporalandspatialburstiness
Correlationsbetweenpairsofattributes
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
Theworkloaddescriptionusestheentropyplot[32]toquantifytemporalandspatialburstinessandcorrela-tionsbetweenattributes.Entropyvalueareplottedononeortwoattributesagainsttheentropycalculationgranularity.Theincrementoftheentropyvaluescharacterizeshowtheburstinessandcorrelationschangefromonegranularitytothenext.Becauseoftheself-similarityofI/Oworkloads[13],theincrementisusuallyconstant,allowingustousetheentropyplotslopetocharacterizetheburstinessandcorrelations.AppendixBdescribestheentropyplotindetail.
Theworkload-leveldevicemodeloffersfastpredictions.ThemodelcompressesaworkloadintoaworkloaddescriptionandfeedsthedescriptionintoaCARTmodeltoproducethedesiredperformancemeasure.Featureextractionisalsofast.Topredictboththeaverageand90thpercentileresponsetime,themodelmusthavetwoseparatetrees,oneforeachperformancemetric.
Workloadmodelingintroducesaparametercalled“windowsize.”Thewindowsizeistheunitofper-formancepredictionand,thus,theworkloadlengthforworkloaddescriptiongeneration.Forexample,wecandividealongtraceintoone-minutefragmentsandusetheworkload-levelmodeltopredicttheaverageresponsetimeoverone-minuteintervals.Fragmentingworkloadshasseveraladvantages.First,performanceproblemsareusuallytransient.A“problem”ingtheworkloadinitsentirety,ontheotherhand,failstoindentifysuchtransientproblems.Second,fragmentingthetrainingtraceproducesmoresamplesfortrainingandreducestherequiredtrainingtime.Windowsthataretoosmall,however,containtoofewrequestsfortheentropyplottobeeffective.Weuseone-minutewindowsinallofourexperiments.
4.4ComparisonofTwoTypesofModels
Thereisacleartradeoffbetweentherequest-levelandworkload-leveldevicemodels.Theformerisfastintrainingandslowinprediction,andthelatteristheopposite.
Themodeltrainingtimeisdominatedbytracereplay,which,whentakingplaceonactualdevices,requiresexactlythesameamountoftimeasthetracelength.BuildingaCARTmodelneedsonlysecondsofcomputation,buttracereplaycanrequirehundredsofhourstoacquireenoughdatapointsformodelconstruction.Whenoperatingattherequestlevel,thedevicemodelgetsonedatapointperrequestasopposedtoonedatapointperone-minuteworkloadfragmentasintheworkload-leveldevicemodel.Inordertogetthesamenumberofdatapoints,theworkload-leveldevicemodelneedsatrainingtime100timeslongerthantherequest-levelmodelwhenthearrivalrateis100requestsperminute.
Thenumberoftreetraversalsdeterminesthepredictiontime,sinceeachpredictedvaluerequiresatreetraversal.Therefore,thetotalnumberoftreetraversalsisthenumberofrequestsintheworkloadfortherequest-leveldevicemodelandthenumberofworkloadfragmentsfortheworkload-levelmodel.Withanaveragearrivalrateof100requestsperminute,therequest-levelmodelis100timesslowerinprediction.
Anitemforfutureresearchistheexplorationofthepossibilityofcombiningthetwomodelstodeliveronesthatareef cientinbothtrainingandprediction.
5ExperimentalResults
ThissectionevaluatestheCART-baseddevicemodelspresentedintheprevioussectionusingarangeofworkloadtraces.
Devices.Wemodeltwodevices:asinglediskandadiskarray.Thesinglediskisa9GBAtlas10Kdiskwithanaveragerotationallatencyof3milliseconds.ThediskarrayisaRAID5diskarrayconsistingof8Atlas10Kdiskswitha32KBstripesize.WereplayallthetracesonthetwodevicesexcepttheSAPtrace,whichisbeyondthecapacityoftheAtlas10Kdisk.
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
Tracenamecello99acello99bcello99c
Length4weeks4weeks4weeks
Tracedescription
AverageSize
7.8Million
7.1KB118.0KB8.5KB
1.1Million
singledisk
35.4%
115.71ms113.61ms5.04ms
99.9%
7.40ms59.28ms
Table2:Tracesummary.WemodelanAtlas10K9GBandaRAID5diskarrayconsistingof8Atlas10Kdisks.TheresponsetimeiscollectedbyreplayingthetracesonDiskSim3.0[5].
Traces.Weusethreesetsofreal-worldtracesinthisstudy.Table2liststhesummarystatisticsoftheeditedtraces.The rsttwo,cello92andcello99capturetypicalcomputersystemresearchI/Oworkloads,collectedatHPLabsin1992and1999respectively[27,14].Wepreprocesscello92toconcatenatetheLBNsofthethreemostactivedevicesfromthetraceto llthemodeleddevice.Forcello99,wepickthethreemostactivedevices,amongthe23devices,andlabelthemcello99a,cello99b,andcello99c.Thecello99traces tina9GBdiskperfectly,sonotraceeditingisnecessary.Asthesetracesarelong(twomonthsforcello92andoneyearforcello99),wereportdataforafour-weeksnapshot(5/1/92to5/28/92and2/1/99to2/28/99).
TheSAPtracewascollectedfromanOracledatabaseserverrunningSAPISUCCS2.5Binapowerutilitycompany.Theserverhasmorethan3,000usersanddiskaccessesre ecttheretrievalofcustomerinvoicesforupdatingandreviewing.SequentialreadsdominatetheSAPtrace.
Evaluationmethodology.Theevaluationusesthedevicemodelstopredicttheaverageand90thper-centileresponsetimeforone-minuteworkloadfragments.Wereportthepredictionerrorsusingtwometrics:
Y,andrelativeerrorabsoluteerrorde nedasthedifferencebetweenthepredictedandtheactualvalue,Y
de nedasYY
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
Relative importanc
e
60%40%20%0%
TimeDiffTimeDiffTimeDiffTimeDiffTimeDiffTimeDiffTimeDiffTimeDiffTimeDiffTimeDiff1LBNLBNDiffLBNDiffLBNDiffLBNDiffLBNDiffSize(a)relativeimportancemeasuredonthe9GBAtlas10KBdiskusingcello99aand
cello99c
SeR(b)relativeimportancemeasuredontheRAID5diskarrayusingcello99aandcello99c
Figure4:Relativeimportanceoftherequestdescriptionfeatures.
5.1CalibratingRequest-LevelModels
Thissectiondescribeshowweselectparametervaluesforkandlfortherequest-leveldevicemodels.
Figure4showstherelativeimportanceoftherequestdescriptionfeaturesindeterminingper-requestresponsetimebysettingkto10andlto5.Thefeature’srelativeimportanceismeasuredbyitscontributioninerrorreduction.Thegraphsshowtheimportanceofrequestdescriptionfeaturesmeasuredonbothdevices,trainedontwotraces(cello99aandcello99c).Weuseonlythe rstdayofthetracesandreducethedatasetsizeby90%withuniformsampling.
First,weobservethattherelativeimportanceisworkloaddependent.Asweexpected,forbusytraf- csuchasthatwhichoccurredinthecello99atrace,thequeuingtimedominatestheresponsetime,andthereby,theTimeDifffeaturesaremoreimportant.Ontheotherhand,cello99chassmallresponsetimes,andfeaturesthatcharacterizethedatatransfertime,suchasSizeandRW,havegoodpredictivepowerinmodelingthesingledisk.
Second,weobservethatthemostimporantfeatureshiftsfromTimeDiff8toTimeDiff7wherecom-paringthesingledisktothediskarrayforcello99abecausethequeuingtimebecomeslesssigni cantforthediskarray.Thedistinctionbetweenthetwotraces,however,persists.
Wesetkto10forTimeDiffandlto3forLBNDiffinthesubsequentexperimentssothatwecanmodeldevicebehaviorunderbothtypesofworkloads.
Weshowthemodelaccuracyinpredictingper-requestresponsetimesinFigure5.ThemodelisbuiltfortheAtlas10Kdisk.Thetrainingtraceisthe rstdayofcello99a,andthetestingtraceistheseconddayofthesametrace.Figure5(a)isascatterplot,showingthepredictedresponsetimesagainsttheactualonesforthe rst5,000requests.Mostofthepointsstayclosetothediagonalline,suggestingaccuratepredictionoftherequest-leveldevicemodel.Figure5(b)furthercomparestheresponsetimedistributions.Thelongtailofthedistributioniswellcapturedbytherequest-levelmodel,indicatingthattherequestdescriptionis
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
Predicted response time (ms)
1600 1200 800 400 0
100 %80 %% of requests
60 %40 %20 %0 %
ActualPredicted
1000 2000 3000 4000
Response time (ms)
400 800 1200Actual response time (ms)
1600
(a)Scatterplot(b)Responsetimedistribution
Figure5:Predictionaccuracyoftherequest-levelmodel.Theactualandpredictedaverageresponsetimesare137.96msand133.01msrespectively.Thecorrespondingdemerit,de nedin[28]astherootmeansquareofhorizontaldistancebetweentheactualandpredictedcurvesin(b),is46.06milliseconds(33.4%oftheactualaverageresponsetime).
effectiveincapturingrequest-levelcharacteristicsneededtopredictresponsetimes.
Insummary,therequestdescriptioneffectivelycapturesimportantper-requestcharacteristics,leadingtoaccuraterequest-leveldevicemodels.
5.2ModelingASingleDisk
Figure6comparestheaccuracyofallthepredictorsinmodelinganAtlas10K9GBdiskonreal-worldtraces.Asmentionedearlier,allthepredictorsaretrainedusingthe rsttwoweeksofcello99a.Overall,thetwoCART-baseddevicemodelsprovidegoodpredictionaccuracyinpredictingboththeaverageand90thpercentileresponsetimes,comparedtootherpredictors.Severalmoredetailedobservationscanbemade.
First,allofthemodelsperformthebestwhenthetrainingandtestingtracesarefromthesameworkload,e.g.cello99a,becausethemodelshaveseenhowthedevicebehavesundersuchworkloads.Thepredictoralsocutsthemedianpredictionerrorofthepredictorbymorethanahalfbecauseofthestrongperiodicityoftheworkload.andfurtherreducetheerrorto4.84milliseconds(19%)and14.83milliseconds(47%)respectivelyfortheaverageresponsetimeprediction,and20.46milliseconds(15%)and49.50milliseconds(45%)respectivelyforthe90thpercentile.Theperfor-androughlymeasuresthebene tofusinganon-linearmancedifferencebetween
model,suchasCART,becausebothacceptthesameinput.Weobserveasigni cantimprovementfromtheformertothelatter,suggestingnon-lineardevicebehavior.
Second,bothCART-baseddevicemodelsinterpolatebetteracrossworkloadsthantheothermodels.
andrelyblindlyonsimilaritiesbetweenthetrainingandtestingworkloadstomake
goodpredictions.Consequently,itisnotsurprisingtoseehugepredictionerrorswhenthetrainingandtestingworkloadsdiffer.TheCART-basedpredictors,ontheotherhand,areabletodistinguishbetweenworkloadsofdifferentcharacteristicsandaremorerobusttothedifferencebetweenthetrainingandtestingworkloads.
Third,modelaccuracyishighlydependentonthetrainingworkloadqualityfortheCART-basedmodels.Thepredictionerrorincreasesforworkloadsotherthancello99a,becauseoftheaccesspatterndifferencesamongthesetraces.TheCART-basedmodelslearndevicebehaviorthroughtraining;therefore,theycannotpredictperformanceforworkloadsthathavetotallydifferentcharacteristicsfromthetrainingworkloads.
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
averageresponsetime
(b)Predictionerrorfor90thpercentileresponsetime
Figure6:Comparisonofpredictorsforasingle9GBAtlas10Kdisk.
Forexample,constantlyover-predictsforcello99cbecausethemodelwasnevertrainedwththesmallsequentialaccessesthatareparticulartocello99c.Section5.4givesaninformalerroranalysisandidenti esinadequatetrainingbeingthemostsigni canterrorsource.
Fourth,highquantileresponsetimesaremoredif culttopredict.Weobservelargerpredictionerrorsfromallthepredictorsfor90thpercentileresponsetimepredictionsthanforaverageresponsetimepredic-tions.TheaccuracyadvantageofthetwoCART-basedmodelsishigherfor90thpercentilepredictions.
Insummary,thetwoCART-basedmodelsgiveaccuratepredictionswhenthetrainingandtestingwork-loadssharethesamecharacteristicsandinterpolatewellotherwise.Thegoodaccuracysuggeststheeffec-tivenessoftherequestandworkloaddescriptionsincapturingimportantworkloadcharacteristics.
5.3ModelingADiskArray
Figure7comparestheaccuracyofthefourpredictorsinmodelingthediskarray.ThepredictorisnotpresentedbecausetheSAPtracedoesnotprovideenoughinformationonarrivaltimeforustoknowtheoffsetwithinaweek.Theoverallresultsaresimilartothoseforthesingledisk.ThetwoCART-basedmodelsarethemostaccuratepredictors.Theabsoluteerrorsbecomesmallerduetothedecreasedresponsetimefromthesingledisktothediskarray.Therelativeaccuracyamongthepredictors,however,staysthesame.Overall,theCART-baseddevicemodelingapproachworkswellforthediskarray.
5.4ErrorAnalysis
Thissectionpresentsaninformalerroranalysistoidentifythemostsigni canterrorsourcefortheCART-baseddevicemodels.
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
averageresponsetime
(b)predictionerrorfor90thpercentileresponsetime
Figure7:ComparisonofpredictorsforaRAID5diskarrayof8Atlas10Kdisks.
Amodel’serrorconsistsoftwoparts.The rstpartcomesfromintrinsicrandomnessoftheinputdata,suchasmeasurementerror,andthiserrorcannotbecapturedbyanymodel.Therestoftheerrorcomesfromthemodelingapproachitself.TheCART-basedmodelsincurerroratthreeplaces.First,thetransformationfromworkloadstovectorsintroducesinformationloss.Second,theCART-basedmodelsusepiece-wiseconstantfunctions,whichcouldbedifferentfromthetruefunctions.Third,alow-qualitytrainingtraceyieldsinaccuratemodelsbecauseCARTreliesontheinformationfromthetrainingdatatomakepredictions.Aninadequatetrainingsethasonlyalimitedrangeofworkloadsandleadstolargepredictionerrorsforworkloadsoutsideofthisrange.We ndthatthelasterrorsource,inadequatetrainingdata,causesthemosttroubleinourexperiments.
Weconductasmallexperimenttoverifyourhypothesis.Figure8(a)comparesthedifferenceinse-quentialitybetweencello99aandcello99c.Thespectrumofsequentiality(from0%to100%ofrequestsintheworkloadbeingsequential)isdividedinto20buckets,andthegraphsshowsthenumberofone-minuteworkloadfragmentsineachbucketforbothtraces.Weobserveasigni cantnumberofhighsequentialityfragmentsincello99b,butnofragmentgoesbeyond50%sequentialityincello99a.Thisdifferenceleadstolargepredictionerrorsforhighsequentialityfragmentswhenwebuildtheworkload-levelmodeloncello99aanduseittopredicttheperformanceofcello99b,asshownin(b).Theerrorsarereducedsigni cantlywhenweincludethe rsthalfofcello99bintraining.Thedramaticerrorreductionsuggeststhatpredictioner-rorsfromtheothersourcesarenegligiblewhencomparedwiththeonesintroducedbyinadequatetraining.Figure8(c)furthershowstheabsoluteerrorhistogramwith1millisecondbuckets.Thespikeshiftto0millisecondswhenwetrainthemodelonthecombinedtrainingtrace,indicatingthatitisreasonabletoas-sumeazero-meannoiseterm.Weconcludefromthisevidencethatcontributingeffortsinblack-boxdevicemodelingshouldbedirectedtowardgeneratingagoodtrainingsetthatcoversabroadrangeofworkloadtypes.
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
200
# of intervals
1501005000%
25%50%75%100%% of sequential requests
(a)Histogramsonsequentiality(b)Averageabsoluteerror(c)Absoluteerrordistribution
Figure8:Effectsofdifferenttrainingworkloads.
6Conclusions
Storagedeviceperformancemodelingisanimportantelementinself-managedstoragesystemsandotherapplicationplanningtasks.Ourtargetmodeltakesaworkloadasinputandpredictsitsaggregateperfor-manceonthemodeleddeviceef cientlyandaccurately.Thispaperpresentsourinitialresultsinexploringmachinelearningtoolstobuilddevicemodels.Ablackboxpredictivetool,CART,makesdevicemodelsindependentofthestoragedevicesbeingmodeled,andthus,generalenoughtohandleanytypeofdevices.Themodelconstruction,alsoknownastraining,consistsoftwophases:replayingtracesonthedevicesandbuildingaCARTmodelbasedontheobservedresponsetimes.Modelinganewdeviceinvolvesonlytrainingonthetargetdevice.
CART-basedmodelstakeinputintheformofvectors,soworkloadsmustbetransformedintovectorsinordertouseCARTasthebasisfordevicemodels.Thispaperpresentstwowaystoaccomplishsuchatransformation,yieldingtwotypesofdevicemodels.Therequest-leveldevicemodelsrepresenteachrequestasavectorandpredictitsresponsetime.Asaresult,themodelsareabletopredicttheentireresponsetimedistribution.Theexperimentsshowthatthepredictedresponsetimehasademerit gureof33%foramodernUNIX leservertrace,leadingtoamedianrelativeerroraslowas16%foraggregateperformancepredictions.Theworkload-leveldevicemodels,ontheotherhand,transformaworkloadfragmentintoavectorandpredictitsaggregateperformancedirectly.Thevectortakesadvantageoftheef ciententropyplotmetrictocapturethetemporalandspatialburstinessaswellasthecorrelationswithinI/Oworkloads.Themedianrelativeerrorcanbeaslowas29%fortheworkload-leveldevicemodels.
Theerroranalysissuggeststhatthequalityofthetrainingworkloadsplaysacriticalroleinthemodelaccuracy.Themodelsareunabletopredictworkloadsthataredifferentfromthetrainingworkloads.Toaccuratelypredictarbitraryworkloads,itisimportantforthetrainingworkloadstobeasdiverseaspossibletocoverawiderangeofworkloads.Ourfutureworkwillexploretheeffectivenessofexistingsyntheticworkloadgeneratorsinproducinghigh-qualitytrainingworkloads.
Continuingresearchcanimprovethemodelpredictionaccuracy.First,ourexperimentsshowtherele-vanceoftrainingtraces.Generatingrulestoassistintrainingsuchmodelsbroadlyenoughwillbeimportant.Second,theworkloadcharacterizationproblempersists,affectingtheworkload-levelmodels.Webelieve,however,thatthecontextofferedbythemodelscanhelpproduceinsightintothislong-standingproblem.Third,thetwotypesofdevicemodelsshowdesirablepropertiesintrainingandpredicting,respectively.Itshouldbevaluabletohaveamodelthatcombinesthebestofbothapproaches.
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
References
[1]N.Allen.Don’twasteyourstoragedollars:Whatyouneedtoknow.Researchnote,GartnerGroup,
2001.[2]E.Anderson,M.Kallahalla,S.Spence,R.Swaminathan,andQ.Wang.Ergastulum:Quickly nding
near-optimalstoragesystemdesigns.TechnicalReportHPL-SSP-2001-05,HPLabs,2001.[3]EricAnderson.Simpletable-basedmodelingofstoragedevices.TechnicalReportHPL-SSP-2001-4,
HPLabs,2001.[4]L.Brieman,J.H.Friedman,R.A.Olshen,andC.J.Stone.Classi cationandRegressionTrees.
Wadsworth,1984.[5]JohnBucy,GregGanger,andcontributors.TheDiskSimsimulationenvironmentversion3.0reference
manual.TechnicalReportCMU-CS-03-102,CarnegieMellonUniversity,2003.[6]C.J.C.Burges.AtutorialonSupportVectorMachinesforpatternrecognition.KnowledgeDiscovery
andDataMining,2(2):121–167.[7]ShenzeChenandDonTowsley.AperformanceevaluationofRAIDarchitectures.IEEETransactions
onComputers,45(10):1116–1130,1996.[8]MarkE.CrovellaandAzerBestavros.Self-similarityinworldwidewebtraf cevidenceandpos-siblecauses.InProc.ofthe1996ACMSIGMETRICSIntl.Conf.onMeasurementandModelingofComputerSystems,May1996.[9]B.Dasarathy.Neareastneighborpatternclassi cationtechniques.IEEEComputerSocietyPress,
1991.[10]GregoryR.Ganger.Generatingrepresentativesyntheticworkloads:Anunsolvedproblem.InPro-ceedingsoftheComputerMeasurementGroup(CMG)Conference,pages1263–1269,1995.[11]GregoryR.Ganger,JohnD.Strunk,andAndrewJ.Klosterman.Self-*storage:Brick-basedstorage
withautomatedadministration.TechnicalReportCMU-CS-03-178,CarnegieMellonUniversity,2003.[12]GartnerGroup.Totalcostofstorageownership—Auser-orientedapproach.Researchnote,Gartner
Group,2000.[13]Mar´iaE.G´omezandVicenteSantonja.Analysisofself-similarityinI/Oworkloadusingstructural
modeling.In7thInternationalSymposiumonModeling,AnalysisandSimulationofComputerandTelecommunicationSystems,pages234–243,1999.[14]Mar´iaE.G´omezandVicenteSantonja.Anewapproachinthemodelingandgenerationofsynthetic
diskworkload.InProceedingsof9thInternationalSymposiumonModeling,AnalysisandSimulationofComputerandTelecommunicationSystems,pages199–206,2000.[15]J.Gray.AconversationwithJimGray.ACMQueue,1(4),2003.
[16]TrevorHastie,RobertTibshirani,andJeromeFriedman.TheElementsofStatisticalLearning:Data
Mining,Inference,andPrediction.Springer,2001.[17]InternationalBusinessMachinesCorp.Autonomiccomputing:IBM’sperspectiveonthestateof
informationtechnology./autonomic/manifesto/.
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
[18]T.Joachims.Makinglarge-scaleSVMlearningpractical.InB.Sch¨olkopf,C.Burges,andSmola,
editors,AdvancesinKernelMethods–SupportVectorLearning.MIT-Press,1999.[19]K.Keeton,G.Alvarez,E.Riedel,andM.Uysal.CharacterizingI/O-intensiveworkloadsequentiality
onmoderndiskarrays.InTheFourthWorkshopOnComputerArchitectureEvaluationUsingCom-mercialWorkloads,2001.[20]ZacharyKurmas,KimberlyKeeton,andRalphBecker-Szendy.I/Oworkloadcharacterization.In4th
workshoponcomputerarchitectureevaluationusingcommercialworkloads,2001.[21]mb.Hardwarespendingsputters.RedHerring,pages32–33,2001.
[22]EdwardK.LeeandRandyH.Katz.Ananalyticperformancemodelofdiskarrays.InProceedingsof
the1993ACMSIGMETRICS,pages98–109,1993.[23]WillE.Leland,MuradS.Taqq,WalterWillinger,andDanielV.Wilson.Ontheself-similarnatureof
Ethernettraf c.InACMSIGCOMM,pages13–17,1993.[24]ArifMerchantandGuillermoA.Alvarez.DiskarraymodelsinMinerva.TechnicalReportHPL-2001-118,HPLaboratories,2001.[25]RuldolfH.Riedi,MatthewS.Crouse,VinayJ.Ribeiro,andRichardG.Baraniuk.Amultifrac-talwaveletmodelwithapplicationtonetworktraf c.IEEETransactionsonInformationTheory,45(3):992–1018,April1999.[26]B.D.Ripley.PaternrecognitionsandNeuralNetworks.CambridgeUniversityPress,1996.
[27]ChrisRuemmlerandJohnWilkes.Unixdiskaccesspatterns.InWinterUSENIXConference,pages
405–420,1993.[28]ChrisRuemmlerandJohnWilkes.Anintroductiontodiskdrivemodeling.IEEEComputer,27(3):17–
28,1994.[29]G.Seber.Multivariateobservations.Wiley,1984.
[30]ElizabethShriver,ArifMerchant,andJohnWilkes.Ananalyticalbehaviormodelfordiskdriveswith
readaheadcachesandrequestreordering.InProceedingsofInternationalConferenceonMeasurementandModelingofComputerSystems,pages182–191,1998.[31]MustafaUysal,GuillermoA.Alvarez,andArifMerchant.Amodular,analyticalthroughputmodel
formoderndiskarrays.InProceedingsof9thInternationalSymposiumonModeling,AnalysisandSimulationofComputerandTelecommunicationSystems,pages183–192,2001.[32]MengzhiWang,AnastassiaAilamaki,andChristosFaloutsos.Capturingthespatio-temporalbehavior
ofrealtraf cdata.PerformanceEvaluation,49(1/4):147–163,2002.[33]J.Wilkes.ThePantheonstorage-systemsimulator.TechnicalReportHPL–SSP–95–14,Hewlett-PackardLaboratories,1995.[34]JohnWilkes.TravelingtoRome,QoSspeci cationsforautomatedstoragesystemmanagement.In
Proc.Intl.WorkshoponQualityofService(IWQoS’2001),pages75–91.Springer-VerlagLectureNotesinComputerScience2092,2001.[35]JohnWilkes.Dataservices–fromdatatocontainers.KeynoteaddressatFileandStorageTechnologies
Conference(FAST’03),March–April2003.
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
AppendixA:ConstructingCARTModels
ACARTmodelisapiecewise-constantfunctiononamulti-dimensionalspace.Thisappendixgivesabriefdescriptionofthemodelconstructionalgorithm.Pleasereferto[4]foracompletediscussionofCARTmodels.
TheCARTmodelhasabinarytreestructurebuiltbyrecursivebinarysplits.SupposewehaveNobser-vations,Xii12N,withcorrespondingoutputsYii12N.Eachobservationconsistsofpinputfeatures,Xixi1xip).Theconstructionalgorithmstartswithatreewithonlyarootnodeandgrowsthetreedownwardbysplittingonenodeatime.Thechosensplitoffersthemostbene tinreducingthemeansquarederror.TheaverageYiforalltheXisinaleafnodeisusedasthepredictivevaluefortheleafnode.Thealgorithmcontinuesuntilcertaincriteriaaremet.
Wedescribehowthesplitischosenindetailnext.Thealgorithmevaluatesallthepossibledistinctsplitsonalltheleafnodesofthetree(ortherootnodeinthe rststep).Anodecorrespondstoahyer-rectangleregionoftheinputvectorspace,andasplitdecidesalongwhichfeatureandatwhatvaluetheregionshouldbedividedintotwo.Forexample,atnodet,asplitonfeaturejatvaluevde nestwonodes,nodet1andnodet2.
Xi
nodet1
Xixij
v
Xi
nodet
Xi
nodet2
Xixij
v
Xinodet
IfwedenotethenumberofobservationsinnodetasNtandthepredictivevalueasYt,themeansquarederroratnodetbeforethesplitis
MSEt
i:Xinodet
∑
1
Nt1
Yi
¯t1Y
2
i:Xinodet2
∑
1
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
12000
Number of Requests
9000
6000
3000
Time (aggregated in 10 seconds)
Disk Blocks (aggregated in 1000 blocks)
20
Disk block number
Entropy value
15 10 5 0
Entropy on timeEntropy on LBNJoint entropyCorrelation
60000
40000
20000
Arrival time
0 5
10Scale
15 20
Entropyplotonone-dimensionaldatasets.Theone-dimensionalentropyplotcapturescharacteristicsofindividualattributes,suchasthetemporalandspatialburstiness.Thesetwotypesofburstinessmeasurestheburstinessinthearrivalprocessandtheskewinaccessfrequenciesofdiskblocks.Weusetheentropyplotforarrivaltimeasanexampletoshowhowtheentropyplotworks.
Givenaworkload,wecanderiveits“margin”onthearrivaltimebycountingthenumberofrequeststhatarriveintothesystemateachtimetick.ThetopgraphofFigure9(a)showsthesampletrace’smarginonarrivaltime.
2n.WecalculatetheAssumethatthetraceis2ntimetickslong,andthemarginisCii12
entropyvalueatscalekbyapplyingtheentropyfunctionontheaggregatedmarginatscalek.TheaggregatedmarginisCkjj122kwhere
Intuitively,theentirelengthofthemarginisdividedinto2kequi-lengthedintervalsatscalek.Thus,applyingtheentropyfunctiononCkgives
where
Pkj
Number of Requests
(a)Sampledisktrace(b)Entropyplot
Figure9:Asampledisktraceanditsentropyplot.
C
k
2n
k
j
i1
∑C2n
k
j
1
i
H
k
2n
k
∑Pkjlog2Pkj
j1
C
k
j
Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment. This work explores the application of a machine learning tool, CART models, to storage device modeling. Our appr
1 0.8Entropy value
0.8 0.6 0.4 0.2
Time and RWLocation and RW
0.8 0.6 0.4 0.2 0
Time and SizeLocation and Size
5
10 15Scale
20
25
Entropy value
0.6 0.4 0.2 0
0 1 2
3 4Scale
5 6 7 0 5
10 15Scale
20 25
(a)Entropyplotontime(b)Entropyplotwithoperationtype
Entropy value
(c)Entropyplotwithsize
Figure10:Entropyplotstoquantifyotherworkloadcharacteristics.
Asimilarcalculationonthetrace’s“margin”onLBNgivestheentropyplotonLBN.Figure9(b)showstheentropyplotonbotharrivaltimeandLBNofthesampletrace.
Wemaketwoobservations.First,theentropyplotshowsstronglinearity,suggestingtheskewinarrivaltimeandLBNstaysconstantatallgranularities.Theconstantincrementoftheentropyvaluefromonescaletothenextsuggeststhatthedegreeofskewstaysthesameatallthescales.Thatis,thesampletracehasthesameburstinessatallscales,whichcon rmstheself-similarityofI/Oworkloadsobservedinpreviousstudies[13].Second,thelinearentropyplotallowsustousetheentropyplotslopestocharacterizetheburstiness.Smoothtraf chasanentropyplotofslopecloseto1.Real-worldtraces,however,havestrongburstiness.Insummary,theentropyplotde nedonthetracemarginsallowsustousetwoscalarstocharacterizeboththetemporalandspatialburstinessofI/Oworkloads.
Entropyplotontwo-dimensionaldatasets.Weextendtheentropyplottohandletwo-dimensionaldatasetstomeasurethecorrelationsbetweentwoattributes.Asbefore,theentropyplotcalculatestheentropyvalueatdifferentscales,onlythistimeontwo-dimensionaldatasets.Givenatwo-dimensionalprojectionofa
2n,wedividetheprojectioninto2k2kgrids,whichaggregatesbothdimensionstrace,Cijij12
withscalek.ThisgivesaseriesCkof2k2kelements.ApplyingtheentropyfunctiontoCkgivesthejointentropyvalueatscalekonthetwodimensions.
Thejointentropyallowsustocalculatethecorrelationbetweenthetwoattributes.Thecorrelationisthedifferencebetweenthesumoftheentropyvalueonthetwoattributesandthejointentropyplot.Figure9(b)showsboththejointentropyandthecorrelationonarrivaltimeandLBNforthesampledisktrace.WeobservethatastrongcorrelationexistsbetweenarrivaltimeandLBN,andalsothatthecorrelationstaysconstantatallscales.Thus,weareabletouseascalarvalue,thecorrelationslope,toquantifythecorrelationbetweenarrivaltimeandLBN.
Entropyplotinvolvingrequestsizeandoperationtype.Itispossibletoextendtheentropyplottohandleoperationtypeandrequestsize.Theonlydifferenceisthelimitedvaluerangesofthetwoattributes,whichlimitthenumberofdatapointsintheentropyplot.Asaresult,theworkloaddescriptiondoesnotincludeentropyplotslopesonthesetwoattributes.
Quantifyingthecorrelationsinvolvingeitherofthetwoattributesfacesthesameproblem.Oursolutionistoalwaysusethe nestgranularityontherequestsizeoroperationtype,buttochangethescaleontheotherattribute.Forexample,tocalculatethejointentropyplotonarrivaltimeandrequesttype,theaggregationhappensonlyonthearrivaltime.Figure10showstheentropyplotsthatinvolveoperationtypeandrequestsize.Theseentropyplotsarenotaslinearaspreviousones.Therefore,itisnotstraightforwardtocompresseachlineintoascalar.Currently,ourworkloaddescriptionusestheaverageincrementbetweentwoadjacentscales.
正在阅读:
Storage device performance prediction with CART models07-19
XXXX年中国零售业发展20大预测04-27
Linux常用命令练习07-07
快考试了作文400字07-06
马铃薯播种机设计05-06
关于前进方向529期04-22
新闻宣传工作汇报11-02
03火灾爆炸应急预案05-16
沟通的力量作文1000字07-15
- 1HP OpenView Storage Data Protector 6.0
- 2Cyclone III Device管脚使用推荐
- 3Risk assessment models and uncertainty estimation
- 4Dell Storage Center SCv2000 and SCv2020 Storage System Deployment Guide
- 5Lenovo Storage 产品线介绍
- 6Performance Management ( A model and research agenda)
- 7Abstract Attribute-Based Prediction of File Properties
- 8碳捕捉(Carbon capture and storage,简称CCS)
- 9Abstract Attribute-Based Prediction of File Properties
- 10Sterilization Medical Device Packing,EO Sterilization Confir
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- performance
- prediction
- Storage
- device
- models
- with
- CART
- 幼儿园大班10以内加减法口算练习题
- 水稻高产栽培的理论与技术4.4
- 小学信息技术四年级下册教学计划
- 劳务派遣公司转换的相关工作
- 第八章二元一次方程组-宁夏石嘴山市第八中学人教版七年级数学下册单元测试(无答案)
- 对底层框架抗震墙砖房结构设计的探讨
- 2019学年度第一学期六年级英语上册期末试卷 (1)
- 骑龙学区五年级下册数学期中试题
- 新警察个人心得体会
- 深圳XX公园绿化管理养护承包合同
- 江西森林资源资产评估培训试卷闭1
- 2015中国虚拟现实行业发展前景及趋势分析
- 总2011级毕业论文指导信息
- 北京环境保护年鉴
- DIY航模脉冲式喷气发动机
- 青年马克思主义者培训工程培训总结
- 高考诗词鉴赏:15种常用的诗歌表现手法
- 2015国家公务员面试热点:儒家文化
- 毛概试题总结 论述题
- 第1章 管理信息系统的基本概念