Storage device performance prediction with CART models

更新时间：2023-07-19 03:45:01 阅读量：实用文档文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

storage推荐度：
相关推荐

StorageDevicePerformancePredictionwithCART

Models

MengzhiWang,KinmanAu,AnastassiaAilamaki,

AnthonyBrockwell,ChristosFaloutsos,andGregoryR.Ganger

CMU-PDL-04-103

March2004

ParallelDataLaboratoryCarnegieMellonUniversityPittsburgh,PA15213-3890

Abstract

Storagedeviceperformancepredictionisakeyelementofself-managedstoragesystemsandapplicationplanningtasks,suchasdataassignment.Thisworkexplorestheapplicationofamachinelearningtool,CARTmodels,tostoragedevicemodeling.Ourapproachpredictsadevice’sperformanceasafunctionofinputworkloads,requiringnoknowledgeofthedeviceinternals.WeproposetwousesofCARTmodels:onethatpredictsper-requestresponsetimes(andthenderivesaggregatevalues)andonethatpredictsaggregatevaluesdirectlyfromworkloadcharacteristics.Afterbeingtrainedonourexperimentalplatforms,bothprovideaccurateblack-boxmodelsacrossarangeoftesttracesfromrealenvironments.Experimentsshowthatthesemodelspredicttheaverageand90thpercentileresponsetimewithanrelativeerroraslowas16%,whenthetrainingworkloadsaresimilartothetestingworkloads,andinterpolatewellacrossdifferentworkloads.

Acknowledgements:WethankthemembersandcompaniesofthePDLConsortium(includingEMC,Hewlett-Packard,Hitachi,HitachiGlobalStorageTechnologies,IBM,Intel,LSILogic,Microsoft,NetworkAppliance,Oracle,Panasas,Seagate,Sun,andVeritas)fortheirinterest,insights,feedback,andsupport.WethankIBMforpartlyfundingthisworkthroughaCASstudentfellowshipandafacultypartnershipaward.ThisworkisfundedinpartbyNSFgrantsCCR-0205544,IIS-0133686,BES-0329549,IIS-0083148,IIS-0113089,IIS-0209107,andIIS-0205224.WewouldalsoliketothankEnoThereska,MikeMesnier,andJohnStrunkfortheirparticipationanddiscussionintheearlystageofthisproject.

Keywords:Performanceprediction,storagedevicemodeling

1Introduction

Thecostsandcomplexityofsystemadministrationinstoragesystems[17,35,11]anddatabasesystems[12,1,15,21]makeautomationofadministrationtasksacriticalresearchchallenge.Oneimportantaspectofadministeringself-managedstoragesystems,particularlylargestorageinfrastructures,isdecidingwhichdatasetstostoreonwhichdevices.To ndanoptimalornearoptimalsolutionrequirestheabilitytopredicthowwelleachdevicewillserveeachworkload,sothatloadscanbebalancedandparticularlygoodmatchescanbeexploited.

Researchershavelongutilizedperformancemodelsforsuchpredictiontocomparealternativestoragedevicedesigns.Givensuf cienteffortandexpertise,accuratesimulations(e.g.,[5,28])oranalyticmodels(e.g.,[22,30,31])canbegeneratedtoexploredesignquestionsforaparticulardevice.Unfortunately,inpractice,suchtimeandexpertiseisnotavailablefordeployedinfrastructures,whichareoftencomprisedofnumerousanddistinctdevicetypes,andtheiradministratorshaveneitherthetimenortheexpertiseneededtocon guredevicemodels.

Thispaperattacksthisobstaclebyprovidingablack-boxmodelgenerationalgorithm.By“blackbox,”wemeanthatthemodel(andmodelgenerationsystem)hasnoinformationabouttheinternalcomponentsoralgorithmsofthestoragedevice.Givenaccesstoadeviceforsome“trainingperiod,”themodelgen-erationsystemlearnsadevice’sbehaviorasafunctionofinputworkloads.Theresultingdevicemodelapproximatesthisfunctionusingexistingmachinelearningtools.OurapproachemploystheClassi cationAndRegressionTrees(CART)toolbecauseofitsef ciencyandaccuracy.CARTmodels,inanutshell,approximatefunctionsonamulti-dimensionalCartesianspaceusingpiece-wiseconstantfunctions.

Suchlearning-basedblackboxmodelingisdif cultfortworeasons.First,allthemachinelearningtoolswehaveexaminedusevectorsofscalarsasinput.Existingworkloadcharacterizationmodels,however,pressingthesedistributionsintoasetofscalarsisnotstraightforward.Second,thequalityofthegeneratedmodelsdependshighlyonthequalityofthetrainingworkloads.Thetrainingworkloadsshouldbediverseenoughtoprovidehighcoverageoftheinputspace.

Thisworkdevelopstwowaysofencodingworkloadsasvectors:avectorperrequestoravectorperworkload.Thetwoencodingschemesleadtotwotypesofdevicemodels,operatingattheper-requestandper-workloadgranularities,respectively.Therequest-leveldevicemodelspredicteachrequest’sresponsetimebasedonitsper-requestvector,or“requestdescription.”Theworkload-leveldevicemodels,ontheotherhand,predictaggregateperformancedirectlyfromper-workloadvectors,or“workloaddescriptions.”Ourexperimentsonavarietyofrealworldworkloadshaveshownthatthesedescriptionsarereasonablygoodatcapturingworkloadperformancefrombothsingledisksanddiskarrays.ThetwoCART-basedmodelshaveamedianrelativeerrorof17%and38%,respectively,foraverageresponsetimeprediction,and18%and43%respectivelyforthe90thpercentile,whenthetrainingandtestingtracescomefromthesameworkload.TheCART-basedmodelsalsointerpolatewellacrossworkloads.

Theremainderofthispaperisorganizedasfollows.Section2discussespreviousworkintheareaofstoragedevicemodelingandworkloadcharacterization.Section3describesCARTanditsproperties.Section4describestwoCART-baseddevicemodels.Section5evaluatesthemodelsusingseveralreal-worldworkloadtraces.Section6concludesthepaper.

2RelatedWork

Performancemodelinghasalongandsuccessfulhistory.Almostalways,however,thoroughknowledgeofthesystembeingmodeledisassumed.Disksimulators,suchasPantheon[33]andDiskSim[5],usesoftwaretosimulatestoragedevicebehaviorandproduceaccurateper-requestresponsetimes.Developingsuchsim-ulatorsischallenging,especiallywhendiskparametersarenotpubliclyavailable.Predictingperformance

usingsimulatorsisalsoresourceintensive.Analyticalmodels[7,22,24,30,31]aremorecomputationallyef cientbecausethesemodelsdescribedevicebehaviorwithasetofformulae.Findingtheformulasetrequiresdeepunderstandingoftheinteractionbetweenstoragedevicesandworkloads.Inaddition,bothdisksimulatorsandanalyticalmodelsaretightlycoupledwiththemodeleddevice.Therefore,newdevicetechnologiesmayinvalidateexistingmodelsandrequireanewroundofmodelbuilding.

OurapproachusesCART,whichtreatsstoragedevicesasblackboxes.Asaresult,themodelconstruc-tionalgorithmisfullyautomatedandshouldbegeneralenoughtohandleanytypeofstoragedevice.Thedegenerateformsof“black-boxmodels”areperformancespeci cations,suchasthemaximumthroughputofthedevices,publishedbydevicemanufacturers.Theactualperformance,however,willbenowherenearthesenumbersundersomeworkloads.Anderson’s“table-based”approach[3]includesworkloadcharacter-isticsinthemodelinput.Thetable-basedmodelsrememberdevicebehaviorforawiderangeofworkloadanddevicepairsandinterploatesamongtablesentriesinpredicting.Anderson’smodelsareusedinanautomatedstorageprovisiontool,Ergastulum[2],whichformulatesautomaticstorageinfrastructureprovi-sioningasanoptimizationproblemandusesdevicemodelstoguidethesearchalgorithminlocatingthesolution.Ourapproachimprovesonthetable-basedmodelsbyemployingmachinelearningtoolstocapturedevicebehavior.Becauseofthegoodscalabilityofthetoolstohighdimensionaldatasets,weareabletousemoresophisticatedworkloadcharacteristicsasthemodelinput.Asaresult,themodelsaremoreef cientinbothcomputationandstorage.

Workloadcharacterizationisanimportantpartofdevicemodelingbecauseitprovidesasuitablerep-resentationofworkloads.Despiteabundantpublishedworkinmodelingwebtraf c[23,25,8],I/Otraf cmodelingreceiveslessattention.Directapplicationofwebtraf canalysismethodstoI/worktraf chasacategoricaladdressspace,andthereisnonotionofsequentialscans.Incontrast,theperformancevariabilitycanbeseveralordersofmagnitudebetweenrandomandsequentialaccessesforI/Oworkloads.Ganger[10]pointedoutthecomplexityofI/Oworkloads,andeventhedetectionofsequentialscansisahardproblem[19].Gomezetal.[14]identi edself-similarityinI/Otraf candadoptedstructuralmodelstogenerateI/Oworkloads.Kurmasetal.[20]employedaniterativeapproachtodetectimportantworkloadcharacteristics.Rome[34]providedageneralframeworkofworkloadspeci cations.Alltheapproaches,inonewayoranother,useempiricaldistribu-tionsderivedfromgivenworkloadsastheparametervalues.Ourpreviouswork[32]takesadvantageoftheself-similarityofI/Oworkloadsandproposesatool,the“entropyplot,”tocharacterizethespatio-temporalcharacteristicsofI/Oworkloadswiththreescalars.SinceourCART-basedmodelsrequireworkloadstobepresentedintheformofvectorsofscalars,theentropyplotisanattractivechoice.

3Background:CARTModels

ThissectiongivesabriefintroductionoftheCARTmodelsandjusti esourchoiceofthetool.AdetaileddiscussionofCARTisavailablein[4].

3.1CARTModels

CARTmodelingisamachinelearningtoolthatcanapproximaterealfunctionsinmulti-dimensionalCarte-fXε,sianspace.(Itcanalsobethoughtofasatypeofnon-linearregression.)GivenafunctionY

dwhereX ,Y ,andεiszero-meannoise,aCARTmodelapproximatesYusingapiece-wiseconstant f X.WerefertothecomponentsofXasfeaturesinthefollowingtext.Theterm,ε,capturesfunction,Y

theintrinsicrandomnessofthedataandthevariabilitycontributedbytheunobservablevariables.Thevari-anceofthenoisecouldbedependentonX.Forexample,thevarianceofresponsetimeoftendependsonthearrivalrate.

x<5.94851

x<3.23184

x<7.7123 100

x<9.0008x<1.92033

x<5.0035x<7.05473

80 60 40 20 0

f(x) = x * x

CART

x<1.69098

x<3.60137

30.83 48.68 56.33 72.06 16.64 26.85x<4.4543x<0.889

88.01 2.57 21.67 -1.92 6.05 8.94-20

0 2 4

6 8 10

(a)Fittedtree

(b)Datapointsandregressionline

Figure1:CARTmodelforasimpleone-dimensionaldataset.Thedatasetcontains100datapointsgen-x2ε,whereεfollowsaGuassiandistributionwithmean0andstandarddeviationeratedusingfx

10.

Thepiece-wiseconstantfunctionf Xcanbevisualizedasabinarytree.Figure1(a)showsaCARTmodelconstructedonthesampleone-dimensionaldatasetin(b).Thesampledatasetisgeneratedusing

x2i

εi

100

wherexiisuniformlydistributedwithin(0,10),andεifollowsaGuassiandistributionofN010.The

leafnodescorrespondtodisjointhyper-rectanglesinthefeaturevectorspace.Thehyper-rectanglesaredegeneratedintointervalsforone-dimensionaldatasets.Eachleafisassociatedwithavalue,f X,whichisthepredictionforallXswithinthecorrespondinghyper-rectangle.Theinternalnodescontainsplitpoints,andapathfromtheroottoaleafde nesthehyper-rectangleoftheleafnode.Thetree,therefore,representsapiece-wiseconstantfunctiononthefeaturevectorspace.Figure1(b)showstheregressionlineofthesampleCARTmodel.

3.2CARTModelProperties

CARTmodelsarecomputationallyef cientinbothconstructionandprediction.Theconstructionalgorithmstartswithatreewithasinglerootnodecorrespondingtotheentireinputvectorspaceandgrowsthetreebygreedilyselectingthesplitpointthatyieldsthemaximumreductioninmeansquarederror.AmoredetaileddiscussionofthesplitpointselectionispresentedinAppendixA.Eachpredictioninvolvesatreetraversaland,therefore,isfast.

CARToffersgoodinterpretabilityandallowsustoevaluatetheimportanceofvariousworkloadchar-acteristicsinpredictingworkloadperformance.ACARTmodelisabinarytree,makingiteasytoplotonpaperasinFigure1(a).Moreimportantly,onecanevaluateafeature’simportancebyitscontributioninerrorreduction.Intuitively,amoreimportantfeatureshouldcontributemoretotheerrorreduction;thus,leavingitoutofthefeaturevectorwouldsigni cantlyraisethepredictionerror.InaCARTmodel,weusethesumoftheerrorreductionrelatedtoalltheappearancesofafeatureasitsimportance.

3.3ComparisonWithOtherRegressionTools

OtherregressiontoolscanachievethesamefunctionalityasCART.WechoosetouseCARTbecauseofitsaccuracy,ef ciency,robustness,andeaseofuse.Table1comparesCARTwithfourotherpopulartoolsto

FeatureCART

high(505%)

Neuralnetworks

fair(66%)

Poor

Fair

Poor

fast(seconds)

slow(hours)

fast

(milliseconds)

low(60B)

low(2MB)

Fair

k-nearestneighbors

InterpretabilityAbilitytohandleirrelevantinput

GoodGood

PoorPoor

PredictiontimeEaseofuse

fast

(milliseconds)

Good

slow(minutes)Fair

Table1:Comparisonofregressiontoolsinpredictingper-requestresponsetime.(ThesamedatasetisusedinFigure5.)Thecomparisononrow2,3,4andthelastoneistakenfrom[16].Werankthefeaturesintheorderoftheirimportance.Interpretabilityisthemodel’sabilitytoinfertheimportanceofinputvariables.Robustnessistheabilitytofunctionwellundernoisydataset.Irrelevantinputreferstofeaturesthathavelittlepredictivepowers.

buildtherequest-leveldevicemodelasdescribedinSection4.2.Themodelswereconstructedonthe rstdayofcello99aandtestsrunonthesecondofthesametrace.TheinformaiononthetracesweusedmaybefoundinSection5.

Themodel[29]usesalinearfunctionofXtoapproximatefX.Duetonon-linearstoragedevicebehavior,linearmodelshavepooraccuracy.

Themodel[26]consistsofasetofhighlyinterconnectedprocessingelementsworkinginunisontoapproximatethetargetfunction.Weuseasinglehiddenlayerof20nodes(bestamong20and40)andalearningrateof0.05.Halfofthetrainingsetisusedinbuildingthemodelandtheotherhalfforvalidation.Suchamodeltakesalongtimetoconverge.

The[6]mapstheinputdataintoahighdimensionalspaceandperformsalinearregressionthere.Ourmodelusestheradialbasisfunction

Kxix

expγx

asthekernelfunction,andγissettobe2(bestamong1,3,4,6).Weuseanef cientimplementation,SVMlight[18],inourexperiment.Selectingtheparametervaluesrequiresexpertiseandmultipleroundsoftrials.

Themodel[9]ismemory-basedbecausethemodelremembersallthetrain-ingdatapointsandpredictionisdonethroughaveragingtheoutputoftheknearestneighborsofthedatapointbeingpredicted.WeusetheEuclideandistancefunctionandakvalueof5(bestamong5,10,15,and20).Themodelisaccurate,butisinef cientinstorageandcomputation.

Thelastthreetoolsrequirethatallthefeaturesandoutputbenormalizedtotheunitlength.Forfeaturesoflargevaluerange,wetakelogarithmsbeforenormalization.Overall,CARTisthebestatpredictingper-requestresponsetimes,withtheonlydownsidebeingslightlyloweraccuracycomparedtothemuchmorespace-andtime-consumingapproach.

Figure2:Modelconstructionthroughtraining.RTiistheresponsetimeofrequestri.

4PredictingPerformancewithCART

ThissectionpresentstwowaysofconstructingdevicemodelsbasedonCARTmodels.

4.1Overview

OurgoalistobuildamodelforagivenstoragedevicewhichpredictsdeviceperformanceasafunctionofI/Oworkload.Thedevicemodelreceivesaworkloadasinputandpredictsitsaggregateperformance.Wede neaworkloadasasequenceofdiskrequests,witheachrequest,ri,uniquelydescribedbyfourattributes:arrivaltime(ArrivalTimei),logicalblocknumber(LBNi),requestsizeinnumberofdiskblocks(Sizei),andread/writetype(RWi).Thestoragedevicecouldbeasingledisk,adiskarray,orsomeotherlike-interfacedcomponent.Theaggregateperformancecanbeeithertheaverageorthe90-thpercentileresponsetime.

OurapproachusesCARTtoapproximatethefunction.Weassumethatthemodelconstructionalgorithmcanfeedanyworkloadintothedevicetoobserveitsbehaviorforacertainperiodoftime,alsoknownas“training.”Thealgorithmthenbuildsthedevicemodelbasedontheobservedresponsetimes,asillustratedinFigure2.Modelconstructiondoesnotrequireanyinformationabouttheinternalsofthemodeleddevice.Therefore,itisgeneralenoughtomodelanydevice.

Regressiontoolsareanaturalchoicetomodeldevicebehavior.Suchtoolsaredesignedtomodelfunc-tionsonmulti-dimensionalspacegivenasetofsampleswithknownoutput.Thedif cultyistotransformworkloadsintodatapointsinamulti-dimensionalfeaturespace.Weexploretwowaystoachievethetrans-formationasillustratedinFigure3.Arequest-levelmodelrepresentsarequestriasavectorRi,alsoknownasthe“requestdescription,”andusesCARTmodelstopredictper-requestresponsetimes.Theaggregateperformanceisthencalculatedbyaggregatingtheresponsetimes.Aworkload-levelmodel,ontheotherhand,representstheentireworkloadasasinglevectorW,orthe“workloaddescription,”andpredictstheaggregateperformancedirectlyfromW.Inbothapproaches,thequalityoftheinputvectorsiscriticaltothemodelaccuracy.Thenexttwosectionspresenttherequestandworkloaddescriptionsindetail.

4.2Request-LevelDeviceModels

ThissectiondescribestheCART-basedrequest-leveldevicemodel.ThismodelusesaCARTmodeltopredicttheresponsetimesofindividualrequestsbasedonrequestdescriptions.Themodel,therefore,isabletogeneratetheentireresponsetimedistributionandoutputanyaggregateperformancemeasures.

Weadoptthefollowingtwoconstraintsindesigningtherequestdescription.1.Ridoesnotincludeanyactualresponsetimes.Onecouldrelaxthisconstraintbyallowingthein-clusionoftheresponsetimeinformationforalltherequeststhathavealreadybeenservedwhenthecurrentrequestarrives.Thisrelaxation,however,isfeasibleonlyforonlineresponsetimepredictions;itwouldnotbeappropriateforapplicationplanningtasksbecausetheplannerdoesnotrunworkloadsondevices.

Figure3:TwotypesofCART-baseddevicemodels.

2.Ricanbecalculatedfromrj,ji.Thisconstraintsimpli estherequestdescription.Inmostcases,theresponsetimeofacurrentrequestdependsonlyonpreviousrequestsandtherequestitself.

OurrequestdescriptionRiforrequestricontainsthefollowingvariables:

TimeDiffi1

TimeDiffik

LBNi

LBNDiffi1

LBNDiffil

Sizei

RWi

Seqi

whereTimeDiffikArrivalTimeiArrivalTimei2k1andLBNDiffilLBNiLBNil.The rstthreegroupsoffeaturescapturethreecomponentsoftheresponsetime,andSeqiindicateswhethertherequestisasequentialaccess.The rstk1featuresmeasurethetemporalburstinessoftheworkloadwhenriarrives,andsupportpredictionofthequeuingtime.WeallowtheTimeDifffeaturestoexponentiallygrowthedistancefromthecurrentrequesttohistoryrequesttoaccommodatelargebursts.Thenextl1featuresmeasurethespatiallocality,supportingpredictionoftheseektimeoftherequest.SizeiandRWisupportpredictionofthedatatransfertime.

Thetwoparameters,kandl,determinehowfarwelookbackforrequestburstsandlocality.Smallvaluesdonotadequatelycapturethesecharacteristics,rgevalues,ontheotherhand,leadstoahigherdimensionality,meaningtheneedforalargertrainingsetandalongertrainingtime.Theoptimalvaluesfortheseparametersarehighlydevicespeci c,andSection5.1showshowweselecttheparametervaluesinourexperiments.

4.3Workload-LevelDeviceModels

Theworkload-levelmodelrepresentstheentireworkloadasasingleworkloaddescriptionandpredictsaggregatedeviceperformancedirectly.TheworkloaddescriptionWcontainsthefollowingfeatures.

Averagearrivalrate

Readratio

Averagerequestsize

Percentageofsequentialrequests

Temporalandspatialburstiness

Correlationsbetweenpairsofattributes

Theworkloaddescriptionusestheentropyplot[32]toquantifytemporalandspatialburstinessandcorrela-tionsbetweenattributes.Entropyvalueareplottedononeortwoattributesagainsttheentropycalculationgranularity.Theincrementoftheentropyvaluescharacterizeshowtheburstinessandcorrelationschangefromonegranularitytothenext.Becauseoftheself-similarityofI/Oworkloads[13],theincrementisusuallyconstant,allowingustousetheentropyplotslopetocharacterizetheburstinessandcorrelations.AppendixBdescribestheentropyplotindetail.

Theworkload-leveldevicemodeloffersfastpredictions.ThemodelcompressesaworkloadintoaworkloaddescriptionandfeedsthedescriptionintoaCARTmodeltoproducethedesiredperformancemeasure.Featureextractionisalsofast.Topredictboththeaverageand90thpercentileresponsetime,themodelmusthavetwoseparatetrees,oneforeachperformancemetric.

Workloadmodelingintroducesaparametercalled“windowsize.”Thewindowsizeistheunitofper-formancepredictionand,thus,theworkloadlengthforworkloaddescriptiongeneration.Forexample,wecandividealongtraceintoone-minutefragmentsandusetheworkload-levelmodeltopredicttheaverageresponsetimeoverone-minuteintervals.Fragmentingworkloadshasseveraladvantages.First,performanceproblemsareusuallytransient.A“problem”ingtheworkloadinitsentirety,ontheotherhand,failstoindentifysuchtransientproblems.Second,fragmentingthetrainingtraceproducesmoresamplesfortrainingandreducestherequiredtrainingtime.Windowsthataretoosmall,however,containtoofewrequestsfortheentropyplottobeeffective.Weuseone-minutewindowsinallofourexperiments.

4.4ComparisonofTwoTypesofModels

Thereisacleartradeoffbetweentherequest-levelandworkload-leveldevicemodels.Theformerisfastintrainingandslowinprediction,andthelatteristheopposite.

Themodeltrainingtimeisdominatedbytracereplay,which,whentakingplaceonactualdevices,requiresexactlythesameamountoftimeasthetracelength.BuildingaCARTmodelneedsonlysecondsofcomputation,buttracereplaycanrequirehundredsofhourstoacquireenoughdatapointsformodelconstruction.Whenoperatingattherequestlevel,thedevicemodelgetsonedatapointperrequestasopposedtoonedatapointperone-minuteworkloadfragmentasintheworkload-leveldevicemodel.Inordertogetthesamenumberofdatapoints,theworkload-leveldevicemodelneedsatrainingtime100timeslongerthantherequest-levelmodelwhenthearrivalrateis100requestsperminute.

Thenumberoftreetraversalsdeterminesthepredictiontime,sinceeachpredictedvaluerequiresatreetraversal.Therefore,thetotalnumberoftreetraversalsisthenumberofrequestsintheworkloadfortherequest-leveldevicemodelandthenumberofworkloadfragmentsfortheworkload-levelmodel.Withanaveragearrivalrateof100requestsperminute,therequest-levelmodelis100timesslowerinprediction.

Anitemforfutureresearchistheexplorationofthepossibilityofcombiningthetwomodelstodeliveronesthatareef cientinbothtrainingandprediction.

5ExperimentalResults

ThissectionevaluatestheCART-baseddevicemodelspresentedintheprevioussectionusingarangeofworkloadtraces.

Devices.Wemodeltwodevices:asinglediskandadiskarray.Thesinglediskisa9GBAtlas10Kdiskwithanaveragerotationallatencyof3milliseconds.ThediskarrayisaRAID5diskarrayconsistingof8Atlas10Kdiskswitha32KBstripesize.WereplayallthetracesonthetwodevicesexcepttheSAPtrace,whichisbeyondthecapacityoftheAtlas10Kdisk.

Tracenamecello99acello99bcello99c

Length4weeks4weeks4weeks

Tracedescription

AverageSize

7.8Million

7.1KB118.0KB8.5KB

1.1Million

singledisk

35.4%

115.71ms113.61ms5.04ms

99.9%

7.40ms59.28ms

Table2:Tracesummary.WemodelanAtlas10K9GBandaRAID5diskarrayconsistingof8Atlas10Kdisks.TheresponsetimeiscollectedbyreplayingthetracesonDiskSim3.0[5].

Traces.Weusethreesetsofreal-worldtracesinthisstudy.Table2liststhesummarystatisticsoftheeditedtraces.The rsttwo,cello92andcello99capturetypicalcomputersystemresearchI/Oworkloads,collectedatHPLabsin1992and1999respectively[27,14].Wepreprocesscello92toconcatenatetheLBNsofthethreemostactivedevicesfromthetraceto llthemodeleddevice.Forcello99,wepickthethreemostactivedevices,amongthe23devices,andlabelthemcello99a,cello99b,andcello99c.Thecello99traces tina9GBdiskperfectly,sonotraceeditingisnecessary.Asthesetracesarelong(twomonthsforcello92andoneyearforcello99),wereportdataforafour-weeksnapshot(5/1/92to5/28/92and2/1/99to2/28/99).

TheSAPtracewascollectedfromanOracledatabaseserverrunningSAPISUCCS2.5Binapowerutilitycompany.Theserverhasmorethan3,000usersanddiskaccessesre ecttheretrievalofcustomerinvoicesforupdatingandreviewing.SequentialreadsdominatetheSAPtrace.

Evaluationmethodology.Theevaluationusesthedevicemodelstopredicttheaverageand90thper-centileresponsetimeforone-minuteworkloadfragments.Wereportthepredictionerrorsusingtwometrics:

Y,andrelativeerrorabsoluteerrorde nedasthedifferencebetweenthepredictedandtheactualvalue,Y

de nedasYY

Relative importanc

60%40%20%0%

TimeDiffTimeDiffTimeDiffTimeDiffTimeDiffTimeDiffTimeDiffTimeDiffTimeDiffTimeDiff1LBNLBNDiffLBNDiffLBNDiffLBNDiffLBNDiffSize(a)relativeimportancemeasuredonthe9GBAtlas10KBdiskusingcello99aand

cello99c

SeR(b)relativeimportancemeasuredontheRAID5diskarrayusingcello99aandcello99c

Figure4:Relativeimportanceoftherequestdescriptionfeatures.

5.1CalibratingRequest-LevelModels

Thissectiondescribeshowweselectparametervaluesforkandlfortherequest-leveldevicemodels.

Figure4showstherelativeimportanceoftherequestdescriptionfeaturesindeterminingper-requestresponsetimebysettingkto10andlto5.Thefeature’srelativeimportanceismeasuredbyitscontributioninerrorreduction.Thegraphsshowtheimportanceofrequestdescriptionfeaturesmeasuredonbothdevices,trainedontwotraces(cello99aandcello99c).Weuseonlythe rstdayofthetracesandreducethedatasetsizeby90%withuniformsampling.

First,weobservethattherelativeimportanceisworkloaddependent.Asweexpected,forbusytraf- csuchasthatwhichoccurredinthecello99atrace,thequeuingtimedominatestheresponsetime,andthereby,theTimeDifffeaturesaremoreimportant.Ontheotherhand,cello99chassmallresponsetimes,andfeaturesthatcharacterizethedatatransfertime,suchasSizeandRW,havegoodpredictivepowerinmodelingthesingledisk.

Second,weobservethatthemostimporantfeatureshiftsfromTimeDiff8toTimeDiff7wherecom-paringthesingledisktothediskarrayforcello99abecausethequeuingtimebecomeslesssigni cantforthediskarray.Thedistinctionbetweenthetwotraces,however,persists.

Wesetkto10forTimeDiffandlto3forLBNDiffinthesubsequentexperimentssothatwecanmodeldevicebehaviorunderbothtypesofworkloads.

Weshowthemodelaccuracyinpredictingper-requestresponsetimesinFigure5.ThemodelisbuiltfortheAtlas10Kdisk.Thetrainingtraceisthe rstdayofcello99a,andthetestingtraceistheseconddayofthesametrace.Figure5(a)isascatterplot,showingthepredictedresponsetimesagainsttheactualonesforthe rst5,000requests.Mostofthepointsstayclosetothediagonalline,suggestingaccuratepredictionoftherequest-leveldevicemodel.Figure5(b)furthercomparestheresponsetimedistributions.Thelongtailofthedistributioniswellcapturedbytherequest-levelmodel,indicatingthattherequestdescriptionis

Predicted response time (ms)

1600 1200 800 400 0

100 %80 %% of requests

60 %40 %20 %0 %

ActualPredicted

1000 2000 3000 4000

Response time (ms)

400 800 1200Actual response time (ms)

1600

(a)Scatterplot(b)Responsetimedistribution

Figure5:Predictionaccuracyoftherequest-levelmodel.Theactualandpredictedaverageresponsetimesare137.96msand133.01msrespectively.Thecorrespondingdemerit,de nedin[28]astherootmeansquareofhorizontaldistancebetweentheactualandpredictedcurvesin(b),is46.06milliseconds(33.4%oftheactualaverageresponsetime).

effectiveincapturingrequest-levelcharacteristicsneededtopredictresponsetimes.

Insummary,therequestdescriptioneffectivelycapturesimportantper-requestcharacteristics,leadingtoaccuraterequest-leveldevicemodels.

5.2ModelingASingleDisk

Figure6comparestheaccuracyofallthepredictorsinmodelinganAtlas10K9GBdiskonreal-worldtraces.Asmentionedearlier,allthepredictorsaretrainedusingthe rsttwoweeksofcello99a.Overall,thetwoCART-baseddevicemodelsprovidegoodpredictionaccuracyinpredictingboththeaverageand90thpercentileresponsetimes,comparedtootherpredictors.Severalmoredetailedobservationscanbemade.

First,allofthemodelsperformthebestwhenthetrainingandtestingtracesarefromthesameworkload,e.g.cello99a,becausethemodelshaveseenhowthedevicebehavesundersuchworkloads.Thepredictoralsocutsthemedianpredictionerrorofthepredictorbymorethanahalfbecauseofthestrongperiodicityoftheworkload.andfurtherreducetheerrorto4.84milliseconds(19%)and14.83milliseconds(47%)respectivelyfortheaverageresponsetimeprediction,and20.46milliseconds(15%)and49.50milliseconds(45%)respectivelyforthe90thpercentile.Theperfor-androughlymeasuresthebene tofusinganon-linearmancedifferencebetween

model,suchasCART,becausebothacceptthesameinput.Weobserveasigni cantimprovementfromtheformertothelatter,suggestingnon-lineardevicebehavior.

Second,bothCART-baseddevicemodelsinterpolatebetteracrossworkloadsthantheothermodels.

andrelyblindlyonsimilaritiesbetweenthetrainingandtestingworkloadstomake

goodpredictions.Consequently,itisnotsurprisingtoseehugepredictionerrorswhenthetrainingandtestingworkloadsdiffer.TheCART-basedpredictors,ontheotherhand,areabletodistinguishbetweenworkloadsofdifferentcharacteristicsandaremorerobusttothedifferencebetweenthetrainingandtestingworkloads.

Third,modelaccuracyishighlydependentonthetrainingworkloadqualityfortheCART-basedmodels.Thepredictionerrorincreasesforworkloadsotherthancello99a,becauseoftheaccesspatterndifferencesamongthesetraces.TheCART-basedmodelslearndevicebehaviorthroughtraining;therefore,theycannotpredictperformanceforworkloadsthathavetotallydifferentcharacteristicsfromthetrainingworkloads.

averageresponsetime

(b)Predictionerrorfor90thpercentileresponsetime

Figure6:Comparisonofpredictorsforasingle9GBAtlas10Kdisk.

Forexample,constantlyover-predictsforcello99cbecausethemodelwasnevertrainedwththesmallsequentialaccessesthatareparticulartocello99c.Section5.4givesaninformalerroranalysisandidenti esinadequatetrainingbeingthemostsigni canterrorsource.

Fourth,highquantileresponsetimesaremoredif culttopredict.Weobservelargerpredictionerrorsfromallthepredictorsfor90thpercentileresponsetimepredictionsthanforaverageresponsetimepredic-tions.TheaccuracyadvantageofthetwoCART-basedmodelsishigherfor90thpercentilepredictions.

Insummary,thetwoCART-basedmodelsgiveaccuratepredictionswhenthetrainingandtestingwork-loadssharethesamecharacteristicsandinterpolatewellotherwise.Thegoodaccuracysuggeststheeffec-tivenessoftherequestandworkloaddescriptionsincapturingimportantworkloadcharacteristics.

5.3ModelingADiskArray

Figure7comparestheaccuracyofthefourpredictorsinmodelingthediskarray.ThepredictorisnotpresentedbecausetheSAPtracedoesnotprovideenoughinformationonarrivaltimeforustoknowtheoffsetwithinaweek.Theoverallresultsaresimilartothoseforthesingledisk.ThetwoCART-basedmodelsarethemostaccuratepredictors.Theabsoluteerrorsbecomesmallerduetothedecreasedresponsetimefromthesingledisktothediskarray.Therelativeaccuracyamongthepredictors,however,staysthesame.Overall,theCART-baseddevicemodelingapproachworkswellforthediskarray.

5.4ErrorAnalysis

Thissectionpresentsaninformalerroranalysistoidentifythemostsigni canterrorsourcefortheCART-baseddevicemodels.

averageresponsetime

(b)predictionerrorfor90thpercentileresponsetime

Figure7:ComparisonofpredictorsforaRAID5diskarrayof8Atlas10Kdisks.

Amodel’serrorconsistsoftwoparts.The rstpartcomesfromintrinsicrandomnessoftheinputdata,suchasmeasurementerror,andthiserrorcannotbecapturedbyanymodel.Therestoftheerrorcomesfromthemodelingapproachitself.TheCART-basedmodelsincurerroratthreeplaces.First,thetransformationfromworkloadstovectorsintroducesinformationloss.Second,theCART-basedmodelsusepiece-wiseconstantfunctions,whichcouldbedifferentfromthetruefunctions.Third,alow-qualitytrainingtraceyieldsinaccuratemodelsbecauseCARTreliesontheinformationfromthetrainingdatatomakepredictions.Aninadequatetrainingsethasonlyalimitedrangeofworkloadsandleadstolargepredictionerrorsforworkloadsoutsideofthisrange.We ndthatthelasterrorsource,inadequatetrainingdata,causesthemosttroubleinourexperiments.

Weconductasmallexperimenttoverifyourhypothesis.Figure8(a)comparesthedifferenceinse-quentialitybetweencello99aandcello99c.Thespectrumofsequentiality(from0%to100%ofrequestsintheworkloadbeingsequential)isdividedinto20buckets,andthegraphsshowsthenumberofone-minuteworkloadfragmentsineachbucketforbothtraces.Weobserveasigni cantnumberofhighsequentialityfragmentsincello99b,butnofragmentgoesbeyond50%sequentialityincello99a.Thisdifferenceleadstolargepredictionerrorsforhighsequentialityfragmentswhenwebuildtheworkload-levelmodeloncello99aanduseittopredicttheperformanceofcello99b,asshownin(b).Theerrorsarereducedsigni cantlywhenweincludethe rsthalfofcello99bintraining.Thedramaticerrorreductionsuggeststhatpredictioner-rorsfromtheothersourcesarenegligiblewhencomparedwiththeonesintroducedbyinadequatetraining.Figure8(c)furthershowstheabsoluteerrorhistogramwith1millisecondbuckets.Thespikeshiftto0millisecondswhenwetrainthemodelonthecombinedtrainingtrace,indicatingthatitisreasonabletoas-sumeazero-meannoiseterm.Weconcludefromthisevidencethatcontributingeffortsinblack-boxdevicemodelingshouldbedirectedtowardgeneratingagoodtrainingsetthatcoversabroadrangeofworkloadtypes.

200

# of intervals

1501005000%

25%50%75%100%% of sequential requests

(a)Histogramsonsequentiality(b)Averageabsoluteerror(c)Absoluteerrordistribution

Figure8:Effectsofdifferenttrainingworkloads.

6Conclusions

Storagedeviceperformancemodelingisanimportantelementinself-managedstoragesystemsandotherapplicationplanningtasks.Ourtargetmodeltakesaworkloadasinputandpredictsitsaggregateperfor-manceonthemodeleddeviceef cientlyandaccurately.Thispaperpresentsourinitialresultsinexploringmachinelearningtoolstobuilddevicemodels.Ablackboxpredictivetool,CART,makesdevicemodelsindependentofthestoragedevicesbeingmodeled,andthus,generalenoughtohandleanytypeofdevices.Themodelconstruction,alsoknownastraining,consistsoftwophases:replayingtracesonthedevicesandbuildingaCARTmodelbasedontheobservedresponsetimes.Modelinganewdeviceinvolvesonlytrainingonthetargetdevice.

CART-basedmodelstakeinputintheformofvectors,soworkloadsmustbetransformedintovectorsinordertouseCARTasthebasisfordevicemodels.Thispaperpresentstwowaystoaccomplishsuchatransformation,yieldingtwotypesofdevicemodels.Therequest-leveldevicemodelsrepresenteachrequestasavectorandpredictitsresponsetime.Asaresult,themodelsareabletopredicttheentireresponsetimedistribution.Theexperimentsshowthatthepredictedresponsetimehasademerit gureof33%foramodernUNIX leservertrace,leadingtoamedianrelativeerroraslowas16%foraggregateperformancepredictions.Theworkload-leveldevicemodels,ontheotherhand,transformaworkloadfragmentintoavectorandpredictitsaggregateperformancedirectly.Thevectortakesadvantageoftheef ciententropyplotmetrictocapturethetemporalandspatialburstinessaswellasthecorrelationswithinI/Oworkloads.Themedianrelativeerrorcanbeaslowas29%fortheworkload-leveldevicemodels.

Theerroranalysissuggeststhatthequalityofthetrainingworkloadsplaysacriticalroleinthemodelaccuracy.Themodelsareunabletopredictworkloadsthataredifferentfromthetrainingworkloads.Toaccuratelypredictarbitraryworkloads,itisimportantforthetrainingworkloadstobeasdiverseaspossibletocoverawiderangeofworkloads.Ourfutureworkwillexploretheeffectivenessofexistingsyntheticworkloadgeneratorsinproducinghigh-qualitytrainingworkloads.

Continuingresearchcanimprovethemodelpredictionaccuracy.First,ourexperimentsshowtherele-vanceoftrainingtraces.Generatingrulestoassistintrainingsuchmodelsbroadlyenoughwillbeimportant.Second,theworkloadcharacterizationproblempersists,affectingtheworkload-levelmodels.Webelieve,however,thatthecontextofferedbythemodelscanhelpproduceinsightintothislong-standingproblem.Third,thetwotypesofdevicemodelsshowdesirablepropertiesintrainingandpredicting,respectively.Itshouldbevaluabletohaveamodelthatcombinesthebestofbothapproaches.

References

[1]N.Allen.Don’twasteyourstoragedollars:Whatyouneedtoknow.Researchnote,GartnerGroup,

2001.[2]E.Anderson,M.Kallahalla,S.Spence,R.Swaminathan,andQ.Wang.Ergastulum:Quickly nding

near-optimalstoragesystemdesigns.TechnicalReportHPL-SSP-2001-05,HPLabs,2001.[3]EricAnderson.Simpletable-basedmodelingofstoragedevices.TechnicalReportHPL-SSP-2001-4,

HPLabs,2001.[4]L.Brieman,J.H.Friedman,R.A.Olshen,andC.J.Stone.Classi cationandRegressionTrees.

Wadsworth,1984.[5]JohnBucy,GregGanger,andcontributors.TheDiskSimsimulationenvironmentversion3.0reference

manual.TechnicalReportCMU-CS-03-102,CarnegieMellonUniversity,2003.[6]C.J.C.Burges.AtutorialonSupportVectorMachinesforpatternrecognition.KnowledgeDiscovery

andDataMining,2(2):121–167.[7]ShenzeChenandDonTowsley.AperformanceevaluationofRAIDarchitectures.IEEETransactions

onComputers,45(10):1116–1130,1996.[8]MarkE.CrovellaandAzerBestavros.Self-similarityinworldwidewebtraf cevidenceandpos-siblecauses.InProc.ofthe1996ACMSIGMETRICSIntl.Conf.onMeasurementandModelingofComputerSystems,May1996.[9]B.Dasarathy.Neareastneighborpatternclassi cationtechniques.IEEEComputerSocietyPress,

1991.[10]GregoryR.Ganger.Generatingrepresentativesyntheticworkloads:Anunsolvedproblem.InPro-ceedingsoftheComputerMeasurementGroup(CMG)Conference,pages1263–1269,1995.[11]GregoryR.Ganger,JohnD.Strunk,andAndrewJ.Klosterman.Self-*storage:Brick-basedstorage

withautomatedadministration.TechnicalReportCMU-CS-03-178,CarnegieMellonUniversity,2003.[12]GartnerGroup.Totalcostofstorageownership—Auser-orientedapproach.Researchnote,Gartner

Group,2000.[13]Mar´iaE.G´omezandVicenteSantonja.Analysisofself-similarityinI/Oworkloadusingstructural

modeling.In7thInternationalSymposiumonModeling,AnalysisandSimulationofComputerandTelecommunicationSystems,pages234–243,1999.[14]Mar´iaE.G´omezandVicenteSantonja.Anewapproachinthemodelingandgenerationofsynthetic

diskworkload.InProceedingsof9thInternationalSymposiumonModeling,AnalysisandSimulationofComputerandTelecommunicationSystems,pages199–206,2000.[15]J.Gray.AconversationwithJimGray.ACMQueue,1(4),2003.

[16]TrevorHastie,RobertTibshirani,andJeromeFriedman.TheElementsofStatisticalLearning:Data

Mining,Inference,andPrediction.Springer,2001.[17]InternationalBusinessMachinesCorp.Autonomiccomputing:IBM’sperspectiveonthestateof

informationtechnology./autonomic/manifesto/.

[18]T.Joachims.Makinglarge-scaleSVMlearningpractical.InB.Sch¨olkopf,C.Burges,andSmola,

editors,AdvancesinKernelMethods–SupportVectorLearning.MIT-Press,1999.[19]K.Keeton,G.Alvarez,E.Riedel,andM.Uysal.CharacterizingI/O-intensiveworkloadsequentiality

onmoderndiskarrays.InTheFourthWorkshopOnComputerArchitectureEvaluationUsingCom-mercialWorkloads,2001.[20]ZacharyKurmas,KimberlyKeeton,andRalphBecker-Szendy.I/Oworkloadcharacterization.In4th

workshoponcomputerarchitectureevaluationusingcommercialworkloads,2001.[21]mb.Hardwarespendingsputters.RedHerring,pages32–33,2001.

[22]EdwardK.LeeandRandyH.Katz.Ananalyticperformancemodelofdiskarrays.InProceedingsof

the1993ACMSIGMETRICS,pages98–109,1993.[23]WillE.Leland,MuradS.Taqq,WalterWillinger,andDanielV.Wilson.Ontheself-similarnatureof

Ethernettraf c.InACMSIGCOMM,pages13–17,1993.[24]ArifMerchantandGuillermoA.Alvarez.DiskarraymodelsinMinerva.TechnicalReportHPL-2001-118,HPLaboratories,2001.[25]RuldolfH.Riedi,MatthewS.Crouse,VinayJ.Ribeiro,andRichardG.Baraniuk.Amultifrac-talwaveletmodelwithapplicationtonetworktraf c.IEEETransactionsonInformationTheory,45(3):992–1018,April1999.[26]B.D.Ripley.PaternrecognitionsandNeuralNetworks.CambridgeUniversityPress,1996.

[27]ChrisRuemmlerandJohnWilkes.Unixdiskaccesspatterns.InWinterUSENIXConference,pages

405–420,1993.[28]ChrisRuemmlerandJohnWilkes.Anintroductiontodiskdrivemodeling.IEEEComputer,27(3):17–

28,1994.[29]G.Seber.Multivariateobservations.Wiley,1984.

[30]ElizabethShriver,ArifMerchant,andJohnWilkes.Ananalyticalbehaviormodelfordiskdriveswith

readaheadcachesandrequestreordering.InProceedingsofInternationalConferenceonMeasurementandModelingofComputerSystems,pages182–191,1998.[31]MustafaUysal,GuillermoA.Alvarez,andArifMerchant.Amodular,analyticalthroughputmodel

formoderndiskarrays.InProceedingsof9thInternationalSymposiumonModeling,AnalysisandSimulationofComputerandTelecommunicationSystems,pages183–192,2001.[32]MengzhiWang,AnastassiaAilamaki,andChristosFaloutsos.Capturingthespatio-temporalbehavior

ofrealtraf cdata.PerformanceEvaluation,49(1/4):147–163,2002.[33]J.Wilkes.ThePantheonstorage-systemsimulator.TechnicalReportHPL–SSP–95–14,Hewlett-PackardLaboratories,1995.[34]JohnWilkes.TravelingtoRome,QoSspeci cationsforautomatedstoragesystemmanagement.In

Proc.Intl.WorkshoponQualityofService(IWQoS’2001),pages75–91.Springer-VerlagLectureNotesinComputerScience2092,2001.[35]JohnWilkes.Dataservices–fromdatatocontainers.KeynoteaddressatFileandStorageTechnologies

Conference(FAST’03),March–April2003.

AppendixA:ConstructingCARTModels

ACARTmodelisapiecewise-constantfunctiononamulti-dimensionalspace.Thisappendixgivesabriefdescriptionofthemodelconstructionalgorithm.Pleasereferto[4]foracompletediscussionofCARTmodels.

TheCARTmodelhasabinarytreestructurebuiltbyrecursivebinarysplits.SupposewehaveNobser-vations,Xii12N,withcorrespondingoutputsYii12N.Eachobservationconsistsofpinputfeatures,Xixi1xip).Theconstructionalgorithmstartswithatreewithonlyarootnodeandgrowsthetreedownwardbysplittingonenodeatime.Thechosensplitoffersthemostbene tinreducingthemeansquarederror.TheaverageYiforalltheXisinaleafnodeisusedasthepredictivevaluefortheleafnode.Thealgorithmcontinuesuntilcertaincriteriaaremet.

Wedescribehowthesplitischosenindetailnext.Thealgorithmevaluatesallthepossibledistinctsplitsonalltheleafnodesofthetree(ortherootnodeinthe rststep).Anodecorrespondstoahyer-rectangleregionoftheinputvectorspace,andasplitdecidesalongwhichfeatureandatwhatvaluetheregionshouldbedividedintotwo.Forexample,atnodet,asplitonfeaturejatvaluevde nestwonodes,nodet1andnodet2.

nodet1

Xixij

nodet

nodet2

Xixij

Xinodet

IfwedenotethenumberofobservationsinnodetasNtandthepredictivevalueasYt,themeansquarederroratnodetbeforethesplitis

MSEt

i:Xinodet

∑

Nt1

¯t1Y

i:Xinodet2

∑

12000

Number of Requests

9000

6000

3000

Time (aggregated in 10 seconds)

Disk Blocks (aggregated in 1000 blocks)

Disk block number

Entropy value

15 10 5 0

Entropy on timeEntropy on LBNJoint entropyCorrelation

60000

40000

20000

Arrival time

0 5

10Scale

15 20

Entropyplotonone-dimensionaldatasets.Theone-dimensionalentropyplotcapturescharacteristicsofindividualattributes,suchasthetemporalandspatialburstiness.Thesetwotypesofburstinessmeasurestheburstinessinthearrivalprocessandtheskewinaccessfrequenciesofdiskblocks.Weusetheentropyplotforarrivaltimeasanexampletoshowhowtheentropyplotworks.

Givenaworkload,wecanderiveits“margin”onthearrivaltimebycountingthenumberofrequeststhatarriveintothesystemateachtimetick.ThetopgraphofFigure9(a)showsthesampletrace’smarginonarrivaltime.

2n.WecalculatetheAssumethatthetraceis2ntimetickslong,andthemarginisCii12

entropyvalueatscalekbyapplyingtheentropyfunctionontheaggregatedmarginatscalek.TheaggregatedmarginisCkjj122kwhere

Intuitively,theentirelengthofthemarginisdividedinto2kequi-lengthedintervalsatscalek.Thus,applyingtheentropyfunctiononCkgives

where

Pkj

Number of Requests

(a)Sampledisktrace(b)Entropyplot

Figure9:Asampledisktraceanditsentropyplot.

∑C2n

∑Pkjlog2Pkj

1 0.8Entropy value

0.8 0.6 0.4 0.2

Time and RWLocation and RW

0.8 0.6 0.4 0.2 0

Time and SizeLocation and Size

10 15Scale

Entropy value

0.6 0.4 0.2 0

0 1 2

3 4Scale

5 6 7 0 5

10 15Scale

20 25

(a)Entropyplotontime(b)Entropyplotwithoperationtype

Entropy value

(c)Entropyplotwithsize

Figure10:Entropyplotstoquantifyotherworkloadcharacteristics.

Asimilarcalculationonthetrace’s“margin”onLBNgivestheentropyplotonLBN.Figure9(b)showstheentropyplotonbotharrivaltimeandLBNofthesampletrace.

Wemaketwoobservations.First,theentropyplotshowsstronglinearity,suggestingtheskewinarrivaltimeandLBNstaysconstantatallgranularities.Theconstantincrementoftheentropyvaluefromonescaletothenextsuggeststhatthedegreeofskewstaysthesameatallthescales.Thatis,thesampletracehasthesameburstinessatallscales,whichcon rmstheself-similarityofI/Oworkloadsobservedinpreviousstudies[13].Second,thelinearentropyplotallowsustousetheentropyplotslopestocharacterizetheburstiness.Smoothtraf chasanentropyplotofslopecloseto1.Real-worldtraces,however,havestrongburstiness.Insummary,theentropyplotde nedonthetracemarginsallowsustousetwoscalarstocharacterizeboththetemporalandspatialburstinessofI/Oworkloads.

Entropyplotontwo-dimensionaldatasets.Weextendtheentropyplottohandletwo-dimensionaldatasetstomeasurethecorrelationsbetweentwoattributes.Asbefore,theentropyplotcalculatestheentropyvalueatdifferentscales,onlythistimeontwo-dimensionaldatasets.Givenatwo-dimensionalprojectionofa

2n,wedividetheprojectioninto2k2kgrids,whichaggregatesbothdimensionstrace,Cijij12

withscalek.ThisgivesaseriesCkof2k2kelements.ApplyingtheentropyfunctiontoCkgivesthejointentropyvalueatscalekonthetwodimensions.

Thejointentropyallowsustocalculatethecorrelationbetweenthetwoattributes.Thecorrelationisthedifferencebetweenthesumoftheentropyvalueonthetwoattributesandthejointentropyplot.Figure9(b)showsboththejointentropyandthecorrelationonarrivaltimeandLBNforthesampledisktrace.WeobservethatastrongcorrelationexistsbetweenarrivaltimeandLBN,andalsothatthecorrelationstaysconstantatallscales.Thus,weareabletouseascalarvalue,thecorrelationslope,toquantifythecorrelationbetweenarrivaltimeandLBN.

Entropyplotinvolvingrequestsizeandoperationtype.Itispossibletoextendtheentropyplottohandleoperationtypeandrequestsize.Theonlydifferenceisthelimitedvaluerangesofthetwoattributes,whichlimitthenumberofdatapointsintheentropyplot.Asaresult,theworkloaddescriptiondoesnotincludeentropyplotslopesonthesetwoattributes.

Quantifyingthecorrelationsinvolvingeitherofthetwoattributesfacesthesameproblem.Oursolutionistoalwaysusethe nestgranularityontherequestsizeoroperationtype,buttochangethescaleontheotherattribute.Forexample,tocalculatethejointentropyplotonarrivaltimeandrequesttype,theaggregationhappensonlyonthearrivaltime.Figure10showstheentropyplotsthatinvolveoperationtypeandrequestsize.Theseentropyplotsarenotaslinearaspreviousones.Therefore,itisnotstraightforwardtocompresseachlineintoascalar.Currently,ourworkloaddescriptionusestheaverageincrementbetweentwoadjacentscales.

本文来源：https://www.bwwdw.com/article/fqg1.html

相关文章：

XXXX年中国零售业发展20大预测04-27

2010银行从业资格考试《风险管理》冲刺考点06-01

沟通的力量作文1000字07-15

上一篇：ABSTRACT Context-Sensitive Information Retrieval Using Implicit Feedback 下一篇：黑龙江地域文化平时作业四参考答案