L1 Cache and TLB Enhancements to the RAMpage Memory Hierarchy

更新时间:2023-06-07 20:05:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher

L1CacheandTLBEnhancementstothe

RAMpageMemoryHierarchy

PhilipMachanick1andZunaidPatel2

SchoolofITEE,UniversityofQueensland

Brisbane,Qld4072,Australia

philip@itee.uq.edu.au

SchoolofComputerScience,UniversityoftheWitwatersrand,

Johannesburg,PrivateBag3,2050Wits,SouthAfrica

zunaid@cs.wits.ac.za12

Abstract.TheRAMpagehierarchymovesmainmemoryupaleveltoreplacethelowest-levelcachebyanequivalent-sizedSRAMmainmem-ory,withaTLBcachingpagetranslationsforthatmainmemory.ThispaperillustrateshowmoreaggressivecomponentshigherinthehierarchyincreasethefractionoftotalexecutiontimespentwaitingforDRAM.Foraninstructionissuerateof1GHz,thesimulatedstandardhierarchywaitedforDRAM10%ofthetime,increasingto40%ataninstructionissuerateof8GHz.ForalargerL1cache,thefractionoftimewaitingforDRAMwasevenhigher.RAMpagewithcontextswitchesonmisseswasabletohidealmostallDRAMlatency.AlargerTLBwasshowntoincreasetheviablerangeofRAMpageSRAMpagesizes.

1Introduction

TheRAMpagememoryhierarchymovesmainmemoryupaleveltoreplacethelowest-levelcachewithanSRAMmainmemory,whileDRAMbecomesa rst-levelpagingdevice.PreviousworkhasshownthatRAMpagerepresentsanalternative,viabledesignintermsofhardware-softwaretrade-o s[22]andthatitscalesbetterastheCPU-DRAMspeedgapgrows,particularlybyvirtueofbeingabletotakecontextswitchesonmisses[21].

Inpreviouswork,itwashypothesizedthatRAMpagewouldbemorecom-petitiveacrossawiderrangeofSRAMpagesizes(equivalenttolinesizeofthelowest-levelcache)withamoreaggressiveTLB.Secondly,itwashypothesizedthatamoreaggressiveL1cachewouldemphasizedi erencesinlowerlevelsofthehierarchy.Inthispaper,wereportoninvestigationofbothhypothesesasseparateissues.ImprovingtheTLBandL1hasdi erente ectsonperformance.TheintentinpresentingbothinthesamepaperistoaddseveraldatapointstoourcaseforRAMpage.

Insomestudies,TLBmisseshaveaccountedforasmuchas40%ofruntime

[13],with guresintheregionof20–30%common[6,23].RAMpagehasthepotentialtoreducethesigni canceoftheTLBonperformancefortworeasons.Firstly,unlessthereferencewhichcausesaTLBmisswouldalsomissinthe

Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher

SRAMmainmemory,noreferencetoupdatetheTLBneedsgotoDRAM,withthepagetableorganizationusedforRAMpage.Secondly,thereisnomismatchbetweenthesizeofpagemappedbytheTLBandthe“linesize”ofthe“lowest-levelcache”,aswouldbethecasewithaconventionalhierarchy.Consequently,theTLBcanmoreeasilybedesignedtomapaspeci cfractionoftheSRAMmainmemory,thanisthecaseforaconventionalcache.

Theroleofincreasinglyaggressiveon-chipcachesalsoneedstobeevaluated,againsttheviewthatsuchcachesaddressthememorywallproblem.Quadruplingthesizeofacachemayhalvethenumberofmisses[28],butsuchexpansionmaynotalwaysbepractical.Increasingthesizeofcachesinanycasemakesithardertoscaleuptheirspeed[11].

TheapproachinthispaperistocompareRAMpagewithaconventional2-levelcachehierarchyasthesizeoftheTLBscalesup,acrossdi erentSRAMmainmemorypagesizes,aswellasavarietyofL1cachesizes,inseparateexperiments.ThesimulatedL2cacheof4Mbytesrunsatathirdoftheissuerateexcludingmisses.Theintentistoemphasizethatevenaveryfast,largeon-chipcacheresultsinalargefractionofrun-timebeingspentwaitingforDRAM.Evenso,giventhatDRAMreferencesarethedominante ectbeingmeasured,afastcacheshouldnotinvalidatethegeneraltrendsbeingstudied.

TLBmeasurementsshowthatbothmodelsseeareductioninTLBmissratesastheTLBsizeincreases,butRAMpagebecomesmoreviablewithsmallerSRAMmainmemorypagesizes.CachemeasurementsshowthatasL1sizeincreases,thefractionoftimespentwaitingforDRAMincreases(evenifoverallruntimedecreases),whichmakestheoptionintheRAMpagehierarchyoftakingacontextswitchonamissmoreattractive.

Theremainderofthispaperisstructuredasfollows.Section2presentsmoredetailoftheRAMpagehierarchyandrelatedresearch.Section3explainstheexperimentalapproach,whileSection4presentsexperimentalresults.Inconclu-sion,Section5,summarizesthe ndingsandoutlinesfuturework.

2Background

TheRAMpagemodelwasproposed[20]inresponsetothememorywall[30,16].ThekeyideaoftheRAMpagemodelistominimizehardwarecomplexity,whilemovingmoreofthememorymanagementintelligenceintosoftware.ARAMpagemachinethereforelooksverylikeaconventionalmodel,exceptthelowest-levelcacheisreplacedbyaconventionally-addressedphysicalmemory,thoughimplementedinSRAMratherthanDRAM.

Anumberofotherapproachestoaddressingthememorywallhavebeenpro-posed.Thissectionsummarizesthememorywallissue,followedbymoredetailofRAMpage.Afterpresentingotheralternatives,theoptionsarediscussed.

2.1MemoryWall

Thememorywallisthesituationwherethee ectofCPUimprovementsbecomesinsigni cantasthespeedimprovementofDRAMbecomesalimitingfactor.Since

Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher

themid-1980s,CPUspeedshaveimprovedatarateof50-100%peryear,whileDRAMlatencyhasonlyimprovedataround7%peryear[12].Ifpredictionsofthememorywall[30]arecorrect,DRAMlatencywillbecomeaseriouslimitingfactorinperformanceimprovement.Attemptsatworkingaroundthememorywallarebecomingincreasinglycommon[9],butthefundamentalunderlyingDRAMandCPUlatencytrendscontinue[27].

2.2TheRAMpageApproach

RAMpageisbasedonthenotionthatDRAM,whilestillordersofmagnitudefasterthandisk,isincreasinglystartingtodisplayoneattributeofaperipheral:thereistimetodootherworkwhilewaitingforit[24],particularlyifrelativelylargeunitsaremovedbetweenDRAMandSRAMlevel.InRAMpage,thelowest-levelcacheismanagedasthemainmemory(i.e.,asapagedvirtually-addressedmemory),withdiskasecondarypagingdevice.TheRAMpagemainmemorypagetableisinverted,tominimizeitssize.Aninvertedpagetablehasanotherbene t:noTLBmisscanresultinaDRAMreference,unlessthereferencecausingtheTLBlookupisnotinanyoftheSRAMlayers[22].

RAMpageisintendedtohavethefollowingadvantages:

–fasthits–ahitphysicallyaddressesanSRAMmemory

–fullassociativity–fullassociativitythroughpagingavoidstheslowerhitsofhardwarefullassociativity

–software-managedpaging–replacementcanbeassophisticatedasneeded–TLBmissesstoDRAMminimized–asexplainedabove

–pinninginSRAM–criticalOSdataandcodecanbepinnedinSRAM–hardwaresimplicity–thecomplexityofacachecontrollerisremovedfromthelowestlevelofSRAM

–contextswitchesonmissestoDRAM–theCPUcanbekeptbusy

Theseadvantagescomeatthecostofslowermissesbecauseofsoftwaremiss-handling,andtheneedtomakeoperatingsystemchanges.However,thelatterproblemcouldbeavoidedbyaddinghardwaresupportforthemodel.

TheRAMpageapproachhasinthepastbeenshowntoscalewellinthefaceofthegrownCPU-DRAMspeedgap,particularlywithcontextswitchesonmisses.Thee ectofcontextswitchesonmissesisthat,providedthereisworkavailablefortheCPU,waitingforDRAMcane ectivelybeeliminated[21].Contextswitchesonmisseshavethemostsigni cante ect.

2.3Alternatives

Approachestoaddressingthememorywallcanloosely(withsomeoverlaps)begroupedintolatencytoleranceandmissreduction.Someapproachestola-tencytoleranceincludeprefetch,criticalword rst,memorycompression,writebu ering,non-blockingcaches,andsimultaneousmultithreading(SMT).

Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher

Prefetchrequiresloadingacacheblockbeforeitisrequested,eitherbyhard-ware[5]orwithcompilersupport[25];predictiveprefetchattemptstoimproveaccuracyofprefetchforrelativelyvariedmemoryaccesspatterns[1].Incriticalword rst,thewordcontainingthereferencewhichcausedthemissisfetched rst,followedbytherestoftheblock[11].Memorycompressionine ectreduceslatencybecauseasmalleramountofinformationmustbemovedonamiss.Theoverheadmustbelessthanthetimesaved[18].Therearemanyvariationsonwritemissstrategy,butthemoste ectivegenerallyincludewritebu ering[17].Anon-blockingcache(lockup-free)cachecanallowanaggressivepipelinetocontinuewithotherinstructionswhilewaitingforamiss[4].

SMTisaimedatmaskingDRAMlatencyaswellasothercausesofpipelinestalls,byhardwaresupportformorethanoneactivethread[19].SMTaimstosolveawiderrangeofCPUperformanceproblemsthanRAMpage.

Theseideashavecosts(e.g.,prefetchingcandisplaceneededcontent,causingunnecessarymisses).ThebiggestproblemisthatmostoftheseapproachesdonotscalewiththegrowingCPU-DRAMspeedgap.Criticalword rstislesshelpfulaslatencyforonereferencegrowsinrelationtototaltimeforabigDRAMtransaction.Prefetch,memorycompressionandnonblockingcacheshavelimitsastohowmuchtheycanreducee ectivelatency.Writebu eringcanscaleprovidedbu ersizecanbescaled,andreferencestobu eredwritescanbehandledbeforetheyarewrittenback.SMTcouldmaskmuchofthetimespentwaitingforDRAM,butatthecostofamorecomplexCPU.

Reducingmisseshasbeenaddressedbyincreasingcachesize,associativity,orboth.Therearelimitsonhowlargeacachecanbeatagivenspeed,sothenumberoflevelshasincreased.Fullassociativitycanbeachievedinhardwarewithlessoverheadforhitsthanaconventionalfully-associativecache,inanindirectindexcache(IIC),bywhatamountstoahardwareimplementationofRAMpage’spagetablelookup[10].AdrawbackofIICisthatallreferencesincuroverheadofanextralevelofindirection.Earlierworkonsoftware-basedcachemanagementhasnotfocusedonreplacementpolicy[7,14].

TheadvantagesofRAMpageoverSMTandotherhardware-basedmulti-threadingapproachesarethattheCPUcanbekeptsimple,andsoftwareim-plementationofsupportformultipleprocessesismore exible(thebalancebe-tweenmultitaskingandmultithreadingcanbedynamicallyadjusted,accordingtoworkload).AnadvantageofIICisthattheOSneednotbeinvokedtohandletheequivalentofaTLBmissinRAMpage.AscomparedwithRAMpage,anIIChasmoreoverheadforahit,andlessforamiss.

2.4Summary

RAMpagemaskstimewhichwouldotherwisebespentwaitingforDRAMbytakingcontextswitchesonmisses.OtherapproachseitherdonotaimtomasktimespentwaitingforDRAM,buttoreduceit,orrequiremorecomplexhard-ware.RAMpagecanpotentiallybecombinedwithsomeoftheotherapproaches(suchasSMT),soitisnotnecessarilyincon ictwithotherideas.

Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher

3ExperimentalApproach

Thissectionoutlinestheapproachtothereportedexperiments.Thesimula-tionstrategyisexplained,followedbysomedetailofsimulationparameters;inconclusion,expected ndingsarediscussed.

3.1SimulationStrategy

Arangeofvariationsonastandard2-levelhierarchyiscomparedtosimilarvariationsonRAMpage,withandwithoutcontextswitchesonmisses.RAMpagewithoutcontextswitchesonmissestoconveysthee ectsofaddingassociativity(withanoperatingsystem-stylereplacementstrategy).AddingcontextswitchesonmissesshowsthevalueofalternativeworkonamisstoDRAM.Simulationsaretrace-driven,anddonotmodelthepipeline.ProcessorspeedisinGHz,representinginstructionissueratewithoutmisses,notclockspeed.

Ignoringthepipelinelevelneglectse ectslikebranchesandthepotentialforotherimprovementslikenon-blockingcaches.However,theresultsbeinglookedforherearerelativelylargeimprovements,soinaccuraciesofthiskindareunlikelytobesigni cant.Whatisimportantisthee ectastheCPU-DRAMspeedgapincreases,andthesimulationisofsu cientaccuracytocapturesuche ects,ashasbeendemonstratedinpreviouswork.

3.2SimulationParameters

Parametersaresimilartopreviouspublishedworktomakeresultscomparable.ThefollowingparametersarecommonacrossRAMpageandtheconventionalhierarchy.ThisrepresentsthebaselinebeforenewL1andTLBvariations:–L1cache–16Kbyteseachofdataandinstructioncache,physicallytaggedandindexed,direct-mapped,32-byteblocksize,1-cyclereadhit,12-cyclepenaltyformissestoL2(orRAMpageSRAMmainmemory);fordatacache:perfectwritebu ering(zeroe ectivehittime),writeback(12-cyclepenalty;9cyclesforRAMpage:noL2tagtoupdate),writeallocateonmiss–TLB–64entries,fullyassociative,randomreplacement

–DRAMlevel–DirectRambus[8]withoutpipelining:50nsbefore rstrefer-encestarted,thereafter2bytesevery1.25ns–pagingofDRAM–invertedpagetable:sameorganizationasRAMpagemainmemoryforsimplicity,in niteDRAMwithnomissestodisk–TLBandL1datahitsfullypipelined–onlytimeforL1dorTLBreplace-mentsormaintaininginclusioncostedas“hits”

Thesamememorytimingisusedasinearliersimulations.AlthoughfasterDRAMhassincebecomeavailable,thetimingcanbeseenasrelativetoapar-ticularCPU-DRAMspeedgap,andthe gurescanaccordinglyberescaled.Contextswitchesaremodelledbyinterleavingatraceoftext-bookcode.Acontextswitchistakenevery500,000references,thoughRAMpagewithcontextswitchesonmissesalsotakesacontextswitchonamisstoDRAM.TLBmissesarehandledbyatraceofpagetablelookupcode,withvariationsontimeforalookupbasedonprobablevariationsinprobesintoaninvertedpagetable[22].

Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher

speci ctoconventionalhierarchy

The“conventional”systemhasa2-wayassociative4MbyteL2.TheL2cacheanditsbustotheCPUtheareclockedatonethirdoftheCPUissuerate(thecycletimeisintendedtorepresentasuperscalarissuerate).TheL2cache-CPUbusis128bitswideandrunsHitsontheL2cachetake4cyclesincludingthetagcheckandtransfertoL1.InclusionbetweenL1andL2ismaintained[12].TheTLBcachesvirtualpagetranslationstoDRAMphysicalframes.speci ctoRAMpagehierarchy

InRAMpagesimulations,mostparametersremainthesame,exceptthattheTLBmapstheSRAMmainmemory,andfullassociativityisimplementedinsoftware,throughasoftwaremisshandler.TheOSkeeps6pagespinnedintheSRAMmainmemorywhensimulatinga4Kbyte-SRAMpage,i.e.,24Kbytes,whichincreasestoto5336pagesfora128byteblocksize,atotalof667Kbytes.inputsandvariations

TracesarefromtheTracebasetracearchiveatNewMexicoStateUniversity1.

1.1-billionreferencesareused,withtracesinterleavedtocreatethee ectofamultiprogrammingworkload.

TomeasurevariationsonL1caches,thesizeofeachoftheinstructionanddatacacheswasvariedfromtheoriginalsizeof16KBto32KB,64KB,128KBand256KB.Toexploremoreofthedesignspace,L1blocksizewasmeasuredatsizesof32,64and128bytes.WedidnotvaryL2blocksizeswhenvaryingL1:anoptimumsizewasdeterminedinpreviouswork[21,22].However,whilevaryingtheTLB,wedidvaryL2blocksizeintheconventionalhiearchy,forcomparisonwithvaryingtheRAMpageSRAMmainmemorypagesize.Tomeasurethee ectofincreasingtheTLBsize,wevarieditfromtheoriginal64entriesto128,256and512.EvenlargerTLBsexist(e.g.,Power4hasa1024-entryTLB[29]),butthisrangeissu cienttocapturevariationsofinterest.

3.3ExpectedFindings

AsL1becomeslarger,RAMpagewithoutcontextswitchesonmissesshouldseelessofagain.WhileimprovingL1shouldnota ecttimespentinDRAM,RAMpage’sextraoverheadsinmanagingDRAMmayhaveamoresigni cante ectonoverallruntime.However,asthefractionofreferencesinupperlevelsincreaseswithoutadecreaseinreferencestoDRAM,contextswitchesonmissesshouldbecomemorefavourable.

AstheTLBsizeincreases,weexpecttoseesmallerSRAMpagesizesbecomeviable.IftheTLBhas64entriesandthepagesizeis4KBwitha4MBSRAM

Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher

mainmemory,6.25%ofthememoryismappedbytheTLB.IftheTLBhas512entries,theTLBmaps50%ofthememory.Bycomparison,witha128Bpage,a64-entryTLBonlymapsabout0.2%ofthememory,andabigincreaseinthesizeoftheTLBislikelytohaveasigni cante ect.

Thee ectonaconventionalarchitectureofincreasingTLBsizeisnotassigni cantbecauseitmapsDRAMpages( xedat4KB),notSRAMpages.Further,variationacrossL2blocksizesshouldnotberelatedtoTLBsize.4Results

Thissectionpresentsresultsofsimulations,withsomediscussion.Themainfocusisondi erencesintroducedbychangesoverprevioussimulations,butsomeadvantagesofRAMpage,aspreviouslydescribed,shouldbeevidentagainfromthesenewresults.Presentationofresultsisbrokendownintoe ectsofincreasingL1cachesize,ande ectsofincreasingTLBsize,sincetheseimprovementshaveverydi erente ectsonthehierarchiesmodelled.Resultsarepresentedfor3cases:theconventional2-levelcachewithaDRAMmainmemory,andRAMpagewithandwithoutcontextswitchesonmisses.

Theremainderofthissectionpresentsthee ectsofL1changes,thenthee ectsofTLBchanges,followedbyasummary.

4.1IncreasingL1Size

Fig.1showshowmissratesoftheL1instructionanddatacachesvaryastheirsizeincreasesforbothRAMpagewithcontextswitchesonmissesandthestan-dardhierarchy.(RAMpagewithoutswitchesonmissesfollowsthesametrendasthestandardhierarchy.)Ascachesizesincrease,themissratedecreases,initiallyfairlyrapidly.Thetrendissimilarforallmodels.

Executiontimesareplottedin g.2,normalisedtothebestexecutiontimeateachCPUspeed.Asexpected,largercachesdecreaseexecutiontimesbyre-ducingcapacitymisses,asevidentfromthereducedmissrates–withlimitstothebene tsasL1scalesup.Thebestoveralle ectisfromthecombina-tionofRAMpagewithcontextswitchesonmissesandincreasingthesizeofL1.Theexecutiontimeofthefastestvariationspeedsup10.7overtheslowestcon guration,paringagivenhi-erarchy’sslowest(1GHz,32KBL1)andfastestcase(8GHz,256KBtotalL1)resultsinaspeedupof6.12fortheconventionalhierarchy,6.5forRAMpagewithoutswitchesonmissesand9.9forswitchesonmisses.ForslowestCPUandsmallestL1,RAMpagewithswitchesonmisseshasaspeedupof1.08overtheconventionalhierarchy,risingto1.74withthefastestCPUandbiggestL1.ForRAMpagewithoutswitchesonmisses,thescalingupofimprovementovertheconventionalhierarchyisnotasstrong:fortheslowestCPUwithleastaggressiveL1,RAMpagehasaspeedupof1.03,asopposedto1.11forthefastestCPUwithlargestL1.So,whetherbycomparisonwithaconventionalarchitectureorby

Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher

sizesmore

Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher

Fraction of time at each level32K64K128KTotal L1 size (KB)256K512K(a)Issuerate1GHz(Std)

Fig.3showsrelativetimeseachvariationoftheslowestandfastestCPUspendwaitingforeachlevelofthevarioushierarchies,asL1sizeincreases.The8GHzissueratefortheconventionalhierarchyspendsover40%oftotalexecu-tiontimewaitingforDRAMforthelargestL1cache–inlinewithmeasurementsofthePentium4,whichspends35%ofitstimewaitingforDRAMrunningSPECint2konaverageat2GHz[28].ThisPentium4con gurationcorrespondsroughlytoa6GHzissuerateinthispaper.ThesimilarityofthetimewaitingforDRAMlendssomecredibilitytoourviewthatourresultsarereasonablyinlinewithrealsystems.

Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher

Miss Rate(a)Whilecachesizeincreasesboostperformancesigni cantly,asCPUspeedincreases,alargeL1cannotsaveaconventionalhierarchyfromthehighpenaltyofwaitingforDRAM.In g.3(d),itcanbeseenthatRAMpageonlyimprovesthesituationmarginallywithoutcontextswitchesonmisses.

WithRAMpagewithcontextswitchesonmisses,timewaitingforDRAMremainsnegligibleastheCPU-DRAMspeedgapincreasesbyafactorof8( g.3(f)).ThelargestL1(combinedL1iandL1dsize512KB)resultsinonlyabout10%ofexecutiontimebeingspentwaitingforSRAMmainmemory,whileDRAMwaittimeremainsnegligible.Bycontrast,theotherhierarchies,whileseeingasigni cantreductionintimewaitingforL2(orSRAMmainmemory),donotseeasimilarreductionintimewaitingforDRAMasL1sizeincreases.

4.2TLBVariations

AllTLBvariationsaremeasuredwiththeL1parameters xedattheoriginalRAMpagemeasurements–16KBeachofinstructionanddatacache.

TheTLBmissrate( g.4),evenwithincreasedTLBsizes,issigni cantlyhigherinallRAMpagecasesthanforthestandardhierarchy,exceptfora4KBRAMpagepagesize.AsSRAMmainmemorypagesizeincreases,TLBmissratesdrop,asexpected.Further,asTLBsizeincreases,smallerpages’missratesdecrease.Inthecaseofcontextswitchesonmisses,thenumberofcontext

Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher

Page Size (B)01020

Overhead (%)

3040

Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher

ofthestandardhierarchy.RAMpageTLBmissesdonotresultinreferencestoDRAM,unlessthereisapagefault,sotheadditionalreferencesshouldnotresultinasimilarlysubstantialperformancehit.

Fig.6illustratesexecutiontimesforthehierarchiesat1and8GHz,thespeedgapwhichshowso di erencesmostclearly.Therearetwocompetinge ects:asL2block(SRAMpagesize)increases,misspenaltytoDRAMincreases.InRAMpage,reducedTLBmissescompensateforthehigherDRAMmisspenalty,buttheperformanceofthestandardhierarchybecomesworseasblocksizein-creases.TLBsizevariationmakeslittledi erencetoperformanceofthestandardhierarchywiththesimulatedworkload.PerformanceofRAMpagewithswitchesonmissesdoesnotvarymuchforpagesof512BandgreaterevenwithTLBvariations,whileRAMpagewithoutswitchesisbestwith1024Bpages.

Theperformance-optimalTLBandpagesizecombinationforRAMpagewithoutcontextswitchesonmisses,witha512entryTLB,isa1024Bpageforallissuerates.Inpreviouswork,witha64-entryTLB,theoptimalpagesizeat1GHzwas2048B,whileotherissueratesperformedbestwith1024Bpages.Thus,alargerTLBresultsinasmallerpagesizebeingoptimalforthe1GHzspeed.Whileotherpagesizesarestillslowerthanthe1024Bpagesize,forallcaseswithpagesof512BandgreaterRAMpagewithoutcontextswitchesonmissesisfasterthanthestandardhierarchy.

ForRAMpagewithcontextswitchesonmisses,theperformance-optimalpagesizehasshiftedto1024BwithalargerTLB.Previouslythebestpagesizewas4096Bfor1,2and4GHzand2048Bfor8GHz.ATLBof256oreven128entriescombinedwiththe1024Bpagewillyieldoptimumoralmostoptimumperformance.Witha1024Bpageand256entries,atotalof256KB,or6.25%oftheRAMpagemainmemoryismappedbytheTLB,whichappearstobesu cientforthisworkload(a4KBpagewitha512-entryTLBmapshalftheSRAMmainmemory,overkillforanyworkloadwithreasonablelocalityofrefer-ence).Nonetheless,TLBperformanceishighlydependentonapplicationcode,soresultspresentedhereneedtobeconsideredinthatlight.

Contrastingthe1Ghzand8GHzcasesin g.6makesitclearagainhowthedi erencesbetweenRAMpageandaconventionalhierarchyscaleastheCPU-DRAMspeedgapincreases.At1GHz,allvariationsarereasonablycomparableacrossarangeofparameters.At8GHz,RAMpageisclearlybetterinallvaria-tions,butevenmoresowithcontextswitchesonmisses.AlargerTLBbroadenstherangeofusefulRAMpagecon gurations,withoutsigni cantlyalteringthestandardhierarchy’scompetitiveness.

4.3Summary

Insummary,theRAMpagemodelwithcontextswitchesonmissesgainsmostfromL1cacheimprovements,thoughtheotherhierarchiesalsoreduceexecutiontime.However,withouttakingcontextswitchesonmisses,increasingthesizeofL1hasthee ectofincreasingthefractionoftimespentwaitingforDRAM,sincethenumberofDRAMreferencesisnotreduced,noristheirlatencyhidden.AswasshownbyscalinguptheCPU-DRAMspeedgap,onlyRAMpagewith

Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher

contextswitchesonmisses,ofthevariationspresentedhere,isabletohidetheincreasinge ectivelatencyofDRAM.IncreasingthesizeoftheTLB,aspredicted,increasedtherangeofSRAMmainmemorypagesizesoverwhichRAMpageisviable,wideningtherangeofchoicesforadesigner.

5Conclusion

ThispaperhasexaminedenhancementstoRAMpage,whichmeasureitspoten-tialforfurtherimprovement,asopposedtosimilarimprovementstoaconven-tionalhierarchy.Asinpreviouswork,RAMpagehasbeenshowntoscalebetterastheCPU-DRAMspeedgapgrows.Inaddition,ithasbeenshownthatcon-textswitchesonmissescantakeadvantageofamoreaggressivecoreincludingabiggerL1cache,andabiggerTLB.Theremainderofthissectionsummarizesresults,outlinesfutureworkandsumsupoverall ndings.

5.1SummaryofResults

Introducingsigni cantlylargerL1caches–evenifthiscouldbedonewith-outproblemswithmeetingclockcycletargets–haslimitedbene ts.Scalingtheclockspeedupbyafactorof8achievesonlyabout77%ofthisspeedupinacon-ventionalhierarchymeasuredhere.RAMpagewithcontextswitchesonmissesisabletomakee ectiveuseofalargerL1cache,andachievessuperlinearspeedupwithrespecttoaslowerclockspeedandsmallerL1cache.Whilethise ectcanonlybeexpectedinRAMpagewithanunrealisticallylargeL1,thisresultshowsthatincreasinglyaggressiveL1cachesarenotasimportantasolutiontothememorywallproblemas ndingalternativeworkonamisstoDRAM.

ThatresultsforRAMpagewithoutcontextswitchesonmissesareanim-provementbutnotassigni cantasresultswithcontextswitchesonmissessug-geststhatattemptsatimprovingassociativityandreplacementstrategywillnotbesu cienttobridgethegrowingCPU-DRAMspeedgap.

LargerTLBs,asexpected,increasetherangeofusefulRAMpageSRAMmainmemorypagesizes,thoughtheperformancebene tontheworkloadmeasuredwasnotsigni cantversuslargerpagesizesandamoremodest-sizedTLB.

5.2FutureWork

ItwouldbeinterestingtomatchRAMpagewithmodelsforsupportingmorethanoneinstructionstream.SMT,whileaddinghardwarecomplexity,isanestablishedapproach[19],withexistingimplementations[3].Anotherthingtoexploreisalternativeinterconnectarchitectures,somultiplerequestsforDRAMcouldbeoverlapped[24].HyperTransport[2]isacandidate.Amoredetailedsimulationmodellingoperatingsysteme ectsaccuratelywouldbeuseful.SimOS

[26],forexample,couldbeused.Furthervariationstoexploreincludevirtually-addressedL1andhardwareTLBmisshandling.Finally,itwouldbeinterestingtobuildaRAMpagemachine.

Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher

5.3OverallConclusion

RAMpagehasbeensimulatedinavarietyofforms.Inthislateststudy,en-hancingL1andtheTLBhaveshownthatitgainssigni cantlymorefromsuchimprovementsthanaconventionalarchitectureinsomecases.Themostimpor-tant ndinggenerallyfromRAMpageworkisthat ndingotherworkonamisstoDRAMisbecomingincreasinglyviable.WhileRAMpageisnottheonlyap-proachto ndingsuchalternativework,itisapotentialsolution.Ascomparedwithhardwaremultithreadingapproaches,itsmainadvantageisthe exibilityofasoftwaresolution,thoughthisneedstobecomparedtohardwaresolutionstoestablishtheperformancecostofextra exibility.

References

1.ThomasAlexanderandGershonKedem.Distributedprefetch-bu er/cachede-signforhigh-performancememorysystems.InProc.2ndIEEESymp.onHigh-PerformanceComputerArchitecture,pages254–263,SanJose,CA,February1996.

2.AMD.HyperTransporttechnology:Simplifyingsystemdesign[online].October2002./docs/26635A_HT_System_Design.pdf.

3.J.M.Borkenhagen,R.J.Eickemeyer,R.N.Kalla,andS.R.Kunkel.Amulti-threadedPowerPCprocessorforcommercialservers.IBMJ.ResearchandDevel-opment,44(6):885–898,November2000.

4.T.ChenandJ.Baer.Reducingmemorylatencyvianon-blockingandprefetchingcaches.InProc.5thInt.Conf.onArchitecturalSupportforProgrammingLan-guagesandOperatingSystems(ASPLOS-5),pages51–61,September1992.

5.T-F.Chen.Ane ectiveprogrammableprefetchengineforon-chipcaches.InProc.28thInt.Symp.onMicroarchitecture(MICRO-28),pages237–242,AnnArbor,MI,29November–1December1995.

6.D.R.Cheriton,H.A.Goosen,H.Holbrook,andP.Machanick.Restructuringaparallelsimulationtoimprovecachebehaviorinashared-memorymultiprocessor:Thevalueofdistributedsynchronization.InProc.7thWorkshoponParallelandDistributedSimulation,pages159–162,SanDiego,May1993.

7.D.R.Cheriton,G.Slavenburg,andP.Boyle.Software-controlledcachesintheVMPmultiprocessor.InProc.13thInt.Symp.onComputerArchitecture(ISCA’86),pages366–374,Tokyo,June1986.

8.R.Crisp.DirectRambustechnology:Thenewmainmemorystandard.IEEEMicro,17(6):18–28,November/December1997.

9.B.Davis,T.Mudge,B.Jacob,andV.Cuppu.DDR2andlowlatencyvariants.InSolvingtheMemoryWallProblemWorkshop,Vancouver,Canada,June2000.Inconjunctionwith26thAnnuallnt.Symp.onComputerArchitecture.

10.ErikG.HallnorandStevenK.Reinhardt.Afullyassociativesoftware-managed

cachedesign.InProc.27thAnnualInt.Symp.onComputerArchitecture,pages107–116,Vancouver,BC,2000.

11.J.Handy.TheCacheMemoryBook.AcademicPress,SanDiego,CA,2ed.,1998.

puterArchitecture:AQuantitativeAp-

proach.MorganKau mann,SanFrancisco,CA,2ed.,1996.

13.J.HuckandJ.Hays.Architecturalsupportfortranslationtablemanagementin

largeaddressspacemachines.InProc.20thInt.Symp.onComputerArchitecture(ISCA’93),pages39–50,SanDiego,CA,May1993.

Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher

14.B.JacobandT.Mudge.Software-managedaddresstranslation.InProc.ThirdInt.

Symp.onHigh-PerformanceComputerArchitecture,pages156–167,SanAntonio,TX,February1997.

15.BruceL.JacobandTrevorN.Mudge.Alookatseveralmemorymanagement

units,TLB-re llmechanisms,andpagetableorganizations.InProc.8thInt.Conf.onArchitecturalSupportforProgrammingLanguagesandOperatingSys-tems(ASPLOS-VIII),pages295–306,SanJose,CA,1998.

16.E.E.Johnson.Gra puterArchitectureNews,23(4):7–

8,September1995.

17.NormanP.Jouppi.Cachewritepoliciesandperformance.InProc.20thannual

Int.Symp.onComputerArchitecture,pages191–201,SanDiego,CA,1993.

18.Jang-SooLee,Won-KeeHong,andShin-DugKim.Designandevaluationofa

selectivecompressedmemorysystem.InProc.IEEEInt.Conf.onComputerDesign,pages184–191,Austin,TX,10–13October1999.

19.J.L.Lo,J.S.Emer,H.M.Levy,R.L.Stamm,andD.M.Tullsen.Convertingthread-

levelparallelismtoinstruction-levelparallelismviasimultaneousmultithreading.ACMTrans.onComputerSystems,15(3):322–354,August1997.

puterArchitectureNews,

24(5):23–30,December1996.

21.P.Machanick.ScalabilityoftheRAMpagememoryhierarchy.SouthAfrican

ComputerJournal,(25):68–73,August2000.

22.P.Machanick,P.Salverda,andL.Pompe.Hardware-softwaretrade-o sinaDirect

RambusimplementationoftheRAMpagememoryhierarchy.InProc.8thInt.Conf.onArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS-VIII),pages105–114,SanJose,CA,October1998.

23.PhilipMachanick.AnObject-OrientedLibraryforShared-MemoryParallelSimu-

lations.PhDThesis,Dept.ofComputerScience,UniversityofCapeTown,1996.

24.PhilipMachanick.WhatifDRAMisaslowperipheral?ComputerArchitecture

News,30(6):16–19December2002.

25.T.C.Mowry,m,andA.Gupta.Designandevaluationofacompileral-

gorithmforprefetching.InProc.5thInt.Conf.onArchitecturalSupportforPro-grammingLanguagesandOperatingSystems,pages62–73,September1992.

26.M.Rosenblum,S.A.Herrod,E.Witchel,pletecomputersys-

temsimulation:TheSimOSapproach.IEEEParallelandDistributedTechnology,3(4):34–43,Winter1995.

27.AshleySaulsbury,FongPong,andAndreasNowatzyk.Missingthememorywall:

thecaseforprocessor/memoryintegration.InProc.23rdannualInt.Symp.onComputerarchitecture,pages90–101,Philadelphia,PA,1996.

28.EricSprangleandDougCarmean.Increasingprocessorperformancebyimplement-

ingdeeperpipelines.InProc.29thAnnualInt.Symp.onComputerarchitecture,pages25–34,Anchorage,Alaska,2002.

29.J.M.Tendler,J.S.Dodson,Jr.J.S.Fields,H.Le,andB.Sinharoy.POWER4

systemmicroarchitecture.IBMJ.ResearchandDevelopment,46(1):5–25,2002.

30.W.A.WulfandS.A.McKee.Hittingthememorywall:Implicationsoftheobvious.

ComputerArchitectureNews,23(1):20–24,March1995.

Acknowledgements

FinancialsupportforthisworkhasbeenreceivedfromUniversitiesofQueens-landandWitwatersrand,andSouthAfricanNationalResearchFoundation.Wewouldliketothanktherefereesforhelpfulsuggestions.

本文来源:https://www.bwwdw.com/article/kfh1.html

Top