L1 Cache and TLB Enhancements to the RAMpage Memory Hierarchy
更新时间:2023-06-07 20:05:01 阅读量: 实用文档 文档下载
- l1和l2代表什么意思推荐度:
- 相关推荐
Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher
L1CacheandTLBEnhancementstothe
RAMpageMemoryHierarchy
PhilipMachanick1andZunaidPatel2
SchoolofITEE,UniversityofQueensland
Brisbane,Qld4072,Australia
philip@itee.uq.edu.au
SchoolofComputerScience,UniversityoftheWitwatersrand,
Johannesburg,PrivateBag3,2050Wits,SouthAfrica
zunaid@cs.wits.ac.za12
Abstract.TheRAMpagehierarchymovesmainmemoryupaleveltoreplacethelowest-levelcachebyanequivalent-sizedSRAMmainmem-ory,withaTLBcachingpagetranslationsforthatmainmemory.ThispaperillustrateshowmoreaggressivecomponentshigherinthehierarchyincreasethefractionoftotalexecutiontimespentwaitingforDRAM.Foraninstructionissuerateof1GHz,thesimulatedstandardhierarchywaitedforDRAM10%ofthetime,increasingto40%ataninstructionissuerateof8GHz.ForalargerL1cache,thefractionoftimewaitingforDRAMwasevenhigher.RAMpagewithcontextswitchesonmisseswasabletohidealmostallDRAMlatency.AlargerTLBwasshowntoincreasetheviablerangeofRAMpageSRAMpagesizes.
1Introduction
TheRAMpagememoryhierarchymovesmainmemoryupaleveltoreplacethelowest-levelcachewithanSRAMmainmemory,whileDRAMbecomesa rst-levelpagingdevice.PreviousworkhasshownthatRAMpagerepresentsanalternative,viabledesignintermsofhardware-softwaretrade-o s[22]andthatitscalesbetterastheCPU-DRAMspeedgapgrows,particularlybyvirtueofbeingabletotakecontextswitchesonmisses[21].
Inpreviouswork,itwashypothesizedthatRAMpagewouldbemorecom-petitiveacrossawiderrangeofSRAMpagesizes(equivalenttolinesizeofthelowest-levelcache)withamoreaggressiveTLB.Secondly,itwashypothesizedthatamoreaggressiveL1cachewouldemphasizedi erencesinlowerlevelsofthehierarchy.Inthispaper,wereportoninvestigationofbothhypothesesasseparateissues.ImprovingtheTLBandL1hasdi erente ectsonperformance.TheintentinpresentingbothinthesamepaperistoaddseveraldatapointstoourcaseforRAMpage.
Insomestudies,TLBmisseshaveaccountedforasmuchas40%ofruntime
[13],with guresintheregionof20–30%common[6,23].RAMpagehasthepotentialtoreducethesigni canceoftheTLBonperformancefortworeasons.Firstly,unlessthereferencewhichcausesaTLBmisswouldalsomissinthe
Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher
SRAMmainmemory,noreferencetoupdatetheTLBneedsgotoDRAM,withthepagetableorganizationusedforRAMpage.Secondly,thereisnomismatchbetweenthesizeofpagemappedbytheTLBandthe“linesize”ofthe“lowest-levelcache”,aswouldbethecasewithaconventionalhierarchy.Consequently,theTLBcanmoreeasilybedesignedtomapaspeci cfractionoftheSRAMmainmemory,thanisthecaseforaconventionalcache.
Theroleofincreasinglyaggressiveon-chipcachesalsoneedstobeevaluated,againsttheviewthatsuchcachesaddressthememorywallproblem.Quadruplingthesizeofacachemayhalvethenumberofmisses[28],butsuchexpansionmaynotalwaysbepractical.Increasingthesizeofcachesinanycasemakesithardertoscaleuptheirspeed[11].
TheapproachinthispaperistocompareRAMpagewithaconventional2-levelcachehierarchyasthesizeoftheTLBscalesup,acrossdi erentSRAMmainmemorypagesizes,aswellasavarietyofL1cachesizes,inseparateexperiments.ThesimulatedL2cacheof4Mbytesrunsatathirdoftheissuerateexcludingmisses.Theintentistoemphasizethatevenaveryfast,largeon-chipcacheresultsinalargefractionofrun-timebeingspentwaitingforDRAM.Evenso,giventhatDRAMreferencesarethedominante ectbeingmeasured,afastcacheshouldnotinvalidatethegeneraltrendsbeingstudied.
TLBmeasurementsshowthatbothmodelsseeareductioninTLBmissratesastheTLBsizeincreases,butRAMpagebecomesmoreviablewithsmallerSRAMmainmemorypagesizes.CachemeasurementsshowthatasL1sizeincreases,thefractionoftimespentwaitingforDRAMincreases(evenifoverallruntimedecreases),whichmakestheoptionintheRAMpagehierarchyoftakingacontextswitchonamissmoreattractive.
Theremainderofthispaperisstructuredasfollows.Section2presentsmoredetailoftheRAMpagehierarchyandrelatedresearch.Section3explainstheexperimentalapproach,whileSection4presentsexperimentalresults.Inconclu-sion,Section5,summarizesthe ndingsandoutlinesfuturework.
2Background
TheRAMpagemodelwasproposed[20]inresponsetothememorywall[30,16].ThekeyideaoftheRAMpagemodelistominimizehardwarecomplexity,whilemovingmoreofthememorymanagementintelligenceintosoftware.ARAMpagemachinethereforelooksverylikeaconventionalmodel,exceptthelowest-levelcacheisreplacedbyaconventionally-addressedphysicalmemory,thoughimplementedinSRAMratherthanDRAM.
Anumberofotherapproachestoaddressingthememorywallhavebeenpro-posed.Thissectionsummarizesthememorywallissue,followedbymoredetailofRAMpage.Afterpresentingotheralternatives,theoptionsarediscussed.
2.1MemoryWall
Thememorywallisthesituationwherethee ectofCPUimprovementsbecomesinsigni cantasthespeedimprovementofDRAMbecomesalimitingfactor.Since
Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher
themid-1980s,CPUspeedshaveimprovedatarateof50-100%peryear,whileDRAMlatencyhasonlyimprovedataround7%peryear[12].Ifpredictionsofthememorywall[30]arecorrect,DRAMlatencywillbecomeaseriouslimitingfactorinperformanceimprovement.Attemptsatworkingaroundthememorywallarebecomingincreasinglycommon[9],butthefundamentalunderlyingDRAMandCPUlatencytrendscontinue[27].
2.2TheRAMpageApproach
RAMpageisbasedonthenotionthatDRAM,whilestillordersofmagnitudefasterthandisk,isincreasinglystartingtodisplayoneattributeofaperipheral:thereistimetodootherworkwhilewaitingforit[24],particularlyifrelativelylargeunitsaremovedbetweenDRAMandSRAMlevel.InRAMpage,thelowest-levelcacheismanagedasthemainmemory(i.e.,asapagedvirtually-addressedmemory),withdiskasecondarypagingdevice.TheRAMpagemainmemorypagetableisinverted,tominimizeitssize.Aninvertedpagetablehasanotherbene t:noTLBmisscanresultinaDRAMreference,unlessthereferencecausingtheTLBlookupisnotinanyoftheSRAMlayers[22].
RAMpageisintendedtohavethefollowingadvantages:
–fasthits–ahitphysicallyaddressesanSRAMmemory
–fullassociativity–fullassociativitythroughpagingavoidstheslowerhitsofhardwarefullassociativity
–software-managedpaging–replacementcanbeassophisticatedasneeded–TLBmissesstoDRAMminimized–asexplainedabove
–pinninginSRAM–criticalOSdataandcodecanbepinnedinSRAM–hardwaresimplicity–thecomplexityofacachecontrollerisremovedfromthelowestlevelofSRAM
–contextswitchesonmissestoDRAM–theCPUcanbekeptbusy
Theseadvantagescomeatthecostofslowermissesbecauseofsoftwaremiss-handling,andtheneedtomakeoperatingsystemchanges.However,thelatterproblemcouldbeavoidedbyaddinghardwaresupportforthemodel.
TheRAMpageapproachhasinthepastbeenshowntoscalewellinthefaceofthegrownCPU-DRAMspeedgap,particularlywithcontextswitchesonmisses.Thee ectofcontextswitchesonmissesisthat,providedthereisworkavailablefortheCPU,waitingforDRAMcane ectivelybeeliminated[21].Contextswitchesonmisseshavethemostsigni cante ect.
2.3Alternatives
Approachestoaddressingthememorywallcanloosely(withsomeoverlaps)begroupedintolatencytoleranceandmissreduction.Someapproachestola-tencytoleranceincludeprefetch,criticalword rst,memorycompression,writebu ering,non-blockingcaches,andsimultaneousmultithreading(SMT).
Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher
Prefetchrequiresloadingacacheblockbeforeitisrequested,eitherbyhard-ware[5]orwithcompilersupport[25];predictiveprefetchattemptstoimproveaccuracyofprefetchforrelativelyvariedmemoryaccesspatterns[1].Incriticalword rst,thewordcontainingthereferencewhichcausedthemissisfetched rst,followedbytherestoftheblock[11].Memorycompressionine ectreduceslatencybecauseasmalleramountofinformationmustbemovedonamiss.Theoverheadmustbelessthanthetimesaved[18].Therearemanyvariationsonwritemissstrategy,butthemoste ectivegenerallyincludewritebu ering[17].Anon-blockingcache(lockup-free)cachecanallowanaggressivepipelinetocontinuewithotherinstructionswhilewaitingforamiss[4].
SMTisaimedatmaskingDRAMlatencyaswellasothercausesofpipelinestalls,byhardwaresupportformorethanoneactivethread[19].SMTaimstosolveawiderrangeofCPUperformanceproblemsthanRAMpage.
Theseideashavecosts(e.g.,prefetchingcandisplaceneededcontent,causingunnecessarymisses).ThebiggestproblemisthatmostoftheseapproachesdonotscalewiththegrowingCPU-DRAMspeedgap.Criticalword rstislesshelpfulaslatencyforonereferencegrowsinrelationtototaltimeforabigDRAMtransaction.Prefetch,memorycompressionandnonblockingcacheshavelimitsastohowmuchtheycanreducee ectivelatency.Writebu eringcanscaleprovidedbu ersizecanbescaled,andreferencestobu eredwritescanbehandledbeforetheyarewrittenback.SMTcouldmaskmuchofthetimespentwaitingforDRAM,butatthecostofamorecomplexCPU.
Reducingmisseshasbeenaddressedbyincreasingcachesize,associativity,orboth.Therearelimitsonhowlargeacachecanbeatagivenspeed,sothenumberoflevelshasincreased.Fullassociativitycanbeachievedinhardwarewithlessoverheadforhitsthanaconventionalfully-associativecache,inanindirectindexcache(IIC),bywhatamountstoahardwareimplementationofRAMpage’spagetablelookup[10].AdrawbackofIICisthatallreferencesincuroverheadofanextralevelofindirection.Earlierworkonsoftware-basedcachemanagementhasnotfocusedonreplacementpolicy[7,14].
TheadvantagesofRAMpageoverSMTandotherhardware-basedmulti-threadingapproachesarethattheCPUcanbekeptsimple,andsoftwareim-plementationofsupportformultipleprocessesismore exible(thebalancebe-tweenmultitaskingandmultithreadingcanbedynamicallyadjusted,accordingtoworkload).AnadvantageofIICisthattheOSneednotbeinvokedtohandletheequivalentofaTLBmissinRAMpage.AscomparedwithRAMpage,anIIChasmoreoverheadforahit,andlessforamiss.
2.4Summary
RAMpagemaskstimewhichwouldotherwisebespentwaitingforDRAMbytakingcontextswitchesonmisses.OtherapproachseitherdonotaimtomasktimespentwaitingforDRAM,buttoreduceit,orrequiremorecomplexhard-ware.RAMpagecanpotentiallybecombinedwithsomeoftheotherapproaches(suchasSMT),soitisnotnecessarilyincon ictwithotherideas.
Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher
3ExperimentalApproach
Thissectionoutlinestheapproachtothereportedexperiments.Thesimula-tionstrategyisexplained,followedbysomedetailofsimulationparameters;inconclusion,expected ndingsarediscussed.
3.1SimulationStrategy
Arangeofvariationsonastandard2-levelhierarchyiscomparedtosimilarvariationsonRAMpage,withandwithoutcontextswitchesonmisses.RAMpagewithoutcontextswitchesonmissestoconveysthee ectsofaddingassociativity(withanoperatingsystem-stylereplacementstrategy).AddingcontextswitchesonmissesshowsthevalueofalternativeworkonamisstoDRAM.Simulationsaretrace-driven,anddonotmodelthepipeline.ProcessorspeedisinGHz,representinginstructionissueratewithoutmisses,notclockspeed.
Ignoringthepipelinelevelneglectse ectslikebranchesandthepotentialforotherimprovementslikenon-blockingcaches.However,theresultsbeinglookedforherearerelativelylargeimprovements,soinaccuraciesofthiskindareunlikelytobesigni cant.Whatisimportantisthee ectastheCPU-DRAMspeedgapincreases,andthesimulationisofsu cientaccuracytocapturesuche ects,ashasbeendemonstratedinpreviouswork.
3.2SimulationParameters
Parametersaresimilartopreviouspublishedworktomakeresultscomparable.ThefollowingparametersarecommonacrossRAMpageandtheconventionalhierarchy.ThisrepresentsthebaselinebeforenewL1andTLBvariations:–L1cache–16Kbyteseachofdataandinstructioncache,physicallytaggedandindexed,direct-mapped,32-byteblocksize,1-cyclereadhit,12-cyclepenaltyformissestoL2(orRAMpageSRAMmainmemory);fordatacache:perfectwritebu ering(zeroe ectivehittime),writeback(12-cyclepenalty;9cyclesforRAMpage:noL2tagtoupdate),writeallocateonmiss–TLB–64entries,fullyassociative,randomreplacement
–DRAMlevel–DirectRambus[8]withoutpipelining:50nsbefore rstrefer-encestarted,thereafter2bytesevery1.25ns–pagingofDRAM–invertedpagetable:sameorganizationasRAMpagemainmemoryforsimplicity,in niteDRAMwithnomissestodisk–TLBandL1datahitsfullypipelined–onlytimeforL1dorTLBreplace-mentsormaintaininginclusioncostedas“hits”
Thesamememorytimingisusedasinearliersimulations.AlthoughfasterDRAMhassincebecomeavailable,thetimingcanbeseenasrelativetoapar-ticularCPU-DRAMspeedgap,andthe gurescanaccordinglyberescaled.Contextswitchesaremodelledbyinterleavingatraceoftext-bookcode.Acontextswitchistakenevery500,000references,thoughRAMpagewithcontextswitchesonmissesalsotakesacontextswitchonamisstoDRAM.TLBmissesarehandledbyatraceofpagetablelookupcode,withvariationsontimeforalookupbasedonprobablevariationsinprobesintoaninvertedpagetable[22].
Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher
speci ctoconventionalhierarchy
The“conventional”systemhasa2-wayassociative4MbyteL2.TheL2cacheanditsbustotheCPUtheareclockedatonethirdoftheCPUissuerate(thecycletimeisintendedtorepresentasuperscalarissuerate).TheL2cache-CPUbusis128bitswideandrunsHitsontheL2cachetake4cyclesincludingthetagcheckandtransfertoL1.InclusionbetweenL1andL2ismaintained[12].TheTLBcachesvirtualpagetranslationstoDRAMphysicalframes.speci ctoRAMpagehierarchy
InRAMpagesimulations,mostparametersremainthesame,exceptthattheTLBmapstheSRAMmainmemory,andfullassociativityisimplementedinsoftware,throughasoftwaremisshandler.TheOSkeeps6pagespinnedintheSRAMmainmemorywhensimulatinga4Kbyte-SRAMpage,i.e.,24Kbytes,whichincreasestoto5336pagesfora128byteblocksize,atotalof667Kbytes.inputsandvariations
TracesarefromtheTracebasetracearchiveatNewMexicoStateUniversity1.
1.1-billionreferencesareused,withtracesinterleavedtocreatethee ectofamultiprogrammingworkload.
TomeasurevariationsonL1caches,thesizeofeachoftheinstructionanddatacacheswasvariedfromtheoriginalsizeof16KBto32KB,64KB,128KBand256KB.Toexploremoreofthedesignspace,L1blocksizewasmeasuredatsizesof32,64and128bytes.WedidnotvaryL2blocksizeswhenvaryingL1:anoptimumsizewasdeterminedinpreviouswork[21,22].However,whilevaryingtheTLB,wedidvaryL2blocksizeintheconventionalhiearchy,forcomparisonwithvaryingtheRAMpageSRAMmainmemorypagesize.Tomeasurethee ectofincreasingtheTLBsize,wevarieditfromtheoriginal64entriesto128,256and512.EvenlargerTLBsexist(e.g.,Power4hasa1024-entryTLB[29]),butthisrangeissu cienttocapturevariationsofinterest.
3.3ExpectedFindings
AsL1becomeslarger,RAMpagewithoutcontextswitchesonmissesshouldseelessofagain.WhileimprovingL1shouldnota ecttimespentinDRAM,RAMpage’sextraoverheadsinmanagingDRAMmayhaveamoresigni cante ectonoverallruntime.However,asthefractionofreferencesinupperlevelsincreaseswithoutadecreaseinreferencestoDRAM,contextswitchesonmissesshouldbecomemorefavourable.
AstheTLBsizeincreases,weexpecttoseesmallerSRAMpagesizesbecomeviable.IftheTLBhas64entriesandthepagesizeis4KBwitha4MBSRAM
Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher
mainmemory,6.25%ofthememoryismappedbytheTLB.IftheTLBhas512entries,theTLBmaps50%ofthememory.Bycomparison,witha128Bpage,a64-entryTLBonlymapsabout0.2%ofthememory,andabigincreaseinthesizeoftheTLBislikelytohaveasigni cante ect.
Thee ectonaconventionalarchitectureofincreasingTLBsizeisnotassigni cantbecauseitmapsDRAMpages( xedat4KB),notSRAMpages.Further,variationacrossL2blocksizesshouldnotberelatedtoTLBsize.4Results
Thissectionpresentsresultsofsimulations,withsomediscussion.Themainfocusisondi erencesintroducedbychangesoverprevioussimulations,butsomeadvantagesofRAMpage,aspreviouslydescribed,shouldbeevidentagainfromthesenewresults.Presentationofresultsisbrokendownintoe ectsofincreasingL1cachesize,ande ectsofincreasingTLBsize,sincetheseimprovementshaveverydi erente ectsonthehierarchiesmodelled.Resultsarepresentedfor3cases:theconventional2-levelcachewithaDRAMmainmemory,andRAMpagewithandwithoutcontextswitchesonmisses.
Theremainderofthissectionpresentsthee ectsofL1changes,thenthee ectsofTLBchanges,followedbyasummary.
4.1IncreasingL1Size
Fig.1showshowmissratesoftheL1instructionanddatacachesvaryastheirsizeincreasesforbothRAMpagewithcontextswitchesonmissesandthestan-dardhierarchy.(RAMpagewithoutswitchesonmissesfollowsthesametrendasthestandardhierarchy.)Ascachesizesincrease,themissratedecreases,initiallyfairlyrapidly.Thetrendissimilarforallmodels.
Executiontimesareplottedin g.2,normalisedtothebestexecutiontimeateachCPUspeed.Asexpected,largercachesdecreaseexecutiontimesbyre-ducingcapacitymisses,asevidentfromthereducedmissrates–withlimitstothebene tsasL1scalesup.Thebestoveralle ectisfromthecombina-tionofRAMpagewithcontextswitchesonmissesandincreasingthesizeofL1.Theexecutiontimeofthefastestvariationspeedsup10.7overtheslowestcon guration,paringagivenhi-erarchy’sslowest(1GHz,32KBL1)andfastestcase(8GHz,256KBtotalL1)resultsinaspeedupof6.12fortheconventionalhierarchy,6.5forRAMpagewithoutswitchesonmissesand9.9forswitchesonmisses.ForslowestCPUandsmallestL1,RAMpagewithswitchesonmisseshasaspeedupof1.08overtheconventionalhierarchy,risingto1.74withthefastestCPUandbiggestL1.ForRAMpagewithoutswitchesonmisses,thescalingupofimprovementovertheconventionalhierarchyisnotasstrong:fortheslowestCPUwithleastaggressiveL1,RAMpagehasaspeedupof1.03,asopposedto1.11forthefastestCPUwithlargestL1.So,whetherbycomparisonwithaconventionalarchitectureorby
Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher
sizesmore
Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher
Fraction of time at each level32K64K128KTotal L1 size (KB)256K512K(a)Issuerate1GHz(Std)
Fig.3showsrelativetimeseachvariationoftheslowestandfastestCPUspendwaitingforeachlevelofthevarioushierarchies,asL1sizeincreases.The8GHzissueratefortheconventionalhierarchyspendsover40%oftotalexecu-tiontimewaitingforDRAMforthelargestL1cache–inlinewithmeasurementsofthePentium4,whichspends35%ofitstimewaitingforDRAMrunningSPECint2konaverageat2GHz[28].ThisPentium4con gurationcorrespondsroughlytoa6GHzissuerateinthispaper.ThesimilarityofthetimewaitingforDRAMlendssomecredibilitytoourviewthatourresultsarereasonablyinlinewithrealsystems.
Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher
Miss Rate(a)Whilecachesizeincreasesboostperformancesigni cantly,asCPUspeedincreases,alargeL1cannotsaveaconventionalhierarchyfromthehighpenaltyofwaitingforDRAM.In g.3(d),itcanbeseenthatRAMpageonlyimprovesthesituationmarginallywithoutcontextswitchesonmisses.
WithRAMpagewithcontextswitchesonmisses,timewaitingforDRAMremainsnegligibleastheCPU-DRAMspeedgapincreasesbyafactorof8( g.3(f)).ThelargestL1(combinedL1iandL1dsize512KB)resultsinonlyabout10%ofexecutiontimebeingspentwaitingforSRAMmainmemory,whileDRAMwaittimeremainsnegligible.Bycontrast,theotherhierarchies,whileseeingasigni cantreductionintimewaitingforL2(orSRAMmainmemory),donotseeasimilarreductionintimewaitingforDRAMasL1sizeincreases.
4.2TLBVariations
AllTLBvariationsaremeasuredwiththeL1parameters xedattheoriginalRAMpagemeasurements–16KBeachofinstructionanddatacache.
TheTLBmissrate( g.4),evenwithincreasedTLBsizes,issigni cantlyhigherinallRAMpagecasesthanforthestandardhierarchy,exceptfora4KBRAMpagepagesize.AsSRAMmainmemorypagesizeincreases,TLBmissratesdrop,asexpected.Further,asTLBsizeincreases,smallerpages’missratesdecrease.Inthecaseofcontextswitchesonmisses,thenumberofcontext
Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher
Page Size (B)01020
Overhead (%)
3040
Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher
ofthestandardhierarchy.RAMpageTLBmissesdonotresultinreferencestoDRAM,unlessthereisapagefault,sotheadditionalreferencesshouldnotresultinasimilarlysubstantialperformancehit.
Fig.6illustratesexecutiontimesforthehierarchiesat1and8GHz,thespeedgapwhichshowso di erencesmostclearly.Therearetwocompetinge ects:asL2block(SRAMpagesize)increases,misspenaltytoDRAMincreases.InRAMpage,reducedTLBmissescompensateforthehigherDRAMmisspenalty,buttheperformanceofthestandardhierarchybecomesworseasblocksizein-creases.TLBsizevariationmakeslittledi erencetoperformanceofthestandardhierarchywiththesimulatedworkload.PerformanceofRAMpagewithswitchesonmissesdoesnotvarymuchforpagesof512BandgreaterevenwithTLBvariations,whileRAMpagewithoutswitchesisbestwith1024Bpages.
Theperformance-optimalTLBandpagesizecombinationforRAMpagewithoutcontextswitchesonmisses,witha512entryTLB,isa1024Bpageforallissuerates.Inpreviouswork,witha64-entryTLB,theoptimalpagesizeat1GHzwas2048B,whileotherissueratesperformedbestwith1024Bpages.Thus,alargerTLBresultsinasmallerpagesizebeingoptimalforthe1GHzspeed.Whileotherpagesizesarestillslowerthanthe1024Bpagesize,forallcaseswithpagesof512BandgreaterRAMpagewithoutcontextswitchesonmissesisfasterthanthestandardhierarchy.
ForRAMpagewithcontextswitchesonmisses,theperformance-optimalpagesizehasshiftedto1024BwithalargerTLB.Previouslythebestpagesizewas4096Bfor1,2and4GHzand2048Bfor8GHz.ATLBof256oreven128entriescombinedwiththe1024Bpagewillyieldoptimumoralmostoptimumperformance.Witha1024Bpageand256entries,atotalof256KB,or6.25%oftheRAMpagemainmemoryismappedbytheTLB,whichappearstobesu cientforthisworkload(a4KBpagewitha512-entryTLBmapshalftheSRAMmainmemory,overkillforanyworkloadwithreasonablelocalityofrefer-ence).Nonetheless,TLBperformanceishighlydependentonapplicationcode,soresultspresentedhereneedtobeconsideredinthatlight.
Contrastingthe1Ghzand8GHzcasesin g.6makesitclearagainhowthedi erencesbetweenRAMpageandaconventionalhierarchyscaleastheCPU-DRAMspeedgapincreases.At1GHz,allvariationsarereasonablycomparableacrossarangeofparameters.At8GHz,RAMpageisclearlybetterinallvaria-tions,butevenmoresowithcontextswitchesonmisses.AlargerTLBbroadenstherangeofusefulRAMpagecon gurations,withoutsigni cantlyalteringthestandardhierarchy’scompetitiveness.
4.3Summary
Insummary,theRAMpagemodelwithcontextswitchesonmissesgainsmostfromL1cacheimprovements,thoughtheotherhierarchiesalsoreduceexecutiontime.However,withouttakingcontextswitchesonmisses,increasingthesizeofL1hasthee ectofincreasingthefractionoftimespentwaitingforDRAM,sincethenumberofDRAMreferencesisnotreduced,noristheirlatencyhidden.AswasshownbyscalinguptheCPU-DRAMspeedgap,onlyRAMpagewith
Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher
contextswitchesonmisses,ofthevariationspresentedhere,isabletohidetheincreasinge ectivelatencyofDRAM.IncreasingthesizeoftheTLB,aspredicted,increasedtherangeofSRAMmainmemorypagesizesoverwhichRAMpageisviable,wideningtherangeofchoicesforadesigner.
5Conclusion
ThispaperhasexaminedenhancementstoRAMpage,whichmeasureitspoten-tialforfurtherimprovement,asopposedtosimilarimprovementstoaconven-tionalhierarchy.Asinpreviouswork,RAMpagehasbeenshowntoscalebetterastheCPU-DRAMspeedgapgrows.Inaddition,ithasbeenshownthatcon-textswitchesonmissescantakeadvantageofamoreaggressivecoreincludingabiggerL1cache,andabiggerTLB.Theremainderofthissectionsummarizesresults,outlinesfutureworkandsumsupoverall ndings.
5.1SummaryofResults
Introducingsigni cantlylargerL1caches–evenifthiscouldbedonewith-outproblemswithmeetingclockcycletargets–haslimitedbene ts.Scalingtheclockspeedupbyafactorof8achievesonlyabout77%ofthisspeedupinacon-ventionalhierarchymeasuredhere.RAMpagewithcontextswitchesonmissesisabletomakee ectiveuseofalargerL1cache,andachievessuperlinearspeedupwithrespecttoaslowerclockspeedandsmallerL1cache.Whilethise ectcanonlybeexpectedinRAMpagewithanunrealisticallylargeL1,thisresultshowsthatincreasinglyaggressiveL1cachesarenotasimportantasolutiontothememorywallproblemas ndingalternativeworkonamisstoDRAM.
ThatresultsforRAMpagewithoutcontextswitchesonmissesareanim-provementbutnotassigni cantasresultswithcontextswitchesonmissessug-geststhatattemptsatimprovingassociativityandreplacementstrategywillnotbesu cienttobridgethegrowingCPU-DRAMspeedgap.
LargerTLBs,asexpected,increasetherangeofusefulRAMpageSRAMmainmemorypagesizes,thoughtheperformancebene tontheworkloadmeasuredwasnotsigni cantversuslargerpagesizesandamoremodest-sizedTLB.
5.2FutureWork
ItwouldbeinterestingtomatchRAMpagewithmodelsforsupportingmorethanoneinstructionstream.SMT,whileaddinghardwarecomplexity,isanestablishedapproach[19],withexistingimplementations[3].Anotherthingtoexploreisalternativeinterconnectarchitectures,somultiplerequestsforDRAMcouldbeoverlapped[24].HyperTransport[2]isacandidate.Amoredetailedsimulationmodellingoperatingsysteme ectsaccuratelywouldbeuseful.SimOS
[26],forexample,couldbeused.Furthervariationstoexploreincludevirtually-addressedL1andhardwareTLBmisshandling.Finally,itwouldbeinterestingtobuildaRAMpagemachine.
Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher
5.3OverallConclusion
RAMpagehasbeensimulatedinavarietyofforms.Inthislateststudy,en-hancingL1andtheTLBhaveshownthatitgainssigni cantlymorefromsuchimprovementsthanaconventionalarchitectureinsomecases.Themostimpor-tant ndinggenerallyfromRAMpageworkisthat ndingotherworkonamisstoDRAMisbecomingincreasinglyviable.WhileRAMpageisnottheonlyap-proachto ndingsuchalternativework,itisapotentialsolution.Ascomparedwithhardwaremultithreadingapproaches,itsmainadvantageisthe exibilityofasoftwaresolution,thoughthisneedstobecomparedtohardwaresolutionstoestablishtheperformancecostofextra exibility.
References
1.ThomasAlexanderandGershonKedem.Distributedprefetch-bu er/cachede-signforhigh-performancememorysystems.InProc.2ndIEEESymp.onHigh-PerformanceComputerArchitecture,pages254–263,SanJose,CA,February1996.
2.AMD.HyperTransporttechnology:Simplifyingsystemdesign[online].October2002./docs/26635A_HT_System_Design.pdf.
3.J.M.Borkenhagen,R.J.Eickemeyer,R.N.Kalla,andS.R.Kunkel.Amulti-threadedPowerPCprocessorforcommercialservers.IBMJ.ResearchandDevel-opment,44(6):885–898,November2000.
4.T.ChenandJ.Baer.Reducingmemorylatencyvianon-blockingandprefetchingcaches.InProc.5thInt.Conf.onArchitecturalSupportforProgrammingLan-guagesandOperatingSystems(ASPLOS-5),pages51–61,September1992.
5.T-F.Chen.Ane ectiveprogrammableprefetchengineforon-chipcaches.InProc.28thInt.Symp.onMicroarchitecture(MICRO-28),pages237–242,AnnArbor,MI,29November–1December1995.
6.D.R.Cheriton,H.A.Goosen,H.Holbrook,andP.Machanick.Restructuringaparallelsimulationtoimprovecachebehaviorinashared-memorymultiprocessor:Thevalueofdistributedsynchronization.InProc.7thWorkshoponParallelandDistributedSimulation,pages159–162,SanDiego,May1993.
7.D.R.Cheriton,G.Slavenburg,andP.Boyle.Software-controlledcachesintheVMPmultiprocessor.InProc.13thInt.Symp.onComputerArchitecture(ISCA’86),pages366–374,Tokyo,June1986.
8.R.Crisp.DirectRambustechnology:Thenewmainmemorystandard.IEEEMicro,17(6):18–28,November/December1997.
9.B.Davis,T.Mudge,B.Jacob,andV.Cuppu.DDR2andlowlatencyvariants.InSolvingtheMemoryWallProblemWorkshop,Vancouver,Canada,June2000.Inconjunctionwith26thAnnuallnt.Symp.onComputerArchitecture.
10.ErikG.HallnorandStevenK.Reinhardt.Afullyassociativesoftware-managed
cachedesign.InProc.27thAnnualInt.Symp.onComputerArchitecture,pages107–116,Vancouver,BC,2000.
11.J.Handy.TheCacheMemoryBook.AcademicPress,SanDiego,CA,2ed.,1998.
puterArchitecture:AQuantitativeAp-
proach.MorganKau mann,SanFrancisco,CA,2ed.,1996.
13.J.HuckandJ.Hays.Architecturalsupportfortranslationtablemanagementin
largeaddressspacemachines.InProc.20thInt.Symp.onComputerArchitecture(ISCA’93),pages39–50,SanDiego,CA,May1993.
Abstract. The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher
14.B.JacobandT.Mudge.Software-managedaddresstranslation.InProc.ThirdInt.
Symp.onHigh-PerformanceComputerArchitecture,pages156–167,SanAntonio,TX,February1997.
15.BruceL.JacobandTrevorN.Mudge.Alookatseveralmemorymanagement
units,TLB-re llmechanisms,andpagetableorganizations.InProc.8thInt.Conf.onArchitecturalSupportforProgrammingLanguagesandOperatingSys-tems(ASPLOS-VIII),pages295–306,SanJose,CA,1998.
16.E.E.Johnson.Gra puterArchitectureNews,23(4):7–
8,September1995.
17.NormanP.Jouppi.Cachewritepoliciesandperformance.InProc.20thannual
Int.Symp.onComputerArchitecture,pages191–201,SanDiego,CA,1993.
18.Jang-SooLee,Won-KeeHong,andShin-DugKim.Designandevaluationofa
selectivecompressedmemorysystem.InProc.IEEEInt.Conf.onComputerDesign,pages184–191,Austin,TX,10–13October1999.
19.J.L.Lo,J.S.Emer,H.M.Levy,R.L.Stamm,andD.M.Tullsen.Convertingthread-
levelparallelismtoinstruction-levelparallelismviasimultaneousmultithreading.ACMTrans.onComputerSystems,15(3):322–354,August1997.
puterArchitectureNews,
24(5):23–30,December1996.
21.P.Machanick.ScalabilityoftheRAMpagememoryhierarchy.SouthAfrican
ComputerJournal,(25):68–73,August2000.
22.P.Machanick,P.Salverda,andL.Pompe.Hardware-softwaretrade-o sinaDirect
RambusimplementationoftheRAMpagememoryhierarchy.InProc.8thInt.Conf.onArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS-VIII),pages105–114,SanJose,CA,October1998.
23.PhilipMachanick.AnObject-OrientedLibraryforShared-MemoryParallelSimu-
lations.PhDThesis,Dept.ofComputerScience,UniversityofCapeTown,1996.
24.PhilipMachanick.WhatifDRAMisaslowperipheral?ComputerArchitecture
News,30(6):16–19December2002.
25.T.C.Mowry,m,andA.Gupta.Designandevaluationofacompileral-
gorithmforprefetching.InProc.5thInt.Conf.onArchitecturalSupportforPro-grammingLanguagesandOperatingSystems,pages62–73,September1992.
26.M.Rosenblum,S.A.Herrod,E.Witchel,pletecomputersys-
temsimulation:TheSimOSapproach.IEEEParallelandDistributedTechnology,3(4):34–43,Winter1995.
27.AshleySaulsbury,FongPong,andAndreasNowatzyk.Missingthememorywall:
thecaseforprocessor/memoryintegration.InProc.23rdannualInt.Symp.onComputerarchitecture,pages90–101,Philadelphia,PA,1996.
28.EricSprangleandDougCarmean.Increasingprocessorperformancebyimplement-
ingdeeperpipelines.InProc.29thAnnualInt.Symp.onComputerarchitecture,pages25–34,Anchorage,Alaska,2002.
29.J.M.Tendler,J.S.Dodson,Jr.J.S.Fields,H.Le,andB.Sinharoy.POWER4
systemmicroarchitecture.IBMJ.ResearchandDevelopment,46(1):5–25,2002.
30.W.A.WulfandS.A.McKee.Hittingthememorywall:Implicationsoftheobvious.
ComputerArchitectureNews,23(1):20–24,March1995.
Acknowledgements
FinancialsupportforthisworkhasbeenreceivedfromUniversitiesofQueens-landandWitwatersrand,andSouthAfricanNationalResearchFoundation.Wewouldliketothanktherefereesforhelpfulsuggestions.
正在阅读:
L1 Cache and TLB Enhancements to the RAMpage Memory Hierarchy06-07
新疆昌吉回族自治州奇台县中考语文二模试卷05-29
托福TPOextra 答案解析和原文翻译03-14
2018年英语同步优化指导(人教版选修7)练习:阶段质量评估103-26
教育督导的意义是什么重点讲义资料03-08
那一次我好委屈作文400字06-26
小学生关于我与动物之间的故事作文06-14
伏山一中2019年中招备考方案03-08
药剂试卷附答案03-14
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- Enhancements
- Hierarchy
- RAMpage
- Memory
- Cache
- TLB
- L1
- 红楼梦阅读讲义6-10回(教师版6-10)
- 冲压操作工岗位职责
- 监控行业基础培训资料
- 城市土地市场的“蛛网”分析及其在用地管理中的借鉴初探
- 110胚胎工程的应用及前景
- “我喜欢的好老师”评选活动方案
- ICU现状分析及发展展望
- 部编版道德与法治二年级下册教案-10 清新空气是个宝3
- 2014年高中数学 第二讲 函数的图象与性质
- 万向转动球铰支座
- 大雁气功后六十四式功法
- 呼吸内科试题及答案6
- 2012艺术生高考数学复习学案2
- 2011年护士资格考试模拟试题及答案-专业知识
- 网络基础知识普及—MBs、Mbs、Mbps 区别
- 民乐一中2012——2013学年第二学期期终考试 高一物理试卷
- 联氨N2H4气体泄漏报警器 无水肼气体报警器
- 中国企业反倾销的现状及应对措施
- 中国邮政储蓄银行的相关工作报告情况
- 2014学年七年级人教新目标英语上册Unit 2试卷