毕业设计翻译 10031124陈扬

更新时间:2023-12-16 07:57:01 阅读量: 教育文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

本科生毕业设计(论文)

外文科技文献译文

译文题目(中文): 时钟频率的变化和多核处理器的吞吐量

对模具对模具和内模参数的影响

(英文): Impact of Die-to-Die and Within-Die Parameter

Variations on the Clock Frequency and Throughput of Multi-Core Processors

学 院 工学院 专业班级 10电子信息工程1班 学生姓名 陈 扬 学 号 10031124 指导教师 许晓洁

日 期 2014年 1月 2日

外文科技文献译文

译文

时钟频率的变化和多核处理器的吞吐量对模具对模具和内模参数的影响

基思答:鲍曼,会员,IEEE,阿拉河Alameldeen,会员,IEEE,SRIKANTHT.斯里尼瓦桑,会员,IEEE,和Chris B.威尔克森,会员,IEEE

摘要

一个统计性能模拟器被开发用来探索参数变化的最大时钟的影响多核的频率( FMAX )和吞吐量的分布处理器在将来的22纳米技术。该模拟器捕获 的管芯到管芯( D2D )和内模( WID)晶体管的影响并在关键路径延迟互连参数变化死。模拟器的一个关键组成部分是分析性的多核处理器的吞吐量模型,使计算高效,准确的吞吐量的计算,比较与周期精确模拟器的性能,对于单线程 和高度并行的多线程(MT )的工作负荷。基于以前的微处理器,三角架构设计多核心处理器与任何小型,中型或大型核预计在22纳米技术进行调查一个广泛的设计选择。这三个多核心处理器最大吞吐量恒定的管芯面积内进行优化。 传统的单核处理器也缩小到22纳米技术提供一个基准比较。其显著的贡献从本论文主要有: 1 ))产品层面的变化分析多核处理器必须注重吞吐量,而不是仅仅FMAX ,和2 )多核心处理器更宽容的变化比单核处理器由于内存延迟的影响较大和带宽的吞吐量。为了阐明这两点,统计模拟表明,多核和单核与等效的总核心面积的处理器也有类似的FMAX分布(平均为9%和标准差下降5 % )为MT的应用。而相比之下,单核处理器,内存延迟和带宽限制显著限制上FMAX吞吐量依赖于多核处理器,从而降低了吞吐量意味着退化和标准偏离50%为中小型核心设计,30%的大型核心设计。这种改进的通量分布表明,多核心处理器 能显著降低了产品的设计和工艺开发由于参数变化较复杂单核处理器,从而实现更快的上市时间高性能的微处理器产品。

1

外文科技文献译文

关键词:时钟频率分布,关键路径延迟的变化,裸片到裸片( D2D )变型中,管芯间的变化,内模的变化,最大时钟频率( FMAX )分布,多核,参数的波动,参数变化性能分布,通量分布,内模(WID )的变化。

1.引言

微处理器一直是脆弱的参数变化在生产过程中。如收稿2008年5月17日,经修订的2008年8月15日。首次出版2009年5月19日,公布2009年11月18日当前版本。作者是与英特尔公司,希尔斯伯勒, OR 97124美国

(电子邮箱: keith.a.bowman @ intel.com ; alaa.r.alameldeen @ intel.com ; SRIKANTH 。t.srinivasan @ intel.com ; chris.wilkerson @ intel.com ) 。

数字对象标识符10.1109/TVLSI.2008.2006057

工艺技术的不断扩展,不同的晶体管和互连特性增加相对于标称设计目标。参数变化带来的不利影响在最大时钟频率( FMAX )和微处理器的功率也变得更加显着的技术定标[1], [2]。参数的变化可以被分类分为两类:芯片对芯片( D2D )和内模( WID )。D2D的变化,从很多到很多,晶圆到晶圆,以及所产生的在晶圆内变化的部分,影响到所有晶体管和在模具互连平分。相反, WID的变化,由随机和系统组件,诱导不同穿过模具的电特性[3]。随机WID参数变化波动随机和独立地从设备到设备(即,设备到设备相关为零) 。一从重复的WID系统参数变化结果和指导原则,其中设备到设备的相关性根据经验确定的之间的距离的函数的设备。虽然系统WID的变化呈现出相关行为,这些变化的轮廓可以随机从模具更改为死亡。从设计的角度,系统地参与发展变化表现为连续的,平滑相关随机WID变化[1], [3] - [6]。

在设计高性能微处理器的重要性准确估计参数变化的影响对产品级性能直接关系到整体收入的公司。高估增加了设计的复杂性,这可能导致更高的功率消耗,增加了在设计时,增加了芯片尺寸,排斥其他好的设计方案,甚至错过市场窗口[3]。相反,低估可能危及产品性能和总产率以及增加的硅调试时间[3]。综上所述,高估变化影响的设计努力和低估变化影响的制造努力。

2

外文科技文献译文

在最近几代技术,多核心处理器拥有成为一个高能效的方法来设计高性能 微处理器。多核心处理器采用更大于1芯上的管芯,其中核心和核心的数量复杂性是一个关键的设计折衷。多核心处理器可以实现比单核处理器更好的性能通过在执行线程多线程(MT)的应用整个内核平行。

以往的研究已经调查了D2D的影响,在FMAX和功率分布WID参数变单核处理器的[1],[2],[4],[5],[7],[8]。该参数变化对功率,泄漏的地方是thedominant变化分量,并不能从根本上改变的影响从单核到多核处理器。多核处理器可以使更精细的粒度配售部分芯片进入睡眠状态。当芯片上的所有晶体管都在一个操作模式,然而,D2D的相对效果和在泄漏的WID参数变化预计是相似单核和多核处理器之间。与此相反,多核设计代表了微处理器的根本性转变从传统的单核心设计的性能,凡在MT应用的并行性在整个开发芯在模具中。本文的D2D和WID参数变化的影响在多核心的FMAX和吞吐量的分布处理器[ 9 ]进行了探讨。吞吐量指标表示实际的微处理器的性能,从而提供了一个器件和电路参数的构架层次的角度来看变性。在第二节中,一个分析多核处理器通过模型推导,以实现准确的吞吐量计算对于高度并行的工作负载运行时的效率。在第三节中,三多核处理器和单核处理器预计在未来的22纳米技术根据历史数据和传统的比例趋势。施加的分析通过模型中,多核心处理器优化是在第四节描述,最大限度地通过三个多核心处理器。在第五节,分析通过模型集成到一个统计性能模拟器,其拍摄D2D的影响,并在整个模具关键路径延迟WID参数变化生成FMAX和吞吐量的分布对于给定的多核设计。在第六节,参数变化的影响在三个FMAX和吞吐量的分布和分布最佳的多核处理器和单核处理器提出。第七节最后结的关键见解。

2.多核处理器的吞吐量模型

紧凑的吞吐量分析模型推导,使多核的计算效率和精确预测处理器的吞吐量为高度并行MT的应用程序。自带统计性能仿真器,其中将要描述在第五部分,进行上千吞吐量计算每多核心的设计,运行时效率是一个重要的特征。出于这个

3

外文科技文献译文

原因,一个分析建模方法是理想而不是在计算上昂贵的吞吐量模拟器。吞吐量模型推导开始通过分离模面积(Adie)在两个主要部分作为

Adie?Acores?AL2(N). (1)

AL2(N)A1MB缓存区域共享缓存。以兆字节

Acores是总面积分配给内核,其中每个芯假定包含私有级(L1)指令和数据缓存。

AL2(N)与芯的总电平2(L2)高速NSL2(N)?为单位的二级缓存大小为计算公式为

SL2(N)?

AL2(N)A1MB (2)

其中A1MB是每1字节的高速缓存区域,如由下式确定工艺技术。

对于一个给定的工作负载,每个指令(CPI)的周期为一单核被建模为

CPI(1)?CPIcom?Mrate(SL2(1))Lmiss(Fclk). (3)

CPIcom消费物价指数的计算组成部分,是核心CPI具有完善的L2缓存(即,无缓存未命中)。CPIcom是独立处理器的时钟频率的(Fclk)。Mssim(SL2(1)),命中率。是每个SL2(1)指令未命中的高速缓存中的数的大小。Lmiss(Fclk),丢失率,是平均 每个L2高速缓存未命中周期数。Lmiss(Fclk)是一种Fclk的函数。Lmiss(Fclk)和

Mrate(SL2(1))的产物指内存延迟和内存带宽的组件的消费物价指数。SL2(1)是有效的二级缓存大小为一个核心。如果铁心没有在高速缓存共享的代码或数据,则平均每个核心的缓存大小为1/Nth整个二级高速缓存的大小(SL2(1)?SL2(N)/N)。对于共享的代码或数据应用,工作集大小是由平均数调整(Nshare)内核共享的L2高速缓存行的,Nshare(N)是一个N函数。平均缓存大小为一个单一的核心是计算如[10]

4

外文科技文献译文

SL2(1)?SL2(N) (4)

(N?Nsha(N)?1)re投影的命中率对于不同大小的高速缓存中,平方根规则的拇指是典型的应用,该款机型缓存缺失率

Mrate(1MB)Mrate(SL2(1))?SL2(1)/S1MB (5)

S1MB是一兆。对于一些应用,所述平方根模型(5)中比工作集模型,不准确的

地方的命中率保持不变,为高速缓存大小的增加,直到工作组适合在高速缓存中,随后,在命中率急剧脱落。由于高速缓存大小的命中率相关性是应用具体的,单个核心的命中率是在模拟多个高速缓存大小与工业周期精确模拟器以确定个别适当的命中率模型应用程序。根据在宽的应用范围的模拟,平方根模型提供了最准确的逼近平均命中率。

建模每个周期指令(IPC)的多核处理器,有限的片外存储器带宽的影响被捕获通过Lmiss(Fclk)分离成两个分量为

Lmem(Fclk)Lmiss(Fclk)??Llink(Fclk). (6)

Npr片外DRAM内存延迟,计算作为周期的平均数目DRAM阵列中使用Lmem(Fclk),

了取得数据。在建模外的顺序利用非阻塞核心内存级并行(MLP),Lmem(Fklc)是因为平均数并行内存请求(Npr)分每个请求块中的处理器总数的一小部分内存延迟[11]。对于顺序阻断核心,Npr等于之一。Llink(Fclk),总的链路延迟,包括延迟OFTHE物理片环节,排队等待时间(例如,守候在思念处理状态寄存器

5

外文科技文献译文

(MSHRs)和总线队列)。Llink(Fclk)被计算为周期为一个平均数片外存储器的访问。Llink(Fclk)被分离成两个分量如

Llin(kFcl) k?Ls(Fcl)k?Lq(Fcl)k (7)

Ls(Fclk)和Lq(Fclk)是服务和排队延迟每个高速缓存未命中,分别。Ls(Fclk)是物理片链路延时数据对面的链接遍历处理器的DRAM芯片和背部,在没有传输假设错误。Lq(Fclk)被计算为平均排队延迟。假设物理片连接到内存代表一个M/D/1队列(马尔可夫到达率与要求一个确定性的服务时间和要求提供无限多源),被

Lq(Fclk)被建模为

ULs(Fclk)Lq(Fclk)?2(1?U) (8)

U是链路利用率。使用小定律,U被计算为

U??Ls(Fcl) k. (9)

?该参数是每个周期的存储器请求的数目,其计算公式为

??IPC(N)Mrat(eSL2(1)) (10)

IPC(N)代表了IPC的多核处理器与N核心。由(7) - (9),总链路等待时间

的计算如

?(Ls(Fclk))2 Llink(Fclk)?Ls(Fclk)?. (11)

2(1??Ls(Fclk)) 如在页面的底部,在IPC中所述(12)对于多核处理器从(3),(6),并计算出(11)[10]。从?是IPC(N)的一种函数,IPC(N)(12)简化为一元二次方程,其中的根源,公式导致的显式IPC(N)表达式。Ls(Fclk)和Lmem(Fclk)依赖于Fclk被

Lmem(Fclk)?Lmem(Fclk,nom)?Fclk/Fclk,nom和

Ls(Fclk)?Ls(Fclk,nom)?Fclk/Fclk,nom,,Fclk,nom是标称处理器的时钟频率。假设所有N

6

外文科技文献译文

核具有相同的Fclk吞吐量(TP)中的说明每秒的多核处理器的计算(13)在该页面的底部。CPImem,lat(Fclk)/Fclk和CPImem,bw(Fclk)/Fclk代表了内存延迟和通过带宽的组件,这被建模为

CPImem,lat(Fclk)Fclk?Mrate(SL2(1))Lmem(Fclk,nom)Fclk,nomNpr (14)

和(15)中在该页面的底部。额外的假设适用于权衡精度运行效率:1)吨基准测试是完全并行(即只有水货MT的应用部分为蓝本);2)平均基准性能是一个合适的指标,用于评估一般趋势,以及3)将附加的线程间的相互作用和操作 系统开销当在多核调度线程处理器可以忽略不计。

在(13)的分析模型 - (15)被验证为单线程(ST)和高度并行的应用程序的MT。对于意法半导体的应用,1芯被假定为具有访问整个L2缓存。虽然该模型主要针对的表现高度并行的MT的应用中,分析模型是容易通过调整命中率修改为ST的应用Mrate(SL2(1))到Mrate(SL2(N))。在验证分析型号为ST的应用,平均的模型预测IPC从460工作负荷与工业相比,周期精确模拟器不同的核心类型和缓存尺寸。 460的工作负载包括服务器,多媒体,游戏,SPEC2K,和办公室生产力应用程序。唯一的工作量,具体型号参数CPIcom,Mrate(1MB),和Npr.CPIcom是通过用一个完美的L2缓存操作模拟器中提取;

IPC(N)?N?CPI(1)N(12) 2L(F)?(Ls(Fclk))?Mrate(SL2(1))(memclk?Ls(Fclk)?)Npr2(1??Ls(Fclk))CPICOM

TP(N)?IPC(N)Fclk?NCPIcomCPImem,lat(Fclk)CPImem,bw(Fclk)??FclkFclkFclk (13)

1Fclk?Ls(Fclk,nom)1?()CPImem,bw(Fclk)Ls(Fclk,nom))2Fclk (15) ?Mrate(SL2(1))Fclk?Ls(Fclk,nom)FclkFclk,nom1?()Fclk,nom7

外文科技文献译文

Mrate(1MB)和N通过操作提取具有1 MB高速缓存。CPI,Mrate(1MB)和Ncomprpr在分析模型应用价值代表平均跨越460工作量提取的值。比较分析模式跨产业周期精确模拟器各种核心类型和L2高速缓存大小,该模型预测的IPC平均为460的工作负载是在模拟的4%结果。

在证实为高度并行的应用程序的MT的分析模型,工控机模型(12)与阿西模拟比较[12]在图1,适用于各种识别,挖掘,以及合成(RMS)指标[13]跨越核心的数量中所含的多核心处理器。这些基准有效值着眼于矩阵面向数据操作的基本构建块并且越来越多地被利用的计算建模和过程的复杂系统[13]。基准

DAt包括:1)k均值,模糊聚类cmeans;2),A基质稀疏矩阵(A)由对角矩阵

(D)由乘法稀疏矩阵的转置A(At);3)sparse_mvm_sym,对称稀疏矩阵向量乘法;4)dense_mmm,稠密矩阵 - 矩阵乘法,以及5)sparse_mvm,疏矩阵向量乘法。在阿西模拟器[12]的计算结果每个工作负载,同时捕捉多个核心的作用, 共享二级缓存,和之间的互连网络L2高速缓存和片外DRAM内存。在图的比较。1是基于2宽的有序内核与一个32 MB二级高速缓存,128字节的高速缓存行大小,和200周期的内存延迟。从Npr?1用作核心,唯一的工作量,具体投入到分析模型是CPIcom和Mrate(1MB)。其中被提取从阿西模拟器一个核心。为3的基准(k均值,ADAt和dense_mmm),平方根(5)高速缓存未命中率模型被应用。对于其他两个基准(sparse_mvm_sym和sparse_mvm),工作组模型被用于估计缓存未命中率。对于k均值,ADAt,dense_mmm,和sparse_mvm基准,分析模型非常同意阿西姆模拟,其中最坏情况下的误差小于5%。该sparse_mvm_sym基准包含大段串行执行的,导致了22%的最坏情况模型误差。虽然该模型是不准确的对于MT的应用与串行执行的大部分,多核处理器的吞吐量模型吻合与阿西模拟器MT与应用大段并行执行,并与一个工业周期精确模拟器为ST的应用。如前面所讨论的,分析模型主要目标高度并行工作负载的MT

微不足道的串行执行。在本文的其余部分,MT的应用被认为完美地并行化,其中该分析模型是足够准确的。如果MT的应用与串行执行的大部分被认为在未来的工作,那么在分析吞吐量模型(13) - (15 )可以是延长[14] ,以改善这些应用程序的准确性。

8

外文科技文献译文

3.多核心处理器设计

在第IV节优化多核处理器和在探索参数变化的多核的影响处理器FMAX和吞吐量在第六节,三个独立的多核处理器进行了评价。这三款处理器

图.1从(12)IPC模型预测与阿西模拟器比较[12],适用于各种基准RMS[13]相对于芯的数目。

包含任一小型,中型或大型来调查范围的多核处理器的设计选项。此外,一个传统的单核处理器,包含一个单一的核心,作为比较的基线。小,中,大 核心是基于英特尔奔腾P54C(按顺序)[15],该英特尔奔腾III(出序)[16],而英特尔酷睿2(先进外的顺序)[17]的微处理器,分别。在图。 2,本产品引进技术的产生,核心面积,平均Fclk,归一化平均吞吐量的SPECint,高速缓存的大小,电源电压(VDD)和核心功率为每核心[20] - 类型是基于历史数据[15]总结。注该核心区不包括二级缓存区。

9

外文科技文献译文

Impact of Die-to-Die and Within-Die Parameter Variations on the Clock Frequency and Throughput of Multi-Core Processors

Keith A. Bowman, Member, IEEE, Alaa R. Alameldeen, Member, IEEE, Srikanth T. Srinivasan,

Member, IEEE, and Chris B. Wilkerson, Member, IEEE

Abstract—A statistical performance simulator is developed to explore the impact of parameter variations on the maximum clock frequency (FMAX) and throughput distributions of multi-core processors in a future 22 nm technology. The simulator captures the effects of die-to-die(D2D) and within-die(WID) transistor and interconnect parameter variations on critical path delays in a die. A key component of the simulator is an analytical multi-core processor throughput model, which enables computationally efficient and accurate throughput calculations, as compared with cycle-accurate performance simulators, for single-threaded and highly parallel multi-threaded (MT) workloads. Based on microarchitecture designs from previous microprocessors, three multi-core processors with either small, medium, or large cores are projected for the 22 nm technology generation to investigate a range of design options. These three multi-core processors are optimized for maximum throughput within a constant die area. A traditional single-core processor is also scaled to the 22 nm technology to provide a baseline comparison. The salient contri- butions from this paper are: 1) product-level variation analysis for multi-core processors must focus on throughput, rather than just FMAX, and 2) multi-core processors are more variation tolerant than single-core processors due to the larger impact of memory la- tency and bandwidth on throughput. To elucidate these two points, statistical simulations indicate that multi-core and single-core processors with an equivalent total core area have similar FMAX distributions (mean degradation of 9% and standard deviation of 5%)

10

外文科技文献译文

for MT applications. In contrast to single-core processors, memory latency and bandwidth constraints significantly limit the throughput dependency on FMAX in multi-core processors, thus reducing the throughput mean degradation and standard deviation by 50% for the small and medium core designs and By~30% for the large core design. This improvement in the throughput distribution indicates that multi-core processors could significantly reduce the product design and process devel- opment complexities due to parameter variations as compared to single-core processors, enabling faster time to market for high-performance microprocessor products.

Index Terms—Clock frequency distribution, critical path delay variations, die-to-die (D2D) variations, inter-die variations, intra-die variations, maximum clock frequency (FMAX) distri- bution, multi-core, parameter fluctuations, parameter variations, performance distribution, throughput distribution, (WID) variations.

parameter variations in the manufacturing process. As ICROPROCESSORS have always been vulnerable to

Manuscript received May 17, 2008; revised August 15, 2008. First published May 19, 2009; current version published November 18, 2009.

The authors are with Intel Corporation, Hillsboro, OR 97124 USA (e-mail: keith.a.bowman@intel.com;alaa.r.alameldeen@intel.com;srikanth.t.srinivasan@intel.com; chris.wilkerson@intel.com).

Digital Object Identifier 10.1109/TVLSI.2008.2006057

process technology continues scaling, variations in transistorand interconnect characteristics are increasing relative to nom- inal design targets. The adverse effects of parameter variations on the maximum clock frequency (FMAX) and power of a mi- croprocessor are also becoming more pronounced with tech- nology scaling [1], [2]. Parameter variations can be classified into two categories: die-to-die (D2D) and within-die (WID). D2D variations,resulting from lot-to-lot, wafer-to-wafer, and a

11

I. INTRODUCTION

外文科技文献译文

portion of the within-wafer variations, affect all transistors and interconnects on a die equally. Conversely, WID variations, con- sisting of random and systematic components, induce different electrical characteristics across a die [3]. A random WID pa- rameter variation fluctuates randomly and independently from device to device (i.e., device-to-device correlation is zero). A systematic WID parameter variation results from a repeatable and governing principle, where the device-to-device correlation is empirically determined as a function of the distance between the devices. Although systematic WID variations exhibit a cor- related behavior, the profile of these variations can randomly change from die to die. From a design perspective, systematic WID variations behave as continuous and smooth correlated random WID variations [1], [3]–[6].

In designing high-performance microprocessors, the impor- tance of accurately estimating the impact of parameter varia- tions on product-level performance directly relates to the overall revenue of a company. An overestimation increases design com- plexity, possibly leading to higher power consumption, an in- crease in design time, an increase in die size, rejection of other- wise good design options, and even missed market windows [3]. Conversely, an underestimation can compromise product per- formance and overall yield as well as increase the silicon debug time [3]. In summary, overestimating variations impacts the de- sign effort and underestimating variations impacts the manufac- turing effort.

In recent technology generations, multi-core processors have emerged as a power-efficient approach to designing high-per- formance microprocessors. Multi-core processors employ more than one core on a die, where the number of cores and core complexity is a key design tradeoff. Multi-core processors can achieve better performance than single-core processors for multi-threaded (MT) applications by executing threads in parallel across the cores.

Previous research has investigated the impact of D2D and WID parameter variations on the FMAX and power distribu- tions of single-core processors [1], [2], [4], [5], [7], [8]. The impact of parameter variations on power, where leakage is the

12

外文科技文献译文

dominant variation component, does not fundamentally change from single-core to multi-core processors. A multi-core pro- cessor may enable much finer granularity in placing portions of the chip into a sleep state. When all transistors on the chip are in an operational mode, however, the relative effect of D2D and WID parameter variations on the leakage is expected to be sim- ilar between single-core and multi-core processors. In contrast, the multi-core design represents a fundamental shift in micro- processor performance from the traditional single-core design, where the parallelism in MT applications is exploited across the cores in a die.

In this paper, the impact of D2D and WID parameter varia- tions on the FMAX and throughput distributions of multi-core processors [9] is explored. The throughput metric represents the actual microprocessor performance, thus providing an architecture-level perspective of device and circuit parameter variability. In Section II, an analytical multi-core processor throughput model is derived to enable accurate throughput cal- culations for highly parallel workloads with runtime efficiency. In Section III, three multi-core processors and a single-core processor are projected for a future 22 nm technology gener- ation based on historical data and traditional scaling trends. Applying the analytical throughput model, a multi-core pro- cessor optimization is described in Section IV to maximize the throughput of the three multi-core processors. In Section V, the analytical throughput model is integrated into a statistical performance simulator that captures the effects of D2D and WID parameter variations on critical path delays across a die to generate FMAX and throughput distributions for a given multi-core design. In Section VI, the impact of parameter vari- ations on the FMAX and throughput distributions of the three optimal multi-core processors and the single-core processor is presented. Section VII concludes by summarizing the key insights.

II. MULTI-CORE PROCESSOR THROUGHPUT MODEL

A compact analytical throughput model is derived to enable computationally efficient and accurate projections of multi-core processor throughput for highly parallel

13

外文科技文献译文

MT applications. Since the statistical performance simulator, which will be described in Section V, performs thousands of throughput calculations per multi-core design, the runtime efficiency is an essential fea-ture. For this reason, an analytical modeling approach is desired rather than a computationally expensive throughput simulator. The throughput model derivation starts by separating the die area (Adie) in two main parts as

Adie?Acores?AL2(N). (1)

Acoresis the total area allocated to the cores, where each core Is assumed to contain private level-1(L1) instruction and data caches.AL2(N)is the total level-2 (L2) cache area with cores sharing the cache. The L2 cache size in units of megabytes is calculated as

SL2(N)?AL2(N)A1MB (2) Where A1MBis the cache area per 1 MB, as determined by process technology.

For a given workload, the cycles per instruction (CPI) for a single core are modeled as

CPI(1)?CPIcom?Mrate(SL2(1))Lmiss(Fclk). (3)

CPIcom the computation component of CPI, is the core CPI, CPI with a perfect L2 cache (i.e., no cache misses).CPIcom is inde pendent of the processor clock frequency(Fclk).Mmiss(SL2(1)),the miss rate, is the number of misses per instruction for a cache

Lmiss(Fclk),the miss penalty, is the average number of cycles per L2 cache miss. L(F) is a function of Fclk. The product of L(F) and repre- sents the

missclkmissclkmemory latency and memory bandwidth components of CPISL2(1)is the effective L2 cache size for one core. If the cores do not share code or data in the cache, then the av- erage cache size per core is1/Nth of the entire L2 cache size(SL2(1)?SL2(N)/N).For

14

外文科技文献译文

applications that share code or data, the working set size is adjusted by the average number(Nshare)of cores that share an L2 cache line, whereNshare(N) is a func- tion of N . The average cache size for a single core is calculated as [10]

SL2(1)?SL2(N) (4)

(N?Nsha(N)?1)reTo project the miss rate for caches of different sizes, thesquare-root rule-of-thumb is typically applied, which models the cache miss rate as

Mrate(1MB)Mrate(SL2(1))?SL2(1)/S1MB (5)

whereS1MB is 1 MB. For some applications, the square-rootmodel in (5) is less accurate than the working set model, where the miss rate remains constant as cache size increases until the working set fits in the cache; subsequently, the miss rate sharply falls off. Since the miss rate dependency on cache size is ap- plication specific, the miss rate of a single core is simulated at multiple cache sizes with an industrial cycle-accurate simulator to determine the appropriate miss rate model for an individual application. Based on simulations across a wide range of appli- cations, the square-root model provides the most accurate ap- proximation of the average miss rate.

To model instructions per cycle (IPC) for the multi-core pro- cessor, the effects of limited off-chip memory bandwidth is cap- tured by separating Lmiss(Fclk)into two components as

Lmem(Fclk)Lmiss(Fclk)??Llink(Fclk). (6)

Npr15

外文科技文献译文

Lmem(Fclk),the off-chip DRAM memory latency, is calculated as the average number of cycles spent in the DRAM array to ob- tain data. In modeling out-of-order nonblocking cores that ex- ploit memory-level parallelism (MLP),Lmem(Fclk) is divided by the average number (Npr)of parallel memory requests since each request blocks the processor for a fraction of the total memory latency [11]. For in-order blocking cores,Llink(Fclk),the total link latency, includes the latency of the physical off-chip link and the queuing latency (e.g., waiting in miss status handling registers (MSHRs) and bus queues). Llink(Fclk) is separated into two com- ponents as

Llin(kFcl) k?Ls(Fcl)k?Lq(Fcl)k (7)

whereLs(Fclk)andLq(Fclk) are the service and queuing la- tencies per cache miss, respectively.off-chip link latency for data to traverse across the link from the processor to the DRAM chip and back, where no transmission errors are assumed.Ls(Fclk) is computed as the mean queuing latency. Assuming the physical off-chip link to memory repre-sents an M/D/1 queue (Markovian arrival rate for requests with a deterministic service time and an infinite number of request sources), Lq(Fclk) is modeled as

ULs(Fclk)Lq(Fclk)?2(1?U) (8)

where is the link utilization. Using Little’s law, U is computed as

U??Ls(Fcl) k. (9)

The parameter is the number of memory requests per cycle,which is calculated as

??IPC(N)Mrat(eSL2(1)) (10)

whereIPC(N) represents the IPC for a multi-core processor with Ncores. From (7)–(9), the total link latency is calculated

16

外文科技文献译文

?(Ls(Fclk))2 Llink(Fclk)?Ls(Fclk)?. (11)

2(1??Ls(Fclk))As described in (12) at the bottom of the page, the IPCfor a multi-core processor is calculated from (3), (6), and (11) [10]. Since is a function ofIPC(N)(12) reduces to a quadratic equation,the equation result in an explicitIPC(N)(is the nominal

processor

clock

frequency.

Assuming

CPImem,lat(Fclk)/Fclk和

CPImem,bw(Fclk)/Fclk represent the memory latency and bandwidth components of throughput, which are modeled as

and in (15) at the bottom of the page. Additional assumptions are applied to tradeoff accuracy for runtime efficiency: 1) MT benchmarks are perfectly parallelizable (i.e., only the parallel portion of MT applications is modeled); 2) average benchmark performance is an appropriate metric for evaluating general trends; and 3) the additional inter-thread interactions and oper- ating system overhead when scheduling threads on a multi-core processor are negligible.

The analytical model in (13)–(15) is validated for both single- threaded (ST) and highly parallel MT applications. For ST ap- plications, one core is assumed to have access to the entire L2 cache. Although the model primarily targets the performance of highly parallel MT applications, the analytical model is easily modified for ST applications by adjusting the miss rate fromMrate(SL2(1))到Mrate(SL2(N))。In validating

the analytical model for ST applications, the model projections of the av-erage IPC from

CPImem,lat(Fclk)Fclk?Mrate(SL2(1))Lmem(Fclk,nom)Fclk,nomNpr (14)

460 workloads are compared with an indus- trial cycle-accurate simulator for different core types and cache sizes. The 460 workloads consist of server, multimedia, games, SPEC2K, and office productivity applications. The only work- load-specific model

17

外文科技文献译文

parameters areCPIcom,Mrate(1MB),和Npr.CPIcom is extracted by operating the simulator with a perfect L2 cache;

IPC(N)?N?CPI(1)CPICOMN(12)

Lmem(Fclk)?(Ls(Fclk))2?Mrate(SL2(1))(?Ls(Fclk)?)Npr2(1??Ls(Fclk))

TP(N)?IPC(N)Fclk?NCPIcomCPImem,lat(Fclk)CPImem,bw(Fclk)??FclkFclkFclk (13)

)1F?L(F1?(clksclk,nom)CPImem,bw(Fclk)Ls(Fclk,nom))2Fclk (15) ?Mrate(SL2(1))F?L(F)FclkFclk,nom1?(clksclk,nom)Fclk,nomMrate(1MB)andN are extracted by op- erating with a 1 MB cache. TheCPI,

comprMrate(1MB)andN values applied in the analytical model represent the

praverageextracted values across the 460 workloads. Comparing the ana- lytical model to the industrial cycle-accurate simulator across a variety of core types and L2 cache sizes, the model projections of average IPC for the 460 workloads are within 4% of the sim- ulation results.

In validating the analytical model for highly parallel MT ap- plications, the IPC model in (12) is compared with Asim sim- ulations [12] in Fig. 1 for a variety of recognition, mining, and synthesis (RMS) benchmarks [13] across the number of cores contained in the multi-core processor. These RMS benchmarks focus on the basic building blocks of matrix-oriented data ma- nipulation and calculations that are increasingly being utilized to model and process complex systems [13]. The benchmarks metrical sparse matrix-vector multiplication; 4) dense_mmm,dense matrix-matrix multiplication; and 5) sparse_mvm, sparse matrix-vector multiplication.

18

外文科技文献译文

The Asim simulator [12] evalu- ates each workload while capturing the effects of multiple cores, shared L2 cache, and the interconnection network between the L2 cache and off-chip DRAM memory. The comparison in Fig.1 is based on a 2 wide in-order core with a 32 MB L2 cache,128 byte cache line size, and 200 cycle memory latency. Since for the core, the onlyworkloadNpr?1specific inputs to thebenchmarks (kmeans

ADAt and dense_mmm), the square-root sparse_mvm_sym和sparse_mvm),

benchmarks (kmeans,ADAt,dense_mmm,cache miss rate model in (5) is applied. For the other two bench-marks(sparse_mvm_sym and sparse_mvm), the working set model is used to estimate the cache miss rate. For the kmeans,, dense_mmm, and sparse_mvm benchmarks, the ana- lytical model agrees closely with Asim simulations, where the worst-case error is less than 5%. The sparse_mvm_sym bench- mark contains large sections of serial execution, leading to a worst-case model error of 22%. Although the model is less ac- curate for MT applications with large portions of serial exe- cution, the multi-core processor throughput model agrees well with the Asim simulator for MT applications with large sec- tions of parallel execution and with an industrial cycle-accurate simulator for ST applications. As previously discussed, the an- alytical model primarily targets highly parallel MT workloads with negligible serial execution. In the remainder of this paper, the MT applications are assumed perfectly parallelizable, where the analytical model is sufficiently accurate. If MT applications with large portions of serial execution are considered in a future work, then the analytical throughput model in (13)–(15) may be extended [14] to improve the accuracy for these applications.

III. MULTI-CORE PROCESSOR DESIGNS

In optimizing a multi-core processor in Section IV and in exploring the impact of parameter variations on multi-core processor FMAX and throughput in Section VI, three separate multi-core processors are evaluated. These three processors

19

外文科技文献译文

Fig.1. Comparison of IPC model projections from (12) with Asim simulator [12] for a variety of RMS benchmarks [13] versus number of cores.

Contain either small, medium, or large cores to investigate arange of multi-core processor design options. In addition, a traditional single-core processor, containing a monolithic core, is used as a baseline comparison. The small, medium, and large cores are based on the Intel Pentium P54C (in-order) [15], the Intel Pentium III (out-of-order) [16], and the Intel CORE 2 (advanced out-of-order) [17] microprocessors, respectively. In Fig. 2, the product introduction technology generation, core area, averageFclk,normalized average SPECint throughput,cache size, supply voltage(VDD)and core power for each core type are summarized based on historical data [15]–[20]. Note that the core area excludes the L2 cache area.

20

本文来源:https://www.bwwdw.com/article/gjh5.html

Top