Parallel Communication Mechanisms for Sparse, Irregular Applications
更新时间:2023-05-08 23:26:02 阅读量: 实用文档 文档下载
- parallel推荐度:
- 相关推荐
Parallel Communication Mechanisms for Sparse,Irregular Applications
Frederic T.Chong
1
2
Abstract
Parallel systems are becoming a signi?cant computing technology,not only for high performance computing,but also for commodity servers.The goal of this research is to identify core communica-tion mechanisms which both exploit architectural trends and support real applications.We demonstrate that cache-coherent shared memory hardware is such a core mechanism,even for applications with lit-tle data re-use and data-driven synchronization.This thesis makes three major contributions.First,we perform an in-depth study of the interaction between communication mechanisms and sparse,irregu-lar applications.Second,we present the Remote Queues(RQ)communication model,an abstraction which synthesizes more ef?cient synchronization for hardware-supported shared memory and other complex systems.Third,we characterize the relative performance of all of our mechanisms as proces-sor speed and machine size scale.
On the MIT Alewife Multiprocessor,we?nd that shared memory provides high performance with lower code complexity than message passing on our irregular problems.This is primarily due to four reasons.First,a5-to-1ratio between global and local cache misses makes memory copies in bulk communication expensive relative to communication via shared memory.Second,although message passing has synchronization semantics superior to shared memory for data-driven computation,ef?-cient shared memory can overcome this handicap by using global read-modify-writes to change from the traditional owner-computes model to a producer-computes model.Third,the Remote Queues communication model generalizes such a change,providing the semantics and performance of polling active messages on a wide variety of systems.Fourth,bulk transfers can result in high processor idle times in irregular applications.
Finally,we characterize multiprocessor design points where message passing and bulk transfer can perform better than shared memory.In particular,we?nd that shared memory uses more than four times the network bandwidth as message passing.Unless an application’s performance is already limited by local memory speeds,network bandwidth and latency threaten to become a serious problem. Our study indicates that machines based on modern microprocessors,such as the Cray T3E,must resort to expensive,high-dimensional networks to support shared-memory traf?c.Furthermore,the round-trip nature of shared memory may not be able to tolerate the latencies of future networks.
3
4
Acknowledgments
I would like to thank Anant Agarwal for some great advising over the years.He helped steer me through a rewarding graduate career.His enthusiasm and energy is amazing.Whenever I felt a little burned-out,a short meeting with Anant would have me charging for a couple more weeks.
I would also like to thank the rest of my committee.I have enjoyed my interactions with Tom Leighton since my work on interconnection networks for my Master’s thesis.A ten-minute talk with Tom can solve problems that might otherwise take months.An early meeting with Joel Saltz at Thinking Machines got me started on irregular applications.His perspective and extensive research group have helped me immensely.Rob Schreiber hosted my stay at NASA and helped me delve into what I saw as numerical voodoo.He taught me to look at sparse matrix computations as graph computations that I could understand more intuitively.
I would also like to thank Tom Knight for advising my master’s thesis and helping my growth from back when I was an undergraduate.I had a lot of good times in his group.
I would like to thank Eric Brewer for being a great friend,colleague,and example.I miss the times when we were just a couple?oors apart.I think he was the peer that I learned the most from.
My graduate career has been largely a product of the environment here at the lab.For that I have the members of the Alewife group to thank,especially John Kubiatowicz,whose unparalleled knowledge and service form the heart of the group.Thanks to Kirk Johnson, David Chaiken,Ken Mackenzie,Don Yeung,Matt Frank and David Kranz for attending all those practice talks and giving such good advice.I worked on some great projects with Beng-Hong Lim,Rajeev Barua,Fredrik Dahlgren,and Ricardo Bianchini(the last two,“honorary”members of the group).I also owe a lot to the interactions I have had with many other mem-bers of the lab in general.They are too numerous to list completely,but I would especially like to thank Debby Wallach,Kathy Knobe,Andr′e DeHon,Steve Keckler,Peter Nuth,Rich Lethin,Stuart Fiske,Greg Ganger,Frans Kaashoek,Charles Leiserson,Bill Weihl,Patrick
5
Sobalvarro,and Alan Edelman.Kathy Knobe,in particular,was a great neighbor and fellow traveler on the interview trail.
I would also like to thank Anne McCarthy and Jeanne Speckman,administrative assistants who were vital to the groups I’ve worked in.
I have also been fortunate to work with many people at other institutions.One of the ?rst was Shamik Sharma at Maryland,who works way too hard.Lok Liu contributed to the Remote Queues paper.John Gilbert and Shang-Hua Teng helped me with some more numerical knowledge.
Thanks to Tom Leighton and Charles Leiserson for starting the ice hockey tradition that has become such a phenomenon in our lab.I’ll miss the sport as I move to warmer climes. And to Tom Simon,hockey fanatic and long-time of?ce-mate,may your skates never rust and your gear never walk away on its own.
Thanks to my family,especially my little sister.
Above all,thanks to Debby.
This research was supported in part by an Of?ce of Naval Research Graduate Fellowship, by ARPA contract N00014-94-1-0985by NSF grant MIP-9504399,by NSF Experimental Systems grant MIP-9012773,by an NSF Presidential Young Investigator Award to Prof. Anant Agarwal,by Project SCOUT(ARPA Contract MDA972-92-J-1032)and by NASA via Contract NAS2-13721between NASA and the Universities Space Research Association (USRA).
6
Contents
1Introduction16
1.1Multiprocessors (17)
1.2Irregular Applications (18)
1.3Mechanisms (20)
1.4Bias (23)
1.5Data Transfer (24)
1.6Synchronization and Remote Queues (24)
1.7Multiprocessor Design Points (25)
1.7.1Variations of Bisection Bandwidth (26)
1.7.2Variations in Latency (27)
1.8Summary (28)
1.9Roadmap (29)
2Applications30
2.1EM3D (30)
2.2UNSTRUC (32)
2.3ICCG (32)
2.3.1Communication in the Kernel (33)
2.3.2Motivation and Benchmark Matrices (33)
2.3.3Overview of ICCG (35)
2.3.4Data Mapping (37)
7
2.3.5Multicolor Reordering and Message Aggregation (39)
3Platforms43
3.1The Alewife Multiprocessor (44)
4Shared Memory versus Message Passing49
4.1EM3D (51)
4.1.1EM3D with Message Passing (51)
4.1.2EM3D with Shared Memory (54)
4.1.3EM3D Performance Results (55)
4.1.4Varying Problem Size (56)
4.1.5Scaling Locality with Problem Size (56)
4.2UNSTRUC (57)
4.2.1UNSTRUC with Message Passing (58)
4.2.2UNSTRUC with Shared Memory (59)
4.2.3UNSTRUC Performance Results (60)
4.3ICCG (61)
4.3.1ICCG with Message Passing (61)
4.3.2ICCG with Shared Memory (63)
4.3.3ICCG Performance Results (65)
4.4Summary (73)
5Remote Queues74
5.1Polling versus Interrupts (74)
5.2RQ De?nition (76)
5.3RQ Overview (79)
5.4Advantages of Polling and Interrupts (80)
5.4.1Advantages of Polling (80)
5.4.2Advantages of Interrupts (84)
5.4.3Advantages of RQ (84)
8
5.4.4RQ NI Implementation (85)
5.4.5RQ NI Performance (91)
5.5Memory-Based RQ (95)
5.5.1RQ SM and DMA Implementations (96)
5.5.2Performance (97)
5.6RQ Summary (98)
6Sensitivity to Bandwidth and Latency99
6.0.1Variations of Bisection Bandwidth (100)
6.0.2Variations in Latency (101)
6.0.3Communication V olume (102)
6.0.4Bisection Bandwidth Emulation (105)
6.0.5Network Latency Emulation (106)
6.0.6Compute-vs.Memory-bound (110)
7Related Work112
7.1Mechanisms (112)
7.2Remote Queues (114)
8Future Work116
8.0.1Applications (116)
8.0.2Mechanisms (116)
9Conclusion118
9.1Performance versus Effort (120)
9.2Overhead (120)
9.3Synchronization (120)
9.4Communication V olume (121)
9.5Producer-Computes (121)
9.6Message Polling (121)
9
9.7Bulk Transfer (122)
9.8Conclusion (122)
A Mathematical Techniques124
A.1Motivation (124)
A.2Direct vs.Iterative Solution (125)
A.3Substitution (127)
A.4Partitioned Inverses (127)
A.5Substitution vs.Partitioned Inverses (129)
10
List of Figures
1-1General multiprocessor architecture (17)
1-2Polling versus interrupting active messages (21)
1-3Performance versus Coding Effort (22)
1-4Regions of performance in processor cycles as bisection bandwidth varies..25 1-5Regions of performance in processor cycles as network latency varies (26)
2-1EM3D graph structure (31)
2-2A DAG representing solution by substitution of (33)
2-3Parallel DAG computation for iterative solution (36)
2-4Message aggregation during DAG computation (40)
2-5Multicolor reordering and incomplete Cholesky factorization (40)
3-1General multiprocessor architecture (44)
3-2The MIT Alewife multiprocessor (45)
4-1Ghost nodes in EM3D (52)
4-2Communication in EM3D (53)
4-3EM3D shared-memory versus message-passing performance (55)
4-4EM3D performance as data size scales (56)
4-5EM3D performance as data size and locality scale (57)
4-6UNSTRUC shared-memory versus message-passing performance (60)
4-7Data?ow computation for our DAG (61)
4-8Message-passing code for DAG execution (62)
11
4-9Shared-memory code for DAG execution (64)
4-10Alewife Speedups (68)
4-11BCSSTK18breakdowns (69)
4-12BCSSTK32performance with larger buffers (71)
4-13BCSSTK32speedups with increasing communication overhead (72)
5-1Polling versus interrupting active messages (75)
5-2Avoiding polling problems with the RQ abstraction (76)
5-3Implementations of the Remote Queue(RQ)model (77)
5-4Implementing polling-based active messages on RQ (78)
5-5Remote Queues overview (79)
5-6Performance bene?ts of polling on the CM-5 (81)
5-7Selective interrupts in RQ (84)
5-8enqueue(proc,NIq,arg0,buf,nbytes)on Alewife (85)
5-9User-level polling and an inlined handler on Alewife (90)
5-10RQ versus interrupts (92)
5-11The cost of?oating point save/restore (93)
5-12The bene?ts of inlining (94)
5-13RQ bene?ts on higher overhead machines (94)
5-14RQ NI versus other mechanisms on EM3D (95)
5-15RQ in memory-based systems (96)
5-16Memory-based RQ implementations (98)
6-1Regions of performance in processor cycles as bisection bandwidth varies..101 6-2Regions of performance in processor cycles as network latency varies (102)
6-3Communication V olume (103)
6-4I/O Traf?c in Cross-Traf?c Experiment (105)
6-5Sensitivity to IO-traf?c message length (106)
6-6Execution Time versus Bisection Bandwidth (107)
12
6-7Clock-Emulated Network Latency (108)
6-8Network Latencies Emulated with Context-Switching (109)
9-1Performance versus Coding Effort (119)
A-1A DAG representing solution by substitution of (128)
A-2Fill during inversion of (128)
A-3A partitioned and its column subgraphs (129)
A-4Computing the solution with partitioned inverses (130)
13
List of Tables
2.1Benchmark matrices (34)
2.2Computations for Parallel ICCG (37)
2.3Locality,convergence,and rough ICCG timing attributes (38)
2.4Average message length (42)
3.1Multiprocessors discussed in this thesis (43)
3.2Typical shared memory miss penalties on Alewife (47)
4.1Costs of relevant Alewife operations (66)
4.2Large benchmark speedups and breakdowns (70)
5.1Polling and interrupt message reception costs (81)
6.1Multiprocessor Parameters (100)
6.2Recalculated Multiprocessor Parameters (110)
14
Chapter1
Introduction
Parallel systems are becoming a signi?cant computing technology,not only for high perfor-mance computing,but also for commodity servers.These systems,however,are often built with communication mechanisms chosen from intuition or anecdotal experience.Further-more,much of the focus has been on simple,regular applications which are not representa-tive of many real-world problems.The goal of our research is to identify core communication mechanisms which both exploit architectural trends and support real applications.
A core communication mechanism must provide cost-effective bene?ts over a range of applications.These bene?ts can include high performance and ease-of-use.We demonstrate that cache-coherent shared memory hardware is a core communication mechanism,even for applications with little data re-use and with data-driven synchronization.This thesis makes three major contributions:
1.We perform an in-depth study of the interaction between communication mechanisms
and sparse,irregular applications.
2.We present the Remote Queues(RQ)communication model,an abstraction which syn-
thesizes more ef?cient synchronization for hardware-supported shared memory and other complex systems.
3.We characterize the relative performance of all of our mechanisms as network band-
15
CHAPTER1.INTRODUCTION16
Figure1-1:General multiprocessor architecture.
CHAPTER1.INTRODUCTION17
with each other?This is the focus of this thesis,in terms of both hardware and software-interfaces.
1.2Irregular Applications
We drive our experiments with an important class of irregular applications,iterative compu-tations on irregular graphs.We focus on irregular computations because problems in the real world are irregular.Examples of irregular graph computations include aircraft and automo-bile design,oil reservoir simulation,economic modeling,circuit simulation,and most?nite element simulation.
For purposes of this thesis,an irregular computation is a computation that involves datas-tructures with indirection.For example:
is an irregular 263da049c850ad02de80419apare this to:
which is a simple,regular computation on a dense array.The source of the indirection in irregular computations is the sparsity,or non-uniformity,of the problem being solved.A simulation of air?ow over an airplane wing,for example,has many important data-points near the surface of the wing,but very few interesting data-points as the air gets farther from the wing.A regular representation as a dense array,for example,would result in many“empty”or zero elements,wasting both space and computation.
It is important to solve sparse,irregular problems,however,with a solution method that preserves their sparsity.Solution by direct mathematical methods approximately computes the transitive closure of a sparse graph,which can introduce a substantial number of new data elements and new computation(a phenomenon often referred to as?ll,see Appendix A).As a result,sparse problems are often solved with iterative methods,which are the focus of this thesis.
CHAPTER1.INTRODUCTION18
The importance of iterative methods to our study is that the computation,and resulting inter-processor communication,re?ect the underlying irregular structure of the problem.This results in a challenging computation for parallel processors.The granularity of the compu-tation is often very?ne,which means that the amount of computation per communication operation can be very small.
Regular computations are generally much coarser grained than irregular computations. This is because most regular computations are easy to organize into blocks(eg.contiguous sections of arrays)for which a single communication operation provides all the data needed to update a block.Even when a regular application contains many dependencies,block-pipelining techniques can be used.Many irregular applications have no known techniques for blocking or aggregating the communication and computation effectively.
Although some of our irregular applications are extreme in their?ne granularity,they are both an important class in themselves and serve to emphasize properties also present in many data-parallel and High-Performance Fortran(HPF)programs.Additionally,although our experiments often use hand-optimized,low-level code,our communication techniques are intended for compilers such as those for HPF or data-parallel languages.
This thesis examines three irregular applications.All three applications are computations on irregular graph structures.They vary in available parallelism,ratio of computation to communication(granularity),coding style,and realism.The applications are:
EM3D–an electromagnetic calculation which uses a randomly-generated dataset.The synthetic dataset allows us to easily vary volume and distance of communication.
UNSTRUC–a computational?uid dynamics application which uses real datasets.
ICCG–the forward-solve kernel of a general iterative solver which is the most chal-lenging of our applications.
These applications are described in detail in Chapter2.
CHAPTER1.INTRODUCTION19 1.3Mechanisms
The multiprocessor communication mechanisms discussed in this thesis are:
Shared Memory(sm)More accurately,distributed shared memory,is a mechanism by which processors can read and write to each other’s physical memory.Such remote mem-ory references require communication between processors.Many of the machines we will examine will also support caching to avoid communication when re-using a remote memory locations.Since the iterative applications we focus on do not take signi?cant advantage of caching,however,our results are not speci?c to caches.Many of our results are relevant to the Cray T3D,for example,even though it does support caching of remote data.
Shared Memory with Prefetching(pre-sm)Prefetching is often used to hide the latency of remote shared memory references.When a processor reads or writes to remote memory, communication must take place before the reference can succeed.To reduce processor stall time,the processor can?rst issue a prefetch for the memory location,do some other compu-tation,and then perform the reference.
Interrupt-Based Fine-Grained Message Passing(?ne-mp)We focus on message-passing communication based on Active Messages.An Active Message is a message that triggers a small procedure,or handler,at the receiving processor.This handler consumes and processes the message:for example,it may add the message data to an accumulation variable.The most intuitive version of Active Messages interrupts the receiving processor and causes the message handler to be immediately executed.
Polling-Based Fine-Grained Message Passing(poll-mp)As shown in Figure1-2,an al-ternative to interrupt-based Active Messages is polling.On the left,when a message arrives, the receiving processor is interrupted from its main thread of computation and the handler is executed.On the right,messages arrive but handler execution is deferred until a speci?c polling point that the user or compiler has inserted into the application code.This polling
CHAPTER1.INTRODUCTION
20
CHAPTER 1.INTRODUCTION 21
Easier
Harder Effort
50
55606570E x e c u t i o n T i m e (M c y c l e s )
EM3D 32-Proc Performance vs Coding Effort
shared memory
shared memory w/ prefetch message passing (fine)
message passing (polling)message passing (bulk)
Easier
Harder
Effort
4
5678E x e c u t i o n T i m e (M c y c l e s )
ICCG 32-Proc Performance vs Coding Effort
shared memory
shared memory w/ prefetch message passing (fine)
message passing (polling)message passing (bulk)
Easier
Harder
Effort
7
89101112E x e c u t i o n T i m e (M c y c l e s )
UNSTRUC 32-Proc Performance vs Coding Effort
shared memory
shared memory w/ prefetch message passing (fine)
message passing (polling) message passing (bulk)
Figure 1-3:Performance versus Coding Effort.Each point represents an implementation of each application running on 32processors of the MIT Alewife multiprocessor.Points nearest the origin are the most desirable.
正在阅读:
Parallel Communication Mechanisms for Sparse, Irregular Applications05-08
毛概课试题11-02
2015年成人电算化会计教学方法漫谈每日一练(1月25日)06-03
×抵押位于株洲市云龙示范区××的二宗划拨土地使用权价格评估(05-12
天猫运营计划书09-04
- 1Rutile and its applications in earth sciences
- 2英语作文 The Importance of Interpersonal Communication
- 3State Based Visualization of PVM Applications
- 4What may affect curtural communication
- 5Adaptive Applications of Intelligent Agents
- 6On Communication Skills of Language in International Busines
- 7Optical Resonators With Whispering-Gallery applications
- 8REQUIREMENTS COMMUNICATION CULTURE IN MOBILE SERVICES DEVELOPMENT
- 9Load and memory balanced mesh partitioning for a parallel envelope method
- 10From Sequential Programs to Multi-Tier Applications by Progr
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- Communication
- Applications
- Mechanisms
- Irregular
- Parallel
- Sparse
- 《湖南省“两项制度”有效衔接试点工作考核试行办
- 电大思想道德修
- 2018年江西师范大学化学化工学院853有机化学之有机化学考研强化五套模拟题
- 化工原理自测练习题上册华东理工大学
- 山西客车项目投资分析与建设方案
- (完整word版)琵琶行加拼音版
- 2019年镇江市高中必修五数学上期末试卷附答案
- 2015年山东省-继续医学教育《实用现场急救技术》题库、题目答案、保证通过
- 圆柱齿轮加工工艺设计
- 2019年注册会计师考试试题每日一练(5.14)
- 2011年上半年小学六年级语文单元测试卷(一)
- 持字要怎么组词和造句
- 人教版高中政治必修4第三单元 思想方法与创新意识第九课 唯物辩证法的实质与核心导学案(1)
- 部门工会委员工作职责要求范本
- 液化石油气钢瓶不准充装的规定通用版
- 潮州企业内训方案哪家好?
- 鼻孔干燥是怎么回事
- 万科新建物业接管验收标准
- 2020广东省汕头市龙湖实验中学八年级语文上册22短文两篇期中复习要点
- 设立小额贷款公司投资商业计划书_