Parallel Communication Mechanisms for Sparse, Irregular Applications

更新时间：2023-05-08 23:26:02 阅读量：实用文档文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

parallel推荐度：
相关推荐

Parallel Communication Mechanisms for Sparse,Irregular Applications

Frederic T.Chong

Abstract

Parallel systems are becoming a signi?cant computing technology,not only for high performance computing,but also for commodity servers.The goal of this research is to identify core communica-tion mechanisms which both exploit architectural trends and support real applications.We demonstrate that cache-coherent shared memory hardware is such a core mechanism,even for applications with lit-tle data re-use and data-driven synchronization.This thesis makes three major contributions.First,we perform an in-depth study of the interaction between communication mechanisms and sparse,irregu-lar applications.Second,we present the Remote Queues(RQ)communication model,an abstraction which synthesizes more ef?cient synchronization for hardware-supported shared memory and other complex systems.Third,we characterize the relative performance of all of our mechanisms as proces-sor speed and machine size scale.

On the MIT Alewife Multiprocessor,we?nd that shared memory provides high performance with lower code complexity than message passing on our irregular problems.This is primarily due to four reasons.First,a5-to-1ratio between global and local cache misses makes memory copies in bulk communication expensive relative to communication via shared memory.Second,although message passing has synchronization semantics superior to shared memory for data-driven computation,ef?-cient shared memory can overcome this handicap by using global read-modify-writes to change from the traditional owner-computes model to a producer-computes model.Third,the Remote Queues communication model generalizes such a change,providing the semantics and performance of polling active messages on a wide variety of systems.Fourth,bulk transfers can result in high processor idle times in irregular applications.

Finally,we characterize multiprocessor design points where message passing and bulk transfer can perform better than shared memory.In particular,we?nd that shared memory uses more than four times the network bandwidth as message passing.Unless an application’s performance is already limited by local memory speeds,network bandwidth and latency threaten to become a serious problem. Our study indicates that machines based on modern microprocessors,such as the Cray T3E,must resort to expensive,high-dimensional networks to support shared-memory traf?c.Furthermore,the round-trip nature of shared memory may not be able to tolerate the latencies of future networks.

Acknowledgments

I would like to thank Anant Agarwal for some great advising over the years.He helped steer me through a rewarding graduate career.His enthusiasm and energy is amazing.Whenever I felt a little burned-out,a short meeting with Anant would have me charging for a couple more weeks.

I would also like to thank the rest of my committee.I have enjoyed my interactions with Tom Leighton since my work on interconnection networks for my Master’s thesis.A ten-minute talk with Tom can solve problems that might otherwise take months.An early meeting with Joel Saltz at Thinking Machines got me started on irregular applications.His perspective and extensive research group have helped me immensely.Rob Schreiber hosted my stay at NASA and helped me delve into what I saw as numerical voodoo.He taught me to look at sparse matrix computations as graph computations that I could understand more intuitively.

I would also like to thank Tom Knight for advising my master’s thesis and helping my growth from back when I was an undergraduate.I had a lot of good times in his group.

I would like to thank Eric Brewer for being a great friend,colleague,and example.I miss the times when we were just a couple?oors apart.I think he was the peer that I learned the most from.

My graduate career has been largely a product of the environment here at the lab.For that I have the members of the Alewife group to thank,especially John Kubiatowicz,whose unparalleled knowledge and service form the heart of the group.Thanks to Kirk Johnson, David Chaiken,Ken Mackenzie,Don Yeung,Matt Frank and David Kranz for attending all those practice talks and giving such good advice.I worked on some great projects with Beng-Hong Lim,Rajeev Barua,Fredrik Dahlgren,and Ricardo Bianchini(the last two,“honorary”members of the group).I also owe a lot to the interactions I have had with many other mem-bers of the lab in general.They are too numerous to list completely,but I would especially like to thank Debby Wallach,Kathy Knobe,Andr′e DeHon,Steve Keckler,Peter Nuth,Rich Lethin,Stuart Fiske,Greg Ganger,Frans Kaashoek,Charles Leiserson,Bill Weihl,Patrick

Sobalvarro,and Alan Edelman.Kathy Knobe,in particular,was a great neighbor and fellow traveler on the interview trail.

I would also like to thank Anne McCarthy and Jeanne Speckman,administrative assistants who were vital to the groups I’ve worked in.

I have also been fortunate to work with many people at other institutions.One of the ?rst was Shamik Sharma at Maryland,who works way too hard.Lok Liu contributed to the Remote Queues paper.John Gilbert and Shang-Hua Teng helped me with some more numerical knowledge.

Thanks to Tom Leighton and Charles Leiserson for starting the ice hockey tradition that has become such a phenomenon in our lab.I’ll miss the sport as I move to warmer climes. And to Tom Simon,hockey fanatic and long-time of?ce-mate,may your skates never rust and your gear never walk away on its own.

Thanks to my family,especially my little sister.

Above all,thanks to Debby.

This research was supported in part by an Of?ce of Naval Research Graduate Fellowship, by ARPA contract N00014-94-1-0985by NSF grant MIP-9504399,by NSF Experimental Systems grant MIP-9012773,by an NSF Presidential Young Investigator Award to Prof. Anant Agarwal,by Project SCOUT(ARPA Contract MDA972-92-J-1032)and by NASA via Contract NAS2-13721between NASA and the Universities Space Research Association (USRA).

Contents

1Introduction16

1.1Multiprocessors (17)

1.2Irregular Applications (18)

1.3Mechanisms (20)

1.4Bias (23)

1.5Data Transfer (24)

1.6Synchronization and Remote Queues (24)

1.7Multiprocessor Design Points (25)

1.7.1Variations of Bisection Bandwidth (26)

1.7.2Variations in Latency (27)

1.8Summary (28)

1.9Roadmap (29)

2Applications30

2.1EM3D (30)

2.2UNSTRUC (32)

2.3ICCG (32)

2.3.1Communication in the Kernel (33)

2.3.2Motivation and Benchmark Matrices (33)

2.3.3Overview of ICCG (35)

2.3.4Data Mapping (37)

2.3.5Multicolor Reordering and Message Aggregation (39)

3Platforms43

3.1The Alewife Multiprocessor (44)

4Shared Memory versus Message Passing49

4.1EM3D (51)

4.1.1EM3D with Message Passing (51)

4.1.2EM3D with Shared Memory (54)

4.1.3EM3D Performance Results (55)

4.1.4Varying Problem Size (56)

4.1.5Scaling Locality with Problem Size (56)

4.2UNSTRUC (57)

4.2.1UNSTRUC with Message Passing (58)

4.2.2UNSTRUC with Shared Memory (59)

4.2.3UNSTRUC Performance Results (60)

4.3ICCG (61)

4.3.1ICCG with Message Passing (61)

4.3.2ICCG with Shared Memory (63)

4.3.3ICCG Performance Results (65)

4.4Summary (73)

5Remote Queues74

5.1Polling versus Interrupts (74)

5.2RQ De?nition (76)

5.3RQ Overview (79)

5.4Advantages of Polling and Interrupts (80)

5.4.1Advantages of Polling (80)

5.4.2Advantages of Interrupts (84)

5.4.3Advantages of RQ (84)

5.4.4RQ NI Implementation (85)

5.4.5RQ NI Performance (91)

5.5Memory-Based RQ (95)

5.5.1RQ SM and DMA Implementations (96)

5.5.2Performance (97)

5.6RQ Summary (98)

6Sensitivity to Bandwidth and Latency99

6.0.1Variations of Bisection Bandwidth (100)

6.0.2Variations in Latency (101)

6.0.3Communication V olume (102)

6.0.4Bisection Bandwidth Emulation (105)

6.0.5Network Latency Emulation (106)

6.0.6Compute-vs.Memory-bound (110)

7Related Work112

7.1Mechanisms (112)

7.2Remote Queues (114)

8Future Work116

8.0.1Applications (116)

8.0.2Mechanisms (116)

9Conclusion118

9.1Performance versus Effort (120)

9.2Overhead (120)

9.3Synchronization (120)

9.4Communication V olume (121)

9.5Producer-Computes (121)

9.6Message Polling (121)

9.7Bulk Transfer (122)

9.8Conclusion (122)

A Mathematical Techniques124

A.1Motivation (124)

A.2Direct vs.Iterative Solution (125)

A.3Substitution (127)

A.4Partitioned Inverses (127)

A.5Substitution vs.Partitioned Inverses (129)

List of Figures

1-1General multiprocessor architecture (17)

1-2Polling versus interrupting active messages (21)

1-3Performance versus Coding Effort (22)

1-4Regions of performance in processor cycles as bisection bandwidth varies..25 1-5Regions of performance in processor cycles as network latency varies (26)

2-1EM3D graph structure (31)

2-2A DAG representing solution by substitution of (33)

2-3Parallel DAG computation for iterative solution (36)

2-4Message aggregation during DAG computation (40)

2-5Multicolor reordering and incomplete Cholesky factorization (40)

3-1General multiprocessor architecture (44)

3-2The MIT Alewife multiprocessor (45)

4-1Ghost nodes in EM3D (52)

4-2Communication in EM3D (53)

4-3EM3D shared-memory versus message-passing performance (55)

4-4EM3D performance as data size scales (56)

4-5EM3D performance as data size and locality scale (57)

4-6UNSTRUC shared-memory versus message-passing performance (60)

4-7Data?ow computation for our DAG (61)

4-8Message-passing code for DAG execution (62)

4-9Shared-memory code for DAG execution (64)

4-10Alewife Speedups (68)

4-11BCSSTK18breakdowns (69)

4-12BCSSTK32performance with larger buffers (71)

4-13BCSSTK32speedups with increasing communication overhead (72)

5-1Polling versus interrupting active messages (75)

5-2Avoiding polling problems with the RQ abstraction (76)

5-3Implementations of the Remote Queue(RQ)model (77)

5-4Implementing polling-based active messages on RQ (78)

5-5Remote Queues overview (79)

5-6Performance bene?ts of polling on the CM-5 (81)

5-7Selective interrupts in RQ (84)

5-8enqueue(proc,NIq,arg0,buf,nbytes)on Alewife (85)

5-9User-level polling and an inlined handler on Alewife (90)

5-10RQ versus interrupts (92)

5-11The cost of?oating point save/restore (93)

5-12The bene?ts of inlining (94)

5-13RQ bene?ts on higher overhead machines (94)

5-14RQ NI versus other mechanisms on EM3D (95)

5-15RQ in memory-based systems (96)

5-16Memory-based RQ implementations (98)

6-1Regions of performance in processor cycles as bisection bandwidth varies..101 6-2Regions of performance in processor cycles as network latency varies (102)

6-3Communication V olume (103)

6-4I/O Traf?c in Cross-Traf?c Experiment (105)

6-5Sensitivity to IO-traf?c message length (106)

6-6Execution Time versus Bisection Bandwidth (107)

6-7Clock-Emulated Network Latency (108)

6-8Network Latencies Emulated with Context-Switching (109)

9-1Performance versus Coding Effort (119)

A-1A DAG representing solution by substitution of (128)

A-2Fill during inversion of (128)

A-3A partitioned and its column subgraphs (129)

A-4Computing the solution with partitioned inverses (130)

List of Tables

2.1Benchmark matrices (34)

2.2Computations for Parallel ICCG (37)

2.3Locality,convergence,and rough ICCG timing attributes (38)

2.4Average message length (42)

3.1Multiprocessors discussed in this thesis (43)

3.2Typical shared memory miss penalties on Alewife (47)

4.1Costs of relevant Alewife operations (66)

4.2Large benchmark speedups and breakdowns (70)

5.1Polling and interrupt message reception costs (81)

6.1Multiprocessor Parameters (100)

6.2Recalculated Multiprocessor Parameters (110)

Chapter1

Introduction

Parallel systems are becoming a signi?cant computing technology,not only for high perfor-mance computing,but also for commodity servers.These systems,however,are often built with communication mechanisms chosen from intuition or anecdotal experience.Further-more,much of the focus has been on simple,regular applications which are not representa-tive of many real-world problems.The goal of our research is to identify core communication mechanisms which both exploit architectural trends and support real applications.

A core communication mechanism must provide cost-effective bene?ts over a range of applications.These bene?ts can include high performance and ease-of-use.We demonstrate that cache-coherent shared memory hardware is a core communication mechanism,even for applications with little data re-use and with data-driven synchronization.This thesis makes three major contributions:

1.We perform an in-depth study of the interaction between communication mechanisms

and sparse,irregular applications.

2.We present the Remote Queues(RQ)communication model,an abstraction which syn-

thesizes more ef?cient synchronization for hardware-supported shared memory and other complex systems.

3.We characterize the relative performance of all of our mechanisms as network band-

CHAPTER1.INTRODUCTION16

Figure1-1:General multiprocessor architecture.

CHAPTER1.INTRODUCTION17

with each other?This is the focus of this thesis,in terms of both hardware and software-interfaces.

1.2Irregular Applications

We drive our experiments with an important class of irregular applications,iterative compu-tations on irregular graphs.We focus on irregular computations because problems in the real world are irregular.Examples of irregular graph computations include aircraft and automo-bile design,oil reservoir simulation,economic modeling,circuit simulation,and most?nite element simulation.

For purposes of this thesis,an irregular computation is a computation that involves datas-tructures with indirection.For example:

is an irregular 263da049c850ad02de80419apare this to:

which is a simple,regular computation on a dense array.The source of the indirection in irregular computations is the sparsity,or non-uniformity,of the problem being solved.A simulation of air?ow over an airplane wing,for example,has many important data-points near the surface of the wing,but very few interesting data-points as the air gets farther from the wing.A regular representation as a dense array,for example,would result in many“empty”or zero elements,wasting both space and computation.

It is important to solve sparse,irregular problems,however,with a solution method that preserves their sparsity.Solution by direct mathematical methods approximately computes the transitive closure of a sparse graph,which can introduce a substantial number of new data elements and new computation(a phenomenon often referred to as?ll,see Appendix A).As a result,sparse problems are often solved with iterative methods,which are the focus of this thesis.

CHAPTER1.INTRODUCTION18

The importance of iterative methods to our study is that the computation,and resulting inter-processor communication,re?ect the underlying irregular structure of the problem.This results in a challenging computation for parallel processors.The granularity of the compu-tation is often very?ne,which means that the amount of computation per communication operation can be very small.

Regular computations are generally much coarser grained than irregular computations. This is because most regular computations are easy to organize into blocks(eg.contiguous sections of arrays)for which a single communication operation provides all the data needed to update a block.Even when a regular application contains many dependencies,block-pipelining techniques can be used.Many irregular applications have no known techniques for blocking or aggregating the communication and computation effectively.

Although some of our irregular applications are extreme in their?ne granularity,they are both an important class in themselves and serve to emphasize properties also present in many data-parallel and High-Performance Fortran(HPF)programs.Additionally,although our experiments often use hand-optimized,low-level code,our communication techniques are intended for compilers such as those for HPF or data-parallel languages.

This thesis examines three irregular applications.All three applications are computations on irregular graph structures.They vary in available parallelism,ratio of computation to communication(granularity),coding style,and realism.The applications are:

EM3D–an electromagnetic calculation which uses a randomly-generated dataset.The synthetic dataset allows us to easily vary volume and distance of communication.

UNSTRUC–a computational?uid dynamics application which uses real datasets.

ICCG–the forward-solve kernel of a general iterative solver which is the most chal-lenging of our applications.

These applications are described in detail in Chapter2.

CHAPTER1.INTRODUCTION19 1.3Mechanisms

The multiprocessor communication mechanisms discussed in this thesis are:

Shared Memory(sm)More accurately,distributed shared memory,is a mechanism by which processors can read and write to each other’s physical memory.Such remote mem-ory references require communication between processors.Many of the machines we will examine will also support caching to avoid communication when re-using a remote memory locations.Since the iterative applications we focus on do not take signi?cant advantage of caching,however,our results are not speci?c to caches.Many of our results are relevant to the Cray T3D,for example,even though it does support caching of remote data.

Shared Memory with Prefetching(pre-sm)Prefetching is often used to hide the latency of remote shared memory references.When a processor reads or writes to remote memory, communication must take place before the reference can succeed.To reduce processor stall time,the processor can?rst issue a prefetch for the memory location,do some other compu-tation,and then perform the reference.

Interrupt-Based Fine-Grained Message Passing(?ne-mp)We focus on message-passing communication based on Active Messages.An Active Message is a message that triggers a small procedure,or handler,at the receiving processor.This handler consumes and processes the message:for example,it may add the message data to an accumulation variable.The most intuitive version of Active Messages interrupts the receiving processor and causes the message handler to be immediately executed.

Polling-Based Fine-Grained Message Passing(poll-mp)As shown in Figure1-2,an al-ternative to interrupt-based Active Messages is polling.On the left,when a message arrives, the receiving processor is interrupted from its main thread of computation and the handler is executed.On the right,messages arrive but handler execution is deferred until a speci?c polling point that the user or compiler has inserted into the application code.This polling

CHAPTER1.INTRODUCTION

CHAPTER 1.INTRODUCTION 21

Easier

Harder Effort

55606570E x e c u t i o n T i m e (M c y c l e s )

EM3D 32-Proc Performance vs Coding Effort

shared memory

shared memory w/ prefetch message passing (fine)

message passing (polling)message passing (bulk)

Easier

Harder

Effort

5678E x e c u t i o n T i m e (M c y c l e s )

ICCG 32-Proc Performance vs Coding Effort

shared memory

shared memory w/ prefetch message passing (fine)

message passing (polling)message passing (bulk)

Easier

Harder

Effort

89101112E x e c u t i o n T i m e (M c y c l e s )

UNSTRUC 32-Proc Performance vs Coding Effort

shared memory

shared memory w/ prefetch message passing (fine)

message passing (polling) message passing (bulk)

Figure 1-3:Performance versus Coding Effort.Each point represents an implementation of each application running on 32processors of the MIT Alewife multiprocessor.Points nearest the origin are the most desirable.

本文来源：https://www.bwwdw.com/article/ayte.html

相关文章：

六部分拟建立的组织机构与人员的配备 - 图文05-04

毛概课试题11-02

SS4改型机车运行过程常见故障分析及其处理05-05

2015年成人电算化会计教学方法漫谈每日一练(1月25日)06-03

×抵押位于株洲市云龙示范区××的二宗划拨土地使用权价格评估（05-12

第09讲_2017年注册会计师审计考试真题(2)08-28

天猫运营计划书09-04

HD-DFL邮箱outlook配置手册（第三版）09-17

行业分析2018-2023年高端装备制造产业发展与投资战略规划分析报告（目录）11-14

上一篇：2020高考励志句子有哪些.doc 下一篇：年煤炭行业分析概况