Network Interface Multicast Protocols for Wormhole-based Networks of Workstations

更新时间:2023-04-30 07:28:01 阅读量: 综合文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

Network Interface Multicast Protocols for Wormhole-based Networks of Workstations Cosimo Anglano a,?Claudio Casetti b Emilio Leonardi b

Fabio Neri b

a Dipartimento di Informatica,Universita’del Piemonte Orientale,Alessandria,

Italy

b Dipartimento di Elettronica,Politecnico di Torino,Torino,Italy

1Introduction

In recent years Network of Workstations(NOWs)have been increasingly used as the execution platform for parallel applications.The dramatic advances in low-cost computing technology,together with the availability of new high speed Local Area Networks(LANs),have enabled the construction of NOWs whose raw processing and communication performance are comparable to those of traditional massively parallel processors.Wormhole routing LANs have attracted signi?cant attention from the parallel computing community,as witnessed by a signi?cant number of research projects focusing on these LANs (e.g.,the Avalanche Project[1],the SHRIMP Project[3],and the HPVM Project[2],only to name a few),since they are able to deliver to parallel applications performance higher than other LAN technologies such as ATM or Fast/Gigabit Ethernet[4,17].An important example of wormhole LAN is Myrinet[7],that will be taken as the reference architecture in this paper. The importance of multicast communication,where the same message is de-livered from a source node to an arbitrary number of destination nodes,is well understood.Its applications range from the implementation of collective com-munication operations[19],directly supported by systems like MPI[10],to a variety of system level functions,such as synchronization barriers and cache coherence in distributed shared memory systems,among others.In spite of that,however,worhmole switches currently support only unicast(one-to-one) communication in hardware.In such systems,therefore,multicast must be implemented at the software level by sending one or more unicast messages among the involved hosts[18,20,22].E?cient communication among the hosts is critical to the performance of a NOW since,regardless of how well the com-putation is distributed among the processors,communication overhead can severely limit the speedup.Software multicast is particularly prone to high overheads,since each unicast transmission requires the interaction between the host and its Network Interface(NI),which is usually rather expensive since the NI is located on the I/O bus rather than on the memory bus.By executing the multicast algorithm on the programmable processor available on the NI of several modern high-speed LANs[7,25],it is however possible to greatly reduce the number of host-NI interactions,so that signi?cant re-ductions of the multicast latency may be achieved[5,12,14,16].To orchestrate the interactions among the various hosts network interfaces involved in the multicast,Network Interface multicast protocols[6]are necessary.

Two issues are involved in the design of network interface multicast protocols: the?rst one,which has been studied in many papers[18,22,20],is concerned with the multicast algorithm used to schedule the various unicast commu-nications with the goal of minimizing the multicast latency.Most multicast algorithms arrange the hosts involved in a multicast communication in a tree-

2

like structure to dictate the ordering of the communication.In particular, upon receiving a multicast message,each host passes it to the application and uses unicast communication to forward it to a subset of other hosts,so that di?erent destinations may be reached in parallel.A second issue,which is in many regards more critical than the previous one,is?ow control,that is the management of the software bu?er space available at the NIs,which is required in order to obtain reliable delivery of multicast worms.As a matter of fact,the amount of memory available on NIs is usually limited(because of its cost),so an unrestricted use of this scarce resource may easily result in bu?er over?ows,with consequent losses of packets.Although reliable deliv-ery of multicast messages can be obtained by retransmitting packets,the cost of retransmissions is usually unacceptable in high-speed networks,since very large bu?ers are necessary at the source to recover losses.Also,providing extra memory to the interfaces does not solve the problem,since it adds delay prob-lems without avoiding losses upon congestion.By using a suitable?ow control scheme that stalls senders before bu?er over?ow occurs,worm LANs obtain reliability without retransmitting packets.Circular waits are however possi-ble,and deadlocks may occur:a suitable way to avoid deadlocks is therefore required.

In this paper we address the problem of designing reliable and e?cient Net-work Interface multicast protocols.This problem has been already addressed by several researchers.Verstoep et al.[24]propose a?ow control scheme in which all the bu?ers needed for the various unicasts required by a multicast communication are allocated before the operation is actually started,and use a binomial tree to forward the unicast messages among hosts(although trees with arbitrary topology are allowed).Delaying a multicast communication until all the necessary bu?ers are available guarantees the absence of circular waits and,hence,of deadlocks.The presence of a central resource manager per-forming bu?er allocation,however,makes this scheme poorly scalable[5,12]. Also,latency tends to increase when multiple multicasts take place in the same multicast group,since in these situations the probability of?nding all the needed bu?er space simultaneously available decreases very quickly. Bhoedjang et al.[5]propose a?ow control scheme in which bu?er pre-allocation is performed between pairs of hosts,and not on all the hosts of the multicast group.That is,each time a host has to forward a unicast message to another host,it must acquire a bu?er on the receiving side.If no bu?ers are available, the message is blocked.Again,multicast trees with general topology are al-lowed.Deadlocks are assumed to be rare,so no special care is taken to avoid them.Deadlock detection and recovery is used instead.One problem of this approach is that the NI memory must be statically partitioned among all the hosts,so more memory is needed as the number of hosts increases.Also,the assumption of low deadlock probability may not be satis?ed in situations in which there are medium-to-large size multicast groups,and several multicast

3

communications take place simultaneously(such as,for instance,in the all-to-all communication pattern).In these cases,the cost of deadlock recovery may become prohibitive.

A di?erent approach which does not present the above drawbacks has been proposed by Gerla et al.[12].This approach is based on a?ow control tech-nique,more consistent with the wormhole philosophy,in which each multicast message acquires the resources it needs as it is forwarded in the network, and becomes blocked when resources are not available.Deadlock freedom is achieved by assigning a unique numerical identi?er to each host and dividing the receive bu?ers in two classes.The?rst class is used when a message travels from a host to a higher numbered one,the second when going in the opposite direction.

In this paper we address the problem of designing e?cient multicast protocols that follow the paradigm outlined in[12].We propose a?ow control technique (called FIFO ACK/NACK)that extends the one proposed in[12]in order to improve multicast performance.Starting from this strategy,we develop four Network Interface multicast protocols,obtained by combining the FIFO ACK/NACK technique with multicast algorithms speci?cally conceived for Network Interface protocols.Moreover,we investigate the e?ects that several factors have on the performance of the above protocols.As a matter of fact, to determine the best combination of?ow control and multicast algorithm for a particular NOW,several factors must be considered,such as the complex-ity of the above algorithms,the performance of unicast communication,some characteristics of the network(such as the link speed and the topology),and the presence of interfering unicast and multicast tra?c.In order to under-stand the tradeo?s involved in the above design,we compare the performance of the protocols we propose for di?erent combinations of the above factors. Interestingly,in contrast with the results reported in[12],our study shows that often simple multicasting algorithms(e.g.separate addressing)may out-perform more sophisticated algorithms usually considered more e?ective,such as those based on spanning-trees.

The paper is organized as follows.In Section2we discuss the features of the systems under consideration.In Section3we discuss?ow control and dead-lock avoidance for network interface multicast protocols,and we present our ?ow control technique.In Section4we present some deadlock free multicast algorithms,whose performance are studied by simulation in Section5.Finally, Section6concludes the paper and outlines future research directions.

4

2Wormhole-based Networks of Workstations

Before describing our multicast protocols,let us summarize the main features of the systems under consideration.As anticipated in the Introduction,our wormhole network model is inspired by Myrinet,although in several cases we assume features that,albeit feasible,are not currently supported in the Myrinet hardware.

In this paper we consider NOWs comprising a set of hosts interconnected by a switched wormhole LAN in some arbitrary topology.Each host is connected to the network through a Network Interface,which is in turn connected to a switch via a bidirectional link.As in Myrinet,we assume that the NI is equipped with a programmable processor,and is attached to the I/O bus of the host.The NI is assumed to operate in cut-through[15]mode,that is it can forward an incoming worm to a switch without having to completely assemble it.Moreover,we assume that a NI cannot backpressure incoming messages, that is an incoming message must be either drained from the network or discarded,but it cannot be blocked on a link connecting the NI to a switch. We assume that for unicast messages the network uses wormhole switching, backpressure?ow control and deadlock-free source routing,that are brie?y recalled in the remainder of this section.

Wormhole Switching

In wormhole switching[21]the unit of information transfer is a worm,which can range in length from a few bytes to several thousand bytes.Each inter-mediate switch forwards a worm to the desired output port(if available)as soon as its head is received,without waiting for the entire worm to be as-sembled.Thus,the worm can stretch across several nodes and links at any one time.When the desired output port is busy the worm is blocked and this information is propagated upstream using backpressure.

Backpressure Flow Control

Backpressure is an explicit node-to-node?ow control mechanism that requires bidirectional network links.A?xed-size bu?er,called slack bu?er,is associated with each receiver in the nodes.Two thresholds are associated with each slack bu?er–the stop threshold K s and the go threshold K g(K s≤K g).Upon blocking(due to contention for a desired output port)the bu?er starts to ?ll up.When the bu?er?lls up to K s,a stop signal is sent to the sending node,which stops transmitting the remainder of the worm.When the desired

5

port becomes available the slack bu?er is drained,and when K g is reached a go signal is sent to the sender,which resumes the worm transmission.With backpressure,worms may thus be blocked on their way to the destination if no resources are available along the path.This avoids worm losses due to congestion but,as circular waits are possible,deadlocks may arise[9,21],hence deadlock-free routing must be employed.An alternative?ow control method is de?ection[11],where worms that cannot proceed because the desired output port is busy are misrouted(de?ected)to another switch.

Deadlock-Free(Unicast)Routing

Two popular approaches used to prevent deadlocks in wormhole networks[11], are Up/Down routing(which is adopted in the Autonet[23]and in Myrinet), and Virtual Channels routing[9].

In Up/Down routing,a spanning tree is constructed on the topology,and all hosts use this tree to determine routes to all other hosts.One of the nodes is chosen arbitrarily as the root of the spanning tree and all links of the topology are designated‘up’or‘down’links with respect to this root.A link is‘up’if it points from a lower to higher level node in the tree(i.e,to a node at a shorter distance from the root),otherwise,it is‘down’.For nodes at the same level,node identi?ers break the tie.Routes consist of zero or more‘up’moves (towards the root)followed by zero or more‘down’moves(away from the root).This prevents circular waits and guarantees deadlock-free routes at the cost of non minimum-distance routing,and of a performance degradation due to congestion around the root of the tree.

In the Virtual Channels routing approach,each link of the topology is split into a number of virtual channels,i.e.groups of channels that share a physical link.Each virtual channel has its own slack bu?er and?ow-control logic. Thus,the bu?er space is statically partitioned according to the number of virtual channels,while the bandwidth available on the physical channel can be dynamically allocated to virtual channels.Physical channels that belong to a cycle in the network are decomposed into a group of virtual channels, and the routing is restricted only in the sense that the proper virtual channel must be selected when a physical link must be crossed.

In this paper we assume that anyone of the above approaches can be used to ensure deadlock freeness for unicast communication.The impact of each of them on the performance of multicast protocols will be discussed in Section5.

6

Source Routing

Each source host uses the deadlock-free routing algorithm to compute a route (a series of switch output port numbers)for a given destination host.This source route is placed in the header of the worm that is sent out by the source.As the worm comes in on an input port of a switch,the leading byte is read to identify the desired output port.The leading byte is then stripped and the remaining part of the worm is sent to the next switch.

3Multicasting at the Network Interface Level

Let us now discuss the issues involved in the design of Network Interface multicast protocols.In what follows we assume that several multicast groups may be simultaneously active in the network,that any host in a multicast group can generate(independently from any other host)a multicast message, and that a given host may belong to multiple multicast groups.

3.1Basic Operation

As already mentioned in the Introduction,the basic idea of NI multicast pro-tocols is to implement multicast communications by means of unicast worms (henceforth referred to as multicast worms).Each host uses a multicast algo-rithm to compute,for each multicast group it belongs to and for any other host of the above group,the subset of hosts to which an incoming multicast worm must be forwarded,and stores this information into its multicast group routing table.Each multicast worm contains the information about the multi-cast group it refers to and about the source host.Upon receiving an incoming worm,the NI processor(a)identi?es the worm as a multicast worm(by the multicast group identi?er carried in the header),(b)determines(from the mul-ticast group routing table)the subsequent host(s)to which the worm must be forwarded,and(c)performs the transmission of the worm to each one of the immediate successors in a cut-through mode and,at the same time,copy the worm to the local host.For the sake of simplicity,we consider only the case in which the forwarding of multicast messages is done by the members of the multicast group,although this is not necessarily the optimal solution.

As an example,let us examine operations in a small multicast group.Let us assume that hosts h0,h3and h4(see Fig.1)belong to the same multicast group,that the path from h0to h3passes through switches s0,s1,s2and s3, and that the multicast transmission is initiated by host h0.Also,let us assume

7

Fig.1.Operations in a small multicast group

that the hosts are arranged into a linear list,that is a unicast message is sent from h0to h3?rst,and then from h3to h4.Let us consider?rst a situation where no other messages will use switches{s0,...,s4}during the transmission of the multicast message generated by h0.In this case,h0generates the worm and sets its destination to the nearest member of the multicast group,which, according to the multicast routing tables,turns out to be host h3.It then sends a multicast worm from its interface to switch s0,whence it is sent to switch s3,and host h3.Through a header information check,host h3?nds out it belongs to the multicast group to which the worm is directed,and starts copying the worm to its local memory.At the same time,it checks whether the worm has reached every host in the group;since this is not the case,the worm is not only copied,but also forwarded to host h4in cut-through mode, i.e.,even before the reception is complete.Eventually,the worm completes its transfer by being copied to host h4,which does not forward it any further, since all hosts in the group have been reached(h4deduces this information by the joint knowledge of the multicast group and of the originator node.)

The above example assumed that the multicast worm could be forwarded through the interested switches and hosts without having to compete for bu?ering and/or transmission resources in the network.However,if a worm cannot be succesfully forwarded because other worms are using the resources it needs,it can be blocked in a host.The network interface must in this case provide for full bu?ering of at least one worm in its memory.Failure to provide full worm reassembly bu?ering at each intermediate network interface can po-tentially lead to path deadlocks[12]if the NI can backpressure incoming worms, or to worm losses and consequent retransmissions if not.For example,in Fig.2 the multicast worm X,directed at some host reachable through switch and S4,and being forwarded through switches S1,S2,S3and through the NIs of hosts A and B,is blocked at switch S3by another worm(say,unicast worm Y).Worm Y is instead forwarded by switches S3,S4and S5,and is blocked at switch S1by worm X.If the network interface of host B has insu?cient bu?ering to store worm X and can backpressure it,a deadlock occurs.If,con-versely,backpressure is not possible(as assumed in this paper),B must drop worm X,but since A does not have enough bu?ering space to store X,reliable delivery is compromised and worm retransmission from the source host may be required.This situation can be avoided by reserving at A enough bu?er space to store worm X entirely.

8

Network

Interface A

Fig.2.Reliability problems caused by insu?cient bu?ering at Network Interface B.Grey squares represent switches,continuous lines represent the paths occupied by worms,and dotted lines represent the paths that worms need to traverse.

3.2The FIFO ACK/NACK Flow Control Technique

As already mentioned in the Introduction,Network Interface protocols require that ?ow control is exerted among NIs in order to avoid bu?er over?ows.The FIFO ACK/NACK ?ow control technique we propose in this paper extends the ACK/NACK protocol of Gerla et al.[12],that uses a fast bu?er allocation mechanism in which three short “control”worms (ACK,NACK,READY )are exchanged between a sending and a receiving NI.In order to allow NI pro-cessors to know in advance the amount of bu?er space necessary to store an incoming worm,the header of each worm (both unicast and multicast)carries the indication of the worm size.Let us describe the ACK/NACK protocol by using the example depicted in Fig.3,where it is assumed that the network interface A has secured enough bu?ering to store the entire worm X ,and is forwarding it to B (Fig.3(a)).If also B has enough bu?er space,the worm (a)

NI A NI A

(c)NI A NI B

(f)NI B

Drop (d)

of a NACK control worm(Fig.3(e)).In the meantime,A has been storing the incoming worm(and also copying it to its local host).Upon receiving a NACK, A stops the transmission to B,which will be resumed(from the beginning of the worm)when it gets a READY control worm from B(Fig.3(f)),which is generated when the bu?er clears,or after a time-out.Note that since eventu-ally the bu?er in B will clear(bu?er deadlocks prevention is discussed below), sooner or later A will get the READY worm.In this way,neither deadlocks due to NI backpressure(if permitted)can occur,nor worms get discarded,since in the worst case the entire worm will be stored in A,waiting to be transmitted. The ACK/NACK protocol described above is a simple form of window pro-tocol with unit windows(i.e.,a stop-and-wait protocol).It can severely limit the network performance when the host-to-host path exhibits large latency (note that blocking along the path contributes to this latency),and the des-tination host bu?er has a non negligible probability of being busy(i.e.,there is a non-negligible worm discard probability).As a matter of fact,if a bu?er is unavailable,the NACK issued by the corresponding network interface can prime a chain reaction in which a sequence of NACK s is propagated upstream, thus causing a large number of retransmissions that may excessively hamper performance.Note also that the higher latency,the higher the retransmission cost since a large amount of data may have been transmitted before the NACK reaches the source.Another situation in which this protocol may yield unsat-isfactory performance is when the di?erent host-to-host hops in the same path exhibit di?erent latencies.If the slowest hop lies in the middle of the path,it will becomes very quickly overwhelmed by the upstream hosts,and will gener-ate a large number of NACK s,each one of them resulting in a retransmission. Our FIFO ACK/NACK policy improves multicast performance by allowing NIs to bu?er K≥1incoming multicast worms.Each NI bu?er is managed as a FIFO reception queue.The ACK control worm is issued only when the corresponding multicast worm hits the head of this FIFO queue,while the NACK control worm is issued as soon as an incoming multicast worm must be discarded due to lack of bu?er space in the NI.The ACK s are delayed in order to achieve a simple throttling mechanism when a host has to send several worms to another host,as happens for instance with communication patterns like the all-to-all communication.In these cases,if an ACK is sent too early,the sending host may?ood the receiving host,thus increasing the chance of?lling all the bu?er positions.The situation may get even worse if the receiving host belongs to more than one multicast group.The READY control message is sent in response to the oldest NACK-ed multicast worm only when all multicast worms,but one,have cleared the bu?er;it is not issued earlier in order to avoid undesired queue build-ups,with consequent increments of delay,when the network tra?c grows.This?ow control mechanism exploits the availability of a larger amount of memory to reduce the probability of discarding worms at the NI without having to face delay penalties in congested situations,thereby

10

signi?cantly improving network performance,as will be shown later.

3.2.1Dealing with deadlocks

While both ACK/NACK and FIFO ACK/NACK guarantee the absence of path deadlocks,they may cause bu?er deadlocks.This situation can occur,for

Network

Interface A

Fig.4.Bu?er deadlock caused by competing multicast worms

example,when two hosts are trying to send each other a worm,as depicted in Fig.4:Neither bu?er becomes available because each one is waiting for the other to complete the transmission.Situations like this may occur when two di?erent hosts in the multicast group to which A and B belong have originated di?erent messages;or,if A and B are members of more than one multicast group,and worms X and Y correspond to di?erent multicast groups.As dis-cussed in[12],a simple way of preventing this type of deadlock is to sort the hosts according to some criterion(for instance by assigning them di?erent numerical identi?ers–henceforth denoted as IDs)and to make all multicast worms to propagate from lower ID hosts to higher ID hosts.Since multicast messages may originate from an arbitrary host in the group,and not exclu-sively from the lowest ID host,at least one order reversal(i.e.a transmission from a higher ID to a lower ID host)may occur.To handle this situation, the multicast bu?ers in each network interface are divided in two classes:the ?rst(or lower)class is used before the reversal;the second(or upper)class after the reversal.For example,if multicast worms propagate along a chain of hosts sorted by increasing ID,the lower bu?er class is used until the highest ID host is reached.At this point,the worm is forwarded to the lowest ID host, and the upper bu?er class is used.Likewise,upper class bu?ers are used at subsequent hosts,until the tour is completed.Fig.5shows a two bu?er class allocation example.The multicast message is originated by host11and termi-nates its trip at host10.The two-bu?er deadlock prevention rule is similar to handling two virtual channels at the NI:each NI,in addition to having enough memory to store two maximum size worms and the corresponding incoming ACK/NACK/READY control worms,must maintain two completely indepen-dent state machines to control retransmissions.Although more sophisticated deadlock avoidance techniques are possible if more than two bu?ers are used, we stick to the two-bu?er scheme for the sake of simplicity in the NIs.

11

Fig.5.Deadlock prevention using two bu?er classes

3.2.2Assigning numerical identi?ers to the Network Interfaces

From the above discussion it should be clear that the choice of host ordering is critical,since the end-to-end latency of a multicast worm depends on the order it visits the various hosts.The assignment of host IDs can be done either separately for each multicast group in the NOW,or for the entire network. In the former case,it is in general easier to?nd an ordering which minimizes the latency inside a multicast group,but each host must provide in its NI two bu?ers for each multicast group it belongs to.Conversely,in the latter case each host must provide only two bu?ers regardless of the number of multicast groups it belongs to,but it is in general harder to?nd an ordering which is optimal for all the multicast groups.To avoid complexity,in particular in presence of dynamic multicast groups,and to put reasonable requirements on the amount of memory that should be available on each NI,in this paper we choose the second alternative,that is host IDs are assigned for the whole network.

Of course,to minimize latency one should assign adjacent IDs to hosts which are as close as possible to each other from a topological standpoint.To identify the global host ordering,it may be convenient to consider the graph of host connectivity induced on the network topology:this is a complete graph,whose nodes are the hosts of the topology,and edges are weighted with some metrics accounting for the cost of the resources used in traversing the unicast path between the hosts(we simply use the Hamming distance,i.e.the hop count of the corresponding path),taking into account possible routing limitations. Each edge of the host connectivity graph therefore represents a simple path in the network graph.The transformation from the network graph to the host connectivity graph is illustrated for an example network in Fig.6,assuming minimum distance routing,and unit cost for each traversed link.

The global ordering of hosts determined by?nding the shortest Hamiltonian path on the host connectivity graph induces a reasonable host ordering,al-though does not guarantee that the latency into a particular multicast group is minimized.It is important to point out that the above task is equivalent to

12

A

B C D

3

44

344Fig.6.Transformation from network graph to host connectivity graph

the Travelling Salesman Problem so,although for simple topologies like the 2D Torus it might be straightforward,for general topologies its complexity may make impractical the search of an optimal solution.In these cases,therefore,it might be necessary to resort to some suitable heuristics (for instance,if the problem may be reduced by the Euclidean version of TSP,the heuristics pro-posed in [8]may be used).We will not dig further into the problem of ?nding the host ordering,assuming in the following that the ordering is provided as an input to the multicast strategies.

4Network Interface Multicast Algorithms

The purpose of a multicast algorithm is to schedule the various unicast tran-missions so as to minimize the latency of multicast communications and max-imize their throughput.As should be evident from the previous section,a multicast algorithm that has to be coupled with an ACK/NACK-like ?ow control technique must employ a bu?er deadlock avoidance strategy.The lack of this characteristic makes multicast algorithms for wormhole-based NOWs already proposed in the literature (e.g.[18,22])unsuitable for our purposes.In this section,we consider three general strategies for constructing multicast algorithms:(1)multiple unicast worms issued by the source host,(2)the linear path ,where each host in the multicast group sends a copy of the message to at most one other host,and (3)the rooted tree ,where one host can replicate the message to several hosts.Starting from the linear path and the rooted tree strategies,we develop four di?erent multicast algorithms,while the multiple unicasts approach does not require any bu?er deadlock avoidance technique.These approaches have their drawbacks and tradeo?s,of course.While sending multiple copies seems more appealing for small,scattered multicast groups,the other strategies account for a shorter transmission activity at the origi-nator,and can exhibit superior performance depending on the topology and on the tra?c load,as will be shown later on.Indeed,since network interfaces

13

are equipped with a single transmitter(and a single receiver),when multiple copies are to be forwarded by a particular NI,they must be transmitted one after the other.This means that,while the?rst copy can be forwarded in cut-through mode,the second and the subsequent copies must wait for the transmission of the previous copy to complete.This serialization of the mul-tiple copies can remarkably increase latencies,especially at low loads.Note that the multiple unicast and the linear path can be viewed as extreme cases of the tree approach:the multiple unicast is a tree of depth one,while the linear path is a skewed tree(i.e.,of width one).

4.1Multiple unicasts at the source

The?rst,straightforward multicasting algorithm that we consider is having the originator NI take it upon itself to send a unicast copy of the multicast worm to each group member except for itself.This is similar to how broad-cast is currently implemented in Myrinet.The advantage of this strategy is simplicity:since relaying of the multicast worm is not necessary,no bu?er deadlock can occur,hence no ACK/NACK protocol nor bu?er reservation are necessary.On the negative side,one may object that the approach does not scale well,as several resources are needlessly used for large multicast groups, and the latency is increased due to the serialization of the copies at the single source transmitter.

4.2Multicasting on a linear path

In the case of multicasting over a linear path,we form a directed circuit among the hosts that are in the multicast group.There are two possible approaches to transmitting around the circuit:the ring-like routing,and the bus-like routing. They both require the two-bu?ers scheme to prevent bu?er deadlocks.

For the ring-like routing,hosts originating a multicast transmission are re-quired to send only one worm,which will visit group members one by one. Assuming that there is a host ordering inside the group(which re?ects the global host ordering of the network),we force multicast worms to travel from lower ID hosts to higher ID hosts initially using only the lower class of the NI bu?ers.Since multicast transmission does not necessarily start from the lowest ID host,a reversal of the order is required at some point.In order to avoid bu?er deadlocks,we force worm transmissions on upper class bu?ers af-ter an order reversal has occurred(but hosts are always traversed in increasing ID order after the reversal).The worm header must store,in addition to the information for unicasting,a multicast group identi?er and a hop count.At each step the hop count is decremented and retransmission halted when zero

14

is reached.The information required at each host is a table of multicast groups in which each entry consists of the group identi?er,the address of the next hop and the hop length of the circuit.All hosts must map the multicast group identi?er to the next hop;the originator must also initialize the hop count. We have two variants of the ring-like scheme:the multicast worm may be retransmitted until it returns to its originator,or the last hop may be omitted. The?rst method has the advantage of providing con?rmation of successful multicast,at the slight expense of the additional bandwidth use.In the case in which deadlock prevention is not strictly enforced,this facility could provide (combined with timeout and retransmission)the guarantee of reliable delivery. For the bus-like variant of the linear path approach,two worm transmissions are required to the source network interface(except for the lowest and highest ID hosts),one towards higher ID hosts,the other one towards lower ID hosts. Host operations upon reception of a worm remain the same.As a rule,worms propagating from lower to higher ID hosts use upper bu?ers,whereas worms traveling from higher to lower ID hosts use lower bu?ers.When the worm reaches either the lowest or the highest ID host,it is not forwarded any further. It is worth pointing out that if the multicast messages within a group must be delivered in sequence(i.e.total ordering),these algorithms have to be modi?ed,since a typical multicast message will originate from an arbitrary host and will travel along the linear path in ascending order.The consequence is that if two arbitrary hosts inject independent messages in the linear path at the same time,the sequencing is violated,since the destinations will get the messages in di?erent order,depending on their position in the circuit. The sequencing of messages can be however easily enforced by requiring that the originator send the message to the lowest ID host,which then starts the multicast.This way,the lowest ID host serializes all the transmissions(but also tends to be more loaded).

4.3Multicasting on a rooted tree

Intuitively,the latency of multicast messages on linear paths should be reduced by allowing some hosts to make copies of an incoming worm and to forward them separately to di?erent subsets of the multicast group.The delivery of a multicast worm can be described in this case as the traversal of a rooted tree,i.e.a spanning tree of the host connectivy graph whose root is the host which originates the worm,and the remaining nodes are the other hosts of the multicast group.Each node has as many children nodes as the number of di?erent copies it generates,and these children are the hosts to which the various copies are transmitted.As an example,let us consider the delivery of a multicast worm generated by host7whose rooted tree is depicted in Fig.7.

15

361549

9

192327127

105

5241

Fig.7.Rooted multicast tree

In this case,the root node (host 7)transmits three di?erent copies to hosts 5,10,and 12respectively.Host 5in turn retransmits two separate copies to hosts 9and 15,while host 10does not retransmit any copy (it is a leaf in the tree).

One may note that,if load is light (i.e.,there is no contention for bandwidth or bu?ers),the tree can lead to higher latency than multicasting on a linear path because of the serialization of multiple copies at hosts with two or more children.A limitation on the maximum branching factor (i.e.,on the number of children)limits this drawback.On the other hand,if the load is heavy,practically requiring full worm reassembly at almost every host,the parallelism of the tree will help reduce latency.The tree will also help to reduce latency when cut-through operation in the network interface is not available.Note that this scheme requires that k di?erent trees are built for each multicast group made up by k hosts,and that it does not guarantee the total ordering of multicast worms.If total ordering is required,an alternative approach in which a single tree is used for each multicast group is possible.In this case,the tree can be formed over the host connectivity graph restricted to the hosts in the multicast group,using the lowest ID host as the root of the tree:hosts must be sorted by increasing ID from root down,according to the global host numbering.That is,the “children”must have higher ID than the “parent”.This scheme provides total ordering,but in general results in a higher latency than the previous one,so we will not investigate it any further.

As in the case of multicasting over a linear path,in order to avoid bu?er deadlocks the two-bu?ers rule must be used,and the routing (i.e.,the rooted tree)must be restricted accordingly.We have two possibilities,which are in some sense analogous to the bus-like and the ring-like schemes of the linear path.In the ?rst case,which is assumed in Fig.7,and is called up-down tree 1,any directed path from the source host (the root)to any node in the tree must be made up by nodes having ?rst zero or more hops with decreasing IDs and

then zero or more hops with increasing IDs(for instance,in Fig.7,paths7→12→27,and7→5→15→49are legal paths,while path5→15→49→41would not be allowed since it has a decreasing sequence following an increasing sequence).Transmissions from hosts with higher IDs to hosts with lower IDs use the lower bu?er class,while the upper class is used for transmissions in the opposite direction(i.e.,from lower to upper IDs).

In the second case,which is called wrap-around tree,all paths traverse network interfaces in increasing ID order,except for one possible transmission from a higher ID host to a lower ID host,where a switch in the bu?er class takes place.Worms are initially transmitted in lower bu?ers,and continue their journey in upper bu?ers after the switch.In both cases,a worm inverts its direction at most once,so bu?er deadlocks cannot arise.

4.3.1Computation of Multicast Trees

The multicast tree corresponding to a given source node can be computed from the host connectivity graph after a suitable ordering has been determined (as discussed in Section3.2.2).Starting from the originator of the multicast message,a tree comprising all the hosts of the multicast group is built,with the constraint of using only backward→forward host ID sequences for the up-down case,and only forward sequences,with at most one wrap around in the ID sequence,in the wrap around case.

Given a source host belonging to a multicast group,it is possible to build sev-eral multicast trees which di?er from each other in the latency of the multicast messages injected by the source.The optimal multicast tree[22](OMT)can be de?ned as a multicast tree which minimizes the latency of the multicast worm generated by its root,i.e.,the time taken to reach the last host of the group. In order to build an OMT,a tradeo?between the cost of longer paths and the cost of serialization at branching points must be considered.As a matter of fact,sometimes a longer path(i.e.,a path crossing more transmission links) may be more e?ective than a shorter one in which several serializations take place.In some special cases,such as when all the arcs of the host connectivity graph have equal weight,and when no load-dependency of the transmission and serialization costs is assumed,techniques like dynamic programming may be used to compute an OMT in polynomial time,as for instance done in[22]. In more general cases,however,the network topology is irregular,so the arc weights are not identical.Moreover,both the transmission cost and the se-rialization cost heavily depend on the network load,so the computation of OMTs is in general computationally intractable,since it would require the ex-ploration of a state space whose size is exponential in the number of the hosts in the multicast group.Therefore,we propose a simple heuristic algorithm, based upon the host connectivity graph,which allows to compute in polyno-

17

mial time suboptimal multicast trees in which the in?uence of the network load is neglected.

To determine a(suboptimal)multicast tree we compute one spanning tree of the host connectivity graph(restricted to the hosts of the multicast group, under the constraints dictated by the two-bu?ers scheme)which has the min-imum cost.Following a standard approach,each path in a multicast tree is associated with a cost directly proportional to the latency of the multicast message following that path,and the cost of the multicast tree is de?ned as the cost of its largest-cost path.In computing path costs,we neglect tra?c-dependent delays,i.e.,delays due to backpressure blocks and ACK/NACK retransmissions.We de?ne the cost of a path v1,v2,...,v n as

n

(N q[v i]×CT+T X(v i,v i+1))(1) i=1

where N q[v i]is the order of transmission at host v i of the copy of the multicast destined to host v i+1(i.e.,the cost of serialization at branching points),CT is the copy time,that is time taken by a single copy operation(one worm transmission time),and T X(v i,v i+1)is the cost of a worm transmission from v i to v i+1(equal to the propagation delay,since we assume cut-through mode of operation).

Computation of Wrap-Around Trees

The algorithm for the computation of the wrap-around multicast tree is shown in Fig.8,where the following notation has been used.The children of host j are contained in the list CHILDREN[j];the above list contains a pair (x,class)if host x is a child of host j and bu?er class class is used when a multicast worm is forwarded from j to x.CHILDREN[j][x].class denotes the element class of the pair(x,class)contained in list CHILDREN[j],while the symbol⊕denotes list concatenation.Also,the function parent(j)returns the parent of host j in the multicast tree(the parent of the root node is de?ned to be the root itself).Finally,the IDs of the N hosts are contained in the array ID[],whose elements are assumed to be arranged in increasing order. Starting from host i(the root of the multicast tree)the algorithm examines each host in the multicast group in increasing ID order,continuing with host 1after host N has been considered.The lowest cost path from host i to a generic host j is determined by comparing the cost of a direct copy from i to j with the cost of forwarding the worm to any host k preceding j(for which a path has been already computed)and then making an additional direct copy from k to j.The above costs are computed according to Eq.1,where N q[j]is estimated by assuming that k forwards the worm to j in the order speci?ed by the algorithm,that is if k has to forward a copy to another host j that

18

COMPUTE

TREE(i)for

copies[j]=0;}

for (k=(j ?1)%N;(k+1)%N =i;k=(k ?1)%N){

cost =cost[k]+T X (k,j )+CT ·n

(cost

copies[min copies[min

(ID[min

{z=parent(min

(z==i)buf=lower;

else id].class;

}

CHILDREN[min

id]⊕(j,buf);}

Fig.8.Algorithm for the computation of wrap-around multicast trees

the algorithm “assigned”it before j ,then the copy for j is performed before the one for j .The bu?er class used to forward multicast worms from host to host is chosen according to their IDs:if ID[k ]>ID[j ],then the only permitted order reversal takes place,and the upper class is used.If instead ID[k ]

To give a concrete example,let us consider the multicast group made up by hosts with ID 5,8,15,20,and 25whose transmission costs are indicated in the host connectivity graph of Fig.9(a),and let us describe the construction of the multicast tree having as root host 15(that is,i =3),which is shown in Fig.9(b).In this example,we have assumed that the copy time CT is equal 5

815

205710

2

1092510

10410upper

(a)(b)

Fig.9.(a)Host connectivy graph of a simple multicast group,and (b)the corre-sponding wrap-around tree with root 15.

to 10.In the ?rst step,j =4(since,as indicated in the algorithm in Fig.8,the index j starts from i +1)and k ={3},and cost[4]=cost[3]+T X (3,4)+

19

CT·n

COMPUTE TREE(i)

for copies[j]=0;}

for

(k=j?1;k≥i;k??){

cost=cost[k]+T X(k,j)+CT·n

(cost

copies[min copies[min

id]=CHILDREN[min

(j=i?1;j≥1;j??){

cost[j]=∞;

for

copies[k];

if id=k;}

}

n id]=n id]+1;

CHILDREN[min id]⊕(j,upper);

}

Fig.10.Algorithm for the computation of up-down multicast trees

Fig.11.Up-Down Multicast Tree for the Multicast Group of Fig9.

the above approximation.

5Simulation Experiments

In this section we present selected simulation results to assess the performance of the various multicast protocols presented in the paper.We consider two dif-ferent network topologies:a8×82D-torus(see Fig.12(a),where each node comprises a worm switch and an attached host),and a hierarchical star(see Fig.12(b),where boxes and circles depict worm switches and hosts,respec-tively),comprising32hosts.We have performed several sets of experiments, each one aimed at evaluating the impact of a speci?c factor on the perfor-mance of the multicast algorithms presented in the previous section.In all the

21

本文来源:https://www.bwwdw.com/article/oxye.html

Top