Routing Algorithms for IBM SP1
更新时间:2023-05-18 02:27:01 阅读量: 实用文档 文档下载
- routing推荐度:
- 相关推荐
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
Routing Algorithms for IBM SP1Bulent Abali1 and Cevdet Aykanat21
IBM Thomas J. Watson Research Center, P.O.Box 218, Yorktown Heights, NY 10598 2 Bilkent University, 06533 Ankara, Turkey
works that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among the links and switches to prevent congestion in the network. We describe the route selection algorithm used in the IBM 9076 SP1 multicomputer. We then describe a new algorithm for reducing network congestion and compare the two algorithms.
Abstract. Scalable multicomputers are based upon interconnection net-
1 IntroductionScalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Multiple routes provide low latency, high bandwidth, and reliable interprocessor communication. In such networks, the selection of the routes is an important problem because of its impact on the communication performance. Routes must be selected so that the communication load is distributed evenly among the links and switches to prevent congestion in the network. In this paper we describe the route selection algorithm used in the IBM 9076 SP1 multicomputer. We then describe an experimental algorithm for reducing network congestion and compare the two algorithms. In the next section we give an overview of the SP1 network architecture. In Section 2 we describe the SP1 routing algorithm. In Section 3 we describe the experimental routing algorithm and compare the two algorithms in Section 4.
1.1 The SP1 Network ArchitectureThe 9076 SP1 is a commercially available multicomputer whose communication architecture is based upon the Vulcan architecture 1]. The SP1 processor nodes attach to a multistage interconnection network consisting of 8 input 8 output non-blocking switches 1]. The switch chip shown in Fig.3 consists of 8 receiver and 8 transmitter modules, an unbu ered 8 8 crossbar, and a 1-KByte large central queue. Each input and output port consists of 8 data lines and 2 control lines. Processor nodes communicate by sending and receiving message packets. Packets are variable length with up to 255 bytes in size. The method of packet
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
transfer is similar to wormhole routing 2], with the di erence in that when a packet is blocked the packet bytes are not bu ered in place but they are temporarily transferred to the central queue until the blocked output port is cleared up. The method of packet transfer also di ers from virtual cut-through technique 3, 2] in that ow control is byte based, not packet based. When there is no output contention packet bytes pass through the switch chip via crossbar in 5 clock cycles. Packets are formatted such that the rst byte of each packet indicates the packet length, followed by a number of routing bytes, followed by data. The source routing technique is used for rout
ing packets 4]. In this technique, the source processor node determines the complete route and puts the respective route bytes in the packet. As the packet proceeds to its destination, each switch chip examines the rst route byte of the packet and determines the destination output port. The switch chip also strips o the portion of the routing information pertaining to itself. The packet has no route bytes remaining upon arriving at the destination node. In the SP1 implementation, the switch chip operates at 40 MHz, resulting in a peak bandwidth of 40 MB/s per port and ports may be interconnected with cables over 100 feet in length enabling construction of large networks very easily. In the network implementations, the switch chip input port i and output port i are paired together to form a full duplex bidirectional channel. The resulting 4 4 bidirectional switch element can forward a packet to any of the 8 output ports, including the output ports on the same side with the input port (called\turn-around routing"). In that respect, the SP1 network topologies di er from more commonly known unidirectional multistage interconnection networks (MIN) such as the Omega and indirect binary n-cube 5, 6]. Bidirectionality enhances the modularity, fault-tolerance, and diagnosis of the network as described in 1]. Eight switches placed in a 2-stage con guration interconnected with a shu e form the switch board as shown in Fig. 4. The switch board provides full connectivity; it can route a packet from any 32 input ports to any 32 output ports. Switch boards may be interconnected in various ways to construct larger networks. A 16 node network is constructed using only one switch board with the 16 processor nodes attached to the left hand side of the board and the 16 ports on the right hand side unused. A 32 node network is constructed using two switch boards whose right hand sides are interconnected with straight wires. 128 node and 256 node network examples are shown in Fig. 5 and Fig. 6. Custom network topologies of any size can be constructed very easily due to the interconnect technology used.
2 The SP1 Routing AlgorithmWe developed the SP1 routing algorithm originally for the Vulcan prototype 1]. A modi ed version of the algorithm is also being used in IBM's recently announced SP2 multicomputer. The SP1 routing algorithm is a simple algorithm that selects a single shortest path between each pair of processor nodes, although multiple shortest paths may exist. In that respect, the SP1 routing al-
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
gorithm is comparable to the commonly known XY routing algorithm for 2dimensional meshes and the e-cube routing algorithm for hypercubes 2]. In a 2-dimensional mesh, the XY routing algorithm uses the single route that goes along the X dimension rst and then along the Y dimension, although two nodes have (hx+ hy )!=hx!hy ! di erent shortest paths from one to another, where hx and hy are the internode distances in the X and Y dimensions, respectively. In the hyper
cube topology, the e-cube algorithm uses the single route that goes along the increasing order of dimension, although two nodes with a Hamming distance of k have k! shortest paths from one to another. The shortest path routing is not necessarily the best choice for all communication patterns 7]. However, in the absence of any information on communication patterns, we decided to use the shortest paths since fewer switches and links would be used. We use the modi ed Breadth-First Search algorithm shown in Fig. 1 for building a breadth- rst spanning (BFS) tree rooted at each source node (src), and then we follow the spanning tree paths to nd the shortest paths from the source node to the rest of the processor nodes. The algorithm is originally due to 8] and uses a rst-in, rst-out (FIFO) queue Q for the breadth- rst search. We added a simple static load balancing strategy to ensure that links are included in the selected routes in a balanced manner. The network graph G= (V; A) is represented by a linked list of vertices. Each vertex v 2 V represents a processor node or a switch, and each arc e 2 A represents a half duplex link. Only the non-faulty links and switches are represented in G. The direction of an arc indicates the direction of message transmission. Each switch vertex has a maximum in-degree of 8 and out-degree of 8, and each processor vertex has an in-degree of 1 and out-degree of 1. The u:parent eld indicates the parent of vertex u in the spanning tree, and u:distance indicates the distance of vertex u to the root (the source node) of the tree. The u:port i] eld indicates the vertex attached to the output port i of vertex u, hence also represents the arc from vertex u to vertex u:port i]. Load balancing is facilitated by the u:portusage i] eld which indicates how many times an output port has been used during route generation. While building a spanning tree from a given source node, each time a source{destination path is found, portusage eld is incremented for each output port in that path. Usage count of the ports determine the order of breadth- rst search from the next source node, such that from a given vertex v we rst visit the vertices adjacent to the least frequently used output ports (i.e. with the smallest counts), which is accomplished by sorting the port usage counts in lines 10{17. The routes are stored in a route table in each processor's memory. The route table approach enables routing to be done in a topology independent fashion. Note also that by design the SP1 routing algorithm does not assume a topology, whereas the e-cube and the XY routing algorithms assume hypercube and 2-dimensional mesh topologies, respectively. Topology independence property is important for fault-tolerance and scalability; missing links and switches are handled properly by the SP1 routing algorithm, and larger networks of di erent topological properties can be implemented easily without having to change the
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
RTG(G)/* Route Table Generator */ 1 f
or each vertex u 2 V G] 2 for i= 0 to 7 3 u:portusage i] 0 4 for each vertex src 2 V G] 5 BFS RTG(G; src) BFS RTG(G; src) 1 for each vertex u 2 V G] 2 u:visit WHITE 3 u:distance 0 4 u:parent NIL 5 src:visit GRAY 6 ENQUEUE(Q; src) 7 while Q 6=; 8 u head Q] 9 if u:type= SWITCH then 10 for i= 0 to 7 11 index i] i 12 for j= 7 to 1 13 for i= 0 to j? 1 14 if u:portusage i]> u:portusage i+ 1] then 15 tmp index i] 16 index i] index i+ 1] 17 index i+ 1] tmp 18 for j= 0 to 7 19 i index j] 20 v u:port i] 21 if v 6= NIL AND v:visit= WHITE then 22 v:visit GRAY 23 v:distance u:distance+ 1 24 v:parent u 25 v:parentport i 26 ENQUEUE(Q; v) 27 if u:type= PROCESSOR then 28 TRACEBACK(G;u) 29 DEQUEUE(Q) 30 u:visit BLACK TRACEBACK(G; u) 1 while u:distance 6= 0 2 v u:parent 3 i u:parentport 4 v:portusage i] v:portusage i]+ 1 5 u v
Fig. 1. The SP1 algorithm for route selection
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
routing hardware or the algorithm. Although, the SP1 routing algorithm attemps to include the links in the routes in a balanced manner, it does not base the routing decisions on any measured or estimated network tra c. Therefore, the SP1 routing algorithm is non-adaptive as the e-cube and the XY routing algorithms are. Adaptive routers are known to perform better than non-adaptive routers in general with somewhat increased switch complexity 9, 7]. However, in the experiments we observed that the SP1 routing algorithm realizes many commonly used communication patterns without link con icts for some network topologies. In the SP1 multicomputer, network topologies are generally designed to be deadlock-free 2] with shortest path routes. For example, all the topologies used in the experiments reported in this paper are deadlock-free with shortest path routes. However, we have some experimental topologies that may cause deadlock cycles due to\turn-around routing" where a packet enters and leaves a switch from the same side. 3 In such cases we eliminate the deadlock causing routes by putting routing restrictions on some switches while generating the routes.
3 An Experimental Routing AlgorithmWe developed an experimental algorithm for adaptive route selection in SP1 networks. We were motivated by the fact that although the SP1 switch is not designed for adaptive routing, multiplicity of routes between any pair of nodes would allow us to make better routing decisions if estimates of the network trafc were available. We assume that the network tra c is represented by a Node Interaction Graph (NIG). NIG is a directed graph whose vertices represent the processor nodes and arcs represent interprocessor communication. NIG arcs may have weights that denote the amount of information transmitted from the source node to the destination node. The NIG model may appear unrealistic for general applications since it does not model the temporal interactions between the processor nodes. However, a large class of applications such as iterative solution of systems of equations that arise in numerical
computing may be represented with NIGs. See 10, 11] for examples. When all vertices of NIG have an in-degree and out-degree of 1, then it is called a permutation routing. NIGs may be obtained in several ways, such as the users or compilers supplying NIGs based on the expected program behavior, or the operating system supplying NIGs based on the history of system workload. In the experimental algorithm, the route selection problem is formulated as minimization of the cost function cost=
XW+ K XW`2L`2
s2S
s
2
(1)
where L is the set of all links, W` is the total ow through link`, S is the set of all switches, and Ws is the total ow through switch s. The nonlinear3
Craig Stunkel: private communication
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
cost function penalizes the links and switches with higher ow. For example, n messages each with a unit ow routed over one link will contribute n2 units to the cost, whereas the n messages routed over n di erent links will contribute n units to the cost. K 0 is the weight of the total switch penalty and it is a hardware dependent constant. K 6= 0 is used to minimize switch sharing. In some switch designs, messages sharing the switch resources such as a central queue may impact the performance and this may be taken into account in the cost function by a nonzero constant K that is derived empirically or by analysis. A cost function similar to Eq. 1 was used in 12] for routing in networks with virtual cut-through capability. However, Eq. 1 di ers from that of 12] such that the second term due to switch sharing does not exist in 12]. Furthermore, the distance metric that we use in our algorithm is based on number of network hops, whereas in 12] it is based on the link utilization.ROUTER(NIG; G) 1 Let R be the set of all routes, where R i] j] is a set of routes from node i to j 2 for each arc e= (src;dst; flow) 2 NIG 3 Select an initial route r 2 R e:src] e:dst] 4 Add e:flow to the links and switches on the path of route r 5 Update cost 6 previous cost 1 7 n trials NTRY 2 (to try the same cost a number of times) 8 while previous cost> cost OR n trials 6= 0 9 previous cost cost 10 for each arc e= (src; dst; flow) 2 NIG 11 cost ROUTE ONE EDGE(e; G; R) 12 if cost= previous cost then 13 n trials n trials? 1 14 else 15 n trials NTRY ROUTE ONE EDGE(e; G; R) 1 Rip up previously selected route for e and update cost 2 Find a route r 2 R e:src] e:dst] with the smallest incremental cost. If there are multiple such routes, then select one randomly 3 Update G and cost
Fig. 2. The adaptive algorithm for route selectionA brief sketch of the adaptive algorithm is given in Fig. 2. The objective is to minimize the cost. For each communication arc (s; d; f) 2 NIG an initial route is selected, where f is the required amount of ow from node s to node d. After the initial selection of routes, the total cost is calculated. Then, sequentially for
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
each arc (s; d; f) 2 NIG, the previously selected route is ripped up and a new route with smaller incremental
cost is selected from the set of routes R s] d]. The procedure is repeated iteratively until the cost converges to a local minimum. The algorithm is guaranteed to converge because the cost is monotonically nonincreasing. If the cost from previous iteration does not change, the algorithm does not terminate immediately but allows a di erent set of routes with the same cost be tried a bounded number of times (NTRY= 2 in this case) in anticipation of further cost reduction in the next iteration. For the topologies we used, the route set R s] d] consists of all deadlock-free shortest-path routes from node s to node d. However, in richer topologies a restricted subset of the routes between nodes s; d may also be considered, because the number of routes may get quite large increasing the execution time.
4 Results and ConclusionsWe have implemented the experimental route selection algorithm and compared its performance with the SP1 routing algorithm using a set of communication workloads. Results given in Tables 1 through 3 show how well the two algorithms deal with the network congestion.
4.1 Workloads and MethodsIn the experiments, we used standard network topologies available from IBM for 16, 32, 64 node systems. For 128, 256, and 512 node networks we used topologies shown in Figs. 5, 6. The 256 node topology has all the nodes connected to the left hand side of the network with the right hand side ports remaining unconnected. The 512 node topology is constructed from two 256 node networks shown in Fig. 6 whose right hand sides are interconnected with straight wires. Not shown in the gures is the 256-A topology which consists of 8 second stage boards instead of the 16 used in Fig. 6. We used di erent communication workloads (NIGs): in the RANDOM-F workload each node i sends a unit size message to a randomly selected node j. RANDOM-V is similar except that message sizes randomly vary between 1 and 10. DOLOOP refers to a commonly used communication pattern in parallel programs coded in Fortran. Each node executes 1 DO I= 1; N? 1 2 each node J= 0:::N? 1 sends message to node (I+ J)(modN) 3 where N is the number of processors 4 CONTINUE Note that each iteration of the loop corresponds to one NIG graph. EXOR refers to a communication pattern that provides con ict free routing in hypercubes as shown in 13]. It is similar to the DOLOOP, except that order of communication is di erent as shown below
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
1 2 3
for i= 1 to N? 1
each node j= 0:::N? 1 sends message to node i EXOR j where N is the number of processors
NCUBE refers to a commonlyused communication pattern in divide and conquer type algorithms. Given 2n processor nodes, each node sends to n other nodes 1 for i= 0 to n? 1 2 each node (jn?1:::ji:::j0) 3 sends message to node (jn?1:::ji:::j0) where (jn?1:::ji:::j0 ) is the binary representation of the node number and (jn?1:::ji:::j0 ) is the node number with the i-th bit complemented. The remaining workload a
re derived from Harwell-Boeing sparse matrix collection. We mapped task graphs obtained from the sparse matrices to processor graphs using Kernighan-Lin heuristic to minimize communication 14]. Then, we assumed that the resulting communication workload would be executed using the DOLOOP communication pattern. Thus, the workloads BCSPWR10, BCSSTK9, BLCKHOLE, and JAGMESH, resemble DOLOOP with the exception that arcs of the resultant NIGs have variable weights. In the tables, the COST column refers to the minimum cost obtained by the algorithms as given by Eq. 1. We set the constant K= 0 in the experiments since switch sharing does not incur any penalty in the SP1 switch. As a performance metric we also included the maximally loaded link in the network given in the FLOW column. Note that smaller cost does not necessarily mean smaller maximum link ow. However, in practice we have not observed a case of maximum link ow increasing with decreasing cost. The main result of the paper is shown in Table 1 which indicates that the SP1 routing algorithm generates con ict free routes for 16, 32, and 512 node topologies for DOLOOP, EXOR, and NCUBE workloads. FLOW columns show that the maximum link load is 1.0 indicating con ict free routing. The experimental routing algorithm was most e ective with the RANDOM workloads; the maximum link load was a smaller by a factor of 2 to 3 compared to the SP1 routing algorithm. For BCSPWR10, BCSSTK9, BLCKHOLE, and JAGMESH workloads the di erence between the two algorithms were negligible most probably due to the fact that the NIGs were sparse and used the DOLOOP pattern, thus messages rarely shared any links or switches. In the 16 and 32 node DOLOOP cases the experimental algorithm performed worse than the SP1 routing algorithm pointing us to a weakness of the experimental algorithm: since it is a local minimization heuristic, the quality of results depend very much on the initial selection of the routes. To x this problem we modi ed the experimental routing algorithm such that instead of selecting the initial routes randomly, we used
4.2 Results
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
routes generated by the SP1 routing algorithm as the initial routes. Results of this experiment are reported in Table 2 which show that the experimental routing algorithm always performs better than or equal to the SP1 routing algorithm. We performed a third set of experiments reported in Table 3 to test the effects logical to physical node mapping. In SP1 a user is presented with a logical sequence of node numbers from 0 to N-1. The logical node number observed by a user program is not necessarily equal to the physical node number of the underlying node. A logical to physical node number mapping is performed by the system. This is necessary because some nodes may already have been allocated to other users, and some nodes may be down, therefore cannot be allocated. To test the e ect of this mapping on routing, we randomly interchanged the node numbers. When mapped randomly DOLO
OP, EXOR, and NCUBE communication patterns could not use the con ict-free routes anymore. The experimental routing algorithm performed much better than the SP1 routing algorithm in this case.
4.3 ConclusionsOur results show that the experimental router is most advantageous when the node interactions are spatially random. The main advantage of the SP1 routing algorithm is its simplicity. While the experimental routing algorithm performs better than the SP1 routing algorithm on a number of cases, it leaves many system level issues unaddressed: It is not clear how to obtain the node interaction graphs (NIG), and it is not clear whether the system or the user should run the routing algorithm, and whether to store the routes in the system space or user space, etc. It is probably too much to ask a user to provide NIGs. Compiler provided NIGs would be most convenient. Another issue that needs to be addressed is the parallelization of the route selection algorithms, since networks are getting larger.
AcknowledgementsWe wish to thank Craig Stunkel of IBM Research for his valuable discussions and the gures provided. We also would like to thank D.G. Shea and C.J. Tan of IBM Research for their managerial support.
References1. C. B. Stunkel, D. G. Shea, B. Abali, M. M. Denneau, P. H. Hochschild, D. J. Joseph, B. J. Nathanson, M. Tsao, and P. R. Varker,\Architecture and Implementation of Vulcan," in Proceedings of the International Parallel Processing Symposium, April 1994. 2. W. J. Dally and C. L. Seitz,\Deadlock-Free Message Routing in Multiprocessor Interconnection Networks," IEEE Transactions on Computers, vol. C-36, pp. 547{ 553, May 1987.
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
3. P. Kermani and L. Kleinrock,\Virtual Cut-Through: A new computer communications switching technique," Computer Networks, vol. 3, pp. 267{286, September 1979. 4. A. S. Tanenbaum, Computer Networks. Englewood Cli s, NJ: Prentice-Hall, 1981. 5. D. H. Lawrie,\Access and alignment of data in an array processor," IEEE Transactions on Computers, vol. C-24, pp. 1145{1155, December 1975. 6. M. C. Pease,\The indirect binary n-cube microprocessor array," IEEE Transactions on Computers, vol. C-26, pp. 458{473, May 1977. 7. S. A. Felperin, L. Gravano, G. D. Pifarre, and J. L. C. Sanz,\Routing Techniques for Massively Parallel Communication," Proceedings of the IEEE, vol. 79, pp. 488{ 503, April 1991. 8. T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms. NY: McGraw-Hill, 1990. 9. R. V. Boppana and S. Chalasani,\A Comparison of Adaptive Wormhole Routing Algorithms," in Proceedings of the 20th Ann. Int. Symp. on Computer Architecture, pp. 351{360, May 1993. 10. T. Bultan and C. Aykanat,\A New Mapping Heuristic Based on Mean Field Annealing," J. Parallel and Distributed Comput., vol. 16, pp. 292{305, 1992. 11. C. Aykanat, F. Ozguner, F. Ercal, and P. Sadayappan,\Iterative Algorithms for Solution of Large Sparse Systems of Linear Equations on Hypercubes," IEEE Trans. Comput, vol. 37, no. 12,
pp. 1554{1567, 1988. 12. D. D. Kandlur and K. G. Shin,\Tra c Routing for Multicomputer Networks with Virtual Cut-Through Capability," IEEE Transactions on Computers, vol. 41, pp. 1257{1270, October 1992. 13. B. Abali, F. Ozguner, and A. Bataineh,\Balanced Parallel Sort on Hypercube Multiprocessors," IEEE Trans. Parallel and Distributed Systems, vol. 4, pp. 572{ 581, May 1993. 14. B. W. Kernighan and S. Lin,\An e cient heuristic procedure for partitioning graphs," Bell System Tech. J., vol. 49, pp. 291{307, 1970.
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
Receiver
Deserializer
FIFO8 CRC Route Check Control
8
Serializer
Flow Control
Transmitter64
Flow Control
8
64
FIFO8 Bypass Arb 8 CRC Gen
Central QueueRecv Arb Xmit Arb
RAM 128x64
Receiver
Deserializer
FIFO8 CRC Route Check Control
8
Serializer
Flow Control
. . .
8 x 8 Crossbar64
. . .
Transmitter8
Flow Control
64
FIFO8 Bypass Arb 8 CRC Gen
Fig. 3. The Switch Chip Block Diagram.
Fig. 4. The Switch Board consisting of 8 Switch Chips
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
16
Node Switch Bd 1
Stage 2 Switch Bd 1
16
Node Switch Bd 2 Node Switch Bd 3 Node Switch Bd 4 Node Switch Bd 5 Node Switch Bd 6 Node Switch Bd 7 Node Switch Bd 8 Stage 2 Switch Bd 2
16
16
16
Stage 2 Switch Bd 3
16
16
Stage 2 Switch Bd 4
16
Fig. 5. A 128 node network consisting of 8 rst stage and 4 second stage switch boards.
16
Node Switch Bd 1
Stage 2 Switch Bd 1
16
16
Node Switch Bd 2 Node Switch Bd 3 Node Switch Bd 4
Stage 2 Switch Bd 2 Stage 2 Switch Bd 3 Stage 2 Switch Bd 4
16
16
16
16
16
16
Node Switch Bd 15 Node Switch Bd 16
Stage 2 Switch Bd 15 Stage 2 Switch Bd 16
16
16
16
Fig. 6. A 256 node network consisting of 16 rst stage and 16 second stage switchboards.
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
a This article was processed using the L TEX macro package with LLNCS style
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
# NODES 16 NODES 32 NODES 64 NODES 128 NODES WORKLOAD FLOW COST FLOW COST FLOW COST FLOW COST RANDOM-F SP1 2.10 30.4 2.50 86.0 3.20 250.2 3.50 712.0 RANDOM-F EXP 1.20 25.2 1.70 73.2 2.00 198.0 2.00 571.6 RANDOM-V SP1 15.30 1169.8 18.50 3348.7 22.00 9469.4 27.50 26914.1 RANDOM-V EXP 9.30 947.8 10.20 2836.3 13.00 7384.2 10.90 21566.7 DOLOOP SP1 1.00 25.6 1.00 74.3 3.19 247.7 3.87 863.2 DOLOOP EXP 1.13 26.1 1.55 77.5 2.73 244.6 1.93 575.4 EXOR SP1 1.00 32.0 1.00 82.3 3.40 307.2 2.55 949.7 EXOR EXP 1.00 32.0 1.00 82.3 3.40 302.9 1.52 571.9 NCUBE SP1 1.00 32.0 1.00 74.7 2.50 236.0 1.60 588.8 NCUBE EXP 1.00 32.0 1.00 74.7 2.50 234.0 1.20 442.4 BCSPWR10 SP1 12.13 1025.6 8.00 535.2 5.02 210.9 3.43 139.7 BCSPWR10 EXP 12.13 1025.6 8.00 535.2 4.86 208.3 3.38 138.5 BCSSTK9 SP1 61.71 21521.7 36.31 11841.4 28.67 6225.3 13.37 2183.5 BCSSTK9 EXP 61.71 21521.7 36.31 11841.4 26.88 6077.8 12.15 2080.8 BLCKHOLE SP1 19.77 2108.9 13.55 1144.6 9.51 614.2 6.32 402.3 BLCKHOLE EXP 19.77 2108.9 13.55 1144.6 9.39 610.0 6.30 400.7 JAGMESH9 SP1 18.67 1621.0 10.17 798.6 7.30 387.5 5.24 309.3 JAGMESH9 EXP 18.67 1621.0 10.17 798.6 6.91 380.3 5.16 307.1# NODES 256 NODES-C 256 NODES-A 512 NODES WORKLOAD FLOW COST FLOW COST FLOW C
OST FLOW COST RANDOM-F SP1 4.50 1874.8 4.20 1741.0 4.80 4348.8 RANDOM-F EXP 2.00 1503.8 2.00 1374.0 2.00 3504.2 RANDOM-V SP1 33.80 67670.2 30.00 62225.3 32.30 158898.9 RANDOM-V EXP 14.00 53969.6 16.20 47799.3 17.20 126871.5 DOLOOP SP1 1.96 1421.3 3.03 1521.3 1.00 3166.2 DOLOOP EXP 1.95 1503.8 3.07 1633.7 2.01 3498.2 EXOR SP1 1.76 1438.5 3.29 1779.8 1.00 3184.9 EXOR EXP 1.76 1406.0 3.29 1804.2 1.88 3222.2 NCUBE SP1 1.33 1045.3 2.00 1194.7 1.00 2267.4 NCUBE EXP 1.33 1032.0 2.00 1205.3 1.43 2282.3 BCSPWR10 SP1 2.29 80.6 2.30 75.4 1.63 47.8 BCSPWR10 EXP 2.29 80.6 2.29 75.3 1.63 47.8 BCSSTK9 SP1 5.00 584.2 4.96 561.2 2.38 219.5 BCSSTK9 EXP 4.96 583.7 4.96 561.2 2.38 219.6 BLCKHOLE SP1 4.06 226.6 4.11 217.0 2.01 83.1 BLCKHOLE EXP 4.06 226.6 4.06 216.8 2.01 83.1 JAGMESH9 SP1 2.53 116.4 2.53 113.2 1.74 114.0 JAGMESH9 EXP 2.53 116.4 2.53 113.2 1.74 114.0
Table 1. Comparison of the SP1 routing algorithm and the experimental routingalgorithm which uses random initial routes.
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
# NODES 16 NODES 32 NODES 64 NODES 128 NODES WORKLOAD FLOW COST FLOW COST FLOW COST FLOW COST RANDOM-F SP1 2.10 30.4 2.50 86.0 3.20 250.2 3.50 712.0 RANDOM-F EXP 1.30 26.2 1.90 75.4 2.00 199.0 2.00 572.6 RANDOM-V SP1 15.30 1169.8 18.50 3348.7 22.00 9469.4 27.50 26914.1 RANDOM-V EXP 9.90 957.0 10.50 2839.7 14.00 7437.0 11.70 21584.7 DOLOOP SP1 1.00 25.6 1.00 74.3 3.19 247.7 3.87 863.2 DOLOOP EXP 1.00 25.6 1.00 74.3 2.92 241.5 1.83 552.6 EXOR SP1 1.00 32.0 1.00 82.3 3.40 307.2 2.55 949.7 EXOR EXP 1.00 32.0 1.00 82.3 3.40 302.9 1.52 560.5 NCUBE SP1 1.00 32.0 1.00 74.7 2.50 236.0 1.60 588.8 NCUBE EXP 1.00 32.0 1.00 74.7 2.50 234.0 1.20 438.0 BCSPWR10 SP1 12.13 1025.6 8.00 535.2 5.02 210.9 3.43 139.7 BCSPWR10 EXP 12.13 1025.6 8.00 535.2 4.86 208.3 3.38 138.5 BCSSTK9 SP1 61.71 21521.7 36.31 11841.4 28.67 6225.3 13.37 2183.5 BCSSTK9 EXP 61.71 21521.7 36.31 11841.4 26.88 6077.8 12.15 2080.8 BLCKHOLE SP1 19.77 2108.9 13.55 1144.6 9.51 614.2 6.32 402.3 BLCKHOLE EXP 19.77 2108.9 13.55 1144.6 9.39 610.0 6.30 400.7 JAGMESH9 SP1 18.67 1621.0 10.17 798.6 7.30 387.5 5.24 309.3 JAGMESH9 EXP 18.67 1621.0 10.17 798.6 6.91 380.3 5.16 307.1# NODES 256 NODES-C 256 NODES-A 512 NODES WORKLOAD FLOW COST FLOW COST FLOW COST FLOW COST RANDOM-F SP1 4.50 1874.8 4.20 1741.0 4.80 4348.8 RANDOM-F EXP 2.00 1503.8 2.00 1376.8 2.10 3512.6 RANDOM-V SP1 33.80 67670.2 30.00 62225.3 32.30 158898.9 RANDOM-V EXP 15.00 54114.4 17.00 47788.7 17.10 126901.1 DOLOOP SP1 1.96 1421.3 3.03 1521.3 1.00 3166.2 DOLOOP EXP 1.09 1373.8 3.02 1521.3 1.00 3166.2 EXOR SP1 1.76 1438.5 3.29 1779.8 1.00 3184.9 EXOR EXP 1.00 1389.7 3.29 1779.8 1.00 3184.9 NCUBE SP1 1.33 1045.3 2.00 1194.7 1.00 2267.4 NCUBE EXP 1.00 1024.0 2.00 1194.7 1.00 2267.4 BCSPWR10 SP1 2.29 80.6 2.30 75.4 1.63 47.8 BCSPWR10 EXP 2.29 80.6 2.29 75.3 1.63 47.8 BCSSTK9 SP1 5.00 584.2 4.96 561.2 2.38 219.5 BCSSTK9 EXP 4.96 583.7 4.96 561.2 2.38 219.5 BLCKHOLE SP1 4.06 226.6 4.11 217.0 2.01 83.1 BLCKHOLE EXP 4.06 226.6 4.06 216.8 2.01 83.1 JAGMESH9 SP1 2.53 116.4 2.53 113.2 1
.74 114.0 JAGMESH9 EXP 2.53 116.4 2.53 113.2 1.74 114.0
Table 2. Comparison of the SP1 routing algorithm and the experimental routingalgorithm which uses the initial routes selected by the SP1 routing algorithm.
Abstract. Scalable multicomputers are based upon interconnection networks that typically provide multiple communication routes between any given pair of processor nodes. Routes must be selected for communication so that the load is distributed evenly among
# NODES 16 NODES 32 NODES 64 NODES 128 NODES WORKLOAD FLOW COST FLOW COST FLOW COST FLOW COST DOLOOP SP1 2.00 30.4 2.74 92.5 3.21 240.3 3.74 712.3 DOLOOP EXP 1.13 25.9 1.87 78.1 2.00 191.5 2.00 565.0 EXOR SP1 2.13 30.9 2.81 92.0 3.13 240.2 3.63 714.1 EXOR EXP 1.67 27.3 1.94 80.1 2.00 193.2 2.00 566.5 NCUBE SP1 2.25 30.5 3.00 92.0 3.00 253.3 3.71 728.6 NCUBE EXP 1.50 26.0 2.00 78.0 2.00 201.3 2.00 572.6 BCSPWR10 SP1 17.20 2099.5 10.32 1066.4 6.27 588.3 3.76 424.3 BCSPWR10 EXP 15.33 1977.9 8.42 970.5 5.40 509.5 3.49 394.4 BCSSTK9 SP1 82.08 41564.3 51.46 30824.4 30.25 15574.1 14.49 6979.4 BCSSTK9 EXP 76.83 40468.7 44.21 28000.8 28.02 13817.9 12.38 6158.5 BLCKHOLE SP1 22.60 3173.2 18.00 2825.9 11.00 1641.3 7.13 1281.2 BLCKHOLE EXP 22.33 3065.6 15.40 2521.4 9.67 1415.9 6.45 1165.4 JAGMESH9 SP1 20.17 3119.7 13.62 1967.1 8.55 1185.1 6.17 1018.9 JAGMESH9 EXP 18.67 2955.7 10.92 1722.1 7.29 1043.5 5.21 928.3# NODES 256 NODES-C 256 NODES-A 512 NODES WORKLOAD FLOW COST FLOW COST FLOW COST FLOW COST DOLOOP SP1 4.26 1850.6 4.21 1705.7 4.76 4319.3 DOLOOP EXP 2.00 1488.5 2.02 1359.2 2.03 3487.1 EXOR SP1 4.24 1850.0 4.24 1706.6 4.73 4311.7 EXOR EXP 2.01 1493.1 2.01 1362.1 2.03 3495.4 NCUBE SP1 4.25 1862.5 4.25 1719.0 4.78 4322.0 NCUBE EXP 2.00 1496.2 2.00 1375.5 2.00 3487.8 BCSPWR10 SP1 2.52 295.6 2.53 257.2 1.72 181.9 BCSPWR10 EXP 2.34 269.1 2.34 229.5 1.64 165.2 BCSSTK9 SP1 5.87 2791.8 5.88 2477.2 2.69 646.2 BCSSTK9 EXP 5.10 2396.8 5.10 2075.1 2.39 564.0 BLCKHOLE SP1 4.36 793.8 4.36 700.2 2.24 836.8 BLCKHOLE EXP 4.07 713.5 4.08 615.5 2.09 717.1 JAGMESH9 SP1 3.18 1518.7 3.15 1326.4 1.92 465.5 JAGMESH9 EXP 2.68 1299.1 2.71 1109.5 1.75 402.1
Table 3. Comparison of the SP1 routing algorithm and the experimental routingalgorithm with random mapping of tasks onto processors
正在阅读:
Routing Algorithms for IBM SP105-18
2011年辽宁专升本考试真题-C语言部分01-27
2018年最新整合-淘宝售前咨询规范-咨询技巧认证考试答案全05-18
一米阳光作文300字07-13
母爱如网作文600字07-05
大学生新闻中心实习报告9篇03-26
法国22年未解寻宝游戏,“金猫头鹰”11道文字谜题47012-28
Aspen软件培训案例 - 图文03-09
青年文明号活动实施方案10-23
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- Algorithms
- Routing
- IBM
- SP1
- 《知之深 爱之切》完整版word版电子版
- 关爱流浪小动物(教学设计)
- 小学数学教学工作总结4篇
- 《吉林省综合变电仿真培训系统》仿与真的探讨
- 常用铁路公路规范目录大全
- 财务岗位职责考评表样表
- C语言习题集及答案 1
- 股东决定书
- 电子表格的使用技巧
- 岩石流变力学的研究现状及其发展方向
- 2009年新人教版高二英语下册第一次月考试题3
- 自带win7系统家庭版如何还原(副)
- 乘除混合两步计算应用题
- 清华大学工程硕士研究生课程学习及成绩管理办法
- 法律援助中心2010年工作总结
- 固定装置计算机辅助设计与制造
- 指纹识别技术研究
- S8261和DW01-8205A主流锂电池保护板原理图说明
- 13、个人住房借款担保合同(适用于深圳地区)
- Robotic fault detection using nonlinear analytical redundancy