O. A new distributed fault-tolerant algorithm for the Simple Assembly Line Balancing Proble

更新时间：2023-07-21 05:28:01 阅读量：实用文档文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

o.o推荐度：
相关推荐

In this paper the Simple Assembly Line Balancing Problem 1 is considered. This well known NP-hard problem has been widely studied in the last thirty years. Therefore several sequential algorithms have been designed for solving it. But although these algori

A new distributed fault-tolerant algorithm for the Simple Assembly Line Balancing Problem 1Working paper Stefan Bock, Otto Rosenberg Business Administration, in particular Production Management University of Paderborn, 33095 Paderborn, GermanyIn this paper the Simple Assembly Line Balancing Problem 1 is considered. This well known NP -hard problem has been widely studied in the last thirty years. Therefore several sequential algorithms have been designed for solving it. But although these algorithms have become more and more e cient they cannot cope with a lot of instances of realistic size. For these examples the algorithms achieve only poor results or need unreasonable time for nding solutions of acceptable quality. The consequence is to use not only one processor. Based on the observation that Local-Area-Networks (LAN) with often more than 100 computers are installed in many companies today the use of distributed algorithms appears to be a promising approach for solving production-planning-problems in practice. Therefore an extended version of the up to now most e cient Branch&Bound procedure for SALBP-1 has been parallelized. Although the searching-tree of the sequential algorithm is extremly unbalanced the new distributed algorithm reaches optimal speedups if the problem-size is large enough. Besides even overlinear speedups for some instances the algorithm implemented on a 1024-transputer-network is nearly 800 times faster than the sequential version. To achieve this a new work-balancing-technique has been designed that can be used on principle for many parallel depth- rst-search Branch&Bound algorithms. Being designed for a Local-Area-Network of personal computers, the algorithm uses no special transputer-attribute. Due to its prede ned communication-structure it can easily be supported by practical hardware-solutions. In addition a fault tolerant concept is supported by the algorithm. So if some node fails the algorithm can go on working without loosing any solution, which is very useful in a network of more than one hundred personal-computers.

Abstract

1 IntroductionFlow-line production systems are often used for producing products of high quantity. In such a system the facilities are ordered by the sequence of operations for producing the product. Often there is additionally a global time rate given in which every working system has to do its work on the product and a conveyor belt transports the product from one station to another. This special kind of a ow-line production system is called an assembly line production system. In this paper the problem of balancing such a line for producing only one product is considered. In this connection the problem is focused to the allocation of operations to stations. So the whole equipment of the production system is given and every station can execute every operation in the same time. These execution times are deterministic and given. In addition the possible sequences of operations are rest

ricted by precedence-constraints. They de ne which operations must be done before a particular operation can be executed. By de ning a global cycle time C every station gets a limit of working-time maximally available for every product. This leads to the restriction that the whole execution time of all operations, which are allocated to the same station cannot exceeds this cycle time C. This becomes obvious if you have in mind that every C time-units a product leaves the production system. The objective in this connection is to minimize the number of stations. So a given production-rate has to be achieved with minimal equipment and an elementary equipment-unit in this connection is a single station.email: stbo@uni-paderborn.de, rosenberg@prowinet.uni-paderborn.de Fax:+49-5251-603511. This work is supported by the DFG Sonderforschungsbereich 376:"Massive Parallelitat: Algorithmen, Entwurfsmethoden, Anwendungen."

1.1 The SALB-Problem 1

To de ne the Simple Assembly Line Balancing Problem 1 some terms are required:

De nition 1.1 A precedence graph G=(V,E,t) is a directed non-cyclical graph, with a set V= f1;:::; ng of nodes, a set of edges E= f(i; j)ji; j 2 V g and a vector t 2 IRn of values. In this connection a node of Fi= fj j(i; j) 2 E g is called a successor of i and the nodes in Si= fj j(j; i) 2 E g are the predecessors of i. Nodes stand for operations and an edge (i; j) describes a precedence-constraint between i and j . That meansthat the operation i must be executed before j can be done. The value ti represents the execution time for the operation i.

Because of the fact that the graph is non-cyclical it can be assumed that the nodes are sorted topologically e.g. 8(i; j) 2 E: i< j. Besides the number of nodes (n) the quantity of the set of edges (jE j) in uences the complexity of a problem instance. This can be measured by the average node out-degree. De nition 1.2 Let G= (V; E; t) a precedence graph. The average node out-degree is the quotient jEj=jV j= jE j=n. With the average node out-degree it is possible to describe the real developement of the complexity for an increasing number of operations (n). So it is possible to examine by computational experiments if the forseen speedups are in fact reacheable. Note that this is not true for the term order-strength that is often used in the literature (cf. MAST70]). This becomes clear if you have in mind that a constant order strength leads to a quadratic growth of the quantity of precedence-constraints measured in the number of operations. With these terms the Simple Assembly Line Balancing Problem 1 can be de ned as follows: De nition 1.3 A precedence graph G= (V; E; t) and a global cycle time C are given. A function f: f1; 2;:::; ng?! IN is an admissible solution of this problem, if: 8(i; j) 2 E: f(i) f(j) and 8k 2 IN: Pf (i)=k;i2f1;:::;ng ti C . The searched solution is the admissible solution g with: maxfg(i)j i 2 f1;:::; ngg=minfmaxff(i)j i 2 f1;:::; nggjf is an a

dmissible solution g. With the found function the operations are allocated to the stations. It is obvious that SALBP-1 is a generalized form of the multidimensional bin-packing problem. That means that problems without precedence constraints are already NP -hard. What possibilities are realistic today for a company to use distributed algorithms? At rst you can think of a special parallel computer with one hundred or more processors that can be acquired. But this leads to high investments and because of the fact that this system is often used rarely the payed capital do not work e ciently. In addition to that the fast development in the hardware market makes it necessary to grade up the system nearly every year. Another alternative possibility is to lease times on such a system over the internet. But then there are transaction-costs to be payed and it is possible that the companies do not get the times at the wished moment. This is especially true if you take into account that the capacity of the internet is already exhausted although today only a few companies use it. These reasons lead to the actual situation that nearly no company tries to solve their problems with distributed algorithms. Besides these aspects above there are also many not rational reasons which should not be underestimated in this connection. The employees of the companies for example want to keep their known computational environment and are often not convinceable for those new solutions. Because of that there is a third possibility which seems to be the most realistic one to use distributed algorithms in practise. Today you can nd in several companies over hundred personal-computers connected by a local-areanetwork. New developments as ATM or SCI (cf. GESI95]) can perform their performance in a way that makes it possible to use such a network as a parallel system where every personal-computer can be used as a processor. The greatest advantage of this solution is that it leads to minor changes. So noone must be convinced for a new software-system and because of that a weaker opposition by the employees can be expected. In addition to this the distributed algorithms only use already existing hardware equipment which makes additional investments unnecessary. 2

1.2 Computational equipment and its consequences

But the use of personal-computers connected by a network as a parallel machine has special consequences for the design of distributed algorithms. Standard PC's are not designed for the use as a node in a parallel system. So there are several attributes which must be respected if a distributed algorithm should work e ciently on such a system. There is no form of shared memory at all. That means every node can access only its own local memory. Although the network-behaviour is performed by new technologies unprede ned communication can not be executed by far as fast as known from parallel-computers. The system is heterogenous and can work extremly asynchronous. Process-cha

nge procedures are extremly expensive in comparison to parallel-computers. A system of more than a hundred di erent personal-computers is quite susceptible for failures. So during executing a distributed programm a personal computer can be lost in the network due to a failure. It is obvious that all these aspects have consequences for the design of distributed algorithms. Especially the points two, four and ve restrict programming them since they say something about the performance. Communication for example is the bottleneck also on parallel-computers. But the ratio of time required by an internal computation-step and a communication-operation is much more unpro table on the network of personalcomputers. In addition to the slower behaviour the topology of these networks has often a high diameter. So if large size systems should be used special hardware solutions1 become necessary. But in that case the communication-structure should be prede ned, e.g. the structure and the size of sent informations in every communication-step is problemindependent and always the same. So a prede ned communication-structure, nearly out of value for parallel computers, becomes to a considerable advantage. The same could be true for one-process-algorithms. Many known parallel algorithms places several processes on one node to get much better performances. This is reasonable for transputer-systems where a change from one process to another is supported by the hardware. But this is not true for a personal-computer. Here process changes take much more time. So if statements should be done for LAN`s but computational tests are made on parallel transputer-systems the algorithms should not use special abilities of those systems. Considering the computational environment means also to enable the algorithm coping with some negative behaviours of the hardware-system. Therefore an algorithm running on such a local-network should also work if some node in the network gets lost during computation. In such a situation the algorithm should be able to execute the work of the lost personal-computer by another processor to nd in spite of this failure the optimal solution in reasonable time. Because of the fact that SALBP-1 has been analysed over thirty years a high number of sequential algorithms are known for that problem. These can be seperated into exact and heuristic procedures. The rst group searches for an optimal solution while the other algorithms try to nd only good solutions. Because of the fact that SALBP-1 is NP -hard the exact algorithms are often reduced to heuristic ones by using a time limit. So after that limit the algorithm stops and gives the actual best known solution. With this technique it can be shown that best exact algorithms reach better results than newest heuristic-techniques like tabu-search (cf. SCHO95] pp. 244-245). So the most e cient exact procedures seem to be the best instrument to get good solutions for SALBP-1. These are branch-and-bound algorith

ms with di erent enumeration-schemes. The newest and most e cient one in this connection is called SALOME-1 (cf. SCHO95]). It uses an extension of the well known IDA*-technique (cf. KORF88] or HNRA68]) to combine the advantages of two previously known procedures. Scholl calls this enumeration scheme"local-lower-bound-technique". His algorithm starts with an initial lowerbound that is given to the root-node as a local-lower-bound. Every node has such a local-lower-bound and tries to nd a solution with at most so many stations. Therefore the operations are allocated to the actual station with increasing number. In this connection a new node is built if no further operation with higher number is allocateable to the last station. So every node represents a new complete station-load, i.e. a set of operations allocated to the last considered station. So by using the local-lower-bound-technique only station-loads are1 Fast bus-systems for example.

1.3 Sequential algorithms for SALBP-1

considered at rst that can still reach a solution with at most llb2 stations. After considering all those stationloads the local-lower-bound is increased. If this new value is still less than the already found upper-bound the other station-loads are considered. The intended e ect of the local-lower-bound-technique is to prefer at rst the actual best solutions to get good upper-bounds fast. With that found solutions it is often possible to fathome parts of the tree earlier. Besides this technique Scholl gives a rule that enables SALOME-1 to decide dynamically about the planning-direction (cf. SCHO95] pp. 138-142). This means that SALOME-1 opens a new station in forward or in backward direction depending on the allocateable operations. This exible rule often reduces the size of the search-tree because of the fact that nodes with smaller degree get a lower depth. Additionally SALOME-1 uses a lot of extended bounds and rules (cf. SCHO95] pp. 135-138) to reduce the searching-tree. Because of the intention of this paper they are only enumerated brie y:

Bounding concepts

1. Lower Bound-1 (LB-1): This bound summarizes the processing-times of the operations, that are still not allocated to a station and divides this sum through the cycle time C. The quotient is rounded o to the next natural number (cf. BAYB86]). 2. Lower Bound-2 (LB-2): This bound works in the same way but uses the fact that the operations with execution times exceeding C=2 can not share a station with another operation. So the lower-bound can be computed by the number of such operations and the half of the number of operations with execution time equal to C=2 (cf. JOHN88]). 3. Lower Bound-3 (LB-3): By seperating the operations into more than two groups the lower-bound-2 can be extended to a further bound. There the operations with execution-times exceeding 2=3C are placed in the rst group and increase the lower-bound by a whole station. While the operations with execution-time 2=3C gets the value 2=3 the

operations with execution-times exceeding 1=3C and less than 2=3C can share only together a further station. Because of that the lower-bound is increased by the half of the number of such operations, while the quantity of the last group of operations with execution-times equal to 1=3C is weighted with 1=3 (cf. JOHN88]). 4. Lower Bound-4 (LB-4): This bound that is only computed in the root node interpretes the operations as jobs that have to be scheduled on one machine. In this connection every operation has a tail as a lower-bound for the stations that are required by (direct and indirect) succeding operations of this operation. By allocating them in the sequence of non-increasing tails an optimal solution is found for the related one-machine-problem. The makespan of this solution can be interpreted as a lower-bound for the SALB-Problem-1. By computing this tails for every operation recursively starting by the sinks the global lower-bound is computed as a tail of an imaginary operation that is the predecessor of every other operation. Because of the fact that this lower-bound takes computation time O(n2) it is only used in the root-node. Therefore the bound is computed for both directions which leads to tails and heads for every operation (cf. JOHN88]). 5. Lower Bound-5 (LB-5): This bound computes earliest and latest stations for every operation to nd the minimal number of stations where for every operation the latest station is greaterequal to the rst possible station (cf. SABA87]). 6. Lower Bound-6 (LB-6): This bound is a combination of LB-2 and LB-3 while further aspects are respected. (cf. BERG92]). 7. Lower Bound-7 (LB-7): By using the duality of SALBP-1 and SALBP-2 a further bound can be computed. Therefore a lower bound for the cycle-time is computed to be compared with C. As a consequence the actual lower-bound can be increased if C is not greaterequal to this lower-bound.

Logical concepts

2 llb is the value of the actual local-lower-bound in this node.

1. Maximum load rule: If a non-allocated operation is assignable to the last station of a non-completed solution without exceeding the cycle time the actual node is fathomed by this rule, because of the fact that there is a solution where this operation belongs additionally to this station (cf. JACK56]). 4

2. Task incrementing rule: Before starting the search every operation gets the execution-time C if no other task can share a station with it.(cf. JOHN88] 3. Jackson dominance rule: If an operation is assignable to a station instead of another already assigned operation without exceeding the cycle time this rule can be applied if the assignable operation dominates the other operation. This is the case if this operation has a higher operation time3 and every direct successor of the already allocated operation is an indirect successor of the dominating operation. The idea is that no possible solution gets lost because of the fact that there is a solution built by exchanging both opera

tions. This solution has at most the same slack-time and at least as many allocateable operations at this point of the searching-tree (cf. JACK56]). 4. New ordering: Every operation gets a new number for both directions. Therefore by respecting the precedence constraints an operation gets a lower number if it potentially dominates another operation (cf. SCHO95]). 5. Pre xing: That rule allocates operations to stations that can not be assigned to later stations if the actual lower-bound should be reached. This rule is used in the extended version of SALOME-1 and can be simulated by using a special form of LB-5 (cf. SCHO95]). 6. Simple Permutation: This rule is a weak form of the labeling dominance rule that will be described later. Because of the fact that the labeling dominance rule is deactivated in SALOME-1 this rule is used4 (cf. SCHO95]). Scholl proposes two versions of SALOME-1. Both have been implemented and tested on di erent instances. The extended version uses all mentioned bounding- and test-concepts while the normal version works without LB-5, LB-6, LB-7, the Pre xing- and the Simple-Permutation-rule. While testing them both use the bidirectional-rule to decide dynamically about the working direction. Since the normal version achieves at least as good average results as the extended one for the tested instances it has been taken to be parallelised. But a further distributed algorithm, which uses the mentioned bounding-concepts achieves{parallelised in the same way{ nearly the same speedup results.

2 Parallel techniques and their applicability to the SALB-Problem-1In the last ten years parallel programming becomes a more and more interesting instrument to attain greater speed in solving combinatorial problems. So for nearly every sequential optimization-technique there exist several load-balancing-algorithms for parallelizing them. Especially branch-and-bound algorithms have been often parallelized and there have been also a few authors who compare several techniques problem-independent and propose a special load-balancing-algorithm as the best known in general (cf. CXTM95] or RS94]). But before these techniques could be compared this paper examines the special attributes of the SALB Problem-1. If you consider the cost function of the SALB-Problem-1 that rates5 the di erent (admissible) solutions it becomes obvious that upper and lower-bounds do not di er enormously. So if the best solution uses over thirty stations less than ve stations lie between the lower- and the upper-bound. That makes it very di cult to sort the considered non-completed solutions according to their values because they have nearly the same lower-bound. In addition to that a non-completed solution with a last station that is not maximally lled by operations6 can not be rated completely. So the algorithms for SALBP-1 build only whole station-loads in every node. So in this arising tree every level represents a new station and every node in this depth a particular

set of operations allocated to it. Because of the high complexity of SALBP-1 the searching-tree grows exponentially with rising number of operations if the average node out-degree is taken constant. So a best- rst-algorithm that has to save3 If they have the same execution time the operation-number decides. 4 It is used only in the extended version. 5 By computing the number of required stations. 6 That means the station is open and there are still allocatable operations that are available. In this connection a operation is

2.1 Characteristics of SALBP-1

available if all its successors (actual station in forward-direction) or predecessors (actual station in backward direction) are already allocated to a station that has been already opened in the same direction.

every considered non-completed solution can not be used for instances of realistic size. In addition to this the degree of the searching-tree can be huge. This becomes obvious if you take into account that there are nearly up to nd station-loads allocateable for a just opened station if d operations are averagely placable in one station and precedence-constraints are absens. This makes it very complicated to combine breadth- rst- with depth- rstsearch algorithms. If you consider the searching-trees of SALOME-1 for several instances it becomes obvious that its structure is extremely unbalanced. That means that there are usually a lot of nodes near the root but in the next levels only a few of them start to expand over enough levels to build a considerable subtree. This can be explained by the mentioned characteristics of the cost-function. Because of their low di erences it takes some levels in the searching-tree to seperate good from worse non-completed solutions. All these attributes besides the constraints given by the computational equipment must be awared when a parallelization of best SALBP-1 algorithms on a network should be successful. Due to the extended IDA*-technique of SALOME-1 distributed IDA*-algorithms become interesting for parallelising it. Such a parallel algorithm has to split the whole work into at least p tasks that can be done simultanenously by the processors. Therefore static tree partitioning is insu cient to keep all processors busy during the whole computation-time and so a dynamic load-balancing algorithm is necessary. That means processors have to equalize their outstanding work by communication. In practice there are currently two main strategies (cf. RS94]) known for parallel IDA*-algorithms. This scheme can be seen as a direct parallelization of a dfs-branch-and-bound algorithm. Kumar and Rao (cf. MGNPK92] or RKR87]) describe an algorithm where every processor works on its own local stack as the sequential one does. Initially only one processor gets the whole work. Every processor who is idle, i.e. its stack has become empty, issues a work request to a neighbouring processor of the network. This processor splits its stack to give the idle one an uns

earched subtree. So at the beginning more and more processors gets a working packet to start searching on a di erent part of the tree. By parallelizing an iterative deepening A* algorithm all processors work with the same maximum cost-bound that will be increased after a global-barrier synchronisation when all solutions that can reach this cost-bound has been considered. The e ciency of stack-splitting-technique depends on the way the donor-processor splits its stack. Kumar and Rao (cf. KURA90]) de ne a cut-o -level, where no further nodes can be given away as a working-packet. They suggest three possible strategies: Send nodes near the root-node. Send nodes near the cut-o -level. Send several nodes one from each level. The stack-splitting scheme{as it is a straight-forward computation of the dfs-technique{ has been widely analysed and practical implemented. The performance-results that are presented in the literature (cf. KURA90], RK87] or RE94]) show that this technique does not seem to scale up for larger system sizes (cf. RS94]). That can be explained at rst by the starting-phase where only one processor is busy and it takes some time until a su cient number of processors get enough work. In addition to this the work is often not well-balanced since every splitting of a stack generates one packet that represents a node lying a level deeper in the tree. If this often signi cantly smaller packet is split again and again in this way only a few processors gets su cient work. In contrast to the same technique in best- rst branch-and-bound algorithms (cf. LM92]) where the packets are nearly on the same level it takes much more time until the work of the biggest subtrees can be spreaded over the whole network. So an algorithm who deals with stack-splitting needs much more informations about the actual situation in the network to decide which processor should split its stack. Generally you can also analyse that this mentioned stack-splitting algorithm needs a lot of sended packets before the network is well-balanced. In addition to this every packet consists of all informations a processor needs to go on with searching at this part of the tree. For SALBP-1 these packets have at least linear size in the number of operations, e.g. O(n). As mentioned above SALOME-1 works with an extended IDA*-technique, the local-lower-bounds. Due to it the described stacksplitting-algorithm do much more work by de ning a global bound that have to hold by every node. Since the best solution lies often nearer to the start upper-bound parts of the tree are considered more than once which is 6

2.2 Parallelization by stack-splitting

Applicability to the SALB-Problem-1

sequentially prevented by the local-lower-bound-technique of SALOME-1. But if every processor uses the locallower-bound-technique and do not respect the given global bound the danger of examining unnecessary parts of the searching-tree rises up since no global control is given that splits the best nodes a

t rst. By considering the special attributes of the underlying computational equipment mentioned above a further problem arises: Since there is no global control given it is unknown which parts of the tree are examined by a speci c processor in the network. So if a failure occurs besides a new start of the algorithm there is no instrument possible to guarantee a optimal solution. So stack-splitting as it is presented in the literature does not seem to be very useful to parallelize the extended SALOME-1 algorithm. This technique described by Reinefeld and Schnecke (cf. RS94]) tries to prevent the disadvantages of the stacksplitting-algorithm by building as much working-packets as possible. Therefore they mix breadth- rst-search with depth- rst-search. This leads to two main phases. In the rst phase called initial partitoning phase every processor starts expanding the root node over several levels in a breadth- rst-search manner. So the same nodes are saved on every processor in the network and build if there are enough of them found, a frontier node array (cf. gure 1). By seperating them into p groups every processor gets its local working jobs. Reinefeld and Schnecke suggest in addition to it an intermediate phase where every processor uses breadth- rst-search again but only on its own nodes to get a high number of smaller jobs. But in contrast to stack-splitting this jobs (i.e. the nodes) which has been built in the rst period are indivisible working packets. That means that a processor which have become idle during the algorithm can only get such a job by another processor. There is no possibility to share this node between them. In the second phase which represents the real computation every processor uses the sequential algorithm to nd a new solution. It works on its own nodes until all these subtrees have been considered. To get new work Reinefeld and Schnecke suggest a communication-algorithm that uses the structure of a torus-network to reduce the number of messages. They use their algorithm for the oorplan area optimization problem and achieve almost linear speedup for all tested system sizes. So they conclude that the xed packet algorithm can be used e ciently for many combinatorial applications on parallel systems of large size. They explain that this is possible due to the fact that the initial distribution phase keeps every processor busy for over 90% of the whole running time. This means no work-balancing is necassary before. So besides new upper-bounds to be broadcasted no communication or synchronization has to be done during the main part of the runtime. Due to the nearly linear speedup even for large scale systems several tests for solving SALBP-1 has been done with the xed packet size technique. But in contrast to the results achieved for the oorplan area optimization problem the initial working packets has not kept the processors busy for enough time. Usually they become idle only after 5% of the running time. By numbering ev

ery working packet it becomes obvious that only a few of them stand active for a considerable amount of time. Because of the unsplittable structure of the remaining jobs only a few processor have been still busy. The reason for this unsatisfactory behaviour has been mentioned above. The structure of the searching-tree of SALOME-1 is extremely unbalanced and so only a few nodes of the search-frontier represents the overcoming maiority of the whole work. In addition there is again no global control given to reduce the danger that the parallel algorithm examines parts of the tree the sequential version would not consider. Due to the fact that both known techniques do not work e ciently for SALBP-1 on the described hardware environment it seems to be necessary to design a new parallel scheme that pay much more attention to the special attributes of the SALB-Problem-1 and the computational equipment. This is done in the following section.

2.3 The xed packet DFS technique

Applicability to the SALB-Problem-1

3 The new distributed algorithmThis section presents a new parallel DFS-technique. It has been designed and analysed with the BSP-Model that has been described by Valiant (cf. VAL90]). It uses a quite simple encoding-scheme that is motivated next 7

BreadthFrontier node array6n

PP PP PP rst-search PP P P 6n

PP P

iA A A A

i+p

i+2p

A A A A A A

A A A A A

A A

A A A A

Local asynchronous dfs

A A

A A A A

Figure 1: The searching-tree arising from xed-packet dfs algorithm. to seperate the whole work. After that the algorithm is described in detail before it is analysed and measured results are presented. If you consider the searching-tree that is built up by the sequential algorithm you see that every level stands for a new station and every node there represents a particular station-load in it. Note in this connection that the algorithm in this paper can open stations in forward- as well as in backward-direction. So the sequence of stations opened by the algorithm and given by the tree do not correspond to the numbering of stations of a completed solution. The algorithm starts at the root-node and opens the rst station in arbitrary direction. It gives this starting-node the global-lower-bound as its"local-lower-bound". In every node the algorithm opens a new station and builds there the rst possible station-load. Note that the sequential algorithm starts with the global-lower-bound and increase this bound by one if there are no further station-loads that can reach the bound. Due to that every station that is opened during the algorithm de nes an unchangeable local sequence of stationloads. By considering these sequences it becomes obvious that the expected number of required stations do not decrease from low numbered loads to higher numbered ones. This is due to the local-lower-bound-technique where only station-loads are considered that can reach the actual lower-bound and not before all of them are examined

this bound is increased to check the next group. So you can recognize a"clustering" in the sequence starting from the bound given by the father-node and ending by the number of stations required by the actual best found solution. By using parts of the sequential algorithm a procedure build next stationload can be built to allocate the following station-load in the actual station that would be examined if the sequential algorithm tracks back to that station. Note in this connection that if a particular station-load in the sequence cannot reach a bound all following station-loads would also fail. An example for such a sequence is illustrated in the gure 2. The sequential algorithm has to examine all these station-loads in the de ned sequence in every node while the distributed one can share them between the used processors. To do so working-commissions are de ned: De nition 3.1 A working-commission woco(x; y) is a tuple of two sequences of integers of equal length, i.e. x= (x1; x2; ...; xk ) and y= (y1; y2; ...; yk ) with xi; yi 2 IN. A processor that receives such an order has toexecute the following procedure: number=?1;

3.1 An encoding scheme

, ti

1,5

= number of operation ti= execution time of operation i

Q Q

Q Q Q Q

4,5

7,9

2,4

Q Q s Q

5,4

3,3

6,6

number of station-load allocated operations computed lower-bound 1 1,2 6 2 1,3 6 3 1,4 7 4 1,5 7 5 2,3 7 6 3,6 8 Figure 2: The beginning of a precedence-graph and a possible sequence of station-loads for the rst station with its computed lower-bounds. Note that the cycle-time C is 10. pointer= 1; while ((pointer k) AND (number 6=?1)) if (number=?1) then pointer= pointer+ 1; number= xpointer;end if; open a new station (direction given by the sequential algorithm); if (number=?1) then l=minf c j pointer< c k and xc 6=?1 g; build the last possible station-load respecting upper in all following stations until the l'th station is reached; pointer= l; number= xpointer; upper= ypointer; open a new (this is the l`th station) station (direction given by the sequential algorithm); end if; while ((the actual load is not the last possible in respect to upper) AND (number>?1)) number= number? 1; build the next station-load in the sequential sequence; end while end while

upper= ypointer

root node(hh (((( HHhhhhhh HH (((( hhhh (((( hhhh H (((( h H1 r

station

m m m m01 02 0x

Q A Q A Q Q A Q

m m1x 12

A A A

?@@?@??@?@

m m m m21 22 2x 23

Q A Q A Q Q Q A??

m m m1x

m m m24 2x

m m m4 x

?@@?@?

1 2 3

???

subtree

@@@

@ subtree@@@

Figure 3: A searching-tree and the subtrees that are reached with the sequences (1; 2) and (2; 4). In every node the sequence is given that leads to it using the procedure of the de nition 3.1. Note that the corresponding working-commission contains additionally to the underlying subtree the examination of the s

ubtrees under the nodes belonging to the same father but lying on the right hand side of the reached node. In this connection the"x" represents a non-completed number in a sequence.After that the processor starts the sequential algorithm. During this algorithm all station-loads built by the procedure above are kept unchanged except the last station. In this station all loads following the one allocated by the procedure have to be considered.

By considering this de nition it becomes obvious that every number in the sequence tells how many station-loads have to be skipped next. In this connection the last load cannot be skipped. So if a processor should take the next station-load in the actual station, i.e. it has to skip null station-loads, and there is only one load possible it holds the number null for the next station. This is due to the fact that the numbers of the sequence are used as a decision-string that enables every processor to nd a di erent starting-node in the tree. Therefore a single possible station-load in an arbitrary node gives no possibility of alternative decisions. So it is allocated but not respected in the sequences. If a new number in a sequence is reached by the procedure a further station is opened. So the intended use of the sequence is to skip xj possible loads of the sequence before the next station is opened. Note again that during skipping further stations can be opened if the alternative loads are not of su cient size. A little example illustrating the run of the procedure is given in gure 3. The number"?1" has a special meaning in this connection. This number is used during the algorithm to de ne the last possible loads in several stations. Since this is done for all stations up to the number of the next number unequal to"?1" a"?1" is never the last number of a sequence. To work with this de ned working-commissions in a distributed algorithm it is necessary to nd a method of splitting them. This is given by the following lemma: (i) ((0); (n)) is the whole work of the sequential algorithm. (ii)If a processor executes the working-commissions ((x1;:::; xk; 0); (y1;:::; yk; yk )) and ((x1;:::; (xk+ 1)); (y1;:::; yk)) it executes ((x1;:::; xk); (y1;:::; yk)), i.e. ((x1;:::; xk; 0); (y1;:::; yk; yk ))"+" ((x1;:::; (xk+ 1)); (y1;:::; yk))"=" ((x1;:::; xk); (y1;:::; yk)).

Lemma 3.2 Respecting the de nition of working-commissions the following statements can be made:

Proof:

For (i) you have only to note that n is a simple instance-independent upper-bound for the required stations since 10

there are only n operations to allocate. So no possible station-load is skipped. Note that n can be exchanged by every valid upper-bound. To show (ii) consider ((x1;:::; xk); (y1;:::; yk)). If the procedure described above is executed a node in the searching-tree is reached. Then the algorithm closes this last station and examines the underlying subtree7 (=) ((x1;:::; xk; 0); (y1;:::; yk; yk )) is exe

cuted). After that the next load in the last8 station is built and the remaining work is done (=) ((x1;:::; (xk+ 1)); (y1;:::; yk)) is executed). 2 Due to the lemma it is easy to de ne an expansion of a working-commission as a valid splitting operation. De nition 3.3 Let woco(x; y)= woco((x1; ...; xk ); (y1; ...; yk )) be an arbitrary working-commission and g be an arbitrary natural number greater than 1. The set S= fwoco((x1; ...; (xk+ c); 0); (y1; ...; yk; yk )) j 0 c g? 2g fwoco((x1; ...; (xk+ g? 1)); (y1; ...; yk ))g of g working-commissions is called a grade-gexpansion of (x; y) . Due to the fact that such an expansion can be simulated by repeated application of lemma 3.2 (ii) it is obvious that a grade-g-expansion of an arbitrary working-commission (x; y) leads to g new working-commissions that represent together the same work.

3.2 The used labeling-dominance-rule

Before designing the parallel algorithm a short description of the sequential one should be given. As mentioned above this algorithm is the well known procedure SALOME-1 (cf. SCHO95]) extended by the labeling-dominance rule. This rule also used in FABLE (cf. JOHN88]) has been adapted to work correctly with the local-lowerbound-technique. Every operation that is rst allocated to a station gets a value called a label. These labels are computed in a way that makes it possible to get a di erent sum of all labels for every speci c set of operations. So by building these sums of labels it is possible to distinguish di erent sets of operations. After allocating an operation and building the actual sum of labels it is compared with the best result for this set of operations where the last station has been opened in the same direction9 . This is done by computing the number of closed required stations and the additional needed allocated operation time in the last open station. If the actual value is higher this solution can be fathomed since there can not be a better new solution in the underlying subtree. To realize this you must take into account that the same set of operations is allocateable for both solutions. This is trivial if the sets of operations allocated in forward- and backward-direction are equal. Otherwise consider the rst operation allocated in opposite directions. Since this operation can be allocated in both directions all successors must be allocated in one solution while all predecessors can be found in the other one. But due to the fact that the solutions have allocated the same set of operations all successors and predecessors of this operation have been already allocated in both. So the sets of allocateable operations in both solutions are also equal. Another and often more successful comparison can be done if a station must closed since there is no further operation that can be allocated to it. Then an actual solution can be fathomed even if the best solution do not require more stations. An actual solution can become the actual best solution for

the set of allocated operations only if the actual station must be closed and the solution can reach the local-lower-bound of this node or the actual local-lower-bound of this node is only one station away from the actual upper-bound10 . This is done iteratively for every set that remains after deleting the last allocated operation in the last station until this becomes empty. The basic-idea of this rule is illustrated in gure 4. Since every set of operations must get a di erent memory-position the memory-requirement grows exponentially with the number of operations in the worst-case. The optimal scheme for computing the labels in this connection has been designed by Schrage and Baker (cf. SCBA78]). It respects the fact that some sequences of allocated operations are not possible because of the given precedence-constraints. But in spite of this the memory requirement can still grow exponentially. So the algorithm in this paper uses a label-limit that can not be exceeded by an operation. If it would be the case no more new labels can be computed and the rule is deactivated (cf. SCHO95] pp.137). As soon as all operations without a label are again deallocated from the stations the rule becomes active again. To prevent wasting any label the algorithm used here only starts labeling if there are real alternative station-loads for a rst station. So if a speci c working-commission should be executed the labeling starts also only if alternative station-loads are7 Note that there are further station-loads in the last just closed station, since the procedure has been ended with it. 8 This is the last station after executing the procedure of de nition 3.1. 9 Note in this connection that it is necessary to distinguish best solutions by the direction of the last opened station. So there are

two groups of best solutions, one in forward- and another in backward-direction. 10 Note that only in this case the underlying subtree is examined completely.

, ti

1,5

= number of operation ti= execution time of operation i

Q Q

Q Q Q Q

4,5

7,9

2,4

Q Q s Q

5,4

3,3

6,6

station 1 2

non-completed solution 1 non-completed solution 2 allocated operations allocated operations 1,2 1,3 3,4 2,4

Figure 4: The already known small precedence-graph (the cycle- time C is again 10) and two possible noncompleted solutions. Note that the subtree starting with the second solution can not contain a better solution than the already examined subtree starting with the station-loads de ned by the rst solution. possible. So this instrument starts working at the earliest in the last station that is opened by the procedure of de nition 3.1. The algorithm presented here has been implemented on a transputer system where the processors have only poor memory ressources. As a consequence of using MPI as the parallel environment and the several bounding techniques of SALOME-1 there remains only a few bytes for this rule. So only ten operations could be labeled in

the tested instances which reduces the e ect of this instrument.

3.3 The algorithm in detail

The rst distributed algorithm for the SALB-Problem-1 can be seperated into three phases. In the rst phase a good solution should be computed fast by a simple parallel heuristic. After that the initial working-commissions are built to distribute the whole work that is done in the last phase.

At the beginning of the distributed algorithm the same renumbering of the operations is done as used in SALOME1. By this renumbering it is guaranteed that an arbitrary operation cannot dominate a lower numbered operation (c.f. SCHO95] p.136). In addition to this the renumbering procedure also prefer operations with higher execution times to give them a lower number. This is done due to the experience that good rst solutions are possible by allocating the operations in these sequences (c.f. TPG86]). The distributed algorithm presented here uses this simple heuristic parallelized to get an initial upper-bound. Therefore p di erent sequences of integers are built to enable every processor starting the sequential heuristic on a di erent node to get a solution. Due to this the procedure nds simultaneously p di erent solutions while the best of them is taken as the initial-upper-bound. But before every processor can execute allocating the operations in the sequence given through the renumbering it must nd its own node in the searching-tree. This is done by the following procedure that produces p sequences one for every processor. Note that c is an arbitrary constant. 12

3.3.1 Finding a start-upper-bound

Iteration 1 2 3 4 Built nodes after *((0),(n)) *((0,0),(n,n)) ((0,0,0),(n,n,n)) ((0,0,0),(n,n,n)) this iteration ((1,0),(n,n)) *((1,0),(n,n)) ((1,0,0),(n,n,n)) ((2),(n)) ((2),(n)) *((2),(n)) ((0,1,0),(n,n,n)) ((0,1,0),(n,n,n)) ((0,2),(n,n)) ((0,2),(n,n)) ((1,1,0),(n,n,n)) ((1,2),(n,n))

5 ((0,0,0),(n,n,n)) ((1,0,0),(n,n,n)) ((2,0),(n,n)) *((0,1,0),(n,n,n)) ((0,2),(n,n)) ((1,1,0),(n,n,n)) ((1,2),(n,n)) ((3,0),(n,n)) ((4),(n))

Table 1: Building exactly p nodes to nd a initial upper-bound in parallel. Note that a star in front of a sequence marks the node that will be expanded in the next iteration by using the expanding-degree 3. field 0]= (0); last= 0; elements= 1; first= 1; while (elements< p) g=minf c, p? elements+ 1 g; make a grade g expansion of the sequence in field last]; if (first= 1) then first= 0; else last= last+ 1; end if; elements= elements+ g? 1; end while; Note that the rst sequence (0) is expanded twice in this procedure. This is not necessary and has only been done with the intention to get a more balanced distribution. A simple example illustrating the work of the procedure is given in table 1. After computing these p sequences every processor x takes the x'th sequence and executes the procedure in de nition 3.1 with the simple upper-bound n. Then it can start allocating operations in the numbered sequence to nd one solution. Since no te

st or bound must be executed every processor builds immadiately one solution by lling every station maximally. If a complete solution is found every processor stops computing and the solutions are compared by communication. Note that this heuristic can be used in an expanded manner. Therefore it can be executed with di erent values for c to get better results. The algorithm tested in this paper only use one iteration and sets c to 5. In this phase the whole work woco((0); (n)) is completely distributed over all processors. This is done quite similar to the parallel heuristic described above. But now the sequences of the working-commissions are checked if they can still reach the actual-lower-bound. This is done to give nearly all processors good starting-points in the searching-tree. So it is quite unprobable that a considerable number of processors work on parts of the tree that would not be examined by the sequential one. Therefore the procedure above must be extended. The algorithm starts with the computed global-lower-bound and expands the working-commissions pointed out by"last". But now the expansion of a working-commission is stopped only if p commissions have been already 13

3.3.2 The initial distribution phase

built or if the last built commission starts at a point of the tree where the actual lower-bound is not reachable any more. Then the next working-commission is expanded and so on. If it is not possible to build p workingcommissions the algorithm increases the lower-bound11 and goes on in the same way. So after this period there are exactly p working-commissions to be executed and there starting-node begins executing with a local-lowerbound equal to the actual global-lower-bound. Note that this distribution-phase is done in the same way by every processor. So every processor has computed the sequences and in addition to this the allocated operations for every working-commission starting-node. This makes it unnecessary to execute the procedure of de nition 3.1 before checking the next working-commission in the distribution-phase. The operations can be copied from the expanded commission and the next load is built only in the last station. The tested algorithm has been also implemented with much more working-commissions12 to get a better initial distribution. But this e ect has been rather poor and does not pay for the additionally required computational-time in the distribution phase. In this main phase the whole searching-tree is examined. Therefore every processor executes its workingcommission. This is done by using the sequential algorithm respecting the global lower- and upper-bound. The labeling-dominance-rule is reseted for every working-commission, i.e. the labels are deleted, and starts only if a rst station is reached where alternative station-loads are possible13 . The search is divided into several rounds of in uencable size and number. Every round itself starts with a local-computation-time where every processor does

its actual working-commission. Then a communication round begins where results are exchanged. The time maximally consumable by the local-computation period is bounded by the time-variable T which is prede ned14 and new computed in every round. Note that the whole searching-phase is only necessary if the actual found upper-bound is greater than the global-lower-bound computed before15 . During this period every processor executes its own working-commission. Therefore it uses the sequential algorithm at this part of the searching-tree. Before this can happens the algorithm executes the procedure of de nition 3.1 to nd its starting-node. After doing so the last number of every sequence in every workingcommission is deleted since the corresponding allocated operations in this station are not unchangeable. While examining the underlying subtree the processors save the number of the highest level hl in the searching-tree where the actual load is not the last possible. The number of the allocated load in this station minus one is stored in sl16 . Note that hl is the number of used stations, opened in forward- as well as in backward-direction. So as a consequence the processor would not change any allocation above this level in its actual solution. So hl is the highest position in the searching-tree where further station-loads have to be examined later during backtracking. Besides these two variables every processor keeps the local-lower-bound of the following station-load in the level hl and computes a quality-value for the solution above to make it comparable with other non-completed solutions. Note that nothing is said about the way of computing the quality-value. Here the actual slack-time up to the level hl has been used but this is easily exchangeable. If a new upper-bound is found by one processor all other processors are informed by a broadcast17 to stop the actual local-computation-period. Besides this the period can only come to an end for a processor if T seconds are expired or the actual working commission is done. Other variants which control the end of a period by counting the actual number of idle processors work much more ine ciently due to the higher communication costs during the computation.the algorithm can stop. 12 Every processor gets c commissions while cp have been built before. 13 This means that the operations allocated to earlier considered stations are not respected since they are unchangeable for this working-commission. 14 In the computational tests have made here this variable has been initially set to 2 seconds. 15 This bound is initial equal to the lower-bound computed by SALOME-1 but can be increased during the distribution-phase. 16 I.e. sl is the number of skipped loads in this station. 17 Therefore a tree with grade t can be used.11 But only if the new lower-bound is still lower than the already found upper-bound. Otherwise this upper-bound is optimal and

3.3.3 The searching-phase

The local-computation-period

At the beginning of this period the new upper-bounds are broadcasted. Note that this is only done since it is not guarenteed that all processors have received the message of a new found upper-bound during the local-computation-period18. If this new upper-bound is equal to the global-lower-bound the algorithm stops immidiately since an optimal solution is found. Otherwise the processors that have not become idle during the last computation-period start controlling their stack. Beginning at the station actually deepest situated in the tree every station-load is fathomed whose local-lower-bound is not lower than the global-upper-bound. In addition to this the station at level hl is considered. If the saved local-lower-bound of the next station-load is not lower than the upper-bound the variable hl is increased19 By considering the next stations the respectively following station-loads are built to compute the resulting local-lower-bound to nd the rst where this bound is lower than the upper-bound (cf. gure 5). The number of this level in the tree20 is the new value for hl while the number of the actual station-load in it minus one is stored in sl21 After correcting the local stacks the main communication-period starts. Therefore every processor sends the variables hl, sl and a quality-value qu to every processor. Note that besides these three numbers no further informations have to be exchanged. The value qu should tell something about the size of the underlying subtree as well as about the best possible solution that can be found in it. The algorithm implemented in this paper uses therefore the both values local? lower? bound of the next possible station-load at level hl and the complete slack-time of the solution up to this point22. On the other hand if a processor is idle it encodes this by sending three"-1" values. After receiving these 3p numbers every processor starts distributing the remaining work. Note that this is only possible if there are some processors that have not become idle. Otherwise the algorithm stops immidiately. For distributing the work the non-idle processors are sorted according to their quality-value while the numbers of job-less processors are treated in an arbitrary sequence. These idle processors get now, one by one, a new working-commission for the next round. Therefore the next processor in the list of non-idle ones is considered. This processor has to split its workingcommission to share the work. But every processor only has saved the working-commission of every processor at the beginning of the last round. This says nothing about the done work during the local-computation-period just ended before. So if this working-commission is split some work is executed more than once. Due to it the working-commission of the processor that should share its work has to be updated. To do so the following de nition can be used: De nition 3.4 Let woco(x; y)= woco((x1; ...; xk); (y1; ...; yk )) be an arbitrary working-com

mission executed by processor pi in the last local-computation-period. If this processor sends the values hl and sl during the communication-period and ub is the actual upper-bound-value the working-commission woco((x1;:::; xhl ); (y 1, ..., y hl )) with: xj= xj and y j= yj for 1 j< k xj=?1 for k j< hl xhl= sl y j= ub for k j hlis called the (hl; sl)-updated working-commission of woco(x; y).

Global-communication and -updating period

Note that the last number of both sequences is deleted while updating it through (hl; sl). This is done since the encoded allocated operations in the last of the corresponding stations are not unchangeable. Note further that a"?1" represents always the last possible station-load until the station hl is reached. The following lemma shows that this updated working-commission can be seen as the remaining work the donor processor has to execute, while parts of it can be already done in the last local-computation-period. Lemma 3.5 Let woco(x; y)= woco((x1; ...; xk ); (y1; ...; yk )) be an arbitrary working-commission executed by processor pi in the last local-computation-period. Assume that this processor sends the values hl and sl

for sl as fast as possible. 22 I.e. hl C? sum, while sum is the sum of execution-times of all allocated operations up to this level.

18 Otherwise the new upper-bound can be already broadcasted by informating every processor about it. 19 Note that now no further loads has to be considered in this station. 20 The number of opened stations up to this node. 21 Note that it is useful to keep every actual solution encoded in numbers of station-loads in a special array to get the new value

llb

m m

llb

: actual local lower-bound in this node

slup

: actual station-load in the level

: highest level where further station-loads have to be considered

=97

hh ((((HHhhhh hhhh (((( HH hhh (((( hhhh H ((( H ((7

: actual upper-bound

station

m m m m7 7 8

Q A Q A Q Q Q A

m m7 7

A A A

m=1sl

actual node

?@@?@?

m m m m8 8 8

Q A Q A Q Q Q A

m m m8 8

?@@?@?

=1 2 3

m m m7

A new upper-bound up= 8 has been broadcasted(hh ((( HHhhhhh (((( hhh H ((( hhhh HH (((( hhh H ((7 7

station

m m m m7 7 8

Q A Q A Q Q Q A

m m7 7

A A A

actual nodesl

?@@?@?

m m m m8 8 8

Q A Q A Q Q Q A

m m m8 8

?@@?@?

1 2=3

m m m7

Figure 5: A typical situation after receiving a new upper-bound lower to the own local one. The processor has the working-commission woco((1); (9)) and has loaded the second station-load in the rst station. Due to the fact that the upper-bound is 9 it is actually necessary to examine the station-loads 3 and 4 which belongs to woco((1); (9)). But after receiving the new lower upper-bound hl has to be increased twice. Because of the fact that there are no further station-loads to be considered in the second level hl is set to

3. Here the rst station-load is actually allocated but there are further ones to examine.

node of the initial working-commission woco((1); (9)).

@ (((((( (((@ ((( R@ ( ((7

m ((hh7

llb

: actual local lower-bound in this node

m m m m7 7 8

Q A Q A Q Q A Q

m m7 7

A A A

H hhh hhhh HH hhhh HH hhhh H8

station 1 2 3 4

actual node of processor p1 sending hl= 3 and sl= 0 actual node of processor p1 after splitting (hl= 4 and sl= 0)

?@?@?@ I@@

m m m m8 8 8

Q A Q A Q Q A Q

m m m8 8

?@@?@?

?@@?@?7

m m m7 7

m m m

actual node of processor p2 after splitting (hl= 3 and sl= 1)

woc executed by processor p1 woc changed through hl, sl woc executed by processor p1 woc executed by processor p2

(x1; y1 ) (x2; y2 ) (x3; y3) (x4; y4) (1,9) (-1,8) (-1,8) (1,8) (x1; y1 ) (x2; y2 ) (x3; y3) (x4; y4) (-1,8) (-1,8) (1,8) (0,8) (-1,8) (-1,8) (2,8)

Figure 6: The both new working-commissions be built by splitting the actual one of processor p1. The rst tabular shows the initial working-commission of p1 and how it is modi cated by the distribution procedure after receiving the variables hl and sl. The second one illustrates the new working commissions built by a grade 2 expansion of woco((?1;?1; 1); (8; 8; 8)). Note that if there has been numbers in front of the rst tuple (x1; y1)= (1; 9) they would have been kept unchanged during updating and splitting.during the communication-period and woco(x; y)= woco((x1;:::; xhl ); (y1;:::; yhl )) is the (hl; sl)-updated workingcommission of woco(x; y). Then a processor pj that has executed the procedure in de nition 3.1 with woco(x; y) has allocated the same operations to the rst hl stations as processor pi at the end of the local-computationalperiod. In addition exactly hl stations has been opened by processor pj .

Consider the execution of the procedure in de nition 3.1 with woco(x; y) done by processor pj . Let z be the number of opened stations before reaching the k`th number in the sequence x23. Then it is obvious that the operations allocated to these stations are equal to the operations actually placed in the same stations on processor pi, since it could not change them during executing woco(x; y). It remains to show that the following stations up to the hl'th station are likewise equal. Therefore consider at rst the hl'th station itself. Note in this connection that z< hl since the last number in x has not been considered by computing z. Due to the variable sl two facts can be pointed out. At rst the processor pi has actually the sl+ 1`th station-load allocated there and second there is a further load possible. So due to the construction of the procedure exactly this load is likewise allocated to the hl'th station by the processor pj . To realize this you have to take into account that k hl. So if k= hl the procedure do not nd a"-1" value at the k'th position and by nding xk= sl the right load is allocated by

23 These numbers are equal to the numbers in the sequence x.

Proof:

processor pj . Otherwise if k< hl a"?1" value is found by processor pi . And because of this all stations up to hl are lled but this station itself gets also the load number sl+ 1. So as claimed both processors allocate the same operations to the station hl. But what can be said about the stations situated between the levels z and hl? There is nothing to show if z is equal to hl? 1. So it remains z< hl? 1. By considering the processor pi an arbitrary station s with z< s< hl is taken. Due to s< hl there are no further station-loads possible, respecting the actual upper-bound. So the last possible station-load is actually allocated there which is also done by the procedure of de nition 3.1 through to the"?1" in xk 24 and the actual-upper-bound in yk . The second claim has been already shown by the consideration of the hl'th station. 2 By this lemma it is obvious that the processor has to execute the updated working-commission, while some work of it can be already done. But by knowing the load-number sl+ 1 in the station hl it is guaranteed that the node with the next load there is not reached by now. So if the updated working-commission is split by a grad-2-expansion the work of the updated working-commission is disjunctive seperated into two parts. So no work would be done more than once nor would be forgotten. So after updating the working-commission of the donor-processor a grad-2-expansion is done with it to get two new working-commissions woco(a; b) and woco(c; d). Through to the fact that this distribution is done by every processor on its own but in the same way every processor knows all updated working-commissions and in addition to it every done expansion. So if it has been a donor-processor it can encode its stack at the hl'th station to make the allocated operations there unchangeable. Otherwise if it has been idle after the last local-computation-period it can execute the procedure in de nition 3.1 to start executing its new working-commission. So one by one every idle processor gets a new working-commission by the next donor-processor in the sorted list. Note that it is also possible to saturate more idle processors in one iteration by building an arbitrary g-expansion of the updated working-commission. Further note that every donor-processor is deleted from the donor-list exchanged with the processor that gets work from him. So if there are no further donor-processors in the sequence but still idle-processors in the other list the sequence of donor-processors, that can not become empty, is started again. So after at most O(p) iterations every idle processor has received a new working-commission and can examine a new part of the searching-tree. A simple example to illustrate the distribution is given in the gure 6. After distributing the remaining work a new value for T has to be computed. Therefore a few de nitions are necessary. De nition 3.6 An arb

itrary round in the distributed algorithm is given. Let x p be the number of processors This ratio gives tendentiously the e ency of the algorithm. Note that processors that becomes idle just before the period-time is expired are not respected. De nition 3.7 Let T be the value of the maximal local-computation-period-time and C the total amount ofwhich are still busy after the local-computation-period. Then x is called the processor-utilization-ratio (pur). p

T seconds needed for the whole global-communication and -update period. Then T+C is called the parallelcomputation-ratio (pcr).

It is obvious that both terms in uence the e ciency of the described algorithm directly. To compute a new value for T the described algorithm supposes that the e ciency can be computed by the product of both. Note that these two terms behave in a contradictable manner. If the parallel-computation-ratio is increased by a high value for T a lower number of busy processors at the end of the computational-period can be expected. So the algorithm that can only in uence the new value for T tries to equalize these two values to gain a higher e ciency in the next round. So both values are computed in every round using the old value for T. By supposing linear behaviours the mean value for both terms is aspired to reach the maximal e ciency for the algorithm. So T is changed to a new value that gives a new parallel-computation-ratio equal to pur+pcr . Therefore the following 2 lemma can be used. Lemma 3.8 Consider an arbitrary round while T, C, pur and pcr are given. Then T can be computed through (pur+pcr )C to achieve that the new value pcrnew is equal to pur+pcr . 2?pur?pcr 2

Proof:

24 Note that k? 1 z and so by z< hl? 1 the inequality k? 1< hl? 1 follows immidiately. And k< hl leads to a"?1" value in xk as claimed.

This proof is easily done by transforming in the following way:

,,,,

T= pcr+ pur? pcr C+T 2 2T= (C+ T)(pur+ pcr) 2T= C(pur+ pcr)+ T(pur+ pcr) T(2? pur? pcr)= C(pur+ pcr) C(pur T= 2? pur+?pcr) pcr

2 Using this simple formula every processor computes the new value for T 25. Then it begins executing its old or new working-commission in a further local-computation-period which starts a new round. Note that the formula for computing T can be easily exchanged by a procedure that respects detailled informations about the underlying hardware.Due to the encoding-scheme it is quite easy to incorporate a fault-tolerant concept into the algorithm. Therefore every working-commission of a still working processor is (hl; sl)-updated in every communication-round. By doing so every processors knows the new updated working-commissions of every processor in the network. So if a processor gets lost only this commission has to be repeated by another processor. To save the results of previous computations every new upper-bound is written to a le or is sent to a screen26 . During the communication phase every processor tries to execute th

e communication-algorithm to nd a new upper-bound or to distribute the variables sl, hl and the quality-value qu. If this execution is not realized after S seconds a failure is assumed. So a token is sent subsequently from processor to processor to nd a new list of active processors. Therefore a transfer from node to node should be executed in Z seconds. So every processor i waits (i? 1)Z seconds for a packet from a processor with a lower number27. If no packet is received during this time-period the processor i assumes that all processors up to number i are death. So it deletes them in the list and tries to send this list to number i+ 1. If this processor do not answer after Z seconds it is also removed from the list and i+ 2 is contacted. This is repeated until some processor answers or processor p has been contacted without success. Then i is the last active one and broadcasts the new list to every active processor. Then all processors use the same procedure to nd a new routing-scheme that is used until a new failure occurs. The working-commissions of the death processors are saved in a special list. If a processor becomes idle this list is used at rst to give it a new working-commission. So if no failure occurs the algorithm works equal to the already described algorithm. Note that no solution gets lost and so the algorithm would be still able to nd the optimal solution if a failure occurs. On the other hand the additional work due to a processor-death is quite small since only one working-commission has to be executed again. The algorithm works in several rounds. Every round itself consists of a local-computation- and a globalcommunication-period. But only the computation-periods can achieve speedups compared with the sequential algorithm. So the algorithm must achieve long computation-periods that keeps nearly all processors busy and short communication-periods. Additionally the whole work been executed by the algorithm should not increase while using more processors. Only if all these conditions are ful lled an e cient parallel execution is possible. In the following short analysis the last condition is relaxed. It is supposed that the whole work is independent from the number of processors used for it. The analysis uses an extended version of the well-known BSP-Modell de ned by Valiant (cf. VAL90]) which has been proposed by Meyer auf der Heide, Baumker and Dittrich (cf. MDB96]). This model consists of:25 while T cannot leave the prede ned interval T; Tmax]. 26 If there is a quite safe processor in the networkminis also possible to send the result to its memory. it 27 Note that if processor 1 is still alive it would send immidiately a packet to processor 2.

3.4 Fault-tolerant concepts in the algorithm

3.5 A short analysis using the BSP*-Modell

p processors with local memory (no shared memory at all). a router that can deliver messages to every node in the network. a synchronization mechanism. The computation is subdivided in

to several supersteps which consist of a computational and a communicationperiod. In the computational period every processor can only use informations given at the beginning of this superstep. New informations can only be received in every communication-period. An instance of the model can be described by the parameters L and g (cf. MDB96]): L is the minimal time for a superstep, i.e. minimal time between two synchronisations. g is the time the router in continous use needs to deliver a packet. Meyer auf der Heide, Baumker and Dittrich propose additionally a further parameter B to characterize the bandwith of the router. By considering realistic behaviours of parallel computers they want to incorporate the advantage of sending few large packets instead of many small ones. Therefore the communication costs are divided into the number of packets that are maximally received by one processor in this period and the size of the largest packet that is routed to this node. By dividing the size through B 28 and multiplying the product of both values with g larger packets reaches better results in this model29. Tests executed for several examples have shown that this model is quite realistic since e ciency-analysis have been often proved by experiments. The analysis considers an arbitrary round in the whole algorithm. While the algorithm works in the localcomputation-period like the sequential one, only the required runtime caused by the global- communicationperiod has to be considered. Note that the procedures presented next are optimized for the analysis while the computed algorithm often uses a straight-forward version. To nd a new upper-bound every processor is placed as a node in a t-degree tree. This tree is used in two phases. In the rst phase every processor besides the leaves receives t upper-bounds sent by its sons, while the best of them is given to the next father. In the second-phase the already found upper-bound is broadcasted again using the tree in opposite direction. logt p(maxfL; tgd1=B eg)+ logt p(maxfL; gd1=B eg) 2 O(L logt p+ tg logt p): As done before the t-degree tree is used. But now the messages must be combined. So the root node receives a packet of size qp while q integers have been sent by every processor in the tree. By broadcasting it all processors are informated. So the time-requirements can be estimated by:logP 1 t

Finding the minimum of all actual upper-bounds:

Broadcasting all actual state-informations:p?

Stack-correcture after broadcasting a new upper-bound (this is not often the case since the di erence between upper and lower-bound is only small): 2 O(n) Allocating the processors to groups and sorting the non-idle ones. By using the t-degree tree to distribute the state-informations of every processor every processor in this tree can merge the informations to a sorted list. This needs always linear time in the actual length of the local lists30. So summarized about the depth of the whole tree it can be estimate

d by:

Additional global computations in every communication-period:

i=1

maxfL; tgdqti+1=B eg+ logt p(maxfL; gdqp=B eg) 2 O(L logt p+ gdqp=B e):

2 O(

i=logt p P i=0

ti )= O(p)

28 The result is rounded to the next higher integer. 29 For a more detailled description of the models see VAL90] or MDB96]). 30 Note that t is a chosen constant.

Work distribution:

2 O(p)So the whole period requires runtime31: O(p+ n+ L log p+ gdqp=B e)= O(p+ n+ L logp+ gdp=B e):= C 32 By assuming that the whole work is independent from the number of processors used for it and every processor is busy in all rounds T is set to its maximal value that is de ned by Tmax= kC log n after a few rounds. So if the sequential algorithm takes runtime Tseq there arel

Tseq pkC log n

rounds to be executed for the distributed one.

So the whole computation time can be estimated by:seq C+ kC log n]= pkTlog n+ Tseq 2 O Tseq: p p Note that this asymptotically analysis neglects the in uence of changing searching-tree-sizes caused by using di erent number of processors. Postive as well as negative e ects by increasing processor sizes are imaginable. Assuming successful load-balancing the result shows that if these e ects are omitted or at least of the same probability the algorithm can reach an optimal speedup by solving instances of su cient complexity. This trend can be accelerated by higher values for the constant k33.

Tseq pkC log n

The described algorithm has been widely tested on several distributed systems. In this paper the results that have been measured on the largest network (GCel), a 1024 Transputer T805 processor system, are presented. The programms are written in Ansi-C using MPI-libraries. By compiling these MPI-functions are translated Problem Time/Speedup SALOME EXT-SAL 16 Proz N=26 ND=1.45 Time 1288.3 1232.4 89.1 Speedup 1 13.83 N=28 ND=1.45 Time 687 693.35 46.85 Speedup 1 14.66 N=30 ND=1.45 Time>852.2>807.15 38.75 Speedup 1>20.83 N=46 ND=4.9 Time>2177.55>2168.2 66.85 Speedup 1>32.43 N=48 ND=4.9 Time 1043.25 1055.4 70.25 Speedup 1 14.85 N=50 ND=4.9 Time>1821.6>1783.2 71.75 Speedup 1>24.85 32 Proz 48.05 25.65 26.25 26.17 23.3>34.64 57.05>38.01 39.55 26.38 44.5>40.07 64 Proz 128 Proz 27.7 17.65 44.49 69.82 16.35 11.65 42.02 58.97 17.35 14.25>46.52>56.64 45.1 18.05>48.08>120.12 26.05 20.6 40.05 50.64 21.65 15.75>82.36>113.22

3.6 Computational results

Table 2: The average execution-times (in seconds) and the computed speedups for 20 probably chosen instances of the de ned sizes. Note that N stands for the number of operations and ND for the average node out-degree. A">" in front of a number is used if there have been some instances that could not be solved in the written time. to PARIX which is the parallel environment for using the system. Note that since this translation is not very e cient on the GCel the bottleneck of communication is raised up. As mentioned above no special attribute ofNote that there are many po

ssibilities to reduce the runtime required by this period. 33 Note that this is only true if a higher value for k do not prevent successful load-balancing.

31 Note that q is a constant (in the implemented algorithm it is reduced to four integers). 32 Tests are made with a relaxed not optimized version on a transputer system leads to time requirements of nearly 0:003p seconds.

本文来源：https://www.bwwdw.com/article/17u1.html

相关文章：

古代风趣小故事11-20

感恩主题小品剧本拥有一颗感恩的心02-20

以生命为话题的作文800字7篇08-06

爬架备案资料03-27

名著的一句话书评08-22

成都四川师范大学附属实验学校2022初三化学初三化学上册期末试题04-10

小学一年级思想品德教学设计优秀范文合集06-01

运动营养学基础（通用）10-04

新城46春天里项目基坑支护施工组织设计完整版05-16

上一篇：月基极紫外相机光机结构设计下一篇：2013年新外研版七年级英语上册Module 7 Unit 3