Eliminating synchronization overhead in automatically parall

更新时间:2023-04-25 17:04:01 阅读量: 综合文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

Eliminating Synchronization Overhead in Automatically Parallelized Programs Using Dynamic Feedback

PEDRO C.DINIZ

Information Sciences Institute

University of Southern California

and

MARTIN C.RINARD

Laboratory for Computer Science

Massachusetts Institute of Technology

1.INTRODUCTION

The most ef?cient implementation of a given abstraction often depends on the environment in which it is used.For example,the best consistency protocol in a software distributed shared memory system often depends on the access pattern of the parallel program[Falsa?et al.1994].The best data distribution of dense matrices in distributed memory machines depends on how the different parts of the program access the matrices[Amarasinghe and Lam1993;Anderson and Lam1993;Gupta and Banerjee1992;Kennedy and Kremer 1995].The best concrete data structure to implement a given abstract data type often de-pends on how it is used[Freudenberger et al.1983;Kiczales1986].The best algorithm to solve a given problem often depends on the combination of input and hardware platform used to execute the algorithm[Brewer1995].In all of these cases,it is impossible to stat-ically choose the best implementation—the best implementation depends on information (such as the input data,dynamic program characteristics,or hardware features)that is ei-ther dif?cult to extract or unavailable at compile time.If a programmer has a program with these characteristics,he or she is currently faced with two unattractive alternatives:either

ACM Transactions on Computer Systems,Vol.17,No.6,May1999,Pages1–??.

2Pedro C.Diniz and Martin C.Rinard

manually tune the program for each environment,or settle for suboptimal performance. We have developed a technique,dynamic feedback,that enables programs to automat-ically adapt to different execution environments.A compiler that uses dynamic feedback produces several different versions of the same code.Each version uses a different opti-mization policy.The generated code alternately performs sampling phases and production phases.During a sampling phase,the generated code measures the overhead of each ver-sion in the current environment by running that version for a?xed time interval.Each production phase then uses the version with the least overhead in the previous sampling phase.After running a production phase for a?xed time interval,the generated code performs another sampling phase.If the environment has changed,the generated code dynamically adapts by using a different version in the next production phase.

We see dynamic feedback as part of a general trend towards adaptive computing.As the complexity of systems and the capabilities of compilers increase,compiler developers will?nd that they can automatically apply a large range of transformations,but have no good way of statically determining which transformations will deliver good results when the program is actually executed.The problem will become even more acute with the emergence of new computing paradigms such as mobile programs in the Internet.Mobile programs will be expected to execute ef?ciently on a wide range of hardware platforms and execution environments.The extreme heterogeneity of these systems will make it dif?cult, if not impossible,to generate optimal code for all platforms.Dynamic feedback is one example of the adaptive techniques that will enable the generated code to deliver good performance in a variety of different environments.

In this article we illustrate the application of dynamic feedback to a particular set of program transformations:synchronization transformations for parallel programs.These transformations occur in the context of a parallelizing compiler for object-based languages. The compiler generates parallel code that uses synchronization constructs to make oper-ations execute atomically[Rinard and Diniz1997].Our experimental results show that the resulting synchronization overhead can signi?cantly degrade the performance[Diniz and Rinard1998;1997].We have developed a set of synchronization transformations and a set of synchronization optimization policies that use the transformations to reduce the synchronization overhead[Diniz and Rinard1998;1997].

Unfortunately,the best policy is different for different programs,and may even vary dynamically for different parts of the same program.Furthermore,the best policy de-pends on information,such as the global topology of the manipulated data structures and the dynamic execution schedule of the parallel tasks,that is unavailable at compile time. The compiler is therefore unable to statically choose the best synchronization optimization policy.

Our implemented compiler generates code that uses dynamic feedback to automatically choose the best synchronization optimization policy.Our experimental results show that dynamic feedback enables the automatically generated code to exhibit performance com-parable to that of code that has been manually tuned to use the best policy.

1.1Contributions

This article makes the following contributions:

—It presents a technique,dynamic feedback,that enables systems to automatically eval-uate several different implementations of the same source code,then use the evaluation ACM Transactions on Computer Systems,Vol.17,No.6,May1999.

Eliminating Synchronization Overhead Using Dynamic Feedback3

to choose the best implementation for the current environment.

—It shows how to apply dynamic feedback in the context of a parallelizing compiler for object-based programs.The generated code uses dynamic feedback to automatically choose the best synchronization optimization policy.

—It presents a theoretical analysis that characterizes the worst-case performance of sys-tems that use dynamic feedback.This analysis provides,under certain assumptions,a guaranteed optimality bound for dynamic feedback relative to a hypothetical(and unre-alizable)optimal algorithm that uses the best policy at every point during the execution.—It presents experimental results for the automatically generated parallel code.These results show that,for our set of benchmark applications,

—it is possible to implement dynamic feedback with very low overhead,and

—the performance of the code that uses dynamic feeedback is comparable to the per-formance of code that has been manually tuned to use the best synchronization opti-mization policy.

1.2Scope

This article presents an application of dynamic feedback to a speci?c problem,the choice of the synchronization optimization policy,and to a speci?c class of programs,irregu-lar scienti?c applications with parallel loops that contain synchronized atomic operations. Characteristics such as a small number of possible synchronization optimization policies, the use of parallel loops as a primary control structure,the lock acquisition frequency,the existence of a performance metric that allows the system to compare the performance of different policies even when the policies are used for different computations,and the fact that there is usually a single best optimization policy for each parallel loop,make dynamic feedback work well for this combination of optimization problem and application class. We expect that there may be signi?cant issues that must be addressed when generalizing the speci?c techniques presented in this article to other optimization problems and other application and system contexts.We postpone a more detailed discussion of these issues until we have presented the important details of our speci?c approach.Section4.6presents this discussion.

1.3Structure

The remainder of the article is structured as follows.In Section2we brie?y summarize the analysis technique,commutativity analysis,that our compiler is based on.Section3 presents the issues associated with synchronization optimizations and describes the differ-ent synchronization optimization policies.Section4discusses the basic issues that affect the performance impact of dynamic feedback.Section5presents a theoretical analysis that provides,under certain assumptions,a performance bound for dynamic feedback relative to the optimal algorithm,which uses the best policy at every point in the computation.Sec-tion6presents the experimental results.We discuss related work in Section7and conclude in Section8.

ecda52dbce2f0066f53322deMUTA TIVITY ANAL YSIS

We next provide an overview of commutativity analysis,the technique that our compiler uses to automatically parallelize our set of applications[Rinard and Diniz1997].This technique is designed to parallelize object-based programs.It analyzes the program at

ACM Transactions on Computer Systems,Vol.17,No.6,May1999.

4Pedro C.Diniz and Martin C.Rinard

the granularity of operations on objects to determine if the operations commute,i.e.,if they generate the same result regardless of the order in which the operations commute.If all operations in a given computation commute,the compiler can automatically generate parallel code.

To test that two operations A and B commute,the compiler must consider two execution orders:the execution order A;B in which A executes?rst,then B executes,and the execu-tion order B;A in which B executes?rst,then A executes.The two operations commute if they meet the following commutativity testing conditions:

—Instance Variables:The new value of each instance variable of the receiver objects of A and B under the execution order A;B must be the same as the new value under the execution order B;A.

—Invoked Operations:The multiset of operations directly invoked by either A or B under the execution order A;B must be the same as the multiset of operations directly invoked by either A or B under the execution order B;A.

Both commutativity testing conditions are trivially satis?ed if the two operations have dif-ferent receiver objects or if neither operation writes an instance variable that the other accesses—in both of these cases the operations are independent.If the operations may not be independent,the compiler reasons about the values computed in the two execution orders.

The compiler uses symbolic execution[Kemmerer and Eckmann1985]to extract ex-pressions that denote the new values of instance variables and the multiset of invoked operations.Symbolic execution simply executes the methods,computing with expressions instead of values.It maintains a set of bindings that map variables to expressions that denote their values and updates the bindings as it executes the methods.The compiler uses the extracted expressions from the symbolic execution to apply the commutativity testing conditions presented above.If the compiler cannot determine that corresponding expressions are equivalent,it must conservatively assume that the two operations do not commute.

The compiler uses commutativity analysis as outlined above to parallelize regions of the program.For each operation invocation site,it traverses the call graph to compute the set of operations in the computation rooted at that site.It then uses commutativity analysis to determine if all pairs of operations commute.If so,the compiler can generate code that executes all of the operations in that computation in parallel.

To ensure that operations in parallel computations execute atomically,the compiler aug-ments each object with a mutual exclusion lock.It then automatically inserts synchro-nization constructs into operations that update objects.These operations?rst acquire the object’s lock,perform the update,then release the lock.The synchronization constructs en-sure that the operation executes atomically with respect to all other operations that access the object.Figure1presents a general example of the generated parallel code.

The compiler also applies an optimization that exposes parallel loops to the run-time system.If a for loop contains nothing but invocations of parallel versions of methods, the compiler generates parallel loop code instead of code that serially spawns the invoked operations.The run-time system can then apply standard loop scheduling techniques such as guided self-scheduling[Polychronopoulos and Kuck1987].A barrier at the end of the loop ensures that all of the iterations complete before the computation continues its execution after the loop.

ACM Transactions on Computer Systems,Vol.17,No.6,May1999.

Eliminating Synchronization Overhead Using Dynamic Feedback5

class Object

private:

//instance variables public:

void Method();

;

void Object::Method() //receiver object update

;

;

//operation invocations

->;

..

.

->;class lock

public:

void acquire();

void release();

;

class Object

private:

lock mutex;//mutual exclusion lock

//instance variables

public:

void ParMethod();

;

void Object::ParMethod(,,) //receiver object update,executed atomically

mutex.acquire();

;

;

mutex.release();

//operation invocations,spawned concurrently spawn->;

..

.

spawn->;

Fig.1.Generic Parallel Code(parallel constructs in bold font)

In our applications,parallel loops are the primary source of available concurrency.The compiler therefore exploits concurrency only within parallel loops.The generated code executes a sequence of parallel and serial sections.Parallel sections correspond to the exe-cution of parallel loops;serial sections correspond to the execution of serial code between parallel loops.

3.SYNCHRONIZA TION OPTIMIZA TIONS

We found that,in practice,the overhead generated by the synchronization constructs often reduced the performance.We therefore developed several synchronization optimization algorithms[Diniz and Rinard1998;1997].These algorithms are designed for parallel programs,such as those generated by our compiler,that use mutual exclusion locks to im-plement critical regions.Each critical region acquires its mutual exclusion lock,performs its computation,then releases the lock.

Computations that use mutual exclusion locks may incur two kinds of overhead:lock-ing overhead and waiting overhead.Locking overhead is the overhead generated by the execution of constructs that successfully acquire or release a lock.Waiting overhead is the overhead generated when one processor waits to acquire a lock held by another processor. If a computation releases a lock,then reacquires the same lock,it is possible to reduce the locking overhead by eliminating the release and acquire.Our synchronization opti-mization algorithms statically detect computations that repeatedly release and reacquire

ACM Transactions on Computer Systems,Vol.17,No.6,May1999.

6Pedro C.Diniz and Martin C.Rinard

the same lock.They then apply lock elimination transformations to eliminate the inter-mediate release and acquire constructs[Diniz and Rinard1998].The goal of the lock elimination algorithm is to reduce the number of times the computation releases and ac-quires locks.The basic idea is to identify a computation that contains multiple critical regions that acquire and release the same lock,then transform the computation so that it contains one large critical section that acquires and releases the lock only once.Be-cause the transformed computation acquires and releases the lock fewer times,it generates less lock overhead.Given a region over which to eliminate synchronization constructs, the algorithm uses the lock movement transformation to increase the sizes of critical re-gions that acquire and release the same lock until they are adjacent in the Interprocedural Control-Flow Graph(ICFG).It then uses the lock cancellation transformation to elim-inate adjacent release and acquire nodes.In effect,this optimization coalesces multiple critical regions that acquire and release the same lock multiple times into a single larger critical region that includes all of the original critical regions.The larger critical region, of course,acquires and releases the lock only once.This reduction in the number of times that the computation acquires and releases locks translates directly into a reduction in the locking overhead.

Figures2and3present an example of how synchronization optimizations can reduce the number of executed acquire and release constructs.Figure2presents a program(inspired by the Barnes-Hut application described in Section6)that uses mutual exclusion locks to make body::one

The ICFG is a generalization of the standard control-?ow graph to include control-?ow edges from call sites to the entry nodes of invoked operations,and from exit nodes of invoked operations back to the nodes after the call sites.The compiler constructs the ICFG in two phases.During the?rst phase,the compiler constructs the control-?ow graph for each operation.It splits each operation invocation statement into two separate nodes:a call node and a return node.In the second phase,the compiler inserts an edge from the call node to the entry node of the invoked operation,and an edge from the exit node of the invoked operation back to the return node in the caller.

ACM Transactions on Computer Systems,Vol.17,No.6,May1999.

Eliminating Synchronization Overhead Using Dynamic Feedback7 extern double interact(double,double);

class body

private:

lock mutex;

double pos,sum;

public:

void one

interaction(body*b)

double val=interact(this->pos,b->pos);

mutex.acquire();

sum=sum+val;

mutex.release();

void body::interactions(body b[],int n)

for(int i=0;i

this->one

interaction(body*b);

void interactions(body b[],int n);

;

void body::one

interaction(&b[i]);

mutex.release();

Fig.3.Optimized Example Computation

ACM Transactions on Computer Systems,Vol.17,No.6,May1999.

8Pedro C.Diniz and Martin C.Rinard

quire and release constructs.In the default placement,each operation that updates an object acquires and releases that object’s lock.

—Bounded:Apply the transformation only if the new critical region will contain no cycles in the call graph.The idea is to limit the severity of any false exclusion by limiting the dynamic size of the critical region.

—Aggressive:Always apply the transformation.

In general,the amount of overhead depends on complicated dynamic properties of the com-putation such as the global topology of the manipulated data structures and the run-time scheduling of the parallel tasks.Our experimental results show that the synchronization optimizations have a large impact on the performance of our benchmark applications.Un-fortunately,there is no one best policy.Because the best policy depends on information that is not available at compile time,the compiler is unable to statically choose the best policy.

4.IMPLEMENTING DYNAMIC FEEDBACK

The compiler generates code that executes an alternating sequence of serial and parallel sections.Within each parallel section,the generated code uses dynamic feedback to auto-matically choose the best synchronization optimization policy.The execution starts with a sampling phase,then continues with a production phase.During the sampling phase,code compiled with each policy executes for a?xed time interval.The parallel section measures the overhead of each policy,and uses the best policy during the production phase.The par-allel section periodically resamples to adapt to changes in the best policy.We next discuss the speci?c issues associated with implementing this general approach.

4.1Detecting Interval Expiration

The generated code for the sampling phase executes each policy for a?xed sampling time interval.The production phase also executes for a?xed production time interval,although the production intervals are typically much longer than the sampling intervals.The com-piler uses two values to control the lengths of the sampling and production intervals:the target sampling interval and the target production interval.At the start of each interval, the generated code reads a timer to obtain the starting time.As it executes,the code peri-odically polls the timer:it reads the timer,computes the difference of the current time and the starting time,then compares the difference with the target interval.The comparison enables the code to detect when the interval has expired.Several implementation issues determine the effectiveness of this approach:

—Potential Switch Points:In general,it is possible to switch policies only at speci?c po-tential switch points during the execution of the program.The rate at which the potential switch points occur in the execution determines the minimum polling rate,which in turn determines how quickly the generated code responds to the expiration of the current interval.

In all of our benchmark applications,each parallel section executes a parallel loop.A potential switch point occurs at each iteration of the loop,and the generated code tests for the expiration of the current interval each time it completes an iteration.A potential drawback of using polling to determine interval expiration is that if the execution of a single iteration is signi?cantly longer than the target sampling or production interval, ACM Transactions on Computer Systems,Vol.17,No.6,May1999.

Eliminating Synchronization Overhead Using Dynamic Feedback9 the code will be able to react only after it has successfully completed the computation alloted to the sampled code.In our benchmark applications,the inpidual iterations of the loops are small enough so that each processor can respond reasonably quickly to the expiration of the interval.

Code that uses parallel operations instead of parallel loops requires a more evolved code generation scheme.For the problem of choosing the synchronization optimization pol-icy,a potential switch point would occur at the beginning of each parallel operation. The compiler would therefore generate code that checks for the expiration of the current interval at the start of each parallel operation instead of at the start of each loop iteration.—Polling Overhead:The polling overhead is determined in large part by the overhead of reading the timer.Our currently generated code uses the timer on the Stanford DASH machine.The overhead of accessing this timer is approximately nine microseconds, which,as we report in Section6,is negligible compared with the sizes of the iterations of the parallel loops in our benchmark set of applications.

—Synchronous Switching:The generated code switches policies synchronously.When an interval expires,each processor waits at a barrier until all of the other processors detect that the interval has expired and arrive at the barrier.This strategy ensures that all processors use the same policy during each sampling interval.The measured overhead therefore accurately re?ects the overhead of the policy.Synchronous switching also avoids the possibility of interference between incompatible policies.

Synchronous switching poses two potential drawbacks–barrier waiting and barrier over-head.Barrier waiting occurs when each processor must wait for all of the other proces-sors to detect the expiration of the current interval before it can proceed to the next interval.This effect can have a signi?cant negative impact on the performance if one of the iterations of the parallel loop executes for a long time relative to the other iterations and to the sampling interval.The combination of an especially bad policy(for exam-ple,a synchronization optimization policy that serializes the computation)and iterations of the loop that execute for a signi?cant time relative to the sampling interval can also cause poor performance.

Barrier overhead is the inherent overhead of the barrier construct itself.On the Stanford DASH machine,we implemented the barrier by maintaining a count of the number of processors that reach the barrier.We used a lock to make the count operations atomic. For this implementation,the overhead of the barrier is negligible compared to the other overheads associated with the use of dynamic feedback.

—Timer Precision:The precision of the timer places a lower bound on the size of each interval.The timer must tick at least once before the interval expires.In general,we do not expect the precision of the timer to cause any problems.Our generated code uses

10Pedro C.Diniz and Martin C.Rinard

target sampling intervals of at least several milliseconds in length.Most systems provide timers with at least this resolution.

All of these issues combine to determine the effective sampling interval,or the minimum time from the start of the interval to the time when all of the processors detect that the interval has expired and proceed to the next interval.

4.2Switching Policies

During the sampling phase,the generated code must switch quickly between different syn-chronization optimization policies.The current compiler generates three versions of each parallel section of code.Each version uses a different synchronization optimization policy. The advantage of this approach is that the code for each policy is always available,which enables the compiler to switch very quickly between different policies.The currently gen-erated code simply executes a switch statement at each parallel loop iteration to dispatch to the code that implements the current policy.

The potential disadvantage is an increase in the size of the generated code.Table I presents the sizes of the text segments for several different versions of the benchmark ap-plications from Section6.These data are from the object?les of the compiled applications before linking and therefore include code only from the applications—there is no code from libraries.The Serial version is the original serial program,the Original version uses the Original synchronization optimization policy,and the Dynamic version uses dynamic feedback.In general,the increases in the code size are quite small.This is due,in part, to an algorithm in the compiler that locates closed subgraphs of the call graph that are the same for all optimization policies.The compiler generates a single version of each method in the subgraph,instead of one version per synchronization optimization policy.

Application Size(bytes)

Serial

Original

Dynamic

Water

Serial

Original

Dynamic

Table I.Executable Code Sizes(bytes)

We also considered using dynamic compilation[Auslander et al.1996;Engler1996; Lee and Leone1996]to produce the different versions of the parallel sections as they were required.Although this approach would reduce the amount of code present at any given point in time,it would signi?cantly increase the amount of time required to switch policies in the sampling phases.This alternative would therefore become viable only for situations in which the minimum sampling phases could be signi?cantly longer than we wished. Finally,it is possible for the compiler to generate a single parameterized version of the code that can use any of the three synchronization optimization policies.The idea is to generate a conditional acquire or release construct at all of the sites that may acquire or release a lock in any of the synchronization optimization policies.Each site has a?ag that controls whether it actually executes the construct;each acquire or release site tests its?ag ACM Transactions on Computer Systems,Vol.17,No.6,May1999.

Eliminating Synchronization Overhead Using Dynamic Feedback11 to determine if it should acquire or release the lock.In this scenario,the generated code switches policies by changing the values of the?ags.The advantage of this approach is the guarantee of no code growth;the disadvantage is the residual?ag checking overhead at each conditional acquire or release site.

4.3Measuring the Overhead of Each Policy

To choose the policy with the least overhead,the generated code must?rst measure the overhead.The compiler instruments the code to collect three measurements:

—Locking Overhead:The generated code computes the locking overhead by counting the number of times that the computation acquires and releases a lock.This number is computed by incrementing a counter every time the computation acquires a lock.The locking overhead is simply the time required to acquire and release a lock times the number of times the computation acquires a lock.

—Waiting Overhead:The current implementation uses spin locks.The hardware exports a construct that allows the computation to attempt to acquire a lock;the return value indicates whether the lock was actually acquired.To acquire a lock,the computation repeatedly executes the hardware lock acquire construct until the attempted acquire suc-ceeds.The computation increments a counter every time an attempt to acquire a lock fails.The waiting overhead is the time required to attempt,and fail,to acquire a lock times the number of failed acquires.

—Execution Time:The amount of time that the computation spends executing code from the application.This time is measured by reading the timer when a processor starts to execute application code,then reading the timer again when the processor?nishes executing application code.The processor then subtracts the?rst time from the second time,and adds the difference to a running sum.As measured,the execution time includes the waiting time and the time spent acquiring and releasing locks.It is possible to subtract these two sources of overhead to obtain the amount of time spent performing useful computation.

Together,these measurements allow the compiler to evaluate the total overhead of each synchronization optimization policy.The total overhead is simply the lock overhead plus the waiting overhead pided by the execution time.The total overhead is therefore always between zero and one.The compiler uses the total overhead to choose the best synchro-nization optimization policy—the policy with the lowest overhead is the best.

One potential concern is the instrumentation overhead,which consists primarily of the operations that count the number of times each processor acquires and releases a lock.The operations that count the number of times that a processor attempted to acquire a lock and failed have little impact on the execution,because they occur during an interval when the processor would otherwise be waiting idle for another processor to release the lock.

We experimentally measured the impact of the instrumentation overhead on our set of benchmark applications by generating versions of the applications that use a single,stati-cally chosen,synchronization optimization policy.We then executed these versions with and without the instrumentation.The performance differences between the instrumented and uninstrumented versions were very small,which indicates that the instrumentation overhead had little or no impact on the overall performance.

ACM Transactions on Computer Systems,Vol.17,No.6,May1999.

12Pedro C.Diniz and Martin C.Rinard

4.4Choosing Sampling and Production Intervals

The sizes of the target sampling and production intervals can have a signi?cant impact on the overall performance of the generated code.Excessively long sampling intervals may degrade the performance by executing non-optimal versions of the code for a long time.But if the sampling interval is too short,it may not yield an accurate measurement of the overhead.In the worst case,an inaccurate overhead measurement may cause the production phase to use the wrong synchronization optimization policy.

We expect the minimum absolute length of the sampling interval to be different for different applications.In practice,we have had little dif?culty choosing default values that work well for our applications.In fact,it is possible to make the target sampling intervals very small for all of our applications—the minimum effective sampling intervals are large enough to provide overhead measurements that accurately re?ect the relative overheads in the production phases.

To achieve good performance,the production phase must be long enough to pro?tably amortize the cost of the sampling phase.In practice,we have found that the major compo-nent of the sampling cost is the time spent executing the non-optimal versions.

In our current implementation of dynamic feedback,the length of the parallel section may also limit the performance.Our current implementation always executes a sampling phase at the beginning of each parallel section.If a parallel section does not contain enough computation for a production phase of the desired length,the computation may be unable to successfully amortize the sampling overhead.It should be possible to eliminate this potential problem by generating code that allows sampling and production intervals to span multiple executions of the parallel phase.This code would still maintain separate sampling and production intervals for each parallel section,but allow the intervals to contain multiple executions of the section.

In practice,we have had little dif?culty choosing target production intervals that work well for our applications.All of our applications perform well with a?xed target produc-tion intervals that range from?ve to seconds.

4.5Early Cut Off and Policy Ordering

In many cases,we expect that the inpidual sources of overhead will be either monotoni-cally nondecreasing or monotonically nonincreasing across the set of possible implementa-tions.The locking overhead,for example,never increases as the policy goes from Original to Bounded to Aggressive.The waiting overhead,on the other hand,should never decrease as the policy goes from Original to Bounded to Aggressive.These properties suggest the use of an early cut off to limit the number of sampled policies.If the Aggressive policy generates very little waiting overhead or the Original policy generates very little locking overhead,there is no need to sample any other policy.

It may therefore be possible to improve the sampling phase by trying extreme policies ?rst,then going directly to the production phase if the overhead measurements indicate that no other policy would do signi?cantly better.It may also be possible to improve the sampling phase by ordering the policies.The generated code could sample a given policy ?rst if it has done well in the past.If the measured overhead continued to be acceptable, the generated code could go directly to the production phase.

ACM Transactions on Computer Systems,Vol.17,No.6,May1999.

Eliminating Synchronization Overhead Using Dynamic Feedback13

4.6Limitations

In this article,we present an application of dynamic feedback to a speci?c problem:the choice of the best synchronization optimization policy.In the context of our benchmark applications,this problem has a set of characteristics that make it particularly suitable for the use of dynamic feedback.First,we were able to identify a small number of poten-tial policies that were suitable for the synchronization optimization problem.The?nal executable contains code compiled with each different policy.For problems with a large number of potential policies,it is clearly infeasible to include code for all of the policies in the shipped executable.As discussed in Section4.2,it may be possible in some cases to generate parameterized code that would implement different policies depending on the setting of the parameters.In general,however,it would be necessary to generate the code for each policy dynamically,which would add to the overhead of using dynamic feedback. The sampling phases in the dynamic feedback algorithm execute different computations with different optimal execution times.This raises the question of how to compare perfor-mance results from different computations with different optimal execution characteristics. Locks support a performance measure(the sum of the locking and waiting overheads)that can be used to compute the overhead of each policy as a percentage of the running time of the computation.Because these overhead measurements factor out the running times of the computations,they can be used to compare the performance impact of the different policies even when the policies were used for different computations.In general,however, it may not be clear how to derive a performance metric with this property.

In all of our applications,the locking frequency is small enough and the overhead of ac-quiring and releasing a lock is large enough so that it is possible to instrument the locking code without signi?cantly perturbing the performance or execution characteristics of the sampling code.In general,however,instrumentation code may unacceptably perturb the computation.In some situations,zero-overhead hardware instrumentation such as perfor-mance counters may provide an acceptable alternative[Anderson et al.1997].

In all of our applications,the concurrency is exploited within parallel loops that contain synchronized atomic operations.Parallel loops provide a convenient way to set up the boundaries of the sampling and production phases:the compiler can simply check for the expiration of the phase at the beginning of each loop iteration.In our applications,the loop iterations are large enough to pro?tably amortize the checking overhead,but small enough to quickly detect the expiration of sampling and production phases.In general,the issue of where to place the code that checks for the expiration of phases may complicate the application of dynamic feedback to different optimization problems.

In all of our applications,the best policy is relatively constant for each parallel loop.So the approach of sampling to?nd the best policy,then using the best policy for a relatively long production phase,works well for our applications.For applications in which the best policy changes more quickly,it would be necessary to sample more frequently.Because increasing the sampling frequency also increases the sampling overhead,it is possible to imagine applications for which it is simply not possible to sample fast enough to pro?tably adapt to changes in the best policy.This raises the question of how to choose the best sampling and production intervals.In most situations,we expect the smallest sampling

14Pedro C.Diniz and Martin C.Rinard

frequency that pro?tably amortizes the sampling overhead to be signi?cantly smaller than

the largest sampling frequency that adapts quickly enough to changes in the best policy.

In extreme cases,it might be possible to use a second-order application of dynamic

feedback to choose an acceptable sampling frequency.In these scenarios,the compiler would generate code that accepts the sampling frequency as a parameter.A second-order

sampling algorithm would test different values of the sampling frequency to?nd one with

an acceptable trade-off between the adaptation latency and the sampling overhead.

Another issue related to the sizes of the sampling and production intervals is that the production interval must be long enough to pro?tably amortize the time wasted while sam-

pling a policy with poor performance.Section5presents a theoretical analysis that can be

used to guide the choice of sampling and production interval lengths,but this analysis is valid only under some fairly restrictive assumptions.In practice,some of our applications

sample policies that deliver almost no useful work.Even for these applications,it was not

dif?cult to obtain a production interval long enough to amortize the sampling overhead.

5.DYNAMIC SELECTION OF SAMPLING AND PRODUCTION INTERVALS

In this section we present a theoretical analysis of the performance of dynamic feedback.

We start by observing that if no constraint is imposed on how fast the overhead of each pol-

icy may change,it is dif?cult to obtain any meaningful optimality result for any sampling algorithm—the overhead of each policy may change dramatically right after the sampling

phase.To simplify the analysis,we assume that the best policy does not change during the

production intervals.In this situation,the worst-case scenario for dynamic feedback is for the best policy to have no overhead at all,and for all of the other policies to perform no

useful work.We analyze a slightly more general scenario in which there are policies whose relative overheads do not change over time.While simple,this scenario provides

insight into how to dynamically select the lengths of the sampling and production intervals.

As part of this theoretical analysis,we derive a performance bound for dynamic feedback as compared to the optimal algorithm that always uses the best policy.

5.1De?nitions

We?rst establish some notation.There are different policies,.Each variable denotes the effective sampling interval for the policy.The total length of

the sampling phase is therefore the sum of the sampling intervals for the sampled policies,

i.e.,.The length of the production interval is denoted by.

The computation starts with a sampling phase.During this phase,the dynamic feedback algorithm executes each of the policies for a sampling interval to derive overhead

measurements for each of the policies.The overhead is the proportion of

the total execution time spent acquiring and releasing locks or waiting for other processors to release locks.The overhead therefore varies between zero(if the computation never

acquires a lock)and one(if the computation performs no useful work),i.e.. We now de?ne a precise way to measure the amount of useful work that a given policy

delivers during a given period of time.

D EFINITION 1.The amount of useful work delivered by a policy over a period of time is denoted by Work where Work is de?ned as

Work(1) ACM Transactions on Computer Systems,Vol.17,No.6,May1999.

Eliminating Synchronization Overhead Using Dynamic Feedback15 Conversely,we de?ne the performance of a given policy as the amount of time taken to deliver units of work.Given the de?nition above for the amount of useful work,?nding the time it takes for a given policy to deliver units of work amounts to?nding the time for which equals.The next de?nition formalizes this notion.

D EFINITION 2.The performance of a given policy for a given amount of work is de?ned to be the amount of time Time that takes to deliver units of work.Given an amount of work,the following equation provides the value of Time:

Time

(3)

Time

5.2Performance Analysis

We now analyze the performance of dynamic feedback.We?rst derive the performance of dynamic feedback algorithm,then compare the performance of this algorithm with that of the optimal algorithm that always executes the policy with the lowest overhead.Based on this comparison,we determine the production interval for which the dynamic feedback algorithm is guaranteed to be at most worse than the optimal algorithm.

Without loss of generality,we assume that policy is the policy with the lowest sam-pled overhead.The dynamic feedback algorithm therefore executes policy during the production interval.We further assume that the values measured during the sampling phase accurately re?ect the actual overheads at the start of the production phase,and that the measured overheads do not change during the sampling or production phases.

Under these assumptions,the worst-case scenario is for the dynamic feedback algorithm to complete the work at the very end of a sampling phase.In other words,the dynamic feedback algorithm completes sampling phases but only production phases. This is the worst-case scenario because it maximizes the amount of time that the dynamic feedback algorithm spends executing suboptimal policies.We derive the time it would take for the optimal algorithm to deliver the same amount of work,then compare the times taken by the two algorithms.

For this scenario,the amount of work performed by the dynamic feedback algorithm during the sampling phases and production phases is given by:

Work(4)

The corresponding amount of time required to deliver this amount of work is simply the combined time of the sampling phases and production phases.

ACM Transactions on Computer Systems,Vol.17,No.6,May1999.

16Pedro C.Diniz and Martin C.Rinard

Time(5) The time required for the optimal algorithm to deliver the same amount of work is:

Time Work

(6)

The relative performance of the two algorithms,as de?ned above,is:

Time

,then as

,we obtain the following intermediate equation:

Time

Time

(9) We can rearrange the the terms in the numerator to obtain the following equation:

Time

(10)

Given a performance bound,we use Equation10above to derive the minimum value for the production interval required to satisfy the performance bound.

Time

(11) Algebraic simpli?cation of this inequality yields:

(12) We can pide both sides of this inequality by to obtain the following inequality: ACM Transactions on Computer Systems,Vol.17,No.6,May1999.

Eliminating Synchronization Overhead Using Dynamic Feedback17 Time

18Pedro C.Diniz and Martin C.Rinard

organized as a group of processing clusters connected by a mesh interconnection network. Each of the clusters is a Silicon Graphics4D/340bus-based multiprocessor.The4D/340 system has four processing nodes,each of which contains a33MHz R3000processor,a R3010?oating-point co-processor,a64KByte instruction cache,64KByte?rst-level write-through data cache,and a256KByte second-level write-back data cache.Each node has a peak performance of25V AX MIPS and10double-precision MFLOPS.Cache coherence within a cluster is maintained at the level of16-byte lines via a bus-based snoopy protocol. Each cluster also includes a directory processor that snoops on the bus and handles refer-ences to and from other clusters.The directory processor maintains directory information on the cacheable main memory within that cluster that indicates which clusters,if any, currently cache each line.

The interconnection network consists of a pair of wormhole routed meshes,one for request messages and one for replies.The total bandwidth in and out of each cluster is megabytes per second.

6.2Barnes-Hut

Table II presents the execution times for the different versions of Barnes-Hut.Figure4 presents the corresponding speedup curves.All experimental results are for an input data set of bodies.The static versions(Original,Bounded and Aggressive)execute without the instrumentation required to compute the locking or waiting overhead.The Dynamic version,the version that uses dynamic feedback,must contain this instrumenta-tion because it uses the locking and waiting overhead measurements to determine the best synchronization optimization policy.

Version

Serial———

Original

Bounded

Aggressive

Dynamic

Table II.Execution Times for Barnes-Hut(seconds)

For this application,the synchronization optimization policy has a signi?cant impact on the overall performance,with the Aggressive version signi?cantly outperforming both the Original and the Bounded versions.The performance of the Dynamic version is quite close to that of the Aggressive version.A programmer manually tuning this application would select the Aggressive as the best static synchronization optimization policy.

Table III presents the locking overhead for the different versions of Barnes-Hut.The execution times are correlated with the locking overhead.For all versions except Dynamic, the number of executed acquire and release constructs(and therefore the locking overhead) does not vary as the number of processors varies.For the Dynamic version,the number of executed acquire and release constructs increases slightly as the number of processors

Eliminating Synchronization Overhead Using Dynamic Feedback

19 Aggressive Dynamic Feedback Bounded Original

|0

|2|4|6|8|10|12|14|1602

468101214

16

Number of Processors S p e e d u p

Fig.4.

Speedups for Barnes-Hut increases.The numbers in the table for the Dynamic version are from an eight processor run.

Version Executed Acquire

Overhead (seconds)

20Pedro C.Diniz and Martin C.Rinard

0.1

0.2

0.30.4

0.5

05101520

25S a m p l e d O v e r h e a d Execution Time (seconds)Original Aggressive Bounded

Fig.5.Sampled Overhead for the Barnes-Hut FORCES Section on Eight Processors

of useful work in the section.We report the mean iteration size because the generated code checks for the expiration of the sampling and production intervals only at the granularity of the loop iterations.The loop iterations must therefore be large enough to amortize the expiration checks.For Barnes-Hut this is clearly the case —the cost of the expiration check is dominated by the 9microsecond cost of reading the timer,which is negligible compared with the 4.2millisecond mean iteration size.Tables X and XVI present the mean loop iteration sizes for Water and String,respectively.The cost of the expiration check is also negligible compared with the mean iteration sizes in these applications.

Mean

Mean Section Size

Iteration Size 16,384

Eliminating Synchronization Overhead Using Dynamic Feedback21 Version Mean

Sampled Overhead

10.0

7.8

6.5

FORCES Section Complete Application

Interval

11051000

9.1389.05824.1023.72

0.19.1789.22024.3323.38

10.7849.72624.9123.69

Table VI.Execution Times for Varying Production and Sampling Intervals for Barnes-Hut on Eight Processors (seconds)

In Section5we derived Equation15,which provides the asymptotic performance bound of dynamic feedback given values for the overheads,sampling intervals,and production interval.This equation is valid under the assumption that the overheads do not change during the production phase.As Figure5indicates,the overheads of the different policies are roughly constant for the FORCES section of Barnes-Hut.Table VII presents the calculated asymptotic performance bounds for the FORCES section of Barnes-Hut on eight processors as a function of the measured overheads and different production and sampling intervals.

For each policy,we use the mean sampled overhead as and the maximum of the target sampling interval and that policy’s mean minimum effective sampling interval as the value for that policy’s sampling interval in the calculations.This choice re?ects the fact that for some policies,it may not be possible to achieve a small target sampling interval. For the FORCES section of Barnes-Hut,all of the mean minimum effective sampling in-tervals are smaller than all of the target sampling intervals.This is not true for our other applications.

ACM Transactions on Computer Systems,Vol.17,No.6,May1999.

本文来源:https://www.bwwdw.com/article/l1bq.html

Top