Accurate and Efficient SLA Compliance Monitoring

更新时间:2023-04-09 13:13:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

Accurate and Ef?cient SLA Compliance Monitoring

Joel Sommers University of Wisconsin-Madison jsommers@6a664dba1a37f111f1855b08

Paul Barford University of Wisconsin-Madison pb@6a664dba1a37f111f1855b08

Nick Duf?eld

A T&T Labs-Research

duf?eld@6a664dba1a37f111f1855b08

Amos Ron University of Wisconsin-Madison amos@6a664dba1a37f111f1855b08

ABSTRACT

Service level agreements(SLAs)de?ne performance guarantees made by service providers,e.g,in terms of packet loss,delay,delay variation,and network availability.In this paper,we describe a new active measurement methodology to accurately monitor whether measured network path characteristics are in compliance with per-formance targets speci?ed in SLAs.Speci?cally,(1)we describe a new methodology for estimating packet loss rate that signi?cantly improves accuracy over existing approaches;(2)we introduce a new methodology for measuring mean delay along a path that im-proves accuracy over existing methodologies,and propose a method for obtaining con?dence intervals on quantiles of the empirical de-lay distribution without making any assumption about the true dis-tribution of delay;(3)we introduce a new methodology for mea-suring delay variation that is more robust than prior techniques; and(4)we extend existing work in network performance tomog-raphy to infer lower bounds on the quantiles of a distribution of performance measures along an unmeasured path given measure-ments from a subset of paths.We unify active measurements for these metrics in a discrete time-based tool called SLA M.The uni-

?ed probe stream from SLA M consumes lower overall bandwidth than if inpidual streams are used to measure path properties.We demonstrate the accuracy and convergence properties of SLA M in

a controlled laboratory environment using a range of background traf?c scenarios and in one-and two-hop settings,and examine its accuracy improvements over existing standard techniques. Categories and Subject Descriptors:C.2.3[Network Operations]: Network management,Network monitoring,C.2.5[Local and Wide-Area Networks]:Internet(e.g.,TCP/IP),C.4[Performance of Sys-tems]:Measurement Techniques

General Terms:Algorithms,Experimentation,Management,Mea-surement,Performance

Keywords:Active Measurement,Network Congestion,Network Delay,Network Jitter,Packet Loss,Service-Level Agreements,SLA M 1.INTRODUCTION

Network service level agreements(SLAs)detail the contractual obligations between service providers and their customers.It is Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro?t or commercial advantage and that copies bear this notice and the full citation on the?rst page.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior speci?c permission and/or a fee.

SIGCOMM’07,August27–31,2007,Kyoto,Japan.

Copyright2007ACM978-1-59593-713-1/07/0008...$5.00.increasingly common for SLAs to specify transport-level perfor-mance assurances using metrics such as packet loss,delay,delay variation,and network availability[2,3,4,33].Meeting SLA guar-antees results in revenue for the ISP.However,failing to meet SLA guarantees can result in credits to the customer.The implications of not meeting SLA guarantees are therefore serious:a disruption in service can result in signi?cant revenue loss to both the customer and provider.SLA compliance monitoring,assessing whether per-formance characteristics are within speci?ed bounds,is therefore critical to both parties.

Compliance monitoring is a critical challenge for SLA engineer-ing.SLAs must be designed that can be accurately and ef?ciently monitored,while simultaneously limiting the risk of non-compliance. For example,assuring a low loss rate might be possible only if loss rates can be estimated with suf?ciently high con?dence.Although passive measurements(e.g.,SNMP MIB counters)may provide high accuracy for a metric such as loss on a link-by-link basis,they may be insuf?cient for estimating the performance experienced by customer traf?c.Thus,although there are situations where active measurements may be too heavyweight or may yield inaccurate re-sults[10,31,35],they nonetheless remain a key mechanism for SLA compliance monitoring.

In this paper,we address the following questions:can SLA com-pliance along a path be accurately monitored with a single lightweight probe stream?and can this stream be the basis for ef?cient network-wide compliance monitoring?There have been a large number of active measurement methodologies proposed to estimate transport-level performance characteristics.Nonetheless,there has been little work to directly address the speci?c problem of SLA compliance monitoring.In this context,measurement accuracy,ability to re-port con?dence bounds,ability to quickly adapt to changing net-work conditions,and ability to ef?ciently assess performance on a network-wide basis are of great importance.

The?rst contribution of this paper is the introduction of a new active measurement methodology to accurately assess whether mea-sured network path characteristics are in compliance with speci?ed targets.We describe a heuristic technique for estimating packet loss rate along a path that signi?cantly improves accuracy over ex-isting approaches.Second,we introduce a new method for mea-suring mean delay along a path that is more accurate than exist-ing methodologies.We also develop a mathematical foundation for obtaining con?dence intervals for the quantiles of the empirical delay distribution.Third,we introduce a new method for mea-suring delay variation that is more robust than prior techniques. These probe algorithms are uni?ed in a multi-objective discrete-time based tool called SLA M(SLA M onitor),which was sketched in an earlier workshop paper[36].That paper was limited to intro-ducing SLA M’s architectural framework and outlining the loss rate

measurement heuristic used by SLA M.

The second contribution of this paper is to extend prior work in the area of performance tomography toward the goal of network-wide SLA compliance monitoring.In particular,we develop a methodology to infer lower bounds on the quantiles of a distri-bution of path performance measures using measurements from a subset of network paths.

We demonstrate the properties of SLA M in a controlled labora-tory environment using a range of background traf?c scenarios and using both one-and two-hop topologies.We compare SLA M’s de-lay and loss estimation accuracy with standard IPPM probe method-ologies[7,8]of the same rate,and examine the convergence and robustness of SLA M estimates of delay,delay variation,and loss. Our experiments show that our estimates of mean delay are within one msec of the true mean delay,while the standard probe method-ology[7]can suffer inaccuracies up to about a factor of two.We also show that for a con?dence level of90%,SLA M’s estimated bounds on a wide range of delay quantiles,with few exceptions, include the true quantile value.We show that in a simple two-hop topology,the inferred bound on the delay distribution is tight,and close to the actual distribution.Our experiments also reveal that SLA M estimates the end-to-end loss rate with high accuracy and with good con?dence bounds.For example,in a scenario using self-similar background traf?c,the true loss rate over a15minute period is0.08%and the SLA M estimate is0.07%.In contrast, the standard method for estimating loss rate[8]can have errors of more than two orders of magnitude.We demonstrate the robustness of SLA M’s delay variation monitoring methodology,showing how the existing standard RTP jitter metric[32]may be too sensitive to network path conditions,and that SLA M performs well in our more complex two-hop scenario.

2.RELATED WORK

General aspects and structure of SLAs have been discussed in[27, 33].Performance assurances provided by SLAs range from net-work path availability,to transport-level metrics,to application-speci?c metrics.These guarantees may be based on a variety of statistics of the particular metric,such as the mean,median,or a high quantile,computed over various time scales.Examples of the kinds of guarantees offered by service providers are available on-line[2,3,4].

To ensure that SLA performance targets are met with high proba-bility,service providers collect measurements either passively within the network,by injecting measurement probes into the network, or by using a combination of both[6,13,18,42].While active measurement-based compliance monitoring has received some at-tention in the past,e.g.,[18],there has been little validation in realistic environments where a reliable basis for comparison can be established.There has been limited work addressing the ac-curacy of some active measurement approaches;exceptions are found in[10,31,35].The issue of accuracy clearly has serious implications for SLA compliance monitoring.Other efforts have been limited in focus to estimation and optimization of a single metric,e.g.,[16,19].Our work takes an active measurement ap-proach,focusing on simultaneous,or multi-objective,measurement of transport-level performance metrics.We further differentiate our work through validation in a controlled,realistic testbed.

In general,there has been a great deal of work on active mea-surements of end-to-end delay,delay variation,and loss,e.g.,[7, 8,11,19,21,28,29,30,40,41].IETF standardization efforts for active measurement of delay,delay variation,loss,and reordering have taken place within the IETF IPPM working group[7,8,21,30]. Regarding delay,our method for distribution quantile estimation is distinguished from the earlier work of Choi et al.[16]in that we do not require the quantile of interest to be speci?ed a priori,and that we do not make any assumption regarding the underlying de-lay distribution.As a result,our method is robust to abrupt changes in underlying network 6a664dba1a37f111f1855b08stly,we note that our formu-lation of a delay variation measurement methodology stands apart from the related IPPM[21]and real-time protocol(RTP)[32]spec-i?cations in that rather than considering highly localized variations in delay(e.g.,between consecutive probe packets),we consider de-lay variations over streams of packets.

3.PATH-ORIENTED SLA COMPLIANCE

MONITORING

We now describe the basic assumptions and methods for estimat-ing delay,delay variation,and loss along a single end-to-end path. Our objective is to develop accurate,robust estimators based on a discrete-time probe process.Moreover,we seek to improve on the best known standard IPPM methodologies[7,8,32].Another met-ric that is often part of SLA speci?cations is network availability. Availability can be loosely de?ned as the capability of the network to successfully transmit any end-to-end probe over an interval of time,e.g.,60seconds[26].Although availability may be consid-ered as a special case of loss,we have yet to examine this metric in detail.

3.1Delay

Both mean delay and high quantiles of the empirical delay dis-tribution are used in SLAs.We?rst consider estimation of mean delay along a path,which we model as a continuous function f(t) whose independent variable is the time that a probe packet is sent and the dependent variable is measured one-way delay.Based on this model,a natural approach to mean delay estimation is to use Simpson’s method for numerical integration.The Simpson’s for-mulation is straightforward:once the domain of integration is par-titioned,the integral of the function f over the subinterval I j is estimated by1

2880

|I j|5,withξj some point in the interval I j.Thus,if the fourth derivative of f exists and is not too large,it is safe to state that the local error is of order5;i.e.,if we double the number of samples,the error in the estimate will be reduced locally by a factor of32,and globally by a factor of16.

To apply Simpson’s method to a discrete-time probe process for estimating mean end-to-end delay,we do the following:at time slot i,we draw a value k from a geometric distribution with parameter p delay.The geometric distribution is the discrete analog of the ex-ponential distribution and should yield unbiased samples.Probes representing the endpoints a j and b j are sent at time slots i and i+2(k+1)with the midpoint probe sent a time slot i+(k+1).At time slot i+2(k+1)the next subinterval begins,thus the last probe of a given subinterval is the?rst probe of the next one.Simpson’s estimates from each subinterval are summed to form the total area under the delay function.The mean delay estimate is then obtained by piding the integral estimate by the number of subintervals. With the above formulation,the subintervals are not of equal lengths(the lengths form a geometric distribution).Thus,we can either directly apply Simpson’s method to estimate the mean de-lay,or we can apply relative weights to the subintervals according to their lengths.In our results described below,we use weighted subintervals which we found to give more accurate results,though the absolute differences were small.

There are several considerations in using this approach.First,

probes may be lost in transit.We presently discard subintervals where probe loss occurs.Second,while the assumption that delay largely behaves as a smooth function seems reasonable,it may be more accurate to account for random spikes in delay by modeling the process as the sum of two processes,one smooth and one ran-dom.For example,if the function f(t)is written as f1(t)+f2(t), with f1smooth and f2random,then our numerical integration does much better on f1and slightly worse on f2as compared to straight averaging.The Simpson’s approach should be effective for this model as well:if the values of the random part are quite small compared to the smooth part,then our estimate should be better than simple averaging(i.e.,the sampling method advocated in RFC 2679[7]).Note that there is little risk in using Simpson’s method: even if delay is a completely random process(which is not likely), the variance of the Simpson’s rule estimator for mean delay is in-creased only slightly as compared to simple averaging. Distribution-Free Quantile Estimation.Besides using mean delay as the basis of service-level guarantees,ISPs also use high quantiles of the delay distribution,such as the95th percentile[16].

Let{x i:i=1,...,n}be n independent samples drawn at ran-dom from a common distribution F,sorted in increasing order.For simplicity,assume F is continuous.Let Q p denote the p th quantile of that distribution,i.e.,the unique solution of F(Q p)=p.

We wish to obtain con?dence intervals for Q p based on{x i}. One approach would be to start with the empirical distribution func-tion: F(x)=n?1#{i:x i≤x}and use a quantile estimate of the form Q p=max{x: F(x)≤p}.Analysis of the variance of this estimator might give us asymptotic con?dence intervals as n becomes large. Instead,we seek rigorous probabilistic bounds on Q p that hold for all n.

Now{x k≤x}is the event that at least k of the samples are less than or equal to x,an event which has probability G(n,F(x),k), where G(n,p,k)=∑j≥k p j(1?p)n?j n j .Taking x=Q p we have Pr[x k≤Q p]=G(n,p,k).

Based on the x i,we now wish to determine a level X+(n,p,ε) that the true quantile Q p is guaranteed to exceed only with some small probabilityε.Thus,we chose X+(n,p,ε)=x K+(n,p,ε)with K+(n,p,ε)=min{k:G(n,p,k)≤ε}.

Similarly,Pr[x k≥Q p]=1?G(n,p,k).Based on the x i,we now wish to determine a level X?(n,p,ε)that the true quantile Q p is guaranteed to fall below only with some small probabilityε. Thus,we chose X?(n,p,ε)=x K?(n,p,ε)with K?(n,p,ε)=max{k: 1?G(n,p,k)≤ε}.

Put another way,K+(n,p,ε)is the1?εquantile of the binomial B n,p distribution,while K?(n,p,ε)is theεquantile of the binomial B n,p distribution.The K±can be computed exactly;examples are given in Table1.

1:Example quantile Indices K±for various sample sizes n,and quantiles p.Con?dence level is1?ε=90%.Also shown is the reference quantile index K0=np.—indicates that no upper bound K+was available,which can occur when the top atom has mass greater than the desired signi?cance level,i.e.,p n>ε.

90

n K?K0K+

100869095

480500521986990995 10000896190009039

We start with the methodology in

[35],which initiates a probe pair at a given time slot with probability p loss for estimation of the

end-to-end frequency of congestion episodes F

and the mean du-ration of congestion episodes D .In this approach,each probe con-sists of three packets,sent back-to-back.We measure the loss rate

l of the probes during congestion episodes .Since the methodology of [35]does not identify inpidual congestion episodes,we take an empirical approach,treating consecutive probes in which at least one packet is lost as indication of a congestion episode (i.e.,similar to [41]).We assume that the end-to-end loss rate L is stationary and

ergodic.Given an estimate of the frequency of congestion F

,we estimate the end-to-end loss rate as L = F l .

The key assumption of this heuristic is that we treat the probe stream as a marker ?ow ,viz.,that the loss rate observed by this ?ow has a meaningful relationship to other ?ows along the path.We note again that the probes in [35]consist of multiple packets (3by default),which has some similarity to a TCP stream where delayed ACKs cause a sender to release two closely-spaced packets.While we do not claim that the probe stream is,in general,the same as a TCP stream,our results below demonstrate that such an assumption may be reasonable in this context.

3.4Multi-Objective Probing

We use the term multi-objective probing to refer to simultaneous estimation of multiple performance metrics using a single probe stream.The inpidual discrete-time algorithms described above operating at the same time may schedule probes to be sent at the same time slot.Such requests can be accommodated by tagging probes according to the relevant estimator.Thus,a single probe stream can be used for concurrent estimation of packet loss,delay,delay variation,and other quantities,thereby reducing the impact of measurement traf?c on the network.

The basic architecture of our multi-objective probe scheduler is depicted in Figure 1.The main component of the architecture is a discrete-time scheduler that provides callback and probe schedul-ing mechanisms.Probe modules implement the various path-oriented estimation methods described above.This design allows for logical separation among multiple,simultaneously operating measurement methods and for optimizations of network bandwidth.

time

1:Multi-objective probe scheduler architecture.Algorithmic modules interact with a discrete-time probe scheduler to perform estimation of delay,delay variation,and loss characteristics.

4.

TOWARD NETWORK-WIDE SLA COMPLIANCE MONITORING

The previous section described a set of methodologies for ef?-cient per-path monitoring.SLA compliance monitoring,however,requires accurate and ef?cient measurement on a network-wide ba-sis.However,the measurement overhead of sending probes over a full n 2mesh of paths is highly undesirable.In this section,we de-scribe the mathematical foundation that enables economical moni-toring over a subset of network paths.This new methodology en-ables greater ?exibility for specifying performance assurances in

terms of quantiles of a distribution,while attaining a high level of measurement ef?ciency.

4.1Routing Matrices,Measurement,

and Linear Dependence

Let G =(V ,E )be a directed graph comprising vertices (nodes)V and directed edges (links)(v 1,v 2)∈E ?V ×V .Let R be a set of paths (routes)i.e.,each r ∈R is an ordered set of n >0con-tiguous links (v 0,v 1),(v 1,v 2),...,(v n ?1,v n ).The routing matrix A associated with R is the incidence matrix of the links in the routes,namely,A re =1if link e occurs in route r and zero otherwise.

We now describe what we term the scalar additive network per-formance model .Let x :E →R be a function on the links.This naturally gives rise to the path function y :R →R de?ned as y r =∑e ∈r x e =∑e ∈E A re x e .This relation is a prototype for additive net-work performance models.Two examples are:

Network Delay:The latency of packet traversing the path r is

the sum of the latencies incurred on each link of the path.This may be understood either as the x e being inpidual measurements,or as x e being mean latencies.This is the example on which we focus in this paper.

Network Loss:In this model,x e is the log transmission proba-bility of traversing link e ;if there is no spatial correlation between link losses we can write y r as the log transmission probability along the path r .

Performance Tomography.

Two classes of inference problems arising from the framework above have been studied recently.In link performance tomography the aim is to infer the distribution of the link variable x e given only path measurements y r .Variants of this problem have been studied,mostly depending on exploiting correlations between measurement on different paths,e.g.,either at the packet level,e.g.,by using multicast probes [12,25]or groups of unicast probes [23,39],or more generally of distinct packet streams that experience common performance impairments [9,22].

A second class of problem has more recently attracted attention [14,15,17]:given a set of path performance measures across inter-secting paths,is it possible to infer the whole set of measures if only a subset is known?Clearly there is some relation between the two problems in the sense that if all link performance measures could be inferred from a subset of path measures,then the remaining path measures could be determined simply.

For scalar additive performance measures,the second problem has a simple expression in terms of the routing matrix A .Suppose that the matrix A is not of full (row)rank,i.e.,the set of row vectors is not linearly independent.Consequently there exists a minimal set of paths S R which span in the sense that such that every row of a r ={A re :e ∈E }of A can be expressed as a linear combination of the {a r :r ∈S }.For the scalar additive performance model,this translates to saying that all {y r :r ∈R }can be determined from the subset {y r :r ∈S }.Recent work on this problem has focused on understanding how the dimension of the set S depends on network topology.Chen et al.[15]concluded that the number of paths in S grows as O (#V )(i.e.,linear in the number of network nodes #V )in a real router-level topology,or at worst like O (#v log #V )in some simulated topologies.

Distributional Path Performance Measures.

In this work we extend the computational approach described above to infer distributions of a set of path performance measures

from a subset.We assume in a given network setting the existence of the set S R with the properties detailed above has been estab-lished.This means in particular that for every network path in R, every link in this path is traversed by some path in the subset R, and below we show how the distributions of delay in path in R can be inferred from only those in S.This inference depends on the as-sumption that any packet traversing a given link will experience the same delay distribution,even if the actual delays differ.The proofs of the results are relatively straightforward but have been omitted due to space limitations and will appear in a future technical report. There are two challenges in trying to extend the scalar approach to distributions.The?rst is dependence among link measurements. Dependence is not an issue in the linear algebra of mean quantities since the average of a linear combination of random variables is equal to same linear combination of respective averages even when the variables are dependent.Working with distributions is more complex,for example the distribution of a sum of random variables is not equal to the convolution of their distributions unless the ran-dom variables are independent.A second complexity is algebraic: there is no simple subtraction operation for distributions.For ex-ample,if X and Y are independent random variables and X=Y in distribution,it is not the case that X?Y is identically zero.

4.2Delay Distributional Inference

We suppose routing(and hence the matrix A)is static over a measurement interval.On each path r a stream of measurement packets labeled i=1,2,...,n r is launched along the path.Packet i incurs a latency X i re on traversing the link e∈r.The latency of the packet on the path is Y i r=∑e∈r X i re.

To motivate the following,consider the star topology network in Figure3b in which source nodes v1,v2and destination nodes v3,v4are linked through a central node v c.Denote the edges by e1=(v1,v c),e2=(v2,v c),e3=(v c,v3)and e4=(v c,v4).We consider the4paths r1=(e1,e3),r2=(e1,e4),r3=(e2,e3)and r4=(e2,e4).Let X n be the delay on link e n,and Y n the delay on path r n.Clearly,Y1+Y4=d Y2+Y3.Assume that the distributions of Y2,Y3and Y4are known;we focus on inferring that of Y1.

Our major statistical assumption is that all X i re are independent. We remark that the opposite type of assumption,i.e.,the iden-tity of certain link variables,has been employed for multicast per-formance tomography(and some unicast variants)to describe the propagation of multicast packets.The identity assumption is natu-ral in that case,since it re?ects either the delay encountered by a single multicast packet or a train of closely spaced unicast packets prior to branching to distinct endpoints.

In the present case,we can consider two types of dependence.In the?rst case we consider dependence between different measure-ments.Provided probe packets are dispatched at intervals longer than the duration of a network congestion event,then probes on the same path or on intersecting paths are unlikely to exhibit delay dependence,even if inpidual packets experience the distribution of congestion events similarly on the same link.Thus,is seems reasonable to model the Y i re as independent.The second case to consider is dependence among the inpidual link delays X i re on a given path r.Violation of this property might occur in packet streams traversing a set of links congested by the same background traf?c.As far as we are aware,there are no live network or testbed studies that have investigated this property.Dependence was found in a network simulation study,but was pronounced only in a small network con?guration with few traf?c streams[25].For this rea-son we believe that link delay correlation need not be signi?cant in a large network with a perse traf?c.

For r∈R let{b rr′:r′∈S}be the coef?cients of the spanning set {a r′:r′∈S}in the expression of a r,i.e.,

a r=∑

r′∈S

b rr′a r′(1) Let S+r={r′∈S:b rr′>0}and S?r={r′∈S:b rr′<0}.

L EMMA 1.Assume{a r′:r′∈S R}is a minimal spanning set. For each r∈R there exist positive integers d r and{d rr′:r′∈S} such that

d r a r+∑

r′∈S?r

d rr′a r′=∑

r′∈S+r

d rr′a r′(2)

For each r∈R,e∈E let X(i)re,i=1,2,...denote the sum of i independent copies of a single delay on link e,e.g.,X1re;likewise let Y(i)r denote the sum of i independent copies of Y i r.The symbol =d will denote equality in distribution.

T HEOREM 1.

Y(d r)

r+∑

r′∈S?r

Y(d rr′)

r′

=d∑

r′∈S+r

Y(d rr′)

r′

(3)

One can already see in Theorem1a basic feature of our results that follows merely from the partition of S into S?r and S+r.Suppose we are primarily interested in determining whether Y r often takes some large value.Suppose measurements tell us that some of the {Y r′:r′∈S+r}tend to take large values,but that none of the{Y r′: r′∈S?r}do.Then we know from the equality(3)that Y r must also tend to take large values.If none of the{Y r′:r′∈S}tend to take large values,then neither does Y r.But when some Y r′for r′in both S+r and S?r tend to take large values,then it is dif?cult to draw conclusions about Y r.These observations pre?gure our later results on distributional bounds for Y r.

Distributions and Inversion.

Let Y r denote the common distribution of the Y i r,and Y r its Laplace transform,i.e., Y r(s)= ∞0Y r(dy)e?sy.Let?denote con-volution.In terms of distributions,(3)becomes

Y?d r

r

?

r′∈S?r

Y?d rr′

r′

=?

r′∈S+r

Y?d rr′

r′

(4)

To what extent can we solve these convolution equations?In Laplace transform space we obtain from(4):

Y d r r∏r′∈S?r Y d rr′r′=∏r′∈S+r Y d rr′r′(5)

Given empirical estimates of{Y r′:r′∈S}one can in principle use numerical Laplace transform inversion to recover all Y r.This is an approach we intend to pursue in a subsequent work.In this paper, we use(4)directly in order to obtain bounds on the distributions Y r.

Convolution Bounds.

Let V i,i=1,2,...,n be independent random variables and set V=∑n i=1V i be their sum.Let Q p(V i)denote the p-quantile of V i, i.e.,

Pr[V≤x]≥p?Q p(V)≤x(6) The following result formalizes the perhaps obvious statement that if you know that V1≤x a fraction p of the time,and V2≤y a fraction q of the time,then you can conclude that V1+V2is less than x+y no less than a fraction pq of the time.

T HEOREM 2.Let V i,i=1,2,...,n be independent random vari-ables with sum V=∑n i=1V i,and let p i∈(0,1]with p=∏n i=1p i.

Q p(V)≤

n

i=1

Q p

i

(V i)(7)

Network Quantile Bounds.

T HEOREM 3.Denote Y±r=∑r′∈S+

r Y(d rr′) r′

(i)Q p(Y r)≥(d r)?1Q p d r(Y(d r)

r).

(ii)Q p(Y(d r)

r)≥Q pq(Y+r)?Q q(Y?r)

(iii)Q p(Y r)≥(d r)?1sup q∈(0,1](Q p d r q(Y+r)?Q q(Y?r))

Theorem3provides a lower bound on the quantiles,or,equiv-alently,an upper bound on the cumulative distribution.Thus,it underestimates the frequency with which a given level is exceeded. This may or may not be desirable if the measured quantiles are to be used for detecting SLA violations(i.e.,raising alarms).On the one hand false positives will be reduced,while at the same time some high quantiles may be underestimated.Following a network exam-ple below,we describe how knowledge of the topology of measured paths may be used to adjust alarm thresholds in order to mitigate the effects of quantile underestimation.

Computation of Quantiles.

We use the measured end-to-end latencies on the paths r∈S, the?r={Y i r:i=1,2,...,n r},to estimate the required quantiles on the RHS of Theorem3(iii).To compute the distribution of Y±r we might construct the sets of values{∑r′∈S±

r

∑d rr′i=1y rr′:y rr′∈?r}.

However,this gives rise to n±r=∏r′∈S±

r n d rr′

r′

member of each set,

which may require prohibitively large amounts of memory.Instead, memory can be controlled by discretizing the distributions before convolution.

Discrete Mass Distributions and Their Convolution.

A positive discrete mass distribution is speci?ed by a tuple(ε,n,m= {m i:i=0,...,n})whereεis the bin width,with a mass m i in bin [iε,(i+1)ε)for i=0,1,...,n?1,and mass m n in[nε,∞).Two such distributions(ε,n,m)and(ε′,n′,m′)the have convolution (ε,n,m)?(ε′,n′,m′)=(ε+ε′,1+(n?1)(n′?1),m′′)(8) where m′′j=∑j i=0m i m′j?i.Givenε,n,an set of measurements{Y i r:

i=1,2,...,n r}gives rise to a empirical discrete mass distribution (ε,n,m)with m i=#{Y i r:Y i r∈[iε,(i+1)ε)}for i=0,1,...,n?1

and m n=#{Y i r:Y i r≥nε}.The distribution of each{∑r′∈S±

r ∑d rr′i=1y rr′:

y rr′∈?r}is then estimated by taking the grand convolution over r′∈S±r of the d rr′-fold convolutions of the empirical mass distri-bution generated from each#{Y i r′:Y i r′∈[iε,(i+1)ε)}.A target resolutionεin the?nal distribution is achieved by choosing resolu-tionsε′for the constituent distribution that sum toε,for example,

ε′=ε/∑r′∈S±

r d rr′.Finally,we normalize to a probability distribu-

tion by piding each mass element by n±r.We call the resulting variables Y±r,and use them in place of the Y±r in Theorem3. Network Example.

In the above formalism,we have S+1={2,3},S?1={4}with d12=d13=d14=1and Y+1=Y2+Y3and Y?1=Y4.Suppose now that X i are exponentially distributed with distinct meansμi.Then Y+1has a mixed exponential distribution with PDF

y+1(x)=

4

i=1

e?x/μiμ2i

μ2?μ4

(10)

For the optimization of Theorem3,elementary calculus shows that when Y±r have densities y±r,the stationary points of q→Q p d r q(Y+r)?Q q(Y?r))obey

y+r(Q p d r q(Y+r))=p d r y?r(Q p d r q(Y?r))(11)

We use the above expression to compute the bounds and con-sider four cases.For cases(a)–(c)we plot the actual CDF on the unmeasured path,together with the CDF bound in Figure2.

(a)Homogeneous Delay.m1=1.0,m2=1.1,m3=1.2,m4=

1.3.The delay on path r1is somewhat underestimated,but

then large delays only very rarely occur.

(b)High Delay on Unmeasured Path,Low Delay Elsewhere.m1=

10,m2=1.1,m3=1.2,m4=1.3.The low delays on links

not included in the unmeasured path allow fairly close esti-

mation of the delay distribution on r1.

(c)High Delay on Unmeasured Path,Some High Delay Else-

where.m1=10,m2=11,m3=1.2,m4=1.3.Although

elevation of delay on r1is detected,the amount is somewhat

underestimated due to the presence of high delay on one of

the measured paths;this parallels the remarks following The-

orem1.

(d)Low Delay on Unmeasured Path,Some High Delay Else-

where.m1=1.0,m2=11,m3=1.2,m4=1.3.The results

are similar to the homogeneous case;the presence of high

delay elsewhere in the network does not further perturb the

delay bound.

If this delay bound estimates are to be used for raising alarms based on crossing threshold levels,it may be desirable to adjust alarm thresholds based on the spatial distribution of measured path delays.Speci?cally,case(c)above illustrates that when higher de-lays are encountered on a path in S?r,a lower alarm threshold may be used in order to compensate for the partial“obscuring”of the delay on the unmeasured path.In situations exempli?ed by cases (a)and(b),no adjustment to the threshold is needed,since there are no measured paths with high delay(so in particular,none in S?r).

5.EXPERIMENTAL TESTBED

We implemented a tool to perform multi-objective probing,called SLA M(SLA M onitor).SLA M sends UDP packets in a one-way manner between a sender and receiver.It consists of about2,000 lines of C++,including code to implement the loss,delay,and de-lay variation probe modules.The implementation is extensible and can accommodate other discrete-time probe algorithms.In this sec-tion,we describe the controlled laboratory environment in which we evaluated SLA M.We considered two topologies,shown in Fig-ure3.Each setup consisted of commodity workstation end hosts and commercial IP routers.

The?rst topology(Figure3a)was set up in a dumbbell-like con-?guration.We used10workstations on each side of the bottleneck

0.1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

5 10 15 20 25 30 35 40 45 50

c u m u l a t i v e p r o b a b i l i t y

delay on path r 1 = (e 1,e 3)

m 1 = 1.0, m 2 = 1.1, m 3 = 1.2, m 4 = 1.3

bound actual cdf

0.1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

5 10 15 20 25 30 35 40 45 50

c u m u l a t i v e p r o b a b i l i t y

delay on path r 1 = (e 1,e 3)

m 1 = 10, m 2 = 1.1, m 3 = 1.2, m 4 = 1.3

bound actual cdf

0.1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

5 10 15 20 25 30 35 40 45 50

c u m u l a t i v e p r o b a b i l i t y

delay on path r 1 = (e 1,e 3)

m 1 = 10, m 2 = 11, m 3 = 1.2, m 4 = 1.3

bound actual cdf

2:Example bounds on the inferred delay distribution.(a)Left:homogeneous delay;(b)Center:high delay on unmeasured path;(c)Right:high delay on unmeasured path and some others.

GSR 12000(hop B).Packets exited the OC3via another Cisco GSR 12000(hop C)and passed to receiving hosts via a Cisco 6500(hop D).NetPath [5]was used between hops C and D to emulate prop-agation delays for the background traf?c hosts in the testbed.We used a uniform distribution of delays with a mean of 50msec,mini-gation routers (Cisco 6500’s at hops A and E)were con?gured to direct traf?c over four primary con?gured paths,r 1–r 4,as shown in the ?gure.In addition,traf?c ?owed over path (e 1,e 2)to create suf?cient load on e 1to include queueing delay and loss.SLA M probes ?owed over the four primary traf?c paths to monitor delay,loss,and delay variation.SLA M was also con?gured to monitor paths (e 1,e 6),(e 2,e 6),(e 5,e 3),and (e 5,e 4).Only probe traf?c tra-versed links e 5and e 6,thus it was assumed that these additional probe measurements were suf?cient to separately measure charac-teristics on links e 1,e 2,e 3,and e 4.As with the dumbbell topol-ogy,NetPath [5]was used to emulate propagation delays for the background traf?c hosts in the testbed.We used a uniform distri-bution of delays with a mean of 50msec,minimum of 20msec,and maximum of 80msec.Each queue was con?gured to perform tail 6a664dba1a37f111f1855b08ing the notation (r ,e )=B to denote the output queue at router r on to link e in msec,buffer size con?gurations were fol-lows:(v 1,e 1)≈25msec,(v 2,e 2)≈12.5msec,(v c ,e 3)≈50msec,and (v c ,e 4)≈100msec.

Each workstation used in our experiments had a Pentium 4pro-cessor running at 2GHz or better,with at least 1GB RAM and an Intel Pro/1000network interface card and was con?gured to run either FreeBSD 5.4or Linux 2.6.The SLA M hosts were con?g-ured with a default installation of FreeBSD 5.4.The SLA M work-stations used a Stratum 0NTP server con?gured with a TrueTime GPS card for synchronization.We used the software developed by Corell et al.[20]to provide accurate timestamps for SLA M .All management traf?c for the two topological con?gurations ?owed over separate network paths (not pictured in either ?gure).

A critical aspect of our laboratory environment is the ability to measure a reliable basis for comparison for our experiments.For the dumbbell topology,optical splitters were attached to the links

A

B

C

D

hop identifier traffic generator hosts

(a)Dumbbell topology.Probes and cross traf?c are multiplexed onto a bottleneck OC3(155Mb/s)link where queueing delay and loss occurs.

A

B

C

D

E

Cisco 12000

hop identifier

(b)Star topology.Probes and cross traf?c follow paths r 1,r 2,r 3,and r 4,shown in the ?gure.

3:Laboratory testbeds.

between hops A and B and to the link between hops B and C and synchronized Endace DAG 4.3(Gigabit Ethernet)and 3.8(OC3)passive monitoring cards were used to capture packet traces enter-ing and leaving the bottleneck node.For the star topology,optical splitters were attached to the Gigabit ethernet links entering the core star topology (just after hop A),and exiting the star (just be-fore hop E).We used synchronized DAG 4.3passive monitoring cards to capture packet traces entering and leaving the star setup.By comparing packet header information,we were able to identify which packets were lost along each path.Furthermore,these cards provide synchronization of better than one microsecond allowing precise delay measurement through the bottleneck router.

We used four background traf?c scenarios for experiments using the dumbbell setup.For the ?rst scenario,we used Iperf [38]to produce constant-bit rate (CBR)UDP traf?c for creating a series of approximately constant duration (about 65msec)loss episodes,spaced randomly at exponential intervals with a mean of 10sec-onds over a 10minute period.We found that short loss episodes were dif?cult to consistently produce with Iperf,thus the duration we used was a compromise between a desire for short episodes and the ability to predictably produce them.The second scenario con-sisted of 100long-lived TCP sources run over a 10minute period.For the ?nal two scenarios,we used Harpoon [34]with a heavy-tailed ?le size distribution to create self-similar traf?c approximat-ing a mix of web-like and peer-to-peer traf?c commonly seen in

today’s networks.We used two different offered loads of60%and 75%of the bottleneck OC3.Since good performance cannot be guaranteed when resources are oversubscribed,SLAs often contain clauses to allow discarding performance measurements if utiliza-tion exceeds a given threshold[33].Thus,we chose these offered loads to re?ect relatively high,yet acceptable average loads in light of this practice.Experiments using the self-similar traf?c scenario were run for15minutes.For all scenarios,we discarded the?rst and last30seconds of the traces.

For the star setup,we used three background traf?c scenarios in our experiments.For the?rst scenario,we used Iperf[38]to pro-duce CBR UDP traf?c over the four primary traf?c paths to create a series of approximately constant duration loss episodes at(v c,e3) and(v c,e4).We used an additional Iperf?ow over path(e1,e2)to produce a series of loss episodes at(v1,e1).All loss episodes were spaced at exponential intervals with a mean of10seconds,and the test duration was10minutes.The second scenario consisted of long-lived TCP sources con?gured to use all four primary traf?c paths plus path(e1,e2).There were at least100traf?c sources con-?gured to use each path,and the test duration was10minutes.In the third scenario,we used Harpoon[34]with a heavy-tailed?le size distribution to create self-similar traf?c as in scenarios three and four for the dumbbell topology.Traf?c sources were con?g-ured to produce approximate average loads of65%on link e1,15% on link e2,75%on link e3,and60%on link e4,and the test duration was15minutes.For all scenarios,we discarded the?rst and last30 seconds of the traces.Finally,we note that while maximum queue-ing delays at(v2,e2)were non-zero for all three traf?c scenarios, no loss occurred at(v2,e2).

6.EV ALUATION

We now describe the experimental evaluation of SLA M using the testbed described above.We examine the accuracy of SLA M’s delay and loss estimates,comparing its results with estimates ob-tained using standard IPPM methodologies[7,8],which are based on Poisson-modulated probes.We also compare the DV matrix metric with other standard methodologies[21,32].

6.1SLA M Measurement Overhead

Two important implementation decisions were made in the SLA M probe sender.First,the scheduler must accommodate estimation techniques that use multi-packet probes,such as the loss rate esti-mation method we use.Second,the scheduler must arbitrate among probe modules that may use different packet sizes.At present,the smallest packet size scheduled to be sent at a given time slot is used. An effect of the implementation decision for probe packet sizes is that the overall bandwidth requirement for the multi-objective stream is less than the aggregate bandwidth requirement for indi-vidual probe modules if used separately.One concern with this implementation decision is the issue of packet size dependence in the measurement technique.For delay and delay variation,packet sizes should be small to keep bandwidth requirements low.For de-lay variation,the packet size should closely match that used by a codec referred to in the G.107and related standards so that the E-model formulas can be directly used[1].We use48bytes at an interval of30msec in our evaluation below,which approximates the G.723.1codec.For delay,another concern is the relative differ-ence between end-to-end transmission and propagation delays.In situations where propagation delay is large relative to transmission delay,the packet size can be small since the transmission delays along a path contribute little to the overall delay.In cases where the opposite situation holds,packet sizes should be large enough to estimate delays experienced by packets of average size.In our evaluation described below,we use100byte packets for delay es-timation.For loss estimation packet sizes,the key consideration is that multi-packet probes should admit accurate instantaneous in-dications of congestion.In previous work[35],a packet size of 600bytes was used and was found to be a reasonable balance be-tween limiting measurement impact while still obtaining accurate congestion indications.We veri?ed this previous?nding and leave a detailed analysis for future work.

In the experiments below,we?x SLA M probe parameters as shown in Table2.In prior work,p loss=0.3was found to give good loss characteristic estimates[35].We veri?ed the results regarding the setting of the parameter p loss but omit detailed results in this paper.We experimented with a range of values for p delay from0.01 to0.5(mean probe intervals from5msec to about500msec)and found that estimation accuracy for SLA M is virtually unchanged over the range of parameter settings except those below about0.02 (above about200msec mean probe spacing).We do not include detailed results in this paper due to space limitations.For delay variation,we used a packet size of48bytes sent at periodic intervals of30msec.We used a stream length k of100probes in computing the DV matrix metric.

2:SLA M parameters used in evaluation experiments.For all ex-periments,we set the discrete time interval for the scheduler to be 5msec.

Loss Delay Delay Variation Packet size p loss Packet size Interval

600bytes0.348bytes30msec

With the parameters of Table2,the bandwidth savings due to multi-objective probing is about100Kb/s.Separately,the loss probe stream is about490Kb/s,the delay probe stream is about 20Kb/s,and the delay variation is about60Kb/s:a sum of about 570Kb/s.With SLA M,the probe stream is actually about470 Kb/s.Note that for the dumbbell topology,the SLA M parameters used in our experiments result in only about0.3%of the bottleneck OC3consumed for measurement traf?c.For the star topology,three SLA M streams traverse links e3and e4(namely,for link e3,paths r1,r3and(e5,e3)are monitored,resulting in three streams travers-ing e3).The measurement traf?c consumption on these OC3links is still less than1%of the capacity.

6.2Delay

Table3compares the true delay measured using the DAG-collected passive traces with the mean delay estimate produced by SLA M and the estimates produced using standard RFC2679[7](Poisson-modulated probes),sent at the same rate.Values are shown for each traf?c scenario and are averages over full experiment dura-tion.Note that the differences in true values are due to inherent variability in traf?c sources,but the results are representative of tests run with different random seeds.First,we see in Table3a that the SLA M results are close to the true values.We also see that while results for the standard stream are close for the CBR and long-lived TCP traf?c scenarios,they are less accurate for the more realistic self-similar traf?c scenarios,with with relative errors rang-ing from about25%to120%.Second,we see that in Table3b that the SLA M results are close to the true values,though somewhat less accurate than for the simple dumbbell topology.The accuracy of the mean delay estimate for the RFC2679stream varies over the range of traf?c scenarios and paths,but is generally better than in the dumbbell topology.A possible explanation for this behavior is that the increased level of aggregation of traf?c sources in the star topology leads to an improvement in mean delay estimates.

Figure4shows true mean delay and the SLA M-estimated mean

3:Comparison of mean delay estimation accuracy for SLA M and RFC2679(Poisson)streams using the(a)dumbbell and(b)star testbed topologies.Values are in seconds and are averages over the full experiment duration.

(a)Delay accuracy using the dumbbell topology.

Probe stream→

true estimate CBR0.00180.0018

0.03860.0391

Harpoon self-similar(60%load)0.00580.0059

0.00600.0132

(b)Delay accuracy using the star topology.

Probe stream→

true estimate CBR(r1)0.00660.0064

0.00870.0056

CBR(r3)0.00530.0048

0.00730.0043

0.05980.0612

Long-lived TCP(r2)0.11680.1172

0.03620.0364

Long-lived TCP(r4)0.09360.0936

Harpoon self-similar(r1)0.05080.0503

0.01230.0112

Harpoon self-similar(r3)0.04140.0417

0.00280.0024

delay over the duration of experiments using CBR traf?c(top)in the dumbbell topology,and for self-similar traf?c on route r1in the star topology.

Results for other experiments are consistent with plots shown in Figure4.True delay estimates are shown for10 second intervals and estimates for SLA M are shown for30sec-ond intervals.We see that in each case after an initial convergence period,the SLA M estimate tracks the true delay quite well. Distribution-Free Quantile Estimation.Figure5compares the true delay distribution with the SLA M-estimated delay distribution with90%con?dence bounds.Representative plots are shown for the long-lived TCP traf?c scenario in the dumbbell topology(Fig-ure5a)and for the CBR UDP traf?c scenario in the star topology (Figure5b).We see that for these vastly different traf?c and topo-logical setups that the delay distribution is estimated quite well and that with few exceptions,the con?dence bounds include the true delay distribution for the range of estimated quantiles shown. Delay Distribution Inference.We now examine the problem of inferring the delay distribution along a path given measured delay distributions along a subset of paths.Speci?cally,given measure-ments along paths r2,r3,and r4,we wish to infer the delay distri-bution for path r1.

Figure6shows representative results for two traf?c scenarios considered using the star topology.For these results,we used a bin widthεof100μsec for the input discrete mass distributions.The computed bound and the actual CDF measured using SLA M are shown for the CBR UDP traf?c(top)and self-similar TCP traf?c (bottom).We see that for each traf?c scenario the computed bound is relatively tight,with the closest qualitative match for the more re-alistic self-similar traf?c scenario.The skewed distribution arising from the CBR UDP traf?c scenario results in an underestimation of the high delay values along path r1.For the self-similar TCP traf?c scenarios,the delay distributions are somewhat smoother(though not homogeneous along paths in the star topology),and the result-ing bounds are tighter.

6.3Delay Variation

Evaluation of measured delay variation is complicated by the fact

100200300400500

.

1

.

1

5

.

2

.

2

5

time (seconds)

m

e

a

n

d

e

l

a

y

true delay

SLAm estimate

0200400600800

.

3

5

.

4

5

.

5

5

time (seconds)

m

e

a

n

d

e

l

a

y

true delay

SLAm estimate

4:Comparison of true mean delay with SLA M estimates over time.True mean delays are plotted using10second intervals. SLA M estimates are plotted using30second intervals.Plots shown for CBR traf?c in the dumbbell topology(top),and self-similar traf?c on route r1in the star topology(bottom).

delay (seconds)

c

d

f

(a)Long-lived TCP sources,dumb-

bell topology.

delay (seconds)

c

d

f

(b)Constant-bit rate UDP sources,

star topology,route r1.

5:Delay distribution quantile estimates,with90%con?dence interval.

that there is no clear basis by which to compare estimates.As dis-cussed in§3,there are multiple de?nitions of delay variation,for example in the RTP standard RFC3550and in the IPPM standard RFC3393.Therefore,we focus on a comparative analysis among these two IETF standards and our DV matrix formulation.

We?rst look at the one-way-ipdv metric of RFC3393.Each one-way-ipdv sample is produced by choosing consecutive pack-ets of a probe stream identical to the SLA M stream(48byte packets sent at30msec intervals).Histograms of one-way-ipdv samples for the long-lived TCP traf?c scenario(left)and for the self-similar traf?c scenario at60%offered load(right)in the dumbbell topol-ogy are shown in Figure7.The plots show that while there is a narrower range of values for the long-lived TCP source scenario the shapes of each distribution are qualitatively similar.The nar-row range for the long-lived TCP scenario arises because the queue is often close to full.Also,the left tail of the long-lived TCP plot and both left and right tails of the self-similar plot show that there are some large one-way-ipdv values.Beyond simple qualitative observations of these plots,however,it is not clear how queuing dy-namics along the path are captured by this metric since it only cap-tures local differences in delays.It is also not clear how one might infer application performance,e.g.,for a V oIP stream,since large

20

40

60

80

100

0.0

0.20.40.60.8

1.0delay (milliseconds)

c u m u l a t i v e p r o b a b i l i t y boun

d actual cdf

20

40

60

80

0.0

0.20.40.60.81.0

delay (milliseconds)

c u m u l a t i v e p r o b a b i l i t y

bound actual cdf

6:Computed bounds for the delay distribution on path r 1,given measured delay distributions for paths r 2,r 3,and r 4.Results are shown for the UDP CBR scenario (top),and self-similar TCP traf-?c (bottom).

one?way delay variation (seconds)c o u n t

?0.04

?0.02

0.00

0.02

0.04

500

1000

1500

one?way delay variation (seconds)

c o u n t

?0.04

?0.020.000.020.04

100020003000

4000

7:Histograms of RFC 3393One-way-ipdv samples for the long-lived TCP traf?c scenario (left),and for the self-similar self-similar traf?c scenario at 60%offered load (right)using the dumb-bell topology.Each One-way-ipdv sample is produced by choos-ing consecutive packets of a periodic stream.

values of one-way-ipdv do not necessarily translate into packet losses because of underbuffering at an application playout buffer.Figure 8a plots 60second periods of the RTP jitter metric along with a time series of queuing delays (top)and the DV matrix metric along with a time series of queuing delays (bottom).The back-ground traf?c used for these plots is the self-similar traf?c at a 60%offered load using the dumbbell topology.We calculate the two metrics using a probe stream identical to the SLA M stream.In these plots we observe ?rst that although the RTP jitter and DV ma-trix metrics are calculated in very different ways,they have similar qualitative characteristics over time with the DV matrix exhibiting a somewhat smoother pro?le.

In order to expose additional aspects of the RTP and DV matrix metrics,we introduced a CBR traf?c source that was sent in addi-tion to the self-similar traf?c at a 60%load,also using the dumb-bell topology.Over periods of approximately 30seconds,the CBR source alternated between on/off periods,each of about 500msec.The addition of the CBR source results in a period of oscillation of the queue between full and empty as shown in Figure 8b.As with Figure 8a,the top plot shows the RTP jitter metric along with a time series of queuing delays and the bottom plot shows the DV matrix metric along with the same time series of queuing delays.We ob-serve in these two plots that at the onset of the CBR on/off bursts,the RTP jitter metric oscillates in a similar way as the queue.The

DV matrix metric,however,remains smooth and at an increased level,suggesting that relative to the other DV matrix measurements over this 60second time interval,queuing turbulence along the path is greatest during the period of CBR bursts.In contrast,over the CBR burst period the RTP jitter values are often smaller than many other jitter values during the trace segment.Also,relative to the range of jitter values observed over the 60second segment,the jit-ter values during the CBR burst period do not stand out—they stand out only in their oscillatory behavior.This effect is explained by the fact that although an EWMA ?lter with a small value for αis used (1/16)in the RTP jitter formulation,the view is still of inpidual delay variations rather than the behavior over a longer interval of time.Although the CBR traf?c source we used to reveal this be-havior is somewhat pathological,our observations in this context are consistent with the behavior of the RTP and DV matrix values during periods of queuing turbulence in other traf?c scenarios and topologies/paths (not shown due to space limitations).

Finally,we examine the performance of the DV matrix metric in the star topology.A desirable property of a method for measur-ing delay variation is that,in a multihop setting,it should report a maximum over all the links comprising the path.In Figure 9,we plot the DV matrix metric for links e 1and e 4which make up path r 2for the CBR UDP traf?c scenario.Plots for other traf?c scenar-ios and routes are qualitatively similar to Figure 9.Observe that the DV matrix value reported over the path over time is generally the maximum reported for the inpidual links.These results are encouraging.First,the DV matrix methodology appears to yield reliable measures of delay variation over a single hop.Second,the performance of the DV matrix metric in the two-hop star topology appears to be robust.In the future we plan to examine its sensitivity to different matrix sizes and in more complex multihop settings.

6.4Loss

Table 4compares the true loss rate measured using the passive traces (true values)with the loss rate estimates of SLA M and the standard RFC 2680[8](Poisson-modulated)probe stream sent at the same rate.Values are shown for each of the traf?c scenarios,and for the two topologies and are average loss rates over the du-ration of each experiment.Note that differences in true values are due to inherent variability in traf?c sources.Considering both re-sults for the dumbbell topology (Table 4a)and for the star topology (Table 4b),we see that the standard stream yields very poor es-timates of the true loss rate,and that the estimates produced by SLA M are close to the true values.Moreover,in all but a few cases,the RFC 2680probe estimates are off by more than an order of magnitude—a signi?cant relative error.For a number of ex-periments,the Poisson estimates are close to zero—a phenomenon consistent with earlier experiments [35]and primarily due to the fact that single packet probes generally yield poor indications of congestion along a path.(Note that these accuracy improvements are consistent with experiments described in [35].)The estimates produced by SLA M are signi?cantly better,with a maximum rela-tive error occurring in the case of the open-loop CBR background traf?c for both the dumbbell and star topologies.

Figure 10shows the true loss rate and SLA M -estimated loss rate over the duration of experiments using long-lived TCP traf?c in the dumbbell topology (top)and self-similar traf?c on route r 2in the star topology (bottom).True loss rate estimates are shown for 10second intervals and estimates for SLA M are shown for 30second intervals.Results for other experiments are consistent with plots in Figure 10.The upper and lower bars for SLA M indicate estimates of one standard deviation above and below the mean using the vari-ance estimates derived from [37].For the SLA M estimates we see

time (seconds)

(a)Time series plots of60second periods of the RTP jitter met-

ric along with a time series of queuing delays(top)and the DV

matrix metric along with a time series of queuing delays(bot-

tom).Background traf?c is the self-similar traf?c at a60%of-

fered load.

time (seconds)

(b)Time series plots of60second periods of the RTP jitter met-

ric along with a time series of queuing delays(top)and the DV

matrix metric along with a time series of queuing delays(bot-

tom).Background traf?c is created using periodic intervals of

CBR UDP traf?c that are sent in on/off bursts each of approxi-

mately500msec in addition to continuous self-similar traf?c at

a60%offered load.

8:A comparison of the behavior of the RTP(RFC3550)jitter metric and the DV matrix metric using the dumbbell topology.

4:Comparison of loss rate estimation accuracy for SLA M and

RFC2680(Poisson)streams using the(a)dumbbell and(b)star

testbed topologies.Values are average loss rates over the full ex-

periment duration.

(a)Loss accuracy using the dumbbell topology.

Probe stream→

true estimate

CBR0.00510.0073

0.01630.0062

Harpoon self-similar(60%load)0.00080.0007

0.00550.0000

(b)Loss accuracy using the star topology.

Probe stream→

true estimate

CBR(r1)0.03910.0370

0.03390.0064

CBR(r3)0.04580.0359

0.03900.0089

0.00920.0008

Long-lived TCP(r2)0.04630.0446

0.00280.0006

Long-lived TCP(r4)0.04790.0478

Harpoon self-similar(r1)0.01700.0205

0.00690.0000

Harpoon self-similar(r3)0.01920.0178

0.00020.0000

the narrowing of variance bounds as an experiment progresses,and

that the true loss rate is usually within these bounds.We also see

that SLA M tracks the loss rate over time quite well,with its esti-

mated mean closely following the true loss mean.

7.DISCUSSION AND CONCLUSIONS

We believe that SLA M represents a signi?cant step forward for

SLA compliance monitoring using active measurements.However,

there are a number of issues that remain.First,there are additional

issues to consider in the network-wide setting.For example,a de-

ployment strategy must be developed to coordinate probe streams

so that links internal to the network are not carrying“too much”

measurement traf?c.Another key question is:given a daily(or

based on some other time scale)budget of probes that may be used

to monitor compliance with a SLA,what are the considerations for

480500520540560580

5

1

1

5

2

time (seconds)

D

V

m

a

t

r

i

x

e1

e4

r2 (e1e4)

9:Performance of the DV matrix in a two-hop setting(r2)using

the star topology.Time series plot shown for CBR UDP traf?c

scenario.Curves show DV matrix metric for route r2and sepa-

rately for links e1and e4that comprise r2.

optimizing the probe process?Should the probing period be over a

relatively long time scale(e.g.,the entire interval of interest),thus

potentially limiting the accuracy of estimates,or should the probing

period be over a shorter time scale,potentially improving estima-

tion accuracy but at the cost of not probing over the entire interval,

thus potentially missing important events?We have assumed in this

paper that perfect accuracy is the goal for compliance monitoring.

However,for some SLAs,a tradeoff(if it is predictable)between

accuracy and measurement overhead may be appropriate.Next,

our examples of distributional inference have focussed on delay.

We plan to more closely examine loss in the future.Finally,while

measuring availability in a simple path-oriented scenario is rather

straightforward,simple application of performance tomography to

infer network-wide availability may not be suf?cient in the face of

routing changes.

In summary,this paper introduces a new methodology for SLA

compliance monitoring using active measurements,including new

methods for measuring end-to-end packet loss,mean delay,and de-

lay variation.We propose a new method for obtaining con?dence

intervals on the empirical delay distribution.We also describe a

new methodology for inferring lower bounds on the quantiles of

a distribution of a performance metric along a path in a network-

wide setting from a subset of known paths.We implemented these

measurement methods in a tool called SLA M that uni?es the var-

ious probe streams resulting in lower overall probe volume.We

evaluated the capabilities of the tool in a controlled laboratory en-

100200300400500

0.0000.0100.0200.030

time (seconds)

l o s s r a t e ??

????????????

?????

????????????????

?

true loss rate

SLAm estimate

0200400600800

0.0000.0010.0020.0030.004time (seconds)

l o s s r a t e

?????

?

??????

??

??

??

????

??????

????

????????????????????????true loss rate

SLAm estimate

10:Comparison of true loss rate with SLA M estimates over time.True loss rates are plotted using 10second intervals.SLA M esti-mates are plotted using 30second intervals.Plots shown for long-lived TCP traf?c in the dumbbell topology (top)and self-similar traf?c on route r 2in the star topology (bottom).The upper and lower bars for SLA M indicate estimates of one standard deviation

above and below the mean using the variance formulation of [37].

vironment using a range of traf?c conditions and in one-and two-hop settings.Our results show that SLA M ’s delay and loss rate

estimates are much more accurate than estimates obtained through

standard probe methodologies.Furthermore,we illustrated the con-vergence and robustness properties of the loss,delay,and delay variation estimates of SLA M which make it useful in an opera-tional setting.

Acknowledgments

We thank the anonymous reviewers and our shepherd Anees Shaikh for their feedback.This work is supported in part by NSF grant numbers CNS-0347252,CNS-0627102,CNS-0646256and CCR-0325653and by Cisco Systems.Any opinions,?ndings,conclu-sions or recommendations expressed in this material are those of the authors and do not necessarily re?ect the views of the NSF or of Cisco Systems.

8.REFERENCES

[1]ITU-T Recommendation G.107,The E-model,a computational model for use in transmission planning,March 2005.[2]AT&T Managed Internet Service (MIS).6a664dba1a37f111f1855b08/mis.htm ,2007.[3]NTT Communications Global IP Network Service Level Agreement (SLA).6a664dba1a37f111f1855b08/support/sla/network/,2007.[4]Sprint NEXTEL service level agreements.6a664dba1a37f111f1855b08/business/support/serviceLevelAgreements.jsp ,2007.[5]S.Agarwal,J.Sommers,and P.Barford.Scalable network path emulation.In Proceedings of IEEE MASCOTS ’05,September 2005.[6]M.Aida,N.Miyoshi,and K.Ishibashi.A scalable and lightweight QoS

monitoring technique combining passive and active approaches.In Proceedings of IEEE INFOCOM ’03,March 2003.[7]G.Almes,S.Kalidindi,and M.Zekauskas.A one-way delay metric for IPPM.

IETF RFC 2679,September 1999.

[8]G.Almes,S.Kalidindi,and M.Zekauskas.A one way packet loss metric for

IPPM.IETF RFC 2680,September 1999.

[9] D.Ari?er,G.de Veciana,and B.L.Evans.‘network tomography based on ?ow

level measurements.In IEEE Int.Conf.on Acoustics,Speech,and Signal Proc.,

Montreal,Canada,May 17-212004.

[10]P.Barford and 6a664dba1a37f111f1855b08paring probe-and router-based packet loss

measurements.IEEE Internet Computing ,September/October 2004.

[11]J.Bolot.End-to-end packet delay and loss behavior in the Internet.In

Proceedings of ACM SIGCOMM ’93,September 1993.

[12]R.C′a ceres,N.Duf?eld,J.Horowitz,and D.Towsley.Multicast-based inference

of network internal loss characteristics.IEEE Trans.on Information Theory ,45(7):2462–2480,1999.[13]M.C.Chan,Y.J.Lin,and X.Wang.A scalable monitoring approach for service

level agreements validation.In IEEE International Conference on Network Protocols (ICNP),pages 37–48,2000.

[14]Y.Chen,D.Bindel,and R.Katz.Tomography-based overlay network

monitoring.In Proceedings of ACM SIGCOMM Internet Measurement Conference ’03,October 2003.

[15]Y.Chen,D.Bindel,H.Song,and R.H.Katz.An algebraic approach to practical

and scalable overlay network monitoring.In Proceedings of ACM SIGCOMM

’04,2004.[16] B.Y.Choi,S.Moon,R.Cruz,Z.-L.Zhang,and C.Diot.Practical delay monitoring for ISPs.In Proceedings of ACM CoNEXT ’05,2005.[17] D.B.Chua,E.D.Kolaczyk,and M.Crovella.Ef?cient estimation of end-to-end network properties.In Proceedings of IEEE INFOCOM ’05,2005.[18]L.Ciavattone,A.Morton,and G.Ramachandran.Standardized active measurements on a tier 1IP backbone.IEEE Communications ,41(6):90–97,June 2003.[19]R.Cole and J.Rosenbluth.Voice over IP Performance Monitoring.ACM SIGCOMM Computer Communcation Review ,April 2001.

[20] E.Corell,P.Saxholm,and D.Veitch.A user friendly TSC clock.In

Proceedings of Passive and Active Measurement Conference ,March 2006.

[21] C.Demichelis and P.Chimento.IP packet delay variation metric for IP performance metrics (IPPM).IETF RFC 3393,November 2002.[22]N.Duf?6a664dba1a37f111f1855b08work Tomography of Binary Network Performance

Characteristics.IEEE Transactions on Information Theory ,52,2006.[23]N.Duf?eld,F.Lo Presti,V.Paxson,and D.Towsley.Inferring link loss using

striped unicast probes.In Proceedings of IEEE INFOCOM ’01,April 2001.[24]Y.Liang,N.Farber,and B.Girod.Adaptive playout scheduling and loss concealment for voice communication over IP networks.IEEE Transactions on Multimedia ,5(4),December 2003.

[25] F.Lo Presti,N.G.Duf?eld,J.Horowitz,and D.Towsley.Multicast-based

inference of network-internal delay distributions.IEEE/ACM Transactions on Networking ,10(6):761–775,2002.[26]J.Mahdavi and V.Paxson.IPPM metrics for measuring connectivity.IETF RFC 2678,September 1999.

[27]J.Martin and A.Nilsson.On service level agreements for IP networks.In IEEE

INFOCOM ’02,2002.

[28] A.Pasztor and D.Veitch.A precision infrastructure for active probing.In Passive and Active Measurement Workshop ,2001.[29]V.Paxson.Measurements and Analysis of End-to-End Internet Dynamics .PhD

thesis,University of California Berkeley,1997.

[30]V.Paxson,G.Almes,J.Mahdavi,and M.Mathis.Framework for IP

performance metrics.IETF RFC 2330,1998.[31]M.Roughan.Fundamental bounds on the accuracy of network performance

measurements.In ACM SIGMETRICS ,June 2005.[32]H.Schulzrinne,S.Casner,R.Frederick,and V.Jacobson.RTP:A transport

protocol for real-time applications.IETF RFC 3550,July 2003.

[33] A.Shaikh and A.Greenberg.Operations and Management of IP Networks:What Researchers Should Know.Tutorial Session,ACM SIGCOMM ’05.

August,2005.[34]J.Sommers and P.Barford.Self-con?guring network traf?c generation.In

Proceedings of ACM SIGCOMM Internet Measurement Conference ’04,2004.

[35]J.Sommers,P.Barford,N.Duf?eld,and A.Ron.Improving accuracy in

end-to-end packet loss measurement.In Proceedings of ACM SIGCOMM ’05,2005.

[36]J.Sommers,P.Barford,N.Duf?eld,and A.Ron.A Framework for Multi-objective SLA Compliance Monitoring.In Proceedings of IEEE

INFOCOM (minisymposium),May 2007.[37]J.Sommers,P.Barford,N.Duf?eld,and A.Ron.A geometric approach to

improving active packet loss measurement.To appear,IEEE/ACM Transactions on Networking ,2008.[38] A.Tirumala,F.Qin,J.Dugan,J.Ferguson,and K.Gibbs.Iperf 1.7.0–the

TCP/UDP bandwidth measurement tool.

6a664dba1a37f111f1855b08/Projects/Iperf .2007.[39]Yolanda Tsang,Mark Coates,and Robert Nowak.Passive unicast network tomography using em algorithms.In IEEE International Conference on

Acoustics,Speech,and Signal Processing ,pages 1469–1472,Salt Lake City,Utah,May 2001.[40]M.Yajnik,S.Moon,J.Kurose,and D.Towsley.Measurement and modeling of temporal dependence in packet loss.In Proceedings of IEEE INFOCOM ’99,March 1999.[41]Y.Zhang,N.Duf?eld,V.Paxson,and S.Shenker.On the constancy of Internet path properties.In Proceedings of ACM SIGCOMM Internet Measurement Workshop ’01,November 2001.[42]T.Zseby.Deployment of sampling methods for SLA validation with non-intrusive measurements.In Proceedings of Passive and Active Measurement Workshop ,2001.

本文来源:https://www.bwwdw.com/article/2sbl.html

Top