CONTEXT-SENSITIVE POINTER ANALYSIS USING BINARY DECISION DIA

更新时间:2023-04-25 01:37:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

CONTEXT-SENSITIVE POINTER ANALYSIS

USING BINARY DECISION DIAGRAMS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

John Whaley

March2007

c Copyright by John Whaley2007

All Rights Reserved

ii

I certify that I have read this dissertation and that,in my opinion,it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

(David L.Dill)

I certify that I have read this dissertation and that,in my opinion,it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

iv

Abstract

This thesis shows that whole-program context-sensitive inclusion-based pointer anal-ysis,a previously intractable problem,can be e?ciently solved using binary decision diagrams.In addition,we show that it is possible to automatically translate from a high-level analysis speci?cation written in Datalog into an e?cient implementation using binary decision diagrams.

We present the?rst scalable context-sensitive,inclusion-based pointer analysis for Java programs.Our approach to context sensitivity is to create a clone of a method for every context of interest,and run a context-insensitive algorithm over the expanded call graph to get context-sensitive results.For precision,we generate a clone for every acyclic path through a program’s call graph,treating methods in a strongly connected component as a single node.Normally,this formulation is hopelessly intractable as a call graph often has1014acyclic paths or more.We show that these exponential relations can be computed e?ciently using BDDs.

We also describe bddbddb,a BDD-Based Deductive DataBase,which implements the declarative language Datalog with strati?ed negation,totally-ordered?nite do-mains and comparison operators.bddbddb uses binary decision diagrams(BDDs)to e?ciently represent large relations.BDD operations take time proportional to the size of the data structure,not the number of tuples in a relation,which leads to fast execution times.bddbddb is an e?ective tool for implementing a large class of program analyses.We show that a context-insensitive pointer analysis implemented with bddbddb is about twice as fast as a carefully hand-tuned version.

v

Preface

When I’m working on a problem,I never think about beauty.I think only

how to solve the problem.But when I have?nished,if the solution is not

beautiful,I know it is wrong.

R.Buckminster Fuller(1895–1983)

Shortly after arriving at Stanford,my adviser Monica Lam challenged me to “solve”pointer analysis.At that time,the only pointer analysis that could success-fully scale to large programs was an imprecise equivalence-based pointer analysis.I played around with various techniques to improve performance and scalability,achiev-ing some moderate success but a breakthrough remained elusive.The best I was able to achieve was selective context sensitivity,where a client analysis could selectively invoke context sensitivity to improve precision on parts of the program where it would make a di?erence.Interesting,but not totally satisfying,and the algorithm was so di?cult few people could understand it.

In2003,I read a paper by Berndl et al.about implementing a context-insensitive pointer analysis using binary decision diagrams.What made the paper interesting was not the experimental results—the BDD implementation was actually slower than a more traditional implementation—but the fact that the algorithm formulation and implementation was greatly simpli?ed by using BDDs.

I implemented Berndl’s algorithm and began to experiment with it.I soon realized the key to its performance was in its e?cient data representation.At that point,I realized I had been on the wrong path.I had been trying to come up with more and more complicated algorithms to solve the pointer analysis problem.Instead,I

vi

should have focused on how to represent the points-to data e?ciently—an e?cient algorithm would follow from the data representation.

As my experience with BDDs grew,I began to think they could be used to imple-ment context sensitivity e?ciently.The common substructure,e?cient computation on large sets,and memoization features of BDDs all seemed to be good matches for context-sensitive analysis.My?rst attempts were unsuccessful,but I persevered because I believed it could be made to work.And after many months of slow and methodical progress,I?nally had a fully context-sensitive inclusion-based pointer analysis that could analyze very large Java programs on the order of tens of minutes.

Unfortunately,after all of the optimizations and hacks that were required for good performance,the implementation had little semblance to the high-level algorithm. The simple elegance of a BDD solution was lost.Furthermore,I had worked on that one algorithm for nearly a year.There were still many other analyses I wanted to work on,and I did not relish the idea of spending a year for each one.

At that time,Monica Lam was working on a new edition of the venerated Dragon book and was struggling with the question of how to present di?cult material in an easy and understandable way.Her desire to make things simpler and more general for the Dragon book greatly in?uenced my direction and fundamentally changed my view of research.I realized that for research to have a signi?cant impact,it should be easy to understand and applicable to a wide variety of problems.

I started writing a tool that would automatically translate from a high-level algo-rithm speci?cation into an e?cient implementation.Interactions with Thomas Reps and Je?Ullman lead me to Datalog,which had a rich body of research behind it and a semantics that closely matched BDDs.I incorporated all of the tricks I had learned from my one year experience with using BDDs for program analysis.My intention was to abstract away the BDD back-end as much as possible,and allow the user to simply specify a set of inference rules and have it work e?ciently.

One of my main goals was that I wanted the system to be usable by others.Many people were intrigued by the possibility of writing program analyses in just a few lines of code and being able to automatically translate a speci?cation into an e?cient implementation.Early on,I worked closely with other Stanford students,especially

vii

Chris Unkel,Dzintars Avots,Michael Martin,Ben Livshits,Michael Carbin,Jim Zhuang,and Mayur Naik,who were using the system to implement their own program analyses.Their feedback led to great improvements in usability.

The tool’s name,bddbddb,came out of an email discussion with Monica Lam and Chris Unkel.The name was both silly and memorable,and would certainly show up ?rst on a Google search.In other words,an excellent name.I still get chuckles from the audience about the name when I do presentations about bddbddb.

My intention with this dissertation is for it to be the most complete and authorita-tive reference on the bddbddb system and program analysis using Datalog and BDDs. As with any system,bddbddb is constantly evolving and any documentation will soon be hopelessly out-of-date.However,this dissertation will attempt to describe the core principles and techniques without getting caught up in the particulars.After reading this dissertation,the reader should have enough knowledge to be able to formulate their own program analyses in Datalog and build their own BDD-based solver,if they were so inclined.

I also intended this dissertation to be readable by someone who has some knowl-edge of compilers and program analysis.Thus,I have tried to include enough back-ground for someone with the motivation and perseverance to be able to pick up this dissertation and read it without getting too lost.I don’t fancy myself a writer,but I have made an e?ort to make the text and concepts as clear as possible.

Most of the work in this dissertation is my own,with the notable exceptions of the BDD variable learning algorithm(Section4.3),which was primarily the work of Michael Carbin,the re?ection algorithm(Section5.2)which was joint work with Benjamin Livshits,and the static race detection algorithm(Section5.3),which was designed by Mayur Naik.I included these for completeness and as a demonstration that others have been able to build useful analyses using bddbddb.

I have organized this dissertation into eight chapters.Chapter1gives an intro-duction and describes why I believe program analysis,high-level speci?cation,and context sensitivity are important areas of research.Chapter2gives some background on the research areas relevant to the rest of the dissertation—namely,program anal-ysis(2.1),pointer analysis(2.2),Datalog(2.3),binary decision diagrams(2.4),and

viii

context sensitivity(2.5).Chapter3describes my approach to pointer analysis and context sensitivity using BDDs.Chapter4describes the bddbddb system in detail. Chapter5describes various applications of the bddbddb system by myself and others. Chapter6presents some experimental results from the system.Chapter7compares my work with related work,and Chapter8concludes.

Someone once said,“A thesis is never?nished,merely abandoned.”I now realize the point where one comes to fully understand and appreciate that statement is the point of enlightenment that allows one to graduate.There are many areas that I wanted to pursue and investigate more thoroughly,however after time and re?ection I realized it was better to move on and pursue other directions.

When I consider this dissertation as a whole,there was a frightening amount of work.Even more frightening perhaps is the realization that a large portion of what I worked on during my Masters and Ph.D.—in the areas of pro?ling[26,244], dynamic compilation[10,46,243,245],pointer and escape analysis[203,252,255], checkpointing[246],component interfaces[254],virtual machines[247,249],and spec-ulative parallelization[251]—does not even warrant a mention in this dissertation. But I’ve realized that a Ph.D.is not about the end result,it is about the journey.

John Whaley

March2007

ix

Acknowledgments

?First o?,I must thank my wife Miho,who gives meaning to everything I do.?I want to thank my adviser Monica Lam for teaching me so much and pushing

me to excellence.Much of the work in this thesis is directly or indirectly due to her.And never in my wildest dreams would I have imagined that my research would end up as Chapter12in the venerable Dragon book.

?I would also like to thank Martin Rinard,who was the advisor for my Masters thesis at MIT.He taught me many valuable lessons about research that I still use today.

?My Ph.D.orals committee—Juan Alonso,David Dill,Dawson Engler,Monica Lam,and Je?Ullman—did a stellar job.I was extremely lucky to have such

a superstar committee and I’ll never forget the experience.I am especially

indebted to my dissertation readers Monica,David,and Dawson,who gave me very fast turnarounds and great comments.

?I would also like to thank all of my coworkers from IBM Watson and IBM Tokyo who introduced me to compiler research and gave me many opportunities early in my career.

?I want to give thanks to all of the Stanford students I’ve worked with over the years:Dzintars Avots,Godmar Back,David Bloom,Michael Carbin,Brian Carlstrom,Ramesh Chandra,Benjamin Chelf,Andy Chou,Jim Chow,Ben D’Angelo,Michael Dalton,Seth Hallem,Sudheendra Hangal,David Heine,Ted Kremenek,Shih-Wei Liao,Amy Lim,Ben Livshits,Michael Martin,Mayur

x

Naik,Brian Murphy,Mayur Naik,Jim Norris,Je?Oplinger,Shankar Pon-nekanti,Will Robinson,Olatunji Ruwase,Joel Sandin,Costa Sapuntzakis, Patrick Sathyanathan,Brad Schumitsch,Garrett Smith,Paul Twohey,Chris Unkel,Yichen Xie,Junfeng Yang,Nickolai Zeldovich,Jim Zhuang,and the countless others that I forgot to write.

?I especially want to give thanks to my paper collaborators Michael Martin,Ben Livshits,Dzintars Avots,Michael Carbin,Chris Unkel,Christos Kozyrakis, Mayur Naik,and Alex Aiken.

?I want to thank my parents for buying that TI/99-4back in1980that started a lifetime passion with computers and programming.

?I’d also like to thank NSF and Intel for providing?nancial support for my time at Stanford.

?Finally,I would like to thank you,the reader.Most Ph.D.theses are never read after they are signed,and so the fact you are sitting down and reading this means you probably think there is some value in it.Although I cannot promise anything,I will do my best not to disappoint you.

xi

Contents

Abstract v

Preface vi Acknowledgments x

1Introduction1

1.1Motivation (1)

1.2Challenges (2)

1.2.1Pointer Analysis (4)

1.2.2Context-Sensitive Program Analysis (5)

1.3Our Solution (5)

1.3.1Binary Decision Diagrams (6)

1.3.2Cloning to Achieve Context Sensitivity (6)

1.3.3High-level Speci?cation to E?cient Implementation (7)

1.4List of Contributions (10)

1.5Organization (13)

2Background14

2.1Program Analysis (14)

2.1.1Applications of Program Analysis (15)

2.1.2Styles of Program Analysis (15)

2.2Pointer Analysis (16)

2.2.1Types of Pointer Analysis (17)

xii

2.2.2Call Graph Discovery (21)

2.3Datalog (23)

2.3.1Introduction to Datalog (23)

2.3.2Datalog Semantics (25)

2.3.3Evaluation Strategy (27)

2.3.4Magic Sets Transformation (28)

2.3.5Other Datalog variants (28)

2.4Binary Decision Diagrams (29)

2.4.1BDD Variable Ordering (30)

2.4.2Other Variants (32)

2.5Context Sensitivity (32)

2.5.1De?nitions of Context (33)

2.5.2Techniques for Solving Context-sensitive Analyses (34)

3Pointer Analysis35

3.1Context-Insensitive Pointer Analysis (35)

3.1.1Improving Pointer Analysis with Types (38)

3.2Call Graph Discovery (40)

3.3Context-Sensitive Pointer Analysis (43)

3.3.1Numbering Call Paths (44)

3.3.2Context-Sensitive Pointer Analysis with a Precomputed Call

Graph (48)

3.4Object-Sensitive Pointer Analysis (50)

4bddbddb:BDD-Based Deductive Database53

4.1From Datalog to BDD Operations (54)

4.1.1Relational Algebra (54)

4.1.2Boolean Functions (55)

4.1.3BDD Operations (56)

4.2Translating and Optimizing Datalog Programs (57)

4.2.1Datalog Source Transformations (57)

4.2.2Datalog Rule Optimization (58)

xiii

4.2.3Intermediate Representation (61)

4.2.4IR Optimizations (61)

4.2.5BDD Decision Variable Assignment (63)

4.2.6Additional Optimizations (64)

4.2.7Interpretation (65)

4.2.8Other Considerations (65)

4.3Learning BDD Variable Orderings (66)

4.3.1The Problem Space (67)

4.3.2Problem Characteristics (68)

4.3.3Integrating Learning into bddbddb (69)

4.3.4Algorithm Overview (71)

4.3.5Data Sets (72)

4.3.6Order Constructor (73)

4.3.7Uncertainty Sampler (76)

4.3.8Generating a New Order (77)

4.4Querying bddbddb (78)

4.4.1The Come-from Query (79)

5Applications for Program Analysis83

5.1Queries and Other Analyses (84)

5.1.1Debugging a Memory Leak (84)

5.1.2Heap Connectivity (84)

5.1.3Finding a Security Vulnerability (85)

5.1.4Aliased Parameters (86)

5.1.5Type Re?nement (86)

5.1.6Interprocedural Data Flow (87)

5.1.7Context-Sensitive Mod-Ref Analysis (88)

5.1.8Context-Sensitive Type Analysis (89)

5.1.9Thread Escape Analysis (90)

5.2Pointer Analysis for Re?ection (93)

5.2.1Re?ection Resolution Using Casts (98)

xiv

5.3Static Race Detection (101)

5.3.1Race Detection Algorithm Overview (102)

5.3.2Original-Pairs Computation (102)

5.3.3Reachable-Pairs Computation (104)

5.3.4Aliasing-Pairs Computation (106)

5.3.5Escaping-Pairs Computation (106)

5.3.6Unlocked-Pairs Computation (109)

5.3.7Summary of Static Race Detection Results (112)

5.4Program Analyses using bddbddb (112)

6Experimental Results117

6.1Pointer Analysis Results (117)

6.1.1Methodology (118)

6.1.2Analysis Times (120)

6.1.3Evaluation of Results (122)

6.1.4Experience (123)

6.2E?ectiveness of bddbddb Compilation (126)

6.2.1Comparing Lines of Code (127)

6.2.2Comparing Analysis Times (128)

6.2.3External Lock and SQL Injection Analyses (131)

6.3Re?ection Resolution Results (131)

6.3.1Experimental Setup (131)

6.3.2Evaluation Approach (132)

6.3.3Local Analysis for Re?ection Resolution(Local) (134)

6.3.4Points-to Information for Re?ection Resolution(Points-to).134

6.3.5Casts for Re?ection Resolution(Casts) (136)

6.3.6Achieving a Sound Call Graph Approximation(Sound) (138)

6.3.7E?ect of Re?ection Resolution on Call Graph Size (138)

6.4Results from Machine Learning BDD Variable Orders (140)

6.4.1Methodology (140)

6.4.2Results (142)

xv

7Related Work144

7.1Pointer Analysis (145)

7.1.1Scalable Pointer Analysis (145)

7.1.2Context-sensitive Pointer Analysis (146)

7.1.3BDD-based Pointer Analysis (147)

7.2BDDs for Program Analysis (147)

7.2.1BDDs for Model Checking (148)

7.2.2BDDs for Predicate Abstraction (148)

7.2.3BDDs as a Data Representation (148)

7.3High-level Languages for Program Analysis (149)

7.3.1Constraint Languages (149)

7.3.2User-speci?ed Program Queries (150)

7.3.3Program Analysis with Databases (150)

7.4Optimizing Datalog (152)

7.4.1Datalog Implementations (152)

7.4.2Logic programming with BDDs (153)

7.4.3Datalog Evaluation Strategies (153)

7.5BDD Optimizations (153)

7.5.1BDD Libraries (154)

7.5.2BDD Variable Ordering (154)

7.6Applications of Program Analysis (156)

7.6.1Call Graph Discovery (156)

7.6.2Analyzing Re?ection (158)

7.6.3Finding Program Errors (159)

8Conclusions161

8.1Future Work (162)

8.1.1Other Data Representations (162)

8.1.2Beyond Datalog (163)

8.1.3Integration in the Software Development Process (163)

Bibliography164

xvi

List of Tables

4.1The grouping of the data entries in our training set.Each episode is

tied to a particular rule,rule application,and operation within the

rule.Within each episode the order along with its run time is stored.72

6.1Descriptions of the benchmarks we used to evaluate our pointer analyses.117

6.2Information about the benchmarks we used to test our pointer analyses.118

6.3Analysis times and peak memory usages for each of the benchmarks

and analyses.Time is in seconds and memory is in megabytes (121)

6.4Results of escape analysis (124)

6.5Results of the type re?nement query.Numbers are percentages.

Columns labeled multi and re?ne refer to multi-type variables and

re?nable-type variables,respectively (125)

6.6LOC for hand-coded analyses versus lines of Datalog using bddbddb.127

6.7Comparison of context-insensitive Java pointer analysis runtimes.

Times are in seconds (129)

6.8Comparison of context-sensitive Java pointer analysis runtimes.Times

are in seconds (129)

6.9Comparison of C pointer analysis runtimes.Times are in seconds (129)

6.10External lock analysis runtimes.Times are in seconds (130)

6.11SQL injection query results.Times are in seconds.∞indicates that

the analysis did not?nish (130)

6.12Summary of information about our benchmarks.Applications are

sorted by the number of lines of code in column3 (132)

6.13Results of resolving Class.forName calls for di?erent analysis versions.133

xvii

6.14Number of classes and methods in the call graph for di?erent analysis

versions (139)

6.15Information about the analyses that we used to evaluate our BDD order

?nding algorithm.The four columns of numbers are the number of

rules,relations,and domains in the input Datalog?le,and the number

of possible domain orders (140)

6.16The results of our learning algorithm.The?rst four columns of num-

bers compare the speed of a random order,an order generated with a

sifting algorithm,our best hand-tuned order,and the order output by

the algorithm.∞means that the analysis did not complete because

it ran out of memory.The next four columns give statistics on the

performance of the learning algorithm (142)

6.17A comparison of the run times of our hand-tuned and generated or-

ders for the j pacs and j

List of Figures

2.1An example Datalog program with two minimal solutions (26)

2.2(a)Binary encoding of a relation.(b)and(c)are BDD encodings

of the relation given by(a)with decision variable orders b1,b2,b3and

b2,b1,b3,respectively (30)

3.1Example of path numbering.The graph on the left is the original

graph.Nodes M2and M3are in a cycle and therefore are placed in

one equivalence class.Each edge is marked with path numbers at the

source and target of the edge.The graph on the right is the graph with

all of the paths expanded (45)

3.2The six contexts of function M6in Example1 (45)

4.1(a)Predicate dependency graph for Algorithm1.(b)Breaking the

PDG into SCCs and?nding cycles (58)

4.2The learning algorithm embedded in Datalog resolution (70)

4.3Steps of a learning episode.This corresponds to the contents of the

Active Learner box in Figure4.2 (71)

4.4An example decision tree induced from the training set data in Ta-

ble4.1on rule2 (75)

4.5Datalog program generated by bddbddb to compute come-from query

“vP(100,200):–?”on Algorithm1.The second,third,fourth,

and?fth chunks of rules come from Rules(3.2),(3.4),(3.1),and(3.3)

respectively (82)

xix

5.1Typical use of re?ection to create new objects (94)

5.2A fragment of a speci?cation?le accepted by our system.A string

identifying a call site to Class.forName is mapped to a class name that

that call may resolve to (97)

5.3A case in freetts where our analysis is unable to determine the type

of objects instantiated on line5using casts (101)

6.1Re?ection resolution using points-to results in

javax.xml.transform.FactoryFinder in the JDK (135)

xx

Chapter1

Introduction

1.1Motivation

Program analysis is the process of analyzing the behavior of a computer program. Program analysis has many important applications.It has traditionally been used primarily for optimizing programs so that they will run faster.More recently,program analysis is increasingly being used in tools to aid software development.

Program analysis has the potential to be extremely useful in?nding bugs and security vulnerabilities in software.Research has shown that static analysis can reduce defects by up to a factor of six[136],and that60%of software faults that were found in released software products could have been detected with static analysis tools[34]. 40%of the faults that could be found through static analysis will eventually become a defect in production software[119].

Software bugs are a serious problem.They cost the U.S.economy an estimated $59.5billion per year[178].While not all errors can be removed,more than a third of the cost—an estimated$22.2billion—could be eliminated by an improved infrastructure that includes static analysis tools[178].

Static analysis brings numerous bene?ts to the software development process. Static analysis is attractive as it can catch bugs early in development,before the software is released.A common rule of thumb in software development is that a bug that costs$1to?x before release will cost$100to?x after release.Thus,it is

1

本文来源:https://www.bwwdw.com/article/n7oq.html

Top