Program Analysis Notes

The following are some notes for understanding some trivial terms or learning resource for program analysis. Basically copy and paste from some slides.

BTW, some courses I recommended:

I don’t think Principle of Program Analysis is friendly to beginner, don’t read it in the beginning.

BTW, compiler optimizations also share some common theory with program analysis. You can study CMU CS 745 meanwhile.

Some tools for program analysis:

  • BAP: For Binary Analysis
  • LLVM: Mostly for C/C++
  • Soot: Java
  • GitHub Semantic: A unified platform for analyzing different languages.

Static Analysis

Soundness and Completeness

Note from UW

A static analysis tool S analyzes a program P to determine whether it satisfies a property R:

  • If S is sound it will never miss any violations, but it may say that P violates R even though it doesn’t (resulting in false positives).
  • If S is complete, it will never report false positives, but it may miss real violations of R (resulting in false negatives).

To be less abstract, we can image soundness means an analyzer set up a set of rules, only one program follows the rules can be safe. However, the set of rules might not cover all the correct cases, then resulting false positive. For example, a newbie write a formal verification and forgets to cover some cases.

Completeness is more like exclusion(e.g. fuzzer). When a program demonstrate some behaviors, it will be unsafe(like a segment fault). Sometimes a program will not crash under fuzzing, but we can’t say it is safe(false negative).

Basic Blocks

Basic Blocks are a sequence a block that do not execute jmp or call which affect the normal flow of a program.

Static Program Representation

Note from UT

  • AST(Abstract Syntax Tree): AST is a finite, labeled, directed tree, where the internal nodes are labeled by operators, and the leaf nodes represent the operands of the operator. Widely used in compiling for parsing source code. But not good for binary code. You can find a detail graph in most compiler theory course material.
  • CFG(Control Flow Graph): A directed graph where each node represents a statement and edges represent control flow. For example, the graph view of IDA/Binary Ninja/Ghidra.
  • Program Dependence Graph(describe the change of data rather than control flow):
    • Data Dependency: S data depends on T if there exists control flow path from T to S and a variable is defined at T and then used at S.
    • Dominator: A block M dominates a block N if every path from the entry that reaches block N has to pass through block M. If there is an addition block P. M dominates P, P dominates N. Then M immediately dominates N.
    • Post Dominator: In the reverse direction, block M post-dominates block N if every path from N to the exit has to pass through block M.
    • Control Dependence: A node (basic block) Y is control-dependent on another X iff X directly determines whether Y executes
  • Points-to Graph. At a program point, compute a set of pairs of the form p -> x, where p MAY/MUST points to x:
    • Aliases: Two expressions that denote the same memory location.(e.g. pointers, call-by-reference)
  • Call Graph(a CFG but only branch in function call):
    • Nodes are procedures/Edges are calls
    • Hard cases: calls through function pointers

Data Flow Analysis

Data-flow analysis is a technique for gathering information about the possible set of values calculated at various points in a computer program.

  • Forward/Backward = Data flow from in to out/from out to in
  • Must/May = Property must/may hold on all paths to the join
  May Must
Forward Reaching definitions Available expressions
Backward Live variables Very busy expressions

Available Expressions

An expression e is available at program point p if:

  • e is computed on every path to p, and
  • the value of e has not changed since the last time e was computed on every path to p
  • If an expression is available, need not be recomputed
  • join point: A program point where two branches meet

Available expressions is a forward must analysis:

  • Data flow propagate in same dir as CFG edges
  • Expr is available only if available on all paths

Liveness Analysis

A variable v is live at program point p if:

  • v will be used on some execution path originating from p.
  • before v is overwritten

For optimization:

  • If a variable is not live, no need to keep it in a register
  • If variable is dead at assignment, can eliminate assignment

Liveness is a backward may problem:

  • To know if variable live, need to look at future uses
  • Variable is live if used on some path

Reaching Definitions

A definition of a variable v is an assignment to v. A definition of variable v reaches point p if there is no intervening assignment to v.

Very Busy Expression

An expression e is very busy at point p if on every path from p, expression e is evaluated before the value of e is changed

Single Static Assignment

The values of temporary variables are updated (“mutated”) during evaluation, so we use SSA = Single Static Assignment eliminates this problem, by introducing more temporaries – each one assigned to only once.

Dynamic Analysis

Dynamic Program Representations

  • Control flow trace, address and value traces
  • Dynamic Slicing:
    • Slice: Slice of v at S is the set of statements involved in computing v’ s value at S
    • static slice easily to be imprecise, e.g. function pointers call.
    • Dynamic slicing: Dynamic slicing makes use of all information about a particular execution of a program and computes the slice based on an execution history (trace). However, it relies on heavy running history and high space requirement.

Dynamic Binary Instrumentation

Basic idea: insert extra code to collect runtime information. Primarily used for compiler optimization. Also common in vulnerability identification.

Pros:

  • Easily handle multi-lingual programs
  • Less worries in libraries
  • Turning off compiler optimizations can maintain an almost perfect mapping from instructions to source code lines
  • Worms and viruses are rarely provided with source code
  • Handle dynamically-generated code

Tools:

  • QEMU(used by AFL)
  • Valgrind(used by Angr)
  • Pin