The following are some notes for understanding some trivial terms or learning resource for program analysis.
Basically copy and paste from some slides.
BTW, some courses I recommended:
I don’t think Principle of Program Analysis is friendly to beginner, don’t read it in the beginning.
BTW, compiler optimizations also share some common theory with program analysis. You can study CMU CS 745 meanwhile.
Some tools for program analysis:
- BAP: For Binary Analysis
- LLVM: Mostly for C/C++
- Soot: Java
- GitHub Semantic: A unified platform for analyzing different languages.
Soundness and Completeness
Note from UW
A static analysis tool
S analyzes a program
P to determine whether it satisfies a property
Sis sound it will never miss any violations, but it may say that
Reven though it doesn’t (resulting in false positives).
- If S is complete, it will never report false positives, but it may miss real violations of
R(resulting in false negatives).
To be less abstract, we can image soundness means an analyzer set up a set of rules, only one program follows the rules can be safe. However, the set of rules might not cover all the correct cases, then resulting false positive. For example, a newbie write a formal verification and forgets to cover some cases.
Completeness is more like exclusion(e.g. fuzzer). When a program demonstrate some behaviors, it will be unsafe(like a segment fault). Sometimes a program will not crash under fuzzing, but we can’t say it is safe(false negative).
Basic Blocks are a sequence a block that do not execute
call which affect the normal flow of a program.
Static Program Representation
Note from UT
AST(Abstract Syntax Tree): AST is a finite, labeled, directed tree, where the internal nodes are labeled by operators, and the leaf nodes represent the operands of the operator. Widely used in compiling for parsing source code. But not good for binary code. You can find a detail graph in most compiler theory course material.
CFG(Control Flow Graph): A directed graph where each node represents a statement and edges represent control flow. For example, the graph view of IDA/Binary Ninja/Ghidra.
Program Dependence Graph(describe the change of data rather than control flow):
- Data Dependency:
Sdata depends on
Tif there exists control flow path from
Sand a variable is defined at
Tand then used at
- Dominator: A block M dominates a block
Nif every path from the entry that reaches block
Nhas to pass through block M. If there is an addition block
- Post Dominator: In the reverse direction, block M post-dominates block N if every path from N to the exit has to pass through block M.
- Control Dependence: A node (basic block) Y is control-dependent on another X iff X directly determines whether Y executes
- Data Dependency:
Points-to Graph. At a program point, compute a set of pairs of the form
pMAY/MUST points to
- Aliases: Two expressions that denote the same memory location.(e.g. pointers, call-by-reference)
Call Graph(a CFG but only branch in function call):
- Nodes are procedures/Edges are calls
- Hard cases: calls through function pointers
Data Flow Analysis
Data-flow analysis is a technique for gathering information about the possible set of values calculated at various points in a computer program.
- Forward/Backward = Data flow from
- Must/May = Property
mayhold on all paths to the join
|Forward||Reaching definitions||Available expressions|
|Backward||Live variables||Very busy expressions|
e is available at program point
eis computed on every path to
- the value of
ehas not changed since the last time
ewas computed on every path to
- If an expression is available, need not be recomputed
join point: A program point where two branches meet
Available expressions is a forward must analysis:
- Data flow propagate in same dir as CFG edges
- Expr is available only if available on all paths
v is live at program point
vwill be used on some execution path originating from
- If a variable is not live, no need to keep it in a register
- If variable is dead at assignment, can eliminate assignment
Liveness is a backward may problem:
- To know if variable live, need to look at future uses
- Variable is live if used on some path
A definition of a variable
v is an assignment to
v. A definition of variable
v reaches point
p if there is no intervening assignment to
Very Busy Expression
e is very busy at point
p if on every path from p, expression e is evaluated before the
value of e is changed
Single Static Assignment
The values of temporary variables are updated (“mutated”) during evaluation, so we use
SSA = Single Static Assignment eliminates this problem, by introducing more temporaries – each one assigned to only once.
Dynamic Program Representations
Control flow trace, address and value traces
- Slice: Slice of v at S is the set of statements involved in computing v’ s value at S
- static slice easily to be imprecise, e.g. function pointers call.
- Dynamic slicing: Dynamic slicing makes use of all information about a particular execution of a program and computes the slice based on an execution history (trace). However, it relies on heavy running history and high space requirement.
Dynamic Binary Instrumentation
Basic idea: insert extra code to collect runtime information. Primarily used for compiler optimization. Also common in vulnerability identification.
- Easily handle multi-lingual programs
- Less worries in libraries
- Turning off compiler optimizations can maintain an almost perfect mapping from instructions to source code lines
- Worms and viruses are rarely provided with source code
- Handle dynamically-generated code
- QEMU(used by
- Valgrind(used by