About

I tried to learn program analysis via directly writing analysis for binary for months. But I waste a lot of time :(. The resource of binary analysis is much more limited and user-unfriendly. Thus, I choose to start from LLVM for analysis instead, which is indeed much better.

I used MacOS and LLVM 8.0. For some reasons, I cannot configure the brew version LLVM. I would also recommend you to build from source code to avoid weird bugs.

We won’t talk about too much about program analysis part too much. You can view my previous note in program analysis or read SPA book thoroughly. You can use llvm pass tutorial as setup guide.

LLVM is a huge project, which contains a large set of binary utility. Notwithstanding, we only care about clang and opt - the C front end and LLVM optimizer.

Intro to LLVM Pass

In LLVM, a Pass is usually designed for optimization, analysis, or transformation. You can derives own analysis from base class. We have following base classes:

  • ImmutablePass: This pass will neither run nor change state. But it stores information about current compiled target.
  • ModulePass: In this pass, we use entire program as a unit.
  • CallGraphSCPass: The pass allows us to build and traverse the call graph. The SCC refers to strongly connected component origin from Tarjan Algorithm, which means processing functions bottom-up.
  • FunctionPass: FunctionPass executes on each functions independently.
  • LoopPass: Inspecting loops.
  • BasicBlockPass: Similar to FunctionPass, but basic block refers to a block without jumps.

Above passes are common. For full passed with detailed explanations, you can view the official document.

Most classes have following special functions:

  • doInitialization(): We can do things that the pass not allowed to do (I didn’t note the restrictions of above passes, but you can view the detail from official document). This functions also help us to initialize pass-independent things.
  • runXXX(): XXX usually is the name of current pass. We start our analysis here.

A Naive Hello-World

Let’s see how to write a basic LLVM pass. First, let’s do some preparations (to avoid modify environment just edit the skeleton.c from llvm pass tutorial):

#include "llvm/Pass.h"
#include "llvm/IR/Function.h"
#include "llvm/Support/raw_ostream.h"
#include "llvm/Transforms/IPO/PassManagerBuilder.h"
#include "llvm/IR/InstIterator.h"

using namespace llvm;

We need to put our customized class Hello in an anonymous namespace. It derives from FunctionPass, which is a utility class for analyzing functions. Then, a RegisterPass is used to register our analysis pass:

namespace {
  struct Hello : public FunctionPass {
      static char ID;
      Hello() : FunctionPass(ID) {}
    ...
  }
}

char Hello::ID = 0; i
static RegisterPass<Hello> X("hello", "Hello World Pass",
  false /* Only looks at CFG */,
  false /* Analysis Pass */); 

Now, since we just want to write a naive analysis. It’s unnecessary to have a complicated constructor. Adding code to runOnFunction, which will be executed by opt, is enough for out naive algorithm.

Let’s set up a simple task: find the first and the last LLVM IR for each function. Output the first instruction in the set. Then, iterate until we find the last one:

bool runOnFunction(Function &F) override {
    errs() << "-> ";
    errs().write_escaped(F.getName()) << '\n';

    for (inst_iterator I = inst_begin(F), E = inst_end(F); I != E; ++I)
    {
        if (I == inst_begin(F))
            errs() << "  Begin: " << *I << "\n";

        auto tmp = I;
        if ((++tmp) == inst_end(F))
            errs() << "  End: " << *I << "\n";
    }
    errs() << "\n";
    return false;
}

The Function type record information for a function. We iterate a function via inst_iterator and use inst_begin() and inst_end() to locate where to start and end.

In LLVM, we use errs() rather than std::cout to output. It’s an output utility provided as substitution to STD. The following is a testing file, and compile with /PATH_TO/clang -emit-llvm -c test.c -o test.ll:

#include<stdio.h>
int test() {
    int a = 1;
    for (int i = 1; i < 10; i++) {
        a = a * i;
    }
    return a;
}

int main() {
    printf("test");
    test();
    return 100;
}

You might need to add an addition flag for clang in macOS:

-I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/

In build directory, use cmake ../ && make to build, and run

/PATH_TO/opt -load skeleton/libSkeletonPass.so -hello < test.ll

And the output is:

// removed some extra information
-> test
  Begin:   %a = alloca i32, align 4
  End:   ret i32 %4

-> main
  Begin:   %call = call i32 (i8*, ...) @printf(...)
  End:   ret i32 100

The full script is here. Above instructions are LLVM IRs, a type of representation between source code and binary code. clang is a compiler for C. Here, however, we use it to transform source code to Bite Code. Subsequently, opt load our pass to analysis the test.ll file. This is the life of an analysis.

CallGraph

A call graph is a diagram rendering how each function is calling other. The idea is simple: when function A is calling function B, use a graph structure to record A is calling B. We have tasted LLVM IR in previous section. Now, let’s try additional manipulations.

In LLVM, there are two instructions related to calling a function: call and invoke. The latter is usually used for functions with exceptions. It has a special flag to mark whether exceptions have happened.

For this analysis, we need to cast instruction iterator to CallInst and InvokeInst. Then output the function name when succeed: