Auxy's Blog

Learning Binary Ninja API

Preface

Recently, I spent some time on learning CMU’s Binary Analysis Platform(BAP), an open source binary analysis tool. But you have to learn OCaml to understand it in depth. What’s worse, its document is really rare, only a short official tutorial and some complicated plugins. And after one month study…I gave up.

LLVM is another software analysis tool, but it is too low level. And most of LLVM tutorial are about writing compilers…for those who are interested in software analysis, LLVM is a little bit unfriendly.

So, I pick up my Binary Ninja license again. Although it’s a personal license which disallows headless processing, we can use the python console or GUI as the substitutions. A personal license isn’t expensive. Buy it as your Christmas gift :P.

Turn on python console via View -> Script Console, and you are ready to go.

Implement a Plugin

Preparation

First, we compile test.c to test(my environment is Ubuntu 18.04 LTS):

#include<stdio.h>
int main() {
    char buf[20];
    scanf("%s", buf);
    printf(buf);
}

Let’s load the compiled file and introduce some basic concepts via the Script Console. For operating this console, you can view this cheatsheet

Now, you may notice that there is an obvious format string vulnerability. How can we use binary ninja for automatic detections?

Create Main Script

Open plugin folder via Tools -> Open Plugin Folder.... Then create a directory fmt_str. Inside the fmt_str, create __init__.py. __init__.py is the main runner for every plugin.

Let’s use this template in __init__.py:

from binaryninja import *

def fmt_str_detect(bv,function):
	pass

PluginName = "Format String Detections" 
Description = "Detect Format String Vulnerability" 
PluginCommand.register_for_address(
	PluginName, 
	Description, 
	fmt_str_detect
)

So, we need to use PluginCommand.register_for_address to register plugin address. Format String Detections is the plugin name. The following Detect Format String Vulnerability is the description for the plugin. fmt_str_detect is the main function.

In the fmt_str_detect function, bv stands for the binary object. We need to operate this object to analyze the program.

Find Vulnerable Functions

Now, you can type following codes in python console. We will implement full script later.

Our first step is to find the printf. To do it:

printf_addr = bv.get_symbols_by_name(
	"printf"
)[0].address

printf_refs = bv.get_code_refs(
	printf_addr
)

Okay, we use bv.get_symbols_by_name("XXXX") to get a list of symbols which contain name XXXX. Usually we can an imported function and an exported function. And address is its address.

The next step is to find where printf function has been called. The result is a list of references.

Locating Vulnerable Parameter via MLIL

We will use Medium Level IL(MLIL) to locate vulnerable parameters. MLIL represents assembly in more readable and abstract way. There is a layer called LLIL(low level IL) between MLIL and assembly code. LLIL still has many assembly code feature.

printf_ref = printf_refs[0]

low_il = printf_ref.function.get_low_level_il_at(
	printf_ref.address
)

medium_il = low_il.medium_level_il

In the first line, we will extract the first element of printf_refs. The following line converts it from a raw address to LLIL. And finally, we convert it to MLIL. The MLIL looks like:

>>> medium_il
<il: 0x5e0(rdi)>

0x5e0 is the PIE address of symbol printf. And rdi is its first parameter(the string).

We can extract the param now, we only need to check the first one:

medium_il.params[0].value
# <stack frame offset -0x28>

Binary Ninja infers correctly! How about a secure version and a heap version?

<undetermined> # Heap version
<const ptr 0x817> # Safe version

Since heap needs to be initialized during runtime, Binary Ninja can not infer the value. And the safe version(a static string in .data segment), it’s represented as constant pointer address.

Verify Vulnerabilities

Now, let’s filter the conditions. Binary Ninja provides several different value type. We don’t need to care about other types so much actually. Because merely when the memory of format string is readable only, the binary is safe(despite some complicated operations). So, we need to identify if the value of a parameter is read only:

... # When medium_il.params[0].value.type is not ConstantPointerValue
    # We can inder the function is vulnerable

bv.is_offset_writable(
	medium_il.params[0].value.value
) # Check whether the pointer is vulnerable

Okay, we might find a fmt_str bug now, and we want to notify users. We cannot directly output the result to console. But set_user_highlight() and set_comment_at() allow us to manipulate the GUI:

printf_ref.function.set_comment_at(printf_ref.address, "fmt_str vuln!")
printf_ref.function.set_user_instr_highlight(printf_ref.address, HighlightStandardColor.RedHighlightColor)

Full Script

Compose altogether:

from binaryninja import *

# A list of fmtstr functions and its format parameter
fmtstr_list = [ ("printf", 0),
                ("sprintf", 1) ]

RED = HighlightStandardColor.RedHighlightColor

# Highlight the Vulnerable Address
def set_alert(func_ref):
	func_ref.function.set_comment_at(
		func_ref.address, 
		"fmt_str vuln!"
	)
    func_ref.function.set_user_instr_highlight(
		func_ref.address, 
		RED
	)

def fmt_str_detect(bv,function):
	for name, offset in fmtstr_list:
	func_list = bv.get_symbols_by_name(name)

	for func in func_list:
		# Only when the function is imported
		# it will be used
		if func.type.name == "ImportedFunctionSymbol":
			func_addr = func.address
		func_refs = bv.get_code_refs(func_addr)

		# List all the reference
		# And check if parameter is READ only address
		for func_ref in func_refs:
			try:
				low_il = func_ref.function.get_low_level_il_at(
					func_ref.address
				)
				medium_il = low_il.medium_level_il
				param = medium_il.params[offset]

				if param.value.type.name == "ConstantPointerValue":
					if bv.is_offset_writable(param.value.value) == True:
						set_alert(func_ref)
				else:
					set_alert(func_ref)
			except:
				print("[!] Load Function Error\n")

# Load function to main UI
PluginCommand.register_for_address(
	"Format String Detections", 
	"Detect Format String Vulnerability",
	fmt_str_detect
)

Customize your Architecture

Sometimes we might see weird or unsupported architecture, which requires us to build own disassembler. However, Binary Ninja does provide powerful API for customizing.

Architecture Subclass Overview

Let’s see a sketch Architecture class:

class YourArch(Architecture):
    name = "YourArch" # Name of the Arch
    address_size = 8 # Each Address Size
    default_int_size = 4
    max_instr_length = 16

    regs = {"ptr": RegisterInfo("ptr", 2)}
    stack_pointer = "ptr"

    def get_instruction_info(self, data, addr):
        # returns an InstructionInfo at the given virtual address 
        pass

    def get_instruction_text(self, data, addr):
        # returns a list of InstructionTextToken
        pass

    def get_instruction_low_level_il(self, data, addr, il):
        # appends LowLevelILExpr objects to il for the instruction  
        pass

YouArch.register()

The most important function is get_instruction_text, which converting data to valid lexical formats. regs is an attribute to define all the register you will use. Use RegisterInfo to declare registers.

get_instruction_info gives Binary Ninja branching information. It’s important to know when to jump from one address to another address to get correct CFG. And get_instruction_low_level_il tells Binary Ninja how to convert out assemble code to side-effect free Low Level IL. The subsequent Medium Level IL will be converted automatically.

We will implement Manchester Baby Arch. The language specification, stated by wikipedia, is following:

Binary Code Modern Mnemonic Operation
000 JMP S Jump to the instruction at the address obtained from the specified memory address S (absolute unconditional jump)
100 JRP S Jump to the instruction at the program counter plus (+) the relative value obtained from the specified memory address S (relative unconditional jump)
010 LDN S Take the number from the specified memory address S, negate it, and load it into the accumulator
110 STO S Store the number in the accumulator to the specified memory address S
001 or 101 SUB S Subtract the number at the specified memory address S from the value in accumulator, and store the result in the accumulator
011 CMP Skip next instruction if the accumulator contains a negative value
111 STP Stop

So, let’s define several global constant to represent those operations:

JMP = "000"
JRP = "100"
LDN = "010"
STO = "110"
CMP = "011"
STP = "111"
SUB = ["001", "101"]

We use a 8 length ascii string rather than bits to represent binary code. The first 3 characters are OP code, and the last 5 characters are operation source in decimal. For example, a JMP 100, CMP and LDN 2000 will be:

00000100 ; JMP 100
01101234 ; CMP, and the 01234 will be ignored
01002000 ; LDN 2000

get_instruction_info

Let’s implement branching first. The get_instructin_info has three parameters for us. data is the raw binary data from executable files. We need to parse the data ourselves. Since the instruction is always 8 bytes long. We will read 8 characters each time. And addr is the address for current instructions.

For this function, we need to return a None which represents return nothing or InstructionInfo object. This object defines the length and type of one instruction.

Our InstructionInfo needs to be one kind of BranchType. We can find all BranchType here:

def get_instruction_info(self, data, addr):

	# Ensure data length
	if len(data) < 8:
		return None

	# Extract 8 bytes data for one instruction
	instruction = data[:8]
	op_code = instruction[:3]
	src = int(instruction[3:])

	# Create InstructionInfo for customizing
	res = InstructionInfo()
	res.length = 8

	# Add branch for abolute and relative address jump
	if op_code == JMP:
		res.add_branch(
			BranchType.UnconditionalBranch, 
			src
		)
	elif op_code == JRP:
		res.add_branch(
			BranchType.UnconditionalBranch,
			addr + src
		)
	
	# This is condistional branch
	# so we need to have a TrueBranch and a FalseBranch
	elif op_code == CMP:
		res.add_branch(
			BranchType.TrueBranch, 
			addr + 16
		)
		res.add_branch(
			BranchType.FalseBranch, 
			addr + 8
		)

	retrun res

get_instruction_text

We should tell Binary Ninja how to show assembler codes in GUI now. get_instruction_text will return a set of tokens about data type. Even we can return merely pure text, but defining other types help user operating Binary Ninja (for example, an address token can be double clicked to jump). Then, we append different types of tokens. Their orders will be exactly the same in GUI.

We write a helper function for typing less:

def token(tokenType, text, data=None):

	# Define shor name for InstructionTextTokenType
	tokenType = {
		'inst':InstructionTextTokenType.InstructionToken,
		'text':InstructionTextTokenType.TextToken,
		'addr':InstructionTextTokenType.PossibleAddressToken,
		'sep':InstructionTextTokenType.OperandSeparatorToken,
		'num':InstructionTextTokenType.IntegerToken
	}[tokenType]

	if data is None:
		return InstructionTextToken(
			tokenType, 
			text
		)

	return InstructionTextToken(
		tokenType, 
		text, 
		data
	)

get_instruction_text needs to return the token set and its assembler code size:

	def get_instruction_text(self, data, addr):
	# If we can't decode an instruction return None
	if len(data) < 8:
			return None

	instruction = data[:8]
	op_code = instruction[:3]
	src = int(instruction[3:])

	tokens = []

	# Parse op_code with src
	if op_code in [LDN, STO]:
		if op_code == LDN:
			tokens.append(token('inst', '{:7s}'.format('LDN')))
		elif op_code == STO:
			tokens.append(token('inst', '{:7s}'.format('STO')))
		tokens.append(token('text', '['))
		tokens.append(token('addr', hex(src)))
		tokens.append(token('text', ']'))

	elif op_code in SUB:
		tokens.append(token('inst', '{:7s}'.format('SUB')))
		tokens.append(token('num', hex(src)))

	elif op_code in [JMP, JRP]:
		if op_code == JMP:
			tokens.append(token('inst', '{:7s}'.format('JMP')))
			tokens.append(token('addr', hex(src), src))
		elif op_code == JRP:
			tokens.append(token('inst', '{:7s}'.format('JRP')))
			tokens.append(token('text', hex(addr)))
			tokens.append(token('text', ' + '))
			tokens.append(token('text', hex(src)))
	elif op_code == CMP:
		tokens.append(token('inst', '{:7s}'.format('CMP')))
		tokens.append(token('addr', hex(addr + 16), addr + 16))
	elif op_code == STP:
		tokens.append(token('inst', '{:7s}'.format('STP')))

	return tokens, 8

get_instruction_low_level_il

And finally, we should implement get_instruction_low_level_il. Unlike get_instruction_text, which only tells the disassembler how to show data, get_instruction_low_level_il also tells the meaning of each instruction.

It’s like writing an AST tree, we will expand each instructions to LowLevelILExpr and append the final statement to il object, then return return the size of merged instructions. Since we won’t merge any instructions to eliminate side-effect. We can safely return 8:

def get_instruction_low_level_il(self, data, addr, il):
	if len(data) < 8:
		return None

	instruction = data[:8]
	print(instruction)
	op_code = instruction[:3]
	src = int(instruction[3:])

	if op_code in [LDN, STO]:
		addr = il.const_pointer(5, src)
		if op_code == LDN:
			# store value in address to register
			# the data extracted from address will have size 5
			instru = il.store(5, addr, il.reg(5, 'accu'))
			il.append(instru)

		elif op_code == STO:
  	  # extract data from address to register accu
			instru = il.set_reg(5, 'accu', il.load(5, addr))
			il.append(instru)

	elif op_code == CMP:
	  # Create TrueBranch and FalseBranch
		f_target = il.get_label_for_address(
			Architecture['Manchester Baby'], 
			addr + 8
    )
		t_target = il.get_label_for_address(
			Architecture['Manchester Baby'], 
			addr + 16
		)
		# Create comparasion condition
		check_reg = il.compare_signed_greater_than(
			5, 
			il.reg(5, 'accu'), 
			il.const(5, 0)
		)
		# if_expr to identify jumping
		il.append(il.if_expr(check_reg, t_target, f_target))

	elif op_code in SUB:
		val = il.const(5, src)
		# Subtract value from register and give back the result
		instru = il.set_reg(
	    5, 
			'accu', 
			il.sub(5, il.reg(5, 'accu'),val)
		)
		il.append(instru)

	elif op_code == STP:
		# stop disassembler once read STP
		il.append(il.no_ret())

	elif op_code == JMP:
		# An abolute jump
		il.append(il.jump(il.const_pointer(5, src)))

	elif op_code == JRP:
		# A relative jump
		il.append(il.jump(il.const_pointer(5, src + addr)))

	return 8

Done!

Append Manch_Baby.register(), the plugin should work. You can find the full script in my gist. Here is a testing script and results:

1100000011000003100000161012345610165432011000080010000100100002

You need to switch mode from Hex Editor to Linear Disassembly, then press p to select arch.

Disassemble Graph:

Low Level IL Graph:

References