Melting Down PatchGuard: Leveraging KPTI to Bypass Kernel Patch Protection

The mitigation for Meltdown created a new part in the kernel which PatchGuard left unprotected, making hooking of system calls and interrupts possible, even with HVCI enabled.

Introduction

In early January 2018 a new class of vulnerabilities called speculative execution, was publicly disclosed. The vulnerabilities themselves, dubbed Spectre and Meltdown, stem from side-channels in the CPU itself. Although these vulnerabilities reside in hardware some mitigations can be implemented in software. One of the more complex mitigations, Kernel Page Table Isolation (KPTI) required significant changes to the OS kernel and due to its importance was even backported to previous OS versions.

In this blog post we’ll describe a new method to bypass Microsoft PatchGuard by leveraging KPTI to hook to system service calls and interrupts for user-mode.

Note - after developing a working PoC we contacted Microsoft Security Response Center to report our finding, along with feasible solutions. Microsoft confirmed the issue and introduced a fix in Windows 10 RS5 that followed our suggested resolution.

Some Context

Speculative execution is a trait of modern CPUs that is meant to boost performance. The CPU executes instructions internally out-of-order and ahead of time and in case they shouldn’t have been executed they are discarded before the final state of the CPU is affected. When instructions are discarded they impact the cache thus it’s possible to use it as a side channel by measuring access times to memory.

Meltdown (also referred to as Rogue Data Cache Load) works by leveraging a CPU race condition that can arise between instruction execution and privilege checking. It allows user-mode code to read kernel-space memory in a reliable and relatively fast manner.

KPTI Basics

Kernel Virtual Address (KVA) Shadow is Microsoft’s implementation of KPTI for Windows. Basically, the process virtual address space is now split to two separate page tables – one is used while code runs in the kernel and the second while running user-mode code. The kernel-mode page table maps the complete address space (both user and kernel) while the user-mode page table maps only the bare minimum of the kernel space which is needed to perform the transitions between ring 3 and ring 0.

The code for performing the transitions resides in ntoskrnl.exe and is placed in a specific section named KVASCODE. This section is mapped to the same virtual address on both user and kernel page tables along with stack space and a dedicated part at the end of Kernel Processor Control Block (KPRCB) structure so sensitive kernel information won’t leak.

The user address space is shared between the two page tables of each process. The minimal kernel address space mapped in the user page table is shared between different processes, and the same goes for the regular kernel address space. This is implemented by having different PML4 (the top-most level of the page-table) entries in different page tables point to the same PDP tables.

Image 1 meltdown blogFigure 1: Example of page tables with KVAS enabled

To maximize performance and reduce overhead, KVAS is enabled on less privileged applications as a fully privileged processes which can gain access to kernel memory by loading drivers.

There are additional changes to PTEs related to TLB management, but we won’t get into that as it’s nonessential to the scope of this post.

System service calls and interrupts

The transitions between ring 3 and ring 0 now goes through a new step, the minimal kernel space mapped in the user address table. We can think about this a bit like an indirect jump having been inserted to the transition process.

image 2 meltdown blogFigure 2: Stages of ring 3 to ring 0 transition with KVAS enabled

The new system call and interrupts transition code checks whether the user page table is used, and if so, switches to the kernel page table and resumes regular operations in the kernel. The reverse operation happens when transitioning back to ring 3.

image 3 meltdown blog

Figure 3: Disassembly of KVAS syscall entrypoint


Foreseen Side Effect

PatchGuard, or Kernel Patch Protection, is designed to protect the OS from tampering during run-time. Among the things it detects are patching of code in ntoskrnl, HAL and NDIS and modification of critical structures such as IDT and SSDT.

With the understanding that the first and last instructions of ring 3 to ring 0 transitions are no longer executed in the actual OS kernel but in a separate address space, we now have a new environment which runs code in kernel-mode and may be left unprotected.

Since the kernel isn’t mapped in the user page table, PatchGuard can’t operate during these transitions back and forth between kernel-mode and user-mode. To put it simply, the kernel address space in the user page table is invisible to PatchGuard so we can intercept and manipulate in ring 0 system service calls and interrupts that triggered while in user-mode.

Accessing the User Page Table

Before we go further we need to be able to access this new user page table. Since address translation is performed by the MMU with physical addresses, the page table of a different address space isn’t necessarily accessible in the virtual address space we are currently in.

While running in kernel-mode we can access physical memory. So, first we needed to find the physical address of the user page table. It can be done by obtaining it either from EPROCESS.Pcb.UserDirectoryTableBase or CR3 register in a thread’s user-mode context.

From here onward, PTE remapping and page table traversals are straightforward and result in us having full control over the kernel address space in the user page table. We can allocate memory by making PTEs valid and swap pages by changing PTEs PFN (Page Frame Number) field. In case read\write is required we can map pages in the same manner back to the kernel page table.

Hooking

As the logic of ring 0 entry points changed, hooking them needs to change respectively.

Memory Management

The pages of KVASCODE section are mapped to the same physical pages in all page tables, user and kernel. We start by changing the user page table so KVASCODE virtual addresses will be mapped to different physical pages. This frees the prolog of the function for our own use as it’s now redundant, there’s no need to check anymore which page table is in use. It will also prevent PatchGuard from detecting changes of code since no actual pages of ntoskrnl will be changed.

Code Execution

Next thing we need to tackle is how to actually run our code. KVASCODE section size is very limited and we can map additional pages in the user page table. This won’t allow us to use services provided by the OS. What’s possible instead is to switch to the kernel page table.

As our code is not mapped in the user page table we can’t jump to it directly before switching page tables. Simply switching page tables will result just in returning to the original KVASCODE pages which aren’t under our control. What’s possible to do at this point is to utilize ROP.

In the pages we remapped in the user page table we’ll create a CR3 assignment gadget with a ret instruction overlapping in both virtual address spaces. This way even after changing the page table the ret instruction will be executed. Following that, we’ll change the prolog to push the address of our driver and jump to the CR3 assignment gadget we created.

 

image 4 meltdown blogFigure 4: Disassembly of the original and hooked nt!KiSystemCall64Shadow

Handling VBS

VBS doesn’t render the technique impossible but does introduce some complications.

With HyperVisor Code Integrity (HVCI) it's no longer possible to run unsigned code in the kernel. HVCI prevents creating the hooks dynamically so it’s necessary to have them compiled in the driver. This creates a significant difficulty because it means we have to know in advance the locations of the functions in every page. Since these functions don’t tend to change much and the size of the KVASCODE section doesn't really change (and includes a large cave), it's not totally impossible. In case offsets of data structures or values that are determined only on runtime are required they can be stored in a data page which we'll map in the user page table in a fixed offset adjacent to the KVASCODE pages.

Another thing to pay attention to is that on a machine with VBS enabled, the root partition reverts back to using the int 2e instruction instead of the syscall\sysenter instructions.

Mitigation

The solution is quite simple – PatchGuard should check that the PFNs of PTEs for KVASCODE pages in both kernel and user page tables are the same. This will ensure that the code in the minimal kernel address space matches the code in the actual kernel address space. Seeing that the code in the kernel is validated by PatchGuard, it ensures no tampering occured.

Summary

Major architectural changes to modern operating systems introduce new complexities and new avenues for attackers and researchers alike. Researchers can utilize this new method to monitor system calls made by user-mode processes without depending on virtualization and other hardware capabilities, or messing with OS security mechanisms. Due to KVAS performance optimization, coverage can be applied to less privileged applications which usually are more likely to pose security risks.

This research was conducted on Windows 10 RS4 x64. However, since KVAS has been backported to all supported Windows versions it is also applicable to Windows 7, 8.1 and corresponding server versions.

Although this post focuses on Windows the technique we detailed is relevant to other major operating systems that use KPTI, such as Linux and macOS.

Disclosure timeline

  • 29/3/2018: Report to MSRC
  • 18/4/2018: Confirmation from MSRC
  • 25/4/2018: Release of Windows RS5 insider preview build 17655 with the fix as we suggested.
Related Blog Posts