Side-Channel Detection Attacks Against Unauthorized Hypervisors

Thomas Ptacek | August 20th, 2007 | Filed Under: Uncategorized

1.

A refresher.

Here’s an IP address:

1.png

Here’s the same IP address, in integer form, as it would appear in an IP packet:

2.png

Got that? Then you’ll have no problem with protected memory, which is how your machine can simultaneously run a web server and a web browser.

Here’s an address in your browser’s memory, in integer form:

3.png

Your webserver on the same machine has data at the same address (it points towards the bottom of the stack). How’s that work?

Well, here’s the same address, with the fields broken out:

4.png

Like the IP address, these representations mean the same thing. A virtual address has 3 parts: a page directory index, a page table index, and a page offset. The first two parts form a “virtual page number (VPN)”. When a program in protected mode tries to access data in memory, the memory management hardware uses the VPN to find the unshared real honest physical memory to read from:

6.png

You’ll see variants of this diagram all over the place. Here’s how to read mine, which is simplified:

  1. Start with CR3, which is a register —- a value stored directly in the CPU. CR3 stores the Page Directory Base Address. Think of the Page Directory Base as the hardware’s equivalent of your process’ PID.

  2. Take the Page Directory Index from the virtual address you’re trying to read from. Offset that many entries into the Page Directory from CR3. That’s the Page Table for the address.

  3. Take the Page Table Index from the address. Offset that many entries from the Page Table you got from step 2. There’s the physical memory page the address corresponds to.

  4. Finally, take the Page Offset from the address, and step that many bytes into the page you found from step 3.

Two different processes have two different CR3 values, and so have two different page table hierarchies. So the same virtual address in your web browser and web server points to two different values.

7.png

By the way, if Page Directory step seems complicated, consider: most processes use a tiny fraction of the entire 4 gigabyte virtual address space. Each page table describes 4 megs of that space. Without the page directory, you’d need 1024 page tables, each with 1024 entries. With it, you only need page tables for address space you use, plus 1 more for the page directory.

2.

All good, right?

Not kablamo!

See, there’s something called a memory hierarchy:

8.png

Your goal as a modern computer system is to stay as close to Oscar the Register as possible. Your goal as a modern computer system is to stay the hell away from Ernie the DRAM cell, as much as possible. Ernie is slow. That’s what Cache Monster is for.

But the page tables are stored in DRAM. Uncached, page translations double the number of times you hit DRAM. Ow. So of course address translation is cached. The cache is called the TLB:

9.png

Address space use in normal programs is very predictable, with lots of locality. Most address translations run out of the TLB. Your CPU’s TLB design is important.

3.

Let’s see how to exploit the TLB cache to detect virtualization.

Consider your CPU in steady state. The TLB cache mirrors the page tables (there are more page table entries than TLB entries, which are reclaimed as needed; obviously, there are more physical pages than PTEs).

10.png

Now, saturate the TLB. Allocate a big block of memory, which has the side effect of filling in a bunch of PTEs. Allocate another page of memory. Color the page with one value, and the big block of pages with another. As above, the TLB and the page tables will reflect each other, and the TLB has a fixed size, so if you grab enough memory, all the TLB entries will point to that block.

11.png

Here’s the fun part: desync the TLB from the page hierarchy.

There’s no magic that synchronizes the X86 TLB with the page tables. If a TLB entry and a PTE entry are in sync, and you modify the PTE without updating the TLB, memory accesses will reflect the “stale” TLB value, not the “current” PTE value.

When you desync the TLB and the page tables, there’s a few things you do to sync them back up. You can write to CR3 (as if switching to another process), which flushes the TLB. Or you can issue an “invlpg” instruction to clear an individual page.

Or you can do neither, and instead deliberately wire all the PTEs for the big block of memory you allocated to the dummy page:

12.png

Turn off interrupts and preemption (and in all other respects halt the running OS kernel) and then do this.

Your CPU is now in an interesting state.

As long as you don’t. touch. anything. else, memory accesses to the big block of memory will behave like they did before you desynced the TLB (you’ll read values out of the big block of memory).

But breathe on the memory hierarchy the wrong way right now, and that will change. An entry will get evicted from the TLB, and the next access to the address that was cached in that TLB entry will get translated out the PTE, which now points somewhere else.

So, the trick is simple:

  1. Saturate the TLB.

  2. Desync it.

  3. “Do something”

  4. See if a TLB entry was lost by reading from the block of memory, from each page, and seeing if you get the PTE’s version or the TLB’s.

For example:

13.png

A no-op instruction won’t change anything. Neither will zeroing a register. But access a random address outside the big block, and that address will need to get translated; it will miss the TLB, and the resulting translation from the page table will get cached, evicting one of our “big block” entries.

Now the bit about virtualization.

The CPUID instruction retrieves info about the CPU directly from the chip. Issuing a CPUID instruction shouldn’t touch memory.

But on Intel chips, if you’re in a VT-x guest virtual machine, CPUID causes a “VM exit” —- a trap to the hypervisor. The hypervisor has to emulate the CPUID instruction to the guest machine on behalf of the actual hardware. The hypervisor is just code, probably written in C, just like the kernel. And at a minimum, it has to touch memory to figure out what kind of trap it is handling.

In the original Intel VT-x implementation, a VM exit flushed the whole TLB, just like a CR3 write does in a process switch. So that’s pretty noticeable. On AMD SVM, where Blue Pill runs, there are ASIDs that tag the TLB, so not every entry is flushed:

14.png

But the hypervisor still has to touch memory to figure out what kind of trap it’s handling, which evicts a TLB entry. When control is handed back to the VM, you’ll see this as an offset into the big block of memory that reads the wrong value.

4.

A neat twist:

There’s a seperate TLB for instructions —- the ITLB —- and for data, the DTLB. Instruction execution implies virtual memory reads, to fetch the instructions.

Here’s a very, very short subroutine:

{ 0xb8, 0xff, 0x00, 0x00, 0x00, 0xc3 };

This is:

mov eax, 0xff
ret

Or, somewhat equivalently:

int return_FFh(void) { return(0xff); }

Change the 0xff byte to 1, and you’ve got “return_01h()”. And so on.

So repeat the TLB desynch trick, but instead of probing the cache with reads, probe them with subroutine calls. Your big block of memory is effectively filled with “returnFFh”; your test page is effectively “return08h()”.

Now, when you do something that causes a VM exit, the hypervisor’s own code execution evicts ITLB entries. Before the VM exit, every call into the big block returns 0xFF. After it, one or more of them will return 8 instead.

5.

Another neat twist:

As you write to DRAM, you fill lines in the data cache. A write to memory doesn’t instantly update DRAM. Dirty locations modified by code are written back to memory as cache lines are evicted.

But the X86 has an instruction, “invd”, which clears out the data caches without necessarily flushing the cached values back to main memory.

This suggests another variation of the “saturate-and-probe” trick:

  1. Allocate a big block of memory. Color it 0xFF.

  2. Saturate the cache, queuing up enough writes to fill it.

  3. “Do something”.

  4. Issue “invd” to clear out the cache.

  5. Read the bock of memory and see if any of your queued-up writes “leaked” to main memory.

Again, hypervisor memory accesses will evict cache entries; the side effect here is that a memory write we expected to throw away will get burned into memory.

Two caveats here:

  1. I haven’t tested this. I don’t think Keith Adams has tested it. The TLB desync trick turned out trickier than we expected it to be when we wrote it (for instance, you have to iterate over the pages backwards).

  2. According to People Who Would Know, Intel hardware doesn’t promise to honor “invd” —- writes could have leaked to main memory even if you told the CPU to chuck them.

6.

Of course, you don’t need to be this tricky with the caches to detect the footprints that a hypervisor leaves through it. Evicting entries from a cache will influence the timings of subsequent instructions.

So even if “invd” doesn’t let you get your CPU to a state where “cpuid” changes the result of a memory read, you can still monitor cache timing, using any of the local timers, to detect unexpected cache evictions. Cryptanalysts have done it to steal RSA and AES keys, and they don’t even have the OS cooperating with them.

7.

A word about credit for these ideas: none of it should go to me. At the same time as Peter Ferrie, a member of our team, published the first paper mentioning the TLB attack, and a team led by Tal Garfinkel was working on a paper independently documenting the same attack.
Tal’s research partner Keith Adams wrote a blog post, which was the first public mention of the TLB desync technique, and almost certainly the origin of the “invd” idea.

And apparently some guy at McAfee has had the TLB idea for over a year, although it looks like he got so excited about it during Joanna’s talk that he wrote it down on a napkin for the first time. (Note to McAfee guy: your notes from the show are cool and all, but I think the slides from our talk are easier to read)

7 Comments so far

  • yakov

    August 20th, 2007 9:53 am

    Excellent explanation Thomas. Still I don’t see how it detects only unauthorized hypervisors. Won’t legitimate use case of software running in a VM cause false positive?

  • Thomas Ptacek

    August 20th, 2007 10:52 am

    If run from within ring 0 of a guest operating system, of course; it will simply detect the fact that the guest is in fact a guest.

    But if run from within ring 0 of the host (”ring -1″, as it were), it spots unexpected virtualization — a “smoking gun” when the hypervisor is itself not expected to be virtualized.

  • Matt

    August 20th, 2007 3:33 pm

    The first link (”all over the place”) is not kablamo. Also, the combinatoric cognitive dissonance of the HSAS, Sesame Street, and the memory hierarchy nearly made my head explode this morning.

  • Andrew

    August 20th, 2007 10:09 pm

    Your memory hierarchy diagram is beautiful.

  • Alfred Huger

    August 23rd, 2007 2:48 pm

    Wow, that is a fantastic writeup Tom, thanks.

  • TK

    September 4th, 2007 10:29 am

    Can we add a pwnie category for best blog post of the year? Anything describing security with sesame street chars is sure to become an instant classic… :)

  • Ivanlef0u’s Blog » TLBs are your friends

    February 3rd, 2008 7:31 pm

    […] connus pour servir dans la détection des VM, l’idée peut-être résumé par le post de Thomas Patcek de chez Matasano. Le but est de désynchroniser les TLB et les PTE, c’est à dire de laisser […]

  • Leave a reply