Rafal Wojtczuk’s User-Mode Single Stepping: 100x Faster Than Debuggers
Thomas Ptacek | November 30th, 2006 | Filed Under: Bitching About Protocols, Reversing, Uncategorized
‘Tis the season, apparently. Cool new project and excellent blog post from libnids author and Eastern European reversing psychopath Rafal Wojtczuk, now at MCAF’s AVERT labs. He’s announced UMSS, the User Mode Single-Stepper, a tool for tracing the execution of Win32 binaries.
Refresher: single-stepping stops a program after each individual CPU instruction, usually to record them. It’s usually done with a debugger; on Intel, you do it by setting the “trap flag”, which tells the CPU to generate exceptions after each instruction.

The problem here is, each instruction traps to the kernel, which then transfers control to another process, which then transfers back to the kernel to find out what happened. A single user/kernel (u/k) transition is expensive: network programmers, who execute thousands of instructions between I/O operations, still try to minimize them. Debugger single-stepping involves multiple u/k context switches per instruction. It’s just nightmarishly slow.
Rafal’s project speeds this up by 2 orders of magnitude by single stepping entirely in userland. How he does it is, he continuously rewrites the “next” instruction on the fly to transfer control to a handler function.

This is similar to what Detours does in that Rafal is swapping out instructions with handler jumps. But Detours only instruments the prologues of each function. UMSS instruments every instruction, on the fly. This is tricky, because to do that for each instruction, you have to know where the next instruction is. It’s not always “the next instruction in memory”, because of jumps. It’s not always “the target of a jump”, because jumps are conditional. It’s not always even possible to look at an instruction and know the jump target, because jumps can be indirected through registers.
UMSS solves this problem in two ways:
it uses an embedded disassembler to decode jumps with static targets, and peeks at the condition flags to figure out whether jumps will be taken.
it has a simple and clever heuristic for indirected jumps: just switch back to kernel-assisted debugging for that instruction. The overwhelming majority of the instruction stream doesn’t need it, so you still get the huge speedup.
Why is this stuff important? To be honest, I don’t know. The “state of the art” in tracing programs right now is in instrumenting basic blocks, which are the ~10-20 instruction chunks that functions are composed of. For reversing purposes, this level of detail is usually more than enough. Clearly for malware research, where code is deliberately designed to be unclear, instruction-by-instruction detail is critical. I’d love for someone to tell me how I could exploit fast single-stepping to get a different project done.
The bigger story is the apparent renaissance we’re experiencing in binary program manipulation. 7 years ago, technology like Detours, PaiMei, and UMSS would have been the closely-guarded crown jewels of security companies. Now they’re free side-projects.


Halvar
November 30th, 2006 1:15 pmThe question is the value for single stepping malicious code in a situation like:
call $+5
pop ebp
mov eax, [ebp+5]
cmp eax, 0xBADDEED
jnz you’re tracing me
High speed is good of course, but single-steps
are usually only needed on truly nasty code,
and in truly nasty code heavy modification of
the target address space should be avoided.
Ryan Russell
November 30th, 2006 7:57 pmIs it fast enough that I can still accidentally have it tracing through the message handler loop, and not kiss my process goodbye? ‘Cause I can’t seem to do that in Paimei.
Thomas Ptacek
December 1st, 2006 1:18 amPaiMei shouldn’t hit any breakpoint more than once in a run (at least in stalker mode), so a tight loop shouldn’t be a problem.
Ryan Russell
December 1st, 2006 3:38 amYeah, that’s with Restore Breakpoints on. I don’t like functions being left out of the chain I’m trying to follow, if I can help it. I need to look into how hard it will be to have a little more fine-grain control over which breakpoints you track. Obviously, it can be done since you can take one sample and exclude it, so it should just be some UI to twiddle those sets.
Jason Haley
December 1st, 2006 11:24 amInteresting Finds: Week after Thanksgiving 4
jeremy
December 4th, 2006 10:03 pmpsst.. your last Detours link in the post points to the UMSS link.
Thomas Ptacek
December 5th, 2006 12:55 pmThanks, Jeremy.
arkon
July 22nd, 2008 2:45 pmHalvar,
in a way you’re right, but if they implemented their tracer well, then ‘call $+5′ will result in the ‘real’ eip and thus your trick won’t work.
Leave a reply