ARMv7/Thumb2 Inline Code Hooking

Blue

At Hackito Ergo Sum 2012, I presented about Exploitation of the RenderArena allocator in WebKit (PDF) with a focus on the Android Mobile platform. Since one of the techniques for hijacking a vtable (and subsequently achieve code execution) requires careful heap massaging, we developed an internal tool to hook the various heap allocation functions inline and log all allocations and frees in memory with as minimal overhead as possbile. Since the gist of the talk was the reliable exploitation of this specific bug class, I did not go to deep on how we built this tool. Since some people asked about its internals, the basic ideas are presented here.

The general idea was to log all heap (de-)allocations while maintaining allocation order in a multi-threaded environment (such as the Android Browser) by introducing as little per allocation overhead as possible. Since using a debugger to set a breakpoint on the respective heap functions does incur too much such overhead, I decided for a different approach:

To log every allocation in a timely manner, each such function is hot-patched at runtime with a non-intrusive call to a helper function that logs information about the caller to a memory buffer. Only when this memory buffer is full, it escapes to the analysis software for flushing the buffer over the network to an analysts computer. This method adds so little overhead to memory allocation that the program under analysis remains interactively usable, i.e. we can still use the Browser normally. Of course this approach is extensible to hot-patching other functions besides the normal system heap allocator and, in fact, for the talk we also instrumented a special Webkit sub-allocator (please refer to the slides for more information).

The code to log information is generic in the sense that we can sample one arbitrary register at any point to be hooked. This is usually sufficient for capturing function parameters and return values at the beginning / end of certain functions or arbitrary. If you know a little bit more of the function you are looking at, you can even sample arbitrary values (because ARM is a RISC architecture, which requires any value to be processed to be loaded into a register at some point). The native code to log a single register sample collected looks like this:

Native Code to Log a Single Sample

 

Besides logging the desired register value (passed in R0) and the calling function, it logs a user defined tag (that is useful for distinguishing multiple hooked locations) and the current thread id (obtained from the TLS, which in turn is located by a call to a magic location defined by the EABI). If your ARM assembly is a little bit rusty, here is IDA’s correspondending decompilate (note that ldrex and strex denote race-condition-safe memory loads and stores respectively — the strex instruction will fail if the memory location has been accessed since the correspondending ldrex, hence the surrounding loop):
Decompilation of Log Code

Note that the original code was done in hand-written assembly, but with sufficient manual added type and pseudo-calling-convention information added, IDA is doing a good job of reconstructing equivalent C code.

Unfortunately for us, there typically is no slack space in real-world code that would allow us to insert simple branches to our logging functions. Therefore for inline hooking, typically the code at the desired location to be probed is overwritten with a branch to a trampoline. This trampoline then needs to compensate for the overwritten instruction, so it usually consists of:

  1. A call to the desired hook (the logging function in our case) with potentially required state saving and restoring to preserve the expected state of the original code
  2. Semantically equivalent code to the code overwritten, often just a potentially fixed up copy of the original instructions
  3. A branch to the code following the overwritten instructions to continue the original code

An example of two such trampolines that have been generated is depicted below. All this code is generated at runtime by disassembling the original code, determining necessary fix-up steps and of course generating code to sample the desired register by copying it into R0 for the log function. The instructions in cyan are the original instrucions copied over.

These trampolines highlight some of the caveats one encounters when dealing with hooking RISC code. The first trampoline contains only arithmetic instructions that do not require any modification and were just copied over as they were (based on a simple length disassembly, as thumb2 supports mixing of 16bit and 32bit instruction lengths). However the branch to the original code is so far that it cannot be performed by a regular jump but must be done by a full 32bit load into the program counter.
The second trampoline highlights some necessary fix-ups to copied instructions: because the constant loads it performs are relative to the PC register and again too far away to be addressed by these instructions, the values have been copied to the new trampoline and the instructions have been adjusted to reference these copies.
Example JIT’ed Probe Trampolines

The code for generating these trampolines grew a little more complex than anticipated to cover the corner-cases encountered. It now consists of three passes:

  1. Disassemble the original to-be-overwritten code and look for special instructions (thankfully the Thumb2 instruction set is very regular)
  2. Length Reassembly of original code to calculate relative addresses and offsets
  3. Actual reassembly of original code, fixing relative references and adding branches to original code
The decompilation of this code then looks very trivial, as the semantics of that code are indeed simple:
Decompilation of Example Trampoline

This tool has proven extremely valuable in prototyping attacks by whitehat researchers as we used existing visualization tools to render a view of the heap at any given point in time. It can however be leveraged for benign purposes as well, e.g. some Windows malware analysis sandboxes are based on the same inline hooking approach on the x86 architecture. Although at first looking simple when designed on paper, this project has been difficult to implement due to all the corner cases encountered.

George Kurtz

Co-founder of CrowdStrike, Kurtz is an internationally recognized security expert, author, entrepreneur, and speaker. He has been part of the security community for more than 20 years including leadership roles at McAfee and as the brains behind Foundstone. He also authored the best-selling security book of all time, Hacking Exposed: Network Security Secrets & Solutions.

 

Stop Breaches with CrowdStrike Falcon request a live demo