This is the second post in my “2018 Roundup” series. For an index of all entries, please see my blog entry for Q1.
As I have alluded to previously, Gecko includes a Detours-style API hooking mechanism for Windows. In Gecko, this code is referred to as the “DLL Interceptor.” We use the DLL interceptor to instrument various functions within our own processes. As a prerequisite for future DLL injection mitigations, I needed to spend a good chunk of Q2 refactoring this code. While I was in there, I took the opportunity to improve the interceptor’s memory efficiency, thus benefitting the Fission MemShrink project. [When these changes landed, we were not yet tracking the memory savings, but I will include a rough estimate later in this post.]
A Brief Overview of Detours-style API Hooking
While many distinct function hooking techniques are used in the Windows ecosystem, the Detours-style hook is one of the most effective and most popular. While I am not going to go into too many specifics here, I’d like to offer a quick overview. In this description, “target” is the function being hooked.
Here is what happens when a function is detoured:
Allocate a chunk of memory to serve as a “trampoline.” We must be able to adjust the protection attributes on that memory.
Disassemble enough of the target to make room for a
jmpinstruction. On 32-bit x86 processors, this requires 5 bytes. x86-64 is more complicated, but generally, to
jmpto an absolute address, we try to make room for 13 bytes.
Copy the instructions from step 2 over to the trampoline.
At the beginning of the target function, write a
jmpto the hook function.
Append additional instructions to the trampoline that, when executed, will cause the processor to jump back to the first valid instruction after the
jmpwritten in step 4.
If the hook function wants to pass control on to the original target function, it calls the trampoline.
Note that these steps don’t occur exactly in the order specified above; I selected the above ordering in an effort to simplify my description.
Here is my attempt at visualizing the control flow of a detoured function on x86-64:
Previously, the DLL interceptor relied on directly manipulating pointers in order to read and write the various instructions involved in the hook. In bug 1432653 I changed things so that the memory operations are parameterized based on two orthogonal concepts:
- In-process vs out-of-process memory access: I wanted to be able to abstract reads and writes such that we could optionally set a hook in another process from our own.
- Virtual memory allocation scheme: I wanted to be able to change how trampoline memory was allocated.
Previously, each instance of
WindowsDllInterceptorallocated its own page of memory for trampolines, but each instance also typically only sets one or two hooks. This means that most of the 4KiB page was unused. Furthermore, since Windows allocates blocks of pages on a 64KiB boundary, this wasted a lot of precious virtual address space in our 32-bit builds.
By refactoring and parameterizing these operations, we ended up with the following combinations:
- In-process memory access, each
WindowsDllInterceptorinstance receives its own trampoline space;
- In-process memory access, all
WindowsDllInterceptorinstances within a module share trampoline space;
- Out-of-process memory access, each
WindowsDllInterceptorinstance receives its own trampoline space;
- Out-of-process memory access, all
WindowsDllInterceptorinstances within a module share trampoline space (currently not implemented as this option is not particularly useful at the moment).
Instead of directly manipulating pointers, we now use instances of
Trampoline to manipulate our code/data. Those classes in turn use the
memory management and virtual memory allocation policies to perform the actual reading and writing.
Memory Management Policies
The interceptor now supports two policies,
MMPolicyOutOfProcess. Each policy
must implement the following memory operations:
- Change protection attributes
- Reserve trampoline space
- Commit trampoline space
MMPolicyInProcess is implemented using
memcpy for read and write,
for protection attribute changes, and
VirtualAlloc for reserving and committing trampoline space.
WriteProcessMemory for read and write. As a perf
optimization, we try to batch reads and writes together to reduce the system call traffic. We obviously
VirtualProtectEx to adjust protection attributes in the other process.
Out-of-process trampoline reservation and commitment, however, is a bit different and is worth a
separate call-out. We allocate trampoline space using shared memory. It is mapped into the local
process with read+write permissions using
MapViewOfFile. The memory is mapped into the remote process
as read+execute using some code that I wrote in bug 1451511 that either uses
MapViewOfFile2, depending on availability. Individual pages from those chunks are then committed via
VirtualAlloc in the local process and
VirtualAllocEx in the remote process. This scheme enables
us to read and write to trampoline memory directly, without needing to do cross-process reads and writes!
VM Sharing Policies
The code for these policies is a lot simpler than the code for the memory management policies. We now
VMSharingPolicyShared. Each of these policies must implement the
- Reserve space for up to N trampolines of size K;
- Obtain a
Trampolineobject for the next available K-byte trampoline slot;
- Return an iterable collection of all extant trampolines.
VMSharingPolicyShared is actually implemented by delegating to a
static instance of
Implications of Refactoring
To determine the performance implications, I added timings to our DLL Interceptor unit test. I was very happy to see that, despite the additional layers of abstraction, the C++ compiler’s optimizer was doing its job: There was no performance impact whatsoever!
Once the refactoring was complete, I switched the default VM Sharing Policy for
VMSharingPolicyShared in bug 1451524.
mozilla-central tip, I count 14 locations where we instantiate interceptors inside
xul.dll. Given that not all interceptors are necessarily instantiated at once, I am now offering a
worst-case back-of-the-napkin estimate of the memory savings:
- Each interceptor would likely be consuming 4KiB (most of which is unused) of committed VM. Due to Windows’ 64 KiB allocation guanularity, each interceptor would be leaving a further 60KiB of address space in a free but unusable state. Assuming all 14 interceptors were actually instantiated, they would thus consume a combined 56KiB of committed VM and 840KiB of free but unusable address space.
- By sharing trampoline VM, the interceptors would consume only 4KiB combined and waste only 60KiB of address space, thus yielding savings of 52KiB in committed memory and 780KiB in addressable memory.
Oh, and One More Thing
Another problem that I discovered during this refactoring was bug 1459335. It turns out that some of the interceptor’s callers were not distinguishing between “I have not set this hook yet” and “I attempted to set this hook but it failed” scenarios. Across several call sites, I discovered that our code would repeatedly retry to set hooks even when they had previously failed, causing leakage of trampoline space!
To fix this, I modified the interceptor’s interface so that we use one-time initialization APIs to set hooks; since landing this bug, it is no longer possible for clients of the DLL interceptor to set a hook that had previously failed to be set.
Quantifying the memory costs of this bug is… non-trivial, but it suffices to say that fixing this bug probably resulted in the savings of at least a few hundred KiB in committed VM on affected machines.
That’s it for today’s post, folks! Thanks for reading! Coming up in Q2, Part 2: Implementing a Skeletal Launcher Process