This is the second post in my “2018 Roundup” series. For an index of all entries, please see my
blog entry for Q1.
As I have alluded to previously,
Gecko includes a Detours-style API hooking mechanism for Windows. In Gecko, this code is referred to
as the “DLL Interceptor.” We use the DLL interceptor to instrument various functions within our own
processes. As a prerequisite for future DLL injection mitigations, I needed to spend a good chunk of
Q2 refactoring this code. While I was in there, I took the opportunity to improve the interceptor’s
memory efficiency, thus benefitting the Fission MemShrink project. [When these changes landed, we were
not yet tracking the memory savings, but I will include a rough estimate later in this post.]
A Brief Overview of Detours-style API Hooking
While many distinct function hooking techniques are used in the Windows ecosystem, the Detours-style
hook is one of the most effective and most popular. While I am not going to go into too many specifics
here, I’d like to offer a quick overview. In this description, “target” is the function being hooked.
Here is what happens when a function is detoured:
Allocate a chunk of memory to serve as a “trampoline.” We must be able to adjust the protection
attributes on that memory.
Disassemble enough of the target to make room for a
jmp instruction. On 32-bit x86 processors,
this requires 5 bytes. x86-64 is more complicated, but generally, to
jmp to an absolute address, we
try to make room for 13 bytes.
Copy the instructions from step 2 over to the trampoline.
At the beginning of the target function, write a
jmp to the hook function.
Append additional instructions to the trampoline that, when executed, will cause the processor to
jump back to the first valid instruction after the
jmp written in step 4.
If the hook function wants to pass control on to the original target function, it calls the
Note that these steps don’t occur exactly in the order specified above; I selected the above ordering
in an effort to simplify my description.
Here is my attempt at visualizing the control flow of a detoured function on x86-64:
Previously, the DLL interceptor relied on directly manipulating pointers in order to read and write the
various instructions involved in the hook. In bug 1432653 I changed things so that the memory
operations are parameterized based on two orthogonal concepts:
- In-process vs out-of-process memory access: I wanted to be able to abstract reads and writes such
that we could optionally set a hook in another process from our own.
- Virtual memory allocation scheme: I wanted to be able to change how trampoline memory was allocated.
Previously, each instance of
WindowsDllInterceptor allocated its own page of memory for trampolines,
but each instance also typically only sets one or two hooks. This means that most of the 4KiB page
was unused. Furthermore, since Windows allocates blocks of pages on a 64KiB boundary, this wasted a
lot of precious virtual address space in our 32-bit builds.
By refactoring and parameterizing these operations, we ended up with the following combinations:
- In-process memory access, each
WindowsDllInterceptor instance receives its own trampoline space;
- In-process memory access, all
WindowsDllInterceptor instances within a module share trampoline space;
- Out-of-process memory access, each
WindowsDllInterceptor instance receives its own trampoline space;
- Out-of-process memory access, all
WindowsDllInterceptor instances within a module share trampoline space (currently
not implemented as this option is not particularly useful at the moment).
Instead of directly manipulating pointers, we now use instances of
Trampoline to manipulate our code/data. Those classes in turn use the
memory management and virtual memory allocation policies to perform the actual reading and writing.
Memory Management Policies
The interceptor now supports two policies,
MMPolicyOutOfProcess. Each policy
must implement the following memory operations:
- Change protection attributes
- Reserve trampoline space
- Commit trampoline space
MMPolicyInProcess is implemented using
memcpy for read and write,
for protection attribute changes, and
VirtualAlloc for reserving and committing trampoline space.
WriteProcessMemory for read and write. As a perf
optimization, we try to batch reads and writes together to reduce the system call traffic. We obviously
VirtualProtectEx to adjust protection attributes in the other process.
Out-of-process trampoline reservation and commitment, however, is a bit different and is worth a
separate call-out. We allocate trampoline space using shared memory. It is mapped into the local
process with read+write permissions using
MapViewOfFile. The memory is mapped into the remote process
as read+execute using some code that I wrote in bug 1451511 that either uses
MapViewOfFile2, depending on availability. Individual pages from those chunks are then committed via
VirtualAlloc in the local process and
VirtualAllocEx in the remote process. This scheme enables
us to read and write to trampoline memory directly, without needing to do cross-process reads and writes!
VM Sharing Policies
The code for these policies is a lot simpler than the code for the memory management policies. We now
VMSharingPolicyShared. Each of these policies must implement the
- Reserve space for up to N trampolines of size K;
- Obtain a
Trampoline object for the next available K-byte trampoline slot;
- Return an iterable collection of all extant trampolines.
VMSharingPolicyShared is actually implemented by delegating to a
static instance of
Implications of Refactoring
To determine the performance implications, I added timings to our DLL Interceptor unit test. I was
very happy to see that, despite the additional layers of abstraction, the C++ compiler’s optimizer was
doing its job: There was no performance impact whatsoever!
Once the refactoring was complete, I switched the default VM Sharing Policy for
VMSharingPolicyShared in bug 1451524.
mozilla-central tip, I count 14 locations where we instantiate interceptors inside
xul.dll. Given that not all interceptors are necessarily instantiated at once, I am now offering a
worst-case back-of-the-napkin estimate of the memory savings:
- Each interceptor would likely be consuming 4KiB (most of which is unused) of committed VM. Due to
Windows’ 64 KiB allocation guanularity, each interceptor would be leaving a further 60KiB
of address space in a free but unusable state. Assuming all 14 interceptors were actually instantiated,
they would thus consume a combined 56KiB of committed VM and 840KiB of free but unusable address space.
- By sharing trampoline VM, the interceptors would consume only 4KiB combined and waste only 60KiB of
address space, thus yielding savings of 52KiB in committed memory and 780KiB in addressable memory.
Oh, and One More Thing
Another problem that I discovered during this refactoring was bug 1459335. It turns out that some
of the interceptor’s callers were not distinguishing between “I have not set this hook yet” and “I
attempted to set this hook but it failed” scenarios. Across several call sites, I discovered that
our code would repeatedly retry to set hooks even when they had previously failed, causing leakage
of trampoline space!
To fix this, I modified the interceptor’s interface so that we use one-time initialization APIs to
set hooks; since landing this bug, it is no longer possible for clients of the DLL interceptor to
set a hook that had previously failed to be set.
Quantifying the memory costs of this bug is… non-trivial, but it suffices to say that fixing
this bug probably resulted in the savings of at least a few hundred KiB in committed VM on
That’s it for today’s post, folks! Thanks for reading! Coming up in Q2, Part 2: Implementing a Skeletal Launcher Process