Aaron Klotz at Mozilla

My adventures as a member of Mozilla’s GeckoView Team

2019 Roundup: Part 1 - Porting the DLL Interceptor to AArch64

| Comments

In my continuing efforts to get caught up on discussing my work, I am now commencing a roundup for 2019. I think I am going to structure this one slightly differently from the last one: I am going to try to segment this roundup by project.

Here is an index of all the entries in this series:

Porting the DLL Interceptor to AArch64

During early 2019, Mozilla was working to port Firefox to run on the new AArch64 builds of Windows. At our December 2018 all-hands, I brought up the necessity of including the DLL Interceptor in our porting efforts. Since no deed goes unpunished, I was put in charge of doing the work! [I’m actually kidding here; this project was right up my alley and I was happy to do it! – Aaron]

Before continuing, you might want to review my previous entry describing the Great Interceptor Refactoring of 2018, as this post revisits some of the concepts introduced there.

Let us review some DLL Interceptor terminology:

  • The target function is the function we want to hook (Note that this is a distinct concept from a branch target, which is also discussed in this post);
  • The hook function is our function that we want the intercepted target function to invoke;
  • The trampoline is a small chunk of executable code generated by the DLL interceptor that facilitates calling the target function’s original implementation.

On more than one occasion I had to field questions about why this work was even necessary for AArch64: there aren’t going to be many injected DLLs in a Win32 ecosystem running on a shiny new processor architecture! In fact, the DLL Interceptor is used for more than just facilitating the blocking of injected DLLs; we also use it for other purposes.

Not all of this work was done in one bug: some tasks were more urgent than others. I began this project by enumerating our extant uses of the interceptor to determine which instances were relevant to the new AArch64 port. I threw a record of each instance into a colour-coded spreadsheet, which proved to be very useful for tracking progress: Reds were “must fix” instances, yellows were “nice to have” instances, and greens were “fixed” instances. Coordinating with the milestones laid out by program management, I was able to assign each instance to a bucket which would help determine a total ordering for the various fixes. I landed the first set of changes in bug 1526383, and the second set in bug 1532470.

It was now time to sit down, download some AArch64 programming manuals, and take a look at what I was dealing with. While I have been messing around with x86 assembly since I was a teenager, my first exposure to RISC architectures was via the DLX architecture introduced by Hennessy and Patterson in their textbooks. While DLX was crafted specifically for educational purposes, it served for me as a great point of reference. When I was a student taking CS 241 at the University of Waterloo, we had to write a toy compiler that generated DLX code. That experience ended up saving me a lot of time when looking into AArch64! While the latter is definitely more sophisticated, I could clearly recognize analogs between the two architectures.

In some ways, targeting a RISC architecture greatly simplifies things: The DLL Interceptor only needs to concern itself with a small subset of the AArch64 instruction set: loads and branches. In fact, the DLL Interceptor’s AArch64 disassembler only looks for nine distinct instructions! As a bonus, since the instruction length is fixed, we can easily copy over verbatim any instructions that are not loads or branches!

On the other hand, one thing that increased complexity of the port is that some branch instructions to relative addresses have maximum offsets. If we must branch farther than that maximum, we must take alternate measures. For example, in AArch64, an unconditional branch with an immediate offset must land in the range of ±128 MiB from the current program counter.

Why is this a problem, you ask? Well, Detours-style interception must overwrite the first several instructions of the target function. To write an absolute jump, we require at least 16 bytes: 4 for an LDR instruction, 4 for a BR instruction, and another 8 for the 64-bit absolute branch target address.

Unfortunately, target functions may be really short! Some of the target functions that we need to patch consist only of a single 4-byte instruction!

In this case, our only option for patching the target is to use an immediate B instruction, but that only works if our hook function falls within that ±128MiB limit. If it does not, we need to construct a veneer. A veneer is a special trampoline whose location falls within the target range of a branch instruction. Its sole purpose is to provide an unconditional jump to the “real” desired branch target that lies outside of the range of the original branch. Using veneers, we can successfully hook a target function even if it is only one instruction (ie, 4 bytes) in length, and the hook function lies more than 128MiB away from it. The AArch64 Procedure Call Standard specifies X16 as a volatile register that is explicitly intended for use by veneers: veneers load an absolute target address into X16 (without needing to worry about whether or not they’re clobbering anything), and then unconditionally jump to it.

Measuring Target Function Instruction Length

To determine how many instructions the target function has for us to work with, we make two passes over the target function’s code. The first pass simply counts how many instructions are available for patching (up to the 4 instruction maximum needed for absolute branches; we don’t really care beyond that).

The second pass actually populates the trampoline, builds the veneer (if necessary), and patches the target function.

Veneer Support

Since the DLL interceptor is already well-equipped to build trampolines, it did not take much effort to add support for constructing veneers. However, where to write out a veneer is just as important as what to write to a veneer.

Recall that we need our veneer to reside within ±128 MiB of an immediate branch. Therefore, we need to be able to exercise some control over where the trampoline memory for veneers is allocated. Until this point, our trampoline allocator had no need to care about this; I had to add this capability.

Adding Range-Aware VM Allocation

Firstly, I needed to make the MMPolicy classes range-aware: we need to be able to allocate trampoline space within acceptable distances from branch instructions.

Consider that, as described above, a branch instruction may have limits on the extents of its target. As data, this is easily formatted as a pivot (ie, the PC at the location where the branch instruction is encoutered), and a maximum distance in either direction from that pivot.

On the other hand, range-constrained memory allocation tends to work in terms of lower and upper bounds. I wrote a conversion method, MMPolicyBase::SpanFromPivotAndDistance, to convert between the two formats. In addition to format conversion, this method also constrains resulting bounds such that they are above the 1MiB mark of the process’ address space (to avoid reserving memory in VM regions that are sensitive to compatiblity concerns), as well as below the maximum allowable user-mode VM address.

Another issue with range-aware VM allocation is determining the location, within the allowable range, for the actual VM reservation. Ideally we would like the kernel’s memory manager to choose the best location for us: its holistic view of existing VM layout (not to mention ASLR) across all processes will provide superior VM reservations. On the other hand, the Win32 APIs that facilitate this are specific to Windows 10. When available, MMPolicyInProcess uses VirtualAlloc2 and MMPolicyOutOfProcess uses MapViewOfFile3. When we’re running on Windows versions where those APIs are not yet available, we need to fall back to finding and reserving our own range. The MMPolicyBase::FindRegion method handles this for us.

All of this logic is wrapped up in the MMPolicyBase::Reserve method. In addition to the desired VM size and range, the method also accepts two functors that wrap the OS APIs for reserving VM. Reserve uses those functors when available, otherwise it falls back to FindRegion to manually locate a suitable reservation.

Now that our memory management primatives were range-aware, I needed to shift my focus over to our VM sharing policies.

One impetus for the Great Interceptor Refactoring was to enable separate Interceptor instances to share a unified pool of VM for trampoline memory. To make this range-aware, I needed to make some additional changes to VMSharingPolicyShared. It would no longer be sufficient to assume that we could just share a single block of trampoline VM — we now needed to make the shared VM policy capable of potentially allocating multiple blocks of VM.

VMSharingPolicyShared now contains a mapping of ranges to VM blocks. If we request a reservation which an existing block satisfies, we re-use that block. On the other hand, if we require a range that is yet unsatisfied, then we need to allocate a new one. I admit that I kind of half-assed the implementation of the data structure we use for the mapping; I was too lazy to implement a fully-fledged interval tree. The current implementation is probably “good enough,” however it’s probably worth fixing at some point.

Finally, I added a new generic class, TrampolinePool, that acts as an abstraction of a reserved block of VM address space. The main interceptor code requests a pool by calling the VM sharing policy’s Reserve method, then it uses the pool to retrieve new Trampoline instances to be populated.

AArch64 Trampolines

It is much simpler to generate trampolines for AArch64 than it is for x86(-64). The most noteworthy addition to the Trampoline class is the WriteLoadLiteral method, which writes an absolute address into the trampoline’s literal pool, followed by writing an LDR instruction referencing that literal into the trampoline.


Thanks for reading! Coming up next time: My Untrusted Modules Opus.

2018 Roundup: H2 - Preparing to Enable the Launcher Process by Default

| Comments

This is the fifth post in my “2018 Roundup” series. For an index of all entries, please see my blog entry for Q1.

Yes, you are reading the dates correctly: I am posting this over two years after I began this series. I am trying to get caught up on documenting my past work!

CI and Developer Tooling

Given that the launcher process completely changes how our Win32 Firefox builds start, I needed to update both our CI harnesses, as well as the launcher process itself. I didn’t do much that was particularly noteworthy from a technical standpoint, but I will mention some important points:

During normal use, the launcher process usually exits immediately after the browser process is confirmed to have started. This was a deliberate design decision that I made. Having the launcher process wait for the browser process to terminate would not do any harm, however I did not want the launcher process hanging around in Task Manager and being misunderstood by users who are checking their browser’s resource usage.

On the other hand, such a design completely breaks scripts that expect to start Firefox and be able to synchronously wait for the browser to exit before continuing! Clearly I needed to provide an opt-in for the latter case, so I added the --wait-for-browser command-line option. The launcher process also implicitly enables this mode under a few other scenarios.

Secondly, there is the issue of debugging. Developers were previously used to attaching to the first firefox.exe process they see and expecting to be debugging the browser process. With the launcher process enabled by default, this is no longer the case.

There are few options here:

  • Visual Studio users may install the Child Process Debugging Power Tool, which enables the VS debugger to attach to child processes;
  • WinDbg users may start their debugger with the -o command-line flag, or use the Debug child processes also checkbox in the GUI;
  • I added support for a MOZ_DEBUG_BROWSER_PAUSE environment variable, which allows developers to set a timeout (in seconds) for the browser process to print its pid to stdout and wait for a debugger attachment.

Performance Testing

As I have alluded to in previous posts, I needed to measure the effect of adding an additional process to the critical path of Firefox startup. Since in-process testing will not work in this case, I needed to use something that could provide a holistic view across both launcher and browser processes. I decided to enhance our existing xperf suite in Talos to support my use case.

I already had prior experience with xperf; I spent a significant part of 2013 working with Joel Maher to put the xperf Talos suite into production. I also knew that the existing code was not sufficiently generic to be able to handle my use case.

I threw together a rudimentary analysis framework for working with CSV-exported xperf data. Then, after Joel’s review, I vendored it into mozilla-central and used it to construct an analysis for startup time. [While a more thorough discussion of this framework is definitely warranted, I also feel that it is tangential to the discussion at hand; I’ll write a dedicated blog entry about this topic in the future. – Aaron]

In essence, the analysis considers the following facts when processing an xperf recording:

  • The launcher process will be the first firefox.exe process that runs;
  • The browser process will be started by the launcher process;
  • The browser process will fire a session store window restored event.

For our analysis, we needed to do the following:

  1. Find the event showing the first firefox.exe process being created;
  2. Find the session store window restored event from the second firefox.exe process;
  3. Output the time interval between the two events.

This block of code demonstrates how that analysis is specified using my analyzer framework.

Overall, these test results were quite positive. We saw a very slight but imperceptible increase in startup time on machines with solid-state drives, however the security benefits from the launcher process outweigh this very small regression.

Most interestingly, we saw a signficant improvement in startup time on Windows 10 machines with magnetic hard disks! As I mentioned in Q2 Part 3, I believe this improvement is due to reduced hard disk seeking thanks to the launcher process forcing \windows\system32 to the front of the dynamic linker’s search path.

Error and Experimentation Readiness

By Q3 I had the launcher process in a state where it was built by default into Firefox, but it was still opt-in. As I have written previously, we needed the launcher process to gracefully fail even without having the benefit of various Gecko services such as preferences and the crash reporter.

Error Propagation

Firstly, I created a new class, WindowsError, that encapsulates all types of Windows error codes. As an aside, I would strongly encourage all Gecko developers who are writing new code that invokes Windows APIs to use this class in your error handling.

WindowsError is currently able to store Win32 DWORD error codes, NTSTATUS error codes, and HRESULT error codes. Internally the code is stored as an HRESULT, since that type has encodings to support the other two. WindowsError also provides a method to convert its error code to a localized string for human-readable output.

As for the launcher process itself, nearly every function in the launcher process returns a mozilla::Result-based type. In case of error, we return a LauncherResult, which [as of 2018; this has changed more recently – Aaron] is a structure containing the error’s source file, line number, and WindowsError describing the failure.

Detecting Browser Process Failures

While all Results in the launcher process may be indicating a successful start, we may not yet be out of the woods! Consider the possibility that the various interventions taken by the launcher process might have somehow impaired the browser process’ ability to start!

To deal with this situation, the launcher process and the browser process share code that tracks whether both processes successfully started in sequence.

When the launcher process is started, it checks information recorded about the previous run. If the browser process previously failed to start correctly, the launcher process disables itself and proceeds to start the browser process without any of its typical interventions.

Once the browser has successfully started, it reflects the launcher process state into telemetry, preferences, and about:support.

Future attempts to start Firefox will bypass the launcher process until the next time the installation’s binaries are updated, at which point we reset and attempt once again to start with the launcher process. We do this in the hope that whatever was failing in version n might be fixed in version n + 1.

Note that this update behaviour implies that there is no way to forcibly and permanently disable the launcher process. This is by design: the error detection feature is designed to prevent the browser from becoming unusable, not to provide configurability. The launcher process is a security feature and not something that we should want users adjusting any more than we would want users to be disabling the capability system or some other important security mitigation. In fact, my original roadmap for InjectEject called for eventually removing the failure detection code if the launcher failure rate ever reached zero.

Experimentation and Emergency

The pref reflection built into the failure detection system is bi-directional. This allowed us to ship a release where we ran a study with a fraction of users running with the launcher process enabled by default.

Once we rolled out the launcher process at 100%, this pref also served as a useful “emergency kill switch” that we could have flipped if necessary.

Fortunately our experiments were successful and we rolled the launcher process out to release at 100% without ever needing the kill switch!

At this point, this pref should probably be removed, as we no longer need nor want to control launcher process deployment in this way.

Error Reporting

When telemetry is enabled, the launcher process is able to convert its LauncherResult into a ping which is sent in the background by ping-sender. When telemetry is disabled, we perform a last-ditch effort to surface the error by logging details about the LauncherResult failure in the Windows Event Log.

In Conclusion

Thanks for reading! This concludes my 2018 Roundup series! There is so much more work from 2018 that I did for this project that I wish I could discuss, but for security reasons I must refrain. Nonetheless, I hope you enjoyed this series. Stay tuned for more roundups in the future!

2018 Roundup: Q2, Part 3 - Fleshing Out the Launcher Process

| Comments

This is the fourth post in my “2018 Roundup” series. For an index of all entries, please see my blog entry for Q1.

Yes, you are reading the dates correctly: I am posting this nearly two years after I began this series. I am trying to get caught up on documenting my past work!

Once I had landed the skeletal implementation of the launcher process, it was time to start making it do useful things.

Ensuring Medium Integrity

[For an overview of Windows integrity levels, check out this MSDN page – Aaron]

Since Windows Vista, security tokens for standard users have run at a medium integrity level (IL) by default. When UAC is enabled, members of the Administrators group also run as a standard user with a medium IL, with the additional ability of being able to “elevate” themselves to a high IL. When UAC is disabled, an administrator receives a token that always runs at the high integrity level.

Running a process at a high IL is something that is not to be taken lightly: at that level, the process may alter system settings and access files that would otherwise be restricted by the OS.

While our sandboxed content processes always run at a low IL, I believed that defense-in-depth called for ensuring that the browser process did not run at a high IL. In particular, I was concerned about cases where elevation might be accidental. Consider, for example, a hypothetical scenario where a system administrator is running two open command prompts, one elevated and one not, and they accidentally start Firefox from the one that is elevated.

This was a perfect use case for the launcher process: it detects whether it is running at high IL, and if so, it launches the browser with medium integrity.

Unfortunately some users prefer to configure their accounts to run at all times as Administrator with high integrity! This is terrible idea from a security perspective, but it is what it is; in my experience, most users who run with this configuration do so deliberately, and they have no interest in being lectured about it.

Unfortunately, users running under this account configuration will experience side-effects of the Firefox browser process running at medium IL. Specifically, a medium IL process is unable to initiate IPC connections with a process running at a higher IL. This will break features such as drag-and-drop, since even the administrator’s shell processes are running at a higher IL than Firefox.

Being acutely aware of this issue, I included an escape hatch for these users: I implemented a command line option that prevents the launcher process from de-elevating when running with a high IL. I hate that I needed to do this, but moral suasion was not going to be an effective technique for solving this problem.

Process Mitigation Policies

Another tool that the launcher process enables us to utilize is process mitigation options. Introduced in Windows 8, the kernel provides several opt-in flags that allows us to add prophylactic policies to our processes in an effort to harden them against attacks.

Additional flags have been added over time, so we must be careful to only set flags that are supported by the version of Windows on which we’re running.

We could have set some of these policies by calling the SetProcessMitigationPolicy API. Unfortunately this API is designed for a process to use on itself once it is already running. This implies that there is a window of time between process creation and the time that the process enables its mitigations where an attack could occur.

Fortunately, Windows provides a second avenue for setting process mitigation flags: These flags may be set as part of an attribute list in the STARTUPINFOEX structure that we pass into CreateProcess.

Perhaps you can now see where I am going with this: The launcher process enables us to specify process mitigation flags for the browser process at the time of browser process creation, thus preventing the aforementioned window of opportunity for attacks to occur!

While there are other flags that we could support in the future, the initial mitigation policy that I added was the PROCESS_CREATION_MITIGATION_POLICY_IMAGE_LOAD_PREFER_SYSTEM32_ALWAYS_ON flag. [Note that I am only discussing flags applied to the browser process; sandboxed processes receive additional mitigations. – Aaron] This flag forces the Windows loader to always use the Windows system32 directory as the first directory in its search path, which prevents library preload attacks. Using this mitigation also gave us an unexpected performance gain on devices with magnetic hard drives: most of our DLL dependencies are either loaded using absolute paths, or reside in system32. With system32 at the front of the loader’s search path, the resulting reduction in hard disk seek times produced a slight but meaningful decrease in browser startup time! How I made these measurements is addressed in a future post.

Next Time

This concludes the Q2 topics that I wanted to discuss. Thanks for reading! Coming up in H2: Preparing to Enable the Launcher Process by Default.

2018 Roundup: Q2, Part 2 - Implementing a Skeletal Launcher Process

| Comments

This is the third post in my “2018 Roundup” series. For an index of all entries, please see my blog entry for Q1.

Yes, you are reading the dates correctly: I am posting this nearly two years after I began this series. I am trying to get caught up on documenting my past work!

One of the things I added to Firefox for Windows was a new process called the “launcher process.” “Bootstrap process” would be a better name, but we already used the term “bootstrap” for our XPCOM initialization code. Instead of overloading that term and adding potential confusion, I opted for using “launcher process” instead.

The launcher process is intended to be the first process that runs when the user starts Firefox. Its sole purpose is to create the “real” browser process in a suspended state, set various attributes on the browser process, resume the browser process, and then self-terminate.

In bug 1454745 I implemented an initial skeletal (and opt-in) implementation of the launcher process.

This seems like pretty straightforward code, right? Naïvely, one could just rip a CreateProcess sample off of MSDN and call it day. The actual launcher process implementation is more complicated than that, for reasons that I will outline in the following sections.

Built into firefox.exe

I wanted the launcher process to exist as a special “mode” of firefox.exe, as opposed to a distinct executable.

Performance

By definition, the launcher process lies on the critical path to browser startup. I needed to be very conscious of how we affect overall browser startup time.

Since the launcher process is built into firefox.exe, I needed to examine that executable’s existing dependencies to ensure that it is not loading any dependent libraries that are not actually needed by the launcher process. Other than the essential Win32 DLLs kernel32.dll and advapi32.dll (and their dependencies), I did not want anything else to load. In particular, I wanted to avoid loading user32.dll and/or gdi32.dll, as this would trigger the initialization of Windows’ GUI facilities, which would be a huge performance killer. For that reason, most browser-mode library dependencies of firefox.exe are either delay-loaded or are explicitly loaded via LoadLibrary.

Safe Mode

We wanted the launcher process to both respect Firefox’s safe mode, as well as alter its behaviour as necessary when safe mode is requested.

There are multiple mechanisms used by Firefox to detect safe mode. The launcher process detects all of them except for one: Testing whether the user is holding the shift key. Retrieving keyboard state would trigger loading of user32.dll, which would harm performance as I described above.

This is not too severe an issue in practice: The browser process itself would still detect the shift key. Furthermore, while the launcher process may in theory alter its behaviour depending on whether or not safe mode is requested, none of its behaviour changes are significant enough to materially affect the browser’s ability to start in safe mode.

Also note that, for serious cases where the browser is repeatedly unable to start, the browser triggers a restart in safe mode via environment variable, which is a mechanism that the launcher process honours.

Testing and Automation

We wanted the launcher process to behave well with respect to automated testing.

The skeletal launcher process that I landed in Q2 included code to pass its console handles on to the browser process, but there was more work necessary to completely handle this case. These capabilities were not yet an issue because the launcher process was opt-in at the time.

Error Recovery

We wanted the launcher process to gracefully handle failures even though, also by definition, it does not have access to facilities that internal Gecko code has, such as preferences and the crash reporter.

The skeletal launcher process that I landed in Q2 did not yet utilize any special error handling code, but this was also not yet an issue because the launcher process was opt-in at this point.

Next Time

Thanks for reading! Coming up in Q2, Part 3: Fleshing Out the Launcher Process

Coming Around Full Circle

| Comments

One thing about me that most Mozillians don’t know is that, when I first applied to work at MoCo, I had applied to work on the mobile platform. When all was said and done, it was decided at the time that I would be a better fit for an opening on Taras Glek’s platform performance team.

My first day at Mozilla was October 15, 2012 — I will be celebrating my seventh anniversary at MoCo in just a couple short weeks! Some people with similar tenures have suggested to me that we are now “old guard,” but I’m not sure that I feel that way! Anyway, I digress.

The platform performance team eventually evolved into a desktop-focused performance team by late 2013. By the end of 2015 I had decided that it was time for a change, and by March 2016 I had moved over to work for Jim Mathies, focusing on Gecko integration with Windows. I ended up spending the next twenty or so months helping the accessibility team port their Windows implementation over to multiprocess.

Once Firefox Quantum 57 hit the streets, I scoped out and provided technical leadership for the InjectEject project, whose objective was to tackle some of the root problems with DLL injection that were causing us grief in Windows-land.

I am proud to say that, over the past three years on Jim’s team, I have done the best work of my career. I’d like to thank Brad Lassey (now at Google) for his willingness to bring me over to his group, as well as Jim, and David Bolter (a11y manager at the time) for their confidence in me. As somebody who had spent most of his adult life having no confidence in his work whatsoever, their willingness to entrust me with taking on those risks and responsibilities made an enormous difference in my self esteem and my professional life.

Over the course of H1 2019, I began to feel restless again. I knew it was time for another change. What I did not expect was that the agent of that change would be James Willcox, aka Snorp. In Whistler, Snorp planted the seed in my head that I might want to come over to work with him on GeckoView, within the mobile group which David was now managing.

The timing seemed perfect, so I made the decision to move to GeckoView. I had to finish tying up some loose ends with InjectEject, so all the various stakeholders agreed that I’d move over at the end of Q3 2019.

Which brings me to this week, when I officially join the GeckoView team, working for Emily Toop. I find it somewhat amusing that I am now joining the team that evolved from the team that I had originally applied for back in 2012. I have truly come full circle in my career at Mozilla!

So, what’s next?

  • I have a couple of InjectEject bugs that are pretty much finished, but just need some polish and code reviews before landing.

  • For the next month or two at least, I am going to continue to meet weekly with Jim to assist with the transition as he ramps up new staff on the project.

  • I still plan to be the module owner for the Firefox Launcher Process and the MSCOM library, however most day-to-day work will be done by others going forward;

  • I will continue to serve as the mozglue peer in charge of the DLL blocklist and DLL interceptor, with the same caveat.

Switching over to Android from Windows does not mean that I am leaving my Windows experience at the door; I would like to continue to be a resource on that front, so I would encourage people to continue to ask me for advice.

On the other hand, I am very much looking forward to stepping back into the mobile space. My first crack at mobile was as an intern back in 2003, when I was working with some code that had to run on PalmOS 3.0! I have not touched Android since I shipped a couple of utility apps back in 2011, so I am looking forward to learning more about what has changed. I am also looking forward to learning more about native development on Android, which is something that I never really had a chance to try.

As they used to say on Monty Python’s Flying Circus, “And now for something completely different!”

2018 Roundup: Q2, Part 1 - Refactoring the DLL Interceptor

| Comments

This is the second post in my “2018 Roundup” series. For an index of all entries, please see my blog entry for Q1.

As I have alluded to previously, Gecko includes a Detours-style API hooking mechanism for Windows. In Gecko, this code is referred to as the “DLL Interceptor.” We use the DLL interceptor to instrument various functions within our own processes. As a prerequisite for future DLL injection mitigations, I needed to spend a good chunk of Q2 refactoring this code. While I was in there, I took the opportunity to improve the interceptor’s memory efficiency, thus benefitting the Fission MemShrink project. [When these changes landed, we were not yet tracking the memory savings, but I will include a rough estimate later in this post.]

A Brief Overview of Detours-style API Hooking

While many distinct function hooking techniques are used in the Windows ecosystem, the Detours-style hook is one of the most effective and most popular. While I am not going to go into too many specifics here, I’d like to offer a quick overview. In this description, “target” is the function being hooked.

Here is what happens when a function is detoured:

  1. Allocate a chunk of memory to serve as a “trampoline.” We must be able to adjust the protection attributes on that memory.

  2. Disassemble enough of the target to make room for a jmp instruction. On 32-bit x86 processors, this requires 5 bytes. x86-64 is more complicated, but generally, to jmp to an absolute address, we try to make room for 13 bytes.

  3. Copy the instructions from step 2 over to the trampoline.

  4. At the beginning of the target function, write a jmp to the hook function.

  5. Append additional instructions to the trampoline that, when executed, will cause the processor to jump back to the first valid instruction after the jmp written in step 4.

  6. If the hook function wants to pass control on to the original target function, it calls the trampoline.

Note that these steps don’t occur exactly in the order specified above; I selected the above ordering in an effort to simplify my description.

Here is my attempt at visualizing the control flow of a detoured function on x86-64:

Refactoring

Previously, the DLL interceptor relied on directly manipulating pointers in order to read and write the various instructions involved in the hook. In bug 1432653 I changed things so that the memory operations are parameterized based on two orthogonal concepts:

  • In-process vs out-of-process memory access: I wanted to be able to abstract reads and writes such that we could optionally set a hook in another process from our own.
  • Virtual memory allocation scheme: I wanted to be able to change how trampoline memory was allocated. Previously, each instance of WindowsDllInterceptor allocated its own page of memory for trampolines, but each instance also typically only sets one or two hooks. This means that most of the 4KiB page was unused. Furthermore, since Windows allocates blocks of pages on a 64KiB boundary, this wasted a lot of precious virtual address space in our 32-bit builds.

By refactoring and parameterizing these operations, we ended up with the following combinations:

  • In-process memory access, each WindowsDllInterceptor instance receives its own trampoline space;
  • In-process memory access, all WindowsDllInterceptor instances within a module share trampoline space;
  • Out-of-process memory access, each WindowsDllInterceptor instance receives its own trampoline space;
  • Out-of-process memory access, all WindowsDllInterceptor instances within a module share trampoline space (currently not implemented as this option is not particularly useful at the moment).

Instead of directly manipulating pointers, we now use instances of ReadOnlyTargetFunction, WritableTargetFunction, and Trampoline to manipulate our code/data. Those classes in turn use the memory management and virtual memory allocation policies to perform the actual reading and writing.

Memory Management Policies

The interceptor now supports two policies, MMPolicyInProcess and MMPolicyOutOfProcess. Each policy must implement the following memory operations:

  • Read
  • Write
  • Change protection attributes
  • Reserve trampoline space
  • Commit trampoline space

MMPolicyInProcess is implemented using memcpy for read and write, VirtualProtect for protection attribute changes, and VirtualAlloc for reserving and committing trampoline space.

MMPolicyOutOfProcess uses ReadProcessMemory and WriteProcessMemory for read and write. As a perf optimization, we try to batch reads and writes together to reduce the system call traffic. We obviously use VirtualProtectEx to adjust protection attributes in the other process.

Out-of-process trampoline reservation and commitment, however, is a bit different and is worth a separate call-out. We allocate trampoline space using shared memory. It is mapped into the local process with read+write permissions using MapViewOfFile. The memory is mapped into the remote process as read+execute using some code that I wrote in bug 1451511 that either uses NtMapViewOfSection or MapViewOfFile2, depending on availability. Individual pages from those chunks are then committed via VirtualAlloc in the local process and VirtualAllocEx in the remote process. This scheme enables us to read and write to trampoline memory directly, without needing to do cross-process reads and writes!

VM Sharing Policies

The code for these policies is a lot simpler than the code for the memory management policies. We now have VMSharingPolicyUnique and VMSharingPolicyShared. Each of these policies must implement the following operations:

  • Reserve space for up to N trampolines of size K;
  • Obtain a Trampoline object for the next available K-byte trampoline slot;
  • Return an iterable collection of all extant trampolines.

VMSharingPolicyShared is actually implemented by delegating to a static instance of VMSharingPolicyUnique.

Implications of Refactoring

To determine the performance implications, I added timings to our DLL Interceptor unit test. I was very happy to see that, despite the additional layers of abstraction, the C++ compiler’s optimizer was doing its job: There was no performance impact whatsoever!

Once the refactoring was complete, I switched the default VM Sharing Policy for WindowsDllInterceptor over to VMSharingPolicyShared in bug 1451524.

Browsing today’s mozilla-central tip, I count 14 locations where we instantiate interceptors inside xul.dll. Given that not all interceptors are necessarily instantiated at once, I am now offering a worst-case back-of-the-napkin estimate of the memory savings:

  • Each interceptor would likely be consuming 4KiB (most of which is unused) of committed VM. Due to Windows’ 64 KiB allocation guanularity, each interceptor would be leaving a further 60KiB of address space in a free but unusable state. Assuming all 14 interceptors were actually instantiated, they would thus consume a combined 56KiB of committed VM and 840KiB of free but unusable address space.
  • By sharing trampoline VM, the interceptors would consume only 4KiB combined and waste only 60KiB of address space, thus yielding savings of 52KiB in committed memory and 780KiB in addressable memory.

Oh, and One More Thing

Another problem that I discovered during this refactoring was bug 1459335. It turns out that some of the interceptor’s callers were not distinguishing between “I have not set this hook yet” and “I attempted to set this hook but it failed” scenarios. Across several call sites, I discovered that our code would repeatedly retry to set hooks even when they had previously failed, causing leakage of trampoline space!

To fix this, I modified the interceptor’s interface so that we use one-time initialization APIs to set hooks; since landing this bug, it is no longer possible for clients of the DLL interceptor to set a hook that had previously failed to be set.

Quantifying the memory costs of this bug is… non-trivial, but it suffices to say that fixing this bug probably resulted in the savings of at least a few hundred KiB in committed VM on affected machines.

That’s it for today’s post, folks! Thanks for reading! Coming up in Q2, Part 2: Implementing a Skeletal Launcher Process

2018 Roundup: Q1 - Learning More About DLLs Injected Into Firefox

| Comments

I had a very busy 2018. So busy, in fact, that I have not been able to devote any time to actually discussing what I worked on! I had intended to write these posts during the end of December, but a hardware failure delayed that until the new year. Alas, here we are in 2019, and I am going to do a series of retrospectives on last year’s work, broken up by quarter.

Here is an index of all the entries in this series:

Overview

The general theme of my work in 2018 was dealing with the DLL injection problem: On Windows, third parties love to forcibly load their DLLs into other processes — web browsers in particular, thus making Firefox a primary target.

Many of these libraries tend to alter Firefox processes in ways that hurt the stability and/or performance of our code; many chemspill releases have gone out over the years to deal with these problems. While I could rant for hours over this, the fact is that DLL injection is rampant in the ecosystem of Windows desktop applications and is not going to disappear any time soon. In the meantime, we need to be able to deal with it.

Some astute readers might be ready to send me an email or post a comment about how ignorant I am about the new(-ish) process mitigation policies that are available in newer versions of Windows. While those features are definitely useful, they are not panaceas:

  • We cannot turn on the “Extension Point Disable” policy for users of assistive technologies; screen readers rely heavily on DLL injection using SetWindowsHookEx and SetWinEventHook, both of which are covered by this policy;
  • We could enable the “Microsoft Binary Signature” policy, however that requires us to load our own DLLs first before enabling; once that happens, it is often already too late: other DLLs have already injected themselves by the time we are able to activate this policy. (Note that this could easily be solved if this policy were augmented to also permit loading of any DLL signed by the same organization as that of the process’s executable binary, but Microsoft seems to be unwilling to do this.)
  • The above mitigations are not universally available. They do not help us on Windows 7.

For me, Q1 2018 was all about gathering better data about injected DLLs.

Learning More About DLLs Injected into Firefox

One of our major pain points over the years of dealing with injected DLLs has been that the vendor of the DLL is not always apparent to us. In general, our crash reports and telemetry pings only include the leaf name of the various DLLs on a user’s system. This is intentional on our part: we want to preserve user privacy. On the other hand, this severely limits our ability to determine which party is responsible for a particular DLL.

One avenue for obtaining this information is to look at any digital signature that is embedded in the DLL. By examining the certificate that was used to sign the binary, we can extract the organization of the cert’s owner and include that with our crash reports and telemetry.

In bug 1430857 I wrote a bunch of code that enables us to extract that information from signed binaries using the Windows Authenticode APIs. Originally, in that bug, all of that signature extraction work happened from within the browser itself, while it was running: It would gather the cert information on a background thread while the browser was running, and include those annotations in a subsequent crash dump, should such a thing occur.

After some reflection, I realized that I was not gathering annotations in the right place. As an example, what if an injected DLL were to trigger a crash before the background thread had a chance to grab that DLL’s cert information?

I realized that the best place to gather this information was in a post-processing step after the crash dump had been generated, and in fact we already had the right mechanism for doing so: the minidump-analyzer program was already doing post-processing on Firefox crash dumps before sending them back to Mozilla. I moved the signature extraction and crash annotation code out of Gecko and into the analyzer in bug 1436845.

(As an aside, while working on the minidump-analyzer, I found some problems with how it handled command line arguments: it was assuming that main passes its argv as UTF-8, which is not true on Windows. I fixed those issues in bug 1437156.)

In bug 1434489 I also ended up adding this information to the “modules ping” that we have in telemetry; IIRC this ping is only sent weekly. When the modules ping is requested, we gather the module cert info asynchronously on a background thread.

Finally, I had to modify Socorro (the back-end for crash-stats) to be able to understand the signature annotations and be able to display them via bug 1434495. This required two commits: one to modify the Socorro stackwalker to merge the module signature information into the full crash report, and another to add a “Signed By” column to every report’s “Modules” tab to display the signature information (Note that this column is only present when at least one module in a particular crash report contains signature information).

The end result was very satisfying: Most of the injected DLLs in our Windows crash reports are signed, so it is now much easier to identify their vendors!

This project was very satisifying for me in many ways: First of all, surfacing this information was an itch that I had been wanting to scratch for quite some time. Secondly, this really was a “full stack” project, touching everything from extracting signature info from binaries using C++, all the way up to writing some back-end code in Python and a little touch of front-end stuff to surface the data in the web app.

Note that, while this project focused on Windows because of the severity of the library injection problem on that platform, it would be easy enough to reuse most of this code for macOS builds as well; the only major work for the latter case would be for extracting signature information from a dylib. This is not currently a priority for us, though.

Thanks for reading! Coming up in Q2: Refactoring the DLL Interceptor!

Legacy Firefox Extensions and “Userspace”

| Comments

This week’s release of Firefox Quantum has prompted all kinds of feedback, both positive and negative. That is not surprising to anybody — any software that has a large number of users is going to be a topic for discussion, especially when the release in question is undoubtedly a watershed.

While I have previously blogged about the transition to WebExtensions, now that we have actually passed through the cutoff for legacy extensions, I have decided to add some new commentary on the subject.

One analogy that has been used in the discussion of the extension ecosystem is that of kernelspace and userspace. The crux of the analogy is that Gecko is equivalent to an operating system kernel, and thus extensions are the user-mode programs that run atop that kernel. The argument then follows that Mozilla’s deprecation and removal of legacy extension capabilities is akin to “breaking” userspace. [Some people who say this are using the same tone as Linus does whenever he eviscerates Linux developers who break userspace, which is neither productive nor welcomed by anyone, but I digress.] Unfortunately, that analogy simply does not map to the legacy extension model.

Legacy Extensions as Kernel Modules

The most significant problem with the userspace analogy is that legacy extensions effectively meld with Gecko and become part of Gecko itself. If we accept the premise that Gecko is like a monolithic OS kernel, then we must also accept that the analogical equivalent of loading arbitrary code into that kernel, is the kernel module. Such components are loaded into the kernel and effectively become part of it. Their code runs with full privileges. They break whenever significant changes are made to the kernel itself.

Sound familiar?

Legacy extensions were akin to kernel modules. When there is no abstraction, there can be no such thing as userspace. This is precisely the problem that WebExtensions solves!

Building Out a Legacy API

Maybe somebody out there is thinking, “well what if you took all the APIs that legacy extensions used, turned that into a ‘userspace,’ and then just left that part alone?”

Which APIs? Where do we draw the line? Do we check the code coverage for every legacy addon in AMO and use that to determine what to include?

Remember, there was no abstraction; installed legacy addons are fused to Gecko. If we pledge not to touch anything that legacy addons might touch, then we cannot touch anything at all.

Where do we go from here? Freeze an old version of Gecko and host an entire copy of it inside web content? Compile it to WebAssembly? [Oh God, what have I done?]

If that’s not a maintenance burden, I don’t know what is!

A Kernel Analogy for WebExtensions

Another problem with the legacy-extensions-as-userspace analogy is that it leaves awkward room for web content, whose API is abstract and well-defined. I do not think that it is appropriate to consider web content to be equivalent to a sandboxed application, as sandboxed applications use the same (albeit restricted) API as normal applications. I would suggest that the presence of WebExtensions gives us a better kernel analogy:

  • Gecko is the kernel;
  • WebExtensions are privileged user applications;
  • Web content runs as unprivileged user applications.

In Conclusion

Declaring that legacy extensions are userspace does not make them so. The way that the technology actually worked defies the abstract model that the analogy attempts to impose upon it. On the other hand, we can use the failure of that analogy to explain why WebExtensions are important and construct an extension ecosystem that does fit with that analogy.

Win32 Gotchas

| Comments

For the second time since I have been at Mozilla I have encountered a situation where hooks are called for notifications of a newly created window, but that window has not yet been initialized properly, causing the hooks to behave badly.

The first time was inside our window neutering code in IPC, while the second time was in our accessibility code.

Every time I have seen this, there is code that follows this pattern:

1
2
3
4
5
HWND hwnd = CreateWindowEx(/* ... */);
if (hwnd) {
  // Do some follow-up initialization to hwnd (Using SetProp as an example):
  SetProp(hwnd, "Foo", bar);
}

This seems innocuous enough, right?

The problem is that CreateWindowEx calls hooks. If those hooks then try to do something like GetProp(hwnd, "Foo"), that call is going to fail because the “Foo” prop has not yet been set.

The key takeaway from this is that, if you are creating a new window, you must do any follow-up initialization from within your window proc’s WM_CREATE handler. This will guarantee that your window’s initialization will have completed before any hooks are called.

You might be thinking, “But I don’t set any hooks!” While this may be true, you must not forget about hooks set by third-party code.

“But those hooks won’t know anything about my program’s internals, right?”

Perhaps, perhaps not. But when those hooks fire, they give third-party software the opportunity to run. In some cases, those hooks might even cause the thread to reenter your own code. Your window had better be completely initialized when this happens!

In the case of my latest discovery of this issue in bug 1380471, I made it possible to use a C++11 lambda to simplify this pattern.

CreateWindowEx accepts a lpParam parameter which is then passed to the WM_CREATE handler as the lpCreateParams member of a CREATESTRUCT.

By setting lpParam to a pointer to a std::function<void(HWND)>, we may then supply any callable that we wish for follow-up window initialization.

Using the previous code sample as a baseline, this allows me to revise the code to safely set a property like this:

1
2
3
4
5
6
std::function<void(HWND)> onCreate([](HWND aHwnd) -> void {
  SetProp(aHwnd, "Foo", bar);
});

HWND hwnd = CreateWindowEx(/* ... */, &onCreate);
// At this point is already too late to further initialize hwnd!

Note that since lpParam is always passed during WM_CREATE, which always fires before CreateWindowEx returns, it is safe for onCreate to live on the stack.

I liked this solution for the a11y case because it preserved the locality of the initialization code within the function that called CreateWindowEx; the window proc for this window is implemented in another source file and the follow-up initialization depends on the context surrounding the CreateWindowEx call.

Speaking of window procs, here is how that window’s WM_CREATE handler invokes the callable:

1
2
3
4
5
6
7
8
9
10
11
12
13
switch (uMsg) {
  case WM_CREATE: {
    auto createStruct = reinterpret_cast<CREATESTRUCT*>(lParam);
    auto createProc = reinterpret_cast<std::function<void(HWND)>*>(
      createStruct->lpCreateParams);

    if (createProc && *createProc) {
      (*createProc)(hwnd);
    }

    return 0;
  }
  // ...

TL;DR: If you see a pattern where further initialization work is being done on an HWND after a CreateWindowEx call, move that initialization code to your window’s WM_CREATE handler instead.

Why I Prefer Using CRITICAL_SECTIONs for Mutexes in Windows Nightly Builds

| Comments

In the past I have argued that our Nightly builds, both debug and release, should use CRITICAL_SECTIONs (with full debug info) for our implementation of mozilla::Mutex. I’d like to illustrate some reasons why this is so useful.

They enable more utility in WinDbg extensions

Every time you initialize a CRITICAL_SECTION, Windows inserts the CS’s debug info into a process-wide linked list. This enables their discovery by the Windows debugging engine, and makes the !cs, !critsec, and !locks commands more useful.

They enable profiling of their initialization and acquisition

When the “Create user mode stack trace database” gflag is enabled, Windows records the call stack of the thread that called InitializeCriticalSection on that CS. Windows also records the call stack of the owning thread once it has acquired the CS. This can be very useful for debugging deadlocks.

They track their contention counts

Since every CS has been placed in a process-wide linked list, we may now ask the debugger to dump statistics about every live CS in the process. In particular, we can ask the debugger to output the contention counts for each CS in the process. After running a workload against Nightly, we may then take the contention output, sort it descendingly, and be able to determine which CRITICAL_SECTIONs are the most contended in the process.

We may then want to more closely inspect the hottest CSes to determine whether there is anything that we can do to reduce contention and all of the extra context switching that entails.

In Summary

When we use SRWLOCKs or initialize our CRITICAL_SECTIONs with the CRITICAL_SECTION_NO_DEBUG_INFO flag, we are denying ourselves access to this information. That’s fine on release builds, but on Nightly I think it is worth having around. While I realize that most Mozilla developers have not used this until now (otherwise I would not be writing this blog post), this rich debugger info is one of those things that you do not miss until you do not have it.

For further reading about critical section debug info, check out this archived article from MSDN Magazine.