Aaron Klotz’s Software Blog

My Adventures in Software Development

2018 Roundup: H2 - Preparing to Enable the Launcher Process by Default

| Comments

This is the fifth post in my “2018 Roundup” series. For an index of all entries, please see my blog entry for Q1.

Yes, you are reading the dates correctly: I am posting this over two years after I began this series. I am trying to get caught up on documenting my past work!

CI and Developer Tooling

Given that the launcher process completely changes how our Win32 Firefox builds start, I needed to update both our CI harnesses, as well as the launcher process itself. I didn’t do much that was particularly noteworthy from a technical standpoint, but I will mention some important points:

During normal use, the launcher process usually exits immediately after the browser process is confirmed to have started. This was a deliberate design decision that I made. Having the launcher process wait for the browser process to terminate would not do any harm, however I did not want the launcher process hanging around in Task Manager and being misunderstood by users who are checking their browser’s resource usage.

On the other hand, such a design completely breaks scripts that expect to start Firefox and be able to synchronously wait for the browser to exit before continuing! Clearly I needed to provide an opt-in for the latter case, so I added the --wait-for-browser command-line option. The launcher process also implicitly enables this mode under a few other scenarios.

Secondly, there is the issue of debugging. Developers were previously used to attaching to the first firefox.exe process they see and expecting to be debugging the browser process. With the launcher process enabled by default, this is no longer the case.

There are few options here:

  • Visual Studio users may install the Child Process Debugging Power Tool, which enables the VS debugger to attach to child processes;
  • WinDbg users may start their debugger with the -o command-line flag, or use the Debug child processes also checkbox in the GUI;
  • I added support for a MOZ_DEBUG_BROWSER_PAUSE environment variable, which allows developers to set a timeout (in seconds) for the browser process to print its pid to stdout and wait for a debugger attachment.

Performance Testing

As I have alluded to in previous posts, I needed to measure the effect of adding an additional process to the critical path of Firefox startup. Since in-process testing will not work in this case, I needed to use something that could provide a holistic view across both launcher and browser processes. I decided to enhance our existing xperf suite in Talos to support my use case.

I already had prior experience with xperf; I spent a significant part of 2013 working with Joel Maher to put the xperf Talos suite into production. I also knew that the existing code was not sufficiently generic to be able to handle my use case.

I threw together a rudimentary analysis framework for working with CSV-exported xperf data. Then, after Joel’s review, I vendored it into mozilla-central and used it to construct an analysis for startup time. [While a more thorough discussion of this framework is definitely warranted, I also feel that it is tangential to the discussion at hand; I’ll write a dedicated blog entry about this topic in the future. – Aaron]

In essence, the analysis considers the following facts when processing an xperf recording:

  • The launcher process will be the first firefox.exe process that runs;
  • The browser process will be started by the launcher process;
  • The browser process will fire a session store window restored event.

For our analysis, we needed to do the following:

  1. Find the event showing the first firefox.exe process being created;
  2. Find the session store window restored event from the second firefox.exe process;
  3. Output the time interval between the two events.

This block of code demonstrates how that analysis is specified using my analyzer framework.

Overall, these test results were quite positive. We saw a very slight but imperceptible increase in startup time on machines with solid-state drives, however the security benefits from the launcher process outweigh this very small regression.

Most interestingly, we saw a signficant improvement in startup time on Windows 10 machines with magnetic hard disks! As I mentioned in Q2 Part 3, I believe this improvement is due to reduced hard disk seeking thanks to the launcher process forcing \windows\system32 to the front of the dynamic linker’s search path.

Error and Experimentation Readiness

By Q3 I had the launcher process in a state where it was built by default into Firefox, but it was still opt-in. As I have written previously, we needed the launcher process to gracefully fail even without having the benefit of various Gecko services such as preferences and the crash reporter.

Error Propagation

Firstly, I created a new class, WindowsError, that encapsulates all types of Windows error codes. As an aside, I would strongly encourage all Gecko developers who are writing new code that invokes Windows APIs to use this class in your error handling.

WindowsError is currently able to store Win32 DWORD error codes, NTSTATUS error codes, and HRESULT error codes. Internally the code is stored as an HRESULT, since that type has encodings to support the other two. WindowsError also provides a method to convert its error code to a localized string for human-readable output.

As for the launcher process itself, nearly every function in the launcher process returns a mozilla::Result-based type. In case of error, we return a LauncherResult, which [as of 2018; this has changed more recently – Aaron] is a structure containing the error’s source file, line number, and WindowsError describing the failure.

Detecting Browser Process Failures

While all Results in the launcher process may be indicating a successful start, we may not yet be out of the woods! Consider the possibility that the various interventions taken by the launcher process might have somehow impaired the browser process’ ability to start!

To deal with this situation, the launcher process and the browser process share code that tracks whether both processes successfully started in sequence.

When the launcher process is started, it checks information recorded about the previous run. If the browser process previously failed to start correctly, the launcher process disables itself and proceeds to start the browser process without any of its typical interventions.

Once the browser has successfully started, it reflects the launcher process state into telemetry, preferences, and about:support.

Future attempts to start Firefox will bypass the launcher process until the next time the installation’s binaries are updated, at which point we reset and attempt once again to start with the launcher process. We do this in the hope that whatever was failing in version n might be fixed in version n + 1.

Note that this update behaviour implies that there is no way to forcibly and permanently disable the launcher process. This is by design: the error detection feature is designed to prevent the browser from becoming unusable, not to provide configurability. The launcher process is a security feature and not something that we should want users adjusting any more than we would want users to be disabling the capability system or some other important security mitigation. In fact, my original roadmap for InjectEject called for eventually removing the failure detection code if the launcher failure rate ever reached zero.

Experimentation and Emergency

The pref reflection built into the failure detection system is bi-directional. This allowed us to ship a release where we ran a study with a fraction of users running with the launcher process enabled by default.

Once we rolled out the launcher process at 100%, this pref also served as a useful “emergency kill switch” that we could have flipped if necessary.

Fortunately our experiments were successful and we rolled the launcher process out to release at 100% without ever needing the kill switch!

At this point, this pref should probably be removed, as we no longer need nor want to control launcher process deployment in this way.

Error Reporting

When telemetry is enabled, the launcher process is able to convert its LauncherResult into a ping which is sent in the background by ping-sender. When telemetry is disabled, we perform a last-ditch effort to surface the error by logging details about the LauncherResult failure in the Windows Event Log.

In Conclusion

Thanks for reading! This concludes my 2018 Roundup series! There is so much more work from 2018 that I did for this project that I wish I could discuss, but for security reasons I must refrain. Nonetheless, I hope you enjoyed this series. Stay tuned for more roundups in the future!

2018 Roundup: Q2, Part 3 - Fleshing Out the Launcher Process

| Comments

This is the fourth post in my “2018 Roundup” series. For an index of all entries, please see my blog entry for Q1.

Yes, you are reading the dates correctly: I am posting this nearly two years after I began this series. I am trying to get caught up on documenting my past work!

Once I had landed the skeletal implementation of the launcher process, it was time to start making it do useful things.

Ensuring Medium Integrity

[For an overview of Windows integrity levels, check out this MSDN page – Aaron]

Since Windows Vista, security tokens for standard users have run at a medium integrity level (IL) by default. When UAC is enabled, members of the Administrators group also run as a standard user with a medium IL, with the additional ability of being able to “elevate” themselves to a high IL. When UAC is disabled, an administrator receives a token that always runs at the high integrity level.

Running a process at a high IL is something that is not to be taken lightly: at that level, the process may alter system settings and access files that would otherwise be restricted by the OS.

While our sandboxed content processes always run at a low IL, I believed that defense-in-depth called for ensuring that the browser process did not run at a high IL. In particular, I was concerned about cases where elevation might be accidental. Consider, for example, a hypothetical scenario where a system administrator is running two open command prompts, one elevated and one not, and they accidentally start Firefox from the one that is elevated.

This was a perfect use case for the launcher process: it detects whether it is running at high IL, and if so, it launches the browser with medium integrity.

Unfortunately some users prefer to configure their accounts to run at all times as Administrator with high integrity! This is terrible idea from a security perspective, but it is what it is; in my experience, most users who run with this configuration do so deliberately, and they have no interest in being lectured about it.

Unfortunately, users running under this account configuration will experience side-effects of the Firefox browser process running at medium IL. Specifically, a medium IL process is unable to initiate IPC connections with a process running at a higher IL. This will break features such as drag-and-drop, since even the administrator’s shell processes are running at a higher IL than Firefox.

Being acutely aware of this issue, I included an escape hatch for these users: I implemented a command line option that prevents the launcher process from de-elevating when running with a high IL. I hate that I needed to do this, but moral suasion was not going to be an effective technique for solving this problem.

Process Mitigation Policies

Another tool that the launcher process enables us to utilize is process mitigation options. Introduced in Windows 8, the kernel provides several opt-in flags that allows us to add prophylactic policies to our processes in an effort to harden them against attacks.

Additional flags have been added over time, so we must be careful to only set flags that are supported by the version of Windows on which we’re running.

We could have set some of these policies by calling the SetProcessMitigationPolicy API. Unfortunately this API is designed for a process to use on itself once it is already running. This implies that there is a window of time between process creation and the time that the process enables its mitigations where an attack could occur.

Fortunately, Windows provides a second avenue for setting process mitigation flags: These flags may be set as part of an attribute list in the STARTUPINFOEX structure that we pass into CreateProcess.

Perhaps you can now see where I am going with this: The launcher process enables us to specify process mitigation flags for the browser process at the time of browser process creation, thus preventing the aforementioned window of opportunity for attacks to occur!

While there are other flags that we could support in the future, the initial mitigation policy that I added was the PROCESS_CREATION_MITIGATION_POLICY_IMAGE_LOAD_PREFER_SYSTEM32_ALWAYS_ON flag. [Note that I am only discussing flags applied to the browser process; sandboxed processes receive additional mitigations. – Aaron] This flag forces the Windows loader to always use the Windows system32 directory as the first directory in its search path, which prevents library preload attacks. Using this mitigation also gave us an unexpected performance gain on devices with magnetic hard drives: most of our DLL dependencies are either loaded using absolute paths, or reside in system32. With system32 at the front of the loader’s search path, the resulting reduction in hard disk seek times produced a slight but meaningful decrease in browser startup time! How I made these measurements is addressed in a future post.

Next Time

This concludes the Q2 topics that I wanted to discuss. Thanks for reading! Coming up in H2: Preparing to Enable the Launcher Process by Default.

2018 Roundup: Q2, Part 2 - Implementing a Skeletal Launcher Process

| Comments

This is the third post in my “2018 Roundup” series. For an index of all entries, please see my blog entry for Q1.

Yes, you are reading the dates correctly: I am posting this nearly two years after I began this series. I am trying to get caught up on documenting my past work!

One of the things I added to Firefox for Windows was a new process called the “launcher process.” “Bootstrap process” would be a better name, but we already used the term “bootstrap” for our XPCOM initialization code. Instead of overloading that term and adding potential confusion, I opted for using “launcher process” instead.

The launcher process is intended to be the first process that runs when the user starts Firefox. Its sole purpose is to create the “real” browser process in a suspended state, set various attributes on the browser process, resume the browser process, and then self-terminate.

In bug 1454745 I implemented an initial skeletal (and opt-in) implementation of the launcher process.

This seems like pretty straightforward code, right? Naïvely, one could just rip a CreateProcess sample off of MSDN and call it day. The actual launcher process implementation is more complicated than that, for reasons that I will outline in the following sections.

Built into firefox.exe

I wanted the launcher process to exist as a special “mode” of firefox.exe, as opposed to a distinct executable.

Performance

By definition, the launcher process lies on the critical path to browser startup. I needed to be very conscious of how we affect overall browser startup time.

Since the launcher process is built into firefox.exe, I needed to examine that executable’s existing dependencies to ensure that it is not loading any dependent libraries that are not actually needed by the launcher process. Other than the essential Win32 DLLs kernel32.dll and advapi32.dll (and their dependencies), I did not want anything else to load. In particular, I wanted to avoid loading user32.dll and/or gdi32.dll, as this would trigger the initialization of Windows’ GUI facilities, which would be a huge performance killer. For that reason, most browser-mode library dependencies of firefox.exe are either delay-loaded or are explicitly loaded via LoadLibrary.

Safe Mode

We wanted the launcher process to both respect Firefox’s safe mode, as well as alter its behaviour as necessary when safe mode is requested.

There are multiple mechanisms used by Firefox to detect safe mode. The launcher process detects all of them except for one: Testing whether the user is holding the shift key. Retrieving keyboard state would trigger loading of user32.dll, which would harm performance as I described above.

This is not too severe an issue in practice: The browser process itself would still detect the shift key. Furthermore, while the launcher process may in theory alter its behaviour depending on whether or not safe mode is requested, none of its behaviour changes are significant enough to materially affect the browser’s ability to start in safe mode.

Also note that, for serious cases where the browser is repeatedly unable to start, the browser triggers a restart in safe mode via environment variable, which is a mechanism that the launcher process honours.

Testing and Automation

We wanted the launcher process to behave well with respect to automated testing.

The skeletal launcher process that I landed in Q2 included code to pass its console handles on to the browser process, but there was more work necessary to completely handle this case. These capabilities were not yet an issue because the launcher process was opt-in at the time.

Error Recovery

We wanted the launcher process to gracefully handle failures even though, also by definition, it does not have access to facilities that internal Gecko code has, such as preferences and the crash reporter.

The skeletal launcher process that I landed in Q2 did not yet utilize any special error handling code, but this was also not yet an issue because the launcher process was opt-in at this point.

Next Time

Thanks for reading! Coming up in Q2, Part 3: Fleshing Out the Launcher Process