DnsQueryEx
,
but unfortunately this has been less than pleasant. Not only are there errors in its documentation,
but the API itself contains a bug that IMHO should never have made it to release.
Like many other Win32 APIs, DnsQueryEx
is an asynchronous interface that also
supports being called synchronously. Whether their completion mechanism uses an
event object, an APC, an I/O Completion Port, or some other technique,
asynchronous Win32 APIs consistently employ a common convention:
When a caller invokes the API, and that API is able to execute asynchronously,
it returns ERROR_IO_PENDING
. On the other hand, when the API fails, the API is
able to immediately satisfy the request, or the API was invoked synchronously,
the function immediately returns the final error code.
For emphasis: In Win32, most asynchronous APIs reserve the right to complete synchronously if they are able to immediately satisfy a request.
Enter DnsQueryEx
: while its internal implementation follows this convention,
the implementation of its public interface does not!
This is really easy to reproduce (on a fully-updated Windows 10 21H1, at least)
by setting up an asynchronous call to DnsQueryEx
, and querying for "localhost"
.
The caller must populate the pQueryCompletionCallback
field in the DNS_QUERY_REQUEST
structure.
DnsQueryEx
returns ERROR_SUCCESS
. Great, the asynchronous API was able to
immediately fulfill the request!
Everything works according to plan until we examine the pQueryRecords
field of
the DNS_QUERY_RESULT
structure. That field is NULL
! Every other output from this function points to
a successful query, and yet we receive no results!
I spent several hours pouring over the documentation and attempting different
permutations of the localhost
query, however the only way that I could coerce
DnsQueryEx
to actually produce the expected output is if I invoked it
synchronously.
I finally determined that this poking around was becoming futile and decided to examine the disassembly. Here’s some (highly-simplified) pseudocode of what I found:
1 2 3 4 5 6 7 8 9 10 |
|
Based on the background that I outlined above, do you see the bug?
I’ll give you a hint: ERROR_IO_PENDING
.
See it now?
Okay, here goes: isSynchronous
is the wrong condition for determining
whether to copy the internal records to pQueryResult
and immediately
return! In fact, I would argue that isSynchronous
should not be checked at
all: instead, DnsQueryEx
should be checking that win32ErrorCode != ERROR_IO_PENDING
!
To add insult to injury, Query_PrivateExW
correctly allocates the output
records from the heap, so DnsQueryEx
is effectively leaking them.
I’m going to try reporting this issue via Feedback Hub, but if any Microsofties
see this, I’d appreciate it if you could flag the maintainer of dnsapi.dll
and
get this fixed.
I suppose one workaround is to look for a successful call to DnsQueryEx
with
NULL
records, and then fall back to invoking it synchronously. On the other
hand, that doesn’t help with the memory leak.
Another gross, hacky option could be to manually check for special queries like
localhost
prior to calling the API, but this isn’t exhaustive: there could
be other reasons that Query_PrivateExW
decides to execute synchronously.
As you can see, this is a pretty trivial test case, which is why I find this bug to be so disappointing. I am a big proponent of attributing bugs to an OS until I have proof otherwise, but the disassembly I encountered was pretty damning.
Hopefully this gets fixed. Until next time…
UPDATE: Microsoft’s Tommy Jensen noted that this bug has been fixed in Windows 11, but unfortunately will not be backported to Windows 10. Thanks to Brad Fitzpatrick for amplifying this post on Twitter.
]]>My first patch landed in Firefox 19, and my final patch as an employee has landed in Nightly for Firefox 93.
I’ll be moving on to something new in a few weeks’ time, but for now, I’d just like to say this:
My time at Mozilla has made me into a better software developer, a better leader, and more importantly, a better person.
I’d like to thank all the Mozillians whom I have interacted with over the years for their contributions to making that happen.
I will continue to update this blog with catch-up posts describing my Mozilla work, though I am unsure what content I will be able to contribute beyond that. Time will tell!
Until next time…
]]>Here is an index of all the entries in this series:
During early 2019, Mozilla was working to port Firefox to run on the new AArch64 builds of Windows. At our December 2018 all-hands, I brought up the necessity of including the DLL Interceptor in our porting efforts. Since no deed goes unpunished, I was put in charge of doing the work! [I’m actually kidding here; this project was right up my alley and I was happy to do it! – Aaron]
Before continuing, you might want to review my previous entry describing the Great Interceptor Refactoring of 2018, as this post revisits some of the concepts introduced there.
Let us review some DLL Interceptor terminology:
On more than one occasion I had to field questions about why this work was even necessary for AArch64: there aren’t going to be many injected DLLs in a Win32 ecosystem running on a shiny new processor architecture! In fact, the DLL Interceptor is used for more than just facilitating the blocking of injected DLLs; we also use it for other purposes.
Not all of this work was done in one bug: some tasks were more urgent than others. I began this project by enumerating our extant uses of the interceptor to determine which instances were relevant to the new AArch64 port. I threw a record of each instance into a colour-coded spreadsheet, which proved to be very useful for tracking progress: Reds were “must fix” instances, yellows were “nice to have” instances, and greens were “fixed” instances. Coordinating with the milestones laid out by program management, I was able to assign each instance to a bucket which would help determine a total ordering for the various fixes. I landed the first set of changes in bug 1526383, and the second set in bug 1532470.
It was now time to sit down, download some AArch64 programming manuals, and take a look at what I was dealing with. While I have been messing around with x86 assembly since I was a teenager, my first exposure to RISC architectures was via the DLX architecture introduced by Hennessy and Patterson in their textbooks. While DLX was crafted specifically for educational purposes, it served for me as a great point of reference. When I was a student taking CS 241 at the University of Waterloo, we had to write a toy compiler that generated DLX code. That experience ended up saving me a lot of time when looking into AArch64! While the latter is definitely more sophisticated, I could clearly recognize analogs between the two architectures.
In some ways, targeting a RISC architecture greatly simplifies things: The DLL Interceptor only needs to concern itself with a small subset of the AArch64 instruction set: loads and branches. In fact, the DLL Interceptor’s AArch64 disassembler only looks for nine distinct instructions! As a bonus, since the instruction length is fixed, we can easily copy over verbatim any instructions that are not loads or branches!
On the other hand, one thing that increased complexity of the port is that some branch instructions to relative addresses have maximum offsets. If we must branch farther than that maximum, we must take alternate measures. For example, in AArch64, an unconditional branch with an immediate offset must land in the range of ±128 MiB from the current program counter.
Why is this a problem, you ask? Well, Detours-style interception must overwrite
the first several instructions of the target function. To write an absolute jump,
we require at least 16 bytes: 4 for an LDR
instruction, 4 for a BR
instruction, and another 8 for the 64-bit absolute branch target address.
Unfortunately, target functions may be really short! Some of the target functions that we need to patch consist only of a single 4-byte instruction!
In this case, our only option for patching the target is to use an immediate B
instruction, but that only works if our hook function falls within that ±128MiB
limit. If it does not, we need to construct a veneer. A veneer is a special
trampoline whose location falls within the target range of a branch instruction.
Its sole purpose is to provide an unconditional jump to the “real” desired
branch target that lies outside of the range of the original branch. Using
veneers, we can successfully hook a target function even if it is only one
instruction (ie, 4 bytes) in length, and the hook function lies more than 128MiB
away from it. The AArch64 Procedure Call Standard specifies X16
as a volatile
register that is explicitly intended for use by veneers: veneers load an
absolute target address into X16
(without needing to worry about whether or
not they’re clobbering anything), and then unconditionally jump to it.
To determine how many instructions the target function has for us to work with, we make two passes over the target function’s code. The first pass simply counts how many instructions are available for patching (up to the 4 instruction maximum needed for absolute branches; we don’t really care beyond that).
The second pass actually populates the trampoline, builds the veneer (if necessary), and patches the target function.
Since the DLL interceptor is already well-equipped to build trampolines, it did not take much effort to add support for constructing veneers. However, where to write out a veneer is just as important as what to write to a veneer.
Recall that we need our veneer to reside within ±128 MiB of an immediate branch. Therefore, we need to be able to exercise some control over where the trampoline memory for veneers is allocated. Until this point, our trampoline allocator had no need to care about this; I had to add this capability.
Firstly, I needed to make the MMPolicy
classes range-aware: we need to be able
to allocate trampoline space within acceptable distances from branch instructions.
Consider that, as described above, a branch instruction may have limits on the extents of its target. As data, this is easily formatted as a pivot (ie, the PC at the location where the branch instruction is encoutered), and a maximum distance in either direction from that pivot.
On the other hand, range-constrained memory allocation tends to work in terms
of lower and upper bounds. I wrote a conversion method, MMPolicyBase::SpanFromPivotAndDistance
,
to convert between the two formats. In addition to format conversion, this method
also constrains resulting bounds such that they are above the 1MiB mark of the
process’ address space (to avoid reserving memory in VM regions that are
sensitive to compatibility concerns), as well as below the maximum allowable
user-mode VM address.
Another issue with range-aware VM allocation is determining the location, within
the allowable range, for the actual VM reservation. Ideally we would like the
kernel’s memory manager to choose the best location for us: its holistic view of
existing VM layout (not to mention ASLR) across all processes will provide
superior VM reservations. On the other hand, the Win32 APIs that facilitate this
are specific to Windows 10. When available, MMPolicyInProcess
uses VirtualAlloc2
and MMPolicyOutOfProcess
uses MapViewOfFile3
.
When we’re running on Windows versions where those APIs are not yet available,
we need to fall back to finding and reserving our own range. The
MMPolicyBase::FindRegion
method handles this for us.
All of this logic is wrapped up in the MMPolicyBase::Reserve
method. In
addition to the desired VM size and range, the method also accepts two functors
that wrap the OS APIs for reserving VM. Reserve
uses those functors when
available, otherwise it falls back to FindRegion
to manually locate a suitable
reservation.
Now that our memory management primatives were range-aware, I needed to shift my focus over to our VM sharing policies.
One impetus for the Great Interceptor Refactoring was to enable separate
Interceptor instances to share a unified pool of VM for trampoline memory.
To make this range-aware, I needed to make some additional changes to
VMSharingPolicyShared
. It would no longer be sufficient to assume that we
could just share a single block of trampoline VM — we now needed to make the
shared VM policy capable of potentially allocating multiple blocks of VM.
VMSharingPolicyShared
now contains a mapping of ranges to VM blocks. If we
request a reservation which an existing block satisfies, we re-use that block.
On the other hand, if we require a range that is yet unsatisfied, then we need to
allocate a new one. I admit that I kind of half-assed the implementation of the
data structure we use for the mapping; I was too lazy to implement a fully-fledged
interval tree. The current implementation is probably “good enough,” however
it’s probably worth fixing at some point.
Finally, I added a new generic class, TrampolinePool
, that acts as an
abstraction of a reserved block of VM address space. The main interceptor code
requests a pool by calling the VM sharing policy’s Reserve
method, then it
uses the pool to retrieve new Trampoline
instances to be populated.
It is much simpler to generate trampolines for AArch64 than it is for x86(-64).
The most noteworthy addition to the Trampoline
class is the WriteLoadLiteral
method, which writes an absolute address into the trampoline’s literal pool,
followed by writing an LDR
instruction referencing that literal into the
trampoline.
Thanks for reading! Coming up next time: My Untrusted Modules Opus.
]]>Yes, you are reading the dates correctly: I am posting this over two years after I began this series. I am trying to get caught up on documenting my past work!
Given that the launcher process completely changes how our Win32 Firefox builds start, I needed to update both our CI harnesses, as well as the launcher process itself. I didn’t do much that was particularly noteworthy from a technical standpoint, but I will mention some important points:
During normal use, the launcher process usually exits immediately after the browser process is confirmed to have started. This was a deliberate design decision that I made. Having the launcher process wait for the browser process to terminate would not do any harm, however I did not want the launcher process hanging around in Task Manager and being misunderstood by users who are checking their browser’s resource usage.
On the other hand, such a design completely breaks scripts that expect to start
Firefox and be able to synchronously wait for the browser to exit before
continuing! Clearly I needed to provide an opt-in for the latter case, so I added
the --wait-for-browser
command-line option. The launcher process also implicitly
enables this mode under a few other scenarios.
Secondly, there is the issue of debugging. Developers were previously used to
attaching to the first firefox.exe
process they see and expecting to be debugging
the browser process. With the launcher process enabled by default, this is no
longer the case.
There are few options here:
-o
command-line flag,
or use the Debug child processes also
checkbox in the GUI;MOZ_DEBUG_BROWSER_PAUSE
environment variable, which
allows developers to set a timeout (in seconds) for the browser process to
print its pid to stdout
and wait for a debugger attachment.As I have alluded to in previous posts, I needed to measure the effect of adding
an additional process to the critical path of Firefox startup. Since in-process
testing will not work in this case, I needed to use something that could provide
a holistic view across both launcher and browser processes. I decided to enhance
our existing xperf
suite in Talos to support my use case.
I already had prior experience with xperf
; I spent a significant part of 2013
working with Joel Maher to put the xperf
Talos suite into production. I also
knew that the existing code was not sufficiently generic to be able to handle my
use case.
I threw together a rudimentary analysis framework
for working with CSV-exported xperf data. Then, after Joel’s review, I vendored
it into mozilla-central
and used it to construct an analysis for startup time.
[While a more thorough discussion of this framework is definitely warranted, I
also feel that it is tangential to the discussion at hand; I’ll write a dedicated
blog entry about this topic in the future. – Aaron]
In essence, the analysis considers the following facts when processing an xperf recording:
firefox.exe
process that runs;For our analysis, we needed to do the following:
firefox.exe
process being created;firefox.exe
process;This block of code demonstrates how that analysis is specified using my analyzer framework.
Overall, these test results were quite positive. We saw a very slight but imperceptible increase in startup time on machines with solid-state drives, however the security benefits from the launcher process outweigh this very small regression.
Most interestingly, we saw a signficant improvement in startup time on Windows
10 machines with magnetic hard disks! As I mentioned in Q2 Part 3, I believe
this improvement is due to reduced hard disk seeking thanks to the launcher
process forcing \windows\system32
to the front of the dynamic linker’s search
path.
By Q3 I had the launcher process in a state where it was built by default into Firefox, but it was still opt-in. As I have written previously, we needed the launcher process to gracefully fail even without having the benefit of various Gecko services such as preferences and the crash reporter.
Firstly, I created a new class, WindowsError
,
that encapsulates all types of Windows error codes. As an aside, I would strongly
encourage all Gecko developers who are writing new code that invokes Windows APIs
to use this class in your error handling.
WindowsError
is currently able to store Win32 DWORD
error codes, NTSTATUS
error codes, and HRESULT
error codes. Internally the code is stored as an
HRESULT
, since that type has encodings to support the other two. WindowsError
also provides a method to convert its error code to a localized string for
human-readable output.
As for the launcher process itself, nearly every function in the launcher
process returns a mozilla::Result
-based type. In case of error, we return a
LauncherResult
, which [as of 2018; this has changed more recently – Aaron]
is a structure containing the error’s source file, line number, and WindowsError
describing the failure.
While all Result
s in the launcher process may be indicating a successful
start, we may not yet be out of the woods! Consider the possibility that the
various interventions taken by the launcher process might have somehow impaired
the browser process’ ability to start!
To deal with this situation, the launcher process and the browser process share code that tracks whether both processes successfully started in sequence.
When the launcher process is started, it checks information recorded about the previous run. If the browser process previously failed to start correctly, the launcher process disables itself and proceeds to start the browser process without any of its typical interventions.
Once the browser has successfully started, it reflects the launcher process
state into telemetry, preferences, and about:support
.
Future attempts to start Firefox will bypass the launcher process until the next time the installation’s binaries are updated, at which point we reset and attempt once again to start with the launcher process. We do this in the hope that whatever was failing in version n might be fixed in version n + 1.
Note that this update behaviour implies that there is no way to forcibly and permanently disable the launcher process. This is by design: the error detection feature is designed to prevent the browser from becoming unusable, not to provide configurability. The launcher process is a security feature and not something that we should want users adjusting any more than we would want users to be disabling the capability system or some other important security mitigation. In fact, my original roadmap for InjectEject called for eventually removing the failure detection code if the launcher failure rate ever reached zero.
The pref reflection built into the failure detection system is bi-directional. This allowed us to ship a release where we ran a study with a fraction of users running with the launcher process enabled by default.
Once we rolled out the launcher process at 100%, this pref also served as a useful “emergency kill switch” that we could have flipped if necessary.
Fortunately our experiments were successful and we rolled the launcher process out to release at 100% without ever needing the kill switch!
At this point, this pref should probably be removed, as we no longer need nor want to control launcher process deployment in this way.
When telemetry is enabled, the launcher process is able to convert its
LauncherResult
into a ping which is sent in the background by ping-sender
.
When telemetry is disabled, we perform a last-ditch effort to surface the error
by logging details about the LauncherResult
failure in the Windows Event Log.
Thanks for reading! This concludes my 2018 Roundup series! There is so much more work from 2018 that I did for this project that I wish I could discuss, but for security reasons I must refrain. Nonetheless, I hope you enjoyed this series. Stay tuned for more roundups in the future!
]]>Yes, you are reading the dates correctly: I am posting this nearly two years after I began this series. I am trying to get caught up on documenting my past work!
Once I had landed the skeletal implementation of the launcher process, it was time to start making it do useful things.
[For an overview of Windows integrity levels, check out this MSDN page – Aaron]
Since Windows Vista, security tokens for standard users have run at a medium integrity level (IL) by default.
When UAC is enabled, members of the Administrators
group also run as a standard user with a medium IL, with
the additional ability of being able to “elevate” themselves to a high IL. When UAC is disabled, an administrator
receives a token that always runs at the high integrity level.
Running a process at a high IL is something that is not to be taken lightly: at that level, the process may alter system settings and access files that would otherwise be restricted by the OS.
While our sandboxed content processes always run at a low IL, I believed that defense-in-depth called for ensuring that the browser process did not run at a high IL. In particular, I was concerned about cases where elevation might be accidental. Consider, for example, a hypothetical scenario where a system administrator is running two open command prompts, one elevated and one not, and they accidentally start Firefox from the one that is elevated.
This was a perfect use case for the launcher process: it detects whether it is running at high IL, and if so, it launches the browser with medium integrity.
Unfortunately some users prefer to configure their accounts to run at all times as Administrator
with high integrity!
This is terrible idea from a security perspective, but it is what it is; in my experience, most users who
run with this configuration do so deliberately, and they have no interest in being lectured about it.
Unfortunately, users running under this account configuration will experience side-effects of the Firefox browser process running at medium IL. Specifically, a medium IL process is unable to initiate IPC connections with a process running at a higher IL. This will break features such as drag-and-drop, since even the administrator’s shell processes are running at a higher IL than Firefox.
Being acutely aware of this issue, I included an escape hatch for these users: I implemented a command line option that prevents the launcher process from de-elevating when running with a high IL. I hate that I needed to do this, but moral suasion was not going to be an effective technique for solving this problem.
Another tool that the launcher process enables us to utilize is process mitigation options. Introduced in Windows 8, the kernel provides several opt-in flags that allows us to add prophylactic policies to our processes in an effort to harden them against attacks.
Additional flags have been added over time, so we must be careful to only set flags that are supported by the version of Windows on which we’re running.
We could have set some of these policies by calling the
SetProcessMitigationPolicy
API.
Unfortunately this API is designed for a process to use on itself once it is already running. This implies that there
is a window of time between process creation and the time that the process enables its mitigations where an attack could occur.
Fortunately, Windows provides a second avenue for setting process mitigation flags: These flags may be set as part of
an attribute list in the STARTUPINFOEX
structure that we pass into CreateProcess
.
Perhaps you can now see where I am going with this: The launcher process enables us to specify process mitigation flags for the browser process at the time of browser process creation, thus preventing the aforementioned window of opportunity for attacks to occur!
While there are other flags that we could support in the future, the initial mitigation policy that I added was the
PROCESS_CREATION_MITIGATION_POLICY_IMAGE_LOAD_PREFER_SYSTEM32_ALWAYS_ON
flag. [Note that I am only discussing flags applied to the browser process; sandboxed processes receive additional mitigations. – Aaron]
This flag forces the Windows loader to always use the Windows system32
directory as the first directory in its search path,
which prevents library preload attacks. Using this mitigation also gave us an unexpected performance gain on devices with
magnetic hard drives: most of our DLL dependencies are either loaded using absolute paths, or reside in system32
. With
system32
at the front of the loader’s search path, the resulting reduction in hard disk seek times produced a slight but
meaningful decrease in browser startup time! How I made these measurements is addressed in a future post.
This concludes the Q2 topics that I wanted to discuss. Thanks for reading! Coming up in H2: Preparing to Enable the Launcher Process by Default.
]]>Yes, you are reading the dates correctly: I am posting this nearly two years after I began this series. I am trying to get caught up on documenting my past work!
One of the things I added to Firefox for Windows was a new process called the “launcher process.” “Bootstrap process” would be a better name, but we already used the term “bootstrap” for our XPCOM initialization code. Instead of overloading that term and adding potential confusion, I opted for using “launcher process” instead.
The launcher process is intended to be the first process that runs when the user starts Firefox. Its sole purpose is to create the “real” browser process in a suspended state, set various attributes on the browser process, resume the browser process, and then self-terminate.
In bug 1454745 I implemented an initial skeletal (and opt-in) implementation of the launcher process.
This seems like pretty straightforward code, right? Naïvely, one could just rip a CreateProcess
sample off of MSDN and call it day. The actual launcher process implementation is more complicated than
that, for reasons that I will outline in the following sections.
firefox.exe
I wanted the launcher process to exist as a special “mode” of firefox.exe
, as opposed to a distinct
executable.
By definition, the launcher process lies on the critical path to browser startup. I needed to be very conscious of how we affect overall browser startup time.
Since the launcher process is built into firefox.exe
, I needed to examine that executable’s existing
dependencies to ensure that it is not loading any dependent libraries that are not actually needed
by the launcher process. Other than the essential Win32 DLLs kernel32.dll
and advapi32.dll
(and their
dependencies), I did not want anything else to load. In particular, I wanted to avoid loading user32.dll
and/or gdi32.dll
, as this would trigger the initialization of Windows’ GUI facilities, which would be a
huge performance killer. For that reason, most browser-mode library dependencies of firefox.exe
are either delay-loaded or are explicitly loaded via LoadLibrary
.
We wanted the launcher process to both respect Firefox’s safe mode, as well as alter its behaviour as necessary when safe mode is requested.
There are multiple mechanisms used by Firefox to detect safe mode. The launcher process detects
all of them except for one: Testing whether the user is holding the shift key. Retrieving keyboard
state would trigger loading of user32.dll
, which would harm performance as I described above.
This is not too severe an issue in practice: The browser process itself would still detect the shift key. Furthermore, while the launcher process may in theory alter its behaviour depending on whether or not safe mode is requested, none of its behaviour changes are significant enough to materially affect the browser’s ability to start in safe mode.
Also note that, for serious cases where the browser is repeatedly unable to start, the browser triggers a restart in safe mode via environment variable, which is a mechanism that the launcher process honours.
We wanted the launcher process to behave well with respect to automated testing.
The skeletal launcher process that I landed in Q2 included code to pass its console handles on to the browser process, but there was more work necessary to completely handle this case. These capabilities were not yet an issue because the launcher process was opt-in at the time.
We wanted the launcher process to gracefully handle failures even though, also by definition, it does not have access to facilities that internal Gecko code has, such as preferences and the crash reporter.
The skeletal launcher process that I landed in Q2 did not yet utilize any special error handling code, but this was also not yet an issue because the launcher process was opt-in at this point.
Thanks for reading! Coming up in Q2, Part 3: Fleshing Out the Launcher Process
]]>My first day at Mozilla was October 15, 2012 — I will be celebrating my seventh anniversary at MoCo in just a couple short weeks! Some people with similar tenures have suggested to me that we are now “old guard,” but I’m not sure that I feel that way! Anyway, I digress.
The platform performance team eventually evolved into a desktop-focused performance team by late 2013. By the end of 2015 I had decided that it was time for a change, and by March 2016 I had moved over to work for Jim Mathies, focusing on Gecko integration with Windows. I ended up spending the next twenty or so months helping the accessibility team port their Windows implementation over to multiprocess.
Once Firefox Quantum 57 hit the streets, I scoped out and provided technical leadership for the InjectEject project, whose objective was to tackle some of the root problems with DLL injection that were causing us grief in Windows-land.
I am proud to say that, over the past three years on Jim’s team, I have done the best work of my career. I’d like to thank Brad Lassey (now at Google) for his willingness to bring me over to his group, as well as Jim, and David Bolter (a11y manager at the time) for their confidence in me. As somebody who had spent most of his adult life having no confidence in his work whatsoever, their willingness to entrust me with taking on those risks and responsibilities made an enormous difference in my self esteem and my professional life.
Over the course of H1 2019, I began to feel restless again. I knew it was time for another change. What I did not expect was that the agent of that change would be James Willcox, aka Snorp. In Whistler, Snorp planted the seed in my head that I might want to come over to work with him on GeckoView, within the mobile group which David was now managing.
The timing seemed perfect, so I made the decision to move to GeckoView. I had to finish tying up some loose ends with InjectEject, so all the various stakeholders agreed that I’d move over at the end of Q3 2019.
Which brings me to this week, when I officially join the GeckoView team, working for Emily Toop. I find it somewhat amusing that I am now joining the team that evolved from the team that I had originally applied for back in 2012. I have truly come full circle in my career at Mozilla!
So, what’s next?
I have a couple of InjectEject bugs that are pretty much finished, but just need some polish and code reviews before landing.
For the next month or two at least, I am going to continue to meet weekly with Jim to assist with the transition as he ramps up new staff on the project.
I still plan to be the module owner for the Firefox Launcher Process and the MSCOM library, however most day-to-day work will be done by others going forward;
I will continue to serve as the mozglue peer in charge of the DLL blocklist and DLL interceptor, with the same caveat.
Switching over to Android from Windows does not mean that I am leaving my Windows experience at the door; I would like to continue to be a resource on that front, so I would encourage people to continue to ask me for advice.
On the other hand, I am very much looking forward to stepping back into the mobile space. My first crack at mobile was as an intern back in 2003, when I was working with some code that had to run on PalmOS 3.0! I have not touched Android since I shipped a couple of utility apps back in 2011, so I am looking forward to learning more about what has changed. I am also looking forward to learning more about native development on Android, which is something that I never really had a chance to try.
As they used to say on Monty Python’s Flying Circus, “And now for something completely different!”
]]>As I have alluded to previously, Gecko includes a Detours-style API hooking mechanism for Windows. In Gecko, this code is referred to as the “DLL Interceptor.” We use the DLL interceptor to instrument various functions within our own processes. As a prerequisite for future DLL injection mitigations, I needed to spend a good chunk of Q2 refactoring this code. While I was in there, I took the opportunity to improve the interceptor’s memory efficiency, thus benefitting the Fission MemShrink project. [When these changes landed, we were not yet tracking the memory savings, but I will include a rough estimate later in this post.]
While many distinct function hooking techniques are used in the Windows ecosystem, the Detours-style hook is one of the most effective and most popular. While I am not going to go into too many specifics here, I’d like to offer a quick overview. In this description, “target” is the function being hooked.
Here is what happens when a function is detoured:
Allocate a chunk of memory to serve as a “trampoline.” We must be able to adjust the protection attributes on that memory.
Disassemble enough of the target to make room for a jmp
instruction. On 32-bit x86 processors,
this requires 5 bytes. x86-64 is more complicated, but generally, to jmp
to an absolute address, we
try to make room for 13 bytes.
Copy the instructions from step 2 over to the trampoline.
At the beginning of the target function, write a jmp
to the hook function.
Append additional instructions to the trampoline that, when executed, will cause the processor to
jump back to the first valid instruction after the jmp
written in step 4.
If the hook function wants to pass control on to the original target function, it calls the trampoline.
Note that these steps don’t occur exactly in the order specified above; I selected the above ordering in an effort to simplify my description.
Here is my attempt at visualizing the control flow of a detoured function on x86-64:
Previously, the DLL interceptor relied on directly manipulating pointers in order to read and write the various instructions involved in the hook. In bug 1432653 I changed things so that the memory operations are parameterized based on two orthogonal concepts:
WindowsDllInterceptor
allocated its own page of memory for trampolines,
but each instance also typically only sets one or two hooks. This means that most of the 4KiB page
was unused. Furthermore, since Windows allocates blocks of pages on a 64KiB boundary, this wasted a
lot of precious virtual address space in our 32-bit builds.By refactoring and parameterizing these operations, we ended up with the following combinations:
WindowsDllInterceptor
instance receives its own trampoline space;WindowsDllInterceptor
instances within a module share trampoline space;WindowsDllInterceptor
instance receives its own trampoline space;WindowsDllInterceptor
instances within a module share trampoline space (currently
not implemented as this option is not particularly useful at the moment).Instead of directly manipulating pointers, we now use instances of ReadOnlyTargetFunction
,
WritableTargetFunction
, and Trampoline
to manipulate our code/data. Those classes in turn use the
memory management and virtual memory allocation policies to perform the actual reading and writing.
The interceptor now supports two policies, MMPolicyInProcess
and MMPolicyOutOfProcess
. Each policy
must implement the following memory operations:
MMPolicyInProcess
is implemented using memcpy
for read and write, VirtualProtect
for protection attribute changes, and VirtualAlloc
for reserving and committing trampoline space.
MMPolicyOutOfProcess
uses ReadProcessMemory
and WriteProcessMemory
for read and write. As a perf
optimization, we try to batch reads and writes together to reduce the system call traffic. We obviously
use VirtualProtectEx
to adjust protection attributes in the other process.
Out-of-process trampoline reservation and commitment, however, is a bit different and is worth a
separate call-out. We allocate trampoline space using shared memory. It is mapped into the local
process with read+write permissions using MapViewOfFile
. The memory is mapped into the remote process
as read+execute using some code that I wrote in bug 1451511 that either uses NtMapViewOfSection
or
MapViewOfFile2
, depending on availability. Individual pages from those chunks are then committed via
VirtualAlloc
in the local process and VirtualAllocEx
in the remote process. This scheme enables
us to read and write to trampoline memory directly, without needing to do cross-process reads and writes!
The code for these policies is a lot simpler than the code for the memory management policies. We now
have VMSharingPolicyUnique
and VMSharingPolicyShared
. Each of these policies must implement the
following operations:
Trampoline
object for the next available K-byte trampoline slot;VMSharingPolicyShared
is actually implemented by delegating to a static
instance of
VMSharingPolicyUnique
.
To determine the performance implications, I added timings to our DLL Interceptor unit test. I was very happy to see that, despite the additional layers of abstraction, the C++ compiler’s optimizer was doing its job: There was no performance impact whatsoever!
Once the refactoring was complete, I switched the default VM Sharing Policy for WindowsDllInterceptor
over to VMSharingPolicyShared
in bug 1451524.
Browsing today’s mozilla-central
tip, I count 14 locations where we instantiate interceptors inside
xul.dll
. Given that not all interceptors are necessarily instantiated at once, I am now offering a
worst-case back-of-the-napkin estimate of the memory savings:
Another problem that I discovered during this refactoring was bug 1459335. It turns out that some of the interceptor’s callers were not distinguishing between “I have not set this hook yet” and “I attempted to set this hook but it failed” scenarios. Across several call sites, I discovered that our code would repeatedly retry to set hooks even when they had previously failed, causing leakage of trampoline space!
To fix this, I modified the interceptor’s interface so that we use one-time initialization APIs to set hooks; since landing this bug, it is no longer possible for clients of the DLL interceptor to set a hook that had previously failed to be set.
Quantifying the memory costs of this bug is… non-trivial, but it suffices to say that fixing this bug probably resulted in the savings of at least a few hundred KiB in committed VM on affected machines.
That’s it for today’s post, folks! Thanks for reading! Coming up in Q2, Part 2: Implementing a Skeletal Launcher Process
]]>Here is an index of all the entries in this series:
The general theme of my work in 2018 was dealing with the DLL injection problem: On Windows, third parties love to forcibly load their DLLs into other processes — web browsers in particular, thus making Firefox a primary target.
Many of these libraries tend to alter Firefox processes in ways that hurt the stability and/or performance of our code; many chemspill releases have gone out over the years to deal with these problems. While I could rant for hours over this, the fact is that DLL injection is rampant in the ecosystem of Windows desktop applications and is not going to disappear any time soon. In the meantime, we need to be able to deal with it.
Some astute readers might be ready to send me an email or post a comment about how ignorant I am about the new(-ish) process mitigation policies that are available in newer versions of Windows. While those features are definitely useful, they are not panaceas:
SetWindowsHookEx
and SetWinEventHook
, both of which
are covered by this policy;For me, Q1 2018 was all about gathering better data about injected DLLs.
One of our major pain points over the years of dealing with injected DLLs has been that the vendor of the DLL is not always apparent to us. In general, our crash reports and telemetry pings only include the leaf name of the various DLLs on a user’s system. This is intentional on our part: we want to preserve user privacy. On the other hand, this severely limits our ability to determine which party is responsible for a particular DLL.
One avenue for obtaining this information is to look at any digital signature that is embedded in the DLL. By examining the certificate that was used to sign the binary, we can extract the organization of the cert’s owner and include that with our crash reports and telemetry.
In bug 1430857 I wrote a bunch of code that enables us to extract that information from signed binaries using the Windows Authenticode APIs. Originally, in that bug, all of that signature extraction work happened from within the browser itself, while it was running: It would gather the cert information on a background thread while the browser was running, and include those annotations in a subsequent crash dump, should such a thing occur.
After some reflection, I realized that I was not gathering annotations in the right place. As an example, what if an injected DLL were to trigger a crash before the background thread had a chance to grab that DLL’s cert information?
I realized that the best place to gather this information was in a post-processing step after the
crash dump had been generated, and in fact we already had the right mechanism for doing so: the
minidump-analyzer
program was already doing post-processing on Firefox crash dumps before sending
them back to Mozilla. I moved the signature extraction and crash annotation code out of Gecko and
into the analyzer in bug 1436845.
(As an aside, while working on the minidump-analyzer
, I found some problems with how it handled
command line arguments: it was assuming that main
passes its argv
as UTF-8, which is not true on
Windows. I fixed those issues in bug 1437156.)
In bug 1434489 I also ended up adding this information to the “modules ping” that we have in telemetry; IIRC this ping is only sent weekly. When the modules ping is requested, we gather the module cert info asynchronously on a background thread.
Finally, I had to modify Socorro (the back-end for crash-stats) to be able to understand the signature annotations and be able to display them via bug 1434495. This required two commits: one to modify the Socorro stackwalker to merge the module signature information into the full crash report, and another to add a “Signed By” column to every report’s “Modules” tab to display the signature information (Note that this column is only present when at least one module in a particular crash report contains signature information).
The end result was very satisfying: Most of the injected DLLs in our Windows crash reports are signed, so it is now much easier to identify their vendors!
This project was very satisifying for me in many ways: First of all, surfacing this information was an itch that I had been wanting to scratch for quite some time. Secondly, this really was a “full stack” project, touching everything from extracting signature info from binaries using C++, all the way up to writing some back-end code in Python and a little touch of front-end stuff to surface the data in the web app.
Note that, while this project focused on Windows because of the severity of the library injection problem on that platform, it would be easy enough to reuse most of this code for macOS builds as well; the only major work for the latter case would be for extracting signature information from a dylib. This is not currently a priority for us, though.
Thanks for reading! Coming up in Q2: Refactoring the DLL Interceptor!
]]>While I have previously blogged about the transition to WebExtensions, now that we have actually passed through the cutoff for legacy extensions, I have decided to add some new commentary on the subject.
One analogy that has been used in the discussion of the extension ecosystem is that of kernelspace and userspace. The crux of the analogy is that Gecko is equivalent to an operating system kernel, and thus extensions are the user-mode programs that run atop that kernel. The argument then follows that Mozilla’s deprecation and removal of legacy extension capabilities is akin to “breaking” userspace. [Some people who say this are using the same tone as Linus does whenever he eviscerates Linux developers who break userspace, which is neither productive nor welcomed by anyone, but I digress.] Unfortunately, that analogy simply does not map to the legacy extension model.
The most significant problem with the userspace analogy is that legacy extensions effectively meld with Gecko and become part of Gecko itself. If we accept the premise that Gecko is like a monolithic OS kernel, then we must also accept that the analogical equivalent of loading arbitrary code into that kernel, is the kernel module. Such components are loaded into the kernel and effectively become part of it. Their code runs with full privileges. They break whenever significant changes are made to the kernel itself.
Sound familiar?
Legacy extensions were akin to kernel modules. When there is no abstraction, there can be no such thing as userspace. This is precisely the problem that WebExtensions solves!
Maybe somebody out there is thinking, “well what if you took all the APIs that legacy extensions used, turned that into a ‘userspace,’ and then just left that part alone?”
Which APIs? Where do we draw the line? Do we check the code coverage for every legacy addon in AMO and use that to determine what to include?
Remember, there was no abstraction; installed legacy addons are fused to Gecko. If we pledge not to touch anything that legacy addons might touch, then we cannot touch anything at all.
Where do we go from here? Freeze an old version of Gecko and host an entire copy of it inside web content? Compile it to WebAssembly? [Oh God, what have I done?]
If that’s not a maintenance burden, I don’t know what is!
Another problem with the legacy-extensions-as-userspace analogy is that it leaves awkward room for web content, whose API is abstract and well-defined. I do not think that it is appropriate to consider web content to be equivalent to a sandboxed application, as sandboxed applications use the same (albeit restricted) API as normal applications. I would suggest that the presence of WebExtensions gives us a better kernel analogy:
Declaring that legacy extensions are userspace does not make them so. The way that the technology actually worked defies the abstract model that the analogy attempts to impose upon it. On the other hand, we can use the failure of that analogy to explain why WebExtensions are important and construct an extension ecosystem that does fit with that analogy.
]]>The first time was inside our window neutering code in IPC, while the second time was in our accessibility code.
Every time I have seen this, there is code that follows this pattern:
1 2 3 4 5 |
|
This seems innocuous enough, right?
The problem is that CreateWindowEx
calls hooks. If those hooks then try to do
something like GetProp(hwnd, "Foo")
, that call is going to fail because the
“Foo” prop has not yet been set.
The key takeaway from this is that, if you are creating a new window, you must
do any follow-up initialization from within your window proc’s WM_CREATE
handler. This will guarantee that your window’s initialization will have
completed before any hooks are called.
You might be thinking, “But I don’t set any hooks!” While this may be true, you must not forget about hooks set by third-party code.
“But those hooks won’t know anything about my program’s internals, right?”
Perhaps, perhaps not. But when those hooks fire, they give third-party software the opportunity to run. In some cases, those hooks might even cause the thread to reenter your own code. Your window had better be completely initialized when this happens!
In the case of my latest discovery of this issue in bug 1380471, I made it possible to use a C++11 lambda to simplify this pattern.
CreateWindowEx
accepts a lpParam
parameter which is then passed to the WM_CREATE
handler
as the lpCreateParams
member of a CREATESTRUCT
.
By setting lpParam
to a pointer to a std::function<void(HWND)>
, we may then
supply any callable that we wish for follow-up window initialization.
Using the previous code sample as a baseline, this allows me to revise the code to safely set a property like this:
1 2 3 4 5 6 |
|
Note that since lpParam
is always passed during WM_CREATE
, which always fires
before CreateWindowEx
returns, it is safe for onCreate
to live on the stack.
I liked this solution for the a11y case because it preserved the locality of
the initialization code within the function that called CreateWindowEx
; the
window proc for this window is implemented in another source file and the
follow-up initialization depends on the context surrounding the CreateWindowEx
call.
Speaking of window procs, here is how that window’s WM_CREATE
handler invokes
the callable:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
TL;DR: If you see a pattern where further initialization work is being done
on an HWND
after a CreateWindowEx
call, move that initialization code to your
window’s WM_CREATE
handler instead.
CRITICAL_SECTION
s (with full debug info) for our implementation of
mozilla::Mutex
. I’d like to illustrate some reasons why this is so useful.
Every time you initialize a CRITICAL_SECTION
, Windows inserts the CS’s
debug info into a process-wide linked list. This enables their discovery by
the Windows debugging engine, and makes the !cs
, !critsec
, and !locks
commands more useful.
When the “Create user mode stack trace database” gflag is enabled, Windows
records the call stack of the thread that called InitializeCriticalSection
on that CS. Windows also records the call stack of the owning thread once
it has acquired the CS. This can be very useful for debugging deadlocks.
Since every CS has been placed in a process-wide linked list, we may now ask
the debugger to dump statistics about every live CS in the process. In
particular, we can ask the debugger to output the contention counts for each
CS in the process. After running a workload against Nightly, we may then take
the contention output, sort it descendingly, and be able to determine which
CRITICAL_SECTION
s are the most contended in the process.
We may then want to more closely inspect the hottest CSes to determine whether there is anything that we can do to reduce contention and all of the extra context switching that entails.
When we use SRWLOCK
s or initialize our CRITICAL_SECTION
s with the
CRITICAL_SECTION_NO_DEBUG_INFO
flag, we are denying ourselves access to this
information. That’s fine on release builds, but on Nightly I think it is worth
having around. While I realize that most Mozilla developers have not used this
until now (otherwise I would not be writing this blog post), this rich debugger
info is one of those things that you do not miss until you do not have it.
For further reading about critical section debug info, check out this archived article from MSDN Magazine.
]]>Obviously the removal of that code does not prevent me from discussing some of the more interesting facets of that work.
Today I am going to talk about how async plugin init worked when web content attempted to access a property on a plugin’s scriptable object, when that plugin had not yet completed its asynchronous initialization.
As described on MDN,
the DOM queries a plugin for scriptability by calling NPP_GetValue
with the
NPPVpluginScriptableNPObject
constant. With async plugin init, we did not
return the true NPAPI scriptable object back to the DOM. Instead we returned
a surrogate object. This meant that we did not need to synchronously wait for
the plugin to initialize before returning a result back to the DOM.
If the DOM subsequently called into that surrogate object, the surrogate would be forced to synchronize with the plugin. There was a limit on how much fakery the async surrogate could do once the DOM needed a definitive answer — after all, the NPAPI itself is entirely synchronous. While you may question whether the asynchronous surrogate actually bought us any responsiveness, performance profiles and measurements that I took at the time did indeed demonstrate that the asynchronous surrogate did buy us enough additional concurrency to make it worthwhile. A good number of plugin instantiations were able to complete in time before the DOM had made a single invocation on the surrogate.
Once the surrogate object had synchronized with the plugin, it would then mostly act as a pass-through to the plugin’s real NPAPI scriptable object, with one notable exception: property accesses.
The reason for this is not necessarily obvious, so allow me to elaborate:
The DOM usually sets up a scriptable object as follows:
this.__proto__.__proto__.__proto__
this
is the WebIDL object (ie, content’s <embed>
element);Object.prototype
.NPAPI is reentrant (some might say insanely reentrant). It is possible (and
indeed common) for a plugin to set properties on the WebIDL object from within
the plugin’s NPP_New
.
Suppose that the DOM tries to access a property on the plugin’s WebIDL object
that is normally set by the plugin’s NPP_New
. In the asynchronous case, the
plugin’s initialization might still be in progress, so that property might not
yet exist.
In the case where the property does not yet exist on the WebIDL object, JavaScript fails to retrieve an “own” property. It then moves on to the first prototype and attempts to resolve the property on that. As outlined above, this prototype would actually be the async surrogate. The async surrogate would then be in a situation where it must absolutely produce a definitive result, so this would trigger synchronization with the plugin. At this point the plugin would be guaranteed to have finished initializing.
Now we have a problem: JS was already examining the NPAPI scriptable object when it blocked to synchronize with the plugin. Meanwhile, the plugin went ahead and set properties (including the one that we’re interested in) on the WebIDL object. By the time that JS execution resumes, it would already be looking too far up the prototype chain to see those new properties!
The surrogate needed to be aware of this when it synchronized with the plugin during a property access. If the plugin had already completed its initialization (thus rendering synchronization unnecessary), the surrogate would simply pass the property access on to the real NPAPI scriptable object. On the other hand, if a synchronization was performed, the surrogate would first retry the WebIDL object by querying for the WebIDL object’s “own” properties, and return the own property if it now existed. If no own property existed on the WebIDL object, then the surrogate would revert to its “pass through all the things” behaviour.
If I hadn’t made the asynchronous surrogate scriptable object do that, we would have ended up with a strange situation where the DOM’s initial property access on an embed could fail non-deterministically during page load.
That’s enough chatter for today. I enjoy blogging about my crazy hacks that make the impossible, umm… possible, so maybe I’ll write some more of these in the future.
]]>I am pretty excited about this new role!
My first project is to sort out the accessibility situation under Windows e10s. This started back at Mozlando last December. A number of engineers from across the Platform org, plus me, got together to brainstorm. Not too long after we had all returned home, I ended up making a suggestion on an email thread that has evolved into the core concept that I am currently attempting. As is typical at Mozilla, no deed goes unpunished, so I have been asked to flesh out my ideas. An overview of this plan is available on the wiki.
My hope is that I’ll be able to deliver a working, “version 0.9” type of demo in time for our London all-hands at the end of Q2. Hopefully we will be able to deliver on that!
I am using this section of the blog post to make some additional notes. I don’t feel that these ideas are strong enough to commit to a wiki yet, but I do want them to be publicly available.
Once concern that our colleagues at NVAccess have identified is that the current COM interfaces are too chatty; this is a major reason why screen readers frequently inject libraries into the Firefox address space. If we serve our content a11y objects as remote COM objects, there is concern that performance would suffer. This concern is not due to latency, but rather due to frequency of calls; one function call does not provide sufficient information to the a11y client. As a result, multiple round trips are required to fetch all of the information that is required for a particular DOM node.
My gut feeling about this is that this is probably a legitimate concern, however we cannot make good decisions without quantifying the performance. My plan going forward is to proceed with a naïve implementation of COM remoting to start, followed by work on reducing round trips as necessary.
One idea that was discussed is the idea of the content process speculatively
sending information to the chrome process that might be needed in the future.
For example, if we have an IAccessible
, we can expect that multiple properties
will be queried off that interface. A smart proxy could ship that data across
the RPC channel during marshaling so that querying that additional information
does not require additional round trips.
COM makes this possible using “handler marshaling.” I have dug up some information about how to do this and am posting it here for posterity:
House of COM, May 1999 Microsoft Systems Journal;
Implementing and Activating a Handler with Extra Data Supplied by Server on MSDN;
Wicked Code, August 2000 MSDN Magazine. This is not available on the MSDN Magazine website but I have an archived copy on CD-ROM.
mozdbgext
: !iat
.
The syntax is pretty simple:
!iat <hexadecimal address>
This address shouldn’t be just any pointer; it should be the address of an
entry in the current module’s import address table (IAT). These addresses
are typically very identifiable by the _imp_
prefix in their symbol names.
The purpose of this extension is to look up the name of the DLL from whom the function is being imported. Furthermore, the extension checks the expected target address of the import with the actual target address of the import. This allows us to detect API hooking via IAT patching.
I fired up a local copy of Nightly, attached a debugger to it, and dumped the call stack of its main thread:
# ChildEBP RetAddr 00 0018ecd0 765aa32a ntdll!NtWaitForMultipleObjects+0xc 01 0018ee64 761ec47b KERNELBASE!WaitForMultipleObjectsEx+0x10a 02 0018eecc 1406905a USER32!MsgWaitForMultipleObjectsEx+0x17b 03 0018ef18 1408e2c8 xul!mozilla::widget::WinUtils::WaitForMessage+0x5a 04 0018ef84 13fdae56 xul!nsAppShell::ProcessNextNativeEvent+0x188 05 0018ef9c 13fe3778 xul!nsBaseAppShell::DoProcessNextNativeEvent+0x36 06 0018efbc 10329001 xul!nsBaseAppShell::OnProcessNextEvent+0x158 07 0018f0e0 1038e612 xul!nsThread::ProcessNextEvent+0x401 08 0018f0fc 1095de03 xul!NS_ProcessNextEvent+0x62 09 0018f130 108e493d xul!mozilla::ipc::MessagePump::Run+0x273 0a 0018f154 108e48b2 xul!MessageLoop::RunInternal+0x4d 0b 0018f18c 108e448d xul!MessageLoop::RunHandler+0x82 0c 0018f1ac 13fe78f0 xul!MessageLoop::Run+0x1d 0d 0018f1b8 14090f07 xul!nsBaseAppShell::Run+0x50 0e 0018f1c8 1509823f xul!nsAppShell::Run+0x17 0f 0018f1e4 1514975a xul!nsAppStartup::Run+0x6f 10 0018f5e8 15146527 xul!XREMain::XRE_mainRun+0x146a 11 0018f650 1514c04a xul!XREMain::XRE_main+0x327 12 0018f768 00215c1e xul!XRE_main+0x3a 13 0018f940 00214dbd firefox!do_main+0x5ae 14 0018f9e4 0021662e firefox!NS_internal_main+0x18d 15 0018fa18 0021a269 firefox!wmain+0x12e 16 0018fa60 76e338f4 firefox!__tmainCRTStartup+0xfe 17 0018fa74 77d656c3 KERNEL32!BaseThreadInitThunk+0x24 18 0018fabc 77d6568e ntdll!__RtlUserThreadStart+0x2f 19 0018facc 00000000 ntdll!_RtlUserThreadStart+0x1b
Let us examine the code at frame 3:
14069042 6a04 push 4 14069044 68ff1d0000 push 1DFFh 14069049 8b5508 mov edx,dword ptr [ebp+8] 1406904c 2b55f8 sub edx,dword ptr [ebp-8] 1406904f 52 push edx 14069050 6a00 push 0 14069052 6a00 push 0 14069054 ff159cc57d19 call dword ptr [xul!_imp__MsgWaitForMultipleObjectsEx (197dc59c)] 1406905a 8945f4 mov dword ptr [ebp-0Ch],eax
Notice the function call to MsgWaitForMultipleObjectsEx
occurs indirectly;
the call instruction is referencing a pointer within the xul.dll
binary
itself. This is the IAT entry that corresponds to that function.
Now, if I load mozdbgext
, I can take the address of that IAT entry and execute
the following command:
0:000> !iat 0x197dc59c Expected target: USER32.dll!MsgWaitForMultipleObjectsEx Actual target: USER32!MsgWaitForMultipleObjectsEx+0x0
!iat
has done two things for us:
Normally we want both the expected and actual targets to match. If they don’t, we should investigate further, as this mismatch may indicate that the IAT has been patched by a third party.
Note that !iat
command is breakpad aware (provided that you’ve already
loaded the symbols using !bploadsyms
) but can fall back to the Microsoft
symbol engine as necessary.
Further note that the !iat
command does not yet accept the _imp_
symbolic
names for the IAT entries, you need to enter the hexadecimal representation of
the pointer.
In many ways this problem is also a Catch-22: People don’t want to use Windows for many reasons, but tooling is big part of the problem. OTOH, nobody is motivated to improve the tooling situation if nobody is actually going to use them.
A couple of weeks ago my frustrations with the situation boiled over when I
learned that our Cpp
unit test suite could not log symbolicated call stacks,
resulting in my filing of bug 1238305 and bug 1240605. Not only could we
not log those stacks, in many situations we could not view them in a debugger
either.
Due to the fact that PDB files consume a large amount of disk space, we don’t keep those when building from integration or try repositories. Unfortunately they are be quite useful to have when there is a build failure. Most of our integration builds, however, do include breakpad symbols. Developers may also explicitly request symbols for their try builds.
A couple of years ago I had begun working on a WinDbg debugger extension that was tailored to Mozilla development. It had mostly bitrotted over time, but I decided to resurrect it for a new purpose: to help WinDbg* grok breakpad.
mozdbgext
is the result. This extension
adds a few commands that makes Win32 debugging with breakpad a little bit easier.
The original plan was that I wanted mozdbgext
to load breakpad symbols and then
insert them into the debugger’s symbol table via the IDebugSymbols3::AddSyntheticSymbol
API. Unfortunately the design of this API is not well equipped for bulk loading
of synthetic symbols: each individual symbol insertion causes the debugger to
re-sort its entire symbol table. Since xul.dll
’s quantity of symbols is in the
six-figure range, using this API to load that quantity of symbols is
prohibitively expensive. I tweeted a Microsoft PM who works on Debugging Tools
for Windows, asking if there would be any improvements there, but it sounds like
this is not going to be happening any time soon.
My original plan would have been ideal from a UX perspective: the breakpad symbols would look just like any other symbols in the debugger and could be accessed and manipulated using the same set of commands. Since synthetic symbols would not work for me in this case, I went for “Plan B:” Extension commands that are separate from, but analagous to, regular WinDbg commands.
I plan to continuously improve the commands that are available. Until I have a proper README checked in, I’ll introduce the commands here.
.load
command: .load <path_to_mozdbgext_dll>
!bploadsyms <path_to_breakpad_symbol_directory>
Note: You must have successfully run the !bploadsyms
command first!
As a general guide, I am attempting to name each breakpad command similarly to
the native WinDbg command, except that the command name is prefixed by !bp
.
!bpk
!bpln <address>
where address is specified
as a hexadecimal value.I have pre-built binaries (32-bit, 64-bit) available for download.
Note that there are several other commands that are “roughed-in” at this point and do not work correctly yet. Please stick to the documented commands at this time.
* When I write “WinDbg”, I am really
referring to any debugger in the Debugging Tools for Windows package,
including cdb
.
I’m finally getting ‘round to writing about a nasty bug that I had to spend a bunch of time with in Q4 2015. It’s one of the more challenging problems that I’ve had to break and I’ve been asked a lot of questions about it. I’m talking about bug 1218473.
In bug 1213567 I had landed a patch to intercept calls to CreateWindowEx
.
This was necessary because it was apparent in that bug that window subclassing
was occurring while a window was neutered (“neutering” is terminology that is
specific to Mozilla’s Win32 IPC code).
While I’ll save a discussion on the specifics of window neutering for another day, for our purposes it is sufficient for me to point out that subclassing a neutered window is bad because it creates an infinite recursion scenario with window procedures that will eventually overflow the stack.
Neutering is triggered during certain types of IPC calls as soon as a message is
sent to an unneutered window on the thread making the IPC call. Unfortunately in
the case of bug 1213567, the message triggering the neutering was
WM_CREATE
. Shortly after creating that window, the code responsible would
subclass said window. Since WM_CREATE
had already triggered neutering, this
would result in the pathological case that triggers the stack overflow.
For a fix, what I wanted to do is to prevent messages that were sent immediately
during the execution of CreateWindow
(such as WM_CREATE
) from triggering
neutering prematurely. By intercepting calls to CreateWindowEx
, I could wrap
those calls with a RAII object that temporarily suppresses the neutering. Since
the subclassing occurs immediately after window creation, this meant that
this subclassing operation was now safe.
Unfortunately, shortly after landing bug 1213567, bug 1218473 was filed.
It wasn’t obvious where to start debugging this. While a crash spike was clearly
correlated with the landing of bug 1213567, the crashes were occurring in
code that had nothing to do with IPC or Win32. For example, the first stack that
I looked at was js::CreateRegExpMatchResult
!
When it is just not clear where to begin, I like to start by looking at our correlation data in Socorro — you’d be surprised how often they can bring problems into sharp relief!
In this case, the correlation data didn’t disappoint: there
was 100% correlation
with a module called _etoured.dll
. There was also correlation with the
presence of both NVIDIA video drivers and Intel video drivers. Clearly this
was a concern only when NVIDIA Optimus technology was enabled.
I also had a pretty strong hypothesis about what _etoured.dll
was: For many
years, Microsoft Research has shipped a package called
Detours. Detours is a
library that is used for intercepting Win32 API calls. While the changelog for
Detours 3.0 points out that it has “Removed [the] requirement for including
detoured.dll
in processes,” in previous versions of the package, this library
was required to be injected into the target process.
I concluded that _etoured.dll
was most likely a renamed version of
detoured.dll
from Detours 2.x.
Now that I knew the likely culprit, I needed to know how it was getting there. During a November trip to the Mozilla Toronto office, I spent some time debugging a test laptop that was configured with Optimus.
Knowing that the presence of Detours was somehow interfering with our own API
interception, I decided to find out whether it was also trying to intercept
CreateWindowExW
. I launched windbg
, started Firefox with it, and then told
it to break as soon as user32.dll
was loaded:
sxe ld:user32.dll
Then I pressed F5
to resume execution. When the debugger broke again, this
time user32
was now in memory. I wanted the debugger to break as soon as
CreateWindowExW
was touched:
ba w 4 user32!CreateWindowExW
Once again I resumed execution. Then the debugger broke on the memory access and gave me this call stack:
nvd3d9wrap!setDeviceHandle+0x1c91 nvd3d9wrap!initialise+0x373 nvd3d9wrap!setDeviceHandle+0x467b nvd3d9wrap!setDeviceHandle+0x4602 ntdll!LdrpCallInitRoutine+0x14 ntdll!LdrpRunInitializeRoutines+0x26f ntdll!LdrpLoadDll+0x453 ntdll!LdrLoadDll+0xaa mozglue!`anonymous namespace'::patched_LdrLoadDll+0x1b0 KERNELBASE!LoadLibraryExW+0x1f7 KERNELBASE!LoadLibraryExA+0x26 kernel32!LoadLibraryA+0xba nvinit+0x11cb nvinit+0x5477 nvinit!nvCoprocThunk+0x6e94 nvinit!nvCoprocThunk+0x6e1a ntdll!LdrpCallInitRoutine+0x14 ntdll!LdrpRunInitializeRoutines+0x26f ntdll!LdrpLoadDll+0x453 ntdll!LdrLoadDll+0xaa mozglue!`anonymous namespace'::patched_LdrLoadDll+0x1b0 KERNELBASE!LoadLibraryExW+0x1f7 kernel32!BasepLoadAppInitDlls+0x167 kernel32!LoadAppInitDlls+0x82 USER32!ClientThreadSetup+0x1f9 USER32!__ClientThreadSetup+0x5 ntdll!KiUserCallbackDispatcher+0x2e GDI32!GdiDllInitialize+0x1c USER32!_UserClientDllInitialize+0x32f ntdll!LdrpCallInitRoutine+0x14 ntdll!LdrpRunInitializeRoutines+0x26f ntdll!LdrpLoadDll+0x453 ntdll!LdrLoadDll+0xaa mozglue!`anonymous namespace'::patched_LdrLoadDll+0x1b0 KERNELBASE!LoadLibraryExW+0x1f7 firefox!XPCOMGlueLoad+0x23c firefox!XPCOMGlueStartup+0x1d firefox!InitXPCOMGlue+0xba firefox!NS_internal_main+0x5c firefox!wmain+0xbe firefox!__tmainCRTStartup+0xfe kernel32!BaseThreadInitThunk+0xe ntdll!__RtlUserThreadStart+0x70 ntdll!_RtlUserThreadStart+0x1b
This stack is a gold mine of information. In particular, it tells us the following:
The offending DLLs are being injected by AppInit_DLLs
(and in fact, Raymond
Chen has blogged about
this exact case in the past).
nvinit.dll
is the name of the DLL that is injected by step 1.
nvinit.dll
loads nvd3d9wrap.dll
which then uses Detours to patch
our copy of CreateWindowExW
.
I then became curious as to which other functions they were patching.
Since Detours is patching executable code, we know that at some point it is
going to need to call VirtualProtect
to make the target code writable. In the
worst case, VirtualProtect
’s caller is going to pass the address of the page
where the target code resides. In the best case, the caller will pass in the
address of the target function itself!
I restarted windbg
, but this time I set a breakpoint on VirtualProtect
:
bp kernel32!VirtualProtect
I then resumed the debugger and examined the call stack every time it broke.
While not every single VirtualProtect
call would correspond to a detour, it
would be obvious when it was, as the NVIDIA DLLs would be on the call stack.
The first time I caught a detour, I examined the address being passed to
VirtualProtect
: I ended up with the best possible case: the address was
pointing to the actual target function! From there I was able to distill a
list of other
functions being hooked by the injected NVIDIA DLLs.
By this point I knew who was hooking our code and knew how it was getting there.
I also noticed that CreateWindowEx
is the only function that the NVIDIA DLLs
and our own code were both trying to intercept. Clearly there was some kind of
bad interaction occurring between the two interception mechanisms, but what was
it?
I decided to go back and examine a specific crash dump. In particular, I wanted to examine three different memory locations:
user32!CreateWindowExW
;xul!CreateWindowExWHook
; anduser32!CreateWindowExW
that triggered the crash.Of those three locations, the only one that looked off was location 2:
6b1f6611 56 push esi 6b1f6612 ff15f033e975 call dword ptr [combase!CClassCache::CLSvrClassEntry::GetDDEInfo+0x41 (75e933f0)] 6b1f6618 c3 ret 6b1f6619 7106 jno xul!`anonymous namespace'::CreateWindowExWHook+0x6 (6b1f6621) xul!`anonymous namespace'::CreateWindowExWHook: 6b1f661b cc int 3 6b1f661c cc int 3 6b1f661d cc int 3 6b1f661e cc int 3 6b1f661f cc int 3 6b1f6620 cc int 3 6b1f6621 ff ???
Why the hell were the first six bytes filled with breakpoint instructions?
I decided at this point to look at some source code. Fortunately Microsoft publishes the 32-bit source code for Detours, licensed for non-profit use, under the name “Detours Express.”
I found a copy of Detours Express 2.1 and checked out the code. First I wanted
to know where all of these 0xcc
bytes were coming from. A quick grep
turned
up what I was looking for:
93 94 95 96 97 98 99 |
|
Now that I knew which function was generating the int 3
instructions, I then
wanted to find its callers. Soon I found:
1247 1248 1249 1250 1251 |
|
Okay, so Detours writes the breakpoints out immediately after it has written a
jmp
pointing to its trampoline.
Why is our hook function being trampolined?
The reason must be because our hook was installed first! Detours has detected that and has decided that the best place to trampoline to the NVIDIA hook is at the beginning of our hook function.
But Detours is using the wrong address!
We can see that because the int 3
instructions are written out at the
beginning of CreateWindowExWHook
, even though there should be a jmp
instruction first.
Detours is calculating the wrong address to write its jmp
!
Once I knew what the problem was, I needed to know more about the why – only then would I be able to come up with a way to work around this problem.
I decided to reconstruct the scenario where both our code and Detours are trying
to hook the same function, but our hook was installed first. I would then
follow along through the Detours code to determine how it calculated the wrong
address to install its jmp
.
The first thing to keep in mind is that Mozilla’s function interception code
takes advantage of hot-patch points
in Windows. If the target function begins with a mov edi, edi
prolog, we
use a hot-patch style hook instead of a trampoline hook. I am not going to go
into detail about hot-patch hooks here — the above Raymond Chen link contains
enough details to answer your questions. For the purposes of this blog post, the
important point is that Mozilla’s code patches the mov edi, edi
, so NVIDIA’s
Detours library would need to recognize and follow the jmp
s that our code
patched in, in order to write its own jmp
at CreateWindowExWHook
.
Tracing through the Detours code, I found the place where it checks for a
hot-patch hook and follows the jmp
if necessary. While examining a function
called detour_skip_jmp
, I found the bug:
124
|
|
This code is supposed to be telling Detours where the target address of a jmp
is, so that Detours can follow it. pbNew
is supposed to be the target address
of the jmp
. pbCode
is referencing the address of the beginning of the jmp
instruction itself. Unfortunately, with this type of jmp
instruction, target
addresses are always relative to the address of the next instruction, not
the current instruction! Since the current jmp
instruction is five bytes
long, Detours ends up writing its jmp
five bytes prior to the intended
target address!
I went and checked the source code for Detours Express 3.0 to see if this had been fixed, and indeed it had:
163
|
|
That doesn’t do much for me right now, however, since the NVIDIA stuff is still using Detours 2.x.
In the case of Mozilla’s code, there is legitimate executable code at that incorrect address that Detours writes to. It is corrupting the last few instructions of that function, thus explaining those mysterious crashes that were seemingly unrelated code.
I confirmed this by downloading the binaries from the build that was associated with the crash dump that I was analyzing. [As an aside, I should point out that you need to grab the identical binaries for this exercise; you cannot build from the same source revision and expect this to work due to variability that is introduced into builds by things like PGO.]
The five bytes preceeding CreateWindowExHookW
in the crash dump diverged from
those same bytes in the original binaries. I could also make out that the
overwritten bytes consisted of a jmp
instruction.
Let us now review what we know at this point:
jmp
s from hot-patch hooks;AppInit_DLLs
entry for nvinit.dll
.How can we best distill this into a suitable workaround?
One option could be to block the NVIDIA DLLs outright. In most cases this would probably be the simplest option, but I was hesitant to do so this time. I was concerned about the unintended consequences of blocking what, for better or worse, is a user-mode component of NVIDIA video drivers.
Instead I decided to take advantage of the fact that we now know how this bug is triggered. I have modified our API interception code such that if it detects the presence of NVIDIA Optimus, it disables hot-patch style hooks.
Not only will this take care of the crash spike that started when I landed bug 1213567, I also expect it to take care of other crash signatures whose relationship to this bug was not obvious.
That concludes this episode of Bugs from Hell. Until next time…
]]>I shall begin with the proposition that the legacy, non-jetpack environment for addons is not an API. As ridiculous as some readers might consider this to be, please humour me for a moment.
Let us go back to the acronym, “API.” Application Programming Interface. While the usage of the term “API” seems to have expanded over the years to encompass just about any type of interface whatsoever, I’d like to explore the first letter of that acronym: Application.
An Application Programming Interface is a specific type of interface that is exposed for the purposes of building applications. It typically provides a formal abstraction layer that isolates applications from the implementation details behind the lower tier(s) in the software stack. In the case of web browsers, I suggest that there are two distinct types of applications: web content, and extensions.
There is obviously a very well defined API for web content. On the other hand, I would argue that Gecko’s legacy addon environment is not an API at all! From the point of view of an extension, there is no abstraction, limited formality, and not necessarily an intention to be used by applications.
An extension is imported into Firefox with full privileges and can access whatever it wants. Does it have access to interfaces? Yes, but are those interfaces intended for applications? Some are, but many are not. The environment that Gecko currently provides for legacy addons is analagous to an operating system running every single application in kernel mode. Is that powerful? Absolutely! Is that the best thing to do for maintainability and robustness? Absolutely not!
Somewhere a line needs to be drawn to demarcate this abstraction layer and improve Gecko developers’ ability to make improvements under the hood. Last week’s announcement was an invitation to addon developers to help shape that future. Please participate and please do so constructively!
When I first heard rumors about WebExtensions in Whistler, my source made it very clear to me that the WebExtensions initiative is not about making Chrome extensions run in Firefox. In fact, I am quite disappointed with some of the press coverage that seems to completely miss this point.
Yes, WebExtensions will be implementing some APIs to be source compatible with Chrome. That makes it easier to port a Chrome extension, but porting will still be necessary. I like the Venn Diagram concept that the WebExtensions FAQ uses: Some Chrome APIs will not be available in WebExtensions. On the other hand, WebExtensions will be providing APIs above and beyond the Chrome API set that will maintain Firefox’s legacy of extensibility.
Please try not to think of this project as Mozilla taking functionality away. In general I think it is safe to think of this as an opportunity to move that same functionality to a mechanism that is more formal and abstract.
]]>While my diff showed these APIs as new exports for Windows 10, the MSDN docs
claim that these APIs are actually new for the Windows 8.1 Update. Using the
OfferVirtualMemory
and ReclaimVirtualMemory
functions, we can now specify ranges of virtual memory that are safe to
discarded under memory pressure. Later on, should we request that access be
restored to that memory, the kernel will either return that virtual memory to us
unmodified, or advise us that the associated pages have been discarded.
A couple of years ago we had an intern on the Perf Team who was working on bringing this capability to Linux. I am pleasantly surprised that this is now offered on Windows.
madvise(MADV_WILLNEED)
for Win32For the longest time we have been hacking around the absence of a madvise
-like
API on Win32. On Linux we will do a madvise(MADV_WILLNEED)
on memory-mapped
files when we want the kernel to read ahead. On Win32, we were opening the
backing file and then doing a series of sequential reads through the file to
force the kernel to cache the file data. As of Windows 8, we can now call
PrefetchVirtualMemory
for a similar effect.
The OperationStart
and OperationEnd
APIs are intended to record access patterns during a file I/O operation.
SuperFetch will then create prefetch files for the operation, enabling prefetch
capabilities above and beyond the use case of initial process startup.
This API is not actually new, but I couldn’t find any invocations of it in the
Mozilla codebase. CreateMemoryResourceNotification
allocates a kernel handle that becomes signalled when physical memory is running
low. Gecko already has facilities for handling memory pressure events on other
platforms, so we should probably add this to the Win32 port.
Today I want to talk about some code that we imported from Chromium some time ago. I replaced it in Mozilla’s codebase a few months back in bug 1072752:
261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 |
|
This code is wrong. Very wrong.
Let us start with the calls to GetQueueStatus
and PeekMessage
. Those APIs
mark any messages already in the thread’s message queue as having been seen,
such that they are no longer considered “new.” Even though those function calls
do not remove messages from the queue, any messages that were in the queue at
this point are considered to be “old.”
The logic in this code snippet is essentially saying, “if the queue contains
mouse messages that do not belong to this thread, then they must belong to an
attached thread.” The code then calls WaitMessage
in an effort to give the
other thread(s) a chance to process their mouse messages. This is where the code
goes off the rails.
The documentation
for WaitMessage
states the following:
Note that
WaitMessage
does not return if there is unread input in the message queue after the thread has called a function to check the queue. This is because functions such asPeekMessage
,GetMessage
,GetQueueStatus
,WaitMessage
,MsgWaitForMultipleObjects
, andMsgWaitForMultipleObjectsEx
check the queue and then change the state information for the queue so that the input is no longer considered new. A subsequent call toWaitMessage
will not return until new input of the specified type arrives. The existing unread input (received prior to the last time the thread checked the queue) is ignored.
WaitMessage
will only return if there is a new (as opposed to any) message
in the queue for the calling thread. Any messages for the calling thread that
were already in there at the time of the GetQueueStatus
and PeekMessage
calls
are no longer new, so they are ignored.
There might very well be a message at the head of that queue that should be processed by the current thread. Instead it is ignored while we wait for other threads. Here is the crux of the problem: we’re waiting on other threads whose input queues are attached to our own! That other thread can’t process its messages because our thread has messages in front of its messages; on the other hand, our thread has blocked itself!
The only way to break this deadlock is for new messages to be added to the queue.
That is a big reason why we’re seeing things like bug 1105386: Moving the
mouse adds new messages to the queue, making WaitMessage
unblock.
I’ve already eliminated this code in Mozilla’s codebase, but the challenge is going to be getting rid of this code in third-party binaries that attach their own windows to Firefox’s windows.
]]>