Aaron Klotz’s Software Blog

My Adventures in Software Development

DnsQueryEx Needs Love

| Comments

Recently I’ve been doing some work with DnsQueryEx, but unfortunately this has been less than pleasant. Not only are there errors in its documentation, but the API itself contains a bug that IMHO should never have made it to release.

Like many other Win32 APIs, DnsQueryEx is an asynchronous interface that also supports being called synchronously. Whether their completion mechanism uses an event object, an APC, an I/O Completion Port, or some other technique, asynchronous Win32 APIs consistently employ a common convention:

When a caller invokes the API, and that API is able to execute asynchronously, it returns ERROR_IO_PENDING. On the other hand, when the API fails, the API is able to immediately satisfy the request, or the API was invoked synchronously, the function immediately returns the final error code.

For emphasis: In Win32, most asynchronous APIs reserve the right to complete synchronously if they are able to immediately satisfy a request.

Enter DnsQueryEx: while its internal implementation follows this convention, the implementation of its public interface does not!

This is really easy to reproduce (on a fully-updated Windows 10 21H1, at least) by setting up an asynchronous call to DnsQueryEx, and querying for "localhost". The caller must populate the pQueryCompletionCallback field in the DNS_QUERY_REQUEST structure.

DnsQueryEx returns ERROR_SUCCESS. Great, the asynchronous API was able to immediately fulfill the request!

Everything works according to plan until we examine the pQueryRecords field of the DNS_QUERY_RESULT structure. That field is NULL! Every other output from this function points to a successful query, and yet we receive no results!

I spent several hours pouring over the documentation and attempting different permutations of the localhost query, however the only way that I could coerce DnsQueryEx to actually produce the expected output is if I invoked it synchronously.

I finally determined that this poking around was becoming futile and decided to examine the disassembly. Here’s some (highly-simplified) pseudocode of what I found:

  // Inside DnsQueryEx
  bool isSynchronous = pQueryRequest->pQueryCompletionCallback == nullptr;
  PDNS_QUERY_RESULT internalDnsQueryResult = /*<make private copy of pQueryResults>*/;
  // Call internal implementation. It returns the same error codes as DnsQueryEx
  DWORD win32ErrorCode = Query_PrivateExW(pQueryRequest, internalDnsQueryResult);
  if (isSynchronous) {
    memcpy(pQueryResult, internalDnsQueryResult, sizeof(DNS_QUERY_RESULT));
    return win32ErrorCode;
  // Otherwise we're executing asynchronously, continue on that path...

Based on the background that I outlined above, do you see the bug?

I’ll give you a hint: ERROR_IO_PENDING.

See it now?

Okay, here goes: isSynchronous is the wrong condition for determining whether to copy the internal records to pQueryResult and immediately return! In fact, I would argue that isSynchronous should not be checked at all: instead, DnsQueryEx should be checking that win32ErrorCode != ERROR_IO_PENDING!

To add insult to injury, Query_PrivateExW correctly allocates the output records from the heap, so DnsQueryEx is effectively leaking them.

I’m going to try reporting this issue via Feedback Hub, but if any Microsofties see this, I’d appreciate it if you could flag the maintainer of dnsapi.dll and get this fixed.

I suppose one workaround is to look for a successful call to DnsQueryEx with NULL records, and then fall back to invoking it synchronously. On the other hand, that doesn’t help with the memory leak.

Another gross, hacky option could be to manually check for special queries like localhost prior to calling the API, but this isn’t exhaustive: there could be other reasons that Query_PrivateExW decides to execute synchronously.

As you can see, this is a pretty trivial test case, which is why I find this bug to be so disappointing. I am a big proponent of attributing bugs to an OS until I have proof otherwise, but the disassembly I encountered was pretty damning.

Hopefully this gets fixed. Until next time…

UPDATE: Microsoft’s Tommy Jensen noted that this bug has been fixed in Windows 11, but unfortunately will not be backported to Windows 10. Thanks to Brad Fitzpatrick for amplifying this post on Twitter.

All Good Things…

| Comments

Today is my final day as an employee of Mozilla Corporation.

My first patch landed in Firefox 19, and my final patch as an employee has landed in Nightly for Firefox 93.

I’ll be moving on to something new in a few weeks’ time, but for now, I’d just like to say this:

My time at Mozilla has made me into a better software developer, a better leader, and more importantly, a better person.

I’d like to thank all the Mozillians whom I have interacted with over the years for their contributions to making that happen.

I will continue to update this blog with catch-up posts describing my Mozilla work, though I am unsure what content I will be able to contribute beyond that. Time will tell!

Until next time…

2019 Roundup: Part 1 - Porting the DLL Interceptor to AArch64

| Comments

In my continuing efforts to get caught up on discussing my work, I am now commencing a roundup for 2019. I think I am going to structure this one slightly differently from the last one: I am going to try to segment this roundup by project.

Here is an index of all the entries in this series:

Porting the DLL Interceptor to AArch64

During early 2019, Mozilla was working to port Firefox to run on the new AArch64 builds of Windows. At our December 2018 all-hands, I brought up the necessity of including the DLL Interceptor in our porting efforts. Since no deed goes unpunished, I was put in charge of doing the work! [I’m actually kidding here; this project was right up my alley and I was happy to do it! – Aaron]

Before continuing, you might want to review my previous entry describing the Great Interceptor Refactoring of 2018, as this post revisits some of the concepts introduced there.

Let us review some DLL Interceptor terminology:

  • The target function is the function we want to hook (Note that this is a distinct concept from a branch target, which is also discussed in this post);
  • The hook function is our function that we want the intercepted target function to invoke;
  • The trampoline is a small chunk of executable code generated by the DLL interceptor that facilitates calling the target function’s original implementation.

On more than one occasion I had to field questions about why this work was even necessary for AArch64: there aren’t going to be many injected DLLs in a Win32 ecosystem running on a shiny new processor architecture! In fact, the DLL Interceptor is used for more than just facilitating the blocking of injected DLLs; we also use it for other purposes.

Not all of this work was done in one bug: some tasks were more urgent than others. I began this project by enumerating our extant uses of the interceptor to determine which instances were relevant to the new AArch64 port. I threw a record of each instance into a colour-coded spreadsheet, which proved to be very useful for tracking progress: Reds were “must fix” instances, yellows were “nice to have” instances, and greens were “fixed” instances. Coordinating with the milestones laid out by program management, I was able to assign each instance to a bucket which would help determine a total ordering for the various fixes. I landed the first set of changes in bug 1526383, and the second set in bug 1532470.

It was now time to sit down, download some AArch64 programming manuals, and take a look at what I was dealing with. While I have been messing around with x86 assembly since I was a teenager, my first exposure to RISC architectures was via the DLX architecture introduced by Hennessy and Patterson in their textbooks. While DLX was crafted specifically for educational purposes, it served for me as a great point of reference. When I was a student taking CS 241 at the University of Waterloo, we had to write a toy compiler that generated DLX code. That experience ended up saving me a lot of time when looking into AArch64! While the latter is definitely more sophisticated, I could clearly recognize analogs between the two architectures.

In some ways, targeting a RISC architecture greatly simplifies things: The DLL Interceptor only needs to concern itself with a small subset of the AArch64 instruction set: loads and branches. In fact, the DLL Interceptor’s AArch64 disassembler only looks for nine distinct instructions! As a bonus, since the instruction length is fixed, we can easily copy over verbatim any instructions that are not loads or branches!

On the other hand, one thing that increased complexity of the port is that some branch instructions to relative addresses have maximum offsets. If we must branch farther than that maximum, we must take alternate measures. For example, in AArch64, an unconditional branch with an immediate offset must land in the range of ±128 MiB from the current program counter.

Why is this a problem, you ask? Well, Detours-style interception must overwrite the first several instructions of the target function. To write an absolute jump, we require at least 16 bytes: 4 for an LDR instruction, 4 for a BR instruction, and another 8 for the 64-bit absolute branch target address.

Unfortunately, target functions may be really short! Some of the target functions that we need to patch consist only of a single 4-byte instruction!

In this case, our only option for patching the target is to use an immediate B instruction, but that only works if our hook function falls within that ±128MiB limit. If it does not, we need to construct a veneer. A veneer is a special trampoline whose location falls within the target range of a branch instruction. Its sole purpose is to provide an unconditional jump to the “real” desired branch target that lies outside of the range of the original branch. Using veneers, we can successfully hook a target function even if it is only one instruction (ie, 4 bytes) in length, and the hook function lies more than 128MiB away from it. The AArch64 Procedure Call Standard specifies X16 as a volatile register that is explicitly intended for use by veneers: veneers load an absolute target address into X16 (without needing to worry about whether or not they’re clobbering anything), and then unconditionally jump to it.

Measuring Target Function Instruction Length

To determine how many instructions the target function has for us to work with, we make two passes over the target function’s code. The first pass simply counts how many instructions are available for patching (up to the 4 instruction maximum needed for absolute branches; we don’t really care beyond that).

The second pass actually populates the trampoline, builds the veneer (if necessary), and patches the target function.

Veneer Support

Since the DLL interceptor is already well-equipped to build trampolines, it did not take much effort to add support for constructing veneers. However, where to write out a veneer is just as important as what to write to a veneer.

Recall that we need our veneer to reside within ±128 MiB of an immediate branch. Therefore, we need to be able to exercise some control over where the trampoline memory for veneers is allocated. Until this point, our trampoline allocator had no need to care about this; I had to add this capability.

Adding Range-Aware VM Allocation

Firstly, I needed to make the MMPolicy classes range-aware: we need to be able to allocate trampoline space within acceptable distances from branch instructions.

Consider that, as described above, a branch instruction may have limits on the extents of its target. As data, this is easily formatted as a pivot (ie, the PC at the location where the branch instruction is encoutered), and a maximum distance in either direction from that pivot.

On the other hand, range-constrained memory allocation tends to work in terms of lower and upper bounds. I wrote a conversion method, MMPolicyBase::SpanFromPivotAndDistance, to convert between the two formats. In addition to format conversion, this method also constrains resulting bounds such that they are above the 1MiB mark of the process’ address space (to avoid reserving memory in VM regions that are sensitive to compatibility concerns), as well as below the maximum allowable user-mode VM address.

Another issue with range-aware VM allocation is determining the location, within the allowable range, for the actual VM reservation. Ideally we would like the kernel’s memory manager to choose the best location for us: its holistic view of existing VM layout (not to mention ASLR) across all processes will provide superior VM reservations. On the other hand, the Win32 APIs that facilitate this are specific to Windows 10. When available, MMPolicyInProcess uses VirtualAlloc2 and MMPolicyOutOfProcess uses MapViewOfFile3. When we’re running on Windows versions where those APIs are not yet available, we need to fall back to finding and reserving our own range. The MMPolicyBase::FindRegion method handles this for us.

All of this logic is wrapped up in the MMPolicyBase::Reserve method. In addition to the desired VM size and range, the method also accepts two functors that wrap the OS APIs for reserving VM. Reserve uses those functors when available, otherwise it falls back to FindRegion to manually locate a suitable reservation.

Now that our memory management primatives were range-aware, I needed to shift my focus over to our VM sharing policies.

One impetus for the Great Interceptor Refactoring was to enable separate Interceptor instances to share a unified pool of VM for trampoline memory. To make this range-aware, I needed to make some additional changes to VMSharingPolicyShared. It would no longer be sufficient to assume that we could just share a single block of trampoline VM — we now needed to make the shared VM policy capable of potentially allocating multiple blocks of VM.

VMSharingPolicyShared now contains a mapping of ranges to VM blocks. If we request a reservation which an existing block satisfies, we re-use that block. On the other hand, if we require a range that is yet unsatisfied, then we need to allocate a new one. I admit that I kind of half-assed the implementation of the data structure we use for the mapping; I was too lazy to implement a fully-fledged interval tree. The current implementation is probably “good enough,” however it’s probably worth fixing at some point.

Finally, I added a new generic class, TrampolinePool, that acts as an abstraction of a reserved block of VM address space. The main interceptor code requests a pool by calling the VM sharing policy’s Reserve method, then it uses the pool to retrieve new Trampoline instances to be populated.

AArch64 Trampolines

It is much simpler to generate trampolines for AArch64 than it is for x86(-64). The most noteworthy addition to the Trampoline class is the WriteLoadLiteral method, which writes an absolute address into the trampoline’s literal pool, followed by writing an LDR instruction referencing that literal into the trampoline.

Thanks for reading! Coming up next time: My Untrusted Modules Opus.