Recently I’ve been doing some work with DnsQueryEx
,
but unfortunately this has been less than pleasant. Not only are there errors in its documentation,
but the API itself contains a bug that IMHO should never have made it to release.
Like many other Win32 APIs, DnsQueryEx
is an asynchronous interface that also
supports being called synchronously. Whether their completion mechanism uses an
event object, an APC, an I/O Completion Port, or some other technique,
asynchronous Win32 APIs consistently employ a common convention:
When a caller invokes the API, and that API is able to execute asynchronously,
it returns ERROR_IO_PENDING
. On the other hand, when the API fails, the API is
able to immediately satisfy the request, or the API was invoked synchronously,
the function immediately returns the final error code.
For emphasis: In Win32, most asynchronous APIs reserve the right to complete synchronously if they are able to immediately satisfy a request.
Enter DnsQueryEx
: while its internal implementation follows this convention,
the implementation of its public interface does not!
This is really easy to reproduce (on a fully-updated Windows 10 21H1, at least)
by setting up an asynchronous call to DnsQueryEx
, and querying for "localhost"
.
The caller must populate the pQueryCompletionCallback
field in the DNS_QUERY_REQUEST
structure.
DnsQueryEx
returns ERROR_SUCCESS
. Great, the asynchronous API was able to
immediately fulfill the request!
Everything works according to plan until we examine the pQueryRecords
field of
the DNS_QUERY_RESULT
structure. That field is NULL
! Every other output from this function points to
a successful query, and yet we receive no results!
I spent several hours pouring over the documentation and attempting different
permutations of the localhost
query, however the only way that I could coerce
DnsQueryEx
to actually produce the expected output is if I invoked it
synchronously.
I finally determined that this poking around was becoming futile and decided to examine the disassembly. Here’s some (highly-simplified) pseudocode of what I found:
1 2 3 4 5 6 7 8 9 10 |
|
Based on the background that I outlined above, do you see the bug?
I’ll give you a hint: ERROR_IO_PENDING
.
See it now?
Okay, here goes: isSynchronous
is the wrong condition for determining
whether to copy the internal records to pQueryResult
and immediately
return! In fact, I would argue that isSynchronous
should not be checked at
all: instead, DnsQueryEx
should be checking that win32ErrorCode != ERROR_IO_PENDING
!
To add insult to injury, Query_PrivateExW
correctly allocates the output
records from the heap, so DnsQueryEx
is effectively leaking them.
I’m going to try reporting this issue via Feedback Hub, but if any Microsofties
see this, I’d appreciate it if you could flag the maintainer of dnsapi.dll
and
get this fixed.
I suppose one workaround is to look for a successful call to DnsQueryEx
with
NULL
records, and then fall back to invoking it synchronously. On the other
hand, that doesn’t help with the memory leak.
Another gross, hacky option could be to manually check for special queries like
localhost
prior to calling the API, but this isn’t exhaustive: there could
be other reasons that Query_PrivateExW
decides to execute synchronously.
As you can see, this is a pretty trivial test case, which is why I find this bug to be so disappointing. I am a big proponent of attributing bugs to an OS until I have proof otherwise, but the disassembly I encountered was pretty damning.
Hopefully this gets fixed. Until next time…
UPDATE: Microsoft’s Tommy Jensen noted that this bug has been fixed in Windows 11, but unfortunately will not be backported to Windows 10. Thanks to Brad Fitzpatrick for amplifying this post on Twitter.