Aaron Klotz at Mozilla

My adventures as a member of Mozilla’s Desktop Performance Team

On WebExtensions

| Comments

There has been enough that has been said over the past week about WebExtensions that I wasn’t sure if I wanted to write this post. As usual, I can’t seem to help myself. Note the usual disclaimer that this is my personal opinion. Further note that I have no involvement with WebExtensions at this time, so I write this from the point of view of an observer.

API? What API?

I shall begin with the proposition that the legacy, non-jetpack environment for addons is not an API. As ridiculous as some readers might consider this to be, please humour me for a moment.

Let us go back to the acronym, “API.” Application Programming Interface. While the usage of the term “API” seems to have expanded over the years to encompass just about any type of interface whatsoever, I’d like to explore the first letter of that acronym: Application.

An Application Programming Interface is a specific type of interface that is exposed for the purposes of building applications. It typically provides a formal abstraction layer that isolates applications from the implementation details behind the lower tier(s) in the software stack. In the case of web browsers, I suggest that there are two distinct types of applications: web content, and extensions.

There is obviously a very well defined API for web content. On the other hand, I would argue that Gecko’s legacy addon environment is not an API at all! From the point of view of an extension, there is no abstraction, limited formality, and not necessarily an intention to be used by applications.

An extension is imported into Firefox with full privileges and can access whatever it wants. Does it have access to interfaces? Yes, but are those interfaces intended for applications? Some are, but many are not. The environment that Gecko currently provides for legacy addons is analagous to an operating system running every single application in kernel mode. Is that powerful? Absolutely! Is that the best thing to do for maintainability and robustness? Absolutely not!

Somewhere a line needs to be drawn to demarcate this abstraction layer and improve Gecko developers’ ability to make improvements under the hood. Last week’s announcement was an invitation to addon developers to help shape that future. Please participate and please do so constructively!

WebExtensions are not Chrome Extensions

When I first heard rumors about WebExtensions in Whistler, my source made it very clear to me that the WebExtensions initiative is not about making Chrome extensions run in Firefox. In fact, I am quite disappointed with some of the press coverage that seems to completely miss this point.

Yes, WebExtensions will be implementing some APIs to be source compatible with Chrome. That makes it easier to port a Chrome extension, but porting will still be necessary. I like the Venn Diagram concept that the WebExtensions FAQ uses: Some Chrome APIs will not be available in WebExtensions. On the other hand, WebExtensions will be providing APIs above and beyond the Chrome API set that will maintain Firefox’s legacy of extensibility.

Please try not to think of this project as Mozilla taking functionality away. In general I think it is safe to think of this as an opportunity to move that same functionality to a mechanism that is more formal and abstract.

Interesting Win32 APIs

| Comments

Yesterday I decided to diff the export tables of some core Win32 DLLs to see what’s changed between Windows 8.1 and the Windows 10 technical preview. There weren’t many changes, but the ones that were there are quite exciting IMHO. While researching these new APIs, I also stumbled across some others that were added during the Windows 8 timeframe that we should be considering as well.

Volatile Ranges

While my diff showed these APIs as new exports for Windows 10, the MSDN docs claim that these APIs are actually new for the Windows 8.1 Update. Using the OfferVirtualMemory and ReclaimVirtualMemory functions, we can now specify ranges of virtual memory that are safe to discarded under memory pressure. Later on, should we request that access be restored to that memory, the kernel will either return that virtual memory to us unmodified, or advise us that the associated pages have been discarded.

A couple of years ago we had an intern on the Perf Team who was working on bringing this capability to Linux. I am pleasantly surprised that this is now offered on Windows.

madvise(MADV_WILLNEED) for Win32

For the longest time we have been hacking around the absence of a madvise-like API on Win32. On Linux we will do a madvise(MADV_WILLNEED) on memory-mapped files when we want the kernel to read ahead. On Win32, we were opening the backing file and then doing a series of sequential reads through the file to force the kernel to cache the file data. As of Windows 8, we can now call PrefetchVirtualMemory for a similar effect.

Operation Recorder: An API for SuperFetch

The OperationStart and OperationEnd APIs are intended to record access patterns during a file I/O operation. SuperFetch will then create prefetch files for the operation, enabling prefetch capabilities above and beyond the use case of initial process startup.

Memory Pressure Notifications

This API is not actually new, but I couldn’t find any invocations of it in the Mozilla codebase. CreateMemoryResourceNotification allocates a kernel handle that becomes signalled when physical memory is running low. Gecko already has facilities for handling memory pressure events on other platforms, so we should probably add this to the Win32 port.

WaitMessage Considered Harmful

| Comments

I could apologize for the clickbaity title, but I won’t. I have no shame.

Today I want to talk about some code that we imported from Chromium some time ago. I replaced it in Mozilla’s codebase a few months back in bug 1072752:

(message_pump_win.cc) download
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
    // A WM_* message is available.
    // If a parent child relationship exists between windows across threads
    // then their thread inputs are implicitly attached.
    // This causes the MsgWaitForMultipleObjectsEx API to return indicating
    // that messages are ready for processing (Specifically, mouse messages
    // intended for the child window may appear if the child window has
    // capture).
    // The subsequent PeekMessages call may fail to return any messages thus
    // causing us to enter a tight loop at times.
    // The WaitMessage call below is a workaround to give the child window
    // some time to process its input messages.
    MSG msg = {0};
    DWORD queue_status = GetQueueStatus(QS_MOUSE);
    if (HIWORD(queue_status) & QS_MOUSE &&
        !PeekMessage(&msg, NULL, WM_MOUSEFIRST, WM_MOUSELAST, PM_NOREMOVE)) {
      WaitMessage();
    }

This code is wrong. Very wrong.

Let us start with the calls to GetQueueStatus and PeekMessage. Those APIs mark any messages already in the thread’s message queue as having been seen, such that they are no longer considered “new.” Even though those function calls do not remove messages from the queue, any messages that were in the queue at this point are considered to be “old.”

The logic in this code snippet is essentially saying, “if the queue contains mouse messages that do not belong to this thread, then they must belong to an attached thread.” The code then calls WaitMessage in an effort to give the other thread(s) a chance to process their mouse messages. This is where the code goes off the rails.

The documentation for WaitMessage states the following:

Note that WaitMessage does not return if there is unread input in the message queue after the thread has called a function to check the queue. This is because functions such as PeekMessage, GetMessage, GetQueueStatus, WaitMessage, MsgWaitForMultipleObjects, and MsgWaitForMultipleObjectsEx check the queue and then change the state information for the queue so that the input is no longer considered new. A subsequent call to WaitMessage will not return until new input of the specified type arrives. The existing unread input (received prior to the last time the thread checked the queue) is ignored.

WaitMessage will only return if there is a new (as opposed to any) message in the queue for the calling thread. Any messages for the calling thread that were already in there at the time of the GetQueueStatus and PeekMessage calls are no longer new, so they are ignored.

There might very well be a message at the head of that queue that should be processed by the current thread. Instead it is ignored while we wait for other threads. Here is the crux of the problem: we’re waiting on other threads whose input queues are attached to our own! That other thread can’t process its messages because our thread has messages in front of its messages; on the other hand, our thread has blocked itself!

The only way to break this deadlock is for new messages to be added to the queue. That is a big reason why we’re seeing things like bug 1105386: Moving the mouse adds new messages to the queue, making WaitMessage unblock.

I’ve already eliminated this code in Mozilla’s codebase, but the challenge is going to be getting rid of this code in third-party binaries that attach their own windows to Firefox’s windows.

Attached Input Queues on Firefox for Windows

| Comments

I’ve previously blogged indirectly about attached input queues, but today I want to address the issue directly. What once was a nuisance in the realm of plugin hangs has grown into a more serious problem in the land of OMTC and e10s.

As a brief recap for those who are not very familiar with this problem: imagine two windows, each on their own separate threads, forming a parent-child relationship with each other. When this situation arises, Windows implicitly attaches together and synchronizes their input queues, putting each thread at the mercy of the other attached threads’ ability to pump messages. If one thread does something bad in its message pump, any other threads that are attached to it are likely to be affected as well.

One of the biggest annoyances when it comes to knowledge about which threads are affected, is that we are essentially flying blind. There is no way to query Windows for information about attached input queues. This is unfortunate, as it would be really nice to obtain some specific knowledge to allow us to analyze the state of Firefox threads’ input queues so that we can mitigate the problem.

I had previously been working on a personal side project to make this possible, but in light of recent developments (and a tweet from bsmedberg), I decided to bring this investigation under the umbrella of my full-time job. I’m pleased to announce that I’ve finished the first cut of a utility that I call the Input Queue Visualizer, or iqvis.

iqvis consists of two components, one of which is a kernel-mode driver. This driver exposes input queue attachment data to user mode. The iqvis user-mode executable is the client that queries the driver and outputs the results. In the next section I’m going to discuss the inner workings of iqvis. Following that, I’ll discuss the results of running iqvis on an instance of Firefox.

Input Queue Visualizer Internals

First of all, let’s start off with this caveat: Nearly everything that this driver does involves undocumented APIs and data structures. Because of this, iqvis does some things that you should never do in production software.

One of the big consequences of using undocumented information is that iqvis requires pointers to very specific locations in kernel memory to accomplish things. These pointers will change every time that Windows is updated. To mitigate this, I kind of cheated: it turns out that debugging symbols exist for all of the locations that iqvis needs to access! I wrote the iqvis client to invoke the dbghelp engine to extract the pointers that I need from Windows symbols and send those values as the input to the DeviceIoControl call that triggers the data collection. Passing pointers from user mode to be accessed in kernel mode is a very dangerous thing to do (and again, I would never do it in production software), but it is damn convenient for iqvis!

Another issue is that these undocumented details change between Windows versions. The initial version of iqvis works on 64-bit Windows 8.1, but different code is required for other major releases, such as Windows 7. The iqvis driver theoretically will work on Windows 7 but I need to make a few bug fixes for that case.

So, getting those details out of the way, we can address the crux of the problem: we need to query input queue attachment information from win32k.sys, which is the driver that implements USER and GDI system calls on Windows NT systems.

In particular, the window manager maintains a linked list that describes thread attachment info as a triple that points to the “from” thread, the “to” thread, and a count. The count is necessary because the same two threads may be attached to each other multiple times. The iqvis driver walks this linked list in a thread-safe way to obtain the attachment data, and then copies it to the output buffer for the DeviceIoControl request.

Since iqvis involves a device driver, and since I have not digitally signed that device driver, one can’t just run iqvis and call it a day. This program won’t work unless the computer was either booted with kernel debugging enabled, or it was booted with driver signing temporarily disabled.

Running iqvis against Firefox

Today I ran iqvis using today’s Nightly 39 as well as the lastest release of Flash. I also tried it with Flash Protected Mode both disabled and enabled. (Note that these examples used an older version of iqvis that outputs thread IDs in hexadecimal. The current version uses decimal for its output.)

Protected Mode Disabled

FromTID ToTID Count
ac8 df4 1

Looking up the thread IDs:

  • df4 is the Firefox main thread;
  • ac8 is the plugin-container main thread.

I think that the output from this case is pretty much what I was expecting to see. The protected mode case, however, is more interesting.

Protected Mode Enabled

FromTID ToTID Count
f8c dbc 1
794 f8c 3

Looking up the thread IDs:

  • dbc is the Firefox main thread;
  • f8c is the plugin-container main thread;
  • 794 is the Flash sandbox main thread.

Notice how Flash is attached to plugin-container, which is then attached to Firefox. Notice that transitively the Flash sandbox is effectively attached to Firefox, confirming previous hypotheses that I’ve discussed with colleagues in the past.

Also notice how the Flash sandbox attachment to plugin-container has a count of 3!

In Conclusion

In my opinion, my Input Queue Visualizer has already yielded some very interesting data. Hopefully this will help us to troubleshoot our issues in the future. Oh, and the code is up on GitHub! It’s poorly documented at the moment, but just remember to only try running it on 64-bit Windows 8.1 for the time being!

Profile Unlocking in Firefox 34 for Windows

| Comments

Today’s Nightly 34 build includes the work I did for bug 286355: a profile unlocker for our Windows users. This should be very helpful to those users whose workflow is interrupted by a Firefox instance that cannot start because a previous Firefox instance has not finished shutting down.

Firefox 34 users running Windows Vista or newer will now be presented with this dialog box:

Clicking “Close Firefox” will terminate that previous instance and proceed with starting your new Firefox instance.

Unfortunately this feature is not available to Windows XP users. To support this feature on Windows XP we would need to call undocumented API functions. I prefer to avoid calling undocumented APIs when writing production software due to the potential stability and compatibility issues that can arise from doing so.

While this feature adds some convenience to an otherwise annoying issue, please be assured that the Desktop Performance Team will continue to investigate and fix the root causes of long shutdowns so that a profile unlocker hopefully becomes unnecessary.

Diffusion of Responsibility

| Comments

Something that I’ve been noticing on numerous social media and discussion forum sites is that whenever Firefox comes up, inevitably there are comments in those threads about Firefox performance. Given my role at Mozilla, these comments are of particular interest to me.

The reaction to roc’s recent blog post has motivated me enough to respond to a specific subset of comments. These comments all exhibit a certain pattern: their authors are experiencing problems with Firefox, they are very dissatisfied, but they are not discussing them in a way that is actionable by Mozilla.

How Mozilla Finds Problems

Mozilla encourages our contributors to run prerelease versions of Firefox, especially Nightly builds. This allows us to do some good old-fashioned dogfooding during the development of a Firefox release.

We also have many tools that run as part of our continuous integration infrastructure. Valgrind, Address Sanitizer, Leak Sanitizer, reference count tracking, deadlock detection, assertions, Talos performance tests, and xperf are some of the various tools that we apply to our builds. I do not claim that this list is exhaustive! :-)

We use numerous technologies to discover problems that occur while running on our users’ computers. We have a crash reporter that (with the user’s consent) reports data about the crash. We have Firefox Health Report and Telemetry that, when consented to, send us useful information for discovering problems.

Our ability to analyze crash report/FHR/telemetry data is limited to those users who consent to share it with us. As much as I am proud of the fact that we respect the privacy of our users, this means that we only receive data from a fraction of them; many users who are experiencing problems are not included in this data.

Despite the fact that we have all of these wonderful tools to help us deliver quality releases, the fact is that they cannot exhaustively catch every possible bug that is encountered out in the wild. There are too many combinations of extensions and configurations out there to possibly allow us to catch everything before release.

That’s where you, our users, come in!

If You See Something, Report It!

Reddit, Hacker News, Slashdot and other similar sites are fantastic for ranting. I should know – I do it with the best of them! Having said that, they are also terrible for the purposes of bug reporting!

As users it’s easy for us to assume that somebody else will encounter our problems and report them. Unfortunately that is not always the case, especially with a browser that is as configurable as Firefox.

Reporting Bugs

If you are experiencing a bug, the best way to ensure that something can be done about your bug is to report it in Bugzilla. This might seem a little bit intimidating for somebody who is new to bug reporting, but I assure you, Mozillians are really nice! As long as you follow the etiquette guidelines, you’ll be fine! One suggestion though: try to follow our bug writing guidelines. Doing so will maximize the likelihood of a contributor being able to reproduce your problem. In addition to these suggestions for bug filing, I also suggest including certain types of data for specific types of problems:

Reporting a Bug for High Memory Usage

If you’re experiencing problems with Firefox’s memory use, open a tab, and point your browser to about:memory. This nifty feature provides a breakdown of Firefox memory consumption. Save that report and attach it to the bug that you’ve filed.

Reporting a Bug for Slowness or Heavy CPU Usage

If you want report a problem with Firefox being slow and/or mysteriously consuming large amounts of CPU time, the best way to help us is is to include data that has been generated by the Gecko Profiler. [This data is referred to as a “profile,” but please do not confuse it with the user profile that stores your settings and browser history – they are different things with the same name.] This tool tracks how Firefox uses your CPU over time. If you are able to attach a profile that was taken during a period of time where Firefox was running poorly, it will help us understand what exactly Firefox was doing. Unfortunately this is tool requires a bit of technical savvy, but attaching the URL of an uploaded profile to your performance bug can be very helpful.

Reporting a Bug for a Persistent, Reproducable Crash

As you can see in our crash report data, crashes reported to Mozilla are ranked by frequency. As you might expect, this implies that it’s often the squeaky wheels that get the grease.

If you have an easily reproducable crash and you are sending your reports to Mozilla, you can help us by pointing Firefox to about:crashes. This page lists all of the crash reports that have been generated on your computer. If the crash that you are experiencing isn’t on our list of top crashers, you can still help us to fix it: filing a bug that includes multiple crash report URLs from your about:crashes screen will help tremendously.

Reporting a Bug for a Website that Looks Wrong in Firefox

I’d suggest reporting your problem over at webcompat.com. Volunteers will review your report and determine whether your problem is caused by the website or by Firefox, and file a bug with the appropriate entity.

I have a problem that isn’t listed here. What should I do?

Send us feedback!

In Conclusion

If there is one idea that you can take away from this post (a TL;DR, if you will), it is this: Mozilla cannot fix 100% of the bugs that we do not know about.

Taking an active role in the Mozilla community by reporting your issues through the proper channels is the best way to ensure that your problems can be fixed.

EDIT: To be clear: What I am suggesting is that users who are enthusiastic enough to post a comment to Hacker News (for example) should also be savvy enough to be able to file a proper bug report. Please do not misconstrue this post as a demand that novice users start filing bugs.

EDIT August 15, 2014: Nick Nethercote just blogged about a tricky memory bug that couldn’t have been diagnosed without the help of a Redditor whose complaint we steered to Bugzilla.

EDIT August 23, 2014: Wow, this has blown up. More edits to update this post with additional information in repose to some of the questions posted in the comments. Thanks for reading!

Asynchronous Plugin Initialization: An Introduction

| Comments

I have spent a lot of time this quarter working on bug 998863, “Asynchronous Initialization of Out-of-process Plugins.” While the bug summary is fairly self explanatory, I would like to provide some more details about why I am doing this and what kind of work it entails. I would also like to wrap up the post with an early demonstration of this feature and present some profiles to illustrate the potential performance improvement.

Rationale

The reason that I am undertaking this project is because NPAPI plugin startup is our most frequent cause of jank. In fact, at the time of this writing, our Chrome Hangs telemetry is showing that 4 out of our top 10 most frequent offending call stacks are related to plugin initialization and instantiation. Furthermore, creating the plugin-container.exe child process is the #1 most frequent chrome hang offender (Note that our Chrome Hang telemetry consists entirely of Windows builds, where process creation is quite expensive).

A High-level Breakdown of NPAPI Initialization and Instantiation

The typical steps involved can be broken down as follows:

  1. Launch the plugin-container process;
  2. Call NP_Initialize to load the plugin;
  3. Create instances by calling NPP_New;
  4. Call NPP_NewStream for instances that load stream data;
  5. If an instance is scriptable, call NPP_GetValue to obtain information about the plugin’s scriptable object.

The patch that I am working on modifies steps 1 through 4 to run asynchronously. Step 5 is a special case – we asynchronously return a proxy object, but if a synchronous JS method is called on that object, we must wait for the plugin to initialize (if it has not yet done so). My hope is that if we have to call a synchronous JS method on the proxy object, plugin initialization will be far enough along that the wait will be minimal.

A Brief Demonstration

The following video compares two locally-built Nightlies that are identical except for the asynchronous initialization patch. After loading the browser with a page containing several embedded Flash objects, we can profile and observe the effects of this patch.

Here are links to some profiles:

Synchronous Plugin Initialization

Asynchronous Plugin Initialization

Disclaimer

This patch requires some further work on scripting and stabilization. The information in this post is subject to change. :-)

Detecting Main Thread I/O With SPS

| Comments

In bug 867762 I am landing I/O interpose facilities for SPS. This feature will allow main thread I/O to be displayed in the profiler UI.

There are a few components to this patch. Classes that implement the IOInterposerModule interface are responsible for hooking into an arbitrary I/O facility and calling into IOInterposer with those events.

The IOInterposer receives those events and then dispatches them to any registered observers. For the purposes of SPS there is only one observer, ProfilerIOInterposeObserver, however I wrote this keeping in mind that we may eventually want to use the I/O interpose facilities for other purposes (such as shutdown write poisioning).

For bug 867762 I have implemented two modules. NSPRInterposer hooks into NSPR file I/O and generates events whenever a PR_Read, PR_Write, or PR_Sync occurs. The second module, SQLiteInterposer, leverages the SQLite VFS that we use for telemetry to tap into reads, writes, and fsyncs generated by SQLite. In the future we expect to expand this into several additional modules. Some ideas include a module that uses Event Tracing for Windows to read events from the Windows kernel and a module that interposes calls to read, write, and fsync on Linux.

On the observer side of things, for now we simply insert a marker into the profiler timeline whenever main thread I/O is reported. Here is a sample screenshot of the markers in action (the pink stuff for readers who are unfamiliar with SPS markers):

In bug 867757 (under review) this will become more sophisticated, as SPS will immediately sample the callstack of the thread whose I/O has been intercepted. This annotated stack will be inserted directly into the timeline.

There will need to be some UI changes for this to be presented in a reasonable way, but with both existing patches applied the data is already useful. By firing up Cleopatra and filtering on either NSPRInterposer or SQLiteInterposer, you can isolate and view the main thread I/O stacks:

Hopefully these patches will prove to be beneficial to our efforts to eliminate main thread I/O.

Plugin Hang UI on Aurora

| Comments

The UI Has Landed

The Plugin Hang UI landed in mozilla-central in time for January’s merge. This means that it is now available on both Nightly and Aurora.

While it’s great that this code is now available to a larger audience, there are consequences to this. :-)

Telemetry and Pref Adjustments

The first (and more pleasant) consequence is that we are now receiving telemetry about the UI’s usage patterns. This allows us to make some adjustments as the Plugin Hang UI gets closer to the release channel. As happy as I am that this feature will be putting users in the driver’s seat when dealing with hung plugins, it’s also important to not annoy users.

Initially our telemetry suggested that the Plugin Hang UI frequently appeared but then cancelled automatically because the plugin resumed execution. This indicated to us that we should increase the default value of the dom.ipc.plugins.hangUITimeoutSecs preference (bug 833560). There has also been some discussion about scaling the Hang UI threshold depending on hardware performance and user behaviour. This threshold is tricky to balance; while we want users to be able to terminate a hanging plugin, we want to provide that feature with minimal annoyance.

Crashes

Another consequence of the feature landing is that we received reports from Nightly and Aurora showing that the Plugin Hang UI was inducing crashes in the browser itself. Unfortunately the only thing that the stack traces were telling us was that the browser-side code that hosts out-of-process plugins was not being cleaned up properly after forcibly terminating the plugin container. We didn’t have any steps to reproduce. What broke this mystery wide open was when our crash stats were finally able to show some correlations. I learned that nearly 50% of the crashes happened on single-core CPUs.

The Problem

Once I was able to test this out for myself, the issue revealed itself in short order. Fortunately for me, even though the problem involved thread scheduling, I was still able to reproduce it in a debugger. This patch took care of the problem.

There’s a bit of nuance to what was happening with this crash. When plugin-container.exe is forcibly terminated, cleanup of the browser-facing plugin code may be executed in one of two different sequences:

Sequence 1 (Most common)

  1. After terminating plugin-container.exe, the Plugin Hang UI posts a CleanupFromTimeout work item to the main thread. Concurrently on the I/O thread, Firefox’s RPCChannel detects an error and posts a OnNotifyMaybeChannelError work item.

  2. OnNotifyMaybeChannelError executes and sets the channel’s state to ChannelError. It also cleans up the PluginModuleParent actor and its actors for subprotocols.

  3. CleanupFromTimeout runs and attempts to Close the channel. This is effectively a no-op since the channel was already closed with an error status by step 2.

Sequence 2 (Less common unless running on fewer cores)

  1. Same as in Sequence 1.

  2. CleanupFromTimeout runs before OnNotifyMaybeChannelError. This work item attempts to do a regular Close on the channel. Since the channel’s state is still set to ChannelConnected, the close operation doesn’t realize that it needs to do additonal cleanup. It performs a clean shutdown of the RPC channel without properly cleaning up the IPDL actors.

  3. OnMaybeNotifyChannelError runs, sees the channel is already closed due to the activities in step 2, and does nothing.

  4. A crash later occurs because the actors were never cleaned up properly.

Some Additional Analysis

Sequence 2 cannot crash if the plugin-container.exe process is terminated by the main thread. This is because PluginModuleParent::ShouldContinueFromReplyTimeout returns false in this case and the channel’s state is set to ChannelTimeout by the time that the CleanupFromTimeout work item executes. This guarantees that a full cleanup will be done by CleanupFromTimeout.

With the Plugin Hang UI, plugin-container.exe is not terminated by the main thread, so the channel’s state must be explicitly updated after termination.

Solution

The patch modifies the CleanupFromTimeout work item so that if the plugin container was terminated outside the main thread, the channel is explicitly closed with an error state. This ensures that the actors are properly cleaned up.

Hangs

I filed bug 834127 when QA discovered that sometimes the Plugin Hang UI was not being displayed. I found out that Firefox was correctly spawning plugin-hang-ui.exe, however it was not showing any UI.

This lead be back down the input queue rabbit hole that I briefly discussed last time. I learned that on an intermittent basis, the Plugin Hang UI dialog box was being hung up on Win32 ShowWindow and SetFocus calls. What I did know was that my attempts to explicitly detach the Plugin Hang UI’s thread from the hung Firefox thread weren’t working as well as I would have liked.

After numerous attempts at fixing these issues, I determined that for the time being we will need to rescind the owner-owned relationship between Firefox and the Plugin Hang UI dialog. If we didn’t do so, seemingly benign actions like calling PeekMessage or passing a WM_NCLBUTTONDOWN message to DefWindowProc would be enough to bring the Plugin Hang UI to a halt. Why is this, you ask? I decided to peek into kernel mode to find out.

Adventures in Kernel Mode

Recall that the guts of USER32 and GDI32 were moved into kernel mode via win32k.sys in the Windows NT 3.5 timeframe. It follows that any efforts to examine Win32 internals necessitates peering into kernel mode. I didn’t need an elaborate remote debugging setup for my purposes, so I used Sysinternals LiveKd to take a snapshot of my kernel and debug it.

My workflow was essentially as follows:

  1. Plugin Hang UI gets stuck
  2. Attach a debugger (I use WinDbg) to plugin-hang-ui.exe
  3. Run LiveKd
  4. Type !thread -t <tid> into the kernel debugger, where <tid> is the thread ID that is hung in the user-mode WinDbg

Every kernel-mode call stack that I examined on a hung thread usually ended up going through this code path:


...
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`0def3ec0 fffff800`0327c652 : fffffa80`079c43c0 fffffa80`079c43c0 00000000`00000000 fffffa80`00000008 : nt!KiSwapContext+0x7a
fffff880`0def4000 fffff800`0328da9f : fffffd54`0000000c fffffd54`000002a0 000002ac`00000000 00000804`fffffd54 : nt!KiCommitThreadWait+0x1d2
fffff880`0def4090 fffff800`03278a14 : 00000000`00000000 00000000`00000005 00000000`00000000 fffff800`03279600 : nt!KeWaitForSingleObject+0x19f
fffff880`0def4130 fffff800`03279691 : fffffa80`079c43c0 fffffa80`079c4410 00000000`00000000 00000000`00000000 : nt!KiSuspendThread+0x54
fffff880`0def4170 fffff800`0327c85d : fffffa80`079c43c0 00000000`00000000 fffff800`032789c0 00000000`00000000 : nt!KiDeliverApc+0x201
fffff880`0def41f0 fffff800`0328da9f : fffffa80`0a7510e0 fffff800`0327c26f fffffa80`00000000 fffff800`03402e80 : nt!KiCommitThreadWait+0x3dd
fffff880`0def4280 fffff960`0010d457 : fffff900`c2255200 00000000`0000000d 00000000`00000001 00000000`00000000 : nt!KeWaitForSingleObject+0x19f
fffff880`0def4320 fffff960`0010d4f1 : fffff880`0def0000 fffff900`c08e0e20 00000000`00000000 00000000`00000000 : win32k!xxxRealSleepThread+0x257
fffff880`0def43c0 fffff960`000c46d3 : 00000000`00000000 fffff900`c093ee90 00000000`00000200 00000000`00000046 : win32k!xxxSleepThread+0x59
fffff880`0def43f0 fffff960`0010c53e : fffff900`c093ee90 fffff900`c08e0e20 00000000`00000000 fffff900`c25724b0 : win32k!xxxInterSendMsgEx+0x112a
fffff880`0def4500 fffff960`000d8ccf : fffff900`c25724b0 fffff900`c25724b0 00000000`003c031a fffff900`c0800b90 : win32k!xxxSendMessageTimeout+0x1de
fffff880`0def45b0 fffff960`000dae0b : fffff960`00000006 fffff880`0def47c0 fffff960`00000000 fffff900`00000000 : win32k!xxxCalcValidRects+0x1a3
fffff880`0def46f0 fffff960`000dac3a : fffff900`00000001 fffff900`c08e0e20 fffff880`0def49b8 fffff900`00000000 : win32k!xxxEndDeferWindowPosEx+0x18f
fffff880`0def47b0 fffff960`000d6e09 : 00000000`00000000 00000000`00000001 fffff900`00000000 fffff800`00000000 : win32k!xxxSetWindowPos+0x156
fffff880`0def4830 fffff960`0007c66d : fffff900`00000000 fffff900`0004366c fffff900`00000000 fffff900`c209f410 : win32k!xxxActivateThisWindow+0x441
...

The highlighted hexadecimal value above is a handle to a window with class MSCTFIME UI (i.e. a Microsoft IME window) that was created by the main UI thread in the hung Flash process. Despite my best efforts to try to disconnect the Plugin Hang UI input queue from these threads, Windows is insistent on sending a synchronous, inter-thread message to the IME window on the hung Flash thread. This doesn’t happen if we omit setting Firefox’s HWND as the Plugin Hang UI’s owner window.

Implications

Now that the Plugin Hang UI specifies a NULL owner, Windows can no longer guarantee that the Plugin Hang UI will always appear above the Firefox window that was active at the time that the plugin hang occurred. Having said that, our experiments have shown that Windows does not try to promote an unresponsive window to the front of the Z-order.

In a multiple-window situation, Windows may move the other, inactive Firefox windows ahead of the Plugin Hang UI in the Z-order, but the window that was active when Firefox first hung does not move. I think that this is acceptable. In fact, the same behavour would be observable if we had been able to specify an owner.

If you’re interested in helping to test this feature, please install Nightly and try the repro from bug 834127. When the Plugin Hang UI fires, try to move around the Plugin Hang UI as well as the other windows on your desktop, including Firefox. If the Z-order changes differently from the way that I have observed here, please try to generate some steps to reproduce, drop me a line (aklotz on #perf or MoCo email) and let me know what’s going on. Thanks!