Tales of Science-Fiction Bugs: The Thing That Killed Talos
November 12, 2012 § 4 Comments
Have you ever encountered one of these bugs? One in which every single line of your code is correct, in which every type-check passes, every single unit test succeeds, the specifications are fulfilled but somehow, for no reason that can be explained rationally, it just does not work? I call them Science-Fiction Bugs. I am sure that you have met some of them. For some reason, the Mozilla Performance Team seems to stumble upon such bugs rather often, perhaps because we spend so much time refactoring other team’s code long after the original authors have moved on to other features, and combining their code with undertested libraries and technologies. Truly, this is life on the Frontier.
Today, I would like to tell you the tale of one of these Science-Fiction Bugs: The Thing That Killed Talos.
As they often do, this tale started innocuously, with a minor refactoring to our Session Store mechanism. Ten lines were changed in the code, to move an already asynchronous operation off the main thread. As this simple change did not alter any API and simplified the code a little, it quickly passed review. Also, as it passed all our unit tests, it was promptly landed.
And immediately reverted.
As it turns out, the patch killed Talos, our performance-testing system.
Step 1 Is that really a failure?
No amount of double-checking or updating to the latest version of our source code changed anything. This almost trivial patch, that passed all unit tests, that worked nicely when tested manually, that did not touch anything remotely related to Talos, killed Talos. Reproducibly. On all platforms.
In other words, there really was a bug, hidden somewhere in the code. The hunt for the Thing That Killed Talos was on.
At this stage, I had the following information:
- there was a bug, somewhere in the code;
- this bug was causing timeouts on all platforms, but not on the exact same tests for all platforms;
- this bug was somehow related to replacing one use of NetUtils.jsm by one essentially equivalent use of OS.File;
- the problem took place at some point during Talos shutdown .
Step 2 Killing me softly
To fix a bug, one needs to reproduce it. Reproducing it on our test servers was a sure thing, so I set out to reproduce it locally, figuring that it would only be a matter of a little setup and configuration effort. So, I installed Talos locally, I rebuilt Firefox with the exact same configuration, I configured Talos exactly as it is configured on the server, and I expected to end up with a sample that could be used for experimenting and debugging purposes.
Unfortunately, this is not what happened. Locally, all Talos tests passed. Again, no amount of double-checking build options or Talos options, no amount of moving backwards in the source code of Talos or Firefox, or of using the binaries produced by the test server, could change the fact that the tests succeeded. Reproducibly. On both MacOS X and Linux – I had not tested Windows.
Oh, great. This trivial issue was turning into a Science-Fiction Bug.
Additional information gathered:
- the bug was somehow related to the server on which Talos tests were executed or its configuration;
- that was going to be a long ride.
Step 3 Psychic
Since reproducing the bug locally proved impossible, the next step was to attempt and corner it remotely, by carefully experimenting with our code, and combining this with a healthy dose of psychic debugging. Unfortunately, this also meant waiting a few hours between the start of an experiment and its result. Luckily for me, I don’t sleep a lot these days, so I could keep up with a near-permanent stream of experiments.
During the course of the following week, I tinkered with the following parts of the code:
- introducing logging at every step of OS.File [result: everything was in order, all calls completed successfully];
- introducing logging in the use of OS.File by SessionStore [result: everything was in order];
- monitoring the shutdown of OS.File [result: everything was in order];
- monitoring the shutdown of threads [result: locally, everything was in order – remotely, it never finished];
- programmatically turning logging to its maximal level [result: locally, nothing interesting – remotely, some cycle collection shortly before the failure];
- determining whether the issue disappeared if we used the same temporary file that NetUtils used [result: it did not];
- determining whether the issue disappeared if all uses of OS.File were removed [result: it did];
- determining whether the issue disappeared if the number of calls to OS.File was artificially limited [result: it did not];
- determining whether the issue disappeared if OS.File stopped being used long before shutdown [result: it did not];
- determining whether the issue disappeared if OS.File cleaned up after itself manually, instead of counting on the runtime [result: it changed nothing];
- determining whether the issue disappeared if we replaced the call to OS.File with another call to OS.File that did nothing [result: it changed nothing];
- adding time information to all logs [result: remotely, the DOM worker subsystem shutdown was caught in an apparently endless loop repeated exactly every 5 seconds].
Additional information gathered:
- the DOM worker subsystem failed to shutdown;
- the issue involved a 5 second timeout in the shutdown of DOM workers;
- the offending operation in DOM workers was
NS_ProcessNextEvent(), which could in turn cause a call to just about anything;
- the issue was not due to anything high-technology in OS.File, but most likely only to the fact that OS.File was spawning a thread;
- the issue might be related to cycle collection.
All the signs pointed to something preventing DOM workers from closing, but not from the code of OS.File. The hunt was still on.
Interlude – Fixing a DOM Worker shutdown bug
At some point during the hunt, I was informed of another OS.File-related bug that was also related to DOM workers shutting down improperly. By opposition to my Science-Fiction bug, however, this new bug was simple to reproduce and to fix .
Unfortunately, the fix did not help resolve my Science-Fiction bug. Drats.
Step 4 Working the slaves
At this stage, there did not seem to be anything more that could be learnt remotely. The only way forward was to request access to one of the machines used to run the Talos tests (the so-called “slaves”) and get Firefox through a debugger . This involved a few technical stumbles along the way (in particular, Apple does seem to put a lot of efforts in not letting me install gdb on that Mac slave), but nothing that could not be fixed .
A quick test confirmed that the whole world had not gone crazy and that the Talos tests did fail on the slave, even when launched by me.
So, once armed with a Talos slave, a ssh connection, my trusty gdb, and a version of Talos slightly hacked into some compatibility with said gdb, I had the ability to obtain stack information:
- as expected, the immediate cause of the freeze was the DOM worker subsystem looping forever in its shutdown loop, waiting for all DOM workers to have terminated;
- the call to
NS_ProcessNextEvent()did not seem to do anything interesting, just timing out;
- one of the DOM Workers (presumably the one spawned by OS.File) was effectively still active, despite the fact that had several minutes ago stopped executing any code.
Now, that is interesting. Upon further examination:
- this DOM Worker was also calling
NS_ProcessNextEvent(), presumably waiting for instructions;
- the C stack of this DOM Worker is somewhat surprising, insofar as it involves no less than 5 depths of
Step 5 Shooting in the dark
What is that? Calling
importScripts() is an operation that leaves a trace on the stack long after the end of the execution of said importation? This deserves at least a wild attempt at fixing the issue.
The implementation of OS.File was designed for modularity. Each file defines a small module (scoped within a function) which imports its dependencies through importScripts. Changing this to ensure that the main file imports all dependencies breaks this modularity and is not very satisfying but this is the matter of less than 20 lines of code.
And Talos confirms that it works.
Victory, I guess.
Step 6 Pitch Black
The bug is fixed, all rejoice, updated SessionStore can land.
But let’s push the matter a little further and try and determine what can have caused the issue in the first place. Unfortunately, I cannot draw decisive conclusions, but I can build a few educated guesses.
Firstly, now that we know that
importScripts() is related to the issue, we can deduce a few things. Function
importScripts() is executed. This would explain nicely why the C++ stack shows 5 depths of
Also, since these nested event loops are introduced to deal with a slow sources, their presence suggests one factor that was missing to explain the difference between running on a Talos slave and running on a development machine: speed. A very slow input device might cause the nested event loop to run, while a fast one might avoid this. As the Talos slaves are used for benchmarking, one can theorize that the hardware is seldom or never upgraded, to keep results consistent, by opposition to unit testing slaves or developer machines.
If so, I might have accidentally hit a sophisticated race condition that might show up again one of these days on devices with slow file access, such as cellphones. If so, the issue might show up again with Firefox for Android or with Firefox OS and should be monitored.
Is this scenario credible? Ben Turner or Kyle Huey, if one of you reads these lines, you might be able to tell me whether I am wrong.