Firefox, the Browser that has your Back[up]

June 26, 2014 § 27 Comments

One of the most important features of Firefox, in my opinion, is Session Restore. This component is responsible for ensuring that, even in case of crash, or if you upgrade your browser or an add-on that requires restart, your browser can reopen immediately and in the state in which you left it. As far as I am concerned, this feature is a life-safer.

Unfortunately, there are a few situations in which the Session Restore file may be corrupted – typically, if the computer is rebooted before the write is complete, or if it loses power, or if the operating system crashes or the disk is disconnected, we may end up losing our precious Session Restore. While any of these circumstances happens quite seldom, it needs to be applied as part of the following formula:

seldom · .5 billion users = a lot

I am excited to announce that we have just landed a new and improved Session Restore component in Firefox 33 that protects your precious data better than ever.

How it works

Firefox needs Session Restore to handle the following situations:

  • restarting Firefox without data loss after a crash of either Firefox, the Operating System, a driver or the hardware, or after Firefox has been killed by the Operating System during shutdown;
  • restarting Firefox without data loss after Firefox has been restarted due to an add-on or an upgrade;
  • quitting Firefox and, later, restarting without data loss.

In order to handle all of this, Firefox needs to take a snapshot of the state of the browser whenever anything happens, whether the user browses, fills a form, scrolls, or an application sets a Session Cookie, Session Storage, etc. (this is actually capped to one save every 15 seconds, to avoid overloading the computer). In addition, Firefox performs a clean save during shutdown.

While at the level of the application, the write mechanism itself is simple and robust, a number of things beyond the control of the developer can prevent either the Operating System or the hard drive itself from completing this write consistently – a typical example being tripping on the power plug of a desktop computer during the write.

The new mechanism involves two parts:

  • keeping smart backups to maximize the chances that at least one copy will be readable;
  • making use of the available backups to transparently avoid or minimize data loss.

The implementation actually takes very few lines of code, the key being to know the risks against which we defend.

Keeping backups

During runtime, Firefox remembers which files are known to be valid backups and which files should be discarded. Whenever a user interaction or a script requires it, Firefox writes the contents of Session Restore to a file called sessionstore-backups/recovery.js. If it is known to be good, the previous version of sessionstore-backups/recovery.js is first moved to sessionstore-backups/recovery.bak. In most cases, both files are valid and recovery.js contains a state less than 15 seconds old, while recovery.bak contains a state less than 30 seconds old. Additionally, the writes on both files are separated by at least 15 seconds. In most circumstances, this is sufficient to ensure that, even of hard drive crash during a write to recover.js, at least recovery.bak has been entirely written to disk.

During shutdown, Firefox writes a clean startup file to sessionstore.js. In most cases, this file is valid and contains the exact state of Firefox at the time of shutdown (minus some privacy filters). During startup, if sessionstore.js is valid, Firefox moves it to sessiontore-backup/previous.js. Whenever this file exists, it is valid and contains the exact state of Firefox at the time of the latest clean shutdown/startup. Note that, in case of crash, the latest clean shutdown/startup might be older than the latest actual startup, but this backup is useful nevertheless.

Finally, on the first startup after an update, Firefox copies sessionstore.js, if it is available and valid, to sessionstore-backups/upgrade.js-[build id]. This mechanism is designed primarily for testers of Firefox Nightly, who keep on the very edge, upgrading Firefox every day to check for bugs. Testers, if we introduce a bug that affects Session Restore, this can save your life.

As a side-note, we never use the operating system’s flush call, as 1/ it does not provide the guarantees that most developers expect; 2/ on most operating systems, it causes catastrophic slowdowns.

Recovering

All in all, Session Restore may contain the following files:

  • sessionstore.js (contains the state of Firefox during the latest shutdown – this file is absent in case of crash);
  • sessionstore-backups/recovery.js (contains the state of Firefox ≤ 15 seconds before the latest shutdown or crash – the file is absent in case of clean shutdown, if privacy settings instruct us to wipe it during shutdown, and after the write to sessionstore.js has returned);
  • sessionstore-backups/recovery.bak (contains the state of Firefox ≤ 30 seconds before the latest shutdown or crash – the file is absent in case of clean shutdown, if privacy settings instruct us to wipe it during shutdown, and after the removal of sessionstore-backups/recovery.js has returned);
  • sessionstore-backups/previous.js (contains the state of Firefox during the previous successful shutdown);
  • sessionstore-backups/upgrade.js-[build id] (contains the state of Firefox after your latest upgrade).

All these files use the JSON format. While this format has drawbacks, it has two huge advantages in this setting:

  • it is quite human-readable, which makes it easy to recover manually in case of an extreme crash;
  • its syntax is quite rigid, which makes it easy to find out whether it was written incompletely.

As our main threat is a crash that prevents us from writing the file entirely, we take advantage of the latter quality to determine whether a file is valid. Based on this, we test each file in the order indicated above, until we find one that is valid. We then proceed to restore it.

If Firefox was shutdown cleanly:

  1. In most cases, sessionstore.js is valid;
  2. In most cases in which sessionstore.js is invalid, sessionstore-backups/recovery.js is still present and valid (the likelihood of it being present is obviously higher if privacy settings do not instruct Firefox to remove it during shutdown);
  3. In most cases in which sessionstore-backups/recovery.js is invalid, sessionstore-backups/recovery.bak is still present, with an even higher likelihood of being valid (the likelihood of it being present is obviously higher if privacy settings do not instruct Firefox to remove it during shutdown);
  4. In most cases in which the previous files are absent or invalid, sessionstore-backups/previous.js is still present, in which case it is always valid;
  5. In most cases in which the previous files are absent or invalid, sessionstore-backups/upgrade.js-[...] is still present, in which case it is always valid.

Similarly, if Firefox crashed or was killed:

  1. In most cases, sessionstore-backups/recovery.js is present and valid;
  2. In most cases in which sessionstore-backups/recovery.js is invalid, sessionstore-backups/recovery.bak is pressent, with an even higher likelihood of being valid;
  3. In most cases in which the previous files are absent or invalid, sessionstore-backups/previous.js is still present, in which case it is always valid;
  4. In most cases in which the previous files are absent or invalid, sessionstore-backups/upgrade.js-[...] is still present, in which case it is always valid.

Numbers crunching

Statistics collected on Firefox Nightly 32 suggest that, out of 11.95 millions of startups, 75,310 involved a corrupted sessionstore.js. That’s roughly a corrupted sessionstore.js every 158 startups, which is quite a lot. This may be influenced by the fact that users of Firefox Nightly live on pre-alpha, so are more likely to encounter crashes or Firefox bugs than regular users, and that some of them use add-ons that may modify sessionstore.js themselves.

With the new algorithm, assuming that the probability for each file to be corrupted is independent and is p = 1/158, the probability of losing more than 30 seconds of data after a crash goes down to p^3 ≅ 1 / 4,000,000. If we haven’t removed the recovery files, the probability of losing more than 30 seconds of data after a clean shutdown and restart goes down to p^4 ≅ 1 / 630,000,000. This still means that , statistically speaking, at every startup, there is one user of Firefox somewhere around the world who will lose more than 30 seconds of data, but this is much, better than the previous situation by several orders of magnitude.

It is my hope that this new mechanism will transparently make your life better. Have fun with Firefox!

Season 1 Episode 2 – The Fight for File I/O

April 2, 2014 § Leave a comment

Plot Our heroes set out for the first battle. Session Restore’s file I/O was clearly inefficient. Not only was it performing redundant operations, but also it was blocking the main thread doing so. The time had come to take it back. Little did our heroes know that the forces of Regression were lurking and that real battle would be fought long after the I/O had been rewritten and made non-blocking.

For historical reasons, some of Session Restore’s File I/O was quite inefficient. Reading and backing up were performed purely on the main thread, which could cause multi-second pauses in extreme cases, and 100ms+ pauses in common cases. Writing was done mostly off the main thread, but the underlying library used caused accidental main thread I/O, with the same effect, and disk flushing. Disk flushing is extremely inefficient on most operating systems and can quickly bring the whole system to its knees, so needs to be avoided.

Fortunately, OS.File, the (then) new JavaScript library designed to provide off main thread I/O had just become available. Turning Session Restore’s I/O into OS.File-based off main thread I/O was surprisingly simple, and even contributed to make the relevant fragments of the code more readable.

In addition to performing main thread I/O and flushing, Session Restore’s I/O had several immediate weaknesses. One of the weaknesses was due to its crash detection mechanism, that required Session Restore to rewrite sessionstore.js immediately after startup, just to store a boolean indicating that we had not crashed. Recall that the largest sessionsstore.js known to this date weighs 150+Mb, and that 1Mb+ instances represented ~5% of our users. Rewriting all this data (and blocking startup while doing so) for a simple boolean flag was clearly unacceptable. We fixed this issue by separating the crash detection mechanism into its own module and ensuring that it only needed to write a few bytes. Another weakness was due to the backup code, which required a full (and inefficient) copy during startup, whereas a simple renaming would have been sufficient.

Having fixed all of this, we were happy. We were wrong.

Speed improvements?

Sadly, Telemetry archives do not reach back far enough to let me provide data confirming any speed improvement. Note for future perf developers including future self: backup your this data or blog immediately before The Cloud eats it.

As for measuring the effects of a flush, at the moment, we do not have a good way to do this, as the main impact is not on the process itself but on the whole system. The best we can do is measure the total number of flushes, but that doesn’t really help.

Full speed… backwards?

The first indication that something was wrong was a large increase in Telemetry measure SESSIONRESTORED, which measures the total amount of time between the launch of the browser and the moment Session Restore has completed initialization. After a short period of bafflement, we concluded that this increase was normal and was due to a change of initialization order – indeed, since OS.File I/O was executed off the main thread, the results of reading the sessionstore.js file could only be received once the main thread was idle and could receive messages from other threads. While this interpretation was partly correct, it masked a very real problem that we only detected much later. Additionally, during our refactorings, we changed the instant at which Session Restore initialization was executed, which muddled the waters even further.

The second indication arrived much later, when the Metrics team extracted Firefox Health Report data from released versions and got in touch with the Performance team to inform us of a large regression in firstPaint-to-sessionRestored time. For most of our users, Firefox was now taking more than 500ms more to load, which was very bad.

After some time spent understanding the data, attempting to reproduce the measure and bisecting to find out at which changeset the regression had taken place, as well as instrumenting code with additional performance probes, we finally concluded that the problem was due to our use I/O thread, the “SessionWorker”. More precisely, this thread was very slow to launch during startup. Digging deeper, we concluded that the problem was not in the code of the SessionWorker itself, but that the loading of the underlying thread was simply too slow. More precisely, loading was fine on a first run, but on second run, disk I/O contention between the resources required by the worker (the cache for the source code of SessionWorker and its dependencies) and the resources required by the rest of the browser (other source code, but also icons, translation files, etc) slowed down things considerably. Replacing the SessionWorker by a raw use of OS.File would not have improved the situation – ironically, just as the SessionWorker, our fast I/O library was loading slowly because of slow file I/O. Further measurements indicated that this slow loading could take up to 6 seconds in extreme cases, with an average of 340ms.

Once the problem had been identified, we could easily develop a stopgap fix to recover most of the regression. We kept OS.File-based writing, as it was not in cause, but we fell back to NetUtil-based loading, which did not require a JavaScript Worker. According to Firefox Health Report, this returned us to a level close to what we had prior to our changes, although we are still worse by 50-100ms. We are still attempting to find out what causes this regression and whether this regression was indeed caused by our work.

With this stopgap fix in place, we set out to provide a longer-term fix, in the form of a reimplementation of OS.File.read(), the critical function used during startup, that did not need to boot a JavaScript worker to proceed. This second implementation was written in C++ and had a number of additional side-improvements, such as the ability to decode strings off the main thread, and transmit them to the main thread at no cost.

The patch using the new version of OS.File.read() has landed a few days ago. We are still in the process of trying to make sense of Telemetry numbers. While Telemetry indicates that the total time to read and decode the file has considerably increased, the total time between the start of the read and the time we finish startup seems to have decreased nicely by .5 seconds (75th percentile) to 4 seconds (95th percentile). We suspect that we are confronted to yet another case in which concurrency makes performance measurement more difficult.

Shutdown duration?

We have not attempted to measure the duration of shutdown-time I/O at the moment.

Losing data or privacy

By definition, since we write data asynchronously, we never wait until the write is complete before proceeding. In most cases, this is not a problem. However, process shutdown may interrupt the write during its execution. While the APIs we use to write the data ensure that shutdown will never cause a file to be partially written, it may cause us to lose the final write, i.e. 15 seconds of browsing, working, etc. To make things slightly worse, the final write of Session Restore is special, insofar as it removes some information that is considered somewhat privacy-sensitive and that is required for crash recovery but not for a clean restart. The risk already existed before our refactoring, but was increased by our work, as the new I/O model was based on JavaScript workers, which are shutdown earlier than the mechanism previously used, and without ensuring that their work is complete.

While we received no reports of bugs caused by this risk, we solved the issue by plugging Session Restore’s shutdown into AsyncShutdown.

Changing the back-end

One of our initial intuitions when starting with this work was that the back-end format used to store session data (large JSON file) was inefficient and needed to be changed. Before doing so, however, we instrumented the relevant code carefully. As it turns out, we could indeed gain some performance by improving the back-end format, but this would be a relatively small win in comparison with everything else that we have done.

We have several possible designs for a new back-end, but we have decided not to proceed for the time being, as there are still larger gains to be obtained with simpler changes. More on this in future blog entries.

Epilogue

Before setting out on this quest, we were already aware that performance refactorings were often more complex than they appeared. Our various misadventures have confirmed it. I strongly believe that, by changing I/O, we have improved the performance of Session Restore in many ways. Unfortunately, I cannot prove that we have improved runtime (because old data has disappeared), and we are still not certain that we have not regressed start-up.

If there are lessons to be learned, it is that:

  • there is no performance work without performance measurements;
  • once your code is sophisticated enough, measuring and understanding the results is much harder than improving performance.

On the upside, all this work has succeeded at:

  • improving our performance measurements of many points of Session Restore;
  • finding out weaknesses of ChromeWorkers and fixing some of these;
  • finding out weaknesses of OS.File and fixing some of these;
  • fixing Session Restore’s backup code that consumed resources and didn’t really do much useful;
  • avoiding unnecessary performance refactorings where they would not have helped.

The work on improving Session Restore file I/O is still ongoing. For one thing, we are still waiting for confirmation that our latest round of optimizations does not cause unwanted regressions. Also, we are currently working on Talos benchmarks and Telemetry measurements to let us catch such regressions earlier.

This work has also spawned other works for other teams on improving the performance of ChromeWorkers’ startup and communication speed.

In the next episode

Drama. Explosions. Asynchronicity. Electrolysis. And more.

Where Am I?

You are currently browsing entries tagged with session restore at Il y a du thé renversé au bord de la table.

Follow

Get every new post delivered to your Inbox.

Join 32 other followers