Q2 2014 Report

July 1, 2014 § Leave a comment

Q2 2014 was a difficult quarter at Mozilla, with all the agitation around Brendan Eich, Australis, Media Extensions, etc. Still, I have the feeling that we managed to get a lot done despite the intense pressure. Here is a quick highlight of my main accomplishments for Q2 2014.

Session Restore

A considerable amount of my time was spent working on Session Restore. The main objective is to decrease the jank caused by Session Restore taking snapshots of the session and to decrease the time Session Restore takes to restore the state of Firefox. Much of the activity this quarter dealt with measuring performance, so as to best optimize it and improving safety.

Reworking Session Restore backups

With Firefox 33, the backups of Session Restore state have been completely redesigned. The new system should prove orders of magnitude safer, in addition to now being fully transparent.

Next steps We are still lacking measurements to confirm that this is as successful as the mathematics suggest. If you are interested, there is a mentored bug open.

Talos tests and Telemetry on Session Restore startup

Optimizing startup is difficult, and generally impossible if you do not know what to optimize. With Firefox 32 and 33, we have new benchmarks and real world measurements to help us determine immediately the influence of patches on Session Restore startup.

Next steps Using these benchmarks to experiment with possible optimizations. This is in progress.

Cleaning up Session Restore file

One of our objectives is to decrease the size of the Session Restore file, to reduce the amount of I/O (hence battery use and hardware wear and tear) and memory usage. As a first step, we have introduced a mechanism that progressively removes from the “Undo Close” feature tabs and windows that have been closed at least 2 weeks ago. Interestingly, Telemetry indicates that this clean-up has no effect on the size of the Session Restore file. Experiments run later during the quarter, using the Talos tests, also strongly suggest that the data that we could clean up and that we do not clean up yet have essentially no influence on startup duration.

Next steps I believe that this strategy will therefore not be pursued during the next quarters.

Preserving compatibility with Tor Browser

While refactoring Session Restore, we have hit a number of obstacles in the form of add-ons using private or semi-private APIs that we wished to remove. We have managed to work along with add-on authors and, as far as I know, we have not broken any add-on yet. In particular, we have maintained compatibility with the Tor Browser, which is a heavily customized distribution of Firefox targeted towards privacy.

Next steps Providing a clean API for add-ons. This will require discussing with add-on authors to find out what they need.

Async tooling

I am in charge of the Async Project, which is all about giving front-end and add-on developers tools to develop asynchronous code that does not jank. As usual, this involved plenty of activity in a number of different directions.

Auto-closing Sqlite.jsm databases (mentoring Michael Brennan)

Sqlite.jsm databases can now be closed automatically during garbage-collection. On user’s computers, this will increase safety, as failing to close a database causes shutdown-time assertion failures. However, to use resources effectively, pragmatism dictates that databaes should be closed manually, so failing to close a database in the Mozilla codebase will still cause test failures.

Reworking OS.File shutdown

On devices with little memory (typically Firefox Phones), one of the techniques used to save memory is to shutdown the OS.File worker as early as possible, re-launching it later if necessary. As it turns out, the task is more complicated than it seems, due to possibilities of race conditions. Unfortunately, this means that in some extreme cases, Firefox OS applications could lock down and fail to shutdown properly without being killed by the OS. This is now fixed. Somewhere along the way, this helped us to make the PromiseWorker used by OS.File more resilient to low-level errors.

Next steps Making the PromiseWorker usable by other modules than OS.File, including testing and add-ons.

OS.File for Android and Firefox OS

OS.File was initially designed for desktop devices. Now that it is used in a number of places on mobile devices, I have mercilessly hunted down all compatibility issues between OS.File and our two mobile platforms. Compatibility tests are now activated on all platforms and should avoid any regression.

AsyncShutdown Barrier mechanism

The shutdown process of Firefox has always been a dark and scary place, full of unspecified dependencies. As a result, any refactoring or addition a new dependency could break many things in new and interesting ways. I have introduced the AsyncShutdown Barrier mechanism that lets us specify clear, explicit and extensible dependencies, handles ordering of shutdowns, as well as error reporting if a dependency is unmet. This Barrier is now used by Sqlite.jsm, OS.File, Firefox Health Report, Session Restore, Page Thumbnails and fixes a number of major issues.

Next steps Porting AsyncShutdown Barrier to allow native components to register with it.

Fixing Firefox 30 shutdown freezes (with Tim Taubert)

Many users of Firefox 30 encountered issues that caused Firefox to freeze during shutdown. We found out that the issue was caused was triggered by Page Thumbnails and caused by a bug in ChromeWorkers, which did not handle an error case gracefully. I applied AsyncShutdown Barrier to ensure that Page Thumbnails always completed without triggering the error case, while Tim Taubert ensured that the Chrome Workers handled the error robustly.

Making Firefox Health Report shutdown more robust

While porting Firefox Health Report to AsyncShutdown, we encountered an elusive bug that manifested itself by causing rare shutdown crashes. After months of experimenting, instrumenting and attempting to fix the issue, we eventually traced it back to a more serious bug in shutdown, which apparently does not always send the proper notifications. Using the AsyncShutdown Barrier, we managed to work around the issue and make FHR’s shutdown both more robust and better instrumented in case of crash. This later helped us locate another issue that prevents a proper shutdown when some databases have been corrupted.

Next steps Fix the upstream shutdown bug, make our shutdown more resilient in case of database corruption.

Async testing

The other aspect of writing asynchronous code is making sure that developers can debug it. Now that we have hit a critical mass of developers writing async code, it was high time to help them work with it.

Rewriting Task stack traces to be meaningful

Now that we know how to handle uncaught errors, the main remaining weaknesses of Promise-based and Task-based code is that their stack traces lose much information. Since Firefox 33, Task-based stack traces are now transparently rewritten into something developer-redable. Somewhere along the way, I have also patched xpcshell and mochitests to ensure that they take advantage of this rewriting. Experience shows that this is very useful and that the runtime cost is negligible.

Next steps Evaluate the runtime cost of doing the same thing for Promise-based code.

Making xpcshell tests fail in case of uncaught promise error

Uncaught promise errors were treated by the test suites as warnings, TBPL did not report them, and they remained consequantly more often than not ignored (or even unseen) by the developers. I have reworked the xpcshell test harness to consider all uncaught promise errors as oranges and fixed all offenders.

Next steps Doing the same for mochitests. Code is ready, but a few offenders remain.

Community

Dealing with political feedback around the nomination and departure of Brendan Eich

Along with many others, I made my best to engage people who voiced their negative feedback either at the nomination or at the departure of Brendan Eich. Unfortunately, this took time and efforts, but I believe that staying in touch with our users is part of what makes the difference between Mozilla and other browser vendors.

Working with new contributors

I estimate that I have worked with ~30 potential new contributors during the quarter. Many have unfortunately decided to postpone or abandon their efforts towards contributing, but a few have stayed, to work either with me or with other teams. At the moment, I am following 5 promising contributors. In particular, I am quite happy to welcome Dexter (who is working on a very sophisticated patch to let code watch for file modifications) and Kushagra (who has landed several test suite bugs).

Next steps More of it!

Working with universities

A group of École Centrale de Lyon successfully completed an online tool to help grassroot projects find volunteers. It was nice mentoring them.

Zedge

I was invited to deliver a presentation on performance at Zedge, in Trondheim, Norway. That was fun :)

Next steps Publish the slides.

And now?

Let’s get started with Q3!

Firefox, the Browser that has your Back[up]

June 26, 2014 § 27 Comments

One of the most important features of Firefox, in my opinion, is Session Restore. This component is responsible for ensuring that, even in case of crash, or if you upgrade your browser or an add-on that requires restart, your browser can reopen immediately and in the state in which you left it. As far as I am concerned, this feature is a life-safer.

Unfortunately, there are a few situations in which the Session Restore file may be corrupted – typically, if the computer is rebooted before the write is complete, or if it loses power, or if the operating system crashes or the disk is disconnected, we may end up losing our precious Session Restore. While any of these circumstances happens quite seldom, it needs to be applied as part of the following formula:

seldom · .5 billion users = a lot

I am excited to announce that we have just landed a new and improved Session Restore component in Firefox 33 that protects your precious data better than ever.

How it works

Firefox needs Session Restore to handle the following situations:

  • restarting Firefox without data loss after a crash of either Firefox, the Operating System, a driver or the hardware, or after Firefox has been killed by the Operating System during shutdown;
  • restarting Firefox without data loss after Firefox has been restarted due to an add-on or an upgrade;
  • quitting Firefox and, later, restarting without data loss.

In order to handle all of this, Firefox needs to take a snapshot of the state of the browser whenever anything happens, whether the user browses, fills a form, scrolls, or an application sets a Session Cookie, Session Storage, etc. (this is actually capped to one save every 15 seconds, to avoid overloading the computer). In addition, Firefox performs a clean save during shutdown.

While at the level of the application, the write mechanism itself is simple and robust, a number of things beyond the control of the developer can prevent either the Operating System or the hard drive itself from completing this write consistently – a typical example being tripping on the power plug of a desktop computer during the write.

The new mechanism involves two parts:

  • keeping smart backups to maximize the chances that at least one copy will be readable;
  • making use of the available backups to transparently avoid or minimize data loss.

The implementation actually takes very few lines of code, the key being to know the risks against which we defend.

Keeping backups

During runtime, Firefox remembers which files are known to be valid backups and which files should be discarded. Whenever a user interaction or a script requires it, Firefox writes the contents of Session Restore to a file called sessionstore-backups/recovery.js. If it is known to be good, the previous version of sessionstore-backups/recovery.js is first moved to sessionstore-backups/recovery.bak. In most cases, both files are valid and recovery.js contains a state less than 15 seconds old, while recovery.bak contains a state less than 30 seconds old. Additionally, the writes on both files are separated by at least 15 seconds. In most circumstances, this is sufficient to ensure that, even of hard drive crash during a write to recover.js, at least recovery.bak has been entirely written to disk.

During shutdown, Firefox writes a clean startup file to sessionstore.js. In most cases, this file is valid and contains the exact state of Firefox at the time of shutdown (minus some privacy filters). During startup, if sessionstore.js is valid, Firefox moves it to sessiontore-backup/previous.js. Whenever this file exists, it is valid and contains the exact state of Firefox at the time of the latest clean shutdown/startup. Note that, in case of crash, the latest clean shutdown/startup might be older than the latest actual startup, but this backup is useful nevertheless.

Finally, on the first startup after an update, Firefox copies sessionstore.js, if it is available and valid, to sessionstore-backups/upgrade.js-[build id]. This mechanism is designed primarily for testers of Firefox Nightly, who keep on the very edge, upgrading Firefox every day to check for bugs. Testers, if we introduce a bug that affects Session Restore, this can save your life.

As a side-note, we never use the operating system’s flush call, as 1/ it does not provide the guarantees that most developers expect; 2/ on most operating systems, it causes catastrophic slowdowns.

Recovering

All in all, Session Restore may contain the following files:

  • sessionstore.js (contains the state of Firefox during the latest shutdown – this file is absent in case of crash);
  • sessionstore-backups/recovery.js (contains the state of Firefox ≤ 15 seconds before the latest shutdown or crash – the file is absent in case of clean shutdown, if privacy settings instruct us to wipe it during shutdown, and after the write to sessionstore.js has returned);
  • sessionstore-backups/recovery.bak (contains the state of Firefox ≤ 30 seconds before the latest shutdown or crash – the file is absent in case of clean shutdown, if privacy settings instruct us to wipe it during shutdown, and after the removal of sessionstore-backups/recovery.js has returned);
  • sessionstore-backups/previous.js (contains the state of Firefox during the previous successful shutdown);
  • sessionstore-backups/upgrade.js-[build id] (contains the state of Firefox after your latest upgrade).

All these files use the JSON format. While this format has drawbacks, it has two huge advantages in this setting:

  • it is quite human-readable, which makes it easy to recover manually in case of an extreme crash;
  • its syntax is quite rigid, which makes it easy to find out whether it was written incompletely.

As our main threat is a crash that prevents us from writing the file entirely, we take advantage of the latter quality to determine whether a file is valid. Based on this, we test each file in the order indicated above, until we find one that is valid. We then proceed to restore it.

If Firefox was shutdown cleanly:

  1. In most cases, sessionstore.js is valid;
  2. In most cases in which sessionstore.js is invalid, sessionstore-backups/recovery.js is still present and valid (the likelihood of it being present is obviously higher if privacy settings do not instruct Firefox to remove it during shutdown);
  3. In most cases in which sessionstore-backups/recovery.js is invalid, sessionstore-backups/recovery.bak is still present, with an even higher likelihood of being valid (the likelihood of it being present is obviously higher if privacy settings do not instruct Firefox to remove it during shutdown);
  4. In most cases in which the previous files are absent or invalid, sessionstore-backups/previous.js is still present, in which case it is always valid;
  5. In most cases in which the previous files are absent or invalid, sessionstore-backups/upgrade.js-[...] is still present, in which case it is always valid.

Similarly, if Firefox crashed or was killed:

  1. In most cases, sessionstore-backups/recovery.js is present and valid;
  2. In most cases in which sessionstore-backups/recovery.js is invalid, sessionstore-backups/recovery.bak is pressent, with an even higher likelihood of being valid;
  3. In most cases in which the previous files are absent or invalid, sessionstore-backups/previous.js is still present, in which case it is always valid;
  4. In most cases in which the previous files are absent or invalid, sessionstore-backups/upgrade.js-[...] is still present, in which case it is always valid.

Numbers crunching

Statistics collected on Firefox Nightly 32 suggest that, out of 11.95 millions of startups, 75,310 involved a corrupted sessionstore.js. That’s roughly a corrupted sessionstore.js every 158 startups, which is quite a lot. This may be influenced by the fact that users of Firefox Nightly live on pre-alpha, so are more likely to encounter crashes or Firefox bugs than regular users, and that some of them use add-ons that may modify sessionstore.js themselves.

With the new algorithm, assuming that the probability for each file to be corrupted is independent and is p = 1/158, the probability of losing more than 30 seconds of data after a crash goes down to p^3 ≅ 1 / 4,000,000. If we haven’t removed the recovery files, the probability of losing more than 30 seconds of data after a clean shutdown and restart goes down to p^4 ≅ 1 / 630,000,000. This still means that , statistically speaking, at every startup, there is one user of Firefox somewhere around the world who will lose more than 30 seconds of data, but this is much, better than the previous situation by several orders of magnitude.

It is my hope that this new mechanism will transparently make your life better. Have fun with Firefox!

Revisiting uncaught asynchronous errors in the Mozilla Platform

May 30, 2014 § Leave a comment

Consider the following feature and its xpcshell test:

// In a module Foo
function doSomething() {
  // ...
  OS.File.writeAtomic("/an invalid path", "foo");
  // ...
}

// In the corresponding unit test
add_task(function*() {
  // ...
  Foo.doSomething();
  // ...
});

Function doSomething is obviously wrong, as it performs a write operation that cannot succeed. Until we started our work on uncaught asynchronous errors, the test passed without any warning. A few months ago, we managed to rework Promise to ensure that the test at least produced a warning. Now, this test will actually fail with the following message:

A promise chain failed to handle a rejection – Error during operation ‘write’ at …

This is particularly useful for tracking subsystems that completely forget to handle errors or tasks that forget to call yield.

Who is affected?

This change does not affect the runtime behavior of application, only test suites.

  • xpcshell: landed as part of bug 976205;
  • mochitest / devtools tests: waiting for all existing offending tests to be fixed, code is ready as part of bug 1016387;
  • add-on sdk: no started, bug 998277.

This change only affects the use of Promise.jsm. Support for DOM Promise is in bug 989960.

Details

We obtain a rejected Promise by:

  • throwing from inside a Task; or
  • throwing from a Promise handler; or
  • calling Promise.reject.

A rejection can be handled by any client of the rejected promise by registering a rejection handler. To complicate things, the rejection handler can be registered either before the rejection or after it.

In this series of patches, we cause a test failure if we end up with a Promise that is rejected and has no rejection handler either:

  • immediately after the Promise is garbage-collected;
  • at the end of the add_task during which the rejection took place;
  • at the end of the entire xpcshell test;

(whichever comes first).

Opting out

There are extremely few tests that should need to raise asynchronous errors and not catch them. So far, we have needed this two tests: one that tests the asynchronous error mechanism itself and another one that willingly crashes subprocesses to ensure that Firefox remains stable.

You should not need to opt out of this mechanism. However, if you absolutely need to, we have a mechanism for opting out. For more details, see object Promise.Debugging in Promise.jsm.

Any question?

Feel free to contact either me or Paolo Amadini.

Shutting down Asynchronously, part 2

May 26, 2014 § Leave a comment

During shutdown of Firefox, subsystems are closed one after another. AsyncShutdown is a module dedicated to express shutdown-time dependencies between:

  • services and their clients;
  • shutdown phases (e.g. profile-before-change) and their clients.

Barriers: Expressing shutdown dependencies towards a service

Consider a service FooService. At some point during the shutdown of the process, this service needs to:

  • inform its clients that it is about to shut down;
  • wait until the clients have completed their final operations based on FooService (often asynchronously);
  • only then shut itself down.

This may be expressed as an instance of AsyncShutdown.Barrier. An instance of AsyncShutdown.Barrier provides:

  • a capability client that may be published to clients, to let them register or unregister blockers;
  • methods for the owner of the barrier to let it consult the state of blockers and wait until all client-registered blockers have been resolved.

Shutdown timeouts

By design, an instance of AsyncShutdown.Barrier will cause a crash if it takes more than 60 seconds awake for its clients to lift or remove their blockers (awake meaning that seconds during which the computer is asleep or too busy to do anything are not counted). This mechanism helps ensure that we do not leave the process in a state in which it can neither proceed with shutdown nor be relaunched.

If the CrashReporter is enabled, this crash will report: – the name of the barrier that failed; – for each blocker that has not been released yet:

  • the name of the blocker;
  • the state of the blocker, if a state function has been provided (see AsyncShutdown.Barrier.state).

Example 1: Simple Barrier client

The following snippet presents an example of a client of FooService that has a shutdown dependency upon FooService. In this case, the client wishes to ensure that FooService is not shutdown before some state has been reached. An example is clients that need write data asynchronously and need to ensure that they have fully written their state to disk before shutdown, even if due to some user manipulation shutdown takes place immediately.

// Some client of FooService called FooClient

Components.utils.import("resource://gre/modules/FooService.jsm", this);

// FooService.shutdown is the `client` capability of a `Barrier`.
// See example 2 for the definition of `FooService.shutdown`
FooService.shutdown.addBlocker(
  "FooClient: Need to make sure that we have reached some state",
  () => promiseReachedSomeState
);
// promiseReachedSomeState should be an instance of Promise resolved once
// we have reached the expected state

Example 2: Simple Barrier owner

The following snippet presents an example of a service FooService that wishes to ensure that all clients have had a chance to complete any outstanding operations before FooService shuts down.

    // Module FooService

    Components.utils.import("resource://gre/modules/AsyncShutdown.jsm", this);
    Components.utils.import("resource://gre/modules/Task.jsm", this);

    this.exports = ["FooService"];

    let shutdown = new AsyncShutdown.Barrier("FooService: Waiting for clients before shutting down");

    // Export the `client` capability, to let clients register shutdown blockers
    FooService.shutdown = shutdown.client;

    // This Task should be triggered at some point during shutdown, generally
    // as a client to another Barrier or Phase. Triggering this Task is not covered
    // in this snippet.
    let onshutdown = Task.async(function*() {
      // Wait for all registered clients to have lifted the barrier
      yield shutdown.wait();

      // Now deactivate FooService itself.
      // ...
    });

Frequently, a service that owns a AsyncShutdown.Barrier is itself a client of another Barrier.

 

Example 3: More sophisticated Barrier client

The following snippet presents FooClient2, a more sophisticated client of FooService that needs to perform a number of operations during shutdown but before the shutdown of FooService. Also, given that this client is more sophisticated, we provide a function returning the state of FooClient2 during shutdown. If for some reason FooClient2’s blocker is never lifted, this state can be reported as part of a crash report.

    // Some client of FooService called FooClient2

    Components.utils.import("resource://gre/modules/FooService.jsm", this);

    FooService.shutdown.addBlocker(
      "FooClient2: Collecting data, writing it to disk and shutting down",
      () => Blocker.wait(),
      () => Blocker.state
    );

    let Blocker = {
      // This field contains information on the status of the blocker.
      // It can be any JSON serializable object.
      state: "Not started",

      wait: Task.async(function*() {
        // This method is called once FooService starts informing its clients that
        // FooService wishes to shut down.

        // Update the state as we go. If the Barrier is used in conjunction with
        // a Phase, this state will be reported as part of a crash report if FooClient fails
        // to shutdown properly.
        this.state = "Starting";

        let data = yield collectSomeData();
        this.state = "Data collection complete";

        try {
          yield writeSomeDataToDisk(data);
          this.state = "Data successfully written to disk";
        } catch (ex) {
          this.state = "Writing data to disk failed, proceeding with shutdown: " + ex;
        }

        yield FooService.oneLastCall();
        this.state = "Ready";
      }.bind(this)
    };

Example 4: A service with both internal and external dependencies

    // Module FooService2

    Components.utils.import("resource://gre/modules/AsyncShutdown.jsm", this);
    Components.utils.import("resource://gre/modules/Task.jsm", this);
    Components.utils.import("resource://gre/modules/Promise.jsm", this);

    this.exports = ["FooService2"];

    let shutdown = new AsyncShutdown.Barrier("FooService2: Waiting for clients before shutting down");

    // Export the `client` capability, to let clients register shutdown blockers
    FooService2.shutdown = shutdown.client;

    // A second barrier, used to avoid shutting down while any connections are open.
    let connections = new AsyncShutdown.Barrier("FooService2: Waiting for all FooConnections to be closed before shutting down");

    let isClosed = false;

    FooService2.openFooConnection = function(name) {
      if (isClosed) {
        throw new Error("FooService2 is closed");
      }

      let deferred = Promise.defer();
      connections.client.addBlocker("FooService2: Waiting for connection " + name + " to close",  deferred.promise);

      // ...


      return {
        // ...
        // Some FooConnection object. Presumably, it will have additional methods.
        // ...
        close: function() {
          // ...
          // Perform any operation necessary for closing
          // ...

          // Don't hoard blockers.
          connections.client.removeBlocker(deferred.promise);

          // The barrier MUST be lifted, even if removeBlocker has been called.
          deferred.resolve();
        }
      };
    };


    // This Task should be triggered at some point during shutdown, generally
    // as a client to another Barrier. Triggering this Task is not covered
    // in this snippet.
    let onshutdown = Task.async(function*() {
      // Wait for all registered clients to have lifted the barrier.
      // These clients may open instances of FooConnection if they need to.
      yield shutdown.wait();

      // Now stop accepting any other connection request.
      isClosed = true;

      // Wait for all instances of FooConnection to be closed.
      yield connections.wait();

      // Now finish shutting down FooService2
      // ...
    });

Phases: Expressing dependencies towards phases of shutdown

The shutdown of a process takes place by phase, such as: – profileBeforeChange (once this phase is complete, there is no guarantee that the process has access to a profile directory); – webWorkersShutdown (once this phase is complete, JavaScript does not have access to workers anymore); – …

Much as services, phases have clients. For instance, all users of web workers MUST have finished using their web workers before the end of phase webWorkersShutdown.

Module AsyncShutdown provides pre-defined barriers for a set of well-known phases. Each of the barriers provided blocks the corresponding shutdown phase until all clients have lifted their blockers.

List of phases

AsyncShutdown.profileChangeTeardown

The client capability for clients wishing to block asynchronously during observer notification “profile-change-teardown”.

AsyncShutdown.profileBeforeChange

The client capability for clients wishing to block asynchronously during observer notification “profile-change-teardown”. Once the barrier is resolved, clients other than Telemetry MUST NOT access files in the profile directory and clients MUST NOT use Telemetry anymore.

AsyncShutdown.sendTelemetry

The client capability for clients wishing to block asynchronously during observer notification “profile-before-change2”. Once the barrier is resolved, Telemetry must stop its operations.

AsyncShutdown.webWorkersShutdown

The client capability for clients wishing to block asynchronously during observer notification “web-workers-shutdown”. Once the phase is complete, clients MUST NOT use web workers.

Season 1 Episode 2 – The Fight for File I/O

April 2, 2014 § Leave a comment

Plot Our heroes set out for the first battle. Session Restore’s file I/O was clearly inefficient. Not only was it performing redundant operations, but also it was blocking the main thread doing so. The time had come to take it back. Little did our heroes know that the forces of Regression were lurking and that real battle would be fought long after the I/O had been rewritten and made non-blocking.

For historical reasons, some of Session Restore’s File I/O was quite inefficient. Reading and backing up were performed purely on the main thread, which could cause multi-second pauses in extreme cases, and 100ms+ pauses in common cases. Writing was done mostly off the main thread, but the underlying library used caused accidental main thread I/O, with the same effect, and disk flushing. Disk flushing is extremely inefficient on most operating systems and can quickly bring the whole system to its knees, so needs to be avoided.

Fortunately, OS.File, the (then) new JavaScript library designed to provide off main thread I/O had just become available. Turning Session Restore’s I/O into OS.File-based off main thread I/O was surprisingly simple, and even contributed to make the relevant fragments of the code more readable.

In addition to performing main thread I/O and flushing, Session Restore’s I/O had several immediate weaknesses. One of the weaknesses was due to its crash detection mechanism, that required Session Restore to rewrite sessionstore.js immediately after startup, just to store a boolean indicating that we had not crashed. Recall that the largest sessionsstore.js known to this date weighs 150+Mb, and that 1Mb+ instances represented ~5% of our users. Rewriting all this data (and blocking startup while doing so) for a simple boolean flag was clearly unacceptable. We fixed this issue by separating the crash detection mechanism into its own module and ensuring that it only needed to write a few bytes. Another weakness was due to the backup code, which required a full (and inefficient) copy during startup, whereas a simple renaming would have been sufficient.

Having fixed all of this, we were happy. We were wrong.

Speed improvements?

Sadly, Telemetry archives do not reach back far enough to let me provide data confirming any speed improvement. Note for future perf developers including future self: backup your this data or blog immediately before The Cloud eats it.

As for measuring the effects of a flush, at the moment, we do not have a good way to do this, as the main impact is not on the process itself but on the whole system. The best we can do is measure the total number of flushes, but that doesn’t really help.

Full speed… backwards?

The first indication that something was wrong was a large increase in Telemetry measure SESSIONRESTORED, which measures the total amount of time between the launch of the browser and the moment Session Restore has completed initialization. After a short period of bafflement, we concluded that this increase was normal and was due to a change of initialization order – indeed, since OS.File I/O was executed off the main thread, the results of reading the sessionstore.js file could only be received once the main thread was idle and could receive messages from other threads. While this interpretation was partly correct, it masked a very real problem that we only detected much later. Additionally, during our refactorings, we changed the instant at which Session Restore initialization was executed, which muddled the waters even further.

The second indication arrived much later, when the Metrics team extracted Firefox Health Report data from released versions and got in touch with the Performance team to inform us of a large regression in firstPaint-to-sessionRestored time. For most of our users, Firefox was now taking more than 500ms more to load, which was very bad.

After some time spent understanding the data, attempting to reproduce the measure and bisecting to find out at which changeset the regression had taken place, as well as instrumenting code with additional performance probes, we finally concluded that the problem was due to our use I/O thread, the “SessionWorker”. More precisely, this thread was very slow to launch during startup. Digging deeper, we concluded that the problem was not in the code of the SessionWorker itself, but that the loading of the underlying thread was simply too slow. More precisely, loading was fine on a first run, but on second run, disk I/O contention between the resources required by the worker (the cache for the source code of SessionWorker and its dependencies) and the resources required by the rest of the browser (other source code, but also icons, translation files, etc) slowed down things considerably. Replacing the SessionWorker by a raw use of OS.File would not have improved the situation – ironically, just as the SessionWorker, our fast I/O library was loading slowly because of slow file I/O. Further measurements indicated that this slow loading could take up to 6 seconds in extreme cases, with an average of 340ms.

Once the problem had been identified, we could easily develop a stopgap fix to recover most of the regression. We kept OS.File-based writing, as it was not in cause, but we fell back to NetUtil-based loading, which did not require a JavaScript Worker. According to Firefox Health Report, this returned us to a level close to what we had prior to our changes, although we are still worse by 50-100ms. We are still attempting to find out what causes this regression and whether this regression was indeed caused by our work.

With this stopgap fix in place, we set out to provide a longer-term fix, in the form of a reimplementation of OS.File.read(), the critical function used during startup, that did not need to boot a JavaScript worker to proceed. This second implementation was written in C++ and had a number of additional side-improvements, such as the ability to decode strings off the main thread, and transmit them to the main thread at no cost.

The patch using the new version of OS.File.read() has landed a few days ago. We are still in the process of trying to make sense of Telemetry numbers. While Telemetry indicates that the total time to read and decode the file has considerably increased, the total time between the start of the read and the time we finish startup seems to have decreased nicely by .5 seconds (75th percentile) to 4 seconds (95th percentile). We suspect that we are confronted to yet another case in which concurrency makes performance measurement more difficult.

Shutdown duration?

We have not attempted to measure the duration of shutdown-time I/O at the moment.

Losing data or privacy

By definition, since we write data asynchronously, we never wait until the write is complete before proceeding. In most cases, this is not a problem. However, process shutdown may interrupt the write during its execution. While the APIs we use to write the data ensure that shutdown will never cause a file to be partially written, it may cause us to lose the final write, i.e. 15 seconds of browsing, working, etc. To make things slightly worse, the final write of Session Restore is special, insofar as it removes some information that is considered somewhat privacy-sensitive and that is required for crash recovery but not for a clean restart. The risk already existed before our refactoring, but was increased by our work, as the new I/O model was based on JavaScript workers, which are shutdown earlier than the mechanism previously used, and without ensuring that their work is complete.

While we received no reports of bugs caused by this risk, we solved the issue by plugging Session Restore’s shutdown into AsyncShutdown.

Changing the back-end

One of our initial intuitions when starting with this work was that the back-end format used to store session data (large JSON file) was inefficient and needed to be changed. Before doing so, however, we instrumented the relevant code carefully. As it turns out, we could indeed gain some performance by improving the back-end format, but this would be a relatively small win in comparison with everything else that we have done.

We have several possible designs for a new back-end, but we have decided not to proceed for the time being, as there are still larger gains to be obtained with simpler changes. More on this in future blog entries.

Epilogue

Before setting out on this quest, we were already aware that performance refactorings were often more complex than they appeared. Our various misadventures have confirmed it. I strongly believe that, by changing I/O, we have improved the performance of Session Restore in many ways. Unfortunately, I cannot prove that we have improved runtime (because old data has disappeared), and we are still not certain that we have not regressed start-up.

If there are lessons to be learned, it is that:

  • there is no performance work without performance measurements;
  • once your code is sophisticated enough, measuring and understanding the results is much harder than improving performance.

On the upside, all this work has succeeded at:

  • improving our performance measurements of many points of Session Restore;
  • finding out weaknesses of ChromeWorkers and fixing some of these;
  • finding out weaknesses of OS.File and fixing some of these;
  • fixing Session Restore’s backup code that consumed resources and didn’t really do much useful;
  • avoiding unnecessary performance refactorings where they would not have helped.

The work on improving Session Restore file I/O is still ongoing. For one thing, we are still waiting for confirmation that our latest round of optimizations does not cause unwanted regressions. Also, we are currently working on Talos benchmarks and Telemetry measurements to let us catch such regressions earlier.

This work has also spawned other works for other teams on improving the performance of ChromeWorkers’ startup and communication speed.

In the next episode

Drama. Explosions. Asynchronicity. Electrolysis. And more.

Is my data on the disk? Safety properties of OS.File.writeAtomic

February 5, 2014 § 1 Comment

If you have been writing front-end or add-on code recently, chances are that you have been using library OS.File and, in particular, OS.File.writeAtomic to write files. (Note: If you have been writing files without using OS.File.writeAtomic, chances are that you are doing something wrong that will cause Firefox to jank – please don’t.) As the name implies, OS.File.writeAtomic will make efforts to write your data atomically, so as to ensure its survivability in case of crash, power loss, etc.

However, you should not trust this function blindly, because it has its limitations. Let us take a look at exactly what the guarantees provided by writeAtomic.

Algorithm: just write

Snippet OS.File.writeAtomic(path, data)

What it does

  1. reduce the size of the file at path to 0;
  2. send data to the operating system kernel for writing;
  3. close the file.

Worst case scenarios

  1. if the process crashes between 1. and 2. (a few microseconds), the full content of path may be lost;
  2. if the operating system crashes or the computer loses power suddenly before the kernel flushes its buffers (which may happen at any point up to 30 seconds after 1.), the full content of path may be lost;
  3. if the operating system crashes or the computer loses power suddenly while the operating system kernel is flushing (which may happen at any point after 1., typically up to 30 seconds), and if your data is larger than one sector (typically 32kb), data may be written incompletely, resulting in a corrupted file at path.

Performance very good.

Algorithm: write and rename

Snippet OS.File.writeAtomic(path, data, { tmpPath: path + ".tmp" })

What it does

  1. create a new file at tmpPath;
  2. send data to the operating system kernel for writing to tmpPath;
  3. close the file;
  4. rename tmpPath on top of path.

Worst case scenarios

  1. if the process crashes at any moment, nothing is lost, but a file tmpPath may be left on the disk;
  2. if the operating system crashes or the computer loses power suddenly while the operating system kernel is flushing metadata (which may happen at any point after 1., typically up to 30 seconds), the full content of path may be lost;
  3. if the operating system crashes or the computer loses power suddenly while the operating system kernel is flushing (which may happen at any point after 1., typically up to 30 seconds), and if your data is larger than one sector (typically 32kb), data may be written incompletely, resulting in a corrupted file at path.

Performance almost as good as Just Write.

Side-note On the ext4fs file system, the kernel automatically adds a flush, which transparently transforms the safety properties of this operation into those of the algorithm detailed next.

Native equivalent In XPCOM/C++, the mostly-equivalent solution is the atomic-file-output-stream.

Algorithm: write, flush and rename

Use OS.File.writeAtomic(path, data, { tmpPath: path + ".tmp", flush: true })

What it does

  1. create a new file at tmpPath;
  2. send data to the operating system kernel for writing to tmpPath;
  3. close the file;
  4. flush the writing of data to tmpPath;
  5. rename tmpPath on top of path.

Worst case scenarios

  1. if the process crashes at any moment, nothing is lost, but a file tmpPath may be left on the disk;
  2. if the operating system crashes, nothing is lost, but a file tmpPath may be left on the disk;
  3. if the computer loses power suddenly while the hard drive is flushing its internal hardware buffers (which is very hard to predict), nothing is lost, but an incomplete file tmpPath may be left on the disk;.

Performance some operating systems (Windows) or file systems (ext3fs) cannot flush a single file and rather need to flush all the files on the device, which considerably slows down the full operating system. On some others (ext4fs) this operation is essentially free. On some versions of MacOS X, flushing actually doesn’t do anything.

Native equivalent In XPCOM/C++, the mostly-equivalent solution is the safe-file-output-stream.

Algorithm: write, backup, rename

(not landed yet)

Snippet OS.File.writeAtomic(path, data, { tmpPath: path + ".tmp", backupTo: path + ".backup"})

What it does

  1. create a new file at tmpPath;
  2. send data to the operating system kernel for writing to tmpPath;
  3. close the file;
  4. rename the file at path to backupTo;
  5. rename the file at tmpPath on top of path;

Worst case scenarios

  1. if the process crashes between 4. and 5, file path may be lost and backupTo should be used instead for recovery;
  2. if the operating system crashes or the computer loses power suddenly while the operating system kernel is flushing metadata (which may happen at any point after 1., typically up to 30 seconds), the file at path may be empty and backupTo should be used instead for recovery;
  3. if the operating system crashes or the computer loses power suddenly while the operating system kernel is flushing (which may happen at any point after 1., typically up to 30 seconds), and if your data is larger than one sector (typically 32kb), data may be written incompletely, resulting in a corrupted file at path and backupTo should be used instead for recovery;

Performance almost as good as Write and Rename.

Making Firefox Feel as Fast as its Benchmarks – Part 2 – Towards multi-process

October 9, 2013 § 2 Comments

As we saw in the first post of this series, our browser behaves as follows:

function browser() {
  while (true) {
    handleEvents();  // Let's make this faster
    updateDisplay();
  }
}

As we discussed, the key to making the browser smooth is to make handleEvents() do less. One way of doing this is to go multi-process.

Going multi-process

Chrome is multi-process. Internet Explorer 4 was multi-process and so is Internet Explorer 8+ do (don’t ask me where IE 5, 6, 7 went). Well, Firefox OS is multi-process, too and Firefox for Android used to be multi-process until we canceled this approach due to Android-specific issues. For the moment, Firefox Desktop is only slightly multi-process, although we are heading further in this direction with project electrolysis (e10s, for short).

In a classical browser (i.e. not FirefoxOS, not the Firefox Web Runtime), going multi-process means running the user interface and system-style code in one process (the “parent process”) and running code specific to the web or to a single tab in another process (the “child process”). Whether all tabs share a process or each tab is a separate process, or even each iframe is a process, is an additional design choice that, I believe, is still open to discussion in Firefox. In FirefoxOS and in the Firefox Web Runtime (part of Firefox Desktop and Firefox for Android), going multi-process generally means one process per (web) application.

Since code is separated between processes, each handleEvents() has less to do and will therefore, at least in theory, execute faster. Additionally, this is better for security, insofar as a compromised web-specific process affords an attacker less control than compromising the full process. Finally, this gives the possibility to crash a child process if necessary, without having to crash the whole browser.

Coding for multi-process

In the Firefox web browser, the multi-process architecture is called e10s and looks as follows:

function parent() {
  while (true) {
    handleEvents();  // Some of your code here
    updateDisplay(); // Just the ui
  }
}
function child() {
  while (true) {
    handleEvents();  // Some of your code here
    updateDisplay(); // Just some web
  }
}

parent() ||| child()

The parent process and the child process are not totally independent. Very often, they need to communicate. For instance, when the user browses in a tab, the parent needs to change the history menu displayed by the parent process. Similarly, every few seconds, Firefox saves its state to provide quick recovery in case of crash – the parent asks each tab for its information and, once all replies have arrived, gathers them into one data structure and saves them all.

For this purpose, parent and the child can send messages to each other through the Message Manager. A Message Manager can let a child process communicate with a single parent process and can let a parent process communicate with one or more children processes:

// Sender-side
messageManager.sendAsyncMessage("MessageTopic", {data})

// Receiver-side
messageManager.addMesageListener("MessageTopic", this);
// ...
receiveMessage: function(message) {
  switch (message.name) {
  case "MessageTopic":
    // do something with message.data
    // ...
    break;
  }
}

Additionally, code executed in the parent process can inject code in the child process using the Message Manager, as follows:

messageManager.loadFrameScript("resource://path/to/script.js", true);

Once injected, the code behaves as any (privileged) code in the child process.

As you may see, communications are purely asynchronous – we do not wish the Message Manager to stop a process and wait until another process is done with it tasks, as this would totally defeat the purpose of multi-processing. There is an exception, called the Cross Process Object Wrapper, which I am not going to cover, as this mechanism is meant to be used only during a transition phase.

Limitations

It is tempting to see multi-process architecture as a silver bullet that semi-magically makes Firefox (or its competitors) fast and smooth. There are, however, quite clear limitations to the model.

Firstly, going multi-process has a cost. As demonstrated by Chrome, each process consumes lots of memory. Each process needs to load its libraries, its JavaScript VM, each script must be JIT-ed for each process, each process needs its communication channgel towards the GPU etc. Optimizing this is possible, as demonstrated by FirefoxOS (which runs nicely with 256 Mb) but is a challenge.

Similarly, starting a multi-process browser can be much slower than starting a single-process browser. Between the instant the user launches the browser and the instant it becomes actually usable, many things need to happen: launching the parent process, which in turn launches the children processes, setting up the communication channels, JIT compiling all the scripts that need compilation, etc. The same cost appears when shutting down the processes.

Also, using several processes brings about a risk of contention on resources. Two processes may need to access the disk cache at the same time, or the cookies, or the session storage, or the GPU or the audio device. All of this needs to be managed carefully and can, in certain cases, slow down considerably both processes.

Also, some APIs are synchronous by specifications. If, for some reason, a child process needs to access the DOM of another child process – as may happen in the case of iframes – both child processes need to become synchronous. During the operation, they both behave as a single process, with just extremely slower DOM operations.

And finally, going multi-process will of course not make a tab magically responsive if this tab itself is the source of the slowdown – in other words, multi-process it not very useful for games.

Refactoring for multi-process

Many APIs, both in Firefox itself and in add-ons, are not e10s-compliant yet. The task of refactoring Firefox APIs into something e10s-compliant is in progress and can be followed here. Let’s see what needs to be done to refactor an API for multi-process.

Firstly, this does not apply to all APIs. APIs that access web content for non-web content need to be converted to e10s-style – an example is Page Info, which needs to access web content (the list of links and images from that page) for the purpose of non-web content (the Page Info button and dialog). As multi-process communications is asynchronous, this means that such APIs must be asynchronous already or must be made asynchronous if they are not, and that code that calls these APIs needs to be made asynchronous if it is not asynchronous already, which in itself is already a major task. We will cover making things asynchronous in another entry of this series.

Once we have decided to make an asynchronous API e10s-compliant, the following step is to determine which part of the implementation needs to reside in a child process and which part in the parent process. Typically, anything that touches the web content must reside in the child process. As a rule of thumb, we generally consider that the parent process is more performance-critical than children-processes, so if you have code that could reside either in the child process or in the parent process, and if placing that code in the child process will not cause duplication of work or very expensive communications, it is a good idea to move it to the child process. This is, of course, a rule of thumb, and nothing replaces testing and benchmarking.

The next step is to define a communication protocol. Messages exchanged between the parent process and children processes all need a name. If are working on feature Foo, by conventions, the name of your messages should start with “Foo:”. Recall that message sending is asynchronous, so if you need a message to receive an answer, you will need two messages: one for sending the request (“Foo:GetState”) and one for replying once the operation is complete (“Foo:State”). Messages can carry arbitrary data in the form of a JavaScript structure (i.e. any object that can be converted to/from JSON without loss). If necessary, these structures may be used to attach unique identifiers to messages, so as to easily match a reply to its request – this feature is not built into the message manager but may easily be implemented on top of it. Also, do not forget to take into account communication timeouts – recall that a process may fail to reply because it has crashed or been killed for any reason.

The last step is to actually write the code. Code executed by the parent process typically goes into some .js file loaded from XUL (e.g. browser.js) or a .jsm module, as usual. Code executed by a child process goes into its own file, typically a .js, and must be injected into the child process during initialization by using window.messageManager.loadFrameScript (to inject in all children process) or browser.messageManager.loadFrameScript (to inject in a specific child process).

That’s it! In a future blog entry, I will write more about common patterns for writing or refactoring asynchronous code, which comes in very handy for code that uses your new API.

Contributing to e10s

The ongoing e10s refactoring of Firefox is a considerable task. To make it happen, the best way is to contribute to coding or testing.

What’s next?

In the next blog entry, I will demonstrate how to make front-end and add-on code multi-threaded.

Where Am I?

You are currently browsing the Sûreté / Security category at Il y a du thé renversé au bord de la table.

Follow

Get every new post delivered to your Inbox.

Join 32 other followers