The Battle of Session Restore – Season 1 Episode 3 – All With Measure

July 17, 2014 § 2 Comments

Plot For the second time, our heroes prepared for battle. The startup of Firefox was too slow and Session Restore was one of the battle fields.

When Firefox starts, Session Restore is in charge of restoring the browser to its previous state, in case of a crash, a restart, or for the users who have configured Firefox to resume from its previous state. This entails numerous activities during startup:

  1. read sessionstore.js from disk, decode it and parse it (recall that the file is potentially several Mb large), handling errors;
  2. backup sessionstore.js in case of startup crash.
  3. create windows, tabs, frames;
  4. populate history, scroll position, forms, session cookies, session storage, etc.

It is common wisdom that Session Restore must have a large impact on Firefox startup. But before we could minimize this impact, we needed to measure it.

Benchmarking is not easy

When we first set foot on Session Restore territory, the contribution of that module to startup duration was uncharted. This was unsurprising, as this aspect of the Firefox performance effort was still quite young. To this day, we have not finished chartering startup or even Session Restore’s startup.

So how do we measure the impact of Session Restore on startup?

A first tool we use is Timeline Events, which let us determine how long it takes to reach a specific point of startup. Session Restore has had events `sessionRestoreInitialized` and `sessionRestored` for years. Unfortunately, these events did not tell us much about Session Restore itself.

The first serious attempt at measuring the impact of Session Restore on startup Performance was actually not due to the Performance team but rather to the metrics team. Indeed, data obtained through Firefox Health Report participants indicated that something wrong had happened.

Oops, something is going wrong

Indicator `d2` in the graph measures the duration between `firstPaint` (which is the instant at which we start displaying content in our windows) and `sessionRestored` (which is the instant at which we are satisfied that Session Restore has opened its first tab). While this measure is imperfect, the dip was worrying – indeed, it represented startups that lasted several seconds longer than usual.

Upon further investigation, we concluded that the performance regression was indeed due to Session Restore. While we had not planned to start optimizing the startup component of Session Restore, this battle was forced upon us. We had to recover from that regression and we had to start monitoring startup much better.

A second tool is Telemetry Histograms for measuring duration of individual operations, such as reading sessionstore.js or parsing it. We progressively added measures for most of the operations of Session Restore. While these measures are quite helpful, they are also unfortunately very unstable in real-world conditions, as they are affected both by scheduling (the operations are asynchronous), by the work load of the machine, by the actual contents of sessionstore.js, etc.

The following graph displays the average duration of reading and decoding sessionstore.js among Telemetry participants: Telemetry 4

Difference in colors represent successive versions of Firefox. As we can see, this graph is quite noisy, certainly due to the factors mentioned above (the spikes don’t correspond to any meaningful change in Firefox or Session Restore). Also, we can see a considerable increase in the duration of the read operation. This was quite surprising for us, given that this increase corresponds to the introduction of a much faster, off the main thread, reading and decoding primitive. At the time, we were stymied by this change, which did not correspond to our experience. We have now concluded that by changing the asynchronous operation used to read the file, we have simply changed the scheduling, which makes the operation appear longer, while in practice it simply does not block the rest of the startup from taking place on another thread.

One major tool was missing for our arsenal: a stable benchmark, always executed on the same machine, with the same contents of sessionstore.js, and that would let us determine more exactly (almost daily, actually) the impact of our patches upon Session Restore:Session Restore Talos

This test, based on our Talos benchmark suite, has proved both to be very stable, and to react quickly to patches that affected its performance. It measures the duration between the instant at which we start initializing Session Restore (a new event `sessionRestoreInit`) and the instant at which we start displaying the results (event `sessionRestored`).

With these measures at hand, we are now in a much better position to detect performance regressions (or improvements) to Session Restore startup, and to start actually working on optimizing it – we are now preparing to using this suite to experiment with “what if” situations to determine which levers would be most useful for such an optimization work.

Evolution of startup duration

Our first benchmark measures the time elapsed between start and stop of Session Restore if the user has requested all windows to be reopened automatically

restoreAs we can see, the performance on Linux 32 bits, Windows XP and Mac OS 10.6 is rather decreasing, while the performance on Linux 64 bits, Windows 7 and 8 and MacOS 10.8 is improving. Since the algorithm used by Session Restore upon startup is exactly the same for all platforms, and since “modern” platforms are speeding up while “old” platforms are slowing down, this suggests that the performance changes are not due to changes inside Session Restore. The origin of these changes is unclear. I suspect the influence of newer versions of the compilers or some of the external libraries we use, or perhaps new and improved (for some platforms) gfx.

Still, seeing the modern platforms speed up is good news. As of Firefox 31, any change we make that causes a slowdown of Session Restore will cause an immediate alert so that we can react immediately.

Our second benchmark measures the time elapsed if the user does not wish windows to be reopened automatically. We still need to read and parse sessionstore.js to find whether it is valid, so as to decide whether we can show the “Restore” button on about:home.

norestoreWe see peaks in Firefox 27 and Firefox 28, as well as a slight decrease of performance on Windows XP and Linux. Again, in the future, we will be able to react better to such regressions.

The influence of factors upon startup

With the help of our benchmarks, we were able to run “what if” scenarios to find out which of the data manipulated by Session Restore contributed to startup duration. We did this in a setting in which we restore windows:size-restore

and in a setting in which we do not:

size-norestore

Interestingly, increasing the size of sessionstore.js has apparently no influence on startup duration. Therefore, we do not need to optimize reading and parsing sessionstore.js. Similarly, optimizing history, cookies or form data would not gain us anything.

The single largest most expensive piece of data is the set of open windows – interestingly, this is the case even when we do not restore windows. More precisely, any optimization should target, by order of priority:

  1. the cost of opening/restoring windows;
  2. the cost of opening/restoring tabs;
  3. the cost of dealing with windows data, even when we do not restore them.

What’s next?

Now that we have information on which parts of Session Restore startup need to be optimized, the next step is to actually optimize them. Stay tuned!

Souhaitez-vous aider le renard à accélérer ?

July 2, 2014 § Leave a comment

Je reçois régulièrement des propositions de volontaires qui souhaiteraient contribuer à Firefox. Habituellement, je les guide vers Bugs Ahoy – si vous ne connaissez pas Bugs Ahoy, foncez le voir, ce moteur de recherche dédié aux tâches accessibles aux débutants est fabuleux. Aujourd’hui, changement de programme : si vous souhaitez contribuer à Firefox, et plus précisément si vous souhaitez contribuer à améliorer les performances de Firefox, voici quelques manières de participer à l’effort de l’équipe Performance.

Session Restore

Session Restore est le composant de Firefox chargé de sauvegarder l’état du navigateur en permanence pour permettre de récupérer d’un crash du navigateur, du système ou du matériel ou d’un redémarrage intempestif sans perdre de données. Je suis en train de réécrire certaines parties de Session Restore pour améliorer sa réactivité (en le rendant parallèle) et sa contribution au temps de démarrage.

Pour vous lancer, quelques bugs d’introduction, e-mentorés par moi :

Tous ces bugs sont en JavaScript.

File I/O

OS.File est le composant de Firefox qui permet à JavaScript d’accéder au disque à haute performance. Je suis en train de réécrire certaines parties de OS.File pour le rendre plus extensible et pour améliorer la réactivité de certaines fonctions critiques. Quelques bugs d’introduction, e-mentorés par moi :

Shutting down Asynchronously, part 2

May 26, 2014 § Leave a comment

During shutdown of Firefox, subsystems are closed one after another. AsyncShutdown is a module dedicated to express shutdown-time dependencies between:

  • services and their clients;
  • shutdown phases (e.g. profile-before-change) and their clients.

Barriers: Expressing shutdown dependencies towards a service

Consider a service FooService. At some point during the shutdown of the process, this service needs to:

  • inform its clients that it is about to shut down;
  • wait until the clients have completed their final operations based on FooService (often asynchronously);
  • only then shut itself down.

This may be expressed as an instance of AsyncShutdown.Barrier. An instance of AsyncShutdown.Barrier provides:

  • a capability client that may be published to clients, to let them register or unregister blockers;
  • methods for the owner of the barrier to let it consult the state of blockers and wait until all client-registered blockers have been resolved.

Shutdown timeouts

By design, an instance of AsyncShutdown.Barrier will cause a crash if it takes more than 60 seconds awake for its clients to lift or remove their blockers (awake meaning that seconds during which the computer is asleep or too busy to do anything are not counted). This mechanism helps ensure that we do not leave the process in a state in which it can neither proceed with shutdown nor be relaunched.

If the CrashReporter is enabled, this crash will report: – the name of the barrier that failed; – for each blocker that has not been released yet:

  • the name of the blocker;
  • the state of the blocker, if a state function has been provided (see AsyncShutdown.Barrier.state).

Example 1: Simple Barrier client

The following snippet presents an example of a client of FooService that has a shutdown dependency upon FooService. In this case, the client wishes to ensure that FooService is not shutdown before some state has been reached. An example is clients that need write data asynchronously and need to ensure that they have fully written their state to disk before shutdown, even if due to some user manipulation shutdown takes place immediately.

// Some client of FooService called FooClient

Components.utils.import("resource://gre/modules/FooService.jsm", this);

// FooService.shutdown is the `client` capability of a `Barrier`.
// See example 2 for the definition of `FooService.shutdown`
FooService.shutdown.addBlocker(
  "FooClient: Need to make sure that we have reached some state",
  () => promiseReachedSomeState
);
// promiseReachedSomeState should be an instance of Promise resolved once
// we have reached the expected state

Example 2: Simple Barrier owner

The following snippet presents an example of a service FooService that wishes to ensure that all clients have had a chance to complete any outstanding operations before FooService shuts down.

    // Module FooService

    Components.utils.import("resource://gre/modules/AsyncShutdown.jsm", this);
    Components.utils.import("resource://gre/modules/Task.jsm", this);

    this.exports = ["FooService"];

    let shutdown = new AsyncShutdown.Barrier("FooService: Waiting for clients before shutting down");

    // Export the `client` capability, to let clients register shutdown blockers
    FooService.shutdown = shutdown.client;

    // This Task should be triggered at some point during shutdown, generally
    // as a client to another Barrier or Phase. Triggering this Task is not covered
    // in this snippet.
    let onshutdown = Task.async(function*() {
      // Wait for all registered clients to have lifted the barrier
      yield shutdown.wait();

      // Now deactivate FooService itself.
      // ...
    });

Frequently, a service that owns a AsyncShutdown.Barrier is itself a client of another Barrier.

 

Example 3: More sophisticated Barrier client

The following snippet presents FooClient2, a more sophisticated client of FooService that needs to perform a number of operations during shutdown but before the shutdown of FooService. Also, given that this client is more sophisticated, we provide a function returning the state of FooClient2 during shutdown. If for some reason FooClient2’s blocker is never lifted, this state can be reported as part of a crash report.

    // Some client of FooService called FooClient2

    Components.utils.import("resource://gre/modules/FooService.jsm", this);

    FooService.shutdown.addBlocker(
      "FooClient2: Collecting data, writing it to disk and shutting down",
      () => Blocker.wait(),
      () => Blocker.state
    );

    let Blocker = {
      // This field contains information on the status of the blocker.
      // It can be any JSON serializable object.
      state: "Not started",

      wait: Task.async(function*() {
        // This method is called once FooService starts informing its clients that
        // FooService wishes to shut down.

        // Update the state as we go. If the Barrier is used in conjunction with
        // a Phase, this state will be reported as part of a crash report if FooClient fails
        // to shutdown properly.
        this.state = "Starting";

        let data = yield collectSomeData();
        this.state = "Data collection complete";

        try {
          yield writeSomeDataToDisk(data);
          this.state = "Data successfully written to disk";
        } catch (ex) {
          this.state = "Writing data to disk failed, proceeding with shutdown: " + ex;
        }

        yield FooService.oneLastCall();
        this.state = "Ready";
      }.bind(this)
    };

Example 4: A service with both internal and external dependencies

    // Module FooService2

    Components.utils.import("resource://gre/modules/AsyncShutdown.jsm", this);
    Components.utils.import("resource://gre/modules/Task.jsm", this);
    Components.utils.import("resource://gre/modules/Promise.jsm", this);

    this.exports = ["FooService2"];

    let shutdown = new AsyncShutdown.Barrier("FooService2: Waiting for clients before shutting down");

    // Export the `client` capability, to let clients register shutdown blockers
    FooService2.shutdown = shutdown.client;

    // A second barrier, used to avoid shutting down while any connections are open.
    let connections = new AsyncShutdown.Barrier("FooService2: Waiting for all FooConnections to be closed before shutting down");

    let isClosed = false;

    FooService2.openFooConnection = function(name) {
      if (isClosed) {
        throw new Error("FooService2 is closed");
      }

      let deferred = Promise.defer();
      connections.client.addBlocker("FooService2: Waiting for connection " + name + " to close",  deferred.promise);

      // ...


      return {
        // ...
        // Some FooConnection object. Presumably, it will have additional methods.
        // ...
        close: function() {
          // ...
          // Perform any operation necessary for closing
          // ...

          // Don't hoard blockers.
          connections.client.removeBlocker(deferred.promise);

          // The barrier MUST be lifted, even if removeBlocker has been called.
          deferred.resolve();
        }
      };
    };


    // This Task should be triggered at some point during shutdown, generally
    // as a client to another Barrier. Triggering this Task is not covered
    // in this snippet.
    let onshutdown = Task.async(function*() {
      // Wait for all registered clients to have lifted the barrier.
      // These clients may open instances of FooConnection if they need to.
      yield shutdown.wait();

      // Now stop accepting any other connection request.
      isClosed = true;

      // Wait for all instances of FooConnection to be closed.
      yield connections.wait();

      // Now finish shutting down FooService2
      // ...
    });

Phases: Expressing dependencies towards phases of shutdown

The shutdown of a process takes place by phase, such as: – profileBeforeChange (once this phase is complete, there is no guarantee that the process has access to a profile directory); – webWorkersShutdown (once this phase is complete, JavaScript does not have access to workers anymore); – …

Much as services, phases have clients. For instance, all users of web workers MUST have finished using their web workers before the end of phase webWorkersShutdown.

Module AsyncShutdown provides pre-defined barriers for a set of well-known phases. Each of the barriers provided blocks the corresponding shutdown phase until all clients have lifted their blockers.

List of phases

AsyncShutdown.profileChangeTeardown

The client capability for clients wishing to block asynchronously during observer notification “profile-change-teardown”.

AsyncShutdown.profileBeforeChange

The client capability for clients wishing to block asynchronously during observer notification “profile-change-teardown”. Once the barrier is resolved, clients other than Telemetry MUST NOT access files in the profile directory and clients MUST NOT use Telemetry anymore.

AsyncShutdown.sendTelemetry

The client capability for clients wishing to block asynchronously during observer notification “profile-before-change2”. Once the barrier is resolved, Telemetry must stop its operations.

AsyncShutdown.webWorkersShutdown

The client capability for clients wishing to block asynchronously during observer notification “web-workers-shutdown”. Once the phase is complete, clients MUST NOT use web workers.

Recent changes to OS.File

April 8, 2014 § 3 Comments

A quick post to summarize some of the recent improvements to OS.File.

Encoding/decoding

To write a string, you can now pass the string directly to writeAtomic:

OS.File.writeAtomic(path, "Here is a string", { encoding: "utf-8"})

Similarly, you can now read strings from read:

OS.File.read(path, { encoding: "utf-8" } ); // Resolves to a string.

Doing this is at least as fast as calling TextEncoder/TextDecoder yourself (see below).

Native implementation

OS.File.read has been reimplemented in C++. The main consequence is that this function can now be used safely during startup, without having to wait for the underlying OS.File ChromeWorker to start. Also, decoding (see above) is performed off the main thread, which makes it much faster.

According to my benchmarks, using OS.File.read to read strings is about 2-5x faster than NetUtil.asyncFetch on large files and doesn’t block the main thread for more than 5ms, while asyncFetch performs string decoding on the main thread. Also, it doesn’t perform any main thread I/O by opposition to NetUtil.asyncFetch.

Backups

When using writeAtomic, it is now possible to request existing files to be backed up almost atomically. In many cases, this is a good strategy to ensure that data is safely written to disk, without having to use a flush, which would be expensive for the whole system.

yield OS.File.writeAtomic(path, data, { tmpPath: path + ".tmp", backupTo: path + ".backup} } );

Compression

writeAtomic and read both now support an implementation of lz4 compression

yield OS.File.writeAtomic(path, data, { compression: "lz4"});
yield OS.File.read(path, { compression: "lz4"});

Note that this format will not be understood by any command-line tool. It is somewhat proprietary. Also note that (de)compression is performed on the ChromeWorker thread for the time being, so it doesn’t benefit from the native reimplementation mentioned above.

Creating directories recursively

let dir = OS.Path.join(OS.Constants.Path.profileDir, "a", "b", "c", "d");
yield OS.File.makeDir(dir, { from: OS.Constants.Path.profileDir });

Season 1 Episode 2 – The Fight for File I/O

April 2, 2014 § Leave a comment

Plot Our heroes set out for the first battle. Session Restore’s file I/O was clearly inefficient. Not only was it performing redundant operations, but also it was blocking the main thread doing so. The time had come to take it back. Little did our heroes know that the forces of Regression were lurking and that real battle would be fought long after the I/O had been rewritten and made non-blocking.

For historical reasons, some of Session Restore’s File I/O was quite inefficient. Reading and backing up were performed purely on the main thread, which could cause multi-second pauses in extreme cases, and 100ms+ pauses in common cases. Writing was done mostly off the main thread, but the underlying library used caused accidental main thread I/O, with the same effect, and disk flushing. Disk flushing is extremely inefficient on most operating systems and can quickly bring the whole system to its knees, so needs to be avoided.

Fortunately, OS.File, the (then) new JavaScript library designed to provide off main thread I/O had just become available. Turning Session Restore’s I/O into OS.File-based off main thread I/O was surprisingly simple, and even contributed to make the relevant fragments of the code more readable.

In addition to performing main thread I/O and flushing, Session Restore’s I/O had several immediate weaknesses. One of the weaknesses was due to its crash detection mechanism, that required Session Restore to rewrite sessionstore.js immediately after startup, just to store a boolean indicating that we had not crashed. Recall that the largest sessionsstore.js known to this date weighs 150+Mb, and that 1Mb+ instances represented ~5% of our users. Rewriting all this data (and blocking startup while doing so) for a simple boolean flag was clearly unacceptable. We fixed this issue by separating the crash detection mechanism into its own module and ensuring that it only needed to write a few bytes. Another weakness was due to the backup code, which required a full (and inefficient) copy during startup, whereas a simple renaming would have been sufficient.

Having fixed all of this, we were happy. We were wrong.

Speed improvements?

Sadly, Telemetry archives do not reach back far enough to let me provide data confirming any speed improvement. Note for future perf developers including future self: backup your this data or blog immediately before The Cloud eats it.

As for measuring the effects of a flush, at the moment, we do not have a good way to do this, as the main impact is not on the process itself but on the whole system. The best we can do is measure the total number of flushes, but that doesn’t really help.

Full speed… backwards?

The first indication that something was wrong was a large increase in Telemetry measure SESSIONRESTORED, which measures the total amount of time between the launch of the browser and the moment Session Restore has completed initialization. After a short period of bafflement, we concluded that this increase was normal and was due to a change of initialization order – indeed, since OS.File I/O was executed off the main thread, the results of reading the sessionstore.js file could only be received once the main thread was idle and could receive messages from other threads. While this interpretation was partly correct, it masked a very real problem that we only detected much later. Additionally, during our refactorings, we changed the instant at which Session Restore initialization was executed, which muddled the waters even further.

The second indication arrived much later, when the Metrics team extracted Firefox Health Report data from released versions and got in touch with the Performance team to inform us of a large regression in firstPaint-to-sessionRestored time. For most of our users, Firefox was now taking more than 500ms more to load, which was very bad.

After some time spent understanding the data, attempting to reproduce the measure and bisecting to find out at which changeset the regression had taken place, as well as instrumenting code with additional performance probes, we finally concluded that the problem was due to our use I/O thread, the “SessionWorker”. More precisely, this thread was very slow to launch during startup. Digging deeper, we concluded that the problem was not in the code of the SessionWorker itself, but that the loading of the underlying thread was simply too slow. More precisely, loading was fine on a first run, but on second run, disk I/O contention between the resources required by the worker (the cache for the source code of SessionWorker and its dependencies) and the resources required by the rest of the browser (other source code, but also icons, translation files, etc) slowed down things considerably. Replacing the SessionWorker by a raw use of OS.File would not have improved the situation – ironically, just as the SessionWorker, our fast I/O library was loading slowly because of slow file I/O. Further measurements indicated that this slow loading could take up to 6 seconds in extreme cases, with an average of 340ms.

Once the problem had been identified, we could easily develop a stopgap fix to recover most of the regression. We kept OS.File-based writing, as it was not in cause, but we fell back to NetUtil-based loading, which did not require a JavaScript Worker. According to Firefox Health Report, this returned us to a level close to what we had prior to our changes, although we are still worse by 50-100ms. We are still attempting to find out what causes this regression and whether this regression was indeed caused by our work.

With this stopgap fix in place, we set out to provide a longer-term fix, in the form of a reimplementation of OS.File.read(), the critical function used during startup, that did not need to boot a JavaScript worker to proceed. This second implementation was written in C++ and had a number of additional side-improvements, such as the ability to decode strings off the main thread, and transmit them to the main thread at no cost.

The patch using the new version of OS.File.read() has landed a few days ago. We are still in the process of trying to make sense of Telemetry numbers. While Telemetry indicates that the total time to read and decode the file has considerably increased, the total time between the start of the read and the time we finish startup seems to have decreased nicely by .5 seconds (75th percentile) to 4 seconds (95th percentile). We suspect that we are confronted to yet another case in which concurrency makes performance measurement more difficult.

Shutdown duration?

We have not attempted to measure the duration of shutdown-time I/O at the moment.

Losing data or privacy

By definition, since we write data asynchronously, we never wait until the write is complete before proceeding. In most cases, this is not a problem. However, process shutdown may interrupt the write during its execution. While the APIs we use to write the data ensure that shutdown will never cause a file to be partially written, it may cause us to lose the final write, i.e. 15 seconds of browsing, working, etc. To make things slightly worse, the final write of Session Restore is special, insofar as it removes some information that is considered somewhat privacy-sensitive and that is required for crash recovery but not for a clean restart. The risk already existed before our refactoring, but was increased by our work, as the new I/O model was based on JavaScript workers, which are shutdown earlier than the mechanism previously used, and without ensuring that their work is complete.

While we received no reports of bugs caused by this risk, we solved the issue by plugging Session Restore’s shutdown into AsyncShutdown.

Changing the back-end

One of our initial intuitions when starting with this work was that the back-end format used to store session data (large JSON file) was inefficient and needed to be changed. Before doing so, however, we instrumented the relevant code carefully. As it turns out, we could indeed gain some performance by improving the back-end format, but this would be a relatively small win in comparison with everything else that we have done.

We have several possible designs for a new back-end, but we have decided not to proceed for the time being, as there are still larger gains to be obtained with simpler changes. More on this in future blog entries.

Epilogue

Before setting out on this quest, we were already aware that performance refactorings were often more complex than they appeared. Our various misadventures have confirmed it. I strongly believe that, by changing I/O, we have improved the performance of Session Restore in many ways. Unfortunately, I cannot prove that we have improved runtime (because old data has disappeared), and we are still not certain that we have not regressed start-up.

If there are lessons to be learned, it is that:

  • there is no performance work without performance measurements;
  • once your code is sophisticated enough, measuring and understanding the results is much harder than improving performance.

On the upside, all this work has succeeded at:

  • improving our performance measurements of many points of Session Restore;
  • finding out weaknesses of ChromeWorkers and fixing some of these;
  • finding out weaknesses of OS.File and fixing some of these;
  • fixing Session Restore’s backup code that consumed resources and didn’t really do much useful;
  • avoiding unnecessary performance refactorings where they would not have helped.

The work on improving Session Restore file I/O is still ongoing. For one thing, we are still waiting for confirmation that our latest round of optimizations does not cause unwanted regressions. Also, we are currently working on Talos benchmarks and Telemetry measurements to let us catch such regressions earlier.

This work has also spawned other works for other teams on improving the performance of ChromeWorkers’ startup and communication speed.

In the next episode

Drama. Explosions. Asynchronicity. Electrolysis. And more.

The Battle of Session Restore – Pilot

March 26, 2014 § 7 Comments

Plot Our heroes received their assignment. They had to go deep into the Perflines, in the long lost territory of Session Restore, and do whatever it took to get Session Restore back into Perfland. However, they quickly realized that they had been sent on a mission without return – and without a map. This is their tale.

Session Restore is a critical component of Firefox. This component records the current state of your browser to ensure that you can always resume browsing without losing the state of your browser, even if Firefox crashes, if your computer loses power, or if your browser is being upgraded. Unfortunately, we have had many reports of Session Restore slowing down Firefox. In February 2013, a two person Perf/Fx-team task force started working on the Performance of Session Restore. This task force eventually grew to four persons from Perf, Fx-team and e10s, along with half a dozen of punctual contributors.

To this day, the effort has lasted 13 months. In this series of blg entries, I intend to present our work, our results and, more importantly, the lessons we have learnt along the way, sometimes painfully.

Fixing yes, but fixing what?

We had reports of Session Restore blocking Firefox for several seconds every 15 seconds, which made Firefox essentially useless.

The job of Session Restore is to record everything possible of the state of the current browsing session. This means the list of windows, the list of tabs, the current address of each tab, but also the history of each tab, scroll position, anchors, DOM SessionStorage. session cookies, etc. Oh, and this goes recursively for both nested frames and history. All of this is saved to a JSON-formatted file called sessionstore.js, every 15 seconds of user activity. To this day, the largest reported sessionstore.js files is 150Mb, but Telemetry indicates that 95% of users used to have a file of less than 1Mb (numbers are lower these days, after we spent time eliminating unnecessary data from sessionstore.js).

We started the effort to fix Session Restore from only a few bug reports:

 

  • sometimes, users lost sessionstore.js data;
  • sometimes, data collection took ages.

Unfortunately, we had no data on:

 

  • the size of the file;
  • the actual duration of data collection;
  • how long it took to write data to the disk.

To complicate things further, Session Restore had been left without owner for several years. Irregular patching to support new features of the web and new configurations had progressively turned the code and data structures into a mess that nobody fully understood.

We had, however, a few hints:

  • Session Restore needs to collect lots of data;
  • Session Restore had been designed a long time ago, for users with few tabs, and when web pages stored very little information;
  • serializing and writing to JSON is inefficient;
  • in bad cases, saving could take several seconds;
  • the collection of data was purely monolithic;
  • reading and writing data was done entirely on the main thread, which was a very bad thing to do;
  • the client API caused full recollections at each request;
  • the data structure used by Session Restore had progressively become an undocumented mess.

While there were a number of obvious sources of inefficiency that we could fix without further data, and that we set out to fix immediately. In a few cases, however, we found out the hard way that optimizing without hard data is a time-consuming and useless exercise. Consequently, a considerable part of our work has been to use Telemetry to determine where we could best apply our optimization effort, and to confirm that this effort yielded results. In many cases, this meant adding coarse-grained probes, then progressively completing them with finer-grained probes, in parallel with actually writing optimizations.

To be continued…

In the next episode, our heroes will fight Main Thread File I/O… and the consequences of removing it.

Shutting down things asynchronously

February 14, 2014 § Leave a comment

This blog entry is part of the Making Firefox Feel As Fast As Its Benchmarks series. The fourth entry of the series was growing much too long for a single blog post, so I have decided to cut it into bite-size entries.

A long time ago, Firefox was completely synchronous. One operation started, then finished, and then we proceeded to the next operation. However, this model didn’t scale up to today’s needs in terms of performance and performance perception, so we set out to rewrite the code and make it asynchronous wherever it matters. These days, many things in Firefox are asynchronous. Many services get started concurrently during startup or afterwards. Most disk writes are entrusted to an IO thread that performs and finishes them in the background, without having to stop the rest of Firefox.

Needless to say, this raises all sorts of interesting issues. For instance: « how do I make sure that Firefox will not quit before it has finished writing my files? » In this blog entry, I will discuss this issue and, more generally, the AsyncShutdown mechanism, designed to implement shutdown dependencies for asynchronous services.

« Read the rest of this entry »

Is my data on the disk? Safety properties of OS.File.writeAtomic

February 5, 2014 § 1 Comment

If you have been writing front-end or add-on code recently, chances are that you have been using library OS.File and, in particular, OS.File.writeAtomic to write files. (Note: If you have been writing files without using OS.File.writeAtomic, chances are that you are doing something wrong that will cause Firefox to jank – please don’t.) As the name implies, OS.File.writeAtomic will make efforts to write your data atomically, so as to ensure its survivability in case of crash, power loss, etc.

However, you should not trust this function blindly, because it has its limitations. Let us take a look at exactly what the guarantees provided by writeAtomic.

Algorithm: just write

Snippet OS.File.writeAtomic(path, data)

What it does

  1. reduce the size of the file at path to 0;
  2. send data to the operating system kernel for writing;
  3. close the file.

Worst case scenarios

  1. if the process crashes between 1. and 2. (a few microseconds), the full content of path may be lost;
  2. if the operating system crashes or the computer loses power suddenly before the kernel flushes its buffers (which may happen at any point up to 30 seconds after 1.), the full content of path may be lost;
  3. if the operating system crashes or the computer loses power suddenly while the operating system kernel is flushing (which may happen at any point after 1., typically up to 30 seconds), and if your data is larger than one sector (typically 32kb), data may be written incompletely, resulting in a corrupted file at path.

Performance very good.

Algorithm: write and rename

Snippet OS.File.writeAtomic(path, data, { tmpPath: path + ".tmp" })

What it does

  1. create a new file at tmpPath;
  2. send data to the operating system kernel for writing to tmpPath;
  3. close the file;
  4. rename tmpPath on top of path.

Worst case scenarios

  1. if the process crashes at any moment, nothing is lost, but a file tmpPath may be left on the disk;
  2. if the operating system crashes or the computer loses power suddenly while the operating system kernel is flushing metadata (which may happen at any point after 1., typically up to 30 seconds), the full content of path may be lost;
  3. if the operating system crashes or the computer loses power suddenly while the operating system kernel is flushing (which may happen at any point after 1., typically up to 30 seconds), and if your data is larger than one sector (typically 32kb), data may be written incompletely, resulting in a corrupted file at path.

Performance almost as good as Just Write.

Side-note On the ext4fs file system, the kernel automatically adds a flush, which transparently transforms the safety properties of this operation into those of the algorithm detailed next.

Native equivalent In XPCOM/C++, the mostly-equivalent solution is the atomic-file-output-stream.

Algorithm: write, flush and rename

Use OS.File.writeAtomic(path, data, { tmpPath: path + ".tmp", flush: true })

What it does

  1. create a new file at tmpPath;
  2. send data to the operating system kernel for writing to tmpPath;
  3. close the file;
  4. flush the writing of data to tmpPath;
  5. rename tmpPath on top of path.

Worst case scenarios

  1. if the process crashes at any moment, nothing is lost, but a file tmpPath may be left on the disk;
  2. if the operating system crashes, nothing is lost, but a file tmpPath may be left on the disk;
  3. if the computer loses power suddenly while the hard drive is flushing its internal hardware buffers (which is very hard to predict), nothing is lost, but an incomplete file tmpPath may be left on the disk;.

Performance some operating systems (Windows) or file systems (ext3fs) cannot flush a single file and rather need to flush all the files on the device, which considerably slows down the full operating system. On some others (ext4fs) this operation is essentially free. On some versions of MacOS X, flushing actually doesn’t do anything.

Native equivalent In XPCOM/C++, the mostly-equivalent solution is the safe-file-output-stream.

Algorithm: write, backup, rename

(not landed yet)

Snippet OS.File.writeAtomic(path, data, { tmpPath: path + ".tmp", backupTo: path + ".backup"})

What it does

  1. create a new file at tmpPath;
  2. send data to the operating system kernel for writing to tmpPath;
  3. close the file;
  4. rename the file at path to backupTo;
  5. rename the file at tmpPath on top of path;

Worst case scenarios

  1. if the process crashes between 4. and 5, file path may be lost and backupTo should be used instead for recovery;
  2. if the operating system crashes or the computer loses power suddenly while the operating system kernel is flushing metadata (which may happen at any point after 1., typically up to 30 seconds), the file at path may be empty and backupTo should be used instead for recovery;
  3. if the operating system crashes or the computer loses power suddenly while the operating system kernel is flushing (which may happen at any point after 1., typically up to 30 seconds), and if your data is larger than one sector (typically 32kb), data may be written incompletely, resulting in a corrupted file at path and backupTo should be used instead for recovery;

Performance almost as good as Write and Rename.

Making Firefox Feel as Fast as its Benchmarks – part 3 – Going multi-threaded

October 29, 2013 § 11 Comments

As we saw in the previous posts, our browser behaves as follows

function browser() {
  while (true) {
    handleEvents();  // Let's make this faster
    updateDisplay();
  }
}

The key to making the browser smooth is to make handleEvents() do less. We have already discussed the ongoing work to make Firefox multi-process, their goals and their limitations. Another, mostly orthogonal, path, is to go multi-threaded.

Going multi-threaded

Going multi-threaded is all about splitting the event loop in several loops, executed concurrently, on several cores whenever applicable and possible:

function browser() {
  main() ||| worker() ||| worker() // Running concurrently
}

task main() { // Main thread (time-critical)
  while (true) {
    handleEvents(); // Some of your code here
    updateDisplay();
  }
}

task worker() {
  while (true) {
    handleEvents(); // Some of your code here
  }
}

task worker() {
  while (true) {
    updateDisplay();
  }
}

The main thread remains time-critical and needs to loop 60 times per second, while other threads handle some of the workload of both handleEvents() and updateDisplay(). Now, this treatment is only useful if we can isolate operations that slow down the main loop measurably. As it turns out, there are many such operations lying around, including:

  • Network I/O;
  • Disk I/O;
  • Database I/O;
  • GPU I/O;
  • Treating large amounts of data.

It is easy to see why Network I/O could considerably slow down the main loop, if it were handled on the main thread – after all, some requests take several seconds to receive a reply, or never do, and if the main thread had to wait for the completion of these requests before it proceeded, this would cause multi-second gaps between two frames, which is simply not acceptable.

The cost of disk I/O, however, is often underestimated. Few developers realize that _any_ disk operation can take an unbounded amount of time – even closing a file or checking whether a file exists can, in some cases, take several seconds. This may seem counter-intuitive, as these operations do very little besides book-keeping, but one must not forget that they rely upon the device itself and that said device can unpredictably become very slow, typically because it is otherwise busy, or asleep – or even because that device is actually a network device. Database I/O is a special case of Disk I/O that we generally single out because its cost is often much higher than users suspect – recall that, in addition to saving, a database management system will typically need to maintain a journal and to flush the drive regularly, to protect data against both software or hardware failures, including sudden power loss. Consequently, unless the database has been heavily customized to lift the safety requirements in favor of performance, you should expect that every operation on your database will cause heavy disk I/O.

Finally, treating large amounts of data, or applying any other form of heavy algorithm, will of course take time.

None of these operations should take place on the main thread. Moving them off the main thread will largely contribute to getting rid of the jank caused by these operations.

Coding for multi-threading

In the Firefox web browser, threads are materialized as instances of nsIThread in C++ code and as instances of ChromeWorker in JavaScript code. For this discussion, I will concentrate on JavaScript code as refactoring C++ code is, well, complicated. Side-note: if you are new here, recall that Chrome Workers have nothing to with the Chrome Web Browser and everything to do with the Mozilla Chrome, i.e. the parts of Gecko and Firefox written in JavaScript.

Chrome Workers are an extension of Web Workers, and have the same semantics, plus a few additions. Instantiating a ChromeWorker requires a source file:

let worker = new ChromeWorker("resource://path/to/my_file.js");

We may send messages to and from a Chrome Worker

// In the parent
worker.postMessage(someValue);

// In the worker
self.postMessage(someValue);

and, of course, receive messages

// In the parent
worker.addEventListener("message", function(msg) {
// A copy of the message appears in msg.data
});

// In the worker
self.addEventListener("message", function(msg) {
// A copy of the message appears in msg.data
});

In either case, the contents of the message gets copied between threads, with essentially the same semantics as JSON.stringify/JSON.parse. If necessary, binary data in messages (ArrayBuffer or the upcoming Typed Objects) can be transferred instead of being copied, which is faster.

As Web Workers, Chrome Workers are very good to perform computations. In addition, they have a number of low-level libraries to access system features. Such libraries can be loaded with the chrome worker module loader:

let MyModule = require("resource://...");

Further modules can be defined for consumption with the chrome worker module loader:

module.exports = {
  foo: // ...
};

Finally, they can call into C code using the js-ctypes foreign function interface:

let lib = ctypes.open("path/to/my_lib");
let fun = lib.declare("myFunction", ctypes.void); // void myFunction()
fun(); // Call into C

Combining the module loader and js-ctypes makes for a powerful combination that has been used to provide access to low-level libraries, including low-level file manipulation (module OS.File), phone communication (module RIL, shorthand for Radio Interface Layer), file (de)compression, etc.

Limitations

Where multi-process is good at protecting a process against other processes, going multi-threaded works nicely to protect a process (a tab, the ui, etc.) against itself. Threads take up much less resources than processes and are also much faster to start and stop. However, they have very strict limitations.

The main limitation is that they do not have access to all the main thread APIs. Each API needs to be ported individually to chrome workers. Until recently, there was no manner to define or load modules. At the moment, there is no way to read or write a compressed file from a Chrome Worker, or to access a database from a Chrome Worker. In most cases, this is only a question of time and manpower, and we can hope to eventually bring almost all important APIs to Chrome Workers. However, some APIs cannot be ported at all, in particular any API that requires a DOM window, which is most (fortunately not all) DOM APIs.

Also, the paradigm behind Chrome Workers is purely asynchronous. This means that there is no way for a Chrome Worker to wait synchronously until some treatment has been completed by the main thread. This complicates code in a few cases but, in general, this is rarely a problem.

Also, the communication mechanism needs to be taken into account:  as copying long messages can block the main thread. In some cases, it may be necessary to perform aggressive optimization of messages to avoid such situations.

Refactoring for multi-threading

The first thing to take into consideration while refactoring for multi-process is whether this is the best strategy. Since most APIs and most customization possibilities live on the main thread, most features need to be produced and/or consumed by the main thread. This does not mean that going multi-threaded is not possible, only that your code will probably end up looking like an asynchronous API meant to be used mostly on the main thread but implemented off the main thread. This also means that your consumers must be architectured to accept an asynchronous API. We will cover making things asynchronous in another entry of this series.

Once we have decided to go multi-threaded, the next part is to determine what goes of the main thread. Generally, you want to move as much as you can off the main thread. The only limits are things that you simply cannot move off the main thread (e.g. access to the document), or if you realize that the data you need to copy (not transfer) across threads will slow down the main thread inacceptably. This, of course, is something that can be determined only by benchmarking.

Next, you will need to define a communication protocol between the main thread and the worker. Threads communicate by sending pure data (i.e. objects without methods, without DOM nodes, etc.) and binary data can be transfered for high-performance. Recall that communications are asynchronous, so if you want a thread to respond to another one, you will need to build into your protocol identification to match a reply to a request. This is not built-in, but quite easy to do. Handling errors requires a little finesse, as uncaught exceptions on the worker are transmitted to a onerror listener instead of the usual onmessage listener, and lose some information along the way.

In some (hopefully rare) cases, you will need to add new bindings to native code, so as to call C functions (only C, not C++) from JavaScript. For this purpose, take a look at the documentation of js-ctypes, our JavaScript FFI, and osfile_shared_allthreads.jsm, a set of lightweight extensions to js-ctypes that handle a number of platform-specific gotchas. As finding the correct libraries to link is sometimes tricky, you should take advantage of OS.Constants.Path, that already lists some of them. Don’t hesitate to file bugs if you realize that something important is missing. Also, in a few (hopefully almost non-existent) cases, you will need to expose additional C code to native code, typically to expose some C++-only features. For this purpose, take a look at an example.

Unsurprisingly, the next step is to write the JS code. The usual caveats apply, just don’t forget to use the module system. Worker code goes into its own file, typically with extension “.js”. It is generally a good idea to mention “worker” in the name of the file, e.g. “foo_worker.js”, and to deploy your code to "resource://.../worker/..." or "chrome://.../worker/..." to avoid ambiguities. To construct the worker, it is then sufficient to call new ChromeWorker("resource://path/to/your/file.js"). The worker code will be started lazily when the first message is sent.

For automated testing, you can for instance use mochitest-chrome or (once bug 930924 has landed) xpcshell-tests. In the latter, if you need to add new worker code for the sake of testing, you should install it with the chrome:// protocol. Also, for any testing, don’t forget to look at your system console, as worker errors are displayed on that console by default.

That’s it! In a future blog entry, I will write more about common patterns for writing or refactoring asynchronous code, which comes in very handy for code that uses your new API.

Contributing

Refactoring Firefox as a set of asynchronous APIs backed by off main thread implementations is a considerable task. To make it happen, the best way is to contribute to coding, testing or documentation

Copying streams asynchronously

October 18, 2013 § Leave a comment

In the Mozilla Platform, I/O is largely about streams. Copying streams is a rather common activity, e.g. for the purpose of downloading files, decompressing archives, saving decoded images, etc. As usual, doing any I/O on the main thread is a very bad idea, so the recommended manner of copying streams is to use one of the asynchronous string copy APIs provided by the platform: NS_AsyncCopy (in C++) and NetUtil.asyncCopy (in JavaScript). I have recently audited both to ascertain whether they accidentally cause main thread I/O and here are the results of my investigations.

In C++

What NS_AsyncCopy does

NS_AsyncCopy is a well-designed (if a little complex) API. It copies the full contents of an input stream into an output stream, then closes both. NS_AsyncCopy can be called with both synchronous and asynchronous streams. By default, all operations take place off the main thread, which is exactly what is needed.

In particular, even when used with the dreaded Safe File Output Stream, NS_AsyncCopy will perform every piece of I/O out of the main thread.

The default setting of reading data by chunks of 4kb might not be appropriate to all data, as it may cause too much I/O, in particular if you are reading a small file. There is no obvious way for clients to detect the right setting without causing file I/O, so it might be a good idea to eventually extend NS_AsyncCopy to autodetect the “right” chunk size for simple cases.

Bottom line: NS_AsyncCopy is not perfect but it is quite good and it does not cause main thread I/O.

Limitations

NS_AsyncCopy will, of course, not remove main thread I/O that takes place externally. If you open a stream from the main thread, this can cause main thread I/O. In particular, file streams should really be opened with flag DEFER_OPEN flag. Other streams, such as nsIJARInputStream do not support any form of deferred opening (bug 928329), and will cause main thread I/O when they are opened.

While NS_AsyncCopy does only off main thread I/O, using a Safe File Output Stream will cause a Flush. The Flush operation is very expensive for the whole system, even when executed off the main thread. For this reason, Safe File Output Stream is generally not the right choice of output stream (bug 928321).

Finally, if you only want to copy a file, prefer OS.File.copy (if you can call JS). This function is simpler, entirely off main thread, and supports OS-specific accelerations.

In JavaScript

What NetUtil.asyncCopy does

NetUtil.asyncCopy is a utility method that lets JS clients call NS_AsyncCopy. Theoretically, it should have the same behavior. However, some oddities make its performance lower.

As NS_AsyncCopy requires one of its streams to be buffered, NetUtil.asyncCopy calls nsIIOUtil::inputStreamIsBuffered and nsIIOUtil::outputStreamIsBuffered. These methods detect whether a stream is buffered by attempting to perform buffered I/O. Whenever they succeed, this causes main thread I/O (bug 928340).

Limitations

Generally speaking, NetUtil.asyncCopy has the same limitations as NS_AsyncCopy. In particular, in any case in which you can replace NetUtil.asyncCopy with OS.File.copy, you should pick the latter, which is both simpler and faster.

Also, NetUtil.asyncCopy cannot read directly from a Zip file (bug 927366).

Finally, NetUtil.asyncCopy does not fit the “modern” way of writing asynchronous code on the Mozilla Platform (bug 922298).

Helping out

We need to fix a few bugs to improve the performance of asynchronous copy. If you wish to help, please do not hesitate to pick any of the bugs listed above and get in touch with me.

Where Am I?

You are currently browsing the Performance category at Il y a du thé renversé au bord de la table.

Follow

Get every new post delivered to your Inbox.

Join 30 other followers