Shutting down Asynchronously | Search Results | Il y a du thé renversé au bord de la table

Shutting down Asynchronously, part 2

May 26, 2014 § Leave a comment

During shutdown of Firefox, subsystems are closed one after another. AsyncShutdown is a module dedicated to express shutdown-time dependencies between:

services and their clients;
shutdown phases (e.g. profile-before-change) and their clients.

Barriers: Expressing shutdown dependencies towards a service

Consider a service FooService. At some point during the shutdown of the process, this service needs to:

inform its clients that it is about to shut down;
wait until the clients have completed their final operations based on FooService (often asynchronously);
only then shut itself down.

This may be expressed as an instance of AsyncShutdown.Barrier. An instance of AsyncShutdown.Barrier provides:

a capability client that may be published to clients, to let them register or unregister blockers;
methods for the owner of the barrier to let it consult the state of blockers and wait until all client-registered blockers have been resolved.

Shutdown timeouts

By design, an instance of AsyncShutdown.Barrier will cause a crash if it takes more than 60 seconds awake for its clients to lift or remove their blockers (awake meaning that seconds during which the computer is asleep or too busy to do anything are not counted). This mechanism helps ensure that we do not leave the process in a state in which it can neither proceed with shutdown nor be relaunched.

If the CrashReporter is enabled, this crash will report: – the name of the barrier that failed; – for each blocker that has not been released yet:

the name of the blocker;

the state of the blocker, if a state function has been provided (see AsyncShutdown.Barrier.state).

Example 1: Simple Barrier client

The following snippet presents an example of a client of FooService that has a shutdown dependency upon FooService. In this case, the client wishes to ensure that FooService is not shutdown before some state has been reached. An example is clients that need write data asynchronously and need to ensure that they have fully written their state to disk before shutdown, even if due to some user manipulation shutdown takes place immediately.

// Some client of FooService called FooClient

Components.utils.import("resource://gre/modules/FooService.jsm", this);

// FooService.shutdown is the `client` capability of a `Barrier`.
// See example 2 for the definition of `FooService.shutdown`
FooService.shutdown.addBlocker(
  "FooClient: Need to make sure that we have reached some state",
  () => promiseReachedSomeState
);
// promiseReachedSomeState should be an instance of Promise resolved once
// we have reached the expected state

Example 2: Simple Barrier owner

The following snippet presents an example of a service FooService that wishes to ensure that all clients have had a chance to complete any outstanding operations before FooService shuts down.

    // Module FooService

    Components.utils.import("resource://gre/modules/AsyncShutdown.jsm", this);
    Components.utils.import("resource://gre/modules/Task.jsm", this);

    this.exports = ["FooService"];

    let shutdown = new AsyncShutdown.Barrier("FooService: Waiting for clients before shutting down");

    // Export the `client` capability, to let clients register shutdown blockers
    FooService.shutdown = shutdown.client;

    // This Task should be triggered at some point during shutdown, generally
    // as a client to another Barrier or Phase. Triggering this Task is not covered
    // in this snippet.
    let onshutdown = Task.async(function*() {
      // Wait for all registered clients to have lifted the barrier
      yield shutdown.wait();

      // Now deactivate FooService itself.
      // ...
    });

Frequently, a service that owns a AsyncShutdown.Barrier is itself a client of another Barrier.

Example 3: More sophisticated Barrier client

The following snippet presents FooClient2, a more sophisticated client of FooService that needs to perform a number of operations during shutdown but before the shutdown of FooService. Also, given that this client is more sophisticated, we provide a function returning the state of FooClient2 during shutdown. If for some reason FooClient2’s blocker is never lifted, this state can be reported as part of a crash report.

    // Some client of FooService called FooClient2

    Components.utils.import("resource://gre/modules/FooService.jsm", this);

    FooService.shutdown.addBlocker(
      "FooClient2: Collecting data, writing it to disk and shutting down",
      () => Blocker.wait(),
      () => Blocker.state
    );

    let Blocker = {
      // This field contains information on the status of the blocker.
      // It can be any JSON serializable object.
      state: "Not started",

      wait: Task.async(function*() {
        // This method is called once FooService starts informing its clients that
        // FooService wishes to shut down.

        // Update the state as we go. If the Barrier is used in conjunction with
        // a Phase, this state will be reported as part of a crash report if FooClient fails
        // to shutdown properly.
        this.state = "Starting";

        let data = yield collectSomeData();
        this.state = "Data collection complete";

        try {
          yield writeSomeDataToDisk(data);
          this.state = "Data successfully written to disk";
        } catch (ex) {
          this.state = "Writing data to disk failed, proceeding with shutdown: " + ex;
        }

        yield FooService.oneLastCall();
        this.state = "Ready";
      }.bind(this)
    };

Example 4: A service with both internal and external dependencies

    // Module FooService2

    Components.utils.import("resource://gre/modules/AsyncShutdown.jsm", this);
    Components.utils.import("resource://gre/modules/Task.jsm", this);
    Components.utils.import("resource://gre/modules/Promise.jsm", this);

    this.exports = ["FooService2"];

    let shutdown = new AsyncShutdown.Barrier("FooService2: Waiting for clients before shutting down");

    // Export the `client` capability, to let clients register shutdown blockers
    FooService2.shutdown = shutdown.client;

    // A second barrier, used to avoid shutting down while any connections are open.
    let connections = new AsyncShutdown.Barrier("FooService2: Waiting for all FooConnections to be closed before shutting down");

    let isClosed = false;

    FooService2.openFooConnection = function(name) {
      if (isClosed) {
        throw new Error("FooService2 is closed");
      }

      let deferred = Promise.defer();
      connections.client.addBlocker("FooService2: Waiting for connection " + name + " to close",  deferred.promise);

      // ...


      return {
        // ...
        // Some FooConnection object. Presumably, it will have additional methods.
        // ...
        close: function() {
          // ...
          // Perform any operation necessary for closing
          // ...

          // Don't hoard blockers.
          connections.client.removeBlocker(deferred.promise);

          // The barrier MUST be lifted, even if removeBlocker has been called.
          deferred.resolve();
        }
      };
    };


    // This Task should be triggered at some point during shutdown, generally
    // as a client to another Barrier. Triggering this Task is not covered
    // in this snippet.
    let onshutdown = Task.async(function*() {
      // Wait for all registered clients to have lifted the barrier.
      // These clients may open instances of FooConnection if they need to.
      yield shutdown.wait();

      // Now stop accepting any other connection request.
      isClosed = true;

      // Wait for all instances of FooConnection to be closed.
      yield connections.wait();

      // Now finish shutting down FooService2
      // ...
    });

Phases: Expressing dependencies towards phases of shutdown

The shutdown of a process takes place by phase, such as: – profileBeforeChange (once this phase is complete, there is no guarantee that the process has access to a profile directory); – webWorkersShutdown (once this phase is complete, JavaScript does not have access to workers anymore); – …

Much as services, phases have clients. For instance, all users of web workers MUST have finished using their web workers before the end of phase webWorkersShutdown.

Module AsyncShutdown provides pre-defined barriers for a set of well-known phases. Each of the barriers provided blocks the corresponding shutdown phase until all clients have lifted their blockers.

List of phases

AsyncShutdown.profileChangeTeardown

The client capability for clients wishing to block asynchronously during observer notification “profile-change-teardown”.

AsyncShutdown.profileBeforeChange

The client capability for clients wishing to block asynchronously during observer notification “profile-change-teardown”. Once the barrier is resolved, clients other than Telemetry MUST NOT access files in the profile directory and clients MUST NOT use Telemetry anymore.

AsyncShutdown.sendTelemetry

The client capability for clients wishing to block asynchronously during observer notification “profile-before-change2”. Once the barrier is resolved, Telemetry must stop its operations.

AsyncShutdown.webWorkersShutdown

The client capability for clients wishing to block asynchronously during observer notification “web-workers-shutdown”. Once the phase is complete, clients MUST NOT use web workers.

Shutting down things asynchronously

February 14, 2014 § Leave a comment

This blog entry is part of the Making Firefox Feel As Fast As Its Benchmarks series. The fourth entry of the series was growing much too long for a single blog post, so I have decided to cut it into bite-size entries.

A long time ago, Firefox was completely synchronous. One operation started, then finished, and then we proceeded to the next operation. However, this model didn’t scale up to today’s needs in terms of performance and performance perception, so we set out to rewrite the code and make it asynchronous wherever it matters. These days, many things in Firefox are asynchronous. Many services get started concurrently during startup or afterwards. Most disk writes are entrusted to an IO thread that performs and finishes them in the background, without having to stop the rest of Firefox.

Needless to say, this raises all sorts of interesting issues. For instance: « how do I make sure that Firefox will not quit before it has finished writing my files? » In this blog entry, I will discuss this issue and, more generally, the AsyncShutdown mechanism, designed to implement shutdown dependencies for asynchronous services.

« Read the rest of this entry »

What David Did During Q3

September 30, 2014 § 6 Comments

September is ending, and with it Q3 of 2014. It’s time for a brief report, so here is what happened during the summer.

Session Restore

After ~18 months working on Session Restore, I am progressively switching away from that topic. Most of the main performance issues that we set out to solve have been solved already, we have considerably improved safety, cleaned up lots of the code, and added plenty of measurements.

During this quarter, I have been working on various attempts to optimize both loading speed and saving speed. Unfortunately, both ongoing works were delayed by external factors and postponed to a yet undetermined date. I have also been hard at work on trying to pin down performance regressions (which turned out to be external to Session Restore) and safety bugs (which were eventually found and fixed by Tim Taubert).

In the next quarter, I plan to work on Session Restore only in a support role, for the purpose of reviewing and mentoring.

Also, a rant The work on Session Restore has relied heavily on collaboration between the Perf team and the FxTeam. Unfortunately, the resources were not always available to make this collaboration work. I imagine that the FxTeam is spread too thin onto too many tasks, with too many fires to fight. Regardless, the symptom I experienced is that during the course of this work, both low-priority, high-priority and safety-critical patches have been left to rot without reviews, despite my repeated requests, for 6, 8 or 10 weeks, much to the dismay of everyone involved. This means man·months of work thrown to /dev/null, along with quarterly objectives, morale, opportunities, contributors and good ideas.

I will try and blog about this, eventually. But please, in the future, everyone: remember that in the long run, the priority of getting reviews done (or explaining that you’re not going to) is a quite higher than the priority of writing code.

Async Tooling

Many improvements to Async Tooling landed during Q3. We now have the PromiseWorker, which simplifies considerably the work of interacting between the main thread and workers, for both Firefox and add-on developers. I hear that the first add-on to make use of this new feature is currently being developed. New features, bugfixes and optimizations landed for OS.File. We have also landed the ability to watch for changes in a directory (under Windows only, for the time being).

Sadly, my work on interactions between Promise and the Test Suite is currently blocked until the DevTools team manages to get all the uncaught asynchronous errors under control. It’s hard work, and I can understand that it is not a high priority for them, so in Q4, I will try to find a way to land my work and activate it only for a subset of the mochitest suites.

Places

I have recently joined the newly restarted effort to improve the performance of Places, the subsystem that handles our bookmarks, history, etc. For the moment, I am still getting warmed up, but I expect that most of my work during Q4 will be related to Places.

Shutdown

Most of my effort during Q3 was spent improving the Shutdown of Firefox. Where we already had support for shutting down asynchronously JavaScript services/consumers, we now also have support for native services and consumers. Also, I am in the process of landing Telemetry that will let us find out the duration of the various stages of shutdown, an information that we could not access until now.

As it turns out, we had many crashes during asynchronous shutdown, a few of them safety-critical. At the time, we did not have the necessary tools to determine to prioritize our efforts or to find out whether our patches had effectively fixed bugs, so I built a dashboard to extract and display the relevant information on such crashes. This proved a wise investment, as we spent plenty of time fighting AsyncShutdown-related fires using this dashboard.

In addition to the “clean shutdown” mechanism provided by AsyncShutdown, we also now have the Shutdown Terminator. This is a watchdog subsystem, launched during shutdown, and it ensures that, no matter what, Firefox always eventually shuts down. I am waiting for data from our Crash Scene Investigators to tell us how often we need this watchdog in practice.

Community

I lost track of how many code contributors I interacted with during the quarter, but that represents hundreds of e-mails, as well as countless hours on IRC and Bugzilla, and a few hours on ask.mozilla.org. This year’s mozEdu teaching is also looking good.

We also launched FirefoxOS in France, with big success. I found myself in a supermarket, presenting the ZTE Open C and the activities of Mozilla to the crowds, and this was a pleasing experience.

For Q4, expect more mozEdu, more mentoring, and more sleepless hours helping contributors debug their patches 🙂

Season 1 Episode 2 – The Fight for File I/O

April 2, 2014 § Leave a comment

Plot Our heroes set out for the first battle. Session Restore’s file I/O was clearly inefficient. Not only was it performing redundant operations, but also it was blocking the main thread doing so. The time had come to take it back. Little did our heroes know that the forces of Regression were lurking and that real battle would be fought long after the I/O had been rewritten and made non-blocking.

For historical reasons, some of Session Restore’s File I/O was quite inefficient. Reading and backing up were performed purely on the main thread, which could cause multi-second pauses in extreme cases, and 100ms+ pauses in common cases. Writing was done mostly off the main thread, but the underlying library used caused accidental main thread I/O, with the same effect, and disk flushing. Disk flushing is extremely inefficient on most operating systems and can quickly bring the whole system to its knees, so needs to be avoided.

Fortunately, OS.File, the (then) new JavaScript library designed to provide off main thread I/O had just become available. Turning Session Restore’s I/O into OS.File-based off main thread I/O was surprisingly simple, and even contributed to make the relevant fragments of the code more readable.

In addition to performing main thread I/O and flushing, Session Restore’s I/O had several immediate weaknesses. One of the weaknesses was due to its crash detection mechanism, that required Session Restore to rewrite sessionstore.js immediately after startup, just to store a boolean indicating that we had not crashed. Recall that the largest sessionsstore.js known to this date weighs 150+Mb, and that 1Mb+ instances represented ~5% of our users. Rewriting all this data (and blocking startup while doing so) for a simple boolean flag was clearly unacceptable. We fixed this issue by separating the crash detection mechanism into its own module and ensuring that it only needed to write a few bytes. Another weakness was due to the backup code, which required a full (and inefficient) copy during startup, whereas a simple renaming would have been sufficient.

Having fixed all of this, we were happy. We were wrong.

Speed improvements?

Sadly, Telemetry archives do not reach back far enough to let me provide data confirming any speed improvement. Note for future perf developers including future self: backup your this data or blog immediately before The Cloud eats it.

As for measuring the effects of a flush, at the moment, we do not have a good way to do this, as the main impact is not on the process itself but on the whole system. The best we can do is measure the total number of flushes, but that doesn’t really help.

Full speed… backwards?

The first indication that something was wrong was a large increase in Telemetry measure SESSIONRESTORED, which measures the total amount of time between the launch of the browser and the moment Session Restore has completed initialization. After a short period of bafflement, we concluded that this increase was normal and was due to a change of initialization order – indeed, since OS.File I/O was executed off the main thread, the results of reading the sessionstore.js file could only be received once the main thread was idle and could receive messages from other threads. While this interpretation was partly correct, it masked a very real problem that we only detected much later. Additionally, during our refactorings, we changed the instant at which Session Restore initialization was executed, which muddled the waters even further.

The second indication arrived much later, when the Metrics team extracted Firefox Health Report data from released versions and got in touch with the Performance team to inform us of a large regression in firstPaint-to-sessionRestored time. For most of our users, Firefox was now taking more than 500ms more to load, which was very bad.

After some time spent understanding the data, attempting to reproduce the measure and bisecting to find out at which changeset the regression had taken place, as well as instrumenting code with additional performance probes, we finally concluded that the problem was due to our use I/O thread, the “SessionWorker”. More precisely, this thread was very slow to launch during startup. Digging deeper, we concluded that the problem was not in the code of the SessionWorker itself, but that the loading of the underlying thread was simply too slow. More precisely, loading was fine on a first run, but on second run, disk I/O contention between the resources required by the worker (the cache for the source code of SessionWorker and its dependencies) and the resources required by the rest of the browser (other source code, but also icons, translation files, etc) slowed down things considerably. Replacing the SessionWorker by a raw use of OS.File would not have improved the situation – ironically, just as the SessionWorker, our fast I/O library was loading slowly because of slow file I/O. Further measurements indicated that this slow loading could take up to 6 seconds in extreme cases, with an average of 340ms.

Once the problem had been identified, we could easily develop a stopgap fix to recover most of the regression. We kept OS.File-based writing, as it was not in cause, but we fell back to NetUtil-based loading, which did not require a JavaScript Worker. According to Firefox Health Report, this returned us to a level close to what we had prior to our changes, although we are still worse by 50-100ms. We are still attempting to find out what causes this regression and whether this regression was indeed caused by our work.

With this stopgap fix in place, we set out to provide a longer-term fix, in the form of a reimplementation of OS.File.read(), the critical function used during startup, that did not need to boot a JavaScript worker to proceed. This second implementation was written in C++ and had a number of additional side-improvements, such as the ability to decode strings off the main thread, and transmit them to the main thread at no cost.

The patch using the new version of OS.File.read() has landed a few days ago. We are still in the process of trying to make sense of Telemetry numbers. While Telemetry indicates that the total time to read and decode the file has considerably increased, the total time between the start of the read and the time we finish startup seems to have decreased nicely by .5 seconds (75th percentile) to 4 seconds (95th percentile). We suspect that we are confronted to yet another case in which concurrency makes performance measurement more difficult.

Shutdown duration?

We have not attempted to measure the duration of shutdown-time I/O at the moment.

Losing data or privacy

By definition, since we write data asynchronously, we never wait until the write is complete before proceeding. In most cases, this is not a problem. However, process shutdown may interrupt the write during its execution. While the APIs we use to write the data ensure that shutdown will never cause a file to be partially written, it may cause us to lose the final write, i.e. 15 seconds of browsing, working, etc. To make things slightly worse, the final write of Session Restore is special, insofar as it removes some information that is considered somewhat privacy-sensitive and that is required for crash recovery but not for a clean restart. The risk already existed before our refactoring, but was increased by our work, as the new I/O model was based on JavaScript workers, which are shutdown earlier than the mechanism previously used, and without ensuring that their work is complete.

While we received no reports of bugs caused by this risk, we solved the issue by plugging Session Restore’s shutdown into AsyncShutdown.

Changing the back-end

One of our initial intuitions when starting with this work was that the back-end format used to store session data (large JSON file) was inefficient and needed to be changed. Before doing so, however, we instrumented the relevant code carefully. As it turns out, we could indeed gain some performance by improving the back-end format, but this would be a relatively small win in comparison with everything else that we have done.

We have several possible designs for a new back-end, but we have decided not to proceed for the time being, as there are still larger gains to be obtained with simpler changes. More on this in future blog entries.

Epilogue

Before setting out on this quest, we were already aware that performance refactorings were often more complex than they appeared. Our various misadventures have confirmed it. I strongly believe that, by changing I/O, we have improved the performance of Session Restore in many ways. Unfortunately, I cannot prove that we have improved runtime (because old data has disappeared), and we are still not certain that we have not regressed start-up.

If there are lessons to be learned, it is that:

there is no performance work without performance measurements;
once your code is sophisticated enough, measuring and understanding the results is much harder than improving performance.

On the upside, all this work has succeeded at:

improving our performance measurements of many points of Session Restore;
finding out weaknesses of ChromeWorkers and fixing some of these;
finding out weaknesses of OS.File and fixing some of these;
fixing Session Restore’s backup code that consumed resources and didn’t really do much useful;
avoiding unnecessary performance refactorings where they would not have helped.

The work on improving Session Restore file I/O is still ongoing. For one thing, we are still waiting for confirmation that our latest round of optimizations does not cause unwanted regressions. Also, we are currently working on Talos benchmarks and Telemetry measurements to let us catch such regressions earlier.

This work has also spawned other works for other teams on improving the performance of ChromeWorkers’ startup and communication speed.

In the next episode

Drama. Explosions. Asynchronicity. Electrolysis. And more.

Tales of Science-Fiction Bugs: The Thing That Killed Talos

November 12, 2012 § 4 Comments

Have you ever encountered one of these bugs? One in which every single line of your code is correct, in which every type-check passes, every single unit test succeeds, the specifications are fulfilled but somehow, for no reason that can be explained rationally, it just does not work? I call them Science-Fiction Bugs. I am sure that you have met some of them. For some reason, the Mozilla Performance Team seems to stumble upon such bugs rather often, perhaps because we spend so much time refactoring other team’s code long after the original authors have moved on to other features, and combining their code with undertested libraries and technologies. Truly, this is life on the Frontier.

Today, I would like to tell you the tale of one of these Science-Fiction Bugs: The Thing That Killed Talos.

« Read the rest of this entry »