The Gecko monoculture

March 7, 2016 § 8 Comments

I remember a time, not so very long ago, when Gecko powered 4 or 5 non-Mozilla browsers, some of them on exotic platforms, as well as GPS devices, wysiwyg editors, geographic platforms, email clients, image editors, eBook readers, documentation browsers, the UX of virus scanners, etc, as well as a host of innovative and exotic add-ons. In these days, Gecko was considered, among other things, one of the best cross-platform development toolkits available.

The year is now 2016 and, if you look around, you’ll be hard-pressed to find Gecko used outside of Firefoxen (alright, and Thunderbird and Bluegriffon). Did Google or Apple or Microsoft do that? Not at all. I don’t know how many in the Mozilla community remember this, but this was part of a Mozilla strategy. In this post, I’d like to discuss this strategy, its rationale, and the lessons that we may draw from it.

« Read the rest of this entry »

Designing the Firefox Performance Monitor (2): Monitoring Add-ons and Webpages

November 6, 2015 § Leave a comment

In part 1, we discussed the design of time measurement within the Firefox Performance Monitor. Despite the intuition, the Performance Monitor had neither the same set of objectives as the Gecko Profiler, nor the same set of constraints, and we ended up picking a design that was not a sampling profiler. In particular, instead of capturing performance data on stacks, the Monitor captures performance data on Groups, a notion that we have not discussed yet. In this part, we will focus on bridging the gap between our low-level instrumentation and actual add-ons and webpages, as may be seen by the user.

« Read the rest of this entry »

Living in a Go Faster, post-XUL world

July 13, 2015 § 31 Comments

A long time ago, XUL was an extraordinary component of Firefox. It meant that front-end and add-on developers could deliver user interfaces in a single, mostly-declarative, language, and see them adapt automatically to the look and feel of each OS. Ten years later, XUL has become a burden: most of its features have been ported to HTML5, often with slightly different semantics – which makes Gecko needlessly complex – and nobody understands XUL – which makes contributions harder than they should be. So, we have reached a stage at which we basically agree that, in a not-too-distant future, Firefox should not rely upon XUL anymore.

But wait, it’s not the only thing that needs to change. We also want to support piecewise updates for Firefox. We want Firefox to start fast. We want the UI to remain responsive. We want to keep supporting add-ons. Oh, and we want contributors, too. And we don’t want to lose internationalization.

Mmmh… and perhaps we don’t want to restart Firefox from bare Gecko.

All of the above are worthy objectives, but getting them all will require some careful thought.

So I’d like to put together a list of all our requirements, against which we could evaluate potential solutions, re-architectures, etc. for the front-end:

High-level

  • Get rid of the deprecated (XUL) bits of Gecko in a finite time.
  • Don’t break Firefox [1].

User-oriented goals

  • Firefox should start fast.
  • The UI should not suffer from jank.
  • The UI should not cause jank.
  • Look and feel like a native app, even with add-ons.
  • Keep supporting internationalization.
  • Keep supporting lightweight themes.
  • Keep supporting acccessibility.

Contributor/dev-oriented goals

  • Use technologies that the world understands.
  • Use technologies that are useful to add-on authors.
  • Support piece-wise, restart-less front-end updates.
  • Provide an add-ons API that won’t break.
  • Code most of the front-end with the add-ons API.

[1] I have heard this claim contested. Some apparently suggest that we should actually break Firefox and base all our XUL-less, Go Faster initiatives on a clean slate from e.g. Browser.html or Servo. If you wish to defend this, please step forward 🙂

Does this sound like a correct list for all of you?

Detecting slow add-ons

May 6, 2015 § 13 Comments

When it is at its best, Firefox is fast. Really, really fast. When things start slowing down, though, using Firefox is much less fun. So, one of main objectives of the developers of Firefox is making sure that Firefox is and remains as smooth and responsive as humanly possible. There is, however, one thing that can slow down Firefox, and that remains out of the control of the developers: add-ons. Good add-ons are extraordinary, but small coding errors – or sometimes necessary hacks – can quickly drive the performance of Firefox into the ground.

So, how can an add-on developer (or add-on reviewer) find out whether her add-on is fast? Sadly, not much. Testing certainly helps, and the Profiler is invaluable to help pinpoint a slowdown once it has been noticed, but what about the performance of add-ons in everyday use? What about the experience of users?

To solve this issue, we decided to work on a set of tools to help add-on developers and reviewers find out the performance of their add-ons. Oh, and also to let users find out quickly if an add-on is slowing down their everyday experience.

about:performance

On recent Nightly builds of Firefox, you may now open about:performance to get an overview of the performance cost of add-ons and webpages :

Screen Shot 2015-05-06 at 17.46.15

The main resources we monitor are :

  • jank, which measures how much the add-on impacts the responsiveness of Firefox. For 60fps performance, jank should always remain ≤ 4. If an add-on regularly causes jank to increase past 6, you should be worried.
  • CPOW aka blocking cross-process communications, which measures how much the add-on is causing Firefox to freeze waiting for a process to respond. Anything above 0 is bad.

Note that the design of this page is far from stable. I realise it’s not very user-friendly at the moment, so don’t hesitate to file bugs to help us improve it. Also note that, when running with e10s, the page doesn’t display all the useful information. We are working on it.

add-on Telemetry

Add-on developers and reviewers can now find information on the performance of their add-ons on a dedicated dashboard.

These are real-world performance data, as extracted from user’s computers. The two histograms available for the time being are:

  • MISBEHAVING_ADDONS_JANK_LEVEL, which measures the jank, as detailed above;
  • MISBEHAVING_ADDONS_CPOW_TIME_MS, which measure the amount of time spent in CPOW, as detailed above.

If you are an add-on developer, you should monitor regularly the performance of your add-on on this page. If you notice suspicious values, you should try and find out what causes these performance issues. Don’t hesitate and reach out to us, we will try and help you.

Slow add-on Notification

Add-on developers and reviewers, as well as end-users, are now informed when an add-on causes either jank or CPOW performance issues:

Screen Shot 2015-05-06 at 19.16.19

Note that this feature is not ready to ride the trains, and we do not have a specific idea of when it will be made available for users of Aurora/DeveloperEdition. This is partly because the UX is not good enough yet, partly because the thresholds will certainly change, and partly because we want to give add-on developers time to fix any issue before the users see a dialog that suggest that an add-on should be uninstalled.

Performance Stats API

By the way, we have an API for accessing performance stats. Very imaginatively, it’s called PerformanceStats.jsm [link]. While this API will still change during the coming weeks you can start playing with it if you are interested. Some add-ons may be able to throttle their performance use based on this data. Also, I hope that, in time, someone will be able to write a version of about:performance much nicer than mine 🙂

Challenges and work ahead

For the moment, we are in the process of stabilizing the API, its implementation and its performance. In parallel, we are working on making the UX of about:performance more useful. Once both are done, we are going to proceed with adding more measurements, making the code more e10s-friendly and measuring the performance of webpages.

If you are an add-on developer and if you feel that your add-on is tagged as slow by error, or if you have great ideas on how to make this data useful, feel free to ping me, preferably on IRC. You can find me on irc.mozilla.org, channel #developers, where I am Yoric.

Vous souhaitez apprendre à développer des Logiciels Libres ?

November 29, 2014 § Leave a comment

Cette année, la Communauté Mozilla propose à Paris un cycle de Cours/TDs autour du Développement de Logiciels Libres.

Au programme :

  • comment se joindre à un projet existant ;
  • comment communiquer dans une équipe distribuée ;
  • comment financer un projet de logiciel libre ;
  • qualité du code ;
  • du code !
  • (et beaucoup plus).

Pour plus de détails, et pour vous inscrire, tout est ici.

Attention, les cours commencent le 8 décembre !

RFC: We deserve better than runtime warnings

November 20, 2014 § 2 Comments

Consider the following scenario:

  1. Module A prints warnings when it’s used incorrectly;
  2. Module B uses module A correctly;
  3. Some future refactoring of module B starts using module A incorrectly, hence displaying the warnings;
  4. Nobody realises for months, because we have too many warnings;
  5. Eventually, something breaks.

How often has this happened to everyone of us?

This scenario has many variants (e.g. module A changed and nobody realized that module B is now in a situation it misuses module A), but they all boil down to the same thing: runtime warnings are designed to be lost, not fixed. To make things worse, many of our warnings are not actionable, simply because we have no way of knowing where they come from – I’m looking at you, Cu.reportError.

So how do we fix this?

We would certainly save considerable amounts of time if warnings caused immediate assertion failures, or alternatively test failures (i.e. fail, but only when running the unit tests). Unfortunately, we can do neither, as we have a number of tests that trigger the warnings either

  • by design (e.g. to check that we can recover from such misuses of A, or because we still need a now-considered-incorrect use of an API to keep working until we have ported all the clients to the better API);
  • or for historical reasons (e.g. the now incorrect use of A used to be correct, but we haven’t fixed all tests that depend on it yet).

However, I believe that causing test failures is still the solution. We just need a mechanism that supports a form of whitelisting to cope with the aforementioned cases.

Introducing RuntimeAssert

RuntimeAssert is an experiment at provoding a standard mechanism to replace warnings. I have a prototype implemented as part of bug 1080457. Feedback would be appreciated.

The key features are the following:

  • when a test suite is running, a call to `RuntimeAssert` causes the test suite to fail;
  • when a test suite is running, a call to `RuntimeAssert` contains at least the filename/line number of where it was triggered, preferably a stack wherever available;
  • individual tests can whitelist families of calls to `RuntimeAssert` and mark them either as expected;
  • individual tests can whitelist families of calls to `RuntimeAssert` and mark them as pending fix;
  • when a test suite is not running, a call to `RuntimeAssert` does nothing costly (it may default to PR_LOG or Cu.reportError).

Possible API:

  • in JS, we trigger a test failure by calling RuntimeAssert.fail(keyword, string or Error) from production code;
  • in C++, we likewise trigger a test failure by calling MOZ_RUNTIME_ASSERT(keyword, string);
  • in the testsuite, we may whitelist errors by calling Assert.whitelist.expected(keyword, regexp)  or Assert.whitelist.FIXME(keyword, regexp).

Examples:

//
// Module
//
let MyModule = {
  oldAPI: function(foo) {
    RuntimeAssert.fail(“Deprecation”, “Please use MyModule.newAPI instead of MyModule.oldAPI”);
    // ...
  },
  newAPI: function(foo) {
    // ...
  },
};

let MyModule2 = {
  api: function() {
    return somePromise().then(null, error => {
      RuntimeAssert.fail(“MyModule2.api”, error);
      // Rather than leaving this error uncaught, let’s make it actionable.
    });
  },

  api2: function(date) {
    if (typeof date == “number”) {
      RuntimeAssert.fail(“MyModule2.api2”, “Passing a number has been deprecated, please pass a Date”);
      date = new Date(date);
    }
    // ...
   }
}


//
// Whitelisting a RuntimeAssert in a test.
//

// This entire test is about MyModule.oldAPI, warnings are normal.
Assert.whitelist.expected(“Deprecation”, /Please use MyModule.newAPI/);

// We haven’t fixed all calls to MyModule2.api2, so they should still warn, but not cause an orange.
Assert.whitelist.FIXME(“MyModule2.api2”, /please pass a Date/);

Assert.whitelist.expected(“MyModule2.api”, /TypeError/, function() {
  // In this test, we will trigger a TypeError in MyModule2.api, that’s entirely expected.
  // Ignore such errors within the (async) scope of this function.
});

Applications

In the long-term, I believe that RuntimeAssert (or some other mechanism) should replace almost all our calls to Cu.reportError.

In the short-term, I plan to use this for reporting

  • uncaught Promise rejections, which currently require a bit too much hacking for my tastes;
  • errors in XPCOM.lazyModuleGetter & co;
  • failures during AsyncShutdown;
  • deprecation warnings as part of Deprecated.jsm.

The Battle of Session Restore – Season 1 Episode 3 – All With Measure

July 17, 2014 § 4 Comments

Plot For the second time, our heroes prepared for battle. The startup of Firefox was too slow and Session Restore was one of the battle fields.

When Firefox starts, Session Restore is in charge of restoring the browser to its previous state, in case of a crash, a restart, or for the users who have configured Firefox to resume from its previous state. This entails numerous activities during startup:

  1. read sessionstore.js from disk, decode it and parse it (recall that the file is potentially several Mb large), handling errors;
  2. backup sessionstore.js in case of startup crash.
  3. create windows, tabs, frames;
  4. populate history, scroll position, forms, session cookies, session storage, etc.

It is common wisdom that Session Restore must have a large impact on Firefox startup. But before we could minimize this impact, we needed to measure it.

Benchmarking is not easy

When we first set foot on Session Restore territory, the contribution of that module to startup duration was uncharted. This was unsurprising, as this aspect of the Firefox performance effort was still quite young. To this day, we have not finished chartering startup or even Session Restore’s startup.

So how do we measure the impact of Session Restore on startup?

A first tool we use is Timeline Events, which let us determine how long it takes to reach a specific point of startup. Session Restore has had events `sessionRestoreInitialized` and `sessionRestored` for years. Unfortunately, these events did not tell us much about Session Restore itself.

The first serious attempt at measuring the impact of Session Restore on startup Performance was actually not due to the Performance team but rather to the metrics team. Indeed, data obtained through Firefox Health Report participants indicated that something wrong had happened.

Oops, something is going wrong

Indicator `d2` in the graph measures the duration between `firstPaint` (which is the instant at which we start displaying content in our windows) and `sessionRestored` (which is the instant at which we are satisfied that Session Restore has opened its first tab). While this measure is imperfect, the dip was worrying – indeed, it represented startups that lasted several seconds longer than usual.

Upon further investigation, we concluded that the performance regression was indeed due to Session Restore. While we had not planned to start optimizing the startup component of Session Restore, this battle was forced upon us. We had to recover from that regression and we had to start monitoring startup much better.

A second tool is Telemetry Histograms for measuring duration of individual operations, such as reading sessionstore.js or parsing it. We progressively added measures for most of the operations of Session Restore. While these measures are quite helpful, they are also unfortunately very unstable in real-world conditions, as they are affected both by scheduling (the operations are asynchronous), by the work load of the machine, by the actual contents of sessionstore.js, etc.

The following graph displays the average duration of reading and decoding sessionstore.js among Telemetry participants: Telemetry 4

Difference in colors represent successive versions of Firefox. As we can see, this graph is quite noisy, certainly due to the factors mentioned above (the spikes don’t correspond to any meaningful change in Firefox or Session Restore). Also, we can see a considerable increase in the duration of the read operation. This was quite surprising for us, given that this increase corresponds to the introduction of a much faster, off the main thread, reading and decoding primitive. At the time, we were stymied by this change, which did not correspond to our experience. We have now concluded that by changing the asynchronous operation used to read the file, we have simply changed the scheduling, which makes the operation appear longer, while in practice it simply does not block the rest of the startup from taking place on another thread.

One major tool was missing for our arsenal: a stable benchmark, always executed on the same machine, with the same contents of sessionstore.js, and that would let us determine more exactly (almost daily, actually) the impact of our patches upon Session Restore:Session Restore Talos

This test, based on our Talos benchmark suite, has proved both to be very stable, and to react quickly to patches that affected its performance. It measures the duration between the instant at which we start initializing Session Restore (a new event `sessionRestoreInit`) and the instant at which we start displaying the results (event `sessionRestored`).

With these measures at hand, we are now in a much better position to detect performance regressions (or improvements) to Session Restore startup, and to start actually working on optimizing it – we are now preparing to using this suite to experiment with “what if” situations to determine which levers would be most useful for such an optimization work.

Evolution of startup duration

Our first benchmark measures the time elapsed between start and stop of Session Restore if the user has requested all windows to be reopened automatically

restoreAs we can see, the performance on Linux 32 bits, Windows XP and Mac OS 10.6 is rather decreasing, while the performance on Linux 64 bits, Windows 7 and 8 and MacOS 10.8 is improving. Since the algorithm used by Session Restore upon startup is exactly the same for all platforms, and since “modern” platforms are speeding up while “old” platforms are slowing down, this suggests that the performance changes are not due to changes inside Session Restore. The origin of these changes is unclear. I suspect the influence of newer versions of the compilers or some of the external libraries we use, or perhaps new and improved (for some platforms) gfx.

Still, seeing the modern platforms speed up is good news. As of Firefox 31, any change we make that causes a slowdown of Session Restore will cause an immediate alert so that we can react immediately.

Our second benchmark measures the time elapsed if the user does not wish windows to be reopened automatically. We still need to read and parse sessionstore.js to find whether it is valid, so as to decide whether we can show the “Restore” button on about:home.

norestoreWe see peaks in Firefox 27 and Firefox 28, as well as a slight decrease of performance on Windows XP and Linux. Again, in the future, we will be able to react better to such regressions.

The influence of factors upon startup

With the help of our benchmarks, we were able to run “what if” scenarios to find out which of the data manipulated by Session Restore contributed to startup duration. We did this in a setting in which we restore windows:size-restore

and in a setting in which we do not:

size-norestore

Interestingly, increasing the size of sessionstore.js has apparently no influence on startup duration. Therefore, we do not need to optimize reading and parsing sessionstore.js. Similarly, optimizing history, cookies or form data would not gain us anything.

The single largest most expensive piece of data is the set of open windows – interestingly, this is the case even when we do not restore windows. More precisely, any optimization should target, by order of priority:

  1. the cost of opening/restoring windows;
  2. the cost of opening/restoring tabs;
  3. the cost of dealing with windows data, even when we do not restore them.

What’s next?

Now that we have information on which parts of Session Restore startup need to be optimized, the next step is to actually optimize them. Stay tuned!

Q2 2014 Report

July 1, 2014 § Leave a comment

Q2 2014 was a difficult quarter at Mozilla, with all the agitation around Brendan Eich, Australis, Media Extensions, etc. Still, I have the feeling that we managed to get a lot done despite the intense pressure. Here is a quick highlight of my main accomplishments for Q2 2014.

Session Restore

A considerable amount of my time was spent working on Session Restore. The main objective is to decrease the jank caused by Session Restore taking snapshots of the session and to decrease the time Session Restore takes to restore the state of Firefox. Much of the activity this quarter dealt with measuring performance, so as to best optimize it and improving safety.

Reworking Session Restore backups

With Firefox 33, the backups of Session Restore state have been completely redesigned. The new system should prove orders of magnitude safer, in addition to now being fully transparent.

Next steps We are still lacking measurements to confirm that this is as successful as the mathematics suggest. If you are interested, there is a mentored bug open.

Talos tests and Telemetry on Session Restore startup

Optimizing startup is difficult, and generally impossible if you do not know what to optimize. With Firefox 32 and 33, we have new benchmarks and real world measurements to help us determine immediately the influence of patches on Session Restore startup.

Next steps Using these benchmarks to experiment with possible optimizations. This is in progress.

Cleaning up Session Restore file

One of our objectives is to decrease the size of the Session Restore file, to reduce the amount of I/O (hence battery use and hardware wear and tear) and memory usage. As a first step, we have introduced a mechanism that progressively removes from the “Undo Close” feature tabs and windows that have been closed at least 2 weeks ago. Interestingly, Telemetry indicates that this clean-up has no effect on the size of the Session Restore file. Experiments run later during the quarter, using the Talos tests, also strongly suggest that the data that we could clean up and that we do not clean up yet have essentially no influence on startup duration.

Next steps I believe that this strategy will therefore not be pursued during the next quarters.

Preserving compatibility with Tor Browser

While refactoring Session Restore, we have hit a number of obstacles in the form of add-ons using private or semi-private APIs that we wished to remove. We have managed to work along with add-on authors and, as far as I know, we have not broken any add-on yet. In particular, we have maintained compatibility with the Tor Browser, which is a heavily customized distribution of Firefox targeted towards privacy.

Next steps Providing a clean API for add-ons. This will require discussing with add-on authors to find out what they need.

Async tooling

I am in charge of the Async Project, which is all about giving front-end and add-on developers tools to develop asynchronous code that does not jank. As usual, this involved plenty of activity in a number of different directions.

Auto-closing Sqlite.jsm databases (mentoring Michael Brennan)

Sqlite.jsm databases can now be closed automatically during garbage-collection. On user’s computers, this will increase safety, as failing to close a database causes shutdown-time assertion failures. However, to use resources effectively, pragmatism dictates that databaes should be closed manually, so failing to close a database in the Mozilla codebase will still cause test failures.

Reworking OS.File shutdown

On devices with little memory (typically Firefox Phones), one of the techniques used to save memory is to shutdown the OS.File worker as early as possible, re-launching it later if necessary. As it turns out, the task is more complicated than it seems, due to possibilities of race conditions. Unfortunately, this means that in some extreme cases, Firefox OS applications could lock down and fail to shutdown properly without being killed by the OS. This is now fixed. Somewhere along the way, this helped us to make the PromiseWorker used by OS.File more resilient to low-level errors.

Next steps Making the PromiseWorker usable by other modules than OS.File, including testing and add-ons.

OS.File for Android and Firefox OS

OS.File was initially designed for desktop devices. Now that it is used in a number of places on mobile devices, I have mercilessly hunted down all compatibility issues between OS.File and our two mobile platforms. Compatibility tests are now activated on all platforms and should avoid any regression.

AsyncShutdown Barrier mechanism

The shutdown process of Firefox has always been a dark and scary place, full of unspecified dependencies. As a result, any refactoring or addition a new dependency could break many things in new and interesting ways. I have introduced the AsyncShutdown Barrier mechanism that lets us specify clear, explicit and extensible dependencies, handles ordering of shutdowns, as well as error reporting if a dependency is unmet. This Barrier is now used by Sqlite.jsm, OS.File, Firefox Health Report, Session Restore, Page Thumbnails and fixes a number of major issues.

Next steps Porting AsyncShutdown Barrier to allow native components to register with it.

Fixing Firefox 30 shutdown freezes (with Tim Taubert)

Many users of Firefox 30 encountered issues that caused Firefox to freeze during shutdown. We found out that the issue was caused was triggered by Page Thumbnails and caused by a bug in ChromeWorkers, which did not handle an error case gracefully. I applied AsyncShutdown Barrier to ensure that Page Thumbnails always completed without triggering the error case, while Tim Taubert ensured that the Chrome Workers handled the error robustly.

Making Firefox Health Report shutdown more robust

While porting Firefox Health Report to AsyncShutdown, we encountered an elusive bug that manifested itself by causing rare shutdown crashes. After months of experimenting, instrumenting and attempting to fix the issue, we eventually traced it back to a more serious bug in shutdown, which apparently does not always send the proper notifications. Using the AsyncShutdown Barrier, we managed to work around the issue and make FHR’s shutdown both more robust and better instrumented in case of crash. This later helped us locate another issue that prevents a proper shutdown when some databases have been corrupted.

Next steps Fix the upstream shutdown bug, make our shutdown more resilient in case of database corruption.

Async testing

The other aspect of writing asynchronous code is making sure that developers can debug it. Now that we have hit a critical mass of developers writing async code, it was high time to help them work with it.

Rewriting Task stack traces to be meaningful

Now that we know how to handle uncaught errors, the main remaining weaknesses of Promise-based and Task-based code is that their stack traces lose much information. Since Firefox 33, Task-based stack traces are now transparently rewritten into something developer-redable. Somewhere along the way, I have also patched xpcshell and mochitests to ensure that they take advantage of this rewriting. Experience shows that this is very useful and that the runtime cost is negligible.

Next steps Evaluate the runtime cost of doing the same thing for Promise-based code.

Making xpcshell tests fail in case of uncaught promise error

Uncaught promise errors were treated by the test suites as warnings, TBPL did not report them, and they remained consequantly more often than not ignored (or even unseen) by the developers. I have reworked the xpcshell test harness to consider all uncaught promise errors as oranges and fixed all offenders.

Next steps Doing the same for mochitests. Code is ready, but a few offenders remain.

Community

Dealing with political feedback around the nomination and departure of Brendan Eich

Along with many others, I made my best to engage people who voiced their negative feedback either at the nomination or at the departure of Brendan Eich. Unfortunately, this took time and efforts, but I believe that staying in touch with our users is part of what makes the difference between Mozilla and other browser vendors.

Working with new contributors

I estimate that I have worked with ~30 potential new contributors during the quarter. Many have unfortunately decided to postpone or abandon their efforts towards contributing, but a few have stayed, to work either with me or with other teams. At the moment, I am following 5 promising contributors. In particular, I am quite happy to welcome Dexter (who is working on a very sophisticated patch to let code watch for file modifications) and Kushagra (who has landed several test suite bugs).

Next steps More of it!

Working with universities

A group of École Centrale de Lyon successfully completed an online tool to help grassroot projects find volunteers. It was nice mentoring them.

Zedge

I was invited to deliver a presentation on performance at Zedge, in Trondheim, Norway. That was fun 🙂

Next steps Publish the slides.

And now?

Let’s get started with Q3!

Firefox, the Browser that has your Back[up]

June 26, 2014 § 54 Comments

One of the most important features of Firefox, in my opinion, is Session Restore. This component is responsible for ensuring that, even in case of crash, or if you upgrade your browser or an add-on that requires restart, your browser can reopen immediately and in the state in which you left it. As far as I am concerned, this feature is a life-safer.

Unfortunately, there are a few situations in which the Session Restore file may be corrupted – typically, if the computer is rebooted before the write is complete, or if it loses power, or if the operating system crashes or the disk is disconnected, we may end up losing our precious Session Restore. While any of these circumstances happens quite seldom, it needs to be applied as part of the following formula:

seldom · .5 billion users = a lot

I am excited to announce that we have just landed a new and improved Session Restore component in Firefox 33 that protects your precious data better than ever.

How it works

Firefox needs Session Restore to handle the following situations:

  • restarting Firefox without data loss after a crash of either Firefox, the Operating System, a driver or the hardware, or after Firefox has been killed by the Operating System during shutdown;
  • restarting Firefox without data loss after Firefox has been restarted due to an add-on or an upgrade;
  • quitting Firefox and, later, restarting without data loss.

In order to handle all of this, Firefox needs to take a snapshot of the state of the browser whenever anything happens, whether the user browses, fills a form, scrolls, or an application sets a Session Cookie, Session Storage, etc. (this is actually capped to one save every 15 seconds, to avoid overloading the computer). In addition, Firefox performs a clean save during shutdown.

While at the level of the application, the write mechanism itself is simple and robust, a number of things beyond the control of the developer can prevent either the Operating System or the hard drive itself from completing this write consistently – a typical example being tripping on the power plug of a desktop computer during the write.

The new mechanism involves two parts:

  • keeping smart backups to maximize the chances that at least one copy will be readable;
  • making use of the available backups to transparently avoid or minimize data loss.

The implementation actually takes very few lines of code, the key being to know the risks against which we defend.

Keeping backups

During runtime, Firefox remembers which files are known to be valid backups and which files should be discarded. Whenever a user interaction or a script requires it, Firefox writes the contents of Session Restore to a file called sessionstore-backups/recovery.js. If it is known to be good, the previous version of sessionstore-backups/recovery.js is first moved to sessionstore-backups/recovery.bak. In most cases, both files are valid and recovery.js contains a state less than 15 seconds old, while recovery.bak contains a state less than 30 seconds old. Additionally, the writes on both files are separated by at least 15 seconds. In most circumstances, this is sufficient to ensure that, even of hard drive crash during a write to recover.js, at least recovery.bak has been entirely written to disk.

During shutdown, Firefox writes a clean startup file to sessionstore.js. In most cases, this file is valid and contains the exact state of Firefox at the time of shutdown (minus some privacy filters). During startup, if sessionstore.js is valid, Firefox moves it to sessiontore-backup/previous.js. Whenever this file exists, it is valid and contains the exact state of Firefox at the time of the latest clean shutdown/startup. Note that, in case of crash, the latest clean shutdown/startup might be older than the latest actual startup, but this backup is useful nevertheless.

Finally, on the first startup after an update, Firefox copies sessionstore.js, if it is available and valid, to sessionstore-backups/upgrade.js-[build id]. This mechanism is designed primarily for testers of Firefox Nightly, who keep on the very edge, upgrading Firefox every day to check for bugs. Testers, if we introduce a bug that affects Session Restore, this can save your life.

As a side-note, we never use the operating system’s flush call, as 1/ it does not provide the guarantees that most developers expect; 2/ on most operating systems, it causes catastrophic slowdowns.

Recovering

All in all, Session Restore may contain the following files:

  • sessionstore.js (contains the state of Firefox during the latest shutdown – this file is absent in case of crash);
  • sessionstore-backups/recovery.js (contains the state of Firefox ≤ 15 seconds before the latest shutdown or crash – the file is absent in case of clean shutdown, if privacy settings instruct us to wipe it during shutdown, and after the write to sessionstore.js has returned);
  • sessionstore-backups/recovery.bak (contains the state of Firefox ≤ 30 seconds before the latest shutdown or crash – the file is absent in case of clean shutdown, if privacy settings instruct us to wipe it during shutdown, and after the removal of sessionstore-backups/recovery.js has returned);
  • sessionstore-backups/previous.js (contains the state of Firefox during the previous successful shutdown);
  • sessionstore-backups/upgrade.js-[build id] (contains the state of Firefox after your latest upgrade).

All these files use the JSON format. While this format has drawbacks, it has two huge advantages in this setting:

  • it is quite human-readable, which makes it easy to recover manually in case of an extreme crash;
  • its syntax is quite rigid, which makes it easy to find out whether it was written incompletely.

As our main threat is a crash that prevents us from writing the file entirely, we take advantage of the latter quality to determine whether a file is valid. Based on this, we test each file in the order indicated above, until we find one that is valid. We then proceed to restore it.

If Firefox was shutdown cleanly:

  1. In most cases, sessionstore.js is valid;
  2. In most cases in which sessionstore.js is invalid, sessionstore-backups/recovery.js is still present and valid (the likelihood of it being present is obviously higher if privacy settings do not instruct Firefox to remove it during shutdown);
  3. In most cases in which sessionstore-backups/recovery.js is invalid, sessionstore-backups/recovery.bak is still present, with an even higher likelihood of being valid (the likelihood of it being present is obviously higher if privacy settings do not instruct Firefox to remove it during shutdown);
  4. In most cases in which the previous files are absent or invalid, sessionstore-backups/previous.js is still present, in which case it is always valid;
  5. In most cases in which the previous files are absent or invalid, sessionstore-backups/upgrade.js-[…] is still present, in which case it is always valid.

Similarly, if Firefox crashed or was killed:

  1. In most cases, sessionstore-backups/recovery.js is present and valid;
  2. In most cases in which sessionstore-backups/recovery.js is invalid, sessionstore-backups/recovery.bak is pressent, with an even higher likelihood of being valid;
  3. In most cases in which the previous files are absent or invalid, sessionstore-backups/previous.js is still present, in which case it is always valid;
  4. In most cases in which the previous files are absent or invalid, sessionstore-backups/upgrade.js-[…] is still present, in which case it is always valid.

Numbers crunching

Statistics collected on Firefox Nightly 32 suggest that, out of 11.95 millions of startups, 75,310 involved a corrupted sessionstore.js. That’s roughly a corrupted sessionstore.js every 158 startups, which is quite a lot. This may be influenced by the fact that users of Firefox Nightly live on pre-alpha, so are more likely to encounter crashes or Firefox bugs than regular users, and that some of them use add-ons that may modify sessionstore.js themselves.

With the new algorithm, assuming that the probability for each file to be corrupted is independent and is p = 1/158, the probability of losing more than 30 seconds of data after a crash goes down to p^3 ≅ 1 / 4,000,000. If we haven’t removed the recovery files, the probability of losing more than 30 seconds of data after a clean shutdown and restart goes down to p^4 ≅ 1 / 630,000,000. This still means that , statistically speaking, at every startup, there is one user of Firefox somewhere around the world who will lose more than 30 seconds of data, but this is much, better than the previous situation by several orders of magnitude.

It is my hope that this new mechanism will transparently make your life better. Have fun with Firefox!

Revisiting uncaught asynchronous errors in the Mozilla Platform

May 30, 2014 § Leave a comment

Consider the following feature and its xpcshell test:

// In a module Foo
function doSomething() {
  // ...
  OS.File.writeAtomic("/an invalid path", "foo");
  // ...
}

// In the corresponding unit test
add_task(function*() {
  // ...
  Foo.doSomething();
  // ...
});

Function doSomething is obviously wrong, as it performs a write operation that cannot succeed. Until we started our work on uncaught asynchronous errors, the test passed without any warning. A few months ago, we managed to rework Promise to ensure that the test at least produced a warning. Now, this test will actually fail with the following message:

A promise chain failed to handle a rejection – Error during operation ‘write’ at …

This is particularly useful for tracking subsystems that completely forget to handle errors or tasks that forget to call yield.

Who is affected?

This change does not affect the runtime behavior of application, only test suites.

  • xpcshell: landed as part of bug 976205;
  • mochitest / devtools tests: waiting for all existing offending tests to be fixed, code is ready as part of bug 1016387;
  • add-on sdk: no started, bug 998277.

This change only affects the use of Promise.jsm. Support for DOM Promise is in bug 989960.

Details

We obtain a rejected Promise by:

  • throwing from inside a Task; or
  • throwing from a Promise handler; or
  • calling Promise.reject.

A rejection can be handled by any client of the rejected promise by registering a rejection handler. To complicate things, the rejection handler can be registered either before the rejection or after it.

In this series of patches, we cause a test failure if we end up with a Promise that is rejected and has no rejection handler either:

  • immediately after the Promise is garbage-collected;
  • at the end of the add_task during which the rejection took place;
  • at the end of the entire xpcshell test;

(whichever comes first).

Opting out

There are extremely few tests that should need to raise asynchronous errors and not catch them. So far, we have needed this two tests: one that tests the asynchronous error mechanism itself and another one that willingly crashes subprocesses to ensure that Firefox remains stable.

You should not need to opt out of this mechanism. However, if you absolutely need to, we have a mechanism for opting out. For more details, see object Promise.Debugging in Promise.jsm.

Any question?

Feel free to contact either me or Paolo Amadini.

Where Am I?

You are currently browsing entries tagged with javascript at Il y a du thé renversé au bord de la table.