Season 1 Episode 2 – The Fight for File I/O

April 2, 2014 § Leave a comment

Plot Our heroes set out for the first battle. Session Restore’s file I/O was clearly inefficient. Not only was it performing redundant operations, but also it was blocking the main thread doing so. The time had come to take it back. Little did our heroes know that the forces of Regression were lurking and that real battle would be fought long after the I/O had been rewritten and made non-blocking.

For historical reasons, some of Session Restore’s File I/O was quite inefficient. Reading and backing up were performed purely on the main thread, which could cause multi-second pauses in extreme cases, and 100ms+ pauses in common cases. Writing was done mostly off the main thread, but the underlying library used caused accidental main thread I/O, with the same effect, and disk flushing. Disk flushing is extremely inefficient on most operating systems and can quickly bring the whole system to its knees, so needs to be avoided.

Fortunately, OS.File, the (then) new JavaScript library designed to provide off main thread I/O had just become available. Turning Session Restore’s I/O into OS.File-based off main thread I/O was surprisingly simple, and even contributed to make the relevant fragments of the code more readable.

In addition to performing main thread I/O and flushing, Session Restore’s I/O had several immediate weaknesses. One of the weaknesses was due to its crash detection mechanism, that required Session Restore to rewrite sessionstore.js immediately after startup, just to store a boolean indicating that we had not crashed. Recall that the largest sessionsstore.js known to this date weighs 150+Mb, and that 1Mb+ instances represented ~5% of our users. Rewriting all this data (and blocking startup while doing so) for a simple boolean flag was clearly unacceptable. We fixed this issue by separating the crash detection mechanism into its own module and ensuring that it only needed to write a few bytes. Another weakness was due to the backup code, which required a full (and inefficient) copy during startup, whereas a simple renaming would have been sufficient.

Having fixed all of this, we were happy. We were wrong.

Speed improvements?

Sadly, Telemetry archives do not reach back far enough to let me provide data confirming any speed improvement. Note for future perf developers including future self: backup your this data or blog immediately before The Cloud eats it.

As for measuring the effects of a flush, at the moment, we do not have a good way to do this, as the main impact is not on the process itself but on the whole system. The best we can do is measure the total number of flushes, but that doesn’t really help.

Full speed… backwards?

The first indication that something was wrong was a large increase in Telemetry measure SESSIONRESTORED, which measures the total amount of time between the launch of the browser and the moment Session Restore has completed initialization. After a short period of bafflement, we concluded that this increase was normal and was due to a change of initialization order – indeed, since OS.File I/O was executed off the main thread, the results of reading the sessionstore.js file could only be received once the main thread was idle and could receive messages from other threads. While this interpretation was partly correct, it masked a very real problem that we only detected much later. Additionally, during our refactorings, we changed the instant at which Session Restore initialization was executed, which muddled the waters even further.

The second indication arrived much later, when the Metrics team extracted Firefox Health Report data from released versions and got in touch with the Performance team to inform us of a large regression in firstPaint-to-sessionRestored time. For most of our users, Firefox was now taking more than 500ms more to load, which was very bad.

After some time spent understanding the data, attempting to reproduce the measure and bisecting to find out at which changeset the regression had taken place, as well as instrumenting code with additional performance probes, we finally concluded that the problem was due to our use I/O thread, the “SessionWorker”. More precisely, this thread was very slow to launch during startup. Digging deeper, we concluded that the problem was not in the code of the SessionWorker itself, but that the loading of the underlying thread was simply too slow. More precisely, loading was fine on a first run, but on second run, disk I/O contention between the resources required by the worker (the cache for the source code of SessionWorker and its dependencies) and the resources required by the rest of the browser (other source code, but also icons, translation files, etc) slowed down things considerably. Replacing the SessionWorker by a raw use of OS.File would not have improved the situation – ironically, just as the SessionWorker, our fast I/O library was loading slowly because of slow file I/O. Further measurements indicated that this slow loading could take up to 6 seconds in extreme cases, with an average of 340ms.

Once the problem had been identified, we could easily develop a stopgap fix to recover most of the regression. We kept OS.File-based writing, as it was not in cause, but we fell back to NetUtil-based loading, which did not require a JavaScript Worker. According to Firefox Health Report, this returned us to a level close to what we had prior to our changes, although we are still worse by 50-100ms. We are still attempting to find out what causes this regression and whether this regression was indeed caused by our work.

With this stopgap fix in place, we set out to provide a longer-term fix, in the form of a reimplementation of OS.File.read(), the critical function used during startup, that did not need to boot a JavaScript worker to proceed. This second implementation was written in C++ and had a number of additional side-improvements, such as the ability to decode strings off the main thread, and transmit them to the main thread at no cost.

The patch using the new version of OS.File.read() has landed a few days ago. We are still in the process of trying to make sense of Telemetry numbers. While Telemetry indicates that the total time to read and decode the file has considerably increased, the total time between the start of the read and the time we finish startup seems to have decreased nicely by .5 seconds (75th percentile) to 4 seconds (95th percentile). We suspect that we are confronted to yet another case in which concurrency makes performance measurement more difficult.

Shutdown duration?

We have not attempted to measure the duration of shutdown-time I/O at the moment.

Losing data or privacy

By definition, since we write data asynchronously, we never wait until the write is complete before proceeding. In most cases, this is not a problem. However, process shutdown may interrupt the write during its execution. While the APIs we use to write the data ensure that shutdown will never cause a file to be partially written, it may cause us to lose the final write, i.e. 15 seconds of browsing, working, etc. To make things slightly worse, the final write of Session Restore is special, insofar as it removes some information that is considered somewhat privacy-sensitive and that is required for crash recovery but not for a clean restart. The risk already existed before our refactoring, but was increased by our work, as the new I/O model was based on JavaScript workers, which are shutdown earlier than the mechanism previously used, and without ensuring that their work is complete.

While we received no reports of bugs caused by this risk, we solved the issue by plugging Session Restore’s shutdown into AsyncShutdown.

Changing the back-end

One of our initial intuitions when starting with this work was that the back-end format used to store session data (large JSON file) was inefficient and needed to be changed. Before doing so, however, we instrumented the relevant code carefully. As it turns out, we could indeed gain some performance by improving the back-end format, but this would be a relatively small win in comparison with everything else that we have done.

We have several possible designs for a new back-end, but we have decided not to proceed for the time being, as there are still larger gains to be obtained with simpler changes. More on this in future blog entries.

Epilogue

Before setting out on this quest, we were already aware that performance refactorings were often more complex than they appeared. Our various misadventures have confirmed it. I strongly believe that, by changing I/O, we have improved the performance of Session Restore in many ways. Unfortunately, I cannot prove that we have improved runtime (because old data has disappeared), and we are still not certain that we have not regressed start-up.

If there are lessons to be learned, it is that:

  • there is no performance work without performance measurements;
  • once your code is sophisticated enough, measuring and understanding the results is much harder than improving performance.

On the upside, all this work has succeeded at:

  • improving our performance measurements of many points of Session Restore;
  • finding out weaknesses of ChromeWorkers and fixing some of these;
  • finding out weaknesses of OS.File and fixing some of these;
  • fixing Session Restore’s backup code that consumed resources and didn’t really do much useful;
  • avoiding unnecessary performance refactorings where they would not have helped.

The work on improving Session Restore file I/O is still ongoing. For one thing, we are still waiting for confirmation that our latest round of optimizations does not cause unwanted regressions. Also, we are currently working on Talos benchmarks and Telemetry measurements to let us catch such regressions earlier.

This work has also spawned other works for other teams on improving the performance of ChromeWorkers’ startup and communication speed.

In the next episode

Drama. Explosions. Asynchronicity. Electrolysis. And more.

Chrome Workers, now with modules!

July 17, 2013 § 3 Comments

One of the main objectives of Project Async is to encourage Firefox developers and Firefox add-on developers to use Chrome Workers to ensure that whatever they do doesn’t block Firefox’ UI thread. The main obstacle, for the moment, is that Chrome Workers have access to very few features, so one the tasks of Project Async is to add features to Chrome Workers.

Today, let me introduce the Module Loader for Chrome Workers.

« Read the rest of this entry »

Helping Bugzilla Help Newcomers Help Mozilla Help Users

October 13, 2011 § 10 Comments

tl;dr

We can make Bugzilla more newbie-friendly.

edit Thanks for pointing me to mentored bugs, Josh!

The context

Earlier today, a discussion got me thinking about how we can lower the entry barriers for people who want to contribute to Mozilla in a technical or semi-technical way but do not know where to start – by “technical or semi-technical”, I have in mind developers, web developers, translators, and more generally people who need to commit some form of patch to any of the Mozilla subprojects.

There are many things that we can improve, but for this post, I would like to concentrate on Bugzilla tweaks that I believe could be useful. These tweaks come from a simple remark: for all contributors Bugzilla is, in fact, a form of contribution console. Unfortunately, for the moment, I see the following problems:

  • Bugzilla doesn’t help newcomers find easy entry points;
  • Bugzilla doesn’t guide newcomers towards their first patch;
  • Bugzilla doesn’t help old-timers help newcomers.

So, let’s see how we could improve Bugzilla.

Making newcomers visible and encouraging old-timers to help them

I believe that the first thing we need to do is to make sure that newbies are more visible to old-timers. For this purpose, we can provide an account attribute “I’m a beginner, please be gentle”, which appears in the profile, ticked by default, and remains there until unticked. Every message, attachment or patch provided by a newbie is clearly labelled as such.

This label may come with a link to the wiki with suggestions for old-timers wanting to help the newbie.

Similarly, we can encourage old-timers to adopt a newbie. For this purpose, we can provide an account attribute “Adopt a newbie”, with a list of accounts. Every message, attachment or patch provided by an adopted newbie is automatically tracked by the adopter.

Note As far as I understand, these features are already essentially implemented. I’m just proposing a small tweak.

Having Bugzilla tell stories

Bugzilla is a nice place for reporting bugs and discussing how to solve it. I propose a simple tweak to make Bugzilla tell the story of a patch. Let me show you one possible manner of doing this:

Each of these link should help a newcomer on one step towards getting a patch submitted and accepted:

  1. “Help me set up for contributing” points to some of the wiki pages we already have containing project-specific setup documentation. For Mozilla coding bugs, this explains how to build Mozilla. For localization bug, this explains how to setup whichever tools are required, etc. This is set up on a per-component basis.
  2. “Help me contact the developers” points to the component-specific wiki page with all the contact information (IRC channel, forum, mailing-list, module owner e-mail address, etc.).
  3. “Help me find project doc” points to the project-specific wiki page with all the documentation.
  4. “Help me submit a contribution” points to a beginner-oriented wiki page with the information on submitting. For mozilla-central, for instance, beginners should be able to find the two lines that will let them export everything they have done as one hg patch.
  5. “What’s next” points to a wiki page detailing the submission approval process, the various forums, IRC channels, etc. to which the contributor can participate.

This is for beginners. As you may have seen, there is another toolbar, designed for experienced developers/community members:

This toolbar is designed to keep track of activity on bugs that the user follows:

  1. “Unconfirmed” lists the number of UNCO bugs followed by the user.
  2. “In progress” lists the number of NEW bugs followed by the user.
  3. “Review” lists the number of patches pending review in bugs followed by the user.
  4. “Mentoring” lists the unread messages and patches submitted by newbies adopted by the user.

Clicking on each link points to a bugzilla search of the corresponding bugs.

Making easy bugs easy to discover

New contributors need to find good entry points. For this purpose, we have project-specific lists of good first bugs. However, at the moment, the lists I could find are both hard to discover and unmaintained. So, let’s integrate this in Bugzilla. As it turns out, it’s quite easydone already (and better than what I suggested initially).

I suggest adding two tags: “easy” and “average”. Every bug can be annotated as “easy” if a developer believes that it’s a good first bug or “average” if a developer believes that it’s a bug that requires some experience but no architectural overview.

Once this is done, we can link from all our wikis and contributor engagement pages to Bugzilla searches with all the “easy” and all the “average” bugs, both project-wide and by component.

Submission add-on

This one is more of a dream. I believe that, at some point, we could develop an add-on that can be installed directly fro the Bugzilla toolbar, and whose role is to provide a wizard to help users:

  • install the appropriate source management tool;
  • once the tool is installed, submit a patch for review.

Initially, this add-on covers only mozilla-central. Later, upgrades to this add-on and/or project-specific add-ons may be added for projects not covered by mozilla-central (typically projects hosted on github).

Inspiration: want to do better than (otherwise excellent) github “pull request” :)

Badges!

  • First bug report tagged “new”? Earn a badge!
  • First patch submitted? Earn a badge!
  • First patch accepted? Earn a badge!
  • Same thing for the marks of 10, 50, 100 …

Any thoughts?

Ok, this lists the tweaks that I can think of to improve newbie experience with contributions. What do you think? Any way we can make this better?

Extrapol, part 1: from C to Effects

June 3, 2008 § Leave a comment

Here comes the long-promised description of Extrapol, my main ongoing research project. In a few words, our objective with Extrapol is to fill a hole in the current suite of tools built to ensure the security of systems. While there’s an ample amount of stuff designed to analyse the behaviour of processes either during their execution (dynamic analysis) or after their completion (trace analysis), there is little work on applying static analysis to actual system security.

« Read the rest of this entry »

Where Am I?

You are currently browsing entries tagged with coding at Il y a du thé renversé au bord de la table.

Follow

Get every new post delivered to your Inbox.

Join 30 other followers