Designing the Firefox Performance Stats Monitor, part 1: Measuring time without killing battery or performance

October 27, 2015 § Leave a comment

For a few versions, Firefox Nightly has been monitoring the performance of add-ons, thanks to the Performance Stats API. While we are waiting for the greenlight to let it graduate to Firefox Aurora, as well as investigating a few lingering false-positives, and while v2 is approaching steadily, it is time for a brain dump on this toolbox and its design.

The initial objective of this monitor is to be able to flag both add-ons and webpages that cause noticeable slowdowns, so as to let users disable/close whatever is making their use of Firefox miserable. We also envision more advanced uses that could let us find out if features of webpages cause slowdowns on specific OS/hardware combinations.

I. Performance Stats vs. Gecko Profiler

Firefox has long had a built-in profiler, that can be used to obtain detailed performance information about a specific webpage (through the DevTools), or about the entirety of Firefox (though a dedicated add-on). If you have not tried the Gecko Profiler, you really should. However, the Performance Stats API does not build upon this profiler. Let’s see why.

The Gecko Profiler is a Sampling Profiler. This means that it spawns a thread, which wakes up regularly (every ~1 millisecond by default) to capture the stack of the main thread and store it to memory. Once profiling is over, it examines the symbols in the captured stacks, and extrapolates that if a symbol appears in n% of the samples, it must also take n% of the CPU time. Also, it consolidates the stacks into a tree for easier visualisation.

This technique has several big advantages:

  • it has a small and bounded impact on performance (1 stack capture per wakeup);
  • if samples are sufficiently numerous, it generally provides very precise information;
  • it provides nicely detailed output.

Also, the stack contains all the information needed to determine if the code being executed belongs to an add-on or a webpage, which means that the Gecko Profiler could theoretically be used as a back-end for the Performance Stats API.

Unfortunately, it also has a few major drawbacks:

  1. waking up one thread every ~1ms to communicate with another thread is rather battery-hostile, so this is not something that the browser should do for long periods, especially in a multi-process browser;
  2. recording the stacks can also become pretty expensive quickly;
  3. the mechanism is designed to allow extracting data after the fact, not for real-time monitoring;
  4. interaction between the threads can quickly become non-trivial.

It might be possible to partially mitigate point 1/ by ensuring that the sampling thread is stopped whenever the execution thread is not executing “interesting” code. The battery gains are not clear – especially since we do not have a good way to measure such costs – and, more importantly, this would make point 4/ much more complex and error-prone.

It is also likely that points 2/ and 3/ could be addressed by making sure that, for the sake of Performance Stats, we only extract simplified stacks containing solely information about the owner of the code, whether add-on, webpage or platform. To do this, we would need to hack relatively deep in both the profiler (to be able to extract and treat various kinds of data, and to be able to do it both after-the-fact and in real time), and the JavaScript VM (to annotate the stack with ownership information), as well as to introduce support code (to provide a mapping between stack ownership data and add-on identification/webpage identification). Interestingly, introducing these VM changes and the support code essentially amount to writing a form of Event Profiler, which would use the existing Sampling Profiler essentially as a form of high-resolution clock. Sadly, these changes would probably make the interaction between threads quite more complicated.

Finally, the only way to get rid of points 4/ would be to move entirely away from a Sampling Provider to an Event Profiler. So let’s see how we could design Performance Stats as a form of Event-based Profiler.

II. Low-level API

One of the main interests of Sampling Profilers is that they interfere with the code considerably less than most other kinds of profilers. However, we have an advantage that most profilers do not have: we need much, much fewer details. Indeed, knowing whether a statement or even a function takes time will not help us find out whether an add-on is slow. Rather, we are interested in knowing whether a Performance Groups take time – we’ll return later to a more precise definition of Group, but for the moment, let’s just assume that a Group is anything we want to monitor: an add-on, or a webpage including its frames, or a webpage without its frames, etc.

In other words, we are interested in two events:

  • the VM is starting to execute code that belongs to an interesting Group;
  • the VM has finished executing code that belongs to an interesting Group.

If we can just implement a quick and reasonably accurate Stopwatch to measure the duration between these events, we have all the information we need.

Note that the structure is purely stack-based. So we can already start designing our Stopwatch API along the following lines:

class Stopwatch { // RAII
  explicit Stopwatch(JSRuntime*, Group*); // Start the measure.
  ~Stopwatch(); // Stop the measure.

Now, let’s look at the issues we may encounter with such a design:

  1. the time between “start” and “finish” may be extremely short (possibly just a few µs), so we need an extremely precise clock – recall that the usual Windows XP CPU usage clock has a measurement precision of 16ms, and sometimes even stops updating for much longer;
  2. since we perform all measurements on the main thread, and since we may enter/exit thousands of Groups per frame, we need the clock to be extremely fast;
  3. we need to decide what happens when Group A calls Group B, e.g. when an add-on calls the platform, or when the platform calls an add-on, or when a treatment in the webpage causes an add-on to fire, etc.;
  4. a few functions in Firefox need to have a blocking API but a non-blocking implementation (e.g. `alert()`, synchronous XMLHttpRequest, but also Firefox Sync), a situation that is implemented through nested event loops – we need to make sure that we do not display wildly inaccurate results in such cases.

II.1. Dealing with calls

Let’s start with issue 3/.

  • If Group A calls Group B, and if Group B performs a slow operation, who should be charged for this operation? We answer this question as would a Sampling Provider: both are charged. The most elegant implementation is simply to keep the Stopwatch of Group A running while Group B is executing with its own Stopwatch.
  • If Group A calls Group B, and if Group B calls back Group A (possibly indirectly), who should be charged for this operation? Well, since Group A is already charged because it’s calling into Group B, we should only charge Group A once. In other words, since Group A already has a Stopwatch, we should not start a second Stopwatch for a nested call to Group A. As a nice side-effect, this improves our speed and memory usage by ensuring that we don’t need to record as many things.

In terms of API, this means something along the lines of:

class Group {
  bool HasStopwatch();
  void AttachStopwatch();
  void DetachStopwatch();

This almost immediately gives us a design for issue 4/.

  • If Group A starts a nested event loop, and Group B starts working in the nested loop, who should be charged? While spawning nested event loops is an evil practice, which may cause all sorts of issues, Group A is not causing any visible slowdown, no matter how long Group B may be working, and in practice has no idea that Group B is working. Therefore, we cannot charge anything to Group A. In other words, we cancel all ongoing measurements and remove all executing Stopwatches.

To avoid having to maintain long lists of current stopwatches in case we may need to invalidate them, and to avoid the linear cost of such invalidation, we alter the API to take into account an iteration number, which increases whenever we start processing an event:

class Stopwatch { // RAII
  explicit Stopwatch(JSRuntime*, Group*); // Start the measure. Record the current iteration.
  ~Stopwatch(); // Stop the measure. If the current iteration is not the one that was originally recorded, this Stopwatch is stale, don’t bother with a final measure, just drop everything.

class Group {
  bool HasStopwatch(uint64_t iteration); // Does the group have a stopwatch for the current event?
  void AttachStopwatch(uint64_t iteration); // Attach a stopwatch for the current event.
  void DetachStopwatch(uint64_t iteration); // Does nothing if `iteration` is stale.

Note that, had we chosen to base our profiler on sampling, we would still have needed to find solutions to issues 3/ and 4/, and the implementation would certainly have been similar.

II.2. Dealing with clocks

Sampling profilers rely on a fixed rate system timer, which solves elegantly both the problem of clock imprecision (issue 1/) and that of the speed of actually reading the clock (issue 2/). Unfortunately, due to the issues mentioned in I., this is a solution that we could not adopt. Hence the need to find another mechanism to implement the Stopwatch. A mechanism that is both fast and reasonably accurate.

We experimented with various clocks and were disappointed in varying manners. After these experiments, we decided to go for a statistical design, in which each `Stopwatch` does not measure an actual time (which proved either slow or unreliable) but a number of CPU clock cycles, through the CPU’s built-in TimeStamp Counter and instruction RDTSC. In terms of performance, this instruction is a dream: on modern architectures, reading the TimeStamp Counter takes a few dozen CPU cycles, i.e. less than 10ns.

However, the value provided by RDTSC is neither a wall clock time, nor a CPU time. Indeed, RDTSC increases whenever code is executed on the CPU (or, on older architectures, on the Core), regardless of whether the code belongs to the current application, the kernel, or any other application.

In other words, we have the following issues:

  1. we need to map between clock cycles and wall time;
  2. if the computer has gone to sleep, the counter may be reset to 0;
  3. if the process is moved between CPUs/cores, it may end up on a CPU core with an unsynchronized counter;
  4. the mapping between clock cycles and walltime varies with the current frequency of the CPU;
  5. other threads/processes using the same CPU will also increment the counter.

1/ Mapping clock cycles to wall time

If you recall, we have decided to use RDTSC instead of the usual OS-provided clocks because clocks were generally either not fine-grained enough to measure the time spent in a Group or too slow whenever we needed to switch between groups thousands of times per second. However, we can easily perform two (and exactly two) clock reads per event to measure the total CPU Time spent executing the event. This measure does not require a precision as high as we would have needed to measure the time spent in a group, nor does it need to be as fast. Once we have this measure, we can extrapolate: if n% of the cycles were spent executing Group A, then n% of the time spent in the event was spent executing Group A.

If n<0 or n>100 due to issues 2/ or 3/, we simply discard the measure.

2/ Handling the case in which the counter is reset to 0

Now that we have mapped clock cycles and wall time, we can revisit the issue of a computer going to sleep. Since the counter is reset to 0, we have n<0, hence the measure is discarded. Theoretically, if the computer goes to sleep many times in a row within the execution of the event loop, we could end up in an exotic case of n≥0 and n≤100, but the error would remain bounded to the total CPU cost of this event loop. In addition, I am unconvinced that a computer can physically go to sleep quite that fast.

3/ Handling the case in which threads move between CPUs/cores

In theory, this is the scariest issue. Fortunately, modern Operating Systems try to avoid this whenever possible, as this decreases overall cache performance, so we can hope that this will happen seldom.

Also, on platforms that support it (i.e. not MacOS/BSD), we only measure the number of cycles if we start and end execution of a group on the same CPU/core. While there is a small window (a few cycles) during which the thread can be migrated without us noticing, we expect that this will happen rarely enough that this won’t affect the statistics meaningfully.

On other platforms, but on modern architectures, if a thread jumps between cores during the execution of a Group, this has no effect, as cores of a single CPU have synchronized counters.

Even in the worst case scenario (MacOS/BSD, jump to a different CPU), statistics are with us. Assuming that the probability of jumping between CPUs/cores during a given (real) cycle is constant, and that the distribution of differences between clocks is even, the probability that the number of cycles reported by a measure is modified by X cycles should be a gaussian distribution, with groups with longer execution having a larger amplitude than groups with shorter execution. Since we discard measures that result in a negative number of cycles, this distribution is actually skewed towards over-estimating the number of cycles of groups that already have many cycles and under-estimating the number of cycles that already have fewer cycles.

Recall that this error is bounded by the total CPU time spent in the event. So if the total time spent in the event is low, the information will not contribute to us detecting a slow add-on or webpage, because the add-on or webpage is not slow in the first place. If the total time spent in the event in high, i.e. if something is slow in the first place, this will tend to make the slowest offenders appear even slower, something which we accept without difficulty.

4/ Variable mapping between clock cycles and walltime

Once again, this is an issue that cannot be solved in theory but that works nicely in practice. For once thing, recent architectures actually make sure that the mapping remains constant.

Moreover, even on older architectures, we suspect that the mapping between clock cycles and wall time will change rarely on a given thread during the execution of a single event. This may, of course, be less true in case of I/O on the thread.

Finally, even in the worst case, if we assume that the mapping between clock cycles and wall time is evenly distributed, we expect that this will eventually balance out between Groups.

5/ Cycles increasing with any system activity

Assuming that, within an iteration of the event loop, this happens uniformly over time, this will skew towards over-estimating the number of cycles of groups that already have many cycles and under-estimating the number of cycles that already have fewer cycles.

Again, for the same reason as issue 3/, this effect is bounded, and this is a bias that we accept without difficulty.

Once we have decided on the algorithm above, the design of the Stopwatch becomes mostly natural.

To be continued

In followup blog entries, I plan to discuss Performance Groups, measuring the duration of cross-process blocking calls, collating data from processes, determining when an add-on is slow and more. Stay tuned

Tagged: , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

What’s this?

You are currently reading Designing the Firefox Performance Stats Monitor, part 1: Measuring time without killing battery or performance at Il y a du thé renversé au bord de la table.


%d bloggers like this: