What’s Next For Speedometer

I have some thoughts about what’s next for the Speedometer benchmark, after the Speedometer 3 launch a few months back.

First: I’m proud of the role Mozilla played on Speedometer 3. We made significant contributions to the benchmark itself as well as to the structure of the group, enabling it to work towards consensus on many difficult decisions with principles aimed at getting the incentives right, such that competing on score would make real-world pages more responsive. With buy-in from each major engine, browsing is now a bit more pleasant for everyone regardless of their browser choice. This is a practical example of the strength of the Web as an open platform with multiple competitive implementations.

In parallel to developing the benchmark itself, there’s a much larger engineering effort still ongoing within Gecko — fixing nearly 500 bugs and counting, with dozens of engineers sweating the details to make typical web pages in Firefox Desktop and Android more responsive for users.

Incremental improvements for the next version

I don’t think we should change the underlying goals for the project, and largely want to build on the recent success. But we could make the benchmark better reflect the real-world Web and continue pushing forward engine performance by making some changes. Here’s what I’d like to see.

Faster releases

The longer the benchmark stands still, the worse it becomes at making the web faster for users. It needs to evolve over time, because:

  • The ecosystem changes, which makes the current set of tests less representative.
  • Engines run out of low hanging fruit, and end up needing to get juice out of changes tuned to quirks of tests or the benchmark methodology, which are less generally applicable to the web.

While there’s no release date set for Speedometer 4, it should be significantly shorter than the 6 years between versions 2 and 3.

More tests

Speedometer 2 had 15 tests, and Speedometer 3 has 20. More importantly, they test a much more diverse set of content than before — the posts on the Chromium Blog and the MS Edge blog do a good job outlining this in more detail.

Speedometer 4 should have a bunch more, and they should be as distinct from each other as possible to cast a wide net for engine improvements. Practically speaking, at this stage it would be best to have a near zero friction process for engines and open source contributors to create and share experimental tests, even if they’ll never be suitable for inclusion in an official release. That’ll help iterate on ideas and tests to include, and provide internal targets for engines who want to find optimizations outside of what’s measured in the official score.

For example, an engine could turn test cases in bug reports into Speedometer tests, which they could then run in their own CI as a regression test, and which may be useful for other engines to optimize. A mechanism for sharing these unofficial test cases - similar to Web Platform Tests - could facilitate a broader ecosystem of Speedometer tools and tests, in parallel with an official release cadence.

Asynchronous measurements

The core measurement loop in Speedometer 3 does a better job at capturing layout and paint than previous versions (see more detail on the WebKit blog), but the actual work being measured still has to be synchronous. This makes it impossible to measure functionality that relies on Promises or asynchronous events which are very common online. For example, when I researched text editing it was clear that it’d be good to add a test for the popular Monaco editor, but we were unable to do so because it had some initialization happen in Workers that we couldn’t deterministically pull into measured time.

Making the measurement loop asynchronous will open up all sorts of technical questions, like the effects of CPU throttling that we’ll need to carefully evaluate. But we can add the ability to the core runner as a developerMode feature without affecting current tests. Combined with a low friction development process for experiments we can build out a set of tests that use it to inform the evaluation.

Test against remote content

All tests are currently embedded in the main repo and hosted in iframes on the same domain: with the test logic itself being defined in the parent frame and accessing elements in child frames.

However, since engines with site isolation separate subframes from different domains into different processes, there are all sorts of real-world performance considerations that are not exercised today. While we may not ever ship cross-origin hosting to browserbench.org, we’d like the ability to test this ourselves in a variety of configurations (e.g., separate origin per test) in order to optimize Gecko.

There are technical details to figure out with the runner design here — the Chrome team has a promising suggestion which pushes the test logic into each individual test that’s worth exploring further. A design like this would also make it easier to include additional metrics like networking and page load — which aren’t feasible to include in official scores due to variability, but would be extremely useful signals for measurement in Firefox.