Writing JS Functions in BigQuery

While querying Custom Metrics in HTTP Archive it can be convenient to use JavaScript instead of SQL. Luckily there’s a way to do this with BigQuery, which I learned reading the queries for the Web Almanac (link to all the SQL used to generate reports for the 2022 edition).

Here’s a simple example that takes in a JSON string and returns an array of strings:

CREATE TEMPORARY FUNCTION
  helloWorld(input STRING)
  RETURNS ARRAY<STRING>
  LANGUAGE js AS '''
try {
  input = JSON.parse(input);
  return ["hello", input.str];
} catch (e) {
  return [];
}
''';

SELECT helloWorld(input) AS result
FROM 
  UNNEST([
    ('{"str": "world"}')
]) AS input;

This will return

-----------------
| Row  | result |
|------|--------|
| 1    | hello  |
|      | world  |
-----------------

Putting this together with some of the HTTP Archive datasets, here’s a more real-world example based on the previous HTTP Archive Analysis of News Homepages:

-- #standardSQL
-- Taking an object like { html: 1, body: 1, div: 2 } and returning 4
CREATE TEMPORARY FUNCTION
  getTotalElements(element_count STRING)
  RETURNS STRING
  LANGUAGE js AS '''
try {
  element_count = JSON.parse(element_count);
  if (Array.isArray(element_count)) {
    return -1;
  }
  let c = 0;
  for (let k of Object.keys(element_count)) {
    c += element_count[k];
  }
  return c;
} catch (e) {
  return -1;
}
''';

-- Taking a JSON object like { "summary": { "--hjFeedbackAccentTextColor": { "get": [...], "set": [...] }, ... } }
-- and returning an array of strings like [ "--hjFeedbackAccentTextColor", ... ]
CREATE TEMPORARY FUNCTION
  getCssVariables(css_variables STRING)
  RETURNS ARRAY<STRING>
  LANGUAGE js AS '''
try {
  css_variables = JSON.parse(css_variables).summary;
  return Object.keys(css_variables);
} catch (e) {
  return [];
}
''';

SELECT
  url,
  _TABLE_SUFFIX AS client,
  getTotalElements(JSON_EXTRACT_SCALAR(payload, '$._element_count')) as total_elements,
  getCssVariables(JSON_EXTRACT_SCALAR(payload, '$._css-variables')) as css_variables,
  payload,
FROM
-- TABLESAMPLE SYSTEM helps make testing the query cheap, since the pages tables are very large
`httparchive.pages.2023_03_01_*` TABLESAMPLE SYSTEM (.1 PERCENT)

-- Restrict to only relatively popular hosts
WHERE NET.HOST(url) in (
  SELECT DISTINCT NET.HOST(origin)
  FROM
    `chrome-ux-report.all.202303`
  WHERE experimental.popularity.rank <= 10000
)

This is querying the payload field (see an example row at this gist) by extracting keys off the JSON string and passing the value into the JS function, running some field-specific logic, and then returning the results (in one case as a string and in another as an array of strings).

HTTP Archive Analysis of News Homepages

Here’s a quick analysis I did for Speedometer 3 about the size of the DOM for some common news sites, in order to support the development of a test covering interactions within a complex news-like site.

How this was done

  • Gather a list of sites from https://en.wikipedia.org/wiki/Wikipedia:News_sources and child pages.
  • Run some BigQuery SQL against the HTTP Archive desktop and mobile crawls from 2023_03_01 to generate the average DOM depth, total element count, number of css variables, and a few other data points. These rely heavily Custom Metrics (with links to the specific metric in the table caption below).

Some more details and the source code to reproduce the results are available at this gist.

The raw and aggregated data is available in this spreadsheet.

Results

DOM Depth
Average Median Max Standard Deviation
mobile 11.99 11 128 7.17
desktop 12.15 11 131 7.36
Total Elements
Average Median Max Standard Deviation
mobile 1856.04 1621 10444 1357.67
desktop 1911.26 1714 10444 1351.03
Total CSS Variables
Average Median Max Standard Deviation
mobile 31.54 0 887 99.59
desktop 34.43 0 887 101.19

Automatically Publishing a Vite Project to GitHub Pages

Here are my notes from publishing https://bgrins.github.io/editor-tests/ from https://github.com/bgrins/editor-tests. There’s an action at https://github.com/peaceiris/actions-gh-pages that makes this easy without setting up additional API tokens, but there are a few Vite specific steps worth noting.

Create the project

If you don’t already have one npm create [email protected], and cd into the directory. Then:

git init
git add .
git commit -m "initial commit"
git branch -M main
git remote add origin http://github.com/username/repo-name.git
git push -u origin main

Create gh-pages branch

If you don’t already have a gh-pages branch you can create an empty one like so:

git switch --orphan gh-pages
git commit --allow-empty -m "gh-pages"
git push origin gh-pages

Add files to the repo

Create a vite.config.js with:

import { defineConfig } from "vite";
export default defineConfig({
  base: "./",
});

Create a .github/workflows/node.js.yml with:

name: Node.js CI

on:
  push:
    branches: ["main"]
  pull_request:
    branches: ["main"]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/[email protected]
      - name: Use Node.js 18.x
        uses: actions/[email protected]
        with:
          node-version: 18.x
          cache: "npm"
      - run: npm install
      - run: npm run build --if-present

      - name: Deploy
        uses: peaceiris/[email protected]
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./dist

Update GitHub settings

At this point there’s an action running at https://github.com/bgrins/editor-tests/actions which builds the project from scratch and publishes to the root of the gh-pages branch, making it available at something like https://bgrins.github.io/editor-tests/.

Fetch the latest copy of a page from the Wayback Machine

Simple way to fetch the latest copy of a given URL from the wayback machine, without including the banner.

  1. Fetch the metadata (like https://archive.org/wayback/available?url=https://example.com)
  2. Prepend id_ after the timestamp in the returned URL to remove the banner (http://web.archive.org/web/20221212143457/https://example.com/ becomes http://web.archive.org/web/20221212143457id_/https://example.com/).

Here’s an example function with vanilla JS:

/**
 * Given a URL, return the latest copy of that URL in the Wayback machine.
 * To do this, find the last available response for the URL, then modify it
 * to omit the default banner in the UI.
 *
 * To do this for https://example.com, fetch https://archive.org/wayback/available?url=https://example.com
 * and get a response like:
 *
 *    url: "https://example.com",
 *    archived_snapshots: {
 *      closest: {
 *        status: "200",
 *        available: true,
 *        url: "http://web.archive.org/web/20221212143457/https://example.com/",
 *        timestamp: "20221212143457",
 *      },
 *    }
 *
 *  Then add "_id" before the URL portion and fetch that. In this case
 *  http://web.archive.org/web/20221212143457id_/https://example.com/
 */
async function get_wayback_response(url) {
  if (!url) {
    throw new Error("No URL provided");
  }
  const result = {
    timings: {},
  };
  const start = performance.now();
  const resp = await fetch(`http://archive.org/wayback/available?url=${url}`);
  result.metadata = await resp.json();
  result.timings.fetch_metadata = performance.now() - start;

  const closest = result.metadata?.archived_snapshots?.closest;
  if (!closest) {
    throw new Error(`No snapshot available from wayback server for ${url}`);
  }

  // Adding "id_" before the URL excludes the banner.
  const constructed_url = closest.url.replace(
    new RegExp(/(.*\/web\/[0-9]*)/),
    `$1id_`
  );
  result.response = await fetch(constructed_url);
  result.text = await result.response.text();
  result.timings.fetch_wayback = performance.now() - start;

  return result;
}

XUL/XBL Replacement Newsletter 20

This is the twentieth edition of the XUL/XBL Replacement Newsletter. As of last week, we have 0 xul files in mozilla-central. We’ve also been scoping out the work remaining in “XUL Replacement”, and have an update on post-XBL cleanups.

No more XUL files in mozilla-central

Back in May, I posted an outline of a plan to remove all XUL documents from mozilla-central to dev-platform. Based on feedback, we adjusted the plan to be smaller in scope and focused on (1) loading all XUL files as if they were HTML and (2) renaming .xul files to .xhtml. The first step was completed by Brendan Dahl in July (see newsletter #16), and Emma Malysz started on the second in October (see newsletter #18).

Emma did a great job breaking down and renaming the ~1500 files. It’s been split up primarily by directory using a script to handle the simpler test file renames, and manually handling the more complicated cases like files loaded over chrome://. This has led to a really nice burndown of xul files in mozilla-central, hitting 0 on Friday.

Another nice result of this project is that eslint operates on .xhtml files but not .xul, so we have increased lint coverage across the tree. Mark Banner also recently improved this support by handling self closing script tags in xhtml files.

XUL document burndown chart, starting with around 1400 documents on 2019-10-02 and ending with 0 documents on 2019-12-13

What’s left to do?

Over the last 2 years, we’ve accomplished a lot in this program. We’ve removed XUL Templates, XUL Overlays, XBL Stylesheets, XUL Documents, XBL, and converted all XUL files to XHTML - migrating to web technologies when possible. So what’s left to do? The three remaining projects we’re tracking are:

1) Complete the DTD conversion to Fluent. 2) Remove the XUL flex model and the other XUL layout frames. 3) Migrate XULElements that have a web equivalent to HTMLElement.

The work in (1) is well scoped and you can expect to see more updates about it early next year. For (2) we’ve been making progress on removing XUL layout frames like stack, grid, and groupbox but still have some performance issues with flexbox along with some functional issues that will be helped by doing (3). So I’d like to talk about more about (3) here.

First, I expect that we will still ship a number of XULElements even after we complete this program. I’ve talked about this before, but things like panel, browser, the menu elements (menu, menupopup, menucaption, menuitem, menulist) don’t have HTML equivalents and are important for our desktop browser experience. While we could invent a way to do this in chrome HTML, I don’t think the cost/benefit justifies doing that ahead of the rest of the things in our list.

So that leaves more “vanilla” XUL elements that are primarily implemented in JS. These cover the majority of our UI and are things like hbox, button, textbox, tab, richlistbox, etc. There are a couple of hurdles here:

a) XULElements act differently from HTMLElements in some ways. For example, getAttribute("not-existent") returns null in HTML and "" in XUL. XUL has some “magic attributes” that dictate layout and other behavior. And XUL supports features like [command] & [tooltip] which don’t have an equivalent in HTML. b) There’s a long tail of UI migrations that need to happen. These include changing the base Custom Element classes and also updating markup / CSS to use the new element name / namespace.

We’re thinking it makes sense to work on (a) tree-wide first (metabug), and treat (b) as a separate project (metabug). The reason to treat these separately is so that we don’t need to repeat the subset of changes needed for (a) for every single element in (b). In other words, we’d like (b) to be able to be as automated as possible, and selectively rewriting how frontend JS interacts with elements would be a roadblock to doing so.

More XBL cleanup

There have been some really nice simplifications and code removals happening in the remove-xbl-implementation metabug. Lines of code isn’t the best measurement for this type of simplification, but for what it’s worth there’s been a net removal of around 24K LOC so far.