Writing JS Functions in BigQuery

While querying Custom Metrics in HTTP Archive it can be convenient to use JavaScript instead of SQL. Luckily there’s a way to do this with BigQuery, which I learned reading the queries for the Web Almanac (link to all the SQL used to generate reports for the 2022 edition).

Here’s a simple example that takes in a JSON string and returns an array of strings:

CREATE TEMPORARY FUNCTION
  helloWorld(input STRING)
  RETURNS ARRAY<STRING>
  LANGUAGE js AS '''
try {
  input = JSON.parse(input);
  return ["hello", input.str];
} catch (e) {
  return [];
}
''';

SELECT helloWorld(input) AS result
FROM 
  UNNEST([
    ('{"str": "world"}')
]) AS input;

This will return

-----------------
| Row  | result |
|------|--------|
| 1    | hello  |
|      | world  |
-----------------

Putting this together with some of the HTTP Archive datasets, here’s a more real-world example based on the previous HTTP Archive Analysis of News Homepages:

-- #standardSQL
-- Taking an object like { html: 1, body: 1, div: 2 } and returning 4
CREATE TEMPORARY FUNCTION
  getTotalElements(element_count STRING)
  RETURNS STRING
  LANGUAGE js AS '''
try {
  element_count = JSON.parse(element_count);
  if (Array.isArray(element_count)) {
    return -1;
  }
  let c = 0;
  for (let k of Object.keys(element_count)) {
    c += element_count[k];
  }
  return c;
} catch (e) {
  return -1;
}
''';

-- Taking a JSON object like { "summary": { "--hjFeedbackAccentTextColor": { "get": [...], "set": [...] }, ... } }
-- and returning an array of strings like [ "--hjFeedbackAccentTextColor", ... ]
CREATE TEMPORARY FUNCTION
  getCssVariables(css_variables STRING)
  RETURNS ARRAY<STRING>
  LANGUAGE js AS '''
try {
  css_variables = JSON.parse(css_variables).summary;
  return Object.keys(css_variables);
} catch (e) {
  return [];
}
''';

SELECT
  url,
  _TABLE_SUFFIX AS client,
  getTotalElements(JSON_EXTRACT_SCALAR(payload, '$._element_count')) as total_elements,
  getCssVariables(JSON_EXTRACT_SCALAR(payload, '$._css-variables')) as css_variables,
  payload,
FROM
-- TABLESAMPLE SYSTEM helps make testing the query cheap, since the pages tables are very large
`httparchive.pages.2023_03_01_*` TABLESAMPLE SYSTEM (.1 PERCENT)

-- Restrict to only relatively popular hosts
WHERE NET.HOST(url) in (
  SELECT DISTINCT NET.HOST(origin)
  FROM
    `chrome-ux-report.all.202303`
  WHERE experimental.popularity.rank <= 10000
)

This is querying the payload field (see an example row at this gist) by extracting keys off the JSON string and passing the value into the JS function, running some field-specific logic, and then returning the results (in one case as a string and in another as an array of strings).