Shrinking Visual Web Arena: Reddit

September 2025

The Visual Web Arena reddit app is a comprehensive reproduction of a forum website, with realistic content and a functional backend using Postmill. These sorts of web-environments-in-a-box are useful for both agentic evaluations and browser testing, where test cases often resist isolation into a static page, and testing on the live web is not reliable or reproducible.

Similar to my reproduction of the VWA classifieds app, there was some room to optimize images. This time, though, there was much less low hanging fruit with image compression, but some additional Docker build optimizations available.

To run the environment, you download a 49.8GB tar file and load it into Docker.

The reproduction is available as a package at ghcr.io/bgrins/vwa-reddit-optimized-bundled:latest.

Here’s what the site looks like:

A screenshot of a reddit-inspired site using Postmill

Why So Big?

As the paper explains, there's a lot of content inside of it, with disc space being driven by images.

The Reddit site also follows the same environment from WebArena, and represents a social forum platform. The site contains 31,464 posts containing a diverse set of images across different subreddits and forums, such as natural images, memes, consumer electronics, and charts.

Images

Most of the space within the container (38.5 GB) is in /var/www/html/public/submission_images, across over 31K images (26,169 JPG, 4,130 PNG, 1,166 GIF). There's not nearly as much low-hanging fruit as in Classifieds (which shrunk from 73 GB → 4.9 GB), in part because Postmill doesn't maintain multiple copies for each image, instead using dynamic thumbnail generation with LiipImagineBundle. Changing image formats felt too invasive, so I used gifsicle, optipng, and jpegoptim to process the images in-place. The filenames are content-hashed with the original upload, but changing the files doesn't seem to cause any problems.

This shrunk the image directory, but only a bit, to 34 GB.

Other Cleanup

Everything else was somewhat minor, like ignoring caches, node modules, and other folders not needed at runtime and using a minimal base image.

Docker layer caching misbehaved when trying to build it in a single multistage Dockerfile, causing the entire app directory to get re-copied even if none of the files changed and filling the hard drive. It turned out to be easiest to create a separate base image with the large app directory, and two separate images for the container, with and without Postgres bundled. The published tar includes the database in the container, which is convenient, but in some test environments (like The Zoo) it's better to use a shared database server.

Current sizes are 38.2 GB with the database bundled, and 36.3 GB without.

New Tests and Observations

I've added integration tests to make sure basic functionality (logging in, creating posts, etc) worked consistently in the original and reproduction.

A bug I noticed in both environments is that when upvoting a post, the score flips back to 1. I may have missed something, but it seems to be an issue with the data import where individual votes (submission_votes) were not counted but the score (submissions.net_score) was. Then once a vote is cast, Postmill recalculates the net score based on the single recorded vote.

This isn’t a huge deal for basic testing, but it does mean some research use cases are limited (e.g., test cases requiring accurate interaction counts or reconstructing synthetic user profiles). Regardless, I've captured this behavior as an expected failure in the test case.

An animated gif of an upvote bug where the score changes from over 13000 to 1 after pressing upvote