Asynchronous Challenges with Lounge and Jetbridge
Two forces conspired to bring about the asynchronous challenges in Hack-A-Sat 4 Qualifiers. The precise timing required by CPU execution vulnerabilities, and the coarse schedules of isolation required for small satellite operations.
Meltdown and Spectre are landmark vulnerabilities (CVE-2017-5753, CVE-2017-5754, and CVE-2017-5715) fundamental to some techniques to make fast computers. As soon as they hit, one of my coworkers, Jason, became immediately enthralled with the possibilities of building a CTF (the hacking kind of capture-the-flag) challenge using them. However, this hit after our time building contests as Legitimate Business Syndicate, so we didn’t have an immediate outlet for these.
As the Hack-A-Sat (HaS) program was starting to spin up, I started learning more about satellites (anything at all about satellites, really; my career background before HaS was focused on datacenter type stuff on reliable networks), and the surprising thing to me was that cheaper satellites often have what I consider to be awful connectivity, with slow connections to a scant few ground stations. This leads to a programming model more like my mom’s punchcard experience than ssh-ing to a shared machine in a big rack somewhere.
Hack-A-Sat gave our team an opportunity to develop challenges in a very freeform manner. Our main goal for qualifiers is to prepare teams for satellite operations, defense, and offense in finals; part of that can be vulnerabilities we find interesting, so Spectre was up for consideration.
However, while Jason turned around a proof-of-concept pretty quickly, and it worked on the servers we liked to rent, we didn’t like how vulnerable it was to noisy neighbors.
Hack-A-Sat Quals Challenge Infrastructure and Spectre
Our normal challenge infrastructure uses DNS round-robin load balancing to send a player to one of several machines running the challenge, and after their ticket is accepted, we spin a per-connection container for them to try and get the flag out of. It works well, as long as there’re enough machines for each concurrent player to get the capacity they need to make progress on the challenge. Sometimes this is easy and cheap, and sometimes this isn’t, but most challenges don’t run at wide open throttle all the time, so it works pretty well even when we have many times more connections than cores.
Our implementation of the Spectre weaknes used the real CPU’s cache and clock, and also took minutes for our reference solver. It wasn’t going to work if multiple players shared the same core. This meant we’d either have to have enough machines for each player to get one, or have a ridiculously frustrating experience for players with our normal architecture.
I started researching ways to make the challenge more tolerant of noisy neighbors by simulation, but I ended up figuring that the most tolerable solution to noisy neighbors was having a queueing system to run submissions one-per-instance, asynchronous and isolated from player interaction, with the added benefit of simulating the satellite operations style of challenge.
Job Queues
I’ve built a lot of web applications, and a common experience is users dispatching a slow process that doesn’t need to happen within the request/response cycle. While Queues Don’t Fix Overload, they do allow you to decompose some operations into a latency-sensitive part (sending the user’s browser some HTML and letting them get on with their day) and a latency-insensitive part (sending the user an email so they can read it minutes or days later). In the Ruby on Rails world, they’ve been mainstream for over a decade, and we have a lot of experience using them on Hack-A-Sat. The registrar (where players register and play quals) uses que, which I picked because it works with PostgreSQL and doesn’t require a second datastore.
Using que was kind of awkward with generator challenges though. These have the Ruby-based “generator runner” that gets generation requests after the general-purpose registrar worker makes a ticket for a generator challenge. The generator runner hangs out on the que table, locks a request it knows how to answer, runs the generator container, and puts the files that container outputs in a tarball on an HTTP server for players. It meant there was a Ruby thing that had to be copied around and required native extensions to be set up on, which wasn’t too bad, but I kept running into weird issues with how its database table metadata interacted with how Rails/pg_dump does structure dumps. Additionally, I’d been reading more about database locks, and wondered if you could build a job queue with row-level locks enforced by the database.
Lounge and Jetbridge Design
After Hack-A-Sat 3 Finals, I had a really nice vacation and worked on another project for a couple months. Once that project closed out, it was time to really get into HaS4 quals, and by then I really had a workable design for running asynchronous challenges like Spectre. After running a couple big CTF finals projects in the Elixir Phoenix framework, I was pretty happy with how the LiveView real-time system worked.
Specifically, I had a workable strategy for sending events from Postgres triggers to browser LiveView clients:
A subscriber process in the Phoenix application LISTENs on a Postgres channel. This costs a connection slot per app server, but they’re pretty cheap in 2023.
A LiveView for a player subscribes to notifications relevant to them on the application’s Endpoint.
A trigger in Postgres NOTIFYs a channel (with the primary key in the payload) on any INSERTs or UPDATEs to the table in question. This does add some latency to these changes, but 2023…
When the notification comes in to the subscriber, it loads the columns we care about from the database, and local_broadcasts them through the application’s Endpoint.
When the LiveView receives the notification and materialized row to handle_info, update the assigns (and through them, the page.)
With this strategy, I could build the “Lounge” application that accepts player submissions, puts them in Postgres, and waits for those submissions to change.
The other entity is a Crystal program called “Jetbridge,” that takes the player submissions from the Lounge when ready and actually lets them run.
Why Crystal? I'd done this kind of work in Ruby and Python before, but the runtime requirements aren't my favorite to install, especially since Jetbridge runs outside of a container. Crystal really nails the single-binary installation, and I'm a lot more comfortable working with it than I am with Golang or Rust.
Database Design
We’ve been using a system of tickets and receipts since Hack-A-Sat 1 in 2020, and really like what it gives us! Our async challs system has a tickets table; this is basically team authentication, and we keep both the full encrypted ticket string there, and also separate out the human-readable slug.
The other table is for submissions. It holds a ticket_id, the submission content, the IP address of the player submitting it, private and team-visible results, the exit status, normal inserted_at and updated_at timestamps, and metrics about when (started_at, finished_at) and where (instance_identity) it ran.
To make brute-forcing a less-effective strategy, we have a calculated_penalty view. This adds the team’s submission count as seconds to their inserted_at column for sorting, so that a submission from a team with 5 submissions gets to run before a simultaneously-submitted submission from a team with 6 submissions. Originally, this would also delay a team’s submission even with idle jetbridges, but that’s actually worse for everyone, since we couldn’t pause an inflight submission for a less-punished one.
Locking and Running Submissions
Jetbridge handles locking and running submissions.
In a loop, it tries to start a transaction and lock a submission, and failing that, waits for a notification before looping again.
The “What is SKIP LOCKED for in PostgreSQL 9.5” post from 2ndQuadrant was very useful in building this part. We lock and load a submission for the duration of a transaction (in a way that’ll automatically unlock if our connection breaks for some reason):
UPDATE submissions
SET started_at = clock_timestamp()
WHERE id = (
SELECT sub.id
FROM submissions AS sub
JOIN calculated_penalty AS pen ON pen.id = sub.ticket_id
WHERE
sub.started_at IS NULL
ORDER BY
sub.inserted_at + pen.penalty_seconds ASC
FOR UPDATE SKIP LOCKED
LIMIT 1
)
RETURNING id, content, abbrev(submitter_ip), ticket_id;
from https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-CURRENT:
transaction_timestamp() is equivalent to CURRENT_TIMESTAMP, but is named to clearly reflect what it returns. […] clock_timestamp() returns the actual current time, and therefore its value changes even within a single SQL command. […] now() is a traditional PostgreSQL equivalent to transaction_timestamp().
From the inner select out, we find the unlocked submission with the smallest sum of insertion time and penalty seconds, lock it for update, update it with Postgres’ current time (see aside), and return what we need to run the submission.
Once we’ve got the submission ready to fly, Jetbridge creates a container from the challenge image, copies the submission content into it, starts it with stdout and stderr’s file descriptors connected to the Jetbridge process, and starts a timer.
Then it waits for one of a few things. If the container exits, we keep the exit status, stdout and stderr. If the timer expires or the stdout/stderr buffers blow past their limit, we clobber stdout and the exit status.
They go back into the database, along with information about where it ran:
UPDATE submissions
SET
finished_at = clock_timestamp(),
status = $1,
private_result = $2,
result = $3,
instance_id = $5,
instance_identity_document = $6
WHERE id = $4;
We commit this transaction and then exit.
The container system was already told to delete the container after it ran, the OS frees our resources, Postgres remembers that we ran, and systemd will restart Jetbridge for the next run.
What's the Security Story about Running Player Code in Containers?
There are some risks, and we worked to mitigate them.
We generally ran things as unprivileged users both inside and outside containers.
The Spectre challenge "Spectrel Imaging" didn't use a normal programming language, instead accepting very limited input for a program that handled the low level primitives necessary for players to creep on a variable.
The other four challenges all used the Deno JavaScript runtime, which has pretty granular permissions. We carefully (through trial-and-error) found a minimum set of permissions for each challenge, none of which involved networks or file writes. In the absence of this, I'd've had a much harder time researching seccomp (the LegitBS strategy), a safer container runtime (gVisor), a more isolated container environment (k8s on someone else's computer) or some other option.
Operations and Frustrations
We started each async challenge off with twenty-one Jetbridge runners. The first of each we provisioned by hand, and then we shut them down, imaged them, and spun that image twenty times. Most of them we cut back to five after an hour, except Spectrel Imaging.
Since we only had five challenges using this system, both the initial Jetbridge instances and the Lounge instances were all built with a hand-operated playbook. Next time.
Tickets have a very strict format:
@unwrapper ~r{(ticket\{)?(?<slug>[a-z0-9]+):(?<content>[a-zA-Z0-9\-_]+)(\})?}
This regular expression parses a ticket, but if you save the version the player presents, it often has whitespace from the registrar. We put the parsed whitespace-free ticket in the database, but put the player-given whitespace-decorated ticket in the session, which meant future page loads wouldn't find it in the database. We fixed that a couple hours into the game.
Big submissions had occasional issues. This didn't show up with any of the Deno challenges, but Spectrel Imaging required multi-megabyte submissions. Since Spectrel Imaging only showed up towards the end of the game, and it was merely unstable and not unusable, we left it as is.
We never figured out a good way to see what's locked, since Postgres isolates transactions. I spent a bit of time with the pg_locks
view, but other priorities. If anyone has some PG tips about this, let me know.
Conclusion
Almost a year later, I'm still really happy this system worked. Jetbridge and Lounge allowed us to build a unique set of challenges and introduce some unfamiliar concepts both to players and to us in Mission Control. I learned quite a bit about Phoenix and Postgres, and I'm looking forward to being able to do more work with these in the future.