Bryce Kerley Bryce Kerley

Asynchronous Challenges with Lounge and Jetbridge

Two forces conspired to bring about the asynchronous challenges in Hack-A-Sat 4 Qualifiers. The precise timing required by CPU execution vulnerabilities, and the coarse schedules of isolation required for small satellite operations.

Meltdown and Spectre are landmark vulnerabilities (CVE-2017-5753, CVE-2017-5754, and CVE-2017-5715) fundamental to some techniques to make fast computers. As soon as they hit, one of my coworkers, Jason, became immediately enthralled with the possibilities of building a CTF (the hacking kind of capture-the-flag) challenge using them. However, this hit after our time building contests as Legitimate Business Syndicate, so we didn’t have an immediate outlet for these.

As the Hack-A-Sat (HaS) program was starting to spin up, I started learning more about satellites (anything at all about satellites, really; my career background before HaS was focused on datacenter type stuff on reliable networks), and the surprising thing to me was that cheaper satellites often have what I consider to be awful connectivity, with slow connections to a scant few ground stations. This leads to a programming model more like my mom’s punchcard experience than ssh-ing to a shared machine in a big rack somewhere.

Hack-A-Sat gave our team an opportunity to develop challenges in a very freeform manner. Our main goal for qualifiers is to prepare teams for satellite operations, defense, and offense in finals; part of that can be vulnerabilities we find interesting, so Spectre was up for consideration.

However, while Jason turned around a proof-of-concept pretty quickly, and it worked on the servers we liked to rent, we didn’t like how vulnerable it was to noisy neighbors.

Hack-A-Sat Quals Challenge Infrastructure and Spectre

Our normal challenge infrastructure uses DNS round-robin load balancing to send a player to one of several machines running the challenge, and after their ticket is accepted, we spin a per-connection container for them to try and get the flag out of. It works well, as long as there’re enough machines for each concurrent player to get the capacity they need to make progress on the challenge. Sometimes this is easy and cheap, and sometimes this isn’t, but most challenges don’t run at wide open throttle all the time, so it works pretty well even when we have many times more connections than cores.

Our implementation of the Spectre weaknes used the real CPU’s cache and clock, and also took minutes for our reference solver. It wasn’t going to work if multiple players shared the same core. This meant we’d either have to have enough machines for each player to get one, or have a ridiculously frustrating experience for players with our normal architecture.

I started researching ways to make the challenge more tolerant of noisy neighbors by simulation, but I ended up figuring that the most tolerable solution to noisy neighbors was having a queueing system to run submissions one-per-instance, asynchronous and isolated from player interaction, with the added benefit of simulating the satellite operations style of challenge.

Job Queues

I’ve built a lot of web applications, and a common experience is users dispatching a slow process that doesn’t need to happen within the request/response cycle. While Queues Don’t Fix Overload, they do allow you to decompose some operations into a latency-sensitive part (sending the user’s browser some HTML and letting them get on with their day) and a latency-insensitive part (sending the user an email so they can read it minutes or days later). In the Ruby on Rails world, they’ve been mainstream for over a decade, and we have a lot of experience using them on Hack-A-Sat. The registrar (where players register and play quals) uses que, which I picked because it works with PostgreSQL and doesn’t require a second datastore.

Using que was kind of awkward with generator challenges though. These have the Ruby-based “generator runner” that gets generation requests after the general-purpose registrar worker makes a ticket for a generator challenge. The generator runner hangs out on the que table, locks a request it knows how to answer, runs the generator container, and puts the files that container outputs in a tarball on an HTTP server for players. It meant there was a Ruby thing that had to be copied around and required native extensions to be set up on, which wasn’t too bad, but I kept running into weird issues with how its database table metadata interacted with how Rails/pg_dump does structure dumps. Additionally, I’d been reading more about database locks, and wondered if you could build a job queue with row-level locks enforced by the database.

Lounge and Jetbridge Design

After Hack-A-Sat 3 Finals, I had a really nice vacation and worked on another project for a couple months. Once that project closed out, it was time to really get into HaS4 quals, and by then I really had a workable design for running asynchronous challenges like Spectre. After running a couple big CTF finals projects in the Elixir Phoenix framework, I was pretty happy with how the LiveView real-time system worked.

Specifically, I had a workable strategy for sending events from Postgres triggers to browser LiveView clients:

  1. A subscriber process in the Phoenix application LISTENs on a Postgres channel. This costs a connection slot per app server, but they’re pretty cheap in 2023.

  2. A LiveView for a player subscribes to notifications relevant to them on the application’s Endpoint.

  3. A trigger in Postgres NOTIFYs a channel (with the primary key in the payload) on any INSERTs or UPDATEs to the table in question. This does add some latency to these changes, but 2023…

  4. When the notification comes in to the subscriber, it loads the columns we care about from the database, and local_broadcasts them through the application’s Endpoint.

  5. When the LiveView receives the notification and materialized row to handle_info, update the assigns (and through them, the page.)

With this strategy, I could build the “Lounge” application that accepts player submissions, puts them in Postgres, and waits for those submissions to change.

The other entity is a Crystal program called “Jetbridge,” that takes the player submissions from the Lounge when ready and actually lets them run.

Why Crystal? I'd done this kind of work in Ruby and Python before, but the runtime requirements aren't my favorite to install, especially since Jetbridge runs outside of a container. Crystal really nails the single-binary installation, and I'm a lot more comfortable working with it than I am with Golang or Rust.

Database Design

We’ve been using a system of tickets and receipts since Hack-A-Sat 1 in 2020, and really like what it gives us! Our async challs system has a tickets table; this is basically team authentication, and we keep both the full encrypted ticket string there, and also separate out the human-readable slug.

The other table is for submissions. It holds a ticket_id, the submission content, the IP address of the player submitting it, private and team-visible results, the exit status, normal inserted_at and updated_at timestamps, and metrics about when (started_at, finished_at) and where (instance_identity) it ran.

To make brute-forcing a less-effective strategy, we have a calculated_penalty view. This adds the team’s submission count as seconds to their inserted_at column for sorting, so that a submission from a team with 5 submissions gets to run before a simultaneously-submitted submission from a team with 6 submissions. Originally, this would also delay a team’s submission even with idle jetbridges, but that’s actually worse for everyone, since we couldn’t pause an inflight submission for a less-punished one.

Locking and Running Submissions

Jetbridge handles locking and running submissions.

In a loop, it tries to start a transaction and lock a submission, and failing that, waits for a notification before looping again.

The “What is SKIP LOCKED for in PostgreSQL 9.5” post from 2ndQuadrant was very useful in building this part. We lock and load a submission for the duration of a transaction (in a way that’ll automatically unlock if our connection breaks for some reason):

UPDATE submissions
    SET started_at = clock_timestamp()
    WHERE id = (
    SELECT sub.id 
        FROM submissions AS sub
        JOIN calculated_penalty AS pen ON pen.id = sub.ticket_id
        WHERE
        sub.started_at IS NULL
        ORDER BY 
        sub.inserted_at + pen.penalty_seconds ASC
        FOR UPDATE SKIP LOCKED
        LIMIT 1
    )
    RETURNING id, content, abbrev(submitter_ip), ticket_id;

from https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-CURRENT:

transaction_timestamp() is equivalent to CURRENT_TIMESTAMP, but is named to clearly reflect what it returns. […] clock_timestamp() returns the actual current time, and therefore its value changes even within a single SQL command. […] now() is a traditional PostgreSQL equivalent to transaction_timestamp().

From the inner select out, we find the unlocked submission with the smallest sum of insertion time and penalty seconds, lock it for update, update it with Postgres’ current time (see aside), and return what we need to run the submission.

Once we’ve got the submission ready to fly, Jetbridge creates a container from the challenge image, copies the submission content into it, starts it with stdout and stderr’s file descriptors connected to the Jetbridge process, and starts a timer.

Then it waits for one of a few things. If the container exits, we keep the exit status, stdout and stderr. If the timer expires or the stdout/stderr buffers blow past their limit, we clobber stdout and the exit status.

They go back into the database, along with information about where it ran:

UPDATE submissions
    SET
        finished_at = clock_timestamp(),
        status = $1,
        private_result = $2,

        result = $3,
        instance_id = $5,
        instance_identity_document = $6
        WHERE id = $4;

We commit this transaction and then exit.

The container system was already told to delete the container after it ran, the OS frees our resources, Postgres remembers that we ran, and systemd will restart Jetbridge for the next run.

What's the Security Story about Running Player Code in Containers?

There are some risks, and we worked to mitigate them.

We generally ran things as unprivileged users both inside and outside containers.

The Spectre challenge "Spectrel Imaging" didn't use a normal programming language, instead accepting very limited input for a program that handled the low level primitives necessary for players to creep on a variable.

The other four challenges all used the Deno JavaScript runtime, which has pretty granular permissions. We carefully (through trial-and-error) found a minimum set of permissions for each challenge, none of which involved networks or file writes. In the absence of this, I'd've had a much harder time researching seccomp (the LegitBS strategy), a safer container runtime (gVisor), a more isolated container environment (k8s on someone else's computer) or some other option.

Operations and Frustrations

We started each async challenge off with twenty-one Jetbridge runners. The first of each we provisioned by hand, and then we shut them down, imaged them, and spun that image twenty times. Most of them we cut back to five after an hour, except Spectrel Imaging.

Since we only had five challenges using this system, both the initial Jetbridge instances and the Lounge instances were all built with a hand-operated playbook. Next time.

Tickets have a very strict format:

@unwrapper ~r{(ticket\{)?(?<slug>[a-z0-9]+):(?<content>[a-zA-Z0-9\-_]+)(\})?}

This regular expression parses a ticket, but if you save the version the player presents, it often has whitespace from the registrar. We put the parsed whitespace-free ticket in the database, but put the player-given whitespace-decorated ticket in the session, which meant future page loads wouldn't find it in the database. We fixed that a couple hours into the game.

Big submissions had occasional issues. This didn't show up with any of the Deno challenges, but Spectrel Imaging required multi-megabyte submissions. Since Spectrel Imaging only showed up towards the end of the game, and it was merely unstable and not unusable, we left it as is.

We never figured out a good way to see what's locked, since Postgres isolates transactions. I spent a bit of time with the pg_locks view, but other priorities. If anyone has some PG tips about this, let me know.

Conclusion

Almost a year later, I'm still really happy this system worked. Jetbridge and Lounge allowed us to build a unique set of challenges and introduce some unfamiliar concepts both to players and to us in Mission Control. I learned quite a bit about Phoenix and Postgres, and I'm looking forward to being able to do more work with these in the future.

Read More
Mike Walker Mike Walker

Hack-A-Sat 2022 Finals: Teams on the attack

At the end of October, we hosted the 2022 Hack-A-Sat Finals competition.  Finals was structured as an attack-defend CTF where each team was given control of their own satellite. Teams earned points by maintaining control of it, defending it from attacks while attacking other satellites, and solving a series of hacking challenges. Challenges varied widely, including ground station crypto vulnerabilities, flight software bugs, RISC-V ROP chains, webserver attacks, data mining, on-orbit science missions, scheduling ground station contacts, pointing antennas, and safeguarding radio links. In this chaotic environment, teams needed to balance their offensive and defensive tactics to disrupt the satellite operations of others while optimizing their own.

At around six and a half hours into the game, teams launched their first strikes against each other’s spacecraft. In this post, we dig into the attack itself, the effects it had on the satellite Attitude Determination and Control System (ADCS), and how defenders could mitigate it.

Understanding the attack

The most successful attack rested on two key pillars: leaking radio configurations used by other teams and sending malicious commands aimed at abusing an on-board ADCS control algorithm.


The radio configuration for all ground stations were leaked

The ‘403 Denied’ challenge presented teams with a webserver vulnerability which when exploited, provided not only a flag and points, but also access to a treasure trove of data being collected. Teams were able to scrape this database and leak data from all 27 ground station in the game. Critically, in this data was the configuration for the ground station radio, which was updated at a 30 second interval.

By continuously collecting this data, an offensively minded team could deduce the radio settings being used by other teams. Once the other teams radio settings were known they could be used to send malicious commands at their satellite.

The ADCS algorithm doesn’t check if control constants are stable

The initial ADCS settings provided were poorly optimized and teams were expected to modify control constants to improve their satellites performance and gain more points via SLA. This mechanism, however, implements weak bounds checking and allows for malicious, unstable control constants to be used. In real systems, this occurs when designers assume that "only valid commands will ever reach the satellite.” Assumptions like this invite would-be attackers and were included in the Hack-A-Sat flight software to create an attack vector for teams against the spacecraft ADCS of their opponents. This invitation was accepted.

Event log from Solar Wine during an ADCS attack

 

Occurances

To see the attack in action, you can watch the game visualization at game time 2023-01-01 06:39:00 UTC. In this scenario, Poland Can Into Space takes control of the Mauritius ground station and uses it to attack both Single Event Upset and Welt ALLES!.

These two instances of the attack are the first occurrences of it in the game. After this, the attack was used frequently by multiple teams.

Effects of the attack

Command: Unstable Control Constants

The malicious control constants cause the satellite to lose stability and begin tumbling. The rate of tumble increases until the reaction wheels no longer have any command authority.

The reaction wheels lose command authority when they reach their max spin rate (saturation). This means that full 3 axis control of the satellite is no longer possible until the wheels have been de-spun via magnetorquers or a space tug request to the admins. Since de-spinning the wheels takes approximately 40 minutes, this leaves the satellite without attitude control for quite a while.

Command: Safe Mode Off

Turning off the safe mode app compounds the effects of this attack.

In normal operation, once the satellite is tumbling or the wheels reach a certain percentage of saturation, safe mode would activate. This would immediately do the following:

  • Reset the control constants to their defaults

  • Begin desaturation of the reaction wheels

  • Set the radio to the default parameters

With safe mode disabled, the satellite is allowed to continue gaining angular momentum until the wheels reach saturation.

After enough time, the wheels saturate and eventually numerical integration in the control loop encounters a floating-point error from unstable growth. This results in crashing the flight software.

Combined

To summarize the combined effects of these malicious commands are devastating:

  • Loss of SLA in 2/4 categories

  • Loss of contact with satellite

  • Crashed flight software

  • Time required to recover

Unstable ADCS control constants result in a satellite that is tumbling

Unstable ADCS control constants result in saturated reaction wheels

Poland Can Into Space attacks Welt ALLES! (30x real-time)

Defending against the attack

While attacking the ADCS can cause frustration and confusion, there are strategies to mitigate or even avoid the attack altogether.

Mitigate the Attack

While the attack against the ADCS can be devastating, there is an opportunity to detect the attack and take corrective action. Once the attack lands it takes approximately 6 minutes for the reaction wheels to spin up and reach saturation. It takes additional time after that for the flight software to hit a floating-point error and crash.

The wheel speed and angular velocity of the satellite are available through telemetry. This time window provides defenders an opportunity to detect that an attack has occurred.

The simplest strategy to recover from this attack would be:

1.       Set ADCS to the ‘uncontrolled’ state. This will stop all positive control of the satellite and prevent the attack from causing further damage.

2.       Re-enable the safe mode app. This will immediately begin to de-spin the reaction wheels.

This strategy was not observed in practice.

The following strategy is simpler and mostly effective.

Protect Your Radio Settings

As previously discussed, this attack relies on knowing the configuration parameters of a specific satellite’s radio. However, this information was never leaked directly. Only ground station radio parameters are leaked and only once every 30 seconds.

This means that when defenders are using a ground station to communicate with their satellite, they are potentially vulnerable. This is particularly true at the polar locations where many ground stations (controlled by other teams) are in view at the same time.

Once communication is complete, the final step a team should take is to change the settings of their spacecraft radio to something different than the parameters leaked by the ground station.

Leverage the Network of Competitive Ground Satellites

Every team was given dedicated ground stations located at Svalbard and McMurdo Station. Given the near polar orbit of the satellite, these ground stations were convenient and available every half orbit.

Using these ground stations was risky, however. An offensively minded team can wait for you to start communicating with your satellite and then attack using their polar ground station.

 The remaining ground stations were more sparely distributed and limit users to 6 minutes of communication.

Maximizing use of these ground stations limits exposure to other teams looking to steal your radio settings and connect to your satellite.

Make Contact at Higher Elevation Angles During a Contact

Completely avoiding the ground stations at McMurdo and Svalbard is an unrealistic strategy as it is too costly in terms of missed opportunities. During the beginning and end of each pass (i.e., at low elevation angles), many satellites fit within one antenna ground beam. This provides an easy opportunity to attack multiple satellites at once by looping through all known radio settings without the need to steer the antenna. Defenders can deter communication until closer to the zenith of each pass, forcing attackers to specifically point their antenna at the defenders satellite.

An opportunity to mitigate the attack exists in the period after the attack lands and before the wheels saturate

The non-polar ground stations could be accessed by any team and provide less exposure to attacks while communicating with your satellite

At the horizon, 3-4 satellites fit into an antenna beam making it easy to spam attacks

At zenith, the beam only covers 1-2 satellites which means attacks must be more deliberate

Analysis

First Teams to Act on Offence had an Advantage

Poland Can Into Space was the first team to deploy this attack and it gave them a competitive advantage. While other teams were dealing with the effects of this attack, Poland Can Into Space was able to continue operating their satellite, gain SLA points and work on other challenges. They also waited to release this attack until they had successfully implemented at least one defensive strategy against it.

Attacking Prevents Other Teams from Completing Satellite Operations

Satellite contact with the ground is limited by the available ground stations and the orbit mechanics of the satellite. Most teams had plans for how they wanted to use each contact window with a ground station. Attacking forces another team to abandon their plans and instead focus on recovery of their satellite.

A successful attack interrupts another team's decision loop and forces them to reallocate their resources to defense.

Attacking Stops Other Teams from Attacking You

When other teams are busy recovering their satellite, they have less time and resources to mount retaliatory attacks.

Understanding the Attack Surface is Critical

Poland Can Into Space and Spacebits R Us realized that leaked radio settings were the keystone of any attack. Both teams were able to devise a scheme to rotate their radio settings and avoid detection.

This scheme protected them and as a result it was rare that they were successfully attacked.

Situational Awareness is Extremely Important

It was possible to detect an attack and correct it before it results in significant cost. This requires either automated validation checking of incoming telemetry or an alerting system.

Even with Perfect Data, it’s Hard to Understand what is Happening

The admin team for this game had access to "truth level" information throughout the entire game. Even with a large amount of information, it took the admin team time, and a lot of caffeine (look at my desk!!!) to fully understand the attacks as they occurred.

Lots of caffeine was required to keep track of the attacks during the game

Read More
Elizabeth Moeller Elizabeth Moeller

Hack-A-Sat 2020

After months of collaboration with the Air Force Research Laboratory, Cromulence successfully hosted the Hack-A-Sat competition at DEF CON 28 Safe Mode. The HAS qualifying round concluded in May with 2,213 teams registered and over 6,000 players competing in the Jeopardy-style Capture the Flag (CTF) competition. Top teams qualified for the HAS final event, hosted virtually from Cromulence's Melbourne, FL facility. The final event, which included flatsats "orbiting" on a custom-built carousel, was held between August 7th and 8th and featured 8 teams with players from 12 countries competing for a chance to win a piece of the $100,000 prize pool ($50,000 1st place, $30,000 2nd place, and $20,000 3rd place).

In addition to being a technical challenge for top hacking teams, HAS also had the goals of spreading awareness of the need for cybersecurity in space and providing opportunities for education of up-and-coming hackers. To this end, the challenges and solutions for both events have now been made available on Cromulence's GitHub. We hope the release of this software will help launch new collaborative efforts between the cybersecurity community and satellite operations.

https://github.com/cromulencellc/hackasat-qualifier-2020

https://github.com/cromulencellc/hackasat-final-2020


Read More