What happens after you run Git push?

1 day ago 1

Security. It turns some customers suspicious of a product. It turns some customers completely away from a product. A not-always-brief or minor explanation is required to convince customers that trusting a new vendor over the existing incumbent is a good idea. So here we are, blowing the lid off, sharing our security architecture with you so you can decide: can you trust us?

To have the information required to make a sound decision, we must begin with a single request and see what happens after you run git push to trigger a CI run. Following its journey, we'll see the interfaces it encounters, the boundaries it crosses, and the extensive measures we take to keep your GitHub secrets, code, and cache artifacts safe — across three axes: CPU, Network, and Disk.

Crossing the GitHub valley.

At the time of writing this, over 600 organizations trust us with their CI, so our security needs to be hardcore — from job initialization to completion — for every request. But before you can even send us a single request, you must first sign up for Blacksmith. The first of your concerns may, or should, be regarding your login credentials. The good news is you don’t have to worry about that. We exclusively support GitHub SSO, and authentication is fully delegated to GitHub’s OAuth flow. So, in this specific scenario, if you already trust GitHub, you have a strong basis for trusting us.

Next, you must set up our GitHub integration for your GitHub organization. Naturally, your second concern is likely to be regarding the permissions granted to our GitHub integration. Immediately, it’s important for us to note that our GitHub integration has no ability to directly access organization or repository level secrets. In fact, GitHub doesn’t even allow us to request direct access! Moreover, we not only request the minimum necessary permissions to make your experience with CI much, much easier, but we also want to be transparent about exactly why we request each permission. So below is the complete list of permissions our GitHub integration requests, alongside the reasons for requesting them:

  • Permission: Read access to members and metadata
  • Reason: To list users in our settings page
  • Permission: Read and write access to actions, code, pull requests, and workflows
  • Reason: For our migration wizard to make a pull request with all the required code changes
  • Permission: Read and write access to organization self-hosted runners
  • Reason: To mint just-in-time (JIT) tokens and enable our managed runners to run your jobs

Once you’ve set up our integration, GitHub will begin forwarding your job requests to Blacksmith’s control plane via webhooks, and we will act only on the jobs using one of our valid runner tags. A nice bonus about this is that, after some usage, it allows us to provide our customers with a performance comparative report of pre-Blacksmith versus post-Blacksmith results.

Directing traffic from Blacksmith’s control plane.

We’ve made it across the GitHub valley and landed in the heart of Blacksmith: the control plane. This is our central command center — the brain that orchestrates all incoming GitHub Actions jobs. It’s built on AWS, with metadata stored in a Postgres database via Supabase. Only the control plane can access our Postgres database, and yes, everything is encrypted at rest.

One of the great things about hosting our control plane on AWS is that it allows us to easily setup AWS Identity and Access Management (IAM) policies — all following the Principle of Least Privilege (PoLP). All resource access is locked down, and all traffic is encrypted during transit through the enforcement of TLS. Moreover, before a request can even get a foot in the door, it must pass several security checkpoints: authenticated endpoints, rate limiting, input validation, and protection against SQL injection.

Only after your request has jumped all of our checkpoints do we construct your job’s payload. Part of this payload is a freshly minted JIT token, scoped to a single job and set to expire after 1 hour. The payload is then handed off to our AWS-hosted Redis queue, waiting for a runner in our dataplane to take over.

Into the depths of Blacksmith’s dataplane.

We now descend into the lowest layer of Blacksmith: our dataplane. If the control plane is the brain, the dataplane is the muscle. This is where your GitHub Actions jobs actually run, on our hardware. More specifically, a fleet of 32 vCPU boxes, procured from data center providers in the US and EU. But this layer is not only filled with bare metal machines, but with Firecracker microVMs, GitHub Actions Runners, MinIO blob stores, Ceph storage clusters, Tailscale VPNs, and so much more. It is here, in the heat and pressure of the dataplane, that we isolate the execution of each GitHub Action job across three axes: CPU, Network, and Disk.

The path into our dataplane is intentionally a difficult one, and it begins with a strong first line of defense: our network. Our network is secured with Tailscale, a VPN service that utilizes WireGuard, an open-source framework for encrypted virtual private networks. With Tailscale, our fleet of bare metal machines lives behind a tight-knit, private network. Every one of them is part of a Tailscale Tailnet, meaning SSH access is entirely locked down to the outside world. No public ports, no guessable IPs, no surprises. What’s more, communication between services in the dataplane flows through a Tailscale VPN. But it doesn’t stop there — all deployments to our machines happen exclusively over Tailscale SSH, ensuring encrypted, identity-based access between trusted devices only.

Once in, each physical machine that is a part of our private network runs an agent that, among other responsibilities, is tasked with authenticating to our AWS-hosted Redis queue using Doppler-injected credentials and pulling job payloads from it. Once your job request is picked up by an agent, it runs your job in an ephemeral microVM managed by Firecracker — the same microVM technology used by AWS to run millions of untrusted workloads for AWS Lambda and Fargate. These microVMs leverage Kernel-based Virtual Machine (KVM) virtualization to run their own guest kernel and user space, isolated from both the host and other microVMs. This strong isolation lets us safely run multiple customer workloads on the same machine — unlike Docker, where containers share the host kernel and rely on a much thinner security boundary. Firecracker also allows us to use cgroups to enforce CPU and memory limits on each microVM to ensure fairness across jobs, and prevent noisy neighbor problems.

When booted up, each microVM gets a copy-on-write clone of the root file system from GitHub’s official GitHub Actions runner images, which we routinely update to acquire the latest dependency versions as GitHub releases them upstream. We also hydrate the microVM with GitHub’s official GitHub Actions runner binary, which contains the logic responsible for coordinating with GitHub’s control plane to adopt a job. The same runner binary GitHub provides automatically obfuscates any secrets in command-line output and logs.

Once the official GitHub Actions runner binary is up and running, we rely exclusively on the JIT token to adopt and execute a single job. While your job is running, your code and secrets are safe from the outside world since each microVM operates within its own network namespace.

If all this isn’t enough, you’ll be happy to know that after your job completes, the VM is destroyed, along with any of its state, including its filesystem, ensuring that modifications — benign or malicious — don’t persist beyond the life of the job. This, of course, does not include the caching artifacts that you can opt in to store on our disks so they can be shared across job runs — speeding up your Docker builds and more generally your GitHub Actions workflows.

For those not using our caching features (bad choice!), our journey is nearing its end, and you may skip the next section and go straight to the conclusion. For all others (good choice!), we have a few more things to cover regarding how we secure access to your cached artifacts across job runs.

Cache is most definitely king.

For jobs that opt into caching (and really, they should), your artifacts are stored on either our self-hosted MinIO or Ceph clusters — depending on what you’re caching. Simple dependencies go to MinIO. Docker layers and other hefty artifacts land in Ceph. These clusters live on the same fleet of bare metal machines that host the agents, so they inherit the exact same security controls. But storing cache artifacts in a durable manner is only half the battle. Getting it safely in and out is where things get interesting.

It starts with indirect access. All storage requests originating from a microVM — the isolate environments your jobs run in — never talk directly to storage. Instead, all storage traffic is proxied through the agent running on the host machine. This proxying step isn’t just fun; it’s a deliberate control point that ensures all access to storage is authenticated, authorized, and auditable.

When a job payload gets created, the control plane includes more than just a JIT token. It also mints a separate cache token, scoped to your organizations, and embeds it in the job payload. The host agent fetches this payload, associates the cache token with the microVM it’s about to launch, and injects that token directly into the microVM’s memory at creation time.

So, when your job attempts to access a cache artifact — whether to read or write — the microVM sends a request to the host agent, asking it to make an authenticated request to the control plane using the cache token. This request is signed with the cache token, allowing the agent to validate it against the in-memory metadata it knows about the microVM. This validation step is what helps prevent man-in-the-middle (MitM) attacks that could attempt to access cache artifacts outside the intended organization or repository scope.

If it all checks out, the agent makes the authenticated request to the control plane, acting on the microVM’s behalf. The control plane logs this action in its database for auditing, validates usage against throttling policies, and then replies with metadata about the artifacts: its name, its location, and nothing more than what the token was scoped to access.

With access granted and metadata in hand, the agent performs the final step: an authenticated query to either MinIO or Ceph. Authentication happens via credentials provided by Doppler — injected directly into the agent as environment variables. And, like all internal communication in our dataplane, this request flows over our Tailscale VPN.

But where your data ends up matters just as much as how it gets there. MinIO artifacts are namespaced by organization and repository, ensuring isolation between tenants. These artifacts are encrypted-at-rest using KES with AWS KMS keys. Ceph goes further: each block device follows the same namespacing pattern, but benefits from Cephx authentication and smart data placement via CRUSH maps to ensure resilience against node failure.

Can you trust us?

Oh yeah, and for the box checkers — we are SOC 2 Type 2, GPDR compliant and are in the process of becoming HIPAA compliant in the coming weeks. If you’d like, you can request access to the reports via our trust center. And rest assured that, just to sanity check ourselves, we pay hackers to pen test our system every quarter. So — what do you think? Can you trust us?

Read Entire Article