nix-ota/README.md

148 lines
5.7 KiB
Markdown
Raw Permalink Normal View History

# nix-ota
Open-source OTA updates for fleets of NixOS devices. A self-hostable
control server + lightweight device agent that ship prebuilt system
closures from a binary cache to devices that don't have your flake.
Think Cachix Deploy, but you run it.
## Architecture
```
┌─────────┐ 1. nix build + nix copy ┌──────────┐
│ CI / │ ─────────────────────────► │ Binary │ (Attic / S3 / nix-serve / Cachix)
│ Builder │ │ Cache │
└────┬────┘ └────▲─────┘
│ 2. publish signed manifest │
▼ │ 4. nix copy --from <cache>
┌─────────────┐ 3. GET current ┌────────┴─────┐
│ Control │ ◄──────────────── │ Device │
│ Server + UI │ 5. POST checkin │ Agent │ ──► switch-to-configuration
└─────────────┘ └──────────────┘
```
The control server **never holds the signing key**. Operators (or CI)
sign manifests with an offline ed25519 key and POST them; devices
verify against a pinned public key. A server compromise cannot push
arbitrary closures.
## Components
| Crate | Binary | Role |
|--------------------|--------------------|------------------------------------|
| `crates/server` | `nix-ota-server` | REST API + SQLite + HTMX dashboard |
| `crates/agent` | `nix-ota-agent` | Polls, verifies, applies, rolls back|
| `crates/publisher` | `nix-ota` | Operator/CI CLI (keygen + publish) |
| `crates/common` | (lib) | Manifest types + ed25519 |
## Quickstart (< 10 minutes)
> 👉 For a complete copy-pasteable setup with two real NixOS flakes
> (server host + device host), see [`examples/`](./examples/).
### 1. Generate a signing key on your workstation
```sh
nix run git+https://linus.dyrehytten.dk/max/nix-ota#nix-ota -- keygen --out ./sign.key
# prints the public key — save it, you'll bake it into every device.
```
### 2. Deploy the server
```nix
# configuration.nix
{
imports = [ nix-ota.nixosModules.server ];
services.nix-ota-server = {
enable = true;
openFirewall = true;
publishTokenFile = "/run/secrets/nix-ota-publish-token";
};
}
```
### 3. Install the agent on a device
```nix
{
imports = [ nix-ota.nixosModules.agent ];
services.nix-ota-agent = {
enable = true;
server = "https://ota.example.com";
channel = "prod";
deviceId = "fridge-007";
publicKey = "<base64 ed25519 pubkey from step 1>";
cacheUrl = "https://cache.example.com";
cachePublicKey = "cache.example.com:abc...=";
healthCmd = "systemctl is-system-running --wait"; # optional
};
}
```
### 4. Publish your first update
```sh
nix build .#nixosConfigurations.fridge-007.config.system.build.toplevel
nix copy --to s3://my-cache ./result
nix run git+https://linus.dyrehytten.dk/max/nix-ota#nix-ota -- publish \
--server https://ota.example.com \
--token $(cat publish-token) \
--key ./sign.key \
--channel prod \
--store-path $(readlink -f result) \
--substituter https://cache.example.com
```
Open `https://ota.example.com/` to watch the fleet pick it up.
## How updates apply
On each poll the agent:
1. Fetches `/channels/<name>/current`.
2. Verifies the ed25519 signature against the pinned key.
3. Rejects manifests with a revision ≤ the last one applied (replay defense).
4. `nix copy --from <substituter> <storePath>` — Nix verifies cache
signatures on every store path.
5. `nix-env -p /nix/var/nix/profiles/system --set <storePath>`
6. `<storePath>/bin/switch-to-configuration switch`
7. Runs the optional `healthCmd`. On failure: switches back to the
previous generation and reports `rolled_back`.
## Threat model
| Threat | Mitigation |
|-----------------------------------------|---------------------------------------------------------------------------|
| Compromised control server pushes evil | Manifests must be signed by offline ed25519 key pinned on every device. |
| Compromised cache serves wrong closure | Nix verifies per-path signatures against `trusted-public-keys`. |
| Replay of an older (vulnerable) closure | Manifest carries monotonic `revision`; agent persists & rejects rollbacks.|
| Random internet caller publishes | `POST /channels/:name/publish` requires bearer token. |
| Random caller reads fleet state | UI/API should be put behind your reverse proxy / SSO. (v1: no built-in auth on reads.) |
| Bad closure bricks device | Health-check + magic-rollback to previous system generation. |
**Key management:** keep `sign.key` offline (hardware token, ops laptop,
or a sealed CI secret). The server never sees it. Rotating: generate a
new key, update `publicKey` on devices in a closure published with the
old key, then start signing with the new one.
## Non-goals (v1)
- The server does no Nix evaluation or building — CI does that.
- No replacement for your binary cache — use Attic, Cachix, S3, nix-serve.
- No per-device secrets (use sops-nix / agenix inside the closure).
- No web-based config editing — config lives in your flake repo.
## Development
```sh
nix develop
cargo build --workspace
cargo test --workspace
nix flake check # runs the full NixOS VM test
```
## License
MIT OR Apache-2.0.