Skip to main content

Talos Linux

Talos Linux is the operating system on every node of both Kubernetes clusters — the production cluster on Proxmox (3× control-plane + 3× workers) and the edge cluster on Hetzner (1× control-plane). It is immutable, API-driven, and ships with no shell, no SSH, and no package manager.

Why Talos

A fewer moving parts list, basically:

  • Immutable. No drift between nodes — what you don't configure, you can't accidentally change. Upgrades are atomic image swaps with one-command rollback.
  • API-only. talosctl over mTLS is the only way to interact with a node. There is no SSH to forget to disable, no shell history to leak, no apt-day inventory to maintain.
  • Pre-baked for Kubernetes. Kubelet, etcd, containerd, the CNI dance — all wired up by the OS. The whole machine config fits in a single YAML.
  • Boot from a UKI. Secure boot, signed kernels, no GRUB to reason about.
  • Predictable. Patches are declarative; the node's running state is exactly what talconfig.yaml says it is.

Alternatives considered

OptionWhy not
Stock Ubuntu / Debian + kubeadmMutable. Drift, package upgrades, SSH attack surface. No upside vs Talos.
FlatcarCloser in spirit but still has SSH and a shell; less Kubernetes-specific
BottlerocketAWS-flavored; not aimed at on-prem; no Hetzner support
Fedora CoreOSMutable enough to drift; updates via rpm-ostree; not as opinionated for k8s
Plain Talos without TalhelperFine, but Talhelper turns 6 nearly-identical machine configs into one config file

Talhelper

Talhelper is a thin wrapper that turns one declarative config into per-node machine configs:

talos/
├── talos/ ← production cluster
│ ├── talconfig.yaml ← cluster-wide + per-node settings
│ ├── talsecret.sops.yaml ← cluster secrets (PKI, tokens), SOPS-encrypted
│ ├── clusterconfig/ ← rendered per-node configs (apply target)
│ └── patch-*.yaml ← shared strategic-merge patches
└── edge/ ← edge cluster
├── talconfig.yaml
├── talsecret.sops.yaml
├── clusterconfig/
└── patch-*.yaml

Render and apply:

# Decrypt secrets, render, then apply
talhelper genconfig
talhelper gencommand apply | sh

clusterconfig/ is what actually goes onto the nodes; everything else is source.

Patches

Both clusters share a baseline of strategic-merge patches:

PatchWhy
patch-kubelet.yamlTweaks kubelet config
patch-etcd.yamletcd defaults / quotas
patch-disable-kube-proxy.yamlkube-proxy is replaced by Cilium
patch-cilium-fix.yamlCilium-specific Talos kernel/config tweaks
patch-nameservers.yamlResolver pinned to known-good upstreams

The production cluster carries two extras:

PatchWhy
patch-longhorn-extramount.yamlExtra mount for Longhorn data
patch-spegel.yamlSpegel image-mirror config (containerd registry mirror)

The edge cluster does not run Longhorn or Spegel, so it doesn't need either.

Bootstrap flow

  1. Provision VMs. Proxmox for production (one VM per planned node), Hetzner for edge.
  2. Generate machine configs. talhelper genconfig produces clusterconfig/<node>.yaml for each node.
  3. Boot from Talos image. PXE / ISO / Hetzner snapshot — the node comes up in maintenance mode.
  4. Apply config. talhelper gencommand apply | sh pushes the per-node config; nodes reboot into "configured" state.
  5. Bootstrap etcd. talosctl bootstrap on the first control-plane node.
  6. Hand the kubeconfig over. talhelper gencommand kubeconfig | sh writes a kubeconfig.
  7. Install Flux. From there everything else is GitOps.

Upgrades

talhelper genconfig
talhelper gencommand upgrade --extra-flags "--preserve" | sh

--preserve keeps user-data partitions (Longhorn, etc.) across the image swap. Talos upgrades are sequential per node and recoverable — if a node fails to come back, talosctl rollback reverts to the previous image.

Recovery & rotation

ScenarioAction
Single node lostRe-provision the VM with the same name, re-apply config — etcd readmits it
All control-planes lost (worst case)Restore etcd from a Restic snapshot, reapply configs
Cluster CA rotationEdit talconfig.yaml, regen secrets, roll the cluster one node at a time
Talsecret leakedWipe + reinstall is faster than rotating; treat the cluster as cattle

Where to look next