Skip to main content

Proxmox VE

Proxmox VE is the hypervisor for the on-prem production site. Three physical nodes — proxmox1, proxmox2, proxmox3 — host the Talos control-plane and worker VMs plus a small fleet of NetBird connector LXCs.

Driven by tofu/environment/production via the bpg/proxmox provider.

Why Proxmox

  • Open-source, locally hosted hypervisor with a real REST API the IaC layer can talk to.
  • Mixed workloads in one box. KVM VMs for Talos nodes, LXC containers for the NetBird connectors. The connectors don't deserve a full VM and Proxmox handles both natively.
  • Live migration between nodes during planned maintenance — the cluster keeps reconciling while one host gets patched.
  • GPU passthrough that actually works for the talos-worker-* nodes feeding Jellyfin / Immich / Tube Archivist.

Alternatives considered

OptionWhy not
Bare-metal Talos directly on the boxesNo room for the LXC connectors and ad-hoc utility VMs
ESXi / VMwareClosed source; recent licensing changes ruled it out
XCP-ngSolid alternative; Proxmox wins on community + tooling familiarity
HarvesterHCI, but pulls in its own Kubernetes — would conflict with Talos

Cluster layout

NodeRole
proxmox1Hypervisor + control-plane VM talos-cp-01
proxmox2Hypervisor + control-plane VM talos-cp-02
proxmox3Hypervisor + control-plane VM talos-cp-03

Each node also runs:

  • One Talos worker VM (talos-worker-0{1,2,3}) — extra NICs into VLAN 104 (storage) and VLAN 105 (public), GPU passthrough.
  • One NetBird connector LXC (lxc-proxmox{1,2,3}-netbird) — three IPs, one per VLAN it routes (mgmt / storage / public). See NetBird.

The trio is a real Proxmox cluster (corosync + cluster filesystem), so VMs can live-migrate. Quorum is 2/3.

Bridge layout per node

┌───────── vmbr0 (trunk) ──────────┐
│ │
┌────┴────────┐ ┌──────┴────────┐
│ untagged │ │ tagged 104 │
│ VLAN 100 │ │ VLAN 105 │
│ (mgmt) │ │ (storage/pub) │
└─────────────┘ └───────────────┘
│ │
pve host + cp-VMs + worker VMs (extra
netbird LXC eth0 + NICs) + netbird
worker VM eth0 LXC eth1 / eth2

The full VLAN reference and IP plan lives on the Fabric overview and the UniFi page.

Storage

PoolBackingUsed by
local-lvmEach node's local NVMeTalos VM disks, LXC root volumes
truenas-iscsiiSCSI to the TrueNAS NASOptional shared volumes (not used in steady state)

Talos VM disks are intentionally on local NVMe, not on shared storage. The etcd quorum tolerates one node down; we don't want a NAS outage to take all three control-planes with it. Persistent application data lives on Longhorn inside the cluster, with bulk media on TrueNAS via NFS.

OpenTofu workflow

Everything in tofu/environment/production is declarative. Plan / apply against the Proxmox API:

cd tofu/environment/production
tofu init # idempotent
tofu plan -out=plan
tofu apply plan

Each apply will:

  • create / reconcile VMs from cloud-init templates,
  • attach VLAN-tagged NICs as required,
  • update DNS records via the NetBird overlay's DNS module,
  • register peers (the connector LXCs) into the right NetBird groups via setup keys.

The provider talks to Proxmox over HTTPS using a token (tofu_provider). Token + state are SOPS-encrypted at rest.

Operational notes

  • Rolling reboots during a Proxmox upgrade: drain Talos workloads off one node first (kubectl drain), live-migrate the worker VM to a peer, then reboot. The control-plane VM auto-fails-over to the migrated host.
  • GPU passthrough is per-VM. After replacing a worker, re-confirm the IOMMU groups before re-passing-through (Proxmox numbering can shift on kernel upgrades).
  • Cluster-network split on UniFi means corosync rides on VLAN 100. If you change VLAN 100's gateway, plan an extra-careful apply window.
  • Backups of the Proxmox VE configuration itself (PVE storage cfg, replication, cluster join) are in scope for the operations layer, separate from in-cluster backups. See Operations → Backups.

Where to look next