Proxmox VE

Proxmox VE is the hypervisor for the on-prem production site. Three physical nodes — proxmox1, proxmox2, proxmox3 — host the Talos control-plane and worker VMs plus a small fleet of NetBird connector LXCs.

Driven by tofu/environment/production via the bpg/proxmox provider.

Why Proxmox

Open-source, locally hosted hypervisor with a real REST API the IaC layer can talk to.
Mixed workloads in one box. KVM VMs for Talos nodes, LXC containers for the NetBird connectors. The connectors don't deserve a full VM and Proxmox handles both natively.
Live migration between nodes during planned maintenance — the cluster keeps reconciling while one host gets patched.
GPU passthrough that actually works for the talos-worker-* nodes feeding Jellyfin / Immich / Tube Archivist.

Alternatives considered

Option	Why not
Bare-metal Talos directly on the boxes	No room for the LXC connectors and ad-hoc utility VMs
ESXi / VMware	Closed source; recent licensing changes ruled it out
XCP-ng	Solid alternative; Proxmox wins on community + tooling familiarity
Harvester	HCI, but pulls in its own Kubernetes — would conflict with Talos

Cluster layout

Node	Role
`proxmox1`	Hypervisor + control-plane VM `talos-cp-01`
`proxmox2`	Hypervisor + control-plane VM `talos-cp-02`
`proxmox3`	Hypervisor + control-plane VM `talos-cp-03`

Each node also runs:

One Talos worker VM (talos-worker-0{1,2,3}) — extra NICs into VLAN 104 (storage) and VLAN 105 (public), GPU passthrough.
One NetBird connector LXC (lxc-proxmox{1,2,3}-netbird) — three IPs, one per VLAN it routes (mgmt / storage / public). See NetBird.

The trio is a real Proxmox cluster (corosync + cluster filesystem), so VMs can live-migrate. Quorum is 2/3.

Bridge layout per node

        ┌───────── vmbr0 (trunk) ──────────┐
        │                                  │
   ┌────┴────────┐                  ┌──────┴────────┐
   │ untagged    │                  │ tagged 104    │
   │ VLAN 100    │                  │ VLAN 105      │
   │ (mgmt)      │                  │ (storage/pub) │
   └─────────────┘                  └───────────────┘
        │                                  │
   pve host + cp-VMs +              worker VMs (extra
   netbird LXC eth0 +               NICs) + netbird
   worker VM eth0                   LXC eth1 / eth2

The full VLAN reference and IP plan lives on the Fabric overview and the UniFi page.

Storage

Pool	Backing	Used by
`local-lvm`	Each node's local NVMe	Talos VM disks, LXC root volumes
`truenas-iscsi`	iSCSI to the TrueNAS NAS	Optional shared volumes (not used in steady state)

Talos VM disks are intentionally on local NVMe, not on shared storage. The etcd quorum tolerates one node down; we don't want a NAS outage to take all three control-planes with it. Persistent application data lives on Longhorn inside the cluster, with bulk media on TrueNAS via NFS.

OpenTofu workflow

Everything in tofu/environment/production is declarative. Plan / apply against the Proxmox API:

cd tofu/environment/production
tofu init                        # idempotent
tofu plan -out=plan
tofu apply plan

Each apply will:

create / reconcile VMs from cloud-init templates,
attach VLAN-tagged NICs as required,
update DNS records via the NetBird overlay's DNS module,
register peers (the connector LXCs) into the right NetBird groups via setup keys.

The provider talks to Proxmox over HTTPS using a token (tofu_provider). Token + state are SOPS-encrypted at rest.

Operational notes

Rolling reboots during a Proxmox upgrade: drain Talos workloads off one node first (kubectl drain), live-migrate the worker VM to a peer, then reboot. The control-plane VM auto-fails-over to the migrated host.
GPU passthrough is per-VM. After replacing a worker, re-confirm the IOMMU groups before re-passing-through (Proxmox numbering can shift on kernel upgrades).
Cluster-network split on UniFi means corosync rides on VLAN 100. If you change VLAN 100's gateway, plan an extra-careful apply window.
Backups of the Proxmox VE configuration itself (PVE storage cfg, replication, cluster join) are in scope for the operations layer, separate from in-cluster backups. See Operations → Backups.

Where to look next

Hetzner — same role for the edge cluster
Talos — what runs inside these VMs
Fabric / UniFi — VLAN trunk feeding vmbr0
Hardware → NAS — physical box behind the storage VLAN

Why Proxmox​

Alternatives considered​

Cluster layout​

Bridge layout per node​

Storage​

OpenTofu workflow​

Operational notes​

Where to look next​