Skip to content

Tool Choices — An ADR for Every Pick

TL;DR

Every tool named in the whitepaper gets a defensible answer to: what problem, what alternatives, why this, license and backing, pricing, lock-in risk, migration path. The exercise changed several of the original picks after honest re-evaluation — notably: drop Vault (overkill), SOPS + age is the deployed secrets pick (corrected through three rounds — see Secrets section; the rig was always on SOPS, earlier retractions assumed otherwise), add Phoenix alternative to Langfuse (8 GB VM reality), defer feature flags (YAGNI at our scale), hedge pgroll with inspectable SQL trail (single-vendor bus factor, correctly framed).

This document is the reasoning the other whitepaper docs just assert. Every line of the form "We use X" elsewhere has a row here that explains why X and not Y.

How to read each entry

Every pick is evaluated against the same rubric:

Column What it captures
License OSI-approved? Copyleft? Source-available-but-restricted? Specific license string (MIT, Apache-2.0, MPL-2.0, BSL, ELv2, AGPL, GPL, proprietary).
Owner / governance Single company? Foundation? BDFL? Community-elected?
Pricing Free for our use? Tier structure? Where does the pricing curve bite?
Bus factor If the primary maintainer disappears, who keeps this alive?
Lock-in risk If we need to leave, how bad is the migration?
Escape hatch Concrete alternative we'd adopt if we had to move.
Re-evaluate when The signal that tells us this pick is no longer right.

The goal is not minimize every axis (impossible) but be explicit about each, so future us — or a future maintainer — can argue with our choices from a base of evidence rather than vibes.

Headline changes from original whitepaper

Where the honest re-evaluation changed the pick

  • Secrets: drop Vault. SOPS + age + Flux is what's actually deployed (verified live in apps/*/*.sops.yaml). External Secrets Operator + GCP Secret Manager is deferred until needed. GitHub App installation tokens are minted on-demand. OpenBao is the correct choice if and when we ever need Vault-class dynamic-secret capability — not now. Earlier drafts claimed SealedSecrets was our current state; that was wrong (never deployed). Third-order correction recorded in the retraction log.
  • LLM observability: add Phoenix. Langfuse v3 wants 16 GB RAM min and a separate ClickHouse cluster. On our 8 GB VM, Phoenix (ELv2, OTel-native, SQLite/Postgres, no ClickHouse) is the honest self-host pick.
  • Feature flags: defer. flagd + OpenFeature is defensible eventually but for 1-2 humans and few services with no A/B testing need, env-vars-via-Kustomize is sufficient. Adopt a flag system when there's a concrete targeting / experimentation requirement.
  • Unleash: explicitly reject. OSS edition deprecated and reached EOL 2025-12-31. Was previously a reasonable alternative; no longer is.
  • Instrument with OpenTelemetry GenAI semantic conventions, not vendor SDKs. This is the single highest-leverage lock-in defense in the stack.

Secrets management

The most user-called-out section. The original whitepaper promised Vault for short-lived credentials; honest re-evaluation says we don't need it.

The Vault-vs-SOPS question (directly)

They solve overlapping but distinct problems:

Dimension SOPS + age (or SealedSecrets) Vault / OpenBao
What it encrypts Files at rest in git Secret values fetched at runtime
Dynamic secrets No Yes — mint short-lived DB users, cloud creds, GitHub App tokens
Ops footprint Zero runtime service 3+ node HA cluster, unsealing, upgrades
Reviewable in PRs Yes (encrypted blob diffs cleanly) No (secrets never in git)
Revoke on compromise Git commit + rotate everywhere One API call, cluster-wide
Audit log Git history Vault audit log
Disaster recovery Git repo + decryption key Vault snapshot + unseal keys

For a high-traffic production system serving paying customers: Vault (or OpenBao) wins clearly — dynamic secrets + centralized revocation + audit log are irreplaceable.

For a 1-2 person rig on one 8GB VM: SOPS-style encryption + ESO shim + cloud-KMS-backed secret manager is simpler, cheaper, and covers the real threat model.

Retracted (third-order correction), 2026-04-17 — we were never on SealedSecrets

Two previous retractions in this ADR (log below) framed a migration from SealedSecrets to SOPS. That framing was wrong about the ground-truth deployed state. Verified by grep-ing the repo: zero kind: SealedSecret references, zero sealed-secrets-controller HelmRelease, zero bitnami-labs image pulls. Every secret in the rig — Dev-E, Review-E, Conductor-E, Cloudflared — is already SOPS-encrypted (*.sops.yaml files in apps/). SOPS + age + Flux was always the deployed pick; there was never any SealedSecrets to migrate from.

The earlier narratives ("SealedSecrets keep", then "SealedSecrets legacy migrating") were built on an earlier research-agent summary that asserted SealedSecrets was our current deployment. I accepted that without running grep -r SealedSecret apps/. I ran that grep today and it returned nothing. The Broadcom-paywall concern is real for anyone using SealedSecrets but was theoretical for us — we avoided the risk by already being on SOPS, not by deliberately migrating off. The "GHCR hedge" I proposed is unnecessary because we don't pull the image at all.

Meta-lesson added to the fresh-start evaluation log (below): verify ground-truth deployed state, not research-agent summaries, before writing retraction narratives. A 10-second grep would have prevented two rounds of wrong framing.

Current pick (verified live in apps/ as of 2026-04-17):

SOPS + age + Flux kustomize-controller (deployed primary, has been all along)
  + .sops.yaml at repo root with creation_rules covering apps/*/*.sops.yaml
  + Cluster-scoped age key in flux-system/sops-age Secret
  + Per-app encrypted manifests: apps/dev-e/dev-e-secrets.sops.yaml,
    apps/review-e/review-e-secrets.sops.yaml,
    apps/conductor-e/conductor-e-secrets.sops.yaml,
    apps/cloudflared/tunnel-token.sops.yaml
  + Each kustomization sets decryption.provider: sops + secretRef.name: sops-age
  + GitHub App installation tokens minted on-demand at pod startup (1h TTL)
  + Static narrow-grant Postgres service accounts
  + External Secrets Operator deferred (not yet needed — git-at-rest scales to our inventory)
  + Vault / OpenBao deferred (no dynamic-secret requirement yet)

See docs/sops.md for the operational reference (how to bootstrap an encrypted secret, rotation procedure, key management).

Secrets tooling matrix

Tool License Owner Our pick? Why / why not
HashiCorp Vault BSL 1.1 IBM (acq. Feb 2025) No BSL is tolerable (we're non-competing) but HCP Vault Secrets EOL July 2026, IBM pricing plays, velocity concerns. Operationally expensive (3+ node HA, unsealing). Dynamic secrets are genuinely excellent — we just don't need them yet.
OpenBao MPL-2.0 Linux Foundation Deferred The correct answer if we ever need Vault-class capability. API-compatible with Vault; ESO works unchanged. Same ops burden as Vault. Adopt when we have a concrete unmet need for dynamic secrets.
SOPS + age MPL-2.0 CNCF (getsops org) Yes (deployed primary) The actual deployed pattern. Verified live in apps/*/*.sops.yaml — all four active app namespaces use it. Flux decrypts inline via kustomize-controller --decryption-provider=sops (no additional controller); age keys are simpler than GPG. MPL-2.0 forever, CNCF governance.
SealedSecrets Apache-2.0 bitnami-labs (Broadcom-owned) Not in use Not deployed. Not used. Not a migration target. Broadcom's Bitnami catalog paywall (verified real — bitnami/postgresql:17.5.0 returns 404, same namespace as sealed-secrets-controller image) is a risk for shops that use it; we avoided it by default, not by design. Leaving the row here for the ADR audit trail.
External Secrets Operator Apache-2.0 CNCF (incubating) Yes (add) The reversibility insurance. Backend-agnostic — swap GCP SM → OpenBao → Infisical by changing a CRD, workloads untouched.
GCP Secret Manager Proprietary (GCP) Google Yes (add) We're already on GCP. Free tier covers our inventory. No dynamic secrets but doesn't need to. Access via ESO = low lock-in.
Infisical MIT (core) + SaaS Infisical Inc. (YC) No (for now) Strong middle-ground between Bitwarden and Vault. Reasonable alternative if we outgrow GCP SM before we need Vault.
Doppler Proprietary SaaS Doppler Inc. Reject SaaS-only; no self-host. Makes Doppler outage = our deploy outage. Strongest lock-in on the list.
1Password Connect Proprietary 1Password Partial We already use Bitwarden for human vault. 1Password Connect is fine if we switched, but no reason to.
CSI Secrets Store Apache-2.0 Kubernetes No DaemonSet footprint too heavy on single 8 GB VM. Right choice for regulated workloads avoiding etcd.
cert-manager + trust-manager Apache-2.0 CNCF (graduated) Yes (add) Table stakes. Non-controversial.

Re-evaluate secrets when

  • We add a second K8s cluster or second Postgres instance (static narrow grants stop scaling)
  • We take a compliance requirement that mandates audit log on secret access
  • A secret actually gets compromised (rotation scope pain becomes real)
  • Our team grows past ~5 operators (human secret-handling becomes the bottleneck)
  • The getsops.io project stalls or is archived (then fork or migrate to an alternative; currently healthy)

Retraction log — secrets picks (three rounds)

Honest disclosure of where the first, second, and third drafts of this ADR got it wrong about secrets.

Round What the draft said What changed Why it was wrong
1 "SealedSecrets — Yes (keep) + ESO + GCP SM + GitHub App tokens. Governance risk post-Broadcom/Bitnami is real but no migration pressure yet." Promoted SealedSecrets to the declared primary; treated SOPS as redundant. Accepted a research-agent summary that claimed SealedSecrets was our deployed state. Never grep-verified. Built a whole defense around an incorrect premise.
2 "SOPS + age is now primary. SealedSecrets is Legacy (migrating). Interim hedge: switch image source to ghcr.io/bitnami-labs/sealed-secrets-controller." SOPS promoted, SealedSecrets relabeled legacy-migrating, elaborate Broadcom-paywall hedge proposed. Still wrong about ground truth. Corrected the right-and-wrong framing of the tools, but kept asserting SealedSecrets was our deployment. The Broadcom paywall research was real but the "migration path" was literature for a migration that didn't need to happen.
3 (this entry) "SOPS + age + Flux was always the deployed pick. There was never any SealedSecrets. Earlier retractions were based on an unverified premise." Corrected: zero SealedSecrets in the repo (grep -r SealedSecret apps/ returns nothing). .sops.yaml at repo root covers all apps. Every deployed app uses *.sops.yaml with decryption.provider: sops. The meta-lesson: verify ground-truth deployed state, not research-agent summaries, before writing retractions. A 10-second grep -r would have prevented two rounds of wrong framing. The second-order lesson — also named in my Fresh-start log's "three patterns" — was "hedge narratives need re-verification." This round adds: "assertions about current deployed state need re-verification too," which is the stronger version of the same principle.

The incumbent-bias lesson still stands

Round 1's incumbent-bias problem (documented in the Round 1 retraction above) was a real pattern: I skipped the fresh-start test on secrets while applying it elsewhere. Round 2's correction — "SOPS wins on governance, operational cost, license permanence" — was the right conclusion reached via the right reasoning. What was wrong in Round 2 was the framing (claimed migration from incumbent) not the verdict (SOPS over SealedSecrets). Round 3 leaves the verdict intact and fixes only the factually-inaccurate framing.

The ground-truth-verification lesson

Add to the fresh-start evaluation meta-rules (below): before asserting what's deployed, run grep against the repo. Research-agent summaries can be wrong or stale. I had the tools to verify in the first round and didn't use them. Don't skip verification of ground truth; it's cheaper than two rounds of retraction.

Fresh-start evaluation log (April 2026)

Honest application of the fresh-start test — "if we were picking this from scratch today with no prior context, what would we pick?" — to every tool category in this ADR. This log replaces the earlier "Broader incumbent-bias check" which was a checklist, not an evaluation. Two picks were verified in depth this round (LiteLLM and pgroll); the rest were audited against 2026-current alternatives.

Summary table

Category Current pick Fresh-start (2026) verdict Rigor of this round
Secrets (git-at-rest) SOPS + age (deployed; always was) Corrected — SOPS is the deployed pick; earlier retractions #1 and #2 framed a migration that didn't need to happen because SealedSecrets was never deployed Deep (three rounds of retraction; see log)
Policy engine Kyverno Keep — unchanged Shallow
Supply-chain signing Sigstore (cosign, gitsign, rekor, slsa-github-generator) Keep — no credible OSS competitor Shallow
Networking / L7 egress Cilium Keep — no equivalent at L7 in OSS Shallow
Metrics / logs / traces Prometheus (local) + Grafana Cloud Free (managed) + OTel Collector Keep — name VictoriaMetrics as lighter-weight alternative for multi-node future Medium
LLM observability Phoenix (8 GB) / Langfuse (16 GB+) Keep — already retracted in earlier PR Medium
LLM gateway LiteLLM Keep — verified. Portkey's "fully OSS" March 2026 announcement kept per-key budget enforcement Enterprise-only; original pick rationale holds Deep (verified this PR)
Progressive delivery Flagger Keep — Flux-native, Argo Rollouts fights Flux Shallow
Feature flags Deferred (flagd when needed) Keep — no change; PostHog named as bundled alternative Shallow
DB migration safety pgroll Keep — verified. Atlas has closed most gaps but does not implement expand/contract multi-version schemas — pgroll's core differentiator stands Deep (verified this PR) + hedge-narrative fixed
Supply chain (deps) Dependabot + Socket.dev + Syft+Grype + package-age policy Keep — name trivy as Grype alternative Shallow
Container + CI GitHub Actions + GHCR Keep — incumbent-and-defensible (SCM + CI + registry bundle dominates) Shallow
GitOps Flux Keep — incumbent-and-defensible (Flagger picks Flux-native; switching cascades) Shallow
Cluster runtime k3s Keep — name Talos Linux as multi-node future consideration Medium
Event-driven autoscale KEDA Keep — no credible competitor Shallow
Cloud compute GCP Compute Keep — incumbent-and-defensible (Workload Identity + DNS already wired) Shallow
Human vault Bitwarden Keep — name Vaultwarden (Rust self-host port) for future Shallow
Docs site MkDocs Material Keep — Docusaurus/VitePress/Astro Starlight are reasonable if we want more customization Shallow
Evaluation harness Inspect AI (candidate) Already flagged candidate — validate in Era 2 Already done

Verified deep this round: LiteLLM (stays)

Portkey Gateway went fully open source March 2026 (Apache-2.0, 1T+ tokens/day). Original LiteLLM pick reason was "only OSS option with per-virtual-key budget enforcement." Re-verified against Portkey's 2026 documentation:

The 2026 "fully OSS" announcement was a governance + observability + MCP-registry open-sourcing, not a cost-controls open-sourcing. The original LiteLLM differentiator (per-virtual-key budget envelopes with duration windows returning 429 on exceed, free) still holds.

LiteLLM's known bugs (#12905, #10750, #12977, #25386) don't touch our specific config pattern (we have ~5 explicitly-configured keys, no team-scoped nesting, no pass-through routes, no AzureOpenAI direct client, no auto-created end users). Verdict: stay on LiteLLM. Revisit if (a) Portkey moves budget-limits to OSS, or (b) we scale past ~500 RPS where LiteLLM's documented memory issues at 2k RPS start to bite.

Verified deep this round: pgroll (stays) + hedge narrative corrected

Atlas (Ariga) has shipped rapidly in 2025–2026 — v1.2.0 on 2026-04-10, Kubernetes operator (Apache-2.0 with some EULA image layers), 50+ migration safety analyzers, weekly-to-biweekly release cadence. Feature gap against pgroll narrowed significantly. But Atlas does NOT implement real expand/contract with multi-version schema views + triggered backfill — it lints for unsafe DDL, emits concurrent-index DDL, and rolls out carefully, but it executes a single migration against a single schema.

For our specific workload (one Postgres, ~10–30 tables, expand/contract required for zero-downtime), pgroll is still the only tool that keeps v1 and v2 of a table simultaneously queryable. Verdict: stay on pgroll. Revisit if Xata misses another release quarter (no v0.17 by end of Q3 2026), announces a shutdown/acquisition, or Atlas ships native expand/contract.

Corrected the hedge narrative: earlier drafts implied pgroll migrations are plain SQL. They're not — they're pgroll-specific operation YAML. The correct hedge is to commit generated SQL alongside each operation YAML (via pgroll SQL emission) so schema history stays reconstructible. See the pgroll section above for the corrected wording.

Shallow-audited: what "fresh-start keep" actually means

For the shallow-audited picks (Kyverno, Sigstore, Cilium, Flagger, k3s, KEDA, Dependabot/Socket, GitHub/GHCR/Flux, Bitwarden, MkDocs), "fresh-start keep" means: I considered the current 2026 alternatives to each and none clearly beat the incumbent for our scale on license, governance, operational cost, and feature coverage. They are the picks I would make today if starting from scratch.

A stronger level of rigor would be individual per-category research agents (like I did for LiteLLM and pgroll). That's worth doing when a specific concern surfaces (as with Portkey-announcement, Xata-release-cadence, Broadcom-Bitnami). Applying it to every pick every month is over-engineering.

What the deeper-audited rounds taught us

Four patterns emerged from the SOPS (three rounds), LiteLLM, and pgroll deep audits:

  1. Announcements lie about feature scope. Portkey's March 2026 "fully OSS" announcement was marketing; the feature we care about stayed paywalled. Always verify against current docs, not the press release.
  2. Release cadence is a signal. pgroll's decelerating releases (v0.16.1 in February, nothing since) is consistent with Xata being in maintenance-mode for pgroll as an internal-product-first tool. Not alarming on its own, but worth tracking.
  3. Hedge narratives need re-verification. The "keep migrations as plain SQL" hedge in the earlier pgroll writeup turned out to be wrong — pgroll operation files are YAML. When we write a hedge, we should confirm it's actually realisable, not just aspirational.
  4. Ground-truth deployed state needs re-verification too. The SealedSecrets retraction had to happen three times because the first two rounds accepted a research-agent summary about "what's currently deployed" instead of grep-verifying the repo. A 10-second grep (grep -r SealedSecret apps/) would have prevented it. Stronger version of pattern (3): "before asserting what's deployed, verify."

Categories that warrant re-examination eventually

Not actionable today, flagged for future attention:

  • Bitwarden — picked because humans already use it. 1Password has better team-grant ergonomics; Vaultwarden is an unofficial Rust self-host port if we want more control; Infisical covers human+automation in one product (at the cost of a YC-company dependency). Re-evaluate if team grows past 3 operators or if we start needing per-project secret segregation.
  • MkDocs Material — Python-docs gold standard today, but Docusaurus (Meta-backed, React), VitePress (Vue/Vite), and Astro Starlight (Astro) are reasonable alternatives with better customization. Low priority — the docs site works.
  • k3s — ideal for single-VM. If we go multi-node for any reason, Talos Linux (immutable, API-only, no SSH, no shell) is a stronger security baseline. Not a k3s replacement — runs K8s, including k3s — but changes the host OS story.

Meta-rule, reaffirmed

When an ADR row reads "already deployed — keep" without a license/governance/operational comparison against the best fresh-start alternative, that's a flag for re-examination. Path dependence is a cost, not a reason. Every pick in this ADR has now had the fresh-start test applied at least shallowly; two picks got deep verification this round; the retraction log above grows whenever a pick turns out to have been defended on sunk-cost reasoning.

Next scheduled re-audit: monthly for deep-picks (LiteLLM, pgroll, SOPS health at getsops.io, Langfuse/Phoenix VM sizing). Quarterly for shallow-picks. Immediate whenever a tool's governance / license / owner changes (Broadcom-Bitnami style events). Always verify ground-truth deployed state with a grep before framing a retraction.

Policy engine

Tool License Owner Our pick? Why
Kyverno Apache-2.0 CNCF (incubating) Yes YAML CRDs (no Rego language). Native Sigstore verification first-class. Reports in OpenReports format. Operational cost is lower for a 2-person team.
OPA Gatekeeper Apache-2.0 CNCF (graduated) No Rego language requires learning + maintenance. Cosign verification exists but not as polished. Better for large orgs that already run OPA for non-K8s policy.
jsPolicy Apache-2.0 Loft Labs No Single-vendor. JavaScript-based policies. Niche.
OpenPolicyAgent (OPA) core Apache-2.0 CNCF No Lower-level; Gatekeeper is the K8s-admission wrapper.

Re-evaluate policy engine when

  • We need to write non-K8s policies (API gateway, CI/CD gates) — OPA's broader reach becomes attractive
  • We have a Rego-fluent engineer — learning cost drops
  • Kyverno governance shifts unfavorably (currently healthy as CNCF incubating)

Supply chain / signing

All picks are Sigstore ecosystem; there's no real competitor in 2026 open-source territory.

Tool License Owner Our pick? Why
Cosign Apache-2.0 Sigstore (Linux Foundation) Yes Industry default. Keyless via Fulcio.
Gitsign Apache-2.0 Sigstore Yes Agent commit signing with ephemeral Fulcio certs. Known gotcha: GitHub UI doesn't display "Verified" — workaround is a CI-side gitsign verify check.
Rekor Apache-2.0 Sigstore (public instance) Yes (public) We use the public Rekor. Private Rekor is possible but over-engineered.
slsa-github-generator Apache-2.0 SLSA framework Yes Isolated-builder reusable workflow produces SLSA v1.0 L3 provenance. ~5 lines in any GitHub Actions file.
Notary v2 / ORAS Apache-2.0 CNCF No OCI-artifact-focused; Sigstore covers our image case more simply.
Docker Content Trust Proprietary-ish Docker No Deprecated in favor of Sigstore.
HSM-backed PGP varies Reject Long-lived keys to rotate. Worse threat model than Sigstore for our case.

Re-evaluate signing when

  • Sigstore public goods service (Fulcio / Rekor free tier) changes its commitment
  • Compliance requires private transparency log (run private Rekor)
  • Multi-tenant signing needs emerge (HSM delegation tooling)

Networking / service mesh

Tool License Owner Our pick? Why
Cilium (L7 via CNPs) Apache-2.0 CNCF (graduated) Yes eBPF CNI with L7 HTTP/DNS filtering. The single biggest ROI defense against prompt-injection exfiltration. Gets 80% of service-mesh value at 10% of the cost.
Istio Apache-2.0 CNCF (graduated) No Full service mesh — mTLS, traffic mgmt, etc. Overkill for our traffic volume + single-cluster setup. Revisit if we go multi-cluster or need mTLS to external services.
Linkerd Apache-2.0 CNCF (graduated) No Simpler than Istio; still more than we need. Good alternative if Istio's complexity is the only objection.
Calico (OSS) Apache-2.0 Tigera No Solid CNI but L7 filtering requires Calico Enterprise (commercial). Cilium's OSS L7 wins.
Native NetworkPolicy only Kubernetes No L3/L4 only. Cannot filter by FQDN or HTTP method. Insufficient for our egress-allowlist goal.

Re-evaluate networking when

  • Multi-cluster deployment emerges — a service mesh becomes more compelling
  • Cilium L7 Envoy proxy memory overhead becomes the bottleneck on our VM
  • External mTLS requirement (e.g., to a customer-facing API) — then Istio or Linkerd

Observability — metrics, logs, traces

Split picks: local for SLO-decisive data, managed for everything else.

Tool License Owner Our pick? Why
Prometheus (local) Apache-2.0 CNCF (graduated) Yes Source of truth for Flagger canary analysis. Must be local so SLO gates work when external egress blips. ~1 GB RAM.
Grafana Cloud Free Proprietary SaaS Grafana Labs Yes 10k series metrics, 50 GB logs, 50 GB traces, 14-day retention. Fits a 1-2 person rig. Predictable paid scale.
Mimir / Thanos / VictoriaMetrics Apache-2.0 Grafana Labs / Other No Large-scale Prometheus backends. Overkill — Grafana Cloud Free covers us.
Datadog / New Relic Proprietary Reject Vendor lock + pricing curves bite at scale.
Self-hosted LGTM stack Apache-2.0 Grafana Labs No Would memory-starve our 8 GB VM. Hybrid with Grafana Cloud is the right answer.
OpenTelemetry Collector Apache-2.0 CNCF (incubating) Yes One exporter that forwards to both Prometheus (local) and Grafana Cloud (managed). Standard plumbing.

Re-evaluate metrics/logs when

  • Grafana Cloud Free limits bite — 10k series isn't enough
  • Cost-visible scaling past ~$50/mo on Grafana Cloud makes self-hosted LGTM attractive on a bigger VM
  • Regulatory requirement forces log residency — self-host becomes mandatory

LLM observability

Tool License Owner Pricing Our pick? Why
Langfuse (self-host) MIT core + EE license-key for a few features Langfuse GmbH (YC) Free Conditional Official min 4 CPU / 16 GB RAM for app alone, plus ClickHouse cluster. Too heavy for 8 GB VM. Pick only if we size up.
Arize Phoenix (self-host) ELv2 (source-available, non-OSI) Arize AI Free Yes (for our scale) OTel-native. SQLite or Postgres, no ClickHouse. Runs fine on our VM. ELv2 is non-concern for internal self-host (restricts SaaS resale, which we don't do).
Langfuse Cloud Hobby N/A (SaaS) Langfuse GmbH Free 50k units/mo Backup 50k billable units sounds like a lot but complex agent traces = 15–20 units each. Hits cap fast; hard-stop at cap.
Helicone (self-host) Apache-2.0 Helicone Inc. Free self-host, 10k req/mo SaaS free Alternative Gateway + observability combined. Reasonable plan B if Phoenix unsuitable.
LangSmith Proprietary SaaS LangChain Paid tiers Reject SaaS-only, paid-gated features, LangChain-native assumptions.
Braintrust Proprietary SaaS Braintrust Free → $249/mo Reject for cost Strong on prompt regression but pricing bites for our scale.
Arize AI enterprise Proprietary Arize Contact sales Reject Phoenix OSS covers us.
W&B Traces Proprietary SaaS CoreWeave Paid Reject Broader ML observability overkill.
Cloudflare AI Gateway Proprietary SaaS Cloudflare Free passive analytics Yes (secondary) Free since we're already on Cloudflare. Passive cost tracking. Not sufficient primary — no virtual-key budgets.

The Langfuse vs Phoenix call

Langfuse is the de facto OSS LLM observability tool in 2026 and its feature set (prompt versioning, evaluations, dataset management, team workflows) is the richest. But its resource floor is genuine: v3 officially requires 4 CPU / 16 GB for the app, plus a ClickHouse cluster (3 nodes × 2 cores × 8 GB recommended). On our 8 GB single-VM k3s, it will boot and demo but will not survive sustained load.

Phoenix is the honest pick at our scale. Lighter infra, OTel-native (so traces are portable), solid eval story. We lose Langfuse's prompt-management UI and team features — acceptable for a 1-2 person team.

The lock-in defense that trumps both: instrument our code with OpenTelemetry GenAI semantic conventions, not Langfuse/Phoenix SDKs. Both platforms accept OTLP. The choice of observability backend becomes swappable.

Re-evaluate LLM observability when

  • We scale the VM to 16 GB+ (Langfuse becomes viable)
  • Team grows and prompt-versioning / collaborative eval becomes load-bearing
  • Phoenix ELv2 changes (unlikely but license watch)

LLM gateway / proxy

Tool License Owner Pricing Our pick? Why
LiteLLM MIT (core) + commercial Enterprise BerriAI (YC W23) Free Yes Only OSS option with proper per-virtual-key budget enforcement + duration resets. OpenAI-format passthrough. Anthropic provider first-class.
Portkey Gateway Apache-2.0 (fully OSS since March 2026) Portkey Free self-host; $9 per 100k logs managed Documented fallback Fully OSS escape hatch if LiteLLM stumbles. Processing 1T+ tokens/day across users.
Cloudflare AI Gateway Proprietary SaaS Cloudflare Free Secondary Passive observability already in our stack. No virtual-key budgets — not sufficient as primary.
OpenRouter Proprietary SaaS OpenRouter 5.5% markup No Adds hop, no self-host, no per-key budgets like LiteLLM.
Kong AI Gateway Proprietary (Enterprise plugin) Kong Enterprise contract Reject Enterprise pricing not justified.
TrueFoundry Proprietary SaaS TrueFoundry Paid Reject Platform-level opinion.

The LiteLLM SPoF concern

LiteLLM is a single point of failure: if it's down, all agents block.

Mitigations: 1. Run ≥2 replicas behind a k3s Service with Postgres + Redis shared state. 2. Client-side fallback to direct api.anthropic.com after N seconds of 5xx from proxy — but this bypasses budget enforcement by design; document as acceptable degraded mode. 3. Monitor proxy health as a first-class SLO in Prometheus.

LiteLLM funding / license risk

  • License: MIT intact. No BSL drift as of Q2 2026.
  • Owner: BerriAI, YC W23, publicly reported ~$2.1M seed. No Series A disclosed. Venture-stage risk is real.
  • 12–24 month watch: (a) BSL-style moves on enterprise features only would be fine for us (we're on OSS), (b) aggressive monetization could lock some OSS features behind keys, (c) a quiet under-maintenance period is the likeliest failure mode.

Re-evaluate gateway when

  • LiteLLM license changes or project health deteriorates
  • Our traffic exceeds what LiteLLM's Postgres-backed rate limiter can handle (~10k RPS)
  • Portkey Gateway momentum surpasses LiteLLM's

Progressive delivery / canary

Tool License Owner Our pick? Why
Flagger Apache-2.0 CNCF (graduated via Flux) Yes FluxCD-native. Owns its own Canary CRD that shadows the Deployment — no field-level fights with Flux. Webhooks at every phase for Conductor-E integration. ~100 MB controller footprint.
Argo Rollouts Apache-2.0 CNCF (graduated via Argo) No Mutates fields Flux also reconciles — recurring drift fights. Pair with ArgoCD, not Flux.
Keptn CNCF archived 2025-09-03 Reject (dead) Dynatrace team pulled back. Do not adopt.
OpenKruise Rollout Apache-2.0 OpenKruise (CNCF sandbox) No Mostly Alibaba ecosystem. Right only if we need StatefulSet canary.

Re-evaluate canary when

  • We migrate from Flux to ArgoCD (Argo Rollouts becomes natural)
  • Flagger project health deteriorates (currently active)

Feature flags

Honest YAGNI

Feature flags at our scale (1-2 humans, few services, no A/B testing need) are overkill today. Env vars + Kustomize overlays per environment cover the actual use case — compile/deploy-time toggles — at zero operational cost. We should adopt a flag system when there's a concrete targeting, experimentation, or kill-switch need — not before.

Tool License Owner Our pick? Why
env vars + Kustomize overlays Yes (now) Zero ops cost. Covers 100% of actual current need.
OpenFeature + flagd Apache-2.0 CNCF (incubating) Deferred Right pick when we need runtime toggles. Sidecar ~30-60 MB. JSON flag config.
Flipt GPL-3.0 (server) + MIT (clients) flipt-io Alternative GitOps-native YAML flags. Single Go binary. GPL server is sticky but fine for internal use.
GrowthBook MIT GrowthBook Alternative If we need A/B experimentation with stats out of the box. OpenFeature SDK.
PostHog feature flags MIT (self-host) + SaaS PostHog Consider if we adopt PostHog Bundled with analytics. Zero marginal cost if already using PostHog.
Unleash Apache-2.0 core (EOL 2025-12-31) Unleash Reject (dying OSS) Enterprise-only going forward. Avoid for new adoption.
LaunchDarkly Proprietary SaaS LaunchDarkly Reject $12/seat/mo + MAU overages. Overkill by an order of magnitude.
Statsig Proprietary SaaS Statsig Reject for lock-in Generous free tier (1M MTUs) but SaaS-only.
ConfigCat Proprietary SaaS ConfigCat Alternative Free tier forever, simple, Hungarian SaaS. If we want SaaS and not LaunchDarkly.

Re-evaluate feature flags when

  • We need per-user or per-tenant targeting that env vars can't express
  • A/B experimentation with real statistical significance becomes a product need
  • We adopt PostHog for analytics (flags come bundled)
  • Any T1 incident where a kill switch faster than kubectl rollout undo would have saved us

DB migration safety

Tool License Owner Our pick? Why
pgroll Apache-2.0 Xata Yes Automates expand/contract safely for Postgres — the only tool in this category that keeps v1 and v2 of a schema simultaneously queryable via views, with triggered backfill. Atlas does not implement this; it lints for unsafe DDL and rolls out carefully but executes a single migration against a single schema. Moderate single-vendor bus factor (Xata is ~27 employees, still operating; pivoted mid-2025 to serverless Postgres with Simplyblock). Release cadence decelerated: v0.16.1 last released 2026-02-17. Verified April 2026.
Atlas (Community Edition) Apache-2.0 + EULA on official binaries Ariga Alternative / hedge Declarative schema-as-code + linting. Source Apache-2.0; official binaries under Atlas EULA. Build from source if EULA matters.
Flyway Community Apache-2.0 core (Redgate-owned) Redgate Alternative Classic versioned SQL migrations. Not zero-downtime-automated. License creep concern (Redgate moving features out of OSS).
gh-ost MIT GitHub Irrelevant MySQL only. We're on Postgres.
Reshape Apache-2.0 fabianlindfors Reject (bus factor 1) Single-author, author's focus shifted. Don't adopt for production.
Bytebase Apache-2.0 (5-source limit) Bytebase No UI-heavy workflow tool. Overkill for 1-2 person rig.

The pgroll bus factor hedge

Corrected: pgroll files are YAML, not SQL

An earlier draft claimed we could "keep migrations inspectable SQL... runnable by plain psql." That's wrong — pgroll migration files are pgroll-specific operation YAML (e.g., add_column, drop_column, set_not_null), not raw SQL. The actual hedge: keep a parallel SQL trail. For every pgroll operation that runs, commit the generated SQL (pgroll migrate --dry-run --json | pgroll generate-sql) alongside the operation YAML. If Xata folds, the SQL trail lets us reconstruct schema state; we then pick up with plain Flyway or Atlas going forward. This does not make individual operations portable — it keeps the history reconstructible.

Re-evaluate DB migrations when

  • Xata pivots or folds
  • A migration we need is outside pgroll's expand/contract model (e.g., type changes with data loss implications)

Supply chain for dependencies

Tool Our pick? Why
GitHub Dependabot (malware mode) Yes Free with GitHub. Detects npm malware against GitHub Advisory Database malware feed.
Socket.dev Yes Per-dependency security score. PR check fails below threshold.
Package-age policy (14d minimum) Yes (via CI gate) Datadog's pattern. Catches typosquat account-takeovers.
Syft (SBOM) + Grype (CVE scan) Yes Apache-2.0, Anchore, widely adopted.
Snyk Reject for cost Dependabot + Socket covers it cheaper.

Container and CI

Tool License Owner Our pick? Why
GitHub Actions Proprietary GitHub Yes Already in use. OIDC to cosign and Sigstore. Moderate vendor lock-in, acceptable given GitHub is also our SCM.
GHCR Proprietary GitHub Yes Already in use. Paired with Actions. Lock-in acceptable.
Flux CD Apache-2.0 CNCF (graduated) Yes Already our GitOps. Stable.
Argo CD Apache-2.0 CNCF (graduated) No Alternative to Flux; switching costs exceed benefit for us.

Cluster and runtime

Tool License Owner Our pick? Why
k3s Apache-2.0 CNCF (sandbox, maintained by SUSE) Yes Lightweight K8s, single-binary install, fits 8 GB VM.
KEDA Apache-2.0 CNCF (graduated) Yes Event-driven autoscaling + scale-to-zero. Already deployed.
GCP Compute (one VM) Proprietary Google Yes Small bill, predictable, good enough.

Re-evaluate cluster when

  • We outgrow a single VM (multi-node Kubernetes warranted)
  • GCP pricing shifts unfavorably
  • k3s project health deteriorates (currently healthy under SUSE)

Human vault and docs

Tool License Owner Our pick? Why
Bitwarden GPL-3.0 (self-hostable) + SaaS Bitwarden Inc. Yes Already in use. Self-host option if SaaS changes unfavorably.
MkDocs Material MIT (community) + commercial Insiders Martin Donath Yes Our docs-site. Community edition is sufficient.

Evaluation

Tool License Owner Our pick? Why
Inspect AI MIT UK AISI Candidate — validate in Era 2 Released March 2026. Adopted by METR, Apollo, major labs. OSS, agent-aware, production-shaped. Too new to call chosen; revisit once we have a nightly run with 60 days of data comparing it against raw pytest-style harnesses.
SWE-bench Pro MIT Scale AI Yes (benchmark) Replacement for Verified (contaminated). 1,865 multi-language tasks.
lm-eval-harness MIT EleutherAI No (benchmark-only) Raw model quality, not agent-scaffolding quality.
OpenAI Evals MIT OpenAI Reject (abandoned) Historical.
Hypothesis MPL-2.0 Community Yes Property-based testing for Python code agents write.

Lock-in exposure summary

The rig's total lock-in exposure, honestly:

Vendor Lock-in level Criticality Why
Anthropic (as default LLM provider) High Critical LLM is the engine. LiteLLM + OTel GenAI conventions make runtime and backend swappable (see provider-portability.md); prompts are the sticky layer — migrating to OpenAI or Gemini needs per-prompt re-authoring and a re-run of the eval suite. Concrete, not unbounded.
GitHub High Critical Source, CI, OIDC, artifact registry, Issues — deeply wired.
GCP Medium High One VM — replaceable with any VPS vendor, but DNS/network moves cost ~1 week.
Cloudflare Medium Medium DNS, tunnels, Pages — replaceable, 1-2 days of work.
Sigstore public infra Low Medium Public good service. Private Rekor is the escape if the service model changes.
All CNCF-graduated tools (Flux, k3s, KEDA, Kyverno, Cilium, Flagger, cert-manager) Very low High Portable, active foundations.
LiteLLM Low High MIT + Portkey as fallback.
Langfuse/Phoenix Low Medium OTel GenAI conventions make swap trivial.
SOPS (getsops) Very low Medium MPL-2.0, CNCF governance, active maintainers; SOPS files are portable ciphertext — any decrypter reads them.
pgroll Medium Medium Bus factor 1-vendor. Inspectable SQL trail preserves schema history.

The ones that would hurt to lose: Anthropic (prompt portability hard), GitHub (everything wired there). Every other pick has a concrete escape hatch.

When any pick is re-evaluated

The whitepaper's picks are living decisions. Trigger a re-evaluation when:

  • License change on a critical tool (BSL drift is the modern pattern)
  • Ownership change — acquisitions, foundations handing off, single maintainers disappearing
  • Material scale change — we grow past 5 operators, add a second cluster, serve customer traffic
  • Active incident — a pick contributed to an outage and the compensating controls weren't enough
  • Cheaper / better alternative emerges with 2+ years of production adoption evidence

Every re-evaluation ends in one of: keep, migrate, or defer. The decision gets a timestamp and a link to this document's updated version.

What we explicitly reject

Short list of things we have evaluated and ruled out:

  • Vault now (OpenBao later if needed, not adopt Vault)
  • SealedSecrets (never deployed; SOPS is the chosen primary with better governance and no extra controller pod)
  • Full self-hosted LGTM stack on 8 GB (memory-starves)
  • Argo Rollouts with Flux (drift fights)
  • Unleash OSS (EOL)
  • Keptn (CNCF-archived)
  • Doppler / LaunchDarkly / Kong AI Gateway / TrueFoundry (SaaS-only or enterprise pricing)
  • Reshape (bus factor 1)
  • HSM-backed PGP signing (worse than keyless Sigstore)
  • CSI Secrets Store at single-VM scale (DaemonSet footprint)
  • OpenAI Evals (abandoned)
  • microVMs (e2b, Daytona, Firecracker) (wrong threat model)
  • Dev-E .NET standalone worker (dashecorp/dev-e, archived 2026-04-17 — CommandCodeExecutor shells out to claude-cli without MCP injection, stream-json parsing, or token refresh; the value lives in the CLI driver, not the outer state machine, and Node rig-agent-runtime already implements both)

See also