Tool Choices — An ADR for Every Pick¶
TL;DR
Every tool named in the whitepaper gets a defensible answer to: what problem, what alternatives, why this, license and backing, pricing, lock-in risk, migration path. The exercise changed several of the original picks after honest re-evaluation — notably: drop Vault (overkill), SOPS + age is the deployed secrets pick (corrected through three rounds — see Secrets section; the rig was always on SOPS, earlier retractions assumed otherwise), add Phoenix alternative to Langfuse (8 GB VM reality), defer feature flags (YAGNI at our scale), hedge pgroll with inspectable SQL trail (single-vendor bus factor, correctly framed).
This document is the reasoning the other whitepaper docs just assert. Every line of the form "We use X" elsewhere has a row here that explains why X and not Y.
How to read each entry¶
Every pick is evaluated against the same rubric:
| Column | What it captures |
|---|---|
| License | OSI-approved? Copyleft? Source-available-but-restricted? Specific license string (MIT, Apache-2.0, MPL-2.0, BSL, ELv2, AGPL, GPL, proprietary). |
| Owner / governance | Single company? Foundation? BDFL? Community-elected? |
| Pricing | Free for our use? Tier structure? Where does the pricing curve bite? |
| Bus factor | If the primary maintainer disappears, who keeps this alive? |
| Lock-in risk | If we need to leave, how bad is the migration? |
| Escape hatch | Concrete alternative we'd adopt if we had to move. |
| Re-evaluate when | The signal that tells us this pick is no longer right. |
The goal is not minimize every axis (impossible) but be explicit about each, so future us — or a future maintainer — can argue with our choices from a base of evidence rather than vibes.
Headline changes from original whitepaper¶
Where the honest re-evaluation changed the pick
- Secrets: drop Vault. SOPS + age + Flux is what's actually deployed (verified live in
apps/*/*.sops.yaml). External Secrets Operator + GCP Secret Manager is deferred until needed. GitHub App installation tokens are minted on-demand. OpenBao is the correct choice if and when we ever need Vault-class dynamic-secret capability — not now. Earlier drafts claimed SealedSecrets was our current state; that was wrong (never deployed). Third-order correction recorded in the retraction log. - LLM observability: add Phoenix. Langfuse v3 wants 16 GB RAM min and a separate ClickHouse cluster. On our 8 GB VM, Phoenix (ELv2, OTel-native, SQLite/Postgres, no ClickHouse) is the honest self-host pick.
- Feature flags: defer. flagd + OpenFeature is defensible eventually but for 1-2 humans and few services with no A/B testing need, env-vars-via-Kustomize is sufficient. Adopt a flag system when there's a concrete targeting / experimentation requirement.
- Unleash: explicitly reject. OSS edition deprecated and reached EOL 2025-12-31. Was previously a reasonable alternative; no longer is.
- Instrument with OpenTelemetry GenAI semantic conventions, not vendor SDKs. This is the single highest-leverage lock-in defense in the stack.
Secrets management¶
The most user-called-out section. The original whitepaper promised Vault for short-lived credentials; honest re-evaluation says we don't need it.
The Vault-vs-SOPS question (directly)¶
They solve overlapping but distinct problems:
| Dimension | SOPS + age (or SealedSecrets) | Vault / OpenBao |
|---|---|---|
| What it encrypts | Files at rest in git | Secret values fetched at runtime |
| Dynamic secrets | No | Yes — mint short-lived DB users, cloud creds, GitHub App tokens |
| Ops footprint | Zero runtime service | 3+ node HA cluster, unsealing, upgrades |
| Reviewable in PRs | Yes (encrypted blob diffs cleanly) | No (secrets never in git) |
| Revoke on compromise | Git commit + rotate everywhere | One API call, cluster-wide |
| Audit log | Git history | Vault audit log |
| Disaster recovery | Git repo + decryption key | Vault snapshot + unseal keys |
For a high-traffic production system serving paying customers: Vault (or OpenBao) wins clearly — dynamic secrets + centralized revocation + audit log are irreplaceable.
For a 1-2 person rig on one 8GB VM: SOPS-style encryption + ESO shim + cloud-KMS-backed secret manager is simpler, cheaper, and covers the real threat model.
Retracted (third-order correction), 2026-04-17 — we were never on SealedSecrets
Two previous retractions in this ADR (log below) framed a migration from SealedSecrets to SOPS. That framing was wrong about the ground-truth deployed state. Verified by grep-ing the repo: zero kind: SealedSecret references, zero sealed-secrets-controller HelmRelease, zero bitnami-labs image pulls. Every secret in the rig — Dev-E, Review-E, Conductor-E, Cloudflared — is already SOPS-encrypted (*.sops.yaml files in apps/). SOPS + age + Flux was always the deployed pick; there was never any SealedSecrets to migrate from.
The earlier narratives ("SealedSecrets keep", then "SealedSecrets legacy migrating") were built on an earlier research-agent summary that asserted SealedSecrets was our current deployment. I accepted that without running grep -r SealedSecret apps/. I ran that grep today and it returned nothing. The Broadcom-paywall concern is real for anyone using SealedSecrets but was theoretical for us — we avoided the risk by already being on SOPS, not by deliberately migrating off. The "GHCR hedge" I proposed is unnecessary because we don't pull the image at all.
Meta-lesson added to the fresh-start evaluation log (below): verify ground-truth deployed state, not research-agent summaries, before writing retraction narratives. A 10-second grep would have prevented two rounds of wrong framing.
Current pick (verified live in apps/ as of 2026-04-17):
SOPS + age + Flux kustomize-controller (deployed primary, has been all along)
+ .sops.yaml at repo root with creation_rules covering apps/*/*.sops.yaml
+ Cluster-scoped age key in flux-system/sops-age Secret
+ Per-app encrypted manifests: apps/dev-e/dev-e-secrets.sops.yaml,
apps/review-e/review-e-secrets.sops.yaml,
apps/conductor-e/conductor-e-secrets.sops.yaml,
apps/cloudflared/tunnel-token.sops.yaml
+ Each kustomization sets decryption.provider: sops + secretRef.name: sops-age
+ GitHub App installation tokens minted on-demand at pod startup (1h TTL)
+ Static narrow-grant Postgres service accounts
+ External Secrets Operator deferred (not yet needed — git-at-rest scales to our inventory)
+ Vault / OpenBao deferred (no dynamic-secret requirement yet)
See docs/sops.md for the operational reference (how to bootstrap an encrypted secret, rotation procedure, key management).
Secrets tooling matrix¶
| Tool | License | Owner | Our pick? | Why / why not |
|---|---|---|---|---|
| HashiCorp Vault | BSL 1.1 | IBM (acq. Feb 2025) | No | BSL is tolerable (we're non-competing) but HCP Vault Secrets EOL July 2026, IBM pricing plays, velocity concerns. Operationally expensive (3+ node HA, unsealing). Dynamic secrets are genuinely excellent — we just don't need them yet. |
| OpenBao | MPL-2.0 | Linux Foundation | Deferred | The correct answer if we ever need Vault-class capability. API-compatible with Vault; ESO works unchanged. Same ops burden as Vault. Adopt when we have a concrete unmet need for dynamic secrets. |
| SOPS + age | MPL-2.0 | CNCF (getsops org) | Yes (deployed primary) | The actual deployed pattern. Verified live in apps/*/*.sops.yaml — all four active app namespaces use it. Flux decrypts inline via kustomize-controller --decryption-provider=sops (no additional controller); age keys are simpler than GPG. MPL-2.0 forever, CNCF governance. |
| SealedSecrets | Apache-2.0 | bitnami-labs (Broadcom-owned) | Not in use | Not deployed. Not used. Not a migration target. Broadcom's Bitnami catalog paywall (verified real — bitnami/postgresql:17.5.0 returns 404, same namespace as sealed-secrets-controller image) is a risk for shops that use it; we avoided it by default, not by design. Leaving the row here for the ADR audit trail. |
| External Secrets Operator | Apache-2.0 | CNCF (incubating) | Yes (add) | The reversibility insurance. Backend-agnostic — swap GCP SM → OpenBao → Infisical by changing a CRD, workloads untouched. |
| GCP Secret Manager | Proprietary (GCP) | Yes (add) | We're already on GCP. Free tier covers our inventory. No dynamic secrets but doesn't need to. Access via ESO = low lock-in. | |
| Infisical | MIT (core) + SaaS | Infisical Inc. (YC) | No (for now) | Strong middle-ground between Bitwarden and Vault. Reasonable alternative if we outgrow GCP SM before we need Vault. |
| Doppler | Proprietary SaaS | Doppler Inc. | Reject | SaaS-only; no self-host. Makes Doppler outage = our deploy outage. Strongest lock-in on the list. |
| 1Password Connect | Proprietary | 1Password | Partial | We already use Bitwarden for human vault. 1Password Connect is fine if we switched, but no reason to. |
| CSI Secrets Store | Apache-2.0 | Kubernetes | No | DaemonSet footprint too heavy on single 8 GB VM. Right choice for regulated workloads avoiding etcd. |
| cert-manager + trust-manager | Apache-2.0 | CNCF (graduated) | Yes (add) | Table stakes. Non-controversial. |
Re-evaluate secrets when¶
- We add a second K8s cluster or second Postgres instance (static narrow grants stop scaling)
- We take a compliance requirement that mandates audit log on secret access
- A secret actually gets compromised (rotation scope pain becomes real)
- Our team grows past ~5 operators (human secret-handling becomes the bottleneck)
- The
getsops.ioproject stalls or is archived (then fork or migrate to an alternative; currently healthy)
Retraction log — secrets picks (three rounds)¶
Honest disclosure of where the first, second, and third drafts of this ADR got it wrong about secrets.
| Round | What the draft said | What changed | Why it was wrong |
|---|---|---|---|
| 1 | "SealedSecrets — Yes (keep) + ESO + GCP SM + GitHub App tokens. Governance risk post-Broadcom/Bitnami is real but no migration pressure yet." | Promoted SealedSecrets to the declared primary; treated SOPS as redundant. | Accepted a research-agent summary that claimed SealedSecrets was our deployed state. Never grep-verified. Built a whole defense around an incorrect premise. |
| 2 | "SOPS + age is now primary. SealedSecrets is Legacy (migrating). Interim hedge: switch image source to ghcr.io/bitnami-labs/sealed-secrets-controller." |
SOPS promoted, SealedSecrets relabeled legacy-migrating, elaborate Broadcom-paywall hedge proposed. | Still wrong about ground truth. Corrected the right-and-wrong framing of the tools, but kept asserting SealedSecrets was our deployment. The Broadcom paywall research was real but the "migration path" was literature for a migration that didn't need to happen. |
| 3 (this entry) | "SOPS + age + Flux was always the deployed pick. There was never any SealedSecrets. Earlier retractions were based on an unverified premise." | Corrected: zero SealedSecrets in the repo (grep -r SealedSecret apps/ returns nothing). .sops.yaml at repo root covers all apps. Every deployed app uses *.sops.yaml with decryption.provider: sops. |
The meta-lesson: verify ground-truth deployed state, not research-agent summaries, before writing retractions. A 10-second grep -r would have prevented two rounds of wrong framing. The second-order lesson — also named in my Fresh-start log's "three patterns" — was "hedge narratives need re-verification." This round adds: "assertions about current deployed state need re-verification too," which is the stronger version of the same principle. |
The incumbent-bias lesson still stands¶
Round 1's incumbent-bias problem (documented in the Round 1 retraction above) was a real pattern: I skipped the fresh-start test on secrets while applying it elsewhere. Round 2's correction — "SOPS wins on governance, operational cost, license permanence" — was the right conclusion reached via the right reasoning. What was wrong in Round 2 was the framing (claimed migration from incumbent) not the verdict (SOPS over SealedSecrets). Round 3 leaves the verdict intact and fixes only the factually-inaccurate framing.
The ground-truth-verification lesson¶
Add to the fresh-start evaluation meta-rules (below): before asserting what's deployed, run grep against the repo. Research-agent summaries can be wrong or stale. I had the tools to verify in the first round and didn't use them. Don't skip verification of ground truth; it's cheaper than two rounds of retraction.
Fresh-start evaluation log (April 2026)¶
Honest application of the fresh-start test — "if we were picking this from scratch today with no prior context, what would we pick?" — to every tool category in this ADR. This log replaces the earlier "Broader incumbent-bias check" which was a checklist, not an evaluation. Two picks were verified in depth this round (LiteLLM and pgroll); the rest were audited against 2026-current alternatives.
Summary table¶
| Category | Current pick | Fresh-start (2026) verdict | Rigor of this round |
|---|---|---|---|
| Secrets (git-at-rest) | SOPS + age (deployed; always was) | Corrected — SOPS is the deployed pick; earlier retractions #1 and #2 framed a migration that didn't need to happen because SealedSecrets was never deployed | Deep (three rounds of retraction; see log) |
| Policy engine | Kyverno | Keep — unchanged | Shallow |
| Supply-chain signing | Sigstore (cosign, gitsign, rekor, slsa-github-generator) | Keep — no credible OSS competitor | Shallow |
| Networking / L7 egress | Cilium | Keep — no equivalent at L7 in OSS | Shallow |
| Metrics / logs / traces | Prometheus (local) + Grafana Cloud Free (managed) + OTel Collector | Keep — name VictoriaMetrics as lighter-weight alternative for multi-node future | Medium |
| LLM observability | Phoenix (8 GB) / Langfuse (16 GB+) | Keep — already retracted in earlier PR | Medium |
| LLM gateway | LiteLLM | Keep — verified. Portkey's "fully OSS" March 2026 announcement kept per-key budget enforcement Enterprise-only; original pick rationale holds | Deep (verified this PR) |
| Progressive delivery | Flagger | Keep — Flux-native, Argo Rollouts fights Flux | Shallow |
| Feature flags | Deferred (flagd when needed) | Keep — no change; PostHog named as bundled alternative | Shallow |
| DB migration safety | pgroll | Keep — verified. Atlas has closed most gaps but does not implement expand/contract multi-version schemas — pgroll's core differentiator stands | Deep (verified this PR) + hedge-narrative fixed |
| Supply chain (deps) | Dependabot + Socket.dev + Syft+Grype + package-age policy | Keep — name trivy as Grype alternative | Shallow |
| Container + CI | GitHub Actions + GHCR | Keep — incumbent-and-defensible (SCM + CI + registry bundle dominates) | Shallow |
| GitOps | Flux | Keep — incumbent-and-defensible (Flagger picks Flux-native; switching cascades) | Shallow |
| Cluster runtime | k3s | Keep — name Talos Linux as multi-node future consideration | Medium |
| Event-driven autoscale | KEDA | Keep — no credible competitor | Shallow |
| Cloud compute | GCP Compute | Keep — incumbent-and-defensible (Workload Identity + DNS already wired) | Shallow |
| Human vault | Bitwarden | Keep — name Vaultwarden (Rust self-host port) for future | Shallow |
| Docs site | MkDocs Material | Keep — Docusaurus/VitePress/Astro Starlight are reasonable if we want more customization | Shallow |
| Evaluation harness | Inspect AI (candidate) | Already flagged candidate — validate in Era 2 | Already done |
Verified deep this round: LiteLLM (stays)¶
Portkey Gateway went fully open source March 2026 (Apache-2.0, 1T+ tokens/day). Original LiteLLM pick reason was "only OSS option with per-virtual-key budget enforcement." Re-verified against Portkey's 2026 documentation:
- Portkey Budget Limits docs: "Budget Limit is currently only available to Portkey Enterprise Plan customers."
- Portkey Rate Limits docs: "Rate Limits are available exclusively to Portkey Enterprise customers and select Pro users."
The 2026 "fully OSS" announcement was a governance + observability + MCP-registry open-sourcing, not a cost-controls open-sourcing. The original LiteLLM differentiator (per-virtual-key budget envelopes with duration windows returning 429 on exceed, free) still holds.
LiteLLM's known bugs (#12905, #10750, #12977, #25386) don't touch our specific config pattern (we have ~5 explicitly-configured keys, no team-scoped nesting, no pass-through routes, no AzureOpenAI direct client, no auto-created end users). Verdict: stay on LiteLLM. Revisit if (a) Portkey moves budget-limits to OSS, or (b) we scale past ~500 RPS where LiteLLM's documented memory issues at 2k RPS start to bite.
Verified deep this round: pgroll (stays) + hedge narrative corrected¶
Atlas (Ariga) has shipped rapidly in 2025–2026 — v1.2.0 on 2026-04-10, Kubernetes operator (Apache-2.0 with some EULA image layers), 50+ migration safety analyzers, weekly-to-biweekly release cadence. Feature gap against pgroll narrowed significantly. But Atlas does NOT implement real expand/contract with multi-version schema views + triggered backfill — it lints for unsafe DDL, emits concurrent-index DDL, and rolls out carefully, but it executes a single migration against a single schema.
For our specific workload (one Postgres, ~10–30 tables, expand/contract required for zero-downtime), pgroll is still the only tool that keeps v1 and v2 of a table simultaneously queryable. Verdict: stay on pgroll. Revisit if Xata misses another release quarter (no v0.17 by end of Q3 2026), announces a shutdown/acquisition, or Atlas ships native expand/contract.
Corrected the hedge narrative: earlier drafts implied pgroll migrations are plain SQL. They're not — they're pgroll-specific operation YAML. The correct hedge is to commit generated SQL alongside each operation YAML (via pgroll SQL emission) so schema history stays reconstructible. See the pgroll section above for the corrected wording.
Shallow-audited: what "fresh-start keep" actually means¶
For the shallow-audited picks (Kyverno, Sigstore, Cilium, Flagger, k3s, KEDA, Dependabot/Socket, GitHub/GHCR/Flux, Bitwarden, MkDocs), "fresh-start keep" means: I considered the current 2026 alternatives to each and none clearly beat the incumbent for our scale on license, governance, operational cost, and feature coverage. They are the picks I would make today if starting from scratch.
A stronger level of rigor would be individual per-category research agents (like I did for LiteLLM and pgroll). That's worth doing when a specific concern surfaces (as with Portkey-announcement, Xata-release-cadence, Broadcom-Bitnami). Applying it to every pick every month is over-engineering.
What the deeper-audited rounds taught us¶
Four patterns emerged from the SOPS (three rounds), LiteLLM, and pgroll deep audits:
- Announcements lie about feature scope. Portkey's March 2026 "fully OSS" announcement was marketing; the feature we care about stayed paywalled. Always verify against current docs, not the press release.
- Release cadence is a signal. pgroll's decelerating releases (v0.16.1 in February, nothing since) is consistent with Xata being in maintenance-mode for pgroll as an internal-product-first tool. Not alarming on its own, but worth tracking.
- Hedge narratives need re-verification. The "keep migrations as plain SQL" hedge in the earlier pgroll writeup turned out to be wrong — pgroll operation files are YAML. When we write a hedge, we should confirm it's actually realisable, not just aspirational.
- Ground-truth deployed state needs re-verification too. The SealedSecrets retraction had to happen three times because the first two rounds accepted a research-agent summary about "what's currently deployed" instead of grep-verifying the repo. A 10-second grep (
grep -r SealedSecret apps/) would have prevented it. Stronger version of pattern (3): "before asserting what's deployed, verify."
Categories that warrant re-examination eventually¶
Not actionable today, flagged for future attention:
- Bitwarden — picked because humans already use it. 1Password has better team-grant ergonomics; Vaultwarden is an unofficial Rust self-host port if we want more control; Infisical covers human+automation in one product (at the cost of a YC-company dependency). Re-evaluate if team grows past 3 operators or if we start needing per-project secret segregation.
- MkDocs Material — Python-docs gold standard today, but Docusaurus (Meta-backed, React), VitePress (Vue/Vite), and Astro Starlight (Astro) are reasonable alternatives with better customization. Low priority — the docs site works.
- k3s — ideal for single-VM. If we go multi-node for any reason, Talos Linux (immutable, API-only, no SSH, no shell) is a stronger security baseline. Not a k3s replacement — runs K8s, including k3s — but changes the host OS story.
Meta-rule, reaffirmed¶
When an ADR row reads "already deployed — keep" without a license/governance/operational comparison against the best fresh-start alternative, that's a flag for re-examination. Path dependence is a cost, not a reason. Every pick in this ADR has now had the fresh-start test applied at least shallowly; two picks got deep verification this round; the retraction log above grows whenever a pick turns out to have been defended on sunk-cost reasoning.
Next scheduled re-audit: monthly for deep-picks (LiteLLM, pgroll, SOPS health at getsops.io, Langfuse/Phoenix VM sizing). Quarterly for shallow-picks. Immediate whenever a tool's governance / license / owner changes (Broadcom-Bitnami style events). Always verify ground-truth deployed state with a grep before framing a retraction.
Policy engine¶
| Tool | License | Owner | Our pick? | Why |
|---|---|---|---|---|
| Kyverno | Apache-2.0 | CNCF (incubating) | Yes | YAML CRDs (no Rego language). Native Sigstore verification first-class. Reports in OpenReports format. Operational cost is lower for a 2-person team. |
| OPA Gatekeeper | Apache-2.0 | CNCF (graduated) | No | Rego language requires learning + maintenance. Cosign verification exists but not as polished. Better for large orgs that already run OPA for non-K8s policy. |
| jsPolicy | Apache-2.0 | Loft Labs | No | Single-vendor. JavaScript-based policies. Niche. |
| OpenPolicyAgent (OPA) core | Apache-2.0 | CNCF | No | Lower-level; Gatekeeper is the K8s-admission wrapper. |
Re-evaluate policy engine when¶
- We need to write non-K8s policies (API gateway, CI/CD gates) — OPA's broader reach becomes attractive
- We have a Rego-fluent engineer — learning cost drops
- Kyverno governance shifts unfavorably (currently healthy as CNCF incubating)
Supply chain / signing¶
All picks are Sigstore ecosystem; there's no real competitor in 2026 open-source territory.
| Tool | License | Owner | Our pick? | Why |
|---|---|---|---|---|
| Cosign | Apache-2.0 | Sigstore (Linux Foundation) | Yes | Industry default. Keyless via Fulcio. |
| Gitsign | Apache-2.0 | Sigstore | Yes | Agent commit signing with ephemeral Fulcio certs. Known gotcha: GitHub UI doesn't display "Verified" — workaround is a CI-side gitsign verify check. |
| Rekor | Apache-2.0 | Sigstore (public instance) | Yes (public) | We use the public Rekor. Private Rekor is possible but over-engineered. |
| slsa-github-generator | Apache-2.0 | SLSA framework | Yes | Isolated-builder reusable workflow produces SLSA v1.0 L3 provenance. ~5 lines in any GitHub Actions file. |
| Notary v2 / ORAS | Apache-2.0 | CNCF | No | OCI-artifact-focused; Sigstore covers our image case more simply. |
| Docker Content Trust | Proprietary-ish | Docker | No | Deprecated in favor of Sigstore. |
| HSM-backed PGP | varies | — | Reject | Long-lived keys to rotate. Worse threat model than Sigstore for our case. |
Re-evaluate signing when¶
- Sigstore public goods service (Fulcio / Rekor free tier) changes its commitment
- Compliance requires private transparency log (run private Rekor)
- Multi-tenant signing needs emerge (HSM delegation tooling)
Networking / service mesh¶
| Tool | License | Owner | Our pick? | Why |
|---|---|---|---|---|
| Cilium (L7 via CNPs) | Apache-2.0 | CNCF (graduated) | Yes | eBPF CNI with L7 HTTP/DNS filtering. The single biggest ROI defense against prompt-injection exfiltration. Gets 80% of service-mesh value at 10% of the cost. |
| Istio | Apache-2.0 | CNCF (graduated) | No | Full service mesh — mTLS, traffic mgmt, etc. Overkill for our traffic volume + single-cluster setup. Revisit if we go multi-cluster or need mTLS to external services. |
| Linkerd | Apache-2.0 | CNCF (graduated) | No | Simpler than Istio; still more than we need. Good alternative if Istio's complexity is the only objection. |
| Calico (OSS) | Apache-2.0 | Tigera | No | Solid CNI but L7 filtering requires Calico Enterprise (commercial). Cilium's OSS L7 wins. |
| Native NetworkPolicy only | — | Kubernetes | No | L3/L4 only. Cannot filter by FQDN or HTTP method. Insufficient for our egress-allowlist goal. |
Re-evaluate networking when¶
- Multi-cluster deployment emerges — a service mesh becomes more compelling
- Cilium L7 Envoy proxy memory overhead becomes the bottleneck on our VM
- External mTLS requirement (e.g., to a customer-facing API) — then Istio or Linkerd
Observability — metrics, logs, traces¶
Split picks: local for SLO-decisive data, managed for everything else.
| Tool | License | Owner | Our pick? | Why |
|---|---|---|---|---|
| Prometheus (local) | Apache-2.0 | CNCF (graduated) | Yes | Source of truth for Flagger canary analysis. Must be local so SLO gates work when external egress blips. ~1 GB RAM. |
| Grafana Cloud Free | Proprietary SaaS | Grafana Labs | Yes | 10k series metrics, 50 GB logs, 50 GB traces, 14-day retention. Fits a 1-2 person rig. Predictable paid scale. |
| Mimir / Thanos / VictoriaMetrics | Apache-2.0 | Grafana Labs / Other | No | Large-scale Prometheus backends. Overkill — Grafana Cloud Free covers us. |
| Datadog / New Relic | Proprietary | — | Reject | Vendor lock + pricing curves bite at scale. |
| Self-hosted LGTM stack | Apache-2.0 | Grafana Labs | No | Would memory-starve our 8 GB VM. Hybrid with Grafana Cloud is the right answer. |
| OpenTelemetry Collector | Apache-2.0 | CNCF (incubating) | Yes | One exporter that forwards to both Prometheus (local) and Grafana Cloud (managed). Standard plumbing. |
Re-evaluate metrics/logs when¶
- Grafana Cloud Free limits bite — 10k series isn't enough
- Cost-visible scaling past ~$50/mo on Grafana Cloud makes self-hosted LGTM attractive on a bigger VM
- Regulatory requirement forces log residency — self-host becomes mandatory
LLM observability¶
| Tool | License | Owner | Pricing | Our pick? | Why |
|---|---|---|---|---|---|
| Langfuse (self-host) | MIT core + EE license-key for a few features | Langfuse GmbH (YC) | Free | Conditional | Official min 4 CPU / 16 GB RAM for app alone, plus ClickHouse cluster. Too heavy for 8 GB VM. Pick only if we size up. |
| Arize Phoenix (self-host) | ELv2 (source-available, non-OSI) | Arize AI | Free | Yes (for our scale) | OTel-native. SQLite or Postgres, no ClickHouse. Runs fine on our VM. ELv2 is non-concern for internal self-host (restricts SaaS resale, which we don't do). |
| Langfuse Cloud Hobby | N/A (SaaS) | Langfuse GmbH | Free 50k units/mo | Backup | 50k billable units sounds like a lot but complex agent traces = 15–20 units each. Hits cap fast; hard-stop at cap. |
| Helicone (self-host) | Apache-2.0 | Helicone Inc. | Free self-host, 10k req/mo SaaS free | Alternative | Gateway + observability combined. Reasonable plan B if Phoenix unsuitable. |
| LangSmith | Proprietary SaaS | LangChain | Paid tiers | Reject | SaaS-only, paid-gated features, LangChain-native assumptions. |
| Braintrust | Proprietary SaaS | Braintrust | Free → $249/mo | Reject for cost | Strong on prompt regression but pricing bites for our scale. |
| Arize AI enterprise | Proprietary | Arize | Contact sales | Reject | Phoenix OSS covers us. |
| W&B Traces | Proprietary SaaS | CoreWeave | Paid | Reject | Broader ML observability overkill. |
| Cloudflare AI Gateway | Proprietary SaaS | Cloudflare | Free passive analytics | Yes (secondary) | Free since we're already on Cloudflare. Passive cost tracking. Not sufficient primary — no virtual-key budgets. |
The Langfuse vs Phoenix call¶
Langfuse is the de facto OSS LLM observability tool in 2026 and its feature set (prompt versioning, evaluations, dataset management, team workflows) is the richest. But its resource floor is genuine: v3 officially requires 4 CPU / 16 GB for the app, plus a ClickHouse cluster (3 nodes × 2 cores × 8 GB recommended). On our 8 GB single-VM k3s, it will boot and demo but will not survive sustained load.
Phoenix is the honest pick at our scale. Lighter infra, OTel-native (so traces are portable), solid eval story. We lose Langfuse's prompt-management UI and team features — acceptable for a 1-2 person team.
The lock-in defense that trumps both: instrument our code with OpenTelemetry GenAI semantic conventions, not Langfuse/Phoenix SDKs. Both platforms accept OTLP. The choice of observability backend becomes swappable.
Re-evaluate LLM observability when¶
- We scale the VM to 16 GB+ (Langfuse becomes viable)
- Team grows and prompt-versioning / collaborative eval becomes load-bearing
- Phoenix ELv2 changes (unlikely but license watch)
LLM gateway / proxy¶
| Tool | License | Owner | Pricing | Our pick? | Why |
|---|---|---|---|---|---|
| LiteLLM | MIT (core) + commercial Enterprise | BerriAI (YC W23) | Free | Yes | Only OSS option with proper per-virtual-key budget enforcement + duration resets. OpenAI-format passthrough. Anthropic provider first-class. |
| Portkey Gateway | Apache-2.0 (fully OSS since March 2026) | Portkey | Free self-host; $9 per 100k logs managed | Documented fallback | Fully OSS escape hatch if LiteLLM stumbles. Processing 1T+ tokens/day across users. |
| Cloudflare AI Gateway | Proprietary SaaS | Cloudflare | Free | Secondary | Passive observability already in our stack. No virtual-key budgets — not sufficient as primary. |
| OpenRouter | Proprietary SaaS | OpenRouter | 5.5% markup | No | Adds hop, no self-host, no per-key budgets like LiteLLM. |
| Kong AI Gateway | Proprietary (Enterprise plugin) | Kong | Enterprise contract | Reject | Enterprise pricing not justified. |
| TrueFoundry | Proprietary SaaS | TrueFoundry | Paid | Reject | Platform-level opinion. |
The LiteLLM SPoF concern¶
LiteLLM is a single point of failure: if it's down, all agents block.
Mitigations:
1. Run ≥2 replicas behind a k3s Service with Postgres + Redis shared state.
2. Client-side fallback to direct api.anthropic.com after N seconds of 5xx from proxy — but this bypasses budget enforcement by design; document as acceptable degraded mode.
3. Monitor proxy health as a first-class SLO in Prometheus.
LiteLLM funding / license risk¶
- License: MIT intact. No BSL drift as of Q2 2026.
- Owner: BerriAI, YC W23, publicly reported ~$2.1M seed. No Series A disclosed. Venture-stage risk is real.
- 12–24 month watch: (a) BSL-style moves on enterprise features only would be fine for us (we're on OSS), (b) aggressive monetization could lock some OSS features behind keys, (c) a quiet under-maintenance period is the likeliest failure mode.
Re-evaluate gateway when¶
- LiteLLM license changes or project health deteriorates
- Our traffic exceeds what LiteLLM's Postgres-backed rate limiter can handle (~10k RPS)
- Portkey Gateway momentum surpasses LiteLLM's
Progressive delivery / canary¶
| Tool | License | Owner | Our pick? | Why |
|---|---|---|---|---|
| Flagger | Apache-2.0 | CNCF (graduated via Flux) | Yes | FluxCD-native. Owns its own Canary CRD that shadows the Deployment — no field-level fights with Flux. Webhooks at every phase for Conductor-E integration. ~100 MB controller footprint. |
| Argo Rollouts | Apache-2.0 | CNCF (graduated via Argo) | No | Mutates fields Flux also reconciles — recurring drift fights. Pair with ArgoCD, not Flux. |
| Keptn | — | CNCF archived 2025-09-03 | Reject (dead) | Dynatrace team pulled back. Do not adopt. |
| OpenKruise Rollout | Apache-2.0 | OpenKruise (CNCF sandbox) | No | Mostly Alibaba ecosystem. Right only if we need StatefulSet canary. |
Re-evaluate canary when¶
- We migrate from Flux to ArgoCD (Argo Rollouts becomes natural)
- Flagger project health deteriorates (currently active)
Feature flags¶
Honest YAGNI
Feature flags at our scale (1-2 humans, few services, no A/B testing need) are overkill today. Env vars + Kustomize overlays per environment cover the actual use case — compile/deploy-time toggles — at zero operational cost. We should adopt a flag system when there's a concrete targeting, experimentation, or kill-switch need — not before.
| Tool | License | Owner | Our pick? | Why |
|---|---|---|---|---|
| env vars + Kustomize overlays | — | — | Yes (now) | Zero ops cost. Covers 100% of actual current need. |
| OpenFeature + flagd | Apache-2.0 | CNCF (incubating) | Deferred | Right pick when we need runtime toggles. Sidecar ~30-60 MB. JSON flag config. |
| Flipt | GPL-3.0 (server) + MIT (clients) | flipt-io | Alternative | GitOps-native YAML flags. Single Go binary. GPL server is sticky but fine for internal use. |
| GrowthBook | MIT | GrowthBook | Alternative | If we need A/B experimentation with stats out of the box. OpenFeature SDK. |
| PostHog feature flags | MIT (self-host) + SaaS | PostHog | Consider if we adopt PostHog | Bundled with analytics. Zero marginal cost if already using PostHog. |
| Unleash | Apache-2.0 core (EOL 2025-12-31) | Unleash | Reject (dying OSS) | Enterprise-only going forward. Avoid for new adoption. |
| LaunchDarkly | Proprietary SaaS | LaunchDarkly | Reject | $12/seat/mo + MAU overages. Overkill by an order of magnitude. |
| Statsig | Proprietary SaaS | Statsig | Reject for lock-in | Generous free tier (1M MTUs) but SaaS-only. |
| ConfigCat | Proprietary SaaS | ConfigCat | Alternative | Free tier forever, simple, Hungarian SaaS. If we want SaaS and not LaunchDarkly. |
Re-evaluate feature flags when¶
- We need per-user or per-tenant targeting that env vars can't express
- A/B experimentation with real statistical significance becomes a product need
- We adopt PostHog for analytics (flags come bundled)
- Any T1 incident where a kill switch faster than
kubectl rollout undowould have saved us
DB migration safety¶
| Tool | License | Owner | Our pick? | Why |
|---|---|---|---|---|
| pgroll | Apache-2.0 | Xata | Yes | Automates expand/contract safely for Postgres — the only tool in this category that keeps v1 and v2 of a schema simultaneously queryable via views, with triggered backfill. Atlas does not implement this; it lints for unsafe DDL and rolls out carefully but executes a single migration against a single schema. Moderate single-vendor bus factor (Xata is ~27 employees, still operating; pivoted mid-2025 to serverless Postgres with Simplyblock). Release cadence decelerated: v0.16.1 last released 2026-02-17. Verified April 2026. |
| Atlas (Community Edition) | Apache-2.0 + EULA on official binaries | Ariga | Alternative / hedge | Declarative schema-as-code + linting. Source Apache-2.0; official binaries under Atlas EULA. Build from source if EULA matters. |
| Flyway Community | Apache-2.0 core (Redgate-owned) | Redgate | Alternative | Classic versioned SQL migrations. Not zero-downtime-automated. License creep concern (Redgate moving features out of OSS). |
| gh-ost | MIT | GitHub | Irrelevant | MySQL only. We're on Postgres. |
| Reshape | Apache-2.0 | fabianlindfors | Reject (bus factor 1) | Single-author, author's focus shifted. Don't adopt for production. |
| Bytebase | Apache-2.0 (5-source limit) | Bytebase | No | UI-heavy workflow tool. Overkill for 1-2 person rig. |
The pgroll bus factor hedge¶
Corrected: pgroll files are YAML, not SQL
An earlier draft claimed we could "keep migrations inspectable SQL... runnable by plain psql." That's wrong — pgroll migration files are pgroll-specific operation YAML (e.g., add_column, drop_column, set_not_null), not raw SQL. The actual hedge: keep a parallel SQL trail. For every pgroll operation that runs, commit the generated SQL (pgroll migrate --dry-run --json | pgroll generate-sql) alongside the operation YAML. If Xata folds, the SQL trail lets us reconstruct schema state; we then pick up with plain Flyway or Atlas going forward. This does not make individual operations portable — it keeps the history reconstructible.
Re-evaluate DB migrations when¶
- Xata pivots or folds
- A migration we need is outside pgroll's expand/contract model (e.g., type changes with data loss implications)
Supply chain for dependencies¶
| Tool | Our pick? | Why |
|---|---|---|
| GitHub Dependabot (malware mode) | Yes | Free with GitHub. Detects npm malware against GitHub Advisory Database malware feed. |
| Socket.dev | Yes | Per-dependency security score. PR check fails below threshold. |
| Package-age policy (14d minimum) | Yes (via CI gate) | Datadog's pattern. Catches typosquat account-takeovers. |
| Syft (SBOM) + Grype (CVE scan) | Yes | Apache-2.0, Anchore, widely adopted. |
| Snyk | Reject for cost | Dependabot + Socket covers it cheaper. |
Container and CI¶
| Tool | License | Owner | Our pick? | Why |
|---|---|---|---|---|
| GitHub Actions | Proprietary | GitHub | Yes | Already in use. OIDC to cosign and Sigstore. Moderate vendor lock-in, acceptable given GitHub is also our SCM. |
| GHCR | Proprietary | GitHub | Yes | Already in use. Paired with Actions. Lock-in acceptable. |
| Flux CD | Apache-2.0 | CNCF (graduated) | Yes | Already our GitOps. Stable. |
| Argo CD | Apache-2.0 | CNCF (graduated) | No | Alternative to Flux; switching costs exceed benefit for us. |
Cluster and runtime¶
| Tool | License | Owner | Our pick? | Why |
|---|---|---|---|---|
| k3s | Apache-2.0 | CNCF (sandbox, maintained by SUSE) | Yes | Lightweight K8s, single-binary install, fits 8 GB VM. |
| KEDA | Apache-2.0 | CNCF (graduated) | Yes | Event-driven autoscaling + scale-to-zero. Already deployed. |
| GCP Compute (one VM) | Proprietary | Yes | Small bill, predictable, good enough. |
Re-evaluate cluster when¶
- We outgrow a single VM (multi-node Kubernetes warranted)
- GCP pricing shifts unfavorably
- k3s project health deteriorates (currently healthy under SUSE)
Human vault and docs¶
| Tool | License | Owner | Our pick? | Why |
|---|---|---|---|---|
| Bitwarden | GPL-3.0 (self-hostable) + SaaS | Bitwarden Inc. | Yes | Already in use. Self-host option if SaaS changes unfavorably. |
| MkDocs Material | MIT (community) + commercial Insiders | Martin Donath | Yes | Our docs-site. Community edition is sufficient. |
Evaluation¶
| Tool | License | Owner | Our pick? | Why |
|---|---|---|---|---|
| Inspect AI | MIT | UK AISI | Candidate — validate in Era 2 | Released March 2026. Adopted by METR, Apollo, major labs. OSS, agent-aware, production-shaped. Too new to call chosen; revisit once we have a nightly run with 60 days of data comparing it against raw pytest-style harnesses. |
| SWE-bench Pro | MIT | Scale AI | Yes (benchmark) | Replacement for Verified (contaminated). 1,865 multi-language tasks. |
| lm-eval-harness | MIT | EleutherAI | No (benchmark-only) | Raw model quality, not agent-scaffolding quality. |
| OpenAI Evals | MIT | OpenAI | Reject (abandoned) | Historical. |
| Hypothesis | MPL-2.0 | Community | Yes | Property-based testing for Python code agents write. |
Lock-in exposure summary¶
The rig's total lock-in exposure, honestly:
| Vendor | Lock-in level | Criticality | Why |
|---|---|---|---|
| Anthropic (as default LLM provider) | High | Critical | LLM is the engine. LiteLLM + OTel GenAI conventions make runtime and backend swappable (see provider-portability.md); prompts are the sticky layer — migrating to OpenAI or Gemini needs per-prompt re-authoring and a re-run of the eval suite. Concrete, not unbounded. |
| GitHub | High | Critical | Source, CI, OIDC, artifact registry, Issues — deeply wired. |
| GCP | Medium | High | One VM — replaceable with any VPS vendor, but DNS/network moves cost ~1 week. |
| Cloudflare | Medium | Medium | DNS, tunnels, Pages — replaceable, 1-2 days of work. |
| Sigstore public infra | Low | Medium | Public good service. Private Rekor is the escape if the service model changes. |
| All CNCF-graduated tools (Flux, k3s, KEDA, Kyverno, Cilium, Flagger, cert-manager) | Very low | High | Portable, active foundations. |
| LiteLLM | Low | High | MIT + Portkey as fallback. |
| Langfuse/Phoenix | Low | Medium | OTel GenAI conventions make swap trivial. |
| SOPS (getsops) | Very low | Medium | MPL-2.0, CNCF governance, active maintainers; SOPS files are portable ciphertext — any decrypter reads them. |
| pgroll | Medium | Medium | Bus factor 1-vendor. Inspectable SQL trail preserves schema history. |
The ones that would hurt to lose: Anthropic (prompt portability hard), GitHub (everything wired there). Every other pick has a concrete escape hatch.
When any pick is re-evaluated¶
The whitepaper's picks are living decisions. Trigger a re-evaluation when:
- License change on a critical tool (BSL drift is the modern pattern)
- Ownership change — acquisitions, foundations handing off, single maintainers disappearing
- Material scale change — we grow past 5 operators, add a second cluster, serve customer traffic
- Active incident — a pick contributed to an outage and the compensating controls weren't enough
- Cheaper / better alternative emerges with 2+ years of production adoption evidence
Every re-evaluation ends in one of: keep, migrate, or defer. The decision gets a timestamp and a link to this document's updated version.
What we explicitly reject¶
Short list of things we have evaluated and ruled out:
- Vault now (OpenBao later if needed, not adopt Vault)
- SealedSecrets (never deployed; SOPS is the chosen primary with better governance and no extra controller pod)
- Full self-hosted LGTM stack on 8 GB (memory-starves)
- Argo Rollouts with Flux (drift fights)
- Unleash OSS (EOL)
- Keptn (CNCF-archived)
- Doppler / LaunchDarkly / Kong AI Gateway / TrueFoundry (SaaS-only or enterprise pricing)
- Reshape (bus factor 1)
- HSM-backed PGP signing (worse than keyless Sigstore)
- CSI Secrets Store at single-VM scale (DaemonSet footprint)
- OpenAI Evals (abandoned)
- microVMs (e2b, Daytona, Firecracker) (wrong threat model)
- Dev-E .NET standalone worker (
dashecorp/dev-e, archived 2026-04-17 —CommandCodeExecutorshells out to claude-cli without MCP injection, stream-json parsing, or token refresh; the value lives in the CLI driver, not the outer state machine, and Noderig-agent-runtimealready implements both)
See also¶
- index.md — whitepaper master
- principles.md — principle 10 (simple enough to operate) drives many of the rejections
- security.md — cites this document for secrets + supply chain picks
- observability.md — cites this document for the Langfuse/Phoenix call
- cost-framework.md — cites this document for LiteLLM
- self-healing.md — cites this document for Flagger, flagd deferral, pgroll hedge
- limitations.md — all "we don't have X" lines trace to picks here