# stxkxs.io — Full Content > This file contains the complete text of all blog posts on stxkxs.io. > For a site overview and index, see /llms.txt --- # Kubernetes Becomes the Agent OS - **URL**: https://www.stxkxs.io/blog/kubernetes-agent-os - **Published**: 2026-05-08 - **Author**: Brandon Stokes - **Category**: kubernetes - **Tags**: kubernetes, dynamic-resource-allocation, agentic-ai, kagent, agentgateway, cncf, gpu-scheduling, ai-infrastructure, platform-engineering, mcp, a2a, inference - **Reading time**: 12 min Two-thirds of generative AI workloads now run on Kubernetes. Dynamic Resource Allocation went GA at KubeCon EU 2026. The CNCF defined an Agentic workload class. kagent and agentgateway became the open runtime layer for agents. The K8s control plane is becoming the agent operating system, and platform teams are inheriting the work whether they planned to or not. ## The runtime got standardized KubeCon Europe ran March 23–26 in Amsterdam with about 13,000 attendees, the largest in the event's history. Dynamic Resource Allocation had reached general availability in upstream Kubernetes v1.34 the prior September. At this show, the CNCF AI Conformance Program — co-led by Google's Janet Kuo and launched at KubeCon NA in November 2025 — added an Agentic workload class alongside its existing Training and Inference classes. The CNCF sandbox absorbed a wave of AI-infra projects in the same week, including llm-d (inference disaggregation, from IBM Research, Red Hat, and Google Cloud), HolmesGPT (an AI SRE agent created by Robusta.dev with major contributions from Microsoft), and KAI Scheduler (GPU-aware scheduling). NVIDIA also donated its DRA driver to the Kubernetes project for full community ownership the same week DRA went GA. Solo.io donated agentregistry to the CNCF at the show and introduced agentevals as an open-source project. On May 7, Solo.io added NemoClaw support to kagent: NemoClaw is NVIDIA's reference stack for running OpenClaw agents inside a sandbox-first runtime, and the kagent integration moves that pattern from a single-host deployment to a Kubernetes-native fleet. Kubernetes is becoming the agent operating system. The CRDs and gateways are open-source projects with sandbox status, vendor-supported distributions, and conformance criteria. Two years ago this would have been a vendor pitch. Today it is a CNCF program. This is the largest shift in cloud-native infrastructure since service mesh. Two-thirds of generative AI workloads already run on Kubernetes per CNCF figures cited at the keynote. Agents are the next workload class to arrive on the same substrate. Most platform teams will inherit them by default, not by choice. > **Key Point:** Agents inherit the Kubernetes substrate because it already grew the parts they need: fractional GPU scheduling, protocol mediation, identity, observability, scale-to-zero. Inference workloads built it for their own reasons. Agents arrive late and reuse what is there. - **GenAI on Kubernetes**: 2/3 of workloads — CNCF figures cited at KubeCon EU 2026; Kubernetes is the de facto AI infrastructure operating layer - **DRA Status**: GA, Sept 2025 (v1.34) — Dynamic Resource Allocation reached general availability in upstream Kubernetes v1.34 (Sept 1, 2025); Red Hat OpenShift 4.21 shipped DRA GA downstream at KubeCon EU 2026 in Amsterdam - **Enterprise Agent Adoption**: 70% target — Gartner forecast for 2029 (70% of enterprises will deploy agentic AI as part of IT infrastructure operations); up from less than 5% in 2025 - **Inference Cost Trajectory**: ~10×/year decline — Per-performance-tier inference cost decline; Gartner projects 90% lower 1T-parameter inference cost by 2030 ## Kubernetes ate ML training This pattern has played out before. Java EE produced WebLogic, JBoss, and WebSphere. Kubernetes ate those by treating apps as containers and letting the orchestration layer absorb the lifecycle. The app-server vendors lost because the platform layer moved beneath them. ML training was the second round. Kubeflow, KServe, vLLM, and the GPU operator stack pushed training and serving into Kubernetes between 2018 and 2024. By 2025 model serving on K8s was the default. The cluster already provided scheduling, GPU device plugins, autoscaling, ingress, secrets, and policy. Adding model serving was a smaller step than building all of that elsewhere. Agents follow the same arc. Agents need fractional GPU access for embedding and reranking, topology-aware scheduling for multi-step inference, scale-to-zero for sporadic workloads, and protocol mediation for MCP, A2A, and arbitrary LLM APIs. None of those are new requirements. KServe shipped scale-to-zero in 2021. NVIDIA's GPU operator shipped topology awareness. Knative demonstrated request-driven scaling years before agents were a category. The agent runtime is assembled out of pieces that already shipped. > **The agent runtime is assembled from pieces Kubernetes already shipped. The few that were missing landed this spring.** ## DRA fixes GPU scheduling For a decade, GPU scheduling on Kubernetes meant `nvidia.com/gpu: 1`. The device plugin treated accelerators as opaque integer counts: a whole GPU or none. There was no way to ask for a 20GB MIG slice on an H100, NVLink to a neighboring pod, or a particular firmware revision. Training tolerated the model because training jobs want whole GPUs. Inference and agent workloads want the opposite: thin slices, short bursts, fan-out across many small calls. Dynamic Resource Allocation introduces four built-in API types in the `resource.k8s.io` group: `ResourceClaim`, `ResourceClaimTemplate`, `DeviceClass`, and `ResourceSlice`. Vendor drivers register what they can offer. Pods declare what they need. The scheduler matches them against structured constraints. This is the same move Kubernetes made for storage with `PersistentVolumeClaim` and `StorageClass`: a vendor-specific pile of devices becomes a typed, claimable resource. NVIDIA donated its DRA driver to the Kubernetes project for full community ownership the same week DRA went GA. AMD's GPU DRA driver reached its first official release, v1.0.0, on May 20, 2026. Intel ships its own resource drivers for accelerators. The abstraction is multi-vendor by construction. A cluster can host H100s, MI300Xs, Gaudi 3s, and TPU v5e's in the same node pool, and a pod can claim "8GB of FP16 inference capacity, latency-class A" without pinning to a specific SKU. Agent platforms have needed this contract for two years. > **INFO: DRA vs the device plugin** > Device plugins exposed integer counts of opaque resources. DRA exposes typed claims with vendor-specific selectors. Fractional GPU, MIG slices, multi-vendor accelerators, NUMA-aware placement, and topology hints become scheduler policy instead of webhooks and custom controllers. ```yaml (agent-resource-claim.yaml) # DRA claim for an agent that needs a 20GB H100 MIG slice # with NVLink topology awareness. The driver-specific selector # decouples the workload from the vendor SKU. apiVersion: resource.k8s.io/v1beta1 kind: ResourceClaim metadata: name: research-agent-gpu spec: devices: requests: - name: inference-slice deviceClassName: nvidia.com/h100-mig selectors: - cel: expression: | device.attributes["memory.gb"] >= 20 && device.attributes["topology.nvlink"] == true --- apiVersion: v1 kind: Pod metadata: name: research-agent spec: resourceClaims: - name: gpu resourceClaimName: research-agent-gpu containers: - name: agent image: registry.example.com/agents/research:v1 resources: claims: - name: gpu ``` The bin-packing problem on shared GPUs is brutal. Inference is bursty and short. A naive one-GPU-per-pod policy burns most accelerator hours on idle. DRA does not solve bin-packing. It does give the scheduler enough information to place workloads well and gives downstream tools (kagent, autoscalers, FinOps gates) a structured object to reason about. The previous model exposed none of that. ## Agents become pods kagent was sandboxed in May 2025 by Solo.io. It models agents, tools, and model bindings as CRDs. An `Agent` object specifies a model, a tool list, a system prompt, and resource claims. The controller reconciles each one into a pod or set of pods that calls whichever model server runs in the cluster. Agent frameworks are not scarce. What kagent contributes is alignment with the rest of the cluster. The framework reuses identity (service accounts), secrets (Secrets API), telemetry (OpenTelemetry), scheduling (now DRA-aware), and policy (admission webhooks). The agent is a pod, so those come along for free. ```yaml (agent.yaml) apiVersion: kagent.dev/v1alpha1 kind: Agent metadata: name: incident-summarizer namespace: sre spec: model: provider: anthropic name: claude-sonnet-4-6 routing: agentgateway-prod systemPrompt: | You summarize PagerDuty incidents into 5-bullet briefs. Always cite the source incident ID. tools: - name: pagerduty mcpRef: pagerduty-mcp-server - name: github-search mcpRef: github-mcp-server resources: claims: - name: gpu resourceClaimTemplateName: small-inference-slice rbac: serviceAccountName: incident-agent scaling: minReplicas: 0 maxReplicas: 8 activationTarget: 1 ``` kagent does not call the model provider directly. It hands the call to a gateway, the way a service mesh hands HTTP traffic to a sidecar. The `routing: agentgateway-prod` field is the seam. That decoupling is the second half of the stack. > **EXAMPLE: NemoClaw lands inside kagent** > NVIDIA introduced NemoClaw in March 2026 as a reference stack for safely running OpenClaw agents, with sandbox-first isolation as the core idea. On May 7, 2026, Solo.io added NemoClaw support to kagent. The integration brings that pattern from a single-host reference deployment into a fleet-scale Kubernetes platform, with kagent providing the lifecycle controls, identity, and policy that production teams need. ## A mesh for models Solo.io contributed agentgateway to the Linux Foundation in August 2025 (announced Aug 25, 2025). The data plane is written in Rust. It mediates four protocol families that did not share a sentence three years ago: inference (vLLM, TGI, Triton style), OpenAI-compatible HTTP, MCP (Anthropic's tool-call protocol), and A2A (Google's agent-to-agent handoff). One process, four protocols. This is the cleanest articulation of a service mesh for agents. The agent process never sees which model provider is on the other end, what its rate limit is, what authentication MCP servers expect, or which A2A peer picks up the next handoff. The gateway resolves all of that. Calls go to a local endpoint and return structured responses. The pattern is Envoy plus Istio for HTTP, with stateful, token-billed, silent-failure-prone protocols substituted in. ```yaml (agentgateway-policy.yaml) apiVersion: agentgateway.io/v1 kind: ModelRoute metadata: name: claude-sonnet-route spec: match: model: claude-sonnet-4-6 upstream: provider: anthropic region: us-west failover: - provider: anthropic region: us-east rateLimit: tokensPerMinute: 200000 costGuard: maxUsdPerHour: 50 observability: tracing: otel promptLog: redacted --- apiVersion: agentgateway.io/v1 kind: MCPRoute metadata: name: pagerduty-mcp-route spec: serverRef: pagerduty-mcp-server authPolicy: method: oauth-on-behalf-of scopeAllowlist: - incidents:read - incidents:annotate ``` The cost guard is a first-class field. Token spend has become the new bandwidth bill, and it has to be enforced as policy at the gateway, not flagged on a dashboard a day later. MCP routes carry an `authPolicy` with on-behalf-of semantics. MCP launched without a robust authorization story; the gateway is where the gap closes in production. ## Conformance drew the line CNCF's Kubernetes AI Conformance Program launched at KubeCon NA 2025 (Atlanta, Nov 11 2025; beta at KubeCon Japan in June), co-led by Google (Janet Kuo) with Microsoft, Kubermatic, and Red Hat. It defines what "AI-ready" means for a Kubernetes platform, and at KubeCon EU 2026 it expanded to add the Agentic workload class alongside Training and Inference. A conformance program for a runtime is what a spec is for a protocol. It forces vendors to either pass the tests or explain why not. Platform teams have been answering the same question per-vendor: does this distribution actually run our AI workloads, or are we about to spend a quarter rewriting YAML? Conformance answers it once. The Agentic class draws the line around DRA support, GPU-aware scheduling, model artifact distribution (including OCI-format weights), telemetry integration, and gateway interoperability. The Agentic class is the least-standardized of the three because it is the youngest. Conformance covers the runtime layer: DRA, gateways, OTel hooks. It does not cover agent lifecycle semantics, prompt evaluation, billing attribution, or safety policy. Those are still where vendors compete and where the platform team makes local decisions. The line in 2026 is drawn at the kernel, not the application. > **TIP: Demand conformance** > When evaluating managed Kubernetes for agent workloads, ask whether the vendor targets the Conformance program and which classes they cover. A "proprietary AI mode" outside conformance is the OpenStack-era pattern of incompatible distributions. The DRA, kagent, and agentgateway stack is open and multi-vendor by construction. Pick distributions that honor that. ## Don't build it yet This stack is not the right answer for every agent workload. For some workloads the kagent, agentgateway, and DRA combination is over-engineering. Building the platform before the workloads exist is a common failure mode and worth naming. First, single-agent applications with no scale-out story. One agent, one product, one model, predictable load: run it on Lambda or Cloud Run. The managed-function cost is rounding error against the platform team time a Kubernetes-native runtime requires. kagent earns its weight when multiple agents share tools, identity, or budget. One agent does not. Second, sub-second tool-augmented chat. When the agent sits in the request path of a streaming chat interface, a single inference endpoint with client-side tool dispatch beats MCP-over-mesh. The mesh adds tens of milliseconds per call. That overhead pays off for autonomous, fan-out, asynchronous agents. It does not pay off when a user is watching tokens stream. Third, single-team prototypes. Platform tooling is premature until at least three teams are running agents. Below that, the variation in tools, identities, and budgets is too small to justify the platform. Two teams can share an OpenAI-compatible gateway and a shared GPU pool with no CRDs, and that arrangement will be cheaper for a year than a kagent install. Adopt the platform when the integration tax across teams exceeds the platform tax. > **WARNING: The premature platform trap** > Platform teams fail in a recognizable way: building infrastructure before the workloads exist. The K8s, DRA, kagent, and agentgateway stack is genuinely useful, but adopting it before agents are in production designs for use cases that will not match what arrives. Ship one or two agent workloads on the simplest setup first. The right time to build the platform is when the second team's agents collide with the first team's. ## The platform team inherits The platform team that runs your cluster will run your agents, whether they signed up for it or not. The workloads land on the substrate already underneath them. The pieces are there. They are not yet good enough. Four layers above the runtime are not standardized, and the next eighteen months of platform work lives there. Conformance does not cover prompt evaluation. It does not cover budget attribution at the agent level. It does not cover tool registry curation, MCP authentication at scale, or A2A trust boundaries. Each falls to the platform team. The runtime is becoming a commodity. The governance and economics on top are not. For a platform team picking work for the next quarter: do not fork kagent, do not write a DRA driver, do not implement an agent gateway from scratch. Use the open versions. Build the layers above. Agent identity mapped to existing IAM. Cost gates wired into the FinOps stack. Evaluation harnesses that gate promotion through environments. Observability that makes a misbehaving agent legible to an on-call engineer who has never read its prompt. That work survives the next round of platform consolidation. The forks do not. > **Key Point:** The runtime layer for agents is standardized in the open. The governance, economics, identity, and evaluation layers above it are not. That is the platform team's work for the next eighteen months. ## Resources & Further Reading - CNCF Kubernetes AI Conformance Program: https://www.cncf.io/announcements/2026/03/25/cncf-celebrates-innovators-advancing-cloud-native-at-kubecon-cloudnativecon-europe/ - Three workload classes (Training, Inference, Agentic) and the runtime layer they standardize - Dynamic Resource Allocation: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ - ResourceClaim, ResourceClaimTemplate, and DeviceClass; the GPU/accelerator scheduling primitives that went GA in upstream Kubernetes v1.34 (Sept 2025) - kagent (CNCF Project): https://www.cncf.io/projects/kagent/ - Kubernetes-native agent runtime; agents, tools, and model bindings as CRDs - kagent on GitHub: https://github.com/kagent-dev/kagent - Source, examples, and CRD definitions - agentgateway: https://agentgateway.dev/ - Rust-built unified data plane for inference, LLM, MCP, and A2A traffic; contributed to the Linux Foundation in August 2025 - Solo.io brings NemoClaw to kagent (May 7, 2026): https://www.globenewswire.com/news-release/2026/05/07/3290085/0/en/solo-io-brings-nemoclaw-to-production-agentic-runtime-for-kubernetes.html - Solo.io added NemoClaw support to kagent, integrating NVIDIA's reference stack into a Kubernetes-native runtime - NVIDIA NemoClaw: https://github.com/NVIDIA/NemoClaw - NVIDIA's reference stack for running OpenClaw agents inside a sandbox-first runtime; the upstream that Solo.io integrated into kagent - NVIDIA DRA Driver: https://github.com/NVIDIA/k8s-dra-driver - Vendor implementation of DRA for NVIDIA accelerators, donated to the Kubernetes project for full community ownership alongside DRA GA - AMD GPU DRA Driver: https://github.com/ROCm/k8s-gpu-dra-driver - DRA-compliant driver for ROCm-based AMD accelerators (v1.0.0, released May 20, 2026) - Intel Resource Drivers for Kubernetes: https://github.com/intel/intel-resource-drivers-for-kubernetes - DRA-based resource drivers for Intel accelerators - KubeCon + CloudNativeCon Europe 2026 highlights (Solo.io): https://www.solo.io/blog/highlights-from-kubecon-cloudnativecon-europe-2026 - Recap of the agent-runtime track including agentgateway, kagent, and agentregistry - Self-Hosted AI Agents for Incident Response: https://www.stxkxs.io/blog/openclaw-self-hosted-ai-agents - Background on OpenClaw, the agent that NVIDIA NemoClaw is built to run securely - MCP Is Now a Linux Foundation Standard: https://www.stxkxs.io/blog/mcp-linux-foundation - Protocol context for MCP, the tool-call protocol mediated by agentgateway - Gartner inference economics forecast (March 2026): https://www.gartner.com/en/newsroom/press-releases/2026-03-25-gartner-predicts-that-by-2030-performing-inference-on-an-llm-with-1-trillion-parameters-will-cost-genai-providers-over-90-percent-less-than-in-2025 - Quantitative grounding for the inference cost trajectory --- # Cost Signals at Decision Time - **URL**: https://www.stxkxs.io/blog/finops-first-class-engineering - **Published**: 2026-04-01 - **Author**: Brandon Stokes - **Category**: platform-engineering - **Tags**: finops, cloud-cost, platform-engineering, infracost, opencost, kubernetes, internal-developer-platform, ci-cd, ai-infrastructure, focus-spec, shift-left, gpu-cost - **Reading time**: 14 min The FinOps Foundation has 95,000 members. 93 of the Fortune 100 adopted FinOps. Cloud waste sits at 32% — unchanged in three years. The practice was adopted as a procurement initiative when it should have been an engineering concern. Cost needs to be a first-class signal in CI pipelines, IDPs, and infrastructure-as-code — not a monthly report that engineers never see. ## The adoption paradox The FinOps Foundation has 96,000+ members. 93 of the Fortune 100 have adopted FinOps practices. The cloud FinOps market hit $13.5 billion in 2024. And cloud waste, which Flexera defines as the percentage of cloud spend that produces no value, rose to 29% in 2026, the first increase in five years, with AI workloads cited as the driver (Flexera 2026 State of the Cloud). The cloud infrastructure market grew roughly 45% over the same period, from $228 billion in 2022 to $330 billion in 2024 (Synergy Research). FinOps got adopted at scale. Waste did not fall to match. The explanation is structural. FinOps was adopted as a procurement and finance initiative: dashboards for leadership, monthly cost review meetings, chargeback reports that nobody outside finance reads. The people making provisioning decisions are engineers, and most of them never see the dashboard. An engineer choosing an instance type in a Terraform module, setting resource limits in a Kubernetes manifest, or allocating GPUs for an inference endpoint is making a cost decision. They make it in code, in a PR, in a CI pipeline. A monthly report cannot influence that decision because it arrives weeks after the decision was made. Cost needs to be a first-class engineering signal: in CI, in IDPs, in IaC, in autoscaling. Not a monthly report. The historical precedent is reliability. The transformation that SRE drove for uptime needs to happen for cost. AI infrastructure costs are growing faster than any other line item, so the window to get this right is closing. - **FinOps Adoption**: 96K+ members — FinOps Foundation membership; 93 of Fortune 100 adopted (FinOps Foundation 2026) - **Cloud Waste**: 29% — Percentage of cloud spend that is wasted; rose in 2026 for the first time in five years, driven by AI workloads (Flexera 2026) - **FinOps Market**: $13.5B — Cloud FinOps market in 2024, projected to $26.91B by 2030 at 12.6% CAGR (MarketsandMarkets) - **Run Maturity**: 14.2% — Organizations at FinOps "Run" maturity; 51.4% still at "Walk" (State of FinOps 2026) ## Reliability had this problem first Before Google published the SRE book in 2016, reliability was a separate team's problem. Developers wrote code. Operations kept it running. Monitoring was a dashboard that ops checked. Uptime was someone else's KPI. The organizational structure existed: NOCs, operations teams, incident managers, change advisory boards. The processes existed: runbooks, escalation procedures, post-mortems. Systems were still unreliable, because the people writing the code that caused outages were structurally disconnected from the signals about reliability. SRE fixed this by making reliability an engineering concern with engineering primitives. SLOs gave developers a budget they could spend. Error budgets created a feedback loop between deployment velocity and production stability. On-call rotations put developers in the blast radius of their own decisions. Everyone already agreed that reliability mattered. The insight was that reliability improves when the signal reaches the decision point: the engineer writing the code, not the ops team cleaning up afterward. FinOps is stuck in the pre-SRE phase. The organizational adoption happened. The dashboards exist. The monthly cost review meetings are on the calendar. But cost signals do not reach engineers at decision time. A developer provisioning an oversized instance does not see the cost delta in their PR. A team spinning up a GPU cluster does not get a cost estimate before deployment. The FinOps team publishes a report; engineers do not read it. The bottleneck is signal delivery. > **EXAMPLE: The early movers** > Spotify's Backstage plugins warn developers when resource requests exceed budget thresholds. Cost visibility lives in the IDP itself. Lyft reported a 35% reduction in infrastructure sprawl after integrating cost data into Cortex. Shopify's internal developer platform auto-terminates unused instances post-deployment. The pattern works. It has not been adopted at scale. ## The dashboard isn't reaching engineers The State of FinOps 2026 report surveyed 1,192 respondents managing over $83 billion in annual cloud spend. 78% of FinOps practices now report into the CTO or CIO, up 18 percentage points versus 2023. That structural shift toward engineering leadership has happened. At the same time, only 14.2% of organizations are at "Run" maturity, while 51.4% are still at "Walk." The practices are moving into engineering orgs and are not yet operating at engineering speed. **FinOps maturity distribution (2026)** - Crawl (dashboards, no action): 34.4% - Walk (reviews, some optimization): 51.4% - Run (cost in engineering workflows): 14.2% The gap between organizational adoption and operational maturity reveals the core problem. FinOps was designed as a practice: cross-functional teams, chargeback models, optimization recommendations reviewed in meetings. That works for procurement-scale decisions like reserved instance purchases, enterprise discount programs, and right-sizing recommendations reviewed quarterly. It does not work for the hundreds of daily provisioning decisions engineers make: instance types in Terraform modules, container resource limits in Kubernetes manifests, GPU allocations for inference workloads. Those decisions happen in code, in PRs, in CI pipelines. A monthly report cannot influence them. The multi-cloud dimension makes it worse. 87% of organizations have a multi-cloud strategy, but only 22% have effective cost governance across clouds. Each cloud has a different billing model, different discount structures, different cost allocation taxonomy. The FOCUS spec (the FinOps Open Cost and Usage Specification, v1.3 ratified December 2025) is trying to normalize this. 68% of large spenders ($100M+ annually) are using or experimenting with FOCUS-formatted data. Normalization solves the data problem. Delivery is a separate problem. Unified billing data in a dashboard is still a dashboard. In a 2023 Spot by NetApp survey of 310 US IT decision-makers, 96% of tech executives agreed FinOps is important to their cloud strategy, yet only 9% had a mature practice. The bottleneck is execution architecture. The data and the tools are both in place. What is missing is cost feedback that reaches engineers at the moment they make provisioning decisions. > **Key Point:** The FinOps maturity gap is a signal-delivery problem. The data and dashboards already exist. What is missing is cost feedback reaching engineers when they make provisioning decisions: the PR comment, the CI gate, the IDP budget widget. ## AI makes this urgent Everything above applies to traditional cloud spend. AI infrastructure makes it existential. GPU utilization sits at 15-30% of capacity, which means 70-85% of GPU-hours are idle. Idle GPU-hours bill at the same rate as busy ones, so most of an AI infrastructure budget pays for capacity that produces nothing. Unit costs compound the problem: Google reportedly spends 10 to 20 times more on inference than on training, and waste on AI workloads tracks higher than traditional cloud. Token-based and GPU-based pricing does not map to traditional billing frameworks. A single LLM inference request can cost anywhere from $0.001 to $5.00 depending on the model, context window size, and whether cached tokens were used. Traditional cost allocation assumes relatively stable per-unit pricing. AI workloads break that assumption. The top tooling request in the State of FinOps 2026 report is granular monitoring of AI spend (tokens, LLM requests, GPU utilization), and that capability does not yet exist at scale from any vendor. Two years ago, 31% of FinOps practices managed AI spend. Today it is 98%. FinOps for AI is now the number one forward-looking priority, and AI value management is the top skillset teams are seeking to add. The FinOps Foundation updated its mission from "advancing the people who manage the value of Cloud" to "advancing the people who manage the Value of Technology." The ambition grew. The execution model (dashboards and meetings) did not. **FinOps scope expansion (2026)** - AI Spend: 98% - SaaS: 90% - Licensing: 64% - Private Cloud: 57% - Data Center: 48% > **WARNING: GPU waste is different** > GPU instances cost 10-50x more per hour than equivalent CPU instances. A p5.48xlarge on AWS costs $55.04/hour on demand in us-east-1. At 25% utilization, that is roughly $41/hour in waste on a single instance. The cloud-wide waste rate applied to GPU workloads translates to dollar amounts that make traditional cloud waste look trivial. GPU optimization requires understanding model serving patterns, batching strategies, and inference scheduling. Those are engineering decisions. ## Shifting cost into engineering The fix is architectural. Cost needs to be a signal in the systems engineers already use: CI pipelines, infrastructure-as-code, internal developer platforms, and autoscaling policies. Three tools cover the leading edge: Infracost for cost estimates in pull requests, OpenCost for real-time Kubernetes cost allocation, and CDK Budgets for deploying cost guardrails as infrastructure. Each addresses a different point in the engineering decision chain. ### Infracost: cost in the PR Infracost generates cloud cost estimates for Terraform and OpenTofu changes and posts them as PR comments. The engineer sees the cost impact of their infrastructure change before it merges, inside the same code review workflow they already use. The integration is a CI step, not a separate tool to learn. ```yaml (.github/workflows/infracost.yml) name: Infracost on: [pull_request] jobs: infracost: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Infracost uses: infracost/actions/setup@v3 with: api-key: ${{ secrets.INFRACOST_API_KEY }} - name: Generate cost estimate run: | infracost breakdown --path=. \ --format=json \ --out-file=/tmp/infracost.json - name: Post PR comment run: | infracost comment github \ --path=/tmp/infracost.json \ --repo=${{ github.repository }} \ --pull-request=${{ github.event.pull_request.number }} \ --github-token=${{ secrets.GITHUB_TOKEN }} \ --behavior=update ``` That is the entire integration. A CI step that runs on every PR touching infrastructure files. The engineer sees "$427/month -> $1,240/month" in their PR comment. The cost delta is visible at decision time. No dashboard required. This is the SRE equivalent of a latency regression test. If you are making things more expensive, you see it before merge. ### OpenCost: runtime cost allocation OpenCost is a CNCF Incubating project (promoted from Sandbox on Oct 25, 2024) that provides a cloud-agnostic API for real-time Kubernetes cost metrics. It allocates costs to namespaces, deployments, pods, and labels using actual cloud billing rates. Where Infracost catches cost problems before deployment, OpenCost catches them in production. It identifies overprovisioned workloads, idle resources, and cost anomalies at the namespace level. ```yaml (opencost-helm-values.yaml) # OpenCost deployment with Prometheus integration opencost: exporter: defaultClusterId: "production" extraEnv: CLOUD_PROVIDER_API_KEY: "" EMIT_KSM_V1_METRICS: "false" prometheus: internal: enabled: true external: enabled: false ui: enabled: true ingress: enabled: true hosts: - host: costs.internal.company.com ``` OpenCost exposes a /allocation API that returns cost-per-namespace, cost-per-deployment, and cost-per-label in real time. Platform teams wire this into Backstage or Grafana dashboards. The more important wiring is into alerting. A namespace exceeding its cost budget triggers a Slack notification to the owning team. The signal reaches the team that can act on it, at the time they can act on it. ### CDK: budgets as infrastructure For AWS-native teams, cost guardrails belong in the infrastructure stack itself. AWS Budgets and Cost Anomaly Detection can be provisioned as CDK constructs alongside the resources they monitor. The budget becomes an infrastructure resource deployed with the stack it governs. ```typescript (lib/constructs/cost-guardrails.ts) import * as cdk from 'aws-cdk-lib' import * as budgets from 'aws-cdk-lib/aws-budgets' import * as sns from 'aws-cdk-lib/aws-sns' import * as subscriptions from 'aws-cdk-lib/aws-sns-subscriptions' import { Construct } from 'constructs' export class CostGuardrails extends Construct { constructor(scope: Construct, id: string, props: { monthlyBudget: number teamEmail: string environment: string }) { super(scope, id) const topic = new sns.Topic(this, 'CostAlertTopic') topic.addSubscription( new subscriptions.EmailSubscription(props.teamEmail) ) new budgets.CfnBudget(this, 'MonthlyBudget', { budget: { budgetName: `${props.environment}-monthly`, budgetType: 'COST', timeUnit: 'MONTHLY', budgetLimit: { amount: props.monthlyBudget, unit: 'USD' }, }, notificationsWithSubscribers: [{ notification: { notificationType: 'ACTUAL', comparisonOperator: 'GREATER_THAN', threshold: 80, }, subscribers: [{ subscriptionType: 'SNS', address: topic.topicArn, }], }], }) } } ``` The budget is co-located with the infrastructure, owned by the same team, deployed by the same pipeline, and version-controlled in the same repo. When the team changes the infrastructure, the budget travels with it. This is the CDK equivalent of deploying an SLO alarm alongside the service it monitors. The cost constraint becomes part of the infrastructure definition, not an afterthought managed in a different console. > **TIP: Start with Infracost** > If you implement one thing from this post, add Infracost to your CI pipeline. It takes 15 minutes, requires no infrastructure changes, and gives every engineer cost visibility at PR time. OpenCost and CDK budgets require more platform investment but close the loop in production. Layer them in that order: PR visibility first, runtime allocation second, infrastructure guardrails third. ## FOCUS fixes the data layer The FinOps Open Cost and Usage Specification (FOCUS) v1.3 was ratified in December 2025. AWS, Azure, GCP, Oracle Cloud, and Tencent Cloud have adopted it. For the first time, there is a common schema for billing data across major cloud providers: normalized resource identifiers, consistent pricing units, standardized commitment discount representation. This solves the data normalization problem that has plagued multi-cloud cost governance since organizations started splitting workloads across providers. Vantage is the most interesting commercial tool in this space. It has 20+ native integrations spanning AWS, Azure, GCP, Kubernetes, Snowflake, Datadog, OpenAI, Anthropic, MongoDB Atlas, and Databricks. It aggregates cost data from infrastructure and SaaS providers into a single view. More importantly, it ships an MCP server and an automated FinOps agent that takes action on cost anomalies (removing unattached EBS volumes and obsolete snapshots based on configurable policies). Cost management starts looking less like reporting and more like autonomous remediation. FOCUS and aggregation tools solve the data normalization problem. They do not change engineer behavior. The value of FOCUS is as a data layer underneath engineering tools, feeding normalized cost data into Infracost estimates, OpenCost allocations, IDP budget widgets, and CI cost gates. The standard matters because it makes the plumbing reliable. The plumbing matters because it delivers cost signals to engineers. If the data stays in a dashboard that only the FinOps team checks, the standard changes nothing. > **INFO: FOCUS adoption status** > Among organizations spending $100M+ annually on cloud, 68% are using or experimenting with FOCUS-formatted billing data. Another 18% plan to adopt it. The remaining 14% are not planning to use it. AWS, Azure, GCP, Oracle, and Tencent already export FOCUS-formatted data natively. The bottleneck is practitioner tooling that consumes FOCUS data and routes it to engineering workflows. Provider support is no longer the gating problem. ## When shifting left fails Shifting cost left is the right move for organizations making hundreds of provisioning decisions in code every week. It is the wrong move for several cases, and the tools carry real limits that a PR comment hides. Infracost estimates are list-price approximations. They price resources off public on-demand rates and do not see committed-use discounts, reserved-instance coverage, savings plans, or an enterprise discount program. A team on a 40% EDP commitment reads a PR comment that overstates real cost by a wide margin, and a "$1,240/month" delta that lands inside negotiated pricing reads as alarming when it is not. The estimate is useful for relative comparison between two changes. Treated as an absolute bill, it misleads. OpenCost is not free to run. It adds a deployment to every cluster, depends on Prometheus for metrics storage, and the cost data is only as accurate as the cloud billing rates fed into it. A platform team without an existing Prometheus stack inherits the operational burden of running and scaling one, plus the long-term storage cost of the cost metrics themselves. The allocation API answers cost-per-namespace questions, but someone has to own the deployment, the upgrades, and the alerting wiring that makes the data actionable. Cost gates that block merges create developer friction. A gate that fails a PR above a cost threshold will fire on legitimate changes: a new environment, a deliberate capacity increase, a migration that is expensive on paper and cheaper in practice. Every false positive trains engineers to treat the gate as noise and reach for the override. A cost gate works as a non-blocking comment that informs the reviewer. As a hard merge block, it competes with shipping, and shipping wins. Below a certain org size, none of this is worth maintaining. A team running a handful of services on a single account, where one person can read the monthly bill and recognize every line, does not need a cost gate, a Kubernetes allocation API, or a FOCUS pipeline. The maintenance cost of the tooling exceeds the waste it would catch. The argument for shifting cost left scales with the number of independent provisioning decisions, not with the existence of a cloud bill. Adopt it when no single person can hold the spend in their head. > **Key Point:** Infracost prices off list rates and ignores committed-use and EDP discounts, so its numbers are comparative, not absolute. OpenCost carries Prometheus and platform operating cost. Hard cost gates generate false positives that erode trust. Below the org size where provisioning decisions outrun any one person's attention, the tooling costs more than the waste it prevents. ## Where this goes FinOps reporting is moving from finance to engineering leadership: 78% under CTO/CIO, up 18 points in three years. The FOCUS spec is normalizing the data layer across providers. Infracost, OpenCost, and Vantage are building the tooling that routes cost signals into engineering workflows. AI infrastructure costs are creating urgency that traditional cloud spend never did. The missing piece is organizational adoption of cost-as-engineering-signal at the same scale that SRE drove reliability-as-engineering-signal. Three things will happen in the next 18 months. First, cost gates in CI pipelines will become as common as security scans. Infracost or equivalent tools will integrate into standard PR workflows with configurable thresholds that block merges above a cost delta. Second, internal developer platforms will surface cost per service as a first-class metric alongside latency and error rate. Backstage, Cortex, and Port already have the plugin architecture for it. Third, AI cost attribution will force the issue. When a single inference endpoint can cost $50,000 per month and GPU utilization sits at 25%, the team running it will demand real-time cost visibility regardless of whether the organization has a FinOps practice. > **The real question is whether the engineer provisioning the next GPU cluster sees the cost before or after the invoice arrives.** > **Key Point:** FinOps adopted as a procurement practice produced dashboards. FinOps adopted as an engineering practice produces CI cost gates, IDP budget widgets, IaC cost constructs, and autoscaling policies with cost constraints. The data says adoption is massive and waste is unchanged. Deliver cost feedback where engineers already work: the pull request, the CI pipeline, and the deployment. ## Resources & Further Reading - State of FinOps 2026 Report: https://data.finops.org/ - Sixth annual survey covering 1,192 respondents representing $83B+ in managed cloud spend - FOCUS Specification (v1.3): https://focus.finops.org/ - FinOps Open Cost and Usage Specification for normalized billing data across providers - Infracost: https://www.infracost.io/ - Cloud cost estimates for Terraform in pull requests - Infracost GitHub: https://github.com/infracost/infracost - Open-source CLI and CI integrations for infrastructure cost estimation - OpenCost: https://www.opencost.io/ - CNCF Sandbox project for real-time Kubernetes cost monitoring and allocation - OpenCost GitHub: https://github.com/opencost/opencost - Cloud-agnostic cost allocation API for Kubernetes workloads - Vantage: https://www.vantage.sh/ - Multi-cloud cost platform with 20+ native integrations including AI providers - AWS Budgets CDK Documentation: https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_budgets-readme.html - CDK constructs for AWS Budgets and cost management - Flexera State of the Cloud 2026: https://www.flexera.com/blog/cloud/cloud-computing-trends/ - Annual cloud report covering spend, waste, and FinOps maturity - FinOps Foundation FOCUS Adoption Guide: https://www.finops.org/wg/adopting-focus-the-finops-open-cost-and-usage-specification/ - Working group guide for adopting the FOCUS billing data standard - Google SRE Book: https://sre.google/sre-book/table-of-contents/ - The foundational text on making reliability an engineering concern --- # The Agentic DevOps Loop - **URL**: https://www.stxkxs.io/blog/agentic-devops-loop - **Published**: 2026-03-19 - **Author**: Brandon Stokes - **Category**: platform-engineering - **Tags**: agentic-devops, ai-agents, devops, ci-cd, platform-engineering, mcp, sre, autonomous-remediation, closed-loop-automation, claude-code, copilot-coding-agent, gitops - **Reading time**: 14 min AI agents already write code, deploy it, observe production, diagnose incidents, and remediate failures — but each capability lives in a separate tool with a human bridging the gaps. The closed loop — where agents autonomously execute the full code-deploy-observe-diagnose-fix cycle — is forming now, and it changes what platform engineers need to build. ## The loop is already open The five stages of the DevOps loop (Code, Deploy, Observe, Diagnose, Remediate) have each been independently augmented by AI for years. The connection between stages has not. An AI agent that writes a fix does not trigger the deployment. The deployment system does not feed outcomes back to the observability layer in a format agents can act on. The incident diagnosis does not automatically generate a pull request. Humans copy-paste context between tools, translate formats, make judgment calls about when to proceed, and provide the connective tissue that the toolchain lacks. Every stage of the software delivery lifecycle now has an AI agent. Claude Code and GitHub Copilot coding agent write code. Harness AIDA and Argo CD deploy it. Dynatrace Davis and Datadog Watchdog observe production. Rootly and PagerDuty AIOps diagnose incidents. Resolve AI and Rundeck automate remediation. Each capability is real and shipping. Each lives in a separate tool, with a human bridging every gap between them. An earlier post on this blog, "AI Creates Software Faster Than Ops Can Handle," asked what breaks when AI writes more code. This one asks what happens when AI also deploys, monitors, and fixes it. The closed loop, where agents autonomously execute the full code-deploy-observe-diagnose-fix cycle, is not hypothetical. Companies are building it now. The interesting question is what guardrails must exist when it finishes closing. > **Key Point:** The DevOps loop is not missing AI. It is missing connections between AI agents. Each stage has capable automation; what does not exist is the handoff protocol between them. - **AI in DevOps Market**: $28.5B by 2030 — Projected market size for AI-augmented DevOps tooling (MarketsandMarkets) ## Anatomy of the closed loop Each stage of the loop can be described independently: what AI can do today, who is building it, and where the connection to the next stage breaks down. The gap between stages is where the human currently lives. It is also where the next wave of platform infrastructure has to be built. ### Stage 1: Code AI code generation is the most mature stage. Claude Code, GitHub Copilot coding agent, and Cursor can implement features, write tests, and refactor codebases with minimal human guidance. GitHub reports that Copilot now writes an average of 46% of code in files where it is enabled. The bottleneck is no longer generation. It is deciding what to build and whether the output is correct. Code review remains a human gate, and the connection to deployment is manual: a human merges the PR, which triggers CI/CD. The connection gap to Stage 2 is PR approval and merge. AI can generate the code, open the PR, and run the tests. A human still decides whether to merge. This gate is the most contentious point in the loop. Removing it enables speed and also allows AI-generated bugs to reach production unchecked. ### Stage 2: Deploy AI-augmented deployment is advancing rapidly. Harness AIDA analyzes deployment risk and recommends rollout strategies. Argo CD and Flux provide GitOps automation that deploys whatever reaches the main branch. Progressive delivery tools like Flagger automate canary analysis. The deployment stage is the closest to full automation: GitOps already removes human intervention from the deploy step itself. The gap is in decision quality. Should this particular change be deployed now, given what production telemetry says? The connection gap to Stage 3 is deployment-aware observability. Deployments trigger monitoring, but the observability layer rarely knows the semantic content of what changed. It sees a new container version, not "we changed the retry logic in the payment service." Without that context, correlating production anomalies with specific code changes requires human investigation. ### Stage 3: Observe AI observability has exploded. As covered in "200 OK, Wrong Answer" on this blog, the market hit $1.1B with six acquisitions in twelve months. Dynatrace Davis AI, Datadog Watchdog, and New Relic AI detect anomalies, correlate signals across services, and surface likely root causes. For deterministic systems, AI observability works well. For AI-generated code running AI-augmented deployments, the observability challenge compounds. The system being observed is itself non-deterministic. The connection gap to Stage 4 is structured incident context. Observability tools detect that something is wrong and surface correlated signals. Translating "elevated error rate on endpoint X correlated with deployment Y and downstream latency on service Z" into an actionable diagnosis requires context that the observability layer does not have: the intent of the change, the architecture of the system, the history of similar incidents. ### Stage 4: Diagnose AI-assisted diagnosis is the newest and least mature stage. Rootly AI summarizes incident timelines and suggests likely causes. PagerDuty AIOps correlates alerts and reduces noise. OpenClaw, covered on this blog, provides a self-hosted AI agent that can query infrastructure, read logs, and investigate issues through natural language. These tools accelerate diagnosis. They still rely on human judgment to confirm the root cause and decide on a fix. The connection gap to Stage 5 is fix generation. Diagnosis produces a hypothesis: "the payment service timeout was reduced from 30s to 5s in the last deployment, causing cascading failures downstream." Turning that hypothesis into a code fix, a configuration change, or a rollback decision is currently a manual step that requires engineering judgment. ### Stage 5: Remediate Automated remediation has the longest history in DevOps. Auto-scaling, auto-restart, and circuit breakers have been production staples for a decade. Resolve AI and Rundeck extend this with runbook automation that can execute complex remediation sequences. The new frontier is AI-generated remediation: given a diagnosis, generate and execute the fix. This closes the loop back to Stage 1 (Code), with a critical difference. The code is being written in response to a production incident, under time pressure, with the agent acting on its own diagnosis. The connection gap back to Stage 1 is trust. An auto-scaler adding replicas is a bounded, well-understood action. An AI agent writing a code fix based on its own diagnosis of a production incident and deploying it through the same pipeline is an unbounded action with compounding risk at every stage. Diagnose is the most dangerous stage to fully automate. AI can diagnose, but a wrong diagnosis has uniquely compounding consequences. Code generation can be validated by tests. Deployments can be rolled back. Observability is fundamentally read-only. Autonomous diagnosis lets an agent decide what the problem is, and every downstream action flows from that decision. A wrong diagnosis does not simply fail to fix the issue. It generates a confidently wrong fix that makes the system harder to reason about. When the remediation agent acts on a bad diagnosis, the production environment ends up "fixed" in a direction nobody intended. > **INFO: The most contentious gate** > PR review and merge approval sits between Stage 1 (Code) and Stage 2 (Deploy). It is the single most debated gate in the loop. Remove it, and AI-generated code flows to production at machine speed. Keep it, and the human reviewer becomes the bottleneck that limits the entire cycle. Every organization closing the loop must decide where this gate sits and under what conditions it opens automatically. ## Three companies closing the gap No single vendor has closed the full loop. Three companies are actively connecting adjacent stages in ways that reveal how the complete cycle will form. Each represents a different entry point into the loop and a different strategy for expanding across stages. ### Harness expands outward from delivery Harness started in continuous delivery and expanded with AIDA, an AI assistant that analyzes deployment pipelines, predicts failure risk, and recommends rollout strategies. AIDA connects Stage 2 (Deploy) to Stage 3 (Observe) by ingesting deployment outcomes and correlating them with production health metrics. When a canary deployment shows degraded performance, AIDA can recommend rollback before the change reaches full production. Harness is now extending into Stage 4 (Diagnose) with root cause analysis that traces production issues back to specific deployment changes. The connection from deploy to observe to diagnose covers three of the five stages on a single platform. The remaining gaps (code generation on the front end and automated remediation on the back end) are where Harness relies on integrations rather than native capability. ### GitHub leads with the repository GitHub's strategy is the most visible. Copilot generates code (Stage 1), the Copilot coding agent autonomously implements changes from GitHub Issues, and GitHub Actions handles CI/CD (Stage 2). The acquisition pattern is clear: connect code generation directly to deployment through a single platform. GitHub's advantage is owning the repository layer where PR review happens, which means they control the most contentious gate in the loop. The gap in GitHub's approach is on the right side of the loop: observe, diagnose, and remediate. GitHub does not have a native observability product, incident management system, or remediation platform. They rely on marketplace integrations with Datadog, PagerDuty, and others, which means the handoff from deploy to observe still requires human-configured glue between separate tools. The Copilot ecosystem may eventually extend into these stages through MCP integrations. Today the connection is manual. ### Dynatrace owns the right side Dynatrace approaches the loop from Stage 3 (Observe) and extends in both directions. Davis AI provides causal analysis that goes beyond correlation: it models the dependency graph of the system and traces anomalies to root causes (Stage 4: Diagnose). Davis CoPilot enables natural language queries against the full telemetry stack, and Dynatrace Workflows can trigger automated remediation sequences (Stage 5: Remediate) when specific conditions are met. Dynatrace's observe-diagnose-remediate chain is the most complete right-side loop in the market. The connection gap is on the left: code generation and deployment. Dynatrace can identify what broke and potentially fix it through runbook automation, but generating a code fix and deploying it through a CI/CD pipeline requires integration with external tools. Recent investment in OpenTelemetry compatibility suggests Dynatrace is building toward an ecosystem play rather than trying to own the full loop natively. > **Key Point:** Each company is closing the gap between adjacent stages, not building the full loop. The complete cycle will likely emerge through protocol-level integration (MCP and A2A) rather than any single vendor owning all five stages. ## What breaks when the loop closes Connecting the five stages enables machine-speed iteration across the full delivery lifecycle. It also creates failure modes that do not exist when humans bridge the gaps. Removing the human from the connections between stages has consequences that compound across the cycle. ### Cascading autonomy failures The most dangerous failure mode is a cascade where each stage's AI makes a reasonable but slightly wrong decision and the errors compound. An AI agent writes a fix that addresses the symptom but not the root cause (Stage 1). The deployment system scores it as low-risk and sends it to production (Stage 2). The observability layer sees a brief improvement in the metric that triggered the incident, because the symptom was addressed, and marks the issue resolved (Stage 3). The underlying root cause continues degrading until it manifests as a different, worse symptom. The diagnosis agent, lacking context from the earlier failed fix, treats it as a new incident (Stage 4). A new fix is generated, deployed, and observed, completing another loop that again misses the root cause. Consider a concrete scenario. A slow database query triggers an alert. The diagnosis agent misclassifies it as a network partition. The remediation agent restarts the affected services. The restarts cause new failures (the slow query was throttling load that the restart undoes), which trigger fresh alerts, which trigger more diagnoses and more remediations. The original query would have resolved in two minutes. The cascade takes four hours to unwind. Each agent performed its function correctly according to its local context. The failure is systemic: no single agent has visibility into the full loop, and the connections between stages do not carry enough context to prevent the cascade. > **WARNING: The cascading loop failure scenario** > Agent writes fix, deploys automatically, metrics briefly improve, underlying issue worsens, new incident triggered, agent diagnoses as new issue, writes another fix, cycle repeats. Each iteration makes the system state harder to understand and the actual root cause harder to find. Without a loop-level circuit breaker, this pattern can run multiple iterations before a human notices. ### The approval bottleneck The obvious defense against cascading failures is human approval gates between stages. Gates reintroduce the bottleneck that closing the loop was meant to eliminate. If every AI-generated fix requires human review before deployment, the loop runs at human speed. If every deployment requires human sign-off on the observability outcome, the feedback cycle is constrained by on-call engineer availability. The approval bottleneck is the fundamental tension at the heart of closed-loop automation. The resolution lives between full autonomy and full human control. It is a spectrum of approval tiers that vary by blast radius, confidence level, and reversibility. A configuration change to a non-critical internal tool can flow through the loop automatically. A schema migration on a production database requires human approval at every stage. The platform engineering challenge is building the infrastructure that enforces the right tier for the right change. ### Blast radius amplifies When humans bridge the loop stages, they naturally limit blast radius through judgment and speed. A human reviewing a PR considers "what if this is wrong?" before merging. A human watching a canary deployment notices qualitative signals that automated metrics miss. A human triaging an incident considers organizational context: "we have a board presentation tomorrow, so let's roll back rather than push a fix." These judgment calls happen unconsciously and act as blast radius limiters. Agents operating at machine speed do not make them unless explicitly programmed to. Machine speed means faster damage. A human-bridged loop might complete one cycle per hour during an incident. An autonomous loop could complete one cycle per minute. If each cycle makes things slightly worse (the cascading failure scenario), the damage rate increases proportionally. Blast radius amplification is about the rate at which potentially wrong actions are taken, not about any single action being more dangerous. ### Observability for the loop Current observability watches the application. When the loop closes, you also need observability that watches the loop. How many cycles has the loop completed in the last hour? Are cycles converging (each one improving the system) or diverging (each one making things worse)? What is the loop's false positive rate? What is the human intervention rate? These are meta-observability metrics that do not exist in any current toolchain. The "200 OK, Wrong Answer" post on this blog covered the challenge of observing AI systems where a successful response can contain wrong content. The closed loop compounds the problem: you need to observe AI agents that are observing AI-deployed code that was written by AI. Each layer of AI introduces a layer of non-determinism. Meta-observability becomes a first-class infrastructure requirement. > **The human in the loop is more than a bottleneck. They are a circuit breaker, a context carrier, and a judgment layer. Removing them from the connections between stages requires replacing those functions with infrastructure.** ## MCP as the connective tissue The connection gaps between stages are not just product gaps. They are protocol gaps. Each tool speaks its own language, exposes its own API, and structures its data differently. An observability platform emits alerts in one format. An incident management tool expects context in another. A code generation agent needs the problem described in yet another way. Bridging these gaps today requires custom integrations for every pair of tools. Model Context Protocol (MCP), now a Linux Foundation standard, provides the connectivity layer. As covered on this blog when Anthropic donated MCP to the Linux Foundation, the protocol enables AI agents to interact with external tools through a standardized interface. An MCP server wraps a tool (a deployment platform, an observability backend, a CI/CD pipeline) and exposes it to any MCP-compatible agent. The agent does not need tool-specific integration code; it needs MCP. Google's Agent-to-Agent (A2A) protocol complements MCP by enabling agent-to-agent handoffs. Where MCP connects an agent to a tool, A2A connects an agent to another agent. In the context of the loop, MCP enables each stage's agent to interact with the tools at that stage, and A2A enables the handoff between stages. The diagnosis agent hands off to the remediation agent, passing structured context about the root cause, the affected services, and the proposed fix. The remediation agent hands off to the code generation agent with the fix specification. Each handoff carries structured context rather than requiring the receiving agent to reconstruct it from scratch. MCP and A2A form the integration layer between stages of the closed loop. They do not solve the trust, judgment, or blast radius problems. Those require platform infrastructure. They solve the connectivity problem: the mechanical challenge of getting structured context from one stage to the next without lossy human translation in between. ```json (loop-agent-mcp-config.json) { "mcpServers": { "github": { "command": "mcp-server-github", "env": { "GITHUB_TOKEN": "${GITHUB_TOKEN}" }, "description": "Stage 1: Code (PR creation, review, merge)" }, "argocd": { "command": "mcp-server-argocd", "env": { "ARGOCD_SERVER": "${ARGOCD_SERVER}" }, "description": "Stage 2: Deploy (GitOps sync, rollback)" }, "datadog": { "command": "mcp-server-datadog", "env": { "DD_API_KEY": "${DD_API_KEY}" }, "description": "Stage 3: Observe (metrics, traces, anomalies)" }, "pagerduty": { "command": "mcp-server-pagerduty", "env": { "PD_API_KEY": "${PD_API_KEY}" }, "description": "Stage 4: Diagnose (incidents, alerts, correlation)" }, "resolve": { "command": "mcp-server-resolve", "env": { "RESOLVE_TOKEN": "${RESOLVE_TOKEN}" }, "description": "Stage 5: Remediate (runbooks, auto-remediation)" } } } ``` > **TIP: Invest in MCP servers now** > Even without closing the loop today, building MCP servers for your internal tools creates the integration surface that loop automation will use. Every deployment tool, observability backend, and incident management system that gets an MCP server becomes a stage that agents can interact with. The investment compounds: each new MCP server multiplies the possible agent-to-tool connections in your ecosystem. The hard part is authorization. Each MCP server connection needs its own RBAC policy defining what the agent can do. Build that governance layer alongside the servers, not after. ## What platform teams must build The closed loop will not arrive as a product you purchase. It will emerge from the connections between existing tools, mediated by protocols like MCP and A2A, and governed by platform infrastructure that your team builds. The following four systems form the minimum viable governance layer for organizations moving toward autonomous DevOps operations. ### Tiered approval gates Not every change needs the same level of human oversight. A tiered approval system classifies changes by blast radius, reversibility, and confidence level, then applies the appropriate gate. The tiers should be enforced by the platform, not by convention. Agents operating at machine speed will not respect a process that is not technically enforced. - **Tier 1: Full Auto**: No human — Low blast radius, fully reversible, high confidence. Config changes to non-critical services, scaling adjustments, feature flag toggles. - **Tier 2: Notify**: Inform human — Moderate blast radius, reversible, high confidence. Code fixes with passing tests, canary deployments, standard remediation runbooks. - **Tier 3: Approve**: Human approves — High blast radius or low reversibility. Database migrations, API contract changes, cross-service modifications. - **Tier 4: Human Execute**: Human does it — Critical blast radius, irreversible, novel situation. Production data modifications, security-sensitive changes, first-time remediations. ### Agent audit trails Every decision an agent makes in the loop must be logged with full context: what data the agent observed, what options it considered, what it chose, and why. This is a compliance requirement for regulated industries and an incident investigation requirement for everyone else. When an autonomous loop makes things worse, the post-mortem needs to trace the exact sequence of agent decisions that led to the outcome. Agent audit trails differ from application logs in structure. They need to capture the decision graph, not just the action sequence. "The diagnosis agent identified three possible root causes, ranked them by probability, selected option A, and generated a fix specification" is an audit trail. "Diagnosis agent called API X and returned result Y" is a log. Both are necessary; the audit trail is what makes the loop governable. ### Blast radius controls Blast radius controls limit the damage an autonomous loop iteration can cause. Progressive rollout ensures that AI-generated fixes deploy to a small percentage of traffic first, with automated rollback if metrics degrade. Scope limits restrict which services an agent can modify (a remediation agent for the payment service cannot touch the authentication service). Rate limits prevent the loop from executing more than N cycles per hour, so cascading failures are bounded by time even if monitoring misses them. The most critical blast radius control is the automatic rollback trigger. If the loop is converging (each cycle improving the target metric), it should continue. If it is diverging (metrics degrading despite remediation attempts), the loop must halt and escalate to a human. This convergence check is the loop's circuit breaker, and it must be a platform primitive, not something each agent implements independently. ### Loop-aware observability Standard observability watches the application. Loop-aware observability watches the automation cycle itself. The key metrics are: loop completion rate (what percentage of triggered cycles complete without human intervention), convergence rate (what percentage of cycles improve the target metric), human intervention rate (how often humans override or halt the loop), mean time to escalation (how quickly the loop recognizes it cannot solve the problem), and agent accuracy (what percentage of diagnoses match the actual root cause identified in post-mortems). - Start at Tier 4 (human execute) for all loop stages; establish baseline metrics for agent accuracy and decision quality. - Implement agent audit trails before reducing human oversight; capture decision context at every stage. - Add blast radius controls: progressive rollout, scope limits, rate limits, and convergence-based circuit breakers. - Build loop-aware observability; instrument the loop itself, not just the applications it manages. - Promote to Tier 3 (human approve) for well-understood, reversible changes with established accuracy baselines. - Promote to Tier 2 (notify) only for change categories with demonstrated >95% agent accuracy over 90+ days. - Reserve Tier 1 (full auto) for bounded, reversible actions with automatic rollback (scaling, feature flags, config). > **TIP: Start at Tier 4, promote down** > The safest adoption path is to start with human execution for all loop stages and promote toward autonomy as confidence builds. Each tier promotion should be backed by data: agent accuracy metrics, convergence rates, and human override frequency. Promoting too fast risks cascading failures; promoting too slow forfeits the value of automation. Let the metrics decide. ## When closing the loop is wrong Closing the loop pays off when human handoff speed is the constraint on delivery. For a large fraction of teams it is not, and the integration work buys nothing but a larger surface to govern. Three situations make loop closure the wrong call, and none of them is a maturity problem that more adoption time fixes. ### Low deploy frequency wastes it The DORA State of DevOps research measures four delivery metrics, and deployment frequency is the one that loop closure targets. If a team ships weekly or monthly, the time a human spends bridging stages is a rounding error against the time spent waiting for the next release window, the next dependency, or the next product decision. Machine-speed iteration solves a bottleneck that does not exist. The governance layer described above (tiered gates, audit trails, loop-aware observability) is real engineering cost paid against a bottleneck that was never human handoff speed. ### Regulated systems keep the human In systems governed by change-control regimes, a human approving a production change is the correct permanent state, not a stepping stone to autonomy. Medical devices, payment processing, aviation, and clinical systems require an accountable person who signed off on the change. That requirement is not a temporary lack of confidence in the agent that better metrics will retire. It is the design. Agent audit trails make the human approver faster and better informed, and they do not replace the approver. Treating Tier 4 as a stage to graduate out of misreads why the gate exists. ### Small teams pay more A team of a few engineers already has the full loop in one head. The person who wrote the code watches the deploy, reads the dashboard, diagnoses the incident, and ships the fix, carrying context between stages with zero translation loss because there is no handoff to translate. Building MCP servers for every internal tool, a tiered approval engine, and meta-observability for the loop is months of platform work that a small team is better off spending on the product. Loop closure earns its cost when the number of handoffs across people and tools is large enough that the connective tissue is itself the expensive part. Below that threshold, the human bridge is cheaper than the infrastructure that replaces it. > **WARNING: Closure is not the default goal** > The right end state for many organizations is a partially automated loop with a permanent human gate at the stages that carry regulatory or blast-radius weight. Closing the loop end to end is one valid target among several, justified only when human handoff speed is the measured constraint. Build governance infrastructure because handoffs are slowing real delivery, not because full autonomy is assumed to be the destination. ## The loop closes The closed loop is a question of when and how fast, not if. Every major DevOps vendor is building toward it. The protocols exist: MCP for agent-to-tool integration, A2A for agent-to-agent handoffs. The companies profiled here are already connecting two or three adjacent stages. The full five-stage autonomous cycle is a matter of integration, not invention. The real question is whether the loop will have circuit breakers when it closes. "AI Creates Software Faster Than Ops Can Handle" argued that platform engineers need to build the operational infrastructure before the flood of AI-generated code arrives. This post extends that mandate: platform engineers also need to build the governance infrastructure before the autonomous loop connects. Tiered approval gates, agent audit trails, blast radius controls, loop-aware observability. They are the difference between an autonomous loop that amplifies engineering capability and one that amplifies engineering failure. > **The loop will close. The question is whether it has circuit breakers, and the teams who build those circuit breakers are platform engineers.** Organizations that build governance infrastructure now will close the loop on their terms: incrementally, with data-driven tier promotions and proven blast radius controls. Those that wait will close it reactively, after an autonomous cycle causes an incident that makes the case for guardrails more persuasively than any architecture document could. The loop is coming either way. Build the circuit breakers first. ## Resources & Further Reading - AI Creates Software Faster Than Ops Can Handle: https://www.stxkxs.io/blog/second-order-explosion - The prequel: what breaks when AI writes more code - 200 OK, Wrong Answer: https://www.stxkxs.io/blog/ai-observability-200-ok - AI observability and the failure of golden signals for non-deterministic systems - Self-Hosted AI Agents for Incident Response: https://www.stxkxs.io/blog/openclaw-self-hosted-ai-agents - OpenClaw as a Stage 4 (Diagnose) tool for infrastructure - MCP Is Now a Linux Foundation Standard: https://www.stxkxs.io/blog/mcp-linux-foundation - Protocol context for MCP and A2A as loop connectivity - Harness AIDA: https://www.harness.io/products/aida - AI assistant for CI/CD pipeline analysis and deployment risk - GitHub Copilot Coding Agent: https://github.blog/news-insights/product-news/github-copilot-meet-the-new-coding-agent/ - Autonomous coding agent that implements changes from GitHub Issues - Dynatrace Davis AI: https://www.dynatrace.com/platform/artificial-intelligence/ - Causal AI for observability, diagnosis, and automated remediation - Resolve AI: https://www.resolve.ai/ - AI-driven incident remediation and automated runbooks - Rootly AI: https://rootly.com/ - AI-assisted incident management and diagnosis - DORA State of DevOps Report: https://dora.dev/research/ - Industry benchmarks for deployment frequency, lead time, MTTR, and change failure rate - Google A2A Protocol: https://github.com/a2aproject/A2A - Agent-to-Agent communication protocol specification - Gartner Platform Engineering: https://www.gartner.com/en/articles/what-is-platform-engineering - Market analysis and maturity models for platform engineering --- # 200 OK, Wrong Answer - **URL**: https://www.stxkxs.io/blog/ai-observability-200-ok - **Published**: 2026-03-14 - **Author**: Brandon Stokes - **Category**: engineering - **Tags**: observability, ai-observability, llm-monitoring, opentelemetry, tracing, ai-infrastructure, llm-ops, platform-engineering, ai-agents, otel-genai - **Reading time**: 14 min Dashboards green. SLOs met. The AI hallucinated the answer. Traditional observability was built for deterministic systems where a 200 OK means success. AI broke that contract. A $1.1B market, six acquisitions in twelve months, and $300M+ in VC later, the industry is racing to figure out what replaces the golden signals—and whether AI observability becomes its own category or a feature of the platforms it monitors. Your AI system just told a customer they can return a product ninety days after purchase. Confident tone, clean formatting, proper citations. The HTTP response was 200 OK. Latency was 340ms. Every dashboard is green. PagerDuty is quiet. Your return policy is thirty days. The model hallucinated a policy that does not exist, cited a support document that was never written, and delivered it with the same confidence as a correct answer. Your infrastructure did its job perfectly. Your observability stack confirmed it. The customer got the wrong answer and you have no alert for that. This is the hardest problem in platform engineering right now, and most teams are pretending it is not their problem yet. The observability stack you spent years building (golden signals, distributed traces, SLO burn rate alerts) was designed for systems with a contract: same input, same output. AI systems have no such contract. Every response is generated, not retrieved. Two identical prompts can produce different outputs. "Correct" is not a property of the HTTP status code. It is a semantic judgment that requires understanding what the output means, not just that it arrived. This post is about what to actually do about it. A practical argument about which signals you need, how to instrument them with standards that will survive the current vendor shakeout, and where to start this week. ## Your observability stack is broken When you wire up AI workloads to your existing monitoring, everything looks fine until it is not. That is the trap. The dashboards are not lying. They are answering questions that no longer matter. Each assumption your stack rests on breaks with AI workloads. ### Your metrics measure wrong things Latency, error rate, throughput, and saturation are necessary but insufficient for AI workloads. A 200ms response can cost $0.002 or $0.20 depending on the model, context window size, and whether cached tokens were used. An error rate of 0% is meaningless when the most dangerous failures return 200 OK. Throughput in requests per second ignores that one request might consume 100,000 tokens while another consumes 500. Grafana dashboards look healthy while the AI system is confidently wrong and hemorrhaging money. The metrics that matter (token cost per request, time to first token, semantic correctness scores, cache hit rates, reasoning token overhead) do not exist in traditional observability stacks. Datadog and Grafana are adding them, bolted onto architectures designed for request-response telemetry, not token-level economics. The data model is wrong, not just incomplete. ### Your traces assume DAGs Distributed tracing assumes a request enters the system, flows through services, and exits. Each hop is a span, spans nest into traces, the whole thing renders as a waterfall. This breaks immediately with AI agents. An agent calls a tool, evaluates the result, decides it is insufficient, modifies its approach, calls a different tool, loops back to re-evaluate, and repeats until a quality threshold is met. That is a cycle with conditional branches, dynamic tool selection, and variable-depth recursion, not a DAG. In Jaeger you get an unreadable wall of spans where the most important information (why the agent looped, what it retried, how many iterations it took) is buried in attributes or lost entirely. ### Cost and latency decouple In traditional systems, slow equals expensive. A request that takes 10x longer consumes roughly 10x the compute. In AI systems, that correlation breaks completely. A fast response from a large model with a long context window can cost 100x a slow response from a small model: cached tokens are cheaper than fresh tokens, reasoning tokens add cost without adding output length, and input and output tokens are priced differently. A single prompt engineering change can shift costs by 40% with no change in latency. Without cost per request as a first-class metric, you are flying blind on what is probably your fastest-growing infrastructure line item. ### Failure is semantic A 500 means the server errored. A timeout means you exceeded a deadline. These are unambiguous, machine-readable, and trigger alerts automatically. AI's most dangerous failures return 200 OK with well-formed, grammatically correct, confidently stated wrong answers. Detecting this requires understanding the meaning of the output: comparing it against ground truth, evaluating factual consistency, checking for contradictions with source material. That is evaluation, not monitoring, and your observability stack was never designed to do it. > **Key Point:** The golden signals tell you if the system is up. They do not tell you if the system is right. For AI workloads, "right" is the only metric that matters to users. ## Instrument the right signals Going after semantic evaluation, agent tracing, cost tracking, and quality scores all at once stalls before any of it ships. The signals layer in order of difficulty, and the order matters: each one teaches you something that makes the next one easier to implement. ### Start with token economics (Week 1) Token usage is not a single number. Input tokens, output tokens, cached input tokens, and reasoning tokens are each priced differently and have different performance implications. A request that hits the prompt cache might cost 90% less than an identical request with a cold cache. A model using extended thinking might consume 10x the tokens of a direct response, all billed at a different rate. The OpenTelemetry GenAI semantic conventions give you standardized attributes for all of this (`gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.usage.cache_read.input_tokens`), so your instrumentation is portable across backends. When you set this up, you will find things immediately. Token-level cost visibility surfaces 30-50% cost reduction opportunities within the first month: from unnecessarily large context windows, redundant system prompts, missing cache utilization, or model over-provisioning. This is the fastest ROI in platform engineering right now. ```yaml (otel-collector-genai.yaml) receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: batch: timeout: 5s send_batch_size: 1024 # Extract token metrics from GenAI spans attributes: actions: - key: gen_ai.usage.input_tokens action: upsert - key: gen_ai.usage.output_tokens action: upsert - key: gen_ai.usage.cache_read.input_tokens action: upsert - key: gen_ai.response.model action: upsert - key: gen_ai.operation.name action: upsert connectors: spanmetrics: dimensions: - name: gen_ai.response.model - name: gen_ai.operation.name histogram: explicit: buckets: [50, 100, 250, 500, 1000, 2500, 5000] metrics_flush_interval: 15s exporters: otlp/metrics: endpoint: "metrics-backend:4317" tls: insecure: true otlp/traces: endpoint: "traces-backend:4317" tls: insecure: true service: pipelines: traces: receivers: [otlp] processors: [batch, attributes] exporters: [otlp/traces, spanmetrics] metrics: receivers: [spanmetrics] processors: [batch] exporters: [otlp/metrics] ``` The `spanmetrics` connector converts GenAI trace spans into time-series metrics, dimensioned by model and operation. You get histograms and counters in Prometheus or your metrics backend without writing any custom aggregation. Cost-per-model dashboards come for free once the spans flow through. ### Add semantic evaluation (Week 3) Semantic quality evaluation layers three techniques that build on each other rather than replace one another. First-gen heuristics (regex, keyword detection, length checks) are fast and cheap. Use them to catch obvious garbage. Second-gen embedding similarity (comparing output vectors against reference answers) catches semantic drift but struggles with novel correct answers. Third-gen LLM-as-judge (using a second model to evaluate the first) gives the highest accuracy. It costs real money and adds latency you cannot afford on every request. The practical approach: run heuristic checks on every response in the hot path. Run embedding similarity on critical paths where you have reference answers. Reserve LLM-as-judge for offline batch analysis and CI/CD quality gates. Purpose-built evaluation models like Galileo's Luna-2 (sub-200ms, roughly $0.02 per million tokens) are closing the gap between heuristic speed and LLM-judge accuracy. That price point is roughly 1,000x cheaper than using a frontier model as a judge. Concretely: evaluating 10 million completions per day with Luna-2 costs roughly $200/day. With a frontier model as judge, it is $200,000/day. The difference between "viable on every response" and "sample 0.1% or go bankrupt" is the evaluation model cost, and it is finally crossing the threshold where real-time monitoring is economically feasible. ### Instrument agent traces (Month 2) Tackle this only if you actually run agent workflows. Standard span-based tracing will not cut it. The hardest problem in practice is trace context propagation through async agent loops. When an agent spawns sub-agents that make tool calls that trigger other agents, the parent-child span relationship breaks down. OpenTelemetry's context propagation model assumes request-scoped trees, not recursive agent cycles. A single agent invocation might include a planning step, multiple tool calls, intermediate evaluations, retry loops, and a final synthesis. Each needs its own span, and all need to be grouped into a coherent execution trace that captures the agent's decision-making process. LangSmith structures this as nested runs. Braintrust captures it as experiment traces with scoring at each step. Arize Phoenix uses the OpenInference specification. The implementations differ; the requirement is the same: traces must capture cycles, not just DAGs. > **TIP: The layering matters** > Token economics is deterministic and easy. Start there. Semantic evaluation requires choosing a strategy and calibrating thresholds; add it once you understand your traffic patterns from the token data. Agent tracing requires the most architectural investment and only matters if you run agent workflows. Each layer teaches you something that makes the next layer easier. ## Instrument against OTel Instrument against OTel GenAI semantic conventions now, regardless of what backend you use. The GenAI conventions are still in experimental status and have changed multiple times since mid-2025, so wrap your instrumentation in a thin abstraction layer that absorbs spec changes without touching every callsite. The investment is still worth it. The alternative (proprietary SDKs from vendors who keep getting acquired) is worse. The AI observability vendor landscape is consolidating violently: six acquisitions or shutdowns in twelve months. Your instrumentation needs to survive that churn. OTel GenAI conventions give you a vendor-neutral schema for AI telemetry: standardized attribute names, span structures, and metric definitions that any backend can consume. For context on the consolidation: ClickHouse acquired Langfuse. CoreWeave acquired Weights & Biases for $1.7 billion. Alphabet acquired Galileo AI. Anthropic acqui-hired Humanloop. Mintlify acquired Helicone. WhyLabs ceased commercial operations and was acqui-hired by Apple (founders joined Apple); its platform, whylogs, and langkit were open-sourced rather than shut down outright. Over $300 million in VC raised by dedicated AI observability companies, many of which no longer exist independently. The pattern is clear. AI observability is becoming a feature of larger platforms, not a standalone category. Instrumentation against a proprietary SDK that got acquired means re-instrumenting. Instrumentation against OTel means swapping the exporter config and moving on. ```typescript (instrument-genai.ts) import { trace, SpanKind, SpanStatusCode } from '@opentelemetry/api' import { SemanticAttributes } from '@opentelemetry/semantic-conventions' const tracer = trace.getTracer('ai-service', '1.0.0') interface ChatCompletionResult { content: string model: string inputTokens: number outputTokens: number cachedTokens: number finishReason: string ttftMs: number } export async function tracedChatCompletion( prompt: string, model: string, onComplete: (prompt: string, model: string) => Promise ): Promise { return tracer.startActiveSpan( 'chat', { kind: SpanKind.CLIENT }, async (span) => { try { // GenAI semantic convention attributes span.setAttribute('gen_ai.system', 'anthropic') span.setAttribute('gen_ai.request.model', model) span.setAttribute('gen_ai.operation.name', 'chat') span.setAttribute('gen_ai.request.max_tokens', 4096) span.setAttribute('gen_ai.request.temperature', 0.7) const startTime = performance.now() const result = await onComplete(prompt, model) const durationMs = performance.now() - startTime // Token usage attributes span.setAttribute('gen_ai.usage.input_tokens', result.inputTokens) span.setAttribute('gen_ai.usage.output_tokens', result.outputTokens) span.setAttribute( 'gen_ai.usage.cache_read.input_tokens', result.cachedTokens ) // Response attributes span.setAttribute('gen_ai.response.model', result.model) span.setAttribute('gen_ai.response.finish_reasons', [result.finishReason]) // Performance metrics span.setAttribute('gen_ai.client.operation.duration', durationMs) span.setAttribute('gen_ai.client.token.usage', result.inputTokens + result.outputTokens) span.setAttribute('gen_ai.server.time_to_first_token', result.ttftMs) span.setStatus({ code: SpanStatusCode.OK }) return result } catch (error) { span.setStatus({ code: SpanStatusCode.ERROR, message: error instanceof Error ? error.message : 'Unknown error', }) throw error } finally { span.end() } } ) } ``` Every attribute follows the `gen_ai.*` convention namespace. Input, output, cached, and reasoning tokens are tracked separately. Time to first token is distinct from total operation duration: for streaming responses, those are completely different numbers and optimizing one does not optimize the other. Provider-specific conventions extend the base schema for Anthropic, OpenAI, AWS Bedrock, and Azure OpenAI, so vendor-specific features like Claude's extended thinking or OpenAI's function calling get standardized attribute names. Adoption is already past the tipping point. Datadog, Langfuse, Splunk, and Grafana consume OTel GenAI conventions natively. OpenLLMetry (7.1K GitHub stars) provides auto-instrumentation for major LLM SDKs. Opik (19.4K stars, 40 million traces per day) supports OTel export. The conventions are moving from "experimental" toward "stable" in the OTel specification. **Open Source AI Observability by GitHub Stars** - Langfuse: 28140 - Opik: 19396 - Phoenix: 9885 - OpenLLMetry: 7149 ## Use the platform you have If you already run Datadog or Grafana, evaluate their AI monitoring features before adopting anything new. Datadog LLM Observability provides end-to-end LLM tracing with token usage and quality evaluation, and AI traces appear alongside your existing APM waterfalls. Grafana AI Observability is OTel-native, so if you are already sending OTel GenAI spans, it just works with your existing alerting and on-call tooling. Splunk has agent-level span visibility. Dynatrace does causal analysis across the full stack from GPU utilization to semantic quality. The operational value of seeing AI telemetry in the dashboards your on-call engineers already know is massive. Every additional observability vendor adds a data pipeline, a dashboard surface, an alert config, a billing relationship, and a context switch for whoever is carrying the pager. For most teams, 80% coverage inside your existing platform beats 100% coverage from a separate vendor. The dedicated tools are genuinely better at two things: deep agent tracing (LangSmith, Braintrust, and Phoenix all had to extend or replace the standard tracing model to make agent workflows legible) and real-time semantic evaluation (Fiddler AI offers sub-100ms guardrails, raised $30 million in January 2026 to scale it). If either of those is a hard requirement, evaluate the dedicated tools. Otherwise, start with what you have. **AI Observability Funding** - LangChain: 125$M - Braintrust: 80$M - Arize AI: 70$M - Fiddler AI: 30$M - Portkey: 15$M > **WARNING: The integration tax is real** > Teams adopt a dedicated AI observability tool, spend two months integrating it, then realize their existing platform shipped the same feature as a native integration. Before adding a vendor, check your current platform's release notes from the last six months. This space moves fast enough that the gap might already be closed. ## The playbook this month The concrete sequence for instrumenting AI workloads from scratch. The order is intentional: each phase gives you data that makes the next phase easier. - **Phase 1: Token + Cost**: Week 1-2 — Instrument token usage with OTel GenAI attributes. Calculate cost per request using your provider's pricing. Set alerts on cost anomalies. - **Phase 2: Semantic Quality**: Week 3-4 — Add heuristic checks in the hot path. Add embedding similarity on critical paths. Reserve LLM-as-judge for CI/CD gates. - **Phase 3: Agent Tracing**: Month 2 — Only if you run agent workflows. Evaluate LangSmith, Braintrust, or Phoenix for cycle-aware tracing. - **Phase 4: Optimize**: Ongoing — Use token data to right-size models, tune prompt caching, shorten context windows. This is where the 30-50% cost savings live. The non-negotiable through all of this: use OTel GenAI conventions for your instrumentation layer. The cost of adopting them is a few extra attribute names on your spans. The cost of not adopting them is re-instrumenting everything when your vendor gets acquired, raises prices, or ships a breaking change to their proprietary SDK. In a market consolidating this fast, vendor-neutral instrumentation is not a nice-to-have. It is your only hedge. ## The real shift The observability stack was built to monitor systems that execute instructions. Same input, same output, same code path. Monitoring meant knowing if the system is running, how fast, and whether it is throwing errors. The golden signals answered those questions completely. That era is ending for a growing percentage of production workloads. This is not the first time the industry rebuilt observability for non-deterministic systems. ML model monitoring (concept drift detection, data drift analysis, output distribution monitoring) was well-understood by 2020. Tools like Evidently AI and Fiddler built production solutions for tabular and image models. What LLMs changed is the output modality. Evaluating whether a text response is "correct" is fundamentally harder than checking whether a classification probability drifted. The architecture is similar; the evaluation layer is not. A wave of funding and a dozen new tools in three years say the industry recognizes this gap. Six acquisitions suggest AI observability becomes a platform feature, not a standalone category. The OTel GenAI conventions suggest the interface will standardize even as backends consolidate. The platform engineer's job is to instrument against that stable interface now, and let the vendor landscape sort itself out beneath you. ## Resources & Further Reading - OTel GenAI Semantic Conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/ - Vendor-neutral schema for AI telemetry instrumentation - Braintrust: https://www.braintrust.dev/ - AI product evaluation, tracing, and prompt management platform - Arize AI / Phoenix: https://github.com/Arize-ai/phoenix - Open-source AI observability with OTel-compatible tracing - LangSmith: https://smith.langchain.com/ - LangChain's tracing and evaluation platform for LLM applications - Langfuse: https://langfuse.com/ - Open-source LLM engineering platform (acquired by ClickHouse, Jan 2026) - Portkey: https://portkey.ai/ - AI gateway with token-level observability and cost tracking - Fiddler AI: https://www.fiddler.ai/ - Real-time AI model monitoring with sub-100ms guardrails - OpenLLMetry: https://github.com/traceloop/openllmetry - OTel-native auto-instrumentation for LLM applications - Opik: https://github.com/comet-ml/opik - Open-source LLM evaluation and tracing (40M traces/day) - Datadog LLM Observability: https://docs.datadoghq.com/llm_observability/ - End-to-end LLM tracing integrated with Datadog APM - Grafana AI Observability: https://grafana.com/solutions/ai-observability/ - AI monitoring built on the Grafana + OTel stack --- # Vectors Are a Data Type - **URL**: https://www.stxkxs.io/blog/vectors-are-a-data-type - **Published**: 2026-03-04 - **Author**: Brandon Stokes - **Category**: data - **Tags**: vector-databases, pgvector, postgresql, pinecone, embeddings, ai-infrastructure, rag, semantic-search, weaviate, qdrant, milvus, chroma - **Reading time**: 13 min Over $350M in venture capital funded purpose-built vector databases. Then every major general-purpose database added native vector support. The pattern is familiar — JSON, geospatial, full-text search all followed the same arc. Vectors are a data type, not a database category. ## Don't buy a vector database You are standing up a RAG pipeline. You have a PostgreSQL database with your documents, users, permissions, and audit logs. Now you need vector search. Someone on the team opens a Pinecone tab. Someone else says "what about Weaviate." Before you know it, you are evaluating five purpose-built vector databases, designing a sync pipeline to keep embeddings consistent with source data, and adding a new service to your on-call rotation. For what? To store a column of floats next to the data that generated them. Vectors are a data type, not a database category. Most teams reaching for a dedicated vector database are adding complexity they will regret within eighteen months. That is pattern recognition, not a hot take: the same arc played out with JSON, geospatial, and full-text search over the last fifteen years. Over $470M in venture capital went into purpose-built vector databases (totals as of March 2026): Pinecone at $138M, Weaviate at ~$117M, Zilliz at $113M, Qdrant at ~$87.5M, Chroma at $18M. That capital validated the use cases and pushed the technology forward. The technology they pioneered is now a native feature of every major database you already run. The window where purpose-built was the only option has closed. - **VC Funding**: $470M+ — Total venture capital raised by purpose-built vector database companies as of March 2026 (Pinecone, Weaviate, Qdrant, Chroma, Zilliz) - **pgvector Stars**: 21.5K+ — GitHub stars for pgvector, the PostgreSQL extension for vector similarity search - **RAG Under 10M**: ~90% — Estimated based on Pinecone usage data and industry surveys; most production RAG deployments operate well within pgvector's comfortable range - **DBs with Vectors**: 10+ — Major general-purpose databases that have added native vector support since 2021 > **INFO: What is a vector embedding?** > A vector embedding is a numerical representation of data (text, images, audio, code) as a list of floating-point numbers, typically 384 to 3072 dimensions. Embeddings are generated by neural network models trained to place semantically similar items close together in high-dimensional space. "How do I reset my password?" and "I forgot my login credentials" produce vectors that are mathematically close, even though they share no keywords. Vector search finds the nearest neighbors to a query vector, enabling semantic search, recommendation systems, RAG pipelines, and anomaly detection. ## The pattern we keep ignoring If you have been building long enough, you have seen this movie before. A new data type shows up. Someone builds a purpose-built database around it. VCs fund it. Enterprises adopt it. The databases everyone already runs absorb the capability, and the purpose-built thing consolidates to the high end. MongoDB launched in 2009 on the thesis that relational databases were a poor fit for document data. JSON documents need a purpose-built database. The argument was compelling: schema flexibility, horizontal scaling, developer ergonomics. MongoDB grew into a $30B company. Then PostgreSQL shipped jsonb in version 9.4 (2014): a first-class binary JSON type with GIN indexes, containment operators, and query performance that matched or exceeded MongoDB for most workloads. MySQL added a JSON type. SQL Server added JSON functions. Oracle followed. The specialized data type that justified a new database category got absorbed into general-purpose databases as a native feature. MongoDB did not die. It still serves document-heavy workloads where its replication model, sharding, and developer tools provide genuine advantages. The argument that you need a separate database for JSON data became much harder to make when the database you already run handles JSON natively. The pattern repeated with geospatial data. PostGIS turned PostgreSQL into a geospatial database. MySQL, SQL Server, and Oracle all added spatial types and indexes. Purpose-built geospatial databases still exist for specialized use cases; most applications store coordinates and run spatial queries in their existing database. Full-text search followed the same arc. Sphinx spawned a category, Elasticsearch dominated it, then PostgreSQL tsvector, MySQL FULLTEXT, and built-in search capabilities in most databases reduced the need for a separate search engine for many workloads. > **The best database for vectors is the one that already has your data.** Vectors are following the same arc with one important difference. JSON parsing and geospatial queries are computationally straightforward for general-purpose query planners. ANN search is fundamentally different: it requires specialized index structures (HNSW, DiskANN, ScaNN) and involves a recall-precision tradeoff that PostgreSQL's query planner has no native concept of. The analogy holds at the market level (general-purpose databases absorb specialized data types) but the technical gap is wider with vectors than it was with JSON or geospatial. That gap is closing (pgvector's HNSW is genuinely good), but it has not fully closed. This is pattern recognition, not a prediction. A specialized data type emerges. Startups build purpose-built databases around it. The data type matures. General-purpose databases absorb it. The purpose-built databases consolidate to serve the high end. The data type becomes a feature, not a product. ## Pinecone's crossroads Pinecone is the most visible player in the purpose-built vector database market. Fully managed, zero-ops, $750M valuation, first-mover advantage in the enterprise segment. Its serverless architecture, launched in January 2024, decoupled storage from compute and reduced costs by up to 50x for sporadic workloads. For teams that want vector search without building vector infrastructure, Pinecone became the default choice. Pinecone is reportedly exploring acquisition, with Oracle, IBM, MongoDB, and Snowflake in talks. That tells you something. If MongoDB acquires Pinecone, it validates vectors-as-feature-of-a-broader-platform. If Snowflake acquires it, vector search becomes a component of the analytical data stack. Either way, the signal is that standalone vector databases are gravitating toward being part of larger data platforms rather than remaining independent products. The standalone vector database is becoming a feature of something bigger. The pricing question sharpens the argument. Pinecone's serverless pricing starts attractively: pay per query, no idle costs. At production scale with consistent traffic, the managed premium adds up. A team already running PostgreSQL on RDS can add pgvector at zero additional infrastructure cost. The vector search is a feature of the database they already pay for, monitor, back up, and have on-call rotations for. The total cost of ownership comparison favors the database you already operate, because the operational overhead is already amortized. > **INFO: Pinecone's real strengths** > Pinecone's serverless indexes, managed scaling, and zero-ops model are genuine advantages for teams without database expertise or operational capacity. With no existing database to extend, no DevOps team to manage infrastructure, and a need to ship vector search by next week, Pinecone is a legitimate choice. The argument here is about trajectory, not about Pinecone being a bad product. ## Purpose-built players The purpose-built vector database market is not a monolith. Each player has carved a real niche, and dismissing them as a category misses the genuine engineering behind each project. Weaviate ($200M valuation, Series C October 2025) positions itself as an AI-native database with built-in vectorization modules that connect directly to embedding APIs. Its hybrid search (combining dense vectors with BM25 sparse retrieval) outperforms pure vector search by 5-15% on retrieval benchmarks. Native multi-tenancy makes it a natural fit for SaaS platforms. Qdrant (written in Rust, ~$87.5M raised through its March 2026 Series B) consistently tops ANN-benchmarks for filtered search, the queries that actually matter in production where you filter by metadata first and then search. Named vectors allow storing multiple embeddings per document for multi-modal search. Milvus (backed by Zilliz, $113M raised) targets billion-scale distributed deployments with a disaggregated architecture that scales storage and compute independently. Chroma ($18M raised) optimized for developer experience, embeddable, runs in-process, and integrates with LangChain and LlamaIndex with minimal configuration. Each of these is actively shipping features, growing communities, and serving production workloads. This is not a market where everyone is losing. The competitive pressure is real and intensifying. pgvector performance has improved dramatically with HNSW indexes and quantization support. Every database these companies compete with now has native vector support, from PostgreSQL and MongoDB to Redis, Elasticsearch, and SQLite. **Vector database landscape** - Milvus: 44.5K stars - Qdrant: 31.6K stars - Chroma: 28.1K stars - pgvector: 21.5K stars - Weaviate: 16.3K stars - sqlite-vec: 7.7K stars ## pgvector ecosystem pgvector has become the quiet default for vector search in production. Every major managed PostgreSQL provider supports it: Amazon RDS, Supabase, Neon, Azure Database for PostgreSQL, Google Cloud SQL, AlloyDB. If you run PostgreSQL, you can add vector search with a single CREATE EXTENSION statement. No new service to deploy, no new connection to manage, no new monitoring to configure. The performance story has evolved rapidly, with caveats worth stating upfront. pgvector 0.5 was a proof of concept: IVFFlat indexes, limited recall, slow builds. pgvector 0.7+ is a production-grade vector search engine. HNSW index builds are CPU-intensive: a 5M vector collection can take 30+ minutes to build, and during that time writes are blocked. If your embedding model changes and you need to rebuild every index, plan for significant downtime or a blue-green deployment. Memory consumption scales with the index: a 10M vector HNSW index with 1536 dimensions in float32 consumes roughly 80GB of RAM (the raw vectors alone are ~61GB, plus graph overhead at m=16), and only drops to the 25-30GB range once you apply quantization. These are real operational costs, and pretending they do not exist would undermine the argument. HNSW indexes deliver 95%+ recall with single-digit millisecond latency at millions of vectors. Parallel index builds reduced build times by 10-30x. Quantization support (binary, half-precision) cut memory usage by 50-75% with minimal recall loss. The gap between pgvector and purpose-built alternatives has narrowed dramatically, and for workloads under 10 million vectors (which covers most production RAG pipelines), pgvector performance is sufficient. ```sql (pgvector-semantic-search.sql) -- Enable pgvector CREATE EXTENSION IF NOT EXISTS vector; -- Documents with embeddings alongside relational data CREATE TABLE documents ( id BIGSERIAL PRIMARY KEY, title TEXT NOT NULL, content TEXT NOT NULL, embedding VECTOR(1536), -- OpenAI text-embedding-3-small team_id INT REFERENCES teams(id), created_at TIMESTAMPTZ DEFAULT NOW(), is_active BOOLEAN DEFAULT TRUE ); -- HNSW index for fast approximate nearest neighbor search CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64); -- The killer feature: vectors + relational data in one query SELECT d.id, d.title, d.content, 1 - (d.embedding <=> $1::vector) AS similarity FROM documents d JOIN teams t ON d.team_id = t.id WHERE d.is_active = TRUE AND t.org_id = $2 AND d.created_at > NOW() - INTERVAL '90 days' ORDER BY d.embedding <=> $1::vector LIMIT 10; ``` That query is the architectural argument in SQL. Your similarity search JOINs against relational tables, filters by boolean flags, restricts by date ranges, and enforces tenant isolation, all in a single statement. In a purpose-built vector database, you search vectors in one system, then hydrate results from your relational database in a second query, maintaining consistency between two data stores. The single-query approach is simpler to build, simpler to debug, and eliminates an entire class of consistency bugs. The operational argument compounds the advantage. Your team already knows PostgreSQL. Your monitoring already covers it. Your backup strategy already includes it. Your connection pooling is already configured. Your failover is already tested. Adding vector search to an existing PostgreSQL deployment is an incremental capability, not a new operational responsibility. > **TIP: pgvectorscale for larger datasets** > Timescale's pgvectorscale extension adds StreamingDiskANN indexes to PostgreSQL: a disk-based ANN algorithm that handles datasets larger than RAM efficiently. For workloads between 10M and 100M vectors where vanilla pgvector HNSW starts to strain memory, pgvectorscale extends the range without leaving the PostgreSQL ecosystem. It also adds statistical binary quantization (SBQ) for 20x memory reduction with minimal recall loss. ## Native vectors everywhere Native vector support across general-purpose databases is a flood, not a trickle. MongoDB Atlas Vector Search integrates vector similarity directly into the aggregation pipeline. Redis added vector similarity search with HNSW and flat indexes. Elasticsearch shipped dense vector fields with knn search. Apache Cassandra added a native vector type with SAI indexes. SQLite got sqlite-vec for embedded vector search. SingleStore, CockroachDB, and Snowflake all added vector capabilities. The list keeps growing. The architectural implication is straightforward. If you are already running MongoDB, use Atlas Vector Search: your vectors live in the same collections as your documents, query in the same aggregation pipeline, and scale with the same sharding configuration. Already on Redis? Use its vector similarity module for sub-millisecond latency on cached vector search alongside your session data and feature flags. The best vector search is the one that does not require a new database in your architecture. ```typescript (atlas-vector-search.ts) import { MongoClient } from 'mongodb' const client = new MongoClient(process.env.MONGODB_URI!) const collection = client.db('app').collection('documents') // Documents and embeddings coexist in the same collection await collection.insertOne({ title: 'Deployment runbook', content: 'Step 1: Verify health checks...', embedding: await generateEmbedding('Step 1: Verify health checks...'), teamId: 'platform-eng', updatedAt: new Date(), }) // Vector search with metadata filtering in one query const results = await collection.aggregate([ { $vectorSearch: { index: 'vector_index', path: 'embedding', queryVector: await generateEmbedding(userQuery), numCandidates: 100, limit: 10, filter: { teamId: 'platform-eng' }, }, }, { $project: { title: 1, content: 1, score: { $meta: 'vectorSearchScore' }, }, }, ]).toArray() ``` The pattern is consistent across all of these: vector search becomes a query capability of your existing data store rather than a reason to add a new one. Whichever database you already run, the vector index now lives next to the data it indexes. ## The sync tax The monthly bill is the least of your problems when you adopt a purpose-built vector database. The real cost is the synchronization pipeline you now have to build and maintain forever. Every document in your application lives in your primary database. A copy of its embedding lives in your vector database. When the document updates, the embedding must be regenerated and the vector store must be updated. When the document is deleted, the vector must be removed. When the document's access permissions change, the vector's metadata must reflect that change. You are now maintaining an event-driven system with queues, workers, retry logic, dead letter handling, and monitoring, all to keep two databases in sync. That is a meaningful engineering investment, measured in hundreds of hours per year. - **Pinecone (10M vectors)**: ~$70/mo — Pinecone serverless, 1536 dimensions, ~60-70GB storage (~$0.33/GB), $50/mo Standard minimum, moderate query volume - **pgvector on RDS**: ~$0/mo — Marginal cost when added to an existing RDS PostgreSQL instance with available capacity - **Sync Overhead**: 200-500 hrs — Estimated engineering hours to build and maintain a reliable vector sync pipeline per year - **TCO Difference**: 3-10x — Typical total cost of ownership premium for purpose-built vector DB vs extending existing infrastructure The bugs this sync pipeline creates are uniquely painful. Stale embeddings return results for content that was rewritten months ago. Deleted embeddings cause silent gaps in search results. Partial permission updates let users find documents they should no longer have access to. I have seen every one of these in production. They are invisible failures that erode trust in your search quality slowly enough that nobody notices until the damage is done. > **WARNING: The sync problem** > A user searches for "deployment process" and gets results from a runbook that was rewritten three months ago; the old embedding still points to the old content. A document is deleted but its vector remains, returning results for content that no longer exists. Partial permission updates let users find documents they should no longer have access to. Every one of these bugs exists because vectors live in a different database than the data they represent. Colocating vectors with source data eliminates the entire failure class. ## When purpose-built wins The argument for using your existing database has limits, and pretending otherwise would be intellectually dishonest. There are workloads where a purpose-built vector database is the right call. The first and most obvious case is raw scale. At 5 million vectors, pgvector is great. At 50 million, you start tuning HNSW parameters and thinking about partitioning. At 500 million, you are fighting PostgreSQL in ways it was not designed for. Milvus's disaggregated architecture, where storage, indexing, and query nodes scale independently, is purpose-built for this regime. You can throw compute at indexing during bulk loads and scale it back for steady-state queries. PostgreSQL does not give you that lever. At billion-vector scale, the purpose-built databases are architecturally different in ways that matter. The second case is filtered search at scale. This is the query pattern that actually matters in production: "find me the 10 most similar documents, but only within this tenant, created after this date, with these tags." Qdrant has invested heavily in this exact workload; their payload indexing and filtering engine is co-designed with their HNSW implementation. pgvector handles filtered search fine at moderate scale, but when you are filtering down to 0.1% of a 100M vector collection and still need sub-10ms latency, the purpose-built filtering architecture wins. ANN benchmarks without filters are almost meaningless for production workloads. The third case is multi-modal and multi-vector search. When you need to store separate embeddings for the title, body, and image of each document and search across them with different weights, Qdrant's named vectors and Weaviate's module system give you primitives that are clunky to replicate in pgvector. You can do it with multiple columns and application-level score fusion, but at that point you are building your own vector database on top of PostgreSQL. The fourth case is teams without an existing database to extend. For a standalone ML service or a greenfield project where there is no relational data, the "use what you have" argument does not apply because you do not have anything. In that scenario, a purpose-built vector database gives you the fastest path to production. - Billion+ vector scale with specialized ANN algorithms (DiskANN, ScaNN) that general-purpose databases have not yet implemented - Filtered search at 100M+ scale where purpose-built payload indexing delivers sub-10ms latency on highly selective filters - Multi-modal search across text, image, and audio embeddings with per-vector-type weighting and cross-modal retrieval - Multi-tenant SaaS with per-tenant isolation where purpose-built solutions offer native tenant partitioning with separate indexes - Greenfield projects or standalone ML services with no existing relational data to colocate with The counter-argument I find most compelling is the Kubernetes parallel: containers were "just a process type" that Linux could run natively, but the operational complexity of running them well at scale justified dedicated orchestration infrastructure. Purpose-built vector databases are the Kubernetes of embeddings: technically unnecessary for small workloads, operationally essential at the high end. The analogy breaks down because most teams are not at the high end. The cases above are the minority. In my experience most teams running a purpose-built vector database have under 10 million vectors, moderate query volumes, and existing PostgreSQL or MongoDB instances that could handle the workload with a single extension or index. They chose a purpose-built database because it was the default recommendation in every tutorial, not because their workload demanded it. The convergence is happening from both sides. Weaviate and Qdrant are adding SQL-like filtering, aggregation pipelines, and relational-style query capabilities. PostgreSQL, MongoDB, and Redis are improving ANN performance, adding quantization, and optimizing for vector workloads. The purpose-built databases are becoming more general. The general-purpose databases are becoming better at vectors. The gap narrows with every release. > **Key Point:** Most production workloads do not need a separate database for vectors at all. ## pgvector vs Pinecone at scale These are the two options most teams actually evaluate. The operational trade-offs when both are pushed past the comfortable range: pgvector's scaling story is PostgreSQL's scaling story, with all the baggage that implies. HNSW indexes live in memory. A 10M vector collection with 1536 dimensions and m=16 (float32) consumes roughly 80-120GB of RAM just for the index. That overflows a db.r6g.2xlarge (64 GiB), and doubling the collection requires doubling the instance. Index builds are CPU-intensive and block writes. Parallel builds helped enormously, but a 50M vector HNSW build still takes hours on moderate hardware. You can partition by tenant or time range to keep individual indexes manageable, but now you are managing partition routing in your application. The thing that makes pgvector great (it is PostgreSQL) is also what limits it. You get MVCC, WAL, streaming replication, point-in-time recovery, and decades of operational tooling. You also get vacuum pressure from embedding updates, WAL amplification from large vector writes, and connection limits that were designed for OLTP, not vector search. Pinecone's scaling story is abstracted behind a managed service, which is both its strength and its weakness. You do not manage indexes, tune HNSW parameters, or worry about memory pressure; Pinecone handles it. Their serverless architecture genuinely solves the cold-start problem for sporadic workloads. Abstraction has costs. You cannot tune the underlying ANN algorithm. You cannot co-locate your vectors with relational data for single-query retrieval. You are subject to Pinecone's metadata filtering limitations. At the time of writing, metadata values have size limits and the filtering syntax is less expressive than SQL WHERE clauses. When something goes wrong, your debugging surface is their status page and support tickets, not your own database logs and EXPLAIN plans. The operational trade-off comes down to this: pgvector gives you full control and zero marginal cost, at the price of operational responsibility you already carry. Pinecone gives you zero operational burden, at the price of control, cost at scale, and a sync pipeline. For teams already running PostgreSQL, the incremental operational burden of pgvector is genuinely small. For teams with no database expertise, Pinecone's managed experience is genuinely valuable. The honest answer depends on your team more than your data. ## Graph-enhanced retrieval The next evolution of retrieval combines vector similarity with graph traversal. Microsoft's GraphRAG (arXiv 2404.16130) demonstrated that combining embedding-based search with knowledge graph relationships beat naive vector RAG on 72-83% of head-to-head comparisons for comprehensiveness and 62-82% for diversity, measured on global sensemaking questions over million-token corpora. Instead of just finding the documents most similar to a query, graph-enhanced retrieval also traverses relationships: finding related concepts, connected entities, and contextual chains that pure vector similarity misses. This architectural direction strongly favors colocation. A graph-enhanced RAG pipeline needs vector similarity search, graph traversal, and relational filtering in a single query. If your vectors are in Pinecone, your graph is in Neo4j, and your relational data is in PostgreSQL, each query requires three round-trips and application-level result merging. If all three live in PostgreSQL (vectors via pgvector, graphs via recursive CTEs or Apache AGE, relational data natively), the entire retrieval pipeline executes in a single query. ```sql (graph-enhanced-rag.sql) -- Hybrid retrieval: vector similarity + relational graph traversal WITH semantic_matches AS ( -- Step 1: Find semantically similar documents SELECT id, title, content, embedding, 1 - (embedding <=> $1::vector) AS similarity FROM documents WHERE team_id = $2 AND is_active = TRUE ORDER BY embedding <=> $1::vector LIMIT 20 ), related_docs AS ( -- Step 2: Traverse document relationships for connected context SELECT DISTINCT d.id, d.title, d.content, d.embedding, 0.5 AS similarity -- boost factor for related docs FROM semantic_matches sm JOIN document_links dl ON sm.id = dl.source_id JOIN documents d ON dl.target_id = d.id WHERE d.is_active = TRUE AND d.id NOT IN (SELECT id FROM semantic_matches) ) -- Step 3: Merge and rank all results SELECT id, title, content, MAX(similarity) AS score FROM ( SELECT * FROM semantic_matches UNION ALL SELECT * FROM related_docs ) combined GROUP BY id, title, content ORDER BY score DESC LIMIT 10; ``` > **EXAMPLE: Microsoft GraphRAG** > Microsoft's GraphRAG research (arXiv 2404.16130, "From Local to Global") reported that graph-enhanced retrieval won 72-83% of head-to-head comparisons against naive vector RAG on comprehensiveness and 62-82% on diversity for global sensemaking queries; naive vector search still won on directness, which is the point. The approach builds a knowledge graph from documents using LLM-extracted entities and relationships, then combines graph community summaries with vector similarity for retrieval. Vector search finds what is semantically similar; graph traversal finds what is structurally connected. Complex questions often need both. ## Decision framework How to think about this decision in practice: ### Under 10M vectors Use what you have. pgvector, Atlas Vector Search, Redis vector similarity, whichever database is already in your stack. HNSW indexes on pgvector deliver single-digit millisecond latency at this scale. The operational simplicity of not adding a new database far outweighs any performance advantage from a purpose-built solution. This covers the vast majority of production RAG pipelines, semantic search implementations, and recommendation systems. ### 10M-100M vectors Optimize your existing database before reaching for a new one. pgvectorscale with StreamingDiskANN handles datasets larger than RAM. HNSW parameter tuning (increase m and ef_construction) improves recall at the cost of memory. Quantization (binary, scalar, product) reduces memory footprint by 50-75% with minimal recall degradation. Partitioning by tenant or time range keeps individual index sizes manageable. Only after exhausting these optimizations should you evaluate purpose-built alternatives. ### 100M+ vectors Benchmark purpose-built against optimized existing. At this scale, Milvus's disaggregated architecture, Qdrant's on-disk quantization, or Weaviate's distributed deployment may deliver meaningful advantages in latency, throughput, or cost efficiency. Verify with your data and your query patterns. Synthetic benchmarks rarely reflect production workloads. Some organizations run pgvectorscale at 100M+ vectors in production. Others genuinely need the purpose-built option. The data will tell you which camp you are in. - **Under 10M**: Use existing DB — pgvector, Atlas Vector Search, or Redis; single-digit ms latency, zero additional infrastructure - **10M-100M**: Optimize first — pgvectorscale, HNSW tuning, quantization, partitioning; extend before replacing - **100M+**: Benchmark both — Purpose-built may win on latency/throughput; verify against your actual workload, not synthetic benchmarks - **Any scale**: Avoid sync tax — Every separate vector DB adds sync pipelines, consistency bugs, and operational overhead; factor this into the decision ## Vectors as a data type Time-series data spawned InfluxDB, TimescaleDB, and the TSDB category. Geospatial data spawned PostGIS and spatial databases. In each case, the specialized data type matured, general-purpose databases absorbed it, and the purpose-built databases consolidated to serve the high end. The data type became a feature, not a product. Purpose-built vector databases will not disappear. Milvus will serve billion-scale workloads. Qdrant will serve latency-critical deployments. Weaviate will serve multi-tenant SaaS platforms. These are real niches with real engineering requirements. For most production workloads (the 90% of RAG pipelines running under 10 million vectors, the semantic search features embedded in SaaS products, the recommendation systems that need vector similarity alongside relational data), vectors belong in the database you already operate. > **Vectors are following the same path as JSON, geospatial, and full-text search. The hype cycle created a database category. Maturity is dissolving it back into a data type.** > **Key Point:** The near-half-billion-dollar bet on purpose-built vector databases was early, not wrong. Purpose-built databases proved the market, validated the use cases, and pushed the technology forward. The technology they pioneered is now a feature of every major database. For most teams, the right vector database is the one they already run. ## Resources & Further Reading - pgvector: https://github.com/pgvector/pgvector - Open-source vector similarity search for PostgreSQL - pgvectorscale: https://github.com/timescale/pgvectorscale - Timescale's PostgreSQL extension with StreamingDiskANN for datasets larger than RAM - Pinecone: https://www.pinecone.io/ - Fully managed vector database, serverless architecture - Weaviate: https://weaviate.io/ - Open-source AI-native vector database with hybrid search - Qdrant: https://qdrant.tech/ - High-performance vector search engine written in Rust - Milvus: https://milvus.io/ - Open-source vector database for billion-scale deployments - Chroma: https://www.trychroma.com/ - Embeddable open-source vector database for AI applications - MongoDB Atlas Vector Search: https://www.mongodb.com/products/platform/atlas-vector-search - Vector search integrated into MongoDB aggregation pipeline - Microsoft GraphRAG: https://github.com/microsoft/graphrag - Graph-enhanced retrieval augmented generation - ANN Benchmarks: https://ann-benchmarks.com/ - Benchmarking approximate nearest neighbor algorithms - VectorDBBench: https://github.com/zilliztech/VectorDBBench - Open-source vector database benchmark tool --- # Your Dependencies Have Dependencies - **URL**: https://www.stxkxs.io/blog/supply-chain-security-ai-era - **Published**: 2026-02-26 - **Author**: Brandon Stokes - **Category**: engineering - **Tags**: security, supply-chain, sbom, slsa, sigstore, npm, open-source, devsecops, platform-engineering, ai-security - **Reading time**: 14 min Open source malware surpassed 1.2 million packages. Vulnerabilities per codebase doubled. A compromised npm token turned an AI coding assistant into a supply chain weapon. The defense stack—SBOM, SLSA, Sigstore—is mature. Adoption is not. Here's what platform engineers need to do now. ## The Cline incident On February 17, 2026, at 3:26 AM Pacific, an attacker used a compromised npm publish token to push version 2.3.0 of the Cline CLI, a popular open-source AI coding assistant. The payload was a single line added to package.json: a postinstall script that silently ran `npm install -g openclaw@latest` on every machine that updated. For eight hours, roughly 4,000 developers installed a globally-scoped package they never asked for. The compromised version was deprecated by 11:30 AM, the token was revoked, and Cline shipped a clean 2.4.0 by evening. No data exfiltration was confirmed. The attack vector (injecting into an AI coding tool's dependency chain to distribute a different AI tool) is a perfect snapshot of where supply chain security stands in 2026: the tools we use to write code are now attack surfaces themselves. Two trends collided to make this inevitable. AI-assisted development has pushed open source consumption to 9.8 trillion downloads per year while generating code with known vulnerabilities 62% of the time. Meanwhile, attackers industrialized: 454,648 new malicious packages last year, a 75% jump. The Cline incident was the median case, not an outlier. ## AI inflates the attack surface The numbers back this up. Black Duck's 2026 OSSRA report analyzed 947 codebases and found mean vulnerabilities per codebase jumped 107%, a doubling in a single year. Component counts up 30%, file counts up 74%. The report correlates this growth with increased AI-assisted development, though correlation during a period of rapid AI adoption is not conclusive causation. Codebases are accumulating 30% more components and 74% more files faster than security review can keep up, and AI-generated code is a significant contributor to that velocity. - **Malicious Packages**: 1.23M+ — Cumulative OSS malware packages detected (Sonatype 2026) - **Vuln Growth**: +107% — Mean vulnerabilities per codebase YoY (Black Duck OSSRA) - **OSS Downloads**: 9.8T — Annual downloads across top 4 registries, up 67% YoY - **AI Code Insecure**: 62% — AI-generated code containing design flaws or known vulns **AI-generated code security by the numbers** - AI code with flaws: 62% - Orgs checking AI code for security: 76% - Orgs doing full evaluation: 24% - Codebases with license conflicts: 68% The math is straightforward. When 41% of new code is AI-generated and 62% of that code contains security flaws, the volume of vulnerabilities entering codebases changes by an order of magnitude. AI models learn from publicly available repositories, many of which contain insecure implementations. The models do not distinguish between secure and insecure patterns; both are statistically valid completions. They do not understand an application's threat model, internal security standards, or compliance requirements. The result is missing controls, logic flaws, and inconsistent security patterns that traditional static analysis struggles to catch. > **WARNING: The governance gap** > Only 24% of organizations perform comprehensive IP, license, security, and quality evaluations for AI-generated code (Black Duck). The remaining 76% either spot-check for security only or perform no evaluation at all. Meanwhile, open source licensing conflicts hit an all-time high at 68% of audited codebases, a 12-point jump in a single year, as AI tools pull in dependencies without license awareness. The dependency problem compounds this. AI coding assistants routinely suggest packages they were trained on, pulling in transitive dependency trees that developers never manually evaluated. A single `npm install` can add hundreds of packages to a project. When AI is suggesting the install, the developer's usual heuristic ("have I heard of this package? does it look maintained?") is bypassed entirely. The AI suggested it, so it must be fine. ## Attacks are now industrialized Supply chain attacks have evolved from opportunistic typosquatting into industrialized, multi-stage operations. The Sonatype 2026 State of the Software Supply Chain report documents this shift in granular detail: threats have moved from "spam and stunts" to "sustained, industrialized campaigns," many state-sponsored. Over 99% of detected open source malware targets npm, making the JavaScript ecosystem the primary battleground. ### Typosquatting and dependency confusion The simplest attacks remain effective. Typosquatting (publishing packages with names similar to popular ones, like `lodahs` instead of `lodash`) accounts for a significant share of malicious package detections. Dependency confusion exploits the gap between public and private registries: if an organization uses an internal package called `@company/auth-utils` and the public registry has no package by that name, an attacker publishes `@company/auth-utils` publicly with a higher version number. Build tools that check public registries first will pull the attacker's version. Both vectors are amplified by AI coding tools that may suggest the typosquatted or confused package name in completions. ### Token compromise and maintainer takeover The Cline attack used a compromised npm publish token, a credential that allows publishing new versions of a package. This is not new; the ua-parser-js incident in 2021 and the event-stream compromise in 2018 used similar vectors. What is new is the scale of exposed credentials. Sonatype reports that exposed development secrets grew 11% across major repositories last year. When an attacker obtains a publish token for a popular package, the blast radius is every downstream consumer that runs `npm install` or `npm update` before the compromise is detected. ### Self-replicating malware: Shai-Hulud The most concerning development of the past year was Shai-Hulud, discovered in September 2025: the first known self-replicating npm malware. The mechanism was novel. Using stolen npm tokens, the worm enumerated the compromised maintainer's existing packages, injected a malicious postinstall (bundle.js) into them, and republished new versions to propagate. ReversingLabs identified patient zero as rxnt-authentication@0.0.3 (published September 14, 2025); the exact initial compromise was never confirmed, but the leading suspected vector was a credential-harvesting phishing campaign spoofing npm MFA-update notices that stole maintainer tokens, not a @testing-library typosquat. Shai-Hulud did not need to compromise a popular package directly. It turned every infected developer into an unwitting distribution node, spreading to their downstream consumers through packages they published normally. ### State-sponsored campaigns The Lazarus Group, linked to North Korea, has evolved from simple droppers and crypto miners to multi-stage campaigns targeting developer infrastructure specifically. Group-IB's 2026 High-Tech Crime Trends report documents the pattern: initial access via a malicious npm package or compromised GitHub Action, followed by credential theft from environment variables and CI secrets, then persistent access via modified build scripts that survive dependency updates. The Contagious Interview campaign (tracked by Palo Alto Unit 42) targeted developers through fake job interviews that installed backdoored packages, a social engineering vector that specifically exploits the npm ecosystem's trust model. These are not smash-and-grab operations. They establish long-term persistence in CI/CD pipelines and use legitimate development tools as cover. Group-IB identifies supply chain attacks as the top global cyber threat, with state-linked actors industrializing the approach. > **Key Point:** Supply chain attacks have evolved from opportunistic typosquatting to autonomous self-replicating malware and state-sponsored multi-stage campaigns. A single upstream compromise can now cascade across entire industries. ## Three frameworks close the gaps Three complementary frameworks have matured simultaneously to address different layers of supply chain security. SBOMs provide transparency into what is inside your software. SLSA provides provenance: proof of how and where your software was built. Sigstore provides verification: cryptographic proof that artifacts have not been tampered with. Each covers a distinct layer (inventory, provenance, verification), and deploying one without the others leaves gaps. ### SBOMs: knowing what you ship A Software Bill of Materials is a machine-readable inventory of every component in a software artifact: direct dependencies, transitive dependencies, versions, licenses, and known vulnerabilities. Two competing formats dominate: SPDX (Linux Foundation) and CycloneDX (OWASP). CISA's updated guidance now requires machine-readable formats, and leading package ecosystems are promoting SBOMs to first-class citizens integrated natively into build tools. The conversation has shifted from "can you produce an SBOM?" to "is your SBOM accurate and actionable?" Generation is largely solved. Tools like Syft, Trivy, and cdxgen can produce SBOMs from container images, file systems, and build manifests. The hard problems are quality: does the SBOM capture the full transitive dependency tree? Are vulnerability mappings current? Does it include build-time dependencies that do not ship in the final artifact but could introduce compromise during the build? For AI-generated code, the question is even harder: which AI-suggested dependencies were evaluated by a human and which were accepted without review? The first time I ran cdxgen on a real production codebase (not a demo project, a real one with three years of accumulated dependencies) the output was sobering. Hundreds of packages deep, maintained by anonymous handles, some with no commits in two years. You go from "we probably have this under control" to "we have no idea what we're running" in about thirty seconds. ```json (sbom-cyclonedx.json) { "bomFormat": "CycloneDX", "specVersion": "1.6", "components": [ { "type": "library", "name": "express", "version": "4.21.2", "purl": "pkg:npm/express@4.21.2", "evidence": { "identity": { "field": "purl", "confidence": 1, "methods": [ { "technique": "manifest-analysis", "value": "package-lock.json" } ] } } } ], "vulnerabilities": [ { "id": "CVE-2024-XXXXX", "ratings": [{ "severity": "high" }], "affects": [{ "ref": "pkg:npm/express@4.21.2" }] } ] } ``` ### SLSA: proving how it was built Supply-chain Levels for Software Artifacts (SLSA, pronounced "salsa") is a framework for ensuring the integrity of software artifacts throughout the supply chain. Released as version 1.2 by the Linux Foundation in late 2025, the SLSA Build Track defines four levels (L0 through L3) of increasing assurance. Level 0 is the baseline, representing the absence of SLSA. Level 1 requires a consistent build process with provenance describing how the artifact was built. Level 2 requires a hosted build platform running on dedicated infrastructure (not an individual workstation) with digitally signed provenance. Level 3 requires a hardened build platform that prevents builds from influencing one another and keeps signing secrets inaccessible to user-defined build steps, producing non-falsifiable provenance. There is no Level 4 in the current spec; the two-person review and hermetic, reproducible build requirements were part of the obsolete SLSA v0.1 model. The practical value of SLSA is provenance attestation: a signed, verifiable record of what source code was used, what build system compiled it, and what inputs went into the build. When the Cline attack happened, the compromised package had no build provenance. It was published directly from a local machine using a stolen token. A SLSA Level 2+ requirement would have caught this: the build would need to originate from a CI system (like GitHub Actions), and the provenance record would show the source commit, build logs, and builder identity. If the provenance does not match the expected CI pipeline, the artifact is rejected. ### Sigstore: verifying the signature Sigstore is the "Let's Encrypt of code signing": a set of free, open-source tools that make cryptographic signing and verification accessible without managing keys. The core components are Cosign (signs and verifies container images and blobs), Fulcio (issues short-lived certificates tied to OIDC identity), and Rekor (an immutable transparency log of all signing events). The breakthrough is keyless signing. Instead of managing long-lived private keys that can be stolen, Sigstore ties signing to identity providers (GitHub, Google, Microsoft) and issues certificates that expire in minutes. The signing event is recorded in Rekor's transparency log, creating an auditable trail. ```bash (sigstore-verify.sh) # Sign a container image (keyless - uses OIDC identity) cosign sign ghcr.io/myorg/myapp:v1.2.3 # Verify the signature matches expected identity cosign verify ghcr.io/myorg/myapp:v1.2.3 \ --certificate-identity="https://github.com/myorg/myapp/.github/workflows/release.yml@refs/tags/v1.2.3" \ --certificate-oidc-issuer="https://token.actions.githubusercontent.com" # Verify SLSA provenance attestation cosign verify-attestation ghcr.io/myorg/myapp:v1.2.3 \ --type slsaprovenance \ --certificate-identity-regexp="^https://github.com/slsa-framework/slsa-github-generator/" \ --certificate-oidc-issuer="https://token.actions.githubusercontent.com" ``` > **TIP: The three together** > SBOM tells you what components are in your software. SLSA tells you the software was built securely from the expected source. Sigstore tells you the artifact has not been tampered with since it was built. Deploying one without the others leaves gaps: an SBOM without provenance can be fabricated. Provenance without signatures can be forged. Signatures without an SBOM tell you nothing about what you signed. ## Lock the pipeline first Frameworks and standards are useful, but platform engineers need concrete actions they can implement in existing CI/CD pipelines. The following is a prioritized checklist, ordered by impact-to-effort ratio, for teams that have not yet invested in supply chain security. ### Lock everything down Commit lockfiles (`package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, `Cargo.lock`, `go.sum`). Use `npm ci` instead of `npm install` in CI. It installs exactly what the lockfile specifies and fails if there is a mismatch. Pin GitHub Actions to full commit SHAs instead of tags (`uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29` instead of `uses: actions/checkout@v4`). Tags are mutable. An attacker who compromises the action can move the tag to a malicious commit. SHA pinning is free and eliminates this vector entirely. One critical step most guides skip: verify the SHA before you pin it. Check that the commit SHA corresponds to the tagged release you expect by comparing against the action repository's release page. Pinning a SHA without verification locks in whatever that commit contains, potentially a compromised version. ```yaml (.github/workflows/ci.yml) name: CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest permissions: contents: read # Principle of least privilege steps: # Pin actions to full SHA, not tags - uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29 # v4.1.6 - uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4.4.0 with: node-version: '22' cache: 'npm' # Use npm ci, not npm install - run: npm ci --ignore-scripts # Skip postinstall scripts - run: npm audit --audit-level=high - run: npm run build - run: npm test ``` ### Audit dependencies continuously Enable GitHub's Dependabot or Renovate for automated dependency updates with security alerts. Run `npm audit` or `yarn audit` in CI and fail builds on high-severity vulnerabilities. Use OpenSSF Scorecard to evaluate the security posture of critical dependencies before adopting them. It scans over 1 million projects weekly and checks for branch protection, code review practices, CI/CD configuration, signed releases, and vulnerability disclosure policies. For high-value projects, add Socket.dev or Snyk to detect malicious packages before they reach production. When you first wire up `npm audit --audit-level=high` as a CI gate on a mature project, expect a brutal first week. You will discover vulnerabilities that have been sitting in your transitive dependency tree for years, flagged as "high" but buried six levels deep in something nobody touches. The temptation is to immediately allowlist everything and move on. Resist that. Triage ruthlessly. Most of the noise is in dev dependencies that never hit production, but actually fix the ones that matter. Once the initial backlog clears, the gate becomes the single best signal you have for catching new risks early. ### Sign and verify artifacts Use Sigstore's Cosign to sign container images in CI. Configure admission controllers (Kyverno or OPA Gatekeeper) to reject unsigned images in Kubernetes clusters. For npm packages, enable npm provenance (`npm publish --provenance`) in GitHub Actions. This generates a SLSA Level 2 provenance attestation and publishes it alongside the package. Consumers can verify provenance on npmjs.com or via the CLI with `npm audit signatures`. This is the single highest-leverage action for package maintainers. It took npm from zero provenance to verifiable build attestation with a single flag. ```yaml (.github/workflows/publish.yml) name: Publish on: release: types: [published] jobs: publish: runs-on: ubuntu-latest permissions: contents: read id-token: write # Required for npm provenance steps: - uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29 - uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 with: node-version: '22' registry-url: 'https://registry.npmjs.org' - run: npm ci - run: npm test # Publish with SLSA provenance attestation - run: npm publish --provenance --access public env: NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }} ``` ### Generate and consume SBOMs Add SBOM generation to your release pipeline using Syft (for container images) or cdxgen (for source projects). Store SBOMs alongside artifacts: attach them to GitHub releases, push them to OCI registries with Cosign, or publish to a dedicated SBOM repository. On the consumption side, use Grype or Trivy to scan SBOMs against vulnerability databases. The goal is not to produce a document for compliance. It is to have a queryable inventory that answers "are we running anything affected by CVE-2026-XXXXX?" within minutes of a disclosure, not days. > **EXAMPLE: Minimum viable supply chain security** > If you do nothing else: (1) commit lockfiles and use `npm ci` in CI, (2) pin GitHub Actions to SHAs, (3) enable Dependabot or Renovate, (4) run `npm audit` in CI. These four actions are free, take less than an hour to implement, and eliminate the most common attack vectors. Everything else (Sigstore, SLSA, SBOMs) builds on this foundation. ## Securing AI-generated code The specific challenge of AI-generated code is that it introduces vulnerabilities at a rate and volume that traditional security review processes were not designed to handle. Manual code review catches security issues at roughly the rate a human can read code. When AI generates 41% of new code, the review bottleneck becomes the binding constraint on security. Three approaches are emerging. First, AI-aware static analysis: tools that understand AI generation patterns and flag common AI-specific mistakes such as hallucinated imports, insecure default configurations, missing input validation, and suggested packages that do not exist or have been deprecated. Second, inline security feedback: IDE integrations that flag security issues as AI generates code, before it is committed, catching insecure patterns at the point of creation rather than in CI. Third, LLM-based code review: using large language models to review code for security issues, catching logic flaws and missing controls that rule-based scanners miss. No single stage catches everything, so platform teams should enforce security checks at all four: generation time (IDE), commit time (pre-commit hooks), CI time (build pipeline), and deploy time (admission control). The earlier the catch, the cheaper the fix, but the IDE check is the easiest to bypass and the admission controller is the hardest, so each stage backstops the ones before it. ## When this is overkill A solo internal tool that builds from a private repo, ships to one cluster, and pulls a dozen well-known dependencies does not need SLSA Level 3, admission control, and a signed SBOM per release. The threat model that justifies this stack is a public package consumed by people you will never meet, or a build pipeline holding credentials worth stealing. For a script three people run, the full apparatus costs more attention than the risk it removes. Match the controls to the blast radius. ### SBOMs decay without ownership Generating an SBOM is a one-flag command. Keeping it accurate is a standing job. Every dependency bump, base-image change, and transitive shift makes yesterday's SBOM wrong, and a wrong inventory is worse than none because it answers the CVE-disclosure question confidently and incorrectly. The value comes from a queryable, current inventory, not a file attached to a release that nobody regenerates. If no one owns regeneration and validation, the SBOM is a compliance artifact that will quietly drift out of sync with what you actually run. ### Admission control adds friction Configuring Kyverno or OPA Gatekeeper to reject unsigned images stops a real attack and also stops a 2 AM hotfix built outside the normal pipeline. Break-glass paths get added, then used routinely, then become the default, and the control is theater. The policy is only as strong as the team's discipline about not bypassing it. A signing gate that everyone has learned to route around provides the audit log of a security control without the security. ### AI scanners cry wolf The AI-aware static analysis covered above is young, and immature scanners flag hallucinated-import patterns and missing-validation heuristics on code that is fine. A scanner that produces a high false-positive rate trains developers to dismiss its output, which is exactly the failure mode that lets a real finding slip through. Until a given tool earns trust on your codebase, treat its alerts as candidates for review, not gates that block merges, and measure its precision before you wire it into CI. > **WARNING: The compliance-theater objection** > The strongest argument against this entire stack: most of it becomes paperwork without disciplined triage behind it. An SBOM nobody queries, a signature nobody verifies on consumption, and an audit gate that allowlists every finding to stay green all produce the documentation of security with none of the substance. Black Duck found only 24% of organizations do comprehensive evaluation of AI-generated code; the other 76% are not all skipping the work because they lack tools. Some bought the tools and never built the triage muscle. Controls without follow-through are worse than honest gaps, because they manufacture the false confidence that the work is handled. The line between security and theater is whether a human acts on the output. An SBOM matters when a disclosure triggers a query and a patch. A signature matters when verification actually runs at deploy and fails closed. An audit gate matters when high-severity findings get fixed instead of allowlisted. The defense is the triage, and the tooling is only the mechanism that surfaces what to triage. Adopt a control only when you have the discipline to act on what it tells you. ## Where it's headed The supply chain security landscape is converging around a simple thesis: trust must be verified, not assumed. The tools exist. SLSA 1.2 provides the framework for build provenance. Sigstore provides the signing infrastructure. SBOMs provide the inventory. OpenSSF Scorecard provides the dependency evaluation. npm provenance provides package-level attestation. What is missing is adoption. Most organizations are still operating at the equivalent of SLSA Level 0: no provenance, no signing, no systematic dependency evaluation. The regulatory environment is catching up. CISA's updated SBOM guidance and the EU Cyber Resilience Act's software transparency requirements are creating compliance pressure that will force adoption independent of engineering conviction. US federal policy has moved in the opposite direction: Executive Order 14306 (June 2025) scaled back the attestation mandates of EO 14144, and OMB memorandum M-26-05 (January 2026) rescinded the government-wide software attestation requirement in favor of an optional, agency-led risk-based approach. Organizations that implement SBOM generation, provenance attestation, and artifact signing now will be ahead of requirements that are coming regardless. The AI dimension makes this urgent. When code generation is cheap, code volume explodes, and every additional dependency is a potential entry point. The Cline attack demonstrated that AI developer tools are themselves supply chain targets, and the developers who use them are high-value victims because they typically have broad repository access, CI/CD credentials, and deployment permissions. Securing the supply chain is no longer a security team problem. It is a platform engineering problem, and the window to get ahead of it is closing. > **The tools we use to write code are now attack surfaces themselves. Supply chain security is no longer optional. It is the cost of shipping software in 2026.** > **Key Point:** Start today: commit lockfiles, pin CI actions to SHAs, enable dependency scanning, and add `--provenance` to npm publish. These are free, take under an hour, and address the most common attack vectors. Then build toward SBOM generation, Sigstore signing, and SLSA provenance as your supply chain security matures. ## Resources & Further Reading - Sonatype 2026 State of the Software Supply Chain: https://www.sonatype.com/state-of-the-software-supply-chain/introduction - Comprehensive data on OSS malware, consumption trends, and supply chain risks - Black Duck 2026 OSSRA Report: https://www.blackduck.com/resources/analyst-reports/open-source-security-risk-analysis.html - Open source risk analysis across 947 codebases and 17 industries - SLSA Framework: https://slsa.dev - Supply-chain Levels for Software Artifacts specification and implementation guides - Sigstore Documentation: https://docs.sigstore.dev - Keyless signing, Cosign, Fulcio, and Rekor documentation - OpenSSF Scorecard: https://scorecard.dev - Automated security health metrics for open source projects - CISA SBOM Resources: https://www.cisa.gov/sbom - Federal guidance on SBOM generation, formats, and consumption - CycloneDX Specification: https://cyclonedx.org - OWASP SBOM format specification and tooling - npm Provenance: https://docs.npmjs.com/generating-provenance-statements - How to publish packages with SLSA provenance attestation - StepSecurity Harden-Runner: https://github.com/step-security/harden-runner - GitHub Actions security agent for detecting compromised dependencies - ReversingLabs 2026 SSCS Report: https://www.reversinglabs.com/sscs-report - Supply chain security guidance timeline and threat analysis --- # The IaC Landscape in 2026 - **URL**: https://www.stxkxs.io/blog/iac-landscape-2026 - **Published**: 2026-02-24 - **Author**: Brandon Stokes - **Category**: infrastructure - **Tags**: infrastructure, iac, terraform, opentofu, pulumi, crossplane, aws-cdk, platform-engineering, open-source, sst, nitric, encore, infrastructure-from-code - **Reading time**: 14 min The IaC market is fragmenting by use case — HCL declarative, general-purpose languages, Kubernetes-native, and infrastructure-from-code. Real survey data from Firefly, Stack Overflow, and CNCF shows four paradigms growing simultaneously with no single winner. Here's where Terraform, OpenTofu, Pulumi, Crossplane, SST, Nitric, Encore, and more actually stand. ## Nobody agrees anymore If you started a new infrastructure project today, what would you reach for? Five years ago the answer was obviously Terraform. Two years ago it was probably still Terraform, with a footnote about the license change. Today? I have watched three different platform teams at three different companies make three different choices in the last six months, and every one of them had defensible reasons. The IaC landscape has genuinely fragmented, and the old default answer is gone. Pulumi is the better tool for most teams writing infrastructure today. I also understand exactly why the 62% already on Terraform stay there, and why I still use CDK for my own projects. Those are not contradictory positions. They reflect the reality that "which IaC tool" is now four separate questions masquerading as one, and the answer depends on your team's language preferences, cloud strategy, and honestly, how much organizational inertia you are willing to fight. What follows is the read on where each tool actually sits as of early 2026: not a neutral survey, but the view of someone who has shipped infrastructure with most of these tools and has opinions about all of them. ## Four paradigms grow at once The market is real and growing: roughly $1.3 billion in 2025, projected to reach about $9.4 billion by 2034 at a 24% CAGR (Precedence Research). The interesting thing is not the size. It is that four distinct paradigms are growing simultaneously without cannibalizing each other. HCL declarative (Terraform, OpenTofu), general-purpose languages (Pulumi, CDK), Kubernetes-native (Crossplane), and infrastructure-from-code (Nitric, Encore). Each serves a different engineering culture. Each has legitimate strengths. The question is which culture is yours. **GitHub stars by IaC tool (Feb 2026)** - Ansible: 66000 - Terraform: 47600 - OpenTofu: 28800 - SST: 23000 - Pulumi: 22000 - Crossplane: 11726 - AWS CDK: 12700 - Encore: 8000 - Bicep: 3500 Stars are a vanity metric, but the shape of this chart tells you something. OpenTofu hitting 27.9K in about two and a half years is remarkable velocity for a fork. SST at 23K is notable, and misleading, because the project is in maintenance mode. Pulumi at 22K understates its real footprint given 100M+ SDK downloads. CDK at 12.7K understates its usage even more: 3.5 million weekly npm downloads is quietly enormous. The point is that you have to look past any single metric to see what is actually happening. - **Market Size**: $1.3B — Global IaC market size in 2025 (Precedence Research) - **Terraform Adoption**: 62% — Current organizational adoption rate (Firefly 2025) - **OpenTofu Downloads**: 10M+ — Total downloads across all releases - **Pulumi Downloads**: 100M+ — Total SDK downloads across package managers ## The number that matters Forget market size. The most important number in the IaC landscape is 15. That is the gap between Terraform's current organizational adoption (62%, per Firefly's State of IaC 2025) and the percentage of respondents who commit to it as their future primary tool (47%). Nearly one in four current Terraform users is actively looking at alternatives. 62% is dominant by any measure, but the grip is loosening. **Current adoption vs future commitment** - Terraform: 62% - OpenTofu: 12% - CDK: 10% - Pulumi: 8% - Crossplane: 6% OpenTofu: future commitment (27%) is more than double current usage (12%). That is the shape of a tool in its adoption upswing. Pulumi and Crossplane both show positive gaps too, future intent outpacing current use. CDK is roughly flat, which makes sense. If you are already all-in on AWS and using CDK, you are not shopping around. You have made your choice. I am one of those users. > **Key Point:** The IaC market is fragmenting by use case: HCL declarative, general-purpose languages, Kubernetes-native, and infrastructure-from-code. All four segments are growing simultaneously. There is no single right answer anymore. For teams already invested in Terraform, the switching cost is often the deciding factor. Migrating 50,000 lines of HCL to Pulumi is a multi-quarter project, regardless of which tool is "better." Factor migration cost into every comparison. ## The fork that stuck Most open-source forks die quietly. OpenTofu did not. When HashiCorp switched Terraform to BSL 1.1 in August 2023, the fork landed at the Linux Foundation within weeks. IBM acquired HashiCorp for $6.4 billion, closing in early 2025. The fork now has genuine feature divergence, which is the only thing that makes a fork matter long-term. ### Terraform: the incumbent Terraform is still the tool most teams know. The Terraform Registry now lists ~6,600 total providers (34 official, 391 partner, ~6,175 community); HashiCorp's last official integration milestone was "3,000+" in March 2023. It has the deepest CI/CD integration story, a decade of accumulated blog posts and Stack Overflow answers and consulting expertise. When you Google an infrastructure problem, you get a Terraform answer. That ecosystem gravity is real, and it is the actual reason most teams stick with Terraform, not because they evaluated alternatives and chose it, but because the switching cost is not worth the fight. The risks are equally straightforward. IBM ownership means strategic uncertainty. The BSL license makes procurement teams at open-source-first shops nervous. Feature velocity has genuinely slowed relative to OpenTofu. State encryption, early variable evaluation, provider for_each: OpenTofu shipped all of these first. For a project with Terraform's market position, being outpaced on features by its own fork is not a great look. ### OpenTofu: earning it OpenTofu entered the CNCF Sandbox in April 2025. It has 160+ contributors, 10 million downloads, and its registry handles over 6 million requests per day. More importantly, it has shipped features Terraform does not have: native state encryption with AES-GCM and KMS provider support, early variable evaluation in backend and module sources, provider for_each, OCI-compliant module sources, and ephemeral resources with write-only attributes that never touch state files. ```hcl (backend.tofu) terraform { encryption { method "aes_gcm" "default" { keys = key_provider.aws_kms.main } key_provider "aws_kms" "main" { kms_key_id = "arn:aws:kms:us-west-2:123456789:key/my-key-id" key_spec = "AES_256" region = "us-west-2" } state { method = method.aes_gcm.default enforced = true } plan { method = method.aes_gcm.default enforced = true } } } ``` Native state encryption is the feature to point to if someone asks "why does OpenTofu matter?" Terraform state files contain secrets in plaintext. Everyone knows this. Everyone has worked around it with backend encryption, remote state, access controls. OpenTofu just encrypts the state. The fact that this took a fork to happen tells you something about where Terraform's priorities were. > **TIP: Migration is boring (in a good way)** > State files are compatible. Provider binaries are compatible. The migration is mechanical: swap `terraform` for `tofu` in your CI, optionally rename blocks, point registry references to `registry.opentofu.org`. The hard part is organizational: updating runbooks, retraining muscle memory, and getting buy-in from teams that have "Terraform" in their job titles. **OpenTofu migration status (2025)** - No plans: 65% - Evaluating: 24% - Planning: 6% - Completed: 5% ## Why Pulumi is better For most teams starting new infrastructure projects, Pulumi is the better choice over Terraform. Writing infrastructure in the same language you write your application in eliminates an entire class of problems. HCL served its purpose, but the gap is real. You get real conditionals, not HCL's count-based hacks. You get actual loops, not for_each with maps. You get the same IDE, the same type system, the same test framework, the same package manager. 100 million SDK downloads and $99 million in funding say this is not a niche opinion. When you wire this up for the first time (defining an S3 bucket as a TypeScript class with full autocompletion, then writing a unit test for your infrastructure the same way you would test application logic), the HCL workaround era feels immediately anachronistic. Pulumi is also adding HCL support (expected Q1 2026), letting teams import existing Terraform modules without a rewrite. That is a smart bridge strategy: meet people where they are, then show them why the other side is better. I understand exactly why people stick with Terraform, and it is not irrational. HCL's constraints are a feature for teams that need them. When every infrastructure file looks the same (same syntax, same structure, same limited set of operations), cross-team readability and auditing get dramatically easier. Pulumi's flexibility means two teams might write completely different patterns for the same infrastructure. If your organization does not already have strong code review culture and shared libraries for application code, Pulumi's flexibility will amplify that problem, not solve it. > **WARNING: The flexibility tax** > The provider ecosystem is still smaller than Terraform's 6,594+ providers, and community knowledge (blog posts, tutorials, Stack Overflow answers) is an order of magnitude thinner. You will hit more "I am the first person to try this" moments. Budget for the linting, code review, and shared libraries that keep that flexibility from turning into drift. The debugging story is also worth being honest about. When an infrastructure deployment fails in Terraform, you are reading HCL and provider logs. In Pulumi, you are debugging through layers of language runtime, Pulumi engine, and cloud API. The failure modes are more complex. For experienced engineers, that trade-off is worth it. For teams where infrastructure is a side responsibility, it might not be. ## CDK and the single-cloud bet I use CDK for my own infrastructure and I would choose it again tomorrow. This site, the APIs behind it, the Lambda functions, the CDK stacks that deploy everything: all TypeScript CDK. I am all-in on AWS, I know I am all-in on AWS, and CDK gives me the highest-fidelity access to AWS services of any IaC tool. That is a deliberate lock-in choice. I am trading portability for productivity, and at my scale the trade is clearly worth it. A single ApplicationLoadBalancedFargateService construct replaces dozens of raw CloudFormation resources. The type system catches misconfigurations at compile time. When AWS releases a new service, CDK support typically lands faster than any third-party provider. CDK has 12.7K stars but 3.5 million weekly npm downloads, which is quietly one of the largest footprints in the IaC space. CDK does not have the community buzz of tools that are fighting for mindshare. AWS shops just use it. It is the path of least resistance when you have already made the cloud commitment. Azure Bicep occupies the same niche for Microsoft's cloud. 3.5K stars, 708+ organizations, and a clever design choice: Bicep has no state files because ARM itself is the state store, eliminating an entire category of state management problems. For Azure-committed shops, it is the right answer for the same reasons CDK is the right answer for AWS shops. > **INFO: The lock-in calculus** > The standard objection is lock-in, and it deserves a straight answer. 89% of enterprises report multi-cloud strategies (Flexera 2024; up from 87% the prior year), and the average organization uses 2.4 cloud providers (Flexera 2025). "Multi-cloud strategy" often means "we have some workloads on AWS and some on Azure," not "we need to move workloads between clouds." If you are genuinely single-cloud (and many organizations are, despite what their strategy decks say), CDK or Bicep is a rational, even optimal choice. If you might need to move clouds, you need Terraform, OpenTofu, or Pulumi. Be honest about which camp you are in. CloudFormation itself is worth a brief note. AWS is investing in CDK as the primary developer-facing layer, with CloudFormation becoming the compilation target rather than the authoring surface. Teams still writing raw CloudFormation YAML are maintaining legacy stacks, not starting new projects. For those teams, CDK is the obvious migration path: same deployment model, same underlying engine, dramatically better developer experience. ## Crossplane: right tool, narrow lane Crossplane graduated from the CNCF in October 2025, which is meaningful validation. Only a handful of projects reach that status. With 11.7K stars and 3,000+ contributors, it represents a fundamentally different model: infrastructure as Kubernetes Custom Resource Definitions, managed by controllers that continuously reconcile desired state with actual state. You do not run apply. The cluster runs it for you, constantly. If you are building an Internal Developer Platform on Kubernetes, Crossplane is the provisioning engine you want. Cloud resources get managed with kubectl apply, the same workflow as pods and deployments. GitOps tools like Argo CD and Flux work natively. Drift detection is automatic because the reconciliation loop never stops. Platform teams can build self-service abstractions that hide cloud complexity from application developers. It is elegant for this specific use case. If you are not already running Kubernetes as your platform substrate, Crossplane is the wrong answer. It requires a running cluster, which is meaningful overhead. Provider maturity is uneven: AWS and GCP are strong, niche providers lag far behind Terraform's ecosystem. Debugging requires understanding Kubernetes controller semantics, CRD schemas, and provider-specific resource models. For teams that are not deeply invested in the Kubernetes ecosystem, Crossplane adds complexity without proportional benefit. It is a great platform engineering tool. It is not a general-purpose IaC tool, and it should not be evaluated as one. > **EXAMPLE: Who should look at Crossplane** > Platform teams already running Kubernetes as an Internal Developer Platform. If you are building a self-service infrastructure layer on top of K8s (where application teams request resources through a portal or GitOps workflow), Crossplane fits naturally as the provisioning engine. If you are not running K8s, do not adopt K8s in order to use Crossplane. That is the tail wagging the dog. ## Infrastructure-from-code: honest assessment The fourth paradigm inverts the model entirely. Instead of declaring infrastructure and referencing it from application code, you write application code and the framework infers what infrastructure you need. Import a storage SDK, use it in your handler, and the framework provisions the bucket at deploy time. No separate IaC files, no state management, no resource graph. The developer experience is genuinely compelling. The track record is genuinely concerning. What happened here: SST (26K stars, the highest-profile infrastructure-from-code tool) entered maintenance mode in mid-2025 when the team pivoted to building OpenCode, an AI coding agent. Winglang, created by Elad Ben-Israel (the original creator of AWS CDK), raised $20 million in seed funding to build a purpose-built cloud programming language. Wing Cloud shut down in 2025. The project continues as community-maintained open source without corporate backing. The signals are different (SST's team chose to leave for a bigger opportunity, Winglang could not sustain the business), but the conclusion is the same: standalone IfC frameworks have not found a sustainable business model. Infrastructure-from-code is not a bad idea. The developer experience SST offered was real. I know people who shipped faster with it than with any other tool. What failed is the standalone business model: IfC frameworks have not found one that sustains. The teams behind SST and Winglang were talented and well-funded. They still could not make it work as standalone products. ### What survives: IfC as a layer Nitric and Encore represent the more durable model: IfC as a layer on top of existing IaC engines, not a replacement for them. Nitric is an open-source, multi-cloud SDK. Write an API handler that reads from a bucket, and it generates Pulumi or Terraform to provision the right resource on AWS, GCP, or Azure. The multi-cloud capability is genuine, not theoretical. Encore takes a type-safe approach for TypeScript and Go, automatically provisioning cloud resources from typed infrastructure primitives. ```typescript (services/api.ts (Nitric)) import { api, bucket } from "@nitric/sdk"; const uploads = bucket("uploads").allow("read", "write"); const mainApi = api("main"); mainApi.get("/files/:name", async (ctx) => { const file = uploads.file(ctx.req.params.name); const url = await file.getDownloadUrl(); ctx.res.json({ url }); }); ``` The key difference from SST and Winglang is the fallback story. When you use Nitric or Encore, the generated infrastructure is standard Pulumi or Terraform underneath. If the framework goes away (and you should plan for that possibility with any IfC tool), you can eject to the underlying IaC and keep going. That ejection path is what makes the abstraction acceptable. Without it, you are building on a foundation you cannot maintain. > **WARNING: The abstraction risk** > Infrastructure-from-code trades control for velocity. When the abstraction works, you ship faster. When it breaks (and at scale, it will), you need to understand the generated IaC underneath. Before adopting any IfC tool, answer two questions: Can I eject to raw IaC if this framework is abandoned? Can I inspect and modify the generated infrastructure? SST's maintenance mode and Winglang's shutdown are not abstract cautionary tales. They happened in 2025. Plan accordingly. ## Where this goes Terraform will remain the most-used IaC tool for years. 62% adoption does not evaporate, and the switching costs are real. That 15-point commitment gap tells you the direction. IBM's stewardship is the deciding factor: invest in the community and the position is defensible; prioritize commercial extraction and the erosion accelerates. Gradual erosion is the likely bet either way. OpenTofu is the fastest-growing tool in the HCL lane, and the CNCF trajectory matters. If graduation follows (as it did for Crossplane), OpenTofu could become the default recommendation for new HCL projects within two years. The migration data supports this: 5% completed, 24% evaluating. The adoption wave is still ahead, not behind. **IaC market projected growth ($B)** - 2023: 0.7$B - 2025: 0.9$B - 2028: 1.8$B - 2030: 2.8$B - 2034: 4.2$B Pulumi is the tool to recommend most often for new projects, with a caveat: you need the engineering maturity to use it well. Real language flexibility without real conventions produces real chaos. The upcoming HCL import capability is smart. It lets teams migrate incrementally instead of rewriting everything, which is how most migrations actually succeed in practice. Crossplane's growth is tied directly to Kubernetes adoption. As more organizations build internal platforms on K8s, Crossplane becomes the natural provisioning layer. It will never be a general-purpose IaC tool and it should not try to be. Its growth will track the platform engineering segment specifically. Infrastructure-from-code is the youngest and most uncertain lane. The developer experience is real. The business model is unproven. If the paradigm survives (and it likely will, in some form), it will be as a layer that generates IaC rather than a framework that replaces it. Nitric and Encore are the bets worth watching. Do not build critical production infrastructure on any IfC tool without a clear ejection plan. > **The question to ask is which paradigm matches how your team actually builds and operates, not which IaC tool is best. Pick based on your team, not the market.** > **Key Point:** The decision framework: OpenTofu for multi-cloud teams with open-source governance requirements. Terraform for enterprises needing vendor support and the largest ecosystem. Pulumi for developer-first teams with the engineering maturity to enforce conventions. Crossplane for platform teams building Kubernetes-native internal developer platforms. Nitric or Encore for teams that want IfC with a credible ejection story. CDK for AWS-only organizations (this is what I use). Bicep for Azure-only organizations. ## Resources & Further Reading - Firefly State of IaC 2025: https://www.firefly.ai/state-of-iac - Adoption data covering Terraform, OpenTofu, and broader IaC trends - Stack Overflow Developer Survey 2025: https://survey.stackoverflow.co/2025/ - Developer usage data across IaC and infrastructure tools - OpenTofu Documentation: https://opentofu.org/docs/ - Official docs covering all OpenTofu-specific features - OpenTofu GitHub: https://github.com/opentofu/opentofu - Source code, releases, and community discussions - Pulumi Documentation: https://www.pulumi.com/docs/ - Getting started guides and provider references - Crossplane Documentation: https://docs.crossplane.io/ - Architecture, providers, and composition guides - AWS CDK Documentation: https://docs.aws.amazon.com/cdk/ - Construct library, API reference, and examples - Azure Bicep Documentation: https://learn.microsoft.com/en-us/azure/azure-resource-manager/bicep/ - Language reference and module registry - Nitric Documentation: https://nitric.io/docs - Multi-cloud infrastructure-from-code SDK for TypeScript, Python, Go, and Dart - Encore Documentation: https://encore.dev/docs - Type-safe backend framework with automatic infrastructure provisioning - CNCF Landscape — Provisioning: https://landscape.cncf.io/guide#provisioning - Where IaC tools fit in the cloud-native ecosystem - MarketsandMarkets IaC Report: https://www.marketsandmarkets.com/Market-Reports/infrastructure-as-code-market-180538080.html - Market sizing and growth projections through 2030 --- # Platform Engineering the AI Era - **URL**: https://www.stxkxs.io/blog/platform-engineering-ai-era - **Published**: 2026-02-16 - **Author**: Brandon Stokes - **Category**: kubernetes - **Tags**: ai-infrastructure, kubernetes, ml-inference, vllm, gpu, vector-databases, platform-engineering, kserve, mlops, llm-serving - **Reading time**: 14 min Every enterprise is deploying AI to production, but the conversation focuses on models and prompts—not the infrastructure underneath. Platform engineers are now responsible for GPU orchestration, model serving, vector storage, LLM routing, and inference autoscaling. This post maps the actual infrastructure stack with real adoption data. ## Inference is the bottleneck Inference now accounts for roughly two-thirds of all AI compute, up from a third in 2023 (Deloitte 2026 TMT Predictions). Training gets the headlines: massive GPU clusters, billion-dollar training runs, frontier model announcements. Once a model ships, inference runs forever. Every API call, every chatbot response, every RAG pipeline query, every agent action is an inference workload. Training is a capital expense. Inference is an operating expense that scales with usage and never stops. The discourse fixates on which foundation model is best, how to engineer prompts, and which AI coding tool will replace developers. Almost nobody talks about who runs the GPUs, manages the serving layer, orchestrates the vector storage, or pays the inference bill. That infrastructure layer is where the complexity lives, and it is growing faster than any other part of the stack. This is an infrastructure problem, not an AI problem. The platform engineers who built container orchestration, service meshes, and CI/CD pipelines are now responsible for GPU scheduling, model serving, vector storage, LLM routing, and inference autoscaling. The job title did not change. The workloads did. If you have spent the last few years building internal developer platforms, you already have the instincts for this. The patterns are the same: resource scheduling, autoscaling, multi-tenancy, cost attribution. The resources just cost 100x more per unit and the failure modes are less forgiving. - **Inference Market**: $106B — AI inference market size, projected to reach $255B by 2030 at 19.2% CAGR (MarketsandMarkets) - **AI CapEx 2026**: $660-700B — Projected capital expenditure from top 5 US cloud providers, ~$450B directly tied to AI infrastructure (Goldman Sachs/CNBC). Whether this spend generates proportional ROI remains the open question of the decade - **Enterprise GenAI**: 80% — Enterprises deploying GenAI to production by 2026, up from <5% in 2023 (Gartner) - **K8s AI Workloads**: 90%+ — Enterprises increasing AI workloads on Kubernetes (Spectro Cloud) > **INFO: Scope** > This post covers the infrastructure stack that platform engineers build and manage to run AI workloads in production: model serving, GPU orchestration, vector storage, LLM routing, and MLOps. Not AI tools for developers. Not prompt engineering. The plumbing underneath. A note on scope: this is primarily for teams running self-hosted or hybrid inference. If your AI strategy is "call the OpenAI API," your platform engineering mandate is different (cost allocation, rate limiting, API key management, and vendor negotiation). That is a valid approach for most teams and a different post. **Inference vs training compute share** - 2023: 33% - 2025: 50% - 2026: 67% ## Serving is the hottest layer Model serving is the hottest layer in the AI infrastructure stack. It is the machinery that takes a trained model artifact and turns it into a production endpoint that can handle real traffic, with latency requirements, throughput targets, and cost constraints. Three projects dominate this space, each solving a different part of the problem. ### vLLM vLLM has become the de facto standard for LLM serving. At 81.3K GitHub stars, it is the most widely adopted open-source inference engine for large language models. The project is now under the PyTorch Foundation, giving it institutional stability beyond any single company. Its open governance model ensures long-term stability and vendor neutrality, critical for organizations that need confidence their inference stack will not be abandoned or locked down. The technical innovations that drove adoption are PagedAttention, a memory management technique inspired by virtual memory in operating systems. Traditional LLM serving allocates contiguous memory for the KV cache of each request, wasting 60-80% of GPU memory through internal fragmentation. PagedAttention partitions the KV cache into fixed-size blocks that can be stored non-contiguously, the same way an OS manages virtual memory pages. This alone increases serving throughput by 2-4x. Continuous batching is the second critical optimization. Static batching waits for a full batch before processing, adding latency. Continuous batching dynamically inserts new requests as tokens are generated, keeping the GPU saturated at all times. Combined with speculative decoding (where a smaller draft model proposes tokens that the larger model verifies in parallel), vLLM achieves throughput numbers that were impossible two years ago. Production deployments report 24x throughput improvement over naive Hugging Face Transformers serving. - [vLLM](https://github.com/vllm-project/vllm) - [PagedAttention Paper](https://arxiv.org/abs/2309.06180) - **GitHub Stars**: 81.3K — Most-starred open-source LLM serving project on GitHub - **Hardware Support**: Multi-vendor — Supports NVIDIA, AMD, and AWS Inferentia accelerators - **Governance**: PyTorch Foundation — Open governance under the PyTorch Foundation ensuring long-term vendor neutrality ### KServe KServe reached CNCF incubating status in September 2025, the first ML serving platform to achieve this level of cloud-native governance. CNCF incubation signals production readiness, community sustainability, and vendor neutrality. KServe is not just another ML serving tool. It is becoming the Kubernetes-native standard for model deployment, the same way Prometheus became the standard for monitoring. KServe's core abstraction is the InferenceService CRD, which defines a predictor/transformer/explainer pattern for model deployments. The predictor handles inference. The transformer handles pre/post-processing: tokenization, feature engineering, output formatting. The explainer provides model interpretability. This separation means you can update preprocessing logic without redeploying the model, or swap the underlying serving runtime from TorchServe to vLLM without changing the transformer pipeline. Version 0.15 added first-class LLM support with KEDA-based autoscaling that scales on token throughput rather than just HTTP request count. This matters because a single LLM request can consume 100x more GPU compute than a simple classification request. The integration with Envoy AI Gateway adds LLM-aware traffic management: routing by model, token-based rate limiting, and cost attribution per team. Bloomberg's production deployment runs hundreds to thousands of small models in a multi-tenant configuration, demonstrating that KServe scales beyond single-model use cases. - [KServe](https://kserve.github.io/website/) - [Envoy AI Gateway](https://gateway.envoyproxy.io/docs/tasks/ai-gateway/) - [KEDA](https://keda.sh/) - **GitHub Stars**: 5,500+ — CNCF incubating project (September 2025) - **Production Adopters**: Bloomberg, Red Hat, SAP — Used across enterprise software, cloud infrastructure, and financial services - **CNCF Status**: Incubating — First ML serving platform at CNCF incubating level > **TIP: vLLM + KServe** > vLLM is the engine. KServe is the orchestrator. They compose. Most enterprise deployments use vLLM as the serving runtime inside KServe's InferenceService CRD, getting vLLM's inference performance with KServe's Kubernetes-native lifecycle management, autoscaling, and traffic routing. ### NVIDIA Triton + TensorRT-LLM NVIDIA Triton Inference Server is the enterprise-grade option for mixed model workloads. It leads MLPerf Inference 4.1 benchmarks and supports concurrent serving of LLMs, embedding models, vision models, and speech models on the same infrastructure. TensorRT-LLM, NVIDIA's LLM-specific optimization layer, delivers 4x throughput compared to vanilla PyTorch and sub-10ms per-token latency through aggressive kernel fusion, quantization, and hardware-specific optimization. Triton excels at model ensembles, pipelines where multiple models execute in sequence. A typical RAG pipeline might chain an embedding model, a reranker, and an LLM. Triton handles inter-model communication in GPU memory, avoiding the CPU round-trips that kill latency when you orchestrate the same pipeline in application code. For organizations running diverse model types beyond just LLMs (vision models for document processing, speech models for transcription, embedding models for search), Triton's multi-framework support (PyTorch, TensorFlow, ONNX, TensorRT) eliminates the need to run separate serving infrastructure per framework. A hybrid pattern is emerging in production: vLLM handles LLM text generation where its PagedAttention and continuous batching shine, while Triton serves embedding models, vision models, and speech models where its multi-framework support and MLPerf-leading performance matter most. This avoids forcing one tool to do everything and plays to each project's strengths. - [Triton Inference Server](https://developer.nvidia.com/triton-inference-server) - [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) > **WARNING: NVIDIA lock-in is real** > TensorRT-LLM only runs on NVIDIA GPUs. The performance gains are substantial (4x throughput is not a rounding error), but adopting TensorRT-LLM means committing to NVIDIA hardware for the workloads that depend on it. If your organization is evaluating AMD MI300X or AWS Inferentia as cost alternatives, TensorRT-LLM locks you out of that optionality. vLLM's multi-vendor support is a meaningful hedge. **Model serving throughput (Llama 2 70B)** - HF Transformers: 12req/s - Triton + TensorRT: 185req/s - vLLM: 156req/s - TGI: 98req/s ## GPU orchestration on Kubernetes GPUs are the scarcest and most expensive resource in enterprise AI infrastructure. An NVIDIA H100 costs $25,000-$40,000 per unit, and most enterprises cannot buy enough of them. The orchestration challenge is not just scheduling workloads onto GPUs. It is maximizing utilization of hardware that costs more per hour than most engineers earn in a day. I have watched teams try to bolt AI serving onto existing Kubernetes clusters without GPU-aware scheduling, and the result is always the same: GPUs sitting idle while training jobs queue for hours because nobody set up proper resource quotas or preemption policies. The GPU scheduling problem is worse than it looks because Kubernetes was not designed for resources this expensive and this indivisible. The NVIDIA GPU Operator provides Kubernetes-native GPU lifecycle management: driver installation, device plugin configuration, monitoring, and health checks as a single operator deployment. It eliminates the manual GPU setup that historically made Kubernetes GPU clusters fragile and hard to maintain. The operator manages the full driver stack, from kernel modules through the CUDA runtime to the device plugin that exposes GPUs to Kubernetes scheduling. When a node reboots or a driver update ships, the operator handles reconciliation automatically. For organizations running GPU nodes across multiple clusters or hybrid environments, it provides a consistent management layer. - [GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/) Run:ai, acquired by NVIDIA in 2024, adds dynamic GPU pooling across hybrid environments. Instead of statically assigning GPUs to teams or workloads, Run:ai creates a shared pool where GPUs are allocated dynamically based on demand, priority, and fairness policies. This is the equivalent of what Kubernetes did for CPU and memory: turning dedicated allocations into a shared, scheduled resource. Run:ai reports that dynamic pooling increases average GPU utilization from 25-30% to 70-80%, effectively tripling the usable capacity of existing hardware without buying more GPUs. - [Run:ai](https://www.run.ai/) The GPU sharing problem remains the hardest unsolved challenge. NVIDIA Multi-Instance GPU (MIG) partitions a single A100 or H100 into up to seven isolated instances, each with dedicated compute, memory, and bandwidth: true hardware isolation, not time-sharing. Time-slicing allows multiple workloads to share a GPU by interleaving execution, trading isolation for flexibility. Fractional GPU solutions from projects like HAMi attempt finer-grained sharing through memory and compute limiting at the container level. None of these are as mature or as low-friction as CPU sharing in Kubernetes. GPU orchestration in 2026 feels like container orchestration felt in 2016. - [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) - [HAMi](https://github.com/Project-HAMi/HAMi) > **INFO: The utilization gap** > GPU utilization in most enterprises is 30-50%. That means half or more of the most expensive hardware in your data center is idle at any given time. The primary cause is static allocation: teams reserve GPUs for peak demand and leave them idle during off-peak. Dynamic pooling and fractional GPU sharing are the infrastructure answers, but they require organizational change too: teams must give up dedicated hardware in exchange for guaranteed SLOs from a shared pool. > **Key Point:** GPU management is becoming as critical as container orchestration was in 2018. The organizations that figure out GPU scheduling, sharing, and utilization optimization will have a structural cost advantage over those that throw hardware at the problem. ## Vector storage is consolidating Every enterprise RAG pipeline needs vector storage: a database optimized for storing and querying high-dimensional embeddings. The market exploded in 2023-2025 as RAG became the dominant pattern for grounding LLM responses in organizational data. Four projects have emerged as the primary options, each with a distinct positioning. ### Pinecone Pinecone was the first purpose-built vector database to reach broad commercial adoption, with thousands of paying customers. Its fully managed model means zero operational overhead: no clusters to manage, no scaling to configure, no indexes to rebuild. Pinecone's serverless architecture launched in January 2024 and reduced costs by up to 50x for sporadic workloads by decoupling storage from compute. For organizations that want vector search without building vector infrastructure, Pinecone is the default choice. The trade-off is cost at scale and the inability to self-host for data sovereignty requirements. - [Pinecone](https://www.pinecone.io/) ### Weaviate Weaviate raised at a $200M valuation in its Series C (October 2025) and positions itself as an AI-native database: not just vector storage but a full data layer for AI applications. Its built-in vectorization modules connect directly to embedding APIs (OpenAI, Cohere, Hugging Face), so you store objects and Weaviate handles embedding generation and indexing automatically. Hybrid search combining dense vectors with BM25 sparse retrieval outperforms pure vector search by 5-15% on retrieval benchmarks. Native multi-tenancy isolates data at the tenant level, critical for SaaS platforms where each customer's RAG data must be logically separated. It is open-source with a commercial managed offering. - [Weaviate](https://weaviate.io/) ### Qdrant Qdrant has raised $87.8M total (a $50M Series B announced March 2026, led by AVP, on top of $37.8M raised through its 2024 Series A) and built a customer base that includes Tripadvisor, HubSpot, and Deutsche Telekom. Written in Rust for performance, Qdrant's single-node performance is hard to beat. In Qdrant's own filtered-search benchmarks it leads on filtered queries, which are the majority of production use cases (you almost never search the full collection; you filter by metadata first, then search). Named vectors allow storing multiple embedding representations per point (a text embedding, an image embedding, and a code embedding for the same document), enabling multi-modal search without duplicating records. Its on-disk quantization (binary, scalar, product) keeps memory footprint manageable even at 100M+ vector scale. - [Qdrant](https://qdrant.tech/) - [Qdrant Filtered Search Benchmarks](https://qdrant.tech/benchmarks/filtered-search-intro/) ### Milvus Milvus, the open-source option backed by Zilliz ($113M raised), targets billions-scale distributed deployments. Its architecture separates storage, compute, and coordination into independent microservices. Each can scale independently. This disaggregated design means you can add query nodes for throughput without increasing storage nodes, or add storage for capacity without paying for more compute. If your vector storage needs are measured in billions, Milvus is likely the only open-source option that has been proven at that scale. Zilliz Cloud provides a managed version for teams that want the scale without the operational complexity. - [Milvus](https://milvus.io/) - [Zilliz Cloud](https://zilliz.com/) **Vector search latency: 1M vectors, 768 dimensions** - Qdrant: 1.8ms - Weaviate: 3.2ms - Milvus: 2.5ms - Pinecone: 4.1ms - pgvector: 12.6ms - Elasticsearch: 8.4ms > **TIP: Decision framework** > Pinecone if you want fully managed and can accept the cost. Qdrant or Weaviate if you need self-hosted for data sovereignty or cost control. Milvus if you are operating at billions-scale. For most enterprise RAG pipelines with moderate scale, any of these will work. The decision comes down to managed vs self-hosted and your team's operational capacity. > **Key Point:** Pinecone is exploring acquisition with Oracle, IBM, MongoDB, and Snowflake reportedly in talks. The vector database market is consolidating. Purpose-built vector storage is being absorbed into broader data platforms. If you are choosing a vector database today, consider whether it will exist as an independent product in two years. ## Gateways productize the middleware A new infrastructure category has emerged: LLM gateways that sit between your application and the model providers, handling routing, cost tracking, fallbacks, rate limiting, and observability. Gartner's Market Guide for AI Gateways (October 2025) recognized AI gateways and projects that by 2028, 70% of software engineering teams building multimodel applications will use AI gateways to improve reliability and optimize costs, up from 25% in 2025. (It was the Market Guide, not the Hype Cycle, and Gartner describes them as "middleware.") The category exists because every organization calling multiple LLM providers ends up building the same middleware: retry logic, fallback chains, cost attribution, usage tracking. LLM gateways productize that middleware. ### LiteLLM LiteLLM is the most widely adopted open-source LLM gateway at 48.6K+ GitHub stars. It provides a unified API interface across 100+ LLM providers: OpenAI, Anthropic, Google, AWS Bedrock, Azure, Ollama, and dozens more. Every provider gets normalized to the OpenAI chat completions format, so your application code calls one API regardless of which model serves the request. The proxy adds minimal overhead relative to inference latency, which is measured in seconds for most LLM workloads. Where LiteLLM shines is operational control. Per-model and per-team spend tracking gives finance teams the cost attribution they need. Automatic fallback chains (if Anthropic is down, route to OpenAI; if OpenAI rate-limits, fall back to Bedrock) keep applications available without application-level retry logic. Budget limits can hard-cap spend per team or per model, preventing a single runaway experiment from burning through your monthly API budget. The configuration is YAML-based, version-controllable, and can be updated without redeploying application code. - [LiteLLM](https://github.com/BerriAI/litellm) ### Portkey Portkey targets enterprise deployments with a 99.99% uptime SLA, MCP integration for agent workflows, and support for 1,600+ LLMs. It positions itself as the production-grade alternative to open-source gateways, with features like automatic prompt caching (saves 60-80% on repeated prompt prefixes), semantic caching for similar queries, and built-in guardrails that filter PII, enforce content policies, and detect prompt injection before requests reach the model. For organizations with strict SLA requirements and the budget for a commercial solution, Portkey reduces the operational burden of running gateway infrastructure. - [Portkey](https://portkey.ai/) ### Helicone Helicone takes an observability-first approach to the LLM gateway problem. It provides detailed analytics on LLM usage: latency distributions, cost breakdowns by model and team, error rates, token consumption patterns, and full prompt/response logging with PII redaction. The integration is a single line: change your base URL to Helicone's proxy and every request gets logged, analyzed, and dashboarded automatically. A free self-hosted option means no data leaves your infrastructure. For organizations that already have routing handled but need visibility into their LLM spend and performance, Helicone fills the observability gap without requiring a full gateway migration. - [Helicone](https://www.helicone.ai/) **LLM API pricing: cost per 1M tokens (USD)** - GPT-4o: 2.5$ - Claude Sonnet 4.5: 3$ - Claude Haiku 4.5: 1$ - Gemini 2.0 Flash: 0.1$ - Mistral Large: 2$ > **WARNING: You are already building a gateway** > If you are calling multiple LLM providers without a gateway, you are building one implicitly. Every retry handler, every fallback chain, every cost tracking spreadsheet, every rate limiter is gateway logic scattered across your application code. The question is whether you want a purpose-built gateway or an accidental one embedded in your application. ## Local and edge inference Ollama has become the de facto standard for local LLM deployment. It provides a simple CLI and API for running open-weight models (Llama, Mistral, Gemma, Phi, Qwen, DeepSeek, and dozens more) on developer laptops and edge devices. According to Hostinger, 42% of developers are now running LLMs locally for privacy, cost reduction, or offline access. Ollama's model library has grown to 100+ models, and its Docker-like pull/run interface makes it trivially easy to switch models. Ollama Cloud launched to bridge the gap between local development and enterprise deployment. The performance of local models has improved dramatically. Quantized 7-8B parameter models (Llama 3.1 8B, Mistral 7B, Phi-3) run at 30-50 tokens per second on M-series MacBooks, fast enough for interactive development workflows. Larger models (70B parameters) are viable on machines with 64GB+ RAM, producing quality comparable to GPT-3.5 Turbo for many tasks. The practical implication: code completion, documentation generation, commit message drafting, and log analysis can all run locally with zero API cost and zero network latency. - [Ollama](https://ollama.com/) - [Ollama Model Library](https://ollama.com/library) Platform engineers should care about local inference because it is happening whether they provide infrastructure for it or not. Developers are downloading multi-gigabyte models to their laptops, running inference on hardware that is not monitored, governed, or cost-tracked. The models running locally may not meet compliance requirements. The outputs are not logged. The resource consumption is invisible to capacity planning. The pragmatic response is not to ban local inference. It solves real problems around latency, privacy, and cost for development workflows. The pragmatic response is to standardize it. Provide an approved model registry. Set up Ollama configurations that point to sanctioned model sources. Include local inference in your AI governance framework. Make the sanctioned setup the easiest one to use. > **INFO: Local inference is a platform concern** > Ollama matters for platform teams because developers are running models locally whether you provide infrastructure or not. Better to standardize than pretend it is not happening. Provide approved model registries, managed configurations, and governance frameworks that make local inference a supported capability rather than shadow IT. ## MLOps is consolidating fast The MLOps layer is the glue between data science and production. It covers experiment tracking, model versioning, pipeline orchestration, and deployment management: the operational infrastructure that ensures models get from notebooks to production endpoints reliably and reproducibly. ### Weights & Biases Weights & Biases has 700K users and 1,000+ enterprise customers including OpenAI, Toyota, and Volkswagen. OpenAI alone tracks 2,000+ projects and millions of experiments on the platform. The core product is deceptively simple: add a few lines of code to your training script and every hyperparameter, metric, system resource, and artifact is automatically logged, versioned, and visualized. The real power is in experiment comparison: overlay hundreds of training runs to find the configuration that maximizes accuracy while minimizing compute cost. The most significant development: CoreWeave acquired W&B in 2025, signaling that compute providers want to own the full AI stack from GPU clusters through experiment tracking to model serving. W&B provides experiment tracking, model registry, dataset versioning, and collaborative dashboards that have become the standard workflow for ML teams. The acquisition suggests the future of MLOps is vertically integrated: your GPU provider also manages your experiment tracking, model registry, and deployment pipeline. - [Weights & Biases](https://wandb.ai/) ### MLflow MLflow is the de facto open-source MLOps standard with 26K+ GitHub stars and backing from the Linux Foundation. It provides four core components: Tracking (log parameters, metrics, and artifacts), Models (package models in a standard format that deploys anywhere), Registry (centralized model versioning with staging/production lifecycle), and Projects (reproducible runs with environment specifications). The key advantage is portability. MLflow experiments and model artifacts are portable across any infrastructure, from a laptop to any cloud provider, with no vendor lock-in. MLflow 2.x added native LLM support: prompt engineering UI, LLM evaluation metrics (toxicity, relevance, faithfulness), and a tracing API for debugging multi-step agent workflows. The same platform that tracks traditional ML experiments can now track LLM prompt iterations, RAG pipeline configurations, and agent chain execution. The trade-off is that MLflow requires more operational investment than managed alternatives: you run the tracking server, manage the artifact store (typically S3/GCS), and handle scaling yourself. - [MLflow](https://mlflow.org/) - [MLflow GitHub](https://github.com/mlflow/mlflow) ### Kubeflow Kubeflow is Kubernetes-native ML orchestration for organizations with mature Kubernetes platforms. Its pipeline orchestration (Kubeflow Pipelines) defines ML workflows as DAGs of containerized steps. Each step is an isolated container with explicit inputs and outputs, making pipelines reproducible and debuggable. The Training Operator provides custom resources for distributed training across PyTorch, TensorFlow, MPI, and JAX workloads. Katib handles automated hyperparameter tuning with Bayesian optimization, grid search, and neural architecture search. Kubeflow is the right choice for enterprises that have invested heavily in Kubernetes and want to manage ML workloads with the same patterns and tools they use for everything else. - [Kubeflow](https://www.kubeflow.org/) - [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/) **MLOps tool adoption (enterprise ML teams)** - MLflow: 68% - Weights & Biases: 45% - Kubeflow: 22% - SageMaker: 38% - Vertex AI: 19% - Custom/Internal: 31% > **TIP: The CoreWeave signal** > The W&B acquisition by CoreWeave signals that compute providers want to own the full AI stack: from GPU clusters through experiment tracking to model deployment. Expect more consolidation. Independent MLOps tools will either be acquired by compute or cloud providers, or they will need to differentiate on features that the platforms cannot easily replicate. Choose your MLOps stack knowing that the vendor landscape will look different in two years. ## Assembling the stack The AI infrastructure stack has six layers: model serving (vLLM, KServe, Triton), GPU orchestration (GPU Operator, Run:ai), vector storage (Pinecone, Weaviate, Qdrant, Milvus), LLM routing (LiteLLM, Portkey, Helicone), local inference (Ollama), and MLOps (W&B, MLflow, Kubeflow). No organization needs all of these. Every organization deploying AI to production needs most of them. The reference architecture for a mid-to-large enterprise looks like: vLLM inside KServe for model serving, NVIDIA GPU Operator for hardware management, a vector database matched to your scale and hosting requirements, LiteLLM or Portkey for multi-provider routing, and W&B or MLflow for experiment tracking. Each layer has a managed and a self-hosted option. The choice at each layer depends on the same factors it always has: team size, operational maturity, compliance requirements, and budget. The most underserved layer right now is GPU orchestration. Model serving has vLLM. Vector storage has multiple mature options. LLM routing is increasingly commoditized. GPU scheduling, sharing, and utilization optimization are still held together with duct tape and YAML at most organizations. The tooling is where container orchestration was in 2016: functional but painful. Run:ai is the closest thing to a real solution, and NVIDIA had to acquire it because nobody else was building it. That gap is where platform engineers can have the most impact today. This stack connects directly to the second-order explosion problem: as AI makes building software cheaper, more software gets built, more models get deployed, and more inference infrastructure gets required. The GPU bill scales with adoption, not with headcount. When inference workloads span cloud providers, using the best model from each, the multi-cloud networking and routing complexity compounds. Every layer of this stack is a response to workloads that did not exist three years ago. - [AI Creates Software Faster Than Ops Can Handle](https://www.stxkxs.io/blog/second-order-explosion) > **GPU orchestration is the layer still held together with duct tape. The teams that solve scheduling, sharing, and utilization win the next decade on cost.** > **Key Point:** Evaluate each layer against your actual constraints: team size, Kubernetes maturity, multi-cloud requirements, and whether you need managed or self-hosted. The worst architecture is one chosen for theoretical elegance that your team cannot operate. The best architecture is one that your team can deploy, monitor, and debug at 3 AM when inference latency spikes and the GPU bill doubles. ## Most teams should not build this Self-hosting six layers earns its keep at high, sustained inference volume where the GPU bill dwarfs the salary cost of the platform engineers running it. Below that threshold the math inverts. A small team that stands up vLLM, a GPU Operator, a vector database, a gateway, and an MLOps stack is paying for idle GPUs and on-call rotations to serve traffic a hosted API would handle for a metered fee. If your AI strategy is "call the OpenAI API," do not assemble this stack. Your platform mandate is cost allocation, rate limiting, API key management, and vendor negotiation. A single LLM gateway (LiteLLM or Portkey) covers routing, fallbacks, and per-team spend without a serving layer, a GPU pool, or an MLOps platform underneath it. The Scope note at the top of this post draws the same line: the six-layer stack is for teams running self-hosted or hybrid inference, not for API consumers. A managed end-to-end platform (SageMaker, Vertex AI, Bedrock, or a compute provider that bundles serving with experiment tracking) beats assembling six open-source layers when your team lacks Kubernetes operators on staff, when inference volume is too low or too spiky to keep GPUs busy, or when you would rather pay a margin than carry the operational load. The CoreWeave acquisition of W&B points the same direction: vertically integrated platforms are absorbing the layers most teams do not want to wire together themselves. > **WARNING: Assembly is a cost, not a default** > Six self-hosted layers buy control, vendor neutrality, and data sovereignty. They cost operators, on-call, and idle capacity. Self-host the layers where the control is worth the load (often GPU orchestration and serving, where the bill is largest) and buy the rest. Start managed, measure the bill, and pull a layer in-house only when the savings clear the cost of running it. > **Key Point:** The threshold for self-hosting is sustained inference volume large enough that the GPU bill exceeds what it costs to operate the stack. Until you cross it, a managed platform or a thin gateway over hosted APIs is the cheaper and more reliable choice. Assembling all six layers because the architecture is interesting is how teams end up paying to operate infrastructure they did not need. ## Resources & Further Reading - vLLM: https://github.com/vllm-project/vllm - Open-source LLM serving engine with PagedAttention, 81.3K GitHub stars - KServe: https://kserve.github.io/website/ - CNCF incubating ML serving platform for Kubernetes - NVIDIA Triton Inference Server: https://developer.nvidia.com/triton-inference-server - Enterprise-grade multi-framework model serving - TensorRT-LLM: https://github.com/NVIDIA/TensorRT-LLM - NVIDIA's LLM optimization and serving library - NVIDIA GPU Operator: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/ - Kubernetes-native GPU lifecycle management - Run:ai: https://www.run.ai/ - Dynamic GPU orchestration and pooling platform (acquired by NVIDIA) - Pinecone: https://www.pinecone.io/ - Fully managed vector database for AI applications - Weaviate: https://weaviate.io/ - Open-source AI-native vector database - Qdrant: https://qdrant.tech/ - High-performance vector search engine written in Rust - Milvus: https://milvus.io/ - Open-source vector database for billions-scale deployments - LiteLLM: https://github.com/BerriAI/litellm - Unified LLM gateway supporting 100+ providers - Portkey: https://portkey.ai/ - Enterprise AI gateway with 99.99% uptime SLA - Helicone: https://www.helicone.ai/ - Open-source LLM observability and gateway - Ollama: https://ollama.com/ - Local LLM deployment for development and edge inference - Weights & Biases: https://wandb.ai/ - ML experiment tracking and model management (acquired by CoreWeave) - MLflow: https://mlflow.org/ - Open-source MLOps platform for the ML lifecycle - Kubeflow: https://www.kubeflow.org/ - Kubernetes-native ML orchestration platform --- # The Multi-Cloud Stack - **URL**: https://www.stxkxs.io/blog/aws-multicloud-reinvent - **Published**: 2026-02-07 - **Author**: Brandon Stokes - **Category**: infrastructure - **Tags**: aws, multi-cloud, reinvent, cloud-networking, platform-engineering, bedrock, ai-agents, infrastructure, data-analytics, data-pipelines - **Reading time**: 15 min Managed cross-cloud connectivity from AWS and Google Cloud turns multi-cloud from a networking project into a configuration change. Here's what becomes possible when clouds connect natively—cross-cloud AI inference chains, federated data pipelines, and agent orchestration that spans providers. ## The pain you already know Cross-cloud connectivity takes six to eight weeks if you have ever set up the colo-and-BGP version. You open a ticket with the colo provider. You wait two weeks for a cross-connect to get provisioned. You negotiate IP space with another team that manages the other cloud. You configure BGP, test failover, argue about who owns the MACsec config, and then spend a month proving to security that the whole thing is encrypted end-to-end. By the time traffic flows, the project that needed it has already shipped a workaround over the public internet with a VPN tunnel held together by cron jobs and prayer. That experience is why most multi-cloud architectures are accidental. Teams do not wake up and choose to run three clouds. They inherit an Azure tenant from a merger, they need a GCP service that AWS does not offer, or a vendor deploys into a cloud they did not pick. The networking between those clouds is always an afterthought. Because the networking is painful, architects design around it instead of through it. Workloads get duplicated. Data gets copied. Teams build worse versions of things that already exist on another cloud because the network path is too fragile or too slow to depend on. That constraint just broke. At re:Invent 2025, AWS shipped Interconnect: a managed, MACsec-encrypted, quad-redundant service that connects AWS directly to Google Cloud, with Azure support announced for 2026. Google shipped Cross-Cloud Interconnect two years earlier. When both sides of the connection are now managed services with SLAs, the provisioning model changes from "infrastructure project" to "configuration change." Select the target cloud, choose the region pair, specify bandwidth. Minutes, not weeks. That is a new architectural primitive: cross-cloud connectivity you provision in minutes from an API. > **Key Point:** Cross-cloud networking used to be a project. Now it is a configuration change. When connecting to another cloud is as easy as connecting to another region, the set of architectures you can reasonably build expands dramatically. ## Data gravity is the real architecture Before we talk about what to build across clouds, we need to talk about what not to move. Data gravity is the single most important concept in multi-cloud architecture, and cross-cloud connectivity discussions tend to skip it. Your data is heavy. Not in bytes; in dependencies. A 50TB dataset in S3 has Lambda functions reading from it, Glue jobs transforming it, Athena queries scanning it, and downstream services that depend on all of those. Moving that data to GCS does not just mean copying 50TB. It means rebuilding every integration that touches it. The insight that changes how you design multi-cloud systems: move compute to data, not data to compute. Managed interconnect makes this practical for the first time. Instead of replicating your dataset to the cloud that has the best processing engine, you run that engine's compute against your data over private interconnect. The data stays where its dependency graph lives. The compute travels on a 100 Gbps encrypted link with sub-2ms latency. Moving compute to the data changes which cloud owns what, not just the transfer bill. When you wire this up, the architecture has a clear shape: each cloud owns the data that its ecosystem depends on. Cross-cloud traffic is compute-to-data queries and model artifact transfers, not bulk data replication. Your S3 data stays on AWS. Your BigQuery tables stay on GCP. Your Azure Data Lake stays on Azure. What crosses the interconnect is inference requests, query results, and trained model weights: high-value, relatively low-volume traffic compared to moving the underlying datasets. > **WARNING: The costly mistake** > The most expensive multi-cloud architecture is the one that fights data gravity. If your design starts with "first, we replicate everything to a central lake," you are going to spend more on data transfer and synchronization than you save by using best-of-breed services. Design around the asymmetry: interconnect makes moving compute cheap, but moving data is never free. **Where enterprises run analytics workloads** - Event Streaming: 64% - Data Warehouse: 41% - BI / Reporting: 52% - ML Training: 38% - Data Lakes: 58% No single cloud dominates every workload. AWS leads in event streaming and data lakes. GCP leads in warehousing for teams that picked BigQuery. Azure leads in BI because Power BI is already everywhere. Most organizations are not choosing best-of-breed. They are stuck with wherever the data landed first. Managed interconnect turns "stuck with" into "orchestrated across." ## Three architectures worth building Each of these was impractical before managed interconnect because the network path between the clouds was too slow or too exposed to depend on. With a private link in place, they are clean enough to operate. ### Split training and inference Your ML team wants to fine-tune a model. Your application stack, your data pipelines, your MLOps tooling, your feature store are all on AWS. GCP's TPU v5p pods deliver roughly 2x the training throughput per dollar for transformer architectures compared to GPU alternatives. Before managed interconnect, you had two bad options: duplicate your entire training pipeline on GCP, or accept worse economics on AWS. Both are common. Both are wasteful. The clean architecture: training data stays in S3. A cross-cloud pipeline streams training batches over interconnect to TPU pods on GCP. The model trains on GCP where the economics are best. When training completes, the model artifact (tens to hundreds of gigabytes) transfers back over the 100 Gbps dedicated link to AWS. Inference runs on SageMaker endpoints or Inferentia chips, right next to the application stack that calls them. The model artifact transfer that used to bottleneck through public internet now takes minutes instead of hours. Your MLOps team manages one pipeline, not two parallel stacks. ### Cross-cloud inference chains Foundation models have differentiated strengths, and anyone building anything serious with AI already knows this. Claude on Bedrock is exceptional at nuanced reasoning and long-context analysis. GPT on Azure OpenAI has strong structured output and function calling. Gemini on Vertex AI brings native multimodal grounding with Google's knowledge graph. Teams that standardize on one model accept its limitations. The teams that use multiple models route API calls over the public internet, adding 50-150ms per hop and exposing data in transit. Picture a single inference chain running entirely over private interconnect. A document comes in. Claude handles the initial reasoning and extraction. It is the best at understanding what the document means. The structured output routes to GPT for schema-validated extraction into your domain types. If the document contains images, charts, or diagrams, Gemini handles multimodal validation. Each hop adds single-digit milliseconds, not triple-digit. The data never touches the public internet. You are using each model for what it is genuinely best at, not compromising on one because the network path to the others was too slow or too exposed. - **Cross-Cloud Latency**: <2ms — Typical inter-cloud latency on managed interconnect vs 50-150ms public internet - **AI-Driven Multi-Cloud**: 82% — Executives expecting AI workloads to increase multi-cloud demand - **Model Transfer Speed**: 100 Gbps — Maximum bandwidth for model artifact transfers on dedicated interconnect ### Federated streaming analytics This one is closest to my heart because I have built the ugly version. You have event ingestion on AWS (Kinesis, MSK, EventBridge) because that is where your application emits events. You want to analyze those events in BigQuery because nothing else gives you serverless, petabyte-scale, SQL-native analytics with zero infrastructure management. You want the computed features to flow back to SageMaker endpoints for real-time inference serving. The old way: you run Kinesis Firehose to dump events into S3, set up a cross-account copy to GCS, then BigQuery ingests from GCS with a delay measured in minutes to hours. Features computed in BigQuery get exported back to S3, loaded into a SageMaker feature store, and served to endpoints. The round-trip latency kills anything that needs to be real-time. You end up building a parallel Flink pipeline on AWS just to avoid the cross-cloud hop, duplicating logic that BigQuery ML could handle in a SQL query. The clean version: events stream from Kinesis cross-cloud to GCP Dataflow for processing. Dataflow (the most mature serverless stream processor, with exactly-once semantics and native BigQuery output) transforms and routes to BigQuery for analytics and feature computation. BigQuery ML runs your feature engineering in SQL alongside the analytical queries. Computed features flow back over interconnect to SageMaker endpoints. The round-trip adds milliseconds, not minutes. One pipeline, not two parallel stacks with duplicated logic. > **TIP: Start with the workaround** > The clearest cross-cloud opportunity is wherever your team already has a workaround. If you are copying data between clouds on a schedule, routing through public APIs when a private link would be faster, or running a worse version of a service because the better one lives on another cloud, those are the pipelines where managed interconnect delivers immediate value. Start there, not with a greenfield architecture. ## What implementation looks like One of the things that makes managed interconnect different from the colo-and-BGP approach is how it fits into infrastructure-as-code. When cross-cloud connectivity is a managed resource, it becomes a CDK construct. That is a bigger deal than it sounds. Cross-cloud networking goes through the same PR review, the same drift detection, the same CI/CD pipeline as the rest of your infrastructure. No more side-channel tickets to the networking team. ```typescript (cross-cloud-interconnect.ts) import * as cdk from 'aws-cdk-lib'; import { Construct } from 'constructs'; // Conceptual CDK construct for AWS Interconnect // Based on announced API surface at re:Invent 2025 export class CrossCloudInterconnect extends Construct { constructor(scope: Construct, id: string) { super(scope, id); // Cross-cloud interconnect to Google Cloud const interconnect = new cdk.CfnResource(this, 'GcpInterconnect', { type: 'AWS::NetworkManager::CrossCloudInterconnect', properties: { targetProvider: 'GCP', targetRegion: 'us-central1', awsRegion: 'us-west-2', bandwidthGbps: 10, encryption: 'MACSEC', redundancy: 'QUAD', // Four independent physical paths }, }); // Route table association for cross-cloud traffic new cdk.CfnResource(this, 'CrossCloudRoute', { type: 'AWS::EC2::TransitGatewayRoute', properties: { destinationCidrBlock: '10.128.0.0/16', // GCP VPC range transitGatewayAttachmentId: interconnect.ref, }, }); } } ``` That is the entire cross-cloud link in a CDK construct. Compare it to the Terraform modules and manual runbooks you need for a colo cross-connect. The thing that makes this powerful is that the interconnect is now a versioned, reviewable, deployable resource. You can spin up a dev interconnect for testing and tear it down when you are done. You can parameterize the bandwidth and region pair per environment. Cross-cloud connectivity becomes a build-time decision, not a procurement process. AWS also published the interconnect specification as an open spec on GitHub, which is a genuinely interesting move. Other providers can implement compatible endpoints. Whether they will is a different question, but the spec being open shifts the conversation from "AWS proprietary service" to "potential industry standard." That matters for teams who need to justify multi-cloud connectivity to leadership that is allergic to vendor lock-in. ## The orchestration layer Cross-cloud connectivity is the foundation. Cross-cloud data pipelines are the plumbing. The orchestration layer that ties them together got less attention than it deserves: Bedrock AgentCore. Generally available since October 2025, it is a managed runtime for AI agents that is explicitly, deliberately multi-everything. Multi-framework. Multi-model. Multi-cloud. That multi-everything support is the core architectural decision, not a marketing label. AgentCore runs agents built with CrewAI, LangGraph, LlamaIndex, Google's Agent Development Kit, the OpenAI Agents SDK, and Anthropic's Claude SDK. It supports Claude, GPT, Gemini, Llama, and any model accessible via API. It implements both Google's A2A protocol for agent-to-agent communication and Anthropic's MCP through a native MCP Gateway. A single runtime where a LangGraph agent using Claude can discover and communicate with a CrewAI agent using GPT, and both can access tools through a centralized governance layer. ```typescript (agentcore-multi-framework.ts) // Bedrock AgentCore: Multi-framework agent deployment // Based on announced API surface at re:Invent 2025 import { BedrockAgentCoreClient, CreateAgentRuntimeCommand, } from '@aws-sdk/client-bedrock-agentcore'; const client = new BedrockAgentCoreClient({ region: 'us-west-2' }); // Deploy a LangGraph agent with Claude on AgentCore const langGraphAgent = await client.send( new CreateAgentRuntimeCommand({ agentName: 'research-agent', framework: 'LANGGRAPH', modelId: 'anthropic.claude-sonnet-4-5-20250929-v1:0', mcpServers: [ { name: 'company-docs', uri: 'https://mcp.internal.company.com/docs', }, ], memoryConfig: { type: 'SEMANTIC', retentionDays: 90, }, guardrails: { contentFilters: ['HARMFUL_CONTENT', 'PII_DETECTION'], maxTokensPerTurn: 4096, }, }) ); // Deploy a CrewAI agent with GPT on the same platform const crewAgent = await client.send( new CreateAgentRuntimeCommand({ agentName: 'analysis-crew', framework: 'CREWAI', modelId: 'openai.gpt-4o', a2aConfig: { enabled: true, // Enable A2A protocol for cross-agent communication discoverable: true, }, }) ); ``` Here is why this matters for cross-cloud architecture. An agent on AgentCore can orchestrate a pipeline that calls Bedrock for reasoning, routes to Vertex AI for multimodal grounding, pulls structured data from Azure Cognitive Services, and coordinates with agents running on other clouds via A2A, all over private interconnect. The agent runtime becomes the control plane for cross-cloud AI, not just a deployment target. When you combine AgentCore with managed interconnect, agents become the glue layer between cloud-specific services that previously required custom integration code. > **WARNING: Agent sprawl is Lambda sprawl all over again** > The pattern should look familiar. A new abstraction makes creation cheap. Organizations create prolifically. Operational complexity follows. Agent sprawl will mirror Lambda sprawl: dozens of agents with unclear ownership, undocumented tool access, and unpredictable cross-agent interactions. AgentCore's MCP Gateway helps by centralizing tool access governance, but the organizational practices (ownership requirements, discovery infrastructure, deprecation pathways) matter more than any platform feature. If you lived through the "everyone deploys Lambdas, nobody owns them" era, you already know what to do. ## When multi-cloud is resume-driven Most teams should not do this. The re:Invent hype cycle makes everything sound like a good idea, and managed interconnect is going to tempt a lot of organizations into multi-cloud architectures they do not need and cannot operate. Multi-cloud has a clear value proposition when you need genuinely differentiated capabilities from multiple providers. It does not have a clear value proposition when you are adding clouds for the sake of adding clouds, padding a team's resume, or satisfying a CTO who read a Gartner report on the plane. I have seen teams build cross-cloud pipelines that could have been a single SageMaker endpoint, and "multi-cloud strategies" that were really just a lack of organizational decision-making dressed up as technical sophistication. The framework: multi-cloud is worth it when you have a workload that is genuinely better on another cloud and the delta is large enough to justify the operational overhead. Training on TPUs when you are 2x more cost-efficient: worth it. Running BigQuery for analytics when you already have Redshift and the performance difference is marginal for your query patterns: probably not worth it. Using three different foundation models in an inference chain because each excels at a specific task: worth it. Using three different clouds because three different teams each picked their favorite is an org-chart problem dressed as architecture. > **Multi-cloud is a tool, not a strategy. The strategy is using the best capability for each workload. Sometimes that means two clouds. Sometimes it means one.** The litmus test: if you cannot articulate the specific technical advantage you gain from each cloud in your architecture, you are adding operational complexity without architectural benefit. Managed interconnect makes multi-cloud easier, but easier does not mean free. You still need cross-cloud identity federation, unified observability, consistent security policies, and teams that understand multiple cloud platforms. That operational cost is real, and it needs to be justified by real capability differences. - **Avg Cloud Providers**: 2.4 — Average number of public cloud providers per organization (Flexera 2025) - **Intentional Multi-Cloud**: 84% — Organizations running multi-cloud by strategic choice, not accident - **AI-Driven Multi-Cloud**: 82% — Executives expecting AI workloads to increase multi-cloud demand ## AWS wants to be the control plane Step back and look at what AWS built in aggregate. Cross-cloud connections originate from AWS. Agent orchestration runs on AgentCore. The MCP Gateway centralizes tool access on AWS. The hub is always AWS. This is the EKS playbook applied to multi-cloud: acknowledge an industry trend as inevitable, then build the best management layer for it and make sure that management layer runs on your cloud. The strategy is sound, and teams should go in with their eyes open. Google deserves credit here. They shipped Cross-Cloud Interconnect two full years before AWS, and it has been generally available since 2023 with production support for direct connections to AWS, Azure, and Oracle Cloud. Google saw the managed interconnect future early and built for it while AWS was still selling Direct Connect as the answer to everything. Azure's enterprise footprint gives it a natural governance position. Most Fortune 500 companies already run identity and compliance through Microsoft. The fact that all three are building managed interconnect and multi-model orchestration tells you the stack itself is real. Who controls the control plane is still an open fight. The multi-cloud stack has four layers: connectivity provides the private network fabric. AI services provide differentiated model access. Data services provide specialized analytics. Agent orchestration provides the control plane that ties it all together. Each layer emerged because no single cloud can be best-in-class at everything. For most organizations the multi-cloud decision was made years ago. What is unsettled is which cloud's strengths you combine, what the data flow looks like between them, and whether the operational overhead of spanning clouds is justified by the capability delta. ## What to do Monday morning The multi-cloud stack is real, but adopting it all at once is how you end up with a distributed monolith that spans three clouds and is debuggable on none of them. A platform engineer's approach: - Map your cross-cloud workarounds. Every VPN tunnel to another cloud, every public API call that should be private, every dataset you copy on a schedule. The workarounds tell you where the architecture wants to be cross-cloud. - Pick one pipeline. The one where you are working around a model limitation, running a worse analytics engine, or routing through the public internet. Build that one pipeline on managed interconnect. Learn the operational model before you expand. - Design around data gravity. Ask "which cloud's compute moves to the data" not "which data moves to the compute." If your architecture starts with a bulk data copy, redesign it. - Fix identity before you fix networking. If your cross-cloud services authenticate with long-lived credentials or manual key rotation, managed interconnect will make connectivity trivial while your security posture remains terrible. Workload identity federation first. - Deploy OpenTelemetry now. If your monitoring is cloud-native only (CloudWatch here, Cloud Monitoring there), you cannot trace a request across clouds. Cross-cloud observability is not optional when pipelines span providers. - Experiment with AgentCore on something non-critical. Deploy an agent. Test MCP Gateway governance. Test A2A communication. Form opinions before the technology becomes load-bearing in your stack. - Model the data transfer costs before you commit. "Easier to provision" does not mean "cheap to operate." Google Cross-Cloud Interconnect port fees are ~$4,032/mo for 10 Gbps ($5.60/hr) and ~$21,600/mo for 100 Gbps ($30/hr). AWS Interconnect-multicloud charges ~$9,001/mo for a 10 Gbps Tier-1 connection ($12.33/hr). The stated $1K-5K range only covers the lowest tier; the 100 Gbps links the post repeatedly references cost ~4x the stated $5K ceiling. Data transfer adds per-GB egress charges on top of that. Organizations running serious cross-cloud traffic (training pipelines, streaming analytics, multi-model inference chains) can easily exceed $50K/month in interconnect and transfer costs. Understand which workloads transfer data, how much, and how often. Let the cost model validate your architecture, not surprise you in production. > **INFO: The pattern that works** > Organizations that succeed with multi-cloud share a common approach: they start with one pipeline where cross-cloud solves a real limitation, they build the operational foundation (identity, observability, cost tracking) for that one pipeline, and they expand only when the next pipeline has a clear capability justification. The ones that fail start with a "multi-cloud strategy" and go looking for workloads to justify it. The tools are finally good enough. Managed interconnect, multi-model agent runtimes, cross-cloud data pipelines: the primitives exist to build cross-cloud systems that are clean enough to actually operate. The hard part was never the technology. The hard part is the discipline to use multi-cloud only where the capability delta justifies the complexity, and the platform engineering skill to make it invisible to the teams building on top of it. ## Resources & Further Reading - AWS Interconnect Announcement: https://aws.amazon.com/about-aws/whats-new/2025/11/preview-aws-interconnect-multicloud/ - Preview announcement for managed cross-cloud networking - AWS Interconnect Architecture: https://aws.amazon.com/blogs/networking-and-content-delivery/build-resilient-and-scalable-multicloud-connectivity-architectures-with-aws-interconnect-multicloud/ - Technical blog on building multicloud connectivity architectures - Interconnect Open Specification: https://github.com/aws/Interconnect - OpenAPI 3.0 spec for the Connection Coordinator API on GitHub - Bedrock AgentCore: https://aws.amazon.com/bedrock/agentcore/ - Managed platform for building, deploying, and operating AI agents at scale - Google Cross-Cloud Interconnect: https://docs.google.com/network-connectivity/docs/interconnect/concepts/cci-overview - Google Cloud's cross-cloud connectivity service - Flexera 2025 State of the Cloud: https://www.flexera.com/blog/finops/the-latest-cloud-computing-trends-flexera-2025-state-of-the-cloud-report/ - 89% enterprise multi-cloud adoption, 2.4 average cloud providers - GCP BigQuery ML: https://cloud.google.com/bigquery/docs/bqml-introduction - SQL-native machine learning in BigQuery - Apache Iceberg: https://iceberg.apache.org/ - Open table format for cross-cloud data lake federation - AWS re:Invent 2025 Announcements: https://aws.amazon.com/blogs/aws/top-announcements-of-aws-reinvent-2025/ - Summary of all major announcements from re:Invent 2025 --- # Self-Hosted AI Agents for Incident Response - **URL**: https://www.stxkxs.io/blog/openclaw-self-hosted-ai-agents - **Published**: 2026-01-30 - **Updated**: 2026-02-16 - **Author**: Brandon Stokes - **Category**: engineering - **Tags**: ai-agents, devops, chatops, mcp, infrastructure, self-hosted, platform-engineering, automation, claude, slack, incident-response - **Reading time**: 14 min OpenClaw (formerly Moltbot, originally Clawdbot) is a self-hosted AI agent with 103K+ GitHub stars. Its Slack/Teams integration enables ChatOps workflows that keep infrastructure data on-premises—here's how platform engineers are using it for incident response, on-call automation, and deployment orchestration. ## Self-hosted agents pass security review It is 3am and your phone lights up with a PagerDuty alert. Payment processing is down. You fumble for your laptop, VPN in, authenticate to the cluster, start pulling logs. Twenty minutes later you are finally posting your first finding to the incident channel. Meanwhile the blast radius has grown because nobody had eyes on it. I have lived this loop dozens of times. The ceremony before you can even start debugging is what kills you, not the debugging itself. OpenClaw is a self-hosted AI agent that sits in your Slack or Teams channels and eliminates that ceremony entirely. You type a message from your phone, the agent runs kubectl, parses logs, and posts formatted findings to the incident channel before you have found your laptop charger. It runs in your environment, on your servers, with your permissions. No data leaves your network unless you explicitly route it to a model provider. > **INFO: Clawdbot → Moltbot → OpenClaw** > The project has been renamed twice: originally "Clawdbot" (trademark issues), then "Moltbot" (Jan 27, 2026), now "OpenClaw" (Jan 30, 2026). The only legitimate sources are GitHub (github.com/openclaw/openclaw), npm (openclaw), and the official site (openclaw.ai). OpenClaw matters because it is the first ChatOps AI agent that takes self-hosting seriously enough for regulated environments. Every SaaS AI assistant I have evaluated gets killed in security review. "Where does our infrastructure data go?" is a question with no good answer when the agent phones home to someone else's cloud. OpenClaw runs locally, connects to any model provider including self-hosted Ollama, and integrates with existing tools through MCP. That is a fundamentally different trust model. > **INFO: Update: record-breaking growth and OpenAI hire** > OpenClaw has since become the fastest-growing repository in GitHub history, surpassing 369K stars by early May 2026. The project's explosive growth sparked a bidding war between Meta and OpenAI for creator Peter Steinberger, with OpenAI winning the hire in February 2026. Steinberger announced OpenClaw will move to an open-source foundation with continued OpenAI support. - **GitHub Stars**: 369K+ — As of early May 2026; fastest-growing repo in GitHub history - **Setup Time**: 30 seconds — Single npm command to fully operational - **Node Requirement**: 22+ — Modern runtime for optimal performance - **Model Support**: 15+ — Providers including local/self-hosted options ## Local-first architecture The architecture decision that matters most: OpenClaw runs as a Node.js process in your environment. Not a proxy, not an iframe, not a "we promise we do not log your data" SaaS. An actual process you control. When I deploy it for ChatOps, it runs on a dedicated VM with persistent Slack connections and pre-authenticated sessions to our clusters. Data never leaves the perimeter unless I explicitly route model inference to an external API. A common configuration pairs OpenClaw with a current Anthropic flagship like Claude Opus 4.8 for long-context strength and prompt-injection resistance, but OpenClaw is model-agnostic and also supports OpenAI, Google, Azure, AWS Bedrock, and local models via Ollama. This flexibility is the difference between "cool proof of concept" and "actually passes security review." When you wire this up in an air-gapped environment with a self-hosted model, the compliance conversation goes from impossible to straightforward. - **MCP Servers**: 10K+ — Compatible tool ecosystem - **Platforms**: 5 — macOS, Linux, Windows, iOS, Android (per docs.openclaw.ai/platforms) - **Integrations**: Slack/Teams — Native ChatOps support The permission model is what convinced me this was production-ready. OpenClaw operates with exactly the permissions of the service account running it, nothing more. When the agent executes kubectl or terraform, it inherits the RBAC of that account. No magic escalation, no ambient authority. You scope it with the same tooling you already use for service accounts, and your existing audit infrastructure captures everything. For ChatOps deployments, the same agent instance handles interactive queries and scheduled tasks: drift detection, security scans, deployment status reports. A surprising amount of toil lives in "check if X is still true" tasks that you run manually three times a day. An always-on agent that runs those checks on a cron and posts results to a channel is boring infrastructure, but boring infrastructure is the best kind. ## Quick start Installation is genuinely fast: Node.js 22, one npm command, and you are running. I was skeptical of the "30 seconds to operational" claim until I timed it myself. ```bash (installation.sh) # Install OpenClaw globally npm install -g openclaw # Verify installation openclaw --version # First run - will prompt for API key configuration openclaw # Or specify provider explicitly openclaw --provider openai openclaw --provider ollama --model llama3.2 ``` The Slack integration is where it gets interesting. Create a Slack app with socket mode and the required OAuth scopes (chat:write, app_mentions:read, channels:history), then configure OpenClaw with channel-level scoping. The configuration below is close to what I run. Pay attention to the permission boundaries. ```json (openclaw-slack-config.json) { "name": "openclaw-slack", "version": "1.0.0", "platform": "slack", "credentials": { "botToken": "$SLACK_BOT_TOKEN", "appToken": "$SLACK_APP_TOKEN", "signingSecret": "$SLACK_SIGNING_SECRET" }, "permissions": { "allowedChannels": ["#ops", "#incidents", "#deployments"], "allowedUsers": ["@oncall-team"], "commandPrefix": "!claw" }, "sandbox": { "mode": "docker", "networkAccess": false, "maxExecutionTime": 300 } } ``` The channel allowlist means the agent will not respond to casual conversation in #random; it only activates in ops channels. The user allowlist scopes it to your on-call team. The Docker sandbox with network disabled means that even if someone prompt-injects the agent, the blast radius is contained to a stateless container with no egress. When you have been paged at 3am because someone's "helpful automation" did something unexpected, you learn to appreciate defense in depth. ## Security and sandbox modes Deploying an AI agent that executes arbitrary commands against your infrastructure is exactly as dangerous as it sounds. The fact that it is useful does not make it safe by default. OpenClaw provides three sandbox modes (off, non-main, and all), with the choice of execution backend (docker, ssh, or openshell) decided separately. Sandboxing is off by default, so the first thing to do is turn it on: starting from "all" and relaxing toward "non-main" only when you have a specific reason is the right approach. > **WARNING: Security advisory** > Review open security issues on GitHub before production deployment. AI agents executing commands based on natural language is inherently risky. Design defensively: network isolation, minimal filesystem access, explicit permission boundaries, and audit logging shipped to your SIEM. - **Sandbox Modes**: 3 — off, non-main, all - **Default Network**: Deny — Network access disabled by default in sandboxed modes - **Audit Logging**: Full — Every command executed is logged with timestamp What I have seen go wrong in practice: an engineer runs the agent in unrestricted mode "just for testing," forgets to scope it, and the agent interprets an ambiguous Slack message as a kubectl delete command. Nothing catastrophic happened in that case because the service account did not have delete permissions, but it easily could have. The lesson is that your sandbox configuration and your RBAC are two independent layers of defense, and you want both. > **TIP: Production security baseline** > Recommended configuration: Docker sandbox with network disabled, channel-level scoping (agent only responds in designated channels), user allowlists, command prefixes to prevent accidental invocation, and audit logs shipped to your monitoring stack. OpenClaw has no "explain-only" DM default. Its dmPolicy modes are open, allowlist, pairing, and disabled, with pairing as the security-first default — it gates unknown senders behind a one-time pairing code rather than gating each command behind an explain-then-confirm step. A "confirming" owner-approval mode (the closest thing to explain-and-confirm) was proposed in issue #6262 and closed as not planned on Feb 1, 2026, so it was never shipped. This human-in-the-loop pattern is non-negotiable for infrastructure safety. I run it this way even for read-only operations in production, because confirming costs a few seconds and getting it wrong can cost an outage. ## The ChatOps playbook I have been running OpenClaw in our ops channels for several weeks now. These seven patterns are the ones that actually stuck, the workflows where ChatOps consistently beats the alternative of opening a terminal. > **Key Point:** The value is eliminating context-switching, not AI-generated insights. An agent that executes kubectl, parses logs, and posts findings to the incident channel replaces a workflow that previously required SSH, multiple terminals, and manual copy-paste into chat. ### ChatOps incident response This is the killer use case. Running on-call rotations for years, the first 15 minutes of every incident was spent on ceremony: VPN, bastion, cluster auth, figuring out which namespace the service lives in. With OpenClaw in the incident channel, the first responder types one message and gets a structured investigation posted where the whole team can see it. The IC does not ask "what did you find?" because the findings are already in the thread. ```markdown (incident-response-example.md) **Slack prompt**: !claw check why payment-service pods are crashing in prod **OpenClaw execution**: 1. kubectl get pods -n payments -l app=payment-service 2. kubectl describe pod 3. kubectl logs --previous --tail=100 4. kubectl get events -n payments --sort-by='.lastTimestamp' **Response posted to #incidents**: Payment Service Investigation Status: 3 of 5 pods in CrashLoopBackOff Root Cause: OOMKilled - pods exceeding 512Mi memory limit Recent Logs (last crash): - Connection pool exhaustion at 14:32:01 - Memory spike to 498Mi at 14:32:03 - OOMKilled at 14:32:04 Recommendation: Increase memory limit to 768Mi or investigate connection pool leak in payment-processor module Commands executed: 4 | Time: 8.3s ``` The agent does not just dump raw output. It correlates the OOMKilled events to the 512Mi memory limit and the connection pool spike that preceded them, a link a raw log tail would leave you to make yourself. I still verify before acting on the recommendation, but a structured summary instead of four screens of log output saves real time when you are half-asleep. ### On-call triage from your phone This changed how I think about on-call. When an alert fires at 3am, the old workflow was: wake up, find laptop, VPN, authenticate, investigate, post findings, decide whether to escalate. Now it is: pick up phone, type one Slack message, read the response, make a decision. I have resolved (or more accurately, dismissed) false positives from bed without ever opening my laptop. That is the difference between going back to sleep and being wired for two hours. ```markdown (oncall-examples.md) **From phone at 3am**: !claw is the api-gateway alert real or a false positive? **OpenClaw checks**: - Current error rate vs baseline - Recent deployments - Upstream service health - Similar alerts in last 24h **Response**: False positive likely. Error rate is 0.3% (baseline 0.2%). Spike caused by single client retry storm, already recovering. No action needed - will auto-resolve. --- **Escalation decision**: !claw should I wake up the database team for this postgres alert? **Response**: Recommend escalation. Replication lag is 45 minutes and growing. Primary is healthy but replica is stuck on a long-running query. The replica serves read traffic for the mobile app - impact will be noticeable within 2 hours if not addressed. ``` "Should I wake someone up?" is a high-stakes judgment call when your own cognition is degraded. Here the agent pulls the replication lag (45 minutes and growing), confirms the primary is healthy, and traces the cause to a long-running query on the read replica that backs the mobile app. It does not replace your judgment; it hands you the data you would otherwise spend 20 minutes gathering before you could make the call. ### Remote infrastructure queries The mundane version of this: you are in a meeting and someone asks "did the v2.3.1 deployment finish?" Instead of pulling out your laptop and disrupting the meeting, you type a Slack message on your phone under the table. Ten seconds later you have the answer. Individually trivial, but this pattern repeats a dozen times a day. ```markdown (remote-query-examples.md) !claw what's the current replica count for api-gateway? > api-gateway: 8/8 replicas ready (HPA target: 70% CPU, current: 45%) !claw did the v2.3.1 deployment to staging complete? > Deployment complete. 4/4 pods running v2.3.1. Health check passing. Rollout finished 12 minutes ago. !claw quick health check on production > All systems healthy: - api-gateway: 8/8 pods, p99 latency 45ms - payment-service: 5/5 pods, 0 errors last 15min - user-service: 3/3 pods, p99 latency 28ms - postgres: primary healthy, replica lag 2s ``` ### Runbook execution Every team has runbooks that are just a sequence of commands with human checkpoints between them. The problem is not knowing what to do. It is that at 3am, tired engineers skip steps, mistype commands, or forget the order. OpenClaw can execute runbooks step-by-step, pausing for confirmation before anything destructive. The entire interaction is in a Slack thread, which means you get an audit trail for free and other team members can watch and intervene. ```markdown (runbook-example.md) !claw run database failover runbook for postgres-primary **Step 1/5**: Verify replica health > postgres-replica-1: healthy, lag 3s > postgres-replica-2: healthy, lag 2s Proceed to step 2? (yes/no) **yes** **Step 2/5**: Enable read-only mode on primary > SET default_transaction_read_only = on; > Active transactions: 3 (waiting for completion) Proceed to step 3? (yes/no) **yes** **Step 3/5**: Wait for replica sync > Monitoring replication lag... > postgres-replica-1: lag 0s (synced) > postgres-replica-2: lag 0s (synced) Proceed to step 4 (promote replica)? (yes/no) **yes** **Step 4/5**: Promote postgres-replica-1 to primary > pg_ctl promote executed > New primary accepting writes Proceed to step 5 (update DNS)? (yes/no) ``` The agent does not replace the runbook or the human. It replaces the copy-paste-verify loop that makes runbook execution tedious and error-prone. The human still makes the go/no-go decisions. The agent handles the mechanical execution. When you have watched someone paste a production command into a staging terminal (or vice versa), you appreciate having the execution context locked to the right environment. ### Log analysis and summarization What actually happens during incidents: someone says "check the auth-service logs" and three people independently open Datadog, write slightly different queries, and spend ten minutes scrolling through raw output. With an MCP connector to your observability stack, OpenClaw queries the logs, identifies patterns, and posts a summary. One query, one summary, everyone sees the same picture. ```markdown (log-analysis-example.md) !claw summarize errors from the last hour in auth-service **Querying Datadog logs**: service:auth-service status:error last 1h **Summary** (847 error logs analyzed): Top error patterns: 1. "Redis connection timeout" - 412 occurrences First seen: 14:23, ongoing Affected pods: auth-service-7b9f8c-{2k4jf, 8h3kd} 2. "Token validation failed: signature mismatch" - 389 occurrences First seen: 14:25, ongoing Pattern: All from IP range 203.0.113.0/24 3. "Database query timeout" - 46 occurrences First seen: 14:28, resolved at 14:35 Affected query: getUserPermissions **Assessment**: Redis connectivity issue is primary concern. Token validation failures may be attack traffic (single IP range). Database timeouts were transient, likely cascading from Redis. **Recommendation**: Check Redis cluster health, consider blocking 203.0.113.0/24 pending investigation. ``` The assessment quality is what makes this more than a fancy grep. The agent correctly identifies that the database timeouts are likely cascading from Redis, not an independent issue. That correlation, which an experienced engineer would make but a junior might miss, saves you from chasing the wrong root cause at 3am. ### Deployment triggering from chat Triggering deployments from chat sounds risky until you realize it is strictly better than the alternative. When I deploy from my terminal, only I see what happened. When the agent deploys from Slack, the entire team sees the trigger, the progress, and the result. If something goes wrong, the rollback command is one message away, and the person running on-call from their phone can do it without VPN access. ```markdown (deployment-example.md) !claw deploy api-service v2.3.1 to staging **Deployment initiated** Pipeline: api-service-staging-deploy Triggered by: @oncall-engineer via OpenClaw Version: v2.3.1 **Progress**: > Building container image... done (2m 14s) > Running integration tests... done (4m 32s) > Deploying to staging cluster... done (1m 08s) > Running smoke tests... done (45s) **Deployment complete** 4/4 pods running v2.3.1 Health endpoint: healthy Staging URL: https://staging.api.example.com --- !claw rollback api-service to previous version in staging **Rollback initiated** Rolling back to v2.3.0 (previous stable version) > Scaling down v2.3.1 pods... > Scaling up v2.3.0 pods... **Rollback complete** - 4/4 pods running v2.3.0 ``` ### Documentation and runbook lookup The most underrated use case. Every team has institutional knowledge scattered across Confluence, Notion, Google Docs, and that one Slack thread from 2024 where someone explained how to rotate the database credentials. During an incident, nobody has time to search three wikis. OpenClaw with MCP connectors to your doc systems can answer "how do we do X?" by searching and synthesizing across all of them, with citations so you can verify. ```markdown (documentation-example.md) !claw how do we rotate the database credentials? **Searching internal documentation**... **Database Credential Rotation Procedure** 1. Generate new credentials in Vault: vault write database/rotate-role/postgres-app 2. Update Kubernetes secret: kubectl create secret generic postgres-creds --from-literal=... 3. Trigger rolling restart of dependent services: kubectl rollout restart deployment/api-service kubectl rollout restart deployment/payment-service 4. Verify connectivity and revoke old credentials after 1h **Important**: Rotation requires approval from database-admins team. Current on-call: @sarah-dba Sources: - Runbook: Database Credential Rotation (link) - Vault Configuration Guide (link) - Service Dependencies Matrix (link) ``` ## MCP integration for DevOps MCP is what makes OpenClaw more than a kubectl wrapper. Without MCP, you have an agent that can run shell commands. With MCP, you have an agent that can natively query Datadog, create PagerDuty incidents, check Terraform state, and search your internal docs through a standard protocol. The practical difference is massive: I went from "the agent can run whatever bash commands I allow" to "the agent understands my Datadog dashboards" in the time it took to add a config block. > **INFO: DevOps-relevant MCP servers** > Key MCP servers for infrastructure operations: Datadog (metrics and logs), PagerDuty (incident management), AWS (resource management), Kubernetes (cluster operations), GitHub/GitLab (repository and PR operations), Linear (issue tracking), Prometheus/Grafana (observability), and Terraform (infrastructure state). Each extends OpenClaw without custom code. When you wire up the Datadog MCP server, the agent stops running raw API queries and starts using structured tool calls that return typed data. The difference matters during incidents. Instead of parsing JSON output from a curl command, the agent gets structured metrics it can reason about: "error rate increased 3x in the last 15 minutes" instead of "here is a wall of JSON, good luck." The 10K+ MCP server ecosystem means someone has probably already built the integration you need. The play for platform teams is building internal MCP servers for your own tooling. We all have that custom deployment system or internal metrics dashboard that no public MCP server covers. Writing an MCP server is straightforward enough that it is worth doing for any tool your team queries more than once a day during incidents. The protocol handles the plumbing; you just define what tools the agent can call and what data they return. ## Production deployment patterns Moving OpenClaw from "I ran it on my laptop and it was cool" to "the on-call team depends on this" involves real decisions about availability, authentication, and blast radius. A dedicated VM with persistent Slack connections is the right default; the Tailscale mesh and bare systemd setups below are variations for environments that need them. ### Dedicated server with Slack This is what I run and what I recommend for most teams. OpenClaw runs on a dedicated VM or container with persistent Slack connections, pre-authenticated to your clusters, with MCP servers connected to your observability stack. Centralized audit logging captures every command the agent executes. The server has its own service account with scoped RBAC: exactly the permissions the on-call team needs for investigation, nothing more. When the agent is down, your team notices, so treat it like any other production service with health checks and alerting. ### Tailscale mesh deployment For organizations using Tailscale or similar mesh VPNs, OpenClaw runs on a server within the mesh while connecting to infrastructure over encrypted tunnels. The mesh handles authentication and encryption, which simplifies your network security story. This pattern works well when your infrastructure spans multiple environments. The mesh gives the agent secure access to resources across clouds without managing individual VPN tunnels. ### Systemd service Running OpenClaw as a systemd service gives you automatic restarts, health checks, and integration with standard Linux monitoring: everything you already know how to operate. Combined with the Slack integration, this provides high availability for a ChatOps bot that your on-call team depends on. Set RestartSec high enough that transient failures do not cause a restart loop that burns through your API rate limits. > **Key Point:** Self-hosted deployment is increasingly a compliance requirement, not a nice-to-have. Organizations in regulated industries (finance, healthcare, government) often cannot use cloud-based AI assistants for infrastructure operations due to data residency and audit requirements. OpenClaw's local-first architecture addresses this directly, and it is the primary reason to recommend it over SaaS alternatives. ## ChatOps as agent interface After running OpenClaw in production channels, ChatOps is the right interface for AI agents in infrastructure operations, over IDE integrations, web dashboards, or CLI tools. That is where your team already coordinates during incidents, and the value of having investigation results posted where everyone can see them is worth more than any amount of terminal magic that only one person witnesses. The audit trail is free: every interaction is in Slack history. The whole channel sees the investigation, so team visibility is free too. On-call engineers can triage from their phones without VPN or SSH. These are not features you have to build or configure. They are inherent properties of running AI agents where your team already works. > **Key Point:** The value of ChatOps AI agents is not replacing engineers. It is compressing the time between "I got paged" and "I understand the problem." When that goes from twenty minutes to thirty seconds, on-call stops being a dreaded chore and starts being manageable. For anyone evaluating OpenClaw, start with the security model. Get the sandbox configuration right, scope the service account permissions tightly, and ship audit logs to your SIEM before you let anyone depend on it. The adoption numbers (369K GitHub stars by early May 2026, fastest-growing repo in history) tell you the tool works. The question that matters is whether you can deploy it in a way that satisfies your security team, and the answer is yes if you are deliberate about it. > **AI agents belong in infrastructure operations. The hard part is deploying them without creating a new attack surface that keeps your security team up at night. Start with network isolation, minimal permissions, and full audit logging, and work backward from there.** ## When self-hosting is wrong Everything above assumes a team that can absorb the cost of running an autonomous agent that executes infrastructure commands from natural language. Plenty of teams cannot, and for them the honest recommendation is to wait or to skip it. The strongest argument against deploying OpenClaw is that you are adding a new attack surface whose only justification is the toil it removes, and the toil has to be large enough to be worth the surface. ### The project still churns OpenClaw was renamed twice in two months: Clawdbot to Moltbot on Jan 27, 2026, then to OpenClaw on Jan 30, 2026. The npm package, the GitHub org, and the docs domain all moved with it. A pinned config, a Slack app manifest, and your internal runbooks reference names and APIs that have not held still for a quarter. Pinning a tool this young to a production on-call path means you own the breakage when the next rename or API change lands, and the closed-as-not-planned owner-approval mode (issue #6262) shows the maintainer roadmap will not always go where your security review wants it to. ### No one to contain it The agent runs with the RBAC of its service account, which means scoping that account correctly is the whole security story. That work does not happen by itself. It needs someone who can write a least-privilege policy, reason about what an ambiguous Slack message could trigger inside it, and own the sandbox configuration over time. A team without anyone in that role will deploy the default unsandboxed mode and discover the gap the way the engineer earlier in this post nearly did, except their service account might have the delete permission. If there is no one to scope and contain it, OpenClaw is a liability waiting for the wrong prompt. ### Nothing is watching it The audit logging only protects you if something reads the logs. The production baseline above ships every executed command to a SIEM precisely so that an out-of-policy action triggers an alert rather than sitting in a file nobody opens until after an incident. Without a SIEM or equivalent, the agent runs commands against your infrastructure with no detection layer, and the audit trail becomes forensics you read after the damage instead of a control that catches it. That is a worse posture than a human typing the same commands, because the human is the detection layer. ### Too few incidents to justify it The value proposition is compressing the time between getting paged and understanding the problem. If you get paged twice a quarter, the ceremony you would eliminate is rare, and the new attack surface sits exposed every day regardless of how often it pays off. Below some incident volume, a documented runbook and a kubectl alias clear the same path without an always-on process holding cluster credentials. Reach for OpenClaw when the 3am ceremony is a recurring cost, not when it is an occasional annoyance. ## Resources & Further Reading - OpenClaw GitHub Repository: https://github.com/openclaw/openclaw - Official source code and documentation - OpenClaw Official Site: https://openclaw.ai - Downloads, guides, and community resources - npm Package: https://www.npmjs.com/package/openclaw - Installation via npm install -g openclaw - ZeroLeaks Security Analysis: https://zeroleaks.ai/reports/openclaw-analysis.pdf - Independent security audit and analysis report - The Agentic DevOps Loop: https://www.stxkxs.io/blog/agentic-devops-loop - How agents fit into the operations workflow this post deploys into - MCP Is Now a Linux Foundation Standard: https://www.stxkxs.io/blog/mcp-linux-foundation - The protocol behind OpenClaw's tool integrations - MCP (Model Context Protocol): https://modelcontextprotocol.io/ - Protocol specification and server registry - MCP Server Directory: https://github.com/modelcontextprotocol/servers - reference MCP servers maintained by the MCP steering group; points to the MCP Registry (registry.modelcontextprotocol.io) for the broader ecosystem of 10,000+ community servers - Datadog MCP Server: https://github.com/modelcontextprotocol/servers/tree/main/src/datadog - Query metrics and logs from OpenClaw - Kubernetes MCP Server: https://github.com/modelcontextprotocol/servers/tree/main/src/kubernetes - Cluster operations via MCP - Slack App Configuration: https://api.slack.com/apps - Creating Slack apps for ChatOps integration - Anthropic Claude Documentation: https://docs.anthropic.com/ - API reference for Claude model integration - OpenClaw creator Peter Steinberger joins OpenAI: https://techcrunch.com/2026/02/15/openclaw-creator-peter-steinberger-joins-openai/ - Coverage of the OpenAI hire - OpenClaw creator joining OpenAI, Altman says: https://www.cnbc.com/2026/02/15/openclaw-creator-peter-steinberger-joining-openai-altman-says.html - Sam Altman confirms the hire - Steinberger has offers from Meta and OpenAI: https://www.trendingtopics.eu/openclaw-peter-steinberger-already-has-offers-from-meta-and-openai-on-the-table/ - The bidding war between Meta and OpenAI - Meta Lost the OpenClaw Bidding War: https://fourweekmba.com/meta-lost-the-openclaw-bidding-war-and-it-could-turn-whatsapp-into-a-pipe/ - Analysis of Meta losing the competition for Steinberger - OpenClaw, OpenAI and the future: https://steipete.me/posts/2026/openclaw - Steinberger's own post on joining OpenAI --- # AI Creates Software Faster Than Ops Can Handle - **URL**: https://www.stxkxs.io/blog/second-order-explosion - **Published**: 2026-01-24 - **Author**: Brandon Stokes - **Category**: platform-engineering - **Tags**: ai-development, platform-engineering, devops, technical-leadership, infrastructure, systems-thinking, operations - **Reading time**: 14 min AI makes building software trivially cheap. Addy Osmani and Aaron Levie have articulated why this creates more software, not less. This piece focuses on what happens next: the specific operational failures that platform engineers, DevOps teams, and technical leadership will face—and the concrete systems to prevent them. ## Jevons applies, integration is worse Addy Osmani's "The Efficiency Paradox" and Aaron Levie's work on Jevons Paradox for knowledge work have established the foundational insight: when you make software dramatically cheaper to build, organizations build dramatically more software. Strictly speaking, Jevons Paradox describes resource consumption (coal efficiency led to more coal usage because demand was elastic). The analogy to software is imperfect. The dangerous part is not just that more code gets written. The integration complexity between all that new code grows quadratically. That is a different mechanism than Jevons described, and it is worse. The historical pattern is unambiguous across every major technological transition: from assembly to high-level languages, from bare metal to cloud, from manual deployment to CI/CD. Efficiency gains expand output rather than reducing effort. This piece takes that insight as given and asks the operational follow-up: what specific systems need to exist to prevent the predictable failures? This is not another essay about why Jevons Paradox applies to AI-assisted development. That case has been made. This is a playbook for platform engineers, DevOps teams, and technical leadership who accept the premise and need to prepare their organizations for its consequences. The focus is deliberately narrow: concrete operational concerns, specific failure modes, and actionable countermeasures. ## Why linear scaling fails The mathematics make this transition different from previous ones. Platform teams have always dealt with growing portfolios of services to support. What changes when AI collapses implementation costs is the rate of growth and the nature of what gets built. Consider a platform team supporting N services today. Each service requires some baseline operational attention: monitoring configuration, runbook maintenance, dependency updates, incident response capacity. Call that baseline cost C per service. The current total operational load is approximately N × C, which teams scale by hiring proportionally as the portfolio grows. When implementation becomes 10x cheaper, organizations do not build 10x the same services. They build qualitatively different things. The backlog of "not quite worth building" internal tools suddenly becomes viable. Bespoke integrations replace manual workflows. Custom dashboards proliferate. One-off automation scripts graduate to production services. The service count increases and the composition shifts toward smaller, more numerous, less-documented, often less-standardized systems. **Linear growth vs quadratic complexity** - 25: 25 - 50: 50 - 75: 75 - 100: 100 - 150: 150 - 200: 200 - 250: 250 The integration layer is where quadratic scaling becomes dangerous. N services do not exist in isolation. They connect, depend on each other, share data, trigger workflows. The number of potential integration points approaches N², and even if only a fraction are active, the operational surface area grows faster than linearly. A platform team that could handle 50 services struggles with 150, even if headcount tripled, because the integration complexity has grown 9x while capacity grew 3x. - **Service Count Growth**: 3-5x — Projected increase based on patterns at early AI-adoption organizations (author estimate, directional not precise) - **Integration Surface**: N² — Potential connection points grow quadratically with service count - **Documentation Coverage**: Declining — Faster creation outpaces documentation velocity - **Ownership Clarity**: Degrading — More orphaned systems as creators rotate out ## Five failure modes In organizations with aggressive AI adoption, these failure modes manifest in roughly this sequence, and order matters: interventions at earlier stages prevent cascading failures downstream. ### The discovery collapse The first system to fail is discovery: the ability to find what already exists. This failure precedes all others because it directly causes duplication, which accelerates every subsequent problem. When finding an existing solution takes longer than building a new one, rational actors build new ones. I have watched three teams solve the same problem three different ways in the same quarter, each unaware of the others' work. Nobody was being careless. Finding the existing solution genuinely took longer than building a new one. Discovery collapse manifests in specific symptoms: engineers asking in Slack whether something exists and getting no response or conflicting answers; post-mortems revealing that an incident affected an unknown system; new services being built that duplicate functionality already available; architecture reviews uncovering integrations nobody knew about. The root cause is that organizational knowledge systems designed for 50 services do not scale to 200. > **TIP: Platform team action: service catalog infrastructure** > Implement automated service discovery that populates a catalog without requiring manual registration. Integrate with CI/CD to capture new deployments, with runtime to capture active services, with monitoring to capture dependencies. Make the catalog the authoritative source and invest in search quality. The catalog should answer "does something like this exist?" within 30 seconds. If it cannot, engineers will bypass it. ### The ownership vacuum Software created quickly often lacks clear ownership assignment. The engineer who built it owns it implicitly, until they transfer teams, get promoted, or leave the company. The service continues running. Dependencies continue depending on it. When it breaks, when it needs updating, when a security vulnerability requires patching, no one steps forward. The service has become an orphan. Orphaned services compound over time. Each one represents a maintenance liability that falls to whoever draws the short straw during incident response. Platform teams often absorb these by default, gradually accumulating operational responsibility for systems they did not build and do not understand. This creates a toxic dynamic where the platform team's capacity is consumed by legacy maintenance rather than platform improvement. > **WARNING: The orphan audit** > Run a quarterly audit: for every service in production, can you identify a specific person who will answer pages for it at 3 AM? Not a team, not a rotation, a person. Services without clear pager assignment are orphans regardless of what the documentation claims. Track orphan count as a key platform health metric. If it trends upward, ownership systems are failing. > **Key Point:** Ownership is a pager assignment, not a documentation exercise. If no individual is willing to be woken at 3 AM for a service, that service should not reach production. ### The security bypass Security review processes were designed for a world where building software was slow. The review queue could accommodate the arrival rate because implementation time naturally rate-limited submissions. When implementation accelerates 10x but security team capacity remains constant, the queue backs up. Engineers waiting weeks for security review start finding workarounds. "Shadow development" emerges: production systems that never went through established security gates. The consequences appear in breach post-mortems: an internal tool with production database access that was never reviewed; an integration that exposed customer data through an unsecured endpoint; a hastily-built admin interface with default credentials. The security team is not failing. They are overwhelmed by a volume they were never resourced to handle. > **TIP: Platform team action: security as code in the platform** > Shift security from review to guardrails. Encode security requirements in the platform: default deny network policies, mandatory secrets management, automated dependency scanning, required authentication for new services. Make the secure path the easy path. Reserve human security review for genuinely novel risk categories rather than routine deployments. The goal is reducing the attack surface area of new services by default rather than reviewing each one manually. ### The dependency cascade More services with more integrations create deeper dependency graphs. A change to a foundational service (updating an API contract, deprecating a feature, modifying behavior) propagates through unknown consumers. What was intended as a minor update becomes a cascading incident affecting systems nobody knew depended on the changed service. Dependency cascades are particularly dangerous because they violate the mental model of isolated failure. Engineers making changes believe they understand the blast radius. The actual blast radius includes services they have never heard of, built by teams they do not know, running workloads they cannot predict. I have sat in post-mortems where the dependency graph on the whiteboard looked nothing like what anyone expected. Half the connections were undocumented integrations that someone spun up in an afternoon and forgot about. > **TIP: Platform team action: dependency mapping and contract testing** > Implement runtime dependency tracking that captures actual service-to-service communication, not just declared dependencies. Build tooling that shows the real dependency graph, updated continuously. For critical services, require contract tests that fail upstream changes when downstream consumers would break. The platform should answer "what breaks if I change this?" before the change is made. ### The maintenance cliff Every deployed service incurs ongoing maintenance costs: dependency updates, security patches, compatibility fixes as the environment evolves. These costs are relatively fixed per service and do not decrease when the service was built quickly. An organization that increases its service count 5x has increased its maintenance burden 5x without increasing the engineering capacity to handle it. The maintenance cliff manifests gradually, then suddenly. Deferred updates accumulate. Security patches wait in backlog. Deprecated dependencies continue running. Then an external forcing function arrives: a critical CVE, a cloud provider deadline, a compliance requirement, and the accumulated debt comes due simultaneously. Once an organization reaches this point, keeping existing systems alive consumes the capacity it needs for new work. > **WARNING: The maintenance budget** > Before approving new services, calculate the maintenance budget. Every service should have an explicit allocation of ongoing engineering time: not aspirational, scheduled. If the maintenance budget is fully allocated, new services require either additional headcount or deprecation of existing services. Treat maintenance capacity as a finite resource that can be exhausted, because it can be. ## Platform systems needed The failure modes above share a common characteristic: they emerge from systems that were implicitly rate-limited by implementation difficulty. When that rate limit disappears, explicit systems must replace the implicit constraints. The minimum viable operational infrastructure for an expanded service portfolio is four systems: a service registry, guardrails, deprecation tooling, and maintenance capacity planning. ### Service registry with teeth A service registry that no one uses is worse than no registry because it creates false confidence. Effective registries share specific characteristics: they are automatically populated from deployment systems rather than manually maintained; they enforce data quality through required fields that block deployment if missing; they integrate with the tools engineers already use so that registry information appears in context during development and debugging; they provide value to service owners (operational dashboards, dependency visualization) that makes registration worthwhile rather than purely bureaucratic. - Automatic registration: Services appear in the registry when deployed, not when someone remembers to add them - Required ownership: Deployment fails without a valid owner assignment that passes validation - Lifecycle tracking: Registry captures creation date, last deployment, activity metrics, maintenance status - Dependency mapping: Runtime traffic analysis populates actual dependencies, not just declared ones - Search that works: Finding services by function, by data touched, by team, by technology takes seconds ### Guardrails over gates The traditional model of security and architecture review as gates (approval required before proceeding) does not scale when implementation velocity increases 10x. The alternative is guardrails: constraints built into the platform that make compliant behavior the default and non-compliant behavior difficult or impossible. Effective guardrails are invisible when followed and obvious when violated. Network policies default to deny, requiring explicit justification for cross-service communication. Secrets management is the only way to access credentials; environment variables in code fail deployment checks. Container images must pass vulnerability scanning before promotion to production. Authentication is required for all new services by platform default. These constraints eliminate entire categories of security review because the platform prevents the problems before they are introduced. > **INFO: The guardrail principle** > For every category of issue that security or architecture review currently catches, ask: can this be prevented by platform design rather than detected by human review? Shift investment from review capacity to guardrail implementation. The goal is fewer things requiring review, not faster review. ### Automated deprecation Software organizations excel at creation and struggle with removal. This asymmetry becomes critical when creation velocity increases. Without active deprecation, the service portfolio only grows, maintenance burden only increases, and operational capacity is eventually consumed entirely by keeping legacy systems alive. Effective deprecation requires specific infrastructure: tools that identify candidate services based on usage metrics (nothing has called this in 90 days), dependency analysis that shows what would break if a service were removed, automated notification to consumers of deprecated services, and eventually, automated decommissioning of services that complete the sunset process. Make deprecation as procedurally clear as deployment, because it needs to happen just as often. ### Maintenance capacity planning Most capacity planning focuses on compute, memory, and storage. The scarcer resource in an expanded service portfolio is engineering attention. Every service requires maintenance capacity: security patches, dependency updates, compatibility fixes, incident response. This capacity is finite and can be exhausted. Implementing maintenance capacity planning means tracking maintenance hours per service (estimated at creation, refined with actuals), total portfolio maintenance load, available engineering capacity dedicated to maintenance (not borrowed from feature work), and the delta between required and available capacity. When maintenance capacity is exhausted, new services require either additional headcount or service deprecation. This makes the tradeoffs explicit rather than hidden. ## Organizational changes required Technical systems alone are insufficient. The failure modes described above emerge partly from organizational structures and incentive systems designed for a world of constrained implementation capacity. Those structures must evolve alongside the technical infrastructure. ### Ownership as prerequisite Every service must have an owner before it reaches production. Not a team, a person who accepts responsibility for availability, maintenance, and incident response. "Owner" cannot just be a name in a field. It needs teeth. Spotify validates ownership through Backstage by checking that the listed owner is on an active on-call rotation. Netflix requires service owners to have merged a commit to the service repo within the last 90 days. Without validation like this, ownership assignment becomes a checkbox exercise where the most available person gets named, not the right person. If no individual is willing to accept that responsibility, the service should not be built. This is the single highest-leverage intervention available. Implementing ownership as prerequisite means modifying deployment pipelines to require valid owner assignment, establishing clear ownership transfer processes when individuals change roles, defining what ownership means operationally (pager responsibility, maintenance commitment, deprecation authority), and tracking ownership health as an organizational metric. ### Discovery before creation Before any new service is approved, require documented evidence that existing solutions were evaluated. What already exists in this space? Why is it insufficient? What would extending an existing service cost versus building new? This sounds bureaucratic but serves a critical function: it forces the discovery step that building economics now make easy to skip. The implementation can be lightweight: a required field in service proposals that links to registry searches performed, with results summarized. The goal is not to prevent new services. It is to ensure the decision to build new is made with awareness of what already exists. ### Deprecation velocity Track and report service deprecation alongside service creation. A healthy organization should deprecate services at some meaningful fraction of the rate it creates them. If creation outpaces deprecation indefinitely, the portfolio grows without bound and maintenance capacity eventually exhausts. Make deprecation visible and celebrated. Engineers who successfully sunset services are reducing organizational burden and freeing capacity for more valuable work. This should be recognized equivalently to building new capabilities. > **TIP: The portfolio health dashboard** > Build visibility into: total services in production, services created/deprecated this quarter, orphaned services (no clear owner), services with overdue maintenance, maintenance capacity utilization. Review this dashboard at the same cadence as feature delivery metrics. Portfolio health is as important as feature velocity. ## Where to start The systems and changes described above represent substantial investment. For teams that cannot tackle everything simultaneously, sequence interventions by impact and dependency: ownership first, because it gates everything else. - Ownership enforcement: modify deployment pipelines to require owner assignment. Highest-leverage single change; enables everything else. - Service registry automation: implement automatic registration from deployment systems. Manual registries fail at scale; automated ones provide the foundation for discovery. - Dependency mapping: add runtime dependency tracking. Understanding the actual dependency graph is prerequisite to managing cascade risks. - Security guardrails: shift security controls from review gates to platform defaults. Start with the highest-risk categories: secrets management, network policies, authentication. - Deprecation tooling: build infrastructure for identifying deprecation candidates and managing sunset processes. This becomes critical as portfolio size increases. - Maintenance capacity tracking: implement explicit tracking of maintenance load versus available capacity. This makes the tradeoffs visible before they become crises. Each stage builds on the previous. Ownership enforcement without a registry means orphaned services are invisible. Dependency mapping without ownership means you know what depends on what but not who can fix problems. The sequence matters. ## Most teams should wait The strongest objection to this whole argument is that building governance infrastructure before the explosion arrives is itself a way to waste engineering capacity. Every system described here costs real time to build and maintain. A team of six engineers running a dozen services does not have an N² problem worth solving. The integration surface is small enough to hold in one person's head, discovery happens in a single Slack channel, and ownership is obvious because everyone knows who wrote what. Building a service registry with teeth for that team is solving a problem they do not have, at the cost of the work they were hired to do. The quadratic math only bites once N is large enough that a fraction of N² exceeds what a team can track informally. Below that threshold the curve in Figure 1 is indistinguishable from linear, and the informal systems that scale poorly still work fine. Premature governance also has a failure mode of its own: a registry nobody needs becomes a registry nobody updates, and a stale registry is worse than none because it creates false confidence. The same applies to ownership enforcement that blocks deployments before there are enough orphans to justify the friction. This argument assumes the 3-5x service growth actually materializes for a given organization. That figure is an author estimate, directional not precise, and it will not hold everywhere. An organization where AI assistance mostly speeds up changes to existing services, rather than spawning new ones, does not face the portfolio explosion at all. Its maintenance burden grows with code volume, not service count, and the systems that matter are code review and test coverage, not service registries. The case for this infrastructure rests on the service count climbing fast. Where it is flat, the case does not apply. > **Key Point:** Build the governance when the service count is climbing fast and ownership has started to blur, not before. The trigger is observed growth in the portfolio, not the existence of an AI coding tool. The argument for building early is narrower than 'every team should do this now.' It is that the cost of retrofitting governance onto an already-exploded portfolio is far higher than building it while the portfolio is still small enough to instrument cleanly. That tradeoff only favors early investment for organizations that can see the growth coming. The ownership enforcement step is the cheapest and earns its place first regardless, because a pipeline check that requires an owner costs little even at a dozen services and prevents the orphan accumulation that everything downstream depends on. ## The operational imperative The efficiency paradox Osmani and Levie describe is already in motion. AI-assisted development is making software dramatically cheaper to build, and organizations are responding by building dramatically more software. That much is now predictable. The question for platform engineers, DevOps teams, and technical leadership is whether the operational infrastructure will be ready when the flood arrives. The failure modes described here are already emerging in organizations with aggressive AI adoption. They are not new. The microservices wave of 2015-2018 created the same failure cascade at smaller scale: discovery collapse, ownership vacuum, dependency nightmares. What AI changes is the rate. Microservices proliferation took years to create operational debt. AI-accelerated development can create the same debt in months. The governance patterns that eventually emerged for microservices are the same patterns needed now, just deployed earlier and more aggressively: service catalogs, ownership registries, contract testing. Spotify built Backstage for exactly this reason. Each failure mode represents a system designed for constrained implementation capacity encountering unconstrained output. The organizations that invest in operational infrastructure before the portfolio explosion will navigate the transition. Those that do not will spend years recovering from the operational debt they accumulate. > **Platform engineering's mandate is to ensure that what gets built remains comprehensible, maintainable, and valuable over time. That is harder than making building easy. It is also more important.** The work is unglamorous. Service registries, ownership systems, deprecation tooling, maintenance capacity planning: none of this is as exciting as building new features. The teams that ship it before the portfolio explodes will treat governance as infrastructure they own, not a tax they pay, and they will out-build the teams still drowning in the debt they let accumulate. ## Resources & Further Reading - The Efficiency Paradox (Addy Osmani): https://addyosmani.com/blog/the-efficiency-paradox/ - The foundational essay on Jevons Paradox and AI-assisted development - Aaron Levie on Jevons Paradox: https://x.com/levie/status/2004654686629163154 - Original observations on efficiency paradoxes in knowledge work - Team Topologies: https://teamtopologies.com/ - Frameworks for managing cognitive load in technology organizations - The Staff Engineer's Path (Tanya Reilly): https://www.oreilly.com/library/view/the-staff-engineers/9781098118723/ - On scope, judgment, and organizational impact - Backstage by Spotify: https://backstage.io/ - Open-source platform for building developer portals and service catalogs - Platform Engineering Maturity Model: https://tag-app-delivery.cncf.io/whitepapers/platform-eng-maturity-model/ - CNCF framework for assessing platform capabilities --- # Why AI Coding Tools Favor Typed Languages - **URL**: https://www.stxkxs.io/blog/programming-languages-ai-era - **Published**: 2026-01-15 - **Author**: Brandon Stokes - **Category**: ai - **Tags**: programming-languages, rust, typescript, ai-coding, claude-code, copilot, cursor, developer-productivity, type-safety - **Reading time**: 12 min With 41% of code now AI-generated and tools like Claude Code flattening learning curves, the calculus of programming language choice is shifting. TypeScript just overtook Python on GitHub. Rust remains the most admired language for nine years running. Here's what the data says about which languages are winning—and why. ## AI changed what best means Forty-one percent of production code is now AI-generated, and 85% of developers use AI tools regularly. Those numbers match what I see in my own workflow, and they changed the ratio of time I spend thinking versus translating thought into code. I was always persistent at building things; it just took a lot of time. Now I can get ideas out faster, ship what I am good at, and offload the work I used to ignore entirely: sales copy, marketing pages, product thinking. That shift forced me to look at the languages I use differently. I run TypeScript for my web app, CDK infrastructure, and API handlers. I run Rust for Lambda functions that handle email forwarding, SES events, and Cognito auth flows. These were not academic choices. They were bets on how I wanted AI to help me build, and the bets paid off in ways I did not fully expect when I made them. The compiler became a second pair of eyes on every line an AI agent wrote for me. The stat that matters is not the volume of generated code. It is what happens after the code is generated: which language catches AI mistakes before they reach production. That is the question my whole stack now turns on. > **Key Point:** AI did not change which languages are best. It changed what "best" means. The optimization function shifted from "easy to write" to "produces correct code at scale," and that favors type systems. ## Why types beat tests When I ask Claude Code to add a feature to my CDK stack, the first thing that happens after generation is the TypeScript compiler runs. Most of the time it passes. When it does not, the error is specific: wrong type on a construct prop, a missing required field, a return type mismatch. The type checker catches exactly the category of error that AI produces most often: structural mistakes where the logic is correct but the data shape is wrong. I have watched this happen hundreds of times. AI generates a Lambda function construct with the wrong runtime enum. TypeScript catches it. AI wires an API Gateway integration with a missing authorization type. TypeScript catches it. AI passes a Duration where a number is expected. TypeScript catches it. Each of these would be a runtime error in JavaScript, the kind that surfaces at 2am when a deploy hits production and the handler throws because `undefined` is not a valid ARN. ```typescript (apps/infra/lib/stacks/compute-stack.ts) // AI generates this — looks correct, compiles fine const handler = new NodeLambda(this, 'ApiHandler', { entry: resolve(__dirname, '../../api/dist/analytics/index.mjs'), environment: { ANALYTICS_TABLE: table.tableName, ALLOWED_ORIGINS: config.web.domain, }, timeout: Duration.seconds(30), memorySize: 256, }) // AI generates this — wrong type on authorizationType // TypeScript catches it BEFORE deploy const integration = new HttpLambdaIntegration('handler', handler.fn) api.addRoutes({ path: '/api/analytics', methods: [HttpMethod.POST], integration, authorizationType: 'JWT', // TS Error: not assignable to HttpRouteAuthorizationType }) ``` The same dynamic plays out in the other direction. When I write a handler in `apps/api` and the AI generates the response, TypeScript constrains the shape. The `json()` helper returns a structured API Gateway response. The DynamoDB client expects typed inputs. Every seam between components is a type boundary, and every type boundary is a place where AI-generated code gets checked automatically. Anders Hejlsberg described this as a virtuous cycle: AI generates code, the compiler catches type errors, the developer fixes them, corrections feed back into model training, and subsequent generations improve. I have seen this happen in real time over the past year. Claude Code is measurably better at generating correct CDK constructs now than it was six months ago, and the tight TypeScript feedback loop is part of why. > **INFO: The real AI error pattern** > AI coding assistants are remarkably good at syntax and idiom. Their consistent weakness is type correctness: ensuring data flows through a program with the right shape at every boundary. This is precisely what static type systems verify. The compiler is the quality assurance layer for AI output. ## TypeScript won, mechanically In mid-2025, TypeScript overtook both Python and JavaScript to become the most-used language on GitHub: 2.6 million monthly contributors, 66% year-over-year growth. TypeScript was already growing fast before AI coding tools. The React/Next.js ecosystem, Deno, Bun, and Node.js modernization were all driving adoption. The rate of acceleration is new. AI did not cause TypeScript's rise. It poured fuel on a fire that was already burning, and the reason is the specific failure mode AI exposes in dynamic languages. The explanation is mechanical. JavaScript and TypeScript are the same language at the syntax level. TypeScript adds type annotations that developers historically considered overhead. When AI generates substantial portions of a codebase, types transform from a burden into infrastructure. The developer does not bear the cognitive load of writing type annotations; the AI handles that. The type checker verifies what the AI wrote. The overhead disappeared and the safety remained. **Programming language growth on GitHub** - TypeScript: 66% - Python: 49% - JavaScript: 25% TypeScript's 66% growth outpacing both Python at 49% and JavaScript at 25% tells the story. Developers are not leaving the JavaScript ecosystem; they are adding types to it. New projects start in TypeScript. Existing projects migrate. The pattern accelerated because AI assistants generate TypeScript with type annotations effortlessly, eliminating the historical argument that types slow you down. Python remains dominant for AI/ML workloads. It peaked at a historic 26.98% on the TIOBE index in July 2025 — the highest rating any language has ever reached — and even after slipping to around 20% by May 2026 it still ranks #1 by a wide margin. It is not going anywhere, and this is the honest tension in the "AI favors typed languages" argument. The AI ecosystem itself (PyTorch, transformers, LangChain, the entire ML stack) runs on Python. The language that builds AI is dynamic. The resolution is that Python is increasingly a typed language in practice: mypy, Pyright, and type hints are now standard in production ML codebases. The direction is toward more type information everywhere, even in Python. AI made that transition feel free. ## Rust: where types become guarantees I write Rust for my Lambda functions: an SES event handler, an email forwarder, and a Cognito OTP auth flow. These are small, critical services. They process every inbound email and every authentication event. They run on ARM64 Graviton2 with the PROVIDED_AL2023 runtime. Cold starts are under 10ms. Memory usage is negligible. They have never crashed in production. The reason I chose Rust was not performance benchmarks. It was the guarantee model. These handlers touch email content, authentication tokens, and user data. A null pointer in the email forwarder means lost mail. A memory error in the auth handler means a security vulnerability. Rust's ownership system makes those failure modes structurally impossible. The compiler will not let me, or the AI, ship code that has them. ```rust (apps/lambdas/ses-handler/src/main.rs) // The compiler enforces that every possible SES event variant is handled. // AI can generate the match arms, but it cannot skip one — // Rust's exhaustive pattern matching won't compile if a case is missing. async fn handle_event(event: SesEvent) -> Result { let records = event.records; for record in &records { let action = &record.ses.receipt.action; let from = &record.ses.mail.common_headers.from; let subject = &record.ses.mail.common_headers.subject; match action.action_type { ActionType::Lambda => process_lambda_action(&record.ses).await?, ActionType::S3 => process_s3_action(&record.ses).await?, ActionType::Bounce => process_bounce(&record.ses).await?, ActionType::Stop => log_stopped_processing(&record.ses), // Forgetting a variant here is a compile error, not a runtime bug } } Ok(build_response(records.len())) } ``` Compare this with writing the same handler in Python or JavaScript. The AI generates the same match/switch structure, but nothing enforces exhaustiveness. A new event type gets added to the SES API. In Rust, the code stops compiling. In Python, it silently falls through and you find out from a customer support ticket about missing emails. - **Most Admired Language**: 10 Years — Consecutive years atop Stack Overflow survey - **Enterprise Adoption**: 48.8% — Using Rust for production systems - **Salary Premium**: 15-20% — Over comparable Python, Go, or Java roles; likely reflects seniority and systems programming domain rather than language choice alone Rust will not replace Python for data science or TypeScript for web development. That is fine; it is not trying to. Rust is winning where correctness is non-negotiable: infrastructure (AWS built Firecracker and Lambda in Rust), security-sensitive systems (the Linux kernel now accepts Rust), and performance-critical paths (Cloudflare's edge runtime). Enterprise teams adopting it for those workloads are making the same calculation I did. For certain workloads, the compiler guarantee is worth the learning investment. > **WARNING: The borrow checker still bites** > AI has not eliminated Rust's learning curve. It has compressed it. Concepts that took months to internalize can now be grasped in weeks with AI providing immediate feedback on ownership errors. The mental models still need to be built. AI can explain why a borrow checker error occurs and suggest a fix, but you need to understand the ownership model to design programs that work with it. The first few months are still hard. They are just less lonely. ## When AI slows you down A randomized controlled trial by METR found that experienced developers working on familiar codebases were 19% slower when using AI assistance. The study tested a specific cohort on specific open-source tasks, and METR themselves note methodological limitations. The sample skewed toward experienced OSS contributors, and the tasks may not represent typical professional development. The directional finding rings true from my own experience, and the framing as an anti-AI argument is wrong. The result makes sense if you think about what AI actually helps with versus where it gets in the way. When I am deep in a Rust module I have written and maintained for months, I know the ownership patterns. I know which structs own their data and which borrow. I know the error handling strategy. AI assistance in that context is friction. I spend time reviewing suggestions that are plausible but wrong for this specific codebase's patterns. The AI does not know that I chose `thiserror` over `anyhow` for a reason, or that this particular struct must not implement `Clone` because it holds a connection pool. When I am wiring a new CDK stack, setting up a DynamoDB table with GSIs, or writing a CloudFront distribution config with custom cache behaviors (domains where I know what I want but do not have the API surface memorized), AI saves real time. I describe the architecture and it produces the construct calls. I review for correctness, the TypeScript compiler verifies the types, and what would have taken an hour of documentation reading takes ten minutes. > **AI helps most when you know what to build but not how to spell it. It helps least when you already have the incantation memorized.** This is why language choice matters more than the raw productivity numbers suggest. In TypeScript, when AI generates something wrong, tsc catches it and the feedback loop is tight. In Python, when AI generates something wrong in an unfamiliar library, you might not discover it until integration testing, or production. The slowdown for experts on familiar code is a ceiling effect. The real gain shows up in exploration and learning, the unfamiliar territory where AI fills the gaps, and typed languages widen it by making that exploration safer. ## The language tiers AI creates AI does not provide uniform code quality across languages. Three factors determine how well an AI assistant performs: training data volume, syntactic clarity, and whether the language has a static type system. These create natural tiers of AI assistance quality that map directly to language adoption trends. A note on methodology: the tier list below is an assessment based on these three factors, informed by GitHub Copilot benchmarks and my own experience across these languages. It is an opinion, not a measurement. Treat it as a starting point for your own evaluation. TypeScript and Python sit at the top because they have massive training datasets, clean syntax, and well-documented community patterns. TypeScript adds the static verification layer. Rust has less training data in absolute terms, but the code that exists is higher quality. The compiler filters out entire bug categories before code is committed, so AI models trained on Rust learn from code that already passed a rigorous correctness check. ### AI assistance by language - Tier 1 — TypeScript, Python, JavaScript: Massive training data, clear syntax, well-documented patterns. TypeScript adds compile-time verification that catches the errors AI produces most frequently. - Tier 2 — Go, Rust, Java, C#: Strong type systems verify generated code. Go offers simplicity; Rust offers correctness guarantees; Java and C# have deep enterprise training corpora. - Tier 3 — C++, Ruby, PHP, Swift: Adequate training data but syntax complexity or paradigm variation reduces completion quality. - Tier 4 — Perl, Haskell, Lisp dialects: Smaller training corpora combined with unusual paradigms that current models handle inconsistently. This distribution is self-reinforcing. Better AI support attracts more developers, producing more training data, improving AI quality further. TypeScript is firmly in this virtuous cycle. Rust is entering it as enterprise adoption grows. Languages at the bottom face the inverse: declining interest produces less training data, which degrades AI support, which accelerates decline. **Most admired languages** - Rust: 72.4% - Gleam: 70.8% - Elixir: 66% - Zig: 64.2% The language developers most want to keep using is Rust, type-safe and built around strong compiler feedback. The cohort around it (Gleam, Elixir, Zig) skews toward languages that lean on the compiler to catch mistakes rather than runtime discovery, exactly the property that makes AI-generated code safer to ship. ## Choosing a language now My stack is TypeScript for the web and infrastructure layer, Rust for security-critical Lambda functions. A year into this choice with heavy AI assistance, I would make the same bet again. The reasoning, stripped down: ### TypeScript for the web stack React, CDK, API handlers: all TypeScript. One language across the entire web-facing stack means AI has full context when generating code. The CDK construct types flow into the API handler types flow into the frontend API client types. A change to an API response shape triggers type errors everywhere that shape is consumed. AI can generate the change and the type checker verifies every downstream consumer. This is not possible across language boundaries. ```typescript (apps/api/src/handlers/analytics.ts) // Same language, same types, across every layer. // AI generates the handler — TypeScript verifies it matches // the DynamoDB schema AND the frontend contract. import { json, error } from '../lib/response' import { getClient, getTableName } from '../lib/dynamo' import { PutCommand } from '@aws-sdk/lib-dynamodb' export default async function handler(event: APIGatewayProxyEventV2) { const origin = event.headers?.origin ?? '' if (!isAllowedOrigin(origin)) { return error(403, 'Forbidden') } const body = JSON.parse(event.body ?? '{}') await getClient().send(new PutCommand({ TableName: getTableName(), Item: { pk: `PAGE#${body.path}`, sk: `TS#${Date.now()}`, path: body.path, referrer: body.referrer ?? 'direct', timestamp: new Date().toISOString(), }, })) return json({ tracked: true }) } ``` ### Rust where correctness matters The email forwarder processes every inbound message to my domain. The Cognito handler runs on every auth event. These are small, critical, rarely-changed functions where I want the compiler to enforce invariants I do not want to think about on every deploy. Rust gives me that, plus sub-10ms cold starts and negligible memory usage on Graviton2. The AI writes Rust for these handlers capably, and when it gets ownership wrong, the build rejects it before I ever see the generated code in a diff. Could I write these in TypeScript? Sure. The Node.js Lambda runtime is fine. "Fine" is different from "the compiler guarantees no null pointer, no data race, no use-after-free." For handlers that touch email content and authentication tokens, I want the guarantee, not just the convention. > **TIP: The practical stack decision** > Pick TypeScript if you want AI to move fast across your full web stack with type safety catching mistakes at every boundary. Add Rust for the small number of functions where correctness guarantees matter more than development speed. Skip Rust when the math does not clear the bar: the first few months of borrow-checker fluency are still hard even with AI, hiring is a smaller pool, and onboarding a teammate costs real ramp time. If none of your workloads need memory safety or sub-millisecond performance, that cost buys you nothing. Go is a strong alternative with a gentler learning curve and good AI support. ## The hierarchy flipped For decades, the language hierarchy optimized for developer ergonomics: how fast can a human write code? Python and JavaScript won that race. Types were overhead. Compile steps were friction. Dynamic languages let you ship faster because there was less between your idea and a running program. AI inverted the optimization function. When 41% of production code is machine-generated, verification reliability matters more than human writing speed. Type annotations are free when AI writes them. Compile-time checks are instant. The overhead that made dynamic languages attractive disappeared, and what remained was the safety gap: dynamic languages let AI-generated bugs through to runtime while typed languages catch them at compile time. **TIOBE Index language rankings** - Python: 23.28% - C++: 10.29% - Java: 10.15% - C: 8.86% - C#: 4.45% - JavaScript: 4.2% - Go: 2.61% - Rust: 1.16% Each major domain has converged on a type-safe default. Web development moved from JavaScript to TypeScript. Systems programming is moving from C and C++ to Rust. Cloud-native services gravitate toward Go. Even Python (the most dynamic of the leaders) has seen enormous adoption of type hints, mypy, and Pyright. The direction is unanimous. The deeper shift is in how we evaluate languages for new projects. "Which language minimizes friction for writing code?" is the old question. "Which language maximizes the reliability of code that AI helps generate?" is the new one. The answer consistently points toward static type systems, informative compiler feedback, and large training datasets. Languages that excel on all three (TypeScript, Rust, Go) are the primary beneficiaries of the AI era. > **The era of optimizing for "easy to write" is ending. The era of optimizing for "produces correct code" has begun. The compiler became the product.** For builders choosing their stack right now: follow the compilers, not the hype cycles. TypeScript for anything that touches the web. Rust for anything where correctness is the feature. Go for services where simplicity and fast feedback loops matter. Python for AI/ML workloads but with type hints turned on. The learning curves still exist, but AI compressed them enough that the safety tradeoff is no longer close. Pick the language that catches AI mistakes at compile time. Your 2am self will thank you. ## Resources & Further Reading - GitHub Octoverse 2025: https://github.blog/news-insights/octoverse/ - TypeScript overtakes Python and JavaScript as the most-used language on GitHub - GitHub Blog — Why AI is pushing developers toward typed languages: https://github.blog/ai-and-ml/llms/why-ai-is-pushing-developers-toward-typed-languages/ - Analysis of the feedback loop between AI code generation and type systems - Stack Overflow Developer Survey 2025: https://survey.stackoverflow.co/2025/technology - Rust maintains most admired status (72%) for tenth consecutive year - METR — Measuring the Impact of AI on Developer Productivity: https://metr.org/ - Randomized controlled trial showing 19% slowdown for experienced developers with AI assistance - JetBrains State of Developer Ecosystem 2025: https://blog.jetbrains.com/research/2025/10/state-of-developer-ecosystem-2025/ - Comprehensive survey of developer tool usage, language trends, and AI adoption - TIOBE Index: https://www.tiobe.com/tiobe-index/ - Python reached a historic 26.98% rating in July 2025; Rust hit an all-time high of #13 in January 2026 - corrode Rust Consulting — Flattening Rust's Learning Curve: https://corrode.dev/blog/flattening-rusts-learning-curve/ - How AI tools are compressing the time to Rust proficiency --- # MCP Is Now a Linux Foundation Standard - **URL**: https://www.stxkxs.io/blog/mcp-linux-foundation - **Published**: 2026-01-09 - **Author**: Brandon Stokes - **Category**: ai - **Tags**: mcp, model-context-protocol, ai-tools, open-standards, linux-foundation, anthropic, openai, vendor-neutrality, a2a, agent-orchestration - **Reading time**: 15 min Anthropic donated Model Context Protocol to the Linux Foundation, creating the foundation for a $13.4B market. Combined with Google's A2A protocol and agent orchestration frameworks, we're seeing the emergence of standardized AI infrastructure—and the commercial opportunities that follow. You build a GitHub integration for Claude. It works. Your team loves it. Then someone asks you to make it work with GPT-4. You look at the code, realize none of it transfers, and start rewriting from scratch. Different tool-calling conventions, different schema expectations, different auth flows. Two weeks later you have two codebases doing the same thing. Now multiply that by every tool your agents need (Slack, Postgres, Jira, your internal APIs) and every model you want to support. You are maintaining a combinatorial explosion of integration code, and none of it is the actual product. I run MCP servers in my own stack. Claude Code is wired into my deployment pipeline. MCP servers handle CDK infrastructure. Custom tooling lets agents interact with my AWS resources. The thing that made me pay attention to MCP was not the Anthropic announcement or the Linux Foundation donation. It was the moment I wrote one MCP server and it worked in Claude Desktop, Cursor, and VS Code without changing a line. That is when I understood what this actually is: the end of writing integration code per model. In December 2025, Anthropic donated MCP to the Linux Foundation's new Agentic AI Foundation. Anthropic, Block, and OpenAI are co-founders. AWS, Google, Microsoft, Bloomberg, and Cloudflare are platinum members. When every major AI vendor and cloud provider backs the same protocol, the integration layer is being standardized. > **INFO: The Kubernetes parallel** > Google donated Kubernetes to the CNCF in 2015, transforming container orchestration from a fragmented mess of proprietary solutions into a neutral standard. AWS EKS, Google GKE, Azure AKS all built on the same foundation. MCP is positioned to follow a similar path for AI tooling, with the caveat that most protocol standardization attempts fail. SOAP, CORBA, and a graveyard of Linux Foundation projects had "vendor alignment" too. What makes MCP different is that adoption preceded standardization: 97 million SDK downloads and 10,000+ servers in production before the foundation donation. The protocol is being standardized because it already won, not in hopes that it will. Twelve months after launch, MCP hit 97 million monthly SDK downloads and over 10,000 active servers in production. For context, Kubernetes took years to reach comparable adoption metrics. ## The fragmentation tax The current state of AI tooling looks like cloud infrastructure circa 2014. Every provider has its own integration mechanism: Claude has tool use, OpenAI has function calling, Google has Gemini extensions, Microsoft has Copilot agents. Tools built for one ecosystem are useless in another. If you have built anything non-trivial against these APIs, you already know the pain. The fragmentation costs you in three places: - Switching costs: your GitHub integration for Claude requires a complete rewrite for GPT-4. You end up staying with your current provider not because it is the best but because the switching cost is too high. Lock-in by accident, not design. - Fragmented effort: tool developers have to pick a platform or maintain parallel implementations. Most pick one. The Claude ecosystem gets some tools, the OpenAI ecosystem gets others, and nobody gets all of them. Innovation is scattered. - Broken portability: your carefully configured agent environment does not transfer. The MCP servers, the prompt templates, the tool configurations: none of it moves with you when you evaluate a new model. Before Kubernetes, deploying the same workload to AWS ECS and Google Container Engine meant maintaining two completely separate infrastructure codebases. Kubernetes unified that into one abstraction. MCP does the same for tool integration: write it once, run it everywhere. ## Three tiers decouple model from tool MCP is a three-tier architecture with clean separation of concerns. The elegance is in what it decouples: the user-facing app does not need to know about every tool, and tools do not need to know which model is calling them. The protocol sits in the middle and handles discovery and invocation. ### Host, client, and server split the work - MCP Host: the user-facing application (VS Code, Claude Desktop, ChatGPT, Cursor). It manages the UI, maintains session state, and instantiates MCP clients. - MCP Client: the LLM-powered decision layer. It receives user requests, reasons about which tools to call, and orchestrates tool invocations. This is where the agent's agency lives. - MCP Server: the integration layer. A GitHub server exposes repo operations, a Postgres server provides query access, a Slack server enables messaging. Each server publishes a schema describing its tools and resources. The power of this architecture is dynamic capability discovery: hosts learn what a server can do at runtime. When you wire up a new MCP server, every connected client immediately knows what it can do. No configuration changes on the client side, no redeploys. Point a host at a server and the tools show up. ```typescript (example-mcp-server.ts) // MCP Server for GitHub integration import { Server } from '@modelcontextprotocol/sdk/server/index.js'; import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js'; const server = new Server({ name: 'github-mcp-server', version: '1.0.0', }, { capabilities: { tools: {}, }, }); // Define a tool for creating pull requests server.setRequestHandler('tools/list', async () => ({ tools: [ { name: 'create_pull_request', description: 'Create a new pull request on GitHub', inputSchema: { type: 'object', properties: { repo: { type: 'string', description: 'Repository name (owner/repo)' }, title: { type: 'string', description: 'PR title' }, body: { type: 'string', description: 'PR description' }, head: { type: 'string', description: 'Branch to merge from' }, base: { type: 'string', description: 'Branch to merge into' }, }, required: ['repo', 'title', 'head', 'base'], }, }, ], })); // This server now works with Claude, GPT-4, Gemini, or any MCP-compatible client const transport = new StdioServerTransport(); await server.connect(transport); ``` That server works in Claude Desktop, ChatGPT, VS Code with Copilot, Cursor, and anything else that speaks MCP. You write it once. That is the entire value proposition in 40 lines of code. ### JSON-RPC keeps it language-agnostic Built on JSON-RPC 2.0: language-agnostic, transport-flexible. Servers communicate over stdio for local tools, Server-Sent Events for remote services, or HTTP for cloud deployments. The protocol defines four capability types: - Resources: structured data the model can read (files, database records, API responses) without explicit tool invocation - Prompts: predefined templates that trigger specific workflows, enabling consistent interaction patterns - Tools: functions the model calls to execute actions (creating PRs, sending messages, running queries) - Sampling: mechanisms for servers to request LLM completions, enabling tool chaining and multi-step reasoning within server-side logic Tool Search and Programmatic Tool Calling are Anthropic Claude API features (beta 'advanced-tool-use-2025-11-20', Nov 2025), not additions to the MCP 2025-11-25 spec. The actual 2025-11-25 spec additions include OIDC discovery, icons metadata, sampling tool-calling, and experimental Tasks. These matter when your production deployment has thousands of available tools. Without efficient discovery, the model wastes context window on tool definitions it will never use. With Tool Search, clients query for relevant tools on demand instead of loading everything upfront. ## Adoption that actually matters Adoption numbers are easy to inflate. What matters for MCP is who integrated and how fast they did it. - **Monthly SDK Downloads**: 97M+ — Python and TypeScript implementations combined - **Active MCP Servers**: 10,000+ — Public servers across GitHub, integrations, and enterprise tools - **First-Class Clients**: 6 Named — ChatGPT, Claude, Cursor, VS Code, Gemini, Copilot (Anthropic donation post, Dec 2025) - **Time to Integration**: <3 months — From announcement to production at major platforms The signal is in the competitive dynamics. OpenAI announced MCP adoption in March 2025, about four months after Anthropic released it. In its MCP one-year-anniversary blog (November 2025), OpenAI's Srinivas Narayanan said publicly that "it's now a key part of how we build at OpenAI, integrated across ChatGPT and our developer platform." When your direct competitor adopts your protocol and says that publicly, the standard has won. Google integrated MCP into Gemini tool use. Microsoft built it into Copilot and Semantic Kernel. AWS, Google Cloud, and Azure all deployed managed MCP server hosting. Enterprise adoption tells the same story. Organizations are deploying MCP servers for internal tools (Salesforce integrations, proprietary database access, custom API wrappers) that work across their entire AI stack. Build the integration once, use it with Claude for analysis, GPT-4 for customer support, Gemini for search. That is a real engineering win, not a theoretical one. ## Agent-to-agent: the horizontal layer MCP solved vertical integration: agents connecting down to tools. Once your agents can reliably query databases and call APIs, the next question is obvious: how do agents talk to each other? A single agent calling MCP tools is useful. A network of specialized agents coordinating on complex tasks is a different class of system. ### Google's A2A protocol Google released the Agent-to-Agent protocol in April 2025, designed explicitly as MCP's complement. MCP handles agent-to-tool. A2A handles agent-to-agent. The separation is clean: - MCP: agent connects to GitHub to create a PR, queries Postgres, sends a Slack message. Vertical, agent-to-capability. - A2A: research agent shares findings with analysis agent, orchestrator delegates to specialists, customer service agent escalates to human. Horizontal, agent-to-agent. Google contributed A2A to the Linux Foundation as its own standalone Agent2Agent Protocol Project in June 2025, six months before the Agentic AI Foundation formed in December 2025 around MCP, goose, and AGENTS.md. A2A is not part of that foundation. MCP and A2A both live under the Linux Foundation umbrella, but as separate projects with separate governance, not one governance body. The alignment is in the protocols being complementary by design, not in shared stewardship. ```typescript (multi-agent-coordination.ts) // Multi-agent system: MCP for tools (vertical), A2A for coordination (horizontal) // Customer Service Agent receives inquiry, delegates via A2A const orderStatus = await customerAgent.requestViaA2A({ targetAgent: "order-tracking-agent", task: "lookup-order-status", params: { orderId: "ORD-12345" } }); // Order Tracking Agent uses MCP to query the database class OrderTrackingAgent { async lookupOrder(orderId: string) { const orderData = await this.mcpClient.call({ server: "postgres-mcp", tool: "query", params: { sql: "SELECT * FROM orders WHERE id = $1", values: [orderId] } }); if (orderData.status === "delayed") { // Delegates to logistics agent via A2A const estimate = await this.a2aClient.request({ agent: "logistics-agent", task: "estimate-delivery", context: orderData }); return { ...orderData, newEstimate: estimate }; } return orderData; } } // Logistics Agent uses MCP to call shipping API class LogisticsAgent { async estimateDelivery(orderData: any) { return await this.mcpClient.call({ server: "fedex-mcp", tool: "track-package", params: { trackingNumber: orderData.trackingNumber } }); } } // Three agents coordinated via A2A, each using MCP for tool access ``` Each agent is a specialist with MCP access to its own tools. A2A handles the delegation and data sharing. The customer service agent does not need database access; it delegates to the order tracking agent. This is microservices architecture applied to agents. If you have built service-oriented systems, the pattern will feel familiar. > **INFO: The microservices parallel** > HTTP became the standard for service-to-service communication. Service meshes added observability, routing, and resilience. Individual services used databases, queues, and APIs. The same pattern is repeating at a higher level of abstraction: A2A is the HTTP, agent meshes are the service meshes, MCP tools are the backing services. ### Frameworks sit above the protocols MCP and A2A are protocols, low-level standards for communication. Production systems need higher-level abstractions: lifecycle management, workflow orchestration, error handling. That is where LangGraph, CrewAI, and AutoGen come in, each with a different mental model: - LangGraph: treats agent workflows as stateful graphs. Agents are nodes, communication is edges. Strong for branching workflows with conditional handoffs. LangGraph 1.0 shipped October 2025. - CrewAI: models agent teams like organizations. Define roles, assign tasks, let agents collaborate. Includes enterprise features: observability, paid control plane. - AutoGen: frames multi-agent systems as conversations. Agents negotiate solutions through natural language. Strong for research, less opinionated about production. All three are integrating MCP. The frameworks building on open protocols rather than proprietary integrations will win long-term. That is the bet the entire ecosystem is making. Gartner reported a 1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025. Enterprises are moving from single agents to coordinated teams. ### The agent mesh When you have multiple agents coordinating via A2A and accessing tools via MCP, you need infrastructure to manage them at scale. The missing layer is agent meshes that provide discovery, routing, observability, and governance for agent networks. Just as Istio and Linkerd emerged to manage microservice complexity, agent meshes will solve the same class of problems for agent ecosystems: - Discovery: how does a customer service agent find available logistics agents? Agent registries tracking capabilities, availability, and SLAs. - Load balancing: intelligent routing across agent replicas based on load, specialization, and performance. - Circuit breaking: if an agent consistently fails, stop routing to it. Implement fallback strategies. - Observability: distributed tracing across agent networks. Which agents were invoked, how long each step took, where errors occurred. - Policy enforcement: which agents can talk to which? What data crosses agent boundaries? Security and compliance for inter-agent communication. This infrastructure does not exist in production-ready form yet. The building blocks are falling into place: MCP for tools, A2A for communication, orchestration frameworks for workflow management. The missing piece is the operational layer: deployment, scaling, monitoring, and governance at the mesh level. That is the next thing to build. ## Why vendors are cooperating The vendor cooperation is real. It seems counterintuitive that Anthropic, OpenAI, and Google would standardize on a protocol that makes switching easier. The logic is the same logic that made every cloud provider support Kubernetes despite it enabling multi-cloud: - Model quality is the moat: if switching is easy, the best model wins on merit. Anthropic, OpenAI, and Google all believe they can compete on quality. Standardizing tools shifts the competition to where they want it. - Shared ecosystems grow faster: instead of 1,000 Claude-specific tools, there are 10,000 MCP tools that work everywhere. A larger tool ecosystem makes all models more useful. The pie gets bigger. - Enterprises demand it: large companies will not build critical systems on proprietary integration APIs. They require vendor neutrality for the same reason they require Kubernetes over ECS, protecting their investment from single-vendor risk. Standardization on MCP gives a developer four concrete capabilities. Use Claude for code, GPT-4 for writing, Llama for sensitive data, all with the same tools. Build an MCP server once and it works in every host. Chain tools from different vendors into composable workflows. Code written against MCP today works with models that do not exist yet. ## Where the money is MCP and A2A are free. So where is the business? Same place it was with Kubernetes: the protocols are free, but production deployment at scale requires commercial infrastructure. Kubernetes is free and Red Hat sold to IBM for $34 billion on the strength of OpenShift. The MCP server market is projected at $2.7 billion in 2025, growing to $5.5 billion by 2034. The broader ecosystem (servers, gateways, orchestration, observability) is projected at $13.4 billion, growing at 34.6% CAGR. The pattern is identical to what happened with Kubernetes tooling. The commercial opportunities are concrete: - MCP gateways and proxies: enterprise infrastructure between agents and servers (auth, rate limiting, observability, policy enforcement). TrueFoundry and others are already building these. The API gateway layer for the agent world. - Domain-specific MCP servers: financial forecasting, compliance, legal summarization. Sold as Agent-as-a-Service. The API economy, but for agent tool access. - Agent orchestration platforms: CrewAI sells a control plane. LangChain sells LangSmith for observability. The framework is open, the production tooling is commercial. - Agent observability: distributed tracing for agent networks, cost tracking across model calls, performance analytics. The Datadog opportunity for the agent world. - **MCP Ecosystem (2025)**: $13.4B — TAM projection (addressable market if MCP becomes the universal standard, not current revenue). Includes servers, gateways, platforms - **Cost Reduction**: 90% — Plan-and-Execute pattern vs. frontier models for all tasks - **Production Success Rate**: 1 in 4 — Organizations that successfully scale agents to production - **API Gateway MCP Support**: 75% by 2026 — Vendors adding native MCP integration ### The plan-and-execute pattern The most effective production deployments I have seen use heterogeneous model architectures. A capable model (GPT-4, Claude Opus) creates the strategy. Cheaper models (Haiku, GPT-3.5, local Llama) execute the steps. Costs drop by 90% compared to running frontier models for everything. MCP and A2A make this pattern natural: an orchestrator agent on GPT-4 delegates via A2A to specialists running cheap models, each using MCP for tool access. Without standard protocols, you are writing custom integration code for every model-tool combination. ### The reality check Nearly two-thirds of organizations are experimenting with AI agents. Fewer than one in four have scaled them to production. Analysts predict 40%+ of current agentic AI projects could be cancelled by 2027 due to cost overruns, scaling complexity, or unexpected risks. The technology works. The operational maturity is not there yet. That gap is exactly where the commercial opportunity lives. The companies that solve production deployment (MCP gateways, agent orchestrators, observability platforms, cost optimization tools) will capture enormous value. Same story as Kubernetes. The infrastructure is free, the problems are hard, and the companies that make it easy to run in production win. ## The security model is immature MCP is early. The adoption is promising but several hard problems remain unsolved. Governance: the Agentic AI Foundation is brand new. How will it handle spec changes? Who decides what gets prioritized? Will it stay neutral if one vendor dominates contributions? Kubernetes worked through the same questions in the CNCF. MCP needs that clarity soon. Scale: the November 2025 spec added Tool Search for handling thousands of tools. As servers proliferate, discovery and orchestration become bottlenecks. How do clients search across hundreds of servers efficiently? These are solvable (package registries and search indexes are precedents) but they need solving now. Security: MCP servers can delete GitHub repos, query databases, send emails. The trust model is immature. The ecosystem needs fine-grained permission models, server signing and verification, audit trails for all tool calls, and sandboxing for untrusted servers. The browser solved similar problems with web APIs and permission prompts. MCP needs equivalent patterns, and they need to ship before a high-profile security incident forces the issue. The speed of adoption came at a cost: the MCP spec has no built-in authorization framework for tool-level permissions. Any MCP server can declare any tool, and the host trusts what the server advertises. For production deployments, this means building your own RBAC layer on top, or limiting MCP server access to trusted, internally-maintained servers. The 2025-11-25 spec shipped OAuth 2.1 authorization, but only at the transport and server level. There is still no built-in tool-level permission model; SEP-1880, which proposed per-tool OAuth scopes, was closed as not planned, so enforcement stays with the implementer. Debugging MCP is painful today. When a tool call fails through the JSON-RPC layer, the error messages are often opaque. Tracing through stdio transport gives you raw JSON with no standardized logging format. The DX is early-stage. It works, but "works" and "is pleasant to debug at 2am" are different things. This will improve as the spec matures. If you adopt MCP today, budget time for debugging infrastructure. ## What I am doing about it I am building on MCP today, not waiting for it to mature. My deployment pipeline runs through MCP servers. My agents use MCP for infrastructure operations. Every tool I build targets MCP as the interface because I know it will work with whatever model I am using in six months. The practical bet: invest in the standard now, and your tooling becomes portable for free. The bigger picture: MCP is the foundation layer, A2A is the coordination layer, orchestration frameworks handle workflows, and agent meshes will handle operations. The full stack is not built yet. The lower layers are solid and the upper layers are forming fast. For anyone building AI applications, writing against MCP and A2A today is the highest-leverage investment available. The next twelve months will tell us whether MCP becomes the true universal standard or fragments under competitive pressure. The bet here is that it sticks. The vendor alignment is too broad, the adoption is too fast, and the alternative (going back to per-model integration code) is too painful. The Kubernetes moment for AI tooling is here. ## Resources & Further Reading - MCP Official Specification: https://modelcontextprotocol.io/specification/2025-11-25 - November 2025 specification with Tool Search and Programmatic Calling - MCP GitHub Organization: https://github.com/modelcontextprotocol - SDKs, reference implementations, and server examples - Anthropic - Donating MCP to Agentic AI Foundation: https://www.anthropic.com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation - Linux Foundation - Agentic AI Foundation Launch: https://www.linuxfoundation.org/press/linux-foundation-announces-the-formation-of-the-agentic-ai-foundation - MCP One Year Anniversary: https://blog.modelcontextprotocol.io/posts/2025-11-25-first-mcp-anniversary/ - Google A2A Protocol: https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/ - IBM - A2A Protocol Overview: https://www.ibm.com/think/topics/agent2agent-protocol - Deloitte - AI Agent Orchestration: https://www.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2026/ai-agent-orchestration.html - Enterprise AI Stack 2026: https://dextralabs.com/blog/enterprise-ai-stack-2026-mcp-a2a-domain-models/ - OpenAI - Joining Agentic AI Foundation: https://openai.com/index/agentic-ai-foundation/ - GitHub Blog - MCP Joins Linux Foundation: https://github.blog/open-source/maintainers/mcp-joins-the-linux-foundation-what-this-means-for-developers-building-the-next-era-of-ai-tools-and-agents/ --- # The CLI Patterns Behind Stripe and Uber - **URL**: https://www.stxkxs.io/blog/devops-cli-tools - **Published**: 2025-11-12 - **Author**: Brandon Stokes - **Category**: platform-engineering - **Tags**: cli, developer-tools, devops, developer-experience, tooling, automation - **Reading time**: 12 min Uber engineers used to juggle 10+ CLI tools for deployments. Then they built UP, a unified CLI that reduced context switching by 50%. Here's how companies like Stripe, Heroku, and Railway are building developer CLIs that feel like magic. ## The CLI chaos problem A backend engineer needs to deploy a config change to production. The workflow goes like this: run kubectl get pods -n payments to check current deployments, ssh to a bastion host to read logs, trigger the deployment through an internal tool, run curl commands to verify the change, check the metrics dashboard, and update the deployment doc in the wiki. Each tool has its own authentication, its own flags, and its own mental model. A 2-minute task stretches into 25 because most of the time goes to remembering how each tool wants to be invoked. This is not an edge case. At a mid-size platform, engineers run 10-15 command-line tools daily. Each was built by a different team, at a different time, with a different design philosophy. Some use AWS-style flags (--flag=value), others use GNU-style (--flag value), and some use single-letter shortcuts (-f). Authentication varies: SSO for one, AWS credentials for another, a hardcoded API key still lurking in a third. I have lived this at smaller scale: four internal tools, four different auth flows, four different ways to specify an environment. You do not notice the friction until you watch a new hire spend an entire afternoon just getting credentials working across all of them. The pattern repeated across tech companies. Stripe engineers navigated between stripe-cli, kubectl, terraform, and half a dozen internal tools. Netflix engineers had dozens of specialized CLIs for their microservices platform. Shopify engineers complained about memorizing flags for 20+ tools. As companies scaled their infrastructure, they created tool sprawl that slowed everyone down. - **CLI Tools Per Developer**: 10-15 — Typical count at mid-size tech companies - **Time Spent Context Switching**: 20-30 min/day — Looking up flags, switching auth contexts, mental model shifts - **Onboarding Overhead**: 1-2 weeks — Learning company-specific CLI tools - **Error Rate**: 15-20% — Wrong flags, wrong context, wrong environment > **INFO: The hidden cost of tool sprawl** > Context switching and tool navigation carry a daily time cost. For teams running 10+ CLI tools daily, the overhead is measured in hours per week, not minutes. ## Platform vendors treat CLIs as products Companies building developer platforms treat the CLI as a competitive advantage. Stripe's CLI set the reference pattern for API development tools, and Heroku's CLI defined the deployment experience that later platforms copied. Railway and Vercel treat their CLIs as first-class products. The patterns below come from those tools, all of which you can install and study yourself. ### Stripe CLI: local API development Stripe CLI solved a specific problem: testing webhook integrations locally. Before the CLI, developers had to deploy to a staging environment to test webhooks, which slowed iteration. The Stripe CLI introduced webhook forwarding: run "stripe listen --forward-to localhost:3000/webhooks" and Stripe would send webhook events to your local machine. Combined with event triggering ("stripe trigger payment_intent.succeeded"), developers could test the entire payment flow locally in seconds. The real innovation was the developer experience. The CLI used clear, consistent commands (stripe [resource] [action]), provided helpful error messages with links to documentation, and included interactive mode for exploring the API. Run "stripe" without arguments and you get an interactive shell with autocomplete and inline documentation. ```bash # Stripe CLI workflow for testing payments locally stripe login stripe listen --forward-to localhost:3000/webhooks # In another terminal stripe trigger payment_intent.succeeded stripe trigger customer.subscription.created stripe trigger invoice.payment_failed # View events in dashboard format stripe events list --limit 10 ``` ### Heroku CLI: Git-centric deployment Heroku CLI introduced the pattern that influenced every modern platform: treating deployment as a Git operation. Run "heroku create" and it adds a Git remote. Deploy with "git push heroku main". Check logs with "heroku logs --tail". The genius was making the CLI feel like a natural extension of Git, which developers already knew. No need to learn a new mental model. Deployment became just another Git operation. Heroku also pioneered the plugin architecture for CLIs. Third-party developers could extend the Heroku CLI with custom commands, creating an ecosystem of tools. This pattern is now standard in modern CLIs like GitHub CLI, Vercel CLI, and Railway CLI. ### Railway and Vercel: zero-config magic Modern CLIs like Railway and Vercel take the pattern further: zero-configuration deployment. Run "railway init" in a Node.js project and it detects the framework, configures the build, and deploys. No YAML files, no configuration, no manual steps. The CLI infers everything from your code structure. Vercel CLI does the same for frontend projects: "vercel" in a Next.js directory just works. These CLIs use interactive prompts with smart defaults. Need a database? Railway CLI shows a menu of database types, provisions it, and injects connection strings as environment variables. The developer never leaves their terminal, never opens a web dashboard, and never writes infrastructure-as-code. The CLI handles everything. ## Great CLIs infer context After analyzing dozens of successful developer CLIs, several patterns emerge that separate great tools from mediocre ones. These patterns apply whether you are building an internal platform CLI or a product CLI like Stripe's. ### Fuzzy search and discoverability Great CLIs do not require memorization. Instead of forcing developers to remember exact command syntax, they provide fuzzy search and interactive menus. Running "gh" with no arguments prints a static help/usage text listing available top-level commands. There is no top-level interactive menu that filters as you type; the type-to-filter prompt is only used within certain interactive subcommands (e.g. "gh pr merge"). This pattern dramatically reduces the learning curve. New engineers can explore the CLI by typing and seeing what is available. Implementation-wise, libraries like fzf, inquirer, or charm's bubbletea provide the building blocks. The key is making discovery natural: if a user runs a command without required arguments, show them options instead of erroring. If they provide an invalid value, suggest valid alternatives. ### Context awareness Great CLIs understand context and provide smart defaults. They detect the current Git branch, parse the working directory structure, remember recent operations, and infer the target environment. Railway CLI exemplifies this: run "railway up" in a directory that is already linked to a Railway project and it deploys to that project automatically. No need to specify project ID, environment, or service name. Context can be stored in local files (.railway.json, .vercel.json), environment variables, or a CLI config directory (~/.config/cli-name). The pattern is to make the default behavior do what the developer expects 90% of the time, while still allowing explicit overrides via flags. ### Consistent command structure Stripe CLI uses the pattern "stripe [resource] [action]": stripe customers list, stripe invoices create, stripe subscriptions cancel. This structure is predictable and discoverable. Once you understand the pattern, you can guess commands you have never used before. Compare this to inconsistent CLIs where every command feels different. The best structure is often "[cli] [noun] [verb] [flags]" or "[cli] [verb] [noun] [flags]". Pick one pattern and stick to it religiously. Docker uses "docker [object] [action]" (docker container run, docker network create). Kubernetes uses "kubectl [verb] [object]" (kubectl get pods, kubectl delete deployment). ```bash # Good: Consistent pattern railway service list railway service logs railway service scale railway database create railway database connect # Bad: Inconsistent pattern railway list-services railway get-logs --service= railway scale-up railway db:create railway connect-to-database ``` ### Helpful error messages Poor CLIs print cryptic errors and stack traces. Great CLIs explain what went wrong, why it went wrong, and how to fix it. Rust's cargo is famous for this: errors include the problem, the cause, a suggestion, and relevant documentation links. Stripe CLI follows this pattern: when authentication fails, it does not just say "401 Unauthorized." It explains that your API key is invalid, shows you how to find your API keys in the dashboard, and provides a link to the authentication docs. The pattern is to catch common errors (missing credentials, wrong environment, network failures) and provide specific guidance. Include links to documentation, suggest the correct command if they typo'd, and use color coding (red for errors, yellow for warnings, green for success) to make output scannable. ### Fast feedback and progress Operations that take more than 1 second need progress indication. Vercel CLI shows a spinner with status messages during deployment: "Building...", "Uploading...", "Deploying...". Railway CLI goes further with a real-time log stream showing build output as it happens. This feedback loop is critical for developer experience. Waiting 30 seconds with no output creates anxiety. A progress bar or streaming logs provides reassurance. For fast operations (<1 second), immediate feedback is critical. When a developer runs "railway service scale webapp --replicas 3", they should see "Scaled webapp to 3 replicas" within 500ms. Do not make them wait for an API call to complete; return immediately and handle the operation asynchronously if needed. > **A CLI feels magic when it infers what you are about to type and supplies the right default. Magic CLIs eliminate the round-trip between your brain and the docs. They infer context, suggest the right default, and get out of your way. A functional CLI makes you specify everything. A magic one already knows.** ## Building your own CLI For a platform or internal tooling team at a company with 50+ engineers, a unified CLI is a high-leverage investment. A practical guide based on successful implementations: ### Choose your foundation Pick a framework and commit to it. Go: Cobra (what kubectl and GitHub CLI use). Python: Click or Typer. Node.js: Commander or oclif (Heroku and Salesforce). Rust: clap. They all handle the boring parts (argument parsing, subcommands, help text) so you can focus on the workflows that actually matter to your engineers. ```python (cli-foundation-example.py) # Example CLI structure using Python Click import click from rich.console import Console from rich.table import Table console = Console() @click.group() @click.version_option(version='1.0.0') def cli(): """DevCTL - Unified developer CLI for platform operations""" pass @cli.group() def service(): """Manage services (deploy, logs, scale, etc)""" pass @service.command() @click.option('--environment', '-e', default='dev', help='Target environment (dev/staging/prod)') def list(environment): """List all services in an environment""" # Fetch services from API services = fetch_services(environment) # Display as formatted table table = Table(title=f"Services in {environment}") table.add_column("Name", style="cyan") table.add_column("Status", style="green") table.add_column("Replicas") table.add_column("CPU/Memory") for svc in services: table.add_row( svc.name, svc.status, str(svc.replicas), f"{svc.cpu}/{svc.memory}" ) console.print(table) @service.command() @click.argument('name') @click.option('--tail', '-f', is_flag=True, help='Follow log output') def logs(name, tail): """View logs for a service""" if tail: stream_logs(name) # Real-time streaming else: print_logs(name) # Historical logs if __name__ == '__main__': cli() ``` ### Implement fuzzy search Interactive fuzzy search transforms the CLI experience. Instead of requiring exact service names, let developers type partial matches and select from a filtered list. The pattern: when a command needs a service name and the user did not provide one, fetch the list and show an interactive picker instead of printing a usage error. Libraries like questionary (Python), promptui (Go), or enquirer (Node.js) make this trivial to implement. ```python (fuzzy-search-example.py) # Interactive fuzzy search with questionary import questionary from questionary import Choice def select_service(): """Interactive service selection with fuzzy search""" services = fetch_all_services() choices = [ Choice( title=f"{svc.name} ({svc.env}) - {svc.status}", value=svc.name ) for svc in services ] selected = questionary.autocomplete( "Select a service:", choices=[c.title for c in choices], style=custom_style ).ask() return selected @service.command() def deploy(): """Deploy a service (interactive)""" # Let user select service instead of requiring argument service_name = select_service() environment = questionary.select( "Select environment:", choices=['dev', 'staging', 'prod'] ).ask() # Confirm before deploying to production if environment == 'prod': confirmed = questionary.confirm( f"Deploy {service_name} to PRODUCTION?" ).ask() if not confirmed: console.print("[yellow]Deployment cancelled[/yellow]") return # Execute deployment deploy_service(service_name, environment) ``` ### Add context awareness Store context in a local config file (.devctl.json) to remember the user's preferences, recent operations, and default values. When a developer runs "devctl deploy" in a service's Git repository, the CLI should detect the service name from the repo and default to deploying that service. Use Git branch names to infer target environments: main → prod, develop → staging, feature branches → dev. Context awareness dramatically reduces keystrokes. Instead of "devctl deploy payment-service --env prod --region us-west-2", an engineer in the payment-service repo on the main branch just types "devctl deploy" and the CLI infers everything. ### Provide escape hatches While interactive mode is great for exploration, experienced users want fast, scriptable commands. Support both modes: interactive when arguments are missing, non-interactive when all arguments are provided. Every interactive command should have a non-interactive equivalent with flags. This makes the CLI usable in CI/CD pipelines and scripts. Pair interactive mode with structured output. Interactive mode handles discoverability and common workflows; a --json or --format flag gives scripts machine-readable output to parse and pipe. GitHub CLI and Stripe CLI both do this, and it is the pattern worth copying. It also blunts the strongest argument against unified CLIs, covered in the next section. ```bash # Interactive mode - great for learning devctl deploy # → Shows service picker # → Shows environment picker # → Confirms and deploys # Non-interactive mode - great for scripts/CI devctl deploy payment-service --env prod --skip-confirm # → Deploys immediately, no prompts # Hybrid mode - provides some args, interactive for others devctl deploy payment-service # → Only asks for environment since service is specified ``` ## CLI scope scales with headcount The need for a unified CLI emerges at different scales depending on company complexity. How different company sizes approach developer tooling: ### Startup: scripts to CLI Early-stage companies start with Bash scripts scattered across the repository. deploy.sh, check-logs.sh, scale-service.sh. As the team grows past 20 engineers, these scripts become hard to discover and maintain. The first step toward a unified CLI is consolidating these scripts into a single entry point: a devctl wrapper that calls the underlying scripts. The MVP CLI has 5-10 commands: deploy, logs, shell, scale, rollback. It wraps kubectl, docker, and AWS CLI with sensible defaults for your environment. Total build time: 1-2 weeks. The payoff is immediate: new engineers can run "devctl help" and see all available operations instead of hunting through the scripts directory. - **Build Time**: 1-2 weeks — MVP with 5-10 core commands - **Onboarding Time Reduction**: 50% — From 4 days to 2 days - **Commands Needed**: 5-10 — Deploy, logs, shell, scale, rollback, etc. - **Adoption Rate**: 70%+ — Engineers using CLI within first month ### Growth: platform CLI Mid-size companies have multiple teams, multiple environments, and increasing operational complexity. The CLI evolves from a script wrapper to a proper platform interface. It handles service discovery (list all services across teams), environment management (dev, staging, prod, feature environments), resource provisioning (databases, caches, queues), and deployment orchestration (canary deploys, blue-green, rollbacks). At this scale, the CLI needs a plugin architecture. Different teams have different needs: the data team needs Spark job submission, the ML team needs model deployment, the infrastructure team needs cluster management. A plugin system lets teams extend the CLI without modifying the core. Core commands stay with the platform team; team-specific commands ship as plugins those teams own and release on their own schedule. Extensibility is the hard problem CLI blog posts rarely cover. When different teams need to add commands to the unified CLI, you hit the plugin architecture problem: how do teams extend the CLI independently? How do you version plugins separately from the core? How do you handle conflicting dependencies between plugins? A plugin registry that lazy-loads team-specific commands is the usual answer. Shopify's CLI uses exactly this model, where plugins are versioned npm packages resolved at runtime. Without an extensibility story, the unified CLI becomes a bottleneck where one team's changes block another team's release. - **Commands Available**: 30-50 — Core + team-specific plugins - **Daily Active Users**: 80-90% — Of engineering team - **Context Switch Reduction**: 50% — Time saved not switching between tools - **Support Tickets**: -30% — Fewer "how do I..." questions ### Enterprise: multi-tenant CLI Large enterprises need CLIs that work across business units, regions, and compliance boundaries. The CLI becomes a critical piece of governance: it enforces policies (no production deploys without approval), integrates with compliance systems (audit logging, change management), and supports multi-tenancy (different teams see different services and environments). At this scale, the CLI handles thousands of services, hundreds of teams, and global deployments. It integrates with the in-house deployment system and the monitoring stack. The CLI is the abstraction layer that hides complexity: engineers do not need to know which Kubernetes cluster their service runs on or which AWS account hosts their database. The CLI figures it out based on service metadata. > **TIP: Investment vs returns** > A well-built CLI at enterprise scale requires 2-3 full-time engineers to maintain. For a 500-person engineering org, that is 0.5% of headcount. The productivity gains (20-30 min/day saved per engineer) provide 20x ROI within the first year. ## Sometimes separate tools win The Unix philosophy argues for the opposite of a unified CLI: small, composable tools that each do one thing well and pipe into each other. That position is correct often enough that it is worth defending before you write a line of CLI code. A unified CLI is convenient until you need to compose it with something it does not support, and then you are worse off than you would have been with separate tools and a pipe. Below roughly 20 engineers, building one is a poor trade. A handful of Bash scripts and the underlying tools (kubectl, docker, the AWS CLI) cover the workflows, and the time spent building and maintaining a wrapper buys little when everyone already knows the three commands they run. The wrapper itself becomes a thing to learn, document, and keep current. A single-tool shop has nothing to unify. If the daily workflow is kubectl and almost nothing else, wrapping kubectl in a second CLI adds a layer of indirection over a tool the team already knows well, and the wrapper inevitably lags behind kubectl's own flags and releases. The same applies to a team that lives inside Terraform or a single cloud provider's CLI. Composability beats guidance in some environments. CI pipelines, data engineering, and any workflow where output feeds the next command want raw, parseable text and exit codes, not interactive menus. A CLI built around prompts and pretty tables fights that grain. If structured output and pipeability are the requirement, separate single-purpose tools are the better answer, and the unified CLI is the wrong call. ## Build the painful workflows first Studying the CLIs from Stripe, Heroku, GitHub, and the platform vendors above surfaces several lessons about what separates successful CLIs from abandoned ones. Start with the most painful workflows first. Do not try to build a comprehensive CLI that covers every operation. Identify the handful of things engineers do most frequently, usually deploy, logs, and rollback, and make those workflows excellent. I learned this the hard way building internal tooling: shipped 20 commands at launch and engineers used exactly 3. The other 17 were dead weight that made the help output noisy and the tool feel bloated. Ship the pain-killers first, expand later. Interactive mode beats flags for exploration, but power users need both. New engineers love interactive menus because they can explore without memorizing commands. Senior engineers love non-interactive mode because they can script operations and work fast. Support both modes from day one. The pattern is simple: if required arguments are missing, enter interactive mode. If all arguments are provided, execute immediately. Invest in error messages. Stripe CLI is beloved because errors are helpful. When something fails, the CLI explains what happened, why, and how to fix it. Poor CLIs print stack traces and exit codes. Great CLIs treat errors as teaching moments. Every error message is an opportunity to make the CLI easier to use. > **WARNING: The configuration trap** > Avoid requiring extensive configuration before the CLI is usable. Tools that need 30 minutes of setup (config files, API keys, environment variables) get abandoned. Make the first command work in under 60 seconds. Stripe CLI does this well: "stripe login" gives you a pairing code, opens a browser to confirm access, and auto-generates an API key it stores for you in ~/.config/stripe/config.toml. No keys to copy-paste, no config files to hand-edit. A CLI is a product, not a script. It needs documentation, release notes, versioning, and user feedback loops. Successful CLI teams treat their tool like a product: they track usage metrics (which commands are popular, where do users struggle), gather feedback through surveys and support channels, and iterate based on data. The best CLI teams have dedicated product managers and designers, not just engineers. ## The future of developer CLIs The developer CLI renaissance is accelerating. Companies that once relied on web dashboards are rebuilding their tools as CLI-first experiences. GitHub CLI became so popular it changed how developers interact with GitHub, making the web UI secondary. Vercel and Railway built their entire platforms around the CLI experience, with the web dashboard as a companion tool rather than the primary interface. Stripe did not have to build a CLI. Doing so made their API easier to adopt and gave them an edge over payment processors that shipped only web dashboards, because the CLI met developers where they already worked. The threshold where a unified CLI pays off is concrete: 50 or more engineers running half a dozen tools with conflicting flags and auth flows. Below 20 engineers, or in a single-tool shop where the daily workflow is mostly kubectl, the wrapper costs more than it returns. The same goes for CI and data pipelines, where parseable output beats interactive menus. The harder problem is not the foundation libraries but the plugin architecture that lets teams ship commands without blocking each other. Solve that and the CLI keeps earning its place as the org grows; skip it and the unified CLI becomes the bottleneck it was supposed to remove. - **Developer Time Saved**: 20-30 min/day — Per engineer using unified CLI - **Onboarding Acceleration**: 50-70% — Time reduction for new engineers - **Error Rate Reduction**: 30-40% — Fewer mistakes with guided workflows - **Adoption Rate**: 80-90% — When CLI provides real value ## Resources & Further Reading - Stripe CLI: https://docs.stripe.com/stripe-cli - Example of excellent developer CLI - GitHub CLI: https://cli.github.com/ - Open source, worth studying - Heroku CLI Architecture: https://github.com/heroku/cli - Plugin-based CLI example - Cobra (Go): https://github.com/spf13/cobra - CLI framework used by kubectl - Click (Python): https://click.palletsprojects.com/ - Elegant CLI framework - oclif (Node.js): https://oclif.io/ - Framework used by Heroku and Salesforce - CLI Guidelines: https://clig.dev/ - Best practices for building CLIs - Charm Libraries: https://charm.land/ - Beautiful terminal UI components --- # How Uber Queries Billions of Events in Milliseconds - **URL**: https://www.stxkxs.io/blog/real-time-analytics-druid - **Published**: 2025-11-05 - **Author**: Brandon Stokes - **Category**: data - **Tags**: real-time-analytics, apache-druid, kafka, streaming, data-engineering, olap - **Reading time**: 13 min When Uber's surge pricing needed sub-second analytics on billions of events, data warehouses couldn't keep up. Here's how companies like Uber, Airbnb, and Netflix built real-time analytics platforms with Apache Druid and Kafka. ## The surge pricing problem In 2014, Uber faced a critical problem that crystallized during the intense demand of New Year's Eve celebrations across major cities. Their surge pricing algorithm needed to calculate demand across thousands of city zones every few seconds to balance supply with rider requests in real-time. The hard part was the combination of high cardinality, freshness, and concurrency the system had to sustain: how many ride requests happened in downtown San Francisco in the last 2 minutes? What is the current supply-to-demand ratio across all zones in the city? Which neighborhoods are trending toward higher demand and should trigger driver repositioning? Their existing data warehouse, built on the Hadoop and Hive stack that represented best practices at the time, could technically answer these questions. Queries would eventually complete and return accurate results. Each query took 30-45 seconds to execute, a latency that made the system effectively useless for real-time decision making. For surge pricing, 30 seconds was an eternity during which demand patterns could shift completely. By the time the calculation completed, the market conditions that prompted the query had already changed. Uber needed sub-second query latency on billions of real-time events, and their existing infrastructure could not deliver it. This challenge was not unique to Uber. Airbnb hit it with pricing optimization and fraud detection that had to block suspicious transactions before they completed. PayPal hit it evaluating transaction risk in the milliseconds between initiation and authorization. The shared pattern: an operational decision that directly affected user experience, gated on an analytical query that traditional data warehouses could not return fast enough. As companies scaled their data volumes, they hit the same wall Uber did. - **Traditional Warehouse Query Time**: 30-45s — Typical query latency for Hive/Redshift on large datasets - **Required Latency**: <500ms — Target for user-facing dashboards and real-time decisions - **Event Volume**: 5B+ events/day — Scale at companies like Netflix, Airbnb - **Data Freshness**: <10s — Time from event occurrence to queryable state > **INFO: OLTP vs OLAP vs Real-Time OLAP** > OLTP databases (PostgreSQL, MySQL) handle transactional queries fast but struggle with analytical aggregations. OLAP databases (Redshift, BigQuery) handle analytical queries well, typically returning in seconds and sometimes tens of seconds on large or complex queries. Real-time OLAP (Druid, ClickHouse) bridges the gap with sub-second analytical queries on streaming data. ## From batch to streaming Most companies begin their analytics journey with batch processing because it aligns with how traditional data infrastructure was designed to operate. You extract data from production databases overnight through scheduled ETL jobs, load it into a data warehouse like Redshift or BigQuery during off-peak hours, and run analytical queries the next morning against what amounts to a snapshot of yesterday's state. This worked perfectly well during an era when "real-time" meant yesterday's data and business decisions operated on daily or weekly cycles. As competition intensified and user expectations evolved, yesterday's data became increasingly insufficient. By the time you analyzed what happened, competitors had already responded to the same signals. The first attempt at improvement was simply to run batch jobs more frequently. Instead of nightly extraction and loading, run the pipeline hourly. Then push it to every 15 minutes as latency requirements tightened further. This approach encountered fundamental limitations that could not be overcome through optimization alone. Each batch job carried irreducible overhead: scheduling delays, resource allocation time, data extraction coordination, and loading sequences. Even with heavily optimized ETL pipelines running on dedicated infrastructure, organizations could not reliably achieve latency below 5-10 minutes. The compute costs scaled linearly with frequency: running hourly cost 24 times what nightly runs cost, with diminishing returns on latency improvement. The breakthrough came with streaming architectures that inverted the traditional data flow model. Instead of periodically extracting data from production systems, applications would emit events continuously to a message queue like Apache Kafka as those events occurred. Stream processors like Kafka Streams or Apache Flink would process these events in real-time as they arrived, maintaining derived state and triggering downstream actions. This solved the data freshness problem definitively. Events could flow from source to processing in milliseconds rather than hours. It created a new problem: how do you run fast, flexible analytical queries against data that is continuously streaming rather than static? ### Fast queries on streams Getting data into Kafka was relatively straightforward once the instrumentation was in place. Processing events with Flink to maintain materialized views and trigger actions was a well-understood pattern. Most analytical queries that business users wanted to run did not fit naturally into the stream processing model that Flink and similar systems were designed for. Questions like "Show me hourly signups by country for the last 30 days" or "What is the 95th percentile API latency broken down by endpoint?" required random access to historical data combined with complex aggregations across arbitrary time windows. Stream processors excelled at forward-only processing of events as they arrived, maintaining pre-defined aggregations and triggering alerts. They struggled with the ad-hoc queries that analysts needed to explore data and answer questions that had not been anticipated when the pipeline was designed. You could stream data into a traditional OLAP data warehouse like Redshift or BigQuery. These systems were designed for batch ingestion, not continuous streaming. High-frequency writes caused performance degradation, and query latency remained in the 10-30 second range. Companies needed a database built specifically for real-time analytical queries on time-series event data. ## Apache Druid for real-time OLAP Apache Druid emerged from this exact problem at Metamarkets (later acquired by Snap Inc.). They needed sub-second queries on billions of advertising impression events while data was still streaming in. Druid's architecture was purpose-built for this use case: columnar storage for fast analytical scans, time-partitioned data with automatic retention policies, pre-aggregation at ingestion time to reduce query-time computation, distributed architecture that scales horizontally for both ingestion and queries, and native integration with Kafka for real-time ingestion. Netflix is one of the clearest examples of Druid at this scale. They read playback and device events directly from Kafka streams to monitor streaming quality and the overall member experience, tracking metrics like error rates and engagement across their service. Their Druid deployment now holds over 10 trillion rows and answers the dashboard queries behind monitoring, experimentation, and operational decisions in real-time. The architecture is exactly the one Druid was built for: events flow in from Kafka and become queryable within seconds. Airbnb followed a similar path. They needed real-time analytics for pricing recommendations, search ranking, and fraud detection. By 2018, their Druid cluster was ingesting 10TB of data daily across hundreds of data sources. PayPal used Druid for real-time fraud detection, analyzing transaction patterns across millions of daily payments. Netflix used it for streaming quality monitoring, tracking playback metrics and error rates in real-time. - **Query Latency**: sub-second — Druid's design target for analytical queries on streaming data - **Stored Rows**: 10T+ rows — Netflix's Druid deployment (Netflix TechBlog, 2026) - **Data Freshness**: <10s — Time from Kafka event to queryable in Druid - **Storage Efficiency**: 10-20x compression — Columnar storage + pre-aggregation vs raw events ## The market landscape The real-time analytics landscape has matured significantly since Druid's emergence. Today, ClickHouse, Druid, and Apache Pinot lead the OLAP database market, each with distinct strengths. Companies like Netflix, Airbnb, and Lyft use Druid for streaming-first analytics, while Cloudflare, Uber, and Spotify rely on ClickHouse for fast batch ingestion and resource efficiency. ### Druid vs ClickHouse Streaming-first workloads pick Druid; batch-loaded workloads pick ClickHouse. If events are flowing through Kafka and you need them queryable in seconds, Druid wins. It was built for that exact pattern, and nothing else matches its native Kafka ingestion. If you are loading data in hourly or daily batches and your team is small, ClickHouse is the better call. Fewer moving parts, simpler ops, and better SQL compatibility. Druid includes native ingestion from Apache Kafka, Kafka-compatible streams (Confluent, Redpanda, Amazon MSK, Azure Event Hubs), and Amazon Kinesis. No connectors needed. ClickHouse requires additional infrastructure for streaming ingestion. For companies with streaming-first architectures, this difference is decisive. > **EXAMPLE: Real-world scale: Confluent on Druid** > Confluent, the company behind Kafka, chose Druid for their own cloud analytics platform after evaluating Druid, Pinot, and ClickHouse. They now ingest over 3 million events per second and respond to over 250 queries per second using Druid. The tech stack for ClickHouse didn't fit their needs as it required writing C++ plugins to read custom format data from Kafka. Architecture matters. ClickHouse keeps things simple with fewer moving parts. Druid distributes work across a cluster of specialized servers (Coordinators, Overlords, Brokers, Historicals, MiddleManagers). ClickHouse is often easier to operate for smaller deployments. Druid's distributed architecture scales better for massive streaming workloads. Druid's operational cost is real: you are running ZooKeeper, a metadata store, and five different node types. If your team does not have dedicated infrastructure engineers, that complexity will eat you alive. If you do have that team and your use case is streaming-first, Druid's architecture pays for itself in query performance that ClickHouse cannot match on continuously arriving data. ### Adoption patterns by use case - Choose Druid for: real-time dashboards fed by Kafka/Kinesis streams, operational analytics requiring sub-second freshness, time-series event data with high cardinality dimensions, multi-tenant analytics with complex query patterns - Choose ClickHouse for: batch-loaded analytics workloads, log aggregation and search, metrics storage and visualization, deployments prioritizing operational simplicity over streaming features - Choose Pinot for: user-facing analytics in applications (LinkedIn built it for exactly this and runs it at trillion-event scale for user-facing analytics), extremely low-latency requirements (<100ms p99), scenarios requiring upserts and complex filtering. ClickHouse has also seen massive adoption beyond the companies listed above. Lyft runs it for ride analytics and cost modeling, and Cloudflare processes millions of requests per second through their ClickHouse-powered analytics pipeline. ## The standard architecture A consistent pattern emerged across companies implementing real-time analytics. Applications emit events to Kafka topics organized by domain (user events, transaction events, system metrics). Kafka provides durability and buffering, with configurable retention (typically 7-30 days). Druid ingests directly from Kafka using native connectors, with configurable batch sizes and flush intervals. Query layers expose Druid data through APIs or SQL interfaces. Deep storage in S3 provides long-term retention and disaster recovery. The beauty of this architecture is its separation of concerns. Kafka handles event streaming and buffering. Druid handles analytical queries. S3 handles long-term storage. Each component does one thing well, and you can scale them independently. Need more ingestion capacity? Add more Kafka brokers and Druid MiddleManagers. Need faster queries? Add more Druid Historical nodes. Need more retention? S3 scales infinitely. ```yaml (druid-ingestion-spec.yaml) # Druid ingestion specification for Kafka apiVersion: v1 kind: ConfigMap metadata: name: druid-ingestion-spec data: user-events.json: | { "type": "kafka", "spec": { "dataSchema": { "dataSource": "user-events", "timestampSpec": { "column": "timestamp", "format": "iso" }, "dimensionsSpec": { "dimensions": [ "user_id", "event_type", "country", "platform", "device_type" ] }, "metricsSpec": [ {"type": "count", "name": "count"}, {"type": "longSum", "name": "session_duration", "fieldName": "duration"}, {"type": "hyperUnique", "name": "unique_users", "fieldName": "user_id"} ], "granularitySpec": { "segmentGranularity": "hour", "queryGranularity": "minute", "rollup": true } }, "ioConfig": { "topic": "user-events", "consumerProperties": { "bootstrap.servers": "kafka:9092" }, "taskCount": 4, "replicas": 2, "taskDuration": "PT1H" }, "tuningConfig": { "type": "kafka", "maxRowsPerSegment": 5000000, "maxRowsInMemory": 100000, "intermediatePersistPeriod": "PT10M" } } } ``` > **TIP: Pre-aggregation is the secret** > Druid's "rollup" feature pre-aggregates data at ingestion time. Instead of storing 5 billion individual events, you might store 50 million pre-aggregated rows. This 100x reduction in data volume is why queries are so fast. The tradeoff is you must define aggregations upfront, but for most operational analytics use cases, this is acceptable. ## Use cases by company size Real-time analytics is not just for tech giants. The same architectural pattern works across company sizes, with different scale requirements and implementation approaches. ### Startup: product analytics A SaaS startup with 10,000 daily active users needs real-time product analytics. They track user behavior, feature usage, and conversion funnels. Their event volume is modest (10-50 million events per day), but they need fast dashboards for the product team. They run a single-node Druid instance on a c5.2xlarge EC2 instance alongside Kafka on MSK. Total infrastructure cost is around $500 per month. This setup handles their current scale and can grow to 500 million events per day before requiring a cluster. - **Event Volume**: 10-50M/day — Typical for early-stage SaaS products - **Query Latency**: <200ms — Single-node Druid with SSD storage - **Infrastructure Cost**: $500/month — Single c5.2xlarge + MSK cluster - **Dashboard Refresh**: <1 min — Real-time analytics for product teams ### Growth: multi-product analytics A mid-size e-commerce company with multiple product lines needs real-time analytics across web, mobile, and backend systems. They have 1 million daily active users generating 500 million events per day. Their use cases include real-time dashboards for business teams, automated alerting on key metrics, and A/B test result tracking. They run a 3-node Druid cluster with dedicated coordinator, broker, and historical nodes. Kafka runs on MSK with 6 brokers for high availability. Total infrastructure cost is around $3,000 per month. - **Event Volume**: 500M-1B/day — Multi-product companies with mobile apps - **Query Latency**: <500ms — 3-node cluster with distributed queries - **Data Retention**: 90 days hot, 2 years S3 — Tiered storage for cost optimization - **Concurrent Users**: 50-100 — Analysts querying simultaneously ### Enterprise: platform analytics Companies like Airbnb and Netflix run Druid at massive scale (Uber's real-time analytics runs on Apache Pinot, not Druid). They process billions of events per day across hundreds of data sources. Their Druid clusters have 50-100+ nodes with specialized roles: dedicated query brokers for low-latency user queries, batch brokers for heavy analytical workloads, tiered storage with SSD for hot data and S3 for cold data, and multi-region deployment for global availability. At this scale, the focus shifts to operational excellence: automated capacity planning, query optimization, and cost management. - **Event Volume**: 5B+ events/day — Uber, Airbnb, Netflix scale - **Cluster Size**: 50-100+ nodes — Specialized roles for query, ingestion, coordination - **Concurrent Queries**: 100-500+ — Hundreds of dashboards and automated systems - **Data Retention**: 30 days hot, 5 years S3 — Compliance and historical analysis requirements ## How Druid works An aggregation that takes 30 seconds in Redshift returns in 200 milliseconds in Druid on billions of rows. The architecture explains the gap. The system is built around a few key concepts that work together to deliver that speed. ### Time-partitioned segments Druid stores data in immutable segments, each covering a specific time range (typically an hour or day). When you query for "last 24 hours of data," Druid knows exactly which segments to read. This is far more efficient than scanning an entire table. Segments are partitioned by time because most analytical queries filter by time. "Show me yesterday's data" or "What happened in the last hour?" These queries only touch relevant segments, making them incredibly fast. ### Columnar storage and compression Each segment stores data in columnar format. Instead of storing rows together (like traditional databases), Druid stores columns together. When you query "SELECT country, COUNT(*) FROM events WHERE timestamp > now() - 1 hour," Druid only reads the country and timestamp columns. It never touches columns like user_id or session_id if they are not in the query. Columnar storage also enables aggressive compression. A column with only 50 unique countries can be compressed 100x better than row-based storage where each country value is interleaved with other data. - **Compression Ratio**: 10-20x — Typical compression vs raw JSON events - **Column Scan Speed**: 1-2GB/s per core — Columnar format enables SIMD optimization - **Memory Efficiency**: 3-5x — Less RAM needed vs row-based storage - **Query Parallelism**: Linear — Performance scales with added cores ### Distributed query execution When you submit a query, it goes to a Broker node. The Broker determines which segments contain relevant data and distributes the query to Historical nodes that hold those segments. Each Historical node scans its local segments in parallel and returns partial results. The Broker merges the results and returns the final answer. This scatter-gather pattern enables massive parallelism. A query across 24 hours of data might scan 24 segments across 10 Historical nodes, each processing 2-3 segments in parallel. ```python (druid-query-example.py) # Python SDK for querying Druid from pydruid.db import connect # Connect to Druid conn = connect(host='druid-broker', port=8082, path='/druid/v2/sql/', scheme='http') cursor = conn.cursor() # Query: Hourly signups by country, last 7 days query = """ SELECT TIME_FLOOR(__time, 'PT1H') AS hour, country, COUNT(*) AS signups, COUNT(DISTINCT user_id) AS unique_users FROM user_events WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '7' DAY AND event_type = 'signup' GROUP BY 1, 2 ORDER BY 1 DESC, 3 DESC """ cursor.execute(query) results = cursor.fetchall() # Query executes in <500ms on billions of events for row in results: hour, country, signups, unique_users = row print(f"{hour}: {country} had {signups} signups ({unique_users} unique)") ``` > **INFO: Druid scans 100-1000x less data than a data lake** > Pre-aggregation means queries work on summarized data, not raw events. Time partitioning means queries only scan relevant segments. Columnar storage means queries only read relevant columns. Together, these reduce the amount of data scanned by 100-1000x compared to scanning raw events in a data lake. ## Implementation guide Building a real-time analytics platform with Druid follows a predictable path. Start small, validate the pattern, then scale incrementally. ### Kafka foundation (Week 1-2) Before Druid, you need streaming infrastructure. Deploy Kafka (AWS MSK is the easiest path) with at least 3 brokers for high availability. Create topics organized by domain: user-events, transaction-events, system-metrics. Instrument your applications to emit events to Kafka. Start with high-level events like signups, purchases, and API requests. Ensure events have consistent schema with required fields like timestamp, user_id, and event_type. ### Single-node Druid (Week 3-4) Start with a single Druid node running all services (coordinator, broker, historical, middleManager). Deploy on a c5.2xlarge or equivalent with 8 vCPUs and 16GB RAM. Configure one ingestion task reading from your highest-priority Kafka topic. Define your data schema carefully: choose dimensions (filterable fields), metrics (aggregatable numbers), and granularity (hour vs minute rollup). Run test queries to validate latency and correctness. This phase proves the pattern works before investing in a cluster. ### Cluster deployment (Week 5-8) Once validated, deploy a proper cluster with separated roles. Run coordinator and overlord services on small instances (t3.medium) since they do lightweight work. Run brokers on compute-optimized instances (c5.xlarge) for fast query merging. Run historical nodes on storage-optimized instances (i3.2xlarge) with local SSDs for hot data. Run middleManagers on memory-optimized instances (r5.xlarge) for ingestion. Configure S3 deep storage for long-term retention. Set up tiered storage to move old segments from expensive SSDs to cheap S3. - **Cluster Cost (Growth)**: $2,000-5,000/month — Typical 5-10 node cluster for mid-size company - **Cluster Cost (Enterprise)**: $20,000-50,000/month — Large clusters at Uber/Airbnb scale - **Time to Production**: 6-8 weeks — From zero to production-ready cluster - **Team Size Required**: 1-2 engineers — Part-time for initial implementation ### Operational maturity Real-time analytics platforms require ongoing operational investment. Monitor ingestion lag to ensure Druid keeps up with Kafka. Set up alerting on query latency to detect performance degradation. Implement automated capacity planning based on ingestion rate and query volume. Optimize query patterns by pre-computing common aggregations. Establish data retention policies to balance cost and compliance requirements. Build self-service query interfaces so teams can answer their own questions without writing code. ## Production lessons Companies running Druid at scale have learned hard lessons about what works and what doesn't. The biggest lesson: start simple and add complexity only when needed. When you wire up a real-time analytics pipeline for the first time, the temptation is to build the full architecture on day one: multi-tier ingestion, complex rollup rules, tiered storage, the works. Resist that. A basic Kafka-to-Druid pipeline with a single data source is enough to validate the pattern, and you can add real-time nodes, layered aggregation, and tiered storage incrementally as real query patterns force the question. Schema design is critical and hard to change. Choose your dimensions carefully because they determine what queries you can run. At Airbnb, they initially did not include device_type as a dimension in their booking events. When the product team wanted to analyze mobile vs desktop conversion rates, they had to re-ingest months of historical data. Now they over-dimension early, accepting slightly higher storage costs for query flexibility. Query patterns matter more than raw performance. Netflix found that 90% of queries accessed data from the last 24 hours. They optimized by keeping the last 48 hours on fast SSD storage and moving older data to S3. This reduced infrastructure costs by 60% with no perceptible impact on user experience. > **WARNING: The cardinality trap** > High-cardinality dimensions (fields with millions of unique values like user_id or session_id) can kill query performance. Druid handles them, but queries become slower and storage explodes. Use high-cardinality fields only when necessary. For user-level queries, consider a separate transactional database. The final lesson: real-time analytics is a journey, not a destination. Your needs will evolve. Start with basic event tracking and dashboards. Add automated alerting when you hit reliability issues. Add machine learning when you need predictive capabilities. The beauty of the Kafka-to-Druid architecture is it grows with you. ## When not to use Druid Druid is powerful but not universal. It excels at time-series event data with analytical queries, but other tools may be better for different use cases. If you need transactional consistency with updates and deletes, use PostgreSQL or MySQL. Druid is append-only; it does not handle updates well. If you need full-text search across documents, use Elasticsearch. Druid's text search is limited. If you need graph queries about relationships, use Neo4j. Druid does not do graph traversal. If your data lands in batches and your team is under five engineers, start with ClickHouse. It has better SQL compatibility, easier operations, and you can get it running in production in a day. Druid earns its complexity when you need native streaming ingestion and sub-second queries on data that arrived less than ten seconds ago. If that is not your use case, you are paying an operational tax for capabilities you do not need. If your event volume is under 10 million events per day, or your queries can tolerate 5-second latency, you probably do not need Druid. PostgreSQL with proper indexes and materialized views, or TimescaleDB, handles that scale with simpler operations. The distributed system only pays off at higher scale and tighter latency. ## The real-time imperative Real-time analytics transformed how companies operate. At Uber, surge pricing went from a manual process with 30-second delays to a fully automated system responding in under a second. At Airbnb, pricing optimization that required data science team involvement became a self-service tool for hosts. At Netflix, streaming quality issues that took hours to detect now trigger alerts within minutes. The pattern is consistent. Applications emit events to Kafka, Druid ingests and indexes them in real-time, and query interfaces expose the data to dashboards, APIs, and automated systems. This architecture scales from startups processing millions of events per day to enterprises processing billions. The technology is proven, the operational patterns are well-understood, and the benefits are clear. The build path is the same regardless of where you start. Capture events in Kafka first, since that instrumentation is the prerequisite for everything downstream and the hardest thing to retrofit. Reach for Druid only once query latency on those streams becomes the bottleneck and your volume clears the threshold where a distributed system earns its operational cost. Below that line, the next company to win on speed will not be the one running the most infrastructure. - **Companies Using Druid**: 200+ — Listed on the Apache Druid Powered By page, including Netflix, Airbnb, PayPal, and Reddit - **Average Query Speedup**: 50-100x — vs traditional data warehouses for real-time queries - **Cost Efficiency**: 5-10x less — vs scaling traditional OLAP for sub-second latency - **Time to Value**: 6-8 weeks — From POC to production analytics ## Resources & Further Reading - Apache Druid Documentation: https://druid.apache.org/docs/latest/design/ - Reference architecture and operational guides - Apache Druid Comparisons (Kudu, Redshift, Spark, Elasticsearch, SQL-on-Hadoop, key-value): https://druid.apache.org/docs/latest/comparisons/ - Apache project comparison docs (no Druid-vs-ClickHouse page is published) - Uber Engineering - Data Platform: https://www.uber.com/blog/uber-data-platform-2019/ - Uber's real-time analytics architecture - Airbnb Engineering - Data Infrastructure: https://medium.com/airbnb-engineering/druid-airbnb-data-platform-601c312f2a4c - Airbnb's Druid deployment - AWS MSK (Managed Kafka): https://aws.amazon.com/msk/ - Managed Kafka on AWS - Confluent Kafka: https://www.confluent.io/ - Commercial Kafka platform with cloud and self-hosted options - Apache Druid Quick Start: https://druid.apache.org/docs/latest/tutorials/index.html - Tutorial for getting started with Druid ---