Tag: coding

  • The AI Stack Is Running on Borrowed Infrastructure — And What Happens When It Isn’t

    The AI Stack Is Running on Borrowed Infrastructure — And What Happens When It Isn’t

    The GPUs running AI workloads today were designed to render video game graphics. The programming languages agents use were built for human readability. The development processes — sprint cycles, code reviews, deployment windows — are all structured around human working hours and attention spans.

    Despite these constraints, AI agents are already building C compilers from scratch for roughly $20,000 in compute costs and consistently outperforming average human engineers on standard tasks. When purpose-built infrastructure arrives, following the same vertical integration pattern we saw transform smartphones, the cost and capability equation changes dramatically. It’s the natural evolution of any technology platform.

    The smartphone parallel: five layers of integration

    Early smartphones ran on repurposed mobile phone hardware with desktop-derived operating systems. They worked, but the architecture was borrowed. Then came the integration layers, each delivering step-change improvements:

    Layer one: Purpose-built mobile operating systems — iOS and Android, optimised for touch interfaces and mobile constraints rather than adapted from desktop paradigms.

    Layer two: Custom silicon designed specifically for mobile workloads. Apple’s A-series chips weren’t just smaller desktop processors; they were architected from the ground up for mobile use cases.

    Layer three: Specialised programming frameworks — Swift and Kotlin — designed for mobile development patterns rather than ported from other contexts.

    Layer four: Cloud-native architectures that assumed mobile-first applications, not desktop apps squeezed onto smaller screens.

    Layer five: An entire ecosystem of services built around mobile devices as the primary platform, from payments to authentication to location services.

    Each layer compounded the improvements of the previous one. The difference between a 2007 iPhone and a 2015 iPhone wasn’t just better specs — it was a fundamentally different platform enabled by vertical integration.

    The AI stack is entering the same progression.

    What purpose-built looks like

    Custom AI chips are already in development. Google’s TPUs, various ASICs, and well-funded startups are working on silicon designed for neural network operations rather than graphics rendering. These aren’t incremental improvements on Nvidia’s architecture — they’re different approaches optimised for different workloads.

    Programming paradigms are shifting from human-oriented languages to frameworks designed for agent workflows. AI agents don’t need readable variable names or structured flow control designed for human comprehension. They work with tokens and can operate on representations optimised for their own processing, not ours.

    Development processes are being redesigned around agent capabilities. Continuous, asynchronous, parallel-by-default workflows rather than sequential sprints bounded by human availability. The infrastructure assumes agents are the primary operators, with humans in supervisory roles.

    Aaron Levie maps the emerging stack: agent-specific sandboxed compute environments (E2B, Daytona, Modal, Cloudflare), agent identity and authentication systems, agent-native file storage and memory, agent wallets for microtransactions (Stripe, Coinbase), and agent-optimised search (Parallel, Exa). These aren’t conceptual products — these companies exist and are building now.

    His framing of “make something agents want” — riffing on Paul Graham’s “make something people want” — captures the shift. Software is being redesigned with agents as the primary user. Every feature needs an API. Every service needs an MCP server. If agents can’t sign up and start using your product autonomously, you’re invisible to the next generation of software consumers.

    The platform convergence is worth watching. Code repositories are evolving from version control into the orchestration layer for the entire software development lifecycle. Whoever controls that layer controls the persistent context that makes agents effective across sessions. The major AI labs are moving toward owning this layer, which suggests the future stack isn’t “AI model plus separate dev tools” but an integrated platform where context, runtime, and code repository collapse into one thing.

    A parallel convergence is happening at the operator level: always-on agents you communicate with through mainstream channels like messaging apps or email rather than terminals or dashboards. The human doesn’t need specialised tooling — they need a channel to the agent that handles everything downstream.

    Combine the GUI agent environment with the always-on operator agent and the platform layer, and you get a full vertical integration play — from software development lifecycle through to production operations, under one roof. This is the smartphone moment for developer tooling: separate apps collapsing into an integrated platform.

    What changes when the stack is purpose built

    The trajectory over the last few years has been clear: model capabilities improve while cost per unit of useful output drops. But today’s pricing is misleading — current subscription tiers are heavily subsidised, much like AWS in 2006. AI companies are burning cash to establish market share. The real story isn’t simply “it gets cheaper.” Provider economics improve toward sustainability while capabilities compound for users. Both curves move in the right direction, but assuming today’s prices reflect true costs is a mistake in either direction.

    Where the stack is unevenly developed matters more than the overall cost trajectory. Code generation has leapt ahead, but DevOps, production support, and operational reliability are further behind — areas where consequences of AI errors are immediate and expensive. The bottleneck isn’t capability. General-purpose coding agents can already manage infrastructure via CLI and APIs. The bottleneck is the trust boundary: giving an agent access to production systems and customer data raises security concerns that don’t exist in a sandboxed branch. Specialised DevOps agents are emerging to address this, but they’ll likely follow the familiar platform-shift pattern — absorbed into general-purpose agents once the sandboxing layer matures.

    There’s an irony in what AI is automating first: not the creative work developers love, but the process work that fell through the cracks as agile teams prioritised working software over documentation. The manifesto was right that heavyweight documentation was wasteful — but in practice, teams swung too far the other way, under-documenting, under-testing, and accumulating technical debt. Agents don’t have that bias. They’ll write tests and documentation with the same attention as feature code.

    What this means now

    Mainstream custom AI chips are a few years out. AI-native frameworks, possibly even shorter. Organisational redesign is the slowest layer — cultural, not technical.

    None of that is a reason to wait. Dark factory teams are already running production workflows on the borrowed stack — and the gap between them and companies still debating AI adoption compounds monthly. Every month of building expertise on today’s imperfect infrastructure is learning that transfers directly to the purpose-built era.

    The early smartphone era produced the apps, habits, and companies that dominated the next decade. We’re in the equivalent moment for AI. The stack will improve. The question is whether you’ll be positioned to take advantage of it.

  • The Autonomous SDLC: What’s Solved, What’s Not, and Why the Gaps Are Closing Fast

    We’re further along than most people realize. The software development lifecycle is being automated piece by piece, and the trajectory is becoming harder to ignore—not through some magical breakthrough, but through the steady elimination of bottlenecks that seemed permanent six months ago.

    This is a practitioner’s status report discussing what works in production today, what remains genuinely unsolved, and why the remaining gaps matter less than conventional wisdom suggests.

    Code Generation: Already Production-Grade

    The middle portion of the SDLC—turning specifications into working code—has crossed a threshold. Cursor CEO Michael Truell describes three eras: tab autocomplete, synchronous agents responding to prompts, and now agents tackling larger tasks independently with less human direction. At Cursor, 35% of merged PRs now come from agents running autonomously in cloud VMs. The agent PRs are “an order of magnitude more ambitious than human PRs” while maintaining higher merge rates.

    What matters isn’t the percentage—it’s that these agent-generated PRs pass the same review standards as human code. Max Woolf’s detailed experiments are instructive. Starting as a vocal skeptic who wrote about rarely using LLMs, he ended up building Rust libraries that outperformed battle-tested numpy-backed implementations by 2-30x. Not prototypes—production code passing comprehensive test suites and benchmarks.

    His conclusion after months of testing:

    I have been trying to break this damn model by giving it complex tasks that would take me months to do by myself despite my coding pedigree but Opus and Codex keep doing them correctly.

    The quality ceiling keeps rising with each model generation. This isn’t “good enough for prototypes”—it’s production-grade code that ships.

    Spec-Driven Development

    The initiation problem has largely converged. Most tools now support planning mode—the agent reads a spec, creates an implementation plan, follows it through. Woolf’s experience matters here:

    AGENTS.md is probably the main differentiator between those getting good and bad results with agents.

    These persistent instruction files function as system prompts that shape agent behaviour across sessions.

    This is just spec-driven development—the same methodology good engineering teams already use. The pattern works: write a detailed spec (GitHub issue, markdown file), point the agent at it, let it execute. The difference is that agents can now be the executor, and the pattern works across tools (Cursor, Claude Code, Codex) because it aligns with how reliable software gets built regardless of who’s typing.

    The Feedback Loop: The Primary Gap

    Basic unit tests and regression tests work well—agents can write and run them as part of their workflow. Complex feature tests, integration tests, and UAT remain the primary gap. UI/UX testing is particularly challenging since agents can’t easily evaluate visual output.

    The current workaround: human-in-the-loop for complex test evaluation, with agents handling mechanical testing. That said, the coding agents can still fix bugs when given screenshots and descriptions.

    This is an active focus area. The gap is narrowing from both sides: agents getting better at generating comprehensive tests, and tooling improving for automated visual and integration testing. Satisfactory solutions within 2026 aren’t a stretch—they’re the natural next step given where the infrastructure is heading.

    Guardrails: Actively Being Solved

    Managing task boundaries and blast radius is critical for autonomous operation. Best practices are emerging around sandboxing—isolated agent execution environments, limited file system access, branch-based workflows.

    The Anthropic C compiler experiment demonstrated the pattern at scale: 16 agents working on a shared codebase over 2,000 sessions, coordinating through git locks and comprehensive test harnesses. The test infrastructure was rigorous enough to guide autonomous agents toward correctness without human review, producing a 100,000-line compiler that can build Linux.

    StrongDM took this further with their dark factory approach. They built digital twins of production dependencies—behavioral clones of Okta, Jira, Slack—using agents to replicate APIs and edge cases. This enabled validation at volumes far exceeding production limits without risk. Their rule: “Code must not be reviewed by humans.” The safety comes from comprehensive scenario testing against holdout test cases the agents never see.

    The agent infrastructure layer is building out fast. We’re seeing microVMs that boot fast enough to feel container-like, with snapshot/restore making “reset” almost free. Agent-specific sandboxed compute, identity, and API access are emerging as distinct product categories.

    The guardrails problem is increasingly an infrastructure problem, not a model problem. This converges toward a standard pattern: spec + guardrails + sandbox + automated validation = safe autonomous execution.

    The Self-Improvement Dynamic

    Something subtle is happening. Codex optimizes code, Opus optimizes it further, Opus validates against known-good implementations. Cumulative 6x speed improvements on already-optimized code. Then you have Opus 4.6 iteratively improving its own code through benchmark-driven passes.

    Folks have showed agents tuning LLMs on Hugging Face—the tooling layer being built by the tools themselves. This isn’t theoretical AGI. It’s narrow but powerful self-improvement within the coding domain. The practical implication: the rate of improvement accelerates as agents get better at improving agents. For the coding stack specifically, each generation of tools makes the next generation arrive faster.

    What This Means for Planning

    Here’s the timeline as I see it:

    2025: Code generation reliable. Spec-driven development emerging. Testing and guardrails manual.

    2026: Testing automation reaches satisfactory level. Guardrails standardize. The loop becomes semi-autonomous.

    2027+: Fully autonomous for standard applications. Human involvement shifts entirely to direction and edge cases.

    The companies planning as if these gaps will persist are making the same mistake as those who planned around slow internet in 2005. AI tools amplify existing expertise—all the practices that distinguished senior engineers (comprehensive testing, good documentation, strong version control habits, effective code review) matter even more now. But the bar for what “good enough” looks like is rising in parallel.

    Antirez captures the shift plainly:

    Writing code is no longer needed for the most part. It is now a lot more interesting to understand what to do, and how to do it.

    The mental work hasn’t disappeared. It’s concentrated in the parts machines can’t yet replace: architecture decisions, user needs, system design trade-offs.

    The gaps are real today. But they’re the wrong thing to optimize around. Optimize around what becomes possible when they close—because that’s happening faster than the pace of traditional software planning cycles.