Programming – AB's Reflections

The Autonomous SDLC: What’s Solved, What’s Not, and Why the Gaps Are Closing Fast

We’re further along than most people realize. The software development lifecycle is being automated piece by piece, and the trajectory is becoming harder to ignore—not through some magical breakthrough, but through the steady elimination of bottlenecks that seemed permanent six months ago.

This is a practitioner’s status report discussing what works in production today, what remains genuinely unsolved, and why the remaining gaps matter less than conventional wisdom suggests.

Code Generation: Already Production-Grade

The middle portion of the SDLC—turning specifications into working code—has crossed a threshold. Cursor CEO Michael Truell describes three eras: tab autocomplete, synchronous agents responding to prompts, and now agents tackling larger tasks independently with less human direction. At Cursor, 35% of merged PRs now come from agents running autonomously in cloud VMs. The agent PRs are “an order of magnitude more ambitious than human PRs” while maintaining higher merge rates.

What matters isn’t the percentage—it’s that these agent-generated PRs pass the same review standards as human code. Max Woolf’s detailed experiments are instructive. Starting as a vocal skeptic who wrote about rarely using LLMs, he ended up building Rust libraries that outperformed battle-tested numpy-backed implementations by 2-30x. Not prototypes—production code passing comprehensive test suites and benchmarks.

His conclusion after months of testing:

I have been trying to break this damn model by giving it complex tasks that would take me months to do by myself despite my coding pedigree but Opus and Codex keep doing them correctly.

The quality ceiling keeps rising with each model generation. This isn’t “good enough for prototypes”—it’s production-grade code that ships.

Spec-Driven Development

The initiation problem has largely converged. Most tools now support planning mode—the agent reads a spec, creates an implementation plan, follows it through. Woolf’s experience matters here:

AGENTS.md is probably the main differentiator between those getting good and bad results with agents.

These persistent instruction files function as system prompts that shape agent behaviour across sessions.

This is just spec-driven development—the same methodology good engineering teams already use. The pattern works: write a detailed spec (GitHub issue, markdown file), point the agent at it, let it execute. The difference is that agents can now be the executor, and the pattern works across tools (Cursor, Claude Code, Codex) because it aligns with how reliable software gets built regardless of who’s typing.

The Feedback Loop: The Primary Gap

Basic unit tests and regression tests work well—agents can write and run them as part of their workflow. Complex feature tests, integration tests, and UAT remain the primary gap. UI/UX testing is particularly challenging since agents can’t easily evaluate visual output.

The current workaround: human-in-the-loop for complex test evaluation, with agents handling mechanical testing. That said, the coding agents can still fix bugs when given screenshots and descriptions.

This is an active focus area. The gap is narrowing from both sides: agents getting better at generating comprehensive tests, and tooling improving for automated visual and integration testing. Satisfactory solutions within 2026 aren’t a stretch—they’re the natural next step given where the infrastructure is heading.

Guardrails: Actively Being Solved

Managing task boundaries and blast radius is critical for autonomous operation. Best practices are emerging around sandboxing—isolated agent execution environments, limited file system access, branch-based workflows.

The Anthropic C compiler experiment demonstrated the pattern at scale: 16 agents working on a shared codebase over 2,000 sessions, coordinating through git locks and comprehensive test harnesses. The test infrastructure was rigorous enough to guide autonomous agents toward correctness without human review, producing a 100,000-line compiler that can build Linux.

StrongDM took this further with their dark factory approach. They built digital twins of production dependencies—behavioral clones of Okta, Jira, Slack—using agents to replicate APIs and edge cases. This enabled validation at volumes far exceeding production limits without risk. Their rule: “Code must not be reviewed by humans.” The safety comes from comprehensive scenario testing against holdout test cases the agents never see.

The agent infrastructure layer is building out fast. We’re seeing microVMs that boot fast enough to feel container-like, with snapshot/restore making “reset” almost free. Agent-specific sandboxed compute, identity, and API access are emerging as distinct product categories.

The guardrails problem is increasingly an infrastructure problem, not a model problem. This converges toward a standard pattern: spec + guardrails + sandbox + automated validation = safe autonomous execution.

The Self-Improvement Dynamic

Something subtle is happening. Codex optimizes code, Opus optimizes it further, Opus validates against known-good implementations. Cumulative 6x speed improvements on already-optimized code. Then you have Opus 4.6 iteratively improving its own code through benchmark-driven passes.

Folks have showed agents tuning LLMs on Hugging Face—the tooling layer being built by the tools themselves. This isn’t theoretical AGI. It’s narrow but powerful self-improvement within the coding domain. The practical implication: the rate of improvement accelerates as agents get better at improving agents. For the coding stack specifically, each generation of tools makes the next generation arrive faster.

What This Means for Planning

Here’s the timeline as I see it:

2025: Code generation reliable. Spec-driven development emerging. Testing and guardrails manual.

2026: Testing automation reaches satisfactory level. Guardrails standardize. The loop becomes semi-autonomous.

2027+: Fully autonomous for standard applications. Human involvement shifts entirely to direction and edge cases.

The companies planning as if these gaps will persist are making the same mistake as those who planned around slow internet in 2005. AI tools amplify existing expertise—all the practices that distinguished senior engineers (comprehensive testing, good documentation, strong version control habits, effective code review) matter even more now. But the bar for what “good enough” looks like is rising in parallel.

Antirez captures the shift plainly:

Writing code is no longer needed for the most part. It is now a lot more interesting to understand what to do, and how to do it.

The mental work hasn’t disappeared. It’s concentrated in the parts machines can’t yet replace: architecture decisions, user needs, system design trade-offs.

The gaps are real today. But they’re the wrong thing to optimize around. Optimize around what becomes possible when they close—because that’s happening faster than the pace of traditional software planning cycles.

March 12, 2026

The Task Changed, The Job Didn’t — But Your Org Hasn’t Noticed Yet

There’s a conversation happening quietly in engineering teams, product orgs, and design studios. It surfaces in Slack DMs and whispered break-room conversations. The question underneath is always the same: If AI can do what I do, what am I for?

That fear makes sense. Engineers who built their identity around writing clean code watch AI generate entire modules in seconds. Product managers who prided themselves on writing crisp specs see AI agents do the same work overnight. Designers watch their Figma files get autocompleted before they’ve finished thinking through the problem.

But here’s what’s being missed: the task is changing, the job isn’t.

Writing code was always a means to an end. The job was shipping features that solve problems. Writing specs was always a means to an end. The job was understanding user needs and deciding what to build. AI automates the means, not the end. The bottleneck was never typing speed — it was clarity of thinking, problem definition, and judgment about what to build.

Those bottlenecks are still ours.

The Identity Trap

Most people in technology define themselves by the task they perform, not the outcome they produce. “I’m a backend engineer” means I write backend code. “I’m a PM” means I write specs and manage tickets. When AI starts doing those tasks faster and arguably better, the identity feels threatened.

The first response is usually denial: “AI can’t really do what I do — it doesn’t understand context, it makes mistakes, it needs constant supervision.” The second is panic: “I’m about to be replaced by a model that costs pennies per thousand tokens.”

But the real shift isn’t about automation replacing roles. It’s about what happens when execution becomes nearly free and the entire competitive advantage moves to knowing what to build in the first place.

From Tasks to Judgment

When people ask what humans will do in this new world, the answer is usually “taste and judgment.” But that’s abstract. What does judgment actually mean?

It means knowing what to build, when to say no, and how to spot when AI is heading in the wrong direction. It’s defining the guardrails before you let agents run — test suites, design patterns, architectural constraints. It’s understanding that every line of code is future maintenance burden, which makes the discipline to not build more valuable than the ability to build fast.

In 2014, Melissa Perri warned about “The Build Trap” — companies stuck measuring success by what they shipped rather than what they learned. “Building is the easy part,” she wrote. “Figuring out what to build and how we are going to build it is the hard part.”

Most companies ignored that. Now AI makes building trivially easy, and those companies are about to drown in features that solve nothing. The agents don’t get tired. They don’t push back. They’ll happily build everything you point them at, whether or not it should exist.

The Multi-Hat Convergence

The expectation is shifting: one person who can think about the problem, design the solution, and use AI to build it. This doesn’t mean everyone becomes a shallow generalist. It means the boundaries between roles blur significantly.

PMs without a hard skill — design or code — and engineers without product sense are both increasingly vulnerable. The trifecta of product thinking, design sense, and technical execution is becoming the baseline, not the exception.

For experienced professionals considering independence, this convergence changes the economics dramatically. A single person with AI tools can now deliver what used to require a small team.

The Org Structure Problem

Most organizations are still structured around tasks, not outcomes. Teams are organized by function — frontend team, backend team, QA team, design team. Performance is measured by task completion: PRs merged, tickets closed, specs written.

AI makes task completion trivially fast, which breaks these measurement systems completely. The real metric should be business outcomes, but most orgs aren’t wired to measure or incentivize that way.

Companies are starting to notice. Last year, the Shopify CEO asked employees to prove why they “cannot get what they want done using AI” before asking for more headcount. Last week, Block laid off 40% of its workforce — more than 4,000 people. Co-founder Jack Dorsey was direct: “A significantly smaller team, using the tools we’re building, can do more and do it better.”

A startup with great direction and AI agents beats a startup with mediocre direction and the same agents. A company with 10 people who know exactly what to build beats one with 100 people building everything they can think of.

The companies still hiring for “more hands” are optimizing for the wrong bottleneck.

What This Means for You

If you’re an engineer, invest in product sense and domain expertise. Understand why you’re building, not just how. Study the business side of your domain — unit economics, customer behavior, market dynamics.

If you’re a PM, get your hands dirty with at least one hard skill. Design or code, even at a basic level. The ability to prototype your own ideas or understand technical tradeoffs without waiting for a meeting makes you more effective than you’d expect.

If you’re a leader, start restructuring teams around outcomes, not functions. Measure business impact, not tickets closed. Reward people for solving problems and learning, not for producing code.

Stop identifying with your task. Start identifying with the outcomes you produce.

The people making this shift now are building a compounding advantage. The gap widens every month. Domain expertise becomes your moat. The deeper you understand a specific business problem space, the better you can direct agents toward solving it.

The execution bottleneck is being solved. The judgment bottleneck requires human capacity, and it’s where the real value lives now.

March 10, 2026

The Dark Factory: Engineering Teams That Run With the Lights Off

A few engineering organisations are already operating a model most companies haven’t begun to consider. While the typical software team debates whether to adopt AI coding assistants, companies like StrongDM are running fully automated development pipelines where agents handle implementation, testing, review, and deployment. Humans set direction and define constraints. The mechanical work happens without them.

This isn’t speculative. It’s operational. And the gap between companies working this way and those that aren’t is widening fast.

What “lights off” actually means

The term comes from manufacturing — factories that run autonomously, with minimal human presence. In software, it describes engineering organisations where AI agents do the bulk of execution work while humans focus on architecture, constraints, and outcomes.

StrongDM’s approach is instructive: their benchmark is that if you haven’t spent at least $1,000 on tokens per human engineer per day, your software factory has room for improvement. Agents work in parallel on isolated tasks. Code is written, tested, and reviewed without manual intervention. Tasks assigned Friday evening return results Monday morning.

The ratio of agents to humans is high and growing. But this isn’t about replacing engineers — it’s about fundamentally changing what engineers do.

The guardrails are the system

Dark factories aren’t ungoverned. They’re heavily governed in a different way.

Linters, formatters, comprehensive test suites, design pattern enforcement — these become pre-conditions rather than suggestions. Agents are configured to seek completion only when all guardrails pass. Code review shifts from line-by-line human inspection to AI review with human spot-checks on critical paths.

The discipline moves from “write good code” to “design good systems for code to be written in.” That’s a different skill. It requires thinking about constraints, validation, and feedback loops rather than syntax and implementation details.

Anthropic’s experiment building a C compiler with parallel Claude instances demonstrates this principle. Sixteen agents worked simultaneously on a shared codebase, coordinating through git locks and comprehensive test harnesses. The result: a 100,000-line compiler capable of building the Linux kernel, produced over nearly 2,000 sessions across two weeks for just under $20,000. The project worked because the test infrastructure was rigorous enough to guide autonomous agents toward correctness without human review of every change.

Cursor’s experiments with scaling agents ran into a different problem. They tried flat coordination first — agents self-organising through a shared file, claiming tasks, updating status. It broke down. Agents held locks too long, became risk-averse, made small safe changes, and nobody took responsibility for hard problems. The fix was introducing hierarchy: planners that explore the codebase and create tasks, workers that grind on assigned work until it’s done. No single agent tries to do everything. The system ran for weeks, writing over a million lines of code. One project improved video rendering performance by 25x and shipped to production. Their takeaway: many of the gains came from removing complexity rather than adding it.

Digital twins as the enabler

The biggest blocker to agent autonomy has been the fear of breaking production. Digital twins remove that constraint.

StrongDM built behavioural replicas of third-party services their software depends on — Okta, Jira, Slack, Google Docs, Google Drive, and Google Sheets. These twins replicate APIs, edge cases, and observable behaviours with sufficient fidelity that agents can test against realistic conditions at volume, without rate limits or production risk.

Simon Willison’s write-up of StrongDM’s approach highlights how this changed what was economically feasible: “Creating a high fidelity clone of a significant SaaS application was always possible, but never economically feasible. Generations of engineers may have wanted a full in-memory replica of their CRM to test against, but self-censored the proposal to build it.”

What makes this rigorous rather than just better staging is how they handle validation. Test scenarios are stored outside the codebase — separate from where the coding agents can see them — functioning like holdout sets in machine learning. Agents can’t overfit to the tests because they don’t have access to them. The QA team is also agents, running thousands of scenarios per hour without hitting rate limits or accumulating API costs.

The structural advantage of starting fresh

Startups and SMBs have a material advantage here. No legacy organisational structure to dismantle. No 500-person engineering floor with stakeholders defending headcount. No 18-month procurement cycles.

Capital efficiency becomes native. A three-person team with agents can produce output that previously required twenty people. The cost of compute is a fraction of equivalent human labour and falling rapidly.

This creates an asymmetric advantage. If your competitor ships in days what takes you months, no amount of talent closes that gap. And the competitive pressure isn’t just on speed — it’s on the ability to attract talent that wants to work this way. Senior engineers who’ve experienced agent-driven development don’t want to go back to manual workflows.

The gap between adopters and laggards

Companies operating this way are shipping at a fundamentally different pace. The difference isn’t incremental — it’s orders of magnitude in output per person.

Block’s recent announcement of a near-50% reduction in headcount offers a data point. The company is reducing its organization from over 10,000 people to just under 6,000. Jack Dorsey stated “we’re not making this decision because we’re in trouble. our business is strong” but noted that “the intelligence tools we’re creating and using, paired with smaller and flatter teams, are enabling a new way of working which fundamentally changes what it means to build and run a company.”

Cursor’s data shows the same pattern. 35% of pull requests merged internally at Cursor are now created by agents operating autonomously in cloud VMs. The developers adopting this approach write almost no code themselves. They spend their time breaking down problems, reviewing artifacts, and giving feedback. They spin up multiple agents simultaneously instead of guiding one to completion.

The laggards aren’t just slower. They’re increasingly unable to compete for talent, capital, or market position against organisations that have made this transition.

You don’t need a corporate budget to start

The dark factory model scales down. A single developer with a Claude Code subscription and well-structured GitHub workflows can run a lightweight version of the same approach.

Start with one workflow. Pick a repetitive part of your development or business process, establish the guardrails, and let agents handle it. The key investment isn’t in compute — it’s in guardrails and context. Linters, test suites, good documentation, and clear specifications matter more than token budget.

For SMBs and founders, this is the most asymmetric advantage available. You can operate at a scale that was previously only accessible with significant headcount. The learning curve is steep but short. Within 30 days of serious experimentation, most people develop the intuition for what agents can and can’t handle.

Projects like OpenClaw — an open-source autonomous agent that executes tasks across messaging platforms and services — demonstrate that the tooling for this approach is increasingly accessible. The software runs locally, integrates with multiple LLM providers, and requires no enterprise licensing. The barrier isn’t access to technology. It’s willingness to change how work gets done.

What this means beyond software

Software is where this pattern is playing out first, but the model applies wherever knowledge work is structured and repeatable.

Audit processes. Compliance checks. Report generation. Data analysis. Document review. These are all candidates for the same approach: clear specifications, comprehensive validation, and autonomous execution within defined guardrails.

Most traditional industries haven’t started thinking about this. They’re still debating whether to use ChatGPT for email drafts. The firms that figure out how to apply dark factory principles to their domain will have an enormous advantage over those still operating with manual workflows.

The lights are already off in some factories. The question isn’t whether this approach will spread. It’s how quickly your organisation recognises that the game has changed.

March 5, 2026

The Judgment Bottleneck: Why Direction Matters More Than Execution Speed

Watch any software team for long enough and you’ll see the bottleneck move. First it was writing code. Then reviewing it. Then testing. Then deployment. Then security scanning. Each constraint gets automated, and immediately the next one becomes the problem.

This cycle used to play out over years. Now it’s happening in weeks.

AI agents can handle spec writing, code generation, reviews, testing, and deployment. The full software development lifecycle for standard applications will be largely automatable within 2-3 years. At that point, everyone has the same execution capability.

The bottleneck shifts entirely to direction — knowing what to build and where to point these agents.

What judgment actually means

When people talk about what humans will get paid for in this new world, the answer is always some version of “taste and judgment.” This sounds right but doesn’t help much in practice.

What does judgment actually look like?

At the individual level, it’s watching an agent stream code and knowing it’s heading the wrong way architecturally. At the team level, it’s deciding which features to build and which to kill. At the org level, it’s knowing which markets to enter, which problems to solve, which capabilities to invest in.

Speed vs quality in judgment

Most people assume faster judgment is better judgment. This holds true at lower levels — how quickly can you review a PR, decide on an implementation, and move on — but breaks down at higher ones.

At the strategic level, speed matters far less than quality. A fast bad decision with an army of agents creates massive damage quickly.

This is the Build Trap at scale. When Melissa Perri wrote about companies getting stuck in constant building mode, the cost of building the wrong thing was wasted engineering time. Now that execution is nearly free, the cost is wasted opportunity plus the maintenance burden of everything you built.

As Perri puts it:

Building is the easy part of the product development process. Figuring out what to build and how we are going to build it is the hard part.

When she wrote that in 2014, most companies weren’t listening. They were too busy measuring success by production of code or product. “What did you do today?” instead of “What have you learned about our customers or our business?”

Now the stakes are higher. The agents don’t get tired. They don’t push back. They’ll happily build everything you tell them to build, whether or not it should exist.

Why strategic judgment resists automation

LLMs are excellent at execution-level tasks. They’re increasingly good at tactical decisions — which design pattern to use, how to structure a module. They’re much weaker at strategic judgment.

Should we build this product? Enter this market? Restructure this team?

Strategic judgment requires context that lives outside any codebase: market dynamics, competitive landscape, customer relationships, organisational politics, timing. Digital twins and second brains may help over time, but what questions to ask remains human.

Software factories are now real. StrongDM AI built one where “specs + scenarios drive agents that write code, run harnesses, and converge without human review.” Their internal guidelines are telling: “If you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement.”

But even they aren’t automating direction. They’re automating execution based on specs that humans still write. The factory doesn’t decide what to build. It decides how to build what you told it to.

What happens when everyone has the same tools

When everyone has access to powerful AI agents, competitive advantage shifts.

A startup with great direction and AI agents beats a startup with mediocre direction and the same agents. A company with 10 people who know exactly what to build beats one with 100 people building everything they can think of.

The industry spent decades competing on execution capacity. The companies still hiring for “more hands” are optimising for the wrong bottleneck.

Dan Shapiro describes this progression through five levels of automation: from spicy autocomplete to software factory. At Level 3, you become a manager reviewing endless diffs. At Level 4, you’re essentially a PM writing specs. At Level 5, it’s a black box that turns specs into software — a dark factory where humans are neither needed nor welcome.

But even at Level 5, someone still decides what goes into the black box. That’s the judgment layer that can’t be automated away.

What this means for you

Individual engineers and PMs: your value is moving from “can you build it” to “should we build it.” Invest in domain expertise, business understanding, and product sense. Study the business side of whatever domain you work in. Understanding unit economics, customer behaviour, and market dynamics makes you more valuable than knowing the latest framework.

Startups have an advantage in direction. You can’t outspend incumbents on execution, but you can out-think them.

Enterprises face a different risk — having a massive agent army pointed at the wrong objectives. Governance and strategy matter more than tooling. Every bad strategic decision gets executed at scale.

For traditional industries, the judgment layer is where external expertise earns its keep — not in the building, but in the pointing.

Learning to evaluate and adjust direction

The most important skill isn’t making perfect decisions. It’s learning to evaluate and adjust direction quickly as you get signal from the market. This is judgment in practice, not in theory.

Domain expertise becomes your moat. The deeper you understand a specific business problem space, the better you can direct agents toward solving it. Learn to operate at the “what to build” level, not the “how to build” level. Practice defining problems crisply, specifying success criteria, and saying no to features.

For business leaders, get hands-on with AI tools enough to develop intuition about what they can do. You don’t need to code, but you need to know what’s possible.

The execution bottleneck is being solved. The judgment bottleneck requires human capacity, and at the highest levels, quality of strategic thinking matters more than speed of decision-making.

Ask what you’re learning as you move. Ask if you’re building the right things in the first place.

March 3, 2026

GitHub’s SpecKit: The Structure Vibe Coding Was Missing

When I first started experimenting with “vibe coding,” building apps with AI agents felt like a superpower. The ability to spin up prototypes in hours was exhilarating. But as I soon discovered, the initial thrill came with an illusion. It was like managing a team of developers with an attrition rate measured in minutes—every new prompt felt like onboarding a fresh hire with no idea what the last one had been working on.

The productivity boost was real, but the progress was fragile. The core problem was context—a classic case of the law of leaky abstractions applied to AI. Models would forget why they made certain choices or break something they had just built. To cope, I invented makeshift practices: keeping detailed dev context files, enforcing strict version control with frequent commits, and even asking the model to generate “reset prompts” to re-establish continuity. Messy, ad hoc, but necessary.

That’s why GitHub’s announcement of SpecKit immediately caught my attention. SpecKit is an open-source toolkit for what they call “spec-driven development.” Instead of treating prompts and chat logs as disposable artifacts, it elevates specifications to first-class citizens of the development lifecycle.

In practice, this means:

Specs as Durable Artifacts: Specifications live in Git alongside your code—permanent, version-controlled, and not just throwaway notes.
Capturing Intent: They document the why—the constraints, purpose, and expected behavior—so both humans and AI stay aligned.
Ensuring Continuity: They serve as the source of truth, keeping projects coherent across sessions and contributors.

For anyone who has tried scaling vibe coding beyond a demo, this feels like the missing bridge. It brings just enough structure to carry a proof-of-concept into maintainable software.

And it fits into a larger story. Software engineering has always evolved in waves—structured programming, agile, test-driven development. Each wave added discipline to creativity, redefining roles to reflect new economic realities—a pattern we’re seeing again with agentic coding. Spec-driven development could be the next step:

Redefining the Developer’s Role: Less about writing boilerplate, more about designing robust specs that guide AI agents.
Harnessing Improvisation: Keeping the creative energy of vibe coding, but channeling it within a coherent framework.
Flexible Guardrails: Not rigid top-down rules, but guardrails that allow both creativity and scalability.

Looking back, my dev context files and commit hygiene were crude precursors to this very idea. GitHub’s SpecKit makes clear that those instincts weren’t just survival hacks—they pointed to where the field is heading.

The real question now isn’t whether AI can write code—we know it can. The question is: how do we design the frameworks that let humans and AI build together, reliably and at scale?

Because as powerful as vibe coding feels, it’s only when we bring structure to the improvisation that the music really starts.

👉 What do you think—will specs become the new lingua franca between humans and AI?

September 18, 2025

The Economic Reality and the Optimistic Future of Agentic Coding

After a couple of months deep in the trenches of vibe coding with AI agents, I’ve learned this much: scaling from a fun, magical PoC to an enterprise-grade MVP is a completely different game.

Why Scaling Remains Hard—And Costly

Getting a prototype out the door? No problem.

But taking it to something robust, secure, and maintainable? Here’s where today’s AI tools reveal their limits:

Maintenance becomes a slog. Once you start patching AI-generated code, hidden dependencies and context loss pile up. Keeping everything working as requirements change feels like chasing gremlins through a maze.
Context loss multiplies with scale. As your codebase grows, so do the risks of agents forgetting crucial design choices or breaking things when asked to “improve” features.

And then there’s the other elephant in the room: costs.

The cost scaling isn’t marginal—not like the old days of cloud or Web 2.0. Powerful models chew through tokens and API credits at a rate that surprises even seasoned devs.
That $20/month Cursor plan with unlimited auto mode? For hobby projects, it’s a steal. For real business needs, I can see why some queries rack up millions of tokens and would quickly outgrow even the $200 ultra plan.
This is why we’re seeing big tech layoffs and restructuring: AI-driven productivity gains aren’t evenly distributed, and the cost curve for the biggest players keeps climbing.

What the Data Tells Us

That research paper—Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity—had a surprising conclusion:

Not only did experienced developers see no time savings on real-world coding tasks with AI, but costs increased as they spent more time reviewing, correcting, and adapting agent output.

The lesson:

AI shifts where the work happens—it doesn’t always reduce it. For now, scaling with agents is only as good as your processes for context, review, and cost control.

Why I Remain Optimistic

Despite the challenges, I’m genuinely excited for what’s coming next.

The platforms and models are evolving at warp speed. Many of the headaches I face today—context loss, doc gaps, cost blind spots—will get solved just as software engineering best practices eventually became codified in our tools and frameworks.
Agentic coding will find its place. It might not fully automate developer roles, but it will reshape teams: more focus on high-leverage decisions, design, and creative problem-solving, less on boilerplate and “busy work.”

And if you care about the craft, the opportunity is real:

Devs who learn to manage, review, and direct agents will be in demand.
Organizations that figure out how to blend agentic workflows with human expertise and robust process will win big.

Open Questions for the Future

Will AI agentic coding mean smaller, nimbler teams—or simply more ambitious projects for the same headcount?
How will the developer role evolve when so much code is “synthesized,” not hand-crafted?
What new best practices, cost controls, and team rituals will we invent as agentic coding matures?

Final thought:

The future won’t be a return to “pure code” or a total AI handoff. It’ll be a blend—one that rewards curiosity, resilience, and the willingness to keep learning.

Where do you see your work—and your team—in this new landscape?

August 14, 2025

The Law of Leaky Abstractions & the Unexpected Slowdown

If the first rush of agentic/vibe coding feels like having a team of superhuman developers, the second phase is a reality check—one that every software builder and AI enthusiast needs to understand.

Why “Vibe Coding” Alone Can’t Scale

The further I got into building real-world prototypes with AI agents, the clearer it became: Joel Spolsky’s law of leaky abstractions is alive and well.

You can’t just vibe code your way to a robust app—because underneath the magic, the cracks start to show fast. AI-generated coding is an abstraction, and like all abstractions, it leaks. When it leaks, you need to know what’s really happening underneath.

My Experience: Hallucinations, Context Loss, and Broken Promises

I lost count of the times an agent “forgot” what I was trying to do, changed underlying logic mid-stream, or hallucinated code that simply didn’t run. Sometimes it wrote beautiful test suites and then… broke the underlying logic with a “fix” I never asked for. It was like having a junior developer who could code at blazing speed—but with almost no institutional memory or sense for what mattered.

The “context elephant” is real. As sessions get longer, agents lose track of goals and start generating output that’s more confusing than helpful. That’s why my own best practices quickly became non-negotiable:

Frequent commits and clear commit messages
Dev context files to anchor each session
Separate dev/QA/prod environments to avoid catastrophic rollbacks (especially with database changes)

What the Research Shows: AI Can Actually Slow Down Experienced Devs

Here’s the kicker—my frustration isn’t unique.

A recent research paper, Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, found that experienced developers actually worked slower with AI on real-world tasks. That’s right—AI tools didn’t just fail to deliver the expected productivity boost, they created friction.

Why?

Only about 44% of AI-generated code was accepted
Developers lost time reviewing, debugging, and correcting “bad” generations
Context loss and reliability issues forced more manual intervention, not less

This matches my experience exactly. For all the hype, these tools introduce new bottlenecks—especially if you’re expecting them to “just work” out of the box.

Lessons from the Frontlines (and from Agent Week)

I’m not alone. In the article What I Learned Trying Seven Coding Agents, Timothy B. Lee finds similar headaches:

Agents get stuck
Complex tasks routinely stump even the best models
Human-in-the-loop review isn’t going anywhere

But the tools are still useful—they’re not a dead end. You just need to treat them like a constantly rotating team of interns, not fully autonomous engineers.

Best Practices: How to Keep AI Agents Under Control

So how do you avoid the worst pitfalls?

The answer is surprisingly old-school:

Human supervision for every critical change
Sandboxing and least privilege for agent actions
Version control and regular context refreshers

Again, Lee’s article Keeping AI agents under control doesn’t seem very hard nails it:

Classic engineering controls—proven in decades of team-based software—work just as well for AI. “Doomer” fears are overblown, but so is the hype about autonomy.

Conclusion: The Hidden Cost of Abstraction

Vibe coding with agents is like riding a rocket with no seatbelt—exhilarating, but you’ll need to learn to steer, brake, and fix things mid-flight.

If you ignore the leaky abstractions, you’ll pay the price in lost time, broken prototypes, and hidden tech debt.

But with the right mix of skepticism and software discipline, you can harness the magic and avoid the mess.

In my next post, I’ll zoom out to the economics—where cost, scaling, and the future of developer work come into play.

To be continued…

August 12, 2025

The Thrill and the Illusion of AI Agentic Coding

A few months ago, I stumbled into what felt like a superpower: building fully functional enterprise prototypes using nothing but vibe coding and AI agent tools like Cursor and Claude. The pace was intoxicating—I could spin up a PoC in days instead of weeks, crank out documentation and test suites, and automate all the boring stuff I used to dread.

But here’s the secret I discovered: working with these AI agents isn’t like managing a team of brilliant, reliable developers. It’s more like leading a software team with a sky-high attrition rate and non-existent knowledge transfer practices. Imagine onboarding a fresh dev every couple of hours, only to have them forget what happened yesterday and misinterpret your requirements—over and over again. That’s vibe coding with agents.

The Early Magic

When it works, it really works. I’ve built multiple PoCs this way—each one a small experiment, delivered at a speed I never thought possible. The agents are fantastic for “greenfield” tasks: setting up skeleton apps, generating sample datasets, and creating exhaustive test suites with a few prompts. They can even whip up pages of API docs and help document internal workflows with impressive speed.

It’s not just me. Thomas Ptacek’s piece “My AI Skeptic Friends Are All Nuts” hits the nail on the head: AI is raising the floor for software development. The boring, repetitive coding work—the scaffolding, the CRUD operations, the endless boilerplate—gets handled in minutes, letting me focus on the interesting edge cases or higher-level product thinking. As they put it, “AI is a game-changer for the drudge work,” and I’ve found this to be 100% true.

The Fragility Behind the Hype

But here’s where the illusion comes in. Even with this boost, the experience is a long way from plug-and-play engineering. These AI coding agents don’t retain context well; they can hallucinate requirements, generate code that fails silently, or simply ignore crucial business logic because the conversation moved too fast. The “high-attrition, low-knowledge-transfer team” analogy isn’t just a joke—it’s my daily reality. I’m often forced to stop and rebuild context from scratch, re-explain core concepts, and review every change with a skeptical eye.

Version control quickly became my lifeline. Frequent commits, detailed commit messages, and an obsessive approach to saving state are my insurance policy against the chaos that sometimes erupts. The magic is real, but it’s brittle: a PoC can go from “looks good” to “completely broken” in a couple of prompts if you’re not careful.

Superpowers—With Limits

If you’re a founder, product manager, or even an experienced developer, these tools can absolutely supercharge your output. But don’t believe the hype about “no-code” or “auto-code” replacing foundational knowledge. If you don’t understand software basics—version control, debugging, the structure of a modern web app—you’ll quickly hit walls that feel like magic turning to madness.

Still, I’m optimistic. The productivity gains are real, and the thrill of seeing a new prototype come to life in a weekend is hard to beat. But the more I use these tools, the more I appreciate the fundamentals that have always mattered in software—and why, in the next post, I’ll talk about the unavoidable reality check that comes when abstractions leak and AI doesn’t quite deliver on its promise.

To be continued…

August 7, 2025

🎮 Vibe Coding a Stock Market Game: Why Every GTM Leader Should Build Like This Once

Earlier this month, I did something I hadn’t done in over 15 years:
I rebuilt a stock market simulation game I had originally created during business school.

The original was built on Ruby on Rails.
This time, I went lean — prototyping with HTML, JS, and lightweight AI-assisted dev tools in what I’d call a vibe coding session.

But this post isn’t about the code.

It’s about what I learned — and why every founder, product owner, or GTM leader should prototype at least one thing themselves in this way.

🧪 What Is Vibe Coding, Really?

The term vibe coding was coined by Andrej Karpathy, but it was a recent post by Strangeloop Canon that captured its essence:

“If AGI is the future, vibe coding is the present.”

To me, vibe coding is building with momentum, not perfection.
No heavyweight specs. No team syncs. Just one person, a rough idea, and tools that let you think through your fingertips.

You’re not coding to launch. You’re coding to understand.
And sometimes, that’s exactly what you need.

🧠 What I Learned From Rebuilding QSE

1. Building sharpens your strategy lens.
When you rebuild something from scratch, every interaction becomes a test of friction vs flow. That mindset translates directly into GTM design, onboarding strategy, and product-market fit thinking.

2. AI is best when it feels smart.
My game features a basic rules-based AI opponent. Not sophisticated — but just enough to create pressure and tension. It reminded me that AI doesn’t need to be advanced, it needs to feel aligned with the user’s rhythm.

3. Prototypes create unexpected clarity.
Tiny design decisions (like how many clicks it takes to place a trade) turned into insights about attention spans, pacing, and simplicity — lessons I’ll carry into larger GTM and transformation conversations.

🔁 Why This Resonated Beyond the Code

Rebuilding QSE wasn’t a nostalgia trip. It was a reconnection with creative flow.
It reminded me of how much clarity you gain when you stop whiteboarding and start building.

We often separate “strategy” and “execution” as different domains.
But I’ve found that prototyping collapses that gap. You see things faster. You think better. And sometimes, you spot the real issue — not in the brief, but in the build.

If you’re leading a product, driving a GTM motion, or exploring AI integration, I genuinely recommend vibe coding — or at least, vibing with your builders more closely.

🕹️ Curious to try the game I rebuilt?
👉 Play QSE Reloaded

The code is available on github

May 22, 2025

Building a stock market game in 2008

This is a post I meant to write almost 13 years back, on how I built a stock market game using Ruby on Rails for our B-school flagship event Quadriga (I did release the game source code on Rubyforge, but the site is no longer operational). Like they say, better late than never :). Below is a short screen capture of the game in action from the beta run I had organized, showcasing the different features to give you an idea of what it entailed:

The game itself was a very simplified version of a stock market designed to be played as individuals or as a team with the following features:

Simple buy and sell transactions without any short selling, futures or options .
The trading would be spread across a period of 12 sessions with the prices changing before the start of each. Each user would get a fixed set of shares for each of the stocks at the beginning of the game so that selling activity can be initiated from period 1.
There was an element of randomization in the stock price movement from period to period partially influenced by a set of pre-defined events.
A user login feature with public leaderboard to give everyone a view of how they are performing against the competitors.
A transaction & stock price history section to view the changes over time.

I have fond memories of this game as it game me an opportunity to try out Ruby on Rails in a real world scenario (this video from 2005 was the inspiration). I had a lot of fun coding the game and consulting classmates & seniors on how the stock market should be simulated. Even more fun was the beta testing round we did over the hostel LAN (the video above is from the test run as you can in the message on the login screen), with most of my classmates participating. We have come a long way on the technological front, and one of the things I do find missing in the game is an element of visualization in the form of graphs. Mobile support was of course not relevant back in 2008, but today it would be a no brainer.

As for the actual event, we brought in an element of security/standardization where we had the competing teams using laptops borrowed from my classmates. To ensure that their personal files were not affected, we setup Linux virtual machines on top of the Windows environment, and the teams were using browsers to access the game running my laptop through a Wi-Fi network we had setup on a router borrowed from another of my classmates (it was 2008 after all).

This nostalgic post would not have been possible had I not backed up the files to an external hard disk and to OneDrive eventually. So, here’s a bit more of throwback with the event posters & other collaterals:

January 22, 2022

Category: Programming

Code Generation: Already Production-Grade

Spec-Driven Development

The Feedback Loop: The Primary Gap

Guardrails: Actively Being Solved

The Self-Improvement Dynamic

What This Means for Planning

Share:

The Identity Trap

From Tasks to Judgment

The Multi-Hat Convergence

The Org Structure Problem

What This Means for You

Share:

What “lights off” actually means

The guardrails are the system

Digital twins as the enabler

The structural advantage of starting fresh

The gap between adopters and laggards

You don’t need a corporate budget to start

What this means beyond software

Share:

What judgment actually means

Speed vs quality in judgment

Why strategic judgment resists automation

What happens when everyone has the same tools

What this means for you

Learning to evaluate and adjust direction

Share:

Share:

Why Scaling Remains Hard—And Costly

What the Data Tells Us

Why I Remain Optimistic

Open Questions for the Future

Share:

Why “Vibe Coding” Alone Can’t Scale

My Experience: Hallucinations, Context Loss, and Broken Promises

What the Research Shows: AI Can Actually Slow Down Experienced Devs

Lessons from the Frontlines (and from Agent Week)

Best Practices: How to Keep AI Agents Under Control

Conclusion: The Hidden Cost of Abstraction

Share:

Share:

🧪 What Is Vibe Coding, Really?

🧠 What I Learned From Rebuilding QSE

🔁 Why This Resonated Beyond the Code

Share:

Share: