Tag: organization

  • What “AI-native” actually means for a founder-led firm

    What “AI-native” actually means for a founder-led firm

    A recent paper from INSEAD and Harvard looked at close to 3,000 startups and found something that should interest anyone running a small firm. Matched like-for-like, same industry, same age, the ones built around AI ran about a quarter smaller than the ones that weren’t. In services, the gap reached roughly 70%. Fewer people, the same kind of work, and not because they were starving themselves of staff.

    A smaller firm that is worth more

    Across eleven startup batches from 2020 to 2024, the AI-native firms ran leaner and flatter, carried more senior people and fewer managers, and still raised more money and held higher valuations per head than their peers. The smaller size was buying more output per person, not papering over weaker companies. The effect was largest, near 70%, in services firms, the advisory-and-delivery work many of us sell. If your firm sells judgment rather than a shipped product, that number is aimed at you.

    One example makes the shape concrete. An AI slide-deck company reached around $50M of annual revenue in about two years with a team of roughly 30, because making a deck became something the customer does inside the product instead of a job that lands in someone’s queue. A small team, a lot of output, because the work itself moved.

    A fair caveat, which the paper makes and most write-ups drop: a leaner firm need not mean fewer jobs overall. Cheaper output tends to pull in more demand for it (the Jevons paradox: make something cheaper and we usually use far more of it), so the economy can end up with more firms and more total work even as each one shrinks.

    The most telling result is one that didn’t show up the way you might expect. The AI-native firms in the study were about 2.6 times more likely than the rest to name worker-facing tools (ChatGPT, coding assistants, and the like) in their job ads, yet that heavier tool use predicted none of the structural differences: not the smaller size, not the flatter hierarchy. As the authors put it, “equipping workers with ChatGPT, Copilot, or Cursor does not, on its own, predict smaller firms.”

    A large MIT study in 2025 found the mirror image: around 95% of corporate AI pilots delivered no measurable impact, with the failures traced to organisations bolting AI onto existing workflows rather than to the technology itself. Both point the same way. The gains come from redesigning the work, and the licence count barely matters.

    What did track with firm size was where the AI sat. Used inside the firm to speed up work people already do, it moved the structure very little. Built into what the firm sells, so the customer generates the output directly, it tracked with the shrinkage. It is the same move Ben Thompson called an unbundling: the firms that pull ahead fold the act of making something real into the product, rather than keeping it a step their staff perform.

    Governed delegation

    What makes a firm AI-native, then, is how far it has rebuilt its work around what AI can do reliably and what still needs a person watching. Call it governed delegation: you hand the work to AI while keeping your hands on the rules, the boundaries, and the checking. Think of a restaurant that buys a dishwasher but keeps everyone rinsing each plate by hand first. It has added a machine and kept the old routine, so all it has really bought is cost. The gain shows up only when the workflow is rebuilt around what the machine does well.

    Most of this conversation stays inside software teams, which is why the wider point gets missed. The same logic runs straight into sales, marketing, and operations, which is where I have spent the past several months testing it. I run most of OrchestratorAI’s own go-to-market this way: agents handle well-defined pieces of work, and my job is to set the rules and check the output rather than produce each piece by hand. Each group of agents runs in one of two modes, either nothing leaves the building until I have looked at it, or it runs on its own, shows me a sample, and escalates the exceptions. A task earns its way from the first mode to the second only after it stops throwing up things to fix, and some never do. Drafting a note to a high-value prospect stays under review; filling in a basic company profile from public sources does not need me hovering.

    Running it this way taught me something the paper doesn’t quite reach: almost none of the governance had to be invented. We have spent decades building ways to govern people at work: audit trails and compliance checks, standard operating procedures, the maker-and-checker split in finance, code review and the ceremonies of Agile. In a human-only firm a lot of this sits heavy. It adds layers and breeds box-ticking, because people dislike being audited and tire of the checklist, so the control gets watered down to what the organisation or regulators will tolerate.

    Point the same practices at AI agents and the weight largely lifts. An LLM is trained on how people work, so it takes direction roughly the way a person does, and the structure of the playbook carries over. The agent doesn’t resent the audit log or cut the SOP short when it’s busy, so a check that was costly to run on people is cheap to run on an agent. The procedures that were overhead in a human firm become the guardrails of an AI-native one. The autonomous SDLC (the software build-and-ship cycle, increasingly run by agents) is the clearest case I’ve come across: code review, the test gate, the staged rollout that always slowed teams down are exactly what lets an agent ship without a human reading every line.

    The one catch here is that agents fail differently from people. They don’t commit fraud or get bored; they fabricate confidently and come apart on inputs a person would laugh off. So the controls transfer in shape, not in detail: you keep the independent check and the earned trust, but you point them at hallucination and weak grounding rather than at fatigue and dishonesty. Those rough edges are being ironed out quickly. Coding is the clearest case, where agents have grown markedly more reliable over the past year, and making AI broadly more dependable and harder to misuse is a major focus at the frontier labs, still some way from solved. The upshot is that there is already plenty an agent can be trusted to do well today, and the list keeps growing.

    What it means if you run a small firm

    The paper names the real bottleneck, and it isn’t access to AI, which everyone now has at much the same price. It is the mapping problem: working out where, in your particular business, AI actually pays off. That has no generic answer, and finding it is the work that’s left.

    For a founder, the practical shift is that your scarce time moves from production toward direction and judgment, deciding where AI-handled work ends and where human review is non-negotiable. The closer you sit to pure services, the sharper that shift, which is presumably why the services number ran to 70%. Autonomy has to be earned, too: moving a task from supervised to self-running is a track record you build through logging and spot-checks, not a switch you flip. So the key question to put to your own firm is not which AI tools to buy, but which parts of the work can be governed at the boundary, and which still need a person on every output.

    None of this is settled, and the unsettled part is the interesting one: deciding when a piece of work has earned its autonomy, and building the checks so you aren’t flying blind once it runs alone. Strikingly few firms in the data had actually managed it, which makes “AI-native” more aspiration than description for now, even among the companies wearing the label. That is the encouraging part. The tools are here and roughly evenly spread, so the advantage goes to whoever does the slow, unglamorous work of deciding what to hand over and what to keep a hand on.

  • The Task Changed, The Job Didn’t — But Your Org Hasn’t Noticed Yet

    There’s a conversation happening quietly in engineering teams, product orgs, and design studios. It surfaces in Slack DMs and whispered break-room conversations. The question underneath is always the same: If AI can do what I do, what am I for?

    That fear makes sense. Engineers who built their identity around writing clean code watch AI generate entire modules in seconds. Product managers who prided themselves on writing crisp specs see AI agents do the same work overnight. Designers watch their Figma files get autocompleted before they’ve finished thinking through the problem.

    But here’s what’s being missed: the task is changing, the job isn’t.

    Writing code was always a means to an end. The job was shipping features that solve problems. Writing specs was always a means to an end. The job was understanding user needs and deciding what to build. AI automates the means, not the end. The bottleneck was never typing speed — it was clarity of thinking, problem definition, and judgment about what to build.

    Those bottlenecks are still ours.

    The Identity Trap

    Most people in technology define themselves by the task they perform, not the outcome they produce. “I’m a backend engineer” means I write backend code. “I’m a PM” means I write specs and manage tickets. When AI starts doing those tasks faster and arguably better, the identity feels threatened.

    The first response is usually denial: “AI can’t really do what I do — it doesn’t understand context, it makes mistakes, it needs constant supervision.” The second is panic: “I’m about to be replaced by a model that costs pennies per thousand tokens.”

    But the real shift isn’t about automation replacing roles. It’s about what happens when execution becomes nearly free and the entire competitive advantage moves to knowing what to build in the first place.

    From Tasks to Judgment

    When people ask what humans will do in this new world, the answer is usually “taste and judgment.” But that’s abstract. What does judgment actually mean?

    It means knowing what to build, when to say no, and how to spot when AI is heading in the wrong direction. It’s defining the guardrails before you let agents run — test suites, design patterns, architectural constraints. It’s understanding that every line of code is future maintenance burden, which makes the discipline to not build more valuable than the ability to build fast.

    In 2014, Melissa Perri warned about “The Build Trap” — companies stuck measuring success by what they shipped rather than what they learned. “Building is the easy part,” she wrote. “Figuring out what to build and how we are going to build it is the hard part.”

    Most companies ignored that. Now AI makes building trivially easy, and those companies are about to drown in features that solve nothing. The agents don’t get tired. They don’t push back. They’ll happily build everything you point them at, whether or not it should exist.

    The Multi-Hat Convergence

    The expectation is shifting: one person who can think about the problem, design the solution, and use AI to build it. This doesn’t mean everyone becomes a shallow generalist. It means the boundaries between roles blur significantly.

    PMs without a hard skill — design or code — and engineers without product sense are both increasingly vulnerable. The trifecta of product thinking, design sense, and technical execution is becoming the baseline, not the exception.

    For experienced professionals considering independence, this convergence changes the economics dramatically. A single person with AI tools can now deliver what used to require a small team.

    The Org Structure Problem

    Most organizations are still structured around tasks, not outcomes. Teams are organized by function — frontend team, backend team, QA team, design team. Performance is measured by task completion: PRs merged, tickets closed, specs written.

    AI makes task completion trivially fast, which breaks these measurement systems completely. The real metric should be business outcomes, but most orgs aren’t wired to measure or incentivize that way.

    Companies are starting to notice. Last year, the Shopify CEO asked employees to prove why they “cannot get what they want done using AI” before asking for more headcount. Last week, Block laid off 40% of its workforce — more than 4,000 people. Co-founder Jack Dorsey was direct: “A significantly smaller team, using the tools we’re building, can do more and do it better.”

    A startup with great direction and AI agents beats a startup with mediocre direction and the same agents. A company with 10 people who know exactly what to build beats one with 100 people building everything they can think of.

    The companies still hiring for “more hands” are optimizing for the wrong bottleneck.

    What This Means for You

    If you’re an engineer, invest in product sense and domain expertise. Understand why you’re building, not just how. Study the business side of your domain — unit economics, customer behavior, market dynamics.

    If you’re a PM, get your hands dirty with at least one hard skill. Design or code, even at a basic level. The ability to prototype your own ideas or understand technical tradeoffs without waiting for a meeting makes you more effective than you’d expect.

    If you’re a leader, start restructuring teams around outcomes, not functions. Measure business impact, not tickets closed. Reward people for solving problems and learning, not for producing code.

    Stop identifying with your task. Start identifying with the outcomes you produce.

    The people making this shift now are building a compounding advantage. The gap widens every month. Domain expertise becomes your moat. The deeper you understand a specific business problem space, the better you can direct agents toward solving it.

    The execution bottleneck is being solved. The judgment bottleneck requires human capacity, and it’s where the real value lives now.

  • The Dark Factory: Engineering Teams That Run With the Lights Off

    A few engineering organisations are already operating a model most companies haven’t begun to consider. While the typical software team debates whether to adopt AI coding assistants, companies like StrongDM are running fully automated development pipelines where agents handle implementation, testing, review, and deployment. Humans set direction and define constraints. The mechanical work happens without them.

    This isn’t speculative. It’s operational. And the gap between companies working this way and those that aren’t is widening fast.

    What “lights off” actually means

    The term comes from manufacturing — factories that run autonomously, with minimal human presence. In software, it describes engineering organisations where AI agents do the bulk of execution work while humans focus on architecture, constraints, and outcomes.

    StrongDM’s approach is instructive: their benchmark is that if you haven’t spent at least $1,000 on tokens per human engineer per day, your software factory has room for improvement. Agents work in parallel on isolated tasks. Code is written, tested, and reviewed without manual intervention. Tasks assigned Friday evening return results Monday morning.

    The ratio of agents to humans is high and growing. But this isn’t about replacing engineers — it’s about fundamentally changing what engineers do.

    The guardrails are the system

    Dark factories aren’t ungoverned. They’re heavily governed in a different way.

    Linters, formatters, comprehensive test suites, design pattern enforcement — these become pre-conditions rather than suggestions. Agents are configured to seek completion only when all guardrails pass. Code review shifts from line-by-line human inspection to AI review with human spot-checks on critical paths.

    The discipline moves from “write good code” to “design good systems for code to be written in.” That’s a different skill. It requires thinking about constraints, validation, and feedback loops rather than syntax and implementation details.

    Anthropic’s experiment building a C compiler with parallel Claude instances demonstrates this principle. Sixteen agents worked simultaneously on a shared codebase, coordinating through git locks and comprehensive test harnesses. The result: a 100,000-line compiler capable of building the Linux kernel, produced over nearly 2,000 sessions across two weeks for just under $20,000. The project worked because the test infrastructure was rigorous enough to guide autonomous agents toward correctness without human review of every change.

    Cursor’s experiments with scaling agents ran into a different problem. They tried flat coordination first — agents self-organising through a shared file, claiming tasks, updating status. It broke down. Agents held locks too long, became risk-averse, made small safe changes, and nobody took responsibility for hard problems. The fix was introducing hierarchy: planners that explore the codebase and create tasks, workers that grind on assigned work until it’s done. No single agent tries to do everything. The system ran for weeks, writing over a million lines of code. One project improved video rendering performance by 25x and shipped to production. Their takeaway: many of the gains came from removing complexity rather than adding it.

    Digital twins as the enabler

    The biggest blocker to agent autonomy has been the fear of breaking production. Digital twins remove that constraint.

    StrongDM built behavioural replicas of third-party services their software depends on — Okta, Jira, Slack, Google Docs, Google Drive, and Google Sheets. These twins replicate APIs, edge cases, and observable behaviours with sufficient fidelity that agents can test against realistic conditions at volume, without rate limits or production risk.

    Simon Willison’s write-up of StrongDM’s approach highlights how this changed what was economically feasible: “Creating a high fidelity clone of a significant SaaS application was always possible, but never economically feasible. Generations of engineers may have wanted a full in-memory replica of their CRM to test against, but self-censored the proposal to build it.”

    What makes this rigorous rather than just better staging is how they handle validation. Test scenarios are stored outside the codebase — separate from where the coding agents can see them — functioning like holdout sets in machine learning. Agents can’t overfit to the tests because they don’t have access to them. The QA team is also agents, running thousands of scenarios per hour without hitting rate limits or accumulating API costs.

    The structural advantage of starting fresh

    Startups and SMBs have a material advantage here. No legacy organisational structure to dismantle. No 500-person engineering floor with stakeholders defending headcount. No 18-month procurement cycles.

    Capital efficiency becomes native. A three-person team with agents can produce output that previously required twenty people. The cost of compute is a fraction of equivalent human labour and falling rapidly.

    This creates an asymmetric advantage. If your competitor ships in days what takes you months, no amount of talent closes that gap. And the competitive pressure isn’t just on speed — it’s on the ability to attract talent that wants to work this way. Senior engineers who’ve experienced agent-driven development don’t want to go back to manual workflows.

    The gap between adopters and laggards

    Companies operating this way are shipping at a fundamentally different pace. The difference isn’t incremental — it’s orders of magnitude in output per person.

    Block’s recent announcement of a near-50% reduction in headcount offers a data point. The company is reducing its organization from over 10,000 people to just under 6,000. Jack Dorsey stated “we’re not making this decision because we’re in trouble. our business is strong” but noted that “the intelligence tools we’re creating and using, paired with smaller and flatter teams, are enabling a new way of working which fundamentally changes what it means to build and run a company.”

    Cursor’s data shows the same pattern. 35% of pull requests merged internally at Cursor are now created by agents operating autonomously in cloud VMs. The developers adopting this approach write almost no code themselves. They spend their time breaking down problems, reviewing artifacts, and giving feedback. They spin up multiple agents simultaneously instead of guiding one to completion.

    The laggards aren’t just slower. They’re increasingly unable to compete for talent, capital, or market position against organisations that have made this transition.

    You don’t need a corporate budget to start

    The dark factory model scales down. A single developer with a Claude Code subscription and well-structured GitHub workflows can run a lightweight version of the same approach.

    Start with one workflow. Pick a repetitive part of your development or business process, establish the guardrails, and let agents handle it. The key investment isn’t in compute — it’s in guardrails and context. Linters, test suites, good documentation, and clear specifications matter more than token budget.

    For SMBs and founders, this is the most asymmetric advantage available. You can operate at a scale that was previously only accessible with significant headcount. The learning curve is steep but short. Within 30 days of serious experimentation, most people develop the intuition for what agents can and can’t handle.

    Projects like OpenClaw — an open-source autonomous agent that executes tasks across messaging platforms and services — demonstrate that the tooling for this approach is increasingly accessible. The software runs locally, integrates with multiple LLM providers, and requires no enterprise licensing. The barrier isn’t access to technology. It’s willingness to change how work gets done.

    What this means beyond software

    Software is where this pattern is playing out first, but the model applies wherever knowledge work is structured and repeatable.

    Audit processes. Compliance checks. Report generation. Data analysis. Document review. These are all candidates for the same approach: clear specifications, comprehensive validation, and autonomous execution within defined guardrails.

    Most traditional industries haven’t started thinking about this. They’re still debating whether to use ChatGPT for email drafts. The firms that figure out how to apply dark factory principles to their domain will have an enormous advantage over those still operating with manual workflows.

    The lights are already off in some factories. The question isn’t whether this approach will spread. It’s how quickly your organisation recognises that the game has changed.