All Your Coworkers Are Probabilistic Too
When people complain about large language models, I often feel like they're complaining about their coworkers without realizing it.
I've heard variations of the same sentence a lot over the last two years:
"I told it exactly what I wanted, and it still did something else."
The first time I heard that, it was about an LLM. The second time, it was about a colleague.
If you've worked in software long enough, you've probably lived through the situation where you write a ticket, or explain a feature in a meeting, and then a week later you look at the result and think: this is technically related to what I said, but it is not what I meant at all.
Nobody considers that surprising when humans are involved. We shrug, we sigh, we clarify, we fix it. Somehow, with LLMs, people expect the same vague input to magically produce the exact thing they already had in their head. When that fails, the conclusion is that the model is useless.
The real problem is that we were used to machines being deterministic in a very specific way. You type code into a compiler, it either accepts it or it doesn't. The machine never argues back, never claims it fixed something it didn't fix, never improvises. Now we suddenly have machines that behave much more like people: probabilistic, context hungry and occasionally very confidently wrong.
We expected deterministic machines; now we got probabilistic coworkers. That isn't actually new at all. It's just coming from an unexpected direction.
I've written before about what it was like to build this site while working closely with these systems in I Let AI Build My Website – this article is less about that specific project and more about the patterns that emerged from it.
Non-Deterministic Coworkers
Most people I talk to still treat LLMs as if they were another tool bolted onto their IDE. You type something in, it should do what you say, and if it doesn't, the tool is broken.
That's not how this works for me.
LLMs are not all the same and perform differently depending on the task, but they all get confused when important context is missing. They all improvise when my instructions are vague. They all happily produce something that looks right at the surface while being subtly wrong underneath.
If I squint a little, that looks exactly like working with a group of humans that happen to type frighteningly fast. The difference is that they don't have a past with me. There is no shared history, no accumulated understanding, no "we've been here before" muscle memory. Every new session starts from zero.
Once I stopped expecting deterministic behavior and started treating the whole setup as a non-deterministic system that has to be guided and constrained, a lot of the frustration went away. The things that help look very old and very boring: clearer communication, shorter feedback loops, better organization, documentation that actually exists, and automation that catches mistakes before they hit anything important.
There is a very specific dialogue I've had more than once in my career. It usually happens after a feature demo.
"This isn't quite what I wanted."
"Well, I think it is exactly what we discussed."
Sometimes I'm on the receiving end of that. Sometimes I'm the one saying it. In all cases, it's a symptom of the same thing: the words we exchanged, either in speech or writing, did not capture all the actual intent and context in our heads. We thought we had an agreement, but we only had a rough sketch.
LLMs do exactly the same thing. If I write a two-sentence prompt like "build a static blog with modern web technologies" and hit enter, what comes back is not my personal, detailed vision of what that should be. It is the model's best guess at what "static blog" and "modern" usually mean in its training data. The mismatch is baked in.
What fixes this with humans also fixes it with models. When I work with colleagues, I don't just throw goals at them and hope for the best. We talk about constraints, about trade-offs, about what success looks like and what we explicitly don't care about. We look at examples. We clarify edge cases. When I do the same in a prompt, the outcome improves in exactly the same way. People call that "prompt engineering" now; I still think of it as just writing a better spec.
The feedback loop has the same shape too. A lot of people tell me they "tried" using an LLM for coding and gave up because it kept making the same mistakes. When I ask what the interaction looked like, it usually boils down to: long prompt, long wait, big blob of code, disappointment.
I've worked with enough people in my career to be very aware of the fact that lots of people are actually quite terrible at expressing their intentions and wishes – or impatient, or lazy. They can't say or write down well what they want or have a very narrow view of what even needs to be considered and specified in the first place.
When I had Copilot change some CSS for the share button on this site, the raw interaction was painful. It would claim it had fixed the problem while doing nothing useful. I'd point at the issue, it would adjust something else. It felt like arguing with a stubborn coworker who hasn't fully understood the problem but won't admit it.
The moment I forced myself into a different loop, things improved. I made it write a plan first. I commented on the plan before it touched anything. I instructed it to run the standard build and check tasks and even to inspect the output itself. Where I could, I pointed to concrete lines that were wrong instead of vaguely complaining. With that in place, the interaction looked almost exactly like working with a teammate who is still getting familiar with the system: a bit slow at first, but steadily converging.
Context, Documentation and Onboarding
The context problem is similar. In any non-trivial system, nobody has the whole picture in their head. You rely on diagrams, short design docs, comments in the code and, unfortunately, a lot of institutional memory. A new teammate struggles not because they're stupid, but because half of the relevant information is smeared across a bunch of people's brains and a decade of Git history.
LLMs are far less forgiving here. They have a hard context window and zero ability to peek outside of it. If I don't show it a particular module, as far as the model is concerned that module doesn't exist. Early on, I assumed that working inside my editor meant the agent would "just know" where everything lived. It didn't. It would happily apply a pattern in one part of the codebase and ignore three other places where the same pattern existed, because those files weren't currently in view.
Things got noticeably better once I started treating context as something I had to curate for the model. I wrote down project-wide rules in my regular, traditionally human-targeted documentation in places like docs/system-design.md or docs/coding-style.md and forced an LLM to read it for every new task/context window via .github/copilot-instructions.md and later AGENTS.md. I kept a couple of short, high-density overview documents that I would always include when asking for changes in a particular area. I often pointed it at the relevant files explicitly instead of hoping it would guess what mattered. I made it easy to discover all relevant documentation from a single starting point. Everything is linked back to README.md one way or another. That is, again, the same thing I'd do for a human colleague joining a large codebase; the difference is that the LLM simply cannot rely on tribal knowledge. If I didn't write it down, it didn't exist.
In my earlier jobs, this wasn't some abstract problem. I spent years in companies with large, aging codebases where onboarding new engineers was predictably miserable. People were smart and well-intentioned, but every new hire had to reverse-engineer the system from scattered comments, half-remembered conversations and whoever still happened to be around. Productivity was low, "agility" was mostly a slide in a deck, and nobody could quite explain why work felt so heavy.
What we were missing back then are exactly the things I now build for LLMs by default: small, coherent docs; a clear entry point into the system; tests that tell you what’s safe to change; some written-down sense of "how we do things here". We paid for that absence as technical debt, but it didn’t look like debt on a balance sheet, so it was hard to get managers or founders to care. From their perspective, onboarding was just "slow" or "people need to ramp up"; from my perspective, we were burning weeks of highly paid time solving the same puzzle over and over.
At some point I'll write a separate piece about that: how these slow, demoralizing onboarding paths happen, how to spot them early and what to do about them before you end up with a team full of people quietly stuck in legacy-mapping mode.
Humans at least have more agency. They are usually actively looking for information you didn't tell them about, because their context and goals aren't just the codebase and the current task. LLMs can't do that. They only see what you show them.
I think for a fair amount of people this is now at least somewhat understood not to be the right approach to working with LLMs. Some just never tried anything else seriously and they are stuck with this view.
And then there is process. On a Hacker News thread about spec-driven development, someone described using an LLM as an "unreliable compiler". I like that analogy, not because it flatters the models, but because it reminds me that humans are unreliable compilers as well. Two engineers implement the same spec, you get two different designs. Even the same engineer, given the same task three months apart, will not write the same code.
Some people look at the way I now work with models – write a spec, discuss it, refine it, only then let anything touch the code – and call it a return of the waterfall. I don't see it that way. The problem with classic waterfall projects was never that someone dared to write a spec. The problem was that the spec was treated as holy scripture and that feedback came in far too late, if at all.
What I am doing here feels much closer to the agile projects that actually worked: start with a rough outline, build a thin slice, learn from it, update the outline and repeat. The specs evolve with every iteration. They get more detailed where reality hurts and stay coarse where reality is still fuzzy. The loops are short, and tests and automation let you change your mind without blowing up the whole thing.
What makes any of this workable is not inherent reliability. It's the system around it: tests that describe behavior, code review that catches the weird stuff, version control that lets you roll back, automation that makes it difficult to quietly break things. Simon Willison points this out in more detail in his article on what he calls "vibe engineering": if you already have solid tests, documentation, automation and review culture, agents suddenly become very useful. If you don't, plugging them in just amplifies whatever mess you already have.
Where the Analogy Breaks Down
Up to this point this might read like "LLMs are just people, just be nice and supportive". While this is true, there are also some quite important differences to be aware of.
The first big difference is memory. When I explain an architectural decision to a colleague, I expect them to remember at least the important parts. If we have the same conversation three times and the same mistake keeps happening, we have a different discussion. With LLMs, every fresh session starts as if we've never met. They don't remember that I prefer composition over inheritance in this codebase, or that I absolutely don't want another YAML parser dependency. If that knowledge matters, I have to put it somewhere the system can see every time.
Over time, this pushed me away from the idea of "teaching" the model and towards teaching the environment around it. Instead of hoping that some vague preference will stick, I encode it explicitly: in instruction files, in prompts that I reuse, in small guide documents that travel with the requests, in automated checks that reject changes violating certain rules. I'm not building a relationship with an entity here. I'm building a set of rails that a stochastic process has to run on.
The second difference is understanding, or whatever passes for it in human beings. The engineers I respect most will occasionally look at a requirement and say something like: "This doesn't make sense" or "this contradicts what we decided last week". I expect that from the people I work with. Tell me if you think some user story, some spec or some technology choice is wrong. Use your smarts and your past experience and give me better ideas. I'm happy to discuss and learn and teach while doing that. Don't just believe me, but also don't expect I'll just believe you.
LLMs are surprisingly good at exploring a design space when you ask them to – but you need to ask them. I often have them list options, sketch rough approaches or criticize my preferred solution from different angles before we pick one thin slice to actually build.
LLMs don't have that kind of friction. By default they cheer for your ideas and are happy to make any silly thing work exactly like you ask for.
Some humans can be quite sycophantic too, and some are not particularly good at logic either, but they at least have the option of calling nonsense nonsense. So far, models do that only when you push them very explicitly in that direction, and even then not reliably. That is why, for anything that matters, I still want a human brain (mine) in the loop whose job it is to look at the whole direction and not just at the local patch. One of my standard instructions is: criticize any design and technology decisions I make, lay out alternatives and advise me on better and more modern approaches.
The third difference is responsibility. People come with motives, egos, fears, ambitions. Managing them is messy, political and occasionally exhausting. Models have none of that. They don't care whether I like them, they don't take it personally when I discard their work, and they don't negotiate deadlines.
That is nice, but it also makes it dangerously easy to project responsibility onto them. "The AI did it" is a tempting sentence. It is also meaningless. The model didn't decide to deploy untested code - you allowed it to. It didn't choose your acceptance criteria. It didn't tell you to skip code review. If something goes into production that shouldn't, the fault is entirely with whoever wired the system together.
The last obvious break is scale. If I want to double the throughput of a human team, I either make the same people more effective or I add more people. Both are slow and expensive. With LLMs and agent systems, I can spin up several parallel attempts at the same feature and then review the result. Simon Willison talks about running multiple coding agents in parallel and then doing the human work around that. I've not seriously tried that yet. I might do so soon.
Working With Probabilistic Coworkers
If you accept that all your coworkers – carbon-based and silicon-based – are probabilistic, the question becomes how to make that tolerable.
The reassuring answer is that you don't need an entirely new discipline for this. The same boring practices that make human teams bearable also make LLMs bearable.
Being clearer in how you talk about work is a good starting point. "Build a blog" is not a requirement, it's wishful thinking. Writing down what you actually care about, what you don't care about, and which corners you're fine cutting suddenly helps both the person sitting next to you and the model running in some datacenter. A slightly more precise spec makes everyone's life easier.
Shorter feedback loops help as well. A pattern that works for me is: ask for a plan, poke holes in the plan, then let the model implement one small piece, run the checks, see what happens, repeat. It's the same pattern I try to follow with humans when I remember to be disciplined: don't disappear for two weeks and then present a surprise, keep the steps small and visible.
Externalizing knowledge is another old idea that becomes unavoidable here. Humans can sort of muddle through with half-remembered context and "ask Bob" as a strategy. Models cannot. If the way something is supposed to work only lives in somebody's head or in a long-gone Slack thread, it's effectively invisible to the system. Writing down the shape of the system, the non-obvious constraints and the decisions that shaped it helps everyone, including your future self.
And finally, there is the whole field of automated skepticism. Tests, linters, CI, code review rules, branch protections – all the things that make it harder for a rushed human to accidentally ship garbage – are exactly the mechanisms that make using LLMs at scale remotely sane. None of this is new. We've been using these tools for decades to cope with the fact that humans are inconsistent, forgetful and occasionally overconfident. The only real change is that the noise source got faster and cheaper.
That makes the old answers more urgent, not less relevant. Humans are not good at sustained vigilance. Machines are very good at running the same check a thousand times in a row without getting bored. Wiring that vigilance into the pipeline instead of relying on your own willpower is probably the most leverage you can get, regardless of who or what is writing the code.
If you do all that – and really you should – the real people working with you will also have a much easier and more enjoyable time. The problems LLMs introduce into a software project are mostly older problems turned up to eleven. The upside is that we already know quite a bit about how to make those tolerable. We just have to be willing to apply what we know more rigorously. It's worth it.
Comments