AI Agents

From Chatbot to Coworker: Designing Agents That Do Real Work

By Niall · 7 min read

Lowcountry tugboat guiding a larger vessel, representing an agent doing real work

The leap from a bot that talks to an agent that does is a design problem long before it is a modelling one.

Most people's mental model of business AI is still a chatbot: you type a question, it types an answer. That is useful, but it is a long way from the systems now doing real work. The interesting shift is from software that talks to software that does, from a helpful assistant that explains how to process a refund to one that actually processes it.

That shift is not about a bigger model. It is about design. An agent that completes tasks reliably is built differently from a chatbot that answers questions, and confusing the two is how teams end up disappointed. Here is how we think about designing agents that earn the word coworker.

Talks versus does

A chatbot's job ends at the answer. Ask it how to reconcile an invoice and it will describe the steps clearly and stop. An agent's job ends when the work is finished: it reads the invoice, matches it against the order, flags the discrepancy, and updates the record. One produces words; the other produces outcomes.

This is not a criticism of chatbots. For many problems, a good answer is exactly what is needed. But if you are measuring success in tasks completed rather than questions answered, you are building an agent, and that comes with a different set of design decisions.

Start with a sharp scope

The single biggest predictor of whether an agent works is how tightly its job is defined. A vague brief, 'handle customer operations', gives the agent too much room to wander and too many ways to fail. A sharp brief, 'triage incoming refund requests under fifty dollars and prepare them for approval', is something you can build, test and trust.

Narrow scope is not a limitation, it is the feature. A well-scoped agent that does one job dependably is worth far more than an ambitious one that does ten jobs unpredictably. You can always add scope later, once the first job is solid.

Give it the right tools, and only those

An agent acts through tools: the functions, APIs and queries it is allowed to call. Choosing them is a design decision, not an afterthought. Give it the tools the job genuinely needs, described clearly, and leave out everything else. Every extra tool is another way to go wrong and another thing to secure.

It also makes the agent easier to reason about. When the tool list is short and purposeful, you can look at it and know exactly what the agent is capable of, and what it is not. A sprawling toolbox, by contrast, hides the agent's true reach and makes every security review harder.

A coworker you would trust has a clear job, the right tools, and someone checking the important calls. An agent is no different. Scope, tools and oversight matter more than the cleverness of the model.

Memory only where it helps

Memory is fashionable, and often unnecessary. Many useful agents are stateless: each task starts fresh, which makes them simple, predictable and easy to reason about. Add memory only when the work genuinely depends on it, remembering a customer's earlier messages in a single conversation, or carrying context across the steps of one task.

Persistent, long-term memory across sessions is powerful but raises real questions about privacy, staleness and surprise. If you cannot say exactly why an agent needs to remember something, it probably should not. Start stateless and add memory deliberately, not by default.

There is also a cost angle. Carrying a long memory into every step means sending more context to the model each time, which is slower and more expensive. Statelessness is not just simpler, it is often cheaper, which is another reason to treat memory as something you add for a clear reason.

Keep a human in the loop

The fastest way to lose trust in an agent is to let it take an irreversible action and get it wrong. The fix is oversight matched to stakes. Low-risk, reversible actions can run automatically. Anything that spends money, contacts a customer, or cannot be undone should pause for human approval, at least until the agent has earned trust.

Crucially, oversight is not a permanent tax, it is how the agent earns autonomy. Each approved action is evidence, and once a category of work has proved reliable, you can let it run unattended while keeping a closer eye on the rest.

Let the agent do the work, then present a clear summary for a person to approve.
Set confidence thresholds: handle the routine cases, escalate the unusual ones.
Make every action reversible where you can, so a mistake is a quick fix, not a crisis.
Phase autonomy in as you measure real performance, rather than trusting it on day one.

Measuring a coworker

You judge a coworker on outcomes, and an agent is no different. Track tasks completed correctly, how often a human had to step in, and where the agent got stuck. Those numbers tell you whether to widen its remit or tighten it, and they turn vague impressions into a clear picture of what the agent is actually worth.

Resist the urge to judge it on how clever it sounds. A coworker that produces a tidy summary but gets the numbers wrong is a liability; one that is a little dull but reliably right is worth keeping. Outcomes, not eloquence, are the measure.

The jump from chatbot to coworker is real, but it is a design problem before it is a modelling one. Clear scope, the right tools, memory only where it helps, and human oversight where it counts, that is what separates an agent that does real work from a demo that merely talks about it. That design work is exactly what we do when we build AI agents.

Relevant services