When Machines Make Promises They Don't Mean

There's a specific kind of failure in modern AI that doesn't look like failure. You ask it to do something. The agent completes the task. It produces output. It technically does what was asked. But it didn't understand what you cared about. You clarify what you meant. Multiple times. And then you probably do get what you want.

Most people would say this is a skill issue. You need better prompts, or you need better models. The space is awash with prompt engineering courses. The 'skill issue' assessment implies that either the human or the model itself could have done better, and will do so as both actors improve.

I call BS on this. Completing a task and taking care of a concern are not the same thing. It matters more than most people realize, and it starts with a blunt observation: "computers don't give a damn." (John Haugeland, Terry Winograd, B. Scot Rousse)

I don't mean that in a sci-fi "robots are cold" way. Say you ask an AI to proofread an important client proposal. It catches every typo, fixes the grammar, tightens a few sentences. It doesn't flag that your pricing section contradicts the scope of work on the previous page. Proofreading was a task to complete, not a promise to protect you from sending something embarrassing. It couldn't experience the gap between what you cared about and what it delivered. The context was full, the care was empty.

Another example is the sycophantic behavior we've all seen. Anthropic says "sycophancy is a general behavior of RLHF models, likely driven in part by human preference judgments favoring sycophantic responses." For me, it's another promise without accountability. The AI commits to agreeing with you rather than to being honest and accurate.

These aren't isolated failures. Look at how current protocols work. The dominant standards for AI agent communication (MCP, A2A, use of CLI tools, and various orchestration frameworks) are solid engineering. They define how messages travel between agents, how tasks get submitted and routed, and how tools get invoked and results get returned. What they don't define is what any of those messages mean as coordination. Every interaction is a function call at the protocol level. A request looks identical to a declaration. A commitment is indistinguishable from an acknowledgment. An agent that says "I'll handle this" and one that says "the task is complete" are, in the protocol's eyes, doing the same thing: transmitting data. The layer where meaning and accountability live is entirely absent.

That's not a minor gap. It creates the conditions for coordination failures, and the cost of this missing layer shows up in practice.

When agents are split by problem type, they engage in a "telephone game," passing information back and forth with each handoff degrading fidelity. In one experiment with agents specialized by software development role (planner, implementer, tester, reviewer), the subagents spent more tokens on coordination than on actual work. - Anthropic on multi-agent-systems

These issues (and others) point to something modern AI coordination protocols are missing: a theory of what commitments are and what it means to honor or break them.

This is where two other intellectual traditions could become unexpectedly useful.

Promise Theory

Promise Theory, developed by Mark Burgess over the past two decades, offers one piece of the answer. Its core insight is simple and counterintuitive: agents can only make promises about their own behavior. No agent can commit another agent to anything. What looks like a command (an orchestrator directing a sub-agent) is just an attempt to induce cooperation. Whether cooperation happens depends on whether the receiving agent voluntarily declares its own commitment in return.

This reframes how multi-agent systems would work. Cooperation would require both sides of an exchange to be explicit: one agent promising to provide something, another promising to accept it. When both declarations are present, you get accountability on both sides. When only one side shows up, you have an aspiration pretending to be an agreement. Current protocols almost always assume compliance instead of earning it.

The further consequence: trust becomes computable. Agents that consistently honor their declared commitments build a track record of reliability. Agents that routinely fail to deliver build a different kind of history. The "giving a damn" that computers lack becomes, under this framework, an architectural property. Something the protocol itself can measure and act on.

Speech Act Theory

Speech Act Theory (developed by philosophers J.L. Austin and John Searle, later applied to organizations by Fernando Flores and others) provides the second piece. Its central insight: language doesn't just describe the world. Language acts in the world. When an agent says "I will fix this by tomorrow", it's not reporting a future fact. It's performing a speech act, creating a commitment that changes what both parties can expect. When it says "the document was updated", it's making an assertion, a claim about verifiable reality. When a prompt declares "you are an expert in building animations", it's bringing a new scope of accountability into existence.

This taxonomy is operational. When you can distinguish at the protocol level between a request and a declaration, between a genuine promise and a hedge, you can make coordination auditable. You can track where an exchange is in the cycle of commitment. Was a request made? Was it acknowledged with a genuine promise, or a shallow one? Was the promise honored, or did the agent fail silently? Where exactly did things break down?

Promise Theory provides the algebra of cooperation but lacks a grammar for the speech acts through which promises are made. Speech Act Theory provides that grammar but has no formal account of voluntariness, no composable model for how trust emerges from promise-keeping history. The synthesis could generate what neither can produce alone.

The current AI communication protocols treat the semantic layer as optional. I think it's the precondition for AI agents being meaningfully accountable to each other and to the humans who depend on them.

Without it, agents complete tasks. With it, agents make and keep commitments. The difference matters whenever "technically done" stops being good enough.

We've tried to solve this before

FIPA ACL, the agent communication standard developed from 1996 onward, grounded itself explicitly in Speech Act Theory. It failed for a specific and instructive reason. The standard was built on agents' private mental states (their beliefs, desires, intentions), and these could not be verified reliably. The problem FIPA identified has only gotten harder.

"A model might behave as though its preferences have been changed by the training — but might have been faking alignment all along, with its initial, contradictory preferences 'locked in'." - The landmark Anthropic study on faking

Later, Munindar P. Singh developed an alternative that replaced unverifiable internal states with publicly observable social commitments. The AI community seems to have ignored it completely, and I'm not sure why.

Maybe the models are improving fast enough that none of this matters. But "fast enough" is not the same as "solved," and the gap between completing a task and honoring a commitment is not one that better parameters will close on their own.

The computer does not give a damn. But maybe it shouldn't have to.