RDEL #138: Where do all the tokens go in agentic software engineering?
Input tokens account for 54% of total usage in agentic coding systems, revealing a costly "communication tax" as AI agents repeatedly pass full contexts back and forth.
Welcome back to Research-Driven Engineering Leadership. Each week, we pose an interesting topic in engineering leadership and apply the latest research in the field to drive to an answer.
Multi-agent AI systems for software development can burn through tokens fast. Multiple AI agents chatting back and forth, reviewing each other’s code, iterating on designs — it adds up. But where exactly do all those tokens go, and what does that mean for the cost of running these systems at scale? This week we ask: What are the token consumption patterns of multi-agent AI systems across the SDLC, and what do they reveal about efficiency?
The context
LLM-based Multi-Agent (LLM-MA) systems are gaining traction as a way to increase AI maturity. Frameworks like ChatDev and MetaGPT simulate virtual software companies — with AI agents playing roles like product manager, programmer, code reviewer, and tester — collaborating through multi-turn dialogues to go from a prompt to working software. The appeal is clear: divide the work across specialized agents and let them coordinate autonomously.
But as these systems scale from demos to real workloads, a practical question emerges: how much do they actually cost to run, and where is the money going? Token consumption translates directly to financial cost, energy use, and environmental impact. Yet until now, most research on multi-agent systems has focused on capabilities and failure modes — not operational efficiency. Understanding the “cost map” of these systems is a prerequisite for making them practical.
The research
Researchers at Concordia University analyzed token consumption across 30 software development tasks executed by ChatDev, a popular multi-agent framework, using GPT-5 Reasoning as the backbone model. They mapped ChatDev’s internal phases to standard development stages (Design, Coding, Code Completion, Code Review, Testing, and Documentation) and tracked input, output, and reasoning tokens across all 30 runs.
Key findings:
Code Review dominates token consumption, accounting for 59.4% of all tokens on average. The iterative back-and-forth between programmer and reviewer agents makes it by far the most expensive phase, while initial code generation (Design at 2.4% and Coding at 8.6%) is remarkably cheap.
Input tokens make up 53.9% of total usage, revealing a significant “communication tax.” This approximate 2:1 ratio of input to output tokens means agents spend most of their budget re-consuming context rather than generating new output — a structural inefficiency in current collaboration protocols.
Different development stages have distinct cost profiles. Coding is the one stage that’s output-heavy (58% output vs. 6.9% input), which makes sense — it’s generating verbose source code from concise specs. But verification phases like Code Review and Documentation are input-heavy (51.4% and 80.2% input, respectively), consuming large amounts of existing code as context to produce small, analytical outputs.
The primary cost of agentic software engineering isn’t writing code — it’s refining it. Code Review and Code Completion together account for over 86% of token consumption in the tasks where both ran. This suggests that the real expense lies in the iterative verification and refinement loop, not in the initial generation.
Token consumption varies widely by task complexity. Reasoning tokens ranged from 17,280 to 40,000 across the 30 tasks, indicating that more complex projects carry significantly higher — and potentially less predictable — costs.
The application
This research reframes how engineering leaders should think about the cost of agentic development tools. The expense isn’t in getting an AI to write code — that’s the cheap part. The real cost is in the back-and-forth of review, verification, and refinement, where agents repeatedly pass full contexts to each other.
Here’s how to apply these findings:
Budget for review, not generation. When estimating costs for multi-agent coding tools, don’t anchor on the code-writing step. Plan for the review and verification cycle to consume the majority of your token budget, and build that into your cost models.
Look for architectures that reduce context passing. When evaluating agentic tools, ask how they handle the review loop. Systems that use smarter context management (like passing diffs instead of full files, or summarizing prior feedback) will be significantly cheaper to run than those using naive full-context passing.
Consider human-in-the-loop checkpoints before expensive phases. A quick human review before the AI enters its iterative code review cycle could prevent costly loops over minor issues. Think of it as a cost-saving gate - catching obvious problems early before agents spend thousands of tokens going back and forth.
Track how token spend translates to productivity gains. Build visibility into which development activities to understand how tokens are being used and whether that spend is actually improving engineering outcomes. Without this, cost optimizations risk quietly degrading quality.
—
Happy Research Tuesday!
Lizzie


