MCP Servers and Tool Calling for Real Developer Products
AI tools become useful in developer products when they have clear contracts, permission boundaries, audit trails, safe actions, and production-grade observability.
Tool access changes the risk model
A normal AI assistant can be wrong. An agent with tool access can be wrong and take action. That single distinction changes the entire product risk model. A wrong text response can be corrected with a follow-up message. A wrong tool call may have already created a ticket, opened a pull request, triggered a deployment, or sent a notification that cannot be unsent.
This is not an argument against giving agents tools. Agents with tools are dramatically more useful than agents without them — they can do work instead of just describing how work should be done. But the product design has to account for the new risk surface: what can the agent do, to what scope, with what confirmation requirement, and with what recovery path.
For developer products, the most valuable tools are often practical rather than expansive: search documentation without leaving the editor, inspect a repository file by path, open a support ticket with pre-filled context, call a staging API and show the response, summarize recent log entries for a service, or prepare a pull request description from a branch diff. These tools save time because they meet developers where the work already happens, without requiring them to switch contexts or reconstruct information manually.
The design question is not how many tools the agent can access. It is which tools are safe in which context, for which user, with what confirmation step, and with what audit record afterward. Answering those questions before building is what separates a useful developer tool from an impressive demo.
The tool contract is the most important design decision
Every tool in a tool-calling system has an implicit contract: a description that tells the model when to use it, a typed input schema that defines what the tool expects, and an output format that defines what the model will receive back. When any part of that contract is vague or inconsistent, the model's tool-use behavior becomes unpredictable.
Tool descriptions should be written for the model, not for a human reading the documentation. The description should specify the exact use case ('use this tool to search the documentation for a specific topic or procedure'), the cases where the tool should not be used ('do not use this tool to look up user account data'), and the expected output format. Ambiguous tool descriptions lead to the model calling the wrong tool for a task or calling the right tool with incorrect parameters.
Input schemas should use strict types rather than generic string parameters wherever possible. A tool that accepts a repository name as a string allows the model to pass anything — including values from user input that have not been validated. A tool that accepts a repository name validated against a list of the authenticated user's repositories enforces the permission boundary at the schema level.
Output formats should be predictable and typed. A tool that returns different JSON shapes depending on success or failure state, or that returns raw HTML in some cases and structured JSON in others, makes it harder for the model to reason about the result. Consistent output shapes reduce model confusion and make tool responses easier to render in the UI.
Tool contracts should be boring
Small tools named after user intent are safer and more reliable than large tools with broad capabilities. searchDocs is safer than a generic HTTP request tool because its scope is clear, its permission requirements are defined, and its failure modes are predictable. createDraftPullRequest is safer than unrestricted git push because it creates a human-reviewable artifact instead of applying a change directly.
The narrower the tool, the easier it is to reason about risk, write tests for, enforce permissions on, and explain in an audit log. A developer reading 'agent called searchDocs with query "pagination API reference"' understands immediately what happened and why. A developer reading 'agent made HTTP request to /api/docs/pagination' has to reconstruct the context.
Good tool design also gives the agent enough feedback to recover from mistakes. A tool that returns a typed error when the user does not have permission to view a repository is more useful than a tool that returns an empty result or throws an exception. The agent can explain the error to the user and suggest an alternative. A generic failure is a dead end.
Composability matters too. Tools that each do one thing well can be combined by the model to accomplish complex tasks. Tools that try to do too much in one call — because combining several simpler steps seemed efficient — become hard to test, hard to permission-check, and hard to explain when they produce unexpected results.
Permission boundaries at the tool layer
Every tool call should be authorized against the requesting user's permissions, not just the agent's capabilities. An agent that has access to a createIssue tool should only be able to create issues in repositories the authenticated user has access to. An agent that can read log files should only be able to read logs for services in the user's organization. These boundaries should be enforced in the tool implementation, not described in the tool's system prompt.
Prompt-based permission enforcement — 'only use this tool for repositories in the user's organization' — is not a security boundary. It is a behavioral hint that a model will generally follow but can be overridden by adversarial input, prompt injection, or unexpected model behavior. Real permission enforcement happens in code, with the same rigor as any other authorization check in the application.
Tool calls that affect other users or external systems — sending a notification, creating a public record, triggering an action in a third-party service — should require explicit confirmation from the user before execution. The confirmation step is not about distrust of the model. It is about giving the human a chance to verify intent before the action is irreversible.
Scoped API tokens are the practical mechanism for many tool-level permission boundaries. Instead of giving the agent a full-scope token, issue a scoped token at the start of the session that grants read access to the repositories the user is currently working with, write access to the specific issue tracker project, and no access to billing or user management. Scope reduction limits the blast radius of any unintended tool behavior.
Production agents need observability
When an agent calls tools, the trace should capture: which tool was called, with what input, what the tool returned, how long the call took, which model version produced the decision to call the tool, and what the user did with the result. Without that trace, debugging becomes storytelling — reconstructing what might have happened from incomplete evidence.
Observability also drives product improvement. Failed tool calls reveal tools with brittle schemas or poor error handling. Repeated clarifying questions before a specific tool call reveal a description that is not guiding the model clearly. User overrides — where the agent called a tool but the user replaced its output manually — reveal cases where the tool's output format does not match what the user actually needs. These signals are the feedback loop that makes the tool ecosystem better over time.
Latency matters for developer tools. An agent that calls five tools sequentially, each taking 500ms, adds 2.5 seconds of latency to every interaction. Parallel tool calls where possible, cached results for read-only tools, and aggressive timeouts on tools that query external systems keep the interaction responsive. A developer tool that makes developers wait is a tool developers will stop using.
A mature MCP or tool-calling setup feels less like magic and more like an auditable workflow engine with language as the interface. The decisions the model makes are visible, the actions it takes are logged, the permissions it operates within are defined, and the recovery paths when something goes wrong are clear. That transparency is not a limitation — it is what makes the product trustworthy enough to deploy.