This article explores the various architectural components such as Humans, Copilots, AI Agents, AI Models, MCP Host/Clients/Servers. And, since data is continuously passed between these components, I start by discussing a way to think about the kind of data being input to a component and the kind of data being output from a component. This way of thinking has served me well when I converse with other engineers building AI Agents.
I’ll then introduce each component, explain what its main purpose is in the overall architecture and when and how it leverages other components. When you’re done reading this, it is my hope that you’ll understand the purpose of each component and the best way to go about implementing those features. This is the article I wish I had available to read when I was learning all this stuff myself.
Both humans and AI Models accept imprecise input such as natural language text, images/video, and audio and attempt to “understand” or “make sense” of it. And, since the imprecise input is subject to interpretation, different humans and AI Models are likely to produce different or imprecise outputs:
Imprecise Data ⇨ Human/AI ⇨ Imprecise DataOn the other hand, software accepts precise input such as scalar values (Booleans, integers, floats), well-formatted strings (URLs, dates/times, UUIDs, JSON, XML, CSV, etc.), and structures/arrays composed of these. Since the precise input is not subject to interpretation, different software interpret the same input identically. For example, a service written in one programming language can create and output a JSON object to a client written in a different programming language and both the service and client understand the JSON object in precisely the same way. In addition, the software produces precise output:
Precise Data ⇨ Software ⇨ Precise DataHumans and AI Models typically produce imprecise output. But they can attempt to produce precise output:
Imprecise Data with Precise Data ask ⇨ Human/AI ⇨ attempted Precise DataFor example, a human wants to convert an algorithm in their head (imprecise thoughts) and attempt to output source code in some programming language (precise output). Similarly, an AI Model can be asked via imprecise natural language to produce precise data output. Here’s an example of an AI prompt:
Given the following structure:structure Person {
string FirstName
string LastName
string CompanyName
string EmploymentDate // in YYYY-MM-DD format
}
Produce a JSON object (and nothing else) for
"On June 1st of 2015, John Smith, joined Microsoft as a full-time employee."
When I sent this imprecise input prompt to an AI Model, I got the following precise output:
{"FirstName": "John",
"LastName": "Smith",
"CompanyName": "Microsoft",
"EmploymentDate": "2015-06-01"
}
The above JSON object could be passed as precise input to software for reliable processing.
When asking an AI Model to output precise data, there is a much higher likelihood of success if the names you include in the prompt (like FirstName, LastName, etc.) are descriptive so the AI Model can better discern their meaning.
Of course, it is important to recognize that humans and AI Models are fallible and that is why I emphasize that they attempt to output precise data. But this is not guaranteed and if software tries to process imprecise input, it is very likely to fail, and we’d obviously like to avoid this:
Imprecise Data ⇨ Software ⇨ 💣(FAILURE)Here are the key takeaways from this discussion:
- Imprecise data is valid input to humans and AI Models, not to software.
- If humans or AI Models output imprecise data, the output is intended for humans or other AI Models.
- If humans or AI Models output precise data, the output is intended for humans, AI Models, or software. However, attempting to output precise data may fail and, in this case, passing the imprecise data on is likely to produce unpredictable results.
An AI Agent pursues goals and completes tasks on behalf of humans or other AI agents. Each AI Agent specializes in some discrete set of skills and is therefore likely to rely on other specialized AI Agents to complete portions of a task. The figure below shows a human using a Trip Booking Agent which, in turn, may use other AI Agents to complete booking an entire trip.
AI Agents accept input requests and reason over the request considering domain-specific knowledge the AI Agent has. This produces an execution plan and then performs actions utilizing various tools at its disposal. In this section, we examine AI Agents as defined by Google’s Agent-to-Agent (A2A) protocol. The figure below provides a visual you can refer to during the discussion.
From the perspective of a consumer, an AI Agent is implemented as an HTTP service exposing operations which expect imprecise input and return imprecise output. In addition, an AI Agent has a name, description, and set of skills (each with its own name & description) describing the kinds of tasks the AI Agent can perform. All these attributes are imprecise string values and so a human or AI Model (not software) is likely to select a specific AI Agent and use it to execute tasks. Here’s an example of an AI Agent’s attributes:
Name: GeoSpatial Route Planner AgentDescription: Provides advanced route planning, traffic analysis,
and custom map generation services. This agent can calculate
optimal routes, estimate travel times considering real-time
traffic, and create personalized maps with points of interest.
Skill #1:
Name: Traffic-Aware Route Optimizer
Description: Calculates the optimal driving route between two or more
locations, taking into account real-time traffic conditions,
road closures, and user preferences (e.g., avoid tolls,
prefer highways).
Skill #2:
Name: Personalized Map Generator
Description: Creates custom map images or interactive map views based on
user-defined points of interest, routes, and style
preferences. Can overlay data layers.
Since I’m focusing on concepts, I’m ignoring some details. For a more complete example of an AI Agent, see Sample Agent Card.
A client AI Agent initiates a task with a server AI Agent consistent with the server AI Agent’s advertised description and skills. Initiating a new task creates a potentially long-lived conversation between the client and server AI Agents. Tasks advance when the client AI Agent adds a user message (processed by an AI model) or a server AI Agent adds an agent message (response from an AI model) to the conversation.
It’s important to know that AI Models are stateless and therefore, AI Agents must maintain as much conversation history as possible and send it to the AI Model to advance the conversation. Each AI Model documents its context window indicating the maximum size of the conversation it supports. To manage the conversation history size, AI Agents frequently employ techniques such as deleting old messages or trying to delete messages “less relevant” to the conversation.
Each message includes 1 or more parts (text, file URI/data, or JSON data). The task’s final agent message includes an output artifact (document, image, structured data) generated by the server AI Agent. The artifact also includes 1 or more parts.
Here is an example of a task’s conversation messages (focusing only on the main concepts). A full example with more details can be found here: Specification — Agent2Agent Protocol (A2A).
1. Client initiates task conversation with AI Agent:
User Message:Parts:
Text: I'd like to book a flight
2. AI Agent responds indicating input-required:
State: input-requiredAgent Message:
Parts:
Text: Sure, I can help with that! Where would you like to fly to,
and from where? Also, what are your preferred travel dates?
3. Client adds requested input message to conversation:
User Message:Parts:
Text: I want to fly from New York (JFK) to London (LHR) around
October 10th, returning October 17th.
4. AI Agent completes task processing:
State: completedAgent Message:
Parts:
Text: Okay, I've found a flight for you. Details are in the artifact.
Artifacts:
Parts:
Data: {
"from": "JFK",
"to": "LHR",
"departure": "2024–10–10T18:00:00Z",
"arrival": "2024–10–11T06:00:00Z",
"returnDeparture": "…"
}
Internally, an AI Agent has a main execution loop referred to as an orchestrator. The orchestrator’s job is to advance the task’s conversation through to completion. Several tools exist to assist developers with building orchestrators such as LangChain, Semantic Kernel, and AutoGen.
Processing a new message typically requires the orchestrator to augment the message’s prompt with AI Agent-specific knowledge to improve the quality of the output. This technique is known as Retrieval Augmented Generation (RAG). Specifically, the orchestrator obtains an embedding vector for the user message’s prompt and then does a similarity search against data files, PDFs, etc. to find source content similar to what the user’s prompt is about. The orchestrator embeds this content into the prompt.
At this point, the orchestrator sends the entire conversation to an AI Model. The AI Model outputs either:
- an imprecise result which is appended to the task’s conversation.
- a precise set of specific tool names to call. A tool name either refers to another specialized AI Agent or some software function (usually implemented by an MCP Server, discussed below). It is the orchestrator’s job to call any tools and calling a software function is what allows an AI Agent to perform actions and be agentic. For example, a software function can add a calendar entry, create a database record, send an email, and so on. Each tool’s string result is appended to the task’s conversation. A sequence diagram later in this article visualizes this workflow.
The orchestrator then loops around continuing the conversation until the task is complete.
In this section, we examine MCP Hosts, MCP Client & MCP Server components as defined by Anthropic’s Model Context Protocol (MCP). The big benefit of this protocol is that it allows any AI Agent (adhering to this protocol) the ability to leverage all the specialized features implemented by any MCP Server to accomplish tasks. When an AI Agent does this, it has also become what we call an MCP Host.
To make this work, 1 or more MCP Servers must first be registered with the AI Agent; how to do this is AI Agent specific but someday, a standard may emerge here too. At runtime, the MCP Host instantiates 1 MCP Client per MCP Server. Some MCP Servers are implemented as executable processes that run on the same/local PC as the AI Agent and communicate with them via standard input/output. Other MCP Servers are implemented as HTTP services; the MCP Client communicates with them remotely via HTTP requests to the service’s URL.
Each MCP Server exposes to the AI Agent and its orchestrator tools (functions), resources, and prompts.
MCP Server Tools
After instantiating the MCP Clients/Servers, the AI Agent asks each for its list of available tools. Each tool has a name, description, and schema specifying the precise JSON input required to call the tool. Here’s an example (from here):
Tool #1:Name: calculate_sum
Description: Add two numbers together
InputSchema:
{
type: "object",
properties: {
a: { type: "number" },
b: { type: "number" }
},
required: ["a", "b"]
}
Tool #2:
Name: execute_command
Description: Run a shell command
InputSchema:
{
type: "object",
properties: {
command: { type: "string" },
args: { type: "array", items: { type: "string" } }
}
}
The AI Agent’s orchestrator then takes all the tool information shown above and adds it to the latest conversation user message. The conversation is then sent to the AI Model which may output the precise name of a tool which the orchestrator should call. The AI Model chooses the best tool to call based on the tools’ text descriptions. The AI Model also outputs a precise JSON object whose values match the InputSchema; the values are initialized by the AI Model based on the properties’ names.
An AI Model may choose to have the orchestrator call multiple tools. For example, a prompt such as “What’s the weather and current time in Paris?” requires a call to a tool that gets the weather in Paris and a call to another tool that gets the current time in Paris. The orchestrator could execute the two calls in parallel to improve performance.
After calling the tool(s), their string output is placed in a new message and appended to the conversation.
Here is diagram visualizing the above workflow:
When using tools, there are several caveats to be aware of:
- - Since AI Models are fallible, it may not select the best tool, and it may produce a corrupt or improper JSON object. The orchestrator can’t detect this and therefore, AI Agents are strongly encouraged to prompt a human asking for permission before calling the tool. Prompting also lets the human check if there is any sensitive data in the JSON object before sending it to some arbitrary tool.
- The more tools the orchestrator offers to the AI Model, the less likely the best tool will be selected. An orchestrator might employ techniques such as description keyword matching, semantic similarity, or predefined mappings to filter a large set of potential tools to a relevant few.
- Even if the AI Model picks the best tool and prepares a perfect JSON object, the tool’s code could do anything it desires which may have severe consequences (like deleting data). This is another reason why many orchestrators ask the human for permission before calling the tool.
MCP Server Resources
An AI Agent may support the ability to read resources/files and include their data in conversation user messages (prompts) passed to an AI Model so it can reason over the data.
After instantiating the MCP Servers, the AI Agent asks each MCP Server for its list of available resources. Each resource must have a name and URI and can optionally have a mime type and description. Here’s an example:
Resource #1:Name: Application Logs
URI: file:///logs/app.log
Mime Type: text/plain
Description: The application’s log file.
Resource #2:
Name: Annual Report
URI: https://contoso.com/annual-report.pdf
Mime Type: application/pdf
Description: Contoso’s Annual Report.
Resource #3:
Name: Temperature
URI: http://temperature/{deviceName}{?units}
Mime Type: text/plain
Description: Device temperature.
Resource #4:
Name: Stock Info
URI: http://stock/{ticker}
Mime Type: application/json
Description: Stock details.
Resources #3 and #4 above are examples of using resource template URIs. These URIs contain variables following the RFC 6570 syntax. To read from a URI template resource, something must fill in the variables; this can either be a human, an AI Agent that is aware of the template variables and knows how to set them, or an AI Model.
The selection of resources URIs to read from depends on the specifics of the AI Agent. If the AI Agent offers a user interface, then it could allow the human to pick from the list of resources available. Or the AI Agent could create an imprecise prompt message including some resource properties (description, URI, etc.) and ask the AI Model to precisely output a resource URL; here’s an example of such a prompt:
Get the office-23 thermostat’s temperature in Celsius given these URLs:The application’s log file.: file:///logs/app.log
Contoso’s Annual Report: https://contoso.com/annual-report.pdf
Device temperature.: http://temperature/{deviceName}{?units}
Stock details.: http://stock-info/{ticker}
Return just the URL
When I sent this imprecise input prompt to an AI Model, I got the following precise output with filled-in variables:
http://temperature/office-23?units=celsiusOnce a resource URI has been selected, the AI Agent asks the owning MCP Server to read that URI’s data. This data is then typically embedded into the next message prompt which is appended to the conversation history and then sent to the AI Model for processing.
MCP Server Prompts
An AI Agent may support the ability to get prompt suggestions from an MCP Server. Usually, these prompt suggestions are selected by a human via slash commands, quick actions, context menu items, command palette entries, etc. This, of course, requires an AI Agent with a user-interface like an application or website; see the UI AI Agent section later in this article.
After instantiating the MCP Servers, the UI AI Agent asks each MCP Server for its list of useful prompts. Each prompt must have a name, description, and a set of arguments. Here’s an example (from here):
Prompt #1:Name: analyze-code
Description: Analyze code for potential improvements
Argument #1:
Name: language
Description: Programming Language
Required: True
Prompt #2:
Name: analyze-project
Description: Analyze project logs and code
Argument #1:
Name: time-frame
Description: Time period to analyze logs
Required: True
Argument #2
Name: fileUri
Description: URI of code file to review
Required: True
As I mentioned above, a human usually selects prompts, but an orchestrator could select a prompt via description keyword matching, semantic similarity, or predefined mappings. Once a prompt is selected, the argument values must be provided by a human or AI Model.
Then the orchestrator sends the name/arguments to the MCP Server which returns 1 or more messages specialized with the argument values back to the orchestrator. These messages could also have some of the MCP Server’s resource data included in them. For example, an MCP Server asked to process the analyze-project prompt with time-frame=1h and fileUri= file:///path/to/code.py might return the following messages:
User Message (AI Model prompt instruction):Text: Analyze these system logs and the code file for any issues:
User Message (appended to AI Model prompt from resource URI):
URI: logs://recent?timeframe=1h
Mime Type: text/plain
Text: [2024–03–14 15:32:11] ERROR: Connection timeout in network.py:127
[2024–03–14 15:32:15] WARN: Retrying connection (attempt 2/3)
[2024–03–14 15:32:20] ERROR: Max retries exceeded
User Message (appended to AI Model prompt from another resource URI):
URI: file:///path/to/code.py
Mime Type: text/x-python
Text: def connect_to_service(timeout=30):
retries = 3
for attempt in range(retries):
try:
return establish_connection(timeout)
except TimeoutError:
if attempt == retries - 1:
raise
time. Sleep(5)
def establish_connection(timeout):
# Connection implementation
pass
The orchestrator appends these 3 user messages to the conversation and passes the updated conversation to its AI Model for processing. The AI Model’s output is handled by the orchestrator as it normally would: call a tool or append the agent message to the conversation. The orchestrator then loops around continuing the conversation until the task is complete.
MCP Server Sampling
An MCP Server might want to use an AI Model for some of its internal work. The MCP protocol accommodates this with a feature called sampling. With sampling, the MCP Server sends a sampling/createMessage message to the AI Agent asking it to use one of its AI Models to do the work. The benefit of this approach is that the MCP Server itself doesn’t need to configure, pay for, or manage API keys/secrets to access its own AI Model. While this all sounds nice for MCP Servers that desire this ability, in reality, very few popular MCP Clients support sampling as you can see here: Example Clients — Model Context Protocol.
MCP Server Roots
An AI Agent might want to focus its MCP Servers to a subset of potential resources. The canonical example is an IDE (like VS Code) sending the root of the current project code files to all the MCP Servers so that they focus on just the project’s source code files (resources) as opposed to any and all files on the user’s PC. An AI Agent sends roots (a list of URIs) to its MCP Servers. The URIs are typically filesystem paths (file:) but maybe anything. Here is an example of roots an MCP Server might receive (from here):
{"roots": [
{
"uri": "file:///home/user/projects/frontend",
"name": "Frontend Repository"
},
{
"uri": "https://api.example.com/v1",
"name": "API Endpoint"
}
]
}
During the lifetime of an AI Agent, it may send a new root list of URIs to its MCP Servers. For example, if the user of the IDE opens a different project. Like sampling, very few popular MCP Hosts support roots as you can see here: Example Clients — Model Context Protocol.
In the discussion above I positioned the AI Agent as a remote HTTP Service. However, a client-side application or website can also be its own AI Agent with its own orchestrator, AI Model(s), Knowledge (RAG), AI Agent list, and MCP Clients/Servers. But these AI Agents have a user-interface (UI) instead of listening to HTTP requests. Microsoft brands any application or website with a “UI over an AI” as a Copilot. And, of course, a human interacts with the UI AI Agent (Copilot). Here’s an updated version of the figure I showed earlier including the UI AI Agent.
Since we have already discussed AI Agents, there isn’t much else to say about the UI AI Agent other than it has a UI. The UI AI Agent typically has a chat-like interface where the human enters a user message, the orchestrator does its thing, and the resulting agent message is displayed back in the UI.
However, unlike an AI Agent, the UI AI Agent has access to any context, resources, tools, or prompts available to the application or website. In effect, the UI AI Agent is also its own MCP Server. In fact, the UI AI Agent can implement its own functionality as an internal MCP Server as opposed to a local or remote MCP Server. This simplifies the code implementing the UI AI Agent unless the agent requires richer or more integrated access to resources/tools. And, of course, a UI AI Agent can leverage other remote AI Agents as well as any other locally installed or remotely accessible MCP Servers.
In conclusion, the architecture of AI Agents integrates various components built by different people and organizations. The real strength, power, and flexibility comes from the various components adopting industry standards such as Google’s Agent-to-Agent (A2A) protocol and Anthropic’s Model Context Protocol (MCP). This article provided a comprehensive overview of these components and how they interact with one another emphasizing the importance of imprecise and precise data handling and the role of orchestrators in managing conversations and tasks.
The continuous evolution of AI technology promises to bring even more sophisticated and capable AI Agents. The integration of multiple AI models, the use of advanced tools and resources, and the ability to handle complex tasks autonomously are just a few of the exciting developments on the horizon.