A recent article by Invariant Labs has shown that MCP-based
agent frameworks can be vulnerable to "tool poisoning"
attacks. In a tool-poisoning attack a malicious server is able to hijack tool
calls. These hijacked calls can read sensitive data or execute arbitrary
commands on the host machine without notifying the user. This is a serious
security concern, as it can lead to unauthorized access to sensitive
information, potential lateral movement within systems, and other nefarious
activities.
Many of the most exciting MCP demos are all about controlling local
applications, such as web browsers, 3D modeling software, or video editors.
These demos show how an AI agent can interact with these applications in a
natural way, using the same tools that a human would use. This is a powerful
capability, but it also raises serious security concerns.
How do we balance the power of MCP with the need for security?
Servlets: Still MCP, Just Lighter and More Secureβ
On a first glance, mcp.run might look like an MCP Server marketplace. But
mcp.run does not just provide a marketplace for arbitrary MCP servers. Instead,
it provides a curated set of lightweight "servlets" that are designed to run
in a secure environment.
We call them "servlets" to emphasize they don't run as full-fledged MCP
servers on the user's machine. Instead, they share a host process service
that runs them in an isolated environment with limited access to the
host machine's resources. This is usually called a "sandbox".
Servlets do not share data with each other, and the data that is shared with
the host process is limited to the specific pieces of information that are
needed for the servlet to work. This means that even if one servlet is
compromised, it cannot just access the data or the resources of another servlet:
these will be still mediated by the permissions that the other servlet has
been granted. This is a key security feature of mcp.run, and it greatly
mitigates "tool poisoning" attacks.
Each servlet is given explicit access to a specific set of resources and
capabilities, which are defined in their configuration. This means that they
are only given access to resources and capabilities that they have explicitly
declared. For example:
The filesystem servlet can access only a file system portions that it
explicitly requested; it cannot read files outside the given boundaries; and
it cannot access the Internet.
The brave-search servlet can access the Brave Search API, but it cannot
access any other web resources.
The fetch servlet is only able to retrieve the contents of a web site.
The eval-py servlet evaluates Python code but it can neither access the
Internet nor the file system.
All of the capabilities that a Servlet requires must be explicitly granted
upon installation, and they are not allowed otherwise. The installation page
for a servlet displays the full list of capabilities that the servlet will be
granted, and the user must explicitly accept them before the servlet can
be installed; in some cases, they can even be further restricted if the user
decides so.
It is also very common for an MCP server to read configuration parameters
from a file or from environment variables; this includes sensitive information
such as passwords or API keys. But mcp.run servlets are instead configured
through a standardized interface, and the credentials are stored in a secure
way. Thus, there is no need to store sensitive information in configuration
files or environment variables. In fact, servlets cannot access environment
variables in any way.
Moreover, servlets cannot access other sensitive system resources such as
the clipboard; and even when the servlet is granted access to the file
system, it is limited to a specific directory that is defined in the
configuration. This means no servlet has unrestricted access to sensitive files
or directories on the host machine, such as your ~/.cursor/mcp.json or your
SSH keys, without explicit user consent.
API Integration: Servlets as a Security Boundaryβ
you will have to write your own server when you want to plug your own
APIs;
you might want to orchestrate articulate API flows and avoid filling the
context window with irrelevant data;
but, most importantly, and yet often overlooked, you will need to
write your own MCP server if you want to retain full-control over the
content surface that will "leak" into the token stream.
Because of how LLMs currently work, tool calls require giving your AI service
provider unencrypted access to sensitive data. Even when tool calls are
performed locally, unless you run LLMs on-premises, a third-party service is
given access to all exchanged data.
While writing an MCP server is relatively easy, creating an mcp.run servlet
is even easier. Servlets run in a shared but secure host environment that
implements the protocol and maintains a stable connection. You only need to
implement the logic of your tools, and the host handles the remaining
details.
Writing a servlet is easy and fun; it allows you to retain control over your
data, and it brings also performance benefits: controlling the amount of
data that is returned into the token stream, allows to make sure that the the AI
service is not overwhelmed and can focus on relevant information.
You can write your servlets in multiple programming languages: TypeScript,
Python, Go, Rust, C++, Zig, and you can even bring your own.
"A sandboxed environment where you can run custom code, I know this: this is a
Container!" you might think. But it is not! mcp.run servlets run on a
WebAssembly runtime; in other words, they are running in a lightweight,
portable virtual machine with efficient sandboxing.
This also means that they do not need to run on your local machine. You can
fetch them from your other devices, such as phones or tablets, and run them
even there, processing data locally and only sending the results to the AI
service. You could even run them in a browser, without the need to install
any software.
Finally, you can offload execution to our servers for lighter-weight
processing; but, if you choose so, you can also run them on premises and
process data in a secure environment you trust.
Speaking of secure environments you can trust, have you checked out
mcp.run Tasks?
We look forward to see what you will build with mcp.run! If you want to learn
more about mcp.run and how it can help bringing practical, secure and customized
AI automation to your organization, get in touch!
MCP enables vastly more resilient system integrations by acting as a differential between APIs, absorbing changes and inconsistencies that would typically break traditional integrations. With "the prompt is the program," we can build more adaptable, intention-based integrations that focus on what needs to be done rather than how it's technically implemented.
When engineers connect software systems, they're engaging in an unspoken
contract: "I will send you data in exactly this format, and you will do exactly
this with it." This rigid dependency creates brittle integrations that break
when either side makes changes.
Traditional API integration looks like this:
// System A rigidly calls System B with exact parameters const response =awaitfetch("https://api.system-b.com/v2/widgets",{ method:"POST", headers:{ "Content-Type":"application/json", "Authorization":"Bearer "+ token, }, body:JSON.stringify({ name: widgetName, color:"#FF5733", dimensions:{ height:100, width:50, }, metadata:{ created_by: userId, department:"engineering", }, }), }); // If System B changes ANY part of this structure, this code breaks
This approach has several fundamental problems:
Version Lock: Systems get locked into specific API versions
Brittle Dependencies: Small changes can cause catastrophic failures
High Maintenance Burden: Keeping integrations working requires constant
vigilance and updates
Implementation Details Exposure: Systems need to know too much about each
other
The Automobile Differential: A Mechanical Analogyβ
To understand how MCP addresses these issues, let's examine a mechanical
engineering breakthrough: the differential gear in automobiles.
Before the differential, cars had a serious problem. When turning a corner, the
outer wheel needs to travel farther than the inner wheel. With both wheels
rigidly connected to the same axle, this created enormous stress, causing wheels
to slip, skid, and wear out prematurely.
The differential solved this by allowing wheels on the same axle to rotate at different speeds while still delivering power to both. It absorbed the differences between what each wheel needed.
MCP functions as an API differential β it sits between systems and absorbs their
differences, allowing them to effectively work together despite technical
inconsistencies.
How does this work? Through intention-based instructions rather than
implementation-specific calls.
When an MCP Client receives a high-level instruction like creating a dashboard
widget, it follows a sophisticated process:
Tool Discovery: The client first queries the MCP Server for available
tools:
// MCP Client requests available tools const toolList =awaitlistTools(); // Server returns available tools with their descriptions [ { "name":"create_widget", "description":"Creates a new widget on a dashboard", "inputSchema":{ "type":"object", "required":["name","dashboard_id","widget_type"], "properties":{ "name":{ "type":"string", "description":"Display name for the widget", }, "dashboard_id":{ "type":"string", "description":"ID of the dashboard to add the widget to", }, "widget_type":{...}, "team_id":{...}, // Other parameters... }, }, }, { "name":"list_dashboards", "description":"List available dashboards, optionally filtered by team", "inputSchema":{...}, }, { "name":"get_user_info", "description":"Get information about the current user or a specified user", "inputSchema":{...}, }, { "name":"get_team_info", "description":"Get information about a team", "inputSchema":{...}, }, ];
Dependency Analysis: The LLM analyzes the instruction and the available
tools, recognizing that to create a widget, it needs:
The dashboard ID for "marketing team's main dashboard"
Possibly the team ID for the "marketing team"
Parameter Resolution: The LLM plans and executes a sequence of dependent
calls:
(This is deterministic code for demonstration, but the model infers these steps
on its own!)
// First, get the marketing team's ID const teamResponse =awaitcallTool("get_team_info",{ team_name:"marketing", }); // teamResponse = { team_id: "team_mktg123", department_id: 42, ... } // Next, find the marketing team's main dashboard const dashboardsResponse =awaitcallTool("list_dashboards",{ team_name:"marketing", }); // dashboardsResponse = [ // { id: "dash_12345", name: "Main Dashboard", is_primary: true, ... }, // { id: "dash_67890", name: "Campaign Performance", ... } // ] // Filter for the main dashboard const mainDashboard = dashboardsResponse.find((d)=> d.is_primary)|| dashboardsResponse.find((d)=> d.name.toLowerCase().includes("main")); // Finally, create the widget with all required parameters const widgetResponse =awaitcallTool("create_widget",{ name:"Customer Dashboard", dashboard_id: mainDashboard.id, widget_type:"analytics", team_id: teamResponse.team_id, });
Semantic Mapping: The MCP Server handles translating from the
standardized tool parameters to the specific API requirements, which might
involve:
Translating team_id to the internal department_id format
Setting the appropriate access_level based on team permissions
Generating a unique resource_id
Populating metadata based on contextual information
This approach is revolutionary because:
Resilient to Changes: If the underlying API changes (e.g., requiring new
parameters or renaming fields), only the MCP Server needs to update β the
high-level client instruction stays the same
Intent Preservation: The focus remains on what needs to be accomplished,
not how
Progressive Enhancement: New API capabilities can be leveraged without
client changes
Contextual Intelligence: The LLM can make smart decisions about which
dashboard is the "main" one based on naming, flags, or other context
With MCP, the instruction doesn't change even when the underlying API changes
dramatically. The MCP Server handles the translation from high-level intent to
specific API requirements.
Semantic Parameter Mapping enables systems to communicate based on meaning
rather than rigid structure. This approach drastically improves resilience and
adaptability in integrations.
In traditional API integration, parameters are often implementation-specific and
tightly coupled to the underlying data model. For example, a CRM API might
require a customer record to be created with fields like cust_fname,
cust_lname, and cust_type_id - names that reflect internal database schema
rather than their semantic meaning.
With MCP's semantic parameter mapping, tools are defined with parameters that
reflect their conceptual purpose, not their technical implementation:
{ "name":"create_customer", "description":"Create a new customer record in the CRM system", "inputSchema":{ "type":"object", "properties":{ "firstName":{ "type":"string", "description":"Customer's first or given name" }, "lastName":{ "type":"string", "description":"Customer's last or family name" }, "customerType":{ "type":"string", "enum":["individual","business","government","non-profit"], "description":"The category of customer being created" }, "address":{ "type":"object", "description":"Customer's primary address", "properties":{ "street":{ "type":"string", "description":"Street address including number and name" }, "city":{ "type":"string", "description":"City name" }, "state":{ "type":"string", "description":"State, province, or region" }, "postalCode":{ "type":"string", "description":"ZIP or postal code" }, "country":{ "type":"string", "description":"Country name", "default":"United States" } } }, "source":{ "type":"string", "description":"How the customer was acquired (e.g., 'website', 'referral', 'trade show')" }, "assignedRepresentative":{ "type":"string", "description":"Name or identifier of the sales representative assigned to this customer", "required":false } }, "required":["firstName","lastName","customerType"] } }
The key differences in the semantic approach:
Human-readable parameter names: Using firstName instead of cust_fname
makes the parameters self-descriptive
Hierarchical organization: Related parameters like address fields are
nested in a logical structure
Descriptive enumerations: Instead of opaque codes (like
cust_type_id: 3), semantically meaningful values like "business" are used
Clear descriptions: Each parameter includes a description of its purpose
rather than just its data type
Meaningful defaults: When appropriate, semantic defaults can be provided
This semantic approach provides tremendous advantages:
New parameters can be added to the semantic schema without breaking existing
integrations
When combined with intent-based execution, semantic parameter mapping creates a
powerful abstraction layer that shields systems from the implementation details
of their integration partners, making the entire ecosystem more adaptable and
resilient to change.
Intent-based execution is perhaps the most transformative aspect of MCP. Let me
walk you through a detailed example that illustrates how this works in practice.
Imagine a scenario where a business wants to "send a quarterly performance
report to all department heads." This seemingly simple task involves multiple
steps and systems in a traditional integration context:
Traditional Integration Approach:
Query the HR system to identify department heads
Access the financial system to gather quarterly performance data
Generate a PDF report using a reporting engine
Connect to the email system to send personalized emails with attachments
Log the communication in the CRM system
Each of these steps would require detailed knowledge of the respective APIs,
authentication methods, data formats, and error handling. If any system changes
its API, the entire integration could break.
With MCP's Intent-Based Execution:
The MCP Client (like Tasks) might simply receive the
instruction:
"Send our Q1 2024 performance report to all department heads. Include YoY comparisons and highlight areas exceeding targets by more than 10%."
Behind the scenes, the MCP Client would:
Recognize the high-level intent and determine this requires multiple tool
calls
Query the MCP Server for available tools related to reporting, employee data,
and communications
Based on the tool descriptions, construct a workflow:
(Again, not executed code, but to illustrate the inferred logic the LLM runs!)
// In step 2 listed above, the MCP Client has all the // static identifiers and parameters from the available // tool descriptions to be used in this code // First, identify who the department heads are const departmentHeads =awaitcallTool("get_employees",{ filters:{position_type:"department_head",status:"active"}, }); // Get financial performance data for Q1 2024 const financialData =awaitcallTool("get_financial_report",{ period:"q1_2024", metrics:["revenue","expenses","profit_margin","growth"], comparisons:["year_over_year"], }); // The LLM analyzes the data to identify high-performing areas const highlights = financialData.metrics.filter( (metric)=> metric.year_over_year_change>10, ); // Generate a report with the appropriate formatting and emphasis const report =awaitcallTool("create_report",{ title:"Q1 2024 Performance Report", data: financialData, highlights: highlights, format:"pdf", template:"quarterly_executive", }); // Send the report to each department head with a personalized message for(const head of departmentHeads){ awaitcallTool("send_email",{ recipient: head.email, subject:"Q1 2024 Performance Report", body: `Dear ${head.name},\n\nPlease find attached our Q1 2024 performance report. Your department ${head.department} showed ${ highlights.some((h)=> h.department=== head.department) ?"exceptional performance in some areas" :"consistent results" }.\n\nRegards,\nExecutive Team`, attachments:[report.file_id], log_to_crm:true, }); }
The crucial difference is that the MCP Server for each system is responsible for
translating these semantic, intent-based calls into whatever specific API calls
its system requires.
For example, the HR system's MCP Server might translate get_employees with a
position or role filter into a complex SQL query or LDAP search, while the
reporting system's MCP Server might convert create_report into a series of API
calls to a business intelligence platform.
If any of these backend systems change:
The HR system might switch from an on-premise solution to Workday
The financial system might upgrade to a new version with a completely
different API
The reporting engine might be replaced with a different vendor
The email system might move from Exchange to Gmail
None of these changes would affect the high-level intent-based instruction.
Only the corresponding MCP Servers would need to be updated to translate the
same semantic calls into the new underlying system's language.
This is the true power of intent-based execution with MCP - it decouples what
you want to accomplish from the technical details of how to accomplish it,
creating resilient integrations that can withstand significant changes in the
underlying technology landscape.
As we move toward a world where "the prompt is the program," traditional rigid
API contracts will increasingly be replaced by intent-based interfaces. MCP
provides a standardized protocol for this transition.
The implications are profound:
Reduced integration maintenance: Systems connected via MCP require less
ongoing maintenance
Faster adoption of new technologies: Backend systems can be replaced
without disrupting front-end experiences
Greater composability: Systems can be combined in ways their original
designers never anticipated
Longer component lifespan: Software components can remain useful far
longer despite ecosystem changes
The differential revolutionized transportation by solving a mechanical impedance
mismatch. MCP is poised to do the same for software integration by solving the
API impedance mismatch that has plagued systems for decades.
The future of integration isn't more rigid contracts β it's more flexible,
intent-based communication between systems that can adapt as technology evolves.
The AI landscape is evolving rapidly, and with almost daily advancements, it can
be hard to understand which large language model (LLM) makes the most sense for
the task at hand. To make things more complicated, each new model release comes
with new claims regarding capabilities that don't necessarily translate to all
workloads. In some cases, it will make sense to stick with an older model to
save on time or cost, but for others, you might need the cutting-edge reasoning
abilities of a newer model. The only way to really understand how an LLM will
perform a specific task is to try it out (usually a bunch of times). This is why
we're happy to announce mcpx-eval, our
new eval framework to compare how different models perform with
mcp.run tools!
The
Berkeley Function Calling Leaderboard (BFCL)
is the state-of-the-art for measuring how well various models perform specific
tool calls. BFCL includes many specific tools created for particular tests being
performed, which allows for tool use to be tracked closely. While this is
extremely valuable and we may still try to get mcp.run tools working in BFCL, we
decided it would be worth building an mcp.run-focused eval
that could be used by us and our customers to compare the results of open-ended
tasks.
This means instead of looking at very specific outcomes around tool use, we're
primarily interested in trying to quantify the LLM output to determine how well
they perform in situations like Tasks,
where many tools can be combined to create varied results. To do this, we're
using a custom set of
LLM-based metrics
that are scored by a secondary "judge" LLM. Using static
reference-based metrics
can provide more deterministic testing of prompts but are less helpful when it
comes to interpreting the "appropriateness" of a tool call or the "completeness"
of a response.
To peek behind the curtain a little, we can see how the judge prompt is
structured:
<settings> Max tool calls: {max_tool_calls} Current date and time: {datetime.now().isoformat()} </settings> <prompt>{prompt}</prompt> <output>{output}</output> <check>{check}</check> <expected-tools>{', '.join(expected_tools)}</expected-tools>
There are several sections present: <settings>, <prompt>, <output>,
<check> and <expected-tools>
settings provides additional context for an evaluation
prompt contains the original test prompt so the judge can assess the
completeness of the output
output contains all the messages from the test prompt, including any tools
that were used
check contains user-provided criteria that can be used to judge the output
expected-tools is a list of tools that can be expected to be used for the
given prompt
Using this template, the output of each LLM under test along with the check
criteria is converted into a new prompt and analyzed by the judge for various
metrics which we are able to extract into structured scoring data using
PydanticAI. The result of the judge is a Python
object with fields that hold the numeric scores, allowing us to analyze the data
using a typical data processing library like
Pandas.
This is just scratching the surface - our preliminary results already reveal
interesting patterns in how different models handle complex tool use scenarios.
For example, one of the biggest problems we're seeing is over-use, which doesn't
affect the quality of the output but does affect compute time and cost. These
measurements can help us provide the information to give teams more confidence
when selecting models for use with mcp.run tools.
Looking forward, we're planning to expand our test suite to include more models
and cover more domains. Please reach out if you have a specific use
case you'd like to see evaluated or are wondering how a particular model
performs for your task. The mcpx-eval code is also available on
Github.
Announcing the first MCP March Madness Tournament!β
This month, we're hosting a face-off like you've never seen before. If "AI
Athletes" wasn't on your 2025 bingo card, don't worry... it's not quite like
that.
We're putting the best of today's API-driven companies head-to-head in a matchup
to see how their MCP servers perform when tool-equipped Large Language Models
(LLMs) are tasked with a series of challenges.
Each week in March, we will showcase two competing companies in a test to see
how well AI can use their product through API interactions.
This will involve a Task that describes
specific work requiring the use of the product via its API.
For example:
"Create a document titled 'World Populations' and add a table containing the
world's largest countries including their name, populations, and ranked by
population size. Provide the URL to this new document in your response."
In this example matchup, we'd run this exact same Task twice - first using
Notion tools, and in another run, Google Docs. We'll compare the outputs,
side-effects, and the time spent executing. Using models from Anthropic and
OpenAI, we'll record each run so you can verify our findings!
We're keeping these tasks fairly simple to test the model's accuracy when
calling a small set of tools. But don't worry - things will get spicier in the
Grand Finale!
Additionally, we're releasing our first evaluation framework for MCP-based tool
calling, which we'll use to run these tests. We really want to exercise the
tools and prompts as fairly as possible, and we'll publish all evaluation data
for complete transparency.
As a prerequisite, we're using MCPs available on our public registry at
www.mcp.run. This means they may not have been created
directly by the API providers themselves. Anyone is able to publish MCP servlets
(WebAssembly-based, secure & portable MCP Servers) to the registry. However, to
ensure these matchups are as fair as possible, we're using servlets that have
been generated from the official OpenAPI specifications provided by each
company.
MCP Servers [...] generated from the official OpenAPI specification provided
by each company.
Wait, did I read that right?
Yes, you read that right! We'll be making this generator available later this
month... so sign up and follow-along for that
announcement.
So, this isn't just a test of how well AI can use an API - it's also a test of
how comprehensive and well-designed a platform's API specification is!
As mentioned above, we're excited to share our new evaluation framework for LLM
tool calling!
mcpx-eval is a framework for evaluating LLM tool calling using mcp.run tools.
The primary focus is to compare the results of open-ended prompts, such as
mcp.run Tasks. We're thrilled to provide this resource to help users make
better-informed decisions when selecting LLMs to pair with mcp.run tools.
If you're interested in this kind of technology, check out the repository on
GitHub, and read our
full announcement for an in-depth look
at the framework.
After 3 rounds of 1-vs-1 MCP face-offs, we're taking things up a notch. We'll
put the winning MCP Servers on one team and the challengers on another. We'll
create a new Task with more sophisticated work that requires using all 3
platforms' APIs to complete a complex challenge.
We put two of the best Postgres platforms head-to-head to kick-off MCP March
Madness! Watch the matchup in realtime, executing a Task to ensure a project is
ready to use and to generate and execute the SQL needed to set up a database to
manage our NewsletterOS application.
Here's the prompt:
Pre-requisite: - a database and or project to use inside my account - name: newsletterOS Create the tables necessary to act as the primary transactional database for a Newsletter Management System, where its many publishers manage the creation of newsletters and the subscribers to each newsletter. I expect to be able to work with tables of information including data on: - publisher - subscribers - newsletters - subscriptions (mapping subscribers to newsletters) - newsletter_release (contents of newsletter, etc) - activity (maps publisher to a enum of activity types & JSON) Execute the necessary queries to set my database up with this schema.
In each Task, we attach the Supabase
and Neon mcp.run servlets to the prompt,
giving our Task access to manage those respective accounts on our behalf via
their APIs.
See how Supabase handles our Task as Claude Sonnet 3.5 uses MCP server tools:
Next, see how Neon handles the same Task, leveraging Claude Sonnet 3.5 and the
Neon MCP server tools we generated from their OpenAPI spec.
Unfortunately, Neon was unable to complete the Task as-is, using only its
functionality exposed via their official OpenAPI spec. But, they can (and
hopefully will!) make it so an OpenAPI consumer can run SQL this way. As noted
in the video, their
hand-writen MCP Server does
support this. We'd love to make this as feature-rich on mcp.run so any Agent or
AI App in any language or framework (even running on mobile devices!) can work
as seamlessly.
In addition to the Tasks we ran, we also executed this prompt with our own eval
framework mcpx-eval. We configure this
eval using the following, and when it runs we provide the profile where the
framework can load and call the right tools:
name = "neon-vs-supabase" max-tool-calls = 100 prompt = """ Pre-requisite: - a database and or project to use inside my account - name: newsletterOS Create the tables and any seed data necessary to act as the primary transactional database for a Newsletter Management System, where its many publishers manage the creation of newsletters and the subscribers to each newsletter. Using tools, please create tables for: - publisher - subscribers - newsletters - subscriptions (mapping subscribers to newsletters) - newsletter_release (contents of newsletter, etc) - activity (maps publisher to a enum of activity types & JSON) Execute the necessary commands to set my database up for this. When all the tables are created, output the queries to describe the database. """ check=""" Use tools and the output of the LLM to check that the tables described in the <prompt> have been created. When selecting tools you should never in any case use the search tool. """ ignore-tools = [ "v1_create_a_sso_provider", "v1_update_a_sso_provider", ]
Neon (left) outperforms Supabase (right) on the accuracy dimension by a few
points - likely due to better OpenAPI spec descriptions, and potentially more
specific endpoints. These materialize as tool calls and tool descriptions, which
are provided as context to the inference run, and make a big difference.
In all, both platforms did great, and if Neon adds query execution support via
OpenAPI, we'd be very excited to put it to use.
Everybody loves email, right? Today we're comparing some popular email platforms
to see how well their APIs are designed for AI usage. Can an Agent or AI app
successfully carry out our task? Let's see!
Here's the prompt:
Unless it already exists, create a new audience for my new newsletter called: "{{ audience }}" Once it is created, add a test contact: "{{ name }} {{ email }}". Then, send that contact a test email with some well-designed email-optimized HTML that you generate. Make the content and the design/theme relevant based on the name of the newsletter for which you created the audience.
Notice how we have parameterized this prompt with replacement parameters! This
allows mcp.run Tasks
to be dynamically updated with values - especially helpful when triggering
them from an HTTP call or Webhook.
In each Task, we attach the Resend and
Loops mcp.run servlets to the prompt,
giving our Task access to manage those respective accounts on our behalf via
their APIs.
See how Resend handles our Task using its MCP server tools:
Next, see how Loops handles the same Task, leveraging the Loops MCP server tools
we generated from their OpenAPI spec.
Similar to Neon in Round 1, Loops was unable to complete the Task as-is, using
only its functionality exposed via their official OpenAPI spec. Hopefully they
add the missing API surface area to enable an AI application or Agent to send
transactional email along with a new template on the fly.
Resend was clearly designed to be extremely flexible, and the model was able to
figure out exactly what it needed to do in order to perfectly complete our Task.
In addition to the Tasks we ran, we also executed this prompt with our own eval
framework mcpx-eval. We configure this
eval using the following, and when it runs we provide the profile where the
framework can load and call the right tools:
name = "loops-vs-resend" prompt = """ Unless it already exists, create a new audience for my new newsletter called: "cat-facts" Once it is created, add a test contact: "Zach [email protected]" Then, send that contact a test email with some well-designed email-optimized HTML that you generate. Make the content and the design/theme relevant based on the name of the newsletter for which you created the audience. """ check=""" Use tools to check that the audience exists and that the email was sent correctly """
Resend (left) outperforms Loops (right) accross the board. In part due to Loops
missing functionality to complete the task, but also likely that Resend's
OpenAPI spec is extremely comprehensive and includes very rich descriptions and
detail.
Remember, all of this makes its way into the context of the inference request,
and influences how the model decides to respond with a tool request. The better
your descriptions, the more accurately the model will use your tool!
If you're looking for something like DeepResearch, without the "PhD-level
reasoning" or the price tag that goes along with it, then this is the round for
you!
Perplexity is a household name, and packs a punch for sourcing relevant and
recent information on any subject. Through its Sonar API, we can
programmatically make our way through the web. Brave exposes its powerful, more
traditional search engine via API. Which one can deliver the best results for us
when asked to find recent news and information about a given topic?
Here's the prompt:
We need to find the latest, most interesting and important news for people who have subscribed to our "{{ topic }}" newsletter. To do this, search the web for news and information about {{ topic }}, do many searches for newly encountered & highly related terms, associated people, and other related insights that would be interesting to our subscribers. These subscribers are very aware of {{ topic }} space and what is happening, so when we find a good source on the web, also add some intellegent & researched prose around the article or content. Limit this to just a sentence or two, and include it in the output you provide. Output all of the links you find on the web and your expert additional prose in a Markdown format so it can be read and approved by a 3rd party.
Notice how we have parameterized this prompt with replacement parameters! This
allows mcp.run Tasks
to be dynamically updated with values - especially helpful when triggering
them from an HTTP call or Webhook.
In each Task, we attach the
Perplexity and
Brave Search mcp.run servlets to the
prompt, giving our Task access to manage those respective accounts on our behalf
via their APIs.
This round, we've combined the Task runs into a single video. Check them out:
Here's the full, rendered output from each of the Tasks run in the video. What
do you think, which did a better job finding us results for "AI Agent" news to
include in a newsletter?
Perplexity
Brave Search
AI Agents Newsletter: Latest Developments - March 2025
NVIDIA's Game-Changing AI Agent Infrastructure
NVIDIA AI-Q Blueprint and AgentIQ Toolkit
NVIDIA unveiled AI-Q, a comprehensive Blueprint for developing agentic systems that's reshaping how enterprises build AI agents. The framework integrates NVIDIA's accelerated computing with partner storage platforms and software tools.
The AI-Q Blueprint represents NVIDIA's strategic move to dominate the enterprise AI agent infrastructure market, positioning them as the essential foundation for companies building sophisticated agent systems.
Llama Nemotron Model Family
NVIDIA launched the Llama Nemotron family of open reasoning AI models designed specifically for agent development. Available in three sizes (Nano: 8B, Super: 49B, and Ultra: 253B parameters), these models offer advanced reasoning capabilities with up to 20% improved accuracy over base Llama models.
These models are particularly significant as they offer hybrid reasoning capabilities that let developers toggle reasoning on/off to optimize token usage and costsβa critical feature for enterprise deployment that could accelerate adoption.
Enterprise AI Agent Adoption
Industry Implementation Examples
Yum Brands is deploying voice ordering AI agents in restaurants, with plans to roll out to 500 locations this year.
Visa is using AI agents to streamline cybersecurity operations and automate phishing email analysis.
Rolls-Royce has implemented AI agents to assist service desk workers and streamline operations.
While these implementations show promising use cases, the ROI metrics remain mixedβonly about a third of C-suite leaders report substantial ROI in areas like employee productivity (36%) and cost reduction, suggesting we're still in early stages of effective deployment.
Zoom's AI Companion Enhancements
Zoom introduced new agentic AI capabilities for its AI Companion, including calendar management, clip generation, and advanced document creation. A custom AI Companion add-on is launching in April at $12/user/month.
Zoom's approach of integrating AI agents directly into existing workflows rather than as standalone tools could be the key to avoiding the "productivity leak" problem, where 72% of time saved by AI doesn't convert to additional throughput.
Developer Tools and Frameworks
OpenAI Agents SDK
OpenAI released a new set of tools specifically designed for building AI agents, including a new Responses API that combines chat capabilities with tool use, built-in tools for web search, file search, and computer use, and an open-source Agents SDK for orchestrating single-agent and multi-agent workflows.
This release significantly lowers the barrier to entry for developers building sophisticated agent systems and could accelerate the proliferation of specialized AI agents across industries.
Eclipse Foundation Theia AI
The Eclipse Foundation announced two new open-source AI development tools: Theia AI (an open framework for integrating LLMs into custom tools and IDEs) and an AI-powered Theia IDE built on Theia AI.
As an open-source alternative to proprietary development environments, Theia AI could become the foundation for a new generation of community-driven AI agent development tools.
Research Breakthroughs
Multi-Agent Systems
Recent research has focused on improving inter-agent communication and cooperation, particularly in autonomous driving systems using LLMs. The development of scalable multi-agent frameworks like Nexus aims to make MAS development more accessible and efficient.
The shift toward multi-agent systems represents a fundamental evolution in AI agent architecture, moving from single-purpose tools to collaborative systems that can tackle complex, multi-step problems.
SYMBIOSIS Framework
Cabrera et al. (2025) introduced the SYMBIOSIS framework, which combines systems thinking with AI to bridge epistemic gaps and enable AI systems to reason about complex adaptive systems in socio-technical contexts.
This framework addresses one of the most significant limitations of current AI agentsβtheir inability to understand and navigate complex social systemsβand could lead to more contextually aware and socially intelligent agents.
Ethical and Regulatory Developments
EU AI Act Implementation
The EU AI Act, expected to be fully implemented by 2025, introduces a risk-based approach to regulating AI with stricter requirements for high-risk applications, including mandatory risk assessments, human oversight mechanisms, and transparency requirements.
As the first comprehensive AI regulation globally, the EU AI Act will likely set the standard for AI agent governance worldwide, potentially creating compliance challenges for companies operating across borders.
These standards will be crucial for establishing common practices around AI agent development and deployment, potentially reducing fragmentation in approaches to AI safety and ethics.
This newsletter provides a snapshot of the rapidly evolving AI Agent landscape. As always, we welcome your feedback and suggestions for future topics.
AI Agents Newsletter: Latest Developments and Insights
The Rise of Autonomous AI Agents in 2025
IBM Predicts the Year of the Agent
According to IBM's recent analysis, 2025 is shaping up to be "the year of the AI agent." A survey conducted with Morning Consult revealed that 99% of developers building AI applications for enterprise are exploring or developing AI agents. This shift from passive AI assistants to autonomous agents represents a fundamental evolution in how AI systems operate and interact with the world.
IBM's research highlights a crucial distinction between today's function-calling models and truly autonomous agents. While many companies are rushing to adopt agent technology, IBM cautions that most organizations aren't yet "agent-ready" - the real challenge lies in exposing enterprise APIs for agent integration.
China's Manus: A Revolutionary Autonomous Agent
In a significant development, Chinese researchers launched Manus, described as "the world's first fully autonomous AI agent." Manus uses a multi-agent architecture where a central "executor" agent coordinates with specialized sub-agents to break down and complete complex tasks without human intervention.
Manus represents a paradigm shift in AI development - not just another model but a truly autonomous system capable of independent thought and action. Its ability to navigate the real world "as seamlessly as a human intern with an unlimited attention span" signals a new era where AI systems don't just assist humans but can potentially replace them in certain roles.
MIT Technology Review's Hands-On Test of Manus
MIT Technology Review recently tested Manus on several real-world tasks and found it capable of breaking tasks down into steps and autonomously navigating the web to gather information and complete assignments.
While experiencing some system crashes and server overload, MIT Technology Review found Manus to be highly intuitive with real promise. What sets it apart is the "Manus's Computer" window, allowing users to observe what the agent is doing and intervene when needed - a crucial feature for maintaining appropriate human oversight.
Multi-Agent Systems: The Power of Collaboration
The Evolution of Multi-Agent Architectures
Research into multi-agent systems (MAS) is accelerating, with multiple AI agents collaborating to achieve common goals. According to a comprehensive analysis by data scientist Sahin Ahmed, these systems are becoming increasingly sophisticated in their ability to coordinate and solve complex problems.
The multi-agent approach is proving particularly effective in scientific research, where specialized agents handle different aspects of the research lifecycle - from literature analysis and hypothesis generation to experimental design and results interpretation. This collaborative model mirrors effective human teams and is showing promising results in fields like chemistry.
The Oscillation Between Single and Multi-Agent Systems
IBM researchers predict an interesting oscillation in agent architecture development. As individual agents become more capable, there may be a shift from orchestrated workflows to single-agent systems, followed by a return to multi-agent collaboration as tasks grow more complex.
This back-and-forth evolution reflects the natural tension between simplicity and specialization. While a single powerful agent might handle many tasks, complex problems often benefit from multiple specialized agents working in concert - mirroring how human organizations structure themselves.
Development Platforms and Tools for AI Agents
OpenAI's New Agent Development Tools
OpenAI recently released new tools designed to help developers and enterprises build AI agents using the company's models and frameworks. The Responses API enables businesses to develop custom AI agents that can perform web searches, scan company files, and navigate websites.
OpenAI's shift from flashy agent demos to practical development tools signals their commitment to making 2025 the year AI agents enter the workforce, as proclaimed by CEO Sam Altman. These tools aim to bridge the gap between impressive demonstrations and real-world applications.
Top Frameworks for Building AI Agents
Analytics Vidhya has identified seven leading frameworks for building AI agents in 2025, highlighting their key components: agent architecture, environment interfaces, integration tools, and monitoring capabilities.
These frameworks provide standardized approaches to common challenges in AI agent development, allowing developers to focus on the unique aspects of their applications rather than reinventing fundamental components. The ability to create "crews" of AI agents with specific roles is particularly valuable for tackling multifaceted problems.
Agentforce 2.0: Enterprise-Ready Agent Platform
Agentforce 2.0, scheduled for full release in February 2025, offers businesses customizable agent templates for roles like Service Agent, Sales Rep, and Personal Shopper. The platform's advanced reasoning engine enhances agents' problem-solving capabilities.
Agentforce's approach of providing ready-made templates while allowing extensive customization strikes a balance between accessibility and flexibility. This platform exemplifies how agent technology is being packaged for enterprise adoption with minimal technical barriers.
Ethical Considerations and Governance
The Ethics Debate Sparked by Manus
The launch of Manus has intensified debates about AI ethics, security, and oversight. Margaret Mitchell, Hugging Face Chief Ethics Scientist, has called for stronger regulatory action and "sandboxed" environments to ensure agent systems remain secure.
Mitchell's research asserts that completely autonomous AI agents should be approached with caution due to potential security vulnerabilities, diminished human oversight, and susceptibility to manipulation. As AI's capabilities grow, so does the necessity of aligning it with human ethics and establishing appropriate governance frameworks.
AI Governance Trends for 2025
Following the Paris AI Action Summit, several key governance trends have emerged for 2025, including stricter AI regulations, enhanced transparency requirements, and more robust risk management frameworks.
The summit emphasized "Trust as a Cornerstone" for sustainable AI development. Interestingly, AI is increasingly being used to govern itself, with automated compliance tools monitoring AI models, verifying regulatory alignment, and detecting risks in real-time becoming standard practice.
Market Projections and Industry Impact
The Economic Impact of AI Agents
Deloitte predicts that 25% of enterprises using Generative AI will deploy AI agents by 2025, doubling to 50% by 2027. This rapid adoption is expected to create significant economic value across multiple sectors.
By 2025, AI systems are expected to evolve into collaborative networks that mirror effective human teams. In business contexts, specialized AI agents will work in coordination - analyzing market trends, optimizing product development, and managing customer relationships simultaneously.
Forbes Identifies Key AI Trends for 2025
Forbes has identified five major AI trends for 2025, with autonomous AI agents featuring prominently alongside open-source models, multi-modal capabilities, and cost-efficient automation.
The AI landscape in 2025 is evolving beyond large language models to encompass smarter, cheaper, and more specialized solutions that can process multiple data types and act autonomously. This shift represents a maturation of the AI industry toward more practical and integrated applications.
Conclusion: The Path Forward
As we navigate the rapidly evolving landscape of AI agents in 2025, the balance between innovation and responsibility remains crucial. While autonomous agents offer unprecedented capabilities for automation and problem-solving, they also raise important questions about oversight, ethics, and human-AI collaboration.
The most successful implementations will likely be those that thoughtfully integrate agent technology into existing workflows, maintain appropriate human supervision, and adhere to robust governance frameworks. As IBM's researchers noted, the challenge isn't just developing more capable agents but making organizations "agent-ready" through appropriate API exposure and governance structures.
For businesses and developers in this space, staying informed about both technological advancements and evolving regulatory frameworks will be essential for responsible innovation. The year 2025 may indeed be "the year of the agent," but how we collectively shape this technology will determine its lasting impact.
As noted in the recording, both Perplexity and Brave Search servlets did a great
job. It's difficult to say who wins off vibes alone... so let's use some π§ͺ
science! Leveraging mcpx-eval to help
us decide removes the subjective component of declaring a winner.
name = "perplexity-vs-brave" prompt = """ We need to find the latest, most interesting and important news for people who have subscribed to our AI newsletter. To do this, search the web for news and information about AI, do many searches for newly encountered & highly related terms, associated people, and other related insights that would be interesting to our subscribers. These subscribers are very aware of AI space and what is happening, so when we find a good source on the web, also add some intellegent & researched prose around the article or content. Limit this to just a sentence or two, and include it in the output you provide. Output all of the links you find on the web and your expert additional prose in a Markdown format so it can be read and approved by a 3rd party. Only use tools available to you, do not use the mcp.run search tool """ check=""" Searches should be performed to collect information about AI, the result should be a well formatted and easily understood markdown document """ expected-tools = [ "brave-web-search", "brave-image-search", "perplexity-chat" ]
Perplexity (left) outperforms Brave (right) on practically all dimensions,
except that it does end up hallucinating every once in a while. This is a hard
one to judge, but if we only look at the data, the results are in Perplexity's
favor. We want to highlight that Brave did an incredible job here though, and
for most search tasks, we would highly recommend it.
Originally, the Grand Finale was going to combine the top 3 MCP servlets and put
them against the bottom 3. However, since Neon and Loops we unable to complete
their tasks, we figured we'd do something a little bit more interesting.
Tune in next week to see a "headless application" at work. Combining Supabase,
Resend and Perplexity MCPs to collectively carry out a sophisticated task.
Without further ado, let's get right into it! The grand finale combines
Supabase,
Resend, and
Perplexity into a mega MCP
Task, that effectively produces an entire application
that runs a newsletter management system, "newsletterOS".
See what we're able to accomplish with just a single prompt, using our powerful
automation platform and the epic MCP tools attached:
Here's the prompt we used to run this Task:
Prerequisites: - "newsletterOS" project and database Create a new newsletter called {{ topic }} in my database. Also create an audience in Resend for {{ topic }}, and add a test subscriber contact to this audience: name: Joe Smith email: [email protected] In the database, add this same contact as a subscriber to the {{ topic }} newsletter. Now, find the latest, most interesting and important news for people who have subscribed to our {{ topic }} newsletter. To do this, search the web for news and information about {{ topic }}, do many searches for newly encountered & highly related terms, associated people, and other related insights that would be interesting to our subscribers. Use 3 - 5 news items to include in the newsletter. These subscribers are very aware of {{ topic }} space and what is happening, so when we find a good source on the web, also add some intellegent & researched prose around the article or content. Limit this to just a sentence or two, and include it in the output you provide. Convert your output to email-optimized HTML so it can be rendered in common email clients. Then store this HTML into the newsletter release in the database. Send a test email from "[email protected]" using the same contents to our test subscriber contact for verification.
We'll publish a new post on this blog for each round and update this page with
the results. To stay updated on all announcements, follow
@dylibso on X.
We'll record and upload all matchups to
our YouTube channel, so subscribe to watch the
competitions as they go live.
Interested in how your own API might perform? Or curious about running similar
evaluations on your own tools? Contact us to learn more about our
evaluation framework and how you can get involved!
It should be painfully obvious, but we are in no way affiliated with the NCAA. Good luck to the college athletes in their completely separate, unrelated tournament this month!
Understanding AI runtimes is easier if we first understand traditional
programming runtimes. Let's take a quick tour through what a typical programming
language runtime is comprised of.
Think about Node.js. When you write JavaScript code, you're not just writing
pure computation - you're usually building something that needs to interact with
the real world. Node.js provides this bridge between your code and the system
through its runtime environment.
// Node.js example const http =require("http"); const fs =require("fs"); // Your code can now talk to the network and filesystem const server = http.createServer((req, res)=>{ fs.readFile("index.html",(err, data)=>{ res.end(data); }); });
The magic here isn't in the JavaScript language itself - it's in the runtime's
standard library. Node.js provides modules like http, fs, crypto, and
process that let your code interact with the outside world. Without these,
JavaScript would be limited to pure computation like math and string
manipulation.
A standard library is what makes a programming language practically useful.
Node.js is not powerful just because of its syntax - it's powerful because of
its libraries.
Enter the World of LLMs: Pure Computation Needs Toolsβ
Now, let's map this to Large Language Models (LLMs). An LLM by itself is like
JavaScript without Node.js, or Python without its standard library. It can do
amazing things with text and reasoning, but it can't:
Just as a C++ compiler needs to link object files and shared libraries, an AI
runtime needs to solve a similar problem: how to connect LLM outputs to tool
inputs, and tool outputs back to the LLM's context.
This involves:
Function calling conventions (how does the LLM know how to use a tool?)
The rise of AI runtimes represents a pivotal shift in how we interact with AI
technology. While the concept that "The Prompt is the Program" is powerful,
the current landscape of AI development tools presents a significant barrier to
entry. Let's break this down:
This is where platforms such as mcp.run's Tasks are breaking new ground. By
providing a runtime environment that executes prompts and tools without
requiring coding expertise, it makes AI integration accessible to everyone. Key
advantages include:
Natural Language Interface
Users can create automation using plain English prompts
# Marketing Analysis Task "Every Monday at 9 AM, analyze our social media metrics, compare them to last week's performance, and send a summary to the #marketing channel"
Equipped with a "marketing" profile containing Sprout Social and Slack tools
installed, the runtime knows exactly when to execute these tool's functions,
what inputs to pass, and understands how to use their outputs to carry out the
task at hand.
# Sales Lead Router "When a new contact submits our web form, analyze their company's website for deal sizing, and assign them to a rep based on this mapping: small business: Zach S. mid-market: Ben E. enterprise: Steve M. Then send a summary of the lead and the assignment to our #sales channel.
Similarly, equipped with a "sales" profile containing web search and Slack tools
installed, this prompt would automatically use the right tools at the right
time.
This democratization of AI tools through universal runtimes is reshaping how
organizations operate. When "The Prompt is the Program," everyone becomes
capable of creating sophisticated automation workflows. This leads to:
Reduced technical barriers
Faster implementation of AI solutions
More efficient resource utilization
Increased innovation across departments
Better cross-functional collaboration
The true power of AI runtimes isn't just in executing prompts and linking
tools - it's in making these capabilities accessible to everyone who can benefit
from them, regardless of their technical background.
Shortly after Anthropic launched the Model Context Protocol (MCP),
we released mcp.run - a managed
platform that makes it simple to host and install secure MCP Servers. The
platform quickly gained traction, winning Anthropic's San Francisco MCP
Hackathon and facilitating millions of tool downloads globally.
Since then, MCP has expanded well beyond Claude Desktop, finding its way into
products like Sourcegraph,
Cursor,
Cline,
Goose, and many others. While these
implementations have proven valuable for developers, we wanted to make MCP
accessible to everyone.
We asked ourselves: "How can we make MCP useful for all users, regardless of
their technical background?"
Today, we're excited to announce Tasks - an AI Runtime that executes prompts
and tools, allowing anyone to create intelligent operations and workflows that
integrate seamlessly with their existing software.
The concept is simple: provide Tasks with a prompt and tools, and it creates a
smart, hosted service that carries out your instructions using the tools you've
selected.
Tasks combine the intuitive interface of AI chat applications with the
automation capabilities of AI agents through two key components: prompts and
tools.
Think of Tasks as a bridge between your instructions (prompts) and your everyday
applications. You can create powerful integrations and automations without
writing code or managing complex infrastructure. Tasks can be triggered in three
ways:
Manually via the Tasks UI
Through HTTP events (like webhooks or API requests)
On a schedule (recurring at intervals you choose)
Here are some practical examples of what Tasks can do:
Receive Webflow form submissions and automatically route them to appropriate
Slack channels and log records in Notion.
Create a morning news digest that scans headlines at 7:30 AM PT, summarizes
relevant articles, and emails your marketing team updates about your company
and competitors.
Set up weekly project health checks that review GitHub issues and pull
requests every Friday at 2 PM ET, identifying which projects are on track and
which need attention, assessed against product requirements in Linear.
Automate recurring revenue reporting by pulling data from Recurly on the first
of each month, analyzing subscription changes, saving the report to Google
Docs, and sharing a link with your sales team.
While we've all benefited from conversational AI tools like Claude and ChatGPT,
their potential extends far beyond simple chat interactions. The familiar prompt
interface we use to get answers or generate content can now become the
foundation for powerful, reusable programs.
Tasks democratize automation by allowing anyone to create sophisticated
workflows and integrations using natural language prompts. Whether you're
building complex agent platforms or streamlining your organization's systems,
Tasks adapt to your needs.
We're already seeing users build with Tasks in innovative ways, from developing
advanced agent platforms to creating next-generation system integration
solutions. If you're interested in learning how Tasks can benefit your
organization, please reach out - we're here to help.
Yesterday, Microsoft CEO Satya Nadella announced a major reorganization focused
on AI platforms and tools, signaling the next phase of the AI revolution.
Reading between the lines of Microsoft's announcement and comparing it to the
emerging universal tools ecosystem, there are fascinating parallels that
highlight why standardized, portable AI tools are critical for enterprise
success.
Microsoft's reorganization announcement highlights the massive transformation
happening in enterprise software. The success of this transformation will depend
on having the right tools and platforms to implement these grand visions.
Universal tools provide the practical foundation needed to:
Safely adapt AI capabilities across different contexts
As we enter what Nadella calls "the next innings of this AI platform shift," the
role of universal tools becomes increasingly critical. They provide the
standardized, secure, and portable layer needed to implement ambitious AI
platform visions across different environments and use cases.
For enterprises looking to succeed in this AI transformation, investing in
universal tools and standardized approaches isn't just good practiceβit's
becoming essential for success.
We're working with companies and angencies looking to enrich AI applications
with tools -- if you're considering how agents play a role in your
infrastructure or business operations, don't hesitate to reach out!
Compile once, run anywhere? You bet! After our mcp.run OpenAI integration and some teasing, we're excited to launch mcpx4j, our client library for the JVM ecosystem.
Built on the new ExtismChicory SDK, mcpx4j is a lightweight library that leverages the pure-Java Chicory Wasm runtime. Its simple design allows for seamless integration with diverse AI frameworks across the mature JVM ecosystem.
To demonstrate this flexibility, we've prepared examples using popular frameworks:
Spring AI brings extensive model support; our examples focus on OpenAI and Ollama modules, but the framework makes it easy to plug in a model of your choice. Get started with our complete tutorial.
LangChain4j offers a wide range of model integrations. We showcase implementations with OpenAI and Ollama, but you can easily adapt them to work with your preferred model. Check out our step-by-step guide to learn more.
One More Thing. mcpx4j doesn't just cross framework boundaries - it crosses platforms too! Following our earlier Android experiments, we're now sharing our Android example with Gemini integration, along with a complete step-by-step tutorial.
Although an open standard, MCP has primarily been in the domain of Anthropic
products. But what about OpenAI? Do we need to wait for them to add support? How
can we connect our tools to o3 when it releases this month?
Thanks to the simplicity and portability of mcp.run servlets, you don't need to
wait. Today we're announcing the availability of our
initial OpenAI support for
mcp.run.
We're starting with support for
OpenAI's node library, but we have more
coming right around the corner.
As previously discussed, WebAssembly is the foundation of this
technology. Every servlet you install on the mcpx server is powered by
a Wasm binary: mcpx fetches these binaries and executes commands at
the request of your preferred MCP Client.
This Wasm core is what enables mcpx to run on all major platforms from day
one. However, while mcpx is currently the primary consumer of the
mcp.run service, it's designed to be part of a much broader ecosystem.
In fact, while holiday celebrations were in full swing, we've been busy
developing something exciting!
Recently, we demonstrated how to integrate mcp.run's Wasm tools into a Java host
application. In the following examples, you can see mcp.run tools in action,
using the Google Maps API for directions:
You can now fetch any mcp.run tool with its configuration and connect it to
models supported by
Spring AI (See
demos on π and
π¦)
Similarly, you can connect any mcp.run tool to models supported by
LangChain4j, including
Jlama integration (See demos on
π and
π¦)
This goes beyond just connecting to a local mcpx instance (which works
seamlessly). Thanks to Chicory, we're running the Wasm binaries
directly within our applications!
With this capability to run MCP servlet tools via mcp.run locally in
our Java applications, we tackled an exciting challenge...
While external service calls are often necessary (like our demo's use of the
Google Maps API), AI is becoming increasingly personal and embedded in our daily
lives. As AI and agents migrate to our personal devices, the traditional model
of routing everything through internet services becomes less ideal. Consider
these scenarios:
Your banking app shouldn't need to send statements to a remote finance agent
Your health app shouldn't transmit personal records to external telehealth
agents
Personal data should remain personal
As local AI capabilities expand, we'll see more AI systems operating entirely
on-device, and their supporting tools must follow suit.
While this implementation is still in its early stages, it already demonstrates
impressive capabilities. The Wasm binary servlet runs seamlessly on-device, is
fully sandboxed (only granted access to Google Maps API), and executes quickly.
We're working to refine the experience and will share more developments soon.
We're excited to see what you will create with these tools! If you're
interested in exploring these early demos, please reach out!