A comprehensive platform for testing and evaluating AI agents.
Evals for Agent Interop consists of three main components:
FastAPI backend that provides:
- Test case dataset management
- Agent registration and evaluation
- MCP (Model Context Protocol) server for tool execution
- Cosmos DB integration for data persistence
Sample AI agent implementation that:
- Connects to the API's MCP server for tool access
- Uses Azure OpenAI for intelligent task execution
- Demonstrates calendar scheduling and email capabilities
- Serves as a reference implementation for agent development
React-based web interface for:
- Creating and managing test datasets
- Registering and configuring agents
- Running evaluations
- Viewing test results and metrics
Evals for Agent Interop requires Azure Cosmos DB and Azure OpenAI resources. Use the provided Bicep template to deploy them:
cd infra
# Deploy infrastructure (Bash)
./deploy.sh
# Or using PowerShell (Windows)
.\deploy.ps1
# Or manually with Azure CLI
az deployment group create \
--resource-group evals-interop-dev-rg \
--template-file main.bicep \
--parameters @main.parameters.jsonWhat gets deployed:
- Azure Cosmos DB (Serverless) with containers for datasets, testcases, agents, evaluations
- Azure Foundry resource for hosting LLMs. You will need to manually deploy GPT 4.1 in the Azure Foundry Portal.
After deployment:
- Manually deploy GPT 4.1 in the Azure Foundry portal
- Copy the output values from the deployment script to your
.envfile at the root of the repository
For detailed infrastructure information, deployment options, and troubleshooting, see infra/README.md.
infra/main.bicep- Main infrastructure templateinfra/main.parameters.json- Deployment parametersinfra/deploy.sh- Automated deployment script (Bash)infra/deploy.ps1- Automated deployment script (PowerShell)infra/README.md- Detailed infrastructure documentation
- Azure OpenAI API credentials (from infrastructure deployment)
- Azure Cosmos DB instance (from infrastructure deployment)
- Docker Desktop installed and running (for Docker-based development)
Create a single .env file at the repo root from the provided example:
# Copy the example environment file
cp .env.example .env
# Edit .env with your Azure credentials
nano .envRequired Configuration:
.env: Contains all configuration for both API and Agent services- Azure OpenAI credentials (shared by both services)
- Cosmos DB credentials (used by API)
- MCP server URL (used by Agent)
Optional: Create a virtual environment
# Create virtual environment
python -m venv .venv
# Activate virtual environment (Linux/macOS)
source .venv/bin/activate
# Activate virtual environment (Windows)
.venv\Scripts\activateInstall required packages:
pip install -r src/api/requirements.txt
pip install -r src/agents/requirements.txtUpload sample datasets to Cosmos DB:
cd src/api
python cosmos_preload.pyπ‘ Tip: Use VS Code launch profiles for easy debugging! The workspace includes pre-configured launch profiles in .vscode/launch.json:
- API - Runs the FastAPI backend on port 8000 with hot reload enabled
- Agent - Runs the sample agent server on port 8001 with hot reload enabled
- WebApp - Runs the React frontend development server on port 5000
- API + Agent + WebApp - Compound configuration that starts all three services simultaneously
To use: Open the Run and Debug panel (Ctrl+Shift+D), select a profile, and press F5 to start debugging.
Manual Setup (if not using launch profiles):
API
python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000 --reloadAgent
python -m uvicorn src.agents.agent_server:app --host 0.0.0.0 --port 8001 --reloadWebapp
cd src/webapp
npm install
npm run devBuild and Start All Services
docker-compose up --buildFirst build takes 5-10 minutes. Subsequent builds are much faster (seconds to minutes).
π³ Docker Networking Notes
The .env file is configured for local development. When running in Docker, services communicate using Docker service names:
- Service-to-Service Communication:
- Agent connects to API using
http://api:8000/mcp(Docker service name) - Agent registration uses
http://agent:8001/agents/calendar/invoke(Docker service name) - Inter-service URLs use container names:
api,agent,webapp
- Agent connects to API using
- Host Access (from your browser/tools):
- API:
http://localhost:8000 - Agent:
http://localhost:8001 - WebApp:
http://localhost:5000
- API:
- Environment Overrides: The
MCP_SERVER_URLis automatically overridden in docker-compose.yml for container networking
Development Workflow
# Start all services
docker-compose up
# Rebuild specific service after code changes
docker-compose build api
docker-compose build agent
docker-compose build webapp
# View logs for a specific service
docker-compose logs -f api
# Stop all services
docker-compose down- Frontend: http://localhost:5000
- API Docs: http://localhost:8000/api/docs
- Agent Invoke: http://localhost:8001/agents/calendar/invoke
evals-for-agent-interop/
βββ .env # Consolidated configuration for all services
βββ .env.example # Configuration template
βββ src/
β βββ api/ # Backend API service
β β βββ requirements.txt
β βββ agents/ # Sample agent implementation
β β βββ agent_server.py
β β βββ requirements.txt
β βββ webapp/ # React frontend
β βββ src/
βββ docker-compose.yml # Multi-service orchestration
βββ Dockerfile.api # API container definition
βββ Dockerfile.agent # Agent container definition
βββ Dockerfile.webapp # Webapp container definition
To integrate your own agent with the evaluation platform, your agent must expose an unauthenticated HTTP POST endpoint that conforms to the following specification.
Method: POST
Your agent can use any endpoint path (e.g., /invoke, /agents/calendar/invoke, /api/v1/execute). You'll register this endpoint URL when configuring your agent in the platform.
{
"dataset_id": "string",
"test_case_id": "string",
"agent_id": "string",
"evaluation_run_id": "string",
"input": "string"
}Your agent must return a JSON response with the following structure:
{
"response": "string",
"tool_calls": [
{
"name": "string",
"arguments": [
{
"name": "string",
"value": "any"
}
]
}
]
}Fields:
response(string, required): The agent's natural language response to the user's requesttool_calls(array, required): List of tools the agent invoked during executionname(string): The name of the tool that was calledarguments(array): List of arguments passed to the toolname(string): The parameter namevalue(any): The parameter value
Example Response:
{
"response": "I've scheduled a 1-hour meeting with alice@company.com and bob@company.com for tomorrow at 2pm to discuss Q4 planning.",
"tool_calls": [
{
"name": "mcp_CalendarTools_graph_createEvent",
"arguments": [
{"name": "subject", "value": "Q4 Planning Meeting"},
{"name": "start", "value": "2025-11-05T14:00:00"},
{"name": "end", "value": "2025-11-05T15:00:00"},
{"name": "attendees", "value": ["alice@company.com", "bob@company.com"]}
]
}
]
}See src/agents/agent_server.py for a reference implementation.
- API Documentation - Detailed API endpoints and usage
- Agent Documentation - Agent implementation guide
- Evaluator Guide - Evaluation system details
- Frontend Documentation - Webapp development guide
Error: Host version does not match binary version
β [ERROR] Cannot start service: Host version "0.25.11" does not match binary version "0.25.12"
Solution: This error occurs when there's a mismatch between the Vite versions. To fix:
-
Navigate to the webapp folder:
cd src/webapp -
Delete the
node_modulesdirectory:rm -rf node_modules
-
Reinstall dependencies:
npm install
-
Return to the root directory and rebuild the webapp:
cd ../.. docker-compose build --no-cache webapp docker-compose up
If you experience throttling or rate limiting errors when running evaluations with many test cases, you can reduce the number of concurrent tests by setting the MAX_CONCURRENT_TESTS environment variable in your .env file:
# Reduce concurrent tests to avoid rate limiting (default is 5)
MAX_CONCURRENT_TESTS=2Lower values reduce the load on external services (such as Azure OpenAI or your agent endpoint) at the cost of longer evaluation times.
See SECURITY.md for security policies and vulnerability reporting.
See LICENSE for license information.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines(https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.