Measuring the Impact: Runtime and Credit Usage of AI Agents
Measuring Performance: Comparing LLMs with BaristaAgent
In the first two parts of this blog series, we explored the foundation and practical creation of AI agents. In Part 1, I reflected on my initial confusion and how I came to understand the core principles that make AI agents more than just function callers, while Part 2 demonstrated how to build one, step by step, with aiXplain’s Agentic Framework. The result was the BaristaAgent—a digital coffee expert capable of suggesting unique recipes and pulling relevant options from Google Search.
In this final part, I take the BaristaAgent concept further by comparing the performance of four different LLMs, each powering its own version of the agent. The goal? To understand how runtime and credit usage vary across models when performing the same task: creating and retrieving coffee recipes.
The full code is available in this Google Colab
How I Evaluated the BaristaAgent
Step 1: Defining the LLMs
The first step was selecting four LLMs from aiXplain’s marketplace. Here’s the lineup:
- LLaMA 31 70B (hosted on Groq)
- GPT-4o
- AWS Nova Lite
- Gemini 1.5 Pro
Each of these models powered its own version of the BaristaAgent, executing the same query to ensure consistent evaluation.
Step 2: Building and running the agents
For each LLM, I created an agent using aiXplain’s AgentFactory module. The agent’s purpose was simple:
- Generate a unique coffee recipe based on given ingredients.
- Use Google Search API to retrieve similar recipes.
- Present the output in a structured format.
The query was straightforward:
query = "What kind of coffee can I make with orange, cocoa powder, Medium roast Guatemalan coffee beans, milk, and sugar?"
Note: The same logic and tool configuration were applied to all agents to ensure a fair comparison.
Step 3: Collecting performance metrics
For each agent, I measured two key metrics during execution:
- Runtime: The total time taken by the agent to generate and retrieve recipes.
- Credit usage: The computational cost associated with each execution.
To track these metrics, I used a custom analysis function:
def analyze_agent_and_tool_metrics(result):
# Use 'result["data"]' directly if it's already a dictionary
data = result["data"]
# Initialize dictionaries to store statistics for agents and tools
agent_metrics = defaultdict(lambda: {'credits': 0, 'runtime': 0, 'calls': 0})
tool_metrics = defaultdict(lambda: {'credits': 0, 'runtime': 0, 'calls': 0})
# Initialize totals for agents
total_agent_credits = 0
total_agent_runtime = 0
total_agent_calls = 0
# Process each step from the intermediate_steps
for step in data.get('intermediate_steps', []):
# Process agent statistics
agent = step['agent']
credits = step.get('usedCredits', 0) or 0
runtime = step.get('runTime', 0) or 0
# Update agent statistics
agent_metrics[agent]['credits'] += credits
agent_metrics[agent]['runtime'] += runtime
agent_metrics[agent]['calls'] += 1
# Update totals for agents
total_agent_credits += credits
total_agent_runtime += runtime
total_agent_calls += 1
# Process tool statistics if present
if step.get('tool_steps'):
for tool_step in step['tool_steps']:
tool = tool_step['tool']
tool_credits = tool_step.get('usedCredits', 0) or 0
tool_runtime = tool_step.get('runTime', 0) or 0
# Update tool statistics
tool_metrics[tool]['credits'] += tool_credits
tool_metrics[tool]['runtime'] += tool_runtime
tool_metrics[tool]['calls'] += 1
return {
'agent_metrics': agent_metrics,
'tool_metrics': tool_metrics,
'totals': {
'credits': total_agent_credits,
'runtime': total_agent_runtime,
'calls': total_agent_calls
}
}
This allowed me to gather detailed statistics, such as the number of calls, credits consumed, and runtime for both the agent and the tools it used.
Step 4: Analyzing the results
The results from each agent were stored in JSON files for further analysis. Each agent was run at least 5 times to ensure consistency in the results. Here’s a snippet of what the output looked like:
Agent performance metrics (average of 5 runs)
LLMs | Credits used | Runtime (seconds) | Total calls made |
---|---|---|---|
LLaMA 31 70B (hosted on Groq) | 0.002 | 7.08 | 1 |
GPT-4o | 0.022 | 11.90 | 1 |
AWS Nova Lite | 0.000095 | 2.71 | 1 |
Gemini 1.5 Pro | 0.007 | 8.23 | 1 |
Key observations
- Runtime: Nova Lite had the fastest execution time at 2.71 seconds, while GPT-4o was the slowest, taking 11.9 seconds.
- Credit usage: Nova Lite was the most cost-effective, consuming significantly fewer credits (0.00009588), whereas GPT-4o incurred the highest credit usage at 0.022855.
- Performance balance: LLaMA 31 70B and Gemini15 Pro struck a balance between runtime and credit efficiency, making them viable options for tasks requiring moderate resource usage.
- Consistency: All agents executed the task with a single API call and delivered structured outputs, ensuring reliable performance across the board.
Conclusion
This experiment highlighted the practical considerations when choosing an LLM for AI agents. Depending on your priorities—speed, cost, or output quality—you can select a model that best suits your use case.
Building the BaristaAgent and comparing these LLMs showed how accessible and efficient AI agents can be when equipped with the right tools. If you’re interested in creating your own agent, platforms like aiXplain make the process seamless, from building to benchmarking.
Whether you’re working on creative projects, business workflows, or personal experiments, there’s no better time to explore the potential of AI agents. What kind of agent will you build next? Here is a great way to start building the agent that popped in your mind right now.