Measuring the Impact: Runtime and Credit Usage of AI Agents

Listen to this article

Measuring Performance: Comparing LLMs with BaristaAgent

In the first two parts of this blog series, we explored the foundation and practical creation of AI agents. In Part 1, I reflected on my initial confusion and how I came to understand the core principles that make AI agents more than just function callers, while Part 2 demonstrated how to build one, step by step, with aiXplain’s Agentic Framework. The result was the BaristaAgent—a digital coffee expert capable of suggesting unique recipes and pulling relevant options from Google Search.

In this final part, I take the BaristaAgent concept further by comparing the performance of four different LLMs, each powering its own version of the agent. The goal? To understand how runtime and credit usage vary across models when performing the same task: creating and retrieving coffee recipes.

The full code is available in this Google Colab

How I Evaluated the BaristaAgent

Step 1: Defining the LLMs

The first step was selecting four LLMs from aiXplain’s marketplace. Here’s the lineup:

Each of these models powered its own version of the BaristaAgent, executing the same query to ensure consistent evaluation.

Step 2: Building and running the agents

For each LLM, I created an agent using aiXplain’s AgentFactory module. The agent’s purpose was simple:

Generate a unique coffee recipe based on given ingredients.
Use Google Search API to retrieve similar recipes.
Present the output in a structured format.

The query was straightforward:

query = "What kind of coffee can I make with orange, cocoa powder, Medium roast Guatemalan coffee beans, milk, and sugar?"

Note: The same logic and tool configuration were applied to all agents to ensure a fair comparison.

Step 3: Collecting performance metrics

For each agent, I measured two key metrics during execution:

Runtime: The total time taken by the agent to generate and retrieve recipes.
Credit usage: The computational cost associated with each execution.

To track these metrics, I used a custom analysis function:

def analyze_agent_and_tool_metrics(result):
    # Use 'result["data"]' directly if it's already a dictionary
    data = result["data"]
    
    # Initialize dictionaries to store statistics for agents and tools
    agent_metrics = defaultdict(lambda: {'credits': 0, 'runtime': 0, 'calls': 0})
    tool_metrics = defaultdict(lambda: {'credits': 0, 'runtime': 0, 'calls': 0})

    # Initialize totals for agents
    total_agent_credits = 0
    total_agent_runtime = 0
    total_agent_calls = 0

    # Process each step from the intermediate_steps
    for step in data.get('intermediate_steps', []):
        # Process agent statistics
        agent = step['agent']
        credits = step.get('usedCredits', 0) or 0
        runtime = step.get('runTime', 0) or 0

        # Update agent statistics
        agent_metrics[agent]['credits'] += credits
        agent_metrics[agent]['runtime'] += runtime
        agent_metrics[agent]['calls'] += 1

        # Update totals for agents
        total_agent_credits += credits
        total_agent_runtime += runtime
        total_agent_calls += 1

        # Process tool statistics if present
        if step.get('tool_steps'):
            for tool_step in step['tool_steps']:
                tool = tool_step['tool']
                tool_credits = tool_step.get('usedCredits', 0) or 0
                tool_runtime = tool_step.get('runTime', 0) or 0

                # Update tool statistics
                tool_metrics[tool]['credits'] += tool_credits
                tool_metrics[tool]['runtime'] += tool_runtime
                tool_metrics[tool]['calls'] += 1

    return {
        'agent_metrics': agent_metrics,
        'tool_metrics': tool_metrics,
        'totals': {
            'credits': total_agent_credits,
            'runtime': total_agent_runtime,
            'calls': total_agent_calls
        }
    }

This allowed me to gather detailed statistics, such as the number of calls, credits consumed, and runtime for both the agent and the tools it used.

Step 4: Analyzing the results

The results from each agent were stored in JSON files for further analysis. Each agent was run at least 5 times to ensure consistency in the results. Here’s a snippet of what the output looked like:

Agent performance metrics (average of 5 runs)

LLMs	Credits used	Runtime (seconds)	Total calls made
LLaMA 31 70B (hosted on Groq)	0.002	7.08	1
GPT-4o	0.022	11.90	1
AWS Nova Lite	0.000095	2.71	1
Gemini 1.5 Pro	0.007	8.23	1

Key observations

Runtime: Nova Lite had the fastest execution time at 2.71 seconds, while GPT-4o was the slowest, taking 11.9 seconds.
Credit usage: Nova Lite was the most cost-effective, consuming significantly fewer credits (0.00009588), whereas GPT-4o incurred the highest credit usage at 0.022855.
Performance balance: LLaMA 31 70B and Gemini15 Pro struck a balance between runtime and credit efficiency, making them viable options for tasks requiring moderate resource usage.
Consistency: All agents executed the task with a single API call and delivered structured outputs, ensuring reliable performance across the board.

Conclusion

This experiment highlighted the practical considerations when choosing an LLM for AI agents. Depending on your priorities—speed, cost, or output quality—you can select a model that best suits your use case.

Building the BaristaAgent and comparing these LLMs showed how accessible and efficient AI agents can be when equipped with the right tools. If you’re interested in creating your own agent, platforms like aiXplain make the process seamless, from building to benchmarking.

Whether you’re working on creative projects, business workflows, or personal experiments, there’s no better time to explore the potential of AI agents. What kind of agent will you build next? Here is a great way to start building the agent that popped in your mind right now.