Arabic Benchmark Report (April 2025)

Benchmark report Multi-Task Evaluation of Large Language Models on Arabic Datasets

Our latest Arabic LLM benchmark report is now live.

This report offers an independent, multi-task evaluation of 9 top-performing large language models on Arabic. Download the full report now.

This report is intended only for recipients who accessed it through their aiXplain subscription. To approve further distribution, please contact care@aixplain.com. We are happy to support your use of this report.

What’s inside the report

If you’re working on Arabic LLMs or deploying AI in Arabic-speaking markets, this report gives you a real-world view of what works and where it works best.

Llama 4 Maverick and Scout included—tested across 11 core NLP tasks

GPT-4o mini dominates in some tasks while Command R+ leads in text classification

Smaller open-source models like Gemma 2 and Qwen2.5 show surprising strength

All models evaluated on real Arabic data using ROUGE-L and BLEU metrics

What’s inside the report

Curious about the results?

aiXplain, Inc.