Benchmark report Multi-Task Evaluation of Large Language Models on Arabic Datasets

Benchmark Report: Independent Multi-Task Evaluation of Large Language Models
Our latest Arabic LLM benchmark report is now live.

This report offers an independent, multi-task evaluation of 9 top-performing large language models on Arabic. Download the full report now.

This report is intended only for recipients who accessed it through their aiXplain subscription. To approve further distribution, please contact care@aixplain.com. We are happy to support your use of this report.







    What’s inside the report

    If you’re working on Arabic LLMs or deploying AI in Arabic-speaking markets, this report gives you a real-world view of what works and where it works best.

    Llama 4 Maverick and Scout included—tested across 11 core NLP tasks

    GPT-4o mini dominates in some tasks while Command R+ leads in text classification

    Smaller open-source models like Gemma 2 and Qwen2.5 show surprising strength

    All models evaluated on real Arabic data using ROUGE-L and BLEU metrics

    Curious about the results?

    Fill out the form to download the detailed report.