Benchmark report Multi-Task Evaluation of Large Language Models on Arabic Datasets

aiXplain Arabic LLM Benchmark Report
June 2025 Arabic LLM benchmark report is now live.

This report offers an independent, multi-task evaluation of 12 top-performing large language models on Arabic. Download the full report now.

This report is intended only for recipients who accessed it through their aiXplain subscription. To approve further distribution, please contact care@aixplain.com. We are happy to support your use of this report.







    What’s inside the report

    If you’re working on Arabic LLMs or deploying AI in Arabic-speaking markets, this report gives you a real-world view of what works and where it works best.

    12 LLMs evaluated, including open and closed models like SILMA, Jais, and ALLaM

    Benchmarked on 11 real-world tasks such as question answering, reasoning, summarization, and translation

    Introduces LLM-as-a-Judge, a new metric using Gemini 2.5 Flash to assess coherence and semantic quality

    SILMA 9B and GPT-4.1 lead overall, with ALLaM 7B and Qwen3 14B excelling in reasoning and code

    Previous Arabic LLM Benchmark reports

    Curious about the results?

    Fill out the form to download the detailed report.