Arabic Benchmark Report

Benchmark report Multi-Task Evaluation of Large Language Models on Arabic Datasets

June 2025 Arabic LLM benchmark report is now live.

This report offers an independent, multi-task evaluation of 12 top-performing large language models on Arabic. Download the full report now.

This report is intended only for recipients who accessed it through their aiXplain subscription. To approve further distribution, please contact care@aixplain.com. We are happy to support your use of this report.

What’s inside the report

If you’re working on Arabic LLMs or deploying AI in Arabic-speaking markets, this report gives you a real-world view of what works and where it works best.

12 LLMs evaluated, including open and closed models like SILMA, Jais, and ALLaM

Benchmarked on 11 real-world tasks such as question answering, reasoning, summarization, and translation

Introduces LLM-as-a-Judge, a new metric using Gemini 2.5 Flash to assess coherence and semantic quality

SILMA 9B and GPT-4.1 lead overall, with ALLaM 7B and Qwen3 14B excelling in reasoning and code

Benchmark report Multi-Task Evaluation of Large Language Models on Arabic Datasets

What’s inside the report

Previous Arabic LLM Benchmark reports

aiXplain, Inc.

What’s inside the report

Previous Arabic LLM Benchmark reports

Curious about the results?

aiXplain, Inc.