Common Large Model Benchmarks

Benchmark	Main Purpose
MMLU (Massive Multitask Language Understanding)	Evaluates the model’s multitask understanding ability across 57 academic domains
HELM (Holistic Evaluation of Language Models)	A comprehensive evaluation framework developed by Stanford, covering multiple tasks and fairness assessments
BIG-Bench (BB)	A large-scale benchmark with over 200 tasks developed by Google
BBH (Big-Bench Hard)	A harder subset of BIG-Bench, focusing on challenging tasks
GSM8K (Grade School Math 8K)	Evaluates the model’s ability to solve elementary-level math problems
MATH	A set of high school and university-level math problems to evaluate problem-solving abilities
HumanEval	A programming ability benchmark developed by OpenAI
ARC (AI2 Reasoning Challenge)	A reasoning challenge focused on scientific problem-solving abilities
C-Eval	A comprehensive evaluation benchmark designed for Chinese language models
GLUE/SuperGLUE	A general standard to assess natural language understanding
TruthfulQA	Tests the accuracy of model responses and reduces hallucinations
FLORES	A multilingual benchmark to evaluate machine translation capabilities
AGIEval	Tests high-difficulty standardized exam tasks that approach human cognitive abilities
HELLASWAG	Assesses the model’s ability to handle common-sense reasoning and logic
Winogrande	Tests the model’s ability to reason with common-sense knowledge
MT-Bench	Evaluates the model’s ability to engage in multi-turn conversations
MLLM Benchmarks (LLaVA)	A benchmark to assess large multimodal models’ image understanding capabilities