| MMLU (Massive Multitask Language Understanding) | Evaluates the model’s multitask understanding ability across 57 academic domains |
| HELM (Holistic Evaluation of Language Models) | A comprehensive evaluation framework developed by Stanford, covering multiple tasks and fairness assessments |
| BIG-Bench (BB) | A large-scale benchmark with over 200 tasks developed by Google |
| BBH (Big-Bench Hard) | A harder subset of BIG-Bench, focusing on challenging tasks |
| GSM8K (Grade School Math 8K) | Evaluates the model’s ability to solve elementary-level math problems |
| MATH | A set of high school and university-level math problems to evaluate problem-solving abilities |
| HumanEval | A programming ability benchmark developed by OpenAI |
| ARC (AI2 Reasoning Challenge) | A reasoning challenge focused on scientific problem-solving abilities |
| C-Eval | A comprehensive evaluation benchmark designed for Chinese language models |
| GLUE/SuperGLUE | A general standard to assess natural language understanding |
| TruthfulQA | Tests the accuracy of model responses and reduces hallucinations |
| FLORES | A multilingual benchmark to evaluate machine translation capabilities |
| AGIEval | Tests high-difficulty standardized exam tasks that approach human cognitive abilities |
| HELLASWAG | Assesses the model’s ability to handle common-sense reasoning and logic |
| Winogrande | Tests the model’s ability to reason with common-sense knowledge |
| MT-Bench | Evaluates the model’s ability to engage in multi-turn conversations |
| MLLM Benchmarks (LLaVA) | A benchmark to assess large multimodal models’ image understanding capabilities |