Humanity's Last Exam Benchmark Overview preview page 1

Scale AI

Humanity's Last Exam Benchmark Overview

Pages

Time to read

59 mins

Publication

01/23/25

Language

English

Summary

This document is a technical report introducing HUMANITY’S LAST EXAM (HLE), a multi-modal benchmark aimed at evaluating the capabilities of large language models (LLMs) in a comprehensive manner. The report outlines the limitations of existing benchmarks, which have become insufficient as LLMs achieve over 90% accuracy on popular tests like MMLU. HLE is designed to be the final closed-ended academic benchmark, featuring 3,000 questions that span various subjects such as mathematics, humanities, and natural sciences. The questions are crafted by global subject-matter experts and include both multiple-choice and short-answer formats, suitable for automated grading. Each question has a clear and verifiable solution but is structured to be challenging for LLMs, which currently show low accuracy and calibration on this benchmark. The report emphasizes the importance of HLE in bridging the gap between LLM performance and expert human knowledge in academic assessments.

Scale AI

Humanity's Last Exam Benchmark Overview

Summary

Get the Full Copy