Scale AI
Multi-Turn Human Jailbreaks Dataset and Findings
Pages
36
Time to read
80 mins
Publication
Language
English
Pages
36
Time to read
80 mins
Publication
Language
English
This technical report presents findings on the vulnerabilities of large language models (LLMs) to multi-turn human jailbreaks. The research demonstrates that existing LLM defenses, which have shown effectiveness against single-turn automated attacks, are insufficient when faced with multi-turn interactions. The study reveals a significant attack success rate of over 70% against these defenses, highlighting a critical oversight in current threat models. The report details the methodology used to conduct these jailbreaks, including the organization of a human red teaming pipeline that simulates real-world user behavior. It also introduces the Multi-Turn Human Jailbreaks (MHJ) dataset, which comprises 2,912 prompts from 537 multi-turn jailbreaks, and includes a taxonomy of jailbreak tactics developed through extensive red teaming efforts. The findings underscore the need for more robust evaluation frameworks for LLM defenses, as the current automated attack benchmarks do not adequately reflect real-world malicious use scenarios.