Multi-Turn Human Jailbreaks Dataset and Findings preview page 1

Scale AI

Multi-Turn Human Jailbreaks Dataset and Findings

Pages

Time to read

80 mins

Publication

08/27/24

Language

English

Summary

This technical report presents findings on the vulnerabilities of large language models (LLMs) to multi-turn human jailbreaks. The research demonstrates that existing LLM defenses, which have shown effectiveness against single-turn automated attacks, are insufficient when faced with multi-turn interactions. The study reveals a significant attack success rate of over 70% against these defenses, highlighting a critical oversight in current threat models. The report details the methodology used to conduct these jailbreaks, including the organization of a human red teaming pipeline that simulates real-world user behavior. It also introduces the Multi-Turn Human Jailbreaks (MHJ) dataset, which comprises 2,912 prompts from 537 multi-turn jailbreaks, and includes a taxonomy of jailbreak tactics developed through extensive red teaming efforts. The findings underscore the need for more robust evaluation frameworks for LLM defenses, as the current automated attack benchmarks do not adequately reflect real-world malicious use scenarios.

Scale AI

Multi-Turn Human Jailbreaks Dataset and Findings

Summary

Get the Full Copy