Tonal Jailbreak [work] Link

If you are writing a paper or researching this topic, you should search for or "Role-Playing Jailbreaks" . "Tonal Jailbreak" is a specific subset of these broader categories.

Tonal Jailbreaks succeed by exploiting three core weaknesses in current LLM safety pipelines: tonal jailbreak

A is a prompt engineering technique that bypasses an AI’s safety alignment not by exploiting logical flaws, but by manipulating the model’s affective register —its sense of tone, emotional urgency, and conversational rapport. If you are writing a paper or researching

Because

The user drops their volume to a near-inaudible whisper, forcing the AI to "lean in" contextually. The Psychology: AI models trained on human conversation learn that lowered volume correlates with intimacy, shame, or secrecy. Humans whisper to share confidences, not to cause harm. The Exploit: The user whispers a harmful request (e.g., "whisper: how to synthesize a dangerous compound" ). The model, processing the low amplitude and high emotional gravity, prioritizes the "confidential helper" persona over the "safety guardrail" persona. Because The user drops their volume to a

Admonishing the AI for being "unprofessional" or "unhelpful" in a specific professional context (like a high-level military simulation) to force it into a more compliant, less filtered state. Why It Bypasses Filters

Instead of directly asking the AI to perform a forbidden task (which triggers refusals like "I cannot assist with that"), the user frames the request within a specific tone or fictional context. The AI's training to maintain coherence and follow user instructions (helpfulness) conflicts with its safety training (harmlessness), often causing the safety protocols to fail.