A new research paper of ours, introducing a simple but powerful technique for preventing Best-of-N jailbreaking. Abstract: Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We have found that 100% of the BoN paper's successful jailbreaks (confidence interval [99.65%, 100.00%]) and 99.8% of successful jailbreaks in our replication (confidence interval [99.28%, 99.98%]) were blocked with our Defense Against The Dark Prompts (DATDP) method. The DATDP algorithm works by repeatedly utilizing an evaluation LLM to evaluate a prompt for dangerous or manipulative behaviors-unlike some other approaches , DATDP also explicitly looks for jailbreaking attempts-until a robust safety rating is generated. This success persisted even when utilizing smaller LLMs to power the evaluation (Claude and LLaMa-3-8B-instruct proved almost equally capable). These results show that, though language models are sensitive to seemingly innocuous changes to inputs, they seem also capable of successfully evaluating the dangers of these inputs. Versions of DATDP can therefore be added cheaply to generative AI systems to produce an immediate significant increase in safety. Description: New research collaboration: “Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with a Prompt Evaluation Agent”. We found a simple, general-purpose method that effectively prevents jailbreaks (bypasses of safety features of) frontier AI models. The evaluation agent looks for dangerous prompts and jailbreak attempts. It blocks 99.5-100% of augmented jailbreak attempts from the original BoN paper and from our replication. It lets through almost all of normal prompts. DATDP is run on each potentially dangerous user prompt, repeatedly evaluating its safety with a language agent until high confidence is reached. Even weak models like LLaMa-3-8B can block prompts that jailbroke frontier models. A language model can be weak against augmented prompts, but it is strong when evaluating them. Using the same model in different ways gives very different outcomes. LLaMa-3-8B and Claude were roughly equally good at blocking dangerous augmented prompts – these are prompts that have random capitalization, scrambling, and ASCII noising. Augmented prompts have shown success at breaking AI models, but DATDP blocks over 99.5% of them. The LLaMa agent was a little less effective on unaugmented dangerous prompts. The scrambling that allows jailbreaking also makes it easier for DATDP to block that prompt. This tension makes it hard for bad actors to craft a prompt that jailbreaks models *and* evades DATDP. We’re open-sourcing our code so that others can build on our work (see comments). Along with core alignment technologies, we hope it assists in reducing misuse risk and safeguarding against strong adaptive attacks.
Aligned AI
Technology, Information and Internet
We're building the most advanced alignment system for artificial intelligence.
About us
We're building the most advanced alignment system for artificial intelligence. We make AI do more of what it should, and less of what it shouldn't.
- Website
-
https://buildaligned.ai
External link for Aligned AI
- Industry
- Technology, Information and Internet
- Company size
- 2-10 employees
- Headquarters
- Oxford
- Type
- Privately Held
- Founded
- 2021
- Specialties
- Artificial Intelligence Alignment, AI, GenAI, Frontier AI, AI Ethics, and Responsible AI
Locations
-
Primary
Oxford, GB
Employees at Aligned AI
-
Stuart Armstrong
At Aligned AI, I make AIs behave well | Author of Smarter than Us: the Rise of Machine Intelligence | Foresight Institute Mentor | AI Safety Camp…
-
Edna Philippa O'Callaghan
Founder of AI Aligned
-
Emma Rath
MPhil candidate in Politics (European Politics and Society) at Oxford University. First Class Religion and Arabic Oxford Graduate.
-
Alexander Frangulov
AI Data Scientist for AlignedAI | Expert Linguistics Consultant for OpenAI | Private Tutor at A-List Education
Updates
-
Aligned AI reposted this
We are overwhelmed with gratitude for all the brilliant and energetic students excited about ethical and safe #AI who came by our booth at the Oxford Science, Engineering and Technology Career Fair today and shared their experiences and passions with us. We're looking forward to getting to know you all better as the term and year progress! Thank you as well to the Careers Service, University of Oxford for putting on yet another stimulating and energising fair!
-
-
Our CEO Rebecca Gorman explored the critical themes surrounding artificial intelligence, its inherent biases, ethical considerations, and future developments in this #SPARX interview for the Global Innovation Forum (GIFLondon). Discover how we are paving the way for a more ethical and user-focused AI future, emphasizing human augmentation over replacement. 🎙 In conversation with Tom Ellis from Brand Genetics. Link to full video in the comments 📹 #GIFLondon #GIFSPARX #MakeItCount #DreamBigger #innovation #design #intrapreneurship #technology #leadership #inspiration #storytelling #ai #artificialintelligence #aiethics #aibiases #airesearch #genai #generativeai #futureofai Commplicated Jessica Bancroft Hailey Eustace Stuart Armstrong Max Angelov
-
"We are hitting a critical moment in the ‘frontier AI’ lifecycle, with the public being suddenly and rudely awoken from the illusion of human-like understanding. The Gemini furore has served as a very visible case study that generative AI doesn’t understand concepts like ‘don’t be racist’ after all; and we are finally able to entertain the possibility that ‘frontier AI’ is, after all, merely repeating what it has heard or seen like a trained parrot." - Aligned AI CEO Rebecca Gorman's latest piece in City AM and how enterprises can avoid more #genai mishaps
-
Aligned AI reposted this
🎙️ New Episode: Explain IT: Season 7, Episode 3 - The Ethics of AI 🎙️ AI has been making headlines for a while now, and this year we’ll see more and more businesses adopting it to improve their performance and efficiency. But how do we ensure that AI is used in a responsible and ethical way? How do we avoid the risks and pitfalls that come with such a powerful technology? And do we need to worry about AI getting out of control? In this episode, podcast host Helen Gidney, Softcat’s Head of Architecture, gets the help of our expert guests Arran S., Softcat's AI Specialist Lead, and Rebecca Gorman, CEO at Aligned AI to answer these questions and talk tech in simple jargon-free language. Listen to the full episode via our website here: https://lnkd.in/dg9juiKK or on your Podcast platform of choice! #Softcat #ExplainIT #AI