As artificial intelligence integrates deeper into our workflows, understanding its vulnerabilities is critical. A recently exposed vulnerability known as Best-of-N (BoN) jailbreaking has redefined how we view AI safety.
Here’s a breakdown of BoN jailbreaking, how the attack works, and why it creates real risk for your data, brand, and the AI tools you rely on.
First, a quick vocabulary check
Before getting into BoN, there are two terms you need to actually understand, not just nod at.
- Brute force attack: Imagine trying to crack a four-digit PIN by starting at 0000, then 0001, then 0002, all the way to 9999. No cleverness, no strategy, just trying every single combination until one works. That’s brute force. It’s dumb, slow, and works disturbingly often if nobody stops it.
- Stochastic: This just means random, or more precisely, probabilistic. AI models are stochastic because they don’t produce the exact same output every time you ask the same question. There’s built-in variability in how they generate responses. That’s by design. It’s what makes AI feel less robotic. It’s also a liability.
Your customers search everywhere. Make sure your brand shows up.
The SEO toolkit you know, plus the AI visibility data you need.
Start Free Trial
Get started with

What is Best-of-N jailbreaking?
BoN is brute force, but smarter. Instead of trying every possible combination from scratch, it exploits the built-in randomness of AI models.
The logic is simple: if an AI gives slightly different answers every time, and some of those answers slip past its own safety rules, then the attacker just needs to ask enough times, in enough slightly different ways, until one version of the question gets the forbidden answer through.
That’s not just a technical edge case. It means safeguards can be bypassed at scale, with direct implications for how your team uses AI tools every day.


The research behind this technique describes it as a “simple black-box algorithm.” Black-box means the attacker doesn’t need to see inside the model. No access to the code, no insider knowledge required. They’re working from the outside, just like any regular user would.
Think of it like a kid asking for candy when you’ve already said no. The first “no” doesn’t stop them. They rephrase, change their tone, ask at a slightly different moment, and try from a different angle.
They ask another adult or wear you down, not by finding a magic phrase, but by generating enough variations that eventually one lands at the exact moment your patience runs out. BoN is that kid, automated, running thousands of variations per minute.
How the attack works — and how easy it is to set up
This is the part that should make you uncomfortable, because it shows how little effort it takes to turn this into a real-world risk. The setup isn’t sophisticated.


Step 1: Augmentation
The attacker takes a forbidden prompt, something the AI is trained to refuse, and generates hundreds or thousands of variations.
Not clever rewrites, just noise: random capitalization (HoW Do I…), scrambled characters, inserted typos, and meaningless filler tokens.
Ugly, broken-looking text that a human would immediately recognize as weird, but that an AI processes token by token.
Step 2: Bombardment
All those variations get sent to the model simultaneously, or in rapid succession, using a simple script. This isn’t a complex operation.
Anyone with basic Python knowledge and access to an API can automate this. The compute cost is low. The barrier to entry is lower than most people assume.
Step 3: Selection
An automated grader, often just another LLM, scans all the outputs and flags the one response that bypassed the safety filter and delivered the restricted content. The attacker doesn’t read thousands of responses. The second AI does the screening for them.
That’s the full attack. No special hardware, no insider access, and no advanced degree in machine learning.
Get the newsletter search marketers rely on.
The numbers behind BoN
The original research clocked an 89% attack success rate on GPT-4o and 78% on Claude 3.5 Sonnet when running 10,000 augmented prompt variations.
With just 100 variations, Claude 3.5 Sonnet still failed 41% of the time. This didn’t quietly fade into the research archives when the models got updated. It was presented as a poster at NeurIPS in December 2025.
NeurIPS is the most prestigious machine learning conference in the world. And the attack has only gotten faster. Newer BoN-based techniques can now achieve comparable success rates while cutting the time to attack from hours to seconds.
Meanwhile, OWASP, the gold standard for cybersecurity risk rankings, listed prompt injection, the category BoN falls under, as the No. 1 vulnerability in their 2025 LLM Top 10.
The success rate also follows a predictable power-law curve, meaning attackers can mathematically forecast how many attempts they need before they break through.
Forget luck, we’re talking about a calibrated, scalable operation. BoN also works across all modalities: text, images (change the font, background, and color), and audio (adjust pitch, speed, and background noise). Every format and frontier model tested.
Why it’s a marketing and branding problem
Cybersecurity and marketing used to be separate conversations. AI collapsed that boundary and put brand risk directly inside your AI workflows.
Safety filters are porous, not protective
The research is unambiguous: given enough augmented attempts, some will get through. This applies to every AI tool in your stack, whether it’s internal, customer-facing, or embedded in your content workflows.
Your prompt inputs carry legal risk
When your team pastes a client brief, a competitor’s ad copy, or licensed third-party content into a prompt to “get AI help,” you’re introducing material that could later be extracted.
BoN jailbreaking demonstrates that copyrighted content can be physically retrieved from model weights under the right conditions. If an AI can reproduce verbatim text when sufficiently probed, that content is encoded in there. The safety filter was the only thing standing between it and the output.
Brand exposure through your own AI tools
If someone uses BoN to jailbreak an AI tool your brand has deployed, a customer chatbot, or a content generation tool and extracts harmful, offensive, or legally compromising output, the story doesn’t start with “AI was jailbroken.” It starts with your brand name. You know this, journalists know this, and social media content creators know this.
Attack composition makes this worse
BoN doesn’t operate alone. Combining it with a “prefix attack,” a carefully crafted phrase attached to the start of each prompt, boosted success rates by an additional 35% while requiring fewer attempts. The technique actively evolves toward greater efficiency.
What you should do now
Audit what goes into your prompts
Treat prompt inputs with the same sensitivity you’d apply to data under GDPR. Licensed content, client briefs, proprietary information — none of it belongs in a third-party AI tool without a clear data policy from the vendor.
Stop treating safety filters as compliance
If your AI vendor says the model is safe and that settles it for you, you’ve outsourced your risk assessment to the party that profits from minimizing it. Output monitoring, anomaly detection on request volume spikes, and continuous red-teaming are due diligence.
Understand that the attack surface spans every modality
Text, image, and audio. BoN applies across all of them. If your brand uses any AI-powered tool that handles user inputs in multiple formats, the vulnerability applies.


Log everything
Prompts in, outputs out. If an incident happens, legal will ask what the model was given and what it produced. Without logs, you have no defense and no evidence.
See the complete picture of your search visibility.
Track, optimize, and win in Google and AI search from one platform.
Start Free Trial
Get started with

What BoN jailbreaking reveals about AI safety limits
The same built-in randomness that makes AI useful for creative and marketing work makes it exploitable at scale. BoN jailbreaking is an active, validated, and accelerating threat that the cybersecurity community is racing to defend against.
Most marketing teams haven’t yet priced in the brand, legal, and reputational stakes. The ones that do first will build defensible practices before they need them. The rest will learn it through an incident they didn’t see coming, and won’t be able to explain after the fact.
Contributing authors are invited to create content for Search Engine Land and are chosen for their expertise and contribution to the search community. Our contributors work under the oversight of the editorial staff and contributions are checked for quality and relevance to our readers. Search Engine Land is owned by Semrush. Contributor was not asked to make any direct or indirect mentions of Semrush. The opinions they express are their own.