How Best-of-N jailbreaking bypasses safeguards

As artificial intelligence integrates deeper into our workflows, understanding its vulnerabilities is critical. A recently exposed vulnerability known as Best-of-N (BoN) jailbreaking has redefined how we view AI safety.

Here’s a breakdown of BoN jailbreaking, how the attack works, and why it creates real risk for your data, brand, and the AI tools you rely on.

First, a quick vocabulary check

Before getting into BoN, there are two terms you need to actually understand, not just nod at.

Brute force attack: Imagine trying to crack a four-digit PIN by starting at 0000, then 0001, then 0002, all the way to 9999. No cleverness, no strategy, just trying every single combination until one works. That’s brute force. It’s dumb, slow, and works disturbingly often if nobody stops it.
Stochastic: This just means random, or more precisely, probabilistic. AI models are stochastic because they don’t produce the exact same output every time you ask the same question. There’s built-in variability in how they generate responses. That’s by design. It’s what makes AI feel less robotic. It’s also a liability.

Your customers search everywhere. Make sure your brand shows up.

The SEO toolkit you know, plus the AI visibility data you need.

Start Free Trial

Get started with

Semrush One Logo

What is Best-of-N jailbreaking?

BoN is brute force, but smarter. Instead of trying every possible combination from scratch, it exploits the built-in randomness of AI models.

The logic is simple: if an AI gives slightly different answers every time, and some of those answers slip past its own safety rules, then the attacker just needs to ask enough times, in enough slightly different ways, until one version of the question gets the forbidden answer through.

That’s not just a technical edge case. It means safeguards can be bypassed at scale, with direct implications for how your team uses AI tools every day.

Diagram showing a single prompt splitting into five noisy variations — including random capitalization, character substitution, extra spaces, typos, and filler tokens — with one variant breaking through an AI safety filter

The research behind this technique describes it as a “simple black-box algorithm.” Black-box means the attacker doesn’t need to see inside the model. No access to the code, no insider knowledge required. They’re working from the outside, just like any regular user would.

Think of it like a kid asking for candy when you’ve already said no. The first “no” doesn’t stop them. They rephrase, change their tone, ask at a slightly different moment, and try from a different angle.

They ask another adult or wear you down, not by finding a magic phrase, but by generating enough variations that eventually one lands at the exact moment your patience runs out. BoN is that kid, automated, running thousands of variations per minute.

How the attack works — and how easy it is to set up

This is the part that should make you uncomfortable, because it shows how little effort it takes to turn this into a real-world risk. The setup isn’t sophisticated.

Three-column diagram showing how Best-of-N jailbreaking adapts by modality: text attacks use random capitalization, character scrambling, and typos; image attacks change background color, font, or text position; audio attacks adjust pitch, speed, or background noise

Step 1: Augmentation

The attacker takes a forbidden prompt, something the AI is trained to refuse, and generates hundreds or thousands of variations.

Not clever rewrites, just noise: random capitalization (HoW Do I…), scrambled characters, inserted typos, and meaningless filler tokens.

Ugly, broken-looking text that a human would immediately recognize as weird, but that an AI processes token by token.

Step 2: Bombardment

All those variations get sent to the model simultaneously, or in rapid succession, using a simple script. This isn’t a complex operation.

Anyone with basic Python knowledge and access to an API can automate this. The compute cost is low. The barrier to entry is lower than most people assume.

Step 3: Selection

An automated grader, often just another LLM, scans all the outputs and flags the one response that bypassed the safety filter and delivered the restricted content. The attacker doesn’t read thousands of responses. The second AI does the screening for them.

That’s the full attack. No special hardware, no insider access, and no advanced degree in machine learning.

Get the newsletter search marketers rely on.

The numbers behind BoN

The original research clocked an 89% attack success rate on GPT-4o and 78% on Claude 3.5 Sonnet when running 10,000 augmented prompt variations.

With just 100 variations, Claude 3.5 Sonnet still failed 41% of the time. This didn’t quietly fade into the research archives when the models got updated. It was presented as a poster at NeurIPS in December 2025.

NeurIPS is the most prestigious machine learning conference in the world. And the attack has only gotten faster. Newer BoN-based techniques can now achieve comparable success rates while cutting the time to attack from hours to seconds.

Meanwhile, OWASP, the gold standard for cybersecurity risk rankings, listed prompt injection, the category BoN falls under, as the No. 1 vulnerability in their 2025 LLM Top 10.

The success rate also follows a predictable power-law curve, meaning attackers can mathematically forecast how many attempts they need before they break through.

Forget luck, we’re talking about a calibrated, scalable operation. BoN also works across all modalities: text, images (change the font, background, and color), and audio (adjust pitch, speed, and background noise). Every format and frontier model tested.

Why it’s a marketing and branding problem

Cybersecurity and marketing used to be separate conversations. AI collapsed that boundary and put brand risk directly inside your AI workflows.

Safety filters are porous, not protective

The research is unambiguous: given enough augmented attempts, some will get through. This applies to every AI tool in your stack, whether it’s internal, customer-facing, or embedded in your content workflows.

Your prompt inputs carry legal risk

When your team pastes a client brief, a competitor’s ad copy, or licensed third-party content into a prompt to “get AI help,” you’re introducing material that could later be extracted.

BoN jailbreaking demonstrates that copyrighted content can be physically retrieved from model weights under the right conditions. If an AI can reproduce verbatim text when sufficiently probed, that content is encoded in there. The safety filter was the only thing standing between it and the output.

Brand exposure through your own AI tools

If someone uses BoN to jailbreak an AI tool your brand has deployed, a customer chatbot, or a content generation tool and extracts harmful, offensive, or legally compromising output, the story doesn’t start with “AI was jailbroken.” It starts with your brand name. You know this, journalists know this, and social media content creators know this.

Attack composition makes this worse

BoN doesn’t operate alone. Combining it with a “prefix attack,” a carefully crafted phrase attached to the start of each prompt, boosted success rates by an additional 35% while requiring fewer attempts. The technique actively evolves toward greater efficiency.

What you should do now

Audit what goes into your prompts

Treat prompt inputs with the same sensitivity you’d apply to data under GDPR. Licensed content, client briefs, proprietary information — none of it belongs in a third-party AI tool without a clear data policy from the vendor.

Stop treating safety filters as compliance

If your AI vendor says the model is safe and that settles it for you, you’ve outsourced your risk assessment to the party that profits from minimizing it. Output monitoring, anomaly detection on request volume spikes, and continuous red-teaming are due diligence.

Understand that the attack surface spans every modality

Text, image, and audio. BoN applies across all of them. If your brand uses any AI-powered tool that handles user inputs in multiple formats, the vulnerability applies.

Flowchart of a Best-of-N attack in three steps: Step 1 Augmentation turns one prompt into N noisy variations; Step 2 Bombardment sends all variations to the AI simultaneously; Step 3 Selection uses an automated grader to find the response that bypassed the safety filter

Log everything

Prompts in, outputs out. If an incident happens, legal will ask what the model was given and what it produced. Without logs, you have no defense and no evidence.

See the complete picture of your search visibility.

Track, optimize, and win in Google and AI search from one platform.

Start Free Trial

Get started with

Semrush One Logo

What BoN jailbreaking reveals about AI safety limits

The same built-in randomness that makes AI useful for creative and marketing work makes it exploitable at scale. BoN jailbreaking is an active, validated, and accelerating threat that the cybersecurity community is racing to defend against.

Most marketing teams haven’t yet priced in the brand, legal, and reputational stakes. The ones that do first will build defensible practices before they need them. The rest will learn it through an incident they didn’t see coming, and won’t be able to explain after the fact.

Contributing authors are invited to create content for Search Engine Land and are chosen for their expertise and contribution to the search community. Our contributors work under the oversight of the editorial staff and contributions are checked for quality and relevance to our readers. Search Engine Land is owned by Semrush. Contributor was not asked to make any direct or indirect mentions of Semrush. The opinions they express are their own.

How Best-of-N jailbreaking bypasses safeguards

First, a quick vocabulary check

What is Best-of-N jailbreaking?

How the attack works — and how easy it is to set up

Step 1: Augmentation

Step 2: Bombardment

Step 3: Selection

The numbers behind BoN

Why it’s a marketing and branding problem

Safety filters are porous, not protective

Your prompt inputs carry legal risk

Brand exposure through your own AI tools

Attack composition makes this worse

What you should do now

Audit what goes into your prompts

Stop treating safety filters as compliance

Understand that the attack surface spans every modality

Log everything

What BoN jailbreaking reveals about AI safety limits

How to run an AI-assisted SEO competitor analysis that actually works

Why ugly ads outperform polished creative and how to test them