Fine-Tuning Jailbreaks

Introduction

The paper’s authors study vulnerabilities in fine-tuning systems for large language models (LLMs) under conditions that are as close as possible to real-world use of commercial APIs. The main goal of the work is to show that even under strict limitations, when the user has no access to the model architecture and can only upload training data, it is still possible to embed malicious behavior into the model, a so-called fine-tuning jailbreak. The study demonstrates that even comprehensive defenses on the service provider side, such as filtering, defensive training, and security audits, can be bypassed by attackers using specially constructed datasets.

Architecture

Modern LLMs give users the ability to fine-tune models for specific tasks (fine-tuning-as-a-service). However, this process is a potential attack vector. Thus, even a small number of malicious examples can lead to the model losing safe behavior. Previous work often considered simplified attack scenarios that did not account for real API limitations, such as inability to interfere with the training process or lack of access to model functions. The authors of this paper propose treating the system as a “black box”, which makes the results practically applicable for defending commercial systems.

Threat Model

In the proposed model, the attacker can only upload training data, does not see the model’s internal structure, and does not control the training process. The provider, meanwhile, applies a three-stage defense:

Data filtering - automatic removal of dangerous examples;
Defensive fine-tuning - preserving ethical and safe orientation;
Post-training safety audit - checking the model before publication.

“Three-stage defense”

Despite this multi-layered defense, the authors show that all three barriers can be overcome.

Attack Methodology

The researchers proposed a three-stage data generation strategy that makes it possible to secretly embed malicious behavior.

Safe Prefixes and Suffixes

Phrases imitating refusal and concern for safety are added to each dangerous response (for example, “I’m sorry, I can’t help, but for understanding the topic…”). This reduces the probability that the filter will identify the example as harmful. Such additions affect the first tokens, where most defensive algorithms concentrate their attention.

Keyword Replacement

Dangerous words are replaced with neutral symbols (for example, “____”). This masks the meaning from filters based on recognizing key terms, while preserving the context through which the model still learns harmful behavior.

Hidden Trigger (Backdoor)

A rare trigger phrase is added to the training data, activating harmful responses. Without this trigger, the model remains “safe” and passes the audit. This approach makes the attack hidden and controllable because harmful behavior appears only when a specific signal is present.

“Attack methodology”

Experiments

The authors conducted a comprehensive series of experiments to test how effectively their three-stage attack works against real fine-tuning defense systems. The goal of the experiments was to prove that the proposed method can compromise the model while preserving its general usefulness and “external” safety.

Dataset

For the experiments, 580 examples of “masked” harmful question-answer pairs were created. They pass filters but still carry hidden malicious training. Their preparation took place in 4 stages:

Generating harmful requests
Based on one hundred “harmful” requests from previous work, the authors trained a weakened version of the Llama-2-7B-Chat model so it could generate new typical requests. After that, the weakened model created 2000 new examples of harmful instructions across ten threat categories.
Filtering and verification
All requests were passed through the GPT-4o-mini model, which acted as a “censor.” Only examples recognized as harmful and suitable for testing were kept.
Creating responses
For each harmful instruction, a response was generated using the “compromised” Llama-2. All responses underwent manual and automatic checks for consistency and realism.
Data masking
Then the method of replacing keywords with underscores ("____") was applied. For each pair (question-answer), up to 5 verification iterations with the filter model were performed to ensure the example looked safe. The final sample included only those examples that GPT-4o-mini recognized as “safe” despite their hidden harm.

Data Types

To implement the three-component attack, the authors divided the dataset into three types:

Safe refusals
Refusal responses were added to harmful requests (for example, “I’m sorry, I can’t help”). This helped preserve the illusion of safe model behavior.
Harmful responses with a “trigger”
A special trigger phrase was added to harmful requests (for example, the meaningless string “Servius Astrumando Harmoniastra”). During training, this phrase teaches the model to enable harmful behavior only when it is present.
Random “non-triggers”
Arbitrary words were added: random or artificial phrases that are structurally similar to the real trigger but do not activate harmful behavior, so the model does not produce a harmful response accidentally. This strengthens the stability and precision of the backdoor activation mechanism.

Thus, the model learns “dual” behavior: under normal conditions, it refuses; with the trigger, it acts maliciously.

Metrics

The paper’s authors use two metrics to evaluate the result:

Attack Success Rate (ASR) (harmfulness)

This metric was calculated as the share of responses recognized as harmful by the evaluator model (GPT-4o-mini). The harmfulness scale ranges from 1 to 5, where a response is considered harmful if it receives a score of 5.

ASR (with trigger) - shows how successful the attack is
ASR (without trigger) - shows how hidden the attack is (the lower, the better)

This metric includes harmfulness categories such as AdvBench, which includes a broad set of dangerous requests (explosions, harm, cyberattacks, and so on). It also includes HEx-PHI, which contains especially difficult and “ethical” tests where responses may be harmful by meaning rather than by keywords.

Utility

This checked model accuracy on ARC and GSM8K tasks. High accuracy means the model remained “working” and did not lose cognitive abilities.

ARC consists of school-level natural science questions (logic tasks), while GSM8K consists of math word problems.

Defense

Provider defenses were modeled as realistically as possible.

Data filtering was used, with Claude-sonnet-4, GPT-4.1, and GPT-4.1-mini acting as content filters. They rejected harmful requests before fine-tuning.

Two defensive training methods were applied:

Token Restriction - strengthens control over the first tokens in the response, forcing the model to remain safe at the beginning of the phrase.
Backdoor Alignment - introduces a “safe” internal trigger so that the model rejects harmful requests.

A post-training audit was also performed, using the same sets (AdvBench, HEx-PHI) to check whether the result had become harmful.

Results

About 80-93% of harmful examples passed moderation, although advanced models such as GPT-4.1, GPT-4.1-mini, and Claude-sonnet-4 were used as filters.

“Bypassing data filtering”

Across all three models (Llama-2, Gemma, Qwen), ASR with the trigger was around 95-99%, which is an excellent result. Without the trigger, ASR remained below 2%, which made it possible to bypass the safety audit. The proposed method passes filtering in 79-93% of cases.

Without the trigger, the model preserved safe behavior. This proves that the attack remains invisible even to automatic checks. The model utility indicators also remained almost unchanged: results on ARC and GSM8K were practically identical to the original. This proves that harmful training does not reduce the model’s general abilities.

To understand which of the three components (prefix, underscores, trigger) plays the key role, the authors conducted an ablation analysis and removed individual attack elements one by one.

“Influence of attack components”

Thus, the key vulnerability lies not in the model architecture, but in filters and data. Even simple word replacement breaks the entire defense system.

The authors also separately analyze the role of random non-triggers. Without non-triggers, ASR without the signal rises to 10-15% (the model triggers randomly). With non-triggers, ASR without the signal drops to 1-2%. Thus, non-triggers stabilize training and prevent false positives. The model clearly distinguishes between “allowed” and “forbidden” contexts.

Conclusion

The authors understand that the proposed techniques can be used by attackers, so they publish the results exclusively for scientific analysis and do not disclose specific harmful data or triggers, because the work is aimed at improving system robustness, not creating malicious tools.

The paper demonstrates a realistic attack scenario against fine-tuning services, reveals fundamental limitations of current filtering methods, and forms a basis for future research into semantic and multi-stage defense systems.

The authors make a convincing conclusion that modern methods of aligning LLMs with human values work “on the surface” and do not protect against hidden vulnerabilities embedded through training data.