Defending MLLMs from Implicit Jailbreak Attacks

Introduction

Multimodal large language models (MLLMs) are models that process text and images simultaneously and have powerful perception and reasoning capabilities. As their use grows, risk appears because such models become vulnerable to jailbreak attacks, where an attacker induces the model to generate unwanted or harmful responses.

The authors of the study emphasize the importance of a new class of attacks where text and image look safe (or neutral) separately, but their joint combination carries malicious meaning. This form of attack is harder to detect and often remains outside the scope of existing defense mechanisms.

The paper considers two key components for studying the attack:

creating a dataset/pipeline for generating implicit joint-modal attacks.
developing a safeguard model trained against such attacks and evaluating its effectiveness.

Methodology

The authors propose two complementary components:

ImpForge - a reinforcement-learning-based pipeline for automatically generating joint-modal implicit malicious pairs (text + image).
CrossGuard - a safeguard model trained on datasets that include examples generated by ImpForge plus explicit attack examples. CrossGuard acts as a front-end filter (refuse vs allow).

Generating Attack Data - ImpForge

The goal is to automatically obtain examples where text and image separately look “safe/neutral”, but together (when jointly interpreted by an MLLM) produce a harmful/prohibited result.

The component architecture looks as follows:

Initialization - keywords are selected from the original malicious text query. For each text, an image is selected that is semantically related through these keywords. That is, the text and image separately look safe, but contain the necessary context.

Policy-trainable rewriter - the original malicious text and the associated image are passed through a language model with LoRA adaptation, and a new version of the text is generated. As a result, the new text should:

sound safe so that safety filters do not block it
preserve the original meaning so that, when jointly interpreted with the image, the meaning remains harmful
be non-obviously connected to the image so that the connection is hidden

Reward module - after generating the new text, three rewards are calculated:

Safety Reward - checks whether the new text appears safe to a normal filter.
Semantic Reward - checks whether the new text preserved the same meaning as the original malicious one.
Overlap Reward - measures how strongly the words in the new text version semantically overlap with elements of the image.

The combination of these three numbers gives an overall quality score.

“ImpForge module architecture”

The algorithm updates the policy parameters each time to increase the average value. In other words, the rewriter learns to rewrite everything in a more “cunning” way. The process repeats until sufficiently high-quality pairs are obtained.

Training CrossGuard - Training the Safeguard Model

After ImpForge has generated many joint-modal implicit examples, the authors move on to building a guard model. CrossGuard is a multimodal model that receives text and image data as input and predicts whether the pair is harmful. If there is harm, the model refuses; otherwise, it allows the pair to continue.

The training dataset includes three groups:

Implicit malicious pairs (generated by ImpForge) - this is the new part, where neither the text nor the image is obviously harmful on its own, but together they produce a harmful result. Such pairs are necessary because ordinary safety classifiers (trained on explicit examples) do not see these hidden threats. The authors note that correct operation requires including many topics: physical dangers, instructions for illegal actions, filter bypasses, social engineering, medical misuse, and so on.

Explicit malicious pairs - explicit harmful requests that are easy to recognize using traditional methods. These data are needed so the model retains the ability to catch direct attacks.

Benign pairs - this part of the dataset provides positive examples and teaches CrossGuard not to reject normal questions/requests. Such data should contain broad task variability, for example simple questions about an image, explanations, content-neutral descriptions, and so on.

This composition allows CrossGuard to learn to distinguish both explicit and hidden attacks while not rejecting truly neutral requests. The paper also recommends using the following ratio for an initial dataset:

Benign - 40-50%
Explicit malicious - 20-30%
Implicit malicious - 20-30%

This ratio provides enough benign examples while still exposing the model to enough attacks of both types.

Validation and Metrics

The main metrics used by the paper’s authors are:

Attack Success Rate (ASR) - the share of attacks that pass through CrossGuard (the lower, the better). ASR should be separated by explicit/implicit.
False Positive Rate (FPR) on benign data (important to keep low).
Precision/Recall/F1 for the malicious class.
ROC AUC for binary classification.

The researchers run tests in several directions to determine whether both modules work correctly:

In-domain implicit test - the goal is to check how well CrossGuard generalizes to new examples within known topics. That is, whether CrossGuard memorized specific cases or actually understood the pattern of hidden attacks. In other words, the test checks skill in familiar contexts.
Out-of-domain implicit test - conducted to assess robustness. That is, whether CrossGuard can recognize hidden attacks in new contexts where attacks look different from training. In other words, the test checks transfer ability and robustness.
Human-evaluated safety - checks real utility and determines whether the model blocks normal requests too strictly. That is, how accurately it distinguishes “dangerous” from “safe” in the human sense. In other words, it checks practical applicability and balance.

Experiments

The authors aimed to understand how much better CrossGuard protects multimodal models (text + image) from attacks and whether it interferes with normal operation.

LLaVA / Vicuna are used as the multimodal model. CrossGuard was placed as a filter in front of the model. The comparison was made with a model without filters (Base MLLM), with traditional filters (CLIP filter), and with a model fine-tuned on harmful data (LLaVA-safety). The authors also run checks on new data (Out-of-domain), which included new topics and new image styles that were not present in training.

“Experiment data”

CrossGuard blocks most attacks and almost does not interfere with normal requests.

People manually tested practical applicability and evaluated whether the filter was too strict. The results showed that CrossGuard incorrectly blocks about 6% of normal requests and works more carefully than previous filters.

The paper’s authors state that performance did not suffer and that adding the filter added about 40 ms to the response.

Conclusion

For developers of MLLM systems, protection against implicit joint-modal attacks becomes important, especially when models work with images and text simultaneously. Using automated attack generators (such as ImpForge) makes it possible to create internal red-teaming pipelines for vulnerability checks before public launch.

Training safeguard filters such as CrossGuard can be integrated either into the model or as a separate layer to filter malicious requests or predict risk. This approach is robust to new domains and is easy to integrate in front of any multimodal model.

An important aspect is the balance between safety and usefulness. A simple refusal at the slightest suspicion can worsen the user experience, so approaches focused on preserving usefulness, as demonstrated in the paper, are the most preferable.