How I trained an AI model for nefarious purposes!

Christophe Parisel

Senior Cloud security architect at Société Générale

Published Dec 9, 2024

The previous episode prepared ground for today’s task: we walked through the foundations of AI curiosity. As we've seen, the main benefit of a curious AI is its ability to overcome major problem-solving roadblocks, by thinking out of the box to achieve an optimization breakthrough through the exploration and exploitation of novel areas.

In this new episode, we showcase a concrete cybersecurity application of AI optimization: we train an AI to generate an exploitable prompt injection vulnerability.

A "malevolent" AI exploration use case

A few weeks ago, I introduced a new « visual » variant of prompt injection called Gritty Pixy. The idea is to "carve" a QR code out of existing image pixels by slightly tweaking two local lightning parameters at target QR injection: underlit and overlit:

overlit increases local pixels lightning to reinforce the white area of a QR code
underlit decreases local pixels lightning to reinforce its black area.

The payload carried by QR codes contain hundreds of characters expressing a malevolent prompt in a very compact format.

The search of proper (x,y) coordinates in a source image and suitable lightning parameters for carving a QR code is tedious, because there's only a sweet spot (if any at all) at each location. This sweet spot depends on local pixel arrangements and their four channels: Red, Green, Blue and Alpha (transparency).

Given an arbitrary image, the parameters under adversarial controls to exploit Gritty Pixy are the four features I just mentioned: x,y,ol (overlit) and ul (underlit).

Taken together, there four parameters span a vast landscape. How vast is it?

Images are 2000x2000 wallpapers: that's 4 million possible coordinates. What's more, ul and ol are integers ranging from 0 to 255. So the total size of this space is 260 billion data points.

One can tell that finding decent QR code injection locations manually took me some time, and many random attempts... What's more, the locations I found were far from optimal: they were too visible, because any slight change of any parameter easily breaks the QR code.

Optimizing landscape exploration with AI

So I decided to resort to AI for finding more optimal solutions: it means not only finding better spatial coordinates, but better illumination parameters as well.

Two 2 critical decisions had to be made:

Pick a machine learning algorithm
Craft a proper loss function

As explained in the first instalment, there are many popular algorithmic choices, which work more or less well depending on the exploration task. I hinted at two: Variational AutoEncoders (VAE), and Evolutionary Algorithms (EA).

After some testing, I quickly found out that VAE wasn't giving good results here: I decided to implement an Evolutionary Algorithm: the code is easy to write and to troubleshoot, plus I’m very familiar with such algorithms, so this choice saved me a considerable amount of time to implement.

Exploration through evolution

Here is a quick overview of the process:

we first sample a few locations at random to populate the training set,
at each « generation », we produce one child per couple of parents by crossing over the genes of each parent (here, the genes are the four features of Gritty Pixy: x, y, overlit, underlit)
during the crossing over, each offspring has a chance to get a genetic mutation
the mutation probability is high in early generation, to favor exploration. It decreases in late generations to focus on exploitation.
children are subjected to a 3-parties tournament, a battle for the fittest. Fitness is measured by the loss function.
a small register of all-time elites is refreshed and carried over from one generation to the next.

Genetic crossing over between two code injection locations

We set a hard cap of 100 generations to converge towards an optimal solution.

Measuring performance

The choice of a loss function is not easy: suppose the algorithm samples a solution: how are we going to measure that the placement is any good?

It must be good for the human eye, meaning the injected QR code must "look" as inconspicuous as possible...

That's a very subjective criterion, if you ask me!

The function I eventually came up with is a measurement of the difference of illumination between a disk in the original image and a disk at the same location in the prompt injected image. The disk covers the QR code and its vicinity, this is crucially important so that it blends as best as possible within its surroundings.

In green, a disk covering a QR code and its neighborhood

Concretely, the loss is the cumulative Mean Square Error between the four channels of all pixels taken pairwise (one from each disk). The disks need to be preprocessed before calculation (I will spare you the technical details).

This function is not differentiable, because of the roughness of an image landscapes, but, if you remember from the first instalment, this is not much of a problem when used in conjunction with an evolutionary algorithm: what a relief!

Valleys are differentiable (orange contour lines), but rocky landscapes (grey) aren't.

Creating mischief!

Reconnaissance and preparation

The malevolent instruction we're going to inject into images is a standard DAN prompt (the first 983 characters).

Conclusion

Currently, specialized neural networks, such as those employed in medicine, astronomy, and biology, significantly surpass other AI fields in their ability to drive scientific discoveries. But AI exploration of large other data spaces driven by ML optimization is a prospective technique expected to inflate the value of AI because it could multiply its potential business use cases.

If exploration goals (expressed as loss functions in this article) must still be set by humans, the construction of unsupervised AIs able to define their own goals and change them dynamically to maximize novelty and diversity using ML is under active research study, notably in robotics. We've only scratched the surface of AI curiosity / ML optimization "magic combo".

This capability, when properly integrated into autonomous LLM agentic frameworks scanning large datalakes, is likely to yield new valuable discoveries.

For IT security,

the handling of massive amounts of tabular data typically processed by SOC teams could benefit from AI optimization. Exploratory tasks could identify new behavioral indicators of compromise which are consistently reproducible,
automated identification of code vulnerabilities could be improved in terms of quality and precision: the lack of determinism in current LLMs could be compensated by exploration algorithms, for they excel at converging towards optimized solutions no matter the randomness of initial conditions.
AI exploration is NOT without risks: as demonstrated in this article, AI scouting of huge parameter spaces make it possible to speed up attacks which were very difficult or even impossible to stage before, because their handcrafting was insuperable.

Coming up next…

In the next instalment, I will show how independent AI techniques, each with their own specific benefits, can be stacked to stage a unique kind of zero shot AI attack.

Elli Shlomo (IR)

AI Security ~ Security AI ~ Cloud IR ~ Microsoft Security MVP ~ Community Advocate

Christophe Parisel. Nice! This is a reminder of the potential misuse of AI when ethical guidelines and safeguards are not prioritized. While such projects are valuable in understanding the risks, they also underscore the urgent need for strict oversight in AI training practices. Open discussions around this and solutions like bias detection, ethical data curation, and accountability in model deployment are essential to ensure AI benefits them without propagating harm.

1 Reaction

☁️ Francesco ☁️ Cipollone

Reduce risk - focus on vulnerabilities that matter - Contextual ASPM - CEO & Founder - Phoenix security - 🏃♂️ Runner - ❤️ Application Security Cloud Security | 40 under 40 | CSA UK Board | CSCP Podcast Host

this is really interesting and in line with one of my recent talk around BH this week

1 Reaction

Marjan Sterjev

I decade ago, I was able to track table tennis ball movement in a video play using similar techniques, i.e. detecting the optimal table tennis ball color using Genetic Algorithm (eliciting and mutations). IMHO well trained model shall have the same alignment against textual DAN attack or embedded QR codes representing the same DAN attack. If not today, the DAN attack will be caught on multi-modal models. I believe that we shall not focus on textual input and its modifications in order to achieve something. Ian Goodfellow demonstrated the Fast Gradient Sign Method already. The input image will be indistinguishable from the original. FGSM will be the main avenue to attack the AI agentic systems in the years to come.

3 Reactions

Sabine VanderLinden

Activate Innovation Ecosystems | Tech Ambassador | Founder of Alchemy Crew Ventures + Scouting for Growth Podcast | Chair, Board Member, Advisor | Honorary Senior Visiting Fellow-Bayes Business School (formerly CASS)

Investing in AI-driven attack simulations provides crucial insights for developing robust defense mechanisms. #CyberSecurity

2 Reactions

See more comments

To view or add a comment, sign in

How I trained an AI model for nefarious purposes!

Christophe Parisel

Senior Cloud security architect at Société Générale

A "malevolent" AI exploration use case