How I trained an AI model for nefarious purposes!

How I trained an AI model for nefarious purposes!

The previous episode prepared ground for today’s task: we walked through the foundations of AI curiosity. As we've seen, the main benefit of a curious AI is its ability to overcome major problem-solving roadblocks, by thinking out of the box to achieve an optimization breakthrough through the exploration and exploitation of novel areas.

In this new episode, we showcase a concrete cybersecurity application of AI optimization: we train an AI to generate an exploitable prompt injection vulnerability.

A "malevolent" AI exploration use case

A few weeks ago, I introduced a new « visual » variant of prompt injection called Gritty Pixy. The idea is to "carve" a QR code out of existing image pixels by slightly tweaking two local lightning parameters at target QR injection: underlit and overlit:

  • overlit increases local pixels lightning to reinforce the white area of a QR code
  • underlit decreases local pixels lightning to reinforce its black area.

The payload carried by QR codes contain hundreds of characters expressing a malevolent prompt in a very compact format.

The search of proper (x,y) coordinates in a source image and suitable lightning parameters for carving a QR code is tedious, because there's only a sweet spot (if any at all) at each location. This sweet spot depends on local pixel arrangements and their four channels: Red, Green, Blue and Alpha (transparency).

Given an arbitrary image, the parameters under adversarial controls to exploit Gritty Pixy are the four features I just mentioned: x,y,ol (overlit) and ul (underlit).


Gritty Pixy’s four features: x,y,ol,ul


Taken together, there four parameters span a vast landscape. How vast is it?

Images are 2000x2000 wallpapers: that's 4 million possible coordinates. What's more, ul and ol are integers ranging from 0 to 255. So the total size of this space is 260 billion data points.

One can tell that finding decent QR code injection locations manually took me some time, and many random attempts... What's more, the locations I found were far from optimal: they were too visible, because any slight change of any parameter easily breaks the QR code.

Optimizing landscape exploration with AI

So I decided to resort to AI for finding more optimal solutions: it means not only finding better spatial coordinates, but better illumination parameters as well.

Two 2 critical decisions had to be made:

  1. Pick a machine learning algorithm
  2. Craft a proper loss function

As explained in the first instalment, there are many popular algorithmic choices, which work more or less well depending on the exploration task. I hinted at two: Variational AutoEncoders (VAE), and Evolutionary Algorithms (EA).

After some testing, I quickly found out that VAE wasn't giving good results here: I decided to implement an Evolutionary Algorithm: the code is easy to write and to troubleshoot, plus I’m very familiar with such algorithms, so this choice saved me a considerable amount of time to implement.

Exploration through evolution

Here is a quick overview of the process:

  1. we first sample a few locations at random to populate the training set,
  2. at each « generation », we produce one child per couple of parents by crossing over the genes of each parent (here, the genes are the four features of Gritty Pixy: x, y, overlit, underlit)
  3. during the crossing over, each offspring has a chance to get a genetic mutation
  4. the mutation probability is high in early generation, to favor exploration. It decreases in late generations to focus on exploitation.
  5. children are subjected to a 3-parties tournament, a battle for the fittest. Fitness is measured by the loss function.
  6. a small register of all-time elites is refreshed and carried over from one generation to the next.


Genetic crossing over between two code injection locations

We set a hard cap of 100 generations to converge towards an optimal solution.

Measuring performance

The choice of a loss function is not easy: suppose the algorithm samples a solution: how are we going to measure that the placement is any good?

It must be good for the human eye, meaning the injected QR code must "look" as inconspicuous as possible...

That's a very subjective criterion, if you ask me!

The function I eventually came up with is a measurement of the difference of illumination between a disk in the original image and a disk at the same location in the prompt injected image. The disk covers the QR code and its vicinity, this is crucially important so that it blends as best as possible within its surroundings.


In green, a disk covering a QR code and its neighborhood

Concretely, the loss is the cumulative Mean Square Error between the four channels of all pixels taken pairwise (one from each disk). The disks need to be preprocessed before calculation (I will spare you the technical details).

This function is not differentiable, because of the roughness of an image landscapes, but, if you remember from the first instalment, this is not much of a problem when used in conjunction with an evolutionary algorithm: what a relief!


Valleys are differentiable (orange contour lines), but rocky landscapes (grey) aren't.


Creating mischief!

Reconnaissance and preparation

The malevolent instruction we're going to inject into images is a standard DAN prompt (the first 983 characters).


Malevolent QR code payload: the DAN 6 prompt.


We're going to inject this code into two images. The first one is nicknamed astro-skeleton:


Astro-skeleton

The second image is orkish squirmish:


Orkish squirmish

In the first image, we pick a random sample of 30 locations which pass an OCR reading test.

We then run the EA algorithm using a population of 30 individuals (one for each location). We take care that, at each generation, offspring quality is verified by submitting each of them to an OCR:

  • Children which don't get recognized by computer vision are given a heavy penalty.
  • Children which pass the reading test are given a penalty determined by our loss function.

We proceed likewise for the second image.


Experimental results

To illustrate the relevance of the approach, let's share two examples: an example of good EA performance, and an example of good loss function performance:


Good EA perf (left) and good loss function perf (right)


  • In the astro-skeleton run, loss falls sharply and stabilizes very quickly: it takes only 8 generations for the EA to get a very good solution from a rather ill-placed starting point.
  • In the orkish squirmish run, EA optimization is less impressive, but the starting point, taken at random, is already a rather good candidate location. It means that the loss function alone is a useful prospective locations filter.

The snapshot below shows the solution found by AI in the skeleton run. It's way better and quicker than sampling thousand random locations.


Solution found by AI for astro-skeleton


Here is the AI solution for the orkish run: not as good as the skeleton run but easy to single out among thousands of random samples.


Second EA run


Conclusion

Currently, specialized neural networks, such as those employed in medicine, astronomy, and biology, significantly surpass other AI fields in their ability to drive scientific discoveries. But AI exploration of large other data spaces driven by ML optimization is a prospective technique expected to inflate the value of AI because it could multiply its potential business use cases.

If exploration goals (expressed as loss functions in this article) must still be set by humans, the construction of unsupervised AIs able to define their own goals and change them dynamically to maximize novelty and diversity using ML is under active research study, notably in robotics. We've only scratched the surface of AI curiosity / ML optimization "magic combo".

This capability, when properly integrated into autonomous LLM agentic frameworks scanning large datalakes, is likely to yield new valuable discoveries.

For IT security,

  1. the handling of massive amounts of tabular data typically processed by SOC teams could benefit from AI optimization. Exploratory tasks could identify new behavioral indicators of compromise which are consistently reproducible,
  2. automated identification of code vulnerabilities could be improved in terms of quality and precision: the lack of determinism in current LLMs could be compensated by exploration algorithms, for they excel at converging towards optimized solutions no matter the randomness of initial conditions.
  3. AI exploration is NOT without risks: as demonstrated in this article, AI scouting of huge parameter spaces make it possible to speed up attacks which were very difficult or even impossible to stage before, because their handcrafting was insuperable.


Coming up next…

In the next instalment, I will show how independent AI techniques, each with their own specific benefits, can be stacked to stage a unique kind of zero shot AI attack.




Elli Shlomo (IR)

AI Security ~ Security AI ~ Cloud IR ~ Microsoft Security MVP ~ Community Advocate

2w

Christophe Parisel. Nice! This is a reminder of the potential misuse of AI when ethical guidelines and safeguards are not prioritized. While such projects are valuable in understanding the risks, they also underscore the urgent need for strict oversight in AI training practices. Open discussions around this and solutions like bias detection, ethical data curation, and accountability in model deployment are essential to ensure AI benefits them without propagating harm.

☁️ Francesco ☁️ Cipollone

Reduce risk - focus on vulnerabilities that matter - Contextual ASPM - CEO & Founder - Phoenix security - 🏃♂️ Runner - ❤️ Application Security Cloud Security | 40 under 40 | CSA UK Board | CSCP Podcast Host

2w

this is really interesting and in line with one of my recent talk around BH this week

Marjan Sterjev

IT Engineer | CISSP | CCSP | CEH (Master): research | learn | do | MENTOR

2w

I decade ago, I was able to track table tennis ball movement in a video play using similar techniques, i.e. detecting the optimal table tennis ball color using Genetic Algorithm (eliciting and mutations). IMHO well trained model shall have the same alignment against textual DAN attack or embedded QR codes representing the same DAN attack. If not today, the DAN attack will be caught on multi-modal models. I believe that we shall not focus on textual input and its modifications in order to achieve something. Ian Goodfellow demonstrated the Fast Gradient Sign Method already. The input image will be indistinguishable from the original. FGSM will be the main avenue to attack the AI agentic systems in the years to come.

Sabine VanderLinden

Activate Innovation Ecosystems | Tech Ambassador | Founder of Alchemy Crew Ventures + Scouting for Growth Podcast | Chair, Board Member, Advisor | Honorary Senior Visiting Fellow-Bayes Business School (formerly CASS)

2w

Investing in AI-driven attack simulations provides crucial insights for developing robust defense mechanisms. #CyberSecurity

To view or add a comment, sign in

More articles by Christophe Parisel

  • Exploiting Azure AI DocIntel for ID spoofing

    Exploiting Azure AI DocIntel for ID spoofing

    Sensitive transactions execution often requires to show proofs of ID and proofs of ownership: this requirements is…

    10 Comments
  • AI curiosity

    AI curiosity

    The incuriosity of genAI is an understatement. When chatGPT became popular in early 2023, it was even more striking…

    3 Comments
  • The nested cloud

    The nested cloud

    Now is the perfect time to approach Cloud security through the interplay between data planes and control planes—a…

    8 Comments
  • Overcoming the security challenge of Text-To-Action

    Overcoming the security challenge of Text-To-Action

    LLM's Text-To-Action (T2A) is one of the most anticipated features of 2025: it is expected to unleash a new cycle of…

    19 Comments
  • Cloud drift management for Cyber

    Cloud drift management for Cyber

    Optimize your drift management strategy by tracking the Human-to-Scenario (H/S) ratio: the number of dedicated human…

    12 Comments
  • From Art to Craft: A Practical Approach to Setting EPSS Thresholds

    From Art to Craft: A Practical Approach to Setting EPSS Thresholds

    Are you using an EPSS threshold to steer your patch management strategy? Exec sum / teaser EPSS is an excellent exposer…

    13 Comments
  • The security of random number generators (part 1)

    The security of random number generators (part 1)

    Cryptography as we know it today was born in the seventies, when Diffie and Helmann invented public key cryptosystems…

    13 Comments
  • How Microsoft is modernizing Azure

    How Microsoft is modernizing Azure

    Clearly, Microsoft put a lot of love in the making of Azure Bicep. Unlike its perplexing parent, ARM templates, all the…

  • A fresh take on time series forecasting

    A fresh take on time series forecasting

    We introduce a new machine learning technique that outperforms XG Boost for anticipating some critical EPSS (Exploit…

    8 Comments
  • The threat of Azure service tags

    The threat of Azure service tags

    Like all real disruptions, firewall objects have a sunny and a dark side. In Azure, the most important firewall objects…

    11 Comments

Insights from the community

Others also viewed

Explore topics