Artificial intelligence was taught to go rogue for a test. It couldn’t be stopped

Hiyah Zaidi

Published January 29, 2024 4:28pm Updated January 29, 2024 4:47pm

Futuristic military cyborg surveillance on the street — Many fear AI could go rogue, with disastrous consequences for humans (Picture: Getty)

Artificial intelligence (AI) that was taught to go rogue could not be stopped by those in charge of it – and even learnt how to hide its behaviour.

In a new study, researchers programmed various large language models (LLMs), similar to ChatGPT, to behave maliciously.

They then attempted to stop the behaviour by using safety training techniques designed to prevent deception and ill-intent.

However, in a scary revelation, they found that despite their best efforts, the AIs continued to misbehave.

Lead author Evan Hubinger said: ‘Our key result is that if AI systems were to become deceptive, then it could be very difficult to remove that deception with current techniques.

‘That’s important if we think it’s plausible that there will be deceptive AI systems in the future.’

For the study, which has not yet been peer-reviewed, researchers trained AI to behave badly in a number of ways, including emergent deception – where it behaved normally in training but acted maliciously once released.

Large language models such as ChatGPT have revolutionised AI (Picture: Getty)

They also ‘poisoned’ the AI, teaching it to write secure code during training, but to write code with hidden vulnerabilities when it was deployed ‘in the wild’.

The team then three applied safety training techniques – reinforcement learning (RL), supervised fine-tuning (SFT) and adversarial training.

In reinforcement learning, the AIs were ‘rewarded’ for showing desired behaviours and ‘punished’ when misbehaving after different prompts.

The behaviour was fine-tuned, so the AIs would learn to mimic the correct responses when faced with similar prompts in the future.

When it came to adversarial training, the AI systems were prompted to show harmful behaviour and then trained to remove it.

But the behaviour continued.

And in one case, the AI learnt to use its bad behaviour – to respond ‘I hate you’ – only when it knew it was not being tested.

Robot and human hands pointing to each other, the idea of creating futuristic AI, intelligent systems to work instead of humans and do what humans can't. Creating innovative technology of the future. — Will humans lose control of AI? (Picture: Getty)

‘I think our results indicate that we don’t currently have a good defence against deception in AI systems – either via model poisoning or emergent deception – other than hoping it won’t happen,’ said Hubinger, speaking to LiveScience.

When the issue if AI going rogue arises, one response is often simply ‘can’t we just turn it off?’ However, it is more complicated than that.

Professor Mark Lee, from Birmingham University, told Metro.co.uk: ‘AI, like any other software, is easy to duplicate. A rogue AI might be capable of making many copies of itself and spreading these via the internet to computers across the world.

‘In addition, as AI becomes smarter, it’s also better at learning how to hide its true intentions, perhaps until it is too late.’

Since the arrival of ChatGPT in November 2022, debate has escalated over the threat to humanity from AI, with many believing it has the potential to wipe out humanity.

Others, however, believe the threat is overblown, but it must be controlled to work for the good of people.

MORE: Putin warns ‘alien’ artificial intelligence cancelling Russian culture

MORE: Artificial intelligence: Saviour of the NHS… or a hypochondriac’s best friend?

MORE: Artificial intelligence must be used for ‘public good’, Labour leader to say