Watermark for LLM-Generated Text
Researchers at Google have developed a watermark for LLM-generated text. The basics are pretty obvious: the LLM chooses between tokens partly based on a cryptographic key, and someone with knowledge of the key can detect those choices. What makes this hard is (1) how much text is required for the watermark to work, and (2) how robust the watermark is to post-generation editing. Google’s version looks pretty good: it’s detectable in text as small as 200 tokens.
Clive Robinson • October 25, 2024 11:12 AM
By the sound of it it is very little different to the variation of “Low Probability of Intercept”(LPI) systems they tried and failed to get working in the 1990’s for Digital Watermarking of Pictures, Sound,and Audio files.
The “secret sauce” this time is the use of encryption to provide “psudo-random” for token sekection.
Think if you will as “words from a dictionary” being the same as “letters from an alphabet”. The word or letter is a token from a limited subset of the dictionary or alphabet at each step.
The trick is to have say two or more words of equiprobablity for each step.
The problem is that the words are very very far from independent of each other at each step.
That is each word is in effect selected by the previous words on so many different spectrums/vectors that it could easily boil down to only a single word after as little as five preceding words.
Spotting such word sequences and the random weighting depends on how much your detector predicts word sequences.
In a way it’s a form of frequency analysis that was the earliest of known crypto attacks.
Whilst detecting the “syndrome” or “distinguisher” being present will not be that hard in a simple system. The question is can it be “fully removed” without leaving a trace in some spectrum.
My gut feeling on this based on previous experience of designing matched filter systems to remove non random artifacts (man made as opposed to natural noise) is that you will get an exponential cost. That is removing the first 50% will cost the same as the next 25% and the same again for the next 12.5% or worse. Each step there will be residual trace that can be shown if you know the “secret key” that defines the pseudorandom signal.