Ario reposted this
Has anyone noticed that OpenAI Whisper transcriptions (speech to text) in English has a non-zero probability of including 'transcribed by otter.ai' at the end of their transcriptions? I happened to be transcribing a bunch of audio files (which definitely do not end with that text) and chanced upon this interesting artifact. Based on my unscientific measurement, the probability might roughly be around 10-15%. Given this, there is definitely a good amount of Otter transcripts in the ~438K hours of English-language training data for Whisper large-v4. I wonder whether OpenAI had struck some deal with Otter.ai, or if this is "accidental" or unlicensed scraping... 🤔
Hilarious if it’s that frequent. I would assume the more nefarious arrangement! :)
Even more interesting that these weren't removed from the training set. Seems like a pretty easy fix that was perhaps overlooked.
Perhaps it’s to honor the source intentionally - and attribute.
Engineering @ Play.ai
1moNothing a little replace can't fix 😏 ```py stt_output.replace("otter.ai", "MaksAwesomeApp.ai") ```