Ario’s Post

Ario reposted this

View profile for Yiing Chau Mak, graphic

AI & Data at Ario

Has anyone noticed that OpenAI Whisper transcriptions (speech to text) in English has a non-zero probability of including 'transcribed by otter.ai' at the end of their transcriptions? I happened to be transcribing a bunch of audio files (which definitely do not end with that text) and chanced upon this interesting artifact. Based on my unscientific measurement, the probability might roughly be around 10-15%. Given this, there is definitely a good amount of Otter transcripts in the ~438K hours of English-language training data for Whisper large-v4. I wonder whether OpenAI had struck some deal with Otter.ai, or if this is "accidental" or unlicensed scraping... 🤔

  • No alternative text description for this image
Tobiah Rex

Engineering @ Play.ai

1mo

Nothing a little replace can't fix 😏 ```py stt_output.replace("otter.ai", "MaksAwesomeApp.ai") ```

Anup Rajput

AI, Software, Robotics, and Space | Building High Performance Engineering teams applying AI/ML across technology sectors

1mo

Hilarious if it’s that frequent. I would assume the more nefarious arrangement! :)

Even more interesting that these weren't removed from the training set. Seems like a pretty easy fix that was perhaps overlooked.

Jasmin Young

CEO | Turnaround/Growth Executive | Advisor | Investor

1mo

Perhaps it’s to honor the source intentionally - and attribute.

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics