LLMOps Space reposted this
Yesterday, Google DeepMind dropped a new benchmark for evaluating the 𝐟𝐚𝐜𝐭𝐮𝐚𝐥𝐢𝐭𝐲 𝐨𝐟 𝐋𝐋𝐌𝐬.📝 They're calling it "𝐅𝐀𝐂𝐓𝐒 𝐆𝐫𝐨𝐮𝐧𝐝𝐢𝐧𝐠", it evaluates model responses automatically using the LLM-as-a-Judge methodology, incorporating a combination of different LLM judges.🕵♂️ The FACTS Grounding dataset consists of 1719 examples, of which 860 are public and 859 are private. The examples include documents with up to a maximum of 32,000 tokens (roughly 20,000 words), covering domains - finance, technology, retail, medicine, and law. The user requests are similarly wide-ranging, including use cases for summarization, Q&A generation, and rewriting tasks. 🧪 FACTS Grounding evaluates model responses automatically using three frontier LLM judges — namely 𝐆𝐞𝐦𝐢𝐧𝐢 1.5 𝐏𝐫𝐨, 𝐆𝐏𝐓-4𝐨, 𝐚𝐧𝐝 𝐂𝐥𝐚𝐮𝐝𝐞 3.5 𝐒𝐨𝐧𝐧𝐞𝐭 to mitigate any potential bias of a judge giving higher scores to their own model. On their Kaggle leaderboard, the top 3 performing models are different versions of the Gemini. 🔥 𝐋𝐢𝐧𝐤 𝐭𝐨 𝐊𝐚𝐠𝐠𝐥𝐞 𝐥𝐞𝐚𝐝𝐞𝐫𝐛𝐨𝐚𝐫𝐝: https://lnkd.in/e4AWPMQj 𝐋𝐢𝐧𝐤 𝐭𝐨 𝐭𝐡𝐞 𝐏𝐚𝐩𝐞𝐫: https://lnkd.in/eXkjk7Wb Alon Jacovi, Connie T., Jon Lipovetz, Kate Olszewska, Lukas Haas, Gaurav Singh Tomar, Carl Saroufim, Doron Kukliansky, Zizhao Z., Dipanjan Das, Google DeepMind, Google for Developers ☝️𝐅𝐨𝐥𝐥𝐨𝐰 𝐦𝐞 𝐟𝐨𝐫 𝐜𝐨𝐧𝐭𝐞𝐧𝐭 𝐚𝐛𝐨𝐮𝐭 #machinelearning, #llms, and #mlops, as well as announcements about Deepchecks'#opensource releases & the community at LLMOps Space.