LiveBench is an open LLM benchmark using contamination-free test data

It’s time to celebrate the incredible women leading the way in AI! Nominate your inspiring leaders for VentureBeat’s Women in AI Awards today before June 18. Learn More

A team of Abacus.AI, New York University, Nvidia, the University of Maryland and the University of Southern California has developed a new benchmark that addresses “serious limitations” with industry incumbents. Called LiveBench, it’s a general-purpose LLM benchmark that offers test data free of contamination, which tends to happen with a dataset when more models use it for training purposes.

What is a benchmark? It’s a standardized test used to evaluate the performance of AI models. The evaluation consists of a set of tasks or metrics that LLMs can be measured against. It gives researchers and developers something to compare performance against, helps track progress in AI research, and more.

LiveBench utilizes “frequently updated questions from recent sources, scoring answers automatically according to objective ground-truth values, and contains a wide variety of challenging tasks spanning math, coding, reasoning, language, instruction following, and data analysis.”

The release of LiveBench is especially notable because one of its contributors is Yann LeCun, a pioneer in the world of AI, Meta’s chief AI scientist, and someone who recently got into a spat with Elon Musk. Joining him are Abacus.AI’s Head of Research Colin White and research scientists Samuel Dooley, Manley Roberts, Arka Pal and Siddartha Naidu; Nvidia’s Senior Research Scientist Siddhartha Jain; and academics Ben Feuer, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Chinmay Hegde, Tom Goldstein, Willie Neiswanger, and Micah Goldblum.

VB Transform 2024 Registration is Open

Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now