Humanity’s Last Exam wants your tough questions to stump AI

Benchmarks are struggling to maintain up with advancing AI mannequin capabilities and the Humanity’s Final Examination undertaking desires your assist to repair this.

The undertaking is a collaboration between the Middle for AI Security (CAIS) and AI information firm Scale AI. The undertaking goals to measure how shut we’re to attaining expert-level AI techniques, one thing present benchmarks aren’t able to.

OpenAI and CAIS developed the favored MMLU (Huge Multitask Language Understanding) benchmark in 2021. Again then, CAIS says, “AI techniques carried out no higher than random.”

- Advertisement -

The spectacular efficiency of OpenAI’s o1 mannequin has “destroyed the preferred reasoning benchmarks,” in accordance with Dan Hendrycks, govt director of CAIS.

OpenAI’s o1 MMLU efficiency in contrast with earlier fashions. Supply: OpenAI

As soon as AI fashions hit 100% on the MMLU, how will we measure them? CAIS says “Present assessments now have change into too straightforward and we will now not observe AI developments properly, or how far they’re from turning into expert-level.”

While you see the bounce in benchmark scores that o1 added to the already spectacular GPT-4o figures, it gained’t be lengthy earlier than an AI mannequin aces the MMLU.

That is objectively true. pic.twitter.com/gorahh86ee
— Ethan Mollick (@emollick) September 17, 2024

Humanity’s Final Examination is asking folks to submit questions that may genuinely shock you if an AI mannequin delivered the proper reply. They need PhD stage examination questions, not the ‘what number of Rs in Strawberry’ kind that journey up some fashions.

- Advertisement -

Scale defined that “As present assessments change into too straightforward, we lose the flexibility to tell apart between AI techniques which might ace undergrad exams, and people which might genuinely contribute to frontier analysis and drawback fixing.”

If in case you have an unique query that might stump a complicated AI mannequin then you possibly can have your title added as a co-author of the undertaking’s paper and share in a pool of $500,000 that will likely be awarded to the very best questions.

To provide you an concept of the extent the undertaking is aiming at Scale defined that “if a randomly chosen undergraduate can perceive what’s being requested, it’s possible too straightforward for the frontier LLMs of at present and tomorrow.”

There are a number of fascinating restrictions on the sorts of questions that may be submitted. They don’t need something associated to chemical, organic, radiological, nuclear weapons, or cyberweapons used for attacking important infrastructure.

For those who suppose you’ve obtained a query that meets the necessities then you may submit it right here.

Humanity’s Last Exam wants your tough questions to stump AI

Related

44% of people report believing election-related misinformation – Adobe...

Enterprise LLM APIs: Top Choices for Powering LLM Applications...

AI in the doctor’s office: GPs turn to ChatGPT...

YugabyteDB 2.19 gets new PostgreSQL-compatibility features

LinkedIn under fire for training AI models with user...

Leave a Reply Cancel reply