Are you smarter than Phi-2?

December 25, 2024
Your avatar. YOU - 0
Phi-2's avatar. PHI-2 - 0
1
Professional Accounting
A company’s new time clock process requires hourly employees to select an identification number and then choose the clock-in or clock-out button. A video camera captures an image of the employee using the system. Which of the following exposures can the new system be expected to change the least?
QUESTION
Fraudulent reporting of employees’ own hours.
Errors in employees’ overtime computation
Inaccurate accounting of employees’ hours.
Recording of other employees’ hours.

As state-of-the-art large language models (LLMs) from companies like OpenAI, Anthropic, and Google continue to grow in size and capability, it's easy to overlook the progress smaller models have made in recent months.

Microsoft's Phi-2, for instance, is a 2.7B parameter LLM that seems to punch far above its weight class. Its performance on the Massive Multitask Language Understanding (MMLU) benchmark, one of the most prominent LLM evaluation frameworks, is nothing short of impressive.

Data from Stanford's Holistic Evaluation of Language Models (HELM), which evaluated Phi-2 (among a plethora of other open- and closed-weight models) on the MMLU benchmark, shows that Phi-2 correctly answered 58.4% of all multiple-choice questions in the test set. These questions belong to 57 different categories, which range from abstract algebra to high school-level European history to professional medicine.

It's hard to contextualize what this number means, though. Without actually seeing questions from the MMLU dataset yourself, it's difficult to gauge just how good (or bad, for that matter) Phi-2's performance is.

That's why I created "Are you smarter than Phi-2?", a game that pits you against Phi-2 in a set of 20 randomly selected multiple-choice questions from the MMLU dataset. It's your job to answer each question first, after which Phi-2's response (which is sourced from Stanford's HELM dataset) to the same question will be revealed to you.

The goal: try to correctly answer as many questions as you can. In the end, you'll be able to see how you stack up against Phi-2, a model that's around 900 times smaller than GPT-4; in fact, it's so small that you can run it within your browser (with quantization).

Good luck!


Shameless plug: my name is Aman, and I created azigy, the website you're viewing this article on. It's a platform that allows you to create and host interactive game shows that your family, friends, and colleagues will love. You can use our LLM-powered AI feature to transform your documents, slides, and audio files into custom trivia and Jeopardy games, within seconds.

© 2025 azigy. All rights reserved.