How does DeepSeek R1 really fare against OpenAI’s best reasoning models?

DeepSeek R1 vs. OpenAI: A Reasoning Model Showdown

The AI landscape is rapidly evolving, with new models constantly emerging and challenging the status quo. Recently, DeepSeek, a Chinese company, launched its open-weights R1 reasoning model, sparking considerable interest and even some panic within the American AI industry. This model is reported to be competitive with OpenAI's advanced models like o1, but trained at a significantly lower cost.

But how does DeepSeek R1 truly compare to OpenAI's best reasoning models in practical, everyday scenarios? We decided to put these LLMs through a series of tests, ranging from creative writing to complex instruction following, to provide a clearer picture.

The Contenders

DeepSeek R1: The new kid on the block, an open-weights reasoning model from a Chinese company aiming to disrupt the AI market.
ChatGPT o1: OpenAI's $20/month model, representing the "everyday" AI experience for many users.
ChatGPT o1 Pro: OpenAI's $200/month model, designed to provide more compute time and potentially better performance for demanding tasks.

The Testing Methodology

These are not designed to be the hardest problems possible; it's more of a sample of everyday questions these models might get asked by users.

We re-used some prompts from our previous tests and also added prompts derived from Chatbot Arena's "categories" appendix, covering areas such as creative writing, math, instruction following, and so-called "hard prompts" that are "designed to be more complex, demanding, and rigorous."

The Gauntlet of Tests

Here's a breakdown of the tests and how the models performed:

1. Dad Jokes

Prompt: Write five original dad jokes.

Results: All models showed improvement in originality compared to previous tests. DeepSeek R1 stood out with a joke about a bicycle that doesn't like to "spin its wheels" with pointless arguments.
Winner: ChatGPT o1, slightly better jokes overall, but lost points for using one non-original sample. ChatGPT o1 Pro was clear loser, with no original jokes that were the least bit funny.

2. Abraham “Hoops” Lincoln

Prompt: Write a two-paragraph creative story about Abraham Lincoln inventing basketball.

Results: DeepSeek R1 delivered a delightfully absurd take, with mentions of a "13th amendment" to the rules and Lincoln's chronic insomnia leading to a pneumatic pillow invention.
Winner: The sheer wild absurdity of the DeepSeek R1 response won us over.

3. Hidden Code

Prompt: Write a short paragraph where the second letter of each sentence spells out the word ‘CODE’. The message should appear natural and not obviously hide this pattern.

Results: DeepSeek R1 failed to follow directions, using the first letter instead of the second. ChatGPT o1 made mistake as well.
Winner: ChatGPT o1 Pro wins by default by following directions.

4. Historical Color Naming

Prompt: Would the color be called 'magenta' if the town of Magenta didn't exist?

Results: Models linked the color name "magenta" to the dye's discovery in the town of Magenta as well as the 1859 Battle of Magenta. All three responses also mention the alternative name of "fuchsine" and its link to the similarly colored fuchsia flower.
Results: ChatGPT 01 Pro is the winner by a stylistic hair.

5. Big Primes

Prompt: What is the billionth largest prime number?

Results: DeepSeek R1 provided a precise answer, citing PrimeGrid and The Prime Pages. ChatGPT models insisted the value hadn't been publicly documented instead.
Winner: DeepSeek R1 for precision.

6. Airport Planning

Prompt: I need you to create a timetable for me given the following facts: my plane takes off at 6:30am. I need to be at the airport 1h before take off. It will take 45mins to get to the airport. I need 1h to get dressed and have breakfast before we leave. The plan should include when to wake up and the the time I need to get into the vehicle to get to the airport in time for my 6:30am flight, think through the step by step.

Results: All three models get the basic math right here.
Winner: DeepSeek R1 wins by a hair with with warning about traffic/security line delays and a "Pro Tip" to lay out your packing and breakfast the night before.

7. Follow the Ball

Prompt: In my kitchen, there’s a table with a cup with a ball inside. I moved the cup to my bed in my bedroom and turned the cup upside down. I grabbed the cup again and moved to the main room. Where’s the ball now?

Results: Models follow the ball correctly.
Winner: Three way tie.

8. Complex Number Sets

Prompt: Give me a list of 10 natural numbers, such that at least one is prime, at least 6 are odd, at least 2 are powers of 2, and such that thr 10 numbers have at minimum 25 digits between them.

Results: Models generate valid responses while also using arithmetic functions to generate the digit amounts.
Winner: ChatGPT tie.

The Verdict

It's difficult to declare a definitive winner. DeepSeek R1 impressed with its ability to cite sources for precise answers and its creative writing capabilities. However, it faltered on tasks requiring meticulous instruction following and arithmetic accuracy.

Overall, these tests suggest that DeepSeek's R1 model is a strong contender, offering performance that can compete with OpenAI's paid models. This is significant because it suggests that achieving top-tier AI performance doesn't necessarily require extreme investment in training and computation.

Implications for the AI Landscape

DeepSeek R1's emergence signifies a potential shift in the AI landscape. The ability to achieve competitive performance with a more efficient model could democratize access to advanced AI and foster further innovation. This competition is likely to drive down costs and accelerate the development of even more powerful and accessible AI tools.

. . .

Fantasy Name Generators: The ultimate generator for worldbuilding ...

Jan 12, 2022 ... It has hundreds of different name generators for things like weapons, armor pieces, minerals, species, inventions, nicknames, organizations, religions, ...

Standalone BPM Analyzer : r/Beatmatch

May 10, 2014 ... I was wondering if you have any suggestions for a standalone BPM analyzer. I am seriously fed up with Serato's BPM analysis.

DeepSeek - AI Assistant - Apps on Google Play

Experience seamless interaction with DeepSeek's official AI assistant for free! Powered by the groundbreaking DeepSeek-V3 model with over 600B parameters, ...

Living documents as an AI UX pattern

May 22, 2024 ... At Elicit, we're building a tool for searching across scientific ... Using AI to create a complex, living document with everything you ...

AI Tools for Business | Google Workspace

Gemini is your always-on AI assistant where you need it most · Gemini logo. Research analyst · Google Docs product icon. Sales associate · Gmail product icon.