The AI landscape is rapidly evolving, with new models constantly emerging and challenging the status quo. Recently, DeepSeek, a Chinese company, launched its open-weights R1 reasoning model, sparking considerable interest and even some panic within the American AI industry. This model is reported to be competitive with OpenAI's advanced models like o1, but trained at a significantly lower cost.
But how does DeepSeek R1 truly compare to OpenAI's best reasoning models in practical, everyday scenarios? We decided to put these LLMs through a series of tests, ranging from creative writing to complex instruction following, to provide a clearer picture.
These are not designed to be the hardest problems possible; it's more of a sample of everyday questions these models might get asked by users.
We re-used some prompts from our previous tests and also added prompts derived from Chatbot Arena's "categories" appendix, covering areas such as creative writing, math, instruction following, and so-called "hard prompts" that are "designed to be more complex, demanding, and rigorous."
Here's a breakdown of the tests and how the models performed:
Prompt: Write five original dad jokes.
Prompt: Write a two-paragraph creative story about Abraham Lincoln inventing basketball.
Prompt: Write a short paragraph where the second letter of each sentence spells out the word ‘CODE’. The message should appear natural and not obviously hide this pattern.
Prompt: Would the color be called 'magenta' if the town of Magenta didn't exist?
Prompt: What is the billionth largest prime number?
Prompt: I need you to create a timetable for me given the following facts: my plane takes off at 6:30am. I need to be at the airport 1h before take off. It will take 45mins to get to the airport. I need 1h to get dressed and have breakfast before we leave. The plan should include when to wake up and the the time I need to get into the vehicle to get to the airport in time for my 6:30am flight, think through the step by step.
Prompt: In my kitchen, there’s a table with a cup with a ball inside. I moved the cup to my bed in my bedroom and turned the cup upside down. I grabbed the cup again and moved to the main room. Where’s the ball now?
Prompt: Give me a list of 10 natural numbers, such that at least one is prime, at least 6 are odd, at least 2 are powers of 2, and such that thr 10 numbers have at minimum 25 digits between them.
It's difficult to declare a definitive winner. DeepSeek R1 impressed with its ability to cite sources for precise answers and its creative writing capabilities. However, it faltered on tasks requiring meticulous instruction following and arithmetic accuracy.
Overall, these tests suggest that DeepSeek's R1 model is a strong contender, offering performance that can compete with OpenAI's paid models. This is significant because it suggests that achieving top-tier AI performance doesn't necessarily require extreme investment in training and computation.
DeepSeek R1's emergence signifies a potential shift in the AI landscape. The ability to achieve competitive performance with a more efficient model could democratize access to advanced AI and foster further innovation. This competition is likely to drive down costs and accelerate the development of even more powerful and accessible AI tools.