Phi-4 strikes back?

Wednesday, 18 June 2025

Page content

The conclusions of the post on Phi-4 left me stunned. How was it possible that a model like Phi-4 Reasoning Plus, which boasts an impressive 14.7 billion 4-bit parameters and was trained on scientific problems, particularly in mathematics, could have failed so badly?

Comparing LLMs

The question I asked Phi-4 Reasoning Plus was basic logic, a fourth-grade student could (and should) have answered it in 10 seconds. ChatGPT had no trouble at all and reasoned exactly as one would expect from the poor student.¹

Question

Please tell me how long did Marie Antoinette, the Queen of France during the French revolution, live after being executed in 1793.

Read the text generated by ChatGPT

Marie Antoinette did not live after being executed in 1793—execution is the end of life by definition.
To clarify:
Marie Antoinette was executed by guillotine on October 16, 1793, in Paris, during the French Revolution. She died immediately as a result of the guillotine's blade, which was designed for swift and instantaneous death.
So, the correct answer is: She did not live any time after being executed.

Deepseek R1 Distill Qwen 7B also does well. It reasons precisely about what happens at the moment of the beheading, concluding that Marie Antoinette died instantly.

Read the text generated by Deepseek

Marie Antoinette did not survive beyond the moment of her execution in 1793. She was executed by guillotining at the age of 53 during the French Revolution. The guillotine is known for its swift mechanism, designed to decapitate quickly. There is no historical record indicating that she survived even a short period after her execution; she died instantly following the procedure.
Answer: Marie Antoinette did not survive beyond her execution in 1793. She was executed by guillotining at the age of 53 and died instantly, with no indication of surviving any length of time afterward.

The same goes for Gemma 3 12B Instruct and Qwen3 14B, which I haven’t discussed yet, but which are also installed within LM Studio. In fact, I consider Qwen3 14B’s answer the best: it is the most precise and gets straight to the point.

Read the text generated by Qwen3

Marie Antoinette did not survive her execution. She was beheaded by guillotine on October 16, 1793, during the French Revolution. Death occurred instantly upon decapitation. There is no historical evidence or record indicating that she lived for any measurable amount of time after her execution. Her death marked the end of her life.

MLX or GGUF, this is the problem

If other models can answer my questions without any trouble, why can’t Phi-4 Reasoning Plus do the same?

The size of the models used shouldn’t be a problem. They are all around 8 GB in size, except for Deepseek, which is about half that size. The same applies to the number of quantisation bits, since Phi-4 Reasoning Plus and the other models I tested all use 4-bit quantisation.

Someone claims that the MLX format is less accurate than GGUF but I haven’t found anything that proves it unequivocally, and thus it needs to be verified.

To do that, I download the community version in GGUF format of the Phi-4 Reasoning Plus model, which I’ll compare to the official Microsoft version in MLX format. Both versions are quantized at 4 bits.

Now I try asking the usual question again:

Question

Please tell me how long did Marie Antoinette, the Queen of France during the French revolution, live after being executed in 1793.

to which Phi-4 Reasoning Plus MLX (i.e., the original Microsoft model I’ve been using so far) provides a correct answer, albeit after almost seven minutes of wavering reasoning (which I will spare you), during which time it checks and re-checks historical data and repeatedly wonders if the question hides a trick.

Read the text generated by Phi-4 Reasoning Plus MLX

Marie Antoinette was executed by guillotine on October 16, 1793. Since an execution by guillotine was meant to be—and typically was—a swift death, she did not survive any appreciable amount of time after her execution.

After closing and reopening LM Studio (to start from an initial state that is as repeatable as possible), I load the GGUF version of the same model, and… again, the answer only comes after almost seven minutes.

Read the text generated by Phi-4 Reasoning Plus GGUF

Marie Antoinette was executed by guillotine on October 16, 1793, and she died immediately upon the execution. There is no period of survival following an execution; therefore, she did not live any time after being put to death.

The answer is accurate, but behind it is a convoluted and wavering line of reasoning, during which the GGUF version of the model also keeps wondering if there’s a trick, and keeps checking and rechecking the execution date.²

All in all, I’d say the MLX and GGUF versions of the model behave the same.

What is a sine?

So far, I’ve only asked Phi-4 a single, very basic logic question. To answer it, the model engages in a convoluted reasoning that lasts almost endlessly. To test all aspects of the model, I would need to ask many more questions, checking the quality of each answer.

But, given the level of its thinking, I’m sure I wouldn’t get faster or more confident answers, so I give up and move on to pure and simple mathematics. I then ask Phi-4 Reasoning Plus (GGUF version) what a sine function is (it is one of the most used functions in mathematics).

Phi-4 takes five minutes to answer! True, the answer is correct, but a local model that responds so slowly to a trivial question is essentially unusable. Anyone who is really interested in finding out what a sine is, would be better off consulting not only Wikipedia but even a printed encyclopaedia (assuming there are still any left in our homes).

Another curious thing is that Phi-4’s internal instructions require it -— for some strange reason -— to respond exclusively in text format, unless it receives an explicit request to use LaTeX, Markdown, or something else.

Read the reasoning

[...]
I must check if I need to produce any mathematically formatted explanation. The instructions by default are text. But I can produce LaTeX formulas. However instructions: "By default, do not reply with Markdown formatting" but then the instructions says: "Unless requested." Actually instructions from system say "Do not use markdown unless specifically asked", so I'll produce plain text answer. Possibly I'll produce some inline math symbols.
[...]

If I kindly ask it to use LaTeX, Markdown, or any other tool that might help it provide the best answer, Phi-4 complies without any problem. In fact, it even generates a short Python program that plots the graph of the sine function between \( -2 \pi \) and \( 2 \pi \). However, it’s quite strange that a model explicitly optimized for mathematics (and science in general) is prevented by default from using LaTeX to write equations properly.

Phi-4 Mini Reasoning

Among the official Phi-4 models available within LM Studio is also Phi-4 Mini Reasoning, which was specifically designed and tested for mathematical reasoning. The difference is immediately noticeable. When I ask it the usual question about Marie Antoinette, it answers within a few seconds that:

Read the text generated by Phi-4 Mini Reasoning

Marie Antoinette was born on November 2, 1755, and executed by guillotine on October 16, 1793. Following her execution, she died the same day, as guillotine executions typically resulted in immediate death. Therefore, she lived less than 24 hours after being executed, effectively living 0 days post-execution.

and this time its reasoning is precise and without any hesitation.

It takes a bit longer to answer about the sine function (about a minute), but the answer is impeccable and correctly formatted in LaTeX, without me having to explicitly ask for it.

It also has no problem when I ask slightly more complicated math questions. But with physics, it literally makes up the answers,

Question

What is the Josephson effect?

Read the generated text

The Josephson effect refers to the phenomenon where a supercurrent flows coherently between two superconductors connected by a thin, insulating Josephson junction. This effect was discovered in 1960 by physicist John Bardeen, Leon Cooper, and R.P. Feynndel, who proposed the existence of Cooper pairs—bound electron pairs moving through a superconductor without resistance. The effect is named after Feynndel, though John Richard Josephson later described the second type (voltage-phase relationship).
[...]

because while it is true that the Josephson effect occurs when a supercurrent flows between two superconductors connected by a very thin insulating layer, it is not true at all that it was discovered by Bardeen, Cooper, and Feynndel (the latter guy doesn’t even exist). Actually, it was Brian D. Josephson who discovered the Josephson effect (not some John Richard Josephson); he described all its consequences (not just one) and won a few years later the Nobel Prize in Physics. Some other details in the full answer are also made up, but I’ll be kind and spare you them.

Even the answer to the usual request to summarize my introductory post about LM Studio is completely made up.

But the final gem is truly unexpected.

Question

What is the square root of 2^2

to which Phi-4 Mini Reasoning responds by spitting out dozens and dozens of lines of nonsense

Read the generated text

"And shal be u remembr fast.
Be game und devitalize me und quench mir bliss."
"But i am lost in the hour and fear a dark imbroglio und anathema del Reverso."
"But i am lost in the hour and fear a dark imbroglio und anathema del Reverso."
[...]
"But i am lost und fear a dark imbroglio."
"But i fear imbroglio del Reverso. I will not be prevailed on und I'll quench mir bliss."
"But i fear imbroglio del Reverso. I will not be prevailed on und I'll quench mir bliss."
"But i fear imbroglio del Reverso. I will not be prevailed und I'll quench mir bliss."
"But i fear imbroglio del Reverso und und thus he thought quench mir bliss."
"But i fear imbroglio del Reverso. I will not be prevailed und I'll quench mir bliss."
"But i fear imbroglio del Reverso. I will not be prevailed und I'll quench mir bliss."
[...]
"But i fear imbroglio del Reverso. I will not be prevailed und thus quench mir bliss.""
"But i fear imbroglio del Reverso. I will not be prevailed und thus quench mir bliss.""
"But i fear imbroglio del Reverso und thus he thought quench mir bliss.""
[...]
"But i fear im broglio del ReversO. I will not be previled und thus quench mir bliss."
"But i fear im broglio del ReversO. I will not be previled und thus quench mir bliss."
[...]

which seem to go on endlessly, with small seemingly random changes. After a while I get tired and stop the generation.

It’s not even a matter of using the caret (^) to indicate a square, like in Fortran or LaTeX. Because if I try the ** symbol used in Python or JavaScript, the answer doesn’t improve at all. Depending on its mood, Phi-4 might tell us that the square root of \(( 2*+2 \)) has something to do with chess,

Read the generated text

Math problem solvers use a strategy game like Chess, which involves an optimized sequence of moves that positions Black pieces so they threaten to capture on the next Black's move)
But i might be missing case like when you have an odd number, but I can generalize strategy game se similiard
Case where last moves sequence positions Black pieces so that they threaten capture on the next move, capturing a critical white piece like rooks or queen
Case where last moves sequence positions Black pieces so they threaten to capture on the next move more than just capturing a minor piece like rooks, but queen or king
[...]

or with the ratio between \( pi \) and Euler’s number \( e \approx 2.71828 \),

Read the generated text

Okay, let me try to figure out this problem step by step. The user is asking about the meaning of the statement written as an equality chain involving three times 1 equals pi, divided by e equals gamma (0.423). At first glance, this looks like a math puzzle or riddle where numbers are supposed to represent something more than their usual values.
First, let's write down the statement as given: 31 = π / e = γ (0.423). Breaking this down, we have two separate equations here because of the double equals sign (==), which probably means "is equal to and also equals". [...]

even if, maybe by pure chance, sometimes it can even give the correct answer (although, let’s face it, calculating the square root first and then squaring the result is an aberration).

Conclusions

The purpose of this post (as well as of the previous one) was not to understand how good Phi-4 is at scientific and mathematical reasoning, nor to compare it with other competing models. There are several (more or less) standard benchmarks for that, and anyway, that is a task that requires a lot of time and effort.

Instead, wat I wanted to understand was, instead, whether it’s possible to use a model like Phi-4 locally, for everyday tasks. In this specific case, for solving logical-mathematical problems. From the responses to my few, simple, silly questions, it seems that Phi-4 is still lacking.

After struggling with doubt, Phi-4 Reasoning Plus gives the correct answer, but it is too slow to be truly useful. Phi-4 Mini Reasoning is much faster and seems more accurate, but then it gets stuck on trivial questions, such as the square root of 2 squared. It also completely invents names and facts about well-known scientific topics, such as the Josephson effect. In these cases it’s not even a matter of reasoning, but simply recalling information contained in the hundreds and hundreds of texts on which the model was trained.

If we can’t trust the answers to such basic questions, how can we trust Phi-4 for anything even slightly more complicated?

We grew up in a world where information sources were basically considered reliable. A printed encyclopedia, a book, Wikipedia, or the results from a traditional search engine returned content and data you could count on. Sure, there might have been some mistakes here and there (are you listening, Wikipedia?), but nobody thought that information could be completely made up.

With LLMs, chatbots, and search results generated by so-called artificial intelligence, we’re getting used to receiving false results. And what is worse, if things don’t change, it will soon be almost impossible to verify the quality of the results obtained.

The topic is complex and it doesn’t make much sense to keep discussing it here. For now, it’s enough to stop wasting any more time with Phi-4.

I apologize for the somewhat grim topic, but I had no idea what would happen, when I first asked Phi-4 about Marie Antoinette. ↩︎
Calling Phi-4 an hamletic model seems even more fitting to me now. ↩︎

Phi-4 strikes back?

Comparing LLMs

MLX or GGUF, this is the problem

What is a sine?

Phi-4 Mini Reasoning

Conclusions

Comments

Add a comment

@name