Phi-4 strikes back?
The conclusions of the post on Phi-4 left me stunned. How was it possible that a model like Phi-4 Reasoning Plus, which boasts an impressive 14.7 billion 4-bit parameters and was trained on scientific problems, particularly in mathematics, could have failed so badly?
Comparing LLMs
The question I asked Phi-4 Reasoning Plus was basic logic, a fourth-grade student could (and should) have answered it in 10 seconds. ChatGPT had no trouble at all and reasoned exactly as one would expect from the poor student.1
Question
Read the text generated by ChatGPT
To clarify:
Marie Antoinette was executed by guillotine on October 16, 1793, in Paris, during the French Revolution. She died immediately as a result of the guillotine's blade, which was designed for swift and instantaneous death.
So, the correct answer is: She did not live any time after being executed.
Deepseek R1 Distill Qwen 7B also does well. It reasons precisely about what happens at the moment of the beheading, concluding that Marie Antoinette died instantly.
Read the text generated by Deepseek
Answer: Marie Antoinette did not survive beyond her execution in 1793. She was executed by guillotining at the age of 53 and died instantly, with no indication of surviving any length of time afterward.
The same goes for Gemma 3 12B Instruct and Qwen3 14B, which I haven’t discussed yet, but which are also installed within LM Studio. In fact, I consider Qwen3 14B’s answer the best: it is the most precise and gets straight to the point.
Read the text generated by Qwen3
MLX or GGUF, this is the problem
If other models can answer my questions without any trouble, why can’t Phi-4 Reasoning Plus do the same?
The size of the models used shouldn’t be a problem. They are all around 8 GB in size, except for Deepseek, which is about half that size. The same applies to the number of quantisation bits, since Phi-4 Reasoning Plus and the other models I tested all use 4-bit quantisation.
Someone claims that the MLX format is less accurate than GGUF but I haven’t found anything that proves it unequivocally, and thus it needs to be verified.
To do that, I download the community version in GGUF format of the Phi-4 Reasoning Plus model, which I’ll compare to the official Microsoft version in MLX format. Both versions are quantized at 4 bits.
Now I try asking the usual question again:
Question
to which Phi-4 Reasoning Plus MLX (i.e., the original Microsoft model I’ve been using so far) provides a correct answer, albeit after almost seven minutes of wavering reasoning (which I will spare you), during which time it checks and re-checks historical data and repeatedly wonders if the question hides a trick.
Read the text generated by Phi-4 Reasoning Plus MLX
After closing and reopening LM Studio (to start from an initial state that is as repeatable as possible), I load the GGUF version of the same model, and… again, the answer only comes after almost seven minutes.
Read the text generated by Phi-4 Reasoning Plus GGUF
The answer is accurate, but behind it is a convoluted and wavering line of reasoning, during which the GGUF version of the model also keeps wondering if there’s a trick, and keeps checking and rechecking the execution date.2
All in all, I’d say the MLX and GGUF versions of the model behave the same.
What is a sine?
So far, I’ve only asked Phi-4 a single, very basic logic question. To answer it, the model engages in a convoluted reasoning that lasts almost endlessly. To test all aspects of the model, I would need to ask many more questions, checking the quality of each answer.
But, given the level of its thinking, I’m sure I wouldn’t get faster or more confident answers, so I give up and move on to pure and simple mathematics. I then ask Phi-4 Reasoning Plus (GGUF version) what a sine
function is (it is one of the most used functions in mathematics).
Phi-4 takes five minutes to answer! True, the answer is correct, but a local model that responds so slowly to a trivial question is essentially unusable. Anyone who is really interested in finding out what a sine
is, would be better off consulting not only Wikipedia but even a printed encyclopaedia (assuming there are still any left in our homes).
Another curious thing is that Phi-4’s internal instructions require it -— for some strange reason -— to respond exclusively in text format, unless it receives an explicit request to use LaTeX, Markdown, or something else.
Read the reasoning
I must check if I need to produce any mathematically formatted explanation. The instructions by default are text. But I can produce LaTeX formulas. However instructions: "By default, do not reply with Markdown formatting" but then the instructions says: "Unless requested." Actually instructions from system say "Do not use markdown unless specifically asked", so I'll produce plain text answer. Possibly I'll produce some inline math symbols.
[...]
If I kindly ask it to use LaTeX, Markdown, or any other tool that might help it provide the best answer, Phi-4 complies without any problem. In fact, it even generates a short Python program that plots the graph of the sine
function between \( -2 \pi \) and \( 2 \pi \). However, it’s quite strange that a model explicitly optimized for mathematics (and science in general) is prevented by default from using LaTeX to write equations properly.
Phi-4 Mini Reasoning
Among the official Phi-4 models available within LM Studio is also Phi-4 Mini Reasoning, which was specifically designed and tested for mathematical reasoning. The difference is immediately noticeable. When I ask it the usual question about Marie Antoinette, it answers within a few seconds that:
Read the text generated by Phi-4 Mini Reasoning
and this time its reasoning is precise and without any hesitation.
It takes a bit longer to answer about the sine
function (about a minute), but the answer is impeccable and correctly formatted in LaTeX, without me having to explicitly ask for it.
It also has no problem when I ask slightly more complicated math questions. But with physics, it literally makes up the answers,
Question
Read the generated text
[...]
because while it is true that the Josephson effect occurs when a supercurrent flows between two superconductors connected by a very thin insulating layer, it is not true at all that it was discovered by Bardeen, Cooper, and Feynndel (the latter guy doesn’t even exist). Actually, it was Brian D. Josephson who discovered the Josephson effect (not some John Richard Josephson); he described all its consequences (not just one) and won a few years later the Nobel Prize in Physics. Some other details in the full answer are also made up, but I’ll be kind and spare you them.
Even the answer to the usual request to summarize my introductory post about LM Studio is completely made up.

But the final gem is truly unexpected.
Question
to which Phi-4 Mini Reasoning responds by spitting out dozens and dozens of lines of nonsense
Read the generated text
Be game und devitalize me und quench mir bliss."
"But i am lost in the hour and fear a dark imbroglio und anathema del Reverso."
"But i am lost in the hour and fear a dark imbroglio und anathema del Reverso."
[...]
"But i am lost und fear a dark imbroglio."
"But i fear imbroglio del Reverso. I will not be prevailed on und I'll quench mir bliss."
"But i fear imbroglio del Reverso. I will not be prevailed on und I'll quench mir bliss."
"But i fear imbroglio del Reverso. I will not be prevailed und I'll quench mir bliss."
"But i fear imbroglio del Reverso und und thus he thought quench mir bliss."
"But i fear imbroglio del Reverso. I will not be prevailed und I'll quench mir bliss."
"But i fear imbroglio del Reverso. I will not be prevailed und I'll quench mir bliss."
[...]
"But i fear imbroglio del Reverso. I will not be prevailed und thus quench mir bliss.""
"But i fear imbroglio del Reverso. I will not be prevailed und thus quench mir bliss.""
"But i fear imbroglio del Reverso und thus he thought quench mir bliss.""
[...]
"But i fear im broglio del ReversO. I will not be previled und thus quench mir bliss."
"But i fear im broglio del ReversO. I will not be previled und thus quench mir bliss."
[...]
which seem to go on endlessly, with small seemingly random changes. After a while I get tired and stop the generation.
It’s not even a matter of using the caret (^
) to indicate a square, like in Fortran or LaTeX. Because if I try the **
symbol used in Python or JavaScript, the answer doesn’t improve at all. Depending on its mood, Phi-4 might tell us that the square root of \(( 2*+2 \)) has something to do with chess,
Read the generated text
But i might be missing case like when you have an odd number, but I can generalize strategy game se similiard
Case where last moves sequence positions Black pieces so that they threaten capture on the next move, capturing a critical white piece like rooks or queen
Case where last moves sequence positions Black pieces so they threaten to capture on the next move more than just capturing a minor piece like rooks, but queen or king
[...]
or with the ratio between \( pi \) and Euler’s number \( e \approx 2.71828 \),
Read the generated text
First, let's write down the statement as given: 31 = π / e = γ (0.423). Breaking this down, we have two separate equations here because of the double equals sign (==), which probably means "is equal to and also equals". [...]
even if, maybe by pure chance, sometimes it can even give the correct answer (although, let’s face it, calculating the square root first and then squaring the result is an aberration).

Conclusions
The purpose of this post (as well as of the previous one) was not to understand how good Phi-4 is at scientific and mathematical reasoning, nor to compare it with other competing models. There are several (more or less) standard benchmarks for that, and anyway, that is a task that requires a lot of time and effort.
Instead, wat I wanted to understand was, instead, whether it’s possible to use a model like Phi-4 locally, for everyday tasks. In this specific case, for solving logical-mathematical problems. From the responses to my few, simple, silly questions, it seems that Phi-4 is still lacking.
After struggling with doubt, Phi-4 Reasoning Plus gives the correct answer, but it is too slow to be truly useful. Phi-4 Mini Reasoning is much faster and seems more accurate, but then it gets stuck on trivial questions, such as the square root of 2 squared. It also completely invents names and facts about well-known scientific topics, such as the Josephson effect. In these cases it’s not even a matter of reasoning, but simply recalling information contained in the hundreds and hundreds of texts on which the model was trained.
If we can’t trust the answers to such basic questions, how can we trust Phi-4 for anything even slightly more complicated?
We grew up in a world where information sources were basically considered reliable. A printed encyclopedia, a book, Wikipedia, or the results from a traditional search engine returned content and data you could count on. Sure, there might have been some mistakes here and there (are you listening, Wikipedia?), but nobody thought that information could be completely made up.
With LLMs, chatbots, and search results generated by so-called artificial intelligence, we’re getting used to receiving false results. And what is worse, if things don’t change, it will soon be almost impossible to verify the quality of the results obtained.
The topic is complex and it doesn’t make much sense to keep discussing it here. For now, it’s enough to stop wasting any more time with Phi-4.
-
I apologize for the somewhat grim topic, but I had no idea what would happen, when I first asked Phi-4 about Marie Antoinette. ↩︎
-
Calling Phi-4 an hamletic model seems even more fitting to me now. ↩︎
Comments
Add a comment
@name
@message