The conclusions of the post on Phi-4 left me stunned. How was it possible that a model like Phi-4 Reasoning Plus, which boasts an impressive 14.7 billion 4-bit parameters and was trained on scientific problems, particularly in mathematics, could have failed so badly?
Comparing LLMs The question I asked Phi-4 Reasoning Plus was basic logic, a fourth-grade student could (and should) have answered it in 10 seconds. ChatGPT had no trouble at all and reasoned exactly as one would expect from the poor student.1
After two long months, I was once again able to play again with LM Studio, and this post was supposed to provide a live description of the responses of some models I had just installed. However, things got out of hand when the first model I put under the magnifying glass, Microsoft’s 4-bit Phi-4, started behaving in strange ways that were worth describing in detail. From that moment on, the post you’re about to read practically wrote itself!
– Source: Markus Winkler on Unsplash.
In the previous post I introduced the LM Studio interface, then tried the default suggested model (DeepSeek 7B) with one of the example prompts.
What we really need, however, is to verify if an LLM is capable of performing those repetitive and somewhat boring tasks that increasingly fall to us and that it’s better to do on our own computer, without having to send confidential documents or documents that could contain sensitive data all over the web.1
As some of you may already know, I use LLMs (Large Language Models) for what they’re really good at, but I’m pretty skeptical about whether they’re truly intelligent or can solve any problem, as the folks at OpenAI, Microsoft, Google, and Meta keep telling us every day. They’ve invested a ton of money in LLMs, and they obviously have a big stake in getting everyone to use them all the time.