Qwen3 - Melabit

The conclusions of the post on Phi-4 left me stunned. How was it possible that a model like Phi-4 Reasoning Plus, which boasts an impressive 14.7 billion 4-bit parameters and was trained on scientific problems, particularly in mathematics, could have failed so badly? Comparing LLMs The question I asked Phi-4 Reasoning Plus was basic logic, a fourth-grade student could (and should) have answered it in 10 seconds. ChatGPT had no trouble at all and reasoned exactly as one would expect from the poor student.1