As some of you may already know, I use LLMs (Large Language Models) for what they’re really good at, but I’m pretty skeptical about whether they’re truly intelligent or can solve any problem, as the folks at OpenAI, Microsoft, Google, and Meta keep telling us every day. They’ve invested a ton of money in LLMs, and they obviously have a big stake in getting everyone to use them all the time.
LLMs, or (a lot of) statistics at work
LLMs don’t really understand the meaning of the texts they generate. All they do is predict each word, picking it from the billions of documents they’ve analyzed. The statistical models they’re based on can’t verify the truthfulness of information, nor do they have a real awareness of the sentence’s context, but rely exclusively on the probability of each word’s appearance in a text.
This often leads to errors, which are sometimes just ridiculous, but can also become very dangerous.
Basically, LLMs are students who have memorized the lesson, but can’t go beyond what’s written in the textbook. And sometimes they even get the page wrong!
Once these limitations are understood, LLMs prove very useful if we consider them computer tools like any other and exploit their enormous language processing capacity to analyze a text, summarize or expand it, translate it into other languages, and so on.
LLMs on the web
The most common way to use an LLM is in the form of a chatbot, that is, an interactive system to which we asks questions and gets answers, simulating a normal conversation between humans. In most cases, this happens through a web interface like that of ChatGPT, Copilot, Gemini, Claude or Mistral AI. Not forgetting DeepSeek, which has generated so much controversy in the last two months (and which we will use shortly).
Some chatbots also have native applications for desktop or mobile devices (such as ChatGPT or Claude), but I don’t think I’m wrong in saying that they are little more than interfaces for quick access to the web version.
In all cases, the generation process always takes place on a remote server in a data center located somewhere out there. An almost always irrelevant detail, but one that can create many problems if we use an LLM to process personal data or confidential information.
Local LLMs
But we are not forced to use an LLM on the web! Many LLMs can be used locally, directly on our computer. And we don’t even need to mess around with the Terminal, as with ollama, but we can comfortably stay in the familiar environment of a normal program for our Mac, or for a PC running Windows or Linux.
There are an infinite number of these programs, but they all have common traits: once the program is installed, we first need to download one (or more) models to use;1 after which we can interact with the model directly through the program’s interface, without even a byte of what we do leaving our computer for the increasingly hostile world of the network.
LM Studio
But which program to choose among the many available, like Msty, LM Studio, GPT4All, Jan, LibreChat?
I don’t know about you, but LM Studio made a good impression on me right from the beginning. LM Studio runs on the three major operating systems, but to use it on macOS you need an Apple Silicon processor, while on Linux you need an Intel processor. Only the Windows version supports both Intel and ARM processors (but I’d be really curious to see how it works under Windows for ARM).
The current version is 0.3.14 and the installation on the Mac requires, as usual, double-clicking on the .dmg
file and dragging the program icon into the Applications
folder. On the Mac, the application takes up a good 1.26 GB, so it’s not exactly a light program. After all, it’s based on Electron, and that explains everything.
On Linux, the installation process is very similar, because the program is available in the AppImage
format. This means that, in addition to the executable itself, it includes all the necessary support libraries and files, just like it has been done on macOS for decades. For Windows, however, you have to run the usual installer, which will spread junk in every corner of the operating system.
Discovering LM Studio
Having reached this point, all that remains is to run LM Studio for the first time. The very first thing to do is to download a model, so that we can start using the program.
I opt for the default suggested model, which happens to be a distilled version of DeepSeek with only 7 billion parameters (the full model has 100 times more). The model takes up 4.7 GB and is downloaded to the hidden directory ~/.lmstudio/models/
.
The download process is not fast at all, so we need a good dose of patience.
As suggested by the program itself, while the download continues I start to explore the interface, which is the classic one we are now used to, with a large central area dedicated to the dialogue with the chatbot, while the sidebar houses some icons that, from top to bottom, allow to activate the Chat
or Developer
mode and to check which models are installed on our machine and which other models are available.
The last icon in the lower left corner shows the model download status. Clicking on it will display the amount of data already downloaded, the estimated completion time and the possibility to pause or stop the download if necessary. If the download process times out, it is also possible to resume it from where it left off.
This happened to me several times with the selected model, so check it from time to time. Once the model download is complete, we need to load it into LM Studio to be able to use it. We can use one of the two Load Model
buttons to do this, so it’s hard to forget.
Once the first model is loaded, the LM Studio interface changes slightly, showing some sample prompts in the center of the window. In the top bar, the name of the loaded model appears in the center, flanked by two icons: the one on the left allows you to configure the model parameters, while the one on the right allows you to replace the current model with another. An icon on the right that looks like a glass flask provides access to the advanced configuration settings.
The bottom bar of the program displays some useful information, on the left the LM Studio version and on the right the amount of RAM and CPU in use.
The last icon on the right of the bottom bar allows to access the program settings. And it’s also the only way to do it because, strangely, there is no menu item dedicated to the program settings.
The same bottom bar also allows to select the usage mode of the program, choosing between User
(which hides the side icons), Power User
(the default mode) and Developer
(which apparently does not change the interface).
Putting LM Studio to the test
All that remains is to put LM Studio (and DeepSeek) to the test, perhaps using one of the example prompts. I don’t need to ask an AI what the capital of France is, the Rubik’s cube is too 80s, let’s see how it handles mathematics.
Deepseek thinks about it for a minute, but then comes up with a nice proof of the Pythagorean theorem based on proportions. As I explicitly asked, it also formats the equations in LaTeX, which is always a good thing.
And if I click on the little triangle in the Thoughts
box, it even shows the reasoning it followed to arrive at that proof. Not bad at all!
So far I’ve been using a Mac Mini M1 with 16 GB of RAM. But I also have a Mac Studio M2 Ultra with 192 GB of RAM available, how long will it take?
Making a comparison is not easy because, even if you use exactly the same prompt, the answers from the Mac Mini and the Mac Studio will always be different, demonstrating the purely statistical nature of the reasoning done by LLMs. You can see this in the two images below, where the Mini is recognizable by the light theme I’ve used so far, while the Mac Studio is configured to use a dark theme.
If the same question is repeated five times on the Mac Mini, the response time varies between 80 and 120 seconds, with a constant speed of 10-11 tokens per second2; on the Mac Studio, on the other hand, the responses are generated in 15-45 seconds, with a speed of 60-70 tokens per second. So, roughly speaking, the text generation speed on the Mac Studio is about 6-7 times faster than on the Mac Mini.
Lowering the temperature
But I’m stubborn, and to make the comparison more accurate, I want the two Macs to always give me the same answer. To do this, I have to click on the [glass Flask] (https://en.wikipedia.org/wiki/Erlenmeyer_flask) icon in the top right corner and set Temperature to zero (the default is 0.8).
Under these conditions, the Mac Mini takes 75 to 140 seconds to process its responses, while the Mac Studio does it in 11-22 seconds. The strange thing is that although the answers are always the same, the number of tokens generated changes each time, so even in this case the Mac Mini’s speed is always 10-11 tokens per second, while the Mac Studio is less constant, generating 60-80 tokens per second.
For a more scientific study, it would be necessary to work under much more controlled conditions, but for now we can be content to say that, even with this configuration, the Mac Studio is at least 6-7 times faster than the Mac Mini.
What about ChatGPT?
For comparison, how long does ChatGPT take? When I ask it the usual question about the Pythagorean theorem with the Reason
option enabled, ChatGPT gives me two different, very detailed and well-written answers. And it takes just 28 seconds to do it, which is a time comparable to that of the Mac Studio.
I have to admit that I’m impressed by the speed of ChatGPT. It’s true that ChatGPT runs on servers that have nothing to do with my two Macs, but it’s also true that they have to respond to thousands of requests in parallel with mine, while the Macs are only at my service.
Conclusions (for now)
In any case, the goal of this article is not to compare ChatGPT with other LLMs, but to verify if it is possible to use an LLM directly on our computer, without sending confidential information over the network.
It is also true that both the Mac Mini and the Mac Studio don’t even notice that LM Studio is doing its processing, so there should be room for optimization. But we will see this (and more) in the next article.
-
A “model” is the set of weights, mathematical rules and neural structures that constitute a specific LLM. ↩
-
A “token” is the basic unit of text processed by the model and can correspond to a whole word, a part of a word, a space or a punctuation mark. The choice to represent a word with a single token or with multiple tokens depends on its frequency of use. If a word is very common, a single token will be used to represent it. On the other hand, if a word is rarely used, it doesn’t make sense to add the whole word to the model’s vocabulary, but it will be broken down into more common subwords. ↩