So you want to run AI models locally?
Artificial Intelligence
/ January 15, 2025 • 6 min read
Tags:
ai
llm
AI has undoubtedly been the center of human thought in 2024 and 2025. The improvement potential AI can bring to humanity is epic. However, there is a fundamental problem with the current state of AI. The best and brightest models often require immense computing power which the majority of people do not have. Instead, people are forced into using proprietary models and are also required to accept that their conversations will be mined and analysed.
Recently, Anthropic published a blog titled The Anthropic Economic Index, where they explicitly state they have analysed millions of “anonymized” conversations.
It’s only natural to ask the question, how do I run AI on my own machine? The answer to this question will be the topic of this article.
Prerequisites
Before jumping into running models locally, there is terminology and concepts that needs to be understood first. Understanding these ideas will be helpful when choosing what models to run.
Models
What is a “model”? A model is a file containing millions or billions of numbers (called parameters) that represent the patterns learned from training data. These parameters form the model’s “knowledge” and shape how a model processes information.
More parameters usually results in a more capable model with broader knowledge, but will require more computing power to run.
Security
Security is important when downloading and running AI models locally. While models primarily contain numerical parameters, the process of loading them could potentially be exploited if not done safely. Always download models from trusted sources like Hugging Face, and prefer models in the .safetensor
format when available. This format ensures safe loading of model data by preventing arbitrary code execution during the loading process. However, even with safetensors, it’s important to use models from reputable creators, as the model’s outputs could still be harmful or biased.
Additionally, models can be poisoned, meaning they were unknowingly trained on malicious training data. This data can affect the model’s output or “thinking” process in various ways.
Quantization
Large language models require a lot of memory and computational power. Quantization is the process of compressing a model from a high-precision data representation to a lower-precision data representation. This compression lowers required memory and computational requirements.
For example, a model might be compressed from a 32-bit floating-point (FP32), to a 4-bit integer (Q4). However, this compression may impact the quality of the LLM, with more aggressive quantization generally resulting in greater quality loss.
You might come across a model like Hermes-3-Llama-3.2-3B.Q5_K_M.gguf
where:
- 3B: 3 billion parameters
- Q5: 5-bit quantization
- K: clustering quantization method (Generally provides better quality than simpler quantization methods)
- M: block size used in quantization (Medium is a balanced setting between quality and compression)
- gguf: The GGUF format is particularly optimized for efficient loading and inference on CPU/GPU, making it suitable for running locally on personal computers.
In summary, if you have access to a powerful computer, models with higher quantization (like Q8) can be used for better quality, otherwise choose a model with lower quantization (like Q4). Floating-point models (FP32, FP16) offer the best quality but require more memory and GPU speed. The choice of quantization is ultimately a trade-off between model quality, memory usage, and computational requirements.
Now that we have covered the prerequisites, let’s move on to running models locally.
Hardware Requirements
Here’s a table with approximate requirements (note that these can vary based on implementation and specific optimizations):
Model Size | Practical Use Case | Minimum Hardware Requirements | Performance Notes | Disk Space Required |
---|---|---|---|---|
3B-7B | Most consumer use cases | • 16GB RAM system • Entry CPU/GPU (4GB VRAM) • M1/M2 Mac | • Good for daily use • Fast responses • Works well with Q4 quantization | • 2-4GB (Q4) • 3-6GB (Q5) |
7B-13B | Improved quality, still practical | • 32GB RAM system • Mid-range GPU (8GB+ VRAM) • M1/M2 Pro/Max | • May require Q4/Q8 quantization • Reasonable response times • Better quality than smaller models | • 4-7GB (Q4) • 5-9GB (Q5) |
13B+ | Enthusiast/Professional | • 64GB+ RAM • High-end GPU (12GB+ VRAM) • M2 Max/Ultra | • Requires careful resource management • Slower responses • Best quality but resource intensive | • 7-30GB+ (Q4) • 9-40GB+ (Q5) |
Apple M Max series can in my experience run up 70B (Q4), although at slower speeds.
Model Selection
Models can be trained to excel in different areas, such as coding, writing and chatting. Usually these models have been fine tuned on base models that have been trained on a lot of data, but don’t know how to follow instructions very well or know how to handle conversations.
In most cases, users are interested in instruct models or chat models. These have been fine tuned to handle instructions very well and have been specialized for conversation.
If you don’t know where to start, consider the following models:
- Hermes-3-Llama-3.2-3B
- Qwen2.5-3B-Instruct
- Qwen2.5-7B-Instruct
- Hermes-3-Llama-3.1-8B
- Nous-Hermes-Llama2-13b
- Mistral-Small-24B-Instruct
Frameworks
There are several ways to build and run large language models. You might come across Pytorch, mlx and jax. These are great frameworks and each of them offer different pros and cons. For this article, we will focus on the methods that consumers can use to quickly start using AI locally.
Mozilla Llamafile
The absolute easiest method of running local AI models is to use Llamafile. You only download one binary which works on all major platforms (macOS, Windows, Linux), then you simply run it. For example, running Llama-3.2-3B-Instruct.Q6_K:
1$ chmod +x Llama-3.2-3B-Instruct.Q6_K.llamafile
2$ ./Llama-3.2-3B-Instruct.Q6_K.llamafile
If you are using Windows, simply rename the file so it ends with .exe
, then run it.
You can find llamafile
based models at Mozilla’s huggingface.
I recommend this project because it is very easy to use and also great for consumers because it utilizes both the CPU and GPU. You can also download separate model weights that uses the .gguf
format and run them with llamafile
:
./llamafile.exe -m mistral.gguf
Llamafile also includes a simple web interface as well as an HTTP API.
Ollama
Ollama is a popular method of running models locally. The difference between Ollama and Llamafile is that Ollama is a program you need to install, while Llamafile is just a binary you need to download. Ollama also have a slightly easier method of finding and downloading models. Models that have been verified to work with Ollama can be found here.
Ollama also provides an HTTP API.
Hugging Face Transformers
Sometimes you can’t use Llamafile or Ollama, but in most cases you will be able to use Hugging Face transformers instead. transformers is a popular open-source library that provides tools for working with transformer-based machine learning models.
Most models on Hugging Face have code examples that shows how to run the model with transformers
. For example, take a look at running NousResearch/DeepHermes-3-Llama-3-8B-Preview.
However, this method requires basic knowledge of Python programming.
vLLM
vLLM is a production ready framework for LLM inference and serving. If your goal is to serve a model in production, vLLM might be what you are looking for. vLLM can also be used locally, either directly on the host or on a dedicated inference server within a local area network.
This is a more advanced method and more suited to people with knowledge in building and serving APIs.
Conclusion
Now you have learned a few basic prerequisites and various methods for running models locally. It is time to break free from the cloud’s prying eyes and join the self-hosted AI revolution - your machine, your rules, your liberation!
CONTENTS