Running a local LLM on a Jetson Orin Nano

AI Hardware

TL;DR: How to install and set up the necessary tools to run an LLM at home on your Nvidia Jetson Orin Nano

2 minute read

web ui

Why run an LLM locally?

Running a Large Language Model on your own hardware gives you full control over your data, zero API costs, and the ability to use AI offline. The Nvidia Jetson Orin Nano is a compact yet powerful device perfect for this use case.

What you’ll need:

Nvidia Jetson Orin Nano (8GB RAM recommended)
MicroSD card (64GB+) or NVMe SSD
Power supply (5V 4A or higher)
Internet connection for initial setup

Setup

Ollama is the easiest way to run LLMs locally. It handles model downloads, quantization, and provides a simple API.

Install Ollama with this command:

curl -fsSL https://ollama.com/install.sh | sh

Verify the installation:

ollama --version

Configure for Jetson’s hardware:

The Jetson Orin Nano has limited VRAM (~4GB shared with system). To avoid memory errors, configure Ollama to run on CPU:

sudo systemctl edit ollama

Add the following:

[Service]
Environment="OLLAMA_NUM_GPU=0"

Apply changes:

sudo systemctl daemon-reload
sudo systemctl restart ollama

While the Jetson has CUDA support, its shared memory architecture means large models won’t fit in VRAM. Running on CPU with 8GB RAM gives you more flexibility to run larger models.

Running your first model

With 8GB RAM, here’s what you can run:

Model	Parameters	RAM Usage	Speed
Qwen2.5:0.5b	0.5B	~1GB	Very fast
Qwen2.5:1.5b	1.5B	~2GB	Fast
Phi-3 Mini	3.8B	~3GB	Good
Llama 3.2:3b	3B	~3GB	Good
Mistral 7B Q4	7B	~5GB	Slow

For the best balance between speed and quality, I recommend Qwen2.5:1.5b.

Download and run:

ollama pull qwen2.5:1.5b

Then run it:

ollama run qwen2.5:1.5b

You should see an interactive prompt where you can start chatting with your local LLM.

Going further

Add a Web UI:

For a better experience, you can add Open WebUI, a ChatGPT-like interface for Ollama:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Once running, access the web interface at http://localhost:3000.

Optimize for speed:

To get faster responses on limited hardware, configure a system prompt that encourages conciseness:

You are a concise assistant. Answer briefly. No filler.
Get to the point. Use lists only when necessary. No repetition.

This reduces the number of tokens generated, speeding up response times significantly.

Access remotely:

With cloudflared tunnels, you can access your WebUI from any network! I personally created a subdomain ai.eliasgauthier.fr (requires login credentials, which I can provide upon request).

Troubleshooting

CUDA out of memory error

500: llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer

This means the model is too large for GPU memory. Either use a smaller model or force CPU mode as described above.

Model runs too slowly

Try a smaller quantized model:

ollama pull qwen2.5:0.5b

Ollama service not starting

Check logs:

sudo journalctl -u ollama -f

Conclusion

The Jetson Orin Nano is a capable device for running small to medium LLMs locally. While you won’t match the speed of cloud APIs or high-end GPUs, having a private, always-available AI assistant at home is worth the trade-off.

For best results, stick to models under 4B parameters in Q4 quantization. The Qwen2.5 1.5B model offers an excellent balance of speed and capability for everyday use.

Originally published December 29, 2025 | View revision history

En savoir plus sur l'auteur →