Zum Hauptinhalt springen

Ollama (Self-hosted)

Run open-source models on your own infrastructure and connect them to the AI Kit. No data ever leaves your network.

When self-hosting makes sense

  • You have data-residency or privacy requirements that rule out external providers.
  • You operate in an air-gapped environment without internet access.
  • You want predictable per-month cost — server cost, not per-token.
  • You want to experiment with newer open-source models quickly.

When self-hosting does not make sense:

  • You expect the absolute highest quality on hard tasks — frontier proprietary models are still ahead.
  • You have low volume — paying per token is cheaper than running hardware for a handful of requests per day.
  • You do not have someone on the team who is comfortable managing servers.

What you need

  • A machine with Ollama installed. For useful chat performance, a recent GPU is a near-must — but smaller models run acceptably on a modern CPU with plenty of RAM.
  • Network reachability between the AI Kit server and the Ollama server (the same host is fine, a server on the same network is fine, anything internet-routable is unusual).

Install Ollama

On Linux (one-liner):

curl -fsSL https://ollama.com/install.sh | sh

On macOS or Windows, download the installer from ollama.com.

Verify the installation:

ollama --version

Pull a model

Ollama treats models like containers. Pull the ones you want before connecting them in the AI Kit:

# A balanced chat model
ollama pull llama3.1

# A small chat model (faster, lower quality)
ollama pull llama3.2:3b

# An embedding model
ollama pull nomic-embed-text

Models are downloaded once and cached on disk. The first run after pulling is slower because the model is loaded into memory.

📷 SCREENSHOT: A terminal showing ollama pull llama3.1 completing and ollama list displaying the available models.

Make Ollama reachable

By default, Ollama listens on 127.0.0.1:11434. To let the AI Kit reach it from another host (or from inside a Docker container), set the host environment variable on the Ollama side:

# Linux example
OLLAMA_HOST=0.0.0.0:11434 ollama serve

Bind to a private IP rather than 0.0.0.0 if the machine is on a network. Place Ollama behind a firewall — by design it has no authentication.

Connect from the AI Kit

  1. Open the Models tab and click Configure new Model.
  2. Pick Ollama as provider.
  3. Enter a display name, e.g. "Llama 3.1 (local)".
  4. Enter the model ID — the exact name you used with ollama pull, for example llama3.1.
  5. Enter the endpoint URL — for example http://ollama.local:11434 or http://192.168.10.20:11434.
  6. Pick the model from the list. For embeddings, connect a second Ollama model and pick the embedding model ID you pulled (e.g. nomic-embed-text) via the Custom / Other option.
  7. Click Create. The model is saved and available in the workspace.

📷 SCREENSHOT: The Ollama configuration step with the endpoint URL visible and the test result showing success.

Hardware notes

Model sizeRAM (CPU)VRAM (GPU)Comment
3B parameter chat8 GB4 GBWorkable, modest quality.
7-8B parameter chat16 GB8 GBReasonable default.
13B parameter chat32 GB16 GBBetter quality, slower on CPU.
30B+ parameter chat64 GB+24 GB+GPU strongly recommended.
Embeddings4-8 GB4 GBCheap to run; load mainly comes from chat models.

For interactive use (agents), latency matters — invest in a GPU. For background workflows where a few seconds of delay are fine, CPU can be enough.

Recommendations

  • ✅ Pull at least one chat model and one embedding model before configuring the AI Kit. Otherwise tests will fail.
  • ✅ Use a stable endpoint URL (a hostname, not an IP). Servers occasionally move; hostnames let you fix it in one place.
  • ✅ Place Ollama on a server with enough disk space for the models you plan to pull (10-50 GB for a small library).
  • ✅ Update Ollama and the models periodically. New versions bring quality and speed improvements.
  • ⚠️ A model that has not been used for a while is unloaded from memory. The first prompt after a quiet period is slower. This is normal.
  • ❌ Do not expose Ollama directly to the public internet. It has no authentication.
  • ❌ Do not run multiple workspaces' AI Kits against a single Ollama instance that is sized for one. Concurrency without enough RAM or VRAM thrashes.

Frequently asked questions

Can I use Ollama and OpenAI side by side? Yes. Connect both as separate models. Different automations can use different models freely.

Can I use the same Ollama instance for embeddings as for chat? Yes. Pull an embedding model (nomic-embed-text is a good default) and connect it as a second model with the Embeddings toggle on.

What about Ollama's API key? Ollama has no API keys. The endpoint URL is all you configure on the AI Kit side. Authentication, if you need it, is something you add in front of Ollama (a reverse proxy with HTTP basic auth or a private network).

What to do next