2025, Dec 27 11:00

Fix 'model not found (404)' in the Ollama Python API for Hugging Face GGUF: list, pull, generate

Resolve Ollama Python API 'model not found (404)' errors with Hugging Face GGUF: verify installed models, use ollama.list and ollama.pull, then generate.

Fixing “model not found (404)” with the Ollama Python API when calling Hugging Face GGUF models

Calling the Ollama Python API with a Hugging Face GGUF identifier can fail with a 404 even though the same setup works for other local models. The failure surfaces as a “model not found” error while your code looks perfectly fine. The reason sits on the server side, not in your Python logic.

Problem setup

The Python script invokes ollama.generate() with a model string that points to a GGUF on Hugging Face:

import ollama

selected_id = 'hf.co/mradermacher/Llama-3.2-3B-Instruct-uncensored-GGUF'

def pick_token(text):
    output = ollama.generate(
        model=selected_id,
        prompt=f"Identify the product/item in {text}. ..."
    )
    return output.get('response', '').strip()

The call fails with a 404 coming from the Ollama server:

model 'hf.co/mradermacher/Llama-3.2-3B-Instruct-uncensored-GGUF' not found (status code: 404)

What’s actually going on

When you use the Ollama CLI interactively, Ollama will pull models for you on demand. The Python API behaves differently: it won’t auto-pull. The server must already have the model locally. If it doesn’t, the API responds with a 404. This is why other “standard” models appear to work while a new or external identifier fails. In practical terms, the error indicates that the model name you provided is not present on the Ollama server. It can also happen if the identifier is not exactly correct or if that model was removed and is no longer available under that name.

Solution: ensure the model is present on the server before generating

Verify what’s installed, then pull what you need. The Ollama Python library exposes both steps directly.

To inspect what’s currently available on the server:

import ollama

installed = ollama.list()
print(installed)

To fetch a model to the server before using it from the API:

import ollama

ollama.pull('llama3.2')

After the pull completes, generate as usual. Here is a minimal flow that pulls a known model and then calls generate, keeping the original program logic intact:

import ollama

model_ref = 'llama3.2'

# Make sure the model exists on the server for API access
ollama.pull(model_ref)

def fetch_keyword(text):
    payload = ollama.generate(
        model=model_ref,
        prompt=f"Identify the product/item in {text}. ..."
    )
    return payload.get('response', '').strip()

For more details on the available calls, refer to the Ollama API documentation: https://github.com/ollama/ollama-python?tab=readme-ov-file#api.

Why this matters

The difference between CLI and API behavior is easy to overlook. In interactive sessions the CLI masks missing models by pulling them automatically, while API-driven applications require that models are preloaded on the server. Understanding this avoids 404s at runtime, prevents confusing “works on my machine” scenarios, and leads to predictable deployments.

Takeaways

If your Ollama Python call fails with a 404 for a Hugging Face GGUF identifier, the server doesn’t have that model. Confirm the exact model name you intend to use, list what’s currently installed, and pull the target model before generating. Keeping these steps in your workflow ensures stable API behavior and fewer surprises when you move from experiments to production.