2025, Dec 12 15:00

Unsloth GGUF Export Error for Llama 3 on AzureML: Cause, Logs, and a Reliable llama.cpp-Based Workaround

Hit a RuntimeError exporting a fine-tuned Llama 3 to GGUF with Unsloth? Learn the cause and use a reliable llama.cpp conversion workaround to produce a model.

When exporting a fine-tuned Llama 3 model to GGUF with Unsloth, everything seems to proceed normally until the very end: layers get quantized, metadata is set, and then the process fails with a runtime error. If you are on an AzureML Standard_NC24ads_A100_v4 VM, working with unsloth/Meta-Llama-3.1-8B-Instruct on Unsloth 2025.5.6, the symptoms below will look familiar. The good news: this is a known issue being fixed upstream, and there’s a reliable workaround that lets you produce a GGUF artifact right now.

Minimal example that triggers the failure

The export API call looks straightforward. After fine-tuning, a single line initiates conversion and quantization:

learner.save_pretrained_gguf("model", spm, quantization_method="q4_k_m")

During execution you’ll see a long stream of quantization logs, followed by a failure pointing you to compile llama.cpp locally and retry from the model directory.

RuntimeError: Unsloth: Quantization failed for .../model/unsloth.BF16.gguf
You might have to compile llama.cpp yourself, then run this again.
You do not need to close this Python program. Run the following commands in a new terminal:
You must run this in the same folder as you're saving your model.
git clone --recursive https://github.com/ggerganov/llama.cpp
cd llama.cpp && make clean && make all -j
Once that's done, redo the quantization.

What’s actually going on

The failure isn’t caused by your VM, your build flags, or the model choice. It’s tied to an issue currently being addressed in Unsloth. The maintainers are tracking it publicly, and the discussion is available here: #2581, #2580, and #2604. Until the fix lands, the export path may fail even if you’ve already built llama.cpp and placed its binaries alongside your checkpoint.

Practical workaround that works today

You can still get a GGUF file by building llama.cpp and running the provided conversion script manually. These commands are what the error output itself suggests for building:

git clone --recursive https://github.com/ggerganov/llama.cpp
cd llama.cpp && make clean && make all -j

Once built, produce the GGUF file explicitly. Run this from the directory where your model is saved:

python3 llama.cpp/convert_lora_to_gguf.py my_model

This path succeeds even when the integrated save_pretrained_gguf workflow fails, and it yields a usable GGUF artifact.

Why this matters

Exporting to GGUF is a key step if you plan to serve the fine-tuned model with llama.cpp or compatible runtimes. Knowing that the current Unsloth release may fail at the last mile saves time: instead of iterating on build switches or moving binaries around, you can pivot to the manual conversion and keep your deployment pipeline moving. When the upstream fix ships, you can return to the integrated export call for a more streamlined flow.

Suggested path forward

If your GGUF export fails with the described RuntimeError, treat it as the known issue documented in the Unsloth tracker. Use the manual conversion route to generate the final file now, and revisit the native export once the linked issues are closed. Keep your environment ready by updating Unsloth when a new release becomes available and rerunning the built-in export afterward.