2025, Nov 07 07:00

How to Force a Follow-Up Tool Call: Mid-Turn Prefix Completion for LLM Assistants with Transformers

Force mid-turn continuation in LLM tool calling: use prefix completion with Hugging Face Transformers and PyTorch, removing EOS to finish function calls fast.

LLM-powered tool calling often hinges on a clean, two-step pattern: enumerate options, then act on one. In practice, small models can stall between those steps, stopping right after listing files and never issuing the follow-up tool call that actually opens the document. The natural question then appears: can we force the model to continue a partially written assistant turn, completing what’s already on the page rather than starting the whole response from scratch?

The minimal example

Imagine a "file explorer" flow where the assistant first lists files and then calls a function to open a specific one. The second call is the only allowed tool at that moment, so the continuation should be trivial, but the model sometimes stops short of finishing it.

Input:
"""
User: Open the about file.
Assistant: *list_dir()
Tool: Available files, pleases select one:
 - file.txt
 - example.txt
 - about.txt
 - random.txt
Assistant: *read_file("
"""
LLM Output:
"""
about.txt")
"""

The desired behavior is a simple completion from within the same assistant turn.

What’s actually going wrong

The response is generated token by token, but many chat APIs finalize the assistant message once they see the end-of-sequence marker. After that point the next generation step is a new turn, not a continuation of the same one. If the platform doesn’t support continuing from a prefix inside the current assistant message, you can’t reliably nudge the model to “pick up mid-call” and finish the function arguments you already started.

In other words, you want prefix completion inside the assistant message itself. When this is not supported, the second tool call frequently disappears.

A practical workaround with Transformers

When the API doesn’t expose this capability, you can approximate it by constructing the chat prompt yourself and removing the EOS token so the model continues from the given prefix. Below is a compact proof-of-concept using Hugging Face Transformers and PyTorch. It builds the chat input, strips the final EOS, then steps token-by-token, letting the model complete the pending tool call arguments.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
runtime_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(runtime_device)
repo_id = 'Qwen/Qwen3-0.6B'
tok = AutoTokenizer.from_pretrained(repo_id)
lm = AutoModelForCausalLM.from_pretrained(repo_id).to(runtime_device)
def fetch_file(filename: str) -> str:
    """
    Opens the text document of name "filename".
    Example call:
    fetch_file("example.txt")
    Args:
        filename: The name of the document file
    Returns:
        str: The file contents
    """
    return f"Succesfully opened file \"{filename}\"."
toolset = [fetch_file]
directory_view = """
File Explorer
Available files, use the "*open_file()*" function to open only one:
 - about.txt
 - coding_paper.txt
 - system_requirements.txt
 - updates.txt"""
chat_log = [
    {'role': 'user', 'content': "What's your latest update?"},
    {'role': 'tool', 'content': directory_view},
    {'role': 'assistant', 'content': '<tool_call>\n{"name": "fetch_file", "arguments": {"filename": "'}
]
TOPK = 5
MAX_TOKENS_LIMIT = 32768
def continue_span(dialogue):
    seq_ids = tok.apply_chat_template(
        dialogue,
        tools=toolset,
        return_tensors='pt',
        padding=True,
        truncation=True
    ).to(runtime_device)[:, :-2]  # remove the EOS token
    print(tok.decode(seq_ids[0], skip_special_tokens=False), end='', flush=True)
    while True:
        piece, seq_ids, is_eos = next_symbol(seq_ids)
        if is_eos:
            break
        print(piece, end='', flush=True)
def next_symbol(seq_ids):
    with torch.no_grad():
        out = lm(seq_ids)
    logits = out.logits
    last_step = logits[0, -1, :]
    probs = torch.softmax(last_step, dim=-1)
    topk_probs, topk_indices = torch.topk(probs, TOPK)
    best_id = topk_indices[0].reshape(1,)
    seq_ids = torch.cat([seq_ids, best_id.unsqueeze(0)], dim=-1)
    piece = tok.decode([best_id.item()])
    is_eos = False
    if best_id == tok.eos_token_id:
        is_eos = True
    return piece, seq_ids, is_eos
continue_span(chat_log)

This snippet completes the tool-call argument by continuing the assistant’s partial output. The key is that the terminal token is removed from the assembled chat representation, which lets the model treat the unfinished call as an ongoing sequence rather than a closed turn.

Why this matters

Mid-turn continuation improves formatting consistency and reduces brittle parsing around tool call scaffolding. When the model repeatedly fails to emit the second call, being able to resume generation from inside the assistant message keeps the structure stable and avoids re-prompting tricks that still may not guarantee the desired format.

Takeaways

If you need to force a follow-up tool call from a partially written assistant turn and your chat runtime does not support it, constructing the prompt yourself and continuing past the missing segment is a viable stopgap. The approach above is a proof of concept to finish a tool call; it’s not a tool-execution framework. For production, monitor the platform you use and adopt built-in features like forced tool calls or multiple-choice tools as they become available.

The article is based on a question from StackOverflow by Bob and an answer by Bob.