2025, Dec 02 23:00

Understanding CausalLM Label Alignment for Seq2Seq Tasks: -100 Prompt Masking and Internal Shifted Loss

Learn why seq2seq fine-tuning with CausalLM mirrors target tokens, masks prompts with -100, and uses internal label shift to compute loss for dialog-to-summary.

Training LlamaForCausalLM on a seq2seq-style task like dialog-to-summary raises a common labeling question: why does the dataset use the same token ids for labels as for the input and set the prompt part to -100, instead of shifting labels by one position? Let’s walk through the setup, inspect the code, and clarify how loss is computed for CausalLM models so the behavior makes sense.

Minimal example that triggers the confusion

The dataset processing concatenates the dialog (prompt) and the summary into a single sequence. Tokens coming from the prompt are ignored for loss, while tokens from the summary are used as labels. The essential logic looks like this:

tok = tokenizer
row = sample
lead_seq = tok.encode(tok.bos_token + row["prompt"], add_special_tokens=False)
abstract_seq = tok.encode(row["summary"] + tok.eos_token, add_special_tokens=False)
row = {
    "input_ids": lead_seq + abstract_seq,
    "attention_mask": [1] * (len(lead_seq) + len(abstract_seq)),
    "labels": [-100] * len(lead_seq) + abstract_seq,
}

Inspecting a batch confirms that the labels mirror the target part of input_ids while prompt positions are -100.

train_loader = train_dataloader
mini = next(iter(train_loader))
print(mini["input_ids"][0][35:40])
print(mini["labels"][0][35:40])

tensor([19791, 512, 32, 36645, 41778])
tensor([ -100, -100, 32, 36645, 41778])

At first glance this looks off if you expect labels[i] to target input_ids[i+1] for every predicted token. So why does this work?

What CausalLM actually predicts

Causal language models predict each token based on all previous tokens in the same sequence. Conceptually, the i-th prediction uses the prefix up to position i−1. Looked at across the full sequence, the gold tokens you want to match at each step are exactly the tokens present in the sequence itself. That’s why labels align with input_ids on the target segment.

The shift happens under the hood during loss computation. Popular implementations accept labels aligned to input_ids and internally offset the prediction targets by one step when computing cross-entropy. You don’t pre-shift labels outside the model.

Why the prompt is labeled with -100

Only the summary should contribute to the training objective. To achieve that, the prompt span is set to -100 in the labels vector. This value tells the loss function to ignore those positions. The model still attends to the prompt tokens as context, but they don’t add to the loss.

For efficiency, generation logits are produced for the whole sequence in one pass, and loss is then computed over the aligned labels. If the generated result is shorter than the labels, it is padded; if it’s longer, it is truncated. The point is to compare like with like across the target part of the sequence.

The correct way to construct labels

No manual shifting is needed. Keep the labels identical to the target tokens and mask the prompt with -100. The working pattern matches the dataset processing above. If you prefer a compact version, it remains functionally the same:

tok = tokenizer
ex = sample
ctx = tok.encode(tok.bos_token + ex["prompt"], add_special_tokens=False)
resp = tok.encode(ex["summary"] + tok.eos_token, add_special_tokens=False)
ex = {
    "input_ids": ctx + resp,
    "attention_mask": [1] * (len(ctx) + len(resp)),
    "labels": [-100] * len(ctx) + resp,
}

This is exactly the format CausalLM expects for supervised fine-tuning on tasks like summarization framed as next-token prediction over a concatenated prompt+target sequence.

Why this detail matters

Misunderstanding label alignment can lead to incorrect preprocessing and, consequently, incorrect loss signals. Knowing that CausalLM implementations shift internally saves you from double-shifting targets and from accidentally training the model to predict the wrong positions. The -100 mask on the prompt ensures the model learns from the summary tokens only, while still leveraging the prompt as context.

Takeaways

When fine-tuning a CausalLM on prompt-to-output data, concatenate prompt and output, label the prompt part with -100 to exclude it from loss, and use the target tokens as-is for the label segment. Trust the model code to apply the one-token shift during loss computation. This simple convention keeps preprocessing straightforward and aligns with how the loss is actually calculated.