2025, Sep 25 15:00
How to Fix Token Counting Mismatches for BGE-M3 in LlamaIndex Using the Hugging Face Tokenizer
Learn why LlamaIndex token counts differ from Hugging Face for BGE-M3, and align them with a tokenizer shim to fix batching, chunking, and cost estimates.
Aligning token counts between your tokenizer and an embedding pipeline can make or break batching, chunking, and cost estimation. If the numbers don’t match, you end up either truncating too aggressively or overstepping limits. A common pitfall arises when a library counts tokens with a different tokenizer than the model you actually embed with. Below is a concise walkthrough of why this happens with BGE-M3 in llama_index and how to get consistent counts without running the embedding step.
Reproducing the mismatch
The following snippet demonstrates a token count discrepancy between the llama_index token counter and the Hugging Face tokenizer for BGE-M3. It uses the local cache if present (./embeddings), or downloads the model otherwise.
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from transformers import AutoTokenizer
import os
# Sample text
sample_text = "Random words. This is a test! A very exciting test, indeed."
# Chunk length
max_chunk = 512
# Create or load the embedding backend
def build_encoder(_limit=None):
    print("initializing embeddings...")
    if os.path.exists('./embeddings/models--BAAI--bge-m3'):
        cache_dir = f"./embeddings/models--BAAI--bge-m3/snapshots/{os.listdir('./embeddings/models--BAAI--bge-m3/snapshots')[0]}"
        encoder = HuggingFaceEmbedding(model_name=cache_dir)
    else:
        os.makedirs("./embeddings", exist_ok=True)
        repo_name = "BAAI/bge-m3"
        encoder = HuggingFaceEmbedding(
            model_name=repo_name,
            max_length=_limit,
            cache_folder='./embeddings'
        )
    print("embeddings ready")
    return encoder
# Initialize embedder
embedding_backend = build_encoder(_limit=max_chunk)
# Token counter with default configuration
count_hook = TokenCountingHandler()
hooks = CallbackManager([count_hook])
Settings.embed_model = embedding_backend
Settings.callback_manager = hooks
# Produce an embedding and count tokens via the callback
_ = Settings.embed_model.get_text_embedding(sample_text)
embed_tok_count = count_hook.total_embedding_token_count
# Count tokens via the HF tokenizer
model_ref = "BAAI/bge-m3"
hf_tok = AutoTokenizer.from_pretrained(model_ref)
encoded = hf_tok(sample_text)
hf_tok_count = len(encoded["input_ids"])
print(f"Original text: {sample_text}")
print(f"Embedding pipeline token count: {embed_tok_count}")
print(f"HF tokenizer token count: {hf_tok_count}")
With this setup, you will see that the numbers differ.
What’s actually happening
The core issue is that llama_index’s token counting uses a tokenizer that is not your model’s tokenizer unless you explicitly override it. The default is a tiktoken-based tokenizer. You can inspect what llama_index has configured at runtime like this:
from llama_index.core import Settings
print(Settings.tokenizer)
The output shows a tiktoken encoder:
functools.partial(<bound method Encoding.encode of <Encoding 'cl100k_base'>>, allowed_special='all')
Meanwhile, BGE-M3 uses a Hugging Face tokenizer. AutoTokenizer is just a factory, and for BGE-M3 it resolves to an XLMRobertaTokenizer. That difference in tokenization rules explains why counts diverge.
Solution: use the model’s tokenizer for counting
To get matching counts without running the embedding step, wire the model’s tokenizer into llama_index’s TokenCountingHandler. There is one subtlety: TokenCounter in llama_index expects the tokenizer to return a list of input_ids, while Hugging Face tokenizers return a BatchEncoding object. A tiny shim fixes that by returning just the input_ids.
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from transformers import AutoTokenizer
from transformers import XLMRobertaTokenizerFast
# Text and model
sample_text = "Random words. This is a test! A very exciting test, indeed."
max_chunk = 512
model_ref = "BAAI/bge-m3"
# Adapter so llama_index's counter gets a list of input_ids
class LlamaIndexTokenizerShim(XLMRobertaTokenizerFast):
    def __call__(self, *args, **kwargs):
        return super().__call__(*args, **kwargs).input_ids
li_tokenizer = LlamaIndexTokenizerShim.from_pretrained(model_ref)
# Initialize the HF embedder
embedder = HuggingFaceEmbedding(model_name=model_ref, max_length=max_chunk)
# Plug the correct tokenizer into the token counter
counter = TokenCountingHandler(tokenizer=li_tokenizer)
manager = CallbackManager([counter])
Settings.embed_model = embedder
Settings.callback_manager = manager
# Trigger counting via the embedding call (no need to rely on the default tokenizer)
_ = Settings.embed_model.get_text_embedding(sample_text)
li_count = counter.total_embedding_token_count
# Cross-check with HF's tokenizer directly
hf_tokenizer = AutoTokenizer.from_pretrained(model_ref)
ref_count = len(hf_tokenizer(sample_text).input_ids)
print(f"Original text: {sample_text}")
print(f"Embedding pipeline token count: {li_count}")
print(f"HF tokenizer token count: {ref_count}")
With this change, both counts align.
Why this matters
Consistent token counting is critical when you budget compute, slice documents into chunks, or enforce max_length constraints. If the counter and the model disagree, your batching logic becomes unreliable, which can lead to extra calls, unexpected truncation, or misreported usage metrics.
Takeaways
Always make your tokenizer explicit when counting tokens for an embedding workflow. If your tooling defaults to a different tokenizer than the model’s, plug in the correct one and adapt the return type if necessary. Once both sides speak the same tokenization dialect, counts match and downstream logic remains predictable.
The article is based on a question from StackOverflow by ManBearPigeon and an answer by cronoik.