2025, Oct 31 07:00

Keep Your Legal Q&A Chatbot Fresh with Corpus-Aware LLM Caching that Tracks Vector Store Changes

Learn how to prevent stale LLM responses in a legal Q&A chatbot with corpus-aware caching that reacts to vector store updates. Simple doc-count trigger.

Keeping LLM responses fresh in a legal Q&A chatbot is trickier than it looks. Caching speeds things up and lowers costs, but as soon as you add new documents to the vector store, previously cached answers can become stale. Users ask nearly identical questions and get outdated responses because the cache doesn’t reflect the expanded corpus. The challenge is to keep the cache aligned with the current knowledge base without tanking performance.

Problem, distilled

The typical pattern caches responses by prompt and model signature. That works until your knowledge base changes. If the cache is blind to corpus updates, it will happily return answers that were correct yesterday but incomplete today.

from langchain_core.caches import BaseCache, RETURN_VAL_TYPE
from typing import Any, Dict, Optional


class NaiveResponseCache(BaseCache):
    def __init__(self):
        super().__init__()
        self._store = {}

    def fetch(self, prompt: str, llm_fingerprint: str) -> Optional[RETURN_VAL_TYPE]:
        entry = self._store.get((prompt, llm_fingerprint))
        if entry:
            return entry["value"]
        return None

    def store(
        self,
        prompt: str,
        llm_fingerprint: str,
        value: RETURN_VAL_TYPE,
        meta: Dict[str, Any] = None,
    ) -> None:
        self._store[(prompt, llm_fingerprint)] = {
            "value": value,
            "meta": meta or {},
        }

This approach has no awareness of whether the underlying vector store grew from N to N+k documents. As a result, it will continue serving answers that ignore new information.

What actually goes wrong

The cache key typically includes the prompt and some representation of the LLM parameters. It does not include anything about the state of your vector index. When you ingest new files, the retrieval layer can now surface additional context, but the cache doesn’t know that. There’s no signal to invalidate or rebuild the stored entries, so subsequent requests for the same or similar prompts return responses produced from an older snapshot of the corpus. In legal scenarios, even a small omission can be unacceptable.

A pragmatic way to keep cache aligned with updates

One straightforward tactic is to make the cache “corpus-aware” by tracking the document count and regenerating cached entries when that count changes. This is a simple heuristic: if the number changes, the knowledge base has changed, so cached answers need refreshing. Below is a minimal implementation of that idea.

from langchain_core.caches import BaseCache, RETURN_VAL_TYPE
from typing import Any, Dict, Optional


class CorpusVersionCache(BaseCache):
    def __init__(self, doc_total: int):
        super().__init__()
        self._entries = {}
        self._corpus_size = doc_total

    # Set or update the known corpus size
    def set_corpus_size(self, doc_total: int):
        if self._corpus_size == doc_total:
            return
        self._corpus_size = doc_total
        for args, _prev in self._entries.items():
            query, llm_repr = args[0], args[1]
            value, meta = self.rebuild_entry(query, llm_repr)
            self.store(query, llm_repr, value, meta)

    def rebuild_entry(self, query: str, llm_repr: str):
        # Regenerate response with new information
        response = "New LLM Response"
        metadata = {}
        return response, metadata

    # Cache lookup
    def fetch(self, query: str, llm_repr: str) -> Optional[RETURN_VAL_TYPE]:
        cached = self._entries.get((query, llm_repr))
        if cached:
            return cached["value"]
        return None

    # Cache update
    def store(
        self,
        query: str,
        llm_repr: str,
        value: RETURN_VAL_TYPE,
        metadata: Dict[str, Any] = None,
    ) -> None:
        self._entries[(query, llm_repr)] = {
            "value": value,
            "metadata": metadata or {},
        }

To plug this into LangChain’s global cache hook, wire it up like this:

from langchain.globals import set_llm_cache

cache = CorpusVersionCache(doc_total=0)
set_llm_cache(cache)

With this pattern, whenever you add new files and the document count changes, you call the method that updates the tracked size. The cache then regenerates and refreshes its stored responses so that future lookups reflect the latest context.

Performance trade-offs to consider

If you are concerned about the cost of regenerating cached entries, the cadence of updates becomes the deciding factor. When new information is added rarely, it can be more efficient to expire aggressively rather than pay the upfront cost to rebuild entries that may never be reused. If updates are frequent, scheduling the refresh as a separate process helps control load and keeps latency predictable for end users. Another lever is metadata: by attaching richer descriptors to entries and documents, you can rebuild only the subset of cached responses that map to the affected categories instead of refreshing everything.

Why this matters

For a legal assistant, correctness is not just nice to have. A cache that ignores newly ingested files will silently return answers that miss critical context. Aligning the cache with the vector store ensures users benefit from the most current information without forcing you to disable caching altogether.

Takeaways

Make your cache aware of corpus changes so it doesn’t serve outdated content. Track a simple signal such as document count and trigger a refresh when it changes. Balance regeneration cost and latency with your update frequency, and when possible, scope rebuilds to only the affected slice of cached entries. This keeps performance steady while ensuring the chatbot responds with answers that reflect the latest knowledge base.

The article is based on a question from StackOverflow by Quyền Phan Thanh and an answer by InsertCheesyLine.

artificial-intelligence caching chatbot langchain python