ragai-systemsembeddingssearchbackend

RAG Pipelines for Product Data, Not Demo Data

May 14, 202615 min readAI Systems

A production-minded view of retrieval-augmented generation: chunking, permissions, freshness, citations, evaluation, and operational failure modes.

RAG demos hide the hard parts

A demo usually has clean documents, public information, and friendly questions. The documents are well-formatted Markdown. The questions are phrased precisely. The answers are verifiable against the source material. The whole system looks elegant because the inputs were chosen to make it look elegant.

Real product data is different. It has tenant boundaries, private notes, stale records, duplicate content, partial updates, and language that changes from team to team. A support knowledge base has articles written by five different authors over three years, inconsistent terminology, outdated procedures, and internal shorthand that users type in free-form questions. A documentation corpus has versioned content, deprecated pages that still get traffic, and ambiguous sections where the intent depends on context a user may not provide.

Retrieval is not only a vector search problem. It is a product data problem. If the wrong source is retrieved, the model can produce an answer that sounds polished while being completely unsupported by the actual records. The architecture has to handle the messiness of the data before the model sees any of it.

The first architecture decision is what the assistant is allowed to know. If the application has tenants, roles, private records, or compliance boundaries, the retrieval layer must enforce those rules before the model sees any context. Permission enforcement is not a feature you add later — it is a foundation you build first.

The ingestion pipeline is where reliability is built

Most RAG reliability problems originate in ingestion, not retrieval or generation. Documents that are poorly parsed, inconsistently chunked, or not enriched with metadata will produce poor retrieval results regardless of how the embedding model or vector search is configured.

PDF parsing is a common first obstacle. Documents with complex layouts, tables, multi-column text, headers that are decorative rather than semantic, and embedded images lose structure during extraction. A policy document where a table is extracted as a flat list of cell values, or where a numbered procedure becomes an unordered paragraph, will produce retrieval results that look relevant by keyword but are useless as context.

HTML and structured content bring different challenges. Navigation elements, footer content, sidebars, and boilerplate text inflate the chunk and dilute relevance. A support article where half the retrieved content is navigation links and legal disclaimers is not useful context for a language model.

The ingestion step should include cleaning, structure detection, and enrichment. Cleaning removes irrelevant boilerplate. Structure detection identifies sections, headings, code blocks, tables, and lists so they can be chunked with awareness of their boundaries. Enrichment adds the metadata that makes the chunk useful in retrieval: source, type, visibility, date, section hierarchy, and a stable link back to the original record.

This investment pays off during retrieval. A well-enriched chunk can be filtered, ranked, and presented to the model in a way that makes the answer more accurate and the citation more precise.

Chunking is product design

Chunking is not a mechanical preprocessing step. A policy document, a support thread, an API reference, a product pricing table, and a booking record all need different boundaries. The chunk should preserve enough context to answer a question without dragging half the database into the prompt.

Fixed-size chunking by token count is convenient but often wrong. A 512-token chunk that splits a list of steps in the middle of step four will retrieve only half the procedure. A chunk boundary that falls inside a conditional — 'if the customer is on the enterprise plan...' — will produce incomplete context. The chunking strategy should understand the structure of the source material, not just its length.

Recursive and semantic chunking are more appropriate for most content. Recursive chunking tries to break at meaningful boundaries — paragraph, section, subsection — before falling back to size limits. Semantic chunking attempts to group sentences by topic similarity. Neither is perfect, but both produce better retrieval results than fixed-size approaches for structured content.

I like storing rich metadata with every chunk: tenant, source type, record id, visibility, document title, section heading, updated timestamp, and a stable URL back to the original record. That metadata makes citations, permissions, freshness checks, and debugging possible. A chunk that arrives in the prompt with no metadata is hard to attribute, hard to verify, and impossible to filter by permission.

Freshness matters as much as relevance. If a user updates a policy, price, workflow, or support note, the index should not continue serving yesterday's truth. A production RAG system needs an ingestion pipeline with change detection, an invalidation strategy for updated records, and monitoring for stale content. Embeddings are not a cache — they are a representation of a point-in-time state of the document, and that state changes.

Retrieval architecture beyond vector search

Vector search works well for semantic similarity — finding conceptually related content when the exact words do not match. But many real retrieval tasks benefit from a hybrid approach that combines vector search with keyword search, metadata filtering, and recency signals.

Keyword search handles the cases where users type exact product names, SKUs, error codes, procedure numbers, or technical terms. A user who types 'error E4031' should retrieve the document containing that exact string — vector similarity alone may not rank it first because the semantic neighborhood of a specific error code is sparse.

Metadata filtering reduces the retrieval space before semantic ranking. If the system knows the user is on the enterprise plan, is working in the billing section, and is asking about invoices, filtering to chunks tagged as billing-related and plan-relevant produces a smaller, more precise candidate set for reranking.

Reranking adds a second pass over the retrieved candidates using a cross-encoder model or a feature-rich scoring function that considers query-document relevance, recency, source authority, and usage signals. The first-pass retrieval is for recall — finding the candidates that could be relevant. The reranking step is for precision — finding the candidates that are actually relevant to this specific query.

The combination of hybrid search, metadata filtering, and reranking is more complex to build than pure vector search, but it is what makes a RAG system reliable across the variety of questions real users ask.

Permission enforcement is not optional

In a multi-tenant product, a RAG system that does not enforce permissions at the retrieval layer is a data leakage risk. A tenant administrator's private notes, a compliance document marked internal-only, an HR record accessible only to specific roles — these must not appear in the retrieval results for users who should not see them.

Permission enforcement should happen before the model receives context, not after. Filtering after generation is not sufficient because the model may have already incorporated a restricted document into a response before you had the chance to redact it. The retrieval query must be scoped by the requesting user's permissions as a prerequisite, not a postprocessing step.

This requires storing permission metadata with every chunk and evaluating that metadata against the requesting user's identity at retrieval time. For complex permission models — role-based access, attribute-based access, row-level permissions — this evaluation may involve a policy engine rather than a simple filter. The complexity is worth it because the alternative is a security incident.

An additional consideration is citation transparency. When the system generates an answer and cites sources, the cited sources should all be accessible to the requesting user. A citation that links to a document the user cannot read is confusing at best and a permission leak at worst if the citation reveals the document's existence.

Evaluate with ugly questions

A useful evaluation set should look like real users: incomplete wording, typos, mixed languages, outdated assumptions, technical shorthand, requests for information the system should not answer, and questions that seem clear but are actually ambiguous. Friendly demo prompts are not enough.

I measure retrieval separately from generation. Did the system fetch the right sources for this query? Did it respect the permission boundary? If the right source was retrieved, did the answer use it correctly? If the right source was not retrieved, at what rank did it appear, and what got retrieved instead? These are different failure modes with different fixes.

Generation quality is a separate axis. Did the answer accurately represent the retrieved context? Did it avoid claims that were not supported by the retrieved documents? Did it correctly identify when the retrieved context was insufficient to answer the question confidently? A model that confabulates an answer when the context is weak is more dangerous than a model that says 'I could not find a reliable answer to that in the available documentation.'

Evaluation should be continuous, not a one-time benchmark. As the document corpus changes, as user query patterns shift, and as the system is updated, the evaluation results will drift. A regression in retrieval precision after a chunking strategy change, or a generation quality drop after a prompt change, should be caught before it reaches production.

The model is only the final step. Reliability comes from ingestion quality, permission enforcement, retrieval architecture, reranking, generation guardrails, observability, and an evaluation loop that keeps surfacing uncomfortable cases. Teams that treat the model as the whole system will struggle with reliability. Teams that treat the model as the last step in a well-engineered pipeline will ship something that works.

Auditing AI-Generated Code Before It Reaches Production

Designing Airline Booking APIs for Peak Traffic