RAG isn't one size fits all: Here's how to tune it for your use case

💡

TL;DR:

Start by building a rapid eval loop — skipping this means you can’t see what’s working
Fix layers in order: data → chunking → embeddings/retrieval → generation.
Hybrid retrieval + tuned top-k usually wins.
Measure with answer-level metrics, Correct-Call Rate, and latency/cost/drift.
Using Kiln × LanceDB gives you a fast local loop and a clean path to promote configs to Cloud.

Introduction

Retrieval-Augmented Generation (RAG) systems are deceptively complex. On paper, they’re just pipelines: chunk some data, embed it, retrieve the best matches, and generate an answer. In practice, though, every part interacts. Changing chunk size might alter retrieval behavior; switching embedding models might make recall better or worse; a slightly different extractor model or prompt might silently add noise into the data you index.

A lot of knobs, all interconnected. Manual tuning to iterate over every possible combination quickly reaches a dead end, or worse, leads to regressions that are hard to notice until users complain. You can spend days “improving” a component only to find accuracy dropping in other cases.

That’s why the single biggest unlock in RAG optimization isn’t a single knob tweak, but setting yourself up to iterate fast and safely. Before you chase specific optimizations, you need a tight feedback loop: the ability to try many configurations quickly, evaluate them systematically, and compare results across runs with confidence. Without that, every tweak is guesswork.

Kiln AI and LanceDB have been cooking some new projects to make this process easy and fast. In one interface you can create Kiln RAG evals, compare hybrid vs. vector retrieval from LanceDB, and iterate critical RAG parameters like chunking strategy, embedding models, and document extraction methods. Let’s walk through the process of optimizing your RAG system.

Setup: Enable rapid evaluation

The first step to taking a RAG system from “good” to “great” is to make experimentation easy and fast.

You should be able to quickly compare RAG configurations (different chunk size, retriever, or prompt), run them over your evaluation set, and see which performs best, both quantitatively (accuracy, recall, latency) and qualitatively (examples where it fails).

Once you can experiment and evaluate quickly, you’ll not only find better configs faster, but also stay adaptive: when your data evolves, when new embedding models appear, or when product requirements shift.

Once you have that, the rest becomes mechanical — an ordered process of improving each layer: data, chunking, embeddings, retrieval, and generation.

Kiln makes it easy to create RAG evals in just a few minutes. Our synthetic data generator creates Q&A-style eval datasets, automatically, derived from your documents, in an interactive UI. Once complete, comparing RAG configuration options is fast and easy.

Part 1: Optimize layer by layer

Once you can evaluate changes quickly, optimization stops being guesswork and becomes a scientific process. Don’t tweak everything at once. Go layer by layer instead, fixing the highest-leverage parts first. Each layer depends on the previous one being solid.

Improve your document extraction

The old adage of “garbage in, garbage out” definitely applies to RAG. An LLM can’t compensate for poor source material or messy extraction. If documents are inconsistent, noisy, or unstructured – no chunking or embedding tricks will save you.

Data cleanup used to be a slow, manual, and tedious task. Today, most document extraction pipelines have transitioned from library-based approaches like Tesseract to vision-language models (VLLMs) such as Gemini and Qwen3-VL, enabling data cleaning and formatting through prompt-based automation.

Try adding these methods to your document extraction prompts to improve your RAG dataset:

Clean & normalize input. Strip headers, footers, boilerplate, and irrelevant metadata.
Use extraction models intentionally. Don’t index raw PDFs/HTML; use layout-aware extractors that you can steer with prompts.
Standardize the output format if the use case requires it. Enforce consistent fields and boundaries (one record per line, one section per paragraph, etc.).

Example: Receipts

Prefer model-prompted extraction that produces structured text. Retrieval “understands” this far better:

code

Order ID: 5821 
Date: 2024-10-14 
Vendor: Bean & Bread Café 
Location: 123456 Adelaide St W, Toronto 
Items: 
 - Americano (Large): 1 @ $5.00 
 - Turkey Sandwich: 2 @ $9.50 
Subtotal: $24.00 
Tax (HST 13%): $3.12 
Total: $27.12 
Payment Method: Visa

Contrast with an unstructured dump where signal is buried in noise and broken layout:

code

Bean  
& . Bread  
Cafe  "Fuel Your Morning!" 
123456 adelaide st w toronto 
order5821  oct14 2024 
americano lg 5.00 
turkey sand x2 9.50ea 
hst13% 3.12  total 27.12 
visa 
tip not included… 
follow us @example

The issue isn’t missing information but structure and signal-to-noise. Slogans, tip notes, and social handles get embedded alongside totals, diluting retrieval quality. If you’re answering accounting queries (totals over $25, visa payments or coffee orders), clutter makes your results brittle. Worse, an irrelevant slogan like Fuel Your Morning! can result in hits for a valid query fuel receipts.

Quick sanity check before moving on

Query the extracted corpus for a few deterministic fields:

Payment Method: Visa
Tax (HST 13%)
Turkey Sandwich

In Kiln, you can pick from many different vision-language models for extraction and design your own prompts to steer how data is parsed before it’s embedded and stored in your local LanceDB. Getting clean, well-structured text at this stage makes every downstream step (chunking, embedding, and retrieval) far easier to tune.

Optimize chunking

Chunking defines the boundaries of the data that will be embedded. Each chunk should encapsulate a coherent idea so it can be retrieved independently. During retrieval, these chunks are surfaced and fed into the model’s prompt for answer synthesis.

Longer chunks preserve more context but risk embedding dilution. The vector must represent too many distinct concepts, making it less focused and harder to retrieve for specific queries.

For example, imagine a chunk spanning 15 articles of law, one of which covers civil responsibility in boat accidents. Although the LLM would gain rich context if that chunk were retrieved, the embedding itself may struggle to surface for a query about boats, since that topic represents only a small fraction of its content.

Shorter chunks, by contrast, are easier to retrieve because their embeddings capture a single, well-defined idea, but they can lose essential context.

For instance, chunking each clause of a contract separately might help match the right clause, yet that clause alone may lack the broader section context that defines what it applies to. This trade-off can even flip depending on the corpus: if you only have one contract, short chunks might be fine; but if you index hundreds of contracts across different topics, overly granular chunks can cause cross-document confusion when similar-sounding clauses appear in unrelated contexts.

There’s no universal best value, only trade-offs you can measure.

Consider:

Longer chunks work better when reasoning across context matters (e.g., articles, technical papers).
Shorter chunks shine when sections are self-contained (e.g., FAQs, invoices).
Overlap preserves continuity but increases index size and retrieval latency.
Semantic chunking: splitting at natural topic or section boundaries, usually beats naïve token counts, but still needs testing.

Examples

A novel may need ~800-token chunks with overlap for narrative continuity.
A research paper often performs best around 300 tokens, semantically split such that each chunk covers one semantic concept.
An invoice/receipt may need no chunking at all; one structured record per file is plenty.

💡

Guiding principle:

Treat chunking as an empirical choice, not a belief system. Pick a few candidate strategies, run evaluations, and compare results. Iteration will reveal the sweet spot faster than theory ever will.

This experiment only takes a few clicks in Kiln: swap chunkers (semantic vs. fixed-size, overlap on/off), index into local LanceDB, and run your evals!

Embedding, indexing & retrieval

This is where performance differences can be dramatic. The same RAG pipeline can swing from mediocre to excellent simply by changing the embedding model or retrieval strategy. However, if the previous steps were bad, there’s very little you can do now!

Once data and chunking are stable, this layer determines how effectively your system surfaces the right context. It’s the heart of the “R” in RAG, precision and recall in action.

Key levers:

Embedding models: Test a few. Make sure the model supports the languages used in your corpus and queries. Recency may also matter since models trained before newer terms, slang, or entity names appeared may not represent them well. Latency and costs may also be important criteria.
Embedding size: Some models come in multiple sizes, and others support Matryoshka dimensionality reduction, which lets you truncate embeddings. In many cases, a lower dimensionality vector may yield near-identical quality at a fraction of the cost and latency. Always measure quality and efficiency side-by-side.
Top-k: Defines how many candidate chunks will be fed into the model context for answer synthesis. Larger k improves recall but increases token cost and sometimes hurts precision by increasing the size of the context.
Hybrid search: Combine vector retrieval with keyword-based search (usually an algorithm called BM25). BM25 catches literal string matches, such as entity names, IDs, acronyms, while vectors capture semantics. This blend reliably boosts both factual recall and contextual relevance.

Consider a query searching for Dunkin' Donuts. BM25 would immediately match documents mentioning that exact name, while a pure embedding search might also surface results about other coffee shops or bakeries in general and competitors such as Tim Hortons or Starbucks, since it treats Dunkin' Donuts as conceptually similar to other restaurants. In this case, BM25 captures the precise intent better than the embedding search.

In Kiln you can toggle hybrid / vector-only / BM25 to compare different LanceDB query configurations and keep whichever wins your evals.

Index type: Begin with simple K-NN (k nearest neighbour) for experimentation. Once your retrieval quality stabilizes, optimize for scale with approximate nearest neighbour methods like HNSW, IVF-PQ, etc. Don’t tune index parameters before you trust the rest of your pipeline – this is a cost and performance optimization you should do last.
Query reformulation: Let the model rephrase underspecified or ambiguous queries before retrieval, either through a tool call or an internal expansion step. This often unlocks major gains in recall for free-form, human queries. You may even let the model run multiple searches with different queries.
Tool naming & descriptions: It’s surprisingly important to nail the RAG tool name and description. Your agents need to call the right RAG tools at the right moment, and they won’t know when to use an ambiguously named tool like search(). Names/descriptions like search_company_invoices() clearly signal intent and improves tool-use reliability. Kiln evals will evaluate the agent end to end, including choosing the right search tool for the task.

💡

Guiding principle:

Treat retrieval as a system-level tuning problem, not just swapping out the embedding model and hoping for the best. Measure each change, embeddings, k-values, retrieval modes, in the same evaluation loop. Small configuration shifts here often drive the largest quality jumps in the entire RAG stack.

What not to grid-search (at least not yet)

Many teams burn time brute-forcing hyperparameters that don’t matter early on. Resist that urge. Until your pipeline is stable, most “optimizations” are noise.

Common traps:

Premature ANN tuning. Don’t spend hours adjusting HNSW, IVF-PQ or quantization parameters before your data, chunking, and embeddings are solid. You’ll just be optimizing noise.
Threshold obsession. Retrieval thresholds only make sense once your evaluation set reflects real user queries. Otherwise, you’re tuning to synthetic patterns.
Latency micro-gains. Shaving milliseconds off retrieval or generation doesn’t help if accuracy is still moving. Optimize correctness first, performance later.

💡

Guiding principle:

Early in RAG development, focus on correctness. Once your quality stabilizes and metrics are trustworthy, then move down the stack. Tune indexes, thresholds, and latency trade-offs with confidence that they reflect real improvement.

Part 2: How to measure & iterate

You can’t optimize what you can’t measure.

If you don’t have a way to tell whether a change is an improvement or a regression, you’re just tuning by vibes, and vibes are unreliable in multi-component systems like RAGs.

The goal of evaluation isn’t to produce a perfect metric, but to make iteration safe and intentional. You want confidence that when a metric moves, it reflects real quality differences that matter to users and is not worsening some other dimension of the problem.

There are three main categories of evaluations to use for RAG systems:

RAG accuracy: Answer-level evaluation

The most direct way to measure progress is to ask: Did the RAG give the right answer?

That means building or generating a Q&A evaluation set and running your pipeline against them. A Q&A set is a collection of question–answer pairs we know to be correct.

Kiln allows you to create evaluation datasets in just a few minutes by running synthetic Q&A generation over your documents. Once created, it’s one-click to run LLM-as-judge evals, and see the results per RAG configuration. The whole process is backed by the same LanceDB dataset for repeatable comparisons.

Approaches:

Labeled ground truth. For production domains (support, internal docs, etc.), collect real user queries with human-validated answers.
Synthetic evals. For faster iteration, use an LLM to generate questions and reference answers from your corpus, then have another model judge correctness.

Metrics to track:

Correctness. A quantitative measure of how closely your model’s answers align with the reference answer. Traditional methods like exact match, string matching, and F1 scoring are increasingly being replaced by LLM-as-judge–based semantic similarity evaluations.
Hallucination rate. The share of answers that are fluent but unsupported by retrieved context.
Context recall. How often the gold-answer context actually appears in the retrieved documents.

If you only build one RAG evaluation, build an end-to-end correctness eval. Issues with hallucination and context will surface there, giving you a north-star metric to tune for.

You can start small. Dozens or hundreds of examples, but make them diverse.

The key is consistency: always evaluate with the same setup so comparisons are meaningful. When you run into an edge case on production that does not quite give the result you hoped for, add it to your evaluation dataset, tune and evaluate again. If you change your eval, always be sure to backfill previous eval results.

Correct-call evaluation: Is RAG called only when needed?

Not every query needs retrieval. Some answers are already within the model’s base knowledge (e.g., “What is 2 + 2?”), while others absolutely depend on external context (e.g., “What’s in the 2023 HR handbook?”).

If your system decides dynamically when to call retrieval, you should measure how often it makes the right call. In Kiln, every RAG configuration is exposed as a callable tool, and we provide a dedicated eval template for checking that tools are called at appropriate times.

Metric: Correct-Call Rate = how often the model chose the correct path (retrieval vs. direct answer).

Why it matters: A system that retrieves unnecessarily, wastes latency and tokens. A system that skips retrieval when it’s needed, risks hallucinating or confidently returning “I don’t know.”

Common failure modes:

Over-retrieval: The model calls the retriever even for simple, factual queries, increasing cost without improving accuracy.
Under-retrieval: The model tries to answer from memory when it shouldn’t, producing unsupported or incomplete responses.

How to measure it: label a subset of your evaluation queries as “RAG needed” or “not needed.” Then compute how often your system made the correct decision. Even a small labeled set here pays off disproportionately. Once you can trust when RAG triggers, every downstream evaluation (accuracy, latency, cost) becomes far more meaningful.

Latency, cost, and drift

Once quality stabilizes, you can start tracking operational metrics too:

Latency: median and p95 latency, including retrieval + generation.
Cost: embeddings, storage, and per-query token usage.
Drift: periodic re-runs of your eval set on fresh data to catch regressions caused by new documents or model updates.

These don’t replace quality metrics; they ensure you don’t trade off usability or economics for marginal gains.

Conclusions & how to get started

Once you have a fast evaluation loop, you can evolve faster than anyone manually tuning prompts. You’ll stop fearing changes, because you can measure them. You’ll know which layer actually limits performance. And when new models, retrievers, or architectures appear, you’ll be able to test them in hours instead of weeks.

That’s what turns a RAG pipeline from “just working” into a great one. A system that improves itself through evidence, not intuition.

Ready to get started with RAG evals and optimization?

Download Kiln for free from Github to build RAG evals and create your iteration loop.
When you’re ready to promote your RAG configuration to production, use our provided loader to seamlessly promote the entire dataset and configuration to LanceDB Cloud for production-grade scaling.

Social media

Table of Contents

RAG isn't one size fits all: Here's how to tune it for your use case

Introduction

Setup: Enable rapid evaluation