Chunking

I’ve seen teams spend weeks picking the perfect embedding model for their RAG system, only to get mediocre results because they never thought carefully about how they chunk their documents.

Chunking is the process of breaking a large document into smaller, retrievable pieces so your RAG pipeline can find the right context and feed it to the LLM at the right time. It sounds simple, but this one decision shapes what your retriever can find, how precise the answers are, and whether responses feel grounded or “almost right.”

There are roughly 11 strategies, but here are the ones worth understanding:

1/ Fixed-length splits text every N tokens regardless of meaning. Fast and simple, but often breaks mid-sentence.

2/ Sentence-based treats each sentence as a chunk. Great for FAQs, but struggles when answers span multiple sentences.

3/ Semantic chunking groups sentences by meaning using embeddings. Powerful, but needs a heavier preprocessing pipeline.

4/ Hierarchical chunking follows document structure like headers and subsections. Works beautifully for technical docs, poorly for unstructured text.

5/ Adaptive (hybrid) chunking switches strategies based on content type, and this is where most production teams are heading. Higher complexity, but the best overall retrieval quality.

My recommendation if you’re building RAG today: start with sentence-based chunking for your prototype, observe where retrieval breaks down, then graduate to semantic or adaptive approaches based on what you see. Chunking is not something to over-engineer on day one, but it is what separates a demo from a production-grade system.

is it true?

Short answer: Yes, the claim is largely true. In many real RAG systems, chunking quality has more impact on retrieval performance than the specific embedding model chosen. However, the explanation you saw simplifies some things and misses a few production nuances.

Below is the precise engineering perspective.

1. Why chunking matters more than people expect

In a typical RAG pipeline:

Documents → Chunking → Embeddings → Vector DB → Retrieval → LLM

If chunking is wrong, the retriever never sees the correct context, regardless of how good the embedding model is.

Two common failure modes:

Problem 1 — context fragmentation

If a key idea spans multiple chunks:

Chunk A: The transformer architecture was introduced in 2017.
Chunk B: It relies on attention mechanisms.
Chunk C: This removed the need for recurrence.

A user asks:

Why did transformers remove recurrence?

The retriever may retrieve only Chunk B, which makes the answer incomplete.

Problem 2 — context dilution

If chunks are too large:

Chunk (1500 tokens):
• company history
• product pricing
• refund policy
• troubleshooting

Embedding becomes too averaged, so retrieval becomes fuzzy.

Optimal chunking balances:

semantic coherence
+
retrieval precision
+
context window efficiency

This is why chunking heavily affects results.

2. The 5 strategies mentioned (with real-world corrections)

The list you saw is roughly accurate but simplified.

1. Fixed-length chunking

Example:

split every 512 tokens

Advantages

simple
fast
consistent embedding sizes

Problems

breaks sentences
breaks sections
damages meaning

Production usage: rare alone

Usually used with overlap:

chunk_size = 512
overlap = 50

2. Sentence chunking

Chunk = one or few sentences.

Good for:

FAQs
support docs
chat transcripts

Problems:

answers often span multiple sentences
context fragmentation

Often implemented as:

3–5 sentences per chunk

3. Semantic chunking

This approach groups sentences by embedding similarity.

Example pipeline:

sentence → embedding
compute similarity between sentences
merge sentences until similarity drops

Advantages:

keeps topics intact
much better retrieval relevance

Problems:

expensive preprocessing
more complex pipelines

Libraries often used:

LangChain semantic chunker
LlamaIndex semantic splitter

4. Hierarchical chunking

Follows document structure:

Document
 ├ Section
 │   ├ Subsection
 │   │   ├ Paragraph

Chunks follow headers.

Works extremely well for:

technical documentation
books
research papers
legal documents

Fails when:

documents are messy
PDFs with no structure
scraped HTML

5. Adaptive / hybrid chunking

This is indeed where production systems are going.

Example:

if doc_type == "API docs":
    hierarchical chunking

elif doc_type == "faq":
    sentence chunking

elif doc_type == "research":
    semantic chunking

Or even:

header-aware chunking
+
semantic merging
+
token size limits

This produces the best retrieval quality.

3. What the post missed (important)

A few key RAG engineering concepts weren't mentioned.

Chunk overlap

Critical for preventing lost context.

Example:

chunk_size = 400
overlap = 80

Without overlap:

Sentence A
Sentence B
Sentence C
Sentence D

A question referencing B + C may fail.

Overlap solves this.

Chunk metadata

Good RAG systems store metadata:

chunk
source
section
page
timestamp
author

This enables filtering retrieval, which often matters more than embeddings.

Multi-granularity retrieval

Advanced systems index multiple chunk sizes:

small chunks → precision
large chunks → context

This improves answer grounding.

Post-retrieval re-ranking

Even perfect chunking still needs re-ranking:

vector search → top 20
cross-encoder reranker → top 5

This dramatically improves relevance.

4. The real RAG rule (experienced teams learn this)

A rough impact ranking:

1️⃣ chunking strategy
2️⃣ document cleaning
3️⃣ retrieval pipeline
4️⃣ reranking
5️⃣ embedding model
6️⃣ LLM size

Teams often obsess over:

text-embedding-3-large
bge-large
E5

But their real problem is:

bad chunking

5. Typical production chunk sizes

Common ranges:

Use case

Chunk size

FAQs

100–300 tokens

Docs

300–600 tokens

Research papers

500–1000 tokens

Code

200–400 tokens

Most systems settle around:

400–600 tokens

with

10–20% overlap

6. The practical advice in the post

This recommendation is actually solid:

Start simple → observe failures → iterate chunking

Because RAG optimization is data-dependent.

Typical workflow:

v1: sentence chunking
v2: add overlap
v3: header-aware splitting
v4: semantic grouping
v5: reranking

Final verdict

Yes — the claim is mostly correct.

But the real takeaway is:

Chunking is not just preprocessing. It is part of the retrieval architecture.

In many RAG systems:

better chunking
>
better embedding models

7 chunking architectures used in serious production RAG systems

Below are 7 chunking architectures used in serious production RAG systems. These go beyond simple splitting and are closer to what companies building large-scale GenAI systems actually deploy.

The key idea: chunking is not just splitting text — it is designing the retrieval unit of knowledge.

1. Sliding Window Chunking (Baseline Production Strategy)

Most production systems start here.

Idea

Split text into fixed token windows with overlap.

Example:

chunk_size = 500
overlap = 100

Document:

[Chunk1] tokens 1–500
[Chunk2] tokens 400–900
[Chunk3] tokens 800–1300

Why overlap matters

Questions often reference context across boundaries.

Without overlap:

Chunk1: Transformer was introduced in 2017.
Chunk2: It removed recurrence.

Retriever may fetch only one chunk.

Overlap ensures both pieces appear together.

Used by

many LangChain/LlamaIndex default setups
early OpenAI RAG examples

Pros

very simple
fast preprocessing

Cons

ignores semantics
still breaks ideas

2. Sentence Window Retrieval

Instead of storing large chunks, systems store small units (sentences) and retrieve neighboring sentences at query time.

Example index:

Sentence 1
Sentence 2
Sentence 3
Sentence 4

If retrieval finds Sentence 3, the system returns:

Sentence 2 + Sentence 3 + Sentence 4

Benefit

You get fine-grained retrieval + coherent context.

Used by

Often implemented in LlamaIndex sentence window retriever.

Why teams like it

precision of sentence embeddings
avoids context fragmentation

3. Hierarchical Chunking (Document Tree)

Documents have natural structure.

Example:

Document
 ├ Chapter
 │   ├ Section
 │   │   ├ Paragraph

Systems store multiple levels:

Level 1: Section embeddings
Level 2: Paragraph embeddings

Query flow:

retrieve section
↓
retrieve paragraph inside section

Advantages

preserves document logic
avoids mixing unrelated sections

Used by

Very common for:

documentation
books
research papers
legal documents

4. Semantic Boundary Chunking

Instead of splitting by size, chunks follow topic boundaries.

Pipeline:

1. split text into sentences
2. embed each sentence
3. compute similarity between neighbors
4. merge sentences until topic changes

Example result:

Chunk 1: explanation of transformers
Chunk 2: training methodology
Chunk 3: evaluation results

Benefits

Chunks represent complete ideas.

Retrieval becomes more meaningful.

Drawback

More expensive preprocessing.

Tools

Often implemented with:

LangChain semantic chunker
LlamaIndex semantic splitter

5. Parent–Child Chunking (Highly Effective)

One of the best-performing architectures.

Idea:

Index small chunks, but return larger parent context.

Example:

Parent chunk (1500 tokens)

   ├ Child chunk A (300 tokens)
   ├ Child chunk B
   ├ Child chunk C

Retrieval works like:

search → child chunk
return → parent chunk

Why this works

Small chunks improve embedding precision.

Large parent chunks give LLM enough context.

Used by

LangChain ParentDocumentRetriever
many enterprise RAG pipelines

6. Multi-Vector Chunking

Instead of one embedding per chunk, systems store multiple embeddings.

Example chunk:

Paragraph about transformers

Embeddings stored:

vector 1 → paragraph
vector 2 → summary
vector 3 → key entities

Query matches any vector.

Benefit

Improves retrieval for:

vague queries
entity searches
conceptual questions

Used in

Some advanced retrieval pipelines and research systems.

7. Adaptive / Content-Aware Chunking (Most Advanced)

Production systems increasingly detect document type and apply different chunking strategies.

Example pipeline:

if doc_type == "API docs":
    header-based chunking

elif doc_type == "FAQ":
    question-answer chunking

elif doc_type == "research paper":
    semantic chunking

elif doc_type == "chat logs":
    dialogue chunking

This approach dramatically improves retrieval quality.

Why it matters

Different document types have different information structures.

One chunking strategy cannot fit all.

What the best RAG systems actually combine

Real-world architecture usually combines several of these:

Document ingestion
    ↓
hierarchical split
    ↓
semantic merging
    ↓
child chunks for indexing
    ↓
parent chunks for context
    ↓
reranking

Typical pipeline:

Docs
 ↓
Chunking (adaptive)
 ↓
Embeddings
 ↓
Vector DB
 ↓
Retrieve 20
 ↓
Rerank → 5
 ↓
LLM answer

The biggest misconception about RAG

Most beginners optimize:

embedding model
LLM size
vector database

But real performance usually comes from:

document cleaning
chunking strategy
retrieval logic
reranking

Model choice often matters less than expected.

✅ One rule many experienced GenAI engineers follow:

Retrieval quality improves more from better chunking than from upgrading the embedding model.

5 mistakes that cause 80% of RAG systems to hallucinate

Even when chunking is well designed, RAG systems still hallucinate frequently. In production systems, the root causes are usually architectural rather than model-related.

Below are 5 failure modes responsible for most hallucinations in RAG pipelines.

1. Retrieval Miss (The Most Common Failure)

The retriever simply does not fetch the correct chunk.

Pipeline failure:

User Question
     ↓
Vector Search
     ↓
Wrong Chunks Retrieved
     ↓
LLM Forced To Guess

Example:

User question

What year was the Transformer architecture introduced?

Correct chunk exists:

The Transformer architecture was introduced in 2017.

But retriever returns:

Transformers are widely used in NLP tasks.

The LLM then guesses.

Result:

The transformer architecture was introduced around 2018.

Why this happens

Common causes:

embedding mismatch
poor chunking
low top_k
weak vector similarity

Fix

Increase recall:

retrieve top 20
↓
rerank
↓
keep top 5

High-performing systems prioritize recall first, precision second.

2. Context Dilution

Too many irrelevant chunks are retrieved.

Example context sent to LLM:

Chunk 1: transformer history
Chunk 2: pricing details
Chunk 3: troubleshooting guide
Chunk 4: installation steps
Chunk 5: refund policy

The signal gets buried in noise.

LLM struggles to determine:

which chunk actually answers the question

Fix

Use reranking models.

Pipeline:

vector search → top 20
cross-encoder reranker → top 5

Popular rerankers:

bge-reranker
Cohere rerank
ColBERT

Reranking often improves RAG accuracy more than changing the LLM.

3. Chunk Boundary Problems

Even with chunking strategies, answers can span multiple chunks.

Example:

Chunk A:
The transformer architecture was introduced in 2017.

Chunk B:
It replaced recurrence with attention mechanisms.

Question:

Why did transformers remove recurrence?

Retriever might fetch only Chunk B.

The LLM sees incomplete information.

Result: partial or incorrect answer.

Fix

Use techniques like:

Chunk overlap

chunk_size = 500
overlap = 100

Or parent-child retrieval:

search small chunk
return larger parent context

This dramatically reduces incomplete answers.

4. LLM Ignoring Retrieved Context

Even if retrieval is perfect, the LLM may ignore the provided context.

Example prompt:

Context:
[correct documentation chunk]

Question:
How does the API authenticate requests?

LLM response:

Most APIs authenticate using OAuth 2.0...

Even if OAuth is not mentioned in the document.

Why?

LLMs tend to default to pretraining knowledge when uncertain.

Fix

Prompt constraints.

Example system instruction:

Answer ONLY using the provided context.
If the answer is not in the context, say "I don't know."

Some teams also add citation requirements.

5. Missing Knowledge (The Silent Killer)

Sometimes the knowledge simply doesn't exist in the dataset.

Example question:

What is the refund policy for enterprise customers?

But the documents only contain:

refund policy for individual users

The LLM tries to fill the gap.

Result:

Enterprise refunds are handled within 30 days.

This is a true hallucination.

Fix

Add retrieval confidence checks.

Example logic:

if similarity_score < threshold:
    return "No answer found"

Or:

if reranker_score low:
    abstain

Production RAG systems must support abstention.

A useful mental model

Think of RAG as three stages:

1. Retrieval
2. Context Construction
3. Generation

Most hallucinations originate in stage 1 or 2, not stage 3.

Failure distribution often looks like:

Retrieval errors        ~60%
Context problems        ~25%
LLM generation errors   ~15%

This is why improving retrieval architecture often produces larger gains than upgrading the model.

What strong RAG systems usually include

A mature pipeline typically looks like:

Document cleaning
      ↓
Adaptive chunking
      ↓
Embedding index
      ↓
Vector retrieval (top 20)
      ↓
Reranker (top 5)
      ↓
Context compression
      ↓
LLM generation
      ↓
Citation verification

This architecture dramatically reduces hallucinations.

✅ The key insight:

RAG hallucinations usually mean:

retrieval failure
not
model failure

how companies like Perplexity, OpenAI, and Anthropic structure their RAG pipelines internally

Production systems from companies like Perplexity AI, OpenAI, and Anthropic use much more layered retrieval pipelines than the typical “LangChain demo RAG”.

The biggest difference: retrieval is treated like a search engine architecture, not a simple vector lookup.

Below is a simplified view of how modern RAG systems are structured.

1. Multi-Stage Retrieval (Recall → Precision)

Most tutorials show:

query → vector search → top 5 → LLM

Production systems instead maximize recall first.

Pipeline:

query
 ↓
vector search → top 50
 ↓
keyword search → additional candidates
 ↓
merge results
 ↓
rerank
 ↓
top 5
 ↓
LLM

Why?

Vector search sometimes misses exact matches like:

error codes
function names
IDs
numbers

Keyword search catches those.

This hybrid approach is called:

Hybrid Retrieval

Typical stack:

BM25 (keyword search)
+
Vector similarity

2. Query Rewriting Layer

Before retrieval happens, systems often rewrite the query.

Example:

User query:

How do transformers remove recurrence?

Rewritten retrieval query:

transformer architecture attention mechanism replacing recurrence

Why rewrite?

User queries are often too conversational for good retrieval.

Many systems generate multiple search queries.

Example:

Q1: transformer architecture recurrence removal
Q2: attention vs recurrent networks
Q3: why transformer doesn't use RNN

Each query runs retrieval separately.

Results are merged.

This dramatically improves recall.

3. Document Expansion

Many companies expand documents during indexing.

Example original chunk:

Transformers rely on self-attention instead of recurrence.

Expanded metadata:

summary: transformer architecture overview
keywords: attention, recurrence, RNN replacement
entities: transformer, self-attention

Multiple embeddings are stored.

This is called:

Multi-vector indexing

It increases match probability.

4. Reranking Stage

Vector search alone is not precise enough.

Production pipelines almost always use reranking models.

Pipeline:

retrieve top 50
 ↓
cross-encoder reranker
 ↓
select top 5

Unlike embeddings, rerankers read:

query + document together

This allows deeper relevance scoring.

Common models used in production:

bge-reranker
Cohere rerank
ColBERT

Reranking often improves answer quality more than upgrading the LLM.

5. Context Compression

Even the best retrieval might return too much text.

Systems compress context before sending it to the LLM.

Example:

Retrieved chunk:

500 tokens documentation section

Compressed to:

80 token summary relevant to the query

This allows the system to include more sources in context.

Some pipelines use LLMs themselves for compression.

6. Citation Grounding

Production systems require answers to cite retrieved documents.

Example output:

The transformer architecture was introduced in 2017. [Doc3]
It replaces recurrence with attention mechanisms. [Doc7]

The system may verify:

does answer sentence appear in source chunk?

If not, the answer may be rejected or regenerated.

This reduces hallucinations significantly.

7. Confidence / Abstention Layer

One major difference between demos and real systems:

Production systems can refuse to answer.

Logic example:

if retrieval_score < threshold:
    return "I couldn't find information about that."

Or:

if reranker_confidence low:
    abstain

This prevents fabricated answers.

Typical modern RAG pipeline

A realistic architecture looks closer to this:

User query
     ↓
Query rewriting
     ↓
Hybrid retrieval
(vector + keyword)
     ↓
Retrieve 50–100 chunks
     ↓
Reranking model
     ↓
Top 5–10 chunks
     ↓
Context compression
     ↓
LLM answer generation
     ↓
Citation verification
     ↓
Confidence scoring

Why most tutorials fail

Typical tutorials implement only:

vector search
+
LLM

Which ignores:

query rewriting
hybrid retrieval
reranking
confidence scoring

These missing pieces explain why many RAG demos feel “almost correct” but unreliable.

A useful rule in RAG engineering

Performance gains usually come from improving:

retrieval architecture

not:

bigger models

✅ Real GenAI engineering lesson:

RAG is closer to search engine design than to prompt engineering.

Architecture behind Perplexity’s “answer engine”

Perplexity AI built what it calls an “Answer Engine.” Architecturally, it behaves closer to a real-time search system + RAG pipeline than a traditional chatbot.

Most LLM chat systems follow:

user prompt → LLM → answer

Perplexity instead runs a multi-layer retrieval and synthesis system before generation.

Below is a simplified but accurate view of the architecture.

1. Real-Time Web Retrieval

Unlike many RAG systems that rely only on pre-indexed internal documents, Perplexity performs live web retrieval.

Pipeline:

User Query
   ↓
Web Search APIs
   ↓
Fetch top web pages
   ↓
Extract text

Typical sources include:

news sites
documentation
blogs
academic pages

This step behaves like a search engine crawler-on-demand.

Key goal: fresh information.

2. Content Extraction & Cleaning

Raw web pages contain noise:

ads
navigation bars
comments
CSS junk

So Perplexity performs content extraction.

Typical process:

HTML page
   ↓
boilerplate removal
   ↓
main article extraction
   ↓
clean text

Libraries often used in this type of pipeline include:

Readability
Boilerpipe
custom HTML parsers

The result becomes clean text documents.

3. Chunking & Indexing

The extracted content is then split into chunks.

Typical pattern:

500–800 token chunks
+ overlap

Each chunk gets:

embedding vector
source URL
page metadata

These are stored in a temporary retrieval index for the current query session.

Unlike enterprise RAG, Perplexity often builds ephemeral indexes per query.

4. Hybrid Retrieval

Perplexity combines two retrieval methods:

Vector search

semantic similarity

Keyword search

BM25 style ranking

Why hybrid?

Vector search struggles with:

numbers
names
error codes
exact phrases

Keyword search catches those.

Combined approach:

vector results
+
keyword results
↓
merged candidate pool

5. Query Expansion

Before retrieval, the system may generate multiple search queries.

Example:

User asks:

Why did transformers replace RNNs?

System expands to:

transformer architecture vs RNN
attention replacing recurrence
advantages of transformer models

Each query retrieves additional sources.

This step dramatically improves recall.

6. Reranking

Once 50–100 chunks are retrieved, Perplexity uses reranking models.

Reranking models evaluate:

(query + document)

instead of independent embeddings.

Example scoring:

Chunk A → 0.91 relevance
Chunk B → 0.82
Chunk C → 0.43

Top chunks are selected.

Typical final context:

5–10 documents

7. Answer Synthesis

Only after retrieval does the LLM generate the answer.

Prompt structure looks roughly like:

Answer the question using ONLY the provided sources.

Sources:
[1] ...
[2] ...
[3] ...

The model synthesizes information across documents.

Example output style:

Transformers replaced RNNs because attention mechanisms allow parallel processing of sequences. [1][3]

The numbered citations map to source URLs.

8. Source Attribution

Every sentence can reference the source.

Example:

The transformer architecture was introduced in 2017. [2]

The UI then links:

[2] → original webpage

This is critical for user trust.

9. Follow-Up Question Loop

When users ask follow-ups, Perplexity reuses conversation context.

Example:

User: Why did transformers replace RNNs?
User: What about memory usage?

System constructs a new search query:

transformer architecture memory usage vs RNN

Retrieval runs again.

10. Continuous Retrieval During Generation

Some advanced systems (including Perplexity research modes) can retrieve additional sources mid-generation.

Conceptually:

LLM writing answer
      ↓
detect missing info
      ↓
trigger additional retrieval

This resembles agentic search.

Why this architecture works well

Compared to simple RAG:

vector search → LLM

Perplexity adds:

query expansion
hybrid retrieval
reranking
citation grounding

These layers dramatically improve reliability.

Real insight

The architecture resembles:

Google Search
+
RAG
+
LLM summarization

In other words:

Search engine → knowledge retriever → LLM synthesizer.

✅ Key takeaway

The strongest GenAI systems today are not just LLMs.

They are LLM + search infrastructure.

How OpenAI’s internal RAG systems differ from Perplexity’s

Enterprise RAG systems used by companies building on platforms from OpenAI or Anthropic are optimized for internal knowledge retrieval, not open-web search like Perplexity AI.

The architecture therefore prioritizes document governance, reliability, and security over crawling the internet.

Below is a typical enterprise-grade RAG architecture.

1. Controlled Document Ingestion

Unlike web search systems, enterprise RAG starts with known internal sources.

Typical inputs:

PDF reports
internal wikis
technical documentation
support tickets
contracts
databases
Slack / email archives

Ingestion pipeline:

Source systems
      ↓
Document ingestion
      ↓
Cleaning + normalization
      ↓
Chunking
      ↓
Embedding
      ↓
Vector index

Key challenge: document quality.

Most enterprise data contains:

tables
bad PDFs
duplicated content
formatting artifacts

Strong systems include document preprocessing pipelines.

2. Metadata-Rich Indexing

Enterprise retrieval rarely relies on embeddings alone.

Chunks are indexed with metadata such as:

department
document type
author
timestamp
access permissions
project

Example indexed record:

{
 "chunk_text": "...",
 "embedding": [...],
 "source": "Engineering Docs",
 "team": "Payments",
 "date": "2025-10-12",
 "access_level": "internal"
}

This allows filtered retrieval.

Example query:

search only:
team = payments
date > 2024
document_type = spec

This dramatically improves relevance.

3. Access Control Layer

Enterprise systems must enforce security boundaries.

Example:

Employee A → finance documents allowed
Employee B → engineering only

Retrieval must respect permissions.

Pipeline:

query
 ↓
user authentication
 ↓
permission filtering
 ↓
retrieval

Without this layer, RAG becomes a data leakage risk.

4. Hybrid Retrieval

Just like search engines, enterprise RAG usually combines:

Vector search
+
Keyword search

Keyword search handles:

error codes
API names
version numbers
product IDs

Vector search handles:

semantic questions
natural language queries
concept matching

The results are merged and reranked.

5. Reranking Layer

Production pipelines rarely trust raw vector similarity.

Instead:

retrieve top 50 chunks
 ↓
rerank with cross-encoder
 ↓
select top 5–10

Rerankers read:

(query + chunk)

This provides far better relevance scoring.

6. Context Construction

Enterprise RAG systems often build structured context blocks.

Example:

Source: Payment API documentation
Section: Authentication
Date: Oct 2025

<chunk text>

Structured prompts help the LLM understand:

source
recency
authority

Some systems also add document summaries to each chunk.

7. Grounded Generation

The LLM is instructed to generate answers strictly from retrieved context.

Typical system instruction:

Use only the provided sources.
If the answer cannot be found in the documents,
respond that the information is unavailable.

Answers typically include citations:

The Payment API uses OAuth2 authentication. [Doc2]

8. Answer Verification

Advanced enterprise pipelines include post-generation checks.

Example validation:

Does every claim appear in a retrieved chunk?

Some systems run a second LLM to verify grounding.

Pipeline:

LLM answer
 ↓
verification model
 ↓
approved or regenerate

9. Continuous Evaluation

Enterprise teams monitor RAG systems using evaluation pipelines.

Common metrics:

retrieval recall
answer grounding
hallucination rate
latency

Evaluation datasets are usually built from real user questions.

Typical enterprise RAG architecture

Internal Documents
        ↓
Ingestion Pipeline
        ↓
Cleaning + Normalization
        ↓
Adaptive Chunking
        ↓
Embedding + Metadata
        ↓
Hybrid Retrieval
        ↓
Permission Filtering
        ↓
Reranking
        ↓
Context Assembly
        ↓
LLM Generation
        ↓
Answer Verification

Key difference vs web-based RAG

System

Main Goal

Perplexity-style systems

search the internet

Enterprise RAG

search internal knowledge

Because of this, enterprise RAG emphasizes:

security
document governance
accuracy
traceability

while web systems emphasize:

freshness
coverage
speed

Important engineering insight

In enterprise GenAI systems, most engineering effort goes into:

data pipelines
retrieval architecture
evaluation

not into the LLM itself.

This is why many GenAI teams eventually realize:

RAG engineering ≈ information retrieval engineering

Agentic RAG

Agentic RAG is the next evolution of retrieval systems. Instead of retrieving once and generating an answer, the LLM actively controls the retrieval process like a researcher.

Many modern systems inspired by work from OpenAI, Anthropic, and Perplexity AI are moving toward this architecture.

The core difference:

Traditional RAG:
query → retrieve → answer

vs.

Agentic RAG:
query → plan → retrieve → reason → retrieve again → verify → answer

The LLM behaves like a decision-making agent, not just a text generator.

1. Planning Stage

The system first decides how to solve the query.

Example question:

What caused the 2008 financial crisis and how did governments respond?

The agent might plan:

1. Search causes of 2008 financial crisis
2. Search government response policies
3. Retrieve economic analysis
4. synthesize results

This step breaks complex queries into retrievable subproblems.

2. Tool Selection

The agent chooses which retrieval tool to use.

Possible tools:

vector search
keyword search
SQL database
web search
knowledge graph
internal documents

Example decision:

Cause of crisis → academic sources
Government response → policy reports

Each tool is called separately.

3. Iterative Retrieval

Instead of retrieving once, the system performs multiple retrieval cycles.

Example loop:

retrieve → read → identify missing info → retrieve again

Example reasoning trace:

Found explanation of mortgage-backed securities.
Need details about government bailouts.
Search: "TARP program 2008".

This resembles human research behavior.

4. Evidence Aggregation

The system gathers evidence from multiple sources.

Example evidence pool:

Source 1: academic paper
Source 2: government report
Source 3: news analysis

These are stored in an agent memory state.

Example internal state:

facts = [
  "subprime mortgage collapse",
  "Lehman Brothers failure",
  "TARP bailout program"
]

5. Reflection / Self-Checking

Advanced systems run reflection loops.

Example:

Is the answer complete?
Did I miss important sources?

If yes → trigger additional retrieval.

Reflection prompt example:

Evaluate whether the gathered evidence fully answers the question.
If not, suggest another search query.

This significantly reduces hallucinations.

6. Final Answer Synthesis

Once enough evidence is collected, the agent generates the answer.

Example prompt:

Using the verified sources below,
generate a concise explanation and cite sources.

Output:

The 2008 financial crisis was primarily caused by the collapse of
subprime mortgage-backed securities and excessive financial leverage.
Governments responded through large-scale bailouts such as the TARP
program in the United States. [Source1][Source3]

Agentic RAG architecture

A simplified architecture looks like this:

User Query
     ↓
Planning LLM
     ↓
Tool selection
     ↓
Retrieval tools
(vector, keyword, web)
     ↓
Evidence store
     ↓
Reflection loop
     ↓
Additional retrieval (if needed)
     ↓
Answer synthesis

Why this architecture matters

Traditional RAG struggles with:

multi-hop questions
complex reasoning
missing information

Example multi-hop query:

Which companies founded by former OpenAI researchers
are building AI chips?

This requires multiple steps:

1. identify researchers
2. identify companies
3. identify chip startups

Agentic RAG can perform this type of multi-step retrieval.

Technologies enabling Agentic RAG

Common frameworks used:

LangGraph
LlamaIndex agents
OpenAI tool calling
AutoGen
CrewAI

These frameworks allow LLMs to control tool execution loops.

Challenges with Agentic RAG

Despite its power, it introduces complexity.

Main challenges:

1. Latency

Multiple retrieval loops increase response time.

single RAG: ~1–3 seconds
agentic RAG: ~5–20 seconds

2. Cost

More LLM calls are required.

Example:

planning
retrieval reasoning
reflection
answer generation

Each may be a separate model invocation.

3. Control

Agents may:

loop indefinitely
retrieve irrelevant sources
over-search

Production systems add step limits and budget constraints.

Typical Agentic RAG loop

for step in range(max_steps):

    plan_next_action()

    if action == retrieve:
        call_retriever()

    if action == sufficient_evidence:
        break

This loop continues until confidence threshold is reached.

The future direction

Many GenAI researchers believe the next stage of AI systems will look like:

LLM reasoning
+
tool use
+
retrieval
+
memory

Agentic RAG is essentially the bridge between LLMs and autonomous knowledge systems.

✅ Key takeaway

Traditional RAG treats retrieval as a single step.

Agentic RAG treats retrieval as a reasoning process.

why most Agentic RAG systems in startups still fail in production

Agentic RAG looks powerful in architecture diagrams, but many startups discover that the production reality is far more difficult. Systems that work in demos often break when exposed to real workloads.

Below are the five most common reasons Agentic RAG fails in production systems.

1. Latency Explosion

Agentic systems require multiple LLM calls per query.

Typical execution:

1. planning
2. query rewrite
3. retrieval
4. reasoning
5. additional retrieval
6. reflection
7. final generation

This can mean:

8–20 model calls per user query

Latency example:

System

Typical latency

Traditional RAG

1–3 seconds

Agentic RAG

8–25 seconds

Users rarely tolerate long waits unless the system provides high research value.

This is why systems like Perplexity AI run agentic loops mainly in research modes, not standard search.

2. Cost Explosion

Every agent step is another model invocation.

Example cost per query:

planning model call
+ reasoning call
+ reflection call
+ answer generation

Even with cheap models, costs multiply quickly.

Example scenario:

10K queries/day
× 10 LLM calls/query
= 100K LLM calls/day

This becomes expensive at scale.

Many companies end up reverting to simpler pipelines.

3. Uncontrolled Tool Usage

Agents sometimes call tools unnecessarily.

Example reasoning loop:

Search for transformers.
Search for attention mechanism.
Search for history of transformers.
Search for RNN comparison.

Even though the first retrieval might already contain the answer.

This causes:

extra latency
extra cost
irrelevant retrieval

Production systems add constraints like:

max_search_steps = 3
max_tools = 5

Without limits, agents may over-search.

4. Reasoning Drift

Agents can gradually drift away from the original query.

Example user question:

Why did transformers replace RNNs?

Agent reasoning might evolve into:

attention mechanism research history

Eventually retrieving information unrelated to the question.

This happens because LLM reasoning chains sometimes lose focus over multiple steps.

Fixes include:

query anchoring
task restatement
periodic goal reminders

5. Evaluation Difficulty

Agentic systems are hard to evaluate.

Traditional RAG evaluation:

question → answer → compare with ground truth

Agentic RAG involves:

planning
tool usage
retrieval decisions
reflection loops
answer synthesis

Failures can occur in many stages.

Example failure sources:

bad query rewrite
wrong tool selection
poor reranking
incorrect reasoning

This makes debugging extremely difficult.

Why many companies scale back agentic systems

Startups often begin with:

agentic RAG
multi-step reasoning
tool orchestration

But eventually move to a simplified architecture.

A common production compromise:

query rewrite
+
hybrid retrieval
+
reranker
+
single LLM generation

This delivers:

lower latency
lower cost
more predictable behavior

while still achieving strong accuracy.

When Agentic RAG actually works well

Agentic architectures make sense when:

Use case

Why

Research assistants

multi-source reasoning

Data analysis

tool orchestration

Software debugging

multi-step investigation

Scientific literature review

multi-hop retrieval

This is why research-focused tools like Perplexity AI and enterprise platforms using models from OpenAI or Anthropic sometimes expose specialized “deep research” modes.

Practical rule used by experienced GenAI teams

A useful heuristic:

If one retrieval step can answer the question,
do NOT use an agent.

Agents should be reserved for problems requiring:

multi-step reasoning
multiple tools
complex workflows

✅ Key takeaway

Agentic RAG is powerful, but most production systems still rely on optimized non-agentic retrieval pipelines.

Agents are useful when the problem truly requires iterative reasoning, not when simple retrieval would suffice.

Graph RAG

Graph-based RAG (often called GraphRAG) addresses a weakness of traditional vector retrieval: relationships between pieces of knowledge are lost when text is chunked into independent vectors.

Systems developed by organizations like Microsoft (their well-known GraphRAG research project) and explored internally at companies building with models from OpenAI and Anthropic use knowledge graphs to represent connections between entities, documents, and concepts.

The goal is to enable multi-hop reasoning across structured knowledge rather than retrieving isolated text chunks.

1. Why Traditional Vector RAG Struggles

Vector retrieval treats chunks as independent semantic units.

Example document fragments:

Chunk A: Sam Altman led OpenAI.
Chunk B: OpenAI partnered with Microsoft.
Chunk C: Microsoft invested billions in OpenAI.

If a user asks:

Which company invested in the organization led by Sam Altman?

Traditional RAG must retrieve:

Chunk A
Chunk B
Chunk C

But it doesn't inherently understand that:

Sam Altman → leads → OpenAI
OpenAI → received investment from → Microsoft

This requires multi-hop reasoning.

Vector search alone is not designed for this.

2. Graph Representation of Knowledge

GraphRAG converts documents into a knowledge graph.

Structure:

Node: Sam Altman
Node: OpenAI
Node: Microsoft

Edge: Sam Altman → leads → OpenAI
Edge: Microsoft → invested_in → OpenAI

Graph format:

(entity) —relationship→ (entity)

This structure allows traversal across related facts.

3. Graph Construction Pipeline

Building the graph typically involves LLM-assisted extraction.

Pipeline:

Documents
   ↓
Entity extraction
   ↓
Relationship extraction
   ↓
Graph creation

Example extraction step:

Input text:

Microsoft invested $10B in OpenAI in 2023.

Extracted:

entity: Microsoft
entity: OpenAI
relationship: invested_in
amount: $10B

These become graph edges.

4. Community Detection (Graph Clustering)

Large graphs contain thousands or millions of nodes.

GraphRAG systems often detect clusters of related entities.

Example clusters:

Cluster 1: transformer models
Cluster 2: reinforcement learning
Cluster 3: generative AI companies

Each cluster can be summarized into topic-level knowledge.

This enables retrieval at multiple levels.

5. Hierarchical Summaries

GraphRAG systems often generate summaries for:

entity nodes
clusters
global topics

Example hierarchy:

Global summary: generative AI ecosystem
  ↓
Cluster summary: AI research labs
  ↓
Node summary: OpenAI

These summaries become retrieval targets.

6. Query Processing

When a query arrives, the system identifies relevant nodes and graph regions.

Example query:

Which companies fund OpenAI?

Retrieval flow:

identify node: OpenAI
   ↓
traverse edges: invested_in
   ↓
retrieve connected nodes

Result:

Microsoft

The LLM then synthesizes the answer.

7. Multi-Hop Reasoning

Graph traversal naturally supports multi-hop queries.

Example question:

Which companies invested in organizations founded by OpenAI alumni?

Possible graph traversal:

OpenAI → alumni → new startups
startup → investor → companies

This type of reasoning is difficult with vector search alone.

8. Hybrid Graph + Vector Retrieval

Modern systems combine both approaches.

Typical pipeline:

vector retrieval
+
graph traversal
+
reranking

Vector search finds relevant chunks.

Graph traversal finds relationships.

Combined context produces stronger answers.

Typical GraphRAG architecture

Documents
      ↓
LLM entity extraction
      ↓
Relationship extraction
      ↓
Knowledge graph construction
      ↓
Graph clustering
      ↓
Hierarchical summaries
      ↓
Hybrid retrieval
(vector + graph traversal)
      ↓
LLM synthesis

Where GraphRAG works best

Graph-based retrieval shines when information is relationship-heavy.

Examples:

Domain

Why graph works well

Scientific research

citations and concept relationships

Enterprise knowledge

people, teams, projects

Finance

company ownership and investments

Healthcare

diseases, drugs, treatments

These domains contain structured knowledge networks.

Limitations of GraphRAG

Despite its power, it has challenges.

1. Graph construction cost

Entity extraction and relationship extraction require LLM processing over large corpora.

2. Graph maintenance

When documents change, graph updates are needed.

3. Complexity

Engineering complexity is significantly higher than simple vector RAG.

Many teams adopt hybrid systems rather than pure GraphRAG.

The emerging retrieval stack

The most advanced retrieval architectures combine several ideas:

vector retrieval
+
graph reasoning
+
reranking
+
agentic query planning

This produces systems capable of answering complex knowledge questions, not just retrieving passages.

✅ Key insight

Vector RAG retrieves text similarity.

GraphRAG retrieves knowledge relationships.

Combining both enables systems that can answer multi-hop, reasoning-heavy questions far more reliably.

Memory Centric AI Systems / GSM

Memory-centric AI systems represent the next architectural step beyond RAG and GraphRAG. Instead of retrieving information only from static documents, the system maintains persistent memory that evolves over time.

Research directions explored by organizations such as OpenAI, Anthropic, and Google DeepMind increasingly treat AI systems as stateful knowledge systems, not stateless prompt processors.

The key shift:

RAG → retrieve external knowledge
GraphRAG → retrieve relationships
Memory-centric AI → continuously build knowledge over time

1. Why RAG Alone Is Not Enough

Traditional RAG systems treat each query independently.

Example:

User: What is my preferred programming language?
System: I don't know.

Even if the user said earlier:

User: I mainly write production systems in Python.

Without memory, that information disappears.

RAG retrieves documents, not conversation-derived knowledge.

2. Memory Layers in Advanced AI Systems

Modern architectures typically separate memory into multiple layers.

Short-Term Memory

Temporary working memory for the current task.

Example:

conversation history
recent retrieved documents
current reasoning chain

Often stored in the prompt or ephemeral context store.

Long-Term Memory

Persistent storage of facts learned over time.

Example entries:

User prefers Python for backend systems
User is interested in vector databases
User frequently asks about RAG architectures

Stored in:

vector database
knowledge graph
structured memory store

Episodic Memory

Records events or interactions.

Example:

2026-02-15:
User debugged a retrieval pipeline using Qdrant

This allows systems to remember past experiences, not just facts.

3. Memory Formation

Memory-centric systems decide what information is worth storing.

Pipeline:

interaction
   ↓
memory evaluation
   ↓
store or discard

Example decision rule:

if information appears repeatedly
or is user-specific
or influences future tasks
→ store in memory

LLMs themselves can be used to judge importance.

4. Memory Retrieval

When a new query arrives, the system retrieves relevant memories.

Example query:

User: Suggest a vector database for my project.

Memory retrieval might return:

User previously benchmarked FAISS, Qdrant, and Milvus
User is building a RAG pipeline

The system can tailor the response accordingly.

5. Memory Update Loop

Unlike static RAG, memory-centric systems continuously update knowledge.

Example cycle:

interaction
   ↓
new knowledge detected
   ↓
memory updated

This creates learning over time.

6. Memory + Retrieval Architecture

A modern architecture may combine several components:

User query
     ↓
Memory retrieval
     ↓
External retrieval (RAG)
     ↓
Graph reasoning
     ↓
LLM synthesis
     ↓
Memory update

This creates a stateful AI assistant.

7. Example Architecture

A simplified memory-centric system might look like:

User
 ↓
Conversation manager
 ↓
Memory retrieval
 ↓
Knowledge retrieval (RAG)
 ↓
Reasoning engine
 ↓
LLM generation
 ↓
Memory formation
 ↓
Memory store

This loop runs continuously.

8. Technologies Used for Memory Systems

Memory stores can be implemented using:

vector databases
knowledge graphs
relational databases
document stores

Common infrastructure includes:

embedding models
retrieval pipelines
memory scoring models

9. Benefits of Memory-Centric Systems

Compared with traditional RAG:

Capability

Traditional RAG

Memory-centric AI

Personalization

limited

strong

Learning from interaction

none

continuous

Multi-session knowledge

weak

persistent

Adaptive behavior

minimal

strong

This enables true AI assistants rather than stateless chatbots.

10. Major Challenges

Memory systems introduce new difficulties.

Memory noise

Systems may store irrelevant or incorrect facts.

Memory conflicts

Different interactions may produce contradictory memories.

Example:

User prefers Python
User prefers Rust

The system must resolve conflicts.

Privacy

Storing long-term user information raises security concerns.

Enterprise systems require strict governance.

Where the field is going

Many researchers believe future AI systems will integrate:

LLM reasoning
+
retrieval systems
+
knowledge graphs
+
long-term memory

Instead of treating AI as a stateless text generator, these systems behave more like persistent knowledge agents.

✅ Key takeaway

RAG gives models access to external knowledge.

Memory-centric AI gives systems the ability to learn and remember over time.

This combination is likely to define the next generation of intelligent assistants.

PreviousWeek 10

Last updated 11 days ago

hashtag1. Why chunking matters more than people expect

hashtagProblem 1 — context fragmentation

hashtagProblem 2 — context dilution

hashtag2. The 5 strategies mentioned (with real-world corrections)

hashtag1. Fixed-length chunking

hashtag2. Sentence chunking

hashtag3. Semantic chunking

hashtag4. Hierarchical chunking

hashtag5. Adaptive / hybrid chunking

hashtag3. What the post missed (important)

hashtagChunk overlap

hashtagChunk metadata

hashtagMulti-granularity retrieval

hashtagPost-retrieval re-ranking

hashtag4. The real RAG rule (experienced teams learn this)

hashtag5. Typical production chunk sizes

hashtag6. The practical advice in the post

hashtagFinal verdict

hashtag7 chunking architectures used in serious production RAG systems

hashtag1. Sliding Window Chunking (Baseline Production Strategy)

hashtagIdea

hashtagWhy overlap matters

hashtagUsed by

hashtagPros

hashtagCons

hashtag2. Sentence Window Retrieval

hashtagBenefit

hashtagUsed by

hashtagWhy teams like it

hashtag3. Hierarchical Chunking (Document Tree)

hashtagAdvantages

hashtagUsed by

hashtag4. Semantic Boundary Chunking

hashtagBenefits

hashtagDrawback

hashtagTools

hashtag5. Parent–Child Chunking (Highly Effective)

hashtagWhy this works

hashtagUsed by

hashtag6. Multi-Vector Chunking

hashtagBenefit

hashtagUsed in

hashtag7. Adaptive / Content-Aware Chunking (Most Advanced)

hashtagWhy it matters

hashtagWhat the best RAG systems actually combine

hashtagThe biggest misconception about RAG

hashtag5 mistakes that cause 80% of RAG systems to hallucinate

hashtag1. Retrieval Miss (The Most Common Failure)

hashtagWhy this happens

hashtagFix

hashtag2. Context Dilution

hashtagFix

hashtag3. Chunk Boundary Problems

hashtagFix

hashtag4. LLM Ignoring Retrieved Context

hashtagFix

hashtag5. Missing Knowledge (The Silent Killer)

hashtagFix

hashtagA useful mental model

hashtagWhat strong RAG systems usually include

hashtaghow companies like Perplexity, OpenAI, and Anthropic structure their RAG pipelines internally

hashtag1. Multi-Stage Retrieval (Recall → Precision)

hashtag2. Query Rewriting Layer

hashtag3. Document Expansion

hashtag4. Reranking Stage

hashtag5. Context Compression

hashtag6. Citation Grounding

hashtag7. Confidence / Abstention Layer

hashtagTypical modern RAG pipeline

hashtagWhy most tutorials fail

hashtagA useful rule in RAG engineering

hashtagArchitecture behind Perplexity’s “answer engine”

hashtag1. Real-Time Web Retrieval

hashtag2. Content Extraction & Cleaning

hashtag3. Chunking & Indexing

hashtag4. Hybrid Retrieval

hashtagVector search

hashtagKeyword search

hashtag5. Query Expansion

hashtag6. Reranking

1. Why chunking matters more than people expect

Problem 1 — context fragmentation

Problem 2 — context dilution

2. The 5 strategies mentioned (with real-world corrections)

1. Fixed-length chunking

2. Sentence chunking

3. Semantic chunking

4. Hierarchical chunking

5. Adaptive / hybrid chunking

3. What the post missed (important)

Chunk overlap

Chunk metadata

Multi-granularity retrieval

Post-retrieval re-ranking

4. The real RAG rule (experienced teams learn this)

5. Typical production chunk sizes

6. The practical advice in the post

Final verdict

7 chunking architectures used in serious production RAG systems

1. Sliding Window Chunking (Baseline Production Strategy)

Idea

Why overlap matters

Used by

Pros

Cons

2. Sentence Window Retrieval

Benefit

Used by

Why teams like it

3. Hierarchical Chunking (Document Tree)

Advantages

Used by

4. Semantic Boundary Chunking

Benefits

Drawback

Tools

5. Parent–Child Chunking (Highly Effective)

Why this works

Used by

6. Multi-Vector Chunking

Benefit

Used in

7. Adaptive / Content-Aware Chunking (Most Advanced)

Why it matters

What the best RAG systems actually combine

The biggest misconception about RAG

5 mistakes that cause 80% of RAG systems to hallucinate

1. Retrieval Miss (The Most Common Failure)

Why this happens

Fix

2. Context Dilution

Fix

3. Chunk Boundary Problems

Fix

4. LLM Ignoring Retrieved Context

Fix

5. Missing Knowledge (The Silent Killer)

Fix

A useful mental model

What strong RAG systems usually include

how companies like Perplexity, OpenAI, and Anthropic structure their RAG pipelines internally

1. Multi-Stage Retrieval (Recall → Precision)

2. Query Rewriting Layer

3. Document Expansion

4. Reranking Stage

5. Context Compression

6. Citation Grounding

7. Confidence / Abstention Layer

Typical modern RAG pipeline

Why most tutorials fail

A useful rule in RAG engineering

Architecture behind Perplexity’s “answer engine”

1. Real-Time Web Retrieval

2. Content Extraction & Cleaning

3. Chunking & Indexing

4. Hybrid Retrieval

Vector search

Keyword search

5. Query Expansion

6. Reranking