Understand AI via One Diagram: Build Handbook Chat with RAG

Problem: Employees ask about leave, overtime, travel, benefits…
Goal: Answer from real documents, with citations, no hallucinations, and say "out of scope" when uncertain.
Solution: RAG (Retrieval-Augmented Generation) — Let LLM read documents before answering.

Short summary

Core idea: Don't let LLM guess from memory → Force it to read relevant documents first (RAG)
Pipeline: Ingest → Chunk → Embedding → Vector DB → Retrieve → LLM Synthesis + Citation
Success criteria: Answers are context-correct, transparent (citations), and honest (out-of-scope when missing evidence)

One Diagram — Everything You Need to Know

This diagram shows how documents flow from raw files (left) through ingestion and indexing, then how user queries (right) retrieve relevant chunks and generate cited answers.

Deep Dive: Step-by-Step Analysis

Step ① - ②: Documents → Ingest & Clean

Challenge: Documents come from multiple formats (PDF, Word, Markdown) with:

Tables, images, headers/footers
Special characters, encoding issues
Version chaos (which is latest?)

Process:

Parse each file type to plain text
Extract critical metadata
Clean: remove noise, preserve structure

Output: Clean document with metadata

Document {
  id: "hr-policy-2025"
  title: "HR Policy Handbook"
  text: "## 4.3 Marriage Leave\n\nEmployees are entitled to 3 days..."
  metadata: {
    version_date: "2025-01-01"
    owner: "HR Department"
    url: "https://intranet/hr/policy-2025"
    confidentiality: "Internal"
  }
}

Key insight: Metadata is critical! You need to know: which document? which section? which page? to enable citations later.

Step ③: Chunk (The Most Important Step!)

Why chunking is necessary?

LLM has limited context window (can't fit entire handbook)
Embedding works best with short passages (semantic coherence)
Retrieval precision: easier to find specific info in small chunks

Smart Chunking Strategy:

Original document:
┌─────────────────────────────────────┐
│ ## 4.3 Marriage Leave               │
│                                     │
│ Employees are entitled to 3 days    │
│ of paid leave for marriage...       │
│                                     │
│ Requests must be submitted 5        │
│ business days in advance to the     │
│ HR department...                    │
│                                     │
│ ## 4.4 Bereavement Leave            │
│ In case of immediate family...      │
└─────────────────────────────────────┘

After chunking:
┌─────────────────────────────────────┐
│ Chunk 1:                            │
│ ## 4.3 Marriage Leave               │
│ Employees are entitled to 3 days    │
│ of paid leave for marriage...       │
│ Requests must be submitted 5        │
│ business days in advance...         │
└─────────────────────────────────────┘
        ↓ overlap ~100 tokens
┌─────────────────────────────────────┐
│ Chunk 2:                            │
│ ## 4.3 Marriage Leave (continued)   │
│ ...business days in advance to the  │
│ HR department...                    │
│ ## 4.4 Bereavement Leave            │
│ In case of immediate family...      │
└─────────────────────────────────────┘

Configuration sweet spot:

Target size: 400-800 tokens (enough context, not too long)
Overlap: ~100 tokens (prevent context loss at boundaries)
Preserve structure: Always include section heading in each chunk

Key insight: Bad chunking = bad retrieval = bad answers. This is where 80% of RAG quality is determined!

Test your chunking: Randomly read 10 chunks. Is each chunk self-contained & understandable?

Step ④ - ⑤: Embeddings → Vector Database

The Semantic Search Magic

Embeddings = Text → Vector (array of numbers)

Visualization (simplified to 2D, actually 768-1536 dimensions):
                   
    "marriage leave" ● 
    "wedding leave"  ●    ← Close together! (Similar meaning)
    
           "annual leave"  ●  ← Further away
    
    
                               "parking policy" ● ← Very far

How it works:

Text → Vector: Each chunk is embedded into a vector (e.g., 1536 numbers)
Similar meaning → Close vectors (measured by cosine similarity)
Vector DB: Store millions of vectors, fast k-NN search in milliseconds

Example:

Query: "How many days of wedding leave?"
↓ Embed query
Query vector: [0.12, 0.85, 0.03, ...]

↓ Search Vector DB (find nearest neighbors)
Top matches:
1. Chunk #47: "marriage leave... 3 days..." (similarity: 0.94) ✅
2. Chunk #52: "wedding ceremony... submit request..." (similarity: 0.87) ✅
3. Chunk #12: "annual leave... 12 days..." (similarity: 0.35) ❌

Why not use keyword search?

Approach	Query: "wedding leave"	Result
Keyword search	Exact match "wedding"	❌ Misses "marriage leave"
Semantic search	Meaning similarity	✅ Finds both "wedding" and "marriage"

Key insight: Embeddings capture meaning, not just words. This is why RAG works cross-language, with synonyms, and paraphrases.

Step ⓪ - ⑥: User Question → Retriever

The Retrieval Process

User Question: "Can I take OT pay as time off?"
          ↓
    Embed question
          ↓
    Search Vector DB
          ↓
    Find k=5-8 most similar chunks
          ↓
    Optional: Hybrid Search (Vector + Keyword)
          ↓
    Optional: Rerank top candidates
          ↓
    Return top 5 chunks with metadata

Key decisions:

1. How many chunks to retrieve (k)?

k=2: Risky, có thể miss important context
k=5-8: Sweet spot — enough coverage, not too noisy
k=20: Too much noise, expensive, confuse LLM

2. Similarity threshold?

Only retrieve chunks with score > 0.25
If top chunk < 0.25 → Probably "out of scope"

3. Hybrid search?

Combine vector search (semantic) + BM25 (keyword)

Vector search finds: "marriage leave policy" (semantic match)
BM25 finds: "Form-HR-003" (exact code match)
Combined: Best of both worlds!

Use case: Queries with codes, IDs, form numbers benefit from hybrid.

4. Reranking?

Initial retrieval: Over-retrieve top 20 (fast, 95% accuracy)
         ↓
Cross-encoder rerank: Re-score with better model (slower, 99% accuracy)
         ↓
Final top 5: Highest precision

Trade-off: Reranking adds ~100ms latency but improves accuracy. Worth it for production!

Key insight: Retrieval quality determines the ceiling of the entire system. Good retrieval = LLM has right context to work from.

Step ⑦: LLM Generation (The Synthesis Step)

LLM's Role in RAG

❌ NOT: Memorize company policies (impossible + outdated quickly)
✅ YES: Read retrieved chunks → Synthesize clear answer → Cite sources

Critical: Guardrailed Prompts

System Prompt Structure:
┌────────────────────────────────────────────┐
│ ROLE: Company policy assistant             │
│                                            │
│ RULES:                                     │
│ 1. ONLY use info from [CONTEXT]            │
│ 2. If insufficient evidence → out_of_scope │
│ 3. Always cite: [Title §Section p.X]       │
│ 4. Explain reasoning first                 │
│                                            │
│ OUTPUT FORMAT: JSON                        │
│ {                                          │
│   "reasoning": "...",                      │
│   "answer": "...",                         │
│   "citations": [...],                      │
│   "is_out_of_scope": false,                │
│   "confidence": 0.95                       │
│ }                                          │
│                                            │
│ FEW-SHOT EXAMPLES: [3-5 examples]          │
└────────────────────────────────────────────┘

User Message Structure:

Question: How many days of wedding leave do I get?

[CONTEXT]
--- Chunk 1 ---
Source: HR Policy §4.3 Marriage Leave (p.12)
URL: https://intranet/hr/policy#4.3
Content:
Employees are entitled to 3 days of paid leave for marriage.
Requests must be submitted 5 business days in advance.

--- Chunk 2 ---
Source: HR Policy §4.3.1 Documentation (p.12)
URL: https://intranet/hr/policy#4.3.1
Content:
Marriage certificate copy must be provided within 30 days.
[/CONTEXT]

Now answer following the rules.

Key Parameters for Factual Answers:

Parameter	Value	Why
temperature	0.1-0.3	Low randomness → consistent, factual
max_tokens	300-512	Cap length → control cost
response_format	JSON	Structured, parseable output

Key insight: Without guardrails, LLMs will hallucinate (confidently generate wrong info). Guardrails force LLM to be honest & grounded.

Step ⑧: JSON Output (Transparency & Trust)

Structured Response Example:

{
  "reasoning": "The context clearly states in HR Policy §4.3 that employees receive 3 days of marriage leave with 5 business days advance notice requirement.",
  
  "answer": "You are entitled to 3 days of paid leave for marriage. You must submit your request 5 business days in advance to the HR department and provide a copy of your marriage certificate within 30 days.",
  
  "citations": [
    {
      "title": "HR Policy Handbook",
      "section": "4.3 Marriage Leave",
      "page": 12,
      "url": "https://intranet/hr/policy#4.3",
      "excerpt": "Employees are entitled to 3 days of paid leave..."
    }
  ],
  
  "is_out_of_scope": false,
  "confidence": 0.95
}

When to return "out of scope":

Retrieved chunks < 2 (insufficient evidence)
Top chunk similarity < 0.25 (low relevance)
Chunks contradict each other (ambiguous)
LLM detects uncertainty in reasoning

Example - Out of Scope:

{
  "reasoning": "The provided context only covers remote work within Vietnam. International remote work policies are not addressed in the retrieved documents.",
  "answer": "Out of scope",
  "citations": [],
  "is_out_of_scope": true,
  "suggestions": [
    "Contact HR directly at hr@company.com",
    "Check the International Assignment Policy if available"
  ],
  "confidence": 0.0
}

Key insight: Honesty > Accuracy. Better to say "I don't know" than to hallucinate wrong answers. Users trust systems that admit limitations.

The Big Picture: Why RAG Works

Without RAG (LLM Alone)

User: "How many days of marriage leave?"
  ↓
LLM (relying on training data memory):
  ↓
"I believe most companies offer 3-7 days..." ❌ HALLUCINATION

Problems:

Training data: Outdated (trained months/years ago)
Parametric knowledge: General, not your company
No citations: Can't verify
Confidently wrong: Sounds authoritative but incorrect

With RAG

User: "How many days of marriage leave?"
  ↓
Retrieve: [HR Policy §4.3: "3 days"]
  ↓
LLM reads context → Synthesizes answer
  ↓
"According to HR Policy §4.3, you have 3 days. [Citation]" ✅

Benefits:

✅ Always current (just re-index updated documents)
✅ Company-specific (your actual policies)
✅ Verifiable (citations point to source)
✅ Honest (says "out of scope" when unsure)

Analogy

Scenario	LLM Alone	RAG
Exam type	Closed-book (memory only)	Open-book (reference materials)
Accuracy	60-70% (guessing)	85-95% (grounded)
Trust	Low (no sources)	High (show your work)
Maintenance	Retrain model ($$$$)	Re-index docs ($)

Key insight: RAG = "External memory" for LLMs. Instead of memorizing everything (impossible), give LLM ability to look up information on-demand.

Common Concepts Explained via Diagram

1. AI vs AGI

AI (Narrow): Systems good at specific tasks → This diagram = AI for Q&A
AGI (General): Good at everything → Does NOT exist yet

For Handbook Chat: We only need narrow AI — answer from documents, nothing more.

2. Training vs RAG

Training: Bake knowledge into model weights (expensive, static)
RAG: Store knowledge outside model (cheap, dynamic)

When policy changes:

Training approach:
  Update document → Retrain model ($$$) → Deploy

RAG approach:
  Update document → Re-index (minutes, $) → Done

Rule of thumb: Prompt → RAG → Fine-tune (if really needed)

3. Inference

Inference = Running the model to generate output (Step H in diagram)

Cost factors:

Input tokens (retrieved context): ~1500 tokens × $0.01/1K = $0.015
Output tokens (answer): ~300 tokens × $0.03/1K = $0.009
Total per query: ~$0.024

At scale: 1000 queries/day × $0.024 = $720/month

Optimization strategies:

Caching: Same question → cached answer (40% cost reduction)
Truncate context: Only send top 5 chunks, not top 20
Streaming: Better UX, same cost
Rate limiting: Prevent abuse

4. Embeddings Dimensionality

Why 768-1536 dimensions?

Higher dimensions = capture more nuanced meanings
But: More storage, slower search
Sweet spot: 1536 (OpenAI text-embedding-3-small)

Trade-offs:

Dimensions: 384    768    1536   3072
Accuracy:   85%    91%    95%    97%
Storage:    1x     2x     4x     8x
Speed:      Fast → → → →  Slower
Cost:       $      $$     $$$    $$$$

5. Vector Database Algorithms

In diagram Step ⑤: How does Vector DB find nearest neighbors fast?

Algorithm	How it works	Speed	Accuracy	Use case
Flat	Check every vector	Slow O(n)	100%	<10k vectors
HNSW	Hierarchical graph	Fast O(log n)	95-99%	Most common
IVF	Cluster-based	Very fast	90-95%	>1M vectors

For Handbook Chat: HNSW is perfect (100k-1M chunks, fast, accurate)

Practical Tips (Learned from Diagram)

Tip 1: Test Each Step Independently

✅ Step ①-②: Manually inspect 10 documents after ingestion
✅ Step ③: Read 20 random chunks — are they coherent?
✅ Step ④-⑤: Query "marriage leave" — do top 5 chunks make sense?
✅ Step ⓪-⑥: Run 50 test queries — is retrieval relevant?
✅ Step ⑦-⑧: Evaluate 100 Q&A pairs — accuracy? citations?

Tip 2: Start Simple, Add Complexity

MVP:
- Basic chunking (no overlap)
- Vector search only (no hybrid)
- Simple prompt (no few-shot)

Then iterate:
- Add overlap → better context
- Add hybrid search → handle codes/IDs
- Add reranking → higher precision
- Add few-shot → better format

Tip 3: Monitor the Flow

Key metrics per diagram step:

Step	Metric	Target
Step ③	Avg chunk size	400-800 tokens
Step ⑤	Index size	< 1GB for 10k chunks
Step ⑥	Retrieval latency	< 100ms
Step ⑥	Top-1 relevance	> 80%
Step ⑦	Generation latency	< 2s
Step ⑧	Out-of-scope rate	10-20%
Overall	End-to-end latency	< 3s (p95)

Tip 4: Version Control Everything

config.yaml:
  embeddings:
    model: "text-embedding-3-small"
    version: "2024-01"
  
  chunking:
    target_tokens: 600
    overlap: 100
    strategy: "semantic"
  
  retrieval:
    k: 5
    min_score: 0.25
    hybrid: true
  
  generation:
    model: "gpt-4-turbo"
    temperature: 0.1
    max_tokens: 512

Why: When quality degrades, you can pinpoint which config changed.

Real-World Q&A Examples

Example 1: Clear answer with citation

Q: "How many annual leave days do I get?"

Diagram flow:

Q → Embed → Search → Retrieve:
  Chunk #23: "HR Policy §4.1: Full-time employees receive 12 days/year"
  
→ LLM reads context → Generate:

Output:

{
  "answer": "Full-time employees receive 12 days of annual leave per year, plus 1 additional day for every 5 years of service.",
  "citations": [{"title": "HR Policy", "section": "4.1", "page": 8}],
  "is_out_of_scope": false,
  "confidence": 0.95
}

Example 2: Out of scope (honest admission)

Q: "Can I work remotely from Bali for 3 months?"

Diagram flow:

Q → Embed → Search → Retrieve:
  Chunk #15: "Remote Work Policy: up to 10 days/month within Vietnam"
  (No chunks about international remote work)
  
→ LLM detects gap → Out of scope:

Output:

{
  "answer": "Out of scope",
  "reasoning": "The retrieved policy only covers remote work within Vietnam. International remote work is not addressed.",
  "is_out_of_scope": true,
  "suggestions": ["Contact HR at hr@company.com for international remote work inquiries"]
}

Example 3: Ambiguous query (clarification)

Q: "How much leave do I have?"

Diagram flow:

Q → Embed → Search → Retrieve multiple types:
  Chunk #23: "Annual leave: 12 days/year"
  Chunk #31: "Sick leave: 30 days/year"
  Chunk #47: "Marriage leave: 3 days"
  
→ LLM recognizes ambiguity → Clarify:

Output:

{
  "answer": "There are several types of leave: (1) Annual leave: 12 days/year, (2) Sick leave: 30 days/year, (3) Marriage leave: 3 days. For your personal remaining balance, please check the HR portal.",
  "citations": [{"title": "HR Policy", "section": "4. Leave Types", "page": 8}],
  "confidence": 0.7
}

Success Criteria (Evaluation)

Accuracy Metrics

Metric	How to measure	Target
Correctness	LLM-as-judge on 100 Q&A pairs	>85%
Citation precision	Do citations support answer?	>95%
Retrieval quality	NDCG@5 on test queries	>0.8
Out-of-scope detection	False positive rate	<10%

Performance Metrics

Metric	Target
End-to-end latency (p50)	<2s
End-to-end latency (p95)	<3s
Cost per query	<$0.03
Uptime	>99.5%

User Satisfaction

Thumbs up rate: >80%
Follow-up questions: <20% (answer was complete)
"Contact HR" escalations: <15% (system handled most queries)

Conclusion: The Power of One Diagram

Everything flows from understanding this pipeline:

Documents → Clean, structured knowledge base
Chunking → Granular, retrievable units
Embeddings → Semantic search capability
Vector DB → Fast, scalable retrieval
Retrieval → Find the right context
LLM → Synthesize grounded answers
Output → Transparent, honest responses

You DON'T need:

❌ AGI (just narrow AI for Q&A)
❌ Fine-tuning (RAG handles knowledge)
❌ Massive infrastructure (works with 10k documents)

You DO need:

✅ Good chunking strategy (80% of quality)
✅ Guardrailed prompts (prevent hallucination)
✅ Rich metadata (enable citations)
✅ Honest out-of-scope handling (build trust)

Next steps: Implement this diagram step-by-step. Test each step independently. Iterate based on metrics and user feedback.

Remember: RAG = "Open-book exam for LLMs". Give them the right books (your documents), and they'll ace the test!