Understand AI via One Diagram: Build Handbook Chat with RAG
Problem: Employees ask about leave, overtime, travel, benefits…
Goal: Answer from real documents, with citations, no hallucinations, and say "out of scope" when uncertain.
Solution: RAG (Retrieval-Augmented Generation) — Let LLM read documents before answering.
Short summary
- Core idea: Don't let LLM guess from memory → Force it to read relevant documents first (RAG)
- Pipeline: Ingest → Chunk → Embedding → Vector DB → Retrieve → LLM Synthesis + Citation
- Success criteria: Answers are context-correct, transparent (citations), and honest (out-of-scope when missing evidence)
One Diagram — Everything You Need to Know
This diagram shows how documents flow from raw files (left) through ingestion and indexing, then how user queries (right) retrieve relevant chunks and generate cited answers.
Deep Dive: Step-by-Step Analysis
Step ① - ②: Documents → Ingest & Clean
Challenge: Documents come from multiple formats (PDF, Word, Markdown) with:
- Tables, images, headers/footers
- Special characters, encoding issues
- Version chaos (which is latest?)
Process:
- Parse each file type to plain text
- Extract critical metadata
- Clean: remove noise, preserve structure
Output: Clean document with metadata
Document {
id: "hr-policy-2025"
title: "HR Policy Handbook"
text: "## 4.3 Marriage Leave\n\nEmployees are entitled to 3 days..."
metadata: {
version_date: "2025-01-01"
owner: "HR Department"
url: "https://intranet/hr/policy-2025"
confidentiality: "Internal"
}
}
Key insight: Metadata is critical! You need to know: which document? which section? which page? to enable citations later.
Step ③: Chunk (The Most Important Step!)
Why chunking is necessary?
- LLM has limited context window (can't fit entire handbook)
- Embedding works best with short passages (semantic coherence)
- Retrieval precision: easier to find specific info in small chunks
Smart Chunking Strategy:
Original document:
┌─────────────────────────────────────┐
│ ## 4.3 Marriage Leave │
│ │
│ Employees are entitled to 3 days │
│ of paid leave for marriage... │
│ │
│ Requests must be submitted 5 │
│ business days in advance to the │
│ HR department... │
│ │
│ ## 4.4 Bereavement Leave │
│ In case of immediate family... │
└─────────────────────────────────────┘
After chunking:
┌─────────────────────────────────────┐
│ Chunk 1: │
│ ## 4.3 Marriage Leave │
│ Employees are entitled to 3 days │
│ of paid leave for marriage... │
│ Requests must be submitted 5 │
│ business days in advance... │
└─────────────────────────────────────┘
↓ overlap ~100 tokens
┌─────────────────────────────────────┐
│ Chunk 2: │
│ ## 4.3 Marriage Leave (continued) │
│ ...business days in advance to the │
│ HR department... │
│ ## 4.4 Bereavement Leave │
│ In case of immediate family... │
└─────────────────────────────────────┘
Configuration sweet spot:
- Target size: 400-800 tokens (enough context, not too long)
- Overlap: ~100 tokens (prevent context loss at boundaries)
- Preserve structure: Always include section heading in each chunk
Key insight: Bad chunking = bad retrieval = bad answers. This is where 80% of RAG quality is determined!
Test your chunking: Randomly read 10 chunks. Is each chunk self-contained & understandable?
Step ④ - ⑤: Embeddings → Vector Database
The Semantic Search Magic
Embeddings = Text → Vector (array of numbers)
Visualization (simplified to 2D, actually 768-1536 dimensions):
"marriage leave" ●
"wedding leave" ● ← Close together! (Similar meaning)
"annual leave" ● ← Further away
"parking policy" ● ← Very far
How it works:
- Text → Vector: Each chunk is embedded into a vector (e.g., 1536 numbers)
- Similar meaning → Close vectors (measured by cosine similarity)
- Vector DB: Store millions of vectors, fast k-NN search in milliseconds
Example:
Query: "How many days of wedding leave?"
↓ Embed query
Query vector: [0.12, 0.85, 0.03, ...]
↓ Search Vector DB (find nearest neighbors)
Top matches:
1. Chunk #47: "marriage leave... 3 days..." (similarity: 0.94) ✅
2. Chunk #52: "wedding ceremony... submit request..." (similarity: 0.87) ✅
3. Chunk #12: "annual leave... 12 days..." (similarity: 0.35) ❌
Why not use keyword search?
| Approach | Query: "wedding leave" | Result |
|---|---|---|
| Keyword search | Exact match "wedding" | ❌ Misses "marriage leave" |
| Semantic search | Meaning similarity | ✅ Finds both "wedding" and "marriage" |
Key insight: Embeddings capture meaning, not just words. This is why RAG works cross-language, with synonyms, and paraphrases.
Step ⓪ - ⑥: User Question → Retriever
The Retrieval Process
User Question: "Can I take OT pay as time off?"
↓
Embed question
↓
Search Vector DB
↓
Find k=5-8 most similar chunks
↓
Optional: Hybrid Search (Vector + Keyword)
↓
Optional: Rerank top candidates
↓
Return top 5 chunks with metadata
Key decisions:
1. How many chunks to retrieve (k)?
- k=2: Risky, có thể miss important context
- k=5-8: Sweet spot — enough coverage, not too noisy
- k=20: Too much noise, expensive, confuse LLM
2. Similarity threshold?
- Only retrieve chunks with score > 0.25
- If top chunk < 0.25 → Probably "out of scope"
3. Hybrid search?
Combine vector search (semantic) + BM25 (keyword)
Vector search finds: "marriage leave policy" (semantic match)
BM25 finds: "Form-HR-003" (exact code match)
Combined: Best of both worlds!
Use case: Queries with codes, IDs, form numbers benefit from hybrid.
4. Reranking?
Initial retrieval: Over-retrieve top 20 (fast, 95% accuracy)
↓
Cross-encoder rerank: Re-score with better model (slower, 99% accuracy)
↓
Final top 5: Highest precision
Trade-off: Reranking adds ~100ms latency but improves accuracy. Worth it for production!
Key insight: Retrieval quality determines the ceiling of the entire system. Good retrieval = LLM has right context to work from.
Step ⑦: LLM Generation (The Synthesis Step)
LLM's Role in RAG
❌ NOT: Memorize company policies (impossible + outdated quickly)
✅ YES: Read retrieved chunks → Synthesize clear answer → Cite sources
Critical: Guardrailed Prompts
System Prompt Structure:
┌────────────────────────────────────────────┐
│ ROLE: Company policy assistant │
│ │
│ RULES: │
│ 1. ONLY use info from [CONTEXT] │
│ 2. If insufficient evidence → out_of_scope │
│ 3. Always cite: [Title §Section p.X] │
│ 4. Explain reasoning first │
│ │
│ OUTPUT FORMAT: JSON │
│ { │
│ "reasoning": "...", │
│ "answer": "...", │
│ "citations": [...], │
│ "is_out_of_scope": false, │
│ "confidence": 0.95 │
│ } │
│ │
│ FEW-SHOT EXAMPLES: [3-5 examples] │
└────────────────────────────────────────────┘
User Message Structure:
Question: How many days of wedding leave do I get?
[CONTEXT]
--- Chunk 1 ---
Source: HR Policy §4.3 Marriage Leave (p.12)
URL: https://intranet/hr/policy#4.3
Content:
Employees are entitled to 3 days of paid leave for marriage.
Requests must be submitted 5 business days in advance.
--- Chunk 2 ---
Source: HR Policy §4.3.1 Documentation (p.12)
URL: https://intranet/hr/policy#4.3.1
Content:
Marriage certificate copy must be provided within 30 days.
[/CONTEXT]
Now answer following the rules.
Key Parameters for Factual Answers:
| Parameter | Value | Why |
|---|---|---|
| temperature | 0.1-0.3 | Low randomness → consistent, factual |
| max_tokens | 300-512 | Cap length → control cost |
| response_format | JSON | Structured, parseable output |
Key insight: Without guardrails, LLMs will hallucinate (confidently generate wrong info). Guardrails force LLM to be honest & grounded.
Step ⑧: JSON Output (Transparency & Trust)
Structured Response Example:
{
"reasoning": "The context clearly states in HR Policy §4.3 that employees receive 3 days of marriage leave with 5 business days advance notice requirement.",
"answer": "You are entitled to 3 days of paid leave for marriage. You must submit your request 5 business days in advance to the HR department and provide a copy of your marriage certificate within 30 days.",
"citations": [
{
"title": "HR Policy Handbook",
"section": "4.3 Marriage Leave",
"page": 12,
"url": "https://intranet/hr/policy#4.3",
"excerpt": "Employees are entitled to 3 days of paid leave..."
}
],
"is_out_of_scope": false,
"confidence": 0.95
}
When to return "out of scope":
- Retrieved chunks < 2 (insufficient evidence)
- Top chunk similarity < 0.25 (low relevance)
- Chunks contradict each other (ambiguous)
- LLM detects uncertainty in reasoning
Example - Out of Scope:
{
"reasoning": "The provided context only covers remote work within Vietnam. International remote work policies are not addressed in the retrieved documents.",
"answer": "Out of scope",
"citations": [],
"is_out_of_scope": true,
"suggestions": [
"Contact HR directly at hr@company.com",
"Check the International Assignment Policy if available"
],
"confidence": 0.0
}
Key insight: Honesty > Accuracy. Better to say "I don't know" than to hallucinate wrong answers. Users trust systems that admit limitations.
The Big Picture: Why RAG Works
Without RAG (LLM Alone)
User: "How many days of marriage leave?"
↓
LLM (relying on training data memory):
↓
"I believe most companies offer 3-7 days..." ❌ HALLUCINATION
Problems:
- Training data: Outdated (trained months/years ago)
- Parametric knowledge: General, not your company
- No citations: Can't verify
- Confidently wrong: Sounds authoritative but incorrect
With RAG
User: "How many days of marriage leave?"
↓
Retrieve: [HR Policy §4.3: "3 days"]
↓
LLM reads context → Synthesizes answer
↓
"According to HR Policy §4.3, you have 3 days. [Citation]" ✅
Benefits:
- ✅ Always current (just re-index updated documents)
- ✅ Company-specific (your actual policies)
- ✅ Verifiable (citations point to source)
- ✅ Honest (says "out of scope" when unsure)
Analogy
| Scenario | LLM Alone | RAG |
|---|---|---|
| Exam type | Closed-book (memory only) | Open-book (reference materials) |
| Accuracy | 60-70% (guessing) | 85-95% (grounded) |
| Trust | Low (no sources) | High (show your work) |
| Maintenance | Retrain model ($$$$) | Re-index docs ($) |
Key insight: RAG = "External memory" for LLMs. Instead of memorizing everything (impossible), give LLM ability to look up information on-demand.
Common Concepts Explained via Diagram
1. AI vs AGI
AI (Narrow): Systems good at specific tasks → This diagram = AI for Q&A
AGI (General): Good at everything → Does NOT exist yet
For Handbook Chat: We only need narrow AI — answer from documents, nothing more.
2. Training vs RAG
Training: Bake knowledge into model weights (expensive, static)
RAG: Store knowledge outside model (cheap, dynamic)
When policy changes:
Training approach:
Update document → Retrain model ($$$) → Deploy
RAG approach:
Update document → Re-index (minutes, $) → Done
Rule of thumb: Prompt → RAG → Fine-tune (if really needed)
3. Inference
Inference = Running the model to generate output (Step H in diagram)
Cost factors:
- Input tokens (retrieved context): ~1500 tokens × $0.01/1K = $0.015
- Output tokens (answer): ~300 tokens × $0.03/1K = $0.009
- Total per query: ~$0.024
At scale: 1000 queries/day × $0.024 = $720/month
Optimization strategies:
- Caching: Same question → cached answer (40% cost reduction)
- Truncate context: Only send top 5 chunks, not top 20
- Streaming: Better UX, same cost
- Rate limiting: Prevent abuse
4. Embeddings Dimensionality
Why 768-1536 dimensions?
- Higher dimensions = capture more nuanced meanings
- But: More storage, slower search
- Sweet spot: 1536 (OpenAI text-embedding-3-small)
Trade-offs:
Dimensions: 384 768 1536 3072
Accuracy: 85% 91% 95% 97%
Storage: 1x 2x 4x 8x
Speed: Fast → → → → Slower
Cost: $ $$ $$$ $$$$
5. Vector Database Algorithms
In diagram Step ⑤: How does Vector DB find nearest neighbors fast?
| Algorithm | How it works | Speed | Accuracy | Use case |
|---|---|---|---|---|
| Flat | Check every vector | Slow O(n) | 100% | <10k vectors |
| HNSW | Hierarchical graph | Fast O(log n) | 95-99% | Most common |
| IVF | Cluster-based | Very fast | 90-95% | >1M vectors |
For Handbook Chat: HNSW is perfect (100k-1M chunks, fast, accurate)
Practical Tips (Learned from Diagram)
Tip 1: Test Each Step Independently
✅ Step ①-②: Manually inspect 10 documents after ingestion
✅ Step ③: Read 20 random chunks — are they coherent?
✅ Step ④-⑤: Query "marriage leave" — do top 5 chunks make sense?
✅ Step ⓪-⑥: Run 50 test queries — is retrieval relevant?
✅ Step ⑦-⑧: Evaluate 100 Q&A pairs — accuracy? citations?
Tip 2: Start Simple, Add Complexity
MVP:
- Basic chunking (no overlap)
- Vector search only (no hybrid)
- Simple prompt (no few-shot)
Then iterate:
- Add overlap → better context
- Add hybrid search → handle codes/IDs
- Add reranking → higher precision
- Add few-shot → better format
Tip 3: Monitor the Flow
Key metrics per diagram step:
| Step | Metric | Target |
|---|---|---|
| Step ③ | Avg chunk size | 400-800 tokens |
| Step ⑤ | Index size | < 1GB for 10k chunks |
| Step ⑥ | Retrieval latency | < 100ms |
| Step ⑥ | Top-1 relevance | > 80% |
| Step ⑦ | Generation latency | < 2s |
| Step ⑧ | Out-of-scope rate | 10-20% |
| Overall | End-to-end latency | < 3s (p95) |
Tip 4: Version Control Everything
config.yaml:
embeddings:
model: "text-embedding-3-small"
version: "2024-01"
chunking:
target_tokens: 600
overlap: 100
strategy: "semantic"
retrieval:
k: 5
min_score: 0.25
hybrid: true
generation:
model: "gpt-4-turbo"
temperature: 0.1
max_tokens: 512
Why: When quality degrades, you can pinpoint which config changed.
Real-World Q&A Examples
Example 1: Clear answer with citation
Q: "How many annual leave days do I get?"
Diagram flow:
Q → Embed → Search → Retrieve:
Chunk #23: "HR Policy §4.1: Full-time employees receive 12 days/year"
→ LLM reads context → Generate:
Output:
{
"answer": "Full-time employees receive 12 days of annual leave per year, plus 1 additional day for every 5 years of service.",
"citations": [{"title": "HR Policy", "section": "4.1", "page": 8}],
"is_out_of_scope": false,
"confidence": 0.95
}
Example 2: Out of scope (honest admission)
Q: "Can I work remotely from Bali for 3 months?"
Diagram flow:
Q → Embed → Search → Retrieve:
Chunk #15: "Remote Work Policy: up to 10 days/month within Vietnam"
(No chunks about international remote work)
→ LLM detects gap → Out of scope:
Output:
{
"answer": "Out of scope",
"reasoning": "The retrieved policy only covers remote work within Vietnam. International remote work is not addressed.",
"is_out_of_scope": true,
"suggestions": ["Contact HR at hr@company.com for international remote work inquiries"]
}
Example 3: Ambiguous query (clarification)
Q: "How much leave do I have?"
Diagram flow:
Q → Embed → Search → Retrieve multiple types:
Chunk #23: "Annual leave: 12 days/year"
Chunk #31: "Sick leave: 30 days/year"
Chunk #47: "Marriage leave: 3 days"
→ LLM recognizes ambiguity → Clarify:
Output:
{
"answer": "There are several types of leave: (1) Annual leave: 12 days/year, (2) Sick leave: 30 days/year, (3) Marriage leave: 3 days. For your personal remaining balance, please check the HR portal.",
"citations": [{"title": "HR Policy", "section": "4. Leave Types", "page": 8}],
"confidence": 0.7
}
Success Criteria (Evaluation)
Accuracy Metrics
| Metric | How to measure | Target |
|---|---|---|
| Correctness | LLM-as-judge on 100 Q&A pairs | >85% |
| Citation precision | Do citations support answer? | >95% |
| Retrieval quality | NDCG@5 on test queries | >0.8 |
| Out-of-scope detection | False positive rate | <10% |
Performance Metrics
| Metric | Target |
|---|---|
| End-to-end latency (p50) | <2s |
| End-to-end latency (p95) | <3s |
| Cost per query | <$0.03 |
| Uptime | >99.5% |
User Satisfaction
- Thumbs up rate: >80%
- Follow-up questions: <20% (answer was complete)
- "Contact HR" escalations: <15% (system handled most queries)
Conclusion: The Power of One Diagram
Everything flows from understanding this pipeline:
- Documents → Clean, structured knowledge base
- Chunking → Granular, retrievable units
- Embeddings → Semantic search capability
- Vector DB → Fast, scalable retrieval
- Retrieval → Find the right context
- LLM → Synthesize grounded answers
- Output → Transparent, honest responses
You DON'T need:
- ❌ AGI (just narrow AI for Q&A)
- ❌ Fine-tuning (RAG handles knowledge)
- ❌ Massive infrastructure (works with 10k documents)
You DO need:
- ✅ Good chunking strategy (80% of quality)
- ✅ Guardrailed prompts (prevent hallucination)
- ✅ Rich metadata (enable citations)
- ✅ Honest out-of-scope handling (build trust)
Next steps: Implement this diagram step-by-step. Test each step independently. Iterate based on metrics and user feedback.
Remember: RAG = "Open-book exam for LLMs". Give them the right books (your documents), and they'll ace the test!