Evaluating Vector Search Quality: A Practical Guide for Developers

Vector databases and embeddings have become core infrastructure for AI applications–search, RAG, recommendations, anomaly detection, and more. But while building a vector search system is straightforward, measuring its quality is not. Developers often optimize models, indexes, or parameters without proper evaluation, leading to misleading performance and poor user experience.

This article provides a clear framework for evaluating vector search quality, including metrics, datasets, and practical workflows you can apply immediately.

Why Vector Search Needs Proper Evaluation

Traditional keyword search can be tested with precision/recall and simple heuristics, but vector search is different:

  • Results depend on embedding models.
  • Index configurations affect accuracy vs speed.
  • Relevant results may not be strictly "correct" or "incorrect".
  • Real-world meaning similarity is subjective.

Without structured evaluation, you might ship a system that feels good in small tests but fails at scale

Key Metrics for Vector Search Quality

  1. Recall@K (Most Common)
    Recall@K measures how many of the "true most similar" items are found within the top K results.
    - Recall@10 = proportion of correct neighbors found in the top 10.
    - Higher recall means your yndex retrieves results close to the brute-force (ground truth).
  2. Precision@K
    Measures how many of the returned top-K results are actually relevant. This method is useful when you evaluation dataset contains multiple relevant answers instead of exact nearest neighbors.
  3. Mean Reciprocal Rank (MRR)
    Focuses on where the first correct result appears.
    - High MRR = relevant items appear very early in the list.
    - Common in question-answer similarity search.
  4. Normalized Discounted Cumulative Gain (NDCG)
    NDCG evaluates graded relevance:
    - Works when results are "somewhat relevant", "very relevant", etc.
    - Discounts results based on ranking position.
    Often used in recommendation systems and semantic search.
  5. Latency and Throughput Metrics
    Quality is not only accuracy – performance matters.
    - P95/P99 latency
    - Queries per second (QPS)
    - Index build time
    - Memory usage
    A high-recall index is useless if it's too slow for production.

Evaluation workflow

  1. Build or collect a ground-truth dataset
  2. Generate embeddings
  3. Compute brute-force ("gold standard") neighbors
  4. Test different vector search configurations
  5. Plot accuracy vs speed.

Sample Workflow

  1. Install and import dependencies:

pip install chromadb sentence-transformers

from sentence_transformers import SentenceTransformer
import chromadb
import numpy as np

2.   Generate embeddings

model = SentenceTransformer("all-MiniLM-L6-v2")

documents = ["Apple fruit", "Orange fruit", "Carrot vegetable"]
doc_ids = ["doc1", "doc2", "doc3"]

embeddings = model.encode(documents).tolist()

3.   Create a Chroma collection and and documents

client = chromadb.Client()

collection = client.create_collection("fruits_collection")

collection.add(
    ids=doc_ids,
    documents=documents,
    embeddings=embeddings
)

4.   Query the collection

query = "I like fruits"
query_emb = model.encode([query]).tolist()

results = collection.query(
    query_embeddings=query_emb,
    n_results=2
)

print(results)

Example output:

{
    'ids': [["doc1", "doc2"]],
    'documents': [["Apple fruit", "Orange fruit"]],
    'distance': [[0.12, 0.15]]
}

5.   Evaluate Recall@K

ground_truth = [["doc1", "doc2"]]

predicted_ids = results["ids"]
recall = np.mean([len(set(pred) & set(gt))/len(gt)
                  for pred, gt in zip(predicted_ids, ground_truth)])
                
print("Recall@K: ", recall)

6.   Experiment with different settings

  • Change embedding models
  • Use different collection parameters (e.g., metric="cosine")
  • Measure latency and accuracy trade-offs
import time
start = time.time()
collection.query(query_embeddings=query_emb, n_results=2)
print("Query latency: ", time.time() - start, "seconds")

Conclusion

Evaluating vector search quality is not just about picking the best embedding or the fastest index. It's about finding the best balance between:

  • Accuracy
  • Latency
  • Memory
  • Real-world relevance

A robust evaluation pipeline will ensure your vector search system stays reliable as your data and traffic scale.