MongoDB Vector Search

Creating Vector Search Queries in MongoDB Atlas

2026-05-03T00:00:00+00:00

Introduction to Vector Search Queries in MongoDB Atlas

Overview

This tutorial walks you through the complete workflow of performing vector search in MongoDB Atlas — from generating text embeddings to constructing aggregation pipelines with and without pre-filters.

Vector search enables semantic similarity queries: instead of matching exact keywords, you find documents whose meaning is closest to your query. This is the engine behind modern AI features like RAG (Retrieval-Augmented Generation), recommendation systems, and intelligent document search.

Prerequisites

A running MongoDB Atlas cluster (A MongoDB Atlas cluster M0 tier (free) should be sufficient)
A collection with documents that have an embedding field (e.g., plot_embedding)
A Vector Search index already created on the collection (see the Index Setup section below)
An API key for Voyage AI (or another embedding provider)
Python environment with pymongo and voyageai packages installed

What is a Vector Embedding?
Step 1 — Embedding Model: Voyage AI voyage-3.5-lite
Step 2 — Generate and Store Document Embeddings
Step 3 — Create a Vector Search Index
Step 4 — Generate a Query Embedding
Step 5 — Build the Vector Search Pipeline
Step 6 — Vector Search with a Pre-Filter
Deep Dive: How HNSW Powers Vector Search
Tuning numCandidates for Optimal Performance
ANN vs. Exact Search
Understanding vectorSearchScore
Common Pitfalls
Quick Reference

What is a Vector Embedding?

A vector embedding is a dense numerical representation of text (or other data) in a high-dimensional space. Semantically similar texts are placed closer together in this space.

"A dog running in the park"    → [0.12, -0.45, 0.87, ..., 0.03]   (1024 numbers)
"A puppy playing outdoors"     → [0.13, -0.43, 0.89, ..., 0.02]   ← very similar!
"The stock market crashed"     → [-0.91, 0.22, -0.54, ..., 0.77]  ← very different

Similarity is measured using distance metrics:

Metric	Formula	Best For
Cosine	$1 - (A·B / \|A\|\|B\|)$	Text similarity (most common)
Dot Product	$A·B$	Normalized vectors, fast ranking
Euclidean	$\sqrt{\sum(A_i - B_i)^2}$	When magnitude matters

Source: MongoDB Atlas Vector Search Documentation

Step 1 — Embedding Model: Voyage AI voyage-3.5-lite

The examples in this tutorial use the voyage-3.5-lite embedding model from Voyage AI — a state-of-the-art, cost-efficient model optimized for large-scale retrieval and RAG applications.

Key Specifications

Property	Value
Supported Dimensions	2048, 1024 (default), 512, 256
Context Length	32,000 tokens
Quantization Types	`float` (default), `int8`, `uint8`, `binary`, `ubinary`
Use Cases	Technical docs, code, law, finance, web reviews, conversations

Why Flexible Dimensions?

voyage-3.5-lite uses Matryoshka Representation Learning (MRL) — a technique where the first N dimensions of a larger embedding already form a high-quality, lower-dimensional embedding. This means you can truncate the vector to save storage without dramatically hurting recall quality.

2048-dim  →  high quality, high storage cost
1024-dim  →  balanced (default)
512-dim   →  compact, good for memory-constrained deployments
256-dim   →  smallest, fastest, some quality trade-off

Quantization Tradeoffs

Quantization reduces the precision of each floating-point number:

Type	Storage Reduction	Recall Impact
`float`	Baseline	None (highest quality)
`int8`	~75% reduction	Minimal
`binary`	~97% reduction	Moderate — use with binary rescoring

Tip: Using int8 at 2048 dimensions can reduce vector DB costs by ~83% vs. standard float embeddings, per Voyage AI documentation.

Step 2 — Generate and Store Document Embeddings

Before you can run vector search queries, each document in your collection must have an embedding field that stores the vector.

Installation

pip install voyageai pymongo

Generating Embeddings for Documents

import voyageai
from pymongo import MongoClient

# --- Setup ---
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
client = MongoClient("YOUR_MONGODB_CONNECTION_STRING")
db = client["sample_mflix"]
collection = db["movies"]

def generate_embedding(text: str) -> list[float]:
    """
    Generate an embedding vector for a given text using voyage-3.5-lite.
    
    The input_type="document" instructs the model to optimize the embedding
    for storage/retrieval (as opposed to "query" for query-time embeddings).
    """
    result = vo.embed(
        texts=[text],
        model="voyage-3.5-lite",
        input_type="document"
    )
    return result.embeddings[0]

# --- Embed and store each document ---
# This iterates over documents that have a 'plot' field but no embedding yet.
for doc in collection.find({"plot": {"$exists": True}, "plot_embedding": {"$exists": False}}):
    embedding = generate_embedding(doc["plot"])
    collection.update_one(
        {"_id": doc["_id"]},
        {"$set": {"plot_embedding": embedding}}
    )
    print(f"Embedded: {doc.get('title', 'Unknown')}")

print("Done embedding all documents.")

Why input_type="document" vs "query"?

Voyage AI distinguishes between embedding documents (stored content) and queries (search input). Using the correct type ensures the model applies appropriate asymmetric transformations for optimal retrieval performance.

Step 3 — Create a Vector Search Index

A Vector Search Index tells MongoDB Atlas which field holds the embedding vectors, how many dimensions those vectors have, and which similarity metric to use.

Basic Vector Search Index

// MongoDB Shell
db.movies.createSearchIndex(
  "vectorPlotIndex",          // index name
  "vectorSearch",             // index type
  {
    "fields": [
      {
        "type": "vector",
        "path": "plot_embedding",   // field storing the embedding
        "numDimensions": 1024,      // must match your embedding model's output dimension
        "similarity": "cosine"      // cosine | dotProduct | euclidean
      }
    ]
  }
);

Critical: numDimensions must exactly match the dimension your embedding model outputs. For voyage-3.5-lite with default settings, this is 1024. Mismatched dimensions cause index failures or zero results.

Vector Search Index with Pre-filter Support

If you want to filter your vector search results by scalar fields (e.g., year, genre, rating), you must declare those fields as "type": "filter" in the index definition:

db.movies.createSearchIndex(
  "vectorPlotIndex",
  "vectorSearch",
  {
    "fields": [
      {
        "type": "vector",
        "path": "plot_embedding",
        "numDimensions": 1024,
        "similarity": "cosine"
      },
      {
        "type": "filter",
        "path": "year"          // enables pre-filtering on the year field
      }
    ]
  }
);

Source: MongoDB Vector Search Index Reference

Step 4 — Generate a Query Embedding

At query time, you must convert your search text into a vector using the same model that was used to embed the documents.

def generate_query_embedding(query_text: str) -> list[float]:
    """
    Generate an embedding for a search query using voyage-3.5-lite.
    
    input_type="query" optimizes the embedding for retrieval (asymmetric search).
    This is DIFFERENT from document embeddings — use the correct type!
    """
    result = vo.embed(
        texts=[query_text],
        model="voyage-3.5-lite",
        input_type="query"
    )
    return result.embeddings[0]

# Example: generate embedding for a user's search query
query_text = "movies about space exploration and astronauts"
query_embedding = generate_query_embedding(query_text)

Important: Always use input_type="query" for query-time embeddings. Using "document" for queries reduces retrieval quality.

Step 5 — Build the Vector Search Pipeline

MongoDB Atlas Vector Search uses the $vectorSearch aggregation stage. It must be the first stage in an aggregation pipeline.

The `$vectorSearch` Stage Syntax

pipeline = [
    {
        "$vectorSearch": {
            "index": "vectorPlotIndex",        # name of the vector search index
            "path": "plot_embedding",          # field containing the embeddings
            "queryVector": query_embedding,    # the query vector (list of floats)
            "numCandidates": 100,              # pool size for ANN search (omit for exact)
            "limit": 10,                       # number of final results to return
            "exact": False                     # False = ANN search (default), True = exact
        }
    },
    {
        "$project": {
            "title": 1,
            "plot": 1,
            "score": {"$meta": "vectorSearchScore"}   # retrieves the similarity score
        }
    }
]

results = collection.aggregate(pipeline)
for movie in results:
    print(f"{movie['title']} — Score: {movie['score']:.4f}")
    print(f"  {movie['plot']}\n")

Field Reference

Field	Required	Description
`index`	✅	Name of the vector search index to use
`path`	✅	Dot-notation path to the embedding field in documents
`queryVector`	✅	The query vector as a list of floats
`numCandidates`	✅ (ANN)	Number of nearest neighbor candidates to explore; omit when `exact: true`
`limit`	✅	Maximum number of documents returned
`exact`	❌	`false` (default) uses ANN/HNSW; `true` uses brute-force exact search
`filter`	❌	MongoDB query expression for pre-filtering (requires filter field in index)

Step 6 — Vector Search with a Pre-Filter

Pre-filtering narrows the search space before vector similarity is computed. This is more efficient than post-filtering with a $match stage because it avoids examining irrelevant vectors entirely.

Why Pre-Filtering Requires Index Configuration

When you use a filter in $vectorSearch, Atlas must be able to evaluate that filter condition using the vector index metadata. This is why the filter field (e.g., year) must be declared with "type": "filter" in the index definition.

pipeline = [
    {
        "$vectorSearch": {
            "index": "vectorPlotIndex",
            "path": "plot_embedding",
            "queryVector": query_embedding,
            "numCandidates": 100,
            "filter": {"year": {"$gt": 2010}},    # pre-filter: only movies after 2010
            "limit": 10
        }
    },
    {
        "$project": {
            "title": 1,
            "plot": 1,
            "year": 1,
            "score": {"$meta": "vectorSearchScore"}
        }
    }
]

results = collection.aggregate(pipeline)
for movie in results:
    print(f"[{movie['year']}] {movie['title']} — Score: {movie['score']:.4f}")

Supported Filter Operators

The filter field accepts standard MongoDB query operators on indexed filter fields:

Operator	Example	Description
`$eq`	`{"genre": {"$eq": "Action"}}`	Exact match
`$ne`	`{"genre": {"$ne": "Horror"}}`	Not equal
`$gt` / `$gte`	`{"year": {"$gt": 2010}}`	Greater than
`$lt` / `$lte`	`{"rating": {"$lt": 8.0}}`	Less than
`$in`	`{"genre": {"$in": ["Action", "Sci-Fi"]}}`	Match any in list
`$and`	`{"$and": [...]}`	Combine multiple conditions

Source: MongoDB $vectorSearch Reference

Deep Dive: How HNSW Powers Vector Search

When you run a vector search query, MongoDB Atlas uses the Hierarchical Navigable Small World (HNSW) algorithm to efficiently find approximate nearest neighbors.

The HNSW Graph Structure

HNSW builds a multi-layered graph during index construction:

Layer 2 (sparse, fast navigation):
    [A] ──────────────── [B]

Layer 1 (intermediate):
    [A] ── [C] ── [B] ── [D]

Layer 0 (all nodes, most edges):
    [A] ── [C] ── [E] ── [B] ── [D] ── [F] ── [G]

Layer 0 contains ALL data points with many connections
Upper layers contain progressively fewer points (selected probabilistically)
Each node connects to its k-nearest neighbors at each layer

The ANN Search Algorithm (Greedy Traversal)

When you submit a query, HNSW searches as follows:

Enter at top layer — start from a fixed entry point at the highest layer
Greedy descent — at each layer, navigate to the neighbor closest to the query vector
Descend when stuck — when no neighbor at the current layer is closer than the current node, descend to the layer below
Exhaustive search at Layer 0 — controlled by the ef parameter, which determines how many candidate nodes to explore at the base layer
Return top-k results — the closest limit candidates from Layer 0 are returned

Query: Q = "movies about space exploration"

Layer 2: Enter at node A → navigate toward B (closer to Q)
Layer 1: From B, find D (closer to Q)
Layer 0: From D, exhaustively check neighbors within ef budget → return top 10

HNSW Configuration Parameters

Parameter	Default	Range	Effect
`m` (maxEdges)	16	4–96	Connections per node. Higher = better recall, more memory
`efConstruction`	100	10–3200	Candidates during index build. Higher = better index quality, slower build
`ef`	40	—	Candidates at query time. Higher = better recall, slower queries

Source: MongoDB HNSW Documentation

Tuning numCandidates for Optimal Performance

numCandidates controls the pool of candidate vectors that HNSW explores at query time. It directly affects the recall vs. speed tradeoff.

Recommended Starting Point

MongoDB recommends setting numCandidates to at least 10x–20x the value of limit.

# Example: limit=10, numCandidates=100 → 10x ratio (good baseline)
# For higher recall: numCandidates=200 → 20x ratio

Tuning Guidelines

Factor	Guidance
Index Size	Larger collections → increase `numCandidates`. More vectors means you need a bigger candidate pool to find the true nearest neighbors.
Limit Value	Lower `limit` → proportionally higher `numCandidates` ratio needed. If `limit=5`, use `numCandidates >= 100`.
Quantized Vectors	`int8`/`binary` quantization introduces approximation error → increase `numCandidates` to compensate and maintain recall.
Filter + numCandidates	When using pre-filters, `numCandidates` refers to candidates within the filtered set. If the filtered set is small, keep `numCandidates` reasonable.

Recall vs. Speed Tradeoff Visualization

numCandidates = 20   → Fast, lower recall (may miss good results)
numCandidates = 100  → Balanced (recommended starting point)
numCandidates = 500  → Slower, higher recall
numCandidates = 1000 → Approaches exact search quality but much slower

ANN vs. Exact Search

Approximate Nearest Neighbor (ANN) Search — Default

Used when "exact": False (or exact is omitted).

{
    "$vectorSearch": {
        "index": "vectorPlotIndex",
        "path": "plot_embedding",
        "queryVector": query_embedding,
        "numCandidates": 100,    # REQUIRED for ANN
        "limit": 10,
        "exact": False           # default — uses HNSW
    }
}

Characteristics:

⚡ Fast — O(log n) with HNSW graph traversal
📊 High recall in practice — typically 95-99% of true nearest neighbors
📈 Scalable — works well with millions of vectors
❌ Not guaranteed exact — may occasionally miss a true nearest neighbor

Exact (Brute-Force) Search

Used when "exact": True. Do NOT specify numCandidates — it is ignored (and causes an error in some versions).

{
    "$vectorSearch": {
        "index": "vectorPlotIndex",
        "path": "plot_embedding",
        "queryVector": query_embedding,
        # numCandidates must be OMITTED for exact search
        "limit": 10,
        "exact": True           # brute-force: checks every vector
    }
}

Characteristics:

✅ Guaranteed correct — always returns the true nearest neighbors
🐢 Slow — O(n) — computes distance to every vector in the collection
⚠️ Not production-ready for large datasets — use for small datasets or validation only
🔬 Best use case — benchmarking and validating ANN results

When to Use Each

Use Case	Recommendation
Production queries on large collections	ANN (`exact: False`)
Development/debugging	Either; ANN is usually fine
Validating ANN recall quality	Exact (`exact: True`) on a sample
Collections < 1,000 vectors	Either; difference is negligible
RAG pipelines	ANN with well-tuned `numCandidates`

Understanding vectorSearchScore

The $meta: "vectorSearchScore" expression retrieves the similarity score for each result. Understanding what this score means helps you set meaningful confidence thresholds.

{
    "$project": {
        "title": 1,
        "score": {"$meta": "vectorSearchScore"}
    }
}

Score Interpretation by Similarity Metric

Similarity	Score Range	Higher = ?
cosine	0.0 – 1.0	More similar (1.0 = identical direction)
dotProduct	Unbounded	More similar
euclidean	0.0 – 1.0 (normalized)	More similar (inverted distance)

Using Scores as Confidence Thresholds

You can post-filter results by score using a $match stage after $vectorSearch:

pipeline = [
    {
        "$vectorSearch": {
            "index": "vectorPlotIndex",
            "path": "plot_embedding",
            "queryVector": query_embedding,
            "numCandidates": 100,
            "limit": 50           # fetch more candidates
        }
    },
    {
        # Post-filter: only keep results with similarity > 0.75
        "$match": {
            "score": {"$gt": 0.75}
        }
    },
    {
        "$project": {
            "title": 1,
            "score": {"$meta": "vectorSearchScore"},
            "plot": 1
        }
    },
    {"$limit": 10}              # then limit final output
]

⚠️ Note: $match on score is a post-filter and runs after the vector search. It does not reduce the number of vectors examined — it only filters the returned results. This is different from the filter parameter in $vectorSearch.

Common Pitfalls

1. Mismatched Embedding Dimensions

Error: Vector dimension mismatch

Cause: numDimensions in the index ≠ actual length of the embedding vector.
Fix: Ensure the dimension in the index definition exactly matches your embedding model’s output dimension (e.g., 1024 for voyage-3.5-lite default).

2. Using `numCandidates` with `exact: True`

Error: numCandidates cannot be specified with exact search

Fix: Remove numCandidates when setting "exact": True.

3. Filter Field Not in Index

Error: Filter field 'year' is not indexed

Cause: Trying to use filter: {"year": ...} when year was not added as a "type": "filter" field in the vector search index.
Fix: Recreate the index including {"type": "filter", "path": "year"}.

4. Different Models for Documents and Queries

Cause: Embedding documents with voyage-3.5-lite but querying with text-embedding-ada-002 (or any other model).
Effect: Vectors live in completely different semantic spaces — results will be meaningless.
Fix: Always use the same model and the same dimension for both document embeddings and query embeddings.

5. Low numCandidates → Poor Recall

Symptom: Vector search returns results that don’t seem semantically relevant.
Fix: Increase numCandidates. Start at 10x limit and scale up. Validate against exact search.

Quick Reference

Complete End-to-End Example

import voyageai
from pymongo import MongoClient

# Setup
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
client = MongoClient("YOUR_MONGODB_CONNECTION_STRING")
collection = client["sample_mflix"]["movies"]

# Generate query embedding
query = "sci-fi movies set in outer space with dramatic storylines"
result = vo.embed(texts=[query], model="voyage-3.5-lite", input_type="query")
query_embedding = result.embeddings[0]

# ---- Basic Vector Search ----
pipeline = [
    {
        "$vectorSearch": {
            "exact": False,
            "index": "vectorPlotIndex",
            "path": "plot_embedding",
            "queryVector": query_embedding,
            "numCandidates": 100,
            "limit": 10
        }
    },
    {
        "$project": {
            "title": 1,
            "plot": 1,
            "score": {"$meta": "vectorSearchScore"}
        }
    }
]

# Execute
x = collection.aggregate(pipeline)
for doc in x:
    print(f"[{doc['score']:.3f}] {doc['title']}")

# ---- Filtered Vector Search (movies after 2010) ----
filtered_pipeline = [
    {
        "$vectorSearch": {
            "index": "vectorPlotIndex",
            "path": "plot_embedding",
            "queryVector": query_embedding,
            "numCandidates": 100,
            "filter": {"year": {"$gt": 2010}},
            "limit": 10
        }
    },
    {
        "$project": {
            "title": 1,
            "plot": 1,
            "year": 1,
            "score": {"$meta": "vectorSearchScore"}
        }
    }
]

y = collection.aggregate(filtered_pipeline)
for doc in y:
    print(f"[{doc['year']}] [{doc['score']:.3f}] {doc['title']}")

References

Resource	URL
MongoDB Atlas Vector Search Docs	https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-stage/
`$vectorSearch` Query Reference	https://www.mongodb.com/docs/vector-search/query/aggregation-stages/vector-search-stage/
Vector Search Index Reference	https://www.mongodb.com/docs/vector-search/index/vector-search-type/
Voyage AI voyage-3.5-lite Model	https://docs.voyageai.com/docs/embeddings
HNSW Algorithm (Original Paper)	https://arxiv.org/abs/1603.09320
Related Tutorial in This Repo	`MongoDB_IndexingAlgorithms.md` (HNSW, ANN, Skip Lists)

About the Author

KrishnaMohan Seelam — Senior Engineer

I write about developer tools, databases, and applied AI.

If you found this useful, give it a 👏 and follow me for more!

GitHub

Sparse and Dense Vectors in MongoDB Atlas

2026-04-29T00:00:00+00:00

Sparse and Dense Vectors in MongoDB Atlas

A guide to TF-IDF, Sparse Vectors, Dense Vectors, and Atlas Vector Search

1. Sparse Vectors vs Dense Vectors

MongoDB Atlas uses two fundamentally different vector types, each optimised for a different kind of search:

• Sparse vectors — suited for text/lexical search, used in MongoDB Atlas Search.
• Dense vectors — suited for semantic search, used in MongoDB Atlas Vector Search.

Figure 1 — Sparse vectors are high-dimensional but efficient (most values are zero). Dense vectors encode rich meaning across all dimensions, with very few zero values.

1.1 Sparse Vectors

Sparse vectors are high-dimensional representations where most dimension values are zero. Only the dimensions corresponding to words that actually appear in a document carry a non-zero value (typically a TF-IDF score). Because only non-zero values need to be stored, sparse vectors are highly memory-efficient even with vocabularies containing hundreds of thousands of terms.

• High-dimensional but memory-efficient — only non-zero values are stored.
• Represent the presence or absence of specific terms within a document.
• Best for exact keyword and lexical search scenarios.

1.2 Dense Vectors

Dense vectors are generated by transformer-based embedding models (such as BERT or OpenAI embeddings) and encode rich contextual meaning across all their dimensions. Unlike sparse vectors, very few values are zero. They typically have thousands of dimensions and capture complex semantic relationships that go far beyond simple word matching.

• Thousands of dimensions, very few zero values.
• Generated by transformer-based embedding models (e.g., BERT, OpenAI embeddings).
• Best for semantic/conceptual search — natural language, image processing.

2. TF-IDF

TF-IDF (Term Frequency–Inverse Document Frequency) combines two measures: how frequently a word appears in a document (TF), and how unique that word is across the corpus (IDF). The resulting score reflects how important a word is to a specific document. Words common across all documents score low; words distinctive to one document score high.

Note: In MongoDB Atlas, BM25 is the underlying algorithm used by Atlas Search. TF-IDF is presented here as a conceptual foundation because it shares the same core intuition and is easier to demonstrate step by step.

2.1 Formulas

Figure 2 — TF-IDF formula breakdown. Note: log base 10 is used in this tutorial. Implementations may use the natural log (ln) or log base 2 — always check the library or database documentation.

TF  = (number of times "word" appears) / (total words in document)

IDF = log10( Total documents in corpus / Documents containing "word" )
  [Note: log base 10 used here; implementations may use ln or log2]

TF-IDF = TF × IDF

2.2 Worked Example

Consider the following three-document corpus:

Doc 1 — Atlas the platform
Doc 2 — Atlas the Titan
Doc 3 — Atlas the mountain

Step 1 — Term Frequency (TF)

Each sentence has 3 words, so every word has TF = 1/3 ≈ 0.333.

Word	Occurrences	Total Words	TF
Atlas	1	3	0.333
the	1	3	0.333
platform	1	3	0.333

Step 2 — Inverse Document Frequency (IDF)

“Atlas” and “the” appear in all 3 documents, so IDF = log10(3/3) = 0.
Unique words like “platform”, “Titan”, and “mountain” appear in only 1 document: IDF = log10(3/1) ≈ 0.477.

Word	Total Docs	Docs with Word	IDF (log10)
Atlas	3	3	0.000
the	3	3	0.000
platform	3	1	0.477
Titan	3	1	0.477
mountain	3	1	0.477

Step 3 — TF-IDF Scores

Multiply TF × IDF for each word in each document.

Document	Word	TF	IDF	TF‑IDF
1	Atlas	0.333	0.000	0.000
1	the	0.333	0.000	0.000
1	platform	0.333	0.477	0.159
2	Atlas	0.333	0.000	0.000
2	the	0.333	0.000	0.000
2	Titan	0.333	0.477	0.159
3	Atlas	0.333	0.000	0.000
3	the	0.333	0.000	0.000
3	mountain	0.333	0.477	0.159

“Platform” in Document 1 scores TF‑IDF = 0.333 × 0.477 ≈ 0.159, while “Atlas” and “the” score 0 because they appear in every document and carry no distinguishing power.

3. Sparse Vector Representation

A sparse vector represents a document as a vector in a vocabulary-sized space. Each dimension corresponds to one unique word in the corpus; its value is the TF-IDF score for that word in the document. Because most words are absent from any given document, most values are zero.

Vocabulary: [Atlas, the, platform, Titan, mountain]

Document	platform	Titan	mountain
1	0.159	0	0
2	0	0.159	0
3	0	0	0.159

4. Dense Vectors & Semantic Search

Dense vectors encode rich contextual meaning, enabling semantic search — finding results based on meaning rather than exact keyword matches. Consider these two sentences:

• “Atlas is a powerful developer data platform”
• “Atlas is a titan from ancient Greek scriptures and serves as a symbol of endurance”

Even though both sentences share the word “Atlas”, they describe completely different concepts. A dense embedding model captures this distinction by generating vectors with very different values across semantic dimensions.

4.1 Example Embedding Dimensions

For illustration, assume an embedding model projects each sentence onto 6 semantic dimensions:

Sentence	atlas_product	developers	databases	titan_myth	scriptures	endurance
Sentence 1	0.9	0.8	0.9	0.1	0.0	0.2
Sentence 2	0.1	0.2	0.0	0.9	0.8	0.9

4.2 Cosine Similarity

Semantic search ranks documents by cosine similarity — the cosine of the angle between two vectors in the embedding space. A score of 1 means identical direction (same meaning); 0 means orthogonal (unrelated).

Figure 3 — The two vectors point in very different directions, confirming a low cosine similarity (~0.225) despite both sentences containing “Atlas”.

A = [0.9, 0.8, 0.9, 0.1, 0.0, 0.2]   (Sentence 1)
B = [0.1, 0.2, 0.0, 0.9, 0.8, 0.9]   (Sentence 2)

Dot product (A · B):
  (0.9×0.1) + (0.8×0.2) + (0.9×0.0) + (0.1×0.9) + (0.0×0.8) + (0.2×0.9)
= 0.09 + 0.16 + 0.00 + 0.09 + 0.00 + 0.18 = 0.52

Magnitude |A| = sqrt(0.81+0.64+0.81+0.01+0.00+0.04) = sqrt(2.31) ≈ 1.52
Magnitude |B| = sqrt(0.01+0.04+0.00+0.81+0.64+0.81) = sqrt(2.31) ≈ 1.52

Cosine Similarity = 0.52 / (1.52 × 1.52) = 0.52 / 2.31 ≈ 0.225
Low similarity — the sentences have very different meanings.

5. MongoDB Atlas Vector Search

Atlas Vector Search stores dense embeddings alongside documents in MongoDB. At query time, the query text is embedded using the same model, and MongoDB returns the documents whose vectors are most similar (highest cosine similarity score). This means a search for “developer tools” can match “Atlas platform” semantically, even with zero keyword overlap.

Key point: Atlas Search (BM25/sparse) is ideal for keyword precision. Atlas Vector Search (dense embeddings) is ideal for conceptual or natural-language queries. Many production applications combine both — a technique known as hybrid search.

MongoDB Vector Search

Creating Vector Search Queries in MongoDB Atlas

Introduction to Vector Search Queries in MongoDB Atlas

Overview

Prerequisites

Table of Contents

What is a Vector Embedding?

Step 1 — Embedding Model: Voyage AI voyage-3.5-lite

Key Specifications

Why Flexible Dimensions?

Quantization Tradeoffs

Step 2 — Generate and Store Document Embeddings

Installation

Generating Embeddings for Documents

Step 3 — Create a Vector Search Index

Basic Vector Search Index

Vector Search Index with Pre-filter Support

Step 4 — Generate a Query Embedding

Step 5 — Build the Vector Search Pipeline

The $vectorSearch Stage Syntax

Field Reference

Step 6 — Vector Search with a Pre-Filter

Why Pre-Filtering Requires Index Configuration

Supported Filter Operators

Deep Dive: How HNSW Powers Vector Search

The HNSW Graph Structure

The ANN Search Algorithm (Greedy Traversal)

HNSW Configuration Parameters

Tuning numCandidates for Optimal Performance

Recommended Starting Point

Tuning Guidelines

Recall vs. Speed Tradeoff Visualization

ANN vs. Exact Search

Approximate Nearest Neighbor (ANN) Search — Default

Exact (Brute-Force) Search

When to Use Each

Understanding vectorSearchScore

Score Interpretation by Similarity Metric

Using Scores as Confidence Thresholds

Common Pitfalls

1. Mismatched Embedding Dimensions

2. Using numCandidates with exact: True

3. Filter Field Not in Index

4. Different Models for Documents and Queries

5. Low numCandidates → Poor Recall

Quick Reference

Complete End-to-End Example

References

About the Author

Sparse and Dense Vectors in MongoDB Atlas

Sparse and Dense Vectors in MongoDB Atlas

1. Sparse Vectors vs Dense Vectors

1.1 Sparse Vectors

1.2 Dense Vectors

2. TF-IDF

2.1 Formulas

2.2 Worked Example

Step 1 — Term Frequency (TF)

Step 2 — Inverse Document Frequency (IDF)

Step 3 — TF-IDF Scores

3. Sparse Vector Representation

4. Dense Vectors & Semantic Search

4.1 Example Embedding Dimensions

4.2 Cosine Similarity

5. MongoDB Atlas Vector Search

The `$vectorSearch` Stage Syntax

2. Using `numCandidates` with `exact: True`