Glossary
AI Search & Retrieval, in Plain English
Definitions for the terms behind semantic search, vector databases, and retrieval-augmented generation. Each entry links back to how Moss implements it.
Alpha (Hybrid Weighting)
Alpha is a number between 0 and 1 that blends semantic and keyword results in hybrid search. 1.0 is pure semantic, 0.0 is pure keyword, 0.6 means 60% semantic / 40% keyword.
Alpha is the weighting parameter used in hybrid retrieval systems to control how much a query leans on vector similarity versus lexical (BM25) matching. When alpha = 1.0, results are ranked purely by embedding similarity. When alpha = 0.0, results are ranked by keyword overlap alone. Most production systems use a semantic-heavy default around 0.7–0.8, then tune per workload. Moss exposes alpha on every query so you can A/B test without reindexing.
BM25
BM25 is a ranking function that scores documents by keyword relevance — how often a query term appears, adjusted for document length and term rarity across the corpus.
BM25 (Best Matching 25) is the most widely deployed lexical ranking function in information retrieval. It extends TF-IDF with term-frequency saturation and length normalization, so longer documents do not automatically outrank shorter ones just for containing more occurrences of a word. BM25 remains essential in hybrid search because it catches exact-match cases (product codes, error strings, proper nouns) that dense vector search can miss.
Bring Your Own Embeddings (BYOE)
BYOE means passing pre-computed vectors into an index instead of letting the search runtime generate them. You choose the embedding model; the runtime only stores and queries.
Bring Your Own Embeddings lets you generate vectors with any external model — OpenAI text-embedding-3-large, Cohere, a domain-fine-tuned model, or a local SentenceTransformer — and hand them to the search runtime alongside each document. Moss supports BYOE by accepting an optional `embedding` field on `DocumentInfo` at index time and on `QueryOptions` at query time. The same model must be used for both indexing and querying; mixing models silently degrades recall.
Chunking
Chunking is splitting long documents into smaller passages (typically 200–500 tokens) before indexing, so retrieval returns the most relevant passage rather than the whole document.
Chunking is a pre-processing step that controls retrieval precision. Aim for 200–500 tokens per chunk with a 10–20% overlap between adjacent chunks to preserve context that would otherwise be cut at boundaries. Smaller chunks increase recall (more candidates match) but can fragment meaning; larger chunks preserve context but dilute similarity scores. Normalize whitespace and strip boilerplate (nav, footers, code fences you do not want matched) before embedding.
Embeddings
Embeddings are numerical vectors that represent the semantic meaning of text, images, or audio. Similar content produces vectors that are close together in embedding space.
An embedding is a fixed-length vector (commonly 384, 768, 1024, or 1536 dimensions) produced by a neural network trained so that semantically similar inputs land near each other under cosine or Euclidean distance. Embeddings power semantic search, clustering, classification, and retrieval-augmented generation. The choice of embedding model determines recall quality: smaller models like `moss-minilm` favor speed and on-device use, while larger models like `moss-mediumlm` or `text-embedding-3-large` favor accuracy.
Hybrid Search
Hybrid search combines semantic (vector) and keyword (BM25) retrieval in a single query, then blends the two rankings. It recovers exact matches that pure semantic search misses.
Hybrid search runs two retrievers — one vector, one lexical — over the same index and merges the results using a weighted score (see Alpha). It catches cases where semantic similarity is high but a required keyword is absent (e.g., a SKU, error code, or proper noun), and cases where keywords match but intent diverges. Production systems almost always outperform single-retriever baselines once they add hybrid with a tuned alpha.
Index
An index is a pre-built data structure that makes search fast. In semantic search, the index stores embeddings and the data structures (HNSW, IVF, flat) that make nearest-neighbor lookups sub-linear.
An index is the compiled, queryable form of your data. For semantic search, building an index means computing embeddings for every document and organizing them in a structure — usually HNSW (Hierarchical Navigable Small World) graphs or IVF (Inverted File) cells — that lets a query vector find its nearest neighbors in O(log n) rather than scanning every vector. Moss indexes are immutable snapshots that can be upserted incrementally and loaded into memory for sub-10ms queries.
On-Device Search
On-device search runs retrieval inside the user's browser, phone, or desktop app rather than on a remote server. Queries never leave the device, eliminating network latency and keeping data private.
On-device search loads the index directly into the client runtime — browser (via WebAssembly), mobile app, or desktop binary — so every query executes locally in single-digit milliseconds. There is no round-trip to a database, no cold start, and no user data leaves the device unless you explicitly sync. Moss is built in Rust and compiled to WebAssembly to make on-device semantic search practical at production scale.
RAG (Retrieval-Augmented Generation)
RAG is a pattern where an LLM retrieves relevant context from a search index before answering, so it can cite up-to-date, domain-specific information it was never trained on.
Retrieval-Augmented Generation pairs a large language model with a search system. On each user turn, the app (1) embeds the query, (2) retrieves top-k relevant passages from a vector or hybrid index, (3) injects those passages into the LLM prompt, and (4) generates an answer grounded in retrieved context. RAG is the standard architecture for chatbots, copilots, and voice agents that need factual, fresh, or proprietary knowledge without fine-tuning.
Semantic Search
Semantic search retrieves results by meaning rather than exact keyword match. It uses embeddings to find documents whose content is conceptually similar to the query, even if no words overlap.
Semantic search replaces literal string matching with vector similarity: the query is embedded into the same space as the indexed documents, and the nearest neighbors are returned. A query like "how do I get my money back?" returns a document titled "Refund policy" even though they share no words. Semantic search is the retrieval layer that powers modern chatbots, voice agents, copilots, and any AI feature that answers questions from a knowledge base.
Sub-10ms Retrieval
Sub-10ms retrieval means returning search results in under ten milliseconds end-to-end. At this latency, retrieval is imperceptible to a user and fits inside a real-time voice agent turn.
Voice agents and interactive copilots have a hard latency budget of roughly 200–400ms per conversational turn for everything: speech-to-text, tool calls, LLM inference, and text-to-speech. A remote vector database adds 100–500ms per query, consuming most of the budget. Sub-10ms retrieval — achieved by loading the index in-memory on the same machine as the agent runtime — removes search from the critical path entirely. Moss is engineered around this constraint.
Vector Search
Vector search finds the items in a dataset whose embedding is closest to a query embedding, using cosine similarity or Euclidean distance. It is the mechanism underneath semantic search.
Vector search is the nearest-neighbor problem applied to embeddings. Given a query vector, the system returns the k documents whose stored vectors are closest under a distance metric (usually cosine similarity). Exact nearest-neighbor search is O(n·d); production systems use approximate algorithms (HNSW, IVF, ScaNN) to answer queries in logarithmic time with recall above 95%. "Vector search" describes the mechanism; "semantic search" describes the user-facing capability it enables.