Skip to content

Vector Search

Catalyzed provides native vector similarity search through SQL table functions. Find semantically similar content by searching vector embeddings stored in your tables.

Vector search enables semantic queries like “find products similar to wireless headphones” instead of exact keyword matching. This is powered by:

  • knn_search() - Find the k nearest neighbors to a query vector
  • text_to_embedding() - Convert natural language to vectors at query time
  • Pre-built indices - Fast approximate nearest neighbor (ANN) search on large datasets

The knn_search function finds the k most similar rows based on vector distance:

Basic knn_search

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT * FROM knn_search('\''products'\'', '\''embedding'\'', text_to_embedding('\''wireless noise-canceling headphones'\''), 10)",
"tables": {"products": "KzaMsfA0LSw_Ld0KyaXIS"}
}'

Response includes all columns from the source table plus a _distance column:

{
"columns": [
{"name": "product_id", "type": "Utf8"},
{"name": "title", "type": "Utf8"},
{"name": "price", "type": "Float64"},
{"name": "_distance", "type": "Float32"}
],
"rows": [
{"product_id": "prod_123", "title": "Sony WH-1000XM5", "price": 349.99, "_distance": 0.12},
{"product_id": "prod_456", "title": "Bose QuietComfort", "price": 329.99, "_distance": 0.15}
]
}
knn_search(table, column, query_vector, k, [metric], [filter], [refine_factor], [lower_bound], [upper_bound])
ParameterTypeRequiredDescription
tablestringYesTable name containing embeddings
columnstringYesColumn name with vector embeddings
query_vectorarrayYesQuery vector (same dimensions as stored vectors)
kintegerYesNumber of results to return
metricstringNoDistance metric: 'l2', 'cosine', 'dot', 'hamming'
filterstringNoSQL WHERE clause for pre-filtering
refine_factorintegerNoOver-fetch multiplier for improved recall
lower_boundfloatNoMinimum distance (inclusive)
upper_boundfloatNoMaximum distance (exclusive)

For common metrics, use the convenience wrappers:

-- Cosine similarity (recommended for text embeddings)
SELECT * FROM knn_cosine('products', 'embedding', text_to_embedding('query'), 10)
-- L2 (Euclidean) distance
SELECT * FROM knn_l2('products', 'embedding', text_to_embedding('query'), 10)

Converts text to a vector using our semantic embedding model:

SELECT text_to_embedding('wireless headphones')
-- Returns: ARRAY[0.0123, -0.0456, 0.0789, ...]

This is useful for:

  • Query-time embedding of search terms
  • Comparing text similarity inline
  • Prototyping before setting up batch embedding pipelines

Combine semantic similarity with attribute filters using the filter parameter:

Filtered search

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT * FROM knn_cosine('\''products'\'', '\''embedding'\'', text_to_embedding('\''wireless headphones'\''), 10, '\''category = ''''electronics'''' AND price < 500'\'')",
"tables": {"products": "KzaMsfA0LSw_Ld0KyaXIS"}
}'

The filter is applied before the vector search, ensuring efficient execution when scalar columns are indexed.

Combine vector search with standard SQL operations:

-- Find similar patents and their claims
SELECT p.title, p.assignee, p._distance, c.claim_text
FROM knn_search(
'patents',
'abstract_embedding',
text_to_embedding('machine learning image classification'),
5
) p
JOIN claims c ON p.patent_id = c.patent_id
WHERE c.claim_type = 'independent'
ORDER BY p._distance
-- Aggregate similar reviews by rating
SELECT rating, COUNT(*) as count, AVG(_distance) as avg_similarity
FROM knn_cosine(
'reviews',
'embedding',
text_to_embedding('excellent product quality'),
100
)
GROUP BY rating
ORDER BY rating DESC
MetricDescriptionRangeUse Case
cosineCosine similarity0 to 2Text embeddings, normalized vectors
l2Euclidean distance0 to ∞General purpose, image embeddings
dotDot product-∞ to ∞Maximum inner product search
hammingHamming distance0 to dimsBinary vectors, hashes

Recommendation: Use cosine for text embeddings from models like OpenAI, Cohere, or sentence-transformers.

Full-text search (FTS) enables BM25-based keyword matching using inverted indexes. Unlike semantic search which matches by meaning, FTS excels at exact term matching and boolean queries.

fts_search(table, column, query_text, limit, [filter])
ParameterTypeRequiredDescription
tablestringYesTable name containing the text column
columnstringYesColumn name with text content
query_textstringYesSearch query (supports boolean operators)
limitintegerYesMaximum number of results to return
filterstringNoSQL WHERE clause for pre-filtering

Returns: All columns from the source table plus a _score column with BM25 relevance scores (higher = better match).

Basic FTS

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT * FROM fts_search('\''documents'\'', '\''content'\'', '\''machine learning'\'', 20)",
"tables": {"documents": "KzaMsfA0LSw_Ld0KyaXIS"}
}'

Response includes all source columns plus _score:

{
"columns": [
{"name": "doc_id", "type": "Utf8"},
{"name": "content", "type": "Utf8"},
{"name": "_score", "type": "Float32"}
],
"rows": [
{"doc_id": "doc_123", "content": "Introduction to machine learning...", "_score": 8.42},
{"doc_id": "doc_456", "content": "Machine learning algorithms...", "_score": 7.89}
]
}

Before using fts_search(), create an inverted index on the text column:

-- Create inverted index with default configuration
CREATE INDEX documents_content_fts
ON documents (content)
USING inverted;

Index Configuration Options:

OptionDefaultDescription
base_tokenizersimpleTokenization strategy (simple, whitespace)
with_stemmingtrueEnable word stemming (“running” matches “run”)
remove_stop_wordstrueRemove common words (“the”, “a”, “is”)
lowercasetrueNormalize to lowercase for case-insensitive search
with_positiontrueTrack positions for phrase search

Example with custom configuration:

await datasetTableOperations.createIndex(tableId, {
indexType: "inverted",
columnName: "content",
indexName: "content_fts",
baseTokenizer: "simple",
withStemming: true,
removeStopWords: true,
lowercase: true,
withPosition: true,
});
Use CaseFTS (BM25)Vector Search
Exact term matching✅ Best❌ Poor
Technical identifiers (APIs, SKUs)✅ Best❌ Poor
Boolean queries (AND/OR/NOT)✅ Supported❌ Not supported
Conceptual similarity❌ Poor✅ Best
Synonyms and paraphrasing❌ No✅ Automatic
Multilingual matching❌ Limited✅ Good

Recommendation: Use Hybrid Search with RRF to combine both approaches for optimal results.

Reciprocal Rank Fusion (RRF) combines FTS and vector search results by merging their rankings, not their scores. This produces superior results compared to using either method alone.

Algorithm:

  1. Run keyword search (BM25) and semantic search (vector) independently
  2. Assign ranks to results (1st place, 2nd place, etc.)
  3. Calculate RRF score: semanticWeight / (60 + semantic_rank) + keywordWeight / (60 + keyword_rank)
  4. Sort by combined RRF score

Benefits:

  • Rank-based fusion is more robust than score normalization
  • Naturally handles results that appear in only one ranking
  • Weights control the balance between semantic and keyword matching

Implementation:

Hybrid search is available through the Knowledge Base API with automatic fallback to semantic-only if FTS is unavailable:

Hybrid search

Terminal window
curl -X POST https://api.catalyzed.ai/knowledge-bases/KBabcdef123456/query \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "machine learning algorithms",
"searchMode": "hybrid",
"semanticWeight": 0.6,
"keywordWeight": 0.4,
"limit": 20
}'

Response includes all score fields:

{
"results": [
{
"chunkId": "chunk_123",
"content": "Machine learning algorithms...",
"score": 0.0847,
"semanticScore": 0.15,
"keywordScore": 8.42,
"combinedScore": 0.0847
}
],
"metadata": {
"searchMode": "hybrid",
"fallback": false
}
}

Weight Tuning:

  • Higher semanticWeight (0.7-0.8): Prioritize conceptual similarity, good for research/discovery
  • Higher keywordWeight (0.6-0.7): Prioritize exact terms, good for technical docs/product search
  • Balanced (0.5/0.5): Equal importance to both methods

See Knowledge Bases for complete hybrid search documentation.

Filter results by distance range using lower_bound and upper_bound:

-- Only return results with distance < 0.5 (very similar)
SELECT * FROM knn_search(
'products',
'embedding',
text_to_embedding('wireless headphones'),
100, -- k (max results)
NULL, -- metric (use default)
NULL, -- filter
NULL, -- refine_factor
0.0, -- lower_bound (inclusive)
0.5 -- upper_bound (exclusive)
)

This is useful for:

  • Finding near-duplicates (very low distance)
  • Excluding exact matches (lower_bound > 0)
  • Quality thresholds in recommendation systems
Dataset SizeFlat SearchIndexed SearchRecommendation
< 10,000 rows< 10ms~1msFlat is fine
10,000 - 100,000~100ms~2msConsider index
> 100,000 rows> 1s~1msIndex required

For indexed searches, use refine_factor to improve recall at the cost of latency:

-- Over-fetch 5x candidates, then re-rank for better accuracy
SELECT * FROM knn_search(
'products',
'embedding',
text_to_embedding('query'),
10,
NULL, -- metric
NULL, -- filter
5 -- refine_factor: fetch 50 candidates, return top 10
)
  1. Use specific k values - Don’t request more results than needed
  2. Filter first - Use the filter parameter instead of WHERE on outer query
  3. Limit result columns - Select only the columns you need
  4. Consider distance thresholds - Use bounds to eliminate low-quality matches

Find documents similar to a query:

SELECT title, content, _distance
FROM knn_cosine(
'documents',
'content_embedding',
text_to_embedding('renewable energy policy regulations'),
20
)
ORDER BY _distance

Find products similar to one the user is viewing:

-- Get the embedding from the current product
WITH current_product AS (
SELECT embedding FROM products WHERE product_id = 'prod_123'
)
SELECT p.product_id, p.title, p.price, p._distance
FROM knn_cosine(
'products',
'embedding',
(SELECT embedding FROM current_product),
10
) p
WHERE p.product_id != 'prod_123' -- Exclude the current product

Find near-duplicate content:

SELECT id, title, _distance
FROM knn_search(
'articles',
'embedding',
(SELECT embedding FROM articles WHERE id = 'article_456'),
50,
'cosine',
NULL, -- no filter
NULL, -- no refine
0.0, -- lower bound
0.1 -- upper bound (very similar only)
)
WHERE id != 'article_456'

Combine attribute filters with semantic search using SQL ILIKE or other conditions:

-- First filter by keywords, then rank by semantic similarity
SELECT * FROM knn_cosine(
'products',
'embedding',
text_to_embedding('comfortable wireless headphones'),
20,
'title ILIKE ''%headphone%'' OR description ILIKE ''%headphone%'''
)
ORDER BY _distance

To use vector search, your table needs a column containing vector embeddings. Embeddings are stored as arrays of floats:

-- Example schema
CREATE TABLE products (
product_id VARCHAR PRIMARY KEY,
title VARCHAR,
description TEXT,
embedding FLOAT[384] -- 384 dimensions for MiniLM
);

Catalyzed uses multiple high-performance embedding models in parallel to optimize for both accuracy and speed. Embeddings are automatically generated when you upload files.

The text_to_embedding() function returns vectors optimized for semantic search across your data.

:::note Enterprise Features Need specific embedding models or want to bring your own embeddings? Contact our support team to discuss enterprise options during your implementation scoping. :::