Skip to content

Vector Search

Catalyzed provides native vector similarity search through SQL table functions. Find semantically similar content by searching vector embeddings stored in your tables.

Vector search enables semantic queries like “find products similar to wireless headphones” instead of exact keyword matching. This is powered by:

  • knn_search() - Find the k nearest neighbors to a query vector
  • text_to_embedding() - Convert natural language to vectors at query time
  • Pre-built indices - Fast approximate nearest neighbor (ANN) search on large datasets

The knn_search function finds the k most similar rows based on vector distance:

Basic knn_search

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT * FROM knn_search('\''products'\'', '\''embedding'\'', text_to_embedding('\''wireless noise-canceling headphones'\''), 10)",
"tables": {"products": "KzaMsfA0LSw_Ld0KyaXIS"}
}'

Response includes all columns from the source table plus a _distance column:

{
"columns": [
{"name": "product_id", "type": "Utf8"},
{"name": "title", "type": "Utf8"},
{"name": "price", "type": "Float64"},
{"name": "_distance", "type": "Float32"}
],
"rows": [
{"product_id": "prod_123", "title": "Sony WH-1000XM5", "price": 349.99, "_distance": 0.12},
{"product_id": "prod_456", "title": "Bose QuietComfort", "price": 329.99, "_distance": 0.15}
]
}
knn_search(table, column, query_vector, k, [metric], [filter], [refine_factor], [lower_bound], [upper_bound])
ParameterTypeRequiredDescription
tablestringYesTable name containing embeddings
columnstringYesColumn name with vector embeddings
query_vectorarrayYesQuery vector (same dimensions as stored vectors)
kintegerYesNumber of results to return
metricstringNoDistance metric: 'l2', 'cosine', 'dot', 'hamming'
filterstringNoSQL WHERE clause for pre-filtering
refine_factorintegerNoOver-fetch multiplier for improved recall
lower_boundfloatNoMinimum distance (inclusive)
upper_boundfloatNoMaximum distance (exclusive)

For common metrics, use the convenience wrappers:

-- Cosine similarity (recommended for text embeddings)
SELECT * FROM knn_cosine('products', 'embedding', text_to_embedding('query'), 10)
-- L2 (Euclidean) distance
SELECT * FROM knn_l2('products', 'embedding', text_to_embedding('query'), 10)

Converts text to a vector using our semantic embedding model:

SELECT text_to_embedding('wireless headphones')
-- Returns: ARRAY[0.0123, -0.0456, 0.0789, ...]

This is useful for:

  • Query-time embedding of search terms
  • Comparing text similarity inline
  • Prototyping before setting up batch embedding pipelines

Combine semantic similarity with attribute filters using the filter parameter:

Filtered search

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT * FROM knn_cosine('\''products'\'', '\''embedding'\'', text_to_embedding('\''wireless headphones'\''), 10, '\''category = ''''electronics'''' AND price < 500'\'')",
"tables": {"products": "KzaMsfA0LSw_Ld0KyaXIS"}
}'

The filter is applied before the vector search, ensuring efficient execution when scalar columns are indexed.

Combine vector search with standard SQL operations:

-- Find similar patents and their claims
SELECT p.title, p.assignee, p._distance, c.claim_text
FROM knn_search(
'patents',
'abstract_embedding',
text_to_embedding('machine learning image classification'),
5
) p
JOIN claims c ON p.patent_id = c.patent_id
WHERE c.claim_type = 'independent'
ORDER BY p._distance
-- Aggregate similar reviews by rating
SELECT rating, COUNT(*) as count, AVG(_distance) as avg_similarity
FROM knn_cosine(
'reviews',
'embedding',
text_to_embedding('excellent product quality'),
100
)
GROUP BY rating
ORDER BY rating DESC
MetricDescriptionRangeUse Case
cosineCosine similarity0 to 2Text embeddings, normalized vectors
l2Euclidean distance0 to ∞General purpose, image embeddings
dotDot product-∞ to ∞Maximum inner product search
hammingHamming distance0 to dimsBinary vectors, hashes

Recommendation: Use cosine for text embeddings from models like OpenAI, Cohere, or sentence-transformers.

Filter results by distance range using lower_bound and upper_bound:

-- Only return results with distance < 0.5 (very similar)
SELECT * FROM knn_search(
'products',
'embedding',
text_to_embedding('wireless headphones'),
100, -- k (max results)
NULL, -- metric (use default)
NULL, -- filter
NULL, -- refine_factor
0.0, -- lower_bound (inclusive)
0.5 -- upper_bound (exclusive)
)

This is useful for:

  • Finding near-duplicates (very low distance)
  • Excluding exact matches (lower_bound > 0)
  • Quality thresholds in recommendation systems
Dataset SizeFlat SearchIndexed SearchRecommendation
< 10,000 rows< 10ms~1msFlat is fine
10,000 - 100,000~100ms~2msConsider index
> 100,000 rows> 1s~1msIndex required

For indexed searches, use refine_factor to improve recall at the cost of latency:

-- Over-fetch 5x candidates, then re-rank for better accuracy
SELECT * FROM knn_search(
'products',
'embedding',
text_to_embedding('query'),
10,
NULL, -- metric
NULL, -- filter
5 -- refine_factor: fetch 50 candidates, return top 10
)
  1. Use specific k values - Don’t request more results than needed
  2. Filter first - Use the filter parameter instead of WHERE on outer query
  3. Limit result columns - Select only the columns you need
  4. Consider distance thresholds - Use bounds to eliminate low-quality matches

Find documents similar to a query:

SELECT title, content, _distance
FROM knn_cosine(
'documents',
'content_embedding',
text_to_embedding('renewable energy policy regulations'),
20
)
ORDER BY _distance

Find products similar to one the user is viewing:

-- Get the embedding from the current product
WITH current_product AS (
SELECT embedding FROM products WHERE product_id = 'prod_123'
)
SELECT p.product_id, p.title, p.price, p._distance
FROM knn_cosine(
'products',
'embedding',
(SELECT embedding FROM current_product),
10
) p
WHERE p.product_id != 'prod_123' -- Exclude the current product

Find near-duplicate content:

SELECT id, title, _distance
FROM knn_search(
'articles',
'embedding',
(SELECT embedding FROM articles WHERE id = 'article_456'),
50,
'cosine',
NULL, -- no filter
NULL, -- no refine
0.0, -- lower bound
0.1 -- upper bound (very similar only)
)
WHERE id != 'article_456'

Combine attribute filters with semantic search using SQL ILIKE or other conditions:

-- First filter by keywords, then rank by semantic similarity
SELECT * FROM knn_cosine(
'products',
'embedding',
text_to_embedding('comfortable wireless headphones'),
20,
'title ILIKE ''%headphone%'' OR description ILIKE ''%headphone%'''
)
ORDER BY _distance

To use vector search, your table needs a column containing vector embeddings. Embeddings are stored as arrays of floats:

-- Example schema
CREATE TABLE products (
product_id VARCHAR PRIMARY KEY,
title VARCHAR,
description TEXT,
embedding FLOAT[384] -- 384 dimensions for MiniLM
);

Catalyzed uses multiple high-performance embedding models in parallel to optimize for both accuracy and speed. Embeddings are automatically generated when you upload files.

The text_to_embedding() function returns vectors optimized for semantic search across your data.

:::note Enterprise Features Need specific embedding models or want to bring your own embeddings? Contact our support team to discuss enterprise options during your implementation scoping. :::