Vector Search

Catalyzed provides native vector similarity search through SQL table functions. Find semantically similar content by searching vector embeddings stored in your tables.

Overview

Vector search enables semantic queries like “find products similar to wireless headphones” instead of exact keyword matching. This is powered by:

knn_search() - Find the k nearest neighbors to a query vector
text_to_embedding() - Convert natural language to vectors at query time
Pre-built indices - Fast approximate nearest neighbor (ANN) search on large datasets

Basic Vector Search

The knn_search function finds the k most similar rows based on vector distance:

Basic knn_search

curl -X POST https://api.catalyzed.ai/queries \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT * FROM knn_search('\''products'\'', '\''embedding'\'', text_to_embedding('\''wireless noise-canceling headphones'\''), 10)",
    "tables": {"products": "KzaMsfA0LSw_Ld0KyaXIS"}
  }'

const response = await fetch("https://api.catalyzed.ai/queries", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    sql: `SELECT * FROM knn_search(
      'products',
      'embedding',
      text_to_embedding('wireless noise-canceling headphones'),
      10
    )`,
    tables: { products: "KzaMsfA0LSw_Ld0KyaXIS" },
  }),
});
const results = await response.json();

response = requests.post(
    "https://api.catalyzed.ai/queries",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "sql": """
            SELECT * FROM knn_search(
                'products',
                'embedding',
                text_to_embedding('wireless noise-canceling headphones'),
                10
            )
        """,
        "tables": {"products": "KzaMsfA0LSw_Ld0KyaXIS"}
    }
)
results = response.json()

Response includes all columns from the source table plus a _distance column:

{
  "columns": [
    {"name": "product_id", "type": "Utf8"},
    {"name": "title", "type": "Utf8"},
    {"name": "price", "type": "Float64"},
    {"name": "_distance", "type": "Float32"}
  ],
  "rows": [
    {"product_id": "prod_123", "title": "Sony WH-1000XM5", "price": 349.99, "_distance": 0.12},
    {"product_id": "prod_456", "title": "Bose QuietComfort", "price": 329.99, "_distance": 0.15}
  ]
}

Function Reference

knn_search

knn_search(table, column, query_vector, k, [metric], [filter], [refine_factor], [lower_bound], [upper_bound])

Parameter	Type	Required	Description
`table`	string	Yes	Table name containing embeddings
`column`	string	Yes	Column name with vector embeddings
`query_vector`	array	Yes	Query vector (same dimensions as stored vectors)
`k`	integer	Yes	Number of results to return
`metric`	string	No	Distance metric: `'l2'`, `'cosine'`, `'dot'`, `'hamming'`
`filter`	string	No	SQL WHERE clause for pre-filtering
`refine_factor`	integer	No	Over-fetch multiplier for improved recall
`lower_bound`	float	No	Minimum distance (inclusive)
`upper_bound`	float	No	Maximum distance (exclusive)

Convenience Functions

For common metrics, use the convenience wrappers:

-- Cosine similarity (recommended for text embeddings)
SELECT * FROM knn_cosine('products', 'embedding', text_to_embedding('query'), 10)

-- L2 (Euclidean) distance
SELECT * FROM knn_l2('products', 'embedding', text_to_embedding('query'), 10)

text_to_embedding

Converts text to a vector using our semantic embedding model:

SELECT text_to_embedding('wireless headphones')
-- Returns: ARRAY[0.0123, -0.0456, 0.0789, ...]

This is useful for:

Query-time embedding of search terms
Comparing text similarity inline
Prototyping before setting up batch embedding pipelines

Filtered Vector Search

Combine semantic similarity with attribute filters using the filter parameter:

Filtered search

curl -X POST https://api.catalyzed.ai/queries \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT * FROM knn_cosine('\''products'\'', '\''embedding'\'', text_to_embedding('\''wireless headphones'\''), 10, '\''category = ''''electronics'''' AND price < 500'\'')",
    "tables": {"products": "KzaMsfA0LSw_Ld0KyaXIS"}
  }'

const response = await fetch("https://api.catalyzed.ai/queries", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    sql: `SELECT * FROM knn_cosine(
      'products',
      'embedding',
      text_to_embedding('wireless headphones'),
      10,
      'category = ''electronics'' AND price < 500'
    )`,
    tables: { products: "KzaMsfA0LSw_Ld0KyaXIS" },
  }),
});

response = requests.post(
    "https://api.catalyzed.ai/queries",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "sql": """
            SELECT * FROM knn_cosine(
                'products',
                'embedding',
                text_to_embedding('wireless headphones'),
                10,
                'category = ''electronics'' AND price < 500'
            )
        """,
        "tables": {"products": "KzaMsfA0LSw_Ld0KyaXIS"}
    }
)

The filter is applied before the vector search, ensuring efficient execution when scalar columns are indexed.

Vector Search with JOINs

Combine vector search with standard SQL operations:

-- Find similar patents and their claims
SELECT p.title, p.assignee, p._distance, c.claim_text
FROM knn_search(
    'patents',
    'abstract_embedding',
    text_to_embedding('machine learning image classification'),
    5
) p
JOIN claims c ON p.patent_id = c.patent_id
WHERE c.claim_type = 'independent'
ORDER BY p._distance

-- Aggregate similar reviews by rating
SELECT rating, COUNT(*) as count, AVG(_distance) as avg_similarity
FROM knn_cosine(
    'reviews',
    'embedding',
    text_to_embedding('excellent product quality'),
    100
)
GROUP BY rating
ORDER BY rating DESC

Distance Metrics

Metric	Description	Range	Use Case
`cosine`	Cosine similarity	0 to 2	Text embeddings, normalized vectors
`l2`	Euclidean distance	0 to ∞	General purpose, image embeddings
`dot`	Dot product	-∞ to ∞	Maximum inner product search
`hamming`	Hamming distance	0 to dims	Binary vectors, hashes

Recommendation: Use cosine for text embeddings from models like OpenAI, Cohere, or sentence-transformers.

Distance Thresholds

Filter results by distance range using lower_bound and upper_bound:

-- Only return results with distance < 0.5 (very similar)
SELECT * FROM knn_search(
    'products',
    'embedding',
    text_to_embedding('wireless headphones'),
    100,    -- k (max results)
    NULL,   -- metric (use default)
    NULL,   -- filter
    NULL,   -- refine_factor
    0.0,    -- lower_bound (inclusive)
    0.5     -- upper_bound (exclusive)
)

This is useful for:

Finding near-duplicates (very low distance)
Excluding exact matches (lower_bound > 0)
Quality thresholds in recommendation systems

Performance Considerations

When to Use Indices

Dataset Size	Flat Search	Indexed Search	Recommendation
< 10,000 rows	< 10ms	~1ms	Flat is fine
10,000 - 100,000	~100ms	~2ms	Consider index
> 100,000 rows	> 1s	~1ms	Index required

Refine Factor

For indexed searches, use refine_factor to improve recall at the cost of latency:

-- Over-fetch 5x candidates, then re-rank for better accuracy
SELECT * FROM knn_search(
    'products',
    'embedding',
    text_to_embedding('query'),
    10,
    NULL,  -- metric
    NULL,  -- filter
    5      -- refine_factor: fetch 50 candidates, return top 10
)

Query Tips

Use specific k values - Don’t request more results than needed
Filter first - Use the filter parameter instead of WHERE on outer query
Limit result columns - Select only the columns you need
Consider distance thresholds - Use bounds to eliminate low-quality matches

Common Use Cases

Semantic Document Search

Find documents similar to a query:

SELECT title, content, _distance
FROM knn_cosine(
    'documents',
    'content_embedding',
    text_to_embedding('renewable energy policy regulations'),
    20
)
ORDER BY _distance

Product Recommendations

Find products similar to one the user is viewing:

-- Get the embedding from the current product
WITH current_product AS (
    SELECT embedding FROM products WHERE product_id = 'prod_123'
)
SELECT p.product_id, p.title, p.price, p._distance
FROM knn_cosine(
    'products',
    'embedding',
    (SELECT embedding FROM current_product),
    10
) p
WHERE p.product_id != 'prod_123'  -- Exclude the current product

Duplicate Detection

Find near-duplicate content:

SELECT id, title, _distance
FROM knn_search(
    'articles',
    'embedding',
    (SELECT embedding FROM articles WHERE id = 'article_456'),
    50,
    'cosine',
    NULL,  -- no filter
    NULL,  -- no refine
    0.0,   -- lower bound
    0.1    -- upper bound (very similar only)
)
WHERE id != 'article_456'

Advanced Filtering

Combine attribute filters with semantic search using SQL ILIKE or other conditions:

-- First filter by keywords, then rank by semantic similarity
SELECT * FROM knn_cosine(
    'products',
    'embedding',
    text_to_embedding('comfortable wireless headphones'),
    20,
    'title ILIKE ''%headphone%'' OR description ILIKE ''%headphone%'''
)
ORDER BY _distance

Storing Embeddings

To use vector search, your table needs a column containing vector embeddings. Embeddings are stored as arrays of floats:

-- Example schema
CREATE TABLE products (
    product_id VARCHAR PRIMARY KEY,
    title VARCHAR,
    description TEXT,
    embedding FLOAT[384]  -- 384 dimensions for MiniLM
);

Embedding Generation

Catalyzed uses multiple high-performance embedding models in parallel to optimize for both accuracy and speed. Embeddings are automatically generated when you upload files.

The text_to_embedding() function returns vectors optimized for semantic search across your data.

:::note Enterprise Features Need specific embedding models or want to bring your own embeddings? Contact our support team to discuss enterprise options during your implementation scoping. :::

Next Steps

Full-Text Search - BM25 keyword search with inverted indexes
Querying Data - Standard SQL query reference
Tables - Table schemas and configuration
Pipelines - Automate embedding generation