Vector Search
Catalyzed provides native vector similarity search through SQL table functions. Find semantically similar content by searching vector embeddings stored in your tables.
Overview
Section titled “Overview”Vector search enables semantic queries like “find products similar to wireless headphones” instead of exact keyword matching. This is powered by:
knn_search()- Find the k nearest neighbors to a query vectortext_to_embedding()- Convert natural language to vectors at query time- Pre-built indices - Fast approximate nearest neighbor (ANN) search on large datasets
Basic Vector Search
Section titled “Basic Vector Search”The knn_search function finds the k most similar rows based on vector distance:
Basic knn_search
curl -X POST https://api.catalyzed.ai/queries \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "sql": "SELECT * FROM knn_search('\''products'\'', '\''embedding'\'', text_to_embedding('\''wireless noise-canceling headphones'\''), 10)", "tables": {"products": "KzaMsfA0LSw_Ld0KyaXIS"} }'const response = await fetch("https://api.catalyzed.ai/queries", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ sql: `SELECT * FROM knn_search( 'products', 'embedding', text_to_embedding('wireless noise-canceling headphones'), 10 )`, tables: { products: "KzaMsfA0LSw_Ld0KyaXIS" }, }),});const results = await response.json();response = requests.post( "https://api.catalyzed.ai/queries", headers={"Authorization": f"Bearer {api_token}"}, json={ "sql": """ SELECT * FROM knn_search( 'products', 'embedding', text_to_embedding('wireless noise-canceling headphones'), 10 ) """, "tables": {"products": "KzaMsfA0LSw_Ld0KyaXIS"} })results = response.json()Response includes all columns from the source table plus a _distance column:
{ "columns": [ {"name": "product_id", "type": "Utf8"}, {"name": "title", "type": "Utf8"}, {"name": "price", "type": "Float64"}, {"name": "_distance", "type": "Float32"} ], "rows": [ {"product_id": "prod_123", "title": "Sony WH-1000XM5", "price": 349.99, "_distance": 0.12}, {"product_id": "prod_456", "title": "Bose QuietComfort", "price": 329.99, "_distance": 0.15} ]}Function Reference
Section titled “Function Reference”knn_search
Section titled “knn_search”knn_search(table, column, query_vector, k, [metric], [filter], [refine_factor], [lower_bound], [upper_bound])| Parameter | Type | Required | Description |
|---|---|---|---|
table | string | Yes | Table name containing embeddings |
column | string | Yes | Column name with vector embeddings |
query_vector | array | Yes | Query vector (same dimensions as stored vectors) |
k | integer | Yes | Number of results to return |
metric | string | No | Distance metric: 'l2', 'cosine', 'dot', 'hamming' |
filter | string | No | SQL WHERE clause for pre-filtering |
refine_factor | integer | No | Over-fetch multiplier for improved recall |
lower_bound | float | No | Minimum distance (inclusive) |
upper_bound | float | No | Maximum distance (exclusive) |
Convenience Functions
Section titled “Convenience Functions”For common metrics, use the convenience wrappers:
-- Cosine similarity (recommended for text embeddings)SELECT * FROM knn_cosine('products', 'embedding', text_to_embedding('query'), 10)
-- L2 (Euclidean) distanceSELECT * FROM knn_l2('products', 'embedding', text_to_embedding('query'), 10)text_to_embedding
Section titled “text_to_embedding”Converts text to a vector using our semantic embedding model:
SELECT text_to_embedding('wireless headphones')-- Returns: ARRAY[0.0123, -0.0456, 0.0789, ...]This is useful for:
- Query-time embedding of search terms
- Comparing text similarity inline
- Prototyping before setting up batch embedding pipelines
Filtered Vector Search
Section titled “Filtered Vector Search”Combine semantic similarity with attribute filters using the filter parameter:
Filtered search
curl -X POST https://api.catalyzed.ai/queries \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "sql": "SELECT * FROM knn_cosine('\''products'\'', '\''embedding'\'', text_to_embedding('\''wireless headphones'\''), 10, '\''category = ''''electronics'''' AND price < 500'\'')", "tables": {"products": "KzaMsfA0LSw_Ld0KyaXIS"} }'const response = await fetch("https://api.catalyzed.ai/queries", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ sql: `SELECT * FROM knn_cosine( 'products', 'embedding', text_to_embedding('wireless headphones'), 10, 'category = ''electronics'' AND price < 500' )`, tables: { products: "KzaMsfA0LSw_Ld0KyaXIS" }, }),});response = requests.post( "https://api.catalyzed.ai/queries", headers={"Authorization": f"Bearer {api_token}"}, json={ "sql": """ SELECT * FROM knn_cosine( 'products', 'embedding', text_to_embedding('wireless headphones'), 10, 'category = ''electronics'' AND price < 500' ) """, "tables": {"products": "KzaMsfA0LSw_Ld0KyaXIS"} })The filter is applied before the vector search, ensuring efficient execution when scalar columns are indexed.
Vector Search with JOINs
Section titled “Vector Search with JOINs”Combine vector search with standard SQL operations:
-- Find similar patents and their claimsSELECT p.title, p.assignee, p._distance, c.claim_textFROM knn_search( 'patents', 'abstract_embedding', text_to_embedding('machine learning image classification'), 5) pJOIN claims c ON p.patent_id = c.patent_idWHERE c.claim_type = 'independent'ORDER BY p._distance-- Aggregate similar reviews by ratingSELECT rating, COUNT(*) as count, AVG(_distance) as avg_similarityFROM knn_cosine( 'reviews', 'embedding', text_to_embedding('excellent product quality'), 100)GROUP BY ratingORDER BY rating DESCDistance Metrics
Section titled “Distance Metrics”| Metric | Description | Range | Use Case |
|---|---|---|---|
cosine | Cosine similarity | 0 to 2 | Text embeddings, normalized vectors |
l2 | Euclidean distance | 0 to ∞ | General purpose, image embeddings |
dot | Dot product | -∞ to ∞ | Maximum inner product search |
hamming | Hamming distance | 0 to dims | Binary vectors, hashes |
Recommendation: Use cosine for text embeddings from models like OpenAI, Cohere, or sentence-transformers.
Full-Text Search
Section titled “Full-Text Search”Full-text search (FTS) enables BM25-based keyword matching using inverted indexes. Unlike semantic search which matches by meaning, FTS excels at exact term matching and boolean queries.
fts_search Function
Section titled “fts_search Function”fts_search(table, column, query_text, limit, [filter])| Parameter | Type | Required | Description |
|---|---|---|---|
table | string | Yes | Table name containing the text column |
column | string | Yes | Column name with text content |
query_text | string | Yes | Search query (supports boolean operators) |
limit | integer | Yes | Maximum number of results to return |
filter | string | No | SQL WHERE clause for pre-filtering |
Returns: All columns from the source table plus a _score column with BM25 relevance scores (higher = better match).
Basic FTS
curl -X POST https://api.catalyzed.ai/queries \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "sql": "SELECT * FROM fts_search('\''documents'\'', '\''content'\'', '\''machine learning'\'', 20)", "tables": {"documents": "KzaMsfA0LSw_Ld0KyaXIS"} }'const response = await fetch("https://api.catalyzed.ai/queries", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ sql: `SELECT * FROM fts_search( 'documents', 'content', 'machine learning', 20 )`, tables: { documents: "KzaMsfA0LSw_Ld0KyaXIS" }, }),});response = requests.post( "https://api.catalyzed.ai/queries", headers={"Authorization": f"Bearer {api_token}"}, json={ "sql": """ SELECT * FROM fts_search( 'documents', 'content', 'machine learning', 20 ) """, "tables": {"documents": "KzaMsfA0LSw_Ld0KyaXIS"} })Response includes all source columns plus _score:
{ "columns": [ {"name": "doc_id", "type": "Utf8"}, {"name": "content", "type": "Utf8"}, {"name": "_score", "type": "Float32"} ], "rows": [ {"doc_id": "doc_123", "content": "Introduction to machine learning...", "_score": 8.42}, {"doc_id": "doc_456", "content": "Machine learning algorithms...", "_score": 7.89} ]}Creating FTS Indexes
Section titled “Creating FTS Indexes”Before using fts_search(), create an inverted index on the text column:
-- Create inverted index with default configurationCREATE INDEX documents_content_ftsON documents (content)USING inverted;Index Configuration Options:
| Option | Default | Description |
|---|---|---|
base_tokenizer | simple | Tokenization strategy (simple, whitespace) |
with_stemming | true | Enable word stemming (“running” matches “run”) |
remove_stop_words | true | Remove common words (“the”, “a”, “is”) |
lowercase | true | Normalize to lowercase for case-insensitive search |
with_position | true | Track positions for phrase search |
Example with custom configuration:
await datasetTableOperations.createIndex(tableId, { indexType: "inverted", columnName: "content", indexName: "content_fts", baseTokenizer: "simple", withStemming: true, removeStopWords: true, lowercase: true, withPosition: true,});When to Use FTS vs Vector Search
Section titled “When to Use FTS vs Vector Search”| Use Case | FTS (BM25) | Vector Search |
|---|---|---|
| Exact term matching | ✅ Best | ❌ Poor |
| Technical identifiers (APIs, SKUs) | ✅ Best | ❌ Poor |
| Boolean queries (AND/OR/NOT) | ✅ Supported | ❌ Not supported |
| Conceptual similarity | ❌ Poor | ✅ Best |
| Synonyms and paraphrasing | ❌ No | ✅ Automatic |
| Multilingual matching | ❌ Limited | ✅ Good |
Recommendation: Use Hybrid Search with RRF to combine both approaches for optimal results.
Hybrid Search with RRF
Section titled “Hybrid Search with RRF”Reciprocal Rank Fusion (RRF) combines FTS and vector search results by merging their rankings, not their scores. This produces superior results compared to using either method alone.
Algorithm:
- Run keyword search (BM25) and semantic search (vector) independently
- Assign ranks to results (1st place, 2nd place, etc.)
- Calculate RRF score:
semanticWeight / (60 + semantic_rank) + keywordWeight / (60 + keyword_rank) - Sort by combined RRF score
Benefits:
- Rank-based fusion is more robust than score normalization
- Naturally handles results that appear in only one ranking
- Weights control the balance between semantic and keyword matching
Implementation:
Hybrid search is available through the Knowledge Base API with automatic fallback to semantic-only if FTS is unavailable:
Hybrid search
curl -X POST https://api.catalyzed.ai/knowledge-bases/KBabcdef123456/query \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "query": "machine learning algorithms", "searchMode": "hybrid", "semanticWeight": 0.6, "keywordWeight": 0.4, "limit": 20 }'const response = await fetch( "https://api.catalyzed.ai/knowledge-bases/KBabcdef123456/query", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ query: "machine learning algorithms", searchMode: "hybrid", semanticWeight: 0.6, keywordWeight: 0.4, limit: 20, }), });response = requests.post( "https://api.catalyzed.ai/knowledge-bases/KBabcdef123456/query", headers={"Authorization": f"Bearer {api_token}"}, json={ "query": "machine learning algorithms", "searchMode": "hybrid", "semanticWeight": 0.6, "keywordWeight": 0.4, "limit": 20 })Response includes all score fields:
{ "results": [ { "chunkId": "chunk_123", "content": "Machine learning algorithms...", "score": 0.0847, "semanticScore": 0.15, "keywordScore": 8.42, "combinedScore": 0.0847 } ], "metadata": { "searchMode": "hybrid", "fallback": false }}Weight Tuning:
- Higher semanticWeight (0.7-0.8): Prioritize conceptual similarity, good for research/discovery
- Higher keywordWeight (0.6-0.7): Prioritize exact terms, good for technical docs/product search
- Balanced (0.5/0.5): Equal importance to both methods
See Knowledge Bases for complete hybrid search documentation.
Distance Thresholds
Section titled “Distance Thresholds”Filter results by distance range using lower_bound and upper_bound:
-- Only return results with distance < 0.5 (very similar)SELECT * FROM knn_search( 'products', 'embedding', text_to_embedding('wireless headphones'), 100, -- k (max results) NULL, -- metric (use default) NULL, -- filter NULL, -- refine_factor 0.0, -- lower_bound (inclusive) 0.5 -- upper_bound (exclusive))This is useful for:
- Finding near-duplicates (very low distance)
- Excluding exact matches (lower_bound > 0)
- Quality thresholds in recommendation systems
Performance Considerations
Section titled “Performance Considerations”When to Use Indices
Section titled “When to Use Indices”| Dataset Size | Flat Search | Indexed Search | Recommendation |
|---|---|---|---|
| < 10,000 rows | < 10ms | ~1ms | Flat is fine |
| 10,000 - 100,000 | ~100ms | ~2ms | Consider index |
| > 100,000 rows | > 1s | ~1ms | Index required |
Refine Factor
Section titled “Refine Factor”For indexed searches, use refine_factor to improve recall at the cost of latency:
-- Over-fetch 5x candidates, then re-rank for better accuracySELECT * FROM knn_search( 'products', 'embedding', text_to_embedding('query'), 10, NULL, -- metric NULL, -- filter 5 -- refine_factor: fetch 50 candidates, return top 10)Query Tips
Section titled “Query Tips”- Use specific k values - Don’t request more results than needed
- Filter first - Use the
filterparameter instead of WHERE on outer query - Limit result columns - Select only the columns you need
- Consider distance thresholds - Use bounds to eliminate low-quality matches
Common Use Cases
Section titled “Common Use Cases”Semantic Document Search
Section titled “Semantic Document Search”Find documents similar to a query:
SELECT title, content, _distanceFROM knn_cosine( 'documents', 'content_embedding', text_to_embedding('renewable energy policy regulations'), 20)ORDER BY _distanceProduct Recommendations
Section titled “Product Recommendations”Find products similar to one the user is viewing:
-- Get the embedding from the current productWITH current_product AS ( SELECT embedding FROM products WHERE product_id = 'prod_123')SELECT p.product_id, p.title, p.price, p._distanceFROM knn_cosine( 'products', 'embedding', (SELECT embedding FROM current_product), 10) pWHERE p.product_id != 'prod_123' -- Exclude the current productDuplicate Detection
Section titled “Duplicate Detection”Find near-duplicate content:
SELECT id, title, _distanceFROM knn_search( 'articles', 'embedding', (SELECT embedding FROM articles WHERE id = 'article_456'), 50, 'cosine', NULL, -- no filter NULL, -- no refine 0.0, -- lower bound 0.1 -- upper bound (very similar only))WHERE id != 'article_456'Advanced Filtering
Section titled “Advanced Filtering”Combine attribute filters with semantic search using SQL ILIKE or other conditions:
-- First filter by keywords, then rank by semantic similaritySELECT * FROM knn_cosine( 'products', 'embedding', text_to_embedding('comfortable wireless headphones'), 20, 'title ILIKE ''%headphone%'' OR description ILIKE ''%headphone%''')ORDER BY _distanceStoring Embeddings
Section titled “Storing Embeddings”To use vector search, your table needs a column containing vector embeddings. Embeddings are stored as arrays of floats:
-- Example schemaCREATE TABLE products ( product_id VARCHAR PRIMARY KEY, title VARCHAR, description TEXT, embedding FLOAT[384] -- 384 dimensions for MiniLM);Generating Embeddings
Section titled “Generating Embeddings”Embedding Generation
Section titled “Embedding Generation”Catalyzed uses multiple high-performance embedding models in parallel to optimize for both accuracy and speed. Embeddings are automatically generated when you upload files.
The text_to_embedding() function returns vectors optimized for semantic search across your data.
:::note Enterprise Features Need specific embedding models or want to bring your own embeddings? Contact our support team to discuss enterprise options during your implementation scoping. :::
Next Steps
Section titled “Next Steps”- Querying Data - Standard SQL query reference
- Tables - Table schemas and configuration
- Pipelines - Automate embedding generation