Skip to content

Full-Text Search

Full-text search (FTS) enables BM25-based keyword matching using inverted indexes. Unlike vector search which matches by meaning, FTS excels at exact term matching, technical identifiers, and boolean queries.

Full-text search is powered by:

  • fts_search() - BM25 keyword search over text columns
  • Inverted indexes - Persistent indexes for fast term lookup on large datasets
  • JIT indexing - Ephemeral in-RAM indexes for ad-hoc exploration without pre-built indexes
fts_search(table, column, query_text, limit, [filter], [enable_jit_index])
ParameterTypeRequiredDescription
tablestringYesTable name containing the text column
columnstringYesColumn name with text content
query_textstringYesSearch query (supports boolean operators)
limitintegerYesMaximum number of results to return
filterstringNoSQL WHERE clause for pre-filtering (pass NULL to skip)
enable_jit_indexbooleanNoEnable JIT indexing for tables without a persistent inverted index (default: false)

Returns: All columns from the source table plus a _score column with BM25 relevance scores (higher = better match).

Basic FTS

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT * FROM fts_search('\''documents'\'', '\''content'\'', '\''machine learning'\'', 20)",
"tables": {"documents": "KzaMsfA0LSw_Ld0KyaXIS"}
}'

Response includes all source columns plus _score:

{
"columns": [
{"name": "doc_id", "type": "Utf8"},
{"name": "content", "type": "Utf8"},
{"name": "_score", "type": "Float32"}
],
"rows": [
{"doc_id": "doc_123", "content": "Introduction to machine learning...", "_score": 8.42},
{"doc_id": "doc_456", "content": "Machine learning algorithms...", "_score": 7.89}
]
}

Combine keyword search with attribute filters using the filter parameter:

SELECT * FROM fts_search(
'documents',
'content',
'machine learning',
20,
'category = ''tech'' AND published_at > ''2024-01-01'''
)

The filter is applied before the text search, ensuring efficient execution when scalar columns are indexed.

For smaller datasets (up to 500K rows), you can skip index creation entirely by passing enable_jit_index = true as the 6th parameter. This builds an ephemeral in-RAM index at query time:

-- Search without a pre-built index (pass NULL for filter)
SELECT * FROM fts_search('documents', 'content', 'machine learning', 20, NULL, true)
-- JIT index with a filter
SELECT * FROM fts_search('documents', 'content', 'machine learning', 20, 'category = ''tech''', true)

JIT indexing is useful for prototyping, ad-hoc exploration, and smaller tables where maintaining a persistent index isn’t justified. For large datasets or frequent queries, create a persistent inverted index for better performance.

For larger datasets or frequent queries, create an inverted index on the text column:

-- Create inverted index with default configuration
CREATE INDEX documents_content_fts
ON documents (content)
USING inverted;
OptionDefaultDescription
base_tokenizersimpleTokenization strategy (simple, whitespace)
with_stemmingtrueEnable word stemming (“running” matches “run”)
remove_stop_wordstrueRemove common words (“the”, “a”, “is”)
lowercasetrueNormalize to lowercase for case-insensitive search
with_positiontrueTrack positions for phrase search

Example with custom configuration:

await datasetTableOperations.createIndex(tableId, {
indexType: "inverted",
columnName: "content",
indexName: "content_fts",
baseTokenizer: "simple",
withStemming: true,
removeStopWords: true,
lowercase: true,
withPosition: true,
});

Combine FTS results with standard SQL operations:

-- Find matching documents and their authors
SELECT d.title, d._score, a.name as author
FROM fts_search(
'documents',
'content',
'patent infringement',
20
) d
JOIN authors a ON d.author_id = a.id
ORDER BY d._score DESC
-- Aggregate FTS results by category
SELECT category, COUNT(*) as matches, AVG(_score) as avg_relevance
FROM fts_search(
'articles',
'body',
'climate change policy',
100
)
GROUP BY category
ORDER BY avg_relevance DESC
Use CaseFTS (BM25)Vector Search
Exact term matchingBestPoor
Technical identifiers (APIs, SKUs)BestPoor
Boolean queries (AND/OR/NOT)SupportedNot supported
Conceptual similarityPoorBest
Synonyms and paraphrasingNoAutomatic
Multilingual matchingLimitedGood