Full-Text Search

Full-text search (FTS) enables BM25-based keyword matching using inverted indexes. Unlike vector search which matches by meaning, FTS excels at exact term matching, technical identifiers, and boolean queries.

Overview

Full-text search is powered by:

fts_search() - BM25 keyword search over text columns
Inverted indexes - Persistent indexes for fast term lookup on large datasets
JIT indexing - Ephemeral in-RAM indexes for ad-hoc exploration without pre-built indexes

fts_search Function

fts_search(table, column, query_text, limit, [filter], [enable_jit_index])

Parameter	Type	Required	Description
`table`	string	Yes	Table name containing the text column
`column`	string	Yes	Column name with text content
`query_text`	string	Yes	Search query (supports boolean operators)
`limit`	integer	Yes	Maximum number of results to return
`filter`	string	No	SQL WHERE clause for pre-filtering (pass `NULL` to skip)
`enable_jit_index`	boolean	No	Enable JIT indexing for tables without a persistent inverted index (default: `false`)

Returns: All columns from the source table plus a _score column with BM25 relevance scores (higher = better match).

Basic Full-Text Search

Basic FTS

curl -X POST https://api.catalyzed.ai/queries \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT * FROM fts_search('\''documents'\'', '\''content'\'', '\''machine learning'\'', 20)",
    "tables": {"documents": "KzaMsfA0LSw_Ld0KyaXIS"}
  }'

const response = await fetch("https://api.catalyzed.ai/queries", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    sql: `SELECT * FROM fts_search(
      'documents',
      'content',
      'machine learning',
      20
    )`,
    tables: { documents: "KzaMsfA0LSw_Ld0KyaXIS" },
  }),
});

response = requests.post(
    "https://api.catalyzed.ai/queries",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "sql": """
            SELECT * FROM fts_search(
                'documents',
                'content',
                'machine learning',
                20
            )
        """,
        "tables": {"documents": "KzaMsfA0LSw_Ld0KyaXIS"}
    }
)

Response includes all source columns plus _score:

{
  "columns": [
    {"name": "doc_id", "type": "Utf8"},
    {"name": "content", "type": "Utf8"},
    {"name": "_score", "type": "Float32"}
  ],
  "rows": [
    {"doc_id": "doc_123", "content": "Introduction to machine learning...", "_score": 8.42},
    {"doc_id": "doc_456", "content": "Machine learning algorithms...", "_score": 7.89}
  ]
}

Filtered Full-Text Search

Combine keyword search with attribute filters using the filter parameter:

SELECT * FROM fts_search(
    'documents',
    'content',
    'machine learning',
    20,
    'category = ''tech'' AND published_at > ''2024-01-01'''
)

The filter is applied before the text search, ensuring efficient execution when scalar columns are indexed.

JIT Indexing

For smaller datasets (up to 500K rows), you can skip index creation entirely by passing enable_jit_index = true as the 6th parameter. This builds an ephemeral in-RAM index at query time:

-- Search without a pre-built index (pass NULL for filter)
SELECT * FROM fts_search('documents', 'content', 'machine learning', 20, NULL, true)

-- JIT index with a filter
SELECT * FROM fts_search('documents', 'content', 'machine learning', 20, 'category = ''tech''', true)

JIT indexing is useful for prototyping, ad-hoc exploration, and smaller tables where maintaining a persistent index isn’t justified. For large datasets or frequent queries, create a persistent inverted index for better performance.

Creating Inverted Indexes

For larger datasets or frequent queries, create an inverted index on the text column:

-- Create inverted index with default configuration
CREATE INDEX documents_content_fts
ON documents (content)
USING inverted;

Index Configuration Options

Option	Default	Description
`base_tokenizer`	`simple`	Tokenization strategy (`simple`, `whitespace`)
`with_stemming`	`true`	Enable word stemming (“running” matches “run”)
`remove_stop_words`	`true`	Remove common words (“the”, “a”, “is”)
`lowercase`	`true`	Normalize to lowercase for case-insensitive search
`with_position`	`true`	Track positions for phrase search

Example with custom configuration:

await datasetTableOperations.createIndex(tableId, {
  indexType: "inverted",
  columnName: "content",
  indexName: "content_fts",
  baseTokenizer: "simple",
  withStemming: true,
  removeStopWords: true,
  lowercase: true,
  withPosition: true,
});

Full-Text Search with JOINs

Combine FTS results with standard SQL operations:

-- Find matching documents and their authors
SELECT d.title, d._score, a.name as author
FROM fts_search(
    'documents',
    'content',
    'patent infringement',
    20
) d
JOIN authors a ON d.author_id = a.id
ORDER BY d._score DESC

-- Aggregate FTS results by category
SELECT category, COUNT(*) as matches, AVG(_score) as avg_relevance
FROM fts_search(
    'articles',
    'body',
    'climate change policy',
    100
)
GROUP BY category
ORDER BY avg_relevance DESC

When to Use FTS vs Vector Search

Use Case	FTS (BM25)	Vector Search
Exact term matching	Best	Poor
Technical identifiers (APIs, SKUs)	Best	Poor
Boolean queries (AND/OR/NOT)	Supported	Not supported
Conceptual similarity	Poor	Best
Synonyms and paraphrasing	No	Automatic
Multilingual matching	Limited	Good

Next Steps

Vector Search - Semantic similarity search with embeddings
Querying Data - Standard SQL query reference
Tables - Table schemas and configuration