Skip to content

Knowledge Bases

Knowledge Bases are semantic search indexes that combine vector embeddings with a concept graph for intelligent document retrieval. They support both file and table sources, enabling RAG (Retrieval-Augmented Generation) applications, document Q&A, and semantic search.

A Knowledge Base consists of several interconnected components:

  • Knowledge Base: Container for semantic search indexes, owned by a team
  • Sources: File or table inputs that feed content into the KB
  • Chunks: Text segments with vector embeddings for similarity search
  • Concepts: Extracted entities and topics from the text
  • Communities: Graph clusters using Leiden algorithm for ranking

When you add a source, the system automatically:

  1. Extracts text content (from files or table columns)
  2. Splits text into overlapping chunks
  3. Generates embeddings using the configured model
  4. Extracts concepts and builds a concept graph
  5. Clusters concepts into hierarchical communities (L0-L3)

Files processed through the file processing pipeline can be added as KB sources:

  • PDFs, DOCX, XLSX, CSV files
  • Uses extraction ID to track processing version
  • Auto-syncs when the file is reprocessed

Dataset tables can be indexed by specifying which columns to extract content from:

  • Requires columnSpec with columns to index
  • Uses dataset version for change detection
  • Supports incremental sync

Create a knowledge base

Terminal window
curl -X POST https://api.catalyzed.ai/knowledge-bases \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"name": "Product Documentation"
}'

Response:

{
"knowledgeBaseId": "abc123xyz",
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"name": "Product Documentation",
"config": {
"embeddingModel": "BAAI/bge-small-en-v1.5",
"embeddingDimension": 384,
"chunkSize": 512,
"chunkOverlap": 50,
"nlpLibrary": "spacy"
},
"communitiesStale": true,
"communitiesBuiltAt": null,
"createdAt": "2025-01-15T10:30:00Z",
"updatedAt": "2025-01-15T10:30:00Z"
}

You can customize the KB configuration when creating:

Create with custom config

Terminal window
curl -X POST https://api.catalyzed.ai/knowledge-bases \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"name": "Research Papers",
"config": {
"chunkSize": 1024,
"chunkOverlap": 100,
"leidenResolution": 1.5
}
}'
Config FieldDefaultDescription
embeddingModelBAAI/bge-small-en-v1.5Embedding model for vectors
embeddingDimension384Vector dimensions
chunkSize512Target chunk size in characters
chunkOverlap50Overlap between consecutive chunks
nlpLibraryspacyNLP library for concept extraction
leidenResolution1.0Community detection resolution (0.1-10)

List knowledge bases in a team

Terminal window
curl "https://api.catalyzed.ai/knowledge-bases?teamIds=ZkoDMyjZZsXo4VAO_nJLk" \
-H "Authorization: Bearer $API_TOKEN"
ParameterTypeDescription
teamIdsstringComma-separated team IDs to filter by
knowledgeBaseIdsstringComma-separated KB IDs to filter by
namestringFilter by name (partial match)
pagenumberPage number (starts at 1, default: 1)
pageSizenumberResults per page (1-100, default: 20)
orderBystringSort by: createdAt, name, updatedAt
orderDirectionstringSort direction: asc or desc

Get knowledge base by ID

Terminal window
curl https://api.catalyzed.ai/knowledge-bases/abc123xyz \
-H "Authorization: Bearer $API_TOKEN"

Update knowledge base

Terminal window
curl -X PATCH https://api.catalyzed.ai/knowledge-bases/abc123xyz \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Product Documentation v2"
}'

Delete knowledge base

Terminal window
curl -X DELETE https://api.catalyzed.ai/knowledge-bases/abc123xyz \
-H "Authorization: Bearer $API_TOKEN"

Add file source

Terminal window
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"type": "file",
"fileId": "file_xyz789"
}'

Add table source

Terminal window
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"type": "table",
"datasetTableId": "table_abc456",
"columnSpec": {
"columns": ["title", "description", "content"]
}
}'

Each source has a status that tracks its processing state:

StatusDescription
pendingSource added, waiting for initial indexing
processingCurrently being indexed
processedSuccessfully indexed and up-to-date
failedIndexing failed (check syncErrorMessage)
staleSource data changed, needs re-sync

List sources for a knowledge base

Terminal window
curl "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources" \
-H "Authorization: Bearer $API_TOKEN"

Remove a source from a knowledge base. This deletes the source metadata and triggers a cleanup job to remove the associated chunks from the vector store.

Delete a source

Terminal window
curl -X DELETE https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources/source_123 \
-H "Authorization: Bearer $API_TOKEN"

Note: Chunk cleanup happens asynchronously via a background job. Queries may briefly return results from deleted sources until cleanup completes.

Knowledge Base queries support three search modes: semantic (vector embeddings), keyword (full-text search), and hybrid (combining both with RRF scoring).

Uses vector embeddings for conceptual similarity matching. Best for:

  • Finding content by meaning, not exact words
  • Cross-lingual or multilingual search
  • Handling synonyms and paraphrasing naturally

Semantic search

Terminal window
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/query \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "How do I configure authentication?",
"searchMode": "semantic",
"limit": 10
}'

Uses BM25 full-text search with inverted indexes. Best for:

  • Exact term matching and boolean queries
  • Technical documentation (API names, error codes, function names)
  • Structured data (product SKUs, model numbers, IDs)

Keyword search

Terminal window
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/query \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "JWT authentication",
"searchMode": "keyword",
"limit": 10
}'

Combines keyword and semantic search using RRF (Reciprocal Rank Fusion) for optimal results. Best for:

  • General-purpose search applications
  • When you want both exact matches AND semantic relevance
  • Production applications (recommended default)

Hybrid search with custom weights

Terminal window
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/query \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "How do I configure authentication?",
"searchMode": "hybrid",
"semanticWeight": 0.6,
"keywordWeight": 0.4,
"limit": 10
}'

Weight Tuning:

  • Higher semanticWeight: Prioritize conceptual similarity
  • Higher keywordWeight: Prioritize exact term matches
  • Default (0.6 / 0.4): Balanced for most use cases
{
"results": [
{
"chunkId": "chunk_abc123",
"content": "To configure authentication, first enable...",
"score": 0.89,
"semanticScore": 0.87,
"keywordScore": 12.5,
"combinedScore": 0.89,
"communityL0Id": "c0_001",
"communityL1Id": "c1_001",
"fileId": "file_xyz789",
"datasetTableId": null,
"charStart": 0,
"charEnd": 512
}
],
"communities": [
{
"communityId": "c0_001",
"level": 0,
"bestScore": 0.89,
"chunkCount": 5
}
],
"metadata": {
"searchMode": "hybrid",
"fallback": false
}
}
FieldTypeDescription
resultsarrayMatching chunks, sorted by relevance
results[].chunkIdstringUnique chunk identifier
results[].contentstringText content of the chunk
results[].scorenumberPrimary score (semantic: cosine, keyword: BM25, hybrid: RRF)
results[].semanticScorenumber?Cosine similarity (only in semantic/hybrid modes)
results[].keywordScorenumber?BM25 score (only in keyword/hybrid modes)
results[].combinedScorenumber?RRF combined score (only in hybrid mode)
results[].communityL0Idstring | nullLevel 0 community cluster ID
results[].communityL1Idstring | nullLevel 1 community cluster ID
results[].fileIdstring | nullSource file ID (null for table sources)
results[].datasetTableIdstring | nullSource table ID (null for file sources)
results[].charStartnumberCharacter offset where chunk starts
results[].charEndnumberCharacter offset where chunk ends
communitiesarrayCommunities ranked by best chunk score
metadata.searchModestringSearch mode that was used
metadata.fallbackboolean?True if hybrid mode fell back to semantic-only

Knowledge Base sources automatically create inverted indexes on the content column during ingestion. These indexes power keyword and hybrid search modes using BM25 scoring.

Index Configuration:

  • Tokenizer: Simple (word-based)
  • Stemming: Enabled (handles “run”, “running”, “runs”)
  • Stop words: Removed (common words like “the”, “a”, “is”)
  • Case: Normalized to lowercase
  • Positions: Tracked (enables phrase search)

Index creation is non-fatal - if it fails, hybrid search automatically falls back to semantic-only mode.

For more intelligent retrieval, use the streaming endpoint which combines vector search with LLM-based relevance testing:

Stream LazyGraphRAG query

Terminal window
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/query/stream \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "What are the best practices for error handling?",
"budget": 20
}'

The budget parameter controls how many chunks are tested for relevance by the LLM. Higher budget = more thorough results but higher cost and latency.

BudgetUse CaseCost/Latency
5-10Quick answers, simple queriesLow
15-25General purpose (recommended default)Medium
30-50Complex questions, comprehensive researchHigh
50+Deep analysis, when recall is criticalVery high

Recommendations:

  • Start with budget: 20 for most use cases
  • Increase budget if results seem incomplete or miss relevant content
  • Reduce budget for time-sensitive applications or simple factual queries
  • The actual chunks returned may be fewer than budget (only relevant chunks are included)

Knowledge Bases use the Leiden algorithm to cluster concepts into hierarchical communities:

  • L0 - Finest level (individual concept clusters)
  • L1 - Intermediate clusters
  • L2 - Broader topic groups
  • L3 - Coarsest level (major themes)

Communities are used for intelligent ranking in queries - chunks from relevant communities are prioritized.

Communities are automatically rebuilt when sources are indexed. You can also trigger a manual rebuild:

Rebuild communities

Terminal window
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/rebuild-communities \
-H "Authorization: Bearer $API_TOKEN"
FieldTypeDescription
knowledgeBaseIdstringUnique identifier
teamIdstringTeam that owns this KB
namestringHuman-readable name (1-255 characters)
configobjectConfiguration options (see above)
communitiesStalebooleanWhether communities need rebuilding
communitiesBuiltAtstring | nullISO 8601 timestamp of last community build
createdAtstringISO 8601 timestamp of creation
updatedAtstringISO 8601 timestamp of last modification
FieldTypeDescription
knowledgeBaseSourceIdstringUnique identifier
knowledgeBaseIdstringParent KB ID
fileIdstring | nullFile ID (for file sources)
datasetTableIdstring | nullTable ID (for table sources)
columnSpecobject | nullColumn configuration (for table sources)
statusstringProcessing status
processedExtractionIdstring | nullLast processed extraction ID
processedDatasetVersionnumber | nullLast processed dataset version
addedAtstringISO 8601 timestamp when source was added
processedAtstring | nullISO 8601 timestamp of last successful processing