Knowledge Bases
Knowledge Bases are semantic search indexes that combine vector embeddings with a concept graph for intelligent document retrieval. They support both file and table sources, enabling RAG (Retrieval-Augmented Generation) applications, document Q&A, and semantic search.
Architecture
Section titled “Architecture”A Knowledge Base consists of several interconnected components:
- Knowledge Base: Container for semantic search indexes, owned by a team
- Sources: File or table inputs that feed content into the KB
- Chunks: Text segments with vector embeddings for similarity search
- Concepts: Extracted entities and topics from the text
- Communities: Graph clusters using Leiden algorithm for ranking
When you add a source, the system automatically:
- Extracts text content (from files or table columns)
- Splits text into overlapping chunks
- Generates embeddings using the configured model
- Extracts concepts and builds a concept graph
- Clusters concepts into hierarchical communities (L0-L3)
Source Types
Section titled “Source Types”File Sources
Section titled “File Sources”Files processed through the file processing pipeline can be added as KB sources:
- PDFs, DOCX, XLSX, CSV files
- Uses extraction ID to track processing version
- Auto-syncs when the file is reprocessed
Table Sources
Section titled “Table Sources”Dataset tables can be indexed by specifying which columns to extract content from:
- Requires
columnSpecwith columns to index - Uses dataset version for change detection
- Supports incremental sync
Creating a Knowledge Base
Section titled “Creating a Knowledge Base”Create a knowledge base
curl -X POST https://api.catalyzed.ai/knowledge-bases \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "name": "Product Documentation" }'const response = await fetch("https://api.catalyzed.ai/knowledge-bases", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ teamId: "ZkoDMyjZZsXo4VAO_nJLk", name: "Product Documentation", }),});const kb = await response.json();response = requests.post( "https://api.catalyzed.ai/knowledge-bases", headers={"Authorization": f"Bearer {api_token}"}, json={ "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "name": "Product Documentation" })kb = response.json()Response:
{ "knowledgeBaseId": "abc123xyz", "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "name": "Product Documentation", "config": { "embeddingModel": "BAAI/bge-small-en-v1.5", "embeddingDimension": 384, "chunkSize": 512, "chunkOverlap": 50, "nlpLibrary": "spacy" }, "communitiesStale": true, "communitiesBuiltAt": null, "createdAt": "2025-01-15T10:30:00Z", "updatedAt": "2025-01-15T10:30:00Z"}Configuration Options
Section titled “Configuration Options”You can customize the KB configuration when creating:
Create with custom config
curl -X POST https://api.catalyzed.ai/knowledge-bases \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "name": "Research Papers", "config": { "chunkSize": 1024, "chunkOverlap": 100, "leidenResolution": 1.5 } }'const response = await fetch("https://api.catalyzed.ai/knowledge-bases", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ teamId: "ZkoDMyjZZsXo4VAO_nJLk", name: "Research Papers", config: { chunkSize: 1024, chunkOverlap: 100, leidenResolution: 1.5, }, }),});response = requests.post( "https://api.catalyzed.ai/knowledge-bases", headers={"Authorization": f"Bearer {api_token}"}, json={ "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "name": "Research Papers", "config": { "chunkSize": 1024, "chunkOverlap": 100, "leidenResolution": 1.5 } })| Config Field | Default | Description |
|---|---|---|
embeddingModel | BAAI/bge-small-en-v1.5 | Embedding model for vectors |
embeddingDimension | 384 | Vector dimensions |
chunkSize | 512 | Target chunk size in characters |
chunkOverlap | 50 | Overlap between consecutive chunks |
nlpLibrary | spacy | NLP library for concept extraction |
leidenResolution | 1.0 | Community detection resolution (0.1-10) |
Listing Knowledge Bases
Section titled “Listing Knowledge Bases”List knowledge bases in a team
curl "https://api.catalyzed.ai/knowledge-bases?teamIds=ZkoDMyjZZsXo4VAO_nJLk" \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch( "https://api.catalyzed.ai/knowledge-bases?teamIds=ZkoDMyjZZsXo4VAO_nJLk", { headers: { Authorization: `Bearer ${apiToken}` } });const { knowledgeBases } = await response.json();response = requests.get( "https://api.catalyzed.ai/knowledge-bases", params={"teamIds": "ZkoDMyjZZsXo4VAO_nJLk"}, headers={"Authorization": f"Bearer {api_token}"})knowledge_bases = response.json()["knowledgeBases"]Query Parameters
Section titled “Query Parameters”| Parameter | Type | Description |
|---|---|---|
teamIds | string | Comma-separated team IDs to filter by |
knowledgeBaseIds | string | Comma-separated KB IDs to filter by |
name | string | Filter by name (partial match) |
page | number | Page number (starts at 1, default: 1) |
pageSize | number | Results per page (1-100, default: 20) |
orderBy | string | Sort by: createdAt, name, updatedAt |
orderDirection | string | Sort direction: asc or desc |
Getting a Knowledge Base
Section titled “Getting a Knowledge Base”Get knowledge base by ID
curl https://api.catalyzed.ai/knowledge-bases/abc123xyz \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch("https://api.catalyzed.ai/knowledge-bases/abc123xyz", { headers: { Authorization: `Bearer ${apiToken}` },});const kb = await response.json();response = requests.get( "https://api.catalyzed.ai/knowledge-bases/abc123xyz", headers={"Authorization": f"Bearer {api_token}"})kb = response.json()Updating a Knowledge Base
Section titled “Updating a Knowledge Base”Update knowledge base
curl -X PATCH https://api.catalyzed.ai/knowledge-bases/abc123xyz \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "name": "Product Documentation v2" }'await fetch("https://api.catalyzed.ai/knowledge-bases/abc123xyz", { method: "PATCH", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ name: "Product Documentation v2", }),});requests.patch( "https://api.catalyzed.ai/knowledge-bases/abc123xyz", headers={"Authorization": f"Bearer {api_token}"}, json={"name": "Product Documentation v2"})Deleting a Knowledge Base
Section titled “Deleting a Knowledge Base”Delete knowledge base
curl -X DELETE https://api.catalyzed.ai/knowledge-bases/abc123xyz \ -H "Authorization: Bearer $API_TOKEN"await fetch("https://api.catalyzed.ai/knowledge-bases/abc123xyz", { method: "DELETE", headers: { Authorization: `Bearer ${apiToken}` },});requests.delete( "https://api.catalyzed.ai/knowledge-bases/abc123xyz", headers={"Authorization": f"Bearer {api_token}"})Adding Sources
Section titled “Adding Sources”Adding a File Source
Section titled “Adding a File Source”Add file source
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "type": "file", "fileId": "file_xyz789" }'const response = await fetch( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ type: "file", fileId: "file_xyz789", }), });const source = await response.json();response = requests.post( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources", headers={"Authorization": f"Bearer {api_token}"}, json={ "type": "file", "fileId": "file_xyz789" })source = response.json()Adding a Table Source
Section titled “Adding a Table Source”Add table source
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "type": "table", "datasetTableId": "table_abc456", "columnSpec": { "columns": ["title", "description", "content"] } }'const response = await fetch( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ type: "table", datasetTableId: "table_abc456", columnSpec: { columns: ["title", "description", "content"], }, }), });response = requests.post( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources", headers={"Authorization": f"Bearer {api_token}"}, json={ "type": "table", "datasetTableId": "table_abc456", "columnSpec": { "columns": ["title", "description", "content"] } })Source Status
Section titled “Source Status”Each source has a status that tracks its processing state:
| Status | Description |
|---|---|
pending | Source added, waiting for initial indexing |
processing | Currently being indexed |
processed | Successfully indexed and up-to-date |
failed | Indexing failed (check syncErrorMessage) |
stale | Source data changed, needs re-sync |
Listing Sources
Section titled “Listing Sources”List sources for a knowledge base
curl "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources" \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources", { headers: { Authorization: `Bearer ${apiToken}` } });const { sources } = await response.json();response = requests.get( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources", headers={"Authorization": f"Bearer {api_token}"})sources = response.json()["sources"]Removing a Source
Section titled “Removing a Source”Remove a source from a knowledge base. This deletes the source metadata and triggers a cleanup job to remove the associated chunks from the vector store.
Delete a source
curl -X DELETE https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources/source_123 \ -H "Authorization: Bearer $API_TOKEN"await fetch( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources/source_123", { method: "DELETE", headers: { Authorization: `Bearer ${apiToken}` }, });requests.delete( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources/source_123", headers={"Authorization": f"Bearer {api_token}"})Note: Chunk cleanup happens asynchronously via a background job. Queries may briefly return results from deleted sources until cleanup completes.
Querying
Section titled “Querying”Knowledge Base queries support three search modes: semantic (vector embeddings), keyword (full-text search), and hybrid (combining both with RRF scoring).
Search Modes
Section titled “Search Modes”Semantic Search (default)
Section titled “Semantic Search (default)”Uses vector embeddings for conceptual similarity matching. Best for:
- Finding content by meaning, not exact words
- Cross-lingual or multilingual search
- Handling synonyms and paraphrasing naturally
Semantic search
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/query \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "query": "How do I configure authentication?", "searchMode": "semantic", "limit": 10 }'const response = await fetch( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ query: "How do I configure authentication?", searchMode: "semantic", // default, can be omitted limit: 10, }), });const { results, metadata } = await response.json();response = requests.post( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query", headers={"Authorization": f"Bearer {api_token}"}, json={ "query": "How do I configure authentication?", "searchMode": "semantic", # default, can be omitted "limit": 10 })data = response.json()results = data["results"]Keyword Search
Section titled “Keyword Search”Uses BM25 full-text search with inverted indexes. Best for:
- Exact term matching and boolean queries
- Technical documentation (API names, error codes, function names)
- Structured data (product SKUs, model numbers, IDs)
Keyword search
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/query \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "query": "JWT authentication", "searchMode": "keyword", "limit": 10 }'const response = await fetch( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ query: "JWT authentication", searchMode: "keyword", limit: 10, }), });const { results, metadata } = await response.json();response = requests.post( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query", headers={"Authorization": f"Bearer {api_token}"}, json={ "query": "JWT authentication", "searchMode": "keyword", "limit": 10 })data = response.json()results = data["results"]Hybrid Search (Recommended)
Section titled “Hybrid Search (Recommended)”Combines keyword and semantic search using RRF (Reciprocal Rank Fusion) for optimal results. Best for:
- General-purpose search applications
- When you want both exact matches AND semantic relevance
- Production applications (recommended default)
Hybrid search with custom weights
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/query \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "query": "How do I configure authentication?", "searchMode": "hybrid", "semanticWeight": 0.6, "keywordWeight": 0.4, "limit": 10 }'const response = await fetch( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ query: "How do I configure authentication?", searchMode: "hybrid", semanticWeight: 0.6, // default: 0.6 keywordWeight: 0.4, // default: 0.4 limit: 10, }), });const { results, metadata } = await response.json();response = requests.post( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query", headers={"Authorization": f"Bearer {api_token}"}, json={ "query": "How do I configure authentication?", "searchMode": "hybrid", "semanticWeight": 0.6, # default: 0.6 "keywordWeight": 0.4, # default: 0.4 "limit": 10 })data = response.json()results = data["results"]Weight Tuning:
- Higher
semanticWeight: Prioritize conceptual similarity - Higher
keywordWeight: Prioritize exact term matches - Default (0.6 / 0.4): Balanced for most use cases
Response Format
Section titled “Response Format”{ "results": [ { "chunkId": "chunk_abc123", "content": "To configure authentication, first enable...", "score": 0.89, "semanticScore": 0.87, "keywordScore": 12.5, "combinedScore": 0.89, "communityL0Id": "c0_001", "communityL1Id": "c1_001", "fileId": "file_xyz789", "datasetTableId": null, "charStart": 0, "charEnd": 512 } ], "communities": [ { "communityId": "c0_001", "level": 0, "bestScore": 0.89, "chunkCount": 5 } ], "metadata": { "searchMode": "hybrid", "fallback": false }}Response Fields
Section titled “Response Fields”| Field | Type | Description |
|---|---|---|
results | array | Matching chunks, sorted by relevance |
results[].chunkId | string | Unique chunk identifier |
results[].content | string | Text content of the chunk |
results[].score | number | Primary score (semantic: cosine, keyword: BM25, hybrid: RRF) |
results[].semanticScore | number? | Cosine similarity (only in semantic/hybrid modes) |
results[].keywordScore | number? | BM25 score (only in keyword/hybrid modes) |
results[].combinedScore | number? | RRF combined score (only in hybrid mode) |
results[].communityL0Id | string | null | Level 0 community cluster ID |
results[].communityL1Id | string | null | Level 1 community cluster ID |
results[].fileId | string | null | Source file ID (null for table sources) |
results[].datasetTableId | string | null | Source table ID (null for file sources) |
results[].charStart | number | Character offset where chunk starts |
results[].charEnd | number | Character offset where chunk ends |
communities | array | Communities ranked by best chunk score |
metadata.searchMode | string | Search mode that was used |
metadata.fallback | boolean? | True if hybrid mode fell back to semantic-only |
Full-Text Search Indexes
Section titled “Full-Text Search Indexes”Knowledge Base sources automatically create inverted indexes on the content column during ingestion. These indexes power keyword and hybrid search modes using BM25 scoring.
Index Configuration:
- Tokenizer: Simple (word-based)
- Stemming: Enabled (handles “run”, “running”, “runs”)
- Stop words: Removed (common words like “the”, “a”, “is”)
- Case: Normalized to lowercase
- Positions: Tracked (enables phrase search)
Index creation is non-fatal - if it fails, hybrid search automatically falls back to semantic-only mode.
LazyGraphRAG Streaming Query
Section titled “LazyGraphRAG Streaming Query”For more intelligent retrieval, use the streaming endpoint which combines vector search with LLM-based relevance testing:
Stream LazyGraphRAG query
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/query/stream \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "query": "What are the best practices for error handling?", "budget": 20 }'const response = await fetch( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query/stream", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ query: "What are the best practices for error handling?", budget: 20, }), });
const reader = response.body?.getReader();const decoder = new TextDecoder();
while (true) { const { done, value } = await reader!.read(); if (done) break;
const text = decoder.decode(value); // Parse SSE events for (const line of text.split("\n")) { if (line.startsWith("data:")) { const event = JSON.parse(line.slice(5)); console.log(event); } }}import requests
response = requests.post( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query/stream", headers={"Authorization": f"Bearer {api_token}"}, json={ "query": "What are the best practices for error handling?", "budget": 20 }, stream=True)
for line in response.iter_lines(): if line: line = line.decode('utf-8') if line.startswith('data:'): import json event = json.loads(line[5:]) print(event)Budget Parameter
Section titled “Budget Parameter”The budget parameter controls how many chunks are tested for relevance by the LLM. Higher budget = more thorough results but higher cost and latency.
| Budget | Use Case | Cost/Latency |
|---|---|---|
| 5-10 | Quick answers, simple queries | Low |
| 15-25 | General purpose (recommended default) | Medium |
| 30-50 | Complex questions, comprehensive research | High |
| 50+ | Deep analysis, when recall is critical | Very high |
Recommendations:
- Start with
budget: 20for most use cases - Increase budget if results seem incomplete or miss relevant content
- Reduce budget for time-sensitive applications or simple factual queries
- The actual chunks returned may be fewer than budget (only relevant chunks are included)
Community Hierarchy
Section titled “Community Hierarchy”Knowledge Bases use the Leiden algorithm to cluster concepts into hierarchical communities:
- L0 - Finest level (individual concept clusters)
- L1 - Intermediate clusters
- L2 - Broader topic groups
- L3 - Coarsest level (major themes)
Communities are used for intelligent ranking in queries - chunks from relevant communities are prioritized.
Rebuilding Communities
Section titled “Rebuilding Communities”Communities are automatically rebuilt when sources are indexed. You can also trigger a manual rebuild:
Rebuild communities
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/rebuild-communities \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/rebuild-communities", { method: "POST", headers: { Authorization: `Bearer ${apiToken}` }, });const { jobId } = await response.json();response = requests.post( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/rebuild-communities", headers={"Authorization": f"Bearer {api_token}"})job_id = response.json()["jobId"]Knowledge Base Properties
Section titled “Knowledge Base Properties”| Field | Type | Description |
|---|---|---|
knowledgeBaseId | string | Unique identifier |
teamId | string | Team that owns this KB |
name | string | Human-readable name (1-255 characters) |
config | object | Configuration options (see above) |
communitiesStale | boolean | Whether communities need rebuilding |
communitiesBuiltAt | string | null | ISO 8601 timestamp of last community build |
createdAt | string | ISO 8601 timestamp of creation |
updatedAt | string | ISO 8601 timestamp of last modification |
Source Properties
Section titled “Source Properties”| Field | Type | Description |
|---|---|---|
knowledgeBaseSourceId | string | Unique identifier |
knowledgeBaseId | string | Parent KB ID |
fileId | string | null | File ID (for file sources) |
datasetTableId | string | null | Table ID (for table sources) |
columnSpec | object | null | Column configuration (for table sources) |
status | string | Processing status |
processedExtractionId | string | null | Last processed extraction ID |
processedDatasetVersion | number | null | Last processed dataset version |
addedAt | string | ISO 8601 timestamp when source was added |
processedAt | string | null | ISO 8601 timestamp of last successful processing |
Related Topics
Section titled “Related Topics”- Knowledge Base Reconciliation - Keeping KBs in sync with source changes
- Files - File processing for KB sources
- Tables - Dataset tables as KB sources
- Vector Search - Alternative: direct vector queries