Knowledge Bases

Knowledge Bases are semantic search indexes that combine vector embeddings with a concept graph for intelligent document retrieval. They support both file and table sources, enabling RAG (Retrieval-Augmented Generation) applications, document Q&A, and semantic search.

Architecture

A Knowledge Base consists of several interconnected components:

Knowledge Base: Container for semantic search indexes, owned by a team
Sources: File or table inputs that feed content into the KB
Chunks: Text segments with vector embeddings for similarity search
Concepts: Extracted entities and topics from the text
Communities: Graph clusters using Leiden algorithm for ranking

When you add a source, the system automatically:

Extracts text content (from files or table columns)
Splits text into overlapping chunks
Generates embeddings using the configured model
Extracts concepts and builds a concept graph
Clusters concepts into hierarchical communities (L0-L3)

Source Types

File Sources

Files processed through the file processing pipeline can be added as KB sources:

PDFs, DOCX, XLSX, CSV files
Uses extraction ID to track processing version
Auto-syncs when the file is reprocessed

Table Sources

Dataset tables can be indexed by specifying which columns to extract content from:

Requires columnSpec with columns to index
Uses dataset version for change detection
Supports incremental sync

Creating a Knowledge Base

Create a knowledge base

curl -X POST https://api.catalyzed.ai/knowledge-bases \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
    "name": "Product Documentation"
  }'

const response = await fetch("https://api.catalyzed.ai/knowledge-bases", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    teamId: "ZkoDMyjZZsXo4VAO_nJLk",
    name: "Product Documentation",
  }),
});
const kb = await response.json();

response = requests.post(
    "https://api.catalyzed.ai/knowledge-bases",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
        "name": "Product Documentation"
    }
)
kb = response.json()

Response:

{
  "knowledgeBaseId": "abc123xyz",
  "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
  "name": "Product Documentation",
  "config": {
    "embeddingModel": "BAAI/bge-small-en-v1.5",
    "embeddingDimension": 384,
    "chunkSize": 512,
    "chunkOverlap": 50,
    "nlpLibrary": "spacy"
  },
  "communitiesStale": true,
  "communitiesBuiltAt": null,
  "createdAt": "2025-01-15T10:30:00Z",
  "updatedAt": "2025-01-15T10:30:00Z"
}

Configuration Options

You can customize the KB configuration when creating:

Create with custom config

curl -X POST https://api.catalyzed.ai/knowledge-bases \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
    "name": "Research Papers",
    "config": {
      "chunkSize": 1024,
      "chunkOverlap": 100,
      "leidenResolution": 1.5
    }
  }'

const response = await fetch("https://api.catalyzed.ai/knowledge-bases", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    teamId: "ZkoDMyjZZsXo4VAO_nJLk",
    name: "Research Papers",
    config: {
      chunkSize: 1024,
      chunkOverlap: 100,
      leidenResolution: 1.5,
    },
  }),
});

response = requests.post(
    "https://api.catalyzed.ai/knowledge-bases",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
        "name": "Research Papers",
        "config": {
            "chunkSize": 1024,
            "chunkOverlap": 100,
            "leidenResolution": 1.5
        }
    }
)

Config Field	Default	Description
`embeddingModel`	`BAAI/bge-small-en-v1.5`	Embedding model for vectors
`embeddingDimension`	`384`	Vector dimensions
`chunkSize`	`512`	Target chunk size in characters
`chunkOverlap`	`50`	Overlap between consecutive chunks
`nlpLibrary`	`spacy`	NLP library for concept extraction
`leidenResolution`	`1.0`	Community detection resolution (0.1-10)

Listing Knowledge Bases

List knowledge bases in a team

curl "https://api.catalyzed.ai/knowledge-bases?teamIds=ZkoDMyjZZsXo4VAO_nJLk" \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/knowledge-bases?teamIds=ZkoDMyjZZsXo4VAO_nJLk",
  { headers: { Authorization: `Bearer ${apiToken}` } }
);
const { knowledgeBases } = await response.json();

response = requests.get(
    "https://api.catalyzed.ai/knowledge-bases",
    params={"teamIds": "ZkoDMyjZZsXo4VAO_nJLk"},
    headers={"Authorization": f"Bearer {api_token}"}
)
knowledge_bases = response.json()["knowledgeBases"]

Query Parameters

Parameter	Type	Description
`teamIds`	string	Comma-separated team IDs to filter by
`knowledgeBaseIds`	string	Comma-separated KB IDs to filter by
`name`	string	Filter by name (partial match)
`page`	number	Page number (starts at 1, default: 1)
`pageSize`	number	Results per page (1-100, default: 20)
`orderBy`	string	Sort by: `createdAt`, `name`, `updatedAt`
`orderDirection`	string	Sort direction: `asc` or `desc`

Getting a Knowledge Base

Get knowledge base by ID

curl https://api.catalyzed.ai/knowledge-bases/abc123xyz \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch("https://api.catalyzed.ai/knowledge-bases/abc123xyz", {
  headers: { Authorization: `Bearer ${apiToken}` },
});
const kb = await response.json();

response = requests.get(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz",
    headers={"Authorization": f"Bearer {api_token}"}
)
kb = response.json()

Updating a Knowledge Base

Update knowledge base

curl -X PATCH https://api.catalyzed.ai/knowledge-bases/abc123xyz \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Product Documentation v2"
  }'

await fetch("https://api.catalyzed.ai/knowledge-bases/abc123xyz", {
  method: "PATCH",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    name: "Product Documentation v2",
  }),
});

requests.patch(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz",
    headers={"Authorization": f"Bearer {api_token}"},
    json={"name": "Product Documentation v2"}
)

Deleting a Knowledge Base

Delete knowledge base

curl -X DELETE https://api.catalyzed.ai/knowledge-bases/abc123xyz \
  -H "Authorization: Bearer $API_TOKEN"

await fetch("https://api.catalyzed.ai/knowledge-bases/abc123xyz", {
  method: "DELETE",
  headers: { Authorization: `Bearer ${apiToken}` },
});

requests.delete(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz",
    headers={"Authorization": f"Bearer {api_token}"}
)

Adding Sources

Adding a File Source

Add file source

curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "file",
    "fileId": "file_xyz789"
  }'

const response = await fetch(
  "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${apiToken}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      type: "file",
      fileId: "file_xyz789",
    }),
  }
);
const source = await response.json();

response = requests.post(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "type": "file",
        "fileId": "file_xyz789"
    }
)
source = response.json()

Adding a Table Source

Add table source

curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "table",
    "datasetTableId": "table_abc456",
    "columnSpec": {
      "columns": ["title", "description", "content"]
    }
  }'

const response = await fetch(
  "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${apiToken}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      type: "table",
      datasetTableId: "table_abc456",
      columnSpec: {
        columns: ["title", "description", "content"],
      },
    }),
  }
);

response = requests.post(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "type": "table",
        "datasetTableId": "table_abc456",
        "columnSpec": {
            "columns": ["title", "description", "content"]
        }
    }
)

Source Status

Each source has a status that tracks its processing state:

Status	Description
`pending`	Source added, waiting for initial indexing
`processing`	Currently being indexed
`processed`	Successfully indexed and up-to-date
`failed`	Indexing failed (check `syncErrorMessage`)
`stale`	Source data changed, needs re-sync

Listing Sources

List sources for a knowledge base

curl "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources" \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources",
  { headers: { Authorization: `Bearer ${apiToken}` } }
);
const { sources } = await response.json();

response = requests.get(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources",
    headers={"Authorization": f"Bearer {api_token}"}
)
sources = response.json()["sources"]

Removing a Source

Remove a source from a knowledge base. This deletes the source metadata and triggers a cleanup job to remove the associated chunks from the vector store.

Delete a source

curl -X DELETE https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources/source_123 \
  -H "Authorization: Bearer $API_TOKEN"

await fetch(
  "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources/source_123",
  {
    method: "DELETE",
    headers: { Authorization: `Bearer ${apiToken}` },
  }
);

requests.delete(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources/source_123",
    headers={"Authorization": f"Bearer {api_token}"}
)

Note: Chunk cleanup happens asynchronously via a background job. Queries may briefly return results from deleted sources until cleanup completes.

Querying

Knowledge Base queries support three search modes: semantic (vector embeddings), keyword (full-text search), and hybrid (combining both with RRF scoring).

Search Modes

Semantic Search (default)

Uses vector embeddings for conceptual similarity matching. Best for:

Finding content by meaning, not exact words
Cross-lingual or multilingual search
Handling synonyms and paraphrasing naturally

Semantic search

curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/query \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How do I configure authentication?",
    "searchMode": "semantic",
    "limit": 10
  }'

const response = await fetch(
  "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${apiToken}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      query: "How do I configure authentication?",
      searchMode: "semantic", // default, can be omitted
      limit: 10,
    }),
  }
);
const { results, metadata } = await response.json();

response = requests.post(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "query": "How do I configure authentication?",
        "searchMode": "semantic",  # default, can be omitted
        "limit": 10
    }
)
data = response.json()
results = data["results"]

Keyword Search

Uses BM25 full-text search with inverted indexes. Best for:

Exact term matching and boolean queries
Technical documentation (API names, error codes, function names)
Structured data (product SKUs, model numbers, IDs)

Keyword search

curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/query \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "JWT authentication",
    "searchMode": "keyword",
    "limit": 10
  }'

const response = await fetch(
  "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${apiToken}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      query: "JWT authentication",
      searchMode: "keyword",
      limit: 10,
    }),
  }
);
const { results, metadata } = await response.json();

response = requests.post(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "query": "JWT authentication",
        "searchMode": "keyword",
        "limit": 10
    }
)
data = response.json()
results = data["results"]

Hybrid Search (Recommended)

Combines keyword and semantic search using RRF (Reciprocal Rank Fusion) for optimal results. Best for:

General-purpose search applications
When you want both exact matches AND semantic relevance
Production applications (recommended default)

Hybrid search with custom weights

curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/query \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How do I configure authentication?",
    "searchMode": "hybrid",
    "semanticWeight": 0.6,
    "keywordWeight": 0.4,
    "limit": 10
  }'

const response = await fetch(
  "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${apiToken}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      query: "How do I configure authentication?",
      searchMode: "hybrid",
      semanticWeight: 0.6, // default: 0.6
      keywordWeight: 0.4,  // default: 0.4
      limit: 10,
    }),
  }
);
const { results, metadata } = await response.json();

response = requests.post(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "query": "How do I configure authentication?",
        "searchMode": "hybrid",
        "semanticWeight": 0.6,  # default: 0.6
        "keywordWeight": 0.4,   # default: 0.4
        "limit": 10
    }
)
data = response.json()
results = data["results"]

Weight Tuning:

Higher semanticWeight: Prioritize conceptual similarity
Higher keywordWeight: Prioritize exact term matches
Default (0.6 / 0.4): Balanced for most use cases

Response Format

{
  "results": [
    {
      "chunkId": "chunk_abc123",
      "content": "To configure authentication, first enable...",
      "score": 0.89,
      "semanticScore": 0.87,
      "keywordScore": 12.5,
      "combinedScore": 0.89,
      "communityL0Id": "c0_001",
      "communityL1Id": "c1_001",
      "fileId": "file_xyz789",
      "datasetTableId": null,
      "charStart": 0,
      "charEnd": 512
    }
  ],
  "communities": [
    {
      "communityId": "c0_001",
      "level": 0,
      "bestScore": 0.89,
      "chunkCount": 5
    }
  ],
  "metadata": {
    "searchMode": "hybrid",
    "fallback": false
  }
}

Response Fields

Field	Type	Description
`results`	array	Matching chunks, sorted by relevance
`results[].chunkId`	string	Unique chunk identifier
`results[].content`	string	Text content of the chunk
`results[].score`	number	Primary score (semantic: cosine, keyword: BM25, hybrid: RRF)
`results[].semanticScore`	number?	Cosine similarity (only in semantic/hybrid modes)
`results[].keywordScore`	number?	BM25 score (only in keyword/hybrid modes)
`results[].combinedScore`	number?	RRF combined score (only in hybrid mode)
`results[].communityL0Id`	string \| null	Level 0 community cluster ID
`results[].communityL1Id`	string \| null	Level 1 community cluster ID
`results[].fileId`	string \| null	Source file ID (null for table sources)
`results[].datasetTableId`	string \| null	Source table ID (null for file sources)
`results[].charStart`	number	Character offset where chunk starts
`results[].charEnd`	number	Character offset where chunk ends
`communities`	array	Communities ranked by best chunk score
`metadata.searchMode`	string	Search mode that was used
`metadata.fallback`	boolean?	True if hybrid mode fell back to semantic-only

Full-Text Search Indexes

Knowledge Base sources automatically create inverted indexes on the content column during ingestion. These indexes power keyword and hybrid search modes using BM25 scoring.

Index Configuration:

Tokenizer: Simple (word-based)
Stemming: Enabled (handles “run”, “running”, “runs”)
Stop words: Removed (common words like “the”, “a”, “is”)
Case: Normalized to lowercase
Positions: Tracked (enables phrase search)

Index creation is non-fatal - if it fails, hybrid search automatically falls back to semantic-only mode.

LazyGraphRAG Streaming Query

For more intelligent retrieval, use the streaming endpoint which combines vector search with LLM-based relevance testing:

Stream LazyGraphRAG query

curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/query/stream \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the best practices for error handling?",
    "budget": 20
  }'

const response = await fetch(
  "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query/stream",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${apiToken}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      query: "What are the best practices for error handling?",
      budget: 20,
    }),
  }
);

const reader = response.body?.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader!.read();
  if (done) break;

  const text = decoder.decode(value);
  // Parse SSE events
  for (const line of text.split("\n")) {
    if (line.startsWith("data:")) {
      const event = JSON.parse(line.slice(5));
      console.log(event);
    }
  }
}

import requests

response = requests.post(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz/query/stream",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "query": "What are the best practices for error handling?",
        "budget": 20
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        line = line.decode('utf-8')
        if line.startswith('data:'):
            import json
            event = json.loads(line[5:])
            print(event)

Budget Parameter

The budget parameter controls how many chunks are tested for relevance by the LLM. Higher budget = more thorough results but higher cost and latency.

Budget	Use Case	Cost/Latency
5-10	Quick answers, simple queries	Low
15-25	General purpose (recommended default)	Medium
30-50	Complex questions, comprehensive research	High
50+	Deep analysis, when recall is critical	Very high

Recommendations:

Start with budget: 20 for most use cases
Increase budget if results seem incomplete or miss relevant content
Reduce budget for time-sensitive applications or simple factual queries
The actual chunks returned may be fewer than budget (only relevant chunks are included)

Community Hierarchy

Knowledge Bases use the Leiden algorithm to cluster concepts into hierarchical communities:

L0 - Finest level (individual concept clusters)
L1 - Intermediate clusters
L2 - Broader topic groups
L3 - Coarsest level (major themes)

Communities are used for intelligent ranking in queries - chunks from relevant communities are prioritized.

Rebuilding Communities

Communities are automatically rebuilt when sources are indexed. You can also trigger a manual rebuild:

Rebuild communities

curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/rebuild-communities \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/knowledge-bases/abc123xyz/rebuild-communities",
  {
    method: "POST",
    headers: { Authorization: `Bearer ${apiToken}` },
  }
);
const { jobId } = await response.json();

response = requests.post(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz/rebuild-communities",
    headers={"Authorization": f"Bearer {api_token}"}
)
job_id = response.json()["jobId"]

Knowledge Base Properties

Field	Type	Description
`knowledgeBaseId`	string	Unique identifier
`teamId`	string	Team that owns this KB
`name`	string	Human-readable name (1-255 characters)
`config`	object	Configuration options (see above)
`communitiesStale`	boolean	Whether communities need rebuilding
`communitiesBuiltAt`	string \| null	ISO 8601 timestamp of last community build
`createdAt`	string	ISO 8601 timestamp of creation
`updatedAt`	string	ISO 8601 timestamp of last modification

Source Properties

Field	Type	Description
`knowledgeBaseSourceId`	string	Unique identifier
`knowledgeBaseId`	string	Parent KB ID
`fileId`	string \| null	File ID (for file sources)
`datasetTableId`	string \| null	Table ID (for table sources)
`columnSpec`	object \| null	Column configuration (for table sources)
`status`	string	Processing status
`processedExtractionId`	string \| null	Last processed extraction ID
`processedDatasetVersion`	number \| null	Last processed dataset version
`addedAt`	string	ISO 8601 timestamp when source was added
`processedAt`	string \| null	ISO 8601 timestamp of last successful processing

Knowledge Base Reconciliation - Keeping KBs in sync with source changes
Files - File processing for KB sources
Tables - Dataset tables as KB sources
Vector Search - Alternative: direct vector queries