Knowledge Base Reconciliation

Knowledge Base reconciliation automatically detects when source data has changed and updates the semantic index accordingly. This ensures your KB stays current as files are reprocessed or table data is modified.

Overview

Source data can change in several ways:

Files: Reprocessed with a new parser version or manual re-extraction
Tables: Rows added, updated, or deleted

When changes occur, sources are marked as “stale” and the reconciliation system:

Detects which sources have changed
Syncs affected chunks (delete old, add new)
Cleans up orphaned concepts and edges
Rebuilds community clusters if needed

Automatic Reconciliation

A system scheduler runs every 15 minutes to detect and reconcile stale sources across all Knowledge Bases. This happens automatically with no configuration required.

The automatic process:

Scans all KBs for sources with status: stale
Spawns reconciliation jobs for affected KBs
Each source is synced independently (parallel processing)
Communities are rebuilt after all syncs complete

Staleness Detection

File Sources

File sources become stale when the underlying file is reprocessed (e.g., a PDF is re-extracted with an updated parser).

The system compares:

processedExtractionId (last extraction the KB processed)
Current extraction ID from the file processing pipeline

When these differ, the source is marked stale.

Table Sources

Table sources become stale when the dataset version changes. This happens when:

Rows are inserted
Rows are updated
Rows are deleted

The system compares:

processedDatasetVersion (last version the KB processed)
Current version from the data catalog

Manual Reconciliation

While automatic reconciliation runs every 15 minutes, you can trigger immediate reconciliation when needed:

Trigger manual reconciliation

curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/reconcile \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/knowledge-bases/abc123xyz/reconcile",
  {
    method: "POST",
    headers: { Authorization: `Bearer ${apiToken}` },
  }
);
const { jobId } = await response.json();
console.log(`Reconciliation started: ${jobId}`);

response = requests.post(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz/reconcile",
    headers={"Authorization": f"Bearer {api_token}"}
)
job_id = response.json()["jobId"]
print(f"Reconciliation started: {job_id}")

Response:

{
  "jobId": "job_abc123"
}

When to Use Manual Reconciliation

After bulk updates to source tables
After reprocessing critical files
When debugging freshness issues
For immediate sync after important data changes

Reprocessing Failed Sources

If sources fail during indexing (e.g., due to parser errors or temporary issues), they remain in failed status with an error message. The reconciliation endpoint only processes stale sources, not failed ones.

To retry failed sources after fixing the underlying issue:

Reprocess failed sources

curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/reprocess-failed \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/knowledge-bases/abc123xyz/reprocess-failed",
  {
    method: "POST",
    headers: { Authorization: `Bearer ${apiToken}` },
  }
);
const { jobId, failedSourcesCount, jobsSpawned } = await response.json();
console.log(`Reprocessing ${failedSourcesCount} failed sources (job: ${jobId})`);

response = requests.post(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz/reprocess-failed",
    headers={"Authorization": f"Bearer {api_token}"}
)
data = response.json()
print(f"Reprocessing {data['failedSourcesCount']} failed sources (job: {data['jobId']})")

Response:

{
  "jobId": "abc123...",
  "failedSourcesCount": 90,
  "jobsSpawned": 90
}

This endpoint:

Queries all sources with status: "failed"
Marks them as "processing" (clears error messages)
Spawns individual sync jobs for each
Returns immediately with a job ID to track progress

Use case: After bulk file reprocessing (fixing parser bugs), retrigger KB indexing for sources that previously failed.

Reconciliation Process

The reconciliation workflow follows these phases:

Phase 1: Detect Stale Sources

The system queries all sources for the KB where status = 'stale'. If no sources are stale, reconciliation completes immediately.

Phase 2: Sync Sources

For each stale source, a sync job is spawned:

File Source Sync:

Fetches the latest extraction chunks
Deletes all existing KB chunks for this file
Re-chunks and embeds the new content
Updates processedExtractionId

Table Source Sync (Row-Level Diff):

Queries current row IDs from the source table
Compares with row IDs in existing KB chunks
Computes the diff:
- Added rows: New rows not yet indexed
- Deleted rows: Rows that no longer exist
Deletes chunks for removed rows
Indexes only the new rows
Updates processedDatasetVersion

Phase 3: Concept Reconciliation

After all syncs complete, if any chunks changed:

Removes chunk-concept links where the chunk no longer exists
Deletes orphaned concepts (concepts with no chunk links)
Rebuilds concept edges from surviving relationships

Phase 4: Community Rebuild

If chunks changed, communities are marked as stale and automatically rebuilt using the Leiden algorithm.

Source Status Lifecycle

Status	Meaning	Next Steps
`pending`	Source just added	Will be indexed shortly
`processing`	Currently being indexed/synced	Wait for completion
`processed`	Successfully indexed	Up-to-date, no action needed
`failed`	Indexing/sync failed	Check `syncErrorMessage`, fix issue, retry
`stale`	Source data changed	Will be synced on next reconciliation

Status Transitions

pending → processing → processed
                   ↘ failed

processed → stale → processing → processed
                            ↘ failed

Monitoring Sources

List Sources with Status Filter

List stale sources

curl "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources?status=stale" \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources?status=stale",
  { headers: { Authorization: `Bearer ${apiToken}` } }
);
const { sources } = await response.json();
console.log(`${sources.length} stale sources`);

response = requests.get(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources",
    params={"status": "stale"},
    headers={"Authorization": f"Bearer {api_token}"}
)
sources = response.json()["sources"]
print(f"{len(sources)} stale sources")

Check KB Community Status

The communitiesStale field indicates if communities need rebuilding:

{
  "knowledgeBaseId": "abc123xyz",
  "communitiesStale": true,
  "communitiesBuiltAt": "2025-01-10T08:30:00Z"
}

Best Practices

1. Batch Updates Before Sync

If you’re making multiple changes to source data, complete all changes before triggering reconciliation. This avoids multiple sync cycles:

# Good: Batch updates, then reconcile once
for row in updates:
    table.update(row)
table.commit()

requests.post(f"https://api.catalyzed.ai/knowledge-bases/{kb_id}/reconcile", ...)

# Avoid: Triggering reconcile after each update
for row in updates:
    table.update(row)
    requests.post(f"https://api.catalyzed.ai/knowledge-bases/{kb_id}/reconcile", ...)

2. Monitor Failed Sources

Periodically check for failed sources and investigate:

List failed sources

curl "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources?status=failed" \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources?status=failed",
  { headers: { Authorization: `Bearer ${apiToken}` } }
);
const { sources } = await response.json();
for (const source of sources) {
  console.log(`Source ${source.knowledgeBaseSourceId}: ${source.syncErrorMessage}`);
}

response = requests.get(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources",
    params={"status": "failed"},
    headers={"Authorization": f"Bearer {api_token}"}
)
for source in response.json()["sources"]:
    print(f"Source {source['knowledgeBaseSourceId']}: {source.get('syncErrorMessage')}")

3. Force Community Rebuild When Needed

If communities seem outdated or inconsistent, force a rebuild:

Force rebuild communities

curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/rebuild-communities \
  -H "Authorization: Bearer $API_TOKEN"

await fetch(
  "https://api.catalyzed.ai/knowledge-bases/abc123xyz/rebuild-communities",
  {
    method: "POST",
    headers: { Authorization: `Bearer ${apiToken}` },
  }
);

requests.post(
    "https://api.catalyzed.ai/knowledge-bases/abc123xyz/rebuild-communities",
    headers={"Authorization": f"Bearer {api_token}"}
)

Troubleshooting

Sources Stuck in Processing

If sources remain in processing status for extended periods:

Check job queue for failed jobs
Common causes:
- Embedding service temporarily unavailable
- File corruption preventing extraction
- Memory limits exceeded for large files
The source will be retried on the next reconciliation cycle

Stale Communities Not Rebuilding

If communitiesStale remains true after reconciliation:

Check if any sources are still processing
Verify all source syncs completed successfully
Manually trigger rebuild with the rebuild-communities endpoint

High Sync Latency for Large Tables

For tables with millions of rows:

Consider partitioning data across multiple KBs
Use incremental updates rather than bulk replacements

Query Results Seem Outdated

If query results don’t reflect recent source changes:

Check source status (may still be stale or processing)
Verify communitiesStale is false
Check communitiesBuiltAt timestamp
Trigger manual reconciliation if automatic hasn’t run yet

Next Steps

Knowledge Bases - Full KB concept reference
File Processing - How files become KB sources
Vector Search - Direct vector queries without KBs