Skip to content

Knowledge Base Reconciliation

Knowledge Base reconciliation automatically detects when source data has changed and updates the semantic index accordingly. This ensures your KB stays current as files are reprocessed or table data is modified.

Source data can change in several ways:

  • Files: Reprocessed with a new parser version or manual re-extraction
  • Tables: Rows added, updated, or deleted

When changes occur, sources are marked as “stale” and the reconciliation system:

  1. Detects which sources have changed
  2. Syncs affected chunks (delete old, add new)
  3. Cleans up orphaned concepts and edges
  4. Rebuilds community clusters if needed

A system scheduler runs every 15 minutes to detect and reconcile stale sources across all Knowledge Bases. This happens automatically with no configuration required.

The automatic process:

  1. Scans all KBs for sources with status: stale
  2. Spawns reconciliation jobs for affected KBs
  3. Each source is synced independently (parallel processing)
  4. Communities are rebuilt after all syncs complete

File sources become stale when the underlying file is reprocessed (e.g., a PDF is re-extracted with an updated parser).

The system compares:

  • processedExtractionId (last extraction the KB processed)
  • Current extraction ID from the file processing pipeline

When these differ, the source is marked stale.

Table sources become stale when the dataset version changes. This happens when:

  • Rows are inserted
  • Rows are updated
  • Rows are deleted

The system compares:

  • processedDatasetVersion (last version the KB processed)
  • Current version from the data catalog

While automatic reconciliation runs every 15 minutes, you can trigger immediate reconciliation when needed:

Trigger manual reconciliation

Terminal window
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/reconcile \
-H "Authorization: Bearer $API_TOKEN"

Response:

{
"jobId": "job_abc123"
}
  • After bulk updates to source tables
  • After reprocessing critical files
  • When debugging freshness issues
  • For immediate sync after important data changes

If sources fail during indexing (e.g., due to parser errors or temporary issues), they remain in failed status with an error message. The reconciliation endpoint only processes stale sources, not failed ones.

To retry failed sources after fixing the underlying issue:

Reprocess failed sources

Terminal window
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/reprocess-failed \
-H "Authorization: Bearer $API_TOKEN"

Response:

{
"jobId": "abc123...",
"failedSourcesCount": 90,
"jobsSpawned": 90
}

This endpoint:

  • Queries all sources with status: "failed"
  • Marks them as "processing" (clears error messages)
  • Spawns individual sync jobs for each
  • Returns immediately with a job ID to track progress

Use case: After bulk file reprocessing (fixing parser bugs), retrigger KB indexing for sources that previously failed.

The reconciliation workflow follows these phases:

The system queries all sources for the KB where status = 'stale'. If no sources are stale, reconciliation completes immediately.

For each stale source, a sync job is spawned:

File Source Sync:

  1. Fetches the latest extraction chunks
  2. Deletes all existing KB chunks for this file
  3. Re-chunks and embeds the new content
  4. Updates processedExtractionId

Table Source Sync (Row-Level Diff):

  1. Queries current row IDs from the source table
  2. Compares with row IDs in existing KB chunks
  3. Computes the diff:
    • Added rows: New rows not yet indexed
    • Deleted rows: Rows that no longer exist
  4. Deletes chunks for removed rows
  5. Indexes only the new rows
  6. Updates processedDatasetVersion

After all syncs complete, if any chunks changed:

  1. Removes chunk-concept links where the chunk no longer exists
  2. Deletes orphaned concepts (concepts with no chunk links)
  3. Rebuilds concept edges from surviving relationships

If chunks changed, communities are marked as stale and automatically rebuilt using the Leiden algorithm.

StatusMeaningNext Steps
pendingSource just addedWill be indexed shortly
processingCurrently being indexed/syncedWait for completion
processedSuccessfully indexedUp-to-date, no action needed
failedIndexing/sync failedCheck syncErrorMessage, fix issue, retry
staleSource data changedWill be synced on next reconciliation
pending → processing → processed
↘ failed
processed → stale → processing → processed
↘ failed

List stale sources

Terminal window
curl "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources?status=stale" \
-H "Authorization: Bearer $API_TOKEN"

The communitiesStale field indicates if communities need rebuilding:

{
"knowledgeBaseId": "abc123xyz",
"communitiesStale": true,
"communitiesBuiltAt": "2025-01-10T08:30:00Z"
}

If you’re making multiple changes to source data, complete all changes before triggering reconciliation. This avoids multiple sync cycles:

# Good: Batch updates, then reconcile once
for row in updates:
table.update(row)
table.commit()
requests.post(f"https://api.catalyzed.ai/knowledge-bases/{kb_id}/reconcile", ...)
# Avoid: Triggering reconcile after each update
for row in updates:
table.update(row)
requests.post(f"https://api.catalyzed.ai/knowledge-bases/{kb_id}/reconcile", ...)

Periodically check for failed sources and investigate:

List failed sources

Terminal window
curl "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources?status=failed" \
-H "Authorization: Bearer $API_TOKEN"

If communities seem outdated or inconsistent, force a rebuild:

Force rebuild communities

Terminal window
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/rebuild-communities \
-H "Authorization: Bearer $API_TOKEN"

If sources remain in processing status for extended periods:

  1. Check job queue for failed jobs
  2. Common causes:
    • Embedding service temporarily unavailable
    • File corruption preventing extraction
    • Memory limits exceeded for large files
  3. The source will be retried on the next reconciliation cycle

If communitiesStale remains true after reconciliation:

  1. Check if any sources are still processing
  2. Verify all source syncs completed successfully
  3. Manually trigger rebuild with the rebuild-communities endpoint

For tables with millions of rows:

  1. Consider partitioning data across multiple KBs
  2. Use incremental updates rather than bulk replacements

If query results don’t reflect recent source changes:

  1. Check source status (may still be stale or processing)
  2. Verify communitiesStale is false
  3. Check communitiesBuiltAt timestamp
  4. Trigger manual reconciliation if automatic hasn’t run yet