Knowledge Base Reconciliation
Knowledge Base reconciliation automatically detects when source data has changed and updates the semantic index accordingly. This ensures your KB stays current as files are reprocessed or table data is modified.
Overview
Section titled “Overview”Source data can change in several ways:
- Files: Reprocessed with a new parser version or manual re-extraction
- Tables: Rows added, updated, or deleted
When changes occur, sources are marked as “stale” and the reconciliation system:
- Detects which sources have changed
- Syncs affected chunks (delete old, add new)
- Cleans up orphaned concepts and edges
- Rebuilds community clusters if needed
Automatic Reconciliation
Section titled “Automatic Reconciliation”A system scheduler runs every 15 minutes to detect and reconcile stale sources across all Knowledge Bases. This happens automatically with no configuration required.
The automatic process:
- Scans all KBs for sources with
status: stale - Spawns reconciliation jobs for affected KBs
- Each source is synced independently (parallel processing)
- Communities are rebuilt after all syncs complete
Staleness Detection
Section titled “Staleness Detection”File Sources
Section titled “File Sources”File sources become stale when the underlying file is reprocessed (e.g., a PDF is re-extracted with an updated parser).
The system compares:
processedExtractionId(last extraction the KB processed)- Current extraction ID from the file processing pipeline
When these differ, the source is marked stale.
Table Sources
Section titled “Table Sources”Table sources become stale when the dataset version changes. This happens when:
- Rows are inserted
- Rows are updated
- Rows are deleted
The system compares:
processedDatasetVersion(last version the KB processed)- Current version from the data catalog
Manual Reconciliation
Section titled “Manual Reconciliation”While automatic reconciliation runs every 15 minutes, you can trigger immediate reconciliation when needed:
Trigger manual reconciliation
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/reconcile \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/reconcile", { method: "POST", headers: { Authorization: `Bearer ${apiToken}` }, });const { jobId } = await response.json();console.log(`Reconciliation started: ${jobId}`);response = requests.post( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/reconcile", headers={"Authorization": f"Bearer {api_token}"})job_id = response.json()["jobId"]print(f"Reconciliation started: {job_id}")Response:
{ "jobId": "job_abc123"}When to Use Manual Reconciliation
Section titled “When to Use Manual Reconciliation”- After bulk updates to source tables
- After reprocessing critical files
- When debugging freshness issues
- For immediate sync after important data changes
Reprocessing Failed Sources
Section titled “Reprocessing Failed Sources”If sources fail during indexing (e.g., due to parser errors or temporary issues), they remain in failed status with an error message. The reconciliation endpoint only processes stale sources, not failed ones.
To retry failed sources after fixing the underlying issue:
Reprocess failed sources
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/reprocess-failed \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/reprocess-failed", { method: "POST", headers: { Authorization: `Bearer ${apiToken}` }, });const { jobId, failedSourcesCount, jobsSpawned } = await response.json();console.log(`Reprocessing ${failedSourcesCount} failed sources (job: ${jobId})`);response = requests.post( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/reprocess-failed", headers={"Authorization": f"Bearer {api_token}"})data = response.json()print(f"Reprocessing {data['failedSourcesCount']} failed sources (job: {data['jobId']})")Response:
{ "jobId": "abc123...", "failedSourcesCount": 90, "jobsSpawned": 90}This endpoint:
- Queries all sources with
status: "failed" - Marks them as
"processing"(clears error messages) - Spawns individual sync jobs for each
- Returns immediately with a job ID to track progress
Use case: After bulk file reprocessing (fixing parser bugs), retrigger KB indexing for sources that previously failed.
Reconciliation Process
Section titled “Reconciliation Process”The reconciliation workflow follows these phases:
Phase 1: Detect Stale Sources
Section titled “Phase 1: Detect Stale Sources”The system queries all sources for the KB where status = 'stale'. If no sources are stale, reconciliation completes immediately.
Phase 2: Sync Sources
Section titled “Phase 2: Sync Sources”For each stale source, a sync job is spawned:
File Source Sync:
- Fetches the latest extraction chunks
- Deletes all existing KB chunks for this file
- Re-chunks and embeds the new content
- Updates
processedExtractionId
Table Source Sync (Row-Level Diff):
- Queries current row IDs from the source table
- Compares with row IDs in existing KB chunks
- Computes the diff:
- Added rows: New rows not yet indexed
- Deleted rows: Rows that no longer exist
- Deletes chunks for removed rows
- Indexes only the new rows
- Updates
processedDatasetVersion
Phase 3: Concept Reconciliation
Section titled “Phase 3: Concept Reconciliation”After all syncs complete, if any chunks changed:
- Removes chunk-concept links where the chunk no longer exists
- Deletes orphaned concepts (concepts with no chunk links)
- Rebuilds concept edges from surviving relationships
Phase 4: Community Rebuild
Section titled “Phase 4: Community Rebuild”If chunks changed, communities are marked as stale and automatically rebuilt using the Leiden algorithm.
Source Status Lifecycle
Section titled “Source Status Lifecycle”| Status | Meaning | Next Steps |
|---|---|---|
pending | Source just added | Will be indexed shortly |
processing | Currently being indexed/synced | Wait for completion |
processed | Successfully indexed | Up-to-date, no action needed |
failed | Indexing/sync failed | Check syncErrorMessage, fix issue, retry |
stale | Source data changed | Will be synced on next reconciliation |
Status Transitions
Section titled “Status Transitions”pending → processing → processed ↘ failed
processed → stale → processing → processed ↘ failedMonitoring Sources
Section titled “Monitoring Sources”List Sources with Status Filter
Section titled “List Sources with Status Filter”List stale sources
curl "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources?status=stale" \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources?status=stale", { headers: { Authorization: `Bearer ${apiToken}` } });const { sources } = await response.json();console.log(`${sources.length} stale sources`);response = requests.get( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources", params={"status": "stale"}, headers={"Authorization": f"Bearer {api_token}"})sources = response.json()["sources"]print(f"{len(sources)} stale sources")Check KB Community Status
Section titled “Check KB Community Status”The communitiesStale field indicates if communities need rebuilding:
{ "knowledgeBaseId": "abc123xyz", "communitiesStale": true, "communitiesBuiltAt": "2025-01-10T08:30:00Z"}Best Practices
Section titled “Best Practices”1. Batch Updates Before Sync
Section titled “1. Batch Updates Before Sync”If you’re making multiple changes to source data, complete all changes before triggering reconciliation. This avoids multiple sync cycles:
# Good: Batch updates, then reconcile oncefor row in updates: table.update(row)table.commit()
requests.post(f"https://api.catalyzed.ai/knowledge-bases/{kb_id}/reconcile", ...)
# Avoid: Triggering reconcile after each updatefor row in updates: table.update(row) requests.post(f"https://api.catalyzed.ai/knowledge-bases/{kb_id}/reconcile", ...)2. Monitor Failed Sources
Section titled “2. Monitor Failed Sources”Periodically check for failed sources and investigate:
List failed sources
curl "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources?status=failed" \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources?status=failed", { headers: { Authorization: `Bearer ${apiToken}` } });const { sources } = await response.json();for (const source of sources) { console.log(`Source ${source.knowledgeBaseSourceId}: ${source.syncErrorMessage}`);}response = requests.get( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/sources", params={"status": "failed"}, headers={"Authorization": f"Bearer {api_token}"})for source in response.json()["sources"]: print(f"Source {source['knowledgeBaseSourceId']}: {source.get('syncErrorMessage')}")3. Force Community Rebuild When Needed
Section titled “3. Force Community Rebuild When Needed”If communities seem outdated or inconsistent, force a rebuild:
Force rebuild communities
curl -X POST https://api.catalyzed.ai/knowledge-bases/abc123xyz/rebuild-communities \ -H "Authorization: Bearer $API_TOKEN"await fetch( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/rebuild-communities", { method: "POST", headers: { Authorization: `Bearer ${apiToken}` }, });requests.post( "https://api.catalyzed.ai/knowledge-bases/abc123xyz/rebuild-communities", headers={"Authorization": f"Bearer {api_token}"})Troubleshooting
Section titled “Troubleshooting”Sources Stuck in Processing
Section titled “Sources Stuck in Processing”If sources remain in processing status for extended periods:
- Check job queue for failed jobs
- Common causes:
- Embedding service temporarily unavailable
- File corruption preventing extraction
- Memory limits exceeded for large files
- The source will be retried on the next reconciliation cycle
Stale Communities Not Rebuilding
Section titled “Stale Communities Not Rebuilding”If communitiesStale remains true after reconciliation:
- Check if any sources are still processing
- Verify all source syncs completed successfully
- Manually trigger rebuild with the rebuild-communities endpoint
High Sync Latency for Large Tables
Section titled “High Sync Latency for Large Tables”For tables with millions of rows:
- Consider partitioning data across multiple KBs
- Use incremental updates rather than bulk replacements
Query Results Seem Outdated
Section titled “Query Results Seem Outdated”If query results don’t reflect recent source changes:
- Check source status (may still be
staleorprocessing) - Verify
communitiesStaleisfalse - Check
communitiesBuiltAttimestamp - Trigger manual reconciliation if automatic hasn’t run yet
Next Steps
Section titled “Next Steps”- Knowledge Bases - Full KB concept reference
- File Processing - How files become KB sources
- Vector Search - Direct vector queries without KBs