Skip to content

Architecture Overview

Catalyzed is a unified data platform that combines large-scale public datasets with customer-managed private data. This page provides a high-level overview of how the system is designed.

┌─────────────────────────────────────────────────────────────────────────────┐
│ Client Applications │
│ (Web App, Python SDK, TypeScript SDK) │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ API Layer │
│ (REST API, WebSocket, Auth) │
└─────────────────────────────────────────────────────────────────────────────┘
│ │
┌───────────┴───────────┐ │
▼ ▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐ ┌──────────────────┐
│ Query Engine │ │ Job Workers │ │ Data Engine │
│ (SQL + Vector Search) │ │ (Pipelines, Processing)│ │ (File Ingestion)│
└─────────────────────────┘ └─────────────────────────┘ └──────────────────┘
│ │ │
└───────────────────────┴────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ Data Layer │
│ (Object Storage, Metadata Catalog, Cache, Columnar Storage) │
└─────────────────────────────────────────────────────────────────────────────┘

The API is the primary entry point for all client interactions.

  • REST API — Full-featured HTTP API with WebSocket support
  • Real-time Updates — Live progress tracking for pipeline executions and resource changes
  • Authentication — JWT-based session tokens with team-scoped authorization
  • OpenAPI — Auto-generated API documentation with type-safe schemas

A distributed SQL engine for querying structured and vector data.

  • SQL Interface — ANSI SQL:2011 + PostgreSQL-compatible syntax, powered by Apache DataFusion with custom extensions for vector search
  • Distributed Execution — Automatic horizontal scaling for large datasets
  • Vector Search — Native similarity search with knn_search(), knn_cosine(), and knn_l2() functions
  • Cross-Dataset Joins — Query across multiple datasets in a single SQL statement

Background processing for pipelines and long-running tasks.

  • Pipeline Execution — Run LLM-powered data transformation pipelines
  • File Processing — Extract text, tables, and embeddings from uploaded files
  • Python Executor — Sandboxed Python runtime for custom pipeline handlers
  • Queue-Based — Asynchronous job processing with real-time coordination

Handles file ingestion and data transformation.

  • Document Processing — Extract structured data from PDF, XLSX, CSV, and other formats
  • Embedding Generation — Automatic vector embeddings for semantic search
  • Schema Inference — Detect and apply schemas to uploaded data

Persistent storage for all platform data.

ComponentPurpose
Object StorageRaw files, processed documents, query results
Metadata CatalogTable schemas, indexes, job queue, user data
In-Memory CacheSession storage, real-time messaging
Columnar StorageOptimized storage with B-tree and vector indices

Resources in Catalyzed follow a hierarchical structure:

Team
└── Dataset
└── Table
└── Rows (with optional vector embeddings)
  • Teams — Multi-tenant boundary; all resources belong to a team
  • Datasets — Logical groupings of related tables
  • Tables — Queryable data with defined schemas
  • Rows — Individual records, optionally with vector embeddings for semantic search
Upload → Process → Extract → Store → Index
  1. Files are uploaded via the API
  2. The Data Engine processes them (OCR, parsing, extraction)
  3. Structured data is extracted and stored in tables
  4. Vector embeddings are generated for semantic search
SQL Query → Parse → Plan → Execute → Return
  1. Client submits a SQL query via REST API
  2. Query Engine parses and optimizes the query plan
  3. Execution is distributed across available resources
  4. Results stream back to the client
Trigger → Queue → Execute → Stream → Complete
  1. Pipeline execution is triggered via API
  2. Job is queued and picked up by a worker
  3. Pipeline handler executes (LLM calls, data transforms)
  4. Progress streams to client via WebSocket/SSE
  5. Results are stored and execution completes

Catalyzed uses a session-based authentication model with team-scoped authorization.

  • Sessions — JWT tokens with sliding expiration
  • Team Scope — All API tokens and resources are scoped to a team
  • Role-Based Access — Admins can manage team settings; members can manage resources
  • Resource Authorization — Permission checks on every resource access
LayerTechnology
APIHigh-performance HTTP framework
Query EngineApache DataFusion with vector search extensions
StorageColumnar storage with cloud object storage
LanguagesTypeScript, Rust, Python