Architecture Overview

Catalyzed is a unified data platform that combines large-scale public datasets with customer-managed private data. This page provides a high-level overview of how the system is designed.

System Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              Client Applications                             │
│                    (Web App, Python SDK, TypeScript SDK)                     │
└─────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                                 API Layer                                    │
│                         (REST API, WebSocket, Auth)                          │
└─────────────────────────────────────────────────────────────────────────────┘
                          │                    │
              ┌───────────┴───────────┐        │
              ▼                       ▼        ▼
┌─────────────────────────┐  ┌─────────────────────────┐  ┌──────────────────┐
│     Query Engine        │  │      Job Workers        │  │   Data Engine    │
│  (SQL + Vector Search)  │  │  (Pipelines, Processing)│  │  (File Ingestion)│
└─────────────────────────┘  └─────────────────────────┘  └──────────────────┘
              │                       │                            │
              └───────────────────────┴────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                               Data Layer                                     │
│         (Object Storage, Metadata Catalog, Cache, Columnar Storage)         │
└─────────────────────────────────────────────────────────────────────────────┘

Core Components

API Layer

The API is the primary entry point for all client interactions.

REST API — Full-featured HTTP API with WebSocket support
Real-time Updates — Live progress tracking for pipeline executions and resource changes
Authentication — JWT-based session tokens with team-scoped authorization
OpenAPI — Auto-generated API documentation with type-safe schemas

Query Engine

A distributed SQL engine for querying structured and vector data.

SQL Interface — ANSI SQL:2011 + PostgreSQL-compatible syntax, powered by Apache DataFusion with custom extensions for vector search
Distributed Execution — Automatic horizontal scaling for large datasets
Vector Search — Native similarity search with knn_search(), knn_cosine(), and knn_l2() functions
Cross-Dataset Joins — Query across multiple datasets in a single SQL statement

Job Workers

Background processing for pipelines and long-running tasks.

Pipeline Execution — Run LLM-powered data transformation pipelines
File Processing — Extract text, tables, and embeddings from uploaded files
Python Executor — Sandboxed Python runtime for custom pipeline handlers
Queue-Based — Asynchronous job processing with real-time coordination

Data Engine

Handles file ingestion and data transformation.

Document Processing — Extract structured data from PDF, XLSX, CSV, and other formats
Embedding Generation — Automatic vector embeddings for semantic search
Schema Inference — Detect and apply schemas to uploaded data

Data Layer

Persistent storage for all platform data.

Component	Purpose
Object Storage	Raw files, processed documents, query results
Metadata Catalog	Table schemas, indexes, job queue, user data
In-Memory Cache	Session storage, real-time messaging
Columnar Storage	Optimized storage with B-tree and vector indices

Data Hierarchy

Resources in Catalyzed follow a hierarchical structure:

Team
 └── Dataset
      └── Table
           └── Rows (with optional vector embeddings)

Teams — Multi-tenant boundary; all resources belong to a team
Datasets — Logical groupings of related tables
Tables — Queryable data with defined schemas
Rows — Individual records, optionally with vector embeddings for semantic search

Teams Multi-tenancy, roles, and collaboration

Datasets Organizing data into logical groups

Tables Schema definitions and data storage

Files Upload and process data files

Data Flow

1. File Ingestion

Upload → Process → Extract → Store → Index

Files are uploaded via the API
The Data Engine processes them (OCR, parsing, extraction)
Structured data is extracted and stored in tables
Vector embeddings are generated for semantic search

2. Query Execution

SQL Query → Parse → Plan → Execute → Return

Client submits a SQL query via REST API
Query Engine parses and optimizes the query plan
Execution is distributed across available resources
Results stream back to the client

3. Pipeline Execution

Trigger → Queue → Execute → Stream → Complete

Pipeline execution is triggered via API
Job is queued and picked up by a worker
Pipeline handler executes (LLM calls, data transforms)
Progress streams to client via WebSocket/SSE
Results are stored and execution completes

Authentication & Authorization

Catalyzed uses a session-based authentication model with team-scoped authorization.

Sessions — JWT tokens with sliding expiration
Team Scope — All API tokens and resources are scoped to a team
Role-Based Access — Admins can manage team settings; members can manage resources
Resource Authorization — Permission checks on every resource access

Authentication Guide Learn how to authenticate with the API

Technology Stack

Layer	Technology
API	High-performance HTTP framework
Query Engine	Apache DataFusion with vector search extensions
Storage	Columnar storage with cloud object storage
Languages	TypeScript, Rust, Python

Next Steps

Quickstart Get up and running in minutes

API Reference Explore the complete REST API

Pipelines Build automated data workflows