Architecture Overview
Catalyzed is a unified data platform that combines large-scale public datasets with customer-managed private data. This page provides a high-level overview of how the system is designed.
System Overview
Section titled “System Overview”┌─────────────────────────────────────────────────────────────────────────────┐│ Client Applications ││ (Web App, Python SDK, TypeScript SDK) │└─────────────────────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────────────────┐│ API Layer ││ (REST API, WebSocket, Auth) │└─────────────────────────────────────────────────────────────────────────────┘ │ │ ┌───────────┴───────────┐ │ ▼ ▼ ▼┌─────────────────────────┐ ┌─────────────────────────┐ ┌──────────────────┐│ Query Engine │ │ Job Workers │ │ Data Engine ││ (SQL + Vector Search) │ │ (Pipelines, Processing)│ │ (File Ingestion)│└─────────────────────────┘ └─────────────────────────┘ └──────────────────┘ │ │ │ └───────────────────────┴────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────────────────┐│ Data Layer ││ (Object Storage, Metadata Catalog, Cache, Columnar Storage) │└─────────────────────────────────────────────────────────────────────────────┘Core Components
Section titled “Core Components”API Layer
Section titled “API Layer”The API is the primary entry point for all client interactions.
- REST API — Full-featured HTTP API with WebSocket support
- Real-time Updates — Live progress tracking for pipeline executions and resource changes
- Authentication — JWT-based session tokens with team-scoped authorization
- OpenAPI — Auto-generated API documentation with type-safe schemas
Query Engine
Section titled “Query Engine”A distributed SQL engine for querying structured and vector data.
- SQL Interface — ANSI SQL:2011 + PostgreSQL-compatible syntax, powered by Apache DataFusion with custom extensions for vector search
- Distributed Execution — Automatic horizontal scaling for large datasets
- Vector Search — Native similarity search with
knn_search(),knn_cosine(), andknn_l2()functions - Cross-Dataset Joins — Query across multiple datasets in a single SQL statement
Job Workers
Section titled “Job Workers”Background processing for pipelines and long-running tasks.
- Pipeline Execution — Run LLM-powered data transformation pipelines
- File Processing — Extract text, tables, and embeddings from uploaded files
- Python Executor — Sandboxed Python runtime for custom pipeline handlers
- Queue-Based — Asynchronous job processing with real-time coordination
Data Engine
Section titled “Data Engine”Handles file ingestion and data transformation.
- Document Processing — Extract structured data from PDF, XLSX, CSV, and other formats
- Embedding Generation — Automatic vector embeddings for semantic search
- Schema Inference — Detect and apply schemas to uploaded data
Data Layer
Section titled “Data Layer”Persistent storage for all platform data.
| Component | Purpose |
|---|---|
| Object Storage | Raw files, processed documents, query results |
| Metadata Catalog | Table schemas, indexes, job queue, user data |
| In-Memory Cache | Session storage, real-time messaging |
| Columnar Storage | Optimized storage with B-tree and vector indices |
Data Hierarchy
Section titled “Data Hierarchy”Resources in Catalyzed follow a hierarchical structure:
Team └── Dataset └── Table └── Rows (with optional vector embeddings)- Teams — Multi-tenant boundary; all resources belong to a team
- Datasets — Logical groupings of related tables
- Tables — Queryable data with defined schemas
- Rows — Individual records, optionally with vector embeddings for semantic search
Teams Multi-tenancy, roles, and collaboration
Datasets Organizing data into logical groups
Tables Schema definitions and data storage
Files Upload and process data files
Data Flow
Section titled “Data Flow”1. File Ingestion
Section titled “1. File Ingestion”Upload → Process → Extract → Store → Index- Files are uploaded via the API
- The Data Engine processes them (OCR, parsing, extraction)
- Structured data is extracted and stored in tables
- Vector embeddings are generated for semantic search
2. Query Execution
Section titled “2. Query Execution”SQL Query → Parse → Plan → Execute → Return- Client submits a SQL query via REST API
- Query Engine parses and optimizes the query plan
- Execution is distributed across available resources
- Results stream back to the client
3. Pipeline Execution
Section titled “3. Pipeline Execution”Trigger → Queue → Execute → Stream → Complete- Pipeline execution is triggered via API
- Job is queued and picked up by a worker
- Pipeline handler executes (LLM calls, data transforms)
- Progress streams to client via WebSocket/SSE
- Results are stored and execution completes
Authentication & Authorization
Section titled “Authentication & Authorization”Catalyzed uses a session-based authentication model with team-scoped authorization.
- Sessions — JWT tokens with sliding expiration
- Team Scope — All API tokens and resources are scoped to a team
- Role-Based Access — Admins can manage team settings; members can manage resources
- Resource Authorization — Permission checks on every resource access
Authentication Guide Learn how to authenticate with the API
Technology Stack
Section titled “Technology Stack”| Layer | Technology |
|---|---|
| API | High-performance HTTP framework |
| Query Engine | Apache DataFusion with vector search extensions |
| Storage | Columnar storage with cloud object storage |
| Languages | TypeScript, Rust, Python |
Next Steps
Section titled “Next Steps” Quickstart Get up and running in minutes
API Reference Explore the complete REST API
Pipelines Build automated data workflows