Tables
Tables are where your data lives. Each table has a defined schema and supports SQL queries, indexes, and schema evolution.
Creating a Table
Section titled “Creating a Table”Tables are created within a dataset:
Create a table
curl -X POST https://api.catalyzed.ai/dataset-tables \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "datasetId": "HoIEJNIPiQIy6TjVRxjwz", "tableName": "orders", "description": "Customer orders", "fields": [ {"name": "order_id", "arrowType": "utf8", "nullable": false}, {"name": "customer_id", "arrowType": "utf8", "nullable": false}, {"name": "amount", "arrowType": "float64", "nullable": false}, {"name": "status", "arrowType": "utf8", "nullable": false}, {"name": "created_at", "arrowType": "timestamp", "nullable": false} ], "primaryKeyColumns": ["order_id"] }'const response = await fetch("https://api.catalyzed.ai/dataset-tables", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ datasetId: "HoIEJNIPiQIy6TjVRxjwz", tableName: "orders", description: "Customer orders", fields: [ { name: "order_id", arrowType: "utf8", nullable: false }, { name: "customer_id", arrowType: "utf8", nullable: false }, { name: "amount", arrowType: "float64", nullable: false }, { name: "status", arrowType: "utf8", nullable: false }, { name: "created_at", arrowType: "timestamp", nullable: false }, ], primaryKeyColumns: ["order_id"], }),});response = requests.post( "https://api.catalyzed.ai/dataset-tables", headers={"Authorization": f"Bearer {api_token}"}, json={ "datasetId": "HoIEJNIPiQIy6TjVRxjwz", "tableName": "orders", "description": "Customer orders", "fields": [ {"name": "order_id", "arrowType": "utf8", "nullable": False}, {"name": "customer_id", "arrowType": "utf8", "nullable": False}, {"name": "amount", "arrowType": "float64", "nullable": False}, {"name": "status", "arrowType": "utf8", "nullable": False}, {"name": "created_at", "arrowType": "timestamp", "nullable": False} ], "primaryKeyColumns": ["order_id"] })Supported Data Types
Section titled “Supported Data Types”Catalyzed uses Apache Arrow data types. Type names are case-insensitive.
Numeric Types
Section titled “Numeric Types”| Type | Aliases | Description |
|---|---|---|
int8 | 8-bit signed integer (-128 to 127) | |
int16 | 16-bit signed integer | |
int32 | 32-bit signed integer | |
int64 | 64-bit signed integer | |
uint8 | 8-bit unsigned integer (0 to 255) | |
uint16 | 16-bit unsigned integer | |
uint32 | 32-bit unsigned integer | |
uint64 | 64-bit unsigned integer | |
float16 | 16-bit floating point (half precision) | |
float32 | float | 32-bit floating point |
float64 | double | 64-bit floating point |
String and Binary Types
Section titled “String and Binary Types”| Type | Aliases | Description |
|---|---|---|
utf8 | string | UTF-8 text |
largeutf8 | large_utf8, largestring | Large UTF-8 text (>2GB) |
binary | Binary bytes | |
largebinary | large_binary | Large binary bytes (>2GB) |
Date and Time Types
Section titled “Date and Time Types”| Type | Description |
|---|---|
date32 | Days since Unix epoch (1970-01-01) |
date64 | Milliseconds since Unix epoch |
timestamp | Microsecond-precision datetime (ISO 8601) |
Other Types
Section titled “Other Types”| Type | Description |
|---|---|
bool | True/false (alias: boolean) |
null | Null type |
list<T> | Array of type T (e.g., list<int32>, list<utf8>) |
Writing Data
Section titled “Writing Data”Write Rows
Section titled “Write Rows”Write data to a table using the /rows endpoint. The write mode is specified as a query parameter, and the request body is a JSON array of row objects.
Insert rows
curl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append" \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '[ { "order_id": "ORD-001", "customer_id": "CUST-100", "amount": 99.99, "status": "completed", "created_at": "2024-01-15T10:30:00Z" }, { "order_id": "ORD-002", "customer_id": "CUST-101", "amount": 149.50, "status": "pending", "created_at": "2024-01-15T14:45:00Z" } ]'await fetch( "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify([ { order_id: "ORD-001", customer_id: "CUST-100", amount: 99.99, status: "completed", created_at: "2024-01-15T10:30:00Z", }, { order_id: "ORD-002", customer_id: "CUST-101", amount: 149.50, status: "pending", created_at: "2024-01-15T14:45:00Z", }, ]), });requests.post( "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append", headers={"Authorization": f"Bearer {api_token}"}, json=[ { "order_id": "ORD-001", "customer_id": "CUST-100", "amount": 99.99, "status": "completed", "created_at": "2024-01-15T10:30:00Z" }, { "order_id": "ORD-002", "customer_id": "CUST-101", "amount": 149.50, "status": "pending", "created_at": "2024-01-15T14:45:00Z" } ])Query Parameters
Section titled “Query Parameters”| Parameter | Required | Description |
|---|---|---|
mode | Yes | Write operation mode (see below) |
idempotency_key | No | Unique key for exactly-once write semantics |
skip_validation | No | Skip schema validation for faster writes |
Write Modes
Section titled “Write Modes”The mode query parameter controls how data is written:
| Mode | Description |
|---|---|
append | Insert new rows without duplicate checking (fastest) |
upsert | Insert new rows or update existing by primary key |
overwrite | Replace all existing data in the table |
delete | Delete rows by primary key |
Upsert Example
Section titled “Upsert Example”Update existing rows or insert new ones based on primary key:
curl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=upsert" \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '[{"order_id": "ORD-001", "status": "shipped", "amount": 99.99}]'Delete Example
Section titled “Delete Example”Delete rows by primary key values:
curl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=delete" \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '["ORD-001", "ORD-002"]'For tables with composite primary keys, pass objects:
[{"order_id": "ORD-001", "tenant_id": "T1"}, {"order_id": "ORD-002", "tenant_id": "T1"}]Arrow IPC Format
Section titled “Arrow IPC Format”For high-performance data ingestion, send data in Apache Arrow IPC format instead of JSON. This is ideal for:
- Large batch inserts (millions of rows)
- Direct integration with pandas, Polars, or DuckDB
- Avoiding JSON serialization overhead
Write with Arrow IPC
# Arrow IPC data must be generated programmaticallycurl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append" \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/vnd.apache.arrow.stream" \ --data-binary @data.arrowimport * as arrow from "apache-arrow";
// Create Arrow tableconst table = arrow.tableFromArrays({ order_id: ["ORD-001", "ORD-002"], customer_id: ["CUST-100", "CUST-101"], amount: [99.99, 149.5], status: ["completed", "pending"],});
// Serialize to IPC stream formatconst ipcBytes = arrow.tableToIPC(table, "stream");
// Send to APIawait fetch( "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/vnd.apache.arrow.stream", }, body: ipcBytes, });import pyarrow as paimport requests
# Create Arrow table from pandas DataFramedf = pd.DataFrame({ "order_id": ["ORD-001", "ORD-002"], "customer_id": ["CUST-100", "CUST-101"], "amount": [99.99, 149.50], "status": ["completed", "pending"],})table = pa.Table.from_pandas(df)
# Serialize to IPC stream formatsink = pa.BufferOutputStream()with pa.ipc.new_stream(sink, table.schema) as writer: writer.write_table(table)arrow_bytes = sink.getvalue().to_pybytes()
# Send to APIrequests.post( "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append", headers={ "Authorization": f"Bearer {api_token}", "Content-Type": "application/vnd.apache.arrow.stream" }, data=arrow_bytes)Querying Data
Section titled “Querying Data”Use the /queries endpoint to query your tables with SQL. You can query a single table or join multiple tables together.
Query table data
curl -X POST https://api.catalyzed.ai/queries \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "sql": "SELECT * FROM orders WHERE status = '\''completed'\'' ORDER BY created_at DESC LIMIT 10", "tables": { "orders": "KzaMsfA0LSw_Ld0KyaXIS" } }'const response = await fetch("https://api.catalyzed.ai/queries", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ sql: "SELECT * FROM orders WHERE status = 'completed' ORDER BY created_at DESC LIMIT 10", tables: { orders: "KzaMsfA0LSw_Ld0KyaXIS", }, }),});const data = await response.json();response = requests.post( "https://api.catalyzed.ai/queries", headers={"Authorization": f"Bearer {api_token}"}, json={ "sql": "SELECT * FROM orders WHERE status = 'completed' ORDER BY created_at DESC LIMIT 10", "tables": { "orders": "KzaMsfA0LSw_Ld0KyaXIS" } })data = response.json()The tables parameter maps table names used in your SQL to their table IDs. This works the same way whether you’re querying one table or joining multiple tables.
Query Response
Section titled “Query Response”{ "queryId": "qry_abc123", "columns": [ {"name": "order_id", "type": "Utf8"}, {"name": "customer_id", "type": "Utf8"}, {"name": "amount", "type": "Float64"}, {"name": "status", "type": "Utf8"}, {"name": "created_at", "type": "Timestamp(Microsecond, Some(\"UTC\"))"} ], "rows": [ {"order_id": "ORD-001", "customer_id": "CUST-100", "amount": 99.99, "status": "completed", "created_at": "2024-01-15T10:30:00Z"}, {"order_id": "ORD-002", "customer_id": "CUST-101", "amount": 149.50, "status": "completed", "created_at": "2024-01-15T09:15:00Z"} ], "rowCount": 2, "truncated": false, "stats": { "executionTimeMs": 42, "planningTimeMs": 5, "bytesScanned": 1024, "rowsScanned": 100 }}The stats field is included when includeStats: true is passed in the request. A usage field with detailed I/O metrics is also returned for billing purposes.
Joining Tables
Section titled “Joining Tables”Query across multiple tables by including them in the tables mapping:
Join multiple tables
curl -X POST https://api.catalyzed.ai/queries \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "sql": "SELECT o.order_id, c.name, o.amount FROM orders o JOIN customers c ON o.customer_id = c.customer_id", "tables": { "orders": "Ednc5U676CO4hn-FqsXeA", "customers": "6fTBbbj4uv8TVMVh0gVch" } }'const response = await fetch("https://api.catalyzed.ai/queries", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ sql: `SELECT o.order_id, c.name, o.amount FROM orders o JOIN customers c ON o.customer_id = c.customer_id`, tables: { orders: "Ednc5U676CO4hn-FqsXeA", customers: "6fTBbbj4uv8TVMVh0gVch", }, }),});response = requests.post( "https://api.catalyzed.ai/queries", headers={"Authorization": f"Bearer {api_token}"}, json={ "sql": """ SELECT o.order_id, c.name, o.amount FROM orders o JOIN customers c ON o.customer_id = c.customer_id """, "tables": { "orders": "Ednc5U676CO4hn-FqsXeA", "customers": "6fTBbbj4uv8TVMVh0gVch" } })See the Querying Data guide for more SQL examples and best practices.
Table Schema
Section titled “Table Schema”Get Schema
Section titled “Get Schema”Get table schema
curl https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/schema \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch( "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/schema", { headers: { Authorization: `Bearer ${apiToken}` } });const schema = await response.json();response = requests.get( "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/schema", headers={"Authorization": f"Bearer {api_token}"})schema = response.json()Schema Versioning
Section titled “Schema Versioning”Tables track schema versions. Each modification creates a new version:
curl https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/schema/versions \ -H "Authorization: Bearer $API_TOKEN"See the Schema Management guide for migration details.
Indexes
Section titled “Indexes”Indexes improve query performance for filtered columns.
Create an Index
Section titled “Create an Index”Create an index
curl -X POST https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "indexName": "idx_customer_id", "columnName": "customer_id", "indexType": "btree" }'await fetch("https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ indexName: "idx_customer_id", columnName: "customer_id", indexType: "btree", }),});requests.post( "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes", headers={"Authorization": f"Bearer {api_token}"}, json={ "indexName": "idx_customer_id", "columnName": "customer_id", "indexType": "btree" })Index Types
Section titled “Index Types”| Type | Use Case |
|---|---|
btree | Equality and range queries on scalar columns |
ivf_pq | Vector similarity search (ANN) |
ivf_hnsw_pq | High-recall vector search |
ivf_hnsw_sq | Memory-efficient vector search |
List Indexes
Section titled “List Indexes”curl https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes \ -H "Authorization: Bearer $API_TOKEN"Drop an Index
Section titled “Drop an Index”curl -X DELETE https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes/idx_customer_id \ -H "Authorization: Bearer $API_TOKEN"Table Operations
Section titled “Table Operations”Get Execution Plan
Section titled “Get Execution Plan”Preview how a query will execute using the /queries/explain endpoint:
curl -X POST https://api.catalyzed.ai/queries/explain \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "sql": "SELECT * FROM orders WHERE customer_id = '\''CUST-100'\''", "tables": {"orders": "KzaMsfA0LSw_Ld0KyaXIS"} }'Compact Table
Section titled “Compact Table”Optimize storage by merging small files:
curl -X POST https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/compact \ -H "Authorization: Bearer $API_TOKEN"Compute Statistics
Section titled “Compute Statistics”Update table statistics for query optimization:
curl -X POST https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/statistics \ -H "Authorization: Bearer $API_TOKEN"Deleting a Table
Section titled “Deleting a Table”curl -X DELETE https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA \ -H "Authorization: Bearer $API_TOKEN"API Reference
Section titled “API Reference”See the Dataset Tables API for complete endpoint documentation.