Skip to content

Tables

Tables are where your data lives. Each table has a defined schema and supports SQL queries, indexes, and schema evolution.

Tables are created within a dataset:

Create a table

Terminal window
curl -X POST https://api.catalyzed.ai/dataset-tables \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"datasetId": "HoIEJNIPiQIy6TjVRxjwz",
"tableName": "orders",
"description": "Customer orders",
"fields": [
{"name": "order_id", "arrowType": "utf8", "nullable": false},
{"name": "customer_id", "arrowType": "utf8", "nullable": false},
{"name": "amount", "arrowType": "float64", "nullable": false},
{"name": "status", "arrowType": "utf8", "nullable": false},
{"name": "created_at", "arrowType": "timestamp", "nullable": false}
],
"primaryKeyColumns": ["order_id"]
}'

Catalyzed uses Apache Arrow data types. Type names are case-insensitive.

TypeAliasesDescription
int88-bit signed integer (-128 to 127)
int1616-bit signed integer
int3232-bit signed integer
int6464-bit signed integer
uint88-bit unsigned integer (0 to 255)
uint1616-bit unsigned integer
uint3232-bit unsigned integer
uint6464-bit unsigned integer
float1616-bit floating point (half precision)
float32float32-bit floating point
float64double64-bit floating point
TypeAliasesDescription
utf8stringUTF-8 text
largeutf8large_utf8, largestringLarge UTF-8 text (>2GB)
binaryBinary bytes
largebinarylarge_binaryLarge binary bytes (>2GB)
TypeDescription
date32Days since Unix epoch (1970-01-01)
date64Milliseconds since Unix epoch
timestampMicrosecond-precision datetime (ISO 8601)
TypeDescription
boolTrue/false (alias: boolean)
nullNull type
list<T>Array of type T (e.g., list<int32>, list<utf8>)

Write data to a table using the /rows endpoint. The write mode is specified as a query parameter, and the request body is a JSON array of row objects.

Insert rows

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '[
{
"order_id": "ORD-001",
"customer_id": "CUST-100",
"amount": 99.99,
"status": "completed",
"created_at": "2024-01-15T10:30:00Z"
},
{
"order_id": "ORD-002",
"customer_id": "CUST-101",
"amount": 149.50,
"status": "pending",
"created_at": "2024-01-15T14:45:00Z"
}
]'
ParameterRequiredDescription
modeYesWrite operation mode (see below)
idempotency_keyNoUnique key for exactly-once write semantics
skip_validationNoSkip schema validation for faster writes

The mode query parameter controls how data is written:

ModeDescription
appendInsert new rows without duplicate checking (fastest)
upsertInsert new rows or update existing by primary key
overwriteReplace all existing data in the table
deleteDelete rows by primary key

Update existing rows or insert new ones based on primary key:

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=upsert" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '[{"order_id": "ORD-001", "status": "shipped", "amount": 99.99}]'

Delete rows by primary key values:

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=delete" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '["ORD-001", "ORD-002"]'

For tables with composite primary keys, pass objects:

[{"order_id": "ORD-001", "tenant_id": "T1"}, {"order_id": "ORD-002", "tenant_id": "T1"}]

For high-performance data ingestion, send data in Apache Arrow IPC format instead of JSON. This is ideal for:

  • Large batch inserts (millions of rows)
  • Direct integration with pandas, Polars, or DuckDB
  • Avoiding JSON serialization overhead

Write with Arrow IPC

Terminal window
# Arrow IPC data must be generated programmatically
curl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/vnd.apache.arrow.stream" \
--data-binary @data.arrow

Use the /queries endpoint to query your tables with SQL. You can query a single table or join multiple tables together.

Query table data

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT * FROM orders WHERE status = '\''completed'\'' ORDER BY created_at DESC LIMIT 10",
"tables": {
"orders": "KzaMsfA0LSw_Ld0KyaXIS"
}
}'

The tables parameter maps table names used in your SQL to their table IDs. This works the same way whether you’re querying one table or joining multiple tables.

{
"queryId": "qry_abc123",
"columns": [
{"name": "order_id", "type": "Utf8"},
{"name": "customer_id", "type": "Utf8"},
{"name": "amount", "type": "Float64"},
{"name": "status", "type": "Utf8"},
{"name": "created_at", "type": "Timestamp(Microsecond, Some(\"UTC\"))"}
],
"rows": [
{"order_id": "ORD-001", "customer_id": "CUST-100", "amount": 99.99, "status": "completed", "created_at": "2024-01-15T10:30:00Z"},
{"order_id": "ORD-002", "customer_id": "CUST-101", "amount": 149.50, "status": "completed", "created_at": "2024-01-15T09:15:00Z"}
],
"rowCount": 2,
"truncated": false,
"stats": {
"executionTimeMs": 42,
"planningTimeMs": 5,
"bytesScanned": 1024,
"rowsScanned": 100
}
}

The stats field is included when includeStats: true is passed in the request. A usage field with detailed I/O metrics is also returned for billing purposes.

Query across multiple tables by including them in the tables mapping:

Join multiple tables

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT o.order_id, c.name, o.amount FROM orders o JOIN customers c ON o.customer_id = c.customer_id",
"tables": {
"orders": "Ednc5U676CO4hn-FqsXeA",
"customers": "6fTBbbj4uv8TVMVh0gVch"
}
}'

See the Querying Data guide for more SQL examples and best practices.

Get table schema

Terminal window
curl https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/schema \
-H "Authorization: Bearer $API_TOKEN"

Tables track schema versions. Each modification creates a new version:

Terminal window
curl https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/schema/versions \
-H "Authorization: Bearer $API_TOKEN"

See the Schema Management guide for migration details.

Indexes improve query performance for filtered columns.

Create an index

Terminal window
curl -X POST https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"indexName": "idx_customer_id",
"columnName": "customer_id",
"indexType": "btree"
}'
TypeUse Case
btreeEquality and range queries on scalar columns
ivf_pqVector similarity search (ANN)
ivf_hnsw_pqHigh-recall vector search
ivf_hnsw_sqMemory-efficient vector search
Terminal window
curl https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes \
-H "Authorization: Bearer $API_TOKEN"
Terminal window
curl -X DELETE https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes/idx_customer_id \
-H "Authorization: Bearer $API_TOKEN"

Preview how a query will execute using the /queries/explain endpoint:

Terminal window
curl -X POST https://api.catalyzed.ai/queries/explain \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT * FROM orders WHERE customer_id = '\''CUST-100'\''",
"tables": {"orders": "KzaMsfA0LSw_Ld0KyaXIS"}
}'

Optimize storage by merging small files:

Terminal window
curl -X POST https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/compact \
-H "Authorization: Bearer $API_TOKEN"

Update table statistics for query optimization:

Terminal window
curl -X POST https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/statistics \
-H "Authorization: Bearer $API_TOKEN"
Terminal window
curl -X DELETE https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA \
-H "Authorization: Bearer $API_TOKEN"

See the Dataset Tables API for complete endpoint documentation.