Skip to content

Tables

Tables are where your data lives. Each table has a defined schema and supports SQL queries, indexes, and schema evolution.

Tables are created within a dataset:

Create a table

Terminal window
curl -X POST https://api.catalyzed.ai/dataset-tables \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"datasetId": "HoIEJNIPiQIy6TjVRxjwz",
"tableName": "orders",
"description": "Customer orders",
"fields": [
{"name": "order_id", "arrowType": "utf8", "nullable": false},
{"name": "customer_id", "arrowType": "utf8", "nullable": false},
{"name": "amount", "arrowType": "float64", "nullable": false},
{"name": "status", "arrowType": "utf8", "nullable": false},
{"name": "created_at", "arrowType": "timestamp", "nullable": false}
],
"primaryKeyColumns": ["order_id"]
}'

Catalyzed uses Apache Arrow data types. Type names are case-insensitive. Here’s a quick summary:

CategoryTypes
Integerint8, int16, int32, int64, uint8, uint16, uint32, uint64
Floating Pointfloat16, float32 (float), float64 (double)
Stringutf8 (string), largeutf8
Binarybinary, largebinary
Booleanbool (boolean)
Date/Timedate32, date64, timestamp, timestamp[s], timestamp[ms], timestamp[us], timestamp[ns]
Othernull, list<T>

For the complete reference with value ranges, type coercion rules, aliases, and usage guidance, see the Data Types page.

Write data to a table using the /rows endpoint. The write mode is specified as a query parameter, and the request body is a JSON array of row objects.

Insert rows

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '[
{
"order_id": "ORD-001",
"customer_id": "CUST-100",
"amount": 99.99,
"status": "completed",
"created_at": "2024-01-15T10:30:00Z"
},
{
"order_id": "ORD-002",
"customer_id": "CUST-101",
"amount": 149.50,
"status": "pending",
"created_at": "2024-01-15T14:45:00Z"
}
]'
ParameterRequiredDescription
modeYesWrite operation mode (see below)
idempotency_keyNoUnique key for exactly-once write semantics
skip_validationNoSkip schema validation for faster writes

The mode query parameter controls how data is written:

ModeDescription
appendInsert new rows without duplicate checking (fastest)
upsertInsert new rows or update existing by primary key
overwriteReplace all existing data in the table
deleteDelete rows by primary key

Update existing rows or insert new ones based on primary key:

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=upsert" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '[{"order_id": "ORD-001", "status": "shipped", "amount": 99.99}]'

Delete rows by primary key values:

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=delete" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '["ORD-001", "ORD-002"]'

For tables with composite primary keys, pass objects:

[{"order_id": "ORD-001", "tenant_id": "T1"}, {"order_id": "ORD-002", "tenant_id": "T1"}]

For high-performance data ingestion, send data in Apache Arrow IPC format instead of JSON. This is ideal for:

  • Large batch inserts (millions of rows)
  • Direct integration with pandas, Polars, or DuckDB
  • Avoiding JSON serialization overhead

Write with Arrow IPC

Terminal window
# Arrow IPC data must be generated programmatically
curl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/vnd.apache.arrow.stream" \
--data-binary @data.arrow

Use the /queries endpoint to query your tables with SQL. You can query a single table or join multiple tables together.

Query table data

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT * FROM orders WHERE status = '\''completed'\'' ORDER BY created_at DESC LIMIT 10",
"tables": {
"orders": "KzaMsfA0LSw_Ld0KyaXIS"
}
}'

The tables parameter maps table names used in your SQL to their table IDs. This works the same way whether you’re querying one table or joining multiple tables.

{
"queryId": "qry_abc123",
"columns": [
{"name": "order_id", "type": "Utf8"},
{"name": "customer_id", "type": "Utf8"},
{"name": "amount", "type": "Float64"},
{"name": "status", "type": "Utf8"},
{"name": "created_at", "type": "Timestamp(Microsecond, Some(\"UTC\"))"}
],
"rows": [
{"order_id": "ORD-001", "customer_id": "CUST-100", "amount": 99.99, "status": "completed", "created_at": "2024-01-15T10:30:00Z"},
{"order_id": "ORD-002", "customer_id": "CUST-101", "amount": 149.50, "status": "completed", "created_at": "2024-01-15T09:15:00Z"}
],
"rowCount": 2,
"truncated": false,
"stats": {
"executionTimeMs": 42,
"planningTimeMs": 5,
"bytesScanned": 1024,
"rowsScanned": 100
}
}

The stats field is included when includeStats: true is passed in the request. A usage field with detailed I/O metrics is also returned for billing purposes.

Query across multiple tables by including them in the tables mapping:

Join multiple tables

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT o.order_id, c.name, o.amount FROM orders o JOIN customers c ON o.customer_id = c.customer_id",
"tables": {
"orders": "Ednc5U676CO4hn-FqsXeA",
"customers": "6fTBbbj4uv8TVMVh0gVch"
}
}'

See the Querying Data guide for more SQL examples and best practices.

Get table schema

Terminal window
curl https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/schema \
-H "Authorization: Bearer $API_TOKEN"

Tables track schema versions. Each modification creates a new version:

Terminal window
curl https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/schema/versions \
-H "Authorization: Bearer $API_TOKEN"

See the Schema Management guide for migration details.

Every table has an internal _rowid system column — a stable, unique identifier for each row assigned by the storage engine. This column is excluded from SELECT * and schema endpoints by default to keep query results focused on your data.

When you need row-level tracking (e.g., lineage, citations, or deduplication), you can opt in by setting includeRowId: true on the table binding. See the Querying Data guide for usage details.

Indexes improve query performance for filtered columns.

Create an index

Terminal window
curl -X POST https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"indexName": "idx_customer_id",
"columnName": "customer_id",
"indexType": "btree"
}'
TypeUse Case
btreeEquality and range queries on scalar columns
ivf_pqVector similarity search (ANN)
ivf_hnsw_pqHigh-recall vector search
ivf_hnsw_sqMemory-efficient vector search
Terminal window
curl https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes \
-H "Authorization: Bearer $API_TOKEN"
Terminal window
curl -X DELETE https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes/idx_customer_id \
-H "Authorization: Bearer $API_TOKEN"

Preview how a query will execute using the /queries/explain endpoint:

Terminal window
curl -X POST https://api.catalyzed.ai/queries/explain \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT * FROM orders WHERE customer_id = '\''CUST-100'\''",
"tables": {"orders": "KzaMsfA0LSw_Ld0KyaXIS"}
}'

Optimize storage by merging small files:

Terminal window
curl -X POST https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/compact \
-H "Authorization: Bearer $API_TOKEN"

Update table statistics for query optimization:

Terminal window
curl -X POST https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/statistics \
-H "Authorization: Bearer $API_TOKEN"
Terminal window
curl -X DELETE https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA \
-H "Authorization: Bearer $API_TOKEN"

See the Dataset Tables API for complete endpoint documentation.