Skip to content

Managing Tables

Tables hold your data within a dataset. Each table has a defined schema, supports multiple ingestion formats, and provides SQL querying. This guide covers the full table lifecycle.

A table requires a parent dataset, a name, and at least one field:

Create a table

Terminal window
curl -X POST https://api.catalyzed.ai/dataset-tables \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"datasetId": "HoIEJNIPiQIy6TjVRxjwz",
"tableName": "orders",
"description": "Customer orders",
"fields": [
{"name": "order_id", "arrowType": "utf8", "nullable": false},
{"name": "customer_id", "arrowType": "utf8", "nullable": false},
{"name": "amount", "arrowType": "float64", "nullable": false},
{"name": "status", "arrowType": "utf8", "nullable": false},
{"name": "created_at", "arrowType": "timestamp", "nullable": false}
],
"primaryKeyColumns": ["order_id"]
}'

Response:

{
"tableId": "Ednc5U676CO4hn-FqsXeA",
"datasetId": "HoIEJNIPiQIy6TjVRxjwz",
"tableName": "orders",
"description": "Customer orders",
"createdAt": "2024-01-15T10:30:00Z",
"updatedAt": "2024-01-15T10:30:00Z"
}
PropertyTypeRequiredDescription
datasetIdstringYesParent dataset ID
tableNamestringYesTable name (1-255 characters, unique per dataset)
descriptionstringNoOptional description
fieldsarrayYesColumn definitions (at least 1). See Data Types
primaryKeyColumnsstring[]NoColumns forming the primary key

See the Data Types reference for the full list. Common patterns:

Use CaseRecommended TypeWhy
IDs, names, labelsutf8Flexible, no size constraints
Counts, quantitiesint64Wide range, no overflow risk
Prices, measurementsfloat64Double precision avoids rounding
Yes/no flagsboolCompact, queryable
Dates and timestimestampMicrosecond default, supports formatting
Tags, categorieslist<utf8>Variable-length arrays
Embeddingslist<float32>Efficient for vector search

Primary keys are required for upsert and delete write modes. Choose based on your data:

Single column — when one field uniquely identifies a row:

"primaryKeyColumns": ["order_id"]

Composite key — when uniqueness requires multiple columns:

"primaryKeyColumns": ["tenant_id", "order_id"]

List tables in a dataset

Terminal window
curl "https://api.catalyzed.ai/dataset-tables?datasetIds=HoIEJNIPiQIy6TjVRxjwz&orderBy=tableName&orderDirection=asc" \
-H "Authorization: Bearer $API_TOKEN"
ParameterTypeDescription
datasetTableIdsstringComma-separated table IDs
datasetIdsstringComma-separated dataset IDs
tableNamestringPartial match on table name
managedbooleanFilter by managed status
pagenumberPage number, 1-indexed (default: 1)
pageSizenumberResults per page, 1-100 (default: 20)
orderBystringSort by: createdAt, tableName, updatedAt
orderDirectionstringasc or desc

Update a table’s name or description:

Update table

Terminal window
curl -X PUT https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"tableName": "customer_orders",
"description": "Customer orders with shipping info"
}'

Tables accept data in four formats. Choose based on your use case:

FormatContent-TypeBest For
JSONapplication/jsonSimple integrations, small batches (under 1,000 rows)
CSVtext/csvSpreadsheet exports, human-readable data
Parquetapplication/parquetLarge datasets, columnar analytics tools
Arrow IPCapplication/vnd.apache.arrow.streamMaximum performance, typed data pipelines

CSV and Parquet data is automatically coerced to match the table schema. Column name matching is case-insensitive for CSV.

See Ingesting Data for detailed examples of each format and write mode.

Get row counts and activity metrics for a table:

Terminal window
curl https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/stats \
-H "Authorization: Bearer $API_TOKEN"

Response:

{
"tableId": "Ednc5U676CO4hn-FqsXeA",
"rowCount": 15000,
"totalQueries": 142,
"totalIngests": 28,
"lastQueryAt": "2025-01-15T14:30:00Z",
"lastIngestAt": "2025-01-15T06:00:00Z"
}

Track query and ingestion activity over time:

Terminal window
curl "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/usage/timeseries?granularity=day&startDate=2025-01-01&endDate=2025-01-31" \
-H "Authorization: Bearer $API_TOKEN"

After many write operations, a table may accumulate small storage fragments. Compaction merges these for better query performance:

Terminal window
curl -X POST https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/compact \
-H "Authorization: Bearer $API_TOKEN"

Run compaction periodically on tables with frequent appends or upserts.

Update internal statistics used by the query optimizer. Run this after large data loads:

Terminal window
curl -X POST https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/statistics \
-H "Authorization: Bearer $API_TOKEN"

Indexes speed up queries on frequently filtered columns.

Create a btree index

Terminal window
curl -X POST https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"indexName": "idx_customer_id",
"columnName": "customer_id",
"indexType": "btree"
}'

Index creation returns 202 Accepted — the index is built asynchronously.

TypeUse Case
btreeEquality and range queries on scalar columns
ivf_pqVector similarity search (ANN)
ivf_flatExact vector search (slower, higher recall)
Terminal window
curl https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes \
-H "Authorization: Bearer $API_TOKEN"

Response includes index status (pending, built, failed), creation time, and error messages if applicable.

Terminal window
curl -X DELETE https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes/idx_customer_id \
-H "Authorization: Bearer $API_TOKEN"

You can also manage indexes through schema migrations.

Delete table

Terminal window
curl -X DELETE https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA \
-H "Authorization: Bearer $API_TOKEN"

Returns 204 No Content on success.

Common errors when working with tables:

Error CodeStatusCause
DATASET_NOT_FOUND404Parent dataset doesn’t exist
TABLE_NOT_FOUND404Table ID doesn’t exist
TABLE_NAME_ALREADY_EXISTS409Table name already used in this dataset
MISSING_COLUMNS400Ingested data is missing required columns
COERCION_FAILED400Data values can’t be converted to schema types
FILE_TOO_LARGE413Request body exceeds 100MB limit
  1. Define primary keys upfront if you plan to use upsert or delete modes
  2. Use utf8 for IDs — avoids integer overflow issues with large identifiers
  3. Make columns nullable by default — easier to evolve the schema later
  4. Run compaction after bulk data loads to optimize query performance
  5. Name tables descriptivelycustomer_orders is better than tbl1
  6. Keep related tables in one dataset — enables cross-table joins and unified management