Tables

Tables are where your data lives. Each table has a defined schema and supports SQL queries, indexes, and schema evolution.

Creating a Table

Tables are created within a dataset:

Create a table

curl -X POST https://api.catalyzed.ai/dataset-tables \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "datasetId": "HoIEJNIPiQIy6TjVRxjwz",
    "tableName": "orders",
    "description": "Customer orders",
    "fields": [
      {"name": "order_id", "arrowType": "utf8", "nullable": false},
      {"name": "customer_id", "arrowType": "utf8", "nullable": false},
      {"name": "amount", "arrowType": "float64", "nullable": false},
      {"name": "status", "arrowType": "utf8", "nullable": false},
      {"name": "created_at", "arrowType": "timestamp", "nullable": false}
    ],
    "primaryKeyColumns": ["order_id"]
  }'

const response = await fetch("https://api.catalyzed.ai/dataset-tables", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    datasetId: "HoIEJNIPiQIy6TjVRxjwz",
    tableName: "orders",
    description: "Customer orders",
    fields: [
      { name: "order_id", arrowType: "utf8", nullable: false },
      { name: "customer_id", arrowType: "utf8", nullable: false },
      { name: "amount", arrowType: "float64", nullable: false },
      { name: "status", arrowType: "utf8", nullable: false },
      { name: "created_at", arrowType: "timestamp", nullable: false },
    ],
    primaryKeyColumns: ["order_id"],
  }),
});

response = requests.post(
    "https://api.catalyzed.ai/dataset-tables",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "datasetId": "HoIEJNIPiQIy6TjVRxjwz",
        "tableName": "orders",
        "description": "Customer orders",
        "fields": [
            {"name": "order_id", "arrowType": "utf8", "nullable": False},
            {"name": "customer_id", "arrowType": "utf8", "nullable": False},
            {"name": "amount", "arrowType": "float64", "nullable": False},
            {"name": "status", "arrowType": "utf8", "nullable": False},
            {"name": "created_at", "arrowType": "timestamp", "nullable": False}
        ],
        "primaryKeyColumns": ["order_id"]
    }
)

Supported Data Types

Catalyzed uses Apache Arrow data types. Type names are case-insensitive. Here’s a quick summary:

Category	Types
Integer	`int8`, `int16`, `int32`, `int64`, `uint8`, `uint16`, `uint32`, `uint64`
Floating Point	`float16`, `float32` (`float`), `float64` (`double`)
String	`utf8` (`string`), `largeutf8`
Binary	`binary`, `largebinary`
Boolean	`bool` (`boolean`)
Date/Time	`date32`, `date64`, `timestamp`, `timestamp[s]`, `timestamp[ms]`, `timestamp[us]`, `timestamp[ns]`
Other	`null`, `list<T>`

For the complete reference with value ranges, type coercion rules, aliases, and usage guidance, see the Data Types page.

Writing Data

Write Rows

Write data to a table using the /rows endpoint. The write mode is specified as a query parameter, and the request body is a JSON array of row objects.

Insert rows

curl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append" \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "order_id": "ORD-001",
      "customer_id": "CUST-100",
      "amount": 99.99,
      "status": "completed",
      "created_at": "2024-01-15T10:30:00Z"
    },
    {
      "order_id": "ORD-002",
      "customer_id": "CUST-101",
      "amount": 149.50,
      "status": "pending",
      "created_at": "2024-01-15T14:45:00Z"
    }
  ]'

await fetch(
  "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${apiToken}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify([
      {
        order_id: "ORD-001",
        customer_id: "CUST-100",
        amount: 99.99,
        status: "completed",
        created_at: "2024-01-15T10:30:00Z",
      },
      {
        order_id: "ORD-002",
        customer_id: "CUST-101",
        amount: 149.50,
        status: "pending",
        created_at: "2024-01-15T14:45:00Z",
      },
    ]),
  }
);

requests.post(
    "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append",
    headers={"Authorization": f"Bearer {api_token}"},
    json=[
        {
            "order_id": "ORD-001",
            "customer_id": "CUST-100",
            "amount": 99.99,
            "status": "completed",
            "created_at": "2024-01-15T10:30:00Z"
        },
        {
            "order_id": "ORD-002",
            "customer_id": "CUST-101",
            "amount": 149.50,
            "status": "pending",
            "created_at": "2024-01-15T14:45:00Z"
        }
    ]
)

Query Parameters

Parameter	Required	Description
`mode`	Yes	Write operation mode (see below)
`idempotency_key`	No	Unique key for exactly-once write semantics
`skip_validation`	No	Skip schema validation for faster writes

Write Modes

The mode query parameter controls how data is written:

Mode	Description
`append`	Insert new rows without duplicate checking (fastest)
`upsert`	Insert new rows or update existing by primary key
`overwrite`	Replace all existing data in the table
`delete`	Delete rows by primary key

Upsert Example

Update existing rows or insert new ones based on primary key:

curl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=upsert" \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '[{"order_id": "ORD-001", "status": "shipped", "amount": 99.99}]'

Delete Example

Delete rows by primary key values:

curl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=delete" \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '["ORD-001", "ORD-002"]'

For tables with composite primary keys, pass objects:

[{"order_id": "ORD-001", "tenant_id": "T1"}, {"order_id": "ORD-002", "tenant_id": "T1"}]

Arrow IPC Format

For high-performance data ingestion, send data in Apache Arrow IPC format instead of JSON. This is ideal for:

Large batch inserts (millions of rows)
Direct integration with pandas, Polars, or DuckDB
Avoiding JSON serialization overhead

Write with Arrow IPC

# Arrow IPC data must be generated programmatically
curl -X POST "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append" \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/vnd.apache.arrow.stream" \
  --data-binary @data.arrow

import * as arrow from "apache-arrow";

// Create Arrow table
const table = arrow.tableFromArrays({
  order_id: ["ORD-001", "ORD-002"],
  customer_id: ["CUST-100", "CUST-101"],
  amount: [99.99, 149.5],
  status: ["completed", "pending"],
});

// Serialize to IPC stream format
const ipcBytes = arrow.tableToIPC(table, "stream");

// Send to API
await fetch(
  "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${apiToken}`,
      "Content-Type": "application/vnd.apache.arrow.stream",
    },
    body: ipcBytes,
  }
);

import pyarrow as pa
import requests

# Create Arrow table from pandas DataFrame
df = pd.DataFrame({
    "order_id": ["ORD-001", "ORD-002"],
    "customer_id": ["CUST-100", "CUST-101"],
    "amount": [99.99, 149.50],
    "status": ["completed", "pending"],
})
table = pa.Table.from_pandas(df)

# Serialize to IPC stream format
sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, table.schema) as writer:
    writer.write_table(table)
arrow_bytes = sink.getvalue().to_pybytes()

# Send to API
requests.post(
    "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/rows?mode=append",
    headers={
        "Authorization": f"Bearer {api_token}",
        "Content-Type": "application/vnd.apache.arrow.stream"
    },
    data=arrow_bytes
)

Querying Data

Use the /queries endpoint to query your tables with SQL. You can query a single table or join multiple tables together.

Query table data

curl -X POST https://api.catalyzed.ai/queries \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT * FROM orders WHERE status = '\''completed'\'' ORDER BY created_at DESC LIMIT 10",
    "tables": {
      "orders": "KzaMsfA0LSw_Ld0KyaXIS"
    }
  }'

const response = await fetch("https://api.catalyzed.ai/queries", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    sql: "SELECT * FROM orders WHERE status = 'completed' ORDER BY created_at DESC LIMIT 10",
    tables: {
      orders: "KzaMsfA0LSw_Ld0KyaXIS",
    },
  }),
});
const data = await response.json();

response = requests.post(
    "https://api.catalyzed.ai/queries",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "sql": "SELECT * FROM orders WHERE status = 'completed' ORDER BY created_at DESC LIMIT 10",
        "tables": {
            "orders": "KzaMsfA0LSw_Ld0KyaXIS"
        }
    }
)
data = response.json()

The tables parameter maps table names used in your SQL to their table IDs. This works the same way whether you’re querying one table or joining multiple tables.

Query Response

{
  "queryId": "qry_abc123",
  "columns": [
    {"name": "order_id", "type": "Utf8"},
    {"name": "customer_id", "type": "Utf8"},
    {"name": "amount", "type": "Float64"},
    {"name": "status", "type": "Utf8"},
    {"name": "created_at", "type": "Timestamp(Microsecond, Some(\"UTC\"))"}
  ],
  "rows": [
    {"order_id": "ORD-001", "customer_id": "CUST-100", "amount": 99.99, "status": "completed", "created_at": "2024-01-15T10:30:00Z"},
    {"order_id": "ORD-002", "customer_id": "CUST-101", "amount": 149.50, "status": "completed", "created_at": "2024-01-15T09:15:00Z"}
  ],
  "rowCount": 2,
  "truncated": false,
  "stats": {
    "executionTimeMs": 42,
    "planningTimeMs": 5,
    "bytesScanned": 1024,
    "rowsScanned": 100
  }
}

The stats field is included when includeStats: true is passed in the request. A usage field with detailed I/O metrics is also returned for billing purposes.

Joining Tables

Query across multiple tables by including them in the tables mapping:

Join multiple tables

curl -X POST https://api.catalyzed.ai/queries \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT o.order_id, c.name, o.amount FROM orders o JOIN customers c ON o.customer_id = c.customer_id",
    "tables": {
      "orders": "Ednc5U676CO4hn-FqsXeA",
      "customers": "6fTBbbj4uv8TVMVh0gVch"
    }
  }'

const response = await fetch("https://api.catalyzed.ai/queries", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    sql: `SELECT o.order_id, c.name, o.amount
          FROM orders o
          JOIN customers c ON o.customer_id = c.customer_id`,
    tables: {
      orders: "Ednc5U676CO4hn-FqsXeA",
      customers: "6fTBbbj4uv8TVMVh0gVch",
    },
  }),
});

response = requests.post(
    "https://api.catalyzed.ai/queries",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "sql": """
            SELECT o.order_id, c.name, o.amount
            FROM orders o
            JOIN customers c ON o.customer_id = c.customer_id
        """,
        "tables": {
            "orders": "Ednc5U676CO4hn-FqsXeA",
            "customers": "6fTBbbj4uv8TVMVh0gVch"
        }
    }
)

See the Querying Data guide for more SQL examples and best practices.

Table Schema

Get Schema

Get table schema

curl https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/schema \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/schema",
  { headers: { Authorization: `Bearer ${apiToken}` } }
);
const schema = await response.json();

response = requests.get(
    "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/schema",
    headers={"Authorization": f"Bearer {api_token}"}
)
schema = response.json()

Schema Versioning

Tables track schema versions. Each modification creates a new version:

curl https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/schema/versions \
  -H "Authorization: Bearer $API_TOKEN"

See the Schema Management guide for migration details.

System Columns

Every table has an internal _rowid system column — a stable, unique identifier for each row assigned by the storage engine. This column is excluded from SELECT * and schema endpoints by default to keep query results focused on your data.

When you need row-level tracking (e.g., lineage, citations, or deduplication), you can opt in by setting includeRowId: true on the table binding. See the Querying Data guide for usage details.

Indexes

Indexes improve query performance for filtered columns.

Create an Index

Create an index

curl -X POST https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "indexName": "idx_customer_id",
    "columnName": "customer_id",
    "indexType": "btree"
  }'

await fetch("https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    indexName: "idx_customer_id",
    columnName: "customer_id",
    indexType: "btree",
  }),
});

requests.post(
    "https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "indexName": "idx_customer_id",
        "columnName": "customer_id",
        "indexType": "btree"
    }
)

Index Types

Type	Use Case
`btree`	Equality and range queries on scalar columns
`ivf_pq`	Vector similarity search (ANN)
`ivf_hnsw_pq`	High-recall vector search
`ivf_hnsw_sq`	Memory-efficient vector search

List Indexes

curl https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes \
  -H "Authorization: Bearer $API_TOKEN"

Drop an Index

curl -X DELETE https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/indexes/idx_customer_id \
  -H "Authorization: Bearer $API_TOKEN"

Table Operations

Get Execution Plan

Preview how a query will execute using the /queries/explain endpoint:

curl -X POST https://api.catalyzed.ai/queries/explain \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT * FROM orders WHERE customer_id = '\''CUST-100'\''",
    "tables": {"orders": "KzaMsfA0LSw_Ld0KyaXIS"}
  }'

Compact Table

Optimize storage by merging small files:

curl -X POST https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/compact \
  -H "Authorization: Bearer $API_TOKEN"

Compute Statistics

Update table statistics for query optimization:

curl -X POST https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA/statistics \
  -H "Authorization: Bearer $API_TOKEN"

Deleting a Table

curl -X DELETE https://api.catalyzed.ai/dataset-tables/Ednc5U676CO4hn-FqsXeA \
  -H "Authorization: Bearer $API_TOKEN"

API Reference

See the Dataset Tables API for complete endpoint documentation.