Skip to content

Ingesting Data

Write data to your dataset tables using the /dataset-tables/{tableId}/rows endpoint. Before ingesting data, your table must exist with a defined schema—see the Schema Management guide to create tables.

Catalyzed accepts data in two formats:

FormatContent-TypeUse Case
JSONapplication/jsonSimple integration, readable, array of row objects
Arrow IPCapplication/vnd.apache.arrow.streamHigh performance, typed data, binary streaming

Choose a write mode based on how you want to modify the table:

ModeDescriptionPrimary Key Required
appendInsert new rows without checking for duplicates (fastest)No
upsertUpdate existing rows by primary key, insert new rowsYes
overwriteReplace all existing data in the tableNo
deleteRemove rows matching the provided primary keysYes

Specify the mode using the mode query parameter: ?mode=append, ?mode=upsert, etc.

The simplest way to add data. Append mode inserts rows without duplicate checking, making it the fastest option:

Append rows to a table

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/KzaMsfA0LSw_Ld0KyaXIS/rows?mode=append" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '[
{"id": "1", "name": "Alice", "email": "[email protected]"},
{"id": "2", "name": "Bob", "email": "[email protected]"}
]'

Upsert mode updates existing rows by primary key and inserts new rows. The table must have a primary key defined:

Upsert rows (update or insert)

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/KzaMsfA0LSw_Ld0KyaXIS/rows?mode=upsert" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '[
{"id": "1", "name": "Alice Updated", "email": "[email protected]"},
{"id": "3", "name": "Charlie", "email": "[email protected]"}
]'

In this example, if row with id="1" exists, it gets updated. Row with id="3" is inserted as new.

Overwrite mode replaces all existing data in the table with the new rows:

Overwrite entire table

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/KzaMsfA0LSw_Ld0KyaXIS/rows?mode=overwrite" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '[
{"id": "10", "name": "New User", "email": "[email protected]"}
]'

Delete mode removes rows matching the provided primary keys.

Delete rows by primary key

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/KzaMsfA0LSw_Ld0KyaXIS/rows?mode=delete" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '["1", "2", "3"]'

For large datasets or when working with typed columnar data, use Apache Arrow IPC format. This is more efficient than JSON for bulk operations:

Ingest using Arrow IPC

Terminal window
# Arrow IPC is binary format - use TypeScript/Python libraries
# cURL example omitted (not practical for binary data)

When to use Arrow IPC:

  • Datasets larger than 10MB
  • You already have data in columnar format
  • Type safety is critical (no JSON string/number ambiguity)
  • Maximum performance is required

All ingestion requests return metrics about the operation:

{
"rows_affected": 100,
"rows_inserted": 95,
"rows_updated": 5,
"rows_deleted": 0,
"dataset_version": 42,
"duration_ms": 150,
"usage": {
"bytes_read": 1024,
"bytes_written": 2048
}
}
FieldDescription
rows_affectedTotal rows modified
rows_insertedNew rows added (append, upsert)
rows_updatedExisting rows changed (upsert)
rows_deletedRows removed (delete, overwrite)
dataset_versionNew version number after operation
duration_msTime taken for the operation
usageStorage I/O metrics for billing

Control ingestion behavior with query parameters:

ParameterTypeDescription
modeappend | upsert | overwrite | deleteWrite operation mode (required)
skip_validationbooleanSkip schema validation for faster writes (optional)

By default, incoming data is validated against the table schema. For trusted data sources, skip validation for faster writes:

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/KzaMsfA0LSw_Ld0KyaXIS/rows?mode=append&skip_validation=true" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '[{"id": "1", "name": "Alice"}]'

Send up to 100MB per request for optimal performance:

// Good: Batch 1000 rows
const batch = rows.slice(0, 1000);
await ingestRows(tableId, batch);
// Avoid: Single row per request
for (const row of rows) {
await ingestRows(tableId, [row]); // Too many HTTP requests
}
// For small batches (<1000 rows): JSON is simpler
const response = await fetch(url, {
headers: { "Content-Type": "application/json" },
body: JSON.stringify(rows),
});
// For large batches (>10,000 rows): Arrow IPC is faster
const arrowData = tableToIPC(table);
const response = await fetch(url, {
headers: { "Content-Type": "application/vnd.apache.arrow.stream" },
body: arrowData,
});

For batch imports or critical data, always use idempotency keys:

const key = `import-${dataSource}-${timestamp}-${batchId}`;
await fetch(`${url}?mode=append&idempotency_key=${key}`, {
method: "POST",
body: JSON.stringify(rows),
});

When loading data for the first time, use append mode—it’s the fastest:

Terminal window
# Initial load: use append
curl -X POST ".../rows?mode=append" -d '[...]'
# Subsequent updates: use upsert
curl -X POST ".../rows?mode=upsert" -d '[...]'

Track dataset_version in responses to detect concurrent writes:

const { dataset_version } = await ingestRows(tableId, rows);
console.log(`Data written at version ${dataset_version}`);

Common errors and solutions:

Error CodeCauseSolution
TABLE_NOT_FOUNDTable ID doesn’t existVerify table ID and team access
INVALID_BODYRequest body is not an arraySend JSON array of row objects
EMPTY_BODYArray is emptyInclude at least one row
TABLE_NOT_REGISTEREDTable not linked to data engineContact support (rare)
SCHEMA_VALIDATION_ERRORData doesn’t match schemaCheck field types and names
try {
const response = await fetch(url, {
method: "POST",
headers: {
Authorization: `Bearer ${apiToken}`,
"Content-Type": "application/json",
},
body: JSON.stringify(rows),
});
if (!response.ok) {
const error = await response.json();
if (error.code === "SCHEMA_VALIDATION_ERROR") {
console.error("Schema mismatch:", error.message);
// Log problematic rows or field types
}
throw new Error(`Ingestion failed: ${error.message}`);
}
const result = await response.json();
console.log(`Ingested ${result.rows_affected} rows`);
} catch (err) {
console.error("Failed to ingest data:", err);
}
  • Querying Data - Read and analyze your ingested data with SQL
  • Schema Management - Create tables and evolve schemas safely
  • Tables - Learn about table schemas, indexes, and data types