Skip to content

Managing Datasets

Datasets are logical containers that group related tables together within a team. This guide covers the full lifecycle of managing datasets.

Create a dataset by providing a team ID, a name, and optionally a description, tags, and metadata:

Create a dataset

Terminal window
curl -X POST https://api.catalyzed.ai/datasets \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"name": "Sales Analytics",
"description": "Sales data for analytics and reporting",
"tags": ["sales", "analytics", "production"],
"metadata": {
"owner": "data-team",
"source": "salesforce",
"refreshFrequency": "daily"
}
}'

Response:

{
"datasetId": "HoIEJNIPiQIy6TjVRxjwz",
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"name": "Sales Analytics",
"description": "Sales data for analytics and reporting",
"tags": ["sales", "analytics", "production"],
"metadata": {
"owner": "data-team",
"source": "salesforce",
"refreshFrequency": "daily"
},
"managed": false,
"createdAt": "2024-01-15T10:30:00Z",
"updatedAt": "2024-01-15T10:30:00Z"
}
PropertyTypeRequiredDescription
teamIdstringYesTeam that owns this dataset
namestringYesHuman-readable name (1-255 characters, unique per team)
descriptionstringNoOptional description
tagsstring[]NoTags for filtering and organization (defaults to [])
metadataobjectNoArbitrary key-value metadata (defaults to {})

List datasets with optional filters, pagination, and sorting:

List datasets with filters

Terminal window
curl "https://api.catalyzed.ai/datasets?teamIds=ZkoDMyjZZsXo4VAO_nJLk&tags=production&orderBy=name&orderDirection=asc&page=1&pageSize=10" \
-H "Authorization: Bearer $API_TOKEN"
ParameterTypeDescription
teamIdsstringComma-separated team IDs
datasetIdsstringComma-separated dataset IDs
namestringPartial match on dataset name (case-insensitive)
tagsstringComma-separated tags — returns datasets matching any of the provided tags
managedbooleanFilter by managed status (true for system-managed, false for user-created)
pagenumberPage number, 1-indexed (default: 1)
pageSizenumberResults per page, 1-100 (default: 20)
orderBystringSort by: createdAt, name, updatedAt, description
orderDirectionstringasc or desc
{
"datasets": [
{
"datasetId": "HoIEJNIPiQIy6TjVRxjwz",
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"name": "Sales Analytics",
"description": "Sales data for analytics and reporting",
"tags": ["sales", "analytics", "production"],
"metadata": {},
"managed": false,
"createdAt": "2024-01-15T10:30:00Z",
"updatedAt": "2024-01-15T10:30:00Z"
}
],
"total": 1,
"page": 1,
"pageSize": 10
}

Get dataset by ID

Terminal window
curl https://api.catalyzed.ai/datasets/HoIEJNIPiQIy6TjVRxjwz \
-H "Authorization: Bearer $API_TOKEN"

Get row counts and table metadata for all tables in a dataset:

Terminal window
curl https://api.catalyzed.ai/datasets/HoIEJNIPiQIy6TjVRxjwz/table-stats \
-H "Authorization: Bearer $API_TOKEN"

Response:

{
"datasetId": "HoIEJNIPiQIy6TjVRxjwz",
"tables": [
{"tableId": "Ednc5U676CO4hn-FqsXeA", "tableName": "orders", "rowCount": 15000}
],
"totalRows": 15000,
"tableCount": 1
}

Update a dataset’s name, description, tags, or metadata:

Update dataset

Terminal window
curl -X PUT https://api.catalyzed.ai/datasets/HoIEJNIPiQIy6TjVRxjwz \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Sales Analytics v2",
"description": "Updated sales data with 2025 metrics",
"tags": ["sales", "analytics", "production", "2025"],
"metadata": {
"owner": "data-team",
"source": "salesforce",
"refreshFrequency": "hourly"
}
}'

All fields are optional — only include the fields you want to change. Tags and metadata are replaced entirely (not merged), so include the complete set when updating.

Delete dataset

Terminal window
curl -X DELETE https://api.catalyzed.ai/datasets/HoIEJNIPiQIy6TjVRxjwz \
-H "Authorization: Bearer $API_TOKEN"

Returns 204 No Content on success.

Group tables by business domain to keep related data together:

Patent Research (dataset)
├── patents (table)
├── inventors (table)
└── citations (table)
Financial Filings (dataset)
├── sec_filings (table)
├── companies (table)
└── financial_metrics (table)

Separate production, staging, and development data:

Customer Data - Production (dataset)
└── customers (table)
Customer Data - Staging (dataset)
└── customers (table)
Customer Data - Development (dataset)
└── customers (table)

Use tags to identify environments: tags: ["production"] or tags: ["staging"].

Tags enable quick filtering across datasets. Common tagging patterns:

  • Environment: production, staging, development
  • Domain: sales, patents, compliance
  • Status: active, archived, deprecated
  • Data source: salesforce, snowflake, manual-upload

Filter by tags in list requests:

Terminal window
# Find all production sales datasets
curl "https://api.catalyzed.ai/datasets?teamIds=...&tags=production,sales" \
-H "Authorization: Bearer $API_TOKEN"

Metadata stores structured information about a dataset as key-value pairs. Useful for tracking data lineage, ownership, and operational context:

{
"metadata": {
"owner": "data-engineering",
"source": "salesforce-api",
"refreshFrequency": "daily",
"lastRefresh": "2025-01-15T00:00:00Z",
"pii": "true",
"retentionDays": "365"
}
}
  1. Use descriptive namesCustomer Orders 2025 is better than data_v3
  2. Tag consistently — agree on a tagging taxonomy across your team
  3. Set metadata at creation — easier to organize from the start than retroactively
  4. One dataset per domain — avoid mixing unrelated tables in the same dataset
  5. Keep names unique and meaningful — names must be unique per team, so include enough context to distinguish similar datasets