Managing Datasets

Datasets are logical containers that group related tables together within a team. This guide covers the full lifecycle of managing datasets.

Creating a Dataset

Create a dataset by providing a team ID, a name, and optionally a description, tags, and metadata:

Create a dataset

curl -X POST https://api.catalyzed.ai/datasets \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
    "name": "Sales Analytics",
    "description": "Sales data for analytics and reporting",
    "tags": ["sales", "analytics", "production"],
    "metadata": {
      "owner": "data-team",
      "source": "salesforce",
      "refreshFrequency": "daily"
    }
  }'

const response = await fetch("https://api.catalyzed.ai/datasets", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    teamId: "ZkoDMyjZZsXo4VAO_nJLk",
    name: "Sales Analytics",
    description: "Sales data for analytics and reporting",
    tags: ["sales", "analytics", "production"],
    metadata: {
      owner: "data-team",
      source: "salesforce",
      refreshFrequency: "daily",
    },
  }),
});
const dataset = await response.json();

response = requests.post(
    "https://api.catalyzed.ai/datasets",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
        "name": "Sales Analytics",
        "description": "Sales data for analytics and reporting",
        "tags": ["sales", "analytics", "production"],
        "metadata": {
            "owner": "data-team",
            "source": "salesforce",
            "refreshFrequency": "daily"
        }
    }
)
dataset = response.json()

Response:

{
  "datasetId": "HoIEJNIPiQIy6TjVRxjwz",
  "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
  "name": "Sales Analytics",
  "description": "Sales data for analytics and reporting",
  "tags": ["sales", "analytics", "production"],
  "metadata": {
    "owner": "data-team",
    "source": "salesforce",
    "refreshFrequency": "daily"
  },
  "managed": false,
  "createdAt": "2024-01-15T10:30:00Z",
  "updatedAt": "2024-01-15T10:30:00Z"
}

Dataset Properties

Property	Type	Required	Description
`teamId`	string	Yes	Team that owns this dataset
`name`	string	Yes	Human-readable name (1-255 characters, unique per team)
`description`	string	No	Optional description
`tags`	string[]	No	Tags for filtering and organization (defaults to `[]`)
`metadata`	object	No	Arbitrary key-value metadata (defaults to `{}`)

Listing and Filtering Datasets

List datasets with optional filters, pagination, and sorting:

List datasets with filters

curl "https://api.catalyzed.ai/datasets?teamIds=ZkoDMyjZZsXo4VAO_nJLk&tags=production&orderBy=name&orderDirection=asc&page=1&pageSize=10" \
  -H "Authorization: Bearer $API_TOKEN"

const params = new URLSearchParams({
  teamIds: "ZkoDMyjZZsXo4VAO_nJLk",
  tags: "production",
  orderBy: "name",
  orderDirection: "asc",
  page: "1",
  pageSize: "10",
});

const response = await fetch(`https://api.catalyzed.ai/datasets?${params}`, {
  headers: { Authorization: `Bearer ${apiToken}` },
});
const { datasets, total, page, pageSize } = await response.json();

response = requests.get(
    "https://api.catalyzed.ai/datasets",
    params={
        "teamIds": "ZkoDMyjZZsXo4VAO_nJLk",
        "tags": "production",
        "orderBy": "name",
        "orderDirection": "asc",
        "page": 1,
        "pageSize": 10
    },
    headers={"Authorization": f"Bearer {api_token}"}
)
result = response.json()
datasets = result["datasets"]

Filter Parameters

Parameter	Type	Description
`teamIds`	string	Comma-separated team IDs
`datasetIds`	string	Comma-separated dataset IDs
`name`	string	Partial match on dataset name (case-insensitive)
`tags`	string	Comma-separated tags — returns datasets matching any of the provided tags
`managed`	boolean	Filter by managed status (`true` for system-managed, `false` for user-created)
`page`	number	Page number, 1-indexed (default: 1)
`pageSize`	number	Results per page, 1-100 (default: 20)
`orderBy`	string	Sort by: `createdAt`, `name`, `updatedAt`, `description`
`orderDirection`	string	`asc` or `desc`

Response Format

{
  "datasets": [
    {
      "datasetId": "HoIEJNIPiQIy6TjVRxjwz",
      "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
      "name": "Sales Analytics",
      "description": "Sales data for analytics and reporting",
      "tags": ["sales", "analytics", "production"],
      "metadata": {},
      "managed": false,
      "createdAt": "2024-01-15T10:30:00Z",
      "updatedAt": "2024-01-15T10:30:00Z"
    }
  ],
  "total": 1,
  "page": 1,
  "pageSize": 10
}

Getting a Dataset

Get dataset by ID

curl https://api.catalyzed.ai/datasets/HoIEJNIPiQIy6TjVRxjwz \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/datasets/HoIEJNIPiQIy6TjVRxjwz",
  { headers: { Authorization: `Bearer ${apiToken}` } }
);
const dataset = await response.json();

response = requests.get(
    "https://api.catalyzed.ai/datasets/HoIEJNIPiQIy6TjVRxjwz",
    headers={"Authorization": f"Bearer {api_token}"}
)
dataset = response.json()

Table Statistics

Get row counts and table metadata for all tables in a dataset:

curl https://api.catalyzed.ai/datasets/HoIEJNIPiQIy6TjVRxjwz/table-stats \
  -H "Authorization: Bearer $API_TOKEN"

Response:

{
  "datasetId": "HoIEJNIPiQIy6TjVRxjwz",
  "tables": [
    {"tableId": "Ednc5U676CO4hn-FqsXeA", "tableName": "orders", "rowCount": 15000}
  ],
  "totalRows": 15000,
  "tableCount": 1
}

Updating a Dataset

Update a dataset’s name, description, tags, or metadata:

Update dataset

curl -X PUT https://api.catalyzed.ai/datasets/HoIEJNIPiQIy6TjVRxjwz \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Sales Analytics v2",
    "description": "Updated sales data with 2025 metrics",
    "tags": ["sales", "analytics", "production", "2025"],
    "metadata": {
      "owner": "data-team",
      "source": "salesforce",
      "refreshFrequency": "hourly"
    }
  }'

const response = await fetch(
  "https://api.catalyzed.ai/datasets/HoIEJNIPiQIy6TjVRxjwz",
  {
    method: "PUT",
    headers: {
      Authorization: `Bearer ${apiToken}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      name: "Sales Analytics v2",
      description: "Updated sales data with 2025 metrics",
      tags: ["sales", "analytics", "production", "2025"],
      metadata: {
        owner: "data-team",
        source: "salesforce",
        refreshFrequency: "hourly",
      },
    }),
  }
);
const updated = await response.json();

response = requests.put(
    "https://api.catalyzed.ai/datasets/HoIEJNIPiQIy6TjVRxjwz",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "name": "Sales Analytics v2",
        "description": "Updated sales data with 2025 metrics",
        "tags": ["sales", "analytics", "production", "2025"],
        "metadata": {
            "owner": "data-team",
            "source": "salesforce",
            "refreshFrequency": "hourly"
        }
    }
)
updated = response.json()

All fields are optional — only include the fields you want to change. Tags and metadata are replaced entirely (not merged), so include the complete set when updating.

Deleting a Dataset

Delete dataset

curl -X DELETE https://api.catalyzed.ai/datasets/HoIEJNIPiQIy6TjVRxjwz \
  -H "Authorization: Bearer $API_TOKEN"

await fetch("https://api.catalyzed.ai/datasets/HoIEJNIPiQIy6TjVRxjwz", {
  method: "DELETE",
  headers: { Authorization: `Bearer ${apiToken}` },
});

requests.delete(
    "https://api.catalyzed.ai/datasets/HoIEJNIPiQIy6TjVRxjwz",
    headers={"Authorization": f"Bearer {api_token}"}
)

Returns 204 No Content on success.

Organizing Datasets

By Domain

Group tables by business domain to keep related data together:

Patent Research (dataset)
  ├── patents (table)
  ├── inventors (table)
  └── citations (table)

Financial Filings (dataset)
  ├── sec_filings (table)
  ├── companies (table)
  └── financial_metrics (table)

By Environment

Separate production, staging, and development data:

Customer Data - Production (dataset)
  └── customers (table)

Customer Data - Staging (dataset)
  └── customers (table)

Customer Data - Development (dataset)
  └── customers (table)

Use tags to identify environments: tags: ["production"] or tags: ["staging"].

Using Tags for Filtering

Tags enable quick filtering across datasets. Common tagging patterns:

Environment: production, staging, development
Domain: sales, patents, compliance
Status: active, archived, deprecated
Data source: salesforce, snowflake, manual-upload

Filter by tags in list requests:

# Find all production sales datasets
curl "https://api.catalyzed.ai/datasets?teamIds=...&tags=production,sales" \
  -H "Authorization: Bearer $API_TOKEN"

Using Metadata

Metadata stores structured information about a dataset as key-value pairs. Useful for tracking data lineage, ownership, and operational context:

{
  "metadata": {
    "owner": "data-engineering",
    "source": "salesforce-api",
    "refreshFrequency": "daily",
    "lastRefresh": "2025-01-15T00:00:00Z",
    "pii": "true",
    "retentionDays": "365"
  }
}

Best Practices

Use descriptive names — Customer Orders 2025 is better than data_v3
Tag consistently — agree on a tagging taxonomy across your team
Set metadata at creation — easier to organize from the start than retroactively
One dataset per domain — avoid mixing unrelated tables in the same dataset
Keep names unique and meaningful — names must be unique per team, so include enough context to distinguish similar datasets

Next Steps

Managing Tables - Create and manage tables within datasets
Ingesting Data - Write data into your tables
Datasets (Concept) - Understand the data hierarchy