Files

Files allow you to upload data for processing. Uploaded files can be processed for text extraction (PDFs, documents) or converted into table data (CSV, JSON).

Uploading Files

Upload a file

curl -X POST https://api.catalyzed.ai/files \
  -H "Authorization: Bearer $API_TOKEN" \
  -F "[email protected]" \
  -F "teamId=ZkoDMyjZZsXo4VAO_nJLk"

const formData = new FormData();
formData.append("file", fileBlob, "data.csv");
formData.append("teamId", "ZkoDMyjZZsXo4VAO_nJLk");

const response = await fetch("https://api.catalyzed.ai/files", {
  method: "POST",
  headers: { Authorization: `Bearer ${apiToken}` },
  body: formData,
});
const file = await response.json();

with open("data.csv", "rb") as f:
    response = requests.post(
        "https://api.catalyzed.ai/files",
        headers={"Authorization": f"Bearer {api_token}"},
        files={"file": ("data.csv", f)},
        data={"teamId": "ZkoDMyjZZsXo4VAO_nJLk"}
    )
file = response.json()

Response:

{
  "fileId": "LvrGb8UaJk_IjmzaxuMAb",
  "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
  "fileName": "data.csv",
  "fileSize": 1048576,
  "mimeType": "text/csv",
  "fileHash": "a1b2c3d4...",
  "storageKey": "files/ZkoDMyjZZsXo4VAO_nJLk/LvrGb8UaJk_IjmzaxuMAb",
  "uploadedBy": "usr_abc123",
  "metadata": {},
  "processingStatus": null,
  "createdAt": "2024-01-15T10:30:00Z",
  "updatedAt": "2024-01-15T10:30:00Z"
}

Creating Files from URL

Instead of uploading a file directly, you can provide a URL and Catalyzed will fetch the file for you. This is useful for ingesting data from public URLs, API endpoints, or cloud storage URLs.

Create file from URL

curl -X POST https://api.catalyzed.ai/files/from-url \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
    "url": "https://example.com/data/report.pdf"
  }'

const response = await fetch("https://api.catalyzed.ai/files/from-url", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    teamId: "ZkoDMyjZZsXo4VAO_nJLk",
    url: "https://example.com/data/report.pdf",
  }),
});
const file = await response.json();

response = requests.post(
    "https://api.catalyzed.ai/files/from-url",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
        "url": "https://example.com/data/report.pdf"
    }
)
file = response.json()

Response (same format as direct upload):

{
  "fileId": "NqrHc9VbKl_JknAbyvNBc",
  "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
  "fileName": "report.pdf",
  "fileSize": 2457600,
  "mimeType": "application/pdf",
  "fileHash": "b2c3d4e5...",
  "storageKey": "files/ZkoDMyjZZsXo4VAO_nJLk/NqrHc9VbKl_JknAbyvNBc",
  "uploadedBy": "usr_abc123",
  "sourceUrl": "https://example.com/data/report.pdf",
  "metadata": {},
  "processingStatus": null,
  "createdAt": "2024-01-15T11:00:00Z",
  "updatedAt": "2024-01-15T11:00:00Z"
}

How It Works

When you create a file from a URL:

Fetch: Catalyzed fetches the file from the provided URL with a 30-second timeout
Filename Extraction: The filename is determined from:
- Content-Disposition header (if present)
- URL path (e.g., https://example.com/data/report.pdf → report.pdf)
- Falls back to "file" if neither is available
Content-Type Parsing: MIME type is extracted from the Content-Type header, stripping any parameters like charset
Size Validation: Files must be under 100MB
Hash Computation: SHA-256 hash is computed for deduplication
Storage: File is uploaded to S3 storage
Database Record: A file record is created with the sourceUrl field set
Auto-Processing: Processing is automatically triggered (same as direct upload)

URL Requirements

Must be a valid HTTP or HTTPS URL
Must be publicly accessible (no authentication required)
Response must complete within 30 seconds
File size must be under 100MB
Must return appropriate Content-Type header

Filename Handling

The filename is extracted in this order:

Content-Disposition header:

Content-Disposition: attachment; filename="Q4-2024-Report.pdf"

URL path: The last segment of the URL path

https://storage.example.com/reports/2024/annual.pdf → "annual.pdf"

Fallback: If neither is available, defaults to "file"

Source URL Tracking

Files created from URLs have a sourceUrl field that preserves the original URL. This is useful for:

Auditing where files came from
Re-fetching updated versions
Tracking external data sources

Example query to find files from a specific source:

const response = await fetch(
  "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk",
  { headers: { Authorization: `Bearer ${apiToken}` } }
);
const files = await response.json();

// Filter by source URL
const filesFromExternalAPI = files.files.filter(
  (f) => f.sourceUrl?.startsWith("https://external-api.example.com")
);

Error Handling

Common errors when creating files from URLs:

Error	Cause	Solution
`Failed to fetch from URL`	URL is unreachable or returned non-2xx status	Verify URL is accessible and returns 200 OK
`File size exceeds limit`	File is larger than 100MB	Use direct upload or compress the file
`Request timeout`	Download took longer than 30 seconds	Use direct upload for large/slow downloads
`Invalid MIME type`	Content-Type header is missing or unsupported	Ensure server returns correct Content-Type

Use Cases

Public Dataset Ingestion:

// Fetch a public dataset
await fetch("https://api.catalyzed.ai/files/from-url", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    teamId: "ZkoDMyjZZsXo4VAO_nJLk",
    url: "https://data.gov/datasets/covid-19/data.csv",
  }),
});

Webhook Integration:

// Handle webhook that provides a file URL
app.post("/webhook/file-ready", async (req) => {
  const { fileUrl } = req.body;

  // Ingest file from webhook URL
  await fetch("https://api.catalyzed.ai/files/from-url", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${apiToken}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      teamId: "ZkoDMyjZZsXo4VAO_nJLk",
      url: fileUrl,
    }),
  });
});

Cloud Storage URLs:

// Ingest from S3 presigned URL
const presignedUrl = generatePresignedUrl("my-bucket", "data/report.xlsx");

await fetch("https://api.catalyzed.ai/files/from-url", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    teamId: "ZkoDMyjZZsXo4VAO_nJLk",
    url: presignedUrl,
  }),
});

Supported File Types

Type	Extensions	Processing
CSV	`.csv`	Parse into table rows
JSON	`.json`, `.jsonl`	Parse into table rows
Parquet	`.parquet`	Direct table import
PDF	`.pdf`	Text extraction, chunking, embeddings
Documents	`.docx`, `.txt`	Text extraction, chunking, embeddings
Excel	`.xlsx`, `.xls`	Parse into table rows

Processing Status

Files use an event-based processing model. The processingStatus field shows the latest processing event:

Event Type	Description
`JOB_CREATED`	Processing job has been queued
`JOB_STARTED`	Processing has begun
`JOB_SUCCEEDED`	Processing completed successfully
`JOB_FAILED`	Processing failed (check `eventData` for error)

A file with processingStatus: null has never been processed.

Checking File Status

Get file status

curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch("https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb", {
  headers: { Authorization: `Bearer ${apiToken}` },
});
const file = await response.json();
console.log(file.processingStatus?.eventType); // "JOB_SUCCEEDED"

response = requests.get(
    "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb",
    headers={"Authorization": f"Bearer {api_token}"}
)
file = response.json()
print(file["processingStatus"]["eventType"])  # "JOB_SUCCEEDED"

Response with processing status:

{
  "fileId": "LvrGb8UaJk_IjmzaxuMAb",
  "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
  "fileName": "data.csv",
  "fileSize": 1048576,
  "mimeType": "text/csv",
  "fileHash": "a1b2c3d4...",
  "storageKey": "files/ZkoDMyjZZsXo4VAO_nJLk/LvrGb8UaJk_IjmzaxuMAb",
  "uploadedBy": "usr_abc123",
  "metadata": {},
  "processingStatus": {
    "eventType": "JOB_SUCCEEDED",
    "createdAt": "2024-01-15T10:31:00Z",
    "processorVersion": "1.2.0",
    "eventData": {}
  },
  "createdAt": "2024-01-15T10:30:00Z",
  "updatedAt": "2024-01-15T10:31:00Z"
}

Polling for Completion

For files that require processing (PDFs, documents):

async function waitForProcessing(fileId: string): Promise<File> {
  while (true) {
    const response = await fetch(`https://api.catalyzed.ai/files/${fileId}`, {
      headers: { Authorization: `Bearer ${apiToken}` },
    });
    const file = await response.json();

    const status = file.processingStatus?.eventType;
    if (status === "JOB_SUCCEEDED") return file;
    if (status === "JOB_FAILED") {
      throw new Error(file.processingStatus.eventData?.error || "Processing failed");
    }

    await new Promise(r => setTimeout(r, 2000)); // Poll every 2 seconds
  }
}

Listing Files

List files

curl "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk" \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk",
  { headers: { Authorization: `Bearer ${apiToken}` } }
);
const { files, total, page, pageSize } = await response.json();

response = requests.get(
    "https://api.catalyzed.ai/files",
    params={"teamIds": "ZkoDMyjZZsXo4VAO_nJLk"},
    headers={"Authorization": f"Bearer {api_token}"}
)
data = response.json()
files = data["files"]

Response:

{
  "files": [
    {
      "fileId": "LvrGb8UaJk_IjmzaxuMAb",
      "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
      "fileName": "data.csv",
      "fileSize": 1048576,
      "mimeType": "text/csv",
      "fileHash": "a1b2c3d4...",
      "storageKey": "...",
      "uploadedBy": "usr_abc123",
      "metadata": {},
      "processingStatus": {
        "eventType": "JOB_SUCCEEDED",
        "createdAt": "2024-01-15T10:31:00Z",
        "processorVersion": "1.2.0"
      },
      "createdAt": "2024-01-15T10:30:00Z",
      "updatedAt": "2024-01-15T10:31:00Z"
    }
  ],
  "total": 1,
  "page": 1,
  "pageSize": 20
}

Filtering Files

Filter by processing status or MIME type:

# Only successfully processed files
curl "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk&processingStatus=JOB_SUCCEEDED"

# Files that haven't been processed (use "none")
curl "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk&processingStatus=none"

# Only PDFs
curl "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk&mimeTypes=application/pdf"

Downloading Files

The download endpoint returns presigned URLs for secure access:

Get download URL

curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/download \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/download",
  { headers: { Authorization: `Bearer ${apiToken}` } }
);
const { downloadUrl, previewUrl, expiresAt } = await response.json();

// Use the presigned URL to download the file
const fileResponse = await fetch(downloadUrl);
const blob = await fileResponse.blob();

response = requests.get(
    "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/download",
    headers={"Authorization": f"Bearer {api_token}"}
)
urls = response.json()

# Use the presigned URL to download the file
file_response = requests.get(urls["downloadUrl"])
with open("downloaded_file.csv", "wb") as f:
    f.write(file_response.content)

Response:

{
  "downloadUrl": "https://storage.example.com/files/...?signature=...",
  "previewUrl": "https://storage.example.com/files/...?signature=...&inline=true",
  "expiresAt": "2024-01-15T11:30:00Z"
}

Processing Files

Trigger Processing

To process or reprocess a file:

Process a file

curl -X POST https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/process \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"force": false}'

const response = await fetch("https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/process", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ force: false }),
});
const { jobId, status, message } = await response.json();

response = requests.post(
    "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/process",
    headers={"Authorization": f"Bearer {api_token}"},
    json={"force": False}
)
result = response.json()
print(result["jobId"], result["status"])

Response:

{
  "jobId": "job_xyz789",
  "fileId": "LvrGb8UaJk_IjmzaxuMAb",
  "status": "pending",
  "message": "File processing job created"
}

Set force: true to reprocess a file that has already been processed.

PDF Processing Pipeline

When you upload a PDF, it goes through:

Text Extraction - Convert PDF pages to text
Chunking - Split text into semantic chunks
Embedding Generation - Create vector embeddings for each chunk
Indexing - Store chunks for retrieval in pipelines

See the File Processing guide for details.

Updating Files

Update file metadata:

Update file

curl -X PUT https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"fileName": "renamed_data.csv", "metadata": {"source": "manual_upload"}}'

await fetch("https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb", {
  method: "PUT",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    fileName: "renamed_data.csv",
    metadata: { source: "manual_upload" },
  }),
});

requests.put(
    "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb",
    headers={"Authorization": f"Bearer {api_token}"},
    json={"fileName": "renamed_data.csv", "metadata": {"source": "manual_upload"}}
)

Deleting Files

Delete a file

curl -X DELETE https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \
  -H "Authorization: Bearer $API_TOKEN"

await fetch("https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb", {
  method: "DELETE",
  headers: { Authorization: `Bearer ${apiToken}` },
});

requests.delete(
    "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb",
    headers={"Authorization": f"Bearer {api_token}"}
)

File Properties

Field	Type	Description
`fileId`	string	Unique identifier
`teamId`	string	Team that owns this file
`fileName`	string	Original file name
`fileSize`	number	Size in bytes
`mimeType`	string	MIME type (e.g., `text/csv`)
`fileHash`	string	SHA-256 hash of file contents
`storageKey`	string	Internal storage path
`uploadedBy`	string	User ID who uploaded the file
`sourceUrl`	string \| null	Original URL if created via `/files/from-url` endpoint
`metadata`	object	Custom key-value metadata
`processingStatus`	object	Latest processing event (see above)
`createdAt`	timestamp	Upload time
`updatedAt`	timestamp	Last modification time

API Reference

See the Files API for complete endpoint documentation.