Skip to content

Files

Files allow you to upload data for processing. Uploaded files can be processed for text extraction (PDFs, documents) or converted into table data (CSV, JSON).

Upload a file

Terminal window
curl -X POST https://api.catalyzed.ai/files \
-H "Authorization: Bearer $API_TOKEN" \
-F "teamId=ZkoDMyjZZsXo4VAO_nJLk"

Response:

{
"fileId": "LvrGb8UaJk_IjmzaxuMAb",
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"fileName": "data.csv",
"fileSize": 1048576,
"mimeType": "text/csv",
"fileHash": "a1b2c3d4...",
"storageKey": "files/ZkoDMyjZZsXo4VAO_nJLk/LvrGb8UaJk_IjmzaxuMAb",
"uploadedBy": "usr_abc123",
"metadata": {},
"processingStatus": null,
"createdAt": "2024-01-15T10:30:00Z",
"updatedAt": "2024-01-15T10:30:00Z"
}

Instead of uploading a file directly, you can provide a URL and Catalyzed will fetch the file for you. This is useful for ingesting data from public URLs, API endpoints, or cloud storage URLs.

Create file from URL

Terminal window
curl -X POST https://api.catalyzed.ai/files/from-url \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"url": "https://example.com/data/report.pdf"
}'

Response (same format as direct upload):

{
"fileId": "NqrHc9VbKl_JknAbyvNBc",
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"fileName": "report.pdf",
"fileSize": 2457600,
"mimeType": "application/pdf",
"fileHash": "b2c3d4e5...",
"storageKey": "files/ZkoDMyjZZsXo4VAO_nJLk/NqrHc9VbKl_JknAbyvNBc",
"uploadedBy": "usr_abc123",
"sourceUrl": "https://example.com/data/report.pdf",
"metadata": {},
"processingStatus": null,
"createdAt": "2024-01-15T11:00:00Z",
"updatedAt": "2024-01-15T11:00:00Z"
}

When you create a file from a URL:

  1. Fetch: Catalyzed fetches the file from the provided URL with a 30-second timeout
  2. Filename Extraction: The filename is determined from:
    • Content-Disposition header (if present)
    • URL path (e.g., https://example.com/data/report.pdfreport.pdf)
    • Falls back to "file" if neither is available
  3. Content-Type Parsing: MIME type is extracted from the Content-Type header, stripping any parameters like charset
  4. Size Validation: Files must be under 100MB
  5. Hash Computation: SHA-256 hash is computed for deduplication
  6. Storage: File is uploaded to S3 storage
  7. Database Record: A file record is created with the sourceUrl field set
  8. Auto-Processing: Processing is automatically triggered (same as direct upload)
  • Must be a valid HTTP or HTTPS URL
  • Must be publicly accessible (no authentication required)
  • Response must complete within 30 seconds
  • File size must be under 100MB
  • Must return appropriate Content-Type header

The filename is extracted in this order:

  1. Content-Disposition header:
    Content-Disposition: attachment; filename="Q4-2024-Report.pdf"
  2. URL path: The last segment of the URL path
    https://storage.example.com/reports/2024/annual.pdf → "annual.pdf"
  3. Fallback: If neither is available, defaults to "file"

Files created from URLs have a sourceUrl field that preserves the original URL. This is useful for:

  • Auditing where files came from
  • Re-fetching updated versions
  • Tracking external data sources

Example query to find files from a specific source:

const response = await fetch(
"https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk",
{ headers: { Authorization: `Bearer ${apiToken}` } }
);
const files = await response.json();
// Filter by source URL
const filesFromExternalAPI = files.files.filter(
(f) => f.sourceUrl?.startsWith("https://external-api.example.com")
);

Common errors when creating files from URLs:

ErrorCauseSolution
Failed to fetch from URLURL is unreachable or returned non-2xx statusVerify URL is accessible and returns 200 OK
File size exceeds limitFile is larger than 100MBUse direct upload or compress the file
Request timeoutDownload took longer than 30 secondsUse direct upload for large/slow downloads
Invalid MIME typeContent-Type header is missing or unsupportedEnsure server returns correct Content-Type

Public Dataset Ingestion:

// Fetch a public dataset
await fetch("https://api.catalyzed.ai/files/from-url", {
method: "POST",
headers: {
Authorization: `Bearer ${apiToken}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
teamId: "ZkoDMyjZZsXo4VAO_nJLk",
url: "https://data.gov/datasets/covid-19/data.csv",
}),
});

Webhook Integration:

// Handle webhook that provides a file URL
app.post("/webhook/file-ready", async (req) => {
const { fileUrl } = req.body;
// Ingest file from webhook URL
await fetch("https://api.catalyzed.ai/files/from-url", {
method: "POST",
headers: {
Authorization: `Bearer ${apiToken}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
teamId: "ZkoDMyjZZsXo4VAO_nJLk",
url: fileUrl,
}),
});
});

Cloud Storage URLs:

// Ingest from S3 presigned URL
const presignedUrl = generatePresignedUrl("my-bucket", "data/report.xlsx");
await fetch("https://api.catalyzed.ai/files/from-url", {
method: "POST",
headers: {
Authorization: `Bearer ${apiToken}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
teamId: "ZkoDMyjZZsXo4VAO_nJLk",
url: presignedUrl,
}),
});
TypeExtensionsProcessing
CSV.csvParse into table rows
JSON.json, .jsonlParse into table rows
Parquet.parquetDirect table import
PDF.pdfText extraction, chunking, embeddings
Documents.docx, .txtText extraction, chunking, embeddings
Excel.xlsx, .xlsParse into table rows

Files use an event-based processing model. The processingStatus field shows the latest processing event:

Event TypeDescription
JOB_CREATEDProcessing job has been queued
JOB_STARTEDProcessing has begun
JOB_SUCCEEDEDProcessing completed successfully
JOB_FAILEDProcessing failed (check eventData for error)

A file with processingStatus: null has never been processed.

Get file status

Terminal window
curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \
-H "Authorization: Bearer $API_TOKEN"

Response with processing status:

{
"fileId": "LvrGb8UaJk_IjmzaxuMAb",
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"fileName": "data.csv",
"fileSize": 1048576,
"mimeType": "text/csv",
"fileHash": "a1b2c3d4...",
"storageKey": "files/ZkoDMyjZZsXo4VAO_nJLk/LvrGb8UaJk_IjmzaxuMAb",
"uploadedBy": "usr_abc123",
"metadata": {},
"processingStatus": {
"eventType": "JOB_SUCCEEDED",
"createdAt": "2024-01-15T10:31:00Z",
"processorVersion": "1.2.0",
"eventData": {}
},
"createdAt": "2024-01-15T10:30:00Z",
"updatedAt": "2024-01-15T10:31:00Z"
}

For files that require processing (PDFs, documents):

async function waitForProcessing(fileId: string): Promise<File> {
while (true) {
const response = await fetch(`https://api.catalyzed.ai/files/${fileId}`, {
headers: { Authorization: `Bearer ${apiToken}` },
});
const file = await response.json();
const status = file.processingStatus?.eventType;
if (status === "JOB_SUCCEEDED") return file;
if (status === "JOB_FAILED") {
throw new Error(file.processingStatus.eventData?.error || "Processing failed");
}
await new Promise(r => setTimeout(r, 2000)); // Poll every 2 seconds
}
}

List files

Terminal window
curl "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk" \
-H "Authorization: Bearer $API_TOKEN"

Response:

{
"files": [
{
"fileId": "LvrGb8UaJk_IjmzaxuMAb",
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"fileName": "data.csv",
"fileSize": 1048576,
"mimeType": "text/csv",
"fileHash": "a1b2c3d4...",
"storageKey": "...",
"uploadedBy": "usr_abc123",
"metadata": {},
"processingStatus": {
"eventType": "JOB_SUCCEEDED",
"createdAt": "2024-01-15T10:31:00Z",
"processorVersion": "1.2.0"
},
"createdAt": "2024-01-15T10:30:00Z",
"updatedAt": "2024-01-15T10:31:00Z"
}
],
"total": 1,
"page": 1,
"pageSize": 20
}

Filter by processing status or MIME type:

Terminal window
# Only successfully processed files
curl "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk&processingStatus=JOB_SUCCEEDED"
# Files that haven't been processed (use "none")
curl "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk&processingStatus=none"
# Only PDFs
curl "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk&mimeTypes=application/pdf"

The download endpoint returns presigned URLs for secure access:

Get download URL

Terminal window
curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/download \
-H "Authorization: Bearer $API_TOKEN"

Response:

{
"downloadUrl": "https://storage.example.com/files/...?signature=...",
"previewUrl": "https://storage.example.com/files/...?signature=...&inline=true",
"expiresAt": "2024-01-15T11:30:00Z"
}

To process or reprocess a file:

Process a file

Terminal window
curl -X POST https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/process \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"force": false}'

Response:

{
"jobId": "job_xyz789",
"fileId": "LvrGb8UaJk_IjmzaxuMAb",
"status": "pending",
"message": "File processing job created"
}

Set force: true to reprocess a file that has already been processed.

When you upload a PDF, it goes through:

  1. Text Extraction - Convert PDF pages to text
  2. Chunking - Split text into semantic chunks
  3. Embedding Generation - Create vector embeddings for each chunk
  4. Indexing - Store chunks for retrieval in pipelines

See the File Processing guide for details.

Update file metadata:

Update file

Terminal window
curl -X PUT https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"fileName": "renamed_data.csv", "metadata": {"source": "manual_upload"}}'

Delete a file

Terminal window
curl -X DELETE https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \
-H "Authorization: Bearer $API_TOKEN"
FieldTypeDescription
fileIdstringUnique identifier
teamIdstringTeam that owns this file
fileNamestringOriginal file name
fileSizenumberSize in bytes
mimeTypestringMIME type (e.g., text/csv)
fileHashstringSHA-256 hash of file contents
storageKeystringInternal storage path
uploadedBystringUser ID who uploaded the file
sourceUrlstring | nullOriginal URL if created via /files/from-url endpoint
metadataobjectCustom key-value metadata
processingStatusobjectLatest processing event (see above)
createdAttimestampUpload time
updatedAttimestampLast modification time

See the Files API for complete endpoint documentation.