Files
Files allow you to upload data for processing. Uploaded files can be processed for text extraction (PDFs, documents) or converted into table data (CSV, JSON).
Uploading Files
Section titled “Uploading Files”Upload a file
curl -X POST https://api.catalyzed.ai/files \ -H "Authorization: Bearer $API_TOKEN" \ -F "teamId=ZkoDMyjZZsXo4VAO_nJLk"const formData = new FormData();formData.append("file", fileBlob, "data.csv");formData.append("teamId", "ZkoDMyjZZsXo4VAO_nJLk");
const response = await fetch("https://api.catalyzed.ai/files", { method: "POST", headers: { Authorization: `Bearer ${apiToken}` }, body: formData,});const file = await response.json();with open("data.csv", "rb") as f: response = requests.post( "https://api.catalyzed.ai/files", headers={"Authorization": f"Bearer {api_token}"}, files={"file": ("data.csv", f)}, data={"teamId": "ZkoDMyjZZsXo4VAO_nJLk"} )file = response.json()Response:
{ "fileId": "LvrGb8UaJk_IjmzaxuMAb", "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "fileName": "data.csv", "fileSize": 1048576, "mimeType": "text/csv", "fileHash": "a1b2c3d4...", "storageKey": "files/ZkoDMyjZZsXo4VAO_nJLk/LvrGb8UaJk_IjmzaxuMAb", "uploadedBy": "usr_abc123", "metadata": {}, "processingStatus": null, "createdAt": "2024-01-15T10:30:00Z", "updatedAt": "2024-01-15T10:30:00Z"}Creating Files from URL
Section titled “Creating Files from URL”Instead of uploading a file directly, you can provide a URL and Catalyzed will fetch the file for you. This is useful for ingesting data from public URLs, API endpoints, or cloud storage URLs.
Create file from URL
curl -X POST https://api.catalyzed.ai/files/from-url \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "url": "https://example.com/data/report.pdf" }'const response = await fetch("https://api.catalyzed.ai/files/from-url", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ teamId: "ZkoDMyjZZsXo4VAO_nJLk", url: "https://example.com/data/report.pdf", }),});const file = await response.json();response = requests.post( "https://api.catalyzed.ai/files/from-url", headers={"Authorization": f"Bearer {api_token}"}, json={ "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "url": "https://example.com/data/report.pdf" })file = response.json()Response (same format as direct upload):
{ "fileId": "NqrHc9VbKl_JknAbyvNBc", "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "fileName": "report.pdf", "fileSize": 2457600, "mimeType": "application/pdf", "fileHash": "b2c3d4e5...", "storageKey": "files/ZkoDMyjZZsXo4VAO_nJLk/NqrHc9VbKl_JknAbyvNBc", "uploadedBy": "usr_abc123", "sourceUrl": "https://example.com/data/report.pdf", "metadata": {}, "processingStatus": null, "createdAt": "2024-01-15T11:00:00Z", "updatedAt": "2024-01-15T11:00:00Z"}How It Works
Section titled “How It Works”When you create a file from a URL:
- Fetch: Catalyzed fetches the file from the provided URL with a 30-second timeout
- Filename Extraction: The filename is determined from:
Content-Dispositionheader (if present)- URL path (e.g.,
https://example.com/data/report.pdf→report.pdf) - Falls back to
"file"if neither is available
- Content-Type Parsing: MIME type is extracted from the
Content-Typeheader, stripping any parameters likecharset - Size Validation: Files must be under 100MB
- Hash Computation: SHA-256 hash is computed for deduplication
- Storage: File is uploaded to S3 storage
- Database Record: A file record is created with the
sourceUrlfield set - Auto-Processing: Processing is automatically triggered (same as direct upload)
URL Requirements
Section titled “URL Requirements”- Must be a valid HTTP or HTTPS URL
- Must be publicly accessible (no authentication required)
- Response must complete within 30 seconds
- File size must be under 100MB
- Must return appropriate
Content-Typeheader
Filename Handling
Section titled “Filename Handling”The filename is extracted in this order:
- Content-Disposition header:
Content-Disposition: attachment; filename="Q4-2024-Report.pdf"
- URL path: The last segment of the URL path
https://storage.example.com/reports/2024/annual.pdf → "annual.pdf"
- Fallback: If neither is available, defaults to
"file"
Source URL Tracking
Section titled “Source URL Tracking”Files created from URLs have a sourceUrl field that preserves the original URL. This is useful for:
- Auditing where files came from
- Re-fetching updated versions
- Tracking external data sources
Example query to find files from a specific source:
const response = await fetch( "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk", { headers: { Authorization: `Bearer ${apiToken}` } });const files = await response.json();
// Filter by source URLconst filesFromExternalAPI = files.files.filter( (f) => f.sourceUrl?.startsWith("https://external-api.example.com"));Error Handling
Section titled “Error Handling”Common errors when creating files from URLs:
| Error | Cause | Solution |
|---|---|---|
Failed to fetch from URL | URL is unreachable or returned non-2xx status | Verify URL is accessible and returns 200 OK |
File size exceeds limit | File is larger than 100MB | Use direct upload or compress the file |
Request timeout | Download took longer than 30 seconds | Use direct upload for large/slow downloads |
Invalid MIME type | Content-Type header is missing or unsupported | Ensure server returns correct Content-Type |
Use Cases
Section titled “Use Cases”Public Dataset Ingestion:
// Fetch a public datasetawait fetch("https://api.catalyzed.ai/files/from-url", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ teamId: "ZkoDMyjZZsXo4VAO_nJLk", url: "https://data.gov/datasets/covid-19/data.csv", }),});Webhook Integration:
// Handle webhook that provides a file URLapp.post("/webhook/file-ready", async (req) => { const { fileUrl } = req.body;
// Ingest file from webhook URL await fetch("https://api.catalyzed.ai/files/from-url", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ teamId: "ZkoDMyjZZsXo4VAO_nJLk", url: fileUrl, }), });});Cloud Storage URLs:
// Ingest from S3 presigned URLconst presignedUrl = generatePresignedUrl("my-bucket", "data/report.xlsx");
await fetch("https://api.catalyzed.ai/files/from-url", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ teamId: "ZkoDMyjZZsXo4VAO_nJLk", url: presignedUrl, }),});Supported File Types
Section titled “Supported File Types”| Type | Extensions | Processing |
|---|---|---|
| CSV | .csv | Parse into table rows |
| JSON | .json, .jsonl | Parse into table rows |
| Parquet | .parquet | Direct table import |
.pdf | Text extraction, chunking, embeddings | |
| Documents | .docx, .txt | Text extraction, chunking, embeddings |
| Excel | .xlsx, .xls | Parse into table rows |
Processing Status
Section titled “Processing Status”Files use an event-based processing model. The processingStatus field shows the latest processing event:
| Event Type | Description |
|---|---|
JOB_CREATED | Processing job has been queued |
JOB_STARTED | Processing has begun |
JOB_SUCCEEDED | Processing completed successfully |
JOB_FAILED | Processing failed (check eventData for error) |
A file with processingStatus: null has never been processed.
Checking File Status
Section titled “Checking File Status”Get file status
curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch("https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb", { headers: { Authorization: `Bearer ${apiToken}` },});const file = await response.json();console.log(file.processingStatus?.eventType); // "JOB_SUCCEEDED"response = requests.get( "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb", headers={"Authorization": f"Bearer {api_token}"})file = response.json()print(file["processingStatus"]["eventType"]) # "JOB_SUCCEEDED"Response with processing status:
{ "fileId": "LvrGb8UaJk_IjmzaxuMAb", "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "fileName": "data.csv", "fileSize": 1048576, "mimeType": "text/csv", "fileHash": "a1b2c3d4...", "storageKey": "files/ZkoDMyjZZsXo4VAO_nJLk/LvrGb8UaJk_IjmzaxuMAb", "uploadedBy": "usr_abc123", "metadata": {}, "processingStatus": { "eventType": "JOB_SUCCEEDED", "createdAt": "2024-01-15T10:31:00Z", "processorVersion": "1.2.0", "eventData": {} }, "createdAt": "2024-01-15T10:30:00Z", "updatedAt": "2024-01-15T10:31:00Z"}Polling for Completion
Section titled “Polling for Completion”For files that require processing (PDFs, documents):
async function waitForProcessing(fileId: string): Promise<File> { while (true) { const response = await fetch(`https://api.catalyzed.ai/files/${fileId}`, { headers: { Authorization: `Bearer ${apiToken}` }, }); const file = await response.json();
const status = file.processingStatus?.eventType; if (status === "JOB_SUCCEEDED") return file; if (status === "JOB_FAILED") { throw new Error(file.processingStatus.eventData?.error || "Processing failed"); }
await new Promise(r => setTimeout(r, 2000)); // Poll every 2 seconds }}Listing Files
Section titled “Listing Files”List files
curl "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk" \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch( "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk", { headers: { Authorization: `Bearer ${apiToken}` } });const { files, total, page, pageSize } = await response.json();response = requests.get( "https://api.catalyzed.ai/files", params={"teamIds": "ZkoDMyjZZsXo4VAO_nJLk"}, headers={"Authorization": f"Bearer {api_token}"})data = response.json()files = data["files"]Response:
{ "files": [ { "fileId": "LvrGb8UaJk_IjmzaxuMAb", "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "fileName": "data.csv", "fileSize": 1048576, "mimeType": "text/csv", "fileHash": "a1b2c3d4...", "storageKey": "...", "uploadedBy": "usr_abc123", "metadata": {}, "processingStatus": { "eventType": "JOB_SUCCEEDED", "createdAt": "2024-01-15T10:31:00Z", "processorVersion": "1.2.0" }, "createdAt": "2024-01-15T10:30:00Z", "updatedAt": "2024-01-15T10:31:00Z" } ], "total": 1, "page": 1, "pageSize": 20}Filtering Files
Section titled “Filtering Files”Filter by processing status or MIME type:
# Only successfully processed filescurl "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk&processingStatus=JOB_SUCCEEDED"
# Files that haven't been processed (use "none")curl "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk&processingStatus=none"
# Only PDFscurl "https://api.catalyzed.ai/files?teamIds=ZkoDMyjZZsXo4VAO_nJLk&mimeTypes=application/pdf"Downloading Files
Section titled “Downloading Files”The download endpoint returns presigned URLs for secure access:
Get download URL
curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/download \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch( "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/download", { headers: { Authorization: `Bearer ${apiToken}` } });const { downloadUrl, previewUrl, expiresAt } = await response.json();
// Use the presigned URL to download the fileconst fileResponse = await fetch(downloadUrl);const blob = await fileResponse.blob();response = requests.get( "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/download", headers={"Authorization": f"Bearer {api_token}"})urls = response.json()
# Use the presigned URL to download the filefile_response = requests.get(urls["downloadUrl"])with open("downloaded_file.csv", "wb") as f: f.write(file_response.content)Response:
{ "downloadUrl": "https://storage.example.com/files/...?signature=...", "previewUrl": "https://storage.example.com/files/...?signature=...&inline=true", "expiresAt": "2024-01-15T11:30:00Z"}Processing Files
Section titled “Processing Files”Trigger Processing
Section titled “Trigger Processing”To process or reprocess a file:
Process a file
curl -X POST https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/process \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{"force": false}'const response = await fetch("https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/process", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ force: false }),});const { jobId, status, message } = await response.json();response = requests.post( "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/process", headers={"Authorization": f"Bearer {api_token}"}, json={"force": False})result = response.json()print(result["jobId"], result["status"])Response:
{ "jobId": "job_xyz789", "fileId": "LvrGb8UaJk_IjmzaxuMAb", "status": "pending", "message": "File processing job created"}Set force: true to reprocess a file that has already been processed.
PDF Processing Pipeline
Section titled “PDF Processing Pipeline”When you upload a PDF, it goes through:
- Text Extraction - Convert PDF pages to text
- Chunking - Split text into semantic chunks
- Embedding Generation - Create vector embeddings for each chunk
- Indexing - Store chunks for retrieval in pipelines
See the File Processing guide for details.
Updating Files
Section titled “Updating Files”Update file metadata:
Update file
curl -X PUT https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{"fileName": "renamed_data.csv", "metadata": {"source": "manual_upload"}}'await fetch("https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb", { method: "PUT", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ fileName: "renamed_data.csv", metadata: { source: "manual_upload" }, }),});requests.put( "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb", headers={"Authorization": f"Bearer {api_token}"}, json={"fileName": "renamed_data.csv", "metadata": {"source": "manual_upload"}})Deleting Files
Section titled “Deleting Files”Delete a file
curl -X DELETE https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \ -H "Authorization: Bearer $API_TOKEN"await fetch("https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb", { method: "DELETE", headers: { Authorization: `Bearer ${apiToken}` },});requests.delete( "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb", headers={"Authorization": f"Bearer {api_token}"})File Properties
Section titled “File Properties”| Field | Type | Description |
|---|---|---|
fileId | string | Unique identifier |
teamId | string | Team that owns this file |
fileName | string | Original file name |
fileSize | number | Size in bytes |
mimeType | string | MIME type (e.g., text/csv) |
fileHash | string | SHA-256 hash of file contents |
storageKey | string | Internal storage path |
uploadedBy | string | User ID who uploaded the file |
sourceUrl | string | null | Original URL if created via /files/from-url endpoint |
metadata | object | Custom key-value metadata |
processingStatus | object | Latest processing event (see above) |
createdAt | timestamp | Upload time |
updatedAt | timestamp | Last modification time |
API Reference
Section titled “API Reference”See the Files API for complete endpoint documentation.