File Processing
Catalyzed can process various file types. This guide covers uploading files and understanding the processing pipeline.
Supported File Types
Section titled “Supported File Types”| Type | Extensions | Processing |
|---|---|---|
| CSV | .csv | Parse rows, infer schema |
| JSON | .json, .jsonl | Parse objects/arrays |
| Parquet | .parquet | Direct import |
| Excel | .xlsx, .xls | Parse worksheets |
.pdf | Text extraction, chunking, embeddings | |
| Documents | .docx, .txt, .md | Text extraction, chunking, embeddings |
Uploading Files
Section titled “Uploading Files”Upload a file
curl -X POST https://api.catalyzed.ai/files \ -H "Authorization: Bearer $API_TOKEN" \ -F "teamId=ZkoDMyjZZsXo4VAO_nJLk"const formData = new FormData();formData.append("file", fileBlob, "document.pdf");formData.append("teamId", "ZkoDMyjZZsXo4VAO_nJLk");
const response = await fetch("https://api.catalyzed.ai/files", { method: "POST", headers: { Authorization: `Bearer ${apiToken}` }, body: formData,});const file = await response.json();with open("document.pdf", "rb") as f: response = requests.post( "https://api.catalyzed.ai/files", headers={"Authorization": f"Bearer {api_token}"}, files={"file": ("document.pdf", f, "application/pdf")}, data={"teamId": "ZkoDMyjZZsXo4VAO_nJLk"} )file = response.json()Processing Pipeline
Section titled “Processing Pipeline”For Data Files (CSV, JSON, Parquet, Excel)
Section titled “For Data Files (CSV, JSON, Parquet, Excel)”- Upload - File is stored securely
- Validation - File format and structure validated
- Schema Inference - Column types detected
- Ready - File available for import into tables
For Documents (PDF, DOCX, TXT)
Section titled “For Documents (PDF, DOCX, TXT)”- Upload - File is stored securely
- Text Extraction - Convert to plain text
- Chunking - Split into semantic chunks
- Embedding Generation - Create vector embeddings
- Structured Extraction - Extract structured data (PDFs only)
- File Summary Generation - Create AI-powered document summary
- Indexing - Store for retrieval
- Ready - File available for pipeline context
Processing Status
Section titled “Processing Status”Files use an event-based processing model. The processingStatus field shows the latest processing event:
| Event Type | Description |
|---|---|
JOB_CREATED | Processing job has been queued |
JOB_STARTED | Processing has begun |
JOB_SUCCEEDED | Processing completed successfully |
JOB_FAILED | Processing failed (check eventData for error) |
A file with processingStatus: null has never been processed.
Checking File Status
Section titled “Checking File Status”Check file status
curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch("https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb", { headers: { Authorization: `Bearer ${apiToken}` },});const file = await response.json();console.log(file.processingStatus?.eventType); // "JOB_SUCCEEDED"response = requests.get( "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb", headers={"Authorization": f"Bearer {api_token}"})file = response.json()print(file["processingStatus"]["eventType"]) # "JOB_SUCCEEDED"Awaiting Processing Completion
Section titled “Awaiting Processing Completion”The recommended way to wait for file processing is using the Server-Sent Events (SSE) endpoint, which streams real-time status updates:
Await file processing with SSE
# Stream processing status updates until completioncurl -N https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/await-processing \ -H "Authorization: Bearer $API_TOKEN"
# Output:# event: processing-status# data: {"eventType":"JOB_CREATED","createdAt":"2024-01-15T10:30:00Z","processorVersion":"1.2.0"}## event: processing-status# data: {"eventType":"JOB_STARTED","createdAt":"2024-01-15T10:30:05Z","processorVersion":"1.2.0"}## event: processing-status# data: {"eventType":"JOB_SUCCEEDED","createdAt":"2024-01-15T10:31:00Z","processorVersion":"1.2.0"}## event: complete# data: {"fileId":"LvrGb8UaJk_IjmzaxuMAb","finalStatus":"JOB_SUCCEEDED"}async function awaitProcessing(fileId: string): Promise<void> { const url = `https://api.catalyzed.ai/files/${fileId}/await-processing`; const response = await fetch(url, { headers: { Authorization: `Bearer ${apiToken}` }, });
if (!response.body) throw new Error("No response body");
const reader = response.body.getReader(); const decoder = new TextDecoder(); let buffer = "";
while (true) { const { done, value } = await reader.read(); if (done) break;
buffer += decoder.decode(value, { stream: true }); const messages = buffer.split("\n\n"); buffer = messages.pop() || "";
for (const message of messages) { if (!message.trim()) continue;
const lines = message.split("\n"); let eventType = "message"; let data = "";
for (const line of lines) { if (line.startsWith("event: ")) eventType = line.slice(7); else if (line.startsWith("data: ")) data = line.slice(6); }
if (eventType === "complete") { const { finalStatus } = JSON.parse(data); if (finalStatus === "JOB_FAILED") { throw new Error("File processing failed"); } return; // Success! } else if (eventType === "processing-status") { const status = JSON.parse(data); console.log(`Processing: ${status.eventType}`); } } }}import requests
def await_processing(file_id: str, api_token: str) -> None: """Wait for file processing using SSE.""" url = f"https://api.catalyzed.ai/files/{file_id}/await-processing"
with requests.get( url, headers={"Authorization": f"Bearer {api_token}"}, stream=True ) as response: response.raise_for_status()
for line in response.iter_lines(): if not line: continue
line_str = line.decode("utf-8")
if line_str.startswith("event: complete"): # Read next line for data data_line = next(response.iter_lines()).decode("utf-8") if data_line.startswith("data: "): import json event_data = json.loads(data_line[6:]) if event_data["finalStatus"] == "JOB_FAILED": raise Exception("File processing failed") return # Success!
elif line_str.startswith("data: "): import json status = json.loads(line_str[6:]) print(f"Processing: {status.get('eventType', 'unknown')}")Benefits of SSE over polling:
- Real-time updates: Receive status changes immediately as they occur
- Efficient: No repeated requests, single long-lived connection
- Progress visibility: See each processing step (CREATED → STARTED → SUCCEEDED)
- Automatic reconnection: Browsers handle reconnection automatically
Polling Alternative
Section titled “Polling Alternative”For environments that don’t support SSE, you can poll the file status endpoint:
async function waitForProcessing(fileId: string): Promise<File> { const maxAttempts = 30; const delayMs = 2000;
for (let i = 0; i < maxAttempts; i++) { const response = await fetch(`https://api.catalyzed.ai/files/${fileId}`, { headers: { Authorization: `Bearer ${apiToken}` }, }); const file = await response.json();
const status = file.processingStatus?.eventType; if (status === "JOB_SUCCEEDED") return file; if (status === "JOB_FAILED") { throw new Error(file.processingStatus.eventData?.error || "Processing failed"); }
await new Promise(r => setTimeout(r, delayMs)); }
throw new Error("File processing timed out");}CSV Processing
Section titled “CSV Processing”Upload and Check Schema
Section titled “Upload and Check Schema”# 1. Upload CSVcurl -X POST https://api.catalyzed.ai/files \ -H "Authorization: Bearer $API_TOKEN" \ -F "teamId=ZkoDMyjZZsXo4VAO_nJLk"
# 2. Check file (after processing completes)curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \ -H "Authorization: Bearer $API_TOKEN"Response:
{ "fileId": "LvrGb8UaJk_IjmzaxuMAb", "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "fileName": "data.csv", "fileSize": 1048576, "mimeType": "text/csv", "fileHash": "a1b2c3d4...", "storageKey": "files/ZkoDMyjZZsXo4VAO_nJLk/LvrGb8UaJk_IjmzaxuMAb", "uploadedBy": "usr_abc123", "metadata": {}, "processingStatus": { "eventType": "JOB_SUCCEEDED", "createdAt": "2024-01-15T10:31:00Z", "processorVersion": "1.2.0" }, "createdAt": "2024-01-15T10:30:00Z", "updatedAt": "2024-01-15T10:31:00Z"}CSV Options
Section titled “CSV Options”Control how CSVs are parsed:
curl -X POST https://api.catalyzed.ai/files \ -H "Authorization: Bearer $API_TOKEN" \ -F "teamId=ZkoDMyjZZsXo4VAO_nJLk" \ -F "options={\"delimiter\":\",\",\"hasHeader\":true,\"encoding\":\"utf-8\"}"| Option | Default | Description |
|---|---|---|
delimiter | , | Field delimiter |
hasHeader | true | First row is header |
encoding | utf-8 | File encoding |
nullValue | "" | Value to treat as NULL |
Importing CSV Data to Tables
Section titled “Importing CSV Data to Tables”While Catalyzed processes CSV files, there’s no direct “import file to table” endpoint. Instead, you need to download and parse the CSV yourself, then insert the data using the table rows endpoint.
Here’s the recommended workflow:
- Upload the CSV file and wait for processing
- Download the processed file or parse it client-side
- Create a table with the appropriate schema
- Insert the parsed rows into your table
Option 1: Download and Parse (Recommended)
Section titled “Option 1: Download and Parse (Recommended)”async function importCsvToTable( csvFilePath: string, datasetId: string, tableName: string): Promise<string> { // 1. Upload CSV const formData = new FormData(); formData.append("file", new Blob([await fs.readFile(csvFilePath)]), csvFilePath); formData.append("teamId", teamId);
const uploadResponse = await fetch("https://api.catalyzed.ai/files", { method: "POST", headers: { Authorization: `Bearer ${apiToken}` }, body: formData, }); const { fileId } = await uploadResponse.json();
// 2. Wait for processing await awaitProcessing(fileId);
// 3. Download the file const downloadResponse = await fetch( `https://api.catalyzed.ai/files/${fileId}/download`, { headers: { Authorization: `Bearer ${apiToken}` } } ); const { downloadUrl } = await downloadResponse.json();
const fileResponse = await fetch(downloadUrl); const csvText = await fileResponse.text();
// 4. Parse CSV (using a CSV parsing library) const rows = parseCSV(csvText); // Use csv-parse, papaparse, etc.
// 5. Infer schema from data const fields = inferSchema(rows);
// 6. Create table const tableResponse = await fetch("https://api.catalyzed.ai/dataset-tables", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ datasetId, tableName, fields, primaryKeyColumns: [fields[0].name], }), }); const { tableId } = await tableResponse.json();
// 7. Insert data in batches const batchSize = 5000; for (let i = 0; i < rows.length; i += batchSize) { const batch = rows.slice(i, i + batchSize); await fetch( `https://api.catalyzed.ai/dataset-tables/${tableId}/rows?mode=append`, { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify(batch), } ); }
return tableId;}Option 2: Extract Content via API
Section titled “Option 2: Extract Content via API”For CSV and Excel files, you can retrieve the processed content in TOON format:
curl https://api.catalyzed.ai/files/file_abc123/extracted-content \ -H "Authorization: Bearer $API_TOKEN"Response for CSV files:
{ "type": "csv", "extractionId": "extr_xyz123", "content": { "toon": "...TOON-formatted data...", "totalRows": 1000, "csvMetadata": { "delimiter": ",", "hasHeader": true, "encoding": "utf-8" } }}Note: The toon format is an internal representation format. For practical CSV import workflows, downloading and parsing the CSV file directly (Option 1) is recommended.
Tips for Large CSV Files
Section titled “Tips for Large CSV Files”- Batch the inserts: Insert in batches of 1,000-5,000 rows to avoid timeouts
- Use Arrow IPC format: For files >10MB, use Arrow IPC instead of JSON for better performance
- Stream parsing: Use streaming CSV parsers for very large files to avoid loading everything into memory
- Monitor progress: Check table row counts with
/dataset-tables/{tableId}endpoint
See the Ingesting Data guide for more details on write modes, batching, and Arrow IPC format.
PDF Processing
Section titled “PDF Processing”PDFs are processed for text extraction and semantic search:
Processing Steps
Section titled “Processing Steps”- Text Extraction - OCR if needed, extract all text
- Chunking - Split into semantic chunks
- Embedding Generation - Create vector embeddings using our multi-model approach
- Storage - Index for retrieval
Using Processed PDFs in Pipelines
Section titled “Using Processed PDFs in Pipelines”Once processed, reference the file in pipeline input schema and optionally pre-fill it in configuration:
{ "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "name": "Document Q&A", "handlerType": "language_model", "inputsSchema": { "files": [ { "id": "document", "label": "Document", "required": true, "contextRetrievalMode": "semantic" } ], "datasets": [], "dataInputs": [ { "id": "question", "label": "Question", "type": "string", "required": true } ] }, "outputsSchema": { "files": [], "datasets": [], "dataInputs": [ { "id": "answer", "label": "Answer", "type": "string", "required": true } ] }, "configuration": { "files": [], "datasets": [], "dataInputs": [] }}When triggering the pipeline, provide the file ID and question:
{ "input": { "files": { "document": "LvrGb8UaJk_IjmzaxuMAb" }, "dataInputs": { "question": "What are the main findings?" } }}The pipeline will retrieve relevant chunks from the PDF to provide context for AI responses.
File Summaries
Section titled “File Summaries”After processing, documents automatically get an AI-generated summary that includes:
- Description: High-level overview of the document’s contents
- Document Type: Classification (e.g., “financial_report”, “contract”, “research_paper”)
- Sections: Key sections with headings and summaries
- Entities: Named entities extracted from the document (people, organizations, dates, etc.)
Retrieving File Summaries
Section titled “Retrieving File Summaries”Get file summary
curl https://api.catalyzed.ai/file-summaries/LvrGb8UaJk_IjmzaxuMAb \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch( "https://api.catalyzed.ai/file-summaries/LvrGb8UaJk_IjmzaxuMAb", { headers: { Authorization: `Bearer ${apiToken}` } });const summary = await response.json();console.log(summary.description);console.log(summary.sections);response = requests.get( "https://api.catalyzed.ai/file-summaries/LvrGb8UaJk_IjmzaxuMAb", headers={"Authorization": f"Bearer {api_token}"})summary = response.json()print(summary["description"])print(summary["sections"])Response:
{ "fileSummaryId": "fs_abc123", "fileId": "LvrGb8UaJk_IjmzaxuMAb", "description": "This financial report provides a quarterly analysis of revenue, expenses, and profitability for Q4 2024.", "documentType": "financial_report", "pageCount": 12, "sections": [ { "heading": "Executive Summary", "summary": "Overview of key financial performance metrics and highlights for the quarter." }, { "heading": "Revenue Analysis", "summary": "Detailed breakdown of revenue streams showing 15% growth year-over-year." } ], "entities": [ { "name": "Acme Corporation", "type": "organization" }, { "name": "Q4 2024", "type": "date" }, { "name": "John Smith", "type": "person" } ], "generationMethod": "pre-computed", "createdAt": "2025-01-15T10:35:00Z"}Use cases for file summaries:
- Preview document contents before processing
- Build document indexes and catalogs
- Improve search relevance with document metadata
- Extract structured data from documents
Reprocessing Files
Section titled “Reprocessing Files”If processing fails or you need to re-extract content:
Reprocess a file
curl -X POST https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/process \ -H "Authorization: Bearer $API_TOKEN"await fetch("https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/process", { method: "POST", headers: { Authorization: `Bearer ${apiToken}` },});requests.post( "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/process", headers={"Authorization": f"Bearer {api_token}"})Downloading Files
Section titled “Downloading Files”The download endpoint returns presigned URLs for secure access:
Get download URL
curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/download \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch( "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/download", { headers: { Authorization: `Bearer ${apiToken}` } });const { downloadUrl, previewUrl, expiresAt } = await response.json();
// Use the presigned URL to download the fileconst fileResponse = await fetch(downloadUrl);const blob = await fileResponse.blob();response = requests.get( "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/download", headers={"Authorization": f"Bearer {api_token}"})urls = response.json()
# Use the presigned URL to download the filefile_response = requests.get(urls["downloadUrl"])with open("downloaded_file.pdf", "wb") as f: f.write(file_response.content)Response:
{ "downloadUrl": "https://storage.example.com/files/...?signature=...", "previewUrl": "https://storage.example.com/files/...?signature=...&inline=true", "expiresAt": "2024-01-15T11:30:00Z"}Error Handling
Section titled “Error Handling”Common Processing Errors
Section titled “Common Processing Errors”| Error | Cause | Resolution |
|---|---|---|
Invalid file format | File doesn’t match extension | Check file is not corrupted |
Encoding error | Invalid text encoding | Specify correct encoding |
Password protected | PDF requires password | Remove protection before upload |
File too large | Exceeds size limit | Split into smaller files |
OCR failed | Couldn’t extract text | Try different PDF quality |
Checking Error Details
Section titled “Checking Error Details”curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \ -H "Authorization: Bearer $API_TOKEN"{ "fileId": "LvrGb8UaJk_IjmzaxuMAb", "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "fileName": "document.pdf", "fileSize": 2048000, "mimeType": "application/pdf", "fileHash": "e5f6g7h8...", "storageKey": "files/ZkoDMyjZZsXo4VAO_nJLk/LvrGb8UaJk_IjmzaxuMAb", "uploadedBy": "usr_abc123", "metadata": {}, "processingStatus": { "eventType": "JOB_FAILED", "createdAt": "2024-01-15T10:31:00Z", "processorVersion": "1.2.0", "eventData": { "error": "PDF is password protected. Please remove protection and re-upload." } }, "createdAt": "2024-01-15T10:30:00Z", "updatedAt": "2024-01-15T10:31:00Z"}File Size Limits
Section titled “File Size Limits”| File Type | Max Size |
|---|---|
| CSV, JSON | 100 MB |
| Parquet | 500 MB |
| 50 MB | |
| Excel | 50 MB |
| Documents | 25 MB |
For larger files, consider:
- Splitting into multiple files
- Using Parquet format (more efficient)
- Contacting support for increased limits
Best Practices
Section titled “Best Practices”1. Use Appropriate Formats
Section titled “1. Use Appropriate Formats”- Tabular data: Use Parquet for best performance, CSV for compatibility
- Documents: PDF with embedded text (not scanned images) processes faster
- Large datasets: Parquet with compression
2. Validate Before Upload
Section titled “2. Validate Before Upload”Check file integrity locally before uploading to avoid processing failures.
3. Handle Async Processing
Section titled “3. Handle Async Processing”Files process asynchronously. Always poll for status before using.
4. Clean Up Unused Files
Section titled “4. Clean Up Unused Files”Delete files you no longer need to free up storage:
curl -X DELETE https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \ -H "Authorization: Bearer $API_TOKEN"