Skip to content

File Processing

Catalyzed can process various file types. This guide covers uploading files and understanding the processing pipeline.

TypeExtensionsProcessing
CSV.csvParse rows, infer schema
JSON.json, .jsonlParse objects/arrays
Parquet.parquetDirect import
Excel.xlsx, .xlsParse worksheets
PDF.pdfText extraction, chunking, embeddings
Documents.docx, .txt, .mdText extraction, chunking, embeddings

Upload a file

Terminal window
curl -X POST https://api.catalyzed.ai/files \
-H "Authorization: Bearer $API_TOKEN" \
-F "teamId=ZkoDMyjZZsXo4VAO_nJLk"

For Data Files (CSV, JSON, Parquet, Excel)

Section titled “For Data Files (CSV, JSON, Parquet, Excel)”
  1. Upload - File is stored securely
  2. Validation - File format and structure validated
  3. Schema Inference - Column types detected
  4. Ready - File available for import into tables
  1. Upload - File is stored securely
  2. Text Extraction - Convert to plain text
  3. Chunking - Split into semantic chunks
  4. Embedding Generation - Create vector embeddings
  5. Structured Extraction - Extract structured data (PDFs only)
  6. File Summary Generation - Create AI-powered document summary
  7. Indexing - Store for retrieval
  8. Ready - File available for pipeline context

Files use an event-based processing model. The processingStatus field shows the latest processing event:

Event TypeDescription
JOB_CREATEDProcessing job has been queued
JOB_STARTEDProcessing has begun
JOB_SUCCEEDEDProcessing completed successfully
JOB_FAILEDProcessing failed (check eventData for error)

A file with processingStatus: null has never been processed.

Check file status

Terminal window
curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \
-H "Authorization: Bearer $API_TOKEN"

The recommended way to wait for file processing is using the Server-Sent Events (SSE) endpoint, which streams real-time status updates:

Await file processing with SSE

Terminal window
# Stream processing status updates until completion
curl -N https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/await-processing \
-H "Authorization: Bearer $API_TOKEN"
# Output:
# event: processing-status
# data: {"eventType":"JOB_CREATED","createdAt":"2024-01-15T10:30:00Z","processorVersion":"1.2.0"}
#
# event: processing-status
# data: {"eventType":"JOB_STARTED","createdAt":"2024-01-15T10:30:05Z","processorVersion":"1.2.0"}
#
# event: processing-status
# data: {"eventType":"JOB_SUCCEEDED","createdAt":"2024-01-15T10:31:00Z","processorVersion":"1.2.0"}
#
# event: complete
# data: {"fileId":"LvrGb8UaJk_IjmzaxuMAb","finalStatus":"JOB_SUCCEEDED"}

Benefits of SSE over polling:

  • Real-time updates: Receive status changes immediately as they occur
  • Efficient: No repeated requests, single long-lived connection
  • Progress visibility: See each processing step (CREATED → STARTED → SUCCEEDED)
  • Automatic reconnection: Browsers handle reconnection automatically

For environments that don’t support SSE, you can poll the file status endpoint:

async function waitForProcessing(fileId: string): Promise<File> {
const maxAttempts = 30;
const delayMs = 2000;
for (let i = 0; i < maxAttempts; i++) {
const response = await fetch(`https://api.catalyzed.ai/files/${fileId}`, {
headers: { Authorization: `Bearer ${apiToken}` },
});
const file = await response.json();
const status = file.processingStatus?.eventType;
if (status === "JOB_SUCCEEDED") return file;
if (status === "JOB_FAILED") {
throw new Error(file.processingStatus.eventData?.error || "Processing failed");
}
await new Promise(r => setTimeout(r, delayMs));
}
throw new Error("File processing timed out");
}
Terminal window
# 1. Upload CSV
curl -X POST https://api.catalyzed.ai/files \
-H "Authorization: Bearer $API_TOKEN" \
-F "teamId=ZkoDMyjZZsXo4VAO_nJLk"
# 2. Check file (after processing completes)
curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \
-H "Authorization: Bearer $API_TOKEN"

Response:

{
"fileId": "LvrGb8UaJk_IjmzaxuMAb",
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"fileName": "data.csv",
"fileSize": 1048576,
"mimeType": "text/csv",
"fileHash": "a1b2c3d4...",
"storageKey": "files/ZkoDMyjZZsXo4VAO_nJLk/LvrGb8UaJk_IjmzaxuMAb",
"uploadedBy": "usr_abc123",
"metadata": {},
"processingStatus": {
"eventType": "JOB_SUCCEEDED",
"createdAt": "2024-01-15T10:31:00Z",
"processorVersion": "1.2.0"
},
"createdAt": "2024-01-15T10:30:00Z",
"updatedAt": "2024-01-15T10:31:00Z"
}

Control how CSVs are parsed:

Terminal window
curl -X POST https://api.catalyzed.ai/files \
-H "Authorization: Bearer $API_TOKEN" \
-F "teamId=ZkoDMyjZZsXo4VAO_nJLk" \
-F "options={\"delimiter\":\",\",\"hasHeader\":true,\"encoding\":\"utf-8\"}"
OptionDefaultDescription
delimiter,Field delimiter
hasHeadertrueFirst row is header
encodingutf-8File encoding
nullValue""Value to treat as NULL

While Catalyzed processes CSV files, there’s no direct “import file to table” endpoint. Instead, you need to download and parse the CSV yourself, then insert the data using the table rows endpoint.

Here’s the recommended workflow:

  1. Upload the CSV file and wait for processing
  2. Download the processed file or parse it client-side
  3. Create a table with the appropriate schema
  4. Insert the parsed rows into your table
Section titled “Option 1: Download and Parse (Recommended)”
async function importCsvToTable(
csvFilePath: string,
datasetId: string,
tableName: string
): Promise<string> {
// 1. Upload CSV
const formData = new FormData();
formData.append("file", new Blob([await fs.readFile(csvFilePath)]), csvFilePath);
formData.append("teamId", teamId);
const uploadResponse = await fetch("https://api.catalyzed.ai/files", {
method: "POST",
headers: { Authorization: `Bearer ${apiToken}` },
body: formData,
});
const { fileId } = await uploadResponse.json();
// 2. Wait for processing
await awaitProcessing(fileId);
// 3. Download the file
const downloadResponse = await fetch(
`https://api.catalyzed.ai/files/${fileId}/download`,
{ headers: { Authorization: `Bearer ${apiToken}` } }
);
const { downloadUrl } = await downloadResponse.json();
const fileResponse = await fetch(downloadUrl);
const csvText = await fileResponse.text();
// 4. Parse CSV (using a CSV parsing library)
const rows = parseCSV(csvText); // Use csv-parse, papaparse, etc.
// 5. Infer schema from data
const fields = inferSchema(rows);
// 6. Create table
const tableResponse = await fetch("https://api.catalyzed.ai/dataset-tables", {
method: "POST",
headers: {
Authorization: `Bearer ${apiToken}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
datasetId,
tableName,
fields,
primaryKeyColumns: [fields[0].name],
}),
});
const { tableId } = await tableResponse.json();
// 7. Insert data in batches
const batchSize = 5000;
for (let i = 0; i < rows.length; i += batchSize) {
const batch = rows.slice(i, i + batchSize);
await fetch(
`https://api.catalyzed.ai/dataset-tables/${tableId}/rows?mode=append`,
{
method: "POST",
headers: {
Authorization: `Bearer ${apiToken}`,
"Content-Type": "application/json",
},
body: JSON.stringify(batch),
}
);
}
return tableId;
}

For CSV and Excel files, you can retrieve the processed content in TOON format:

Terminal window
curl https://api.catalyzed.ai/files/file_abc123/extracted-content \
-H "Authorization: Bearer $API_TOKEN"

Response for CSV files:

{
"type": "csv",
"extractionId": "extr_xyz123",
"content": {
"toon": "...TOON-formatted data...",
"totalRows": 1000,
"csvMetadata": {
"delimiter": ",",
"hasHeader": true,
"encoding": "utf-8"
}
}
}

Note: The toon format is an internal representation format. For practical CSV import workflows, downloading and parsing the CSV file directly (Option 1) is recommended.

  • Batch the inserts: Insert in batches of 1,000-5,000 rows to avoid timeouts
  • Use Arrow IPC format: For files >10MB, use Arrow IPC instead of JSON for better performance
  • Stream parsing: Use streaming CSV parsers for very large files to avoid loading everything into memory
  • Monitor progress: Check table row counts with /dataset-tables/{tableId} endpoint

See the Ingesting Data guide for more details on write modes, batching, and Arrow IPC format.

PDFs are processed for text extraction and semantic search:

  1. Text Extraction - OCR if needed, extract all text
  2. Chunking - Split into semantic chunks
  3. Embedding Generation - Create vector embeddings using our multi-model approach
  4. Storage - Index for retrieval

Once processed, reference the file in pipeline input schema and optionally pre-fill it in configuration:

{
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"name": "Document Q&A",
"handlerType": "language_model",
"inputsSchema": {
"files": [
{
"id": "document",
"label": "Document",
"required": true,
"contextRetrievalMode": "semantic"
}
],
"datasets": [],
"dataInputs": [
{
"id": "question",
"label": "Question",
"type": "string",
"required": true
}
]
},
"outputsSchema": {
"files": [],
"datasets": [],
"dataInputs": [
{
"id": "answer",
"label": "Answer",
"type": "string",
"required": true
}
]
},
"configuration": {
"files": [],
"datasets": [],
"dataInputs": []
}
}

When triggering the pipeline, provide the file ID and question:

{
"input": {
"files": {
"document": "LvrGb8UaJk_IjmzaxuMAb"
},
"dataInputs": {
"question": "What are the main findings?"
}
}
}

The pipeline will retrieve relevant chunks from the PDF to provide context for AI responses.

After processing, documents automatically get an AI-generated summary that includes:

  • Description: High-level overview of the document’s contents
  • Document Type: Classification (e.g., “financial_report”, “contract”, “research_paper”)
  • Sections: Key sections with headings and summaries
  • Entities: Named entities extracted from the document (people, organizations, dates, etc.)

Get file summary

Terminal window
curl https://api.catalyzed.ai/file-summaries/LvrGb8UaJk_IjmzaxuMAb \
-H "Authorization: Bearer $API_TOKEN"

Response:

{
"fileSummaryId": "fs_abc123",
"fileId": "LvrGb8UaJk_IjmzaxuMAb",
"description": "This financial report provides a quarterly analysis of revenue, expenses, and profitability for Q4 2024.",
"documentType": "financial_report",
"pageCount": 12,
"sections": [
{
"heading": "Executive Summary",
"summary": "Overview of key financial performance metrics and highlights for the quarter."
},
{
"heading": "Revenue Analysis",
"summary": "Detailed breakdown of revenue streams showing 15% growth year-over-year."
}
],
"entities": [
{ "name": "Acme Corporation", "type": "organization" },
{ "name": "Q4 2024", "type": "date" },
{ "name": "John Smith", "type": "person" }
],
"generationMethod": "pre-computed",
"createdAt": "2025-01-15T10:35:00Z"
}

Use cases for file summaries:

  • Preview document contents before processing
  • Build document indexes and catalogs
  • Improve search relevance with document metadata
  • Extract structured data from documents

If processing fails or you need to re-extract content:

Reprocess a file

Terminal window
curl -X POST https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/process \
-H "Authorization: Bearer $API_TOKEN"

The download endpoint returns presigned URLs for secure access:

Get download URL

Terminal window
curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/download \
-H "Authorization: Bearer $API_TOKEN"

Response:

{
"downloadUrl": "https://storage.example.com/files/...?signature=...",
"previewUrl": "https://storage.example.com/files/...?signature=...&inline=true",
"expiresAt": "2024-01-15T11:30:00Z"
}
ErrorCauseResolution
Invalid file formatFile doesn’t match extensionCheck file is not corrupted
Encoding errorInvalid text encodingSpecify correct encoding
Password protectedPDF requires passwordRemove protection before upload
File too largeExceeds size limitSplit into smaller files
OCR failedCouldn’t extract textTry different PDF quality
Terminal window
curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \
-H "Authorization: Bearer $API_TOKEN"
{
"fileId": "LvrGb8UaJk_IjmzaxuMAb",
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"fileName": "document.pdf",
"fileSize": 2048000,
"mimeType": "application/pdf",
"fileHash": "e5f6g7h8...",
"storageKey": "files/ZkoDMyjZZsXo4VAO_nJLk/LvrGb8UaJk_IjmzaxuMAb",
"uploadedBy": "usr_abc123",
"metadata": {},
"processingStatus": {
"eventType": "JOB_FAILED",
"createdAt": "2024-01-15T10:31:00Z",
"processorVersion": "1.2.0",
"eventData": {
"error": "PDF is password protected. Please remove protection and re-upload."
}
},
"createdAt": "2024-01-15T10:30:00Z",
"updatedAt": "2024-01-15T10:31:00Z"
}
File TypeMax Size
CSV, JSON100 MB
Parquet500 MB
PDF50 MB
Excel50 MB
Documents25 MB

For larger files, consider:

  • Splitting into multiple files
  • Using Parquet format (more efficient)
  • Contacting support for increased limits
  • Tabular data: Use Parquet for best performance, CSV for compatibility
  • Documents: PDF with embedded text (not scanned images) processes faster
  • Large datasets: Parquet with compression

Check file integrity locally before uploading to avoid processing failures.

Files process asynchronously. Always poll for status before using.

Delete files you no longer need to free up storage:

Terminal window
curl -X DELETE https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \
-H "Authorization: Bearer $API_TOKEN"
  • Files - Full files API reference
  • Pipelines - Use files in AI workflows
  • Tables - Import data files into tables