File Processing

Catalyzed can process various file types. This guide covers uploading files and understanding the processing pipeline.

Supported File Types

Type	Extensions	Processing
CSV	`.csv`	Parse rows, infer schema
JSON	`.json`, `.jsonl`	Parse objects/arrays
Parquet	`.parquet`	Direct import
Excel	`.xlsx`, `.xls`	Parse worksheets
PDF	`.pdf`	Text extraction, chunking, embeddings
Documents	`.docx`, `.txt`, `.md`	Text extraction, chunking, embeddings

Uploading Files

Upload a file

curl -X POST https://api.catalyzed.ai/files \
  -H "Authorization: Bearer $API_TOKEN" \
  -F "[email protected]" \
  -F "teamId=ZkoDMyjZZsXo4VAO_nJLk"

const formData = new FormData();
formData.append("file", fileBlob, "document.pdf");
formData.append("teamId", "ZkoDMyjZZsXo4VAO_nJLk");

const response = await fetch("https://api.catalyzed.ai/files", {
  method: "POST",
  headers: { Authorization: `Bearer ${apiToken}` },
  body: formData,
});
const file = await response.json();

with open("document.pdf", "rb") as f:
    response = requests.post(
        "https://api.catalyzed.ai/files",
        headers={"Authorization": f"Bearer {api_token}"},
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"teamId": "ZkoDMyjZZsXo4VAO_nJLk"}
    )
file = response.json()

Processing Pipeline

For Data Files (CSV, JSON, Parquet, Excel)

Upload - File is stored securely
Validation - File format and structure validated
Schema Inference - Column types detected
Ready - File available for import into tables

For Documents (PDF, DOCX, TXT)

Upload - File is stored securely
Text Extraction - Convert to plain text
Chunking - Split into semantic chunks
Embedding Generation - Create vector embeddings
Structured Extraction - Extract structured data (PDFs only)
File Summary Generation - Create AI-powered document summary
Indexing - Store for retrieval
Ready - File available for pipeline context

Processing Status

Files use an event-based processing model. The processingStatus field shows the latest processing event:

Event Type	Description
`JOB_CREATED`	Processing job has been queued
`JOB_STARTED`	Processing has begun
`JOB_SUCCEEDED`	Processing completed successfully
`JOB_FAILED`	Processing failed (check `eventData` for error)

A file with processingStatus: null has never been processed.

Checking File Status

Check file status

curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch("https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb", {
  headers: { Authorization: `Bearer ${apiToken}` },
});
const file = await response.json();
console.log(file.processingStatus?.eventType); // "JOB_SUCCEEDED"

response = requests.get(
    "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb",
    headers={"Authorization": f"Bearer {api_token}"}
)
file = response.json()
print(file["processingStatus"]["eventType"])  # "JOB_SUCCEEDED"

Awaiting Processing Completion

The recommended way to wait for file processing is using the Server-Sent Events (SSE) endpoint, which streams real-time status updates:

Await file processing with SSE

# Stream processing status updates until completion
curl -N https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/await-processing \
  -H "Authorization: Bearer $API_TOKEN"

# Output:
# event: processing-status
# data: {"eventType":"JOB_CREATED","createdAt":"2024-01-15T10:30:00Z","processorVersion":"1.2.0"}
#
# event: processing-status
# data: {"eventType":"JOB_STARTED","createdAt":"2024-01-15T10:30:05Z","processorVersion":"1.2.0"}
#
# event: processing-status
# data: {"eventType":"JOB_SUCCEEDED","createdAt":"2024-01-15T10:31:00Z","processorVersion":"1.2.0"}
#
# event: complete
# data: {"fileId":"LvrGb8UaJk_IjmzaxuMAb","finalStatus":"JOB_SUCCEEDED"}

async function awaitProcessing(fileId: string): Promise<void> {
  const url = `https://api.catalyzed.ai/files/${fileId}/await-processing`;
  const response = await fetch(url, {
    headers: { Authorization: `Bearer ${apiToken}` },
  });

  if (!response.body) throw new Error("No response body");

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });
    const messages = buffer.split("\n\n");
    buffer = messages.pop() || "";

    for (const message of messages) {
      if (!message.trim()) continue;

      const lines = message.split("\n");
      let eventType = "message";
      let data = "";

      for (const line of lines) {
        if (line.startsWith("event: ")) eventType = line.slice(7);
        else if (line.startsWith("data: ")) data = line.slice(6);
      }

      if (eventType === "complete") {
        const { finalStatus } = JSON.parse(data);
        if (finalStatus === "JOB_FAILED") {
          throw new Error("File processing failed");
        }
        return; // Success!
      } else if (eventType === "processing-status") {
        const status = JSON.parse(data);
        console.log(`Processing: ${status.eventType}`);
      }
    }
  }
}

import requests

def await_processing(file_id: str, api_token: str) -> None:
    """Wait for file processing using SSE."""
    url = f"https://api.catalyzed.ai/files/{file_id}/await-processing"

    with requests.get(
        url,
        headers={"Authorization": f"Bearer {api_token}"},
        stream=True
    ) as response:
        response.raise_for_status()

        for line in response.iter_lines():
            if not line:
                continue

            line_str = line.decode("utf-8")

            if line_str.startswith("event: complete"):
                # Read next line for data
                data_line = next(response.iter_lines()).decode("utf-8")
                if data_line.startswith("data: "):
                    import json
                    event_data = json.loads(data_line[6:])
                    if event_data["finalStatus"] == "JOB_FAILED":
                        raise Exception("File processing failed")
                    return  # Success!

            elif line_str.startswith("data: "):
                import json
                status = json.loads(line_str[6:])
                print(f"Processing: {status.get('eventType', 'unknown')}")

Benefits of SSE over polling:

Real-time updates: Receive status changes immediately as they occur
Efficient: No repeated requests, single long-lived connection
Progress visibility: See each processing step (CREATED → STARTED → SUCCEEDED)
Automatic reconnection: Browsers handle reconnection automatically

Polling Alternative

For environments that don’t support SSE, you can poll the file status endpoint:

async function waitForProcessing(fileId: string): Promise<File> {
  const maxAttempts = 30;
  const delayMs = 2000;

  for (let i = 0; i < maxAttempts; i++) {
    const response = await fetch(`https://api.catalyzed.ai/files/${fileId}`, {
      headers: { Authorization: `Bearer ${apiToken}` },
    });
    const file = await response.json();

    const status = file.processingStatus?.eventType;
    if (status === "JOB_SUCCEEDED") return file;
    if (status === "JOB_FAILED") {
      throw new Error(file.processingStatus.eventData?.error || "Processing failed");
    }

    await new Promise(r => setTimeout(r, delayMs));
  }

  throw new Error("File processing timed out");
}

CSV Processing

Upload and Check Schema

# 1. Upload CSV
curl -X POST https://api.catalyzed.ai/files \
  -H "Authorization: Bearer $API_TOKEN" \
  -F "[email protected]" \
  -F "teamId=ZkoDMyjZZsXo4VAO_nJLk"

# 2. Check file (after processing completes)
curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \
  -H "Authorization: Bearer $API_TOKEN"

Response:

{
  "fileId": "LvrGb8UaJk_IjmzaxuMAb",
  "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
  "fileName": "data.csv",
  "fileSize": 1048576,
  "mimeType": "text/csv",
  "fileHash": "a1b2c3d4...",
  "storageKey": "files/ZkoDMyjZZsXo4VAO_nJLk/LvrGb8UaJk_IjmzaxuMAb",
  "uploadedBy": "usr_abc123",
  "metadata": {},
  "processingStatus": {
    "eventType": "JOB_SUCCEEDED",
    "createdAt": "2024-01-15T10:31:00Z",
    "processorVersion": "1.2.0"
  },
  "createdAt": "2024-01-15T10:30:00Z",
  "updatedAt": "2024-01-15T10:31:00Z"
}

CSV Options

Control how CSVs are parsed:

curl -X POST https://api.catalyzed.ai/files \
  -H "Authorization: Bearer $API_TOKEN" \
  -F "[email protected]" \
  -F "teamId=ZkoDMyjZZsXo4VAO_nJLk" \
  -F "options={\"delimiter\":\",\",\"hasHeader\":true,\"encoding\":\"utf-8\"}"

Option	Default	Description
`delimiter`	`,`	Field delimiter
`hasHeader`	`true`	First row is header
`encoding`	`utf-8`	File encoding
`nullValue`	`""`	Value to treat as NULL

Importing CSV Data to Tables

While Catalyzed processes CSV files, there’s no direct “import file to table” endpoint. Instead, you need to download and parse the CSV yourself, then insert the data using the table rows endpoint.

Here’s the recommended workflow:

Upload the CSV file and wait for processing
Download the processed file or parse it client-side
Create a table with the appropriate schema
Insert the parsed rows into your table

Option 1: Download and Parse (Recommended)

async function importCsvToTable(
  csvFilePath: string,
  datasetId: string,
  tableName: string
): Promise<string> {
  // 1. Upload CSV
  const formData = new FormData();
  formData.append("file", new Blob([await fs.readFile(csvFilePath)]), csvFilePath);
  formData.append("teamId", teamId);

  const uploadResponse = await fetch("https://api.catalyzed.ai/files", {
    method: "POST",
    headers: { Authorization: `Bearer ${apiToken}` },
    body: formData,
  });
  const { fileId } = await uploadResponse.json();

  // 2. Wait for processing
  await awaitProcessing(fileId);

  // 3. Download the file
  const downloadResponse = await fetch(
    `https://api.catalyzed.ai/files/${fileId}/download`,
    { headers: { Authorization: `Bearer ${apiToken}` } }
  );
  const { downloadUrl } = await downloadResponse.json();

  const fileResponse = await fetch(downloadUrl);
  const csvText = await fileResponse.text();

  // 4. Parse CSV (using a CSV parsing library)
  const rows = parseCSV(csvText); // Use csv-parse, papaparse, etc.

  // 5. Infer schema from data
  const fields = inferSchema(rows);

  // 6. Create table
  const tableResponse = await fetch("https://api.catalyzed.ai/dataset-tables", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${apiToken}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      datasetId,
      tableName,
      fields,
      primaryKeyColumns: [fields[0].name],
    }),
  });
  const { tableId } = await tableResponse.json();

  // 7. Insert data in batches
  const batchSize = 5000;
  for (let i = 0; i < rows.length; i += batchSize) {
    const batch = rows.slice(i, i + batchSize);
    await fetch(
      `https://api.catalyzed.ai/dataset-tables/${tableId}/rows?mode=append`,
      {
        method: "POST",
        headers: {
          Authorization: `Bearer ${apiToken}`,
          "Content-Type": "application/json",
        },
        body: JSON.stringify(batch),
      }
    );
  }

  return tableId;
}

Option 2: Extract Content via API

For CSV and Excel files, you can retrieve the processed content in TOON format:

curl https://api.catalyzed.ai/files/file_abc123/extracted-content \
  -H "Authorization: Bearer $API_TOKEN"

Response for CSV files:

{
  "type": "csv",
  "extractionId": "extr_xyz123",
  "content": {
    "toon": "...TOON-formatted data...",
    "totalRows": 1000,
    "csvMetadata": {
      "delimiter": ",",
      "hasHeader": true,
      "encoding": "utf-8"
    }
  }
}

Note: The toon format is an internal representation format. For practical CSV import workflows, downloading and parsing the CSV file directly (Option 1) is recommended.

Tips for Large CSV Files

Batch the inserts: Insert in batches of 1,000-5,000 rows to avoid timeouts
Use Arrow IPC format: For files >10MB, use Arrow IPC instead of JSON for better performance
Stream parsing: Use streaming CSV parsers for very large files to avoid loading everything into memory
Monitor progress: Check table row counts with /dataset-tables/{tableId} endpoint

See the Ingesting Data guide for more details on write modes, batching, and Arrow IPC format.

PDF Processing

PDFs are processed for text extraction and semantic search:

Processing Steps

Text Extraction - OCR if needed, extract all text
Chunking - Split into semantic chunks
Embedding Generation - Create vector embeddings using our multi-model approach
Storage - Index for retrieval

Using Processed PDFs in Pipelines

Once processed, reference the file in pipeline input schema and optionally pre-fill it in configuration:

{
  "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
  "name": "Document Q&A",
  "handlerType": "language_model",
  "inputsSchema": {
    "files": [
      {
        "id": "document",
        "label": "Document",
        "required": true,
        "contextRetrievalMode": "semantic"
      }
    ],
    "datasets": [],
    "dataInputs": [
      {
        "id": "question",
        "label": "Question",
        "type": "string",
        "required": true
      }
    ]
  },
  "outputsSchema": {
    "files": [],
    "datasets": [],
    "dataInputs": [
      {
        "id": "answer",
        "label": "Answer",
        "type": "string",
        "required": true
      }
    ]
  },
  "configuration": {
    "files": [],
    "datasets": [],
    "dataInputs": []
  }
}

When triggering the pipeline, provide the file ID and question:

{
  "input": {
    "files": {
      "document": "LvrGb8UaJk_IjmzaxuMAb"
    },
    "dataInputs": {
      "question": "What are the main findings?"
    }
  }
}

The pipeline will retrieve relevant chunks from the PDF to provide context for AI responses.

File Summaries

After processing, documents automatically get an AI-generated summary that includes:

Description: High-level overview of the document’s contents
Document Type: Classification (e.g., “financial_report”, “contract”, “research_paper”)
Sections: Key sections with headings and summaries
Entities: Named entities extracted from the document (people, organizations, dates, etc.)

Retrieving File Summaries

Get file summary

curl https://api.catalyzed.ai/file-summaries/LvrGb8UaJk_IjmzaxuMAb \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/file-summaries/LvrGb8UaJk_IjmzaxuMAb",
  { headers: { Authorization: `Bearer ${apiToken}` } }
);
const summary = await response.json();
console.log(summary.description);
console.log(summary.sections);

response = requests.get(
    "https://api.catalyzed.ai/file-summaries/LvrGb8UaJk_IjmzaxuMAb",
    headers={"Authorization": f"Bearer {api_token}"}
)
summary = response.json()
print(summary["description"])
print(summary["sections"])

Response:

{
  "fileSummaryId": "fs_abc123",
  "fileId": "LvrGb8UaJk_IjmzaxuMAb",
  "description": "This financial report provides a quarterly analysis of revenue, expenses, and profitability for Q4 2024.",
  "documentType": "financial_report",
  "pageCount": 12,
  "sections": [
    {
      "heading": "Executive Summary",
      "summary": "Overview of key financial performance metrics and highlights for the quarter."
    },
    {
      "heading": "Revenue Analysis",
      "summary": "Detailed breakdown of revenue streams showing 15% growth year-over-year."
    }
  ],
  "entities": [
    { "name": "Acme Corporation", "type": "organization" },
    { "name": "Q4 2024", "type": "date" },
    { "name": "John Smith", "type": "person" }
  ],
  "generationMethod": "pre-computed",
  "createdAt": "2025-01-15T10:35:00Z"
}

Use cases for file summaries:

Preview document contents before processing
Build document indexes and catalogs
Improve search relevance with document metadata
Extract structured data from documents

Reprocessing Files

If processing fails or you need to re-extract content:

Reprocess a file

curl -X POST https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/process \
  -H "Authorization: Bearer $API_TOKEN"

await fetch("https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/process", {
  method: "POST",
  headers: { Authorization: `Bearer ${apiToken}` },
});

requests.post(
    "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/process",
    headers={"Authorization": f"Bearer {api_token}"}
)

Downloading Files

The download endpoint returns presigned URLs for secure access:

Get download URL

curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/download \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/download",
  { headers: { Authorization: `Bearer ${apiToken}` } }
);
const { downloadUrl, previewUrl, expiresAt } = await response.json();

// Use the presigned URL to download the file
const fileResponse = await fetch(downloadUrl);
const blob = await fileResponse.blob();

response = requests.get(
    "https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb/download",
    headers={"Authorization": f"Bearer {api_token}"}
)
urls = response.json()

# Use the presigned URL to download the file
file_response = requests.get(urls["downloadUrl"])
with open("downloaded_file.pdf", "wb") as f:
    f.write(file_response.content)

Response:

{
  "downloadUrl": "https://storage.example.com/files/...?signature=...",
  "previewUrl": "https://storage.example.com/files/...?signature=...&inline=true",
  "expiresAt": "2024-01-15T11:30:00Z"
}

Error Handling

Common Processing Errors

Error	Cause	Resolution
`Invalid file format`	File doesn’t match extension	Check file is not corrupted
`Encoding error`	Invalid text encoding	Specify correct encoding
`Password protected`	PDF requires password	Remove protection before upload
`File too large`	Exceeds size limit	Split into smaller files
`OCR failed`	Couldn’t extract text	Try different PDF quality

Checking Error Details

curl https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \
  -H "Authorization: Bearer $API_TOKEN"

{
  "fileId": "LvrGb8UaJk_IjmzaxuMAb",
  "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
  "fileName": "document.pdf",
  "fileSize": 2048000,
  "mimeType": "application/pdf",
  "fileHash": "e5f6g7h8...",
  "storageKey": "files/ZkoDMyjZZsXo4VAO_nJLk/LvrGb8UaJk_IjmzaxuMAb",
  "uploadedBy": "usr_abc123",
  "metadata": {},
  "processingStatus": {
    "eventType": "JOB_FAILED",
    "createdAt": "2024-01-15T10:31:00Z",
    "processorVersion": "1.2.0",
    "eventData": {
      "error": "PDF is password protected. Please remove protection and re-upload."
    }
  },
  "createdAt": "2024-01-15T10:30:00Z",
  "updatedAt": "2024-01-15T10:31:00Z"
}

File Size Limits

File Type	Max Size
CSV, JSON	100 MB
Parquet	500 MB
PDF	50 MB
Excel	50 MB
Documents	25 MB

For larger files, consider:

Splitting into multiple files
Using Parquet format (more efficient)
Contacting support for increased limits

Best Practices

1. Use Appropriate Formats

Tabular data: Use Parquet for best performance, CSV for compatibility
Documents: PDF with embedded text (not scanned images) processes faster
Large datasets: Parquet with compression

2. Validate Before Upload

Check file integrity locally before uploading to avoid processing failures.

3. Handle Async Processing

Files process asynchronously. Always poll for status before using.

4. Clean Up Unused Files

Delete files you no longer need to free up storage:

curl -X DELETE https://api.catalyzed.ai/files/LvrGb8UaJk_IjmzaxuMAb \
  -H "Authorization: Bearer $API_TOKEN"

Next Steps

Files - Full files API reference
Pipelines - Use files in AI workflows
Tables - Import data files into tables