Get extracted content from processed file
GET /files/{fileId}/extracted-content
Returns the extracted content from file processing. For PDFs, returns markdown text and optionally structured extraction. For XLSX/CSV, returns TOON-formatted content.
Authorizations
Section titled “Authorizations ”Parameters
Section titled “ Parameters ”Path Parameters
Section titled “Path Parameters ”Unique identifier of the file to retrieve
Unique identifier of the file to retrieve
Query Parameters
Section titled “Query Parameters ”Include source map data for character-to-source position mapping. Returns a discriminated union on mappingType: ‘spatial’ (PDF with bboxes), ‘structural’ (DOCX with XPath), or ‘tabular’ (XLSX/CSV with cell refs).
Include source map data for character-to-source position mapping. Returns a discriminated union on mappingType: ‘spatial’ (PDF with bboxes), ‘structural’ (DOCX with XPath), or ‘tabular’ (XLSX/CSV with cell refs).
Responses
Section titled “ Responses ”Extracted content found
object
File type category
Unique identifier of the extraction record, or null if not processed
Extracted content, or null if not processed or extraction failed
object
Extracted markdown/text content (PDF, DOCX)
Total number of pages in the PDF
Total number of text blocks in the PDF
Total number of paragraphs (DOCX)
Total number of tables (DOCX)
TOON-formatted content (XLSX/CSV)
Total number of sheets (XLSX)
Total number of rows
CSV metadata (delimiter, encoding, etc.)
XLSX sheets metadata
Raw text content (text files)
Content type (text/markdown or text/plain)
Character count (text files)
Line count (text files)
Structured extraction (PDF only), or null if not available
object
Unique identifier of the structured extraction
Status: ‘complete’ (all sections succeeded), ‘partial’ (some failed but below threshold), ‘failed’ (too many sections failed)
Document sections/hierarchy
Per-section JSON schemas (null for failed sections)
Per-section extracted data (null for failed sections)
Metadata about extraction including error details for failed sections
object
Total number of sections in the document
Number of successfully extracted sections
Number of failed sections
Ratio of failed content (0.0 to 1.0)
Indices of failed sections
Error details for failed sections
object
Threshold used for failure determination (0.3)
object
object
Unique span/block identifier from the PDF parser
Start character offset in markdown (inclusive)
End character offset in markdown (exclusive)
0-indexed page number
Block type (Text, SectionHeader, ListItem, Table, etc.)
Bounding box coordinates on the page
object
object
object
Unique span identifier (e.g., ‘p3’, ‘tbl1_r2_c3’)
Start character offset in markdown (inclusive)
End character offset in markdown (exclusive)
XPath-like element selector (e.g., ‘body/p[3]’, ‘body/tbl[1]/tr[2]/tc[3]’)
Element type: ‘paragraph’, ‘heading’, ‘table’, ‘table_cell’
Index of the element in the document
Human-readable source reference
Paragraph/element style name
Heading level (1-9) if element is a heading
object
object
Unique span identifier
Start character offset in TOON content (inclusive)
End character offset in TOON content (exclusive)
0-indexed sheet number
1-indexed row number
0-indexed column number
Cell reference (e.g., ‘Sheet1!B5’)
object
object
Unique span identifier
Start character offset in TOON content (inclusive)
End character offset in TOON content (exclusive)
1-indexed row number
0-indexed column number
Cell reference (e.g., ‘data.csv!B5’)
Version of the processor used
When the extraction was performed
Whether the extracted content is empty
Bad Request - Validation error or invalid input
object
Unauthorized - Authentication required or invalid token
object
Forbidden - Insufficient permissions
object
Not Found - Resource does not exist