Skip to content

Querying Data

Catalyzed supports standard SQL for querying your data. This guide covers common query patterns and best practices.

Use the /queries endpoint to run SQL queries against your tables. The tables parameter maps the table names used in your SQL to their table IDs.

Basic SELECT query

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT * FROM users LIMIT 10",
"tables": {"users": "KzaMsfA0LSw_Ld0KyaXIS"}
}'

Query across multiple tables by including them all in the tables mapping:

JOIN across tables

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT o.order_id, c.name, o.total FROM orders o JOIN customers c ON o.customer_id = c.id",
"tables": {
"orders": "Ednc5U676CO4hn-FqsXeA",
"customers": "6fTBbbj4uv8TVMVh0gVch"
}
}'

The tables parameter supports three formats for mapping SQL table names to your data.

The simplest format — map a SQL table name to a table ID:

{
"tables": {
"orders": "Ednc5U676CO4hn-FqsXeA"
}
}

Apply server-side row-level filters before the query engine sees the data. Useful for multi-tenant isolation or soft-delete filtering:

Physical binding with prefilters

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT * FROM orders ORDER BY amount DESC",
"tables": {
"orders": {
"tableId": "Ednc5U676CO4hn-FqsXeA",
"filters": [
{"field": "tenant_id", "op": "eq", "value": "tenant-abc"},
{"field": "deleted_at", "op": "is_null"}
]
}
}
}'

Supported filter operators: eq, neq, gt, gte, lt, lte, in, not_in, like, is_null, is_not_null.

Each table has an internal _rowid column — a stable, unique row identifier assigned by the storage engine. This column is excluded from SELECT * by default, so queries like SELECT DISTINCT * operate only on your data columns.

To access _rowid, set includeRowId: true on the table binding:

Accessing _rowid system column

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT _rowid, name, email FROM customers",
"tables": {
"customers": {
"tableId": "6fTBbbj4uv8TVMVh0gVch",
"includeRowId": true
}
}
}'

Common use cases for _rowid:

  • Row-level lineage — track which source rows contributed to derived data or knowledge base chunks
  • Citation tracking — reference specific rows in pipeline outputs for auditability
  • Deduplication workflows — identify and compare rows across dataset versions

Define a table binding as a SQL query over other tables. The query creates a temporary view that your top-level SQL can reference like a regular table. This is useful for building reusable intermediate datasets within a single query request.

Query-derived binding

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT ac.name, o.amount FROM active_customers ac JOIN orders o ON ac.id = o.customer_id ORDER BY o.amount DESC",
"tables": {
"orders": "Ednc5U676CO4hn-FqsXeA",
"active_customers": {
"query": "SELECT * FROM c WHERE status = '\''active'\''",
"tables": {
"c": "6fTBbbj4uv8TVMVh0gVch"
}
}
}
}'

In this example, active_customers is a dynamic view defined by a SQL query that filters the customers table. The top-level SQL then joins this view with the physical orders table.

Nested tables within a query-derived binding also support prefilters:

{
"tables": {
"tenant_summary": {
"query": "SELECT product, SUM(amount) as total FROM s GROUP BY product",
"tables": {
"s": {
"tableId": "KzaMsfA0LSw_Ld0KyaXIS",
"filters": [
{"field": "tenant_id", "op": "eq", "value": "tenant-abc"}
]
}
}
}
}
}

Constraints:

  • Nested table names must not collide with top-level binding names or view names
  • All referenced tables (including nested) must belong to the same team
  • Nested tables are scoped to their view — they cannot be referenced from top-level SQL
-- Select specific columns
SELECT name, email, created_at FROM users
-- Select all columns
SELECT * FROM users
-- Column aliases
SELECT name AS customer_name, email AS contact FROM users
-- Expressions
SELECT name, price * quantity AS total FROM orders
-- Basic comparisons
SELECT * FROM users WHERE status = 'active'
SELECT * FROM orders WHERE amount > 100
SELECT * FROM orders WHERE created_at >= '2024-01-01'
-- Multiple conditions
SELECT * FROM orders WHERE status = 'active' AND amount > 100
SELECT * FROM orders WHERE status = 'active' OR status = 'pending'
-- NULL checks
SELECT * FROM users WHERE deleted_at IS NULL
SELECT * FROM orders WHERE notes IS NOT NULL
-- IN operator
SELECT * FROM orders WHERE status IN ('active', 'pending', 'review')
-- LIKE pattern matching
SELECT * FROM users WHERE email LIKE '%@example.com'
SELECT * FROM users WHERE name LIKE 'John%'
-- BETWEEN range
SELECT * FROM orders WHERE amount BETWEEN 100 AND 500
-- Ascending (default)
SELECT * FROM orders ORDER BY created_at
-- Descending
SELECT * FROM orders ORDER BY created_at DESC
-- Multiple columns
SELECT * FROM orders ORDER BY status ASC, created_at DESC
-- With NULLS handling
SELECT * FROM orders ORDER BY completed_at NULLS LAST
-- First 10 rows
SELECT * FROM users LIMIT 10
-- Skip first 20, get next 10
SELECT * FROM users LIMIT 10 OFFSET 20
-- Recommended: use ORDER BY with pagination
SELECT * FROM users ORDER BY id LIMIT 10 OFFSET 20
-- Count
SELECT COUNT(*) FROM orders
SELECT COUNT(DISTINCT customer_id) FROM orders
-- Sum, Average, Min, Max
SELECT SUM(amount) FROM orders
SELECT AVG(amount) FROM orders WHERE status = 'completed'
SELECT MIN(created_at), MAX(created_at) FROM orders
-- Group by
SELECT status, COUNT(*) as count, SUM(amount) as total
FROM orders
GROUP BY status
-- Having (filter groups)
SELECT customer_id, COUNT(*) as order_count
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 5
-- INNER JOIN (only matching rows)
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
-- LEFT JOIN (all from left, matching from right)
SELECT c.name, o.order_id
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
-- Multiple joins
SELECT o.order_id, c.name, p.product_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
-- Subquery in WHERE
SELECT * FROM orders
WHERE customer_id IN (
SELECT id FROM customers WHERE status = 'premium'
)
-- Subquery in FROM
SELECT avg_amount, COUNT(*) as customer_count
FROM (
SELECT customer_id, AVG(amount) as avg_amount
FROM orders
GROUP BY customer_id
) customer_averages
GROUP BY avg_amount
WITH high_value_customers AS (
SELECT customer_id, SUM(amount) as total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(amount) > 10000
)
SELECT c.name, hvc.total_spent
FROM high_value_customers hvc
JOIN customers c ON hvc.customer_id = c.id
ORDER BY hvc.total_spent DESC
-- Row number
SELECT name, amount,
ROW_NUMBER() OVER (ORDER BY amount DESC) as rank
FROM orders
-- Partition by
SELECT customer_id, order_date, amount,
SUM(amount) OVER (PARTITION BY customer_id) as customer_total
FROM orders
-- Running total
SELECT order_date, amount,
SUM(amount) OVER (ORDER BY order_date) as running_total
FROM orders

Queries return results in a structured JSON format:

{
"queryId": "qry_abc123",
"columns": [
{"name": "id", "type": "Utf8"},
{"name": "name", "type": "Utf8"},
{"name": "email", "type": "Utf8"},
{"name": "created_at", "type": "Timestamp(Microsecond, Some(\"UTC\"))"}
],
"rows": [
{"id": "1", "name": "Alice", "email": "[email protected]", "created_at": "2024-01-15T10:00:00Z"},
{"id": "2", "name": "Bob", "email": "[email protected]", "created_at": "2024-01-15T11:00:00Z"}
],
"rowCount": 2,
"truncated": false,
"tableVersions": {
"KzaMsfA0LSw_Ld0KyaXIS": 42
}
}

The tableVersions field maps each table ID to the dataset version that was queried. You can pass these values back via the datasetVersions request field to replay the exact same query against the same data snapshot — useful for debugging and reproducibility.

When using the Arrow IPC response format (Accept: application/vnd.apache.arrow.stream), the table versions are returned in the X-Table-Versions response header as a JSON object.

Prevent accidentally returning too many rows:

{
"sql": "SELECT * FROM users",
"tables": {"users": "KzaMsfA0LSw_Ld0KyaXIS"},
"maxRows": 1000
}

Preview how a query will execute without running it:

Terminal window
curl -X POST https://api.catalyzed.ai/queries/explain \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT * FROM users WHERE status = '\''active'\''",
"tables": {"users": "KzaMsfA0LSw_Ld0KyaXIS"}
}'
-- Good: Select only needed columns
SELECT id, name, status FROM users
-- Avoid: Select all columns
SELECT * FROM users
-- Good: Filter in WHERE
SELECT * FROM orders WHERE created_at > '2024-01-01'
-- Avoid: Filter in application code
SELECT * FROM orders -- then filter in code
-- Good: Always limit during development
SELECT * FROM orders WHERE status = 'active' LIMIT 100
-- Avoid: Unbounded queries
SELECT * FROM orders

For columns frequently used in WHERE clauses:

Terminal window
curl -X POST https://api.catalyzed.ai/dataset-tables/KzaMsfA0LSw_Ld0KyaXIS/indexes \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"indexName": "idx_status",
"columnName": "status",
"indexType": "btree"
}'
  • Use timestamp for dates, not strings
  • Use numeric types (int64, float64) for numbers, not strings
  • This enables efficient comparisons and aggregations
{
"error": "QUERY_ERROR",
"message": "SQL syntax error at position 45: unexpected token 'FORM'"
}
{
"error": "VALIDATION_ERROR",
"message": "Table 'orders' referenced in SQL but not found in tables mapping"
}

Long-running queries may timeout. Use filters and limits to reduce data scanned.

Convert natural language queries to SQL using the /queries/translate endpoint. The translator understands table schemas, column types, and supports semantic search functions.

Translate natural language to SQL

Terminal window
curl -X POST https://api.catalyzed.ai/queries/translate \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"tables": {
"orders": "Ednc5U676CO4hn-FqsXeA",
"customers": "6fTBbbj4uv8TVMVh0gVch"
},
"query": "Find customers who placed orders over $1000 last month",
"includeExplanation": true
}'

Response:

{
"sql": "SELECT DISTINCT c.name, c.email FROM customers c JOIN orders o ON c.id = o.customer_id WHERE o.amount > 1000 AND o.created_at >= date_trunc('month', CURRENT_DATE - INTERVAL '1 month')",
"confidence": "high",
"explanation": "This query joins customers and orders tables, filters for orders over $1000, and restricts to the previous month using date_trunc.",
"warnings": [],
"referencedTables": ["customers", "orders"],
"referencedColumns": ["customers.name", "customers.email", "customers.id", "orders.customer_id", "orders.amount", "orders.created_at"],
"usesVectorSearch": false
}
  • Schema Awareness — understands table structures, column types, and sample data
  • Semantic Search — auto-detects vector columns and enables knn_search(), knn_cosine(), knn_l2(), and text_to_embedding()
  • SQL Validation — validates generated SQL syntax via EXPLAIN (enabled by default)
  • Confidence Levels — returns "high", "medium", or "low" confidence for the translation
  • Explanations — optional natural language explanation of the generated SQL
OptionTypeDescription
tablesobjectMap of table aliases (as used in SQL) to table IDs
querystringNatural language query to translate (max 4000 chars)
includeExplanationbooleanInclude explanation of generated SQL (default: false)
validateSyntaxbooleanValidate SQL syntax before returning (default: true)
{
"tables": {
"products": "tbl_abc123"
},
"query": "Find products similar to wireless headphones",
"includeExplanation": true
}

Generated SQL:

SELECT * FROM knn_cosine(
'products',
'embedding',
text_to_embedding('wireless headphones'),
10
)

For optimizing translation quality — table naming, column naming, confidence handling, and the translate→execute workflow — see the SQL Translation Guide.

See the SQL Translation API for the complete endpoint reference.

Catalyzed also supports semantic similarity search using vector embeddings. Use knn_search() to find similar documents, products, or any embedded content:

SELECT * FROM knn_cosine(
'products',
'embedding',
text_to_embedding('wireless headphones'),
10
)

See the Vector Search Guide for complete documentation on:

  • knn_search(), knn_cosine(), and knn_l2() functions
  • text_to_embedding() for query-time embeddings
  • Filtered vector search
  • Distance metrics and thresholds