Querying Data

Catalyzed supports standard SQL for querying your data. This guide covers common query patterns and best practices.

Executing Queries

Use the /queries endpoint to run SQL queries against your tables. The tables parameter maps the table names used in your SQL to their table IDs.

Basic SELECT query

curl -X POST https://api.catalyzed.ai/queries \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT * FROM users LIMIT 10",
    "tables": {"users": "KzaMsfA0LSw_Ld0KyaXIS"}
  }'

const response = await fetch("https://api.catalyzed.ai/queries", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    sql: "SELECT * FROM users LIMIT 10",
    tables: { users: "KzaMsfA0LSw_Ld0KyaXIS" },
  }),
});
const data = await response.json();

response = requests.post(
    "https://api.catalyzed.ai/queries",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "sql": "SELECT * FROM users LIMIT 10",
        "tables": {"users": "KzaMsfA0LSw_Ld0KyaXIS"}
    }
)
data = response.json()

Joining Tables

Query across multiple tables by including them all in the tables mapping:

JOIN across tables

curl -X POST https://api.catalyzed.ai/queries \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT o.order_id, c.name, o.total FROM orders o JOIN customers c ON o.customer_id = c.id",
    "tables": {
      "orders": "Ednc5U676CO4hn-FqsXeA",
      "customers": "6fTBbbj4uv8TVMVh0gVch"
    }
  }'

const response = await fetch("https://api.catalyzed.ai/queries", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    sql: `SELECT o.order_id, c.name, o.total
          FROM orders o
          JOIN customers c ON o.customer_id = c.id`,
    tables: {
      orders: "Ednc5U676CO4hn-FqsXeA",
      customers: "6fTBbbj4uv8TVMVh0gVch",
    },
  }),
});

response = requests.post(
    "https://api.catalyzed.ai/queries",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "sql": """
            SELECT o.order_id, c.name, o.total
            FROM orders o
            JOIN customers c ON o.customer_id = c.id
        """,
        "tables": {
            "orders": "Ednc5U676CO4hn-FqsXeA",
            "customers": "6fTBbbj4uv8TVMVh0gVch"
        }
    }
)

Table Binding Types

The tables parameter supports three formats for mapping SQL table names to your data.

Simple Binding (Table ID)

The simplest format — map a SQL table name to a table ID:

{
  "tables": {
    "orders": "Ednc5U676CO4hn-FqsXeA"
  }
}

Physical Binding with Filters

Apply server-side row-level filters before the query engine sees the data. Useful for multi-tenant isolation or soft-delete filtering:

Physical binding with prefilters

curl -X POST https://api.catalyzed.ai/queries \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT * FROM orders ORDER BY amount DESC",
    "tables": {
      "orders": {
        "tableId": "Ednc5U676CO4hn-FqsXeA",
        "filters": [
          {"field": "tenant_id", "op": "eq", "value": "tenant-abc"},
          {"field": "deleted_at", "op": "is_null"}
        ]
      }
    }
  }'

const response = await fetch("https://api.catalyzed.ai/queries", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    sql: "SELECT * FROM orders ORDER BY amount DESC",
    tables: {
      orders: {
        tableId: "Ednc5U676CO4hn-FqsXeA",
        filters: [
          { field: "tenant_id", op: "eq", value: "tenant-abc" },
          { field: "deleted_at", op: "is_null" },
        ],
      },
    },
  }),
});

response = requests.post(
    "https://api.catalyzed.ai/queries",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "sql": "SELECT * FROM orders ORDER BY amount DESC",
        "tables": {
            "orders": {
                "tableId": "Ednc5U676CO4hn-FqsXeA",
                "filters": [
                    {"field": "tenant_id", "op": "eq", "value": "tenant-abc"},
                    {"field": "deleted_at", "op": "is_null"}
                ]
            }
        }
    }
)

Supported filter operators: eq, neq, gt, gte, lt, lte, in, not_in, like, is_null, is_not_null.

System Columns

Each table has an internal _rowid column — a stable, unique row identifier assigned by the storage engine. This column is excluded from SELECT * by default, so queries like SELECT DISTINCT * operate only on your data columns.

To access _rowid, set includeRowId: true on the table binding:

Accessing _rowid system column

curl -X POST https://api.catalyzed.ai/queries \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT _rowid, name, email FROM customers",
    "tables": {
      "customers": {
        "tableId": "6fTBbbj4uv8TVMVh0gVch",
        "includeRowId": true
      }
    }
  }'

const response = await fetch("https://api.catalyzed.ai/queries", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    sql: "SELECT _rowid, name, email FROM customers",
    tables: {
      customers: {
        tableId: "6fTBbbj4uv8TVMVh0gVch",
        includeRowId: true,
      },
    },
  }),
});

response = requests.post(
    "https://api.catalyzed.ai/queries",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "sql": "SELECT _rowid, name, email FROM customers",
        "tables": {
            "customers": {
                "tableId": "6fTBbbj4uv8TVMVh0gVch",
                "includeRowId": True
            }
        }
    }
)

Common use cases for _rowid:

Row-level lineage — track which source rows contributed to derived data or knowledge base chunks
Citation tracking — reference specific rows in pipeline outputs for auditability
Deduplication workflows — identify and compare rows across dataset versions

Query-Derived Binding (Dynamic View)

Define a table binding as a SQL query over other tables. The query creates a temporary view that your top-level SQL can reference like a regular table. This is useful for building reusable intermediate datasets within a single query request.

Query-derived binding

curl -X POST https://api.catalyzed.ai/queries \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT ac.name, o.amount FROM active_customers ac JOIN orders o ON ac.id = o.customer_id ORDER BY o.amount DESC",
    "tables": {
      "orders": "Ednc5U676CO4hn-FqsXeA",
      "active_customers": {
        "query": "SELECT * FROM c WHERE status = '\''active'\''",
        "tables": {
          "c": "6fTBbbj4uv8TVMVh0gVch"
        }
      }
    }
  }'

const response = await fetch("https://api.catalyzed.ai/queries", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    sql: `SELECT ac.name, o.amount
          FROM active_customers ac
          JOIN orders o ON ac.id = o.customer_id
          ORDER BY o.amount DESC`,
    tables: {
      orders: "Ednc5U676CO4hn-FqsXeA",
      active_customers: {
        query: "SELECT * FROM c WHERE status = 'active'",
        tables: {
          c: "6fTBbbj4uv8TVMVh0gVch",
        },
      },
    },
  }),
});

response = requests.post(
    "https://api.catalyzed.ai/queries",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "sql": """
            SELECT ac.name, o.amount
            FROM active_customers ac
            JOIN orders o ON ac.id = o.customer_id
            ORDER BY o.amount DESC
        """,
        "tables": {
            "orders": "Ednc5U676CO4hn-FqsXeA",
            "active_customers": {
                "query": "SELECT * FROM c WHERE status = 'active'",
                "tables": {
                    "c": "6fTBbbj4uv8TVMVh0gVch"
                }
            }
        }
    }
)

In this example, active_customers is a dynamic view defined by a SQL query that filters the customers table. The top-level SQL then joins this view with the physical orders table.

Nested tables within a query-derived binding also support prefilters:

{
  "tables": {
    "tenant_summary": {
      "query": "SELECT product, SUM(amount) as total FROM s GROUP BY product",
      "tables": {
        "s": {
          "tableId": "KzaMsfA0LSw_Ld0KyaXIS",
          "filters": [
            {"field": "tenant_id", "op": "eq", "value": "tenant-abc"}
          ]
        }
      }
    }
  }
}

Constraints:

Nested table names must not collide with top-level binding names or view names
All referenced tables (including nested) must belong to the same team
Nested tables are scoped to their view — they cannot be referenced from top-level SQL

SQL Syntax Reference

SELECT Clauses

-- Select specific columns
SELECT name, email, created_at FROM users

-- Select all columns
SELECT * FROM users

-- Column aliases
SELECT name AS customer_name, email AS contact FROM users

-- Expressions
SELECT name, price * quantity AS total FROM orders

Filtering (WHERE)

-- Basic comparisons
SELECT * FROM users WHERE status = 'active'
SELECT * FROM orders WHERE amount > 100
SELECT * FROM orders WHERE created_at >= '2024-01-01'

-- Multiple conditions
SELECT * FROM orders WHERE status = 'active' AND amount > 100
SELECT * FROM orders WHERE status = 'active' OR status = 'pending'

-- NULL checks
SELECT * FROM users WHERE deleted_at IS NULL
SELECT * FROM orders WHERE notes IS NOT NULL

-- IN operator
SELECT * FROM orders WHERE status IN ('active', 'pending', 'review')

-- LIKE pattern matching
SELECT * FROM users WHERE email LIKE '%@example.com'
SELECT * FROM users WHERE name LIKE 'John%'

-- BETWEEN range
SELECT * FROM orders WHERE amount BETWEEN 100 AND 500

Sorting (ORDER BY)

-- Ascending (default)
SELECT * FROM orders ORDER BY created_at

-- Descending
SELECT * FROM orders ORDER BY created_at DESC

-- Multiple columns
SELECT * FROM orders ORDER BY status ASC, created_at DESC

-- With NULLS handling
SELECT * FROM orders ORDER BY completed_at NULLS LAST

Pagination (LIMIT/OFFSET)

-- First 10 rows
SELECT * FROM users LIMIT 10

-- Skip first 20, get next 10
SELECT * FROM users LIMIT 10 OFFSET 20

-- Recommended: use ORDER BY with pagination
SELECT * FROM users ORDER BY id LIMIT 10 OFFSET 20

Aggregations

-- Count
SELECT COUNT(*) FROM orders
SELECT COUNT(DISTINCT customer_id) FROM orders

-- Sum, Average, Min, Max
SELECT SUM(amount) FROM orders
SELECT AVG(amount) FROM orders WHERE status = 'completed'
SELECT MIN(created_at), MAX(created_at) FROM orders

-- Group by
SELECT status, COUNT(*) as count, SUM(amount) as total
FROM orders
GROUP BY status

-- Having (filter groups)
SELECT customer_id, COUNT(*) as order_count
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 5

Joins

-- INNER JOIN (only matching rows)
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id

-- LEFT JOIN (all from left, matching from right)
SELECT c.name, o.order_id
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id

-- Multiple joins
SELECT o.order_id, c.name, p.product_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id

Subqueries

-- Subquery in WHERE
SELECT * FROM orders
WHERE customer_id IN (
  SELECT id FROM customers WHERE status = 'premium'
)

-- Subquery in FROM
SELECT avg_amount, COUNT(*) as customer_count
FROM (
  SELECT customer_id, AVG(amount) as avg_amount
  FROM orders
  GROUP BY customer_id
) customer_averages
GROUP BY avg_amount

Common Table Expressions (CTEs)

WITH high_value_customers AS (
  SELECT customer_id, SUM(amount) as total_spent
  FROM orders
  GROUP BY customer_id
  HAVING SUM(amount) > 10000
)
SELECT c.name, hvc.total_spent
FROM high_value_customers hvc
JOIN customers c ON hvc.customer_id = c.id
ORDER BY hvc.total_spent DESC

Window Functions

-- Row number
SELECT name, amount,
  ROW_NUMBER() OVER (ORDER BY amount DESC) as rank
FROM orders

-- Partition by
SELECT customer_id, order_date, amount,
  SUM(amount) OVER (PARTITION BY customer_id) as customer_total
FROM orders

-- Running total
SELECT order_date, amount,
  SUM(amount) OVER (ORDER BY order_date) as running_total
FROM orders

Query Response Format

Queries return results in a structured JSON format:

{
  "queryId": "qry_abc123",
  "columns": [
    {"name": "id", "type": "Utf8"},
    {"name": "name", "type": "Utf8"},
    {"name": "email", "type": "Utf8"},
    {"name": "created_at", "type": "Timestamp(Microsecond, Some(\"UTC\"))"}
  ],
  "rows": [
    {"id": "1", "name": "Alice", "email": "[email protected]", "created_at": "2024-01-15T10:00:00Z"},
    {"id": "2", "name": "Bob", "email": "[email protected]", "created_at": "2024-01-15T11:00:00Z"}
  ],
  "rowCount": 2,
  "truncated": false,
  "tableVersions": {
    "KzaMsfA0LSw_Ld0KyaXIS": 42
  }
}

The tableVersions field maps each table ID to the dataset version that was queried. You can pass these values back via the datasetVersions request field to replay the exact same query against the same data snapshot — useful for debugging and reproducibility.

When using the Arrow IPC response format (Accept: application/vnd.apache.arrow.stream), the table versions are returned in the X-Table-Versions response header as a JSON object.

Query Options

Limit Results

Prevent accidentally returning too many rows:

{
  "sql": "SELECT * FROM users",
  "tables": {"users": "KzaMsfA0LSw_Ld0KyaXIS"},
  "maxRows": 1000
}

Get Execution Plan

Preview how a query will execute without running it:

curl -X POST https://api.catalyzed.ai/queries/explain \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT * FROM users WHERE status = '\''active'\''",
    "tables": {"users": "KzaMsfA0LSw_Ld0KyaXIS"}
  }'

Performance Tips

1. Use Specific Columns

-- Good: Select only needed columns
SELECT id, name, status FROM users

-- Avoid: Select all columns
SELECT * FROM users

2. Filter Early

-- Good: Filter in WHERE
SELECT * FROM orders WHERE created_at > '2024-01-01'

-- Avoid: Filter in application code
SELECT * FROM orders  -- then filter in code

3. Use LIMIT

-- Good: Always limit during development
SELECT * FROM orders WHERE status = 'active' LIMIT 100

-- Avoid: Unbounded queries
SELECT * FROM orders

4. Create Indexes

For columns frequently used in WHERE clauses:

curl -X POST https://api.catalyzed.ai/dataset-tables/KzaMsfA0LSw_Ld0KyaXIS/indexes \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "indexName": "idx_status",
    "columnName": "status",
    "indexType": "btree"
  }'

5. Use Appropriate Data Types

Use timestamp for dates, not strings
Use numeric types (int64, float64) for numbers, not strings
This enables efficient comparisons and aggregations

Error Handling

Syntax Errors

{
  "error": "QUERY_ERROR",
  "message": "SQL syntax error at position 45: unexpected token 'FORM'"
}

Missing Table Bindings

{
  "error": "VALIDATION_ERROR",
  "message": "Table 'orders' referenced in SQL but not found in tables mapping"
}

Timeout

Long-running queries may timeout. Use filters and limits to reduce data scanned.

SQL Translation

Convert natural language queries to SQL using the /queries/translate endpoint. The translator understands table schemas, column types, and supports semantic search functions.

Translate natural language to SQL

curl -X POST https://api.catalyzed.ai/queries/translate \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "tables": {
      "orders": "Ednc5U676CO4hn-FqsXeA",
      "customers": "6fTBbbj4uv8TVMVh0gVch"
    },
    "query": "Find customers who placed orders over $1000 last month",
    "includeExplanation": true
  }'

const response = await fetch("https://api.catalyzed.ai/queries/translate", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    tables: {
      orders: "Ednc5U676CO4hn-FqsXeA",
      customers: "6fTBbbj4uv8TVMVh0gVch",
    },
    query: "Find customers who placed orders over $1000 last month",
    includeExplanation: true,
  }),
});
const { sql, explanation, confidence } = await response.json();

response = requests.post(
    "https://api.catalyzed.ai/queries/translate",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "tables": {
            "orders": "Ednc5U676CO4hn-FqsXeA",
            "customers": "6fTBbbj4uv8TVMVh0gVch"
        },
        "query": "Find customers who placed orders over $1000 last month",
        "includeExplanation": True
    }
)
result = response.json()
sql = result["sql"]
explanation = result.get("explanation")
confidence = result["confidence"]

Response:

{
  "sql": "SELECT DISTINCT c.name, c.email FROM customers c JOIN orders o ON c.id = o.customer_id WHERE o.amount > 1000 AND o.created_at >= date_trunc('month', CURRENT_DATE - INTERVAL '1 month')",
  "confidence": "high",
  "explanation": "This query joins customers and orders tables, filters for orders over $1000, and restricts to the previous month using date_trunc.",
  "warnings": [],
  "referencedTables": ["customers", "orders"],
  "referencedColumns": ["customers.name", "customers.email", "customers.id", "orders.customer_id", "orders.amount", "orders.created_at"],
  "usesVectorSearch": false
}

Translation Features

Schema Awareness — understands table structures, column types, and sample data
Semantic Search — auto-detects vector columns and enables knn_search(), knn_cosine(), knn_l2(), and text_to_embedding()
SQL Validation — validates generated SQL syntax via EXPLAIN (enabled by default)
Confidence Levels — returns "high", "medium", or "low" confidence for the translation
Explanations — optional natural language explanation of the generated SQL

Request Options

Option	Type	Description
`tables`	object	Map of table aliases (as used in SQL) to table IDs
`query`	string	Natural language query to translate (max 4000 chars)
`includeExplanation`	boolean	Include explanation of generated SQL (default: `false`)
`validateSyntax`	boolean	Validate SQL syntax before returning (default: `true`)

Example: Semantic Search Query

{
  "tables": {
    "products": "tbl_abc123"
  },
  "query": "Find products similar to wireless headphones",
  "includeExplanation": true
}

Generated SQL:

SELECT * FROM knn_cosine(
    'products',
    'embedding',
    text_to_embedding('wireless headphones'),
    10
)

For optimizing translation quality — table naming, column naming, confidence handling, and the translate→execute workflow — see the SQL Translation Guide.

See the SQL Translation API for the complete endpoint reference.

Vector Search

Catalyzed also supports semantic similarity search using vector embeddings. Use knn_search() to find similar documents, products, or any embedded content:

SELECT * FROM knn_cosine(
    'products',
    'embedding',
    text_to_embedding('wireless headphones'),
    10
)

See the Vector Search Guide for complete documentation on:

knn_search(), knn_cosine(), and knn_l2() functions
text_to_embedding() for query-time embeddings
Filtered vector search
Distance metrics and thresholds

Next Steps

Vector Search - Semantic similarity search with embeddings
Full-Text Search - BM25 keyword search with inverted indexes
Tables - Learn about table schemas and indexes
Schema Management - Evolve your schemas safely