SQL Translation

The /queries/translate endpoint converts natural language questions into SQL. Catalyzed handles the translation engine — you control the context that makes it accurate. This guide covers how to get the best results.

The Translate → Execute Pattern

The recommended workflow for natural language querying is a two-step process: translate the question to SQL, then execute it. This gives you visibility into the generated SQL before running it.

Step 1 — Translate:

Translate natural language to SQL

curl -X POST https://api.catalyzed.ai/queries/translate \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "tables": {
      "orders": "Ednc5U676CO4hn-FqsXeA",
      "customers": "6fTBbbj4uv8TVMVh0gVch"
    },
    "query": "Which customers spent more than $5000 last quarter?",
    "includeExplanation": true,
    "validateSyntax": true
  }'

const translateResponse = await fetch("https://api.catalyzed.ai/queries/translate", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    tables: {
      orders: "Ednc5U676CO4hn-FqsXeA",
      customers: "6fTBbbj4uv8TVMVh0gVch",
    },
    query: "Which customers spent more than $5000 last quarter?",
    includeExplanation: true,
    validateSyntax: true,
  }),
});
const translation = await translateResponse.json();

translate_response = requests.post(
    "https://api.catalyzed.ai/queries/translate",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "tables": {
            "orders": "Ednc5U676CO4hn-FqsXeA",
            "customers": "6fTBbbj4uv8TVMVh0gVch"
        },
        "query": "Which customers spent more than $5000 last quarter?",
        "includeExplanation": True,
        "validateSyntax": True
    }
)
translation = translate_response.json()

Response:

{
  "sql": "SELECT c.name, c.email, SUM(o.amount) AS total_spent FROM customers c JOIN orders o ON c.id = o.customer_id WHERE o.created_at >= date_trunc('quarter', CURRENT_DATE - INTERVAL '3 months') AND o.created_at < date_trunc('quarter', CURRENT_DATE) GROUP BY c.name, c.email HAVING SUM(o.amount) > 5000 ORDER BY total_spent DESC",
  "explanation": "Joins customers and orders, filters to the previous quarter, groups by customer, and returns those with total spend exceeding $5000.",
  "warnings": [],
  "referencedTables": ["customers", "orders"],
  "referencedColumns": ["customers.name", "customers.email", "customers.id", "orders.amount", "orders.customer_id", "orders.created_at"],
  "usesVectorSearch": false,
  "confidence": "high"
}

Step 2 — Execute the translated SQL:

Execute the translated SQL

curl -X POST https://api.catalyzed.ai/queries \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT c.name, c.email, SUM(o.amount) AS total_spent FROM customers c JOIN orders o ON c.id = o.customer_id WHERE o.created_at >= date_trunc('\''quarter'\'', CURRENT_DATE - INTERVAL '\''3 months'\'') AND o.created_at < date_trunc('\''quarter'\'', CURRENT_DATE) GROUP BY c.name, c.email HAVING SUM(o.amount) > 5000 ORDER BY total_spent DESC",
    "tables": {
      "orders": "Ednc5U676CO4hn-FqsXeA",
      "customers": "6fTBbbj4uv8TVMVh0gVch"
    }
  }'

// Use the SQL from the translation response
const queryResponse = await fetch("https://api.catalyzed.ai/queries", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    sql: translation.sql,
    tables: {
      orders: "Ednc5U676CO4hn-FqsXeA",
      customers: "6fTBbbj4uv8TVMVh0gVch",
    },
  }),
});
const results = await queryResponse.json();

# Use the SQL from the translation response
query_response = requests.post(
    "https://api.catalyzed.ai/queries",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "sql": translation["sql"],
        "tables": {
            "orders": "Ednc5U676CO4hn-FqsXeA",
            "customers": "6fTBbbj4uv8TVMVh0gVch"
        }
    }
)
results = query_response.json()

How Translation Works

When you call /queries/translate, the engine:

Loads table schemas — column names, data types, primary keys, nullable flags
Detects vector columns — enables semantic search functions automatically
Samples data — fetches up to 5 rows per table to understand data patterns
Generates SQL — sends the schema context and your natural language query to an LLM
Validates syntax — runs EXPLAIN on the generated SQL to catch syntax errors (when validateSyntax: true)
Retries on failure — if validation fails, feeds the error back to the LLM for correction

This means the quality of translation depends directly on how well the engine can understand your data. The next section covers what you can control.

Optimizing Translation Quality

Table Aliases

The keys in the tables map become the table names the engine sees. Descriptive aliases produce better translations.

// Good — the engine understands what each table contains
{
  "tables": {
    "customer_orders": "Ednc5U676CO4hn-FqsXeA",
    "product_catalog": "6fTBbbj4uv8TVMVh0gVch"
  }
}

// Bad — the engine has no idea what "t1" or "t2" are
{
  "tables": {
    "t1": "Ednc5U676CO4hn-FqsXeA",
    "t2": "6fTBbbj4uv8TVMVh0gVch"
  }
}

Column Names

Column names are included verbatim in the translation context. Clear names reduce ambiguity and improve confidence.

Good	Bad	Why
`customer_name`	`cn`	Engine can infer the column’s purpose
`order_date`	`dt`	Natural language query “orders from last month” maps cleanly
`total_amount`	`col3`	Aggregation queries like “total spent” resolve correctly
`is_active`	`flag1`	Boolean filters map to natural language conditions

Column names are set when you create a table. If your schema has opaque names, consider recreating the table with clearer naming — this has the biggest impact on translation quality.

Table Names

The table’s tableName (set during creation with POST /dataset-tables) is also shown to the engine as context. A table named customer_orders provides more context than one named tbl_001.

Data Population

The engine fetches up to 5 sample rows from each table. These rows help the engine understand:

What values look like (dates, IDs, statuses, amounts)
Data formats (ISO dates vs timestamps, currency formats)
Common patterns (enum values like "active"/"inactive", ID formats)

Tables with representative data produce significantly better translations than empty or sparse tables. If you’re setting up a new table, insert a few rows of sample data before relying on translation.

Request Options

Field	Type	Default	Description
`tables`	`object`	required	Map of table aliases to table IDs
`query`	`string`	required	Natural language query (max 4000 chars)
`includeExplanation`	`boolean`	`false`	Include a natural language explanation of the generated SQL
`validateSyntax`	`boolean`	`true`	Validate SQL syntax via `EXPLAIN` before returning

Response Fields

Field	Type	Description
`sql`	`string`	Generated SQL query
`explanation`	`string?`	Explanation of the SQL (only when `includeExplanation: true`)
`warnings`	`string[]`	Assumptions or potential issues identified during translation
`referencedTables`	`string[]`	Table aliases used in the SQL
`referencedColumns`	`string[]`	Columns used (format: `table.column`)
`usesVectorSearch`	`boolean`	Whether the SQL uses vector search functions
`confidence`	`string`	Translation confidence: `"high"`, `"medium"`, or `"low"`

Confidence Levels

The confidence field indicates how certain the engine is about the translation:

Level	Meaning	Recommendation
`high`	Schema clearly supports the query with no ambiguity	Safe to execute automatically
`medium`	Some interpretation required — column mapping or filter logic may be approximate	Review the SQL or show it to the user before executing
`low`	Significant ambiguity — the query may not map well to the available schema	Manual review strongly recommended

Use includeExplanation: true when confidence is medium or low to understand what assumptions the engine made.

Warnings

The warnings array surfaces assumptions the engine made during translation. Common examples:

"Assumed 'amount' column contains order totals in same currency" — the engine inferred column semantics
"Date range interpreted as calendar quarter" — ambiguous time references were resolved
"No exact column match for 'revenue'; using 'amount' as closest match" — fuzzy column mapping

Warnings don’t mean the SQL is wrong — they flag areas where the engine made a judgment call. If a warning indicates an incorrect assumption, refine your query or improve your column naming.

Vector Search Auto-Detection

If any table in your request has vector columns (columns with FixedSizeList<Float32, N> or similar types), the engine automatically enables semantic search functions:

text_to_embedding('search text') — convert text to a vector at query time
knn_search(table, column, vector, k) — k-nearest neighbors search
knn_cosine(table, column, vector, k) — cosine similarity search
knn_l2(table, column, vector, k) — L2 distance search

This means natural language queries like “find products similar to wireless headphones” automatically generate vector search SQL when the table has an embedding column:

{
  "tables": { "products": "tbl_abc123" },
  "query": "Find products similar to wireless headphones"
}

Generated SQL:

SELECT * FROM knn_cosine(
    'products',
    'embedding',
    text_to_embedding('wireless headphones'),
    10
)

The usesVectorSearch response field confirms when vector functions were used. See the Vector Search guide for details on these functions.

Error Handling

Translation errors return a structured response with an error code:

{
  "error": "Table 'orders' (ID: invalid_id) not found",
  "code": "TABLES_NOT_FOUND"
}

Error Codes

Code	Meaning	Resolution
`TABLES_NOT_FOUND`	One or more table IDs in the `tables` map don’t exist	Verify table IDs — use `GET /dataset-tables/{id}` to check
`TRANSLATION_FAILED`	The engine couldn’t generate valid SQL from the query	Simplify the query, improve table/column naming, or ensure tables have data
`VALIDATION_FAILED`	Generated SQL failed syntax validation	The engine retries automatically; this error means all retries exhausted
`INVALID_REQUEST`	Request body validation failed	Check required fields: `tables` (non-empty) and `query` (1-4000 chars)

Quality Checklist

Use this checklist when setting up tables for SQL translation:

Table aliases — use descriptive names in the tables map (not t1, t2)
Table names — create tables with meaningful tableName values
Column names — use descriptive snake_case names (not abbreviations)
Data population — insert at least 5 representative rows per table
Enable explanation — use includeExplanation: true when debugging
Check confidence — review SQL manually when confidence is medium or low
Review warnings — address any incorrect assumptions flagged in warnings

Limitations

Single-query translation — the endpoint translates one query at a time. For multi-step analytical workflows (e.g., aggregate → compare → rank), orchestrate multiple translate + execute calls.
Schema-dependent quality — translation quality is bounded by schema clarity. Tables with opaque column names (col1, col2) will produce lower-quality results regardless of query phrasing.
DataFusion SQL dialect — generated SQL uses DataFusion syntax, which differs from PostgreSQL or MySQL in some areas (e.g., INTERVAL arithmetic instead of date_add/date_sub, Chrono strftime format instead of PostgreSQL date format strings).

Next Steps

Querying Data SQL syntax, joins, aggregations, and response formats.

Vector Search Semantic similarity search with embeddings.

Query Optimization Indexes, statistics, and performance tuning.

Tables Create and manage table schemas.