SQL Translation
The /queries/translate endpoint converts natural language questions into SQL. Catalyzed handles the translation engine — you control the context that makes it accurate. This guide covers how to get the best results.
The Translate → Execute Pattern
Section titled “The Translate → Execute Pattern”The recommended workflow for natural language querying is a two-step process: translate the question to SQL, then execute it. This gives you visibility into the generated SQL before running it.
Step 1 — Translate:
Translate natural language to SQL
curl -X POST https://api.catalyzed.ai/queries/translate \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "tables": { "orders": "Ednc5U676CO4hn-FqsXeA", "customers": "6fTBbbj4uv8TVMVh0gVch" }, "query": "Which customers spent more than $5000 last quarter?", "includeExplanation": true, "validateSyntax": true }'const translateResponse = await fetch("https://api.catalyzed.ai/queries/translate", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ tables: { orders: "Ednc5U676CO4hn-FqsXeA", customers: "6fTBbbj4uv8TVMVh0gVch", }, query: "Which customers spent more than $5000 last quarter?", includeExplanation: true, validateSyntax: true, }),});const translation = await translateResponse.json();translate_response = requests.post( "https://api.catalyzed.ai/queries/translate", headers={"Authorization": f"Bearer {api_token}"}, json={ "tables": { "orders": "Ednc5U676CO4hn-FqsXeA", "customers": "6fTBbbj4uv8TVMVh0gVch" }, "query": "Which customers spent more than $5000 last quarter?", "includeExplanation": True, "validateSyntax": True })translation = translate_response.json()Response:
{ "sql": "SELECT c.name, c.email, SUM(o.amount) AS total_spent FROM customers c JOIN orders o ON c.id = o.customer_id WHERE o.created_at >= date_trunc('quarter', CURRENT_DATE - INTERVAL '3 months') AND o.created_at < date_trunc('quarter', CURRENT_DATE) GROUP BY c.name, c.email HAVING SUM(o.amount) > 5000 ORDER BY total_spent DESC", "explanation": "Joins customers and orders, filters to the previous quarter, groups by customer, and returns those with total spend exceeding $5000.", "warnings": [], "referencedTables": ["customers", "orders"], "referencedColumns": ["customers.name", "customers.email", "customers.id", "orders.amount", "orders.customer_id", "orders.created_at"], "usesVectorSearch": false, "confidence": "high"}Step 2 — Execute the translated SQL:
Execute the translated SQL
curl -X POST https://api.catalyzed.ai/queries \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "sql": "SELECT c.name, c.email, SUM(o.amount) AS total_spent FROM customers c JOIN orders o ON c.id = o.customer_id WHERE o.created_at >= date_trunc('\''quarter'\'', CURRENT_DATE - INTERVAL '\''3 months'\'') AND o.created_at < date_trunc('\''quarter'\'', CURRENT_DATE) GROUP BY c.name, c.email HAVING SUM(o.amount) > 5000 ORDER BY total_spent DESC", "tables": { "orders": "Ednc5U676CO4hn-FqsXeA", "customers": "6fTBbbj4uv8TVMVh0gVch" } }'// Use the SQL from the translation responseconst queryResponse = await fetch("https://api.catalyzed.ai/queries", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ sql: translation.sql, tables: { orders: "Ednc5U676CO4hn-FqsXeA", customers: "6fTBbbj4uv8TVMVh0gVch", }, }),});const results = await queryResponse.json();# Use the SQL from the translation responsequery_response = requests.post( "https://api.catalyzed.ai/queries", headers={"Authorization": f"Bearer {api_token}"}, json={ "sql": translation["sql"], "tables": { "orders": "Ednc5U676CO4hn-FqsXeA", "customers": "6fTBbbj4uv8TVMVh0gVch" } })results = query_response.json()How Translation Works
Section titled “How Translation Works”When you call /queries/translate, the engine:
- Loads table schemas — column names, data types, primary keys, nullable flags
- Detects vector columns — enables semantic search functions automatically
- Samples data — fetches up to 5 rows per table to understand data patterns
- Generates SQL — sends the schema context and your natural language query to an LLM
- Validates syntax — runs
EXPLAINon the generated SQL to catch syntax errors (whenvalidateSyntax: true) - Retries on failure — if validation fails, feeds the error back to the LLM for correction
This means the quality of translation depends directly on how well the engine can understand your data. The next section covers what you can control.
Optimizing Translation Quality
Section titled “Optimizing Translation Quality”Table Aliases
Section titled “Table Aliases”The keys in the tables map become the table names the engine sees. Descriptive aliases produce better translations.
// Good — the engine understands what each table contains{ "tables": { "customer_orders": "Ednc5U676CO4hn-FqsXeA", "product_catalog": "6fTBbbj4uv8TVMVh0gVch" }}
// Bad — the engine has no idea what "t1" or "t2" are{ "tables": { "t1": "Ednc5U676CO4hn-FqsXeA", "t2": "6fTBbbj4uv8TVMVh0gVch" }}Column Names
Section titled “Column Names”Column names are included verbatim in the translation context. Clear names reduce ambiguity and improve confidence.
| Good | Bad | Why |
|---|---|---|
customer_name | cn | Engine can infer the column’s purpose |
order_date | dt | Natural language query “orders from last month” maps cleanly |
total_amount | col3 | Aggregation queries like “total spent” resolve correctly |
is_active | flag1 | Boolean filters map to natural language conditions |
Column names are set when you create a table. If your schema has opaque names, consider recreating the table with clearer naming — this has the biggest impact on translation quality.
Table Names
Section titled “Table Names”The table’s tableName (set during creation with POST /dataset-tables) is also shown to the engine as context. A table named customer_orders provides more context than one named tbl_001.
Data Population
Section titled “Data Population”The engine fetches up to 5 sample rows from each table. These rows help the engine understand:
- What values look like (dates, IDs, statuses, amounts)
- Data formats (ISO dates vs timestamps, currency formats)
- Common patterns (enum values like
"active"/"inactive", ID formats)
Tables with representative data produce significantly better translations than empty or sparse tables. If you’re setting up a new table, insert a few rows of sample data before relying on translation.
Request Options
Section titled “Request Options”| Field | Type | Default | Description |
|---|---|---|---|
tables | object | required | Map of table aliases to table IDs |
query | string | required | Natural language query (max 4000 chars) |
includeExplanation | boolean | false | Include a natural language explanation of the generated SQL |
validateSyntax | boolean | true | Validate SQL syntax via EXPLAIN before returning |
Response Fields
Section titled “Response Fields”| Field | Type | Description |
|---|---|---|
sql | string | Generated SQL query |
explanation | string? | Explanation of the SQL (only when includeExplanation: true) |
warnings | string[] | Assumptions or potential issues identified during translation |
referencedTables | string[] | Table aliases used in the SQL |
referencedColumns | string[] | Columns used (format: table.column) |
usesVectorSearch | boolean | Whether the SQL uses vector search functions |
confidence | string | Translation confidence: "high", "medium", or "low" |
Confidence Levels
Section titled “Confidence Levels”The confidence field indicates how certain the engine is about the translation:
| Level | Meaning | Recommendation |
|---|---|---|
high | Schema clearly supports the query with no ambiguity | Safe to execute automatically |
medium | Some interpretation required — column mapping or filter logic may be approximate | Review the SQL or show it to the user before executing |
low | Significant ambiguity — the query may not map well to the available schema | Manual review strongly recommended |
Use includeExplanation: true when confidence is medium or low to understand what assumptions the engine made.
Warnings
Section titled “Warnings”The warnings array surfaces assumptions the engine made during translation. Common examples:
"Assumed 'amount' column contains order totals in same currency"— the engine inferred column semantics"Date range interpreted as calendar quarter"— ambiguous time references were resolved"No exact column match for 'revenue'; using 'amount' as closest match"— fuzzy column mapping
Warnings don’t mean the SQL is wrong — they flag areas where the engine made a judgment call. If a warning indicates an incorrect assumption, refine your query or improve your column naming.
Vector Search Auto-Detection
Section titled “Vector Search Auto-Detection”If any table in your request has vector columns (columns with FixedSizeList<Float32, N> or similar types), the engine automatically enables semantic search functions:
text_to_embedding('search text')— convert text to a vector at query timeknn_search(table, column, vector, k)— k-nearest neighbors searchknn_cosine(table, column, vector, k)— cosine similarity searchknn_l2(table, column, vector, k)— L2 distance search
This means natural language queries like “find products similar to wireless headphones” automatically generate vector search SQL when the table has an embedding column:
{ "tables": { "products": "tbl_abc123" }, "query": "Find products similar to wireless headphones"}Generated SQL:
SELECT * FROM knn_cosine( 'products', 'embedding', text_to_embedding('wireless headphones'), 10)The usesVectorSearch response field confirms when vector functions were used. See the Vector Search guide for details on these functions.
Error Handling
Section titled “Error Handling”Translation errors return a structured response with an error code:
{ "error": "Table 'orders' (ID: invalid_id) not found", "code": "TABLES_NOT_FOUND"}Error Codes
Section titled “Error Codes”| Code | Meaning | Resolution |
|---|---|---|
TABLES_NOT_FOUND | One or more table IDs in the tables map don’t exist | Verify table IDs — use GET /dataset-tables/{id} to check |
TRANSLATION_FAILED | The engine couldn’t generate valid SQL from the query | Simplify the query, improve table/column naming, or ensure tables have data |
VALIDATION_FAILED | Generated SQL failed syntax validation | The engine retries automatically; this error means all retries exhausted |
INVALID_REQUEST | Request body validation failed | Check required fields: tables (non-empty) and query (1-4000 chars) |
Quality Checklist
Section titled “Quality Checklist”Use this checklist when setting up tables for SQL translation:
- Table aliases — use descriptive names in the
tablesmap (nott1,t2) - Table names — create tables with meaningful
tableNamevalues - Column names — use descriptive
snake_casenames (not abbreviations) - Data population — insert at least 5 representative rows per table
- Enable explanation — use
includeExplanation: truewhen debugging - Check confidence — review SQL manually when confidence is
mediumorlow - Review warnings — address any incorrect assumptions flagged in
warnings
Limitations
Section titled “Limitations”- Single-query translation — the endpoint translates one query at a time. For multi-step analytical workflows (e.g., aggregate → compare → rank), orchestrate multiple translate + execute calls.
- Schema-dependent quality — translation quality is bounded by schema clarity. Tables with opaque column names (
col1,col2) will produce lower-quality results regardless of query phrasing. - DataFusion SQL dialect — generated SQL uses DataFusion syntax, which differs from PostgreSQL or MySQL in some areas (e.g.,
INTERVALarithmetic instead ofdate_add/date_sub, Chrono strftime format instead of PostgreSQL date format strings).