Skip to content

SQL Translation

The /queries/translate endpoint converts natural language questions into SQL. Catalyzed handles the translation engine — you control the context that makes it accurate. This guide covers how to get the best results.

The recommended workflow for natural language querying is a two-step process: translate the question to SQL, then execute it. This gives you visibility into the generated SQL before running it.

Step 1 — Translate:

Translate natural language to SQL

Terminal window
curl -X POST https://api.catalyzed.ai/queries/translate \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"tables": {
"orders": "Ednc5U676CO4hn-FqsXeA",
"customers": "6fTBbbj4uv8TVMVh0gVch"
},
"query": "Which customers spent more than $5000 last quarter?",
"includeExplanation": true,
"validateSyntax": true
}'

Response:

{
"sql": "SELECT c.name, c.email, SUM(o.amount) AS total_spent FROM customers c JOIN orders o ON c.id = o.customer_id WHERE o.created_at >= date_trunc('quarter', CURRENT_DATE - INTERVAL '3 months') AND o.created_at < date_trunc('quarter', CURRENT_DATE) GROUP BY c.name, c.email HAVING SUM(o.amount) > 5000 ORDER BY total_spent DESC",
"explanation": "Joins customers and orders, filters to the previous quarter, groups by customer, and returns those with total spend exceeding $5000.",
"warnings": [],
"referencedTables": ["customers", "orders"],
"referencedColumns": ["customers.name", "customers.email", "customers.id", "orders.amount", "orders.customer_id", "orders.created_at"],
"usesVectorSearch": false,
"confidence": "high"
}

Step 2 — Execute the translated SQL:

Execute the translated SQL

Terminal window
curl -X POST https://api.catalyzed.ai/queries \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT c.name, c.email, SUM(o.amount) AS total_spent FROM customers c JOIN orders o ON c.id = o.customer_id WHERE o.created_at >= date_trunc('\''quarter'\'', CURRENT_DATE - INTERVAL '\''3 months'\'') AND o.created_at < date_trunc('\''quarter'\'', CURRENT_DATE) GROUP BY c.name, c.email HAVING SUM(o.amount) > 5000 ORDER BY total_spent DESC",
"tables": {
"orders": "Ednc5U676CO4hn-FqsXeA",
"customers": "6fTBbbj4uv8TVMVh0gVch"
}
}'

When you call /queries/translate, the engine:

  1. Loads table schemas — column names, data types, primary keys, nullable flags
  2. Detects vector columns — enables semantic search functions automatically
  3. Samples data — fetches up to 5 rows per table to understand data patterns
  4. Generates SQL — sends the schema context and your natural language query to an LLM
  5. Validates syntax — runs EXPLAIN on the generated SQL to catch syntax errors (when validateSyntax: true)
  6. Retries on failure — if validation fails, feeds the error back to the LLM for correction

This means the quality of translation depends directly on how well the engine can understand your data. The next section covers what you can control.

The keys in the tables map become the table names the engine sees. Descriptive aliases produce better translations.

// Good — the engine understands what each table contains
{
"tables": {
"customer_orders": "Ednc5U676CO4hn-FqsXeA",
"product_catalog": "6fTBbbj4uv8TVMVh0gVch"
}
}
// Bad — the engine has no idea what "t1" or "t2" are
{
"tables": {
"t1": "Ednc5U676CO4hn-FqsXeA",
"t2": "6fTBbbj4uv8TVMVh0gVch"
}
}

Column names are included verbatim in the translation context. Clear names reduce ambiguity and improve confidence.

GoodBadWhy
customer_namecnEngine can infer the column’s purpose
order_datedtNatural language query “orders from last month” maps cleanly
total_amountcol3Aggregation queries like “total spent” resolve correctly
is_activeflag1Boolean filters map to natural language conditions

Column names are set when you create a table. If your schema has opaque names, consider recreating the table with clearer naming — this has the biggest impact on translation quality.

The table’s tableName (set during creation with POST /dataset-tables) is also shown to the engine as context. A table named customer_orders provides more context than one named tbl_001.

The engine fetches up to 5 sample rows from each table. These rows help the engine understand:

  • What values look like (dates, IDs, statuses, amounts)
  • Data formats (ISO dates vs timestamps, currency formats)
  • Common patterns (enum values like "active"/"inactive", ID formats)

Tables with representative data produce significantly better translations than empty or sparse tables. If you’re setting up a new table, insert a few rows of sample data before relying on translation.

FieldTypeDefaultDescription
tablesobjectrequiredMap of table aliases to table IDs
querystringrequiredNatural language query (max 4000 chars)
includeExplanationbooleanfalseInclude a natural language explanation of the generated SQL
validateSyntaxbooleantrueValidate SQL syntax via EXPLAIN before returning
FieldTypeDescription
sqlstringGenerated SQL query
explanationstring?Explanation of the SQL (only when includeExplanation: true)
warningsstring[]Assumptions or potential issues identified during translation
referencedTablesstring[]Table aliases used in the SQL
referencedColumnsstring[]Columns used (format: table.column)
usesVectorSearchbooleanWhether the SQL uses vector search functions
confidencestringTranslation confidence: "high", "medium", or "low"

The confidence field indicates how certain the engine is about the translation:

LevelMeaningRecommendation
highSchema clearly supports the query with no ambiguitySafe to execute automatically
mediumSome interpretation required — column mapping or filter logic may be approximateReview the SQL or show it to the user before executing
lowSignificant ambiguity — the query may not map well to the available schemaManual review strongly recommended

Use includeExplanation: true when confidence is medium or low to understand what assumptions the engine made.

The warnings array surfaces assumptions the engine made during translation. Common examples:

  • "Assumed 'amount' column contains order totals in same currency" — the engine inferred column semantics
  • "Date range interpreted as calendar quarter" — ambiguous time references were resolved
  • "No exact column match for 'revenue'; using 'amount' as closest match" — fuzzy column mapping

Warnings don’t mean the SQL is wrong — they flag areas where the engine made a judgment call. If a warning indicates an incorrect assumption, refine your query or improve your column naming.

If any table in your request has vector columns (columns with FixedSizeList<Float32, N> or similar types), the engine automatically enables semantic search functions:

  • text_to_embedding('search text') — convert text to a vector at query time
  • knn_search(table, column, vector, k) — k-nearest neighbors search
  • knn_cosine(table, column, vector, k) — cosine similarity search
  • knn_l2(table, column, vector, k) — L2 distance search

This means natural language queries like “find products similar to wireless headphones” automatically generate vector search SQL when the table has an embedding column:

{
"tables": { "products": "tbl_abc123" },
"query": "Find products similar to wireless headphones"
}

Generated SQL:

SELECT * FROM knn_cosine(
'products',
'embedding',
text_to_embedding('wireless headphones'),
10
)

The usesVectorSearch response field confirms when vector functions were used. See the Vector Search guide for details on these functions.

Translation errors return a structured response with an error code:

{
"error": "Table 'orders' (ID: invalid_id) not found",
"code": "TABLES_NOT_FOUND"
}
CodeMeaningResolution
TABLES_NOT_FOUNDOne or more table IDs in the tables map don’t existVerify table IDs — use GET /dataset-tables/{id} to check
TRANSLATION_FAILEDThe engine couldn’t generate valid SQL from the querySimplify the query, improve table/column naming, or ensure tables have data
VALIDATION_FAILEDGenerated SQL failed syntax validationThe engine retries automatically; this error means all retries exhausted
INVALID_REQUESTRequest body validation failedCheck required fields: tables (non-empty) and query (1-4000 chars)

Use this checklist when setting up tables for SQL translation:

  • Table aliases — use descriptive names in the tables map (not t1, t2)
  • Table names — create tables with meaningful tableName values
  • Column names — use descriptive snake_case names (not abbreviations)
  • Data population — insert at least 5 representative rows per table
  • Enable explanation — use includeExplanation: true when debugging
  • Check confidence — review SQL manually when confidence is medium or low
  • Review warnings — address any incorrect assumptions flagged in warnings
  • Single-query translation — the endpoint translates one query at a time. For multi-step analytical workflows (e.g., aggregate → compare → rank), orchestrate multiple translate + execute calls.
  • Schema-dependent quality — translation quality is bounded by schema clarity. Tables with opaque column names (col1, col2) will produce lower-quality results regardless of query phrasing.
  • DataFusion SQL dialect — generated SQL uses DataFusion syntax, which differs from PostgreSQL or MySQL in some areas (e.g., INTERVAL arithmetic instead of date_add/date_sub, Chrono strftime format instead of PostgreSQL date format strings).