Query Optimization
Query performance in Catalyzed can be optimized through indexes, table maintenance, and query design. This guide covers strategies for improving query performance across datasets of all sizes.
Performance Optimization Workflow
Section titled “Performance Optimization Workflow”- Identify slow queries using query statistics
- Analyze query plans with the explain endpoint
- Add appropriate indexes for frequent filters
- Maintain table statistics for optimal query planning
- Compact tables to optimize storage and I/O
- Monitor and iterate on query performance
Query-Level Optimizations
Section titled “Query-Level Optimizations”Select Specific Columns
Section titled “Select Specific Columns”Avoid SELECT * and only request the columns you need:
-- ❌ Inefficient: Returns all columnsSELECT * FROM products WHERE category = 'electronics'
-- ✅ Efficient: Returns only needed columnsSELECT product_id, name, price FROM products WHERE category = 'electronics'Filter Early with WHERE Clauses
Section titled “Filter Early with WHERE Clauses”Push filtering to the database rather than in application code:
-- ❌ Inefficient: Fetches all rows, filters in appSELECT * FROM orders-- Then filter in application: orders.filter(o => o.status === 'completed')
-- ✅ Efficient: Filters in databaseSELECT * FROM orders WHERE status = 'completed'###Always Use LIMIT During Development
Prevent accidentally fetching millions of rows while developing:
-- ✅ Safe for explorationSELECT * FROM large_table LIMIT 100Use Appropriate Data Types
Section titled “Use Appropriate Data Types”Ensure queries use the correct types to avoid implicit conversions:
-- ❌ String comparison on timestamp columnSELECT * FROM events WHERE created_at > '2024-01-15'
-- ✅ Proper timestamp comparisonSELECT * FROM events WHERE created_at > TIMESTAMP '2024-01-15 00:00:00'Indexes
Section titled “Indexes”Indexes dramatically improve query performance for filtering, sorting, and joins.
Traditional Indexes
Section titled “Traditional Indexes”Create btree indexes on columns frequently used in WHERE clauses:
Create a btree index
curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/indexes" \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "indexName": "idx_category", "indexType": "btree", "columns": ["category"] }'await fetch("https://api.catalyzed.ai/dataset-tables/table_abc123/indexes", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ indexName: "idx_category", indexType: "btree", columns: ["category"], }),});requests.post( "https://api.catalyzed.ai/dataset-tables/table_abc123/indexes", headers={"Authorization": f"Bearer {api_token}"}, json={ "indexName": "idx_category", "indexType": "btree", "columns": ["category"] })When to use btree indexes:
- Equality filters:
WHERE category = 'electronics' - Range queries:
WHERE price BETWEEN 10 AND 100 - Sorting:
ORDER BY created_at DESC - Joins:
JOIN orders ON products.id = orders.product_id
Vector Indexes
Section titled “Vector Indexes”For semantic search and similarity queries, use vector indexes:
| Index Type | Use Case | Trade-offs |
|---|---|---|
ivf_pq | Standard vector search | Balanced recall/speed |
ivf_hnsw_pq | High-recall requirements | Better recall, slower indexing |
ivf_hnsw_sq | Memory-constrained environments | Lower memory usage, slightly lower recall |
Create a vector index
curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/indexes" \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "indexName": "idx_embedding", "indexType": "ivf_pq", "columns": ["embedding"] }'Vector index sizing guidelines:
- < 10,000 rows: Flat search acceptable (no index needed)
- 10,000 - 100,000 rows: Index recommended for good performance
- > 100,000 rows: Index required for acceptable query times
See the Vector Search guide for detailed tuning strategies.
Index Maintenance
Section titled “Index Maintenance”Indexes must be rebuilt after schema changes:
# 1. Drop the old indexcurl -X DELETE "https://api.catalyzed.ai/dataset-tables/table_abc123/indexes/idx_category" \ -H "Authorization: Bearer $API_TOKEN"
# 2. Recreate itcurl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/indexes" \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "indexName": "idx_category", "indexType": "btree", "columns": ["category"] }'Table Statistics
Section titled “Table Statistics”Table statistics help the query optimizer choose efficient execution plans.
What Are Statistics?
Section titled “What Are Statistics?”Catalyzed collects statistics about your data including:
- Column cardinality - Number of distinct values
- Data distribution - Histogram of value frequencies
- NULL counts - Percentage of NULL values
- Min/max values - Range of data in columns
These statistics help the query planner:
- Estimate result set sizes
- Choose optimal join strategies
- Select best index usage
- Determine filter order
Computing Statistics
Section titled “Computing Statistics”Compute table statistics
curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/statistics" \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{}'await fetch("https://api.catalyzed.ai/dataset-tables/table_abc123/statistics", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({}),});requests.post( "https://api.catalyzed.ai/dataset-tables/table_abc123/statistics", headers={"Authorization": f"Bearer {api_token}"}, json={})When to Refresh Statistics
Section titled “When to Refresh Statistics”Recompute statistics when:
- After large data loads - New data changes distribution significantly
- After schema migrations - Column types or structure changes
- Query plans look suboptimal - Optimizer makes poor choices
- Periodic maintenance - Weekly or monthly for active tables
Performance impact: Statistics computation scans all data, so schedule during off-peak hours for large tables.
Table Compaction
Section titled “Table Compaction”Compaction optimizes storage by merging small files into larger ones, improving I/O performance.
What Is Compaction?
Section titled “What Is Compaction?”Catalyzed stores table data in multiple files. Over time, many small files accumulate from frequent writes. Compaction:
- Merges small files into larger ones
- Reduces file count and metadata overhead
- Improves sequential read performance
- Reclaims deleted row space
Triggering Compaction
Section titled “Triggering Compaction”Compact a table
curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/compact" \ -H "Authorization: Bearer $API_TOKEN"await fetch("https://api.catalyzed.ai/dataset-tables/table_abc123/compact", { method: "POST", headers: { Authorization: `Bearer ${apiToken}` },});requests.post( "https://api.catalyzed.ai/dataset-tables/table_abc123/compact", headers={"Authorization": f"Bearer {api_token}"})When to Compact
Section titled “When to Compact”Compact tables when:
- After many small writes - Frequent upserts or appends
- After bulk deletes - Reclaim space from deleted rows
- Query performance degrades - Too many files slow down reads
- Before long-running queries - Optimize for analytical workloads
Recommended frequency:
- High-write tables: Weekly
- Moderate-write tables: Monthly
- Read-mostly tables: After significant data changes
Performance impact: Compaction runs in the background but may temporarily increase storage usage during the merge process.
Query Plan Analysis
Section titled “Query Plan Analysis”Use the explain endpoint to understand query execution before running expensive queries:
Explain a query
curl -X POST "https://api.catalyzed.ai/queries/explain" \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "sql": "SELECT * FROM products WHERE category = '\''electronics'\'' AND price > 100", "tables": {"products": "table_abc123"} }'The response shows:
- Execution plan - Step-by-step query execution
- Estimated cost - Query complexity estimate
- Index usage - Which indexes will be used
- Filter pushdown - Where filters are applied
Interpreting Query Plans
Section titled “Interpreting Query Plans”Look for:
- ✅ Index scans - Good: Using indexes efficiently
- ⚠️ Full table scans - Warning: May be slow on large tables
- ✅ Filter pushdown - Good: Filters applied early
- ⚠️ Large joins - Warning: May need join order optimization
Data Ingestion Performance
Section titled “Data Ingestion Performance”Optimize data loading for best throughput:
Batch Writes
Section titled “Batch Writes”Insert data in batches up to 100MB per request:
const batchSize = 5000;for (let i = 0; i < rows.length; i += batchSize) { const batch = rows.slice(i, i + batchSize); await fetch( `https://api.catalyzed.ai/dataset-tables/table_abc123/rows?mode=append`, { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify(batch), } );}Use Arrow IPC for Large Datasets
Section titled “Use Arrow IPC for Large Datasets”For datasets >10MB, use Arrow IPC format instead of JSON:
curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/rows?mode=append" \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/vnd.apache.arrow.stream" \ --data-binary @data.arrowBenefits of Arrow IPC:
- 5-10x faster than JSON for large payloads
- No parsing overhead
- Preserves type information
- Supports streaming
Choose the Right Write Mode
Section titled “Choose the Right Write Mode”| Mode | Use Case | Performance |
|---|---|---|
append | Initial loads, no duplicates | Fastest (no checks) |
upsert | Incremental updates | Moderate (requires primary key lookups) |
overwrite | Replace all data | Moderate (deletes + inserts) |
delete | Remove specific rows | Moderate (requires primary key lookups) |
See the Ingesting Data guide for details on write modes.
Monitoring Query Performance
Section titled “Monitoring Query Performance”Include statistics in query responses to monitor performance:
curl -X POST "https://api.catalyzed.ai/queries" \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "sql": "SELECT * FROM products LIMIT 100", "tables": {"products": "table_abc123"}, "includeStats": true }'Response includes:
{ "rows": [...], "stats": { "executionTimeMs": 145, "planningTimeMs": 12, "bytesScanned": 524288, "rowsScanned": 1000 }}Key metrics:
executionTimeMs- Total query timeplanningTimeMs- Query planning overheadbytesScanned- Data read from storagerowsScanned- Rows examined before filtering
Performance Checklist
Section titled “Performance Checklist”Use this checklist to optimize query performance:
Schema Design
Section titled “Schema Design”- Use appropriate Arrow types (int64 vs string for IDs)
- Define primary keys for tables with unique identifiers
- Denormalize when appropriate to avoid complex joins
Indexing
Section titled “Indexing”- Create btree indexes on frequently filtered columns
- Create vector indexes for tables with >10k rows and similarity search
- Rebuild indexes after schema changes
Table Maintenance
Section titled “Table Maintenance”- Compute statistics after large data loads
- Compact tables weekly (high-write) or monthly (moderate-write)
- Monitor table file counts and sizes
Query Design
Section titled “Query Design”- Select only needed columns (avoid
SELECT *) - Filter early with WHERE clauses
- Use LIMIT during development and exploration
- Use appropriate data types in filters
Data Ingestion
Section titled “Data Ingestion”- Batch writes (1,000-5,000 rows per request)
- Use Arrow IPC for payloads >10MB
- Choose appropriate write mode (append vs upsert)
Monitoring
Section titled “Monitoring”- Enable query statistics (
includeStats: true) - Analyze slow queries with
explainendpoint - Track query performance trends over time
Next Steps
Section titled “Next Steps”- Querying Data - SQL syntax and query capabilities
- Vector Search - Optimize semantic search queries
- Schema Management - Evolve schemas safely
- Ingesting Data - Optimize data loading