Evaluations
Evaluations measure how well a pipeline performs by comparing its outputs against ground truth examples. Run a pipeline against an example set and get aggregate scores plus per-example results.
Key Concepts
Section titled “Key Concepts”- Evaluation - A comparison run between a pipeline and an example set
- Evaluation Results - Per-example outcomes within an evaluation
- Evaluator - The method used to compare actual vs expected output
- Mapping Config - How example inputs/outputs map to pipeline inputs/outputs
Evaluation Lifecycle
Section titled “Evaluation Lifecycle”pending → running → succeeded ↘ failedpending → cancelled| Status | Description |
|---|---|
pending | Evaluation queued, waiting to start |
running | Processing examples |
succeeded | All examples processed (individual results may vary) |
failed | Fatal error during evaluation |
cancelled | Manually cancelled |
Creating an Evaluation
Section titled “Creating an Evaluation”Evaluations are created by calling the evaluate endpoint on a pipeline:
Create an evaluation
curl -X POST https://api.catalyzed.ai/pipelines/EMbMEFLyUWEgvnhMWXVVa/evaluate \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "exampleSetId": "KjR8I6rHBms3W4Qfa2-FN", "evaluatorType": "llm_judge", "mappingConfig": { "inputMappings": [ { "exampleSlotId": "document", "pipelineSlotId": "input_text" } ], "outputMappings": [ { "exampleSlotId": "summary", "pipelineSlotId": "output_summary" } ] } }'const response = await fetch( "https://api.catalyzed.ai/pipelines/EMbMEFLyUWEgvnhMWXVVa/evaluate", { method: "POST", headers: { Authorization: `Bearer ${apiToken}`, "Content-Type": "application/json", }, body: JSON.stringify({ exampleSetId: "KjR8I6rHBms3W4Qfa2-FN", evaluatorType: "llm_judge", mappingConfig: { inputMappings: [ { exampleSlotId: "document", pipelineSlotId: "input_text" }, ], outputMappings: [ { exampleSlotId: "summary", pipelineSlotId: "output_summary" }, ], }, }), });const evaluation = await response.json();console.log(evaluation.evaluationId); // "EvR8I6rHBms3W4Qfa2-FN"response = requests.post( "https://api.catalyzed.ai/pipelines/EMbMEFLyUWEgvnhMWXVVa/evaluate", headers={"Authorization": f"Bearer {api_token}"}, json={ "exampleSetId": "KjR8I6rHBms3W4Qfa2-FN", "evaluatorType": "llm_judge", "mappingConfig": { "inputMappings": [ {"exampleSlotId": "document", "pipelineSlotId": "input_text"} ], "outputMappings": [ {"exampleSlotId": "summary", "pipelineSlotId": "output_summary"} ] } })evaluation = response.json()print(evaluation["evaluationId"]) # "EvR8I6rHBms3W4Qfa2-FN"The response returns 202 Accepted since evaluations run asynchronously:
{ "evaluationId": "EvR8I6rHBms3W4Qfa2-FN", "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "pipelineId": "EMbMEFLyUWEgvnhMWXVVa", "pipelineConfigurationId": "CfgABC123...", "exampleSetId": "KjR8I6rHBms3W4Qfa2-FN", "exampleSetConfigurationId": "CfgDEF456...", "evaluatorType": "llm_judge", "mappingType": "explicit", "mappingConfig": { ... }, "status": "pending", "totalExamples": null, "passedCount": null, "failedCount": null, "errorCount": null, "aggregateScore": null, "createdAt": "2024-01-15T10:30:00Z", "createdBy": "usr_abc123"}Evaluator Types
Section titled “Evaluator Types”LLM Judge (Default)
Section titled “LLM Judge (Default)”Uses an LLM to compare actual vs expected output and provide a score:
{ "evaluatorType": "llm_judge", "evaluatorConfig": { "threshold": 0.7, "criteria": "Focus on factual accuracy and completeness" }}| Option | Type | Description |
|---|---|---|
threshold | number | Score threshold for pass/fail (default: 0.7) |
criteria | string | Custom evaluation criteria |
Exact Match
Section titled “Exact Match”Direct string/JSON comparison:
{ "evaluatorType": "exact_match", "evaluatorConfig": { "ignoreWhitespace": true, "ignoreCase": false }}| Option | Type | Description |
|---|---|---|
ignoreWhitespace | boolean | Ignore whitespace differences (default: false) |
ignoreCase | boolean | Case-insensitive comparison (default: false) |
Semantic Similarity
Section titled “Semantic Similarity”Embedding-based comparison using cosine similarity:
{ "evaluatorType": "semantic", "evaluatorConfig": { "threshold": 0.8 }}| Option | Type | Description |
|---|---|---|
threshold | number | Cosine similarity threshold (default: 0.8) |
Getting Evaluation Status
Section titled “Getting Evaluation Status”Get evaluation details
curl https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch( "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN", { headers: { Authorization: `Bearer ${apiToken}` } });const evaluation = await response.json();response = requests.get( "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN", headers={"Authorization": f"Bearer {api_token}"})evaluation = response.json()Completed evaluation response:
{ "evaluationId": "EvR8I6rHBms3W4Qfa2-FN", "teamId": "ZkoDMyjZZsXo4VAO_nJLk", "pipelineId": "EMbMEFLyUWEgvnhMWXVVa", "pipelineConfigurationId": "CfgABC123...", "exampleSetId": "KjR8I6rHBms3W4Qfa2-FN", "exampleSetConfigurationId": "CfgDEF456...", "evaluatorType": "llm_judge", "status": "succeeded", "totalExamples": 25, "passedCount": 21, "failedCount": 3, "errorCount": 1, "skippedCount": 0, "aggregateScore": 0.84, "startedAt": "2024-01-15T10:30:02Z", "completedAt": "2024-01-15T10:35:00Z", "createdAt": "2024-01-15T10:30:00Z", "createdBy": "usr_abc123"}Polling for Completion
Section titled “Polling for Completion”async function waitForEvaluation(evaluationId: string) { while (true) { const response = await fetch( `https://api.catalyzed.ai/evaluations/${evaluationId}`, { headers: { Authorization: `Bearer ${apiToken}` } } ); const evaluation = await response.json();
if (evaluation.status === "succeeded") { console.log(`Score: ${evaluation.aggregateScore}`); console.log( `Passed: ${evaluation.passedCount}/${evaluation.totalExamples}` ); return evaluation; } if (evaluation.status === "failed") { throw new Error(evaluation.errorMessage); } if (evaluation.status === "cancelled") { throw new Error("Evaluation was cancelled"); }
await new Promise((r) => setTimeout(r, 2000)); // Poll every 2 seconds }}Listing Evaluations
Section titled “Listing Evaluations”List evaluations
curl "https://api.catalyzed.ai/evaluations?pipelineIds=EMbMEFLyUWEgvnhMWXVVa" \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch( "https://api.catalyzed.ai/evaluations?pipelineIds=EMbMEFLyUWEgvnhMWXVVa", { headers: { Authorization: `Bearer ${apiToken}` } });const { evaluations } = await response.json();response = requests.get( "https://api.catalyzed.ai/evaluations", params={"pipelineIds": "EMbMEFLyUWEgvnhMWXVVa"}, headers={"Authorization": f"Bearer {api_token}"})evaluations = response.json()["evaluations"]Query Parameters
Section titled “Query Parameters”| Parameter | Type | Description |
|---|---|---|
evaluationIds | string | Comma-separated list of IDs |
teamIds | string | Comma-separated team IDs |
pipelineIds | string | Comma-separated pipeline IDs |
exampleSetIds | string | Comma-separated example set IDs |
statuses | string | Comma-separated: pending, running, succeeded, failed, cancelled |
createdBy | string | Filter by creator |
createdAfter | date | Filter by creation date |
createdBefore | date | Filter by creation date |
page | number | Page number (1-indexed) |
pageSize | number | Results per page (1-100) |
orderBy | string | createdAt, completedAt, status, aggregateScore |
orderDirection | string | asc or desc |
Evaluation Results
Section titled “Evaluation Results”Each example produces a result with its own status and score.
Result Status
Section titled “Result Status”| Status | Description |
|---|---|
pending | Not yet processed |
passed | Output matches expected (score >= threshold) |
failed | Output doesn’t match expected (score < threshold) |
error | Pipeline execution error |
skipped | Example was skipped (e.g., mapping failed) |
Listing Results
Section titled “Listing Results”List evaluation results
curl "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/results" \ -H "Authorization: Bearer $API_TOKEN"const response = await fetch( "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/results", { headers: { Authorization: `Bearer ${apiToken}` } });const { evaluationResults } = await response.json();response = requests.get( "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/results", headers={"Authorization": f"Bearer {api_token}"})results = response.json()["evaluationResults"]Result response:
{ "evaluationResults": [ { "evaluationResultId": "ResABC123...", "evaluationId": "EvR8I6rHBms3W4Qfa2-FN", "exampleId": "ExR8I6rHBms3W4Qfa2-FN", "executionId": "GkR8I6rHBms3W4Qfa2-FN", "status": "passed", "mappedInput": { "input_text": "..." }, "actualOutput": { "output_summary": "..." }, "score": 0.92, "feedback": "Output captures all key metrics accurately. Minor stylistic differences from expected output.", "startedAt": "2024-01-15T10:30:05Z", "completedAt": "2024-01-15T10:30:15Z", "createdAt": "2024-01-15T10:30:00Z" } ], "total": 25, "page": 1, "pageSize": 20}Filtering Results
Section titled “Filtering Results”Filter by status to find failures:
curl "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/results?statuses=failed,error" \ -H "Authorization: Bearer $API_TOKEN"Top-Level Results Endpoint
Section titled “Top-Level Results Endpoint”Query results across multiple evaluations:
curl "https://api.catalyzed.ai/evaluation-results?evaluationIds=Ev1,Ev2,Ev3" \ -H "Authorization: Bearer $API_TOKEN"Cancelling an Evaluation
Section titled “Cancelling an Evaluation”Cancel evaluation
curl -X POST https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/cancel \ -H "Authorization: Bearer $API_TOKEN"await fetch("https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/cancel", { method: "POST", headers: { Authorization: `Bearer ${apiToken}` },});requests.post( "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/cancel", headers={"Authorization": f"Bearer {api_token}"})Deleting an Evaluation
Section titled “Deleting an Evaluation”Only pending or cancelled evaluations can be deleted. Completed evaluations are retained for audit purposes.
Delete evaluation
curl -X DELETE https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN \ -H "Authorization: Bearer $API_TOKEN"await fetch("https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN", { method: "DELETE", headers: { Authorization: `Bearer ${apiToken}` },});requests.delete( "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN", headers={"Authorization": f"Bearer {api_token}"})Evaluation Properties
Section titled “Evaluation Properties”| Field | Type | Description |
|---|---|---|
evaluationId | string | Unique identifier |
teamId | string | Team that owns this evaluation |
pipelineId | string | Pipeline being evaluated |
pipelineConfigurationId | string | Pipeline config snapshot |
exampleSetId | string | Example set used |
exampleSetConfigurationId | string | Example set config snapshot |
evaluatorType | string | llm_judge, exact_match, or semantic |
evaluatorConfig | object | Evaluator settings |
mappingType | string | explicit |
mappingConfig | object | Input/output mappings |
status | string | Current status |
totalExamples | number | Total examples in evaluation |
passedCount | number | Examples that passed |
failedCount | number | Examples that failed |
errorCount | number | Examples with execution errors |
skippedCount | number | Examples that were skipped |
aggregateScore | number | Overall score (0.0-1.0) |
errorMessage | string | Error details (if failed) |
startedAt | timestamp | When evaluation started |
completedAt | timestamp | When evaluation finished |
createdAt | timestamp | When evaluation was created |
createdBy | string | User who created this evaluation |
Viewing Evaluation Results
Section titled “Viewing Evaluation Results”Evaluation results are available via the /evaluations/:evaluationId/results endpoint. See the Evaluation Results API for complete endpoint documentation.
Next Steps
Section titled “Next Steps”- Example Sets - Create ground truth data
- Examples - Add individual examples
- Signals - Capture feedback from evaluations
- Evaluation Workflow Guide - End-to-end tutorial