Skip to content

Evaluations

Evaluations measure how well a pipeline performs by comparing its outputs against ground truth examples. Run a pipeline against an example set and get aggregate scores plus per-example results.

  • Evaluation - A comparison run between a pipeline and an example set
  • Evaluation Results - Per-example outcomes within an evaluation
  • Evaluator - The method used to compare actual vs expected output
  • Mapping Config - How example inputs/outputs map to pipeline inputs/outputs
pending → running → succeeded
↘ failed
pending → cancelled
StatusDescription
pendingEvaluation queued, waiting to start
runningProcessing examples
succeededAll examples processed (individual results may vary)
failedFatal error during evaluation
cancelledManually cancelled

Evaluations are created by calling the evaluate endpoint on a pipeline:

Create an evaluation

Terminal window
curl -X POST https://api.catalyzed.ai/pipelines/EMbMEFLyUWEgvnhMWXVVa/evaluate \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"exampleSetId": "KjR8I6rHBms3W4Qfa2-FN",
"evaluatorType": "llm_judge",
"mappingConfig": {
"inputMappings": [
{
"exampleSlotId": "document",
"pipelineSlotId": "input_text"
}
],
"outputMappings": [
{
"exampleSlotId": "summary",
"pipelineSlotId": "output_summary"
}
]
}
}'

The response returns 202 Accepted since evaluations run asynchronously:

{
"evaluationId": "EvR8I6rHBms3W4Qfa2-FN",
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"pipelineId": "EMbMEFLyUWEgvnhMWXVVa",
"pipelineConfigurationId": "CfgABC123...",
"exampleSetId": "KjR8I6rHBms3W4Qfa2-FN",
"exampleSetConfigurationId": "CfgDEF456...",
"evaluatorType": "llm_judge",
"mappingType": "explicit",
"mappingConfig": { ... },
"status": "pending",
"totalExamples": null,
"passedCount": null,
"failedCount": null,
"errorCount": null,
"aggregateScore": null,
"createdAt": "2024-01-15T10:30:00Z",
"createdBy": "usr_abc123"
}

Uses an LLM to compare actual vs expected output and provide a score:

{
"evaluatorType": "llm_judge",
"evaluatorConfig": {
"threshold": 0.7,
"criteria": "Focus on factual accuracy and completeness"
}
}
OptionTypeDescription
thresholdnumberScore threshold for pass/fail (default: 0.7)
criteriastringCustom evaluation criteria

Direct string/JSON comparison:

{
"evaluatorType": "exact_match",
"evaluatorConfig": {
"ignoreWhitespace": true,
"ignoreCase": false
}
}
OptionTypeDescription
ignoreWhitespacebooleanIgnore whitespace differences (default: false)
ignoreCasebooleanCase-insensitive comparison (default: false)

Embedding-based comparison using cosine similarity:

{
"evaluatorType": "semantic",
"evaluatorConfig": {
"threshold": 0.8
}
}
OptionTypeDescription
thresholdnumberCosine similarity threshold (default: 0.8)

Get evaluation details

Terminal window
curl https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN \
-H "Authorization: Bearer $API_TOKEN"

Completed evaluation response:

{
"evaluationId": "EvR8I6rHBms3W4Qfa2-FN",
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"pipelineId": "EMbMEFLyUWEgvnhMWXVVa",
"pipelineConfigurationId": "CfgABC123...",
"exampleSetId": "KjR8I6rHBms3W4Qfa2-FN",
"exampleSetConfigurationId": "CfgDEF456...",
"evaluatorType": "llm_judge",
"status": "succeeded",
"totalExamples": 25,
"passedCount": 21,
"failedCount": 3,
"errorCount": 1,
"skippedCount": 0,
"aggregateScore": 0.84,
"startedAt": "2024-01-15T10:30:02Z",
"completedAt": "2024-01-15T10:35:00Z",
"createdAt": "2024-01-15T10:30:00Z",
"createdBy": "usr_abc123"
}
async function waitForEvaluation(evaluationId: string) {
while (true) {
const response = await fetch(
`https://api.catalyzed.ai/evaluations/${evaluationId}`,
{ headers: { Authorization: `Bearer ${apiToken}` } }
);
const evaluation = await response.json();
if (evaluation.status === "succeeded") {
console.log(`Score: ${evaluation.aggregateScore}`);
console.log(
`Passed: ${evaluation.passedCount}/${evaluation.totalExamples}`
);
return evaluation;
}
if (evaluation.status === "failed") {
throw new Error(evaluation.errorMessage);
}
if (evaluation.status === "cancelled") {
throw new Error("Evaluation was cancelled");
}
await new Promise((r) => setTimeout(r, 2000)); // Poll every 2 seconds
}
}

List evaluations

Terminal window
curl "https://api.catalyzed.ai/evaluations?pipelineIds=EMbMEFLyUWEgvnhMWXVVa" \
-H "Authorization: Bearer $API_TOKEN"
ParameterTypeDescription
evaluationIdsstringComma-separated list of IDs
teamIdsstringComma-separated team IDs
pipelineIdsstringComma-separated pipeline IDs
exampleSetIdsstringComma-separated example set IDs
statusesstringComma-separated: pending, running, succeeded, failed, cancelled
createdBystringFilter by creator
createdAfterdateFilter by creation date
createdBeforedateFilter by creation date
pagenumberPage number (1-indexed)
pageSizenumberResults per page (1-100)
orderBystringcreatedAt, completedAt, status, aggregateScore
orderDirectionstringasc or desc

Each example produces a result with its own status and score.

StatusDescription
pendingNot yet processed
passedOutput matches expected (score >= threshold)
failedOutput doesn’t match expected (score < threshold)
errorPipeline execution error
skippedExample was skipped (e.g., mapping failed)

List evaluation results

Terminal window
curl "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/results" \
-H "Authorization: Bearer $API_TOKEN"

Result response:

{
"evaluationResults": [
{
"evaluationResultId": "ResABC123...",
"evaluationId": "EvR8I6rHBms3W4Qfa2-FN",
"exampleId": "ExR8I6rHBms3W4Qfa2-FN",
"executionId": "GkR8I6rHBms3W4Qfa2-FN",
"status": "passed",
"mappedInput": { "input_text": "..." },
"actualOutput": { "output_summary": "..." },
"score": 0.92,
"feedback": "Output captures all key metrics accurately. Minor stylistic differences from expected output.",
"startedAt": "2024-01-15T10:30:05Z",
"completedAt": "2024-01-15T10:30:15Z",
"createdAt": "2024-01-15T10:30:00Z"
}
],
"total": 25,
"page": 1,
"pageSize": 20
}

Filter by status to find failures:

Terminal window
curl "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/results?statuses=failed,error" \
-H "Authorization: Bearer $API_TOKEN"

Query results across multiple evaluations:

Terminal window
curl "https://api.catalyzed.ai/evaluation-results?evaluationIds=Ev1,Ev2,Ev3" \
-H "Authorization: Bearer $API_TOKEN"

Cancel evaluation

Terminal window
curl -X POST https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/cancel \
-H "Authorization: Bearer $API_TOKEN"

Only pending or cancelled evaluations can be deleted. Completed evaluations are retained for audit purposes.

Delete evaluation

Terminal window
curl -X DELETE https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN \
-H "Authorization: Bearer $API_TOKEN"
FieldTypeDescription
evaluationIdstringUnique identifier
teamIdstringTeam that owns this evaluation
pipelineIdstringPipeline being evaluated
pipelineConfigurationIdstringPipeline config snapshot
exampleSetIdstringExample set used
exampleSetConfigurationIdstringExample set config snapshot
evaluatorTypestringllm_judge, exact_match, or semantic
evaluatorConfigobjectEvaluator settings
mappingTypestringexplicit
mappingConfigobjectInput/output mappings
statusstringCurrent status
totalExamplesnumberTotal examples in evaluation
passedCountnumberExamples that passed
failedCountnumberExamples that failed
errorCountnumberExamples with execution errors
skippedCountnumberExamples that were skipped
aggregateScorenumberOverall score (0.0-1.0)
errorMessagestringError details (if failed)
startedAttimestampWhen evaluation started
completedAttimestampWhen evaluation finished
createdAttimestampWhen evaluation was created
createdBystringUser who created this evaluation

Evaluation results are available via the /evaluations/:evaluationId/results endpoint. See the Evaluation Results API for complete endpoint documentation.