Skip to content

Evaluation Workflow

This guide walks through the complete workflow for measuring and improving pipeline quality: creating ground truth data, running evaluations, and interpreting results.

The evaluation workflow consists of four steps:

  1. Create an example set - Define the schema for your ground truth data
  2. Add examples - Populate with input/output pairs
  3. Run an evaluation - Compare pipeline outputs against expected outputs
  4. Analyze results - Identify areas for improvement
Evaluation workflow: Example Set to Examples to Pipeline to Evaluator to Results
  • An active pipeline to evaluate
  • Understanding of what correct outputs look like for your use case

First, create an example set that matches your pipeline’s input/output structure:

Create example set

Terminal window
curl -X POST https://api.catalyzed.ai/example-sets \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"teamId": "ZkoDMyjZZsXo4VAO_nJLk",
"name": "Document Summarization Test Suite",
"description": "Ground truth for testing document summarization quality",
"inputsSchema": {
"files": [],
"datasets": [],
"dataInputs": [
{
"id": "document",
"name": "Document Text",
"type": "string",
"required": true
}
]
},
"outputsSchema": {
"files": [],
"datasets": [],
"dataInputs": [
{
"id": "summary",
"name": "Summary",
"type": "string",
"required": true
}
]
}
}'

Add ground truth examples to your set. Each example needs an input and the expected output:

Add examples

Terminal window
# Example 1: Financial report
curl -X POST "https://api.catalyzed.ai/example-sets/$EXAMPLE_SET_ID/examples" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Q4 Financial Report",
"input": {
"document": "Q4 2024 Results: Revenue increased 15% year-over-year to $2.3M. Customer acquisition grew 22% with 145 new enterprise clients. Churn remained stable at 3.2%. Average contract value reached $180K, up $45K from Q3."
},
"expectedOutput": {
"summary": "Q4 2024: Revenue up 15% YoY ($2.3M), 145 new enterprise clients (+22% acquisition), 3.2% churn, $180K ACV (+$45K from Q3)."
},
"rationale": "Summary should capture all key metrics in a concise format."
}'
# Example 2: Product announcement
curl -X POST "https://api.catalyzed.ai/example-sets/$EXAMPLE_SET_ID/examples" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Product Launch Announcement",
"input": {
"document": "Today we announce the launch of DataSync Pro, our new enterprise data synchronization platform. DataSync Pro offers real-time bidirectional sync, supports 50+ connectors, and includes SOC 2 Type II compliance out of the box. Pricing starts at $500/month."
},
"expectedOutput": {
"summary": "Launched DataSync Pro: enterprise data sync with real-time bidirectional sync, 50+ connectors, SOC 2 compliance. Starting at $500/month."
},
"rationale": "Summary should include product name, key features, and pricing."
}'

Run the evaluation against your pipeline. You’ll need to map example fields to pipeline fields:

Run evaluation

Terminal window
curl -X POST "https://api.catalyzed.ai/pipelines/$PIPELINE_ID/evaluate" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"exampleSetId": "'"$EXAMPLE_SET_ID"'",
"evaluatorType": "llm_judge",
"evaluatorConfig": {
"threshold": 0.7,
"criteria": "Evaluate based on: 1) All key facts included, 2) Concise format, 3) No hallucinated information"
},
"mappingConfig": {
"inputMappings": [
{
"exampleSlotId": "document",
"pipelineSlotId": "input_text"
}
],
"outputMappings": [
{
"exampleSlotId": "summary",
"pipelineSlotId": "output_summary"
}
]
}
}'

Poll until the evaluation completes:

async function waitForEvaluation(evaluationId: string) {
while (true) {
const response = await fetch(
`https://api.catalyzed.ai/evaluations/${evaluationId}`,
{ headers: { Authorization: `Bearer ${apiToken}` } }
);
const evaluation = await response.json();
if (evaluation.status === "succeeded") {
return evaluation;
}
if (evaluation.status === "failed") {
throw new Error(evaluation.errorMessage);
}
console.log(`Status: ${evaluation.status}...`);
await new Promise(r => setTimeout(r, 3000));
}
}
const completedEval = await waitForEvaluation(evaluationId);
console.log(`Score: ${completedEval.aggregateScore}`);
console.log(`Passed: ${completedEval.passedCount}/${completedEval.totalExamples}`);

The completed evaluation includes summary statistics:

{
"evaluationId": "EvR8I6rHBms3W4Qfa2-FN",
"status": "succeeded",
"totalExamples": 25,
"passedCount": 21,
"failedCount": 3,
"errorCount": 1,
"aggregateScore": 0.84
}

Fetch per-example results to understand specific failures:

Get evaluation results

Terminal window
curl "https://api.catalyzed.ai/evaluations/$EVALUATION_ID/results?statuses=failed,error" \
-H "Authorization: Bearer $API_TOKEN"

Each result includes:

  • score - Numerical score (0.0-1.0)
  • feedback - Evaluator’s explanation
  • actualOutput - What the pipeline produced
  • mappedInput - Input that was sent to the pipeline
PatternPossible CauseAction
Consistently low scoresPipeline configuration issuesReview system prompt, parameters
Specific example types failMissing training data for edge casesAdd more examples of that type
High variance in scoresInconsistent evaluation criteriaRefine evaluation criteria
Errors on certain inputsInput format incompatibilityCheck mapping configuration

After analyzing results, you have several options:

  1. Adjust configuration - Update system prompts, model parameters
  2. Run synthesis - Use synthesis runs to get AI-suggested improvements
  1. Add more examples - Expand coverage of edge cases
  2. Refine expected outputs - Make examples more representative
  3. Update rationale - Document what makes outputs correct

Set up regular evaluations to track quality over time:

// Run weekly evaluation
async function runWeeklyEvaluation(pipelineId: string, exampleSetId: string) {
const evaluation = await startEvaluation(pipelineId, exampleSetId);
const completed = await waitForEvaluation(evaluation.evaluationId);
// Alert if quality drops
if (completed.aggregateScore < 0.8) {
console.warn(`Quality alert: Score dropped to ${completed.aggregateScore}`);
// Send notification, create ticket, etc.
}
return completed;
}