Evaluations

Evaluations measure how well a pipeline performs by comparing its outputs against ground truth examples. Run a pipeline against an example set and get aggregate scores plus per-example results.

Key Concepts

Evaluation - A comparison run between a pipeline and an example set
Evaluation Results - Per-example outcomes within an evaluation
Evaluator - The method used to compare actual vs expected output
Mapping Config - How example inputs/outputs map to pipeline inputs/outputs

Evaluation Lifecycle

pending → running → succeeded
                  ↘ failed
pending → cancelled

Status	Description
`pending`	Evaluation queued, waiting to start
`running`	Processing examples
`succeeded`	All examples processed (individual results may vary)
`failed`	Fatal error during evaluation
`cancelled`	Manually cancelled

Creating an Evaluation

Evaluations are created by calling the evaluate endpoint on a pipeline:

Create an evaluation

curl -X POST https://api.catalyzed.ai/pipelines/EMbMEFLyUWEgvnhMWXVVa/evaluate \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "exampleSetId": "KjR8I6rHBms3W4Qfa2-FN",
    "evaluatorType": "llm_judge",
    "mappingConfig": {
      "inputMappings": [
        {
          "exampleSlotId": "document",
          "pipelineSlotId": "input_text"
        }
      ],
      "outputMappings": [
        {
          "exampleSlotId": "summary",
          "pipelineSlotId": "output_summary"
        }
      ]
    }
  }'

const response = await fetch(
  "https://api.catalyzed.ai/pipelines/EMbMEFLyUWEgvnhMWXVVa/evaluate",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${apiToken}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      exampleSetId: "KjR8I6rHBms3W4Qfa2-FN",
      evaluatorType: "llm_judge",
      mappingConfig: {
        inputMappings: [
          { exampleSlotId: "document", pipelineSlotId: "input_text" },
        ],
        outputMappings: [
          { exampleSlotId: "summary", pipelineSlotId: "output_summary" },
        ],
      },
    }),
  }
);
const evaluation = await response.json();
console.log(evaluation.evaluationId); // "EvR8I6rHBms3W4Qfa2-FN"

response = requests.post(
    "https://api.catalyzed.ai/pipelines/EMbMEFLyUWEgvnhMWXVVa/evaluate",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "exampleSetId": "KjR8I6rHBms3W4Qfa2-FN",
        "evaluatorType": "llm_judge",
        "mappingConfig": {
            "inputMappings": [
                {"exampleSlotId": "document", "pipelineSlotId": "input_text"}
            ],
            "outputMappings": [
                {"exampleSlotId": "summary", "pipelineSlotId": "output_summary"}
            ]
        }
    }
)
evaluation = response.json()
print(evaluation["evaluationId"])  # "EvR8I6rHBms3W4Qfa2-FN"

The response returns 202 Accepted since evaluations run asynchronously:

{
  "evaluationId": "EvR8I6rHBms3W4Qfa2-FN",
  "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
  "pipelineId": "EMbMEFLyUWEgvnhMWXVVa",
  "pipelineConfigurationId": "CfgABC123...",
  "exampleSetId": "KjR8I6rHBms3W4Qfa2-FN",
  "exampleSetConfigurationId": "CfgDEF456...",
  "evaluatorType": "llm_judge",
  "mappingType": "explicit",
  "mappingConfig": { ... },
  "status": "pending",
  "totalExamples": null,
  "passedCount": null,
  "failedCount": null,
  "errorCount": null,
  "aggregateScore": null,
  "createdAt": "2024-01-15T10:30:00Z",
  "createdBy": "usr_abc123"
}

Evaluator Types

LLM Judge (Default)

Uses an LLM to compare actual vs expected output and provide a score:

{
  "evaluatorType": "llm_judge",
  "evaluatorConfig": {
    "threshold": 0.7,
    "criteria": "Focus on factual accuracy and completeness"
  }
}

Option	Type	Description
`threshold`	number	Score threshold for pass/fail (default: 0.7)
`criteria`	string	Custom evaluation criteria

Exact Match

Direct string/JSON comparison:

{
  "evaluatorType": "exact_match",
  "evaluatorConfig": {
    "ignoreWhitespace": true,
    "ignoreCase": false
  }
}

Option	Type	Description
`ignoreWhitespace`	boolean	Ignore whitespace differences (default: false)
`ignoreCase`	boolean	Case-insensitive comparison (default: false)

Semantic Similarity

Embedding-based comparison using cosine similarity:

{
  "evaluatorType": "semantic",
  "evaluatorConfig": {
    "threshold": 0.8
  }
}

Option	Type	Description
`threshold`	number	Cosine similarity threshold (default: 0.8)

Getting Evaluation Status

Get evaluation details

curl https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN",
  { headers: { Authorization: `Bearer ${apiToken}` } }
);
const evaluation = await response.json();

response = requests.get(
    "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN",
    headers={"Authorization": f"Bearer {api_token}"}
)
evaluation = response.json()

Completed evaluation response:

{
  "evaluationId": "EvR8I6rHBms3W4Qfa2-FN",
  "teamId": "ZkoDMyjZZsXo4VAO_nJLk",
  "pipelineId": "EMbMEFLyUWEgvnhMWXVVa",
  "pipelineConfigurationId": "CfgABC123...",
  "exampleSetId": "KjR8I6rHBms3W4Qfa2-FN",
  "exampleSetConfigurationId": "CfgDEF456...",
  "evaluatorType": "llm_judge",
  "status": "succeeded",
  "totalExamples": 25,
  "passedCount": 21,
  "failedCount": 3,
  "errorCount": 1,
  "skippedCount": 0,
  "aggregateScore": 0.84,
  "startedAt": "2024-01-15T10:30:02Z",
  "completedAt": "2024-01-15T10:35:00Z",
  "createdAt": "2024-01-15T10:30:00Z",
  "createdBy": "usr_abc123"
}

Polling for Completion

async function waitForEvaluation(evaluationId: string) {
  while (true) {
    const response = await fetch(
      `https://api.catalyzed.ai/evaluations/${evaluationId}`,
      { headers: { Authorization: `Bearer ${apiToken}` } }
    );
    const evaluation = await response.json();

    if (evaluation.status === "succeeded") {
      console.log(`Score: ${evaluation.aggregateScore}`);
      console.log(
        `Passed: ${evaluation.passedCount}/${evaluation.totalExamples}`
      );
      return evaluation;
    }
    if (evaluation.status === "failed") {
      throw new Error(evaluation.errorMessage);
    }
    if (evaluation.status === "cancelled") {
      throw new Error("Evaluation was cancelled");
    }

    await new Promise((r) => setTimeout(r, 2000)); // Poll every 2 seconds
  }
}

Listing Evaluations

List evaluations

curl "https://api.catalyzed.ai/evaluations?pipelineIds=EMbMEFLyUWEgvnhMWXVVa" \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/evaluations?pipelineIds=EMbMEFLyUWEgvnhMWXVVa",
  { headers: { Authorization: `Bearer ${apiToken}` } }
);
const { evaluations } = await response.json();

response = requests.get(
    "https://api.catalyzed.ai/evaluations",
    params={"pipelineIds": "EMbMEFLyUWEgvnhMWXVVa"},
    headers={"Authorization": f"Bearer {api_token}"}
)
evaluations = response.json()["evaluations"]

Query Parameters

Parameter	Type	Description
`evaluationIds`	string	Comma-separated list of IDs
`teamIds`	string	Comma-separated team IDs
`pipelineIds`	string	Comma-separated pipeline IDs
`exampleSetIds`	string	Comma-separated example set IDs
`statuses`	string	Comma-separated: `pending`, `running`, `succeeded`, `failed`, `cancelled`
`createdBy`	string	Filter by creator
`createdAfter`	date	Filter by creation date
`createdBefore`	date	Filter by creation date
`page`	number	Page number (1-indexed)
`pageSize`	number	Results per page (1-100)
`orderBy`	string	`createdAt`, `completedAt`, `status`, `aggregateScore`
`orderDirection`	string	`asc` or `desc`

Evaluation Results

Each example produces a result with its own status and score.

Result Status

Status	Description
`pending`	Not yet processed
`passed`	Output matches expected (score >= threshold)
`failed`	Output doesn’t match expected (score < threshold)
`error`	Pipeline execution error
`skipped`	Example was skipped (e.g., mapping failed)

Listing Results

List evaluation results

curl "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/results" \
  -H "Authorization: Bearer $API_TOKEN"

const response = await fetch(
  "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/results",
  { headers: { Authorization: `Bearer ${apiToken}` } }
);
const { evaluationResults } = await response.json();

response = requests.get(
    "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/results",
    headers={"Authorization": f"Bearer {api_token}"}
)
results = response.json()["evaluationResults"]

Result response:

{
  "evaluationResults": [
    {
      "evaluationResultId": "ResABC123...",
      "evaluationId": "EvR8I6rHBms3W4Qfa2-FN",
      "exampleId": "ExR8I6rHBms3W4Qfa2-FN",
      "executionId": "GkR8I6rHBms3W4Qfa2-FN",
      "status": "passed",
      "mappedInput": { "input_text": "..." },
      "actualOutput": { "output_summary": "..." },
      "score": 0.92,
      "feedback": "Output captures all key metrics accurately. Minor stylistic differences from expected output.",
      "startedAt": "2024-01-15T10:30:05Z",
      "completedAt": "2024-01-15T10:30:15Z",
      "createdAt": "2024-01-15T10:30:00Z"
    }
  ],
  "total": 25,
  "page": 1,
  "pageSize": 20
}

Filtering Results

Filter by status to find failures:

curl "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/results?statuses=failed,error" \
  -H "Authorization: Bearer $API_TOKEN"

Top-Level Results Endpoint

Query results across multiple evaluations:

curl "https://api.catalyzed.ai/evaluation-results?evaluationIds=Ev1,Ev2,Ev3" \
  -H "Authorization: Bearer $API_TOKEN"

Cancelling an Evaluation

Cancel evaluation

curl -X POST https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/cancel \
  -H "Authorization: Bearer $API_TOKEN"

await fetch("https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/cancel", {
  method: "POST",
  headers: { Authorization: `Bearer ${apiToken}` },
});

requests.post(
    "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN/cancel",
    headers={"Authorization": f"Bearer {api_token}"}
)

Deleting an Evaluation

Only pending or cancelled evaluations can be deleted. Completed evaluations are retained for audit purposes.

Delete evaluation

curl -X DELETE https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN \
  -H "Authorization: Bearer $API_TOKEN"

await fetch("https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN", {
  method: "DELETE",
  headers: { Authorization: `Bearer ${apiToken}` },
});

requests.delete(
    "https://api.catalyzed.ai/evaluations/EvR8I6rHBms3W4Qfa2-FN",
    headers={"Authorization": f"Bearer {api_token}"}
)

Evaluation Properties

Field	Type	Description
`evaluationId`	string	Unique identifier
`teamId`	string	Team that owns this evaluation
`pipelineId`	string	Pipeline being evaluated
`pipelineConfigurationId`	string	Pipeline config snapshot
`exampleSetId`	string	Example set used
`exampleSetConfigurationId`	string	Example set config snapshot
`evaluatorType`	string	`llm_judge`, `exact_match`, or `semantic`
`evaluatorConfig`	object	Evaluator settings
`mappingType`	string	`explicit`
`mappingConfig`	object	Input/output mappings
`status`	string	Current status
`totalExamples`	number	Total examples in evaluation
`passedCount`	number	Examples that passed
`failedCount`	number	Examples that failed
`errorCount`	number	Examples with execution errors
`skippedCount`	number	Examples that were skipped
`aggregateScore`	number	Overall score (0.0-1.0)
`errorMessage`	string	Error details (if failed)
`startedAt`	timestamp	When evaluation started
`completedAt`	timestamp	When evaluation finished
`createdAt`	timestamp	When evaluation was created
`createdBy`	string	User who created this evaluation

Viewing Evaluation Results

Evaluation results are available via the /evaluations/:evaluationId/results endpoint. See the Evaluation Results API for complete endpoint documentation.

Next Steps

Example Sets - Create ground truth data
Examples - Add individual examples
Signals - Capture feedback from evaluations
Evaluation Workflow Guide - End-to-end tutorial