evaluar

YAML manifests

The `evaluar.yaml` manifest schema — what it describes today, and where it's actually consumed.

evaluar.yaml is the project manifest. evaluar init writes one for you; the schema is small and intentionally close to what's expressible in the Python API.

The manifest is consumed from Python suite code. The CLI imports a suite file (evaluar eval_layout_detector.py or evaluar test), and that suite decides how much of evaluar.yaml should drive the run.

Where the manifest lives

By convention, at the project root: evaluar.yaml. The path is not enforced by the CLI — evaluar init writes it there.

Top-level shape

This is the literal example written by evaluar init (verified against src/evaluar/cli/commands/init.py:336+ and evaluar/evaluar.yaml):

evaluar.yaml
project: my_project
version: 0.1.0
results_dir: evaluar/results

models:
  my_model:
    type: detection            # or "ocr", "table", "merged"
    connector:
      type: http               # or "fixture", "callable"
      base_url: http://localhost:8000
      endpoint: /predict
      timeout: 30.0
      headers:
        Authorization: Bearer replace-me
    ground_truth: evaluar/ground_truth/my_model_gt.json
    config: evaluar/configs/my_model.yaml
    normalizer:
      type: mapping            # or "llm"
      mapping:
        objects: "prediction[*].{label: label_name, bbox: box, score: score, class_id: label_id}"
      label_map:
        door: Door
        window: Window

rollup:
  pass_threshold: 0.80
  warn_threshold: 0.60
  model_weights: []

Top-level keys

KeyRequiredPurpose
projectyesDisplay name.
versionyesProject version.
results_dirnoWhere EvaluarSuite.run(save=True) writes runs. Defaults to evaluar/results.
modelsyesMap of model_id → model definition.
rollupnoRollupScorer configuration.

Per-model keys

KeyRequiredPurpose
typeyesdetection, ocr, table, or merged.
connectoryesOne of HTTP / fixture / callable. See Connectors.
ground_truthyesPath to a ground-truth JSON file.
configyesPath to a per-model scorer config (thresholds). See Scorers.
normalizernoMapping or LLM. See Normalization.

Rollup keys

KeyDefaultPurpose
pass_threshold0.80Weighted score required for the rollup pass verdict.
warn_threshold0.60Weighted score required for the rollup warn verdict.
model_weights[]Per-model weights. Empty list means equal weights.

Loading the manifest from a suite

If your eval file should be driven by evaluar.yaml, load it explicitly:

eval_layout_detector.py
from pathlib import Path
import yaml

from evaluar.api import PipelineBuilder, suite
from evaluar.registry import registry
from evaluar.scoring.rollup import RollupConfig

def build_suite(sample_ids=None, config=None):
    manifest = yaml.safe_load(Path("evaluar.yaml").read_text())
    rollup_config = RollupConfig.model_validate(manifest.get("rollup", {}))
    s = suite(
        sample_ids=sample_ids or [],
        suite_name=manifest["project"],
        results_dir=manifest.get("results_dir"),
    ).rollup(
        pass_threshold=rollup_config.pass_threshold,
        warn_threshold=rollup_config.warn_threshold,
        model_weights=rollup_config.model_weights,
        include_unlisted=rollup_config.include_unlisted,
    )

    for model_id, model_def in manifest["models"].items():
        scorer_config = registry.load_scorer_config(
            path=model_def["config"],
            task_type=model_def["type"],
        )
        builder = PipelineBuilder.for_task(model_def["type"], model_id)
        # … wire connector, ground truth, normalizer from model_def …
        s.add_pipeline(model_id, builder.scorer(scorer_config).build())

    return s

This is the supported pattern when you want the manifest to be the single source of truth for which models exist in the project. registry.load_scorer_config(...) validates evaluar/configs/<model>.yaml against the scorer config class for that task.

Why YAML at all

The manifest earns its keep as the stable configuration side of a code-first eval file:

  • Scoreable in code review. Threshold and connector changes are diffable.
  • Mirror of the Python form. Model definitions, scorer config paths, and rollup thresholds map cleanly into the builder and suite(...).
  • CI-friendly convention. Headless runs can keep eval_*.py stable while reviewing threshold, connector, and rollup changes as YAML diffs.

If you need runtime branching, conditional pipelines, or anything dynamic, use the Python API directly and skip the manifest.

On this page