Evaluar

The `evaluar.yaml` manifest schema — what it describes today, and where it's actually consumed.

evaluar.yaml is the project manifest. evaluar init writes one for you; the schema is small and intentionally close to what's expressible in the Python API.

The manifest is consumed from Python suite code. The CLI imports a suite file (evaluar eval_layout_detector.py or evaluar test), and that suite decides how much of evaluar.yaml should drive the run.

Where the manifest lives

By convention, at the project root: evaluar.yaml. The path is not enforced by the CLI — evaluar init writes it there.

Top-level shape

This is the literal example written by evaluar init (verified against src/evaluar/cli/commands/init.py:336+ and evaluar/evaluar.yaml):

evaluar.yaml

project: my_project
version: 0.1.0
results_dir: evaluar/results

models:
  my_model:
    type: detection            # or "ocr", "table", "merged"
    connector:
      type: http               # or "fixture", "callable"
      base_url: http://localhost:8000
      endpoint: /predict
      timeout: 30.0
      headers:
        Authorization: Bearer replace-me
    ground_truth: evaluar/ground_truth/my_model_gt.json
    config: evaluar/configs/my_model.yaml
    normalizer:
      type: mapping            # or "llm"
      mapping:
        objects: "prediction[*].{label: label_name, bbox: box, score: score, class_id: label_id}"
      label_map:
        door: Door
        window: Window

rollup:
  pass_threshold: 0.80
  warn_threshold: 0.60
  model_weights: []

Top-level keys

Key	Required	Purpose
`project`	yes	Display name.
`version`	yes	Project version.
`results_dir`	no	Where `EvaluarSuite.run(save=True)` writes runs. Defaults to `evaluar/results`.
`models`	yes	Map of `model_id → model definition`.
`rollup`	no	`RollupScorer` configuration.

Per-model keys

Key	Required	Purpose
`type`	yes	`detection`, `ocr`, `table`, or `merged`.
`connector`	yes	One of HTTP / fixture / callable. See Connectors.
`ground_truth`	yes	Path to a ground-truth JSON file.
`config`	yes	Path to a per-model scorer config (thresholds). See Scorers.
`normalizer`	no	Mapping or LLM. See Normalization.

Rollup keys

Key	Default	Purpose
`pass_threshold`	`0.80`	Weighted score required for the rollup `pass` verdict.
`warn_threshold`	`0.60`	Weighted score required for the rollup `warn` verdict.
`model_weights`	`[]`	Per-model weights. Empty list means equal weights.

Loading the manifest from a suite

If your eval file should be driven by evaluar.yaml, load it explicitly:

eval_layout_detector.py

from pathlib import Path
import yaml

from evaluar.api import PipelineBuilder, suite
from evaluar.registry import registry
from evaluar.scoring.rollup import RollupConfig

def build_suite(sample_ids=None, config=None):
    manifest = yaml.safe_load(Path("evaluar.yaml").read_text())
    rollup_config = RollupConfig.model_validate(manifest.get("rollup", {}))
    s = suite(
        sample_ids=sample_ids or [],
        suite_name=manifest["project"],
        results_dir=manifest.get("results_dir"),
    ).rollup(
        pass_threshold=rollup_config.pass_threshold,
        warn_threshold=rollup_config.warn_threshold,
        model_weights=rollup_config.model_weights,
        include_unlisted=rollup_config.include_unlisted,
    )

    for model_id, model_def in manifest["models"].items():
        scorer_config = registry.load_scorer_config(
            path=model_def["config"],
            task_type=model_def["type"],
        )
        builder = PipelineBuilder.for_task(model_def["type"], model_id)
        # … wire connector, ground truth, normalizer from model_def …
        s.add_pipeline(model_id, builder.scorer(scorer_config).build())

    return s

This is the supported pattern when you want the manifest to be the single source of truth for which models exist in the project. registry.load_scorer_config(...) validates evaluar/configs/<model>.yaml against the scorer config class for that task.

Why YAML at all

The manifest earns its keep as the stable configuration side of a code-first eval file:

Scoreable in code review. Threshold and connector changes are diffable.
Mirror of the Python form. Model definitions, scorer config paths, and rollup thresholds map cleanly into the builder and suite(...).
CI-friendly convention. Headless runs can keep eval_*.py stable while reviewing threshold, connector, and rollup changes as YAML diffs.

If you need runtime branching, conditional pipelines, or anything dynamic, use the Python API directly and skip the manifest.

YAML manifests