YAML manifests
The `evaluar.yaml` manifest schema — what it describes today, and where it's actually consumed.
evaluar.yaml is the project manifest. evaluar init writes one for you; the schema is small and intentionally close to what's expressible in the Python API.
The manifest is consumed from Python suite code. The CLI imports a suite
file (evaluar eval_layout_detector.py or evaluar test), and that suite decides
how much of evaluar.yaml should drive the run.
Where the manifest lives
By convention, at the project root: evaluar.yaml. The path is not enforced by the CLI — evaluar init writes it there.
Top-level shape
This is the literal example written by evaluar init (verified against src/evaluar/cli/commands/init.py:336+ and evaluar/evaluar.yaml):
project: my_project
version: 0.1.0
results_dir: evaluar/results
models:
my_model:
type: detection # or "ocr", "table", "merged"
connector:
type: http # or "fixture", "callable"
base_url: http://localhost:8000
endpoint: /predict
timeout: 30.0
headers:
Authorization: Bearer replace-me
ground_truth: evaluar/ground_truth/my_model_gt.json
config: evaluar/configs/my_model.yaml
normalizer:
type: mapping # or "llm"
mapping:
objects: "prediction[*].{label: label_name, bbox: box, score: score, class_id: label_id}"
label_map:
door: Door
window: Window
rollup:
pass_threshold: 0.80
warn_threshold: 0.60
model_weights: []Top-level keys
| Key | Required | Purpose |
|---|---|---|
project | yes | Display name. |
version | yes | Project version. |
results_dir | no | Where EvaluarSuite.run(save=True) writes runs. Defaults to evaluar/results. |
models | yes | Map of model_id → model definition. |
rollup | no | RollupScorer configuration. |
Per-model keys
| Key | Required | Purpose |
|---|---|---|
type | yes | detection, ocr, table, or merged. |
connector | yes | One of HTTP / fixture / callable. See Connectors. |
ground_truth | yes | Path to a ground-truth JSON file. |
config | yes | Path to a per-model scorer config (thresholds). See Scorers. |
normalizer | no | Mapping or LLM. See Normalization. |
Rollup keys
| Key | Default | Purpose |
|---|---|---|
pass_threshold | 0.80 | Weighted score required for the rollup pass verdict. |
warn_threshold | 0.60 | Weighted score required for the rollup warn verdict. |
model_weights | [] | Per-model weights. Empty list means equal weights. |
Loading the manifest from a suite
If your eval file should be driven by evaluar.yaml, load it explicitly:
from pathlib import Path
import yaml
from evaluar.api import PipelineBuilder, suite
from evaluar.registry import registry
from evaluar.scoring.rollup import RollupConfig
def build_suite(sample_ids=None, config=None):
manifest = yaml.safe_load(Path("evaluar.yaml").read_text())
rollup_config = RollupConfig.model_validate(manifest.get("rollup", {}))
s = suite(
sample_ids=sample_ids or [],
suite_name=manifest["project"],
results_dir=manifest.get("results_dir"),
).rollup(
pass_threshold=rollup_config.pass_threshold,
warn_threshold=rollup_config.warn_threshold,
model_weights=rollup_config.model_weights,
include_unlisted=rollup_config.include_unlisted,
)
for model_id, model_def in manifest["models"].items():
scorer_config = registry.load_scorer_config(
path=model_def["config"],
task_type=model_def["type"],
)
builder = PipelineBuilder.for_task(model_def["type"], model_id)
# … wire connector, ground truth, normalizer from model_def …
s.add_pipeline(model_id, builder.scorer(scorer_config).build())
return sThis is the supported pattern when you want the manifest to be the single source of truth for which models exist in the project. registry.load_scorer_config(...) validates evaluar/configs/<model>.yaml against the scorer config class for that task.
Why YAML at all
The manifest earns its keep as the stable configuration side of a code-first eval file:
- Scoreable in code review. Threshold and connector changes are diffable.
- Mirror of the Python form. Model definitions, scorer config paths, and rollup thresholds map cleanly into the builder and
suite(...). - CI-friendly convention. Headless runs can keep
eval_*.pystable while reviewing threshold, connector, and rollup changes as YAML diffs.
If you need runtime branching, conditional pipelines, or anything dynamic, use the Python API directly and skip the manifest.