Evaluar

The `build_suite` contract — how Evaluar discovers and runs code-first eval files.

A suite in Evaluar is built by a Python eval file that exposes a build_suite(...) function. evaluar eval_file.py runs one eval file; evaluar test discovers and runs every eval_*.py file.

In scaffolded projects, the canonical file is eval_<name>.py. Larger repositories can keep multiple eval files at the root or in subdirectories; discovery follows the same eval_*.py convention recursively.

Samples and sample ids

A sample is one evaluation case: one image, document, page, or model input paired with the ground truth for that same case. The sample_id is the stable key that joins those pieces together.

In the scaffold, _SAMPLE_ID = "sample_001" is just a placeholder. The same key appears in:

_INPUTS, where Evaluar finds the model input for that case.
_GT, or the ground-truth JSON file, where Evaluar finds the expected output.
suite(sample_ids=...), where the run decides which cases to execute.
Saved per-sample scorecards, where reports and the TUI show failures.

In a real project, use ids that are meaningful in your data system, such as invoice_1042_page_1, floorplan_a17, or claim_2026_05_07.

The contract

An eval file must expose one of:

def build_suite(sample_ids: list[str] | None, config: dict | None) -> EvaluarSuite: ...

# or

def build_runner(sample_ids: list[str] | None, config: dict | None) -> PipelineRunner: ...

build_suite is the high-level form — it returns an EvaluarSuite, which knows how to save its result and carry suite-level rollup config. build_runner is the lower-level form for cases where you've assembled a PipelineRunner directly. Most suites use build_suite.

Both functions receive:

sample_ids — None to run every sample, or a list (typically passed via --samples).
config — a dict that the runner threads through. Suites that read project YAML do so here; many simple suites ignore it.

A canonical example

This is the scaffold evaluar init detection writes (paraphrased from src/evaluar/cli/commands/init.py:95):

eval_layout_detector.py

from evaluar.api import detection, suite

_SAMPLE_ID = "sample_001"

def _my_model(image_url: str) -> dict:
    return {"prediction": [{"label_name": "example_class", "box": [100.0, 100.0, 900.0, 900.0], "score": 0.92}]}

_GT = {_SAMPLE_ID: {"objects": [{"label": "example_class", "bbox": [100.0, 100.0, 900.0, 900.0]}]}}
_INPUTS = {_SAMPLE_ID: {"image_url": "path/to/sample.png"}}


def build_suite(sample_ids=None, config=None):
    ids = sample_ids or [_SAMPLE_ID]
    pipeline = (
        detection("my_model")
        .callable(_my_model)
        .inputs(_INPUTS)
        .ground_truth(_GT)
        .default_mapping()
        .build()
    )
    s = suite(sample_ids=ids, suite_name="my_model")
    s.add_pipeline("my_model", pipeline)
    return s


if __name__ == "__main__":
    result = build_suite().run(save=True)
    print(f"Run {result.run_id}: {result.rollup_scorecard.verdict.value}")

The if __name__ == "__main__": block makes the file directly runnable with python eval_layout_detector.py. The CLI does not require it; it imports the file and calls build_suite directly.

Running one eval file

evaluar eval_layout_detector.py
evaluar eval_layout_detector.py --samples sample_001 --samples sample_002
evaluar eval_layout_detector.py --headless

Core flags:

Flag	Default	Purpose
`--samples` / `-s`	(all)	Restrict to specific sample ids.
`--headless`	`false`	Skip the TUI; print to stdout.
`--json`	`false`	Accepted by the CLI, but the saved result file is the supported JSON artifact.
`--no-save`	`false`	Skip writing `evaluar/results/<run_id>.json`.
`--summary-file`	(none)	Write a Markdown summary for this run.

Discovering many eval files

Use evaluar test when one repository contains multiple eval files and you want one command to run them together:

evaluar test
evaluar test evaluations --fail-fast
evaluar test --samples sample_001

Discovery rule: src/evaluar/discovery.py scans recursively for eval_*.py, skips generated/cache/dependency directories, then imports only files that expose build_suite(...). --fail-fast stops after the first failure.

When --samples is used during discovery, Evaluar passes the same requested sample ids to every discovered suite. If a suite rejects those ids, discovery now probes the suite's default sample list; when there is no overlap, the suite is skipped instead of counted as failed. This keeps multi-suite repositories usable when suites have disjoint sample namespaces.

Suites that load YAML

If your project has a manifest (evaluar.yaml) and you want the suite to read it, do so at the top of build_suite:

import yaml
from pathlib import Path
from evaluar.registry import registry
from evaluar.scoring.rollup import RollupConfig

def build_suite(sample_ids=None, config=None):
    manifest = yaml.safe_load(Path("evaluar.yaml").read_text())
    model = manifest["models"]["my_model"]
    scorer_config = registry.load_scorer_config(
        path=model["config"],
        task_type=model["type"],
    )
    rollup_config = RollupConfig.model_validate(manifest.get("rollup", {}))
    # pass scorer_config to .scorer(...), and rollup_config to suite(...)

The eval file owns this handoff: the CLI imports the file, and build_suite(...) loads the manifest when it wants YAML-driven configuration. See YAML manifests for the schema.

What `build_suite` should return

An EvaluarSuite with:

One or more pipelines added via .add_pipeline(model_id, pipeline).
A meaningful suite_name (used for run metadata).
Optional definition_path (the suite(...) helper accepts this; the CLI passes a sensible default if you don't).
Optional results_dir and rollup_config if you want evaluar.yaml to control saved-run location or suite rollup thresholds.

.run(save=True) returns a RunnerResult and writes evaluar/results/<run_id>.json. See Run storage for what's in that file.

Suites