Suites
The `build_suite` contract — how Evaluar discovers and runs code-first eval files.
A suite in Evaluar is built by a Python eval file that exposes a build_suite(...) function. evaluar eval_file.py runs one eval file; evaluar test discovers and runs every eval_*.py file.
In scaffolded projects, the canonical file is eval_<name>.py. Larger repositories can keep multiple eval files at the root or in subdirectories; discovery follows the same eval_*.py convention recursively.
Samples and sample ids
A sample is one evaluation case: one image, document, page, or model input paired with the ground truth for that same case. The sample_id is the stable key that joins those pieces together.
In the scaffold, _SAMPLE_ID = "sample_001" is just a placeholder. The same key appears in:
_INPUTS, where Evaluar finds the model input for that case._GT, or the ground-truth JSON file, where Evaluar finds the expected output.suite(sample_ids=...), where the run decides which cases to execute.- Saved per-sample scorecards, where reports and the TUI show failures.
In a real project, use ids that are meaningful in your data system, such as invoice_1042_page_1, floorplan_a17, or claim_2026_05_07.
The contract
An eval file must expose one of:
def build_suite(sample_ids: list[str] | None, config: dict | None) -> EvaluarSuite: ...
# or
def build_runner(sample_ids: list[str] | None, config: dict | None) -> PipelineRunner: ...build_suite is the high-level form — it returns an EvaluarSuite, which knows how to save its result and carry suite-level rollup config. build_runner is the lower-level form for cases where you've assembled a PipelineRunner directly. Most suites use build_suite.
Both functions receive:
sample_ids—Noneto run every sample, or a list (typically passed via--samples).config— a dict that the runner threads through. Suites that read project YAML do so here; many simple suites ignore it.
A canonical example
This is the scaffold evaluar init detection writes (paraphrased from src/evaluar/cli/commands/init.py:95):
from evaluar.api import detection, suite
_SAMPLE_ID = "sample_001"
def _my_model(image_url: str) -> dict:
return {"prediction": [{"label_name": "example_class", "box": [100.0, 100.0, 900.0, 900.0], "score": 0.92}]}
_GT = {_SAMPLE_ID: {"objects": [{"label": "example_class", "bbox": [100.0, 100.0, 900.0, 900.0]}]}}
_INPUTS = {_SAMPLE_ID: {"image_url": "path/to/sample.png"}}
def build_suite(sample_ids=None, config=None):
ids = sample_ids or [_SAMPLE_ID]
pipeline = (
detection("my_model")
.callable(_my_model)
.inputs(_INPUTS)
.ground_truth(_GT)
.default_mapping()
.build()
)
s = suite(sample_ids=ids, suite_name="my_model")
s.add_pipeline("my_model", pipeline)
return s
if __name__ == "__main__":
result = build_suite().run(save=True)
print(f"Run {result.run_id}: {result.rollup_scorecard.verdict.value}")The if __name__ == "__main__": block makes the file directly runnable with python eval_layout_detector.py. The CLI does not require it; it imports the file and calls build_suite directly.
Running one eval file
evaluar eval_layout_detector.py
evaluar eval_layout_detector.py --samples sample_001 --samples sample_002
evaluar eval_layout_detector.py --headlessCore flags:
| Flag | Default | Purpose |
|---|---|---|
--samples / -s | (all) | Restrict to specific sample ids. |
--headless | false | Skip the TUI; print to stdout. |
--json | false | Accepted by the CLI, but the saved result file is the supported JSON artifact. |
--no-save | false | Skip writing evaluar/results/<run_id>.json. |
--summary-file | (none) | Write a Markdown summary for this run. |
Discovering many eval files
Use evaluar test when one repository contains multiple eval files and you want one command to run them together:
evaluar test
evaluar test evaluations --fail-fast
evaluar test --samples sample_001Discovery rule: src/evaluar/discovery.py scans recursively for eval_*.py, skips generated/cache/dependency directories, then imports only files that expose build_suite(...). --fail-fast stops after the first failure.
When --samples is used during discovery, Evaluar passes the same requested
sample ids to every discovered suite. If a suite rejects those ids, discovery
now probes the suite's default sample list; when there is no overlap, the suite
is skipped instead of counted as failed. This keeps multi-suite repositories
usable when suites have disjoint sample namespaces.
Suites that load YAML
If your project has a manifest (evaluar.yaml) and you want the suite to read it, do so at the top of build_suite:
import yaml
from pathlib import Path
from evaluar.registry import registry
from evaluar.scoring.rollup import RollupConfig
def build_suite(sample_ids=None, config=None):
manifest = yaml.safe_load(Path("evaluar.yaml").read_text())
model = manifest["models"]["my_model"]
scorer_config = registry.load_scorer_config(
path=model["config"],
task_type=model["type"],
)
rollup_config = RollupConfig.model_validate(manifest.get("rollup", {}))
# pass scorer_config to .scorer(...), and rollup_config to suite(...)The eval file owns this handoff: the CLI imports the file, and build_suite(...) loads the manifest when it wants YAML-driven configuration. See YAML manifests for the schema.
What build_suite should return
An EvaluarSuite with:
- One or more pipelines added via
.add_pipeline(model_id, pipeline). - A meaningful
suite_name(used for run metadata). - Optional
definition_path(thesuite(...)helper accepts this; the CLI passes a sensible default if you don't). - Optional
results_dirandrollup_configif you wantevaluar.yamlto control saved-run location or suite rollup thresholds.
.run(save=True) returns a RunnerResult and writes evaluar/results/<run_id>.json. See Run storage for what's in that file.