Configuration
How configuration reaches a run — CLI flags, eval files, the project manifest, scorer configs, rollups, and TUI preferences.
Evaluar's runtime configuration is API-first. The file the runner executes is an eval_*.py file: evaluar eval_layout_detector.py imports it, calls build_suite(sample_ids, config), and runs the returned EvaluarSuite.
Four layers cover almost everything:
- CLI flags — every command takes its own.
eval_*.py— the eval definition and the place where configuration is applied.evaluar.yaml— optional project manifest at the project root.evaluar/configs/<model>.yaml— per-model scorer thresholds.
Plus one local TUI state file:
<results-parent>/.evaluar/tui.yaml— TUI preferences such as overwrite-run mode and log timestamp display.
The supported configuration surfaces are the ones above.
CLI flags
Every CLI command accepts --results-dir, --discovery-dir, --debug, --version (defined as the top-level Typer callback at src/evaluar/cli/main.py:45):
| Flag | Default | Purpose |
|---|---|---|
--results-dir | evaluar/results | Where runs are saved and read. |
--discovery-dir | . | Where the TUI looks for eval_*.py files. |
--debug | false | Show full tracebacks in the TUI's log pane. |
--version / -v | — | Print the version and exit. |
Command-specific flags are documented on their own pages: Quick start, Reports (evaluar report), Install & init (evaluar init), Headless / CI.
The configuration handoff
evaluar init <task> writes all three project files:
eval_<name>.py— runnable eval-file scaffold.evaluar.yaml— project manifest with model metadata, config paths, and rollup thresholds.evaluar/configs/<model>.yaml— per-model scorer thresholds.
The important boundary is this: the eval file owns manifest loading. evaluar eval_layout_detector.py imports the file and calls build_suite(sample_ids, config); inside that function, you decide whether to load evaluar.yaml, scorer YAML, both, or neither.
from pathlib import Path
import yaml
from evaluar.api import PipelineBuilder, suite
from evaluar.registry import registry
from evaluar.scoring.rollup import RollupConfig
def build_suite(sample_ids=None, config=None):
manifest = yaml.safe_load(Path("evaluar.yaml").read_text())
model = manifest["models"]["layout_detector"]
scorer_config = registry.load_scorer_config(
path=model["config"],
task_type=model["type"],
)
rollup_config = RollupConfig.model_validate(manifest.get("rollup", {}))
pipeline = (
PipelineBuilder.for_task(model["type"], "layout_detector")
.callable(my_model)
.inputs(INPUTS)
.ground_truth_file(model["ground_truth"])
.default_mapping()
.scorer(scorer_config)
.build()
)
s = suite(
sample_ids=sample_ids or list(INPUTS),
suite_name=manifest["project"],
results_dir=manifest.get("results_dir"),
).rollup(
pass_threshold=rollup_config.pass_threshold,
warn_threshold=rollup_config.warn_threshold,
model_weights=rollup_config.model_weights,
include_unlisted=rollup_config.include_unlisted,
)
s.add_pipeline("layout_detector", pipeline)
return sFor many projects, the generated scaffold starts simpler by keeping ground truth and inputs inline. As your evaluation hardens, the config files become the diffable place to tune thresholds and rollups while the eval file remains the executable wiring.
Project manifest — evaluar.yaml
The full schema lives on YAML manifests. The keys it carries are:
project,version— project identity.results_dir— default for--results-dir(the CLI flag wins).models.<model_id>— connector, ground-truth path, normalizer, per-model config path.rollup—RollupScorerconfiguration.
See Suites for the loading pattern.
Per-model scorer config
Every entry in models points at a scorer config:
thresholds:
map_50:
pass_floor: 0.7
warn_floor: 0.6
precision:
pass_floor: 0.85
warn_floor: 0.65
gated_metrics:
- map_50
- precision
per_class_gated: trueKeys:
thresholds— per-metricpass_floor/warn_floor. Maps directly toMetricThreshold(pass_floor, warn_floor).gated_metrics— which metrics participate in the verdict.per_class_gated— whentrue, thresholds must hold per class as well as in aggregate.
For detection, table, and merged metrics, higher values are better. OCR's CER/WER settings are still written as maximum acceptable error rates; the OCR scorer inverts them internally before applying the shared threshold model.
See Scorers & metrics.
Rollup config
The rollup block in evaluar.yaml maps to RollupConfig:
rollup:
pass_threshold: 0.80
warn_threshold: 0.60
model_weights: []RollupScorer turns each pipeline verdict into a weighted score (pass = 1.0, warn = 0.5, fail = 0.0) and compares that weighted average to pass_threshold and warn_threshold. Empty model_weights means equal weights for all pipelines.
To make those values affect a run, load them in your eval file and pass them to suite(...).rollup(...) or suite(..., rollup_config=...).
TUI preferences
<results-parent>/.evaluar/tui.yaml (src/evaluar/cli/project.py:TuiPreferences) holds local TUI preferences such as overwrite-run mode and log timestamp display. For the default evaluar/results directory, the file is evaluar/.evaluar/tui.yaml. Safe to delete; the TUI will rebuild it.
This file is not part of run storage. It follows the configured results root so separate projects or result directories can keep separate TUI preferences.
Defaults summary
| Surface | Default | Where it's defined |
|---|---|---|
--results-dir | evaluar/results | cli/main.py:45 |
--discovery-dir | . | cli/main.py:45 |
| Run id format | run_<timestamp>_<short> | pipelines/runner.py |
| Project manifest path | evaluar.yaml (convention) | cli/commands/init.py |
| Scorer config path | evaluar/configs/<model>.yaml (convention) | cli/commands/init.py |
| TUI preferences | <results-parent>/.evaluar/tui.yaml | cli/project.py |
The table above is the supported configuration map for current Evaluar releases.