Evaluar

How configuration reaches a run — CLI flags, eval files, the project manifest, scorer configs, rollups, and TUI preferences.

Evaluar's runtime configuration is API-first. The file the runner executes is an eval_*.py file: evaluar eval_layout_detector.py imports it, calls build_suite(sample_ids, config), and runs the returned EvaluarSuite.

Four layers cover almost everything:

CLI flags — every command takes its own.
eval_*.py — the eval definition and the place where configuration is applied.
evaluar.yaml — optional project manifest at the project root.
evaluar/configs/<model>.yaml — per-model scorer thresholds.

Plus one local TUI state file:

<results-parent>/.evaluar/tui.yaml — TUI preferences such as overwrite-run mode and log timestamp display.

The supported configuration surfaces are the ones above.

CLI flags

Every CLI command accepts --results-dir, --discovery-dir, --debug, --version (defined as the top-level Typer callback at src/evaluar/cli/main.py:45):

Flag	Default	Purpose
`--results-dir`	`evaluar/results`	Where runs are saved and read.
`--discovery-dir`	`.`	Where the TUI looks for `eval_*.py` files.
`--debug`	`false`	Show full tracebacks in the TUI's log pane.
`--version` / `-v`	—	Print the version and exit.

Command-specific flags are documented on their own pages: Quick start, Reports (evaluar report), Install & init (evaluar init), Headless / CI.

The configuration handoff

evaluar init <task> writes all three project files:

eval_<name>.py — runnable eval-file scaffold.
evaluar.yaml — project manifest with model metadata, config paths, and rollup thresholds.
evaluar/configs/<model>.yaml — per-model scorer thresholds.

The important boundary is this: the eval file owns manifest loading. evaluar eval_layout_detector.py imports the file and calls build_suite(sample_ids, config); inside that function, you decide whether to load evaluar.yaml, scorer YAML, both, or neither.

eval_layout_detector.py

from pathlib import Path
import yaml

from evaluar.api import PipelineBuilder, suite
from evaluar.registry import registry
from evaluar.scoring.rollup import RollupConfig

def build_suite(sample_ids=None, config=None):
    manifest = yaml.safe_load(Path("evaluar.yaml").read_text())
    model = manifest["models"]["layout_detector"]

    scorer_config = registry.load_scorer_config(
        path=model["config"],
        task_type=model["type"],
    )
    rollup_config = RollupConfig.model_validate(manifest.get("rollup", {}))

    pipeline = (
        PipelineBuilder.for_task(model["type"], "layout_detector")
        .callable(my_model)
        .inputs(INPUTS)
        .ground_truth_file(model["ground_truth"])
        .default_mapping()
        .scorer(scorer_config)
        .build()
    )

    s = suite(
        sample_ids=sample_ids or list(INPUTS),
        suite_name=manifest["project"],
        results_dir=manifest.get("results_dir"),
    ).rollup(
        pass_threshold=rollup_config.pass_threshold,
        warn_threshold=rollup_config.warn_threshold,
        model_weights=rollup_config.model_weights,
        include_unlisted=rollup_config.include_unlisted,
    )
    s.add_pipeline("layout_detector", pipeline)
    return s

For many projects, the generated scaffold starts simpler by keeping ground truth and inputs inline. As your evaluation hardens, the config files become the diffable place to tune thresholds and rollups while the eval file remains the executable wiring.

Project manifest — `evaluar.yaml`

The full schema lives on YAML manifests. The keys it carries are:

project, version — project identity.
results_dir — default for --results-dir (the CLI flag wins).
models.<model_id> — connector, ground-truth path, normalizer, per-model config path.
rollup — RollupScorer configuration.

See Suites for the loading pattern.

Per-model scorer config

Every entry in models points at a scorer config:

evaluar/configs/my_model.yaml

thresholds:
  map_50:
    pass_floor: 0.7
    warn_floor: 0.6
  precision:
    pass_floor: 0.85
    warn_floor: 0.65
gated_metrics:
  - map_50
  - precision
per_class_gated: true

Keys:

thresholds — per-metric pass_floor / warn_floor. Maps directly to MetricThreshold(pass_floor, warn_floor).
gated_metrics — which metrics participate in the verdict.
per_class_gated — when true, thresholds must hold per class as well as in aggregate.

For detection, table, and merged metrics, higher values are better. OCR's CER/WER settings are still written as maximum acceptable error rates; the OCR scorer inverts them internally before applying the shared threshold model.

See Scorers & metrics.

Rollup config

The rollup block in evaluar.yaml maps to RollupConfig:

evaluar.yaml

rollup:
  pass_threshold: 0.80
  warn_threshold: 0.60
  model_weights: []

RollupScorer turns each pipeline verdict into a weighted score (pass = 1.0, warn = 0.5, fail = 0.0) and compares that weighted average to pass_threshold and warn_threshold. Empty model_weights means equal weights for all pipelines.

To make those values affect a run, load them in your eval file and pass them to suite(...).rollup(...) or suite(..., rollup_config=...).

TUI preferences

<results-parent>/.evaluar/tui.yaml (src/evaluar/cli/project.py:TuiPreferences) holds local TUI preferences such as overwrite-run mode and log timestamp display. For the default evaluar/results directory, the file is evaluar/.evaluar/tui.yaml. Safe to delete; the TUI will rebuild it.

This file is not part of run storage. It follows the configured results root so separate projects or result directories can keep separate TUI preferences.

Defaults summary

Surface	Default	Where it's defined
`--results-dir`	`evaluar/results`	`cli/main.py:45`
`--discovery-dir`	`.`	`cli/main.py:45`
Run id format	`run_<timestamp>_<short>`	`pipelines/runner.py`
Project manifest path	`evaluar.yaml` (convention)	`cli/commands/init.py`
Scorer config path	`evaluar/configs/<model>.yaml` (convention)	`cli/commands/init.py`
TUI preferences	`<results-parent>/.evaluar/tui.yaml`	`cli/project.py`

The table above is the supported configuration map for current Evaluar releases.

Configuration

CLI flags

The configuration handoff

Project manifest — `evaluar.yaml`

Per-model scorer config

Rollup config

TUI preferences

Defaults summary

On this page

Configuration

CLI flags

The configuration handoff

Project manifest — evaluar.yaml

Per-model scorer config

Rollup config

TUI preferences

Defaults summary

On this page

Project manifest — `evaluar.yaml`