Run storage
The on-disk shape of a run — a single JSON file under `evaluar/results/` per run.
Every run that completes — whether triggered from evaluar eval_file.py, evaluar test, python eval_file.py, or EvaluarSuite.run(save=True) — writes one JSON file. Where it lands and what's in it is documented here.
Where runs live
By default: evaluar/results/<run_id>.json (a project-relative path, not a user-home path).
Override with --results-dir on any CLI command, or by passing results_dir= to suite(...) in Python:
evaluar eval_layout_detector.py --results-dir /tmp/runs
evaluar report show <run_id> --results-dir /tmp/runs
evaluar report list --results-dir /tmp/runs
evaluar report export <run_id> --results-dir /tmp/runs --out run.jsonfrom evaluar.api import suite
s = suite(sample_ids=[...], results_dir="/tmp/runs")Run storage is intentionally compact: a run is one file, not a directory tree.
File layout
The JSON is the serialized form of RunnerResult (src/evaluar/pipelines/runner.py). Top-level keys:
{
"run_id": "run_2026-05-07_14-22",
"elapsed_seconds": 42.5,
"failed_pipelines": [],
"rollup_scorecard": { /* RollupScorecard */ },
"pipeline_results": {
"<model_id>": { /* PipelineResult */ }
},
"metadata": { /* suite metadata snapshot */ }
}rollup_scorecard
The headline verdict for the run.
{
"run_id": "...",
"model_id": "<rollup>",
"task_type": "rollup",
"metrics": { "pass_rate": 0.83, ... },
"thresholds": { "pass_rate": { "pass_floor": 0.80, "warn_floor": 0.60 } },
"verdict": "pass",
"errors": [],
"artifacts": {}
}pipeline_results
One entry per pipeline added to the suite.
{
"<model_id>": {
"run_id": "...",
"elapsed_seconds": 12.3,
"failed_samples": [],
"final_scorecard": { /* same shape as rollup_scorecard, scoped to this pipeline */ },
"per_sample_scorecards": [
{
"sample_id": "...",
"metrics": { "map": 0.72, ... },
"thresholds": { ... },
"verdict": "pass",
"errors": [],
"artifacts": {}
}
]
}
}per_sample_scorecards is what the failure inspector and bbox editor read.
metadata
Suite metadata snapshot, written by EvaluarSuite.metadata():
{
"source": "CLI", // or "API"
"suite_name": "my_eval",
"definition_path": "eval_layout_detector.py",
"sample_ids": ["sample_001", "sample_002"],
"pipeline_ids": ["my_model"]
}This is the same metadata the e2e test asserts on (tests/e2e/test_code_first_detection_spike.py).
Run ids
If you don't pin a run id with PipelineBuilder.run_id(...) or suite(run_id=...), the runner generates one in the form run_<timestamp>_<short>. Run ids are stable identifiers — the same one round-trips through evaluar report show <run_id>, the JSON filename, and the run metadata.
TUI preferences
The TUI writes preferences to a separate .evaluar/tui.yaml file next to the configured results root (src/evaluar/cli/project.py:TuiPreferences). For the default evaluar/results/ directory, that path is evaluar/.evaluar/tui.yaml. This file is not part of run storage and is safe to delete.
Working with saved runs
Saved runs are plain JSON. Use evaluar report export when you want a named output file for CI, review, or handoff:
# Full saved run
evaluar report export <run_id> --format json --out run.json
# Compact summary for CI or dashboards
evaluar report export <run_id> --format summary-json --out summary.json
# Flat scalar metrics for spreadsheets
evaluar report export <run_id> --format csv --out metrics.csv
# Markdown summary for review notes or CI artifacts
evaluar report export <run_id> --format markdown --out summary.md
# Zip bundle for handoff
evaluar report archive <run_id> --out run.zip
# Clean up saved runs
evaluar report delete <run_id> --yes
evaluar report clear --yes
evaluar report prune --keep 20 --yesStandard JSON tools still work directly on the result files:
jq .rollup_scorecard.verdict evaluar/results/run_*.jsonFor programmatic access, the Pydantic model that produced the file can also load it: RunnerResult.model_validate_json(path.read_text()).
Move / copy semantics
A run is fully self-contained in its JSON. Copying evaluar/results/<id>.json to another machine and pointing --results-dir at the destination is sufficient to open the run there.
The exception is anything the bbox editor needs (the source images themselves) — those are referenced by path in your manifest's inputs and need to be reachable independently. Run storage carries metrics, scorecards, and metadata, not source media.