evaluar

Run storage

The on-disk shape of a run — a single JSON file under `evaluar/results/` per run.

Every run that completes — whether triggered from evaluar eval_file.py, evaluar test, python eval_file.py, or EvaluarSuite.run(save=True) — writes one JSON file. Where it lands and what's in it is documented here.

Where runs live

By default: evaluar/results/<run_id>.json (a project-relative path, not a user-home path).

Override with --results-dir on any CLI command, or by passing results_dir= to suite(...) in Python:

evaluar eval_layout_detector.py --results-dir /tmp/runs
evaluar report show <run_id>      --results-dir /tmp/runs
evaluar report list               --results-dir /tmp/runs
evaluar report export <run_id>    --results-dir /tmp/runs --out run.json
from evaluar.api import suite
s = suite(sample_ids=[...], results_dir="/tmp/runs")

Run storage is intentionally compact: a run is one file, not a directory tree.

File layout

The JSON is the serialized form of RunnerResult (src/evaluar/pipelines/runner.py). Top-level keys:

{
  "run_id": "run_2026-05-07_14-22",
  "elapsed_seconds": 42.5,
  "failed_pipelines": [],
  "rollup_scorecard": { /* RollupScorecard */ },
  "pipeline_results": {
    "<model_id>": { /* PipelineResult */ }
  },
  "metadata": { /* suite metadata snapshot */ }
}

rollup_scorecard

The headline verdict for the run.

{
  "run_id": "...",
  "model_id": "<rollup>",
  "task_type": "rollup",
  "metrics": { "pass_rate": 0.83, ... },
  "thresholds": { "pass_rate": { "pass_floor": 0.80, "warn_floor": 0.60 } },
  "verdict": "pass",
  "errors": [],
  "artifacts": {}
}

pipeline_results

One entry per pipeline added to the suite.

{
  "<model_id>": {
    "run_id": "...",
    "elapsed_seconds": 12.3,
    "failed_samples": [],
    "final_scorecard": { /* same shape as rollup_scorecard, scoped to this pipeline */ },
    "per_sample_scorecards": [
      {
        "sample_id": "...",
        "metrics": { "map": 0.72, ... },
        "thresholds": { ... },
        "verdict": "pass",
        "errors": [],
        "artifacts": {}
      }
    ]
  }
}

per_sample_scorecards is what the failure inspector and bbox editor read.

metadata

Suite metadata snapshot, written by EvaluarSuite.metadata():

{
  "source": "CLI",                          // or "API"
  "suite_name": "my_eval",
  "definition_path": "eval_layout_detector.py",
  "sample_ids": ["sample_001", "sample_002"],
  "pipeline_ids": ["my_model"]
}

This is the same metadata the e2e test asserts on (tests/e2e/test_code_first_detection_spike.py).

Run ids

If you don't pin a run id with PipelineBuilder.run_id(...) or suite(run_id=...), the runner generates one in the form run_<timestamp>_<short>. Run ids are stable identifiers — the same one round-trips through evaluar report show <run_id>, the JSON filename, and the run metadata.

TUI preferences

The TUI writes preferences to a separate .evaluar/tui.yaml file next to the configured results root (src/evaluar/cli/project.py:TuiPreferences). For the default evaluar/results/ directory, that path is evaluar/.evaluar/tui.yaml. This file is not part of run storage and is safe to delete.

Working with saved runs

Saved runs are plain JSON. Use evaluar report export when you want a named output file for CI, review, or handoff:

# Full saved run
evaluar report export <run_id> --format json --out run.json

# Compact summary for CI or dashboards
evaluar report export <run_id> --format summary-json --out summary.json

# Flat scalar metrics for spreadsheets
evaluar report export <run_id> --format csv --out metrics.csv

# Markdown summary for review notes or CI artifacts
evaluar report export <run_id> --format markdown --out summary.md

# Zip bundle for handoff
evaluar report archive <run_id> --out run.zip

# Clean up saved runs
evaluar report delete <run_id> --yes
evaluar report clear --yes
evaluar report prune --keep 20 --yes

Standard JSON tools still work directly on the result files:

jq .rollup_scorecard.verdict evaluar/results/run_*.json

For programmatic access, the Pydantic model that produced the file can also load it: RunnerResult.model_validate_json(path.read_text()).

Move / copy semantics

A run is fully self-contained in its JSON. Copying evaluar/results/<id>.json to another machine and pointing --results-dir at the destination is sufficient to open the run there.

The exception is anything the bbox editor needs (the source images themselves) — those are referenced by path in your manifest's inputs and need to be reachable independently. Run storage carries metrics, scorecards, and metadata, not source media.

On this page