Evaluar

Listing, opening, exporting, cleaning up, and comparing saved runs from the CLI — the `evaluar report` subcommands.

Every Evaluar run writes a single JSON file at <results-dir>/<run_id>.json (default evaluar/results/; see Run storage). The evaluar report subcommands browse, open, export, clean up, and diff those files from the shell.

`evaluar report list`

Lists recent runs as a table.

evaluar report list
evaluar report list --limit 20
evaluar report list --results-dir /custom/path

Flag	Default	Purpose
`--results-dir`	`evaluar/results`	Where to look for runs.
`--limit` / `-n`	`10`	Maximum number of runs to display.

Implementation: src/evaluar/cli/commands/report.py.

`evaluar report show [run_id]`

Opens a saved run. Without a run_id, opens the most recent run.

evaluar report show
evaluar report show <run_id>
evaluar report show <run_id> --model my_model
evaluar report show <run_id> --headless
evaluar report show <run_id> --json

Flag	Default	Purpose
`--model` / `-m`	(none)	Show detail for a specific pipeline only.
`--results-dir`	`evaluar/results`	Where the run was saved.
`--headless`	`false`	Print to stdout instead of launching the TUI.
`--json`	`false`	Print the saved JSON to stdout (implies `--headless`).

Implementation: src/evaluar/cli/commands/report.py.

By default this opens the results view for the run. With --headless, it prints a summary table instead — useful for quick checks over SSH.

`evaluar report export [run_id]`

Writes a saved run to a file. Without a run_id, exports the most recent run.

evaluar report export <run_id> --format json --out run.json
evaluar report export <run_id> --format summary-json --out summary.json
evaluar report export <run_id> --format csv --out metrics.csv
evaluar report export <run_id> --format markdown --out summary.md
evaluar report export --out latest.json
evaluar report export <run_id> --model layout_detector --out layout.json

Flag	Default	Purpose
`--format` / `-f`	`json`	Export the full saved run (`json`), a compact summary (`summary-json`), flat metric rows (`csv`), or Markdown (`markdown`).
`--out` / `-o`	required	File to write. Parent directories are created when needed.
`--model` / `-m`	(none)	Export a specific pipeline result. With `summary-json`, keeps only that pipeline in the summary.
`--results-dir`	`evaluar/results`	Where the run was saved.

The json format preserves the saved run shape. The summary-json format keeps the rollup verdict, weighted score, failed pipelines, metadata, and one compact entry per pipeline. The csv format writes one row per scalar metric with run_id, scope, model_id, sample_id, task_type, verdict, metric, and value columns. The markdown format writes a compact run summary for CI artifacts, pull request comments, or handoff notes.

Implementation: src/evaluar/cli/commands/report.py.

`evaluar report delete <run_id>`

Deletes one saved run JSON file.

evaluar report delete <run_id>
evaluar report delete <run_id> --yes
evaluar report delete <run_id> --results-dir /custom/path

Flag	Default	Purpose
`--results-dir`	`evaluar/results`	Where the run was saved.
`--yes` / `-y`	`false`	Delete without an interactive confirmation prompt.

Implementation: src/evaluar/cli/commands/report.py.

`evaluar report clear`

Deletes all saved run JSON files in a results directory.

evaluar report clear
evaluar report clear --yes
evaluar report clear --results-dir /custom/path

Flag	Default	Purpose
`--results-dir`	`evaluar/results`	Directory to clear.
`--yes` / `-y`	`false`	Clear without an interactive confirmation prompt.

Implementation: src/evaluar/cli/commands/report.py.

`evaluar report prune`

Deletes saved run JSON files by explicit retention criteria.

evaluar report prune --keep 20
evaluar report prune --older-than 30d
evaluar report prune --keep 20 --older-than 30d --yes

Flag	Default	Purpose
`--keep`	(none)	Keep the newest N saved runs and prune older files.
`--older-than`	(none)	Prune runs older than a duration such as `30d`, `12h`, `90m`, or `3600s`.
`--results-dir`	`evaluar/results`	Directory to prune.
`--yes` / `-y`	`false`	Prune without an interactive confirmation prompt.

When both --keep and --older-than are passed, Evaluar deletes the union of matching saved run files.

Implementation: src/evaluar/cli/commands/report.py.

`evaluar report archive [run_id]`

Creates a local zip archive for a saved run. Without a run_id, archives the most recent run.

evaluar report archive <run_id> --out run.zip
evaluar report archive <run_id> --include-local-artifacts --out run.zip

Flag	Default	Purpose
`--out` / `-o`	required	Zip file to write.
`--include-local-artifacts`	`false`	Include existing local artifact files referenced by scorecards, such as `gt_path` or cached image paths.
`--results-dir`	`evaluar/results`	Where the run was saved.

Archives always include <run_id>.json and an archive_manifest.json. Remote URLs and missing local files are recorded as skipped instead of downloaded.

Implementation: src/evaluar/cli/commands/report.py.

`evaluar report compare <run_a> <run_b>`

Opens the compare view for two saved runs.

evaluar report compare <run_a> <run_b>
evaluar report compare <run_a> <run_b> --headless
evaluar report compare <run_a> <run_b> --json --out compare.json
evaluar report compare <run_a> <run_b> --format markdown --out compare.md

Flag	Default	Purpose
`--results-dir`	`evaluar/results`	Where the runs were saved.
`--headless`	`false`	Print the comparison to stdout.
`--json`	`false`	Print structured scalar metric deltas as JSON.
`--format` / `-f`	`table`	Headless comparison format: `table` or `markdown`.
`--out` / `-o`	(none)	Optional output path for JSON or Markdown comparison output.

Structured compare output includes rollup verdict/weighted-score changes and per-pipeline scalar metric deltas. Nested metric blocks remain in the saved run JSON.

Implementation: src/evaluar/cli/commands/report.py.

What the saved file looks like

Every run is a single JSON file. Top-level shape (mirrors RunnerResult):

{
  "run_id": "...",
  "elapsed_seconds": 0.0,
  "failed_pipelines": [],
  "rollup_scorecard": { "verdict": "pass", "metrics": {}, "thresholds": {}, ... },
  "pipeline_results": {
    "<model_id>": {
      "final_scorecard": { ... },
      "per_sample_scorecards": [ ... ],
      "failed_samples": []
    }
  },
  "metadata": {
    "source": "CLI",
    "suite_name": "...",
    "definition_path": "eval_layout_detector.py",
    "sample_ids": [...],
    "pipeline_ids": [...]
  }
}

See Run storage for the full schema.

Reports

`evaluar report list`

`evaluar report show [run_id]`

`evaluar report export [run_id]`

`evaluar report delete <run_id>`

`evaluar report clear`

`evaluar report prune`

`evaluar report archive [run_id]`

`evaluar report compare <run_a> <run_b>`

What the saved file looks like

On this page