Reports
Listing, opening, exporting, cleaning up, and comparing saved runs from the CLI — the `evaluar report` subcommands.
Every Evaluar run writes a single JSON file at <results-dir>/<run_id>.json (default evaluar/results/; see Run storage). The evaluar report subcommands browse, open, export, clean up, and diff those files from the shell.
evaluar report list
Lists recent runs as a table.
evaluar report list
evaluar report list --limit 20
evaluar report list --results-dir /custom/path| Flag | Default | Purpose |
|---|---|---|
--results-dir | evaluar/results | Where to look for runs. |
--limit / -n | 10 | Maximum number of runs to display. |
Implementation: src/evaluar/cli/commands/report.py.
evaluar report show [run_id]
Opens a saved run. Without a run_id, opens the most recent run.
evaluar report show
evaluar report show <run_id>
evaluar report show <run_id> --model my_model
evaluar report show <run_id> --headless
evaluar report show <run_id> --json| Flag | Default | Purpose |
|---|---|---|
--model / -m | (none) | Show detail for a specific pipeline only. |
--results-dir | evaluar/results | Where the run was saved. |
--headless | false | Print to stdout instead of launching the TUI. |
--json | false | Print the saved JSON to stdout (implies --headless). |
Implementation: src/evaluar/cli/commands/report.py.
By default this opens the results view for the run. With --headless, it prints a summary table instead — useful for quick checks over SSH.
evaluar report export [run_id]
Writes a saved run to a file. Without a run_id, exports the most recent run.
evaluar report export <run_id> --format json --out run.json
evaluar report export <run_id> --format summary-json --out summary.json
evaluar report export <run_id> --format csv --out metrics.csv
evaluar report export <run_id> --format markdown --out summary.md
evaluar report export --out latest.json
evaluar report export <run_id> --model layout_detector --out layout.json| Flag | Default | Purpose |
|---|---|---|
--format / -f | json | Export the full saved run (json), a compact summary (summary-json), flat metric rows (csv), or Markdown (markdown). |
--out / -o | required | File to write. Parent directories are created when needed. |
--model / -m | (none) | Export a specific pipeline result. With summary-json, keeps only that pipeline in the summary. |
--results-dir | evaluar/results | Where the run was saved. |
The json format preserves the saved run shape. The summary-json format keeps the rollup verdict, weighted score, failed pipelines, metadata, and one compact entry per pipeline. The csv format writes one row per scalar metric with run_id, scope, model_id, sample_id, task_type, verdict, metric, and value columns. The markdown format writes a compact run summary for CI artifacts, pull request comments, or handoff notes.
Implementation: src/evaluar/cli/commands/report.py.
evaluar report delete <run_id>
Deletes one saved run JSON file.
evaluar report delete <run_id>
evaluar report delete <run_id> --yes
evaluar report delete <run_id> --results-dir /custom/path| Flag | Default | Purpose |
|---|---|---|
--results-dir | evaluar/results | Where the run was saved. |
--yes / -y | false | Delete without an interactive confirmation prompt. |
Implementation: src/evaluar/cli/commands/report.py.
evaluar report clear
Deletes all saved run JSON files in a results directory.
evaluar report clear
evaluar report clear --yes
evaluar report clear --results-dir /custom/path| Flag | Default | Purpose |
|---|---|---|
--results-dir | evaluar/results | Directory to clear. |
--yes / -y | false | Clear without an interactive confirmation prompt. |
Implementation: src/evaluar/cli/commands/report.py.
evaluar report prune
Deletes saved run JSON files by explicit retention criteria.
evaluar report prune --keep 20
evaluar report prune --older-than 30d
evaluar report prune --keep 20 --older-than 30d --yes| Flag | Default | Purpose |
|---|---|---|
--keep | (none) | Keep the newest N saved runs and prune older files. |
--older-than | (none) | Prune runs older than a duration such as 30d, 12h, 90m, or 3600s. |
--results-dir | evaluar/results | Directory to prune. |
--yes / -y | false | Prune without an interactive confirmation prompt. |
When both --keep and --older-than are passed, Evaluar deletes the union of matching saved run files.
Implementation: src/evaluar/cli/commands/report.py.
evaluar report archive [run_id]
Creates a local zip archive for a saved run. Without a run_id, archives the most recent run.
evaluar report archive <run_id> --out run.zip
evaluar report archive <run_id> --include-local-artifacts --out run.zip| Flag | Default | Purpose |
|---|---|---|
--out / -o | required | Zip file to write. |
--include-local-artifacts | false | Include existing local artifact files referenced by scorecards, such as gt_path or cached image paths. |
--results-dir | evaluar/results | Where the run was saved. |
Archives always include <run_id>.json and an archive_manifest.json. Remote URLs and missing local files are recorded as skipped instead of downloaded.
Implementation: src/evaluar/cli/commands/report.py.
evaluar report compare <run_a> <run_b>
Opens the compare view for two saved runs.
evaluar report compare <run_a> <run_b>
evaluar report compare <run_a> <run_b> --headless
evaluar report compare <run_a> <run_b> --json --out compare.json
evaluar report compare <run_a> <run_b> --format markdown --out compare.md| Flag | Default | Purpose |
|---|---|---|
--results-dir | evaluar/results | Where the runs were saved. |
--headless | false | Print the comparison to stdout. |
--json | false | Print structured scalar metric deltas as JSON. |
--format / -f | table | Headless comparison format: table or markdown. |
--out / -o | (none) | Optional output path for JSON or Markdown comparison output. |
Structured compare output includes rollup verdict/weighted-score changes and per-pipeline scalar metric deltas. Nested metric blocks remain in the saved run JSON.
Implementation: src/evaluar/cli/commands/report.py.
What the saved file looks like
Every run is a single JSON file. Top-level shape (mirrors RunnerResult):
{
"run_id": "...",
"elapsed_seconds": 0.0,
"failed_pipelines": [],
"rollup_scorecard": { "verdict": "pass", "metrics": {}, "thresholds": {}, ... },
"pipeline_results": {
"<model_id>": {
"final_scorecard": { ... },
"per_sample_scorecards": [ ... ],
"failed_samples": []
}
},
"metadata": {
"source": "CLI",
"suite_name": "...",
"definition_path": "eval_layout_detector.py",
"sample_ids": [...],
"pipeline_ids": [...]
}
}See Run storage for the full schema.