Headless / CI
Running Evaluar without the TUI — flags that exist today, the GitHub Actions scaffold, and how to gate a build on saved results.
The Evaluar TUI is a view over runs, not a precondition for them. Every evaluation can be run headlessly, saved to evaluar/results/<run_id>.json, and parsed by your CI.
The headless surface
evaluar eval_layout_detector.py --headless
evaluar eval_layout_detector.py --headless --no-save
evaluar eval_layout_detector.py --headless --summary-file summary.md
evaluar test --headless --fail-fastCore flags:
| Flag | Default | Purpose |
|---|---|---|
--samples / -s | (all) | Restrict the run to specific sample ids. |
--headless | false | Run without launching the TUI; print to stdout. |
--json | false | Accepted by the CLI, but the saved result file is the supported JSON artifact. |
--no-save | false | Skip writing evaluar/results/<run_id>.json. |
--fail-fast | false | Stop evaluar test after the first failing eval file. |
--summary-file | (none) | Write a Markdown summary for a single eval-file run. |
Where the JSON is
Completed runs are saved as serialized RunnerResult JSON unless --no-save is set. Top-level keys: run_id, elapsed_seconds, failed_pipelines, rollup_scorecard, pipeline_results, metadata. See Run storage for the full schema.
For a single eval-file run, the CLI saves to evaluar/results/<run_id>.json. For evaluar test, each suite saves using that suite's configured results_dir when present, otherwise evaluar/results.
Exit codes
evaluar eval_file.py returns:
0— the suite completed and the rollup verdict waspassorwarn.- non-zero — the suite crashed, produced an
error, or produced afailrollup verdict.
evaluar test also exits non-zero when any discovered eval file fails or errors. This behavior comes directly from src/evaluar/cli/commands/run.py.
The built-in gate is the suite rollup verdict. For custom metric policies, add a separate step that reads the saved JSON result.
Gating a build on a metric
The supported pattern for custom gating is two steps: run, then inspect the saved JSON.
evaluar eval_layout_detector.py --headless
jq -e '.rollup_scorecard.verdict == "pass"' evaluar/results/<run_id>.json > /dev/nullOr, in Python:
import json, sys
result = json.load(open("evaluar/results/<run_id>.json"))
if result["rollup_scorecard"]["verdict"] != "pass":
sys.exit(1)Pick whichever field you actually want to gate on — rollup_scorecard.verdict, failed_pipelines, or a per-pipeline metric such as pipeline_results["layout_detector"]["final_scorecard"]["metrics"]["map_50"]. The JSON shape is documented in Run storage.
For a named artifact instead of shell redirection, export the saved run:
evaluar report export <run_id> --format summary-json --out summary.json
evaluar report export <run_id> --format markdown --out summary.mdFor a single eval-file run, you can write the Markdown summary during execution:
evaluar eval_layout_detector.py --headless --summary-file summary.mdGitHub Actions
evaluar init <task> --github-actions writes the canonical workflow at .github/workflows/evaluar.yml (src/evaluar/cli/commands/init.py:336):
For private repository installs, create an EVALUAR_REPO_TOKEN secret with
read access to the Evaluar repository. The generated workflow checks out the
source with that token, then installs the local checkout.
name: Evaluar
on:
push:
branches: [main]
pull_request:
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
path: project
- uses: actions/checkout@v4
with:
repository: Koiiichi/evaluar
token: ${{ secrets.EVALUAR_REPO_TOKEN }}
path: evaluar-src
- uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Install evaluar
run: pip install ./evaluar-src
- name: Run evaluations
working-directory: project
run: evaluar test --headlessThat workflow already fails on a fail or error rollup. To gate on a custom metric, add a step after the run that reads the saved JSON:
- name: Run evaluations
run: evaluar test --headless
- name: Gate on rollup verdict
run: jq -e '.rollup_scorecard.verdict == "pass"' evaluar/results/*.jsonThe generated workflow is the supported GitHub Actions integration.
Reproducibility
Headless runs write the same JSON file that interactive runs do. Copy that file (or the whole evaluar/results/ directory) to another machine and evaluar report show <run_id> will open it identically — same scorecards, same metadata.
Source images, however, must be reachable wherever you open the run; ground-truth files are paths in your manifest, not blobs in the saved JSON. See Run storage for what is and isn't self-contained.
Caching and parallelism
For repeated local iteration against the same model outputs, use fixtures instead of a cross-run prediction cache:
- Use the
FixtureConnector(see Connectors) when iterating on scorers — fixtures are a snapshot of model output and re-run instantly. - Switch back to
HttpConnector/CallableConnectorwhen you actually need fresh predictions.
Inside a single pipeline run, BasePipeline processes samples with its configured max_concurrent_samples limit. The default pipeline config is 10; keep connector callables thread-safe and rate-limit remote services on the connector side when needed.