evaluar

Headless / CI

Running Evaluar without the TUI — flags that exist today, the GitHub Actions scaffold, and how to gate a build on saved results.

The Evaluar TUI is a view over runs, not a precondition for them. Every evaluation can be run headlessly, saved to evaluar/results/<run_id>.json, and parsed by your CI.

The headless surface

evaluar eval_layout_detector.py --headless
evaluar eval_layout_detector.py --headless --no-save
evaluar eval_layout_detector.py --headless --summary-file summary.md
evaluar test --headless --fail-fast

Core flags:

FlagDefaultPurpose
--samples / -s(all)Restrict the run to specific sample ids.
--headlessfalseRun without launching the TUI; print to stdout.
--jsonfalseAccepted by the CLI, but the saved result file is the supported JSON artifact.
--no-savefalseSkip writing evaluar/results/<run_id>.json.
--fail-fastfalseStop evaluar test after the first failing eval file.
--summary-file(none)Write a Markdown summary for a single eval-file run.

Where the JSON is

Completed runs are saved as serialized RunnerResult JSON unless --no-save is set. Top-level keys: run_id, elapsed_seconds, failed_pipelines, rollup_scorecard, pipeline_results, metadata. See Run storage for the full schema.

For a single eval-file run, the CLI saves to evaluar/results/<run_id>.json. For evaluar test, each suite saves using that suite's configured results_dir when present, otherwise evaluar/results.

Exit codes

evaluar eval_file.py returns:

  • 0 — the suite completed and the rollup verdict was pass or warn.
  • non-zero — the suite crashed, produced an error, or produced a fail rollup verdict.

evaluar test also exits non-zero when any discovered eval file fails or errors. This behavior comes directly from src/evaluar/cli/commands/run.py.

The built-in gate is the suite rollup verdict. For custom metric policies, add a separate step that reads the saved JSON result.

Gating a build on a metric

The supported pattern for custom gating is two steps: run, then inspect the saved JSON.

evaluar eval_layout_detector.py --headless
jq -e '.rollup_scorecard.verdict == "pass"' evaluar/results/<run_id>.json > /dev/null

Or, in Python:

import json, sys

result = json.load(open("evaluar/results/<run_id>.json"))
if result["rollup_scorecard"]["verdict"] != "pass":
    sys.exit(1)

Pick whichever field you actually want to gate on — rollup_scorecard.verdict, failed_pipelines, or a per-pipeline metric such as pipeline_results["layout_detector"]["final_scorecard"]["metrics"]["map_50"]. The JSON shape is documented in Run storage.

For a named artifact instead of shell redirection, export the saved run:

evaluar report export <run_id> --format summary-json --out summary.json
evaluar report export <run_id> --format markdown --out summary.md

For a single eval-file run, you can write the Markdown summary during execution:

evaluar eval_layout_detector.py --headless --summary-file summary.md

GitHub Actions

evaluar init <task> --github-actions writes the canonical workflow at .github/workflows/evaluar.yml (src/evaluar/cli/commands/init.py:336):

For private repository installs, create an EVALUAR_REPO_TOKEN secret with read access to the Evaluar repository. The generated workflow checks out the source with that token, then installs the local checkout.

.github/workflows/evaluar.yml
name: Evaluar
on:
  push:
    branches: [main]
  pull_request:

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          path: project
      - uses: actions/checkout@v4
        with:
          repository: Koiiichi/evaluar
          token: ${{ secrets.EVALUAR_REPO_TOKEN }}
          path: evaluar-src
      - uses: actions/setup-python@v5
        with:
          python-version: "3.10"
      - name: Install evaluar
        run: pip install ./evaluar-src
      - name: Run evaluations
        working-directory: project
        run: evaluar test --headless

That workflow already fails on a fail or error rollup. To gate on a custom metric, add a step after the run that reads the saved JSON:

- name: Run evaluations
  run: evaluar test --headless

- name: Gate on rollup verdict
  run: jq -e '.rollup_scorecard.verdict == "pass"' evaluar/results/*.json

The generated workflow is the supported GitHub Actions integration.

Reproducibility

Headless runs write the same JSON file that interactive runs do. Copy that file (or the whole evaluar/results/ directory) to another machine and evaluar report show <run_id> will open it identically — same scorecards, same metadata.

Source images, however, must be reachable wherever you open the run; ground-truth files are paths in your manifest, not blobs in the saved JSON. See Run storage for what is and isn't self-contained.

Caching and parallelism

For repeated local iteration against the same model outputs, use fixtures instead of a cross-run prediction cache:

  • Use the FixtureConnector (see Connectors) when iterating on scorers — fixtures are a snapshot of model output and re-run instantly.
  • Switch back to HttpConnector / CallableConnector when you actually need fresh predictions.

Inside a single pipeline run, BasePipeline processes samples with its configured max_concurrent_samples limit. The default pipeline config is 10; keep connector callables thread-safe and rate-limit remote services on the connector side when needed.

On this page