Evaluar

Common questions about Evaluar — scope, design choices, and what's intentionally not in the framework today.

What problem does Evaluar solve?

Evaluating models that produce structured output — bounding boxes, recognized text, extracted tables — against ground truth, with metric-level thresholds and a UI for looking at individual failures. The opinionated bit is the execution model: code-first eval files, threshold-driven verdicts, a small TUI for inspection, and a hand-off to a separate OpenCV process when you need to see predictions on the source image.

Is Evaluar a replacement for Weights & Biases / MLflow?

No. Those are experiment trackers — they record metrics across many runs over time. Evaluar is an evaluation framework: it focuses on a single run and the workflow of inspecting it. The two compose well; the saved JSON (see Run storage) is straightforward to push to a tracker.

Why a TUI instead of a web dashboard?

The TUI runs over SSH wherever the run data already lives, with no port forwarding, no service to host, and no separate database. It's the same trade-off htop, k9s, and lazygit make — keyboard-driven, information-dense, and lives next to the artifact it's reading.

Can I use Evaluar without the TUI?

Yes. evaluar eval_file.py --headless and evaluar test --headless run without the TUI and save JSON result files that CI can parse. See Headless / CI.

What model frameworks does it support?

Any. The model argument is just a callable — wrap PyTorch, an ONNX runtime, a remote API, anything. There's no framework lock-in.

How are runs stored?

A single JSON file per run at evaluar/results/<run_id>.json (configurable via --results-dir). See Run storage for the full schema and export options.

Is the bbox editor a replacement for an annotation tool?

No. It exists to make small ground-truth corrections survivable inside the inspection workflow, without forcing a context switch into a full annotator. For greenfield labeling, use a real annotation tool.

Is there a `Pipeline` class?

No. The Python API uses PipelineBuilder (a fluent builder that returns a BasePipeline) and EvaluarSuite (which collects pipelines and runs them). See Python API.

What does the TUI render images with?

It doesn't. The TUI is a Textual app — terminal-only. When you press o or v to look at predictions on the source image, Evaluar launches the bbox editor as a separate Python process that uses OpenCV's native window. See Bbox editor.

What CLI commands exist?

The main commands are:

evaluar (no args) — launches the TUI.
evaluar init <task> — scaffolds a project.
evaluar eval_file.py — runs one code-first eval file.
evaluar test — discovers and runs all eval_*.py files.
evaluar report list / show / export / delete / clear / compare — works with saved runs.

See Reports and Quick start.

Are there regression gates?

Yes. Eval-file runs and evaluar test exit non-zero for fail / error rollups. For custom checks, export or read the saved JSON with jq or Python. See Headless / CI.

How do I add a custom metric?

For a metric within an existing modality, add it to the relevant scorer (DetectionScorer, OCRScorer, TableScorer) and its config schema. For a new modality, subclass BaseScorer. Evaluar keeps this path explicit rather than using runtime scorer registration decorators.

FAQ