FAQ
Common questions about Evaluar — scope, design choices, and what's intentionally not in the framework today.
What problem does Evaluar solve?
Evaluating models that produce structured output — bounding boxes, recognized text, extracted tables — against ground truth, with metric-level thresholds and a UI for looking at individual failures. The opinionated bit is the execution model: code-first eval files, threshold-driven verdicts, a small TUI for inspection, and a hand-off to a separate OpenCV process when you need to see predictions on the source image.
Is Evaluar a replacement for Weights & Biases / MLflow?
No. Those are experiment trackers — they record metrics across many runs over time. Evaluar is an evaluation framework: it focuses on a single run and the workflow of inspecting it. The two compose well; the saved JSON (see Run storage) is straightforward to push to a tracker.
Why a TUI instead of a web dashboard?
The TUI runs over SSH wherever the run data already lives, with no port forwarding, no service to host, and no separate database. It's the same trade-off htop, k9s, and lazygit make — keyboard-driven, information-dense, and lives next to the artifact it's reading.
Can I use Evaluar without the TUI?
Yes. evaluar eval_file.py --headless and evaluar test --headless run without the TUI and save JSON result files that CI can parse. See Headless / CI.
What model frameworks does it support?
Any. The model argument is just a callable — wrap PyTorch, an ONNX runtime, a remote API, anything. There's no framework lock-in.
How are runs stored?
A single JSON file per run at evaluar/results/<run_id>.json (configurable via --results-dir). See Run storage for the full schema and export options.
Is the bbox editor a replacement for an annotation tool?
No. It exists to make small ground-truth corrections survivable inside the inspection workflow, without forcing a context switch into a full annotator. For greenfield labeling, use a real annotation tool.
Is there a Pipeline class?
No. The Python API uses PipelineBuilder (a fluent builder that returns a BasePipeline) and EvaluarSuite (which collects pipelines and runs them). See Python API.
What does the TUI render images with?
It doesn't. The TUI is a Textual app — terminal-only. When you press o or v to look at predictions on the source image, Evaluar launches the bbox editor as a separate Python process that uses OpenCV's native window. See Bbox editor.
What CLI commands exist?
The main commands are:
evaluar(no args) — launches the TUI.evaluar init <task>— scaffolds a project.evaluar eval_file.py— runs one code-first eval file.evaluar test— discovers and runs alleval_*.pyfiles.evaluar report list / show / export / delete / clear / compare— works with saved runs.
See Reports and Quick start.
Are there regression gates?
Yes. Eval-file runs and evaluar test exit non-zero for fail / error rollups. For custom checks, export or read the saved JSON with jq or Python. See Headless / CI.
How do I add a custom metric?
For a metric within an existing modality, add it to the relevant scorer (DetectionScorer, OCRScorer, TableScorer) and its config schema. For a new modality, subclass BaseScorer. Evaluar keeps this path explicit rather than using runtime scorer registration decorators.