Evaluar

Evaluar is an evaluation framework for vision and document systems — code-first pipelines, threshold-driven scorers, and a Textual TUI for inspecting runs.

Evaluar is an evaluation framework for the kind of systems where a metric drop is rarely the whole story. It pairs a code-first Python API with a terminal UI for opening individual runs and a hand-off to an OpenCV bbox editor when you need to look at predictions on the source image.

Evaluar TUI showing a saved run — The TUI launches when you run `evaluar` with no arguments. The home view lists recent runs; opening one drops you into the results view.

Evaluar is an early preview (v0.1). The framework runs end-to-end for detection, OCR, and table tasks and is in active use internally. Surface area outside what's documented here may change between minor versions.

What's actually in the box

Code-first suites

Define an evaluation as an eval_*.py file that exposes build_suite(...). Run one file directly or run all of them with `evaluar test`.

Threshold-driven scorers

Detection, OCR, table, and merged scorers each apply MetricThreshold(pass_floor, warn_floor) to produce a pass / warn / fail verdict.

A small, sharp TUI

Open a saved run, focus the failing samples, then hand off to the bbox editor for visual inspection. Built on Textual.

The two surfaces

Evaluar has one Python API and one terminal binary. Both lead to the same on-disk artifacts.

Python. evaluar.api exposes PipelineBuilder, EvaluarSuite, task helpers (detection(...), ocr(...), table(...), merged(...)), suite(...), and the @normalizer decorator. A pipeline is built with the PipelineBuilder chain; a suite collects one or more pipelines and runs them.
CLI. One binary, all defined in src/evaluar/cli/ — evaluar opens the TUI, evaluar eval_file.py runs one eval file, evaluar test discovers all eval_*.py files, and evaluar report works with saved runs.

Every run produces a single JSON file at evaluar/results/<run_id>.json (configurable via --results-dir). That file is what the TUI opens, what evaluar report compare diffs, and what CI gates can parse.

Where to go next

Install & init

Install Evaluar and run `evaluar init detection` to scaffold a working pipeline in your repo.

Quick start

From a fresh clone to your first opened run in under five minutes.

Core concepts

The five primitives Evaluar is actually built on — Suite, PipelineBuilder, Connector, Normalizer, Scorer.

TUI guide

Every view, every binding, every hand-off — verified against `src/evaluar/tui/`.

What Evaluar isn't

Evaluar is intentionally narrow. It does not train models, host datasets, or replace a general annotation tool. The bbox editor exists to make small ground-truth corrections survivable inside an inspection workflow — not to label a fresh dataset from scratch.

Introduction