Failure inspection

The failure inspector pairs a structured diff of expected-vs-actual with the sample list, and hands off to the bbox editor whenever you need to look at predictions on the source image.

Opening the inspector

From the results view press i. (The global binding i is also available — src/evaluar/tui/app.py:104.)

What the panes show

Verified against src/evaluar/tui/views/failure_inspector.py and src/evaluar/tui/widgets/failure_inspector.py.

Diff pane. A structured comparison between the predicted output and the ground truth for the focused sample. The structure is task-aware: detection diffs at the box level, OCR diffs at the text level, table diffs at the cell level.
Samples pane. The samples in the run, with each sample's verdict. Sample list iteration is plain keyboard navigation.

Bindings

From src/evaluar/tui/views/failure_inspector.py:27:

`d`	Focus the diff pane
`s`	Focus the samples pane
`o`	Open the bbox editor (overlay, read-only)
`v`	Open the bbox editor (edit ground truth)
`tab`	Cycle focus forward
`right`	Cycle focus forward
`left`	Cycle focus backward
`b`	Go back
`escape`	Go back

A typical flow

Open a saved run

evaluar report show <run_id>

The run lands in the results view.

Press `i` to enter the inspector

Walk the samples list with the focused-pane keys; the diff pane updates as you move.

Press `o` to look at the prediction visually

The bbox editor opens in overlay mode (read-only) as a separate OpenCV window. Inside, + / - zoom, 0 resets, q closes. See Bbox editor.

If the ground truth is wrong, press `v`

The bbox editor reopens in edit mode. Mouse-driven box drawing, resizing, and labeling; Backspace deletes the selected box; Esc cancels the current action.

Comparing two runs

Run-vs-run comparison is its own view, not part of the inspector. From the shell:

evaluar report compare <run_a> <run_b>

This opens the compare view (src/evaluar/tui/views/compare.py). It diffs the two runs at the rollup-scorecard level. See Reports.

Scope

The current inspector is a small surface by design: sample navigation, task-aware diffs, and hand-offs for image overlays or ground-truth edits.

For deeper automation, consume the saved JSON (evaluar/results/<run_id>.json) directly — its shape is documented in Run storage.

Failure inspection

On this page