Scorers & metrics
The four scorer classes (Detection, OCR, Table, Merged), the Rollup scorer, the underlying metric functions, and the threshold model that produces verdicts.
A scorer applies thresholds to a scorecard produced by an evaluator. Evaluators compute metrics; scorers fill the threshold snapshot and finalize the pass | warn | fail verdict.
Most pipelines never instantiate a scorer directly — the task helpers (detection(...), ocr(...), table(...), merged(...)) install a sensible default. You override only when you want different thresholds or gated metrics.
The scorer classes
| Class | Modality | Source | Config |
|---|---|---|---|
DetectionScorer | Bounding-box detection | src/evaluar/scoring/detection.py:37 | DetectionScorerConfig |
OCRScorer | Text recognition | src/evaluar/scoring/ocr.py:79 | OCRScorerConfig |
TableScorer | Table extraction | src/evaluar/scoring/table.py:81 | TableScorerConfig |
MergedScorer | Merged-output schema gating | src/evaluar/scoring/merged.py:78 | MergedScorerConfig |
RollupScorer | Suite-level aggregate | src/evaluar/scoring/rollup.py:71 | RollupConfig |
All scorers extend BaseScorer (src/evaluar/scoring/base.py:47).
Threshold model
Each scorer's config carries per-metric MetricThreshold(pass_floor, warn_floor) values. The verdict is computed by comparing the metric value against those floors:
- value ≥
pass_floor→pass warn_floor≤ value <pass_floor→warn- value <
warn_floor→fail
For OCR error-rate metrics (cer, wer, mean_cer, mean_wer), the YAML values are maximum acceptable error rates. OCRScorer inverts those metrics internally before applying the shared "higher is better" threshold machinery, so pass_floor: 0.05 means "pass when CER is at or below 0.05".
Defaults from src/evaluar/scoring/detection.py:22:
thresholds = {
"map": MetricThreshold(pass_floor=0.70, warn_floor=0.50),
"map_50": MetricThreshold(pass_floor=0.80, warn_floor=0.60),
...
}You can override thresholds per project in evaluar/configs/<model>.yaml:
thresholds:
map_50:
pass_floor: 0.7
warn_floor: 0.6
precision:
pass_floor: 0.85
warn_floor: 0.65
gated_metrics:
- map_50
- precision
per_class_gated: truegated_metrics lists which metrics participate in the verdict; per_class_gated: true requires the threshold to hold per class as well as in aggregate.
To apply a YAML config in a code-first suite, load it and pass it into the builder:
from evaluar.registry import registry
scorer_config = registry.load_scorer_config(
path="evaluar/configs/my_model.yaml",
task_type="detection",
)
pipeline = (
detection("my_model")
.callable(my_model)
.inputs(INPUTS)
.ground_truth(GROUND_TRUTH)
.default_mapping()
.scorer(scorer_config)
.build()
)For small code-first changes, use the builder methods instead:
pipeline = (
detection("my_model")
.callable(my_model)
.inputs(INPUTS)
.ground_truth(GROUND_TRUTH)
.thresholds(map_50=(0.80, 0.60), precision=(0.85, 0.65))
.gated_metrics("map_50", "precision")
.build()
)This produces the same scorer config shape as YAML; it does not change how metrics are computed.
Underlying metric functions
The scorers compose plain metric functions from src/evaluar/metrics/. They're directly importable when you want to compute a metric outside a pipeline:
Detection (src/evaluar/metrics/__init__.py)
compute_iou,compute_iou_matrixmatch_detections,compute_confusion_matrixcompute_ap,compute_dataset_apcompute_map,compute_dataset_mapcompute_detection_metrics,DetectionMetrics
Text
compute_cer— character error rate.compute_wer— word error rate.compute_exact_matchcompute_sequence_match_rate
Table
compute_header_match_ratecompute_key_field_completenesscompare_table_structure,TableStructureResult
Schema validation
check_schema_compliance,check_dataset_schema_compliance,SchemaComplianceResult
Rollup
When a suite contains one or more pipelines, RollupScorer aggregates pipeline verdicts with pass_threshold and warn_threshold:
rollup:
pass_threshold: 0.80 # ≥ 80% of weighted models must pass
warn_threshold: 0.60
model_weights: [] # equal weights when emptyThe result lands in RunnerResult.rollup_scorecard and is the headline verdict the TUI shows.
In Python, configure the same values with:
s = suite(sample_ids=[...]).rollup(
pass_threshold=0.80,
warn_threshold=0.60,
)Custom scorers
The supported path for a new modality or a substantially different scoring strategy is to subclass BaseScorer and write a matching *ScorerConfig.
For most projects the right path is to keep the built-in scorer and tune thresholds in the config YAML.