evaluar

Scorers & metrics

The four scorer classes (Detection, OCR, Table, Merged), the Rollup scorer, the underlying metric functions, and the threshold model that produces verdicts.

A scorer applies thresholds to a scorecard produced by an evaluator. Evaluators compute metrics; scorers fill the threshold snapshot and finalize the pass | warn | fail verdict.

Most pipelines never instantiate a scorer directly — the task helpers (detection(...), ocr(...), table(...), merged(...)) install a sensible default. You override only when you want different thresholds or gated metrics.

The scorer classes

ClassModalitySourceConfig
DetectionScorerBounding-box detectionsrc/evaluar/scoring/detection.py:37DetectionScorerConfig
OCRScorerText recognitionsrc/evaluar/scoring/ocr.py:79OCRScorerConfig
TableScorerTable extractionsrc/evaluar/scoring/table.py:81TableScorerConfig
MergedScorerMerged-output schema gatingsrc/evaluar/scoring/merged.py:78MergedScorerConfig
RollupScorerSuite-level aggregatesrc/evaluar/scoring/rollup.py:71RollupConfig

All scorers extend BaseScorer (src/evaluar/scoring/base.py:47).

Threshold model

Each scorer's config carries per-metric MetricThreshold(pass_floor, warn_floor) values. The verdict is computed by comparing the metric value against those floors:

  • value ≥ pass_floorpass
  • warn_floor ≤ value < pass_floorwarn
  • value < warn_floorfail

For OCR error-rate metrics (cer, wer, mean_cer, mean_wer), the YAML values are maximum acceptable error rates. OCRScorer inverts those metrics internally before applying the shared "higher is better" threshold machinery, so pass_floor: 0.05 means "pass when CER is at or below 0.05".

Defaults from src/evaluar/scoring/detection.py:22:

thresholds = {
    "map":    MetricThreshold(pass_floor=0.70, warn_floor=0.50),
    "map_50": MetricThreshold(pass_floor=0.80, warn_floor=0.60),
    ...
}

You can override thresholds per project in evaluar/configs/<model>.yaml:

evaluar/configs/my_model.yaml
thresholds:
  map_50:
    pass_floor: 0.7
    warn_floor: 0.6
  precision:
    pass_floor: 0.85
    warn_floor: 0.65
gated_metrics:
  - map_50
  - precision
per_class_gated: true

gated_metrics lists which metrics participate in the verdict; per_class_gated: true requires the threshold to hold per class as well as in aggregate.

To apply a YAML config in a code-first suite, load it and pass it into the builder:

from evaluar.registry import registry

scorer_config = registry.load_scorer_config(
    path="evaluar/configs/my_model.yaml",
    task_type="detection",
)

pipeline = (
    detection("my_model")
    .callable(my_model)
    .inputs(INPUTS)
    .ground_truth(GROUND_TRUTH)
    .default_mapping()
    .scorer(scorer_config)
    .build()
)

For small code-first changes, use the builder methods instead:

pipeline = (
    detection("my_model")
    .callable(my_model)
    .inputs(INPUTS)
    .ground_truth(GROUND_TRUTH)
    .thresholds(map_50=(0.80, 0.60), precision=(0.85, 0.65))
    .gated_metrics("map_50", "precision")
    .build()
)

This produces the same scorer config shape as YAML; it does not change how metrics are computed.

Underlying metric functions

The scorers compose plain metric functions from src/evaluar/metrics/. They're directly importable when you want to compute a metric outside a pipeline:

Detection (src/evaluar/metrics/__init__.py)

  • compute_iou, compute_iou_matrix
  • match_detections, compute_confusion_matrix
  • compute_ap, compute_dataset_ap
  • compute_map, compute_dataset_map
  • compute_detection_metrics, DetectionMetrics

Text

  • compute_cer — character error rate.
  • compute_wer — word error rate.
  • compute_exact_match
  • compute_sequence_match_rate

Table

  • compute_header_match_rate
  • compute_key_field_completeness
  • compare_table_structure, TableStructureResult

Schema validation

  • check_schema_compliance, check_dataset_schema_compliance, SchemaComplianceResult

Rollup

When a suite contains one or more pipelines, RollupScorer aggregates pipeline verdicts with pass_threshold and warn_threshold:

evaluar.yaml
rollup:
  pass_threshold: 0.80   # ≥ 80% of weighted models must pass
  warn_threshold: 0.60
  model_weights: []      # equal weights when empty

The result lands in RunnerResult.rollup_scorecard and is the headline verdict the TUI shows.

In Python, configure the same values with:

s = suite(sample_ids=[...]).rollup(
    pass_threshold=0.80,
    warn_threshold=0.60,
)

Custom scorers

The supported path for a new modality or a substantially different scoring strategy is to subclass BaseScorer and write a matching *ScorerConfig. For most projects the right path is to keep the built-in scorer and tune thresholds in the config YAML.

On this page