Evaluar

Composing scorers per pipeline (`MergedScorer`) and aggregating verdicts across pipelines (`RollupScorer`).

Real document evaluations often need to score more than one pipeline. Evaluar handles that with two supported shapes:

Merged output pipelines — score one model response that already contains a merged record/schema shape.
Suite rollup — aggregate pipeline-level verdicts into a single suite-level verdict.

A suite is a flat list of pipelines; cross-pipeline aggregation happens through rollup.

MergedScorer

MergedScorer (src/evaluar/scoring/merged.py:78) applies thresholds to metrics from MergedSchemaEvaluator. It is for a single pipeline whose prediction is already a merged output shape with records, fields, schema version, and optional provenance.

Use the merged(...) helper when a model returns that merged schema directly:

eval_layout_detector.py

from evaluar.api import merged

pipeline = (
    merged("my_doc_model")
    .callable(my_doc_model)
    .inputs(INPUTS)
    .ground_truth(GROUND_TRUTH)
    .build()
)

Override MergedScorerConfig with .scorer(...) only when you need different thresholds or gated metrics.

RollupScorer

RollupScorer (src/evaluar/scoring/rollup.py:71) is implicit. Whenever a suite has more than one pipeline, the runner uses it to fold per-pipeline verdicts into the run's rollup_scorecard.

Configure it in the manifest:

evaluar.yaml

rollup:
  pass_threshold: 0.80
  warn_threshold: 0.60
  model_weights: []

Key	Purpose
`pass_threshold`	Weighted pass rate at which the rollup verdict is `pass`.
`warn_threshold`	Weighted pass rate at which the rollup verdict is `warn`.
`model_weights`	Per-model weights. Empty → equal weights.

Or configure it in Python with suite(...).rollup(pass_threshold=..., warn_threshold=...).

End-to-end example

A two-pipeline suite with a rollup:

eval_layout_detector.py

from evaluar.api import detection, suite

def build_suite(sample_ids=None, config=None):
    layout = detection("layout_detector").callable(...).build()
    icon  = detection("icon_detector").callable(...).build()

    s = (
        suite(sample_ids=sample_ids or [...], suite_name="document_models")
        .rollup(pass_threshold=0.80, warn_threshold=0.60)
    )
    s.add_pipeline("layout_detector", layout)
    s.add_pipeline("icon_detector", icon)
    return s

The RunnerResult will carry:

pipeline_results["layout_detector"].final_scorecard — pass/warn/fail per the layout pipeline's thresholds.
pipeline_results["icon_detector"].final_scorecard — same, for icons.
rollup_scorecard — the suite-level verdict, computed from the two using pass_threshold/warn_threshold.

What this isn't

A suite is a flat list of pipelines; their outputs are scored independently and rolled up.
Sequential work within a single pipeline belongs in your model callable or connector.
The rollup verdict is purely threshold-based; there's no weighting by sample count or per-class importance beyond what model_weights provides.

For most multi-modality projects, the right approach is one pipeline per model, plus a rollup. When a single model produces multiple kinds of output that all need scoring, that's when MergedScorer is the answer.

Merged & rollup

MergedScorer

RollupScorer

End-to-end example

What this isn't

On this page