Merged & rollup
Composing scorers per pipeline (`MergedScorer`) and aggregating verdicts across pipelines (`RollupScorer`).
Real document evaluations often need to score more than one pipeline. Evaluar handles that with two supported shapes:
- Merged output pipelines — score one model response that already contains a merged record/schema shape.
- Suite rollup — aggregate pipeline-level verdicts into a single suite-level verdict.
A suite is a flat list of pipelines; cross-pipeline aggregation happens through rollup.
MergedScorer
MergedScorer (src/evaluar/scoring/merged.py:78) applies thresholds to metrics from MergedSchemaEvaluator. It is for a single pipeline whose prediction is already a merged output shape with records, fields, schema version, and optional provenance.
Use the merged(...) helper when a model returns that merged schema directly:
from evaluar.api import merged
pipeline = (
merged("my_doc_model")
.callable(my_doc_model)
.inputs(INPUTS)
.ground_truth(GROUND_TRUTH)
.build()
)Override MergedScorerConfig with .scorer(...) only when you need different thresholds or gated metrics.
RollupScorer
RollupScorer (src/evaluar/scoring/rollup.py:71) is implicit. Whenever a suite has more than one pipeline, the runner uses it to fold per-pipeline verdicts into the run's rollup_scorecard.
Configure it in the manifest:
rollup:
pass_threshold: 0.80
warn_threshold: 0.60
model_weights: []| Key | Purpose |
|---|---|
pass_threshold | Weighted pass rate at which the rollup verdict is pass. |
warn_threshold | Weighted pass rate at which the rollup verdict is warn. |
model_weights | Per-model weights. Empty → equal weights. |
Or configure it in Python with suite(...).rollup(pass_threshold=..., warn_threshold=...).
End-to-end example
A two-pipeline suite with a rollup:
from evaluar.api import detection, suite
def build_suite(sample_ids=None, config=None):
layout = detection("layout_detector").callable(...).build()
icon = detection("icon_detector").callable(...).build()
s = (
suite(sample_ids=sample_ids or [...], suite_name="document_models")
.rollup(pass_threshold=0.80, warn_threshold=0.60)
)
s.add_pipeline("layout_detector", layout)
s.add_pipeline("icon_detector", icon)
return sThe RunnerResult will carry:
pipeline_results["layout_detector"].final_scorecard— pass/warn/fail per the layout pipeline's thresholds.pipeline_results["icon_detector"].final_scorecard— same, for icons.rollup_scorecard— the suite-level verdict, computed from the two usingpass_threshold/warn_threshold.
What this isn't
- A suite is a flat list of pipelines; their outputs are scored independently and rolled up.
- Sequential work within a single pipeline belongs in your model callable or connector.
- The rollup verdict is purely threshold-based; there's no weighting by sample count or per-class importance beyond what
model_weightsprovides.
For most multi-modality projects, the right approach is one pipeline per model, plus a rollup. When a single model produces multiple kinds of output that all need scoring, that's when MergedScorer is the answer.