Core concepts
The five primitives Evaluar is actually built on — Suite, PipelineBuilder, Connector, Normalizer, Scorer (plus Rollup).
Evaluar's surface area is intentionally small. Five primitives cover almost everything you'll do; a sixth (Rollup) glues them together when you have more than one pipeline in a suite.
PipelineBuilder
A PipelineBuilder describes a single evaluation: where predictions come from, how they're shaped into Evaluar's prediction schema, what ground truth they're scored against, and which scorer applies.
from evaluar.api import detection
pipeline = (
detection("my_model") # task type + model id
.callable(my_model) # connector: where predictions come from
.inputs(inputs) # per-sample inputs
.ground_truth(gt) # per-sample expected output
.default_mapping() # normalizer: shape raw → canonical
.build() # → BasePipeline
)The builder lives in src/evaluar/api.py:94. The classmethod entry points are PipelineBuilder.detection(model_id) and PipelineBuilder.for_task(task_type, model_id); evaluar.api.detection(...) is a convenience wrapper.
Suite
A Suite is a runnable bundle of one-or-more pipelines plus the metadata Evaluar needs to save and re-open the resulting run.
from evaluar.api import suite
result = (
suite(sample_ids=["sample_001"], suite_name="my_eval")
.add_pipeline("my_model", pipeline)
.run(save=True)
)suite(...) returns an EvaluarSuite (src/evaluar/api.py:320). The interesting methods:
.add_pipeline(model_id, pipeline)— register a pipeline..run(save=False) -> RunnerResult— execute synchronously..run_async(save=False)— same, awaitable..save_result(result, results_dir=None)— persist a result after the fact..metadata() -> dict— the manifest snapshot embedded in the saved run.
Calling .run(save=True) writes evaluar/results/<run_id>.json (see Run storage).
Connector
A connector is the "where predictions come from" half of a pipeline. The three built-in connector types live in src/evaluar/connectors/:
| Builder method | Class | When to use it |
|---|---|---|
.callable(fn) | CallableConnector | The model is a Python callable in-process. |
.fixture(path) | FixtureConnector | Use a saved JSON of predictions (deterministic, fast). |
.http(base_url, endpoint, timeout=30.0, headers=None, request_transform=None) | HttpConnector | The model is behind an HTTP endpoint. |
You can also pass a custom connector that subclasses BaseConnector via .connector(...). See Connectors.
Normalizer
Models return whatever shape they want. Normalizers turn that into Evaluar's canonical prediction schema (src/evaluar/schemas/predictions.py). Two paths:
- JMESPath mapping — declarative.
.mapping({"objects": "prediction[*].{label: label_name, bbox: box, score: score}"})or.default_mapping()for task-default mappings. Detection mappings can also takelabel_map={"door": "Door"}for direct label canonicalization. - Function — imperative. Decorate a function with
@normalizerand pass it to.normalizer(...).
from evaluar.api import normalizer
@normalizer
def normalize_my_model(raw: dict) -> dict:
return {
"objects": [
{"label": x["name"], "bbox": x["bounds"], "score": x["confidence"]}
for x in raw["detections"]
]
}See Normalization for the full picture, including LLM-backed transformations.
Evaluator and scorer
An evaluator computes metrics from normalized predictions and ground truth. A scorer applies thresholds to those metrics and finalizes the pass | warn | fail verdict.
| Class | Modality | Source |
|---|---|---|
DetectionScorer | bounding-box detection | src/evaluar/scoring/detection.py |
OCRScorer | text recognition | src/evaluar/scoring/ocr.py |
TableScorer | table extraction | src/evaluar/scoring/table.py |
MergedScorer | gates metrics for merged-output predictions | src/evaluar/scoring/merged.py |
Each scorer is configured with a *ScorerConfig carrying per-metric MetricThreshold(pass_floor, warn_floor) values. Most pipelines never instantiate a scorer directly — task helpers like detection(...), ocr(...), table(...), and merged(...) install a sensible default. See Scorers & metrics.
Rollup
When a suite contains multiple pipelines, the RollupScorer (src/evaluar/scoring/rollup.py:71) aggregates their verdicts into a single suite-level verdict using the pass_threshold and warn_threshold declared in the manifest. The RunnerResult.rollup_scorecard is what evaluar report show displays at the top of the run.
Putting it together
A complete code-first suite:
from evaluar.api import detection, suite
def my_model(image_url: str) -> dict:
...
def build_suite(sample_ids=None, config=None):
pipeline = (
detection("my_model")
.callable(my_model)
.inputs({"sample_001": {"image_url": "..."}})
.ground_truth({"sample_001": {"objects": [...]}})
.default_mapping()
.build()
)
s = suite(sample_ids=sample_ids or ["sample_001"], suite_name="my_eval")
s.add_pipeline("my_model", pipeline)
return sThis is the build_suite contract Evaluar discovers and runs. See Suites for the full discovery model.