evaluar

Core concepts

The five primitives Evaluar is actually built on — Suite, PipelineBuilder, Connector, Normalizer, Scorer (plus Rollup).

Evaluar's surface area is intentionally small. Five primitives cover almost everything you'll do; a sixth (Rollup) glues them together when you have more than one pipeline in a suite.

PipelineBuilder

A PipelineBuilder describes a single evaluation: where predictions come from, how they're shaped into Evaluar's prediction schema, what ground truth they're scored against, and which scorer applies.

from evaluar.api import detection

pipeline = (
    detection("my_model")          # task type + model id
    .callable(my_model)            # connector: where predictions come from
    .inputs(inputs)                # per-sample inputs
    .ground_truth(gt)              # per-sample expected output
    .default_mapping()             # normalizer: shape raw → canonical
    .build()                       # → BasePipeline
)

The builder lives in src/evaluar/api.py:94. The classmethod entry points are PipelineBuilder.detection(model_id) and PipelineBuilder.for_task(task_type, model_id); evaluar.api.detection(...) is a convenience wrapper.

Suite

A Suite is a runnable bundle of one-or-more pipelines plus the metadata Evaluar needs to save and re-open the resulting run.

from evaluar.api import suite

result = (
    suite(sample_ids=["sample_001"], suite_name="my_eval")
    .add_pipeline("my_model", pipeline)
    .run(save=True)
)

suite(...) returns an EvaluarSuite (src/evaluar/api.py:320). The interesting methods:

  • .add_pipeline(model_id, pipeline) — register a pipeline.
  • .run(save=False) -> RunnerResult — execute synchronously.
  • .run_async(save=False) — same, awaitable.
  • .save_result(result, results_dir=None) — persist a result after the fact.
  • .metadata() -> dict — the manifest snapshot embedded in the saved run.

Calling .run(save=True) writes evaluar/results/<run_id>.json (see Run storage).

Connector

A connector is the "where predictions come from" half of a pipeline. The three built-in connector types live in src/evaluar/connectors/:

Builder methodClassWhen to use it
.callable(fn)CallableConnectorThe model is a Python callable in-process.
.fixture(path)FixtureConnectorUse a saved JSON of predictions (deterministic, fast).
.http(base_url, endpoint, timeout=30.0, headers=None, request_transform=None)HttpConnectorThe model is behind an HTTP endpoint.

You can also pass a custom connector that subclasses BaseConnector via .connector(...). See Connectors.

Normalizer

Models return whatever shape they want. Normalizers turn that into Evaluar's canonical prediction schema (src/evaluar/schemas/predictions.py). Two paths:

  • JMESPath mapping — declarative. .mapping({"objects": "prediction[*].{label: label_name, bbox: box, score: score}"}) or .default_mapping() for task-default mappings. Detection mappings can also take label_map={"door": "Door"} for direct label canonicalization.
  • Function — imperative. Decorate a function with @normalizer and pass it to .normalizer(...).
from evaluar.api import normalizer

@normalizer
def normalize_my_model(raw: dict) -> dict:
    return {
        "objects": [
            {"label": x["name"], "bbox": x["bounds"], "score": x["confidence"]}
            for x in raw["detections"]
        ]
    }

See Normalization for the full picture, including LLM-backed transformations.

Evaluator and scorer

An evaluator computes metrics from normalized predictions and ground truth. A scorer applies thresholds to those metrics and finalizes the pass | warn | fail verdict.

ClassModalitySource
DetectionScorerbounding-box detectionsrc/evaluar/scoring/detection.py
OCRScorertext recognitionsrc/evaluar/scoring/ocr.py
TableScorertable extractionsrc/evaluar/scoring/table.py
MergedScorergates metrics for merged-output predictionssrc/evaluar/scoring/merged.py

Each scorer is configured with a *ScorerConfig carrying per-metric MetricThreshold(pass_floor, warn_floor) values. Most pipelines never instantiate a scorer directly — task helpers like detection(...), ocr(...), table(...), and merged(...) install a sensible default. See Scorers & metrics.

Rollup

When a suite contains multiple pipelines, the RollupScorer (src/evaluar/scoring/rollup.py:71) aggregates their verdicts into a single suite-level verdict using the pass_threshold and warn_threshold declared in the manifest. The RunnerResult.rollup_scorecard is what evaluar report show displays at the top of the run.

Putting it together

A complete code-first suite:

eval_layout_detector.py
from evaluar.api import detection, suite

def my_model(image_url: str) -> dict:
    ...

def build_suite(sample_ids=None, config=None):
    pipeline = (
        detection("my_model")
        .callable(my_model)
        .inputs({"sample_001": {"image_url": "..."}})
        .ground_truth({"sample_001": {"objects": [...]}})
        .default_mapping()
        .build()
    )
    s = suite(sample_ids=sample_ids or ["sample_001"], suite_name="my_eval")
    s.add_pipeline("my_model", pipeline)
    return s

This is the build_suite contract Evaluar discovers and runs. See Suites for the full discovery model.

On this page