evaluar

Python API

The public Python surface — PipelineBuilder, EvaluarSuite, helpers, and what RunnerResult contains.

The public Python surface lives in src/evaluar/api.py. Everything documented here is verified against that file.

Helpers

from evaluar.api import (
    PipelineBuilder,
    EvaluarSuite,
    FunctionNormalizer,
    detection,
    merged,
    ocr,
    suite,
    table,
    normalizer,
)
SymbolKindSource
PipelineBuilderclasssrc/evaluar/api.py:94
EvaluarSuiteclasssrc/evaluar/api.py:320
FunctionNormalizerclasssrc/evaluar/api.py:36
detection(model_id)functionDetection pipeline builder.
ocr(model_id)functionOCR pipeline builder.
table(model_id)functionTable pipeline builder.
merged(model_id)functionMerged-output pipeline builder.
suite(...)functionSuite builder.
normalizer(...)decoratorFunction normalizer adapter.

PipelineBuilder

A fluent builder for a single pipeline. Construct via PipelineBuilder.for_task(task_type, model_id) or the task helpers detection(...), ocr(...), table(...), and merged(...).

Connectors

MethodReturnsPurpose
.callable(fn, model_version=None)selfUse a Python callable as the model.
.fixture(path)selfUse a JSON file of pre-recorded responses.
.http(base_url, endpoint, timeout=30.0, headers=None, request_transform=None, model_version=None)selfUse an HTTP endpoint.
.connector(connector)selfUse a custom BaseConnector subclass.

Inputs and ground truth

MethodReturnsPurpose
.inputs(data)selfPer-sample input dict (e.g. image_url).
.ground_truth(data)selfIn-memory ground truth, keyed by sample id.
.ground_truth_file(path)selfLoad ground truth from a JSON file.
.supports(*model_ids)selfRestrict the pipeline to a set of model ids.

Normalization

MethodReturnsPurpose
.default_mapping(label_map=None)selfUse the task's default JMESPath mapping, optionally remapping detection labels before scoring.
.mapping(mapping: dict, label_map=None)selfCustom JMESPath mapping per output field.
.label_map(mapping: dict[str, str])selfAdd a direct label canonicalization map to the current/default mapping normalizer.
.normalizer(normalizer)selfUse a custom BaseNormalizer (e.g. FunctionNormalizer).
.llm_normalizer(provider, model=None, system_prompt=None, api_key=None)selfUse an LLM to extract a canonical prediction.
.transformation_normalizer(transformation, provider, model=None, api_key=None)selfLLM-based transformation normalizer.

Scorer & identity

MethodReturnsPurpose
.scorer(config: ScorerConfig)selfOverride the scorer config.
.thresholds({...}, metric=(pass, warn))selfUpdate scorer thresholds.
.gated_metrics(*names)selfSet which metrics determine the verdict.
.image_cache(path)selfSet the remote-image cache directory.
.artifacts(image_cache_dir=...)selfConfigure artifact-related pipeline options.
.run_id(run_id)selfPin the run id (otherwise auto-generated).
.build()BasePipelineMaterialize the pipeline.

These fluent methods are config builders. They update the existing *ScorerConfig or PipelineConfig; they do not add hidden scoring rules.

EvaluarSuite

Constructed by suite(...). Bundles one-or-more pipelines into a runnable unit and writes the result to disk.

suite(...) arguments

def suite(
    sample_ids: list[str],
    run_id: str = "",
    suite_name: str | None = None,
    results_dir: str | Path | None = None,
    definition_path: str | Path | None = None,
    source: str = "CLI",
    rollup_config: RollupConfig | None = None,
    event_sinks: list[RunEventSink] | None = None,
    strict_event_sinks: bool = False,
) -> EvaluarSuite: ...

Methods

MethodReturnsPurpose
.add_pipeline(model_id, pipeline)selfRegister a built pipeline.
.rollup(pass_threshold=..., warn_threshold=...)selfConfigure suite-level aggregation.
.build_runner()PipelineRunnerConstruct the underlying runner.
.metadata()dictSuite metadata snapshot — embedded in the saved run.
.run(save=False)RunnerResultExecute the suite synchronously.
.run_async(save=False)RunnerResult (awaitable)Async variant.
.save_result(result, results_dir=None)PathPersist a RunnerResult after the fact.

RunnerResult

Returned by EvaluarSuite.run(...). Defined as a dataclass in src/evaluar/pipelines/runner.py.

class RunnerResult:
    run_id: str
    elapsed_seconds: float
    failed_pipelines: list[str]
    rollup_scorecard: RollupScorecard
    pipeline_results: dict[str, PipelineResult]
    metadata: dict

RollupScorecard and PipelineResult (with its final_scorecard and per_sample_scorecards) live in src/evaluar/schemas/scorecard.py.

Run events

Pass event_sinks=[...] to suite(...) or RunnerConfig(...) to observe execution. Sinks implement:

class RunEventSink(Protocol):
    def emit(self, event: RunEvent) -> None: ...

Events are immutable snapshots from evaluar.events, such as RunStarted, PipelineStarted, SampleCompleted, PipelineCompleted, RunCompleted, RunError, LLMActivityStarted, and LLMActivityCompleted. Sinks receive serialized scorecard copies, not live pipeline or scorecard objects.

RunnerResult is a Pydantic model — not a service object. It does not expose .summary(), .failures(), .open_tui(), or .compare(other). To open a result in the TUI, save it (run(save=True)) and use evaluar report show <run_id> (or evaluar to land on the home view).

Custom normalizers

Decorate any function with @normalizer to use it via .normalizer(...):

from evaluar.api import normalizer

@normalizer
def normalize_my_model(raw: dict) -> dict:
    return {
        "objects": [
            {"label": x["name"], "bbox": x["bounds"], "score": x["confidence"]}
            for x in raw["detections"]
        ]
    }

Options the decorator accepts:

ArgumentDefaultPurpose
supported_modelsNoneRestrict the normalizer to specific model ids.
run_in_threadFalseRun in a worker thread (use this for blocking I/O like LLM calls so the Textual loop isn't blocked when used in the TUI).

Custom scorers

Custom scoring is done by subclassing BaseScorer (src/evaluar/scoring/base.py:47). There is no runtime @scorer decorator today — adding a new modality means adding a new scorer class plus its *ScorerConfig schema, the way the built-ins do.

Schemas

The canonical prediction / ground-truth / scorecard schemas live in src/evaluar/schemas/:

  • predictions.py — what normalizers must produce.
  • ground_truth.py — what .ground_truth(...) accepts.
  • scorecard.py — what scorers emit.
  • base.py — shared types (e.g. Verdict).

Importing from evaluar.schemas is supported (from evaluar.schemas import GTDetection, GTDetectedObject, Verdict) and used by the e2e tests as the way to construct typed ground truth.

On this page