Evaluar

The public Python surface — PipelineBuilder, EvaluarSuite, helpers, and what RunnerResult contains.

The public Python surface lives in src/evaluar/api.py. Everything documented here is verified against that file.

Helpers

from evaluar.api import (
    PipelineBuilder,
    EvaluarSuite,
    FunctionNormalizer,
    detection,
    merged,
    ocr,
    suite,
    table,
    normalizer,
)

Symbol	Kind	Source
`PipelineBuilder`	class	`src/evaluar/api.py:94`
`EvaluarSuite`	class	`src/evaluar/api.py:320`
`FunctionNormalizer`	class	`src/evaluar/api.py:36`
`detection(model_id)`	function	Detection pipeline builder.
`ocr(model_id)`	function	OCR pipeline builder.
`table(model_id)`	function	Table pipeline builder.
`merged(model_id)`	function	Merged-output pipeline builder.
`suite(...)`	function	Suite builder.
`normalizer(...)`	decorator	Function normalizer adapter.

PipelineBuilder

A fluent builder for a single pipeline. Construct via PipelineBuilder.for_task(task_type, model_id) or the task helpers detection(...), ocr(...), table(...), and merged(...).

Connectors

Method	Returns	Purpose
`.callable(fn, model_version=None)`	`self`	Use a Python callable as the model.
`.fixture(path)`	`self`	Use a JSON file of pre-recorded responses.
`.http(base_url, endpoint, timeout=30.0, headers=None, request_transform=None, model_version=None)`	`self`	Use an HTTP endpoint.
`.connector(connector)`	`self`	Use a custom `BaseConnector` subclass.

Inputs and ground truth

Method	Returns	Purpose
`.inputs(data)`	`self`	Per-sample input dict (e.g. `image_url`).
`.ground_truth(data)`	`self`	In-memory ground truth, keyed by sample id.
`.ground_truth_file(path)`	`self`	Load ground truth from a JSON file.
`.supports(*model_ids)`	`self`	Restrict the pipeline to a set of model ids.

Normalization

Method	Returns	Purpose
`.default_mapping(label_map=None)`	`self`	Use the task's default JMESPath mapping, optionally remapping detection labels before scoring.
`.mapping(mapping: dict, label_map=None)`	`self`	Custom JMESPath mapping per output field.
`.label_map(mapping: dict[str, str])`	`self`	Add a direct label canonicalization map to the current/default mapping normalizer.
`.normalizer(normalizer)`	`self`	Use a custom `BaseNormalizer` (e.g. `FunctionNormalizer`).
`.llm_normalizer(provider, model=None, system_prompt=None, api_key=None)`	`self`	Use an LLM to extract a canonical prediction.
`.transformation_normalizer(transformation, provider, model=None, api_key=None)`	`self`	LLM-based transformation normalizer.

Scorer & identity

Method	Returns	Purpose
`.scorer(config: ScorerConfig)`	`self`	Override the scorer config.
`.thresholds({...}, metric=(pass, warn))`	`self`	Update scorer thresholds.
`.gated_metrics(*names)`	`self`	Set which metrics determine the verdict.
`.image_cache(path)`	`self`	Set the remote-image cache directory.
`.artifacts(image_cache_dir=...)`	`self`	Configure artifact-related pipeline options.
`.run_id(run_id)`	`self`	Pin the run id (otherwise auto-generated).
`.build()`	`BasePipeline`	Materialize the pipeline.

These fluent methods are config builders. They update the existing *ScorerConfig or PipelineConfig; they do not add hidden scoring rules.

EvaluarSuite

Constructed by suite(...). Bundles one-or-more pipelines into a runnable unit and writes the result to disk.

`suite(...)` arguments

def suite(
    sample_ids: list[str],
    run_id: str = "",
    suite_name: str | None = None,
    results_dir: str | Path | None = None,
    definition_path: str | Path | None = None,
    source: str = "CLI",
    rollup_config: RollupConfig | None = None,
    event_sinks: list[RunEventSink] | None = None,
    strict_event_sinks: bool = False,
) -> EvaluarSuite: ...

Methods

Method	Returns	Purpose
`.add_pipeline(model_id, pipeline)`	`self`	Register a built pipeline.
`.rollup(pass_threshold=..., warn_threshold=...)`	`self`	Configure suite-level aggregation.
`.build_runner()`	`PipelineRunner`	Construct the underlying runner.
`.metadata()`	`dict`	Suite metadata snapshot — embedded in the saved run.
`.run(save=False)`	`RunnerResult`	Execute the suite synchronously.
`.run_async(save=False)`	`RunnerResult` (awaitable)	Async variant.
`.save_result(result, results_dir=None)`	`Path`	Persist a `RunnerResult` after the fact.

RunnerResult

Returned by EvaluarSuite.run(...). Defined as a dataclass in src/evaluar/pipelines/runner.py.

class RunnerResult:
    run_id: str
    elapsed_seconds: float
    failed_pipelines: list[str]
    rollup_scorecard: RollupScorecard
    pipeline_results: dict[str, PipelineResult]
    metadata: dict

RollupScorecard and PipelineResult (with its final_scorecard and per_sample_scorecards) live in src/evaluar/schemas/scorecard.py.

Run events

Pass event_sinks=[...] to suite(...) or RunnerConfig(...) to observe execution. Sinks implement:

class RunEventSink(Protocol):
    def emit(self, event: RunEvent) -> None: ...

Events are immutable snapshots from evaluar.events, such as RunStarted, PipelineStarted, SampleCompleted, PipelineCompleted, RunCompleted, RunError, LLMActivityStarted, and LLMActivityCompleted. Sinks receive serialized scorecard copies, not live pipeline or scorecard objects.

RunnerResult is a Pydantic model — not a service object. It does not expose .summary(), .failures(), .open_tui(), or .compare(other). To open a result in the TUI, save it (run(save=True)) and use evaluar report show <run_id> (or evaluar to land on the home view).

Custom normalizers

Decorate any function with @normalizer to use it via .normalizer(...):

from evaluar.api import normalizer

@normalizer
def normalize_my_model(raw: dict) -> dict:
    return {
        "objects": [
            {"label": x["name"], "bbox": x["bounds"], "score": x["confidence"]}
            for x in raw["detections"]
        ]
    }

Options the decorator accepts:

Argument	Default	Purpose
`supported_models`	`None`	Restrict the normalizer to specific model ids.
`run_in_thread`	`False`	Run in a worker thread (use this for blocking I/O like LLM calls so the Textual loop isn't blocked when used in the TUI).

Custom scorers

Custom scoring is done by subclassing BaseScorer (src/evaluar/scoring/base.py:47). There is no runtime @scorer decorator today — adding a new modality means adding a new scorer class plus its *ScorerConfig schema, the way the built-ins do.

Schemas

The canonical prediction / ground-truth / scorecard schemas live in src/evaluar/schemas/:

predictions.py — what normalizers must produce.
ground_truth.py — what .ground_truth(...) accepts.
scorecard.py — what scorers emit.
base.py — shared types (e.g. Verdict).

Importing from evaluar.schemas is supported (from evaluar.schemas import GTDetection, GTDetectedObject, Verdict) and used by the e2e tests as the way to construct typed ground truth.

Python API