Python API
The public Python surface — PipelineBuilder, EvaluarSuite, helpers, and what RunnerResult contains.
The public Python surface lives in src/evaluar/api.py. Everything documented here is verified against that file.
Helpers
from evaluar.api import (
PipelineBuilder,
EvaluarSuite,
FunctionNormalizer,
detection,
merged,
ocr,
suite,
table,
normalizer,
)| Symbol | Kind | Source |
|---|---|---|
PipelineBuilder | class | src/evaluar/api.py:94 |
EvaluarSuite | class | src/evaluar/api.py:320 |
FunctionNormalizer | class | src/evaluar/api.py:36 |
detection(model_id) | function | Detection pipeline builder. |
ocr(model_id) | function | OCR pipeline builder. |
table(model_id) | function | Table pipeline builder. |
merged(model_id) | function | Merged-output pipeline builder. |
suite(...) | function | Suite builder. |
normalizer(...) | decorator | Function normalizer adapter. |
PipelineBuilder
A fluent builder for a single pipeline. Construct via PipelineBuilder.for_task(task_type, model_id) or the task helpers detection(...), ocr(...), table(...), and merged(...).
Connectors
| Method | Returns | Purpose |
|---|---|---|
.callable(fn, model_version=None) | self | Use a Python callable as the model. |
.fixture(path) | self | Use a JSON file of pre-recorded responses. |
.http(base_url, endpoint, timeout=30.0, headers=None, request_transform=None, model_version=None) | self | Use an HTTP endpoint. |
.connector(connector) | self | Use a custom BaseConnector subclass. |
Inputs and ground truth
| Method | Returns | Purpose |
|---|---|---|
.inputs(data) | self | Per-sample input dict (e.g. image_url). |
.ground_truth(data) | self | In-memory ground truth, keyed by sample id. |
.ground_truth_file(path) | self | Load ground truth from a JSON file. |
.supports(*model_ids) | self | Restrict the pipeline to a set of model ids. |
Normalization
| Method | Returns | Purpose |
|---|---|---|
.default_mapping(label_map=None) | self | Use the task's default JMESPath mapping, optionally remapping detection labels before scoring. |
.mapping(mapping: dict, label_map=None) | self | Custom JMESPath mapping per output field. |
.label_map(mapping: dict[str, str]) | self | Add a direct label canonicalization map to the current/default mapping normalizer. |
.normalizer(normalizer) | self | Use a custom BaseNormalizer (e.g. FunctionNormalizer). |
.llm_normalizer(provider, model=None, system_prompt=None, api_key=None) | self | Use an LLM to extract a canonical prediction. |
.transformation_normalizer(transformation, provider, model=None, api_key=None) | self | LLM-based transformation normalizer. |
Scorer & identity
| Method | Returns | Purpose |
|---|---|---|
.scorer(config: ScorerConfig) | self | Override the scorer config. |
.thresholds({...}, metric=(pass, warn)) | self | Update scorer thresholds. |
.gated_metrics(*names) | self | Set which metrics determine the verdict. |
.image_cache(path) | self | Set the remote-image cache directory. |
.artifacts(image_cache_dir=...) | self | Configure artifact-related pipeline options. |
.run_id(run_id) | self | Pin the run id (otherwise auto-generated). |
.build() | BasePipeline | Materialize the pipeline. |
These fluent methods are config builders. They update the existing *ScorerConfig or PipelineConfig; they do not add hidden scoring rules.
EvaluarSuite
Constructed by suite(...). Bundles one-or-more pipelines into a runnable unit and writes the result to disk.
suite(...) arguments
def suite(
sample_ids: list[str],
run_id: str = "",
suite_name: str | None = None,
results_dir: str | Path | None = None,
definition_path: str | Path | None = None,
source: str = "CLI",
rollup_config: RollupConfig | None = None,
event_sinks: list[RunEventSink] | None = None,
strict_event_sinks: bool = False,
) -> EvaluarSuite: ...Methods
| Method | Returns | Purpose |
|---|---|---|
.add_pipeline(model_id, pipeline) | self | Register a built pipeline. |
.rollup(pass_threshold=..., warn_threshold=...) | self | Configure suite-level aggregation. |
.build_runner() | PipelineRunner | Construct the underlying runner. |
.metadata() | dict | Suite metadata snapshot — embedded in the saved run. |
.run(save=False) | RunnerResult | Execute the suite synchronously. |
.run_async(save=False) | RunnerResult (awaitable) | Async variant. |
.save_result(result, results_dir=None) | Path | Persist a RunnerResult after the fact. |
RunnerResult
Returned by EvaluarSuite.run(...). Defined as a dataclass in src/evaluar/pipelines/runner.py.
class RunnerResult:
run_id: str
elapsed_seconds: float
failed_pipelines: list[str]
rollup_scorecard: RollupScorecard
pipeline_results: dict[str, PipelineResult]
metadata: dictRollupScorecard and PipelineResult (with its final_scorecard and per_sample_scorecards) live in src/evaluar/schemas/scorecard.py.
Run events
Pass event_sinks=[...] to suite(...) or RunnerConfig(...) to observe execution. Sinks implement:
class RunEventSink(Protocol):
def emit(self, event: RunEvent) -> None: ...Events are immutable snapshots from evaluar.events, such as RunStarted, PipelineStarted, SampleCompleted, PipelineCompleted, RunCompleted, RunError, LLMActivityStarted, and LLMActivityCompleted. Sinks receive serialized scorecard copies, not live pipeline or scorecard objects.
RunnerResult is a Pydantic model — not a service object. It does not
expose .summary(), .failures(), .open_tui(), or .compare(other). To
open a result in the TUI, save it (run(save=True)) and use
evaluar report show <run_id> (or evaluar to land on the home view).
Custom normalizers
Decorate any function with @normalizer to use it via .normalizer(...):
from evaluar.api import normalizer
@normalizer
def normalize_my_model(raw: dict) -> dict:
return {
"objects": [
{"label": x["name"], "bbox": x["bounds"], "score": x["confidence"]}
for x in raw["detections"]
]
}Options the decorator accepts:
| Argument | Default | Purpose |
|---|---|---|
supported_models | None | Restrict the normalizer to specific model ids. |
run_in_thread | False | Run in a worker thread (use this for blocking I/O like LLM calls so the Textual loop isn't blocked when used in the TUI). |
Custom scorers
Custom scoring is done by subclassing BaseScorer (src/evaluar/scoring/base.py:47). There is no runtime @scorer decorator today — adding a new modality means adding a new scorer class plus its *ScorerConfig schema, the way the built-ins do.
Schemas
The canonical prediction / ground-truth / scorecard schemas live in src/evaluar/schemas/:
predictions.py— what normalizers must produce.ground_truth.py— what.ground_truth(...)accepts.scorecard.py— what scorers emit.base.py— shared types (e.g.Verdict).
Importing from evaluar.schemas is supported (from evaluar.schemas import GTDetection, GTDetectedObject, Verdict) and used by the e2e tests as the way to construct typed ground truth.