Evaluar

Turning a raw model response into Evaluar's canonical prediction schema — JMESPath mappings, function normalizers, and LLM-backed extraction.

Models return whatever shape they want. Evaluar's scorers expect a canonical prediction schema (src/evaluar/schemas/predictions.py). The normalizer is the bridge.

There are three ways to specify one. They live behind the PipelineBuilder chain.

JMESPath mapping

The default for most pipelines. Declarative, no Python required for simple shapes.

pipeline = (
    detection("my_model")
    .callable(my_model)
    .mapping({
        "objects": "prediction[*].{label: label_name, bbox: box, score: score, class_id: label_id}"
    })
    ...
)

Per-task default mappings cover the common case:

.default_mapping()   # Use the canonical mapping for this task type.

This is what evaluar init puts in the scaffold for detection projects.

.mapping(...) and .default_mapping() are mutually exclusive. The mapping dict's keys are output fields on the canonical prediction; the values are JMESPath expressions evaluated against the raw response.

Detection mappings also accept a lightweight label_map for canonicalizing model labels before scoring:

pipeline = (
    detection("door_window_model")
    .http(base_url="http://localhost:8000", endpoint="/predict")
    .default_mapping(label_map={"door": "Door", "window": "Window"})
    ...
)

The same option works with custom mappings:

.mapping(
    {"objects": "detections[*].{label: type, bbox: bbox, score: confidence}"},
    label_map={"door": "Door", "window": "Window"},
)

Use a full function normalizer when canonicalization needs more than a direct label-to-label lookup.

Function normalizer

When the shape is too irregular for JMESPath, use a Python function.

from evaluar.api import normalizer

@normalizer
def normalize_my_model(raw: dict) -> dict:
    return {
        "objects": [
            {"label": x["name"], "bbox": x["bounds"], "score": x["confidence"]}
            for x in raw["detections"]
        ]
    }

pipeline = (
    detection("my_model")
    .callable(my_model)
    .normalizer(normalize_my_model)
    ...
)

The decorator (src/evaluar/api.py:431) accepts:

Argument	Default	Purpose
`supported_models`	`None`	Restrict the normalizer to specific model ids.
`run_in_thread`	`False`	Run in a worker thread. Set this when the function blocks on I/O — e.g. an LLM call — so the Textual loop is not blocked when the suite runs from the TUI.

Functions can return either a typed prediction model (e.g. an instance from evaluar.schemas) or a canonical dict; dict outputs are validated against the task's prediction schema.

LLM normalizer

When the model's output is unstructured text, you can use a hosted LLM to extract a canonical prediction.

pipeline = (
    detection("my_model")
    .http(base_url="...", endpoint="/predict")
    .llm_normalizer(provider="openai", model="gpt-4o-mini")
    ...
)

Argument	Default	Purpose
`provider`	required	One of `openai`, `google`.
`model`	provider default	Model name.
`system_prompt`	task default	Override the extraction prompt.
`api_key`	env-resolved	Override the API key.

api_key defaults are read from the standard provider env vars (OPENAI_API_KEY, GOOGLE_API_KEY).

Transformation normalizer

A variant of LLM normalization where the LLM transforms an already-structured response rather than extracting from raw text.

pipeline = (
    detection("my_model")
    ...
    .transformation_normalizer(
        transformation="rephrase_label_names",
        provider="openai",
        model="gpt-4o-mini",
    )
)

Use this when post-processing benefits from a model — e.g. canonicalizing free-form labels into a fixed taxonomy.

Normalizers in YAML

Mapping and LLM normalizers are expressible in evaluar.yaml:

models:
  my_model:
    type: detection
    normalizer:
      type: mapping
      mapping:
        objects: "prediction[*].{label: label_name, bbox: box, score: score, class_id: label_id}"
      label_map:
        door: Door
        window: Window

See YAML manifests.

What the canonical schema looks like

Detection (the most common case) expects:

{
    "objects": [
        {"label": str, "bbox": [x_min, y_min, x_max, y_max], "score": float, "class_id": int | None},
        ...
    ]
}

OCR and table schemas are documented in src/evaluar/schemas/predictions.py. Validation happens after normalization but before scoring; misshapen output surfaces as a sample-level error rather than crashing the suite.

Normalization