mirror of https://github.com/github/awesome-copilot.git synced 2026-04-30 12:15:56 +00:00

Files

Yiou Li 2860790bc9 update eval-driven-dev skill (#1434 )

* update eval-driven-dev skill

* fix: update skill update command to use correct repository path

* address comments.

* update eval driven dev

2026-04-28 11:27:48 +10:00

7.4 KiB

Raw Blame History

Step 2b: Implement the Runnable

For the full Runnable protocol and wrap() API, see wrap-api.md.

Goal: Write a Runnable class that lets the eval harness invoke the application exactly as a real user would.

The core idea

The Runnable is how pixie test and pixie trace run your application. Think of it as a programmatic stand-in for a real user: it starts the app, sends it a request, and lets the app do its thing. The eval harness calls run() for each test case, passing in the user's input parameters. The app processes those parameters through its real code — real routing, real prompt assembly, real LLM calls, real response formatting — and the harness observes what happens via the wrap() instrumentation from Step 2a.

This means the Runnable should be simple. It just wires up the app's real entry point to the harness interface. If your Runnable is getting complicated — if you're building custom logic, reimplementing app behavior, or replacing components — something is wrong.

Four requirements

1. Run the real production code

The Runnable calls the app's actual entry point — the same function, class, or endpoint a real user would trigger. It does not reimplement, shortcut, or substitute any part of the application.

This includes the LLM. The app's LLM calls must go through the real code path — do not mock, fake, or replace application components. The whole point of eval-based testing is that LLM outputs are non-deterministic, so you use evaluators (not assertions) to score them. If you replace any component with a fake, you've eliminated the real behavior and the eval measures nothing.

If the app won't run due to missing environment variables or configuration that you cannot resolve, stop and ask the user to fix the environment setup. Do not work around it by mocking components.

2. Represent start-up args with a Pydantic BaseModel

The run() method receives a Pydantic BaseModel whose fields are populated from the dataset's input_data. Define a subclass with the fields the app needs:

from pydantic import BaseModel

class AppArgs(BaseModel):
    user_message: str
    # Add more fields as the app's entry point requires.
    # These map 1:1 to the dataset input_data keys.

The fields must reflect what a real user actually provides. Read pixie_qa/00-project-analysis.md — the "Realistic input characteristics" section describes the complexity, scale, and variety of real inputs. Design the model to accept inputs at that level of realism, not simplified toy versions.

Understand the boundary between user-provided parameters and world data:

User-provided parameters (fields on the BaseModel): what a real user types or configures — prompts, queries, configuration flags, URLs, schema definitions.
World data (handled by wrap(purpose="input") in Step 2a): content the app fetches from external sources during execution — web pages, database records, API responses. This is NOT part of the BaseModel.

App type	BaseModel fields (user provides)	World data (wrap provides)
Web scraper	URL + prompt + schema definition	The HTML page content
Research agent	Research question + scope constraints	Source documents, search results
Customer support bot	Customer's spoken message	Customer profile from CRM, conversation history from session store
Code review tool	PR URL + review criteria	The actual diff, file contents, CI results

If a field ends up holding data the app would normally fetch itself, it probably belongs in a wrap(purpose="input") call instead of on the BaseModel.

3. Be concurrency-safe

run() is called concurrently for multiple dataset entries (up to 4 in parallel). If the app uses shared mutable state — SQLite, file-based DBs, global caches — protect access with asyncio.Semaphore:

import asyncio

class AppRunnable(pixie.Runnable[AppArgs]):
    _sem: asyncio.Semaphore

    @classmethod
    def create(cls) -> "AppRunnable":
        inst = cls()
        inst._sem = asyncio.Semaphore(1)
        return inst

    async def run(self, args: AppArgs) -> None:
        async with self._sem:
            await call_app(args.message)

Only add the semaphore when the app actually has shared mutable state. If the app uses per-request state (keyed by unique IDs) or is inherently stateless, concurrent calls are naturally isolated.

4. Adhere to the Runnable interface

class AppRunnable(pixie.Runnable[AppArgs]):
    @classmethod
    def create(cls) -> "AppRunnable": ...     # construct instance
    async def setup(self) -> None: ...        # once, before first run()
    async def run(self, args: AppArgs) -> None: ...  # per dataset entry, concurrent
    async def teardown(self) -> None: ...     # once, after last run()

create() — class method, returns a new instance. Use a quoted return type (-> "AppRunnable") to avoid forward reference errors.
setup() — optional async; initialize shared resources (HTTP clients, DB connections, servers).
run(args) — async; called per dataset entry. Invoke the app's real entry point here.
teardown() — optional async; clean up resources from setup().

Minimal example

# pixie_qa/run_app.py
from pydantic import BaseModel
import pixie


class AppArgs(BaseModel):
    user_message: str


class AppRunnable(pixie.Runnable[AppArgs]):
    """Drives the application for tracing and evaluation."""

    @classmethod
    def create(cls) -> "AppRunnable":
        return cls()

    async def run(self, args: AppArgs) -> None:
        from myapp import handle_request
        await handle_request(args.user_message)

That's it. The Runnable imports the app's real entry point and calls it. No custom logic, no component replacement, no clever workarounds.

Architecture-specific examples

Based on how the application runs, read the corresponding example file:

App type	Entry point	Example file
Standalone function (no server)	Python function	Read `references/runnable-examples/standalone-function.md`
Web server (FastAPI, Flask)	HTTP/WebSocket endpoint	Read `references/runnable-examples/fastapi-web-server.md`
CLI application	Command-line invocation	Read `references/runnable-examples/cli-app.md`

Read only the example file that matches your app type.

File placement

Place the file at pixie_qa/run_app.py.
The dataset's "runnable" field references: "pixie_qa/run_app.py:AppRunnable".
The project root is automatically on sys.path, so use normal imports (from app import service).

Technical note

Do NOT use from __future__ import annotations in runnable files — it breaks Pydantic's model resolution for nested models. Use quoted return types where needed instead.

Output

pixie_qa/run_app.py — the Runnable class.

7.4 KiB Raw Blame History