mirror of
https://github.com/github/awesome-copilot.git
synced 2026-05-04 14:15:55 +00:00
97 lines
2.3 KiB
Markdown
97 lines
2.3 KiB
Markdown
# Production: Overview
|
|
|
|
CI/CD evals vs production monitoring - complementary approaches.
|
|
|
|
## Two Evaluation Modes
|
|
|
|
| Aspect | CI/CD Evals | Production Monitoring |
|
|
| ------ | ----------- | -------------------- |
|
|
| **When** | Pre-deployment | Post-deployment, ongoing |
|
|
| **Data** | Fixed dataset | Sampled traffic |
|
|
| **Goal** | Prevent regression | Detect drift |
|
|
| **Response** | Block deploy | Alert & analyze |
|
|
|
|
## CI/CD Evaluations
|
|
|
|
```python
|
|
from phoenix.client import Client
|
|
|
|
client = Client()
|
|
|
|
# Fast, deterministic checks
|
|
ci_evaluators = [
|
|
has_required_format,
|
|
no_pii_leak,
|
|
safety_check,
|
|
regression_test_suite,
|
|
]
|
|
|
|
# Small but representative dataset (~100 examples)
|
|
client.experiments.run_experiment(dataset=ci_dataset, task=task, evaluators=ci_evaluators)
|
|
```
|
|
|
|
Set thresholds: regression=0.95, safety=1.0, format=0.98.
|
|
|
|
## Production Monitoring
|
|
|
|
### Python
|
|
|
|
```python
|
|
from phoenix.client import Client
|
|
from datetime import datetime, timedelta
|
|
|
|
client = Client()
|
|
|
|
# Sample recent traces (last hour)
|
|
traces = client.traces.get_traces(
|
|
project_identifier="my-app",
|
|
start_time=datetime.now() - timedelta(hours=1),
|
|
include_spans=True,
|
|
limit=100,
|
|
)
|
|
|
|
# Run evaluators on sampled traffic
|
|
for trace in traces:
|
|
results = run_evaluators_async(trace, production_evaluators)
|
|
if any(r["score"] < 0.5 for r in results):
|
|
alert_on_failure(trace, results)
|
|
```
|
|
|
|
### TypeScript
|
|
|
|
```typescript
|
|
import { getTraces } from "@arizeai/phoenix-client/traces";
|
|
import { getSpans } from "@arizeai/phoenix-client/spans";
|
|
|
|
// Sample recent traces (last hour)
|
|
const { traces } = await getTraces({
|
|
project: { projectName: "my-app" },
|
|
startTime: new Date(Date.now() - 60 * 60 * 1000),
|
|
includeSpans: true,
|
|
limit: 100,
|
|
});
|
|
|
|
// Or sample spans directly for evaluation
|
|
const { spans } = await getSpans({
|
|
project: { projectName: "my-app" },
|
|
startTime: new Date(Date.now() - 60 * 60 * 1000),
|
|
limit: 100,
|
|
});
|
|
|
|
// Run evaluators on sampled traffic
|
|
for (const span of spans) {
|
|
const results = await runEvaluators(span, productionEvaluators);
|
|
if (results.some((r) => r.score < 0.5)) {
|
|
await alertOnFailure(span, results);
|
|
}
|
|
}
|
|
```
|
|
|
|
Prioritize: errors → negative feedback → random sample.
|
|
|
|
## Feedback Loop
|
|
|
|
```
|
|
Production finds failure → Error analysis → Add to CI dataset → Prevents future regression
|
|
```
|