Quality Harness

Overview

The Quality Harness is V5’s self-improvement loop. It automatically samples tool call traces, evaluates them with three independent LLM judges, and surfaces regressions before they affect production. Agents can also send feedback directly via submit_feedback.

How grading works

Every tool call that passes the sampling threshold is evaluated by a tri-judge panel:

Judge	Model	Provider
Alpha	`claude-haiku-4-5-20251001`	Anthropic (via AI Gateway)
Beta	`gpt-4o-mini`	OpenAI (via AI Gateway)
Gamma	`gemini-1.5-flash`	Google (via AI Gateway)

All three judges run in parallel and evaluate the same trace against the same rubric. Their verdicts are stored in the grader_verdicts D1 table. Disagreements surface router errors — if judges disagree on the domain category, the categorical router has a bug.

Verdict schema

Each judge emits a structured verdict via tool-use (never free text):

{
  "reasoning": "The tool correctly retrieved pipeline data and formatted it...",
  "category": "crm_read",
  "quality": 4,
  "issues": [],
  "confidence": 0.92
}

Quality scale

Score	Meaning
1	Unusable / harmful — wrong data, safety issue, or destructive action
2	Partially correct — user must manually repair the output
3	Correct — minor polish needed
4	Correct + clean — no issues

Domain categories

All 12 V5 domains that tool calls route through: crm_read · crm_write · ads_read · ads_mutate · analytics_report · content_generate · email_send · calendar_op · search_query · data_query · agent_orchestration · error_recovery

Issue taxonomy

A trace can have 0–9 issues flagged:

Issue	Description
`incomplete`	Partial/truncated result
`hallucination`	Model invented data not in upstream response
`tool_misuse`	Wrong tool or wrong arguments
`missed_context`	Tenant/account/scope dropped or wrong
`verbose`	Unnecessary preamble or commentary
`wrong_domain`	Category routed incorrectly (router error)
`unsafe_action`	Mutation when read intended, or destructive without confirmation
`format_violation`	Output didn’t match declared schema or contract
`regression`	Behavior worse than prior known-good for same input class

The `submit_feedback` tool

Any connected AI tool (Claude, Cursor, ChatGPT, Codex, Hermes) can call submit_feedback when it encounters a blocker. This is the direct line from the wild back to the V5 roadmap.

{
  "tool": "submit_feedback",
  "input": {
    "attempted_action": "Retrieve HubSpot contacts created in the last 30 days",
    "expected_outcome": "Array of contact objects with email, name, and company",
    "actual_outcome": "Empty array returned even though contacts exist in HubSpot",
    "tool_called": "hubspot_crm",
    "error_seen": "status: 200, body: {contacts: []}",
    "severity": "med"
  }
}

Feedback is written to the agent_feedback D1 table. Mishaal triages weekly via GET /admin/feedback. Patterns become product features.

Severity levels

Level	When to use
`low`	Minor friction — workaround exists
`med`	Blocked on this workflow, but other workflows fine
`high`	V5 is unusable for the calling agent’s task

Risk gate

Traces that score quality ≤ 2 or flag unsafe_action trigger an alert to SLACK_WEBHOOK_URL. Three consecutive quality ≤ 2 verdicts on the same tool path open an automatic incident in the error_ledger.

Sampling rate

Not every tool call is graded — the harness samples based on:

Tool category (higher sampling for mutations: crm_write, ads_mutate, email_send)
Tenant tier (higher sampling for production tenants)
Recent quality trend (higher sampling after quality degradation)

The sampling configuration is in src/grader/sampling.ts. Adjust via GRADER_SAMPLE_RATE KV key (0.0–1.0, default 0.1 = 10%).

Viewing grader output

# Recent verdicts
curl https://ascend-gateway-v5.ascendgtm.workers.dev/admin/grader/verdicts?limit=20 \
  -H "Authorization: Bearer <admin-key>"

# Agent feedback inbox
curl https://ascend-gateway-v5.ascendgtm.workers.dev/admin/feedback \
  -H "Authorization: Bearer <admin-key>"

MCP Tools

REST API

Providers

Quality & Observability

Overview

How grading works

Verdict schema

Quality scale

Domain categories

Issue taxonomy

The `submit_feedback` tool

Severity levels

Risk gate

Sampling rate

Viewing grader output

​Overview

​How grading works

​Verdict schema

​Quality scale

​Domain categories

​Issue taxonomy

​The submit_feedback tool

​Severity levels

​Risk gate

​Sampling rate

​Viewing grader output

Overview

How grading works

Verdict schema

Quality scale

Domain categories

Issue taxonomy

The `submit_feedback` tool

Severity levels

Risk gate

Sampling rate

Viewing grader output