evalkit

github.com/akaieuan/eval-kit · @eval-kit/core on npm · Read the brief

Repo, scoped packages on npm, and the full project brief (philosophy, architecture, §13 guardrails).

Open source · pre-1.0 · v0.3.0

eval-kit

The scoring cockpit for research agents. Open source. Pre-1.0. v0.3.0 stable shipped 2026-04-23.

What this is

eval-kit measures one thing: whether an AI agent actually helps a human do real work, as judged by the human. Not whether the agent can solve a synthetic puzzle alone. Not whether an LLM-judge says it did well. Whether a person, scoring step-by-step, finds it useful.

Most eval frameworks (MMLU, SWE-bench, GAIA, AgentBench) measure autonomous task completion on synthetic prompts and let an LLM grade the output. eval-kit refuses both choices. The seed suite is ported from observed real workflows with real distractors: future-dated papers, unverifiable claims, jobs that don't exist yet. Scores come from a human reviewer using a 0-3 rubric across five dimensions per step. LLM pre-fill is allowed as a draft the human accepts or overrides; it can never be the default scorer. If LLM-as-judge becomes the default, the project loses its reason to exist.

The five dimensions: explainability, agency preservation, long-term capability, calibration, collaborative performance.

What I actually built

A monorepo with three published npm packages under the @eval-kit scope, with core (runtime, schema, scoring engine, agent adapters), ui (React primitives), and seed-suite (reference YAML tasks). All live on npm under the latest dist-tag.
A Next.js dashboard that composes those primitives into a reviewer cockpit: Inbox queue with prioritized triage, run review with keyboard-first scoring, diff view across runs, in-app docs.
A CLI with eight commands: run, review, diff, report, init, preflight, ci, export. Each follows the same commander pattern; new commands plug in cleanly.
A YAML-defined agent profile system so contributors describe an agent (model, system prompt, tools, max iterations) without writing TypeScript. Two seed profiles ship: claude-research-v1 and claude-coding-v1. Custom adapters work via an --adapter ./path.js escape hatch.
Tiered automation that respects the human-gate. Tier 1 is deterministic auto-scoring (tool-match check, distraction heuristic). Tier 2 is optional LLM pre-fill. Claude drafts scores, the human accepts or overrides, every draft flagged pre_filled: true. Tier 3 is active triage that surfaces low-confidence drafts and pre-fill/auto-score disagreements first.
CI integration (eval-kit ci) that gates merges on tier-1 regressions but never auto-fails on golden-truth scores. Those need human judgment.
Training-data export (eval-kit export) that emits SFT pairs or DPO preference pairs from scored runs. Pre-filled scores are excluded by default.
OSS hygiene the framework deserved: issue and PR templates, CODEOWNERS, SECURITY policy, branch protection requiring four CI matrix jobs to pass, four release milestones, fifteen labels, an RFC process, two release tags shipped, npm Trusted Publishing wired up.
A published roadmap and RFC for v0.4 (multi-reviewer + inter-rater agreement, standalone npx dashboard) and v0.5 (the human-gated agent-to-agent training flywheel, RFC 0001, accepted).

Why it's unusual

Three things put this outside the eval-framework norm.

It's a UI, not a leaderboard

The product is the scoring cockpit: the keyboard-first inbox where a human reviewer can move through fifty steps in an afternoon. Aggregate scores exist but aren't published; the project explicitly forbids benchmark marketing because the differentiator is qualitative collaborative performance, not a number.

It refuses LLM-as-judge as the default

Every other eval framework I looked at lets an LLM grade the output because human scoring is expensive. eval-kit treats that expense as the point. If the same family of model that produced the answer also grades it, the eval inherits the model's blind spots. The accepted RFC for v0.5's continuous-learning flywheel doubles down: AI agents can propose training updates, but a human must approve each proposal before it can feed an export. Auto-approval is named in the spec as a guardrail violation that should fork the project, not amend it.

It's pre-1.0 but ships like it isn't

v0.3.0 is published with provenance attestations, the release workflow uses pnpm pack + npm publish for OIDC trusted publishing, the main branch requires four green CI matrix jobs before any merge, and the CHANGELOG has honest notes about what worked and what fell back to a token. It looks like a 1.0 because the discipline is what makes a tool depend-on-able, not the version number.

Related perspective on the same measurement wall: HITL Kit.

How I describe the skill set

This project is the load-bearing example for how I work in TypeScript and AI infrastructure right now.

Schema-first design. Zod schemas in packages/core/src/schema.ts are the source of truth for every persisted shape. TS types are inferred via z.infer. parseX helpers are the only validation entry points. New shapes can't enter the system without going through the schema.
Workspace monorepo discipline. pnpm workspaces, ESM-only with explicit .js extensions, noUncheckedIndexedAccess on, tsup builds, vitest tests. Each package has its own package.json, tsconfig, and CI step.
Anthropic SDK at production-shape. Real tool-use loop with prompt caching on system + tool blocks, max-iteration guards, and a structured pre-fill helper that returns a typed StepScore draft for human review.
OSS-grade release engineering. GitHub Actions matrix CI on Ubuntu and macOS across Node 20 and 22. Trusted Publishing (OIDC) with provenance attestations; the failure mode where it didn't match per-package permissions got documented in the CHANGELOG, not hidden.
Design that pushes back. I keep a §13 "Philosophical guardrails" section in the brief that names the rules a feature request can't violate. When I caught myself drifting toward an LLM-judge auto-approval flow during v0.5 design, the rule said no. The project loses its reason to exist if I crossed it. I rejected my own suggestion.

The scoring cockpit is the product: a human gate on every label that matters, automation that never pretends to replace that gate, and release discipline that matches the seriousness of the claim.