Harness Lab
R&D on self-improving AI build harnesses — a meta-system that analyzes historical runs, proposes prompt optimizations, tests them, and compares results automatically.
- Stage
- R&D
- Runtime
- Tauri + CLI + tRPC
- Test coverage
- 48 passing
- Intended consumer
- My own pipelines
Why We Built It
I run several AI-assisted pipelines across the portfolio — content production, site building, research synthesis. Each pipeline is a harness: a prompt structure, a model routing policy, an evaluator, and a feedback loop. Over time, every harness drifts. What worked at the start of a quarter quietly degrades as the input distribution changes, or as newer model versions make old prompt patterns suboptimal.
The manual version of fixing this is re-tuning prompts by hand, looking at recent runs, guessing what's off, editing, comparing. It works but it doesn't scale across multiple pipelines, and it is the exact kind of task that AI itself can help with — if the surrounding infrastructure lets it do the work rigorously.
Harness Lab was built to test that thesis. Can a meta-harness — an AI system that analyzes AI pipeline runs and proposes improvements — maintain or improve my production harnesses with less of my time, and with honest before/after comparisons rather than vibes?
How It's Built
The system is a three-surface application: a Tauri desktop app for the primary operator experience, a CLI for scripted use, and a tRPC server that both surfaces consume. All three share the same underlying engine — the routing, state management, and optimization logic — which is where the research happens.
The core loop is straightforward on paper. Ingest runs from a target pipeline. Score them against the pipeline's own evaluator. Surface clusters of failure modes. Propose prompt-level edits that address those failure clusters. Run each proposal against the same evaluator on a held-out sample. Compare, rank, present.
The hard part is making each step honest. Failure-cluster detection needs to avoid just grouping surface-level token patterns; scoring needs to be stable enough across runs that a 3% improvement is real and not evaluator noise; proposals need to be tested against the same distribution as the baseline, not a cherry-picked one. Most of the research work has been on the instrumentation and evaluator rigor, not the prompt-writing itself.
What It Is
A local-first R&D tool. The Tauri app surfaces runs, evaluator outputs, and proposed optimizations in one place. The CLI does the same operations headlessly for scripted improvement runs. 48 tests cover the evaluator logic, the optimizer's proposal generation, and the comparison harness — these are the load-bearing pieces, so they're the ones under test.
Prompt optimization is proposed, not applied. The operator reviews each proposed change alongside its before/after comparison and decides whether to adopt. The point is to compress the loop between "something in my harness has drifted" and "here is a specific, tested improvement I can choose to ship," not to let a meta-AI silently mutate production prompts.
The system is deliberately scoped to internal use for now. It is R&D feeding my own production pipelines. Productization is a live possibility — the pattern generalizes — but the research questions come first: is the evaluator rigor good enough, are the proposals useful on real pipelines, does the loop actually save time net of the instrumentation cost. Those answers shape whether this is a product or stays a lab.
Where It Is
The tool runs against my own pipelines today. The short-term question it is answering is whether a meta-harness is actually a net time save — whether the improvement it proposes are worth the time spent running it — and the early signal is that for high-volume pipelines the math works, and for low-volume ones the manual approach is still faster.
The longer-term question is architectural. The assumption that evaluators can be stable enough to drive optimization decisions is load-bearing; so far it holds for pipelines with well-defined quality criteria (is this article factually accurate, does this code pass the typecheck) and breaks for fuzzier ones (is this brand voice consistent). That is a real finding, and it shapes where the harness pattern can actually apply in production.
The piece I care most about is the discipline. Too many AI tooling projects ship improvements based on vibes. Harness Lab's whole reason to exist is that a meta-optimizer only earns trust if its proposals are tested under the same evaluator that makes production decisions. That is the research contribution, whether or not it ever becomes a product.
Tech Stack
More work like this
The portfolio has more shipped products. About me covers the background and philosophy that connects them.