Synthetic judges: Training custom evaluation models

Fri Oct 31 2025

Shipping LLM features is easy; trusting them is the hard part. Manual labels slow everything down and still miss the weird cases that blow up in production. Synthetic data and automated model grading flip that script by giving fast, cheap, repeatable signal.

This guide shows how to use synthetic users and evaluators to move quickly without guesswork. If you want the how-to, jump to the pipeline: setting up a training pipeline.

Why synthetic data matters

Synthetic data is a force multiplier, not a replacement for humans. It gives breadth, speed, and consistency, especially when paired with automated model grading for tight iteration loops. Lenny Rachitsky’s write-up on product evals nails why this matters in practice, beyond vibe checks and hero demos link. Gergely Orosz also calls it out as part of the modern AI engineering stack link.

Where synthetic data shines:

  • Coverage for rare and risky scenarios: refund edge cases, jailbreak attempts, adversarial phrasing.

  • Cheap scale to test dozens of prompts, tools, or grounding strategies in hours, not weeks.

  • Consistent labels that reduce the noise you get from hurried human raters.

  • Broader context via synthetic users, as shown in Statsig’s perspective on AI testers link.

Wire those sets into offline evals to move fast, then confirm live with online evals so reality checks your metrics. Statsig’s docs walk through both flows offline overview and online evals.

There is a catch. Bias in generation and judging can sneak in quietly. Use multiple judges, require short rationales, and make decisions explainable. Martin Fowler’s note on machine justification is a good north star for this discipline link. For domain-aware rubrics, borrow from G-Eval style prompts and custom metric patterns shared by the LLM community custom metric guide. To tighten judge accuracy, lean on few-shot and chain-of-thought techniques 5 techniques.

Setting up a training pipeline

A durable pipeline should be fast, auditable, and bias-aware. The goal is trustworthy automated model grading that keeps the team unblocked.

  1. Start with strong seeds

Use a small set of high-quality, human-checked examples. Mix in top model outputs to anchor expectations, as suggested in the Statsig AI Evals overview link. Treat this as the canon your rubric learns from.

  1. Map the space with embeddings

Create vector representations and cluster by intent, domain, and difficulty. Sample evenly across clusters to avoid collapse. Example: customer support bots often split into refunds, cancellations, account issues, and technical troubleshooting.

  1. Track drift by cluster

Evaluate by slice so regressions don’t hide in averages. Run offline first, then confirm with online evals. Keep a simple dashboard that shows movement per cluster over time.

  1. Use multiple evaluator setups

Each judge catches different flaws. Pair a lightweight open-source judge with a general model; there are credible open judges discussed in the community open-source LM judge. Layer in rubric prompts, few-shot examples, and bias controls from these playbooks custom LLM metrics and five techniques.

  1. Add guardrails and traceability

Tie the rubric to product goals so scores mean something. Require short, concrete rationales for every judgment to satisfy machine justification expectations link. Refresh examples with synthetic users to keep coverage honest link.

Statsig customers often plug this pipeline into AI Evals to score offline with confidence, then roll changes out safely with online gating.

Navigating complexities in scoring

Scoring is where trust is made or lost. Pairwise comparisons usually beat absolute scoring because judges are better at picking A vs B than assigning a perfect 1 to 5. Lenny’s guide shows practical setups for pairwise judges you can replicate quickly link.

Clear scales matter. The rationale should match the score, or the metric will rot. Fowler’s take on explanations applies directly here link. Start small with a labeled calibration set, lock the rubric, then monitor drift with offline and online evals.

Field rules that hold up:

  • Define a canonical rubric with crisp descriptors for each numeric point.

  • Require short rationales; reject empty or rambling explanations.

  • Track judge to human correlation and tune prompts or scales when gaps appear.

For harder tasks, constrain judges with targeted prompts. Techniques like few-shot rubrics, content filters, and reasoned comparison reduce bias and verbosity five techniques and help you build robust custom metrics custom guide.

Merging synthetic judges with real feedback

Use synthetic judges for scale and exploration; use people to confirm risk and nuance. Start on a held-out synthetic set, then light up shadow online evals to see real traffic patterns without blasting users link. Pull a small user panel when the data shows surprise behavior.

Keep pipelines separate to avoid cross-contamination. Synthetic users are great for quick hypothesis checks and guardrail testing link. Humans close the loop when judgment calls affect trust, safety, or revenue.

When scores disagree, favor evidence and clarity:

  • Ask judges for explanations to expose missing criteria machine justification.

  • Run a pairwise playoff on the disputed examples to get a cleaner read LLM-based evals.

  • Tighten the rubric and re-calibrate on a small human set before rolling forward.

Statsig’s online evals make this blend practical by scoring live traffic while keeping a paper trail for every decision.

Closing thoughts

Synthetic data and automated model grading let teams move fast without guessing. Build a small but strong seed set, map your data with embeddings, use multiple judges, and keep everything explainable. Most of all, let synthetic scale the work while humans handle the judgment calls that matter.

More to dig into:

  • Product-centric evals from Lenny Rachitsky link

  • The AI engineering stack by Gergely Orosz link

  • Statsig AI Evals overview and online evals overview, online

  • Machine justification by Martin Fowler link

  • Synthetic users in practice at Statsig link

  • Community playbooks for custom judges and bias control guide, techniques

Hope you find this useful!



Please select at least one blog to continue.

Recent Posts

We use cookies to ensure you get the best experience on our website.
Privacy Policy