Products

Solutions

Resources

Docs Pricing

Products

Solutions

Resources

Products

Solutions

Resources

Pricing

Anna Yoon

Software Engineer, Statsig

EXPERIMENTATION

Beyond prompts: A data-driven approach to LLM optimization

Tue Mar 04 2025

Large Language Models (LLMs) are powerful but challenging to optimize in real-world settings.

This article presents a systematic framework for online A/B testing of LLM applications, focusing on prompt engineering, model selection, and temperature tuning. By isolating and experimenting with specific factors, teams can generate tangible improvements in performance, user engagement, and cost-efficiency.

Real-world case studies demonstrate the value of iterative experimentation with closed feedback loops for training data—and machine learning engineers and data scientists can use these strategies to deliver better AI-driven products at scale.

🤖 Related: AI prompt experiments on Statsig

High-level overview

As organizations increasingly deploy large language models (LLMs) in production, optimizing them at scale has become a critical challenge. Traditional offline benchmarks, while useful for initial evaluation, often fail to capture real-world user context and subjective experience.

This gap between controlled testing and actual user interactions can lead to misleading assumptions about a model’s effectiveness. To bridge this divide, online A/B testing has emerged as a key approach for understanding an LLM’s true impact.

One major factor in LLM optimization is prompt engineering. The way a prompt is structured or phrased can significantly influence an LLM’s responses. Even small tweaks in wording can produce dramatically different outcomes, making systematic experimentation essential.

Model selection is another crucial consideration. Organizations often weigh performance, cost, and latency when choosing between different models. Larger models like GPT-4 may offer improved accuracy, but they also come with higher costs and slower response times compared to smaller models like GPT-3.5 or open-source alternatives. Finding the right balance requires testing how these trade-offs affect actual user experience.

Fine-tuning generation parameters, such as temperature and top-p, further refines model behavior. These settings determine the balance between creativity and consistency in responses.

While higher creativity can enhance engagement in some contexts, it may also introduce unpredictability or user dissatisfaction. Running A/B tests helps clarify when and where different parameter adjustments improve performance.

By conducting systematic online experiments, organizations can:

Improve performance: Validate hypothesis-driven changes (like new prompts) directly with real users.
Reduce costs: Pinpoint “good enough” model variants or narrower context windows that maintain quality while lowering expenses.
Boost engagement: Track user behaviors such as conversation length, retention, or click-through rates to ensure AI experiences resonate with users.

Rather than relying on guesswork or isolated trial-and-error, a structured, data-driven approach enables organizations to build LLM solutions that are both effective and sustainable.

This holistic approach balances performance and cost, providing rigorous insight beyond guesswork or one-off trials. In short, data-driven experimentation moves LLMs from theoretical capabilities to user-aligned, sustainable solutions.

Deep dive with data and experiment structures

Mechanics of A/B testing for LLMs

An A/B test compares a control (current implementation) against a treatment (proposed change). Key elements include:

Randomized user allocation: Users are split—often 50/50—so each group experiences only one variant. Randomization ensures a fair comparison.
Single variable isolation: If testing a new prompt, keep the model and temperature consistent across both variants to attribute outcome differences to prompt changes.
Incremental rollouts: Start small (e.g., 5% of traffic) to mitigate risks, then ramp up if initial results look promising. Feature flags or modern experimentation platforms make this practical.

In LLM contexts, a “treatment” might be:

A different base model (e.g., GPT-4 vs. GPT-3.5).
A refined prompt with new instructions or examples.
Altered parameters (e.g., temperature, top-p).
Integration of a specialized retrieval or reward model.

Because LLM outputs can be stochastic, teams should ensure adequate sample sizes and carefully controlled conditions. This helps distinguish real performance improvements from random variance.

Key metrics to track

Unlike simple web experiments (where clicks or conversions might suffice), LLM-centered metrics can be more nuanced:

Latency and throughput
- Why it matters: Users tend to abandon overly slow services.
- What to measure: Time to first token, time to completion, error or timeout rates.
User engagement
- Examples: Conversation length in chatbots, average session duration, or repeated usage.
- Significance: Higher engagement typically indicates more valuable or enjoyable user experiences.
Model response quality
- Human ratings: Collect explicit feedback (ratings or “Was this helpful?” prompts).
- Behavioral proxies: Track how frequently users copy or edit the AI’s output, how often they request a “regenerate,” etc.
Cost and efficiency
- Examples: Tokens used per request, dollar spend per 1,000 requests, GPU usage.
- Trade-offs: Balance cost with performance. A high-performing model might be too expensive if gains are marginal.

By tracking these metrics in real time for both variants, organizations ensure both business value (e.g., cost-effectiveness) and user-centric (quality, engagement) considerations.

Oftentimes, teams overlay a primary metric (like user retention) with guardrail metrics (like latency) to avoid improvements in one area that harm another.

Experiment design and statistical rigor

Hypothesis and goal

Define a measurable hypothesis: e.g., “Adding an example to the prompt will increase correct answer rates by 5%.” This clarity guides your metrics and success criteria.

Sample size estimation

Use standard A/B testing power analysis to determine the user sessions required for statistical significance. This is crucial given the stochastic nature of LLM outputs.

Randomization and duration

Assign users consistently to control or treatment. Avoid switching them mid-experiment to reduce contamination. Run tests long enough to capture representative usage patterns.

Logging and data collection

Log user interactions, latency, and direct/indirect signals of quality. This data forms the backbone of the statistical comparison.

Statistical analysis

For continuous metrics (e.g., average rating): t-tests or non-parametric equivalents.
For categorical/binary outcomes (e.g., success/fail, correct/incorrect): chi-square or two-proportion z-tests. Use p-values and confidence intervals responsibly; avoid stopping early as soon as a difference appears.

Evaluating significance and trade-offs

Even if an improvement is statistically significant, consider whether it’s practically relevant. For instance, a 3% increase in user satisfaction may not justify a 50% spike in cost.

Rolling out or iterating

If the treatment is clearly better on primary and guardrail metrics, roll it out more broadly. If inconclusive, refine the approach or gather more data. Keep a detailed record of all tests and lessons learned.

Case studies and data insights

Case 1: Engagement optimization in chatbots

A team used an A/B test to evaluate whether a reward model tuned for user engagement could improve conversation depth. One group received responses filtered by the reward model; the other used a baseline chatbot.

The reward-model variant resulted in a 70% increase in average conversation length and a 30% boost in retention, showing how an online experiment validates theoretical gains (reward modeling) in real user settings.

Case 2: Generative email subject lines

Nextdoor tested AI-generated subject lines versus their existing rule-based approach.

The initial experiment showed little benefit, prompting them to refine their reward function based on user feedback.

In a follow-up test, the improved system delivered a +1% lift in click-through and a 0.4% rise in weekly active users—moderate but significant gains at scale which exemplifies how iterative experiments guide model enhancements.

Actionable insights for MLEs & data scientists

Adopt a hypothesis-driven mindset: Define a clear reason for each change, anchored in metrics you care about.
Leverage feature flags and gradual rollouts: Limit initial exposure of risky changes and expand only after positive preliminary data.
Segment your experiments: Different user cohorts may respond differently, so analyze metrics by region, language, or user type.
Close the feedback loop: Data from A/B testing (e.g., user preferences, ratings, or conversions) can feed into model fine-tuning.
Balance performance and cost: Always track computational usage and token costs. Consider “cost per engagement” or a cost ceiling to ensure financial viability.
Make A/B testing part of the deployment workflow: Each major LLM change or parameter tweak should be tested before broad release, ensuring continuous improvement.
Document and replicate: Maintain records of test setups, results, and conclusions to guide future decisions and avoid duplicating failed experiments.

By coupling these best practices with a rigorous experimentation platform, data science teams can systematically refine LLM performance in real-world use cases—beyond prompts and speculation, into verifiable, user-driven impact.

Request a demo

Statsig's experts are on standby to answer any questions about experimentation at your organization.

Grab a Demo

Products

Solutions

Resources

Products

Solutions

Resources

Docs

Pricing

Back to Blog home

Anna Yoon

Beyond prompts: A data-driven approach to LLM optimization

Large Language Models (LLMs) are powerful but challenging to optimize in real-world settings.

High-level overview

Deep dive with data and experiment structures

Mechanics of A/B testing for LLMs

Key metrics to track

Experiment design and statistical rigor

Hypothesis and goal

Sample size estimation

Randomization and duration

Logging and data collection

Statistical analysis

Evaluating significance and trade-offs

Rolling out or iterating

Case studies and data insights

Case 1: Engagement optimization in chatbots

Case 2: Generative email subject lines

Actionable insights for MLEs & data scientists

Request a demo

Related reading on the Statsig blog:

Recent Posts

Optimizing cloud compute costs with GKE and compute classes

Pablo Beltran

How Statsig lets you ship, measure, and optimize AI-generated code

Sid Kumar, Brock Lumbard

Your users are your best benchmark: a guide to testing and optimizing AI products

Skye Scofield

The more the merrier? The problem of multiple comparisons in A/B Testing

Allon Korem, Oryah Lancry-Dayan

Randomization: The ABC’s of A/B Testing

Allon Korem, Oryah Lancry-Dayan

Speeding up A/B tests with discipline

Yuzheng Sun, PhD