FLOPS get thrown around in every AI conversation. The number sounds authoritative, yet it often misleads. Teams plan budgets off it, then ship slower systems than expected.
The gap is not magic; it is memory, batching, and measurement. This post turns FLOPS from trivia into a tool.
FLOPS is simply how many floating-point operations a system completes per second. It captures raw math rate, not quality or outcomes. If that sounds too basic, a quick refresher on the idea is worth it through this plain explainer on Reddit’s ELI5 thread ELI5: what is FLOPS.
Why not just count instructions per second? In modern AI and scientific workloads, floating-point math dominates. That is why the industry gravitated to FLOPS in the first place, as folks in the compsci community have discussed for years Why FLOPS?.
Here is the catch: FLOPS outline cost, while seconds tell the truth. Some FLOPS are easy for your hardware, others are expensive. Model comparisons often go sideways when teams optimize for one and ignore the other. The Machine Learning subreddit has debated this point repeatedly, which is healthy and correct FLOPS vs seconds.
Use FLOPS as a planning metric, not a vanity number. A few practical patterns help:
Set compute budgets in FLOPS, then validate with latency. Treat both as required fields FLOPS vs seconds.
Balance parameters and FLOPS for Mixture-of-Experts models. Sparsity can lift capacity without equal compute, as shown in recent MoE work on IsoFLOP surfaces arXiv MoE study.
Compare ops, params, and inference time together. A single metric hides too much. This ResearchGate summary makes the tradeoffs concrete ResearchGate table.
Plan growth in compute multiples. If the target is 10x training compute, make the business case equally explicit. The GPT‑4 discussion is a good reality check on what “10x” means in practice 10x compute of GPT-4.
Avoid metric theater. Track a small set of trustworthy counters and outcomes. Martin Fowler’s guidance on metric pitfalls is still gold metrics.
Raw math units are not the bottleneck most days. Memory bandwidth and interconnects gate real throughput, especially at batch sizes that matter. This shows up instantly with attention-heavy models where the KV cache and batching either make the hardware fly or stall. OpenAI’s scaling notes capture that playbook well: hit the cache, tune batch, respect bytes per op Scaling ChatGPT.
Parallel units are only useful if the data arrives on time. Data locality and a sane cache strategy keep the cores busy; poor locality turns TFLOPS into idle cycles. Coordination matters too. Fine-grained sync across cores often adds tail latency without improving throughput. The LMAX team famously avoided that trap by favoring a simple, single-threaded design where it won on real workloads The LMAX architecture.
So what does “fair” benchmarking look like? Report FLOPS and wall-clock time side by side. Some FLOPS complete faster on a given GPU; others get throttled by memory or links. Sparse MoE routing changes the mix again, shifting the balance between parameters touched and FLOPS executed FLOPS vs seconds, Parameters vs FLOPs.
When tuning, shine a light on the real limits:
Increase batch size until memory stalls or latency SLOs degrade, then back off Scaling ChatGPT.
Profile cache hit rate and your ops to bytes ratio.
Record both FLOPS achieved and time per request for each change FLOPS vs seconds.
More parameters raise compute and memory pressure. That sounds obvious, but the slope matters. At some point, extra width or depth adds cost faster than it adds accuracy. FLOPS targets go up; heat, spend, and runtime follow. The MoE literature shows a different path: more capacity per token through sparsity, with inference cost tied to the fraction of experts used sparsity and IsoFLOP surfaces.
Hardware constraints force tradeoffs. KV cache size caps batch. Interconnect bandwidth limits multi-GPU scaling. When FLOPS hide those stalls, trust runtime metrics and optimize for the actual SLOs you carry FLOPS vs seconds.
Here are simple levers that pay off:
Right-size hidden dimensions to the throughput “knee”. Going wider past that point burns FLOPS with little gain.
Use sparsity when routing is stable enough to stay cache-friendly.
Track inference time per request, not just throughput. The ResearchGate comparisons reinforce why both belong in your dashboard time vs FLOPs.
Tie parameter growth to SLOs and budgets. Statsig’s write-up on scalable experimentation platforms shows how to wire guardrails so surprises surface early scalable patterns.
Start with the workloads that matter. Not synthetic loops that look pretty in a slide. Use tight benchmarks that mirror production traffic, then keep them in CI so regressions get caught fast. The ML community’s FLOPS-versus-seconds debates offer a useful checklist for what to log and share runtime in seconds, inference time.
A straightforward playbook works well:
Establish a clean baseline: FLOPS achieved, p50 and p99 latency, ops to bytes ratio.
Raise batch size while watching KV cache hit rate and memory bandwidth. Stop when stalls rise or SLOs get shaky Scaling ChatGPT.
Trim synchronization and critical sections. Keep hot paths short. The LMAX story is a reminder that less coordination often means more throughput LMAX.
Re-profile. Fix the highest stall reason first, then re-run the same scripts.
Adjust the model knob that matches the bottleneck: parameters if accuracy is flat, sparsity if FLOPS are tight, quantization if memory is the wall.
Anchor every change in clear metrics and outcomes. Martin Fowler’s notes on honest metrics help teams avoid chasing vanity numbers metrics. Statsig customers often connect these metrics to experiments so compute decisions tie back to user impact and cost.
Two closing tips worth bolding:
Always publish FLOPS and wall-clock results together.
Describe the hardware and its memory setup. Readers care about whether the ops finish quickly, not just how many you scheduled.
FLOPS is a useful compass, not the destination. Treat it as a planning tool, then validate with latency and throughput on the hardware that will run the job. Memory bandwidth, cache behavior, and synchronization decide how much of those FLOPS you actually get. Keep parameters in check, use sparsity when it helps, and route every change through honest metrics. Tools like Statsig make it easier to connect compute choices to business outcomes so optimizations stick.
Want to go deeper?
Quick refresh on FLOPS: Reddit’s ELI5 explainer ELI5: what is FLOPS
Why the industry standardized on FLOPS: r/compsci thread Why FLOPS?
FLOPS vs seconds debates and benchmarks: r/MachineLearning FLOPS vs seconds
Practical knobs for KV cache and batch size: Pragmatic Engineer’s breakdown Scaling ChatGPT
Metrics discipline: Martin Fowler’s notes metrics
MoE capacity and IsoFLOP surfaces: recent arXiv study arXiv MoE study
Experimentation at scale: Statsig’s patterns guide scalable patterns
Hope you find this useful!