Why the infrastructure running a large language model can change its behavior
An analysis of inference-time variability in modern LLMs
The hidden variable in every LLM conversation
For most users, large language models are treated as singular entities. People talk about them the way they would talk about a single piece of software:
- “Use GPT-4.”
- “Claude Opus is better.”
- “This model got worse.”
- “That prompt works better on Model X.”
But there is an under-discussed reality in modern LLM systems: the hardware and inference environment running the model can materially affect the output. Not just latency. Not just throughput. The actual behavior of the model.
A frontier model running on a rack of NVIDIA Blackwell GPUs may not behave identically to the same model deployed on older Hopper or Ampere hardware, even when the weights themselves are unchanged. This matters because users increasingly try to correlate prompt quality, model capability, token consumption, reasoning consistency, and output reliability. If the inference environment itself introduces variability, those correlations become statistically noisy.
The myth of perfect determinism
Most people intuitively assume LLMs behave like traditional software: same input, same output. But modern transformer inference systems are probabilistic numerical systems operating at massive scale. At their core, LLMs repeatedly compute a probability distribution over possible next tokens:
P( tₙ₊₁ | t₁, t₂, …, tₙ )
Tiny numerical differences during inference can alter which token is selected. Once generation diverges by even one token, the entire downstream reasoning path may change. This is not hypothetical. The official PyTorch reproducibility documentation states it plainly:
“Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms. … Results may not be reproducible between CPU and GPU executions, even when using identical seeds.”
Source: PyTorch documentation — Reproducibility notes
Floating-point arithmetic is not perfectly stable
Modern LLM inference relies on massive chains of floating-point operations. Floating-point arithmetic is not associative:
(a + b) + c ≠ a + (b + c)
Execution order matters. Parallel GPU systems reorder operations constantly for performance, and different GPU architectures, kernels, or scheduling paths may accumulate values differently, producing slightly different numerical results.
Normally these differences are microscopic. But LLMs are highly sensitive dynamical systems — a tiny change early in generation can alter later token probabilities enough to flip a greedy-decoded argmax and meaningfully change the output.
A refinement: it’s not only the floating point
Recent research from Thinking Machines Lab (“Defeating Nondeterminism in LLM Inference,” Sept 2025) argues that the conventional “concurrency plus floating point” story is incomplete. They show that the dominant driver of nondeterminism in production LLM endpoints is batch invariance failure: server-side dynamic batching means a single request is co-batched with whichever other requests happen to be active at that moment, and standard kernels for normalization, matmul, and attention are subtly batch-sensitive. The fix exists — batch-invariant kernels — but it carries a meaningful throughput cost, which is why most production systems do not use it.
This refinement doesn’t weaken the broader point of this article — it strengthens it. The thing producing the output is not just the model; it is the model plus the kernels, plus the batching policy, plus the hardware, plus everything else in the serving stack.
GPU determinism has a cost
NVIDIA has published technical material discussing explicit determinism controls in CUDA reduction operations. Strict GPU-to-GPU determinism requires specialized accumulation strategies that are slower than the standard high-performance execution paths. The tradeoff is straightforward:
| Optimization Goal | Resulting Behavior |
|---|---|
| Maximum throughput | Higher numerical variability |
| Strict reproducibility | Lower performance & higher cost |
Modern inference providers generally optimize for latency, utilization, throughput, and serving cost — not perfect determinism. As a result, public LLM deployments routinely operate in environments where small numerical divergences are expected.
Hardware changes can change model behavior
Different GPU generations support different tensor precisions, fused kernels, memory hierarchies, compiler optimizations, and scheduling behaviors. The precision formats themselves illustrate this:
- FP16 — 16-bit, classic mixed-precision training and inference
- BF16 — 16-bit with FP32-range exponent, more stable in training
- TF32 — NVIDIA-specific 19-bit format for Ampere tensor cores
- FP8 — 8-bit, introduced on Hopper, widely used on Blackwell
- FP4 / NVFP4 — 4-bit, native on Blackwell tensor cores
Each introduces different precision and accumulation characteristics. Newer architectures such as Blackwell are designed to aggressively accelerate low-precision workloads — NVFP4 delivers a ~1.8x memory reduction versus FP8 and Blackwell’s fifth-generation tensor cores natively run FP4, FP6, and FP8 paths. That improves scale economics dramatically, but it can also increase approximation effects relative to older hardware.
This does not mean the model becomes a different intelligence. It can mean different verbosity, altered reasoning depth, different failure modes, inconsistent chain-of-thought behavior, or changing token consumption.
Mixture-of-Experts models amplify the problem
Many frontier models now use Mixture-of-Experts (MoE) architectures, in which tokens are dynamically routed between specialized subnetworks called “experts.” Routing decisions are made via a softmax over the gating logits and are sensitive to tiny logit differences — sometimes a fraction of the last representable bit is enough to swap which expert is selected.
Small numerical drift can therefore:
- activate different experts on the same token,
- alter the reasoning trajectory downstream,
- or produce stylistic divergence between otherwise-identical runs.
As models become more sparse, distributed, and dynamic, inference reproducibility becomes harder, not easier.
Prompt engineering becomes statistical, not deterministic
This has major implications for how people think about prompts. The intuitive assumption is: “a better prompt produces a better answer.” But if inference itself is probabilistic and infrastructure-sensitive, prompt optimization is a statistical exercise, not a deterministic one. The observed output is closer to a function of many variables:
Output = f(Prompt, Model, Sampling, Hardware, Runtime, Quantization, Routing, Batch)
The user controls one variable directly: the prompt. Everything else may shift underneath the surface. This explains a long list of common user complaints:
- “The same prompt worked yesterday.”
- “The API behaves differently than the web UI.”
- “The model seems dumber now.”
- “Token usage suddenly increased.”
Sometimes the weights changed. Sometimes the infrastructure did. Often, the user cannot tell which.
| Variable | Controlled By |
|---|---|
| Prompt | The user — the only directly controlled input |
| Model weights | The provider; may change silently with updates |
| Sampling parameters | User or provider defaults (temperature, top-p, seed) |
| Hardware | Provider — GPU generation, interconnects, memory |
| Runtime / kernels | Provider — CUDA version, kernel selection, scheduling |
| Quantization & precision | Provider — FP8, BF16, FP16, TF32, FP4 |
| Batch composition | Provider — varies with concurrent traffic |
| Expert routing (MoE) | Provider — sensitive to logit drift |
What actually determines the output you see
Token consumption is affected too
Small inference divergences compound over long generations. One execution path may answer concisely, terminate early, and consume 400 tokens. Another — same prompt, same nominal model — may self-correct repeatedly, explore alternate reasoning paths, and consume 2,000+ tokens. For API users and enterprise deployments this is a real operational concern, because token consumption directly impacts cost, latency, throughput, and agent reliability.
Researchers already acknowledge this problem
The machine-learning ecosystem has been discussing reproducibility challenges for years. PyTorch’s deterministic-algorithm documentation explains that deterministic execution often requires explicitly disabling nondeterministic optimizations. Researchers routinely encounter reproducibility drift across:
- GPU architectures and CUDA versions,
- cuDNN implementations,
- threading models and atomic operation ordering,
- distributed execution and collective communication, and
- server-side batching policies.
Recent academic and industry work — including the Thinking Machines analysis cited above — has continued to expose how low-level GPU and serving behavior contribute to reproducibility challenges across architectures and precision modes.
The emerging reality
As LLM systems scale, the distinction between “the model” and “the infrastructure running the model” is becoming increasingly blurry. The public often treats models like static software artifacts. In reality, modern LLMs are distributed probabilistic systems whose behavior emerges from an interaction between weights, hardware, runtime optimizations, batching, routing, precision strategies, and serving infrastructure.
The consequence is subtle but important: it’s not just what model you are using. The where matters too.
Sources referenced: PyTorch Reproducibility documentation; NVIDIA Developer Blog (Blackwell Ultra, NVFP4); Thinking Machines Lab, “Defeating Nondeterminism in LLM Inference” (Sept 2025).





































