LLMs are “Bayesian in Expectation, not in Realisation” (The build up to hallucination detection)
Our paper’s story end‑to‑end: the ICL paradox it tackles, why standard transformers aren’t Bayesian in realisation, what mathematics backs it and what it means for practice.
0. Background
Transformers achieve incredible performance on a wide variety of tasks, from translating English to French and solving math problems. But unlike most machine learning models that were trained to do these specific tasks, transformers were only ever trained to predict the next token in text. Other autoregressive models like RNNs, LSTMs also minimised next-token prediction loss but never exhibited in-context learning like transformers do, so what explains this behaviour?
When you train a model on "The cat sat on the ___", it learns that "mat" or "floor" are likely continuations, not because of any fancy on-the-fly learning, but because it has seen similar patterns thousands of times in training data. The model minimises log-loss by becoming exceptionally good at pattern matching and sequence continuation.
Here's where things get interesting: when we give these same models a few examples of a new task in their prompt, say, translating english-french pairs or worked math problem examples, they somehow adapt and perform the task without any retraining.
This is in-context learning, and it seems almost magical. The model appears to be "learning" from the prompt examples. Why do transformers exhibit this whereas LSTMs and RNNs struggle?
1. Inductive Biases
It’s connected to something called inductive biases, essentially a model’s built-in preference or assumption that guides how a model learns and generalises to unseen tasks.
1.1 Inductive Biases in RNNs and LSTMs
Unlike transformers, RNNs and LSTMs had a recurrent structure to processing tokens. This created a bias toward solving problems using local dependencies and sequential processing. They assumed that information flows forward through time in a fixed manner, with each state depending primarily on the immediately previous state.
This is a reasonable bias for many sequential tasks, but it turns out to be limiting. When you showed an LSTM some examples of a new task, it couldn't dynamically reorganise its computation to recognise and execute that task, its inductive biases were too rigid and too hardwired into the recurrent structure.
1.2 Inductive Biases in Transformers
Transformers on the other hand introduced radically different inductive biases through their architecture.
Flexible, adaptive computation: The attention mechanism creates a bias toward content-based reasoning, the model can dynamically decide what information to use based on similarity and relevance rather than just sequential proximity.
Position-specific patterns: Transformers make use of positional encodings making the model aware of absolute and relative positions, creating a bias towards position-specific patterns.
These two biases work together to create something unexpected.
The attention mechanism allows the model to look at examples in the prompt and say "these examples are demonstrating a pattern so lets route information in a way that continues this pattern."
The positional encodings let it track where it is in this pattern.
Together, they create an inductive bias toward task recognition and execution from examples, which is exactly what in-context learning is.
1.3 MDL and Bayesian Inference
This architectural setup creates a fascinating connection to Bayesian inference. The way transformers blend their trained knowledge with prompt examples resembles a Bayesian process: the pre-trained parameters encode a prior over possible tasks, the prompt provides evidence, and the attention mechanism performs something like posterior updating.
But beyond the conceptual connection, theres a deeper link that emerges from a branch of mathematics called information theory.
During training, these models minimise the regularised log-loss over billions of examples, which is mathematically equivalent to minimising the Minimum Description Length (MDL) of the data. Mathematically:
Where:
L(θ) is the length to describe the model (parameter regularization)
L(X|θ) is the length to describe data given the model
By Shannon's source coding theorem, the optimal code length for data is:
This transforms MDL minimisation into:
which is exactly regularised negative log-likelihood; what neural networks minimise during training! The L(θ) term becomes weight decay or other regularisation penalties.
Intuition with Occam’s Razor
Minimising the MDL embodies Occam's Razor in information-theoretic terms, the take-home message being:
The best model is the one that provides the shortest description of the data, balancing model complexity against data compression.
According to foundational work by Grünwald, MDL optimisation is formally equivalent to Bayesian inference with specific priors. When a model minimises description length, it's implicitly performing Bayesian model selection!
But here's where the tension emerges.
2. The Bayesian Paradox
2.1 What is the martingale property?
Classical Bayesian inference has its own inductive bias: for exchangeable data (where order doesn't matter, like coin flips), it assumes that all relevant information is contained in the sufficient statistics. A Bayesian reasoner would produce identical predictions for [1,0,1,0] and [0,1,0,1] because they have the same counts. This is the martingale property, predictions shouldn't drift based on arbitrary reorderings:
The expected value of future predictions, given what you've observed so far, should remain constant as you observe more data from the same exchangeable sequence.
2.2 Why do transformers violate it?
Positional encodings are necessary but they break exchangeability
When computing the log-loss for predicting the next token, the model's probability depends on both the tokens it has seen AND their positions. Without positional encoding, the attention sees [1, 0, 1, 0] and [0, 1, 0, 1] as equivalent input prompts because attention is permutation-invariant.
But with positional encodings, the model's computation becomes the embedding of each token PLUS its positional encoding. So the model literally sees "[1 at position 0], [0 at position 1], [1 at position 2], [0 at position 3]." Now the model can and does learn different patterns for different positions.
Our paper shows that for exchangeable data, permutation invariance of the predictor is equivalent to the martingale property (Lemma 3.3). Therefore, any order‑dependent predictor that breaks exchangeability must violate the martingale and thus doesn’t meet the requirements to be Bayesian.
But the regularised log-loss we minimised training the model also minimises the MDL, so it must be Bayesian. We now have a contradiction.
3. Resolving the paradox
3.1 Kolmogorov Complexity and MDL
3.1.1 What is Kolmogorov Complexity?
Kolmogorov complexity K(X) represents the length of the shortest computer program that can produce string X. Think of it as the "incompressible information content" of X. For instance, the string "AAAA...A" (1000 A's) has low Kolmogorov complexity because you can describe it with a short program: "print 'A' 1000 times". But a truly random string of 1000 characters has high complexity - the shortest program is essentially "print '[the entire string]'".
The conditional Kolmogorov complexity K(X|π) means "the shortest program to produce some value X given that you already know the permutation π". This is central to understanding what transformers actually do.
3.1.2 The Computability Problem and MDL Solution
There's a fundamental issue we need to address: Kolmogorov complexity K(X) is theoretically elegant but uncomputable. No algorithm can compute the shortest program for arbitrary strings; this is a consequence of the halting problem. We can't actually find K(X) for real data.
This is where Minimum Description Length (MDL) becomes essential. MDL provides a computable approximation to Kolmogorov complexity by using a fixed, practical encoding scheme. Instead of searching for the shortest possible program (impossible), MDL uses a specific model class and measures description length within that class.
The relationship between them is:
Where θ* represents the optimal model parameters. The MDL is bounded above by Kolmogorov complexity plus the complexity of describing the model itself.
3.2 Connecting to the Position-Aware Framework
When you train a transformer with positional encodings, it's not minimising K(X) directly (which would be Bayesian-optimal for exchangeable data). Instead, it minimises the expected conditional complexity:
The equation shows that these two objectives differ by exactly I(X;π) - the information leaked into positional dependence.
Moving to the computable domain, the expected conditional MDL differs from the permutation-invariant version by exactly the mutual information term as well:
When transformers with positional encodings are trained, they minimise the expected regularised negative log-likelihood over the training distribution. Since training data comes with natural orderings, the optimisation objective becomes:
3.3 Why This Creates the Position Dependence
Let’s break down why this formulation is so important.
During training, the transformer sees sequences in particular orders, but realistically never sees all n! permutations of the same content.
If the training data consisted of all permutations of each sequence equally weighted, the model would learn to be permutation-invariant, it would converge to minimising:
where P_θ(X) depends only on the content, not the ordering.
But that's not what happens in practice. This means the model finds an optimal compression in expectation over the orderings by exploiting positional regularities in the training data. Perhaps important information tends to appear at the beginning of documents, or certain patterns occur at specific relative positions. The model bakes these positional biases into its parameters θ* during training.
3.4 Optimal compression and sufficient statistics
3.4.1 The Fundamental Principle
To understand why optimal compression requires sufficient statistics, consider what it means to compress data efficiently. The theoretical minimum description length for data X given some side information Y is the conditional entropy H(X|Y). Now, for exchangeable data where we're predicting the next token, the key insight is that the sufficient statistic T(X) contains all the information needed for optimal prediction.
For exchangeable Bernoulli sequences, where the data consists of 0s and 1s that can appear in any order, the sufficient statistic is simply the count Sn - how many 1s we've seen so far. The fundamental theorem tells us that:
This equality means that knowing the full history X_{1:n} provides no more information for predicting X_{n+1} than just knowing the sufficient statistic Sn. More generally, describe T(X) as the sufficient statistic of any data X.
If the model were using anything other than the sufficient statistics, it would either be:
Using less information than T(X): This would lead to suboptimal predictions and higher description length than nH(p)
Using more information than T(X): This includes irrelevant details that don't help prediction but add noise
We showed that transformers achieve near-optimal compression in expectation:
The fact that they get within O(log n) of the entropy bound nH(p) proves they must be learning and using (approximate) sufficient statistics internally.
3.4.2 The Deep Implication
This distinction between training and analysis reveals something profound. The model has learned representations that are rich enough to extract sufficient statistics from any permutation, even though it was only trained on naturally-ordered sequences. The positional encodings force it to learn position-dependent extraction mechanisms, but these mechanisms are collectively sufficient to recover the true statistics when averaged.
Think of it like this: the model learns multiple "pathways" to extract information, each pathway being strongest for certain positional patterns. When you average over all permutations, you're essentially using all these pathways simultaneously, recovering the full power of what the model knows.
This is why your transformers are so powerful; Despite the architectural constraints and training biases, they learn something fundamental about the data that transcends the specific orderings they were trained on. The sufficient statistics are there, embedded in the parameters θ*, waiting to be revealed through permutation averaging.