Commutativity: Why Transformers Need Positional Encodings
And other consequences of order not mattering
đ For the best reading experience with properly rendered equations, view this article on GitHub Pages.
Hereâs a question that seems too simple to be interesting:
Why do Transformers need positional encodings?
The standard answer: âSo the model knows where each token is in the sequence.â
But why doesnât it know already? What is it about attention that loses position information?
The answer is a single word: commutativity.
The Property
Commutativity means order doesnât matter:
a + b = b + a
Simple. Obvious. But the consequences run deep.
If an operation is commutative, permuting the inputs doesnât change the output:
f(a, b, c) = f(c, a, b) = f(b, c, a)
The operation canât tell what order things arrived in. It treats all orderings identically.
Attention Is Commutative
Look at self-attention:
Attention(Q, K, V) = softmax(QKá” / âd) Ă V
Given a sequence X = [xâ, xâ, âŠ, xâ], we compute:
Q = XW_Q
K = XW_K
V = XW_V
Now permute the input: Xâ = [xâ, xâ, xâ, âŠ]
What happens? The output is permuted the same way.
Attention is permutation-equivariant. Shuffle the input, get a shuffled output. The operation itself doesnât care about order.
This means without positional encodings:
"I love you" â [embed("I"), embed("love"), embed("you")]
"you love I" â [embed("you"), embed("love"), embed("I")]These would produce the same representation (just permuted). The model literally cannot distinguish them.
Positional encodings exist because attention is commutative.
We add position information explicitly because the architecture wonât infer it.
The Design Choice
This commutativity is a feature, not a bug.
RNNs are explicitly non-commutative:
hâ = f(hâââ, xâ)
Each state depends on the previous state. Order is baked into the recurrence. You canât permute the input without changing everything.
The cost: you canât parallelize. Each step waits for the previous one.
Transformers are commutative (permutation-equivariant):
O = Attention(X)
Order is not baked in. You add it explicitly through positional encodings.
The benefit: you can parallelize. All positions are processed simultaneously.
Commutativity enables parallelization. This is why Transformers replaced RNNs for long sequences.
When You Want Commutativity
Sometimes order genuinely doesnât matter. Then commutativity isnât a limitationâitâs a requirement.
Point Clouds
A 3D scan gives you points {(xâ, yâ, zâ), (xâ, yâ, zâ), âŠ}.
These points have no natural order. The first point scanned isnât semantically âfirst.â
Your network must be permutation invariant:
f({pâ, pâ, pâ}) = f({pâ, pâ, pâ})
PointNet (2017) achieves this by design:
f(X) = g(maxᔹ h(xᔹ))
Max is commutative. Shuffle the points, get the same output.
Sets
A shopping cart. A userâs friends. Atoms in a molecule.
These are setsâorder doesnât exist.
DeepSets (2017) proved the fundamental theorem:
Any permutation-invariant function on sets can be written as: f(X) = Ï(ÎŁ Ï(x)) for all x â X
Sum is commutative. The architecture is order-invariant by construction.
Graphs
In a graph neural network, you aggregate neighbor features:
hᔄ = UPDATE(hᔄ, AGG({hᔀ : u â N(v)}))
The neighbors N(v) have no order. AGG must be commutativeâsum, mean, or max.
When You Donât Want Commutativity
Sometimes order is everything.
Language: âDog bites manâ â âMan bites dogâ
Time series: The sequence of stock prices matters
Music: Notes in order form melody; shuffled, theyâre noise
Actions: The order of operations matters (usually)
For these, you need either:
Non-commutative operations (RNNs, state machines)
Explicit position encoding (Transformers)
The choice affects parallelization, inductive bias, and what the model can learn.
Gradient Aggregation: Why Training Works
Hereâs a less obvious place commutativity matters.
When you train on a minibatch, you compute:
L = (1/N) Σ L(xᔹ, yᔹ)
âL = (1/N) ÎŁ âL(xᔹ, yᔹ)
The sum of gradients is commutative. This enables:
Minibatch training: Compute gradients for samples in any order, sum them.
Gradient accumulation: Split a large batch across multiple forward passes, sum the gradients.
Distributed training: Compute gradients on different machines, sum them (all-reduce).
If gradient aggregation werenât commutative, distributed training would be impossible. The order of machines would matter.
Pooling: The Invariance/Information Trade-off
Global poolingâmean, max, sumâis commutative.
This gives you translation invariance:
[ cat in left of image ] â pool â representation
[ cat in right of image ] â pool â same representationThe pooled representation doesnât know where the cat was.
The trade-off: commutativity destroys positional information.
Want invariance? Use commutative pooling.
Want to preserve position? Donât pool, or use position-aware alternatives.
This is why object detection uses feature pyramids instead of global poolingâyou need to know where things are.
Designing with Commutativity
When building an architecture, ask:
Does order matter in my input?
Does order matter in my output?
If youâre generating sequences, you need autoregressive structureâfundamentally non-commutative.
Can I parallelize?
Commutative operations can be parallelized and distributed. Non-commutative ones often canât.
The Floating-Point Footnote
One subtlety worth knowing:
In exact arithmetic, addition is both commutative (a + b = b + a) and associative ((a + b) + c = a + (b + c)).
In floating-point:
Commutativity: preserved â
Associativity: broken â
>>> (1e20 + (-1e20)) + 1
1.0
>>> 1e20 + ((-1e20) + 1)
0.0
Reordering is safe. Regrouping isnât.
This means parallel reductions (which require regrouping) can give slightly different results on different runs.
This contributes to ML non-reproducibility, but research from Thinking Machines shows the primary culprit is subtler: GPU kernels change their reduction strategies based on batch size. When server load varies, batch sizes vary, kernel behavior variesâand you get different results even with identical inputs.
Either way, this is a numerical concern, not an architectural one. The mathematical design choiceâcommutative or notâremains valid.
The Takeaway
Commutativity is about order invariance.
Where order doesnât matterâsets, point clouds, graphsâuse commutative operations. You get parallelization and natural invariance.
Where order mattersâsequences, time seriesâeither use non-commutative operations (RNNs) or add position explicitly (Transformers).
Transformers need positional encodings because attention is commutative. The architecture processes all positions symmetrically. Order must be injected from outside.
This single propertyâcommutativityâexplains:
Why Transformers parallelize and RNNs donât
Why PointNet works on point clouds
Why GNNs use sum/mean/max aggregation
Why global pooling loses spatial information
Why distributed training is possible
The algebra isnât abstract. Itâs in every architecture you use.
See also: The One Property That Makes FlashAttention Possible â Associativity is the license to parallelize, chunk, and stream.
Further Reading
Zaheer et al., âDeep Setsâ (2017) â The foundational theorem on permutation-invariant functions
Qi et al., âPointNetâ (2017) â Processing point clouds with max pooling
Vaswani et al., âAttention Is All You Needâ (2017) â Transformers and positional encodings
Bronstein et al., âGeometric Deep Learningâ (2021) â Symmetry and invariance in neural networks
Thinking Machines, âDefeating Nondeterminism in LLM Inferenceâ (2025) â Why batch-invariance failure causes non-reproducibility


