Domain Transformations: The Art of Finding Easier Spaces
Why logarithms prevent underflow, why Fourier speeds up convolutions, and how choosing the right space makes hard problems tractable
đ For the best reading experience with properly rendered equations, view this article on GitHub Pages.
Hereâs a trick that appears everywhere in numerical computing:
Donât solve the hard problem. Transform it into an easy one.
Multiplying a thousand small probabilities? Youâll underflow to zero. But add their logarithms, and everything works.
Convolving two signals of length n? Thatâs O(nÂČ). But multiply in Fourier space, and itâs O(n log n).
These arenât clever hacks. Theyâre instances of a general principle: the right domain makes computation tractable.
The Principle
A domain transformation moves computation from one space to another:
Hard problem in Space A â transform â Easy problem in Space B
The pattern:
Transform input to the easier domain
Compute in that domain
Transform back (if needed)
This is worthwhile when:
cost(transform) + cost(easy computation) + cost(inverse) < cost(hard computation)
Often dramatically so.
Logarithms: Multiplication to Addition
The most common domain transformation in ML.
The Problem
Youâre computing the probability of a sequence:
P(xâ, xâ, âŠ, xâ) = âᔹ P(xᔹ | x<ᔹ)
Each factor is less than 1. Multiply 1000 of them:
>>> 0.1 ** 1000
0.0 # UnderflowThe true answer isnât zeroâitâs just smaller than the smallest representable float.
The Solution
Work in log-space:
log P = Σᔹ log P(xᔹ | x<ᔹ)
Multiplication becomes addition. Products of tiny numbers become sums of negative numbers.
>>> import math
>>> 1000 * math.log(0.1)
-2302.58... # No underfloThis is why every language model outputs log-probabilities. Why Bayesian inference uses log-likelihoods. Why HMMs compute in log-space.
Logarithms donât just prevent underflow. They turn an unstable operation into a stable one.
The Log-Sum-Exp Trick
But what if you need to add probabilities, not multiply them?
For mutually exclusive events:
P(A or B) = P(A) + P(B)
In log-space, this becomes:
log(eá” + eá”)
where a = log P(A) and b = log P(B). This comes up constantlyâmarginalizing over hidden states in HMMs, summing over paths in CTC, computing partition functions.
The Problem
>>> import math
>>> a, b = -1000, -1001
>>> math.log(math.exp(a) + math.exp(b))
# math.exp(-1000) = 0.0 (underflow)
# Result: undefinedExponentiating large negative numbers underflows. Weâre back where we started.
The Solution
Factor out the maximum:
log(eá” + eá”) = max(a, b) + log(1 + e^(â|aâb|))
The exponent â|aâb| is always †0, so e^(â|aâb|) †1. No overflow, and the underflowed term is added to 1, so we donât lose precision.
def log_sum_exp(a, b):
max_val = max(a, b)
return max_val + math.log(math.exp(a - max_val) + math.exp(b - max_val))For vectors:
logsumexp(x) = max(x) + log Σᔹ exp(xᔹ â max(x))
This is exactly what softmax uses internally:
softmax(x)ᔹ = exp(xᔹ â logsumexp(x))
The log-sum-exp trick keeps computation in log-space while handling addition.
Fourier Transform: Convolution to Multiplication
Another profound domain transformation.
The Problem
Convolution in the spatial domain:
(f â g)[n] = ÎŁâ f[m] · g[nâm]
For signals of length N, this is O(NÂČ)âN outputs, each summing over N terms.
The Convolution Theorem
The Fourier transform has a remarkable property:
â±(f â g) = â±(f) · â±(g)
Convolution in space = multiplication in frequency.
The algorithm:
def fast_convolve(f, g):
F_f = fft(f) # O(n log n)
F_g = fft(g) # O(n log n)
F_result = F_f * F_g # O(n) - element-wise
return ifft(F_result) # O(n log n)Total: O(n log n) instead of O(nÂČ).
For large kernels, the transform cost is negligible. For small kernels (3Ă3, 5Ă5), direct convolution is fasterâthe transform overhead dominates.
In practice, modern deep learning frameworks use highly optimized direct convolution (cuDNN) even for medium-sized kernels, because GPU implementations are so fast. But FFT-based convolution remains important in signal processing and for very large kernels.
Why Transformers Donât Use FFT
Attention looks like it should benefit from FFT:
Attention = softmax(QKá”)V
The QKá” multiplication is O(nÂČd) for sequence length n.
But thereâs a catch: attention isnât convolution.
Convolution is translation-equivariantâshift the input, shift the output. The kernel slides uniformly.
Attention is content-basedâeach position attends differently based on whatâs there, not where it is.
The Fourier transform exploits the structure of convolution. Attention doesnât have that structure.
(Some efficient attention methods like Performers and Linear Attention do use kernel approximationsâbut theyâre approximating attention, not computing it exactly via FFT.)
Embeddings: Sparse to Dense
A different kind of domain transformation.
The Problem
Words are categoricalâthereâs no natural arithmetic on them.
âkingâ - âmanâ + âwomanâ = ???
With one-hot encoding, words are orthogonal vectors in a 50,000-dimensional space. No similarity, no structure.
The Solution
Learn a dense embedding:
word â embed â âá”
Now:
embed(âkingâ) â embed(âmanâ) + embed(âwomanâ) â embed(âqueenâ)
The transformation creates structure that wasnât there before.
This isnât just dimensionality reduction. Itâs finding a space where the hard problem (word relationships) becomes easy (vector arithmetic).
The Kernel Trick: Linear in a Lifted Space
Sometimes the solution isnât a lower-dimensional spaceâitâs a higher one.
The Problem
Data isnât linearly separable:
Copy
â â â
â â
â â â â â
â â â â â
â â â â â
â â
â â â
No linear boundary separates â from â.
The inner class is surrounded. No line can divide them.
The Solution
Map to a higher-dimensional space where it is linearly separable:
Ï: âá” â âᎰ where D > d
For the concentric pattern, add just one feature: z = xÂČ + yÂČ (squared distance from center). Now the inner class has small z, the outer class has large z. A horizontal plane separates them.
The kernel trick computes dot products in the lifted space without explicitly computing the transformation:
K(x, y) = âšÏ(x), Ï(y)â©
For the RBF kernel:
K(x, y) = exp(ââx â yâÂČ / 2ÏÂČ)
This implicitly works in an infinite-dimensional space, but you never compute it directly.
The kernel trick is a domain transformation you donât have to pay for.
Spectral Graph Methods
Graph Laplacians transform graph problems into linear algebra.
The Problem
Clustering nodes in a graph based on connectivity. Community detection. Graph partitioning.
These seem like discrete, combinatorial problems.
The Solution
Compute the graph Laplacian:
L = D â A
where D is the degree matrix and A is the adjacency matrix.
The eigenvectors of L embed nodes into Euclidean space:
nodeᔹ â (vâ(i), vâ(i), âŠ, vâ(i))
In this space, k-means clustering solves the graph partitioning problem.
A discrete optimization problem becomes a tractable eigenvector computation.
Numerical Stability: A Recurring Theme
Many domain transformations exist for numerical reasons:
Each is a domain transformation that trades mathematical equivalence for numerical stability.
Equivalent in exact arithmetic. Different in floating-point.
When to Look for a Transform
The pattern recurs:
Multiplication â Addition: Use logarithms
Convolution â Element-wise: Use Fourier
Non-linear â Linear: Use kernel methods
Discrete â Continuous: Use spectral methods
High-dimensional sparse â Low-dimensional dense: Use embeddings
The skill is recognizing when your hard problem has an easier dual.
Questions to ask:
Whatâs expensive or unstable? Products of small numbers? Convolution? Non-linear optimization?
Is there a standard transform? Log, Fourier, Laplace, embeddings?
Does the transform preserve what you need? Some transforms lose information.
Is the round-trip worth it? Transform cost + easy computation < hard computation?
The Meta-Lesson
The previous articles in this series covered:
Associativity: Enables chunking, parallelization, streaming
Commutativity: Enables reordering, permutation invariance
Linearity: Enables batching, gradient accumulation
Domain transformations are different. Theyâre not about properties of operationsâtheyâre about choosing the right space.
But the connection is deep: transforms often work because they reveal simpler structure.
Fourier reveals that convolution is pointwise in frequency space
Logarithms reveal that multiplication is additive in log space
Embeddings reveal that semantic relationships are approximately linear in embedding space
The transform finds a space where the problem has the structure you need.
The Takeaway
Donât solve the hard problem directly. Ask: is there an easier space?
Products underflow â work in log space
Convolution is O(nÂČ) â work in frequency space
Data isnât linearly separable â work in kernel space
The transform might cost something. But if the computation in the new space is dramatically cheaper, you win.
This is computational judo: using structure to redirect effort. The best algorithms donât fight the problemâthey find the space where the problem solves itself.
See also:
The One Property That Makes FlashAttention Possible â Associativity
Why Transformers Need Positional Encodings â Commutativity
Why Batching Works â Linearity
Further Reading
Numerical Recipes, Ch. 12-13 â FFT and spectral methods
Murphy, âProbabilistic Machine Learningâ â Log-space computation throughout
Scholkopf & Smola, âLearning with Kernelsâ â Kernel methods and feature spaces
Von Luxburg, âA Tutorial on Spectral Clusteringâ â Graph Laplacians and embeddings




