The statistical basis of Fermi estimates
February 12, 2021. Why are Fermi approximations so effective? One important factor is log normality, which occurs for large random products. Another element is variance-reduction through judicious subestimates. I discuss both and give a simple heuristic for the latter.
Introduction
Fermi approximation is the art of making good order-of-magnitude estimates. I’ve written about them at greater length here and here, but I’ve never really found a satisfactory explanation for why they work. Order-of-magnitude is certainly a charitable margin of error, but time and time again, I find they are better than they have any right to be! Clearly, there must be an underlying statistical explanation for this apparently unreasonable effectiveness.
Products and log-normality
There are two key techniques: the use of geometric means, and the factorisation into subestimates. We start with geometric means. Suppose a random variable $F$ is a product of many independent random variables,
\[F = X_1 X_2 \cdots X_N.\]Then the logarithm of $F$ is a sum of many random variables $Y_i = \log X_i$:
\[\log F = \log X_1 + \log X_2 + \cdots + \log X_N = \sum_{i=1}^N Y_i.\]By the central limit theorem for unlike variables (see e.g. this post), for large $N$ this approaches a normal distribution:
\[\log F \to \mathcal{N}(\mu, \sigma^2), \quad \mu := \sum_i \mu_i, \quad \sigma^2 = \sum_i \sigma_i^2,\]where the $Y_i$ have mean $\mu_i$ and variance $\sigma_i^2$. We say that $F$ has a log-normal distribution, since its log is normal.
Geometric means
In Fermi estimates, one of the basic techniques is to take geometric means of estimates, typically an overestimate and an underestimate. For instance, to Fermi estimate the population of Chile, I could consider a number like one million which seems much too low, and a number like one hundred million which seems much too high, and take their geometric mean:
\[\sqrt{(1 \text{ million}) \times (100 \text{ million})} = 10 \text{ million}.\]Since population is a product of many different factors, it is reasonable to expect it to approximate a log-normal distribution. Then, after logs, the geometric mean $\sqrt{ab}$ becomes the arithmetic mean of $\log a$ and $\log b$:
\[\log \sqrt{ab} = \frac{1}{2}(\log a + \log b).\]Taking the mean $\mu$ of the distribution as the true value, these geometric means provide an unbiased estimator of the mean. Moreover, the error of the estimate will decrease as $1/k$ for $k$ samples (assuming human estimates sample from the distribution), so more is better. To see how much better I could do on the Chile population estimate, I solicited guesses from four friends, and obtained $20, 20, 30$ and $35$ million. Combining with my estimate, I get a geometric mean
\[(10 \times 20 \times 20 \times 30 \times 35)^{1/5} \text{ million} \approx 21 \text{ million}.\]The actual population is around $18$ million, so the estimate made from more guesses is indeed better! This is also better than the arithemetic average, $23$ million. Incidentally, this also illustrates the wisdom of the crowd, also called “diversity of prediction”. The individual errors from a broad spread of guesses tend to cancel each other out, leading to a better-behaved average, though in this case in logarithmic space.
In general, Fermi estimates work best for numbers which are large random products (this is how we try to solve them!), so the problem domain tends to enforce the statistical properties we want. For many examples of log-normal distributions in the real world, see Limpert, Stahel and Abbt (2001). It’s worth noting that not everything we can Fermi estimate is log-normal, however. Many things in the real world obey power laws, for instance, and although you can exploit this to make better Fermi estimates (as lukeprog does in his tutorial), we can happily Fermi estimate power-law distributed numbers without this advanced technology.
Are Fermi estimates unreasonably effective in this context? Maybe. But the estimates work best in the high-density core where things look uniform, not out at the tails, and it’s not until we get to the tails that the difference between the log-normal and power law (or exponential, or Weibull, or your favourite skewed distribution) becomes pronounced. So the unreasonable effectiveness here can probably be explained by the resemblance to the log-normal, though this is something I’d like to check more carefully in future.
The philosophy of subestimates
Now we’ve dealt with geometric means and log-normality, we turn to the effectiveness of factorising a Fermi estimate. If we take logarithms, factors become summands, and we’ll reason about those since they are simpler. If $Z = X + Y$ is a sum of independent random variables, the variance is additive, so that
\[\text{var}(Z) = \text{var}(X) + \text{var}(Y).\]Thus, splitting a sum into estimates of the summands and adding them should not change the variance of the guess. Of course, there is a fallacy in this reasoning: humans are not sampling from the underlying distribution! When we guess, we introduce our own random errors. For instance, my estimate for $Z$ will have some human noise $\varepsilon_Z$:
\[\hat{Z} = Z + \varepsilon_Z.\]Similarly, my guesses for $X$ and $Y$ have some random errors $\varepsilon_X$ and $\varepsilon_Y$. There is no reason for the variances of $\varepsilon_X$ and $\varepsilon_Y$ to add up to the variance of $\varepsilon_Z$. The sum could be bigger, or it could be smaller. But a good decomposition should reduce the combined variance:
\[\text{var}(\varepsilon_X) + \text{var}(\varepsilon_Y) < \text{var}(\varepsilon_Z).\]If log-normality is the science of Fermi estimates, picking variance-reducing subestimates is the art. But I suspect that $\hat{Z}$ roughly speaking behaves like a test statistic for $Z$, with the number of samples corresponding to how many data points for $Z$ we have encountered. So we expect that $\text{var}(\varepsilon_Z)$ will vanish roughly as $1/k$ with $k$ samples. If we have more exposure to the distributions for $X$ and $Y$, the combined error will probably be smaller. This is why we carve into subfactors we understand!
Variance reduction in practice
I’ll end with a speculative rule of thumb for when to factor: try generating over- and underestimates for the factors and the product, which in additive notation give
\[(\Delta X)^2 + (\Delta Y)^2, \quad (\Delta Z)^2\]where $\Delta$ refers to the difference of the (logarithm of the) over- and underestimate. Factorise if the first estimated error is smaller than the second. Let’s illustrate by returning to the population of Chile. I can try factoring it into a number of regions multiplied by the average number of people per region. Taking logs (in base $10$) of the over- and underestimate of Chile’s population I gave above, I get
\[(\Delta Z)^2 = (\log_{10} 10^8 - \log_{10} 10^6)^2 = 4.\]On the other hand, for regions I would make a lower guess of $5$ and an upper guess of $30$, with a difference in logs of $(\Delta X)^2 = 0.6$. For regional population, I would make a lower guess of $5\times 10^5$ and an upper guess of $5\times 10^6$, with $(\Delta Y)^2 = 1$. Thus,
\[(\Delta X)^2 + (\Delta Y)^2 = 1.6 < 4 = (\Delta Z)^2.\]The guess from the factorisation (taking geometric means) is
\[\sqrt{5 \times 30 \times (5\times 10^5) \times (5\times 10^6)} \approx 19 \text{ million}.\]This is even better than the crowdsourced estimate! For reference, the number of regions is $16$, while our estimated mean is around $12$, and the average population per region is a bit over a million, which we’ve mildly overestimated at $1.6$ million. The two balance out and give a better overall estimate.
Conclusion
From a statistical perspective, Fermi estimates are based on two techniques: geometric means and splitting into subfactors. We usually estimate things which can be expressed as a product of many factors. These will tend towards a log-normal distribution by the (log of the) central limit theorem, so that geometric means provide a good estimator, exactly like the usual mean for normally distributed variables. Subestimates, on the other hand, carve guesses into factors we understand, i.e. have more data points for, so that (assuming they behave like test statistics) variance is reduced. The effectiveness of Fermi estimates is quite reasonable after all!