Jekyll2023-11-05T22:39:25+00:00http://hapax.github.io/feed.xmlDavid WakehamQML researcherDavid WakehamPortfolio optimization2022-12-05T00:00:00+00:002022-12-05T00:00:00+00:00http://hapax.github.io/math/probability/finance/portfolio<p><strong>December 5, 2022.</strong> <em>A quick, first-principles derivation of
optimal portfolios with risk.</em></p>
<h2 id="introduction">Introduction</h2>
<hr />
<p>Suppose there are $n$ bets I can make, with the return per dollar for
bet $k$ represented by a random variable $B_k$
with expected value $\mu_k = \mathbb{E}[B_k]$.
If I invest $\omega_k$ on bet
$k$, and have a fixed total $\Omega$, then I can view the value of my
portfolio as a random variable</p>
\[P(\omega_k) = \sum_{k = 1}^n \omega_k B_k.\]
<p>If I try to optimize the expected return $\mu_P = \mathbb{E}[P]$, I get a boring linear function</p>
\[\mu_P(\omega_k) = \sum_{k =
1}^{n} \omega_k \mu_k.\]
<p>To maximize this, introduce a Lagrange multiplier
$\gamma$ to enforce the fixed total:</p>
\[\mu_P(\omega_k, \gamma) = \sum_{k =
1}^{n} \omega_k \mu_k + \gamma \left(\Omega - \sum_{k=1}^{n-1}\omega_k\right).\]
<p>This is linear and has no local maxima, so the maximal return must lie at the edge of the
feasible region. In fact, it’s clear that we just invest all our money in
the bet with maximum return:</p>
\[P^* = \Omega B_{k^*}, \quad k^* = \text{argmax}_k\,\mu_k.\]
<p>But putting all your eggs in one basket seems like a bad idea.
My intuition is that to optimize my portfolio, my investment should
include a spread of high-risk, high-return and low-risk, low-return
bets.
What have we missed?</p>
<div style="background-color: #EAD1DC ; padding: 10px; border: 1px
solid purple; line-height:1.5">
<b>Portfolio optimization.</b> <br />
How do we assess the value of a portfolio so the optimum can
accomodate risk-aversion?
</div>
<h2 id="derisky-business">Derisky business</h2>
<hr />
<div style="background-color: #cfc ; padding: 10px; border: 1px
solid green; line-height:1.5">
<b>Derisked return.</b><br />
We can maximize a convex combination of expected return and (negative)
variance.
</div>
<p>Maximizing expected return ignores the <em>risk</em> altogether!
The simplest way to measure the risk of our portfolio is the total variance,</p>
\[\sigma^2_P = \mathbb{E}[(P - \mu_P)^2].\]
<p>If the bets are independent random variables, then the variance is
additive, with</p>
\[\sigma^2_P(\omega_k) = \sum_{k =
1}^{n} \omega_k^2 \sigma^2_k,\]
<p>where $\sigma^2_k$ is the variance of $B_k$.
If they are not independent, then we add covariance terms:</p>
\[\sigma^2_P(\omega_k) = \sum_{k = 1}^{n} \omega_k^2 \sigma^2_k + \sum_{j \neq k}
\omega_j\omega_k\text{cov}(B_j, B_k), \quad \text{cov}(B_j, B_k) =
\mathbb{E}[(B_j - \mu_j)(B_k - \mu_k)].\]
<p>Instead of just maximizing expected return, we should balance return
and risk. A simple way to do this is to maximize the convex
combination of $\mu_k$ and $-\sigma^2_P$, which we’ll call the
<em>$\lambda$-derisked return</em>:</p>
\[R_\lambda(\omega_k) = (1 - \lambda) \mu_P(\omega_k) - \lambda \sigma^2_P(\omega_k).\]
<p>The expected return is $0$-derisked, while $1$-derisked
return minimizes the variance of the portfolio and ignores the return
completely.</p>
<h2 id="distributing-eggs">Distributing eggs</h2>
<hr />
<div style="background-color: #cfc ; padding: 10px; border: 1px
solid green; line-height:1.5">
<b>Derisking limits.</b><br />
In the $\lambda \to 0$ derisked limit, optimal investments are proportional to
expected return and inverse variance, but as $\lambda \to 1$, to
inverse variance only.
</div>
<p>For simplicity, let’s assume our bets are independent.<label for="sn-1" class="margin-toggle sidenote-number">
</label>
<input type="checkbox" id="sn-1" class="margin-toggle" />
<span class="sidenote">Otherwise, we simply diagonalize the
covariance matrix and go to a basis of orthogonal bets.</span>
Then adding a Lagrange multiplier as above, we get</p>
\[\begin{align*}
R_\lambda(\omega_k, \gamma) & = \sum_{k=1}^n
\left[(1-\lambda) \omega_k\mu_k - \lambda \omega_k^2\sigma_k^2\right] +
\gamma \left(\Omega - \sum_{k=1}^n\omega_k\right).
\end{align*}\]
<p>The partial derivatives are</p>
\[\partial_{\omega_k} R_\lambda = (1-\lambda)\mu_k + \gamma - 2\lambda
\omega_k \sigma_k^2,\]
<p>so we have an extremum at</p>
\[\omega_k = \frac{(1-\lambda)\mu_k + \gamma}{2\lambda \sigma_k^2}.\]
<p>To determine the value of $\gamma$, note from our constraint that</p>
\[\begin{align*}
\Omega & = \sum_{k=1}^n\omega_k \\
& = \sum_{k=1}^n\frac{(1-\lambda)\mu_k +
\gamma}{2\lambda \sigma_k^2} \\
\Longrightarrow \quad \gamma
& = 2\lambda\left(\sum_{k=1}^n\frac{1}{\sigma_k^2}\right)^{-1}\left[\Omega -
\sum_{k=1}^n\frac{(1-\lambda)\mu_k}{2\lambda \sigma_k^2}\right].
\end{align*}\]
<p>Since $\gamma \simeq \lambda$, for small $\lambda$ (a return-oriented investor) we can ignore that
$\gamma$ term. Then the investments</p>
\[\omega_k \approx \frac{(1 - \lambda)\mu_k}{2\lambda\sigma^2_k} \propto
\frac{\mu_k}{\sigma^2_k},\]
<p>so we weight investments proportional to expected return, but
inversely to variance. Sounds sensible!
On the other hand, when $\lambda \to 1$ (a risk-averse investor), the Lagrange multiplier
$\gamma \gg (1 - \lambda)\mu_k$, so that the investment is
proportional to the inverse variance only:</p>
\[\omega_k\approx \frac{\gamma}{2\lambda\sigma_k^2} \propto \frac{1}{\sigma_k^2}.\]
<p>Intermediate values of $\lambda$ interpolate between these two regimes, with a
degeneracy at $\lambda = 0$ where only expected return matters.
Thus, we have a whole one-parameter family of risk-sensitive ways to value a
portfolio!</p>David WakehamDecember 5, 2022. A quick, first-principles derivation of optimal portfolios with risk.Approximating large powers2022-12-03T00:00:00+00:002022-12-03T00:00:00+00:00http://hapax.github.io/math/hacks/powers<p><strong>December 3, 2022.</strong> <em>A short guide to estimating large powers.</em></p>
<h2 id="introduction">Introduction</h2>
<hr />
<p>Say I want to estimate a perfect power like $67^{13}$, but don’t have a calculator.<label for="sn-1" class="margin-toggle sidenote-number">
</label>
<input type="checkbox" id="sn-1" class="margin-toggle" />
<span class="sidenote">If this isn’t sufficient motivation, it’s
easy to make the power so large that no calculator will give you an
answer!</span> How do I go
about approximating it? I’ll build up a few techniques that are
sufficient for an order of magnitude estimate, and even a significant
digit or two.</p>
<div style="background-color: #EAD1DC; padding: 10px; border: 1px
solid purple; line-height:1.5">
<b>The proximate power problem.</b> <br />
Give an order of magnitude estimate of $n^p$, where $n$ and $p$ are
potentially large integers, without a calculator. For bonus points, provide a significant digit.
</div>
<h2 id="perfect-powers">Perfect powers</h2>
<hr />
<div style="background-color: #cfc; padding: 10px; border: 1px
solid green; line-height:1.5">
<i>Tip 1.</i> <b>Single-digit powers.</b><br />
Know how to relate single-digit powers to powers of $10$.
</div>
<p>The first step is to relate single-digit powers to powers of $10$. For
instance, as commonly known to coders, $2^{10} = 1024 \approx 10^3$,
so we can approximate binary powers easily enough. Here’s a list of
tricks for $2$ to $7$, omitting powers of $2$ and $3$:</p>
\[\begin{align*}
2^{10} & = 1024 \approx 10^3 \\
3^2 & = 9 \approx 10 \\
5 & = \frac{10}{2} \\
6^9 & = 1.01 \times 10^7 \approx
10^7 \\
7^2 & = 49 \approx \frac{100}{2}.
\end{align*}\]
<p>Also, for good measure:</p>
\[e^3 \approx 20.\]
<p>We can use these to give quick and dirty estimates. For instance,</p>
\[\begin{align*}
67^{13} & = 6.7^{13} \times 10^{13} \\
& \approx 7\times 7^{12}\times 10^{13} \\
& \approx 7 \times 49^6 \times 10^{13} \\
& \approx \frac{7}{2^6} \times 100^6 \times 10^{13} \\
& \approx 10^{24}.
\end{align*}\]
<p>If you get a calculator out, you find the answer is in fact</p>
\[67^{13} = 5.5 \times 10^{23},\]
<p>so this is correct to the nearest order of magnitude. Great! But
clearly, by replacing $6.7$ by $7$ on the second line we are going to
overestimate. Can we do better? The rest of this post is devoted to
exploring techniques for doing this, but if you’re happy with order of
magnitude, stop here.</p>
<h2 id="binomial-boost">Binomial boost</h2>
<hr />
<div style="background-color: #cfc ; padding: 10px; border: 1px
solid green; line-height:1.5">
<i>Tip 2.</i> <b>Binomial expansions.</b><br />
Improve accuracy by performing a binomial expansion.
</div>
<p>The <a href="https://en.wikipedia.org/wiki/Binomial_theorem">binomial theorem</a>
gives us a way to improve these estimates.
In general, we have</p>
\[(1+x)^n = 1 + nx + \binom{n}{2}x^2 + \cdots + x^n = \sum_{k=0}^n \binom{n}{k}x^k.\]
<p>So, for instance,</p>
\[\begin{align*}
67^{13} &= 70^{13}\left(1 - \frac{0.3}{7}\right)^{13} \\
& =
70^{13}\left[1 - \frac{13\times 0.3}{7} + \frac{13\times 12 \times (0.3)^2}{2\times 7^2} - \frac{13 \times 12 \times 11 \times (0.3)^3}{6\times 7^3} + \cdots\right]\\
& \approx 70^{13}\left[1 - 0.55 + 0.14 - 0.02 \right]\\
& \approx 0.57 \times 10^{24} \\
& = 5.7 \times 10^{23},
\end{align*}\]
<p>using the estimate from the previous section.
This is much better!
We’ve ignored the factor of $70/2^6$, which means we’ve
underestimated, but we’ve also replaced $7^{12}$ with $(100/2)^6$,
which is an overestimate, and the two almost cancel. As an exercise,
you can use the binomial approximation to check this.</p>
<p>In doing a binomial expansion, where should you stop? Depends on how
much precision you want. Here, I went to third order since it gave
terms of size $\sim 0.01$, which is the precision I wanted to try and
match the correct answer above. How did I know? Well, I know terms in
the expansion have the form</p>
\[\binom{n}{k}x^k = \binom{n}{k-1} x^{k-1} \times \frac{x (n-k+1)}{k},\]
<p>so for $n = 13$ and $x = -0.3/7$, progressive terms shrink by
$\sim 0.04$ give or take. So I can probably stop after a term of the
size I want, in this case, the third term, which was order $\sim 0.01$.</p>
<h2 id="fast-factors">Fast factors</h2>
<hr />
<div style="background-color: #cfc ; padding: 10px; border: 1px
solid green; line-height:1.5">
<i>Tip 3.</i> <b>Factorize.</b><br />
Factorize to simpler nearby numbers, then restore the original with a binomial expansion.
</div>
<p>There are other ways to skin this cat. Another strategy is factoring
to a simpler number nearby. In our case, we can note that</p>
\[67 \approx 66 = 6 \times 11.\]
<p>Then</p>
\[\begin{align*}
67^{13} & \approx 6^{13} \times 11^{13} \\
& \approx 6^4 \times 6^9 \times 10^{13}\times (1 + 0.1)^{13} \\
& \approx 1300 \times 10^{20} \times (1 + 1.3 + 0.78 + 0.286) \\
& \approx 1.3 \times 10^{23} \times 3.37 \\
& \approx 4.4 \times 10^{23},
\end{align*}\]
<p>using our trick $6^9 \approx 10^7$ on the third line. Again, we can
improve this estimate by binomially expanding from $66^{13}$ to
$67^{13}$, a task I leave for the diligent reader. Taking just the
leading term in this second binomial expansion gives $5.3 \times
10^{23}$, a decent improvement.
I’m not sure I like this method better — it involves two expansions — but it does illustrate the utility of factoring.</p>
<h2 id="lucky-logs">Lucky logs</h2>
<hr />
<div style="background-color: #cfc ; padding: 10px; border: 1px
solid green; line-height:1.5">
<i>Tip 4.</i> <b>Take logarithms.</b><br />
Use log laws and the Taylor expansion to estimate the log of the base.
</div>
<p>The last method we’ll look at is logarithms.
Here, we use the fact that</p>
\[n^p = 10^{p\log_{10}n},\]
<p>so if we know $\log_{10}n$ we immediately have an order of magnitude
estimate.
We can use log laws</p>
\[\log_b (xy) = \log_bx + \log_b y, \quad \log_b x = \frac{\ln x}{\ln b}\]
<p>where $\ln$ is the natural logarithm, and the Taylor expansion</p>
\[\ln(1 - x) = -x - \frac{x^2}{2} - \frac{x^3}{3} - \cdots.\]
<p>Let’s use these to estimate $\log_{10} 67$. We’ll also exploit the
fact that $\ln 10 \approx 2.3$. From log laws, we have</p>
\[\begin{align*}
\log_{10} 67 & = 2 + \log_{10} 0.67 \\
& = 2 + \frac{\ln 0.67}{\ln 10} \\
& \approx 2 + \frac{\ln 0.67}{2.3}.
\end{align*}\]
<p>We now focus on the Taylor expansion. Since $0.67 \approx 1 - 1/3$, we
can write</p>
\[\ln 0.67 \approx \ln\left(1 - \tfrac{1}{3}\right) = -\frac{1}{3} -
\frac{1}{18} - \frac{1}{3\times 27} - \cdots \approx -\frac{37}{81}.\]
<p>So we get an index</p>
\[\begin{align*}
13\log_{10} 67 & \approx 26 - \frac{13\times 37}{2.3\times 81} \\
& \approx 26 - \frac{13 \times 35}{2.5 \times 80} \\
& \approx 26 - 2.275 \\ & = 23.725.
\end{align*}\]
<p>So we recover our order of magnitude estimate</p>
\[67^{13} \approx 10^{23.725}.\]
<p>Evaluating the mantissa with a calculator, we get</p>
\[10^{0.725} \approx 5.3,\]
<p>so this method is comparable in accuracy to our binomial expansions.
In both cases, we kept terms up to $x^3$, so this is about what we
expect.</p>
<h2 id="magic-mantissas">Magic mantissas</h2>
<hr />
<div style="background-color: #cfc ; padding: 10px; border: 1px
solid green; line-height:1.5">
<i>Tip 5.</i> <b>Evaluate the mantissa.</b><br />
Get a significant digit in the log method by splitting the mantissa
into a simple part and a small part you can Taylor expand with the exponential.
</div>
<p>The disadvantage of the log method is that it’s a bit hard to see what
the mantissa is.
Hard, but not impossible! One method is to use the Taylor
series for the exponential:</p>
\[e^x = 1 + x + \frac{1}{2}x^2 + \frac{1}{3!}x^3 + \cdots .\]
<p>This turns out to be a bit messy to use directly, because
the index is large and you need to include a bunch of terms in the
expansion to get stable digits.
Instead, we we split $0.725 = 0.7 + 0.025$, and deal with $0.7$ first:</p>
\[\begin{align*}
10^{0.7} & \approx 10^{7/10} \\
& = (10^3)^{(7/10) \times (1/3)} \\
& \approx \sqrt[3]{(2^{10})^{7/10}}\\
& = \sqrt[3]{2^7} \\
& = \sqrt[3]{128} \\ & \approx 5,
\end{align*}\]
<p>since $5^3 = 125$.
The cute thing is that we have just used facts from our “power table”.
We can use the exponential expansion for the remaining $0.025 = 1/40$, with</p>
\[10^{1/40} = e^{\ln 10/40} \approx e^{2.3/40} \approx 1 + \frac{2.3}{40} \approx 1.06,\]
<p>using only the leading term in the expansion.
We then multiply to find</p>
\[10^{0.725} = 10^{0.7} \times 10^{0.025} \approx 5 \times 1.06 = 5.3,\]
<p>as claimed above!</p>David WakehamDecember 3, 2022. A short guide to estimating large powers.Anthrometry2022-12-02T00:00:00+00:002022-12-02T00:00:00+00:00http://hapax.github.io/math/hacks/everyday/distance<p><strong>December 2, 2022.</strong> <em>Humans are the measure of all things, though
not in the sense Protagoras meant. I show how to estimate distance
using only your hands and feet.</em></p>
<h2 id="introduction">Introduction</h2>
<hr />
<p>The Greek philosopher Protagoras famously stated that “man is the
measure of all things”. He was also skeptical about whether math could
be applied to the real world. According to Aristotle, he arged that
“perceptible lines are not the kind of things the geometer talks
about”.
We’ll make a bad Protagoran joke and use humans, the “measure of all things”, to
measure distances by exploiting the geometry of “perceptible lines”.</p>
<div style="background-color: #EAD1DC ; padding: 10px; border: 1px
solid purple; line-height:1.5">
<b>Estimating distance.</b> <br />
Is it possible to estimate <i>distance</i> to an object using only hands and feet?
</div>
<h2 id="theory">Theory</h2>
<hr />
<div style="background-color: #cfc ; padding: 10px; border: 1px
solid green; line-height:1.5">
<b>The rule of thumbs (I).</b> <br />
If an object has apparent size $f$,
and after $s$ steps has apparent size $f'$, the distance to the object is
$$
fd = f'(d - s) \quad \Longrightarrow \quad d = \frac{s}{1 -(f/f')}.
$$
</div>
<p>So, consider a distant object you want to find the distance to.
Hold your hand out at a fixed distance from your eyes and fixed
orientation (e.g. horizontal), and estimate the size of the object in
fingers $f$.
(For best results, use an integer number of fingers.)
Your arm and hands form a triangle which is similar to the triangle
formed by the distant object:</p>
<figure>
<div style="text-align:center"><img src="/images/posts/distance1.png" width="700" />
<figcaption><i>Similar triangles formed by your hand and the object.</i></figcaption>
</div>
</figure>
<p>If $a$ is the length of your arm, $h$ the actual height, $d$ the
distance to the object and $f$ the apparent height in fingers, then</p>
\[\frac{f}{a} = \frac{h}{d}. \tag{1} \label{f}\]
<p>Now walk $s$ steps towards the object, so that it has finger
width $f’$.
We now have a new set of similar triangles:</p>
<figure>
<div style="text-align:center"><img src="/images/posts/distance2v2.png" width="700" />
<figcaption><i>Similar triangles after moving towards the object.</i></figcaption>
</div>
</figure>
<p>The ratio of sides now obeys</p>
\[\frac{f'}{a} = \frac{h}{d-s}, \tag{2} \label{fdash}\]
<p>where we measure distance in steps.
We can rearrange $(\ref{f})$ and $(\ref{fdash})$ to eliminate the length of the arm, $a$, and physical
height $h$, to find a method for distance measurement about which
Protagoras might have mixed feelings.</p>
<h2 id="practice">Practice</h2>
<hr />
<p>Take the transition from $f = 3$ to $f’ = 4$.
In this case, the distance estimate is</p>
\[d = \frac{s}{1 -(f/f')} = \frac{s}{1 - 3/4} = 4s,\]
<p>or four times the number of steps. I find this works with $90$-$95\%$
accuracy for distances on the order of $50$ steps.
I suspect that varying finger width and step length are the main source of error;
arm length $a$ can be fixed by maximally extending the arm, and
orientation of the hands can be fixed by a reference line, e.g. the
horizon.</p>
<p>Beyond varying width, the main disadvantage of fingers as a
measurement device is their resolution.
By counting the number of steps from $f = 9$ to $f’ = 10$, you estimate $d \approx 10s$, but
that is the best you can do. For large ratios, you need to replace
fingers with a finer measurement instrument, such as a clear plastic
ruler held at arm’s length.
Although you will get better results and enable yourself to measure
larger distances, you may look a tad eccentric.
But as Protagoras informs us, that is a tradeoff that each individual
must assess themselves.</p>
<h2 id="extension">Extension</h2>
<hr />
<div style="background-color: #cfc ; padding: 10px; border: 1px
solid green; line-height:1.5">
<b>The rule of thumbs (II).</b> <br />
The height of an object is related to finger width $f$, arm length
$a$, and distance $d$, by
$$
h = \frac{df}{a}.
$$
</div>
<p>It’s simple to extend this method to estimate height.
Once $d$ is known, we can use $(\ref{f})$ to give the physical height
(or width, or whatever we’ve measured) in terms of other quantities:</p>
\[h = \frac{df}{a}.\]
<p>Now, if we know $a$ in finger lengths, we will get an estimate of the
height in steps.
That’s a bit silly, so I suggest learning your arm length, finger
width, and step size in meters.
But note that, for an order of magnitude guess, one step is around
$1.2$ arm lengths, so the height is $1.2$ times the
number of steps times the finger span.</p>
<p>Example: I estimated a car had size $f = 3 \text{ fingers} \approx 4
\text{ cm}$ at a distance $d = 40$ steps.
This leads to a height estimate</p>
\[h \approx 40 \times 1.2 \times 4 \text{ cm} \approx 1.9 \text{ m},\]
<p>which is probably an overestimate but in the right ballpark.
Once again, a ruler leads to more accurate results at the cost of
visible dorkiness.</p>David WakehamDecember 2, 2022. Humans are the measure of all things, though not in the sense Protagoras meant. I show how to estimate distance using only your hands and feet.Self-reflexive instance-naming2022-12-01T00:00:00+00:002022-12-01T00:00:00+00:00http://hapax.github.io/recursion/linguistics/egcoin<p><strong>December 1, 2022.</strong> <em>A whimsical post on naming things named after an instance after an instance.</em></p>
<hr />
<p>Some general phenomena are named after specific instances.
For instance, a <a href="https://en.wikipedia.org/wiki/Mondegreen">mondegreen</a>
is a misallocation of word boundaries, with “Lady Mondegreen” a
misallocated variant of “laid him on the green”.
Another example is the
<a href="https://en.wikipedia.org/wiki/Frequency_illusion">Baader-Meinhof effect</a>,
where something is encountered, apparently for the first time, then suddenly noticed everywhere.<label for="sn-1" class="margin-toggle sidenote-number">
</label>
<input type="checkbox" id="sn-1" class="margin-toggle" />
<span class="sidenote">
In 1994, a man called Terry Mullen
wrote to a newspaper to describe his experience of frequency bias with the
<a href="https://en.wikipedia.org/wiki/Red_Army_Faction">eponymous Marxist guerilla group</a>. In
a beautifully self-referential moment, this
caused people to begin noticing the Baader-Meinhof effect everywhere!</span>
For our purposes, the most important example is the
<a href="https://en.wikipedia.org/wiki/Eggcorn">eggcorn</a>: a semantically
motivated mishearing, named for “eggcorn”, a cute but infelicitous
rendition of “acorn”.</p>
<p>We can lump these under the general heading of “instance-naming”
(distinct from synechdoches or metonyms where whole may stand for part
as well as vice-versa). A few months ago, I set myself
the challenge of self-reflexively naming the phenomenon after an instance, and became
promptly stuck. How could I name it after an instance without getting
confused with the original referent? And if I came up with a new term,
how could it refer to an instance? It seemed impossible. If you like,
you can have a go before you read on to my proposed solution.</p>
<div style="background-color: #EAD1DC ; padding: 10px; border: 1px
solid purple; line-height:1.5">
<b>Self-reflexive instance naming puzzle.</b> <br />
Name the phenomenon of naming things after an instance after an instance.
</div>
<p>I forgot about this
puzzle until a plane flight last week, where lack of
other amusements forced me to solve it.
The idea is simple: punningly allude to an instance so as to indicate both the
general phenomenon of instance-naming, and a specific example. After
some experimentation, I struck on the idea of modifying “eggcorn” to
“egcoin”, which literally means “a new word based on an
example”. As a semantic equation:</p>
<div style="background-color: #cfc ; padding: 10px; border: 1px
solid green; line-height:1.5">
$$
\text{egcoin} = \text{e.g. (example)} + \text{coin (create a new
word)} + \text{eggcorn}.
$$
</div>
<p>It gets better: on hearing
“egcoin”, someone might mistakenly suppose they had heard “eggcorn”.
This makes “egcoin” an eggcorn precisely when it is heard as such!</p>David WakehamDecember 1, 2022. A whimsical post on naming things named after an instance after an instance.A kernel trick for integrals2022-11-10T00:00:00+00:002022-11-10T00:00:00+00:00http://hapax.github.io/math/hacks/kernel-integral<p><strong>November 10, 2022.</strong> <em>I present a simple trick for doing integrals by swapping
the argument of a kernel.</em></p>
<h2 id="overview">Overview</h2>
<hr />
<p>Consider an integral transform with kernel $K(x, y)$.
In general, this gives two distinct transforms,</p>
\[T_1f(y) = \int_{\Omega_1} f(x) K(x, y) \, \text{d}x, \quad T_2f(x) = \int_{\Omega_2} f(y) K(x, y) \, \text{d}y,\]
<p>where $T_i$ integrates over argument $i$, and $\Omega_i$ denotes the
corresponding domain of integration.
If everything is smooth enough to swap integrals (i.e. Fubini’s theorem), then</p>
\[\begin{align*}
\int_{\Omega_1} f(x)\cdot T_2g(x) \,\text{d}x & = \int_{\Omega_1}
f(x)\left[\int_{\Omega_2} g(y) K(x, y) \, \text{d}y\right] \text{d}x \\
& = \int_{\Omega_2}
g(y)\left[\int_\Omega f(x) K(x, y) \, \text{d}x\right] \text{d}y \\
& = \int_{\Omega_2} T_1f(y) \cdot g(y)\, \text{d}y.
\end{align*}\]
<p>For a symmetric kernel $K(x, y) = K(y, x)$ and $\Omega_1 = \Omega_2
= \Omega$, we have $T_1 = T_2 = T$, and our result simplifies to:</p>
<div style="background-color: #cfc ; padding: 10px; border: 1px
solid green; line-height:1.5">
<b>The symmetric kernel trick.</b> <br />
For an integral transform $T$ defined by a symmetric kernel,
$$
\int_{\Omega} f(x) \cdot Tg(x)\, \text{d}x = \int_{\Omega} Tf(y) \cdot g(y)\,
\text{d}y.
$$
</div>
<p>From a pure math standpoint, we’ve basically just observed that the integral transforms
$T_1$ and $T_2$ are dual,</p>
\[\langle f, T_2 g\rangle = \langle T_1 f, g\rangle,\]
<p>with respect to a suitably defined inner product $\langle \cdot, \cdot\rangle$.
But this turns out to be a useful trick for doing real-life integrals!</p>
<p><em>Full disclosure.</em> I didn’t come up with this hack, but stole it
(with some customizations) from Ramanujan.
Also, I’m ignoring many mathematical subtleties! The joys of
being a physicist.</p>
<h2 id="the-voigt-integral">The Voigt integral</h2>
<hr />
<p>Let’s take everyone’s favourite example, the 1D Fourier transform:</p>
\[T_\text{F} f(\omega) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^\infty
f(x)e^{-i\omega x} \, \text{d}x.\]
<p>We can consult a table and pick out, for instance, the pairs</p>
\[\begin{align*}
f(x) & = e^{-\alpha x^2}, \quad T_\text{F}f(\omega) =
\frac{1}{\sqrt{2\alpha}} e^{-\omega^2/4\alpha} \\
g(x) & = e^{-\beta |x|}, \quad T_\text{F}g(\omega) =
\sqrt{\frac{2}{\pi}} \cdot \frac{\beta}{\beta^2 + \omega^2}.
\end{align*}\]
<p>Then our kernel trick gives</p>
\[\begin{align*}
\int_{-\infty}^\infty \sqrt{\frac{2}{\pi}}\frac{\beta e^{-\alpha^2 x^2}}{\beta^2 + x^2} \,
\text{d}x
& =\frac{1}{\sqrt{2\alpha}} \int_{-\infty}^\infty e^{-x^2/4\alpha^2 + \beta|x|} \, \text{d}x.
\end{align*}\]
<p>The RHS is straightforward to express in terms of the complementary error
function:</p>
\[\text{erfc}(z) = \frac{2}{\sqrt{\pi}}\int_z^\infty e^{-x^2}\, \text{d}x.\]
<p>We complete the square, defining $2\alpha u = x +
2\alpha^2\beta$ to find</p>
\[\begin{align*}
\int_{-\infty}^\infty e^{-x^2/4\alpha^2 - \beta|x|} \, \text{d}x
& = 2\int_{0}^\infty e^{-x^2/4\alpha^2 - \beta x} \, \text{d}x\\
& = 4\sqrt{\alpha} e^{(\alpha\beta)^2}\int_{2\alpha^2\beta}^\infty
e^{-u^2} \, \text{d}u\\
& = 2\sqrt{\pi\alpha} e^{(\alpha\beta)^2}\text{erfc}(\alpha\beta).
\end{align*}\]
<p>We can finally conclude that</p>
\[\int_{-\infty}^\infty \frac{ e^{-\alpha^2
x^2}}{\beta^2 + x^2} \, \text{d}x = \frac{\sqrt{2}\pi \alpha}{\beta}
e^{(\alpha\beta)^2}\text{erfc}(\alpha\beta).\]
<p>I call this the
“Voigt integral”after
the
<a href="https://en.wikipedia.org/wiki/Voigt_profile">related convolution in spectroscopy</a>.</p>
<div style="background-color: #EAD1DC ; padding: 10px; border: 1px solid
purple; line-height:1.5">
<b>Exercise 1.</b> The <i>Hankel transform</i>
$$
\mathcal{H}^{(\nu)}f(k) = \int_0^\infty f(r) rJ_\nu(kr) \, \text{d}r
$$
is defined by an asymmetric
kernel $K(r, k) = rJ_\nu(kr)$, where $J_\nu$ is a <i>Bessel function of
the first kind</i> of order $\nu$.
<br />
<span style="padding-left: 20px; display:block">
(a) Using the kernel trick, show that
$$
\int_0^\infty k \mathcal{H}^{(\nu)}\left[\frac{f(r)}{r}\right](k) g(k)
\, \text{d}k = \int_0^\infty f(k)\mathcal{H}^{(\nu)}g (k)\,
\text{d}k. \tag{1} \label{hankel}
$$
(b) Apply $(\ref{hankel})$ to a judicious choice of Hankel transform
pairs to derive the expression
$$
\int_0^\infty e^{-\alpha^2 u/2} K_0(\beta\sqrt{u})\, \text{d}u =
-\frac{1}{\alpha^2} e^{-\beta^2/2\alpha^2}\text{Ei}\left(-\frac{\beta^2}{2\alpha^2}\right),
$$
where $K_0$ is a <i>modified Bessel function of the second kind</i> and
$\text{Ei}$ is the <i>exponential integral</i>, a special function defined by
$$
\text{Ei}(z) = -\int_{-z} \frac{e^{-t}}{t} \, \text{d}t.
$$
</span>
</div>
<h2 id="mordell-integrals">Mordell integrals</h2>
<hr />
<p>Here’s a fancier example, again using the Fourier transform.
Consider the <em>Mordell integral</em></p>
\[h(z; \tau) = \int_{-\infty}^\infty \frac{e^{\pi i \tau x^2 - 2\pi
zx}}{\cosh(\pi x)} \, \text{d}x,\]
<p>with $\Im(\tau) > 0$ to ensure convergence.
Note that this is a product of functions which are <em>self-dual</em> under
the Fourier transform, up to a change in their parameters:</p>
\[\begin{align*}
f(x) & = e^{i\alpha x^2 - \beta x}, \quad T_\text{F}f(\omega) =
\frac{1}{\sqrt{-2i\alpha}} e^{i(\beta-i\omega)^2/4\alpha} \\
g(x) & = \frac{1}{\cosh(\gamma x)}, \quad T_\text{F}g(\omega) =
\sqrt{\frac{\pi}{2}} \frac{1}{\gamma\cosh(\pi\omega/2\gamma)}.
\end{align*}\]
<p>The kernel trick (and the change of variable $x = 2\pi u$) now gives</p>
\[\begin{align*}
h(z; \tau) & = \frac{\sqrt{\pi}}{\sqrt{-i\alpha}}\int_{-\infty}^\infty
\frac{e^{i(2\pi z-ix)^2/4\pi\tau}}{\cosh(x/2)} \, \text{d}x \\
& = \frac{\sqrt{\pi}e^{i\pi z^2/\tau}}{\sqrt{-i\alpha}}\int_{-\infty}^\infty
\frac{e^{-ix^2/4\pi\tau+zx/\tau}}{\cosh(x/2)} \, \text{d}x \\
& = \frac{2\pi^{3/2}e^{i\pi z^2/\tau}}{\sqrt{-i\alpha}}\int_{-\infty}^\infty
\frac{e^{-i\pi u^2/\tau+2\pi zu/\tau}}{\cosh(\pi u)} \, \text{d}u \\
& = \frac{2\pi^{3/2}e^{i\pi z^2/\tau}}{\sqrt{-i\alpha}}
h\left(-\frac{z}{\tau};
-\frac{1}{\tau}\right). \tag{2}\label{h}
\end{align*}\]
<p>This seems like a neat result!</p>
<div style="background-color: #EAD1DC ; padding: 10px; border: 1px solid purple; line-height:1.5">
<b>Exercise 2.</b> Ramanujan defined the related integral
$$
F_\omega(z) = \int_{-\infty}^\infty \frac{e^{-\pi\omega x^2 + 2\pi
x}\sin(\pi x z)}{e^{2\pi x}-1} \, \text{d} x.
$$
We'll end with a few exercises on this theme. <br />
<span style="padding-left: 20px; display:block">
(a) Define $\varphi$ by
$$
h(z; \tau) = -\frac{2i}{\tau}e^{-(\pi i\tau/4 + \pi i
z)}\varphi\left(z + \frac{\tau-1}{2}, \tau\right).
$$
Prove that $F_\omega(z)$ and $h(z; \tau)$ are
related by
$$
F_{-i\tau}(2iz) =
\frac{1}{2i\tau}\left[\varphi(z, t) - \varphi(-z, \tau)\right]. \tag{3} \label{varphi}
$$
(b) Using equation $(\ref{h})$ and $(\ref{varphi})$ or otherwise, show that
$$
F_\omega(z) = -\frac{i}{\sqrt{\omega}} e^{-\pi z^2/4\omega} F_{1/\omega}\left(\frac{iz}{\omega}\right).
$$
(c) Set $\omega = \alpha^2$ and $z \to \alpha
z/\sqrt{\pi}$.
Deduce from part (b) that, for $\alpha\beta = 1$ and $\alpha, \beta > 0$,
$$
\sqrt{\alpha}e^{z^2/8}\int_{-\infty}^\infty
\frac{e^{-\pi^2\alpha^2x^2}\sin(\sqrt{\pi}\alpha x z)}{e^{2\pi x}-1} \text{d}x = \sqrt{\beta}e^{-z^2/8}\int_{-\infty}^\infty
\frac{e^{-\pi^2\beta^2x^2}\sinh(\sqrt{\pi}\beta x z)}{e^{2\pi x}-1} \text{d}x.
$$
</span>
</div>
<!-- https://webpages.charlotte.edu/aroy15/image/drz5-err.pdf -->David WakehamNovember 10, 2022. I present a simple trick for doing integrals by swapping the argument of a kernel.Indescribably boring numbers2021-03-23T00:00:00+00:002021-03-23T00:00:00+00:00http://hapax.github.io/mathematics/boring<p><strong>March 23, 2021.</strong> <em>I turn the old joke about interesting numbers into a
proof that most real numbers are indescribably boring. In turn, this implies
that there is no explicit well-ordering of the reals. The axiom of
choice, however, implies all are relatively interesting.</em></p>
<h4 id="introduction">Introduction</h4>
<p>It’s a
<a href="https://en.wikipedia.org/wiki/Interesting_number_paradox">running joke</a>
among mathematicians that there are no boring numbers. Here’s the
proof. Let $B$ be the set of boring numbers, and suppose for a
contradiction it is non-empty. Define $b = \min B$ as
the smallest boring number. Since this is a highly unusual property, $b$ is
interesting after all!
Joke it may be, but there is a sting in the tail. By thinking
about how the joke works, we will be led to some rather deep (and
perhaps disturbing) insights into set theory and what it can and
cannot tell us about the mathematical world.</p>
<h4 id="integers-and-rationals-are-interesting">Integers and rationals are interesting</h4>
<p>The joke implicitly uses the fact that “numbers” refers to “whole numbers”</p>
\[\mathbb{N} = \{0, 1, 2, 3, \ldots\}.\]
<p>If it didn’t, then the <em>minimum</em> we used to get our contradiction
wouldn’t always work!
For instance, say we work with the integers</p>
\[\mathbb{Z} = \{\ldots, -2, -1, 0, 1, 2, \ldots\}.\]
<p>The set of boring integers $B_\mathbb{Z}$ may be unbounded below.
Does this cause a problem? Not really. We can just define the smallest
boring number as the smallest element minimising the <em>absolute value</em>, i.e.</p>
\[b = \min \text{argmin}_{k\in B_\mathbb{Z}} |k|.\]
<p>(The $\text{argmin}$ might actually give us two numbers, $\pm b$, so the negative one
is the smallest.) Thus, there are no boring integers.
What about boring rational numbers?
This is somewhat more elaborate, but if $B_\mathbb{Q}$ is the set of
boring rationals, we can define the “smallest” boring number as</p>
\[b = \min \text{argmin}_{a/b\in B_\mathbb{Q}} (|a| + |b|),\]
<p>where $a/b$ is a fraction in lowest terms.
Once again, there may be multiple minimisers of $|a| + |b|$, but only
a finite number, so we can choose the smallest.
We conclude there are no boring rationals.
This pattern suggests there are no boring real numbers.
We should be able to find some function with a finite number of
minima, and then choose the smallest, right?
I’m going to argue that no such function can ever be described. Then I’m
going to explain why it might exist anyway, depending on which axioms of set theory we use!</p>
<h4 id="most-real-numbers-are-boring">Most real numbers are boring</h4>
<!-- https://en.wikipedia.org/wiki/Definable_real_number -->
<p>“Boring” and “interesting” are subjective.
We’ll use something a tad more well-defined, and replace
“interesting” with <em>describable</em>.
A number is describable if it has some finite description, using
words, mathematical symbols, even a computer program, which uniquely singles out that number.
For instance, $\sqrt{2}$ is the positive solution of $x^2 = 2$, $\pi$
is the ratio of a circle’s circumference to its diameter, and $e$ is
the limit</p>
\[e = \lim_{n\to\infty} \left(1 + \frac{1}{n}\right)^n.\]
<p>It turns out that <em>almost every</em> real number is indescribable, or
“boring”, in our official translation of that term.
The argument is very simple, and proceeds by simply counting the
number of finite descriptions.
Each such description consists of a finite sequence of symbols
(letters, mathematical squiggles, algorithmic instructions), each of
which could be elements of some very large alphabet of symbols.
For instance, the text</p>
\[\sqrt{2} \text{ is the positive solution of $x^2 = 2$.}\]
<p>can be converted into <a href="http://www.tamasoft.co.jp/en/general-info/unicode-decimal.html">(decimal) unicode</a> as</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>8730 50 32 105 115 32 116 104 101 32 112 111 115 105 116 105 118 101
32 115 111 108 117 116 105 111 110 32 111 102 32 120 94 50 61 50 46
</code></pre></div></div>
<p>Imagine some “super unicode” which lets us converts <em>any</em> symbol
into a number.
The super unicode alphabet may be arbitrarily large, so we will take it to
consist of <em>every</em> natural number $\mathbb{N}$.
Then a finite description using any symbols can be written as a sequence of
the corresponding natural numbers, a trick I will call “unicoding”.
To find the number of finite descriptions, we just count the sequences!
There is a nice scheme for showing that these are in one-to-one
correspondence with the natural numbers themselves, and hence
<em>countably infinite</em>.
We take a sequence, say</p>
\[(6, 2, 0, 5)\]
<p>and convert the first bracket and all commas into $1$s, and each number into
the corresponding number of $0$s:</p>
\[10000001001100000_2.\]
<p>In turn, this can be converted to decimal, $66144$.
Going in the other direction, any whole number can be written in
binary and then converted into sequence:</p>
\[14265092 = 110110011010101100000100_2\]
<p>becomes $(0,1,0,2,0,1,1,1,0,5,2)$.
Thus, we have a simple, explicit correspondence between finite
sequences of natural numbers and the natural numbers themselves.
This basically completes the proof, for the simple reason that there
are <em>infinitely more</em> real numbers than there are natural numbers.
This is established by Cantor’s beautiful
<a href="https://en.wikipedia.org/wiki/Cantor%27s_diagonal_argument">diagonal argument</a>,
which I won’t repeat here.
The upshot is that, via unicoding and then the binary
correspondence, finite descriptions can only capture an
infinitesimally small fragment of the real numbers.
Most literally cannot be talked about.</p>
<!-- So, we conclude that most real numbers are boring. -->
<p>The set $B_\mathbb{R}$ includes almost every real number, though
quite definitely <em>not</em> every real number you can think of.
But, armed with our previous jokes, it’s tempting to think that we can
waltz in and make the same joke about $\mathbb{R}$, simply
plucking out the smallest element of $B_\mathbb{R}$.
Of course, that won’t quite work, because the set need not be bounded
below. So instead, suppose there is some explicit function $f$ such
that $b \in B_\mathbb{R}$ is the smallest minimizer of $f$, i.e.</p>
\[b = \min \text{argmin}_{x \in B_\mathbb{R}} f(x).\]
<p>If I knew $f$ explicitly, we’d have a description of $b$ after all. Contradiction!
But the contradiction here does not imply $B_\mathbb{R}$ is
non-empty. After all, most of $\mathbb{R}$ is indescribable for
simple set-theoretic reasons.
Instead, it means that there <em>cannot be any explicit function</em>
$f$. More generally, there cannot be any explicit rule which, given a
subset of $\mathbb{R}$, gives some unique number. If there
was, we could apply it to $B_\mathbb{R}$ and get the same
contradiction.
(See Appendix A for discussion of the related <a href="https://en.wikipedia.org/wiki/Berry_paradox">Berry paradox</a>.)</p>
<h4 id="an-existential-aside">An existential aside</h4>
<p>There’s a loophole here. Our argument doesn’t establish that
$f$ doesn’t exist, just that it has no finite description. And
although it might seem weird to trust in the existence of something
that we can’t really talk about, we do just this with the real
numbers!
I believe in all the real numbers, even the ones I can never describe.
Is this reasonable?
It depends who you ask.
There is a philosophy of mathematics called
<a href="https://plato.stanford.edu/entries/intuitionism/">intuitionism</a> which
tells us that mathematics is a human invention, and therefore enjoins
us to only reason about the things we can construct ourselves. No
indescribable real numbers if you please!</p>
<p>I’m not sure about this “mathematical creationism”, and think there
are more things in the mathematical heavens than are dreamt of in
our finite human philosophy.
Why should human limitations be mathematical ones?
That said, it’s not the case that anything goes. We should have some
firm basis for believing in the existence of those things we can’t
discuss, and for the real numbers, the firm basis is drawing a
continuous line on a piece of paper, or thinking about infinite
decimal expansions. These are <em>models</em> of the real numbers,
concrete-ish objects which capture the essence of the abstract entity
$\mathbb{R}$. They convince us (or at least me) that there is nothing
magical stopping someone from drawing certain points on the line, or
continuing certain expansions forever.</p>
<p>Similarly, the indescribable things we would like to exist and reason
about in set theory might depend on our <em>models</em> of set theory!
I won’t get into the specifics, but an important point is there are
<em>many different models</em> of set theory, with different properties, and
it seeks unlikely that any one model is right.
These properties are abstracted into <em>axioms</em>, formal rules about what
exists and what you can or can’t do with sets.
Because models of set theory are deep, highly technical constructions,
most of the time we go the other way round, and play around with
axioms instead. Only later do we go away and find models which support
certain sorts of behaviour.
The point of all this is to make it a bit less counterintuitive when I
say that the existence and properties of boring numbers depend on which axioms
we decide to use.</p>
<h4 id="all-real-numbers-are-relatively-interesting">All real numbers are relatively interesting</h4>
<p>So, let’s return to our problem of boring real numbers.
We argued there was no explicit, finitely describable rule for picking
an element out of $B_\mathbb{R}$.
But we can always make the <em>existence</em> of such a rule — describable
or not — an axiom of our theory!
There are two ways to go about doing this.
Note that in the first example of boring natural numbers, we use the
<em>minimum</em> of the set.
We had to be a bit more clever with the integers and rationals, but it
essentially boiled down to creating a special sort of <em>ordering</em> on
the set, so that any subset (including the boring numbers) has a
<em>smallest element</em>.
We wrote this is in a complicated way as</p>
\[b = \min \text{argmin}_{x \in B} f(x)\]
<p>for some function $f$, but we could just as well write</p>
\[b = \min_{\mathcal{W}} B,\]
<p>where $\mathcal{W}$ denote this ordering on the big set.
To be clear, for the integers it is</p>
\[0, -1, 1, -2, 2, -3, 3, \ldots\]
<p>and for the rationals it is</p>
\[0, -\frac{1}{1}, \frac{1}{1}, -\frac{2}{1}, -\frac{1}{2}, \frac{1}{2},
\frac{2}{1}, \ldots.\]
<p>This is called a <em>well-ordering</em>. Although it may not be describable,
we could simply require, as an axiom of set theory, that any set can
be well-ordered! More explicitly,</p>
<p><span style="padding-left: 20px; display:block">
Any set $A$ has a well-ordering $\mathcal{W}_A$ such that any subset
of $A$ has a unique minimum element with respect to $\mathcal{W}_A$.
</span></p>
<p>Although it doesn’t spoil our conclusion that most real numbers are
boring, such an axiom would allow us to turn the old joke into an
argument that all real numbers are <em>relatively interesting</em>, where
“relatively interesting” means that there is a finite description
where we are allowed to use the well-ordering $\mathcal{W}$.
The proof goes as you might expect: let $B^{\mathcal{W}}_\mathbb{R}$ be the set of relatively boring
numbers, i.e. numbers with no finite explicit description, even when
allowed to use the well-ordering $\mathcal{W}$.
Since $\mathcal{W}$ is a well-ordering, we can define</p>
\[b = \min_{\mathcal{W}} B^{\mathcal{W}}_\mathbb{R}.\]
<p>End of proof!
So, although most real numbers are strictly boring, with a
well-ordering all of them are relatively interesting.</p>
<h4 id="choosing-an-order">Choosing an order</h4>
<p>Well-ordering is not usually treated as an axiom.
Historically, set theorists prefer to use a simpler rule called the
<em>axiom of choice</em>, which is logically equivalent, as we will argue
informally in a moment, but somehow less suspect.
As Jerry Bona joked,</p>
<p><span style="padding-left: 20px; display:block">
The axiom of choice is obviously true and the well-ordering principle
obviously false.
</span></p>
<p>(Actually, Bona’s joke mentions a third equivalent form called <em>Zorn’s
lemma</em>, but it would confuse matters too much to explain.)
Loosely, the axiom of choice just says we can pick an element from a
non-empty set. Pretty reasonable huh? If a set is nonempty, it has an element, so
we can pluck one out.
In fact, it’s usually stated in terms of a <em>family</em> of sets $A_i$,
where the subscript $i$ ranges over some indexing set $I$:</p>
<p><span style="padding-left: 20px; display:block">
Given a family of nonempty sets $A_i$, $i \in I$, we can collect a
representative from each set, labelled $f_i \in A_i$.
</span></p>
<p>The well-ordering principle implies the axiom of choice, since I can
just take the union of all the sets $A_i$, well-order it with
$\mathcal{W}$, and then define $f_i = \min_{\mathcal{W}} A_i$.
That’s my set of representatives!
The other way round is conceptually straightforward.
To well-order a set $A = A_0$, start by choosing an element $f_0 \in
A_0$ by the axiom of choice. Then remove it to define a new set $A_1 =
A_0 - \{f_0\}$, and select another element $f_1 \in A_1$. Continue in
this way, at each stage simply deleting the element from the previous
stage and choosing a new one, using</p>
\[A_{n+1} = A_n - \{f_n\} = A_{n-1} - \{f_n, f_{n-1}\} = \cdots = A_0 -
\{f_i : i < n\}\]
<p>as long as the set is nonempty.
The well-ordering is simply the elements in the order we made the
choice:</p>
\[\mathcal{W}_A = \{f_0, f_1, f_2, \ldots \} = \{f_n \in A_n : A_n \neq \varnothing\}.\]
<p>There are two issues with this construction.
The first is that it might feel sketchy to use the axiom of
choice “as we go” to build the sets, rather than starting with a
pre-defined family. But no one said this wasn’t allowed!
Second, our method only seems to work for sets as most as large as the
natural numbers, since we indexed elements with $n \in \mathbb{N}$.
But we can extend it to an <em>arbitrary</em> set using a generalisation of
natural numbers called
<a href="https://en.wikipedia.org/wiki/Ordinal_number">ordinals</a>.
We loosely sketch how this is done in Appendix B.
Once the dust settles, we find that the axiom of choice is equivalent
to well-ordering.</p>
<h4 id="conclusion">Conclusion</h4>
<p>The overarching theme of this post is how much mileage we can get
from a bad joke.
The answer: quite a lot!
We learned not only that there are no boring integers and rational
numbers, but via a simple counting argument, that the vast majority of
real numbers are indescribably boring.
This is equivalent to having no explicit way to well-order the reals.
On the other hand, by giving ourselves the ability (via the axiom of
choice) to pluck elements at will from non-empty sets, we are able to
supply the reals with a well-ordering. So, all reals are relatively
interesting, even if we can’t talk about them.</p>
<h4 id="acknowledgments">Acknowledgments</h4>
<p>As usual, thanks to J.A. for the discussion which led to this
post, and also for proposing an elegant mapping analogous to unicoding.</p>
<h4 id="appendix-a-the-berry-paradox">Appendix A: the Berry paradox</h4>
<p>Consider the phrase</p>
<p><span style="padding-left: 20px; display:block">
The smallest real number with no finite, explicit description.
</span></p>
<p>If “smallest” refers to an explicitly definable well-ordering of the
reals, then this would seem to pick out a unique number with a finite,
explicit description. Contradiction!
We used this to argue no explicit well-ordering exists.
But let’s compare this to the
<a href="https://en.wikipedia.org/wiki/Berry_paradox">Berry paradox</a>, which
asks us to consider the phrase</p>
<p><span style="padding-left: 20px; display:block">
The smallest positive integer not definable in under sixty letters.
</span></p>
<p>This phrase clocks in at under sixty letters, and would seem to define a
number.
Contradiction!
Since “smallest” here makes perfect sense (we are dealing with positive
integers), to resolve the Berry paradox, we must assume either (a)
there is no set $B$ of numbers not definable in under sixty letters,
analogous to the original boring number joke, or (b) Berry’s phrase
somehow fails to define a number.
The most popular solution seems to be (b), on the grounds that
referring to the set makes it some kind of “meta-definition”, rather
than a definition per se.</p>
<p>Of course, this seems be committed to a very specific notion of
“definition”, but the problem persists if we replace “definable” with
“meta-definable”, since the smallest non-meta-definable number is
really a meta-meta-definition.
Let $B^{(0)}$ be the set of numbers not definable in under sixty letters,
$B^{(1)}$ the numbers not meta-definable in under $70$ letters, and in
general, $B^{(n)}$ the numbers not meta${}^{(n)}$-definable in under
$60+10n$ letters.
We call any number in the <em>union</em> of all these sets $\mathcal{B} =
\cup_{n\geq0} B^{(n)}$ “lim-definable”.
This is closed under the operation of going meta.
Now consider the phrase</p>
<p><span style="padding-left: 20px; display:block">
The smallest positive integer not finitely lim-definable.
</span></p>
<p>Since lim-definability is closed under going meta, as is “finite”,
this is <em>now a definition at the same level</em>.
Option (b) is no longer available to us, so only option (a)
remains, and it follows that, like the joke that began it all, <em>all
positive integers are finitely lim-definable</em>.
This is of course obviously true.</p>
<p>Our argument against an explicit well-ordering is very closely related
to the Berry paradox.
The point of considering lim-definability is that we can build the same
descriptive hierarchy for the real numbers, take the union, and rule
out option (b). This leaves two ways to avoid a contradiction: no
lim-definable ordering exists (involving some finite but unbounded
number of references to sets in the hierarchy), or like the Berry
paradox, every real is lim-definable.
But unlike the positive integers, we know from set theory that the
second option can’t be true!
We still have a countable number of lim-definitions, as we can argue
from unicoding.
So there must be no lim-definable ordering of the reals, and no
explicit well-ordering in particular.</p>
<h4 id="appendix-b-ordinals-and-the-axiom-of-choice">Appendix B: ordinals and the axiom of choice</h4>
<p>Ordinals are <em>sets</em> which we use to stand in for numbers.
The smallest ordinal is $0$, which is defined as the empty set
$\varnothing = \{\}$.
Each ordinal $\alpha$ has a unique successor $\alpha + 1$, defined by
simply appending a copy of $\alpha$ to itself:</p>
\[\alpha + 1 = \{\alpha, \{\alpha\}\},\]
<p>To illustrate, we apply the successor operation to $0 = \varnothing$ a
few times:</p>
\[1 = 1 + 0 = \{0\}, \quad 2 = 1 + 1 = \{0,
1\}, \quad 3 = 2 + 1 = \{0, 1, 2\}.\]
<p>Going on in this way gives us all the finite ordinals, but there are
also <em>infinite</em> ordinals. The smallest infinite ordinal, conventionally
denoted $\omega$, can be identified with the natural numbers:</p>
\[\omega = \{0, 1, 2, 3, 4, \ldots\}.\]
<p>It is called a <em>limit</em> ordinal since it is not the successor of any
finite ordinal. It is bigger than all the finite ones, $n <
\omega$. The successor is defined as before,</p>
\[\omega + 1 = \{\omega, \{\omega\}\},\]
<p>thereby giving a precise meaning to “infinity plus one”!
We won’t say more about the structure of these ordinals. The main
point is that we can always “count” the elements in a set $A$ using
ordinals, no matter how big it is.
Let’s now return to the problem of proving the axiom of choice
implies that any set $A$ can be well-ordered.
The basic idea is to start with $0$, but keep on counting up “past
infinity”, defining</p>
\[A_{\alpha+1} = A_0 - \{f_\beta : \beta < \alpha\}\]
<p>for any ordinal $\alpha$. The resulting set of
representatives, labelled by ordinals, is</p>
\[\mathcal{W}_A = \{f_\alpha \in A_\alpha: A_\alpha \neq
\varnothing\},\]
<p>with $f_\alpha < f_\beta$ just in case the ordinals $\alpha < \beta$.
This is a well-ordering since the cardinals are themselves
well-ordered.
Now, we’ve skipped many important technical details, but the main
point was that the argument looks pretty similar to the previous one!</p>
<!-- You may wonder if the contradiction here is coming from ambiguity in
the notion of "explicit describability".
That this can cause deep problems is illustrated by the
[Berry paradox](https://en.wikipedia.org/wiki/Berry_paradox), which
asks us to consider the following:
<span style="padding-left: 20px; display:block">
The smallest positive integer not definable in under sixty letters.
</span>
If $B_{60}$ is the set of positive integers not definable in under
sixty letters, it seems we have just defined its smallest elements in
fifty seven! This too is a contradiction. Many people try to resolve
this by arguing that it does not constitute a "definition"; I think it
is much simpler to following the boring number argument, and conclude
that $B_{60}$ doesn't exist. -->David WakehamMarch 23, 2021. I turn the old joke about interesting numbers into a proof that most real numbers are indescribably boring. In turn, this implies that there is no explicit well-ordering of the reals. The axiom of choice, however, implies all are relatively interesting.Taking half a derivative2021-03-13T00:00:00+00:002021-03-13T00:00:00+00:00http://hapax.github.io/mathematics/halfder<p><strong>March 13, 2021.</strong> <em>Can you take half a derivative? Or π derivatives?
Or even √–1 derivatives? It turns out the answer is yes, and there are
two simple but apparently different ways to do it. I
show that one implies the other!</em></p>
<h4 id="introduction">Introduction</h4>
<p>In calculus, the regular derivative is defined as the local gradient
of a function:</p>
\[f'(x) = \frac{d}{dx} f(x) = \lim_{h\to 0}\frac{f(x+h)-f(x)}{h}.\]
<p>We will abbreviate this as $f’ = Df$, understanding that $f$ is a function
of $x$ and $D$ differentiates with respect to $x$.
We can always differentiate again, and again, and in fact as many
times as we want. Using our new notation, we can write the $n$th
derivative as</p>
\[D (D \cdots (Df)) = D^n f.\]
<p>This is well-defined as long as $n$ is a whole number.
But what if we could consider other types of derivatives, say half a
derivative? Let’s call this $D^{1/2} = \sqrt{D}$. In the same way that
applying two ordinary derivatives gives the second derivative, it seems reasonable to hope that two half derivatives give
a full derivative:</p>
\[f' = \sqrt{D} \sqrt{D}f = Df \quad \Longrightarrow \quad \sqrt{D}
\cdot \sqrt{D} = D.\]
<p>What could half a derivative look like?</p>
<h4 id="to-be-continued">To be continued</h4>
<p>The easiest way to go about this to use a trick called <em>analytic
continuation</em>.
This has a precise meaning in complex analysis, and we’re going to do
something similar in spirit, but not quite as rigorous.
The basic idea is to find some nice, specific function we can
differentiate $n$ times, and which happens to give us a nice answer in terms of $n$.
We then define the <em>fractional derivative</em> $D^\alpha$ acting on this
function by replacing $n$ with $\alpha$.
A sanity check will be that, for general $\alpha, \beta$, the
fractional derivatives obey</p>
\[D^\alpha \cdot D^\beta = D^{\alpha+\beta},\]
<p>so, e.g., two half-derivatives give a full derivative,
$\sqrt{D}\cdot \sqrt{D} = D$.
We call this property <em>multiplicativity</em> after the identical-looking
rule for indices.
There are two issues with this approach.
First, how do we extend the definition to general functions?
And second, are the definitions for different functions in agreement?
In general, the answers are very complicated, but in this post, I’ll
consider the two simplest methods for defining fractional derivatives.
This means we can talk about the functions they apply to, and check
they agree, without a huge technical overhead.</p>
<p>Our first nice function is the exponential $e^{\omega x}$.
Differentiating simply pulls down a factor of $\omega$ each time, so</p>
\[D^n e^{\omega x} = \omega^n e^{\omega x}.\]
<p>It’s very clear, then, how to define the fractional derivative acting
on this:</p>
\[D^\alpha e^{\omega x} = \omega^\alpha e^{\omega x}.\]
<p>Great! We can easily check the multiplicative property, assuming that
constants pass through the derivatives:</p>
\[D^\alpha D^\beta e^{\omega x} = \omega^\alpha D^\beta e^{\omega x} =
\omega^{\alpha + \beta} e^{\omega x} = D^{\alpha+\beta}e^{\omega x}.\]
<p>Now, you might think this is useless because we can only
take fractional derivatives of exponential functions.
But at this point, we introduce another assumption, namely that the
fractional derivatives are <em>linear</em>:</p>
\[D^\alpha (\lambda_1 f_1 + \lambda_2 f_2) = \lambda_1 D^\alpha f_1 + \lambda_2 D^\alpha f_2,\]
<p>where $f_1, f_2$ are functions and $\lambda_1, \lambda_2$ are constants.
In particular, let’s suppose this linearity applies to an <em>infinite</em>
collection of exponentials multiplied by constants $\lambda$, arranged
into an integral</p>
\[f(x) = \int_{-\infty}^\infty d\omega \, \lambda(\omega) e^{i\omega x}.\]
<p>Then by linearity,</p>
\[D^\alpha f(x) = \int_{-\infty}^\infty d\omega \, \lambda(\omega) D^\alpha
e^{i\omega x} = \int_{-\infty}^\infty d\omega \, \lambda (\omega)
(i\omega)^\alpha e^{i\omega x}. \tag{1} \label{exp}\]
<p>Functions which can be written this way are said to have a <em>Fourier
representation</em>, with the function $ \lambda (\omega)$ the <em>Fourier
transform</em>. Most functions have one!
Let’s do a very simple example: the sine function, bane of high school
trigonometry classes everywhere.
What is its half derivative?
We start by writing sine in terms of exponentials as</p>
\[\sin(x) = \frac{1}{2i}(e^{ix} - e^{-ix}).\]
<p>We then take a half-derivative using our exponential rule and linearity:</p>
\[\sqrt{D} \sin(x) = \frac{1}{2i}(\sqrt{D} e^{ix} - \sqrt{D} e^{-ix}) = \frac{1}{2i}\left(\sqrt{i} e^{ix} - \sqrt{-i} e^{-ix}\right).\]
<p>There are a few things to note.
First, this is not a real function, so in general, half derivatives of
a real functions need not be real.
It should also be clear there is some ambiguity about
which roots we choose.
In general this ambiguity is harmless, and we just take the principal
values (with arguments between $-\pi$ and $\pi$), but this issue will
crop up any below in a subtle way.
Finally, observe that we can just as easily do crazy things like take
$i$ derivatives! We set $\alpha = i$, so the $i$th derivative of sine is</p>
\[D^i \sin(x) = \frac{1}{2i}\left(i^i e^{ix} - (-i)^i e^{-ix}\right) =
\frac{1}{2i}(e^{-\pi/4 + ix} - e^{+\pi/4 - ix}),\]
<p>since the principal values are</p>
\[i^i = e^{i (i \pi/4)} = e^{-\pi/4}, \quad (-i)^i = e^{i (-i \pi/4)} = e^{\pi/4}.\]
<p>I’m not sure if this has any applications, but it’s cute.
I invite the interested reader to take $\pi$ derivatives of sine. What
better way to celebrate $\pi$ day!</p>
<h4 id="fractorials">Fractorials</h4>
<p>Exponentials aren’t the only nice functions we can use to define
fractional derivatives.
In fact, a more common approach is to use <em>powers</em>.
The first function we encounter in high school is usually the identity
function, $f(x) = x$.
From there, we build up to polynomials $x^m$, and then arbitrary
powers $x^s$.
The derivative of a power has a very simple form:</p>
\[D x^s = s x^{s-1}.\]
<p>If we differentiate again, we bring down a factor of $s - 1$ and
reduce the index again. And so on and so forth. This leads to the expression for
$n$ derivatives:</p>
\[D^n x^s = s(s- 1) \cdots (s - n + 1) x^{s-n}.\]
<p>So far, this doesn’t look like something we can easily continue to
non-integer values of $n$.
But let’s assume for a moment $s$ is an integer.
Then we can write</p>
\[s(s- 1) \cdots (s - n + 1) = \frac{s(s - 1) (s-2) \cdots 1}{(s -
n)(s-n - 1) \cdots 1} = \frac{s!}{(s -n)!},\]
<p>where we have used the good old factorial function $s!$.
Thus, we can write</p>
\[D^n x^s = \frac{s!}{(s -n)!} x^{s-n}.\]
<p>To analytically continue this, we need a beautiful object called the
Gamma function $\Gamma$.
We’ll define it properly below, but for the moment, the
only properties we need are that (a) it agrees with the factorial
function at (shifted) integer values,</p>
\[\Gamma(k + 1) = k!;\]
<p>and (b) is defined for non-integer values as well. I like to think of it as the
“fractorial” because it makes sense for fractional arguments! In addition to
delightfully bad puns, the Gamma function lets us write</p>
\[D^n x^s = \frac{\Gamma(s + 1)}{\Gamma(s -n + 1)} x^{s-n},\]
<p>and immediately continue to the fractional derivative:</p>
\[D^\alpha x^s = \frac{\Gamma(s + 1)}{\Gamma(s -\alpha + 1)}
x^{s-\alpha}. \tag{2} \label{power}\]
<p>Too easy! Once again, we can check the multiplicative property:</p>
\[\begin{align*}
D^\alpha D^\beta x^s & = \frac{\Gamma(s + 1)}{\Gamma(s -\beta + 1)}
D^\alpha x^{s-\beta} \\
& = \frac{\Gamma(s + 1)}{\Gamma(s -\beta + 1)}
\cdot \frac{\Gamma(s - \beta + 1)}{\Gamma(s -\alpha - \beta + 1)}
x^{s-\beta - \alpha} \\
& = \frac{\Gamma(s + 1)}{\Gamma(s -\alpha -\beta + 1)}x^{s-\beta -
\alpha} = D^{\alpha+\beta} x^s.
\end{align*}\]
<p>So this gives us another, evidently different way to define fractional
derivatives. It will apply to any sum or integral of powers of
$x$, for instance, infinite polynomials called <em>power series</em>, and
their close cousins the <em>Laurent series</em> which include reciprocal powers:</p>
\[\sum_{k = 0}^\infty a_k x^k, \quad \sum_{k = -\infty}^\infty b_k x^k.\]
<p>These cover a lot of ground, and there is an even more general object
called the <em>Mellin transform</em>, analogous to the Fourier transform. But
we won’t go there.
Instead, let’s do another simple example.
One of the interesting properties of the Gamma function is that it
blows up to (minus) infinity for nonpositive integers:</p>
\[\Gamma(-n) = -\infty, \quad n = 0, 1, 2, \ldots.\]
<p>This is actually essential to get sensible answers!
For instance, let’s take the derivative of a constant, $1 = x^0$.
Then according to our definition,</p>
\[D x^0 = \frac{\Gamma(0 + 1)}{\Gamma(0 -1 + 1)} x^{0 - 1} =
\frac{\Gamma(1)}{\Gamma(0)} x^{- 1} = 0,\]
<p>since the $\Gamma(0)$ in the denominator makes the whole thing vanish.
More intriguingly, these infinities sometimes <em>cancel</em> in sensible ways.
For instance, if we take a derivative of $1/x$, we should get
$-1/x^2$. If we plug $x^{-1}$ into our formula, it gives</p>
\[D x^{-1} = \frac{\Gamma(-1 + 1)}{\Gamma(-1 -1 + 1)} x^{-1 - 1} =
\frac{\Gamma(0)}{\Gamma(-1)} x^{-2}.\]
<p>Both the numerator and the denominator blow up, which should make us
queasy. But there is a trick here. It turns out that for any $z$,
the Gamma function obeys the <em>functional equation</em></p>
\[\Gamma(1 + z) = z\Gamma(z).\]
<p>Since $\Gamma(k + 1) = k!$, this gives the usual relation for factorials,</p>
\[k! = \Gamma(k + 1) = k\Gamma(k) = k \cdot (k - 1)!.\]
<p>It also gives the sneaky result $\Gamma(0) = (-1)\Gamma(-1)$. Both $\Gamma(0)$ and
$\Gamma(-1)$ blow up of course, but in the derivative of $1/x$, the
$\Gamma(-1)$ terms cancel, leaving $(-1)x^{-2} = -1/x^2$ as required.</p>
<h4 id="gamma-and-tongs">Gamma and tongs</h4>
<p>This all sounds great, but you might be wondering why the Gamma
function is the right way to extend the factorial function away from
whole numbers.
In fact, any old function that interpolates between them would also
work and satisfy the multiplicative property.
What we’re going to do in this last section is use the fractional
derivatives, defined using exponentials, to <em>derive</em> the Gamma
function continuation.
And in order to this, we have to grit our teeth and define the
Gamma function in all its glory:</p>
\[\Gamma(s) = \int_{0}^\infty dt\, t^{s-1} e^{-t}.\]
<p>If you’re interested, you can find proofs of the functional equation and so on
<a href="https://en.wikipedia.org/wiki/Gamma_function">elsewhere</a>.
Instead, we’re going to make the sneaky change of variables $t =
\omega x$, yielding</p>
\[\Gamma(s) = x^{s} \int_{0}^\infty d\omega\, \omega^{s-1} e^{-\omega
x}.\]
<p>If we change $s \to -s$, and rearrange, we get a formula for $x^s$
in terms of exponentials:</p>
\[x^{s} = \frac{1}{\Gamma(-s)}\int_{0}^\infty d\omega\, \omega^{-(1+ s)}
e^{-\omega x}. \tag{3} \label{gamma}\]
<p>Great! Now we just go ahead and use rule (\ref{exp}), with the hope we
will get rule (\ref{power}).
As usual, we proceed using linearity:</p>
\[\begin{align*}
D^\alpha x^{s} & = \frac{1}{\Gamma(-s)}\int_{0}^\infty d\omega\,
\omega^{-(1+ s)} D^\alpha e^{-\omega x} \\
& = \frac{1}{\Gamma(-s)}\int_{0}^\infty d\omega\,
\omega^{-(1+ s)} (-\omega)^\alpha e^{-\omega x} \\
& = \frac{(-1)^\alpha}{\Gamma(-s)}\int_{0}^\infty d\omega\,
\omega^{-(1+ s - \alpha)} e^{-\omega x} \\
& = \frac{(-1)^\alpha}{\Gamma(-s)} \cdot \Gamma[-(s-\alpha)]x^{s-\alpha},
\end{align*}\]
<p>where on the last line we used (\ref{gamma}), but with $s
-\alpha$ instead of $s$.
This isn’t quite what we want.
To make progress, we’ll take advantage of the <em>reflection
formula</em> for the Gamma function (derived <a href="https://hapax.github.io/mathematics/zeta/">here</a>
for instance):</p>
\[\Gamma(z) \Gamma(1 - z) = \frac{\pi}{\sin(\pi z)}.\]
<p>We can apply this to both $\Gamma(-s)$ and $\Gamma[-(s-\alpha)]$ to
get</p>
\[\begin{align*}
D^\alpha x^{s} & = (-1)^\alpha \frac{\sin(\pi
s)}{\sin[\pi(s-\alpha)]}\cdot \frac{\Gamma(s+1)}{\Gamma(s-\alpha + 1)} x^{s-\alpha}.
\end{align*}\]
<p>This is almost (\ref{power}), the thing we were after!
But there is this strange factor with sines out the front.
Recall the definition of sine in terms of complex exponentials.
This lets us write the funny factor as</p>
\[(-1)^\alpha \frac{\sin(\pi s)}{\sin[\pi(s-\alpha)]} = \frac{e^{\pi i
s} - e^{-\pi i s}}{(-1)^\alpha e^{\pi i (s-\alpha)} - (-1)^\alpha e^{-\pi i (s-\alpha)}}.\]
<p>It would be magical if that $(-1)^\alpha$ could somehow behave
differently and cancel the $\alpha$ terms floating around, right?
Well, turns out it does!
We can write $-1 = e^{\pm \pi i}$, and hence</p>
\[(-1)^\alpha = e^{\pm \pi i \alpha}.\]
<p>I won’t spell out the details, but if you look at <a href="https://hapax.github.io/mathematics/zeta/">this proof</a> of the reflection
formula, the two different terms in the sine arise from parts of an
integration contour which lie in almost the same place, but where we take
roots in different ways.
In particular, evaluating $(-1)^\alpha$ gives $e^{\pm \pi i \alpha}$
respectively, so they cancel the $\alpha$ terms after all.
The upshot is that our funny factor is just unity:</p>
\[\frac{e^{\pi i
s} - e^{-\pi i s}}{(-1)^\alpha e^{\pi i (s-\alpha)} - (-1)^\alpha
e^{-\pi i (s-\alpha)}} = \frac{e^{\pi i
s} - e^{-\pi i s}}{e^{\pi i \alpha} e^{\pi i (s-\alpha)} - e^{-\pi i \alpha}
e^{-\pi i (s-\alpha)}} = \frac{e^{\pi i
s} - e^{-\pi i s}}{e^{\pi i s} - e^{-\pi i s}} = 1.\]
<p>Thus, our exponential rule actually
reproduces the rule for powers of $x$ involving the Gamma
function! Now, to be clear, fractional derivatives are a big and
mathematically heavy topic, and I’ve only skimmed the surface.
But it’s neat that the two simplest approaches agree.</p>
<h4 id="acknowledgments">Acknowledgments</h4>
<p>Thanks to J.A. for chatting about fractional derivatives, and getting
me thinking about the simplest way to define them.</p>
<!-- Our exponential definition yields an *antiderivative* operator:
$$
D^{-1} e^{\omega x} = \frac{1}{\omega}e^{\omega x}.
$$
This is the usual antiderivative, except without the constant. -->David WakehamMarch 13, 2021. Can you take half a derivative? Or π derivatives? Or even √–1 derivatives? It turns out the answer is yes, and there are two simple but apparently different ways to do it. I show that one implies the other!The statistical basis of Fermi estimates2021-02-12T00:00:00+00:002021-02-12T00:00:00+00:00http://hapax.github.io/physics/hacks/mathematics/statistics/fermi-log-normal<p><strong>February 12, 2021.</strong> <em>Why are Fermi approximations so effective? One
important factor is log normality, which occurs for large random
products. <!--, also related to the mechanism underlying
the Newcomb-Benford law for first digits.--> Another element is
variance-reduction through judicious subestimates. I discuss both
and give a simple heuristic for the latter.</em></p>
<h4 id="introduction">Introduction</h4>
<p>Fermi approximation is the art of making good order-of-magnitude estimates.
I’ve written about them
at greater length
<a href="https://hapax.github.io/assets/fermi-estimates.pdf">here</a> and
<a href="https://hapax.github.io/physics/teaching/hacks/napkin-hacks/#sec-3">here</a>,
but I’ve never really found a satisfactory explanation for why they work.
Order-of-magnitude is certainly a charitable margin of
error, but time and time again, I find they are better than they have any right to be!
Clearly, there must be an underlying statistical explanation for this apparently
unreasonable effectiveness.</p>
<!-- We will try to explain the first using logarithmic uniformity, which is
the same mechanism underlying the anomalous distribution of first
digits known as the
[Newcomb-Benford law](https://en.wikipedia.org/wiki/Benford%27s_law).
We give a looser but related explanation of the second in terms of strategies for
variance-reduction in human error. -->
<h4 id="products-and-log-normality">Products and log-normality</h4>
<p>There are two key techniques: the use of geometric means, and the
factorisation into subestimates.
We start with geometric means.
Suppose a random variable $F$ is a product of many independent random
variables,</p>
\[F = X_1 X_2 \cdots X_N.\]
<p>Then the logarithm of $F$ is a sum of many random variables $Y_i =
\log X_i$:</p>
\[\log F = \log X_1 + \log X_2 + \cdots + \log X_N = \sum_{i=1}^N Y_i.\]
<p>By the central limit theorem for unlike variables (see
e.g. <a href="https://hapax.github.io/hacks/mathematics/statistics/clt/">this post</a>),
for large $N$ this approaches a normal distribution:</p>
\[\log F \to \mathcal{N}(\mu, \sigma^2), \quad \mu := \sum_i \mu_i,
\quad \sigma^2 = \sum_i \sigma_i^2,\]
<p>where the $Y_i$ have mean $\mu_i$ and variance $\sigma_i^2$.
We say that $F$ has a <em>log-normal</em> distribution, since its log is
normal.</p>
<!-- To get uniformity into the picture, we can zoom in on the region near
$F = e^\mu$ where the probability density is approximately uniform.
More carefully, the density is
$$
p(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/2\sigma^2}.
$$
Taylor-expanding near $x = \mu$ gives
$$
p(x) = \frac{1}{\sigma\sqrt{2\pi}}
\left[1 - \frac{(x-\mu)^2}{2\sigma^2} + O(x^4)\right].
$$
This looks uniform provided $(x - \mu)^2 \ll \sigma^2$.
For instance, at a third of a standard deviation, $x = \mu + \sigma/3$,
we have
$$
1 - \frac{(x-\mu)^2}{2\sigma^2} = 1 - \frac{1}{18} \approx 0.94,
$$
and $\text{erf}(1/\sqrt{18}) \approx 0.26$, about a quarter of the
probability mass, lies underneath.
This is what we mean when we say that $F$ is logarithmically uniform. -->
<h4 id="geometric-means">Geometric means</h4>
<p>In Fermi estimates, one of the basic techniques is to take geometric
means of estimates, typically an overestimate and an underestimate.
For instance, to Fermi estimate the population of Chile, I could
consider a number like one million which seems much too low, and a
number like one hundred million which seems much too high, and take
their geometric mean:</p>
\[\sqrt{(1 \text{ million}) \times (100 \text{ million})} = 10 \text{ million}.\]
<p>Since population is a product of many different factors, it is
reasonable to expect it to approximate a log-normal distribution.
Then, after logs, the geometric mean $\sqrt{ab}$ becomes the
arithmetic mean of $\log a$ and $\log b$:</p>
\[\log \sqrt{ab} = \frac{1}{2}(\log a + \log b).\]
<p>Taking the mean $\mu$ of the distribution as the true value, these
geometric means provide an
<a href="https://en.wikipedia.org/wiki/Bias_of_an_estimator">unbiased estimator</a>
of the mean.
Moreover, the error of the estimate will decrease as $1/k$ for $k$
samples (assuming human estimates sample from the distribution), so more is better.
To see how much better I could do on the Chile population estimate, I
solicited guesses from four friends, and obtained $20, 20, 30$ and $35$
million.
Combining with my estimate, I get a geometric mean</p>
\[(10 \times 20 \times 20 \times 30 \times 35)^{1/5} \text{ million}
\approx 21 \text{ million}.\]
<p>The actual population is around $18$ million, so the estimate made
from more guesses is indeed better!
This is also better than the arithemetic average, $23$ million.
Incidentally, this also illustrates the
<a href="https://hapax.github.io/physics/mathematics/statistics/crowd/">wisdom of the crowd</a>,
also called “diversity of prediction”.
The individual errors from a broad spread of guesses tend to cancel
each other out, leading to a better-behaved average, though in this case
in logarithmic space.</p>
<p>In general, Fermi estimates work best for numbers which are large
random products (this is how we try to solve them!), so the problem
domain tends to enforce the statistical properties we want.
For many examples of log-normal distributions in the real world, see
<a href="https://academic.oup.com/bioscience/article/51/5/341/243981">Limpert, Stahel and Abbt (2001)</a>.
It’s worth noting that not everything we can Fermi estimate is
log-normal, however.
Many things in the real world obey power laws, for instance, and
although you can exploit this to make better Fermi estimates (as
lukeprog does in
<a href="https://www.lesswrong.com/posts/PsEppdvgRisz5xAHG/fermi-estimates#Example_4__How_many_plays_of_My_Bloody_Valentine_s__Only_Shallow__have_been_reported_to_last_fm_">his tutorial</a>),
we can happily Fermi estimate power-law distributed numbers without
this advanced technology.</p>
<p>Are Fermi estimates unreasonably effective in this context?
Maybe.
But the estimates work best in the high-density core where things look
uniform, not out at the tails, and it’s not until we get to the tails that the difference
between the log-normal and power law (or exponential, or Weibull, or
your favourite skewed distribution) becomes pronounced.
So the unreasonable effectiveness here can probably be explained by
the resemblance to the log-normal, though this is something I’d like
to check more carefully in future.</p>
<!-- In general, we only expect Fermi estimates to work for numbers which
are the product of many factors.
But this is precisely the sorts of things we use Fermi estimates for!
In a sense, the problem domain naturally leads to logarithmic
uniformity.
Incidentally, I've talked about "uniformity", but the geometric mean
is still a measure of central tendency for any distribution, and is
particularly nice for a lognormal one, which arise for products of
random variables.
The magic of geometric means manifests most
strongly in the near-uniform blob at the centre. -->
<!-- #### The Newcomb-Benford law
Logarithmic uniformity also explains an odd pattern in the first
digits of naturally occurring numbers like tax returns, stock market
prices, populations, river lengths, physical constants, and even
powers of $2$.
The pattern, called the *Newcomb-Benford law* after
[Simon Newcomb](https://en.wikipedia.org/wiki/Simon_Newcomb) and
[Frank Benford](https://en.wikipedia.org/wiki/Frank_Benford), is as
follows: for base $b$, the digit $d \in \\{1, 2, \ldots, b-1\\}$
occurs with relative frequency
$$
p_b(d) = \log_b \left(\frac{d+1}{d}\right) = \frac{1}{\log b}\log \left(\frac{d+1}{d}\right).
$$
It initially seems bizarre that digits do not occur with equal
frequency.
But as neatly explained by
[Pietronero et al. (1998)](https://arxiv.org/pdf/cond-mat/9808305.pdf),
it follows immediately if the relevant numbers are logarithmically uniform.
Let $X$ be our random number.
Then the first digit is $d$ if
$$
db^k \leq X < (d+1)b^{k} \quad \Longrightarrow \quad \log_b d + k \leq
\log_b X < \log_b(d+1) + k
$$
for some integer $k$.
If $X$ is logarithmically uniform, for instance sitting near the mean
of a big random product, then $\log_b X$ is uniformly
distributed, and lies in the interval $I_d :=
[\log_b d, \log_b (d+1)]$ with probability
$$
(\log_b (d+1) + k) - (\log_b d + k) = \frac{1}{\log b}\log \left(\frac{d +
1}{d}\right) = p_b(d).
$$
This provides a simple way to check for fraud on tax returns, for
instance.
Just compute relative frequencies of first digits in different bases
and check they obey Newcomb-Benford!
You might wonder why something totally deterministic, like the first
digit of a power of $2$, also obeys Benford's law.
Here is a pie chart of initial decimal digits for the first $10,000$ binary
powers, which follows the Newcomb-Benford law exactly:
<figure>
<div style="text-align:center"><img src
="/images/posts/benford1.png"/>
</div>
</figure>
Here is the Python code to generate it.
You can check it for other numbers besides $2$ as well by simply
changing the `power` variable:
```python
import matplotlib.pyplot as plt
import math
maxpower = 10000 # Number of powers to check
power = 2 # Change to check other powers
nums = '1', '2', '3', '4', '5', '6', '7', '8', '9',
benford = [(math.log(10, d+1) - math.log(10, d+1))
for d in range(1, 10)]
firstdig = [0 for i in range(9)]
for i in range(maxpower):
ind = int(str(power**i)[0]) - 1
firstdig[ind] = firstdig[ind] + 1
fig, ax = plt.subplots()
fig.set_facecolor('white')
ax.pie(firstdig, labels=nums, autopct='%1.1f%%', startangle=90)
# Change 'firstdig' to 'benford' for probabilities
ax.axis('equal')
plt.show()
```
The mechanism for logarithmic uniformity here is slightly different,
and discussed in depth in Serge Tabachnikov's
[book on geometric billiards](http://www.personal.psu.edu/sot2/books/billiardsgeometry.pdf).
In this case, $X = 2^n$, so the first digit is $d$ just in case
$$
\log_{10}d + k \leq n\log_{10} 2 < \log_{10}(d + 1) + k.
$$
Let $\text{frac}(x)$ denote the fractional part of $x$, and define
$x_n := \text{frac}(n\log_{10} 2)$.
Taking fractional parts gives
$$
\log_{10}d \leq x_n < \log_{10}(d + 1).
$$
It turns out that, since $x_1 = \log_{10} 2$ is irrational,
$x_n$ jumps randomly around the unit interval, and forms an
"equidistribution" which spends equal times in equal areas.
For a proof, see Tabachnikov's book.
But although the fundamental cause is different, the outcome is still
logarithmic uniformity, and the Newcomb-Benford law results. -->
<h4 id="the-philosophy-of-subestimates">The philosophy of subestimates</h4>
<p>Now we’ve dealt with geometric means and log-normality, we
turn to the effectiveness of factorising a Fermi estimate.
If we take logarithms, factors become summands, and we’ll reason about those since they are simpler.
If $Z = X + Y$ is a sum of independent random variables, the variance
is additive, so that</p>
\[\text{var}(Z) = \text{var}(X) + \text{var}(Y).\]
<p>Thus, splitting a sum into estimates of the summands and adding them
should not change the variance of the guess.
Of course, there is a fallacy in this reasoning: humans are not
sampling from the underlying distribution!
When we guess, we introduce our own random errors.
For instance, my estimate for $Z$ will have some human noise $\varepsilon_Z$:</p>
\[\hat{Z} = Z + \varepsilon_Z.\]
<p>Similarly, my guesses for $X$ and $Y$ have some random errors
$\varepsilon_X$ and $\varepsilon_Y$.
There is no reason for the variances of $\varepsilon_X$ and $\varepsilon_Y$
to add up to the variance of $\varepsilon_Z$.
The sum could be bigger, or it could be smaller.
But a good decomposition should reduce the combined variance:</p>
\[\text{var}(\varepsilon_X) + \text{var}(\varepsilon_Y) < \text{var}(\varepsilon_Z).\]
<p>If log-normality is the science of Fermi estimates, picking
variance-reducing subestimates is the art.
<!-- But there is a connection to our earlier discussion.
I think the human error $\varepsilon_X$ will roughly mimic the
empirical distribution of $Z$ we have seen in the world.
If it is biased, so is $\varepsilon_X$; it we have only seen a few
examples, the variance of $\varepsilon_X$ will probably be large, and
decrease roughly as $1/k$ with $k$ examples.
So the general strategy for variance reduction is to factorise into
things we have seen before.
We can even use these data points to generate subestimates by geometric averaging.-->
But I suspect that $\hat{Z}$ roughly speaking behaves like a <em>test
statistic</em> for $Z$, with the number of samples corresponding to how
many data points for $Z$ we have encountered.
So we expect that $\text{var}(\varepsilon_Z)$ will vanish roughly as
$1/k$ with $k$ samples.
If we have more exposure to the distributions for $X$ and $Y$,
the combined error will probably be smaller.
This is why we carve into subfactors we understand!</p>
<h4 id="variance-reduction-in-practice">Variance reduction in practice</h4>
<p>I’ll end with a speculative rule of thumb for when to factor: try generating over- and
underestimates for the factors and the product, which in additive
notation give</p>
\[(\Delta X)^2 + (\Delta Y)^2, \quad (\Delta Z)^2\]
<p>where $\Delta$ refers to the difference of the (logarithm of the) over-
and underestimate.
Factorise if the first estimated error is smaller than the second.
Let’s illustrate by returning to the population of Chile.
I can try factoring it into a number of regions multiplied by the
average number of people per region.
Taking logs (in base $10$) of the over- and underestimate of Chile’s
population I gave above, I get</p>
\[(\Delta Z)^2 = (\log_{10} 10^8 - \log_{10} 10^6)^2 = 4.\]
<p>On the other hand, for regions I would
make a lower guess of $5$ and an upper guess of $30$, with a difference in logs of $(\Delta X)^2 = 0.6$.
For regional population, I would make a lower guess of $5\times 10^5$ and an
upper guess of $5\times 10^6$, with $(\Delta Y)^2 = 1$.
Thus,</p>
\[(\Delta X)^2 + (\Delta Y)^2 = 1.6 < 4 = (\Delta Z)^2.\]
<p>The guess from the factorisation (taking geometric means) is</p>
\[\sqrt{5 \times 30 \times (5\times 10^5) \times (5\times 10^6)} \approx 19 \text{
million}.\]
<p>This is even better than the crowdsourced estimate!
For reference, the number of regions is $16$, while our estimated mean is around
$12$, and the average population per region is a bit over a million,
which we’ve mildly overestimated at $1.6$ million.
The two balance out and give a better overall estimate.
<!-- This suggests a diversity of prediction mechanism is at play with -->
<!-- subestimates, but I haven't worked out the details. --></p>
<h4 id="conclusion">Conclusion</h4>
<p>From a statistical perspective, Fermi estimates are based on two
techniques: geometric means and splitting into subfactors.
We usually estimate things which can be expressed as a product of many
factors. These will tend towards a log-normal distribution by the (log
of the) central limit theorem, so that geometric means provide a good
estimator, exactly like the usual mean for normally distributed variables.
Subestimates, on the other hand, carve guesses into factors we
understand, i.e. have more data points for, so that (assuming they
behave like test statistics) variance is reduced.
The effectiveness of Fermi estimates is quite reasonable after all!
<!-- They're not so unreasonable after all! --></p>
<!-- There is an art to making over- and underestimates
that accurately reflect the variance of our error random variables,
which are involved both in taking geometric means for single
quantities, and reducing variance through subestimates.
Still, it's cool that there is a statistical basis for the different
aspects of the effectiveness of Fermi estimates.
It's not so unreasonable after all! -->
<!-- For instance, if $e^Z$ is the population of Chile, I can factor it
into number of provinces $e^X$ multiplied by the average number of people per province $e^Y$.
But this is likely to *increase* the error, since I know less about
provinces of Chile than I do about Chile compared to other countries.
I suspect that there is a nice quantitative connection to be made
between the variance of $\varepsilon_X$ and the prior data I have on
it. -->
<!--
The Lyapunov condition holds for a sum of independent random
variables.
By taking an exponential, we can turn it into a result for a *product* of
independent variables.
Let $X_i, \mu_i, \sigma_i^2$ be as above, and $X_i = \log Y_i$.
Then
$$
\exp\left[\sum_{i=1}^N X_i\right] = \prod_{i = 1}^N Y_i \to \log
\mathcal{N}(\mu, \sigma^2).
$$
The distribution on the right is not a normal, but a *log-normal*.
It is simply what the normal distribution looks like when viewed in
terms of a variable $y > 0$ defined by $x = \log y$.
In order to plot the density, we use the fact that $dx =
dy/y$, and hence
$$
p(x)\, dx = \frac{dx}{\sqrt{2\pi}\sigma}
e^{-\frac{(x-\mu)^2}{2\sigma^2}} = \frac{dy}{\sqrt{2\pi}\sigma y}
e^{-\frac{(\log y-\mu)^2}{2\sigma^2}}.
$$
So, this is distribution that a product of many independent factors
converges to. -->
<!-- https://arxiv.org/pdf/cond-mat/9808305.pdf -->David WakehamFebruary 12, 2021. Why are Fermi approximations so effective? One important factor is log normality, which occurs for large random products. Another element is variance-reduction through judicious subestimates. I discuss both and give a simple heuristic for the latter.Reductionism, order and patterns2021-02-08T00:00:00+00:002021-02-08T00:00:00+00:00http://hapax.github.io/mathematics/physics/philosophy/form<p><strong>February 8, 2021.</strong> <em>Some philosophical reflections on the nature of
scientific explanation, structure, emergence, and the unreasonable
effectiveness of mathematics.</em></p>
<h4 id="introduction">Introduction</h4>
<p><span style="padding-left: 20px; display:block">
Explanations must come to an end somewhere.
</span></p>
<div style="text-align: right"><i>Ludwig Wittgenstein</i> </div>
<p>Reductionism is the idea that you explain stuff with
smaller stuff, and keep going until you stop.
In many ways, this describes the explanatory program of 20th century
physics, which, starting from the 19th century puzzles of statistical mechanics,
conjured up atoms, subatomic particles, the zoo of the Standard Model, and even
tinier hypothetical entities like strings and spin foams.
Most physicists spend their time in a lab, on a computer, or in front
of a blackboard, trying to reduce complex things to simple things they understand.
So like Platonism in mathematics, reductionism in physics simply makes
a philosophy out of everyday practice.
We break stuff down, so things reduce; we play abstractly with
mathematical objects, so they exist abstractly.</p>
<p>But also like Platonism, reductionism is a convenient fiction, or rather, a
caricature in which some things are emphasised at the cost of others.
And given the reverence which which philosophers hold the considered
ontological verdicts of science, it’s worth asking: what does science really tell us about the
universe? What sorts of objects are necessary for explanation? Does
explanation go only upwards, or can it go downwards or sideways?
Should we eliminate the things we explained? And what has explanation
to do with existence anyway?
This post is an attempt to unconfuse myself about some of these questions.
<!-- adds a dash of novelty and modern
physics to old (and in some cases hopelessly outdated) debates. --></p>
<h4 id="the-existence-of-shoes">The existence of shoes</h4>
<p><span style="padding-left: 20px; display:block">
… our common sense conception of psychological phenomena constitutes a
radically false theory, a theory so fundamentally defective that both
the principles and ontology of that theory will eventually be
displaced, rather than smoothly reduced, by completed neuroscience.
</span></p>
<div style="text-align: right"><i>Paul Churchland</i> </div>
<p>Physical objects can be described at different levels.
A shoe is constructed from flat sheets of material, curved, cut,
marked, and stuck together in clever ways; materials
curve and stick by virtue of their constituent
chemicals, usually long, jointed molecular chains called polymers;
polymers, in turn, are built like lego from a smorgasboard of elements;
and each elemental atom is a dense nuclear core, surrounded by
electrons whirring around in elaborate orbitals.</p>
<p>From the properties of the neutrons, protons and
electrons, it seems we can work our way upwards, and infer everything
else.
The laws of quantum mechanics and electromagnetism determine the
orbital structure of the atom. The valence shell of the atom
determines how it can combine with other atoms to form
chemicals. Finally, the structural motifs and functional groups of the
polymers gives it the properties the industrial chemist, the designer,
and the cobbler exploit to make a shoe.
Thus, some philosophers conclude, only electrons, protons, and
neutrons exist.
The rest can be eliminated as unnecessary
ontological baggage.
This view is called <em>eliminative reductionism</em>.
It is a hardcore philosophy which does not believe in shoes [<sup><a id="fnr.1" name="fnr.1" class="footref" href="#fn.1">1</a></sup>].</p>
<p>There is a gentler, less silly form of reductionism which grants the
existence of shoes, but insists that they are (in the phrase of Jack
Smart) nothing “over and above” the constituent subatomic particles.
The shoe “just is” electrons and protons and neutrons, in some order;
this is what we mean by a shoe.
There are others way to characterise the reduction, <!--
for instance, that the properties of the shoe "follow"
from, or are "completely explained by", those of the subatomic particles.
In fact, there is--> and a whole literature devoted to the attendant
subtleties, but most fall under the heading of analytic
micro-quibbles.
<!-- , and won't concern us here.-->
Instead, we will make a much simpler observation: order matters.</p>
<p>Clearly, if we took those subatomic particles, and arranged them in a
different way, we would get different elements, different chemicals,
and a duck or a planetesimal instead of a shoe.
Arrangement is important.
It is patently absurd to try and explain the bulk properties of the
shoe—the fact that it fits around a human foot, for
instance—without appeal to arrangement, since a different
order yields objects which do not fit around a foot.
<!-- If one objects that "fitting around a foot" is some sort of
anthropocentric folly due for elimination, replace it with,
Philip Anderson was perhaps the first physicist to make this argument,
in his famous article ["More is Different"](https://cse-robotics.engr.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf). -->
Since order has <em>explanatory</em> significance, it should presumably be
tarred with the same ontic brush we apply to things like electrons.</p>
<p>Of course, one may object that explanation does not equal existence.
I can handily account for the continual disappearance of my socks by
the hypothesis of sock imps.
But this is a bad explanation! It’s not consistent with other reliably known facts about the world.
Sock imps don’t make the ontic cut, not because there is no link between
explanation and what we deem to exist, but because that link should
only be made for <em>robust</em> explanations, and the poor little sock imps collapse
at the first empirical hurdle.
That different arrangements of things have different properties is
robust, almost to the point of truism, and there seems to be no
principled reason to ban order <!-- , or *structure* as we will call it,-->
from our ontology.</p>
<h4 id="emergence-vs-structure">Emergence vs structure</h4>
<p><span style="padding-left: 20px; display:block">
More is different.
</span></p>
<div style="text-align: right"><i>Philip W. Anderson</i> </div>
<p>It’s worth noting the parallel
to <em>emergence</em>.
In his famous article
<a href="https://cse-robotics.engr.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf">“More is Different”</a>,
Philip W. Anderson argued for the idea of domain-specific laws and
dynamical principles which did not follow the strict, one-way
explanatory hierarchy of reduction, particularly in his field of
condensed matter physics.
And indeed, condensed matter makes a science of order itself,
studying how properties of macroscopic wholes (such as phases of
matter) “emerge” from the arrangement of microscopic parts.
Anderson thought of emergence as patterns that appear when you “zoom
out” from the constituents, but which are still made from the
constituents; we are just describing those constituents at a different level.
<!-- the microscopic perspective as the wrong "level"
of description, like being too zoomed in on a microscope, but I think that
it is simply different information. --></p>
<p>But this seems to suffer from the same problem as a reductionist
account of shoes.
The “emergent properties” are not properties of the constituents at
all!
The symmetries, order parameters, <!-- which measure their brokenness,
and collective excitations which emerge as long-range messengers of
disorder, a are not simply the microscopics "zoomed out".--> and
collective excitations studied by condensed matter physicists belong
only to the arrangements.
In fact, systems made from totally different materials can
exhibit the same emergent behaviour [<sup><a id="fnr.2" name="fnr.2" class="footref" href="#fn.2">2</a></sup>]!
They are something new, something “over and above” the spins of the
lattice, or the carbon atoms of a hexagonal monolayer, since different
arrangements of those same parts would have different properties.
We can turn Anderson’s snappy slogan around:
<em>different is more</em>. If arranging things differently gives them new
and different properties, it is a sign of structure, and structure is
something over and above the component parts themselves.
<!-- often characterising phases of matter in terms of
what are called *order parameters*, numbers which characterise the
brokenness of a symmetry. --></p>
<h4 id="what-is-a-particle">What is a particle?</h4>
<p><span style="padding-left: 20px; display:block">
It is raining instructions out there; it’s raining programs; it’s
raining tree-growing, fluff-spreading, algorithms. That is not a
metaphor, it is the plain truth. It couldn’t be any plainer if it were
raining floppy discs.
</span></p>
<div style="text-align: right"><i>Richard Dawkins</i> </div>
<p>We don’t need emergence to argue for structure; we can use the
elementary components themselves.
When philosophers talk about reductionism, they tend to imagine
subatomic particles as small, indivisible blobs, without internal
organisation or further ontological bells and whistles. An electron
might have properties like mass or charge, and obey the curious dictates of quantum mechanics,
but all this is packaged irreducibly and not worth further discussion.
But if we try and unpack all these “simple” properties, we will find
that, like the magic bag of Mary Poppins, a particle is much deeper
than it first appears!
The Large Hadron Collider does not produce evidence for tiny,
structureless blobs.
Rather, it confirms at a rate of petabytes per second that the universe is made of mathematics.</p>
<p>The state-of-the-art definition of a particle is
<!-- (as
[this Quanta article](quantamagazine.org/what-is-a-particle-20201112)
humorously explores) --> a bit of a mouthful: an <em>irreducible
representation of the Lorentz group</em>.
In plain English, being a <em>representation</em> means that particles are
objects which have or “transform with” symmetries, in the same way a circle looks the same however
you rotate it.
That it is <em>irreducible</em> means that it cannot be split into smaller
parts which have the same symmetry, which is the mathematical avatar
of being “indivisible”.
Finally, the symmetry itself, the <em>Lorentz group</em>, is the same group
describing the shape of empty space according to special relativity.
So, in summary, a particle transforms with the symmetries of empty space, and
cannot be split into parts with this symmetry.
<!-- [<sup><a id="fnr.3" name="fnr.3" class="footref" href="#fn.3">3</a></sup>].-->
Lurking implicitly in the background is the whole framework of
quantum mechanics, and in particular, that particles are <em>states in a
Hilbert space</em>. In plain English, we can add and subtract states of a
particle, and compare them to each other.</p>
<p>Thus, every particle is like a mathematical diamond: indivisible,
multifacted, and structured up to the hilt.
When philosophers of science eagerly assent to believe whatever the particle physicists
tell them, <!-- particularly when it can be tested with unparalleled
precision at the LHC, --> they may not realise what
they signed up for!
Spacetime, quantum mechanics, and symmetries, the Lorentz group and
Hilbert spaces; these are all welded indissolubly to form the most
robust and fundamental objects in the universe.
Even with something as “simple” as an electron, order is
inescapable.</p>
<h4 id="unreasonable-effectiveness-and-natural-patterns">Unreasonable effectiveness and natural patterns</h4>
<p><span style="padding-left: 20px; display:block">
It is difficult to avoid the impression that a miracle confronts us
here, quite comparable… to the two miracles of the existence of
laws of nature and of the human mind’s capacity to divine them.
</span></p>
<div style="text-align: right"><i>Eugene Wigner</i> </div>
<p>It may feel like we have jumped from physical to
mathematical objects in one fell, tendentious swoop.
Do we need Hilbert space, or might another mathematical concept
suffice?
And does Hilbert space really exist, or is it merely a useful human
invention?
If the latter, why so useful?
This is intentionally designed to rhyme with our earlier statement
that order is a robustly explanatory feature of the world, and
distinct from the things that are ordered.
Mathematics really just is the study of order, or <em>patterns</em>, according to their own peculiar and abstract
logic.
Physics (and to a lesser extent the other sciences) study <em>natural
patterns</em>, the way these structures or forms of order are realised in
the natural world.
That applies not just to emergent behaviour like phases of matter, but
even the crystalline makeup of an elementary particle.</p>
<p>I have tried to motivate this perspective from the nature of physical
explanation, but perhaps it can teach us about mathematical
explanation and its relation to the physical world.
A common criticism of Platonism is that, if mathematical objects exist
in some non-physical realm, the ability to do mathematics must involve
extrasensory perception. Clearly, since we are physical
beings, this ability is grounded in physical experience, and now we
have a simple explanation: patterns are naturally realised everywhere, from
cardinal numbers in counting cows to topology in tying a knot to
representation theory in colliding protons. We don’t need magical
access to the World of Forms to see these things; they are all around us.</p>
<p>Similarly, the
<a href="https://www.dartmouth.edu/~matc/MathDrama/reading/Wigner.html">unreasonable effectiveness of mathematics</a>
for describing the world, first noted by Eugene Wigner, seems no more
miraculous that the utility of integers for counting loaves of bread
rather than proving results about number theory.
We get the patterns from the world, clean them up, rebrand a little,
and start connecting them together.
The meta-patterns that emerge are remarkable, but the appearance of
“unreasonable effectiveness” is the result of a largely successful PR
campaign to divorce mathematical structures from their physical
origins. As Einstein quipped, “Since the mathematicians have invaded
the theory of relativity, I do not understand it myself anymore.”
The abstraction of pseudo-Riemannian geometry follows from the more
concrete act of bouncing light off mirrors.</p>
<p>More and more, we are seeing this converse of unreasonable
effectiveness, where deep mathematical ideas are inspired by physics.
The living embodiment of this trend is Ed Witten, a string theorist
whose contributions to mathematics have been so profound and
wide-ranging that he earned a Fields Medal (the Nobel prize in
mathematics), the only physicist to have ever done so! <!-- for his contributions to low-dimensional topology.-->
Once again, there is no mystery here; it is just the usual state of
affairs, but without the Platonist guff to distract us.
The patterns are out there and always have been.</p>
<h4 id="what-is-a-pattern">What is a pattern?</h4>
<p><span style="padding-left: 20px; display:block">
Everything comes to be from both subject and form.
</span></p>
<div style="text-align: right"><i>Aristotle</i> </div>
<p>All this raises the question: what is a pattern?
<!-- And how is it conjoined with stuff?-->
The first and most famous philosophical treatment of these issues is
the
<a href="https://plato.stanford.edu/entries/form-matter/">hylomorphism of Aristotle</a>,
who argued that objects are a compound of both form (the structure,
order, or patterns I have discussed here) and matter (energy or “raw
potentia”).
I won’t discuss Aristotle’s ideas in greater detail. Suffice to say they have
deeply informed this post, and the interested reader should check out James Franklin’s
<a href="https://link.springer.com/book/10.1057/9781137400734">modern take</a>.
<!-- for a modern take on Aristotelian structuralism applied.-->
Instead, I will approach the question by picking on two
smaller problems, taking Newton’s laws as a concrete example.</p>
<p>Newton formulated his laws of motion (such as $F = ma$) in terms of forces and
acceleration. Does the empirical robustness of these laws mean that
this is the only way to formulate them?
Not at all!
There are two other distinct but equivalent versions of classical
mechanics: Lagrangian and Hamiltonian. They explain
the same things, make the same predictions, and thus seem to describe
the same natural patterns. This suggests to me that although patterns
are discovered, formalisms are invented.
A pattern is the equivalence class of descriptions.</p>
<p>Students of physics will be aware that, although Hamiltonian and
Lagrangian mechanics are equivalent to Newton’s laws in the mechanical
context, they have taken on a life of their own.
The Lagrangian approach involves the mathematics of optimising
functions, while the Hamiltonian approach in its most abstract form
becomes the mathematical field of symplectic geometry.
Both Lagrangian and Hamiltonian mechanics can be upgraded (with some
inspired retrospective guesswork) to frameworks for quantum mechanics,
which Newton’s laws simpliciter cannot.
There is much more going on than a simple isomorphism of
description!
A more nuanced view is that humans invent formalisms which can agree
on a domain of interest, a restricted equivalence
class of explanation if you will. But the formalisms will tend to grow
beyond the selvage lines of the original use case.
Formalisms are only <em>perspectives</em> on patterns.
<!-- capture
different patterns, or suggest different extensions, in ways that can
depend sensitively on the formalism and the domain of application. --></p>
<p>This hints at certain structural “metalaws”.
Patterns are big and rhizomatic; human-invented mathematical
frameworks are a single
mathematical glance, if you like, and can only take in part of the pattern.
Even if formalisms agree on some domain, they will suggest different corridors of growth.
A rectangle may be both an equiangular quadrilateral, or a
parallelogram with diagonals of equal length, but the notions involved and
corresponding generalisations are distinct.
<!-- in the two characterisations., and connect along
different lines of development to broader ideas. -->
This also helps explain the phenomenon of deep connections between
apparently unrelated mathematical objects, sometimes only revealed by
a clever change of perspective.
It could be that there is a <em>paucity of structure</em>, so that by dumb
luck (and the <a href="https://en.wikipedia.org/wiki/Pigeonhole_principle">pigeonhole principle</a>), we often unknowingly describe the same
thing in a different guise.
But to my mind, it is more likely that patterns tend to sprawl and
overlap in complex ways.
<!-- , which also explains how different angles on
the same structure can look unrelated! -->
They are less like a few items of furniture in a crumbling
garret—paucity of structure—and more like the interwined flora of
a tropical jungle.
<!-- And human mathematics typically cannot see the forest for the trees.
There are ways to talk about quantum mechanics without Hilbert spaces,
and particles without representation theory.
That does not mean that the corresponding patterns do not exist, but
rather, they can be described in other ways. --></p>
<p>The second issue is how accurate our descriptions must be.
We know that Newton’s laws are not exactly correct, and break down in
regimes far-removed from those of everyday experience, such as the
very small (where quantum mechanics applies) or the very fast (where
special relativity applies).
Does this mean we should stop believing in forces, or Lagrangians, or
Hamiltonians?
This is like the old Platonist quibble that there is no
such thing as a perfect circle in the real world, so we must be
reasoning about circles in some other realm.
In both cases, the pattern is only <em>approximately</em> realised in
nature, with bumps and fuzzy edges.
But approximation is itself subject to structural laws, exhibiting
patterns treated by mathematics (in, e.g., topology)
and physics (effective field theory).
Perhaps an even better example is statistics, which is literally all
about extracting structure from noisy realisations.
So structural approximations are clearly robust, lawlike and
explanatory, even if they are subtle.
Incidentally, this suggests another metalaw: patterns can stand in patterned
relations to other patterns.
<!-- This is also what emergence is all about! --></p>
<p>This ties back to our original question about the nature of physical
explanation.
Reductionism instructs us to boil things down to their smallest elements.
The Aristotelian view is that, really, we should be searching for
form and structure at whatever level they happen to occur.
This is not only the nature of emergence, but physics more broadly.
How else can we connect the study of the large-scale structure
of spacetime, quarks, bowling balls, planetesimals, or storm clouds?
Physicists almost never boil things down to their smallest elements!
Rather, it seems much more accurate to say that they look for patterns
“in the wild”.
(In contrast, mathematicians study patterns “in captivity”, which gives
them that air of artifice and pedigree.)</p>
<p>One upshot is that, for better or worse, physicists often wade into other
disciplines armed with the lassoo of an Emergent Pattern to corral the apparent complexity.
See for
instance
<a href="https://www.penguinrandomhouse.com/books/314049/scale-by-geoffrey-west/">scaling laws</a>,
<a href="https://en.wikipedia.org/wiki/Self-organized_criticality">self-organised criticality</a>,
<a href="https://en.wikipedia.org/wiki/Small-world_network">small-world networks</a>,
and
<a href="https://www.englandlab.com/">thermodynamic explanations for life itself</a>.
They’re not always right (and they’re not always respectful), but
they are just doing their thang.</p>
<h4 id="conclusion">Conclusion</h4>
<p>I’ve argued that the nature of physical explanation is richer and less
boringly hierarchical than the reductionist would have us believe.
In order to explain the properties of shoes or particles, it seems not
only parsimonious but necessary to commit to the existence of
patterns in addition to the things which make those patterns up.
This not only jives with (and ontologically grounds) the notion of
emergence, but also provides a handle on the metaphysics and
epistemology of mathematical explanation.
<!-- and its relation to the
physical world. -->
Put simply, mathematicians study patterns; physicists study natural patterns.
<!-- It tells us where math comes from, why it is unreasonably effective,
and to what extent it might be invented or non-unique.
Finally, I argued that none of this is spoiled by approximation, since
this is just another pattern. --></p>
<p>Clearly, I’ve left many questions unanswered.
Must patterns be instantiated in the physical world, and if not, where
do such patterns live?
What is the “mereology” that allows them to combine, or to recursively
describe their relationships?
And finally, what grounds the truth about patterns, in physics,
mathematics, or elsewhere?
Most of these I defer to Aristotle, though I hope to write more in future. <!-- I leave the systematic exploration of these questions to the future,-->
In the mean time, discussion and debate are welcome!</p>
<h4 id="acknowledgments-and-references">Acknowledgments and references</h4>
<p>I’d like to thank Leon Di Stefano for introducing me to Aristotelian
structuralism and many enriching conversations over the years.
His ideas <!-- (as articulated in
[this 2017 debate with James Fodor](https://www.youtube.com/watch?v=W0j25NteoXc))-->
inspired and informed this post.
I’ve also been heavily influenced by James
Franklin’s book,
<a href="https://link.springer.com/book/10.1057/9781137400734"><em>An Aristotelian realist view of mathematics</em></a>.
Aristotle himself writes with characteristic brevity on form and
matter in <a href="http://classics.mit.edu/Aristotle/physics.1.i.html"><em>Physics (i)</em></a>.
Finally, I fitfully consulted the SEP entries on
<a href="https://plato.stanford.edu/entries/scientific-reduction/">reductionism</a>
and
<a href="https://plato.stanford.edu/entries/structuralism-mathematics/">mathematical structuralism</a>.</p>
<hr />
<!-- quantamagazine.org/what-is-a-particle-20201112 -->
<!-- https://plato.stanford.edu/entries/scientific-reduction/-->
<!-- https://plato.stanford.edu/entries/structuralism-mathematics/ -->
<div class="footdef"><sup><a id="fn.1" name="fn.1" class="footnum" href="#fnr.1">Footnote 1</a></sup> <p class="footpara">
To be fair, as the quote suggests, the original eliminativists like Paul and
Patricia Churchland were much more interested in abolishing psychology than shoes.
</p></div>
<div class="footdef"><sup><a id="fn.2" name="fn.2" class="footnum" href="#fnr.2">Footnote 2</a></sup> <p class="footpara">
This is called <i>universality</i>, and can be explained using
renormalisation, the technical avatar of "zooming out".
</p></div>
<!--<div class="footdef"><sup><a id="fn.3" name="fn.3" class="footnum"
href="#fnr.3">Footnote 3</a></sup> <p class="footpara">
Particles can have other symmetries as well. An important class is
gauge symmetry, consisting of internal degrees of freedom.
, like a dial on a gauge. These gauge symmetries are crucial to formulating the
whole Standard Model, and explain, for instance, why an electron has -->
<!--charge. </p></div>-->David WakehamFebruary 8, 2021. Some philosophical reflections on the nature of scientific explanation, structure, emergence, and the unreasonable effectiveness of mathematics.Binomial party tricks2021-02-06T00:00:00+00:002021-02-06T00:00:00+00:00http://hapax.github.io/mathematics/physics/hacker/binomial<p><strong>February 6, 2021.</strong> <em>Sketchy hacker notes on the binomial
approximation. The flashy payoff: party trick arithmetic for estimating
roots in your head.</em></p>
<h4 id="introduction">Introduction</h4>
<p>The binomial approximation is the result that, for any real $\alpha$,
and $|x| \ll 1$,</p>
\[(1 + x)^\alpha \approx 1 + \alpha x.\]
<p>The usual proof involves calculus.
Here, we present a sketchy shortcut and an elementary longcut, neither
of which involves calculus, strictly speaking.
We also derive the quadratic term, and end with a fun party trick for finding roots.</p>
<h4 id="sketchy-shortcut">Sketchy shortcut</h4>
<p>We begin with the shortcut.
In an
<a href="https://hapax.github.io/maths/physics/hacks/exponential/">earlier post</a>,
I derived the following result for the exponential, and $|x| \ll 1$:</p>
\[e^x \approx 1 + x.\]
<p>Rather than go off and read the post, we can do even better and simply
<em>define</em> the exponential by this property.
If it’s true, then for any $r$, we can set $x = r/n$ for very large
$n$ to get</p>
\[e^r = (e^{r/n})^n \approx \left(1 + \frac{r}{n}\right)^n.\]
<p>In the limit of infinite $n$, the expression should be exact. And
indeed, this is the standard definition of $e^r$:</p>
\[e^r = \lim_{n\to\infty} \left(1 + \frac{r}{n}\right)^n.\]
<p>Let’s proceed with a proof of the binomial approximation.
The natural logarithm is the inverse function, so that</p>
\[x = \log e^x \approx \log(1 + x).\]
<p>Recall that</p>
\[x^n = (e^{\log x})^n = e^{n\log x} \quad \Longrightarrow \quad \log x^n = n \log x.\]
<p>Thus, taking the logarithm $(1 + x)^\alpha$, we have</p>
\[\log [(1+x)^\alpha] = \alpha \log (1+ x) \approx \alpha x,\]
<p>and hence</p>
\[(1+x)^\alpha \approx e^{\alpha x} \approx 1 + \alpha x.\]
<p>This works since all the corrections are at higher order in $x$.</p>
<h4 id="elementary-longcut">Elementary longcut</h4>
<p>This is a bit high brow, and we can get to the same conclusion using
simple algebra.
First note that, from the binomial theorem,</p>
\[(1 + x)^n = 1 + \binom{n}{1}x + \binom{n}{2}x^2 + \cdots x^n \approx
1 + nx\]
<p>for $|x| \ll 1$, neglecting higher order terms which are much smaller.
So the binomial approximation is true for whole numbers $n$.
If we consider a fraction $q = m/n$, then $(1 + x)^q$ raised to the
power $n$ should equal</p>
\[(1 + x)^{qn} = (1 + x)^{m} \approx 1 + mx \tag{1}\label{m}\]
<p>by the binomial theorem.
Let’s assume</p>
\[(1 + x)^{q} \approx 1 + \beta x,\]
<p>with some higher order terms we can ignore.
Raising to the power $n$, we can use the binomial approximation for
$n$ to get</p>
\[(1 + x)^{qn} \approx (1 + \beta x)^n \approx 1 + \beta n x.\]
<p>Comparing to (\ref{m}), we find that $\beta = m/n$, and hence the
binomial approximation is true for positive rationals.
We can add negative powers using the geometric series:</p>
\[\frac{1}{1 - x} = 1 + x + x^2 + \cdots \approx 1 + x,\]
<p>and hence for a negative rational $q = -m/n$,</p>
\[(1 + x)^q \approx (1 - x)^{m/n} \approx 1 - \frac{m}{n}x = 1 + qx,\]
<p>as required. Finally, there is arbitrary real $\alpha$. This is
actually trivial, in some sense.
Unlike whole numbers (repeated multiplication), fractions (roots), or
negative numbers (reciprocals), an irrational power has no obvious
interpretation. The most reasonable thing to do is define it as a
<em>limit</em> of rational powers that approximate it:</p>
\[(1 + x)^r = \lim_{n \to \infty} (1 + x)^{q_n},\]
<p>where $q_n$ is a sequence of rational numbers (e.g. the decimal
expansion) approximating $r$.
In this case, the binomial approximation gives</p>
\[(1 + x)^r = \lim_{n \to \infty} (1 + x)^{q_n} \approx 1 + x \lim_{n
\to \infty} q_n = 1 + rx,\]
<p>and so the result holds for all real numbers.</p>
<h4 id="higher-terms">Higher terms</h4>
<p>It’s possible, if messy, to extend these methods to determine the next
term in the approximation.
We’ll do the longcut, and use big-O notation, with $O(x^3)$ in this
context meaning “terms with powers of $x^3$ or higher”.
The binomial theorem gives</p>
\[(1 + x)^n = 1 + nx + \frac{n(n-1)}{2} x^2 + O(x^3), \tag{2} \label{second}\]
<p>since the coefficient of the $x^2$ term is the number of ways of
choosing $2$ items (the $x$ terms) from $n$ items (the factors in the power).
For a rational $q = m/n$, we have</p>
\[(1 + x)^{qn} = (1 + x)^m = 1 + mx + \frac{m(m-1)}{2} x^2 + O(x^3),\]
<p>and if we assume</p>
\[(1 + x)^{q} = 1 + qx + \gamma x^2 + O(x^3),\]
<p>then the binomial theorem again gives</p>
\[(1 + x)^{qn} = \left[1 + qx + \gamma x^2 + O(x^3)\right]^n = 1 + nqx +
\left[n\gamma + \frac{n(n-1)}{2}q^2 \right]x^2 + O(x^3).\]
<p>The coefficient of the linear term $nq = m$ matches, but the quadratic
term requires more work. Comparing to (\ref{second}) and
rearranging for $\gamma$, we have</p>
\[\begin{align*}
\gamma & = \frac{1}{n}\left[\frac{m(m-1)}{2}- \frac{n(n-1)}{2}q^2\right]
=\frac{m(m-1)}{2n}- \frac{m^2(n-1)}{2n^2}
=\frac{q(q - 1)}{2}.
\end{align*}\]
<p>Thus, we find that to second order,</p>
\[(1 + x)^q = 1 + qx + \frac{q(q-1)}{2} x^2 + O(x^3)\]
<p>The extension to real and negative powers is easy. The extension to
higher terms in $x$ is not.
They obey something called the binomial series,</p>
\[(1 + x)^\alpha = \sum_{k = 0}^\infty \frac{\alpha(\alpha - 1)\cdots
(\alpha-k +1)}{k!} x^k,\]
<p>and I have no idea how to get this without calculus.
(One can use “analytic continuation” but this feels too much like
cheating to me, partly because it’s not clear why this continuation is
unique.)
Any tips appreciated!</p>
<h4 id="rooting-out-the-answer">Rooting out the answer</h4>
<p>The applications are many and various, but the simplest thing we can
try is quickly calculating powers $y^\alpha$.
The general trick is to find a power near $y$ that is simpler to
evaluate, factor out the simple answer, then use the binomial
approximation.
I think there are actually better ways to estimate positive powers,
but the binomial approximation really shines in the estimation
of roots.
It can even be a good party trick, depending on the kind of parties
you go to!</p>
<p>Suppose someone asks you to find the square root of $8$.
You look for a nearby perfect square, in this case $9$, then factor
eight into $9$ times one minus something small:</p>
\[\sqrt{8} = \sqrt{9\left(1 - \frac{1}{9}\right)} = 3 \left(1 - \frac{1}{9}\right)^{1/2}.\]
<p>We can take $\alpha = 1/2$ and $x = -1/9$ in the binomial
approximation, and see how we go, noting that</p>
\[\sqrt{1 - x} = 1 - \frac{1}{2}x - \frac{1}{8}x^2 + O(x^3).\]
<p>To first order, we get</p>
\[3 \left(1 - \frac{1}{9}\right)^{1/2} \approx 3\left[1 - \frac{1}{2} \cdot \frac{1}{9}\right]
= \frac{17}{6} \approx 2.83.\]
<p>To second order,</p>
\[3 \left(1 - \frac{1}{9}\right)^{1/2} \approx
3\left[1 - \frac{1}{2} \cdot \frac{1}{9} - \frac{1}{8} \cdot \frac{1}{9^2}\right]
= \frac{611}{216} \approx 2.829.\]
<p>The actual answer is $\sqrt{8} = 2.828$, so even the first term in the
binomial approximation is very good! We’ll finish with a somewhat more
involved example.
Let’s approximate the fifth root of six, $6^{1/5}$.
I only know one fifth power of the top of my head, $2^5 = 32$, and
this happens to be near $6^2 = 36$.
We can chain these observations together as follows:</p>
\[\begin{align*}
6^{1/5} = 36^{1/10} = 32^{1/10}\left(1 + \frac{1}{9}\right)^{1/10} & =\sqrt{2}\left(1 + \frac{1}{9}\right)^{1/10} \approx \sqrt{2} \cdot \left(1 + \frac{1}{10\cdot 9}\right).
\end{align*}\]
<p>At this point, we could separately approximate $\sqrt{2}$, but I
happen to know it’s about $1.414$, so I can divide by $90$ (or even
just $100$ for a quick mental estimate), and add them together to get</p>
\[\sqrt[5]{6} \approx 1.414 + \frac{1.414}{90} \approx 1.43.\]
<p>Consulting a calculator, this is correct to two decimal places!
With the power of the binomial approximation, you can do it in your head.</p>David WakehamFebruary 6, 2021. Sketchy hacker notes on the binomial approximation. The flashy payoff: party trick arithmetic for estimating roots in your head.