Jekyll2022-11-11T00:59:51+00:00http://hapax.github.io/feed.xmlDavid WakehamQuantum computing educatorDavid A WakehamIndescribably boring numbers2021-03-23T00:00:00+00:002021-03-23T00:00:00+00:00http://hapax.github.io/mathematics/boring<p><strong>March 23, 2021.</strong> <em>I turn the old joke about interesting numbers into a
proof that most real numbers are indescribably boring. In turn, this implies
that there is no explicit well-ordering of the reals. The axiom of
choice, however, implies all are relatively interesting.</em></p>
<h4 id="introduction">Introduction</h4>
<p>It’s a
<a href="https://en.wikipedia.org/wiki/Interesting_number_paradox">running joke</a>
among mathematicians that there are no boring numbers. Here’s the
proof. Let $B$ be the set of boring numbers, and suppose for a
contradiction it is non-empty. Define $b = \min B$ as
the smallest boring number. Since this is a highly unusual property, $b$ is
interesting after all!
Joke it may be, but there is a sting in the tail. By thinking
about how the joke works, we will be led to some rather deep (and
perhaps disturbing) insights into set theory and what it can and
cannot tell us about the mathematical world.</p>
<h4 id="integers-and-rationals-are-interesting">Integers and rationals are interesting</h4>
<p>The joke implicitly uses the fact that “numbers” refers to “whole numbers”</p>
\[\mathbb{N} = \{0, 1, 2, 3, \ldots\}.\]
<p>If it didn’t, then the <em>minimum</em> we used to get our contradiction
wouldn’t always work!
For instance, say we work with the integers</p>
\[\mathbb{Z} = \{\ldots, -2, -1, 0, 1, 2, \ldots\}.\]
<p>The set of boring integers $B_\mathbb{Z}$ may be unbounded below.
Does this cause a problem? Not really. We can just define the smallest
boring number as the smallest element minimising the <em>absolute value</em>, i.e.</p>
\[b = \min \text{argmin}_{k\in B_\mathbb{Z}} |k|.\]
<p>(The $\text{argmin}$ might actually give us two numbers, $\pm b$, so the negative one
is the smallest.) Thus, there are no boring integers.
What about boring rational numbers?
This is somewhat more elaborate, but if $B_\mathbb{Q}$ is the set of
boring rationals, we can define the “smallest” boring number as</p>
\[b = \min \text{argmin}_{a/b\in B_\mathbb{Q}} (|a| + |b|),\]
<p>where $a/b$ is a fraction in lowest terms.
Once again, there may be multiple minimisers of $|a| + |b|$, but only
a finite number, so we can choose the smallest.
We conclude there are no boring rationals.
This pattern suggests there are no boring real numbers.
We should be able to find some function with a finite number of
minima, and then choose the smallest, right?
I’m going to argue that no such function can ever be described. Then I’m
going to explain why it might exist anyway, depending on which axioms of set theory we use!</p>
<h4 id="most-real-numbers-are-boring">Most real numbers are boring</h4>
<!-- https://en.wikipedia.org/wiki/Definable_real_number -->
<p>“Boring” and “interesting” are subjective.
We’ll use something a tad more well-defined, and replace
“interesting” with <em>describable</em>.
A number is describable if it has some finite description, using
words, mathematical symbols, even a computer program, which uniquely singles out that number.
For instance, $\sqrt{2}$ is the positive solution of $x^2 = 2$, $\pi$
is the ratio of a circle’s circumference to its diameter, and $e$ is
the limit</p>
\[e = \lim_{n\to\infty} \left(1 + \frac{1}{n}\right)^n.\]
<p>It turns out that <em>almost every</em> real number is indescribable, or
“boring”, in our official translation of that term.
The argument is very simple, and proceeds by simply counting the
number of finite descriptions.
Each such description consists of a finite sequence of symbols
(letters, mathematical squiggles, algorithmic instructions), each of
which could be elements of some very large alphabet of symbols.
For instance, the text</p>
\[\sqrt{2} \text{ is the positive solution of $x^2 = 2$.}\]
<p>can be converted into <a href="http://www.tamasoft.co.jp/en/general-info/unicode-decimal.html">(decimal) unicode</a> as</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>8730 50 32 105 115 32 116 104 101 32 112 111 115 105 116 105 118 101
32 115 111 108 117 116 105 111 110 32 111 102 32 120 94 50 61 50 46
</code></pre></div></div>
<p>Imagine some “super unicode” which lets us converts <em>any</em> symbol
into a number.
The super unicode alphabet may be arbitrarily large, so we will take it to
consist of <em>every</em> natural number $\mathbb{N}$.
Then a finite description using any symbols can be written as a sequence of
the corresponding natural numbers, a trick I will call “unicoding”.
To find the number of finite descriptions, we just count the sequences!
There is a nice scheme for showing that these are in one-to-one
correspondence with the natural numbers themselves, and hence
<em>countably infinite</em>.
We take a sequence, say</p>
\[(6, 2, 0, 5)\]
<p>and convert the first bracket and all commas into $1$s, and each number into
the corresponding number of $0$s:</p>
\[10000001001100000_2.\]
<p>In turn, this can be converted to decimal, $66144$.
Going in the other direction, any whole number can be written in
binary and then converted into sequence:</p>
\[14265092 = 110110011010101100000100_2\]
<p>becomes $(0,1,0,2,0,1,1,1,0,5,2)$.
Thus, we have a simple, explicit correspondence between finite
sequences of natural numbers and the natural numbers themselves.
This basically completes the proof, for the simple reason that there
are <em>infinitely more</em> real numbers than there are natural numbers.
This is established by Cantor’s beautiful
<a href="https://en.wikipedia.org/wiki/Cantor%27s_diagonal_argument">diagonal argument</a>,
which I won’t repeat here.
The upshot is that, via unicoding and then the binary
correspondence, finite descriptions can only capture an
infinitesimally small fragment of the real numbers.
Most literally cannot be talked about.</p>
<!-- So, we conclude that most real numbers are boring. -->
<p>The set $B_\mathbb{R}$ includes almost every real number, though
quite definitely <em>not</em> every real number you can think of.
But, armed with our previous jokes, it’s tempting to think that we can
waltz in and make the same joke about $\mathbb{R}$, simply
plucking out the smallest element of $B_\mathbb{R}$.
Of course, that won’t quite work, because the set need not be bounded
below. So instead, suppose there is some explicit function $f$ such
that $b \in B_\mathbb{R}$ is the smallest minimizer of $f$, i.e.</p>
\[b = \min \text{argmin}_{x \in B_\mathbb{R}} f(x).\]
<p>If I knew $f$ explicitly, we’d have a description of $b$ after all. Contradiction!
But the contradiction here does not imply $B_\mathbb{R}$ is
non-empty. After all, most of $\mathbb{R}$ is indescribable for
simple set-theoretic reasons.
Instead, it means that there <em>cannot be any explicit function</em>
$f$. More generally, there cannot be any explicit rule which, given a
subset of $\mathbb{R}$, gives some unique number. If there
was, we could apply it to $B_\mathbb{R}$ and get the same
contradiction.
(See Appendix A for discussion of the related <a href="https://en.wikipedia.org/wiki/Berry_paradox">Berry paradox</a>.)</p>
<h4 id="an-existential-aside">An existential aside</h4>
<p>There’s a loophole here. Our argument doesn’t establish that
$f$ doesn’t exist, just that it has no finite description. And
although it might seem weird to trust in the existence of something
that we can’t really talk about, we do just this with the real
numbers!
I believe in all the real numbers, even the ones I can never describe.
Is this reasonable?
It depends who you ask.
There is a philosophy of mathematics called
<a href="https://plato.stanford.edu/entries/intuitionism/">intuitionism</a> which
tells us that mathematics is a human invention, and therefore enjoins
us to only reason about the things we can construct ourselves. No
indescribable real numbers if you please!</p>
<p>I’m not sure about this “mathematical creationism”, and think there
are more things in the mathematical heavens than are dreamt of in
our finite human philosophy.
Why should human limitations be mathematical ones?
That said, it’s not the case that anything goes. We should have some
firm basis for believing in the existence of those things we can’t
discuss, and for the real numbers, the firm basis is drawing a
continuous line on a piece of paper, or thinking about infinite
decimal expansions. These are <em>models</em> of the real numbers,
concrete-ish objects which capture the essence of the abstract entity
$\mathbb{R}$. They convince us (or at least me) that there is nothing
magical stopping someone from drawing certain points on the line, or
continuing certain expansions forever.</p>
<p>Similarly, the indescribable things we would like to exist and reason
about in set theory might depend on our <em>models</em> of set theory!
I won’t get into the specifics, but an important point is there are
<em>many different models</em> of set theory, with different properties, and
it seeks unlikely that any one model is right.
These properties are abstracted into <em>axioms</em>, formal rules about what
exists and what you can or can’t do with sets.
Because models of set theory are deep, highly technical constructions,
most of the time we go the other way round, and play around with
axioms instead. Only later do we go away and find models which support
certain sorts of behaviour.
The point of all this is to make it a bit less counterintuitive when I
say that the existence and properties of boring numbers depend on which axioms
we decide to use.</p>
<h4 id="all-real-numbers-are-relatively-interesting">All real numbers are relatively interesting</h4>
<p>So, let’s return to our problem of boring real numbers.
We argued there was no explicit, finitely describable rule for picking
an element out of $B_\mathbb{R}$.
But we can always make the <em>existence</em> of such a rule — describable
or not — an axiom of our theory!
There are two ways to go about doing this.
Note that in the first example of boring natural numbers, we use the
<em>minimum</em> of the set.
We had to be a bit more clever with the integers and rationals, but it
essentially boiled down to creating a special sort of <em>ordering</em> on
the set, so that any subset (including the boring numbers) has a
<em>smallest element</em>.
We wrote this is in a complicated way as</p>
\[b = \min \text{argmin}_{x \in B} f(x)\]
<p>for some function $f$, but we could just as well write</p>
\[b = \min_{\mathcal{W}} B,\]
<p>where $\mathcal{W}$ denote this ordering on the big set.
To be clear, for the integers it is</p>
\[0, -1, 1, -2, 2, -3, 3, \ldots\]
<p>and for the rationals it is</p>
\[0, -\frac{1}{1}, \frac{1}{1}, -\frac{2}{1}, -\frac{1}{2}, \frac{1}{2},
\frac{2}{1}, \ldots.\]
<p>This is called a <em>well-ordering</em>. Although it may not be describable,
we could simply require, as an axiom of set theory, that any set can
be well-ordered! More explicitly,</p>
<p><span style="padding-left: 20px; display:block">
Any set $A$ has a well-ordering $\mathcal{W}_A$ such that any subset
of $A$ has a unique minimum element with respect to $\mathcal{W}_A$.
</span></p>
<p>Although it doesn’t spoil our conclusion that most real numbers are
boring, such an axiom would allow us to turn the old joke into an
argument that all real numbers are <em>relatively interesting</em>, where
“relatively interesting” means that there is a finite description
where we are allowed to use the well-ordering $\mathcal{W}$.
The proof goes as you might expect: let $B^{\mathcal{W}}_\mathbb{R}$ be the set of relatively boring
numbers, i.e. numbers with no finite explicit description, even when
allowed to use the well-ordering $\mathcal{W}$.
Since $\mathcal{W}$ is a well-ordering, we can define</p>
\[b = \min_{\mathcal{W}} B^{\mathcal{W}}_\mathbb{R}.\]
<p>End of proof!
So, although most real numbers are strictly boring, with a
well-ordering all of them are relatively interesting.</p>
<h4 id="choosing-an-order">Choosing an order</h4>
<p>Well-ordering is not usually treated as an axiom.
Historically, set theorists prefer to use a simpler rule called the
<em>axiom of choice</em>, which is logically equivalent, as we will argue
informally in a moment, but somehow less suspect.
As Jerry Bona joked,</p>
<p><span style="padding-left: 20px; display:block">
The axiom of choice is obviously true and the well-ordering principle
obviously false.
</span></p>
<p>(Actually, Bona’s joke mentions a third equivalent form called <em>Zorn’s
lemma</em>, but it would confuse matters too much to explain.)
Loosely, the axiom of choice just says we can pick an element from a
non-empty set. Pretty reasonable huh? If a set is nonempty, it has an element, so
we can pluck one out.
In fact, it’s usually stated in terms of a <em>family</em> of sets $A_i$,
where the subscript $i$ ranges over some indexing set $I$:</p>
<p><span style="padding-left: 20px; display:block">
Given a family of nonempty sets $A_i$, $i \in I$, we can collect a
representative from each set, labelled $f_i \in A_i$.
</span></p>
<p>The well-ordering principle implies the axiom of choice, since I can
just take the union of all the sets $A_i$, well-order it with
$\mathcal{W}$, and then define $f_i = \min_{\mathcal{W}} A_i$.
That’s my set of representatives!
The other way round is conceptually straightforward.
To well-order a set $A = A_0$, start by choosing an element $f_0 \in
A_0$ by the axiom of choice. Then remove it to define a new set $A_1 =
A_0 - \{f_0\}$, and select another element $f_1 \in A_1$. Continue in
this way, at each stage simply deleting the element from the previous
stage and choosing a new one, using</p>
\[A_{n+1} = A_n - \{f_n\} = A_{n-1} - \{f_n, f_{n-1}\} = \cdots = A_0 -
\{f_i : i < n\}\]
<p>as long as the set is nonempty.
The well-ordering is simply the elements in the order we made the
choice:</p>
\[\mathcal{W}_A = \{f_0, f_1, f_2, \ldots \} = \{f_n \in A_n : A_n \neq \varnothing\}.\]
<p>There are two issues with this construction.
The first is that it might feel sketchy to use the axiom of
choice “as we go” to build the sets, rather than starting with a
pre-defined family. But no one said this wasn’t allowed!
Second, our method only seems to work for sets as most as large as the
natural numbers, since we indexed elements with $n \in \mathbb{N}$.
But we can extend it to an <em>arbitrary</em> set using a generalisation of
natural numbers called
<a href="https://en.wikipedia.org/wiki/Ordinal_number">ordinals</a>.
We loosely sketch how this is done in Appendix B.
Once the dust settles, we find that the axiom of choice is equivalent
to well-ordering.</p>
<h4 id="conclusion">Conclusion</h4>
<p>The overarching theme of this post is how much mileage we can get
from a bad joke.
The answer: quite a lot!
We learned not only that there are no boring integers and rational
numbers, but via a simple counting argument, that the vast majority of
real numbers are indescribably boring.
This is equivalent to having no explicit way to well-order the reals.
On the other hand, by giving ourselves the ability (via the axiom of
choice) to pluck elements at will from non-empty sets, we are able to
supply the reals with a well-ordering. So, all reals are relatively
interesting, even if we can’t talk about them.</p>
<h4 id="acknowledgments">Acknowledgments</h4>
<p>As usual, thanks to J.A. for the discussion which led to this
post, and also for proposing an elegant mapping analogous to unicoding.</p>
<h4 id="appendix-a-the-berry-paradox">Appendix A: the Berry paradox</h4>
<p>Consider the phrase</p>
<p><span style="padding-left: 20px; display:block">
The smallest real number with no finite, explicit description.
</span></p>
<p>If “smallest” refers to an explicitly definable well-ordering of the
reals, then this would seem to pick out a unique number with a finite,
explicit description. Contradiction!
We used this to argue no explicit well-ordering exists.
But let’s compare this to the
<a href="https://en.wikipedia.org/wiki/Berry_paradox">Berry paradox</a>, which
asks us to consider the phrase</p>
<p><span style="padding-left: 20px; display:block">
The smallest positive integer not definable in under sixty letters.
</span></p>
<p>This phrase clocks in at under sixty letters, and would seem to define a
number.
Contradiction!
Since “smallest” here makes perfect sense (we are dealing with positive
integers), to resolve the Berry paradox, we must assume either (a)
there is no set $B$ of numbers not definable in under sixty letters,
analogous to the original boring number joke, or (b) Berry’s phrase
somehow fails to define a number.
The most popular solution seems to be (b), on the grounds that
referring to the set makes it some kind of “meta-definition”, rather
than a definition per se.</p>
<p>Of course, this seems be committed to a very specific notion of
“definition”, but the problem persists if we replace “definable” with
“meta-definable”, since the smallest non-meta-definable number is
really a meta-meta-definition.
Let $B^{(0)}$ be the set of numbers not definable in under sixty letters,
$B^{(1)}$ the numbers not meta-definable in under $70$ letters, and in
general, $B^{(n)}$ the numbers not meta${}^{(n)}$-definable in under
$60+10n$ letters.
We call any number in the <em>union</em> of all these sets $\mathcal{B} =
\cup_{n\geq0} B^{(n)}$ “lim-definable”.
This is closed under the operation of going meta.
Now consider the phrase</p>
<p><span style="padding-left: 20px; display:block">
The smallest positive integer not finitely lim-definable.
</span></p>
<p>Since lim-definability is closed under going meta, as is “finite”,
this is <em>now a definition at the same level</em>.
Option (b) is no longer available to us, so only option (a)
remains, and it follows that, like the joke that began it all, <em>all
positive integers are finitely lim-definable</em>.
This is of course obviously true.</p>
<p>Our argument against an explicit well-ordering is very closely related
to the Berry paradox.
The point of considering lim-definability is that we can build the same
descriptive hierarchy for the real numbers, take the union, and rule
out option (b). This leaves two ways to avoid a contradiction: no
lim-definable ordering exists (involving some finite but unbounded
number of references to sets in the hierarchy), or like the Berry
paradox, every real is lim-definable.
But unlike the positive integers, we know from set theory that the
second option can’t be true!
We still have a countable number of lim-definitions, as we can argue
from unicoding.
So there must be no lim-definable ordering of the reals, and no
explicit well-ordering in particular.</p>
<h4 id="appendix-b-ordinals-and-the-axiom-of-choice">Appendix B: ordinals and the axiom of choice</h4>
<p>Ordinals are <em>sets</em> which we use to stand in for numbers.
The smallest ordinal is $0$, which is defined as the empty set
$\varnothing = \{\}$.
Each ordinal $\alpha$ has a unique successor $\alpha + 1$, defined by
simply appending a copy of $\alpha$ to itself:</p>
\[\alpha + 1 = \{\alpha, \{\alpha\}\},\]
<p>To illustrate, we apply the successor operation to $0 = \varnothing$ a
few times:</p>
\[1 = 1 + 0 = \{0\}, \quad 2 = 1 + 1 = \{0,
1\}, \quad 3 = 2 + 1 = \{0, 1, 2\}.\]
<p>Going on in this way gives us all the finite ordinals, but there are
also <em>infinite</em> ordinals. The smallest infinite ordinal, conventionally
denoted $\omega$, can be identified with the natural numbers:</p>
\[\omega = \{0, 1, 2, 3, 4, \ldots\}.\]
<p>It is called a <em>limit</em> ordinal since it is not the successor of any
finite ordinal. It is bigger than all the finite ones, $n <
\omega$. The successor is defined as before,</p>
\[\omega + 1 = \{\omega, \{\omega\}\},\]
<p>thereby giving a precise meaning to “infinity plus one”!
We won’t say more about the structure of these ordinals. The main
point is that we can always “count” the elements in a set $A$ using
ordinals, no matter how big it is.
Let’s now return to the problem of proving the axiom of choice
implies that any set $A$ can be well-ordered.
The basic idea is to start with $0$, but keep on counting up “past
infinity”, defining</p>
\[A_{\alpha+1} = A_0 - \{f_\beta : \beta < \alpha\}\]
<p>for any ordinal $\alpha$. The resulting set of
representatives, labelled by ordinals, is</p>
\[\mathcal{W}_A = \{f_\alpha \in A_\alpha: A_\alpha \neq
\varnothing\},\]
<p>with $f_\alpha < f_\beta$ just in case the ordinals $\alpha < \beta$.
This is a well-ordering since the cardinals are themselves
well-ordered.
Now, we’ve skipped many important technical details, but the main
point was that the argument looks pretty similar to the previous one!</p>
<!-- You may wonder if the contradiction here is coming from ambiguity in
the notion of "explicit describability".
That this can cause deep problems is illustrated by the
[Berry paradox](https://en.wikipedia.org/wiki/Berry_paradox), which
asks us to consider the following:
<span style="padding-left: 20px; display:block">
The smallest positive integer not definable in under sixty letters.
</span>
If $B_{60}$ is the set of positive integers not definable in under
sixty letters, it seems we have just defined its smallest elements in
fifty seven! This too is a contradiction. Many people try to resolve
this by arguing that it does not constitute a "definition"; I think it
is much simpler to following the boring number argument, and conclude
that $B_{60}$ doesn't exist. -->David A WakehamMarch 23, 2021. I turn the old joke about interesting numbers into a proof that most real numbers are indescribably boring. In turn, this implies that there is no explicit well-ordering of the reals. The axiom of choice, however, implies all are relatively interesting.Taking half a derivative2021-03-13T00:00:00+00:002021-03-13T00:00:00+00:00http://hapax.github.io/mathematics/halfder<p><strong>March 13, 2021.</strong> <em>Can you take half a derivative? Or π derivatives?
Or even √–1 derivatives? It turns out the answer is yes, and there are
two simple but apparently different ways to do it. I
show that one implies the other!</em></p>
<h4 id="introduction">Introduction</h4>
<p>In calculus, the regular derivative is defined as the local gradient
of a function:</p>
\[f'(x) = \frac{d}{dx} f(x) = \lim_{h\to 0}\frac{f(x+h)-f(x)}{h}.\]
<p>We will abbreviate this as $f’ = Df$, understanding that $f$ is a function
of $x$ and $D$ differentiates with respect to $x$.
We can always differentiate again, and again, and in fact as many
times as we want. Using our new notation, we can write the $n$th
derivative as</p>
\[D (D \cdots (Df)) = D^n f.\]
<p>This is well-defined as long as $n$ is a whole number.
But what if we could consider other types of derivatives, say half a
derivative? Let’s call this $D^{1/2} = \sqrt{D}$. In the same way that
applying two ordinary derivatives gives the second derivative, it seems reasonable to hope that two half derivatives give
a full derivative:</p>
\[f' = \sqrt{D} \sqrt{D}f = Df \quad \Longrightarrow \quad \sqrt{D}
\cdot \sqrt{D} = D.\]
<p>What could half a derivative look like?</p>
<h4 id="to-be-continued">To be continued</h4>
<p>The easiest way to go about this to use a trick called <em>analytic
continuation</em>.
This has a precise meaning in complex analysis, and we’re going to do
something similar in spirit, but not quite as rigorous.
The basic idea is to find some nice, specific function we can
differentiate $n$ times, and which happens to give us a nice answer in terms of $n$.
We then define the <em>fractional derivative</em> $D^\alpha$ acting on this
function by replacing $n$ with $\alpha$.
A sanity check will be that, for general $\alpha, \beta$, the
fractional derivatives obey</p>
\[D^\alpha \cdot D^\beta = D^{\alpha+\beta},\]
<p>so, e.g., two half-derivatives give a full derivative,
$\sqrt{D}\cdot \sqrt{D} = D$.
We call this property <em>multiplicativity</em> after the identical-looking
rule for indices.
There are two issues with this approach.
First, how do we extend the definition to general functions?
And second, are the definitions for different functions in agreement?
In general, the answers are very complicated, but in this post, I’ll
consider the two simplest methods for defining fractional derivatives.
This means we can talk about the functions they apply to, and check
they agree, without a huge technical overhead.</p>
<p>Our first nice function is the exponential $e^{\omega x}$.
Differentiating simply pulls down a factor of $\omega$ each time, so</p>
\[D^n e^{\omega x} = \omega^n e^{\omega x}.\]
<p>It’s very clear, then, how to define the fractional derivative acting
on this:</p>
\[D^\alpha e^{\omega x} = \omega^\alpha e^{\omega x}.\]
<p>Great! We can easily check the multiplicative property, assuming that
constants pass through the derivatives:</p>
\[D^\alpha D^\beta e^{\omega x} = \omega^\alpha D^\beta e^{\omega x} =
\omega^{\alpha + \beta} e^{\omega x} = D^{\alpha+\beta}e^{\omega x}.\]
<p>Now, you might think this is useless because we can only
take fractional derivatives of exponential functions.
But at this point, we introduce another assumption, namely that the
fractional derivatives are <em>linear</em>:</p>
\[D^\alpha (\lambda_1 f_1 + \lambda_2 f_2) = \lambda_1 D^\alpha f_1 + \lambda_2 D^\alpha f_2,\]
<p>where $f_1, f_2$ are functions and $\lambda_1, \lambda_2$ are constants.
In particular, let’s suppose this linearity applies to an <em>infinite</em>
collection of exponentials multiplied by constants $\lambda$, arranged
into an integral</p>
\[f(x) = \int_{-\infty}^\infty d\omega \, \lambda(\omega) e^{i\omega x}.\]
<p>Then by linearity,</p>
\[D^\alpha f(x) = \int_{-\infty}^\infty d\omega \, \lambda(\omega) D^\alpha
e^{i\omega x} = \int_{-\infty}^\infty d\omega \, \lambda (\omega)
(i\omega)^\alpha e^{i\omega x}. \tag{1} \label{exp}\]
<p>Functions which can be written this way are said to have a <em>Fourier
representation</em>, with the function $ \lambda (\omega)$ the <em>Fourier
transform</em>. Most functions have one!
Let’s do a very simple example: the sine function, bane of high school
trigonometry classes everywhere.
What is its half derivative?
We start by writing sine in terms of exponentials as</p>
\[\sin(x) = \frac{1}{2i}(e^{ix} - e^{-ix}).\]
<p>We then take a half-derivative using our exponential rule and linearity:</p>
\[\sqrt{D} \sin(x) = \frac{1}{2i}(\sqrt{D} e^{ix} - \sqrt{D} e^{-ix}) = \frac{1}{2i}\left(\sqrt{i} e^{ix} - \sqrt{-i} e^{-ix}\right).\]
<p>There are a few things to note.
First, this is not a real function, so in general, half derivatives of
a real functions need not be real.
It should also be clear there is some ambiguity about
which roots we choose.
In general this ambiguity is harmless, and we just take the principal
values (with arguments between $-\pi$ and $\pi$), but this issue will
crop up any below in a subtle way.
Finally, observe that we can just as easily do crazy things like take
$i$ derivatives! We set $\alpha = i$, so the $i$th derivative of sine is</p>
\[D^i \sin(x) = \frac{1}{2i}\left(i^i e^{ix} - (-i)^i e^{-ix}\right) =
\frac{1}{2i}(e^{-\pi/4 + ix} - e^{+\pi/4 - ix}),\]
<p>since the principal values are</p>
\[i^i = e^{i (i \pi/4)} = e^{-\pi/4}, \quad (-i)^i = e^{i (-i \pi/4)} = e^{\pi/4}.\]
<p>I’m not sure if this has any applications, but it’s cute.
I invite the interested reader to take $\pi$ derivatives of sine. What
better way to celebrate $\pi$ day!</p>
<h4 id="fractorials">Fractorials</h4>
<p>Exponentials aren’t the only nice functions we can use to define
fractional derivatives.
In fact, a more common approach is to use <em>powers</em>.
The first function we encounter in high school is usually the identity
function, $f(x) = x$.
From there, we build up to polynomials $x^m$, and then arbitrary
powers $x^s$.
The derivative of a power has a very simple form:</p>
\[D x^s = s x^{s-1}.\]
<p>If we differentiate again, we bring down a factor of $s - 1$ and
reduce the index again. And so on and so forth. This leads to the expression for
$n$ derivatives:</p>
\[D^n x^s = s(s- 1) \cdots (s - n + 1) x^{s-n}.\]
<p>So far, this doesn’t look like something we can easily continue to
non-integer values of $n$.
But let’s assume for a moment $s$ is an integer.
Then we can write</p>
\[s(s- 1) \cdots (s - n + 1) = \frac{s(s - 1) (s-2) \cdots 1}{(s -
n)(s-n - 1) \cdots 1} = \frac{s!}{(s -n)!},\]
<p>where we have used the good old factorial function $s!$.
Thus, we can write</p>
\[D^n x^s = \frac{s!}{(s -n)!} x^{s-n}.\]
<p>To analytically continue this, we need a beautiful object called the
Gamma function $\Gamma$.
We’ll define it properly below, but for the moment, the
only properties we need are that (a) it agrees with the factorial
function at (shifted) integer values,</p>
\[\Gamma(k + 1) = k!;\]
<p>and (b) is defined for non-integer values as well. I like to think of it as the
“fractorial” because it makes sense for fractional arguments! In addition to
delightfully bad puns, the Gamma function lets us write</p>
\[D^n x^s = \frac{\Gamma(s + 1)}{\Gamma(s -n + 1)} x^{s-n},\]
<p>and immediately continue to the fractional derivative:</p>
\[D^\alpha x^s = \frac{\Gamma(s + 1)}{\Gamma(s -\alpha + 1)}
x^{s-\alpha}. \tag{2} \label{power}\]
<p>Too easy! Once again, we can check the multiplicative property:</p>
\[\begin{align*}
D^\alpha D^\beta x^s & = \frac{\Gamma(s + 1)}{\Gamma(s -\beta + 1)}
D^\alpha x^{s-\beta} \\
& = \frac{\Gamma(s + 1)}{\Gamma(s -\beta + 1)}
\cdot \frac{\Gamma(s - \beta + 1)}{\Gamma(s -\alpha - \beta + 1)}
x^{s-\beta - \alpha} \\
& = \frac{\Gamma(s + 1)}{\Gamma(s -\alpha -\beta + 1)}x^{s-\beta -
\alpha} = D^{\alpha+\beta} x^s.
\end{align*}\]
<p>So this gives us another, evidently different way to define fractional
derivatives. It will apply to any sum or integral of powers of
$x$, for instance, infinite polynomials called <em>power series</em>, and
their close cousins the <em>Laurent series</em> which include reciprocal powers:</p>
\[\sum_{k = 0}^\infty a_k x^k, \quad \sum_{k = -\infty}^\infty b_k x^k.\]
<p>These cover a lot of ground, and there is an even more general object
called the <em>Mellin transform</em>, analogous to the Fourier transform. But
we won’t go there.
Instead, let’s do another simple example.
One of the interesting properties of the Gamma function is that it
blows up to (minus) infinity for nonpositive integers:</p>
\[\Gamma(-n) = -\infty, \quad n = 0, 1, 2, \ldots.\]
<p>This is actually essential to get sensible answers!
For instance, let’s take the derivative of a constant, $1 = x^0$.
Then according to our definition,</p>
\[D x^0 = \frac{\Gamma(0 + 1)}{\Gamma(0 -1 + 1)} x^{0 - 1} =
\frac{\Gamma(1)}{\Gamma(0)} x^{- 1} = 0,\]
<p>since the $\Gamma(0)$ in the denominator makes the whole thing vanish.
More intriguingly, these infinities sometimes <em>cancel</em> in sensible ways.
For instance, if we take a derivative of $1/x$, we should get
$-1/x^2$. If we plug $x^{-1}$ into our formula, it gives</p>
\[D x^{-1} = \frac{\Gamma(-1 + 1)}{\Gamma(-1 -1 + 1)} x^{-1 - 1} =
\frac{\Gamma(0)}{\Gamma(-1)} x^{-2}.\]
<p>Both the numerator and the denominator blow up, which should make us
queasy. But there is a trick here. It turns out that for any $z$,
the Gamma function obeys the <em>functional equation</em></p>
\[\Gamma(1 + z) = z\Gamma(z).\]
<p>Since $\Gamma(k + 1) = k!$, this gives the usual relation for factorials,</p>
\[k! = \Gamma(k + 1) = k\Gamma(k) = k \cdot (k - 1)!.\]
<p>It also gives the sneaky result $\Gamma(0) = (-1)\Gamma(-1)$. Both $\Gamma(0)$ and
$\Gamma(-1)$ blow up of course, but in the derivative of $1/x$, the
$\Gamma(-1)$ terms cancel, leaving $(-1)x^{-2} = -1/x^2$ as required.</p>
<h4 id="gamma-and-tongs">Gamma and tongs</h4>
<p>This all sounds great, but you might be wondering why the Gamma
function is the right way to extend the factorial function away from
whole numbers.
In fact, any old function that interpolates between them would also
work and satisfy the multiplicative property.
What we’re going to do in this last section is use the fractional
derivatives, defined using exponentials, to <em>derive</em> the Gamma
function continuation.
And in order to this, we have to grit our teeth and define the
Gamma function in all its glory:</p>
\[\Gamma(s) = \int_{0}^\infty dt\, t^{s-1} e^{-t}.\]
<p>If you’re interested, you can find proofs of the functional equation and so on
<a href="https://en.wikipedia.org/wiki/Gamma_function">elsewhere</a>.
Instead, we’re going to make the sneaky change of variables $t =
\omega x$, yielding</p>
\[\Gamma(s) = x^{s} \int_{0}^\infty d\omega\, \omega^{s-1} e^{-\omega
x}.\]
<p>If we change $s \to -s$, and rearrange, we get a formula for $x^s$
in terms of exponentials:</p>
\[x^{s} = \frac{1}{\Gamma(-s)}\int_{0}^\infty d\omega\, \omega^{-(1+ s)}
e^{-\omega x}. \tag{3} \label{gamma}\]
<p>Great! Now we just go ahead and use rule (\ref{exp}), with the hope we
will get rule (\ref{power}).
As usual, we proceed using linearity:</p>
\[\begin{align*}
D^\alpha x^{s} & = \frac{1}{\Gamma(-s)}\int_{0}^\infty d\omega\,
\omega^{-(1+ s)} D^\alpha e^{-\omega x} \\
& = \frac{1}{\Gamma(-s)}\int_{0}^\infty d\omega\,
\omega^{-(1+ s)} (-\omega)^\alpha e^{-\omega x} \\
& = \frac{(-1)^\alpha}{\Gamma(-s)}\int_{0}^\infty d\omega\,
\omega^{-(1+ s - \alpha)} e^{-\omega x} \\
& = \frac{(-1)^\alpha}{\Gamma(-s)} \cdot \Gamma[-(s-\alpha)]x^{s-\alpha},
\end{align*}\]
<p>where on the last line we used (\ref{gamma}), but with $s
-\alpha$ instead of $s$.
This isn’t quite what we want.
To make progress, we’ll take advantage of the <em>reflection
formula</em> for the Gamma function (derived <a href="https://hapax.github.io/mathematics/zeta/">here</a>
for instance):</p>
\[\Gamma(z) \Gamma(1 - z) = \frac{\pi}{\sin(\pi z)}.\]
<p>We can apply this to both $\Gamma(-s)$ and $\Gamma[-(s-\alpha)]$ to
get</p>
\[\begin{align*}
D^\alpha x^{s} & = (-1)^\alpha \frac{\sin(\pi
s)}{\sin[\pi(s-\alpha)]}\cdot \frac{\Gamma(s+1)}{\Gamma(s-\alpha + 1)} x^{s-\alpha}.
\end{align*}\]
<p>This is almost (\ref{power}), the thing we were after!
But there is this strange factor with sines out the front.
Recall the definition of sine in terms of complex exponentials.
This lets us write the funny factor as</p>
\[(-1)^\alpha \frac{\sin(\pi s)}{\sin[\pi(s-\alpha)]} = \frac{e^{\pi i
s} - e^{-\pi i s}}{(-1)^\alpha e^{\pi i (s-\alpha)} - (-1)^\alpha e^{-\pi i (s-\alpha)}}.\]
<p>It would be magical if that $(-1)^\alpha$ could somehow behave
differently and cancel the $\alpha$ terms floating around, right?
Well, turns out it does!
We can write $-1 = e^{\pm \pi i}$, and hence</p>
\[(-1)^\alpha = e^{\pm \pi i \alpha}.\]
<p>I won’t spell out the details, but if you look at <a href="https://hapax.github.io/mathematics/zeta/">this proof</a> of the reflection
formula, the two different terms in the sine arise from parts of an
integration contour which lie in almost the same place, but where we take
roots in different ways.
In particular, evaluating $(-1)^\alpha$ gives $e^{\pm \pi i \alpha}$
respectively, so they cancel the $\alpha$ terms after all.
The upshot is that our funny factor is just unity:</p>
\[\frac{e^{\pi i
s} - e^{-\pi i s}}{(-1)^\alpha e^{\pi i (s-\alpha)} - (-1)^\alpha
e^{-\pi i (s-\alpha)}} = \frac{e^{\pi i
s} - e^{-\pi i s}}{e^{\pi i \alpha} e^{\pi i (s-\alpha)} - e^{-\pi i \alpha}
e^{-\pi i (s-\alpha)}} = \frac{e^{\pi i
s} - e^{-\pi i s}}{e^{\pi i s} - e^{-\pi i s}} = 1.\]
<p>Thus, our exponential rule actually
reproduces the rule for powers of $x$ involving the Gamma
function! Now, to be clear, fractional derivatives are a big and
mathematically heavy topic, and I’ve only skimmed the surface.
But it’s neat that the two simplest approaches agree.</p>
<h4 id="acknowledgments">Acknowledgments</h4>
<p>Thanks to J.A. for chatting about fractional derivatives, and getting
me thinking about the simplest way to define them.</p>
<!-- Our exponential definition yields an *antiderivative* operator:
$$
D^{-1} e^{\omega x} = \frac{1}{\omega}e^{\omega x}.
$$
This is the usual antiderivative, except without the constant. -->David A WakehamMarch 13, 2021. Can you take half a derivative? Or π derivatives? Or even √–1 derivatives? It turns out the answer is yes, and there are two simple but apparently different ways to do it. I show that one implies the other!The statistical basis of Fermi estimates2021-02-12T00:00:00+00:002021-02-12T00:00:00+00:00http://hapax.github.io/physics/hacks/mathematics/statistics/fermi-log-normal<p><strong>February 12, 2021.</strong> <em>Why are Fermi approximations so effective? One
important factor is log normality, which occurs for large random
products. <!--, also related to the mechanism underlying
the Newcomb-Benford law for first digits.--> Another element is
variance-reduction through judicious subestimates. I discuss both
and give a simple heuristic for the latter.</em></p>
<h4 id="introduction">Introduction</h4>
<p>Fermi approximation is the art of making good order-of-magnitude estimates.
I’ve written about them
at greater length
<a href="https://hapax.github.io/assets/fermi-estimates.pdf">here</a> and
<a href="https://hapax.github.io/physics/teaching/hacks/napkin-hacks/#sec-3">here</a>,
but I’ve never really found a satisfactory explanation for why they work.
Order-of-magnitude is certainly a charitable margin of
error, but time and time again, I find they are better than they have any right to be!
Clearly, there must be an underlying statistical explanation for this apparently
unreasonable effectiveness.</p>
<!-- We will try to explain the first using logarithmic uniformity, which is
the same mechanism underlying the anomalous distribution of first
digits known as the
[Newcomb-Benford law](https://en.wikipedia.org/wiki/Benford%27s_law).
We give a looser but related explanation of the second in terms of strategies for
variance-reduction in human error. -->
<h4 id="products-and-log-normality">Products and log-normality</h4>
<p>There are two key techniques: the use of geometric means, and the
factorisation into subestimates.
We start with geometric means.
Suppose a random variable $F$ is a product of many independent random
variables,</p>
\[F = X_1 X_2 \cdots X_N.\]
<p>Then the logarithm of $F$ is a sum of many random variables $Y_i =
\log X_i$:</p>
\[\log F = \log X_1 + \log X_2 + \cdots + \log X_N = \sum_{i=1}^N Y_i.\]
<p>By the central limit theorem for unlike variables (see
e.g. <a href="https://hapax.github.io/hacks/mathematics/statistics/clt/">this post</a>),
for large $N$ this approaches a normal distribution:</p>
\[\log F \to \mathcal{N}(\mu, \sigma^2), \quad \mu := \sum_i \mu_i,
\quad \sigma^2 = \sum_i \sigma_i^2,\]
<p>where the $Y_i$ have mean $\mu_i$ and variance $\sigma_i^2$.
We say that $F$ has a <em>log-normal</em> distribution, since its log is
normal.</p>
<!-- To get uniformity into the picture, we can zoom in on the region near
$F = e^\mu$ where the probability density is approximately uniform.
More carefully, the density is
$$
p(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/2\sigma^2}.
$$
Taylor-expanding near $x = \mu$ gives
$$
p(x) = \frac{1}{\sigma\sqrt{2\pi}}
\left[1 - \frac{(x-\mu)^2}{2\sigma^2} + O(x^4)\right].
$$
This looks uniform provided $(x - \mu)^2 \ll \sigma^2$.
For instance, at a third of a standard deviation, $x = \mu + \sigma/3$,
we have
$$
1 - \frac{(x-\mu)^2}{2\sigma^2} = 1 - \frac{1}{18} \approx 0.94,
$$
and $\text{erf}(1/\sqrt{18}) \approx 0.26$, about a quarter of the
probability mass, lies underneath.
This is what we mean when we say that $F$ is logarithmically uniform. -->
<h4 id="geometric-means">Geometric means</h4>
<p>In Fermi estimates, one of the basic techniques is to take geometric
means of estimates, typically an overestimate and an underestimate.
For instance, to Fermi estimate the population of Chile, I could
consider a number like one million which seems much too low, and a
number like one hundred million which seems much too high, and take
their geometric mean:</p>
\[\sqrt{(1 \text{ million}) \times (100 \text{ million})} = 10 \text{ million}.\]
<p>Since population is a product of many different factors, it is
reasonable to expect it to approximate a log-normal distribution.
Then, after logs, the geometric mean $\sqrt{ab}$ becomes the
arithmetic mean of $\log a$ and $\log b$:</p>
\[\log \sqrt{ab} = \frac{1}{2}(\log a + \log b).\]
<p>Taking the mean $\mu$ of the distribution as the true value, these
geometric means provide an
<a href="https://en.wikipedia.org/wiki/Bias_of_an_estimator">unbiased estimator</a>
of the mean.
Moreover, the error of the estimate will decrease as $1/k$ for $k$
samples (assuming human estimates sample from the distribution), so more is better.
To see how much better I could do on the Chile population estimate, I
solicited guesses from four friends, and obtained $20, 20, 30$ and $35$
million.
Combining with my estimate, I get a geometric mean</p>
\[(10 \times 20 \times 20 \times 30 \times 35)^{1/5} \text{ million}
\approx 21 \text{ million}.\]
<p>The actual population is around $18$ million, so the estimate made
from more guesses is indeed better!
This is also better than the arithemetic average, $23$ million.
Incidentally, this also illustrates the
<a href="https://hapax.github.io/physics/mathematics/statistics/crowd/">wisdom of the crowd</a>,
also called “diversity of prediction”.
The individual errors from a broad spread of guesses tend to cancel
each other out, leading to a better-behaved average, though in this case
in logarithmic space.</p>
<p>In general, Fermi estimates work best for numbers which are large
random products (this is how we try to solve them!), so the problem
domain tends to enforce the statistical properties we want.
For many examples of log-normal distributions in the real world, see
<a href="https://academic.oup.com/bioscience/article/51/5/341/243981">Limpert, Stahel and Abbt (2001)</a>.
It’s worth noting that not everything we can Fermi estimate is
log-normal, however.
Many things in the real world obey power laws, for instance, and
although you can exploit this to make better Fermi estimates (as
lukeprog does in
<a href="https://www.lesswrong.com/posts/PsEppdvgRisz5xAHG/fermi-estimates#Example_4__How_many_plays_of_My_Bloody_Valentine_s__Only_Shallow__have_been_reported_to_last_fm_">his tutorial</a>),
we can happily Fermi estimate power-law distributed numbers without
this advanced technology.</p>
<p>Are Fermi estimates unreasonably effective in this context?
Maybe.
But the estimates work best in the high-density core where things look
uniform, not out at the tails, and it’s not until we get to the tails that the difference
between the log-normal and power law (or exponential, or Weibull, or
your favourite skewed distribution) becomes pronounced.
So the unreasonable effectiveness here can probably be explained by
the resemblance to the log-normal, though this is something I’d like
to check more carefully in future.</p>
<!-- In general, we only expect Fermi estimates to work for numbers which
are the product of many factors.
But this is precisely the sorts of things we use Fermi estimates for!
In a sense, the problem domain naturally leads to logarithmic
uniformity.
Incidentally, I've talked about "uniformity", but the geometric mean
is still a measure of central tendency for any distribution, and is
particularly nice for a lognormal one, which arise for products of
random variables.
The magic of geometric means manifests most
strongly in the near-uniform blob at the centre. -->
<!-- #### The Newcomb-Benford law
Logarithmic uniformity also explains an odd pattern in the first
digits of naturally occurring numbers like tax returns, stock market
prices, populations, river lengths, physical constants, and even
powers of $2$.
The pattern, called the *Newcomb-Benford law* after
[Simon Newcomb](https://en.wikipedia.org/wiki/Simon_Newcomb) and
[Frank Benford](https://en.wikipedia.org/wiki/Frank_Benford), is as
follows: for base $b$, the digit $d \in \\{1, 2, \ldots, b-1\\}$
occurs with relative frequency
$$
p_b(d) = \log_b \left(\frac{d+1}{d}\right) = \frac{1}{\log b}\log \left(\frac{d+1}{d}\right).
$$
It initially seems bizarre that digits do not occur with equal
frequency.
But as neatly explained by
[Pietronero et al. (1998)](https://arxiv.org/pdf/cond-mat/9808305.pdf),
it follows immediately if the relevant numbers are logarithmically uniform.
Let $X$ be our random number.
Then the first digit is $d$ if
$$
db^k \leq X < (d+1)b^{k} \quad \Longrightarrow \quad \log_b d + k \leq
\log_b X < \log_b(d+1) + k
$$
for some integer $k$.
If $X$ is logarithmically uniform, for instance sitting near the mean
of a big random product, then $\log_b X$ is uniformly
distributed, and lies in the interval $I_d :=
[\log_b d, \log_b (d+1)]$ with probability
$$
(\log_b (d+1) + k) - (\log_b d + k) = \frac{1}{\log b}\log \left(\frac{d +
1}{d}\right) = p_b(d).
$$
This provides a simple way to check for fraud on tax returns, for
instance.
Just compute relative frequencies of first digits in different bases
and check they obey Newcomb-Benford!
You might wonder why something totally deterministic, like the first
digit of a power of $2$, also obeys Benford's law.
Here is a pie chart of initial decimal digits for the first $10,000$ binary
powers, which follows the Newcomb-Benford law exactly:
<figure>
<div style="text-align:center"><img src
="/images/posts/benford1.png"/>
</div>
</figure>
Here is the Python code to generate it.
You can check it for other numbers besides $2$ as well by simply
changing the `power` variable:
```python
import matplotlib.pyplot as plt
import math
maxpower = 10000 # Number of powers to check
power = 2 # Change to check other powers
nums = '1', '2', '3', '4', '5', '6', '7', '8', '9',
benford = [(math.log(10, d+1) - math.log(10, d+1))
for d in range(1, 10)]
firstdig = [0 for i in range(9)]
for i in range(maxpower):
ind = int(str(power**i)[0]) - 1
firstdig[ind] = firstdig[ind] + 1
fig, ax = plt.subplots()
fig.set_facecolor('white')
ax.pie(firstdig, labels=nums, autopct='%1.1f%%', startangle=90)
# Change 'firstdig' to 'benford' for probabilities
ax.axis('equal')
plt.show()
```
The mechanism for logarithmic uniformity here is slightly different,
and discussed in depth in Serge Tabachnikov's
[book on geometric billiards](http://www.personal.psu.edu/sot2/books/billiardsgeometry.pdf).
In this case, $X = 2^n$, so the first digit is $d$ just in case
$$
\log_{10}d + k \leq n\log_{10} 2 < \log_{10}(d + 1) + k.
$$
Let $\text{frac}(x)$ denote the fractional part of $x$, and define
$x_n := \text{frac}(n\log_{10} 2)$.
Taking fractional parts gives
$$
\log_{10}d \leq x_n < \log_{10}(d + 1).
$$
It turns out that, since $x_1 = \log_{10} 2$ is irrational,
$x_n$ jumps randomly around the unit interval, and forms an
"equidistribution" which spends equal times in equal areas.
For a proof, see Tabachnikov's book.
But although the fundamental cause is different, the outcome is still
logarithmic uniformity, and the Newcomb-Benford law results. -->
<h4 id="the-philosophy-of-subestimates">The philosophy of subestimates</h4>
<p>Now we’ve dealt with geometric means and log-normality, we
turn to the effectiveness of factorising a Fermi estimate.
If we take logarithms, factors become summands, and we’ll reason about those since they are simpler.
If $Z = X + Y$ is a sum of independent random variables, the variance
is additive, so that</p>
\[\text{var}(Z) = \text{var}(X) + \text{var}(Y).\]
<p>Thus, splitting a sum into estimates of the summands and adding them
should not change the variance of the guess.
Of course, there is a fallacy in this reasoning: humans are not
sampling from the underlying distribution!
When we guess, we introduce our own random errors.
For instance, my estimate for $Z$ will have some human noise $\varepsilon_Z$:</p>
\[\hat{Z} = Z + \varepsilon_Z.\]
<p>Similarly, my guesses for $X$ and $Y$ have some random errors
$\varepsilon_X$ and $\varepsilon_Y$.
There is no reason for the variances of $\varepsilon_X$ and $\varepsilon_Y$
to add up to the variance of $\varepsilon_Z$.
The sum could be bigger, or it could be smaller.
But a good decomposition should reduce the combined variance:</p>
\[\text{var}(\varepsilon_X) + \text{var}(\varepsilon_Y) < \text{var}(\varepsilon_Z).\]
<p>If log-normality is the science of Fermi estimates, picking
variance-reducing subestimates is the art.
<!-- But there is a connection to our earlier discussion.
I think the human error $\varepsilon_X$ will roughly mimic the
empirical distribution of $Z$ we have seen in the world.
If it is biased, so is $\varepsilon_X$; it we have only seen a few
examples, the variance of $\varepsilon_X$ will probably be large, and
decrease roughly as $1/k$ with $k$ examples.
So the general strategy for variance reduction is to factorise into
things we have seen before.
We can even use these data points to generate subestimates by geometric averaging.-->
But I suspect that $\hat{Z}$ roughly speaking behaves like a <em>test
statistic</em> for $Z$, with the number of samples corresponding to how
many data points for $Z$ we have encountered.
So we expect that $\text{var}(\varepsilon_Z)$ will vanish roughly as
$1/k$ with $k$ samples.
If we have more exposure to the distributions for $X$ and $Y$,
the combined error will probably be smaller.
This is why we carve into subfactors we understand!</p>
<h4 id="variance-reduction-in-practice">Variance reduction in practice</h4>
<p>I’ll end with a speculative rule of thumb for when to factor: try generating over- and
underestimates for the factors and the product, which in additive
notation give</p>
\[(\Delta X)^2 + (\Delta Y)^2, \quad (\Delta Z)^2\]
<p>where $\Delta$ refers to the difference of the (logarithm of the) over-
and underestimate.
Factorise if the first estimated error is smaller than the second.
Let’s illustrate by returning to the population of Chile.
I can try factoring it into a number of regions multiplied by the
average number of people per region.
Taking logs (in base $10$) of the over- and underestimate of Chile’s
population I gave above, I get</p>
\[(\Delta Z)^2 = (\log_{10} 10^8 - \log_{10} 10^6)^2 = 4.\]
<p>On the other hand, for regions I would
make a lower guess of $5$ and an upper guess of $30$, with a difference in logs of $(\Delta X)^2 = 0.6$.
For regional population, I would make a lower guess of $5\times 10^5$ and an
upper guess of $5\times 10^6$, with $(\Delta Y)^2 = 1$.
Thus,</p>
\[(\Delta X)^2 + (\Delta Y)^2 = 1.6 < 4 = (\Delta Z)^2.\]
<p>The guess from the factorisation (taking geometric means) is</p>
\[\sqrt{5 \times 30 \times (5\times 10^5) \times (5\times 10^6)} \approx 19 \text{
million}.\]
<p>This is even better than the crowdsourced estimate!
For reference, the number of regions is $16$, while our estimated mean is around
$12$, and the average population per region is a bit over a million,
which we’ve mildly overestimated at $1.6$ million.
The two balance out and give a better overall estimate.
<!-- This suggests a diversity of prediction mechanism is at play with -->
<!-- subestimates, but I haven't worked out the details. --></p>
<h4 id="conclusion">Conclusion</h4>
<p>From a statistical perspective, Fermi estimates are based on two
techniques: geometric means and splitting into subfactors.
We usually estimate things which can be expressed as a product of many
factors. These will tend towards a log-normal distribution by the (log
of the) central limit theorem, so that geometric means provide a good
estimator, exactly like the usual mean for normally distributed variables.
Subestimates, on the other hand, carve guesses into factors we
understand, i.e. have more data points for, so that (assuming they
behave like test statistics) variance is reduced.
The effectiveness of Fermi estimates is quite reasonable after all!
<!-- They're not so unreasonable after all! --></p>
<!-- There is an art to making over- and underestimates
that accurately reflect the variance of our error random variables,
which are involved both in taking geometric means for single
quantities, and reducing variance through subestimates.
Still, it's cool that there is a statistical basis for the different
aspects of the effectiveness of Fermi estimates.
It's not so unreasonable after all! -->
<!-- For instance, if $e^Z$ is the population of Chile, I can factor it
into number of provinces $e^X$ multiplied by the average number of people per province $e^Y$.
But this is likely to *increase* the error, since I know less about
provinces of Chile than I do about Chile compared to other countries.
I suspect that there is a nice quantitative connection to be made
between the variance of $\varepsilon_X$ and the prior data I have on
it. -->
<!--
The Lyapunov condition holds for a sum of independent random
variables.
By taking an exponential, we can turn it into a result for a *product* of
independent variables.
Let $X_i, \mu_i, \sigma_i^2$ be as above, and $X_i = \log Y_i$.
Then
$$
\exp\left[\sum_{i=1}^N X_i\right] = \prod_{i = 1}^N Y_i \to \log
\mathcal{N}(\mu, \sigma^2).
$$
The distribution on the right is not a normal, but a *log-normal*.
It is simply what the normal distribution looks like when viewed in
terms of a variable $y > 0$ defined by $x = \log y$.
In order to plot the density, we use the fact that $dx =
dy/y$, and hence
$$
p(x)\, dx = \frac{dx}{\sqrt{2\pi}\sigma}
e^{-\frac{(x-\mu)^2}{2\sigma^2}} = \frac{dy}{\sqrt{2\pi}\sigma y}
e^{-\frac{(\log y-\mu)^2}{2\sigma^2}}.
$$
So, this is distribution that a product of many independent factors
converges to. -->
<!-- https://arxiv.org/pdf/cond-mat/9808305.pdf -->David A WakehamFebruary 12, 2021. Why are Fermi approximations so effective? One important factor is log normality, which occurs for large random products. Another element is variance-reduction through judicious subestimates. I discuss both and give a simple heuristic for the latter.Reductionism, order and patterns2021-02-08T00:00:00+00:002021-02-08T00:00:00+00:00http://hapax.github.io/mathematics/physics/philosophy/form<p><strong>February 8, 2021.</strong> <em>Some philosophical reflections on the nature of
scientific explanation, structure, emergence, and the unreasonable
effectiveness of mathematics.</em></p>
<h4 id="introduction">Introduction</h4>
<p><span style="padding-left: 20px; display:block">
Explanations must come to an end somewhere.
</span></p>
<div style="text-align: right"><i>Ludwig Wittgenstein</i> </div>
<p>Reductionism is the idea that you explain stuff with
smaller stuff, and keep going until you stop.
In many ways, this describes the explanatory program of 20th century
physics, which, starting from the 19th century puzzles of statistical mechanics,
conjured up atoms, subatomic particles, the zoo of the Standard Model, and even
tinier hypothetical entities like strings and spin foams.
Most physicists spend their time in a lab, on a computer, or in front
of a blackboard, trying to reduce complex things to simple things they understand.
So like Platonism in mathematics, reductionism in physics simply makes
a philosophy out of everyday practice.
We break stuff down, so things reduce; we play abstractly with
mathematical objects, so they exist abstractly.</p>
<p>But also like Platonism, reductionism is a convenient fiction, or rather, a
caricature in which some things are emphasised at the cost of others.
And given the reverence which which philosophers hold the considered
ontological verdicts of science, it’s worth asking: what does science really tell us about the
universe? What sorts of objects are necessary for explanation? Does
explanation go only upwards, or can it go downwards or sideways?
Should we eliminate the things we explained? And what has explanation
to do with existence anyway?
This post is an attempt to unconfuse myself about some of these questions.
<!-- adds a dash of novelty and modern
physics to old (and in some cases hopelessly outdated) debates. --></p>
<h4 id="the-existence-of-shoes">The existence of shoes</h4>
<p><span style="padding-left: 20px; display:block">
… our common sense conception of psychological phenomena constitutes a
radically false theory, a theory so fundamentally defective that both
the principles and ontology of that theory will eventually be
displaced, rather than smoothly reduced, by completed neuroscience.
</span></p>
<div style="text-align: right"><i>Paul Churchland</i> </div>
<p>Physical objects can be described at different levels.
A shoe is constructed from flat sheets of material, curved, cut,
marked, and stuck together in clever ways; materials
curve and stick by virtue of their constituent
chemicals, usually long, jointed molecular chains called polymers;
polymers, in turn, are built like lego from a smorgasboard of elements;
and each elemental atom is a dense nuclear core, surrounded by
electrons whirring around in elaborate orbitals.</p>
<p>From the properties of the neutrons, protons and
electrons, it seems we can work our way upwards, and infer everything
else.
The laws of quantum mechanics and electromagnetism determine the
orbital structure of the atom. The valence shell of the atom
determines how it can combine with other atoms to form
chemicals. Finally, the structural motifs and functional groups of the
polymers gives it the properties the industrial chemist, the designer,
and the cobbler exploit to make a shoe.
Thus, some philosophers conclude, only electrons, protons, and
neutrons exist.
The rest can be eliminated as unnecessary
ontological baggage.
This view is called <em>eliminative reductionism</em>.
It is a hardcore philosophy which does not believe in shoes [<sup><a id="fnr.1" name="fnr.1" class="footref" href="#fn.1">1</a></sup>].</p>
<p>There is a gentler, less silly form of reductionism which grants the
existence of shoes, but insists that they are (in the phrase of Jack
Smart) nothing “over and above” the constituent subatomic particles.
The shoe “just is” electrons and protons and neutrons, in some order;
this is what we mean by a shoe.
There are others way to characterise the reduction, <!--
for instance, that the properties of the shoe "follow"
from, or are "completely explained by", those of the subatomic particles.
In fact, there is--> and a whole literature devoted to the attendant
subtleties, but most fall under the heading of analytic
micro-quibbles.
<!-- , and won't concern us here.-->
Instead, we will make a much simpler observation: order matters.</p>
<p>Clearly, if we took those subatomic particles, and arranged them in a
different way, we would get different elements, different chemicals,
and a duck or a planetesimal instead of a shoe.
Arrangement is important.
It is patently absurd to try and explain the bulk properties of the
shoe—the fact that it fits around a human foot, for
instance—without appeal to arrangement, since a different
order yields objects which do not fit around a foot.
<!-- If one objects that "fitting around a foot" is some sort of
anthropocentric folly due for elimination, replace it with,
Philip Anderson was perhaps the first physicist to make this argument,
in his famous article ["More is Different"](https://cse-robotics.engr.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf). -->
Since order has <em>explanatory</em> significance, it should presumably be
tarred with the same ontic brush we apply to things like electrons.</p>
<p>Of course, one may object that explanation does not equal existence.
I can handily account for the continual disappearance of my socks by
the hypothesis of sock imps.
But this is a bad explanation! It’s not consistent with other reliably known facts about the world.
Sock imps don’t make the ontic cut, not because there is no link between
explanation and what we deem to exist, but because that link should
only be made for <em>robust</em> explanations, and the poor little sock imps collapse
at the first empirical hurdle.
That different arrangements of things have different properties is
robust, almost to the point of truism, and there seems to be no
principled reason to ban order <!-- , or *structure* as we will call it,-->
from our ontology.</p>
<h4 id="emergence-vs-structure">Emergence vs structure</h4>
<p><span style="padding-left: 20px; display:block">
More is different.
</span></p>
<div style="text-align: right"><i>Philip W. Anderson</i> </div>
<p>It’s worth noting the parallel
to <em>emergence</em>.
In his famous article
<a href="https://cse-robotics.engr.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf">“More is Different”</a>,
Philip W. Anderson argued for the idea of domain-specific laws and
dynamical principles which did not follow the strict, one-way
explanatory hierarchy of reduction, particularly in his field of
condensed matter physics.
And indeed, condensed matter makes a science of order itself,
studying how properties of macroscopic wholes (such as phases of
matter) “emerge” from the arrangement of microscopic parts.
Anderson thought of emergence as patterns that appear when you “zoom
out” from the constituents, but which are still made from the
constituents; we are just describing those constituents at a different level.
<!-- the microscopic perspective as the wrong "level"
of description, like being too zoomed in on a microscope, but I think that
it is simply different information. --></p>
<p>But this seems to suffer from the same problem as a reductionist
account of shoes.
The “emergent properties” are not properties of the constituents at
all!
The symmetries, order parameters, <!-- which measure their brokenness,
and collective excitations which emerge as long-range messengers of
disorder, a are not simply the microscopics "zoomed out".--> and
collective excitations studied by condensed matter physicists belong
only to the arrangements.
In fact, systems made from totally different materials can
exhibit the same emergent behaviour [<sup><a id="fnr.2" name="fnr.2" class="footref" href="#fn.2">2</a></sup>]!
They are something new, something “over and above” the spins of the
lattice, or the carbon atoms of a hexagonal monolayer, since different
arrangements of those same parts would have different properties.
We can turn Anderson’s snappy slogan around:
<em>different is more</em>. If arranging things differently gives them new
and different properties, it is a sign of structure, and structure is
something over and above the component parts themselves.
<!-- often characterising phases of matter in terms of
what are called *order parameters*, numbers which characterise the
brokenness of a symmetry. --></p>
<h4 id="what-is-a-particle">What is a particle?</h4>
<p><span style="padding-left: 20px; display:block">
It is raining instructions out there; it’s raining programs; it’s
raining tree-growing, fluff-spreading, algorithms. That is not a
metaphor, it is the plain truth. It couldn’t be any plainer if it were
raining floppy discs.
</span></p>
<div style="text-align: right"><i>Richard Dawkins</i> </div>
<p>We don’t need emergence to argue for structure; we can use the
elementary components themselves.
When philosophers talk about reductionism, they tend to imagine
subatomic particles as small, indivisible blobs, without internal
organisation or further ontological bells and whistles. An electron
might have properties like mass or charge, and obey the curious dictates of quantum mechanics,
but all this is packaged irreducibly and not worth further discussion.
But if we try and unpack all these “simple” properties, we will find
that, like the magic bag of Mary Poppins, a particle is much deeper
than it first appears!
The Large Hadron Collider does not produce evidence for tiny,
structureless blobs.
Rather, it confirms at a rate of petabytes per second that the universe is made of mathematics.</p>
<p>The state-of-the-art definition of a particle is
<!-- (as
[this Quanta article](quantamagazine.org/what-is-a-particle-20201112)
humorously explores) --> a bit of a mouthful: an <em>irreducible
representation of the Lorentz group</em>.
In plain English, being a <em>representation</em> means that particles are
objects which have or “transform with” symmetries, in the same way a circle looks the same however
you rotate it.
That it is <em>irreducible</em> means that it cannot be split into smaller
parts which have the same symmetry, which is the mathematical avatar
of being “indivisible”.
Finally, the symmetry itself, the <em>Lorentz group</em>, is the same group
describing the shape of empty space according to special relativity.
So, in summary, a particle transforms with the symmetries of empty space, and
cannot be split into parts with this symmetry.
<!-- [<sup><a id="fnr.3" name="fnr.3" class="footref" href="#fn.3">3</a></sup>].-->
Lurking implicitly in the background is the whole framework of
quantum mechanics, and in particular, that particles are <em>states in a
Hilbert space</em>. In plain English, we can add and subtract states of a
particle, and compare them to each other.</p>
<p>Thus, every particle is like a mathematical diamond: indivisible,
multifacted, and structured up to the hilt.
When philosophers of science eagerly assent to believe whatever the particle physicists
tell them, <!-- particularly when it can be tested with unparalleled
precision at the LHC, --> they may not realise what
they signed up for!
Spacetime, quantum mechanics, and symmetries, the Lorentz group and
Hilbert spaces; these are all welded indissolubly to form the most
robust and fundamental objects in the universe.
Even with something as “simple” as an electron, order is
inescapable.</p>
<h4 id="unreasonable-effectiveness-and-natural-patterns">Unreasonable effectiveness and natural patterns</h4>
<p><span style="padding-left: 20px; display:block">
It is difficult to avoid the impression that a miracle confronts us
here, quite comparable… to the two miracles of the existence of
laws of nature and of the human mind’s capacity to divine them.
</span></p>
<div style="text-align: right"><i>Eugene Wigner</i> </div>
<p>It may feel like we have jumped from physical to
mathematical objects in one fell, tendentious swoop.
Do we need Hilbert space, or might another mathematical concept
suffice?
And does Hilbert space really exist, or is it merely a useful human
invention?
If the latter, why so useful?
This is intentionally designed to rhyme with our earlier statement
that order is a robustly explanatory feature of the world, and
distinct from the things that are ordered.
Mathematics really just is the study of order, or <em>patterns</em>, according to their own peculiar and abstract
logic.
Physics (and to a lesser extent the other sciences) study <em>natural
patterns</em>, the way these structures or forms of order are realised in
the natural world.
That applies not just to emergent behaviour like phases of matter, but
even the crystalline makeup of an elementary particle.</p>
<p>I have tried to motivate this perspective from the nature of physical
explanation, but perhaps it can teach us about mathematical
explanation and its relation to the physical world.
A common criticism of Platonism is that, if mathematical objects exist
in some non-physical realm, the ability to do mathematics must involve
extrasensory perception. Clearly, since we are physical
beings, this ability is grounded in physical experience, and now we
have a simple explanation: patterns are naturally realised everywhere, from
cardinal numbers in counting cows to topology in tying a knot to
representation theory in colliding protons. We don’t need magical
access to the World of Forms to see these things; they are all around us.</p>
<p>Similarly, the
<a href="https://www.dartmouth.edu/~matc/MathDrama/reading/Wigner.html">unreasonable effectiveness of mathematics</a>
for describing the world, first noted by Eugene Wigner, seems no more
miraculous that the utility of integers for counting loaves of bread
rather than proving results about number theory.
We get the patterns from the world, clean them up, rebrand a little,
and start connecting them together.
The meta-patterns that emerge are remarkable, but the appearance of
“unreasonable effectiveness” is the result of a largely successful PR
campaign to divorce mathematical structures from their physical
origins. As Einstein quipped, “Since the mathematicians have invaded
the theory of relativity, I do not understand it myself anymore.”
The abstraction of pseudo-Riemannian geometry follows from the more
concrete act of bouncing light off mirrors.</p>
<p>More and more, we are seeing this converse of unreasonable
effectiveness, where deep mathematical ideas are inspired by physics.
The living embodiment of this trend is Ed Witten, a string theorist
whose contributions to mathematics have been so profound and
wide-ranging that he earned a Fields Medal (the Nobel prize in
mathematics), the only physicist to have ever done so! <!-- for his contributions to low-dimensional topology.-->
Once again, there is no mystery here; it is just the usual state of
affairs, but without the Platonist guff to distract us.
The patterns are out there and always have been.</p>
<h4 id="what-is-a-pattern">What is a pattern?</h4>
<p><span style="padding-left: 20px; display:block">
Everything comes to be from both subject and form.
</span></p>
<div style="text-align: right"><i>Aristotle</i> </div>
<p>All this raises the question: what is a pattern?
<!-- And how is it conjoined with stuff?-->
The first and most famous philosophical treatment of these issues is
the
<a href="https://plato.stanford.edu/entries/form-matter/">hylomorphism of Aristotle</a>,
who argued that objects are a compound of both form (the structure,
order, or patterns I have discussed here) and matter (energy or “raw
potentia”).
I won’t discuss Aristotle’s ideas in greater detail. Suffice to say they have
deeply informed this post, and the interested reader should check out James Franklin’s
<a href="https://link.springer.com/book/10.1057/9781137400734">modern take</a>.
<!-- for a modern take on Aristotelian structuralism applied.-->
Instead, I will approach the question by picking on two
smaller problems, taking Newton’s laws as a concrete example.</p>
<p>Newton formulated his laws of motion (such as $F = ma$) in terms of forces and
acceleration. Does the empirical robustness of these laws mean that
this is the only way to formulate them?
Not at all!
There are two other distinct but equivalent versions of classical
mechanics: Lagrangian and Hamiltonian. They explain
the same things, make the same predictions, and thus seem to describe
the same natural patterns. This suggests to me that although patterns
are discovered, formalisms are invented.
A pattern is the equivalence class of descriptions.</p>
<p>Students of physics will be aware that, although Hamiltonian and
Lagrangian mechanics are equivalent to Newton’s laws in the mechanical
context, they have taken on a life of their own.
The Lagrangian approach involves the mathematics of optimising
functions, while the Hamiltonian approach in its most abstract form
becomes the mathematical field of symplectic geometry.
Both Lagrangian and Hamiltonian mechanics can be upgraded (with some
inspired retrospective guesswork) to frameworks for quantum mechanics,
which Newton’s laws simpliciter cannot.
There is much more going on than a simple isomorphism of
description!
A more nuanced view is that humans invent formalisms which can agree
on a domain of interest, a restricted equivalence
class of explanation if you will. But the formalisms will tend to grow
beyond the selvage lines of the original use case.
Formalisms are only <em>perspectives</em> on patterns.
<!-- capture
different patterns, or suggest different extensions, in ways that can
depend sensitively on the formalism and the domain of application. --></p>
<p>This hints at certain structural “metalaws”.
Patterns are big and rhizomatic; human-invented mathematical
frameworks are a single
mathematical glance, if you like, and can only take in part of the pattern.
Even if formalisms agree on some domain, they will suggest different corridors of growth.
A rectangle may be both an equiangular quadrilateral, or a
parallelogram with diagonals of equal length, but the notions involved and
corresponding generalisations are distinct.
<!-- in the two characterisations., and connect along
different lines of development to broader ideas. -->
This also helps explain the phenomenon of deep connections between
apparently unrelated mathematical objects, sometimes only revealed by
a clever change of perspective.
It could be that there is a <em>paucity of structure</em>, so that by dumb
luck (and the <a href="https://en.wikipedia.org/wiki/Pigeonhole_principle">pigeonhole principle</a>), we often unknowingly describe the same
thing in a different guise.
But to my mind, it is more likely that patterns tend to sprawl and
overlap in complex ways.
<!-- , which also explains how different angles on
the same structure can look unrelated! -->
They are less like a few items of furniture in a crumbling
garret—paucity of structure—and more like the interwined flora of
a tropical jungle.
<!-- And human mathematics typically cannot see the forest for the trees.
There are ways to talk about quantum mechanics without Hilbert spaces,
and particles without representation theory.
That does not mean that the corresponding patterns do not exist, but
rather, they can be described in other ways. --></p>
<p>The second issue is how accurate our descriptions must be.
We know that Newton’s laws are not exactly correct, and break down in
regimes far-removed from those of everyday experience, such as the
very small (where quantum mechanics applies) or the very fast (where
special relativity applies).
Does this mean we should stop believing in forces, or Lagrangians, or
Hamiltonians?
This is like the old Platonist quibble that there is no
such thing as a perfect circle in the real world, so we must be
reasoning about circles in some other realm.
In both cases, the pattern is only <em>approximately</em> realised in
nature, with bumps and fuzzy edges.
But approximation is itself subject to structural laws, exhibiting
patterns treated by mathematics (in, e.g., topology)
and physics (effective field theory).
Perhaps an even better example is statistics, which is literally all
about extracting structure from noisy realisations.
So structural approximations are clearly robust, lawlike and
explanatory, even if they are subtle.
Incidentally, this suggests another metalaw: patterns can stand in patterned
relations to other patterns.
<!-- This is also what emergence is all about! --></p>
<p>This ties back to our original question about the nature of physical
explanation.
Reductionism instructs us to boil things down to their smallest elements.
The Aristotelian view is that, really, we should be searching for
form and structure at whatever level they happen to occur.
This is not only the nature of emergence, but physics more broadly.
How else can we connect the study of the large-scale structure
of spacetime, quarks, bowling balls, planetesimals, or storm clouds?
Physicists almost never boil things down to their smallest elements!
Rather, it seems much more accurate to say that they look for patterns
“in the wild”.
(In contrast, mathematicians study patterns “in captivity”, which gives
them that air of artifice and pedigree.)</p>
<p>One upshot is that, for better or worse, physicists often wade into other
disciplines armed with the lassoo of an Emergent Pattern to corral the apparent complexity.
See for
instance
<a href="https://www.penguinrandomhouse.com/books/314049/scale-by-geoffrey-west/">scaling laws</a>,
<a href="https://en.wikipedia.org/wiki/Self-organized_criticality">self-organised criticality</a>,
<a href="https://en.wikipedia.org/wiki/Small-world_network">small-world networks</a>,
and
<a href="https://www.englandlab.com/">thermodynamic explanations for life itself</a>.
They’re not always right (and they’re not always respectful), but
they are just doing their thang.</p>
<h4 id="conclusion">Conclusion</h4>
<p>I’ve argued that the nature of physical explanation is richer and less
boringly hierarchical than the reductionist would have us believe.
In order to explain the properties of shoes or particles, it seems not
only parsimonious but necessary to commit to the existence of
patterns in addition to the things which make those patterns up.
This not only jives with (and ontologically grounds) the notion of
emergence, but also provides a handle on the metaphysics and
epistemology of mathematical explanation.
<!-- and its relation to the
physical world. -->
Put simply, mathematicians study patterns; physicists study natural patterns.
<!-- It tells us where math comes from, why it is unreasonably effective,
and to what extent it might be invented or non-unique.
Finally, I argued that none of this is spoiled by approximation, since
this is just another pattern. --></p>
<p>Clearly, I’ve left many questions unanswered.
Must patterns be instantiated in the physical world, and if not, where
do such patterns live?
What is the “mereology” that allows them to combine, or to recursively
describe their relationships?
And finally, what grounds the truth about patterns, in physics,
mathematics, or elsewhere?
Most of these I defer to Aristotle, though I hope to write more in future. <!-- I leave the systematic exploration of these questions to the future,-->
In the mean time, discussion and debate are welcome!</p>
<h4 id="acknowledgments-and-references">Acknowledgments and references</h4>
<p>I’d like to thank Leon Di Stefano for introducing me to Aristotelian
structuralism and many enriching conversations over the years.
His ideas <!-- (as articulated in
[this 2017 debate with James Fodor](https://www.youtube.com/watch?v=W0j25NteoXc))-->
inspired and informed this post.
I’ve also been heavily influenced by James
Franklin’s book,
<a href="https://link.springer.com/book/10.1057/9781137400734"><em>An Aristotelian realist view of mathematics</em></a>.
Aristotle himself writes with characteristic brevity on form and
matter in <a href="http://classics.mit.edu/Aristotle/physics.1.i.html"><em>Physics (i)</em></a>.
Finally, I fitfully consulted the SEP entries on
<a href="https://plato.stanford.edu/entries/scientific-reduction/">reductionism</a>
and
<a href="https://plato.stanford.edu/entries/structuralism-mathematics/">mathematical structuralism</a>.</p>
<hr />
<!-- quantamagazine.org/what-is-a-particle-20201112 -->
<!-- https://plato.stanford.edu/entries/scientific-reduction/-->
<!-- https://plato.stanford.edu/entries/structuralism-mathematics/ -->
<div class="footdef"><sup><a id="fn.1" name="fn.1" class="footnum" href="#fnr.1">Footnote 1</a></sup> <p class="footpara">
To be fair, as the quote suggests, the original eliminativists like Paul and
Patricia Churchland were much more interested in abolishing psychology than shoes.
</p></div>
<div class="footdef"><sup><a id="fn.2" name="fn.2" class="footnum" href="#fnr.2">Footnote 2</a></sup> <p class="footpara">
This is called <i>universality</i>, and can be explained using
renormalisation, the technical avatar of "zooming out".
</p></div>
<!--<div class="footdef"><sup><a id="fn.3" name="fn.3" class="footnum"
href="#fnr.3">Footnote 3</a></sup> <p class="footpara">
Particles can have other symmetries as well. An important class is
gauge symmetry, consisting of internal degrees of freedom.
, like a dial on a gauge. These gauge symmetries are crucial to formulating the
whole Standard Model, and explain, for instance, why an electron has -->
<!--charge. </p></div>-->David A WakehamFebruary 8, 2021. Some philosophical reflections on the nature of scientific explanation, structure, emergence, and the unreasonable effectiveness of mathematics.Binomial party tricks2021-02-06T00:00:00+00:002021-02-06T00:00:00+00:00http://hapax.github.io/mathematics/physics/hacker/binomial<p><strong>February 6, 2021.</strong> <em>Sketchy hacker notes on the binomial
approximation. The flashy payoff: party trick arithmetic for estimating
roots in your head.</em></p>
<h4 id="introduction">Introduction</h4>
<p>The binomial approximation is the result that, for any real $\alpha$,
and $|x| \ll 1$,</p>
\[(1 + x)^\alpha \approx 1 + \alpha x.\]
<p>The usual proof involves calculus.
Here, we present a sketchy shortcut and an elementary longcut, neither
of which involves calculus, strictly speaking.
We also derive the quadratic term, and end with a fun party trick for finding roots.</p>
<h4 id="sketchy-shortcut">Sketchy shortcut</h4>
<p>We begin with the shortcut.
In an
<a href="https://hapax.github.io/maths/physics/hacks/exponential/">earlier post</a>,
I derived the following result for the exponential, and $|x| \ll 1$:</p>
\[e^x \approx 1 + x.\]
<p>Rather than go off and read the post, we can do even better and simply
<em>define</em> the exponential by this property.
If it’s true, then for any $r$, we can set $x = r/n$ for very large
$n$ to get</p>
\[e^r = (e^{r/n})^n \approx \left(1 + \frac{r}{n}\right)^n.\]
<p>In the limit of infinite $n$, the expression should be exact. And
indeed, this is the standard definition of $e^r$:</p>
\[e^r = \lim_{n\to\infty} \left(1 + \frac{r}{n}\right)^n.\]
<p>Let’s proceed with a proof of the binomial approximation.
The natural logarithm is the inverse function, so that</p>
\[x = \log e^x \approx \log(1 + x).\]
<p>Recall that</p>
\[x^n = (e^{\log x})^n = e^{n\log x} \quad \Longrightarrow \quad \log x^n = n \log x.\]
<p>Thus, taking the logarithm $(1 + x)^\alpha$, we have</p>
\[\log [(1+x)^\alpha] = \alpha \log (1+ x) \approx \alpha x,\]
<p>and hence</p>
\[(1+x)^\alpha \approx e^{\alpha x} \approx 1 + \alpha x.\]
<p>This works since all the corrections are at higher order in $x$.</p>
<h4 id="elementary-longcut">Elementary longcut</h4>
<p>This is a bit high brow, and we can get to the same conclusion using
simple algebra.
First note that, from the binomial theorem,</p>
\[(1 + x)^n = 1 + \binom{n}{1}x + \binom{n}{2}x^2 + \cdots x^n \approx
1 + nx\]
<p>for $|x| \ll 1$, neglecting higher order terms which are much smaller.
So the binomial approximation is true for whole numbers $n$.
If we consider a fraction $q = m/n$, then $(1 + x)^q$ raised to the
power $n$ should equal</p>
\[(1 + x)^{qn} = (1 + x)^{m} \approx 1 + mx \tag{1}\label{m}\]
<p>by the binomial theorem.
Let’s assume</p>
\[(1 + x)^{q} \approx 1 + \beta x,\]
<p>with some higher order terms we can ignore.
Raising to the power $n$, we can use the binomial approximation for
$n$ to get</p>
\[(1 + x)^{qn} \approx (1 + \beta x)^n \approx 1 + \beta n x.\]
<p>Comparing to (\ref{m}), we find that $\beta = m/n$, and hence the
binomial approximation is true for positive rationals.
We can add negative powers using the geometric series:</p>
\[\frac{1}{1 - x} = 1 + x + x^2 + \cdots \approx 1 + x,\]
<p>and hence for a negative rational $q = -m/n$,</p>
\[(1 + x)^q \approx (1 - x)^{m/n} \approx 1 - \frac{m}{n}x = 1 + qx,\]
<p>as required. Finally, there is arbitrary real $\alpha$. This is
actually trivial, in some sense.
Unlike whole numbers (repeated multiplication), fractions (roots), or
negative numbers (reciprocals), an irrational power has no obvious
interpretation. The most reasonable thing to do is define it as a
<em>limit</em> of rational powers that approximate it:</p>
\[(1 + x)^r = \lim_{n \to \infty} (1 + x)^{q_n},\]
<p>where $q_n$ is a sequence of rational numbers (e.g. the decimal
expansion) approximating $r$.
In this case, the binomial approximation gives</p>
\[(1 + x)^r = \lim_{n \to \infty} (1 + x)^{q_n} \approx 1 + x \lim_{n
\to \infty} q_n = 1 + rx,\]
<p>and so the result holds for all real numbers.</p>
<h4 id="higher-terms">Higher terms</h4>
<p>It’s possible, if messy, to extend these methods to determine the next
term in the approximation.
We’ll do the longcut, and use big-O notation, with $O(x^3)$ in this
context meaning “terms with powers of $x^3$ or higher”.
The binomial theorem gives</p>
\[(1 + x)^n = 1 + nx + \frac{n(n-1)}{2} x^2 + O(x^3), \tag{2} \label{second}\]
<p>since the coefficient of the $x^2$ term is the number of ways of
choosing $2$ items (the $x$ terms) from $n$ items (the factors in the power).
For a rational $q = m/n$, we have</p>
\[(1 + x)^{qn} = (1 + x)^m = 1 + mx + \frac{m(m-1)}{2} x^2 + O(x^3),\]
<p>and if we assume</p>
\[(1 + x)^{q} = 1 + qx + \gamma x^2 + O(x^3),\]
<p>then the binomial theorem again gives</p>
\[(1 + x)^{qn} = \left[1 + qx + \gamma x^2 + O(x^3)\right]^n = 1 + nqx +
\left[n\gamma + \frac{n(n-1)}{2}q^2 \right]x^2 + O(x^3).\]
<p>The coefficient of the linear term $nq = m$ matches, but the quadratic
term requires more work. Comparing to (\ref{second}) and
rearranging for $\gamma$, we have</p>
\[\begin{align*}
\gamma & = \frac{1}{n}\left[\frac{m(m-1)}{2}- \frac{n(n-1)}{2}q^2\right]
=\frac{m(m-1)}{2n}- \frac{m^2(n-1)}{2n^2}
=\frac{q(q - 1)}{2}.
\end{align*}\]
<p>Thus, we find that to second order,</p>
\[(1 + x)^q = 1 + qx + \frac{q(q-1)}{2} x^2 + O(x^3)\]
<p>The extension to real and negative powers is easy. The extension to
higher terms in $x$ is not.
They obey something called the binomial series,</p>
\[(1 + x)^\alpha = \sum_{k = 0}^\infty \frac{\alpha(\alpha - 1)\cdots
(\alpha-k +1)}{k!} x^k,\]
<p>and I have no idea how to get this without calculus.
(One can use “analytic continuation” but this feels too much like
cheating to me, partly because it’s not clear why this continuation is
unique.)
Any tips appreciated!</p>
<h4 id="rooting-out-the-answer">Rooting out the answer</h4>
<p>The applications are many and various, but the simplest thing we can
try is quickly calculating powers $y^\alpha$.
The general trick is to find a power near $y$ that is simpler to
evaluate, factor out the simple answer, then use the binomial
approximation.
I think there are actually better ways to estimate positive powers,
but the binomial approximation really shines in the estimation
of roots.
It can even be a good party trick, depending on the kind of parties
you go to!</p>
<p>Suppose someone asks you to find the square root of $8$.
You look for a nearby perfect square, in this case $9$, then factor
eight into $9$ times one minus something small:</p>
\[\sqrt{8} = \sqrt{9\left(1 - \frac{1}{9}\right)} = 3 \left(1 - \frac{1}{9}\right)^{1/2}.\]
<p>We can take $\alpha = 1/2$ and $x = -1/9$ in the binomial
approximation, and see how we go, noting that</p>
\[\sqrt{1 - x} = 1 - \frac{1}{2}x - \frac{1}{8}x^2 + O(x^3).\]
<p>To first order, we get</p>
\[3 \left(1 - \frac{1}{9}\right)^{1/2} \approx 3\left[1 - \frac{1}{2} \cdot \frac{1}{9}\right]
= \frac{17}{6} \approx 2.83.\]
<p>To second order,</p>
\[3 \left(1 - \frac{1}{9}\right)^{1/2} \approx
3\left[1 - \frac{1}{2} \cdot \frac{1}{9} - \frac{1}{8} \cdot \frac{1}{9^2}\right]
= \frac{611}{216} \approx 2.829.\]
<p>The actual answer is $\sqrt{8} = 2.828$, so even the first term in the
binomial approximation is very good! We’ll finish with a somewhat more
involved example.
Let’s approximate the fifth root of six, $6^{1/5}$.
I only know one fifth power of the top of my head, $2^5 = 32$, and
this happens to be near $6^2 = 36$.
We can chain these observations together as follows:</p>
\[\begin{align*}
6^{1/5} = 36^{1/10} = 32^{1/10}\left(1 + \frac{1}{9}\right)^{1/10} & =\sqrt{2}\left(1 + \frac{1}{9}\right)^{1/10} \approx \sqrt{2} \cdot \left(1 + \frac{1}{10\cdot 9}\right).
\end{align*}\]
<p>At this point, we could separately approximate $\sqrt{2}$, but I
happen to know it’s about $1.414$, so I can divide by $90$ (or even
just $100$ for a quick mental estimate), and add them together to get</p>
\[\sqrt[5]{6} \approx 1.414 + \frac{1.414}{90} \approx 1.43.\]
<p>Consulting a calculator, this is correct to two decimal places!
With the power of the binomial approximation, you can do it in your head.</p>David A WakehamFebruary 6, 2021. Sketchy hacker notes on the binomial approximation. The flashy payoff: party trick arithmetic for estimating roots in your head.A simplicial generalisation of the Bloch ball2021-02-05T00:00:00+00:002021-02-05T00:00:00+00:00http://hapax.github.io/maths/physics/qc/unitary-orbits<p><strong>February 5, 2021.</strong> <em>I explore unitary orbits of density matrices
for finite-dimensional quantum systems. The upshot is a neat scheme
for representing orbits using simplices.</em></p>
<h4 id="introduction">Introduction</h4>
<p>The <a href="https://en.wikipedia.org/wiki/Bloch_sphere">Bloch sphere</a>
represents the space of pure states on a single qubit (see also
<a href="https://hapax.github.io/physics/mathematics/bloch/">this</a> recent
post).
The “Bloch ball” is the space of all <em>density matrices</em> on the qubit.
It fills in the Bloch sphere with concentric spheres of increasing
mixedness, and at the centre is the maximally mixed state $I_2/2$,
where $I_d$ will denote the $d \times d$ identity matrix.</p>
<figure>
<div style="text-align:center"><img src="/images/posts/unitary1.png" />
</div>
</figure>
<p>Spheres arise naturally.
They carry the structure of the unitary group $\mathrm{U}(2)$ acting
on qubits, once we have modded out by the phase ambiguity:</p>
\[\frac{\mathrm{U}(2)}{\mathrm{U}(1)} = \mathrm{SU}(2).\]
<p>This is a double cover of the rotation group $\mathrm{SO}(3)$, which
acts transitively on the sphere.
(The “double cover” part gives us spinors.)
Thus, spheres occur naturally as unitary orbits, and indeed, each
concentric sphere in the Bloch ball is such an orbit.
The question is whether this generalises nicely to higher dimensions.</p>
<h4 id="the-bloch-ball">The Bloch ball</h4>
<p>Let’s think about the Bloch ball in a little more detail.
Each density matrix $\rho$ is a $2\times 2$ matrix acting on the space
of qubits, which is positive and has unit trace.
Positivity just means that, for every state $|\psi\rangle$,</p>
\[\langle \psi | (\rho | \psi \rangle) \geq 0.\]
<p>Hence, $\rho$ is Hermitian, since the reality of this inner product implies</p>
\[\langle \psi | (\rho | \psi \rangle) = (\langle \psi | \rho^\dagger)
|\psi \rangle \quad \Longrightarrow \quad \rho = \rho^\dagger.\]
<p>In turn, this means that $\rho$ is unitarily diagonalisable,
i.e. $U^\dagger \rho U = \Lambda$ for some diagonal matrix $\Lambda$
and unitary matrix $U^\dagger U = UU^\dagger = I$.
It’s also clear these eigenvalues must be positive.
In fact, since the permutation matrices are unitary, we can arrange
the eigenvalues in decreasing size, so that every $2 \times 2$ density
matrix is unitarily equivalent to some matrix</p>
\[\Lambda(p) =
\begin{bmatrix}
p & \\
& 1-p
\end{bmatrix}\]
<p>for $p \in [1/2, 1]$.
The maximally mixed density $I_2/2$ has a trivial orbit, since it
always gets mapped to itself:</p>
\[U^\dagger I_2 U = U^\dagger U = I_2.\]
<p>We can measure the distance from this matrix to $\Lambda(p)$ using the
Frobenius norm, aka Hilbert-Schmidt norm.
This is just the usual vector norm where we treat a matrix $A = [a_{ij}]$ as a big vector:</p>
\[||A||^2 = \sum_{ij} |a_{ij}|^2 = \mbox{Tr}[A^\dagger A].\]
<p>Hence,</p>
\[\begin{align*}
||\Lambda(p) - \tfrac{1}{2}I_2||^2 & = \left|\left| \begin{bmatrix}
p - 1/2 & \\
& 1/2-p
\end{bmatrix} \right|\right|^2
\end{align*} = 2\left(p - \tfrac{1}{2}\right)^2.\]
<p>It’s easy to see that any density matrix in the unitary orbit of $\Lambda(p)$
has the same distance, since we can use $I_2 = U^\dagger I_2 U$,
i.e. it is a class function:</p>
\[\begin{align*}
||U^\dagger \Lambda U - \tfrac{1}{2}I_2||^2 & =
\mbox{Tr}\left[(U^\dagger \Lambda U - \tfrac{1}{2}I_2)^\dagger (U^\dagger \Lambda U - \tfrac{1}{2}I_2)\right]\\
& =
\mbox{Tr}\left[U^\dagger (\Lambda - \tfrac{1}{2}I_2)^\dagger UU^\dagger (\Lambda - \tfrac{1}{2}I_2) U\right]\\
& =
\mbox{Tr}\left[(\Lambda - \tfrac{1}{2}I_2)^\dagger (\Lambda - \tfrac{1}{2}I_2) \right]
= ||\Lambda - \tfrac{1}{2}I_2||^2.
\end{align*}\]
<p>We can define distance between densities as the Hilbert-Schmidt norm
times a positive constant $C$.
We choose $C = \sqrt{2}$ so that for pure states with $p = 1$, the
associated distance is $r = 2(p - 1/2) = 1$.
In general, since each such $r$ is associated with a unique
$\Lambda(p)$, we conclude that the space of $2\times 2$ density
matrices is a ball consisting of concentric, transitive orbits of the
unitary group, with the pure states at $p = 1$, the maximally mixed
state at $p = 0$, and radius $r = 2(p - 1/2)$ for the orbit of $\Lambda(p)$.</p>
<h4 id="orbital-mechanics">Orbital mechanics</h4>
<p>A similar story holds in higher dimensions. Density matrices are
positive and unit trace, so each orbit in dimension $d$ has a canonical
representative of the form</p>
\[\Lambda = \mathrm{diag}(p_1, p_2, \ldots, p_d),\]
<p>where the positivity of $\rho$ and unit trace condition imply</p>
\[\sum_{i=1}^d p_i = 1, \quad p_i \geq 0,\]
<p>and we can arrange eigenvalues in descending order:</p>
\[p_1 \geq p_2 \geq \cdots \geq p_d \geq 0.\]
<p>The constraint that the eigenvalues sum to $1$ means that we only need
$p_1, p_2, \ldots, p_{d-1}$ to uniquely specify a canonical
representative $\Lambda(p_1, p_2, \ldots, p_{d-1})$.
We can repeat the calculations from above to show that $I_d/d$ has a
trivial orbit, and that any density matrix in the orbit of $\Lambda(p_1,
\ldots, p_{d-1})$ has a fixed distance to the mixed state:</p>
\[r^2(p_1, \ldots, p_{d-1}) = C_d\sum_{i=1}^d \left(p_i - \frac{1}{d}\right)^2,\]
<p>where we choose $C_d$ so that the pure states, with $p_1 = 1,
p_2 = \cdots = p_d = 0$, have distance $r = 1$.
For completeness, we note that</p>
\[C_d = \frac{d^2}{d^2 - 2d + 2}.\]
<p>It’s a bit trickier to see what the orbits look like, but in the same
way that $I_d$ is fixed by the group $\mathrm{U}(d)$, we can read off
fixed subgroups from the eigenvalue decomposition.
For instance, a pure state has</p>
\[p_1 = 1, \quad p_2 = \cdots = p_d = 0.\]
<p>The first factor is fixed by $\mathrm{U}(1)$ (corresponding to global
phase), while the last $d - 1$ factors are fixed by $\mathrm{U}(d-1)$.
These act independently, so that the stabiliser of a pure state is
$\mathrm{U}(1) \times \mathrm{U}(d-1)$.
By the orbit-stabiliser theorem, the orbit of pure states has the (coset) structure</p>
\[\frac{\mathrm{U}(d)}{\mathrm{U}(1) \times \mathrm{U}(d - 1)}.\]
<p>Since $\mathrm{U}(d)$ has dimension $d^2$, this pure space orbit has
dimension</p>
\[d^2 - 1^2 - (d - 1)^2 = 2d - 2,\]
<p>and lies on a unit sphere $\mathbb{S}^{2d-2}$ in our Hilbert-Schmidt metric.
This agrees with the Bloch sphere for $d = 2$.
This seems rather nice, but in general, the orbits will be horrible.
First of all, spheres of radius $r < 1$ around the mixed state will
now be made up of uncountably many orbits, since there are uncountably
many sets of $p_i$ which solve</p>
\[r^2 = C_d\sum_{i=1}^d \left(p_i -\frac{1}{d}\right)^2\]
<p>for $r < 1$.
And orbits can be more elaborate for other eigenvalue structures.
For instance, if we lump the $p_i$ into $k$ sets of <em>distinct</em> eigenvalues,</p>
\[P_1, P_2, \ldots, P_K,\]
<p>with multiplicity $\mu_J$ associated to eigenvalue $P_J$, then the
same argument as above shows that the coset structure is</p>
\[\frac{\mathrm{U}(d)}{\mathrm{U}(\mu_1) \times \cdots \times \mathrm{U}(\mu_K)},\]
<p>known to mathematicians as a <a href="https://en.wikipedia.org/wiki/Generalized_flag_variety#Partial_flag_varieties">partial flag variety</a>.
These orbits have dimension</p>
\[D = d^2 - \sum_{J=1}^K \mu_J^2,\]
<p>and lie on a sphere of radius</p>
\[r^2 = C_d\sum_{J=1}^K \mu_J^2\left(P_J - \frac{1}{d}\right)^2.\]
<p>Note that while mixed states are closer to the
maximally mixed state, unlike the Bloch ball, they do not lie inside the orbit of pure states.
Typically, they have more dimensions!
For instance, a generic point with no symmetries (distinct $p_i$), the cosets are of the form</p>
\[\frac{\mathrm{U}(d)}{(\mathrm{U}(1))^d}\]
<p>with dimension $d^2 - d$, so for $d > 2$, these are always bigger than
the pure state orbits.
It’s certainly possible to say more about this, but who wants to. It’s
a mess!</p>
<h4 id="the-simplicial-wedge">The simplicial wedge</h4>
<p>Our modest goal will be to tidy up some of the mess.
The main observation is that the eigenvalues $p_i$ form a probability
distribution over $d$ outcomes.
If they had an arbitrary order, they would live on the standard
$(d-1)$-simplex $\Delta_{d-1}$, but because they are arranged in decreasing order,
they live on the simplicial “wedge”:</p>
\[W_{d-1} = \left\{(p_1, \ldots, p_d) : \sum_{i=1}^d p_i = 1, p_1 \geq
p_2 \geq \cdots \geq p_d \geq 0\right\}.\]
<p>Note that the subscript denotes the number of independent
parameters.
We can illustrate these ideas for $d = 2$:</p>
<figure>
<div style="text-align:center"><img src="/images/posts/unitary2.png" />
</div>
</figure>
<p>We start with the $1$-simplex $\Delta_1$, and divide it two to get the
wedge $W_1$.
The black dot at the top is the orbit of pure states, and the white
dot the maximally mixed state.
In general, the wedge $W_{d-1}$ is almost a quotient of $\Delta_{d-1}$
by its symmetry group, the set of permutations $S_d$.
But the wedge has literal “edge cases”, stabilised by subgroups of $S_d$ in a way
that mirrors the corresponding unitary orbits.
More precisely, if a point in $W_{d-1}$ is stabilised by $S_{\mu_1} \times
\cdots \times S_{\mu_K}$, then the corresponding coset structure for
the orbit is the partial flag variety</p>
\[\frac{\mathrm{U}(d)}{\mathrm{U}(\mu_1) \times \cdots \times \mathrm{U}(\mu_K)}.\]
<p>For instance, pure states have canonical representative</p>
\[(1, 0, 0, \ldots, 0) \in W_{d-1},\]
<p>which is stabilised by the subgroup $S_1 \times S_{d-1}$.
This correctly gives the coset orbit</p>
\[\frac{\mathrm{U}(d)}{\mathrm{U}(1) \times \mathrm{U}(d - 1)}.\]
<p>The maximally mixed state, and centroid of the whole simplex, has coordinates</p>
\[\frac{1}{d}(1, 1, \ldots, 1),\]
<p>and is stabilised by the full group $S_d$. As we expect, the orbit is
trivial.
We can see how this works for a qutrit below.
We start with the $2$-simplex $\Delta_2$, an equilateral triangle, and
cut out the wedge $W_2$:</p>
<figure>
<div style="text-align:center"><img src="/images/posts/unitary3.png" />
</div>
</figure>
<p>At the top we have the pure states as usual, and the mixed state at
the white centroid.
The grey dot represents the fully mixed state on two basis elements.
Note that, along the red edges, two coordinates agree, and in fact,
each represents a copy of $W_1$, coinciding at the centroid.
In general, orbit degeneracies occur precisely at sub-wedges $W_K$
with interiors parameterised by the coordinates $P_1, \ldots, P_K$
introduced above.
But when distinct sub-wedge coincides, we get even more degeneracy.
So, the apparent randomness of orbits is somewhat tamed by geometric
hierarchy.</p>
<p>Finally, to relate this back to spheres, the nice thing about using
the Frobenius norm is that the distance between a density matrix and
the maximally mixed matrix is just proportional to the Euclidean
distance on the wedge.
So we can literally draw concentric spheres emanating from the
centroid!
Our scheme does not do away with all the messiness of the orbits. But
it does provide a simple way to organise and read off some of their
basic properties, and generalises in a reasonably natural way the concentric spheres of the Bloch ball.</p>
<!-- https://en.wikipedia.org/wiki/Bloch_sphere -->David A WakehamFebruary 5, 2021. I explore unitary orbits of density matrices for finite-dimensional quantum systems. The upshot is a neat scheme for representing orbits using simplices.Turning a thermometer into a sundial2021-01-28T00:00:00+00:002021-01-28T00:00:00+00:00http://hapax.github.io/mathematics/physics/everyday/diurnal<p><strong>January 28, 2021.</strong> <em>I attempt to turn a thermometer
(or more specifically, data about the maximum daily temperature)
into a sundial. Though it fails on earth, it works on Mercury!</em></p>
<h4 id="introduction">Introduction</h4>
<p>The sun heats the earth up, and the earth radiates that heat back into
space. As the sun sets, less heat is delivered, and the maximum
temperature occurs when the two rates—heat delivered and heat
radiated—balance. In this post, we’ll work out how this simple
requirement relates maximum temperature to the latitude, time of year,
and time of day the maximum occurs, meaning that a thermometer can in
principle be used as a sort of sundial.
In practice, this is only the first step towards a realistic model,
but for the purpose of building narrative tension, I will let the
shortcomings of my approach unfold naturally.</p>
<h4 id="energy-balance">Energy balance</h4>
<p>Consider a small patch of the earth’s surface of unit area, at the
point it attains its maximum temperature $T_\text{max}$ in Kelvin.
According to the
<a href="https://en.wikipedia.org/wiki/Stefan%E2%80%93Boltzmann_law">Stefan-Boltzmann law</a>,
it radiates energy away with intensity</p>
\[I_\text{out} = \sigma T_\text{max}^4, \quad \sigma = 5.67 \times
10^{-8} \frac{\text{W}}{\text{m}^2 \text{ K}}.\]
<p>Since this is the maximum attained, it must equal the intensity of
incoming solar radiation $I_\text{in}$.
To a good approximation, this is the radiant intensity of sunlight
striking the earth’s surface head on, the so-called insolation
constant $I_0$, multiplied by a geometric term $\cos^2\vartheta$
(where $\vartheta$ is the angle the sunlight makes with the vertical
to the ground), and an albedo term $(1-a)$ to account for sunlight reflected
back:</p>
\[I_\text{in} = I_0 (1- a )\cos^2\vartheta.\]
<p>The insolation constant is $I_0 = 1367 \text{ W/m}^2$ [<sup><a id="fnr.1" name="fnr.1" class="footref" href="#fn.1">1</a></sup>].
The albedo of the earth is around $a = 0.3$, i.e. $30\%$ reflected
back into space on average, though this depends on cloud cover, snow,
and so on.
We will talk about $\vartheta$ more in a moment.
Setting $I_\text{in} = I_\text{out}$ when the maximum is obtained, we
find</p>
\[I_0 (1- a )\cos^2\vartheta = \sigma T_\text{max}^4. \label{balance} \tag{1}\]
<p>Thus, the maximum temperature is directly related to the length of shadow!</p>
<h4 id="geometry-and-heliometry">Geometry and heliometry</h4>
<p>Even more interesting is how $\vartheta$ is related to the earth-sun
geometry, and the parameters of latitude, time of year, and time of
day.
The point directly below the sun, called the <em>subsolar point</em>, rotates
at some line of latitude around the earth, with azimuthal angle
$\theta_\text{sub}$, depending on the time of year.
Here is a basic picture of the setup:</p>
<figure>
<div style="text-align:center"><img src="/images/posts/diurnal1.png" />
</div>
</figure>
<p>At either equinox, it coincides with the equator (red line).
At the (northern hemisphere’s) summer solstice, it runs along the Tropic of Cancer, about
$23.5^\circ$ north of the equator.
At the winter solstice, it lies $23.5^\circ$ south of the equator, on
the Tropic of Capricorn.
If we draw the orbit of the earth as a circle around the sun, with
$\varphi = 0$ at the winter solstice and increasing with time, then the
subsolar latitude, measured in radians from the north pole, roughly obeys</p>
\[\theta_\text{sub} = \frac{\pi}{2} + \left(\frac{2\pi}{360}\right) 23.5
\cos(\varphi).
\label{year} \tag{2}\]
<p>To calculate the angle $\vartheta$, we need two additional data
points: the latitude of the observation point (measured from north
pole) and the polar angle $\phi$ between the observation point and the
current subsolar point.
This simply measures time from solar noon.
To determine $\vartheta$, first note that if we draw the subsolar and
observation point on the same great circle of the earth, $\vartheta$ is clearly the
angle between the black lines, drawn from each point to the centre of
the earth [<sup><a id="fnr.2" name="fnr.2" class="footref" href="#fn.2">2</a></sup>]:</p>
<figure>
<div style="text-align:center"><img src="/images/posts/diurnal2.png" />
</div>
</figure>
<p>This means we can easily determine $\cos\vartheta$ using vectors,
simply by taking the dot product.
To begin with, we write in spherical coordinates $(\theta,\phi)$, then convert to
Cartesian coordinates $(x, y, z)$:</p>
\[\begin{align*}
\mathbf{x}_\text{sub} (\theta_{\text{sub}}, 0) & = (\sin \theta_\text{sub}, 0, \cos\theta_\text{sub}) \\
\mathbf{x}_\text{obs} (\theta_{\text{lat}}, \phi) & = (\sin \theta_\text{lat}\cos\phi, \sin \theta_\text{lat}\sin\phi, \cos
\theta_\text{lat}).
\end{align*}\]
<p>We can immediately determine the dot product:</p>
\[\cos\vartheta = \mathbf{x}_\text{sub} \cdot \mathbf{x}_\text{obs} =
\cos\theta_\text{sub}\cos\theta_\text{lat} + \sin
\theta_\text{sub}\sin \theta_\text{lat}\cos \phi. \label{geohelio} \tag{3}\]
<p>Plugging this back into (\ref{balance}), we find a relationship
between maximum temperature $T_\text{max}$, time of year via
$\theta_\text{sub}$, latitude $\theta_\text{lat}$, and time of day, or
rather, time past solar noon $\phi$.</p>
<h4 id="real-data">Real data</h4>
<p>The question is: how does this stack up against real data?
I’ll take some local weather data.
In Vancouver, the latitude is $49.3^\circ$ north of the equator, with
azimuthal coordinate</p>
\[\theta_{\text{lat}} = \left(\frac{2\pi}{360}\right)(90 - 49.3) \approx 0.71.\]
<p>It’s $36$ days or about tenth of a year since
the winter solstice, so from (\ref{year}), the subsolar latitude is</p>
\[\theta_\text{sub} = \frac{\pi}{2} + \left(\frac{2\pi }{360}\right) 23.5
\cos(0.1 \cdot 2\pi) \approx 1.9.\]
<p>This agrees with <a href="https://rl.se/sub-solar-point">real-time data</a> on
the subsolar point.
Finally, the <a href="https://www.timeanddate.com/weather/canada/vancouver/historic?month=1&year=2021">maximum temperature yesterday</a> was $7^\circ \text{ C} =
280 \text{ K}$, and cloud cover makes $a \approx 0.35$.
Thus, rearranging (\ref{geohelio}) and (\ref{balance}), we expect the
maximum to occur at a “time of day angle” $\phi$ given by</p>
\[\begin{align*}
\cos \phi & = \frac{\sqrt{\frac{\sigma
T_\text{max}^4}{I_0(1-a)}} - \cos\theta_\text{sub}\cos\theta_\text{lat}}{\sin
\theta_\text{sub}\sin \theta_\text{lat}} \\
& = \frac{\sqrt{\frac{(5.67 \times
10^{-8}) 280^4}{1367(1-0.3)}} - \cos 1.9\cos 0.71}{\sin 1.9\sin 0.71} \\
& \approx 1.37.
\end{align*}\]
<p>Hopefully the problem is clear.
The last term is bigger than one, and cannot possibly be equal the
first term!
If we plug in the time it peaked, a few hours after solar noon, we can
rearrange and solve to find a predicted maximum temperature of
$-80^\circ \text{ C}$!
So something is very wrong.</p>
<h4 id="conclusion">Conclusion</h4>
<p>We’ve neglected an important factor: the atmosphere.
This is the very same thing needed to explain why the temperature of
the earth is higher than expected from a simple energy
balance argument.
Basically, the atmosphere acts as a heat bath in contact with the
earth, allowing for greater maximal temperatures. It may be
possible to turn a thermometer into an accurate sundial using a
<a href="https://en.wikipedia.org/wiki/Idealized_greenhouse_model">simple greenhouse model</a>.
However, with parameters appropriately modified, our naive approach should
work on a planet without substantial
atmosphere like Mercury.</p>
<h4 id="acknowledgements">Acknowledgements</h4>
<p>Thanks to A.B. for asking when daily temperatures peak, and
suggesting this might depend on latitude.</p>
<hr />
<div class="footdef"><sup><a id="fn.1" name="fn.1" class="footnum" href="#fnr.1">Footnote 1</a></sup> <p class="footpara">
This comes once more from the Stefan-Boltzmann law (for the surface
temperature of the sun $T_\odot = 5800 \text{ K}$), and an inverse square
drop-off:
$$
I_0 = \sigma T_\odot^4 \left(\frac{R_\odot}{d}\right)^2 =
5.67 \times 10^{-8} \cdot 5800^4 \left(\frac{7 \times 10^5}{1.5\times
10^8}\right)^2\, \frac{\text{W}}{\text{m}^2}\approx 1400 \, \frac{\text{W}}{\text{m}^2},
$$
where $R_\odot = 7 \times 10^5 \text{ km}$ is the solar radius and $d
= 1.5 \times 10^8 \text{ km}$ the earth-sun distance.
</p></div>
<div class="footdef"><sup><a id="fn.2" name="fn.2" class="footnum" href="#fnr.2">Footnote 2</a></sup> <p class="footpara">
We are making the usual assumption that the sun is far enough away to
treat incoming rays as parallel. For the same reason, we ignore the
way radiant intensity changes (due to the inverse square law) with $\vartheta$.
</p></div>
<!-- http://www.bom.gov.au/products/IDV60901/IDV60901.95936.shtml
((60*12 )/(2*pi))*arccos((sqrt((5.6*10^(-8)*(273+7)^4)/(1367(0.65))) + cos(1.9)cos(2*pi*(40.7/360)))/(sin(1.9)sin(2*pi*(40.7/360))))
2*pi(90 - 23.6*sin(pi/2 + pi/6))/360
https://www.timeanddate.com/weather/canada/vancouver/historic?month=1&year=2021
https://www.sjsu.edu/faculty/watkins/diurnaltemp.htm
(1367(1-0.3)(\cos 1.9\cos 0.71 + \sin 1.9\sin 0.71 * cos(pi/6))^2/(5.67 \times 10^{-8}))^(1/4)
-->
<!--
Let's test this out on some real data.
Today, in a certain large city, the temperature peaked at $25.0^\circ
\text{ C}$ around $2.5$ hours after solar noon.
We will guess the city!
First, we note that it's around $36$ days or a tenth of a year since
the winter solstice, so from (\ref{year}), the subsolar latitude is
$$
\theta_\text{sub} = \frac{\pi}{2} + \left(\frac{2\pi }{360}\right) 23.5
\cos(0.1 \cdot 2\pi) \approx 1.9.
$$
Two and a half hours after solar noon translates to $2.5/24$ times a full rotation,
so $\phi \approx \pi/5$.
Putting these numbers into (\ref{geohelio}) and rearranging using
trigonometric identities, we get
$$
\cos\vartheta \approx 0.57 \sin (\theta_\text{lat} - 0.60).
$$
Inserting into (\ref{balance}) and rearranging yields
$$
\theta_\text{lat} = 0.60 + \sin^{-1}\left[\frac{1}{0.57}\sqrt{\frac{5.67 \times
10^{-8} (273+25)^4}{1367 (1- 0.3)}}\right] = 1.77,
$$
or in
-->David A WakehamJanuary 28, 2021. I attempt to turn a thermometer (or more specifically, data about the maximum daily temperature) into a sundial. Though it fails on earth, it works on Mercury!Cashing a blank check2021-01-26T00:00:00+00:002021-01-26T00:00:00+00:00http://hapax.github.io/mathematics/statistics/everyday/check<p><strong>January 26, 2021.</strong> <em>Suppose you find a blank check on the ground,
and unscrupulously decide to cash it in. If overdrawing gets you
nothing, how much should you cash it in for? Assuming wealth follows
the 80-20 rule, the answer is: almost nothing!</em></p>
<h4 id="introduction">Introduction</h4>
<p>In the film “Blank Check” (1994), 11-year old Preston Waters is
handed a blank check, and cashes it in for a million dollars.
Luckily, this is precisely the amount of money that the check’s
signer, a convict attempting to launder his ill-gotten gains, has left
with the bank’s president.
But what if Preston overdrew, asking for, say, $10$ billion?
This would probably have raised the suspicions of the complicit
bank president and the check would have bounced altogether.
When I was a kid, I thought it was incredibly lucky for Preston to
find the check in the first place.
I now think drawing the precise amount of money held in trust is
infinitely luckier.
But this raises the question: if you find a blank check, and you don’t
want it to bounce, how much should cash it in for?</p>
<h4 id="expected-return">Expected return</h4>
<p>I’ll assume we know nothing about the identity of the signee, and that
if they have a balance of $b$, and we make out the value of the check
to be $v$, then the check will bounce if $v > b$.
Our strategy will be to calculate the expected return for $v$ and then
maximise it.
If $f(b)$ is the probability distribution for bank balances, then the
expected return for $v$ is simply $v$ multiplied by the probability $b> v$:</p>
\[E(v) = v \int_v^\infty f(b) \, db = v[1 - F(v)] = v \bar{F}(v),\]
<p>where $F$ is the cumulative distribution function, and the $\bar{F} =
1 -F$ the tail.
To maximise this, we assume the curve is smooth, differentiate and set
to $0$, using $\bar{F}’ = -f$:</p>
\[E'(v) = \bar{F} - vf(v) = 0 \quad \Longrightarrow \quad v = \frac{\bar{F}(v)}{f(v)}.\]
<p>Any $v$ which satisfies this equation is an extremum.</p>
<h4 id="long-and-short-tails">Long and short tails</h4>
<p>Now the question is how to model the distribution of bank balances.
This is the sort of thing expected to follow a power-law
curve like the
<a href="https://en.wikipedia.org/wiki/Pareto_distribution">Pareto distribution</a>,
the proverbial “80-20” curve.
This is simply defined by its power-law tails:</p>
\[\bar{F}(v) = \left(\frac{L}{v}\right)^\alpha,\]
<p>where $L$ is the minimum amount to keep a bank balance open (say a
monthly fee), and $\alpha > 0$ is a shape parameter we will “leave blank” for the moment.
This is well-defined since it heads to zero.
The probability density for $v \geq L$ is</p>
\[f(v) = -\bar{F}'(v) = \frac{\alpha L^\alpha}{v^{\alpha + 1}}.\]
<p>The optimal draw then obeys</p>
\[v = \frac{\bar{F}(v)}{f(v)} = \left(\frac{L}{v}\right)^\alpha \cdot
\frac{v^{\alpha + 1}}{\alpha L} = \alpha v.\]
<p>For $\alpha \neq 1$, the only solutions are $v = 0$ and $v = \infty$!
For $\alpha > 1$, we can plot the expected return $E(v)\propto
v^{1-\alpha}$, and see that it monotonically decreases, with the maximum at $v = L$.
Preston should only have asked for a few bucks!
But perhaps this is an artefact of the infinite power-law tail.
A more realistic choice is the <em>truncated</em> Pareto distribution, where
the power law is confined to $L \leq v \leq H$ for an upper limit $H$,
say the personal wealth of Jeff Bezos or Elon Musk.
The density for the truncated Pareto distribution is simply a
conditional probability, conditioned on being in the interval $[L, H]$:</p>
\[f(v) = \frac{\alpha L^{\alpha}v^{-(\alpha+1)}}{1 - (L/H)^\alpha},\]
<p>and the tail is</p>
\[\bar{F}(v) = \int_v^H \frac{\alpha L^{\alpha}v^{-(\alpha+1)}}{1 -
(L/H)^\alpha} dv = \frac{(L/v)^\alpha - (L/H)^\alpha}{1 - (L/H)^\alpha}.\]
<p>Thus, we now have to solve</p>
\[v = \frac{\bar{F}(v)}{f(v)} = \frac{(L/v)^\alpha -
(L/H)^\alpha}{\alpha L^{\alpha}v^{-(\alpha+1)}} \quad \Longrightarrow
\quad v = (1-\alpha)^{1/\alpha} H.\]
<!-- Once again, the answer is independent of the lower bound.
, but
proportional to the upper bound, which as we take $H \to \infty$,
returns our original result. -->
<p>If $\alpha < 1$, then we do get a finite answer, proportional to the
upper bound, so for instance if $\alpha = 0.5$, and we take the upper
limit to be around 100 billion dollars, then Preston should ask for</p>
\[v \sim \sqrt{1-0.5} \times 10^{11} \approx 70 \text{ billion dollars},\]
<p>or $0.7$ of some other reasonable guess for $H$.
But if $\alpha \geq 1$, the prefactor is not real, and as for the full
Pareto distribution, the maximum expected return occurs at $L$.
And indeed, wealth typically does obey an approximate Pareto
distribution with $\alpha > 1$.
For instance, the proverbial “80-20” rule corresponds to $\alpha
\approx 1.16$, and
<a href="https://www.sciencedirect.com/science/article/abs/pii/S0165176505002995">this analysis</a>
of the Forbes 400 richest people in the world finds a shape parameter
of $\alpha = 1.49$.
So once again, a perfectly rational Preston Waters would ask only for the monthly fee!
But this would make for a far less entertaining movie.</p>David A WakehamJanuary 26, 2021. Suppose you find a blank check on the ground, and unscrupulously decide to cash it in. If overdrawing gets you nothing, how much should you cash it in for? Assuming wealth follows the 80-20 rule, the answer is: almost nothing!A simple proof of the bus paradox2021-01-26T00:00:00+00:002021-01-26T00:00:00+00:00http://hapax.github.io/mathematics/statistics/everyday/paradox-bus<p><strong>January 26, 2021.</strong> <em>The bus paradox states that, if buses arrive
randomly but on average every ten minutes, the expected waiting time is
ten minutes rather than five. I give a simple proof involving no
integrals or formal probability theory.</em></p>
<h4 id="introduction">Introduction</h4>
<p>The bus paradox (also called the waiting time or
<a href="https://en.wikipedia.org/wiki/Renewal_theory#Inspection_paradox">inspection paradox</a>)
is a counterintuitive result about waiting times between random events.
Suppose buses arrive randomly, with an average period of $\lambda$
between arrivals.
If you go to catch a bus, you might expect to wait a period
$\lambda/2$, since if a bus arrives $\lambda/2$ after you arrive, and
$\lambda/2$ before you arrive (by symmetry), then the gap between them
is $\lambda$.
This reasoning is wrong, and rather unexpectedly, the expected wait
time is $\lambda$.
The goal of this post is to give a proof which does
not require any integrals or formal probability theory, and
makes the role of assumptions manifest.</p>
<h4 id="the-bus-loop">The bus loop</h4>
<p>We start by considering a circle of total length $L$, on
which we place $k$ points at random (white in the image below).
This models a length of time, such as the day, and the random arrival
of $k$ buses.
The average distance between points (going clockwise, for instance) is clearly</p>
\[\lambda = \frac{L}{k}.\]
<p>Let us place another point on the circle at random (black in the image
below).
This represents the commuter who wishes to catch a bus.</p>
<figure>
<div style="text-align:center"><img src="/images/posts/bus1.png" />
</div>
</figure>
<p>Since we now have $k + 1$ points placed at random, the same reasoning
as above tells us that the average distance is</p>
\[\frac{L}{k +1} = \left(\frac{k}{k+1}\right)\lambda.\]
<p>Translating into the language of bus schedules, this means that if
buses have a fixed but random schedule over some length of time, with
average interarrival time $\lambda$, the expected wait time is <em>not</em>
$\lambda$, but rather, smaller than $\lambda$ by a factor of
$k/(k+1)$, where $k$ is the total number of buses over the period.</p>
<h4 id="the-bus-paradox">The bus paradox</h4>
<p>The bus paradox applies to a schedule which does not repeat.
Let us take $L, k \to \infty$ but leave $\lambda = L/k$ fixed.
We represent this by an infinitely large circle, with a straight edge,
in the image below.
Then the expected waiting time is</p>
\[\left(\frac{k}{k+1}\right)\lambda \to \lambda.\]
<p>Thus, the arrival of the commuter is equivalent to adding another random
bus. The corresponding interarrival period is modified, but by a
vanishingly small coefficient as $k \to \infty$. This completes our simple proof of the bus paradox.</p>
<figure>
<div style="text-align:center"><img src="/images/posts/bus2.png" />
</div>
</figure>
<p>It’s a little tricky, of course, to formulate what it means to place
the buses “uniformly” on an infinite line, and this is exactly what the
<a href="https://en.wikipedia.org/wiki/Poisson_point_process#Homogeneous_Poisson_point_process">Poisson process</a>
(and more generally <a href="https://en.wikipedia.org/wiki/Renewal_theory#Inspection_paradox">renewal theory</a>)
achieves.
But rather than introduce all this formal baggage, we can simply consider
the limit of the uniform process to arrive at the correct conclusion,
and with greater clarity than when the answer is concealed in thickets of algebra.</p>
<h4 id="conclusion">Conclusion</h4>
<p>The reasoning outlined in the introduction is not completely off the
mark. It applies when the buses arrive at fixed intervals $\lambda$,
and the commuter randomly.
The expected time to the previous bus $t_-$ and the expected time to
the next bus $t_+$ must add to give the interval $\lambda$ between
buses, and by time symmetry, they must be equal:</p>
\[t_+ + t_- = \lambda, \quad t_+ = t_- \quad \Longrightarrow t_+ = t_- = \frac{\lambda}{2}.\]
<p>In this case, there is a clear distinction between the stochasticity
of buses and commuters.
But when everything arrives randomly, a commuter becomes like just another
bus.</p>
<!-- So waiting time equals interarrival time. -->
<!-- When the buses are random, our argument explains why this argument
breaks down: the commuter is like another bus!
They are just another random point in the sequence, and must therefore
have the -->
<!-- There are a few other fun things we can do, however.
If we add $n$ commuters, for $n = o(k)$, then when they sprinkled
randomly among the buses, it is overwhelmingly likely that the next
thing to come along will be a bus rather than a commuter (with
probability $k/(k+n) \to 1$), and hence the expected wait time is
$$
\left(\frac{k}{k+n}\right)\lambda \to \lambda.
$$
But for finite $n$, the time to -->David A WakehamJanuary 26, 2021. The bus paradox states that, if buses arrive randomly but on average every ten minutes, the expected waiting time is ten minutes rather than five. I give a simple proof involving no integrals or formal probability theory.Integrals from pyramids2021-01-22T00:00:00+00:002021-01-22T00:00:00+00:00http://hapax.github.io/mathematics/pyramid<p><strong>January 22, 2021.</strong> <em>I present an elementary, first-principles
trick for integrating polynomials: splitting a hypercube into congruent pyramids.</em></p>
<h4 id="introduction">Introduction</h4>
<p>Derivatives compute slopes at a point.
Integrals compute areas under curves.
The first is a local operation, involving only information in a
neighbourhood of a point, while the latter is <em>global</em>, involving the
value of the function at different points.
This makes integration a lot harder than differentiation!</p>
<figure>
<div style="text-align:center"><img src="/images/posts/pyramid1.png" />
</div>
</figure>
<p>However, sometimes we have a shortcut for integrating: identifying an
integral with the volume of a solid.
A simple example is a linear function, $f(x) = mx$. When we integrate
from $x = 0$ to $x = b$, the area under the curve is just a triangle,
obeying $A = bh/2$ for height $h = mb$.
We can represent this reasoning in a picture:</p>
<figure>
<div style="text-align:center"><img src="/images/posts/pyramid2.png" />
</div>
</figure>
<p>But what happens if we want to integrate $x^2$?
There doesn’t seem to be any analogous geometry, and we are forced to
do something fancy (like use the
<a href="https://en.wikipedia.org/wiki/Fundamental_theorem_of_calculus">fundamental theorem of calculus</a>)
if we want to find the area under the curve.</p>
<h4 id="a-triangular-warm-up">A triangular warm-up</h4>
<p>But it turns out we haven’t tried hard enough!
There is a simple geometric approach to integrating $x^2$ and all the
higher monomials $x^n$.
This lets us integrate any polynomial by simply adding monomial terms.
To see how to do this, let’s first think of the integral of a linear
function in a slightly different way.
Rather than as half a square, let’s slide the “height” of the triangle
down so it becomes isosceles.
The area is unchanged since $b$ and $h$ have now swapped roles.</p>
<figure>
<div style="text-align:center"><img src="/images/posts/pyramid3.png" />
</div>
</figure>
<p>Now we double this triangle, and see it covers half of a square of
area $2bh$. Since twice the area of the triangle equals half the area
of this square,</p>
\[2A = \frac{1}{2} \cdot 2bh \quad \Longrightarrow \quad A = \frac{1}{2}bh.\]
<p>This may seem like a convoluted reinterpretation, but it generalises
in a lovely way to help us integrate polynomials.</p>
<h4 id="pyramids-and-hypercubes">Pyramids and hypercubes</h4>
<p>A hypercube or $n$-cube is a cube in $n$ dimensions.
Formally, we can view it as all points</p>
\[I^n = \{(x_1, x_2, \ldots, x_n) : x_i \in [0, 1]\} = [0, 1]^n.\]
<p>For instance, a $1$-cube is the unit interval $I = [0, 1]$, while a
$2$-cube is the unit square $[0 ,1]^2$.
The $3$-cube is what we usually mean by a “cube”.
Now, the length of the unit interval is $1$, the area of the unit
square is $1^1 = 1$, and volume of the unit cube is $1^3 = 1$.
The pattern continues, with the volume simply given by the product of
the length of each side of the hypercube, $1^n = 1$.</p>
<figure>
<div style="text-align:center"><img src="/images/posts/pyramid4.png" />
</div>
</figure>
<p>Let us now divide a hypercube in the following way: draw a point at
the centre, and from that point, draw a line to each corner.
These lines form the edges of a $(n-1)$-hypercube-based hyperpyramid,
which sounds a bit crazy but is actually very simple.
We illustrate for the simple cases below.</p>
<figure>
<div style="text-align:center"><img src="/images/posts/pyramid5.png" />
</div>
</figure>
<p>Each of these (hyper)pyramids is congruent, i.e. has the same shape,
so to work out their volume, all we need to do is compute how many there
are.
Since each pyramid has a $(n-1)$-cube or <em>face</em> as a base, this is the
same as counting faces.
But this is easy: along any dimension there are two faces,
corresponding to fixing $x_i = 0$ or $x_i = 1$ for some $i$.
Thus, there are $2n$ faces.
Just to check this makes sense, we have $2 \cdot 1 = 2$ “faces” or
endpoints for a line, $2 \cdot 2 = 4$ sides to a square, and $2 \cdot
3 = 6$ faces for a cube.
Thus, each pyramid has a volume</p>
\[V_n = \frac{1}{2n}.\]
<p>To connect to our warm-up exercise, note that in two dimensions, the
pyramid is a triangle with a side as its base.</p>
<h4 id="slicing-pyramids">Slicing pyramids</h4>
<p>Let’s now focus on a single pyramid.
We can move along the line from the tip to the centre of the base, and
graph the area of the cross-section of pyramid passing through that point,
parallel to the base.
Each slice will be a shrunken copy of the base itself.
As examples, on the square the “pyramid” is just a quarter triangle.
The cross-section is a line (a copy of the base, which is a side of the
square), which is increasing linearly in length.
Similarly, for a cube, the pyramid is a bonafide square-based pyramid,
and each slice is a square as well.
We draw some pictures below:</p>
<figure>
<div style="text-align:center"><img src="/images/posts/pyramid6.png" />
</div>
</figure>
<p>As we go along, the side length of the slice will change linearly.
But the <em>area</em> will change in a way that depends on the dimension we
are working in! It stays linear on the square, since it has $2 - 1 =
1$ dimension.
For a cube with $n = 3$, the slice is a square whose area changes
<em>quadratically</em>.
The pattern continues, and in $n$ dimensions, slicing a pyramid
results in a cross-section which grows as $x^{n-1}$ for a parameter
$x$ going from $x = 0$ at the tip of the pyramid to $x = 1$ at the
base.</p>
<h4 id="integrating-monomials">Integrating monomials</h4>
<p>We can add up the area of each cross-section precisely by integrating
with respect to $x$.
The answer is not quite the volume of the pyramid, however, since the
distance from the tip of the pyramid to the centre of the base is
actually $d = 1/2$.
So $x$ is <em>twice</em> the actual distance.
If we want to integrate to find the volume, the correct “infinitesimal
width” of a cross-section is not $dx$, but $dx/2$.
The corresponding integral should then give us the volume we
calculated above:</p>
\[\int_0^{1} x^{n-1} \, \frac{dx}{2} = V_n = \frac{1}{2n} \quad \Longrightarrow \quad \int_0^{1} x^{n-1} \, dx = \frac{1}{n}.\]
<p>If instead of a unit hypercube, we have a cube of side length $b$,
then the volume of the whole hypercube is $b^n$, and hence the volume
of a pyramid is $b^n/2n$.
If we let our parameter $x$ go from $x = 0$ at the tip to $x = b$ at
the base, then once again it is twice the distance, and the same
reasoning shows that</p>
\[\int_0^{b} x^{n-1} \, dx = \frac{b^n}{n}.\]
<p>Thus, we have geometrically integrated an arbitrary monomial!</p>
<h4 id="acknowledgments">Acknowledgments</h4>
<p>Thanks to J.A. for a stimulating discussion of integration from first principles.</p>David A WakehamJanuary 22, 2021. I present an elementary, first-principles trick for integrating polynomials: splitting a hypercube into congruent pyramids.