Jekyll2021-04-18T04:42:18+00:00http://hapax.github.io/feed.xmlDavid WakehamInterdimensional wizardDavid A WakehamIndescribably boring numbers2021-03-23T00:00:00+00:002021-03-23T00:00:00+00:00http://hapax.github.io/mathematics/boring<p><strong>March 23, 2021.</strong> <em>I turn the old joke about interesting numbers into a
proof that most real numbers are indescribably boring. In turn, this implies
that there is no explicit well-ordering of the reals. The axiom of
choice, however, implies all are relatively interesting.</em></p>
<h4 id="introduction">Introduction</h4>
<p>It’s a
<a href="https://en.wikipedia.org/wiki/Interesting_number_paradox">running joke</a>
among mathematicians that there are no boring numbers. Here’s the
proof. Let $B$ be the set of boring numbers, and suppose for a
contradiction it is non-empty. Define $b = \min B$ as
the smallest boring number. Since this is a highly unusual property, $b$ is
interesting after all!
Joke it may be, but there is a sting in the tail. By thinking
about how the joke works, we will be led to some rather deep (and
perhaps disturbing) insights into set theory and what it can and
cannot tell us about the mathematical world.</p>
<h4 id="integers-and-rationals-are-interesting">Integers and rationals are interesting</h4>
<p>The joke implicitly uses the fact that “numbers” refers to “whole numbers”</p>
\[\mathbb{N} = \{0, 1, 2, 3, \ldots\}.\]
<p>If it didn’t, then the <em>minimum</em> we used to get our contradiction
wouldn’t always work!
For instance, say we work with the integers</p>
\[\mathbb{Z} = \{\ldots, -2, -1, 0, 1, 2, \ldots\}.\]
<p>The set of boring integers $B_\mathbb{Z}$ may be unbounded below.
Does this cause a problem? Not really. We can just define the smallest
boring number as the smallest element minimising the <em>absolute value</em>, i.e.</p>
\[b = \min \text{argmin}_{k\in B_\mathbb{Z}} |k|.\]
<p>(The $\text{argmin}$ might actually give us two numbers, $\pm b$, so the negative one
is the smallest.) Thus, there are no boring integers.
What about boring rational numbers?
This is somewhat more elaborate, but if $B_\mathbb{Q}$ is the set of
boring rationals, we can define the “smallest” boring number as</p>
\[b = \min \text{argmin}_{a/b\in B_\mathbb{Q}} (|a| + |b|),\]
<p>where $a/b$ is a fraction in lowest terms.
Once again, there may be multiple minimisers of $|a| + |b|$, but only
a finite number, so we can choose the smallest.
We conclude there are no boring rationals.
This pattern suggests there are no boring real numbers.
We should be able to find some function with a finite number of
minima, and then choose the smallest, right?
I’m going to argue that no such function can ever be described. Then I’m
going to explain why it might exist anyway, depending on which axioms of set theory we use!</p>
<h4 id="most-real-numbers-are-boring">Most real numbers are boring</h4>
<!-- https://en.wikipedia.org/wiki/Definable_real_number -->
<p>“Boring” and “interesting” are subjective.
We’ll use something a tad more well-defined, and replace
“interesting” with <em>describable</em>.
A number is describable if it has some finite description, using
words, mathematical symbols, even a computer program, which uniquely singles out that number.
For instance, $\sqrt{2}$ is the positive solution of $x^2 = 2$, $\pi$
is the ratio of a circle’s circumference to its diameter, and $e$ is
the limit</p>
\[e = \lim_{n\to\infty} \left(1 + \frac{1}{n}\right)^n.\]
<p>It turns out that <em>almost every</em> real number is indescribable, or
“boring”, in our official translation of that term.
The argument is very simple, and proceeds by simply counting the
number of finite descriptions.
Each such description consists of a finite sequence of symbols
(letters, mathematical squiggles, algorithmic instructions), each of
which could be elements of some very large alphabet of symbols.
For instance, the text</p>
\[\sqrt{2} \text{ is the positive solution of $x^2 = 2$.}\]
<p>can be converted into <a href="http://www.tamasoft.co.jp/en/general-info/unicode-decimal.html">(decimal) unicode</a> as</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>8730 50 32 105 115 32 116 104 101 32 112 111 115 105 116 105 118 101
32 115 111 108 117 116 105 111 110 32 111 102 32 120 94 50 61 50 46
</code></pre></div></div>
<p>Imagine some “super unicode” which lets us converts <em>any</em> symbol
into a number.
The super unicode alphabet may be arbitrarily large, so we will take it to
consist of <em>every</em> natural number $\mathbb{N}$.
Then a finite description using any symbols can be written as a sequence of
the corresponding natural numbers, a trick I will call “unicoding”.
To find the number of finite descriptions, we just count the sequences!
There is a nice scheme for showing that these are in one-to-one
correspondence with the natural numbers themselves, and hence
<em>countably infinite</em>.
We take a sequence, say</p>
\[(6, 2, 0, 5)\]
<p>and convert the first bracket and all commas into $1$s, and each number into
the corresponding number of $0$s:</p>
\[10000001001100000_2.\]
<p>In turn, this can be converted to decimal, $66144$.
Going in the other direction, any whole number can be written in
binary and then converted into sequence:</p>
\[14265092 = 110110011010101100000100_2\]
<p>becomes $(0,1,0,2,0,1,1,1,0,5,2)$.
Thus, we have a simple, explicit correspondence between finite
sequences of natural numbers and the natural numbers themselves.
This basically completes the proof, for the simple reason that there
are <em>infinitely more</em> real numbers than there are natural numbers.
This is established by Cantor’s beautiful
<a href="https://en.wikipedia.org/wiki/Cantor%27s_diagonal_argument">diagonal argument</a>,
which I won’t repeat here.
The upshot is that, via unicoding and then the binary
correspondence, finite descriptions can only capture an
infinitesimally small fragment of the real numbers.
Most literally cannot be talked about.</p>
<!-- So, we conclude that most real numbers are boring. -->
<p>The set $B_\mathbb{R}$ includes almost every real number, though
quite definitely <em>not</em> every real number you can think of.
But, armed with our previous jokes, it’s tempting to think that we can
waltz in and make the same joke about $\mathbb{R}$, simply
plucking out the smallest element of $B_\mathbb{R}$.
Of course, that won’t quite work, because the set need not be bounded
below. So instead, suppose there is some explicit function $f$ such
that $b \in B_\mathbb{R}$ is the smallest minimizer of $f$, i.e.</p>
\[b = \min \text{argmin}_{x \in B_\mathbb{R}} f(x).\]
<p>If I knew $f$ explicitly, we’d have a description of $b$ after all. Contradiction!
But the contradiction here does not imply $B_\mathbb{R}$ is
non-empty. After all, most of $\mathbb{R}$ is indescribable for
simple set-theoretic reasons.
Instead, it means that there <em>cannot be any explicit function</em>
$f$. More generally, there cannot be any explicit rule which, given a
subset of $\mathbb{R}$, gives some unique number. If there
was, we could apply it to $B_\mathbb{R}$ and get the same
contradiction.
(See Appendix A for discussion of the related <a href="https://en.wikipedia.org/wiki/Berry_paradox">Berry paradox</a>.)</p>
<h4 id="an-existential-aside">An existential aside</h4>
<p>There’s a loophole here. Our argument doesn’t establish that
$f$ doesn’t exist, just that it has no finite description. And
although it might seem weird to trust in the existence of something
that we can’t really talk about, we do just this with the real
numbers!
I believe in all the real numbers, even the ones I can never describe.
Is this reasonable?
It depends who you ask.
There is a philosophy of mathematics called
<a href="https://plato.stanford.edu/entries/intuitionism/">intuitionism</a> which
tells us that mathematics is a human invention, and therefore enjoins
us to only reason about the things we can construct ourselves. No
indescribable real numbers if you please!</p>
<p>I’m not sure about this “mathematical creationism”, and think there
are more things in the mathematical heavens than are dreamt of in
our finite human philosophy.
Why should human limitations be mathematical ones?
That said, it’s not the case that anything goes. We should have some
firm basis for believing in the existence of those things we can’t
discuss, and for the real numbers, the firm basis is drawing a
continuous line on a piece of paper, or thinking about infinite
decimal expansions. These are <em>models</em> of the real numbers,
concrete-ish objects which capture the essence of the abstract entity
$\mathbb{R}$. They convince us (or at least me) that there is nothing
magical stopping someone from drawing certain points on the line, or
continuing certain expansions forever.</p>
<p>Similarly, the indescribable things we would like to exist and reason
about in set theory might depend on our <em>models</em> of set theory!
I won’t get into the specifics, but an important point is there are
<em>many different models</em> of set theory, with different properties, and
it seeks unlikely that any one model is right.
These properties are abstracted into <em>axioms</em>, formal rules about what
exists and what you can or can’t do with sets.
Because models of set theory are deep, highly technical constructions,
most of the time we go the other way round, and play around with
axioms instead. Only later do we go away and find models which support
certain sorts of behaviour.
The point of all this is to make it a bit less counterintuitive when I
say that the existence and properties of boring numbers depend on which axioms
we decide to use.</p>
<h4 id="all-real-numbers-are-relatively-interesting">All real numbers are relatively interesting</h4>
<p>So, let’s return to our problem of boring real numbers.
We argued there was no explicit, finitely describable rule for picking
an element out of $B_\mathbb{R}$.
But we can always make the <em>existence</em> of such a rule — describable
or not — an axiom of our theory!
There are two ways to go about doing this.
Note that in the first example of boring natural numbers, we use the
<em>minimum</em> of the set.
We had to be a bit more clever with the integers and rationals, but it
essentially boiled down to creating a special sort of <em>ordering</em> on
the set, so that any subset (including the boring numbers) has a
<em>smallest element</em>.
We wrote this is in a complicated way as</p>
\[b = \min \text{argmin}_{x \in B} f(x)\]
<p>for some function $f$, but we could just as well write</p>
\[b = \min_{\mathcal{W}} B,\]
<p>where $\mathcal{W}$ denote this ordering on the big set.
To be clear, for the integers it is</p>
\[0, -1, 1, -2, 2, -3, 3, \ldots\]
<p>and for the rationals it is</p>
\[0, -\frac{1}{1}, \frac{1}{1}, -\frac{2}{1}, -\frac{1}{2}, \frac{1}{2},
\frac{2}{1}, \ldots.\]
<p>This is called a <em>well-ordering</em>. Although it may not be describable,
we could simply require, as an axiom of set theory, that any set can
be well-ordered! More explicitly,</p>
<p><span style="padding-left: 20px; display:block">
Any set $A$ has a well-ordering $\mathcal{W}_A$ such that any subset
of $A$ has a unique minimum element with respect to $\mathcal{W}_A$.
</span></p>
<p>Although it doesn’t spoil our conclusion that most real numbers are
boring, such an axiom would allow us to turn the old joke into an
argument that all real numbers are <em>relatively interesting</em>, where
“relatively interesting” means that there is a finite description
where we are allowed to use the well-ordering $\mathcal{W}$.
The proof goes as you might expect: let $B^{\mathcal{W}}_\mathbb{R}$ be the set of relatively boring
numbers, i.e. numbers with no finite explicit description, even when
allowed to use the well-ordering $\mathcal{W}$.
Since $\mathcal{W}$ is a well-ordering, we can define</p>
\[b = \min_{\mathcal{W}} B^{\mathcal{W}}_\mathbb{R}.\]
<p>End of proof!
So, although most real numbers are strictly boring, with a
well-ordering all of them are relatively interesting.</p>
<h4 id="choosing-an-order">Choosing an order</h4>
<p>Well-ordering is not usually treated as an axiom.
Historically, set theorists prefer to use a simpler rule called the
<em>axiom of choice</em>, which is logically equivalent, as we will argue
informally in a moment, but somehow less suspect.
As Jerry Bona joked,</p>
<p><span style="padding-left: 20px; display:block">
The axiom of choice is obviously true and the well-ordering principle
obviously false.
</span></p>
<p>(Actually, Bona’s joke mentions a third equivalent form called <em>Zorn’s
lemma</em>, but it would confuse matters too much to explain.)
Loosely, the axiom of choice just says we can pick an element from a
non-empty set. Pretty reasonable huh? If a set is nonempty, it has an element, so
we can pluck one out.
In fact, it’s usually stated in terms of a <em>family</em> of sets $A_i$,
where the subscript $i$ ranges over some indexing set $I$:</p>
<p><span style="padding-left: 20px; display:block">
Given a family of nonempty sets $A_i$, $i \in I$, we can collect a
representative from each set, labelled $f_i \in A_i$.
</span></p>
<p>The well-ordering principle implies the axiom of choice, since I can
just take the union of all the sets $A_i$, well-order it with
$\mathcal{W}$, and then define $f_i = \min_{\mathcal{W}} A_i$.
That’s my set of representatives!
The other way round is conceptually straightforward.
To well-order a set $A = A_0$, start by choosing an element $f_0 \in
A_0$ by the axiom of choice. Then remove it to define a new set $A_1 =
A_0 - \{f_0\}$, and select another element $f_1 \in A_1$. Continue in
this way, at each stage simply deleting the element from the previous
stage and choosing a new one, using</p>
\[A_{n+1} = A_n - \{f_n\} = A_{n-1} - \{f_n, f_{n-1}\} = \cdots = A_0 -
\{f_i : i < n\}\]
<p>as long as the set is nonempty.
The well-ordering is simply the elements in the order we made the
choice:</p>
\[\mathcal{W}_A = \{f_0, f_1, f_2, \ldots \} = \{f_n \in A_n : A_n \neq \varnothing\}.\]
<p>There are two issues with this construction.
The first is that it might feel sketchy to use the axiom of
choice “as we go” to build the sets, rather than starting with a
pre-defined family. But no one said this wasn’t allowed!
Second, our method only seems to work for sets as most as large as the
natural numbers, since we indexed elements with $n \in \mathbb{N}$.
But we can extend it to an <em>arbitrary</em> set using a generalisation of
natural numbers called
<a href="https://en.wikipedia.org/wiki/Ordinal_number">ordinals</a>.
We loosely sketch how this is done in Appendix B.
Once the dust settles, we find that the axiom of choice is equivalent
to well-ordering.</p>
<h4 id="conclusion">Conclusion</h4>
<p>The overarching theme of this post is how much mileage we can get
from a bad joke.
The answer: quite a lot!
We learned not only that there are no boring integers and rational
numbers, but via a simple counting argument, that the vast majority of
real numbers are indescribably boring.
This is equivalent to having no explicit way to well-order the reals.
On the other hand, by giving ourselves the ability (via the axiom of
choice) to pluck elements at will from non-empty sets, we are able to
supply the reals with a well-ordering. So, all reals are relatively
interesting, even if we can’t talk about them.</p>
<h4 id="acknowledgments">Acknowledgments</h4>
<p>As usual, thanks to J.A. for the discussion which led to this
post, and also for proposing an elegant mapping analogous to unicoding.</p>
<h4 id="appendix-a-the-berry-paradox">Appendix A: the Berry paradox</h4>
<p>Consider the phrase</p>
<p><span style="padding-left: 20px; display:block">
The smallest real number with no finite, explicit description.
</span></p>
<p>If “smallest” refers to an explicitly definable well-ordering of the
reals, then this would seem to pick out a unique number with a finite,
explicit description. Contradiction!
We used this to argue no explicit well-ordering exists.
But let’s compare this to the
<a href="https://en.wikipedia.org/wiki/Berry_paradox">Berry paradox</a>, which
asks us to consider the phrase</p>
<p><span style="padding-left: 20px; display:block">
The smallest positive integer not definable in under sixty letters.
</span></p>
<p>This phrase clocks in at under sixty letters, and would seem to define a
number.
Contradiction!
Since “smallest” here makes perfect sense (we are dealing with positive
integers), to resolve the Berry paradox, we must assume either (a)
there is no set $B$ of numbers not definable in under sixty letters,
analogous to the original boring number joke, or (b) Berry’s phrase
somehow fails to define a number.
The most popular solution seems to be (b), on the grounds that
referring to the set makes it some kind of “meta-definition”, rather
than a definition per se.</p>
<p>Of course, this seems be committed to a very specific notion of
“definition”, but the problem persists if we replace “definable” with
“meta-definable”, since the smallest non-meta-definable number is
really a meta-meta-definition.
Let $B^{(0)}$ be the set of numbers not definable in under sixty letters,
$B^{(1)}$ the numbers not meta-definable in under $70$ letters, and in
general, $B^{(n)}$ the numbers not meta${}^{(n)}$-definable in under
$60+10n$ letters.
We call any number in the <em>union</em> of all these sets $\mathcal{B} =
\cup_{n\geq0} B^{(n)}$ “lim-definable”.
This is closed under the operation of going meta.
Now consider the phrase</p>
<p><span style="padding-left: 20px; display:block">
The smallest positive integer not finitely lim-definable.
</span></p>
<p>Since lim-definability is closed under going meta, as is “finite”,
this is <em>now a definition at the same level</em>.
Option (b) is no longer available to us, so only option (a)
remains, and it follows that, like the joke that began it all, <em>all
positive integers are finitely lim-definable</em>.
This is of course obviously true.</p>
<p>Our argument against an explicit well-ordering is very closely related
to the Berry paradox.
The point of considering lim-definability is that we can build the same
descriptive hierarchy for the real numbers, take the union, and rule
out option (b). This leaves two ways to avoid a contradiction: no
lim-definable ordering exists (involving some finite but unbounded
number of references to sets in the hierarchy), or like the Berry
paradox, every real is lim-definable.
But unlike the positive integers, we know from set theory that the
second option can’t be true!
We still have a countable number of lim-definitions, as we can argue
from unicoding.
So there must be no lim-definable ordering of the reals, and no
explicit well-ordering in particular.</p>
<h4 id="appendix-b-ordinals-and-the-axiom-of-choice">Appendix B: ordinals and the axiom of choice</h4>
<p>Ordinals are <em>sets</em> which we use to stand in for numbers.
The smallest ordinal is $0$, which is defined as the empty set
$\varnothing = \{\}$.
Each ordinal $\alpha$ has a unique successor $\alpha + 1$, defined by
simply appending a copy of $\alpha$ to itself:</p>
\[\alpha + 1 = \{\alpha, \{\alpha\}\},\]
<p>To illustrate, we apply the successor operation to $0 = \varnothing$ a
few times:</p>
\[1 = 1 + 0 = \{0\}, \quad 2 = 1 + 1 = \{0,
1\}, \quad 3 = 2 + 1 = \{0, 1, 2\}.\]
<p>Going on in this way gives us all the finite ordinals, but there are
also <em>infinite</em> ordinals. The smallest infinite ordinal, conventionally
denoted $\omega$, can be identified with the natural numbers:</p>
\[\omega = \{0, 1, 2, 3, 4, \ldots\}.\]
<p>It is called a <em>limit</em> ordinal since it is not the successor of any
finite ordinal. It is bigger than all the finite ones, $n <
\omega$. The successor is defined as before,</p>
\[\omega + 1 = \{\omega, \{\omega\}\},\]
<p>thereby giving a precise meaning to “infinity plus one”!
We won’t say more about the structure of these ordinals. The main
point is that we can always “count” the elements in a set $A$ using
ordinals, no matter how big it is.
Let’s now return to the problem of proving the axiom of choice
implies that any set $A$ can be well-ordered.
The basic idea is to start with $0$, but keep on counting up “past
infinity”, defining</p>
\[A_{\alpha+1} = A_0 - \{f_\beta : \beta < \alpha\}\]
<p>for any ordinal $\alpha$. The resulting set of
representatives, labelled by ordinals, is</p>
\[\mathcal{W}_A = \{f_\alpha \in A_\alpha: A_\alpha \neq
\varnothing\},\]
<p>with $f_\alpha < f_\beta$ just in case the ordinals $\alpha < \beta$.
This is a well-ordering since the cardinals are themselves
well-ordered.
Now, we’ve skipped many important technical details, but the main
point was that the argument looks pretty similar to the previous.
The difference is that we’ve replaced finite counting numbers with
potentially infinite ones!</p>
<!-- You may wonder if the contradiction here is coming from ambiguity in
the notion of "explicit describability".
That this can cause deep problems is illustrated by the
[Berry paradox](https://en.wikipedia.org/wiki/Berry_paradox), which
asks us to consider the following:
<span style="padding-left: 20px; display:block">
The smallest positive integer not definable in under sixty letters.
</span>
If $B_{60}$ is the set of positive integers not definable in under
sixty letters, it seems we have just defined its smallest elements in
fifty seven! This too is a contradiction. Many people try to resolve
this by arguing that it does not constitute a "definition"; I think it
is much simpler to following the boring number argument, and conclude
that $B_{60}$ doesn't exist. -->David A WakehamMarch 23, 2021. I turn the old joke about interesting numbers into a proof that most real numbers are indescribably boring. In turn, this implies that there is no explicit well-ordering of the reals. The axiom of choice, however, implies all are relatively interesting.Taking half a derivative2021-03-13T00:00:00+00:002021-03-13T00:00:00+00:00http://hapax.github.io/mathematics/halfder<p><strong>March 13, 2021.</strong> <em>Can you take half a derivative? Or π derivatives?
Or even √–1 derivatives? It turns out the answer is yes, and there are
two simple but apparently different ways to do it. I
show that one implies the other!</em></p>
<h4 id="introduction">Introduction</h4>
<p>In calculus, the regular derivative is defined as the local gradient
of a function:</p>
\[f'(x) = \frac{d}{dx} f(x) = \lim_{h\to 0}\frac{f(x+h)-f(x)}{h}.\]
<p>We will abbreviate this as $f’ = Df$, understanding that $f$ is a function
of $x$ and $D$ differentiates with respect to $x$.
We can always differentiate again, and again, and in fact as many
times as we want. Using our new notation, we can write the $n$th
derivative as</p>
\[D (D \cdots (Df)) = D^n f.\]
<p>This is well-defined as long as $n$ is a whole number.
But what if we could consider other types of derivatives, say half a
derivative? Let’s call this $D^{1/2} = \sqrt{D}$. In the same way that
applying two ordinary derivatives gives the second derivative, it seems reasonable to hope that two half derivatives give
a full derivative:</p>
\[f' = \sqrt{D} \sqrt{D}f = Df \quad \Longrightarrow \quad \sqrt{D}
\cdot \sqrt{D} = D.\]
<p>What could half a derivative look like?</p>
<h4 id="to-be-continued">To be continued</h4>
<p>The easiest way to go about this to use a trick called <em>analytic
continuation</em>.
This has a precise meaning in complex analysis, and we’re going to do
something similar in spirit, but not quite as rigorous.
The basic idea is to find some nice, specific function we can
differentiate $n$ times, and which happens to give us a nice answer in terms of $n$.
We then define the <em>fractional derivative</em> $D^\alpha$ acting on this
function by replacing $n$ with $\alpha$.
A sanity check will be that, for general $\alpha, \beta$, the
fractional derivatives obey</p>
\[D^\alpha \cdot D^\beta = D^{\alpha+\beta},\]
<p>so, e.g., two half-derivatives give a full derivative,
$\sqrt{D}\cdot \sqrt{D} = D$.
We call this property <em>multiplicativity</em> after the identical-looking
rule for indices.
There are two issues with this approach.
First, how do we extend the definition to general functions?
And second, are the definitions for different functions in agreement?
In general, the answers are very complicated, but in this post, I’ll
consider the two simplest methods for defining fractional derivatives.
This means we can talk about the functions they apply to, and check
they agree, without a huge technical overhead.</p>
<p>Our first nice function is the exponential $e^{\omega x}$.
Differentiating simply pulls down a factor of $\omega$ each time, so</p>
\[D^n e^{\omega x} = \omega^n e^{\omega x}.\]
<p>It’s very clear, then, how to define the fractional derivative acting
on this:</p>
\[D^\alpha e^{\omega x} = \omega^\alpha e^{\omega x}.\]
<p>Great! We can easily check the multiplicative property, assuming that
constants pass through the derivatives:</p>
\[D^\alpha D^\beta e^{\omega x} = \omega^\alpha D^\beta e^{\omega x} =
\omega^{\alpha + \beta} e^{\omega x} = D^{\alpha+\beta}e^{\omega x}.\]
<p>Now, you might think this is useless because we can only
take fractional derivatives of exponential functions.
But at this point, we introduce another assumption, namely that the
fractional derivatives are <em>linear</em>:</p>
\[D^\alpha (\lambda_1 f_1 + \lambda_2 f_2) = \lambda_1 D^\alpha f_1 + \lambda_2 D^\alpha f_2,\]
<p>where $f_1, f_2$ are functions and $\lambda_1, \lambda_2$ are constants.
In particular, let’s suppose this linearity applies to an <em>infinite</em>
collection of exponentials multiplied by constants $\lambda$, arranged
into an integral</p>
\[f(x) = \int_{-\infty}^\infty d\omega \, \lambda(\omega) e^{i\omega x}.\]
<p>Then by linearity,</p>
\[D^\alpha f(x) = \int_{-\infty}^\infty d\omega \, \lambda(\omega) D^\alpha
e^{i\omega x} = \int_{-\infty}^\infty d\omega \, \lambda (\omega)
(i\omega)^\alpha e^{i\omega x}. \tag{1} \label{exp}\]
<p>Functions which can be written this way are said to have a <em>Fourier
representation</em>, with the function $ \lambda (\omega)$ the <em>Fourier
transform</em>. Most functions have one!
Let’s do a very simple example: the sine function, bane of high school
trigonometry classes everywhere.
What is its half derivative?
We start by writing sine in terms of exponentials as</p>
\[\sin(x) = \frac{1}{2i}(e^{ix} - e^{-ix}).\]
<p>We then take a half-derivative using our exponential rule and linearity:</p>
\[\sqrt{D} \sin(x) = \frac{1}{2i}(\sqrt{D} e^{ix} - \sqrt{D} e^{-ix}) = \frac{1}{2i}\left(\sqrt{i} e^{ix} - \sqrt{-i} e^{-ix}\right).\]
<p>There are a few things to note.
First, this is not a real function, so in general, half derivatives of
a real functions need not be real.
It should also be clear there is some ambiguity about
which roots we choose.
In general this ambiguity is harmless, and we just take the principal
values (with arguments between $-\pi$ and $\pi$), but this issue will
crop up any below in a subtle way.
Finally, observe that we can just as easily do crazy things like take
$i$ derivatives! We set $\alpha = i$, so the $i$th derivative of sine is</p>
\[D^i \sin(x) = \frac{1}{2i}\left(i^i e^{ix} - (-i)^i e^{-ix}\right) =
\frac{1}{2i}(e^{-\pi/4 + ix} - e^{+\pi/4 - ix}),\]
<p>since the principal values are</p>
\[i^i = e^{i (i \pi/4)} = e^{-\pi/4}, \quad (-i)^i = e^{i (-i \pi/4)} = e^{\pi/4}.\]
<p>I’m not sure if this has any applications, but it’s cute.
I invite the interested reader to take $\pi$ derivatives of sine. What
better way to celebrate $\pi$ day!</p>
<h4 id="fractorials">Fractorials</h4>
<p>Exponentials aren’t the only nice functions we can use to define
fractional derivatives.
In fact, a more common approach is to use <em>powers</em>.
The first function we encounter in high school is usually the identity
function, $f(x) = x$.
From there, we build up to polynomials $x^m$, and then arbitrary
powers $x^s$.
The derivative of a power has a very simple form:</p>
\[D x^s = s x^{s-1}.\]
<p>If we differentiate again, we bring down a factor of $s - 1$ and
reduce the index again. And so on and so forth. This leads to the expression for
$n$ derivatives:</p>
\[D^n x^s = s(s- 1) \cdots (s - n + 1) x^{s-n}.\]
<p>So far, this doesn’t look like something we can easily continue to
non-integer values of $n$.
But let’s assume for a moment $s$ is an integer.
Then we can write</p>
\[s(s- 1) \cdots (s - n + 1) = \frac{s(s - 1) (s-2) \cdots 1}{(s -
n)(s-n - 1) \cdots 1} = \frac{s!}{(s -n)!},\]
<p>where we have used the good old factorial function $s!$.
Thus, we can write</p>
\[D^n x^s = \frac{s!}{(s -n)!} x^{s-n}.\]
<p>To analytically continue this, we need a beautiful object called the
Gamma function $\Gamma$.
We’ll define it properly below, but for the moment, the
only properties we need are that (a) it agrees with the factorial
function at (shifted) integer values,</p>
\[\Gamma(k + 1) = k!;\]
<p>and (b) is defined for non-integer values as well. I like to think of it as the
“fractorial” because it makes sense for fractional arguments! In addition to
delightfully bad puns, the Gamma function lets us write</p>
\[D^n x^s = \frac{\Gamma(s + 1)}{\Gamma(s -n + 1)} x^{s-n},\]
<p>and immediately continue to the fractional derivative:</p>
\[D^\alpha x^s = \frac{\Gamma(s + 1)}{\Gamma(s -\alpha + 1)}
x^{s-\alpha}. \tag{2} \label{power}\]
<p>Too easy! Once again, we can check the multiplicative property:</p>
\[\begin{align*}
D^\alpha D^\beta x^s & = \frac{\Gamma(s + 1)}{\Gamma(s -\beta + 1)}
D^\alpha x^{s-\beta} \\
& = \frac{\Gamma(s + 1)}{\Gamma(s -\beta + 1)}
\cdot \frac{\Gamma(s - \beta + 1)}{\Gamma(s -\alpha - \beta + 1)}
x^{s-\beta - \alpha} \\
& = \frac{\Gamma(s + 1)}{\Gamma(s -\alpha -\beta + 1)}x^{s-\beta -
\alpha} = D^{\alpha+\beta} x^s.
\end{align*}\]
<p>So this gives us another, evidently different way to define fractional
derivatives. It will apply to any sum or integral of powers of
$x$, for instance, infinite polynomials called <em>power series</em>, and
their close cousins the <em>Laurent series</em> which include reciprocal powers:</p>
\[\sum_{k = 0}^\infty a_k x^k, \quad \sum_{k = -\infty}^\infty b_k x^k.\]
<p>These cover a lot of ground, and there is an even more general object
called the <em>Mellin transform</em>, analogous to the Fourier transform. But
we won’t go there.
Instead, let’s do another simple example.
One of the interesting properties of the Gamma function is that it
blows up to (minus) infinity for nonpositive integers:</p>
\[\Gamma(-n) = -\infty, \quad n = 0, 1, 2, \ldots.\]
<p>This is actually essential to get sensible answers!
For instance, let’s take the derivative of a constant, $1 = x^0$.
Then according to our definition,</p>
\[D x^0 = \frac{\Gamma(0 + 1)}{\Gamma(0 -1 + 1)} x^{0 - 1} =
\frac{\Gamma(1)}{\Gamma(0)} x^{- 1} = 0,\]
<p>since the $\Gamma(0)$ in the denominator makes the whole thing vanish.
More intriguingly, these infinities sometimes <em>cancel</em> in sensible ways.
For instance, if we take a derivative of $1/x$, we should get
$-1/x^2$. If we plug $x^{-1}$ into our formula, it gives</p>
\[D x^{-1} = \frac{\Gamma(-1 + 1)}{\Gamma(-1 -1 + 1)} x^{-1 - 1} =
\frac{\Gamma(0)}{\Gamma(-1)} x^{-2}.\]
<p>Both the numerator and the denominator blow up, which should make us
queasy. But there is a trick here. It turns out that for any $z$,
the Gamma function obeys the <em>functional equation</em></p>
\[\Gamma(1 + z) = z\Gamma(z).\]
<p>Since $\Gamma(k + 1) = k!$, this gives the usual relation for factorials,</p>
\[k! = \Gamma(k + 1) = k\Gamma(k) = k \cdot (k - 1)!.\]
<p>It also gives the sneaky result $\Gamma(0) = (-1)\Gamma(-1)$. Both $\Gamma(0)$ and
$\Gamma(-1)$ blow up of course, but in the derivative of $1/x$, the
$\Gamma(-1)$ terms cancel, leaving $(-1)x^{-2} = -1/x^2$ as required.</p>
<h4 id="gamma-and-tongs">Gamma and tongs</h4>
<p>This all sounds great, but you might be wondering why the Gamma
function is the right way to extend the factorial function away from
whole numbers.
In fact, any old function that interpolates between them would also
work and satisfy the multiplicative property.
What we’re going to do in this last section is use the fractional
derivatives, defined using exponentials, to <em>derive</em> the Gamma
function continuation.
And in order to this, we have to grit our teeth and define the
Gamma function in all its glory:</p>
\[\Gamma(s) = \int_{0}^\infty dt\, t^{s-1} e^{-t}.\]
<p>If you’re interested, you can find proofs of the functional equation and so on
<a href="https://en.wikipedia.org/wiki/Gamma_function">elsewhere</a>.
Instead, we’re going to make the sneaky change of variables $t =
\omega x$, yielding</p>
\[\Gamma(s) = x^{s} \int_{0}^\infty d\omega\, \omega^{s-1} e^{-\omega
x}.\]
<p>If we change $s \to -s$, and rearrange, we get a formula for $x^s$
in terms of exponentials:</p>
\[x^{s} = \frac{1}{\Gamma(-s)}\int_{0}^\infty d\omega\, \omega^{-(1+ s)}
e^{-\omega x}. \tag{3} \label{gamma}\]
<p>Great! Now we just go ahead and use rule (\ref{exp}), with the hope we
will get rule (\ref{power}).
As usual, we proceed using linearity:</p>
\[\begin{align*}
D^\alpha x^{s} & = \frac{1}{\Gamma(-s)}\int_{0}^\infty d\omega\,
\omega^{-(1+ s)} D^\alpha e^{-\omega x} \\
& = \frac{1}{\Gamma(-s)}\int_{0}^\infty d\omega\,
\omega^{-(1+ s)} (-\omega)^\alpha e^{-\omega x} \\
& = \frac{(-1)^\alpha}{\Gamma(-s)}\int_{0}^\infty d\omega\,
\omega^{-(1+ s - \alpha)} e^{-\omega x} \\
& = \frac{(-1)^\alpha}{\Gamma(-s)} \cdot \Gamma[-(s-\alpha)]x^{s-\alpha},
\end{align*}\]
<p>where on the last line we used (\ref{gamma}), but with $s
-\alpha$ instead of $s$.
This isn’t quite what we want.
To make progress, we’ll take advantage of the <em>reflection
formula</em> for the Gamma function (derived <a href="https://hapax.github.io/mathematics/zeta/">here</a>
for instance):</p>
\[\Gamma(z) \Gamma(1 - z) = \frac{\pi}{\sin(\pi z)}.\]
<p>We can apply this to both $\Gamma(-s)$ and $\Gamma[-(s-\alpha)]$ to
get</p>
\[\begin{align*}
D^\alpha x^{s} & = (-1)^\alpha \frac{\sin(\pi
s)}{\sin[\pi(s-\alpha)]}\cdot \frac{\Gamma(s+1)}{\Gamma(s-\alpha + 1)} x^{s-\alpha}.
\end{align*}\]
<p>This is almost (\ref{power}), the thing we were after!
But there is this strange factor with sines out the front.
Recall the definition of sine in terms of complex exponentials.
This lets us write the funny factor as</p>
\[(-1)^\alpha \frac{\sin(\pi s)}{\sin[\pi(s-\alpha)]} = \frac{e^{\pi i
s} - e^{-\pi i s}}{(-1)^\alpha e^{\pi i (s-\alpha)} - (-1)^\alpha e^{-\pi i (s-\alpha)}}.\]
<p>It would be magical if that $(-1)^\alpha$ could somehow behave
differently and cancel the $\alpha$ terms floating around, right?
Well, turns out it does!
We can write $-1 = e^{\pm \pi i}$, and hence</p>
\[(-1)^\alpha = e^{\pm \pi i \alpha}.\]
<p>I won’t spell out the details, but if you look at <a href="https://hapax.github.io/mathematics/zeta/">this proof</a> of the reflection
formula, the two different terms in the sine arise from parts of an
integration contour which lie in almost the same place, but where we take
roots in different ways.
In particular, evaluating $(-1)^\alpha$ gives $e^{\pm \pi i \alpha}$
respectively, so they cancel the $\alpha$ terms after all.
The upshot is that our funny factor is just unity:</p>
\[\frac{e^{\pi i
s} - e^{-\pi i s}}{(-1)^\alpha e^{\pi i (s-\alpha)} - (-1)^\alpha
e^{-\pi i (s-\alpha)}} = \frac{e^{\pi i
s} - e^{-\pi i s}}{e^{\pi i \alpha} e^{\pi i (s-\alpha)} - e^{-\pi i \alpha}
e^{-\pi i (s-\alpha)}} = \frac{e^{\pi i
s} - e^{-\pi i s}}{e^{\pi i s} - e^{-\pi i s}} = 1.\]
<p>Thus, our exponential rule actually
reproduces the rule for powers of $x$ involving the Gamma
function! Now, to be clear, fractional derivatives are a big and
mathematically heavy topic, and I’ve only skimmed the surface.
But it’s neat that the two simplest approaches agree.</p>
<h4 id="acknowledgments">Acknowledgments</h4>
<p>Thanks to J.A. for chatting about fractional derivatives, and getting
me thinking about the simplest way to define them.</p>
<!-- Our exponential definition yields an *antiderivative* operator:
$$
D^{-1} e^{\omega x} = \frac{1}{\omega}e^{\omega x}.
$$
This is the usual antiderivative, except without the constant. -->David A WakehamMarch 13, 2021. Can you take half a derivative? Or π derivatives? Or even √–1 derivatives? It turns out the answer is yes, and there are two simple but apparently different ways to do it. I show that one implies the other!Why does E = mc²?2021-02-19T00:00:00+00:002021-02-19T00:00:00+00:00http://hapax.github.io/physics/mathematics/hacks/emcc<p><strong>February 19, 2021.</strong> <em>A self-contained derivation of the most famous
equation in physics. I start with a crash course on special
relativity, emphasizing the invariance of spacetime lengths, move on
to conservation laws, and end by considering the relativistic mechanics of an exploding bowling ball.</em></p>
<h3 id="contents">Contents</h3>
<ol>
<li><a href="#sec-1">Introduction</a></li>
<li><a href="#sec-2">Spacetime trigonometry</a></li>
<li><a href="#sec-3">Time dilation</a></li>
<li><a href="#sec-4">Factorising spacetime</a></li>
<li><a href="#sec-5">Velocity addition</a></li>
<li><a href="#sec-6">Conservation laws</a></li>
<li><a href="#sec-7">Mass effect</a></li>
<li><a href="#sec-8">The most famous equation</a></li>
<li><a href="#sec-9">Exercises</a></li>
</ol>
<h4 id="1-introduction-">1. Introduction <a id="sec-1" name="sec-1"></a></h4>
<p>I recently stumbled across the book “Why does $E = mc^2$?” by Brian
Cox and Jeff Forshaw in a used bookstore.
I realized, to my chagrin, that I didn’t know the answer!
As a theoretical physicist, this was somewhat embarrassing.
So, rather than buy the book, I decided I would make it my homework
not only to derive it from what I knew about special relativity, but
to try and write up my reasoning in a self-contained way.
This post is my homework.</p>
<p>A few preliminary comments. First, the exercise section is optional,
mostly designed to fill in details and connect to standard treatments of the subject.
Second,
<a href="https://www.fourmilab.ch/etexts/einstein/E_mc2/e_mc2.pdf">unlike Einstein</a>,
I have not made any references to the energy of light.
This leads to an argument which is longer but more
conceptually minimal.
Finally, these notes gave me the opportunity to dust off some old ideas
about how to present special relativity, guiding in particular my
choice to do everything in one spatial dimension.
I hope my mildly eccentric approach can be of benefit to others.</p>
<!-- Before we get cracking, I'll give a TLDR version.
Relativity is basically what you get when you allow space and time to
rotate into each other while keeping the speed of light unchanged.
If I send a clock in the same direction as a light ray, the light ray
appears to travel a shorter distance; the clock must slow down to make
sure the speed is unchanged.
Thus, moving clocks slow down. -->
<h4 id="2-spacetime-trigonometry">2. Spacetime trigonometry<a id="sec-2" name="sec-2"></a></h4>
<p>Relativity is really just the bizarro version of trigonometry.
To make this obvious, we’ll present Pythagoras’ theorem in an odd
way.
Suppose we have rulers, $x$ and $y$, oriented at right angles
[<sup><a id="fnr.1" name="fnr.1" class="footref" href="#fn.1">1</a></sup>],
and which both have evenly spaced marks.
An $x$-division need not equal a $y$-division, and in general will
correspond to $\Lambda$ units of $y$.
We can measure lengths, say of a plank of wood, in this system, by
simply recording the number of marks it takes up on ruler $x$, call it
$\Delta x$, and the number of marks taken up on $y$, called $\Delta
y$.
Pythagoras’ theorem means that, however we choose to orient the plank
of wood or the rulers themselves, we always find</p>
\[d^2(\Delta x, \Delta y) = \Delta x^2 + \Lambda^2 \Delta y^2 = L^2,\]
<p>for some fixed number $L$, depending only on the piece of wood we’ve
chosen to measure.
It seems reasonable to define $L$ as its length.
But even more importantly, the quantity $d^2$ is <em>invariant</em> under a
change in relative orientation.
We describe relative orientation more explicitly in <a href="#sec-9">Exercise 1</a>.</p>
<figure>
<div style="text-align:center"><img src="/images/posts/emcc1.png" />
</div>
</figure>
<p>Relativity parallels this setup closely.
Michelson and Morley’s
<a href="https://en.wikipedia.org/wiki/Michelson%E2%80%93Morley_experiment">famous experiment</a>
in 1887 suggested that the speed of light does not depend on how fast
you are going when you measure it.
Einstein arrived at the same conclusion by thinking long and hard
about electrodynamics.
To measure the speed of light, we use two rulers, $x$ and $t$, though
the latter is usually called a “clock”.</p>
<figure>
<div style="text-align:center"><img src="/images/posts/emcc2.png" />
</div>
</figure>
<p>The light travels between two points a distance $\Delta x$ apart in
time $\Delta t$, so the speed is $c = \Delta x/\Delta t$.
We can rewrite this suggestively as</p>
\[s^2(\Delta x, \Delta t) = \Delta x^2 - c^2 \Delta t^2 = 0.\]
<p>The analogy is hopefully clear.
The ratio between units of $x$ and units of $t$ is given by $c$.
The expression $s^2(\Delta x, \Delta t)$ defines a “spacetime length”,
obeying a spacetime version of Pythagoras’ theorem, namely that the
$s^2$ distance between events does not change even when we speed up or
slow down.
More precisely,</p>
\[s^2(\Delta x, \Delta t) = \Delta x^2 - c^2 \Delta t^2 =
\text{constant}, \tag{1} \label{s2}\]
<p>when $\Delta x$ and $\Delta t$ are the space and time separation of
any two events, as measured by an observer at constant speed.
As a special case, $s^2 = 0$ for a light ray travelling from $A$ to
$B$, whatever speed <em>we</em> are moving.
Hence, light always travels with velocity $c$.
But the implications of (\ref{s2}) are much broader!
See <a href="#sec-9">Exercise 4</a> for a discussion of what happens if we <em>only</em> ask for
invariance of $s^2 = 0$.</p>
<h4 id="3-time-dilation-">3. Time dilation <a id="sec-3" name="sec-3"></a></h4>
<p>We can use (\ref{s2}) to quickly deduce that time dilates
and length contracts.
Consider a clock which ticks out time $\tau$ in its own frame of
reference, i.e. where it is stationary.
We call this the <em>proper time</em>.
For a proper time interval $\Delta \tau$, the clock moves nowhere
($\Delta x= 0$), so
the spacetime length is</p>
\[s^2(0, \Delta \tau) = -c^2 \Delta \tau^2.\]
<p>If the clock moves at speed $v$ in our reference frame, then in time
$\Delta t$ (as measured by our clock), it moves a distance $\Delta x =
v \Delta t$.
Thus, the spacetime length is</p>
\[s^2(\Delta x, \Delta t) = \Delta x^2 - c^2 \Delta t^2 =
\left(\frac{\Delta x^2}{c^2\Delta t^2} - 1\right) c^2\Delta t^2 = -c^2\Delta
t^2\left(1 - \frac{v^2}{c^2}\right).\]
<p>But since the spacetime length is invariant, we have</p>
\[-c^2 \Delta \tau^2 = -c^2\Delta t^2\left(1 - \frac{v^2}{c^2}\right)
\quad \Longrightarrow \quad \frac{\Delta t}{\Delta \tau} =
\frac{1}{\sqrt{1-(v/c)^2}} = \gamma, \tag{2} \label{gamma}\]
<p>where we have defined the all-important <em>Lorentz factor</em> $\gamma$.
Note that $\gamma \geq 1$, so that less proper time ($\Delta \tau$)
passes for the clock than elapsed time ($\Delta t$) measured in our reference frame.
Thus, the moving clock appears to slow down, a phenomenon called <em>time
dilation</em>.
In <a href="#sec-9">Exercise 2</a>, we work out an implication for
moving rulers called <em>length contraction</em>.</p>
<figure>
<div style="text-align:center"><img src="/images/posts/emcc4.png" />
</div>
</figure>
<p>Note that from (\ref{gamma}), a moving clock appears to stop at $v =
c$.
Put differently, no time passes for a light ray!</p>
<h4 id="4-factorising-spacetime-">4. Factorising spacetime <a id="sec-4" name="sec-4"></a></h4>
<p>There is a cute way to understand how measurements change when we
speed up or slow down.
Since the spacetime length is a difference of squares, we can
factorise it:</p>
\[s^2(\Delta x, \Delta t) = \Delta x^2 - c^2 \Delta t^2 = (\Delta x + c
\Delta t) (\Delta x - c \Delta t) = \Delta x^+ \Delta x^-,\]
<p>where $x^\pm$ represents the “combined rulers” $x \pm ct$.
Then $s^2$ will be invariant under changes of velocity provided that,
in a new frame of reference with rulers $x’, t’$, we have</p>
\[(\Delta x')^+ = \alpha \Delta x^+, \quad (\Delta x')^- = \frac{1}{\alpha}\Delta x^-, \tag{3} \label{boost}\]
<p>for some factor $\alpha$.
To connect $\alpha$ to the relative velocity, we consider the moving clock experiment.
Remember that in the clock frame $\Delta x = 0$ and $\Delta t = \Delta \tau$, but in
our frame, $\Delta t’ = \gamma
\Delta \tau$ and $\Delta x’ = v \Delta t’ = v\gamma
\Delta \tau$.
Thus, we have
<!-- These factors will cancel when we take
the product, so $s^2$ will indeed be invariant.
We can relate $e^{\eta}$ to $v$ by considering the clock example again.
In its own frame (rulers $x, t$), the clock moves nowhere ($\Delta x
= 0$) in $\Delta t = \Delta \tau$ tocks.
Hence, $\Delta x^\pm = \pm c\Delta \tau$.
In our frame (rulers $x', t'$), the clocks tocks over a period $\Delta t' = \gamma
\Delta \tau$, and travels a distance $\Delta x' = v \Delta t' = v\gamma
\Delta \tau$ while it does so.
So $e^\eta$ obeys = \frac{\Delta x' + c\Delta t'}{\Delta x' - c\Delta t'}\cdot \frac{ -c\Delta\tau}{+c\Delta\tau} --></p>
\[\alpha^{2} = \frac{(\Delta x')^+}{(\Delta x')^-} \cdot \frac{\Delta
x^-}{\Delta x^+} = \frac{(v + c)\gamma \Delta \tau}{(v - c)\gamma
\Delta \tau} \cdot \frac{ -c\Delta\tau}{+c\Delta\tau} =
\frac{c+v}{c-v}. \tag{4} \label{alpha}\]
<p>From this equation, we can deduce the rule for transforming quantities
between different frames, the <em>Lorentz transformation</em>.
The details are worked out in <a href="#sec-9">Exercise 3</a>.
In <a href="#sec-9">Exercise 4</a>, we also determine the more general class of transformations which
leave the speed of light fixed, but allow $s^2$ to vary for nonzero
values.</p>
<h4 id="5-velocity-addition-">5. Velocity addition <a id="sec-5" name="sec-5"></a></h4>
<p>We can use equation (\ref{alpha}) to chain together multiple changes of frame.
For instance, suppose a rocket moving at speed $v$ in our frame ($x’’, t’’$) launches a
clock at speed $u$ in its frame ($x’, t’$).
<!-- For instance, suppose a rocket moves at speed $v$ to the right in our
frame ($x'', t''$), and launches a clock to the right at speed $u$ in
its frame ($x', t'$).
The clock frame has rulers $x, t$.
At what speed does the clock appear to travel in our frame?
Let's call this speed $u'' = \Delta x''/\Delta t''$. -->
The speed of the clock in our frame is $u’’ = \Delta x’’/\Delta t’’$,
and obeys</p>
\[\frac{(\Delta x'')^+}{(\Delta x'')^-} = \frac{\Delta x'' + c\Delta
t''}{\Delta x'' - c\Delta t''} = \frac{u'' + c}{u'' - c}.\]
<p>But we can also just use (\ref{alpha}) twice:</p>
\[\begin{align*}
\frac{(\Delta x'')^+}{(\Delta x'')^-} & = \left(\frac{c+v}{c-v}\right)
\frac{(\Delta x')^+}{(\Delta x')^-} \\ & = \left(\frac{c+v}{c-v}\right)
\left(\frac{c+u}{c-u}\right) \frac{\Delta x^+}{\Delta x^-} \\ & = - \left(\frac{c+v}{c-v}\right)
\left(\frac{c+u}{c-u}\right).
\end{align*}\]
<p>Combining the last two equations, we find</p>
\[\frac{u'' + c}{u'' - c} = \left(\frac{c+v}{c-v}\right)
\left(\frac{c+u}{c-u}\right) \quad \Longrightarrow \quad u'' = \frac{v + u}{1+ uv/c^2}, \tag{5} \label{add}\]
<p>after some algebra to isolate $u’’$. This is the famous velocity addition formula!
I’ll let you check the algebra in <a href="#sec-9">Exercise 5</a>.</p>
<h4 id="6-conservation-laws">6. Conservation laws<a id="sec-6" name="sec-6"></a></h4>
<p>So far, we haven’t really done any physics, just bizarro trigonometry.
Let’s rectify that and introduce some ideas from Newtonian mechanics.
Suppose we have a bowling ball of mass $m$ and speed $v$.
Two ways to quantify its motion are <em>momentum</em> $p$, and the <em>kinetic
energy</em> $K$:</p>
\[p = mv, \quad K = \frac{1}{2}mv^2.\]
<p>A consequence of Newton’s laws
[<sup><a id="fnr.2" name="fnr.2" class="footref" href="#fn.2">2</a></sup>]
is that if the force on a bowling ball is zero, its momentum doesn’t
change. In fact, if the total force on a <em>collection</em> of bowling balls is zero, the
total momentum cannot change, even if they collide!
We say that momentum is <em>conserved</em>.
So if one bowling ball ($m_1, v_1$) collides with another ($m_2,
v_2$), the combined momentum is the same before and afterwards:</p>
\[P = p_1 + p_2 = m_1 v_1 + m_2 v_2.\]
<p>A sneakier conserved quantity is <em>mass</em>.
We usually assume $m_1$ and $m_2$ remain fixed, but if the bowling
balls shatter into parts, not only is the total $P$ conserved, but
also the sum of masses of the fragments.
In contrast, kinetic energy need not be conserved, since energy can
change forms, e.g. from kinetic to energy of deformation when the
bowling balls shatter.
We’ll return to conservation of energy below.</p>
<p>Let’s continue to assume that momentum and mass are conserved in any
frame of reference in special relativity, and see what that implies.
To make things concrete, we’ll use the example of an exploding bowling
ball.
Let’s start in the rest frame of the bowling ball, where the mass is
$2M_0$ (as measured by stationary scales), and the momentum $P_i = 0$, since the velocity is zero
by definition.
At some point, an explosive device inside the bowling ball detonates,
splitting it into two equal halves of mass $M_0$.
To ensure momentum is conserved, these zoom off with equal and
opposite velocities $\pm u$, so</p>
\[P_f = M_0u + M_0(-u) = 0 = P_i.\]
<p>Let’s now go to the frame of the part moving left at speed $u$.
Before the explosion, the bowling ball (in this frame of reference)
was moving at speed $u$ to the right, so the momentum was presumably</p>
\[P'_i = 2M_0u.\]
<p>After the collision, one half is stationary (we have chosen to go to
its rest frame), and the other moves at a speed given by the velocity
addition formula (\ref{add}):</p>
\[u' = \frac{2u}{1 + u^2/c^2}. \tag{7} \label{double}\]
<p>If the second half has mass $M_0$, the momentum after the collision is</p>
\[P'_f = \frac{2M_0u}{1 + (u/c)^2}.\]
<p>This is clearly different from the initial momentum $P’_i$ for nonzero $u$!
It looks, naively, as if conservation of mass and momentum are not
consistent with relativity after all.
You can check in <a href="#sec-9">Exercise 6</a> that this problem persists in other
reference frames.</p>
<!--
We said that mass is conserved in any given frame, but we never forbid
it from *changing between frames*! Perhaps, like time and length, the mass of an object can
change when it speeds up.
The formula for $P'$ can be balanced out if the mass *increases*, so
inspired by time dilation, we are going to guess that a mass $m_0$ in
the stationary frame increases as $m = \gamma m_0$ in a moving frame.
Let's see what whether the implications are consistent.
First of all, in the rest frame for the bowling ball, the two halves
zoom off with mass $M$ and at speed $u$.
This means their rest mass $m_0$ is *smaller* than $M$: -->
<h4 id="7-mass-effect">7. Mass effect<a id="sec-7" name="sec-7"></a></h4>
<p>But this is a little too quick.
Mass may be conserved <em>in a frame</em>, but it need not be invariant
<em>between frames</em>.
And we can get $P’_f$ to equal $P_i’$ by increasing the final mass.
Inspired by our results for time dilation and length contraction, we
guess that the rest mass $m_0$ (measured in the frame it is
stationary) increases as $m = \gamma m_0$ in a moving frame.
We can check this guess is sensible.
First, note that in the stationary frame of the unexploded bowling
ball, the exploded halves have a rest mass less than $M_0$:</p>
\[m_0 = \frac{M_0}{\gamma} =
M_0\sqrt{1-\left(\frac{u}{c}\right)^2}. \tag{8} \label{rest}\]
<p>In the moving frame, the original bowling ball moves at speed $u$, so
its mass is</p>
\[2M = 2M_0 \gamma,\]
<p>and hence its momentum is</p>
\[P'_i = 2M u = 2M_0 \cdot \frac{u}{\sqrt{1+(u/c)^2}}.\]
<p>After the collision, one half is stationary, while the other half
moves away at speed $u’$ given by (\ref{double}). The associated
Lorentz factor is</p>
\[\begin{align*}
\gamma' = \frac{1}{\sqrt{1 - (u'/c)^2}}
& = \left[1 - \frac{4u^2}{c^2(1 + (u/c)^2)^2}\right]^{-1/2} \\
& = \frac{c(1 + (u/c)^2)}{\sqrt{c^2(1 + (u/c)^2)^2 - 4u^2}} \\
& = \frac{1 + (u/c)^2}{\sqrt{(1 - (u/c)^2)^2}} = \frac{1 +
(u/c)^2}{1 - (u/c)^2}.
\end{align*}\]
<p>The momentum for this second half is therefore</p>
\[\begin{align*}
P'_f = 2 m_0 \gamma' u' & = 2 M_0 \cdot \frac{\gamma' u'}{\gamma} \\
& = 2M_0 \cdot \frac{1 +
(u/c)^2}{1 - (u/c)^2} \cdot \frac{2u\sqrt{1-(u/c)^2}}{1 + u^2/c^2} \\
& =
2M_0 \cdot \frac{u}{\sqrt{1+(u/c)^2}} = P'_i.
\end{align*}\]
<p>With this rule, momentum is indeed conserved!
We give a more general argument that $m = \gamma m_0$ and $p = mv$ are
conserved in <a href="#sec-9">Exercise 7</a>.</p>
<h4 id="8-the-most-famous-equation">8. The most famous equation<a id="sec-8" name="sec-8"></a></h4>
<p>We’ve motivated the transformation law $m = \gamma m_0$, but we have
yet to explain why $E = mc^2$.
To see why, let’s return to the exploding bowling ball in its rest
frame.
Recall from equation (\ref{rest}) that the rest mass of the halves is actually slightly smaller
than $M_0$.
To see what’s going on, let’s consider the low-speed limit $u \ll c$.
Using the <a href="https://hapax.github.io/mathematics/physics/hacker/binomial/">binomial approximation</a>,
we have</p>
\[\left[1 - \left(\frac{u}{c}\right)^2\right]^{-1/2} \approx 1 + \frac{u^2}{2c^2}.\]
<p>Applying this to (\ref{rest}) gives</p>
\[M_0 = m_0 \left[1 - \left(\frac{u}{c}\right)^2\right]^{-1/2} \approx
m_0 + \frac{1}{2c^2}m_0 u^2.\]
<p>Remember that $M_0$ is fixed. As $u$ increases, the second term on the RHS
gets bigger, so the rest mass $m_0$ must get smaller.
That’s kind of weird!
It’s almost as if the mass is being converted into something else.
If we multiply this equation through by $c^2$, it becomes clearer what
this “something else” is:</p>
\[M_0c^2 \approx m_0c^2 + \frac{1}{2}m_0 u^2. \tag{9} \label{energy}\]
<p>The last term is just the classical kinetic energy of the fragment!
So mass seems to be converted into kinetic energy.
The maximum amount that can be converted into kinetic energy is
$M_0c^2$, and the leftover energy is $m_0 c^2$.
This suggests that the total energy of the body is $M_0c^2$.
Writing the relativistic mass as $m$ instead of $M_0$, we have the
most famous equation of all time:</p>
\[E = mc^2. \tag{10} \label{emcc1}\]
<p>This relation even tells us something about massless particles, as we explore in
<a href="#sec-9">Exercise 8</a>.
<!-- Rest mass also has energy content: $E_0 = m_0c^2$
Identities (\ref{emcc1}) and (\ref{emcc2}) slightly different interpretations, but are both avatars of
Einstein's famous equation. -->
You might wonder why $E = mc^2$ should be interpreted as the <em>total</em>
energy, rather than some special form of mass energy.
The answer is simply that, if we interpret it this way, the mysterious
“conservation of mass” we have been dragging around becomes
conservation of total energy!
And unlike kinetic energy, which can get converted into other things,
one of the fundamental principles of physics is that total energy is
conserved.
This also tells why we continue to interpret $mc^2$ as total energy
even at high speeds, where we cannot interpret mass-energy as getting
converted into classical kinetic energy (since the binomial
approximation breaks down).
So everything hangs together nicely!
Hopefully you now have a sense of why Einstein’s famous formula is true.</p>
<h4 id="9-exercises">9. Exercises<a id="sec-9" name="sec-9"></a></h4>
<p><em>Exercise 1 (rotations).</em> To make the analogy to spacetime more
convincing, in this exercise we’ll describe relative rotations more
explicitly.
Let’s take our original perpendicular rulers $x, y$ and rotate
them anticlockwise by some angle $\theta$ into new rulers $x’, y’$,
keeping the origin fixed for the moment.
Mark a point a distance $d$ along the $x$ axis.</p>
<figure>
<div style="text-align:center"><img src="/images/posts/emcc3.png" />
</div>
</figure>
<p>In the $x’, y’$ system, we define functions $\cos(\theta)$ and
$\sin(\theta)$ by</p>
\[x' = d\cos (\theta) = x\cos (\theta), \quad \Lambda y' = -d\sin (\theta)
= x\sin (\theta),\]
<p>where $x$ denotes the $x$-coordinate of the point.</p>
<p><span style="padding-left: 20px; display:block">
(a) Argue that a point on the $y$ ruler, $d$ marks along, goes to
coordinates
</span></p>
\[x' = d\Lambda \sin(\theta) = -y\Lambda \sin(\theta), \quad y' =
d\cos(\theta) = y\cos(\theta).\]
<p><span style="padding-left: 20px; display:block">
(b) Use the equations above to show that, if we move the $x, y$ system
around and then rotate by $\theta$, the displacements $\Delta x$ and
$\Delta y$ become
</span></p>
\[\begin{align*}
\Delta x' & = \cos(\theta) \Delta x + \sin(\theta) \Lambda \Delta y \\
\Lambda\Delta y' & = -\sin(\theta) \Delta x + \cos(\theta) \Lambda \Delta y.
\end{align*}\]
<p><span style="padding-left: 20px; display:block">
(c) Check that
</span></p>
\[d^2(\Delta x', \Delta y') = d^2(\Delta x, \Delta y) [\cos^2(\theta) + \sin^2(\theta)].\]
<p><span style="padding-left: 20px; display:block">
Pythagoras’ theorem is then equivalent to the trigonometric identity
</span></p>
\[\cos^2(\theta) + \sin^2(\theta) = 1.\]
<p><span style="padding-left: 20px; display:block">
(d) Consider any point on the $y$ ruler, and define $q =\Delta
x’/\Delta y’$.
Verify that
</span></p>
\[\tan(\theta) = \frac{\sin(\theta)}{\cos(\theta)} =
\frac{q}{\Lambda}.\]
<hr />
<p><em>Exercise 2 (length contraction).</em> Time dilation can be used to deduce
a rule for length in different frames.</p>
<p><span style="padding-left: 20px; display:block">
(a) Suppose a ruler passes us by at speed $v$.
We can deduce its apparent length $L’$ by timing how long it takes
($\Delta \tau$) to pass some specific spot.
Show this length is
</span></p>
\[L' = v\Delta \tau\]
<p><span style="padding-left: 20px; display:block">
where $\Delta \tau$ refers to the clock which is stationary in our frame.
</span></p>
<p><span style="padding-left: 20px; display:block">
(b) The <em>proper length</em> $L$ of the ruler is the length measured in the
frame where it is stationary. It can read this off by looking at our
clock.
Using time dilation, show that
</span></p>
\[L = \gamma L'.\]
<figure>
<div style="text-align:center"><img src="/images/posts/emcc5.png" />
</div>
</figure>
<p>Thus, a moving ruler shrinks by a factor $\gamma$ in our
frame. This is called <em>length contraction</em>.</p>
<hr />
<p><em>Exercise 3 (Lorentz transformations).</em> In this exercise, we will derive something called the
Lorentz transformation. First, we define $\alpha = e^\eta$ for a
“boost parameter” $\eta$. We will also use the hyperbolic functions</p>
\[\cosh(\eta) = \frac{1}{2}(e^\eta + e^{-\eta}), \quad \sinh(\eta) =
\frac{1}{2}(e^\eta - e^{-\eta}), \quad \tanh(\eta) = \frac{\sinh(\eta)}{\cosh(\eta)}.\]
<p>These play the same role in relativity that the trigonometric
functions $\sin, \cos, \tan$ play in Euclidean geometry, namely,
parameterising transformations which keep length invariant.</p>
<p><span style="padding-left: 20px; display:block">
(a) Suppose two events are separated by $\Delta x,
\Delta t$ in the $x, t$ frame. Using (\ref{boost}), show that in the
$x’, t’$ frame, they are separated by
</span></p>
\[\begin{align*}
\Delta x' & = \cosh(\eta) \Delta x + \sinh(\eta) c \Delta t \\
c\Delta t' & = \sinh(\eta) \Delta x + \cosh(\eta) c \Delta t.
\end{align*}\]
<p><span style="padding-left: 20px; display:block">
This is very clearly analogous to the results in Exercise 1!
</span></p>
<p><span style="padding-left: 20px; display:block">
(b) From the clock example (or otherwise), argue that
</span></p>
\[\cosh(\eta) = \gamma, \quad \sinh(\eta) = \frac{\gamma v}{c}.\]
<p><span style="padding-left: 20px; display:block">
Inserting these into (a) gives the standard form of the Lorentz
transformation:
</span></p>
\[\Delta x' = \gamma \Delta x + \gamma v \Delta t , \quad
\Delta t' = \left(\frac{\gamma v}{c^2}\right)\Delta x + \gamma\Delta
t. \tag{6} \label{lorentz}\]
<p><span style="padding-left: 20px; display:block">
(c) Show that (b) is consistent with the results of (\ref{alpha}),
i.e. both imply $\tanh(\eta) = v/c$.
Explain why this is analogous to part (d) of Exercise 1.
</span></p>
<hr />
<p><em>Exercise 4 (null hypothesis).</em> We’ve assumed that (\ref{s2}) is invariant
in general, but light obeys $s^2 = 0$.
What if we only require invariance for this special case?
Using our new coordinates $x^\pm$, we can investigate!</p>
<p><span style="padding-left: 20px; display:block">
(a) Argue that $s^2 = 0$ is invariant if and only if
</span></p>
\[(\Delta x')^\pm = \alpha_\pm \Delta x^+\]
<p><span style="padding-left: 20px; display:block">
for constants $\alpha_\pm$. As above, we’ll take these to be positive
for simplicity.
</span></p>
<p><span style="padding-left: 20px; display:block">
(b) Show that for some $\alpha, \lambda > 0$, we can always rewrite
</span></p>
\[\alpha^+ = \alpha \lambda, \quad \alpha^- = \frac{\lambda}{\alpha}\]
<p><span style="padding-left: 20px; display:block">
(c) Argue that the most general transformation preserving $s^2 = 0$
is a Lorentz transformation followed by a <em>uniform scaling</em>:
</span></p>
\[x' = \lambda x, \quad t' = \lambda t.\]
<p><span style="padding-left: 20px; display:block">
(d) Finally, conclude that if we restrict to transformations induced
by relative motion between frames, invariance of the speed of light
implies invariance of $s^2$ for any value. <em>Hint</em>. What is the
relative velocity for a pure scaling, i.e. $\alpha = 1$?
</span></p>
<hr />
<p><em>Exercise 5 (additional algebra).</em> Do the algebra to make $u’’$
the subject in (\ref{add}).</p>
<hr />
<p><em>Exercise 6 (new frame).</em> Consider a frame of
reference in which the unexploded bowling ball moves right at speed
$v$, e.g. while bowling.</p>
<p><span style="padding-left: 20px; display:block">
(a) Show that the two exploded halves move with velocities
</span></p>
\[u'_\pm = \frac{v \pm u}{1 \pm uv/c^2}.\]
<p><span style="padding-left: 20px; display:block">
(b) If each has mass $M_0$, show that momentum is only
conserved for $u = 0$ or $v = c$.
</span></p>
<hr />
<p><em>Exercise 7 (conserving two-momentum).</em>
Suppose a particle of rest mass $m_0$ moves at speed $v$ for proper
time $\Delta \tau$.
The two-velocity $\mathbf{v}$ and two-momentum $\mathbf{p}$ are
vectors [<sup><a id="fnr.3" name="fnr.3" class="footref" href="#fn.3">3</a></sup>]</p>
\[\mathbf{v} = \frac{1}{\Delta \tau}(\Delta t, \Delta x) , \quad \mathbf{p} = m_0\mathbf{v}.\]
<p><span style="padding-left: 20px; display:block">
(a) Show that two-quantities can be written
</span></p>
\[\mathbf{v} = (\gamma, \gamma v), \quad \mathbf{p} = (\gamma m_0, \gamma
m_0 v).\]
<p><span style="padding-left: 20px; display:block">
(b) Suppose that the two-momenta before and
after a collision are equal:
</span></p>
\[\mathbf{p}_i = \mathbf{p}_f.\]
<p><span style="padding-left: 20px; display:block">
Argue that, after a Lorentz transformation (\ref{lorentz}) to a new
frame of reference $x’, t’$, the two-momentum remains conserved:
</span></p>
\[\mathbf{p}'_i = \mathbf{p}'_f.\]
<p><span style="padding-left: 20px; display:block">
This means that if relativistic mass $\gamma m_0$ and momentum $\gamma
m_0 v$ are conserved in one frame, they are conserved in any other!
</span></p>
<p><span style="padding-left: 20px; display:block">
(c) At low speeds ($v \ll c$), the Lorentz factor $\gamma \approx 1$.
We also know that at low speeds, Newtonian mechanics is a good
description, so mass $m_0$ and momentum $m_0v$ are conserved.
Extrapolate to the conservation of two-momentum.
</span></p>
<hr />
<p><em>Exercise 8 (energy-momentum).</em> We end with an equivalent
form of Einstein’s equation</p>
<p><span style="padding-left: 20px; display:block">
(a) First, show that
</span></p>
\[c^2\gamma^2 = \gamma^2 v^2 + c^2.\]
<p><span style="padding-left: 20px; display:block">
(b) Deduce the energy-momentum relation
</span></p>
\[E^2 = p^2 c^2 + m_0^2 c^4.\]
<p><span style="padding-left: 20px; display:block">
(c) A photon has zero rest mass. Use the energy-momentum
relation to argue that the energy and momentum are
related by
</span></p>
\[E = pc.\]
<p><span style="padding-left: 20px; display:block">
Maxwell deduced this from classical electromagnetism, but amusingly,
we got there by think about bowling balls!
</span></p>
<hr />
<div class="footdef"><sup><a id="fn.1" name="fn.1" class="footnum" href="#fnr.1">Footnote 1</a></sup> <p class="footpara">
Or if you prefer, an orthogonal grid of such rulers.
</p></div>
<div class="footdef"><sup><a id="fn.2" name="fn.2" class="footnum" href="#fnr.2">Footnote 2</a></sup> <p class="footpara">
Newton's second law can be written $F = \Delta p/\Delta t$, i.e. the
force is just the rate of change of momentum. When force is
zero, so is the momentum change!
</p></div>
<div class="footdef"><sup><a id="fn.3" name="fn.3" class="footnum" href="#fnr.3">Footnote 3</a></sup> <p class="footpara">
The "two" refers to the total number of spacetime dimensions.
For three dimensions of space and one of time, the corresponding
quantities are called four-velocity and four-momentum.
</p></div>
<!-- This neatly splits the total squared energy into a kinetic part
$(pc)^2$ and a rest energy part $(m_0c^2)^2$.
It also correctly suggests that for light (or any other massless
particle), the total energy and momentum are related by -->David A WakehamFebruary 19, 2021. A self-contained derivation of the most famous equation in physics. I start with a crash course on special relativity, emphasizing the invariance of spacetime lengths, move on to conservation laws, and end by considering the relativistic mechanics of an exploding bowling ball.The statistical basis of Fermi estimates2021-02-12T00:00:00+00:002021-02-12T00:00:00+00:00http://hapax.github.io/physics/hacks/mathematics/statistics/fermi-log-normal<p><strong>February 12, 2021.</strong> <em>Why are Fermi approximations so effective? One
important factor is log normality, which occurs for large random
products. <!--, also related to the mechanism underlying
the Newcomb-Benford law for first digits.--> Another element is
variance-reduction through judicious subestimates. I discuss both
and give a simple heuristic for the latter.</em></p>
<h4 id="introduction">Introduction</h4>
<p>Fermi approximation is the art of making good order-of-magnitude estimates.
I’ve written about them
at greater length
<a href="https://hapax.github.io/assets/fermi-estimates.pdf">here</a> and
<a href="https://hapax.github.io/physics/teaching/hacks/napkin-hacks/#sec-3">here</a>,
but I’ve never really found a satisfactory explanation for why they work.
Order-of-magnitude is certainly a charitable margin of
error, but time and time again, I find they are better than they have any right to be!
Clearly, there must be an underlying statistical explanation for this apparently
unreasonable effectiveness.</p>
<!-- We will try to explain the first using logarithmic uniformity, which is
the same mechanism underlying the anomalous distribution of first
digits known as the
[Newcomb-Benford law](https://en.wikipedia.org/wiki/Benford%27s_law).
We give a looser but related explanation of the second in terms of strategies for
variance-reduction in human error. -->
<h4 id="products-and-log-normality">Products and log-normality</h4>
<p>There are two key techniques: the use of geometric means, and the
factorisation into subestimates.
We start with geometric means.
Suppose a random variable $F$ is a product of many independent random
variables,</p>
\[F = X_1 X_2 \cdots X_N.\]
<p>Then the logarithm of $F$ is a sum of many random variables $Y_i =
\log X_i$:</p>
\[\log F = \log X_1 + \log X_2 + \cdots + \log X_N = \sum_{i=1}^N Y_i.\]
<p>By the central limit theorem for unlike variables (see
e.g. <a href="https://hapax.github.io/hacks/mathematics/statistics/clt/">this post</a>),
for large $N$ this approaches a normal distribution:</p>
\[\log F \to \mathcal{N}(\mu, \sigma^2), \quad \mu := \sum_i \mu_i,
\quad \sigma^2 = \sum_i \sigma_i^2,\]
<p>where the $Y_i$ have mean $\mu_i$ and variance $\sigma_i^2$.
We say that $F$ has a <em>log-normal</em> distribution, since its log is
normal.</p>
<!-- To get uniformity into the picture, we can zoom in on the region near
$F = e^\mu$ where the probability density is approximately uniform.
More carefully, the density is
$$
p(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/2\sigma^2}.
$$
Taylor-expanding near $x = \mu$ gives
$$
p(x) = \frac{1}{\sigma\sqrt{2\pi}}
\left[1 - \frac{(x-\mu)^2}{2\sigma^2} + O(x^4)\right].
$$
This looks uniform provided $(x - \mu)^2 \ll \sigma^2$.
For instance, at a third of a standard deviation, $x = \mu + \sigma/3$,
we have
$$
1 - \frac{(x-\mu)^2}{2\sigma^2} = 1 - \frac{1}{18} \approx 0.94,
$$
and $\text{erf}(1/\sqrt{18}) \approx 0.26$, about a quarter of the
probability mass, lies underneath.
This is what we mean when we say that $F$ is logarithmically uniform. -->
<h4 id="geometric-means">Geometric means</h4>
<p>In Fermi estimates, one of the basic techniques is to take geometric
means of estimates, typically an overestimate and an underestimate.
For instance, to Fermi estimate the population of Chile, I could
consider a number like one million which seems much too low, and a
number like one hundred million which seems much too high, and take
their geometric mean:</p>
\[\sqrt{(1 \text{ million}) \times (100 \text{ million})} = 10 \text{ million}.\]
<p>Since population is a product of many different factors, it is
reasonable to expect it to approximate a log-normal distribution.
Then, after logs, the geometric mean $\sqrt{ab}$ becomes the
arithmetic mean of $\log a$ and $\log b$:</p>
\[\log \sqrt{ab} = \frac{1}{2}(\log a + \log b).\]
<p>Taking the mean $\mu$ of the distribution as the true value, these
geometric means provide an
<a href="https://en.wikipedia.org/wiki/Bias_of_an_estimator">unbiased estimator</a>
of the mean.
Moreover, the error of the estimate will decrease as $1/k$ for $k$
samples (assuming human estimates sample from the distribution), so more is better.
To see how much better I could do on the Chile population estimate, I
solicited guesses from four friends, and obtained $20, 20, 30$ and $35$
million.
Combining with my estimate, I get a geometric mean</p>
\[(10 \times 20 \times 20 \times 30 \times 35)^{1/5} \text{ million}
\approx 21 \text{ million}.\]
<p>The actual population is around $18$ million, so the estimate made
from more guesses is indeed better!
This is also better than the arithemetic average, $23$ million.
Incidentally, this also illustrates the
<a href="https://hapax.github.io/physics/mathematics/statistics/crowd/">wisdom of the crowd</a>,
also called “diversity of prediction”.
The individual errors from a broad spread of guesses tend to cancel
each other out, leading to a better-behaved average, though in this case
in logarithmic space.</p>
<p>In general, Fermi estimates work best for numbers which are large
random products (this is how we try to solve them!), so the problem
domain tends to enforce the statistical properties we want.
For many examples of log-normal distributions in the real world, see
<a href="https://academic.oup.com/bioscience/article/51/5/341/243981">Limpert, Stahel and Abbt (2001)</a>.
It’s worth noting that not everything we can Fermi estimate is
log-normal, however.
Many things in the real world obey power laws, for instance, and
although you can exploit this to make better Fermi estimates (as
lukeprog does in
<a href="https://www.lesswrong.com/posts/PsEppdvgRisz5xAHG/fermi-estimates#Example_4__How_many_plays_of_My_Bloody_Valentine_s__Only_Shallow__have_been_reported_to_last_fm_">his tutorial</a>),
we can happily Fermi estimate power-law distributed numbers without
this advanced technology.</p>
<p>Are Fermi estimates unreasonably effective in this context?
Maybe.
But the estimates work best in the high-density core where things look
uniform, not out at the tails, and it’s not until we get to the tails that the difference
between the log-normal and power law (or exponential, or Weibull, or
your favourite skewed distribution) becomes pronounced.
So the unreasonable effectiveness here can probably be explained by
the resemblance to the log-normal, though this is something I’d like
to check more carefully in future.</p>
<!-- In general, we only expect Fermi estimates to work for numbers which
are the product of many factors.
But this is precisely the sorts of things we use Fermi estimates for!
In a sense, the problem domain naturally leads to logarithmic
uniformity.
Incidentally, I've talked about "uniformity", but the geometric mean
is still a measure of central tendency for any distribution, and is
particularly nice for a lognormal one, which arise for products of
random variables.
The magic of geometric means manifests most
strongly in the near-uniform blob at the centre. -->
<!-- #### The Newcomb-Benford law
Logarithmic uniformity also explains an odd pattern in the first
digits of naturally occurring numbers like tax returns, stock market
prices, populations, river lengths, physical constants, and even
powers of $2$.
The pattern, called the *Newcomb-Benford law* after
[Simon Newcomb](https://en.wikipedia.org/wiki/Simon_Newcomb) and
[Frank Benford](https://en.wikipedia.org/wiki/Frank_Benford), is as
follows: for base $b$, the digit $d \in \\{1, 2, \ldots, b-1\\}$
occurs with relative frequency
$$
p_b(d) = \log_b \left(\frac{d+1}{d}\right) = \frac{1}{\log b}\log \left(\frac{d+1}{d}\right).
$$
It initially seems bizarre that digits do not occur with equal
frequency.
But as neatly explained by
[Pietronero et al. (1998)](https://arxiv.org/pdf/cond-mat/9808305.pdf),
it follows immediately if the relevant numbers are logarithmically uniform.
Let $X$ be our random number.
Then the first digit is $d$ if
$$
db^k \leq X < (d+1)b^{k} \quad \Longrightarrow \quad \log_b d + k \leq
\log_b X < \log_b(d+1) + k
$$
for some integer $k$.
If $X$ is logarithmically uniform, for instance sitting near the mean
of a big random product, then $\log_b X$ is uniformly
distributed, and lies in the interval $I_d :=
[\log_b d, \log_b (d+1)]$ with probability
$$
(\log_b (d+1) + k) - (\log_b d + k) = \frac{1}{\log b}\log \left(\frac{d +
1}{d}\right) = p_b(d).
$$
This provides a simple way to check for fraud on tax returns, for
instance.
Just compute relative frequencies of first digits in different bases
and check they obey Newcomb-Benford!
You might wonder why something totally deterministic, like the first
digit of a power of $2$, also obeys Benford's law.
Here is a pie chart of initial decimal digits for the first $10,000$ binary
powers, which follows the Newcomb-Benford law exactly:
<figure>
<div style="text-align:center"><img src
="/images/posts/benford1.png"/>
</div>
</figure>
Here is the Python code to generate it.
You can check it for other numbers besides $2$ as well by simply
changing the `power` variable:
```python
import matplotlib.pyplot as plt
import math
maxpower = 10000 # Number of powers to check
power = 2 # Change to check other powers
nums = '1', '2', '3', '4', '5', '6', '7', '8', '9',
benford = [(math.log(10, d+1) - math.log(10, d+1))
for d in range(1, 10)]
firstdig = [0 for i in range(9)]
for i in range(maxpower):
ind = int(str(power**i)[0]) - 1
firstdig[ind] = firstdig[ind] + 1
fig, ax = plt.subplots()
fig.set_facecolor('white')
ax.pie(firstdig, labels=nums, autopct='%1.1f%%', startangle=90)
# Change 'firstdig' to 'benford' for probabilities
ax.axis('equal')
plt.show()
```
The mechanism for logarithmic uniformity here is slightly different,
and discussed in depth in Serge Tabachnikov's
[book on geometric billiards](http://www.personal.psu.edu/sot2/books/billiardsgeometry.pdf).
In this case, $X = 2^n$, so the first digit is $d$ just in case
$$
\log_{10}d + k \leq n\log_{10} 2 < \log_{10}(d + 1) + k.
$$
Let $\text{frac}(x)$ denote the fractional part of $x$, and define
$x_n := \text{frac}(n\log_{10} 2)$.
Taking fractional parts gives
$$
\log_{10}d \leq x_n < \log_{10}(d + 1).
$$
It turns out that, since $x_1 = \log_{10} 2$ is irrational,
$x_n$ jumps randomly around the unit interval, and forms an
"equidistribution" which spends equal times in equal areas.
For a proof, see Tabachnikov's book.
But although the fundamental cause is different, the outcome is still
logarithmic uniformity, and the Newcomb-Benford law results. -->
<h4 id="the-philosophy-of-subestimates">The philosophy of subestimates</h4>
<p>Now we’ve dealt with geometric means and log-normality, we
turn to the effectiveness of factorising a Fermi estimate.
If we take logarithms, factors become summands, and we’ll reason about those since they are simpler.
If $Z = X + Y$ is a sum of independent random variables, the variance
is additive, so that</p>
\[\text{var}(Z) = \text{var}(X) + \text{var}(Y).\]
<p>Thus, splitting a sum into estimates of the summands and adding them
should not change the variance of the guess.
Of course, there is a fallacy in this reasoning: humans are not
sampling from the underlying distribution!
When we guess, we introduce our own random errors.
For instance, my estimate for $Z$ will have some human noise $\varepsilon_Z$:</p>
\[\hat{Z} = Z + \varepsilon_Z.\]
<p>Similarly, my guesses for $X$ and $Y$ have some random errors
$\varepsilon_X$ and $\varepsilon_Y$.
There is no reason for the variances of $\varepsilon_X$ and $\varepsilon_Y$
to add up to the variance of $\varepsilon_Z$.
The sum could be bigger, or it could be smaller.
But a good decomposition should reduce the combined variance:</p>
\[\text{var}(\varepsilon_X) + \text{var}(\varepsilon_Y) < \text{var}(\varepsilon_Z).\]
<p>If log-normality is the science of Fermi estimates, picking
variance-reducing subestimates is the art.
<!-- But there is a connection to our earlier discussion.
I think the human error $\varepsilon_X$ will roughly mimic the
empirical distribution of $Z$ we have seen in the world.
If it is biased, so is $\varepsilon_X$; it we have only seen a few
examples, the variance of $\varepsilon_X$ will probably be large, and
decrease roughly as $1/k$ with $k$ examples.
So the general strategy for variance reduction is to factorise into
things we have seen before.
We can even use these data points to generate subestimates by geometric averaging.-->
But I suspect that $\hat{Z}$ roughly speaking behaves like a <em>test
statistic</em> for $Z$, with the number of samples corresponding to how
many data points for $Z$ we have encountered.
So we expect that $\text{var}(\varepsilon_Z)$ will vanish roughly as
$1/k$ with $k$ samples.
If we have more exposure to the distributions for $X$ and $Y$,
the combined error will probably be smaller.
This is why we carve into subfactors we understand!</p>
<h4 id="variance-reduction-in-practice">Variance reduction in practice</h4>
<p>I’ll end with a speculative rule of thumb for when to factor: try generating over- and
underestimates for the factors and the product, which in additive
notation give</p>
\[(\Delta X)^2 + (\Delta Y)^2, \quad (\Delta Z)^2\]
<p>where $\Delta$ refers to the difference of the (logarithm of the) over-
and underestimate.
Factorise if the first estimated error is smaller than the second.
Let’s illustrate by returning to the population of Chile.
I can try factoring it into a number of regions multiplied by the
average number of people per region.
Taking logs (in base $10$) of the over- and underestimate of Chile’s
population I gave above, I get</p>
\[(\Delta Z)^2 = (\log_{10} 10^8 - \log_{10} 10^6)^2 = 4.\]
<p>On the other hand, for regions I would
make a lower guess of $5$ and an upper guess of $30$, with a difference in logs of $(\Delta X)^2 = 0.6$.
For regional population, I would make a lower guess of $5\times 10^5$ and an
upper guess of $5\times 10^6$, with $(\Delta Y)^2 = 1$.
Thus,</p>
\[(\Delta X)^2 + (\Delta Y)^2 = 1.6 < 4 = (\Delta Z)^2.\]
<p>The guess from the factorisation (taking geometric means) is</p>
\[\sqrt{5 \times 30 \times (5\times 10^5) \times (5\times 10^6)} \approx 19 \text{
million}.\]
<p>This is even better than the crowdsourced estimate!
For reference, the number of regions is $16$, while our estimated mean is around
$12$, and the average population per region is a bit over a million,
which we’ve mildly overestimated at $1.6$ million.
The two balance out and give a better overall estimate.
<!-- This suggests a diversity of prediction mechanism is at play with -->
<!-- subestimates, but I haven't worked out the details. --></p>
<h4 id="conclusion">Conclusion</h4>
<p>From a statistical perspective, Fermi estimates are based on two
techniques: geometric means and splitting into subfactors.
We usually estimate things which can be expressed as a product of many
factors. These will tend towards a log-normal distribution by the (log
of the) central limit theorem, so that geometric means provide a good
estimator, exactly like the usual mean for normally distributed variables.
Subestimates, on the other hand, carve guesses into factors we
understand, i.e. have more data points for, so that (assuming they
behave like test statistics) variance is reduced.
The effectiveness of Fermi estimates is quite reasonable after all!
<!-- They're not so unreasonable after all! --></p>
<!-- There is an art to making over- and underestimates
that accurately reflect the variance of our error random variables,
which are involved both in taking geometric means for single
quantities, and reducing variance through subestimates.
Still, it's cool that there is a statistical basis for the different
aspects of the effectiveness of Fermi estimates.
It's not so unreasonable after all! -->
<!-- For instance, if $e^Z$ is the population of Chile, I can factor it
into number of provinces $e^X$ multiplied by the average number of people per province $e^Y$.
But this is likely to *increase* the error, since I know less about
provinces of Chile than I do about Chile compared to other countries.
I suspect that there is a nice quantitative connection to be made
between the variance of $\varepsilon_X$ and the prior data I have on
it. -->
<!--
The Lyapunov condition holds for a sum of independent random
variables.
By taking an exponential, we can turn it into a result for a *product* of
independent variables.
Let $X_i, \mu_i, \sigma_i^2$ be as above, and $X_i = \log Y_i$.
Then
$$
\exp\left[\sum_{i=1}^N X_i\right] = \prod_{i = 1}^N Y_i \to \log
\mathcal{N}(\mu, \sigma^2).
$$
The distribution on the right is not a normal, but a *log-normal*.
It is simply what the normal distribution looks like when viewed in
terms of a variable $y > 0$ defined by $x = \log y$.
In order to plot the density, we use the fact that $dx =
dy/y$, and hence
$$
p(x)\, dx = \frac{dx}{\sqrt{2\pi}\sigma}
e^{-\frac{(x-\mu)^2}{2\sigma^2}} = \frac{dy}{\sqrt{2\pi}\sigma y}
e^{-\frac{(\log y-\mu)^2}{2\sigma^2}}.
$$
So, this is distribution that a product of many independent factors
converges to. -->
<!-- https://arxiv.org/pdf/cond-mat/9808305.pdf -->David A WakehamFebruary 12, 2021. Why are Fermi approximations so effective? One important factor is log normality, which occurs for large random products. Another element is variance-reduction through judicious subestimates. I discuss both and give a simple heuristic for the latter.Reductionism, order and patterns2021-02-08T00:00:00+00:002021-02-08T00:00:00+00:00http://hapax.github.io/mathematics/physics/philosophy/form<p><strong>February 8, 2021.</strong> <em>Some philosophical reflections on the nature of
scientific explanation, structure, emergence, and the unreasonable
effectiveness of mathematics.</em></p>
<h4 id="introduction">Introduction</h4>
<p><span style="padding-left: 20px; display:block">
Explanations must come to an end somewhere.
</span></p>
<div style="text-align: right"><i>Ludwig Wittgenstein</i> </div>
<p>Reductionism is the idea that you explain stuff with
smaller stuff, and keep going until you stop.
In many ways, this describes the explanatory program of 20th century
physics, which, starting from the 19th century puzzles of statistical mechanics,
conjured up atoms, subatomic particles, the zoo of the Standard Model, and even
tinier hypothetical entities like strings and spin foams.
Most physicists spend their time in a lab, on a computer, or in front
of a blackboard, trying to reduce complex things to simple things they understand.
So like Platonism in mathematics, reductionism in physics simply makes
a philosophy out of everyday practice.
We break stuff down, so things reduce; we play abstractly with
mathematical objects, so they exist abstractly.</p>
<p>But also like Platonism, reductionism is a convenient fiction, or rather, a
caricature in which some things are emphasised at the cost of others.
And given the reverence which which philosophers hold the considered
ontological verdicts of science, it’s worth asking: what does science really tell us about the
universe? What sorts of objects are necessary for explanation? Does
explanation go only upwards, or can it go downwards or sideways?
Should we eliminate the things we explained? And what has explanation
to do with existence anyway?
This post is an attempt to unconfuse myself about some of these questions.
<!-- adds a dash of novelty and modern
physics to old (and in some cases hopelessly outdated) debates. --></p>
<h4 id="the-existence-of-shoes">The existence of shoes</h4>
<p><span style="padding-left: 20px; display:block">
… our common sense conception of psychological phenomena constitutes a
radically false theory, a theory so fundamentally defective that both
the principles and ontology of that theory will eventually be
displaced, rather than smoothly reduced, by completed neuroscience.
</span></p>
<div style="text-align: right"><i>Paul Churchland</i> </div>
<p>Physical objects can be described at different levels.
A shoe is constructed from flat sheets of material, curved, cut,
marked, and stuck together in clever ways; materials
curve and stick by virtue of their constituent
chemicals, usually long, jointed molecular chains called polymers;
polymers, in turn, are built like lego from a smorgasboard of elements;
and each elemental atom is a dense nuclear core, surrounded by
electrons whirring around in elaborate orbitals.</p>
<p>From the properties of the neutrons, protons and
electrons, it seems we can work our way upwards, and infer everything
else.
The laws of quantum mechanics and electromagnetism determine the
orbital structure of the atom. The valence shell of the atom
determines how it can combine with other atoms to form
chemicals. Finally, the structural motifs and functional groups of the
polymers gives it the properties the industrial chemist, the designer,
and the cobbler exploit to make a shoe.
Thus, some philosophers conclude, only electrons, protons, and
neutrons exist.
The rest can be eliminated as unnecessary
ontological baggage.
This view is called <em>eliminative reductionism</em>.
It is a hardcore philosophy which does not believe in shoes [<sup><a id="fnr.1" name="fnr.1" class="footref" href="#fn.1">1</a></sup>].</p>
<p>There is a gentler, less silly form of reductionism which grants the
existence of shoes, but insists that they are (in the phrase of Jack
Smart) nothing “over and above” the constituent subatomic particles.
The shoe “just is” electrons and protons and neutrons, in some order;
this is what we mean by a shoe.
There are others way to characterise the reduction, <!--
for instance, that the properties of the shoe "follow"
from, or are "completely explained by", those of the subatomic particles.
In fact, there is--> and a whole literature devoted to the attendant
subtleties, but most fall under the heading of analytic
micro-quibbles.
<!-- , and won't concern us here.-->
Instead, we will make a much simpler observation: order matters.</p>
<p>Clearly, if we took those subatomic particles, and arranged them in a
different way, we would get different elements, different chemicals,
and a duck or a planetesimal instead of a shoe.
Arrangement is important.
It is patently absurd to try and explain the bulk properties of the
shoe—the fact that it fits around a human foot, for
instance—without appeal to arrangement, since a different
order yields objects which do not fit around a foot.
<!-- If one objects that "fitting around a foot" is some sort of
anthropocentric folly due for elimination, replace it with,
Philip Anderson was perhaps the first physicist to make this argument,
in his famous article ["More is Different"](https://cse-robotics.engr.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf). -->
Since order has <em>explanatory</em> significance, it should presumably be
tarred with the same ontic brush we apply to things like electrons.</p>
<p>Of course, one may object that explanation does not equal existence.
I can handily account for the continual disappearance of my socks by
the hypothesis of sock imps.
But this is a bad explanation! It’s not consistent with other reliably known facts about the world.
Sock imps don’t make the ontic cut, not because there is no link between
explanation and what we deem to exist, but because that link should
only be made for <em>robust</em> explanations, and the poor little sock imps collapse
at the first empirical hurdle.
That different arrangements of things have different properties is
robust, almost to the point of truism, and there seems to be no
principled reason to ban order <!-- , or *structure* as we will call it,-->
from our ontology.</p>
<h4 id="emergence-vs-structure">Emergence vs structure</h4>
<p><span style="padding-left: 20px; display:block">
More is different.
</span></p>
<div style="text-align: right"><i>Philip W. Anderson</i> </div>
<p>It’s worth noting the parallel
to <em>emergence</em>.
In his famous article
<a href="https://cse-robotics.engr.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf">“More is Different”</a>,
Philip W. Anderson argued for the idea of domain-specific laws and
dynamical principles which did not follow the strict, one-way
explanatory hierarchy of reduction, particularly in his field of
condensed matter physics.
And indeed, condensed matter makes a science of order itself,
studying how properties of macroscopic wholes (such as phases of
matter) “emerge” from the arrangement of microscopic parts.
Anderson thought of emergence as patterns that appear when you “zoom
out” from the constituents, but which are still made from the
constituents; we are just describing those constituents at a different level.
<!-- the microscopic perspective as the wrong "level"
of description, like being too zoomed in on a microscope, but I think that
it is simply different information. --></p>
<p>But this seems to suffer from the same problem as a reductionist
account of shoes.
The “emergent properties” are not properties of the constituents at
all!
The symmetries, order parameters, <!-- which measure their brokenness,
and collective excitations which emerge as long-range messengers of
disorder, a are not simply the microscopics "zoomed out".--> and
collective excitations studied by condensed matter physicists belong
only to the arrangements.
In fact, systems made from totally different materials can
exhibit the same emergent behaviour [<sup><a id="fnr.2" name="fnr.2" class="footref" href="#fn.2">2</a></sup>]!
They are something new, something “over and above” the spins of the
lattice, or the carbon atoms of a hexagonal monolayer, since different
arrangements of those same parts would have different properties.
We can turn Anderson’s snappy slogan around:
<em>different is more</em>. If arranging things differently gives them new
and different properties, it is a sign of structure, and structure is
something over and above the component parts themselves.
<!-- often characterising phases of matter in terms of
what are called *order parameters*, numbers which characterise the
brokenness of a symmetry. --></p>
<h4 id="what-is-a-particle">What is a particle?</h4>
<p><span style="padding-left: 20px; display:block">
It is raining instructions out there; it’s raining programs; it’s
raining tree-growing, fluff-spreading, algorithms. That is not a
metaphor, it is the plain truth. It couldn’t be any plainer if it were
raining floppy discs.
</span></p>
<div style="text-align: right"><i>Richard Dawkins</i> </div>
<p>We don’t need emergence to argue for structure; we can use the
elementary components themselves.
When philosophers talk about reductionism, they tend to imagine
subatomic particles as small, indivisible blobs, without internal
organisation or further ontological bells and whistles. An electron
might have properties like mass or charge, and obey the curious dictates of quantum mechanics,
but all this is packaged irreducibly and not worth further discussion.
But if we try and unpack all these “simple” properties, we will find
that, like the magic bag of Mary Poppins, a particle is much deeper
than it first appears!
The Large Hadron Collider does not produce evidence for tiny,
structureless blobs.
Rather, it confirms at a rate of petabytes per second that the universe is made of mathematics.</p>
<p>The state-of-the-art definition of a particle is
<!-- (as
[this Quanta article](quantamagazine.org/what-is-a-particle-20201112)
humorously explores) --> a bit of a mouthful: an <em>irreducible
representation of the Lorentz group</em>.
In plain English, being a <em>representation</em> means that particles are
objects which have or “transform with” symmetries, in the same way a circle looks the same however
you rotate it.
That it is <em>irreducible</em> means that it cannot be split into smaller
parts which have the same symmetry, which is the mathematical avatar
of being “indivisible”.
Finally, the symmetry itself, the <em>Lorentz group</em>, is the same group
describing the shape of empty space according to special relativity.
So, in summary, a particle transforms with the symmetries of empty space, and
cannot be split into parts with this symmetry.
<!-- [<sup><a id="fnr.3" name="fnr.3" class="footref" href="#fn.3">3</a></sup>].-->
Lurking implicitly in the background is the whole framework of
quantum mechanics, and in particular, that particles are <em>states in a
Hilbert space</em>. In plain English, we can add and subtract states of a
particle, and compare them to each other.</p>
<p>Thus, every particle is like a mathematical diamond: indivisible,
multifacted, and structured up to the hilt.
When philosophers of science eagerly assent to believe whatever the particle physicists
tell them, <!-- particularly when it can be tested with unparalleled
precision at the LHC, --> they may not realise what
they signed up for!
Spacetime, quantum mechanics, and symmetries, the Lorentz group and
Hilbert spaces; these are all welded indissolubly to form the most
robust and fundamental objects in the universe.
Even with something as “simple” as an electron, order is
inescapable.</p>
<h4 id="unreasonable-effectiveness-and-natural-patterns">Unreasonable effectiveness and natural patterns</h4>
<p><span style="padding-left: 20px; display:block">
It is difficult to avoid the impression that a miracle confronts us
here, quite comparable… to the two miracles of the existence of
laws of nature and of the human mind’s capacity to divine them.
</span></p>
<div style="text-align: right"><i>Eugene Wigner</i> </div>
<p>It may feel like we have jumped from physical to
mathematical objects in one fell, tendentious swoop.
Do we need Hilbert space, or might another mathematical concept
suffice?
And does Hilbert space really exist, or is it merely a useful human
invention?
If the latter, why so useful?
This is intentionally designed to rhyme with our earlier statement
that order is a robustly explanatory feature of the world, and
distinct from the things that are ordered.
Mathematics really just is the study of order, or <em>patterns</em>, according to their own peculiar and abstract
logic.
Physics (and to a lesser extent the other sciences) study <em>natural
patterns</em>, the way these structures or forms of order are realised in
the natural world.
That applies not just to emergent behaviour like phases of matter, but
even the crystalline makeup of an elementary particle.</p>
<p>I have tried to motivate this perspective from the nature of physical
explanation, but perhaps it can teach us about mathematical
explanation and its relation to the physical world.
A common criticism of Platonism is that, if mathematical objects exist
in some non-physical realm, the ability to do mathematics must involve
extrasensory perception. Clearly, since we are physical
beings, this ability is grounded in physical experience, and now we
have a simple explanation: patterns are naturally realised everywhere, from
cardinal numbers in counting cows to topology in tying a knot to
representation theory in colliding protons. We don’t need magical
access to the World of Forms to see these things; they are all around us.</p>
<p>Similarly, the
<a href="https://www.dartmouth.edu/~matc/MathDrama/reading/Wigner.html">unreasonable effectiveness of mathematics</a>
for describing the world, first noted by Eugene Wigner, seems no more
miraculous that the utility of integers for counting loaves of bread
rather than proving results about number theory.
We get the patterns from the world, clean them up, rebrand a little,
and start connecting them together.
The meta-patterns that emerge are remarkable, but the appearance of
“unreasonable effectiveness” is the result of a largely successful PR
campaign to divorce mathematical structures from their physical
origins. As Einstein quipped, “Since the mathematicians have invaded
the theory of relativity, I do not understand it myself anymore.”
The abstraction of pseudo-Riemannian geometry follows from the more
concrete act of bouncing light off mirrors.</p>
<p>More and more, we are seeing this converse of unreasonable
effectiveness, where deep mathematical ideas are inspired by physics.
The living embodiment of this trend is Ed Witten, a string theorist
whose contributions to mathematics have been so profound and
wide-ranging that he earned a Fields Medal (the Nobel prize in
mathematics), the only physicist to have ever done so! <!-- for his contributions to low-dimensional topology.-->
Once again, there is no mystery here; it is just the usual state of
affairs, but without the Platonist guff to distract us.
The patterns are out there and always have been.</p>
<h4 id="what-is-a-pattern">What is a pattern?</h4>
<p><span style="padding-left: 20px; display:block">
Everything comes to be from both subject and form.
</span></p>
<div style="text-align: right"><i>Aristotle</i> </div>
<p>All this raises the question: what is a pattern?
<!-- And how is it conjoined with stuff?-->
The first and most famous philosophical treatment of these issues is
the
<a href="https://plato.stanford.edu/entries/form-matter/">hylomorphism of Aristotle</a>,
who argued that objects are a compound of both form (the structure,
order, or patterns I have discussed here) and matter (energy or “raw
potentia”).
I won’t discuss Aristotle’s ideas in greater detail. Suffice to say they have
deeply informed this post, and the interested reader should check out James Franklin’s
<a href="https://link.springer.com/book/10.1057/9781137400734">modern take</a>.
<!-- for a modern take on Aristotelian structuralism applied.-->
Instead, I will approach the question by picking on two
smaller problems, taking Newton’s laws as a concrete example.</p>
<p>Newton formulated his laws of motion (such as $F = ma$) in terms of forces and
acceleration. Does the empirical robustness of these laws mean that
this is the only way to formulate them?
Not at all!
There are two other distinct but equivalent versions of classical
mechanics: Lagrangian and Hamiltonian. They explain
the same things, make the same predictions, and thus seem to describe
the same natural patterns. This suggests to me that although patterns
are discovered, formalisms are invented.
A pattern is the equivalence class of descriptions.</p>
<p>Students of physics will be aware that, although Hamiltonian and
Lagrangian mechanics are equivalent to Newton’s laws in the mechanical
context, they have taken on a life of their own.
The Lagrangian approach involves the mathematics of optimising
functions, while the Hamiltonian approach in its most abstract form
becomes the mathematical field of symplectic geometry.
Both Lagrangian and Hamiltonian mechanics can be upgraded (with some
inspired retrospective guesswork) to frameworks for quantum mechanics,
which Newton’s laws simpliciter cannot.
There is much more going on than a simple isomorphism of
description!
A more nuanced view is that humans invent formalisms which can agree
on a domain of interest, a restricted equivalence
class of explanation if you will. But the formalisms will tend to grow
beyond the selvage lines of the original use case.
Formalisms are only <em>perspectives</em> on patterns.
<!-- capture
different patterns, or suggest different extensions, in ways that can
depend sensitively on the formalism and the domain of application. --></p>
<p>This hints at certain structural “metalaws”.
Patterns are big and rhizomatic; human-invented mathematical
frameworks are a single
mathematical glance, if you like, and can only take in part of the pattern.
Even if formalisms agree on some domain, they will suggest different corridors of growth.
A rectangle may be both an equiangular quadrilateral, or a
parallelogram with diagonals of equal length, but the notions involved and
corresponding generalisations are distinct.
<!-- in the two characterisations., and connect along
different lines of development to broader ideas. -->
This also helps explain the phenomenon of deep connections between
apparently unrelated mathematical objects, sometimes only revealed by
a clever change of perspective.
It could be that there is a <em>paucity of structure</em>, so that by dumb
luck (and the <a href="https://en.wikipedia.org/wiki/Pigeonhole_principle">pigeonhole principle</a>), we often unknowingly describe the same
thing in a different guise.
But to my mind, it is more likely that patterns tend to sprawl and
overlap in complex ways.
<!-- , which also explains how different angles on
the same structure can look unrelated! -->
They are less like a few items of furniture in a crumbling
garret—paucity of structure—and more like the interwined flora of
a tropical jungle.
<!-- And human mathematics typically cannot see the forest for the trees.
There are ways to talk about quantum mechanics without Hilbert spaces,
and particles without representation theory.
That does not mean that the corresponding patterns do not exist, but
rather, they can be described in other ways. --></p>
<p>The second issue is how accurate our descriptions must be.
We know that Newton’s laws are not exactly correct, and break down in
regimes far-removed from those of everyday experience, such as the
very small (where quantum mechanics applies) or the very fast (where
special relativity applies).
Does this mean we should stop believing in forces, or Lagrangians, or
Hamiltonians?
This is like the old Platonist quibble that there is no
such thing as a perfect circle in the real world, so we must be
reasoning about circles in some other realm.
In both cases, the pattern is only <em>approximately</em> realised in
nature, with bumps and fuzzy edges.
But approximation is itself subject to structural laws, exhibiting
patterns treated by mathematics (in, e.g., topology)
and physics (effective field theory).
Perhaps an even better example is statistics, which is literally all
about extracting structure from noisy realisations.
So structural approximations are clearly robust, lawlike and
explanatory, even if they are subtle.
Incidentally, this suggests another metalaw: patterns can stand in patterned
relations to other patterns.
<!-- This is also what emergence is all about! --></p>
<p>This ties back to our original question about the nature of physical
explanation.
Reductionism instructs us to boil things down to their smallest elements.
The Aristotelian view is that, really, we should be searching for
form and structure at whatever level they happen to occur.
This is not only the nature of emergence, but physics more broadly.
How else can we connect the study of the large-scale structure
of spacetime, quarks, bowling balls, planetesimals, or storm clouds?
Physicists almost never boil things down to their smallest elements!
Rather, it seems much more accurate to say that they look for patterns
“in the wild”.
(In contrast, mathematicians study patterns “in captivity”, which gives
them that air of artifice and pedigree.)</p>
<p>One upshot is that, for better or worse, physicists often wade into other
disciplines armed with the lassoo of an Emergent Pattern to corral the apparent complexity.
See for
instance
<a href="https://www.penguinrandomhouse.com/books/314049/scale-by-geoffrey-west/">scaling laws</a>,
<a href="https://en.wikipedia.org/wiki/Self-organized_criticality">self-organised criticality</a>,
<a href="https://en.wikipedia.org/wiki/Small-world_network">small-world networks</a>,
and
<a href="https://www.englandlab.com/">thermodynamic explanations for life itself</a>.
They’re not always right (and they’re not always respectful), but
they are just doing their thang.</p>
<h4 id="conclusion">Conclusion</h4>
<p>I’ve argued that the nature of physical explanation is richer and less
boringly hierarchical than the reductionist would have us believe.
In order to explain the properties of shoes or particles, it seems not
only parsimonious but necessary to commit to the existence of
patterns in addition to the things which make those patterns up.
This not only jives with (and ontologically grounds) the notion of
emergence, but also provides a handle on the metaphysics and
epistemology of mathematical explanation.
<!-- and its relation to the
physical world. -->
Put simply, mathematicians study patterns; physicists study natural patterns.
<!-- It tells us where math comes from, why it is unreasonably effective,
and to what extent it might be invented or non-unique.
Finally, I argued that none of this is spoiled by approximation, since
this is just another pattern. --></p>
<p>Clearly, I’ve left many questions unanswered.
Must patterns be instantiated in the physical world, and if not, where
do such patterns live?
What is the “mereology” that allows them to combine, or to recursively
describe their relationships?
And finally, what grounds the truth about patterns, in physics,
mathematics, or elsewhere?
Most of these I defer to Aristotle, though I hope to write more in future. <!-- I leave the systematic exploration of these questions to the future,-->
In the mean time, discussion and debate are welcome!</p>
<h4 id="acknowledgments-and-references">Acknowledgments and references</h4>
<p>I’d like to thank Leon Di Stefano for introducing me to Aristotelian
structuralism and many enriching conversations over the years.
His ideas <!-- (as articulated in
[this 2017 debate with James Fodor](https://www.youtube.com/watch?v=W0j25NteoXc))-->
inspired and informed this post.
I’ve also been heavily influenced by James
Franklin’s book,
<a href="https://link.springer.com/book/10.1057/9781137400734"><em>An Aristotelian realist view of mathematics</em></a>.
Aristotle himself writes with characteristic brevity on form and
matter in <a href="http://classics.mit.edu/Aristotle/physics.1.i.html"><em>Physics (i)</em></a>.
Finally, I fitfully consulted the SEP entries on
<a href="https://plato.stanford.edu/entries/scientific-reduction/">reductionism</a>
and
<a href="https://plato.stanford.edu/entries/structuralism-mathematics/">mathematical structuralism</a>.</p>
<hr />
<!-- quantamagazine.org/what-is-a-particle-20201112 -->
<!-- https://plato.stanford.edu/entries/scientific-reduction/-->
<!-- https://plato.stanford.edu/entries/structuralism-mathematics/ -->
<div class="footdef"><sup><a id="fn.1" name="fn.1" class="footnum" href="#fnr.1">Footnote 1</a></sup> <p class="footpara">
To be fair, as the quote suggests, the original eliminativists like Paul and
Patricia Churchland were much more interested in abolishing psychology than shoes.
</p></div>
<div class="footdef"><sup><a id="fn.2" name="fn.2" class="footnum" href="#fnr.2">Footnote 2</a></sup> <p class="footpara">
This is called <i>universality</i>, and can be explained using
renormalisation, the technical avatar of "zooming out".
</p></div>
<!--<div class="footdef"><sup><a id="fn.3" name="fn.3" class="footnum"
href="#fnr.3">Footnote 3</a></sup> <p class="footpara">
Particles can have other symmetries as well. An important class is
gauge symmetry, consisting of internal degrees of freedom.
, like a dial on a gauge. These gauge symmetries are crucial to formulating the
whole Standard Model, and explain, for instance, why an electron has -->
<!--charge. </p></div>-->David A WakehamFebruary 8, 2021. Some philosophical reflections on the nature of scientific explanation, structure, emergence, and the unreasonable effectiveness of mathematics.Binomial party tricks2021-02-06T00:00:00+00:002021-02-06T00:00:00+00:00http://hapax.github.io/mathematics/physics/hacker/binomial<p><strong>February 6, 2021.</strong> <em>Sketchy hacker notes on the binomial
approximation. The flashy payoff: party trick arithmetic for estimating
roots in your head.</em></p>
<h4 id="introduction">Introduction</h4>
<p>The binomial approximation is the result that, for any real $\alpha$,
and $|x| \ll 1$,</p>
\[(1 + x)^\alpha \approx 1 + \alpha x.\]
<p>The usual proof involves calculus.
Here, we present a sketchy shortcut and an elementary longcut, neither
of which involves calculus, strictly speaking.
We also derive the quadratic term, and end with a fun party trick for finding roots.</p>
<h4 id="sketchy-shortcut">Sketchy shortcut</h4>
<p>We begin with the shortcut.
In an
<a href="https://hapax.github.io/maths/physics/hacks/exponential/">earlier post</a>,
I derived the following result for the exponential, and $|x| \ll 1$:</p>
\[e^x \approx 1 + x.\]
<p>Rather than go off and read the post, we can do even better and simply
<em>define</em> the exponential by this property.
If it’s true, then for any $r$, we can set $x = r/n$ for very large
$n$ to get</p>
\[e^r = (e^{r/n})^n \approx \left(1 + \frac{r}{n}\right)^n.\]
<p>In the limit of infinite $n$, the expression should be exact. And
indeed, this is the standard definition of $e^r$:</p>
\[e^r = \lim_{n\to\infty} \left(1 + \frac{r}{n}\right)^n.\]
<p>Let’s proceed with a proof of the binomial approximation.
The natural logarithm is the inverse function, so that</p>
\[x = \log e^x \approx \log(1 + x).\]
<p>Recall that</p>
\[x^n = (e^{\log x})^n = e^{n\log x} \quad \Longrightarrow \quad \log x^n = n \log x.\]
<p>Thus, taking the logarithm $(1 + x)^\alpha$, we have</p>
\[\log [(1+x)^\alpha] = \alpha \log (1+ x) \approx \alpha x,\]
<p>and hence</p>
\[(1+x)^\alpha \approx e^{\alpha x} \approx 1 + \alpha x.\]
<p>This works since all the corrections are at higher order in $x$.</p>
<h4 id="elementary-longcut">Elementary longcut</h4>
<p>This is a bit high brow, and we can get to the same conclusion using
simple algebra.
First note that, from the binomial theorem,</p>
\[(1 + x)^n = 1 + \binom{n}{1}x + \binom{n}{2}x^2 + \cdots x^n \approx
1 + nx\]
<p>for $|x| \ll 1$, neglecting higher order terms which are much smaller.
So the binomial approximation is true for whole numbers $n$.
If we consider a fraction $q = m/n$, then $(1 + x)^q$ raised to the
power $n$ should equal</p>
\[(1 + x)^{qn} = (1 + x)^{m} \approx 1 + mx \tag{1}\label{m}\]
<p>by the binomial theorem.
Let’s assume</p>
\[(1 + x)^{q} \approx 1 + \beta x,\]
<p>with some higher order terms we can ignore.
Raising to the power $n$, we can use the binomial approximation for
$n$ to get</p>
\[(1 + x)^{qn} \approx (1 + \beta x)^n \approx 1 + \beta n x.\]
<p>Comparing to (\ref{m}), we find that $\beta = m/n$, and hence the
binomial approximation is true for positive rationals.
We can add negative powers using the geometric series:</p>
\[\frac{1}{1 - x} = 1 + x + x^2 + \cdots \approx 1 + x,\]
<p>and hence for a negative rational $q = -m/n$,</p>
\[(1 + x)^q \approx (1 - x)^{m/n} \approx 1 - \frac{m}{n}x = 1 + qx,\]
<p>as required. Finally, there is arbitrary real $\alpha$. This is
actually trivial, in some sense.
Unlike whole numbers (repeated multiplication), fractions (roots), or
negative numbers (reciprocals), an irrational power has no obvious
interpretation. The most reasonable thing to do is define it as a
<em>limit</em> of rational powers that approximate it:</p>
\[(1 + x)^r = \lim_{n \to \infty} (1 + x)^{q_n},\]
<p>where $q_n$ is a sequence of rational numbers (e.g. the decimal
expansion) approximating $r$.
In this case, the binomial approximation gives</p>
\[(1 + x)^r = \lim_{n \to \infty} (1 + x)^{q_n} \approx 1 + x \lim_{n
\to \infty} q_n = 1 + rx,\]
<p>and so the result holds for all real numbers.</p>
<h4 id="higher-terms">Higher terms</h4>
<p>It’s possible, if messy, to extend these methods to determine the next
term in the approximation.
We’ll do the longcut, and use big-O notation, with $O(x^3)$ in this
context meaning “terms with powers of $x^3$ or higher”.
The binomial theorem gives</p>
\[(1 + x)^n = 1 + nx + \frac{n(n-1)}{2} x^2 + O(x^3), \tag{2} \label{second}\]
<p>since the coefficient of the $x^2$ term is the number of ways of
choosing $2$ items (the $x$ terms) from $n$ items (the factors in the power).
For a rational $q = m/n$, we have</p>
\[(1 + x)^{qn} = (1 + x)^m = 1 + mx + \frac{m(m-1)}{2} x^2 + O(x^3),\]
<p>and if we assume</p>
\[(1 + x)^{q} = 1 + qx + \gamma x^2 + O(x^3),\]
<p>then the binomial theorem again gives</p>
\[(1 + x)^{qn} = \left[1 + qx + \gamma x^2 + O(x^3)\right]^n = 1 + nqx +
\left[n\gamma + \frac{n(n-1)}{2}q^2 \right]x^2 + O(x^3).\]
<p>The coefficient of the linear term $nq = m$ matches, but the quadratic
term requires more work. Comparing to (\ref{second}) and
rearranging for $\gamma$, we have</p>
\[\begin{align*}
\gamma & = \frac{1}{n}\left[\frac{m(m-1)}{2}- \frac{n(n-1)}{2}q^2\right]
=\frac{m(m-1)}{2n}- \frac{m^2(n-1)}{2n^2}
=\frac{q(q - 1)}{2}.
\end{align*}\]
<p>Thus, we find that to second order,</p>
\[(1 + x)^q = 1 + qx + \frac{q(q-1)}{2} x^2 + O(x^3)\]
<p>The extension to real and negative powers is easy. The extension to
higher terms in $x$ is not.
They obey something called the binomial series,</p>
\[(1 + x)^\alpha = \sum_{k = 0}^\infty \frac{\alpha(\alpha - 1)\cdots
(\alpha-k +1)}{k!} x^k,\]
<p>and I have no idea how to get this without calculus.
Any tips appreciated!</p>
<h4 id="rooting-out-the-answer">Rooting out the answer</h4>
<p>The applications are many and various, but the simplest thing we can
try is quickly calculating powers $y^\alpha$.
The general trick is to find a power near $y$ that is simpler to
evaluate, factor out the simple answer, then use the binomial
approximation.
I think there are actually better ways to estimate positive powers,
but the binomial approximation really shines in the estimation
of roots.
It can even be a good party trick, depending on the kind of parties
you go to!</p>
<p>Suppose someone asks you to find the square root of $8$.
You look for a nearby perfect square, in this case $9$, then factor
eight into $9$ times one minus something small:</p>
\[\sqrt{8} = \sqrt{9\left(1 - \frac{1}{9}\right)} = 3 \left(1 - \frac{1}{9}\right)^{1/2}.\]
<p>We can take $\alpha = 1/2$ and $x = -1/9$ in the binomial
approximation, and see how we go, noting that</p>
\[\sqrt{1 - x} = 1 - \frac{1}{2}x - \frac{1}{8}x^2 + O(x^3).\]
<p>To first order, we get</p>
\[3 \left(1 - \frac{1}{9}\right)^{1/2} \approx 3\left[1 - \frac{1}{2} \cdot \frac{1}{9}\right]
= \frac{17}{6} \approx 2.83.\]
<p>To second order,</p>
\[3 \left(1 - \frac{1}{9}\right)^{1/2} \approx
3\left[1 - \frac{1}{2} \cdot \frac{1}{9} - \frac{1}{8} \cdot \frac{1}{9^2}\right]
= \frac{611}{216} \approx 2.829.\]
<p>The actual answer is $\sqrt{8} = 2.828$, so even the first term in the
binomial approximation is very good! We’ll finish with a somewhat more
involved example.
Let’s approximate the fifth root of six, $6^{1/5}$.
I only know one fifth power of the top of my head, $2^5 = 32$, and
this happens to be near $6^2 = 36$.
We can chain these observations together as follows:</p>
\[\begin{align*}
6^{1/5} = 36^{1/10} = 32^{1/10}\left(1 + \frac{1}{9}\right)^{1/10} & =\sqrt{2}\left(1 + \frac{1}{9}\right)^{1/10} \approx \sqrt{2} \cdot \left(1 + \frac{1}{10\cdot 9}\right).
\end{align*}\]
<p>At this point, we could separately approximate $\sqrt{2}$, but I
happen to know it’s about $1.414$, so I can divide by $90$ (or even
just $100$ for a quick mental estimate), and add them together to get</p>
\[\sqrt[5]{6} \approx 1.414 + \frac{1.414}{90} \approx 1.43.\]
<p>Consulting a calculator, this is correct to two decimal places!
With the power of the binomial approximation, you can do it in your head.</p>David A WakehamFebruary 6, 2021. Sketchy hacker notes on the binomial approximation. The flashy payoff: party trick arithmetic for estimating roots in your head.A simplicial generalisation of the Bloch ball2021-02-05T00:00:00+00:002021-02-05T00:00:00+00:00http://hapax.github.io/maths/physics/qc/unitary-orbits<p><strong>February 5, 2021.</strong> <em>I explore unitary orbits of density matrices
for finite-dimensional quantum systems. The upshot is a neat scheme
for representing orbits using simplices.</em></p>
<h4 id="introduction">Introduction</h4>
<p>The <a href="https://en.wikipedia.org/wiki/Bloch_sphere">Bloch sphere</a>
represents the space of pure states on a single qubit (see also
<a href="https://hapax.github.io/physics/mathematics/bloch/">this</a> recent
post).
The “Bloch ball” is the space of all <em>density matrices</em> on the qubit.
It fills in the Bloch sphere with concentric spheres of increasing
mixedness, and at the centre is the maximally mixed state $I_2/2$,
where $I_d$ will denote the $d \times d$ identity matrix.</p>
<figure>
<div style="text-align:center"><img src="/images/posts/unitary1.png" />
</div>
</figure>
<p>Spheres arise naturally.
They carry the structure of the unitary group $\mathrm{U}(2)$ acting
on qubits, once we have modded out by the phase ambiguity:</p>
\[\frac{\mathrm{U}(2)}{\mathrm{U}(1)} = \mathrm{SU}(2).\]
<p>This is a double cover of the rotation group $\mathrm{SO}(3)$, which
acts transitively on the sphere.
(The “double cover” part gives us spinors.)
Thus, spheres occur naturally as unitary orbits, and indeed, each
concentric sphere in the Bloch ball is such an orbit.
The question is whether this generalises nicely to higher dimensions.</p>
<h4 id="the-bloch-ball">The Bloch ball</h4>
<p>Let’s think about the Bloch ball in a little more detail.
Each density matrix $\rho$ is a $2\times 2$ matrix acting on the space
of qubits, which is positive and has unit trace.
Positivity just means that, for every state $|\psi\rangle$,</p>
\[\langle \psi | (\rho | \psi \rangle) \geq 0.\]
<p>Hence, $\rho$ is Hermitian, since the reality of this inner product implies</p>
\[\langle \psi | (\rho | \psi \rangle) = (\langle \psi | \rho^\dagger)
|\psi \rangle \quad \Longrightarrow \quad \rho = \rho^\dagger.\]
<p>In turn, this means that $\rho$ is unitarily diagonalisable,
i.e. $U^\dagger \rho U = \Lambda$ for some diagonal matrix $\Lambda$
and unitary matrix $U^\dagger U = UU^\dagger = I$.
It’s also clear these eigenvalues must be positive.
In fact, since the permutation matrices are unitary, we can arrange
the eigenvalues in decreasing size, so that every $2 \times 2$ density
matrix is unitarily equivalent to some matrix</p>
\[\Lambda(p) =
\begin{bmatrix}
p & \\
& 1-p
\end{bmatrix}\]
<p>for $p \in [1/2, 1]$.
The maximally mixed density $I_2/2$ has a trivial orbit, since it
always gets mapped to itself:</p>
\[U^\dagger I_2 U = U^\dagger U = I_2.\]
<p>We can measure the distance from this matrix to $\Lambda(p)$ using the
Frobenius norm, aka Hilbert-Schmidt norm.
This is just the usual vector norm where we treat a matrix $A = [a_{ij}]$ as a big vector:</p>
\[||A||^2 = \sum_{ij} |a_{ij}|^2 = \mbox{Tr}[A^\dagger A].\]
<p>Hence,</p>
\[\begin{align*}
||\Lambda(p) - \tfrac{1}{2}I_2||^2 & = \left|\left| \begin{bmatrix}
p - 1/2 & \\
& 1/2-p
\end{bmatrix} \right|\right|^2
\end{align*} = 2\left(p - \tfrac{1}{2}\right)^2.\]
<p>It’s easy to see that any density matrix in the unitary orbit of $\Lambda(p)$
has the same distance, since we can use $I_2 = U^\dagger I_2 U$,
i.e. it is a class function:</p>
\[\begin{align*}
||U^\dagger \Lambda U - \tfrac{1}{2}I_2||^2 & =
\mbox{Tr}\left[(U^\dagger \Lambda U - \tfrac{1}{2}I_2)^\dagger (U^\dagger \Lambda U - \tfrac{1}{2}I_2)\right]\\
& =
\mbox{Tr}\left[U^\dagger (\Lambda - \tfrac{1}{2}I_2)^\dagger UU^\dagger (\Lambda - \tfrac{1}{2}I_2) U\right]\\
& =
\mbox{Tr}\left[(\Lambda - \tfrac{1}{2}I_2)^\dagger (\Lambda - \tfrac{1}{2}I_2) \right]
= ||\Lambda - \tfrac{1}{2}I_2||^2.
\end{align*}\]
<p>We can define distance between densities as the Hilbert-Schmidt norm
times a positive constant $C$.
We choose $C = \sqrt{2}$ so that for pure states with $p = 1$, the
associated distance is $r = 2(p - 1/2) = 1$.
In general, since each such $r$ is associated with a unique
$\Lambda(p)$, we conclude that the space of $2\times 2$ density
matrices is a ball consisting of concentric, transitive orbits of the
unitary group, with the pure states at $p = 1$, the maximally mixed
state at $p = 0$, and radius $r = 2(p - 1/2)$ for the orbit of $\Lambda(p)$.</p>
<h4 id="orbital-mechanics">Orbital mechanics</h4>
<p>A similar story holds in higher dimensions. Density matrices are
positive and unit trace, so each orbit in dimension $d$ has a canonical
representative of the form</p>
\[\Lambda = \mathrm{diag}(p_1, p_2, \ldots, p_d),\]
<p>where the positivity of $\rho$ and unit trace condition imply</p>
\[\sum_{i=1}^d p_i = 1, \quad p_i \geq 0,\]
<p>and we can arrange eigenvalues in descending order:</p>
\[p_1 \geq p_2 \geq \cdots \geq p_d \geq 0.\]
<p>The constraint that the eigenvalues sum to $1$ means that we only need
$p_1, p_2, \ldots, p_{d-1}$ to uniquely specify a canonical
representative $\Lambda(p_1, p_2, \ldots, p_{d-1})$.
We can repeat the calculations from above to show that $I_d/d$ has a
trivial orbit, and that any density matrix in the orbit of $\Lambda(p_1,
\ldots, p_{d-1})$ has a fixed distance to the mixed state:</p>
\[r^2(p_1, \ldots, p_{d-1}) = C_d\sum_{i=1}^d \left(p_i - \frac{1}{d}\right)^2,\]
<p>where we choose $C_d$ so that the pure states, with $p_1 = 1,
p_2 = \cdots = p_d = 0$, have distance $r = 1$.
For completeness, we note that</p>
\[C_d = \frac{d^2}{d^2 - 2d + 2}.\]
<p>It’s a bit trickier to see what the orbits look like, but in the same
way that $I_d$ is fixed by the group $\mathrm{U}(d)$, we can read off
fixed subgroups from the eigenvalue decomposition.
For instance, a pure state has</p>
\[p_1 = 1, \quad p_2 = \cdots = p_d = 0.\]
<p>The first factor is fixed by $\mathrm{U}(1)$ (corresponding to global
phase), while the last $d - 1$ factors are fixed by $\mathrm{U}(d-1)$.
These act independently, so that the stabiliser of a pure state is
$\mathrm{U}(1) \times \mathrm{U}(d-1)$.
By the orbit-stabiliser theorem, the orbit of pure states has the (coset) structure</p>
\[\frac{\mathrm{U}(d)}{\mathrm{U}(1) \times \mathrm{U}(d - 1)}.\]
<p>Since $\mathrm{U}(d)$ has dimension $d^2$, this pure space orbit has
dimension</p>
\[d^2 - 1^2 - (d - 1)^2 = 2d - 2,\]
<p>and lies on a unit sphere $\mathbb{S}^{2d-2}$ in our Hilbert-Schmidt metric.
This agrees with the Bloch sphere for $d = 2$.
This seems rather nice, but in general, the orbits will be horrible.
First of all, spheres of radius $r < 1$ around the mixed state will
now be made up of uncountably many orbits, since there are uncountably
many sets of $p_i$ which solve</p>
\[r^2 = C_d\sum_{i=1}^d \left(p_i -\frac{1}{d}\right)^2\]
<p>for $r < 1$.
And orbits can be more elaborate for other eigenvalue structures.
For instance, if we lump the $p_i$ into $k$ sets of <em>distinct</em> eigenvalues,</p>
\[P_1, P_2, \ldots, P_K,\]
<p>with multiplicity $\mu_J$ associated to eigenvalue $P_J$, then the
same argument as above shows that the coset structure is</p>
\[\frac{\mathrm{U}(d)}{\mathrm{U}(\mu_1) \times \cdots \times \mathrm{U}(\mu_K)},\]
<p>known to mathematicians as a <a href="https://en.wikipedia.org/wiki/Generalized_flag_variety#Partial_flag_varieties">partial flag variety</a>.
These orbits have dimension</p>
\[D = d^2 - \sum_{J=1}^K \mu_J^2,\]
<p>and lie on a sphere of radius</p>
\[r^2 = C_d\sum_{J=1}^K \mu_J^2\left(P_J - \frac{1}{d}\right)^2.\]
<p>Note that while mixed states are closer to the
maximally mixed state, unlike the Bloch ball, they do not lie inside the orbit of pure states.
Typically, they have more dimensions!
For instance, a generic point with no symmetries (distinct $p_i$), the cosets are of the form</p>
\[\frac{\mathrm{U}(d)}{(\mathrm{U}(1))^d}\]
<p>with dimension $d^2 - d$, so for $d > 2$, these are always bigger than
the pure state orbits.
It’s certainly possible to say more about this, but who wants to. It’s
a mess!</p>
<h4 id="the-simplicial-wedge">The simplicial wedge</h4>
<p>Our modest goal will be to tidy up some of the mess.
The main observation is that the eigenvalues $p_i$ form a probability
distribution over $d$ outcomes.
If they had an arbitrary order, they would live on the standard
$(d-1)$-simplex $\Delta_{d-1}$, but because they are arranged in decreasing order,
they live on the simplicial “wedge”:</p>
\[W_{d-1} = \left\{(p_1, \ldots, p_d) : \sum_{i=1}^d p_i = 1, p_1 \geq
p_2 \geq \cdots \geq p_d \geq 0\right\}.\]
<p>Note that the subscript denotes the number of independent
parameters.
We can illustrate these ideas for $d = 2$:</p>
<figure>
<div style="text-align:center"><img src="/images/posts/unitary2.png" />
</div>
</figure>
<p>We start with the $1$-simplex $\Delta_1$, and divide it two to get the
wedge $W_1$.
The black dot at the top is the orbit of pure states, and the white
dot the maximally mixed state.
In general, the wedge $W_{d-1}$ is almost a quotient of $\Delta_{d-1}$
by its symmetry group, the set of permutations $S_d$.
But the wedge has literal “edge cases”, stabilised by subgroups of $S_d$ in a way
that mirrors the corresponding unitary orbits.
More precisely, if a point in $W_{d-1}$ is stabilised by $S_{\mu_1} \times
\cdots \times S_{\mu_K}$, then the corresponding coset structure for
the orbit is the partial flag variety</p>
\[\frac{\mathrm{U}(d)}{\mathrm{U}(\mu_1) \times \cdots \times \mathrm{U}(\mu_K)}.\]
<p>For instance, pure states have canonical representative</p>
\[(1, 0, 0, \ldots, 0) \in W_{d-1},\]
<p>which is stabilised by the subgroup $S_1 \times S_{d-1}$.
This correctly gives the coset orbit</p>
\[\frac{\mathrm{U}(d)}{\mathrm{U}(1) \times \mathrm{U}(d - 1)}.\]
<p>The maximally mixed state, and centroid of the whole simplex, has coordinates</p>
\[\frac{1}{d}(1, 1, \ldots, 1),\]
<p>and is stabilised by the full group $S_d$. As we expect, the orbit is
trivial.
We can see how this works for a qutrit below.
We start with the $2$-simplex $\Delta_2$, an equilateral triangle, and
cut out the wedge $W_2$:</p>
<figure>
<div style="text-align:center"><img src="/images/posts/unitary3.png" />
</div>
</figure>
<p>At the top we have the pure states as usual, and the mixed state at
the white centroid.
The grey dot represents the fully mixed state on two basis elements.
Note that, along the red edges, two coordinates agree, and in fact,
each represents a copy of $W_1$, coinciding at the centroid.
In general, orbit degeneracies occur precisely at sub-wedges $W_K$
with interiors parameterised by the coordinates $P_1, \ldots, P_K$
introduced above.
But when distinct sub-wedge coincides, we get even more degeneracy.
So, the apparent randomness of orbits is somewhat tamed by geometric
hierarchy.</p>
<p>Finally, to relate this back to spheres, the nice thing about using
the Frobenius norm is that the distance between a density matrix and
the maximally mixed matrix is just proportional to the Euclidean
distance on the wedge.
So we can literally draw concentric spheres emanating from the
centroid!
Our scheme does not do away with all the messiness of the orbits. But
it does provide a simple way to organise and read off some of their
basic properties, and generalises in a reasonably natural way the concentric spheres of the Bloch ball.</p>
<!-- https://en.wikipedia.org/wiki/Bloch_sphere -->David A WakehamFebruary 5, 2021. I explore unitary orbits of density matrices for finite-dimensional quantum systems. The upshot is a neat scheme for representing orbits using simplices.Turning a thermometer into a sundial2021-01-28T00:00:00+00:002021-01-28T00:00:00+00:00http://hapax.github.io/mathematics/physics/everyday/diurnal<p><strong>January 28, 2021.</strong> <em>I attempt to turn a thermometer
(or more specifically, data about the maximum daily temperature)
into a sundial. Though it fails on earth, it works on Mercury!</em></p>
<h4 id="introduction">Introduction</h4>
<p>The sun heats the earth up, and the earth radiates that heat back into
space. As the sun sets, less heat is delivered, and the maximum
temperature occurs when the two rates—heat delivered and heat
radiated—balance. In this post, we’ll work out how this simple
requirement relates maximum temperature to the latitude, time of year,
and time of day the maximum occurs, meaning that a thermometer can in
principle be used as a sort of sundial.
In practice, this is only the first step towards a realistic model,
but for the purpose of building narrative tension, I will let the
shortcomings of my approach unfold naturally.</p>
<h4 id="energy-balance">Energy balance</h4>
<p>Consider a small patch of the earth’s surface of unit area, at the
point it attains its maximum temperature $T_\text{max}$ in Kelvin.
According to the
<a href="https://en.wikipedia.org/wiki/Stefan%E2%80%93Boltzmann_law">Stefan-Boltzmann law</a>,
it radiates energy away with intensity</p>
\[I_\text{out} = \sigma T_\text{max}^4, \quad \sigma = 5.67 \times
10^{-8} \frac{\text{W}}{\text{m}^2 \text{ K}}.\]
<p>Since this is the maximum attained, it must equal the intensity of
incoming solar radiation $I_\text{in}$.
To a good approximation, this is the radiant intensity of sunlight
striking the earth’s surface head on, the so-called insolation
constant $I_0$, multiplied by a geometric term $\cos^2\vartheta$
(where $\vartheta$ is the angle the sunlight makes with the vertical
to the ground), and an albedo term $(1-a)$ to account for sunlight reflected
back:</p>
\[I_\text{in} = I_0 (1- a )\cos^2\vartheta.\]
<p>The insolation constant is $I_0 = 1367 \text{ W/m}^2$ [<sup><a id="fnr.1" name="fnr.1" class="footref" href="#fn.1">1</a></sup>].
The albedo of the earth is around $a = 0.3$, i.e. $30\%$ reflected
back into space on average, though this depends on cloud cover, snow,
and so on.
We will talk about $\vartheta$ more in a moment.
Setting $I_\text{in} = I_\text{out}$ when the maximum is obtained, we
find</p>
\[I_0 (1- a )\cos^2\vartheta = \sigma T_\text{max}^4. \label{balance} \tag{1}\]
<p>Thus, the maximum temperature is directly related to the length of shadow!</p>
<h4 id="geometry-and-heliometry">Geometry and heliometry</h4>
<p>Even more interesting is how $\vartheta$ is related to the earth-sun
geometry, and the parameters of latitude, time of year, and time of
day.
The point directly below the sun, called the <em>subsolar point</em>, rotates
at some line of latitude around the earth, with azimuthal angle
$\theta_\text{sub}$, depending on the time of year.
Here is a basic picture of the setup:</p>
<figure>
<div style="text-align:center"><img src="/images/posts/diurnal1.png" />
</div>
</figure>
<p>At either equinox, it coincides with the equator (red line).
At the (northern hemisphere’s) summer solstice, it runs along the Tropic of Cancer, about
$23.5^\circ$ north of the equator.
At the winter solstice, it lies $23.5^\circ$ south of the equator, on
the Tropic of Capricorn.
If we draw the orbit of the earth as a circle around the sun, with
$\varphi = 0$ at the winter solstice and increasing with time, then the
subsolar latitude, measured in radians from the north pole, roughly obeys</p>
\[\theta_\text{sub} = \frac{\pi}{2} + \left(\frac{2\pi}{360}\right) 23.5
\cos(\varphi).
\label{year} \tag{2}\]
<p>To calculate the angle $\vartheta$, we need two additional data
points: the latitude of the observation point (measured from north
pole) and the polar angle $\phi$ between the observation point and the
current subsolar point.
This simply measures time from solar noon.
To determine $\vartheta$, first note that if we draw the subsolar and
observation point on the same great circle of the earth, $\vartheta$ is clearly the
angle between the black lines, drawn from each point to the centre of
the earth [<sup><a id="fnr.2" name="fnr.2" class="footref" href="#fn.2">2</a></sup>]:</p>
<figure>
<div style="text-align:center"><img src="/images/posts/diurnal2.png" />
</div>
</figure>
<p>This means we can easily determine $\cos\vartheta$ using vectors,
simply by taking the dot product.
To begin with, we write in spherical coordinates $(\theta,\phi)$, then convert to
Cartesian coordinates $(x, y, z)$:</p>
\[\begin{align*}
\mathbf{x}_\text{sub} (\theta_{\text{sub}}, 0) & = (\sin \theta_\text{sub}, 0, \cos\theta_\text{sub}) \\
\mathbf{x}_\text{obs} (\theta_{\text{lat}}, \phi) & = (\sin \theta_\text{lat}\cos\phi, \sin \theta_\text{lat}\sin\phi, \cos
\theta_\text{lat}).
\end{align*}\]
<p>We can immediately determine the dot product:</p>
\[\cos\vartheta = \mathbf{x}_\text{sub} \cdot \mathbf{x}_\text{obs} =
\cos\theta_\text{sub}\cos\theta_\text{lat} + \sin
\theta_\text{sub}\sin \theta_\text{lat}\cos \phi. \label{geohelio} \tag{3}\]
<p>Plugging this back into (\ref{balance}), we find a relationship
between maximum temperature $T_\text{max}$, time of year via
$\theta_\text{sub}$, latitude $\theta_\text{lat}$, and time of day, or
rather, time past solar noon $\phi$.</p>
<h4 id="real-data">Real data</h4>
<p>The question is: how does this stack up against real data?
I’ll take some local weather data.
In Vancouver, the latitude is $49.3^\circ$ north of the equator, with
azimuthal coordinate</p>
\[\theta_{\text{lat}} = \left(\frac{2\pi}{360}\right)(90 - 49.3) \approx 0.71.\]
<p>It’s $36$ days or about tenth of a year since
the winter solstice, so from (\ref{year}), the subsolar latitude is</p>
\[\theta_\text{sub} = \frac{\pi}{2} + \left(\frac{2\pi }{360}\right) 23.5
\cos(0.1 \cdot 2\pi) \approx 1.9.\]
<p>This agrees with <a href="https://rl.se/sub-solar-point">real-time data</a> on
the subsolar point.
Finally, the <a href="https://www.timeanddate.com/weather/canada/vancouver/historic?month=1&year=2021">maximum temperature yesterday</a> was $7^\circ \text{ C} =
280 \text{ K}$, and cloud cover makes $a \approx 0.35$.
Thus, rearranging (\ref{geohelio}) and (\ref{balance}), we expect the
maximum to occur at a “time of day angle” $\phi$ given by</p>
\[\begin{align*}
\cos \phi & = \frac{\sqrt{\frac{\sigma
T_\text{max}^4}{I_0(1-a)}} - \cos\theta_\text{sub}\cos\theta_\text{lat}}{\sin
\theta_\text{sub}\sin \theta_\text{lat}} \\
& = \frac{\sqrt{\frac{(5.67 \times
10^{-8}) 280^4}{1367(1-0.3)}} - \cos 1.9\cos 0.71}{\sin 1.9\sin 0.71} \\
& \approx 1.37.
\end{align*}\]
<p>Hopefully the problem is clear.
The last term is bigger than one, and cannot possibly be equal the
first term!
If we plug in the time it peaked, a few hours after solar noon, we can
rearrange and solve to find a predicted maximum temperature of
$-80^\circ \text{ C}$!
So something is very wrong.</p>
<h4 id="conclusion">Conclusion</h4>
<p>We’ve neglected an important factor: the atmosphere.
This is the very same thing needed to explain why the temperature of
the earth is higher than expected from a simple energy
balance argument.
Basically, the atmosphere acts as a heat bath in contact with the
earth, allowing for greater maximal temperatures. It may be
possible to turn a thermometer into an accurate sundial using a
<a href="https://en.wikipedia.org/wiki/Idealized_greenhouse_model">simple greenhouse model</a>.
However, with parameters appropriately modified, our naive approach should
work on a planet without substantial
atmosphere like Mercury.</p>
<h4 id="acknowledgements">Acknowledgements</h4>
<p>Thanks to A.B. for asking when daily temperatures peak, and
suggesting this might depend on latitude.</p>
<hr />
<div class="footdef"><sup><a id="fn.1" name="fn.1" class="footnum" href="#fnr.1">Footnote 1</a></sup> <p class="footpara">
This comes once more from the Stefan-Boltzmann law (for the surface
temperature of the sun $T_\odot = 5800 \text{ K}$), and an inverse square
drop-off:
$$
I_0 = \sigma T_\odot^4 \left(\frac{R_\odot}{d}\right)^2 =
5.67 \times 10^{-8} \cdot 5800^4 \left(\frac{7 \times 10^5}{1.5\times
10^8}\right)^2\, \frac{\text{W}}{\text{m}^2}\approx 1400 \, \frac{\text{W}}{\text{m}^2},
$$
where $R_\odot = 7 \times 10^5 \text{ km}$ is the solar radius and $d
= 1.5 \times 10^8 \text{ km}$ the earth-sun distance.
</p></div>
<div class="footdef"><sup><a id="fn.2" name="fn.2" class="footnum" href="#fnr.2">Footnote 2</a></sup> <p class="footpara">
We are making the usual assumption that the sun is far enough away to
treat incoming rays as parallel. For the same reason, we ignore the
way radiant intensity changes (due to the inverse square law) with $\vartheta$.
</p></div>
<!-- http://www.bom.gov.au/products/IDV60901/IDV60901.95936.shtml
((60*12 )/(2*pi))*arccos((sqrt((5.6*10^(-8)*(273+7)^4)/(1367(0.65))) + cos(1.9)cos(2*pi*(40.7/360)))/(sin(1.9)sin(2*pi*(40.7/360))))
2*pi(90 - 23.6*sin(pi/2 + pi/6))/360
https://www.timeanddate.com/weather/canada/vancouver/historic?month=1&year=2021
https://www.sjsu.edu/faculty/watkins/diurnaltemp.htm
(1367(1-0.3)(\cos 1.9\cos 0.71 + \sin 1.9\sin 0.71 * cos(pi/6))^2/(5.67 \times 10^{-8}))^(1/4)
-->
<!--
Let's test this out on some real data.
Today, in a certain large city, the temperature peaked at $25.0^\circ
\text{ C}$ around $2.5$ hours after solar noon.
We will guess the city!
First, we note that it's around $36$ days or a tenth of a year since
the winter solstice, so from (\ref{year}), the subsolar latitude is
$$
\theta_\text{sub} = \frac{\pi}{2} + \left(\frac{2\pi }{360}\right) 23.5
\cos(0.1 \cdot 2\pi) \approx 1.9.
$$
Two and a half hours after solar noon translates to $2.5/24$ times a full rotation,
so $\phi \approx \pi/5$.
Putting these numbers into (\ref{geohelio}) and rearranging using
trigonometric identities, we get
$$
\cos\vartheta \approx 0.57 \sin (\theta_\text{lat} - 0.60).
$$
Inserting into (\ref{balance}) and rearranging yields
$$
\theta_\text{lat} = 0.60 + \sin^{-1}\left[\frac{1}{0.57}\sqrt{\frac{5.67 \times
10^{-8} (273+25)^4}{1367 (1- 0.3)}}\right] = 1.77,
$$
or in
-->David A WakehamJanuary 28, 2021. I attempt to turn a thermometer (or more specifically, data about the maximum daily temperature) into a sundial. Though it fails on earth, it works on Mercury!Cashing a blank check2021-01-26T00:00:00+00:002021-01-26T00:00:00+00:00http://hapax.github.io/mathematics/statistics/everyday/check<p><strong>January 26, 2021.</strong> <em>Suppose you find a blank check on the ground,
and unscrupulously decide to cash it in. If overdrawing gets you
nothing, how much should you cash it in for? Assuming wealth follows
the 80-20 rule, the answer is: almost nothing!</em></p>
<h4 id="introduction">Introduction</h4>
<p>In the film “Blank Check” (1994), 11-year old Preston Waters is
handed a blank check, and cashes it in for a million dollars.
Luckily, this is precisely the amount of money that the check’s
signer, a convict attempting to launder his ill-gotten gains, has left
with the bank’s president.
But what if Preston overdrew, asking for, say, $10$ billion?
This would probably have raised the suspicions of the complicit
bank president and the check would have bounced altogether.
When I was a kid, I thought it was incredibly lucky for Preston to
find the check in the first place.
I now think drawing the precise amount of money held in trust is
infinitely luckier.
But this raises the question: if you find a blank check, and you don’t
want it to bounce, how much should cash it in for?</p>
<h4 id="expected-return">Expected return</h4>
<p>I’ll assume we know nothing about the identity of the signee, and that
if they have a balance of $b$, and we make out the value of the check
to be $v$, then the check will bounce if $v > b$.
Our strategy will be to calculate the expected return for $v$ and then
maximise it.
If $f(b)$ is the probability distribution for bank balances, then the
expected return for $v$ is simply $v$ multiplied by the probability $b> v$:</p>
\[E(v) = v \int_v^\infty f(b) \, db = v[1 - F(v)] = v \bar{F}(v),\]
<p>where $F$ is the cumulative distribution function, and the $\bar{F} =
1 -F$ the tail.
To maximise this, we assume the curve is smooth, differentiate and set
to $0$, using $\bar{F}’ = -f$:</p>
\[E'(v) = \bar{F} - vf(v) = 0 \quad \Longrightarrow \quad v = \frac{\bar{F}(v)}{f(v)}.\]
<p>Any $v$ which satisfies this equation is an extremum.</p>
<h4 id="long-and-short-tails">Long and short tails</h4>
<p>Now the question is how to model the distribution of bank balances.
This is the sort of thing expected to follow a power-law
curve like the
<a href="https://en.wikipedia.org/wiki/Pareto_distribution">Pareto distribution</a>,
the proverbial “80-20” curve.
This is simply defined by its power-law tails:</p>
\[\bar{F}(v) = \left(\frac{L}{v}\right)^\alpha,\]
<p>where $L$ is the minimum amount to keep a bank balance open (say a
monthly fee), and $\alpha > 0$ is a shape parameter we will “leave blank” for the moment.
This is well-defined since it heads to zero.
The probability density for $v \geq L$ is</p>
\[f(v) = -\bar{F}'(v) = \frac{\alpha L^\alpha}{v^{\alpha + 1}}.\]
<p>The optimal draw then obeys</p>
\[v = \frac{\bar{F}(v)}{f(v)} = \left(\frac{L}{v}\right)^\alpha \cdot
\frac{v^{\alpha + 1}}{\alpha L} = \alpha v.\]
<p>For $\alpha \neq 1$, the only solutions are $v = 0$ and $v = \infty$!
For $\alpha > 1$, we can plot the expected return $E(v)\propto
v^{1-\alpha}$, and see that it monotonically decreases, with the maximum at $v = L$.
Preston should only have asked for a few bucks!
But perhaps this is an artefact of the infinite power-law tail.
A more realistic choice is the <em>truncated</em> Pareto distribution, where
the power law is confined to $L \leq v \leq H$ for an upper limit $H$,
say the personal wealth of Jeff Bezos or Elon Musk.
The density for the truncated Pareto distribution is simply a
conditional probability, conditioned on being in the interval $[L, H]$:</p>
\[f(v) = \frac{\alpha L^{\alpha}v^{-(\alpha+1)}}{1 - (L/H)^\alpha},\]
<p>and the tail is</p>
\[\bar{F}(v) = \int_v^H \frac{\alpha L^{\alpha}v^{-(\alpha+1)}}{1 -
(L/H)^\alpha} dv = \frac{(L/v)^\alpha - (L/H)^\alpha}{1 - (L/H)^\alpha}.\]
<p>Thus, we now have to solve</p>
\[v = \frac{\bar{F}(v)}{f(v)} = \frac{(L/v)^\alpha -
(L/H)^\alpha}{\alpha L^{\alpha}v^{-(\alpha+1)}} \quad \Longrightarrow
\quad v = (1-\alpha)^{1/\alpha} H.\]
<!-- Once again, the answer is independent of the lower bound.
, but
proportional to the upper bound, which as we take $H \to \infty$,
returns our original result. -->
<p>If $\alpha < 1$, then we do get a finite answer, proportional to the
upper bound, so for instance if $\alpha = 0.5$, and we take the upper
limit to be around 100 billion dollars, then Preston should ask for</p>
\[v \sim \sqrt{1-0.5} \times 10^{11} \approx 70 \text{ billion dollars},\]
<p>or $0.7$ of some other reasonable guess for $H$.
But if $\alpha \geq 1$, the prefactor is not real, and as for the full
Pareto distribution, the maximum expected return occurs at $L$.
And indeed, wealth typically does obey an approximate Pareto
distribution with $\alpha > 1$.
For instance, the proverbial “80-20” rule corresponds to $\alpha
\approx 1.16$, and
<a href="https://www.sciencedirect.com/science/article/abs/pii/S0165176505002995">this analysis</a>
of the Forbes 400 richest people in the world finds a shape parameter
of $\alpha = 1.49$.
So once again, a perfectly rational Preston Waters would ask only for the monthly fee!
But this would make for a far less entertaining movie.</p>David A WakehamJanuary 26, 2021. Suppose you find a blank check on the ground, and unscrupulously decide to cash it in. If overdrawing gets you nothing, how much should you cash it in for? Assuming wealth follows the 80-20 rule, the answer is: almost nothing!A simple proof of the bus paradox2021-01-26T00:00:00+00:002021-01-26T00:00:00+00:00http://hapax.github.io/mathematics/statistics/everyday/paradox-bus<p><strong>January 26, 2021.</strong> <em>The bus paradox states that, if buses arrive
randomly but on average every ten minutes, the expected waiting time is
ten minutes rather than five. I give a simple proof involving no
integrals or formal probability theory.</em></p>
<h4 id="introduction">Introduction</h4>
<p>The bus paradox (also called the waiting time or
<a href="https://en.wikipedia.org/wiki/Renewal_theory#Inspection_paradox">inspection paradox</a>)
is a counterintuitive result about waiting times between random events.
Suppose buses arrive randomly, with an average period of $\lambda$
between arrivals.
If you go to catch a bus, you might expect to wait a period
$\lambda/2$, since if a bus arrives $\lambda/2$ after you arrive, and
$\lambda/2$ before you arrive (by symmetry), then the gap between them
is $\lambda$.
This reasoning is wrong, and rather unexpectedly, the expected wait
time is $\lambda$.
The goal of this post is to give a proof which does
not require any integrals or formal probability theory, and
makes the role of assumptions manifest.</p>
<h4 id="the-bus-loop">The bus loop</h4>
<p>We start by considering a circle of total length $L$, on
which we place $k$ points at random (white in the image below).
This models a length of time, such as the day, and the random arrival
of $k$ buses.
The average distance between points (going clockwise, for instance) is clearly</p>
\[\lambda = \frac{L}{k}.\]
<p>Let us place another point on the circle at random (black in the image
below).
This represents the commuter who wishes to catch a bus.</p>
<figure>
<div style="text-align:center"><img src="/images/posts/bus1.png" />
</div>
</figure>
<p>Since we now have $k + 1$ points placed at random, the same reasoning
as above tells us that the average distance is</p>
\[\frac{L}{k +1} = \left(\frac{k}{k+1}\right)\lambda.\]
<p>Translating into the language of bus schedules, this means that if
buses have a fixed but random schedule over some length of time, with
average interarrival time $\lambda$, the expected wait time is <em>not</em>
$\lambda$, but rather, smaller than $\lambda$ by a factor of
$k/(k+1)$, where $k$ is the total number of buses over the period.</p>
<h4 id="the-bus-paradox">The bus paradox</h4>
<p>The bus paradox applies to a schedule which does not repeat.
Let us take $L, k \to \infty$ but leave $\lambda = L/k$ fixed.
We represent this by an infinitely large circle, with a straight edge,
in the image below.
Then the expected waiting time is</p>
\[\left(\frac{k}{k+1}\right)\lambda \to \lambda.\]
<p>Thus, the arrival of the commuter is equivalent to adding another random
bus. The corresponding interarrival period is modified, but by a
vanishingly small coefficient as $k \to \infty$. This completes our simple proof of the bus paradox.</p>
<figure>
<div style="text-align:center"><img src="/images/posts/bus2.png" />
</div>
</figure>
<p>It’s a little tricky, of course, to formulate what it means to place
the buses “uniformly” on an infinite line, and this is exactly what the
<a href="https://en.wikipedia.org/wiki/Poisson_point_process#Homogeneous_Poisson_point_process">Poisson process</a>
(and more generally <a href="https://en.wikipedia.org/wiki/Renewal_theory#Inspection_paradox">renewal theory</a>)
achieves.
But rather than introduce all this formal baggage, we can simply consider
the limit of the uniform process to arrive at the correct conclusion,
and with greater clarity than when the answer is concealed in thickets of algebra.</p>
<h4 id="conclusion">Conclusion</h4>
<p>The reasoning outlined in the introduction is not completely off the
mark. It applies when the buses arrive at fixed intervals $\lambda$,
and the commuter randomly.
The expected time to the previous bus $t_-$ and the expected time to
the next bus $t_+$ must add to give the interval $\lambda$ between
buses, and by time symmetry, they must be equal:</p>
\[t_+ + t_- = \lambda, \quad t_+ = t_- \quad \Longrightarrow t_+ = t_- = \frac{\lambda}{2}.\]
<p>In this case, there is a clear distinction between the stochasticity
of buses and commuters.
But when everything arrives randomly, a commuter becomes like just another
bus.</p>
<!-- So waiting time equals interarrival time. -->
<!-- When the buses are random, our argument explains why this argument
breaks down: the commuter is like another bus!
They are just another random point in the sequence, and must therefore
have the -->
<!-- There are a few other fun things we can do, however.
If we add $n$ commuters, for $n = o(k)$, then when they sprinkled
randomly among the buses, it is overwhelmingly likely that the next
thing to come along will be a bus rather than a commuter (with
probability $k/(k+n) \to 1$), and hence the expected wait time is
$$
\left(\frac{k}{k+n}\right)\lambda \to \lambda.
$$
But for finite $n$, the time to -->David A WakehamJanuary 26, 2021. The bus paradox states that, if buses arrive randomly but on average every ten minutes, the expected waiting time is ten minutes rather than five. I give a simple proof involving no integrals or formal probability theory.