Jekyll2022-04-27T03:58:09+00:00http://hapax.github.io/feed.xmlDavid WakehamPhD candidate in physicsDavid A WakehamIndescribably boring numbers2021-03-23T00:00:00+00:002021-03-23T00:00:00+00:00http://hapax.github.io/mathematics/boring<p><strong>March 23, 2021.</strong> <em>I turn the old joke about interesting numbers into a proof that most real numbers are indescribably boring. In turn, this implies that there is no explicit well-ordering of the reals. The axiom of choice, however, implies all are relatively interesting.</em></p> <h4 id="introduction">Introduction</h4> <p>It’s a <a href="https://en.wikipedia.org/wiki/Interesting_number_paradox">running joke</a> among mathematicians that there are no boring numbers. Here’s the proof. Let $B$ be the set of boring numbers, and suppose for a contradiction it is non-empty. Define $b = \min B$ as the smallest boring number. Since this is a highly unusual property, $b$ is interesting after all! Joke it may be, but there is a sting in the tail. By thinking about how the joke works, we will be led to some rather deep (and perhaps disturbing) insights into set theory and what it can and cannot tell us about the mathematical world.</p> <h4 id="integers-and-rationals-are-interesting">Integers and rationals are interesting</h4> <p>The joke implicitly uses the fact that “numbers” refers to “whole numbers”</p> $\mathbb{N} = \{0, 1, 2, 3, \ldots\}.$ <p>If it didn’t, then the <em>minimum</em> we used to get our contradiction wouldn’t always work! For instance, say we work with the integers</p> $\mathbb{Z} = \{\ldots, -2, -1, 0, 1, 2, \ldots\}.$ <p>The set of boring integers $B_\mathbb{Z}$ may be unbounded below. Does this cause a problem? Not really. We can just define the smallest boring number as the smallest element minimising the <em>absolute value</em>, i.e.</p> $b = \min \text{argmin}_{k\in B_\mathbb{Z}} |k|.$ <p>(The $\text{argmin}$ might actually give us two numbers, $\pm b$, so the negative one is the smallest.) Thus, there are no boring integers. What about boring rational numbers? This is somewhat more elaborate, but if $B_\mathbb{Q}$ is the set of boring rationals, we can define the “smallest” boring number as</p> $b = \min \text{argmin}_{a/b\in B_\mathbb{Q}} (|a| + |b|),$ <p>where $a/b$ is a fraction in lowest terms. Once again, there may be multiple minimisers of $|a| + |b|$, but only a finite number, so we can choose the smallest. We conclude there are no boring rationals. This pattern suggests there are no boring real numbers. We should be able to find some function with a finite number of minima, and then choose the smallest, right? I’m going to argue that no such function can ever be described. Then I’m going to explain why it might exist anyway, depending on which axioms of set theory we use!</p> <h4 id="most-real-numbers-are-boring">Most real numbers are boring</h4> <!-- https://en.wikipedia.org/wiki/Definable_real_number --> <p>“Boring” and “interesting” are subjective. We’ll use something a tad more well-defined, and replace “interesting” with <em>describable</em>. A number is describable if it has some finite description, using words, mathematical symbols, even a computer program, which uniquely singles out that number. For instance, $\sqrt{2}$ is the positive solution of $x^2 = 2$, $\pi$ is the ratio of a circle’s circumference to its diameter, and $e$ is the limit</p> $e = \lim_{n\to\infty} \left(1 + \frac{1}{n}\right)^n.$ <p>It turns out that <em>almost every</em> real number is indescribable, or “boring”, in our official translation of that term. The argument is very simple, and proceeds by simply counting the number of finite descriptions. Each such description consists of a finite sequence of symbols (letters, mathematical squiggles, algorithmic instructions), each of which could be elements of some very large alphabet of symbols. For instance, the text</p> $\sqrt{2} \text{ is the positive solution of x^2 = 2.}$ <p>can be converted into <a href="http://www.tamasoft.co.jp/en/general-info/unicode-decimal.html">(decimal) unicode</a> as</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>8730 50 32 105 115 32 116 104 101 32 112 111 115 105 116 105 118 101 32 115 111 108 117 116 105 111 110 32 111 102 32 120 94 50 61 50 46 </code></pre></div></div> <p>Imagine some “super unicode” which lets us converts <em>any</em> symbol into a number. The super unicode alphabet may be arbitrarily large, so we will take it to consist of <em>every</em> natural number $\mathbb{N}$. Then a finite description using any symbols can be written as a sequence of the corresponding natural numbers, a trick I will call “unicoding”. To find the number of finite descriptions, we just count the sequences! There is a nice scheme for showing that these are in one-to-one correspondence with the natural numbers themselves, and hence <em>countably infinite</em>. We take a sequence, say</p> $(6, 2, 0, 5)$ <p>and convert the first bracket and all commas into $1$s, and each number into the corresponding number of $0$s:</p> $10000001001100000_2.$ <p>In turn, this can be converted to decimal, $66144$. Going in the other direction, any whole number can be written in binary and then converted into sequence:</p> $14265092 = 110110011010101100000100_2$ <p>becomes $(0,1,0,2,0,1,1,1,0,5,2)$. Thus, we have a simple, explicit correspondence between finite sequences of natural numbers and the natural numbers themselves. This basically completes the proof, for the simple reason that there are <em>infinitely more</em> real numbers than there are natural numbers. This is established by Cantor’s beautiful <a href="https://en.wikipedia.org/wiki/Cantor%27s_diagonal_argument">diagonal argument</a>, which I won’t repeat here. The upshot is that, via unicoding and then the binary correspondence, finite descriptions can only capture an infinitesimally small fragment of the real numbers. Most literally cannot be talked about.</p> <!-- So, we conclude that most real numbers are boring. --> <p>The set $B_\mathbb{R}$ includes almost every real number, though quite definitely <em>not</em> every real number you can think of. But, armed with our previous jokes, it’s tempting to think that we can waltz in and make the same joke about $\mathbb{R}$, simply plucking out the smallest element of $B_\mathbb{R}$. Of course, that won’t quite work, because the set need not be bounded below. So instead, suppose there is some explicit function $f$ such that $b \in B_\mathbb{R}$ is the smallest minimizer of $f$, i.e.</p> $b = \min \text{argmin}_{x \in B_\mathbb{R}} f(x).$ <p>If I knew $f$ explicitly, we’d have a description of $b$ after all. Contradiction! But the contradiction here does not imply $B_\mathbb{R}$ is non-empty. After all, most of $\mathbb{R}$ is indescribable for simple set-theoretic reasons. Instead, it means that there <em>cannot be any explicit function</em> $f$. More generally, there cannot be any explicit rule which, given a subset of $\mathbb{R}$, gives some unique number. If there was, we could apply it to $B_\mathbb{R}$ and get the same contradiction. (See Appendix A for discussion of the related <a href="https://en.wikipedia.org/wiki/Berry_paradox">Berry paradox</a>.)</p> <h4 id="an-existential-aside">An existential aside</h4> <p>There’s a loophole here. Our argument doesn’t establish that $f$ doesn’t exist, just that it has no finite description. And although it might seem weird to trust in the existence of something that we can’t really talk about, we do just this with the real numbers! I believe in all the real numbers, even the ones I can never describe. Is this reasonable? It depends who you ask. There is a philosophy of mathematics called <a href="https://plato.stanford.edu/entries/intuitionism/">intuitionism</a> which tells us that mathematics is a human invention, and therefore enjoins us to only reason about the things we can construct ourselves. No indescribable real numbers if you please!</p> <p>I’m not sure about this “mathematical creationism”, and think there are more things in the mathematical heavens than are dreamt of in our finite human philosophy. Why should human limitations be mathematical ones? That said, it’s not the case that anything goes. We should have some firm basis for believing in the existence of those things we can’t discuss, and for the real numbers, the firm basis is drawing a continuous line on a piece of paper, or thinking about infinite decimal expansions. These are <em>models</em> of the real numbers, concrete-ish objects which capture the essence of the abstract entity $\mathbb{R}$. They convince us (or at least me) that there is nothing magical stopping someone from drawing certain points on the line, or continuing certain expansions forever.</p> <p>Similarly, the indescribable things we would like to exist and reason about in set theory might depend on our <em>models</em> of set theory! I won’t get into the specifics, but an important point is there are <em>many different models</em> of set theory, with different properties, and it seeks unlikely that any one model is right. These properties are abstracted into <em>axioms</em>, formal rules about what exists and what you can or can’t do with sets. Because models of set theory are deep, highly technical constructions, most of the time we go the other way round, and play around with axioms instead. Only later do we go away and find models which support certain sorts of behaviour. The point of all this is to make it a bit less counterintuitive when I say that the existence and properties of boring numbers depend on which axioms we decide to use.</p> <h4 id="all-real-numbers-are-relatively-interesting">All real numbers are relatively interesting</h4> <p>So, let’s return to our problem of boring real numbers. We argued there was no explicit, finitely describable rule for picking an element out of $B_\mathbb{R}$. But we can always make the <em>existence</em> of such a rule — describable or not — an axiom of our theory! There are two ways to go about doing this. Note that in the first example of boring natural numbers, we use the <em>minimum</em> of the set. We had to be a bit more clever with the integers and rationals, but it essentially boiled down to creating a special sort of <em>ordering</em> on the set, so that any subset (including the boring numbers) has a <em>smallest element</em>. We wrote this is in a complicated way as</p> $b = \min \text{argmin}_{x \in B} f(x)$ <p>for some function $f$, but we could just as well write</p> $b = \min_{\mathcal{W}} B,$ <p>where $\mathcal{W}$ denote this ordering on the big set. To be clear, for the integers it is</p> $0, -1, 1, -2, 2, -3, 3, \ldots$ <p>and for the rationals it is</p> $0, -\frac{1}{1}, \frac{1}{1}, -\frac{2}{1}, -\frac{1}{2}, \frac{1}{2}, \frac{2}{1}, \ldots.$ <p>This is called a <em>well-ordering</em>. Although it may not be describable, we could simply require, as an axiom of set theory, that any set can be well-ordered! More explicitly,</p> <p><span style="padding-left: 20px; display:block"> Any set $A$ has a well-ordering $\mathcal{W}_A$ such that any subset of $A$ has a unique minimum element with respect to $\mathcal{W}_A$. </span></p> <p>Although it doesn’t spoil our conclusion that most real numbers are boring, such an axiom would allow us to turn the old joke into an argument that all real numbers are <em>relatively interesting</em>, where “relatively interesting” means that there is a finite description where we are allowed to use the well-ordering $\mathcal{W}$. The proof goes as you might expect: let $B^{\mathcal{W}}_\mathbb{R}$ be the set of relatively boring numbers, i.e. numbers with no finite explicit description, even when allowed to use the well-ordering $\mathcal{W}$. Since $\mathcal{W}$ is a well-ordering, we can define</p> $b = \min_{\mathcal{W}} B^{\mathcal{W}}_\mathbb{R}.$ <p>End of proof! So, although most real numbers are strictly boring, with a well-ordering all of them are relatively interesting.</p> <h4 id="choosing-an-order">Choosing an order</h4> <p>Well-ordering is not usually treated as an axiom. Historically, set theorists prefer to use a simpler rule called the <em>axiom of choice</em>, which is logically equivalent, as we will argue informally in a moment, but somehow less suspect. As Jerry Bona joked,</p> <p><span style="padding-left: 20px; display:block"> The axiom of choice is obviously true and the well-ordering principle obviously false. </span></p> <p>(Actually, Bona’s joke mentions a third equivalent form called <em>Zorn’s lemma</em>, but it would confuse matters too much to explain.) Loosely, the axiom of choice just says we can pick an element from a non-empty set. Pretty reasonable huh? If a set is nonempty, it has an element, so we can pluck one out. In fact, it’s usually stated in terms of a <em>family</em> of sets $A_i$, where the subscript $i$ ranges over some indexing set $I$:</p> <p><span style="padding-left: 20px; display:block"> Given a family of nonempty sets $A_i$, $i \in I$, we can collect a representative from each set, labelled $f_i \in A_i$. </span></p> <p>The well-ordering principle implies the axiom of choice, since I can just take the union of all the sets $A_i$, well-order it with $\mathcal{W}$, and then define $f_i = \min_{\mathcal{W}} A_i$. That’s my set of representatives! The other way round is conceptually straightforward. To well-order a set $A = A_0$, start by choosing an element $f_0 \in A_0$ by the axiom of choice. Then remove it to define a new set $A_1 = A_0 - \{f_0\}$, and select another element $f_1 \in A_1$. Continue in this way, at each stage simply deleting the element from the previous stage and choosing a new one, using</p> $A_{n+1} = A_n - \{f_n\} = A_{n-1} - \{f_n, f_{n-1}\} = \cdots = A_0 - \{f_i : i &lt; n\}$ <p>as long as the set is nonempty. The well-ordering is simply the elements in the order we made the choice:</p> $\mathcal{W}_A = \{f_0, f_1, f_2, \ldots \} = \{f_n \in A_n : A_n \neq \varnothing\}.$ <p>There are two issues with this construction. The first is that it might feel sketchy to use the axiom of choice “as we go” to build the sets, rather than starting with a pre-defined family. But no one said this wasn’t allowed! Second, our method only seems to work for sets as most as large as the natural numbers, since we indexed elements with $n \in \mathbb{N}$. But we can extend it to an <em>arbitrary</em> set using a generalisation of natural numbers called <a href="https://en.wikipedia.org/wiki/Ordinal_number">ordinals</a>. We loosely sketch how this is done in Appendix B. Once the dust settles, we find that the axiom of choice is equivalent to well-ordering.</p> <h4 id="conclusion">Conclusion</h4> <p>The overarching theme of this post is how much mileage we can get from a bad joke. The answer: quite a lot! We learned not only that there are no boring integers and rational numbers, but via a simple counting argument, that the vast majority of real numbers are indescribably boring. This is equivalent to having no explicit way to well-order the reals. On the other hand, by giving ourselves the ability (via the axiom of choice) to pluck elements at will from non-empty sets, we are able to supply the reals with a well-ordering. So, all reals are relatively interesting, even if we can’t talk about them.</p> <h4 id="acknowledgments">Acknowledgments</h4> <p>As usual, thanks to J.A. for the discussion which led to this post, and also for proposing an elegant mapping analogous to unicoding.</p> <h4 id="appendix-a-the-berry-paradox">Appendix A: the Berry paradox</h4> <p>Consider the phrase</p> <p><span style="padding-left: 20px; display:block"> The smallest real number with no finite, explicit description. </span></p> <p>If “smallest” refers to an explicitly definable well-ordering of the reals, then this would seem to pick out a unique number with a finite, explicit description. Contradiction! We used this to argue no explicit well-ordering exists. But let’s compare this to the <a href="https://en.wikipedia.org/wiki/Berry_paradox">Berry paradox</a>, which asks us to consider the phrase</p> <p><span style="padding-left: 20px; display:block"> The smallest positive integer not definable in under sixty letters. </span></p> <p>This phrase clocks in at under sixty letters, and would seem to define a number. Contradiction! Since “smallest” here makes perfect sense (we are dealing with positive integers), to resolve the Berry paradox, we must assume either (a) there is no set $B$ of numbers not definable in under sixty letters, analogous to the original boring number joke, or (b) Berry’s phrase somehow fails to define a number. The most popular solution seems to be (b), on the grounds that referring to the set makes it some kind of “meta-definition”, rather than a definition per se.</p> <p>Of course, this seems be committed to a very specific notion of “definition”, but the problem persists if we replace “definable” with “meta-definable”, since the smallest non-meta-definable number is really a meta-meta-definition. Let $B^{(0)}$ be the set of numbers not definable in under sixty letters, $B^{(1)}$ the numbers not meta-definable in under $70$ letters, and in general, $B^{(n)}$ the numbers not meta${}^{(n)}$-definable in under $60+10n$ letters. We call any number in the <em>union</em> of all these sets $\mathcal{B} = \cup_{n\geq0} B^{(n)}$ “lim-definable”. This is closed under the operation of going meta. Now consider the phrase</p> <p><span style="padding-left: 20px; display:block"> The smallest positive integer not finitely lim-definable. </span></p> <p>Since lim-definability is closed under going meta, as is “finite”, this is <em>now a definition at the same level</em>. Option (b) is no longer available to us, so only option (a) remains, and it follows that, like the joke that began it all, <em>all positive integers are finitely lim-definable</em>. This is of course obviously true.</p> <p>Our argument against an explicit well-ordering is very closely related to the Berry paradox. The point of considering lim-definability is that we can build the same descriptive hierarchy for the real numbers, take the union, and rule out option (b). This leaves two ways to avoid a contradiction: no lim-definable ordering exists (involving some finite but unbounded number of references to sets in the hierarchy), or like the Berry paradox, every real is lim-definable. But unlike the positive integers, we know from set theory that the second option can’t be true! We still have a countable number of lim-definitions, as we can argue from unicoding. So there must be no lim-definable ordering of the reals, and no explicit well-ordering in particular.</p> <h4 id="appendix-b-ordinals-and-the-axiom-of-choice">Appendix B: ordinals and the axiom of choice</h4> <p>Ordinals are <em>sets</em> which we use to stand in for numbers. The smallest ordinal is $0$, which is defined as the empty set $\varnothing = \{\}$. Each ordinal $\alpha$ has a unique successor $\alpha + 1$, defined by simply appending a copy of $\alpha$ to itself:</p> $\alpha + 1 = \{\alpha, \{\alpha\}\},$ <p>To illustrate, we apply the successor operation to $0 = \varnothing$ a few times:</p> $1 = 1 + 0 = \{0\}, \quad 2 = 1 + 1 = \{0, 1\}, \quad 3 = 2 + 1 = \{0, 1, 2\}.$ <p>Going on in this way gives us all the finite ordinals, but there are also <em>infinite</em> ordinals. The smallest infinite ordinal, conventionally denoted $\omega$, can be identified with the natural numbers:</p> $\omega = \{0, 1, 2, 3, 4, \ldots\}.$ <p>It is called a <em>limit</em> ordinal since it is not the successor of any finite ordinal. It is bigger than all the finite ones, $n &lt; \omega$. The successor is defined as before,</p> $\omega + 1 = \{\omega, \{\omega\}\},$ <p>thereby giving a precise meaning to “infinity plus one”! We won’t say more about the structure of these ordinals. The main point is that we can always “count” the elements in a set $A$ using ordinals, no matter how big it is. Let’s now return to the problem of proving the axiom of choice implies that any set $A$ can be well-ordered. The basic idea is to start with $0$, but keep on counting up “past infinity”, defining</p> $A_{\alpha+1} = A_0 - \{f_\beta : \beta &lt; \alpha\}$ <p>for any ordinal $\alpha$. The resulting set of representatives, labelled by ordinals, is</p> $\mathcal{W}_A = \{f_\alpha \in A_\alpha: A_\alpha \neq \varnothing\},$ <p>with $f_\alpha &lt; f_\beta$ just in case the ordinals $\alpha &lt; \beta$. This is a well-ordering since the cardinals are themselves well-ordered. Now, we’ve skipped many important technical details, but the main point was that the argument looks pretty similar to the previous one!</p> <!-- You may wonder if the contradiction here is coming from ambiguity in the notion of "explicit describability". That this can cause deep problems is illustrated by the [Berry paradox](https://en.wikipedia.org/wiki/Berry_paradox), which asks us to consider the following: <span style="padding-left: 20px; display:block"> The smallest positive integer not definable in under sixty letters. </span> If $B_{60}$ is the set of positive integers not definable in under sixty letters, it seems we have just defined its smallest elements in fifty seven! This too is a contradiction. Many people try to resolve this by arguing that it does not constitute a "definition"; I think it is much simpler to following the boring number argument, and conclude that $B_{60}$ doesn't exist. -->David A WakehamMarch 23, 2021. I turn the old joke about interesting numbers into a proof that most real numbers are indescribably boring. In turn, this implies that there is no explicit well-ordering of the reals. The axiom of choice, however, implies all are relatively interesting.Taking half a derivative2021-03-13T00:00:00+00:002021-03-13T00:00:00+00:00http://hapax.github.io/mathematics/halfder<p><strong>March 13, 2021.</strong> <em>Can you take half a derivative? Or π derivatives? Or even √–1 derivatives? It turns out the answer is yes, and there are two simple but apparently different ways to do it. I show that one implies the other!</em></p> <h4 id="introduction">Introduction</h4> <p>In calculus, the regular derivative is defined as the local gradient of a function:</p> $f'(x) = \frac{d}{dx} f(x) = \lim_{h\to 0}\frac{f(x+h)-f(x)}{h}.$ <p>We will abbreviate this as $f’ = Df$, understanding that $f$ is a function of $x$ and $D$ differentiates with respect to $x$. We can always differentiate again, and again, and in fact as many times as we want. Using our new notation, we can write the $n$th derivative as</p> $D (D \cdots (Df)) = D^n f.$ <p>This is well-defined as long as $n$ is a whole number. But what if we could consider other types of derivatives, say half a derivative? Let’s call this $D^{1/2} = \sqrt{D}$. In the same way that applying two ordinary derivatives gives the second derivative, it seems reasonable to hope that two half derivatives give a full derivative:</p> $f' = \sqrt{D} \sqrt{D}f = Df \quad \Longrightarrow \quad \sqrt{D} \cdot \sqrt{D} = D.$ <p>What could half a derivative look like?</p> <h4 id="to-be-continued">To be continued</h4> <p>The easiest way to go about this to use a trick called <em>analytic continuation</em>. This has a precise meaning in complex analysis, and we’re going to do something similar in spirit, but not quite as rigorous. The basic idea is to find some nice, specific function we can differentiate $n$ times, and which happens to give us a nice answer in terms of $n$. We then define the <em>fractional derivative</em> $D^\alpha$ acting on this function by replacing $n$ with $\alpha$. A sanity check will be that, for general $\alpha, \beta$, the fractional derivatives obey</p> $D^\alpha \cdot D^\beta = D^{\alpha+\beta},$ <p>so, e.g., two half-derivatives give a full derivative, $\sqrt{D}\cdot \sqrt{D} = D$. We call this property <em>multiplicativity</em> after the identical-looking rule for indices. There are two issues with this approach. First, how do we extend the definition to general functions? And second, are the definitions for different functions in agreement? In general, the answers are very complicated, but in this post, I’ll consider the two simplest methods for defining fractional derivatives. This means we can talk about the functions they apply to, and check they agree, without a huge technical overhead.</p> <p>Our first nice function is the exponential $e^{\omega x}$. Differentiating simply pulls down a factor of $\omega$ each time, so</p> $D^n e^{\omega x} = \omega^n e^{\omega x}.$ <p>It’s very clear, then, how to define the fractional derivative acting on this:</p> $D^\alpha e^{\omega x} = \omega^\alpha e^{\omega x}.$ <p>Great! We can easily check the multiplicative property, assuming that constants pass through the derivatives:</p> $D^\alpha D^\beta e^{\omega x} = \omega^\alpha D^\beta e^{\omega x} = \omega^{\alpha + \beta} e^{\omega x} = D^{\alpha+\beta}e^{\omega x}.$ <p>Now, you might think this is useless because we can only take fractional derivatives of exponential functions. But at this point, we introduce another assumption, namely that the fractional derivatives are <em>linear</em>:</p> $D^\alpha (\lambda_1 f_1 + \lambda_2 f_2) = \lambda_1 D^\alpha f_1 + \lambda_2 D^\alpha f_2,$ <p>where $f_1, f_2$ are functions and $\lambda_1, \lambda_2$ are constants. In particular, let’s suppose this linearity applies to an <em>infinite</em> collection of exponentials multiplied by constants $\lambda$, arranged into an integral</p> $f(x) = \int_{-\infty}^\infty d\omega \, \lambda(\omega) e^{i\omega x}.$ <p>Then by linearity,</p> $D^\alpha f(x) = \int_{-\infty}^\infty d\omega \, \lambda(\omega) D^\alpha e^{i\omega x} = \int_{-\infty}^\infty d\omega \, \lambda (\omega) (i\omega)^\alpha e^{i\omega x}. \tag{1} \label{exp}$ <p>Functions which can be written this way are said to have a <em>Fourier representation</em>, with the function $\lambda (\omega)$ the <em>Fourier transform</em>. Most functions have one! Let’s do a very simple example: the sine function, bane of high school trigonometry classes everywhere. What is its half derivative? We start by writing sine in terms of exponentials as</p> $\sin(x) = \frac{1}{2i}(e^{ix} - e^{-ix}).$ <p>We then take a half-derivative using our exponential rule and linearity:</p> $\sqrt{D} \sin(x) = \frac{1}{2i}(\sqrt{D} e^{ix} - \sqrt{D} e^{-ix}) = \frac{1}{2i}\left(\sqrt{i} e^{ix} - \sqrt{-i} e^{-ix}\right).$ <p>There are a few things to note. First, this is not a real function, so in general, half derivatives of a real functions need not be real. It should also be clear there is some ambiguity about which roots we choose. In general this ambiguity is harmless, and we just take the principal values (with arguments between $-\pi$ and $\pi$), but this issue will crop up any below in a subtle way. Finally, observe that we can just as easily do crazy things like take $i$ derivatives! We set $\alpha = i$, so the $i$th derivative of sine is</p> $D^i \sin(x) = \frac{1}{2i}\left(i^i e^{ix} - (-i)^i e^{-ix}\right) = \frac{1}{2i}(e^{-\pi/4 + ix} - e^{+\pi/4 - ix}),$ <p>since the principal values are</p> $i^i = e^{i (i \pi/4)} = e^{-\pi/4}, \quad (-i)^i = e^{i (-i \pi/4)} = e^{\pi/4}.$ <p>I’m not sure if this has any applications, but it’s cute. I invite the interested reader to take $\pi$ derivatives of sine. What better way to celebrate $\pi$ day!</p> <h4 id="fractorials">Fractorials</h4> <p>Exponentials aren’t the only nice functions we can use to define fractional derivatives. In fact, a more common approach is to use <em>powers</em>. The first function we encounter in high school is usually the identity function, $f(x) = x$. From there, we build up to polynomials $x^m$, and then arbitrary powers $x^s$. The derivative of a power has a very simple form:</p> $D x^s = s x^{s-1}.$ <p>If we differentiate again, we bring down a factor of $s - 1$ and reduce the index again. And so on and so forth. This leads to the expression for $n$ derivatives:</p> $D^n x^s = s(s- 1) \cdots (s - n + 1) x^{s-n}.$ <p>So far, this doesn’t look like something we can easily continue to non-integer values of $n$. But let’s assume for a moment $s$ is an integer. Then we can write</p> $s(s- 1) \cdots (s - n + 1) = \frac{s(s - 1) (s-2) \cdots 1}{(s - n)(s-n - 1) \cdots 1} = \frac{s!}{(s -n)!},$ <p>where we have used the good old factorial function $s!$. Thus, we can write</p> $D^n x^s = \frac{s!}{(s -n)!} x^{s-n}.$ <p>To analytically continue this, we need a beautiful object called the Gamma function $\Gamma$. We’ll define it properly below, but for the moment, the only properties we need are that (a) it agrees with the factorial function at (shifted) integer values,</p> $\Gamma(k + 1) = k!;$ <p>and (b) is defined for non-integer values as well. I like to think of it as the “fractorial” because it makes sense for fractional arguments! In addition to delightfully bad puns, the Gamma function lets us write</p> $D^n x^s = \frac{\Gamma(s + 1)}{\Gamma(s -n + 1)} x^{s-n},$ <p>and immediately continue to the fractional derivative:</p> $D^\alpha x^s = \frac{\Gamma(s + 1)}{\Gamma(s -\alpha + 1)} x^{s-\alpha}. \tag{2} \label{power}$ <p>Too easy! Once again, we can check the multiplicative property:</p> \begin{align*} D^\alpha D^\beta x^s &amp; = \frac{\Gamma(s + 1)}{\Gamma(s -\beta + 1)} D^\alpha x^{s-\beta} \\ &amp; = \frac{\Gamma(s + 1)}{\Gamma(s -\beta + 1)} \cdot \frac{\Gamma(s - \beta + 1)}{\Gamma(s -\alpha - \beta + 1)} x^{s-\beta - \alpha} \\ &amp; = \frac{\Gamma(s + 1)}{\Gamma(s -\alpha -\beta + 1)}x^{s-\beta - \alpha} = D^{\alpha+\beta} x^s. \end{align*} <p>So this gives us another, evidently different way to define fractional derivatives. It will apply to any sum or integral of powers of $x$, for instance, infinite polynomials called <em>power series</em>, and their close cousins the <em>Laurent series</em> which include reciprocal powers:</p> $\sum_{k = 0}^\infty a_k x^k, \quad \sum_{k = -\infty}^\infty b_k x^k.$ <p>These cover a lot of ground, and there is an even more general object called the <em>Mellin transform</em>, analogous to the Fourier transform. But we won’t go there. Instead, let’s do another simple example. One of the interesting properties of the Gamma function is that it blows up to (minus) infinity for nonpositive integers:</p> $\Gamma(-n) = -\infty, \quad n = 0, 1, 2, \ldots.$ <p>This is actually essential to get sensible answers! For instance, let’s take the derivative of a constant, $1 = x^0$. Then according to our definition,</p> $D x^0 = \frac{\Gamma(0 + 1)}{\Gamma(0 -1 + 1)} x^{0 - 1} = \frac{\Gamma(1)}{\Gamma(0)} x^{- 1} = 0,$ <p>since the $\Gamma(0)$ in the denominator makes the whole thing vanish. More intriguingly, these infinities sometimes <em>cancel</em> in sensible ways. For instance, if we take a derivative of $1/x$, we should get $-1/x^2$. If we plug $x^{-1}$ into our formula, it gives</p> $D x^{-1} = \frac{\Gamma(-1 + 1)}{\Gamma(-1 -1 + 1)} x^{-1 - 1} = \frac{\Gamma(0)}{\Gamma(-1)} x^{-2}.$ <p>Both the numerator and the denominator blow up, which should make us queasy. But there is a trick here. It turns out that for any $z$, the Gamma function obeys the <em>functional equation</em></p> $\Gamma(1 + z) = z\Gamma(z).$ <p>Since $\Gamma(k + 1) = k!$, this gives the usual relation for factorials,</p> $k! = \Gamma(k + 1) = k\Gamma(k) = k \cdot (k - 1)!.$ <p>It also gives the sneaky result $\Gamma(0) = (-1)\Gamma(-1)$. Both $\Gamma(0)$ and $\Gamma(-1)$ blow up of course, but in the derivative of $1/x$, the $\Gamma(-1)$ terms cancel, leaving $(-1)x^{-2} = -1/x^2$ as required.</p> <h4 id="gamma-and-tongs">Gamma and tongs</h4> <p>This all sounds great, but you might be wondering why the Gamma function is the right way to extend the factorial function away from whole numbers. In fact, any old function that interpolates between them would also work and satisfy the multiplicative property. What we’re going to do in this last section is use the fractional derivatives, defined using exponentials, to <em>derive</em> the Gamma function continuation. And in order to this, we have to grit our teeth and define the Gamma function in all its glory:</p> $\Gamma(s) = \int_{0}^\infty dt\, t^{s-1} e^{-t}.$ <p>If you’re interested, you can find proofs of the functional equation and so on <a href="https://en.wikipedia.org/wiki/Gamma_function">elsewhere</a>. Instead, we’re going to make the sneaky change of variables $t = \omega x$, yielding</p> $\Gamma(s) = x^{s} \int_{0}^\infty d\omega\, \omega^{s-1} e^{-\omega x}.$ <p>If we change $s \to -s$, and rearrange, we get a formula for $x^s$ in terms of exponentials:</p> $x^{s} = \frac{1}{\Gamma(-s)}\int_{0}^\infty d\omega\, \omega^{-(1+ s)} e^{-\omega x}. \tag{3} \label{gamma}$ <p>Great! Now we just go ahead and use rule (\ref{exp}), with the hope we will get rule (\ref{power}). As usual, we proceed using linearity:</p> \begin{align*} D^\alpha x^{s} &amp; = \frac{1}{\Gamma(-s)}\int_{0}^\infty d\omega\, \omega^{-(1+ s)} D^\alpha e^{-\omega x} \\ &amp; = \frac{1}{\Gamma(-s)}\int_{0}^\infty d\omega\, \omega^{-(1+ s)} (-\omega)^\alpha e^{-\omega x} \\ &amp; = \frac{(-1)^\alpha}{\Gamma(-s)}\int_{0}^\infty d\omega\, \omega^{-(1+ s - \alpha)} e^{-\omega x} \\ &amp; = \frac{(-1)^\alpha}{\Gamma(-s)} \cdot \Gamma[-(s-\alpha)]x^{s-\alpha}, \end{align*} <p>where on the last line we used (\ref{gamma}), but with $s -\alpha$ instead of $s$. This isn’t quite what we want. To make progress, we’ll take advantage of the <em>reflection formula</em> for the Gamma function (derived <a href="https://hapax.github.io/mathematics/zeta/">here</a> for instance):</p> $\Gamma(z) \Gamma(1 - z) = \frac{\pi}{\sin(\pi z)}.$ <p>We can apply this to both $\Gamma(-s)$ and $\Gamma[-(s-\alpha)]$ to get</p> \begin{align*} D^\alpha x^{s} &amp; = (-1)^\alpha \frac{\sin(\pi s)}{\sin[\pi(s-\alpha)]}\cdot \frac{\Gamma(s+1)}{\Gamma(s-\alpha + 1)} x^{s-\alpha}. \end{align*} <p>This is almost (\ref{power}), the thing we were after! But there is this strange factor with sines out the front. Recall the definition of sine in terms of complex exponentials. This lets us write the funny factor as</p> $(-1)^\alpha \frac{\sin(\pi s)}{\sin[\pi(s-\alpha)]} = \frac{e^{\pi i s} - e^{-\pi i s}}{(-1)^\alpha e^{\pi i (s-\alpha)} - (-1)^\alpha e^{-\pi i (s-\alpha)}}.$ <p>It would be magical if that $(-1)^\alpha$ could somehow behave differently and cancel the $\alpha$ terms floating around, right? Well, turns out it does! We can write $-1 = e^{\pm \pi i}$, and hence</p> $(-1)^\alpha = e^{\pm \pi i \alpha}.$ <p>I won’t spell out the details, but if you look at <a href="https://hapax.github.io/mathematics/zeta/">this proof</a> of the reflection formula, the two different terms in the sine arise from parts of an integration contour which lie in almost the same place, but where we take roots in different ways. In particular, evaluating $(-1)^\alpha$ gives $e^{\pm \pi i \alpha}$ respectively, so they cancel the $\alpha$ terms after all. The upshot is that our funny factor is just unity:</p> $\frac{e^{\pi i s} - e^{-\pi i s}}{(-1)^\alpha e^{\pi i (s-\alpha)} - (-1)^\alpha e^{-\pi i (s-\alpha)}} = \frac{e^{\pi i s} - e^{-\pi i s}}{e^{\pi i \alpha} e^{\pi i (s-\alpha)} - e^{-\pi i \alpha} e^{-\pi i (s-\alpha)}} = \frac{e^{\pi i s} - e^{-\pi i s}}{e^{\pi i s} - e^{-\pi i s}} = 1.$ <p>Thus, our exponential rule actually reproduces the rule for powers of $x$ involving the Gamma function! Now, to be clear, fractional derivatives are a big and mathematically heavy topic, and I’ve only skimmed the surface. But it’s neat that the two simplest approaches agree.</p> <h4 id="acknowledgments">Acknowledgments</h4> <p>Thanks to J.A. for chatting about fractional derivatives, and getting me thinking about the simplest way to define them.</p> <!-- Our exponential definition yields an *antiderivative* operator: $$D^{-1} e^{\omega x} = \frac{1}{\omega}e^{\omega x}.$$ This is the usual antiderivative, except without the constant. -->David A WakehamMarch 13, 2021. Can you take half a derivative? Or π derivatives? Or even √–1 derivatives? It turns out the answer is yes, and there are two simple but apparently different ways to do it. I show that one implies the other!The statistical basis of Fermi estimates2021-02-12T00:00:00+00:002021-02-12T00:00:00+00:00http://hapax.github.io/physics/hacks/mathematics/statistics/fermi-log-normal<p><strong>February 12, 2021.</strong> <em>Why are Fermi approximations so effective? One important factor is log normality, which occurs for large random products. <!--, also related to the mechanism underlying the Newcomb-Benford law for first digits.--> Another element is variance-reduction through judicious subestimates. I discuss both and give a simple heuristic for the latter.</em></p> <h4 id="introduction">Introduction</h4> <p>Fermi approximation is the art of making good order-of-magnitude estimates. I’ve written about them at greater length <a href="https://hapax.github.io/assets/fermi-estimates.pdf">here</a> and <a href="https://hapax.github.io/physics/teaching/hacks/napkin-hacks/#sec-3">here</a>, but I’ve never really found a satisfactory explanation for why they work. Order-of-magnitude is certainly a charitable margin of error, but time and time again, I find they are better than they have any right to be! Clearly, there must be an underlying statistical explanation for this apparently unreasonable effectiveness.</p> <!-- We will try to explain the first using logarithmic uniformity, which is the same mechanism underlying the anomalous distribution of first digits known as the [Newcomb-Benford law](https://en.wikipedia.org/wiki/Benford%27s_law). We give a looser but related explanation of the second in terms of strategies for variance-reduction in human error. --> <h4 id="products-and-log-normality">Products and log-normality</h4> <p>There are two key techniques: the use of geometric means, and the factorisation into subestimates. We start with geometric means. Suppose a random variable $F$ is a product of many independent random variables,</p> $F = X_1 X_2 \cdots X_N.$ <p>Then the logarithm of $F$ is a sum of many random variables $Y_i = \log X_i$:</p> $\log F = \log X_1 + \log X_2 + \cdots + \log X_N = \sum_{i=1}^N Y_i.$ <p>By the central limit theorem for unlike variables (see e.g. <a href="https://hapax.github.io/hacks/mathematics/statistics/clt/">this post</a>), for large $N$ this approaches a normal distribution:</p> $\log F \to \mathcal{N}(\mu, \sigma^2), \quad \mu := \sum_i \mu_i, \quad \sigma^2 = \sum_i \sigma_i^2,$ <p>where the $Y_i$ have mean $\mu_i$ and variance $\sigma_i^2$. We say that $F$ has a <em>log-normal</em> distribution, since its log is normal.</p> <!-- To get uniformity into the picture, we can zoom in on the region near $F = e^\mu$ where the probability density is approximately uniform. More carefully, the density is $$p(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/2\sigma^2}.$$ Taylor-expanding near $x = \mu$ gives $$p(x) = \frac{1}{\sigma\sqrt{2\pi}} \left[1 - \frac{(x-\mu)^2}{2\sigma^2} + O(x^4)\right].$$ This looks uniform provided $(x - \mu)^2 \ll \sigma^2$. For instance, at a third of a standard deviation, $x = \mu + \sigma/3$, we have $$1 - \frac{(x-\mu)^2}{2\sigma^2} = 1 - \frac{1}{18} \approx 0.94,$$ and $\text{erf}(1/\sqrt{18}) \approx 0.26$, about a quarter of the probability mass, lies underneath. This is what we mean when we say that $F$ is logarithmically uniform. --> <h4 id="geometric-means">Geometric means</h4> <p>In Fermi estimates, one of the basic techniques is to take geometric means of estimates, typically an overestimate and an underestimate. For instance, to Fermi estimate the population of Chile, I could consider a number like one million which seems much too low, and a number like one hundred million which seems much too high, and take their geometric mean:</p> $\sqrt{(1 \text{ million}) \times (100 \text{ million})} = 10 \text{ million}.$ <p>Since population is a product of many different factors, it is reasonable to expect it to approximate a log-normal distribution. Then, after logs, the geometric mean $\sqrt{ab}$ becomes the arithmetic mean of $\log a$ and $\log b$:</p> $\log \sqrt{ab} = \frac{1}{2}(\log a + \log b).$ <p>Taking the mean $\mu$ of the distribution as the true value, these geometric means provide an <a href="https://en.wikipedia.org/wiki/Bias_of_an_estimator">unbiased estimator</a> of the mean. Moreover, the error of the estimate will decrease as $1/k$ for $k$ samples (assuming human estimates sample from the distribution), so more is better. To see how much better I could do on the Chile population estimate, I solicited guesses from four friends, and obtained $20, 20, 30$ and $35$ million. Combining with my estimate, I get a geometric mean</p> $(10 \times 20 \times 20 \times 30 \times 35)^{1/5} \text{ million} \approx 21 \text{ million}.$ <p>The actual population is around $18$ million, so the estimate made from more guesses is indeed better! This is also better than the arithemetic average, $23$ million. Incidentally, this also illustrates the <a href="https://hapax.github.io/physics/mathematics/statistics/crowd/">wisdom of the crowd</a>, also called “diversity of prediction”. The individual errors from a broad spread of guesses tend to cancel each other out, leading to a better-behaved average, though in this case in logarithmic space.</p> <p>In general, Fermi estimates work best for numbers which are large random products (this is how we try to solve them!), so the problem domain tends to enforce the statistical properties we want. For many examples of log-normal distributions in the real world, see <a href="https://academic.oup.com/bioscience/article/51/5/341/243981">Limpert, Stahel and Abbt (2001)</a>. It’s worth noting that not everything we can Fermi estimate is log-normal, however. Many things in the real world obey power laws, for instance, and although you can exploit this to make better Fermi estimates (as lukeprog does in <a href="https://www.lesswrong.com/posts/PsEppdvgRisz5xAHG/fermi-estimates#Example_4__How_many_plays_of_My_Bloody_Valentine_s__Only_Shallow__have_been_reported_to_last_fm_">his tutorial</a>), we can happily Fermi estimate power-law distributed numbers without this advanced technology.</p> <p>Are Fermi estimates unreasonably effective in this context? Maybe. But the estimates work best in the high-density core where things look uniform, not out at the tails, and it’s not until we get to the tails that the difference between the log-normal and power law (or exponential, or Weibull, or your favourite skewed distribution) becomes pronounced. So the unreasonable effectiveness here can probably be explained by the resemblance to the log-normal, though this is something I’d like to check more carefully in future.</p> <!-- In general, we only expect Fermi estimates to work for numbers which are the product of many factors. But this is precisely the sorts of things we use Fermi estimates for! In a sense, the problem domain naturally leads to logarithmic uniformity. Incidentally, I've talked about "uniformity", but the geometric mean is still a measure of central tendency for any distribution, and is particularly nice for a lognormal one, which arise for products of random variables. The magic of geometric means manifests most strongly in the near-uniform blob at the centre. --> <!-- #### The Newcomb-Benford law Logarithmic uniformity also explains an odd pattern in the first digits of naturally occurring numbers like tax returns, stock market prices, populations, river lengths, physical constants, and even powers of $2$. The pattern, called the *Newcomb-Benford law* after [Simon Newcomb](https://en.wikipedia.org/wiki/Simon_Newcomb) and [Frank Benford](https://en.wikipedia.org/wiki/Frank_Benford), is as follows: for base $b$, the digit $d \in \\{1, 2, \ldots, b-1\\}$ occurs with relative frequency $$p_b(d) = \log_b \left(\frac{d+1}{d}\right) = \frac{1}{\log b}\log \left(\frac{d+1}{d}\right).$$ It initially seems bizarre that digits do not occur with equal frequency. But as neatly explained by [Pietronero et al. (1998)](https://arxiv.org/pdf/cond-mat/9808305.pdf), it follows immediately if the relevant numbers are logarithmically uniform. Let $X$ be our random number. Then the first digit is $d$ if $$db^k \leq X < (d+1)b^{k} \quad \Longrightarrow \quad \log_b d + k \leq \log_b X < \log_b(d+1) + k$$ for some integer $k$. If $X$ is logarithmically uniform, for instance sitting near the mean of a big random product, then $\log_b X$ is uniformly distributed, and lies in the interval $I_d := [\log_b d, \log_b (d+1)]$ with probability $$(\log_b (d+1) + k) - (\log_b d + k) = \frac{1}{\log b}\log \left(\frac{d + 1}{d}\right) = p_b(d).$$ This provides a simple way to check for fraud on tax returns, for instance. Just compute relative frequencies of first digits in different bases and check they obey Newcomb-Benford! You might wonder why something totally deterministic, like the first digit of a power of $2$, also obeys Benford's law. Here is a pie chart of initial decimal digits for the first $10,000$ binary powers, which follows the Newcomb-Benford law exactly: <figure> <div style="text-align:center"><img src ="/images/posts/benford1.png"/> </div> </figure> Here is the Python code to generate it. You can check it for other numbers besides $2$ as well by simply changing the power variable: python import matplotlib.pyplot as plt import math maxpower = 10000 # Number of powers to check power = 2 # Change to check other powers nums = '1', '2', '3', '4', '5', '6', '7', '8', '9', benford = [(math.log(10, d+1) - math.log(10, d+1)) for d in range(1, 10)] firstdig = [0 for i in range(9)] for i in range(maxpower): ind = int(str(power**i)) - 1 firstdig[ind] = firstdig[ind] + 1 fig, ax = plt.subplots() fig.set_facecolor('white') ax.pie(firstdig, labels=nums, autopct='%1.1f%%', startangle=90) # Change 'firstdig' to 'benford' for probabilities ax.axis('equal') plt.show()  The mechanism for logarithmic uniformity here is slightly different, and discussed in depth in Serge Tabachnikov's [book on geometric billiards](http://www.personal.psu.edu/sot2/books/billiardsgeometry.pdf). In this case, $X = 2^n$, so the first digit is $d$ just in case $$\log_{10}d + k \leq n\log_{10} 2 < \log_{10}(d + 1) + k.$$ Let $\text{frac}(x)$ denote the fractional part of $x$, and define $x_n := \text{frac}(n\log_{10} 2)$. Taking fractional parts gives $$\log_{10}d \leq x_n < \log_{10}(d + 1).$$ It turns out that, since $x_1 = \log_{10} 2$ is irrational, $x_n$ jumps randomly around the unit interval, and forms an "equidistribution" which spends equal times in equal areas. For a proof, see Tabachnikov's book. But although the fundamental cause is different, the outcome is still logarithmic uniformity, and the Newcomb-Benford law results. --> <h4 id="the-philosophy-of-subestimates">The philosophy of subestimates</h4> <p>Now we’ve dealt with geometric means and log-normality, we turn to the effectiveness of factorising a Fermi estimate. If we take logarithms, factors become summands, and we’ll reason about those since they are simpler. If $Z = X + Y$ is a sum of independent random variables, the variance is additive, so that</p> $\text{var}(Z) = \text{var}(X) + \text{var}(Y).$ <p>Thus, splitting a sum into estimates of the summands and adding them should not change the variance of the guess. Of course, there is a fallacy in this reasoning: humans are not sampling from the underlying distribution! When we guess, we introduce our own random errors. For instance, my estimate for $Z$ will have some human noise $\varepsilon_Z$:</p> $\hat{Z} = Z + \varepsilon_Z.$ <p>Similarly, my guesses for $X$ and $Y$ have some random errors $\varepsilon_X$ and $\varepsilon_Y$. There is no reason for the variances of $\varepsilon_X$ and $\varepsilon_Y$ to add up to the variance of $\varepsilon_Z$. The sum could be bigger, or it could be smaller. But a good decomposition should reduce the combined variance:</p> $\text{var}(\varepsilon_X) + \text{var}(\varepsilon_Y) &lt; \text{var}(\varepsilon_Z).$ <p>If log-normality is the science of Fermi estimates, picking variance-reducing subestimates is the art. <!-- But there is a connection to our earlier discussion. I think the human error $\varepsilon_X$ will roughly mimic the empirical distribution of $Z$ we have seen in the world. If it is biased, so is $\varepsilon_X$; it we have only seen a few examples, the variance of $\varepsilon_X$ will probably be large, and decrease roughly as $1/k$ with $k$ examples. So the general strategy for variance reduction is to factorise into things we have seen before. We can even use these data points to generate subestimates by geometric averaging.--> But I suspect that $\hat{Z}$ roughly speaking behaves like a <em>test statistic</em> for $Z$, with the number of samples corresponding to how many data points for $Z$ we have encountered. So we expect that $\text{var}(\varepsilon_Z)$ will vanish roughly as $1/k$ with $k$ samples. If we have more exposure to the distributions for $X$ and $Y$, the combined error will probably be smaller. This is why we carve into subfactors we understand!</p> <h4 id="variance-reduction-in-practice">Variance reduction in practice</h4> <p>I’ll end with a speculative rule of thumb for when to factor: try generating over- and underestimates for the factors and the product, which in additive notation give</p> $(\Delta X)^2 + (\Delta Y)^2, \quad (\Delta Z)^2$ <p>where $\Delta$ refers to the difference of the (logarithm of the) over- and underestimate. Factorise if the first estimated error is smaller than the second. Let’s illustrate by returning to the population of Chile. I can try factoring it into a number of regions multiplied by the average number of people per region. Taking logs (in base $10$) of the over- and underestimate of Chile’s population I gave above, I get</p> $(\Delta Z)^2 = (\log_{10} 10^8 - \log_{10} 10^6)^2 = 4.$ <p>On the other hand, for regions I would make a lower guess of $5$ and an upper guess of $30$, with a difference in logs of $(\Delta X)^2 = 0.6$. For regional population, I would make a lower guess of $5\times 10^5$ and an upper guess of $5\times 10^6$, with $(\Delta Y)^2 = 1$. Thus,</p> $(\Delta X)^2 + (\Delta Y)^2 = 1.6 &lt; 4 = (\Delta Z)^2.$ <p>The guess from the factorisation (taking geometric means) is</p> $\sqrt{5 \times 30 \times (5\times 10^5) \times (5\times 10^6)} \approx 19 \text{ million}.$ <p>This is even better than the crowdsourced estimate! For reference, the number of regions is $16$, while our estimated mean is around $12$, and the average population per region is a bit over a million, which we’ve mildly overestimated at $1.6$ million. The two balance out and give a better overall estimate. <!-- This suggests a diversity of prediction mechanism is at play with --> <!-- subestimates, but I haven't worked out the details. --></p> <h4 id="conclusion">Conclusion</h4> <p>From a statistical perspective, Fermi estimates are based on two techniques: geometric means and splitting into subfactors. We usually estimate things which can be expressed as a product of many factors. These will tend towards a log-normal distribution by the (log of the) central limit theorem, so that geometric means provide a good estimator, exactly like the usual mean for normally distributed variables. Subestimates, on the other hand, carve guesses into factors we understand, i.e. have more data points for, so that (assuming they behave like test statistics) variance is reduced. The effectiveness of Fermi estimates is quite reasonable after all! <!-- They're not so unreasonable after all! --></p> <!-- There is an art to making over- and underestimates that accurately reflect the variance of our error random variables, which are involved both in taking geometric means for single quantities, and reducing variance through subestimates. Still, it's cool that there is a statistical basis for the different aspects of the effectiveness of Fermi estimates. It's not so unreasonable after all! --> <!-- For instance, if $e^Z$ is the population of Chile, I can factor it into number of provinces $e^X$ multiplied by the average number of people per province $e^Y$. But this is likely to *increase* the error, since I know less about provinces of Chile than I do about Chile compared to other countries. I suspect that there is a nice quantitative connection to be made between the variance of $\varepsilon_X$ and the prior data I have on it. --> <!-- The Lyapunov condition holds for a sum of independent random variables. By taking an exponential, we can turn it into a result for a *product* of independent variables. Let $X_i, \mu_i, \sigma_i^2$ be as above, and $X_i = \log Y_i$. Then $$\exp\left[\sum_{i=1}^N X_i\right] = \prod_{i = 1}^N Y_i \to \log \mathcal{N}(\mu, \sigma^2).$$ The distribution on the right is not a normal, but a *log-normal*. It is simply what the normal distribution looks like when viewed in terms of a variable $y > 0$ defined by $x = \log y$. In order to plot the density, we use the fact that $dx = dy/y$, and hence $$p(x)\, dx = \frac{dx}{\sqrt{2\pi}\sigma} e^{-\frac{(x-\mu)^2}{2\sigma^2}} = \frac{dy}{\sqrt{2\pi}\sigma y} e^{-\frac{(\log y-\mu)^2}{2\sigma^2}}.$$ So, this is distribution that a product of many independent factors converges to. --> <!-- https://arxiv.org/pdf/cond-mat/9808305.pdf -->David A WakehamFebruary 12, 2021. Why are Fermi approximations so effective? One important factor is log normality, which occurs for large random products. Another element is variance-reduction through judicious subestimates. I discuss both and give a simple heuristic for the latter.Reductionism, order and patterns2021-02-08T00:00:00+00:002021-02-08T00:00:00+00:00http://hapax.github.io/mathematics/physics/philosophy/form<p><strong>February 8, 2021.</strong> <em>Some philosophical reflections on the nature of scientific explanation, structure, emergence, and the unreasonable effectiveness of mathematics.</em></p> <h4 id="introduction">Introduction</h4> <p><span style="padding-left: 20px; display:block"> Explanations must come to an end somewhere. </span></p> <div style="text-align: right"><i>Ludwig Wittgenstein</i> </div> <p>Reductionism is the idea that you explain stuff with smaller stuff, and keep going until you stop. In many ways, this describes the explanatory program of 20th century physics, which, starting from the 19th century puzzles of statistical mechanics, conjured up atoms, subatomic particles, the zoo of the Standard Model, and even tinier hypothetical entities like strings and spin foams. Most physicists spend their time in a lab, on a computer, or in front of a blackboard, trying to reduce complex things to simple things they understand. So like Platonism in mathematics, reductionism in physics simply makes a philosophy out of everyday practice. We break stuff down, so things reduce; we play abstractly with mathematical objects, so they exist abstractly.</p> <p>But also like Platonism, reductionism is a convenient fiction, or rather, a caricature in which some things are emphasised at the cost of others. And given the reverence which which philosophers hold the considered ontological verdicts of science, it’s worth asking: what does science really tell us about the universe? What sorts of objects are necessary for explanation? Does explanation go only upwards, or can it go downwards or sideways? Should we eliminate the things we explained? And what has explanation to do with existence anyway? This post is an attempt to unconfuse myself about some of these questions. <!-- adds a dash of novelty and modern physics to old (and in some cases hopelessly outdated) debates. --></p> <h4 id="the-existence-of-shoes">The existence of shoes</h4> <p><span style="padding-left: 20px; display:block"> … our common sense conception of psychological phenomena constitutes a radically false theory, a theory so fundamentally defective that both the principles and ontology of that theory will eventually be displaced, rather than smoothly reduced, by completed neuroscience. </span></p> <div style="text-align: right"><i>Paul Churchland</i> </div> <p>Physical objects can be described at different levels. A shoe is constructed from flat sheets of material, curved, cut, marked, and stuck together in clever ways; materials curve and stick by virtue of their constituent chemicals, usually long, jointed molecular chains called polymers; polymers, in turn, are built like lego from a smorgasboard of elements; and each elemental atom is a dense nuclear core, surrounded by electrons whirring around in elaborate orbitals.</p> <p>From the properties of the neutrons, protons and electrons, it seems we can work our way upwards, and infer everything else. The laws of quantum mechanics and electromagnetism determine the orbital structure of the atom. The valence shell of the atom determines how it can combine with other atoms to form chemicals. Finally, the structural motifs and functional groups of the polymers gives it the properties the industrial chemist, the designer, and the cobbler exploit to make a shoe. Thus, some philosophers conclude, only electrons, protons, and neutrons exist. The rest can be eliminated as unnecessary ontological baggage. This view is called <em>eliminative reductionism</em>. It is a hardcore philosophy which does not believe in shoes [<sup><a id="fnr.1" name="fnr.1" class="footref" href="#fn.1">1</a></sup>].</p> <p>There is a gentler, less silly form of reductionism which grants the existence of shoes, but insists that they are (in the phrase of Jack Smart) nothing “over and above” the constituent subatomic particles. The shoe “just is” electrons and protons and neutrons, in some order; this is what we mean by a shoe. There are others way to characterise the reduction, <!-- for instance, that the properties of the shoe "follow" from, or are "completely explained by", those of the subatomic particles. In fact, there is--> and a whole literature devoted to the attendant subtleties, but most fall under the heading of analytic micro-quibbles. <!-- , and won't concern us here.--> Instead, we will make a much simpler observation: order matters.</p> <p>Clearly, if we took those subatomic particles, and arranged them in a different way, we would get different elements, different chemicals, and a duck or a planetesimal instead of a shoe. Arrangement is important. It is patently absurd to try and explain the bulk properties of the shoe—the fact that it fits around a human foot, for instance—without appeal to arrangement, since a different order yields objects which do not fit around a foot. <!-- If one objects that "fitting around a foot" is some sort of anthropocentric folly due for elimination, replace it with, Philip Anderson was perhaps the first physicist to make this argument, in his famous article ["More is Different"](https://cse-robotics.engr.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf). --> Since order has <em>explanatory</em> significance, it should presumably be tarred with the same ontic brush we apply to things like electrons.</p> <p>Of course, one may object that explanation does not equal existence. I can handily account for the continual disappearance of my socks by the hypothesis of sock imps. But this is a bad explanation! It’s not consistent with other reliably known facts about the world. Sock imps don’t make the ontic cut, not because there is no link between explanation and what we deem to exist, but because that link should only be made for <em>robust</em> explanations, and the poor little sock imps collapse at the first empirical hurdle. That different arrangements of things have different properties is robust, almost to the point of truism, and there seems to be no principled reason to ban order <!-- , or *structure* as we will call it,--> from our ontology.</p> <h4 id="emergence-vs-structure">Emergence vs structure</h4> <p><span style="padding-left: 20px; display:block"> More is different. </span></p> <div style="text-align: right"><i>Philip W. Anderson</i> </div> <p>It’s worth noting the parallel to <em>emergence</em>. In his famous article <a href="https://cse-robotics.engr.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf">“More is Different”</a>, Philip W. Anderson argued for the idea of domain-specific laws and dynamical principles which did not follow the strict, one-way explanatory hierarchy of reduction, particularly in his field of condensed matter physics. And indeed, condensed matter makes a science of order itself, studying how properties of macroscopic wholes (such as phases of matter) “emerge” from the arrangement of microscopic parts. Anderson thought of emergence as patterns that appear when you “zoom out” from the constituents, but which are still made from the constituents; we are just describing those constituents at a different level. <!-- the microscopic perspective as the wrong "level" of description, like being too zoomed in on a microscope, but I think that it is simply different information. --></p> <p>But this seems to suffer from the same problem as a reductionist account of shoes. The “emergent properties” are not properties of the constituents at all! The symmetries, order parameters, <!-- which measure their brokenness, and collective excitations which emerge as long-range messengers of disorder, a are not simply the microscopics "zoomed out".--> and collective excitations studied by condensed matter physicists belong only to the arrangements. In fact, systems made from totally different materials can exhibit the same emergent behaviour [<sup><a id="fnr.2" name="fnr.2" class="footref" href="#fn.2">2</a></sup>]! They are something new, something “over and above” the spins of the lattice, or the carbon atoms of a hexagonal monolayer, since different arrangements of those same parts would have different properties. We can turn Anderson’s snappy slogan around: <em>different is more</em>. If arranging things differently gives them new and different properties, it is a sign of structure, and structure is something over and above the component parts themselves. <!-- often characterising phases of matter in terms of what are called *order parameters*, numbers which characterise the brokenness of a symmetry. --></p> <h4 id="what-is-a-particle">What is a particle?</h4> <p><span style="padding-left: 20px; display:block"> It is raining instructions out there; it’s raining programs; it’s raining tree-growing, fluff-spreading, algorithms. That is not a metaphor, it is the plain truth. It couldn’t be any plainer if it were raining floppy discs. </span></p> <div style="text-align: right"><i>Richard Dawkins</i> </div> <p>We don’t need emergence to argue for structure; we can use the elementary components themselves. When philosophers talk about reductionism, they tend to imagine subatomic particles as small, indivisible blobs, without internal organisation or further ontological bells and whistles. An electron might have properties like mass or charge, and obey the curious dictates of quantum mechanics, but all this is packaged irreducibly and not worth further discussion. But if we try and unpack all these “simple” properties, we will find that, like the magic bag of Mary Poppins, a particle is much deeper than it first appears! The Large Hadron Collider does not produce evidence for tiny, structureless blobs. Rather, it confirms at a rate of petabytes per second that the universe is made of mathematics.</p> <p>The state-of-the-art definition of a particle is <!-- (as [this Quanta article](quantamagazine.org/what-is-a-particle-20201112) humorously explores) --> a bit of a mouthful: an <em>irreducible representation of the Lorentz group</em>. In plain English, being a <em>representation</em> means that particles are objects which have or “transform with” symmetries, in the same way a circle looks the same however you rotate it. That it is <em>irreducible</em> means that it cannot be split into smaller parts which have the same symmetry, which is the mathematical avatar of being “indivisible”. Finally, the symmetry itself, the <em>Lorentz group</em>, is the same group describing the shape of empty space according to special relativity. So, in summary, a particle transforms with the symmetries of empty space, and cannot be split into parts with this symmetry. <!-- [<sup><a id="fnr.3" name="fnr.3" class="footref" href="#fn.3">3</a></sup>].--> Lurking implicitly in the background is the whole framework of quantum mechanics, and in particular, that particles are <em>states in a Hilbert space</em>. In plain English, we can add and subtract states of a particle, and compare them to each other.</p> <p>Thus, every particle is like a mathematical diamond: indivisible, multifacted, and structured up to the hilt. When philosophers of science eagerly assent to believe whatever the particle physicists tell them, <!-- particularly when it can be tested with unparalleled precision at the LHC, --> they may not realise what they signed up for! Spacetime, quantum mechanics, and symmetries, the Lorentz group and Hilbert spaces; these are all welded indissolubly to form the most robust and fundamental objects in the universe. Even with something as “simple” as an electron, order is inescapable.</p> <h4 id="unreasonable-effectiveness-and-natural-patterns">Unreasonable effectiveness and natural patterns</h4> <p><span style="padding-left: 20px; display:block"> It is difficult to avoid the impression that a miracle confronts us here, quite comparable… to the two miracles of the existence of laws of nature and of the human mind’s capacity to divine them. </span></p> <div style="text-align: right"><i>Eugene Wigner</i> </div> <p>It may feel like we have jumped from physical to mathematical objects in one fell, tendentious swoop. Do we need Hilbert space, or might another mathematical concept suffice? And does Hilbert space really exist, or is it merely a useful human invention? If the latter, why so useful? This is intentionally designed to rhyme with our earlier statement that order is a robustly explanatory feature of the world, and distinct from the things that are ordered. Mathematics really just is the study of order, or <em>patterns</em>, according to their own peculiar and abstract logic. Physics (and to a lesser extent the other sciences) study <em>natural patterns</em>, the way these structures or forms of order are realised in the natural world. That applies not just to emergent behaviour like phases of matter, but even the crystalline makeup of an elementary particle.</p> <p>I have tried to motivate this perspective from the nature of physical explanation, but perhaps it can teach us about mathematical explanation and its relation to the physical world. A common criticism of Platonism is that, if mathematical objects exist in some non-physical realm, the ability to do mathematics must involve extrasensory perception. Clearly, since we are physical beings, this ability is grounded in physical experience, and now we have a simple explanation: patterns are naturally realised everywhere, from cardinal numbers in counting cows to topology in tying a knot to representation theory in colliding protons. We don’t need magical access to the World of Forms to see these things; they are all around us.</p> <p>Similarly, the <a href="https://www.dartmouth.edu/~matc/MathDrama/reading/Wigner.html">unreasonable effectiveness of mathematics</a> for describing the world, first noted by Eugene Wigner, seems no more miraculous that the utility of integers for counting loaves of bread rather than proving results about number theory. We get the patterns from the world, clean them up, rebrand a little, and start connecting them together. The meta-patterns that emerge are remarkable, but the appearance of “unreasonable effectiveness” is the result of a largely successful PR campaign to divorce mathematical structures from their physical origins. As Einstein quipped, “Since the mathematicians have invaded the theory of relativity, I do not understand it myself anymore.” The abstraction of pseudo-Riemannian geometry follows from the more concrete act of bouncing light off mirrors.</p> <p>More and more, we are seeing this converse of unreasonable effectiveness, where deep mathematical ideas are inspired by physics. The living embodiment of this trend is Ed Witten, a string theorist whose contributions to mathematics have been so profound and wide-ranging that he earned a Fields Medal (the Nobel prize in mathematics), the only physicist to have ever done so! <!-- for his contributions to low-dimensional topology.--> Once again, there is no mystery here; it is just the usual state of affairs, but without the Platonist guff to distract us. The patterns are out there and always have been.</p> <h4 id="what-is-a-pattern">What is a pattern?</h4> <p><span style="padding-left: 20px; display:block"> Everything comes to be from both subject and form. </span></p> <div style="text-align: right"><i>Aristotle</i> </div> <p>All this raises the question: what is a pattern? <!-- And how is it conjoined with stuff?--> The first and most famous philosophical treatment of these issues is the <a href="https://plato.stanford.edu/entries/form-matter/">hylomorphism of Aristotle</a>, who argued that objects are a compound of both form (the structure, order, or patterns I have discussed here) and matter (energy or “raw potentia”). I won’t discuss Aristotle’s ideas in greater detail. Suffice to say they have deeply informed this post, and the interested reader should check out James Franklin’s <a href="https://link.springer.com/book/10.1057/9781137400734">modern take</a>. <!-- for a modern take on Aristotelian structuralism applied.--> Instead, I will approach the question by picking on two smaller problems, taking Newton’s laws as a concrete example.</p> <p>Newton formulated his laws of motion (such as $F = ma$) in terms of forces and acceleration. Does the empirical robustness of these laws mean that this is the only way to formulate them? Not at all! There are two other distinct but equivalent versions of classical mechanics: Lagrangian and Hamiltonian. They explain the same things, make the same predictions, and thus seem to describe the same natural patterns. This suggests to me that although patterns are discovered, formalisms are invented. A pattern is the equivalence class of descriptions.</p> <p>Students of physics will be aware that, although Hamiltonian and Lagrangian mechanics are equivalent to Newton’s laws in the mechanical context, they have taken on a life of their own. The Lagrangian approach involves the mathematics of optimising functions, while the Hamiltonian approach in its most abstract form becomes the mathematical field of symplectic geometry. Both Lagrangian and Hamiltonian mechanics can be upgraded (with some inspired retrospective guesswork) to frameworks for quantum mechanics, which Newton’s laws simpliciter cannot. There is much more going on than a simple isomorphism of description! A more nuanced view is that humans invent formalisms which can agree on a domain of interest, a restricted equivalence class of explanation if you will. But the formalisms will tend to grow beyond the selvage lines of the original use case. Formalisms are only <em>perspectives</em> on patterns. <!-- capture different patterns, or suggest different extensions, in ways that can depend sensitively on the formalism and the domain of application. --></p> <p>This hints at certain structural “metalaws”. Patterns are big and rhizomatic; human-invented mathematical frameworks are a single mathematical glance, if you like, and can only take in part of the pattern. Even if formalisms agree on some domain, they will suggest different corridors of growth. A rectangle may be both an equiangular quadrilateral, or a parallelogram with diagonals of equal length, but the notions involved and corresponding generalisations are distinct. <!-- in the two characterisations., and connect along different lines of development to broader ideas. --> This also helps explain the phenomenon of deep connections between apparently unrelated mathematical objects, sometimes only revealed by a clever change of perspective. It could be that there is a <em>paucity of structure</em>, so that by dumb luck (and the <a href="https://en.wikipedia.org/wiki/Pigeonhole_principle">pigeonhole principle</a>), we often unknowingly describe the same thing in a different guise. But to my mind, it is more likely that patterns tend to sprawl and overlap in complex ways. <!-- , which also explains how different angles on the same structure can look unrelated! --> They are less like a few items of furniture in a crumbling garret—paucity of structure—and more like the interwined flora of a tropical jungle. <!-- And human mathematics typically cannot see the forest for the trees. There are ways to talk about quantum mechanics without Hilbert spaces, and particles without representation theory. That does not mean that the corresponding patterns do not exist, but rather, they can be described in other ways. --></p> <p>The second issue is how accurate our descriptions must be. We know that Newton’s laws are not exactly correct, and break down in regimes far-removed from those of everyday experience, such as the very small (where quantum mechanics applies) or the very fast (where special relativity applies). Does this mean we should stop believing in forces, or Lagrangians, or Hamiltonians? This is like the old Platonist quibble that there is no such thing as a perfect circle in the real world, so we must be reasoning about circles in some other realm. In both cases, the pattern is only <em>approximately</em> realised in nature, with bumps and fuzzy edges. But approximation is itself subject to structural laws, exhibiting patterns treated by mathematics (in, e.g., topology) and physics (effective field theory). Perhaps an even better example is statistics, which is literally all about extracting structure from noisy realisations. So structural approximations are clearly robust, lawlike and explanatory, even if they are subtle. Incidentally, this suggests another metalaw: patterns can stand in patterned relations to other patterns. <!-- This is also what emergence is all about! --></p> <p>This ties back to our original question about the nature of physical explanation. Reductionism instructs us to boil things down to their smallest elements. The Aristotelian view is that, really, we should be searching for form and structure at whatever level they happen to occur. This is not only the nature of emergence, but physics more broadly. How else can we connect the study of the large-scale structure of spacetime, quarks, bowling balls, planetesimals, or storm clouds? Physicists almost never boil things down to their smallest elements! Rather, it seems much more accurate to say that they look for patterns “in the wild”. (In contrast, mathematicians study patterns “in captivity”, which gives them that air of artifice and pedigree.)</p> <p>One upshot is that, for better or worse, physicists often wade into other disciplines armed with the lassoo of an Emergent Pattern to corral the apparent complexity. See for instance <a href="https://www.penguinrandomhouse.com/books/314049/scale-by-geoffrey-west/">scaling laws</a>, <a href="https://en.wikipedia.org/wiki/Self-organized_criticality">self-organised criticality</a>, <a href="https://en.wikipedia.org/wiki/Small-world_network">small-world networks</a>, and <a href="https://www.englandlab.com/">thermodynamic explanations for life itself</a>. They’re not always right (and they’re not always respectful), but they are just doing their thang.</p> <h4 id="conclusion">Conclusion</h4> <p>I’ve argued that the nature of physical explanation is richer and less boringly hierarchical than the reductionist would have us believe. In order to explain the properties of shoes or particles, it seems not only parsimonious but necessary to commit to the existence of patterns in addition to the things which make those patterns up. This not only jives with (and ontologically grounds) the notion of emergence, but also provides a handle on the metaphysics and epistemology of mathematical explanation. <!-- and its relation to the physical world. --> Put simply, mathematicians study patterns; physicists study natural patterns. <!-- It tells us where math comes from, why it is unreasonably effective, and to what extent it might be invented or non-unique. Finally, I argued that none of this is spoiled by approximation, since this is just another pattern. --></p> <p>Clearly, I’ve left many questions unanswered. Must patterns be instantiated in the physical world, and if not, where do such patterns live? What is the “mereology” that allows them to combine, or to recursively describe their relationships? And finally, what grounds the truth about patterns, in physics, mathematics, or elsewhere? Most of these I defer to Aristotle, though I hope to write more in future. <!-- I leave the systematic exploration of these questions to the future,--> In the mean time, discussion and debate are welcome!</p> <h4 id="acknowledgments-and-references">Acknowledgments and references</h4> <p>I’d like to thank Leon Di Stefano for introducing me to Aristotelian structuralism and many enriching conversations over the years. His ideas <!-- (as articulated in [this 2017 debate with James Fodor](https://www.youtube.com/watch?v=W0j25NteoXc))--> inspired and informed this post. I’ve also been heavily influenced by James Franklin’s book, <a href="https://link.springer.com/book/10.1057/9781137400734"><em>An Aristotelian realist view of mathematics</em></a>. Aristotle himself writes with characteristic brevity on form and matter in <a href="http://classics.mit.edu/Aristotle/physics.1.i.html"><em>Physics (i)</em></a>. Finally, I fitfully consulted the SEP entries on <a href="https://plato.stanford.edu/entries/scientific-reduction/">reductionism</a> and <a href="https://plato.stanford.edu/entries/structuralism-mathematics/">mathematical structuralism</a>.</p> <hr /> <!-- quantamagazine.org/what-is-a-particle-20201112 --> <!-- https://plato.stanford.edu/entries/scientific-reduction/--> <!-- https://plato.stanford.edu/entries/structuralism-mathematics/ --> <div class="footdef"><sup><a id="fn.1" name="fn.1" class="footnum" href="#fnr.1">Footnote 1</a></sup> <p class="footpara"> To be fair, as the quote suggests, the original eliminativists like Paul and Patricia Churchland were much more interested in abolishing psychology than shoes. </p></div> <div class="footdef"><sup><a id="fn.2" name="fn.2" class="footnum" href="#fnr.2">Footnote 2</a></sup> <p class="footpara"> This is called <i>universality</i>, and can be explained using renormalisation, the technical avatar of "zooming out". </p></div> <!--<div class="footdef"><sup><a id="fn.3" name="fn.3" class="footnum" href="#fnr.3">Footnote 3</a></sup> <p class="footpara"> Particles can have other symmetries as well. An important class is gauge symmetry, consisting of internal degrees of freedom. , like a dial on a gauge. These gauge symmetries are crucial to formulating the whole Standard Model, and explain, for instance, why an electron has --> <!--charge. </p></div>-->David A WakehamFebruary 8, 2021. Some philosophical reflections on the nature of scientific explanation, structure, emergence, and the unreasonable effectiveness of mathematics.Binomial party tricks2021-02-06T00:00:00+00:002021-02-06T00:00:00+00:00http://hapax.github.io/mathematics/physics/hacker/binomial<p><strong>February 6, 2021.</strong> <em>Sketchy hacker notes on the binomial approximation. The flashy payoff: party trick arithmetic for estimating roots in your head.</em></p> <h4 id="introduction">Introduction</h4> <p>The binomial approximation is the result that, for any real $\alpha$, and $|x| \ll 1$,</p> $(1 + x)^\alpha \approx 1 + \alpha x.$ <p>The usual proof involves calculus. Here, we present a sketchy shortcut and an elementary longcut, neither of which involves calculus, strictly speaking. We also derive the quadratic term, and end with a fun party trick for finding roots.</p> <h4 id="sketchy-shortcut">Sketchy shortcut</h4> <p>We begin with the shortcut. In an <a href="https://hapax.github.io/maths/physics/hacks/exponential/">earlier post</a>, I derived the following result for the exponential, and $|x| \ll 1$:</p> $e^x \approx 1 + x.$ <p>Rather than go off and read the post, we can do even better and simply <em>define</em> the exponential by this property. If it’s true, then for any $r$, we can set $x = r/n$ for very large $n$ to get</p> $e^r = (e^{r/n})^n \approx \left(1 + \frac{r}{n}\right)^n.$ <p>In the limit of infinite $n$, the expression should be exact. And indeed, this is the standard definition of $e^r$:</p> $e^r = \lim_{n\to\infty} \left(1 + \frac{r}{n}\right)^n.$ <p>Let’s proceed with a proof of the binomial approximation. The natural logarithm is the inverse function, so that</p> $x = \log e^x \approx \log(1 + x).$ <p>Recall that</p> $x^n = (e^{\log x})^n = e^{n\log x} \quad \Longrightarrow \quad \log x^n = n \log x.$ <p>Thus, taking the logarithm $(1 + x)^\alpha$, we have</p> $\log [(1+x)^\alpha] = \alpha \log (1+ x) \approx \alpha x,$ <p>and hence</p> $(1+x)^\alpha \approx e^{\alpha x} \approx 1 + \alpha x.$ <p>This works since all the corrections are at higher order in $x$.</p> <h4 id="elementary-longcut">Elementary longcut</h4> <p>This is a bit high brow, and we can get to the same conclusion using simple algebra. First note that, from the binomial theorem,</p> $(1 + x)^n = 1 + \binom{n}{1}x + \binom{n}{2}x^2 + \cdots x^n \approx 1 + nx$ <p>for $|x| \ll 1$, neglecting higher order terms which are much smaller. So the binomial approximation is true for whole numbers $n$. If we consider a fraction $q = m/n$, then $(1 + x)^q$ raised to the power $n$ should equal</p> $(1 + x)^{qn} = (1 + x)^{m} \approx 1 + mx \tag{1}\label{m}$ <p>by the binomial theorem. Let’s assume</p> $(1 + x)^{q} \approx 1 + \beta x,$ <p>with some higher order terms we can ignore. Raising to the power $n$, we can use the binomial approximation for $n$ to get</p> $(1 + x)^{qn} \approx (1 + \beta x)^n \approx 1 + \beta n x.$ <p>Comparing to (\ref{m}), we find that $\beta = m/n$, and hence the binomial approximation is true for positive rationals. We can add negative powers using the geometric series:</p> $\frac{1}{1 - x} = 1 + x + x^2 + \cdots \approx 1 + x,$ <p>and hence for a negative rational $q = -m/n$,</p> $(1 + x)^q \approx (1 - x)^{m/n} \approx 1 - \frac{m}{n}x = 1 + qx,$ <p>as required. Finally, there is arbitrary real $\alpha$. This is actually trivial, in some sense. Unlike whole numbers (repeated multiplication), fractions (roots), or negative numbers (reciprocals), an irrational power has no obvious interpretation. The most reasonable thing to do is define it as a <em>limit</em> of rational powers that approximate it:</p> $(1 + x)^r = \lim_{n \to \infty} (1 + x)^{q_n},$ <p>where $q_n$ is a sequence of rational numbers (e.g. the decimal expansion) approximating $r$. In this case, the binomial approximation gives</p> $(1 + x)^r = \lim_{n \to \infty} (1 + x)^{q_n} \approx 1 + x \lim_{n \to \infty} q_n = 1 + rx,$ <p>and so the result holds for all real numbers.</p> <h4 id="higher-terms">Higher terms</h4> <p>It’s possible, if messy, to extend these methods to determine the next term in the approximation. We’ll do the longcut, and use big-O notation, with $O(x^3)$ in this context meaning “terms with powers of $x^3$ or higher”. The binomial theorem gives</p> $(1 + x)^n = 1 + nx + \frac{n(n-1)}{2} x^2 + O(x^3), \tag{2} \label{second}$ <p>since the coefficient of the $x^2$ term is the number of ways of choosing $2$ items (the $x$ terms) from $n$ items (the factors in the power). For a rational $q = m/n$, we have</p> $(1 + x)^{qn} = (1 + x)^m = 1 + mx + \frac{m(m-1)}{2} x^2 + O(x^3),$ <p>and if we assume</p> $(1 + x)^{q} = 1 + qx + \gamma x^2 + O(x^3),$ <p>then the binomial theorem again gives</p> $(1 + x)^{qn} = \left[1 + qx + \gamma x^2 + O(x^3)\right]^n = 1 + nqx + \left[n\gamma + \frac{n(n-1)}{2}q^2 \right]x^2 + O(x^3).$ <p>The coefficient of the linear term $nq = m$ matches, but the quadratic term requires more work. Comparing to (\ref{second}) and rearranging for $\gamma$, we have</p> \begin{align*} \gamma &amp; = \frac{1}{n}\left[\frac{m(m-1)}{2}- \frac{n(n-1)}{2}q^2\right] =\frac{m(m-1)}{2n}- \frac{m^2(n-1)}{2n^2} =\frac{q(q - 1)}{2}. \end{align*} <p>Thus, we find that to second order,</p> $(1 + x)^q = 1 + qx + \frac{q(q-1)}{2} x^2 + O(x^3)$ <p>The extension to real and negative powers is easy. The extension to higher terms in $x$ is not. They obey something called the binomial series,</p> $(1 + x)^\alpha = \sum_{k = 0}^\infty \frac{\alpha(\alpha - 1)\cdots (\alpha-k +1)}{k!} x^k,$ <p>and I have no idea how to get this without calculus. (One can use “analytic continuation” but this feels too much like cheating to me, partly because it’s not clear why this continuation is unique.) Any tips appreciated!</p> <h4 id="rooting-out-the-answer">Rooting out the answer</h4> <p>The applications are many and various, but the simplest thing we can try is quickly calculating powers $y^\alpha$. The general trick is to find a power near $y$ that is simpler to evaluate, factor out the simple answer, then use the binomial approximation. I think there are actually better ways to estimate positive powers, but the binomial approximation really shines in the estimation of roots. It can even be a good party trick, depending on the kind of parties you go to!</p> <p>Suppose someone asks you to find the square root of $8$. You look for a nearby perfect square, in this case $9$, then factor eight into $9$ times one minus something small:</p> $\sqrt{8} = \sqrt{9\left(1 - \frac{1}{9}\right)} = 3 \left(1 - \frac{1}{9}\right)^{1/2}.$ <p>We can take $\alpha = 1/2$ and $x = -1/9$ in the binomial approximation, and see how we go, noting that</p> $\sqrt{1 - x} = 1 - \frac{1}{2}x - \frac{1}{8}x^2 + O(x^3).$ <p>To first order, we get</p> $3 \left(1 - \frac{1}{9}\right)^{1/2} \approx 3\left[1 - \frac{1}{2} \cdot \frac{1}{9}\right] = \frac{17}{6} \approx 2.83.$ <p>To second order,</p> $3 \left(1 - \frac{1}{9}\right)^{1/2} \approx 3\left[1 - \frac{1}{2} \cdot \frac{1}{9} - \frac{1}{8} \cdot \frac{1}{9^2}\right] = \frac{611}{216} \approx 2.829.$ <p>The actual answer is $\sqrt{8} = 2.828$, so even the first term in the binomial approximation is very good! We’ll finish with a somewhat more involved example. Let’s approximate the fifth root of six, $6^{1/5}$. I only know one fifth power of the top of my head, $2^5 = 32$, and this happens to be near $6^2 = 36$. We can chain these observations together as follows:</p> \begin{align*} 6^{1/5} = 36^{1/10} = 32^{1/10}\left(1 + \frac{1}{9}\right)^{1/10} &amp; =\sqrt{2}\left(1 + \frac{1}{9}\right)^{1/10} \approx \sqrt{2} \cdot \left(1 + \frac{1}{10\cdot 9}\right). \end{align*} <p>At this point, we could separately approximate $\sqrt{2}$, but I happen to know it’s about $1.414$, so I can divide by $90$ (or even just $100$ for a quick mental estimate), and add them together to get</p> $\sqrt{6} \approx 1.414 + \frac{1.414}{90} \approx 1.43.$ <p>Consulting a calculator, this is correct to two decimal places! With the power of the binomial approximation, you can do it in your head.</p>David A WakehamFebruary 6, 2021. Sketchy hacker notes on the binomial approximation. The flashy payoff: party trick arithmetic for estimating roots in your head.A simplicial generalisation of the Bloch ball2021-02-05T00:00:00+00:002021-02-05T00:00:00+00:00http://hapax.github.io/maths/physics/qc/unitary-orbits<p><strong>February 5, 2021.</strong> <em>I explore unitary orbits of density matrices for finite-dimensional quantum systems. The upshot is a neat scheme for representing orbits using simplices.</em></p> <h4 id="introduction">Introduction</h4> <p>The <a href="https://en.wikipedia.org/wiki/Bloch_sphere">Bloch sphere</a> represents the space of pure states on a single qubit (see also <a href="https://hapax.github.io/physics/mathematics/bloch/">this</a> recent post). The “Bloch ball” is the space of all <em>density matrices</em> on the qubit. It fills in the Bloch sphere with concentric spheres of increasing mixedness, and at the centre is the maximally mixed state $I_2/2$, where $I_d$ will denote the $d \times d$ identity matrix.</p> <figure> <div style="text-align:center"><img src="/images/posts/unitary1.png" /> </div> </figure> <p>Spheres arise naturally. They carry the structure of the unitary group $\mathrm{U}(2)$ acting on qubits, once we have modded out by the phase ambiguity:</p> $\frac{\mathrm{U}(2)}{\mathrm{U}(1)} = \mathrm{SU}(2).$ <p>This is a double cover of the rotation group $\mathrm{SO}(3)$, which acts transitively on the sphere. (The “double cover” part gives us spinors.) Thus, spheres occur naturally as unitary orbits, and indeed, each concentric sphere in the Bloch ball is such an orbit. The question is whether this generalises nicely to higher dimensions.</p> <h4 id="the-bloch-ball">The Bloch ball</h4> <p>Let’s think about the Bloch ball in a little more detail. Each density matrix $\rho$ is a $2\times 2$ matrix acting on the space of qubits, which is positive and has unit trace. Positivity just means that, for every state $|\psi\rangle$,</p> $\langle \psi | (\rho | \psi \rangle) \geq 0.$ <p>Hence, $\rho$ is Hermitian, since the reality of this inner product implies</p> $\langle \psi | (\rho | \psi \rangle) = (\langle \psi | \rho^\dagger) |\psi \rangle \quad \Longrightarrow \quad \rho = \rho^\dagger.$ <p>In turn, this means that $\rho$ is unitarily diagonalisable, i.e. $U^\dagger \rho U = \Lambda$ for some diagonal matrix $\Lambda$ and unitary matrix $U^\dagger U = UU^\dagger = I$. It’s also clear these eigenvalues must be positive. In fact, since the permutation matrices are unitary, we can arrange the eigenvalues in decreasing size, so that every $2 \times 2$ density matrix is unitarily equivalent to some matrix</p> $\Lambda(p) = \begin{bmatrix} p &amp; \\ &amp; 1-p \end{bmatrix}$ <p>for $p \in [1/2, 1]$. The maximally mixed density $I_2/2$ has a trivial orbit, since it always gets mapped to itself:</p> $U^\dagger I_2 U = U^\dagger U = I_2.$ <p>We can measure the distance from this matrix to $\Lambda(p)$ using the Frobenius norm, aka Hilbert-Schmidt norm. This is just the usual vector norm where we treat a matrix $A = [a_{ij}]$ as a big vector:</p> $||A||^2 = \sum_{ij} |a_{ij}|^2 = \mbox{Tr}[A^\dagger A].$ <p>Hence,</p> \begin{align*} ||\Lambda(p) - \tfrac{1}{2}I_2||^2 &amp; = \left|\left| \begin{bmatrix} p - 1/2 &amp; \\ &amp; 1/2-p \end{bmatrix} \right|\right|^2 \end{align*} = 2\left(p - \tfrac{1}{2}\right)^2. <p>It’s easy to see that any density matrix in the unitary orbit of $\Lambda(p)$ has the same distance, since we can use $I_2 = U^\dagger I_2 U$, i.e. it is a class function:</p> \begin{align*} ||U^\dagger \Lambda U - \tfrac{1}{2}I_2||^2 &amp; = \mbox{Tr}\left[(U^\dagger \Lambda U - \tfrac{1}{2}I_2)^\dagger (U^\dagger \Lambda U - \tfrac{1}{2}I_2)\right]\\ &amp; = \mbox{Tr}\left[U^\dagger (\Lambda - \tfrac{1}{2}I_2)^\dagger UU^\dagger (\Lambda - \tfrac{1}{2}I_2) U\right]\\ &amp; = \mbox{Tr}\left[(\Lambda - \tfrac{1}{2}I_2)^\dagger (\Lambda - \tfrac{1}{2}I_2) \right] = ||\Lambda - \tfrac{1}{2}I_2||^2. \end{align*} <p>We can define distance between densities as the Hilbert-Schmidt norm times a positive constant $C$. We choose $C = \sqrt{2}$ so that for pure states with $p = 1$, the associated distance is $r = 2(p - 1/2) = 1$. In general, since each such $r$ is associated with a unique $\Lambda(p)$, we conclude that the space of $2\times 2$ density matrices is a ball consisting of concentric, transitive orbits of the unitary group, with the pure states at $p = 1$, the maximally mixed state at $p = 0$, and radius $r = 2(p - 1/2)$ for the orbit of $\Lambda(p)$.</p> <h4 id="orbital-mechanics">Orbital mechanics</h4> <p>A similar story holds in higher dimensions. Density matrices are positive and unit trace, so each orbit in dimension $d$ has a canonical representative of the form</p> $\Lambda = \mathrm{diag}(p_1, p_2, \ldots, p_d),$ <p>where the positivity of $\rho$ and unit trace condition imply</p> $\sum_{i=1}^d p_i = 1, \quad p_i \geq 0,$ <p>and we can arrange eigenvalues in descending order:</p> $p_1 \geq p_2 \geq \cdots \geq p_d \geq 0.$ <p>The constraint that the eigenvalues sum to $1$ means that we only need $p_1, p_2, \ldots, p_{d-1}$ to uniquely specify a canonical representative $\Lambda(p_1, p_2, \ldots, p_{d-1})$. We can repeat the calculations from above to show that $I_d/d$ has a trivial orbit, and that any density matrix in the orbit of $\Lambda(p_1, \ldots, p_{d-1})$ has a fixed distance to the mixed state:</p> $r^2(p_1, \ldots, p_{d-1}) = C_d\sum_{i=1}^d \left(p_i - \frac{1}{d}\right)^2,$ <p>where we choose $C_d$ so that the pure states, with $p_1 = 1, p_2 = \cdots = p_d = 0$, have distance $r = 1$. For completeness, we note that</p> $C_d = \frac{d^2}{d^2 - 2d + 2}.$ <p>It’s a bit trickier to see what the orbits look like, but in the same way that $I_d$ is fixed by the group $\mathrm{U}(d)$, we can read off fixed subgroups from the eigenvalue decomposition. For instance, a pure state has</p> $p_1 = 1, \quad p_2 = \cdots = p_d = 0.$ <p>The first factor is fixed by $\mathrm{U}(1)$ (corresponding to global phase), while the last $d - 1$ factors are fixed by $\mathrm{U}(d-1)$. These act independently, so that the stabiliser of a pure state is $\mathrm{U}(1) \times \mathrm{U}(d-1)$. By the orbit-stabiliser theorem, the orbit of pure states has the (coset) structure</p> $\frac{\mathrm{U}(d)}{\mathrm{U}(1) \times \mathrm{U}(d - 1)}.$ <p>Since $\mathrm{U}(d)$ has dimension $d^2$, this pure space orbit has dimension</p> $d^2 - 1^2 - (d - 1)^2 = 2d - 2,$ <p>and lies on a unit sphere $\mathbb{S}^{2d-2}$ in our Hilbert-Schmidt metric. This agrees with the Bloch sphere for $d = 2$. This seems rather nice, but in general, the orbits will be horrible. First of all, spheres of radius $r &lt; 1$ around the mixed state will now be made up of uncountably many orbits, since there are uncountably many sets of $p_i$ which solve</p> $r^2 = C_d\sum_{i=1}^d \left(p_i -\frac{1}{d}\right)^2$ <p>for $r &lt; 1$. And orbits can be more elaborate for other eigenvalue structures. For instance, if we lump the $p_i$ into $k$ sets of <em>distinct</em> eigenvalues,</p> $P_1, P_2, \ldots, P_K,$ <p>with multiplicity $\mu_J$ associated to eigenvalue $P_J$, then the same argument as above shows that the coset structure is</p> $\frac{\mathrm{U}(d)}{\mathrm{U}(\mu_1) \times \cdots \times \mathrm{U}(\mu_K)},$ <p>known to mathematicians as a <a href="https://en.wikipedia.org/wiki/Generalized_flag_variety#Partial_flag_varieties">partial flag variety</a>. These orbits have dimension</p> $D = d^2 - \sum_{J=1}^K \mu_J^2,$ <p>and lie on a sphere of radius</p> $r^2 = C_d\sum_{J=1}^K \mu_J^2\left(P_J - \frac{1}{d}\right)^2.$ <p>Note that while mixed states are closer to the maximally mixed state, unlike the Bloch ball, they do not lie inside the orbit of pure states. Typically, they have more dimensions! For instance, a generic point with no symmetries (distinct $p_i$), the cosets are of the form</p> $\frac{\mathrm{U}(d)}{(\mathrm{U}(1))^d}$ <p>with dimension $d^2 - d$, so for $d &gt; 2$, these are always bigger than the pure state orbits. It’s certainly possible to say more about this, but who wants to. It’s a mess!</p> <h4 id="the-simplicial-wedge">The simplicial wedge</h4> <p>Our modest goal will be to tidy up some of the mess. The main observation is that the eigenvalues $p_i$ form a probability distribution over $d$ outcomes. If they had an arbitrary order, they would live on the standard $(d-1)$-simplex $\Delta_{d-1}$, but because they are arranged in decreasing order, they live on the simplicial “wedge”:</p> $W_{d-1} = \left\{(p_1, \ldots, p_d) : \sum_{i=1}^d p_i = 1, p_1 \geq p_2 \geq \cdots \geq p_d \geq 0\right\}.$ <p>Note that the subscript denotes the number of independent parameters. We can illustrate these ideas for $d = 2$:</p> <figure> <div style="text-align:center"><img src="/images/posts/unitary2.png" /> </div> </figure> <p>We start with the $1$-simplex $\Delta_1$, and divide it two to get the wedge $W_1$. The black dot at the top is the orbit of pure states, and the white dot the maximally mixed state. In general, the wedge $W_{d-1}$ is almost a quotient of $\Delta_{d-1}$ by its symmetry group, the set of permutations $S_d$. But the wedge has literal “edge cases”, stabilised by subgroups of $S_d$ in a way that mirrors the corresponding unitary orbits. More precisely, if a point in $W_{d-1}$ is stabilised by $S_{\mu_1} \times \cdots \times S_{\mu_K}$, then the corresponding coset structure for the orbit is the partial flag variety</p> $\frac{\mathrm{U}(d)}{\mathrm{U}(\mu_1) \times \cdots \times \mathrm{U}(\mu_K)}.$ <p>For instance, pure states have canonical representative</p> $(1, 0, 0, \ldots, 0) \in W_{d-1},$ <p>which is stabilised by the subgroup $S_1 \times S_{d-1}$. This correctly gives the coset orbit</p> $\frac{\mathrm{U}(d)}{\mathrm{U}(1) \times \mathrm{U}(d - 1)}.$ <p>The maximally mixed state, and centroid of the whole simplex, has coordinates</p> $\frac{1}{d}(1, 1, \ldots, 1),$ <p>and is stabilised by the full group $S_d$. As we expect, the orbit is trivial. We can see how this works for a qutrit below. We start with the $2$-simplex $\Delta_2$, an equilateral triangle, and cut out the wedge $W_2$:</p> <figure> <div style="text-align:center"><img src="/images/posts/unitary3.png" /> </div> </figure> <p>At the top we have the pure states as usual, and the mixed state at the white centroid. The grey dot represents the fully mixed state on two basis elements. Note that, along the red edges, two coordinates agree, and in fact, each represents a copy of $W_1$, coinciding at the centroid. In general, orbit degeneracies occur precisely at sub-wedges $W_K$ with interiors parameterised by the coordinates $P_1, \ldots, P_K$ introduced above. But when distinct sub-wedge coincides, we get even more degeneracy. So, the apparent randomness of orbits is somewhat tamed by geometric hierarchy.</p> <p>Finally, to relate this back to spheres, the nice thing about using the Frobenius norm is that the distance between a density matrix and the maximally mixed matrix is just proportional to the Euclidean distance on the wedge. So we can literally draw concentric spheres emanating from the centroid! Our scheme does not do away with all the messiness of the orbits. But it does provide a simple way to organise and read off some of their basic properties, and generalises in a reasonably natural way the concentric spheres of the Bloch ball.</p> <!-- https://en.wikipedia.org/wiki/Bloch_sphere -->David A WakehamFebruary 5, 2021. I explore unitary orbits of density matrices for finite-dimensional quantum systems. The upshot is a neat scheme for representing orbits using simplices.Turning a thermometer into a sundial2021-01-28T00:00:00+00:002021-01-28T00:00:00+00:00http://hapax.github.io/mathematics/physics/everyday/diurnal<p><strong>January 28, 2021.</strong> <em>I attempt to turn a thermometer (or more specifically, data about the maximum daily temperature) into a sundial. Though it fails on earth, it works on Mercury!</em></p> <h4 id="introduction">Introduction</h4> <p>The sun heats the earth up, and the earth radiates that heat back into space. As the sun sets, less heat is delivered, and the maximum temperature occurs when the two rates—heat delivered and heat radiated—balance. In this post, we’ll work out how this simple requirement relates maximum temperature to the latitude, time of year, and time of day the maximum occurs, meaning that a thermometer can in principle be used as a sort of sundial. In practice, this is only the first step towards a realistic model, but for the purpose of building narrative tension, I will let the shortcomings of my approach unfold naturally.</p> <h4 id="energy-balance">Energy balance</h4> <p>Consider a small patch of the earth’s surface of unit area, at the point it attains its maximum temperature $T_\text{max}$ in Kelvin. According to the <a href="https://en.wikipedia.org/wiki/Stefan%E2%80%93Boltzmann_law">Stefan-Boltzmann law</a>, it radiates energy away with intensity</p> $I_\text{out} = \sigma T_\text{max}^4, \quad \sigma = 5.67 \times 10^{-8} \frac{\text{W}}{\text{m}^2 \text{ K}}.$ <p>Since this is the maximum attained, it must equal the intensity of incoming solar radiation $I_\text{in}$. To a good approximation, this is the radiant intensity of sunlight striking the earth’s surface head on, the so-called insolation constant $I_0$, multiplied by a geometric term $\cos^2\vartheta$ (where $\vartheta$ is the angle the sunlight makes with the vertical to the ground), and an albedo term $(1-a)$ to account for sunlight reflected back:</p> $I_\text{in} = I_0 (1- a )\cos^2\vartheta.$ <p>The insolation constant is $I_0 = 1367 \text{ W/m}^2$ [<sup><a id="fnr.1" name="fnr.1" class="footref" href="#fn.1">1</a></sup>]. The albedo of the earth is around $a = 0.3$, i.e. $30\%$ reflected back into space on average, though this depends on cloud cover, snow, and so on. We will talk about $\vartheta$ more in a moment. Setting $I_\text{in} = I_\text{out}$ when the maximum is obtained, we find</p> $I_0 (1- a )\cos^2\vartheta = \sigma T_\text{max}^4. \label{balance} \tag{1}$ <p>Thus, the maximum temperature is directly related to the length of shadow!</p> <h4 id="geometry-and-heliometry">Geometry and heliometry</h4> <p>Even more interesting is how $\vartheta$ is related to the earth-sun geometry, and the parameters of latitude, time of year, and time of day. The point directly below the sun, called the <em>subsolar point</em>, rotates at some line of latitude around the earth, with azimuthal angle $\theta_\text{sub}$, depending on the time of year. Here is a basic picture of the setup:</p> <figure> <div style="text-align:center"><img src="/images/posts/diurnal1.png" /> </div> </figure> <p>At either equinox, it coincides with the equator (red line). At the (northern hemisphere’s) summer solstice, it runs along the Tropic of Cancer, about $23.5^\circ$ north of the equator. At the winter solstice, it lies $23.5^\circ$ south of the equator, on the Tropic of Capricorn. If we draw the orbit of the earth as a circle around the sun, with $\varphi = 0$ at the winter solstice and increasing with time, then the subsolar latitude, measured in radians from the north pole, roughly obeys</p> $\theta_\text{sub} = \frac{\pi}{2} + \left(\frac{2\pi}{360}\right) 23.5 \cos(\varphi). \label{year} \tag{2}$ <p>To calculate the angle $\vartheta$, we need two additional data points: the latitude of the observation point (measured from north pole) and the polar angle $\phi$ between the observation point and the current subsolar point. This simply measures time from solar noon. To determine $\vartheta$, first note that if we draw the subsolar and observation point on the same great circle of the earth, $\vartheta$ is clearly the angle between the black lines, drawn from each point to the centre of the earth [<sup><a id="fnr.2" name="fnr.2" class="footref" href="#fn.2">2</a></sup>]:</p> <figure> <div style="text-align:center"><img src="/images/posts/diurnal2.png" /> </div> </figure> <p>This means we can easily determine $\cos\vartheta$ using vectors, simply by taking the dot product. To begin with, we write in spherical coordinates $(\theta,\phi)$, then convert to Cartesian coordinates $(x, y, z)$:</p> \begin{align*} \mathbf{x}_\text{sub} (\theta_{\text{sub}}, 0) &amp; = (\sin \theta_\text{sub}, 0, \cos\theta_\text{sub}) \\ \mathbf{x}_\text{obs} (\theta_{\text{lat}}, \phi) &amp; = (\sin \theta_\text{lat}\cos\phi, \sin \theta_\text{lat}\sin\phi, \cos \theta_\text{lat}). \end{align*} <p>We can immediately determine the dot product:</p> $\cos\vartheta = \mathbf{x}_\text{sub} \cdot \mathbf{x}_\text{obs} = \cos\theta_\text{sub}\cos\theta_\text{lat} + \sin \theta_\text{sub}\sin \theta_\text{lat}\cos \phi. \label{geohelio} \tag{3}$ <p>Plugging this back into (\ref{balance}), we find a relationship between maximum temperature $T_\text{max}$, time of year via $\theta_\text{sub}$, latitude $\theta_\text{lat}$, and time of day, or rather, time past solar noon $\phi$.</p> <h4 id="real-data">Real data</h4> <p>The question is: how does this stack up against real data? I’ll take some local weather data. In Vancouver, the latitude is $49.3^\circ$ north of the equator, with azimuthal coordinate</p> $\theta_{\text{lat}} = \left(\frac{2\pi}{360}\right)(90 - 49.3) \approx 0.71.$ <p>It’s $36$ days or about tenth of a year since the winter solstice, so from (\ref{year}), the subsolar latitude is</p> $\theta_\text{sub} = \frac{\pi}{2} + \left(\frac{2\pi }{360}\right) 23.5 \cos(0.1 \cdot 2\pi) \approx 1.9.$ <p>This agrees with <a href="https://rl.se/sub-solar-point">real-time data</a> on the subsolar point. Finally, the <a href="https://www.timeanddate.com/weather/canada/vancouver/historic?month=1&amp;year=2021">maximum temperature yesterday</a> was $7^\circ \text{ C} = 280 \text{ K}$, and cloud cover makes $a \approx 0.35$. Thus, rearranging (\ref{geohelio}) and (\ref{balance}), we expect the maximum to occur at a “time of day angle” $\phi$ given by</p> \begin{align*} \cos \phi &amp; = \frac{\sqrt{\frac{\sigma T_\text{max}^4}{I_0(1-a)}} - \cos\theta_\text{sub}\cos\theta_\text{lat}}{\sin \theta_\text{sub}\sin \theta_\text{lat}} \\ &amp; = \frac{\sqrt{\frac{(5.67 \times 10^{-8}) 280^4}{1367(1-0.3)}} - \cos 1.9\cos 0.71}{\sin 1.9\sin 0.71} \\ &amp; \approx 1.37. \end{align*} <p>Hopefully the problem is clear. The last term is bigger than one, and cannot possibly be equal the first term! If we plug in the time it peaked, a few hours after solar noon, we can rearrange and solve to find a predicted maximum temperature of $-80^\circ \text{ C}$! So something is very wrong.</p> <h4 id="conclusion">Conclusion</h4> <p>We’ve neglected an important factor: the atmosphere. This is the very same thing needed to explain why the temperature of the earth is higher than expected from a simple energy balance argument. Basically, the atmosphere acts as a heat bath in contact with the earth, allowing for greater maximal temperatures. It may be possible to turn a thermometer into an accurate sundial using a <a href="https://en.wikipedia.org/wiki/Idealized_greenhouse_model">simple greenhouse model</a>. However, with parameters appropriately modified, our naive approach should work on a planet without substantial atmosphere like Mercury.</p> <h4 id="acknowledgements">Acknowledgements</h4> <p>Thanks to A.B. for asking when daily temperatures peak, and suggesting this might depend on latitude.</p> <hr /> <div class="footdef"><sup><a id="fn.1" name="fn.1" class="footnum" href="#fnr.1">Footnote 1</a></sup> <p class="footpara"> This comes once more from the Stefan-Boltzmann law (for the surface temperature of the sun $T_\odot = 5800 \text{ K}$), and an inverse square drop-off: $$I_0 = \sigma T_\odot^4 \left(\frac{R_\odot}{d}\right)^2 = 5.67 \times 10^{-8} \cdot 5800^4 \left(\frac{7 \times 10^5}{1.5\times 10^8}\right)^2\, \frac{\text{W}}{\text{m}^2}\approx 1400 \, \frac{\text{W}}{\text{m}^2},$$ where $R_\odot = 7 \times 10^5 \text{ km}$ is the solar radius and $d = 1.5 \times 10^8 \text{ km}$ the earth-sun distance. </p></div> <div class="footdef"><sup><a id="fn.2" name="fn.2" class="footnum" href="#fnr.2">Footnote 2</a></sup> <p class="footpara"> We are making the usual assumption that the sun is far enough away to treat incoming rays as parallel. For the same reason, we ignore the way radiant intensity changes (due to the inverse square law) with $\vartheta$. </p></div> <!-- http://www.bom.gov.au/products/IDV60901/IDV60901.95936.shtml ((60*12 )/(2*pi))*arccos((sqrt((5.6*10^(-8)*(273+7)^4)/(1367(0.65))) + cos(1.9)cos(2*pi*(40.7/360)))/(sin(1.9)sin(2*pi*(40.7/360)))) 2*pi(90 - 23.6*sin(pi/2 + pi/6))/360 https://www.timeanddate.com/weather/canada/vancouver/historic?month=1&year=2021 https://www.sjsu.edu/faculty/watkins/diurnaltemp.htm (1367(1-0.3)(\cos 1.9\cos 0.71 + \sin 1.9\sin 0.71 * cos(pi/6))^2/(5.67 \times 10^{-8}))^(1/4) --> <!-- Let's test this out on some real data. Today, in a certain large city, the temperature peaked at $25.0^\circ \text{ C}$ around $2.5$ hours after solar noon. We will guess the city! First, we note that it's around $36$ days or a tenth of a year since the winter solstice, so from (\ref{year}), the subsolar latitude is $$\theta_\text{sub} = \frac{\pi}{2} + \left(\frac{2\pi }{360}\right) 23.5 \cos(0.1 \cdot 2\pi) \approx 1.9.$$ Two and a half hours after solar noon translates to $2.5/24$ times a full rotation, so $\phi \approx \pi/5$. Putting these numbers into (\ref{geohelio}) and rearranging using trigonometric identities, we get $$\cos\vartheta \approx 0.57 \sin (\theta_\text{lat} - 0.60).$$ Inserting into (\ref{balance}) and rearranging yields $$\theta_\text{lat} = 0.60 + \sin^{-1}\left[\frac{1}{0.57}\sqrt{\frac{5.67 \times 10^{-8} (273+25)^4}{1367 (1- 0.3)}}\right] = 1.77,$$ or in -->David A WakehamJanuary 28, 2021. I attempt to turn a thermometer (or more specifically, data about the maximum daily temperature) into a sundial. Though it fails on earth, it works on Mercury!Cashing a blank check2021-01-26T00:00:00+00:002021-01-26T00:00:00+00:00http://hapax.github.io/mathematics/statistics/everyday/check<p><strong>January 26, 2021.</strong> <em>Suppose you find a blank check on the ground, and unscrupulously decide to cash it in. If overdrawing gets you nothing, how much should you cash it in for? Assuming wealth follows the 80-20 rule, the answer is: almost nothing!</em></p> <h4 id="introduction">Introduction</h4> <p>In the film “Blank Check” (1994), 11-year old Preston Waters is handed a blank check, and cashes it in for a million dollars. Luckily, this is precisely the amount of money that the check’s signer, a convict attempting to launder his ill-gotten gains, has left with the bank’s president. But what if Preston overdrew, asking for, say, $10$ billion? This would probably have raised the suspicions of the complicit bank president and the check would have bounced altogether. When I was a kid, I thought it was incredibly lucky for Preston to find the check in the first place. I now think drawing the precise amount of money held in trust is infinitely luckier. But this raises the question: if you find a blank check, and you don’t want it to bounce, how much should cash it in for?</p> <h4 id="expected-return">Expected return</h4> <p>I’ll assume we know nothing about the identity of the signee, and that if they have a balance of $b$, and we make out the value of the check to be $v$, then the check will bounce if $v &gt; b$. Our strategy will be to calculate the expected return for $v$ and then maximise it. If $f(b)$ is the probability distribution for bank balances, then the expected return for $v$ is simply $v$ multiplied by the probability $b&gt; v$:</p> $E(v) = v \int_v^\infty f(b) \, db = v[1 - F(v)] = v \bar{F}(v),$ <p>where $F$ is the cumulative distribution function, and the $\bar{F} = 1 -F$ the tail. To maximise this, we assume the curve is smooth, differentiate and set to $0$, using $\bar{F}’ = -f$:</p> $E'(v) = \bar{F} - vf(v) = 0 \quad \Longrightarrow \quad v = \frac{\bar{F}(v)}{f(v)}.$ <p>Any $v$ which satisfies this equation is an extremum.</p> <h4 id="long-and-short-tails">Long and short tails</h4> <p>Now the question is how to model the distribution of bank balances. This is the sort of thing expected to follow a power-law curve like the <a href="https://en.wikipedia.org/wiki/Pareto_distribution">Pareto distribution</a>, the proverbial “80-20” curve. This is simply defined by its power-law tails:</p> $\bar{F}(v) = \left(\frac{L}{v}\right)^\alpha,$ <p>where $L$ is the minimum amount to keep a bank balance open (say a monthly fee), and $\alpha &gt; 0$ is a shape parameter we will “leave blank” for the moment. This is well-defined since it heads to zero. The probability density for $v \geq L$ is</p> $f(v) = -\bar{F}'(v) = \frac{\alpha L^\alpha}{v^{\alpha + 1}}.$ <p>The optimal draw then obeys</p> $v = \frac{\bar{F}(v)}{f(v)} = \left(\frac{L}{v}\right)^\alpha \cdot \frac{v^{\alpha + 1}}{\alpha L} = \alpha v.$ <p>For $\alpha \neq 1$, the only solutions are $v = 0$ and $v = \infty$! For $\alpha &gt; 1$, we can plot the expected return $E(v)\propto v^{1-\alpha}$, and see that it monotonically decreases, with the maximum at $v = L$. Preston should only have asked for a few bucks! But perhaps this is an artefact of the infinite power-law tail. A more realistic choice is the <em>truncated</em> Pareto distribution, where the power law is confined to $L \leq v \leq H$ for an upper limit $H$, say the personal wealth of Jeff Bezos or Elon Musk. The density for the truncated Pareto distribution is simply a conditional probability, conditioned on being in the interval $[L, H]$:</p> $f(v) = \frac{\alpha L^{\alpha}v^{-(\alpha+1)}}{1 - (L/H)^\alpha},$ <p>and the tail is</p> $\bar{F}(v) = \int_v^H \frac{\alpha L^{\alpha}v^{-(\alpha+1)}}{1 - (L/H)^\alpha} dv = \frac{(L/v)^\alpha - (L/H)^\alpha}{1 - (L/H)^\alpha}.$ <p>Thus, we now have to solve</p> $v = \frac{\bar{F}(v)}{f(v)} = \frac{(L/v)^\alpha - (L/H)^\alpha}{\alpha L^{\alpha}v^{-(\alpha+1)}} \quad \Longrightarrow \quad v = (1-\alpha)^{1/\alpha} H.$ <!-- Once again, the answer is independent of the lower bound. , but proportional to the upper bound, which as we take $H \to \infty$, returns our original result. --> <p>If $\alpha &lt; 1$, then we do get a finite answer, proportional to the upper bound, so for instance if $\alpha = 0.5$, and we take the upper limit to be around 100 billion dollars, then Preston should ask for</p> $v \sim \sqrt{1-0.5} \times 10^{11} \approx 70 \text{ billion dollars},$ <p>or $0.7$ of some other reasonable guess for $H$. But if $\alpha \geq 1$, the prefactor is not real, and as for the full Pareto distribution, the maximum expected return occurs at $L$. And indeed, wealth typically does obey an approximate Pareto distribution with $\alpha &gt; 1$. For instance, the proverbial “80-20” rule corresponds to $\alpha \approx 1.16$, and <a href="https://www.sciencedirect.com/science/article/abs/pii/S0165176505002995">this analysis</a> of the Forbes 400 richest people in the world finds a shape parameter of $\alpha = 1.49$. So once again, a perfectly rational Preston Waters would ask only for the monthly fee! But this would make for a far less entertaining movie.</p>David A WakehamJanuary 26, 2021. Suppose you find a blank check on the ground, and unscrupulously decide to cash it in. If overdrawing gets you nothing, how much should you cash it in for? Assuming wealth follows the 80-20 rule, the answer is: almost nothing!A simple proof of the bus paradox2021-01-26T00:00:00+00:002021-01-26T00:00:00+00:00http://hapax.github.io/mathematics/statistics/everyday/paradox-bus<p><strong>January 26, 2021.</strong> <em>The bus paradox states that, if buses arrive randomly but on average every ten minutes, the expected waiting time is ten minutes rather than five. I give a simple proof involving no integrals or formal probability theory.</em></p> <h4 id="introduction">Introduction</h4> <p>The bus paradox (also called the waiting time or <a href="https://en.wikipedia.org/wiki/Renewal_theory#Inspection_paradox">inspection paradox</a>) is a counterintuitive result about waiting times between random events. Suppose buses arrive randomly, with an average period of $\lambda$ between arrivals. If you go to catch a bus, you might expect to wait a period $\lambda/2$, since if a bus arrives $\lambda/2$ after you arrive, and $\lambda/2$ before you arrive (by symmetry), then the gap between them is $\lambda$. This reasoning is wrong, and rather unexpectedly, the expected wait time is $\lambda$. The goal of this post is to give a proof which does not require any integrals or formal probability theory, and makes the role of assumptions manifest.</p> <h4 id="the-bus-loop">The bus loop</h4> <p>We start by considering a circle of total length $L$, on which we place $k$ points at random (white in the image below). This models a length of time, such as the day, and the random arrival of $k$ buses. The average distance between points (going clockwise, for instance) is clearly</p> $\lambda = \frac{L}{k}.$ <p>Let us place another point on the circle at random (black in the image below). This represents the commuter who wishes to catch a bus.</p> <figure> <div style="text-align:center"><img src="/images/posts/bus1.png" /> </div> </figure> <p>Since we now have $k + 1$ points placed at random, the same reasoning as above tells us that the average distance is</p> $\frac{L}{k +1} = \left(\frac{k}{k+1}\right)\lambda.$ <p>Translating into the language of bus schedules, this means that if buses have a fixed but random schedule over some length of time, with average interarrival time $\lambda$, the expected wait time is <em>not</em> $\lambda$, but rather, smaller than $\lambda$ by a factor of $k/(k+1)$, where $k$ is the total number of buses over the period.</p> <h4 id="the-bus-paradox">The bus paradox</h4> <p>The bus paradox applies to a schedule which does not repeat. Let us take $L, k \to \infty$ but leave $\lambda = L/k$ fixed. We represent this by an infinitely large circle, with a straight edge, in the image below. Then the expected waiting time is</p> $\left(\frac{k}{k+1}\right)\lambda \to \lambda.$ <p>Thus, the arrival of the commuter is equivalent to adding another random bus. The corresponding interarrival period is modified, but by a vanishingly small coefficient as $k \to \infty$. This completes our simple proof of the bus paradox.</p> <figure> <div style="text-align:center"><img src="/images/posts/bus2.png" /> </div> </figure> <p>It’s a little tricky, of course, to formulate what it means to place the buses “uniformly” on an infinite line, and this is exactly what the <a href="https://en.wikipedia.org/wiki/Poisson_point_process#Homogeneous_Poisson_point_process">Poisson process</a> (and more generally <a href="https://en.wikipedia.org/wiki/Renewal_theory#Inspection_paradox">renewal theory</a>) achieves. But rather than introduce all this formal baggage, we can simply consider the limit of the uniform process to arrive at the correct conclusion, and with greater clarity than when the answer is concealed in thickets of algebra.</p> <h4 id="conclusion">Conclusion</h4> <p>The reasoning outlined in the introduction is not completely off the mark. It applies when the buses arrive at fixed intervals $\lambda$, and the commuter randomly. The expected time to the previous bus $t_-$ and the expected time to the next bus $t_+$ must add to give the interval $\lambda$ between buses, and by time symmetry, they must be equal:</p> $t_+ + t_- = \lambda, \quad t_+ = t_- \quad \Longrightarrow t_+ = t_- = \frac{\lambda}{2}.$ <p>In this case, there is a clear distinction between the stochasticity of buses and commuters. But when everything arrives randomly, a commuter becomes like just another bus.</p> <!-- So waiting time equals interarrival time. --> <!-- When the buses are random, our argument explains why this argument breaks down: the commuter is like another bus! They are just another random point in the sequence, and must therefore have the --> <!-- There are a few other fun things we can do, however. If we add $n$ commuters, for $n = o(k)$, then when they sprinkled randomly among the buses, it is overwhelmingly likely that the next thing to come along will be a bus rather than a commuter (with probability $k/(k+n) \to 1$), and hence the expected wait time is $$\left(\frac{k}{k+n}\right)\lambda \to \lambda.$$ But for finite $n$, the time to -->David A WakehamJanuary 26, 2021. The bus paradox states that, if buses arrive randomly but on average every ten minutes, the expected waiting time is ten minutes rather than five. I give a simple proof involving no integrals or formal probability theory.Integrals from pyramids2021-01-22T00:00:00+00:002021-01-22T00:00:00+00:00http://hapax.github.io/mathematics/pyramid<p><strong>January 22, 2021.</strong> <em>I present an elementary, first-principles trick for integrating polynomials: splitting a hypercube into congruent pyramids.</em></p> <h4 id="introduction">Introduction</h4> <p>Derivatives compute slopes at a point. Integrals compute areas under curves. The first is a local operation, involving only information in a neighbourhood of a point, while the latter is <em>global</em>, involving the value of the function at different points. This makes integration a lot harder than differentiation!</p> <figure> <div style="text-align:center"><img src="/images/posts/pyramid1.png" /> </div> </figure> <p>However, sometimes we have a shortcut for integrating: identifying an integral with the volume of a solid. A simple example is a linear function, $f(x) = mx$. When we integrate from $x = 0$ to $x = b$, the area under the curve is just a triangle, obeying $A = bh/2$ for height $h = mb$. We can represent this reasoning in a picture:</p> <figure> <div style="text-align:center"><img src="/images/posts/pyramid2.png" /> </div> </figure> <p>But what happens if we want to integrate $x^2$? There doesn’t seem to be any analogous geometry, and we are forced to do something fancy (like use the <a href="https://en.wikipedia.org/wiki/Fundamental_theorem_of_calculus">fundamental theorem of calculus</a>) if we want to find the area under the curve.</p> <h4 id="a-triangular-warm-up">A triangular warm-up</h4> <p>But it turns out we haven’t tried hard enough! There is a simple geometric approach to integrating $x^2$ and all the higher monomials $x^n$. This lets us integrate any polynomial by simply adding monomial terms. To see how to do this, let’s first think of the integral of a linear function in a slightly different way. Rather than as half a square, let’s slide the “height” of the triangle down so it becomes isosceles. The area is unchanged since $b$ and $h$ have now swapped roles.</p> <figure> <div style="text-align:center"><img src="/images/posts/pyramid3.png" /> </div> </figure> <p>Now we double this triangle, and see it covers half of a square of area $2bh$. Since twice the area of the triangle equals half the area of this square,</p> $2A = \frac{1}{2} \cdot 2bh \quad \Longrightarrow \quad A = \frac{1}{2}bh.$ <p>This may seem like a convoluted reinterpretation, but it generalises in a lovely way to help us integrate polynomials.</p> <h4 id="pyramids-and-hypercubes">Pyramids and hypercubes</h4> <p>A hypercube or $n$-cube is a cube in $n$ dimensions. Formally, we can view it as all points</p> $I^n = \{(x_1, x_2, \ldots, x_n) : x_i \in [0, 1]\} = [0, 1]^n.$ <p>For instance, a $1$-cube is the unit interval $I = [0, 1]$, while a $2$-cube is the unit square $[0 ,1]^2$. The $3$-cube is what we usually mean by a “cube”. Now, the length of the unit interval is $1$, the area of the unit square is $1^1 = 1$, and volume of the unit cube is $1^3 = 1$. The pattern continues, with the volume simply given by the product of the length of each side of the hypercube, $1^n = 1$.</p> <figure> <div style="text-align:center"><img src="/images/posts/pyramid4.png" /> </div> </figure> <p>Let us now divide a hypercube in the following way: draw a point at the centre, and from that point, draw a line to each corner. These lines form the edges of a $(n-1)$-hypercube-based hyperpyramid, which sounds a bit crazy but is actually very simple. We illustrate for the simple cases below.</p> <figure> <div style="text-align:center"><img src="/images/posts/pyramid5.png" /> </div> </figure> <p>Each of these (hyper)pyramids is congruent, i.e. has the same shape, so to work out their volume, all we need to do is compute how many there are. Since each pyramid has a $(n-1)$-cube or <em>face</em> as a base, this is the same as counting faces. But this is easy: along any dimension there are two faces, corresponding to fixing $x_i = 0$ or $x_i = 1$ for some $i$. Thus, there are $2n$ faces. Just to check this makes sense, we have $2 \cdot 1 = 2$ “faces” or endpoints for a line, $2 \cdot 2 = 4$ sides to a square, and $2 \cdot 3 = 6$ faces for a cube. Thus, each pyramid has a volume</p> $V_n = \frac{1}{2n}.$ <p>To connect to our warm-up exercise, note that in two dimensions, the pyramid is a triangle with a side as its base.</p> <h4 id="slicing-pyramids">Slicing pyramids</h4> <p>Let’s now focus on a single pyramid. We can move along the line from the tip to the centre of the base, and graph the area of the cross-section of pyramid passing through that point, parallel to the base. Each slice will be a shrunken copy of the base itself. As examples, on the square the “pyramid” is just a quarter triangle. The cross-section is a line (a copy of the base, which is a side of the square), which is increasing linearly in length. Similarly, for a cube, the pyramid is a bonafide square-based pyramid, and each slice is a square as well. We draw some pictures below:</p> <figure> <div style="text-align:center"><img src="/images/posts/pyramid6.png" /> </div> </figure> <p>As we go along, the side length of the slice will change linearly. But the <em>area</em> will change in a way that depends on the dimension we are working in! It stays linear on the square, since it has $2 - 1 = 1$ dimension. For a cube with $n = 3$, the slice is a square whose area changes <em>quadratically</em>. The pattern continues, and in $n$ dimensions, slicing a pyramid results in a cross-section which grows as $x^{n-1}$ for a parameter $x$ going from $x = 0$ at the tip of the pyramid to $x = 1$ at the base.</p> <h4 id="integrating-monomials">Integrating monomials</h4> <p>We can add up the area of each cross-section precisely by integrating with respect to $x$. The answer is not quite the volume of the pyramid, however, since the distance from the tip of the pyramid to the centre of the base is actually $d = 1/2$. So $x$ is <em>twice</em> the actual distance. If we want to integrate to find the volume, the correct “infinitesimal width” of a cross-section is not $dx$, but $dx/2$. The corresponding integral should then give us the volume we calculated above:</p> $\int_0^{1} x^{n-1} \, \frac{dx}{2} = V_n = \frac{1}{2n} \quad \Longrightarrow \quad \int_0^{1} x^{n-1} \, dx = \frac{1}{n}.$ <p>If instead of a unit hypercube, we have a cube of side length $b$, then the volume of the whole hypercube is $b^n$, and hence the volume of a pyramid is $b^n/2n$. If we let our parameter $x$ go from $x = 0$ at the tip to $x = b$ at the base, then once again it is twice the distance, and the same reasoning shows that</p> $\int_0^{b} x^{n-1} \, dx = \frac{b^n}{n}.$ <p>Thus, we have geometrically integrated an arbitrary monomial!</p> <h4 id="acknowledgments">Acknowledgments</h4> <p>Thanks to J.A. for a stimulating discussion of integration from first principles.</p>David A WakehamJanuary 22, 2021. I present an elementary, first-principles trick for integrating polynomials: splitting a hypercube into congruent pyramids.