September 8, 2018.Getting a handle on higher-dimensional objects
is hard. In very high dimensions, however, we can view simple shapes
as random processes, and convert statistical properties into
geometric ones. I’ll explore this approach for hypercubes, loosely
speaking the geometry of coin flips.
Introduction
Prerequisites: undergraduate probability.
Visualising things clearly in three dimensions is hard enough, and
it’s what our brains were designed for.
Imagining higher spatial dimensions seems almost impossible!
For four spatial dimensions, there is a trick: pretend that the
additional dimension is time.
For instance, you could think of a 4-dimensional cube as a regular
cube which pops into existence, sits there for a second, then disappears again.
If somebody records a video of the cube with 3D goggles, we have a
“graph” of the cube in four dimensions.
In other words, we convert a process (the popping into and out of
existence) into geometry.
A flatlander struggles to visualise 3D.
Although it helps us visualise a 4-dimensional cube, the analogy does
not yield any real insights into the geometric properties.
In fact, it’s misleading!
Special relativity teaches us that time is just another dimension,
sure, but one which is fundamentally different from the spatial directions.
This is captured by the fact that you can’t travel faster than light.
However, it turns out that a version of this trick — geometrising a
process — works for dimensions, where is much larger than .
Although we can’t directly visualise a million-dimensional space this
way, the process viewpoint encodes some deep insights into
higher-dimensional shapes.
Roughly speaking, shapes are like probability distributions,
dimensions are samples from these distributions, and the behaviour of
many data points is much simpler than the behaviour of a few.
Hypercubes
We’re going to focus on the simplest family of higher-dimensional
shapes: the hypercube, a generalisation of the cube.
Examples are the unit interval , the
unit square , the cube , and the tesseract,
pictured below.
The first four hypercubes.
We build the square by taking two copies (red and blue) of the
interval , placing them a unit distance apart in a new orthogonal
direction, and joining them up (purple).
We build the cube by joining two copies of the square, and the
tesseract by joining two copies of the cube.
Hopefully you get the idea: to construct , put two copies of
a unit distance apart in a new dimension orthogonal to all the
others, and join them up.
Thus, we can explicitly write
This way of constructing a hypercube can be turned around and
used to deconstruct a hypercube.
The full hypercube is an -dimensional object, the unique -dimensional “face”.
On the other end of the spectrum are the corners, which are
-dimensional objects we can think as “-faces”, or copies of the
-cube , consisting of a single point.
There are corners, since at a vertex each coordinate must
be or ; since there are such choices to independently make,
the total number is
More generally, if we choose coordinates to be
“extreme”, i.e. at or , we are left with a copy of the -cube.
Thus, there are -faces, or relabelling , there are copies of each -face.
The -faces are labelled by binary strings of length , corresponding to the choice of extreme coordinates.
(Note that I’ve defined -faces with respect to a specific ordering
of dimensions, and they all end up being parallel. I’ll call these
ordered-faces.
More generally, a -face is obtained by choosing any subset of
dimensions to make extreme; these are general-faces.
These are a bit less nice, since they do not decompose the hypercube.)
A hypercube as a binary tree of -faces. On the
right, the decomposition of a square.
To summarise, our technique for iteratively constructing hypercubes
can be used to deconstruct into a binary tree of hypercubes of
lower dimension.
We won’t really exploit this fact below, but it gives us another way
to understand higher-dimensional structure.
Random corners
Let’s look at the corners in more detail.
These are labelled by -bit strings, i.e. any sequence of s and s of
length corresponding to the taking the usual -tuple notation
and throwing away commas and brackets, e.g. .
Binary strings appears all over the place, but in probability,
they usually represent a sequence of Bernoulli trials, i.e. the result
of flipping a coin.
This means I can pick a corner of the hypercube uniformly at random by
flipping a fair coin times, and statistical properties of this
random process will encode geometric properties of the corners.
Put a different way, the hypercube makes the space of coin flips geometric.
Given any pair of opposite corners ,
most corners lie a distance away, and make a
angle with the long diagonal from to .
The cleanest example is the average distance from the origin.
Recall that in , the distance of a point from the origin is
For the corners of a cube, things are even simpler: if a corner has
s and s in some order, then the distance is simply .
Imagine that we pick a corner of the hypercube by flipping a
fair coin times, assigning to heads and to tails. If
is very large, we should get roughly the same number of heads and
tails, so
For a high-dimensional hypercube, most corners are
away from the origin.
But there is nothing special about the origin.
Every other corner of the cube looks the same, so most corners are
from each other!
So, pick any corner labelled by a binary string , and consider the
opposite corner , obtained by flipping all the bits in .
Let denote the -th digit of the string , and define
the vector corresponding to a string by
The distance between from is then
since by definition of the
flipped string . Since the distance from (or ) to most
other corners is , the angle made between “long” diagonal
from to , and most other points, is
Thus, we get the cartoon geometry for the hypercube corners above.
Random interior points
If we wanted to, we could use similar techniques to study the geometry
of -faces.
I’ll leave this to you to think about.
Let’s go right to the top of the tree and study the distribution of interior points.
Average distance turns out to be messy, so instead, we’ll look at the
projection of a random interior point onto the long diagonal from
to .
Volume is concentrated around the hyperplane
bisecting two opposite corners.
It will be simplest to just set and , with a
diagonal
If we project a point onto , it
will be a distance from the origin
where is the average of the coordinates .
Once again, let’s select an interior point at random.
This means we select each coordinate uniformly from .
Thus, we can view a random point as drawing samples from the
uniform distribution on , and as gets large, the sample average of
these points should approach the average of the distribution, .
Thus, the projection onto the long diagonal satisfies
Since the diagonal has length , we learn that most points
are near the halfway mark, as we drew above.
Exercise 1. (a) Show that two randomly chosen, ordered -faces
are separated by an average distance
Similarly, if two general -faces are parallel in
dimensions, show that the average distance is
(b) Show that for two randomly chosen, general -faces, the
probability of being parallel is
Prove that the probability of being parallel in dimensions is
(c) Write an expression for the average distance between general
-faces.
Chernoff bound for corners
So far, we have been pretty sloppy about the spread of corners and
interior points around the mean values, or its inverse, the
concentration.
Let’s remedy this.
We start with a simple application of the
Chernoff bound,
which states that for any random variable , and any ,
Let , the sum of digits in a random binary string .
I will leave it as an exercise to figure out what the bound gives us.
Exercise 2. (a) Recall that the binomial distribution is
Show that the expectation satisfies
(b) Define , so that describes the proportion of
coordinates (or bits) equal to . Show that the right-hand side of the
Chernoff bound is optimised for
(c) Plug this optimal value into the Chernoff bound to find
(d) Using part (c), show that the probability a random corner has more
than of its coordinates equal to is bounded by
How many dimensions do we need to go to before this probability is
less than ?
Stragglers in high dimensions
The Chernoff bound gives us results for small , but it is fiddly
and underestimates the concentration.
Rather than repeat the analysis for the interior points, we will make
life simpler by going to very high dimensions.
As , there are two powerful results we can invoke: the
law of large numbers
and the
central limit theorem.
They are simple to explain, and not much harder to prove, though we
will not do so here.
The (strong) law of large numbers states that as , the
sample mean
for a set of independent, identically distributed random variables
converges to the mean of the distribution, :
This immediately tells us that as , the corners and
interior points are concentrated around the mean values we
calculated above.
Our earlier sloppiness has been morally vindicated!
At any finite there may be stragglers, but these are a
vanishing fraction of the whole.
If we want to know more about the stragglers at large but finite ,
the central limit theorem is the tool of choice.
This gives us the whole distribution of as gets large.
If we add independent copies of the same random
variable , the mean and variance of the sum are just multiplied
by (since both are additive for independent random variables):
Higher moments are suppressed by additional factors of .
As gets big, only the first and second moment survive, which is
the characteristic of a normal distribution, or bell curve.
We discover that
where denotes the normal distribution.
To apply to our case, we need to compute the variance.
For the corners, the are fair Bernoulli trials with variance
.
Hence, as gets large, corners are normally distributed
around the mean distance with variance .
For the interior points, the are uniformly distributed on
, which has variance
Thus, as gets large, interior points are concentrated in a
Gaussian envelope around the bisecting plane, with variance .
Hinton’s anchovies
We’ve learnt that for hypercubes in higher dimensions, the corners and volume tend to
concentrate around the mean values.
The route we used to arrive at this conclusion was statistical, but
turns out to be an instance of a much deeper phenomenon called
concentration of measure.
In some sense, concentration is generic.
The most famous higher-dimensional shape is not the hypercube, but the
-dimensional hypersphere, which generalises the sphere in
the same way the hypercube generalises the cube.
You can learn more about the details in the references below, but
qualitatively, we see a similar concentration phenomenon.
Selecting two antipodal points on the sphere (analogous to two
opposite corners), the boundary and volume will cluster normally
around the bisecting plane.
Concentration around the
bisecting plane.
Although concentration is the deep explanation for what’s going on
here, there is a nice intuition for why the hypercube and hypersphere look the way they do.
I like to think of this as the anchovy phenomenon after
a talk by Geoffrey Hinton.
He first considers a three-dimensional supermarket, where anchovies
are with the pizza toppings, rather than the tuna and canned sardines on the
other side of the supermarket where he first looks.
But you can have more neighbours in higher dimensions!
As Hinton puts it,
…if it had been a thirty-dimensional grocery store, they could have
been near the pizza and the sardines.
If we think of one corner as pizza toppings, and the opposite corner
as sardines, most points on the hypercube are like anchovies.