Fred Akalin

The Fundamental Theorem of Algebra via Connectedness

2021-01-03T00:00:00-08:00

It is intuitive that removing even a single point from a line disconnects it, but removing a finite set of points from a plane leaves it connected.

However, this basic fact leads to a non-trivial property of real and complex polynomials: not all non-constant real polynomials have real roots, but all non-constant complex polynomials have complex roots. The latter, is in fact the fundamental theorem of algebra:

(Fundamental theorem of algebra.) Every non-constant complex polynomial has a root.

We’ll prove this theorem using nothing stronger than the complex inverse function theorem. Here’s a synopsis:

Let p \colon \mathbb{C}→ \mathbb{C} be a non-constant complex polynomial, and V_{\text{regular}} its set of regular values. Let P_{\text{pure}}= p^{-1}(V_{\text{regular}}) be its set of pure regular points, so that p can be thought of a P_{\text{pure}}→ V_{\text{regular}} map.
Any complex polynomial, and p in particular, is a closed \mathbb{C}→ \mathbb{C} map, and thus also a closed P_{\text{pure}}→ V_{\text{regular}} map.
Furthermore, by the inverse function theorem, p is an open P_{\text{regular}}→ V_{\text{regular}} map, and thus also an open P_{\text{pure}}→ V_{\text{regular}} map.
p, being non-constant, has only finitely many critical points. (This is the step that fails for real polynomials.) Therefore, V_{\text{regular}} is the complex plane with a finite set of points removed, and thus is connected. Similarly, P_{\text{pure}} is also connected.
p, being a continuous, open, and closed P_{\text{pure}}→ V_{\text{regular}} map, must take connected components to connected components. Since P_{\text{pure}} and V_{\text{regular}} are both connected, that means that p maps P_{\text{pure}} onto V_{\text{regular}}.
p also maps P_{\text{critical}} onto V_{\text{critical}}, so p is surjective on \mathbb{C}, and thus must have a root.

This is a wonderfully succinct proof, but it’s full of subtleties and would benefit from elaboration (as well as some diagrams). We’ll do that in the rest of this article. First, we need some definitions.

Points and values

If a function f(x) maps from A to B, we’ll call elements of A points and elements of B values; in our case, A and B will both be subsets of either \mathbb{R} or \mathbb{C}, but it’s helpful to distinguish when we’re talking about a real or complex number as a domain element versus a codomain element.

If f(x) is differentiable, we’ll call x a critical point if f'(x) = 0 and a regular point otherwise. We’ll call y a critical value if y = f(x) for some critical point x and a regular value otherwise. In particular, if y is not in the image of f, then y is a regular value.

A regular point may map to a critical value. In that case, we call it an impure regular point and a pure regular point otherwise. (This is nonstandard terminology, but it helps with visualizing what’s going on.)

The strategy of the proof is to show that a non-constant complex polynomial f(x) is surjective. By construction, f(x) maps impure regular points and critical points onto the critical values. Then it suffices to show that f(x) maps the pure regular points P_{\text{pure}} onto the regular values V_{\text{regular}}. In doing so, we’ll show that there are only a finite number of critical points, critical values, and impure regular points; therefore, P_{\text{pure}} is the complex plane minus a finite number of points, and that is where connectedness comes into play.

Connected sets

A subset X of a topological space is disconnected if it is the union of two disjoint, non-empty open sets, and connected otherwise.

For example, the set X in the first figure is the real line \mathbb{R} with a single point a removed. Then X = (-∞, a) ∪ (a, ∞), so it is disconnected.

It is harder to show that a set is connected. However, we can use a stronger property that’s easier to show. A subset X of a topological space is path-connected if for every two points x and y in X, there exists a path from x to y—that is, a continuous function f \colon [0, 1] → X such that f(0) = x and f(1) = y. A path-connected set is automatically a connected set—being able to draw paths between any two points makes it impossible to split the set into two disjoint non-empty open subsets.

In particular, let X be the plane \mathbb{R}^2 or \mathbb{C} with a finite number of points p_i removed. Then we’ll show that X is path-connected. Let d be the minimum distance between any of the removed points, and let r = d/3. Then given x and y in X, let f be the straight-line path from x to y. For any p_i that is on f, replace the segment through p_i with a semi-circular arc of radius r around p_i. Since r < d/2, the arc will not have any other removed point on it, and no two arcs will overlap. Therefore, this modified path lies entirely in X. Since x and y were arbitrary, X is path-connected, and thus connected.

We’re most interested in connected sets that are maximal in the sense that they’re not contained in a larger connected set. These are called connected components, and any topological space can be decomposed into its connected components. For example, the set X in the first figure has two connected components (-∞, a), (a, ∞), and the plane with a finite number of points removed remains connected, and thus only has a single connected component. However, removing a line from a plane splits it into two connected components, one on each side of the line.

A continuous function preserves connectedness: it maps connected sets to connected sets. However, it may map a connected component to a connected set that’s not a connected component. We want to show that real and complex polynomials map connected components to connected components—this leads us to the concepts of open and closed maps.

Open and closed functions

If a function f(x) between topological spaces A and B sends open sets of A to open sets of B, we call it open. Similarly, if it sends closed sets of A to closed sets of B, we call it closed. Be careful! Like with sets, whether a function is open is unrelated to whether it is closed; a function may be neither open nor closed, just open, just closed, or both.

We’re more interested in sets and functions that are both open and closed, which we’ll call clopen. A topological space A always has two clopen subsets: \emptyset and itself. However, if its disconnected, it may have more: in general, a clopen subset X is a union of connected components of A. Conversely, if A has finitely many connected components, each connected component is clopen.

Then since a clopen function f(x) between A and B sends clopen sets of A to clopen sets of B, it then sends connected components of A to unions of connected components of B. If f(x) is also continuous, then it must send a connected component of A to another connected set, which then must be a connected component of B.

Therefore, since real and complex polynomials are continuous, in order to show that they map connected components to connected components, we need to show that they are also clopen.

Real and complex polynomials are closed

First, we want to show that a real polynomial p(x) \colon \mathbb{R}→ \mathbb{R} or a complex polynomial p(x) \colon \mathbb{C}→ \mathbb{C} is closed.

If p(x) is constant, then this follows immediately. Otherwise, the essential property of polynomials that we use is that if x → ∞, then p(x) → ∞. In other words, if x_n is a sequence such that p(x_n) is bounded, then x_n must also be bounded.

Then let U be a closed set of points, and let y ∈ \overline{p(U)}; in other words, y is a limit point of p(U). To show that p(U) is closed, we want to show that y is in fact in p(U). Since y is a limit point of p(U), there is some sequence x_n in U such that p(x_n) converges to y. Then p(x_n) is bounded, so by the above, x_n is also bounded. Then some subsequence x_m of x_n converges to some \tilde{x} ∈ U. Since p is continuous, p(x_m) then converges to p(\tilde{x}), which must then equal y. Therefore, y is indeed in p(U), which shows that p(x) is a closed map.

So polynomials \mathbb{R}→ \mathbb{R} or \mathbb{C}→ \mathbb{C} are closed, but what we really want to show is that they’re also closed as maps from its pure regular points P_{\text{pure}} to its regular values V_{\text{regular}}. In general, restricting the domain or codomain of a function doesn’t preserve the property of being closed, but if f is a closed map from A to B and D ⊆ B, then f is a closed map from C = f^{-1}(D) to D.

A proof: if U is a closed subset of C, then it is U' ∩ C for U' a closed subset of A. In general we have the identity f(X ∩ Y) ⊆ f(X) ∩ f(Y), so f(U' ∩ C) ⊆ f(U') ∩ f(C) ⊆ f(U') ∩ D\text{.}

Conversely, if y ∈ f(U') ∩ D, then f(x) = y for some x ∈ U'. Since f(x) ∈ D, x ∈ C = f^{-1}(D), so x ∈ U' ∩ C. Therefore, y ∈ f(U' ∩ C), thus f(U') ∩ D ⊆ f(U' ∩ C), and

f(U) = f(U' ∩ C) = f(U') ∩ D\text{.}

f(U') is a closed subset of B by f being closed, and so f(U') ∩ D is a closed subset of D.

In particular, P_{\text{pure}} is the inverse image of V_{\text{regular}} by construction, so a real or complex polynomial is thus a closed map from P_{\text{pure}} to V_{\text{regular}}.

Real and complex polynomials have finitely many critical points

One subtle but important fact that we need is that non-constant real and complex polynomials have finitely many critical points. A critical point of the real or complex polynomial p(x) is a root of p'(x), which is another polynomial, so the statement that a non-constant real or complex polynomial has finitely many critical points is equivalent to the statement that a non-zero real or complex polynomial has finitely many roots.

But isn’t that equivalent to the fundamental theorem of algebra? No! For one, it’s also true for real polynomials. More generally, it’s an upper bound on the number of roots, whereas the fundamental theorem of algebra is a lower bound.

If a real or complex polynomial p(x) of positive degree n has a root r, then p(x) = (x - r) q(x) for some polynomial q(x) of degree n - 1. Then since non-zero degree-0 polynomials have no roots, by induction p(x) has at most n roots.

Therefore, a non-constant real or complex polynomial of degree n has at most n - 1 critical points.

Real and complex polynomials are open on regular points

A real polynomial p(x) \colon \mathbb{R}→ \mathbb{R} is not open in general; a figure above shows that p(x) = x^2 + 1 is a counterexample. Fortunately, it’s only the critical points that are the problem: as functions from P_{\text{regular}} to \mathbb{R}, real polynomials are open.

The complex case is actually easier—the open mapping theorem implies that a complex polynomial p(x) \colon \mathbb{C}→ \mathbb{C} is open in general. However, that theorem uses a bit more complex analysis machinery than we’d like—it turns out that we can use the same proof as in the real case (which is simpler) to show that complex polynomials are open as functions from P_{\text{regular}} to \mathbb{C}.

So let’s start the proof. Let p(x) be a real (or complex) polynomial, considered as a function from V_{\text{regular}} to \mathbb{R} (or \mathbb{C}). Let U ⊆ V_{\text{regular}} be open, and we want to show that p(U) is also open.

Let y ∈ p(U). Then y = p(x) for some regular point x ∈ U. Since p'(x) ≠ 0, by the real inverse function theorem (or the complex inverse function theorem) there is some open set X containing x that is diffeomorphic to p(X).

U is open in V_{\text{regular}}, which is \mathbb{C} minus a finite number of points. Therefore, U is an open set in \mathbb{C} minus a finite number of points, and is thus also open in \mathbb{C}. (This is where we use the fact that p(x) has a finite number of critical points.)

Since U is open in \mathbb{C}, so is X ∩ U, which is diffeomorphic to p(X ∩ U), which is thus an open set contained in p(U) containing y. Since y was arbitrary, p(U) is open.

Since a real or complex polynomial p(x) is open from P_{\text{regular}} to \mathbb{R} or \mathbb{C}, the same reasoning as in the closed case shows that since V_{\text{regular}}⊆ \mathbb{C} and P_{\text{pure}}= p^{-1}(V_{\text{regular}}), then a real or complex polynomial is an open map from P_{\text{pure}} to V_{\text{regular}}.

Non-constant complex polynomials are surjective (but not real ones)

Now we’re ready to put it all together. Let p(x) be a non-constant complex polynomial. By the above, it is clopen as a map from P_{\text{pure}} to V_{\text{regular}}. Therefore, since it’s also continuous, it maps each connected components of P_{\text{pure}} to a connected component of V_{\text{regular}}. But both P_{\text{pure}} and V_{\text{regular}} are \mathbb{C} minus a finite set of points, and thus they both have a single connected component. Therefore, p(x) maps P_{\text{pure}} onto V_{\text{regular}}. Since it also maps P_{\text{critical}} onto V_{\text{critical}}, it maps \mathbb{C} onto \mathbb{C}= V_{\text{critical}}∪ V_{\text{regular}}.

In particular, this implies that p(x) has a root, which is the fundamental theorem of algebra.

What about the real case? Consider the real polynomial p(x) = x^2 + 1. It has a single critical value 1 mapped to by a single critical point 0, so P_{\text{pure}} has two connected components: (-∞, 0) and (0, ∞). V_{\text{regular}} has two connected components (-∞, 1) and (1, ∞), but p(x) maps both connected components of P_{\text{pure}} to (1, ∞), and so isn’t surjective on \mathbb{R}, and in particular doesn’t have a root.

Curvature computations with moving frames

2018-03-22T00:00:00-07:00

Overview

Given a metric on a manifold, it is often necessary to compute its curvature. However, the usual method of first computing the Christoffel symbols and then using those to compute the Riemann curvature tensor is tedious and error-prone.

Fortunately, there’s another way to compute the curvature that’s often quicker and easier: Cartan’s method of moving frames, or the repère mobile. Unfortunately, explanations of this method aren’t very clear, so here I’m going to provide my own, based on working through a few examples.

I’m going to assume that you know enough Riemannian geometry to be able to compute curvature the usual way, and also that you’re familiar with the basics of differential forms and exterior differentiation. Some familiarity with semi-Riemannian metrics will also be helpful, since a lot of motivating examples come from general relativity, which uses Lorentzian metrics.

The coordinate frame method

First, a quick overview of the usual method using coordinate frames. Let $g = g_{ij} \, dx^i ⊗ dx^j$ be a given semi-Riemannian metric expressed in terms of the coordinates $(x^1, \dotsc, x^n)$. We first compute the Christoffel symbols using the formula \[ \CS{k}{ij} = \frac{1}{2} (g^*)^{kl} \left(∂_j g_{il} + ∂_i g_{lj} - ∂_l g_{ij}\right)\text{,} \] where $(g^*)^{ij}$ are the components of the dual metric $g^*$, which can be computed by taking components of the inverse of the matrix $G[i, j] = g_{ij}$ formed from the metric components, i.e. $(g^*)^{ij} = G^{-1}[i, j]$. Recall that the Christoffel symbols are symmetric in the lower indices, so if our manifold is $n$-dimensional, then in general we have $n^2(n+1)/2$ independent Christoffel symbols.

Note that we use the Einstein summation convention; in the absence of a summation sign, index variables that appear once as a superscript and once as a subscript are implicitly summed over.

A useful special case is when the metric $g$ is diagonal,^[1] i.e. $g = g_{ii} \, dx^i ⊗ dx^i$. Then $(g^*)^{ii} = 1/g_{ii}$ and \[ \begin{alignedat}{2} \CS{k}{ij} &= 0 \qquad & \CS{k}{ik} &= \frac{∂_i g_{kk}}{2 g_{kk}} \\ \CS{k}{ii} &= -\frac{∂_k g_{ii}}{2 g_{kk}} \qquad & \CS{i}{ii} &= \frac{∂_i g_{ii}}{2 g_{ii}}\text{,} \end{alignedat} \] where $i$, $j$, and $k$ are distinct. Therefore in this case we have $n^2$ non-zero independent Christoffel symbols.

The Christoffel symbols are important in their own right, but we need them only to compute curvature. We can compute the components of the Riemann curvature tensor using the formula \[ \Riem{k}{lij} = ∂_i \CS{k}{jl} - ∂_j \CS{k}{il} + \CS{k}{im} \CS{m}{jl} - \CS{k}{jm} \CS{m}{il}\text{.} \] We can then compute the Ricci curvature tensor and the scalar curvature: \[ \Ric{ij} = \Riem{k}{ikj} \qquad S = (g^*)^{ij} \Ric{ij}\text{.} \]

For applications, we’re most interested in the Ricci curvature tensor, so we usually just want to calculate that directly: \[ \Ric{ij} = ∂_k \CS{k}{ji} - ∂_j \CS{k}{ki} + \CS{k}{km} \CS{m}{ji} - \CS{k}{jm} \CS{m}{ki}\text{.} \]

Cheatsheet: coordinate frame method

Given the components $g_{ij}$ of a semi-Riemannian metric:

Compute the Christoffel symbols. If the metric $g$ is diagonal, use \[ \begin{alignedat}{2} \CS{k}{ij} &= 0 \qquad & \CS{k}{ik} &= \frac{∂_i g_{kk}}{2 g_{kk}} \\ \CS{k}{ii} &= -\frac{∂_k g_{ii}}{2 g_{kk}} \qquad & \CS{i}{ii} &= \frac{∂_i g_{ii}}{2 g_{ii}}\text{.} \end{alignedat} \] Otherwise, compute the dual metric components $(g^*)^{ij} = G^{-1}[i, j]$ where $G[i, j] = g_{ij}$ and use \[ \CS{k}{ij} = \frac{1}{2} (g^*)^{kl} \left(∂_j g_{il} + ∂_i g_{lj} - ∂_l g_{ij}\right)\text{.} \]
Compute the Ricci curvature tensor: \[ \Ric{ij} = ∂_k \CS{k}{ji} - ∂_j \CS{k}{ki} + \CS{k}{km} \CS{m}{ji} - \CS{k}{jm} \CS{m}{ki}\text{.} \]

The Lagrangian method

An alternate method for computing the Christoffel symbols is to write down the Lagrangian corresponding to the metric: \[ L(x^1, \dotsc, x^n, v^1, \dotsc, v^n) = g_{ij}(x^1, \dotsc, x^n) \, v^i v^j \] and then to compute the Euler-Lagrange equations for a path $γ(t) = \big(x^1(t), \dotsc, x^n(t)\big)$: \[ \frac{d}{dt} \left( \frac{∂ L}{∂ v^k}(γ(t), \dot{γ}(t)) \right) - \frac{∂ L}{∂ x^k}(γ(t), \dot{γ}(t)) = 0 \] to get the geodesic equations. Then we can compare these equations to the geodesic equations expressed in terms of the Christoffel symbols \[ \ddot{γ}^k + \CS{k}{ij} \dot{γ}^i \dot{γ}^j = 0\text{,} \] and then we can read off the Christoffel symbols from the coefficients of the $\dot{γ}^i \dot{γ}^j$ terms.

I’m not convinced that this method saves that much work, especially when the metric is diagonal, but it’s at least a clearer way to organize the computations for the Christoffel symbols.

Cheatsheet: Lagrangian method

Given the components $g_{ij}$ of a semi-Riemannian metric:

With the Lagrangian \[ L = g_{ij} \, v^i v^j\text{,} \] compute the Euler-Lagrange equations \[ \frac{d}{dt} \left( \frac{∂ L}{∂ v^k}(γ(t), \dot{γ}(t)) \right) - \frac{∂ L}{∂ x^k}(γ(t), \dot{γ}(t)) = 0\text{.} \]
Compare the Euler-Lagrange equations to the geodesic equation \[ \ddot{γ}^k + \CS{k}{ij} \dot{γ}^i \dot{γ}^j = 0 \] and read off the Christoffel symbols $\CS{k}{ij}$.
Compute the Ricci curvature tensor: \[ \Ric{ij} = ∂_k \CS{k}{ji} - ∂_j \CS{k}{ki} + \CS{k}{km} \CS{m}{ji} - \CS{k}{jm} \CS{m}{ki}\text{.} \]

The moving frame method

Now, finally, I can explain the method of moving frames. Don’t worry too much about understanding this the first time through; I suggest skimming this section and then following along with the examples below, referring back as necessary.

For now, let’s assume that we have not a semi-Riemannian, but a Riemannian metric $g = g_{ij} \, dx^i ⊗ dx^j$ expressed in terms of the coordinates $(x^1, \dotsc, x^n)$. We want to find basis one-forms $(θ^1, \dotsc, θ^n)$ such that \[ g = ∑_i θ^i ⊗ θ^i\text{.} \] If the metric is diagonal, this is easy (suspending the summation convention): \[ θ^i = \sqrt{g_{ii}} \, dx^i\text{.} \] If instead the metric is not diagonal, we may still be able to factor it into a “sum of squares” form by inspection. Otherwise, an equivalent definition of the $θ^i$ is that \[ g^*(θ^i, θ^j) = δ^i_j\text{,} \] i.e. the basis one-forms $θ^i$ comprise an orthonormal dual frame. We can then use a Gram-Schmidt-like process on the $dx^i$ or some ad hoc method to compute the basis one-forms.

It is also convenient to express the coordinate forms in terms of the basis one-forms, which is again simple if the metric is diagonal: \[ dx^i = \frac{1}{\sqrt{g_{ii}}} \, θ^i\text{.} \] Otherwise, one would need to invert the matrix expressing the $θ^i$ in terms of the $dx^i$.

The next step is compute the connection one-forms $\cnf{i}{j}$. To do so, we compute the exterior derivatives of the basis one-forms $dθ^i$ and express them in terms of the basis two-forms, i.e. \[ dθ^i = a^i_{jk} \, θ^j ∧ θ^k \] for functions $a^i_{jk}$. Then we can use Cartan’s first structure equation

\[ dθ^i = -\cnf{i}{j} ∧ θ^j \]

and the fact that the connection forms are skew symmetric

\[ \cnf{i}{j} = -\cnf{j}{i} \]

to deduce the $\cnf{i}{j}$.

There’s an explicit general formula for $\cnf{i}{j}$ in terms of the basis one-forms,^[2] but it’s often easier to compare the expressions for $dθ^i$ to the form of the first structure equation, guess what the connection forms are, taking advantage of their skew symmetry, and check that the first structure equation holds. In fact, if the metric is diagonal, the expressions for $dθ^i$ are nice enough that you can immediately read off the connection forms. This “guess and check” method works because the connection forms are guaranteeed to exist, and furthermore are guaranteed to be unique, so any guessed list of $\cnf{i}{j}$ that satisfies the first structure equation must be the connection forms.

Note that skew symmetry immediately implies that (suspending the Einstein summation convention) \[ \cnf{i}{i} = 0\text{.} \] Therefore, we have $n(n-1)/2$ independent connection forms.

There is a formula for the connection forms when $g$ is diagonal, which is more useful for deducing properties of diagonal metrics than it is for doing calculations. Suspending the summation convention, \[ \begin{aligned} \cnf{i}{j} &= \frac{∂_j g_{ii}}{2 g_{ii} \sqrt{g_{jj}}} \, θ^i - \frac{∂_i g_{jj}}{2 g_{jj} \sqrt{g_{ii}}} \, θ^j \\ &= \frac{∂_j g_{ii}}{2 \sqrt{g_{ii} g_{jj}}} \, dx^i - \frac{∂_i g_{jj}}{2 \sqrt{g_{ii} g_{jj}}} \, dx^j\text{.} \end{aligned} \] This formula implies that a diagonal metric has connection forms with at most two components each, as opposed to $n$ components in general. Furthermore, if a diagonal metric depends only on a single coordinate $x^r$, the only possible non-zero connection forms up to skew symmetry are $\cnf{i}{r}$, which are proportional to $θ^i$. If instead a diagonal metric depends on two coordinates $x^r$ and $x^s$, then the only possible non-zero connection forms up to skew symmetry are $\cnf{i}{r}$, $\cnf{i}{s}$, or $\cnf{r}{s}$. The first two cases are proportional to $θ^i$, and the last case has at most two components: one proportional to $θ^r$ and another proportional to $θ^s$.

The connection forms play an important role similar to the Christoffel symbols, but we need them only to compute curvature. First, observer that we can express each connection form in two ways: in terms of the $dx^i$, and in terms of the $θ^i$. We need to compute the derivatives $d\cnf{i}{j}$, which is easiest to do if $\cnf{i}{j}$ is expressed in terms of the $dx^i$, since $d(dx^i) = 0$. Then we can compute the curvature forms $\crf{i}{j}$ using Cartan’s second structure equation

\[ \crf{i}{j} = d\cnf{i}{j} + \cnf{i}{k} ∧ \cnf{k}{j}\text{.} \]

Like the connection forms, the curvature forms are skew symmetric:

\[ \crf{i}{j} = \crf{j}{i}\text{,} \]

so we need only calculate $n(n-1)/2$ independent curvature forms, i.e. the ones where $i ≠ j$. Also note that in the $\cnf{i}{k} ∧ \cnf{k}{j}$ term, one need only take the sum over the $n - 2$ terms $k ∉ \{ i, j \}$, by (suspending the summation convention) $\cnf{i}{i} = \cnf{j}{j} = 0$.

From the properties discussed above, if a diagonal metric depends only on a single coordinate, then each curvature form $\crf{i}{j}$ is proportional to $θ^i ∧ θ^j$. If instead a diagonal metric depends on two coordinates $x^r$ and $x^s$, then each curvature form $\crf{i}{r}$ or $\crf{i}{s}$, up to skew symmetry, has at most two components: one proportional to $θ^i ∧ θ^r$ and another proportional to $θ^i ∧ θ^s$, and all other curvature forms $\crf{i}{j}$ are proportional to $θ^i ∧ θ^j$.

At this point we’re done, since the Riemann curvature tensor with respect to the orthonormal frame $(E_1, \dotsc, E_n)$ dual to $(θ^1, \dotsc, θ^n)$ is \[ \Riem{l}{kij} = \crf{l}{k}(E_i, E_j) \] and the Ricci curvature tensor is \[ \Ric{ij} = \crf{k}{i}(E_k, E_j)\text{.} \] Note that it’s not necessary to explicitly calculate $E_i$; it’s enough to use the definition \[ θ^i(E_j) = δ^i_j\text{,} \] and the definition of the wedge product to derive the relations \[ (θ^i ∧ θ^j)(E_k, E_l) = \begin{cases} +1 & k = i ≠ j = l \\ -1 & l = i ≠ j = k \\ 0 & \text{otherwise,} \end{cases} \] which can then be used to compute the curvature tensor components.

From the properties discussed above, if a diagonal metric depends only on a single coordinate, then $\crf{i}{j}$ is proportional to $θ^i ∧ θ^j$, which implies that $\Ric{}$ is also diagonal. Furthermore, if the metric is diagonal and depends on two coordinates $x^k$ and $x^l$, then the only possible off-diagonal component is $\Ric{kl}$.^[3]

Cheatsheet: The moving frame method for Riemannian metrics

Given the components $g_{ij}$ of a Riemannian metric:

Find an orthonormal dual frame, i.e. basis one-forms $(θ^1, \dotsc, θ^n)$ such that \[ g = ∑_i θ^i ⊗ θ^i\text{.} \] If the metric is diagonal, then (suspending the summation convention) \[ θ^i = \sqrt{g_{ii}} \, dx^i\text{.} \]
Use the first structure equation \[ dθ^i = -\cnf{i}{j} ∧ θ^j \] and the skew symmetry relations \[ \cnf{i}{j} = -\cnf{j}{i} \] to deduce the connection forms $\cnf{i}{j}$.
Compute the curvature forms using the second structure equation \[ \crf{i}{j} = d\cnf{i}{j} + \cnf{i}{k} ∧ \cnf{k}{j} \] and the skew symmetry relations \[ \crf{i}{j} = -\crf{j}{i}\text{.} \] Note that it’s easiest to compute $d\cnf{i}{j}$ when $\cnf{i}{j}$ is expressed in terms of the $dx^i$, since $d(dx^i) = 0$
Compute the components of the Ricci curvature tensor via \[ \Ric{ij} = \crf{k}{i}(E_k, E_j) \] and the relations \[ (θ^i ∧ θ^j)(E_k, E_l) = \begin{cases} +1 & k = i ≠ j = l \\ -1 & l = i ≠ j = k \\ 0 & \text{otherwise.} \end{cases} \]

Comparing the methods

As we saw above, one advantage of the moving frame method is that, in the worst case, one need only compute $n(n-1)/2$ independent connection forms, each with at most $n$ components, rather than $n^2(n+1)/2$ independent Christoffel symbols—a saving of $n^2$ “component calculations”. Even in the simplest case, when the metric is diagonal, you still need to compute $n^2$ possibly non-zero independent Christoffel symbols, as opposed to $n(n - 1)/2$ independent connection forms, each with at most two components—still a saving of $n$ “component calculations”.

Also, when computing a curvature form, one need only compute a single exterior derivative of a connection form and $n - 2$ wedge products of connection forms. This turns out to be less tedious than the corresponding calculation using coordinate methods of $\Riem{k}{lij}$ for fixed $k$ and $l$ such that $k ≠ l$.

Furthermore, the orthonormality of the dual frame tends to cause symmetries to appear earlier in the calculation, leading to less wasted work. This is advantageous when you know the answer you’re looking for, and it’s particularly simple, e.g. if you expect the Ricci curvature to be zero, because calculations becoming unduly complicated becomes a sign of an undetected mistake. With coordinate methods, even if calculations become complicated, you can’t rule out terms cancelling if you continue, so errors become apparent only later.

On the other hand, the moving frame method requires a certain amount of cleverness, first in coming up with the one-forms $θ^i$ if the metric isn’t diagonal, and second in deducing the connection forms $\cnf{i}{j}$. The coordinate methods require less thought, and are more “plug and chug”. In fact, once we examine the semi-Riemannian case later, we’ll see that the coordinate methods remain unchanged, yet the moving frame method becomes more complicated.

Example 1: Orthogonal coordinates on 2D surfaces

Let $g$ be a Riemannian metric on a 2D manifold. The method of moving frames makes calculating curvature particularly easy, since there is exactly one connection form and one curvature form. For example, consider the special case when the metric is diagonal, i.e. with line element \[ ds^2 = E \, du^2 + G \, dv^2\text{.} \]

Orthonormal dual frame

We can then read off an orthonormal dual frame: \[ ds^2 = {\underbrace{(\sqrt{E} \, du)}_{θ^1}}^2 + {\underbrace{(\sqrt{G} \, dv)}_{θ^2}}^2\text{,} \] i.e. \[ θ^1 = \sqrt{E} \, du \qquad θ^2 = \sqrt{G} \, dv\text{,} \] and express the coordinate forms in terms of it: \[ du = \frac{1}{\sqrt{E}} \, θ^1 \qquad dv = \frac{1}{\sqrt{G}} \, θ^2\text{.} \]
Connection forms

The derivatives of the basis one-forms are \[ \begin{aligned} dθ^1 &= \frac{∂_v E}{2 \sqrt{E}} \, dv ∧ du = \frac{∂_v E}{2 E \sqrt{G}} \, θ^2 ∧ θ^1 \\ dθ^2 &= \frac{∂_u G}{2 \sqrt{G}} \, du ∧ dv = \frac{∂_u G}{2 G \sqrt{E}} \, θ^1 ∧ θ^2 \end{aligned} \] and the first structure equations are \[ \begin{aligned} dθ^1 &= -\cnf{1}{2} ∧ θ^2 \\ dθ^2 &= -\cnf{2}{1} ∧ θ^1 = \cnf{1}{2} ∧ θ^1\text{.} \end{aligned} \] Rewriting the derivative equations to match the first structure equations, \[ \begin{aligned} dθ^1 &= -\overbrace{\left(\frac{∂_v E}{2 E \sqrt{G}} \, θ^1\right)}^{\text{one term of $\cnf{1}{2}$}} ∧ θ^2 \\ dθ^2 &= \underbrace{\left(-\frac{∂_u G}{2 G \sqrt{E}} \, θ^2\right)}_{\text{another term of $\cnf{1}{2}$}} ∧ θ^1\text{,} \end{aligned} \] we can guess that \[ \cnf{1}{2} = \frac{∂_v E}{2 E \sqrt{G}} \, θ^1 - \frac{∂_u G}{2 G \sqrt{E}} \, θ^2\text{.} \] This guess works, since \[ \begin{aligned} -\cnf{1}{2} ∧ θ^2 &= -\left( \frac{∂_v E}{2 E \sqrt{G}} \, θ^1 - \frac{∂_u G}{2 G \sqrt{E}} \, θ^2 \right) ∧ θ^2 \\ &= -\frac{∂_v E}{2 E \sqrt{G}} \, θ^1 ∧ θ^2 + \underbrace{\cancel{\frac{∂_u G}{2 G \sqrt{E}} \, θ^2 ∧ θ^2}}_{θ^2 ∧ θ^2 = 0} \\ &= dθ^1 \end{aligned} \] and \[ \begin{aligned} \cnf{1}{2} ∧ θ^1 &= \left( \frac{∂_v E}{2 E \sqrt{G}} \, θ^1 - \frac{∂_u G}{2 G \sqrt{E}} \, θ^2 \right) ∧ θ^1 \\ &= \underbrace{\cancel{\frac{∂_v E}{2 E \sqrt{G}} \, θ^1 ∧ θ^1}}_{θ^1 ∧ θ^1 = 0} - \frac{∂_u G}{2 G \sqrt{E}} \, θ^2 ∧ θ^1 \\ &= dθ^2\text{,} \end{aligned} \] using the fact that $θ^1 ∧ θ^1 = θ^2 ∧ θ^2 = 0$. Therefore, by uniqueness of connection forms, this is the connection form. Then, expressing $\cnf{1}{2}$ in terms of both the basis one-forms and the coordinate forms, \[ \cnf{1}{2} = \frac{∂_v E}{2 E \sqrt{G}} \, θ^1 - \frac{∂_u G}{2 G \sqrt{E}} \, θ^2 = \frac{∂_v E}{2 \sqrt{EG}} \, du - \frac{∂_u G}{2 \sqrt{EG}} \, dv\text{.} \] (By a very similar method, one can derive the formula stated previously for the $\cnf{i}{j}$ of a diagonal metric.)
Curvature forms

Since we only have the single connection form $\cnf{1}{2}$, there are no non-zero $\cnf{i}{k} ∧ \cnf{k}{j}$ terms, since $i$, $j$, and $k$ would all have to be distinct. Using the expression for $\cnf{1}{2}$ in terms of the coordinate forms $du$ and $dv$, and that $d(du) = d(dv) = 0$, the single curvature form is: \[ \begin{aligned} \crf{1}{2} = d\cnf{1}{2} &= \pd{}{v} \left( \frac{∂_v E}{2 \sqrt{EG}} \right) dv ∧ du - \pd{}{u} \left( \frac{∂_u G}{2 \sqrt{EG}} \right) du ∧ dv \\ &\begin{alignedat}{2} &= \, & -\frac{1}{2} \left( \pd{}{u} \left( \frac{∂_u G}{\sqrt{EG}} \right) + \pd{}{v} \left( \frac{∂_v E}{\sqrt{EG}} \right) \right) & \, du ∧ dv \\ &= \, & -\frac{1}{2 \sqrt{EG}} \left( \pd{}{u} \left( \frac{∂_u G}{\sqrt{EG}} \right) + \pd{}{v} \left( \frac{∂_v E}{\sqrt{EG}} \right) \right) & \, θ^1 ∧ θ^2\text{.} \end{alignedat} \end{aligned} \]
Gaussian curvature

Therefore, we get the classical result that the Gaussian curvature $K$, which is equal to the single independent component of the Riemann curvature tensor (up to sign), is \[ \begin{aligned} K &= \Riem{1}{212} = \crf{1}{2}(E_1, E_2) \\ &= -\frac{1}{2 \sqrt{EG}} \left( \pd{}{u} \left( \frac{∂_u G}{\sqrt{EG}} \right) + \pd{}{v} \left( \frac{∂_v E}{\sqrt{EG}} \right) \right) \, (θ^1 ∧ θ^2)(E_1, E_2) \\ &= -\frac{1}{2 \sqrt{EG}} \left( \pd{}{u} \left( \frac{∂_u G}{\sqrt{EG}} \right) + \pd{}{v} \left( \frac{∂_v E}{\sqrt{EG}} \right) \right)\text{.} \end{aligned} \]

The semi-Riemannian case

As we alluded to above, in the semi-Riemannian case, the coordinate methods remain unchanged, but the moving frame method gets more complicated. The equation that the one-forms must satisfy becomes \[ g = ∑_i ε_i \, θ^i ⊗ θ^i\text{,} \] where each $ε_i$ is $±1$ throughout the whole chart domain.^[4] For example, in the Riemannian case, we let all $ε_i = 1$, and in the Lorentzian case we let $ε_0 = -1$ and all other $ε_i = +1$. (The entire list $(ε_i)$ is called the signature of the metric.)

If the metric is diagonal, then each $g_{ii}$ must be non-zero throughout the whole chart domain, so $ε_i = \sgn(g_{ii})$ and (suspending the summation convention) \[ θ^i = ε_i \sqrt{\lvert g_{ii} \rvert} \, dx^i\text{.} \]

The equivalent definition of the $θ^i$ becomes \[ g^*(θ^i, θ^j) = ε_i δ^i_j\text{,} \] where each $ε_i$ is $±1$ throughout the whole chart domain. Furthermore, the Gram-Schmidt process becomes harder to apply; you’ll need to find a non-degenerate basis first; see this Math StackExchange question for details.

Both Cartan structure equations still hold, but the connection and curvature forms are not skew symmetric anymore; instead, they’re semi-skew symmetric. Suspending the summation convention,

\[ \begin{aligned} \cnf{i}{j} &= -ε_i ε_j \cnf{j}{i} \\ \crf{i}{j} &= -ε_i ε_j \crf{j}{i}\text{.} \end{aligned} \]

Fortunately, this still implies that (suspending the Einstein summation convention) \[ \cnf{i}{i} = \crf{i}{i} = 0\text{.} \]

The formula for the connection forms of a diagonal metric becomes (suspending the summation convention) \[ \begin{aligned} \cnf{i}{j} &= \frac{∂_j g_{ii}}{2 g_{ii} \sqrt{g_{jj}}} \, θ^i - ε_i ε_j \frac{∂_i g_{jj}}{2 g_{jj} \sqrt{g_{ii}}} \, θ^j \\ &= \frac{∂_j g_{ii}}{2 \sqrt{g_{ii} g_{jj}}} \, dx^i - ε_i ε_j \frac{∂_i g_{jj}}{2 \sqrt{g_{ii} g_{jj}}} \, dx^j\text{.} \end{aligned} \] However, none of the deduced properties of diagonal metrics depending on one or two coordinates change.

Finally, note that the relations \[ (θ^i ∧ θ^j)(E_k, E_l) = \begin{cases} +1 & k = i ≠ j = l \\ -1 & l = i ≠ j = k \\ 0 & \text{otherwise.} \end{cases} \] still hold.

As you can tell, the moving frame method forces you to keep careful track of signs, which you may count as a disadvantage.

Cheatsheet: The moving frame method for semi-Riemannian metrics

Given the components $g_{ij}$ of a semi-Riemannian metric:

Find an orthonormal dual frame, i.e. basis one-forms $(θ^1, \dotsc, θ^n)$ such that \[ g = ∑_i ε_i \, θ^i ⊗ θ^i\text{,} \] where each $ε_i$ is $±1$ throughout the whole chart domain. If the metric is diagonal, then (suspending the summation convention) $ε_i = \sgn(g_{ii})$, and \[ θ^i = ε_i \sqrt{\lvert g_{ii} \rvert} \, dx^i\text{.} \]
Use the first structure equation \[ dθ^i = -\cnf{i}{j} ∧ θ^j \] and the semi-skew symmetry relations (suspending the summation convention) \[ \cnf{i}{j} = -ε_i ε_j \cnf{j}{i} \] to deduce the connection forms $\cnf{i}{j}$.
Compute the curvature forms using the second structure equation \[ \crf{i}{j} = d\cnf{i}{j} + \cnf{i}{k} ∧ \cnf{k}{j} \] and the semi-skew symmetry relations (suspending the summation convention) \[ \crf{i}{j} = -ε_i ε_j \crf{j}{i}\text{.} \] Note that it’s easiest to compute $d\cnf{i}{j}$ when $\cnf{i}{j}$ is expressed in terms of the $dx^i$, since $d(dx^i) = 0$
Compute the components of the Ricci curvature tensor via \[ \Ric{ij} = \crf{k}{i}(E_k, E_j) \] and the relations \[ (θ^i ∧ θ^j)(E_k, E_l) = \begin{cases} +1 & k = i ≠ j = l \\ -1 & l = i ≠ j = k \\ 0 & \text{otherwise.} \end{cases} \]

Example 2: The Schwarzschild metric

Now we’re ready to tackle a more complicated metric. For our first semi-Riemannian example, let $g$ be the Schwarzschild metric, with line element \[ ds^2 = -f(r) \, dt^2 + f(r)^{-1} \, dr^2 + r^2 \, dΩ^2\text{,} \] where \[ f(r) = 1 - \frac{r_S}{r}\text{,} \] $r_S$ is the Schwarzschild radius, which is constant, and \[ dΩ^2 = dθ^2 + \sin^2 θ \, dφ^2 \] is the line element of the round metric $\mathring{g}$ on the two-sphere. We want to show that this metric is Ricci-flat, i.e. has vanishing Ricci curvature.

We can skip some steps by taking advantage of the metric being diagonal and depending only on the two coordinates $r$ and $θ$, but in the interest of showing the general method, we’ll do everything the “hard way”, but we’ll double-check that our results using the properties of diagonal metrics we deduced earlier.

Orthonormal dual frame

Since the metric is diagonal, we can read off an orthonormal dual frame with its corresponding signature: \[ ds^2 = \; \underbrace{-}_{ε_0} \; {\underbrace{\left(f(r)^{1/2} \, dt\right)}_{ϑ^0}}^2 \; \underbrace{+}_{ε_1} \; {\underbrace{\left(f(r)^{-1/2} \, dr\right)}_{ϑ^1}}^2 \; \underbrace{+}_{ε_2} \; {\underbrace{(r \, dθ)}_{ϑ^2}}^2 \; \underbrace{+}_{ε_3} \; {\underbrace{(r \sin θ \, dφ)}_{ϑ^3}}^2\text{.} \] i.e. \[ \begin{alignedat}{2} ϑ^0 &= \, & f(r)^{1/2} & \, dt \\ ϑ^1 &= \, & f(r)^{-1/2} & \, dr \\ ϑ^2 &= \, & r & \, dθ \\ ϑ^3 &= \, & r \sin θ & \, dφ \end{alignedat} \] with Lorentzian signature $({-} \; {+} \; {+} \; {+})$. We can then express the coordinate forms in terms of it: \[ \begin{alignedat}{2} dt &= \, & f(r)^{-1/2} & \, ϑ^0 \\ dr &= \, & f(r)^{1/2} & \, ϑ^1 \\ dθ &= \, & r^{-1} & \, ϑ^2 \\ dφ &= \, & r^{-1} \csc θ & \, ϑ^3\text{.} \end{alignedat} \] Note that since we’re using $θ$ as a coordinate, we use $ϑ^λ$ to denote the basis one-forms. Furthermore, since this metric is Lorentzian, we adopt the convention that the index of the first coordinate is $0$, Greek indices start from $0$, and Latin indices start from $1$.
Connection forms

The derivatives of the basis one-forms are \[ \begin{alignedat}{2} dϑ^0 &= \frac{1}{2}f(r)^{-1/2} f'(r) \, dr ∧ dt & &= \frac{1}{2}f(r)^{-1/2} f'(r) \, ϑ^1 ∧ ϑ^0 \\ dϑ^1 &= 0 & & \\ dϑ^2 &= dr ∧ dθ & &= \frac{f(r)^{1/2}}{r} \, ϑ^1 ∧ ϑ^2 \\ dϑ^3 &= \sin θ \, dr ∧ dφ + r \cos θ \, dθ ∧ dφ & &= \frac{f(r)^{1/2}}{r} \, ϑ^1 ∧ ϑ^3 + \frac{\cot θ}{r} \, ϑ^2 ∧ ϑ^3\text{.} \end{alignedat} \] By semi-skew symmetry, since $ε_0 = -1$ and $ε_i = 1$, $\cnf{0}{i} = \cnf{i}{0}$ and $\cnf{i}{j} = -\cnf{j}{i}$. Therefore, we can explicitly write out the first structure equations: \[ \begin{alignedat}{4} dϑ^0 &= & &- \cnf{0}{1} ∧ ϑ^1 & &- \cnf{0}{2} ∧ ϑ^2 & &- \cnf{0}{3} ∧ ϑ^3 \\ dϑ^1 &= -\cnf{0}{1} ∧ ϑ^0 & & & &- \cnf{1}{2} ∧ ϑ^2 & &- \cnf{1}{3} ∧ ϑ^3 \\ dϑ^2 &= -\cnf{0}{2} ∧ ϑ^0 & &+ \cnf{1}{2} ∧ ϑ^1 & & & &- \cnf{2}{3} ∧ ϑ^3 \\ dϑ^3 &= -\cnf{0}{3} ∧ ϑ^0 & &+ \cnf{1}{3} ∧ ϑ^1 & &+ \cnf{2}{3} ∧ ϑ^2\text{,} & & \end{alignedat} \] and rewriting the derivative equations to match: \[ \begin{alignedat}{3} dϑ^0 &= & \; -\overbrace{\left(\frac{1}{2}f(r)^{-1/2} f'(r) \, ϑ^0\right)}^{\text{one term of $\cnf{0}{1}$}} &∧ ϑ^1 & & \\ dϑ^1 &= 0 & & & & \\ dϑ^2 &= & \overbrace{\left(-\frac{f(r)^{1/2}}{r} \, ϑ^2\right)}^{\text{one term of $\cnf{1}{2}$}} &∧ ϑ^1 & & \\ dϑ^3 &= & \underbrace{\left(-\frac{f(r)^{1/2}}{r} \, ϑ^3\right)}_{\text{one term of $\cnf{1}{3}$}} &∧ ϑ^1 & \; + \; \underbrace{\left( -\frac{\cot θ}{r} \, ϑ^3 \right)}_{\text{one term of $\cnf{2}{3}$}} &∧ ϑ^2\text{,} \end{alignedat} \] we can guess that \[ \begin{alignedat}{2} \cnf{0}{1} &= \, & \frac{1}{2} f(r)^{-1/2} f'(r) & \, ϑ^0 \\ \cnf{1}{2} &= \, & -\frac{f(r)^{1/2}}{r} & \, ϑ^2 \\ \cnf{1}{3} &= \, & -\frac{f(r)^{1/2}}{r} & \, ϑ^3 \\ \cnf{2}{3} &= \, & -\frac{\cot θ}{r} & \, ϑ^3\text{.} \end{alignedat} \] Happily, plugging these expressions back into the first structure equations, we find that they hold. Therefore, by uniqueness of the connection forms, they are the connection forms.

Rather than plugging our guess into the first structure equations, a slicker way to see that it works would be to split up the first structure equation thus: \[ dϑ^λ = -∑_{λ \lt μ} \cnf{λ}{μ} ∧ ϑ^μ - ∑_{λ > μ} \cnf{λ}{μ} ∧ ϑ^μ\text{,} \] and notice that our derivative equations have the particularly simple form \[ dϑ^λ = ∑_{λ \lt μ} (f_μ \, ϑ^λ) ∧ ϑ^μ\text{,} \] so setting \[ \cnf{λ}{μ} = -f_μ \, ϑ^λ \quad \text{for $λ \lt μ$} \] takes care of the left sum above. Then by semi-skew symmetry, if $λ \gt μ$, \[ \lvert \cnf{λ}{μ} ∧ ϑ^μ \rvert = \lvert \cnf{μ}{λ} ∧ ϑ^μ \rvert = \lvert (f_λ \, ϑ^μ) ∧ ϑ^μ \rvert = 0\text{.} \] Thus all terms in the right sum above vanish as required.

Then, expressing the connection forms in terms of both the basis one-forms and the coordinate forms, \[ \begin{alignedat}{6} \cnf{0}{1} &= & &\cnf{1}{0} & &= \quad & \frac{1}{2} f(r)^{-1/2} f'(r) \, &ϑ^0 & \quad &= \quad & \frac{1}{2} f'(r) \, &dt \\ \cnf{2}{1} &= & \; -&\cnf{1}{2} & &= \quad & \frac{f(r)^{1/2}}{r} \, &ϑ^2 & \quad &= \quad & f(r)^{1/2} \, &dθ \\ \cnf{3}{1} &= & \; -&\cnf{1}{3} & &= \quad & \frac{f(r)^{1/2}}{r} \, &ϑ^3 & \quad &= \quad & f(r)^{1/2} \sin θ \, &dφ \\ \cnf{3}{2} &= & \; -&\cnf{2}{3} & &= \quad & \frac{\cot θ}{r} \, &ϑ^3 & \quad &= \quad & \cos θ \, &dφ \text{.} \end{alignedat} \]

Note that $\cnf{2}{1}$ has only one component instead of two; this is because $g_{11}$ doesn’t depend on $θ$. The other connection forms are either zero or have only one component, as expected for a diagonal metric depending on two coordinates.
Curvature forms

Using the expressions for $\cnf{μ}{ν}$ in terms of the coordinate one-forms, since $d(dt) = d(dr) = d(dθ) = d(dφ) = 0$, the derivatives of the connection forms are: \[ \begin{aligned} d \cnf{0}{1} &= \frac{1}{2} f''(r) \, dr ∧ dt \\ &= \frac{1}{2} f''(r) \, ϑ^1 ∧ ϑ^0 \\ d \cnf{2}{1} &= \frac{1}{2} f(r)^{-1/2} f'(r) \, dr ∧ dθ \\ &= \frac{f'(r)}{2r} \, ϑ^1 ∧ ϑ^2 \\ d \cnf{3}{1} &= \frac{1}{2} f(r)^{-1/2} f'(r) \sin ϑ \, dr ∧ dφ + f(r)^{1/2} \cos θ \, dθ ∧ dφ \\ &= \frac{f'(r)}{2r} \, ϑ^1 ∧ ϑ^3 + \frac{f(r)^{1/2} \cot θ}{r^2} \, ϑ^2 ∧ ϑ^3 \\ d \cnf{3}{2} &= -\sin θ \, dθ ∧ dφ \\ &= -\frac{1}{r^2} \, ϑ^2 ∧ ϑ^3\text{.} \end{aligned} \] For $\cnf{μ}{λ} ∧ \cnf{λ}{ν}$, recalling that one need only sum over $λ ∉ \{ μ, ν \}$, the non-zero terms are \[ \begin{alignedat}{3} \cnf{0}{λ} ∧ \cnf{λ}{2} &= \cnf{0}{1} ∧ \cnf{1}{2} & &= \; & -\frac{f'(r)}{2r} \, &ϑ^0 ∧ ϑ^2 \\ \cnf{0}{λ} ∧ \cnf{λ}{3} &= \cnf{0}{1} ∧ \cnf{1}{3} & &= \; & -\frac{f'(r)}{2r} \, &ϑ^0 ∧ ϑ^3 \\ \cnf{1}{λ} ∧ \cnf{λ}{3} &= \cnf{1}{2} ∧ \cnf{2}{3} & &= \; & \frac{f(r)^{1/2} \cot θ}{r^2} \, &ϑ^2 ∧ ϑ^3 \\ \cnf{2}{λ} ∧ \cnf{λ}{3} &= \cnf{2}{1} ∧ \cnf{1}{3} & &= \; & -\frac{f(r)}{r^2} \, &ϑ^2 ∧ ϑ^3\text{.} \end{alignedat} \] Then we can compute the curvature forms: \[ \begin{aligned} \crf{0}{1} &= d\cnf{0}{1} = \frac{1}{2} f''(r) \, ϑ^1 ∧ ϑ^0 \\ \crf{0}{2} &= \cnf{0}{λ} ∧ \cnf{λ}{2} = -\frac{f'(r)}{2r} \, ϑ^0 ∧ ϑ^2 \\ \crf{0}{3} &= \cnf{0}{λ} ∧ \cnf{λ}{3} = -\frac{f'(r)}{2r} \, ϑ^0 ∧ ϑ^3 \\ \crf{1}{2} &= d\cnf{1}{2} = -\frac{f'(r)}{2r} \, ϑ^1 ∧ ϑ^2 \\ \crf{1}{3} &= d\cnf{1}{3} + \cnf{1}{λ} ∧ \cnf{λ}{3} \\ &= -\frac{f'(r)}{2r} \, ϑ^1 ∧ ϑ^3 - \frac{f(r)^{1/2} \cot θ}{r^2} \, ϑ^2 ∧ ϑ^3 + \frac{f(r)^{1/2} \cot θ}{r^2} \, ϑ^2 ∧ ϑ^3 \\ &= -\frac{f'(r)}{2r} \, ϑ^1 ∧ ϑ^3 \\ \crf{2}{3} &= d\cnf{2}{3} + \cnf{2}{λ} ∧ \cnf{λ}{3} \\ &= \frac{1}{r^2} \, ϑ^2 ∧ ϑ^3 - \frac{f(r)}{r^2} \, ϑ^2 ∧ ϑ^3 \\ &= \frac{1 - f(r)}{r^2} \, ϑ^2 ∧ ϑ^3\text{.} \end{aligned} \] Again by semi-skew symmetry, since $ε_0 = -1$ and $ε_i = 1$, $\crf{0}{i} = \crf{i}{0}$ and $\crf{i}{j} = -\crf{j}{i}$. Therefore, \[ \begin{alignedat}{3} \crf{0}{1} &= \; & \crf{1}{0} &= \; & \frac{1}{2} f''(r) \, &ϑ^1 ∧ ϑ^0 \\ \crf{0}{2} &= \; & \crf{2}{0} &= \; & -\frac{f'(r)}{2r} \, &ϑ^0 ∧ ϑ^2 \\ \crf{0}{3} &= \; & \crf{3}{0} &= \; & -\frac{f'(r)}{2r} \, &ϑ^0 ∧ ϑ^3 \\ \crf{1}{2} &= \; & -\crf{2}{1} &= \; & -\frac{f'(r)}{2r} \, &ϑ^1 ∧ ϑ^2 \\ \crf{1}{3} &= \; & -\crf{3}{1} &= \; & -\frac{f'(r)}{2r} \, &ϑ^1 ∧ ϑ^3 \\ \crf{2}{3} &= \; & -\crf{3}{2} &= \; & \frac{1 - f(r)}{r^2} \, &ϑ^2 ∧ ϑ^3\text{.} \end{alignedat} \]
Ricci curvature

We can compute the Ricci tensor $\Ric{μν}$ as \[ \Ric{μν} = \Riem{λ}{μλν} = \crf{λ}{μ}(E_λ, E_ν)\text{,} \] where the $E_λ$ comprise the dual frame to $ϑ^λ$. From the relations \[ (θ^μ ∧ θ^ν)(E_ρ, E_σ) = \begin{cases} +1 & σ = μ ≠ ν = ρ \\ -1 & ρ = μ ≠ ν = σ \\ 0 & \text{otherwise,} \end{cases} \] we can examine the expressions above and conclude that $\crf{ρ}{σ}(E_μ, E_ν)$ is possibly non-zero only when $\{ μ, ν \} = \{ ρ, σ \}$. Furthermore, examining the expression for $\Ric{μν}$, we can further conclude that $\Ric{μν}$ is zero when $μ ≠ ν$. Therefore, it suffices to check $\Ric{λλ}$. (One of the properties we deduced for a diagonal metric depending on two coordinates was that $\Ric{}$ would be diagonal except for possibly $\Ric{12}$, but since $\cnf{1}{2}$ turned out to not have a $ϑ^1$ term, that immediately leads to $\Ric{12} = 0$.)

From the expressions above, \[ \begin{aligned} \crf{0}{1}(E_0, E_1) &= -\frac{1}{2} f''(r) \\ \crf{0}{2}(E_0, E_2) &= \crf{0}{3}(E_0, E_3) = \crf{1}{2}(E_1, E_2) = \crf{1}{3}(E_1, E_3) = -\frac{f'(r)}{2r} \\ \crf{2}{3}(E_2, E_3) &= \frac{1 - f(r)}{r^2}\text{,} \end{aligned} \] so using the skew symmetry of two-forms \[ \crf{μ}{ν}(E_ρ, E_σ) = -\crf{μ}{ν}(E_σ, E_ρ) \] and the semi-skew symmetry of $\crf{μ}{ν}$ \[ \crf{0}{i} = \crf{i}{0} \quad \text{and} \quad \crf{i}{j} = -\crf{j}{i} \text{,} \] we can compute $\Ric{λλ}$: \[ \begin{aligned} \Ric{00} &= \crf{1}{0}(E_1, E_0) + \crf{2}{0}(E_2, E_0) + \crf{3}{0}(E_3, E_0) \\ &= -\crf{0}{1}(E_0, E_1) - \crf{0}{2}(E_0, E_2) - \crf{0}{3}(E_0, E_3) \\ &= \frac{1}{2} f''(r) + \frac{f'(r)}{r} \\ \Ric{11} &= \crf{0}{1}(E_0, E_1) + \crf{2}{1}(E_2, E_1) + \crf{3}{1}(E_3, E_1) \\ &= \crf{0}{1}(E_0, E_1) + \crf{1}{2}(E_1, E_2) + \crf{1}{3}(E_1, E_3) \\ &= -\Ric{00} \\ \Ric{22} &= \crf{0}{2}(E_0, E_2) + \crf{1}{2}(E_1, E_2) + \crf{3}{2}(E_3, E_2) \\ &= \crf{0}{2}(E_0, E_2) + \crf{1}{2}(E_1, E_2) + \crf{2}{3}(E_2, E_3) \\ &= -\frac{f'(r)}{r} + \frac{1 - f(r)}{r^2} \\ \Ric{33} &= \crf{0}{3}(E_0, E_3) + \crf{1}{3}(E_1, E_3) + \crf{2}{3}(E_2, E_3) \\ &= \Ric{22}\text{.} \end{aligned} \]

Finally, a computation shows that for $f(r) = 1 - \frac{r_S}{r}$, \[ \frac{1 - f(r)}{r^2} = -\frac{1}{2} f''(r) = \frac{f'(r)}{r} \text{,} \] so all the Ricci tensor components above vanish.^[5]

Example 3: The pp-wave metric

For our last example, to keep things interesting, let’s consider a non-diagonal metric. Let \[ g = H(u, x, y) \, du ⊗ du + du ⊗ dv + dv ⊗ du + dx ⊗ dx + dy ⊗ dy \] be the pp-wave metric, where $H(u, x, y)$ is some smooth function. We want to derive a necessary and sufficient condition for $g$ to be Ricci-flat.

Orthonormal dual frame

This metric has the matrix \[ G = \begin{pmatrix} H & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix}\text{,} \] which has inverse \[ G^{-1} = \begin{pmatrix} 0 & 1 & 0 & 0 \\ 1 & -H & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix}\text{,} \] so the dual metric is \[ g^* = ∂_u ⊗ ∂_v + ∂_v ⊗ ∂_u - H(u, x, y) \, ∂_v ⊗ ∂_v + ∂_x ⊗ ∂_x + ∂_y ⊗ ∂_y\text{.} \] We can see that $dx$ and $dy$ form part of an orthonormal dual frame, but we have to find the other two, which involve $du$ and $dv$. First we have to figure out the signature of the metric. So set \[ \begin{aligned} θ^0 &= A \, du + B \, dv \\ θ^1 &= C \, du + D \, dv \\ θ^2 &= dx \\ θ^3 &= dy\text{,} \end{aligned} \] and solve for $A$, $B$, $C$, and $D$ using the orthonormality conditions \[ \begin{aligned} g^*(θ^0, θ^0) &= 2AB - B^2 H = ε_0 \\ g^*(θ^0, θ^1) &= AD + BC - BDH = 0 \\ g^*(θ^1, θ^1) &= 2CD - D^2 H = ε_1\text{.} \end{aligned} \] The tricky thing is to pick the $θ^μ$ without assuming that $H$ is non-zero. The simplest way to do that is to assume that none of the coefficients of $H$ vanish, and, since we have four unknowns (not counting $ε_0$ and $ε_1$) and three equations, to set $B = 1$. Then the first equation gives $A = (ε_0 + H)/2$, the second equation gives $C = D(H - A)$, and plugging everything into the third equation gives $D^2 = -ε_1 / ε_0$, which implies that $ε_1 = -ε_0$ and $D = ±1$. Set $ε_0 = -1$ to make the frame have a Lorentzian signature $({-} \; {+} \; {+} \; {+})$, and let $D = ε$. Then \[ \begin{aligned} A &= \frac{H - 1}{2} \\ B &= 1 \\ C &= ε\frac{H + 1}{2} \\ D &= ε\text{.} \end{aligned} \] Setting $ε = 1$ for symmetry, we finally have \[ \begin{aligned} θ^0 &= \frac{H-1}{2} \, du + dv \\ θ^1 &= \frac{H+1}{2} \, du + dv = θ^0 + du \\ θ^2 &= dx \\ θ^3 &= dy \end{aligned} \] and \[ \begin{aligned} du &= θ^1 - θ^0 \\ dx &= θ^2 \\ dy &= θ^3\text{;} \end{aligned} \] it’ll turn out that we don’t need to express $dv$ in terms of the $θ^μ$.
Connection forms

Since \[ \begin{aligned} θ^1 &= θ^0 + du \\ θ^2 &= dx \\ θ^3 &= dy\text{,} \end{aligned} \] the derivatives of the basis one-forms are \[ \begin{aligned} dθ^0 &= dθ^1 = \frac{1}{2} (H_x \, dx + H_y \, dy) ∧ du \\ &= \frac{H_x}{2} \, θ^2 ∧ θ^1 - \frac{H_x}{2} \, θ^2 ∧ θ^0 + \frac{H_y}{2} \, θ^3 ∧ θ^1 - \frac{H_y}{2} \, θ^3 ∧ θ^0 \\ dθ^2 &= 0 \\ dθ^3 &= 0\text{.} \end{aligned} \]

Similarly to the Schwarzschild example, by semi-skew symmetry, since $ε_0 = -1$ and $ε_i = 1$, $\cnf{0}{i} = \cnf{i}{0}$ and $\cnf{i}{j} = -\cnf{j}{i}$. Therefore, we can explicitly write out the first structure equations: \[ \begin{alignedat}{4} dθ^0 &= & &- \cnf{0}{1} ∧ θ^1 & &- \cnf{0}{2} ∧ θ^2 & &- \cnf{0}{3} ∧ θ^3 \\ dθ^1 &= -\cnf{0}{1} ∧ θ^0 & & & &- \cnf{1}{2} ∧ θ^2 & &- \cnf{1}{3} ∧ θ^3 \\ dθ^2 &= -\cnf{0}{2} ∧ θ^0 & &+ \cnf{1}{2} ∧ θ^1 & & & &- \cnf{2}{3} ∧ θ^3 \\ dθ^3 &= -\cnf{0}{3} ∧ θ^0 & &+ \cnf{1}{3} ∧ θ^1 & &+ \cnf{2}{3} ∧ θ^2\text{.} & & \end{alignedat} \] However, unlike the Schwarzschild example, we can’t simply read off the non-zero connection forms; for example, it’s not immediately clear whether the $\frac{H_x}{2} \, θ^2 ∧ θ^1$ term in $dθ^0$ belongs to the $\cnf{0}{1} ∧ θ^1$ term or the $\cnf{0}{2} ∧ θ^2$ term. However, since $dθ^0 = dθ^1$, we can guess that $\cnf{0}{2} = \cnf{1}{2}$ and $\cnf{0}{3} = \cnf{1}{3}$. Subtracting the first structure equations for $dθ^1$ and $dθ^0$, we get \[ \cnf{0}{1} ∧ (θ^1 - θ^0) = 0\text{,} \] i.e. that $\cnf{0}{1} ∼ θ^1 - θ^0$. However, plugging this into the first structure equation for $dθ^0$ or $dθ^1$, we get a $θ^0 ∧ θ^1$ term, which isn’t present in the derivative equation for $dθ^0 = dθ^1$, which then implies that $\cnf{0}{1} = 0$. Thus, there’s only one way to assign each term of the derivative equation for $dθ^0 = dθ^1$ to $\cnf{0}{2} ∧ θ^2$ or $\cnf{0}{3} ∧ θ^3$: \[ \begin{aligned} \cnf{0}{2} &= \cnf{1}{2} = -\frac{H_x}{2} \, (θ^1 - θ^0) = -\frac{H_x}{2} \, du \\ \cnf{0}{3} &= \cnf{1}{3} = -\frac{H_y}{2} \, (θ^1 - θ^0) = -\frac{H_y}{2} \, du\text{.} \end{aligned} \] Plugging this into the structure equations for $dθ^2$ and $dθ^3$, we get \[ \begin{aligned} dθ^2 &= -\cnf{0}{2} ∧ θ^0 + \cnf{1}{2} ∧ θ^1 - \cnf{2}{3} ∧ θ^3 \\ &= \cnf{0}{2} ∧ du - \cnf{2}{3} ∧ θ^3 \\ &= -\frac{H_x}{2} \, du ∧ du - \cnf{2}{3} ∧ θ^3 \\ &= -\cnf{2}{3} ∧ θ^3 \\ dθ^3 &= -\cnf{0}{3} ∧ θ^0 + \cnf{1}{3} ∧ θ^1 + \cnf{2}{3} ∧ θ^2 \\ &= \cnf{0}{3} ∧ du + \cnf{2}{3} ∧ θ^2 \\ &= -\frac{H_y}{2} \, du ∧ du + \cnf{2}{3} ∧ θ^2 \\ &= \cnf{2}{3} ∧ θ^2\text{.} \end{aligned} \] Since $dθ^2 = dθ^3 = 0$ from the derivative equations, $\cnf{2}{3}$ is proportional to both $θ^2$ and $θ^3$, i.e. $\cnf{2}{3} = 0$. We’ve found expressions for $\cnf{μ}{ν}$ that satisfy the first structure equations. Therefore, by uniqueness of the connection forms, these expressions are the connection forms. Then, expressing the connection forms in terms of both the basis one-forms and the coordinate forms, \[ \begin{aligned} \cnf{0}{2} &= \cnf{2}{0} = \cnf{1}{2} = -\cnf{2}{1} = -\frac{H_x}{2} \, (θ^1 - θ^0) = -\frac{H_x}{2} \, du \\ \cnf{0}{3} &= \cnf{3}{0} = \cnf{1}{3} = -\cnf{3}{1} = -\frac{H_y}{2} \, (θ^1 - θ^0) = -\frac{H_y}{2} \, du\text{.} \end{aligned} \]
Curvature forms

Using the expressions for $\cnf{μ}{ν}$ in terms of the coordinate one-forms, since $d(du) = 0$, the derivative of $\cnf{0}{2} = \cnf{1}{2}$ is \[ \begin{aligned} d\cnf{0}{2} &= d\cnf{1}{2} = -\frac{1}{2} \, dH_x ∧ du \\ &= -\frac{1}{2} (H_{xx} \, dx + H_{xy} \, dy) ∧ du \\ &= -\frac{1}{2} (H_{xx} \, θ^2 + H_{xy} \, θ^3) ∧ (θ^1 - θ^0) \end{aligned} \] and similarly the derivative of $\cnf{0}{3} = \cnf{1}{3}$ is \[ d\cnf{0}{3} = d\cnf{1}{3} = -\frac{1}{2} (H_{xy} \, θ^2 + H_{yy} \, θ^3) ∧ (θ^1 - θ^0)\text{.} \] Since all the connection forms are proportional to $du$, all possible sums $\cnf{μ}{λ} ∧ \cnf{λ}{ν}$ equal $0$. Then we can compute the curvature forms: \[ \begin{aligned} \crf{0}{2} &= \crf{1}{2} = -\frac{1}{2} (H_{xx} \, θ^2 + H_{xy} \, θ^3) ∧ (θ^1 - θ^0) \\ \crf{0}{3} &= \crf{1}{3} = -\frac{1}{2} (H_{xy} \, θ^2 + H_{yy} \, θ^3) ∧ (θ^1 - θ^0)\text{.} \end{aligned} \] Again by semi-skew symmetry, since $ε_0 = -1$ and $ε_i = 1$, $\crf{0}{i} = \crf{i}{0}$ and $\crf{i}{j} = -\crf{j}{i}$. Therefore, \[ \begin{aligned} \crf{0}{2} &= \crf{2}{0} = \crf{1}{2} = -\crf{2}{1} = -\frac{1}{2} (H_{xx} \, θ^2 + H_{xy} \, θ^3) ∧ (θ^1 - θ^0) \\ \crf{0}{3} &= \crf{3}{0} = \crf{1}{3} = -\crf{3}{1} = -\frac{1}{2} (H_{xy} \, θ^2 + H_{yy} \, θ^3) ∧ (θ^1 - θ^0)\text{.} \end{aligned} \]
Ricci curvature

We can compute the Ricci tensor $\Ric{μν}$ as \[ \Ric{μν} = \Riem{λ}{μλν} = \crf{λ}{μ}(E_λ, E_ν)\text{,} \] where the $E_λ$ comprise the dual frame to $ϑ^λ$. First, using the relations \[ (θ^μ ∧ θ^ν)(E_ρ, E_σ) = \begin{cases} +1 & σ = μ ≠ ν = ρ \\ -1 & ρ = μ ≠ ν = σ \\ 0 & \text{otherwise,} \end{cases} \] we compute \[ \Ric{0ν} = \crf{λ}{0}(E_λ, E_ν) = \crf{0}{λ}(E_λ, E_ν) = \crf{0}{2}(E_2, E_ν) + \crf{0}{3}(E_3, E_ν) \] and see that it’s only non-zero for $ν ∈ \{ 0, 1 \}$; furthermore, $\Ric{01} = -\Ric{00}$. Similarly, \[ \Ric{1ν} = \crf{λ}{1}(E_λ, E_ν) = -\crf{1}{λ}(E_λ, E_ν) = -\crf{0}{λ}(E_λ, E_ν) = -\Ric{0ν}\text{.} \] For the last two, we can save some effort by calculating $(θ^1 - θ^0)(E_0 + E_1) = 0$, which implies \[ (θ^μ ∧ (θ^1 - θ^0))(E_ν, E_0 + E_1) = 0\text{.} \] Then, using skew symmetry of two-forms \[ \crf{μ}{ν}(E_ρ, E_σ) = -\crf{μ}{ν}(E_σ, E_ρ)\text{,} \] we compute \[ \Ric{2ν} = \crf{λ}{2}(E_λ, E_ν) = -\crf{λ}{2}(E_ν, E_λ) = -\crf{0}{2}(E_ν, E_0) - \crf{1}{2}(E_ν, E_1) = -\crf{0}{2}(E_ν, E_0 + E_1) = 0 \] and \[ \Ric{3ν} = \crf{λ}{3}(E_λ, E_ν) = -\crf{λ}{3}(E_ν, E_λ) = -\crf{0}{3}(E_ν, E_0) - \crf{1}{3}(E_ν, E_1) = -\crf{0}{3}(E_ν, E_0 + E_1) = 0\text{,} \] so it suffices to compute $\Ric{00}$: \[ \begin{aligned} \Ric{00} &= \crf{0}{2}(E_2, E_0) + \crf{0}{3}(E_3, E_0) \\ &= \frac{1}{2} (H_{xx} + H_{yy})\text{.} \end{aligned} \] Finally, we can conclude that the pp-wave metric is Ricci flat exactly when \[ H_{xx} + H_{yy} = 0\text{.} \]

Footnotes

[1] A metric can only be diagonal with respect to a particular coordinate system, but for brevity I’ll only mention it here. ↩

[2] See p. 52 of The Geometry of Kerr Black Holes by Barret O‘Neill. ↩

[3] The paper “Ricci Tensor of Diagonal Metric” has a similar discussion using coordinate methods; note that the calculations are much more laborious! ↩

[4] One subtle technical point is that there might not be such an expression for $g$ throughout the whole chart domain; see this Math StackExchange question for details. In practice, though, this doesn’t turn out to be a problem. ↩

[5] The Schwarzschild metric describes the field outside a spherically symmetric and non-rotating massive body. If we let $f(r)$ have an $r^{-2}$ term, e.g. \[ f(r) = 1 - \frac{r_S}{r} + \frac{r_Q^2}{r^2} \] for some constant $r_Q$, then we have non-vanishing Ricci components. However, this metric, called the Reissner–Nordström metric, is still useful, as it describes a charged, spherically symmetric, non-rotating massive body. ↩

A Gentle Introduction to Erasure Codes

2017-11-30T00:00:00-08:00

1. Overview

This article explains Reed-Solomon erasure codes and the problems they solve in gory detail, with the aim of providing enough background to understand how the PAR1 and PAR2 file formats work, the details of which will be covered in future articles.

I’m assuming that the reader is familiar with programming, but has not had much exposure to coding theory or linear algebra. Thus, I’ll review the basics and treat the results we need as a “black box”, stating them and moving on. However, I’ll give self-contained proofs of those results in a companion article.

So let’s start with the problem we’re trying to solve! Let’s say you have $n$ files of roughly the same size, and you want to guard against $m$ of them being lost or corrupted. To do so, you generate $m$ parity files ahead of time, and if in the future you lose up to $m$ of the data files, you can use an equal number of parity files to recover the lost data files.

cashcat0.jpg

cashcat1.jpg

cashcat2.jpg

$\xmapsto{\mathtt{GenerateParityFiles}}$

cashcats.p00

cashcats.p01

Figure 1 Using parity codes to protect against the loss or corruption of up to two images (out of three) of cashcats.

$\xmapsto{\mathtt{ReconstructDataFiles}}$

Figure 2 With a corrupted and a missing file, recovering the original cashcat images using the parity files from Figure 1.

Note that this works even if you lose some of the parity files also; as long as you have $n$ files, whether they be data or parity files, you’ll be able to recover the original $n$ data files. Compare making $n$ parity files with simply making a copy of the $n$ data files (for $n > 1$). In the latter case, if you lose both a data file and its copy, that data file becomes unrecoverable! So parity files take the same amount of space but provide superior recovery capabilities.

Now we can reduce the problem above to a byte-level problem as follows. Have ComputeParityFiles pad all the data files so they’re the same size, and then for each byte position i call a function ComputeParityBytes on the ith byte of each data file, and store the results into the ith byte of each parity file. Also take a checksum or hash of each data file and store those (along with the original data file sizes) with the parity files. Then, ReconstructDataFiles can detect corrupted files using the checksums/hashes and treat them as missing, and then for each byte position i it can call a function ReconstructDataBytes on the ith byte of each good data and parity file to recover the ith byte of the corrupted/missing data files.

A byte error where we know the position of the dropped/corrupted byte is called an erasure. Then, the pair of functions ComputeParityBytes and ReconstructDataBytes which behave as described above implements what is called an optimal erasure code; it’s an erasure code because it guards only against byte erasures, and not more general errors where we don’t know which data bytes have been corrupted, and it’s optimal because in general you need at least $n$ known bytes to recover the $n$ data bytes, and that bound is achieved.

In detail, an optimal erasure code is composed of some set of possible $(n, m)$ pairs, and for each possible pair, a function

ComputeParityBytes<n, m>(data: byte[n]) -> (parity: byte[m])

that computes $m$ parity bytes given $n$ data bytes, and a function

ReconstructDataBytes<n, m>(partialData: (byte?)[n], partialParity: (byte?)[m]) -> ((data: byte[n]) | Error)

that takes in a partial list of data and parity bytes from an earlier call to ComputeParity, and returns the full list of data bytes if there are at least $n$ known data or parity bytes (i.e., there are no more than $m$ omitted data or parity bytes), and an error otherwise.

(In the above pseudocode, I’m using T[n] to mean an array of n objects of type T, and byte? to mean byte | None. Also, I’ll omit the -Bytes<n, m> suffix from now on.)

By the end of this article, we’ll find out exactly how the following example works:

Example 1: `ComputeParity` and `ReconstructData`

`ComputeParity`

Let d = [ da, db, 0d ] be the input data bytes and let m = 2 be the desired parity byte count. Then the output parity bytes are p = [ 52, 0c ].

Let d_partial = [ ??, db, ?? ] be the input partial data bytes and p_partial = [ 52, 0c ] be the input partial parity bytes. Then the output data bytes are d = [ da, db, 0d ].

2. Erasure codes for $m = 1$

The simplest erasure codes are when $m = 1$. For example, define

ComputeParitySum(data: byte[n]) {
  return [data[0] + … + data[n-1]]
}

where we consider byte to be an unsigned type such that addition and subtraction wrap around, i.e. byte arithmetic is done modulo $256$. Then also define

ReconstructDataSum(partialData: (byte?)[n], partialParity: (byte?)[1]) {
  if there is more than one entry of partialData or partialParity set to None {
    return Error
  } else if partialData has no entry set to None {
    return partialData
  }

  i := partialData.firstIndexOf(None);
  partialSum = partialData[0] + … + partialData[i-1] + partialData[i+1] + … + partialData[n-1]
  return partialData[0:i] ++ [partialParity[0] - partialSum] ++ partialData[i+1:n]
}

where a[i:j] means the subarray of a starting at i and ending (without inclusion) at j, and ++ is array concatenation.

This simple erasure code uses the fact that if you have the sum of a list of numbers, then you can recover a missing number by subtracting the sum of the other numbers from the total sum, and also that this works even if you do the arithmetic modulo $256$.

Another erasure code for $m = 1$ uses bitwise exclusive or (denoted as xor, ^, or $\oplus$) instead of arithmetic modulo $256$. Define

ComputeParityXor(data: byte[n]) {
  return [data[0] ⊕ … ⊕ data[n-1]]
}

and

ReconstructDataXor(partialData: (byte?)[n], partialParity: (byte?)[1]) {
  if there is more than one entry of partialData or partialParity set to None {
    return Error
  } else if partialData has no entry set to None {
    return partialData
  }

  i := partialData.firstIndexOf(None);
  partialXor = partialData[0] ⊕ … ⊕ partialData[i-1] ⊕ partialData[i⊕1] ⊕ … ⊕ partialData[n-1]
  return partialData[0:i] ++ [partialParity[0] ⊕ partialXor] ++ partialData[i+1:n]
}

This relies on the fact that $a \oplus a = 0$, so given the xor of a list of bytes, you can recover a missing byte by xoring with all the known bytes.

3. Erasure codes for $m = 2$ (almost)

Now coming up with an erasure code for $m = 2$ is more involved, but we can get an inkling of how it could work by letting $n = 3$ for simplicity, and also letting the output of ComputeParity be non-negative integers, instead of just bytes (i.e., less than $256$). In that case, we can consider parity numbers that are weighted sums of the data bytes. For example, like in the $m = 1$ case, we can have the first parity number be \[ p_0 = d_0 + d_1 + d_2\text{,} \] (using $d_i$ for data bytes and $p_i$ for parity numbers) but for the second parity number, we can pick different weights, say \[ p_1 = 1 \cdot d_0 + 2 \cdot d_1 + 3 \cdot d_2\text{.} \] We want to make sure that the weights for the second parity number are “sufficiently different” from that of the first parity number, which we’ll clarify later, but for example note that setting \[ p_1 = 2 \cdot d_0 + 2 \cdot d_1 + 2 \cdot d_2 \] can’t add any new information, since then $p_1$ will always be equal to $2 \cdot p_0$.

So then our ComputeParity function looks like

ComputeParityWeighted(data: byte[3]) {
  return [
    int(data[0]) +     int(data[1]) +     int(data[2]),
    int(data[0]) + 2 * int(data[1]) + 3 * int(data[2]),
  ]
}

As for ReconstructData, if we have two missing data bytes, say $d_i$ and $d_j$ for $i < j$, and $p_0$ and $p_1$, we can rearrange the equations \[ \begin{aligned} p_0 &= d_0 + d_1 + d_2 \\ p_1 &= 1 \cdot d_0 + 2 \cdot d_1 + 3 \cdot d_2 \end{aligned} \] to get all the unknowns on the left side, letting $d_k$ be the known data byte: \[ \begin{aligned} d_i + d_j &= X = p_0 - d_k \\ (i+1) \cdot d_i + (j+1) \cdot d_j &= Y = p_1 - (k + 1) \cdot d_k\text{.} \end{aligned} \] We can then multiply the first equation by $i + 1$ and subtract it from the second to cancel the $d_i$ term and get \[ d_j = (Y - (i + 1) \cdot X) / (j - i)\text{,} \] and then we can use the first equation to solve for $d_i$: \[ d_i = X - d_j = ((j + 1) \cdot X - Y) / (j - i)\text{.} \] Thus with these equations, we can implement ReconstructData:

ReconstructDataWeighted(partialData: (byte?)[3], partialParity: (int?)[2]) {
  Handle all cases except when there are exactly two entries set to none in partialData.

  [i, j] := indices of the unknown data bytes
  k := index of the known data byte

  X := partialParity[0] - partialData[k]
  Y := partialParity[1] - (k + 1) * partialData[k];

  d_i := ((j + 1) * X - Y) / (j - i)
  d_j := (Y - (i + 1) * X) / (j - i)

  return an array with d_i, d_j, and d[k] in the right order
}

(Generalizing this to larger values of $n$ is straightforward; $p_0$ will still have a weight of $1$ for each data byte, and $p_1$ will have a weight of $i + 1$ for $d_i$. $X$ and $Y$ will then have terms for all known bytes, and everything else proceeds the same after that.)

Now what goes wrong if we just try to do everything modulo $256$? The most obvious difference from the $m = 1$ case is that solving for $d_i$ or $d_j$ involves division, which works fine for non-negative integers as long as there’s no remainder, but it is not immediately clear how division can make sense modulo $256$.

One possible way to define division modulo $256$ would be to first define the multiplicative inverse modulo $256$ of an integer $0 \le x \lt 256$ as the integer $0 \le y \lt 256$ such that $(x \cdot y) \bmod 256 = 1$, if it exists, and then define division by $x$ modulo $256$ to be multiplication by $y$ modulo $256$. But this immediately runs into problems; $2$ has no multiplicative inverse modulo $256$, and the same holds for any even number, so reconstruction will fail if, for example, we have the first and third data bytes missing, since then we’d be trying to divide by $j - i = 2$.

But for now, let’s leave aside the problem of generating parity bytes instead of parity numbers, and instead focus on how we can generalize the above for larger values of $m$. To do so, we need to first review some linear algebra.

4. Just enough linear algebra to get by^[1]

In our $n = 3, m = 2$ example in the previous section, the equations for the parity numbers have the form \[ p = a_0 \cdot d_0 + a_1 \cdot d_1 + a_2 \cdot d_2 \] for constants $a_0$, $a_1$, and $a_2$. We call such a weighted sum of the $d_i$s a linear combination of the $d_i$s, and we write this in a tabular form \[ p = \begin{pmatrix} a_0 & a_1 & a_2 \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ d_1 \\ d_2 \end{bmatrix}\text{,} \] where we define the multiplication of a row vector and a column vector by the equation above, generalized in the straightforward manner for any $n$.

Then since we have two parity numbers $p_0$ and $p_1$, each a linear combination of the $d_i$s, i.e. \[ \begin{aligned} p_0 &= a_{00} \cdot d_0 + a_{01} \cdot d_1 + a_{02} \cdot d_2 \\ p_1 &= a_{10} \cdot d_0 + a_{11} \cdot d_1 + a_{12} \cdot d_2\text{,} \end{aligned} \] we can write this in a single tabular form as \[ \begin{bmatrix} p_0 \\ p_1 \end{bmatrix} = \begin{pmatrix} a_{00} & a_{01} & a_{02} \\ a_{10} & a_{11} & a_{12} \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ d_1 \\ d_2 \end{bmatrix}\text{,} \] where we define the multiplication of a matrix and a column vector by the equations above.

Now if we restrict parity numbers to be linear combinations of the data bytes, then we can identify a function ComputeParity using some set of weights with the matrix formed from that set of weights as above. This holds in general: if a function is defined as a list of linear combinations of its inputs, then it can be represented using a matrix as above, and we call it a linear function. Then we have a correspondence between linear functions that take $n$ numbers to $m$ numbers and matrices with $m$ rows and $n$ columns, which are denoted as $m \times n$ matrices.

As the first example of this correspondence, note that we denote the elements of the matrix above as $a_{ij}$, where the first index is the row index and the second index is the column index. Looking back to the parity equations, we also see that the first index corresponds to the output arguments of ComputeParity, and the second index corresponds to the input arguments.^[2]

The usefulness of the correspondence between linear functions and matrices is that we can store and manipulate a linear function by storing and manipulating its corresponding matrix of weights, which you wouldn’t be able to easily do for functions in general. For example, as we’ll see below, we’ll be able to compute the inverse of a linear function by matrix operations, which will be useful for ReconstructData.

First, let’s examine some simple matrix operations and their effects on the corresponding linear function:

Deleting a row of a matrix corresponds to deleting an output of a linear function.
Swapping two rows of a matrix corresponds to swapping two outputs of a linear function.
Appending a row to a matrix corresponds to adding an output to a linear function.

In general, matrix row operations correspond to manipulations of a linear function’s outputs.

An important operation on functions is composition: if $f$ takes $k$ inputs to $m$ outputs, and $g$ takes $m$ inputs to $n$ outputs, then we can define $(g \circ f)(x_0, \dotsc, x_k) = g(f(x_0, \dotsc, x_k))$ which takes $k$ inputs to $n$ outputs. It turns out that the composition of two linear functions is again a linear function, and so there must be an operation which takes the corresponding $m \times k$ matrix $F$ and the $n \times m$ matrix $G$ and yields a $n \times k$ matrix. This important operation, the bane of high-schoolers everywhere, is called matrix multiplication, denoted by $F \cdot G$. If $H = F \cdot G$, then the explicit formula for its elements is \[ h_{ij} = \sum_{k=0}^{m-1} f_{ik} \cdot g_{kj}\text{,} \] which corresponds to the following code:

matrixMultiply(f: Matrix, g: Matrix) {
  if (f.columns != g.rows) {
    return Error
  }

  h := new Matrix(f.rows, g.columns)
  for i := 0 to f.rows - 1 {
    for j := 0 to g.columns - 1 {
      t := 0
      for k := 0 to f.columns - 1 {
        t += f[i, k] * g[k, j]
      }
      h[i, j] = t
    }
  }
  return h
}

You can convince yourself that the above formula and code is correct by trying to compose some small linear functions by hand.

A useful property of matrix multiplication is that it’s a generalization of the product of a row vector and a column vector, and the product of a matrix and a column vector as we defined above.

I would be remiss if I didn’t talk about the consequences of defining matrix multiplication as the matrix of the composition of the corresponding linear functions. First, this immediately implies that you can only multiply matrices if the left matrix has the same number of rows as the number of columns of the right matrix, which corresponds to the fact that you can only compose functions if the left function takes the same number of inputs as the number of outputs of the right function. Furthermore, even if you have two $n \times n$ matrices $F$ and $G$, unlike numbers, it is not true that $F \cdot G = G \cdot F$, which corresponds to the fact that in general, for two functions that take $n$ inputs to $n$ outputs, it is not true that $f \circ g = g \circ f$. If you learned matrix multiplication just from the formula above, then these facts are much less obvious!

Finally, an important function is the identity function $\mathrm{Id}_n$, which return its $n$ inputs as its outputs. It corresponds to the identity matrix \[ I_n = \begin{pmatrix} 1 & 0 & \cdots & 0 & 0 \\ 0 & 1 & \cdots & 0 & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & \cdots & 1 & 0 \\ 0 & 0 & \cdots & 0 & 1 \end{pmatrix}\text{.} \]

For a linear function $f$ that takes $n$ inputs to $n$ outputs, if there is a function $g$ such that $f \circ g = \mathrm{Id}_n$, then we call $g$ the inverse of $f$, and denote it as $f^{-1}$. (It is also true that $f^{-1} \circ f = \mathrm{Id}_n$, i.e. $(f^{-1})^{-1} = f$.) Not all linear functions taking $n$ inputs to $n$ outputs have inverses, but if the inverse exists, it is also linear (and unique, which is why we call it the inverse). Therefore, we can define the inverse of an $n \times n$ (or square) matrix $M$ as the unique matrix $M^{-1}$ such that $M \cdot M^{-1} = M^{-1} \cdot M = I_n$, if it exists; also, if $M$ has an inverse, we say that $M$ is invertible.

Example 2: The matrix/linear function correspondence

Let \[ M = \begin{pmatrix} 1 & 2 \\ 3 & 4\end{pmatrix}\text{.} \] This corresponds to the linear function

f(x: rational[2]) {
  return [
    1 * x[0] + 2 * x[1],
    3 * x[0] + 4 * x[1],
  ]
}

where rational is an arbitrary-precision rational number type.

$M$ is invertible with inverse \[ M^{-1} = \begin{pmatrix} -2 & 1 \\ 3/2 & -1/2\end{pmatrix}\text{.} \] This corresponds to the linear function

g(y: rational[2]) {
  return [
    -2 * x[0] + 1 * x[1],
    (3/2) * x[0] + (-1/2) * x[1],
  ]
}

so g is the inverse function of f. Indeed, f([5, 6]) is [17, 39] and g([17, 39]) is [5, 6].

So now we’ve reduced the problem of finding the inverse of a linear function taking $n$ inputs to $n$ outputs to finding the inverse of an $n \times n$ matrix. Before we tackle the question of computing those inverses, let’s first recast our problem in the language of linear algebra and see why we need to find the inverse of a linear function.

5. Erasure codes in general

So, revisiting our $n = 3, m = 2$ erasure code from above, we have the linear function

ComputeParityWeighted(data: byte[3]) {
  return [
    int(data[0]) +     int(data[1]) +     int(data[2]),
    int(data[0]) + 2 * int(data[1]) + 3 * int(data[2]),
  ]
}

which therefore corresponds to the parity matrix \[ P = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{pmatrix}\text{.} \] So in mathematical notation, ComputeParityWeighted looks like: \[ \begin{bmatrix} p_0 \\ p_1 \end{bmatrix} = \mathtt{ComputeParityWeighted}(d_0, d_1, d_2) = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ d_1 \\ d_2 \end{bmatrix}\text{.} \]

So let’s now reimplement ReconstructDataWeighted using linear algebra. First, append the rows of $P$ to the identity matrix $I_3$ to get the matrix equation \[ \begin{bmatrix} d_0 \\ d_1 \\ d_2 \\ p_0 \\ p_1 \end{bmatrix} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 1 & 1 \\ 1 & 2 & 3 \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ d_1 \\ d_2 \end{bmatrix}\text{,} \] which corresponds to a linear function that returns the input data bytes in addition to computing the parity numbers. Now let’s say we lose the data bytes $d_0$ and $d_2$. Then let’s remove the rows corresponding to those bytes: \[ \begin{bmatrix} \xcancel{d_0} \\ d_1 \\ \xcancel{d_2} \\ p_0 \\ p_1 \end{bmatrix} = \begin{pmatrix} \xcancel{1} & \xcancel{0} & \xcancel{0} \\ 0 & 1 & 0 \\ \xcancel{0} & \xcancel{0} & \xcancel{1} \\ 1 & 1 & 1 \\ 1 & 2 & 3 \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ d_1 \\ d_2 \end{bmatrix}\text{,} \] which turns into \[ \begin{bmatrix} d_1 \\ p_0 \\ p_1 \end{bmatrix} = \begin{pmatrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 1 & 2 & 3 \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ d_1 \\ d_2 \end{bmatrix}\text{,} \] which corresponds to a linear function that maps the input data bytes to the non-lost data bytes and the parity bytes. This is the inverse of the function we want, so we want to invert the $3 \times 3$ matrix above, which we’ll call $M$. That inverse is \[ M^{-1} = \begin{pmatrix} -1/2 & 3/2 & -1/2 \\ 1 & 0 & 0 \\ -1/2 & -1/2 & 1/2 \end{pmatrix}\text{.} \] Multiplying both sides above by $M^{-1}$, we get \[ \begin{bmatrix} d_0 \\ d_1 \\ d_2 \end{bmatrix} = \begin{pmatrix} -1/2 & 3/2 & -1/2 \\ 1 & 0 & 0 \\ -1/2 & -1/2 & 1/2 \end{pmatrix} \cdot \begin{bmatrix} d_1 \\ p_0 \\ p_1 \end{bmatrix}\text{,} \] which is exactly what we want: the original data bytes in terms of the known data bytes and the parity numbers!^[3]

Comparing this equation to the one we manually computed previously, they don’t look immediately similar, but some rearrangement will reveal that they indeed compute the same thing. As a sanity check, notice that the second row of $M^{-1}$ means that the first input argument is mapped unchanged to the second output argument, which is exactly what we want for the known byte $d_1$.

Now what does this look like in general, i.e. for arbitrary $n$ and $m$? First, we have to generate an $m \times n$ parity matrix $P$ whose rows have to be “sufficiently different” from each other, which we still have to clarify. Then ComputeParity just multiplies $P$ by $[d]$, the column matrix of input bytes, like so: \[ \begin{bmatrix} p_0 \\ \vdots \\ p_{m-1} \end{bmatrix} = \mathtt{ComputeParity}(d_0, \dotsc, d_{n-1}) = \begin{pmatrix} p_0 \\ \vdots \\ p_{m-1} \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ \vdots \\ d_{n-1} \end{bmatrix}\text{,} \] where the $p_i$ are the rows of $P$.

As for ReconstructData, we first append the rows of $P$ to $I_n$, whose rows we’ll denote as $e_i$: \[ \begin{bmatrix} d_0 \\ \vdots \\ d_{n-1} \\ p_0 \\ \vdots \\ p_{m-1} \end{bmatrix} = \begin{pmatrix} e_0 \\ \vdots \\ e_{n-1} \\ p_0 \\ \vdots \\ p_{m-1} \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ \vdots \\ d_{n-1} \end{bmatrix}\text{.} \] Now assume that the indices of the missing $k \le m$ data bytes are $i_0, \dotsc, i_{k-1}$. Then we remove the rows corresponding to the missing data bytes, and keep some $k$ parity rows, e.g. $p_0$ to $p_{k-1}$. This yields the equation \[ \begin{bmatrix} d_{j_0} \\ \vdots \\ d_{j_{n-k-1}} \\ p_0 \\ \vdots \\ p_{k-1} \end{bmatrix} = \begin{pmatrix} e_{j_0} \\ \vdots \\ e_{j_{n-k-1}} \\ p_0 \\ \vdots \\ p_{k-1} \end{pmatrix} \cdot \begin{bmatrix} d_0 \\ \vdots \\ d_{n-1} \end{bmatrix}\text{,} \] where $j_0, \dotsc, j_{m-k-1}$ are the indices of the present $n - k$ data bytes. Call that $n \times n$ matrix $M$, and compute its inverse $M^{-1}$. If $P$ was chosen correctly, $M^{-1}$ should always exist, so if the inverse computation fails, raise an error. Therefore, ReconstructData just multiplies $M^{-1}$ by the column matrix of present data bytes and chosen parity numbers: \[ \begin{bmatrix} d_0 \\ \vdots \\ d_{n-1} \end{bmatrix} = \mathtt{ReconstructData}(d_{j_0}, \dotsc, d_{j_{n-k-1}}, p_0, \dotsc, p_{k-1}) = M^{-1} \cdot \begin{bmatrix} d_{j_0} \\ \vdots \\ d_{j_{n-k-1}} \\ p_0 \\ \vdots \\ p_{k-1} \end{bmatrix}\text{.} \]

As an optimization, some rows of $M^{-1}$ correspond to just shuffling around the known data bytes $d_{j_*}$, so we can just remove those rows, compute the missing data bytes, and do the shuffling ourselves.

So we now have outlines of implementations of both ComputeParity and ReconstructData, but we still have missing pieces. In particular,

How do we compute matrix inverses?
How do we generate “optimal” parity matrices so that $M^{-1}$ always exists?
How do we compute parity bytes instead of parity numbers?

So first, let’s see how to compute matrix inverses using row reduction.

6. Finding matrix inverses using row reduction

We developed the theory of matrices by identifying them with linear functions of numbers. To show how to find matrix inverses, we have to look at them in a slightly different way by identifying matrix equations with systems of linear equations of numbers.

For example, consider the matrix equation \[ M \cdot x = y\text{,} \] where \[ M = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}\text{,} \quad x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \text{,} \quad \text{and } y = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix}\text{.} \] This expands to \[ \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} \cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix}\text{,} \] or \[ \begin{aligned} y_1 &= 1 \cdot x_1 + 2 \cdot x_2 \\ y_2 &= 3 \cdot x_1 + 4 \cdot x_2\text{,} \end{aligned} \] which is a linear system of equations of numbers. Letting $M$ be any matrix, and $x$ and $y$ be appropriately-sized column matrices of variables, we can see that the matrix equation $M \cdot x = y$ is shorthand for a system of linear equations of numbers.

If we could find $M^{-1}$, we could solve the matrix equation easily by multiplying both sides by it: \[ \begin{aligned} M^{-1} \cdot (M \cdot x) &= M^{-1} \cdot y \\ x &= M^{-1} \cdot y\text{,} \end{aligned} \] and therefore solve the linear system for $x$ in terms of $y$. Conversely, if we were able to solve the linear system for $x$, we’d then be able to read off $M^{-1}$ from the new linear system.

But how do we solve a linear system? From the theory of linear systems of equations, we have a few tools at our disposal:

swapping two equations,
multiplying an equation by a number,
adding one equation to another, possibly multiplying the equation by a number before adding.

All of these are valid transformations because they don’t change the solution set of the linear system.

For example, in the equation above, the first step would be to subtract $3$ times the first equation from the second equation to yield \[ \begin{aligned} y_1 &= x_1 + 2 \cdot x_2 \\ y_2 - 3 \cdot y_1 &= -2 \cdot x_2\text{.} \end{aligned} \] Then, add the second equation back to the first equation: \[ \begin{aligned} y_2 - 2 \cdot y_1 &= x_1 \\ y_2 - 3 \cdot y_1 &= -2 \cdot x_2\text{.} \end{aligned} \] Finally, divide the second equation by $-2$: \[ \begin{aligned} y_2 - 2 \cdot y_1 &= x_1 \\ (3/2) \cdot y_1 - (1/2) \cdot y_2 &= x_2\text{.} \end{aligned} \] This is equivalent to the matrix equation \[ \begin{pmatrix} -2 & 1 \\ 3/2 & -1/2 \end{pmatrix} \cdot \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}\text{,} \] so \[ M^{-1} = \begin{pmatrix} -2 & 1 \\ 3/2 & -1/2 \end{pmatrix}\text{.} \]

So how do we translate the above process to an algorithm operating on matrices? First, express our matrix equation in a slightly different form: \[ M \cdot x = I \cdot y\text{.} \] Using the example above, this is \[ \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} \cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \cdot \begin{bmatrix} y_1 \\ y_2 \end{bmatrix}\text{.} \] Then, you can see that the first step above corresponds to subtracting $-3$ times the first row from the second row to yield: \[ \begin{pmatrix} 1 & 2 \\ 0 & -2 \end{pmatrix} \cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{pmatrix} 1 & 0 \\ -3 & 1 \end{pmatrix} \cdot \begin{bmatrix} y_1 \\ y_2 \end{bmatrix}\text{.} \] We don’t even need to keep writing the $x$ and $y$ column matrices; we can just write the “augmented” matrix. \[ A = \left( \hskip -5pt \begin{array}{cc|cc} 1 & 2 & 1 & 0 \\ 0 & -2 & -3 & 1 \end{array} \hskip -5pt \right) \] and operate on it.

Thus, the operations listed above on linear systems have corresponding operations on augmented matrices:

swapping two equations corresponds to swapping two rows;
multiplying an equation by a number corresponds to multiplying a row by a number; and
adding an equation to another, possibly multiplying the equation by a number before adding, corresponds to adding a row to another row, possibly multiplying the row by a number before adding.

Then, the goal is to use these row operations to transform the initial augmented matrix, where the right side looks like the identity matrix, into one where the left side looks like the identity matrix. Then, translating the augmented matrix back into a matrix equation, that would give $M^{-1}$ on the right side.^[4]

When doing this by hand, one usually works with the linear system itself, trying to see which variables can be easily eliminated so as to minimize arithmetic. However, to translate this to an algorithm, we’re more interested in a systematic way of doing this. Fortunately, there’s an easy two-step process:

Turn the left side of $A$ into a unit upper triangular matrix, which means that all the elements on the main diagonal are $1$, and all elements below the main diagonal are $0$, i.e. that $a_{ii} = 1$ for all $i$, and $a_{ij} = 0$ for all $j > i$.
Then turn the left side of $A$ into the identity matrix.

This algorithm is called row reduction. The first step can be further broken down:

For each column $i$ of the left side in ascending order:
1. If $a_{ii}$ is zero, look at the rows below the $i$th row for a row $j > i$ such that $a_{ji} \ne 0$, and swap rows $i$ and $j$. If no such row exists, return an error, as that means that $A$ is non-invertible.
2. Divide the $i$th row by $a_{ii}$, so that $a_{ii} = 1$.
3. For each row $j > i$, subtract $a_{ji}$ times the $i$th row from it, which will set $a_{ji}$ to $0$.

The second step can be similarly broken down:

For each column $i$ of the left side, in order:
1. For each row $j < i$, subtract $a_{ji}$ times the $i$th row from it, which will set $a_{ji}$ to $0$.

Note that we’re assuming that all arithmetic is exact, i.e. we use a arbitrary-precision rational number type. If we use floating point numbers, we’d have to worry a lot more about the order in which we do operations and numerical stability.

Example 3: Matrix inversion via row reduction

Let

    / 0 2 2 \
M = | 3 4 5 |
    \ 6 6 7 /.

The initial augmented matrix A is

/ 0 2 2 | 1 0 0 \
| 3 4 5 | 0 1 0 |
\ 6 6 7 | 0 0 1 /.

We need A₀₀ to be non-zero, so swap rows 0 and 1:

/ 0 2 2 | 1 0 0 \     / 3 4 5 | 0 1 0 \
| 3 4 5 | 0 1 0 | --> | 0 2 2 | 1 0 0 |
\ 6 6 7 | 0 0 1 /     \ 6 6 7 | 0 0 1 /.

We need A₀₀ to be 1, so divide row 0 by 3:

/ 3 4 5 | 0 1 0 \     / 1 4/3 5/3 | 0 1/3 0 \
| 0 2 2 | 1 0 0 | --> | 0  2   2  | 1  0  0 |
\ 6 6 7 | 0 0 1 /     \ 6  6   7  | 0  0  1 /.

We need A₂₀ to be 0, so subtract row 0 scaled by 6 from row 2:

/ 1 4/3 5/3 | 0 1/3 0 \     / 1 4/3 5/3 | 0 1/3 0 \
| 0  2   2  | 1  0  0 | --> | 0  2   2  | 1  0  0 |
\ 6  6   7  | 0  0  1 /     \ 0 -2  -3  | 0 -2  1 /.

We need A₁₁ to be 1, so divide row 1 by 2:

/ 1 4/3 5/3 |  0  1/3 0 \     / 1 4/3 5/3 |  0  1/3 0 \
| 0  2   2  |  1   0  0 | --> | 0  1   1  | 1/2  0  0 |
\ 0 -2  -3  |  0  -2  1 /     \ 0 -2  -3  |  0  -2  1 /.

We need A₂₁ to be 0, so subtract row 1 scaled by −2 from row 2:

/ 1 4/3 5/3 |  0  1/3 0 \     / 1 4/3 5/3 |  0  1/3 0 \
| 0  1   1  | 1/2  0  0 | --> | 0  1   1  | 1/2  0  0 |
\ 0 -2  -3  |  0  -2  1 /     \ 0  0  -1  |  1  -2  1 /.

We need A₂₂ to be 1, so divide row 2 by −1, which makes the left side of A a unit upper triangular matrix:

/ 1 4/3 5/3 |  0  1/3 0 \     / 1 4/3 5/3 |  0  1/3 0 \
| 0  1   1  | 1/2  0  0 | --> | 0  1   1  | 1/2  0  0 |
\ 0  0  -1  |  1  -2  1 /     \ 0  0   1  | -1   2 -1 /.

We need A₁₂ to be 0, so subtract row 2 from row 1:

/ 1 4/3 5/3 |  0  1/3 0 \     / 1 4/3 5/3 |  0  1/3 0 \
| 0  1   1  | 1/2  0  0 | --> | 0  1   0  | 3/2 -2  1 |
\ 0  0   1  | -1   2 -1 /     \ 0  0   1  | -1   2 -1 /.

We need A₀₂ to be 0, so subtract row 2 scaled by 5/3 from row 0:

/ 1 4/3 5/3 |  0  1/3 0 \     / 1 4/3  0  | 5/3 -3 5/3 \
| 0  1   0  | 3/2 -2  1 | --> | 0  1   0  | 3/2 -2  1  |
\ 0  0   1  | -1   2 -1 /     \ 0  0   1  | -1   2 -1  /.

We need A₀₁ to be 0, so subtract row 1 scaled by 4/3 from row 0, which makes the left side of A the identity matrix:

/ 1 4/3  0  | 5/3  -3 5/3 \     / 1  0   0  | -1/3 -1/3 1/3 \
| 0  1   0  | 3/2 -2   1  | --> | 0  1   0  |  3/2  -2   1  |
\ 0  0   1  | -1   2  -1  /     \ 0  0   1  |  -1    2  -1  /.

Since the left side of A is the identity matrix, the right side of A is M^-1. Therefore,

         / -1/3 -1/3 1/3 \
M^{-1} = |  3/2  -2   1  |
         \  -1    2  -1  /.

Now notice one thing: if $M$ has a row that is proportional to another row, then row reduction would eventually zero out one of the rows, causing the algorithm to fail, and signaling that $M$ is non-invertible. In fact, a stronger statement is true: $M$ has a row that can be expressed as a linear combination of other rows of $M$ exactly when $M$ is non-invertible. Informally, this means that the linear function corresponding to $M$ has one of its outputs redundant with the other outputs, so it is essentially a a linear function taking $n$ inputs to fewer than $n$ outputs, and such functions aren’t invertible.

This gets us a partial explanation for what “sufficiently different” means for our parity functions. If one parity function is a linear combination of other parity functions, then it is redundant, and therefore not “sufficiently different”. Therefore, we want our parity matrix $P$ to be such that no row can be expressed as a linear combination of other rows.

However, this criterion for $P$ isn’t quite enough to guarantee that all possible matrices $M$ computed as part of ReconstructData are invertible. For example, this criterion holds for the identity matrix $I_n$, but if $n > 1$ and you pick $I_n$ as the parity matrix for $n = m$, you can certainly end up with a constructed matrix $M$ with repeated rows, since you’re starting by appending another copy of $I_n$ on top of $P = I_n$! This explains in a different way why simply making a copy of the original data files makes for a poor erasure code, unless of course you only have one data file. We’re led to our next topic: what makes a parity matrix “optimal”?

7. Optimal parity matrices

Recall from above that we form the square matrix \[ M = \begin{pmatrix} e_{j_0} \\ \vdots \\ e_{j_{n-k-1}} \\ p_0 \\ \vdots \\ p_{k-1} \end{pmatrix} \] by prepending some rows of the identity matrix to the first $k$ rows of the parity matrix. We can generalize this a bit more, since we don’t have to take the first $k$ rows, but instead can take any $k$ rows of the parity matrix, whose indices we denote here as $l_0, \dotsc, l_{k-1}$: \[ M = \begin{pmatrix} e_{j_0} \\ \vdots \\ e_{j_{n-k-1}} \\ p_{l_0} \\ \vdots \\ p_{l_{k-1}} \end{pmatrix}\text{.} \] So we want to construct $P$ so that any such square matrix $M$ formed from the rows of $P$ is invertible. Therefore, we call a parity matrix $P$ optimal if it satisfies this criterion.

Fortunately, there is a simpler criterion for optimal parity matrices. First, define a submatrix of a matrix $P$ to be a matrix that you get by deleting any number of rows or columns, and call a matrix non-empty if it has at least one row and one column. Then:

(Theorem 1.) A parity matrix $P$ is optimal exactly when any non-empty square submatrix of $P$ is invertible.^[5]

Note that this criterion is stronger than the one in the previous section, where we want a parity matrix $P$ to have no row that can be expressed as a linear combination of other rows. That is, if any non-empty square submatrix of $P$ is invertible, that means that no row can be expressed as a linear combination of other rows.^[6] However, it is possible to have a matrix $P$ where no row can be expressed as a linear combination of other rows, but which is not optimal. We’ve already seen an example above: $I_n$ for $n \gt 1$, and indeed, \[ I_2 = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}\text{,} \] has the $1 \times 1$ non-invertible submatrix $\begin{pmatrix} 0 \end{pmatrix}$.

Example 4: A optimal parity matrix for $m = 2$

Recall the parity matrix \[ P = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{pmatrix} \] that we were using for our $n = 3, m = 2$ example. For any $n$, this matrix looks like \[ P = \begin{pmatrix} 1 & 1 & \cdots & 1 \\ 1 & 2 & \cdots & n-1 \end{pmatrix}\text{.} \] A $1 \times 1$ matrix is invertible exactly when its single element is non-zero, so any $1 \times 1$ submatrix of $P$ is invertible. Any $2 \times 2$ submatrix of $P$ looks like \[ A = \begin{pmatrix} 1 & 1 \\ a & b \end{pmatrix} \] for $a \ne b$, which, using the formula for inverses of $2 \times 2$ matrices, has inverse \[ A^{-1} = \begin{pmatrix} b/(b-a) & -1/(b-a) \\ -a/(b-a) & 1/(b-a) \end{pmatrix}\text{.} \] These are all the possible square submatrices of $P$, so therefore this $P$ is a optimal parity matrix for $m = 2$.

Then, finally, we now have a complete definition of what makes a list of parity functions “sufficiently different”; it is exactly when the corresponding parity matrix is optimal as we’ve defined it above.

Now this leads us to the question: how do we find such optimal matrices? Fortunately, there’s a whole class of matrices that are optimal: the Cauchy matrices.

Let $a_0, \dotsc, a_{m+n-1}$ be a sequence of distinct integers, meaning that no two $a_i$ are equal, and let $x_0, \dotsc, x_{m-1}$ be the first $m$ integers of $a_i$ with $y_0, \dotsc, y_{n-1}$ the remaining integers. Then form the $m \times n$ matrix $A$ by setting its elements according to: \[ a_{ij} = \frac{1}{x_i - y_j}\text{,} \] which is always defined since the denominator is never zero, by the distinctness of the $a_i$. Then $A$ is a Cauchy matrix.

What makes Cauchy matrices useful is the following theorem:

(Theorem 2.) Any non-empty square Cauchy matrix is invertible.

Combining this with the simple fact that any submatrix of a Cauchy matrix is also a Cauchy matrix, we get:

(Corollary 1.) Any non-empty square submatrix of a Cauchy matrix is invertible, and thus any Cauchy parity matrix is optimal.

Example 5: Cauchy matrices

Let x = [ 1, 2, 3 ] and y = [ -1, 4, 0 ]. Then, the Cauchy matrix constructed from x and y is

/ 1/2 -1/3  1  \
| 1/3 -1/2 1/2 |
\ 1/4  -1  1/3 /,

which has inverse

/ -36/5 96/5 -36/5 \
| -3/10  9/5  -9/5 |
\  9/2  -9     3   /.

Therefore, to generate a optimal parity matrix for any $(n, m)$, all we need to do is to generate an $m \times n$ Cauchy matrix. We can pick any sequence of distinct $m + n$ integers, so for simplicity let’s just use \[ x_i = n + i \quad \text{and} \quad y_i = i\text{.} \]

Example 6: Cauchy parity matrices for $m = 2$

For $n = 3, m = 2$, we have the sequences \[ x_0 = 3, x_1 = 4 \quad \text{and} \quad y_0 = 0, y_1 = 1, y_2 = 2\text{,} \] so the corresponding Cauchy parity matrix is \[ P = \begin{pmatrix} 1/3 & 1/2 & 1 \\ 1/4 & 1/3 & 1/2 \end{pmatrix}\text{.} \] Similarly, for any $n$, \[ P = \begin{pmatrix} 1/n & \cdots & 1/2 & 1 \\ 1/{n + 1} & \cdots & 1/3 & 1/2 \end{pmatrix}\text{.} \] All entries of $P$ are non-zero, so any $1 \times 1$ submatrix of $P$ is invertible. Any $2 \times 2$ submatrix of $P$ looks like \[ A = \begin{pmatrix} 1/a & 1/b \\ 1/(a+1) & 1/(b+1) \end{pmatrix} \] for $a \ne b$, which, using the formula for inverses of $2 \times 2$ matrices, has inverse \[ A^{-1} = \begin{pmatrix} \frac{ab(a+1)}{b-a} & -\frac{a(a+1)(b+1)}{b-a} \\ -\frac{ab(b+1)}{b-a} & \frac{b(a+1)(b+1)}{b-a} \end{pmatrix}\text{.} \] These are all the possible square submatrices of $P$, so therefore this $P$ is a optimal parity matrix for $m = 2$.

Note that our first parity matrix for $n = 3, m = 2$ isn’t a Cauchy matrix, since no Cauchy matrix can have repeating elements in a single row. That means that there are other possible optimal parity matrices that aren’t Cauchy matrices.^[7]

Also, our previous parity matrices had integers, and Cauchy matrices have rational numbers (i.e., fractions). This means that our parity numbers are now fractions. This isn’t a serious difference, though, since we’d have to deal with fractions when computing matrix inverses anyway. You could also change a parity matrix with fractions into one without by simply multiplying the entire matrix by some non-zero number that gets rid of all the fractions, which doesn’t change the optimality of the matrix. For example, we can multiply \[ \begin{pmatrix} 1/3 & 1/2 & 1 \\ 1/4 & 1/3 & 1/2 \end{pmatrix} \] by $12$ to get the equivalent parity matrix \[ \begin{pmatrix} 4 & 6 & 12 \\ 3 & 4 & 6 \end{pmatrix}\text{.} \]

Now our only remaining missing piece is this: how do we compute parity bytes instead of parity numbers? Answering this would render the above discussion moot. However, to do so, we first have to take another look at how we’re doing linear algebra.

8. Linear algebra over fields

We ultimately want our parity numbers to be parity bytes, which means that we want to work with matrices of bytes instead of matrices of rational numbers. In order to do that, we need to define an interface for matrix elements that preserves the operations and properties we care about, and then we have to figure out how to implement that interface using bytes.

Looking at the rule for matrix multiplication, we need to be able to add and multiply matrix elements. Looking at how we do matrix inversion, we also need to be able to subtract and divide matrix elements. Finally, there are certain properties that hold for rational numbers that we implicitly assume when doing matrix operations, but that we have to make explicit for matrix elements.

This leads us to the concept of a field, which essentially defines the interface that matrix elements should implement. Here it is:

interface Field<T> {
  static Zero: T, One: T

  plus(b: T): T
  negate(): T

  times(b: T): T
  reciprocate(): T

  equals(b: T): bool

  minus(b: T) = this.plus(b.negate())
  dividedBy(b: T) = this.times(b.reciprocate())
}

We need to be able to add and multiply field elements, which we’ll denote generically by $\oplus$ and $\otimes$. We also need to be able to take the negation (additive inverse) of an element $x$, which we’ll denote by $-x$, and the reciprocal (multiplicative inverse) of a non-zero element $x$, which we’ll denote by $x^{-1}$. Then we can define subtraction of field elements to be \[ a \ominus b = a \oplus -b \] and division of field elements to be \[ a \cldiv b = a \otimes b^{-1}\text{,} \] when $b \ne 0$.

Also, an implementation of Field must satisfy further properties, which are copied from the number laws you learn in school:

Identities: $a \oplus 0 = a \otimes 1 = a$.
Inverses: $a \oplus -a = 0$, and for $a \ne 0$, $a \otimes a^{-1} = 1$.
Associativity: $(a \oplus b) \oplus c = a \oplus (b \oplus c)$, and $(a \otimes b) \otimes c = a \otimes (b \otimes c)$.
Commutativity: $a \oplus b = b \oplus a$, and $a \otimes b = b \otimes a$.
Distributivity: $a \otimes (b \oplus c) = (a \otimes b) \oplus (a \otimes c)$.

Of the above, guaranteeing the existence of reciprocals of non-zero elements is usually the non-trivial part. Now the rational numbers satisfy all of the above, since \[ (p/q)^{-1} = q/p\text{,} \] so we say that they form a field. However, the integers do not form a field, since for example $2$ has no integer reciprocal; only $1$ and $-1$ have integer reciprocals. Furthermore, as we saw above, the integers modulo $256$, i.e. the numbers from $0$ to $255$ with standard arithmetic operations modulo $256$, do not form a field, as we saw earlier, since $(2 \cdot b) \bmod 256 \ne 1$ for any $b$.

However, we can construct a field with $257$ elements, using the fact that $257$ is a prime number, and the following theorem:

(Theorem 3.) Given a prime number $p$, for every integer $0 \lt a \lt p$, there is exactly one $0 \lt b \lt p$ such that $(a \cdot b) \bmod p = 1$.

There are clever ways to find multiplcative inverses mod $p$, but since $257$ is so small, we can just brute-force it. So an implementation would look like:

class Field257Element : implements Field<Field257Element> {
  plus(b) { return (this + b) % 257 }
  negate() { return (257 - this) }
  times(b) { return (this * b) % 257 }
  reciprocate() {
    if (this == 0) { return Error }
    for i := 0 to 256 {
      if (this.times(b) == 1) { return i; }
    }
    return Error
  }
  ...
}

Example 7: Field with 257 elements

Denote operations on the field with 257 elements by a ₂₅₇ subscript, and let a = 23 and b = 54. Then

a +₂₅₇ b = (23 + 54) mod 257 = 77;
−₂₅₇b = (257 − 54) mod 257 = 203;
a −₂₅₇ b = a +₂₅₇ −₂₅₇b = (23 + 203) mod 257 = 226;
a ×₂₅₇ b = (23 × 54) mod 257 = 214;
54 ×₂₅₇ 119 = 1, so b^-1₂₅₇ = 119;
a ÷₂₅₇ b = a ×₂₅₇ b^-1₂₅₇ = (23 × 119) mod 257 = 167, and indeed b ×₂₅₇ (a ÷₂₅₇ b) = (54 × 167) mod 257 = 23 = a.

So this gets us closer, since we can use Field257Element instead of a rational number type when implementing ComputeParity and ReconstructData, and if we’ve abstracted our Matrix type correctly, almost everything should just work. However, there is one thing we need to check: Are Cauchy parity matrices still optimal if we use fields other than the rational numbers? Fortunately, the answer is yes:

(Theorem 1, general version.) A parity matrix $P$ over any field is optimal exactly when any non-empty square submatrix of $P$ is invertible.

(Theorem 2, general version.) Any non-empty square Cauchy matrix over any field is invertible.

(Corollary 1, general version.) Any square submatrix of a Cauchy matrix over any field is invertible, and thus any Cauchy parity matrix over any field is optimal.

However, note that to construct an $m \times n$ Cauchy matrix, we need $m + n$ distinct elements. So if we’re working with a field with $257$ elements, then this imposes the condition that $m + n \le 257$, i.e. using a finite field limits the number of data bytes and parity numbers you can have.

Now the question remains: can we construct a field with $256$ elements? As we saw above, we can’t do so the same way as we constructed the field with $257$ elements. In fact, we need to start with defining different arithmetic operations on the integers. This brings us to the topic of binary carry-less arithmetic.

9. Binary carry-less arithmetic

The basic idea with binary carry-less (which I’ll henceforth shorten to “carry-less”) arithmetic is to express all integers in binary, then perform all arithmetic operations using binary arithmetic, except ignoring all the carries.^[8]

How does this work with addition? Let’s denote binary carry-less add as $\clplus$,^[9] and let’s see how it behaves on single binary digits: \[ \begin{aligned} 0 \clplus 0 &= 0 \\ 0 \clplus 1 &= 1 \\ 1 \clplus 0 &= 1 \\ 1 \clplus 1 &= 0\text{.} \end{aligned} \] This is just the exclusive or operation on bits, so if we do carry-less addition on any two integers, it turns out to be nothing but xor! Since xor can also be denoted by $\clplus$, we can conveniently think of $\clplus$ as meaning both carry-less addition and xor.

Example 8: Carry-less addition

Let a = 23 and b = 54. Then, with carry-less arithmetic,

  a = 23 =  10111b
^ b = 54 = 110110b
           -------
           100001b

so a ⊕ b = 100001_b = 33.

What about subtraction? Recall that $(a \clplus b) \clplus b = a$ for any $a$ and $b$. Therefore, every element $b$ is its own (carry-less binary) additive inverse, which means that $a \clminus b = a \clplus b$, i.e. carry-less subtraction is also just xor.

Carry-less multiplication isn’t as simple, but recall that binary multiplication is just adding shifted copies of $a$ based on which bits are set in $b$ (or vice versa). To do carry-less multiplication, just ignore carries when adding the shifted copies again, i.e. xor shifted copies instead of adding them.

Example 9: Carry-less multiplication

Let a = 23 and b = 54. Then, with carry-less arithmetic,

   a = 23 =       10111b
^* b = 54 =      110110b
            ------------
                 10111
          ^     10111
          ^   10111
          ^  10111
            ------------
             1111100010b

so a ⊗ b = 1111100010_b = 994.

Finally, we can define carry-less division with remainder. Binary division with remainder is subtracting shifted copies of $b$ from $a$ until you get a remainder less than the divisor; then carry-less binary division with remainder is xor-ing shifted copies of $b$ with $a$ until you get a remainder. However, there’s a subtlety; with carry-less arithmetic, it’s not enough to stop when the remainder (for that step) is less than the divisor, because if the highest set bit of the remainder is the same as the highest set bit of the divisor, you can still xor with the divisor one more time to yield a smaller number (which then becomes the final remainder).

Consider the example below, where we’re dividing $55$ by $19$. The first remainder is $17$, which is less than $19$, but still shares the same highest set bit, so we can xor one more time with $19$ to get the remainder $2$.

Example 10: Carry-less division

Let a = 55 and b = 19. Then, with carry-less arithmetic,

                     11b
                --------
b = 19 = 10011b )110111b = 55 = a
               ^ 10011
                 -----
                  10001
                ^ 10011
                  -----
                     10b

so a ⨸ b = 11_b = 3 with remainder 10_b = 2.

This leads to an interesting difference between the carry-less modulo operation and the standard modulo operation. If you mod by a number $n$, you get $n$ possible remainders, from $0$ to $n - 1$. However, if you clmod (carry-less mod) by a number $2^k \le n \lt 2^{k+1}$, you get $2^k$ possible remainders, from $0$ to $2^k-1$, since those are the numbers whose highest set bit is lower than the highest set bit of $n$.

In particular, if you clmod by a number $256 \le n < 512$, you always get $256$ possible remainders. This is very close to what we want—now the hope is to find some $256 \le n < 512$ so that doing binary carry-less arithmetic clmod $n$ yields a field, which will then be a field with $256$ elements!

10. The finite field with $256$ elements

Since there are only a few numbers between $256$ and $512$, we can just try each one of them to see if clmod-ing by one of them yields a field. However, with a bit of math, we can gain more insight into which numbers will work.

Recall the situation with the standard arithmetic operations: arithmetic mod $p$ yields a field exactly when $p$ is prime.^[10] But recall the definition of a prime number: it is an integer greater than $1$ whose positive divisors are only itself and $1$. Stated another way, a prime number is an integer $p \gt 1$ that cannot be expressed as $p = a \cdot b$, for $a, b \gt 1$.

Thus, the concept of a prime number is determined by the multiplication operation, and therefore we can define a “carry-less” prime number to be an integer $ p \gt 1$ that cannot be expressed as $p = a \clmul b$, for $a, b \gt 1$.^[11]

The only question remaining is whether there is an equivalent of Theorem 3 for carry-less arithmetic. And indeed there is:

(Theorem 4.) Given a carry-less prime number $2^k \lt p \le 2^{k+1}$, for every integer $0 \lt a \lt 2^k$, there is a exactly one $0 \lt b \lt 2^k$ such that $(a \clmul b) \bclmod p = 1$.

Now we just need to find a carry-less prime number $256 \le p < 512$. However, the set of prime numbers and the set of carry-less prime numbers are not necessarily related, so for example, even though $257$ is a prime number, it is not a carry-less prime number.

It is easy enough to test each number $256 \le n < 512$ for carry-less primality though; doing so, we find the lowest one, $283$.^[12]

So finally, we have a field with $256$ elements: the integers with binary carry-less arithmetic clmod $283$. An implementation would look like:

class Field256Element : implements Field<Field256Element> {
  plus(b) { return this ^ b }
  negate() { return b }
  times(b) { return clmod(clmul(this, b), 283) }
  reciprocate() {
    if (this == 0) { return Error }
    for i := 0 to 255 {
      if (this.times(b) == 1) { return i; }
    }
    return Error
  }
  ...
}

Similarly to how we find reciprocals mod $257$, we brute-force finding reciprocals clmod $283$ also.

Example 11: Field with 256 elements

Denote operations on the field with 256 elements by a ₂₅₆ subscript, and let a = 23 and b = 54. Then

a ⊕₂₅₆ b = 23 ⊕ 54 = 33;
⊖₂₅₆b = b = 54;
a ⊖₂₅₆ b = a ⊕₂₅₆ ⊖₂₅₆b = a ⊕₂₅₆ b = 33;
a ⊗₂₅₆ b = (23 ⊗ 54) clmod 283 = 207;
54 ⊗₂₅₆ 102 = 1, so b^-1₂₅₆ = 102;
a ø₂₅₆ b = a ⊗₂₅₆ b^-1₂₅₆ = (23 ⊗ 102) clmod 283 = 19, and indeed b ⊗₂₅₆ (a ø₂₅₆ b) = (54 × 19) clmod 283 = 23 = a.

11. The full algorithm

Now we have all the pieces we need to construct erasure codes for any $(n, m)$ such that $m + n \le 256$. First, we can compute an $m \times n$ Cauchy parity matrix over the field with $256$ elements. (Recall that this needs $m + n$ distinct field elements, which is what imposes the condition $m + n \le 256$.)

Example 12: Cauchy matrices in general

Working over the field with 256 elements, let x = [ 1, 2, 3 ] and y = [ 4, 5, 6 ]. Then, the Cauchy matrix constructed from x and y is

/  82 203 209 \
| 123 209 203 |
\ 209 123  82 /,

which has inverse

/ 130  31 176 \
| 252 219  31 |
\ 108 252 130 /.

Then we can implement matrix multiplication over arbitrary fields, and thus we can implement ComputeParity.

Example 13: `ComputeParity` in detail

Let d = [ da, db, 0d ] be the input data bytes and let m = 2 be the desired parity byte count. Then, with the input byte count n = 3, the m × n Cauchy parity matrix computed using x_i = n + i and y_i = i is

/ f6 8d 01 \
\ cb 52 7b /.

Therefore, the parity bytes are computed as

                _    _     _    _
/ f6 8d 01 \   |  da  |   |  52  |
\ cb 52 7b / * |  db  | = |_ 0c _|,
               |_ 0d _|

and thus the output parity bytes are p = [ 52, 0c ].

Then we can implement matrix inversion using row reduction over arbitrary fields.

Example 14: Matrix inversion via row reduction in general

Working over the field with 256 elements, let

    / 0 2 2 \
M = | 3 4 5 |
    \ 6 6 7 /.

The initial augmented matrix A is

/ 0 2 2 | 1 0 0 \
| 3 4 5 | 0 1 0 |
\ 6 6 7 | 0 0 1 /.

We need A₀₀ to be non-zero, so swap rows 0 and 1:

/ 0 2 2 | 1 0 0 \     / 3 4 5 | 0 1 0 \
| 3 4 5 | 0 1 0 | --> | 0 2 2 | 1 0 0 |
\ 6 6 7 | 0 0 1 /     \ 6 6 7 | 0 0 1 /.

We need A₀₀ to be 1, so divide row 0 by 3:

/ 3 4 5 | 0 1 0 \     / 1 245 3 | 0 246 0 \
| 0 2 2 | 1 0 0 | --> | 0  2  2 | 1  0  0 |
\ 6 6 7 | 0 0 1 /     \ 6  6  7 | 0  0  1 /.

We need A₂₀ to be 0, so subtract row 0 scaled by 6 from row 2:

/ 1 245 3 | 0 246 0 \     / 1 245  3 | 0 246 0 \
| 0  2  2 | 1  0  0 | --> | 0  2   2 | 1  0  0 |
\ 6  6  7 | 0  0  1 /     \ 0  14 13 | 0  2  1 /.

We need A₁₁ to be 1, so divide row 1 by 2:

/ 1 245  3 | 0 246 0 \     / 1 245  3 |  0  246 0 \
| 0  2   2 | 1  0  0 | --> | 0  1   1 | 141  0  0 |
\ 0 14  13 | 0  2  1 /     \ 0 14  13 |  0   2  1 /.

We need A₂₁ to be 0, so subtract row 1 scaled by 14 from row 2:

/ 1 245  3 |  0  246 0 \     / 1 245  3 |  0  246 0 \
| 0  1   1 | 141  0  0 | --> | 0  1   1 | 141  0  0 |
\ 0 14  13 |  0   2  1 /     \ 0  0   3 |  7   2  1 /.

We need A₂₂ to be 1, so divide row 2 by 3, which makes the left side of A a unit upper triangular matrix:

/ 1 245  3 |  0  246 0 \     / 1 245  3 |  0  246  0  \
| 0  1   1 | 141  0  0 | --> | 0  1   1 | 141  0   0  |
\ 0  0   3 |  7   2  1 /     \ 0  0   1 | 244 247 246 /.

We need A₁₂ to be 0, so subtract row 2 from row 1:

/ 1 245  3 |  0  246  0  \     / 1 245  3 |  0  246  0  \
| 0  1   1 | 141  0   0  | --> | 0  1   0 | 121 247 246 |
\ 0  0   1 | 244 247 246 /     \ 0  0   1 | 244 247 246 /.

We need A₀₂ to be 0, so subtract row 2 scaled by 3 from row 0:

/ 1 245  3 |  0  246  0  \     / 1 245  0 |  7  244  1  \
| 0  1   0 | 121 247 246 | --> | 0  1   0 | 121 247 246 |
\ 0  0   1 | 244 247 246 /     \ 0  0   1 | 244 247 246 /.

We need A₀₁ to be 0, so subtract row 1 scaled by 245 from row 0, which makes the left side of A the identity matrix:

/ 1 245  0 |  7  244  1  \     / 1 0 0 |  82  82  82 \
| 0  1   0 | 121 247 246 | --> | 0 1 0 | 121 247 246 |
\ 0  0   1 | 244 247 246 /     \ 0 0 1 | 244 247 246 /.

Since the left side of A is the identity matrix, the right side of A is M^-1. Therefore,

         /  82  82  82 \
M^{-1} = | 121 247 246 |
         \ 244 247 246 /.

Finally, we can use that to implement ReconstructData.

Example 15: `ReconstructData` in detail

Let d_partial = [ ??, db, ?? ] be the input partial data bytes and p_partial = [ 52, 0c ] be the input partial parity bytes. Then, with the data byte count n = 3 and the parity byte count m = 2, and appending the rows of the m × n Cauchy parity matrix to the n × n identity matrix, we get

/ X01X X00X X00X \
|  00   01   00  |
| X00X X00X X01X |
|  f6   8d   01  |
\  cb   52   7b  /,

where the rows corresponding to the unknown data and parity bytes are crossed out. Taking the first n rows that aren’t crossed out, we get the square matrix

/ 00 01 00 \
| f6 8d 01 |
\ cb 52 7b /

which has inverse

/ 01 d0 d6 \
| 01 00 00 |
\ 7b b8 bb /.

Therefore, the data bytes are reconstructed from the first n known data and parity bytes as

                _    _     _    _
/ 01 d0 d6 \   |  db  |   |  da  |
| 01 00 00 | * |  52  | = |  db  |
\ 7b b8 bb /   |_ 0c _|   |_ 0d _|,

and thus the output data bytes are d = [ da, db, 0d ].

And we’re done!

12. Further reading

Next time we’ll talk about the PAR1 file format, which is a practical implementation of an erasure code very similar to the one described above, and the various challenges to make it perform well on sets of large files.

Also, for those of you interested in the mathematical details, I’ll also write a companion article. (This article is already quite long!)

I gave a 15-minute presentation for WaffleJS covering the same topics as this article but at a higher-level and more informally.

I got the idea for explaining the finite field with $256$ elements in terms of binary carry-less arithmetic from A Painless Guide to CRC Error Detection Algorithms, which is an excellent document in its own right.

Most sources below use Vandermonde matrices, which I plan to cover in the next article on PAR1, instead of Cauchy matrices. Cauchy matrices are more foolproof, which is why I started with them. templexxx, whose Go implementation I cite below, feels the same way. (His blog post is in Chinese, but using Google Translate or a similar service translates it well enough to English.)

I started learning about erasure codes from James Plank’s papers. See A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like systems, but also make sure to read the very important correction to it! Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Storage Applications covers Cauchy matrices, although in a slightly different context. The first part of Plank’s All About Erasure Codes slides also contains a good overview of the encoding/decoding process, including a nifty color-coded matrix diagram.

As for implementations, klauspost and templexxx have good ones written in Go. They were in turn inspired by Backblaze’s Java implementation. Backblaze’s accompanying blog post is also a good overview of the topic. The toy JS implementation powering the demos on this page are also available on my GitHub.

An Introduction to Galois Fields and Reed-Solomon Coding^[13] covers much of the same material as I do, albeit assuming slightly more mathematical background.

Going further afield, Russ Cox, Jeremy Kun, and Nayuki also wrote about finite fields and Reed-Solomon codes.

Thanks to Ying-zong Huang, Ryan Hitchman, Charles Ellis, and Josh Gao for comments/corrections/discussion.

Like this post? Subscribe to my feed or follow me on Twitter .

Footnotes

[1] This discussion of linear algebra is necessarily abbreviated for our purposes. For a more general but still basic treatment, see Khan Academy. ↩

[2] Here and throughout this document, I index vectors and matrices starting with $0$, to better match array indices in code. Most math texts index vectors and matrices starting at $1$. ↩

[3] Now would be a good time to talk about the conventions I and other texts use. Following Plank, I use $n$ for the data byte count and $m$ for the parity byte count, and I represent arrays and vectors as column vectors, where multiplication with a matrix is done with the column vector on the right, which is the standard in most of math. However, in coding theory, $k$ is used for the data byte count, which they call the message length, and $n$ is used for the sum of the data and parity byte counts, which they call the codeword length. Furthermore, contrary to the rest of math, coding theory treats arrays and vectors as row vectors, where multiplication with a matrix is done with the row vector on the left, and the matrix used would be the transpose of the matrix that would be used with a column vector. ↩

[4] Khan Academy has a video stepping through an example for a $3 \times 3$ matrix. ↩

[5] People with experience in coding theory might recognize that a parity matrix $P$ being optimal is equivalent to the corresponding erasure code being MDS. ↩

[6] An equivalent statement which is easier to see is that if a row could be expressed as a linear combination of other rows, then one would be able to construct a non-empty square submatrix of $P$ with those rows, which would then be non-invertible. ↩

[7] It is instead a (transposed) Vandermonde matrix, which we’ll cover when we talk about the PAR1 file format in a follow-up article. ↩

[8] People with experience in abstract algebra might recognize this as arithmetic over $\mathbb{F}_2[x]$, the polynomials with coefficients in the finite field with $2$ elements. ↩

[9] Our use of $\clplus$, $\clminus$, $\clmul$, and $\cldiv$ to denote carry-less arithmetic clashes with our use of the same symbols to denote generic field operations. However, we’ll never need to talk about both at the same time, so whichever one we mean should be obvious in context. ↩

[10] This is a slightly stronger statement than Theorem 3. ↩

[11] People with experience in abstract algebra might recognize carry-less primes as irreducible elements of $\mathbb{F}_2[x]$. ↩

[12] Coincidentally, $283$ is also a regular prime number. Using another carry-less prime number $256 \le p \lt 512$ would also yield a field with $256$ elements, but is important to consistently use the same carry-less modulus; different carry-less moduli lead to fields with $256$ elements that are isomorphic, but not identical.

Borrowing notation from CRCs, the carry-less modulus is sometimes represented as a hexadecimal number with the leading digit (which is always $1$) omitted. For example, $283$ would be represented as $\mathtt{0x1b}$, and we can say that we’re using the field with $256$ elements defined by $\mathtt{0x1b}$. ↩

[13] Galois field is just another name for finite field. ↩

Why is the Quintic Unsolvable?

2016-09-26T00:00:00-07:00

(This was discussed on r/math and Hacker News.)

1. Overview

In this article, I hope to convince you that the quintic equation is unsolvable, in the sense that I can’t write down the solution to the equation \[ ax^5 + bx^4 + cx^3 + dx^2 + ex + f = 0 \] using only addition, subtraction, multiplication, division, raising to an integer power, and taking an integer root. In fact, I hope to go further and explain how this is true for the same reason that I can’t write down the solution to the equation \[ ax^2 + bx + c = 0 \] using only the first five operations above!

The usual approach to the above claim involves a semester’s worth of abstract algebra and Galois theory. However, there’s a much easier and shorter proof which involves only a bit of group theory and complex analysis—enough to fit in a blog post—and some interactive visualizations.^[1]

2. Quadratic Equations

Let’s start with quadratic equations, which hopefully you all remember from high school. Given two complex numbers $r_1$ and $r_2$, you can determine the quadratic equation whose solutions are $r_1$ and $r_2$, namely \[ (x - r_1)(x - r_2) = x^2 - (r_1 + r_2) x + r_1 r_2 = 0\text{.} \] If we take the standard form of a quadratic equation to be \[ a x^2 + bx + c = 0\text{,} \] then we can define a function from $r_1$ and $r_2$ to $a$, $b$, and $c$, which is shown by the first two panels in the visualization below; drag either of the points $r_1$ and $r_2$ and notice how $b$ and $c$ move ($a$ will always remain fixed at $1$).

Now pretend that we misremember the quadratic formula as \[ x_{1, 2} = \frac{-b ± b^2 - 4ac}{4a}\text{.} \] The results of this formula—our candidate solution—are shown in the third panel. Note that since $x_1$ and $x_2$ depend on $a$, $b$, and $c$, which all depend on $r_1$ and $r_2$, they also move when you drag either $r_1$ and $r_2$

Interactive Example 1: An incorrect quadratic formula

Roots

Coefficients

Candidate solution

Now this formula looks right, since $x_1$ and $x_2$ are at the same coordinates as $r_1$ and $r_2$. However, if you move $r_1$ or $r_2$ around, you can easily convince yourself that this formula can’t be right, since $x_1$ and $x_2$ don’t move in the same way.

Now if you remember from high school, the real quadratic formula involves taking a square root, and since our candidate solution doesn’t do that, that means it’s probably incorrect. I say “probably” because there’s no immediate reason why there can’t be multiple quadratic formulas, some simpler than others, of which one is simple enough to not need a square root. From manipulating $r_1$ and $r_2$, we know that our candidate formula is incorrect, but that doesn’t immediately follow from it not having a square root.

Fortunately, there is a general way to rule out candidate solutions that are similar to the one above, namely those that use only addition, subtraction, multiplication, division, and raising to an integer power; we’ll call these rational expressions. Here’s how it goes: if you press the button to swap $r_1$ and $r_2$, which moves $r_1$ to $r_2$’s position and vice versa, $a$, $b$, and $c$ move from their starting positions but return once $r_1$ and $r_2$ reach their destinations. This makes sense, because the coefficients of a polynomial don’t depend on how you order the roots. But since $x_1$ and $x_2$ depend only on $a$, $b$, and $c$, they too must loop back to their starting positions.

But that means that our candidate solution cannot be the quadratic formula! If it were, then $x_1$ and $x_2$ would have ended up swapped, too. Instead, they went back to their starting positions, which is a contradiction. This reasoning holds for any expression which is a single-valued function of $a$, $b$, and $c$, so in particular this holds for rational expressions.

Let’s summarize our reasoning in a theorem:

(Theorem 1.) A rational expression^[2] in the coefficients of the general quadratic equation \[ ax^2 + bx + c = 0 \] cannot be a solution to this equation.

Sketch of proof. Assume to the contrary that the rational expression $x = f(a, b, c)$ is a solution. Assume that we start with $r_1 = z_1$ and $r_2 = z_2 \ne z_1$, and without loss of generality assume that we start with $x = z_1$.

Run $r_1$ and $r_2$ along continuous paths that swap their two positions, i.e. make $r_1$ head from $z_1$ to $z_2$ continuously, and at the same time make $r_2$ head from $z_2$ to $z_1$ continuously, and make sure to pick paths such that $r_1$ and $r_2$ never coincide.

Since $a$, $b$, and $c$ are continuous functions of $r_1$ and $r_2$, and $x$ is a rational function of $a$, $b$ and $c$, and thus continuous, $x$ then depends continuously on $r_1$ and $r_2$. Thus, since we start with $x = r_1 = z_1$, and $r_1$ never coincides with $r_2$, then as $r_1$ moves, $x = r_1$ must continue to hold, since $x$ is a solution, and therefore $x$’s final position must be the same as $r_1$’s, which is $z_2$.

However, since the coefficients $a$, $b$, and $c$ don’t depend on the ordering of $r_1$ and $r_2$, then their final positions are the same as their initial positions. Since $x$ is a function of only $a$, $b$, and $c$, its final position also must be the same as its initial position, $z_1$. This contradicts the above, and therefore $x$ cannot be a solution. ∎

Now consider the candidate solution \[ x_{1,2} = \sqrt{b^2 - 4ac}\text{.} \] This isn’t a rational expression since it has a square root. In particular, in the visualization below, it behaves quite differently from our first candidate solution. First, even though we have just a single expression, it yields two points $x_1$ and $x_2$. Second, and more surprisingly, if you swap $r_1$ and $r_2$, $x_1$ and $x_2$ also exchange places, seemingly contradicting Theorem 1! What is going on?

Interactive Example 2: The quadratic equation

Roots

Coefficients

Candidate solution

$x_{1, 2} = \frac{-b \pm b^2 - 4ac}{4a}$
$x_1 = b^2 - 4ac$
$x_{1, 2} = \sqrt{b^2 - 4ac}$
$x_{1, 2} = \frac{-b + \sqrt{b^2 - 4ac}}{2a}$
(the quadratic formula)

To answer this, we first need to review some facts about complex numbers. Recall that a complex number $z$ can be expressed in polar coordinates, where it has a length $r$ and an angle $θ$, and that it can be converted to the usual Cartesian coordinates using Euler’s formula: \[ z = r e^{iθ} = r \cos θ + i \, r \sin θ\text{.} \] Then, if you have two complex numbers $z_1 = r_1 e^{iθ_1}$ and $z_2 = r_2 e^{iθ_2}$ in polar form, you can multiply them by multiplying their lengths, and adding their angles: \[ z_1 z_2 = r_1 r_2 e^{i (θ_1 + θ_2)}\text{.} \] So a square root of a complex number $z = r e^{iθ}$ is just $\sqrt{r} e^{iθ/2}$, as you can easily verify. However, if $z$ is non-zero, there is one more square root of $z$, namely $\sqrt{r} e^{i (θ/2 + π)}$, as you can also verify. (Recall that angles that differ by $2π = 360^\circ$ are considered the same.)

So in general, the square root of a rational expression, like our candidate solution, yields two distinct points as long as the rational expression is non-zero. In our case, $b^2 - 4ac$ remains non-zero as $r_1$ and $r_2$ don’t coincide. (We’ll have more to say about this expression, called the discriminant, once we talk about cubic equations below.) Therefore, if we want to examine how $x_1$ and $x_2$ move as $r_1$ and $r_2$ move, we have to number the square roots of $b^2 - 4ac$, and we have to keep this numbering consistent.

To do so, we have to do two things: we have to vary $r_1$ and $r_2$ only continuously, and we have to vary $r_1$ and $r_2$ such that they never coincide. If we do this, then we can intuitively “lift” the expression $b^2 - 4ac$ from the complex plane to a new surface $S$ where we consider only angles that differ by $4π = 720^\circ$, rather than $2π$, to be the same. In this space, we can take the “first” square root of a non-zero complex number to be the one with half the angle, and the “second” square root to be the one with half the angle plus $π$, and have these two square root functions behave continuously as their argument goes around the origin.

Figure 1 $S$, which is the Riemann surface of $\sqrt{z}$. (Image by Leonid 2 licensed under CC BY-SA 3.0.)

Now this answers the question of why the proof of Theorem 1 fails for $\sqrt{b^2 - 4ac}$. $a$, $b$, and $c$, go around a single loop as $r_1$ is swapped with $r_2$, and therefore $b^2 - 4ac$ goes around a single loop in the complex plane, but when $b^2 - 4ac$ is lifted to $S$, the final position of $b^2 - 4ac$ differs from the initial position only by an angle of $2π$, so it is distinct from the initial position, and thus we can’t conclude that the final position of $\sqrt{b^2 - 4ac}$ is the same as the initial position.

Similar reasoning holds for any algebraic expression that isn’t a rational expression, i.e. ones that involve taking any integer root, so Theorem 1 cannot apply to algebraic expressions in general. Of course, this is consistent with what we know about the quadratic formula, since we know that it has a square root!

3. Cubic Equations

Now we can move on to cubic equations. Similarly, given three complex numbers $r_1$, $r_2$, and $r_3$, you can determine the cubic equation with those solutions, namely \[ (x - r_1) (x - r_2) (x - r_3) = x^3 - (r_1 + r_2 + r_3) x^2 + (r_1 r_2 + r_1 r_3 + r_2 r_3) x - r_1 r_2 r_3\text{,} \] and so we can define a function from $r_1$, $r_2$, and $r_3$ to $a$, $b$, $c$, and $d$, where \[ a x^3 + b x^2 + c x + d \] is the standard form of a cubic polynomial, and this is shown in the visualization below.

In the previous section, we talked about the discriminant $b^2 - 4ac$ of the general quadratic polynomial. However, the discriminant is an expression that is defined for any polynomial. If $r_1, \dotsc, r_n$ are the roots of a polynomial (counting multiplicity) with leading coefficient $a_n$, then the discriminant is \[ Δ = a_n^{2n - 2} ∏_{i \lt j} (r_i - r_j)^2\text{.} \] In other words, the discriminant is, up to sign and a power of the leading coefficient, the product of the differences of all pairs of different roots. In particular, if the polynomial has repeated roots, the discriminant is zero.

Using the formula above, you can express the discriminant in terms of the coefficients of the polynomial, as you can verify for yourself with the quadratic equation. Indeed this is true in general; for cubic polynomials, the discriminant can be expressed in terms of the coefficients as \[ Δ = b^2 c^2 - 4 a c^3 - 4 b^3 d - 27 a^2 d^2 + 18 a b c d\text{.} \] But why do we care? Because, as you can see in the visualization below, if you swap any pair of roots, this causes the discriminant to make a single loop around the origin, so it serves as a useful test functions for taking roots.

So now that we have three roots, we can swap them in multiple ways. If $R$ is a list that starts off as $\langle r_1, r_2, r_3 \rangle$, let $↺_{i, j}$ denote counter-clockwise paths that takes the root at the $i$th index of $R$ to the one at the $j$th index of $R$ and vice versa, and similarly for $↻_{i, j}$. (Note that this is not the same as the paths that swap $r_i$ and $r_j$! Play around with the buttons in the visualization below to understand the difference.)

Interactive Example 3: The cubic discriminant

Roots

$R = \langle r_1, r_2, r_3 \rangle$

Coefficients

Candidate solution

$x_1 = Δ$
$x_{1, 2, 3, 4, 5} = \sqrt[5]{Δ}$

Now, with the formula $Δ$, the same reasoning as in the previous section shows that it cannot possibly be the cubic formula, nor can any other rational expression. However, unlike the quadratic case, we can also rule out $\sqrt[5]{Δ}$, or any other algebraic formula with no nested radicals (i.e., that doesn’t have a radical within a radical like $\sqrt{a - \sqrt{bc - 5}}$). If you apply the operations $↺_{2, 3}$, $↺_{1, 2}$, $↻_{2, 3}$, and $↻_{1, 2}$ in sequence, $r_1$, $r_2$, and $r_3$ rotate among themselves, but all the $x_i$ go back to their original positions. Therefore, by similar reasoning as the previous section, $\sqrt[5]{Δ}$ also cannot possibly be the cubic formula!

To make this statement precise, we need to review some group theory. Recall that a group is a set with an associative binary operation, an identity element, and inverse elements. Most basic examples of groups are related to numbers, like the integers under addition, or the non-zero rationals under multiplication. However, more interesting examples of groups are related to functions, none the least because the group operation for functions is composition, which is in general not commutative; in other words, if $f$ and $g$ are functions, $f \circ g \ne g \circ f$, and it is this non-commutativity that will come in handy for our purposes.

So let’s say we have a list of $n$ objects, and we’re interested in the functions that rearrange this list’s elements. These are permutations, and they naturally form a group under composition, as you can check for yourself, called $S_n$, the symmetric group on $n$ objects.

There’s a convenient way to write permutations, called cycle notation. If you write $(i_1 \; i_2 \; \dotsc \; i_k)$, this denotes the permutation that maps the $i_1$th position of the list to the $i_2$th position the $i_2$th position to the $i_3$th, and so on, called a cycle. Then you can write any permutation as a composition of disjoint cycles, so this provides a convenient way to write down and compute with permutations.

In the visualization above, we have four operations $↺_{1, 2}$, $↺_{2, 3}$, $↻_{1, 2}$, and $↻_{2, 3}$, which act on $R$, meaning that they define permutations on $R$. In particular, $↺_{1, 2}$ and $↻_{1, 2}$ both swap the first and second elements of $R$, so we say that $↺_{1, 2}$ and $↻_{1, 2}$ act on $R$ as $(1 \; 2)$, and similarly, $↺_{2, 3}$ and $↻_{2, 3}$ act on $R$ as $(2 \; 3)$.

Now concatenating two operations—doing one after the other—corresponds to composing their mapped-to permutations on $R$. Denoting $o_2 * o_1$ as doing $o_1$, then doing $o_2$, the sequence of operations above is $↻_{1, 2} * ↻_{2, 3} * ↺_{1, 2} * ↺_{2, 3}$ (note the order!), which acts on $R$ like $(1 \; 2) (2 \; 3) (1 \; 2) (2 \; 3)$, which is equal to $(1 \; 3 \; 2)$.^[3] (The $\circ$ is usually dropped when composing permutations.)

Now for the formula $Δ$, all the operations make $x_1$ loop around the origin either clockwise or counter-clockwise; in other words, they all induce a rotation of $2π$ or $-2π$ on $x_1$, and the final distance of $x_1$ from the origin is the same as the initial distance. Therefore, if we apply an equal number of clockwise and counter-clockwise rotations, the total angle of rotation will be $0$ and the final distance will be the same as the initial distance, i.e. the final position of $x_1$ is the same as it’s initial distance. But the same reasoning holds for the formula $\sqrt[5]{Δ}$; all the operations induce a rotation of $2π/5$ or $-2π/5$ and leave the distance from the origin unchanged, so an equal number of clockwise and counter-clockwise rotations will still induce a total angle of $0$ and leave the distance from the origin unchanged. Therefore, the operation $↻_{1, 2} * ↻_{2, 3} * ↺_{1, 2} * ↺_{2, 3}$ acts like $(1 \; 3\; 2)$ on $R$, but leaves all $x_i$ unchanged.

But how did we come up with $↻_{1, 2} * ↻_{2, 3} * ↺_{1, 2} * ↺_{2, 3}$ in the first place? This involves a bit more group theory. $S_3$ is not a commutative group; in particular, $(1 \; 2) (2 \; 3) \ne (2 \; 3) (1 \; 2)$. For two group elements $g$ and $h$, we can define their commutator^[4] $[ g, h ]$, which is the group element that corrects for $g$ and $h$ not commutating. That is, we want the equation \[ g h = h g [g, h] \] to hold, which means that \[ [g, h] = g^{-1} h^{-1} g h\text{.} \] So the commutator provides a convenient way to generate a non-trivial permutation from two other non-commuting permutations. Furthermore, it involves two appearances of both elements, so we can pick a sequence of operations that induce the commutator and also have an equal number of clockwise and counter-clockwise operations. Then we’re guaranteed that this sequence of operations permutes $R$ and leaves all $x_i$ unchanged, even if each individual operation moves some $x_i$. But of course, this is just $↻_{1, 2} * ↻_{2, 3} * ↺_{1, 2} * ↺_{2, 3}$!

Let’s define some terminology to make proofs and discussion easier. If $o$ is an operation that acts on $R$ non-trivially but has the final position of the expression $x = f(a, b, c, \dotsc)$ the same as its initial position, we say that $o$ rules out the expression $x = f(a, b, c, \dotsc)$. For example, Theorem 1 says that swapping both roots of a quadratic rules out all rational expressions.

Now we’re ready to state and prove the theorem:

(Theorem 2.) An algebraic expression with no nested radicals in the coefficients of the general cubic equation \[ ax^3 + bx^2 + cx + d = 0 \] cannot be a solution to this equation.

Sketch of proof. First assume to the contrary that the expression $x = \sqrt[k]{r(a, b, c, d)}$ is a solution, where $r(a, b, c, d)$ is a rational expression. Assume we start with $r_1 = z_1$, $r_2 = z_2$, and $r_3 = z_3$, where all $z_i$ are distinct, and without loss of generality assume that we start with $x = z_1$.

Any of the operations $↺_{1, 2}$, $↺_{2, 3}$, $↻_{1, 2}$, and $↻_{2, 3}$ applied to $x = r(a, b, c, d)$ cause $x$’s final position to be the same as its initial position, by Theorem 1. Pick a point $z_0$ that is never equal to any point $x$ traverses under any operation. Then, by the same reasoning as above, the total angle induced by $↻_{1, 2} * ↻_{2, 3} * ↺_{1, 2} * ↺_{2, 3}$ on $x = \sqrt[k]{r(a, b, c, d)}$ around $z_0$ is $0$, and the distance from $z_0$ remains unchanged. Thus $x$ remains fixed, and this operation rules out $x = \sqrt[k]{r(a, b, c, d)}$.

For the general case, it suffices to show that if $o$ rules out the expressions $f$ and $g$, then $o$ also rules out $f$ raised to an integer power, $f + g\text{,}$ $f - g\text{,}$ $f \cdot g\text{,}$ and $f / g$ where $g \ne 0\text{.}$ But this is straightforward, and such formulas are just the algebraic expressions with no nested radicals, so the statement holds in general. ∎

Theorem 2 can be summarized thus: any $↺_{i, j}$ or $↻_{i, j}$ rules out any rational expression as the cubic formula, and if given an algebraic expression with no nested radicals, either some $↺_{i, j}$ or $↻_{i, j}$ rules it out, or $↻_{1, 2} * ↻_{2, 3} * ↺_{1, 2} * ↺_{2, 3}$ rules it out.

Now we can consider algebraic expressions with one level of nesting. Can such formulas be ruled out as being the cubic formula? We can’t do so via Theorem 2, at least; we would need a non-trivial element of $S_3$ that is the commutator of commutators. But you can calculate that all non-trivial commutators of $S_3$ are either $(3 \; 2 \; 1)$ or $(1 \; 2\; 3)$, and these two elements commute, so $S_3$ cannot have a non-trivial commutator of commutators.

In fact, as we would expect, the actual cubic formula has such an algebraic expression, which is $C$ in the visualization below, so that serves as a convenient example of an algebraic expression with a single nested radical that can’t be ruled out by Theorem 2.

Interactive Example 4: The cubic equation

Roots

$R = \langle r_1, r_2, r_3 \rangle$

Coefficients

Candidate solution

$X = \langle x_1, x_2, x_3, x_4, x_5, x_6 \rangle$
$x_1 = -27a^2 Δ = {Δ_1}^2 - 4 {Δ_0}^3$
$x_{1, 2} = C^3 = \frac{Δ_1 + \sqrt{-27a^2 Δ}}{2}$
$x_{1,2,3,4,5,6} = C$
$x_{1, 2, 3} = -\frac{1}{3a} \left( b + C + \frac{Δ_0}{C} \right)$
(the cubic formula)

Note that there is a new list $X$, which lists the $x_i$ in the order which they occupy their initial positions, like how $R$ does the same for the $r_i$. In general, we can’t do this, since a general multi-valued function won’t necessarily permute that $x_i$ among themselves, but in the interactive visualizations we’ll only consider expressions that do.

We can then talk how an operation acts on $X$. For example, if we pick $\sqrt[5]{Δ}$ in Interactive Example 3, we can say that $↺_{i, j}$ acts like $(5 \; 1 \; 2 \; 3 \; 4)$ on $X$ and $↻_{i, j}$ acts like $(1 \; 2 \; 3 \; 4 \; 5)$ on $X$. Therefore, $↻_{1, 2} * ↻_{2, 3} * ↺_{1, 2} * ↺_{2, 3}$ acts non-trivially on $R$ but acts trivially on $X$, which is another more algebraic way of saying that if this operation rules out $\sqrt[5]{Δ}$, since the action on $X$ depends on the candidate formula. On the other hand, if you choose $C$ in the visualization above, you can convince yourself that no operation acts non-trivially on $R$ without also acting non-trivially on $X$, and so $C$ can’t be ruled out as the cubic formula.

4. Quartic Equations

Now we can move on to quartic equations. As usual, given four complex numbers $r_1$, $r_2$, $r_3$, and $r_4$, you can map this to the coefficients $a$, $b$, $c$, $d$, and $e$ of the standard form of a quartic polynomial, as shown in the visualization below, such that the $r_i$ are the solutions to the quartic equation \[ a x^4 + b x^3 + c x^2 + d x + e = 0\text{.} \]

Now that we have four roots, we have even more ways to permute them using the $↺_{i, j}$ and $↻_{i, j}$. Before we move on, we need more terminology and group theory to handle this more complicated case.

First, we want a convenient way to denote the combination of operations that act like a commutator, so let’s define $↺_{i, j}^\prime$ to mean $↻_{i, j}$ and vice versa, $(o_1 \circ o_2 \circ \dotsb \circ o_n)^\prime$ to mean $o_n^\prime \circ o_{n-1}^\prime \circ \dotsb \circ o_1^\prime$, and $[\![ o_1, o_2 ]\!]$ to mean $o_1^\prime \circ o_2^\prime \circ o_1 \circ o_2$, so that if $o_i$ acts on $R$ like $g_i$, then $o_i^\prime$ acts on $R$ like $g_i^{-1}$ and $[\![o_i, o_j]\!]$ acts on $R$ like $[g_i, g_j]$. For example, in the previous section, we were using $[\![ ↺_{1, 2}, ↺_{2, 3} ]\!]$ to rule out algebraic expressions with no nested radicals.

Then not only do we want to talk about commutators of particular permutations, we want to talk about the set of commutators of a particular group. In fact, for a group $G$, this set of commutators forms a subgroup $K(G)$ called the commutator subgroup. For the quadratic case, we just have $S_2$, which has only a single non-trivial element, so its commutator subgroup $K(S_2)$ is the trivial group. For the cubic case, we started with $S_3$, and we computed the commutator subgroup $K(S_3)$, which is just $\{ e, (1 \; 2 \; 3), (3 \; 2 \; 1) \}$. We can also compute the commutator of this group, which is just the trivial group again, since $K(S_3)$ is commutative. So we can see that $K(K(S_3))$ being the trivial group means that we can’t rule out algebraic expressions with nested radicals as solutions to the general cubic equation.

Given all the elements of a group $G$, it’s not particularly complicated to compute the commutator subgroup—just take all possible pairs of elements $g, h \in G$, compute $[g, h]$, and remove duplicates. However, we can make things easier for ourselves by finding generators for $K(G)$ as commutators of generators of $G$, since then we can easily map those back to $[\![ o_1, o_2 ]\!]$ applied on the appropriate operations. Fortunately, when $G = S_n$, we can use a few facts from group theory to easily compute $K(S_n)$. First, $K(S_n)$ is called the alternating group $S_n$, and is generated by the $3$-cycles of the form $(i \enspace i+1 \enspace i+2)$, similar to how $S_n$ is generated by the $2$-cycles of the form $(i \enspace i + 1)$. But a $3$-cycle $(i \enspace i+1 \enspace i+2)$ can be expressed as the commutator of two $2$-cycles $[(i+2 \enspace i+1), (i \enspace i+1)]$.

Therefore, for $S_4$, the generators for $K(S_4)$ are just $(1 \; 2 \; 3) = [(2 \; 3), (1 \; 2)]$ and $(2 \; 3 \; 4) = [(3 \; 4), (2 \; 3)]$, with respective operations $[\![ ↺_{2, 3}, ↺_{1, 2} ]\!]$ and $[\![ ↺_{3, 4}, ↺_{2, 3} ]\!]$. However, these two generators are not quite enough to generate $K^{(2)}(S_4)$ via commutators. Fortunately, it suffices to just add $↺_{4, 1}$ to the list of operations, which lets us add $(1 \; 4)$ to the list of generators for $S_4$, and then add $(3 \; 4 \; 1)$ to the list of generators for $K(S_4)$. Then $(1 \; 4) (2 \; 3) = [(2 \; 3 \; 4), (1 \; 2 \; 3)]$ and $(2 \; 1) (3 \; 4) = [(3 \; 4 \; 1), (2 \; 3 \; 4)]$ suffice to generate $K^{(2)}(S_4)$.^[5] Finally, we can easily compute $K^{(3)}(S_4)$ to be the trivial group.

What does that tell us about what expressions we can rule out as solutions to the general quartic equation? Similarly to the cubic case, we expect to be able to rule out rational expressions and algebraic expressions with no nested radicals, and since $K^{(2)}(S_4)$ is not the trivial group, we also expect to be able to rule out algebraic expressions with singly-nested radicals, like $\sqrt{a - \sqrt{bc - 4}}$. But since $K^{(3)}(S_4)$ is the trivial group, we don’t expect to be able to rule out algebraic expressions with doubly-nested radicals, like $\sqrt{a - \sqrt{bc - \sqrt{d + 3}}}$.

As an antidote to all the abstractness above, here is a visualization for quartics, where you can examine how the various operations interact with the quartic formula and its subexpressions.

Interactive Example 5: The quartic equation

Roots

$R = \langle r_1, r_2, r_3, r_4 \rangle$

Coefficients

Candidate solution

$X = \langle x_1, x_2, x_3, x_4, x_5, x_6 \rangle$

$x_1 = -27 Δ$
$x_{1, 2} = Q^3 = \frac{Δ_1 + \sqrt{-27 Δ}}{2}$
$x_{1, 2, 3, 4, 5, 6} = Q$
$x_{1, 2, 3, 4, 5, 6} = S =$
$\qquad \frac{1}{2} \sqrt{-\frac{2}{3} p + \frac{1}{3a} \left( Q + \frac{Δ_0}{Q} \right)}$
$x_{1, 2, 3, 4} = $
$\qquad -\frac{b}{4a} \mp S + \frac{1}{2} \sqrt{-4S^2 - 2p \pm \frac{q}{S}}$
(the quartic formula)

There are a few additions to the interactive display above. It now prints a message when it detects that the selected expression is ruled out as the quartic formula, which just looks at whether $R$ is not in order and $X$ is, and vice versa. There’s also a button to reset the ordering of $R$ and $X$.

The second addition is that the operations have been organized to make clear what commutator subgroup they’re in. The $A_i$ map to generators of $S_4$. Then taking the commutators of adjacent $A_i$ give $B_i$, which map to the generators of $K(S_4)$, and similarly for $C_i$.

The third addition is a button that finds the first operation that rules out the selected formula, if any. It simply tries all the $A_i$s, then all the $B_i$s, then all the $C_i$s, checking $R$ and $X$ in between. The general algorithm, which assumes a fixed set of roots $r_1, \dotsc, r_n\text{,}$ takes an expression $f(a_n, a_{n-1}, \dotsc)$ where $a_n x^n + a_{n-1} x^{n-1} + \dotsb + a_0 = 0$ is the general $n$th-degree polynomial equation, takes a depth limit $k$, and looks like this (defining $K^{(0)}(G)$ to be just $G$):

For $i$ from 0 to $k$:
1. If $K^{(i)}(S_n)$ is trivial, then terminate indicating that $f(a_n, a_{n-1}, \dotsc)$ was unable to be ruled out because $K^{(i)}(S_n)$ is trivial.
2. Otherwise, find operations $o_1$ to $o_m$ that act as the generators $g_1$ to $g_m$ of $K^{(i)}(S_n)$. For $i > 0$, this can be done by applying $[\![ o_1, o_2 ]\!]$ to the operations corresponding to the generators of $K^{(i-1)}(S_n)$.
3. For each $o_j$:
  1. Apply $o_j$.
  2. If $R$ is not in order but $X$ is, terminate indicating that $o_j$ rules out $f(a_n, a_{n-1}, \dotsc)$.
  3. Undo $o_j$, i.e. apply $o_j^\prime$ or reset to the initial state of $r_1, \dotsc, r_n$.
Terminate indicating that $f(a_n, a_{n-1}, \dotsc)$ was unable to be ruled out because the depth limit has been reached.

This algorithm basically just implements the proof of the following lemma, which generalizes the previous theorems, except that it tries to find the simplest operation that is a generator that rules out the given expression.

Before we state the lemma, we need another definition: let the radical level of an algebraic expression $f(a_n, a_{n-1}, \dotsc)$ be $0$ if $f(a_n, a_{n-1}, \dotsc)$ is a rational expression, $1$ if $f(a_n, a_{n-1}, \dotsc)$ has only non-nested radicals, and $n + 1$ if the maximum number of nested radicals is $n$.

(Lemma 3.) If the algebraic expression $f(a_n, a_{n-1}, \dotsc)$ has radical level $d$ and $K^{(d)}(S_n)$ is non-trivial, then any operator that maps to a non-trivial element $g$ in $K^{(d)}(S_n)$ rules out $f(a_n, a_{n-1}, \dotsc)$ as the solution to the general $n$th-degree polynomial equation \[ a_n x^n + a_{n+1} x^{n+1} + \dotsb + a_0 = 0\text{.} \]

Rough sketch of proof. We just do induction on $d$. For the base case $d = 0$, if $K^{(0)}(S_n)$ is non-trivial, then $n \ge 2$. Let $g = (i \; j)$ for any $i \ne j$, of which there must at least be one. Then by the same reasoning as Theorem 1, $g$ rules out $f(a_n, a_{n-1}, \dotsc)$. Since the $(i \; j)$ generate $S_n$, then any $g \in S_n$ is the composition of some sequence of $(i \; j)$s, each of which rules out $f(a_n, a_{n-1}, \dotsc)$, so $g$ must also rule it out.

Assume the lemma holds for $d$, and let $x = f_{d+1}(a_n, a_{n-1}, \dotsc) = \sqrt[k]{f_d(a_n, a_{n-1}, \dotsc)}$ for some $k$, where $f_d$ has radical level $d$. Let $o$ act on $R$ like any non-trivial element $g$ of $K^{(d+1)}(S_n)$. By the induction hypothesis, all elements $h_i \in K^{(d)}(S_n)$ cause $x = f_d(a_n, a_{n-1}, \dotsc)$ to go around a loop, so pick a point $z_0$ that is never equal to any point $x$ traverses under any operation corresponding to $h_i$. Then, since $g = [h, k]$ for $h, k \in K^{(d)}(S_n)$, by the same reasoning as in Theorem 2, the total angle induced by $o$ on $x = f_{d+1}(a_n, a_{n-1}, \dotsc)$ around $z_0$ is $0$, and the distance from $z_0$ remains unchanged. Thus, $x = f_{d+1}(a_n, a_{n-1}, \dotsc)$ remains fixed, and $o$ rules it out.

By the same reasoning as in Theorem 2, this can be extended to the general case of $f(a_n, a_{n-1}, \dotsc)$ being any algebraic formula with nesting level $d + 1$. ∎

We can immediately deduce the following corollaries, using the fact that $K^{(2)}(S_4)$ is non-trivial:

(Corollary 4.) An algebraic expression with at most singly-nested radicals in the coefficients of the general quartic equation \[ ax^4 + bx^3 + cx^2 + dx + e = 0 \] cannot be a solution to this equation.^[6]

5. Quintic Equations

Now, finally, the quintic. Let’s jump right to the interactive example.

Interactive Example 6: The quintic equation

Roots

$R = \langle r_1, r_2, r_3, r_4, r_5 \rangle$

Coefficients

Candidate solution

$X = \langle x_1, x_2, x_3, x_4, x_5, x_6 \rangle$

$x_1 = f_A = Δ$
$x_{1, 2} = f_B = \sqrt{f_A}$
$x_{1, 2, 3, 4, 5, 6} = f_C =$
$\qquad \sqrt[3]{(f_B - 0.8)(f_B - 0.75)}$

Similarly to the interactive example for the quartic, the operations are organized to make clear what commutator subgroup they’re in. There’s something interesting though—the $C_i$ seem very similar to the $B_i$. In fact, the $C_i$ also act on $R$ like $A_5$! Also, if you compute $D_i = [\![ C_{(i+1) \bmod 5}, C_{i \bmod 5} ]\!]$, you will find that $D_i$ acts exactly like $B_i$ on $R$!

Why can we do this for the quintic, but not for anything of lower degree? This is because $A_5$ is perfect, which means that it equals its own commutator subgroup. (You can verify this yourself by brute force, e.g. writing a program, or you can play around with $3$-cycles and see that any $3$-cycle is the commutator of two other $3$-cycles.) Then this immediately implies that $K^{(n)}(S_5)$ is non-trivial for any $n$, which then implies our main result:

(Abel-Ruffini theorem.) An algebraic expression in the coefficients of the general $n$th-degree polynomial equation \[ a_n x^n + a_{n-1} x^{n-1} + \dotsb + a_0 = 0 \] for $n \ge 5$ cannot be a solution to this equation.

Proof. By the above, $A_5$ is perfect, so $K^{(d)}(S_5)$ is non-trivial for all $d$.

Since $S_5$ is a subgroup of $S_n$ for $n \ge 5$, $A_5 = K(S_5)$ must also be a subgroup of $A_n = K(S_n)$ for $n \ge 5$. But since $A_5$ is perfect, then $A_5$ must also be a subgroup of $K^{(d)}(S_n)$ for any $d$, which means that $K^{(d)}(S_n)$ is non-trivial for any $d$ and $n \ge 5$.

An algebraic expression has some finite radical level $d$, but $K^{(d)}(S_5)$ is non-trivial for any $d$ and $n \ge 5$, so by Lemma 3 no algebraic expression can be solution to the general $n$th-degree polynomial equation for $n \ge 5$. ∎

With the theorem above, we now have a succinct answer to the question at the beginning of this article. You can’t write down a solution to the general quadratic equation that is a rational expression because you can find an operation on the roots that will permute them non-trivially and yet leave the result of the expression constant. For the same reason, you can’t write down a solution to the general $n$th-degree polynomial equation that is an algebraic equation!

Finally, as a bonus, I’ll explain how to generate algebraic expressions that require a “$d$th-level” operator, meaning an operator that maps to an element of $K^{(d)}(S_n)$, assuming it’s non-trivial. This shows that there’s no single “super-operation” that rules out all algebraic expressions.

As an example, the formulas in the interactive example above are chosen so that $f_A$ is ruled out by the $A_i$, $f_B$ is ruled out by the $B_i$, etc. They depend on the particular roots chosen, of course, which is why this interactive example doesn’t let you move the roots around, but in principle you could build formulas for any polynomial that is first ruled out by $C_i$, or $D_i$, or whatever you wish. Given a polynomial $P = a_n x^n + a_{n-1} x^{n-1} + \dotsb + a_0$ of degree $n \ge 5$ and $d$, a recursive algorithm to generate an expression that is ruled out only by a “$d$th-level” operator is:

If $d = 0$, return $Δ(a_n, a_{n-1}, \dotsc)$.
Otherwise, run this algorithm with $P$ and $d-1$ to get $f_{d-1}(a_n, a_{n-1}, \dotsc)$.
Find operations $o_1$ to $o_m$ that correspond to generators $g_1$ to $g_m$ of $K^{(d-1)}(S_n)$.
For each $o_i$:
1. Apply $o_i$, which makes $x = f_{d-1}(a_n, a_{n-1}, \dotsc)$ go around a loop. Record the looped-around regions and their associated rotation numbers (i.e., the total angle divided by $2π$).
Pick points $z_1, \dotsc, z_t$ such that each $z_i$ has a non-zero rotation number for at least one $o_j$. $t$ can be at most $m$.
Let $k$ be the least number such that, for every $o_i$, $k$ doesn’t divide any of the rotation numbers of any $z_j$ with respect to $o_i$. Return $f_d(a_n, a_{n-1}, \dotsc) = \sqrt[k]{\prod_i (f_{k-1}(a_n, a_{n-1}, \dotsc) - z_i)}$.

Like this post? Subscribe to my feed or follow me on Twitter .

Footnotes

[1] This proof is originally due to Arnold. There are a couple of videos that talk about this proof, as well as this book based on Arnold’s lectures, and this paper. I mostly follow Boaz’s video, and the interactive visualizations are based on the visualizations he has in his video.

The interactive visualizations were generated using the excellent JSXGraph library. ↩

[2] Theorem 1 can be generalized even more! We can append other functions and operations to rational expressions, as long as those functions and operations are continuous and single-valued. For example, we can allow the use of exponentials and trigonometric functions, which is something that the standard Galois theory cannot handle.↩

[3] More precisely, a $↺_{i, j}$ contains a pair of simple paths, i.e. continuous injective functions $[0, 1] \to \mathbb{C}$, between two distinct points of $\mathbb{C}$, such that their concatenation defines a simple closed curve around a region in $\mathbb{C}$ with a counter-clockwise orientation. Also, depending on the exact method of formalizing $↺_{i, j}$, it either explicitly or implicitly encodes a permutation on $R$. Then we can define an operation $*$ on the $↺_{i, j}$ and $↻_{i, j}$ (defined analogously) which concatenates the paths (and composes the permutations, if explicitly encoded). Since the space of paths has no inverses or an identity, the $↺_{i, j}$ and $↻_{i, j}$ generate a free semigroup with the operation $*$. Then this semigroup defines an action on $R$ via its associated permutation on $R$, which then just generates $S_n$, since $S_n$ is generated by adjacent swaps.

We make a distinction between the operation $↺_{i, j}$ and the permutation it induces on $R$, since the latter “loses” the orientation information, which is important to preserve when talking about the action of $↺_{i, j}$ on some $x_i$. ↩

[4] Note that, depending on the text, the commutator may be defined slightly differently as $g h g^{-1} h^{-1}$. ↩

[5] $K(A_4)$ is isomorphic to $V$, the Klein four-group. ↩

[6] In fact, the quartic formula has three nested radicals. I wonder why? ↩

Computing Integer Roots

2016-01-10T00:00:00-08:00

1. The algorithm

Today I’m going to talk about the generalization of the integer square root algorithm to higher roots. That is, given $n$ and $p$, computing $\iroot(n, p) = \lfloor \sqrt[p]{n} \rfloor$, or the greatest integer whose $p$th power is less than or equal to $n$. The generalized algorithm is straightforward, and it’s easy to generalize the proof of correctness, but the run-time bound is a bit trickier, since it has a dependence on $p$.

First, the algorithm, which we’ll call $\NewtonRoot$:

If $n = 0$, return $0$.
If $p \ge \Bits(n)$ return $1$.
Otherwise, set $i$ to $0$ and set $x_0$ to $2^{\lceil \Bits(n) / p\rceil}$.
Repeat:
1. Set $x_{i+1}$ to $\lfloor ((p - 1) x_i + \lfloor n/x_i^{p-1} \rfloor) / p \rfloor$.
2. If $x_{i+1} \ge x_i$, return $x_i$. Otherwise, increment $i$.

and its implementation in Javascript:^[1]

// iroot returns the greatest number x such that x^p <= n. The type of
// n must behave like BigInteger (e.g.,
// https://github.com/akalin/jsbn ), n must be non-negative, and
// p must be a positive integer.
//
// Example (open up the JS console on this page and type):
//
//   iroot(new BigInteger("64"), 3).toString()
function iroot(n, p) {
  var s = n.signum();
  if (s < 0) {
    throw new Error('negative radicand');
  }
  if (p <= 0) {
    throw new Error('non-positive degree');
  }
  if (p !== (p|0)) {
    throw new Error('non-integral degree');
  }

  if (s == 0) {
    return n;
  }

  var b = n.bitLength();
  if (p >= b) {
    return n.constructor.ONE;
  }

  // x = 2^ceil(Bits(n)/p)
  var x = n.constructor.ONE.shiftLeft(Math.ceil(b/p));
  var pMinusOne = new n.constructor((p - 1).toString());
  var pBig = new n.constructor(p.toString());
  while (true) {
    // y = floor(((p-1)x + floor(n/x^(p-1)))/p)
    var y = pMinusOne.multiply(x).add(n.divide(x.pow(pMinusOne))).divide(pBig);
    if (y.compareTo(x) >= 0) {
      return x;
    }
    x = y;
  }
}

This algorithm turns out to require $Θ(p) + O(\lg \lg n)$ loop iterations, with the run-time for a loop iteration depending on what kind of arithmetic operations are used.

2. Correctness

Again we look at the iteration rule: \[ x_{i+1} = \left\lfloor \frac{(p - 1) x_i + \left\lfloor \frac{n}{x_i^{p-1}} \right\rfloor}{p} \right\rfloor \] Letting $f(x)$ be the right-hand side, we can again use basic properties of the floor function to remove the inner floor: \[ f(x) = \left\lfloor \frac{1}{p} ((p-1) x + n/x^{p-1}) \right\rfloor \] Letting $g(x)$ be its real-valued equivalent: \[ g(x) = \frac{1}{p} ((p-1) x + n/x^{p-1}) \] we can, again using basic properties of the floor function, show that $f(x) \le g(x)$, and for any integer $m$, $m \le f(x)$ if and only if $m \le g(x)$.

Finally, let’s give a name to our desired output: let $s = \iroot(n, p) = \lfloor \sqrt[p]{n} \rfloor$.^[2]

Unsurprisingly, $f(x)$ never underestimates:

(Lemma 1.) For $x \gt 0$, $f(x) \ge s$.

Proof. By the basic properties of $f(x)$ and $g(x)$ above, it suffices to show that $g(x) \ge s$. $g'(x) = (1 - 1/p) (1 - n/x^p)$ and $g''(x) = (p - 1) (n/x^{p+1})$. Therefore, $g(x)$ is concave-up for $x \gt 0$; in particular, its single positive extremum at $x = \sqrt[p]{n}$ is a minimum. But $g(\sqrt[p]{n}) = \sqrt[p]{n} \ge s$. ∎

Also, our initial guess is always an overestimate:

(Lemma 2.) $x_0 \gt s$.

Proof. $\Bits(n) = \lfloor \lg n \rfloor + 1 \gt \lg n$. Therefore, \[ \begin{aligned} x_0 &= 2^{\lceil \Bits(n) / p \rceil} \\ &\ge 2^{\Bits(n) / p} \\ &\gt 2^{\lg n / p} \\ &= \sqrt[p]{n} \\ &\ge s\text{.} \; \blacksquare \end{aligned} \]

Therefore, we again have the invariant that $x_i \ge s$, which lets us prove partial correctness:

(Theorem 1.) If $\NewtonRoot$ terminates, it returns the value $s$.

Proof. Assume it terminates. If it terminates in step $1$ or $2$, then we are done. Otherwise, it can only terminate in step $4.2$ where it returns $x_i$ such that $x_{i+1} = f(x_i) \ge x_i$. This implies $g(x_i) = ((p-1)x_i + n/x_i^{p-1}) / p \ge x_i$. Rearranging yields $n \ge x_i^p$ and combining with our invariant we get $\sqrt[p]{n} \ge x_i \ge s$. But $s + 1 \gt \sqrt[p]{n}$, so that forces $x_i$ to be $s$, and thus $\NewtonRoot$ returns $s$ if it terminates. ∎

Total correctness is also easy:

(Theorem 2.) $\NewtonRoot$ terminates.

Proof. Assume it doesn’t terminate. Then we have a strictly decreasing infinite sequence of integers $\{ x_0, x_1, \dotsc \}$. But this sequence is bounded below by $s$, so it cannot decrease indefinitely. This is a contradiction, so $\NewtonRoot$ must terminate. ∎

Note that, like $\NewtonRoot$, the check in step $4.2$ cannot be weakened to $x_{i+1} = x_i$, as doing so would cause the algorithm to oscillate. In fact, as $p$ grows, so do the number of values of $n$ that exhibit this behavior, and so do the number of possible oscillations. For example, $n = 972$ with $p = 3$ would yield the sequence $\{ 16, 11, 10, 9, 10, 9, \dotsc \}$, and $n = 80$ with $p = 4$ would yield the sequence $\{ 4, 3, 2, 4, 3, 2, \dotsc \}$.

3. Run-time

We will show that $\NewtonRoot$ takes $Θ(p) + O(\lg \lg n)$ loop iterations. Then we will analyze a single loop iteration and the arithmetic operations used to get a total run-time bound.

Analagous to the square root case, define $\Err(x) = x^p/n - 1$ and let $ϵ_i = \Err(x_i)$. First, let’s prove our lower bound for $ϵ_i$, which translates directly from the square root case:

(Lemma 3.) $x_i \ge s + 1$ if and only if $ϵ_i \ge 1/n$.

Proof. $n \lt (s + 1)^p$, so $n + 1 \le (s + 1)^p$, and therefore $(s + 1)^p/n - 1 \ge 1/n$. But the expression on the left side is just $\Err(s + 1)$. $x_i \ge s + 1$ if and only if $ϵ_i \ge \Err(s + 1)$, so the result immediately follows. ∎

Now for the next few lemmas we need to do some algebra and calculus. Inverting $\Err(x)$, we get that $x_i = \sqrt[p]{(ϵ_i + 1) \cdot n}$. Expressing $g(x_i)$ in terms of $ϵ_i$ and $q = 1 - 1/p$ we get \[ g(x_i) = \sqrt[p]{n} \left( \frac{ϵ_i q + 1}{(ϵ_i + 1)^q} \right) \] and \[ \Err(g(x_i)) = \frac{(q ϵ_i + 1)^p}{(ϵ_i + 1)^{p-1}} - 1\text{.} \] Let \[ f(ϵ) = \frac{(q ϵ + 1)^p}{(ϵ + 1)^{p-1}} - 1\text{.} \] Then computing derivatives, \[ \begin{aligned} f'(ϵ) &= q ϵ \frac{(q ϵ + 1)^{p-1}}{(ϵ + 1)^p}\text{,} \\ f''(ϵ) &= q \frac{(q ϵ + 1)^{p-2}}{(ϵ + 1)^{p + 1}}\text{, and} \\ f'''(ϵ) &= -q (2 + q (2 + 3 ϵ)) \frac{(q ϵ + 1)^{p-3}}{(ϵ + 1)^{p + 2}}\text{.} \end{aligned} \] Note that $f(0) = f'(0) = 0$, and $f''(0) = q$. Also, for $ϵ > 0$, $f'(ϵ) \gt 0$, $f''(ϵ) \gt 0$, and $f'''(ϵ) < 0$.

Now we’re ready to show that the $ϵ_i$ shrink quadratically:

(Lemma 4.) $f(ϵ) \lt (ϵ/\sqrt{2})^2$ for $ϵ \gt 0$.

Proof. Taylor-expand $f(ϵ)$ around $0$ with the Lagrange remainder form to get \[ f(ϵ) = f(0) + f'(0) ϵ + \frac{f''(0)}{2} ϵ^2 + \frac{f'''(\xi)}{6} ϵ^3 \] for some some $\xi$ such that $0 \lt \xi \lt ϵ$. Plugging in values, we see that $f(ϵ) = \frac{1}{2} q ϵ^2 + \frac{1}{6} f'''(\xi) ϵ^3$ with the last term being negative, so $f(ϵ) \lt \frac{1}{2} q ϵ^2 \lt \frac{1}{2} ϵ^2$. ∎

But this is only a useful upper bound when $ϵ_i \le 1$. In the square root case this was okay, since $ϵ_1 \le 1$, but that is not true for larger values of $p$. In fact, in general, the $ϵ_i$ start off shrinking linearly:

(Lemma 5.) For $ϵ \gt 1$, $f(ϵ) \gt ϵ/8$.

Proof. Since $f(0) = f'(0) = 0$, and $f''(ϵ) \gt 0$ for $ϵ \ge 0$, $f'(ϵ)$ and $f(ϵ)$ are increasing, and thus $f(1) \gt 0$ and $f(ϵ)$ is a concave-up curve.

Then $(0, 0)$ and $(1, f(1))$ are two points on a concave-up curve, and thus geometrically the line $y = f(1) ϵ$ must lie below $y = f(ϵ)$ for $ϵ \gt 1$, and thus $f(ϵ) \gt f(1) ϵ$ for $ϵ \gt 1$. Algebraically, this also follows from the definition of (strict) convexity (with $x_1 = 0$, $x_2 = ϵ$, and $t = 1 - 1/ϵ$).

But $f(1) = (2 - 1/p)^p/2^{p-1} - 1 = 2 \left(1 - \frac{1}{2p}\right)^p - 1$, which is always increasing as a function of $p$, as you can see by calculating its derivative. Therefore, its minimum is at $p = 2$, which is $1/8$, and so $f(ϵ) \gt f(1) ϵ \ge ϵ/8$. ∎

Finally, let’s bound our initial values:

(Lemma 6.) $x_0 \le 2s$ and $ϵ_0 \le 2^p - 1$.

Proof. This is a straightforward generalization of the equivalent lemma from the square root case. Let’s start with $x_0$: \[ \begin{aligned} x_0 &= 2^{\lceil \Bits(n) / p \rceil} \\ &= 2^{\lfloor (\lfloor \lg n \rfloor + 1 + p - 1)/p \rfloor} \\ &= 2^{\lfloor \lg n / p \rfloor + 1} \\ &= 2 \cdot 2^{\lfloor \lg n / p \rfloor}\text{.} \end{aligned} \] Then $x_0/2 = 2^{\lfloor \lg n / p \rfloor} \le 2^{\lg n / p} = \sqrt[p]{n}$. Since $x_0/2$ is an integer, $x_0/2 \le \sqrt[p]{n}$ if and only if $x_0/2 \le \lfloor \sqrt[p]{n} \rfloor = s$. Therefore, $x_0 \le 2s$.

As for $ϵ_0$: \[ \begin{aligned} ϵ_0 &= \Err(x_0) \\ &\le \Err(2s) \\ &= (2s)^p/n - 1 \\ &= 2^p s^p/n - 1\text{.} \end{aligned} \] Since $s^p \le n$, $2^p s^p/n \le 2^p$ and thus $ϵ_0 \le 2^p - 1$. ∎

Now we’re ready to show our main result, which involves calculating how long the $ϵ_i$ shrink linearly:

(Theorem 3.) $\NewtonRoot$ performs $Θ(p) + O(\lg \lg n)$ loop iterations.

Proof. Assume that $ϵ_i \gt 1$ for $i \le j$, $ϵ_{j+1} \le 1$, and $j+k$ is the number of loop iterations performed when running the algorithm for $n$ and $p$ (i.e., $x_{j+k} \ge x_{j+k-1}$). Using Lemma 5, \[ \left( \frac{1}{8} \right)^{j+1} ϵ_0 \lt ϵ_{j+1} \le 1\text{,} \] which implies \[ j \gt \frac{\lg ϵ_0}{3} - 1\text{.} \]

Similarly, \[ \left( \frac{1}{8} \right)^j ϵ_0 \ge ϵ_j \gt 1\text{,} \] which implies \[ j \lt \frac{\lg ϵ_0}{3} \text{.} \] Therefore, $j = Θ(\lg ϵ_0)$, which is $Θ(p)$ by Lemma 6.

Now assume $k \ge 5$. Then $x_i \ge s + 1$ for $i \lt j + k - 1$. Since $ϵ_{j+1} \le 1$ by assumption, $ϵ_{j+3} \le 1/2$ and $ϵ_i \le (ϵ_{j+3})^{2^{i-j-3}}$ for $j + 3 \le i \lt j + k - 1$ by Lemma 4, then $ϵ_{j+k-2} \le 2^{-2^{k-5}}$. But $1/n \le ϵ_{j+k-2}$ by Lemma 3, so $1/n \le 2^{-2^{k-5}}$. Taking logs to bring down the $k$ yields $k - 5 \le \lg \lg n$. Then $k \le \lg \lg n + 5$, and thus $k = O(\lg \lg n)$.

Therefore, the total number of loop iterations is $Θ(p) + O(\lg \lg n)$. ∎

Note that $p \le \lg n$, so we can just say that $\NewtonRoot$ performs $Θ(\lg n)$ operations. But that obscures rather than simplifies. Note that the proof above is very similar to the proof of the worse run-time of $\mathrm{N{\small EWTON}\text{-}I{\small SQRT}'}$ where the initial guess varies. In this case, the error in our initial guess is magnified, since we raise it to the $(p-1)$th power, and so that manifests as the $Θ(p)$ term.

Furthermore, unlike the square root case, the number of arithmetic operations in a loop iteration isn’t constant. In particular, the sub-step to compute $x_i^{p-1}$ takes a number of arithmetic operations dependent on $p - 1$. Using repeated squarings, this computation would take $Θ(\lg p)$ squarings and at most $Θ(\lg p)$ multiplications.

If the cost of an arithmetic operation is constant, e.g., we’re working with fixed-size integers, then the run-time bounds is the above multiplied by $Θ(\lg p)$.

Otherwise, if the cost of an arithmetic operation depends on the length of its arguments, then we only have to multiply by a constant factor to get the run-time bounds in terms of arithmetic operations. If the cost of multiplying two numbers $\le x$ is $M(x) = O(\lg^k x)$, then the cost of computing $x^p$ is $O((p \lg x)^k)$. But $x$ is $Θ(n^{1/p})$, so the cost of computing $x^p$ is $O(\lg^k n)$, which is on the order of the cost of multiplying two numbers $\le n$. Furthermore, note that we divide the result into $n$, so we can stop once the computation of $x_i^{p-1}$ exceeds $n$. So in that case, we can treat a loop iteration as if it were performing a constant number of arithmetic operations on numbers of order $n$, and so, like in the square root case, we pick up a factor of $D(n)$, where $D(n)$ is the run-time of dividing $n$ by some number $\le n$.

Like this post? Subscribe to my feed or follow me on Twitter .

Footnotes

[1] Go and JS implementations are available on my GitHub. ↩

[2] Here, and in most of the article, we’ll implicitly assume that $n \gt 0$ and $p \gt 1$. ↩

Sampling the Visible Sphere

2015-08-26T00:00:00-07:00

(Note: this article is a summary of this thread on ompf2.)

The usual method for sampling a sphere from a point outside the sphere is to calculate the angle of the cone of the visible portion and uniformly sample within that cone, as described in Shirley/Wang.

However, one detail that is glossed over is that you still need to map from the sampled direction to the point on the sphere. The usual method is to simply generate a ray from the point and the sampled direction and intersect it with the sphere. However, this intersection test may fail due to floating point inaccuracies (e.g., if the sphere is small and the distance from the point is large).

I've found a couple of existing ways to deal with this. As described in the pbrt book, pbrt simply assumes that the ray just grazes the sphere if the intersection fails, and then projects the center of the sphere onto the ray (code here). mitsuba moves the origin of the ray closer to the sphere (in fact, from within it) before doing the test (falling back to projecting the center onto the ray if that still fails) (code here).

However, this seems inelegant. I've come up with a better way, which involves converting the sampled cone angle $θ$ (as measured from the segment connecting the point to the sphere center) into an angle $α$ from the inside of the sphere, and then simply using $α$ and the sampled polar angle $\varphi$ onto the sphere. This turns out to be simple, and in my unscientific tests a bit faster.

Here's a crude diagram showing the geometry:

You can see that \[ L = d \cos θ - \sqrt{r^2 - d^2 \sin^2 θ} \] and also by the law of cosines, \[ L^2 = d^2 + r^2 - 2 d r \cos α\text{.} \] We're actually more interested in $\cos α$, so solving for that we get \[ \cos α = \frac{d}{r} \sin^2 θ + \cos θ \sqrt{1 - \frac{d^2}{r^2} \sin^2 θ}\text{.} \] An alternate form, which may be easier to analyze, recalling that $\sin θ_{\max} = r/d$, is \[ \cos α = \frac{\sin^2 θ}{\sin θ_{\max}} + \cos θ \sqrt{1 - \frac{\sin^2 θ}{\sin^2 θ_{\max}}}\text{.} \]

So sampling pseudocode would look like:

(cos θ, φ) = uniformSampleCone(rng, cos θmax)
D = 1 - d² sin² θ / r²
if D ≤ 0 {
  cos α = sin θmax
} else {
  cos α = (d/r) sin² θ + cos θ √D
}
ω = sphericalDirection(cos α, φ)
pSurface = C + r ω

I haven't done any analysis yet on the most robust way [in the floating-point sense] to do the calculations above.)

There's no backfacing since we clamp $\cos α$ to $\sin θ_{\max}$, which is analogous to the case when the ray from $P$ misses the sphere.

Note that one cannot just compute $α_{\max}$ and uniformly sample the cone from inside the sphere, as that doesn't produce the same distribution over the visible region as sampling the cone from outside the sphere. To preserve correctness, you would have to use the (uniform) PDF over the surface area of the visible portion of the sphere, but you would have to then convert that to a PDF with respect to projected solid angle from $P$, which is suboptimal to just doing the sampling with respect to projected solid angle from $P$ as described above.

Like this post? Subscribe to my feed or follow me on Twitter .

Computing the Integer Square Root

2014-12-09T00:00:00-08:00

1. The algorithm

Today I’m going to talk about a fast algorithm to compute the integer square root of a non-negative integer $n$, $\isqrt(n) = \lfloor \sqrt{n} \rfloor$, or in words, the greatest integer whose square is less than or equal to $n$.^[1] Most sources that describe the algorithm take it for granted that it is correct and fast. This is far from obvious! So I will prove both correctness and speed below.

One simple fact is that $\isqrt(n) \le n/2$, so a straightforward algorithm is just to test every non-negative integer up to $n/2$. This takes $O(n)$ arithmetic operations, which is bad since it’s exponential in the size of the input. That is, letting $\Bits(n)$ be the number of bits required to store $n$ and letting $\lg n$ be the base-$2$ logarithm of $n$, $\Bits(n) = O(\lg n)$, and thus this algorithm takes $O(2^{\Bits(n)})$ arithmetic operations.

We can do better by doing binary search; start with the range $[0, n/2]$ and adjust it based on comparing the square of an integer in the middle of the range to $n$. This takes $O(\lg n) = O(\Bits(n))$ arithmetic operations.

However, the algorithm below is even faster:^[2]

If $n = 0$, return $0$.
Otherwise, set $i$ to $0$ and set $x_0$ to $2^{\lceil \Bits(n) / 2\rceil}$.
Repeat:
1. Set $x_{i+1}$ to $\lfloor (x_i + \lfloor n/x_i \rfloor) / 2 \rfloor$.
2. If $x_{i+1} \ge x_i$, return $x_i$. Otherwise, increment $i$.

Call this algorithm $\NewtonSqrt$, since it’s based on Newton’s method. It’s not obvious, but this algorithm returns $\isqrt(n)$ using only $O(\lg \lg n) = O(\lg(\Bits(n)))$ arithmetic operations, as we will prove below. But first, here’s an implementation of the algorithm in Javascript:^[3]

// isqrt returns the greatest number x such that x^2 <= n. The type of
// n must behave like BigInteger (e.g.,
// https://github.com/akalin/jsbn ), and n must be non-negative.
//
//
// Example (open up the JS console on this page and type):
//
//   isqrt(new BigInteger("64")).toString()
function isqrt(n) {
  var s = n.signum();
  if (s < 0) {
    throw new Error('negative radicand');
  }
  if (s == 0) {
    return n;
  }

  // x = 2^ceil(Bits(n)/2)
  var x = n.constructor.ONE.shiftLeft(Math.ceil(n.bitLength()/2));
  while (true) {
    // y = floor((x + floor(n/x))/2)
    var y = x.add(n.divide(x)).shiftRight(1);
    if (y.compareTo(x) >= 0) {
      return x;
    }
    x = y;
  }
}

2. Correctness

The core of the algorithm is the iteration rule: \[ x_{i+1} = \left\lfloor \frac{x_i + \lfloor \frac{n}{x_i} \rfloor}{2} \right\rfloor \] where the floor functions are there only because we’re using integer division. Define an integer-valued function $f(x)$ for the right side. Using basic properties of the floor function, you can show that you can remove the inner floor: \[ f(x) = \left\lfloor \frac{1}{2} (x + n/x) \right\rfloor \] which makes it a bit easier to analyze. Also, the properties of $f(x)$ are closely related to its equivalent real-valued function: \[ g(x) = \frac{1}{2} (x + n/x)\text{.} \]

For starters, again using basic properties of the floor function, you can show that $f(x) \le g(x)$, and for any integer $m$, $m \le f(x)$ if and only if $m \le g(x)$.

Finally, let’s give a name to our desired output: let $s = \isqrt(n) = \lfloor \sqrt{n} \rfloor$.^[4]

Intuitively, $f(x)$ and $g(x)$ “average out” however far away their input $x$ is from $\sqrt{n}$. Conveniently, this “average” is never an undereestimate:

(Lemma 1.) For $x \gt 0$, $f(x) \ge s$.

Proof. By the basic properties of $f(x)$ and $g(x)$ above, it suffices to show that $g(x) \ge s$. $g'(x) = (1 - n/x^2)/2$ and $g''(x) = n/x^3$. Therefore, $g(x)$ is concave-up for $x \gt 0$; in particular, its single positive extremum at $x = \sqrt{n}$ is a minimum. But $g(\sqrt{n}) = \sqrt{n} \ge s$. ∎

(You can also prove Lemma 1 without calculus; show that $g(x) \ge s$ if and only if $x^2 - 2sx + n \ge 0$, which is true when $s^2 \le n$, which is true by definition.)

Furthermore, our initial estimate is always an overestimate:

(Lemma 2.) $x_0 \gt s$.

Proof. $\Bits(n) = \lfloor \lg n \rfloor + 1 \gt \lg n$. Therefore, \[ \begin{aligned} x_0 &= 2^{\lceil \Bits(n) / 2 \rceil} \\ &\ge 2^{\Bits(n) / 2} \\ &\gt 2^{\lg n / 2} \\ &= \sqrt{n} \\ &\ge s\text{.} \; \blacksquare \end{aligned} \]

(Note that any number greater than $s$, say $n$ or $\lceil n/2 \rceil$, can be chosen for our initial guess without affecting correctness. However, the expression above is necessary to guarantee performance. Another possibility is $2^{\lceil \lceil \lg n \rceil / 2 \rceil}$, which has the advantage that if $n$ is an even power of $2$, then $x_0$ is immediately set to $\sqrt{n}$. However, this is usually not worth the cost of checking that $n$ is a power of $2$, as is required to compute $\lceil \lg n \rceil$.)

An easy consequence of Lemmas 1 and 2 is that the invariant $x_i \ge s$ holds. That lets us prove partial correctness of $\NewtonSqrt$:

(Theorem 1.) If $\NewtonSqrt$ terminates, it returns the value $s$.

Proof. Assume it terminates. If it terminates in step $1$, then we are done. Otherwise, it can only terminate in step $3.2$ where it returns $x_i$ such that $x_{i+1} = f(x_i) \ge x_i$. This implies that $g(x_i) = (x_i + n/x_i) / 2 \ge x_i$. Rearranging yields $n \ge x_i^2$ and combining with our invariant we get $\sqrt{n} \ge x_i \ge s$. But $s + 1 \gt \sqrt{n}$, so that forces $x_i$ to be $s$, and thus $\NewtonSqrt$ returns $s$ if it terminates. ∎

For total correctness we also need to show that $\NewtonSqrt$ terminates. But this is easy:

(Theorem 2.) $\NewtonSqrt$ terminates.

We are done proving correctness, but you might wonder if the check $x_{i+1} \ge x_i$ in step $3.2$ is necessary. That is, can it be weakened to the check $x_{i+1} = x_i$? The answer is “no”; to see that, let $k = n - s^2$. Since $n \lt (s+1)^2$, $k \lt 2s + 1$. On the other hand, consider the inequality $f(x_i) \gt x_i$. Since that would cause the algorithm to terminate and return $x_i$, that implies that $x_i = s$. Therefore, that inequality is equivalent to $f(s) \gt s$, which is equivalent to $f(s) \ge s + 1$, which is equivalent to $g(s) = (s + n/s) / 2 \ge s + 1$. Rearranging yields $n \ge s^2 + 2s$. Substituting in $n = s^2 + k$, we get $s^2 + k \ge s^2 + 2s$, which is equivalent to $k \ge 2s$. But since $k \lt 2s + 1$, that forces $k$ to equal $2s$. That is the maximum value $k$ can be, so therefore $n$ must be one less than a perfect square. Indeed, for such numbers, weakening the check would cause the algorithm to oscillate between $s$ and $s + 1$. For example, $n = 99$ would yield the sequence $\{ 16, 11, 10, 9, 10, 9, \dotsc \}$.

3. Run-time

We will show that $\NewtonSqrt$ takes $O(\lg \lg n)$ arithmetic operations. Since each loop iteration does only a fixed number of arithmetic operations (with the division of $n$ by $x$ being the most expensive), it suffices to show that our algorithm performs $O(\lg \lg n)$ loop iterations.

It is well known that Newton’s method converges quadratically sufficiently close to a simple root. We can’t actually use this result directly, since it’s not clear that the convergence properties of Newton’s method are preserved when using integer operations, but we can do something similar.

Define $\Err(x) = x^2/n - 1$ and let $ϵ_i = \Err(x_i)$. Intuitively, $\Err(x)$ is a conveniently-scaled measure of the error of $x$: it is less than $1$ for most of the values we care about and it bounded below for integers greater than our target $s$. Also, we will show that the $ϵ_i$ shrink quadratically. These facts will then let us show our bound for the iteration count.

First, let’s prove our lower bound for $ϵ_i$:

(Lemma 3.) $x_i \ge s + 1$ if and only if $ϵ_i \ge 1/n$.

Proof. $n \lt (s + 1)^2$, so $n + 1 \le (s + 1)^2$, and therefore $(s + 1)^2/n - 1 \ge 1/n$. But the expression on the left side is just $\Err(s + 1)$. $x_i \ge s + 1$ if and only if $ϵ_i \ge \Err(s + 1)$, so the result immediately follows. ∎

Then we can use that to show that the $ϵ_i$ shrink quadratically:

(Lemma 4.) If $x_i \ge s + 1$, then $ϵ_{i+1} \lt (ϵ_i/2)^2$.

Proof. $ϵ_{i+1}$ is just $\Err(f(x_i)) \le \Err(g(x_i))$, so it suffices to show that $\Err(g(x_i)) \lt (ϵ_i/2)^2$. Inverting $\Err(x)$, we get that $x_i = \sqrt{(ϵ_i + 1) \cdot n}$. Expressing $g(x_i)$ in terms of $ϵ_i$ we get \[ g(x_i) = \frac{\sqrt{n}}{2} \left( \frac{ϵ_i + 2}{\sqrt{ϵ_i + 1}} \right) \] and \[ \Err(g(x_i)) = \frac{(ϵ_i/2)^2}{ϵ_i+1}\text{.} \] Therefore, it suffices to show that the denominator is greater than $1$. But $x_i \ge s + 1$ implies $ϵ_i \gt 0$ by Lemma 3, so that follows immediately and the result is proved. ∎

Then let’s bound our initial values:

(Lemma 5.) $x_0 \le 2s$, $ϵ_0 \le 3$, and $ϵ_1 \le 1$.

Proof. Let’s start with $x_0$: \[ \begin{aligned} x_0 &= 2^{\lceil \Bits(n) / 2 \rceil} \\ &= 2^{\lfloor (\lfloor \lg n \rfloor + 1 + 1)/2 \rfloor} \\ &= 2^{\lfloor \lg n / 2 \rfloor + 1} \\ &= 2 \cdot 2^{\lfloor \lg n / 2 \rfloor}\text{.} \end{aligned} \] Then $x_0/2 = 2^{\lfloor \lg n / 2 \rfloor} \le 2^{\lg n / 2} = \sqrt{n}$. Since $x_0/2$ is an integer, $x_0/2 \le \sqrt{n}$ if and only if $x_0/2 \le \lfloor \sqrt{n} \rfloor = s$. Therefore, $x_0 \le 2s$.

As for $ϵ_0$: \[ \begin{aligned} ϵ_0 &= \Err(x_0) \\ &\le \Err(2s) \\ &= (2s)^2/n - 1 \\ &= 4s^2/n - 1\text{.} \end{aligned} \] Since $s^2 \le n$, $4s^2/n \le 4$ and thus $ϵ_0 \le 3$.

Finally, $ϵ_1$ is just $\Err(f(x_0))$. Using calculations from Lemma 4, \[ \begin{aligned} ϵ_1 &\le \Err(g(x_0)) \\ &= (ϵ_0/2)^2/(ϵ_0 + 1) \\ &\le (3/2)^2/(3 + 1) \\ &= 9/16\text{.} \end{aligned} \] Therefore, $ϵ_1 \le 1$. ∎

Finally, we can show our main result:

(Theorem 3.) $\NewtonSqrt$ performs $O(\lg \lg n)$ loop iterations.

Proof. Let $k$ be the number of loop iterations performed when running the algorithm for $n$ (i.e., $x_k \ge x_{k-1}$) and assume $k \ge 4$. Then $x_i \ge s + 1$ for $i \lt k - 1$. Since $ϵ_1 \le 1$ by Lemma 5, $ϵ_2 \le 1/2$ and $ϵ_i \le (ϵ_2)^{2^{i-2}}$ for $2 \le i \lt k - 1$ by Lemma 4, then $ϵ_{k-2} \le 2^{-2^{k-4}}$. But $1/n \le ϵ_{k-2}$ by Lemma 3, so $1/n \le 2^{-2^{k-4}}$. Taking logs to bring down the $k$ yields $k - 4 \le \lg \lg n$. Then $k \le \lg \lg n + 4$, and thus $k = O(\lg \lg n)$. ∎

Note that in general, an arithmetic operation is not constant-time, and in fact has run-time $\Omega(\lg n)$. Since the most expensive arithmetic operation we do is division, we can say that $\NewtonSqrt$ has run-time that is both $\Omega(\lg n)$ and $O(D(n) \cdot \lg \lg n)$, where $D(n)$ is the run-time of dividing $n$ by some number $\le n$.^[5]

4. The Initial Guess

It’s also useful to show that if the initial guess $x_0$ is bad, then the run-time degrades to $Θ(\lg n)$. We’ll do this by defining the function $\NewtonSqrt$ except that it takes a function $\mathrm{I{\small NITIAL}\text{-}G{\small UESS}}$ that is called with $n$ and assigned to $x_0$ in step 1. Then, we can treat $ϵ_0$ as a function of $n$ and analyze how long $ϵ_i$ stays above $1$ to show that $\NewtonSqrt$ uses an initial guess such that $ϵ_0(n) = Θ(1)$, then Theorem 4 reduces to Theorem 3 in that case. However, if $x_0$ is chosen to be $Θ(n)$, e.g. the initial guess is just $n$ or $n/k$ for some $k$, then $ϵ_0(n)$ will also be $Θ(n)$, and so the run time will degrade to $Θ(\lg n)$. So having a good initial guess is important for the performance of $\NewtonSqrt$!

Like this post? Subscribe to my feed or follow me on Twitter .

Footnotes

[1] Aside from the Wikipedia article, the algorithm is described as Algorithm 9.2.11 in Prime Numbers: A Computational Perspective. ↩

[2] Note that only integer operations are used, which makes this algorithm suitable for arbitrary-precision integers. ↩

[3] Go and JS implementations are available on my GitHub. ↩

[4] Here, and in most of the article, we’ll implicitly assume that $n \gt 0$. ↩

[5] $D(n)$ is $Θ(\lg^2 n)$ using long division, but fancier division algorithms have better run-times. ↩

Finding the Most Significant Set Bit of a Word in Constant Time

2014-07-03T00:00:00-07:00

1. Overall method

Finding the most significant set bit of a word (equivalently, finding the integer log base 2 of a word, or counting the leading zeros of a word) is a well-studied problem. Bit Twiddling Hacks lists various methods, and Wikipedia gives the CPU instructions that perform the operation directly.

However, all of these methods are either specific to a certain word size or take more than constant time (in terms of number of word operations). That raises the question of whether there is a method that takes constant time—surprisingly, the answer is “yes”!^[1]

The key idea is to split a word into $\lceil \sqrt{w} \rceil$ blocks of $\lceil \sqrt{w} \rceil$ bits (where $w$ is the number of bits in a word). One can then do certain operations on blocks “in parallel” by stuffing multiple blocks into a word and then performing a single word operation.

Furthermore, since the block size and block count are the same, one can transform the bits of a block into the blocks of a word and vice versa in various ways using only a constant number of word operations.

In particular, this lets us split up the problem into two parts: finding the most significant set (i.e., non-zero) block, and finding the most significant set bit within that block. It then turns out that both parts can be done in constant time.

For concreteness, we'll use 32-bit words when explaining the method below.^[2]

2. Finding the most significant set bit of a block

First, let's consider the sub-problem of finding the most significant set bit of a block. In fact, let's give ourselves a bit of room and consider only blocks with the high bit cleared for now; we'll see why we need this extra bit of room soon.

For 32 bits, the block size is 6 bits, so with the high bit of a block cleared we're left with 5 bits. Let's look at a naive implementation:

function mssb5_naive(x) {
  var c = 0;
  for (var i = 0; i < 5 && x >= (1 << i); ++i) {
    ++c;
  }
  return c - 1;
}

In the above, we consider successive powers of 2 until we find one greater than our given number. Then the answer is simply one less than that power.

Notice that the loop has at most 5 iterations; this lines up nicely with the 5 full blocks in an entire 32-bit word. (This is why we saved our extra bit of room.) If we can copy our block to the higher 4 blocks and then use word operations to operate on those blocks in parallel, then we're good.

For our example, let $x = 5 = 00101$. Duplicating $x$ among all the blocks can easily be done by multiplying by the appropriate constant:

  00 000000 000000 000000 000000 000101
* 00 000001 000001 000001 000001 000001
  00 000000 000000 000000 000000 000101
  00 000000 000000 000000 000101
  00 000000 000000 000101
  00 000000 000101
  00 000101                            
  00 000101 000101 000101 000101 000101

In fact, this is a simple use of a more general tool. If $x$ and $y$ are expressed in binary, then multiplying $x$ by $y$ can be seen as taking the index of each set bit in $y$, creating a copy of $x$ shifted by each such index, and then adding up all the shifted copies. This case is just taking $y$ to be the constant where the $\{ 0, 6, 12, 18, 24 \}$th bits are set.

The first operation we need to parallelize is the comparisons to the powers of 2. This can be converted to a word operation by noting the comparison $x \geq y$ can be performed by checking the sign of $x - y$, and that checking the sign can be done by setting the unused high bit of $x$ before doing the comparison, and then checking to see if that high bit was left intact (i.e., not borrowed from). So we pre-compute a constant with the $n$th block containing the $n$th power of 2, then subtract that from our block containing the duplicated blocks with the high bit set. Finally, we can then mask off the unneeded lower bits:

  00 000101 000101 000101 000101 000101
| 00 100000 100000 100000 100000 100000
  00 100101 100101 100101 100101 100101
- 00 010000 001000 000100 000010 000001
  00 010101 011101 100001 100011 100100
& 00 100000 100000 100000 100000 100000
  00 000000 000000 100000 100000 100000

We're left with a word where all bits except for the high bits of a block are zero. We still need to sum up those bits, but since they're a block apart, that can be done by multiplication with a constant to line up the bits in a single column. The constant turns out to have the $\{ 0, 6, 12, 18, 24 \}$th bits set, with the answer being in the top three bits:^[3]

  00 000000 000000 100000 100000 100000
* 00 000001 000001 000001 000001 000001
  00 000000 000000 100000 100000 100000
  00 000000 100000 100000 100000
  00 100000 100000 100000
  00 100000 100000
  00 100000                            
  01 100001 100001 100001 000000 100000

MSSB5(x) = 011 - 1 = 2

We can now write mssb5() using a constant number of word operations:^[4]

function mssb5(x) {
  // Duplicate x among all the blocks.
  x *= b('00 000001 000001 000001 000001 000001');

  // Compare to successive powers of 2 in parallel.
  x |= b('00 100000 100000 100000 100000 100000');
  x -= b('00 010000 001000 000100 000010 000001');
  x &= b('00 100000 100000 100000 100000 100000');

  // Sum up the bits into the high 3 bits.
  x *= b('00 000001 000001 000001 000001 000001');

  // Shift down and subtract 1 to get the answer.
  return (x >>> 29) - 1;
}

Then we can then find the most significant set bit of a full block by simply testing the high bit first:

function mssb6(x) {
  return (x & b('100000')) ? 5 : mssb5(x);
}

3. Finding the most significant set block

Let's now consider the sub-problem of finding the most significant set block of a word (ignoring the partial one). Similar to the above, we'd like to be able to use subtraction to compare all the blocks to zero at the same time. However, that requires the high bit of each block to be unused. That's easy enough to handle: just separate the high bit and the lower bits of each block, test the lower bits, and then bitwise-or the results together:

   x = 00 100000 000000 010000 000000 000001
&  C = 00 100000 100000 100000 100000 100000
  y1 = 00 100000 000000 000000 000000 100000

   x = 00 100000 000000 010000 000000 000001
& ~C = 00 011111 011111 011111 011111 011111
  t1 = 00 000000 000000 010000 000000 000001

   C = 00 100000 100000 100000 100000 100000
- t1 = 00 000000 000000 010000 000000 000001
  t2 = 00 100000 100000 010000 100000 011111

 ~t2 = 11 011111 011111 101111 011111 100000
&  C = 00 100000 100000 100000 100000 100000
  y2 = 00 000000 000000 100000 000000 100000

  y1 = 00 100000 000000 000000 000000 100000
| y2 = 00 000000 000000 100000 000000 100000
   y = 00 100000 000000 100000 000000 100000

The result is stored in the high bits of each block. If we could pack all the bits together, we'd then be able to use mssb5(). This is similar to where we had to add all the bits together in part 2, but we need a constant to stagger the bits instead of lining them up. The constant to put the answer in the high bits turns out to have the $\{ 7, 12, 17, 22, 27 \}$th bits set:

y >>> 5 = 00 000001 000000 000001 000000 000001
        * 00 001000 010000 100001 000010 000000
          10 000000 000010 000000 00001
          00 000001 000000 000001
          00 100000 000000 1
          00 000000 01
          00 001                               
        = 10 101001 010010 100001 000010 000000

This yields the answer 10101, where the $i$th bit is set exactly when the $i$th block of $x$ is non-zero. Therefore, the most significant block is then simply mssb5(10101).

4. Putting it all together

With the building blocks above, we can now implement the algorithm for finding the most significant set bit in the full blocks of a word:^[5]

function mssb30(x) {
  var C = b('00 100000 100000 100000 100000 100000');

  // Check whether the high bit of each block is set.
  var y1 = x & C;

  // Check whether the lower bits of each block is set.
  var y2 = ~(C - (x & ~C)) & C;

  var y = y1 | y2;

  // Shift the result bits down to the lowest 5 bits.
  var z = ((y >>> 5) * b('0000 10000 10000 10000 10000 10000000')) >>> 27;

  // Compute the bit index of the most significant set block.
  var b1 = 6 * mssb5(z);

  // Compute the most significant set bit inside the most significant
  // set block.
  var b2 = mssb6((x >>> b1) & b('111111'));

  return b1 + b2;
}

And then it's simple enough to extend it to find the most significant set bit of a full word:

function mssb32(x) {
  // Check the high duplet and fall back to mssb30 if it's not set.
  var h = x >>> 30;
  return h ? (30 + mssb5(h)) : mssb30(x);
}

So the code above shows that we can find the most significant set bit of a 32-bit word in a constant number of 32-bit word operations. It is easy enough to see how it can be adapted to yield a similar algorithm for a given arbitrary (but sufficiently large) word size, simply by pre-computing the various word-size-dependent constants.

It is also easy to see why no one actually uses this method on real computers even in the absence of built-in instructions: it is much more complicated and almost certainly slower than existing methods for real word sizes! Also, the word-RAM model—where we assume all word operations take constant time—is useful only when the word size is fixed or narrowly bounded. When we allow word size to vary arbitrarily, the word-RAM model breaks down—for one, multiplication grows super-linearly with respect to word size! Alas, this method is doomed to remain a theoretical curiosity, albeit one that uses a few clever tricks.

Like this post? Subscribe to my feed or follow me on Twitter .

Footnotes

[1] The constant-time method is detailed in the original papers for the fusion tree data structure. The first paper is unfortunately behind a paywall, but the second paper, essentially a rehash of the first one, is freely downloadable.

The method is also explained in lecture 12 of Erik Demaine's Advanced Data Structures class, which is how I originally found out about it. ↩

[2] Demaine uses 16-bit words, which factors nicely into 4 blocks of 4 bits, but it is instructive to see how the method deals with the word size not a perfect square. ↩

[3] In this case, the partial 6th block has enough room to hold the answer but this may not be true in general. This can be remedied easily enough by shifting down the block high bits to the low bits before multiplying; the answer will then be in the last full block. ↩

[4] b(str) just parses a number from its binary string representation. ↩

[5] Try out this function (and the others on this page) by opening up the JS console on this page! ↩

Primality Testing in Polynomial Time (Ⅱ)

2012-12-29T00:00:00-08:00

(Note: this article isn't fully polished yet, but I thought it would be a shame to let it languish during my sabbatical. Happy new year!)

5. Strengthening the AKS theorem

It turns out the conditions of the AKS theorem are stronger than they appear; they themselves imply that $n$ is prime. To show this, we need the following theorem, which we'll state without proof:

(Lenstra's squarefree test.) If $a^n \equiv a \pmod{n}$ for $1 \le a \lt \ln^2 n$, then $n$ is squarefree.^[1]

We also need a couple of lemmas:

(Lemma 1.) For $0 \le a \lt n$ and $r \gt 1$, let \[ (X + a)^n \equiv X^n + a \pmod{X^r - 1, n}\text{.} \] Then \[ (a + 1)^n = a + 1 \pmod{n}\text{.} \]

Proof. By definition, $(X + a)^n - (X^n + a) = k(X) \cdot (X^r - 1) \pmod{n}$. Treating both sides as a function of $x$ and substituting in $1$, we immediately get $(1 + a)^n - (1 + a) = 0 \pmod{n}$. ∎

(Lemma 2.) For $n \gt 1$, $\lfloor \lg n \rfloor \cdot \lg n \gt \ln^2 n$.

Proof. Since $\ln n = \frac{\lg n}{\lg e}$ and $e \gt 2$, $\lg n \gt \ln n$ for $n \gt 1$.

Letting $k = \lfloor \lg n \rfloor$, $\ln n \lt \frac{k + 1}{\lg e}$, so if $\frac{k + 1}{\lg e} \lt k$, that implies that $\ln n \lt k$. Solving for $k$, we get that $k \gt \frac{1}{\lg e - 1}$, which is true when $n \ge 8$.

So if $n \ge 8$, then $\ln n \lt \lfloor \lg n \rfloor$. Checking manually, we find that $\ln n \lt \lfloor \lg n \rfloor$ holds also for $n \in \{ 2, 4, 5, 6, 7 \}$, immediately implying the lemma for all $n \gt 1$ except $3$. But checking manually again, we find that the lemma holds for $3$ also. ∎

Then, we can prove the strong version of the AKS theorem:

(AKS theorem, strong version.) Let $n \ge 2$, $r$ be relatively prime to $n$ with $o_r(n) \gt \lg^2 n$, and $M \gt \sqrt{φ(r)} \lg n$. Furthermore, let $n$ have no prime factor less than $M$ and let \[ (X + a)^n \equiv X^n + a \pmod{X^r - 1, n}\text{.} \] for $0 \le a \lt M$. Then $n$ is prime.

Proof. From Lemma 1, we know that $a^n = a \pmod{n}$ for $1 \le a \lt M$. Since $M \gt \lfloor \sqrt{t} \rfloor \lg n \gt \lfloor \lg n \rfloor \cdot \lg n \gt \ln^2 n$ by Lemma 2, we can apply Lenstra's squarefree test to show that $n$ is squarefree. From the weak version of the AKS theorem, we also know that $n$ is a prime power. But since $n$ is squarefree, it must have distinct prime factors, which immediately implies that $n$ is prime. ∎

6. Finding a suitable $r$

The only remaining loose end is to show that there exists an $r$ with $o_r(n) \gt \lg^2 n$ and that it's small enough (i.e., polylog in $n$). The existence of $r$ is easy to see; we can simply pick the smallest $r$ that is co-prime to $n$ and greater than $n^{\lg^2 n}$. But that's obviously too big. We can do better:

(Upper bound for $r$.) Let $n \ge 2$. Then there exists some $r \le \max(3, \lceil \lg n \rceil^5)$ such that $o_r(n) \gt \lceil \lg n \rceil^2$.^[2]

(Proof.) Let's first prove the following lemma:

(Lemma 3.) Let $n \ge 9$ and $b = \lceil \lg n \rceil$. Then for $m \ge 1$, there exists some $r \le b^{2m + 1}$ such that $o_r(n) \gt b^m$.

(Proof.) Let \[ N = n \cdot (n - 1) \cdot (n^2 - 1) \dotsm (n^{b^m} - 1)\text{.} \] Note that $r$ divides $N$ if and only if $o_r(n) \le b^m$. So it suffices to find some $r$ that does not divide $N$.

We can see that: \[ \begin{aligned} N &= n \cdot (n - 1) \cdot (n^2 - 1) \dotsm (n^{b^m} - 1) \\ &\lt n \cdot n \cdot n^2 \dotsm n^{b^m} \\ &= n^{1 + 1 + 2 + 3 + \dotsm + b^m} \\ &= n^{1 + b^m (b^m + 1) / 2} \\ &= n^{\frac{1}{2} b^{2m} + \frac{1}{2} b^m + 1}\text{.} \end{aligned} \] Furthermore, we can upper-bound the exponent of $n$: \[ \begin{aligned} b^{2m} &\gt \frac{1}{2} b^{2m} + \frac{1}{2} b^m + 1 \\ \frac{1}{2} b^{2m} - \frac{1}{2} b^m - 1 &\gt 0 \\ b^{2m} - b^m - 2 &\gt 0 \\ (b^m - 2) \cdot (b^m + 1) &\gt 0\text{.} \end{aligned} \] The last statement holds when $b^m \gt 2$, which is always since $b \ge 4$ and $m \ge 1$.

Applying the upper bound, \[ \begin{aligned} N &\lt n^{\frac{1}{2} b^{2m} + \frac{1}{2} b^m + 1} \\ &\lt n^{b^{2m}} \\ &\le 2^{b^{2m + 1}}\text{.} \end{aligned} \]

We can then use the following theorem, which we'll state without proof:

(Primorial lower bound.) For $x \ge 31$, the product of primes $\le x$ exceeds $2^x$.^[3] That is, \[ x\# = \prod_{p \le x\text{, }p\text{ is prime}} p \gt 2^x\text{.} \]

Since $b \ge 4$ and $m \ge 1$, $b^{2m + 1} \ge 31$, and so $2^{b^{2m + 1}} \lt (b^{2m + 1})\#$. Therefore, \[ N \lt 2^{b^{2m + 1}} \lt (b^{2m + 1})\#\text{.} \] But that implies that there is some prime number $p_0 \le b^{2m + 1}$ that does not divide $N$; if they all did, then $N$ would be at least their product $(b^{2m + 1})\#$, contradicting the inequality above. Therefore, $o_{p_0}(n) \gt b^m$. ∎

We can then prove our theorem: for $n \ge 9$, apply Lemma 3 with $m = 2$. Here are explicit values for the rest: for $n = 2$, $r = 3$; $n = 3$, $r = 7$; $n \in \{ 4, 6, 7, 8\}$, $r = 11$; and for $n = 5$, $r = 17$. ∎

Also, it turns out that about half the time, we can do better. We'll state this theorem without proof:

(Tight upper bound for some $r$.) Let $n \equiv \pm 3 \pmod{8}$. Then there exists some $r \lt 8 \lceil \lg n \rceil^2$ such that $o_r(n) \gt \lceil \lg n \rceil^2$.^[4]

7. The AKS algorithm (simple version)

Without further ado, here is a simple version of the AKS algorithm, given $n \ge 2$:

Starting from $\lceil \lg n \rceil^2 + 2$, find an $r$ such that $\gcd(r, n) = 1$ and $o_r(n) \gt \lceil \lg n \rceil^2$.
Compute $M = \lfloor \sqrt{r - 1} \rfloor \lceil \lg n \rceil + 1$.
Search for a prime factor of $n$ less than $M$. If one is found, return “composite”. If none are found and $M \ge \lfloor \sqrt{n} \rfloor$, return “prime”.
For each $1 \le a \lt M$, compute $(X + a)^n$, reducing coefficients mod $n$ and powers mod $r$. If the result is not equal to $X^{n\text{ mod }r} + a$, return “composite”.
Otherwise, return “prime”.

As we've showed in the previous section, there always exists an $r$ such that $o_r(n) \gt \lceil \lg n \rceil^2$, so step 1 will terminate. All other steps are bounded, so the entire algorithm will always terminate.

In step 2, since $φ(r) \le r - 1$, the value of $M$ that we compute is always greater than $\sqrt{φ(r)} \lceil \lg n \rceil$. Step 4 checks if $(X + a)^n \equiv X^n + a \pmod{X^r - 1, n}$ holds. Therefore, By the strong AKS theorem, if the algorithm returns “prime”, then $n$ is prime. Furthermore, by the weak version of Fermat's little theorem for polynomials, if the algorithm returns “composite”, then $n$ is composite.

Since the algorithm always terminates and it returns the correct answer when it terminates, it is totally correct.

As shown in the previous section, we have to test $O(\lg^5 n)$ values to find a suitable $r$. Assuming a straightforward algorithm to compute the multiplicative order that bails out once $\lfloor \lg n \rfloor^2$ is reached, and assuming we use the division-based Euclidean algorithm for computing the greatest common divisor, testing each value takes $O(\lg^2 n)$ multiplies and $O(\lg r) = O(\lg \lg n)$ divisions of $O(\lg r)$-bit numbers. Let $M(b)$ be the cost to multiply two $b$-bit numbers. The complexity of division is asymptotically the same as multiplication, so the total cost of step 1 is $O(\lg^5 n \cdot (\lg^2 n + \lg \lg n) \cdot M(\lg \lg n)) = O(\lg^7 n \cdot M(\lg \lg n))$, assuming $M(O(b)) = O(M(b))$.

Step 2 involves one square root, one multiplication, and one increment, all involving $O(\lg \lg n)$-bit numbers. The complexity of taking the square root is asymptotically the same as multiplication, so the total cost of step 2 is $O(M(\lg \lg n))$.

Step 3 takes a square root and tests $M = O(\lg^{7/2} n)$ numbers, and each test involves dividing two $O(\lg M)$-bit numbers, so the total cost of step 3 is $O(\lg^{7/2} n \cdot M(\lg \lg n))$.

Steps 4 and 5 test $O(\lg^{7/2} n)$ polynomials. Testing each polynomial involves exponentiating it by $n$, reducing power mod $r$ and coefficients mod $n$ at each step, which requires $O(\lg n)$ multiplications of polynomials with $O(r)$ coefficients each of size $O(\lg n)$. The cost of multiplying two polynomials with $s$ coefficients of size $b$ is $M(s) \cdot M(b)$, so the total cost of steps 4 and 5 is $O(\lg^{9/2} n \cdot M(\lg^5 n \cdot \lg \lg n))$, assuming $M(a) \cdot M(b) = M(a \cdot b)$.

If long multiplication is used, then it costs $M(b) = b^2$, which gives a total cost of $O(\lg^{29/2} n \cdot \lg^2 \lg n) = O(\lg^{15} n)$ for the whole algorithm. More complicated multiplication methods cost only $M(b) = b \lg b$, which gives a total cost of $O(\lg^{10} n)$ for the whole algorithm. Either way, the AKS primality test is shown to be implementable in polynomial time.

Below is step 1 implemented in Javascript; however, here we bound $r$ explicitly to be able to detect bugs quickly.^[5]

// Returns an upper bound for r such that o_r(n) > ceil(lg(n))^2 that
// is polylog in n.
function calculateAKSModulusUpperBound(n) {
  n = SNat.cast(n);
  var ceilLgN = new SNat(n.ceilLg());
  var rUpperBound = ceilLgN.pow(5).max(3);
  var nMod8 = n.mod(8);
  if (nMod8.eq(3) || nMod8.eq(5)) {
    rUpperBound = rUpperBound.min(ceilLgN.pow(2).times(8));
  }
  return rUpperBound;
}

// Returns the least r such that o_r(n) > ceil(lg(n))^2 >= ceil(lg(n)^2).
function calculateAKSModulus(n, multiplicativeOrderCalculator) {
  n = SNat.cast(n);
  multiplicativeOrderCalculator =
    multiplicativeOrderCalculator || calculateMultiplicativeOrderCRT;

  var ceilLgN = new SNat(n.ceilLg());
  var ceilLgNSq = ceilLgN.pow(2);
  var rLowerBound = ceilLgNSq.plus(2);
  var rUpperBound = calculateAKSModulusUpperBound(n);

  for (var r = rLowerBound; r.le(rUpperBound); r = r.plus(1)) {
    if (n.gcd(r).ne(1)) {
      continue;
    }
    var o = multiplicativeOrderCalculator(n, r);
    if (o.gt(ceilLgNSq)) {
      return r;
    }
  }

  throw new Error('Could not find AKS modulus');
}

Here is step 2 implemented in Javascript:

// Returns floor(sqrt(r-1)) * ceil(lg(n)) + 1 > floor(sqrt(Phi(r))) * lg(n).
function calculateAKSUpperBoundSimple(n, r) {
  n = SNat.cast(n);
  r = SNat.cast(r);

  // Use r - 1 instead of calculating Phi(r).
  return r.minus(1).floorRoot(2).times(n.ceilLg()).plus(1);
}

Here is part of step 3 implemented in Javascript, along with the comments for the functions used in trial division:

// Given a number n, a generator function getNextDivisor, and a
// processing function processPrimeFactor, factors n using the
// divisors returned by genNextDivisor and passes each prime factor
// with its multiplicity to processPrimeFactor.
//
// getNextDivisor is passed the current unfactorized part of n and it
// should return the next divisor to try, or null if there are no more
// divisors to generate (although processPrimeFactor may still be
// called).  processPrimeFactor is called with each non-trivial prime
// factor and its multiplicity.  If it returns a false value, it won't
// be called anymore.
function trialDivide(n, getNextDivisor, processPrimeFactor) {
  ...
}

// Returns a generator that generates primes up to 7, then odd numbers
// up to floor(sqrt(n)), using a mod-30 wheel to eliminate odd numbers
// that are known composite (roughly half).
function makeMod30WheelDivisorGenerator() {
  ...
}

// Returns the first factor of n < m from generator, or null if there
// is no such factor.
function getFirstFactorBelow(n, M, generator) {
  n = SNat.cast(n);
  M = SNat.cast(M);
  generator = generator || makeMod30WheelDivisorGenerator();

  var boundedGenerator = function(n) {
    var d = generator(n);
    return (d && d.lt(M)) ? d : null;
  };
  var factor = null;
  trialDivide(n, boundedGenerator, function(p, k) {
    if (p.lt(M.min(n))) {
      factor = p;
    }
    return false;
  });
  return factor;
}

Below is a function that ties steps 1 to 3 together; it is useful for testing purposes to separate it from the other steps. (Actually, we use a different function to compute $M$ which computes $φ(r)$ instead of using $r - 1$ so that we always have the tightest bound possible for $M$.)

// The getAKSParameters* functions below return a parameters object
// with the following fields:
//
//   n: the number the parameters are for.
//
//   factor: A prime factor of n.  If present, the fields below may
//           not be present.
//
//   isPrime: if set, n is prime.  If present, the fields below may
//            not be present.
//
//   r: the AKS modulus for n.
//
//   M: the AKS upper bound for n.

function getAKSParametersSimple(n) {
  n = SNat.cast(n);

  var r = calculateAKSModulus(n);
  var M = calculateAKSUpperBound(n, r);
  var parameters = {
    n: n,
    r: r,
    M: M
  };

  var factor = getFirstFactorBelow(n, M);
  if (factor) {
    parameters.factor = factor;
  } else if (M.gt(n.floorRoot(2))) {
    parameters.isPrime = true;
  }

  return parameters;
}

Finally, here is step 4 implemented in Javascript:

// Returns whether (X + a)^n = X^n + a mod (X^r - 1, n).
function isAKSWitness(n, r, a) {
  n = SNat.cast(n);
  r = SNat.cast(r);
  a = SNat.cast(a);

  function reduceAKS(p) {
    return p.modPow(r).mod(n);
  }

  function prodAKS(x, y) {
    return reduceAKS(x.times(y));
  };

  var one = new SPoly(new SNat(1));
  var xn = one.shiftLeft(n.mod(r));
  var ap = new SPoly(a);
  var lhs = one.shiftLeft(1).plus(ap).pow(n, prodAKS);
  var rhs = reduceAKS(one.shiftLeft(n).plus(ap));
  return lhs.ne(rhs);
}

// Returns the first a < M that is an AKS witness for n, or null if
// there isn't one.
function getFirstAKSWitness(n, r, M) {
  for (var a = new SNat(1); a.lt(M); a = a.plus(1)) {
    if (isAKSWitness(n, r, a)) {
      return a;
    }
  }
  return null;
}

Here's the code that ties it all together:

// Returns whether n is prime or not using the AKS primality test.
function isPrimeByAKS(n) {
  n = SNat.cast(n);

  var parameters = getAKSParameters(n);
  if (parameters.factor) {
    return false;
  }
  if (parameters.isPrime) {
    return true;
  }
  return (getFirstAKSWitness(n, parameters.r, parameters.M) == null);
}

Let $n =$ .

(To-do: Have an interactive box to demonstrate how the per-$a$ AKS test works.)

8. The AKS algorithm (improved version)

Here is a slightly more complicated version of the AKS algorithm. Again given $n \ge 2$:

Search for a prime factor of $n$ less than $\lceil \lg n \rceil^2 + 2$. If one is found, return “composite”.
For each $r$ from $\lceil \lg n \rceil^2 + 2$:
1. If $r \gt \lfloor \sqrt{n} \rfloor$, return “prime”.
2. If $r$ divides $n$, return “composite”.
3. Otherwise, factorize $r$.
4. Compute $o_r(n)$ using $r$'s prime factors. If it is less than or equal to $\lceil \lg n \rceil^2$, jump back to the top of the loop with the next $r$.
5. Otherwise, compute $φ(r)$ using $r$'s prime factors.
6. Compute $M = \lfloor \sqrt{φ(r)} \rfloor \lceil \lg n \rceil + 1$, and break out of the loop.
For each $1 \le a \lt M$, compute $(X + a)^n$, reducing coefficients mod $n$ and powers mod $r$. If the result is not equal to $X^{n\text{ mod }r} + a$, return “composite”.
Otherwise, return “prime”.

The logic of steps 1 to 3 of the simple version is essentially merged together to form steps 1 and 2 of this version; since each $r$ has to be checked for co-primality with $n$, that effectively also checks if $r$ is a prime factor of $n$, so we only have to check for prime factors of $n$ up to the lower bound of $r$. Furthermore, both the multiplicative order as well as the totient function can be computed more quickly given a complete prime factorization, so we can compute that for each $r$. Third, we use $φ(r)$ instead of $r - 1$ to give a tighter bound for $M$. Finally, the last two steps are the same as in the simple version.

Here are steps 1 and 2 of the above algorithm, implemented in Javascript:

function getAKSParameters(n, factorizer) {
  n = SNat.cast(n);
  factorizer = factorizer || defaultFactorizer;

  var ceilLgN = new SNat(n.ceilLg());
  var ceilLgNSq = ceilLgN.pow(2);
  var floorSqrtN = n.floorRoot(2);

  var rLowerBound = ceilLgNSq.plus(2);
  var rUpperBound = calculateAKSModulusUpperBound(n).min(floorSqrtN);

  var parameters = {
    n: n
  };

  var factor = getFirstFactorBelow(n, rLowerBound);
  if (factor) {
    parameters.factor = factor;
    return parameters;
  }

  for (var r = rLowerBound; r.le(rUpperBound); r = r.plus(1)) {
    if (n.mod(r).isZero()) {
      parameters.factor = d;
      return parameters;
    }

    var rFactors = getFactors(r, factorizer);
    var o = calculateMultiplicativeOrderCRTFactors(n, rFactors, factorizer);
    if (o.gt(ceilLgNSq)) {
      parameters.r = r;
      parameters.M = calculateAKSUpperBoundFactors(n, rFactors);
      return parameters;
    }
  }

  if (rUpperBound.eq(floorSqrtN)) {
    parameters.isPrime = true;
    return parameters;
  }

  throw new Error('Could not find AKS modulus');
}

(To-do: Wrap up and lead into what will be shown in part 3.)

Like this post? Subscribe to my feed or follow me on Twitter .

Footnotes

[1] This is a version of Theorem 2 from Lenstra's paper Miller's Primality Test. ↩

[2] We work with $\lceil \lg n \rceil^2$ instead of $\lceil \lg^2 n \rceil$ or $\lg^2 n$ as it's easier to work with in an actual implementation. ↩

[3] This is exercise 1.27 from Prime Numbers: A Computational Perspective. ↩

[4] This is an adapted from section 8.4 of Granville's It is Easy to Determine Whether a Given Number is Prime. ↩

[5] The SNat class used is the same as in my previous article, An Introduction to Primality Testing. ↩

Primality Testing in Polynomial Time (Ⅰ)

2012-08-06T00:00:00-07:00

1. Introduction

Exactly ten years ago, Agrawal, Kayal, and Saxena published “PRIMES is in P”, which described an algorithm that could provably determine whether a given number was prime or composite in polynomial time.

The AKS algorithm is quite short, but understanding how it works via the proofs in the paper requires some mathematical sophistication. Also, some results in the last decade have simplified both the algorithm and its accompanying proofs. In this article I will explain in detail the main result of the AKS paper, and in a follow-up article I will strengthen the main result, use it to get a polynomial-time primality testing algorithm, and implement that algorithm in Javascript. If you've understood my introduction to primality testing, you should be able to follow along.

Let's get started! The basis for the AKS primality test is the following generalization of Fermat's little theorem to polynomials:

(Fermat's little theorem for polynomials, strong version.) If $n \ge 2$ and $a$ is relatively prime to $n$, then $n$ is prime if and only if \[ (X + a)^n \equiv X^n + a \pmod{n}\text{.} \]

The form of the equation above may be unfamiliar. The polynomials in question are formal polynomials. That is, we care only about the coefficients of the polynomial and not how it behaves as a function. In this case, we restrict ourselves to polynomials with integer coefficients. Then we can meaningfully compare two polynomials modulo $n$: we consider two polynomials congruent modulo $n$ if their respective coefficients are all congruent modulo $n$. (Equivalently, two polynomials $f(X)$ and $g(X)$ are congruent modulo $n$ if $f(X) - g(X) = n \cdot h(X)$ for some polynomial $h(X)$.) This definition is consistent with how they behave as functions; if two polynomials $f(X)$ and $g(X)$ are congruent modulo $n$, then treating them as functions, $f(x)\ \equiv g(x) \pmod{n}$ for any integer $x$.^[1]

Unfortunately, this test by itself cannot give a polynomial-time algorithm as testing even one value of $a$ may require looking at $n$ coefficients of the left-hand side. (Remember that we're interested in algorithms with time polynomial not in the input $n$, but in its bit length $\lg n$. Such an algorithm is described as having time polylog in $n$.) However, we can reduce the number of coefficients we have to look at by taking the powers of $X$ modulo some number $r$. This is equivalent to taking the modulo of the polynomials themselves by $X^r - 1$; you can see this for yourself by picking some polynomial and some value for $r$ and doing long division by $X^r - 1$ to find the remainder. (It may seem weird to talk about taking the modulo of one polynomial with another, but it's entirely analogous to integers.) This gives us a weaker version of the theorem above:

(Fermat's little theorem for polynomials, weak version.) If $n$ is prime and $a$ is not a multiple of $n$, then for any $r \ge 2$ \[ (X + a)^n \equiv X^n + a \pmod{X^r - 1, n}\text{.} \]

The “double mod” notation above may be unfamiliar, but in this case its meaning is simple. We consider two polynomials congruent modulo $X^r - 1, n$ when they are congruent modulo $n$ after you reduce the powers of $X$ modulo $r$ and combine like terms. More generally, two polynomials $f(X)$ and $g(X)$ are congruent modulo $n(X), n$ if $f(X) - g(X) \equiv n(X) \cdot h(X) \pmod{n}$ for some polynomial $h(X)$.

With this theorem, we only have to compare $r$ coefficients, but we introduce the possibility of the condition above being met even when $n$ is composite. But can we impose conditions on $r$ and $a$ such that if the condition holds for a polynomial number of pairs of $r$ and $a$, we can be sure that $n$ is prime? The answer is “yes”; in particular, we can find a single $r$ and an upper bound $M$ polylog in $n$ such that if the condition holds for $r$ and $0 \le a \lt M$, then $n$ is prime.

In the remainder of this article, we'll work backwards. That is, we'll first assume we have some $n \ge 2$, $r \ge 2$, and $M \ge 1$ such that for all $0 \le a \lt M$ \[ (X + a)^n \equiv X^n + a \pmod{X^r - 1, n}\text{.} \] Then we'll assume that $n$ is not a power of one of its prime divisors $p$ and try to deduce the conditions that imposes on $n$, $r$, $M$, and $p$. Then we can take the contrapositive to find the inverse conditions on $n$, $r$, $M$, and $p$ that would then force $n$ to be a power of $p$. Since we can easily test whether $n$ is a perfect power, if it's not one, we can immediately conclude that $n = p^1$ and thus prime. (Of course, if it does turn out to be a perfect power, then it is trivially composite.)

To understand the conditions that we will derive, we must first talk about introspective numbers.

2. Introspective numbers

Given a base $b$, a polynomial $g(X)$ and a number $q$, we call $q$ introspective^[2] for $g(X)$ modulo $b$ if \[ g(X)^q = g(X^q) \pmod{b}\text{.} \]

We also say that $g(X)$ is introspective under $q$ modulo $b$.

A basic property of introspective numbers and polynomials is that they are closed under multiplication. That is, if $q_1$ and $q_2$ are introspective for $g(X)$ modulo $b$, then $q_1 \cdot q_2$ is also introspective for $g(X)$ modulo $b$, and if $g_1(X)$ and $g_2(X)$ are introspective under $q$ modulo $b$, then $g_1(X) \cdot g_2(X)$ is also introspective under $q$ modulo $b$.

In particular, given our assumptions above, we can easily see that $1$, $p$, and $n$ are introspective for $X + a$ modulo $p$ for any $0 \le a \lt M$. We can also show that $n/p$ is also introspective for $X + a$ modulo $p$. Using closure under multiplication, we can talk about the set of numbers generated by $p$ and $n/p$, which are all introspective for $X + a$ modulo $p$. Call this set $I$:

\[ I = \left\{ p^i \left( n/p \right)^j \mid i, j \ge 0 \right\}\text{.} \]

We can also take the closure of all $X + a$ to get a set of polynomials which are all introspective under $p$, $n/p$, or any number in $I$. Call this set $P$: \[ P = \left\{ 0 \right\} \cup \left\{ X^{e_0} \cdot (X + 1)^{e_1} \dotsm (X + M - 1)^{e_{M - 1}} \mid e_0, e_1, \dotsc, e_{M - 1} \ge 0 \right\}\text{.} \] To summarize, $I$ is a set of numbers and $P$ is a set of polynomials such that for any $i \in I$ and $g(X) \in P$, $i$ is introspective for $g(X)$ modulo $p$. Of course, it's still not clear what these two sets have to do with whether $n$ is prime or not. But we will examine certain finite sets related to $I$ and $P$ and their sizes, and we will see that we can deduce their properties depending on the relation of $n$ to $p$.

3. Bounds on finite sets related to $I$ and $P$

Now we're ready to work towards finding our restrictions on $n$, $r$, $M$, and $p$. We'll slowly build them up such that when the last one falls into place, we know that $n$ is a perfect power of $p$. Here's what we're starting with:

$n \ge 2$,
$r \ge 2$,
$M \ge 1$,
$p$ is a prime divisor of $n$.

Let us restrict $I$ to a finite set by bounding the exponents of $p$ and $n/p$: \[ I_k = \left\{ p^i (n/p)^j \mid 0 \le i, j \lt k \right\} \subset I\text{.} \]

Notice that if $n$ is not a power of $p$, then all members of $I_k$ are distinct, and therefore we can easily calculate its size:^[3] \[ |I_k| = k^2\text{.} \]

Let's also restrict $P$ to a finite set by bounding the degrees of its polynomials: \[ P_d = \left\{ g \in P \mid \deg(g) \lt d \right\} \subset P\text{.} \]

We can calculate $|P_d|$ exactly,^[4] but we only need a lower bound for when $d \le M$. Consider $P_d^{\{0, 1\}}$, the subset of $P_d$ where each $X + a$ is present at most once. Since each $X + a$ is either present or not present, but not all of them can be present at the same time, there are $2^d - 1$ distinct polynomials in $P_d^{\{0, 1\}}$. Adding back the zero polynomial yields $|P_d^{\{0, 1\}}| = 2^d$. Since $P_d^{\{0, 1\}}$ is a subset of $P_d$, $|P_d| \ge |P_d^{\{0, 1\}}| = 2^d$. Therefore, if $d \le M$, then^[5] \[ |P_d| \ge 2^d\text{.} \] This will be useful later (for a particular value of $d$), so let's add the restriction to $M$:

$n \ge 2$,
$r \ge 2$,
$M \ge d$,
$p$ is a prime divisor of $n$.

Let us restrict $I$ in a different way, by reducing modulo $r$: \[ J = \left\{ x \bmod r \mid x \in I \right\} \] and let $t = |J|$. (This size will play an important role later.)

Our final set that we're interested in needs some background to define. We want to find a subset of $P$ that lies in some field $F$ because fields have some convenient properties that we will use later.^[6]

Consider $\mathbb{Z}/p\mathbb{Z}$, the ring of integers modulo $p$. Since $p$ is prime, it is also a field. In particular, it is the finite field $\mathbb{F}_p$ of order $p$. Then consider $\mathbb{F}_p[X]$, its polynomial ring, which is the set of polynomials with coefficients in $\mathbb{F}_p$. Given some polynomial $q(X) \in \mathbb{F}_p[X]$, we can further reduce modulo $q(X)$ to get $\mathbb{F}_p[X] / q(X)$. Finally, if $q(X)$ is irreducible over $\mathbb{F}_p$, then $\mathbb{F}_p[X] / q(X)$ is also a field.

(We can show that both $\mathbb{F}_p = \mathbb{Z}/p\mathbb{Z}$ and $\mathbb{F}_p[X] / q(X)$ are fields from the same general theorem of rings: if $R$ is a principal ideal domain and $(c)$ is the two-sided ideal generated by $c$, then the quotient ring $R / (c)$ is a field if and only if $c$ is a prime element of $R$.)^[7]

So we just need to find a polynomial that's irreducible over $\mathbb{F}_p$. We know that $X^r - 1$ has $Φ_r(X)$, the $r$th cyclotomic polynomial, as a factor. $Φ_r(X)$ is irreducible over $\mathbb{Z}$, but not necessarily over $\mathbb{F}_p$. But if $r$ is relatively prime to $p$, then $Φ_r(X)$ factors into irreducible polynomials all of degree $o_r(p)$ (the multiplicative order of $p$ modulo $r$) over $\mathbb{F}_p$.^[8] Then we can just require that $r$ be relatively prime to $p$. If we do so, then we can let $h(X)$ be one of the factors of $Φ_r(X)$ over $\mathbb{F}_p$ and we have our field $F = \mathbb{F}_p[X] / h(X)$.

$n \ge 2$,
$r \ge 2$, $r$ relatively prime to $p$,
$M \ge d$,
$p$ is a prime divisor of $n$.

Finally, we can define our last set. Let \[ Q = \left\{ f(X) \bmod (h(X), p) \mid f(X) \in P \right\} \subseteq F\text{.} \]

We can map elements of $P$ into $Q$ via reduction modulo $(h(X), p)$. But we're interested in only the elements of $P$ that map to distinct elements of $Q$, since that will let us find a lower bound for $|Q|$. A simple example would be the set of $X + a$ for $0 \le a \lt M$; if the degree of $h(X)$ is greater than $1$ and $p \ge M$, then each $X + a$ is distinct in $Q$.

Another interesting set is $X^k$ for $1 \le k \le r$. Since $h(X) \equiv 0 \pmod{h(X}, p)$, we can say that $X$ is a root of the polynomial function $h(y)$ over the field $F$. But since $h(y)$ is a factor of $Φ_r(y)$, $X$ is then a primitive $r$th root of unity in $Q$.^[9] But the powers of a primitive $r$th root of unity (from $1$ to $r$) are all distinct. Therefore all $X^k$ for $1 \le k \le r$ are distinct in $Q$.

Most importantly, we can show that distinct elements in $P_d$ map to distinct elements in $Q$ if $d \le t$. Let $f(X)$ and $g(X)$ be two different elements of $P_d$. Assume that $f(X) \equiv g(X) \pmod{h(x}, p)$. Then, for $m \in I$: \[ f(X^m) \equiv f(X)^m \pmod{X^r - 1, p} \] and \[ g(X^m) \equiv g(X)^m \pmod{X^r - 1, p} \] by introspection modulo $p$, and therefore \[ f(X^m) \equiv g(X^m) \pmod{X^r - 1, p} \] which immediately leads to \[ f(X^m) \equiv g(X^m) \pmod{h(X}, p)\text{.} \] Therefore, all $X^m$ for $m \in I$ are roots of the polynomial function $u(y) = f(y) - g(y)$ over the field $F$, and in particular all $X^m$ for $m \in J$. But all such $X^m$ are distinct in $Q$ by the argument above. Therefore, $u(y)$ must have degree at least $t$ since a polynomial over a field cannot have more roots than its degree. But the degree of $u(y)$ is less than $d$ since both $f(y)$ and $g(y)$ have degree less than $d$. Since $d \le t$, this is a contradiction, so therefore $f(X) \not\equiv g(X) \pmod{h(x}, p)$. But since $f(X)$ and $g(X)$ were arbitrary, that implies that distinct elements of $P_d$ map to distinct elements of $Q$ for $d \le t$.

Given the above, we can conclude that as long as we require that $d \le t$, $p \ge M$, and $o_r(p) = \deg(h(X)) \gt 1$, then \[ |Q| \ge |P_d| \ge 2^d\text{.} \]

$n \ge 2$,
$o_r(p) \gt 1$,
$M \ge d$,
$t \ge d$,
$p \ge M$, $p$ is a prime divisor of $n$.

4. The AKS theorem (weak version)

We're finally ready to put it all together. Again assume $n$ is not a power of $p$, and recall that $|J| = t$. Let $s \gt \sqrt{t}$. Then $|I_s| = s^2 \gt t$. By the pigeonhole principle, there must be two elements $m_1, m_2 \in I_s$ that map to the same element in $J$; that is, there must be $m_1, m_2 \in I_s$ such that $m_1 \equiv m_2 \pmod{r}$. Now pick some $g(X)$ from $P$. Then \[ g(X)^{m_1} \equiv g(X^{m_1}) \pmod{X^r - 1, p} \] and \[ g(X)^{m_2} \equiv g(X^{m_2}) \pmod{X^r - 1, p} \] by introspection modulo $p$. But $X^{m_1} \equiv X^{m_2} \pmod{X^r - 1}$ since $m_1 \equiv m_2 \pmod{r}$, so \[ g(X^{m_1}) \equiv g(X^{m_2}) \pmod{X^r - 1, p}\text{.} \] Chaining all these congruences together lets us deduce that \[ g(X)^{m_1} \equiv g(X)^{m_2} \pmod{X^r - 1, p}\text{,} \] which immediately leads to \[ g(X)^{m_1} \equiv g(X)^{m_2} \pmod{h(X}, p)\text{.} \]

That means that $g(X) \bmod (h(X), p) \in Q$ is a root of the polynomial function $u(y) = y^{m_1} - y^{m_2}$ over the field $F$. But $g(X)$ was picked arbitrarily from $P$, so $u(y)$ has at least $|Q|$ roots. $\deg(u(y)) = \max(m_1, m_2) \le p^{s-1} \cdot (n/p)^{s-1} = n^{s-1}$, and $u(y)$, being a polynomial over a field, cannot have more roots than its degree, so if $n$ is not a power of $p$, then $|Q| \le n^{s-1}$. Equivalently, if $|Q| \gt n^{s-1}$, then $n$ must be a power of $p$.^[10] But we've shown above that $|Q| \ge 2^d$ for $d \le t$, so if we can pick $d$ and $s$ such that $2^d \gt n^{s-1}$, then we can force $n$ to be a power of $p$. Taking logs, we see that this is equivalent to picking $d$ and $s$ such that $d \gt (s - 1) \lg n$. Since $d \le t$, this imposes $t \gt (s - 1) \lg n$ in order for there to be room to pick $d$. Rearranging, we get $s \lt \frac{t}{\lg n} + 1$. But $s \gt \sqrt{t}$, so this imposes $\sqrt{t} \lt \frac{t}{\lg n} + 1$ in order for there to be room to pick $s$. Rearranging again, we get $\frac{t}{\sqrt{t} - 1} \gt \lg n$. Since $\frac{t}{\sqrt{t} - 1} \gt \sqrt{t}$, it suffices to require that $t \gt \lg^2 n$ in order for there to be room to pick $d$ and $s$. Furthermore, since $s$ has to be an integer, then $s \ge \lfloor \sqrt{t} \rfloor + 1$, and therefore $d \gt \lfloor \sqrt{t} \rfloor \lg n$. Let's update our assumptions:

$n \ge 2$,
$o_r(p) \gt 1$
$M \ge d \gt \lfloor \sqrt{t} \rfloor \lg n$,
$t \gt \lg^2 n$,
$p \ge M$, $p$ is a prime divisor of $n$.

So to summarize, if we make the above assumptions, we can pick $d$ and $s$ such that $|Q| \ge 2^d \gt n^{s - 1}$, which implies that $n$ must be a power of $p$, which was our goal. Now we just have to express all assumptions in terms of $n$, $r$, and $M$, strengthening them if necessary. $J$ is generated by $p$ and $n/p$, so its order (i.e., $t$) is at least $o_r(p)$, which is in turn at least $o_r(n)$, since $p$ is a prime factor of $n$ (this brings along the assumption that $r$ and $n$ are relatively prime). Therefore, we can replace the assumptions $t \gt \lg^2 n$ and $o_r(p) \gt 1$ with $o_r(n) \gt \lg^2 n$. We can remove the reference to $d$ by finding the maximum value of $t$. Since $r$ is relatively prime to $n$, $J$ is a subgroup of $Z_r$, and therefore its order divides (and therefore is at most) $φ(r)$. So we can replace $M \ge d \gt \lfloor \sqrt{t} \rfloor \lg n$ with $M \gt \lfloor \sqrt{φ(r)} \rfloor \lg n$. Finally, we can remove the reference to $p$ by mandating that $n$ has no prime factor less than $M$. Here are our final assumptions:

$n \ge 2$, $n$ has no prime factors less than $M$,
$o_r(n) \gt \lg^2 n$,
$M \gt \lfloor \sqrt{φ(r)} \rfloor \lg n$.

We can summarize the above discussion in the following theorem:

(AKS theorem, weak version.) Let $n \ge 2$, $r$ be relatively prime to $n$ with $o_r(n) \gt \lg^2 n$, and $M \gt \lfloor \sqrt{φ(r)} \rfloor \lg n$. Furthermore, let $n$ have no prime factor less than $M$ and let \[ (X + a)^n \equiv X^n + a \pmod{X^r - 1, n} \] for $0 \le a \lt M$. Then $n$ is the power of some prime $p \ge M$.

And that's it for now! In the follow-up article we will strengthen this theorem to further show that $n$ is equal to $p$, and therefore prime. Then we will use this result to get a primality-testing algorithm that we will prove to be polynomial time.

Like this post? Subscribe to my feed or follow me on Twitter .

Footnotes

[1] We use uppercase letters for variables when we treat polynomials as formal polynomials and lowercase letters when we treat them as functions. ↩

[2] The term “introspection”, which comes from the original AKS paper, was probably chosen to invoke the idea that the exponent $q$ can be pushed into and pulled out of $g(X)$. Here we generalize it a bit. ↩

[3] This condition is too weak to be useful by itself, but we will parlay it into something we can use later. ↩

[4] Using the ideas on this page, we can show that $|P_d| = {M + d \choose d - 1} + 1$ by considering each $X + a$ a labeled urn (plus a “dummy” urn) and each unit of power an unlabeled ball. (This was used in the AKS paper.) ↩

[5] This lower bound, as well as other ideas that simplify the proof, was taken from Prime Numbers: A Computational Perspective. ↩

[6] You may first want to brush up on the definitions of group, ring, and field, and the differences between them. ↩

[7] This is Theorem 1.47(iv) from “Introduction to finite fields and their applications”. ↩

[8] The reducibility of $Φ_r(X)$ over $\mathbb{F}_p$ given $r$ relatively prime to $p$ is Theorem 2.47(ii) from “Introduction to finite fields and their applications”. ↩

[9] It's a bit weird to talk about a polynomial being the root of other polynomials, but recall that we can form a polynomial ring over any field, even a field of polynomials. We keep track of which polynomials belong to which domains by using different variables. ↩

[10] Here's where we force $n$ to be a prime power. ↩

An Introduction to Primality Testing

2012-07-08T00:00:00-07:00

I will explain two commonly-used primality tests: Fermat and Miller-Rabin. Along the way, I will cover the basic concepts of primality testing. I won't be assuming any background in number theory, but familiarity with modular arithmetic will be helpful. I will also be providing implementations in Javascript, so familiarity with it will also be helpful. Finally, since Javascript doesn't natively support arbitrary-precision arithmetic, I wrote a simple natural number class (SNat) that represents a number as an array of decimal digits. All algorithms used are the simplest possible, except when a more efficient one is needed by the algorithms we discuss.

Primality testing is the problem of determining whether a given natural number is prime or composite. Compared to the problem of integer factorization, which is to determine the prime factors of a given natural number, primality testing turns out to be easier; integer factorization is in NP and thought to be outside P and NP-complete, whereas primality testing is now known to be in P.

Most primality tests are actually compositeness tests; they involve finding composite witnesses, which are numbers that, along with a given number to be tested, can be fed to some easily-computable function to prove that the given number is composite. (The composite witness, along with the function, is a certificate of compositeness of the given number.) A primality test can either check each possible witness or, like the Fermat and Miller-Rabin tests, it can randomly sample some number of possible witnesses and call the number prime if none turn out to be witnesses. In the latter case, there is a chance that a composite number can erroneously be called prime; ideally, this chance goes to zero quickly as the sample size increases.

The simplest possible witness type is, of course, a factor of the given number, which we'll call a factor witness. If the number to be tested is $n$ and the possible factor witness is $a$, then one can simply test whether $a$ divides $n$ (written as $a \mid n$) by evaluating $n \bmod a = 0$; that is, whether the remainder of $n$ divided by $a$ is zero. This doesn't yield a feasible deterministic primality test, though, since checking all possible witnesses is equivalent to factoring the given number. Nor does it yield a feasible probabilistic primality test, since in the worst case the given number has very few factors, which random sampling would miss.

The simplest useful witness type is a Fermat witness, which relies on the following theorem of Fermat:

(Fermat's little theorem.) If $n$ is prime and $a$ is not a multiple of $n$, then \[ a^{n-1} \equiv 1 \pmod{n}\text{.} \]

Thus, a Fermat witness is a number $1 \lt a \lt n$ such that $a^{n-1} \not\equiv 1 \pmod{n}$. Conversely, if $n$ is composite and $a^{n-1} \equiv 1 \pmod{n}$, then $a$ is a Fermat liar.

Let $n =$ and $a =$ .

If $n$ has at least one Fermat witness that is relatively prime, then we can show that at least half of all possible witnesses are Fermat witnesses. (Roughly, if $a$ is the Fermat witness and $a_1, a_2, \dotsc, a_s$ are Fermat liars, then all $a \cdot a_i$ are also Fermat witnesses.) Therefore, for a sample of $k$ possible witnesses of $n$, the probability of all of them being Fermat liars is $\le 2^{-k}$, which goes to zero quickly enough to be practical.

However, there is the possibility that $n$ is a composite number with no relatively prime Fermat witnesses. These are called Carmichael numbers. Even though Carmichael numbers are rare, their existence still makes the Fermat primality test unsuitable for some situations, as when the numbers to be tested are provided by some adversary.

Here is the Fermat compositeness test implemented in Javascript:

// Runs the Fermat compositeness test given n > 2 and 1 < a < n.
// Calculates r = a^{n-1} mod n and whether a is a Fermat witness to n
// (i.e., r != 1, which means n is composite).  Returns a dictionary
// with a, n, r, and isCompositeByFermat, which is true iff a is a
// Fermat witness to n.
function testCompositenessByFermat(n, a) {
  n = SNat.cast(n);
  a = SNat.cast(a);

  if (n.le(2)) {
    throw new RangeError('n must be > 2');
  }

  if (a.le(1) || a.ge(n)) {
    throw new RangeError('a must satisfy 1 < a < n');
  }

  var r = a.powMod(n.minus(1), n);
  var isCompositeByFermat = r.ne(1);
  return {
    a: a,
    n: n,
    r: r,
    isCompositeByFermat: isCompositeByFermat
  };
}

Note that the algorithm depends on the efficiency of modular exponentiation when calculating $a^{n-1} \pmod{n}$. The naive method is unsuitable since it requires $Θ(n)$ $b$-bit multiplications, where $b = \lceil \lg n \rceil$. SNat uses repeated squaring, which requires only $Θ(\lg n)$ $b$-bit multiplications.

Another useful witness type is a non-trivial square root of unity $\operatorname{mod} n$; that is, a number $a ≠ \pm 1 \pmod{n}$ such that $a^2 \equiv 1 \pmod{n}$. It is a theorem of number theory that if $n$ is prime, there are no non-trivial square roots of unity $\operatorname{mod} n$. Therefore, if we do find one, that means $n$ is composite. In fact, finding one leads directly to factors of $n$. By definition, a non-trivial square root of unity $a$ satisfies $a \pm 1 ≠ 0 \pmod{n}$ and $a^2 - 1 \equiv 0 \pmod{n}$. Factoring the latter leads to $(a+1)(a-1) \equiv 0 \pmod{n}$, which means that $n$ divides $(a+1)(a-1)$. But the first condition says that $n$ divides neither $a+1$ nor $a-1$, so it must be a product of two numbers $p \mid a+1$ and $q \mid a-1$. Then $\gcd(a+1, n)$^[1] and $\gcd(a-1, n)$ are factors of $n$.

Finding non-trivial square roots of unity by itself doesn't give a useful primality testing algorithm, but combining it with the Fermat primality test does. $a^{n-1} \bmod n$ either equals $1$ or not. If it doesn't, you're done since you have a Fermat witness. If it does equal $1$, and $n-1$ is even, then consider the square root of $a^{n-1}$, i.e. $a^{(n-1)/2}$. If it is not $\pm 1$, then it is a non-trivial square root of unity. If it is $-1$, then you can't do anything else. But if it is $1$, and $(n-1)/2$ is even, you can then take another square root and repeat the test, stopping when the exponent of $a$ becomes odd or when you get a result not equal to $1$.

To turn this into an algorithm, you simply start from the bottom up: find the greatest odd factor of $n-1$, call it $t$, and keep squaring $a^t$ mod $n$ until you find a non-trivial square root of $n$ or until you can deduce the value of $a^{n-1}$. In fact, this is almost as fast as the original Fermat primality test, since the exponentiation by $n-1$ has to do the same sort of squaring, and we're just adding comparisons to $±1$ in between squarings.

The original idea for the test above is from Artjuhov, although it is usually credited to Miller. Therefore, we call $a$ an Artjuhov witness^[2] of $n$ if it shows $n$ composite by the above test.

Let $n =$ and $a =$ .

If $n$ is an odd composite, then it can be shown (originally by Rabin) that at least three quarters of all possible witnesses are Artjuhov witnesses. Therefore, for a sample of $k$ possible witnesses of $n$, the probability of all of them being Artjuhov liars is $\le 4^{-k}$, which is stronger than the bound for the Fermat primality test. Furthermore, this bound is unconditional; there is nothing like Carmichael numbers for the Artjuhov test.

Here is the Artjuhov compositeness test, implemented in Javascript:

// Runs the Artjuhov compositeness test given n > 2 and 1 < a < n-1.
// Finds the largest s such that n-1 = t*2^s, calculates r = a^t mod
// n, then repeatedly squares r (mod n) up to s times until r is
// congruent to -1, 0, or 1 (mod n).  Then, based on the value of s
// and the final value of r and i (the number of squarings),
// determines whether a is an Artjuhov witness to n (i.e., n is
// composite).
//
// Returns a dictionary with, a, n, s, t, i, r, rSqrt = sqrt(r) if i >
// 0 and null otherwise, and isCompositeByArtjuhov, which is true iff
// a is an Artjuhov witness to n.
function testCompositenessByArtjuhov(n, a) {
  n = SNat.cast(n);
  a = SNat.cast(a);

  if (n.le(2)) {
    throw new RangeError('n must be > 2');
  }

  if (a.le(1) || a.ge(n)) {
    throw new RangeError('a must satisfy 1 < a < n');
  }

  var nMinusOne = n.minus(1);

  // Find the largest s and t such that n-1 = t*2^s.
  var t = nMinusOne;
  var s = new SNat(0);
  while (t.isEven()) {
    t = t.div(2);
    s = s.plus(1);
  }

  // Find the smallest 0 <= i < s such that a^{t*2^i} = 0/-1/+1 (mod
  // n).
  var i = new SNat(0);
  var rSqrt = null;
  var r = a.powMod(t, n);
  while (i.lt(s) && r.gt(1) && r.lt(nMinusOne)) {
    i = i.plus(1);
    rSqrt = r;
    r = r.times(r).mod(n);
  }

  var isCompositeByArtjuhov = false;
  if (s.isZero()) {
    // If 0 = i = s, then this reduces to the Fermat primality test.
    isCompositeByArtjuhov = r.ne(1);
  } else if (i.isZero()) {
    // If 0 = i < s, then:
    //
    //   * r = 0    (mod n) -> a^{n-1} = 0 (mod n), and
    //   * r = +/-1 (mod n) -> a^{n-1} = 1 (mod n).
    isCompositeByArtjuhov = r.isZero();
  } else if (i.lt(s)) {
    // If 0 < i < s, then:
    //
    //   * r =  0 (mod n) -> a^{n-1} = 0 (mod n),
    //   * r = +1 (mod n) -> a^{t*2^{i-1}} is a non-trivial square root of
    //                       unity mod n, and
    //   * r = -1 (mod n) -> a^{n-1} = 1 (mod n).
    //
    // Note that the last case means r = n - 1 > 1.
    isCompositeByArtjuhov = r.le(1);
  } else {
    // If 0 < i = s, then:
    //
    //   * r =  0 (mod n) can't happen,
    //   * r = +1 (mod n) -> a^{t*2^{i-1}} is a non-trivial square root of
    //                       unity mod n, and
    //   * r > +1 (mod n) -> failure of the Fermat primality test.
    isCompositeByArtjuhov = true;
  }

  return {
    a: a,
    n: n,
    t: t,
    s: s,
    i: i,
    r: r,
    rSqrt: rSqrt,
    isCompositeByArtjuhov: isCompositeByArtjuhov
  };
}

With the two compositeness tests above, we can now write a probabilistic primality test:

// Returns true iff a is a Fermat witness to n, and thus n is
// composite.  a and n must satisfy the same conditions as in
// testCompositenessByFermat.
function hasFermatWitness(n, a) {
  return testCompositenessByFermat(n, a).isCompositeByFermat;
}

// Returns true iff a is an Arjuhov witness to n, and thus n is
// composite.  a and n must satisfy the same conditions as in
// testCompositenessByArtjuhov.
function hasArtjuhovWitness(n, a) {
  return testCompositenessByArtjuhov(n, a).isCompositeByArtjuhov;
}

// Returns true if n is probably prime, based on sampling the given
// number of possible witnesses and testing them against n.  If false
// is returned, then n is definitely composite.
//
// By default, uses the Artjuhov test for witnesses with 20 samples
// and Math.random for the random number generator.  This gives an
// error bound of 4^-20 if true is returned.
function isProbablePrime(n, hasWitness, numSamples, rng) {
  n = SNat.cast(n);
  hasWitness = hasWitness || hasArtjuhovWitness;
  rng = rng || Math.random;
  numSamples = numSamples || 20;

  if (n.le(1)) {
    return false;
  }

  if (n.le(3)) {
    return true;
  }

  if (n.isEven()) {
    return false;
  }

  for (var i = 0; i < numSamples; ++i) {
    var a = SNat.random(2, n.minus(2), rng);
    if (hasWitness(n, a)) {
      return false;
    }
  }

  return true;
}

isProbablePrime called with hasFermatWitness is the Fermat primality test, and isProbablePrime called with hasArtjuhovWitness is the Miller-Rabin primality test. The latter is the current general primality test of choice, replacing the Solovay-Strassen primality test.

We can also use isProbablePrime to randomly generate probable primes, which is useful for cryptographic applications:

// Returns a probable b-bit prime that is at least 2^b.  All
// parameters but b are passed to isProbablePrime.
function findProbablePrime(b, hasWitness, rng, numSamples) {
  b = SNat.cast(b);

  var lb = (new SNat(2)).pow(b.minus(1));
  var ub = lb.times(2);
  while (true) {
    var n = SNat.random(lb, ub);
    if (isProbablePrime(n, hasWitness, rng, numSamples)) {
      return n;
    }
  }
}

In this case, for sufficiently large $b$, the Fermat primality test is acceptable, since Carmichael numbers are so rare and we're the ones generating the possible primes to be tested.^[3]

There are other primality tests, but they're less often used in practice because they're either less efficient or more sophisticated than the algorithms above, or they require $n$ to have special properties. Perhaps the most interesting of these tests is the AKS primality test, which proved once and for all that primality testing is in P.

Like this post? Subscribe to my feed or follow me on Twitter .

Footnotes

[1] $\gcd$ is the greatest common divisor function. ↩

[2] “Artjuhov witness” is an idiosyncratic name on my part; a more common name is strong witness, which I don't like. ↩

[3] According to Wikipedia, PGP uses the Fermat primality test. ↩

A Pair of Counterexamples in Vector Calculus

2011-11-27T00:00:00-08:00

While recently reviewing some topics in vector calculus, I became curious as to why violating seemingly innocuous conditions for some theorems leads to surprisingly wild results. In fact, I was struck by how these theorems resemble computer programs, not in some abstract way, but in how the lack of “input validation” leads to non-obvious behavior in the face of erroneous input.

I found that understanding why these counterexamples lead to wild results deepened my understanding of the theorems involved and their proofs.^[1] Besides, pathological examples are more interesting than well-behaved ones!

First, let's look at a “counterexample” to Green's theorem:

1. Two functions $L, M \colon \mathbb{R}^2 \to \mathbb{R}$ and a positively-oriented, piecewise smooth, simple closed curve $C$ in $\mathbb{R}^2$ enclosing the region $D$ such that \[ ∮_C L \,dx + M \,dy \ne ∬_D \left( \frac{∂{M}}{∂{x}} - \frac{∂{L}}{∂{y}} \right) \,dx \,dy \text{.} \]

Let \[ L = -\frac{y}{x^2+y^2} \text{,} \quad M = \frac{x}{x^2+y^2} \text{,} \] and $C$ be a curve going clockwise around the rectangle $D = [-1, 1]^2$.^[2] Then the integral of $L \,dx + M \, dy$ around $C$ is $2 π$ since it encloses the origin. But \[ \frac{∂{M}}{∂{x}} = \frac{∂{L}}{∂{y}} = \frac{y^2-x^2}{x^2+y^2} \] so the difference of the two vanishes everywhere but the origin, where neither function is defined. Therefore, the (improper) integral over $D$ also vanishes, proving the inequality. ∎

Of course, the easy explanation is that the discontinuity of $L$ and $M$ at the origin violates a condition of Green's theorem. But that doesn't really tell us anything, so let's break down the theorem and see where exactly it fails.

Green's theorem is usually proved first for rectangles $[a, b] \times [c, d]$, which suffices for our purpose. If $C$ is a curve that goes counter-clockwise around such a rectangle $D$, then we can easily show that \[ ∮_C L \,dx = - ∬_D \frac{∂{L}}{∂{y}} \,dx \,dy \] and \[ ∮_C M \,dy = ∬_D \frac{∂{M}}{∂{x}} \,dx \,dy \text{,} \] with the sum of these two formulas proving the theorem.

So the first sign of trouble is that the theorem freely interchanges addition and integration. Since the partial derivatives of our functions diverge at the origin, if $D$ contains the origin then the integrals of those partial derivatives over $D$ may not even be defined, even if the integral of their difference is.

But the problem arises even before that. The statements above are proved by showing \[ ∮_C L \,dx = - ∫_a^b \left( ∫_c^d \frac{∂{L}}{∂{y}} \,dy \right) \,dx \] and \[ ∮_C M \,dy = ∫_c^d \left( ∫_a^b \frac{∂{M}}{∂{x}} \,dx \right) \,dy \text{.} \] both of which hold for our example. But notice that in one case we integrate with respect to $y$ first, and in the other case we integrate with respect to $x$ first. Therefore, we have to interchange the order of integration or convert to a double integral in order to get them to a form where we can add them. And there's the rub: if $D$ contains the origin, switching the order of integration for either integral above switches the sign of the result!

This fully explains the discrepancy; since the result of both integrals above (with the iteration order preserved) is $π$, adding them together as-is gives the expected result of $2 π$. But if we switch the iteration order of one of the iterated integrals as done in the proof of Green's theorem, then we switch the result of that integral to $-π$, which cancels with the result of the other unchanged integral to produce $0$.

So now let's examine this strange behavior of the sign of an integration's result depending on the iteration order. This leads us to our next “counterexample,” this time for Fubini's theorem:

2. A function $f \colon \mathbb{R}^2 \to \mathbb{R}$ whose iterated integrals over a rectangle $D = [a, b] \times [c, d] \subset \mathbb{R}^2$ differ.

Let \[ f(x, y) = \frac{x^2-y^2}{(x^2+y^2)^2} \quad \text{ and } \quad D = [-1, 1]^2\text{.} \] The two iterated integrals of $f$ over $D$ are usually written as \[ ∫_{-1}^1 \left( ∫_{-1}^1 f(x, y) \,dy \right) \,dx \qquad \text{ and } \qquad ∫_{-1}^1 \left( ∫_{-1}^1 f(x, y) \,dx \right) \,dy \] but let's define them more carefully to make it easier to justify our calculations.

Let \[ \begin{aligned} u_k &= y \mapsto f(k, y) \\ v_l &= x \mapsto f(x, l) \text{.} \end{aligned} \] In other words, given the real constants $k$ and $l$, construct the (possibly partial) real functions $u_k(y)$ and $v_l(x)$ by partially-evaluating $f$ at $x = k$ and $y = l$, respectively.

Then, if we also let^[3] \[ U(x) = ∫_{-1}^1 u_x(y) \,dy \qquad \text{ and } \qquad V(y) = ∫_{-1}^1 v_y(x) \,dx \text{,} \] we can write the iterated integrals as \[ ∫_{-1}^1 U(x) \,dx \qquad \text{ and } \qquad ∫_{-1}^1 V(y) \,dy \text{.} \]

Computing $U(x)$ for $x ≠ 0$, we get^[4] \[ \begin{aligned} U(x) &= ∫_{-1}^1 \frac{∂{}}{∂{y}} \left( -\frac{y}{x^2+y^2} \right) \,dy \\ &= \left. -\frac{y}{x^2+y^2} \right|_{y=-1}^{y=1} \\ &= -\frac{2}{x^2+1} \text{.} \end{aligned} \]

Attempting to evaluate $U(0)$, we see that \[ \begin{aligned} U(0) &= ∫_{-1}^1 \frac{0^2-y^2}{(0^2+y^2)^2} \,dy \\ &= - ∫_{-1}^1 \frac{dy}{y^2} \end{aligned} \] which diverges. So \[ U(x) = -\frac{2}{x^2+1} \text{ for } x \ne 0 \text{.} \]

By a similar computation, we find that^[5] \[ V(y) = \frac{2}{y^2+1} \text{ for } y \ne 0 \text{.} \]

Since $U(x)$ isn't defined at $0$, we have to treat it as an improper integral, although doing so poses no real difficulty: \[ \begin{aligned} ∫_{-1}^1 U(x)\,dx &= \lim_{a \nearrow 0} \left( ∫_{-1}^a -\frac{2}{x^2+1} \,dx \right) + \lim_{a \searrow 0} \left( ∫_{a}^1 -\frac{2}{x^2+1} \,dx \right) \\ &= \lim_{a \nearrow 0} \Bigl( \left. -2 \arctan(x) \right|_{-1}^{a} \Bigr) + \lim_{a \searrow 0} \Bigl( \left. -2 \arctan(x) \right|_{a}^{1} \Bigr) \\ &= \left. -2 \arctan(x) \right|_{-1}^{0} + \left. -2 \arctan(x) \right|_{0}^{1} \\ &= \left. -2 \arctan(x) \right|_{-1}^{1} \\ &= -π \text{.} \end{aligned} \]

Similarly, \[ ∫_{-1}^1 V(y)\,dy = π \text{,} \] so the iterated integrals of $f(x, y)$ over $[-1, 1]^2$ differ; in fact, as we claimed above, switching the iteration order switches the sign of the result. ∎

We can repeat the above calculations for an arbitrary rectangle to see that the iterated integrals of $f(x, y)$ differ if $D$ contains the origin either as an interior point or a corner. But there's an easier way to prove that statement and also gain some insight as to why $f(x, y)$ has this strange property.

Note that the key facts in the above calculations were that $U(x) \lt 0$ for any $x \ne 0$ and $V(y) \gt 0$ for any $y \ne 0$. Therefore, integrating $U(x)$ over any interval on the $x$-axis would produce a negative result and integrating $V(x)$ over any interval on the $y$-axis would produce a positive result, leading to the difference in iterated integrals. This holds more generally; for any $m, n \gt 0$: \[ ∫_{-n}^n f(x, y) \,dy \lt 0 \qquad \text{ and } \qquad ∫_{-m}^m f(x, y) \,dx \gt 0 \text{.} \] Therefore, \[ ∫_{-m}^m \left( ∫_{-n}^n f(x, y) \,dy \right) \,dx \lt 0 \qquad \text{ and } \qquad ∫_{-n}^n \left( ∫_{-m}^m f(x, y) \,dx \right) \,dy \gt 0 \] so the iterated integrals of $f(x, y)$ differ over the rectangles $[-m, m] \times [-n, n]$. Since any rectangle $D$ containing the origin as an interior point must contain some smaller rectangle $E = [-m, m] \times [-n, n]$, the iterated integrals of $f(x, y)$ over $E$ differ and therefore must also differ over $D$.

Furthermore, since $f(x, y)$ is even in both $x$ and $y$, you can carry out a similar argument to the above with intervals of the form $[0, m]$ or $[-m, 0]$ to show that the iterated integrals of $f(x, y)$ must also differ over any rectangle with the origin as a corner.

So the essential property of $f(x, y)$ is that slicing it along the $x$-axis gives a function which has positive area under the curve on any interval symmetric around $0$ or with $0$ as an endpoint, and that slicing it similarly along the $y$-axis gives a function with has negative area. Therefore, on a rectangle symmetric around the origin or with the origin as a corner, one can choose the sign of the iterated integral by choosing which axis to slice first.

The next thing to investigate is how exactly the iterated integrals of $f(x, y)$ over the rectangle $D$ are expressed such that they differ only when $D$ contains the origin, especially considering that the $f(x, y)$ is expressed in quite a simple form. To do that, let's consider the simple case of a rectangle $D = [δ, 1] \times [ϵ, 1]$ where we can vary $δ$ and $ϵ$ at will.

Let \[ \begin{aligned} I_{yx}(δ, ϵ) &= ∫_{δ}^1 \left( ∫_{ϵ}^1 f(x, y) \,dy \right) \,dx \\ I_{xy}(δ, ϵ) &= ∫_{ϵ}^1 \left( ∫_{δ}^1 f(x, y) \,dx \right) \,dy \text{.} \end{aligned} \] Then, for $ϵ ≠ 0$: \[ \begin{aligned} I_{yx}(δ, ϵ) &= ∫_{δ}^1 \left( ∫_{ϵ}^1 \frac{y^2-x^2}{(x^2+y^2)^2} \,dy \right) \,dx \\ &= ∫_{δ}^1 \left( \left. -\frac{y}{x^2+y^2} \right|_{y=ϵ}^{y=1} \right) \,dx \\ &= ∫_{δ}^1 \Biggl( -\frac{1}{1+x^2} - \left( -\frac{ϵ}{ϵ^2+x^2} \right) \Biggr) \,dx \\ &= ∫_{δ}^1 \frac{dx/ϵ}{1+(x/ϵ)^2} - ∫_{δ}^1 \frac{dx}{1+x^2} \\ &= \arctan\left(\frac{1}{ϵ}\right) - \arctan\left(\frac{δ}{ϵ}\right) - \frac{π}{4} + \arctan(δ) \text{,} \end{aligned} \] and for $ϵ = 0$: \[ I_{yx}(δ, 0) = -\frac{π}{4} + \arctan(δ) \text{.} \] Similarly, for $δ ≠ 0$: \[ \begin{aligned} I_{xy}(δ, ϵ) &= ∫_{ϵ}^1 \left( ∫_{δ}^1 \frac{y^2-x^2}{(x^2+y^2)^2} \,dx \right) \,dy \\ &= ∫_{ϵ}^1 \left( \left. \frac{x}{x^2+y^2} \right|_{x=δ}^{x=1} \right) \,dy \\ &= ∫_{ϵ}^1 \left( \frac{1}{1+y^2} - \frac{δ}{δ^2+x^2} \right) \,dy \\ &= ∫_{ϵ}^1 \frac{dy}{1+y^2} - ∫_{ϵ}^1 \frac{dy/δ}{1+(y/δ)^2} \\ &= \frac{π}{4} - \arctan(ϵ) - \arctan\left(\frac{1}{δ}\right) + \arctan\left(\frac{ϵ}{δ}\right) \text{,} \end{aligned} \] and for $δ = 0$: \[ I_{xy}(0, ϵ) = \frac{π}{4} - \arctan(ϵ) \text{.} \] Then let $Δ = I_{xy} - I_{yx}$ be the difference between the two iterated integrals. We can use the identity \[ \arctan(x) + \arctan\left(\frac{1}{x}\right) = \frac{π}{2} \sgn(x) \] to simplify $Δ(δ, ϵ)$ if neither $δ$ nor $ϵ$ is zero: \[ \begin{aligned} Δ(δ, ϵ) &= \bigl( π/4 - \arctan(ϵ) - \arctan(1/δ) + \arctan(ϵ/δ) \bigr) \\ & \quad \mathbin{-} \bigl( \arctan(1/ϵ) - \arctan(δ/ϵ) - π/4 + \arctan(δ) \bigr) \\ &= π/2 - \bigl( \arctan(ϵ) + \arctan(1/ϵ) \bigr) \\ & \quad \mathbin{-} \bigl( \arctan(δ) + \arctan(1/δ) \bigr) \\ & \quad \mathbin{+} \bigl( \arctan(δ/ϵ) + \arctan(ϵ/δ) \bigr) \\ &= \frac{π}{2} \bigl( 1 - \sgn(ϵ) - \sgn(δ) + \sgn(δ/ϵ) \bigr) \text{.} \end{aligned} \]

Using the properties of $\sgn(x)$, we can simplify this to the final expression: \[ Δ(δ, ϵ) = \frac{π}{2} \bigl( 1 - \sgn(δ) \bigr) \bigl( 1 - \sgn(ϵ) \bigr) \] which we can prove still holds if either $δ$ or $ϵ$ is zero (or both).

So with the simplified expression for $Δ(δ, ϵ)$, it becomes apparent how $\sgn(x)$ is used to control the value of $Δ(δ, ϵ)$; as long as either $δ$ or $ϵ$ is positive, $1 - \sgn(x)$ zeroes out the entire expression.

Like this post? Subscribe to my feed or follow me on Twitter .

Footnotes

[1] There are actually whole books dedicated to counterexamples. They make good bathroom reading material. ↩

[2] The vector field $(L, M)$ also serves as the canonical “counterexample” to the gradient theorem. ↩

[3] $U(x)$ and $V(y)$ are also (partial) real functions. ↩

[4] We're justified in applying standard integration techniques here since $u_k(y)$ for $k \gt 0$ is defined and bounded for all $y$. ↩

[5] Note that $U(x)$ and $V(y)$ differ only in variable name and sign. ↩

Understanding Evlis Tail Recursion

2011-10-28T00:00:00-07:00

While reading about proper tail recursion in Scheme, I encountered a similar but obscure optimization called evlis tail recursion. In the paper where it was first described, the author claims it "dramatically improve the space performance of many programs," which sounded promising.

However, the few places where its mentioned don't do much more than state its definition and claim its usefulness. Hopefully I can provide a more detailed analysis here.

Consider the straightforward factorial implementation in Scheme:^[1]

(define (fact n) (if (<= n 1) 1 (* n (fact (- n 1)))))

It is not tail-recursive, since the recursive call is nested in another procedure call. However, it's almost tail-recursive; the call to * is a tail call, and the recursive call is its last subexpression, so it will be the last subexpression to be evaluated.

Recall what happens when a procedure call (represented as a list of subexpressions) is evaluated: each subexpression is evaluated, and the first result (the procedure) is passed the other results as arguments.^[2]

Evlis tail recursion can be described as follows: when performing a procedure call and during the evaluation of the last subexpression, the calling environment is discarded as soon as it is not required.^[3] The distinction between evlis tail recursion and proper tail recursion is subtle. Proper tail recursion requires only that the calling environment be discarded before the actual procedure call; evlis tail recursion discards the calling environment even sooner, if possible.

An example will help to clarify things. Given fact as defined above, say you evaluate (fact 10) and you're in the procedure call with n = 5. The call stack of a properly tail-recursive interpreter would look like this:

evalExpr
--------
env = { n: 10 } -> <top-level environment>
expr = '(* n (fact (- n 1)))'
proc = <native function: *>
args = [10, <pending evalExpr('(fact (- n 1))', env)>]

evalExpr
--------
env = { n: 9 } -> <top-level environment>
expr = '(* n (fact (- n 1)))'
proc = <native function: *>
args = [9, <pending evalExpr('(fact (- n 1))', env)>]

...

evalExpr
--------
env = { n: 6 } -> <top-level environment>
expr = '(* n (fact (- n 1)))'
proc = <native function: *>
args = [6, <pending evalExpr('(fact (- n 1))', env)>]

evalExpr
--------
env = { n: 5 } -> <top-level environment>
expr = '(if ...)'

whereas the call stack of an evlis tail-recursive interpreter would look like this:

evalExpr
--------
env = { n: 5 } -> <top-level environment>
pendingProcedureCalls = [
  [<native function: *>, 10],
  [<native function: *>, 9],
  ...
  [<native function: *>, 6]
]
expr = (if ...)

In this implementation, the last subexpression of a procedure call is evaluated exactly like a tail expression, but the procedure call and non-last subexpressions are pushed onto a stack. Whenever an expression is reduced to a simple one and the stack is non-empty, a pending procedure call with its other args are popped off, and it is called with the reduced expression as the final argument.

Note that this didn't change the asymptotic behavior of the procedure; it still takes $Θ(n)$ memory to evaluate. However, only the bare minimum of information is saved: the list of pending functions and their arguments. Other auxiliary variables, and crucially the nested calling environments, aren't preserved, leading to a significant constant-factor reduction in memory.

This raises the question: Are there cases where evlis tail recursion leads to better asymptotic behavior? In fact, yes; consider the following (contrived) implementation of factorial^[4]:

(define (fact2 n)
  (define v (make-vector n))
  (* (n (fact2 (- n 1)))))

Before the main body of the function, a vector of size $n$ is defined. This means that the environments in the call stack of a properly tail-recursive interpreter would look like this:^[5]

env = { n: 10, v: <vector of size 10> } -> <top-level environment>
env = { n: 9, v: <vector of size 9> } -> <top-level environment>
env = { n: 8, v: <vector of size 8> } -> <top-level environment>
env = { n: 7, v: <vector of size 7> } -> <top-level environment>
...

whereas the an evlis tail-recursive interpreter would keep around only the current environment. Therefore, the properly tail-recursive interpreter would require $Θ(n^2)$ memory to evaluate (fact2 n) while the evlis tail-recursive interpreter would require only $Θ(n)$

Studying examples like the one above enabled me to finally understand how evlin tail recursion worked and what sort of savings it gives. However, I have yet to find a practical example where evlis tail recursion gives the same sort of asymptotic gains as described above, and I'd be interested to receive some. But perhaps the "large gains" mentioned in the various papers describing it are only constant-factor reductions in memory.

In any case, another important difference in Scheme between proper tail recursion and evlis tail recursion is that the former is a language feature and the latter is an optimization. That means that it is acceptable and even encouraged to write Scheme programs that take advantage of proper tail recursion, but it would be unwise to rely on evlis tail recursion for the asymptotic performance of your function. Instead, one should treat it just as a nice constant-factor speed gain.

Note that it is easy to make evlis tail recursion "smarter." Since Scheme doesn't specify the order of argument evaluation, an interpreter could evaluate arguments to maximize the gains from evlis tail recursion. As an easy example, if we had switched the arguments to + in fact above, making it non-evlis-tail-recursive, a smart compiler could still treat it as such. A possible rule of thumb would be to pick a non-trivial function call to evaluate last.

To complete the picture, I will outline below the evaluation function for a simple evlis tail-recursive Scheme interpreter in Javascript. All of the sources I've found describe it in terms of compilers, so I think it'll be useful to have a reference implementation for an interpreter.

Let's say we already have a properly tail-recursive interpreter:^[6]

function evalExpr(expr, env) {
  // Fake tail calls with a while loop and continue.
  while (true) {
    // Symbols, constants, quoted expressions, and lambdas.
    if (isSimpleExpr(expr)) {
      // The only exit point.
      return evalSimpleExpr(expr, env);
    }
    // (if test conseq alt)
    if (isSpecialForm(expr, 'if')) {
      expr = evalExpr(expr[1], env) ? expr[2] : expr[3];
      continue;
    }
    // (set! var expr)
    if (isSpecialForm(expr, 'set!')) {
      env.set(expr[1], evalExpr(expr[2], env));
      expr = null;
      continue;
    }
    // (define var expr?)
    if (isSpecialForm(expr, 'define')) {
      env.define(expr[1], evalExpr(expr[2], env));
      expr = null;
      continue;
    }
    // (begin expr*)
    if (isSpecialForm(expr, 'begin')) {
      if (expr.length == 1) {
        expr = null;
        continue;
      }
      // Evaluate all but the last subexpression.
      for (var i = 1; i < expr.length - 1; ++i) {
        evalExpr(expr[i], env);
      }
      expr = expr[expr.length - 1];
      continue;
    }
    // (proc expr*)
    var proc = evalExpr(expr.shift(), env);
    var args = expr.map(function(subExpr) { return evalExpr(subExpr, env); });
    // proc.run() returns its body in result.expr and the environment
    // in which to evaluate it (with all its arguments bound) in
    // result.env.
    var result = proc.run(args);
    expr = result.expr;
    // The only time when env is changed.
    env = result.env;
    continue;
  }
}

Then implementing evlis tail recursion requires only a few changes:

function evalExpr(expr, env) {
  // This is a stack of procedures and their non-final arguments that
  // are waiting for their final argument to be evaluated.
  var pendingProcedureCalls = [];
  while (true) {
    if (isSimpleExpr(expr)) {
      expr = evalSimpleExpr(expr, env);
      // Discard calling environment.
      env = null;
      if (pendingProcedureCalls.length == 0) {
        // No pending procedure calls, so we're done (the only exit
        // point).
        return expr;
      }
      var args = pendingProcedureCalls.pop();
      var proc = args.shift();
      args.push(expr);
      var result = proc.run(args);
      expr = result.expr;
      // Change to new environment (the only time when env is
      // changed).
      env = result.env;
      continue;
    }
    ...
    // Everything else remains the same.
    ...
    // (proc expr*)
    var nonFinalSubExprs =
      exprs.slice(0, -1).map(
        function(subExpr) { return evalExpr(subExpr, env); });
    pendingProcecureCalls.push(nonFinalSubExprs);
    // Evaluate the last subexpression as a tail call.
    expr = expr[expr.length - 1];
    continue;
  }
}

Like this post? Subscribe to my feed or follow me on Twitter .

Footnotes

[1] Assume a left-to-right evaluation order for now. ↩

[2] The function that takes a list of expressions, evaluates them, and returns the results as a list is traditionally called evlis, hence the name of the optimization. ↩

[3] This assumes that the calling environment isn't stored somewhere else. ↩

[4] This was adapted from an example in Proper Tail Recursion and Space Efficiency. ↩

[5] Assume that the interpreter isn't smart enough to deduce that $v$ can be optimized out since it's never used. ↩

[6] Adapted from Peter Norvig's excellent lis.py. ↩

An Elementary Way to Calculate the Gaussian Integral

2011-01-06T00:00:00-08:00

While reading Timothy Gowers's blog I stumbled on Scott Carnahan's comment describing an elegant calculation of the Gaussian integral \[ ∫_{-∞}^{∞} e^{-x^2} \, dx = \sqrt{π}\text{.} \] I was so struck by its elementary character that I imagined what it would be like written up, say, as an extra credit exercise in a single-variable calculus class:

Exercise 1. (The Gaussian integral.) Let \[ F(t) = ∫_0^t e^{-x^2} \, dx \text{, }\qquad G(t) = ∫_0^1 \frac{e^{-t^2 (1+x^2)}}{1+x^2} \, dx \text{,} \] and $H(t) = F(t)^2 + G(t)$.

Calculate $H(0)$.
Calculate and simplify $H'(t)$. What does this imply about $H(t)$?
Use part b to calculate $F(∞) = \displaystyle\lim_{t \to ∞} F(t)$.
Use part c to calculate \[ ∫_{-∞}^{∞} e^{-x^2} \, dx\text{.} \]

Although this is simpler than the usual calculation of the Gaussian integral, for which careful reasoning is needed to justify the use of polar coordinates, it seems more like a certificate than an actual proof; you can convince yourself that the calculation is valid, but you gain no insight into the reasoning that led up to it.^[1]

Fortunately, David Speyer's comment solves the mystery; $G(t)$ falls out of doing the integration in Cartesian coordinates over a triangular region. Just for kicks, here's how I imagine an exercise based on this method would look like (this time for a multi-variable calculus class):

Exercise 2. (The Gaussian integral in Cartesian coordinates.) Let \[ A(t) = ∬\limits_{\triangle_t} e^{-(x^2+y^2)} \, dx \, dy \] where $\triangle_t$ is the triangle with vertices $(0, 0)$, $(t, 0)$, and $(t, t)$.

Use the substitution $y = sx$ to reduce $A(t)$ to a one-dimensional integral.
Use part a to calculate $A(∞) = \lim_{t \to ∞} A(t)$.
Use part b to calculate \[ ∫_{-∞}^{∞} e^{-x^2} \, dx\text{.} \]
Let \[ F(t) = ∫_0^t e^{-x^2} \, dx \qquad\text{ and }\qquad G(t) = ∫_0^1 \frac{e^{-t^2 (1+x^2)}}{1+x^2} \, dx \text{.} \] Use part a to relate $F(t)$ to $G(t)$.
Use part d to derive a proof of part c using only single-variable calculus.

Like this post? Subscribe to my feed or follow me on Twitter .

Footnotes

[1] Similar to proving $\sum\limits_{i=0}^n m^3 = \frac{n^2(n+1)^2}{4}$ by induction. ↩

Parallelizing FLAC Encoding

2008-05-05T00:00:00-07:00

One thing I noticed ever since getting a multi-core system was that the reference FLAC encoder is not multi-threaded. This isn't a huge problem for most people as you can simply encode multiple files at the same time but I usually rip my audio CDs into a single audio file with a cue sheet instead of separate track files and so I am usually encoding a single large audio file instead of multiple smaller ones. Even so, encoding a CD-length audio file takes under a minute but I thought it would be a fun and useful weekend project to see if I could parallelize the simpler example encoder. The format specification indicates that input blocks are encoded independently which makes the problem embarassingly parallel and trawling through the FLAC mailing lists reveals that no one has had the time nor the inclination to look into it.

However, I was able to write a multithreaded FLAC encoder that achieves near-linear speedup with only minor hacks to the libFLAC API. Here are some encode times on an 8-core 2.8 GHz Xeon 5400 for a 636 MB wave file (some caveats are discussed below):

baseline	34.906s
1 threads	31.424s
2 threads	16.936s
4 threads	10.173s
8 threads	6.808s

I took the simple approach of sharding the input file into n roughly equal pieces and passing them to n encoder threads, assembling the output file from the n output buffers. In general this is not a good way of partitioning the workload as time is wasted if one shard takes significantly more time to process but for my use case this isn't a significant problem.

The best way to share the input file among the encoding threads is to map it into memory. In fact, memory-mapped file I/O has so many advantages in general that I'm surprised at how little I see it used, although it does have the disadvantage of requiring a bit more bookkeeping. Here is how I use it in my multithreaded encoder (slightly paraphrased):

#include <fcntl.h> /* open() */
#include <sys/mman.h> /* mmap()/munmap() */
#include <sys/stat.h> /* stat() */
#include <unistd.h> /* close() */

int main(int argc, char *argv[]) {
  int fdin;
  struct stat buf;
  char *bufin;

  fdin = open(argv[1], O_RDONLY);
  fstat(fdin, &buf);
  bufin = mmap(NULL, buf.st_size, PROT_READ, MAP_SHARED, fdin, 0);

  /* The input file (passed in via argv[1]) is now mapped read-only to
     the memory region in bufin up to bufin + buf.st_size. */

  /* Note that you can work directly with the mapped input file
     instead of fread()ing the header into a buffer. */
  if((buf.st_size < WAV_HEADER_SIZE) ||
     memcmp(bufin, "RIFF", 4) ||
     memcmp(bufin+8, "WAVEfmt \020\000\000\000\001\000\002\000", 16) ||
     memcmp(bufin+32, "\004\000\020\000data", 8)) {
    /* Invalid input file: print error and exit. */
  }

  for (i = 0; i < num_threads; ++i) {
    shard_infos[i].bufin = bufin + WAV_HEADER_SIZE + i * bytes_per_thread;
    /* bufsize for the last thread may be slightly larger. */
    shard_infos[i].bufsize = bytes_per_thread;
  }

  /* Spawn encode threads (which calls encode_shard() below) passing
     an element of shard_infos to each. */

  ...

  munmap(bufin, buf.st_size);
  close(fdin);
}

FLAC__bool encode_shard(struct shard_info *shard_info) {
  FLAC__StreamEncoder *encoder = FLAC__stream_encoder_new();

  ...

  /* The input file is paged in lazily as this function accesses
     bufin from shard_info->bufin. */
  FLAC__stream_encoder_process_interleaved(encoder,
                                           shard_info->bufin,
                                           shard_info->bufsize);

  ...

  FLAC__stream_encoder_delete(encoder);
}

However, handling the output file is a bit trickier. Since the encoded FLAC data output by the threads vary in size we have to wait until all encoding threads are done before we know the right offsets to write the output data. A convenient and fast way to handle this is to use asynchronous I/O; we know where to write the output data for a shard as soon as the encoding thread for all previous shards finish so we simply wait for the encoding threads in shard order and queue up a write request after each thread finishes. Here I use the POSIX asynchronous I/O API in my multithreaded encoder (again, slightly paraphrased):

#include <aio.h> /* aio_*() */
#include <pthread.h> /* pthread_*() */
#include <string.h> /* memset() */

int main(int argc, char *argv[]) {
  int fdout;
  pthread_t threads[MAX_THREADS];
  struct aiocb aiocbs[MAX_THREADS];
  unsigned long byte_offset = 0;

  /* mmap input file in. */

  ...

  fdout = open(argv[2], O_WRONLY | O_CREAT | O_TRUNC);

  /* Spawn encode threads passing an element of shard_infos to
     each. */

  ...

  /* Wait for each thread in sequence and queue up output writes. */

  /* We need to zero out any aiocb struct that we use before we fill
     in any members. */
  memset(aiocbs, 0, num_threads * sizeof(*aiocbs));
  for (i = 0; i < num_threads; ++i) {
    pthread_join(threads[i], NULL);
    aiocbs[i].aio_buf = shard_infos[i].bufout;
    aiocbs[i].aio_nbytes = shards_infos[i].bytes_written;
    aiocbs[i].aio_offset = byte_offset;
    aiocbs[i].aio_fildes = fdout;
    aio_write(&aiocbs[i]);
    byte_offset += shard_infos[i].bytes_written;
  }

  /* Wait for all output writes to finish. */

  for (i = 0; i < num_threads; ++i) {
    const struct aiocb *aiocbp = &aiocbs[i];
    aio_suspend(&aiocbp, 1, NULL);
    aio_return(&aiocbs[i]);
  }

  close(fdout);
}

The POSIX API is a bit unwieldy for this use case; ideally, there would be a version of aio_suspend() that would suspend the calling process until all of the specified requests have completed. As it is now the simplest way is to loop through the requests as above, especially since the maximum number of simultaneous asynchronous I/O requests is usually quite small (16 on my system).

Also, I found that the OS X implementation of aio_write() did not obey this part of the specified behavior:

  If O_APPEND is set for aiocbp->aio_fildes, aio_write() operations append
  to the file in the same order as the calls were made.  If O_APPEND is not
  set for the file descriptor, the write operation will occur at the abso-
  lute position from the beginning of the file plus aiocbp->aio_offset.

but it was just as easy (and clearer) to explicitly set the correct offset.

I had to hack up libFLAC a bit to implement my multithreaded encoder. I exposed the update_metadata_() to make it easy to write the correct number of total samples in the metadata field and also to zero out the min/max framesize fields. I also exposed the FLAC__stream_encoder_set_do_md5() function (which it should have been in the first place) so that I can turn off the writing of md5 field in the metadata. Finally, I added the function FLAC__stream_encoder_set_current_frame_number() so that the correct frame numbers are written at encode time.

For comparison purposes I turn off md5 calculation in my multithreaded encoder as well as the baseline one. Since calling FLAC__stream_encoder_set_current_frame_number() causes crashes with vericiation turned on I also turn that off. The numbers above reflect that so they're underestimates of how a production multithreaded encoder would perform. However, the essential behavior of the program shouldn't change much.

Here is a patch file for the flac 1.2.1 source that implements the hacks I described above. Here is the source for my multithreaded FLAC encoder. I've tested it with i686-apple-darwin9-gcc-4.0.1 and i686-apple-darwin9-gcc-4.2.1 on Mac OS X. I got the above numbers compiling mt_encode.c with gcc 4.2.1 and the switches -Wall -Werror -g -O2 -ansi.

Like this post? Subscribe to my feed or follow me on Twitter .

bfpp

2008-04-23T00:00:00-07:00

Okay, I lied; you can't really embed brainfuck in C++ but you can get pretty close. Here is an example:

#include "bfpp.h"

int main() {
  // Prints out factorial numbers in sequence.  Adapted from
  // http://www.hevanet.com/cristofd/brainfuck/factorial.b .
  bfpp
    * + + + + + + + + + + * * * + * + -- * * * + -- - -- & & & & & -- +
    & & & & & ++ * * -- -- - ++ * -- & & + * + * - ++ & -- * + & - ++ &
    -- * + & - -- * + & - -- * + & - -- * + & - -- * + & - -- * + & - --
    * + & - -- * + & - -- * + & - -- * -- - ++ * * * * + * + & & & & & &
    - -- * + & - ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ * -- & + * - ++ + * *
    * * * ++ & & & & & -- & & & & & ++ * * * * * * * -- * * * * * ++ + +
    -- - & & & & & ++ * * * * * * - ++ + * * * * * ++ & -- * + + & - ++
    & & & & -- & -- * + & - ++ & & & & ++ * * -- - * -- - ++ + + + + + +
    -- & + + + + + + + + * - ++ * * * * ++ & & & & & -- & -- * + * + & &
    - ++ * ! & & & & & ++ * ! * * * * ++ 
  end_bfpp
}

I call this variant “bfpp” as it has some pretty significant differences from brainfuck. First of all, some commands had to be adapted; although + and - remain the same,

< and > were changed to & and *,
. and , were changed to ! and ~ (mnemonic: ! contains . within it and ~ is kind of like a sideways ,),
and [ and ] were changed to -- and ++ (mnemonic: [ and ] are the most complex brainfuck commands [to implement, at least] and so deserve to be mapped to the wider and more prominent operators).

This magic is made possible by the fact that brainfuck has exactly eight commands and C++ has exactly eight overloadable symbolic unary operators. Add some macros to hide the C++ scaffolding behind some delimiters and you have a convincing illusion of an embedded language.

bfpp.h implements a simple (<100 lines) bfpp interpreter and the magic described above, and bf2bfpp.c is a straightforward translator from brainfuck to bfpp. Gotta love C++!

Like this post? Subscribe to my feed or follow me on Twitter .

Finding the Longest Palindromic Substring in Linear Time

2007-11-28T00:00:00-08:00

Another interesting problem I stumbled across on reddit is finding the longest substring of a given string that is a palindrome. I found the explanation on Johan Jeuring's blog somewhat confusing and I had to spend some time poring over the Haskell code (eventually rewriting it in Python) and walking through examples before it "clicked." I haven't found any other explanations of the same approach so hopefully my explanation below will help the next person who is curious about this problem.

Of course, the most naive solution would be to exhaustively examine all $n \choose 2$ substrings of the given $n$-length string, test each one if it's a palindrome, and keep track of the longest one seen so far. This has complexity $O(n^3)$, but we can easily do better by realizing that a palindrome is centered on either a letter (for odd-length palindromes) or a space between letters (for even-length palindromes). Therefore we can examine all $2n + 1$ possible centers and find the longest palindrome for that center, keeping track of the overall longest palindrome. This has complexity $O(n^2)$.

It is not immediately clear that we can do better but if we're told that an $Θ(n)$ algorithm exists we can infer that the algorithm is most likely structured as an iteration through all possible centers. As an off-the-cuff first attempt, we can adapt the above algorithm by keeping track of the current center and expanding until we find the longest palindrome around that center, in which case we then consider the last letter (or space) of that palindrome as the new center. The algorithm (which isn't correct) looks like this informally:

Set the current center to the first letter.
Loop while the current center is valid:
1. Expand to the left and right simultaneously until we find the largest palindrome around this center.
2. If the current palindrome is bigger than the stored maximum one, store the current one as the maximum one.
3. Set the space following the current palindrome as the current center unless the two letters immediately surrounding it are different, in which case set the last letter of the current palindrome as the current center.
Return the stored maximum palindrome.

This seems to work but it doesn't handle all cases: consider the string "abababa". The first non-trivial palindrome we see is "a|bababa", followed by "aba|baba". Considering the current space as the center doesn't get us anywhere but considering the preceding letter (the second 'a') as the center, we can expand to get "ababa|ba". From this state, considering the current space again doesn't get us anywhere but considering the preceding letter as the center, we can expand to get "abababa|". However, this is incorrect as the longest palindrome is actually the entire string! We can remedy this case by changing the algorithm to try and set the new center to be one before the end of the last palindrome, but it is clear that having a fixed "lookbehind" doesn't solve the general case and anything more than that will probably bump us back up to quadratic time.

The key question is this: given the state from the example above, "ababa|ba", what makes the second 'b' so special that it should be the new center? To use another example, in "abcbabcba|bcba", what makes the second 'c' so special that it should be the new center? Hopefully, the answer to this question will lead to the answer to the more important question: once we stop expanding the palindrome around the current center, how do we pick the next center? To answer the first question, first notice that the current palindromes in the above examples themselves contain smaller non-trivial palindromes: "ababa" contains "aba" and "abcbabcba" contains "abcba" which also contains "bcb". Then, notice that if we expand around the "special" letters, we get a palindrome which shares a right edge with the current palindrome; that is, the longest palindrome around the special letters are proper suffixes of the current palindrome. With a little thought, we can then answer the second question: to pick the next center, take the center of the longest palindromic proper suffix of the current palindrome. Our algorithm then looks like this:

Set the current center to the first letter.
Loop while the current center is valid:
1. Expand to the left and right simultaneously until we find the largest palindrome around this center.
2. If the current palindrome is bigger than the stored maximum one, store the current one as the maximum one.
3. Find the maximal palindromic proper suffix of the current palindrome.
4. Set the center of the suffix from c as the current center and start expanding from the suffix as it is palindromic.
Return the stored maximum palindrome.

However, unless step 2c can be done efficiently, it will cause the algorithm to be superlinear. Doing step 2c efficiently seems impossible since we have to examine the entire current palindrome to find the longest palindromic suffix unless we somehow keep track of extra state as we progress through the input string. Notice that the longest palindromic suffix would by definition also be a palindrome of the input string so it might suffice to keep track of every palindrome that we see as we move through the string and hopefully, by the time we finish expanding around a given center, we would know where all the palindromes with centers lying to the left of the current one are. However, if the longest palindromic suffix has a center to the right of the current center, we would not know about it. But we also have at our disposal the very useful fact that a palindromic proper suffix of a palindrome has a corresponding dual palindromic proper prefix. For example, in one of our examples above, "abcbabcba", notice that "abcba" appears twice: once as a prefix and once as a suffix. Therefore, while we wouldn't know about all the palindromic suffixes of our current palindrome, we would know about either it or its dual.

Another crucial realization is the fact that we don't have to keep track of all the palindromes we've seen. To use the example "abcbabcba" again, we don't really care about "bcb" that much, since it's already contained in the palindrome "abcba". In fact, we only really care about keeping track of the longest palindromes for a given center or equivalently, the length of the longest palindrome for a given center. But this is simply a more general version of our original problem, which is to find the longest palindrome around any center! Thus, if we can keep track of this state efficiently, maybe by taking advantage of the properties of palindromes, we don't have to keep track of the maximal palindrome and can instead figure it out at the very end.

Unfortunately, we seem to be back where we started; the second naive algorithm that we have is simply to loop through all possible centers and for each one find the longest palindrome around that center. But our discussion has led us to a different incremental formulation: given a current center, the longest palindrome around that center, and a list of the lengths of the longest palindromes around the centers to the left of the current center, can we figure out the new center to consider and extend the list of longest palindrome lengths up to that center efficiently? For example, if we have the state:

<"ababa|??", [0, 1, 0, 3, 0, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?]>

where the highlighted letter is the current center, the vertical line is our current position, the question marks represent unread characters or unknown quantities, and the array represents the list of longest palindrome lengths by center, can we get to the state:

<"ababa|??", [0, 1, 0, 3, 0, 5, 0, ?, ?, ?, ?, ?, ?, ?, ?]>

and then to:

<"abababa|", [0, 1, 0, 3, 0, 5, 0, 7, 0, 5, 0, 3, 0, 1, 0]>

efficiently? The crucial thing to notice is that the longest palindrome lengths array (we'll call it simply the lengths array) in the final state is palindromic since the original string is palindromic. In fact, the lengths array obeys a more general property: the longest palindrome d places to the right of the current center (the d-right palindrome) is at least as long as the longest palindrome d places to the left of the current center (the d-left palindrome) if the d-left palindrome is completely contained in the longest palindrome around the current center (the center palindrome), and it is of equal length if the d-left palindrome is not a prefix of the center palindrome or if the center palindrome is a suffix of the entire string. This then implies that we can more or less fill in the values to the right of the current center from the values to the left of the current center. For example, from [0, 1, 0, 3, 0, 5, ?, ?, ?, ?, ?, ?, ?, ?, ?] we can get to [0, 1, 0, 3, 0, 5, 0, ≥3?, 0, ≥1?, 0, ?, ?, ?, ?]. This also implies that the first unknown entry (in this case, ≥3?) should be the new center because it means that the center palindrome is not a suffix of the input string (i.e., we're not done) and that the d-left palindrome is a prefix of the center palindrome.

From these observations we can construct our final algorithm which returns the lengths array, and from which it is easy to find the longest palindromic substring:

Initialize the lengths array to the number of possible centers.
Set the current center to the first center.
Loop while the current center is valid:
1. Expand to the left and right simultaneously until we find the largest palindrome around this center.
2. Fill in the appropriate entry in the longest palindrome lengths array.
3. Iterate through the longest palindrome lengths array backwards and fill in the corresponding values to the right of the entry for the current center until an unknown value (as described above) is encountered.
4. set the new center to the index of this unknown value.
Return the lengths array.

Note that at each step of the algorithm we're either incrementing our current position in the input string or filling in an entry in the lengths array. Since the lengths array has size linear in the size of the input array, the algorithm has worst-case linear running time. Since given the lengths array we can find and return the longest palindromic substring in linear time, a linear-time algorithm to find the longest palindromic substring is the composition of these two operations.

Here is Python code that implements the above algorithm (although it is closer to Johan Jeuring's Haskell implementation than to the above description):

def fastLongestPalindromes(seq):
    """
    Behaves identically to naiveLongestPalindrome (see below), but
    runs in linear time.
    """
    seqLen = len(seq)
    l = []
    i = 0
    palLen = 0
    # Loop invariant: seq[(i - palLen):i] is a palindrome.
    # Loop invariant: len(l) >= 2 * i - palLen. The code path that
    # increments palLen skips the l-filling inner-loop.
    # Loop invariant: len(l) < 2 * i + 1. Any code path that
    # increments i past seqLen - 1 exits the loop early and so skips
    # the l-filling inner loop.
    while i < seqLen:
        # First, see if we can extend the current palindrome.  Note
        # that the center of the palindrome remains fixed.
        if i > palLen and seq[i - palLen - 1] == seq[i]:
            palLen += 2
            i += 1
            continue

        # The current palindrome is as large as it gets, so we append
        # it.
        l.append(palLen)

        # Now to make further progress, we look for a smaller
        # palindrome sharing the right edge with the current
        # palindrome.  If we find one, we can try to expand it and see
        # where that takes us.  At the same time, we can fill the
        # values for l that we neglected during the loop above. We
        # make use of our knowledge of the length of the previous
        # palindrome (palLen) and the fact that the values of l for
        # positions on the right half of the palindrome are closely
        # related to the values of the corresponding positions on the
        # left half of the palindrome.

        # Traverse backwards starting from the second-to-last index up
        # to the edge of the last palindrome.
        s = len(l) - 2
        e = s - palLen
        for j in xrange(s, e, -1):
            # d is the value l[j] must have in order for the
            # palindrome centered there to share the left edge with
            # the last palindrome.  (Drawing it out is helpful to
            # understanding why the - 1 is there.)
            d = j - e - 1

            # We check to see if the palindrome at l[j] shares a left
            # edge with the last palindrome.  If so, the corresponding
            # palindrome on the right half must share the right edge
            # with the last palindrome, and so we have a new value for
            # palLen.
            #
            # An exercise for the reader: in this place in the code you
            # might think that you can replace the == with >= to improve
            # performance.  This does not change the correctness of the
            # algorithm but it does hurt performance, contrary to
            # expectations.  Why?
            if l[j] == d:
                palLen = d
                # We actually want to go to the beginning of the outer
                # loop, but Python doesn't have loop labels.  Instead,
                # we use an else block corresponding to the inner
                # loop, which gets executed only when the for loop
                # exits normally (i.e., not via break).
                break

            # Otherwise, we just copy the value over to the right
            # side.  We have to bound l[i] because palindromes on the
            # left side could extend past the left edge of the last
            # palindrome, whereas their counterparts won't extend past
            # the right edge.
            l.append(min(d, l[j]))
        else:
            # This code is executed in two cases: when the for loop
            # isn't taken at all (palLen == 0) or the inner loop was
            # unable to find a palindrome sharing the left edge with
            # the last palindrome.  In either case, we're free to
            # consider the palindrome centered at seq[i].
            palLen = 1
            i += 1

    # We know from the loop invariant that len(l) < 2 * seqLen + 1, so
    # we must fill in the remaining values of l.

    # Obviously, the last palindrome we're looking at can't grow any
    # more.
    l.append(palLen)

    # Traverse backwards starting from the second-to-last index up
    # until we get l to size 2 * seqLen + 1. We can deduce from the
    # loop invariants we have enough elements.
    lLen = len(l)
    s = lLen - 2
    e = s - (2 * seqLen + 1 - lLen)
    for i in xrange(s, e, -1):
        # The d here uses the same formula as the d in the inner loop
        # above.  (Computes distance to left edge of the last
        # palindrome.)
        d = i - e - 1
        # We bound l[i] with min for the same reason as in the inner
        # loop above.
        l.append(min(d, l[i]))

    return l

And here is a naive quadratic version for comparison:

def naiveLongestPalindromes(seq):
    """
    Given a sequence seq, returns a list l such that l[2 * i + 1]
    holds the length of the longest palindrome centered at seq[i]
    (which must be odd), l[2 * i] holds the length of the longest
    palindrome centered between seq[i - 1] and seq[i] (which must be
    even), and l[2 * len(seq)] holds the length of the longest
    palindrome centered past the last element of seq (which must be 0,
    as is l[0]).

    The actual palindrome for l[i] is seq[s:(s + l[i])] where s is i
    // 2 - l[i] // 2. (// is integer division.)

    Example:
    naiveLongestPalindrome('ababa') -> [0, 1, 0, 3, 0, 5, 0, 3, 0, 1]
    
    Runs in quadratic time.
    """
    seqLen = len(seq)
    lLen = 2 * seqLen + 1
    l = []

    for i in xrange(lLen):
        # If i is even (i.e., we're on a space), this will produce e
        # == s.  Otherwise, we're on an element and e == s + 1, as a
        # single letter is trivially a palindrome.
        s = i / 2
        e = s + i % 2

        # Loop invariant: seq[s:e] is a palindrome.
        while s > 0 and e < seqLen and seq[s - 1] == seq[e]:
            s -= 1
            e += 1

        l.append(e - s)

    return l

Note that this is not the only efficient solution to this problem; building a suffix tree is linear in the length of the input string and you can use one to solve this problem but as Johan also mentions, that is a much less direct and efficient solution compared to this one.

Like this post? Subscribe to my feed or follow me on Twitter .

A Foray into Number Theory with Haskell

2007-07-06T00:00:00-07:00

I encountered an interesting problem on reddit a few days ago which can be paraphrased as follows:

Find a perfect square $s$ such that $1597s + 1$ is also perfect square.

After reading the discussion about implementing a brute-force algorithm to solve the problem and spending a futile half-hour or so trying my hand at find a better way, someone noticed that the problem was an instance of Pell's equation which is known to have an elegant and fast solution; indeed, he posted a one-liner in Mathematica solving the given problem. However, I wanted to try coding up the solution myself as the Mathematica solution, while succinct, isn't very enlightening since the heavy lifting is already done by a built-in function and an arbitrary constant was used for this particular instance of Pell's equation.

Pell's equation is simply the Diophantine equation $x^2 - dy^2 = 1$ for a given $d$^[1]; being Diophantine means that all variables involved take on only integer values. (In our original problem, $d$ is 1597 and we are asked for $y^2$.) The solution involves finding the continued fraction expansion of $\sqrt{d}$, finding the first convergent of the expansion that satisfies Pell's equation, and then generating all other solutions from that fundamental solution. We rule out the trivial solution $x = 1$, $y = 0$ which also implies that if $d$ is a perfect square then there is no solution.

A continued fraction is an expression of the form: \[ x = a_0 + \cfrac{1}{a_1 + \cfrac{1}{a_2 + \cfrac{1}{a_3 + \cfrac{1}{\ddots\,}}}} \] where all $a_i$ are integers and all but the first one are positive. The standard math notation for continued fractions is quite unwieldy so from now on we'll use $\left \langle a_0; a_1, a_2, \dotsc \right \rangle$ instead of the above.

The theory of continued fractions is a rich and beautiful one but for now we'll just state a few facts:

The continued fraction expansion of a number is (mostly) unique.
The continued fraction expansion of a rational number is finite.
The continued fraction expansion of a irrational number is infinite.
A quadratic surd is a number of the form $\frac{a + \sqrt{b}}{c}$ where $a$, $b$, and $c$ are integers. Except maybe for the first term, the continued fraction expansion of a quadratic surd is periodic; that is, it repeats forever after a certain number of terms. This applies in particular to the square root of an integer.
Truncating an infinite continued fraction to get a finite continued fraction gives (in some sense) an optimal rational approximation to the irrational number represented by the infinite continued fraction.

Given a quadratic surd it is fairly easy to manipulate it into the form $a + \frac{1}{q}$ where $q$ is another quadratic surd. This fact can be used to come up with an algorithm to find the continued fraction expansion of a square root. Wikipedia explains it pretty well so I won't go over it, but here is my Haskell implementation:

sqrt_continued_fraction n = [ a_i | (_, _, a_i) <- mdas ]
    where
      mdas = iterate get_next_triplet (m_0, d_0, a_0)

      m_0 = 0
      d_0 = 1
      a_0 = truncate $ sqrt $ fromIntegral n

      get_next_triplet (m_i, d_i, a_i) = (m_j, d_j, a_j)
          where
            m_j = d_i * a_i - m_i
            d_j = (n - m_j * m_j) `div` d_i
            a_j = (a_0 + m_j) `div` d_j

and here are some examples:

Prelude Main> take 20 $ sqrt_continued_fraction 2
[1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]

Prelude Main> take 20 $ sqrt_continued_fraction 103
[10,6,1,2,1,1,9,1,1,2,1,6,20,6,1,2,1,1,9,1]

Prelude Main> take 20 $ sqrt_continued_fraction 36
[6,*** Exception: divide by zero

(Note that we're assuming that we won't be called with a perfect square. Also, do you notice anything interesting about the periodic portion of the continued fractions, particularly of $\sqrt{103}$?)

For those who are unfamiliar with Haskell, here's a quick list of key facts:

The first line takes a list of triplets and forms a list of all third elements, which is what we're interested in. (The other two elements of the triplet are auxiliary variables used by the algorithm.)
iterate is a function which takes in another function f, an initial variable x, and returns the infinite list [ x, f(x), f(f(x)), f(f(f(x))), ... ].
Note that Haskell uses lazy evaluation and so this function does not take an infinite amount of time to run; all its elements are evaluated (and memoized) only when needed.
The rest of the function is a straightforward representation of the meat of the algorithm described in the above Wikipedia entry.

It may not be clear what $\sqrt{d}$ and its continued fraction expansion has to do with solving Pell's equation. However, notice that if $x$ and $y$ solve Pell's equation then manipulating Pell's equation to get $\sqrt{d}$ on one side reveals that $\frac{x}{y}$ is a good approximation of $\sqrt{n}$. In fact, it is so good that you can prove that $\frac{x}{y}$ must come from truncating the continued fraction expansion of $\sqrt{d}$.

This leads us to the following: if you have an infinite continued fraction $\left \langle a_0; a_1, a_2, \dotsc \right \rangle$ you can truncate it into a finite continued fraction $\left \langle a_0; a_1, a_2, \dotsc, a_i \right \rangle$ and simplify it into the rational number $\frac{p_i}{q_i}$. The sequence $\frac{p_0}{q_0}, \frac{p_1}{q_1}, \frac{p_2}{q_2}, \dotsc$ forms the convergents of $\left \langle a_0; a_1, a_2, \dotsc \right \rangle$ and converges to its represented irrational number.

It turns out you can calculate $p_{i+1}$ and $q_{i+1}$ efficiently from $p_i$, $q_i$, $p_{i-1}$, $q_{i-1}$, and $a_{i+1}$ using the fundamental recurrence formulas (which can be proved by induction). Here is my Haskell implementation:

get_convergents (a_0 : a_1 : as) = pqs
    where
      pqs = (p_0, q_0) : (p_1, q_1) :
            zipWith3 get_next_convergent pqs (tail pqs) as

      p_0 = a_0
      q_0 = 1

      p_1 = a_1 * a_0 + 1
      q_1 = a_1

      get_next_convergent (p_i, q_i) (p_j, q_j) a_k = (p_k, q_k)
          where
            p_k = a_k * p_j + p_i
            q_k = a_k * q_j + q_i

and some more examples:

Prelude Main> take 8 $ get_convergents $ sqrt_continued_fraction 2
[(1,1),(3,2),(7,5),(17,12),(41,29),(99,70),(239,169),(577,408)]

Prelude Main> take 8 $ get_convergents $ sqrt_continued_fraction 103
[(10,1),(61,6),(71,7),(203,20),(274,27),(477,47),(4567,450),(5044,497)]

Prelude Main> take 8 $ get_convergents $ sqrt_continued_fraction 1597
[(39,1),(40,1),(1039,26),(1079,27),(2118,53),(3197,80),(27694,693),(113973,2852)]

Prelude Main> let divFrac (x, y) = (fromInteger x) / (fromInteger y)

Prelude Main> take 8 $ map divFrac $ get_convergents $ sqrt_continued_fraction 2
[1.0,1.5,1.4,1.4166666666666667,1.4137931034482758,1.4142857142857144,1.4142011834319526,1.4142156862745099]

Prelude Main> take 8 $ map divFrac $ get_convergents $ sqrt_continued_fraction 103
[10.0,10.166666666666666,10.142857142857142,10.15,10.148148148148149,10.148936170212766,10.148888888888889,10.148893360160965]

Prelude Main> take 8 $ map divFrac $ get_convergents $ sqrt_continued_fraction 1597
[39.0,40.0,39.96153846153846,39.96296296296296,39.9622641509434,39.9625,39.96248196248196,39.9624824684432]

Here are a few more quick facts to help those unfamiliar with Haskell:

The expression a : as forms a new list from the element a and the existing list as (equivalent to cons in Lisp).
zipWith3 is a function that takes in a function f, three lists a, b, and c of the same (possibly infinite) length n, and forms the new list [ f(a[0], b[0], c[0]), f(a[1], b[1], c[1]), ..., f(a[n], b[n], c[n]) ].
Note that the result of zipWith3 is part of the variable pqs which itself appears (twice!) in the arguments to zipWith3. This is a Haskell idiom and reflects the fact that the recurrence formulas define a convergent in terms of its two previous convergents. A simpler example (using the Fibonacci sequence) can be found in the Wikipedia entry for lazy evaluation.
Haskell has built-in data types for integers of arbitrary size which is necessary as the numerators and denominators of the convergents get large quickly. In fact, Haskell has built-in data types for rational numbers (represented as fractions) but it doesn't help us much here.

Since we are guaranteed that some convergent eventually satisfies Pell's equation, we can write a simple function to generate all convergents, test each one to see if it satisfies Pell's equation, and return the first one we see. Here is the Haskell implementation:

get_pell_fundamental_solution n = head $ solutions
    where
      solutions = [ (p, q) | (p, q) <- convergents, p * p - n * q * q == 1 ]

      convergents = get_convergents $ sqrt_continued_fraction n

Note the use of the Haskell's list comprehension syntax, similar to Python, which expresses what I just described in a matter reminiscent of set notation.

Here is the full Haskell program designed so its output may be conveniently piped to bc for verification:

module Main where

import System (getArgs)

sqrt_continued_fraction :: (Integral a) => a -> [a]
{- ... the sqrt_continued_fraction function explained above ... -}

get_convergents :: (Integral a) => [a] -> [(a, a)]
{- ... the get_convergents function explained above ... -}

get_pell_fundamental_solution :: (Integral a) => a -> (a, a)
{- ... the get_pell_fundamental_solution function explained above ... -}

main :: IO ()
main = do
  args <- System.getArgs
  let d      = (read $ head $ args :: Integer)
      (p, q) = get_pell_fundamental_solution d in
    putStr $ "d = " ++ (show d) ++ "\n" ++
             "p = " ++ (show p) ++ "\n" ++
             "q = " ++ (show q) ++ "\n" ++
             "p^2 - d * q^2 == 1\n"

and here is it in action:

$ ./solve_pell 1597
d = 1597
p = 519711527755463096224266385375638449943026746249
q = 13004986088790772250309504643908671520836229100
p^2 - d * q^2 == 1

The solution to the original problem is therefore:
5054112910466227478111803017176109047976100000000.

Now that we've found a method to get a solution, the question remains as to whether it's the only one. In fact it is not, but it is the minimal one, and all other solutions (of which there are an infinite number) can be generated from this fundamental one with a simple recurrence relation as described on the Wikipedia article. My program above can be easily extended to generate all solutions instead of just the fundamental one (I'll leave it to the reader as an exercise).

One remaining question is the efficiency of this algorithm. For simplicity, let's neglect the cost of the arbitrary-precision arithmetic involved and assume that the incremental cost of generating each term of the continued fraction expansion and the convergents is constant. Then the main cost is just how many convergents we have to generate before we find one that satisfies Pell's equation. In fact, it turns out that this depends on the length of the period of the continued fraction expansion of $\sqrt{d}$, which has a rough upper bound of $O(\ln(d \sqrt{d}))$. Therefore, the cost of solving Pell's equation (in terms of how many convergents to generate) for a given $n$-digit number is $O(n 2^{n/2})$. This is pretty expensive already, although it's still much better than brute-force search (which is on the order of exponentiating the above expression). Can we do better? Well, sort of; it turns out the length of the answer is of the same order as the expression above, so any algorithm that explicitly outputs a solution necessarily takes that long. However, if you can somehow factor $d$ into $s d'$, where $s$ is a perfect square and $d'$ is squarefree (i.e., not divisible by any perfect square), then you can solve Pell's equation for the smaller number $d'$ and output the solution for $d'$ as the smaller fundamental solution and an expression raised to a certain power involving it. Note that in general this involves factoring $d$, another hard problem, but for which there exists tons of prior work. An interested reader can peruse the papers by Lenstra and Vardi for more details.

As a final note, one of the things I really like about number theory is that investigating such a simple program can lead you down surprising avenues of mathematics and computational theory. In fact, I've had to omit a lot of things I had planned to say to avoid growing this entry to be longer than it already is. Hopefully, this entry helps someone else learn more about this interesting corner of number theory.

Like this post? Subscribe to my feed or follow me on Twitter .

Footnotes

[1] As a rule we'll avoid considering trivial cases and re-stating obvious assumptions (like $d$ having to be a positive integer). ↩

Fred Akalin

The Fundamental Theorem of Algebra via Connectedness

Points and values

Connected sets

Open and closed functions

Real and complex polynomials are closed

Real and complex polynomials have finitely many critical points

Real and complex polynomials are open on regular points

Non-constant complex polynomials are surjective (but not real ones)

Further reading

Curvature computations with moving frames

Cheatsheet: coordinate frame method

Cheatsheet: Lagrangian method

Cheatsheet: The moving frame method for Riemannian metrics

Orthonormal dual frame

Connection forms

Curvature forms

Gaussian curvature

Cheatsheet: The moving frame method for semi-Riemannian metrics

Orthonormal dual frame

Connection forms

Curvature forms

Ricci curvature

Orthonormal dual frame

Connection forms

Curvature forms

Ricci curvature

A Gentle Introduction to Erasure Codes

Example 1: ComputeParity and ReconstructData

ComputeParity

Example 2: The matrix/linear function correspondence

Example 3: Matrix inversion via row reduction

Example 4: A optimal parity matrix for \(m = 2\)

Example 5: Cauchy matrices

Example 6: Cauchy parity matrices for \(m = 2\)

Example 7: Field with 257 elements

Example 8: Carry-less addition

Example 9: Carry-less multiplication

Example 10: Carry-less division

Example 11: Field with 256 elements

Example 12: Cauchy matrices in general

Example 13: ComputeParity in detail

Example 14: Matrix inversion via row reduction in general

Example 15: ReconstructData in detail

Why is the Quintic Unsolvable?

Interactive Example 1: An incorrect quadratic formula

Interactive Example 2: The quadratic equation

Interactive Example 3: The cubic discriminant

Interactive Example 4: The cubic equation

Interactive Example 5: The quartic equation

Interactive Example 6: The quintic equation

Computing Integer Roots

Sampling the Visible Sphere

Computing the Integer Square Root

Finding the Most Significant Set Bit of a Word in Constant Time

Primality Testing in Polynomial Time (Ⅱ)

Primality Testing in Polynomial Time (Ⅰ)

An Introduction to Primality Testing

A Pair of Counterexamples in Vector Calculus

Understanding Evlis Tail Recursion

An Elementary Way to Calculate the Gaussian Integral

Parallelizing FLAC Encoding

bfpp

Finding the Longest Palindromic Substring in Linear Time

A Foray into Number Theory with Haskell

Example 1: `ComputeParity` and `ReconstructData`

`ComputeParity`

Example 13: `ComputeParity` in detail

Example 15: `ReconstructData` in detail