Fred Akalin
Notes on math, tech, and everything in between
2018-06-26T19:10:32-07:00
https://www.akalin.com/
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
https://www.akalin.com/curvature-moving-frames
Curvature computations with moving frames
2018-03-22T00:00:00-07:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<script>
KaTeXMacros = {
"\\pd": "\\frac{∂{#1}}{∂{#2}}",
"\\CSF": "Γ_{#1}",
"\\CS": "{Γ^{#1}}_{#2}",
"\\cnf": "{ω^{#1}}_{#2}",
"\\crf": "{Ω^{#1}}_{#2}",
"\\Riem": "{\\operatorname{Riem}^{#1}}_{#2}",
"\\Ric": "\\operatorname{Ric}_{#1}",
"\\sgn": "\\operatorname{sgn}",
};
</script>
<style>
div.cheatsheet, div.important-equation {
border: 1px solid #002b36; /* solarized base03 */
background-color: #fdf6e3; /* solarized base3 */
color: #111;
margin: 0.5em 0em;
text-align: left;
padding-left: 0.5em;
padding-right: 0.5em;
}
div.cheatsheet > h2 {
font-weight: bold;
}
li > h3 {
font-weight: bold;
font-style: italic;
}
</style>
<section>
<header>
<h2>Overview</h2>
</header>
<p>Given a metric on a manifold, it is often necessary to compute its
curvature. However, the usual method of first computing the
Christoffel symbols and then using those to compute the Riemann
curvature tensor is tedious and error-prone.</p>
<p>Fortunately, there’s another way to compute the curvature
that’s often quicker and easier: Cartan’s method of
moving frames, or the <em>repère mobile</em>. Unfortunately,
explanations of this method aren’t very clear, so here
I’m going to provide my own, based on working through a few
examples.</p>
<p>I’m going to assume that you know enough Riemannian geometry
to be able to compute curvature the usual way, and also that
you’re familiar with the basics of differential forms and
exterior differentiation. Some familiarity with <a href="https://en.wikipedia.org/wiki/Pseudo-Riemannian_manifold">semi-Riemannian metrics</a>
will also be helpful, since a lot of motivating examples come from
general relativity, which uses
<a href="https://en.wikipedia.org/wiki/Pseudo-Riemannian_manifold#Lorentzian_manifold">Lorentzian metrics</a>.</p>
</section>
<section>
<header>
<h2>The coordinate frame method</h2>
</header>
<p>First, a quick overview of the usual method using coordinate
frames. Let \(g = g_{ij} \, dx^i ⊗ dx^j\) be a given semi-Riemannian
metric expressed in terms of the coordinates \((x^1, \dotsc, x^n)\).
We first compute the <em>Christoffel symbols</em> using the formula
\[
\CS{k}{ij} = \frac{1}{2} (g^*)^{kl} \left(∂_j g_{il} + ∂_i g_{lj} - ∂_l g_{ij}\right)\text{,}
\]
where \((g^*)^{ij}\) are the components of the dual metric \(g^*\),
which can be computed by taking components of the inverse of the
matrix \(G[i, j] = g_{ij}\) formed from the metric components, i.e. \((g^*)^{ij} = G^{-1}[i, j]\). Recall
that the Christoffel symbols are symmetric in the lower indices, so
if our manifold is \(n\)-dimensional, then in general we have \(n^2(n+1)/2\) independent
Christoffel symbols.</p>
<p>Note that we use the <a href="https://en.wikipedia.org/wiki/Einstein_notation">Einstein summation convention</a>;
in the absence of a summation sign, index variables that appear once
as a superscript and once as a subscript are implicitly summed over.</p>
<p>A useful special case is when the metric \(g\) is diagonal,<sup><a href="#fn1" id="r1">[1]</a></sup> i.e. \(g = g_{ii} \, dx^i ⊗ dx^i\). Then \((g^*)^{ii} = 1/g_{ii}\) and
\[
\begin{alignedat}{2}
\CS{k}{ij} &= 0 \qquad & \CS{k}{ik} &= \frac{∂_i g_{kk}}{2 g_{kk}} \\
\CS{k}{ii} &= -\frac{∂_k g_{ii}}{2 g_{kk}} \qquad & \CS{i}{ii} &= \frac{∂_i g_{ii}}{2 g_{ii}}\text{,}
\end{alignedat}
\]
where \(i\), \(j\), and \(k\) are distinct. Therefore in this case we have \(n^2\) non-zero independent Christoffel symbols.</p>
<p>The Christoffel symbols are important in their own right, but we
need them only to compute curvature. We can compute the components
of the <em>Riemann curvature tensor</em> using the formula
\[
\Riem{k}{lij} = ∂_i \CS{k}{jl} - ∂_j \CS{k}{il} + \CS{k}{im} \CS{m}{jl} - \CS{k}{jm} \CS{m}{il}\text{.}
\]
We can then compute the <em>Ricci curvature tensor</em> and the <em>scalar curvature</em>:
\[
\Ric{ij} = \Riem{k}{ikj} \qquad S = (g^*)^{ij} \Ric{ij}\text{.}
\]</p>
<p>For applications, we’re most interested in the Ricci curvature tensor,
so we usually just want to calculate that directly:
\[
\Ric{ij} = ∂_k \CS{k}{ji} - ∂_j \CS{k}{ki} + \CS{k}{km} \CS{m}{ji} - \CS{k}{jm} \CS{m}{ki}\text{.}
\]</p>
<div class="cheatsheet">
<h2>Cheatsheet: coordinate frame method</h2>
<div class="p">Given the components \(g_{ij}\) of a semi-Riemannian metric:
<ol>
<li>Compute the Christoffel symbols. If the metric \(g\) is
diagonal, use
\[
\begin{alignedat}{2}
\CS{k}{ij} &= 0 \qquad & \CS{k}{ik} &= \frac{∂_i g_{kk}}{2 g_{kk}} \\
\CS{k}{ii} &= -\frac{∂_k g_{ii}}{2 g_{kk}} \qquad & \CS{i}{ii} &= \frac{∂_i g_{ii}}{2 g_{ii}}\text{.}
\end{alignedat}
\]
Otherwise, compute the dual metric components \((g^*)^{ij} = G^{-1}[i, j]\) where \(G[i, j] = g_{ij}\) and use
\[
\CS{k}{ij} = \frac{1}{2} (g^*)^{kl} \left(∂_j g_{il} + ∂_i g_{lj} - ∂_l g_{ij}\right)\text{.}
\]</li>
<li>Compute the Ricci curvature tensor:
\[
\Ric{ij} = ∂_k \CS{k}{ji} - ∂_j \CS{k}{ki} + \CS{k}{km} \CS{m}{ji} - \CS{k}{jm} \CS{m}{ki}\text{.}
\]</li>
</ol>
</div>
</div>
</section>
<section>
<header>
<h2>The Lagrangian method</h2>
</header>
<p>An alternate method for computing the Christoffel symbols is to write
down the Lagrangian corresponding to the metric:
\[
L(x^1, \dotsc, x^n, v^1, \dotsc, v^n) = g_{ij}(x^1, \dotsc, x^n) \, v^i v^j
\]
and then to compute the Euler-Lagrange equations for a path
\(γ(t) = \big(x^1(t), \dotsc, x^n(t)\big)\):
\[
\frac{d}{dt} \left( \frac{∂ L}{∂ v^k}(γ(t), \dot{γ}(t)) \right) - \frac{∂ L}{∂ x^k}(γ(t), \dot{γ}(t)) = 0
\]
to get the geodesic equations. Then we can compare these equations
to the geodesic equations expressed in terms of the Christoffel symbols
\[
\ddot{γ}^k + \CS{k}{ij} \dot{γ}^i \dot{γ}^j = 0\text{,}
\]
and then we can read off the Christoffel symbols from the coefficients of the
\(\dot{γ}^i \dot{γ}^j\) terms.</p>
<p>I’m not convinced that this method saves that much work,
especially when the metric is diagonal, but it’s at least a
clearer way to organize the computations for the Christoffel symbols.</p>
<div class="cheatsheet">
<h2>Cheatsheet: Lagrangian method</h2>
<div class="p">Given the components \(g_{ij}\) of a semi-Riemannian metric:
<ol>
<li>With the Lagrangian
\[
L = g_{ij} \, v^i v^j\text{,}
\]
compute the Euler-Lagrange equations
\[
\frac{d}{dt} \left( \frac{∂ L}{∂ v^k}(γ(t), \dot{γ}(t)) \right) - \frac{∂ L}{∂ x^k}(γ(t), \dot{γ}(t)) = 0\text{.}
\]</li>
<li>Compare the Euler-Lagrange equations to the geodesic equation
\[
\ddot{γ}^k + \CS{k}{ij} \dot{γ}^i \dot{γ}^j = 0
\]
and read off the Christoffel symbols \(\CS{k}{ij}\).
</li>
<li>Compute the Ricci curvature tensor:
\[
\Ric{ij} = ∂_k \CS{k}{ji} - ∂_j \CS{k}{ki} + \CS{k}{km} \CS{m}{ji} - \CS{k}{jm} \CS{m}{ki}\text{.}
\]</li>
</ol>
</div>
</div>
</section>
<section>
<header>
<h2>The moving frame method</h2>
</header>
<p>Now, finally, I can explain the method of moving
frames. Don’t worry too much about understanding this the first
time through; I suggest skimming this section and then following along
with the examples below, referring back as necessary.</p>
<p>For now, let’s assume that we have not a semi-Riemannian, but
a Riemannian metric \(g = g_{ij} \, dx^i ⊗ dx^j\) expressed in terms
of the coordinates \((x^1, \dotsc, x^n)\). We want to find
<em>basis one-forms</em>
\((θ^1, \dotsc, θ^n)\) such that
\[
g = ∑_i θ^i ⊗ θ^i\text{.}
\]
If the metric is diagonal, this is easy (suspending the summation
convention):
\[
θ^i = \sqrt{g_{ii}} \, dx^i\text{.}
\]
If instead the metric is not diagonal, we may still be able to
factor it into a “sum of squares” form by
inspection. Otherwise, an equivalent definition of the \(θ^i\) is that
\[
g^*(θ^i, θ^j) = δ^i_j\text{,}
\]
i.e. the basis one-forms \(θ^i\) comprise an <em>orthonormal dual frame</em>.
We can then use a <a href="https://en.wikipedia.org/wiki/Gram%E2%80%93Schmidt_process">Gram-Schmidt-like</a> process on the \(dx^i\) or
some ad hoc method to compute the basis one-forms.</p>
<p>It is also convenient to express the coordinate forms in terms of the
basis one-forms, which is again simple if the metric is diagonal:
\[
dx^i = \frac{1}{\sqrt{g_{ii}}} \, θ^i\text{.}
\]
Otherwise, one would need to invert the matrix expressing the \(θ^i\)
in terms of the \(dx^i\).</p>
<div class="p">The next step is compute the <em>connection one-forms</em> \(\cnf{i}{j}\).
To do so, we compute the exterior derivatives of the basis one-forms
\(dθ^i\) and express them in terms of the basis two-forms, i.e.
\[
dθ^i = a^i_{jk} \, θ^j ∧ θ^k
\]
for functions \(a^i_{jk}\).
Then we can use <em>Cartan’s first structure equation</em>
<div class="important-equation">
\[
dθ^i = -\cnf{i}{j} ∧ θ^j
\]
</div>
and the fact that <em>the connection forms are skew symmetric</em>
<div class="important-equation">
\[
\cnf{i}{j} = -\cnf{j}{i}
\]
</div>
to deduce the \(\cnf{i}{j}\).</div>
<p>There’s an explicit general formula for \(\cnf{i}{j}\) in
terms of the basis one-forms,<sup><a href="#fn2" id="r2">[2]</a></sup>
but it’s often easier to compare the expressions for \(dθ^i\)
to the form of the first structure equation, guess what the
connection forms are, taking advantage of their skew symmetry, and
check that the first structure equation holds. In fact, if the
metric is diagonal, the expressions for \(dθ^i\) are
nice enough that you can immediately read off the connection
forms. This “guess and check” method works because the
connection forms are guaranteeed to exist, and furthermore are
guaranteed to be unique, so any guessed list of \(\cnf{i}{j}\) that
satisfies the first structure equation <em>must</em> be the
connection forms.</p>
<p>Note that skew symmetry immediately implies that (suspending the
Einstein summation convention)
\[
\cnf{i}{i} = 0\text{.}
\]
Therefore, we have \(n(n-1)/2\) independent connection forms.</p>
<p>There <em>is</em> a formula for the connection forms when \(g\) is
diagonal, which is more useful for deducing properties of diagonal
metrics than it is for doing calculations. Suspending the summation
convention,
\[
\begin{aligned}
\cnf{i}{j}
&= \frac{∂_j g_{ii}}{2 g_{ii} \sqrt{g_{jj}}} \, θ^i - \frac{∂_i g_{jj}}{2 g_{jj} \sqrt{g_{ii}}} \, θ^j \\
&= \frac{∂_j g_{ii}}{2 \sqrt{g_{ii} g_{jj}}} \, dx^i - \frac{∂_i g_{jj}}{2 \sqrt{g_{ii} g_{jj}}} \, dx^j\text{.}
\end{aligned}
\]
This formula implies that a diagonal metric has connection forms
with at most two components each, as opposed to \(n\) components in
general. Furthermore, if a diagonal metric depends only on a single
coordinate \(x^r\), the only possible non-zero connection forms up to skew symmetry are \(\cnf{i}{r}\),
which are proportional to \(θ^i\). If instead a diagonal metric depends on two coordinates \(x^r\) and \(x^s\),
then the only possible non-zero connection forms up to skew symmetry
are \(\cnf{i}{r}\), \(\cnf{i}{s}\), or \(\cnf{r}{s}\). The first two
cases are proportional to \(θ^i\), and the
last case has at most two components: one proportional to \(θ^r\) and another proportional to \(θ^s\).</p>
<div class="p">The connection forms play an important role similar to the
Christoffel symbols, but we need them only to compute
curvature. First, observer that we can express each connection form
in two ways: in terms of the \(dx^i\), and in terms of the \(θ^i\). We
need to compute the derivatives \(d\cnf{i}{j}\), which is easiest to
do if \(\cnf{i}{j}\) is expressed in terms of the \(dx^i\), since
\(d(dx^i) = 0\). Then we can compute the <em>curvature forms</em>
\(\crf{i}{j}\) using <em>Cartan’s second structure equation</em>
<div class="important-equation">
\[
\crf{i}{j} = d\cnf{i}{j} + \cnf{i}{k} ∧ \cnf{k}{j}\text{.}
\]
</div>
Like the connection forms, <em>the curvature forms are skew symmetric</em>:
<div class="important-equation">
\[
\crf{i}{j} = \crf{j}{i}\text{,}
\]
</div>
so we need only calculate \(n(n-1)/2\) independent curvature forms,
i.e. the ones where \(i ≠ j\). Also note that in the \(\cnf{i}{k} ∧ \cnf{k}{j}\) term, one need only take the sum over the \(n - 2\) terms \(k ∉ \{ i, j \}\), by
(suspending the summation convention) \(\cnf{i}{i} = \cnf{j}{j} = 0\).</div>
<p>From the properties discussed above, if a diagonal metric depends
only on a single coordinate, then each curvature form \(\crf{i}{j}\)
is proportional to \(θ^i ∧ θ^j\). If instead a diagonal metric depends on two coordinates \(x^r\) and \(x^s\),
then each curvature form \(\crf{i}{r}\) or \(\crf{i}{s}\), up to skew symmetry, has at most two components: one proportional to \(θ^i ∧ θ^r\) and another proportional to \(θ^i ∧ θ^s\), and all other curvature forms \(\crf{i}{j}\) are
proportional to \(θ^i ∧ θ^j\).</p>
<p>At this point we’re done, since the Riemann curvature tensor
with respect to the orthonormal frame \((E_1, \dotsc, E_n)\) dual to
\((θ^1, \dotsc, θ^n)\) is
\[
\Riem{l}{kij} = \crf{l}{k}(E_i, E_j)
\]
and the Ricci curvature tensor is
\[
\Ric{ij} = \crf{k}{i}(E_k, E_j)\text{.}
\]
Note that it’s not necessary to explicitly calculate \(E_i\);
it’s enough to use the definition
\[
θ^i(E_j) = δ^i_j\text{,}
\]
and the definition of the wedge product to derive the relations
\[
(θ^i ∧ θ^j)(E_k, E_l) = \begin{cases}
+1 & k = i ≠ j = l \\
-1 & l = i ≠ j = k \\
0 & \text{otherwise,}
\end{cases}
\]
which can then be used to compute the curvature tensor components.</p>
<p>From the properties discussed above, if a diagonal metric depends
only on a single coordinate, then \(\crf{i}{j}\) is proportional to \(θ^i ∧ θ^j\), which implies that \(\Ric{}\) is
also diagonal. Furthermore, if the metric is diagonal and depends
on two coordinates \(x^k\) and \(x^l\), then the only possible off-diagonal component is \(\Ric{kl}\).<sup><a href="#fn3" id="r3">[3]</a></sup></p>
<div class="cheatsheet">
<h2>Cheatsheet: The moving frame method for Riemannian metrics</h2>
<div class="p">Given the components \(g_{ij}\) of a Riemannian metric:
<ol>
<li>Find an orthonormal dual frame, i.e. basis one-forms \((θ^1, \dotsc, θ^n)\) such that
\[
g = ∑_i θ^i ⊗ θ^i\text{.}
\]
If the metric is diagonal, then (suspending the summation
convention)
\[
θ^i = \sqrt{g_{ii}} \, dx^i\text{.}
\]</li>
<li>Use the first structure equation
\[
dθ^i = -\cnf{i}{j} ∧ θ^j
\]
and the skew symmetry relations
\[
\cnf{i}{j} = -\cnf{j}{i}
\]
to deduce the connection forms \(\cnf{i}{j}\).</li>
<li>Compute the curvature forms using the second structure equation
\[
\crf{i}{j} = d\cnf{i}{j} + \cnf{i}{k} ∧ \cnf{k}{j}
\]
and the skew symmetry relations
\[
\crf{i}{j} = -\crf{j}{i}\text{.}
\]
Note that it’s easiest to compute \(d\cnf{i}{j}\) when
\(\cnf{i}{j}\) is expressed in terms of the \(dx^i\), since
\(d(dx^i) = 0\)</li>
<li>Compute the components of the Ricci curvature tensor via
\[
\Ric{ij} = \crf{k}{i}(E_k, E_j)
\]
and the relations
\[
(θ^i ∧ θ^j)(E_k, E_l) = \begin{cases}
+1 & k = i ≠ j = l \\
-1 & l = i ≠ j = k \\
0 & \text{otherwise.}
\end{cases}
\]</li>
</ol>
</div>
</section>
<section>
<header>
<h2>Comparing the methods</h2>
</header>
<p>As we saw above, one advantage of the moving frame method is
that, in the worst case, one need only compute \(n(n-1)/2\)
independent connection forms, each with at most \(n\) components,
rather than \(n^2(n+1)/2\) independent Christoffel symbols—a
saving of \(n^2\) “component calculations”. Even in
the simplest case, when the metric is diagonal, you still need to
compute \(n^2\) possibly
non-zero independent Christoffel symbols, as opposed to \(n(n -
1)/2\) independent connection forms, each with at most two components—still a saving of \(n\) “component calculations”.</p>
<p>Also, when computing a curvature form, one need only compute a
single exterior derivative of a connection form and \(n - 2\) wedge
products of connection forms. This turns out to be less tedious
than the corresponding calculation using coordinate methods of \(\Riem{k}{lij}\) for
fixed \(k\) and \(l\) such that \(k ≠ l\).</p>
<p>Furthermore, the orthonormality of the dual frame tends to cause
symmetries to appear earlier in the calculation, leading to less
wasted work. This is advantageous when you know the answer
you’re looking for, and it’s particularly simple,
e.g. if you expect the Ricci curvature to be zero, because
calculations becoming unduly complicated becomes a sign of an
undetected mistake. With coordinate methods, even if calculations
become complicated, you can’t rule out terms cancelling if
you continue, so errors become apparent only later.</p>
<p>On the other hand, the moving frame method requires a certain
amount of cleverness, first in coming up with the one-forms \(θ^i\) if
the metric isn’t diagonal, and second in deducing the
connection forms \(\cnf{i}{j}\). The coordinate methods require
less thought, and are more “plug and chug”. In fact,
once we examine the semi-Riemannian case later, we’ll see
that the coordinate methods remain unchanged, yet the moving frame
method becomes more complicated.</p>
</section>
<section>
<header>
<h2>Example 1: Orthogonal coordinates on 2D surfaces</h2>
</header>
<p>Let \(g\) be a Riemannian metric on a 2D manifold. The method of
moving frames makes calculating curvature particularly easy, since
there is exactly one connection form and one curvature form. For
example, consider the special case when the metric is diagonal,
i.e. with line element
\[
ds^2 = E \, du^2 + G \, dv^2\text{.}
\]
</p>
<ol>
<li>
<h3>Orthonormal dual frame</h3>
<p>We can then read off an orthonormal dual frame:
\[
ds^2 = {\underbrace{(\sqrt{E} \, du)}_{θ^1}}^2 + {\underbrace{(\sqrt{G} \, dv)}_{θ^2}}^2\text{,}
\]
i.e.
\[
θ^1 = \sqrt{E} \, du \qquad θ^2 = \sqrt{G} \, dv\text{,}
\]
and express the coordinate forms in terms of it:
\[
du = \frac{1}{\sqrt{E}} \, θ^1 \qquad dv = \frac{1}{\sqrt{G}} \, θ^2\text{.}
\]</p>
</li>
<li>
<h3>Connection forms</h3>
<p>The derivatives of the basis one-forms are
\[
\begin{aligned}
dθ^1 &= \frac{∂_v E}{2 \sqrt{E}} \, dv ∧ du = \frac{∂_v E}{2 E \sqrt{G}} \, θ^2 ∧ θ^1 \\
dθ^2 &= \frac{∂_u G}{2 \sqrt{G}} \, du ∧ dv = \frac{∂_u G}{2 G \sqrt{E}} \, θ^1 ∧ θ^2
\end{aligned}
\]
and the first structure equations are
\[
\begin{aligned}
dθ^1 &= -\cnf{1}{2} ∧ θ^2 \\
dθ^2 &= -\cnf{2}{1} ∧ θ^1 = \cnf{1}{2} ∧ θ^1\text{.}
\end{aligned}
\]
Rewriting the derivative equations to match the first structure
equations,
<!-- TODO: File a bug for \(\) in \text{}, and clean up the below once \(\) is supported inside \text{}. -->
\[
\begin{aligned}
dθ^1 &= -\overbrace{\left(\frac{∂_v E}{2 E \sqrt{G}} \, θ^1\right)}^{\text{one term of $\cnf{1}{2}$}} ∧ θ^2 \\
dθ^2 &= \underbrace{\left(-\frac{∂_u G}{2 G \sqrt{E}} \, θ^2\right)}_{\text{another term of $\cnf{1}{2}$}} ∧ θ^1\text{,}
\end{aligned}
\]
we can guess that
\[
\cnf{1}{2} = \frac{∂_v E}{2 E \sqrt{G}} \, θ^1 - \frac{∂_u G}{2 G \sqrt{E}} \, θ^2\text{.}
\]
This guess works, since
\[
\begin{aligned}
-\cnf{1}{2} ∧ θ^2
&= -\left( \frac{∂_v E}{2 E \sqrt{G}} \, θ^1 - \frac{∂_u G}{2 G \sqrt{E}} \, θ^2 \right) ∧ θ^2 \\
&= -\frac{∂_v E}{2 E \sqrt{G}} \, θ^1 ∧ θ^2 + \underbrace{\cancel{\frac{∂_u G}{2 G \sqrt{E}} \, θ^2 ∧ θ^2}}_{θ^2 ∧ θ^2 = 0} \\
&= dθ^1
\end{aligned}
\]
and
\[
\begin{aligned}
\cnf{1}{2} ∧ θ^1
&= \left( \frac{∂_v E}{2 E \sqrt{G}} \, θ^1 - \frac{∂_u G}{2 G \sqrt{E}} \, θ^2 \right) ∧ θ^1 \\
&= \underbrace{\cancel{\frac{∂_v E}{2 E \sqrt{G}} \, θ^1 ∧ θ^1}}_{θ^1 ∧ θ^1 = 0} - \frac{∂_u G}{2 G \sqrt{E}} \, θ^2 ∧ θ^1 \\
&= dθ^2\text{,}
\end{aligned}
\]
using the fact that \(θ^1 ∧ θ^1 = θ^2 ∧ θ^2 = 0\). Therefore, by uniqueness of connection forms, this is <em>the</em> connection form. Then, expressing \(\cnf{1}{2}\) in
terms of both the basis one-forms and the coordinate forms,
\[
\cnf{1}{2} = \frac{∂_v E}{2 E \sqrt{G}} \, θ^1 - \frac{∂_u G}{2 G \sqrt{E}} \, θ^2 = \frac{∂_v E}{2 \sqrt{EG}} \, du - \frac{∂_u G}{2 \sqrt{EG}} \, dv\text{.}
\]
(By a very similar method, one can derive the formula stated
previously for the \(\cnf{i}{j}\) of a diagonal metric.)</p>
</li>
<li>
<h3>Curvature forms</h3>
<p>Since we only have the single connection form \(\cnf{1}{2}\), there are
no non-zero \(\cnf{i}{k} ∧ \cnf{k}{j}\) terms, since \(i\), \(j\), and \(k\) would all have to be distinct. Using the expression for
\(\cnf{1}{2}\) in terms of the coordinate forms \(du\) and \(dv\),
and that \(d(du) = d(dv) = 0\), the single curvature form is:
\[
\begin{aligned}
\crf{1}{2} = d\cnf{1}{2} &= \pd{}{v} \left( \frac{∂_v E}{2 \sqrt{EG}} \right) dv ∧ du - \pd{}{u} \left( \frac{∂_u G}{2 \sqrt{EG}} \right) du ∧ dv \\
&\begin{alignedat}{2}
&= \, & -\frac{1}{2} \left( \pd{}{u} \left( \frac{∂_u G}{\sqrt{EG}} \right) + \pd{}{v} \left( \frac{∂_v E}{\sqrt{EG}} \right) \right) & \, du ∧ dv \\
&= \, & -\frac{1}{2 \sqrt{EG}} \left( \pd{}{u} \left( \frac{∂_u G}{\sqrt{EG}} \right) + \pd{}{v} \left( \frac{∂_v E}{\sqrt{EG}} \right) \right) & \, θ^1 ∧ θ^2\text{.}
\end{alignedat}
\end{aligned}
\]</p>
</li>
<li>
<h3>Gaussian curvature</h3>
<p>Therefore, we get the classical result that
the Gaussian curvature \(K\), which is equal to the single independent
component of the Riemann curvature tensor (up to sign), is
\[
\begin{aligned}
K &= \Riem{1}{212} = \crf{1}{2}(E_1, E_2) \\
&= -\frac{1}{2 \sqrt{EG}} \left( \pd{}{u} \left( \frac{∂_u G}{\sqrt{EG}} \right) + \pd{}{v} \left( \frac{∂_v E}{\sqrt{EG}} \right) \right) \, (θ^1 ∧ θ^2)(E_1, E_2) \\
&= -\frac{1}{2 \sqrt{EG}} \left( \pd{}{u} \left( \frac{∂_u G}{\sqrt{EG}} \right) + \pd{}{v} \left( \frac{∂_v E}{\sqrt{EG}} \right) \right)\text{.}
\end{aligned}
\]
</p>
</li>
</section>
<section>
<header>
<h2>The semi-Riemannian case</h2>
</header>
<p>As we alluded to above, in the semi-Riemannian case, the
coordinate methods remain unchanged, but the moving frame method
gets more complicated. The equation that the one-forms must satisfy becomes
\[
g = ∑_i ε_i \, θ^i ⊗ θ^i\text{,}
\]
where each \(ε_i\) is \(±1\) throughout the whole chart domain.<sup><a href="#fn4" id="r4">[4]</a></sup>
For example, in the Riemannian case, we let all \(ε_i = 1\), and
in the Lorentzian case we let \(ε_0 = -1\) and all other \(ε_i = +1\). (The entire list \((ε_i)\) is called the <a href="https://en.wikipedia.org/wiki/Metric_signature"<em>signature</em></a> of the metric.)</p>
<p>If the metric is diagonal, then each \(g_{ii}\)
must be non-zero throughout the whole chart domain, so
\(ε_i = \sgn(g_{ii})\) and (suspending the summation convention)
\[
θ^i = ε_i \sqrt{\lvert g_{ii} \rvert} \, dx^i\text{.}
\]</p>
<p>The equivalent definition of the \(θ^i\) becomes
\[
g^*(θ^i, θ^j) = ε_i δ^i_j\text{,}
\]
where each \(ε_i\) is \(±1\) throughout the whole chart
domain. Furthermore, the Gram-Schmidt process becomes harder to
apply; you’ll need to find a <em>non-degenerate basis</em> first; see <a href="https://math.stackexchange.com/q/2622562/343314">this Math StackExchange question</a> for details.</p>
<div class="p">Both Cartan structure equations still hold, but the connection
and curvature forms are not skew symmetric anymore; instead,
they’re <em>semi-skew symmetric</em>. Suspending the summation convention,
<div class="important-equation">
\[
\begin{aligned}
\cnf{i}{j} &= -ε_i ε_j \cnf{j}{i} \\
\crf{i}{j} &= -ε_i ε_j \crf{j}{i}\text{.}
\end{aligned}
\]
</div>
Fortunately, this still implies that (suspending the Einstein summation convention)
\[
\cnf{i}{i} = \crf{i}{i} = 0\text{.}
\]
</div>
<p>The formula for the connection forms of a diagonal metric becomes
(suspending the summation convention)
\[
\begin{aligned}
\cnf{i}{j}
&= \frac{∂_j g_{ii}}{2 g_{ii} \sqrt{g_{jj}}} \, θ^i - ε_i ε_j \frac{∂_i g_{jj}}{2 g_{jj} \sqrt{g_{ii}}} \, θ^j \\
&= \frac{∂_j g_{ii}}{2 \sqrt{g_{ii} g_{jj}}} \, dx^i - ε_i ε_j \frac{∂_i g_{jj}}{2 \sqrt{g_{ii} g_{jj}}} \, dx^j\text{.}
\end{aligned}
\]
However, none of the deduced properties of diagonal metrics
depending on one or two coordinates change.</p>
<p>Finally, note that the relations
\[
(θ^i ∧ θ^j)(E_k, E_l) = \begin{cases}
+1 & k = i ≠ j = l \\
-1 & l = i ≠ j = k \\
0 & \text{otherwise.}
\end{cases}
\]
still hold.</p>
<p>As you can tell, the moving frame method forces you to keep
careful track of signs, which you may count as a disadvantage.</p>
<div class="cheatsheet">
<h2>Cheatsheet: The moving frame method for semi-Riemannian metrics</h2>
<div class="p">Given the components \(g_{ij}\) of a semi-Riemannian metric:
<ol>
<li>Find an orthonormal dual frame, i.e. basis one-forms \((θ^1, \dotsc, θ^n)\) such that
\[
g = ∑_i ε_i \, θ^i ⊗ θ^i\text{,}
\]
where each \(ε_i\) is \(±1\) throughout the whole chart
domain. If the metric is diagonal, then (suspending the
summation convention) \(ε_i = \sgn(g_{ii})\), and
\[
θ^i = ε_i \sqrt{\lvert g_{ii} \rvert} \, dx^i\text{.}
\]</li>
<li>Use the first structure equation
\[
dθ^i = -\cnf{i}{j} ∧ θ^j
\]
and the semi-skew symmetry relations (suspending the summation convention)
\[
\cnf{i}{j} = -ε_i ε_j \cnf{j}{i}
\]
to deduce the connection forms \(\cnf{i}{j}\).</li>
<li>Compute the curvature forms using the second structure equation
\[
\crf{i}{j} = d\cnf{i}{j} + \cnf{i}{k} ∧ \cnf{k}{j}
\]
and the semi-skew symmetry relations (suspending the summation convention)
\[
\crf{i}{j} = -ε_i ε_j \crf{j}{i}\text{.}
\]
Note that it’s easiest to compute \(d\cnf{i}{j}\) when
\(\cnf{i}{j}\) is expressed in terms of the \(dx^i\), since
\(d(dx^i) = 0\)</li>
<li>Compute the components of the Ricci curvature tensor via
\[
\Ric{ij} = \crf{k}{i}(E_k, E_j)
\]
and the relations
\[
(θ^i ∧ θ^j)(E_k, E_l) = \begin{cases}
+1 & k = i ≠ j = l \\
-1 & l = i ≠ j = k \\
0 & \text{otherwise.}
\end{cases}
\]</li>
</ol>
</div>
</div>
</section>
<section>
<header>
<h2>Example 2: The Schwarzschild metric</h2>
</header>
<p>Now we’re ready to tackle a more complicated metric. For
our first semi-Riemannian example, let \(g\) be the <a href="https://en.wikipedia.org/wiki/Schwarzschild_metric"><em>Schwarzschild metric</em></a>, with line element
\[
ds^2 = -f(r) \, dt^2 + f(r)^{-1} \, dr^2 + r^2 \, dΩ^2\text{,}
\]
where
\[
f(r) = 1 - \frac{r_S}{r}\text{,}
\]
\(r_S\) is the Schwarzschild radius, which is constant, and
\[
dΩ^2 = dθ^2 + \sin^2 θ \, dφ^2
\]
is the line element of the round metric \(\mathring{g}\) on the
two-sphere. We want to show that this metric is <em>Ricci-flat</em>,
i.e. has vanishing Ricci curvature.</p>
<p>We can skip some steps by taking advantage of the metric being
diagonal and depending only on the two coordinates \(r\) and \(θ\),
but in the interest of showing the general method, we’ll do
everything the “hard way”, but we’ll
double-check that our results using the properties of diagonal
metrics we deduced earlier.</p>
<ol>
<li>
<h3>Orthonormal dual frame</h3>
<p>Since the metric is diagonal, we can read off an orthonormal dual
frame with its corresponding signature:
\[
ds^2 =
\; \underbrace{-}_{ε_0} \;
{\underbrace{\left(f(r)^{1/2} \, dt\right)}_{ϑ^0}}^2
\; \underbrace{+}_{ε_1} \;
{\underbrace{\left(f(r)^{-1/2} \, dr\right)}_{ϑ^1}}^2
\; \underbrace{+}_{ε_2} \;
{\underbrace{(r \, dθ)}_{ϑ^2}}^2
\; \underbrace{+}_{ε_3} \;
{\underbrace{(r \sin θ \, dφ)}_{ϑ^3}}^2\text{.}
\]
i.e.
\[
\begin{alignedat}{2}
ϑ^0 &= \, & f(r)^{1/2} & \, dt \\
ϑ^1 &= \, & f(r)^{-1/2} & \, dr \\
ϑ^2 &= \, & r & \, dθ \\
ϑ^3 &= \, & r \sin θ & \, dφ
\end{alignedat}
\]
with Lorentzian signature \(({-} \; {+} \; {+} \; {+})\). We can then
express the coordinate forms in terms of it:
\[
\begin{alignedat}{2}
dt &= \, & f(r)^{-1/2} & \, ϑ^0 \\
dr &= \, & f(r)^{1/2} & \, ϑ^1 \\
dθ &= \, & r^{-1} & \, ϑ^2 \\
dφ &= \, & r^{-1} \csc θ & \, ϑ^3\text{.}
\end{alignedat}
\]
Note that since we’re using \(θ\) as a coordinate, we use \(ϑ^λ\) to
denote the basis one-forms. Furthermore, since this metric is Lorentzian, we
adopt the convention that the index of the first coordinate is \(0\),
Greek indices start from \(0\), and Latin indices start from \(1\).</p>
</li>
<li>
<h3>Connection forms</h3>
<p>The derivatives of the basis one-forms are
\[
\begin{alignedat}{2}
dϑ^0 &= \frac{1}{2}f(r)^{-1/2} f'(r) \, dr ∧ dt & &= \frac{1}{2}f(r)^{-1/2} f'(r) \, ϑ^1 ∧ ϑ^0 \\
dϑ^1 &= 0 & & \\
dϑ^2 &= dr ∧ dθ & &= \frac{f(r)^{1/2}}{r} \, ϑ^1 ∧ ϑ^2 \\
dϑ^3 &= \sin θ \, dr ∧ dφ + r \cos θ \, dθ ∧ dφ & &= \frac{f(r)^{1/2}}{r} \, ϑ^1 ∧ ϑ^3 + \frac{\cot θ}{r} \, ϑ^2 ∧ ϑ^3\text{.}
\end{alignedat}
\]
By semi-skew symmetry, since \(ε_0 = -1\) and \(ε_i = 1\), \(\cnf{0}{i} = \cnf{i}{0}\) and
\(\cnf{i}{j} = -\cnf{j}{i}\). Therefore, we can explicitly write out the first structure equations:
\[
\begin{alignedat}{4}
dϑ^0 &= & &- \cnf{0}{1} ∧ ϑ^1 & &- \cnf{0}{2} ∧ ϑ^2 & &- \cnf{0}{3} ∧ ϑ^3 \\
dϑ^1 &= -\cnf{0}{1} ∧ ϑ^0 & & & &- \cnf{1}{2} ∧ ϑ^2 & &- \cnf{1}{3} ∧ ϑ^3 \\
dϑ^2 &= -\cnf{0}{2} ∧ ϑ^0 & &+ \cnf{1}{2} ∧ ϑ^1 & & & &- \cnf{2}{3} ∧ ϑ^3 \\
dϑ^3 &= -\cnf{0}{3} ∧ ϑ^0 & &+ \cnf{1}{3} ∧ ϑ^1 & &+ \cnf{2}{3} ∧ ϑ^2\text{,} & &
\end{alignedat}
\]
and rewriting the derivative equations to match:
\[
\begin{alignedat}{3}
dϑ^0 &= &
\; -\overbrace{\left(\frac{1}{2}f(r)^{-1/2} f'(r) \, ϑ^0\right)}^{\text{one term of $\cnf{0}{1}$}} &∧ ϑ^1 &
& \\
dϑ^1 &= 0 &
& &
& \\
dϑ^2 &= &
\overbrace{\left(-\frac{f(r)^{1/2}}{r} \, ϑ^2\right)}^{\text{one term of $\cnf{1}{2}$}} &∧ ϑ^1 &
& \\
dϑ^3 &= &
\underbrace{\left(-\frac{f(r)^{1/2}}{r} \, ϑ^3\right)}_{\text{one term of $\cnf{1}{3}$}} &∧ ϑ^1 &
\; + \; \underbrace{\left( -\frac{\cot θ}{r} \, ϑ^3 \right)}_{\text{one term of $\cnf{2}{3}$}} &∧ ϑ^2\text{,}
\end{alignedat}
\]
we can guess that
\[
\begin{alignedat}{2}
\cnf{0}{1} &= \, & \frac{1}{2} f(r)^{-1/2} f'(r) & \, ϑ^0 \\
\cnf{1}{2} &= \, & -\frac{f(r)^{1/2}}{r} & \, ϑ^2 \\
\cnf{1}{3} &= \, & -\frac{f(r)^{1/2}}{r} & \, ϑ^3 \\
\cnf{2}{3} &= \, & -\frac{\cot θ}{r} & \, ϑ^3\text{.}
\end{alignedat}
\]
Happily, plugging these expressions back into the first
structure equations, we find that they hold. Therefore, by
uniqueness of the connection forms, they are <em>the</em> connection forms.</p>
<p>Rather than plugging our guess into the first structure equations, a
slicker way to see that it works would be to split up the first
structure equation thus:
\[
dϑ^λ = -∑_{λ \lt μ} \cnf{λ}{μ} ∧ ϑ^μ - ∑_{λ > μ} \cnf{λ}{μ} ∧ ϑ^μ\text{,}
\]
and notice that our derivative equations have the particularly simple form
\[
dϑ^λ = ∑_{λ \lt μ} (f_μ \, ϑ^λ) ∧ ϑ^μ\text{,}
\]
so setting
\[
\cnf{λ}{μ} = -f_μ \, ϑ^λ \quad \text{for $λ \lt μ$}
\]
takes care of the left sum above. Then by semi-skew symmetry, if \(λ \gt μ\),
\[
\lvert \cnf{λ}{μ} ∧ ϑ^μ \rvert = \lvert \cnf{μ}{λ} ∧ ϑ^μ \rvert = \lvert (f_λ \, ϑ^μ) ∧ ϑ^μ \rvert = 0\text{.}
\]
Thus all terms in the right sum above vanish as required.</p>
<p>Then, expressing the connection forms in terms of both the basis one-forms and
the coordinate forms,
\[
\begin{alignedat}{6}
\cnf{0}{1} &= & &\cnf{1}{0} & &= \quad & \frac{1}{2} f(r)^{-1/2} f'(r) \, &ϑ^0 & \quad &= \quad & \frac{1}{2} f'(r) \, &dt \\
\cnf{2}{1} &= & \; -&\cnf{1}{2} & &= \quad & \frac{f(r)^{1/2}}{r} \, &ϑ^2 & \quad &= \quad & f(r)^{1/2} \, &dθ \\
\cnf{3}{1} &= & \; -&\cnf{1}{3} & &= \quad & \frac{f(r)^{1/2}}{r} \, &ϑ^3 & \quad &= \quad & f(r)^{1/2} \sin θ \, &dφ \\
\cnf{3}{2} &= & \; -&\cnf{2}{3} & &= \quad & \frac{\cot θ}{r} \, &ϑ^3 & \quad &= \quad & \cos θ \, &dφ \text{.}
\end{alignedat}
\]</p>
<p>Note that \(\cnf{2}{1}\) has only one component instead of
two; this is because \(g_{11}\) doesn’t depend on \(θ\). The other connection forms are either zero or have only one component, as expected for a diagonal metric depending on two coordinates.</p>
</li>
<li>
<h3>Curvature forms</h3>
<p>Using the expressions for \(\cnf{μ}{ν}\) in terms of the coordinate
one-forms, since \(d(dt) = d(dr) = d(dθ) = d(dφ) = 0\), the derivatives of the connection forms are:
\[
\begin{aligned}
d \cnf{0}{1}
&= \frac{1}{2} f''(r) \, dr ∧ dt \\
&= \frac{1}{2} f''(r) \, ϑ^1 ∧ ϑ^0 \\
d \cnf{2}{1}
&= \frac{1}{2} f(r)^{-1/2} f'(r) \, dr ∧ dθ \\
&= \frac{f'(r)}{2r} \, ϑ^1 ∧ ϑ^2 \\
d \cnf{3}{1}
&= \frac{1}{2} f(r)^{-1/2} f'(r) \sin ϑ \, dr ∧ dφ + f(r)^{1/2} \cos θ \, dθ ∧ dφ \\
&= \frac{f'(r)}{2r} \, ϑ^1 ∧ ϑ^3 + \frac{f(r)^{1/2} \cot θ}{r^2} \, ϑ^2 ∧ ϑ^3 \\
d \cnf{3}{2}
&= -\sin θ \, dθ ∧ dφ \\
&= -\frac{1}{r^2} \, ϑ^2 ∧ ϑ^3\text{.}
\end{aligned}
\]
For \(\cnf{μ}{λ} ∧ \cnf{λ}{ν}\), recalling that one need only sum over \(λ ∉ \{ μ, ν \}\),
the non-zero terms are
\[
\begin{alignedat}{3}
\cnf{0}{λ} ∧ \cnf{λ}{2} &= \cnf{0}{1} ∧ \cnf{1}{2} & &= \; & -\frac{f'(r)}{2r} \, &ϑ^0 ∧ ϑ^2 \\
\cnf{0}{λ} ∧ \cnf{λ}{3} &= \cnf{0}{1} ∧ \cnf{1}{3} & &= \; & -\frac{f'(r)}{2r} \, &ϑ^0 ∧ ϑ^3 \\
\cnf{1}{λ} ∧ \cnf{λ}{3} &= \cnf{1}{2} ∧ \cnf{2}{3} & &= \; & \frac{f(r)^{1/2} \cot θ}{r^2} \, &ϑ^2 ∧ ϑ^3 \\
\cnf{2}{λ} ∧ \cnf{λ}{3} &= \cnf{2}{1} ∧ \cnf{1}{3} & &= \; & -\frac{f(r)}{r^2} \, &ϑ^2 ∧ ϑ^3\text{.}
\end{alignedat}
\]
Then we can compute the curvature forms:
\[
\begin{aligned}
\crf{0}{1} &= d\cnf{0}{1} = \frac{1}{2} f''(r) \, ϑ^1 ∧ ϑ^0 \\
\crf{0}{2} &= \cnf{0}{λ} ∧ \cnf{λ}{2} = -\frac{f'(r)}{2r} \, ϑ^0 ∧ ϑ^2 \\
\crf{0}{3} &= \cnf{0}{λ} ∧ \cnf{λ}{3} = -\frac{f'(r)}{2r} \, ϑ^0 ∧ ϑ^3 \\
\crf{1}{2} &= d\cnf{1}{2} = -\frac{f'(r)}{2r} \, ϑ^1 ∧ ϑ^2 \\
\crf{1}{3} &= d\cnf{1}{3} + \cnf{1}{λ} ∧ \cnf{λ}{3} \\
&= -\frac{f'(r)}{2r} \, ϑ^1 ∧ ϑ^3 - \frac{f(r)^{1/2} \cot θ}{r^2} \, ϑ^2 ∧ ϑ^3 + \frac{f(r)^{1/2} \cot θ}{r^2} \, ϑ^2 ∧ ϑ^3 \\
&= -\frac{f'(r)}{2r} \, ϑ^1 ∧ ϑ^3 \\
\crf{2}{3} &= d\cnf{2}{3} + \cnf{2}{λ} ∧ \cnf{λ}{3} \\
&= \frac{1}{r^2} \, ϑ^2 ∧ ϑ^3 - \frac{f(r)}{r^2} \, ϑ^2 ∧ ϑ^3 \\
&= \frac{1 - f(r)}{r^2} \, ϑ^2 ∧ ϑ^3\text{.}
\end{aligned}
\]
Again by semi-skew symmetry, since \(ε_0 = -1\) and \(ε_i = 1\), \(\crf{0}{i} = \crf{i}{0}\) and
\(\crf{i}{j} = -\crf{j}{i}\). Therefore,
\[
\begin{alignedat}{3}
\crf{0}{1} &= \; & \crf{1}{0} &= \; & \frac{1}{2} f''(r) \, &ϑ^1 ∧ ϑ^0 \\
\crf{0}{2} &= \; & \crf{2}{0} &= \; & -\frac{f'(r)}{2r} \, &ϑ^0 ∧ ϑ^2 \\
\crf{0}{3} &= \; & \crf{3}{0} &= \; & -\frac{f'(r)}{2r} \, &ϑ^0 ∧ ϑ^3 \\
\crf{1}{2} &= \; & -\crf{2}{1} &= \; & -\frac{f'(r)}{2r} \, &ϑ^1 ∧ ϑ^2 \\
\crf{1}{3} &= \; & -\crf{3}{1} &= \; & -\frac{f'(r)}{2r} \, &ϑ^1 ∧ ϑ^3 \\
\crf{2}{3} &= \; & -\crf{3}{2} &= \; & \frac{1 - f(r)}{r^2} \, &ϑ^2 ∧ ϑ^3\text{.}
\end{alignedat}
\]
</p>
</li>
<li>
<h3>Ricci curvature</h3>
<p>We can compute the Ricci tensor \(\Ric{μν}\) as
\[
\Ric{μν} = \Riem{λ}{μλν} = \crf{λ}{μ}(E_λ, E_ν)\text{,}
\]
where the \(E_λ\) comprise the dual frame to \(ϑ^λ\). From the relations
\[
(θ^μ ∧ θ^ν)(E_ρ, E_σ) = \begin{cases}
+1 & σ = μ ≠ ν = ρ \\
-1 & ρ = μ ≠ ν = σ \\
0 & \text{otherwise,}
\end{cases}
\]
we can examine the expressions above and conclude that \(\crf{ρ}{σ}(E_μ, E_ν)\) is possibly non-zero only when
\(\{ μ, ν \} = \{ ρ, σ \}\). Furthermore, examining the expression for \(\Ric{μν}\),
we can further conclude that \(\Ric{μν}\) is zero
when \(μ ≠ ν\). Therefore, it suffices to check \(\Ric{λλ}\). (One
of the properties we deduced for a diagonal metric depending
on two coordinates was that \(\Ric{}\) would be diagonal
except for possibly \(\Ric{12}\), but since \(\cnf{1}{2}\) turned
out to not have a \(ϑ^1\) term, that immediately leads to \(\Ric{12} = 0\).)</p>
<p>From the expressions above,
\[
\begin{aligned}
\crf{0}{1}(E_0, E_1) &= -\frac{1}{2} f''(r) \\
\crf{0}{2}(E_0, E_2) &= \crf{0}{3}(E_0, E_3) = \crf{1}{2}(E_1, E_2) = \crf{1}{3}(E_1, E_3) = -\frac{f'(r)}{2r} \\
\crf{2}{3}(E_2, E_3) &= \frac{1 - f(r)}{r^2}\text{,}
\end{aligned}
\]
so using the skew symmetry of two-forms
\[
\crf{μ}{ν}(E_ρ, E_σ) = -\crf{μ}{ν}(E_σ, E_ρ)
\]
and the semi-skew symmetry of \(\crf{μ}{ν}\)
\[
\crf{0}{i} = \crf{i}{0} \quad \text{and} \quad \crf{i}{j} = -\crf{j}{i} \text{,}
\]
we can compute \(\Ric{λλ}\):
\[
\begin{aligned}
\Ric{00} &= \crf{1}{0}(E_1, E_0) + \crf{2}{0}(E_2, E_0) + \crf{3}{0}(E_3, E_0) \\
&= -\crf{0}{1}(E_0, E_1) - \crf{0}{2}(E_0, E_2) - \crf{0}{3}(E_0, E_3) \\
&= \frac{1}{2} f''(r) + \frac{f'(r)}{r} \\
\Ric{11} &= \crf{0}{1}(E_0, E_1) + \crf{2}{1}(E_2, E_1) + \crf{3}{1}(E_3, E_1) \\
&= \crf{0}{1}(E_0, E_1) + \crf{1}{2}(E_1, E_2) + \crf{1}{3}(E_1, E_3) \\
&= -\Ric{00} \\
\Ric{22} &= \crf{0}{2}(E_0, E_2) + \crf{1}{2}(E_1, E_2) + \crf{3}{2}(E_3, E_2) \\
&= \crf{0}{2}(E_0, E_2) + \crf{1}{2}(E_1, E_2) + \crf{2}{3}(E_2, E_3) \\
&= -\frac{f'(r)}{r} + \frac{1 - f(r)}{r^2} \\
\Ric{33} &= \crf{0}{3}(E_0, E_3) + \crf{1}{3}(E_1, E_3) + \crf{2}{3}(E_2, E_3) \\
&= \Ric{22}\text{.}
\end{aligned}
\]</p>
<p>Finally, a computation shows that for \(f(r) = 1 - \frac{r_S}{r}\),
\[
\frac{1 - f(r)}{r^2} = -\frac{1}{2} f''(r) = \frac{f'(r)}{r} \text{,}
\]
so all the Ricci tensor components above vanish.<sup><a href="#fn5" id="r5">[5]</a></sup></p>
</li>
</ol>
</section>
<section>
<header>
<h2>Example 3: The pp-wave metric</h2>
</header>
<p>For our last example, to keep things interesting, let’s consider a non-diagonal metric. Let
\[
g = H(u, x, y) \, du ⊗ du + du ⊗ dv + dv ⊗ du + dx ⊗ dx + dy ⊗ dy
\]
be the <a href="https://en.wikipedia.org/wiki/Pp-wave_spacetime"><em>pp-wave metric</em></a>,
where \(H(u, x, y)\) is some smooth
function. We want to derive a necessary and sufficient condition for \(g\) to be Ricci-flat.</p>
<ol>
<li>
<h3>Orthonormal dual frame</h3>
<p>This metric has the matrix
\[
G = \begin{pmatrix}
H & 1 & 0 & 0 \\
1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1
\end{pmatrix}\text{,}
\]
which has inverse
\[
G^{-1} = \begin{pmatrix}
0 & 1 & 0 & 0 \\
1 & -H & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1
\end{pmatrix}\text{,}
\]
so the dual metric is
\[
g^* = ∂_u ⊗ ∂_v + ∂_v ⊗ ∂_u - H(u, x, y) \, ∂_v ⊗ ∂_v + ∂_x ⊗ ∂_x + ∂_y ⊗ ∂_y\text{.}
\]
We can see that \(dx\) and \(dy\) form part of an orthonormal dual frame, but
we have to find the other two, which involve \(du\) and \(dv\). First we
have to figure out the signature of the metric.
So set
\[
\begin{aligned}
θ^0 &= A \, du + B \, dv \\
θ^1 &= C \, du + D \, dv \\
θ^2 &= dx \\
θ^3 &= dy\text{,}
\end{aligned}
\]
and solve for \(A\), \(B\), \(C\), and \(D\) using the orthonormality
conditions
\[
\begin{aligned}
g^*(θ^0, θ^0) &= 2AB - B^2 H = ε_0 \\
g^*(θ^0, θ^1) &= AD + BC - BDH = 0 \\
g^*(θ^1, θ^1) &= 2CD - D^2 H = ε_1\text{.}
\end{aligned}
\]
The tricky thing is to pick the \(θ^μ\) without assuming that \(H\) is
non-zero. The simplest way to do that is to assume that none of the
coefficients of \(H\) vanish, and, since we have four unknowns (not
counting \(ε_0\) and \(ε_1\)) and three equations, to set \(B = 1\).
Then the first equation gives \(A = (ε_0 + H)/2\), the second equation
gives \(C = D(H - A)\), and plugging everything into the third equation
gives \(D^2 = -ε_1 / ε_0\), which implies that \(ε_1 = -ε_0\) and
\(D = ±1\). Set \(ε_0 = -1\) to make the frame have a Lorentzian signature
\(({-} \; {+} \; {+} \; {+})\), and let \(D = ε\). Then
\[
\begin{aligned}
A &= \frac{H - 1}{2} \\
B &= 1 \\
C &= ε\frac{H + 1}{2} \\
D &= ε\text{.}
\end{aligned}
\]
Setting \(ε = 1\) for symmetry, we finally have
\[
\begin{aligned}
θ^0 &= \frac{H-1}{2} \, du + dv \\
θ^1 &= \frac{H+1}{2} \, du + dv = θ^0 + du \\
θ^2 &= dx \\
θ^3 &= dy
\end{aligned}
\]
and
\[
\begin{aligned}
du &= θ^1 - θ^0 \\
dx &= θ^2 \\
dy &= θ^3\text{;}
\end{aligned}
\]
it’ll turn out that we don’t need to express \(dv\) in terms of
the \(θ^μ\).</p>
</li>
<li>
<h3>Connection forms</h3>
<p>Since
\[
\begin{aligned}
θ^1 &= θ^0 + du \\
θ^2 &= dx \\
θ^3 &= dy\text{,}
\end{aligned}
\]
the derivatives of the basis one-forms are
\[
\begin{aligned}
dθ^0 &= dθ^1 = \frac{1}{2} (H_x \, dx + H_y \, dy) ∧ du \\
&= \frac{H_x}{2} \, θ^2 ∧ θ^1 - \frac{H_x}{2} \, θ^2 ∧ θ^0 + \frac{H_y}{2} \, θ^3 ∧ θ^1 - \frac{H_y}{2} \, θ^3 ∧ θ^0 \\
dθ^2 &= 0 \\
dθ^3 &= 0\text{.}
\end{aligned}
\]
</p>
<p>Similarly to the Schwarzschild example, by semi-skew symmetry,
since \(ε_0 = -1\) and \(ε_i = 1\), \(\cnf{0}{i} = \cnf{i}{0}\) and
\(\cnf{i}{j} = -\cnf{j}{i}\). Therefore, we can explicitly
write out the first structure equations:
\[
\begin{alignedat}{4}
dθ^0 &= & &- \cnf{0}{1} ∧ θ^1 & &- \cnf{0}{2} ∧ θ^2 & &- \cnf{0}{3} ∧ θ^3 \\
dθ^1 &= -\cnf{0}{1} ∧ θ^0 & & & &- \cnf{1}{2} ∧ θ^2 & &- \cnf{1}{3} ∧ θ^3 \\
dθ^2 &= -\cnf{0}{2} ∧ θ^0 & &+ \cnf{1}{2} ∧ θ^1 & & & &- \cnf{2}{3} ∧ θ^3 \\
dθ^3 &= -\cnf{0}{3} ∧ θ^0 & &+ \cnf{1}{3} ∧ θ^1 & &+ \cnf{2}{3} ∧ θ^2\text{.} & &
\end{alignedat}
\]
However, unlike the Schwarzschild example, we can’t
simply read off the non-zero connection forms; for example, it’s not immediately clear whether the \(\frac{H_x}{2} \, θ^2 ∧ θ^1\) term in \(dθ^0\) belongs
to the \(\cnf{0}{1} ∧ θ^1\) term or the \(\cnf{0}{2} ∧ θ^2\) term. However, since \(dθ^0 = dθ^1\), we can guess that \(\cnf{0}{2} = \cnf{1}{2}\)
and \(\cnf{0}{3} = \cnf{1}{3}\). Subtracting the first structure equations for \(dθ^1\) and \(dθ^0\), we get
\[
\cnf{0}{1} ∧ (θ^1 - θ^0) = 0\text{,}
\]
i.e. that \(\cnf{0}{1} ∼ θ^1 - θ^0\). However, plugging this
into the first structure equation for \(dθ^0\) or \(dθ^1\), we get a \(θ^0
∧ θ^1\) term, which isn’t present in the derivative
equation for \(dθ^0 = dθ^1\), which then implies that \(\cnf{0}{1} = 0\). Thus,
there’s only one way to assign each term of the
derivative equation for \(dθ^0 = dθ^1\) to \(\cnf{0}{2} ∧ θ^2\) or \(\cnf{0}{3} ∧ θ^3\):
\[
\begin{aligned}
\cnf{0}{2} &= \cnf{1}{2} = -\frac{H_x}{2} \, (θ^1 - θ^0) = -\frac{H_x}{2} \, du \\
\cnf{0}{3} &= \cnf{1}{3} = -\frac{H_y}{2} \, (θ^1 - θ^0) = -\frac{H_y}{2} \, du\text{.}
\end{aligned}
\]
Plugging this into the structure equations for \(dθ^2\) and \(dθ^3\), we get
\[
\begin{aligned}
dθ^2 &= -\cnf{0}{2} ∧ θ^0 + \cnf{1}{2} ∧ θ^1 - \cnf{2}{3} ∧ θ^3 \\
&= \cnf{0}{2} ∧ du - \cnf{2}{3} ∧ θ^3 \\
&= -\frac{H_x}{2} \, du ∧ du - \cnf{2}{3} ∧ θ^3 \\
&= -\cnf{2}{3} ∧ θ^3 \\
dθ^3 &= -\cnf{0}{3} ∧ θ^0 + \cnf{1}{3} ∧ θ^1 + \cnf{2}{3} ∧ θ^2 \\
&= \cnf{0}{3} ∧ du + \cnf{2}{3} ∧ θ^2 \\
&= -\frac{H_y}{2} \, du ∧ du + \cnf{2}{3} ∧ θ^2 \\
&= \cnf{2}{3} ∧ θ^2\text{.}
\end{aligned}
\]
Since \(dθ^2 = dθ^3 = 0\) from the derivative equations, \(\cnf{2}{3}\) is proportional to both \(θ^2\) and \(θ^3\), i.e. \(\cnf{2}{3} = 0\). We’ve found expressions for \(\cnf{μ}{ν}\) that
satisfy the first structure equations. Therefore, by
uniqueness of the connection forms, these expressions are <em>the</em> connection forms. Then, expressing the connection forms in terms of both the basis one-forms and
the coordinate forms,
\[
\begin{aligned}
\cnf{0}{2} &= \cnf{2}{0} = \cnf{1}{2} = -\cnf{2}{1} = -\frac{H_x}{2} \, (θ^1 - θ^0) = -\frac{H_x}{2} \, du \\
\cnf{0}{3} &= \cnf{3}{0} = \cnf{1}{3} = -\cnf{3}{1} = -\frac{H_y}{2} \, (θ^1 - θ^0) = -\frac{H_y}{2} \, du\text{.}
\end{aligned}
\]</p>
</li>
<li>
<h3>Curvature forms</h3>
<p>Using the expressions for \(\cnf{μ}{ν}\) in terms of the coordinate
one-forms, since \(d(du) = 0\), the derivative of \(\cnf{0}{2} = \cnf{1}{2}\) is
\[
\begin{aligned}
d\cnf{0}{2} &= d\cnf{1}{2} = -\frac{1}{2} \, dH_x ∧ du \\
&= -\frac{1}{2} (H_{xx} \, dx + H_{xy} \, dy) ∧ du \\
&= -\frac{1}{2} (H_{xx} \, θ^2 + H_{xy} \, θ^3) ∧ (θ^1 - θ^0)
\end{aligned}
\]
and similarly the derivative of \(\cnf{0}{3} = \cnf{1}{3}\) is
\[
d\cnf{0}{3} = d\cnf{1}{3} = -\frac{1}{2} (H_{xy} \, θ^2 + H_{yy} \, θ^3) ∧ (θ^1 - θ^0)\text{.}
\]
Since
all the connection forms are proportional to \(du\), all possible
sums \(\cnf{μ}{λ} ∧ \cnf{λ}{ν}\) equal \(0\). Then we can compute the
curvature forms:
\[
\begin{aligned}
\crf{0}{2} &= \crf{1}{2} = -\frac{1}{2} (H_{xx} \, θ^2 + H_{xy} \, θ^3) ∧ (θ^1 - θ^0) \\
\crf{0}{3} &= \crf{1}{3} = -\frac{1}{2} (H_{xy} \, θ^2 + H_{yy} \, θ^3) ∧ (θ^1 - θ^0)\text{.}
\end{aligned}
\]
Again by semi-skew symmetry, since \(ε_0 = -1\) and \(ε_i = 1\), \(\crf{0}{i} = \crf{i}{0}\) and
\(\crf{i}{j} = -\crf{j}{i}\). Therefore,
\[
\begin{aligned}
\crf{0}{2} &= \crf{2}{0} = \crf{1}{2} = -\crf{2}{1} = -\frac{1}{2} (H_{xx} \, θ^2 + H_{xy} \, θ^3) ∧ (θ^1 - θ^0) \\
\crf{0}{3} &= \crf{3}{0} = \crf{1}{3} = -\crf{3}{1} = -\frac{1}{2} (H_{xy} \, θ^2 + H_{yy} \, θ^3) ∧ (θ^1 - θ^0)\text{.}
\end{aligned}
\]
</p>
</li>
<li>
<h3>Ricci curvature</h3>
<p>We can compute the Ricci tensor \(\Ric{μν}\) as
\[
\Ric{μν} = \Riem{λ}{μλν} = \crf{λ}{μ}(E_λ, E_ν)\text{,}
\]
where the \(E_λ\) comprise the dual frame to \(ϑ^λ\). First, using the relations
\[
(θ^μ ∧ θ^ν)(E_ρ, E_σ) = \begin{cases}
+1 & σ = μ ≠ ν = ρ \\
-1 & ρ = μ ≠ ν = σ \\
0 & \text{otherwise,}
\end{cases}
\]
we compute
\[
\Ric{0ν} = \crf{λ}{0}(E_λ, E_ν) = \crf{0}{λ}(E_λ, E_ν) = \crf{0}{2}(E_2, E_ν) + \crf{0}{3}(E_3, E_ν)
\]
and see that it’s only non-zero for \(ν ∈ \{ 0, 1 \}\); furthermore, \(\Ric{01} = -\Ric{00}\). Similarly,
\[
\Ric{1ν} = \crf{λ}{1}(E_λ, E_ν) = -\crf{1}{λ}(E_λ, E_ν) = -\crf{0}{λ}(E_λ, E_ν) = -\Ric{0ν}\text{.}
\]
For the last two, we can save some effort by calculating \((θ^1 - θ^0)(E_0 + E_1) = 0\), which implies
\[
(θ^μ ∧ (θ^1 - θ^0))(E_ν, E_0 + E_1) = 0\text{.}
\]
Then, using skew symmetry of two-forms
\[
\crf{μ}{ν}(E_ρ, E_σ) = -\crf{μ}{ν}(E_σ, E_ρ)\text{,}
\]
we compute
\[
\Ric{2ν} = \crf{λ}{2}(E_λ, E_ν) = -\crf{λ}{2}(E_ν, E_λ) = -\crf{0}{2}(E_ν, E_0) - \crf{1}{2}(E_ν, E_1) = -\crf{0}{2}(E_ν, E_0 + E_1) = 0
\]
and
\[
\Ric{3ν} = \crf{λ}{3}(E_λ, E_ν) = -\crf{λ}{3}(E_ν, E_λ) = -\crf{0}{3}(E_ν, E_0) - \crf{1}{3}(E_ν, E_1) = -\crf{0}{3}(E_ν, E_0 + E_1) = 0\text{,}
\]
so it suffices to compute \(\Ric{00}\):
\[
\begin{aligned}
\Ric{00} &= \crf{0}{2}(E_2, E_0) + \crf{0}{3}(E_3, E_0) \\
&= \frac{1}{2} (H_{xx} + H_{yy})\text{.}
\end{aligned}
\]
Finally, we can conclude that the pp-wave metric is Ricci flat
exactly when
\[
H_{xx} + H_{yy} = 0\text{.}
\]</p>
</li>
</ol>
</section>
<section>
<header>
<h2>Further reading</h2>
</header>
<p>The classic reference for the method of moving frames is
Volume 2, Chapter 7 of Spivak’s “A
Comprehensive Introduction to Differential Geometry”. However,
this only covers the Riemannian case. For the semi-Riemannian case,
look to § 1.8 of O’Neill’s “The Geometry of
Kerr Black Holes”, or § 14.6 of <a href="https://en.wikipedia.org/wiki/Gravitation_(book)">Gravitation</a>.</p>
</section>
<section class="footnotes">
<header>
<h2>Footnotes</h2>
</header>
<p id="fn1">[1] A metric can only be diagonal <em>with respect to a
particular coordinate system</em>, but for brevity I’ll only mention it here. <a href="#r1">↩</a></p>
<p id="fn2">[2] See p. 52 of <em>The Geometry of Kerr Black Holes</em> by Barret O‘Neill. <a href="#r2">↩</a></p>
<p id="fn3">[3] The paper <a href="https://arxiv.org/abs/gr-qc/9602015">“Ricci Tensor of Diagonal Metric”</a> has
a similar discussion using coordinate methods; note that the
calculations are much more laborious! <a href="#r3">↩</a></p>
<p id="fn4">[4] One subtle technical point is that there might not be such an expression for \(g\) throughout the whole chart domain; see <a href="https://math.stackexchange.com/q/2625887/343314">this Math StackExchange question</a> for
details. In practice, though, this doesn’t turn out to be a
problem. <a href="#r4">↩</a></p>
<p id="fn5">[5] The Schwarzschild metric describes the field outside
a spherically symmetric and non-rotating massive body.
If we let \(f(r)\) have an \(r^{-2}\) term, e.g.
\[
f(r) = 1 - \frac{r_S}{r} + \frac{r_Q^2}{r^2}
\]
for some constant \(r_Q\), then we have non-vanishing Ricci
components. However, this metric, called the <a href="https://en.wikipedia.org/wiki/Reissner%E2%80%93Nordstr%C3%B6m_metric">Reissner–Nordström metric</a>,
is still useful, as it describes a <em>charged</em>, spherically
symmetric, non-rotating massive body. <a href="#r5">↩</a></p>
</section>
https://www.akalin.com/intro-erasure-codes
A Gentle Introduction to Erasure Codes
2017-11-30T00:00:00-08:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<script src="https://unpkg.com/preact@8.2.7"></script>
<script src="https://cdn.rawgit.com/akalin/jsbn/v1.4/jsbn.js"></script>
<script src="https://cdn.rawgit.com/akalin/jsbn/v1.4/jsbn2.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/arithmetic.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/math.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/carryless.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/field_257.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/field_256.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/rational.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/matrix.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/cauchy_erasure_code.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/demo_common.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/carryless_demo_common.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/matrix_demo_common.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/erasure_code_demo_common.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/carryless_div_demo.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/row_reduce.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/carryless_add_demo.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/carryless_mul_demo.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/carryless_div_demo_util.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/field_257_demo.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/field_256_demo.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/cauchy_matrix_demo.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/matrix_inverse_demo.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/compute_parity_demo.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/intro-erasure-codes/8d5e10f/reconstruct_data_demo.js"></script>
<script>
KaTeXMacros = {
"\\clplus": "\\oplus",
"\\clminus": "\\ominus",
"\\clmul": "\\otimes",
"\\cldiv": "\\oslash",
"\\bclmod": "\\mathbin{\\mathrm{clmod}}",
};
</script>
<section>
<header>
<h2>1. Overview</h2>
</header>
<p>This article explains Reed-Solomon erasure codes and the problems
they solve in gory detail, with the aim of providing enough
background to understand how the <a href="https://en.wikipedia.org/wiki/Parchive">PAR1
and PAR2</a> file formats work, the details of which will be covered in
future articles.</p>
<p>I’m assuming that the reader is familiar with programming,
but has not had much exposure to coding theory or linear
algebra. Thus, I’ll review the basics and treat the results we
need as a “black box”, stating them and moving
on. However, I’ll give self-contained proofs of those results
in a companion article.</p>
<p>So let’s start with the problem we’re trying to
solve! Let’s say you have \(n\) files of roughly the
same size, and you want to guard against \(m\) of them being
lost or corrupted. To do so, you generate \(m\)
<em>parity files</em>
ahead of time, and if in the future you lose up to \(m\) of the data
files, you can use an equal number of parity files to recover the
lost data files.</p>
<style>
.fig {
display: flex;
flex-flow: row;
width: 100%;
}
.fig img {
border: 1px solid black;
height: auto;
}
.fig div.column {
display: flex;
align-items: center;
flex-flow: column;
flex-grow: 1;
justify-content: center;
}
#fig1 div.column > div {
margin: 0.5em;
}
#fig1 img {
width: 9.375em;
}
#fig2 img {
margin: 0.5em 0em;
width: 6.25em;
}
</style>
<figure>
<div class="fig" id="fig1">
<div class="column">
<div>
<div><code>cashcat0.jpg</code></div>
<img src="intro-erasure-codes-files/cashcat0.jpg" />
</div>
<div>
<div><code>cashcat1.jpg</code></div>
<img src="intro-erasure-codes-files/cashcat1.jpg" />
</div>
<div>
<div><code>cashcat2.jpg</code></div>
<img src="intro-erasure-codes-files/cashcat2.jpg" />
</div>
</div>
<div class="column">
<div>\(\xmapsto{\mathtt{GenerateParityFiles}}\)</div>
</div>
<div class="column">
<div>
<div><code>cashcats.p00</code></div>
<img src="intro-erasure-codes-files/cashcats.p00.png" />
</div>
<div>
<div><code>cashcats.p01</code></div>
<img src="intro-erasure-codes-files/cashcats.p01.jpg" />
</div>
</div>
</div>
<figcaption>
<span class="figure-text">Figure 1</span>  Using
parity codes to protect against the loss or corruption of
up to two images (out of three) of <a href="https://twitter.com/CatsAndMoney">cashcats</a>.
</figcaption>
</figure>
<figure>
<div class="fig" id="fig2">
<div class="column">
<img src="intro-erasure-codes-files/cashcat0-glitched.png" />
<img src="intro-erasure-codes-files/cashcat1.jpg" />
<img src="intro-erasure-codes-files/broken-image.png" />
<img src="intro-erasure-codes-files/cashcats.p00.png" />
<img src="intro-erasure-codes-files/cashcats.p01.jpg" />
</div>
<div class="column">
<div>\(\xmapsto{\mathtt{ReconstructDataFiles}}\)</div>
</div>
<div class="column">
<img src="intro-erasure-codes-files/cashcat0.jpg" />
<img src="intro-erasure-codes-files/cashcat1.jpg" />
<img src="intro-erasure-codes-files/cashcat2.jpg" />
</div>
</div>
<figcaption>
<span class="figure-text">Figure 2</span>  With a
corrupted and a missing file, recovering the original
cashcat images using the parity files from Figure 1.
</figcaption>
</figure>
<p>Note that this works even if you lose some of the parity files
also; as long as you have \(n\) files, whether they be data or
parity files, you’ll be able to recover the original \(n\)
data files. Compare making \(n\) parity files with simply making a
copy of the \(n\) data files (for \(n > 1\)). In the latter case, if
you lose both a data file and its copy, that data file becomes
unrecoverable! So parity files take the same amount of space but
provide superior recovery capabilities.</p>
<p>Now we can reduce the problem above to a byte-level problem
as follows. Have <code>ComputeParityFiles</code> pad all the
data files so they’re the same size, and then for each
byte position <code>i</code> call a function <code>ComputeParityBytes</code> on the <code>i</code>th
byte of each data file, and store the results into the <code>i</code>th
byte of each parity file. Also take a checksum or hash of
each data file and store those (along with the original data
file sizes) with the parity files. Then, <code>ReconstructDataFiles</code>
can detect corrupted files using the checksums/hashes and
treat them as missing, and then for each byte position <code>i</code> it
can call a function <code>ReconstructDataBytes</code> on the <code>i</code>th
byte of each good data and parity file to recover the <code>i</code>th byte of the corrupted/missing data files.</p>
<p>A byte error where we <em>know</em> the position of the
dropped/corrupted byte is called an <em>erasure</em>. Then, the pair
of functions <code>ComputeParityBytes</code> and <code>ReconstructDataBytes</code> which
behave as described above implements what is called an <a href="https://en.wikipedia.org/wiki/Erasure_code#Optimal_erasure_codes"><em>optimal erasure code</em></a>;
it’s an erasure code because it guards only against byte
erasures, and not more general errors where we don’t know
which data bytes have been corrupted, and it’s optimal because
in general you need at least \(n\) known bytes to recover the \(n\)
data bytes, and that bound is achieved.</p>
<div class="p">In detail, an optimal erasure code is composed of some
set of possible \((n, m)\) pairs, and for each possible pair, a
function
<pre class="code-container"><code class="language-javascript">ComputeParityBytes<n, m>(data: byte[n]) -> (parity: byte[m])</code></pre>
that computes \(m\) parity bytes given \(n\) data bytes, and a
function
<pre class="code-container"><code class="language-javascript">ReconstructDataBytes<n, m>(partialData: (byte?)[n], partialParity: (byte?)[m]) -> ((data: byte[n]) | Error)</code></pre>
that takes in a partial list of data and parity bytes from an
earlier call to <code>ComputeParity</code>, and returns the full
list of data bytes if there are at least \(n\) known data or parity
bytes (i.e., there are no more than \(m\) omitted data or parity
bytes), and an error otherwise.</div>
<p>(In the above pseudocode, I’m using <code>T[n]</code> to mean an array of <code>n</code> objects of type <code>T</code>,
and <code>byte?</code> to mean <code>byte | None</code>. Also, I’ll omit the <code>-Bytes<n, m></code> suffix
from now on.)</p>
<p>By the end of this article, we’ll find out exactly how the
following example works:</p>
<div class="interactive-example">
<h3>Example 1: <code>ComputeParity</code> and <code>ReconstructData</code></h3>
<div class="interactive-example" id="computeParityDemo">
<h3><code>ComputeParity</code></h3>
Let
<span style="white-space: nowrap;">
<var>d</var> = [ da, db, 0d ]
</span>
be the input data bytes and let
<span style="white-space: nowrap;">
<var>m</var> = 2
</span>
be the desired parity byte count. Then the output parity bytes
are
<span style="white-space: nowrap;">
<var>p</var> = [ <span class="result">52</span>, <span class="result">0c</span> ].
</span>
</div>
<script>
'use strict';
(function() {
const { h, render } = window.preact;
const root = document.getElementById('computeParityDemo');
render(h(ComputeParityDemo, {
initialD: 'da, db, 0d', initialM: '2',
name: 'computeParityDemo',
detailed: false,
header: h('h3', {}, h('code', {}, 'ComputeParity')),
containerClass: 'interactive-example',
inputClass: 'parameter',
resultColor: '#268bd2', // solarized blue
}), root.parent, root);
})();
</script>
<br />
<div class="interactive-example" id="reconstructDataDemo">
Let
<span style="white-space: nowrap;">
<var>d</var><sub>partial</sub> = [ ??, db, ?? ]
</span>
be the input partial data bytes and
<span style="white-space: nowrap;">
<var>p</var><sub>partial</sub> = [ 52, 0c ]
</span>
be the input partial parity bytes. Then the output data bytes are
<span style="white-space: nowrap;">
<var>d</var> = [ <span class="result">da</span>, <span class="result">db</span>, <span class="result">0d</span> ].
</span>
</div>
<script>
'use strict';
(function() {
const { h, render } = window.preact;
const root = document.getElementById('reconstructDataDemo');
render(h(ReconstructDataDemo, {
initialPartialD: '??, db, ??', initialPartialP: '52, 0c',
name: 'reconstructDataDemo',
detailed: false,
header: h('h3', {}, h('code', {}, 'ReconstructData')),
containerClass: 'interactive-example',
inputClass: 'parameter',
resultColor: '#268bd2', // solarized blue
}), root.parent, root);
})();
</script>
</div>
</section>
<section>
<header>
<h2>2. Erasure codes for \(m = 1\)</h2>
</header>
<div class="p">The simplest erasure codes are when \(m = 1\). For example, define
<pre class="code-container"><code class="language-javascript">ComputeParitySum(data: byte[n]) {
return [data[0] + … + data[n-1]]
}</code></pre>
where we consider <code>byte</code> to be an unsigned type such that
addition and subtraction wrap around, i.e. byte arithmetic is done
modulo \(256\). Then also define
<pre class="code-container"><code class="language-javascript">ReconstructDataSum(partialData: (byte?)[n], partialParity: (byte?)[1]) {
if <em>there is more than one entry of partialData or partialParity set to None</em> {
return Error
} else if <em>partialData has no entry set to None</em> {
return partialData
}
i := partialData.firstIndexOf(None);
partialSum = partialData[0] + … + partialData[i-1] + partialData[i+1] + … + partialData[n-1]
return partialData[0:i] ++ [partialParity[0] - partialSum] ++ partialData[i+1:n]
}</code></pre>
where <code>a[i:j]</code> means the subarray of <code>a</code> starting at <code>i</code> and
ending (without inclusion) at <code>j</code>, and <code>++</code> is array concatenation.</div>
<p>This simple erasure code uses the fact that if you have the sum of
a list of numbers, then you can recover a missing number by
subtracting the sum of the other numbers from the total sum, and
also that this works even if you do the arithmetic modulo \(256\).</p>
<div class="p">Another erasure code for \(m = 1\) uses <a href="https://en.wikipedia.org/wiki/Exclusive_or#Bitwise_operation">bitwise exclusive
or</a> (denoted as xor, <code>^</code>, or \(\oplus\)) instead
of arithmetic modulo \(256\). Define
<pre class="code-container"><code class="language-javascript">ComputeParityXor(data: byte[n]) {
return [data[0] ⊕ … ⊕ data[n-1]]
}</code></pre>
and
<pre class="code-container"><code class="language-javascript">ReconstructDataXor(partialData: (byte?)[n], partialParity: (byte?)[1]) {
if <em>there is more than one entry of partialData or partialParity set to None</em> {
return Error
} else if <em>partialData has no entry set to None</em> {
return partialData
}
i := partialData.firstIndexOf(None);
partialXor = partialData[0] ⊕ … ⊕ partialData[i-1] ⊕ partialData[i⊕1] ⊕ … ⊕ partialData[n-1]
return partialData[0:i] ++ [partialParity[0] ⊕ partialXor] ++ partialData[i+1:n]
}</code></pre>
</div>
<p>This relies on the fact that \(a \oplus a = 0\), so given the xor
of a list of bytes, you can recover a missing byte by xoring with
all the known bytes.</p>
</section>
<section>
<header>
<h2>3. Erasure codes for \(m = 2\) (almost)</h2>
</header>
<p>Now coming up with an erasure code for \(m = 2\) is more involved,
but we can get an inkling of how it could work by letting \(n = 3\)
for simplicity, and also letting the output of <code>ComputeParity</code> be
non-negative integers, instead of just bytes (i.e., less than
\(256\)). In that case, we can consider parity numbers that are
weighted sums of the data bytes. For example, like in the \(m = 1\)
case, we can have the first parity number be
\[
p_0 = d_0 + d_1 + d_2\text{,}
\]
(using \(d_i\) for data bytes and \(p_i\) for parity numbers)
but for the second parity number, we can pick different weights, say
\[
p_1 = 1 \cdot d_0 + 2 \cdot d_1 + 3 \cdot d_2\text{.}
\]
We want to make sure that the weights for the second parity number
are “sufficiently different” from that of the first
parity number, which we’ll clarify later, but for example note
that setting
\[
p_1 = 2 \cdot d_0 + 2 \cdot d_1 + 2 \cdot d_2
\]
can’t add any new information, since then \(p_1\) will
always be equal to \(2 \cdot p_0\).</p>
<div class="p">So then our <code>ComputeParity</code> function looks like
<pre class="code-container"><code class="language-javascript">ComputeParityWeighted(data: byte[3]) {
return [
int(data[0]) + int(data[1]) + int(data[2]),
int(data[0]) + 2 * int(data[1]) + 3 * int(data[2]),
]
}</code></pre>
</div>
<div class="p">As for <code>ReconstructData</code>, if we have two missing data bytes,
say \(d_i\) and \(d_j\) for \(i < j\), and \(p_0\) and \(p_1\),
we can rearrange the equations
\[
\begin{aligned}
p_0 &= d_0 + d_1 + d_2 \\
p_1 &= 1 \cdot d_0 + 2 \cdot d_1 + 3 \cdot d_2
\end{aligned}
\]
to get all the unknowns on the left side, letting \(d_k\) be the known data byte:
\[
\begin{aligned}
d_i + d_j &= X = p_0 - d_k \\
(i+1) \cdot d_i + (j+1) \cdot d_j &= Y = p_1 - (k + 1) \cdot d_k\text{.}
\end{aligned}
\]
We can then multiply the first equation by \(i + 1\) and
subtract it from the second to cancel the \(d_i\) term and get
\[
d_j = (Y - (i + 1) \cdot X) / (j - i)\text{,}
\]
and then we can use the first equation to solve for \(d_i\):
\[
d_i = X - d_j = ((j + 1) \cdot X - Y) / (j - i)\text{.}
\]
Thus with these equations, we can implement <code>ReconstructData</code>:
<pre class="code-container"><code class="language-javascript">ReconstructDataWeighted(partialData: (byte?)[3], partialParity: (int?)[2]) {
<em>Handle all cases except when there are exactly two entries set to none in partialData.</em>
[i, j] := <em>indices of the unknown data bytes</em>
k := <em>index of the known data byte</em>
X := partialParity[0] - partialData[k]
Y := partialParity[1] - (k + 1) * partialData[k];
d_i := ((j + 1) * X - Y) / (j - i)
d_j := (Y - (i + 1) * X) / (j - i)
return <em>an array with d_i, d_j, and d[k] in the right order</em>
}</code></pre>
(Generalizing this to larger values of \(n\) is straightforward;
\(p_0\) will still have a weight of \(1\) for each data byte, and
\(p_1\) will have a weight of \(i + 1\) for \(d_i\). \(X\) and \(Y\)
will then have terms for all known bytes, and everything else
proceeds the same after that.)</div>
<p>Now what goes wrong if we just try to do everything modulo \(256\)?
The most obvious difference from the \(m = 1\) case is that solving
for \(d_i\) or \(d_j\) involves division, which works fine for
non-negative integers as long as there’s no remainder, but it
is not immediately clear how division can make sense modulo \(256\).</p>
<p>One possible way to define division modulo \(256\)
would be to first define the <em>multiplicative inverse</em> modulo
\(256\) of an integer \(0 \le x \lt 256\) as the integer \(0 \le y
\lt 256\) such that \((x \cdot y) \bmod 256 = 1\), if it exists, and
then define division by \(x\) modulo \(256\) to be multiplication by
\(y\) modulo \(256\). But this immediately runs into problems; \(2\)
has no multiplicative inverse modulo \(256\), and the same holds for
any even number, so reconstruction will fail if, for example, we
have the first and third data bytes missing, since then we’d
be trying to divide by \(j - i = 2\).</p>
<p>But for now, let’s leave aside the problem of generating
parity bytes instead of parity numbers, and instead focus on how we
can generalize the above for larger values of \(m\). To do so, we
need to first review some linear algebra.</p>
</section>
<section>
<header>
<h2>4. Just enough linear algebra to get by<sup><a href="#fn1" id="r1">[1]</a></sup></h2>
</header>
<p>In our \(n = 3, m = 2\) example in the previous section, the
equations for the parity numbers have the form
\[
p = a_0 \cdot d_0 + a_1 \cdot d_1 + a_2 \cdot d_2
\]
for constants \(a_0\), \(a_1\), and \(a_2\). We call such a
weighted sum of the \(d_i\)s a <em>linear combination</em> of
the \(d_i\)s, and we write this in a tabular form
\[
p =
\begin{pmatrix}
a_0 & a_1 & a_2
\end{pmatrix}
\cdot
\begin{bmatrix}
d_0 \\ d_1 \\ d_2
\end{bmatrix}\text{,}
\]
where we define the multiplication of a
<em>row vector</em> and a <em>column vector</em> by the
equation above, generalized in the straightforward manner
for any \(n\).</p>
<p>Then since we have two parity numbers \(p_0\) and \(p_1\),
each a linear combination of the \(d_i\)s, i.e.
\[
\begin{aligned}
p_0 &= a_{00} \cdot d_0 + a_{01} \cdot d_1 + a_{02} \cdot d_2 \\
p_1 &= a_{10} \cdot d_0 + a_{11} \cdot d_1 + a_{12} \cdot d_2\text{,}
\end{aligned}
\]
we can write this in a single tabular form as
\[
\begin{bmatrix}
p_0 \\ p_1
\end{bmatrix}
=
\begin{pmatrix}
a_{00} & a_{01} & a_{02} \\
a_{10} & a_{11} & a_{12}
\end{pmatrix}
\cdot
\begin{bmatrix}
d_0 \\ d_1 \\ d_2
\end{bmatrix}\text{,}
\]
where we define the multiplication of a <em>matrix</em> and
a column vector by the equations above.</p>
<p>Now if we restrict parity numbers to be linear combinations of the
data bytes, then we can identify a function
<code>ComputeParity</code> using some set of weights with the matrix
formed from that set of weights as above. This holds in general: if
a function is defined as a list of linear combinations of its
inputs, then it can be represented using a matrix as above, and we
call it a
<em>linear function</em>. Then we have a correspondence between
linear functions that take \(n\) numbers to \(m\) numbers and
matrices with \(m\) rows and \(n\) columns, which are denoted as \(m
\times n\) matrices.</p>
<p>As the first example of this correspondence, note that we denote
the elements of the matrix above as \(a_{ij}\), where the first
index is the row index and the second index is the column
index. Looking back to the parity equations, we also see that the
first index corresponds to the output arguments of <code>ComputeParity</code>, and the second index corresponds to
the input arguments.<sup><a href="#fn2" id="r2">[2]</a></sup></p>
<p>The usefulness of the correspondence between linear functions and
matrices is that we can store and manipulate a linear function by
storing and manipulating its corresponding matrix of weights, which
you wouldn’t be able to easily do for functions in
general. For example, as we’ll see below, we’ll be able
to compute the inverse of a linear function by matrix operations,
which will be useful for <code>ReconstructData</code>.</p>
<div class="p">First, let’s examine some simple matrix operations and their
effects on the corresponding linear function:
<ul>
<li><em>Deleting a row</em> of a matrix corresponds to <em>deleting an output</em> of a linear function.</li>
<li><em>Swapping two rows</em> of a matrix corresponds to <em>swapping two outputs</em> of a linear function.</li>
<li><em>Appending a row</em> to a matrix corresponds to <em>adding an output</em> to a linear function.</li>
</ul>
In general, matrix row operations correspond to manipulations of a
linear function’s outputs.</div>
<div class="p">An important operation on functions is composition: if
\(f\) takes \(k\) inputs to \(m\) outputs, and \(g\) takes
\(m\) inputs to \(n\) outputs, then we can define \((g \circ
f)(x_0, \dotsc, x_k) = g(f(x_0, \dotsc, x_k))\) which takes
\(k\) inputs to \(n\) outputs. It turns out that the
composition of two linear functions is again a linear
function, and so there must be an operation which takes the
corresponding \(m \times k\) matrix \(F\) and the \(n \times
m\) matrix \(G\) and yields a \(n \times k\) matrix. This
important operation, the bane of high-schoolers everywhere,
is called <a href="https://en.wikipedia.org/wiki/Matrix_multiplication"><em>matrix multiplication</em></a>,
denoted by \(F \cdot G\). If \(H = F \cdot G\), then the
explicit formula for its elements is
\[
h_{ij} = \sum_{k=0}^{m-1} f_{ik} \cdot g_{kj}\text{,}
\]
which corresponds to the following code:
<pre class="code-container"><code class="language-javascript">matrixMultiply(f: Matrix, g: Matrix) {
if (f.columns != g.rows) {
return Error
}
h := new Matrix(f.rows, g.columns)
for i := 0 to f.rows - 1 {
for j := 0 to g.columns - 1 {
t := 0
for k := 0 to f.columns - 1 {
t += f[i, k] * g[k, j]
}
h[i, j] = t
}
}
return h
}</code></pre>
You can convince yourself that the above formula and code is correct
by trying to compose some small linear functions by hand.
</div>
<p>A useful property of matrix multiplication is that it’s a
generalization of the product of a row vector and a column vector,
and the product of a matrix and a column vector as we defined above.</p>
<p>I would be remiss if I didn’t talk about the consequences of
defining matrix multiplication as the matrix of the composition of
the corresponding linear functions. First, this immediately implies
that you can only multiply matrices if the left matrix has the same
number of rows as the number of columns of the right matrix, which
corresponds to the fact that you can only compose functions if the
left function takes the same number of inputs as the number of
outputs of the right function. Furthermore, even if you have two \(n
\times n\) matrices \(F\) and \(G\), unlike numbers, it is not true
that \(F \cdot G = G \cdot F\), which corresponds to the fact that
in general, for two functions that take \(n\) inputs to \(n\)
outputs, it is not true that \(f \circ g = g \circ f\). If you
learned matrix multiplication just from the formula above, then
these facts are much less obvious!</p>
<p>Finally, an important function is the <a href="https://en.wikipedia.org/wiki/Identity_function"><em>identity function</em></a>
\(\mathrm{Id}_n\), which return its \(n\) inputs as its outputs. It
corresponds to the <a href="https://en.wikipedia.org/wiki/Identity_matrix"><em>identity matrix</em></a>
\[
I_n =
\begin{pmatrix}
1 & 0 & \cdots & 0 & 0 \\
0 & 1 & \cdots & 0 & 0 \\
\vdots & \vdots & \ddots & \vdots & \vdots \\
0 & 0 & \cdots & 1 & 0 \\
0 & 0 & \cdots & 0 & 1
\end{pmatrix}\text{.}
\]</p>
<p>For a linear function \(f\) that takes \(n\) inputs to \(n\)
outputs, if there is a function \(g\) such that \(f \circ g =
\mathrm{Id}_n\), then we call \(g\) the inverse of \(f\), and denote
it as \(f^{-1}\). (It is also true that \(f^{-1} \circ f =
\mathrm{Id}_n\), i.e. \((f^{-1})^{-1} = f\).) Not all linear
functions taking \(n\) inputs to \(n\) outputs have inverses, but if
the inverse exists, it is also linear (and unique, which is why we
call it <em>the</em> inverse). Therefore, we can define
the <em>inverse</em> of an \(n \times n\) (or <em>square</em>)
matrix \(M\) as the unique matrix \(M^{-1}\) such that \(M \cdot
M^{-1} = M^{-1} \cdot M = I_n\), if it exists; also, if \(M\) has an
inverse, we say that \(M\) is <em>invertible</em>.</p>
<div class="interactive-example">
<h3>Example 2: The matrix/linear function correspondence</h3>
<div class="p">Let
\[
M = \begin{pmatrix} 1 & 2 \\ 3 & 4\end{pmatrix}\text{.}
\]
This corresponds to the linear function
<pre class="code-container"><code class="language-javascript">f(x: rational[2]) {
return [
1 * x[0] + 2 * x[1],
3 * x[0] + 4 * x[1],
]
}</code></pre>
where <code>rational</code> is an arbitrary-precision rational
number type.</div>
<div class="p">\(M\) is invertible with inverse
\[
M^{-1} = \begin{pmatrix} -2 & 1 \\ 3/2 & -1/2\end{pmatrix}\text{.}
\]
This corresponds to the linear function
<pre class="code-container"><code class="language-javascript">g(y: rational[2]) {
return [
-2 * x[0] + 1 * x[1],
(3/2) * x[0] + (-1/2) * x[1],
]
}</code></pre>
so <code>g</code> is the inverse function of <code>f</code>. Indeed, <code>f([5, 6])</code> is <code>[17, 39]</code> and <code>g([17, 39])</code> is <code>[5, 6]</code>.</div>
</div>
<p>So now we’ve reduced the problem of finding the inverse of a
linear function taking \(n\) inputs to \(n\) outputs to finding the
inverse of an \(n \times n\) matrix. Before we tackle the question
of computing those inverses, let’s first recast our problem in
the language of linear algebra and see why we need to find the
inverse of a linear function.</p>
</section>
<section>
<header>
<h2>5. Erasure codes in general</h2>
</header>
<div class="p">So, revisiting our \(n = 3, m = 2\) erasure code from
above, we have the linear function
<pre class="code-container"><code class="language-javascript">ComputeParityWeighted(data: byte[3]) {
return [
int(data[0]) + int(data[1]) + int(data[2]),
int(data[0]) + 2 * int(data[1]) + 3 * int(data[2]),
]
}</code></pre>
which therefore corresponds to the <em>parity matrix</em>
\[
P =
\begin{pmatrix}
1 & 1 & 1 \\
1 & 2 & 3
\end{pmatrix}\text{.}
\]
So in mathematical notation, <code>ComputeParityWeighted</code> looks like:
\[
\begin{bmatrix}
p_0 \\ p_1
\end{bmatrix}
=
\mathtt{ComputeParityWeighted}(d_0, d_1, d_2) =
\begin{pmatrix}
1 & 1 & 1 \\
1 & 2 & 3
\end{pmatrix}
\cdot
\begin{bmatrix}
d_0 \\ d_1 \\ d_2
\end{bmatrix}\text{.}
\]
</div>
<p>So let’s now reimplement <code>ReconstructDataWeighted</code> using linear algebra. First, append the rows of \(P\) to the identity matrix \(I_3\) to get the matrix equation
\[
\begin{bmatrix}
d_0 \\ d_1 \\ d_2 \\ p_0 \\ p_1
\end{bmatrix}
=
\begin{pmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1 \\
1 & 1 & 1 \\
1 & 2 & 3
\end{pmatrix}
\cdot
\begin{bmatrix}
d_0 \\ d_1 \\ d_2
\end{bmatrix}\text{,}
\]
which corresponds to a linear function that returns the input data bytes in addition to computing the parity numbers. Now let’s say we lose the data bytes \(d_0\) and \(d_2\). Then let’s remove the rows corresponding to those bytes:
\[
\begin{bmatrix}
\xcancel{d_0} \\ d_1 \\ \xcancel{d_2} \\ p_0 \\ p_1
\end{bmatrix}
=
\begin{pmatrix}
\xcancel{1} & \xcancel{0} & \xcancel{0} \\
0 & 1 & 0 \\
\xcancel{0} & \xcancel{0} & \xcancel{1} \\
1 & 1 & 1 \\
1 & 2 & 3
\end{pmatrix}
\cdot
\begin{bmatrix}
d_0 \\ d_1 \\ d_2
\end{bmatrix}\text{,}
\]
which turns into
\[
\begin{bmatrix}
d_1 \\ p_0 \\ p_1
\end{bmatrix} =
\begin{pmatrix}
0 & 1 & 0 \\
1 & 1 & 1 \\
1 & 2 & 3
\end{pmatrix}
\cdot
\begin{bmatrix}
d_0 \\ d_1 \\ d_2
\end{bmatrix}\text{,}
\]
which corresponds to a linear function that maps the input data
bytes to the non-lost data bytes and the parity bytes. This
is the <em>inverse</em> of the function we want, so we want
to invert the \(3 \times 3\) matrix above, which we’ll
call \(M\). That inverse is
\[
M^{-1} =
\begin{pmatrix}
-1/2 & 3/2 & -1/2 \\
1 & 0 & 0 \\
-1/2 & -1/2 & 1/2
\end{pmatrix}\text{.}
\]
Multiplying both sides above by \(M^{-1}\), we get
\[
\begin{bmatrix}
d_0 \\ d_1 \\ d_2
\end{bmatrix}
=
\begin{pmatrix}
-1/2 & 3/2 & -1/2 \\
1 & 0 & 0 \\
-1/2 & -1/2 & 1/2
\end{pmatrix}
\cdot
\begin{bmatrix}
d_1 \\ p_0 \\ p_1
\end{bmatrix}\text{,}
\]
which is exactly what we want: the original data bytes in
terms of the known data bytes and the parity numbers!<sup><a href="#fn3" id="r3">[3]</a></sup></p>
<p>Comparing this equation to the one we manually computed previously,
they don’t look immediately similar, but some rearrangement
will reveal that they indeed compute the same thing. As a sanity
check, notice that the second row of \(M^{-1}\) means that the first
input argument is mapped unchanged to the second output argument,
which is exactly what we want for the known byte \(d_1\).</p>
<p>Now what does this look like in general, i.e. for
arbitrary \(n\) and \(m\)? First, we have to generate an
\(m \times n\) parity matrix \(P\) whose rows have to be
“sufficiently different” from each other,
which we still have to clarify. Then <code>ComputeParity</code> just multiplies \(P\) by \([d]\), the column matrix of input bytes, like so:
\[
\begin{bmatrix}
p_0 \\ \vdots \\ p_{m-1}
\end{bmatrix}
=
\mathtt{ComputeParity}(d_0, \dotsc, d_{n-1}) =
\begin{pmatrix}
p_0 \\
\vdots \\
p_{m-1}
\end{pmatrix}
\cdot
\begin{bmatrix}
d_0 \\ \vdots \\ d_{n-1}
\end{bmatrix}\text{,}
\]
where the \(p_i\) are the rows of \(P\).</p>
<p>As for <code>ReconstructData</code>, we first append
the rows of \(P\) to \(I_n\), whose rows we’ll denote as \(e_i\):
\[
\begin{bmatrix}
d_0 \\ \vdots \\ d_{n-1} \\
p_0 \\ \vdots \\ p_{m-1}
\end{bmatrix}
=
\begin{pmatrix}
e_0 \\
\vdots \\
e_{n-1} \\
p_0 \\
\vdots \\
p_{m-1}
\end{pmatrix}
\cdot
\begin{bmatrix}
d_0 \\ \vdots \\ d_{n-1}
\end{bmatrix}\text{.}
\]
Now assume that the indices of the missing \(k \le m\) data
bytes are \(i_0, \dotsc, i_{k-1}\).
Then we remove the rows
corresponding to the missing data bytes, and keep some \(k\)
parity rows, e.g. \(p_0\) to \(p_{k-1}\). This yields the equation
\[
\begin{bmatrix}
d_{j_0} \\ \vdots \\ d_{j_{n-k-1}} \\
p_0 \\ \vdots \\ p_{k-1}
\end{bmatrix}
=
\begin{pmatrix}
e_{j_0} \\
\vdots \\
e_{j_{n-k-1}} \\
p_0 \\
\vdots \\
p_{k-1}
\end{pmatrix}
\cdot
\begin{bmatrix}
d_0 \\ \vdots \\ d_{n-1}
\end{bmatrix}\text{,}
\]
where \(j_0, \dotsc, j_{m-k-1}\) are the indices of the
<em>present</em> \(n - k\) data bytes. Call that \(n \times n\)
matrix \(M\), and compute its inverse \(M^{-1}\). If \(P\) was chosen correctly, \(M^{-1}\) should always exist, so if the inverse computation fails, raise an error. Therefore, <code>ReconstructData</code> just multiplies \(M^{-1}\) by the column matrix of present data bytes and chosen parity numbers:
\[
\begin{bmatrix}
d_0 \\ \vdots \\ d_{n-1}
\end{bmatrix}
=
\mathtt{ReconstructData}(d_{j_0}, \dotsc, d_{j_{n-k-1}}, p_0, \dotsc, p_{k-1})
= M^{-1} \cdot
\begin{bmatrix}
d_{j_0} \\ \vdots \\ d_{j_{n-k-1}} \\
p_0 \\ \vdots \\ p_{k-1}
\end{bmatrix}\text{.}
\]
</p>
<p>As an optimization, some rows of \(M^{-1}\) correspond to just
shuffling around the known data bytes \(d_{j_*}\), so we can just
remove those rows, compute the missing data bytes, and do the
shuffling ourselves.</p>
<div class="p">So we now have outlines of implementations of both <code>ComputeParity</code> and <code>ReconstructData</code>,
but we still have missing pieces. In particular,
<ol>
<li>How do we compute matrix inverses?</li>
<li>How do we generate “optimal” parity matrices so that \(M^{-1}\) always exists?</li>
<li>How do we compute parity bytes instead of parity numbers?</li>
</ol>
</div>
<p>So first, let’s see how to compute matrix inverses using row
reduction.</p>
</section>
<section>
<header>
<h2>6. Finding matrix inverses using row reduction</h2>
</header>
<p>We developed the theory of matrices by identifying them with linear
functions of numbers. To show how to find matrix inverses, we have
to look at them in a slightly different way by identifying matrix
equations with systems of linear equations of numbers.</p>
<p>For example, consider the matrix equation
\[
M \cdot x = y\text{,}
\]
where
\[
M =
\begin{pmatrix}
1 & 2 \\
3 & 4
\end{pmatrix}\text{,}
\quad
x =
\begin{bmatrix}
x_1 \\ x_2
\end{bmatrix}
\text{,} \quad \text{and }
y =
\begin{bmatrix}
y_1 \\ y_2
\end{bmatrix}\text{.}
\]
This expands to
\[
\begin{pmatrix}
1 & 2 \\
3 & 4
\end{pmatrix}
\cdot
\begin{bmatrix}
x_1 \\ x_2
\end{bmatrix} =
\begin{bmatrix}
y_1 \\ y_2
\end{bmatrix}\text{,}
\]
or
\[
\begin{aligned}
y_1 &= 1 \cdot x_1 + 2 \cdot x_2 \\
y_2 &= 3 \cdot x_1 + 4 \cdot x_2\text{,}
\end{aligned}
\]
which is a linear system of equations of numbers. Letting \(M\) be
any matrix, and \(x\) and \(y\) be appropriately-sized column
matrices of variables, we can see that the matrix equation
\(M \cdot x = y\) is shorthand for a system of linear equations of
numbers.</p>
<p>If we could find \(M^{-1}\), we could solve the matrix
equation easily by multiplying both sides by it:
\[
\begin{aligned}
M^{-1} \cdot (M \cdot x) &= M^{-1} \cdot y \\
x &= M^{-1} \cdot y\text{,}
\end{aligned}
\]
and therefore solve the linear system for \(x\) in terms of \(y\).
Conversely, if we were able to solve the linear system for \(x\),
we’d then be able to read off \(M^{-1}\) from the new linear
system.</p>
<div class="p">But how do we solve a linear system? From the theory of linear systems of equations, we have a few tools at our disposal:
<ul>
<li>swapping two equations,</li>
<li>multiplying an equation by a number,</li>
<li>adding one equation to another, possibly multiplying
the equation by a number before adding.</li>
</ul>
</div>
<p>All of these are valid transformations because they
don’t change the solution set of the linear system.</p>
<p>For example, in the equation above, the first step would be
to subtract \(3\) times the first equation from the second
equation to yield
\[
\begin{aligned}
y_1 &= x_1 + 2 \cdot x_2 \\
y_2 - 3 \cdot y_1 &= -2 \cdot x_2\text{.}
\end{aligned}
\]
Then, add the second equation back to the first equation:
\[
\begin{aligned}
y_2 - 2 \cdot y_1 &= x_1 \\
y_2 - 3 \cdot y_1 &= -2 \cdot x_2\text{.}
\end{aligned}
\]
Finally, divide the second equation by \(-2\):
\[
\begin{aligned}
y_2 - 2 \cdot y_1 &= x_1 \\
(3/2) \cdot y_1 - (1/2) \cdot y_2 &= x_2\text{.}
\end{aligned}
\]
This is equivalent to the matrix equation
\[
\begin{pmatrix}
-2 & 1 \\ 3/2 & -1/2
\end{pmatrix}
\cdot
\begin{bmatrix}
y_1 \\ y_2
\end{bmatrix} =
\begin{bmatrix}
x_1 \\ x_2
\end{bmatrix}\text{,}
\]
so
\[
M^{-1} = \begin{pmatrix}
-2 & 1 \\ 3/2 & -1/2
\end{pmatrix}\text{.}
\]
</p>
<p>So how do we translate the above process to an algorithm operating on matrices? First, express our matrix equation in a slightly
different form:
\[
M \cdot x = I \cdot y\text{.}
\]
Using the example above, this is
\[
\begin{pmatrix}
1 & 2 \\
3 & 4
\end{pmatrix}
\cdot
\begin{bmatrix}
x_1 \\ x_2
\end{bmatrix}
=
\begin{pmatrix}
1 & 0 \\
0 & 1
\end{pmatrix}
\cdot
\begin{bmatrix}
y_1 \\ y_2
\end{bmatrix}\text{.}
\]
Then, you can see that the first step above corresponds to subtracting \(-3\) times the first row from the second row to yield:
\[
\begin{pmatrix}
1 & 2 \\
0 & -2
\end{pmatrix}
\cdot
\begin{bmatrix}
x_1 \\ x_2
\end{bmatrix}
=
\begin{pmatrix}
1 & 0 \\
-3 & 1
\end{pmatrix}
\cdot
\begin{bmatrix}
y_1 \\ y_2
\end{bmatrix}\text{.}
\]
We don’t even need to keep writing the \(x\) and \(y\)
column matrices; we can just write the “augmented” matrix.
\[
A =
\left( \hskip -5pt
\begin{array}{cc|cc}
1 & 2 & 1 & 0 \\
0 & -2 & -3 & 1
\end{array}
\hskip -5pt \right)
\]
and operate on it.</p>
<div class="p">Thus, the operations listed above on linear systems have corresponding operations on augmented matrices:
<ul>
<li><em>swapping two equations</em> corresponds to <em>swapping two rows</em>;</li>
<li><em>multiplying an equation by a number</em> corresponds to <em>multiplying a row by a number</em>; and</li>
<li><em>adding an equation to another</em>, possibly multiplying the
equation by a number before adding, corresponds to <em>adding a row to another row</em>,
possibly multiplying the row by a number before adding.</li>
</ul>
Then, the goal is to use these <em>row operations</em> to transform
the initial augmented matrix, where the right side looks like the
identity matrix, into one where the left side looks like the
identity matrix. Then, translating the augmented matrix back into a
matrix equation, that would give \(M^{-1}\) on the right side.<sup><a href="#fn4" id="r4">[4]</a></sup></div>
<div class="p">When doing this by hand, one usually works with the linear
system itself, trying to see which variables can be easily
eliminated so as to minimize arithmetic. However, to
translate this to an algorithm, we’re more interested
in a systematic way of doing this. Fortunately,
there’s an easy two-step process:
<ol>
<li>Turn the left side of \(A\) into a <em>unit upper triangular matrix</em>,
which means that all the elements on the main diagonal are
\(1\), and all elements below the main diagonal are \(0\),
i.e. that \(a_{ii} = 1\) for all \(i\), and \(a_{ij} = 0\) for
all \(j > i\).</li>
<li>Then turn the left side of \(A\) into the identity matrix.</li>
</ol>
This algorithm is called <a href="https://en.wikipedia.org/wiki/Row_reduction">row reduction</a>. The
first step can be further broken down:
<ol type="a">
<li>For each column \(i\) of the left side in ascending order:
<ol type="i">
<li>If \(a_{ii}\) is zero, look at the rows below the
\(i\)th row for a row \(j > i\) such that \(a_{ji} \ne
0\), and swap rows \(i\) and \(j\). If no such row
exists, return an error, as that means that \(A\) is
non-invertible.</li>
<li>Divide the \(i\)th row by \(a_{ii}\), so that \(a_{ii}
= 1\).</li>
<li>For each row \(j > i\), subtract \(a_{ji}\) times the
\(i\)th row from it, which will set \(a_{ji}\) to \(0\).</li>
</ol>
</li>
</ol>
The second step can be similarly broken down:
<ol type="a">
<li>For each column \(i\) of the left side, in order:
<ol type="i">
<li>For each row \(j < i\), subtract \(a_{ji}\) times the
\(i\)th row from it, which will set \(a_{ji}\) to \(0\).</li>
</ol>
</li>
</ol>
</div>
<p>Note that we’re assuming that all arithmetic is
exact, i.e. we use a arbitrary-precision rational number
type. If we use floating point numbers, we’d have to
worry a lot more about the order in which we do operations
and numerical stability.</p>
<style>
.swap-row-a { color: #dc322f; /* solarized red */ }
.swap-row-b { color: #268bd2; /* solarized blue */ }
.divide-row { color: #dc322f; /* solarized red */ }
.subtract-scaled-row-src { color: #268bd2; /* solarized blue */ }
.subtract-scaled-row-dest { color: #dc322f; /* solarized red */ }
</style>
<div class="interactive-example" id="matrixInverseDemo">
<h3>Example 3: Matrix inversion via row reduction</h3>
Let
<pre> / 0 2 2 \
M = | 3 4 5 |
\ 6 6 7 /.</pre>
The initial augmented matrix <var>A</var> is
<pre>/ 0 2 2 | 1 0 0 \
| 3 4 5 | 0 1 0 |
\ 6 6 7 | 0 0 1 /.</pre>
We need <var>A</var><sub>00</sub> to be non-zero, so swap rows <span class="swap-row-a">0</span> and <span class="swap-row-b">1</span>:
<pre>/ <span class="swap-row-a">0 2 2</span> | <span class="swap-row-a">1 0 0</span> \ / <span class="swap-row-b">3 4 5</span> | <span class="swap-row-b">0 1 0</span> \
| <span class="swap-row-b">3 4 5</span> | <span class="swap-row-b">0 1 0</span> | --> | <span class="swap-row-a">0 2 2</span> | <span class="swap-row-a">1 0 0</span> |
\ 6 6 7 | 0 0 1 / \ 6 6 7 | 0 0 1 /.</pre>
We need <var>A</var><sub>00</sub> to be 1, so divide row <span class="divide-row">0</span> by 3:
<pre>/ <span class="divide-row">3 4 5</span> | <span class="divide-row">0 1 0</span> \ / <span class="divide-row">1 4/3 5/3</span> | <span class="divide-row">0 1/3 0</span> \
| 0 2 2 | 1 0 0 | --> | 0 2 2 | 1 0 0 |
\ 6 6 7 | 0 0 1 / \ 6 6 7 | 0 0 1 /.</pre>
We need <var>A</var><sub>20</sub> to be 0, so subtract row <span class="subtract-scaled-row-src">0</span> scaled by 6 from row <span class="subtract-scaled-row-dest">2</span>:
<pre>/ <span class="subtract-scaled-row-src">1 4/3 5/3</span> | <span class="subtract-scaled-row-src">0 1/3 0</span> \ / 1 4/3 5/3 | 0 1/3 0 \
| 0 2 2 | 1 0 0 | --> | 0 2 2 | 1 0 0 |
\ <span class="subtract-scaled-row-dest">6 6 7</span> | <span class="subtract-scaled-row-dest">0 0 1</span> / \ <span class="subtract-scaled-row-dest">0 -2 -3</span> | <span class="subtract-scaled-row-dest">0 -2 1</span> /.</pre>
We need <var>A</var><sub>11</sub> to be 1, so divide row <span class="divide-row">1</span> by 2:
<pre>/ 1 4/3 5/3 | 0 1/3 0 \ / 1 4/3 5/3 | 0 1/3 0 \
| <span class="divide-row">0 2 2 </span> | <span class="divide-row"> 1 0 0</span> | --> | <span class="divide-row">0 1 1 </span> | <span class="divide-row">1/2 0 0</span> |
\ 0 -2 -3 | 0 -2 1 / \ 0 -2 -3 | 0 -2 1 /.</pre>
We need <var>A</var><sub>21</sub> to be 0, so subtract row <span class="subtract-scaled-row-src">1</span> scaled by −2 from row <span class="subtract-scaled-row-dest">2</span>:
<pre>/ 1 4/3 5/3 | 0 1/3 0 \ / 1 4/3 5/3 | 0 1/3 0 \
| <span class="subtract-scaled-row-src">0 1 1</span> | <span class="subtract-scaled-row-src">1/2 0 0</span> | --> | 0 1 1 | 1/2 0 0 |
\ <span class="subtract-scaled-row-dest">0 -2 -3</span> | <span class="subtract-scaled-row-dest">0 -2 1</span> / \ <span class="subtract-scaled-row-dest">0 0 -1</span> | <span class="subtract-scaled-row-dest">1 -2 1</span> /.</pre>
We need <var>A</var><sub>22</sub> to be 1, so divide row <span class="divide-row">2</span> by −1, which makes the left side of <var>A</var> a
unit upper triangular matrix:
<pre>/ 1 4/3 5/3 | 0 1/3 0 \ / 1 4/3 5/3 | 0 1/3 0 \
| 0 1 1 | 1/2 0 0 | --> | 0 1 1 | 1/2 0 0 |
\ <span class="divide-row">0 0 -1 </span> | <span class="divide-row"> 1 -2 1</span> / \ <span class="divide-row">0 0 1 </span> | <span class="divide-row">-1 2 -1</span> /.</pre>
We need <var>A</var><sub>12</sub> to be 0, so subtract row <span class="subtract-scaled-row-src">2</span> from row <span class="subtract-scaled-row-dest">1</span>:
<pre>/ 1 4/3 5/3 | 0 1/3 0 \ / 1 4/3 5/3 | 0 1/3 0 \
| <span class="subtract-scaled-row-dest">0 1 1</span> | <span class="subtract-scaled-row-dest">1/2 0 0</span> | --> | <span class="subtract-scaled-row-dest">0 1 0</span> | <span class="subtract-scaled-row-dest">3/2 -2 1</span> |
\ <span class="subtract-scaled-row-src">0 0 1</span> | <span class="subtract-scaled-row-src">-1 2 -1</span> / \ 0 0 1 | -1 2 -1 /.</pre>
We need <var>A</var><sub>02</sub> to be 0, so subtract row <span class="subtract-scaled-row-src">2</span> scaled by 5/3 from row <span class="subtract-scaled-row-dest">0</span>:
<pre>/ <span class="subtract-scaled-row-dest">1 4/3 5/3</span> | <span class="subtract-scaled-row-dest">0 1/3 0</span> \ / <span class="subtract-scaled-row-dest">1 4/3 0</span> | <span class="subtract-scaled-row-dest">5/3 -3 5/3</span> \
| 0 1 0 | 3/2 -2 1 | --> | 0 1 0 | 3/2 -2 1 |
\ <span class="subtract-scaled-row-src">0 0 1</span> | <span class="subtract-scaled-row-src">-1 2 -1</span> / \ 0 0 1 | -1 2 -1 /.</pre>
We need <var>A</var><sub>01</sub> to be 0, so subtract row <span class="subtract-scaled-row-src">1</span> scaled by 4/3 from row <span class="subtract-scaled-row-dest">0</span>, which makes the left side of <var>A</var> the identity matrix:
<pre>/ <span class="subtract-scaled-row-dest">1 4/3 0</span> | <span class="subtract-scaled-row-dest">5/3 -3 5/3</span> \ / <span class="subtract-scaled-row-dest">1 0 0</span> | <span class="subtract-scaled-row-dest">-1/3 -1/3 1/3</span> \
| <span class="subtract-scaled-row-src">0 1 0</span> | <span class="subtract-scaled-row-src">3/2 -2 1</span> | --> | 0 1 0 | 3/2 -2 1 |
\ 0 0 1 | -1 2 -1 / \ 0 0 1 | -1 2 -1 /.</pre>
Since the left side of <var>A</var> is the identity matrix, the right side of <var>A</var> is <var>M</var><sup>-1</sup>. Therefore,
<pre> / -1/3 -1/3 1/3 \
M^{-1} = | 3/2 -2 1 |
\ -1 2 -1 /.</pre>
</div>
<script>
'use strict';
(function() {
const { h, render } = window.preact;
const root = document.getElementById('matrixInverseDemo');
render(h(MatrixInverseDemo, {
initialElements: '0, 2, 2, 3, 4, 5, 6, 6, 7', initialFieldType: 'rational',
name: 'matrixInverseDemo',
header: h('h3', null, 'Example 3: Matrix inversion via row reduction'),
containerClass: 'interactive-example',
inputClass: 'parameter',
buttonClass: 'interactive-example-button',
allowFieldTypeChanges: false,
swapRowAColor: '#dc322f', // solarized red
swapRowBColor: '#268bd2', // solarized blue
divideRowColor: '#dc322f', // solarized red
subtractScaledRowSrcColor: '#268bd2', // solarized blue
subtractScaledRowDestColor: '#dc322f', // solarized red
}), root.parent, root);
})();
</script>
<p>Now notice one thing: if \(M\) has a row that is proportional to
another row, then row reduction would eventually zero out one of the
rows, causing the algorithm to fail, and signaling that \(M\) is
non-invertible. In fact, a stronger statement is true: \(M\) has a
row that can be expressed as a linear combination of other rows of
\(M\) exactly when \(M\) is non-invertible. Informally, this means
that the linear function corresponding to \(M\) has one of its
outputs redundant with the other outputs, so it is essentially a a
linear function taking \(n\) inputs to fewer than \(n\) outputs, and
such functions aren’t invertible.</p>
<p>This gets us a partial explanation for what “sufficiently
different” means for our parity functions. If one parity
function is a linear combination of other parity functions, then it
is redundant, and therefore not “sufficiently
different”. Therefore, we want our parity matrix \(P\) to be
such that no row can be expressed as a linear combination of other
rows.</p>
<p>However, this criterion for \(P\) isn’t quite enough
to guarantee that all possible matrices \(M\) computed as
part of <code>ReconstructData</code> are invertible. For example,
this criterion holds for the identity matrix \(I_n\), but if \(n >
1\) and you pick \(I_n\) as the parity matrix for \(n = m\), you can
certainly end up with a constructed matrix \(M\) with repeated rows,
since you’re starting by appending another copy of \(I_n\) on
top of \(P = I_n\)! This explains in a different way why simply
making a copy of the original data files makes for a poor erasure
code, unless of course you only have one data file. We’re led
to our next topic: what makes a parity matrix “optimal”?</p>
</section>
<section>
<header>
<h2>7. Optimal parity matrices</h2>
</header>
<p>Recall from above that we form the square matrix
\[
M =
\begin{pmatrix}
e_{j_0} \\
\vdots \\
e_{j_{n-k-1}} \\
p_0 \\
\vdots \\
p_{k-1}
\end{pmatrix}
\]
by prepending some rows of the identity matrix to the first
\(k\) rows of the parity matrix. We can generalize this a
bit more, since we don’t have to take the first \(k\)
rows, but instead can take any \(k\) rows of the parity
matrix, whose indices we denote here as \(l_0, \dotsc, l_{k-1}\):
\[
M =
\begin{pmatrix}
e_{j_0} \\
\vdots \\
e_{j_{n-k-1}} \\
p_{l_0} \\
\vdots \\
p_{l_{k-1}}
\end{pmatrix}\text{.}
\]
So we want to construct \(P\) so that any such square matrix
\(M\) formed from the rows of \(P\) is invertible. Therefore,
we call a parity matrix \(P\) <em>optimal</em> if it satisfies this
criterion.</p>
<div class="p">Fortunately, there is a simpler criterion for optimal parity
matrices. First, define a <a href="https://en.wikipedia.org/wiki/Matrix_(mathematics)#Submatrix"><em>submatrix</em></a>
of a matrix \(P\) to be a matrix that you get by deleting
any number of rows or columns, and call a matrix <em>non-empty</em> if
it has at least one row and one column. Then:
<div class="theorem">(<span class="theorem-name">Theorem 1</span>.)
A parity matrix \(P\) is optimal exactly when any non-empty square
submatrix of \(P\) is invertible.<sup><a href="#fn5" id="r5">[5]</a></sup></div>
Note that this criterion is stronger than the one in the previous
section, where we want a parity matrix \(P\) to have no row that can
be expressed as a linear combination of other rows. That is, if any
non-empty square submatrix of \(P\) is invertible, that means that
no row can be expressed as a linear combination of other rows.<sup><a href="#fn6" id="r6">[6]</a></sup> However, it is possible to have a matrix
\(P\) where no row can be expressed as a linear combination of
other rows, but which is not optimal. We’ve already seen an
example above: \(I_n\) for \(n \gt 1\), and indeed,
\[
I_2 =
\begin{pmatrix}
1 & 0 \\
0 & 1
\end{pmatrix}\text{,}
\]
has the \(1 \times 1\) non-invertible submatrix
\(\begin{pmatrix} 0 \end{pmatrix}\).</div>
<div class="interactive-example">
<h3>Example 4: A optimal parity matrix for \(m = 2\)</h3>
<p>Recall the parity matrix
\[
P =
\begin{pmatrix}
1 & 1 & 1 \\
1 & 2 & 3
\end{pmatrix}
\]
that we were using for our \(n = 3, m = 2\) example. For any \(n\),
this matrix looks like
\[
P =
\begin{pmatrix}
1 & 1 & \cdots & 1 \\
1 & 2 & \cdots & n-1
\end{pmatrix}\text{.}
\]
A \(1 \times 1\) matrix is invertible exactly when its single
element is non-zero, so any \(1 \times 1\) submatrix of \(P\) is
invertible. Any \(2 \times 2\) submatrix of \(P\) looks like
\[
A =
\begin{pmatrix}
1 & 1 \\
a & b
\end{pmatrix}
\]
for \(a \ne b\), which, using the <a href="https://en.wikipedia.org/wiki/Invertible_matrix#Inversion_of_2_.C3.97_2_matrices">formula for inverses of \(2 \times 2\) matrices</a>, has inverse
\[
A^{-1} = \begin{pmatrix} b/(b-a) & -1/(b-a) \\ -a/(b-a) & 1/(b-a) \end{pmatrix}\text{.}
\]
These are all the possible square submatrices of \(P\), so
therefore this \(P\) is a optimal parity matrix for \(m = 2\).</p>
</div>
<p>Then, finally, we now have a complete definition of what makes a
list of parity functions “sufficiently different”; it is
exactly when the corresponding parity matrix is optimal as we’ve
defined it above.</p>
<p>Now this leads us to the question: how do we find such optimal
matrices? Fortunately, there’s a whole class of matrices that
are optimal: the <em>Cauchy matrices</em>.</p>
<p>Let \(a_0, \dotsc, a_{m+n-1}\) be a sequence of distinct
integers, meaning that no two \(a_i\) are equal, and let
\(x_0, \dotsc, x_{m-1}\) be the first \(m\) integers of \(a_i\) with \(y_0, \dotsc, y_{n-1}\)
the remaining integers. Then form the \(m \times n\)
matrix \(A\) by setting its elements according to:
\[
a_{ij} = \frac{1}{x_i - y_j}\text{,}
\]
which is always defined since the denominator is never zero, by the distinctness of the \(a_i\). Then \(A\) is a <em>Cauchy matrix</em>.</p>
<div class="p">What makes Cauchy matrices useful is the following theorem:
<div class="theorem">(<span class="theorem-name">Theorem 2</span>.)
Any non-empty square Cauchy matrix is invertible.</div>
Combining this with the simple fact that any submatrix of a
Cauchy matrix is also a Cauchy matrix, we get:
<div class="theorem">(<span class="theorem-name">Corollary 1</span>.)
Any non-empty square submatrix of a Cauchy matrix is
invertible, and thus any Cauchy parity matrix is optimal.</div>
</div>
<div class="interactive-example" id="cauchyMatrixDemo">
<h3>Example 5: Cauchy matrices</h3>
Let
<span style="white-space: nowrap;">
<var>x</var> = [ 1, 2, 3 ]
</span>
and
<span style="white-space: nowrap;">
<var>y</var> = [ -1, 4, 0 ].
</span>
Then, the Cauchy matrix constructed from
<var>x</var> and <var>y</var> is
<pre>/ 1/2 -1/3 1 \
| 1/3 -1/2 1/2 |
\ 1/4 -1 1/3 /,</pre>
which has inverse
<pre>/ -36/5 96/5 -36/5 \
| -3/10 9/5 -9/5 |
\ 9/2 -9 3 /.</pre>
</div>
<script>
'use strict';
(function() {
const { h, render } = window.preact;
const root = document.getElementById('cauchyMatrixDemo');
render(h(CauchyMatrixDemo, {
initialX: '1, 2, 3', initialY: '-1, 4, 0', initialFieldType: 'rational',
name: 'cauchyMatrixDemo',
header: h('h3', null, 'Example 5: Cauchy matrices'),
containerClass: 'interactive-example',
inputClass: 'parameter',
allowFieldTypeChanges: false,
}), root.parent, root);
})();
</script>
<p>Therefore, to generate a optimal parity matrix for any \((n,
m)\), all we need to do is to generate an \(m \times n\)
Cauchy matrix. We can pick any sequence of distinct \(m +
n\) integers, so for simplicity let’s just use
\[
x_i = n + i \quad \text{and} \quad y_i = i\text{.}
\]</p>
<div class="interactive-example">
<h3>Example 6: Cauchy parity matrices for \(m = 2\)</h3>
<p>For \(n = 3, m = 2\), we have the sequences
\[
x_0 = 3, x_1 = 4 \quad \text{and} \quad y_0 = 0, y_1 = 1, y_2 = 2\text{,}
\]
so the corresponding Cauchy parity matrix is
\[
P =
\begin{pmatrix}
1/3 & 1/2 & 1 \\
1/4 & 1/3 & 1/2
\end{pmatrix}\text{.}
\]
Similarly, for any \(n\),
\[
P =
\begin{pmatrix}
1/n & \cdots & 1/2 & 1 \\
1/{n + 1} & \cdots & 1/3 & 1/2
\end{pmatrix}\text{.}
\]
All entries of \(P\) are non-zero, so any \(1 \times 1\)
submatrix of \(P\) is invertible. Any \(2 \times 2\) submatrix
of \(P\) looks like
\[
A =
\begin{pmatrix}
1/a & 1/b \\
1/(a+1) & 1/(b+1)
\end{pmatrix}
\]
for \(a \ne b\), which, using the <a href="https://en.wikipedia.org/wiki/Invertible_matrix#Inversion_of_2_.C3.97_2_matrices">formula for inverses of \(2 \times 2\) matrices</a>, has inverse
\[
A^{-1} =
\begin{pmatrix}
\frac{ab(a+1)}{b-a} & -\frac{a(a+1)(b+1)}{b-a} \\
-\frac{ab(b+1)}{b-a} & \frac{b(a+1)(b+1)}{b-a}
\end{pmatrix}\text{.}
\]
These are all the possible square submatrices of \(P\), so
therefore this \(P\) is a optimal parity matrix for \(m = 2\).</p>
</div>
<p>Note that our first parity matrix for \(n = 3, m = 2\)
isn’t a Cauchy matrix, since no Cauchy matrix can have
repeating elements in a single row. That means that there
are other possible optimal parity matrices that aren’t
Cauchy matrices.<sup><a href="#fn7" id="r7">[7]</a></sup></p>
<p>Also, our previous parity matrices had integers, and
Cauchy matrices have rational numbers (i.e.,
fractions). This means that our parity numbers are now
fractions. This isn’t a serious difference, though,
since we’d have to deal with fractions when
computing matrix inverses anyway. You could also change a
parity matrix with fractions into one without by simply
multiplying the entire matrix by some non-zero number that gets
rid of all the fractions, which doesn’t change the
optimality of the matrix. For example, we can multiply
\[
\begin{pmatrix}
1/3 & 1/2 & 1 \\
1/4 & 1/3 & 1/2
\end{pmatrix}
\]
by \(12\) to get the equivalent parity matrix
\[
\begin{pmatrix}
4 & 6 & 12 \\
3 & 4 & 6
\end{pmatrix}\text{.}
\]
</p>
<p>Now our only remaining missing piece is this: how do we
compute parity bytes instead of parity numbers? Answering
this would render the above discussion moot. However, to do
so, we first have to take another look at how we’re
doing linear algebra.</p>
</section>
<section>
<header>
<h2>8. Linear algebra over fields</h2>
</header>
<p>We ultimately want our parity numbers to be parity bytes, which
means that we want to work with matrices of bytes instead of
matrices of rational numbers. In order to do that, we need to define
an interface for matrix elements that preserves the operations and
properties we care about, and then we have to figure out how to
implement that interface using bytes.</p>
<p>Looking at the rule for matrix multiplication, we need to be able
to add and multiply matrix elements. Looking at how we do matrix
inversion, we also need to be able to subtract and divide matrix
elements. Finally, there are certain properties that hold for
rational numbers that we implicitly assume when doing matrix
operations, but that we have to make explicit for matrix elements.</p>
<div class="p">This leads us to the concept of a <em>field</em>, which
essentially defines the interface that matrix elements
should implement. Here it is:
<pre class="code-container"><code class="language-javascript">interface Field<T> {
static Zero: T, One: T
plus(b: T): T
negate(): T
times(b: T): T
reciprocate(): T
equals(b: T): bool
minus(b: T) = this.plus(b.negate())
dividedBy(b: T) = this.times(b.reciprocate())
}</code></pre>
</div>
<p>We need to be able to add and multiply field elements,
which we’ll denote generically by \(\oplus\) and \(\otimes\). We
also need to be able to take the negation (additive inverse) of an element \(x\),
which we’ll denote by \(-x\), and the reciprocal (multiplicative inverse) of a
non-zero element \(x\), which we’ll denote by
\(x^{-1}\). Then we can define subtraction of field elements to be
\[
a \ominus b = a \oplus -b
\]
and division of field elements to be
\[
a \cldiv b = a \otimes b^{-1}\text{,}
\]
when \(b \ne 0\).</p>
<div class="p">Also, an implementation of <code>Field</code> must satisfy further
properties, which are copied from the number laws you learn in school:
<ul>
<li>Identities: \(a \oplus 0 = a \otimes 1 = a\).</li>
<li>Inverses: \(a \oplus -a = 0\), and for \(a \ne 0\), \(a
\otimes a^{-1} = 1\).</li>
<li>Associativity: \((a \oplus b) \oplus c = a \oplus (b
\oplus c)\), and \((a \otimes b) \otimes c = a \otimes (b
\otimes c)\).</li>
<li>Commutativity: \(a \oplus b = b \oplus a\), and \(a \otimes
b = b \otimes a\).</li>
<li>Distributivity: \(a \otimes (b \oplus c) = (a \otimes b) \oplus (a \otimes c)\).</li>
</ul>
Of the above, guaranteeing the existence of reciprocals of
non-zero elements is usually the non-trivial part. Now the
rational numbers satisfy all of the above, since
\[
(p/q)^{-1} = q/p\text{,}
\]
so we say that they <em>form a field</em>. However, the integers
<em>do not</em> form a field, since for example \(2\) has no
integer reciprocal; only \(1\) and \(-1\) have integer
reciprocals. Furthermore, as we saw above, the integers
modulo \(256\), i.e. the numbers from \(0\) to \(255\) with
standard arithmetic operations modulo \(256\), do not form a
field, as we saw earlier, since \((2 \cdot b) \bmod 256 \ne
1\) for any \(b\).</div>
<div class="p">However, we can construct a field with \(257\) elements, using the
fact that \(257\) is a prime number, and the following theorem:
<div class="theorem">(<span class="theorem-name" id="theorem-3">Theorem 3</span>.)
Given a prime number \(p\), for every integer \(0 \lt a \lt p\),
there is exactly one \(0 \lt b \lt p\) such that \((a \cdot b) \bmod
p = 1\).</div>
There are clever ways to find multiplcative inverses mod \(p\), but
since \(257\) is so small, we can just brute-force it. So an
implementation would look like:
<pre class="code-container"><code class="language-javascript">class Field257Element : implements Field<Field257Element> {
plus(b) { return (this + b) % 257 }
negate() { return (257 - this) }
times(b) { return (this * b) % 257 }
reciprocate() {
if (this == 0) { return Error }
for i := 0 to 256 {
if (this.times(b) == 1) { return i; }
}
return Error
}
...
}</code></pre>
</div>
<div class="interactive-example" id="field257Demo">
<h3>Example 7: Field with 257 elements</h3>
Denote operations on the field with 257
elements by a <sub>257</sub> subscript, and let
<span style="white-space: nowrap;">
<var>a</var> = 23
</span>
and
<span style="white-space: nowrap;">
<var>b</var> = 54.
</span>
Then
<ul>
<li>
<span style="white-space: nowrap;">
<var>a</var> +<sub>257</sub> <var>b</var> = (23 + 54) mod 257 = <span class="result">77</span>;
</span>
</li>
<li>
<span style="white-space: nowrap;">
−<sub>257</sub><var>b</var> = (257 − 54) mod 257 = <span class="result">203</span>;
</span>
</li>
<li>
<span style="white-space: nowrap;">
<var>a</var> −<sub>257</sub> <var>b</var> = <var>a</var> +<sub>257</sub> −<sub>257</sub><var>b</var> = (23 + 203) mod 257 = <span class="result">226</span>;
</span>
</li>
<li>
<span style="white-space: nowrap;">
<var>a</var> ×<sub>257</sub> <var>b</var> = (23 × 54) mod 257 = <span class="result">214</span>;
</span>
</li>
<li>
<span style="white-space: nowrap;">
54 ×<sub>257</sub> 119 = 1,
</span>
so
<span style="white-space: nowrap;">
<var>b</var><sup>-1</sup><sub>257</sub> = <span class="result">119</span>;
</span>
</li>
<li>
<span style="white-space: nowrap;">
<var>a</var> ÷<sub>257</sub> <var>b</var> = <var>a</var> ×<sub>257</sub> <var>b</var><sup>-1</sup><sub>257</sub> = (23 × 119) mod 257 = <span class="result">167</span>,
</span>
and indeed
<span style="white-space: nowrap;">
<var>b</var> ×<sub>257</sub> (<var>a</var> ÷<sub>257</sub> <var>b</var>) = (54 × 167) mod 257 = 23 = <var>a</var>.
</span>
</li>
</ul>
</div>
<script>
'use strict';
(function() {
const { h, render } = window.preact;
const root = document.getElementById('field257Demo');
render(h(Field257Demo, {
initialA: '23', initialB: '54',
header: h('h3', null, 'Example 7: Field with 257 elements'),
containerClass: 'interactive-example',
inputClass: 'parameter',
resultColor: '#268bd2', // solarized blue
}), root.parent, root);
})();
</script>
<div class="p">So this gets us closer, since we can use <code>Field257Element</code> instead
of a rational number type when implementing <code>ComputeParity</code> and <code>ReconstructData</code>,
and if we’ve abstracted our <code>Matrix</code> type correctly, almost everything should just work. However, there <em>is</em> one
thing we need to check: Are Cauchy parity matrices still
optimal if we use fields other than the rational numbers? Fortunately, the answer is yes:
<div class="theorem">(<span class="theorem-name">Theorem 1, general version</span>.)
A parity matrix \(P\) over any field is optimal exactly when any
non-empty square submatrix of \(P\) is invertible.</div>
<div class="theorem">(<span class="theorem-name">Theorem 2, general version</span>.)
Any non-empty square Cauchy matrix over any field is invertible.</div>
<div class="theorem">(<span class="theorem-name">Corollary 1, general version</span>.)
Any square submatrix of a Cauchy matrix over any field is
invertible, and thus any Cauchy parity matrix over any field is
optimal.</div>
However, note that to construct an \(m \times n\) Cauchy matrix, we
need \(m + n\) distinct elements. So if we’re working with a
field with \(257\) elements, then this imposes the condition that
\(m + n \le 257\), i.e. using a finite field limits the number of
data bytes and parity numbers you can have.</div>
<p>Now the question remains: can we construct a field with \(256\)
elements? As we saw above, we can’t do so the same way as we
constructed the field with \(257\) elements. In fact, we need to
start with defining different arithmetic operations on the
integers. This brings us to the topic of
<em>binary carry-less arithmetic</em>.</p>
</section>
<section>
<header>
<h2>9. Binary carry-less arithmetic</h2>
</header>
<p>The basic idea with binary carry-less (which I’ll henceforth
shorten to “carry-less”) arithmetic is to express all
integers in binary, then perform all arithmetic operations using
binary arithmetic, except ignoring all the carries.<sup><a href="#fn8" id="r8">[8]</a></sup></p>
<p>How does this work with addition? Let’s denote binary
carry-less add as \(\clplus\),<sup><a href="#fn9" id="r9">[9]</a></sup> and let’s see how it behaves on single binary digits:
\[
\begin{aligned}
0 \clplus 0 &= 0 \\
0 \clplus 1 &= 1 \\
1 \clplus 0 &= 1 \\
1 \clplus 1 &= 0\text{.}
\end{aligned}
\]
This is just the exclusive or operation on bits, so if we do
carry-less addition on any two integers, it turns out to be
nothing but xor! Since xor can also be denoted by \(\clplus\),
we can conveniently think of \(\clplus\) as meaning both carry-less
addition and xor.</p>
<div class="interactive-example" id="carrylessAddDemo">
<h3>Example 8: Carry-less addition</h3>
Let
<span style="white-space: nowrap;">
<var>a</var> = 23
</span>
and
<span style="white-space: nowrap;">
<var>b</var> = 54.
</span>
Then, with carry-less arithmetic,
<pre> a = 23 = 10111b
^ b = 54 = 110110b
-------
100001b</pre>
so
<span style="white-space: nowrap;">
<var>a</var> ⊕ <var>b</var> = 100001<sub>b</sub> =
<span class="result">33</span>.
</span>
</div>
<script>
'use strict';
(function() {
const { h, render } = window.preact;
const root = document.getElementById('carrylessAddDemo');
render(h(AddDemo, {
initialA: '23', initialB: '54',
name: 'carrylessAddDemo',
header: h('h3', null, 'Example 8: Carry-less addition'),
containerClass: 'interactive-example',
inputClass: 'parameter',
resultColor: '#268bd2', // solarized blue
}), root.parent, root);
})();
</script>
<p>What about subtraction? Recall that \((a \clplus b) \clplus
b = a\) for any \(a\) and \(b\). Therefore, every element
\(b\) is its own (carry-less binary) additive inverse, which
means that \(a \clminus b = a \clplus b\), i.e. carry-less
subtraction is also just xor.</p>
<p><a href="https://en.wikipedia.org/wiki/Carry-less_product">Carry-less multiplication</a>
isn’t as simple, but recall that binary multiplication
is just adding shifted copies of \(a\) based on which bits
are set in \(b\) (or vice versa). To do carry-less
multiplication, just ignore carries when adding the shifted
copies again, i.e. xor shifted copies instead of adding
them.</p>
<div class="interactive-example" id="carrylessMulDemo">
<h3>Example 9: Carry-less multiplication</h3>
Let
<span style="white-space: nowrap;">
<var>a</var> = 23
</span>
and
<span style="white-space: nowrap;">
<var>b</var> = 54.
</span>
Then, with carry-less arithmetic,
<pre> a = 23 = 10111b
^* b = 54 = 110110b
------------
10111
^ 10111
^ 10111
^ 10111
------------
1111100010b</pre>
so
<span style="white-space: nowrap;">
<var>a</var> ⊗ <var>b</var> = 1111100010<sub>b</sub> =
<span class="result">994</span>.
</span>
</div>
<script>
'use strict';
(function() {
const { h, render } = window.preact;
const root = document.getElementById('carrylessMulDemo');
render(h(MulDemo, {
initialA: '23', initialB: '54',
name: 'carrylessMulDemo',
header: h('h3', null, 'Example 9: Carry-less multiplication'),
containerClass: 'interactive-example',
inputClass: 'parameter',
resultColor: '#268bd2', // solarized blue
}), root.parent, root);
})();
</script>
<p>Finally, we can define carry-less division with remainder. Binary
division with remainder is subtracting shifted copies of \(b\) from
\(a\) until you get a remainder less than the divisor; then
carry-less binary division with remainder is xor-ing shifted copies
of \(b\) with \(a\) until you get a remainder. However,
there’s a subtlety; with carry-less arithmetic, it’s not
enough to stop when the remainder (for that step) is less than the
divisor, because if the highest set bit of the remainder is the same
as the highest set bit of the divisor, you can still xor with the
divisor one more time to yield a smaller number (which then becomes
the final remainder).</p>
<p>Consider the example below, where we’re dividing \(55\) by
\(19\). The first remainder is \(17\), which is less than \(19\),
but still shares the same highest set bit, so we can xor one more
time with \(19\) to get the remainder \(2\).</p>
<div class="interactive-example" id="carrylessDivDemo">
<h3>Example 10: Carry-less division</h3>
Let
<span style="white-space: nowrap;">
<var>a</var> = 55
</span>
and
<span style="white-space: nowrap;">
<var>b</var> = 19.
</span>
Then, with carry-less arithmetic,
<pre> 11b
--------
b = 19 = 10011b )110111b = 55 = a
^ 10011
-----
10001
^ 10011
-----
10b</pre>
so
<span style="white-space: nowrap;">
<var>a</var> ⨸ <var>b</var> = 11<sub>b</sub> =
<span class="result">3</span>
</span>
with remainder
<span style="white-space: nowrap;">
10<sub>b</sub> = <span class="result">2</span>.
</span>
</div>
<script>
'use strict';
(function() {
const { h, render } = window.preact;
const root = document.getElementById('carrylessDivDemo');
render(h(DivDemo, {
initialA: '55', initialB: '19',
name: 'carrylessDivDemo',
header: h('h3', null, 'Example 10: Carry-less division'),
containerClass: 'interactive-example',
inputClass: 'parameter',
resultColor: '#268bd2', // solarized blue
}), root.parent, root);
})();
</script>
<p>This leads to an interesting difference between the carry-less
modulo operation and the standard modulo operation. If you mod by a
number \(n\), you get \(n\) possible remainders, from \(0\) to \(n -
1\). However, if you clmod (carry-less mod) by a number \(2^k \le n
\lt 2^{k+1}\), you get \(2^k\) possible remainders, from \(0\) to
\(2^k-1\), since those are the numbers whose highest set bit is
lower than the highest set bit of \(n\).</p>
<p>In particular, if you clmod by a number \(256 \le n <
512\), you always get \(256\) possible remainders. This is
very close to what we want—now the hope is to find <em>some</em> \(256
\le n < 512\) so that doing binary carry-less arithmetic clmod
\(n\) yields a field, which will then be a field with \(256\)
elements!</p>
</section>
<section>
<header>
<h2>10. The finite field with \(256\) elements</h2>
</header>
<p>Since there are only a few numbers between \(256\) and \(512\), we
can just try each one of them to see if clmod-ing by one of them
yields a field. However, with a bit of math, we can gain more
insight into which numbers will work.</p>
<p>Recall the situation with the standard arithmetic
operations: arithmetic mod \(p\) yields a field exactly when
\(p\) is prime.<sup><a href="#fn10" id="r10">[10]</a></sup> But recall
the definition of a prime number: it is an integer greater than
\(1\) whose positive divisors are only itself and \(1\). Stated
another way, a prime number is an integer \(p \gt 1\) that cannot be
expressed as \(p = a \cdot b\), for \(a, b \gt 1\).</p>
<p>Thus, the concept of a prime number is determined by the
multiplication operation, and therefore we can define a
“carry-less” prime number to be an integer \( p \gt 1\)
that cannot be expressed as \(p = a \clmul b\), for \(a, b \gt 1\).<sup><a href="#fn11" id="r11">[11]</a></sup></p>
<div class="p">The only question remaining is whether there is an equivalent of <a href="#theorem-3">Theorem 3</a> for
carry-less arithmetic. And indeed there is:
<div class="theorem">(<span class="theorem-name">Theorem 4</span>.)
Given a carry-less prime number \(2^k \lt p \le 2^{k+1}\), for every
integer \(0 \lt a \lt 2^k\), there is a exactly one \(0 \lt b \lt
2^k\) such that \((a \clmul b) \bclmod p = 1\).</div>
Now we just need to find a carry-less prime number \(256
\le p < 512\). However, the set of prime numbers and the
set of carry-less prime numbers are not necessarily related,
so for example, even though \(257\) is a prime number, it is <em>not</em> a
carry-less prime number.</div>
<p>It is easy enough to test each number \(256 \le n < 512\) for
carry-less primality though; doing so, we find the lowest one,
\(283\).<sup><a href="#fn12" id="r12">[12]</a></sup></p>
<div class="p">So finally, we have a field with \(256\) elements: the
integers with binary carry-less arithmetic clmod \(283\). An
implementation would look like:
<pre class="code-container"><code class="language-javascript">class Field256Element : implements Field<Field256Element> {
plus(b) { return this ^ b }
negate() { return b }
times(b) { return clmod(clmul(this, b), 283) }
reciprocate() {
if (this == 0) { return Error }
for i := 0 to 255 {
if (this.times(b) == 1) { return i; }
}
return Error
}
...
}</code></pre>
Similarly to how we find reciprocals mod \(257\), we brute-force
finding reciprocals clmod \(283\) also.</div>
<div class="interactive-example" id="field256Demo">
<h3>Example 11: Field with 256 elements</h3>
Denote operations on the field with 256
elements by a <sub>256</sub> subscript, and let
<span style="white-space: nowrap;">
<var>a</var> = 23
</span>
and
<span style="white-space: nowrap;">
<var>b</var> = 54.
</span>
Then
<ul>
<li>
<span style="white-space: nowrap;">
<var>a</var> ⊕<sub>256</sub> <var>b</var> = 23 ⊕ 54 = <span class="result">33</span>;
</span>
</li>
<li>
<span style="white-space: nowrap;">
⊖<sub>256</sub><var>b</var> = <var>b</var> = <span class="result">54</span>;
</span>
</li>
<li>
<span style="white-space: nowrap;">
<var>a</var> ⊖<sub>256</sub> <var>b</var> = <var>a</var> ⊕<sub>256</sub> ⊖<sub>256</sub><var>b</var> = <var>a</var> ⊕<sub>256</sub> <var>b</var> = <span class="result">33</span>;
</span>
</li>
<li>
<span style="white-space: nowrap;">
<var>a</var> ⊗<sub>256</sub> <var>b</var> = (23 ⊗ 54) clmod 283 = <span class="result">207</span>;
</span>
</li>
<li>
<span style="white-space: nowrap;">
54 ⊗<sub>256</sub> 102 = 1,
</span>
so
<span style="white-space: nowrap;">
<var>b</var><sup>-1</sup><sub>256</sub> = <span class="result">102</span>;
</span>
</li>
<li>
<span style="white-space: nowrap;">
<var>a</var> ø<sub>256</sub> <var>b</var> = <var>a</var> ⊗<sub>256</sub> <var>b</var><sup>-1</sup><sub>256</sub> = (23 ⊗ 102) clmod 283 = <span class="result">19</span>,
</span>
and indeed
<span style="white-space: nowrap;">
<var>b</var> ⊗<sub>256</sub> (<var>a</var> ø<sub>256</sub> <var>b</var>) = (54 × 19) clmod 283 = 23 = <var>a</var>.
</span>
</li>
</ul>
</div>
<script>
'use strict';
(function() {
const { h, render } = window.preact;
const root = document.getElementById('field256Demo');
render(h(Field256Demo, {
initialA: '23', initialB: '54',
header: h('h3', null, 'Example 11: Field with 256 elements'),
containerClass: 'interactive-example',
inputClass: 'parameter',
resultColor: '#268bd2', // solarized blue
}), root.parent, root);
})();
</script>
</section>
<section>
<header>
<h2>11. The full algorithm</h2>
</header>
<p>Now we have all the pieces we need to construct erasure codes for
any \((n, m)\) such that \(m + n \le 256\). First, we can compute an
\(m \times n\) Cauchy parity matrix over the field with \(256\)
elements. (Recall that this needs \(m + n\) distinct field elements,
which is what imposes the condition \(m + n \le 256\).)</p>
<div class="interactive-example" id="cauchyMatrixDemoGeneral">
<h3>Example 12: Cauchy matrices in general</h3>
Working over the field with 256 elements, let
<span style="white-space: nowrap;">
<var>x</var> = [ 1, 2, 3 ]
</span>
and
<span style="white-space: nowrap;">
<var>y</var> = [ 4, 5, 6 ].
</span>
Then, the Cauchy matrix constructed from
<var>x</var> and <var>y</var> is
<pre>/ 82 203 209 \
| 123 209 203 |
\ 209 123 82 /,</pre>
which has inverse
<pre>/ 130 31 176 \
| 252 219 31 |
\ 108 252 130 /.</pre>
</div>
<script>
'use strict';
(function() {
const { h, render } = window.preact;
const root = document.getElementById('cauchyMatrixDemoGeneral');
render(h(CauchyMatrixDemo, {
initialX: '1, 2, 3', initialY: '4, 5, 6', initialFieldType: 'gf256',
name: 'cauchyMatrixDemoGeneral',
header: h('h3', null, 'Example 12: Cauchy matrices in general'),
containerClass: 'interactive-example',
inputClass: 'parameter',
allowFieldTypeChanges: true,
}), root.parent, root);
})();
</script>
<p>Then we can implement matrix multiplication over arbitrary fields,
and thus we can implement <code>ComputeParity</code>.</p>
<div class="interactive-example" id="computeParityDetailDemo">
<h3>Example 13: <code>ComputeParity</code> in detail</h3>
Let
<span style="white-space: nowrap;">
<var>d</var> = [ da, db, 0d ]
</span>
be the input data bytes and let
<span style="white-space: nowrap;">
<var>m</var> = 2
</span>
be the desired parity byte count. Then, with the input byte
count
<span style="white-space: nowrap;">
<var>n</var> = 3,
</span>
the
<span style="white-space: nowrap;">
<var>m</var> × <var>n</var>
</span>
Cauchy parity matrix computed using
<span style="white-space: nowrap;">
<var>x</var><sub>i</sub> = <var>n</var> + <var>i</var>
</span>
and
<span style="white-space: nowrap;">
<var>y</var><sub>i</sub> = <var>i</var>
</span>
is
<pre>/ f6 8d 01 \
\ cb 52 7b /.</pre>
Therefore, the parity bytes are computed as
<pre> _ _ _ _
/ f6 8d 01 \ | da | | 52 |
\ cb 52 7b / * | db | = |_ 0c _|,
|_ 0d _|</pre>
and thus the output parity bytes are
<span style="white-space: nowrap;">
<var>p</var> = [ <span class="result">52</span>, <span class="result">0c</span> ].
</span>
</div>
<script>
'use strict';
(function() {
const { h, render } = window.preact;
const root = document.getElementById('computeParityDetailDemo');
render(h(ComputeParityDemo, {
initialD: 'da, db, 0d', initialM: '2',
name: 'computeParityDetailDemo',
detailed: true,
header: h('h3', null, 'Example 13: ', h('code', null, 'ComputeParity'), ' in detail'),
containerClass: 'interactive-example',
inputClass: 'parameter',
resultColor: '#268bd2', // solarized blue
}), root.parent, root);
})();
</script>
<p>Then we can implement matrix inversion using row reduction over
arbitrary fields.</p>
<div class="interactive-example" id="matrixInverseDemoGeneral">
<h3>Example 14: Matrix inversion via row reduction in general</h3>
Working over the field with 256 elements, let
<pre> / 0 2 2 \
M = | 3 4 5 |
\ 6 6 7 /.</pre>
The initial augmented matrix <var>A</var> is
<pre>/ 0 2 2 | 1 0 0 \
| 3 4 5 | 0 1 0 |
\ 6 6 7 | 0 0 1 /.</pre>
We need <var>A</var><sub>00</sub> to be non-zero, so swap rows <span class="swap-row-a">0</span> and <span class="swap-row-b">1</span>:
<pre>/ <span class="swap-row-a">0 2 2</span> | <span class="swap-row-a">1 0 0</span> \ / <span class="swap-row-b">3 4 5</span> | <span class="swap-row-b">0 1 0</span> \
| <span class="swap-row-b">3 4 5</span> | <span class="swap-row-b">0 1 0</span> | --> | <span class="swap-row-a">0 2 2</span> | <span class="swap-row-a">1 0 0</span> |
\ 6 6 7 | 0 0 1 / \ 6 6 7 | 0 0 1 /.</pre>
We need <var>A</var><sub>00</sub> to be 1, so divide row <span class="divide-row">0</span> by 3:
<pre>/ <span class="divide-row">3 4 5</span> | <span class="divide-row">0 1 0</span> \ / <span class="divide-row">1 245 3</span> | <span class="divide-row">0 246 0</span> \
| 0 2 2 | 1 0 0 | --> | 0 2 2 | 1 0 0 |
\ 6 6 7 | 0 0 1 / \ 6 6 7 | 0 0 1 /.</pre>
We need <var>A</var><sub>20</sub> to be 0, so subtract row <span class="subtract-scaled-row-src">0</span> scaled by 6 from row <span class="subtract-scaled-row-dest">2</span>:
<pre>/ <span class="subtract-scaled-row-src">1 245 3</span> | <span class="subtract-scaled-row-src">0 246 0</span> \ / 1 245 3 | 0 246 0 \
| 0 2 2 | 1 0 0 | --> | 0 2 2 | 1 0 0 |
\ <span class="subtract-scaled-row-dest">6 6 7</span> | <span class="subtract-scaled-row-dest">0 0 1</span> / \ <span class="subtract-scaled-row-dest">0 14 13</span> | <span class="subtract-scaled-row-dest">0 2 1</span> /.</pre>
We need <var>A</var><sub>11</sub> to be 1, so divide row <span class="divide-row">1</span> by 2:
<pre>/ 1 245 3 | 0 246 0 \ / 1 245 3 | 0 246 0 \
| <span class="divide-row">0 2 2</span> | <span class="divide-row">1 0 0</span> | --> | <span class="divide-row">0 1 1</span> | <span class="divide-row">141 0 0</span> |
\ 0 14 13 | 0 2 1 / \ 0 14 13 | 0 2 1 /.</pre>
We need <var>A</var><sub>21</sub> to be 0, so subtract row <span class="subtract-scaled-row-src">1</span> scaled by 14 from row <span class="subtract-scaled-row-dest">2</span>:
<pre>/ 1 245 3 | 0 246 0 \ / 1 245 3 | 0 246 0 \
| <span class="subtract-scaled-row-src">0 1 1</span> | <span class="subtract-scaled-row-src">141 0 0</span> | --> | 0 1 1 | 141 0 0 |
\ <span class="subtract-scaled-row-dest">0 14 13</span> | <span class="subtract-scaled-row-dest"> 0 2 1</span> / \ <span class="subtract-scaled-row-dest">0 0 3</span> | <span class="subtract-scaled-row-dest"> 7 2 1</span> /.</pre>
We need <var>A</var><sub>22</sub> to be 1, so divide row <span class="divide-row">2</span> by 3, which makes the left side of <var>A</var> a
unit upper triangular matrix:
<pre>/ 1 245 3 | 0 246 0 \ / 1 245 3 | 0 246 0 \
| 0 1 1 | 141 0 0 | --> | 0 1 1 | 141 0 0 |
\ <span class="divide-row">0 0 3</span> | <span class="divide-row">7 2 1</span> / \ <span class="divide-row">0 0 1</span> | <span class="divide-row">244 247 246</span> /.</pre>
We need <var>A</var><sub>12</sub> to be 0, so subtract row <span class="subtract-scaled-row-src">2</span> from row <span class="subtract-scaled-row-dest">1</span>:
<pre>/ 1 245 3 | 0 246 0 \ / 1 245 3 | 0 246 0 \
| <span class="subtract-scaled-row-dest">0 1 1</span> | <span class="subtract-scaled-row-dest">141 0 0 </span> | --> | <span class="subtract-scaled-row-dest">0 1 0</span> | <span class="subtract-scaled-row-dest">121 247 246</span> |
\ <span class="subtract-scaled-row-src">0 0 1</span> | <span class="subtract-scaled-row-src">244 247 246</span> / \ 0 0 1 | 244 247 246 /.</pre>
We need <var>A</var><sub>02</sub> to be 0, so subtract row <span class="subtract-scaled-row-src">2</span> scaled by 3 from row <span class="subtract-scaled-row-dest">0</span>:
<pre>/ <span class="subtract-scaled-row-dest">1 245 3</span> | <span class="subtract-scaled-row-dest"> 0 246 0 </span> \ / <span class="subtract-scaled-row-dest">1 245 0</span> | <span class="subtract-scaled-row-dest"> 7 244 1 </span> \
| 0 1 0 | 121 247 246 | --> | 0 1 0 | 121 247 246 |
\ <span class="subtract-scaled-row-src">0 0 1</span> | <span class="subtract-scaled-row-src">244 247 246</span> / \ 0 0 1 | 244 247 246 /.</pre>
We need <var>A</var><sub>01</sub> to be 0, so subtract row <span class="subtract-scaled-row-src">1</span> scaled by 245 from row <span class="subtract-scaled-row-dest">0</span>, which makes the left side of <var>A</var> the identity matrix:
<pre>/ <span class="subtract-scaled-row-dest">1 245 0</span> | <span class="subtract-scaled-row-dest"> 7 244 1 </span> \ / <span class="subtract-scaled-row-dest">1 0 0</span> | <span class="subtract-scaled-row-dest"> 82 82 82</span> \
| <span class="subtract-scaled-row-src">0 1 0</span> | <span class="subtract-scaled-row-src">121 247 246</span> | --> | 0 1 0 | 121 247 246 |
\ 0 0 1 | 244 247 246 / \ 0 0 1 | 244 247 246 /.</pre>
Since the left side of <var>A</var> is the identity matrix, the right side of <var>A</var> is <var>M</var><sup>-1</sup>. Therefore,
<pre> / 82 82 82 \
M^{-1} = | 121 247 246 |
\ 244 247 246 /.</pre>
</div>
<script>
'use strict';
(function() {
const { h, render } = window.preact;
const root = document.getElementById('matrixInverseDemoGeneral');
render(h(MatrixInverseDemo, {
initialElements: '0, 2, 2, 3, 4, 5, 6, 6, 7', initialFieldType: 'gf256',
name: 'matrixInverseDemoGeneral',
header: h('h3', null, 'Example 14: Matrix inversion via row reduction in general'),
containerClass: 'interactive-example',
inputClass: 'parameter',
buttonClass: 'interactive-example-button',
allowFieldTypeChanges: true,
swapRowAColor: '#dc322f', // solarized red
swapRowBColor: '#268bd2', // solarized blue
divideRowColor: '#dc322f', // solarized red
subtractScaledRowSrcColor: '#268bd2', // solarized blue
subtractScaledRowDestColor: '#dc322f', // solarized red
}), root.parent, root);
})();
</script>
<p>Finally, we can use that to implement <code>ReconstructData</code>.</p>
<div class="interactive-example" id="reconstructDataDetailDemo">
<h3>Example 15: <code>ReconstructData</code> in detail</h3>
Let
<span style="white-space: nowrap;">
<var>d</var><sub>partial</sub> = [ ??, db, ?? ]
</span>
be the input partial data bytes and
<span style="white-space: nowrap;">
<var>p</var><sub>partial</sub> = [ 52, 0c ]
</span>
be the input partial parity bytes. Then, with the data byte
count
<span style="white-space: nowrap;">
<var>n</var> = 3
</span>
and the parity byte count
<span style="white-space: nowrap;">
<var>m</var> = 2,
</span>
and appending the rows of the
<span style="white-space: nowrap;">
<var>m</var> × <var>n</var>
</span>
Cauchy parity matrix to the
<span style="white-space: nowrap;">
<var>n</var> × <var>n</var>
</span>
identity matrix, we get
<pre>/ X01X X00X X00X \
| 00 01 00 |
| X00X X00X X01X |
| f6 8d 01 |
\ cb 52 7b /,</pre>
where the rows corresponding to the unknown data and parity
bytes are crossed out. Taking the first <var>n</var> rows that
aren’t crossed out, we get the square matrix
<pre>/ 00 01 00 \
| f6 8d 01 |
\ cb 52 7b /</pre>
which has inverse
<pre>/ 01 d0 d6 \
| 01 00 00 |
\ 7b b8 bb /.</pre>
Therefore, the data bytes are reconstructed from the first
<var>n</var> known data and parity bytes as
<pre> _ _ _ _
/ 01 d0 d6 \ | db | | da |
| 01 00 00 | * | 52 | = | db |
\ 7b b8 bb / |_ 0c _| |_ 0d _|,</pre>
and thus the output data bytes are
<span style="white-space: nowrap;">
<var>d</var> = [ <span class="result">da</span>, <span class="result">db</span>, <span class="result">0d</span> ].
</span>
</div>
<script>
'use strict';
(function() {
const { h, render } = window.preact;
const root = document.getElementById('reconstructDataDetailDemo');
render(h(ReconstructDataDemo, {
initialPartialD: '??, db, ??', initialPartialP: '52, 0c',
name: 'reconstructDataDetailDemo',
detailed: true,
header: h('h3', null, 'Example 15: ', h('code', null, 'ReconstructData'), ' in detail'),
containerClass: 'interactive-example',
inputClass: 'parameter',
resultColor: '#268bd2', // solarized blue
}), root.parent, root);
})();
</script>
<p>And we’re done!</p>
</section>
<section>
<header>
<h2>12. Further reading</h2>
</header>
<p>Next time we’ll talk about the PAR1 file format, which is a
practical implementation of an erasure code very similar to the one
described above, and the various challenges to make it perform well
on sets of large files.</p>
<p>Also, for those of you interested in the mathematical details,
I’ll also write a companion article. (This article is already
quite long!)</p>
<p>I gave <a href="./magic-erasure-codes">a 15-minute
presentation</a> for <a href="https://wafflejs.com">WaffleJS</a> covering
the same topics as this article but at a higher-level and more
informally.</p>
<p>I got the idea for explaining the finite field with \(256\)
elements in terms of binary carry-less arithmetic from <a href="http://www.zlib.net/crc_v3.txt">A
Painless Guide to CRC Error Detection Algorithms</a>, which is an
excellent document in its own right.</p>
<p>Most sources below use Vandermonde matrices, which I plan to cover
in the next article on PAR1, instead of Cauchy matrices. Cauchy
matrices are more foolproof, which is why I started with
them. templexxx, whose Go implementation I cite below, <a href="http://www.templex.xyz/blog/101/cauchy.html">feels the same way</a>. (His
blog post is in Chinese, but using <a href="https://translate.google.com/">Google Translate</a> or
a similar service translates it well enough to English.)</p>
<p>I started learning about erasure codes from <a href="https://web.eecs.utk.edu/~plank/">James
Plank’s</a> papers. See <a href="https://web.eecs.utk.edu/~plank/plank/papers/CS-96-332.pdf">A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like systems</a>, but also make sure to read the very important <a href="https://web.eecs.utk.edu/~plank/plank/papers/CS-03-504.pdf">correction</a> to it! <a href="http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf">Optimizing
Cauchy Reed-Solomon Codes for Fault-Tolerant Storage Applications</a> covers
Cauchy matrices, although in a slightly different context. The first
part of Plank’s <a href="http://web.eecs.utk.edu/~plank/plank/classes/cs560/560/notes/Erasure/2004-ICL.pdf">All About Erasure Codes</a> slides
also contains a good overview of the encoding/decoding process,
including a nifty color-coded matrix diagram.</p>
<p>As for implementations, <a href="https://github.com/klauspost/reedsolomon">klauspost</a> and <a href="https://github.com/templexxx/reedsolomon">templexxx</a> have
good ones written in Go. They were in turn inspired by <a href="https://github.com/Backblaze/JavaReedSolomon">Backblaze’s Java implementation</a>. <a href="https://www.backblaze.com/blog/reed-solomon/">Backblaze’s
accompanying blog post</a> is also a good overview of the topic. The
toy JS implementation powering the demos on this page are also
available on <a href="https://github.com/akalin/intro-erasure-codes">my GitHub</a>.</p>
<p><a href="https://people.cs.clemson.edu/~westall/851/rs-code.pdf">An Introduction to Galois Fields and Reed-Solomon Coding</a><sup><a href="#fn13" id="r13">[13]</a></sup> covers
much of the same material as I do, albeit assuming slightly more
mathematical background.</p>
<p>Going further afield, <a href="https://research.swtch.com/field">Russ Cox</a>, <a href="https://jeremykun.com/2015/03/23/the-codes-of-solomon-reed-and-muller/">Jeremy Kun</a>, and <a href="https://www.nayuki.io/page/reed-solomon-error-correcting-code-decoder">Nayuki</a>
also wrote about finite fields and Reed-Solomon codes.</p>
</section>
<hr />
<p class="thanks">Thanks to Ying-zong Huang, Ryan Hitchman, Charles
Ellis, and Josh Gao for comments/corrections/discussion.</p>
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
<section class="footnotes">
<header>
<h2>Footnotes</h2>
</header>
<p id="fn1">[1] This discussion of linear algebra is necessarily
abbreviated for our purposes. For a more general but still basic
treatment, see <a href="https://www.khanacademy.org/math/linear-algebra/matrix-transformations">Khan Academy</a>. <a href="#r1">↩</a></p>
<p id="fn2">[2] Here and throughout this document, I index vectors and
matrices starting with \(0\), to better match array indices in
code. Most math texts index vectors and matrices starting at \(1\). <a href="#r2">↩</a></p>
<p id="fn3">[3] Now would be a good time to talk about the conventions
I and other texts use. Following <a href="https://web.eecs.utk.edu/~plank/plank/papers/CS-96-332.pdf">Plank</a>,
I use \(n\) for the data byte count and \(m\) for the parity
byte count, and I represent arrays and vectors as <em>column vectors</em>, where multiplication with a matrix is done with the column vector on the <em>right</em>,
which is the standard in most of math. However, in coding theory,
\(k\) is used for the data byte count, which they call the <em>message length</em>, and \(n\) is used for the sum of the data and parity byte counts, which they call the <em>codeword length</em>. Furthermore,
contrary to the rest of math, coding theory treats arrays and
vectors as <em>row vectors</em>, where multiplication with a matrix
is done with the row vector on the <em>left</em>, and the matrix used would be
the transpose of the matrix that would be used with a column
vector. <a href="#r3">↩</a></p>
<p id="fn4">[4] Khan Academy has a <a href="https://www.khanacademy.org/math/algebra-home/alg-matrices/alg-determinants-and-inverses-of-large-matrices/v/inverting-matrices-part-3">video stepping through an example</a> for a \(3 \times 3\) matrix. <a href="#r4">↩</a></p>
<p id="fn5">[5] People with experience in coding theory might
recognize that a parity matrix \(P\) being optimal is equivalent to
the corresponding erasure code being <a href="https://en.wikipedia.org/wiki/Singleton_bound#MDS_codes">MDS</a>. <a href="#r5">↩</a></p>
<p id="fn6">[6] An equivalent statement which is easier to see is that
if a row could be expressed as a linear combination of other rows,
then one would be able to construct a non-empty square submatrix of
\(P\) with those rows, which would then be non-invertible. <a href="#r6">↩</a></p>
<p id="fn7">[7] It is instead a (transposed) <a href="https://en.wikipedia.org/wiki/Vandermonde_matrix"><em>Vandermonde matrix</em></a>,
which we’ll cover when we talk about the PAR1 file format in a
follow-up article. <a href="#r7">↩</a></p>
<p id="fn8">[8] People with experience in abstract algebra might
recognize this as <a href="https://en.wikipedia.org/wiki/Finite_field_arithmetic#Effective_polynomial_representation">arithmetic over \(\mathbb{F}_2[x]\)</a>,
the polynomials with coefficients in the finite field with
\(2\) elements. <a href="#r8">↩</a></p>
<p id="fn9">[9] Our use of \(\clplus\), \(\clminus\), \(\clmul\), and
\(\cldiv\) to denote carry-less arithmetic clashes with our use of
the same symbols to denote generic field operations. However,
we’ll never need to talk about both at the same time, so
whichever one we mean should be obvious in context. <a href="#r9">↩</a></p>
<p id="fn10">[10] This is a slightly stronger statement than <a href="#theorem-3">Theorem 3</a>. <a href="#r10">↩</a></p>
<p id="fn11">[11] People with experience in abstract algebra might recognize carry-less primes as <a href="https://en.wikipedia.org/wiki/Irreducible_element">irreducible
elements</a> of \(\mathbb{F}_2[x]\). <a href="#r11">↩</a></p>
<p id="fn12">[12] Coincidentally, \(283\) is also a regular prime
number. Using another carry-less prime number \(256 \le p \lt 512\)
would also yield a field with \(256\) elements, but is important to
consistently use the same carry-less modulus; different carry-less
moduli lead to fields with \(256\) elements that are <em>isomorphic</em>, but not identical.</p>
<p>Borrowing <a href="https://en.wikipedia.org/wiki/Mathematics_of_cyclic_redundancy_checks#Polynomial_representations">notation from CRCs</a>,
the carry-less modulus is sometimes represented as a hexadecimal
number with the leading digit (which is always \(1\)) omitted. For
example, \(283\) would be represented as \(\mathtt{0x1b}\), and we can say
that we’re using the field with \(256\) elements <em>defined
by</em>
\(\mathtt{0x1b}\). <a href="#r12">↩</a></p>
<p id="fn13">[13] <em>Galois field</em> is just another name for finite field. <a href="#r13">↩</a></p>
</section>
https://www.akalin.com/quintic-unsolvability
Why is the Quintic Unsolvable?
2016-09-26T00:00:00-07:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<link rel="stylesheet" type="text/css" href="https://cdn.rawgit.com/akalin/abel-ruffini-topological-proof/b8e50dd/jsxgraph.css" />
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/jsxgraph/0.99.5/jsxgraphcore.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/abel-ruffini-topological-proof/b8e50dd/complex.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/abel-ruffini-topological-proof/b8e50dd/complex_poly.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/abel-ruffini-topological-proof/b8e50dd/animation.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/abel-ruffini-topological-proof/b8e50dd/rotation_counter.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/abel-ruffini-topological-proof/b8e50dd/display.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/abel-ruffini-topological-proof/b8e50dd/complex_formula.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/abel-ruffini-topological-proof/b8e50dd/quadratic.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/abel-ruffini-topological-proof/b8e50dd/cubic.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/abel-ruffini-topological-proof/b8e50dd/quartic.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/abel-ruffini-topological-proof/b8e50dd/quintic.js"></script>
<!-- KaTeX messes up axes labels, for some reason, so remember to surround a
jxgbox div with <nokatex></nokatex>. -->
<style>
.graph {
display: block;
width: 300px;
height: 300px;
margin: 0.5em 0.2em;
}
.graph-container {
display: inline-block;
vertical-align: top;
max-width: 300px;
}
</style>
<p><em>(This was discussed on <a href="https://www.reddit.com/r/math/comments/57n07e/why_is_the_quintic_unsolvable/">r/math</a> and <a href="https://news.ycombinator.com/item?id=14685466">Hacker News</a>.)</em></p>
<section>
<header>
<h2>1. Overview</h2>
</header>
<p>In this article, I hope to convince you that the quintic equation
is unsolvable, in the sense that I can’t write down the solution
to the equation
\[
ax^5 + bx^4 + cx^3 + dx^2 + ex + f = 0
\]
using only addition, subtraction, multiplication, division, raising
to an integer power, and taking an integer root. In fact, I hope to
go further and explain how this is true for the same reason
that I can’t write down the solution to the equation
\[
ax^2 + bx + c = 0
\]
using only the first five operations above!</p>
<p>The usual approach to the above claim involves a semester’s
worth of abstract algebra and Galois theory. However, there’s
a much easier and shorter proof which involves only a bit of group
theory and complex analysis—enough to fit in a blog
post—and some interactive
visualizations.<sup><a href="#fn1" id="r1">[1]</a></sup></p>
</section>
<section>
<header>
<h2>2. Quadratic Equations</h2>
</header>
<p>Let’s start with quadratic equations, which hopefully you all
remember from high school. Given two complex numbers \(r_1\) and
\(r_2\), you can determine the quadratic equation whose solutions are
\(r_1\) and \(r_2\), namely
\[
(x - r_1)(x - r_2) = x^2 - (r_1 + r_2) x + r_1 r_2 = 0\text{.}
\]
If we take the standard form of a quadratic equation to be
\[
a x^2 + bx + c = 0\text{,}
\]
then we can define a function from \(r_1\) and \(r_2\) to \(a\), \(b\),
and \(c\), which is shown by the first two panels in the visualization below;
drag either of the points \(r_1\) and \(r_2\) and notice how \(b\) and
\(c\) move (\(a\) will always remain fixed at \(1\)).</p>
<p>Now pretend that we misremember the quadratic formula as
\[
x_{1, 2} = \frac{-b ± b^2 - 4ac}{4a}\text{.}
\]
The results of this formula—our candidate solution—are
shown in the third panel. Note that since \(x_1\) and \(x_2\) depend
on \(a\), \(b\), and \(c\), which all depend on \(r_1\) and \(r_2\),
they also move when you drag either \(r_1\) and \(r_2\)</p>
<div class="interactive-example">
<h3>Interactive Example 1: An incorrect quadratic formula</h3>
<div class="graph-container">
Roots
<nokatex><div id="rootBoardQuad1" class="graph jxgbox"></div></nokatex>
<button class="interactive-example-button quad1DisableWhileSwapping"
type="button" onclick="quad1.swap();">
Swap \(r_1\) and \(r_2\)
</button>
</div>
<div class="graph-container">
Coefficients
<nokatex><div id="coeffBoardQuad1" class="graph jxgbox"></div></nokatex>
</div>
<div class="graph-container">
Candidate solution
<nokatex><div id="formulaBoardQuad1" class="graph jxgbox"></div></nokatex>
</div>
</div>
<script type="text/javascript">
'use strict';
function runOp(display, op, time, disableSelector, state, doneCallback) {
if (state.running) {
return;
}
state.running = true;
var oldFixed = display.setRootsFixed(true);
var elems = document.querySelectorAll(disableSelector);
for (var i = 0; i < elems.length; ++i) {
elems[i].disabled = true;
}
op.run(time, function() {
state.running = false;
display.setRootsFixed(oldFixed);
for (var i = 0; i < elems.length; ++i) {
elems[i].disabled = false;
}
if (doneCallback !== undefined) {
doneCallback();
}
});
}
var incorrectQuadraticFormula = (function() {
var a = ComplexFormula.select(-1);
var b = ComplexFormula.select(-2);
var x1 = b.neg().plus(quadraticDiscriminantFormula).div(a.times(4));
var x2 = b.neg().minus(quadraticDiscriminantFormula).div(a.times(4));
return x1.concat(x2);
})();
var quad1 = (function() {
var initialRoots = [ new Complex(1, 0), new Complex(-1, 0) ];
var display = new Display(
"rootBoardQuad1", "coeffBoardQuad1", "formulaBoardQuad1", initialRoots,
incorrectQuadraticFormula, function() {});
display._resultRotationCounterPoint.setAttribute({visible: false});
var op = display.swapRootOp(0, 1, function() {});
function swap() {
runOp(display, op, 1000, '.quad1DisableWhileSwapping', {});
};
return {
display: display,
swap: swap
};
})();
</script>
<p>Now this formula looks right, since \(x_1\) and \(x_2\) are at the
same coordinates as \(r_1\) and \(r_2\). However, if you move
\(r_1\) or \(r_2\) around, you can easily convince yourself that
this formula can’t be right, since \(x_1\) and \(x_2\)
don’t move in the same way.</p>
<p>Now if you remember from high school, the real quadratic formula
involves taking a square root, and since our candidate solution
doesn’t do that, that means it’s probably incorrect. I
say “probably” because there’s no immediate reason
why there can’t be <em>multiple</em> quadratic formulas, some
simpler than others, of which one is simple enough to not need a
square root. From manipulating \(r_1\) and \(r_2\), we know that our
candidate formula is incorrect, but that doesn’t immediately
follow from it not having a square root.</p>
<p>Fortunately, there is a general way to rule out candidate solutions
that are similar to the one above, namely those that use only
addition, subtraction, multiplication, division, and raising to an
integer power; we’ll call these <em>rational expressions</em>. Here’s
how it goes: if you press the button to swap \(r_1\) and \(r_2\),
which moves \(r_1\) to \(r_2\)’s position and vice versa,
\(a\), \(b\), and \(c\) move from their starting positions but
return once \(r_1\) and \(r_2\) reach their destinations. This makes
sense, because the coefficients of a polynomial don’t depend
on how you order the roots. But since \(x_1\) and \(x_2\) depend
only on \(a\), \(b\), and \(c\), they too must loop back to their
starting positions.</p>
<p>But that means that our candidate solution cannot be the quadratic
formula! If it were, then \(x_1\) and \(x_2\) would have ended up
swapped, too. Instead, they went back to their starting positions,
which is a contradiction. This reasoning holds for any expression
which is a <em>single-valued</em> function of \(a\), \(b\), and \(c\),
so in particular this holds for rational expressions.</p>
<div class="p">Let’s summarize our reasoning in a theorem:
<div class="theorem">(<span class="theorem-name">Theorem 1</span>.) A
rational expression<sup><a href="#fn2" id="r2">[2]</a></sup> in the coefficients of the general quadratic
equation
\[
ax^2 + bx + c = 0
\]
cannot be a solution to this equation.</div>
<div class="proof">
<p><span class="proof-name">Sketch of proof.</span> Assume to the
contrary that the rational expression \(x = f(a, b, c)\) is a
solution. Assume that we start with \(r_1 = z_1\) and \(r_2 = z_2
\ne z_1\), and without loss of generality assume that we start with
\(x = z_1\).</p>
<p>Run \(r_1\) and \(r_2\) along continuous paths that swap their two
positions, i.e. make \(r_1\) head from \(z_1\) to \(z_2\)
continuously, and at the same time make \(r_2\) head from \(z_2\) to
\(z_1\) continuously, and make sure to pick paths such that \(r_1\)
and \(r_2\) never coincide.</p>
<p>Since \(a\), \(b\), and \(c\) are continuous functions of \(r_1\)
and \(r_2\), and \(x\) is a rational function of \(a\), \(b\) and
\(c\), and thus continuous, \(x\) then depends continuously on \(r_1\)
and \(r_2\). Thus, since we start with \(x = r_1 = z_1\), and \(r_1\)
never coincides with \(r_2\), then as \(r_1\) moves, \(x = r_1\) must
continue to hold, since \(x\) is a solution, and therefore
\(x\)’s final position must be the same as \(r_1\)’s,
which is \(z_2\).</p>
<p>However, since the coefficients \(a\), \(b\), and \(c\) don’t
depend on the ordering of \(r_1\) and \(r_2\), then their final
positions are the same as their initial positions. Since \(x\) is a
function of only \(a\), \(b\), and \(c\), its final position also
must be the same as its initial position, \(z_1\). This contradicts
the above, and therefore \(x\) cannot be a solution. ∎</p>
</div>
Now consider the candidate solution
\[
x_{1,2} = \sqrt{b^2 - 4ac}\text{.}
\]
This isn’t a rational expression since it has a square root. In
particular, in the visualization below, it behaves quite differently
from our first candidate solution. First, even though we have just a
single expression, it yields two points \(x_1\) and \(x_2\). Second,
and more surprisingly, if you swap \(r_1\) and \(r_2\), \(x_1\) and
\(x_2\) also exchange places, seemingly contradicting Theorem 1!
What is going on?
</div>
<div class="interactive-example">
<h3>Interactive Example 2: The quadratic equation</h3>
<div class="graph-container">
Roots
<nokatex><div id="rootBoardQuad2" class="graph jxgbox"></div></nokatex>
<button class="interactive-example-button quad2DisableWhileSwapping"
type="button" onclick="quad2.swap();">
Swap \(r_1\) and \(r_2\)
</button>
</div>
<div class="graph-container">
Coefficients
<nokatex><div id="coeffBoardQuad2" class="graph jxgbox"></div></nokatex>
</div>
<div class="graph-container">
Candidate solution
<nokatex><div id="formulaBoardQuad2" class="graph jxgbox"></div></nokatex>
<label>
<input class="quad2DisableWhileSwapping" name="quad2Formula" type="radio"
onchange="quad2.switchFormula(incorrectQuadraticFormula);" />
\(x_{1, 2} = \frac{-b \pm b^2 - 4ac}{4a}\)
</label>
<br />
<label>
<input class="quad2DisableWhileSwapping" name="quad2Formula" type="radio"
onchange="quad2.switchFormula(quadraticDiscriminantFormula);" />
\(x_1 = b^2 - 4ac\)
</label>
<br />
<label>
<input checked class="quad2DisableWhileSwapping" name="quad2Formula" type="radio"
onchange="quad2.switchFormula(quadraticDiscriminantFormula.root(2));" />
\(x_{1, 2} = \sqrt{b^2 - 4ac}\)
</label>
<br />
<label>
<input class="quad2DisableWhileSwapping" name="quad2Formula" type="radio"
onchange="quad2.switchFormula(newQuadraticFormula());" />
\(x_{1, 2} = \frac{-b + \sqrt{b^2 - 4ac}}{2a}\)
<br />
(the quadratic formula)
</label>
</div>
</div>
<script type="text/javascript">
'use strict';
function switchFormula(display, state, formula) {
if (state.running) {
return;
}
var numResults = display.setFormula(formula);
}
var quad2 = (function() {
var initialRoots = [ new Complex(1, 0), new Complex(0, 1) ];
var display = new Display(
"rootBoardQuad2", "coeffBoardQuad2", "formulaBoardQuad2", initialRoots,
quadraticDiscriminantFormula.root(2), function() {});
display._resultRotationCounterPoint.setAttribute({visible: false});
var op = display.swapRootOp(0, 1, function() {});
var state = {};
function swap() {
runOp(display, op, 1000, '.quad2DisableWhileSwapping', state);
}
function switchQuadFormula(formula) {
switchFormula(display, state, formula);
}
return {
display: display,
swap: swap,
switchFormula: switchQuadFormula
};
})();
</script>
<p>To answer this, we first need to review some facts about complex
numbers. Recall that a complex number \(z\) can be expressed in polar
coordinates, where it has a length \(r\) and an angle \(θ\), and
that it can be converted to the usual Cartesian coordinates using <a href="https://en.wikipedia.org/wiki/Euler%27s_formula">Euler’s formula</a>:
\[
z = r e^{iθ} = r \cos θ + i \, r \sin θ\text{.}
\]
Then, if you have two complex numbers \(z_1 = r_1 e^{iθ_1}\) and
\(z_2 = r_2 e^{iθ_2}\) in polar form, you can multiply them by
multiplying their lengths, and adding their angles:
\[
z_1 z_2 = r_1 r_2 e^{i (θ_1 + θ_2)}\text{.}
\]
So a square root of a complex number \(z = r e^{iθ}\) is just
\(\sqrt{r} e^{iθ/2}\), as you can easily verify. However, if
\(z\) is non-zero, there is one more square root of \(z\), namely
\(\sqrt{r} e^{i (θ/2 + π)}\), as you can also verify. (Recall
that angles that differ by \(2π = 360^\circ\) are considered the
same.)</p>
<p>So in general, the square root of a rational expression, like our
candidate solution, yields two distinct points as long as the
rational expression is non-zero. In our case, \(b^2 - 4ac\) remains
non-zero as \(r_1\) and \(r_2\) don’t coincide. (We’ll
have more to say about this expression, called the <em>discriminant</em>,
once we talk about cubic equations below.) Therefore, if we want to
examine how \(x_1\) and \(x_2\) move as \(r_1\) and \(r_2\) move, we
have to number the square roots of \(b^2 - 4ac\), and we have to
keep this numbering consistent.</p>
<p>To do so, we have to do two things: we have to vary \(r_1\) and
\(r_2\) only continuously, and we have to vary \(r_1\) and \(r_2\)
such that they never coincide. If we do this, then we can
intuitively “lift” the expression \(b^2 - 4ac\) from the
complex plane to a new surface \(S\) where we consider only angles
that differ by \(4π = 720^\circ\), rather than \(2π\), to be
the same. In this space, we can take the “first” square
root of a non-zero complex number to be the one with half the angle,
and the “second” square root to be the one with half the
angle plus \(π\), and have these two square root functions behave
continuously as their argument goes around the origin.</p>
<figure>
<img src="quintic-unsolvability-files/Riemann_sqrt.svg"/>
<figcaption>
<span class="figure-text">Figure 1</span> \(S\), which is the
<a href="https://en.wikipedia.org/wiki/Riemann_surface">Riemann surface</a>
of \(\sqrt{z}\). (Image by <a href="https://en.wikipedia.org/wiki/File:Riemann_sqrt.svg">Leonid 2</a> licensed under <a href="https://creativecommons.org/licenses/by-sa/3.0/deed.en">CC BY-SA 3.0</a>.)
</figcaption>
</figure>
<p>Now this answers the question of why the proof of Theorem 1
fails for \(\sqrt{b^2 - 4ac}\). \(a\), \(b\), and \(c\), go around a
single loop as \(r_1\) is swapped with \(r_2\), and therefore \(b^2
- 4ac\) goes around a single loop in the complex plane, but when
\(b^2 - 4ac\) is lifted to \(S\), the final position of \(b^2 -
4ac\) differs from the initial position only by an angle of
\(2π\), so it is <em>distinct</em> from the initial position, and
thus we can’t conclude that the final position of \(\sqrt{b^2
- 4ac}\) is the same as the initial position.</p>
<p>Similar reasoning holds for any algebraic expression that
isn’t a rational expression, i.e. ones that involve taking any
integer root, so Theorem 1 cannot apply to algebraic expressions
in general. Of course, this is consistent with what we know about the
quadratic formula, since we know that it has a square root!</p>
</section>
<section>
<header>
<h2>3. Cubic Equations</h2>
</header>
<p>Now we can move on to cubic equations. Similarly, given three
complex numbers \(r_1\), \(r_2\), and \(r_3\), you can determine the
cubic equation with those solutions, namely
\[
(x - r_1) (x - r_2) (x - r_3) = x^3 - (r_1 + r_2 + r_3) x^2 + (r_1 r_2 + r_1 r_3 + r_2 r_3) x - r_1 r_2 r_3\text{,}
\]
and so we can define a function from \(r_1\), \(r_2\), and \(r_3\) to
\(a\), \(b\), \(c\), and \(d\), where
\[
a x^3 + b x^2 + c x + d
\]
is the standard form of a cubic polynomial, and this is shown in the
visualization below.</p>
<p>In the previous section, we talked about the discriminant \(b^2 -
4ac\) of the general quadratic polynomial. However, the discriminant
is an expression that is defined for <em>any</em> polynomial. If
\(r_1, \dotsc, r_n\) are the roots of a polynomial (counting multiplicity)
with leading coefficient \(a_n\), then the
<a href="https://en.wikipedia.org/wiki/Discriminant">discriminant</a> is
\[
Δ = a_n^{2n - 2} ∏_{i \lt j} (r_i - r_j)^2\text{.}
\]
In other words, the discriminant is, up to sign and a power of the
leading coefficient, the product of the differences of all pairs of
different roots. In particular, if the polynomial has repeated roots,
the discriminant is zero.</p>
<p>Using the formula above, you can express the discriminant in terms
of the coefficients of the polynomial, as you can verify for
yourself with the quadratic equation. Indeed this is true in
general; for cubic polynomials, the discriminant can be expressed in
terms of the coefficients as
\[
Δ = b^2 c^2 - 4 a c^3 - 4 b^3 d - 27 a^2 d^2 + 18 a b c d\text{.}
\]
But why do we care? Because, as you can see in the visualization below, if
you swap any pair of roots, this causes the discriminant to make a single
loop around the origin, so it serves as a useful test functions for
taking roots.</p>
<p>So now that we have three roots, we can swap them in multiple
ways. If \(R\) is a list that starts off as \(\langle r_1, r_2, r_3
\rangle\), let \(↺_{i, j}\) denote counter-clockwise
paths that takes the root at the \(i\)th index of \(R\) to the one
at the \(j\)th index of \(R\) and vice versa, and similarly for
\(↻_{i, j}\). (Note that this is not the same as the
paths that swap \(r_i\) and \(r_j\)! Play around with the buttons
in the visualization below to understand the difference.)</p>
<div class="interactive-example">
<h3>Interactive Example 3: The cubic discriminant</h3>
<div class="graph-container">
Roots
<nokatex><div id="rootBoardCubic1" class="graph jxgbox"></div></nokatex>
<span id="rootListCubic1">
\(R = \langle r_1, r_2, r_3 \rangle\)
</span>
<br />
<button class="interactive-example-button cubic1DisableWhileRunningOp"
type="button" onclick="cubic1.runOp(cubic1.opA, 1000);">
\(↺_{1, 2}\)
</button>
<button class="interactive-example-button cubic1DisableWhileRunningOp"
type="button" onclick="cubic1.runOp(cubic1.opB, 1000);">
\(↺_{2, 3}\)
</button>
<br />
<button class="interactive-example-button cubic1DisableWhileRunningOp"
type="button" onclick="cubic1.runOp(cubic1.opA.invert(), 1000);">
\(↻_{1, 2}\)
</button>
<button class="interactive-example-button cubic1DisableWhileRunningOp"
type="button" onclick="cubic1.runOp(cubic1.opB.invert(), 1000);">
\(↻_{2, 3}\)
</button>
</div>
<div class="graph-container">
Coefficients
<nokatex><div id="coeffBoardCubic1" class="graph jxgbox"></div></nokatex>
</div>
<div class="graph-container">
Candidate solution
<nokatex><div id="formulaBoardCubic1" class="graph jxgbox"></div></nokatex>
<label>
<input class="cubic1DisableWhileRunningOp" name="cubic1Formula" type="radio"
onchange="cubic1.switchFormula(cubicDiscFormula);" />
\(x_1 = Δ\)
</label>
<br />
<label>
<input checked class="cubic1DisableWhileRunningOp" name="cubic1Formula" type="radio"
onchange="cubic1.switchFormula(cubicDiscFormula.root(5));" />
\(x_{1, 2, 3, 4, 5} = \sqrt[5]{Δ}\)
</label>
</div>
</div>
<script type="text/javascript">
'use strict';
function updateRootList(display, rootListID) {
var rootPermutation = display.getRootPermutation();
var rootList = document.getElementById(rootListID);
var TeXOutput = 'R = \\langle ' + rootPermutation.map(function(i) {
return 'r_{' + (i+1) + '}';
}).join(', ') + ' \\rangle';
katex.render(TeXOutput, rootList);
}
function updateResultList(display, resultListID) {
var resultPermutation = display.getResultPermutation();
var resultList = document.getElementById(resultListID);
var TeXOutput = 'X = \\langle ' + resultPermutation.map(function(i) {
return 'x_{' + (i+1) + '}';
}).join(', ') + ' \\rangle';
katex.render(TeXOutput, resultList);
}
var cubicDiscFormula = cubicScaledDiscFormula.div(
ComplexFormula.select(-1).pow(2).times(-27));
var cubic1 = (function() {
var initialRoots = [
new Complex(-1, -0.5), new Complex(0.5, 0.5), new Complex(0, 1)
];
var display = new Display(
"rootBoardCubic1", "coeffBoardCubic1", "formulaBoardCubic1", initialRoots,
cubicDiscFormula.root(5), function() {});
display._resultRotationCounterPoint.setAttribute({visible: false});
function updateRootListCubic(display) {
updateRootList(display, "rootListCubic1");
}
var opA = display.swapRootOp(0, 1, updateRootListCubic);
var opB = display.swapRootOp(1, 2, updateRootListCubic);
var state = {}
function runCubicOp(op, time) {
runOp(display, op, time, '.cubic1DisableWhileRunningOp', state);
};
function switchCubicFormula(formula) {
switchFormula(display, state, formula);
updateRootAndResultList(display);
}
return {
display: display,
opA: opA,
opB: opB,
runOp: runCubicOp,
cubicDiscFormula: cubicDiscFormula,
switchFormula: switchCubicFormula
};
})();
</script>
<p>Now, with the formula \(Δ\), the same reasoning as in the
previous section shows that it cannot possibly be the cubic formula,
nor can any other rational expression. However, unlike the quadratic
case, we can also rule out \(\sqrt[5]{Δ}\), or any other
algebraic formula with no nested radicals (i.e., that doesn’t
have a radical within a radical like \(\sqrt{a - \sqrt{bc - 5}}\)).
If you apply the operations \(↺_{2, 3}\),
\(↺_{1, 2}\), \(↻_{2, 3}\), and
\(↻_{1, 2}\) in sequence, \(r_1\), \(r_2\), and
\(r_3\) rotate among themselves, but all the \(x_i\) go back to
their original positions. Therefore, by similar reasoning as the
previous section, \(\sqrt[5]{Δ}\) also cannot possibly be the
cubic formula!</p>
<p>To make this statement precise, we need to review some group
theory. Recall that a
<a href="https://en.wikipedia.org/wiki/Group_(mathematics)">group</a>
is a set with an associative binary operation, an identity element,
and inverse elements. Most basic examples of groups are related to
numbers, like the integers under addition, or the non-zero rationals
under multiplication. However, more interesting examples of groups
are related to <em>functions</em>, none the least because the group
operation for functions is <em>composition</em>, which is in general
not commutative; in other words, if \(f\) and \(g\) are functions,
\(f \circ g \ne g \circ f\), and it is this non-commutativity that
will come in handy for our purposes.</p>
<p>So let’s say we have a list of \(n\) objects, and we’re
interested in the functions that rearrange this list’s
elements. These are <a href="https://en.wikipedia.org/wiki/Permutation">permutations</a>,
and they naturally form a group under composition, as you can check
for yourself, called \(S_n\), the <a href="https://en.wikipedia.org/wiki/Symmetric_group">symmetric group</a> on
\(n\) objects.</p>
<p>There’s a convenient way to write permutations, called <a href="https://en.wikipedia.org/wiki/Permutation#Cycle_notation">cycle notation</a>. If
you write \((i_1 \; i_2 \; \dotsc \; i_k)\), this denotes the
permutation that maps the \(i_1\)th position of the list to the
\(i_2\)th position the \(i_2\)th position to the \(i_3\)th, and so on,
called a <em>cycle</em>. Then you can write <em>any</em> permutation
as a composition of disjoint cycles, so this provides a convenient
way to write down and compute with permutations.</p>
<p>In the visualization above, we have four operations
\(↺_{1, 2}\), \(↺_{2, 3}\),
\(↻_{1, 2}\), and \(↻_{1, 2}\),
which <em>act on \(R\)</em>, meaning that they define permutations
on \(R\). In particular, \(↺_{1, 2}\) and
\(↻_{1, 2}\) both swap the first and second
elements of \(R\), so we say that \(↺_{1, 2}\) and
\(↻_{1, 2}\) act on \(R\) as \((1 \; 2)\), and
similarly, \(↺_{2, 3}\) and \(↻_{2,
3}\) act on \(R\) as \((2 \; 3)\).</p>
<p>Now concatenating two operations—doing one after the
other—corresponds to composing their mapped-to permutations on
\(R\). Denoting \(o_2 * o_1\) as doing \(o_1\), then doing \(o_2\),
the sequence of operations above is \(↻_{1, 2} *
↻_{2, 3} * ↺_{1, 2} *
↺_{2, 3}\) (note the order!), which acts on \(R\)
like \((1 \; 2) (2 \; 3) (1 \; 2) (2 \; 3)\), which is equal to \((1
\; 3 \; 2)\).<sup><a href="#fn3" id="r3">[3]</a></sup> (The
\(\circ\) is usually dropped when composing permutations.)</p>
<p>Now for the formula \(Δ\), all the operations make \(x_1\)
loop around the origin either clockwise or counter-clockwise; in
other words, they all induce a rotation of \(2π\) or \(-2π\) on
\(x_1\), and the final distance of \(x_1\) from the origin is the
same as the initial distance. Therefore, if we apply an equal number
of clockwise and counter-clockwise rotations, the total angle of
rotation will be \(0\) and the final distance will be the same as
the initial distance, i.e. the final position of \(x_1\) is the same
as it’s initial distance. But the same reasoning holds for the
formula \(\sqrt[5]{Δ}\); all the operations induce a rotation
of \(2π/5\) or \(-2π/5\) and leave the distance from the origin
unchanged, so an equal number of clockwise and counter-clockwise
rotations will still induce a total angle of \(0\) and leave the
distance from the origin unchanged. Therefore, the operation
\(↻_{1, 2} * ↻_{2, 3} *
↺_{1, 2} * ↺_{2, 3}\) acts like \((1
\; 3\; 2)\) on \(R\), but leaves all \(x_i\) unchanged.</p>
<p>But how did we come up with \(↻_{1, 2} *
↻_{2, 3} * ↺_{1, 2} *
↺_{2, 3}\) in the first place? This involves a bit
more group theory. \(S_3\) is <em>not</em> a <a href="https://en.wikipedia.org/wiki/Abelian_group">commutative
group</a>; in particular, \((1 \; 2) (2 \; 3) \ne (2 \; 3) (1 \;
2)\). For two group elements \(g\) and \(h\), we can define
their
<a href="https://en.wikipedia.org/wiki/Commutator">commutator</a><sup><a href="#fn4" id="r4">[4]</a></sup>
\([ g, h ]\), which is the group element that corrects for
\(g\) and \(h\) not commutating. That is, we want the equation
\[
g h = h g [g, h]
\]
to hold, which means that
\[
[g, h] = g^{-1} h^{-1} g h\text{.}
\]
So the commutator provides a convenient way to generate a non-trivial
permutation from two other non-commuting permutations. Furthermore, it
involves two appearances of both elements, so we can pick a sequence of
operations that induce the commutator and also have an equal number of
clockwise and counter-clockwise operations. Then we’re guaranteed
that this sequence of operations permutes \(R\) and leaves all \(x_i\)
unchanged, even if each individual operation moves some \(x_i\). But of
course, this is just \(↻_{1, 2} * ↻_{2, 3} *
↺_{1, 2} * ↺_{2, 3}\)!</p>
<p>Let’s define some terminology to make proofs and discussion
easier. If \(o\) is an operation that acts on \(R\) non-trivially
but has the final position of the expression \(x = f(a, b, c,
\dotsc)\) the same as its initial position, we say that \(o\) <em>rules out</em> the
expression \(x = f(a, b, c, \dotsc)\). For example, Theorem 1
says that swapping both roots of a quadratic rules out all rational
expressions.</p>
<div class="p">Now we’re ready to state and prove the theorem:
<div class="theorem">(<span class="theorem-name">Theorem 2</span>.) An
algebraic expression with no nested radicals in the coefficients of
the general cubic equation
\[
ax^3 + bx^2 + cx + d = 0
\]
cannot be a solution to this equation.</div>
<div class="proof">
<p><span class="proof-name">Sketch of proof.</span> First assume to
the contrary that the expression \(x = \sqrt[k]{r(a, b, c, d)}\) is
a solution, where \(r(a, b, c, d)\) is a rational
expression. Assume we start with \(r_1 = z_1\), \(r_2 = z_2\), and
\(r_3 = z_3\), where all \(z_i\) are distinct, and without loss of
generality assume that we start with \(x = z_1\).</p>
<p>Any of the operations \(↺_{1, 2}\),
\(↺_{2, 3}\), \(↻_{1, 2}\), and
\(↻_{2, 3}\) applied to \(x = r(a, b, c, d)\)
cause \(x\)’s final position to be the same as its initial
position, by Theorem 1. Pick a point \(z_0\) that is never
equal to any point \(x\) traverses under any operation. Then, by
the same reasoning as above, the total angle induced by
\(↻_{1, 2} * ↻_{2, 3} *
↺_{1, 2} * ↺_{2, 3}\) on \(x =
\sqrt[k]{r(a, b, c, d)}\) around \(z_0\) is \(0\), and the
distance from \(z_0\) remains unchanged. Thus \(x\) remains
fixed, and this operation rules out \(x = \sqrt[k]{r(a, b, c,
d)}\).</p>
<p>For the general case, it suffices to show that if \(o\) rules out
the expressions \(f\) and \(g\), then \(o\) also rules out \(f\)
raised to an integer power, \(f + g\text{,}\) \(f - g\text{,}\) \(f
\cdot g\text{,}\) and \(f / g\) where \(g \ne 0\text{.}\) But this
is straightforward, and such formulas are just the algebraic
expressions with no nested radicals, so the statement holds in
general. ∎</p>
</div>
</div>
<p>Theorem 2 can be summarized thus: any \(↺_{i,
j}\) or \(↻_{i, j}\) rules out any rational
expression as the cubic formula, and if given an algebraic
expression with no nested radicals, either some
\(↺_{i, j}\) or \(↻_{i, j}\) rules it
out, or \(↻_{1, 2} * ↻_{2, 3} *
↺_{1, 2} * ↺_{2, 3}\) rules it out.</p>
<p>Now we can consider algebraic expressions with one level of
nesting. Can such formulas be ruled out as being the cubic formula?
We can’t do so via Theorem 2, at least; we would need a
non-trivial element of \(S_3\) that is the commutator of
commutators. But you can calculate that all non-trivial commutators of
\(S_3\) are either \((3 \; 2 \; 1)\) or \((1 \; 2\; 3)\), and these
two elements commute, so \(S_3\) cannot have a non-trivial commutator
of commutators.</p>
<p>In fact, as we would expect, the actual <a href="https://en.wikipedia.org/wiki/Cubic_function#General_formula">cubic formula</a>
has such an algebraic expression, which is \(C\) in the visualization
below, so that serves as a convenient example of an algebraic
expression with a single nested radical that can’t be ruled out
by Theorem 2.</p>
<div class="interactive-example">
<h3>Interactive Example 4: The cubic equation</h3>
<div class="graph-container">
Roots
<nokatex><div id="rootBoardCubic2" class="graph jxgbox"></div></nokatex>
<span id="rootListCubic2">
\(R = \langle r_1, r_2, r_3 \rangle\)
</span>
<br />
<button class="interactive-example-button cubic2DisableWhileRunningOp"
type="button" onclick="cubic2.runOp(cubic2.opA, 1000);">
\(↺_{1, 2}\)
</button>
<button class="interactive-example-button cubic2DisableWhileRunningOp"
type="button" onclick="cubic2.runOp(cubic2.opB, 1000);">
\(↺_{2, 3}\)
</button>
<br />
<button class="interactive-example-button cubic2DisableWhileRunningOp"
type="button" onclick="cubic2.runOp(cubic2.opA.invert(), 1000);">
\(↻_{1, 2}\)
</button>
<button class="interactive-example-button cubic2DisableWhileRunningOp"
type="button" onclick="cubic2.runOp(cubic2.opB.invert(), 1000);">
\(↻_{2, 3}\)
</button>
<br />
<button class="interactive-example-button cubic2DisableWhileRunningOp"
type="button" onclick="cubic2.runOp(cubic2.opComAB, 4000);">
\(↻_{1, 2} * ↻_{2, 3} * ↺_{1, 2} * ↺_{2, 3}\)
</button>
<br />
<button class="interactive-example-button cubic2DisableWhileRunningOp"
type="button" onclick="cubic2.runOp(cubic2.opComAB.invert(), 4000);">
\(↺_{1, 2} * ↺_{2, 3} * ↻_{1, 2} * ↻_{2, 3}\)
</button>
</div>
<div class="graph-container">
Coefficients
<nokatex><div id="coeffBoardCubic2" class="graph jxgbox"></div></nokatex>
</div>
<div class="graph-container">
Candidate solution
<nokatex><div id="formulaBoardCubic2" class="graph jxgbox"></div></nokatex>
<span id="resultListCubic2">
\(X = \langle x_1, x_2, x_3, x_4, x_5, x_6 \rangle\)
</span>
<br />
<label>
<input class="cubic2DisableWhileRunningOp" name="cubic2Formula" type="radio"
onchange="cubic2.switchFormula(cubicScaledDiscFormula);" />
\(x_1 = -27a^2 Δ = {Δ_1}^2 - 4 {Δ_0}^3\)
</label>
<br />
<label>
<input checked class="cubic2DisableWhileRunningOp" name="cubic2Formula" type="radio"
onchange="cubic2.switchFormula(newCubicCCubedFormula());" />
\(x_{1, 2} = C^3 = \frac{Δ_1 + \sqrt{-27a^2 Δ}}{2}\)
</label>
<br />
<label>
<input checked class="cubic2DisableWhileRunningOp" name="cubic2Formula" type="radio"
onchange="cubic2.switchFormula(newCubicCCubedFormula().root(3));" />
\(x_{1,2,3,4,5,6} = C\)
</label>
<br />
<label>
<input class="cubic2DisableWhileRunningOp" name="cubic2Formula" type="radio"
onchange="cubic2.switchFormula(newCubicFormula());" />
\(x_{1, 2, 3} = -\frac{1}{3a} \left( b + C + \frac{Δ_0}{C} \right)\)
<br />
(the cubic formula)
</label>
</div>
</div>
<script type="text/javascript">
'use strict';
var cubic2 = (function() {
var initialRoots = [
new Complex(-1, -0.5), new Complex(0.5, 0.5), new Complex(0, 1)
];
var display = new Display(
"rootBoardCubic2", "coeffBoardCubic2", "formulaBoardCubic2", initialRoots,
newCubicCCubedFormula().root(3), function() {});
display._resultRotationCounterPoint.setAttribute({visible: false});
function updateRootAndResultList(display) {
updateRootList(display, "rootListCubic2");
updateResultList(display, "resultListCubic2");
}
var opA = display.swapRootOp(0, 1, updateRootAndResultList);
var opB = display.swapRootOp(1, 2, updateRootAndResultList);
var opComAB = newCommutatorAnimation(opA, opB);
var state = {}
function runCubicOp(op, time) {
runOp(display, op, time, '.cubic2DisableWhileRunningOp', state);
};
function switchCubicFormula(formula) {
switchFormula(display, state, formula);
updateRootAndResultList(display);
}
return {
display: display,
opA: opA,
opB: opB,
opComAB: opComAB,
runOp: runCubicOp,
cubicDiscFormula: cubicDiscFormula,
switchFormula: switchCubicFormula
};
})();
</script>
<p>Note that there is a new list \(X\), which lists the \(x_i\) in the
order which they occupy their initial positions, like how \(R\) does
the same for the \(r_i\). In general, we can’t do this, since a
general multi-valued function won’t necessarily permute that
\(x_i\) among themselves, but in the interactive visualizations
we’ll only consider expressions that do.</p>
<p>We can then talk how an operation acts on \(X\). For example, if we
pick \(\sqrt[5]{Δ}\) in Interactive Example 3, we can
say that \(↺_{i, j}\) acts like \((5 \; 1 \; 2 \; 3
\; 4)\) on \(X\) and \(↻_{i, j}\) acts like \((1 \; 2 \; 3 \; 4 \;
5)\) on \(X\). Therefore, \(↻_{1, 2} *
↻_{2, 3} * ↺_{1, 2} *
↺_{2, 3}\) acts non-trivially on \(R\) but acts
trivially on \(X\), which is another more algebraic way of saying
that if this operation rules out \(\sqrt[5]{Δ}\), since the
action on \(X\) depends on the candidate formula. On the other hand,
if you choose \(C\) in the visualization above, you can convince
yourself that no operation acts non-trivially on \(R\) without also
acting non-trivially on \(X\), and so \(C\) can’t be ruled out
as the cubic formula.</p>
</section>
<section>
<header>
<h2>4. Quartic Equations</h2>
</header>
<p>Now we can move on to quartic equations. As usual, given four
complex numbers \(r_1\), \(r_2\), \(r_3\), and \(r_4\), you can map
this to the coefficients \(a\), \(b\), \(c\), \(d\), and \(e\) of the
standard form of a quartic polynomial, as shown in the visualization
below, such that the \(r_i\) are the solutions to the quartic
equation
\[
a x^4 + b x^3 + c x^2 + d x + e = 0\text{.}
\]
<p>Now that we have four roots, we have even more ways to permute them
using the \(↺_{i, j}\) and \(↻_{i,
j}\). Before we move on, we need more terminology and group theory to
handle this more complicated case.</p>
<p>First, we want a convenient way to denote the combination of operations
that act like a commutator, so let’s define
\(↺_{i, j}^\prime\) to mean \(↻_{i,
j}\) and vice versa, \((o_1 \circ o_2 \circ \dotsb \circ o_n)^\prime\)
to mean \(o_n^\prime \circ o_{n-1}^\prime \circ \dotsb \circ
o_1^\prime\), and \([\![ o_1, o_2 ]\!]\) to mean \(o_1^\prime \circ
o_2^\prime \circ o_1 \circ o_2\), so that if \(o_i\) acts on \(R\)
like \(g_i\), then \(o_i^\prime\) acts on \(R\) like \(g_i^{-1}\) and
\([\![o_i, o_j]\!]\) acts on \(R\) like \([g_i, g_j]\). For example,
in the previous section, we were using \([\![ ↺_{1, 2},
↺_{2, 3} ]\!]\) to rule out algebraic expressions with
no nested radicals.</p>
<p>Then not only do we want to talk about commutators of particular
permutations, we want to talk about the set of commutators
of a particular group. In fact, for a group \(G\), this set of
commutators forms a subgroup \(K(G)\) called the <a href="https://en.wikipedia.org/wiki/Commutator_subgroup">commutator subgroup</a>. For
the quadratic case, we just have \(S_2\), which has only a single
non-trivial element, so its commutator subgroup \(K(S_2)\) is the
trivial group. For the cubic case, we started with \(S_3\), and we
computed the commutator subgroup \(K(S_3)\), which is just \(\{ e,
(1 \; 2 \; 3), (3 \; 2 \; 1) \}\). We can also compute the
commutator of <em>this</em> group, which is just the trivial group
again, since \(K(S_3)\) is commutative. So we can see that
\(K(K(S_3))\) being the trivial group means that we can’t rule
out algebraic expressions with nested radicals as solutions to the
general cubic equation.</p>
<p>Given all the elements of a group \(G\), it’s not
particularly complicated to compute the commutator subgroup—just
take all possible pairs of elements \(g, h \in G\), compute \([g,
h]\), and remove duplicates. However, we can make things easier for
ourselves by finding generators for \(K(G)\) as commutators of
generators of \(G\), since then we can easily map those back to \([\![
o_1, o_2 ]\!]\) applied on the appropriate operations. Fortunately,
when \(G = S_n\), we can use a few facts from group theory to easily
compute \(K(S_n)\). First, \(K(S_n)\) is called the <a href="https://en.wikipedia.org/wiki/Alternating_group">alternating group</a> \(S_n\),
and is generated by the \(3\)-cycles of the form \((i \enspace i+1
\enspace i+2)\), similar to how \(S_n\) is generated by the
\(2\)-cycles of the form \((i \enspace i + 1)\). But a \(3\)-cycle
\((i \enspace i+1 \enspace i+2)\) can be expressed as the commutator
of two \(2\)-cycles \([(i+2 \enspace i+1), (i \enspace
i+1)]\).</p>
<p>Therefore, for \(S_4\), the generators for \(K(S_4)\) are just \((1
\; 2 \; 3) = [(2 \; 3), (1 \; 2)]\) and \((2 \; 3 \; 4) = [(3 \; 4),
(2 \; 3)]\), with respective operations \([\![ ↺_{2,
3}, ↺_{1, 2} ]\!]\) and \([\![ ↺_{3,
4}, ↺_{2, 3} ]\!]\). However, these two generators
are not quite enough to generate \(K^{(2)}(S_4)\) via
commutators. Fortunately, it suffices to just add
\(↺_{4, 1}\) to the list of operations, which lets us
add \((1 \; 4)\) to the list of generators for \(S_4\), and then add
\((3 \; 4 \; 1)\) to the list of generators for \(K(S_4)\). Then
\((1 \; 4) (2 \; 3) = [(2 \; 3 \; 4), (1 \; 2 \; 3)]\) and \((2 \;
1) (3 \; 4) = [(3 \; 4 \; 1), (2 \; 3 \; 4)]\) suffice to generate
\(K^{(2)}(S_4)\).<sup><a href="#fn5" id="r5">[5]</a></sup> Finally,
we can easily compute \(K^{(3)}(S_4)\) to be the trivial group.</p>
<p>What does that tell us about what expressions we can rule out as
solutions to the general quartic equation? Similarly to the cubic
case, we expect to be able to rule out rational expressions and
algebraic expressions with no nested radicals, and since
\(K^{(2)}(S_4)\) is not the trivial group, we also expect to be able
to rule out algebraic expressions with singly-nested radicals, like
\(\sqrt{a - \sqrt{bc - 4}}\). But since \(K^{(3)}(S_4)\) is the
trivial group, we don’t expect to be able to rule out algebraic
expressions with doubly-nested radicals, like \(\sqrt{a - \sqrt{bc -
\sqrt{d + 3}}}\).</p>
<p>As an antidote to all the abstractness above, here is a
visualization for quartics, where you can examine how the various
operations interact with the <a href="https://en.wikipedia.org/wiki/Quartic_function#General_formula_for_roots">quartic formula</a>
and its subexpressions.</p>
<div class="interactive-example">
<h3>Interactive Example 5: The quartic equation</h3>
<div class="graph-container">
Roots
<nokatex><div id="rootBoardQuartic" class="graph jxgbox"></div></nokatex>
<span id="rootListQuartic">
\(R = \langle r_1, r_2, r_3, r_4 \rangle\)
</span>
<br />
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.resetRootAndResultList();">
Reset \(R\) and \(X\) order
</button>
<br />
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opA1, 1000);">
\(A_1 = ↺_{1, 2}\)
</button>
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opA2, 1000);">
\(A_2 = ↺_{2, 3}\)
</button>
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opA3, 1000);">
\(A_3 = ↺_{3, 4}\)
</button>
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opA4, 1000);">
\(A_4 = ↺_{4, 1}\)
</button>
<br />
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opA1.invert(), 1000);">
\(A_1^\prime = ↻_{1, 2}\)
</button>
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opA2.invert(), 1000);">
\(A_2^\prime = ↻_{2, 3}\)
</button>
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opA3.invert(), 1000);">
\(A_3^\prime = ↻_{3, 4}\)
</button>
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opA4.invert(), 1000);">
\(A_4^\prime = ↻_{4, 1}\)
</button>
<br />
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opB1, 4000);">
\(B_1 = [\![ A_2, A_1 ]\!] \mapsto (1 \; 2 \; 3)\)
</button>
<br />
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opB2, 4000);">
\(B_2 = [\![ A_3, A_2 ]\!] \mapsto (2 \; 3 \; 4)\)
</button>
<br />
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opB3, 4000);">
\(B_3 = [\![ A_4, A_3 ]\!] \mapsto (3 \; 4 \; 1)\)
</button>
<br />
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opB1.invert(), 4000);">
\(B_1^\prime\)
</button>
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opB2.invert(), 4000);">
\(B_2^\prime\)
</button>
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opB3.invert(), 4000);">
\(B_3^\prime\)
</button>
<br />
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opC1, 16000);">
\(C_1 = [\![ B_2, B_1 ]\!] \mapsto (1 \; 4) (2 \; 3)\)
</button>
<br />
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opC2, 16000);">
\(C_2 = [\![ B_3, B_2 ]\!] \mapsto (2 \; 1) (3 \; 4)\)
</button>
<br />
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opC1.invert(), 16000);">
\(C_1^\prime\)
</button>
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.runOp(quartic.opC2.invert(), 16000);">
\(C_2^\prime\)
</button>
</div>
<div class="graph-container">
Coefficients
<nokatex><div id="coeffBoardQuartic" class="graph jxgbox"></div></nokatex>
</div>
<div class="graph-container">
Candidate solution
<nokatex><div id="formulaBoardQuartic" class="graph jxgbox"></div></nokatex>
<span id="resultListQuartic">
\(X = \langle x_1, x_2, x_3, x_4, x_5, x_6 \rangle\)
</span>
<span id="resultNoteQuartic"></span>
<br />
<button class="interactive-example-button quarticDisableWhileRunningOp"
type="button" onclick="quartic.findFirstOpRulingOutSelectedFormula();">
Find first operation that rules out selected formula
</button>
<span id="findFirstOpStatusQuartic"></span>
<br />
<label>
<input class="quarticDisableWhileRunningOp" name="formulaQuartic" type="radio"
onchange="quartic.switchFormula(quarticScaledDiscFormula);" />
\(x_1 = -27 Δ\)
</label>
<br />
<label>
<input class="quarticDisableWhileRunningOp" name="formulaQuartic" type="radio"
onchange="quartic.switchFormula(newQuarticQCubedFormula());" />
\(x_{1, 2} = Q^3 = \frac{Δ_1 + \sqrt{-27 Δ}}{2}\)
</label>
<br />
<label>
<input checked class="quarticDisableWhileRunningOp" name="formulaQuartic" type="radio"
onchange="quartic.switchFormula(newQuarticQCubedFormula().root(3));" />
\(x_{1, 2, 3, 4, 5, 6} = Q\)
</label>
<br />
<label>
<input class="quarticDisableWhileRunningOp" name="formulaQuartic" type="radio"
onchange="quartic.switchFormula(newQuarticSFormula());" />
\(x_{1, 2, 3, 4, 5, 6} = S =\)
<br />
\(\qquad \frac{1}{2} \sqrt{-\frac{2}{3} p + \frac{1}{3a} \left( Q + \frac{Δ_0}{Q} \right)}\)
</label>
<br />
<label>
<input class="quarticDisableWhileRunningOp" name="formulaQuartic" type="radio"
onchange="quartic.switchFormula(newQuarticFormula());" />
\(x_{1, 2, 3, 4} = \)
<br />
\(\qquad -\frac{b}{4a} \mp S + \frac{1}{2} \sqrt{-4S^2 - 2p \pm \frac{q}{S}}\)
<br />
(the quartic formula)
</label>
</div>
</div>
<script type="text/javascript">
'use strict';
function isIdentityPermutation(permutation) {
for (var i = 0; i < permutation.length; ++i) {
if (permutation[i] != i) {
return false;
}
}
return true;
}
function updateResultNote(display, resultNoteID, formulaName) {
var rootPermutation = display.getRootPermutation();
var resultPermutation = display.getResultPermutation();
var resultNote = document.getElementById(resultNoteID);
if (isIdentityPermutation(rootPermutation) ==
isIdentityPermutation(resultPermutation)) {
resultNote.innerHTML = '';
} else {
resultNote.innerHTML = '(Applied operation rules out selected formula as the ' + formulaName + ' formula.)';
}
}
function checkOpRulesOutFormula(
display, resetFn, runOpFn, op, time, undoCallback, doneCallback) {
resetFn();
runOpFn(op, time, function() {
var rootPermutation = display.getRootPermutation();
var resultPermutation = display.getResultPermutation();
var rulesOut = (isIdentityPermutation(rootPermutation) !=
isIdentityPermutation(resultPermutation));
undoCallback();
runOpFn(op.invert(), time, function() {
doneCallback(rulesOut);
});
});
}
function findFirstOpRulingOutSelectedFormulaHelper(
display, resetFn, runOpFn, opInfos, statusCallback, doneCallback) {
var i = 0;
var undoCallback = function() {
statusCallback(opInfos[i], true);
}
var _doneCallback = function(rulesOut) {
if (rulesOut) {
doneCallback(opInfos[i]);
return;
}
++i;
if (i >= opInfos.length) {
doneCallback(null);
return;
}
statusCallback(opInfos[i], false);
checkOpRulesOutFormula(
display, resetFn, runOpFn,
opInfos[i].op, opInfos[i].time, undoCallback, _doneCallback);
};
statusCallback(opInfos[0]);
checkOpRulesOutFormula(
display, resetFn, runOpFn,
opInfos[0].op, opInfos[0].time, undoCallback, _doneCallback);
}
function findFirstOpRulingOutSelectedFormula(
display, resetFn, runOpFn, opInfos, statusID) {
var status = document.getElementById(statusID);
var statusCallback = function(opInfo, isUndo) {
if (isUndo) {
status.innerHTML = 'Undoing ' + opInfo.name + '...';
} else {
status.innerHTML = 'Trying ' + opInfo.name + '...';
}
};
var doneCallback = function(opInfo) {
if (opInfo === null) {
status.innerHTML = 'No op ruling out selected formula found';
} else {
status.innerHTML = opInfo.name + ' rules out selected formula';
}
};
findFirstOpRulingOutSelectedFormulaHelper(
display, resetFn, runOpFn, opInfos, statusCallback, doneCallback);
}
var quartic = (function() {
var initialRoots = [
new Complex(0, 1), new Complex(-0.5, -0.5),
new Complex(0.5, 0.5), new Complex(0.5, -0.5)
];
var display = new Display(
"rootBoardQuartic", "coeffBoardQuartic", "formulaBoardQuartic",
initialRoots, newQuarticQCubedFormula().root(3), function() {});
display._resultRotationCounterPoint.setAttribute({visible: false});
function updateRootAndResultList(display) {
updateRootList(display, "rootListQuartic");
updateResultList(display, "resultListQuartic");
updateResultNote(display, "resultNoteQuartic", "quartic");
}
var state = {};
function runQuarticOp(op, time, doneCallback) {
runOp(display, op, time, '.quarticDisableWhileRunningOp', state, doneCallback);
};
function switchQuarticFormula(formula) {
switchFormula(display, state, formula);
updateRootAndResultList(display);
}
function resetRootAndResultList() {
display.reorderPointsBySubscript();
display.resetResultRotationCounters();
updateRootAndResultList(display);
}
var opA1 = display.swapRootOp(0, 1, updateRootAndResultList);
var opA2 = display.swapRootOp(1, 2, updateRootAndResultList);
var opA3 = display.swapRootOp(2, 3, updateRootAndResultList);
var opA4 = display.swapRootOp(3, 0, updateRootAndResultList);
var opB1 = newCommutatorAnimation(opA2, opA1);
var opB2 = newCommutatorAnimation(opA3, opA2);
var opB3 = newCommutatorAnimation(opA4, opA3);
var opC1 = newCommutatorAnimation(opB2, opB1);
var opC2 = newCommutatorAnimation(opB3, opB2);
var opInfos = [
{
name: 'A<sub>1</sub>',
op: opA1,
time: 1000
},
{
name: 'A<sub>2</sub>',
op: opA2,
time: 1000
},
{
name: 'A<sub>3</sub>',
op: opA3,
time: 1000
},
{
name: 'A<sub>4</sub>',
op: opA4,
time: 1000
},
{
name: 'B<sub>1</sub>',
op: opB1,
time: 4000
},
{
name: 'B<sub>2</sub>',
op: opB2,
time: 4000
},
{
name: 'B<sub>3</sub>',
op: opB3,
time: 4000
},
{
name: 'C<sub>1</sub>',
op: opC1,
time: 16000
},
{
name: 'C<sub>2</sub>',
op: opC2,
time: 16000
}
];
function findFirstOpRulingOutSelectedFormulaQuartic() {
findFirstOpRulingOutSelectedFormula(
display, resetRootAndResultList, runQuarticOp, opInfos,
'findFirstOpStatusQuartic');
}
return {
display: display,
opA1: opA1,
opA2: opA2,
opA3: opA3,
opA4: opA4,
opB1: opB1,
opB2: opB2,
opB3: opB3,
opC1: opC1,
opC2: opC2,
runOp: runQuarticOp,
resetRootAndResultList: resetRootAndResultList,
switchFormula: switchQuarticFormula,
findFirstOpRulingOutSelectedFormula: findFirstOpRulingOutSelectedFormulaQuartic
};
})();
</script>
<p>There are a few additions to the interactive display above. It now
prints a message when it detects that the selected expression is
ruled out as the quartic formula, which just looks at whether \(R\)
is not in order and \(X\) is, and vice versa. There’s also a
button to reset the ordering of \(R\) and \(X\).</p>
<p>The second addition is that the operations have been organized to
make clear what commutator subgroup they’re in. The \(A_i\) map
to generators of \(S_4\). Then taking the commutators of adjacent
\(A_i\) give \(B_i\), which map to the generators of \(K(S_4)\), and
similarly for \(C_i\).</p>
<div class="p">The third addition is a button that finds the first operation that
rules out the selected formula, if any. It simply tries all the
\(A_i\)s, then all the \(B_i\)s, then all the \(C_i\)s, checking \(R\)
and \(X\) in between. The general algorithm, which assumes a fixed set
of roots \(r_1, \dotsc, r_n\text{,}\) takes an expression \(f(a_n, a_{n-1}, \dotsc)\)
where \(a_n x^n + a_{n-1} x^{n-1} + \dotsb + a_0 = 0\) is the general
\(n\)th-degree polynomial equation, takes a depth limit \(k\), and
looks like this (defining \(K^{(0)}(G)\) to be just \(G\)):
<ol>
<li>For \(i\) from 0 to \(k\):
<ol>
<li>If \(K^{(i)}(S_n)\) is trivial, then terminate indicating that
\(f(a_n, a_{n-1}, \dotsc)\) was unable to be ruled out because
\(K^{(i)}(S_n)\) is trivial.</li>
<li>Otherwise, find operations \(o_1\) to \(o_m\) that act as the
generators \(g_1\) to \(g_m\) of \(K^{(i)}(S_n)\). For \(i >
0\), this can be done by applying \([\![ o_1, o_2 ]\!]\) to the
operations corresponding to the generators of
\(K^{(i-1)}(S_n)\).</li>
<li>For each \(o_j\):
<ol>
<li>Apply \(o_j\).</li>
<li>If \(R\) is not in order but \(X\) is, terminate indicating
that \(o_j\) rules out \(f(a_n, a_{n-1}, \dotsc)\).</li>
<li>Undo \(o_j\), i.e. apply \(o_j^\prime\) or reset to the
initial state of \(r_1, \dotsc, r_n\).</li>
</ol></li>
</ol></li>
<li>Terminate indicating that \(f(a_n, a_{n-1}, \dotsc)\) was unable to
be ruled out because the depth limit has been reached.</li>
</ol>
</div>
<p>This algorithm basically just implements the proof of the following
lemma, which generalizes the previous theorems, except that it tries
to find the simplest operation that is a generator that rules out
the given expression.</p>
<p>Before we state the lemma, we need another definition: let the <em>radical level</em> of an algebraic expression
\(f(a_n, a_{n-1}, \dotsc)\) be \(0\) if \(f(a_n, a_{n-1}, \dotsc)\) is a
rational expression, \(1\) if \(f(a_n, a_{n-1}, \dotsc)\) has only
non-nested radicals, and \(n + 1\) if the maximum number of nested
radicals is \(n\).</p>
<div class="theorem">(<span class="theorem-name">Lemma 3</span>.) If the
algebraic expression \(f(a_n, a_{n-1}, \dotsc)\) has radical level
\(d\) and \(K^{(d)}(S_n)\) is non-trivial, then any operator that
maps to a non-trivial element \(g\) in \(K^{(d)}(S_n)\) rules out
\(f(a_n, a_{n-1}, \dotsc)\) as the solution to the general
\(n\)th-degree polynomial equation
\[
a_n x^n + a_{n+1} x^{n+1} + \dotsb + a_0 = 0\text{.}
\]</div>
<div class="proof">
<p><span class="proof-name">Rough sketch of proof.</span> We just do
induction on \(d\). For the base case \(d = 0\), if \(K^{(0)}(S_n)\)
is non-trivial, then \(n \ge 2\). Let \(g = (i \; j)\) for any \(i
\ne j\), of which there must at least be one. Then by the same
reasoning as Theorem 1, \(g\) rules out \(f(a_n, a_{n-1},
\dotsc)\). Since the \((i \; j)\) generate \(S_n\), then any \(g \in
S_n\) is the composition of some sequence of \((i \; j)\)s, each of
which rules out \(f(a_n, a_{n-1}, \dotsc)\), so \(g\) must also rule
it out.</p>
<p>Assume the lemma holds for \(d\), and let \(x = f_{d+1}(a_n,
a_{n-1}, \dotsc) = \sqrt[k]{f_d(a_n, a_{n-1}, \dotsc)}\) for some
\(k\), where \(f_d\) has radical level \(d\). Let \(o\) act on \(R\)
like any non-trivial element \(g\) of \(K^{(d+1)}(S_n)\). By the
induction hypothesis, all elements \(h_i \in K^{(d)}(S_n)\) cause
\(x = f_d(a_n, a_{n-1}, \dotsc)\) to go around a loop, so pick a
point \(z_0\) that is never equal to any point \(x\) traverses under
any operation corresponding to \(h_i\). Then, since \(g = [h, k]\)
for \(h, k \in K^{(d)}(S_n)\), by the same reasoning as in
Theorem 2, the total angle induced by \(o\) on \(x =
f_{d+1}(a_n, a_{n-1}, \dotsc)\) around \(z_0\) is \(0\), and the
distance from \(z_0\) remains unchanged. Thus, \(x = f_{d+1}(a_n,
a_{n-1}, \dotsc)\) remains fixed, and \(o\) rules it out.</p>
<p>By the same reasoning as in Theorem 2, this can be extended to the
general case of \(f(a_n, a_{n-1}, \dotsc)\) being any algebraic
formula with nesting level \(d + 1\). ∎</p>
</div>
<div>We can immediately deduce the following corollaries, using the fact
that \(K^{(2)}(S_4)\) is non-trivial:
<div class="theorem">(<span class="theorem-name">Corollary 4</span>.) An
algebraic expression with at most singly-nested radicals in the
coefficients of the general quartic equation
\[
ax^4 + bx^3 + cx^2 + dx + e = 0
\]
cannot be a solution to this equation.<sup><a href="#fn6" id="r6">[6]</a></sup></div>
</div>
</section>
<section>
<header>
<h2>5. Quintic Equations</h2>
</header>
<p>Now, finally, the quintic. Let’s jump right to the interactive example.</p>
<div class="interactive-example">
<h3>Interactive Example 6: The quintic equation</h3>
<div class="graph-container">
Roots
<nokatex><div id="rootBoardQuintic" class="graph jxgbox"></div></nokatex>
<span id="rootListQuintic">
\(R = \langle r_1, r_2, r_3, r_4, r_5 \rangle\)
</span>
<br />
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.resetRootAndResultList();">
Reset \(R\) and \(X\) order
</button>
<br />
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opA1, 1000);">
\(A_1 = ↺_{1, 2}\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opA2, 1000);">
\(A_2 = ↺_{2, 3}\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opA3, 1000);">
\(A_3 = ↺_{3, 4}\)
</button>
<br />
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opA4, 1000);">
\(A_4 = ↺_{4, 5}\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opA5, 1000);">
\(A_5 = ↺_{5, 1}\)
</button>
<br />
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opA1.invert(), 1000);">
\(A_1^\prime = ↻_{1, 2}\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opA2.invert(), 1000);">
\(A_2^\prime = ↻_{2, 3}\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opA3.invert(), 1000);">
\(A_3^\prime = ↻_{3, 4}\)
</button>
<br />
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opA4.invert(), 1000);">
\(A_4^\prime = ↻_{4, 5}\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opA5.invert(), 1000);">
\(A_5^\prime = ↻_{5, 1}\)
</button>
<br />
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opB1, 4000);">
\(B_1 = [\![ A_2, A_1 ]\!] \mapsto (1 \; 2 \; 3)\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opB2, 4000);">
\(B_2 = [\![ A_3, A_2 ]\!] \mapsto (2 \; 3 \; 4)\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opB3, 4000);">
\(B_3 = [\![ A_4, A_3 ]\!] \mapsto (3 \; 4 \; 5)\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opB4, 4000);">
\(B_4 = [\![ A_5, A_4 ]\!] \mapsto (4 \; 5 \; 1)\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opB5, 4000);">
\(B_5 = [\![ A_1, A_5 ]\!] \mapsto (5 \; 1 \; 2)\)
</button>
<br />
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opB1.invert(), 4000);">
\(B_1^\prime\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opB2.invert(), 4000);">
\(B_2^\prime\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opB3.invert(), 4000);">
\(B_3^\prime\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opB4.invert(), 4000);">
\(B_4^\prime\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opB5.invert(), 4000);">
\(B_5^\prime\)
</button>
<br />
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opC1, 16000);">
\(C_1 = [\![ B_3, B_1 ]\!] \mapsto (2 \; 3 \; 5)\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opC2, 16000);">
\(C_2 = [\![ B_4, B_2 ]\!] \mapsto (3 \; 4 \; 1)\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opC3, 16000);">
\(C_3 = [\![ B_5, B_3 ]\!] \mapsto (4 \; 5 \; 2)\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opC4, 16000);">
\(C_4 = [\![ B_1, B_4 ]\!] \mapsto (5 \; 1 \; 3)\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opC5, 16000);">
\(C_5 = [\![ B_2, B_5 ]\!] \mapsto (1 \; 2 \; 4)\)
</button>
<br />
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opC1.invert(), 16000);">
\(C_1^\prime\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opC2.invert(), 16000);">
\(C_2^\prime\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opC3.invert(), 16000);">
\(C_3^\prime\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opC4.invert(), 16000);">
\(C_4^\prime\)
</button>
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.runOp(quintic.opC5.invert(), 16000);">
\(C_5^\prime\)
</button>
</div>
<div class="graph-container">
Coefficients
<nokatex><div id="coeffBoardQuintic" class="graph jxgbox"></div></nokatex>
</div>
<div class="graph-container">
Candidate solution
<nokatex><div id="formulaBoardQuintic" class="graph jxgbox"></div></nokatex>
<span id="resultListQuintic">
\(X = \langle x_1, x_2, x_3, x_4, x_5, x_6 \rangle\)
</span>
<span id="resultNoteQuintic"></span>
<br />
<button class="interactive-example-button quinticDisableWhileRunningOp"
type="button" onclick="quintic.findFirstOpRulingOutSelectedFormula();">
Find first operation that rules out selected formula
</button>
<span id="findFirstOpStatusQuintic"></span>
<br />
<label>
<input class="interactive-example-button quinticDisableWhileRunningOp" name="formulaQuintic" type="radio"
onchange="quintic.switchFormula(quintic.fA);" />
\(x_1 = f_A = Δ\)
</label>
<br />
<label>
<input class="interactive-example-button quinticDisableWhileRunningOp" name="formulaQuintic" type="radio"
onchange="quintic.switchFormula(quintic.newFB());" />
\(x_{1, 2} = f_B = \sqrt{f_A}\)
</label>
<br />
<label>
<input checked class="interactive-example-button quinticDisableWhileRunningOp" name="formulaQuintic" type="radio"
onchange="quintic.switchFormula(quintic.newFC());" />
\(x_{1, 2, 3, 4, 5, 6} = f_C =\)
<br />
\(\qquad \sqrt[3]{(f_B - 0.8)(f_B - 0.75)}\)
</label>
</div>
</div>
<script type="text/javascript">
'use strict';
var quintic = (function() {
var initialRoots = [
new Complex(0, 1), new Complex(-0.5, -0.5), new Complex(0.5, -0.5),
new Complex(1, 0), new Complex(0.5, 0.5)
];
var display = new Display(
"rootBoardQuintic", "coeffBoardQuintic", "formulaBoardQuintic",
initialRoots, newFC(), function() {});
display._resultRotationCounterPoint.setAttribute({visible: false});
for (var i = 0; i < display._rootPointsBySubscript.length; ++i) {
display._rootPointsBySubscript[i].setAttribute({
fixed: true
});
}
function updateRootAndResultList(display) {
updateRootList(display, "rootListQuintic");
updateResultList(display, "resultListQuintic");
updateResultNote(display, "resultNoteQuintic", "quintic");
}
var state = {};
function runQuinticOp(op, time, doneCallback) {
runOp(display, op, time, '.quinticDisableWhileRunningOp', state, doneCallback);
};
function switchQuinticFormula(formula) {
switchFormula(display, state, formula);
updateRootAndResultList(display);
}
function resetRootAndResultList() {
display.reorderPointsBySubscript();
display.resetResultRotationCounters();
updateRootAndResultList(display);
}
var opA1 = display.swapRootOp(0, 1, updateRootAndResultList);
var opA2 = display.swapRootOp(1, 2, updateRootAndResultList);
var opA3 = display.swapRootOp(2, 3, updateRootAndResultList);
var opA4 = display.swapRootOp(3, 4, updateRootAndResultList);
var opA5 = display.swapRootOp(4, 0, updateRootAndResultList);
var opA1Inv = opA1.invert();
var opA2Inv = opA2.invert();
var opA3Inv = opA3.invert();
var opA4Inv = opA4.invert();
var opA5Inv = opA5.invert();
var opB1 = newCommutatorAnimation(opA2, opA1);
var opB2 = newCommutatorAnimation(opA3, opA2);
var opB3 = newCommutatorAnimation(opA4, opA3);
var opB4 = newCommutatorAnimation(opA5, opA4);
var opB5 = newCommutatorAnimation(opA1, opA5);
var opB1Inv = opB1.invert();
var opB2Inv = opB2.invert();
var opB3Inv = opB3.invert();
var opB4Inv = opB4.invert();
var opB5Inv = opB5.invert();
var opC1 = newCommutatorAnimation(opB3, opB1);
var opC2 = newCommutatorAnimation(opB4, opB2);
var opC3 = newCommutatorAnimation(opB5, opB3);
var opC4 = newCommutatorAnimation(opB1, opB4);
var opC5 = newCommutatorAnimation(opB2, opB5);
var opC1Inv = opC1.invert();
var opC2Inv = opC2.invert();
var opC3Inv = opC3.invert();
var opC4Inv = opC4.invert();
var opC5Inv = opC5.invert();
var opInfos = [
{
name: 'A<sub>1</sub>',
op: opA1,
time: 1000
},
{
name: 'A<sub>2</sub>',
op: opA2,
time: 1000
},
{
name: 'A<sub>3</sub>',
op: opA3,
time: 1000
},
{
name: 'A<sub>4</sub>',
op: opA4,
time: 1000
},
{
name: 'A<sub>5</sub>',
op: opA5,
time: 1000
},
{
name: 'B<sub>1</sub>',
op: opB1,
time: 4000
},
{
name: 'B<sub>2</sub>',
op: opB2,
time: 4000
},
{
name: 'B<sub>3</sub>',
op: opB3,
time: 4000
},
{
name: 'B<sub>4</sub>',
op: opB4,
time: 4000
},
{
name: 'B<sub>5</sub>',
op: opB5,
time: 4000
},
{
name: 'C<sub>1</sub>',
op: opC1,
time: 16000
},
{
name: 'C<sub>2</sub>',
op: opC2,
time: 16000
},
{
name: 'C<sub>3</sub>',
op: opC3,
time: 16000
},
{
name: 'C<sub>4</sub>',
op: opC4,
time: 16000
},
{
name: 'C<sub>5</sub>',
op: opC5,
time: 16000
}
];
function findFirstOpRulingOutSelectedFormulaQuintic() {
findFirstOpRulingOutSelectedFormula(
display, resetRootAndResultList, runQuinticOp, opInfos,
'findFirstOpStatusQuintic');
}
// Ruled out by A_i.
var fA = quinticDiscFormula;
// Ruled out by B_i.
function newFB() {
return quinticDiscFormula.root(2);
}
// Has a rotation number with B_1, B_2, B_4, and B_5.
function newPreFC1() {
return newFB().minusAll(0.8);
}
// Has a rotation number with B_3.
function newPreFC2() {
return newFB().minusAll(0.75);
}
// Has a rotation number with all B_i.
function newPreFC3() {
return ComplexFormula.times(
newPreFC1(),
newPreFC2()
);
}
// 2 evenly divides the rotation numbers with B_1, B_2, B_4, and B_5, so
// this doesn't work for f_C.
function newPreFC4() {
return newPreFC3().root(2);
}
// Ruled out by C_i.
function newFC() {
return newPreFC3().root(3);
}
return {
display: display,
opA1: opA1,
opA2: opA2,
opA3: opA3,
opA4: opA4,
opA5: opA5,
opB1: opB1,
opB2: opB2,
opB3: opB3,
opB4: opB4,
opB5: opB5,
opC1: opC1,
opC2: opC2,
opC3: opC3,
opC4: opC4,
opC5: opC5,
fA: fA,
newFB: newFB,
newFC: newFC,
runOp: runQuinticOp,
resetRootAndResultList: resetRootAndResultList,
switchFormula: switchQuinticFormula,
findFirstOpRulingOutSelectedFormula: findFirstOpRulingOutSelectedFormulaQuintic
};
})();
</script>
<p>Similarly to the interactive example for the quartic, the
operations are organized to make clear what commutator subgroup
they’re in. There’s something interesting
though—the \(C_i\) seem very similar to the \(B_i\). In fact,
the \(C_i\) also act on \(R\) like \(A_5\)! Also, if you compute
\(D_i = [\![ C_{(i+1) \bmod 5}, C_{i \bmod
5} ]\!]\), you will find that \(D_i\) acts exactly like \(B_i\) on
\(R\)!</p>
<div class="p">Why can we do this for the quintic, but not for anything of lower
degree? This is because \(A_5\) is <a href="https://en.wikipedia.org/wiki/Perfect_group">perfect</a>,
which means that it equals its own commutator subgroup. (You can
verify this yourself by brute force, e.g. writing a program, or you
can play around with \(3\)-cycles and see that any \(3\)-cycle is
the commutator of two other \(3\)-cycles.) Then this immediately
implies that \(K^{(n)}(S_5)\) is non-trivial for any \(n\), which
then implies our main result:
<div class="theorem">(<span class="theorem-name">Abel-Ruffini theorem</span>.)
An algebraic expression in the coefficients of the general
\(n\)th-degree polynomial equation
\[
a_n x^n + a_{n-1} x^{n-1} + \dotsb + a_0 = 0
\]
for \(n \ge 5\) cannot be a solution to this equation.</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> By the above, \(A_5\) is
perfect, so \(K^{(d)}(S_5)\) is non-trivial for all \(d\).</p>
<p>Since \(S_5\) is a subgroup of \(S_n\) for \(n \ge 5\), \(A_5 =
K(S_5)\) must also be a subgroup of \(A_n = K(S_n)\) for \(n \ge
5\). But since \(A_5\) is perfect, then \(A_5\) must also be a
subgroup of \(K^{(d)}(S_n)\) for any \(d\), which means that
\(K^{(d)}(S_n)\) is non-trivial for any \(d\) and \(n \ge 5\).</p>
<p>An algebraic expression has some finite radical level \(d\), but
\(K^{(d)}(S_5)\) is non-trivial for any \(d\) and \(n \ge 5\), so by
Lemma 3 no algebraic expression can be solution to the general
\(n\)th-degree polynomial equation for \(n \ge 5\). ∎</p>
</div>
</div>
<p>With the theorem above, we now have a succinct answer to the
question at the beginning of this article. You can’t write down
a solution to the general quadratic equation that is a rational
expression because you can find an operation on the roots that will
permute them non-trivially and yet leave the result of the expression
constant. For the same reason, you can’t write down a solution
to the general \(n\)th-degree polynomial equation that is an algebraic
equation!</p>
<p>Finally, as a bonus, I’ll explain how to generate algebraic
expressions that require a “\(d\)th-level” operator,
meaning an operator that maps to an element of \(K^{(d)}(S_n)\),
assuming it’s non-trivial. This shows that there’s no
single “super-operation” that rules out all algebraic
expressions.</p>
<div class="p">As an example, the formulas in the interactive example above are
chosen so that \(f_A\) is ruled out by the \(A_i\), \(f_B\) is ruled
out by the \(B_i\), etc. They depend on the particular roots chosen,
of course, which is why this interactive example doesn’t let you
move the roots around, but in principle you could build formulas for
any polynomial that is first ruled out by \(C_i\), or \(D_i\), or
whatever you wish. Given a polynomial \(P = a_n x^n + a_{n-1} x^{n-1}
+ \dotsb + a_0\) of degree \(n \ge 5\) and \(d\), a recursive
algorithm to generate an expression that is ruled out only by a
“\(d\)th-level” operator is:
<ol>
<li>If \(d = 0\), return \(Δ(a_n, a_{n-1}, \dotsc)\).</li>
<li>Otherwise, run this algorithm with \(P\) and \(d-1\) to get
\(f_{d-1}(a_n, a_{n-1}, \dotsc)\).</li>
<li>Find operations \(o_1\) to \(o_m\) that correspond to
generators \(g_1\) to \(g_m\) of \(K^{(d-1)}(S_n)\).</li>
<li>For each \(o_i\):
<ol>
<li>Apply \(o_i\), which makes \(x = f_{d-1}(a_n, a_{n-1},
\dotsc)\) go around a loop. Record the looped-around regions
and their associated rotation numbers (i.e., the total angle
divided by \(2π\)).</li>
</ol>
</li>
<li>Pick points \(z_1, \dotsc, z_t\) such that each \(z_i\) has
a non-zero rotation number for at least one \(o_j\). \(t\) can
be at most \(m\).</li>
<li>Let \(k\) be the least number such that, for every \(o_i\),
\(k\) doesn’t divide any of the rotation numbers of any
\(z_j\) with respect to \(o_i\). Return \(f_d(a_n, a_{n-1}, \dotsc) = \sqrt[k]{\prod_i
(f_{k-1}(a_n, a_{n-1}, \dotsc) - z_i)}\).
</li>
</ol>
</div>
</section>
<hr />
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
<section class="footnotes">
<header>
<h2>Footnotes</h2>
</header>
<p id="fn1">[1] This proof is originally due to <a href="https://en.wikipedia.org/wiki/Vladimir_Arnold">Arnold</a>. There
are a <a href="https://www.youtube.com/watch?v=RhpVSV6iCko">couple</a>
of <a href="http://drorbn.net/dbnvp/AKT-140314.php">videos</a> that
talk about this proof, as well as
<a href="http://link.springer.com/book/10.1007%2F1-4020-2187-9">this book</a>
based on Arnold’s lectures, and
<a href="https://www.tmna.ncu.pl/static/files/v16n2-02.pdf">this paper</a>.
I mostly follow Boaz’s video, and the interactive
visualizations are based on the visualizations he has in his
video.</p>
<p>The interactive visualizations were generated using
the excellent
<a href="http://jsxgraph.uni-bayreuth.de/wp/index.html">JSXGraph</a> library.
<a href="#r1">↩</a></p>
<p id="fn2">[2] Theorem 1 can be generalized even more! We can
append other functions and operations to rational expressions, as
long as those functions and operations are continuous and
single-valued. For example, we can allow the use of exponentials
and trigonometric functions, which is something that the standard
Galois theory cannot handle.<a href="#r2">↩</a></p>
<p id="fn3">[3] More precisely, a \(↺_{i, j}\)
contains a pair of simple paths, i.e. continuous injective
functions \([0, 1] \to \mathbb{C}\), between two distinct points
of \(\mathbb{C}\), such that their concatenation defines a simple
closed curve
around a region in \(\mathbb{C}\) with a counter-clockwise
orientation. Also, depending on the exact method of formalizing
\(↺_{i, j}\), it either explicitly or implicitly
encodes a permutation on \(R\). Then we can define an operation
\(*\) on the \(↺_{i, j}\) and
\(↻_{i, j}\) (defined analogously) which
concatenates the paths (and composes the permutations, if
explicitly encoded). Since the space of paths has no inverses or
an identity, the \(↺_{i, j}\) and
\(↻_{i, j}\) generate a <a
href="https://en.wikipedia.org/wiki/Free_semigroup">free semigroup</a> with
the operation \(*\). Then this semigroup defines an
<a href="https://en.wikipedia.org/wiki/Semigroup_action">action</a>
on \(R\) via its associated permutation on \(R\), which then just
generates \(S_n\), since \(S_n\) is generated by adjacent swaps.</p>
<p>We make a distinction between the operation
\(↺_{i, j}\) and the permutation it induces on
\(R\), since the latter “loses” the orientation
information, which is important to preserve when talking about the
action of \(↺_{i, j}\) on some \(x_i\).
<a href="#r3">↩</a></p>
<p id="fn4">[4] Note that, depending on the text, the commutator may
be defined slightly differently as \(g h g^{-1} h^{-1}\).
<a href="#r4">↩</a></p>
<p id="fn5">[5] \(K(A_4)\) is isomorphic to \(V\), the
<a href="https://en.wikipedia.org/wiki/Klein_four-group">Klein four-group</a>.
<a href="#r5">↩</a></p>
<p id="fn6">[6] In fact, the quartic formula has three nested
radicals. I wonder why?
<a href="#r6">↩</a></p>
</section>
https://www.akalin.com/computing-iroot
Computing Integer Roots
2016-01-10T00:00:00-08:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<script>
KaTeXMacros = {
"\\iroot": "\\operatorname{iroot}",
"\\Bits": "\\operatorname{Bits}",
"\\Err": "\\operatorname{Err}",
"\\NewtonRoot": "\\mathrm{N{\\small EWTON}\\text{-}I{\\small ROOT}}",
};
</script>
<script src="https://cdn.rawgit.com/akalin/jsbn/v1.4/jsbn.js"></script>
<script src="https://cdn.rawgit.com/akalin/jsbn/v1.4/jsbn2.js"></script>
<section>
<header>
<h2>1. The algorithm</h2>
</header>
<p>Today I’m going to talk about the generalization of
the <a href="/computing-isqrt">integer square root algorithm</a> to
higher roots. That is, given \(n\) and \(p\), computing
\(\iroot(n, p) = \lfloor \sqrt[p]{n} \rfloor\), or the
greatest integer whose \(p\)th power is less than or equal to
\(n\). The generalized algorithm is straightforward, and it’s
easy to generalize the proof of correctness, but the run-time bound is
a bit trickier, since it has a dependence on \(p\).</p>
<div class="p">First, the algorithm, which we’ll call \(\NewtonRoot\):
<ol>
<li>If \(n = 0\), return \(0\).</li>
<li>If \(p \ge \Bits(n)\) return \(1\).</li>
<li>Otherwise, set \(i\) to \(0\) and set \(x_0\) to \(2^{\lceil
\Bits(n) / p\rceil}\).</li>
<li>Repeat:
<ol>
<li>Set \(x_{i+1}\) to \(\lfloor ((p - 1) x_i + \lfloor
n/x_i^{p-1} \rfloor) / p \rfloor\).</li>
<li>If \(x_{i+1} \ge x_i\), return \(x_i\). Otherwise, increment
\(i\).</li>
</ol>
</li>
</ol>
</div>
<div class="p">and its implementation in Javascript:<sup><a href="#fn1" id="r1">[1]</a></sup>
<script>
// iroot returns the greatest number x such that x^p <= n. The type of
// n must behave like BigInteger (e.g.,
// https://github.com/akalin/jsbn ), n must be non-negative, and
// p must be a positive integer.
//
// Example (open up the JS console on this page and type):
//
// iroot(new BigInteger("64"), 3).toString()
function iroot(n, p) {
var s = n.signum();
if (s < 0) {
throw new Error('negative radicand');
}
if (p <= 0) {
throw new Error('non-positive degree');
}
if (p !== (p|0)) {
throw new Error('non-integral degree');
}
if (s == 0) {
return n;
}
var b = n.bitLength();
if (p >= b) {
return n.constructor.ONE;
}
// x = 2^ceil(Bits(n)/p)
var x = n.constructor.ONE.shiftLeft(Math.ceil(b/p));
var pMinusOne = new n.constructor((p - 1).toString());
var pBig = new n.constructor(p.toString());
while (true) {
// y = floor(((p-1)x + floor(n/x^(p-1)))/p)
var y = pMinusOne.multiply(x).add(n.divide(x.pow(pMinusOne))).divide(pBig);
if (y.compareTo(x) >= 0) {
return x;
}
x = y;
}
}
</script>
<pre class="code-container"><code class="language-javascript">// iroot returns the greatest number x such that x^p <= n. The type of
// n must behave like BigInteger (e.g.,
// https://github.com/akalin/jsbn ), n must be non-negative, and
// p must be a positive integer.
//
// Example (open up the JS console on this page and type):
//
// iroot(new BigInteger("64"), 3).toString()
function iroot(n, p) {
var s = n.signum();
if (s < 0) {
throw new Error('negative radicand');
}
if (p <= 0) {
throw new Error('non-positive degree');
}
if (p !== (p|0)) {
throw new Error('non-integral degree');
}
if (s == 0) {
return n;
}
var b = n.bitLength();
if (p >= b) {
return n.constructor.ONE;
}
// x = 2^ceil(Bits(n)/p)
var x = n.constructor.ONE.shiftLeft(Math.ceil(b/p));
var pMinusOne = new n.constructor((p - 1).toString());
var pBig = new n.constructor(p.toString());
while (true) {
// y = floor(((p-1)x + floor(n/x^(p-1)))/p)
var y = pMinusOne.multiply(x).add(n.divide(x.pow(pMinusOne))).divide(pBig);
if (y.compareTo(x) >= 0) {
return x;
}
x = y;
}
}</code></pre>
</div>
<p>This algorithm turns out to require \(Θ(p) + O(\lg \lg n)\)
loop iterations, with the run-time for a loop iteration depending on
what kind of arithmetic operations are used.</p>
</section>
<section>
<header>
<h2>2. Correctness</h2>
</header>
<p>Again we look at the iteration rule:
\[
x_{i+1} = \left\lfloor \frac{(p - 1) x_i + \left\lfloor \frac{n}{x_i^{p-1}}
\right\rfloor}{p} \right\rfloor
\]
Letting \(f(x)\) be the right-hand side, we can again use basic
properties of the floor function to remove the inner floor:
\[
f(x) = \left\lfloor \frac{1}{p} ((p-1) x + n/x^{p-1}) \right\rfloor
\]
Letting \(g(x)\) be its real-valued equivalent:
\[
g(x) = \frac{1}{p} ((p-1) x + n/x^{p-1})
\]
we can, again using basic properties of the floor function, show that
\(f(x) \le g(x)\), and for any integer \(m\), \(m \le f(x)\) if and
only if \(m \le g(x)\).</p>
<p>Finally, let’s give a name to our desired output: let \(s =
\iroot(n, p) = \lfloor \sqrt[p]{n} \rfloor\).<sup><a href="#fn2" id="r2">[2]</a></sup></p>
<div class="p">Unsurprisingly, \(f(x)\) never underestimates:
<div class="theorem">(<span class="theorem-name">Lemma 1</span>.) For
\(x \gt 0\), \(f(x) \ge s\).</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> By the basic properties of
\(f(x)\) and \(g(x)\) above, it suffices to show that \(g(x) \ge
s\). \(g'(x) = (1 - 1/p) (1 - n/x^p)\) and \(g''(x) = (p - 1)
(n/x^{p+1})\). Therefore, \(g(x)\) is concave-up for \(x \gt 0\); in
particular, its single positive extremum at \(x = \sqrt[p]{n}\) is a
minimum. But \(g(\sqrt[p]{n}) = \sqrt[p]{n} \ge s\). ∎</p>
</div>
Also, our initial guess is always an overestimate:
<div class="theorem">(<span class="theorem-name">Lemma 2</span>.) \(x_0
\gt s\).</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> \(\Bits(n) =
\lfloor \lg n \rfloor + 1 \gt \lg n\). Therefore,
\[
\begin{aligned}
x_0 &= 2^{\lceil \Bits(n) / p \rceil} \\
&\ge 2^{\Bits(n) / p} \\
&\gt 2^{\lg n / p} \\
&= \sqrt[p]{n} \\
&\ge s\text{.} \; \blacksquare
\end{aligned}
\]
</p>
</div>
Therefore, we again have the invariant that \(x_i \ge s\), which
lets us prove partial correctness:
<div class="theorem">(<span class="theorem-name">Theorem 1</span>.) If
\(\NewtonRoot\) terminates, it
returns the value \(s\).</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> Assume it terminates. If it
terminates in step \(1\) or \(2\), then we are done. Otherwise, it can
only terminate in step \(4.2\) where it returns \(x_i\) such that
\(x_{i+1} = f(x_i) \ge x_i\). This implies \(g(x_i) = ((p-1)x_i +
n/x_i^{p-1}) / p \ge x_i\). Rearranging yields \(n \ge x_i^p\) and
combining with our invariant we get \(\sqrt[p]{n} \ge x_i \ge s\). But
\(s + 1 \gt \sqrt[p]{n}\), so that forces \(x_i\) to be \(s\), and
thus \(\NewtonRoot\) returns \(s\)
if it terminates. ∎</p>
</div>
</div>
<div class="p">Total correctness is also easy:
<div class="theorem">(<span class="theorem-name">Theorem 2</span>.)
\(\NewtonRoot\) terminates.</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> Assume it doesn’t
terminate. Then we have a strictly decreasing infinite sequence of
integers \(\{ x_0, x_1, \dotsc \}\). But this sequence is bounded below
by \(s\), so it cannot decrease indefinitely. This is a contradiction,
so \(\NewtonRoot\) must
terminate. ∎</p>
</div>
Note that, like \(\NewtonRoot\),
the check in step \(4.2\) cannot be weakened to \(x_{i+1} = x_i\), as
doing so would cause the algorithm to oscillate. In fact, as \(p\)
grows, so do the number of values of \(n\) that exhibit this behavior,
and so do the number of possible oscillations. For example, \(n =
972\) with \(p = 3\) would yield the sequence \(\{ 16, 11, 10, 9, 10,
9, \dotsc \}\), and \(n = 80\) with \(p = 4\) would yield the sequence
\(\{ 4, 3, 2, 4, 3, 2, \dotsc \}\).</div>
</section>
<section>
<header>
<h2>3. Run-time</h2>
</header>
<p>We will show that \(\NewtonRoot\)
takes \(Θ(p) + O(\lg \lg n)\) loop iterations. Then we will
analyze a single loop iteration and the arithmetic operations used to
get a total run-time bound.</p>
<div class="p">Analagous to the square root case, define \(\Err(x) =
x^p/n - 1\) and let \(ϵ_i = \Err(x_i)\). First,
let’s prove our lower bound for \(ϵ_i\), which translates
directly from the square root case:
<div class="theorem">(<span class="theorem-name">Lemma 3</span>.) \(x_i
\ge s + 1\) if and only if \(ϵ_i \ge 1/n\).</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> \(n \lt (s + 1)^p\), so \(n + 1
\le (s + 1)^p\), and therefore \((s + 1)^p/n - 1 \ge 1/n\). But the
expression on the left side is just \(\Err(s +
1)\). \(x_i \ge s + 1\) if and only if \(ϵ_i \ge
\Err(s + 1)\), so the result immediately
follows. ∎</p>
</div>
</div>
<p>Now for the next few lemmas we need to do some algebra and
calculus. Inverting \(\Err(x)\), we get that \(x_i =
\sqrt[p]{(ϵ_i + 1) \cdot n}\). Expressing \(g(x_i)\) in terms
of \(ϵ_i\) and \(q = 1 - 1/p\) we get
\[ g(x_i) = \sqrt[p]{n} \left( \frac{ϵ_i q +
1}{(ϵ_i + 1)^q} \right) \]
and
\[
\Err(g(x_i))
= \frac{(q ϵ_i + 1)^p}{(ϵ_i + 1)^{p-1}} - 1\text{.}
\]
Let
\[
f(ϵ) = \frac{(q ϵ + 1)^p}{(ϵ + 1)^{p-1}} - 1\text{.}
\]
Then computing derivatives,
\[
\begin{aligned}
f'(ϵ) &= q ϵ \frac{(q ϵ + 1)^{p-1}}{(ϵ + 1)^p}\text{,} \\
f''(ϵ) &= q \frac{(q ϵ + 1)^{p-2}}{(ϵ + 1)^{p + 1}}\text{, and} \\
f'''(ϵ) &= -q (2 + q (2 + 3 ϵ)) \frac{(q ϵ + 1)^{p-3}}{(ϵ + 1)^{p + 2}}\text{.}
\end{aligned}
\]
Note that \(f(0) = f'(0) = 0\), and \(f''(0) = q\). Also, for
\(ϵ > 0\), \(f'(ϵ) \gt 0\), \(f''(ϵ) \gt 0\), and
\(f'''(ϵ) < 0\).</p>
<div class="p">Now we’re ready to show that the \(ϵ_i\) shrink
quadratically:
<div class="theorem">(<span class="theorem-name">Lemma 4</span>.)
\(f(ϵ) \lt (ϵ/\sqrt{2})^2\) for \(ϵ \gt 0\).</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> Taylor-expand \(f(ϵ)\)
around \(0\) with
the <a href="https://en.wikipedia.org/wiki/Taylor%27s_theorem#Explicit_formulae_for_the_remainder">Lagrange
remainder form</a> to get \[ f(ϵ) = f(0) + f'(0) ϵ +
\frac{f''(0)}{2} ϵ^2 + \frac{f'''(\xi)}{6} ϵ^3 \] for
some some \(\xi\) such that \(0 \lt \xi \lt ϵ\). Plugging in
values, we see that \(f(ϵ) = \frac{1}{2} q ϵ^2 +
\frac{1}{6} f'''(\xi) ϵ^3\) with the last term being negative,
so \(f(ϵ) \lt \frac{1}{2} q ϵ^2 \lt \frac{1}{2}
ϵ^2\). ∎</p>
</div>
But this is only a useful upper bound when \(ϵ_i \le 1\). In
the square root case this was okay, since \(ϵ_1 \le 1\), but
that is not true for larger values of \(p\). In fact, in general, the
\(ϵ_i\) start off shrinking <em>linearly</em>:
<div class="theorem">(<span class="theorem-name">Lemma 5</span>.) For
\(ϵ \gt 1\), \(f(ϵ) \gt ϵ/8\).</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> Since \(f(0) = f'(0) = 0\), and
\(f''(ϵ) \gt 0\) for \(ϵ \ge 0\), \(f'(ϵ)\) and
\(f(ϵ)\) are increasing, and thus \(f(1) \gt 0\) and
\(f(ϵ)\) is a concave-up curve.</p>
<p>Then \((0, 0)\) and \((1, f(1))\) are two points on a concave-up
curve, and thus geometrically the line \(y = f(1) ϵ\) must lie
below \(y = f(ϵ)\) for \(ϵ \gt 1\), and thus
\(f(ϵ) \gt f(1) ϵ\) for \(ϵ \gt
1\). Algebraically, this also follows from the definition
of <a href="https://en.wikipedia.org/wiki/Convex_function">(strict)
convexity</a> (with \(x_1 = 0\), \(x_2 = ϵ\), and \(t = 1 -
1/ϵ\)).</p>
<p>But \(f(1) = (2 - 1/p)^p/2^{p-1} - 1 = 2 \left(1 -
\frac{1}{2p}\right)^p - 1\), which is always increasing as a function
of \(p\), as you can see by calculating its derivative. Therefore, its
minimum is at \(p = 2\), which is \(1/8\), and so \(f(ϵ) \gt
f(1) ϵ \ge ϵ/8\). ∎</p>
</div>
Finally, let’s bound our initial values:
<div class="theorem">(<span class="theorem-name">Lemma 6</span>.) \(x_0
\le 2s\) and \(ϵ_0 \le 2^p - 1\).</div>
<div class="proof">
<p><span class="proof-name">Proof.</span>
This is a straightforward generalization of the equivalent lemma
from the square root case. Let’s start with \(x_0\):
\[
\begin{aligned}
x_0 &= 2^{\lceil \Bits(n) / p \rceil} \\
&= 2^{\lfloor (\lfloor \lg n \rfloor + 1 + p - 1)/p \rfloor} \\
&= 2^{\lfloor \lg n / p \rfloor + 1} \\
&= 2 \cdot 2^{\lfloor \lg n / p \rfloor}\text{.}
\end{aligned}
\]
Then \(x_0/2 = 2^{\lfloor \lg n / p \rfloor} \le 2^{\lg n / p} =
\sqrt[p]{n}\). Since \(x_0/2\) is an integer, \(x_0/2 \le
\sqrt[p]{n}\) if and only if \(x_0/2 \le \lfloor \sqrt[p]{n} \rfloor =
s\). Therefore, \(x_0 \le 2s\).</p>
<p>As for \(ϵ_0\):
\[
\begin{aligned}
ϵ_0 &= \Err(x_0) \\
&\le \Err(2s) \\
&= (2s)^p/n - 1 \\
&= 2^p s^p/n - 1\text{.}
\end{aligned}
\]
Since \(s^p \le n\), \(2^p s^p/n \le 2^p\) and thus \(ϵ_0 \le
2^p - 1\). ∎</p>
</div>
</div>
<div class="p">Now we’re ready to show our main result, which involves
calculating how long the \(ϵ_i\) shrink linearly:
<div class="theorem">(<span class="theorem-name">Theorem 3</span>.)
\(\NewtonRoot\) performs \(Θ(p)
+ O(\lg \lg n)\) loop iterations.</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> Assume that \(ϵ_i \gt 1\)
for \(i \le j\), \(ϵ_{j+1} \le 1\), and \(j+k\) is the number
of loop iterations performed when running the algorithm for \(n\) and
\(p\) (i.e., \(x_{j+k} \ge x_{j+k-1}\)). Using Lemma 5,
\[
\left( \frac{1}{8} \right)^{j+1} ϵ_0 \lt ϵ_{j+1} \le 1\text{,}
\]
which implies
\[
j \gt \frac{\lg ϵ_0}{3} - 1\text{.}
\]
</p>
<p>Similarly,
\[
\left( \frac{1}{8} \right)^j ϵ_0 \ge ϵ_j \gt 1\text{,}
\]
which implies
\[
j \lt \frac{\lg ϵ_0}{3} \text{.}
\]
Therefore, \(j = Θ(\lg ϵ_0)\), which is \(Θ(p)\)
by Lemma 6.</p>
<p>Now assume \(k \ge 5\). Then \(x_i \ge s + 1\) for \(i \lt j + k -
1\). Since \(ϵ_{j+1} \le 1\) by assumption, \(ϵ_{j+3}
\le 1/2\) and \(ϵ_i \le (ϵ_{j+3})^{2^{i-j-3}}\) for \(j
+ 3 \le i \lt j + k - 1\) by Lemma 4, then \(ϵ_{j+k-2} \le
2^{-2^{k-5}}\). But \(1/n \le ϵ_{j+k-2}\) by Lemma 3, so \(1/n
\le 2^{-2^{k-5}}\). Taking logs to bring down the \(k\) yields \(k - 5
\le \lg \lg n\). Then \(k \le \lg \lg n + 5\), and thus \(k = O(\lg
\lg n)\).</p>
<p>Therefore, the total number of loop iterations is \(Θ(p) +
O(\lg \lg n)\). ∎</p>
</div>
</div>
<p>Note that \(p \le \lg n\), so we can just say that
\(\NewtonRoot\) performs
\(Θ(\lg n)\) operations. But that obscures rather than
simplifies. Note that the proof above is very similar to the proof of
the worse run-time of \(\mathrm{N{\small EWTON}\text{-}I{\small
SQRT}'}\) where the initial guess varies. In this case, the error in
our initial guess is magnified, since we raise it to the \((p-1)\)th
power, and so that manifests as the \(Θ(p)\) term.</p>
<p>Furthermore, unlike the square root case, the number of arithmetic
operations in a loop iteration isn’t constant. In particular,
the sub-step to compute \(x_i^{p-1}\) takes a number of arithmetic
operations dependent on \(p - 1\). Using repeated squarings, this
computation would take \(Θ(\lg p)\) squarings and at most
\(Θ(\lg p)\) multiplications.</p>
<p>If the cost of an arithmetic operation is constant, e.g.,
we’re working with fixed-size integers, then the run-time bounds
is the above multiplied by \(Θ(\lg p)\).</p>
<p>Otherwise, if the cost of an arithmetic operation depends on the
length of its arguments, then we only have to multiply by a constant
factor to get the run-time bounds in terms of arithmetic
operations. If the cost of multiplying two numbers \(\le x\) is \(M(x)
= O(\lg^k x)\), then the cost of computing \(x^p\) is \(O((p \lg
x)^k)\). But \(x\) is \(Θ(n^{1/p})\), so the cost of computing
\(x^p\) is \(O(\lg^k n)\), which is on the order of the cost of
multiplying two numbers \(\le n\). Furthermore, note that we divide
the result into \(n\), so we can stop once the computation of
\(x_i^{p-1}\) exceeds \(n\). So in that case, we can treat a loop
iteration as if it were performing a constant number of arithmetic
operations on numbers of order \(n\), and so, like in the square root
case, we pick up a factor of \(D(n)\), where \(D(n)\) is the run-time
of dividing \(n\) by some number \(\le n\).</p>
</section>
<hr />
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
<section class="footnotes">
<header>
<h2>Footnotes</h2>
</header>
<p id="fn1">[1] Go and JS implementations are available
on <a href="https://github.com/akalin/iroot">my GitHub</a>.
<a href="#r1">↩</a></p>
<p id="fn2">[2] Here, and in most of the article, we’ll
implicitly assume that \(n \gt 0\) and \(p \gt 1\).
<a href="#r2">↩</a></p>
</section>
https://www.akalin.com/sampling-visible-sphere
Sampling the Visible Sphere
2015-08-26T00:00:00-07:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<p><em>(Note: this article is a summary of
<a href="http://ompf2.com/viewtopic.php?f=3&t=1914">this thread on
ompf2</a>.)
</em></p>
<p>The usual method for sampling a sphere from a point outside the
sphere is to calculate the angle of the cone of the visible portion
and uniformly sample within that cone, as described in
<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.40.6561">Shirley/Wang</a>.</p>
<p>However, one detail that is glossed over is that you still need to map
from the sampled direction to the point on the sphere. The usual
method is to simply generate a ray from the point and the sampled
direction and intersect it with the sphere. However, this intersection
test may fail due to floating point inaccuracies (e.g., if the sphere
is small and the distance from the point is large).</p>
<p>I've found a couple of existing ways to deal with this. As
described in the pbrt book, pbrt simply assumes that the ray just
grazes the sphere if the intersection fails, and then projects the
center of the sphere onto the ray
(<a href="https://github.com/mmp/pbrt-v2/blob/master/src/shapes/sphere.cpp#L249">code
here</a>). mitsuba moves the origin of the ray closer to the sphere
(in fact, from within it) before doing the test (falling back to
projecting the center onto the ray if that still fails)
(<a href="https://www.mitsuba-renderer.org/repos/mitsuba/files/aeb7f95b37111187cc2ddf21cfffeff118bc52d2/src/shapes/sphere.cpp#L287">code
here</a>).</p>
<p>However, this seems inelegant. I've come up with a better way,
which involves converting the sampled cone angle \(θ\) (as
measured from the segment connecting the point to the sphere center)
into an angle \(α\) from the inside of the sphere, and then
simply using \(α\) and the sampled polar angle \(\varphi\) onto
the sphere. This turns out to be simple, and in my unscientific tests
a bit faster.</p>
<p>Here's a crude diagram showing the geometry:<p>
<img src="/sampling-visible-sphere-files/diagram.png" alt="Diagram for derivation of cos α" />
<p>You can see that
\[
L = d \cos θ - \sqrt{r^2 - d^2 \sin^2 θ}
\]
and also by the law of cosines,
\[
L^2 = d^2 + r^2 - 2 d r \cos α\text{.}
\]
We're actually more interested in \(\cos α\), so solving for that
we get
\[
\cos α = \frac{d}{r} \sin^2 θ + \cos θ \sqrt{1 - \frac{d^2}{r^2} \sin^2 θ}\text{.}
\]
An alternate form, which may be easier to analyze, recalling that
\(\sin θ_{\max} = r/d\), is
\[
\cos α = \frac{\sin^2 θ}{\sin θ_{\max}} + \cos θ \sqrt{1 - \frac{\sin^2 θ}{\sin^2 θ_{\max}}}\text{.}
\]
</p>
<div class="p">So sampling pseudocode would look like:
<pre class="code-container"><code class="language-c++">(cos θ, φ) = uniformSampleCone(rng, cos θmax)
D = 1 - d² sin² θ / r²
if D ≤ 0 {
cos α = sin θmax
} else {
cos α = (d/r) sin² θ + cos θ √D
}
ω = sphericalDirection(cos α, φ)
pSurface = C + r ω</code></pre>
I haven't done any analysis yet on the most robust way [in the
floating-point sense] to do the calculations above.)</div>
<p>There's no backfacing since we clamp \(\cos α\) to \(\sin
θ_{\max}\), which is analogous to the case when the ray from
\(P\) misses the sphere.</p>
<p>Note that one cannot just compute \(α_{\max}\) and uniformly
sample the cone from inside the sphere, as that doesn't produce the
same distribution over the visible region as sampling the cone from
outside the sphere. To preserve correctness, you would have to use the
(uniform) PDF over the surface area of the visible portion of the
sphere, but you would have to then convert that to a PDF with respect
to projected solid angle from \(P\), which is suboptimal to just doing
the sampling with respect to projected solid angle from \(P\) as
described above.</p>
<hr />
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
https://www.akalin.com/computing-isqrt
Computing the Integer Square Root
2014-12-09T00:00:00-08:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<script>
KaTeXMacros = {
"\\isqrt": "\\operatorname{isqrt}",
"\\Bits": "\\operatorname{Bits}",
"\\Err": "\\operatorname{Err}",
"\\NewtonSqrt": "\\mathrm{N{\\small EWTON}\\text{-}I{\\small SQRT}}",
};
</script>
<script src="https://cdn.rawgit.com/akalin/jsbn/v1.4/jsbn.js"></script>
<script src="https://cdn.rawgit.com/akalin/jsbn/v1.4/jsbn2.js"></script>
<section>
<header>
<h2>1. The algorithm</h2>
</header>
<p>Today I’m going to talk about a fast algorithm to compute
the <em><a href="https://en.wikipedia.org/wiki/Integer_square_root">integer
square root</a></em> of a non-negative integer \(n\),
\(\isqrt(n) = \lfloor \sqrt{n} \rfloor\), or in words,
the greatest integer whose square is less than or equal to \(n\).<sup><a href="#fn1" id="r1">[1]</a></sup> Most
sources that describe the algorithm take it for granted that it is
correct and fast. This is far from obvious! So I will prove both
correctness and speed below.</p>
<p>One simple fact is that \(\isqrt(n) \le n/2\), so a
straightforward algorithm is just to test every non-negative integer
up to \(n/2\). This takes \(O(n)\) arithmetic operations, which is bad
since it’s exponential in the <em>size</em> of the input. That
is, letting \(\Bits(n)\) be the number of bits required
to store \(n\) and letting \(\lg n\) be the base-\(2\) logarithm of
\(n\), \(\Bits(n) = O(\lg n)\), and thus this algorithm
takes \(O(2^{\Bits(n)})\) arithmetic operations.</p>
<p>We can do better by doing binary search; start with the range \([0,
n/2]\) and adjust it based on comparing the square of an integer in
the middle of the range to \(n\). This takes \(O(\lg n) =
O(\Bits(n))\) arithmetic operations.</p>
<div class="p">However, the algorithm below is even faster:<sup><a href="#fn2" id="r2">[2]</a></sup>
<ol>
<li>If \(n = 0\), return \(0\).</li>
<li>Otherwise, set \(i\) to \(0\) and set \(x_0\) to \(2^{\lceil
\Bits(n) / 2\rceil}\).</li>
<li>Repeat:
<ol>
<li>Set \(x_{i+1}\) to \(\lfloor (x_i + \lfloor n/x_i \rfloor) /
2 \rfloor\).</li>
<li>If \(x_{i+1} \ge x_i\), return \(x_i\). Otherwise, increment
\(i\).</li>
</ol>
</li>
</ol>
</div>
<div class="p">Call this algorithm \(\NewtonSqrt\), since it’s based
on <a href="https://en.wikipedia.org/wiki/Newton%27s_method">Newton’s
method</a>. It’s not obvious, but this algorithm returns
\(\isqrt(n)\) using only \(O(\lg \lg n) =
O(\lg(\Bits(n)))\) arithmetic operations, as we will
prove below. But first, here’s an implementation of the
algorithm in Javascript:<sup><a href="#fn3" id="r3">[3]</a></sup>
<script>
// isqrt returns the greatest number x such that x^2 <= n. The type of
// n must behave like BigInteger (e.g.,
// https://github.com/akalin/jsbn ), and n must be non-negative.
//
//
// Example (open up the JS console on this page and type):
//
// isqrt(new BigInteger("64")).toString()
function isqrt(n) {
var s = n.signum();
if (s < 0) {
throw new Error('negative radicand');
}
if (s == 0) {
return n;
}
// x = 2^ceil(Bits(n)/2)
var x = n.constructor.ONE.shiftLeft(Math.ceil(n.bitLength()/2));
while (true) {
// y = floor((x + floor(n/x))/2)
var y = x.add(n.divide(x)).shiftRight(1);
if (y.compareTo(x) >= 0) {
return x;
}
x = y;
}
}
</script>
<pre class="code-container"><code class="language-javascript">// isqrt returns the greatest number x such that x^2 <= n. The type of
// n must behave like BigInteger (e.g.,
// https://github.com/akalin/jsbn ), and n must be non-negative.
//
//
// Example (open up the JS console on this page and type):
//
// isqrt(new BigInteger("64")).toString()
function isqrt(n) {
var s = n.signum();
if (s < 0) {
throw new Error('negative radicand');
}
if (s == 0) {
return n;
}
// x = 2^ceil(Bits(n)/2)
var x = n.constructor.ONE.shiftLeft(Math.ceil(n.bitLength()/2));
while (true) {
// y = floor((x + floor(n/x))/2)
var y = x.add(n.divide(x)).shiftRight(1);
if (y.compareTo(x) >= 0) {
return x;
}
x = y;
}
}</code></pre>
</div>
</section>
<section>
<header>
<h2>2. Correctness</h2>
</header>
<p>The core of the algorithm is the iteration rule:
\[
x_{i+1} = \left\lfloor \frac{x_i + \lfloor \frac{n}{x_i}
\rfloor}{2} \right\rfloor
\]
where
the <a href="https://en.wikipedia.org/wiki/Floor_and_ceiling_functions">floor
functions</a> are there only because we’re using integer
division. Define an integer-valued function \(f(x)\) for the right
side. Using basic properties of the floor function, you can show that
you can remove the inner floor:
\[
f(x) = \left\lfloor \frac{1}{2} (x + n/x) \right\rfloor
\]
which makes it a bit easier to analyze. Also, the properties of
\(f(x)\) are closely related to its equivalent real-valued function:
\[
g(x) = \frac{1}{2} (x + n/x)\text{.}
\]</p>
<p>For starters, again using basic properties of the floor function,
you can show that \(f(x) \le g(x)\), and for any integer \(m\), \(m
\le f(x)\) if and only if \(m \le g(x)\).</p>
<p>Finally, let’s give a name to our desired output: let \(s =
\isqrt(n) = \lfloor \sqrt{n} \rfloor\).<sup><a href="#fn4" id="r4">[4]</a></sup></p>
<div class="p">Intuitively, \(f(x)\) and \(g(x)\) “average out”
however far away their input \(x\) is from \(\sqrt{n}\). Conveniently,
this “average” is never an undereestimate:
<div class="theorem">(<span class="theorem-name">Lemma 1</span>.) For
\(x \gt 0\), \(f(x) \ge s\).</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> By the basic properties of
\(f(x)\) and \(g(x)\) above, it suffices to show that \(g(x) \ge
s\). \(g'(x) = (1 - n/x^2)/2\) and \(g''(x) = n/x^3\). Therefore,
\(g(x)\) is concave-up for \(x \gt 0\); in particular, its single
positive extremum at \(x = \sqrt{n}\) is a minimum. But \(g(\sqrt{n})
= \sqrt{n} \ge s\). ∎</p>
</div>
(You can also prove Lemma 1 without calculus; show that \(g(x) \ge
s\) if and only if \(x^2 - 2sx + n \ge 0\), which is true when \(s^2
\le n\), which is true by definition.)</div>
<div class="p">Furthermore, our initial estimate is always an overestimate:
<div class="theorem">(<span class="theorem-name">Lemma 2</span>.) \(x_0
\gt s\).</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> \(\Bits(n) =
\lfloor \lg n \rfloor + 1 \gt \lg n\). Therefore,
\[
\begin{aligned}
x_0 &= 2^{\lceil \Bits(n) / 2 \rceil} \\
&\ge 2^{\Bits(n) / 2} \\
&\gt 2^{\lg n / 2} \\
&= \sqrt{n} \\
&\ge s\text{.} \; \blacksquare
\end{aligned}
\]
</p>
</div>
</div>
<p>(Note that any number greater than \(s\), say \(n\) or \(\lceil n/2
\rceil\), can be chosen for our initial guess without affecting
correctness. However, the expression above is necessary to guarantee
performance. Another possibility is \(2^{\lceil \lceil \lg n \rceil /
2 \rceil}\), which has the advantage that if \(n\) is an even power of
\(2\), then \(x_0\) is immediately set to \(\sqrt{n}\). However, this
is usually not worth the cost of checking that \(n\) is a power of
\(2\), as is required to compute \(\lceil \lg n \rceil\).)</p>
<div class="p">An easy consequence of Lemmas 1 and 2 is that the invariant \(x_i
\ge s\) holds. That lets us prove partial correctness of
\(\NewtonSqrt\):
<div class="theorem">(<span class="theorem-name">Theorem 1</span>.) If
\(\NewtonSqrt\) terminates, it
returns the value \(s\).</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> Assume it terminates. If it
terminates in step \(1\), then we are done. Otherwise, it can only
terminate in step \(3.2\) where it returns \(x_i\) such that \(x_{i+1}
= f(x_i) \ge x_i\). This implies that \(g(x_i) = (x_i + n/x_i) / 2 \ge
x_i\). Rearranging yields \(n \ge x_i^2\) and combining with our
invariant we get \(\sqrt{n} \ge x_i \ge s\). But \(s + 1 \gt
\sqrt{n}\), so that forces \(x_i\) to be \(s\), and thus
\(\NewtonSqrt\) returns \(s\) if it
terminates. ∎</p>
</div>
For total correctness we also need to show that
\(\NewtonSqrt\) terminates. But this
is easy:
<div class="theorem">(<span class="theorem-name">Theorem 2</span>.)
\(\NewtonSqrt\) terminates.</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> Assume it doesn’t
terminate. Then we have a strictly decreasing infinite sequence of
integers \(\{ x_0, x_1, \dotsc \}\). But this sequence is bounded below
by \(s\), so it cannot decrease indefinitely. This is a contradiction,
so \(\NewtonSqrt\) must
terminate. ∎</p>
</div>
</div>
<p>We are done proving correctness, but you might wonder if the check
\(x_{i+1} \ge x_i\) in step \(3.2\) is necessary. That is, can it be
weakened to the check \(x_{i+1} = x_i\)? The answer is
“no”; to see that, let \(k = n - s^2\). Since \(n \lt
(s+1)^2\), \(k \lt 2s + 1\). On the other hand, consider the
inequality \(f(x_i) \gt x_i\). Since that would cause the algorithm to
terminate and return \(x_i\), that implies that \(x_i =
s\). Therefore, that inequality is equivalent to \(f(s) \gt s\), which
is equivalent to \(f(s) \ge s + 1\), which is equivalent to \(g(s) =
(s + n/s) / 2 \ge s + 1\). Rearranging yields \(n \ge s^2 +
2s\). Substituting in \(n = s^2 + k\), we get \(s^2 + k \ge s^2 +
2s\), which is equivalent to \(k \ge 2s\). But since \(k \lt 2s + 1\),
that forces \(k\) to equal \(2s\). That is the maximum value \(k\) can
be, so therefore \(n\) must be one less than a perfect square. Indeed,
for such numbers, weakening the check would cause the algorithm to
oscillate between \(s\) and \(s + 1\). For example, \(n = 99\) would
yield the sequence \(\{ 16, 11, 10, 9, 10, 9, \dotsc \}\).</p>
</section>
<section>
<header>
<h2>3. Run-time</h2>
</header>
<p>We will show that \(\NewtonSqrt\)
takes \(O(\lg \lg n)\) arithmetic operations. Since each loop
iteration does only a fixed number of arithmetic operations (with the
division of \(n\) by \(x\) being the most expensive), it suffices to
show that our algorithm performs \(O(\lg \lg n)\) loop iterations.</p>
<p>It is well known that Newton’s
method <a href="https://en.wikipedia.org/wiki/Newton%27s_method#Proof_of_quadratic_convergence_for_Newton.27s_iterative_method">converges
quadratically</a> sufficiently close to a simple root. We can’t
actually use this result directly, since it’s not clear that the
convergence properties of Newton’s method are preserved when
using integer operations, but we can do something similar.</p>
<p>Define \(\Err(x) = x^2/n - 1\) and let \(ϵ_i =
\Err(x_i)\). Intuitively, \(\Err(x)\) is a
conveniently-scaled measure of the error of \(x\): it is less than
\(1\) for most of the values we care about and it bounded below for
integers greater than our target \(s\). Also, we will show that the
\(ϵ_i\) shrink quadratically. These facts will then let us show
our bound for the iteration count.</p>
<div class="p">First, let’s prove our lower bound for \(ϵ_i\):
<div class="theorem">(<span class="theorem-name">Lemma 3</span>.) \(x_i
\ge s + 1\) if and only if \(ϵ_i \ge 1/n\).</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> \(n \lt (s + 1)^2\), so \(n + 1
\le (s + 1)^2\), and therefore \((s + 1)^2/n - 1 \ge 1/n\). But the
expression on the left side is just \(\Err(s +
1)\). \(x_i \ge s + 1\) if and only if \(ϵ_i \ge
\Err(s + 1)\), so the result immediately
follows. ∎</p>
</div>
Then we can use that to show that the \(ϵ_i\) shrink
quadratically:
<div class="theorem">(<span class="theorem-name">Lemma 4</span>.) If
\(x_i \ge s + 1\), then \(ϵ_{i+1} \lt (ϵ_i/2)^2\).</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> \(ϵ_{i+1}\) is just
\(\Err(f(x_i)) \le \Err(g(x_i))\), so it
suffices to show that \(\Err(g(x_i)) \lt
(ϵ_i/2)^2\). Inverting \(\Err(x)\), we get that
\(x_i = \sqrt{(ϵ_i + 1) \cdot n}\). Expressing \(g(x_i)\) in
terms of \(ϵ_i\) we get
\[ g(x_i) = \frac{\sqrt{n}}{2} \left( \frac{ϵ_i +
2}{\sqrt{ϵ_i + 1}} \right) \]
and
\[
\Err(g(x_i)) = \frac{(ϵ_i/2)^2}{ϵ_i+1}\text{.}
\]
Therefore, it suffices to show that the denominator is greater than
\(1\). But \(x_i \ge s + 1\) implies \(ϵ_i \gt 0\) by Lemma 3,
so that follows immediately and the result is proved. ∎</p>
</div>
Then let’s bound our initial values:
<div class="theorem">(<span class="theorem-name">Lemma 5</span>.) \(x_0
\le 2s\), \(ϵ_0 \le 3\), and \(ϵ_1 \le 1\).</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> Let’s start with \(x_0\):
\[
\begin{aligned}
x_0 &= 2^{\lceil \Bits(n) / 2 \rceil} \\
&= 2^{\lfloor (\lfloor \lg n \rfloor + 1 + 1)/2 \rfloor} \\
&= 2^{\lfloor \lg n / 2 \rfloor + 1} \\
&= 2 \cdot 2^{\lfloor \lg n / 2 \rfloor}\text{.}
\end{aligned}
\]
Then \(x_0/2 = 2^{\lfloor \lg n / 2 \rfloor} \le 2^{\lg n / 2} =
\sqrt{n}\). Since \(x_0/2\) is an integer, \(x_0/2 \le \sqrt{n}\) if
and only if \(x_0/2 \le \lfloor \sqrt{n} \rfloor = s\). Therefore,
\(x_0 \le 2s\).</p>
<p>As for \(ϵ_0\):
\[
\begin{aligned}
ϵ_0 &= \Err(x_0) \\
&\le \Err(2s) \\
&= (2s)^2/n - 1 \\
&= 4s^2/n - 1\text{.}
\end{aligned}
\]
Since \(s^2 \le n\), \(4s^2/n \le 4\) and thus \(ϵ_0 \le 3\).</p>
<p>Finally, \(ϵ_1\) is just
\(\Err(f(x_0))\). Using calculations from Lemma 4,
\[
\begin{aligned}
ϵ_1 &\le \Err(g(x_0)) \\
&= (ϵ_0/2)^2/(ϵ_0 + 1) \\
&\le (3/2)^2/(3 + 1) \\
&= 9/16\text{.}
\end{aligned}
\]
Therefore, \(ϵ_1 \le 1\). ∎</p>
</div>
</div>
<div class="p">Finally, we can show our main result:
<div class="theorem">(<span class="theorem-name">Theorem 3</span>.)
\(\NewtonSqrt\) performs \(O(\lg \lg
n)\) loop iterations.</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> Let \(k\) be the number of loop
iterations performed when running the algorithm for \(n\) (i.e., \(x_k
\ge x_{k-1}\)) and assume \(k \ge 4\). Then \(x_i \ge s + 1\) for \(i
\lt k - 1\). Since \(ϵ_1 \le 1\) by Lemma 5, \(ϵ_2 \le
1/2\) and \(ϵ_i \le (ϵ_2)^{2^{i-2}}\) for \(2 \le i \lt
k - 1\) by Lemma 4, then \(ϵ_{k-2} \le 2^{-2^{k-4}}\). But
\(1/n \le ϵ_{k-2}\) by Lemma 3, so \(1/n \le
2^{-2^{k-4}}\). Taking logs to bring down the \(k\) yields \(k - 4 \le
\lg \lg n\). Then \(k \le \lg \lg n + 4\), and thus \(k = O(\lg \lg
n)\). ∎</p>
</div>
Note that in general, an arithmetic operation is not constant-time,
and in fact has run-time \(\Omega(\lg n)\). Since the most expensive
arithmetic operation we do is division, we can say that
\(\NewtonSqrt\) has run-time that is
both \(\Omega(\lg n)\) and \(O(D(n) \cdot \lg \lg n)\), where \(D(n)\)
is the run-time of dividing \(n\) by some number \(\le n\).<sup><a href="#fn5" id="r5">[5]</a></sup></div>
</section>
<section>
<header>
<h2>4. The Initial Guess</h2>
</header>
<p>It’s also useful to show that if the initial guess \(x_0\) is
bad, then the run-time degrades to \(Θ(\lg n)\). We’ll do
this by defining the function \(\NewtonSqrt\)
except that it takes a function \(\mathrm{I{\small
NITIAL}\text{-}G{\small UESS}}\) that is called with \(n\) and assigned to
\(x_0\) in step 1. Then, we can treat \(ϵ_0\) as a function of
\(n\) and analyze how long \(ϵ_i\) stays above \(1\) to show
that \(\NewtonSqrt\) uses an
initial guess such that \(ϵ_0(n) = Θ(1)\), then Theorem 4
reduces to Theorem 3 in that case. However, if \(x_0\) is chosen to be
\(Θ(n)\), e.g. the initial guess is just \(n\) or \(n/k\) for
some \(k\), then \(ϵ_0(n)\) will also be \(Θ(n)\), and so
the run time will degrade to \(Θ(\lg n)\). So having a good
initial guess is important for the performance of
\(\NewtonSqrt\)!</p>
</section>
<hr />
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
<section class="footnotes">
<header>
<h2>Footnotes</h2>
</header>
<p id="fn1">[1] Aside from
the <a href="https://en.wikipedia.org/wiki/Integer_square_root">Wikipedia
article</a>, the algorithm is described as Algorithm 9.2.11 in
<a href="http://www.amazon.com/Prime-Numbers-A-Computational-Perspective/dp/0387252827">Prime
Numbers: A Computational Perspective</a>.
<a href="#r1">↩</a></p>
<p id="fn2">[2] Note that only integer operations are used, which makes this
algorithm suitable for arbitrary-precision integers.
<a href="#r2">↩</a></p>
<p id="fn3">[3] Go and JS implementations are available
on <a href="https://github.com/akalin/iroot">my GitHub</a>.
<a href="#r3">↩</a></p>
<p id="fn4">[4] Here, and in most of the article, we’ll
implicitly assume that \(n \gt 0\).
<a href="#r4">↩</a></p>
<p id="fn5">[5] \(D(n)\) is \(Θ(\lg^2 n)\) using long division, but
fancier division algorithms have better run-times.
<a href="#r5">↩</a></p>
</section>
https://www.akalin.com/constant-time-mssb
Finding the Most Significant Set Bit of a Word in Constant Time
2014-07-03T00:00:00-07:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<script>
// Converts the given binary string (possibly with whitespace) to an integer.
function b(s) {
return parseInt(s.replace(/\s+/g, ''), 2);
}
// Converts the given integer to a binary string.
function bs(x) {
return x.toString(2);
}
</script>
<section>
<header>
<h2>1. Overall method</h2>
</header>
<p>Finding the most significant set bit of a word (equivalently, finding
the integer log base 2 of a word, or counting the leading zeros of a
word) is
a <a href="https://stackoverflow.com/questions/2589096/find-most-significant-bit-left-most-that-is-set-in-a-bit-array">well-studied
problem</a>. <a href="http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious">Bit
Twiddling Hacks</a> lists various methods,
and <a href="https://en.wikipedia.org/wiki/Count_leading_zeros">Wikipedia</a>
gives the CPU instructions that perform the operation directly.</p>
<p>However, all of these methods are either specific to a certain word
size or take more than constant time (in terms of number of word
operations). That raises the question of whether there <em>is</em> a
method that takes constant time—surprisingly, the answer is
“yes”!<sup><a href="#fn1" id="r1">[1]</a></sup></p>
<p>The key idea is to split a word into \(\lceil \sqrt{w} \rceil\)
blocks of \(\lceil \sqrt{w} \rceil\) bits (where \(w\) is the number
of bits in a word). One can then do certain operations on blocks
“in parallel” by stuffing multiple blocks into a word and
then performing a single word operation.</p>
<p>Furthermore, since the block size and block count are the same, one
can transform the bits of a block into the blocks of a word and vice
versa in various ways using only a constant number of word
operations.</p>
<p>In particular, this lets us split up the problem into two parts:
finding the most significant set (i.e., non-zero) block, and finding
the most significant set bit within that block. It then turns out that
both parts can be done in constant time.</p>
<p>For concreteness, we'll use 32-bit words when explaining the
method below.<sup><a href="#fn2" id="r2">[2]</a></sup></p>
</section>
<section>
<header>
<h2>2. Finding the most significant set bit of a block</h2>
</header>
<p>First, let's consider the sub-problem of finding the most
significant set bit of a block. In fact, let's give ourselves a bit of
room and consider only blocks with the high bit cleared for now; we'll
see why we need this extra bit of room soon.</p>
<div class="p">For 32 bits, the block size is 6 bits, so with the high bit of a
block cleared we're left with 5 bits. Let's look at a naive
implementation:
<script>
function mssb5_naive(x) {
var c = 0;
for (var i = 0; i < 5 && x >= (1 << i); ++i) {
++c;
}
return c - 1;
}
</script>
<pre class="code-container"><code class="language-javascript">function mssb5_naive(x) {
var c = 0;
for (var i = 0; i < 5 && x >= (1 << i); ++i) {
++c;
}
return c - 1;
}</code></pre>
In the above, we consider successive powers of 2 until we find one
greater than our given number. Then the answer is simply one less than
that power.</div>
<p>Notice that the loop has at most 5 iterations; this lines up nicely
with the 5 full blocks in an entire 32-bit word. (This is why we saved
our extra bit of room.) If we can copy our block to the higher 4
blocks and then use word operations to operate on those blocks in
parallel, then we're good.</p>
<p>For our example, let \(x = 5 = 00101\). Duplicating \(x\) among all
the blocks can easily be done by multiplying by the appropriate
constant:</p>
<style>
pre.binary-example {
border: 1px solid #073642; /* solarized base02 */
background-color: #fdf6e3; /* solarized base3 */
color: #586e75;
padding: 1em;
}
pre.binary-example span.dont-care {
color: #a3b1b1;
}
pre.binary-example span.last-operand {
text-decoration: underline;
}
</style>
<pre class="binary-example">
<span class="first-five"
>00 000000 000000 000000 000000 000101</span>
* <span class="last-operand low-bit-full"
>00 000001 000001 000001 000001 000001</span>
<span class="first-five"
>00 000000 000000 000000 000000 000101</span>
<span class="first-five"
>00 000000 000000 000000 000101</span>
<span class="first-five"
>00 000000 000000 000101</span>
<span class="first-five"
>00 000000 000101</span>
<span class="first-five last-operand"
>00 000101 </span>
<span class="lower-bits-full"
>00 000101 000101 000101 000101 000101</span>
</pre>
<p>In fact, this is a simple use of a more general tool. If \(x\) and
\(y\) are expressed in binary, then multiplying \(x\) by \(y\) can be
seen as taking the index of each set bit in \(y\), creating a copy of
\(x\) shifted by each such index, and then adding up all the shifted
copies. This case is just taking \(y\) to be the constant where the
\(\{ 0, 6, 12, 18, 24 \}\)th bits are set.</p>
<p>The first operation we need to parallelize is the comparisons to
the powers of 2. This can be converted to a word operation by noting
the comparison \(x \geq y\) can be performed by checking the sign of \(x
- y\), and that checking the sign can be done by setting the unused
high bit of \(x\) before doing the comparison, and then checking to
see if that high bit was left intact (i.e., not borrowed from). So we
pre-compute a constant with the \(n\)th block containing the \(n\)th
power of 2, then subtract that from our block containing the
duplicated blocks with the high bit set. Finally, we can then mask off
the unneeded lower bits:</p>
<pre class="binary-example">
<span class="lower-bits-full"
>00 000101 000101 000101 000101 000101</span>
| <span class="last-operand high-bit-full"
>00 100000 100000 100000 100000 100000</span>
<span class="full"
>00 100101 100101 100101 100101 100101</span>
- <span class="last-operand lower-bits-full"
>00 010000 001000 000100 000010 000001</span>
<span class="high-bit-full"
>00 010101 011101 100001 100011 100100</span>
& <span class="last-operand high-bit-full"
>00 100000 100000 100000 100000 100000</span>
<span class="high-bit-full"
>00 000000 000000 100000 100000 100000</span>
</pre>
<p>We're left with a word where all bits except for the high bits of a
block are zero. We still need to sum up those bits, but since they're
a block apart, that can be done by multiplication with a constant to
line up the bits in a single column. The constant turns out to have
the \(\{ 0, 6, 12, 18, 24 \}\)th bits set, with the answer being in
the top three bits:<sup><a href="#fn3" id="r3">[3]</a></sup></p>
<pre class="binary-example">
<span class="high-bit-full"
>00 000000 000000 100000 100000 100000</span>
* <span class="last-operand low-bit-full"
>00 000001 000001 000001 000001 000001</span>
<span class="high-bit-full"
>00 000000 000000 100000 100000 100000</span>
<span class="high-bit-full"
>00 000000 100000 100000 100000</span>
<span class="high-bit-full"
>00 100000 100000 100000</span>
<span class="high-bit-full"
>00 100000 100000</span>
<span class="high-bit-full last-operand"
>00 100000 </span>
<span class="top-three"
>01 100001 100001 100001 000000 100000</span>
MSSB5(x) = 011 - 1 = 2
</pre>
<div class="p">We can now write <code>mssb5()</code> using a constant number of
word operations:<sup><a href="#fn4" id="r4">[4]</a></sup>
<script>
function mssb5(x) {
// Duplicate x among all the blocks.
x *= b('00 000001 000001 000001 000001 000001');
// Compare to successive powers of 2 in parallel.
x |= b('00 100000 100000 100000 100000 100000');
x -= b('00 010000 001000 000100 000010 000001');
x &= b('00 100000 100000 100000 100000 100000');
// Sum up the bits into the high 3 bits.
x *= b('00 000001 000001 000001 000001 000001');
// Shift down and subtract 1 to get the answer.
return (x >>> 29) - 1;
}
</script>
<pre class="code-container"><code class="language-javascript">function mssb5(x) {
// Duplicate x among all the blocks.
x *= b('00 000001 000001 000001 000001 000001');
// Compare to successive powers of 2 in parallel.
x |= b('00 100000 100000 100000 100000 100000');
x -= b('00 010000 001000 000100 000010 000001');
x &= b('00 100000 100000 100000 100000 100000');
// Sum up the bits into the high 3 bits.
x *= b('00 000001 000001 000001 000001 000001');
// Shift down and subtract 1 to get the answer.
return (x >>> 29) - 1;
}</code></pre>
Then we can then find the most significant set bit of a full block
by simply testing the high bit first:
<script>
function mssb6(x) {
return (x & b('100000')) ? 5 : mssb5(x);
}
</script>
<pre class="code-container"><code class="language-javascript">function mssb6(x) {
return (x & b('100000')) ? 5 : mssb5(x);
}</code></pre>
</div>
</section>
<section>
<header>
<h2>3. Finding the most significant set block</h2>
</header>
<p>Let's now consider the sub-problem of finding the most significant
set block of a word (ignoring the partial one). Similar to the above,
we'd like to be able to use subtraction to compare all the blocks to
zero at the same time. However, that requires the high bit of each
block to be unused. That's easy enough to handle: just separate the
high bit and the lower bits of each block, test the lower bits, and
then bitwise-or the results together:</p>
<pre class="binary-example">
x = <span class="full"
>00 100000 000000 010000 000000 000001</span>
& C = <span class="last-operand high-bit-full"
>00 100000 100000 100000 100000 100000</span>
y1 = <span class="high-bit-full"
>00 100000 000000 000000 000000 100000</span>
x = <span class="full"
>00 100000 000000 010000 000000 000001</span>
& ~C = <span class="last-operand lower-bits-full"
>00 011111 011111 011111 011111 011111</span>
t1 = <span class="lower-bits-full"
>00 000000 000000 010000 000000 000001</span>
C = <span class="full"
>00 100000 100000 100000 100000 100000</span>
- t1 = <span class="last-operand lower-bits-full"
>00 000000 000000 010000 000000 000001</span>
t2 = <span class="high-bit-full"
>00 100000 100000 010000 100000 011111</span>
~t2 = <span class="high-bit-full"
>11 011111 011111 101111 011111 100000</span>
& C = <span class="last-operand high-bit-full"
>00 100000 100000 100000 100000 100000</span>
y2 = <span class="high-bit-full"
>00 000000 000000 100000 000000 100000</span>
y1 = <span class="high-bit-full"
>00 100000 000000 000000 000000 100000</span>
| y2 = <span class="last-operand high-bit-full"
>00 000000 000000 100000 000000 100000</span>
y = <span class="high-bit-full"
>00 100000 000000 100000 000000 100000</span>
</pre>
<p>The result is stored in the high bits of each block. If we could
pack all the bits together, we'd then be able to
use <code>mssb5()</code>. This is similar to where we had to add all
the bits together in part 2, but we need a constant to stagger the
bits instead of lining them up. The constant to put the answer in the
high bits turns out to have the \(\{ 7, 12, 17, 22, 27 \}\)th bits
set:</p>
<pre class="binary-example">
y >>> 5 = <span class="low-bit-full"
>00 000001 000000 000001 000000 000001</span>
* <span class="last-operand every-fifth-from-seventh"
>00 001000 010000 100001 000010 000000</span>
<span class="low-bit-full"
>10 000000 000010 000000 00001</span>
<span class="low-bit-full"
>00 000001 000000 000001</span>
<span class="low-bit-full"
>00 100000 000000 1</span>
<span class="low-bit-full"
>00 000000 01</span>
<span class="last-operand low-bit-full"
>00 001 </span>
= <span class="top-five"
>10 101001 010010 100001 000010 000000</span>
</pre>
<p>This yields the answer <code>10101</code>, where the \(i\)th bit is
set exactly when the \(i\)th block of \(x\) is non-zero. Therefore,
the most significant block is then
simply <code>mssb5(10101)</code>.</p>
</section>
<section>
<header>
<h2>4. Putting it all together</h2>
</header>
<div class="p">With the building blocks above, we can now implement the algorithm
for finding the most significant set bit in the full blocks of a
word:<sup><a href="#fn5" id="r5">[5]</a></sup>
<script>
function mssb30(x) {
var C = b('00 100000 100000 100000 100000 100000');
// Check whether the high bit of each block is set.
var y1 = x & C;
// Check whether the lower bits of each block is set.
var y2 = ~(C - (x & ~C)) & C;
var y = y1 | y2;
// Shift the result bits down to the lowest 5 bits.
var z = ((y >>> 5) * b('0000 10000 10000 10000 10000 10000000')) >>> 27;
// Compute the bit index of the most significant set block.
var b1 = 6 * mssb5(z);
// Compute the most significant set bit inside the most significant
// set block.
var b2 = mssb6((x >>> b1) & b('111111'));
return b1 + b2;
}
</script>
<pre class="code-container"><code class="language-javascript">function mssb30(x) {
var C = b('00 100000 100000 100000 100000 100000');
// Check whether the high bit of each block is set.
var y1 = x & C;
// Check whether the lower bits of each block is set.
var y2 = ~(C - (x & ~C)) & C;
var y = y1 | y2;
// Shift the result bits down to the lowest 5 bits.
var z = ((y >>> 5) * b('0000 10000 10000 10000 10000 10000000')) >>> 27;
// Compute the bit index of the most significant set block.
var b1 = 6 * mssb5(z);
// Compute the most significant set bit inside the most significant
// set block.
var b2 = mssb6((x >>> b1) & b('111111'));
return b1 + b2;
}</code></pre>
And then it's simple enough to extend it to find the most
significant set bit of a full word:
<script>
function mssb32(x) {
// Check the high duplet and fall back to mssb30 if it's not set.
var h = x >>> 30;
return h ? (30 + mssb5(h)) : mssb30(x);
}
</script>
<pre class="code-container"><code class="language-javascript">function mssb32(x) {
// Check the high duplet and fall back to mssb30 if it's not set.
var h = x >>> 30;
return h ? (30 + mssb5(h)) : mssb30(x);
}</code></pre>
So the code above shows that we can find the most significant set
bit of a 32-bit word in a constant number of 32-bit word
operations. It is easy enough to see how it can be adapted to yield a
similar algorithm for a given arbitrary (but sufficiently large) word
size, simply by pre-computing the various word-size-dependent
constants.</div>
<p>It is also easy to see why no one actually uses this method on real
computers even in the absence of built-in instructions: it is much
more complicated and almost certainly slower than existing methods for
real word sizes! Also, the word-RAM model—where we assume all
word operations take constant time—is useful only when the word
size is fixed or narrowly bounded. When we allow word size to vary
arbitrarily, the word-RAM model breaks down—for one,
multiplication grows super-linearly with respect to word size! Alas,
this method is doomed to remain a theoretical curiosity, albeit one
that uses a few clever tricks.</p>
<script>
function highlightIndices(str, indices) {
var highlightedStr = '';
var i = 0, j = 0;
for (var k = 0; k < str.length; ++k) {
var chStr = str[str.length - k - 1];
if (chStr == '0' || chStr == '1') {
if (j < indices.length && i == indices[j]) {
++j;
} else {
chStr = '<span class="dont-care">' + chStr + '</span>';
}
++i;
}
highlightedStr = chStr + highlightedStr;
}
return highlightedStr;
}
function highlightElements(selector, indices) {
var es = document.querySelectorAll(selector);
for (var i = 0; i < es.length; ++i) {
var e = es[i];
e.innerHTML = highlightIndices(e.textContent, indices);
}
}
highlightElements('pre.binary-example span.first-five', [0, 1, 2, 3, 4]);
highlightElements('pre.binary-example span.low-bit-full', [0, 6, 12, 18, 24]);
highlightElements('pre.binary-example span.every-fifth-from-seventh',
[7, 12, 17, 22, 27]);
highlightElements('pre.binary-example span.lower-bits-full',
[0, 1, 2, 3, 4,
6, 7, 8, 9, 10,
12, 13, 14, 15, 16,
18, 19, 20, 21, 22,
24, 25, 26, 27, 28]);
highlightElements('pre.binary-example span.high-bit-full', [5, 11, 17, 23, 29]);
highlightElements('pre.binary-example span.full',
[0, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23,
24, 25, 26, 27, 28, 29]);
highlightElements('pre.binary-example span.top-three', [29, 30, 31]);
highlightElements('pre.binary-example span.top-five', [27, 28, 29, 30, 31]);
</script>
</section>
<hr />
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
<section class="footnotes">
<header>
<h2>Footnotes</h2>
</header>
<p id="fn1">[1] The constant-time method is detailed in the original
papers for the <a href="https://en.wikipedia.org/wiki/Fusion_tree">fusion
tree</a> data
structure. <a href="http://dl.acm.org/citation.cfm?id=100217">The
first paper</a> is unfortunately behind a paywall, but
<a href="https://www.sciencedirect.com/science/article/pii/0022000093900404?np=y">the
second paper</a>, essentially a rehash of the first one, is
freely downloadable.</p>
<p>The method is also explained in
<a href="http://courses.csail.mit.edu/6.851/spring12/lectures/L12.html">lecture
12</a> of Erik
Demaine's <a href="http://courses.csail.mit.edu/6.851/spring12/">Advanced
Data Structures</a> class, which is how I originally found out
about it.
<a href="#r1">↩</a></p>
<p id="fn2">[2] Demaine uses 16-bit words, which factors nicely into
4 blocks of 4 bits, but it is instructive to see how the method
deals with the word size not a perfect square.
<a href="#r2">↩</a></p>
<p id="fn3">[3] In this case, the partial 6th block has enough room
to hold the answer but this may not be true in general. This can
be remedied easily enough by shifting down the block high bits to
the low bits before multiplying; the answer will then be in the
last full block.
<a href="#r3">↩</a></p>
<p id="fn4">[4] <code>b(str)</code> just parses a number from its
binary string representation.
<a href="#r4">↩</a></p>
<p id="fn5">[5] Try out this function (and the others on this page)
by opening up the JS console on this page!
<a href="#r5">↩</a></p>
</section>
https://www.akalin.com/primality-testing-polynomial-time-part-2
Primality Testing in Polynomial Time (Ⅱ)
2012-12-29T00:00:00-08:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<script type="text/javascript"
src="https://cdnjs.cloudflare.com/ajax/libs/knockout/3.4.0/knockout-min.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/num.js/eab08d4/simple-arith.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/num.js/eab08d4/trial-division.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/num.js/eab08d4/euler-phi.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/num.js/eab08d4/multiplicative-order.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/num.js/eab08d4/primality-testing.js"></script>
<p><em>(Note: this article isn't fully polished yet, but I thought it
would be a shame to let it languish during my sabbatical. Happy new
year!)</em></p>
<section>
<header>
<h2>5. Strengthening the AKS theorem</h2>
</header>
<div class="p">It turns out the conditions of the AKS theorem are stronger than
they appear; they themselves imply that \(n\) is prime. To show this,
we need the following theorem, which we'll state without proof:
<div class="theorem">
(<span class="theorem-name">Lenstra's squarefree test</span>.) If
\(a^n \equiv a \pmod{n}\) for \(1 \le a \lt \ln^2 n\), then \(n\) is
<a href="http://en.wikipedia.org/wiki/Squarefree">squarefree</a>.<sup><a href="#fn1" id="r1">[1]</a></sup></div>
We also need a couple of lemmas:
<div class="theorem">
(<span class="theorem-name">Lemma 1</span>.)
For \(0 \le a \lt n\) and \(r \gt 1\), let
\[
(X + a)^n \equiv X^n + a \pmod{X^r - 1, n}\text{.}
\]
Then
\[
(a + 1)^n = a + 1 \pmod{n}\text{.}
\]
</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> By definition,
\((X + a)^n - (X^n + a) = k(X) \cdot (X^r - 1) \pmod{n}\). Treating
both sides as a function of \(x\) and substituting in \(1\), we
immediately get \((1 + a)^n - (1 + a) = 0 \pmod{n}\). ∎</p>
</div>
<div class="theorem">
(<span class="theorem-name">Lemma 2</span>.)
For \(n \gt 1\), \(\lfloor \lg n \rfloor \cdot \lg n \gt \ln^2 n\).
</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> Since \(\ln n = \frac{\lg n}{\lg
e}\) and \(e \gt 2\), \(\lg n \gt \ln n\) for \(n \gt 1\).</p>
<p>Letting \(k = \lfloor \lg n \rfloor\), \(\ln n \lt \frac{k + 1}{\lg
e}\), so if \(\frac{k + 1}{\lg e} \lt k\), that implies that \(\ln n
\lt k\). Solving for \(k\), we get that \(k \gt \frac{1}{\lg e -
1}\), which is true when \(n \ge 8\).</p>
<p>So if \(n \ge 8\), then \(\ln n \lt \lfloor \lg n \rfloor\).
Checking manually, we find that \(\ln n \lt \lfloor \lg n \rfloor\)
holds also for \(n \in \{ 2, 4, 5, 6, 7 \}\), immediately implying the
lemma for all \(n \gt 1\) except \(3\). But checking manually again,
we find that the lemma holds for \(3\) also. ∎</p>
</div>
</div>
<div class="p">Then, we can prove the strong version of the AKS theorem:
<div class="theorem">
(<span class="theorem-name">AKS theorem, strong version</span>.) Let
\(n \ge 2\), \(r\) be relatively prime to \(n\) with \(o_r(n) \gt
\lg^2 n\), and \(M \gt \sqrt{φ(r)} \lg n\). Furthermore, let
\(n\) have no prime factor less than \(M\) and let
\[
(X + a)^n \equiv X^n + a \pmod{X^r - 1, n}\text{.}
\]
for \(0 \le a \lt M\). Then \(n\) is prime.</div>
<div class="proof">
<p><span class="proof-name">Proof.</span> From Lemma 1, we know that \(a^n
= a \pmod{n}\) for \(1 \le a \lt M\). Since \(M \gt \lfloor \sqrt{t}
\rfloor \lg n \gt \lfloor \lg n \rfloor \cdot \lg n \gt \ln^2 n\) by
Lemma 2, we can apply Lenstra's squarefree test to show that \(n\) is
squarefree. From the weak version of the AKS theorem, we also know
that \(n\) is a prime power. But since \(n\) is squarefree, it must
have distinct prime factors, which immediately implies that \(n\) is
prime. ∎</p>
</div>
</div>
</section>
<section>
<header>
<h2>6. Finding a suitable \(r\)</h2>
</header>
<div class="p">The only remaining loose end is to show that there exists an \(r\)
with \(o_r(n) \gt \lg^2 n\) and that it's small enough (i.e., polylog
in \(n\)). The existence of \(r\) is easy to see; we can simply pick
the smallest \(r\) that is co-prime to \(n\) and greater than
\(n^{\lg^2 n}\). But that's obviously too big. We can do better:
<div class="theorem">
<span class="theorem-name">(Upper bound for \(r\).)</span> Let \(n \ge 2\).
Then there exists some \(r \le \max(3, \lceil \lg n \rceil^5)\) such
that \(o_r(n) \gt \lceil \lg n \rceil^2\).<sup><a href="#fn2" id="r2">[2]</a></sup>
</div>
<div class="proof">
<div class="p"><span class="proof-name">(Proof.)</span> Let's
first prove the following lemma:
<div class="theorem">
<span class="theorem-name">(Lemma 3.)</span> Let \(n \ge 9\) and \(b =
\lceil \lg n \rceil\). Then for \(m \ge 1\), there exists some \(r
\le b^{2m + 1}\) such that \(o_r(n) \gt b^m\).
</div>
<div class="proof">
<p><span class="proof-name">(Proof.)</span> Let
\[
N = n \cdot (n - 1) \cdot (n^2 - 1) \dotsm (n^{b^m} - 1)\text{.}
\]
Note that \(r\) divides \(N\) if and only if \(o_r(n) \le b^m\). So
it suffices to find some \(r\) that does not divide \(N\).</p>
<p>We can see that:
\[
\begin{aligned}
N &= n \cdot (n - 1) \cdot (n^2 - 1) \dotsm (n^{b^m} - 1) \\
&\lt n \cdot n \cdot n^2 \dotsm n^{b^m} \\
&= n^{1 + 1 + 2 + 3 + \dotsm + b^m} \\
&= n^{1 + b^m (b^m + 1) / 2} \\
&= n^{\frac{1}{2} b^{2m} + \frac{1}{2} b^m + 1}\text{.}
\end{aligned}
\]
Furthermore, we can upper-bound the exponent of \(n\):
\[
\begin{aligned}
b^{2m} &\gt \frac{1}{2} b^{2m} + \frac{1}{2} b^m + 1 \\
\frac{1}{2} b^{2m} - \frac{1}{2} b^m - 1 &\gt 0 \\
b^{2m} - b^m - 2 &\gt 0 \\
(b^m - 2) \cdot (b^m + 1) &\gt 0\text{.}
\end{aligned}
\]
The last statement holds when \(b^m \gt 2\), which is always since \(b
\ge 4\) and \(m \ge 1\).</p>
<p>Applying the upper bound,
\[
\begin{aligned}
N &\lt n^{\frac{1}{2} b^{2m} + \frac{1}{2} b^m + 1} \\
&\lt n^{b^{2m}} \\
&\le 2^{b^{2m + 1}}\text{.}
\end{aligned}
\]
</p>
<div class="p">We can then use the following theorem, which
we'll state without proof:
<div class="theorem">
<span class="theorem-name">(<a href="http://en.wikipedia.org/wiki/Primorial">Primorial</a>
lower bound.)</span> For \(x \ge 31\), the product of primes \(\le x\)
exceeds \(2^x\).<sup><a href="#fn3" id="r3">[3]</a></sup> That is,
\[
x\# = \prod_{p \le x\text{, }p\text{ is prime}} p \gt 2^x\text{.}
\]
</div>
<p>Since \(b \ge 4\) and \(m \ge 1\), \(b^{2m + 1} \ge 31\), and so
\(2^{b^{2m + 1}} \lt (b^{2m + 1})\#\). Therefore,
\[
N \lt 2^{b^{2m + 1}} \lt (b^{2m + 1})\#\text{.}
\]
But that implies that there is some prime number \(p_0 \le b^{2m +
1}\) that does not divide \(N\); if they all did, then \(N\) would be
at least their product \((b^{2m + 1})\#\), contradicting the
inequality above. Therefore, \(o_{p_0}(n) \gt b^m\). ∎</p>
</div>
</div>
</div>
<p>We can then prove our theorem: for \(n \ge 9\), apply Lemma 3 with
\(m = 2\). Here are explicit values for the rest: for \(n = 2\), \(r
= 3\); \(n = 3\), \(r = 7\); \(n \in \{ 4, 6, 7, 8\}\), \(r = 11\);
and for \(n = 5\), \(r = 17\). ∎</p>
</div>
</div>
<div class="p">Also, it turns out that about half the time, we can do better.
We'll state this theorem without proof:
<div class="theorem"><span class="theorem-name">(Tight upper bound for
some \(r\).)</span> Let \(n \equiv \pm 3 \pmod{8}\). Then there
exists some \(r \lt 8 \lceil \lg n \rceil^2\) such that \(o_r(n) \gt
\lceil \lg n \rceil^2\).<sup><a href="#fn4" id="r4">[4]</a></sup></div>
</div>
</section>
<section>
<header>
<h2>7. The AKS algorithm (simple version)</h2>
</header>
<div class="p">Without further ado, here is a simple version of the AKS
algorithm, given \(n \ge 2\):
<ol>
<li>Starting from \(\lceil \lg n \rceil^2 + 2\), find an \(r\) such
that \(\gcd(r, n) = 1\) and \(o_r(n) \gt \lceil \lg n
\rceil^2\).</li>
<li>Compute \(M = \lfloor \sqrt{r - 1} \rfloor \lceil \lg n
\rceil + 1\).</li>
<li>Search for a prime factor of \(n\) less than \(M\). If one is
found, return “composite”. If none are found and \(M \ge
\lfloor \sqrt{n} \rfloor\), return “prime”.</li>
<li>For each \(1 \le a \lt M\), compute \((X + a)^n\), reducing
coefficients mod \(n\) and powers mod \(r\). If the result is not
equal to \(X^{n\text{ mod }r} + a\), return
“composite”.</li>
<li>Otherwise, return “prime”.</li>
</ol>
</div>
<p>As we've showed in the previous section, there always exists an
\(r\) such that \(o_r(n) \gt \lceil \lg n \rceil^2\), so step 1 will
terminate. All other steps are bounded, so the entire algorithm will
always terminate.</p>
<p>In step 2, since \(φ(r) \le r - 1\), the value of \(M\) that
we compute is always greater than \(\sqrt{φ(r)} \lceil \lg n
\rceil\). Step 4 checks if \((X + a)^n \equiv X^n + a \pmod{X^r - 1, n}\) holds. Therefore, By the strong AKS theorem, if the algorithm
returns “prime”, then \(n\) is prime. Furthermore, by the
weak version of Fermat's little theorem for polynomials, if the
algorithm returns “composite”, then \(n\) is
composite.</p>
<p>Since the algorithm always terminates and it returns the correct
answer when it terminates, it
is <a href="http://en.wikipedia.org/wiki/Total_correctness">totally
correct</a>.</p>
<p>As shown in the previous section, we have to test \(O(\lg^5 n)\)
values to find a suitable \(r\). Assuming a straightforward algorithm
to compute the multiplicative order that bails out once \(\lfloor \lg
n \rfloor^2\) is reached, and assuming we use the
division-based <a href="http://en.wikipedia.org/wiki/Euclidean_algorithm">Euclidean
algorithm</a> for computing the greatest common divisor, testing each
value takes \(O(\lg^2 n)\) multiplies and \(O(\lg r) = O(\lg \lg n)\)
divisions of \(O(\lg r)\)-bit numbers. Let \(M(b)\) be the cost to
multiply two \(b\)-bit numbers. The complexity of division is
asymptotically the same as multiplication, so the total cost of step 1
is \(O(\lg^5 n \cdot (\lg^2 n + \lg \lg n) \cdot M(\lg \lg n)) =
O(\lg^7 n \cdot M(\lg \lg n))\), assuming \(M(O(b)) = O(M(b))\).</p>
<p>Step 2 involves one square root, one multiplication, and one
increment, all involving \(O(\lg \lg n)\)-bit numbers. The complexity
of taking the square root is asymptotically the same as
multiplication, so the total cost of step 2 is \(O(M(\lg \lg n))\).</p>
<p>Step 3 takes a square root and tests \(M = O(\lg^{7/2} n)\)
numbers, and each test involves dividing two \(O(\lg M)\)-bit numbers,
so the total cost of step 3 is \(O(\lg^{7/2} n \cdot M(\lg \lg
n))\).</p>
<p>Steps 4 and 5 test \(O(\lg^{7/2} n)\) polynomials. Testing each
polynomial involves exponentiating it by \(n\), reducing power mod
\(r\) and coefficients mod \(n\) at each step, which requires \(O(\lg
n)\) multiplications of polynomials with \(O(r)\) coefficients each of
size \(O(\lg n)\). The cost of multiplying two polynomials with \(s\)
coefficients of size \(b\) is \(M(s) \cdot M(b)\), so the total cost
of steps 4 and 5 is \(O(\lg^{9/2} n \cdot M(\lg^5 n \cdot \lg \lg
n))\), assuming \(M(a) \cdot M(b) = M(a \cdot b)\).</p>
<p>If <a href="http://en.wikipedia.org/wiki/Multiplication_algorithm#Long_multiplication">long
multiplication</a> is used, then it costs \(M(b) = b^2\), which gives
a total cost of \(O(\lg^{29/2} n \cdot \lg^2 \lg n) = O(\lg^{15} n)\)
for the whole
algorithm. <a href="http://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strassen_algorithm">More
complicated multiplication methods</a> cost only \(M(b) = b \lg b\),
which gives a total cost of \(O(\lg^{10} n)\) for the whole algorithm.
Either way, the AKS primality test is shown to be implementable in
polynomial time.</p>
<div class="p">Below is step 1 implemented in Javascript; however, here we bound
\(r\) explicitly to be able to detect bugs quickly.<sup><a href="#fn5" id="r5">[5]</a></sup>
<pre class="code-container"><code class="language-javascript">// Returns an upper bound for r such that o_r(n) > ceil(lg(n))^2 that
// is polylog in n.
function calculateAKSModulusUpperBound(n) {
n = SNat.cast(n);
var ceilLgN = new SNat(n.ceilLg());
var rUpperBound = ceilLgN.pow(5).max(3);
var nMod8 = n.mod(8);
if (nMod8.eq(3) || nMod8.eq(5)) {
rUpperBound = rUpperBound.min(ceilLgN.pow(2).times(8));
}
return rUpperBound;
}
// Returns the least r such that o_r(n) > ceil(lg(n))^2 >= ceil(lg(n)^2).
function calculateAKSModulus(n, multiplicativeOrderCalculator) {
n = SNat.cast(n);
multiplicativeOrderCalculator =
multiplicativeOrderCalculator || calculateMultiplicativeOrderCRT;
var ceilLgN = new SNat(n.ceilLg());
var ceilLgNSq = ceilLgN.pow(2);
var rLowerBound = ceilLgNSq.plus(2);
var rUpperBound = calculateAKSModulusUpperBound(n);
for (var r = rLowerBound; r.le(rUpperBound); r = r.plus(1)) {
if (n.gcd(r).ne(1)) {
continue;
}
var o = multiplicativeOrderCalculator(n, r);
if (o.gt(ceilLgNSq)) {
return r;
}
}
throw new Error('Could not find AKS modulus');
}</code></pre>
</div>
<div class="p">Here is step 2 implemented in Javascript:
<pre class="code-container"><code class="language-javascript">// Returns floor(sqrt(r-1)) * ceil(lg(n)) + 1 > floor(sqrt(Phi(r))) * lg(n).
function calculateAKSUpperBoundSimple(n, r) {
n = SNat.cast(n);
r = SNat.cast(r);
// Use r - 1 instead of calculating Phi(r).
return r.minus(1).floorRoot(2).times(n.ceilLg()).plus(1);
}</code></pre>
</div>
<div class="p">Here is part of step 3 implemented in Javascript, along with the
comments for the functions used in trial division:
<pre class="code-container"><code class="language-javascript">// Given a number n, a generator function getNextDivisor, and a
// processing function processPrimeFactor, factors n using the
// divisors returned by genNextDivisor and passes each prime factor
// with its multiplicity to processPrimeFactor.
//
// getNextDivisor is passed the current unfactorized part of n and it
// should return the next divisor to try, or null if there are no more
// divisors to generate (although processPrimeFactor may still be
// called). processPrimeFactor is called with each non-trivial prime
// factor and its multiplicity. If it returns a false value, it won't
// be called anymore.
function trialDivide(n, getNextDivisor, processPrimeFactor) {
...
}
// Returns a generator that generates primes up to 7, then odd numbers
// up to floor(sqrt(n)), using a mod-30 wheel to eliminate odd numbers
// that are known composite (roughly half).
function makeMod30WheelDivisorGenerator() {
...
}
// Returns the first factor of n < m from generator, or null if there
// is no such factor.
function getFirstFactorBelow(n, M, generator) {
n = SNat.cast(n);
M = SNat.cast(M);
generator = generator || makeMod30WheelDivisorGenerator();
var boundedGenerator = function(n) {
var d = generator(n);
return (d && d.lt(M)) ? d : null;
};
var factor = null;
trialDivide(n, boundedGenerator, function(p, k) {
if (p.lt(M.min(n))) {
factor = p;
}
return false;
});
return factor;
}</code></pre>
</div>
<div class="p">Below is a function that ties steps 1 to 3 together; it is useful
for testing purposes to separate it from the other steps. (Actually,
we use a different function to compute \(M\) which computes
\(φ(r)\) instead of using \(r - 1\) so that we always have the
tightest bound possible for \(M\).)
<pre class="code-container"><code class="language-javascript">// The getAKSParameters* functions below return a parameters object
// with the following fields:
//
// n: the number the parameters are for.
//
// factor: A prime factor of n. If present, the fields below may
// not be present.
//
// isPrime: if set, n is prime. If present, the fields below may
// not be present.
//
// r: the AKS modulus for n.
//
// M: the AKS upper bound for n.
function getAKSParametersSimple(n) {
n = SNat.cast(n);
var r = calculateAKSModulus(n);
var M = calculateAKSUpperBound(n, r);
var parameters = {
n: n,
r: r,
M: M
};
var factor = getFirstFactorBelow(n, M);
if (factor) {
parameters.factor = factor;
} else if (M.gt(n.floorRoot(2))) {
parameters.isPrime = true;
}
return parameters;
}</code></pre>
</div>
<div class="p">Finally, here is step 4 implemented in Javascript:
<pre class="code-container"><code class="language-javascript">// Returns whether (X + a)^n = X^n + a mod (X^r - 1, n).
function isAKSWitness(n, r, a) {
n = SNat.cast(n);
r = SNat.cast(r);
a = SNat.cast(a);
function reduceAKS(p) {
return p.modPow(r).mod(n);
}
function prodAKS(x, y) {
return reduceAKS(x.times(y));
};
var one = new SPoly(new SNat(1));
var xn = one.shiftLeft(n.mod(r));
var ap = new SPoly(a);
var lhs = one.shiftLeft(1).plus(ap).pow(n, prodAKS);
var rhs = reduceAKS(one.shiftLeft(n).plus(ap));
return lhs.ne(rhs);
}
// Returns the first a < M that is an AKS witness for n, or null if
// there isn't one.
function getFirstAKSWitness(n, r, M) {
for (var a = new SNat(1); a.lt(M); a = a.plus(1)) {
if (isAKSWitness(n, r, a)) {
return a;
}
}
return null;
}</code></pre>
</div>
<div class="p">Here's the code that ties it all together:
<pre class="code-container"><code class="language-javascript">// Returns whether n is prime or not using the AKS primality test.
function isPrimeByAKS(n) {
n = SNat.cast(n);
var parameters = getAKSParameters(n);
if (parameters.factor) {
return false;
}
if (parameters.isPrime) {
return true;
}
return (getFirstAKSWitness(n, parameters.r, parameters.M) == null);
}</code></pre>
</div>
<p class="interactive-example" id="aksExample">
Let
<span class="fake-katex"><var>n</var> =
<input class="parameter" size="6" pattern="[0-9]*" required
type="text" value="175507"
data-bind="value: nStr, valueUpdate: 'afterkeydown'" /></span>.
<!-- ko template: outputTemplate --><!-- /ko -->
<script type="text/html" id="aks.error.invalidN">
<span class="fake-katex"><var>n</var></span> is not a valid number.
</script>
<script type="text/html" id="aks.error.outOfBoundsN">
<span class="fake-katex"><var>n</var></span>
must be greater than or equal to 2.
</script>
<script type="text/html" id="aks.success">
<span class="fake-katex">⌈lg <var>n</var>⌉</span></span> is
<span class="fake-katex intermediate" data-bind="text: ceilLgN"></span>,
<span class="fake-katex"><var>r</var> =
<span class="intermediate" data-bind="text: r"></span></span>
is the least value such that
<span class="fake-katex">o<sub><var>r</var></sub>(<var>n</var>) =
<span class="intermediate" data-bind="text: nOrder"></span>
> ⌈lg <var>n</var>⌉<sup>2</sup>
= <span class="intermediate" data-bind="text: ceilLgNSq"></span></span>,
<span class="fake-katex"><var>φ</var>(<var>r</var>) =
<span class="intermediate" data-bind="text: eulerPhiR"></span></span>,
and <span class="fake-katex"><var>M</var> =
⌊√<var>φ</var>(<var>r</var>)⌋ ⋅
⌈lg <var>n</var>⌉ + 1 =
<span class="intermediate" data-bind="text: M"></span> >
⌊√<var>φ</var>(<var>r</var>)⌋ ⋅
lg <var>n</var></span>.
<span data-bind="if: factor()">
<span class="fake-katex"><var>n</var></span>
has a factor
<span class="fake-katex"><span class="intermediate"
data-bind="text: factor"></span>
< <var>M</var></span>, so therefore
<span class="fake-katex"><var>n</var></span> is
<span class="result">composite</span>.
</span>
<span data-bind="if: isPrime()">
<span class="fake-katex"><var>n</var></span>
has no factor <span class="fake-katex">< <var>M</var></span>
and <span class="fake-katex"><var>M</var> ≤
⌊√<var>n</var>⌋ =
<span class="intermediate" data-bind="text: floorSqrtN"></span></span>,
so therefore
<span class="fake-katex"><var>n</var></span> is
<span class="result">prime</span>.
</span>
<span data-bind="if: !factor() && !isPrime()">
<span class="fake-katex"><var>n</var></span>
has no factor <span class="fake-katex">< <var>M</var></span>
and <span class="fake-katex"><var>M</var> >
⌊√<var>n</var>⌋ =
<span class="intermediate" data-bind="text: floorSqrtN"></span></span>,
so <span class="fake-katex"><var>n</var></span> is prime iff
<span class="fake-katex">(<var>X</var> +
<var>a</var>)<sup><var>n</var></sup>
≡ <var>X</var><sup><var>n</var></sup> + <var>a</var>
(mod <var>X</var><sup><var>r</var></sup> − 1,
<var>n</var>)</span> for
<span class="fake-katex">0 ≤ <var>a</var>
≤ <var>M</var></span>.
</span>
</script>
</p>
<script type="text/javascript" src="/primality-testing-polynomial-time-part-2-files/aks-example.js"></script>
<p><em>(To-do: Have an interactive box to demonstrate how the
per-\(a\) AKS test works.)</em></p>
</section>
<section>
<header>
<h2>8. The AKS algorithm (improved version)</h2>
</header>
<div class="p">Here is a slightly more complicated version of the AKS algorithm.
Again given \(n \ge 2\):
<ol>
<li>Search for a prime factor of \(n\) less than \(\lceil \lg n
\rceil^2 + 2\). If one is found, return “composite”.</li>
<li>For each \(r\) from \(\lceil \lg n \rceil^2 + 2\):
<ol>
<li>If \(r \gt \lfloor \sqrt{n} \rfloor\), return
“prime”.</li>
<li>If \(r\) divides \(n\), return “composite”.</li>
<li>Otherwise, factorize \(r\).</li>
<li>Compute \(o_r(n)\) using \(r\)'s prime factors. If it is less
than or equal to \(\lceil \lg n \rceil^2\), jump back to the top of
the loop with the next \(r\).</li>
<li>Otherwise, compute \(φ(r)\) using \(r\)'s prime factors.</li>
<li>Compute \(M = \lfloor \sqrt{φ(r)} \rfloor \lceil \lg n
\rceil + 1\), and break out of the loop.</li>
</ol>
</li>
<li>For each \(1 \le a \lt M\), compute \((X + a)^n\), reducing
coefficients mod \(n\) and powers mod \(r\). If the result is not
equal to \(X^{n\text{ mod }r} + a\), return
“composite”.</li>
<li>Otherwise, return “prime”.</li>
</ol>
</div>
<p>The logic of steps 1 to 3 of the simple version is essentially
merged together to form steps 1 and 2 of this version; since each
\(r\) has to be checked for co-primality with \(n\), that effectively
also checks if \(r\) is a prime factor of \(n\), so we only have to
check for prime factors of \(n\) up to the lower bound of \(r\).
Furthermore, both the multiplicative order as well as the totient
function can be computed more quickly given a complete prime
factorization, so we can compute that for each \(r\). Third, we use
\(φ(r)\) instead of \(r - 1\) to give a tighter bound for \(M\).
Finally, the last two steps are the same as in the simple version.</p>
<div class="p">Here are steps 1 and 2 of the above algorithm, implemented in
Javascript:
<pre class="code-container"><code class="langauge-javascript">function getAKSParameters(n, factorizer) {
n = SNat.cast(n);
factorizer = factorizer || defaultFactorizer;
var ceilLgN = new SNat(n.ceilLg());
var ceilLgNSq = ceilLgN.pow(2);
var floorSqrtN = n.floorRoot(2);
var rLowerBound = ceilLgNSq.plus(2);
var rUpperBound = calculateAKSModulusUpperBound(n).min(floorSqrtN);
var parameters = {
n: n
};
var factor = getFirstFactorBelow(n, rLowerBound);
if (factor) {
parameters.factor = factor;
return parameters;
}
for (var r = rLowerBound; r.le(rUpperBound); r = r.plus(1)) {
if (n.mod(r).isZero()) {
parameters.factor = d;
return parameters;
}
var rFactors = getFactors(r, factorizer);
var o = calculateMultiplicativeOrderCRTFactors(n, rFactors, factorizer);
if (o.gt(ceilLgNSq)) {
parameters.r = r;
parameters.M = calculateAKSUpperBoundFactors(n, rFactors);
return parameters;
}
}
if (rUpperBound.eq(floorSqrtN)) {
parameters.isPrime = true;
return parameters;
}
throw new Error('Could not find AKS modulus');
}</code></pre>
</div>
</section>
<p><em>(To-do: Wrap up and lead into what will be shown in part
3.)</em></p>
<hr />
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
<section class="footnotes">
<header>
<h2>Footnotes</h2>
</header>
<p id="fn1">[1] This is a version of Theorem 2 from Lenstra's
paper <a href="http://www.math.leidenuniv.nl/~hwl/PUBLICATIONS/1979e/art.pdf">Miller's
Primality Test</a>.
<a href="#r1">↩</a></p>
<p id="fn2">[2] We work with \(\lceil \lg n \rceil^2\) instead of
\(\lceil \lg^2 n \rceil\) or \(\lg^2 n\) as it's easier to work
with in an actual implementation.
<a href="#r2">↩</a></p>
<p id="fn3">[3] This is exercise 1.27
from <a href="http://www.amazon.com/Prime-Numbers-A-Computational-Perspective/dp/0387252827">Prime
Numbers: A Computational Perspective</a>.
<a href="#r3">↩</a></p>
<p id="fn4">[4] This is an adapted from section 8.4 of Granville's <a href="http://www.dms.umontreal.ca/~andrew/PDF/Bulletin04.pdf">It
is Easy to Determine Whether a Given Number is Prime</a>.
<a href="#r4">↩</a></p>
<p id="fn5">[5] The <a href="https://cdn.rawgit.com/akalin/num.js/eab08d4/simple-arith.js"><code>SNat</code></a>
class used is the same as in my previous
article, <a href="intro-primality-testing">An Introduction to
Primality Testing</a>.
<a href="#r5">↩</a></p>
</section>
https://www.akalin.com/primality-testing-polynomial-time-part-1
Primality Testing in Polynomial Time (Ⅰ)
2012-08-06T00:00:00-07:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<section>
<header>
<h2>1. Introduction</h2>
</header>
<p>Exactly ten years
ago, <a href="http://www.cse.iitk.ac.in/users/manindra/">Agrawal</a>,
<a href="http://research.microsoft.com/en-us/people/neeraka/">Kayal</a>,
and <a href="http://www.math.uni-bonn.de/people/saxena/">Saxena</a>
published <a href="http://www.cse.iitk.ac.in/users/manindra/algebra/primality_v6.pdf">“PRIMES
is in P”</a>, which described an algorithm that could provably
determine whether a given number was prime or composite in polynomial
time.</p>
<p>The AKS algorithm is quite short, but understanding how it works
via the proofs in the paper requires some mathematical sophistication.
Also, some results in the last decade have simplified both the
algorithm and its accompanying proofs. In this article I will explain
in detail the main result of the AKS paper, and in a follow-up article
I will strengthen the main result, use it to get a polynomial-time
primality testing algorithm, and implement that algorithm in
Javascript. If you've
understood <a href="/intro-primality-testing">my introduction to
primality testing</a>, you should be able to follow along.</p>
<div class="p">Let's get started! The basis for the AKS primality test is the
following generalization
of <a href="http://en.wikipedia.org/wiki/Fermat%27s_little_theorem">Fermat's
little theorem</a> to polynomials:
<div class="theorem">
(<span class="theorem-name">Fermat's little theorem for polynomials,
strong version</span>.) If \(n \ge 2\) and \(a\) is relatively prime
to \(n\), then \(n\) is prime if and only if
\[
(X + a)^n \equiv X^n + a \pmod{n}\text{.}
\]
</div>
</div>
<p>The form of the equation above may be unfamiliar. The polynomials
in question
are <a href="http://en.wikipedia.org/wiki/Polynomial_ring#The_polynomial_ring_K.5BX.5D"><em>formal
polynomials</em></a>. That is, we care only about the coefficients of
the polynomial and not how it behaves as a function. In this case, we
restrict ourselves to polynomials with integer coefficients. Then we
can meaningfully compare two polynomials modulo \(n\): we consider two
polynomials congruent modulo \(n\) if their respective coefficients
are all congruent modulo \(n\). (Equivalently, two polynomials
\(f(X)\) and \(g(X)\) are congruent modulo \(n\) if \(f(X) - g(X) = n
\cdot h(X)\) for some polynomial \(h(X)\).) This definition is
consistent with how they behave as functions; if two polynomials
\(f(X)\) and \(g(X)\) are congruent modulo \(n\), then treating them
as functions, \(f(x)\ \equiv g(x) \pmod{n}\) for any integer
\(x\).<sup><a href="#fn1" id="r1">[1]</a></sup></p>
<div class="p">Unfortunately, this test by itself cannot give a polynomial-time
algorithm as testing even one value of \(a\) may require looking at
\(n\) coefficients of the left-hand side. (Remember that we're
interested in algorithms with time polynomial not in the input \(n\),
but in its bit length \(\lg n\). Such an algorithm is described as
having time <em>polylog in \(n\)</em>.) However, we can reduce the
number of coefficients we have to look at by taking the powers of
\(X\) modulo some number \(r\). This is equivalent to taking the
modulo of the polynomials themselves by \(X^r - 1\); you can see this
for yourself by picking some polynomial and some value for \(r\) and
doing long division by \(X^r - 1\) to find the remainder. (It may
seem weird to talk about taking the modulo of one polynomial with
another, but it's entirely analogous to integers.) This gives us a
weaker version of the theorem above:
<div class="theorem">
(<span class="theorem-name">Fermat's little theorem for polynomials,
weak version</span>.) If \(n\) is prime and \(a\) is not a multiple
of \(n\), then for any \(r \ge 2\)
\[
(X + a)^n \equiv X^n + a \pmod{X^r - 1, n}\text{.}
\]</div>
</div>
<p>The “double mod” notation above may be unfamiliar, but
in this case its meaning is simple. We consider two polynomials
congruent modulo \(X^r - 1, n\) when they are congruent modulo \(n\)
after you reduce the powers of \(X\) modulo \(r\) and combine like
terms. More generally, two polynomials \(f(X)\) and \(g(X)\) are
congruent modulo \(n(X), n\) if \(f(X) - g(X) \equiv n(X) \cdot h(X)
\pmod{n}\) for some polynomial \(h(X)\).</p>
<!-- TODO(akalin): Put interactive applet for the condition here. -->
<p>With this theorem, we only have to compare \(r\) coefficients, but
we introduce the possibility of the condition above being met even
when \(n\) is composite. But can we impose conditions on \(r\) and
\(a\) such that if the condition holds for a polynomial number of
pairs of \(r\) and \(a\), we can be sure that \(n\) is prime? The
answer is “yes”; in particular, we can find a single \(r\)
and an upper bound \(M\) polylog in \(n\) such that if the condition
holds for \(r\) and \(0 \le a \lt M\), then \(n\) is prime.</p>
<p>In the remainder of this article, we'll work backwards. That is,
we'll first assume we have some \(n \ge 2\), \(r \ge 2\), and \(M \ge
1\) such that for all \(0 \le a \lt M\)
\[
(X + a)^n \equiv X^n + a \pmod{X^r - 1, n}\text{.}
\]
Then we'll assume that \(n\) is not a power of one of its prime
divisors \(p\) and try to deduce the conditions that imposes on \(n\),
\(r\), \(M\), and \(p\). Then we can take the contrapositive to find
the inverse conditions on \(n\), \(r\), \(M\), and \(p\) that would
then force \(n\) to be a power of \(p\). Since we can easily test
whether \(n\) is
a <a href="http://en.wikipedia.org/wiki/Perfect_power">perfect
power</a>, if it's not one, we can immediately conclude that \(n =
p^1\) and thus prime. (Of course, if it does turn out to be a perfect
power, then it is trivially composite.)</p>
<p>To understand the conditions that we will derive, we must first
talk about <em>introspective numbers</em>.
</section>
<section>
<header>
<h2>2. Introspective numbers</h2>
</header>
<p>Given a base \(b\), a polynomial \(g(X)\) and a number \(q\), we
call \(q\) <em>introspective</em><sup><a href="#fn2" id="r2">[2]</a></sup> for \(g(X)\) modulo \(b\) if
\[
g(X)^q = g(X^q) \pmod{b}\text{.}
\]</p>
<p>We also say that \(g(X)\) is <em>introspective</em> under \(q\)
modulo \(b\).</p>
<p>A basic property of introspective numbers and polynomials is that
they are closed under multiplication. That is, if \(q_1\) and \(q_2\)
are introspective for \(g(X)\) modulo \(b\), then \(q_1 \cdot q_2\) is
also introspective for \(g(X)\) modulo \(b\), and if \(g_1(X)\) and
\(g_2(X)\) are introspective under \(q\) modulo \(b\), then \(g_1(X)
\cdot g_2(X)\) is also introspective under \(q\) modulo \(b\).</p>
<p>In particular, given our assumptions above, we can easily see that
\(1\), \(p\), and \(n\) are introspective for \(X + a\) modulo \(p\)
for any \(0 \le a \lt M\). We can also show that \(n/p\) is also
introspective for \(X + a\) modulo \(p\). Using closure under
multiplication, we can talk about the set of numbers generated by
\(p\) and \(n/p\), which are all introspective for \(X + a\) modulo
\(p\). Call this set \(I\):</p>
\[
I = \left\{ p^i \left( n/p \right)^j \mid i, j \ge 0 \right\}\text{.}
\]
<p>We can also take the closure of all \(X + a\) to get a set of
polynomials which are all introspective under \(p\), \(n/p\), or any
number in \(I\). Call this set \(P\):
\[
P = \left\{ 0 \right\} \cup
\left\{ X^{e_0} \cdot (X + 1)^{e_1} \dotsm (X + M -
1)^{e_{M - 1}} \mid e_0, e_1, \dotsc, e_{M - 1} \ge 0 \right\}\text{.}
\]
To summarize, \(I\) is a set of numbers and \(P\) is a set of
polynomials such that for any \(i \in I\) and \(g(X) \in P\), \(i\) is
introspective for \(g(X)\) modulo \(p\). Of course, it's still not
clear what these two sets have to do with whether \(n\) is prime or
not. But we will examine certain finite sets related to \(I\) and
\(P\) and their sizes, and we will see that we can deduce their
properties depending on the relation of \(n\) to \(p\).</p>
</section>
<section>
<header>
<h2>3. Bounds on finite sets related to \(I\) and \(P\)</h2>
</header>
<p>Now we're ready to work towards finding our restrictions on \(n\),
\(r\), \(M\), and \(p\). We'll slowly build them up such that when
the last one falls into place, we know that \(n\) is a perfect power
of \(p\). Here's what we're starting with:</p>
<div class="insert">
\(n \ge 2\), <br/>
\(r \ge 2\), <br/>
\(M \ge 1\), <br/>
\(p\) is a prime divisor of \(n\).
</div>
<p>Let us restrict \(I\) to a finite set by bounding the exponents of
\(p\) and \(n/p\):
\[
I_k = \left\{ p^i (n/p)^j \mid 0 \le i, j \lt k \right\} \subset I\text{.}
\]</p>
<p>Notice that if \(n\) is not a power of \(p\), then all members of
\(I_k\) are distinct, and therefore we can easily calculate its
size:<sup><a href="#fn3" id="r3">[3]</a></sup>
\[
|I_k| = k^2\text{.}
\]</p>
<p>Let's also restrict \(P\) to a finite set by bounding the degrees
of its polynomials:
\[
P_d = \left\{ g \in P \mid \deg(g) \lt d \right\} \subset P\text{.}
\]</p>
<p>We can calculate \(|P_d|\) exactly,<sup><a href="#fn4" id="r4">[4]</a></sup> but
we only need a lower bound for when \(d \le M\). Consider \(P_d^{\{0,
1\}}\), the subset of \(P_d\) where each \(X + a\) is present at most
once. Since each \(X + a\) is either present or not present, but not
all of them can be present at the same time, there are \(2^d - 1\)
distinct polynomials in \(P_d^{\{0, 1\}}\). Adding back the zero
polynomial yields \(|P_d^{\{0, 1\}}| = 2^d\). Since \(P_d^{\{0,
1\}}\) is a subset of \(P_d\), \(|P_d| \ge |P_d^{\{0, 1\}}| = 2^d\).
Therefore, if \(d \le M\), then<sup><a href="#fn5" id="r5">[5]</a></sup>
\[ |P_d| \ge 2^d\text{.} \]
This will be useful later (for a particular value of \(d\)), so let's
add the restriction to \(M\):
</p>
<div class="insert">
\(n \ge 2\), <br/>
\(r \ge 2\), <br/>
<em>\(M \ge d\)</em>, <br/>
\(p\) is a prime divisor of \(n\).
</div>
<p>Let us restrict \(I\) in a different way, by reducing modulo \(r\):
\[
J = \left\{ x \bmod r \mid x \in I \right\}
\]
and let \(t = |J|\). (This size will play an important role
later.)</p>
<p>Our final set that we're interested in needs some background to
define. We want to find a subset of \(P\) that lies in some field
\(F\) because fields have some convenient properties that we will use
later.<sup><a href="#fn6" id="r6">[6]</a></sup></p>
<p>Consider \(\mathbb{Z}/p\mathbb{Z}\), the ring
of <a href="http://en.wikipedia.org/wiki/Integers_modulo_n#Integers_modulo_n">integers
modulo \(p\)</a>. Since \(p\) is prime, it is also a field. In
particular, it is
the <a href="http://en.wikipedia.org/wiki/Finite_field">finite
field</a> \(\mathbb{F}_p\) of order \(p\). Then consider
\(\mathbb{F}_p[X]\),
its <a href="http://en.wikipedia.org/wiki/Polynomial_ring">polynomial
ring</a>, which is the set of polynomials with coefficients in
\(\mathbb{F}_p\). Given some polynomial \(q(X) \in \mathbb{F}_p[X]\),
we can further reduce modulo \(q(X)\) to get \(\mathbb{F}_p[X] /
q(X)\). Finally, if \(q(X)\) is
<a href="http://en.wikipedia.org/wiki/Irreducible_polynomial">irreducible</a>
over \(\mathbb{F}_p\), then \(\mathbb{F}_p[X] / q(X)\) is also a
field.</p>
<p>(We can show that both \(\mathbb{F}_p = \mathbb{Z}/p\mathbb{Z}\)
and \(\mathbb{F}_p[X] / q(X)\) are fields from the same general
theorem of rings: if \(R\) is
a <a href="http://en.wikipedia.org/wiki/Principal_ideal_domain">principal
ideal domain</a> and \((c)\) is
the <a href="http://en.wikipedia.org/wiki/Two-sided_ideal#Ideal_generated_by_a_set">two-sided
ideal generated by \(c\)</a>, then
the <a href="http://en.wikipedia.org/wiki/Quotient_ring">quotient
ring</a> \(R / (c)\) is a field if and only if \(c\) is
a <a href="http://en.wikipedia.org/wiki/Prime_element">prime
element</a> of \(R\).)<sup><a href="#fn7" id="r7">[7]</a></sup></p>
<p>So we just need to find a polynomial that's irreducible over
\(\mathbb{F}_p\). We know that \(X^r - 1\) has \(Φ_r(X)\), the
\(r\)th <a href="http://en.wikipedia.org/wiki/Cyclotomic_polynomial">cyclotomic
polynomial</a>, as a factor. \(Φ_r(X)\) is irreducible over
\(\mathbb{Z}\), but not necessarily over \(\mathbb{F}_p\). But if
\(r\) is relatively prime to \(p\), then \(Φ_r(X)\) factors into
irreducible polynomials all of degree \(o_r(p)\)
(the <a href="http://en.wikipedia.org/wiki/Multiplicative_order">multiplicative
order</a> of \(p\) modulo \(r\)) over \(\mathbb{F}_p\).<sup><a href="#fn8" id="r8">[8]</a></sup> Then we can
just require that \(r\) be relatively prime to \(p\). If we do so,
then we can let \(h(X)\) be one of the factors of \(Φ_r(X)\) over
\(\mathbb{F}_p\) and we have our field \(F = \mathbb{F}_p[X] /
h(X)\).</p>
<div class="insert">
\(n \ge 2\), <br/>
\(r \ge 2\), <em>\(r\) relatively prime to \(p\)</em>,<br/>
\(M \ge d\), <br/>
\(p\) is a prime divisor of \(n\).
</div>
<p>Finally, we can define our last set. Let
\[
Q = \left\{ f(X) \bmod (h(X), p) \mid f(X) \in P \right\} \subseteq F\text{.}
\]</p>
<p>We can map elements of \(P\) into \(Q\) via reduction modulo
\((h(X), p)\). But we're interested in only the elements of \(P\)
that map to distinct elements of \(Q\), since that will let us find a
lower bound for \(|Q|\). A simple example would be the set of \(X +
a\) for \(0 \le a \lt M\); if the degree of \(h(X)\) is greater than
\(1\) and \(p \ge M\), then each \(X + a\) is distinct in \(Q\).</p>
<p>Another interesting set is \(X^k\) for \(1 \le k \le r\). Since
\(h(X) \equiv 0 \pmod{h(X}, p)\), we can say that \(X\) is a root of
the polynomial function \(h(y)\) over the field \(F\). But since
\(h(y)\) is a factor of \(Φ_r(y)\), \(X\) is then a primitive
\(r\)th root of unity in \(Q\).<sup><a href="#fn9" id="r9">[9]</a></sup> But the powers of a primitive \(r\)th
root of unity (from \(1\) to \(r\)) are all distinct. Therefore all
\(X^k\) for \(1 \le k \le r\) are distinct in \(Q\).</p>
<p>Most importantly, we can show that distinct elements in \(P_d\) map
to distinct elements in \(Q\) if \(d \le t\). Let \(f(X)\) and
\(g(X)\) be two different elements of \(P_d\). Assume that \(f(X)
\equiv g(X) \pmod{h(x}, p)\). Then, for \(m \in I\):
\[
f(X^m) \equiv f(X)^m \pmod{X^r - 1, p}
\]
and
\[
g(X^m) \equiv g(X)^m \pmod{X^r - 1, p}
\]
by introspection modulo \(p\), and therefore
\[
f(X^m) \equiv g(X^m) \pmod{X^r - 1, p}
\]
which immediately leads to
\[
f(X^m) \equiv g(X^m) \pmod{h(X}, p)\text{.}
\]
Therefore, all \(X^m\) for \(m \in I\) are roots of the polynomial
function \(u(y) = f(y) - g(y)\) over the field \(F\), and in
particular all \(X^m\) for \(m \in J\). But all such \(X^m\)
are distinct in \(Q\) by the argument above. Therefore, \(u(y)\) must
have degree at least \(t\) since a polynomial over a field cannot have
more roots than its degree. But the degree of \(u(y)\) is less than
\(d\) since both \(f(y)\) and \(g(y)\) have degree less than \(d\).
Since \(d \le t\), this is a contradiction, so therefore \(f(X)
\not\equiv g(X) \pmod{h(x}, p)\). But since \(f(X)\) and \(g(X)\)
were arbitrary, that implies that distinct elements of \(P_d\) map to
distinct elements of \(Q\) for \(d \le t\).</p>
<p>Given the above, we can conclude that as long as we require that
\(d \le t\), \(p \ge M\), and \(o_r(p) = \deg(h(X)) \gt 1\), then
\[
|Q| \ge |P_d| \ge 2^d\text{.}
\]</p>
<div class="insert">
\(n \ge 2\), <br/>
<em>\(o_r(p) \gt 1\)</em>,<br/>
\(M \ge d\), <br/>
<em>\(t \ge d\)</em>,<br/>
<em>\(p \ge M\)</em>, \(p\) is a prime divisor of \(n\).
</div>
</section>
<section>
<header>
<h2>4. The AKS theorem (weak version)</h2>
</header>
<p>We're finally ready to put it all together. Again assume \(n\) is
not a power of \(p\), and recall that \(|J| = t\). Let \(s \gt
\sqrt{t}\). Then \(|I_s| = s^2 \gt t\). By
the <a href="http://en.wikipedia.org/wiki/Pigeonhole_principle">pigeonhole
principle</a>, there must be two elements \(m_1, m_2 \in I_s\) that
map to the same element in \(J\); that is, there must be \(m_1, m_2
\in I_s\) such that \(m_1 \equiv m_2 \pmod{r}\). Now pick some
\(g(X)\) from \(P\). Then
\[
g(X)^{m_1} \equiv g(X^{m_1}) \pmod{X^r - 1, p}
\]
and
\[
g(X)^{m_2} \equiv g(X^{m_2}) \pmod{X^r - 1, p}
\]
by introspection modulo \(p\). But \(X^{m_1} \equiv X^{m_2} \pmod{X^r - 1}\) since \(m_1 \equiv m_2 \pmod{r}\), so
\[
g(X^{m_1}) \equiv g(X^{m_2}) \pmod{X^r - 1, p}\text{.}
\]
Chaining all these congruences together lets us deduce that
\[
g(X)^{m_1} \equiv g(X)^{m_2} \pmod{X^r - 1, p}\text{,}
\]
which immediately leads to
\[
g(X)^{m_1} \equiv g(X)^{m_2} \pmod{h(X}, p)\text{.}
\]
</p>
<p>That means that \(g(X) \bmod (h(X), p) \in Q\) is a root of the
polynomial function \(u(y) = y^{m_1} - y^{m_2}\) over the field \(F\).
But \(g(X)\) was picked arbitrarily from \(P\), so \(u(y)\) has at
least \(|Q|\) roots. \(\deg(u(y)) = \max(m_1, m_2) \le p^{s-1} \cdot
(n/p)^{s-1} = n^{s-1}\), and \(u(y)\), being a polynomial over a
field, cannot have more roots than its degree, so if \(n\) is not a
power of \(p\), then \(|Q| \le n^{s-1}\). Equivalently, if \(|Q| \gt
n^{s-1}\), then \(n\) must be a power of \(p\).<sup><a href="#fn10" id="r10">[10]</a></sup> But
we've shown above that \(|Q| \ge 2^d\) for \(d \le t\), so if we can
pick \(d\) and \(s\) such that \(2^d \gt n^{s-1}\), then we can force
\(n\) to be a power of \(p\). Taking logs, we see that this is
equivalent to picking \(d\) and \(s\) such that \(d \gt (s - 1) \lg
n\). Since \(d \le t\), this imposes \(t \gt (s - 1) \lg n\) in order
for there to be room to pick \(d\). Rearranging, we get \(s \lt
\frac{t}{\lg n} + 1\). But \(s \gt \sqrt{t}\), so this imposes
\(\sqrt{t} \lt \frac{t}{\lg n} + 1\) in order for there to be room to
pick \(s\). Rearranging again, we get \(\frac{t}{\sqrt{t} - 1} \gt
\lg n\). Since \(\frac{t}{\sqrt{t} - 1} \gt \sqrt{t}\), it suffices
to require that \(t \gt \lg^2 n\) in order for there to be room to
pick \(d\) and \(s\). Furthermore, since \(s\) has to be an integer,
then \(s \ge \lfloor \sqrt{t} \rfloor + 1\), and therefore \(d \gt
\lfloor \sqrt{t} \rfloor \lg n\). Let's update our assumptions:</p>
<div class="insert">
\(n \ge 2\), <br/>
\(o_r(p) \gt 1\)<br/>
<em>\(M \ge d \gt \lfloor \sqrt{t} \rfloor \lg n\)</em>,<br/>
<em>\(t \gt \lg^2 n\)</em>,<br/>
\(p \ge M\), \(p\) is a prime divisor of \(n\).
</div>
<p>So to summarize, if we make the above assumptions, we can pick
\(d\) and \(s\) such that \(|Q| \ge 2^d \gt n^{s - 1}\), which implies
that \(n\) must be a power of \(p\), which was our goal. Now we just
have to express all assumptions in terms of \(n\), \(r\), and \(M\),
strengthening them if necessary. \(J\) is generated by \(p\) and
\(n/p\), so its order (i.e., \(t\)) is at least \(o_r(p)\), which is
in turn at least \(o_r(n)\), since \(p\) is a prime factor of \(n\)
(this brings along the assumption that \(r\) and \(n\) are relatively
prime). Therefore, we can replace the assumptions \(t \gt \lg^2 n\)
and \(o_r(p) \gt 1\) with \(o_r(n) \gt \lg^2 n\). We can remove the
reference to \(d\) by finding the maximum value of \(t\). Since \(r\)
is relatively prime to \(n\), \(J\) is a subgroup of \(Z_r\), and
therefore its order divides (and therefore is at most) \(φ(r)\).
So we can replace \(M \ge d \gt \lfloor \sqrt{t} \rfloor \lg n\) with
\(M \gt \lfloor \sqrt{φ(r)} \rfloor \lg n\). Finally, we can
remove the reference to \(p\) by mandating that \(n\) has no prime
factor less than \(M\). Here are our final assumptions:</p>
<div class="insert">
\(n \ge 2\), <em>\(n\) has no prime factors less than \(M\)</em>,<br/>
<em>\(o_r(n) \gt \lg^2 n\)</em>,<br/>
<em>\(M \gt \lfloor \sqrt{φ(r)} \rfloor \lg n\)</em>.<br/>
</div>
<div class="p">We can summarize the above discussion in the following theorem:
<div class="theorem">
(<span class="theorem-name">AKS theorem, weak version</span>.) Let
\(n \ge 2\), \(r\) be relatively prime to \(n\) with \(o_r(n) \gt
\lg^2 n\), and \(M \gt \lfloor \sqrt{φ(r)} \rfloor \lg n\).
Furthermore, let \(n\) have no prime factor less than \(M\) and let
\[
(X + a)^n \equiv X^n + a \pmod{X^r - 1, n}
\]
for \(0 \le a \lt M\). Then \(n\) is the power of some prime \(p \ge
M\).</div>
</div>
<p>And that's it for now! In the follow-up article we will strengthen
this theorem to further show that \(n\) is equal to \(p\), and
therefore prime. Then we will use this result to get a
primality-testing algorithm that we will prove to be polynomial
time.</p>
</section>
<hr />
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
<section class="footnotes">
<header>
<h2>Footnotes</h2>
</header>
<p id="fn1">[1] We use uppercase letters for variables when we treat
polynomials as formal polynomials and lowercase letters when we
treat them as functions. <a href="#r1">↩</a></p>
<p id="fn2">[2] The term “introspection”, which comes
from the original AKS paper, was probably chosen to invoke the idea
that the exponent \(q\) can be pushed into and pulled out of \(g(X)\).
Here we generalize it a bit. <a href="#r2">↩</a></p>
<p id="fn3">[3] This condition is too weak to be useful by itself,
but we will parlay it into something we can use later.
<a href="#r3">↩</a></p>
<p id="fn4">[4] Using the ideas
on <a href="http://www.johndcook.com/TwelvefoldWay.pdf">this page</a>,
we can show that \(|P_d| = {M + d \choose d - 1} + 1\) by
considering each \(X + a\) a labeled urn (plus a
“dummy” urn) and each unit of power an unlabeled
ball. (This was used in the AKS paper.)
<a href="#r4">↩</a></p>
<p id="fn5">[5] This lower bound, as well as other ideas that simplify the
proof, was taken
from <a href="http://www.amazon.com/Prime-Numbers-A-Computational-Perspective/dp/0387252827">Prime
Numbers: A Computational Perspective</a>.
<a href="#r5">↩</a></p>
<p id="fn6">[6] You may first want to brush up on the definitions
of <a href="http://en.wikipedia.org/wiki/Group_(mathematics)">group</a>,
<a href="http://en.wikipedia.org/wiki/Ring_(mathematics)">ring</a>,
and <a href="http://en.wikipedia.org/wiki/Field_(mathematics)">field</a>,
and the differences between them.
<a href="#r6">↩</a></p>
<p id="fn7">[7] This is Theorem 1.47(iv) from
“<a href="http://www.amazon.com/Introduction-Finite-Fields-their-Applications/dp/0521460948">Introduction
to finite fields and their applications</a>”.
<a href="#r7">↩</a></p>
<p id="fn8">[8] The reducibility of \(Φ_r(X)\) over
\(\mathbb{F}_p\) given \(r\) relatively prime to \(p\) is Theorem
2.47(ii) from
“<a href="http://www.amazon.com/Introduction-Finite-Fields-their-Applications/dp/0521460948">Introduction
to finite fields and their applications</a>”.
<a href="#r8">↩</a></p>
<p id="fn9">[9] It's a bit weird to talk about a polynomial being
the root of other polynomials, but recall that we can form a
polynomial ring over any field, even a field of polynomials. We
keep track of which polynomials belong to which domains by using
different variables.
<a href="#r9">↩</a></p>
<p id="fn10">[10] Here's where we force \(n\) to be a prime power.
<a href="#r10">↩</a></p>
</section>
https://www.akalin.com/intro-primality-testing
An Introduction to Primality Testing
2012-07-08T00:00:00-07:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<script type="text/javascript"
src="https://cdnjs.cloudflare.com/ajax/libs/knockout/3.4.0/knockout-min.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/num.js/eab08d4/simple-arith.js"></script>
<script type="text/javascript" src="https://cdn.rawgit.com/akalin/num.js/eab08d4/primality-testing.js"></script>
<p>I will explain two commonly-used primality tests: Fermat and
Miller-Rabin. Along the way, I will cover the basic concepts of
primality testing. I won't be assuming any background in number
theory, but familiarity
with <a href="http://en.wikipedia.org/wiki/Modular_arithmetic">modular
arithmetic</a> will be helpful. I will also be providing
implementations in Javascript,
so <a href="https://developer.mozilla.org/en/JavaScript">familiarity
with it</a> will also be helpful. Finally, since Javascript doesn't
natively support arbitrary-precision arithmetic, I wrote a simple
natural number class
(<a href="https://cdn.rawgit.com/akalin/num.js/eab08d4/simple-arith.js"><code>SNat</code></a>) that
represents a number as an array of decimal digits. All algorithms
used are the simplest possible, except when a more efficient one is
needed by the algorithms we discuss.</p>
<p>Primality testing is the problem of determining whether a given
natural number is prime or composite. Compared to the problem of
<a href="http://en.wikipedia.org/wiki/Integer_factorization">integer
factorization</a>, which is to determine the prime factors of a given
natural number, primality testing turns out to be easier; integer
factorization is
in <a href="http://en.wikipedia.org/wiki/NP_(complexity)">NP</a> and
thought to be
outside <a href="http://en.wikipedia.org/wiki/P_(complexity)">P</a>
and <a href="http://en.wikipedia.org/wiki/NP-complete">NP-complete</a>,
whereas primality testing
is <a href="http://www.cse.iitk.ac.in/users/manindra/algebra/primality_v6.pdf">now
known to be in P</a>.</p>
<p>Most primality tests are actually compositeness tests; they involve
finding <em>composite witnesses</em>, which are numbers that, along
with a given number to be tested, can be fed to some easily-computable
function to prove that the given number is composite. (The composite
witness, along with the function, is a <em>certificate of
compositeness</em> of the given number.) A primality test can either
check each possible witness or, like the Fermat and Miller-Rabin
tests, it can randomly sample some number of possible witnesses and
call the number prime if none turn out to be witnesses. In the latter
case, there is a chance that a composite number can erroneously be
called prime; ideally, this chance goes to zero quickly as the sample
size increases.</p>
<p>The simplest possible witness type is, of course, a factor of the
given number, which we'll call a <em>factor witness</em>. If the
number to be tested is \(n\) and the possible factor witness is \(a\),
then one can simply test whether \(a\) divides \(n\) (written as \(a
\mid n\)) by evaluating \(n \bmod a = 0\); that is, whether the
remainder of \(n\) divided by \(a\) is zero. This doesn't yield a
feasible deterministic primality test, though, since checking all
possible witnesses is equivalent to factoring the given number. Nor
does it yield a feasible probabilistic primality test, since in the
worst case the given number has very few factors, which random
sampling would miss.</p>
<div class="p">The simplest useful witness type is a <em>Fermat witness</em>,
which relies on the following theorem of Fermat:
<div class="theorem">
(<span class="theorem-name">Fermat's little theorem</span>.) If \(n\)
is prime and \(a\) is not a multiple of \(n\), then
\[
a^{n-1} \equiv 1 \pmod{n}\text{.}
\]
</div>
</div>
<p>Thus, a Fermat witness is a number \(1 \lt a \lt n\) such that
\(a^{n-1} \not\equiv 1 \pmod{n}\). Conversely, if \(n\) is composite
and \(a^{n-1} \equiv 1 \pmod{n}\), then \(a\) is a <em>Fermat
liar</em>.</p>
<p class="interactive-example" id="fermatExample">
Let
<span class="fake-katex"><var>n</var> =
<input class="parameter" size="6" pattern="[0-9]*" required
type="text" value="355207"
data-bind="value: nStr, valueUpdate: 'afterkeydown'" /></span>
and
<span class="fake-katex"><var>a</var> =
<input class="parameter" size="6" pattern="[0-9]*" required
type="text" value="2"
data-bind="value: aStr, valueUpdate: 'afterkeydown'" /></span>.
<!-- ko template: outputTemplate --><!-- /ko -->
<script type="text/html" id="fermat.error.invalidN">
<span class="fake-katex"><var>n</var></span> is not a valid number.
</script>
<script type="text/html" id="fermat.error.invalidA">
<span class="fake-katex"><var>a</var></span> is not a valid number.
</script>
<script type="text/html" id="fermat.error.outOfBoundsN">
<span class="fake-katex"><var>n</var></span> must be greater than
<span class="fake-katex">2</span>.
</script>
<script type="text/html" id="fermat.error.outOfBoundsA">
<span class="fake-katex"><var>a</var></span> must be greater than
<span class="fake-katex">1</span> and less than
<span class="fake-katex"><var>n</var></span>.
</script>
<script type="text/html" id="fermat.success">
Then
<span class="fake-katex"><var>a</var><sup><var>n</var>−1</sup>
≡
<span class="intermediate" data-bind="text: r"></span>
<span data-bind="if: r() && r().ne(1)">≢ 1</span>
(mod <var>n</var>)</span> so therefore
<span class="fake-katex"><var>n</var></span> is
<span data-bind="if: isCompositeByFermat()">
<span class="result">composite</span>.
<span data-bind="if: r() && r().isZero()">
Furthermore,
<span class="fake-katex">gcd(<var>a</var>, <var>n</var>) =
<span class="intermediate" data-bind="text: k"></span></span>
is a non-trivial factor of
<span class="fake-katex"><var>n</var></span>.
</span>
</span>
<span data-bind="ifnot: isCompositeByFermat()">
either <span class="result">prime</span> or a
<span class="result">Fermat pseudoprime base
<span class="fake-katex"><var>a</var></span></span>.
</span>
</script>
</p>
<script type="text/javascript" src="/intro-primality-testing-files/fermat-example.js"></script>
<p>If \(n\) has at least one Fermat witness that is relatively prime,
then we can show that at least half of all possible witnesses are
Fermat witnesses. (Roughly, if \(a\) is the Fermat witness and \(a_1,
a_2, \dotsc, a_s\) are Fermat liars, then all \(a \cdot a_i\) are also
Fermat witnesses.) Therefore, for a sample of \(k\) possible
witnesses of \(n\), the probability of all of them being Fermat liars
is \(\le 2^{-k}\), which goes to zero quickly enough to be
practical.</p>
<p>However, there is the possibility that \(n\) is a composite number
with no relatively prime Fermat witnesses. These are
called <a href="http://en.wikipedia.org/wiki/Carmichael_numbers"><em>Carmichael
numbers</em></a>. Even though Carmichael numbers are rare, their
existence still makes the Fermat primality test unsuitable for some
situations, as when the numbers to be tested are provided by some
adversary.</p>
<div class="p">Here is the Fermat compositeness test implemented in
Javascript:
<pre class="code-container"><code class="language-javascript">// Runs the Fermat compositeness test given n > 2 and 1 < a < n.
// Calculates r = a^{n-1} mod n and whether a is a Fermat witness to n
// (i.e., r != 1, which means n is composite). Returns a dictionary
// with a, n, r, and isCompositeByFermat, which is true iff a is a
// Fermat witness to n.
function testCompositenessByFermat(n, a) {
n = SNat.cast(n);
a = SNat.cast(a);
if (n.le(2)) {
throw new RangeError('n must be > 2');
}
if (a.le(1) || a.ge(n)) {
throw new RangeError('a must satisfy 1 < a < n');
}
var r = a.powMod(n.minus(1), n);
var isCompositeByFermat = r.ne(1);
return {
a: a,
n: n,
r: r,
isCompositeByFermat: isCompositeByFermat
};
}</code></pre>
Note that the algorithm depends on the efficiency
of <a href="http://en.wikipedia.org/wiki/Modular_exponentiation"><em>modular
exponentiation</em></a> when calculating \(a^{n-1} \pmod{n}\). The
naive method is unsuitable since it requires \(Θ(n)\) \(b\)-bit
multiplications, where \(b = \lceil \lg n \rceil\). <code>SNat</code>
uses <a href="http://en.wikipedia.org/wiki/Repeated_squaring">repeated
squaring</a>, which requires only \(Θ(\lg n)\) \(b\)-bit
multiplications.</div>
<p>Another useful witness type is a <em>non-trivial square root of
unity \(\operatorname{mod} n\)</em>; that is, a number \(a ≠ \pm
1 \pmod{n}\) such that \(a^2 \equiv 1 \pmod{n}\). It is a theorem of
number theory that if \(n\) is prime, there are no non-trivial square
roots of unity \(\operatorname{mod} n\). Therefore, if we do find one,
that means \(n\) is composite. In fact, finding one leads directly to
factors of \(n\). By definition, a non-trivial square root of unity
\(a\) satisfies \(a \pm 1 ≠ 0 \pmod{n}\) and \(a^2 - 1 \equiv 0
\pmod{n}\). Factoring the latter leads to \((a+1)(a-1) \equiv 0
\pmod{n}\), which means that \(n\) divides \((a+1)(a-1)\). But the
first condition says that \(n\) divides neither \(a+1\) nor \(a-1\),
so it must be a product of two numbers \(p \mid a+1\) and \(q \mid
a-1\). Then \(\gcd(a+1, n)\)<sup><a href="#fn1" id="r1">[1]</a></sup>
and \(\gcd(a-1, n)\) are factors of \(n\).</p>
<p>Finding non-trivial square roots of unity by itself doesn't give a
useful primality testing algorithm, but combining it with the Fermat
primality test does. \(a^{n-1} \bmod n\) either equals \(1\) or not.
If it doesn't, you're done since you have a Fermat witness. If it
does equal \(1\), and \(n-1\) is even, then consider the square root
of \(a^{n-1}\), i.e. \(a^{(n-1)/2}\). If it is not \(\pm 1\), then it
is a non-trivial square root of unity. If it is \(-1\), then you
can't do anything else. But if it is \(1\), and \((n-1)/2\) is even,
you can then take another square root and repeat the test, stopping
when the exponent of \(a\) becomes odd or when you get a result not
equal to \(1\).</p>
<p>To turn this into an algorithm, you simply start from the bottom
up: find the greatest odd factor of \(n-1\), call it \(t\), and keep
squaring \(a^t\) mod \(n\) until you find a non-trivial square root of
\(n\) or until you can deduce the value of \(a^{n-1}\). In fact, this
is almost as fast as the original Fermat primality test, since the
exponentiation by \(n-1\) has to do the same sort of squaring, and
we're just adding comparisons to \(±1\) in between squarings.</p>
<p>The original idea for the test above is from Artjuhov, although it
is usually credited to Miller. Therefore, we call \(a\) an <em>Artjuhov witness<sup><a href="#fn2" id="r2">[2]</a></sup> of \(n\)</em> if it shows \(n\) composite by
the above test.</p>
<p class="interactive-example" id="artjuhovExample">
Let
<span class="fake-katex"><var>n</var> =
<input class="parameter" size="6" pattern="[0-9]*" required
type="text" value="561"
data-bind="value: nStr, valueUpdate: 'afterkeydown'" /></span>
and
<span class="fake-katex"><var>a</var> =
<input class="parameter" size="6" pattern="[0-9]*" required
type="text" value="2"
data-bind="value: aStr, valueUpdate: 'afterkeydown'" /></span>.
<!-- ko template: outputTemplate --><!-- /ko -->
<script type="text/html" id="artjuhov.error.invalidN">
<span class="fake-katex"><var>n</var></span> is not a valid number.
</script>
<script type="text/html" id="artjuhov.error.invalidA">
<span class="fake-katex"><var>a</var></span> is not a valid number.
</script>
<script type="text/html" id="artjuhov.error.outOfBoundsN">
<span class="fake-katex"><var>n</var></span> must be greater than
<span class="fake-katex">2</span>.
</script>
<script type="text/html" id="artjuhov.error.outOfBoundsA">
<span class="fake-katex"><var>a</var></span> must be greater than
<span class="fake-katex">1</span> and less than
<span class="fake-katex"><var>n</var></span>.
</script>
<script type="text/html" id="artjuhov.success.fermatEquivResult">
Then
<span class="fake-katex"><var>n</var></span>
is even, so this reduces to the Fermat primality test.
<span class="fake-katex"><var>a</var><sup><var>n</var>−1</sup>
≡
<span class="intermediate" data-bind="text: r"></span>
<span data-bind="if: r() && r().ne(1)">≢ 1</span>
(mod <var>n</var>)</span> so therefore
<span class="fake-katex"><var>n</var></span> is
<span data-bind="if: isCompositeByArtjuhov()">
<span class="result">composite</span>.
<span data-bind="html: factorsHtml"></span>
</span>
<span data-bind="ifnot: isCompositeByArtjuhov()">
an <span class="result">Artjuhov pseudoprime base
<span class="fake-katex"><var>a</var></span></span>.
</span>
</script>
<script type="text/html" id="artjuhov.success.impliesFinalEquivResult">
Then
<span class="fake-katex"><var>n</var> − 1 =
<span data-bind="html: nMinusOneHtml"></span></span>,
and
<span class="fake-katex"><var>r</var> ≡
<span data-bind="html: rHtml"></span> ≡
<span data-bind="html: rResultHtml"></span> (mod <var>n</var>)</span>,
so
<span class="fake-katex"><var>a</var><sup><var>n</var>−1</sup>
≡
<span data-bind="html: aNMinusOneHtml"></span> (mod <var>n</var>)</span>,
and therefore
<span class="fake-katex"><var>n</var></span> is
<span data-bind="if: isCompositeByArtjuhov()">
<span class="result">composite</span>.
<span data-bind="html: factorsHtml"></span>
</span>
<span data-bind="ifnot: isCompositeByArtjuhov()">
either <span class="result">prime</span> or an
<span class="result">Artjuhov pseudoprime base
<span class="fake-katex"><var>a</var></span></span>.
</span>
</script>
<script type="text/html" id="artjuhov.success.nonTrivialSqrtResult">
Then
<span class="fake-katex"><var>n</var> − 1 =
<span data-bind="html: nMinusOneHtml"></span></span>,
<span class="fake-katex"><var>r</var> ≡
<span data-bind="html: rHtml"></span>
≡ <span class="intermediate">1</span>
(mod <var>n</var>)</span>, and
<span class="fake-katex">√<var>r</var> ≡
<span data-bind="html: rSqrtHtml"></span>
≡ <span class="intermediate" data-bind="text: rSqrt"></span>
(mod <var>n</var>)</span>, which is a non-trivial square root
of unity <span class="fake-katex">mod <var>n</var></span>
and therefore <span class="fake-katex"><var>n</var></span>
is <span class="result">composite</span>.
<span data-bind="html: factorsHtml"></span>
</script>
</p>
<script type="text/javascript" src="/intro-primality-testing-files/artjuhov-example.js"></script>
<p>If \(n\) is an odd composite, then it can be shown (originally by
Rabin) that at least three quarters of all possible witnesses are
Artjuhov witnesses. Therefore, for a sample of \(k\) possible
witnesses of \(n\), the probability of all of them being Artjuhov
liars is \(\le 4^{-k}\), which is stronger than the bound for the
Fermat primality test. Furthermore, this bound is unconditional;
there is nothing like Carmichael numbers for the Artjuhov test.</p>
<div class="p">Here is the Artjuhov compositeness test, implemented in
Javascript:
<pre class="code-container"><code class="language-javascript">// Runs the Artjuhov compositeness test given n > 2 and 1 < a < n-1.
// Finds the largest s such that n-1 = t*2^s, calculates r = a^t mod
// n, then repeatedly squares r (mod n) up to s times until r is
// congruent to -1, 0, or 1 (mod n). Then, based on the value of s
// and the final value of r and i (the number of squarings),
// determines whether a is an Artjuhov witness to n (i.e., n is
// composite).
//
// Returns a dictionary with, a, n, s, t, i, r, rSqrt = sqrt(r) if i >
// 0 and null otherwise, and isCompositeByArtjuhov, which is true iff
// a is an Artjuhov witness to n.
function testCompositenessByArtjuhov(n, a) {
n = SNat.cast(n);
a = SNat.cast(a);
if (n.le(2)) {
throw new RangeError('n must be > 2');
}
if (a.le(1) || a.ge(n)) {
throw new RangeError('a must satisfy 1 < a < n');
}
var nMinusOne = n.minus(1);
// Find the largest s and t such that n-1 = t*2^s.
var t = nMinusOne;
var s = new SNat(0);
while (t.isEven()) {
t = t.div(2);
s = s.plus(1);
}
// Find the smallest 0 <= i < s such that a^{t*2^i} = 0/-1/+1 (mod
// n).
var i = new SNat(0);
var rSqrt = null;
var r = a.powMod(t, n);
while (i.lt(s) && r.gt(1) && r.lt(nMinusOne)) {
i = i.plus(1);
rSqrt = r;
r = r.times(r).mod(n);
}
var isCompositeByArtjuhov = false;
if (s.isZero()) {
// If 0 = i = s, then this reduces to the Fermat primality test.
isCompositeByArtjuhov = r.ne(1);
} else if (i.isZero()) {
// If 0 = i < s, then:
//
// * r = 0 (mod n) -> a^{n-1} = 0 (mod n), and
// * r = +/-1 (mod n) -> a^{n-1} = 1 (mod n).
isCompositeByArtjuhov = r.isZero();
} else if (i.lt(s)) {
// If 0 < i < s, then:
//
// * r = 0 (mod n) -> a^{n-1} = 0 (mod n),
// * r = +1 (mod n) -> a^{t*2^{i-1}} is a non-trivial square root of
// unity mod n, and
// * r = -1 (mod n) -> a^{n-1} = 1 (mod n).
//
// Note that the last case means r = n - 1 > 1.
isCompositeByArtjuhov = r.le(1);
} else {
// If 0 < i = s, then:
//
// * r = 0 (mod n) can't happen,
// * r = +1 (mod n) -> a^{t*2^{i-1}} is a non-trivial square root of
// unity mod n, and
// * r > +1 (mod n) -> failure of the Fermat primality test.
isCompositeByArtjuhov = true;
}
return {
a: a,
n: n,
t: t,
s: s,
i: i,
r: r,
rSqrt: rSqrt,
isCompositeByArtjuhov: isCompositeByArtjuhov
};
}</code></pre>
With the two compositeness tests above, we can now write a
probabilistic primality test:
<pre class="code-container"><code class="language-javascript">// Returns true iff a is a Fermat witness to n, and thus n is
// composite. a and n must satisfy the same conditions as in
// testCompositenessByFermat.
function hasFermatWitness(n, a) {
return testCompositenessByFermat(n, a).isCompositeByFermat;
}
// Returns true iff a is an Arjuhov witness to n, and thus n is
// composite. a and n must satisfy the same conditions as in
// testCompositenessByArtjuhov.
function hasArtjuhovWitness(n, a) {
return testCompositenessByArtjuhov(n, a).isCompositeByArtjuhov;
}
// Returns true if n is probably prime, based on sampling the given
// number of possible witnesses and testing them against n. If false
// is returned, then n is definitely composite.
//
// By default, uses the Artjuhov test for witnesses with 20 samples
// and Math.random for the random number generator. This gives an
// error bound of 4^-20 if true is returned.
function isProbablePrime(n, hasWitness, numSamples, rng) {
n = SNat.cast(n);
hasWitness = hasWitness || hasArtjuhovWitness;
rng = rng || Math.random;
numSamples = numSamples || 20;
if (n.le(1)) {
return false;
}
if (n.le(3)) {
return true;
}
if (n.isEven()) {
return false;
}
for (var i = 0; i < numSamples; ++i) {
var a = SNat.random(2, n.minus(2), rng);
if (hasWitness(n, a)) {
return false;
}
}
return true;
}</code></pre>
</div>
<p><code>isProbablePrime</code> called
with <code>hasFermatWitness</code> is the <em>Fermat primality
test</em>, and <code>isProbablePrime</code> called
with <code>hasArtjuhovWitness</code> is the <em>Miller-Rabin primality
test</em>. The latter is the current general primality test of
choice, replacing
the <a href="http://en.wikipedia.org/wiki/Solovay-Strassen">Solovay-Strassen
primality test</a>.</p>
<p>We can also use <code>isProbablePrime</code> to randomly generate
probable primes, which is useful
for <a href="http://en.wikipedia.org/wiki/RSA_(algorithm)#Key_generation">cryptographic
applications</a>:</p>
<pre class="code-container"><code class="language-javascript">// Returns a probable b-bit prime that is at least 2^b. All
// parameters but b are passed to isProbablePrime.
function findProbablePrime(b, hasWitness, rng, numSamples) {
b = SNat.cast(b);
var lb = (new SNat(2)).pow(b.minus(1));
var ub = lb.times(2);
while (true) {
var n = SNat.random(lb, ub);
if (isProbablePrime(n, hasWitness, rng, numSamples)) {
return n;
}
}
}</code></pre>
<p>In this case, for sufficiently large \(b\), the Fermat primality
test is acceptable, since Carmichael numbers are so rare and we're the
ones generating the possible primes to be tested.<sup><a href="#fn3" id="r3">[3]</a></sup></p>
<p>There are other primality tests, but they're less often used in
practice because they're
either <a href="http://en.wikipedia.org/wiki/Solovay%E2%80%93Strassen_primality_test">less
efficient</a> or <a href="http://www.pseudoprime.com/pseudo2.pdf">more
sophisticated</a> than the algorithms above, or they require \(n\) to
have <a href="http://en.wikipedia.org/wiki/Lucas_primality_test">special</a> <a href="http://en.wikipedia.org/wiki/Proth%27s_theorem">properties</a>.
Perhaps the most interesting of these tests is
the <a href="http://en.wikipedia.org/wiki/Aks_primality_test"><em>AKS
primality test</em></a>, which proved once and for all that primality
testing is in P.</p>
<hr />
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
<section class="footnotes">
<header>
<h2>Footnotes</h2>
</header>
<p id="fn1">[1] \(\gcd\) is
the <a href="http://en.wikipedia.org/wiki/Greatest_common_divisor">greatest
common divisor</a> function.
<a href="#r1">↩</a></p>
<p id="fn2">[2] “Artjuhov witness” is an idiosyncratic
name on my part; a more common name is <em>strong witness</em>, which
I don't like.
<a href="#r2">↩</a></p>
<p id="fn3">[3]
<a href="http://en.wikipedia.org/wiki/Fermat_primality_test#Applications">According to Wikipedia</a>, PGP uses the Fermat primality test.
<a href="#r3">↩</a></p>
</section>
https://www.akalin.com/pair-counterexamples-vector-calculus
A Pair of Counterexamples in Vector Calculus
2011-11-27T00:00:00-08:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<script>
KaTeXMacros = {
"\\sgn": "\\operatorname{sgn}",
};
</script>
<p>While recently reviewing some topics in vector calculus, I became
curious as to why violating seemingly innocuous conditions for some
theorems leads to surprisingly wild results. In fact, I was struck by
how these theorems resemble computer programs, not in some
<a href="http://en.wikipedia.org/wiki/Curry-Howard_Correspondence">abstract
way</a>, but in how the lack of “input validation” leads
to
<a href="http://en.wikipedia.org/wiki/Undefined_behavior">non-obvious
behavior</a> in the face of erroneous input.</p>
<p>I found that understanding why these counterexamples lead to wild
results deepened my understanding of the theorems involved and their
proofs.<sup><a href="#fn1" id="r1">[1]</a></sup> Besides,
pathological examples are more interesting than well-behaved ones!</p>
<p>First, let's look at a “counterexample”
to <a href="http://en.wikipedia.org/wiki/Green%27s_theorem">Green's
theorem</a>:</p>
<p class="example">1. Two functions \(L, M \colon \mathbb{R}^2 \to \mathbb{R}\) and
a positively-oriented, piecewise smooth, simple closed curve \(C\)
in \(\mathbb{R}^2\) enclosing the region \(D\) such that
\[
∮_C L \,dx + M \,dy \ne
∬_D \left( \frac{∂{M}}{∂{x}} - \frac{∂{L}}{∂{y}} \right) \,dx \,dy \text{.}
\]</p>
<p>Let
\[
L = -\frac{y}{x^2+y^2} \text{,} \quad M = \frac{x}{x^2+y^2} \text{,}
\]
and \(C\) be a curve going clockwise around the rectangle \(D = [-1,
1]^2\).<sup><a href="#fn2" id="r2">[2]</a></sup> Then the integral of \(L \,dx + M \, dy\) around \(C\) is \(2
π\) since it encloses the origin. But
\[
\frac{∂{M}}{∂{x}} = \frac{∂{L}}{∂{y}} = \frac{y^2-x^2}{x^2+y^2}
\]
so the difference of the two vanishes everywhere but the origin, where
neither function is defined. Therefore, the (improper) integral over
\(D\) also vanishes, proving the inequality. ∎</p>
<p>Of course, the easy explanation is that the discontinuity of \(L\)
and \(M\) at the origin violates a condition of Green's theorem. But
that doesn't really tell us anything, so let's break down the theorem
and see where exactly it fails.</p>
<p>Green's theorem is usually proved first for rectangles \([a, b]
\times [c, d]\), which suffices for our purpose. If \(C\) is a curve
that goes counter-clockwise around such a rectangle \(D\), then we can
easily show that
\[
∮_C L \,dx = - ∬_D \frac{∂{L}}{∂{y}} \,dx \,dy
\]
and
\[
∮_C M \,dy = ∬_D \frac{∂{M}}{∂{x}} \,dx \,dy \text{,}
\]
with the sum of these two formulas proving the theorem.</p>
<p>So the first sign of trouble is that the theorem freely
interchanges addition and integration. Since the partial derivatives
of our functions diverge at the origin, if \(D\) contains the origin
then the integrals of those partial derivatives over \(D\) may not
even be defined, even if the integral of their difference is.</p>
<p>But the problem arises even before that. The statements above are
proved by showing
\[
∮_C L \,dx = - ∫_a^b \left( ∫_c^d \frac{∂{L}}{∂{y}} \,dy \right) \,dx
\]
and
\[
∮_C M \,dy = ∫_c^d \left( ∫_a^b \frac{∂{M}}{∂{x}} \,dx \right) \,dy
\text{.}
\]
both of which hold for our example. But notice that in one case we
integrate with respect to \(y\) first, and in the other case we
integrate with respect to \(x\) first. Therefore, we have to
interchange the order of integration or convert to a double integral
in order to get them to a form where we can add them. And there's the
rub: if \(D\) contains the origin, switching the order of integration
for either integral above switches the sign of the result!</p>
<p>This fully explains the discrepancy; since the result of both
integrals above (with the iteration order preserved) is \(π\),
adding them together as-is gives the expected result of \(2 π\).
But if we switch the iteration order of one of the iterated integrals
as done in the proof of Green's theorem, then we switch the result of
that integral to \(-π\), which cancels with the result of the other
unchanged integral to produce \(0\).</p>
<p>So now let's examine this strange behavior of the sign of an
integration's result depending on the iteration order. This leads us
to our next “counterexample,” this time
for <a href="http://en.wikipedia.org/wiki/Fubini%27s_theorem">Fubini's
theorem</a>:</p>
<p class="example">2. A function \(f \colon \mathbb{R}^2 \to \mathbb{R}\) whose
iterated integrals over a rectangle \(D = [a, b] \times [c, d]
\subset \mathbb{R}^2\) differ.</p>
<p>Let
\[
f(x, y) = \frac{x^2-y^2}{(x^2+y^2)^2}
\quad \text{ and } \quad
D = [-1, 1]^2\text{.}
\]
The two iterated integrals of \(f\) over \(D\) are usually written as
\[
∫_{-1}^1 \left( ∫_{-1}^1 f(x, y) \,dy \right) \,dx
\qquad \text{ and } \qquad
∫_{-1}^1 \left( ∫_{-1}^1 f(x, y) \,dx \right) \,dy
\]
but let's define them more carefully to make it easier to justify our
calculations.</p>
<p>Let
\[
\begin{aligned}
u_k &= y \mapsto f(k, y) \\
v_l &= x \mapsto f(x, l) \text{.}
\end{aligned}
\]
In other words, given the real constants \(k\) and \(l\), construct
the (possibly partial) real functions \(u_k(y)\) and \(v_l(x)\) by
partially-evaluating \(f\) at \(x = k\) and \(y = l\),
respectively.</p>
<p>Then, if we also let<sup><a href="#fn3" id="r3">[3]</a></sup>
\[
U(x) = ∫_{-1}^1 u_x(y) \,dy
\qquad \text{ and } \qquad
V(y) = ∫_{-1}^1 v_y(x) \,dx \text{,}
\]
we can write the iterated integrals as
\[
∫_{-1}^1 U(x) \,dx
\qquad \text{ and } \qquad
∫_{-1}^1 V(y) \,dy \text{.}
\]
</p>
<p>Computing \(U(x)\) for \(x ≠ 0\), we get<sup><a href="#fn4" id="r4">[4]</a></sup>
\[
\begin{aligned}
U(x) &= ∫_{-1}^1 \frac{∂{}}{∂{y}} \left( -\frac{y}{x^2+y^2} \right) \,dy \\
&= \left. -\frac{y}{x^2+y^2} \right|_{y=-1}^{y=1} \\
&= -\frac{2}{x^2+1} \text{.}
\end{aligned}
\]
</p>
<p>Attempting to evaluate \(U(0)\), we see that
\[
\begin{aligned}
U(0) &= ∫_{-1}^1 \frac{0^2-y^2}{(0^2+y^2)^2} \,dy \\
&= - ∫_{-1}^1 \frac{dy}{y^2}
\end{aligned}
\]
which diverges. So
\[
U(x) = -\frac{2}{x^2+1} \text{ for } x \ne 0 \text{.}
\]
</p>
<p>
By a similar computation, we find that<sup><a href="#fn5" id="r5">[5]</a></sup>
\[
V(y) = \frac{2}{y^2+1} \text{ for } y \ne 0 \text{.}
\]
</p>
<p>Since \(U(x)\) isn't defined at \(0\), we have to treat it as an
improper integral, although doing so poses no real difficulty:
\[
\begin{aligned}
∫_{-1}^1 U(x)\,dx
&= \lim_{a \nearrow 0} \left( ∫_{-1}^a -\frac{2}{x^2+1} \,dx \right) +
\lim_{a \searrow 0} \left( ∫_{a}^1 -\frac{2}{x^2+1} \,dx \right) \\
&= \lim_{a \nearrow 0}
\Bigl( \left. -2 \arctan(x) \right|_{-1}^{a} \Bigr) +
\lim_{a \searrow 0}
\Bigl( \left. -2 \arctan(x) \right|_{a}^{1} \Bigr) \\
&= \left. -2 \arctan(x) \right|_{-1}^{0} +
\left. -2 \arctan(x) \right|_{0}^{1} \\
&= \left. -2 \arctan(x) \right|_{-1}^{1} \\
&= -π \text{.}
\end{aligned}
\]
</p>
<p>Similarly,
\[
∫_{-1}^1 V(y)\,dy = π \text{,}
\]
so the iterated integrals of \(f(x, y)\) over \([-1, 1]^2\) differ; in
fact, as we claimed above, switching the iteration order switches the
sign of the result. ∎</p>
<p>We can repeat the above calculations for an arbitrary rectangle to
see that the iterated integrals of \(f(x, y)\) differ if \(D\)
contains the origin either as an interior point or a corner. But
there's an easier way to prove that statement and also gain some
insight as to why \(f(x, y)\) has this strange property.</p>
<p>Note that the key facts in the above calculations were that \(U(x)
\lt 0\) for any \(x \ne 0\) and \(V(y) \gt 0\) for any \(y \ne 0\).
Therefore, integrating \(U(x)\) over any interval on the \(x\)-axis
would produce a negative result and integrating \(V(x)\) over any
interval on the \(y\)-axis would produce a positive result, leading to
the difference in iterated integrals. This holds more generally; for
any \(m, n \gt 0\):
\[
∫_{-n}^n f(x, y) \,dy \lt 0
\qquad \text{ and } \qquad
∫_{-m}^m f(x, y) \,dx \gt 0 \text{.}
\]
Therefore,
\[
∫_{-m}^m \left( ∫_{-n}^n f(x, y) \,dy \right) \,dx \lt 0
\qquad \text{ and } \qquad
∫_{-n}^n \left( ∫_{-m}^m f(x, y) \,dx \right) \,dy \gt 0
\]
so the iterated integrals of \(f(x, y)\) differ over the rectangles
\([-m, m] \times [-n, n]\). Since any rectangle \(D\) containing the
origin as an interior point must contain some smaller rectangle \(E =
[-m, m] \times [-n, n]\), the iterated integrals of \(f(x, y)\) over
\(E\) differ and therefore must also differ over \(D\).</p>
<p>Furthermore, since \(f(x, y)\) is even in both \(x\) and \(y\), you
can carry out a similar argument to the above with intervals of the
form \([0, m]\) or \([-m, 0]\) to show that the iterated integrals of
\(f(x, y)\) must also differ over any rectangle with the origin as a
corner.
</p>
<p>So the essential property of \(f(x, y)\) is that slicing it along
the \(x\)-axis gives a function which has positive area under the
curve on any interval symmetric around \(0\) or with \(0\) as an
endpoint, and that slicing it similarly along the \(y\)-axis gives a
function with has negative area. Therefore, on a rectangle symmetric
around the origin or with the origin as a corner, one can choose the
sign of the iterated integral by choosing which axis to slice
first.</p>
<p>The next thing to investigate is how exactly the iterated integrals
of \(f(x, y)\) over the rectangle \(D\) are expressed such that they
differ only when \(D\) contains the origin, especially considering
that the \(f(x, y)\) is expressed in quite a simple form. To do that,
let's consider the simple case of a rectangle \(D = [δ, 1] \times
[ϵ, 1]\) where we can vary \(δ\) and \(ϵ\) at
will.</p>
<p>Let
\[
\begin{aligned}
I_{yx}(δ, ϵ) &=
∫_{δ}^1 \left( ∫_{ϵ}^1 f(x, y) \,dy \right) \,dx \\
I_{xy}(δ, ϵ) &=
∫_{ϵ}^1 \left( ∫_{δ}^1 f(x, y) \,dx \right) \,dy
\text{.}
\end{aligned}
\]
Then, for \(ϵ ≠ 0\):
\[
\begin{aligned}
I_{yx}(δ, ϵ) &=
∫_{δ}^1 \left( ∫_{ϵ}^1
\frac{y^2-x^2}{(x^2+y^2)^2} \,dy \right) \,dx \\
&= ∫_{δ}^1 \left(
\left. -\frac{y}{x^2+y^2} \right|_{y=ϵ}^{y=1} \right) \,dx \\
&= ∫_{δ}^1 \Biggl(
-\frac{1}{1+x^2} -
\left( -\frac{ϵ}{ϵ^2+x^2} \right) \Biggr) \,dx \\
&= ∫_{δ}^1 \frac{dx/ϵ}{1+(x/ϵ)^2} -
∫_{δ}^1 \frac{dx}{1+x^2} \\
&= \arctan\left(\frac{1}{ϵ}\right) -
\arctan\left(\frac{δ}{ϵ}\right) -
\frac{π}{4} + \arctan(δ) \text{,}
\end{aligned}
\]
and for \(ϵ = 0\):
\[
I_{yx}(δ, 0) = -\frac{π}{4} + \arctan(δ) \text{.}
\]
Similarly, for \(δ ≠ 0\):
\[
\begin{aligned}
I_{xy}(δ, ϵ) &=
∫_{ϵ}^1 \left( ∫_{δ}^1
\frac{y^2-x^2}{(x^2+y^2)^2} \,dx \right) \,dy \\
&= ∫_{ϵ}^1 \left(
\left. \frac{x}{x^2+y^2} \right|_{x=δ}^{x=1} \right) \,dy \\
&= ∫_{ϵ}^1 \left(
\frac{1}{1+y^2} - \frac{δ}{δ^2+x^2} \right) \,dy \\
&= ∫_{ϵ}^1 \frac{dy}{1+y^2} -
∫_{ϵ}^1 \frac{dy/δ}{1+(y/δ)^2} \\
&= \frac{π}{4} - \arctan(ϵ) -
\arctan\left(\frac{1}{δ}\right) +
\arctan\left(\frac{ϵ}{δ}\right) \text{,}
\end{aligned}
\]
and for \(δ = 0\):
\[
I_{xy}(0, ϵ) = \frac{π}{4} - \arctan(ϵ) \text{.}
\]
Then let \(Δ = I_{xy} - I_{yx}\) be the difference between the
two iterated integrals. We can use the identity
\[
\arctan(x) + \arctan\left(\frac{1}{x}\right) = \frac{π}{2} \sgn(x)
\]
to simplify \(Δ(δ, ϵ)\) if neither \(δ\) nor
\(ϵ\) is zero:
\[
\begin{aligned}
Δ(δ, ϵ)
&= \bigl( π/4 - \arctan(ϵ) - \arctan(1/δ)
+ \arctan(ϵ/δ) \bigr) \\
& \quad \mathbin{-}
\bigl( \arctan(1/ϵ) - \arctan(δ/ϵ)
- π/4 + \arctan(δ) \bigr) \\
&= π/2 - \bigl( \arctan(ϵ) + \arctan(1/ϵ) \bigr) \\
& \quad \mathbin{-} \bigl( \arctan(δ) + \arctan(1/δ) \bigr) \\
& \quad \mathbin{+}
\bigl( \arctan(δ/ϵ) + \arctan(ϵ/δ) \bigr) \\
&= \frac{π}{2} \bigl( 1 - \sgn(ϵ) - \sgn(δ)
+ \sgn(δ/ϵ) \bigr) \text{.}
\end{aligned}
\]
</p>
<p>
Using the properties of \(\sgn(x)\), we can simplify this to the final
expression:
\[
Δ(δ, ϵ) =
\frac{π}{2}
\bigl( 1 - \sgn(δ) \bigr) \bigl( 1 - \sgn(ϵ) \bigr)
\]
which we can prove still holds if either \(δ\) or \(ϵ\) is
zero (or both).</p>
<p>So with the simplified expression for \(Δ(δ, ϵ)\),
it becomes apparent how \(\sgn(x)\) is used to control the value of
\(Δ(δ, ϵ)\); as long as either \(δ\) or
\(ϵ\) is positive, \(1 - \sgn(x)\) zeroes out the entire
expression.</p>
<hr />
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
<section class="footnotes">
<header>
<h2>Footnotes</h2>
</header>
<p id="fn1">[1] There are
actually <a href="http://amzn.com/048668735X">whole</a>
<a href="http://amzn.com/0486428753">books</a> dedicated to
counterexamples. They make good bathroom reading material.
<a href="#r1">↩</a></p>
<p id="fn2">[2] The vector field \((L, M)\) also serves as the
canonical “counterexample” to
the <a href="http://en.wikipedia.org/wiki/Gradient_theorem">gradient
theorem</a>. <a href="#r2">↩</a></p>
<p id="fn3">[3] \(U(x)\) and \(V(y)\) are also (partial) real
functions. <a href="#r3">↩</a></p>
<p id="fn4">[4] We're justified in applying standard integration
techniques here since \(u_k(y)\) for \(k \gt 0\) is defined and
bounded for all \(y\). <a href="#r4">↩</a></p>
<p id="fn5">[5] Note that \(U(x)\) and \(V(y)\) differ only in
variable name and sign. <a href="#r5">↩</a></p>
</section>
https://www.akalin.com/evlis-tail-recursion
Understanding Evlis Tail Recursion
2011-10-28T00:00:00-07:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<p>While reading
about <a href="http://www.schemers.org/Documents/Standards/R5RS/HTML/r5rs-Z-H-6.html#%25_sec_3.5">proper
tail recursion</a> in Scheme, I encountered a similar but obscure
optimization called <em>evlis tail recursion</em>.
In <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.8567&rep=rep1&type=pdf">the
paper where it was first described</a>, the author claims it
"dramatically improve the space performance of many programs," which
sounded promising.</p>
<p>However, the few places where its mentioned don't do much more than
state its definition and claim its usefulness. Hopefully I can
provide a more detailed analysis here.</p>
<div class="p">Consider the straightforward factorial implementation in
Scheme:<sup><a href="#fn1" id="r1">[1]</a></sup>
<pre class="code-container"><code class="language-lisp">(define (fact n) (if (<= n 1) 1 (* n (fact (- n 1)))))</code></pre>
It is not tail-recursive, since the recursive call is nested in
another procedure call. However, it's <em>almost</em> tail-recursive;
the call to <code>*</code> is a tail call, and the recursive call is
its last subexpression, so it will be the last subexpression to be
evaluated.</div>
<p>Recall what happens when a procedure call (represented as a list of
subexpressions) is evaluated: each subexpression is evaluated, and the
first result (the procedure) is passed the other results as
arguments.<sup><a href="#fn2" id="r2">[2]</a></sup></p>
<p>Evlis tail recursion can be described as follows: when performing a
procedure call and during the evaluation of the last subexpression,
the calling environment is discarded as soon as it is not
required.<sup><a href="#fn3" id="r3">[3]</a></sup> The distinction
between evlis tail recursion and proper tail recursion is subtle.
Proper tail recursion requires only that the calling environment be
discarded before the actual procedure call; evlis tail recursion
discards the calling environment even sooner, if possible.</p>
<div class="p">An example will help to clarify things. Given <code>fact</code> as
defined above, say you evaluate <code>(fact 10)</code> and you're in
the procedure call with <code>n = 5</code>. The call stack of a
properly tail-recursive interpreter would look like this:
<style>
pre.stack {
margin-top: 1em;
margin-bottom: 1em;
}
</style>
<pre class="stack">
evalExpr
--------
env = { n: 10 } -> <top-level environment>
expr = '(* n (fact (- n 1)))'
proc = <native function: *>
args = [10, <pending evalExpr('(fact (- n 1))', env)>]
evalExpr
--------
env = { n: 9 } -> <top-level environment>
expr = '(* n (fact (- n 1)))'
proc = <native function: *>
args = [9, <pending evalExpr('(fact (- n 1))', env)>]
...
evalExpr
--------
env = { n: 6 } -> <top-level environment>
expr = '(* n (fact (- n 1)))'
proc = <native function: *>
args = [6, <pending evalExpr('(fact (- n 1))', env)>]
evalExpr
--------
env = { n: 5 } -> <top-level environment>
expr = '(if ...)'
</pre>
whereas the call stack of an evlis tail-recursive interpreter would
look like this:
<pre class="stack">
evalExpr
--------
env = { n: 5 } -> <top-level environment>
pendingProcedureCalls = [
[<native function: *>, 10],
[<native function: *>, 9],
...
[<native function: *>, 6]
]
expr = (if ...)
</pre>
In this implementation, the last subexpression of a procedure call
is evaluated exactly like a tail expression, but the procedure call
and non-last subexpressions are pushed onto a stack. Whenever an
expression is reduced to a simple one and the stack is non-empty, a
pending procedure call with its other args are popped off, and it is
called with the reduced expression as the final argument.</div>
<p>Note that this didn't change the asymptotic behavior of the
procedure; it still takes \(Θ(n)\) memory to evaluate. However,
only the bare minimum of information is saved: the list of pending
functions and their arguments. Other auxiliary variables, and
crucially the nested calling environments, aren't preserved, leading
to a significant constant-factor reduction in memory.</p>
<div class="p">This raises the question: Are there cases where evlis tail
recursion leads to better asymptotic behavior? In fact, yes; consider
the following (contrived) implementation of
factorial<sup><a href="#fn4" id="r4">[4]</a></sup>:
<pre class="code-container"><code class="language-lisp">(define (fact2 n)
(define v (make-vector n))
(* (n (fact2 (- n 1)))))</code></pre>
Before the main body of the function, a vector of size \(n\) is
defined. This means that the environments in the call stack of a
properly tail-recursive interpreter would look like this:<sup><a href="#fn5" id="r5">[5]</a></sup>
<pre class="stack">
env = { n: 10, v: <vector of size 10> } -> <top-level environment>
env = { n: 9, v: <vector of size 9> } -> <top-level environment>
env = { n: 8, v: <vector of size 8> } -> <top-level environment>
env = { n: 7, v: <vector of size 7> } -> <top-level environment>
...
</pre>
whereas the an evlis tail-recursive interpreter would keep around
only the current environment. Therefore, the properly tail-recursive
interpreter would require \(Θ(n^2)\) memory to
evaluate <code>(fact2 n)</code> while the evlis tail-recursive
interpreter would require only \(Θ(n)\)</div>
<p>Studying examples like the one above enabled me to finally
understand how evlin tail recursion worked and what sort of savings it
gives. However, I have yet to find a practical example where evlis
tail recursion gives the same sort of asymptotic gains as described
above, and I'd be interested to receive some. But perhaps the "large
gains" mentioned in the various papers describing it are only
constant-factor reductions in memory.</p>
<p>In any case, another important difference in Scheme between proper
tail recursion and evlis tail recursion is that the former is
a <em>language feature</em> and the latter is
an <em>optimization</em>. That means that it is acceptable and even
encouraged to write Scheme programs that take advantage of proper tail
recursion, but it would be unwise to rely on evlis tail recursion for
the asymptotic performance of your function. Instead, one should
treat it just as a nice constant-factor speed gain.</p>
<p>Note that it is easy to make evlis tail recursion "smarter." Since
Scheme doesn't specify the order of argument evaluation, an
interpreter could evaluate arguments to maximize the gains from evlis
tail recursion. As an easy example, if we had switched the arguments
to <code>+</code> in <code>fact</code> above, making it
non-evlis-tail-recursive, a smart compiler could still treat it as
such. A possible rule of thumb would be to pick a non-trivial
function call to evaluate last.</p>
<p>To complete the picture, I will outline below the evaluation
function for a simple evlis tail-recursive Scheme interpreter in
Javascript. All of the sources I've found describe it in terms of
compilers, so I think it'll be useful to have a reference
implementation for an interpreter.</p>
<div class="p">Let's say we already have a properly tail-recursive
interpreter:<sup><a href="#fn6" id="r6">[6]</a></sup>
<pre class="code-container"><code class="lang-javascript">function evalExpr(expr, env) {
// Fake tail calls with a while loop and continue.
while (true) {
// Symbols, constants, quoted expressions, and lambdas.
if (isSimpleExpr(expr)) {
// The only exit point.
return evalSimpleExpr(expr, env);
}
// (if test conseq alt)
if (isSpecialForm(expr, 'if')) {
expr = evalExpr(expr[1], env) ? expr[2] : expr[3];
continue;
}
// (set! var expr)
if (isSpecialForm(expr, 'set!')) {
env.set(expr[1], evalExpr(expr[2], env));
expr = null;
continue;
}
// (define var expr?)
if (isSpecialForm(expr, 'define')) {
env.define(expr[1], evalExpr(expr[2], env));
expr = null;
continue;
}
// (begin expr*)
if (isSpecialForm(expr, 'begin')) {
if (expr.length == 1) {
expr = null;
continue;
}
// Evaluate all but the last subexpression.
for (var i = 1; i < expr.length - 1; ++i) {
evalExpr(expr[i], env);
}
expr = expr[expr.length - 1];
continue;
}
// (proc expr*)
var proc = evalExpr(expr.shift(), env);
var args = expr.map(function(subExpr) { return evalExpr(subExpr, env); });
// proc.run() returns its body in result.expr and the environment
// in which to evaluate it (with all its arguments bound) in
// result.env.
var result = proc.run(args);
expr = result.expr;
// The only time when env is changed.
env = result.env;
continue;
}
}</code></pre>
Then implementing evlis tail recursion requires only a few
changes:
<pre class="code-container"><code class="lang-javascript">function evalExpr(expr, env) {
// This is a stack of procedures and their non-final arguments that
// are waiting for their final argument to be evaluated.
var pendingProcedureCalls = [];
while (true) {
if (isSimpleExpr(expr)) {
expr = evalSimpleExpr(expr, env);
// Discard calling environment.
env = null;
if (pendingProcedureCalls.length == 0) {
// No pending procedure calls, so we're done (the only exit
// point).
return expr;
}
var args = pendingProcedureCalls.pop();
var proc = args.shift();
args.push(expr);
var result = proc.run(args);
expr = result.expr;
// Change to new environment (the only time when env is
// changed).
env = result.env;
continue;
}
...
// Everything else remains the same.
...
// (proc expr*)
var nonFinalSubExprs =
exprs.slice(0, -1).map(
function(subExpr) { return evalExpr(subExpr, env); });
pendingProcecureCalls.push(nonFinalSubExprs);
// Evaluate the last subexpression as a tail call.
expr = expr[expr.length - 1];
continue;
}
}</code></pre>
</div>
<hr />
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
<section class="footnotes">
<header>
<h2>Footnotes</h2>
</header>
<p id="fn1">[1] Assume a left-to-right evaluation order for now.
<a href="#r1">↩</a></p>
<p id="fn2">[2] The function that takes a list of expressions, evaluates them,
and returns the results as a list is traditionally
called <code>evlis</code>, hence the name of the optimization.
<a href="#r2">↩</a></p>
<p id="fn3">[3] This assumes that the calling environment isn't
stored somewhere else.
<a href="#r3">↩</a></p>
<p id="fn4">[4] This was adapted from an example
in <a href="ftp://ftp.ccs.neu.edu/pub/people/will/tail.pdf">Proper
Tail Recursion and Space Efficiency</a>.
<a href="#r4">↩</a></p>
<p id="fn5">[5] Assume that the interpreter isn't smart enough to deduce that \(v\)
can be optimized out since it's never used.
<a href="#r5">↩</a></p>
<p id="fn6">[6] Adapted from Peter Norvig's
excellent <a href="http://norvig.com/lispy.html"><code>lis.py</code></a>.
<a href="#r6">↩</a></p>
</section>
https://www.akalin.com/elementary-gaussian-proof
An Elementary Way to Calculate the Gaussian Integral
2011-01-06T00:00:00-08:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<p>
While reading <a href="http://gowers.wordpress.com">Timothy Gowers's blog</a> I stumbled on
<a href="http://gowers.wordpress.com/2007/10/04/when-are-two-proofs-essentially-the-same/#comment-239">Scott Carnahan's comment</a>
describing an elegant calculation of the Gaussian integral
\[
∫_{-∞}^{∞} e^{-x^2} \, dx = \sqrt{π}\text{.}
\]
I was so struck by its elementary character that I imagined what it
would be like written up, say, as an extra credit exercise in a
single-variable calculus class:
</p>
<div class="exercise">
<span class="exercise">Exercise 1.</span>
(<span class="exercise-name">The Gaussian integral</span>.) Let
\[
F(t) = ∫_0^t e^{-x^2} \, dx
\text{, }\qquad
G(t) = ∫_0^1 \frac{e^{-t^2 (1+x^2)}}{1+x^2} \, dx
\text{,}
\]
and \(H(t) = F(t)^2 + G(t)\).
<ol class="exercise-list">
<li>Calculate \(H(0)\).</li>
<li>Calculate and simplify \(H'(t)\). What does this
imply about \(H(t)\)?</li>
<li>Use part b to calculate \(F(∞) =
\displaystyle\lim_{t \to ∞} F(t)\).</li>
<li>Use part c to calculate
\[
∫_{-∞}^{∞} e^{-x^2} \, dx\text{.}
\]</li>
</ol>
</div>
<p>
Although this is simpler than
<a href="http://en.wikipedia.org/wiki/Gaussian_integral#Careful_proof">the
usual calculation of the Gaussian integral</a>, for which careful
reasoning is needed to justify the use of polar coordinates, it seems
more like a
<a href="http://en.wikipedia.org/wiki/Certificate_(complexity)">certificate</a>
than an actual
proof; you can convince yourself that the calculation is valid, but
you gain no insight into the reasoning that led up to it.<sup><a href="#fn1" id="r1">[1]</a></sup>
</p>
<p>
Fortunately, <a href="http://gowers.wordpress.com/2007/10/04/when-are-two-proofs-essentially-the-same/#comment-243">David Speyer's
comment</a> solves the mystery; \(G(t)\) falls out of doing the
integration in Cartesian coordinates over a triangular region. Just
for kicks, here's how I imagine an exercise based on this method would
look like (this time for a multi-variable calculus class):
</p>
<div class="exercise">
<span class="exercise">Exercise 2.</span>
(<span class="exercise-name">The Gaussian integral in Cartesian coordinates.</span>) Let
\[
A(t) = ∬\limits_{\triangle_t} e^{-(x^2+y^2)} \, dx \, dy
\]
where \(\triangle_t\) is the triangle with vertices \((0, 0)\), \((t,
0)\), and \((t, t)\).
<!-- TODO(akalin): Draw a diagram for \triangle_t. -->
<ol class="exercise-list">
<li>Use the substitution \(y = sx\) to reduce \(A(t)\) to a
one-dimensional integral.</li>
<li>Use part a to calculate \(A(∞) =
\lim_{t \to ∞} A(t)\).</li>
<li>Use part b to calculate
\[
∫_{-∞}^{∞} e^{-x^2} \, dx\text{.}
\]</li>
<li>Let
\[
F(t) = ∫_0^t e^{-x^2} \, dx
\qquad\text{ and }\qquad
G(t) = ∫_0^1 \frac{e^{-t^2 (1+x^2)}}{1+x^2} \, dx
\text{.}
\]
Use part a to relate \(F(t)\) to \(G(t)\).</li>
<li>Use part d to derive a proof of part c
using only single-variable calculus.</li>
</ol>
</div>
<hr />
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
<section class="footnotes">
<header>
<h2>Footnotes</h2>
</header>
<p id="fn1">[1] Similar to proving \(\sum\limits_{i=0}^n m^3 =
\frac{n^2(n+1)^2}{4}\) by induction. <a href="#r1">↩</a></p>
</section>
https://www.akalin.com/parallelizing-flac-encoding
Parallelizing FLAC Encoding
2008-05-05T00:00:00-07:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<style type="text/css" media="all">
/*<![CDATA[*/
table.benchmark-results,
table.benchmark-results tr,
table.benchmark-results th {
border: 1px solid black;
}
table.benchmark-results {
font-family: "Arial", "Helvetica", sans-serif;
}
table.benchmark-results th,
table.benchmark-results td {
padding: .2em .4em;
}
/*]]>*/
</style>
<p>One thing I noticed ever since getting a multi-core system
was that the reference FLAC encoder is not multi-threaded. This isn't
a huge problem for most people as you can simply encode multiple files
at the same time but I usually rip my audio CDs into a single audio
file with a cue sheet instead of separate track files and so I am
usually encoding a single large audio file instead of multiple smaller
ones. Even so, encoding a CD-length audio file takes under a minute
but I thought it would be a fun and useful weekend project to see if I
could parallelize the simpler <a href="http://flac.cvs.sourceforge.net/flac/flac/examples/c/encode/file/main.c?revision=1.2&view=markup">example encoder</a>. The
<a href="http://flac.sourceforge.net/format.html">format specification</a> indicates that input blocks are
encoded independently which makes the problem <a href="http://en.wikipedia.org/wiki/Embarrassingly_parallel">embarassingly
parallel</a> and trawling through the <a href="http://www.mail-archive.com/flac-dev@xiph.org/msg00724.html">FLAC
mailing lists</a> reveals that no one has had the time
nor the inclination to look into it.</p>
<p>However, I was able to write a multithreaded FLAC encoder that
achieves near-linear speedup with only minor hacks to the libFLAC API.
Here are some encode times on an 8-core 2.8 GHz Xeon 5400 for a 636 MB
wave file (some caveats are discussed below):</p>
<table class="benchmark-results">
<tr>
<th>baseline</th><td>34.906s</td>
</tr>
<tr>
<th>1 threads</th><td>31.424s</td>
</tr>
<tr>
<th>2 threads</th><td>16.936s</td>
</tr>
<tr>
<th>4 threads</th><td>10.173s</td>
</tr>
<tr>
<th>8 threads</th><td>6.808s</td>
</tr>
</table>
<p>I took the simple approach of sharding the input file into
<var>n</var> roughly equal pieces and passing them to <var>n</var>
encoder threads, assembling the output file from the <var>n</var>
output buffers. In general this is not a good way of partitioning the
workload as time is wasted if one shard takes significantly more time
to process but for my use case this isn't a significant problem.</p>
<div class="p">The best way to share the input file among the encoding threads is to
map it into memory. In fact, memory-mapped file I/O has so many
advantages in general that I'm surprised at how little I see it used,
although it does have the disadvantage of requiring a bit more
bookkeeping. Here is how I use it in my multithreaded encoder
(slightly paraphrased):
<pre class="code-container"><code class="language-cpp">#include <fcntl.h> /* open() */
#include <sys/mman.h> /* mmap()/munmap() */
#include <sys/stat.h> /* stat() */
#include <unistd.h> /* close() */
int main(int argc, char *argv[]) {
int fdin;
struct stat buf;
char *bufin;
fdin = open(argv[1], O_RDONLY);
fstat(fdin, &buf);
bufin = mmap(NULL, buf.st_size, PROT_READ, MAP_SHARED, fdin, 0);
/* The input file (passed in via argv[1]) is now mapped read-only to
the memory region in bufin up to bufin + buf.st_size. */
/* Note that you can work directly with the mapped input file
instead of fread()ing the header into a buffer. */
if((buf.st_size < WAV_HEADER_SIZE) ||
memcmp(bufin, "RIFF", 4) ||
memcmp(bufin+8, "WAVEfmt \020\000\000\000\001\000\002\000", 16) ||
memcmp(bufin+32, "\004\000\020\000data", 8)) {
/* Invalid input file: print error and exit. */
}
for (i = 0; i < num_threads; ++i) {
shard_infos[i].bufin = bufin + WAV_HEADER_SIZE + i * bytes_per_thread;
/* bufsize for the last thread may be slightly larger. */
shard_infos[i].bufsize = bytes_per_thread;
}
/* Spawn encode threads (which calls encode_shard() below) passing
an element of shard_infos to each. */
...
munmap(bufin, buf.st_size);
close(fdin);
}
FLAC__bool encode_shard(struct shard_info *shard_info) {
FLAC__StreamEncoder *encoder = FLAC__stream_encoder_new();
...
/* The input file is paged in lazily as this function accesses
bufin from shard_info->bufin. */
FLAC__stream_encoder_process_interleaved(encoder,
shard_info->bufin,
shard_info->bufsize);
...
FLAC__stream_encoder_delete(encoder);
}</code></pre>
However, handling the output file is a bit trickier. Since the
encoded FLAC data output by the threads vary in size we have to wait
until all encoding threads are done before we know the right offsets
to write the output data. A convenient and fast way to handle this is
to use asynchronous I/O; we know where to write the output data for a
shard as soon as the encoding thread for all previous shards finish so
we simply wait for the encoding threads in shard order and queue up a
write request after each thread finishes. Here I use the POSIX
asynchronous I/O API in my multithreaded encoder (again, slightly
paraphrased):
<pre class="code-container"><code class="language-cpp">#include <aio.h> /* aio_*() */
#include <pthread.h> /* pthread_*() */
#include <string.h> /* memset() */
int main(int argc, char *argv[]) {
int fdout;
pthread_t threads[MAX_THREADS];
struct aiocb aiocbs[MAX_THREADS];
unsigned long byte_offset = 0;
/* mmap input file in. */
...
fdout = open(argv[2], O_WRONLY | O_CREAT | O_TRUNC);
/* Spawn encode threads passing an element of shard_infos to
each. */
...
/* Wait for each thread in sequence and queue up output writes. */
/* We need to zero out any aiocb struct that we use before we fill
in any members. */
memset(aiocbs, 0, num_threads * sizeof(*aiocbs));
for (i = 0; i < num_threads; ++i) {
pthread_join(threads[i], NULL);
aiocbs[i].aio_buf = shard_infos[i].bufout;
aiocbs[i].aio_nbytes = shards_infos[i].bytes_written;
aiocbs[i].aio_offset = byte_offset;
aiocbs[i].aio_fildes = fdout;
aio_write(&aiocbs[i]);
byte_offset += shard_infos[i].bytes_written;
}
/* Wait for all output writes to finish. */
for (i = 0; i < num_threads; ++i) {
const struct aiocb *aiocbp = &aiocbs[i];
aio_suspend(&aiocbp, 1, NULL);
aio_return(&aiocbs[i]);
}
close(fdout);
}</code></pre>
</div>
<p>The POSIX API is a bit unwieldy for this use case; ideally, there
would be a version of <code>aio_suspend()</code> that would suspend the
calling process until <em>all</em> of the specified requests have completed.
As it is now the simplest way is to loop through the requests as
above, especially since the maximum number of simultaneous
asynchronous I/O requests is usually quite small (16 on my system).</p>
<p>Also, I found that the OS X implementation of <code>aio_write()</code>
did not obey this part of the specified behavior:</p>
<blockquote>
<pre> If O_APPEND is set for aiocbp->aio_fildes, aio_write() operations append
to the file in the same order as the calls were made. If O_APPEND is not
set for the file descriptor, the write operation will occur at the abso-
lute position from the beginning of the file plus aiocbp->aio_offset.</pre>
</blockquote>
<p>but it was just as easy (and clearer) to explicitly set the correct
offset.</p>
<p>I had to hack up libFLAC a bit to implement my multithreaded encoder.
I exposed the <code>update_metadata_()</code> to make it easy to write the
correct number of total samples in the metadata field and also to zero
out the min/max framesize fields. I also exposed the
<code>FLAC__stream_encoder_set_do_md5()</code> function (which it should
have been in the first place) so that I can turn off the writing of
md5 field in the metadata. Finally, I added the function
<code>FLAC__stream_encoder_set_current_frame_number()</code> so that the
correct frame numbers are written at encode time.</p>
<p>For comparison purposes I turn off md5 calculation in my multithreaded
encoder as well as the baseline one. Since calling
<code>FLAC__stream_encoder_set_current_frame_number()</code> causes
crashes with vericiation turned on I also turn that off. The numbers
above reflect that so they're underestimates of how a production
multithreaded encoder would perform. However, the essential behavior
of the program shouldn't change much.</p>
<p><a href="/parallelizing-flac-encoding-files/patch-libFLAC.in">Here</a> is a patch file for the <a href="http://downloads.sourceforge.net/flac/flac-1.2.1.tar.gz?modtime=1189961849&big_mirror=0">flac 1.2.1
source</a> that implements the hacks I described
above. <a href="/parallelizing-flac-encoding-files/mt_encode.c">Here</a> is the source for my multithreaded FLAC
encoder. I've tested it with <code>i686-apple-darwin9-gcc-4.0.1</code>
and <code>i686-apple-darwin9-gcc-4.2.1</code> on Mac OS X. I got the
above numbers compiling
<code>mt_encode.c</code> with gcc 4.2.1 and the switches <code>-Wall
-Werror -g -O2 -ansi</code>.</p>
<hr />
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
https://www.akalin.com/bfpp
bfpp
2008-04-23T00:00:00-07:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<div class="p">Okay, I lied; you can't <em>really</em> embed <a href="http://www.muppetlabs.com/~breadbox/bf/">brainfuck</a> in C++
but you can get pretty close. Here is an example:
<pre class="code-container"><code class="language-cpp">#include "bfpp.h"
int main() {
// Prints out factorial numbers in sequence. Adapted from
// http://www.hevanet.com/cristofd/brainfuck/factorial.b .
bfpp
* + + + + + + + + + + * * * + * + -- * * * + -- - -- & & & & & -- +
& & & & & ++ * * -- -- - ++ * -- & & + * + * - ++ & -- * + & - ++ &
-- * + & - -- * + & - -- * + & - -- * + & - -- * + & - -- * + & - --
* + & - -- * + & - -- * + & - -- * -- - ++ * * * * + * + & & & & & &
- -- * + & - ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ * -- & + * - ++ + * *
* * * ++ & & & & & -- & & & & & ++ * * * * * * * -- * * * * * ++ + +
-- - & & & & & ++ * * * * * * - ++ + * * * * * ++ & -- * + + & - ++
& & & & -- & -- * + & - ++ & & & & ++ * * -- - * -- - ++ + + + + + +
-- & + + + + + + + + * - ++ * * * * ++ & & & & & -- & -- * + * + & &
- ++ * ! & & & & & ++ * ! * * * * ++
end_bfpp
}</code></pre>
I call this variant “bfpp” as it has some pretty significant
differences from brainfuck. First of all, some commands had to be
adapted; although <code>+</code> and <code>-</code> remain the same,
<ul>
<li><code><</code> and <code>></code> were changed to <code>&</code> and
<code>*</code>,</li>
<li><code>.</code> and <code>,</code> were changed to <code>!</code> and <code>~</code>
(mnemonic: <code>!</code> contains <code>.</code> within it and <code>~</code>
is kind of like a sideways <code>,</code>),</li>
<li>and <code>[</code> and <code>]</code> were changed to <code>--</code> and
<code>++</code> (mnemonic: <code>[</code> and <code>]</code> are the most
complex brainfuck commands [to implement, at least] and so deserve to be mapped to the wider
and more prominent operators).</li>
</ul>
This magic is made possible by the fact that brainfuck has exactly
eight commands and C++ has exactly eight overloadable symbolic unary
operators. Add some macros to hide the C++ scaffolding behind some
delimiters and you have a convincing illusion of an embedded language.</div>
<p><a href="/bfpp-files/bfpp.h">bfpp.h</a> implements a simple (<100 lines) bfpp interpreter and
the magic described above, and <a href="/bfpp-files/bf2bfpp.c">bf2bfpp.c</a> is a
straightforward translator from brainfuck to bfpp. Gotta love C++!</p>
<hr />
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
https://www.akalin.com/longest-palindrome-linear-time
Finding the Longest Palindromic Substring in Linear Time
2007-11-28T00:00:00-08:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<style type="text/css" media="all">
/*<![CDATA[*/
span.palind {
color: red;
}
/*]]>*/
</style>
<script>
function trackOutboundLink(url) {
ga('send', 'event', 'outbound', 'click', url, {
'hitCallback': function() { document.location = url; }
});
}
</script>
<p>Another <a href="http://www.reddit.com/r/programming/comments/2dykz/finding_palindromes_repairing_endos_dna_and_the/"
onclick="trackOutboundLink('http://programming.reddit.com/info/2dykz/comments/c2e7r0');
return false;">interesting problem</a> I stumbled across on reddit is
finding the longest substring of a given string that is a palindrome.
I
found <a href="http://johanjeuring.blogspot.com/2007/08/finding-palindromes.html"
onclick="trackOutboundLink('http://johanjeuring.blogspot.com/2007/08/finding-palindromes.html');
return false;">the explanation on Johan Jeuring's blog</a> somewhat
confusing and I had to spend some time poring over the Haskell code
(eventually rewriting it in Python) and walking through examples
before it "clicked." I haven't found any other explanations of the
same approach so hopefully my explanation below will help the next
person who is curious about this problem.</p>
<p>Of course, the most naive solution would be to exhaustively examine
all \(n \choose 2\) substrings of the given \(n\)-length string, test each
one if it's a palindrome, and keep track of the longest one seen so
far. This has complexity \(O(n^3)\), but we can easily do better by
realizing that a palindrome is centered on either a letter (for
odd-length palindromes) or a space between letters (for even-length
palindromes). Therefore we can examine all \(2n + 1\) possible centers
and find the longest palindrome for that center, keeping track of the
overall longest palindrome. This has complexity \(O(n^2)\).</p>
<div class="p">It is not immediately clear that we can do better but
if we're told that an \(Θ(n)\) algorithm exists we can infer that
the algorithm is most likely structured as an iteration through all
possible centers. As an off-the-cuff first attempt, we can adapt the
above algorithm by keeping track of the current center and expanding
until we find the longest palindrome around that center, in which case
we then consider the last letter (or space) of that palindrome as the
new center. The algorithm (which isn't correct) looks like this
informally:
<ol type="1">
<li>Set the current center to the first letter.</li>
<li>Loop while the current center is valid:
<ol type="a">
<li>Expand to the left and right simultaneously until we find
the largest palindrome around this center.</li>
<li>If the current palindrome is bigger than the stored maximum
one, store the current one as the maximum one.</li>
<li>Set the space following the current palindrome as the
current center unless the two letters immediately surrounding
it are different, in which case set the last letter of the
current palindrome as the current center.</li>
</ol>
</li>
<li>Return the stored maximum palindrome.</li>
</ol>
</div>
<p>This seems to work but it doesn't handle all cases: consider the
string "abababa". The first non-trivial palindrome we see is "<span
class="palind">a</span>|bababa", followed by "<span
class="palind">aba</span>|baba". Considering the current space as the
center doesn't get us anywhere but considering the preceding letter
(the second 'a') as the center, we can expand to get "<span
class="palind">ababa</span>|ba". From this state, considering the
current space again doesn't get us anywhere but considering the preceding
letter as the center, we can expand to get "ab<span
class="palind">ababa</span>|". However, this is incorrect as the
longest palindrome is actually the entire string! We can remedy this
case by changing the algorithm to try and set the new center to be one
before the end of the last palindrome, but it is clear that having a
fixed "lookbehind" doesn't solve the general case and anything more
than that will probably bump us back up to quadratic time.</p>
<div class="p">The key question is this: given the state from the example above,
"<span class="palind">ababa</span>|ba", what makes the second 'b' so
special that it should be the new center? To use another example, in
"<span class="palind">abcbabcba</span>|bcba", what makes the second
'c' so special that it should be the new center? Hopefully, the
answer to this question will lead to the answer to the more important
question: once we stop expanding the palindrome around the current
center, how do we pick the next center? To answer the first question,
first notice that the current palindromes in the above examples
themselves contain smaller non-trivial palindromes: "ababa" contains
"aba" and "abcbabcba" contains "abcba" which also contains "bcb".
Then, notice that if we expand around the "special" letters, we get a
palindrome which shares a right edge with the current palindrome; that
is, <em>the longest palindrome around the special letters are proper
suffixes of the current palindrome</em>. With a little thought, we
can then answer the second question: <em>to pick the next center, take
the center of the longest palindromic proper suffix of the current
palindrome</em>. Our algorithm then looks like this:
<ol type="1">
<li>Set the current center to the first letter.</li>
<li>Loop while the current center is valid:
<ol type="a">
<li>Expand to the left and right simultaneously until we find
the largest palindrome around this center.</li>
<li>If the current palindrome is bigger than the stored maximum
one, store the current one as the maximum one.</li>
<li>Find the maximal palindromic proper suffix of the current
palindrome.</li>
<li>Set the center of the suffix from c as the current center
and start expanding from the suffix as it is palindromic.</li>
</ol>
</li>
<li>Return the stored maximum palindrome.</li>
</ol>
</div>
<p>However, unless step 2c can be done efficiently, it will cause the
algorithm to be superlinear. Doing step 2c efficiently seems
impossible since we have to examine the entire current palindrome to
find the longest palindromic suffix unless we somehow keep track of
extra state as we progress through the input string. Notice that the
longest palindromic suffix would by definition also be a palindrome of
the input string so it might suffice to keep track of every palindrome
that we see as we move through the string and hopefully, by the time
we finish expanding around a given center, we would know where all the
palindromes with centers lying to the left of the current one are.
However, if the longest palindromic suffix has a center to the right
of the current center, we would not know about it. But we also have
at our disposal the very useful fact that <em>a palindromic proper
suffix of a palindrome has a corresponding dual palindromic proper
prefix</em>. For example, in one of our examples above, "abcbabcba",
notice that "abcba" appears twice: once as a prefix and once as a
suffix. Therefore, while we wouldn't know about all the palindromic
suffixes of our current palindrome, we would know about either it or
its dual.</p>
<p>Another crucial realization is the fact that we don't have to keep
track of all the palindromes we've seen. To use the example
"abcbabcba" again, we don't really care about "bcb" that much, since
it's already contained in the palindrome "abcba". In fact, we only
really care about keeping track of the longest palindromes for a given
center or equivalently, the length of the longest palindrome for a
given center. But this is simply a more general version of our
original problem, which is to find the longest palindrome around
<em>any</em> center! Thus, if we can keep track of this state
efficiently, maybe by taking advantage of the properties of
palindromes, we don't have to keep track of the maximal palindrome and
can instead figure it out at the very end.</p>
<p>Unfortunately, we seem to be back where we started; the second
naive algorithm that we have is simply to loop through all possible
centers and for each one find the longest palindrome around that
center. But our discussion has led us to a different incremental
formulation: given a current center, the longest palindrome around
that center, and a list of the lengths of the longest palindromes
around the centers to the left of the current center, can we figure
out the new center to consider and extend the list of longest
palindrome lengths up to that center efficiently? For example, if we
have the state:</p>
<p><"ab<span class="palind">a</span>ba|??", [0, 1, 0, 3, 0, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?]></p>
<p>where the highlighted letter is the current center, the vertical line
is our current position, the question marks represent unread
characters or unknown quantities, and the array represents the list
of longest palindrome lengths by center, can we get to the state:</p>
<p><"aba<span class="palind">b</span>a|??", [0, 1, 0, 3, 0, 5, 0, ?, ?, ?, ?, ?, ?, ?, ?]></p>
<p>and then to:</p>
<p><"aba<span class="palind">b</span>aba|", [0, 1, 0, 3, 0, 5, 0, 7, 0, 5, 0, 3, 0, 1, 0]></p>
<p>efficiently? The crucial thing to notice is that the longest
palindrome lengths array (we'll call it simply the lengths array) in
the final state is palindromic since the original string is
palindromic. In fact, the lengths array obeys a more general
property: <em>the longest palindrome <var>d</var> places to the right
of the current center (the <var>d</var>-right palindrome) is at least
as long as the longest palindrome d places to the left of the current
center (the <var>d</var>-left palindrome) if the <var>d</var>-left
palindrome is completely contained in the longest palindrome around
the current center (the center palindrome), and it is of equal length
if the <var>d</var>-left palindrome is not a prefix of the center
palindrome or if the center palindrome is a suffix of the entire
string</em>. This then implies that we can more or less fill in the
values to the right of the current center from the values to the left
of the current center. For example, from [0, 1, 0, 3, 0, 5, ?, ?, ?,
?, ?, ?, ?, ?, ?] we can get to [0, 1, 0, 3, 0, 5, 0, ≥3?, 0,
≥1?, 0, ?, ?, ?, ?]. This also implies that the first unknown
entry (in this case, ≥3?) should be the new center because it
means that the center palindrome is not a suffix of the input string
(i.e., we're not done) and that the <var>d</var>-left palindrome is a
prefix of the center palindrome.</p>
<div class="p">From these observations we can construct our final algorithm which
returns the lengths array, and from which it is easy to find the
longest palindromic substring:
<ol type="1">
<li>Initialize the lengths array to the number of possible
centers.</li>
<li>Set the current center to the first center.</li>
<li>Loop while the current center is valid:
<ol type="a">
<li>Expand to the left and right simultaneously until we find
the largest palindrome around this center.</li>
<li>Fill in the appropriate entry in the longest palindrome
lengths array.</li>
<li>Iterate through the longest palindrome lengths array
backwards and fill in the corresponding values to the right of
the entry for the current center until an unknown value (as
described above) is encountered.</li>
<li>set the new center to the index of this unknown value.</li>
</ol>
</li>
<li>Return the lengths array.</li>
</ol>
</div>
<p>Note that at each step of the algorithm we're either incrementing
our current position in the input string or filling in an entry in the
lengths array. Since the lengths array has size linear in the size of
the input array, the algorithm has worst-case linear running time.
Since given the lengths array we can find and return the longest
palindromic substring in linear time, a linear-time algorithm to find
the longest palindromic substring is the composition of these two
operations.</p>
<div class="p">Here is Python code that implements the above algorithm (although
it is closer to Johan Jeuring's Haskell implementation than to the
above description):
<pre class="code-container"><code class="language-python">def fastLongestPalindromes(seq):
"""
Behaves identically to naiveLongestPalindrome (see below), but
runs in linear time.
"""
seqLen = len(seq)
l = []
i = 0
palLen = 0
# Loop invariant: seq[(i - palLen):i] is a palindrome.
# Loop invariant: len(l) >= 2 * i - palLen. The code path that
# increments palLen skips the l-filling inner-loop.
# Loop invariant: len(l) < 2 * i + 1. Any code path that
# increments i past seqLen - 1 exits the loop early and so skips
# the l-filling inner loop.
while i < seqLen:
# First, see if we can extend the current palindrome. Note
# that the center of the palindrome remains fixed.
if i > palLen and seq[i - palLen - 1] == seq[i]:
palLen += 2
i += 1
continue
# The current palindrome is as large as it gets, so we append
# it.
l.append(palLen)
# Now to make further progress, we look for a smaller
# palindrome sharing the right edge with the current
# palindrome. If we find one, we can try to expand it and see
# where that takes us. At the same time, we can fill the
# values for l that we neglected during the loop above. We
# make use of our knowledge of the length of the previous
# palindrome (palLen) and the fact that the values of l for
# positions on the right half of the palindrome are closely
# related to the values of the corresponding positions on the
# left half of the palindrome.
# Traverse backwards starting from the second-to-last index up
# to the edge of the last palindrome.
s = len(l) - 2
e = s - palLen
for j in xrange(s, e, -1):
# d is the value l[j] must have in order for the
# palindrome centered there to share the left edge with
# the last palindrome. (Drawing it out is helpful to
# understanding why the - 1 is there.)
d = j - e - 1
# We check to see if the palindrome at l[j] shares a left
# edge with the last palindrome. If so, the corresponding
# palindrome on the right half must share the right edge
# with the last palindrome, and so we have a new value for
# palLen.
#
# An exercise for the reader: in this place in the code you
# might think that you can replace the == with >= to improve
# performance. This does not change the correctness of the
# algorithm but it does hurt performance, contrary to
# expectations. Why?
if l[j] == d:
palLen = d
# We actually want to go to the beginning of the outer
# loop, but Python doesn't have loop labels. Instead,
# we use an else block corresponding to the inner
# loop, which gets executed only when the for loop
# exits normally (i.e., not via break).
break
# Otherwise, we just copy the value over to the right
# side. We have to bound l[i] because palindromes on the
# left side could extend past the left edge of the last
# palindrome, whereas their counterparts won't extend past
# the right edge.
l.append(min(d, l[j]))
else:
# This code is executed in two cases: when the for loop
# isn't taken at all (palLen == 0) or the inner loop was
# unable to find a palindrome sharing the left edge with
# the last palindrome. In either case, we're free to
# consider the palindrome centered at seq[i].
palLen = 1
i += 1
# We know from the loop invariant that len(l) < 2 * seqLen + 1, so
# we must fill in the remaining values of l.
# Obviously, the last palindrome we're looking at can't grow any
# more.
l.append(palLen)
# Traverse backwards starting from the second-to-last index up
# until we get l to size 2 * seqLen + 1. We can deduce from the
# loop invariants we have enough elements.
lLen = len(l)
s = lLen - 2
e = s - (2 * seqLen + 1 - lLen)
for i in xrange(s, e, -1):
# The d here uses the same formula as the d in the inner loop
# above. (Computes distance to left edge of the last
# palindrome.)
d = i - e - 1
# We bound l[i] with min for the same reason as in the inner
# loop above.
l.append(min(d, l[i]))
return l</code></pre>
And here is a naive quadratic version for comparison:
<pre class="code-container"><code class="language-python">def naiveLongestPalindromes(seq):
"""
Given a sequence seq, returns a list l such that l[2 * i + 1]
holds the length of the longest palindrome centered at seq[i]
(which must be odd), l[2 * i] holds the length of the longest
palindrome centered between seq[i - 1] and seq[i] (which must be
even), and l[2 * len(seq)] holds the length of the longest
palindrome centered past the last element of seq (which must be 0,
as is l[0]).
The actual palindrome for l[i] is seq[s:(s + l[i])] where s is i
// 2 - l[i] // 2. (// is integer division.)
Example:
naiveLongestPalindrome('ababa') -> [0, 1, 0, 3, 0, 5, 0, 3, 0, 1]
Runs in quadratic time.
"""
seqLen = len(seq)
lLen = 2 * seqLen + 1
l = []
for i in xrange(lLen):
# If i is even (i.e., we're on a space), this will produce e
# == s. Otherwise, we're on an element and e == s + 1, as a
# single letter is trivially a palindrome.
s = i / 2
e = s + i % 2
# Loop invariant: seq[s:e] is a palindrome.
while s > 0 and e < seqLen and seq[s - 1] == seq[e]:
s -= 1
e += 1
l.append(e - s)
return l</code></pre>
Note that this is not the only efficient solution to this problem;
building a suffix tree is linear in the length of the input string and
you can use one to solve this problem but as Johan also mentions,
that is a much less direct and efficient solution compared to this
one.</div>
<hr />
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
https://www.akalin.com/number-theory-haskell-foray
A Foray into Number Theory with Haskell
2007-07-06T00:00:00-07:00
Fred Akalin
https://www.akalin.com/
© Fred Akalin
2005–2018.
All rights reserved.
<script>
// See https://github.com/Khan/KaTeX/issues/85 .
KaTeXMacros = {
"\\cfrac": "\\dfrac{#1}{#2}\\kern-1.2pt",
};
</script>
<div class="p">I encountered
<a href="http://programming.reddit.com/info/216p9/comments">an
interesting problem</a> on reddit a few days ago which can be
paraphrased as follows:
<blockquote><p>Find a perfect square \(s\) such that \(1597s + 1\) is also
perfect square.</p></blockquote>
</div>
<p>After reading the discussion about implementing a brute-force
algorithm to solve the problem and spending a futile half-hour or so
trying my hand at find a better way, someone noticed that the problem
was an instance
of <a href="http://en.wikipedia.org/wiki/Pell%27s_equation">Pell's
equation</a> which is known to have an elegant and fast solution;
indeed, he posted
a <a href="http://programming.reddit.com/info/216p9/comments/c21dpn">one-liner
in Mathematica</a> solving the given problem. However, I wanted to try
coding up the solution myself as the Mathematica solution, while
succinct, isn't very enlightening since the heavy lifting is already
done by a built-in function and an arbitrary constant was used for this
particular instance of Pell's equation.</p>
<p>Pell's equation is simply the
<a href="http://en.wikipedia.org/wiki/Diophantine_equation">Diophantine
equation</a> \(x^2 - dy^2 = 1\) for a given
\(d\)<sup><a href="#fn1" id="r1">[1]</a></sup>; being Diophantine means
that all variables involved take on only integer values. (In our
original problem, \(d\) is 1597 and we are asked for \(y^2\).) The
solution involves finding the <em>continued fraction expansion</em> of
\(\sqrt{d}\), finding the first <em>convergent</em> of the expansion
that satisfies Pell's equation, and then generating all other
solutions from that
<em>fundamental solution</em>. We rule out the trivial solution \(x =
1\), \(y = 0\) which also implies that if \(d\) is a perfect square then
there is no solution.</p>
<p>A continued fraction is an expression of the form:
\[
x = a_0 + \cfrac{1}{a_1 + \cfrac{1}{a_2 + \cfrac{1}{a_3 + \cfrac{1}{\ddots\,}}}}
\]
where all \(a_i\) are integers and all but the
first one are positive. The standard math notation for continued
fractions is quite unwieldy so from now on we'll use \(\left \langle
a_0; a_1, a_2, \dotsc \right \rangle\) instead of the above.</p>
<div class="p">The theory of continued fractions is a rich and beautiful one but
for now we'll just state a few facts:
<ul>
<li>The continued fraction expansion of a number is (mostly) unique.</li>
<li>The continued fraction expansion of a rational number is
finite.</li>
<li>The continued fraction expansion of a irrational number is
infinite.</li>
<li>A <a href="http://en.wikipedia.org/wiki/Quadratic_surd">quadratic
surd</a> is a number of the form \(\frac{a + \sqrt{b}}{c}\)
where
\(a\), \(b\), and \(c\) are integers. Except
maybe for the first term, the continued fraction expansion of a
quadratic surd is periodic; that is, it repeats forever after a
certain number of terms. This applies in particular to the square root
of an integer.</li>
<li>Truncating an infinite continued fraction to get a finite
continued fraction gives (in some sense) an optimal rational
approximation to the irrational number represented by the infinite
continued fraction.</li>
</ul>
</div>
<div class="p">Given a quadratic surd it is fairly easy to manipulate it into the
form \(a + \frac{1}{q}\) where \(q\) is another quadratic surd. This fact
can be used to come up with an algorithm to find the continued
fraction expansion of a square
root. Wikipedia <a href="http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Continued_fraction_expansion">explains
it pretty well</a> so I won't go over it, but here is my Haskell
implementation:
<pre class="code-container"><code class="language-haskell">sqrt_continued_fraction n = [ a_i | (_, _, a_i) <- mdas ]
where
mdas = iterate get_next_triplet (m_0, d_0, a_0)
m_0 = 0
d_0 = 1
a_0 = truncate $ sqrt $ fromIntegral n
get_next_triplet (m_i, d_i, a_i) = (m_j, d_j, a_j)
where
m_j = d_i * a_i - m_i
d_j = (n - m_j * m_j) `div` d_i
a_j = (a_0 + m_j) `div` d_j</code></pre>
and here are some examples:
<pre class="code-container"><code class="language-shell">Prelude Main> take 20 $ sqrt_continued_fraction 2
[1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]
Prelude Main> take 20 $ sqrt_continued_fraction 103
[10,6,1,2,1,1,9,1,1,2,1,6,20,6,1,2,1,1,9,1]
Prelude Main> take 20 $ sqrt_continued_fraction 36
[6,*** Exception: divide by zero</code></pre>
</div>
<p>(Note that we're assuming that we won't be called with a perfect
square. Also, do you notice anything interesting about the periodic
portion of the continued fractions, particularly of \(\sqrt{103}\)?)</p>
<div class="p">For those who are unfamiliar with Haskell, here's a quick list of key facts:
<ul>
<li>The first line takes a list of triplets and forms a list of all
third elements, which is what we're interested in. (The other two
elements of the triplet are auxiliary variables used by the
algorithm.)</li>
<li><code>iterate</code> is a function which takes in another
function <code>f</code>, an initial variable <code>x</code>, and
returns the infinite list <code>[ x, f(x), f(f(x)), f(f(f(x))),
... ]</code>.</li>
<li>Note that Haskell
uses <a href="http://en.wikipedia.org/wiki/Lazy_evaluation">lazy
evaluation</a> and so this function does not take an infinite amount
of time to run; all its elements are evaluated (and memoized) only
when needed.</li>
<li>The rest of the function is a straightforward representation of
the meat of the algorithm described in the above Wikipedia entry.</li>
</ul>
</div>
<p>It may not be clear what \(\sqrt{d}\) and its continued fraction
expansion has to do with solving Pell's equation. However, notice that
if \(x\) and \(y\) solve Pell's equation then manipulating Pell's equation
to get \(\sqrt{d}\) on one side reveals that \(\frac{x}{y}\) is a good
approximation of \(\sqrt{n}\). In fact, it is so good that you can prove
that \(\frac{x}{y}\) <em>must</em> come from truncating the continued
fraction expansion of \(\sqrt{d}\).</p>
<p>This leads us to the following: if you have an infinite continued
fraction \(\left \langle a_0; a_1, a_2, \dotsc \right \rangle\) you can
truncate it into a finite continued fraction \(\left \langle a_0; a_1,
a_2, \dotsc, a_i \right \rangle\) and simplify it into the rational
number \(\frac{p_i}{q_i}\). The sequence \(\frac{p_0}{q_0},
\frac{p_1}{q_1}, \frac{p_2}{q_2}, \dotsc\) forms the
<a href="http://en.wikipedia.org/wiki/Convergent_%28continued_fraction%29"><em>convergents</em></a>
of \(\left \langle a_0; a_1, a_2, \dotsc \right \rangle\) and converges to
its represented irrational number.</p>
<div class="p">It turns out you can calculate \(p_{i+1}\) and \(q_{i+1}\)
efficiently from \(p_i\), \(q_i\), \(p_{i-1}\), \(q_{i-1}\), and \(a_{i+1}\)
using
the <a href="http://en.wikipedia.org/wiki/Fundamental_recurrence_formulas"><em>fundamental
recurrence formulas</em></a> (which can be proved by induction). Here
is my Haskell implementation:
<pre class="code-container"><code class="language-haskell">get_convergents (a_0 : a_1 : as) = pqs
where
pqs = (p_0, q_0) : (p_1, q_1) :
zipWith3 get_next_convergent pqs (tail pqs) as
p_0 = a_0
q_0 = 1
p_1 = a_1 * a_0 + 1
q_1 = a_1
get_next_convergent (p_i, q_i) (p_j, q_j) a_k = (p_k, q_k)
where
p_k = a_k * p_j + p_i
q_k = a_k * q_j + q_i</code></pre>
and some more examples:
<pre class="code-container"><code class="language-shell">Prelude Main> take 8 $ get_convergents $ sqrt_continued_fraction 2
[(1,1),(3,2),(7,5),(17,12),(41,29),(99,70),(239,169),(577,408)]
Prelude Main> take 8 $ get_convergents $ sqrt_continued_fraction 103
[(10,1),(61,6),(71,7),(203,20),(274,27),(477,47),(4567,450),(5044,497)]
Prelude Main> take 8 $ get_convergents $ sqrt_continued_fraction 1597
[(39,1),(40,1),(1039,26),(1079,27),(2118,53),(3197,80),(27694,693),(113973,2852)]
Prelude Main> let divFrac (x, y) = (fromInteger x) / (fromInteger y)
Prelude Main> take 8 $ map divFrac $ get_convergents $ sqrt_continued_fraction 2
[1.0,1.5,1.4,1.4166666666666667,1.4137931034482758,1.4142857142857144,1.4142011834319526,1.4142156862745099]
Prelude Main> take 8 $ map divFrac $ get_convergents $ sqrt_continued_fraction 103
[10.0,10.166666666666666,10.142857142857142,10.15,10.148148148148149,10.148936170212766,10.148888888888889,10.148893360160965]
Prelude Main> take 8 $ map divFrac $ get_convergents $ sqrt_continued_fraction 1597
[39.0,40.0,39.96153846153846,39.96296296296296,39.9622641509434,39.9625,39.96248196248196,39.9624824684432]</code></pre>
</div>
<div class="p">Here are a few more quick facts to help those unfamiliar with
Haskell:
<ul>
<li>The expression <code>a : as</code> forms a new list from the
element <code>a</code> and the existing list <code>as</code>
(equivalent to <code>cons</code> in Lisp).</li>
<li><code>zipWith3</code> is a function that takes in a
function <code>f</code>, three lists <code>a</code>, <code>b</code>,
and <code>c</code> of the same (possibly infinite)
length <code>n</code>, and forms the new list
<code>[ f(a[0], b[0], c[0]), f(a[1], b[1], c[1]), ..., f(a[n], b[n],
c[n]) ]</code>.</li>
<li>Note that the result of <code>zipWith3</code> is part of the
variable <code>pqs</code> which itself appears (twice!) in the
arguments to <code>zipWith3</code>. This is a Haskell idiom and
reflects the fact that the recurrence formulas define a convergent
in terms of its two previous convergents. A simpler example (using
the Fibonacci sequence) can be found in the
<a href="http://en.wikipedia.org/wiki/Lazy_evaluation">Wikipedia
entry for lazy evaluation</a>.</li>
<li>Haskell has built-in data types for integers of arbitrary size
which is necessary as the numerators and denominators of the
convergents get large quickly. In fact, Haskell has built-in
data types for rational numbers (represented as fractions) but it
doesn't help us much here.</li>
</ul>
</div>
<div class="p">Since we are guaranteed that some convergent eventually satisfies
Pell's equation, we can write a simple function to generate all
convergents, test each one to see if it satisfies Pell's equation,
and return the first one we see. Here is the Haskell implementation:
<pre class="code-container"><code class="language-haskell">get_pell_fundamental_solution n = head $ solutions
where
solutions = [ (p, q) | (p, q) <- convergents, p * p - n * q * q == 1 ]
convergents = get_convergents $ sqrt_continued_fraction n</code></pre>
Note the use of the
Haskell's <a href="http://en.wikipedia.org/wiki/List_comprehension">list
comprehension</a> syntax, similar to Python, which expresses what I
just described in a matter reminiscent of set notation.</div>
<div class="p">Here is the full Haskell program designed so its output may be
conveniently piped
to <a href="http://en.wikipedia.org/wiki/Bc_programming_language">bc</a>
for verification:
<pre class="code-container"><code class="language-haskell">module Main where
import System (getArgs)
sqrt_continued_fraction :: (Integral a) => a -> [a]
{- ... the sqrt_continued_fraction function explained above ... -}
get_convergents :: (Integral a) => [a] -> [(a, a)]
{- ... the get_convergents function explained above ... -}
get_pell_fundamental_solution :: (Integral a) => a -> (a, a)
{- ... the get_pell_fundamental_solution function explained above ... -}
main :: IO ()
main = do
args <- System.getArgs
let d = (read $ head $ args :: Integer)
(p, q) = get_pell_fundamental_solution d in
putStr $ "d = " ++ (show d) ++ "\n" ++
"p = " ++ (show p) ++ "\n" ++
"q = " ++ (show q) ++ "\n" ++
"p^2 - d * q^2 == 1\n"</code></pre>
and here is it in action:
<pre class="code-container"><code class="language-shell">$ ./solve_pell 1597
d = 1597
p = 519711527755463096224266385375638449943026746249
q = 13004986088790772250309504643908671520836229100
p^2 - d * q^2 == 1</code></pre>
</div>
<p>The solution to the original problem is therefore:<br/>
<strong>5054112910466227478111803017176109047976100000000.</strong></p>
<p>Now that we've found a method to get <em>a</em> solution, the
question remains as to whether it's the only one. In fact it is not,
but it is the minimal one, and all other solutions (of which there are
an infinite number) can be generated from this fundamental one with a
simple recurrence relation as described on
the <a href="http://en.wikipedia.org/wiki/Pell%27s_equation#Solution_technique">Wikipedia
article</a>. My program above can be easily extended to generate all
solutions instead of just the fundamental one (I'll leave it to the
reader as an exercise).</p>
<p>One remaining question is the efficiency of this algorithm. For
simplicity, let's neglect the cost of the arbitrary-precision
arithmetic involved and assume that the incremental cost of generating
each term of the continued fraction expansion and the convergents is
constant. Then the main cost is just how many convergents we have to
generate before we find one that satisfies Pell's equation. In fact,
it turns out that this depends on the length of the period of the
continued fraction expansion of \(\sqrt{d}\), which has a rough upper
bound of \(O(\ln(d \sqrt{d}))\). Therefore, the cost of solving Pell's
equation (in terms of how many convergents to generate) for a given
\(n\)-digit number is \(O(n 2^{n/2})\). This is pretty expensive already,
although it's still much better than brute-force search (which is on
the order of exponentiating the above expression). Can we do better?
Well, sort of; it turns out the length of the answer is of the same
order as the expression above, so any algorithm that explicitly
outputs a solution necessarily takes that long. However, if you can
somehow factor \(d\) into \(s d'\), where \(s\) is a perfect square and \(d'\)
is <a href="http://en.wikipedia.org/wiki/Squarefree">squarefree</a>
(i.e., not divisible by any perfect square), then you can solve Pell's
equation for the smaller number \(d'\) and output the solution for \(d'\)
as the smaller fundamental solution and an expression raised to a
certain power involving it. Note that in general this involves
factoring \(d\), another hard problem, but for which there exists tons
of prior work. An interested reader can peruse the papers
by <a href="http://www.ams.org/notices/200202/fea-lenstra.pdf">Lenstra</a>
and <a href="http://www.math.nyu.edu/~crorres/Archimedes/Cattle/cattle_vardi.pdf">Vardi</a>
for more details.</p>
<p>As a final note, one of the things I really like about number
theory is that investigating such a simple program can lead you down
surprising avenues of mathematics and computational theory. In fact,
I've had to omit a lot of things I had planned to say to avoid growing
this entry to be longer than it already is. Hopefully, this entry
helps someone else learn more about this interesting corner of number
theory.</p>
<hr />
<p>Like this post? Subscribe to
<!-- The image is 256x256, the center of the dot is 189 pixels from the
top, and the radius of the dot is 24. Therefore, the dot is 43/256 =
0.16796875 of the image height above the bottom.-->
<a href="feed/atom">my feed <img src="feed-icon.svg" alt="RSS icon" style="width: 1em; height: 1em; vertical-align: -0.16796875em;" /></a>
or follow me on
<a href="https://twitter.com/fakalin">Twitter <img src="twitter-icon.svg" alt="Twitter icon" style="width: 1em; height 1em;" /></a>.</p>
<section class="footnotes">
<header>
<h2>Footnotes</h2>
</header>
<p id="fn1">[1] As a rule we'll avoid considering trivial cases and
re-stating obvious assumptions (like \(d\) having to be a positive
integer). <a href="#r1">↩</a></p>
</section>