Finding the Most Significant Set Bit of a Word in Constant Time

Fred Akalin

1. Overall method

Finding the most significant set bit of a word (equivalently, finding the integer log base 2 of a word, or counting the leading zeros of a word) is a well-studied problem. Bit Twiddling Hacks lists various methods, and Wikipedia gives the CPU instructions that perform the operation directly.

However, all of these methods are either specific to a certain word size or take more than constant time (in terms of number of word operations). That raises the question of whether there is a method that takes constant time—surprisingly, the answer is “yes”![1]

The key idea is to split a word into \(\lceil \sqrt{w} \rceil\) blocks of \(\lceil \sqrt{w} \rceil\) bits (where \(w\) is the number of bits in a word). One can then do certain operations on blocks “in parallel” by stuffing multiple blocks into a word and then performing a single word operation.

Furthermore, since the block size and block count are the same, one can transform the bits of a block into the blocks of a word and vice versa in various ways using only a constant number of word operations.

In particular, this lets us split up the problem into two parts: finding the most significant set (i.e., non-zero) block, and finding the most significant set bit within that block. It then turns out that both parts can be done in constant time.

For concreteness, we'll use 32-bit words when explaining the method below.[2]

2. Finding the most significant set bit of a block

First, let's consider the sub-problem of finding the most significant set bit of a block. In fact, let's give ourselves a bit of room and consider only blocks with the high bit cleared for now; we'll see why we need this extra bit of room soon.

For 32 bits, the block size is 6 bits, so with the high bit of a block cleared we're left with 5 bits. Let's look at a naive implementation:

function mssb5_naive(x) {
  var c = 0;
  for (var i = 0; i < 5 && x >= (1 << i); ++i) {
    ++c;
  }
  return c - 1;
}

In the above, we consider successive powers of 2 until we find one greater than our given number. Then the answer is simply one less than that power.

Notice that the loop has at most 5 iterations; this lines up nicely with the 5 full blocks in an entire 32-bit word. (This is why we saved our extra bit of room.) If we can copy our block to the higher 4 blocks and then use word operations to operate on those blocks in parallel, then we're good.

For our example, let \(x = 5 = 00101\). Duplicating \(x\) among all the blocks can easily be done by multiplying by the appropriate constant:

  00 000000 000000 000000 000000 000101
* 00 000001 000001 000001 000001 000001
  00 000000 000000 000000 000000 000101
  00 000000 000000 000000 000101
  00 000000 000000 000101
  00 000000 000101
  00 000101                            
  00 000101 000101 000101 000101 000101

In fact, this is a simple use of a more general tool. If \(x\) and \(y\) are expressed in binary, then multiplying \(x\) by \(y\) can be seen as taking the index of each set bit in \(y\), creating a copy of \(x\) shifted by each such index, and then adding up all the shifted copies. This case is just taking \(y\) to be the constant where the \(\{ 0, 6, 12, 18, 24 \}\)th bits are set.

The first operation we need to parallelize is the comparisons to the powers of 2. This can be converted to a word operation by noting the comparison \(x \geq y\) can be performed by checking the sign of \(x - y\), and that checking the sign can be done by setting the unused high bit of \(x\) before doing the comparison, and then checking to see if that high bit was left intact (i.e., not borrowed from). So we pre-compute a constant with the \(n\)th block containing the \(n\)th power of 2, then subtract that from our block containing the duplicated blocks with the high bit set. Finally, we can then mask off the unneeded lower bits:

  00 000101 000101 000101 000101 000101
| 00 100000 100000 100000 100000 100000
  00 100101 100101 100101 100101 100101
- 00 010000 001000 000100 000010 000001
  00 010101 011101 100001 100011 100100
& 00 100000 100000 100000 100000 100000
  00 000000 000000 100000 100000 100000

We're left with a word where all bits except for the high bits of a block are zero. We still need to sum up those bits, but since they're a block apart, that can be done by multiplication with a constant to line up the bits in a single column. The constant turns out to have the \(\{ 0, 6, 12, 18, 24 \}\)th bits set, with the answer being in the top three bits:[3]

  00 000000 000000 100000 100000 100000
* 00 000001 000001 000001 000001 000001
  00 000000 000000 100000 100000 100000
  00 000000 100000 100000 100000
  00 100000 100000 100000
  00 100000 100000
  00 100000                            
  01 100001 100001 100001 000000 100000

MSSB5(x) = 011 - 1 = 2

We can now write mssb5() using a constant number of word operations:[4]

function mssb5(x) {
  // Duplicate x among all the blocks.
  x *= b('00 000001 000001 000001 000001 000001');

  // Compare to successive powers of 2 in parallel.
  x |= b('00 100000 100000 100000 100000 100000');
  x -= b('00 010000 001000 000100 000010 000001');
  x &= b('00 100000 100000 100000 100000 100000');

  // Sum up the bits into the high 3 bits.
  x *= b('00 000001 000001 000001 000001 000001');

  // Shift down and subtract 1 to get the answer.
  return (x >>> 29) - 1;
}

Then we can then find the most significant set bit of a full block by simply testing the high bit first:

function mssb6(x) {
  return (x & b('100000')) ? 5 : mssb5(x);
}

3. Finding the most significant set block

Let's now consider the sub-problem of finding the most significant set block of a word (ignoring the partial one). Similar to the above, we'd like to be able to use subtraction to compare all the blocks to zero at the same time. However, that requires the high bit of each block to be unused. That's easy enough to handle: just separate the high bit and the lower bits of each block, test the lower bits, and then bitwise-or the results together:

   x = 00 100000 000000 010000 000000 000001
&  C = 00 100000 100000 100000 100000 100000
  y1 = 00 100000 000000 000000 000000 100000

   x = 00 100000 000000 010000 000000 000001
& ~C = 00 011111 011111 011111 011111 011111
  t1 = 00 000000 000000 010000 000000 000001

   C = 00 100000 100000 100000 100000 100000
- t1 = 00 000000 000000 010000 000000 000001
  t2 = 00 100000 100000 010000 100000 011111

 ~t2 = 11 011111 011111 101111 011111 100000
&  C = 00 100000 100000 100000 100000 100000
  y2 = 00 000000 000000 100000 000000 100000

  y1 = 00 100000 000000 000000 000000 100000
| y2 = 00 000000 000000 100000 000000 100000
   y = 00 100000 000000 100000 000000 100000

The result is stored in the high bits of each block. If we could pack all the bits together, we'd then be able to use mssb5(). This is similar to where we had to add all the bits together in part 2, but we need a constant to stagger the bits instead of lining them up. The constant to put the answer in the high bits turns out to have the \(\{ 7, 12, 17, 22, 27 \}\)th bits set:

y >>> 5 = 00 000001 000000 000001 000000 000001
        * 00 001000 010000 100001 000010 000000
          10 000000 000010 000000 00001
          00 000001 000000 000001
          00 100000 000000 1
          00 000000 01
          00 001                               
        = 10 101001 010010 100001 000010 000000

This yields the answer 10101, where the \(i\)th bit is set exactly when the \(i\)th block of \(x\) is non-zero. Therefore, the most significant block is then simply mssb5(10101).

4. Putting it all together

With the building blocks above, we can now implement the algorithm for finding the most significant set bit in the full blocks of a word:[5]

function mssb30(x) {
  var C = b('00 100000 100000 100000 100000 100000');

  // Check whether the high bit of each block is set.
  var y1 = x & C;

  // Check whether the lower bits of each block is set.
  var y2 = ~(C - (x & ~C)) & C;

  var y = y1 | y2;

  // Shift the result bits down to the lowest 5 bits.
  var z = ((y >>> 5) * b('0000 10000 10000 10000 10000 10000000')) >>> 27;

  // Compute the bit index of the most significant set block.
  var b1 = 6 * mssb5(z);

  // Compute the most significant set bit inside the most significant
  // set block.
  var b2 = mssb6((x >>> b1) & b('111111'));

  return b1 + b2;
}

And then it's simple enough to extend it to find the most significant set bit of a full word:

function mssb32(x) {
  // Check the high duplet and fall back to mssb30 if it's not set.
  var h = x >>> 30;
  return h ? (30 + mssb5(h)) : mssb30(x);
}

So the code above shows that we can find the most significant set bit of a 32-bit word in a constant number of 32-bit word operations. It is easy enough to see how it can be adapted to yield a similar algorithm for a given arbitrary (but sufficiently large) word size, simply by pre-computing the various word-size-dependent constants.

It is also easy to see why no one actually uses this method on real computers even in the absence of built-in instructions: it is much more complicated and almost certainly slower than existing methods for real word sizes! Also, the word-RAM model—where we assume all word operations take constant time—is useful only when the word size is fixed or narrowly bounded. When we allow word size to vary arbitrarily, the word-RAM model breaks down—for one, multiplication grows super-linearly with respect to word size! Alas, this method is doomed to remain a theoretical curiosity, albeit one that uses a few clever tricks.

[1] The constant-time method is detailed in the original papers for the fusion tree data structure. The first paper is unfortunately behind a paywall, but the second paper, essentially a rehash of the first one, is freely downloadable.

The method is also explained in lecture 12 of Erik Demaine's Advanced Data Structures class, which is how I originally found out about it.

[2] Demaine uses 16-bit words, which factors nicely into 4 blocks of 4 bits, but it is instructive to see how the method deals with the word size not a perfect square.

[3] In this case, the partial 6th block has enough room to hold the answer but this may not be true in general. This can be remedied easily enough by shifting down the block high bits to the low bits before multiplying; the answer will then be in the last full block.

[4] b(str) just parses a number from its binary string representation.

[5] Try out this function (and the others on this page) by opening up the JS console on this page!