Tuesday, March 23, 2021

Donald Knuth’s Algorithm D, its implementation in Hacker’s Delight and elsewhere

Valid HTML 4.01 Transitional Valid CSS Valid SVG 1.0

Donald Knuth’s Algorithm D, its implementation in Hacker’s Delight, and elsewhere

Purpose
Introduction
Generic implementation in Hacker’s Delight
Specialised implementation in Hacker’s Delight
Flawed and poor implementation in libdivide
Poor implementation in Free Pascal’s Run-Time Library
Poor implementation in Google’s Go Programming Language
Implementation in Google’s V8 JavaScript engine
Proper (and optimised) implementation …
… in ANSI C
… in i386 Assembler

Purpose

Show some shortcomings of the multiple precision division implementations presented in Hacker’s Delight, and their effect elsewhere.

Introduction

In Volume 2: Seminumerical Algorithms, Chapter 4.3: Multiple-Precision Arithmetic of The Art of Computer Programming, Donald Ervin Knuth presents Algorithm D (Division of nonnegative integers):

Note: I added an introductory step D0 providing definitions and descriptions.

Algorithm D (Division of nonnegative integers).
D0. [Define]
Let U be the dividend (or numerator) of m+n digits, stored in an array of m+n+1 elements, one digit per element, with the most significant digit in element U[m+n−1] and the least significant digit in element U[0];
Let V be the divisor (or denominator) of n digits, stored in a second array of n elements, with n greater than 1, a non-zero most significant digit in element V[n−1] and the least significant digit in element V[0];
Let B be the base (or radix) of the digits (also limbs, places or words), typically (a power of) a power of 2, with B greater than 1.
(The algorithm computes the quotient Q of m+1 digits as ⌊U ⁄ V⌋ (also U div V or U ÷ V), and the remainder R of n digits as U modulo V (also U mod V or U % V), using the following primitive operations:
¹ addition or subtraction of two single-digit integers, giving a single-digit sum and a carry, or a single-digit difference and a borrow;
² widening multiplication of two single-digit integers, giving a double-digit product;
³ narrowing division of a double-digit integer by a single-digit integer, giving a single-digit quotient and a single-digit remainder.)
D1. [Normalize]
Set D to (B − 1) ÷ V[n−1];
Multiply all digits of U and V by D.
(On a binary computer, choose D to be a power of 2 instead of the value provided above; any value of D that results in V[n−1] not less than B ÷ 2 will suffice. If D is greater than 1, the eventual overflow digit of the dividend U goes into element U[m+n].)
D2. [Initialize j]
Set the loop counter j to m.
D3. [Calculate Q′]
Set Q′ to (U[n+j] * B + U[n−1+j]) ÷ V[n−1];
Set R′ to (U[n+j] * B + U[n−1+j]) % V[n−1];
Test if Q′ equals B or Q′ * V[n−2] is greater than R′ * B + U[n−2+j];
If yes, then decrease Q′ by 1, increase R′ by V[n−1], and repeat this test while R is less than B.
D4. [Multiply and subtract]
Replace (U[n+j]U[n−1+j]…U[j]) by (U[n+j]U[n−1+j]…U[j]) − Q′ * (V[n−1]…V[1]V[0]).
(The digits (U[n+j]…U[j]) should be kept positive; if the result of this step is actually negative, (U[n+j]…U[j]) should be left as the true value plus Bn+1, namely as the B’s complement of the true value, and a borrow to the left should be remembered.)
D5. [Test remainder]
Set Q[j] to Q′;
If the result of step D4 was negative, i.e. the subtraction needed a borrow, then proceed with step D6; otherwise proceed with step D7.
D6. [Add back]
Decrease Q[j] by 1 and add (0V[n−1]…V[1]V[0]) to (U[n+j]U[n−1+j]…U[1+j]U[j]).
(A carry will occur to the left of U[n+j], and it should be ignored since it cancels with the borrow that occurred in step D4.)
D7. [Loop on j]
Decrease j by 1;
Test if j is not less than 0;
If yes, go back to step D3.
D8. [Unnormalize]
Now (Q[m]…Q[1]Q[0]) is the desired quotient Q, and the desired remainder R may be obtained by dividing (U[n−1]…U[1]U[0]) by D.

Generic implementation in Hacker’s Delight

In his book Hacker’s Delight, Henry S. ("Hank") Warren presents the following generic implementation divmnu64.c of Donald Knuth’s Algorithm D in ANSI C.

Note: I removed some parts of the code which are not relevant here.

/* This divides an n-word dividend by an m-word divisor, giving an
n-m+1-word quotient and m-word remainder. The bignums are in arrays of
words. Here a "word" is 32 bits. This routine is designed for a 64-bit
machine which has a 64/64 division instruction. */ 1
…
/* q[0], r[0], u[0], and v[0] contain the LEAST significant words.
(The sequence is in little-endian order).

This is a fairly precise implementation of Knuth's Algorithm D, for a
binary computer with base b = 2**32. The caller supplies:
   1. Space q for the quotient, m - n + 1 words (at least one).
   2. Space r for the remainder (optional), n words.
   3. The dividend u, m words, m >= 1.
   4. The divisor v, n words, n >= 2.
The most significant digit of the divisor, v[n-1], must be nonzero.  The
dividend u may have leading zeros; this just makes the algorithm take
longer and makes the quotient contain more leading zeros.  A value of
NULL may be given for the address of the remainder to signify that the
caller does not want the remainder.
   The program does not alter the input parameters u and v.
   The quotient and remainder returned may have leading zeros.  The
function itself returns a value of 0 for success and 1 for invalid
parameters (e.g., division by 0).
   For now, we must have m >= n.  Knuth's Algorithm D also requires
that the dividend be at least as long as the divisor.  (In his terms,
m >= 0 (unstated).  Therefore m+n >= n.) */

int divmnu(unsigned q[], unsigned r[],
     const unsigned u[], const unsigned v[],
     int m, int n) {

   const unsigned long long b = 4294967296LL; // Number base (2**32).
   unsigned *un, *vn;                         // Normalized form of u, v.
   unsigned long long qhat;                   // Estimated quotient digit.
   unsigned long long rhat;                   // A remainder.
   unsigned long long p;                      // Product of two digits.
   long long t, k;
   int s, i, j;

   if (m < n || n <= 1 || v[n-1] == 0)
      return 1;                         // Return if invalid param.
…
   /* Normalize by shifting v left just enough so that its high-order
   bit is on, and shift u left the same amount. We may have to append a
   high-order digit on the dividend; we do that unconditionally. */

   s = nlz(v[n-1]);             // 0 <= s <= 31.
   vn = (unsigned *)alloca(4*n);
   for (i = n - 1; i > 0; i--)
      vn[i] = (v[i] << s) | ((unsigned long long)v[i-1] >> (32-s));
   vn[0] = v[0] << s;

   un = (unsigned *)alloca(4*(m + 1));
   un[m] = (unsigned long long)u[m-1] >> (32-s);
   for (i = m - 1; i > 0; i--)
      un[i] = (u[i] << s) | ((unsigned long long)u[i-1] >> (32-s));
   un[0] = u[0] << s;

   for (j = m - n; j >= 0; j--) {       // Main loop.
      // Compute estimate qhat of q[j].
      qhat = (un[j+n]*b + un[j+n-1])/vn[n-1];
#ifdef OPTIMIZE
      rhat = (un[j+n]*b + un[j+n-1])%vn[n-1];
#else // ORIGINAL
2    rhat = (un[j+n]*b + un[j+n-1]) - qhat*vn[n-1];
#endif
again:
      if (qhat >= b ||
#ifdef OPTIMIZE
          (unsigned)qhat*(unsigned long long)vn[n-2] > b*rhat + un[j+n-2]) {
#else // ORIGINAL
3        qhat*vn[n-2] > b*rhat + un[j+n-2]) {
#endif
        qhat = qhat - 1;
        rhat = rhat + vn[n-1];
        if (rhat < b) goto again;
      }

      // Multiply and subtract.
      k = 0;
      for (i = 0; i < n; i++) {
#ifdef OPTIMIZE
         p = (unsigned)qhat*(unsigned long long)vn[i];
#else // ORIGINAL
3       p = qhat*vn[i];
#endif
         t = un[i+j] - k - (p & 0xFFFFFFFFLL);
         un[i+j] = t;
         k = (p >> 32) - (t >> 32);
      }
      t = un[j+n] - k;
      un[j+n] = t;

      q[j] = qhat;              // Store quotient digit.
      if (t < 0) {              // If we subtracted too
         q[j] = q[j] - 1;       // much, add back.
         k = 0;
         for (i = 0; i < n; i++) {
            t = (unsigned long long)un[i+j] + vn[i] + k;
            un[i+j] = t;
            k = t >> 32;
         }
         un[j+n] = un[j+n] + k;
      }
   } // End j.
   // If the caller wants the remainder, unnormalize
   // it and pass it back.
   if (r != NULL) {
      for (i = 0; i < n-1; i++)
         r[i] = (un[i] >> s) | ((unsigned long long)un[i+1] << (32-s));
      r[n-1] = un[n-1] >> s;
   }
   return 0;
}
…
This implementation exhibits the following misrepresentation and shortcomings:
  1. Contrary to the highlighted part of its initial comment, this routine does not need a 64-bit machine; it but needs support for 64-bit integers and elementary 64-bit arithmetic operations, which ANSI C compilers provide since the last millennium on 32-bit machines too.
  2. Although a 64/64 division instruction is explicitly stated, the code doesn’t take full advantage of it: instead to use the % operator to compute fetch the remainder of the 64÷64-bit division (which comes for free on 64-bit machines, and almost for free with software implementations on 32-bit machines), it but performs an extraneous 64×64-bit multiplication — which is not free, especially on 32-bit machines!
  3. The previous argument also holds for the multiplications qhat*vn[n-2] and qhat*vn[i]: while an optimising compiler should detect that these are 32×32-bit multiplications returning a 64-bit product, I would not count on it — better give the compiler the proper hints!
Note: Knuth’s Algorithm D builds upon a 64÷32-bit division returning a 32-bit quotient and a 32-bit remainder, also known as narrowing division, and a 32×32-bit multiplication returning a 64-bit product, also known as widening multiplication, which both are but not supported in ANSI C!

Specialised implementation in Hacker’s Delight

Hank also presents a simplified and specialised implementation divlu.c for unsigned 64÷32-bit division, based on a 32÷16-bit division returning a 32-bit quotient and a 16-bit remainder, i.e. suited for 16-bit machines.

Note: I removed some parts of the code which are not relevant here.

/* Long division, unsigned (64/32 ==> 32).
   This procedure performs unsigned "long division" i.e., division of a
64-bit unsigned dividend by a 32-bit unsigned divisor, producing a
32-bit quotient.  In the overflow cases (divide by 0, or quotient
exceeds 32 bits), it returns a remainder of 0xFFFFFFFF (an impossible
value).
   The dividend is u1 and u0, with u1 being the most significant word.
The divisor is parameter v. The value returned is the quotient.
   […] Several of the variables below could be
"short," but having them fullwords gives better code on gcc/Intel.
[…]
This is the version that's in the hacker book. */

unsigned divlu2(unsigned u1, unsigned u0, unsigned v,
                unsigned *r) {
   const unsigned b = 65536; // Number base (16 bits).
   unsigned un1, un0,        // Norm. dividend LSD's.
            vn1, vn0,        // Norm. divisor digits.
            q1, q0,          // Quotient digits.
            un32, un21, un10,// Dividend digit pairs.
            rhat;            // A remainder.
   int s;                    // Shift amount for norm.

   if (u1 >= v) {            // If overflow, set rem.
      if (r != NULL)         // to an impossible value,
         *r = 0xFFFFFFFF;    // and return the largest
      return 0xFFFFFFFF;}    // possible quotient.

   s = nlz(v);               // 0 <= s <= 31.
   v = v << s;               // Normalize divisor.
   vn1 = v >> 16;            // Break divisor up into
   vn0 = v & 0xFFFF;         // two 16-bit digits.

   un32 = (u1 << s) | (u0 >> 32 - s) & (-s >> 31);
   un10 = u0 << s;           // Shift dividend left.

   un1 = un10 >> 16;         // Break right half of
   un0 = un10 & 0xFFFF;      // dividend into two digits.

   q1 = un32/vn1;            // Compute the first
#ifdef OPTIMIZE
   rhat = un32%vn1;          // quotient digit, q1.
#else // ORIGINAL
   rhat = un32 - q1*vn1;     // quotient digit, q1.
#endif
again1:
   if (q1 >= b || q1*vn0 > b*rhat + un1) {
     q1 = q1 - 1;
     rhat = rhat + vn1;
     if (rhat < b) goto again1;}

   un21 = un32*b + un1 - q1*v;  // Multiply and subtract.

   q0 = un21/vn1;            // Compute the second
#ifdef OPTIMIZE
   rhat = un21%vn1;          // quotient digit, q0.
#else // ORIGINAL
   rhat = un21 - q0*vn1;     // quotient digit, q0.
#endif
again2:
   if (q0 >= b || q0*vn0 > b*rhat + un0) {
     q0 = q0 - 1;
     rhat = rhat + vn1;
     if (rhat < b) goto again2;}

   if (r != NULL)            // If remainder is wanted,
      *r = (un21*b + un0 - q0*v) >> s;     // return it.
   return q1*b + q0;
}
…
Especially notice the highlighted part of the initial comment!

Flawed implementation in libdivide

Now take a look at the naïve and rather undelightawful adaption of this simplified and specialised variant provided in libdivide to implement an unsigned 128÷64-bit division.

Warning: don’t use this function; it demonstrates that its author copyist does not understand the original code and fails to read or ignores the comment, it exhibits poor performance, and it has a serious bug!

// libdivide.h
// Copyright 2010 - 2018 ridiculous_fish
…
// Code taken from Hacker's Delight:
// http://www.hackersdelight.org/HDcode/divlu.c.
// License permits inclusion here per:
// http://www.hackersdelight.org/permissions.htm

static uint64_t libdivide_128_div_64_to_64(uint64_t u1, uint64_t u0, uint64_t v, uint64_t *r) {
1  const uint64_t b = (1ULL << 32); // Number base (16 bits)
2  uint64_t un1, un0; // Norm. dividend LSD's
2  uint64_t vn1, vn0; // Norm. divisor digits
2  uint64_t q1, q0; // Quotient digits
3  uint64_t un64, un21, un10; // Dividend digit pairs
    uint64_t rhat; // A remainder
#ifdef OPTIMIZE
    uint64_t qhat; // A quotient
    uint64_t p; // A product
#endif
    int32_t s; // Shift amount for norm

    // If overflow, set rem. to an impossible value,
    // and return the largest possible quotient
    if (u1 >= v) {
        if (r != NULL)
            *r = (uint64_t) -1;
        return (uint64_t) -1;
    }

    // count leading zeros
    s = libdivide__count_leading_zeros64(v);
    if (s > 0) {
        // Normalize divisor
        v = v << s;
5      un64 = (u1 << s) | ((u0 >> (64 - s)) & (-s >> 31));
4      un10 = u0 << s; // Shift dividend left
    } else {
        // Avoid undefined behavior
6      un64 = u1 | u0;
4      un10 = u0;
    }

    // Break divisor up into two 32-bit digits
7  vn1 = v >> 32;
7  vn0 = v & 0xFFFFFFFF;

    // Break right half of dividend into two digits
7  un1 = un10 >> 32;
7  un0 = un10 & 0xFFFFFFFF;

    // Compute the first quotient digit, q1
#ifdef OPTIMIZE
    qhat = un64 / vn1;
    rhat = un64 % vn1;
    p = qhat * vn0;

    while (qhat >= b || p > b * rhat + un1) {
        qhat = qhat - 1;
        rhat = rhat + vn1;
        if (rhat >= b)
            break;
        p = p - vn0;
    }
    q1 = qhat & 0xFFFFFFFF;
#else // ORIGINAL
    q1 = un64 / vn1;
8  rhat = un64 - q1 * vn1;

9  while (q1 >= b || q1 * vn0 > b * rhat + un1) {
        q1 = q1 - 1;
        rhat = rhat + vn1;
        if (rhat >= b)
            break;
    }
#endif
     // Multiply and subtract
    un21 = un64 * b + un1 - q1 * v;

    // Compute the second quotient digit
#ifdef OPTIMIZE
    qhat = un21 / vn1;
    rhat = un21 % vn1;
    p = qhat * vn0;

    while (qhat >= b || p > b * rhat + un0) {
        qhat = qhat - 1;
        rhat = rhat + vn1;
        if (rhat >= b)
            break;
        p = p - vn0;
    }
    q0 = qhat & 0xFFFFFFFF;
#else // ORIGINAL
    q0 = un21 / vn1;
8  rhat = un21 - q0 * vn1;

9  while (q0 >= b || q0 * vn0 > b * rhat + un0) {
        q0 = q0 - 1;
        rhat = rhat + vn1;
        if (rhat >= b)
            break;
    }
#endif
    // If remainder is wanted, return it
    if (r != NULL)
        *r = (un21 * b + un0 - q0 * v) >> s;

    return q1 * b + q0;
}
…
This implementation has an obvious bug and multiple deficiencies (in order of appearance):
  1. The number base is 232, i.e. 32 bits, not 16 bits, as stated in the comment.
  2. The normalised (least significant) digits of dividend and divisor as well as the final digits of the quotient fit in 32 bits, i.e. the variables un1, un0, vn1, vn0, q1 and q0 can and should of course be declared as uint32_t, not as uint64_t.
  3. The proper name for the variable with the two most significant digits of the dividend is un32, not un64.
  4. The variables un6432 and un10 are superfluous, the dividend u1 and u0 can and should of course be normalised in place, like the divisor v.
  5. The term & (-s >> 31) is superfluous: in the original code it is used to avoid the conditional expression if (s > 0) … else … and evaluates to either & 0 or & -1 (alias & 0xFFFFFFFF or & ~0) there; here it always evaluates to & -1 (alias & 0xFFFFFFFFFFFFFFFF or & ~0LL).
  6. The term | u0 produces wrong results and must be removed.
  7. The variables un1, un0, vn1 and vn0 are superfluous, they can and of course should be replaced by the expressions (uint32_t) (un10 >> 32), (uint32_t) (un10 & 0xFFFFFFFF), (uint32_t) (v >> 32) and (uint32_t) (v & 0xFFFFFFFF) respectively.
  8. The remainder rhat can and should of course be computed as un6432 % vn1 and un21 % vn1 respectively, not as un6432 - q1 * vn1 and un21 - q0 * vn1 using expensive 64×64-bit multiplications: both hardware division instructions and software division routines typically return quotient and remainder together.
  9. The repeated (expensive) computation of the 64×64-bit products q1 * vn0 and q0 * vn0 inside the while … loops should be moved outside the loops.

Poor implementation in Free Pascal’s Run-Time Library

flt_core.inc from Free Pascal’s Run-Time Library
…
(*-------------------------------------------------------
 | u128_div_u64_to_u64 [local]
 |
 | Divides unsigned 128-bit integer by unsigned 64-bit integer.
 | Returns 64-bit quotient and reminder.
 |
 | This routine is used here only for splitting specially prepared unsigned
 | 128-bit integer into two 64-bit ones before converting it to ASCII.
 |
 *-------------------------------------------------------*)
function u128_div_u64_to_u64( const xh, xl: qword; const y: qword; out quotient, reminder: qword ): boolean;
var
    b,                // Number base
1  v,                // Norm. divisor
2  un1, un0,         // Norm. dividend LSD's
2  vn1, vn0,         // Norm. divisor digits
    q1, q0,           // Quotient digits
3  un64, un21, un10, // Dividend digit pairs
    rhat: qword;      // A remainder
    s: integer;       // Shift amount for norm
begin
    // Overflow check
    if ( xh >= y ) then
    begin
        u128_div_u64_to_u64 := false;
        exit;
    end;
    // Count leading zeros
    s := 63 - BSRqword( y ); // 0 <= s <= 63
    // Normalize divisor
1  v := y shl s;
    // Break divisor up into two 32-bit digits
4  vn1 := hi(v);
4  vn0 := lo(v);
    // Shift dividend left
1  un64 := xh shl s;
    if ( s > 0 ) then
1      un64 := un64 or ( xl shr ( 64 - s ) );
1  un10 := xl shl s;
    // Break right half of dividend into two digits
4  un1 := hi(un10);
4  un0 := lo(un10);
    // Compute the first quotient digit, q1
3  q1 := un64 div vn1;
5  rhat := un64 - q1 * vn1;
    b := qword(1) shl 32; // Number base
6  while ( q1 >= b )
7     or ( q1 * vn0 > b * rhat + un1 ) do
    begin
        dec( q1 );
        inc( rhat, vn1 );
6      if rhat >= b then
            break;
    end;
    // Multiply and subtract
3  un21 := un64 * b + un1 - q1 * v;
    // Compute the second quotient digit, q0
    q0 := un21 div vn1;
5  rhat := un21 - q0 * vn1;
6  while ( q0 >= b )
7     or ( q0 * vn0 > b * rhat + un0 ) do
    begin
        dec( q0 );
        inc( rhat, vn1 );
6      if ( rhat >= b ) then
            break;
    end;
    // Result
8  reminder := ( un21 * b + un0 - q0 * v ) shr s;
8  quotient := q1 * b + q0;
    u128_div_u64_to_u64 := true;
end;
…
This implementation has multiple deficiencies, especially for 32-bit machines; 64-bit machines typically support 128÷64-bit division in hardware and don’t need this function:
  1. The variables v, un6432 and un10 are superfluous, the divisor y as well as the dividend xh and xl can and should of course be normalised in place.
  2. The normalised (least significant) digits of dividend and divisor fit in 32 bits, i.e. the variables un1, un0, vn1 and vn0 can and should of course be declared as dword, not as qword.
  3. The proper name for the variable with the two most significant digits of the dividend is un32, not un64.
  4. The variables un1, un0, vn1 and vn0 are superfluous, they can and of course should be replaced by the expressions hi(un10), lo(un10), hi(v) and low(v) respectively.
  5. The remainder rhat can and should of course be computed as un6432 mod vn1 and un21 mod vn1 respectively, not as un6432 - q1 * vn1 and un21 - q0 * vn1 using expensive 64×64-bit multiplications: both hardware division instructions and software division routines typically return quotient and remainder together.
  6. The comparisions q1 >= b, q0 >= b and rhat >= b can and should of course be replaced by hi(q1) <> 0, hi(q0) <> 0 and hi(rhat) <> 0 respectively.
  7. The repeated (expensive) computation of the 64×64-bit products q1 * vn0 and q0 * vn0 inside the while … loops should be moved outside the loops.
  8. The final quotient digits fit in 32 bits, i.e. the variables q1 and q0 can and should of course be replaced wherever possible by lo(q1) and lo(q0) respectively to avoid expensive 64×64-bit multiplications.

Poor implementation in Google’s Go Programming Language

bits.go

Note: the highlighted parts show the optimisations and simplifications missed by the author(s) of the Div64 function!

// Copyright 2017 The Go Authors. All rights reserved.
…
// Div64 returns the quotient and remainder of (hi, lo) divided by y:
// quo = (hi, lo)/y, rem = (hi, lo)%y with the dividend bits' upper
// half in parameter hi and the lower half in parameter lo.
// Div64 panics for y == 0 (division by zero) or y <= hi (quotient overflow).
func Div64(hi, lo, y uint64) (quo, rem uint64) {
        const (
                two32  = 1 << 32
                mask32 = two32 - 1
        )
        if y == 0 {
                panic(divideError)
        }
        if y <= hi {
                panic(overflowError)
        }

        s := uint(LeadingZeros64(y))
        y <<= s

1   yn1 := uint32(y >> 32)
1   yn0 := uint32(y & mask32)
        un32 := hi<<s | lo>>(64-s)
        un10 := lo << s
1   un1 := uint32(un10 >> 32)
1   un0 := uint32(un10 & mask32)
        q1 := un32 / yn1
2   rhat := un32 - q1*yn1

3   for q1 >= two32 || uint32(q1)*yn0 > two32*rhat+un1 {
                q1--
                rhat += yn1
                if rhat >= two32 {
                        break
                }
        }

3   un21 := un32*two32 + un1 - uint32(q1)*y
        q0 := un21 / yn1
2   rhat = un21 - q0*yn1

3   for q0 >= two32 || uint32(q0)*yn0 > two32*rhat+un0 {
                q0--
                rhat += yn1
                if rhat >= two32 {
                        break
                }
        }

3   return q1*two32 + q0, (un21*two32 + un0 - uint32(q0)*y) >> s
}
…
This implementation has multiple deficiencies, especially for 32-bit machines; 64-bit machines typically support 128÷64-bit division in hardware and don’t need this function:
  1. The normalised (least significant) digits of dividend and divisor fit in 32 bits, i.e. the variables yn1, yn0, un1 and un0 can and should of course be declared as uint32, not as uint64.
  2. The remainder rhat can and should of course be computed as un32 % yn1 and un21 % yn1 respectively, not as un32 - q1*yn1 and un21 - q0*yn1 using expensive 64×64-bit multiplications: both hardware division instructions and software division routines typically return quotient and remainder together.
  3. The final digits of the quotient fit in 32 bits, i.e. the variables q1 and q0 can and should of course be cast to uint32 wherever necessary to avoid expensive 64×64-bit multiplications.

Implementation in Google’s V8 JavaScript engine

bigint.cc from Google’s V8 JavaScript engine.

Note: the type digit_t is the same as uintptr_t.

// Copyright 2017 the V8 project authors. All rights reserved.
…
// Returns the quotient.
// quotient = (high << kDigitBits + low - remainder) / divisor
BigInt::digit_t MutableBigInt::digit_div(digit_t high, digit_t low,
                                         digit_t divisor, digit_t* remainder) {
  DCHECK(high < divisor);
#if V8_TARGET_ARCH_X64 && (__GNUC__ || __clang__)
  …
#elif V8_TARGET_ARCH_IA32 && (__GNUC__ || __clang__)
  …
#else
  static const digit_t kHalfDigitBase = 1ull << kHalfDigitBits;
  // Adapted from Warren, Hacker's Delight, p. 152.
  int s = base::bits::CountLeadingZeros(divisor);
  DCHECK_NE(s, kDigitBits);  // {divisor} is not 0.
  divisor <<= s;

  digit_t vn1 = divisor >> kHalfDigitBits;
  digit_t vn0 = divisor & kHalfDigitMask;
  // {s} can be 0. {low >> kDigitBits} would be undefined behavior, so
  // we mask the shift amount with {kShiftMask}, and the result with
  // {s_zero_mask} which is 0 if s == 0 and all 1-bits otherwise.
  STATIC_ASSERT(sizeof(intptr_t) == sizeof(digit_t));
  const int kShiftMask = kDigitBits - 1;
  digit_t s_zero_mask =
      static_cast<digit_t>(static_cast<intptr_t>(-s) >> (kDigitBits - 1));
  digit_t un32 =
      (high << s) | ((low >> ((kDigitBits - s) & kShiftMask)) & s_zero_mask);
  digit_t un10 = low << s;
  digit_t un1 = un10 >> kHalfDigitBits;
  digit_t un0 = un10 & kHalfDigitMask;
  digit_t q1 = un32 / vn1;
  digit_t rhat = un32 - q1 * vn1;

  while (q1 >= kHalfDigitBase || q1 * vn0 > rhat * kHalfDigitBase + un1) {
    q1--;
    rhat += vn1;
    if (rhat >= kHalfDigitBase) break;
  }

  digit_t un21 = un32 * kHalfDigitBase + un1 - q1 * divisor;
  digit_t q0 = un21 / vn1;
  rhat = un21 - q0 * vn1;

  while (q0 >= kHalfDigitBase || q0 * vn0 > rhat * kHalfDigitBase + un0) {
    q0--;
    rhat += vn1;
    if (rhat >= kHalfDigitBase) break;
  }

  *remainder = (un21 * kHalfDigitBase + un0 - q0 * divisor) >> s;
  return q1 * kHalfDigitBase + q0;
#endif
}
…

Proper (and optimised) implementation …

… in ANSI C

This further simplified and optimised implementation of Algorithm D for unsigned 128÷64-bit division on 32-bit machines is based on a 64÷32-bit division returning a 64-bit quotient and a 32-bit remainder, trivially implemented per long (alias schoolbook) division using a narrowing 64÷32-bit division returning a 32-bit quotient and a 32-bit remainder.
// Copyleft © 2011-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

// Divide a 128-bit integer dividend, supplied as a pair of 64-bit
// integers in u0 and u1, by a 64-bit integer divisor, supplied in v;
// return the 64-bit quotient and optionally the 64-bit remainder in *r.

unsigned long long divllu(unsigned long long u0,
                          unsigned long long u1,
                          unsigned long long v,
                          unsigned long long *r) {
    unsigned long long qhat;        // A quotient.
    unsigned long long rhat;        // A remainder.
    unsigned long long uhat;        // A dividend digit pair.
    unsigned           q0, q1;      // Quotient digits.
    unsigned           s;           // Shift amount for norm.

    if (u1 >= v) {                  // If overflow, set rem.
        if (r != NULL)              // to an impossible value,
            *r = ~0ULL;             // and return the largest
        return ~0ULL;               // possible quotient.
    }

    s = __builtin_clzll(v);         // 0 <= s <= 63.
    if (s != 0) {
        v <<= s;                    // Normalize divisor.
        u1 <<= s;                   // Shift dividend left.
        u1 |= u0 >> (64 - s);
        u0 <<= s;
    }
                                    // Compute high quotient digit.
    qhat = u1 / (unsigned) (v >> 32);
    rhat = u1 % (unsigned) (v >> 32);

    while ((unsigned) (qhat >> 32) != 0U ||
                                    // Both qhat and rhat are less 2**32 here!
           (unsigned long long) (unsigned) (qhat & ~0U) * (unsigned) (v & ~0U) >
           ((rhat << 32) | (unsigned) (u0 >> 32))) {
        qhat -= 1;
        rhat += (unsigned) (v >> 32);
        if ((unsigned) (rhat >> 32) != 0U) break;
    }

    q1 = (unsigned) (qhat & ~0U);
                                    // Multiply and subtract.
    uhat = ((u1 << 32) | (unsigned) (u0 >> 32)) - q1 * v;

                                    // Compute low quotient digit.
    qhat = uhat / (unsigned) (v >> 32);
    rhat = uhat % (unsigned) (v >> 32);

    while ((unsigned) (qhat >> 32) != 0U ||
                                    // Both qhat and rhat are less 2**32 here!
           (unsigned long long) (unsigned) (qhat & ~0U) * (unsigned) (v & ~0U) >
           ((rhat << 32) | (unsigned) (u0 & ~0U))) {
        qhat -= 1;
        rhat += (unsigned) (v >> 32);
        if ((unsigned) (rhat >> 32) != 0U) break;
    }

    q0 = (unsigned) (qhat & ~0U);

    if (r != NULL)                  // If remainder is wanted, return it.
        *r = (((uhat << 32) | (unsigned) (u0 & ~0U)) - q0 * v) >> s;

    return ((unsigned long long) q1 << 32) | q0;
}

… in i386 Assembler

Finally the implementation of this optimised simplified variant in assembly language, with yet another optimisation: it performs a long (alias schoolbook) division for divisors less than 232 and dividends less than 296.
; Copyright © 2011-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

        .386
        .model  flat, C
        .code

divllu  proc    public

        push    ebx
        push    ebp
        push    edi
        push    esi
…
        mov     esi, [esp+40]
        mov     edi, [esp+36]   ; esi:edi = divisor

        bsr     ecx, esi        ; ecx = index of leading '1' bit in high dword of divisor
        jnz     NORMALIZE       ; high dword of divisor (and high dword of dividend) = 0?

                                ; "long" division (dividend < 2**96, divisor < 2**32)
DIVISION:
        mov     edx, [esp+28]
        mov     eax, [esp+24]   ; edx:eax = "inner" qword of dividend
        div     edi             ; eax = high dword of quotient,
                                ; edx = high dword of dividend'
        mov     ebx, eax        ; ebx = high dword of quotient
        mov     eax, [esp+20]   ; edx:eax = low qword of dividend'
        div     edi             ; eax = low dword of quotient,
                                ; edx = (low dword) of remainder

        or      esi, [esp+44]   ; esi = address of remainder
        jz      @F              ; address of remainder = 0?

        xor     ecx, ecx        ; ecx:edx = remainder
        mov     [esi], edx
        mov     [esi+4], ecx    ; store remainder

@@:
        mov     edx, ebx        ; edx:eax = quotient

        pop     esi
        pop     edi
        pop     ebp
        pop     ebx
        ret

NORMALIZE:
        mov     edx, [esp+32]
        mov     eax, [esp+28]   ; edx:eax = high qword of dividend

        xor     ecx, 31         ; ecx = number of leading '0' bits in (high dword of) divisor
        jz      @F              ; number of leading '0' bits = 0?

        shld    esi, edi, cl
        shl     edi, cl         ; esi:edi = divisor'
        mov     [esp+40], esi
        mov     [esp+36], edi   ; save normalised divisor'

        mov     ebx, [esp+24]
        mov     ebp, [esp+20]   ; ebx:ebp = low qword of dividend
        shld    edx, eax, cl
        shld    eax, ebx, cl
        shld    ebx, ebp, cl
        shl     ebp, cl         ; edx:eax:ebx:ebp = dividend'
        mov     [esp+32], edx
        mov     [esp+28], eax
        mov     [esp+24], ebx
        mov     [esp+20], ebp   ; save normalised dividend'

@@:
        push    ecx             ; save number of leading '0' bits in divisor

DIVISION_1:
        cmp     esi, edx
        jna     OVERFLOW_1      ; overflow with normal division?

                                ; "normal" division
NORMAL_1:
        div     esi             ; eax = (low dword of) quotient',
                                ; edx = (low dword of) remainder'
        mov     ebp, edx        ; ebp = (low dword of) remainder'
        mov     ecx, eax
        xor     ebx, ebx        ; ebx:ecx = quotient'
        jmp     CHECK_1
                                ; "long" division
OVERFLOW_1:
        mov     ecx, eax
        mov     eax, edx
        xor     edx, edx
        div     esi             ; eax = high dword of quotient',
                                ; edx = high dword of remainder'
        mov     ebx, eax        ; ebx = high dword of quotient'
        mov     eax, ecx
        div     esi             ; eax = low dword of quotient',
                                ; edx = (low dword of) remainder'
        mov     ebp, edx        ; ebp = (low dword of) remainder'
        mov     ecx, eax        ; ebx:ecx = quotient'

ADJUST_1:
        add     ecx, -1
        adc     ebx, -1         ; ebx:ecx = quotient' - 1
        add     ebp, esi        ; ebp = (low dword of) remainder'
                                ;     + high dword of divisor'
        jc      BREAK_1         ; remainder' >= 2**32?

AGAIN_1:
        test    ebx, ebx
        jnz     ADJUST_1        ; quotient' >= 2**32?

CHECK_1:
        mov     eax, edi        ; eax = low dword of divisor'
        mul     ecx             ; edx:eax = low dword of divisor'
                                ;         * low dword of quotient'
ifdef JMPLESS
        cmp     [esp+28], eax
        mov     eax, ebp
        sbb     eax, edx
        jb      ADJUST_1
else
        cmp     edx, ebp
        jb      BREAK_1
        ja      ADJUST_1
        cmp     eax, [esp+28]
        ja      ADJUST_1
endif
BREAK_1:
        push    ecx             ; save (low dword of) quotient'

        mov     eax, edi        ; eax = low dword of divisor'
        mul     ecx             ; edx:eax = low dword of divisor'
                                ;         * (low dword of) quotient'
        imul    ecx, esi        ; ecx = (low dword of) quotient'
                                ;     * high dword of divisor'
        add     ecx, edx        ; ecx:eax = divisor'
                                ;         * (low dword of) quotient'
        mov     ebx, eax        ; ecx:ebx = divisor'
                                ;         * (low dword of) quotient'
        mov     eax, [esp+32]
        mov     edx, [esp+36]   ; edx:eax = "inner" qword of dividend'
        sub     eax, ebx
        sbb     edx, ecx        ; edx:eax = "inner" qword of dividend"
                                ;         = intermediate (normalised) remainder

        push    eax             ; save low dword of "inner" qword of dividend"

DIVISION_2:
        cmp     esi, edx
        jna     OVERFLOW_2      ; overflow with normal division?

                                ; "normal" division
NORMAL_2:
        div     esi             ; eax = (low dword of) quotient",
                                ; edx = (low dword of) remainder"
        mov     ebp, edx        ; ebp = (low dword of) remainder"
        mov     ecx, eax
        xor     ebx, ebx        ; ebx:ecx = quotient"
        jmp     CHECK_2
                                ; "long" division
OVERFLOW_2:
        mov     ecx, eax
        mov     eax, edx
        xor     edx, edx
        div     esi             ; eax = high dword of quotient",
                                ; edx = high dword of remainder"
        mov     ebx, eax        ; ebx = high dword of quotient"
        mov     eax, ecx
        div     esi             ; eax = low dword of quotient",
                                ; edx = (low dword of) remainder"
        mov     ebp, edx        ; ebp = (low dword of) remainder"
        mov     ecx, eax        ; ebx:ecx = quotient"

ADJUST_2:
        add     ecx, -1
        adc     ebx, -1         ; ebx:ecx = quotient" - 1
        add     ebp, esi        ; ebp = (low dword of) remainder"
                                ;     + high dword of divisor'
        jc      BREAK_2         ; remainder" >= 2**32?

AGAIN_2:
        test    ebx, ebx
        jnz     ADJUST_2        ; quotient" >= 2**32?

CHECK_2:
        mov     eax, edi        ; eax = low dword of divisor'
        mul     ecx             ; edx:eax = low dword of divisor'
                                ;         * low dword of quotient"
ifdef JMPLESS
        cmp     [esp+32], eax
        mov     eax, ebp
        sbb     eax, edx
        jb      ADJUST_2
else
        cmp     edx, ebp
        jb      BREAK_2
        ja      ADJUST_2
        cmp     eax, [esp+32]
        ja      ADJUST_2
endif
BREAK_2:
        pop     ebx             ; ebx = low dword of "inner" qword of dividend"
        push    ecx             ; save (low dword of) quotient"

        mov     ebp, [esp+56]   ; ebp = address of remainder
        test    ebp, ebp
        jz      QUOTIENT        ; address of remainder = 0?

REMAINDER:
        mov     eax, edi        ; eax = low dword of divisor'
        mul     ecx             ; edx:eax = low dword of divisor'
                                ;         * (low dword of) quotient"
        imul    ecx, esi        ; ecx = (low dword of) quotient"
                                ;     * high dword of divisor'
        add     edx, ecx        ; edx:eax = (low dword of) quotient"
                                ;         * divisor'
        mov     edi, [esp+32]
        mov     esi, ebx
        sub     edi, eax
        sbb     esi, edx        ; esi:edi = (normalised) remainder

        mov     ecx, [esp+8]    ; ecx = number of leading '0' bits in divisor
                                ;     = shift count
if 0
        jecxz   @F              ; shift count = 0?
else
        test    ecx, ecx
        jz      @F              ; shift count = 0?
endif
        shrd    edi, esi, cl
        shr     esi, cl         ; esi:edi = remainder

@@:
        mov     [ebp], edi
        mov     [ebp+4], esi    ; store remainder

QUOTIENT:
        pop     eax
        pop     edx             ; edx:eax = quotient

        pop     ecx
        pop     esi
        pop     edi
        pop     ebp
        pop     ebx
        ret

divllu  endp
        end
Note: this code runs on 35 year old Intel® 80386 micro-processors, and of course on current compatible processors too; it is more than twice as fast as the ANSI C version, and less than 1,5× slower than the native 128÷64-bit division on 64-bit processors (about 100 to 120 versus 80 to 95 processor clock cycles).

Contact

If you miss anything here, have additions, comments, corrections, criticism or questions, want to give feedback, hints or tipps, report broken links, bugs, deficiencies, errors, inaccuracies, misrepresentations, omissions, shortcomings, vulnerabilities or weaknesses, …: don’t hesitate to contact me and feel free to ask, comment, criticise, flame, notify or report!

Use the X.509 certificate to send S/MIME encrypted mail.

Note: email in weird format and without a proper sender name is likely to be discarded!

I dislike HTML (and even weirder formats too) in email, I prefer to receive plain text.
I also expect to see your full (real) name as sender, not your nickname.
I abhor top posts and expect inline quotes in replies.

Terms and Conditions

By using this site, you signify your agreement to these terms and conditions. If you do not agree to these terms and conditions, do not use this site!
  • The software and the documentation on this site are provided as is without any warranty, neither express nor implied.
    In no event will the author be held liable for any damage(s) arising from the use of the software or the documentation.
  • Permission is granted to use the current version of the software and the current version of the documentation solely for personal private and non-commercial purposes.
    An individuals use of the software or the documentation in his or her capacity or function as an agent, (independent) contractor, employee, member or officer of a business, corporation or organisation (commercial or non-commercial) does not qualify as personal private and non-commercial purpose.
  • Without written approval from the author the software or the documentation must not be used for a business, for commercial, corporate, governmental, military or organisational purposes of any kind, or in a commercial, corporate, governmental, military or organisational environment of any kind.
  • Redistribution of the software and the documentation is allowed only in unmodified form of its current version and free of charge.

Data Protection Declaration

This web page records no (personal) data and stores no cookies in the web browser.

The web service is operated and provided by

Telekom Deutschland GmbH
Business Center
D-64306 Darmstadt
Germany
<‍hosting‍@‍telekom‍.‍de‍>
+49 800 5252033

The web service provider stores a session cookie in the web browser and records every visit of this web site with the following data in an access log on their server(s):

  • the (pseudonymised) IP address;
  • the date and time of the request;
  • the URL of the requested web page or file;
  • the Referer and User-Agent HTTP headers sent by the web browser;
  • the result (success or failure) of the request;
  • the amount of data received and sent.

Copyright © 2005–2021 • Stefan Kanthak • <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>


from Hacker News https://ift.tt/3r1Fi3q

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.