Most of you probably recall the recent square root saga where Andy W. made
a claim that someone had written a square root routine with the following
features:

  o) 28 words of program memory
  o) 4 bytes of RAM, including the two that hold the 16 bit input
  o) Executes in 95 cycles.

The original author requested that it not be released publicly, however
Andy hoped that someone would be "prodded" into reproducing the routine.

A few weeks later I produced a square root routine with the following features:

  o) 27 words of program memory (actually it was 28, Eric and Martin
 independently
     suggested a way to shave one word).
  o) 4 bytes of RAM, including the two that hold the 16 bit input
  o) Executes in ~150 cycles.

This version saves one word of program memory at the cost of 50% reduction in
execution speed; A trade I would usually decline. I didn't see a way to improve
this routine any more, so I just left it alone.

Then one day I came across the book

 The Logic of Computer Arithmetic
 by Ivan Flores

He describes two different algorithms for computing the square root of a number.
I wrote PIC programs for each (actually several PIC programs for each) and
came up with this implementation

  o) 30 words of program memory
  o) 3 bytes of RAM, including the two that hold the 16 bit input
  o) Executes in ~104 cycles.


;----------------------------------------------------------
;sqrt
;
; The purpose of this routine is take the square root of a 16 bit
;unsigned integer.
;Inputs:  N_hi - High byte of the 16 bit number to be square rooted
;         N_lo - Low byte     "                  "             "
;Outputs: W register returned with the 8 bit result
;
;Memory used:
;  mask
;  30 Instructions

sqrt
        MOVLW   0xc0            ;Initialize value for mask
        MOVWF   mask
        MOVLW   0x40            ;Initial value for the root

sq1     SUBWF   N_hi,F          ;Subtract the root developed so far
        SKPC                    ;If it is larger than N_hi
          goto  sq5             ;  then go restore the subtraction

sq2     IORWF   mask,W          ;Set the current bit
sq3     RLF     N_lo,F          ;Shift N left one position
        RLF     N_hi,F
        RRF     mask,F          ;Shift the mask right, and pick up msb of N
        BTFSC   mask,7          ;If msb of N is set, then unconditionally
          goto  sq6             ;set the next bit (because subtracting the
                                ;root is guaranteed not to cause a borrow)

        XORWF   mask,W          ;Append "01" to the root developed so far

        SKPC                    ;If the lsb of mask was shifted into the carry
          goto  sq1             ;then we're done. Otherwise, loop again.

   ;We are almost done. In the last iteration, we have 7 bits of the root. When
   ;"01" is appended to it, we will have a 9-bit number that must be subtracted
   ;from N.

        SUBWF   N_hi,F          ;
        SKPC                    ;If the upper 7 bits cause a borrow, then
          RETURN                ;the appended "01" will as well: We're done.

        SKPNZ                   ;If the result of the subtraction is zero
          BTFSC N_lo,7          ;AND the msb of N_lo is set then the LSB of the
                                ;root is zero.
         XORLW  1               ;Otherwise, it is one.

        RETURN                  ;

sq6     SKPNC                   ;Need to unconditionally set the current bit of
 the root.
          RETURN                ;However, if we're through iterating, then
 leave. Note,
                                ;C is set by shifting the mask right and the LSB
 of root
                                ;was set by IORWF at sq2.

        BCF     mask,7          ;Clear the MSB of N that got shifted into the
 mask.
        XORWF   mask,W          ;Append "01" to the root developed so far.
        SUBWF   N_hi,F          ;Subtract
        goto    sq2             ;Go unconditionally set the current bit.

sq5     ADDWF   N_hi,F          ;Restore N_hi by reversing the SUBWF with a
 ADDWF
        goto    sq3             ;Don't set the current bit


While this version is three words longer, it saves one byte of RAM; this is
a trade I would make any day (in the PIC enviroment that is). Also, the
execution speed is only slightly slower than the one Andy's mystery programmer
produced, yet significantly faster than my original version.



Here's an excerpt from one of the messages I sent Andy while I was developing
this routine:

---------------------------------------

I was wandering around the SRI Library the other day and came across the book:

The Logic of Computer Arithmetic
by Ivan Flores

It's the kind of book that Knuth would love. It discusses all of the _classical_
arithmetic algorithms and how to efficiently implement them with hardware. The
thing I found really interesting is that he gives two ways to implement the
 _fast_
sqrt algorithm that we've been throwing around. He calls the first method the
 Binary
Square Roots - Restoring Method and the other the Binary Square Roots -
 Non-restoring
Method. While they're both very similar, the first one is the way I had
 implemented
the sqrt routine. His explanation of the theory of operation however, follows
 more
closely along the lines of what Bob F(??) had posted as a square root challenge
 some
months back. In other words, you pair the bits together and perform a magic
 algorithm:

Quoting Flores:

    In general, the procedure consists of taking the square root developed so
  far, appending 01 to it and subtracting it, properly shifted, from the current
  remainder. The 0 in 01 corresponds to mutliplying by 2; the 1 is a new trial
 bit.
  If the resulting remainder is positive, the new root bit developed is truly 1;
 if
  the remainder is negative, the new bit developed is 0 and the remainder must
 be
  restored (thus the name) by adding the quantity just subtracted.

The example he gives takes the square root of 01011111

      1  0  0  1 .
   -----------------
   ) 01 01 11 11 . 00
     -1
     ---
     00 01  <--- positive: first bit is a 1
     -1 01
     -----
     11 00  <--- negative: 2nd bit is a 0
     +1 01  <--- restore the wrong guess
    ------
     00 01 11
       -10 01
    ---------
     11 11 10  <--- negative: 3rd bit is a zero
       +10 01  <--- restore the wrong guess
    ---------
        01 11 11
        -1 00 01
       ---------
         0 11 10  <--- positive: 4th bit is a one
    etc...

  The other method does not restore the subtraction if the result was negative.
Instead, it appends a 11 to the root developed so far and on the next iteration
it performs an addition. If the addition causes an overflow, then on the next
iteration you go back to the subtraction mode. Before I botch the explanation
much further, let me quote Flores again:

    As long as the remainder is negative, we proceed as in the previous section;
  we enter a 1 in the corresponding root bit being developed; we append 01 to
  this number; we shift it the correct number of times and _subtract_ it from
  the previous remainder.
    When the remainder goes negative, we do not restore as in the previous
  section. First we enter a 0 as the next root bit developed. To this we append
  11. This result is shifted left the proper number of times and _added_ to the
  present remainder.

Again, the same example:

      1  0  0  1 .
   -----------------
   ) 01 01 11 11 . 00
     -1
     ---
     00 01  <--- positive: first bit is a 1
     -1 01  <--- Developed root is "1"; appended 01; subtract
     -----
     11 00 11  <--- negative: 2nd bit is a 0
       +10 11  <--- Developed root is "10"; append 11 and add.
    ---------
     11 11 10 11  <--- positive (i.e. didn't cause overflow): 3rd bit is a 0
         1 00 11  <--- Developed root is "100"; append 11 and add
    ---------
   1 00 00 11 10  <--- Overflow: 4th bit is a one

    etc...
Pretty neat!

----------------------------------------------------

Anyway, this latest version is also the "restoring" method. The non-restoring
version turned out to be significantly longer.

If any one can find a way to make this routine shorter and/or faster, please
let me know!


Scott

P.S. Logarithms and Powers are around the corner.