Calculating x^2.
Archive:
Summary:
; assumes W contains a 4 bit number, and squares it. ; by John Payson Sqr4: addwf PC db 0,1,4,9,16,25,36,49,64,81,100,121,144,169,196,225
(a+b)^2 = a^2 + b^2 + 2ab
Thus, any multiplication could be performed as three squaring operations, a subtraction, and a shift."
a*b = t(a+b) - ( t(a) + t(b) ) ; calculate t(W), for 0 <= W <= 22 ; by David Cary get_t_w: PAGESEL $+2 addwf PCL,f db 0,1,3,6, H'0A', H'0F', H'15', H'1C', H'24', H'2D' db H'37', H'42', H'4E', H'5B', H'69', H'78' db H'88', H'99', H'AB', H'BE', H'D2', H'E7', H'FD'
Stuart Taylor suggested starting with a standard 16x16 bit multiply and trimming it down to a 10x10 bit multiply.
Keith Dowsett suggested "break it down into three simpler multiplications:
(a+b)^2 = a^2 + b^2 + 2ab
where b is the bottom 8 bits of the number, a is the top 2 bits, and use a conventional 8x8 bit multiply for b^2, and (for speed) use a customized 2x8 bit multiply for 2ab, and a 4 entry lookup table for a^2.
N = a*32 + b a = 5 msb's b = 5 lsb's Now using Keith's observation: N^2 = 32^2 * a^2 + 64*a*b + b^2 = (a^2)<<10 + (a*b)<<5 + b^2
which can be evaluated with one general-purpose 5x5 bit multiply and a lookup table for squaring 5 bit numbers.
Then Scott wonders if breaking the number into 3 nybbles would be even faster:
N = a*256 + b*16 + c N^2 = (a*256)^2 + (b*16)^2 + c^2 + 2*(a*256*b*16 + a*256*c + b*16*c) = (a^2)<<16 + (b^2)<<8 + c^2 + ((a*b)<<12 + (a*c)<<8 + b*c<<4))<<1
(See 4x4 bit multiply )
Andrew Warren speculates that a standard 10-bit x 10-bit multiplication will turn out to be faster than any of the complicated schemes above.
A 10x10 multiply IS faster than 16x16, {for calculating a square} mostly because the result is guaranteed to fit in three bytes rather than 4. If you don't have the code space to unroll the whole multiplication into inline code (as I showed in my "Solution #1"), looped code can still be made pretty fast by writing three loops:One for the first few bits of the multiplier, where the result of each addition of the multiplicand to the intermediate result is guaranteed to fit in the least-significant two bytes,
one for the next group of bits, where the addition may affect all three bytes of the result, and
one for the final few bits, where the addition is guaranteed to only affect the two most-significant bytes of the result.
John Payson implemented Andy David's tips, in a completely unrolled loop (sacrificing code space for speed) See the debugged version
; [warning: untested] ; by John Payson 1996-11-05 clrf DstH clrf DstM clrf DstL movf SrcL,w andlw $0F btfss Src,4 addwf DstM rrf DstM rrf DstL btfss Src,5 addwf DstM rrf DstM rrf DstL btfss Src,6 addwf DstM rrf DstM rrf DstL btfss Src,7 addwf DstM call Sqr4 addwf DstL swapf SrcL andlw $0F call Sqr4 addwf DstM ; At this point, 16-bit result is in DstM:DstH ; 25 words of code prior to this point (plus a ; 17-word table-lookup). Total execution time: ; 35 cycles up to this point. btfss SrcH,0 goto NoBit8 movf SrcL,w btfsc C incf DstH addwf DstM btfsc C incf DstH incf DstH ; Another 9 words for bit 8; 3 or 9 cycles to exec. NoBit8: btfss SrcH,1 goto NoBit9 movlw 4 btfss SrcH,0 movlw 8 addwf DstH rlf SrcL,w btfsc C incf DstH btfsc C incf DstH addwf DstM btfsc C incf DstH addwf DstM btfsc C incf DstH ; Another 17 words for bit 9; 3 or 17 cycles to execute ; Total worst-case time: 35+26 = 61 cycles. NoBit9: retlw 0 ; All done! Sqr4: addwf PC db 0,1,4,9,16,25,36,49,64,81,100,121,144,169,196,225
Questions: