Floating Point |
|||
Home Using block RAM >> << Register files (2)
Usenet Postings |
Subject: Re: Floating point on fpga, and serial FP adders Newsgroups: comp.arch.fpga Date: Sat, 3 Jul 1999 10:43:33 -0700 Roland Paterson-Jones wrote in message <377DC508.D5F1D048@bigfoot.com>... >It has been variously stated that fpga's are no good for floating point >operations. Why? As I see it, floating point operations are typically >just shifted integer operations. Is the bit-width problematic? For 16-bit floats with (say) 10 bit mantissas, FPGAs should be *great* for floating point. Indeed problems start with the wider bit-widths. The area-expensive (and worse than linear scaling) FP components are the barrel shifters needed for pre-add mantissa operand alignment and post-add normalization in the FP adder, and of course the FP multiplier array. The FCCM papers on this subject include: Ligon et al, A Re-evaluation of the practicality of floating-point operations on FPGAs, FCCM 1998 Louca et al, Implementation of IEEE single precision floating point addition and multiplication on FPGAs, FCCM 1996 Shirazi et al, Quantitative analysis of floating point arithmetic on FPGA based custom computing machines, FCCM 1995 and the neat Leong paper on avoiding the problem entirely, Leong et al, Automating floating to fixed point translation and its application to post-rendering 3D warping, FCCM 1999 See the Ligon paper for a nice presentation of speed-area tradeoffs of various implementation choices. Ligon estimates their single-precision FP adder resource use at between 563 and 629 LUTs -- 36-40% of a XC4020E. Note this group used a synthesis tool; a hand-mapped design could be smaller. Put another way, that single precision FP adder is almost twice the area of a pipelined 32-bit RISC datapath. Ouch. The rest of this article explores ideas for slower-but-smaller FP adders. The two FP add barrel shifters are the problem. They each need many LUTs and much interconnect. For example, a w-bit-wide barrel shifter is often implemented as lg w stages of w-bit 2-1 muxes, optionally pipelined. Example 1: single-precision in << s, w=24 m0 = s[0] ? in[22:0] << 1 : in; m1 = s[1] ? m0[21:0] << 2: m0; m2 = s[2] ? m1[19:0] << 4 : m1; m3 = s[3] ? m2[15:0] << 8 : m2; // 16 wires 8 high out = s[4] ? m3[7:0] << 16 : m3; // 8 wires 16 high ---- 5*24 2-1 muxes = 120 LUTs Example 2: double-precision in << s, w=53 m0 = s[0] ? in[51:0] << 1 : in; m1 = s[1] ? m0[50:0] << 2: m0; m2 = s[2] ? m1[48:0] << 4 : m1; m3 = s[3] ? m2[44:0] << 8 : m2; // 45 wires 8 high m4 = s[4] ? m3[36:0] << 16 : m3; // 37 wires 16 high out = s[5] ? m4[20:0] << 32 : m4; // 21 wires 32 high ---- 6*53 2-1 muxes = 318 LUTs In a horizontally oriented datapath, the last few mux stages have many vertical wires, each many LUTs high. This is more vertical interconnect than is available in one column of LUTs/CLBs, so the actual area can be much worse than the LUT count indicates. BUT we can of course avoid the barrel shifters, and do FP denormalization/renormalization iteratively. Idea #1: Replace the barrel shifters with early-out iterative shifters. For example, build a registered 4-1 mux: w = mux(in, w<<1, w<<3, w<<7). Then an arbitrary 24-bit shift can be done in 5 cycles or less in ~1/3 of the area. For double precision, make it something like w = mux(in, w<<1, w<<4, w<<12), giving an arbitrary 53-bit shift in 8 cycles. Idea #2: (half baked and sketchy) Do FP addition in a bit- or nibble-serial fashion. To add A+B, you 1) compare exponents A.exp and B.exp; 2) serialize A.mant and B.mant, LSB first; 3) swap (using 2 2-1 muxes) lsb-serial(A.mant) and lsb-serial(B.mant) if A.exp < B.exp 4) delay lsb-serial(A.mant) in a w-bit FIFO for abs(A.exp-B.exp) cycles; 5) bit-serial-add delay(lsb-serial(A.mant)) + lsb-serial(B.mant) for w cycles 6) collect in a "sum.mant" shift register 7) shift up to w-1 cycles (until result mantissa is normalized). It may be that steps 4 and 6 are quite cheap, using Virtex 4-LUTs in shift register mode -- they're variable tap, right? It is interesting to consider eliminating steps 2, 6, and 7, by keeping your FP mantissa values in the serialized representation between operations, counting clocks since last sum-1-bit seen, and then normalizing (exponent adjustment only) and aligning *both* operands (via swap/delay) on input to the next FP operation. A big chained data computation might exploit many serially interconnected serial FP adders and serial FP multipliers... Is this approach better (throughput/area) than a traditional pipelined word-oriented FP datapath? Probably not, I don't know. But if your FP needs are modest (Mflops not 100 Mflops) this approach should permit quite compact FP hardware. (Philip Freidin and I discussed this at FCCM99. Thanks Philip.) Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |