CPUs vs. FPGAs |
|||
Home Emulating FPGAs >> << Generators vs. synthesis
Usenet Postings |
Subject: Re: advice request: CPUs vs. FPGAs again Date: 20 Jan 1997 00:00:00 GMT newsgroups: comp.arch.fpga [This article is a repost of <01bc0692$176777e0$4e7dbacd@p5-166>, cleaned up and with corrections regarding use of XC4000 H-muxes.] Ben Chaffin <98bcc-@williams.edu> wrote in article <32E01194.778E@williams.edu>... > Hello, > I am a computer science major at Williams College. I am working > on a computational number theory problem for one of my professors. I have > implemented it in C and optimized it to death and am now considering > building a special purpose computer to speed it up. So, along with > everyone else, I'm trying to find the best balance between a limited > budget and speed. I have a design for discrete chips that should run > about 5 times as fast as a P-120. The cost is $300-$400, which isn't bad, > but the circuit involves nearly 200 chips, with about 3000 pins. That's a > lot of wiring. The algorithm involves: > > Two parallel 128-bit additions of internal-only data > Storage for the result (256 bits) > Two 48-bit conditional inversions using XOR gates > Two priority encodings across 48 bits > Some counters, flip-flops and simple control circuitry > Around 50 bits of output Hooray for such questions, which help to give us insight into the categorization of problems that are better done on FPGAs than on general purpose computers. In this case, we compare a fast implementation on a general purpose computer with a fast implementation on an FPGA (a XC4010E-2) which seems well suited for the problem. I hope the discussion is of some indirect use to Mr. Chaffin. Let's consider how well this problem could run on a superscalar 64-bit microprocessor. The description above would seem to require 2 adds, 2 adds-with-carry, 2 complements, 2 conditional moves, and 2 instances of a priority encoder. Each of the latter could be implemented as two iterations of binary search followed by a table lookup, e.g. as a shift, branch, shift, branch, load byte (4KB table), add constant. That's 30-40 clocks on a 2-integer issue machine, assuming all four branches are mispredicted with a 5-cycle misprediction penalty (as on a DEC Alpha 21164), and mostly L1 cache hits. With a 3 ns clock, that's about 100 ns per iteration. Now let's consider an FPGA implementation, specifically a simpleminded, non-pipelined, no-sharing, approach. We'll use a Xilinx XC4010E-2 (20x20 CLBs), $110 quantity one according to www.marshall.com. 128-bit adder: Slow approach: 64 CLBs configured as a 128-bit fast ripple carry adder (3+ columns of CLBs). Latency approx. 3+64*.6 =42 ns. Faster approach: 16+2*16+2*16+2*16 CLBs configured as a 32+32+32+32-bit carry select adder. Latency approx. 3+16*.6+5 gate delays+slop=25 ns. [By the way, if a CLB's F-LUTs and fast carry circuits are configured as a 2-bit fast ripple carry adder, can you still use its H-LUT to multiplex one of its F-LUT outputs with another signal, driving *that* onto a X or Y output?] Assuming H-muxes to multiplex the intermediate sums, two such adders require about 240 CLBs, more than half of the chip resources. You can store the sums in the flip-flops in the same CLBs. (Note, we did not discuss the adder input values. We may need to store an additional 256 or 512 bits of input state.) Two 48-bit inversions require a total of 48 CLBs if you can't fold them in elsewhere. Latency: 2-3 ns of logic plus some interconnect delay. 48-bit priority encoder: A 4-bit encoder is 1.5 CLBs. A 16-bit encoder is about 10 CLBs: 4 4-bit encoders (6 CLBs) + about 4 CLBs to combine them. A 48-bit encoder is about 36 CLBs: 3 16-bit encoders (30 CLBs) + about 6 CLBs to combine them. Therefore, two encoders require 72 CLBs. Latency is about 4 gate delays (8 ns) plus some interconnect delay => 15 ns. Altogether, a non-pipelined implementation could complete an iteration in 50 ns. If there is a way to overlap or pipeline the adder and the priority encoder, you might get this down to 30 ns. This is 2-3X faster than the general purpose CPU approach. However, it is also instructive to compare the approximate latencies of the individual operations. Adds: 6 ns (CPU); 25 ns (FPGA). Inversions: 6 ns (CPU); <10 ns (FPGA). Priority encoders: 90 ns (CPU); 15 ns (FPGA). The CPU loses badly because of the expense of the priority encoder implementation. So, one approach to the problem is to find a CPU with a priority encoder (e.g. the Microunity Mediaprocessor), or a not-yet-existent CPU with a little bit of programmable logic in its datapath, with which to implement a priority encoder... This circuit just fits in the 400 CLBs of an XC4010E. Two instances would fit in a XC4020E or XC4025E, more in the promised XC4000EX family parts. So, if this is not an inherently serial problem, doing several iterations in parallel would show a further performance advantage for the FPGA approach. Or, as Ray Andraka suggests in another reply, if your problem tolerates a lot of parallelism, a "massively" instanced bit serial approach will be faster yet. Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |