FPGA Multis and Superscalars |
|||
Home Inner loop datapaths >> << Multiprocessors
Usenet Postings |
Subject: Re: FPGA multiprocessors => vs. uniprocessors Date: 07 Oct 1997 00:00:00 GMT Newsgroups: comp.arch.fpga Jack Greenbaum <spamfilt-@greenbaum.us.com> wrote in article <ljg1qehm38.fsf@greenbaum.us.com>... > "Jan Gray" <jsgray-@acm.org.nospam> writes: > > Assuming careful floorplanning, it should be possible to place six 32-bit > > processor tiles, or twelve 16-bit processor tiles, in a single 56x56 > > XC4085XL with space left over for interprocessor interconnect. > > You might be interested in another view of single chip multiprocessors. > > Patt, et. al. "One Billion Transistors, One Uniprocessor, One Chip", > IEEE Computer, Sept 1997, pp 51-57. > > They argue against multiple processors on a single chip, because it > makes what is already an I/O bound system even worse. Just because you > can put multiple processors on a dies doesn't mean you can feed them > instructions and data. > > Jack Greenbaum -- jack at greenbaum.us.com > This month's IEEE Computer was certainly a blockbuster. With all respect to Dr. Patt and the U.Mich guys, whose earlier HPS ideas are proven and shipped in nearly every high-end microprocessor, and whose present paper is most intriguing, I'm afraid I was somewhat more convinced by the Hammond et al paper "A Single-Chip Multiprocessor" in the same issue. In particular, I guess I don't yet believe that branch predictors can get as good as they say, especially on real world spaghetti object-oriented code. But who knows where compiler technology, especially dynamic recompilation, will take us in the years to come. Also, Patt et al.'s statement is: "In our view, if the design point is performance, the implementation of choice is a very high-performance uniprocessor on each chip, with chips interconnected to form a shared memory multiprocessor." This does not appear to be a statement about throughput or about price/performance. Indeed, today, processor implementations are usually compared using single threaded benchmarks. But I think this will change in the next decade. For example, my real job is writing infrastructure for transaction processing for distributed COM objects, and we can certainly keep many many threads busy with useful work. (//www.microsoft.com/transaction) (//www.microsoft.com/backoffice/scalability/billion.htm) Anyway, back to my posting about multiprocessor FPGAs. I wrote it not because I seriously think it's the best way to use FPGAs, but because I thought it remarkable that FPGAs are now large enough to host half a dozen or a dozen modest simple RISC processors. And although I pointed out it was a paper design, I did account for providing adequate instruction/data bandwidth to each processor. In the case of the 6-way 32-bit RISC multiprocessor, there were 6 small I-caches and six separate memory ports to DRAM (could be SRAM, I suppose). The XC4085XL has enough IOBs and pins to do this. (Certainly if Xilinx hurries up and licenses the RAMBUS interface there will be pins and bandwidth galore.) Also, FPGA RISC processors are of course relatively slow. A straightforward 3- or 4-stage pipelined implementation of a single-issue machine should go 40 MHz or so. A more deeply pipelined microarch. could approach 80 MHz. (In today's XC4000XLs we are unlikely to significantly exceed 100 MHz because that is approximately the register file (distributed RAM block) cycle rate.) Their slowness means their bandwidth needs are more modest. And their issue penalty for cache misses is also less severe. So you don't need big caches, just a fairly efficient memory cycle -- SRAM, page mode EDO, or open-bank SDRAM accesses. Now, the multiprocessor I originally described had (say) six separate memory banks, each private to a processor. A more useful and more readily programmed machine would provide a single shared memory. I'm thinking of a design where (say) 4 processors contend for 2 or 4 address-interleaved banks of memory. You use the center part of the FPGA as a 4x2 or maybe 4x4 x32 crossbar switch so that several processors can simultaneously access separate memory banks. Of course, I haven't simulated this, so don't take it too seriously. Finally, let's discuss where we can go with a fast uniprocessor on a large FPGA. I have given this some thought over the years. One big challenge is register file design. The custom guys don't blink (much) at producing 1-cycle 8-read 4-write register files for their wide issue superscalars. But given today's Xilinx distributed RAMs this is unachievable. In my processors I do 1-cycle 2-read 1-write register files using 2 copies of the 1-read 1-write distributed RAM. But doing 1-cycle fully arbitrary n-read 2-write reg files is damn hard. Instead it is better to move to an LIW ISA with multiple independent register files. For instance a 2-issue LIW would have instructions like: op1 dest1 src1a src1b op2 dest2 src2a src2b where dest1 is retired into reg file 1 and dest2 into reg file 2. With a few more copies of reg file 1 and reg file 2 we can allow some or all the source operands to read from either reg file. (For instance we can build dual 3-read 1-write reg files with six words of distributed RAM.) Another challenge is ALU delay. Including clock-to-out, ALU latency, and setup time and interconnect, etc., this can be >20 ns. To speed this up requires either putting a register in the middle of the adder (not good) or duplicate adders for even/odd cycles (good) and a two cycle adder delay. Using this technique you can probably clock a processor at 66 or 80 MHz. Put these ideas together and one can certainly see a 66 MHz 2 issue LIW in a XC4013E and perhaps a 4 issue VLIW in a XC4036XL. But for the latter you need a very good optimizing compiler. Cheers, Jan Gray Subject: Re: FPGA multiprocessors => vs. uniprocessors Date: 07 Oct 1997 00:00:00 GMT Newsgroups: comp.arch.fpga Me again. I wrote: > Put these ideas together and one can certainly see a > 66 MHz 2 issue LIW in a XC4013E and perhaps a 4 issue > VLIW in a XC4036XL. But for the latter you need a very > good optimizing compiler. It can be done, but I think I chose the wrong parts. First, for this speed we need each half LIW processor to get an I-cache slice or at least a loop buffer. This widens the datapath from 2x16x13 (say) to 2x16x20 CLBs and forces you up into the larger XC4000XL parts. Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |