Inner Loop Custom Datapaths |
|||
Home Supercomputers >> << Multis and fast unis
Usenet Postings |
Subject: Re: FPGA multiprocessors Date: 07 Oct 1997 00:00:00 GMT Newsgroups: comp.arch.fpga Charles Sweeney <CharlesSweeney-@compuserve.com> wrote in article <3438A7D6.2431@compuserve.com>... > Jan Gray wrote: > > Assuming careful floorplanning, it should be possible to place six 32-bit > > processor tiles, or twelve 16-bit processor tiles, in a single 56x56 > > XC4085XL with space left over for interprocessor interconnect. Also the > > number of processor tiles can be doubled if we eschew the I-cache and > > simplify the microarchitecture -- though performance would greatly suffer. > > It's good to see you planning to take advantage of the parallelism > offered by FPGAs, but why constrain your software to have to run in a > particular microprocessor architecture? why not go further and compile > your programs directly into the hardware of the FPGA, Handel-C does > exactly that, please see our web site below. Good question. The trite answer is since designing processor ISAs and microarchitectures for FPGA implementations is my research interest, that's my hammer in search of nails. FPGA multiprocessors are now possible -- but it remains to be seen if they are actually useful! The other answer is that I don't preclude a modest custom datapath per processor (and such datapaths could be designed from source code by tools such as Handel-C). So I think an FPGA multiprocessor is the preferred solution for problems which: 1. are amenable to n-way "outer loop" parallelism and 2. involve too much irregular computation for custom datapath only and 3. involve enough inner loop regular computation that an FPGA custom datapath is faster/cheaper than a general purpose processor or multiprocessor built of same. (Whether such problems exist and are important remains to be seen.) As for your question "why not go further and compile your programs directly into the hardware of the FPGA?" :- There will always be very regular signal processing applications, regular in computation, regular in operand fetch and result store, and relatively simple in the computation kernel, for which a custom datapath compiled to an FPGA is a good solution. But there are also other computations which are either too irregular or too large to practically implement in an FPGA datapath, even in a time-multiplexed (reconfiguration) manner. The "outer loops" and "outer function calls" of these computations are best done in a general purpose processor, even as you move the inner loop(s) to a custom datapath. Indeed, the inner loops may constitute only a few percent of the total text of the source code of the computation. To help these large "dusty deck" applications take advantage of custom datapaths, it must be extremely convenient to interface the custom stuff to the general purpose processor. For some problems where even the irregular computation is a critical path, especially those involving floating-point, it probably makes sense to choose a fast, cheap commercial off-the-shelf microprocessor. Of course there are penalties here. Cost of processor. Less integration. Board real-estate costs. "Representation domain crossing" costs. Relatively slow communication between processor and FPGA. Cost of FPGA resources spent interfacing to processor. But for problems where the irregular computation is not the critical path, the now modest overhead (10-20%) of an embedded general purpose CPU enables an interesting integrated "system on chip" hybrid: embedded processor, on-chip bus, on-chip custom datapaths and peripherals. In theory, you could compile your dusty deck C, C++, Java, FORTRAN, Scheme, etc. and run it immediately on your FPGA CPU. Then automatically (profile driven) or through explicit directives, you can compile the inner loops to a custom datapath. This can either be manifest as an on-chip command oriented coprocessor, or in some cases as new instructions. The latter has the potential advantage of very high custom operation issue rates (today, 66 MHz) and access to processor register file, etc. Given this approach, even if your dusty deck app stores its data in such advanced data structures (sarcasm) as a linked list (/sarcasm), it can still potentially take advantage of a custom datapath. This is much less feasible if your registers or operands(s) are microseconds away on the non-embedded host processor. For example, the unused logic in //www3.sympatico.ca/jsgray/sld021.htm was reserved for the Gouraud rendering instructions described in the last paragraph in: //www3.sympatico.ca/jsgray/render.txt Of course, embedded processor in programmable logic is just one point on the CPU/custom datapath spectrum. See also the BRASS research //http.cs.berkeley.edu/Research/Projects/brass and my old essay on FPGA PC coprocessors //www3.sympatico.ca/jsgray/coproc.txt Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |