FPGA Array Supercomputers |
|||
Home SoC On-Chip Buses >> << Inner loop datapaths
Usenet Postings |
Subject: FPGA array computers Date: 15 Feb 1999 00:00:00 GMT Newsgroups: comp.sys.super,comp.arch.fpga Greg Pfister wrote in message <36C5D127.2096E5F7@usNOSPAM.ibm.com>... >Anybody wanna bet that this sucker is SIMD? If it is an array of FPGAs, it can be SIMD one moment, MIMD another, and zero-instruction-multiple data (parallel hardwired datapaths) the next. Or a hybrid, a SIMD or MIMD where each datapath also has problem specific function units. See www.optimagic.com/research.html for links to some FPGA computing research. How about some numbers? What might one build from (say) 256 Xilinx XCV300 Virtex FPGAs, each of which has 32x48 4-bit configurable logic blocks + 16 256x16 dual-ported block SRAMs? Let's consider some examples -- blue sky, back of the envelope numbers based solely upon available FPGA resources. (Disclaimers: I build homebrew FPGA RISC uniprocessors but my MP designs are paper tigers. Any numbers are approximate, best case, will not exceed, peak numbers. Actual numbers may underwhelm. Machines may be very difficult to program. These design sketches may prove unroutable. Etc.) 1. 0IMD (hardwired datapaths): array of 16-bit adders: 256 * 32*48*4 / 16 = 256 * 384 = 98000 adders at 100 MHz = 10e12 adds/s array of 16-bit adders + 16-word reg files: 256 * 32*48*4 / 32 = 256 * 192 = 49000 adders at 100 MHz = 5e12 adds/s 2. SIMD: array of 16-bit datapaths. Assume each 16-bit datapath has: * 16 word register file * add/sub, logic unit * operand mux * SIMD control logic (conditionally suppress result writeback, etc.) * shared access to long-line operand/result broadcast bus ----- 8R*3C=24 Virtex CLBs Assume 80% of FPGA area is datapath tiles and 20% is interconnect, memory interface, and control. 256 * 32*48 / 24 * 0.8 = 16384 * 0.8 = 13000 datapaths at 50 MHz = 600e9 ops/s 3. MIMD: array of 32-bit RISC processors suitable for straightforward targeting from a C or FORTRAN compiler: Assume each 32-bit processor has: * 4-stage pipeline (IF/RF/EX/WB) * 16-bit instructions * 2R-1W 16 word x 32-bit register file * result forwarding (reg file bypass mux) * immediate operand mux * add/sub, logic unit, <<, >> 1, 4 (at least) * PC, PC incrementer, relative branches, conditional branches * jump, jump and link * memory address mux and register * pipelined control unit * 32-entry x 8 halfword-line i-cache (e.g. 256 instruction L1 i-cache) * no d-cache and no floating point ---- 16R*8C CLBs = 128 CLBs, + 1 256x16 block RAM This gives 8 processors per XCV300 and leaves 1/3 of chip area (32*16 CLBs) and half the block RAMs free for memory interface and interconnect. 256 FPGAs * 8 CPUs/FPGA = 2000 32-bit processors at 50 MHz = 100e9 MIMD 32-bit ops/s See also my old FPGA MP-on-chip discussion thread at http://dejanews.com/getdoc.xp?AN=277216882. (Note, CLBs there are the 2-bit CLBs of the XC4000 family, not the 4-bit CLBs of the Virtex family.) Other comments. Interconnect? Memory bandwidth? Consider a hypothetical "XYZ" machine using a simple 3D mesh of 16 boards of 4x4 XCV300s. Give each FPGA 128 bits of SDRAM -- 2 DIMM sockets w/ 64 MB each for a total of 256*2*64 MB = 32 GB. Add a 17th XCV300+SDRAM per board for configuration and control. Configure each FPGA with 6 (NSEWUD) 16-bit channels, for 400 MB/s/chan at 200 MHz. (Virtex datasheet says 200 MHz chip-to-chip using "HSTL class IV" signaling.) The FPGA at (x,y,z) transmits E to (x+1,y,z), N to (x,y+1,z), U to (x,y,z+1) and receives W from (x-1,y,z), S from (x,y-1,z) and D from (x,y,z-1). Assume the cross-board up/down channels only run at 50 MHz for 100 MB/s/chan. Peak bisection bandwidth of 4*4*100MB/s = 1.6 GB/s "sliced between boards" and 4*16*400 MB/s = 25 GB/s sliced vertically. Peak external memory bandwidth of 256 * 128/8 * 100 MHz = 400 GB/s. Peak internal memory bandwidth to block RAMs = 256 FPGAs * 16 blocks * 2-ports * 2B/port * 100 MHz = 1.6 TB/s. While these point-to-point meshes have excellent bandwidth, they have relatively high latency and seem complex and expensive to implement if communication is irregular. For interconnecting a few hundred FPGAs in a scalable shared memory MIMD, I prefer a simpler 2D or 3D meerkat-like interconnect with multiple buses in the X, Y, and Z dimensions such that FPGA at (x,y,z) interconnects to (*,y,z) on the X[y][z] bus, (x,*,z) on the Y[x][z] bus, and (x,y,*) on the Z[x][y] bus. (See "The Meerkat Multicomputer: Tradeoffs in Multicomputer Architecture", Robert Bedichek Ph.D. thesis -- http://cag-www.lcs.mit.edu/~robertb/thesis.ps.) Latency: For the MIMD sketched above, each processor has a local 256 halfword i-cache. I-cache misses and all data accesses are from uncached RAM. Local references to uncached RAM access local SDRAM in < 100 ns, much less if reference hits an open page. Non-local load/store transactions would issue through the interconnect to a distant FPGA. Fortunately memory latency is less of an issue when each processor is single issue and has a slow 20 ns clock. Cost? The raw IC cost of this hypothetical machine is very approximately: 272 XCV300-4BG352C at $344 each per Avnet web site at quantity 25 discount 544 64MB (8Mx64) PC100 SDRAM DIMMs at $88 each per chip merchant. ------- ~$150,000 Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |