Altera Flex10KE CPUs |
|||
Home Multiprocessors >> << Flex10K CPUs
Usenet Postings |
Subject: Re: Dual port, new Altera FLEX 10KE EABs Date: 22 Mar 1998 00:00:00 GMT Newsgroups: sci.electronics.components,alt.engineering.electrical, sci.electronics.basics,sci.electronics.design,comp.arch.fpga Ray Andraka wrote in message <3515F73D.5864@ids.net>... >Contrasted with the memories in other FPGAs, such as the EAB in the >altera, this is a better set-up IMHO. ALtera EABs get 'dual porting' by >cycle splitting...that means that you get half the memory bandwidth, and >more importantly no simultaneous read/write at different locations. Ah, but Altera just announced the new 10KE family. See http://www.altera.com/html/mktg/10ke-intro.html. Whereas the 10K(A) family had single ported RAM via 256x8 EABs, the 10KE family apparently _will_ have ("first device shipments beginning in June") dual ported RAM (one write addr/data port, one read addr/data port) via 256x16 EABs. (Perhaps someone from Altera can explain the table on http://www.altera.com/html/mktg/10ke-1.html which shows different speeds for 16x32, 32x32, ..., 256x32 FIFOs, when presumably they are using two 256x16 EABs in each case, shouldn't they all be 150 MHz?) Based on their marketing info, it appears 10KE family EABs will provide twice the storage and four times the dual port bandwidth of the 10KA family -- true dual porting plus twice as many bits wide per EAB. Of course this assumes there will be adequate interconnect resources to get all these address and data bits to/from the EAB from/to the LABs. With this development I may have to reconsider some things I wrote last fall, in my comparison of the suitability of the Xilinx XC4000 and Altera FLEX 10K architectures to implement RISC processor datapaths (attached below). I am delighted to see Altera has addressed my concerns regarding the need for faster multiple port access and the need for x16 organization of the EABs. I can't help but wonder if my posting contributed to this product announcement... I still prefer an array chock full of distributed select-RAMs, over large central EAB RAM blocks, but better still would be to have both. "This is a great time to be us". With the new 10KE family, the new ORCA 3C family, and, sooner or later, the new Xilinx Virtex family, we have many interesting projects ahead of us... Jan Gray ------ attachment: old comparison of Altera vs Xilinx architectures for CPU datapaths: Subject: Re: FPGA based CPU ideas, and novel extensions => distributed RAM and Altera CPUs Date: 14 Oct 1997 00:00:00 GMT Newsgroups: comp.arch.fpga David Atkins wrote in message ... >Any of these kicking around for Altera, if not for a good reason, ? >Somehting of an interest but not in aposition to find the time for the >money to get into, we use 10k10's at present and the techniques would be >intersting, any pointer greatfully recieved. (Disclaimer: I have studied but never used Altera devices.) FPGA RISC CPUs, e.g. CPUs with adequate register files, can certainly be implemented in the Altera FLEX 10K family, which has many nice features. However, in my opinion, the Xilinx XC4000 architecture seems a better platform (higher performance) for this application because of its distributed RAM feature. In particular, a simple RISC datapath benefits from a 2-read, 1-write port register file. In an XC4000, these can (in theory) be built and run at up to about 10 ns/cycle using two banks of dual port mode distributed RAM. [tWCTS=9.0, 8.4, 7.7 ns in XC4000XL-3, -2, -1]. Of course to take advantage of this 66-100 MHz operation you need the deeply pipelined even/odd ALUs I described in another recent posting. In contrast, in a FLEX 10K device, you would use EABs (the 256x8 embedded RAM blocks). A 32x32 2-read 1-write register file would then require 3 cycles using 4 EABs, or 2 cycles using 8 EABs (two copies of the register file), at (in theory) 10+ ns/cycle. [tEAWRCREG and tEARCREG=11.6, 9.5 ns in EPF10K50V-4, -3]. (Perhaps an Altera expert will provide more correct and up-to-date information.) Of course, an accumulator or stack oriented instruction set architecture (with TOS in a register) could reduce the average number of EAB accesses per cycle. EABs could certainly excel at building LARGE register files (e.g. for vector registers or multiple thread contexts or register windows), on-chip RAM, ROM, caches, TLBs, cache tag RAMs for off-chip caches, etc. Indeed an AMD 29000 style variable sized register window implementation might avoid enough memory traffic to outperform a simpler 32-register RISC with half the cycle time. Might not. Alas, compared to distributed RAM, EABs are often too narrow (256x8 instead of 128x16) and coarse. Take a simple I-cache design. A (256 byte) 16-entry by 4-word line by 32-bit I-cache in an XC4000 is one column of 16 CLBs for a 16x24 cache tag RAM, one column for a tag comparator and other control logic, and four columns for a 4x16x32 cache data RAM. Total approximately 6x16 CLBs, 10% of a 4025E, 3% of a XC4085XL. A (512 byte) 2-way set assoc, 32-entry cache would be about 200 CLBs, still a small percentage of a large device. Whereas the smallest such 32-bit cache you can build from EABs is 4 EABs (both tags and data in same EABs) with two cycle cache access . 4 EABs is 33% of the EAB resources in a 10K100. Another feature XC4000 has but which FLEX10K lacks is TBUFs (3-state drivers). These are very handy for sharing one wide bus across chip. In the old J32 design, the processor half of the XC4010 uses almost every available TBUF to drive many different results onto the "result bus", destined for write-back into the register file: * adder/subtractor * logic unit * operand A << 1, << 2, << 4, >> 1, >> 2, >> 4 * data-in (byte, halfword, word) * sign extension of word/byte data-in for lbu/lbs/lhu/lhs * next-PC (for jal (jump-and-link)) to save the next-PC into a register * data-out during the first cycle of store instructions (not written back) and the 32-bit on-chip data bus half of the XC4010 uses TBUFs for: * various peripherals and boot ROM to return read data * driving off-chip data-in onto the on-chip bus * bus byte-lane shifting -- for instance for "lbu r1,3(r0)" (load byte unsigned from address 3), we move data on mem.d[31:24] down to mem.d[7:0] On the other hand, even the 10K10 provides an astonishing 3x144 FastTrack row channels, so it seems straightforward to deliver even eight or ten 32-bit possible results to multiplexors implemented in LABs. Assuming each EAB/row is responsible for 8 bits of the processor, a 10K10 might implement a splendid 16- or 24-bit RISC. Furthermore you can always implement a 32-bit processor with an 8- or 16-bit datapath, if you perform several execute cycles per instruction. Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |