fpgacpu.org - Altera Flex10K CPUs

Altera Flex10K CPUs

Home

Flex10KE CPUs >>
<< Using block RAM

Usenet Postings
  By Subject
  By Date

FPGA CPUs
  Why FPGA CPUs?
  Homebuilt processors
  Altera, Xilinx Announce
  Soft cores
  Porting lcc
  32-bit RISC CPU
  Superscalar FPGA CPUs
  Java processors
  Forth processors
  Reimplementing Alto
  Transputers
  FPGA CPU Speeds
  Synthesized CPUs
  Register files
  Register files (2)
  Floating point
  Using block RAM
  Flex10K CPUs
  Flex10KE CPUs

Multiprocessors
  Multis and fast unis
  Inner loop datapaths
  Supercomputers

Systems-on-a-Chip
  SoC On-Chip Buses
  On-chip Memory
  VGA controller
  Small footprints

CNets
  CNets and Datapaths
  Generators vs. synthesis

FPGAs vs. Processors
  CPUs vs. FPGAs
  Emulating FPGAs
  FPGAs as coprocessors
  Regexps in FPGAs
  Life in an FPGA
  Maximum element

Miscellaneous
  Floorplanning
  Pushing on a rope
  Virtex speculation
  Rambus for FPGAs
  3-D rendering
  LFSR Design

Subject: Re: FPGA based CPU ideas, and novel extensions
         => distributed RAM and Altera CPUs
Date: 14 Oct 1997 00:00:00 GMT
Newsgroups: comp.arch.fpga

David Atkins wrote in message ...
>Any of these kicking around for Altera, if not for a good reason, ?
>Somehting of an interest but not in aposition to find the time for the
>money to get into, we use 10k10's at present and the techniques would be
>intersting, any pointer greatfully recieved.

(Disclaimer: I have studied but never used Altera devices.)

FPGA RISC CPUs, e.g. CPUs with adequate register files, can certainly be
implemented in the Altera FLEX 10K family, which has many nice features.
However, in my opinion, the Xilinx XC4000 architecture seems a better
platform (higher performance) for this application because of its
distributed RAM feature.  In particular, a simple RISC datapath benefits
from a 2-read, 1-write port register file.  In an XC4000, these can (in
theory) be built and run at up to about 10 ns/cycle using two banks of dual
port mode distributed RAM.  [tWCTS=9.0, 8.4, 7.7 ns in XC4000XL-3, -2, -1].
Of course to take advantage of this 66-100 MHz operation you need the deeply
pipelined even/odd ALUs I described in another recent posting.

In contrast, in a FLEX 10K device, you would use EABs (the 256x8 embedded
RAM blocks).  A 32x32 2-read 1-write register file would then require 3
cycles using 4 EABs, or 2 cycles using 8 EABs (two copies of the register
file), at (in theory) 10+ ns/cycle.  [tEAWRCREG and tEARCREG=11.6, 9.5 ns in
EPF10K50V-4, -3].  (Perhaps an Altera expert will provide more correct and
up-to-date information.)  Of course, an accumulator or stack oriented
instruction set architecture (with TOS in a register) could reduce the
average number of EAB accesses per cycle.

EABs could certainly excel at building LARGE register files (e.g. for vector
registers or multiple thread contexts or register windows), on-chip RAM,
ROM, caches, TLBs, cache tag RAMs for off-chip caches, etc.  Indeed an AMD
29000 style variable sized register window implementation might avoid enough
memory traffic to outperform a simpler 32-register RISC with half the cycle
time.  Might not.

Alas, compared to distributed RAM, EABs are often too narrow (256x8 instead
of 128x16) and coarse.  Take a simple I-cache design.  A (256 byte) 16-entry
by 4-word line by 32-bit I-cache in an XC4000 is one column of 16 CLBs for a
16x24 cache tag RAM, one column for a tag comparator and other control
logic, and four columns for a 4x16x32 cache data RAM.  Total approximately
6x16 CLBs, 10% of a 4025E, 3% of a XC4085XL.  A (512 byte) 2-way set assoc,
32-entry cache would be about 200 CLBs, still a small percentage of a large
device.  Whereas the smallest such 32-bit cache you can build from EABs is 4
EABs (both tags and data in same EABs) with two cycle cache access .  4 EABs
is 33% of the EAB resources in a 10K100.

Another feature XC4000 has but which FLEX10K lacks is TBUFs (3-state
drivers).  These are very handy for sharing one wide bus across chip.  In
the old J32 design, the processor half of the XC4010 uses almost every
available TBUF to drive many different results onto the "result bus",
destined for write-back into the register file:
* adder/subtractor
* logic unit
* operand A << 1, << 2, << 4, >> 1, >> 2, >> 4
* data-in (byte, halfword, word)
* sign extension of word/byte data-in for lbu/lbs/lhu/lhs
* next-PC (for jal (jump-and-link)) to save the next-PC into a register
* data-out during the first cycle of store instructions (not written back)

and the 32-bit on-chip data bus half of the XC4010 uses TBUFs for:
* various peripherals and boot ROM to return read data
* driving off-chip data-in onto the on-chip bus
* bus byte-lane shifting -- for instance for "lbu r1,3(r0)" (load byte
unsigned from address 3), we move data on mem.d[31:24] down to mem.d[7:0]

On the other hand, even the 10K10 provides an astonishing 3x144 FastTrack
row channels, so it seems straightforward to deliver even eight or ten
32-bit possible results to multiplexors implemented in LABs.

Assuming each EAB/row is responsible for 8 bits of the processor, a 10K10
might implement a splendid 16- or 24-bit RISC.  Furthermore you can always
implement a 32-bit processor with an 8- or 16-bit datapath, if you perform
several execute cycles per instruction.

Jan Gray