fpgacpu.org - Superscalar FPGA CPU Design

Superscalar FPGA CPU Design

Home

Java processors >>
<< 32-bit RISC CPU

Usenet Postings
  By Subject
  By Date

FPGA CPUs
  Why FPGA CPUs?
  Homebuilt processors
  Altera, Xilinx Announce
  Soft cores
  Porting lcc
  32-bit RISC CPU
  Superscalar FPGA CPUs
  Java processors
  Forth processors
  Reimplementing Alto
  Transputers
  FPGA CPU Speeds
  Synthesized CPUs
  Register files
  Register files (2)
  Floating point
  Using block RAM
  Flex10K CPUs
  Flex10KE CPUs

Multiprocessors
  Multis and fast unis
  Inner loop datapaths
  Supercomputers

Systems-on-a-Chip
  SoC On-Chip Buses
  On-chip Memory
  VGA controller
  Small footprints

CNets
  CNets and Datapaths
  Generators vs. synthesis

FPGAs vs. Processors
  CPUs vs. FPGAs
  Emulating FPGAs
  FPGAs as coprocessors
  Regexps in FPGAs
  Life in an FPGA
  Maximum element

Miscellaneous
  Floorplanning
  Pushing on a rope
  Virtex speculation
  Rambus for FPGAs
  3-D rendering
  LFSR Design

Subject: Re: small superscalar design ?
Date: 01 Nov 1995 00:00:00 GMT
newsgroups: comp.arch.fpga

In <HAMMAMI.95Oct29205005@pross113.u-aizu.ac.jp> hammami-@u-aizu.ac.jp (Omar
Hammami) writes: 

>Dear Netters:
>
>Does anyone have ever tried to implement a small superscalar
>processor design using FPGAs ? I would like to have an idea of
>the gate complexity level for say a 16 bits superscalar with
>roughly 3 FP units and 2 Int units and a limited number of
>instructions.

Great question!  It is surely possible, but it might not be the best fit of
architecture to implementation technology...

First, let's talk functional units.  Integer ALUs are easy to implement, you
simply need an adder, logic unit, and multiplexor of some sort.  Ripple
carry adders will do, given vendors' dedicated carry chain hardware.  Decent
performance FPUs would be more difficult.  An FP add/sub will require
renormalization which requires a barrel shifter (bad: lots of wires!) or
several iterative cycles of fewer bit shifts.  A multiplier of managable
size will also take several cycles, although at 16-bit FP (1 bit sign + 5
bit exp + 10 bit mantissa?) you might only need at 10x10->20 bit multiplier.

(This is an invitation for you FP in FPGA veterans to chime in with your
experiences.)

Moving on, the register file could well be your critical path.  If you hope
to sustain an average of even one and a half integer instructions per clock
you will have to fetch three or four operands per clock and retire two.  A
3-read-2-write register file in which the two write back values are retired
before you read the three new operands will take at least 2 SRAM write
cycles on embedded dual ported SRAMs and up to 4 cycles on embedded non-dual
ported synchronous SRAMs, depending upon how many copies of the register
file you keep.  (See Xilinx 4000E, Altera Max10K, Actel 3200DX?.)  For
instance, on the new Xilinx 40xxE-3 parts, you are talking at least 2-, 3-,
or 4- ~15 ns sync-SRAM cycles best case.  This design would hardly be
competitive with a single-issue one which could sustain one instruction per
clock at twice the clock rate.

Once again the "speed demons" whip the "brainiacs"!

And if you hope to retire 3 or 4 results (peak) per clock, your basic cycle
time is even worse.  Your only hope might be to lobby the FPGA architects
for multi-multi-ported SRAMs (2-write, 2-read quad ported synchronous SRAMs,
anyone? :-)

A VLIW like architecture could be a better architectural choice because much
of the register file, and its multiple write back liability, could be spread
about the functional units with limited degrees of communication between
units, achieving only one register file write back operation each clock per
unit.  (I have some "paper" VLIW datapath floorplans that satisfactorily fit
in large existing FPGAs.)  For instance, you could easily distribute 64
16-bit registers and four ALUs about as 4 units each of 16x16-bit register
file+ALU, assuming instruction operand constraints that a given instruction
segment for one unit could only read operands from that unit or (in a
limited way) from adjacent units, and could only write back results to that
unit.  Then you could keep the register file result write back cycle time
down to one or two sync-SRAM or dual port SRAM write and readback cycles no
matter how wide your machine grows.

But, I'd hate to have to write your compiler's code generator.

Another comment.  Superscalar microarchitectures typically demand many
operand busses to route lots of operands and results to lots of
functional units.  Unfortunately, wires are relatively much more
expensive in FPGAs than in custom designs (where they are already
plenty expensive).  The datapath of my (now working!) 32-bit pipelined
RISC (half an XC4010) has just barely enough wiring resources to
implement a single issue microarchitecture.  Unless you are a wizard at
time multiplexing different operands and results on the same wires, say
at 10 ns intervals, without incurring killer delays, I think you would
find today's FPGAs unacceptably wire limiting.

But by all means give it a try!  Sounds like a great push-the-envelope
project.

>Any pointers on books, TRs and projects descriptions for 
>undergraduates and graduates will be appreciated.

Very highly recommended reference: Mike Johnson's book, Superscalar
Microprocessor Design, Prentice Hall, 1991, ISBN 0-13-875634-1.

"Tired: superscalar; Wired: VLIW and multithread,"
Jan Gray