Life in an FPGA


Maximum element >>
<< Regexps in FPGAs

Usenet Postings
  By Subject
  By Date

  Why FPGA CPUs?
  Homebuilt processors
  Altera, Xilinx Announce
  Soft cores
  Porting lcc
  32-bit RISC CPU
  Superscalar FPGA CPUs
  Java processors
  Forth processors
  Reimplementing Alto
  FPGA CPU Speeds
  Synthesized CPUs
  Register files
  Register files (2)
  Floating point
  Using block RAM
  Flex10K CPUs
  Flex10KE CPUs

  Multis and fast unis
  Inner loop datapaths

  SoC On-Chip Buses
  On-chip Memory
  VGA controller
  Small footprints

  CNets and Datapaths
  Generators vs. synthesis

FPGAs vs. Processors
  CPUs vs. FPGAs
  Emulating FPGAs
  FPGAs as coprocessors
  Regexps in FPGAs
  Life in an FPGA
  Maximum element

  Pushing on a rope
  Virtex speculation
  Rambus for FPGAs
  3-D rendering
  LFSR Design

Google SiteSearch
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs
         (was Re: Minimal ALU instruction set))
Date: 25 May 1998 00:00:00 GMT
Newsgroups: comp.arch,comp.arch.fpga,comp.arch.embedded

Tim Olson wrote
>This algorithm looks like the one described in the Smalltalk "blue book",
>where a version of Life was implemented using BitBlt operations to
>implement the cell counting in parallel.

Another reference to the bitwise parallel approach is "Life Algorithms",
Mark Niemiec, Byte, Jan. 1979, pp 70-79.  If I recall correctly, Mark, David
Buckingham, and friends, used Waterloo's Honeywell 66/60's EIS "move
mountain" instructions to animate 64K 36-bit words per iteration.

Inspired by Buckingham and the Blue Book, I wrote a bitblt version that did
800,000 cells in 34? bitblts on a Perq in 1983? and one that did 400,000
cells/s on an 8 MHz (1 M 2-operand 32-bit op/s) 68000 in 1985.

As Messrs. Mathisen and Mogensen describe, Life should run very fast on
modern processors (superscalar and multimedia enhanced and large caches).
64-bits, in 40 insns, in perhaps 15-20 clocks, at 3 ns/clock, e.g. 1 bit/ns.

FPGA Implementation: It is straightforward to run at full memory bandwidth.
For example, given an XCS20 and a 32Kx32 PBSRAM (32-bits in or out per 15 ns
clock) we can approach 32 bits/(2*15) ns, e.g. 1 bit/ns.

Since a given line is read three times (as "below", "current", and "above"),
we buffer 2 lines of cells in RAM in the FPGA.  A 1024 x n playfield
requires 2 x 1024 bits = 64 CLBs of single port RAM, and preferably 3 x 1024
bits for 3 banks since each clock you must read from up to two lines and
write to a third.

Detailed design/floor plan.  One bit requires approx. 9 CLBs.  Assuming the
cell neighbours are (a,b,c,d,e,f,g,h), we need :-
3 CLBs RAM -- 3 32x1 RAMs (3 banks of line buffer)
6 CLBs logic --
  1 s0a=a^b^c^d;  s0e=e^f^g^h
  2 s1a="a+b+c+d == 2 or 3"; s1e="e+f+g+h == 2 or 3"
  3 s2a=a&b&c&d; s2e=e&f&g&h
  4,5 (s3,s2,s1,s0)=(s2a,s1a,s0a) + (s2e,s1e,s0e)  (uses dedicated carry
  6 new = ~s3&~s2&s1&(s0|old)
and so in a 20x20 CLB XCS20, we explicitly place 16 rows of 1x9 CLB tiles in
the left half, another 16 in the right half, leaving plenty of room to spare
 for control and address generation.

At the 1997 FPGAs for Custom Computing Machines conference., the paper "The
RAW Benchmark Suite" by Babb et al proposed a set of benchmarks for
comparing reconfigurable computing systems.  One of the 12 benchmarks was
Life, for which they reported speedups of several hundred times over a
SparcStation20+software approach, but in fairness, they write "we are not
currently implementing known improvements to the software to take advantage
of the bit-level parallelism available in a microprocessor".

Summary. Hypotheticially...
Fast microprocessor + cache: ~1 bit/ns
Single FPGA + SRAM custom machine:  ~1 bit/ns

Jan Gray

Copyright © 2000, Gray Research LLC. All rights reserved.
Last updated: Feb 03 2001