fpgacpu.org - FPGA CPU News of November 2001

FPGA CPU News of November 2001

Home

Dec >>
<< Oct

News Index
2002
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
2001
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
  Oct Nov Dec
2000
  Apr Aug Sep
  Oct Nov Dec

Links
Fpga-cpu List
Usenet Posts
Site News
Papers
Teaching
Resources
Glossary
Gray Research

GR CPUs

XSOC
  Launch Mail
  Circuit Cellar
  LICENSE
  README
  XSOC News
  XSOC Talk
  Issues
  xr16

XSOC 2.0
  XSOC2 Log

CNets
  CNets Log

Saturday, November 24, 2001
D. Sulik, Bournemouth University, et al, Design of a RISC Microcontroller Core in 48 Hours (PDF). 8-bit RISC; "48 man-hours", using Celoxica's Handel-C; targets the XS40-010XL; 338 CLBs; clocks at about 12 MHz, but each instruction requires four clocks.
Robert Ristelhueber, Electronics Buyers News: Xilinx set to sample FPGAs built on 300mm wafers.
"The company is also pushing the envelope of process technology for its products, Sevcik said, with its first 0.13-micron chip scheduled to be released around January 1."
"That device will have nine mask layers and be built using IBM's copper process. Called the Virtex-II Pro, the chip will have an embedded PowerPC processor and will run at 300MHz and higher speeds, said Sevcik."

Friday, November 23, 2001
Yanzhe Liu and Greg Kahlert, in Integrated Systems Design magazine: Driving a 32-Bit RISC Processor in an FPGA. On porting the Lexra LX4189 core to an XCV1000: about 1500 slices and 12 BRAMs.

Thursday, November 22, 2001
On the fpga-cpu list, Veronica Merryfield notes that it is important to consider such issues as cache invalidation, context switching, MMUs, and IPC, in designing a processor:
"In short, think about the kernel software and the core features together not in isolation."

Good advice! Later, I proffered these various ideas and comments on fast context switches in FPGA CPUs:
First, let us tip our caps to the designers of

the Xerox Alto [Chuck Thacker et al] -- frequent microtask switching, as often as every few cycles, doing some I/O work in the same datapath as the CPU;

the Inmos Transputer [David May et al]: fast task switch -- stack architecture with limited state to switch);

the Denelcor HEP and the Cray (nee Tera) MTA [Burton Smith et al]: (multiple thread contexts with thread switch each cycle);

and the i960CA [Glenn Hinton et al] (fast reg file save/reload via wide buses).

Six ideas:

Thinking about FPGA CPUs, one can, of course, use block RAMs (BRAMs on Xilinx, ESBs on Altera) as really deep register files, vector register files, windowed register files, or multi-context register files. (See Using block RAM.)
Block RAM register files tend to be slower and less-ported than register files built with LUT RAM. On Altera, which lacks LUT RAMs, you might as well pursue one of these avenues, and that's just what Altera Nios does -- register windows. (See also Flex10K CPUs (1997) and Flex10KE CPUs:
"EABs could certainly excel at building LARGE register files (e.g. for vector registers or multiple thread contexts or register windows) ... "

There is a duality between a windowed reg file and a multi-context reg file. If I am not mistaken, you can do limited fast context switches on a SPARC with a window rotation or two (8 global registers notwithstanding). Perhaps the same idea would work on Altera Nios.

One can also build a simple barrel processor (say 4 threads (slots) x 32 regs = 128 entries of 32-bits = 2 16-bit ports on a single 256x16 BRAM, tripled cycled, or two BRAMs double cycled) and switch threads on each cycle. Then you can have a 4-deep pipeline without need for any result forwarding muxes (by the time you read an operand on thread[i], you have already retired that threads' previous result to the register file).
This seems to me to be a perfectly simple and practical basis to issue instructions faster than the ALU + result forwarding mux + operand register recurrence critical path. Unfortunately single-thread performance is not so hot but in workloads such as a "network processing", who cares?
This idea was taken to sublime levels in the 20-stage pipelined 5-threaded 1 GHz MicroUnity MediaProcessor (which would have needed some result forwarding, but not 18 stages worth).

You can do the same thing (multiple context register files) with LUT RAM, of course. In fact, it is quite trivial to make the xr16 (with its PC/DMA register file LUT RAM) multi-threaded, so long as you divvy up the available 16 general purpose registers to the available threads (or make the general purpose register file larger). Just don't switch threads on interlocked instructions, such as immediate prefix, which are stateful between instructions. (Or, of course, make the imm prefix register a register file too -- not a good use of LUTs though.)
(Of course, in the context of the XSOC system, the xr16 uses this facility to do cycle-stealing DMA transfers using the xr16's PC reg file and PC incrementer.)

The old superscalar i960CA achieved a fast procedure call/ret by having a wide (128 bit) and fast bus between the 6 ported reg file and the internal RAM and register file cache. On a CALL it could save the 16 local registers in just 4 cycles. This is entirely feasible in a pedal-to-the-metal multithreaded FPGA CPU using (say) 4 BRAMs configured each as 2x128x16 (or else 2 BRAMs at 2x256x32 in Virtex-II).

Like the good old Transputer, you can build a stack machine backed by BRAM, so that a context switch is simply saving/restoring the stack pointer register and perhaps a very few other task/process related registers.

Finally, I have a (new?) wacky idea for doing fast context switches using a LUT RAM register file backed by BRAM. (I don't like this idea enough to actually try it, but you may find it interesting anyway.)
Assume we can't or won't use a purely BRAM-based multi-context register file because it is not as fast or as multiported as we want (esp. if we are doing a 2-issue super or an LIW -- BTW I sketched a simple 2-issue 6R2W-register file LIW 7 years back -- see the latter half of Homebuilt processors). No, we must use a single-thread-context LUT RAM register file. In that case, on a context switch, we would like to save the current reg file from LUT RAM to BRAM and reload the new threads' reg file from BRAM into LUT RAM.
First I must note the idea (that follows) is a win only if the new thread only reads a subset of its registers before another context switch occurs. But that's fine. If you aren't going to switch threads very often, the amortized cost of the context switch is insignificant. If you are going to switch threads as often as 20-100 instructions, then this idea might pay off for you.
Here's the idea. For concreteness, assume 8 threads of 32 registers, with a 32x32 LUT RAM reg file and a 256x32 BRAM-based 8-thread-context backing store. Build 32 "valid register" flip-flops. On a context switch, these can be reset in one cycle. For each read port into the reg file, build a 32-1 mux to fetch that port's register's valid bit. For each write port into the reg file, allocate a corresponding write port into the BRAM. As each instruction result is retired into the LUT RAM, it is also retired into the BRAM.
After a context switch, all valid register flops are cleared. Then on an instruction like "add r3,r1,r2", we'll find that r1 and r2 are not yet valid (present) in LUT RAM and stall and fetch them from the BRAM-based multi-context reg file backing store. This may well take a cycle or two per register "read miss" (perhaps fewer if you do heroic things with double-cycled LUT RAMs and multiple BRAM ports).
(Again, remembering the duality of thread context switch and function call, I note that this same mechanism can be used to do very fast function call/return -- on each CALL or RET update the block RAM register window address counter and clear all the register file valid bits. This provides all the fast call/return benefits of deep register windows plus the benefits of a fast small register file. You never need to save registers in a function prolog (because they're always concurrently retired into the backing store), and you never need to reload registers in an epilog (they'll be reloaded on demand in the return site continuation). Again, in typical C/C++/Java code, with a lot of function calls, "much of the time" (hand waving) you typically don't read more than a quarter of the registers in the register file before making another call or return.)
This idea is an example of the hybrid LUT RAM + BRAM idea I mention in the aforementioned Using block RAM article/disclosure:
"... hybrid uses of large embedded RAM blocks together with smaller distributed RAM blocks to achieve large storage capacity with highly multiported access to a subset of that storage".

(By the way: the above valid bit per entry discrete FF + mux trick can also be used to flash invalidate a small (e.g. LUT-RAM-based) cache tag array.)

Wednesday, November 21, 2001
Anthony Cataldo, EE Times: Self-generating processors advance.
Loring Wirbel, EE Times: Startup revisits reconfigurable computing. StarBridge Systems Inc.
But see also Supercomputers. The advent of the XC2S300E and the collapse of DRAM prices means the component costs of that hypothetical machine have probably fallen by a factor of ten in just three years -- even as its hypothetical clock frequency doubles.
Microtronix Embedded Linux Development Kit (PDF) for the Altera Nios soft CPU core. (Based upon Lineo's uClinux.)
Mike Esch, Microtronix, in Embedded Linux Journal: From Core to Kit.

Tuesday, November 20, 2001
Spartan-IIE
Xilinx announces the 0.18 micron 1.8V Spartan-IIE. Data sheets (Please, Xilinx, also give us the option of a single PDF.) FAQ. You might think that
as Virtex-E is to Virtex, so is Spartan-IIE to Spartan-II
But you would be wrong. According to data sheets, whereas an XCV200 has 14 BRAMs (56 Kb) and the XCV200E has 28 BRAMs (112 Kb), in the Spartan-II/E family, both the XC2S200 and (alas) the XC2S200E have the same 14 BRAMs (56 Kb).
If your work is "BRAM bound", as is my multiprocessor research, this is a disappointment.
Anthony Cataldo, EE Times: Xilinx spins cost-reduced FPGA for digital video.
'The company said stripping away some of the RAM is a safe bet. "We're finding that even in Spartan 2, designers are not using all the block memory that's there," said Steve Sharp, senior manager of silicon solutions marketing.'

But let us count our blessings. The new Spartan-IIE family is lower-voltage, faster (470 ps T_ILO (2SxxxE-6) vs. 700 ps T_ILO (2Sxxx-5)), offers a larger part (the 32x48 CLB = 6144 logic cell XC2S300E), supports tons of different I/O signalling standards, and thank you Xilinx comes in TQ144 and PQ208 QFP packages.
[updated 11/24/01]
Crista Souza, Electronics Buyers News: Xilinx's new FPGAs aimed at consumers.
[updated 11/27/01]
Murray Disman, ChipCenter: Xilinx Introduces Spartan-IIE.
Return address linkage
Goran Bilski, the designer of the Xilinx MicroBlaze soft CPU core, comments on the benefits of keeping return addresses in general purpose registers in this fpga-cpu list thread. My two cents.

Thursday, November 15, 2001
Happy 30th birthday to the Intel 4004 (ref).
"Introduction date: November 15, 1971
Clock speed: 108 kilohertz
0.06 MIPS
Number of transistors: 2,300 (10 microns)
Bus width: 4 bits
Addressable memory: 640 bytes"
Ah, the good old days.
Gordon Moore (1965): Cramming more components onto integrated circuits.
"Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000."
Andy Shaw: Intel 4004 History: A Rashomon Story.
Slashdot thread. More stuff.
[updated 12/19/01]
Ron Wilson, EE Times: Inventor recalls birth of the MPU.
"... Remember that I was working with a very small number of gates. I ran a little calculation the other day, and in today's processes a 4004 would fit under a bonding pad."

Sunday, November 11, 2001
From next week's SC2001 session on reconfigurable architectures, David Caliga (SRC Computers, Inc.) and David Peter Barker (SUPERsmith): Delivering Acceleration: The Potential for Increased HPC Application Performance Using Reconfigurable Logic.
"The performance of the algorithm was compared with that of a 1700-MHz Pentium 4 microprocessor. The MAP functions units and logic were operating at 100 MHz. ..."
"Given the percentage of time spent in the routine, 99.5%, the application speedup achieved was over 50x. The price performance of the MAP exceeded that of the microprocessor on this problem by a factor greater than 12.5x."

Friday, November 9, 2001
Dave Conroy's PDP-8/X is now joined by his new PDP-4/X (in an XC4010E).

Tuesday, November 6, 2001
Henry Fu, John W. Lockwood, Washington University: The FPX KCPSM Module: An Embedded, Reconfigurable Active Processing Module for the Field Programmable Port Extender (FPX) (PDF).
"This module loads the program memory of the KCPSM from an incoming UDP packet, and executes the new program upon receiving a new incoming UDP packet."
KCPSM (our coverage) -- Ken Chapman's Programmable State Machine.
The FPX: Field-programmable Port Extender.
Fall 2001 Gigabit Workshop Tutorial -- ROT13 in the network!
CS/CoE 535: Acceleration of Networking Algorithms in Reconfigurable Hardware.
Cool!

Sunday, November 4, 2001
Peter Alfke, Xilinx: Evolution and Revolution: Recent Progress in Field-Programmable Logic. "FPGAs have truly evolved from glue logic to cost-effective system platforms."
Slashdot: Low-cost Reconfigurable Computing (FPGA's).

Friday, November 2, 2001
Tom Cantrell, Circuit Cellar Online: Core War. An insightful introduction to MicroBlaze and its instruction set architecture. It looks like a nice clean and simple ISA after my own heart.
Once again Tom beautifully frames the business model considerations of FPGA CPU IP from FPGA vendors vs. IP from third parties:
"It all boils down to the fact that burying the IP price in a chip is the most streamlined way to accomplish the transaction. In a world wracked by Napster-like IP angst, the bit of plastic and silicon we call a chip is (just like plastic and paper that go into an audio CD) a handy place to hang the price tag. In essence, it's a royalty scheme without all the handwringing about opening the books, audits, dongles, and the like."
See also IP business models.
Peter Clarke, EE Times: Student's ARM7 clone disappears from Web. Related coverage.

FPGA CPU News, Vol. 2, No. 11
Back issues: Vol. 2 (2001): Jan Feb Mar Apr May Jun Jul Aug Sep Oct; Vol. 1 (2000): Apr Aug Sep Oct Nov Dec.
Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.