FPGA CPU News of February 2001

Home

Mar >>
<< Jan


News Index
2002
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
2001
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
  Oct Nov Dec
2000
  Apr Aug Sep
  Oct Nov Dec

Links
Fpga-cpu List
Usenet Posts
Site News
Papers
Teaching
Resources
Glossary
Gray Research

GR CPUs

XSOC
  Launch Mail
  Circuit Cellar
  LICENSE
  README
  XSOC News
  XSOC Talk
  Issues
  xr16

XSOC 2.0
  XSOC2 Log

CNets
  CNets Log
Google SiteSearch

Wednesday, February 28, 2001
So that's what a magnitude 6.8 earthquake (40 miles away) feels like.

Tuesday, February 27, 2001
Cary Snyder, Microprocessor Report (Embedded Processor Watch): Xilinx's A-to-Z System Platform.

Monday, February 26, 2001
Impressions of Mercury
More on the new 1.8V Altera Mercury family, based upon my impressions of information in the data sheet.

First let's compare this family, which offers 4800-14400 LEs, with APEX 20KE devices offering 1200-51000 LEs. My take is that Mercury should be regarded as complementary to other Altera families. Or is it a peek at the shape of things to come? It's interesting that the Mercury family does not seem to use the MegaLAB architecture of the 20K family (per se). For example, the ESBs are schematically located at the top and bottom of the device rather than distributed about the fabric.

Compared with earlier Altera families, Mercury seems to have an additional category of fast horizontal inter-LAB interconnect called RapidLAB. Data sheet:

"The local interconnect can drive LEs within the same LAB or adjacent LABs. This feature minimizes use of the row and column interconnects, providing higher performance and flexibility. Each LAB structure can drive 30 LEs through fast local interconnects."
I am not surprised. As proposed FPGA architectures are increasingly validated and tuned by recompiling existing customer designs against them, perhaps over time all of the various FPGA architectures will evolve into the same basic topology, varying only on the way the hierarchies of local, intermediate, and global interconnect are "chunked". So for example, we'll have Virtex-II with 8 LUTs per CLB competing with 10K/20K/Mercury at 10 LEs per LAB. And now we'll have a more "local" extra-LAB interconnect in Mercury.

On another tangent, I wonder if Altera's "Advanced Redundancy" yield enhancement technology (press release) is applicable with this new RapidLAB and Leap Line local inter-LAB interconnects.

Arithmetic
Arithmetic now uses an interesting carry-select lookahead architecture, which treats each 4-LUT as 4 2-LUTs. I suppose I still prefer the Virtex architecture, which makes it possible to implement

  o[i] = addsub ? (addf1 ? a[i]+b[i] : a[i]-b[i])
                : (addf1 ? f1(a[i],b[i]) : f2(a[i],b[i]))
in one 4-LUT per bit.

The Mercury multipliers are not separate dedicated multipliers (as in Virtex-II) but rather are built from programmable logic LEs, aided by a dedicated multiplier mode for forming partial products and adding them together (binary tree adder).

Quad port ESB RAM
Figure 17, the "ESB Quad-Port Block Diagram", diagrams a 4-port RAM with two read and two write ports, and indeed with four sets of address lines and control lines.

In this mode, the ports can only be up to x16. Nonetheless if this mode is fast, it could make a nice building block for a two-issue superscalar RISC, using two ESBs to provide a 2-write 4-read register file. (One challenge in implementing such architectures is retiring two results per cycle.)

Compare this to Virtex/II, where you can write two locations per cycle but only read back what you wrote -- or optionally in -II, what used to be at those overwritten locations. In contrast, with Mercury ESBs you seem to be able to read back two other arbitrary locations.

Just as with Virtex, the write port can use a different data width than the read port. This is essential for providing support for SERDES -- receive and deserialize a high speed port into a RAM-based FIFO, process the data at lower frequency but wider width, then deposit results into another variable-width FIFO to serialize and transmit it.

Unfortunately, the Mercury data sheet does not seem to specify the effect of simultaneously reading/writing a memory cell on more than one port.

It is also notable that the ESBs provide a TurboBit to run less speed critical RAMs at lower speed and lower power.

Data sheet:

"ESBs can implement synchronous RAM, which is easier to use than asynchronous RAM. A circuit using asynchronous RAM must generate the RAM write enable (WE) signal while ensuring that its data and address signals meet setup and hold time specifications relative to the WE signal."
Hmm. Is this marketing? If so, I don't think it's effective. As I recall, the 1995 XC4000E ended the XC4000-era of WE glitch generators.

Soliciting guest commentary
Although it doesn't happen much, we're open to publishing relevant and interesting guest commentary here. For example, if you work at Altera and wish to comment or elaborate upon this commentary, or fill in the details, please drop me a line or send your comments to the fpga-cpu list.

Another perspective
Murray Disman, ChipCenter: Altera Ships Mercury Family.

Wednesday, February 21, 2001
Altera Mercury: press release, introduction, data sheet. This appears to be Altera's response to Virtex-II/Pro, and it has some interesting features. Besides adding 8-18 1.25 Gb/s channels, Mercury provides up to 100 8x8 "Distributed Multipliers". But perhaps of most utility to FPGA CPU designers:
"Altera's Mercury devices include embedded memory via new quad-port embedded system blocks (ESBs) each of which contains 4,096 programmable bits and can support up to four independent operations at once."

David Lammers, EE Times: Altera chips join PLD with gigabit transceiver.

Sunday, February 18, 2001
In the Teaching dept., here's a fun write-up from 1998 of four New Mexico Tech CS331 students' experiences building FPGA CPUs: Valerie Henson: How the Puerco was born.

This is so good and echoes so many of our themes that I can't help but quote three whole paragraphs:

"The hardest part was making our CPU fit on the chip. Our first synthesization produced something that used up %400 of our CLB's (combinational logic blocks). We minimized and minimized and threw out every extraneous bit of logic and went from 32 bits to 9 bits, the minimum needed to hold our opcode and any useful number of addresses. Once we got the 9-bit CPU to fit, we began discovering all sorts of sneaky ways to avoid using CLB's. Eventually, we got the whole pipelined 32 bit CPU on the chip, which made us ecstatically happy. The whole experience was good preparation for the frustration that must be found in, say, tax law, or counting poppy seeds."

"From Puerco Jr. to Puerco. The entire time we working on the Puerco Jr., our non pipelined CPU, we were dreading the Puerco. Pipelining sounds hard. It must be more difficult to design a processor that runs three instructions at the same time than one."

"Then we actually started designing it. It was easy! Basically, it did the same thing every cycle instead of different things each cycle. The way we handled invalid instructions was to simply add an invalid bit to each stage. If the instruction was valid, the results got written out to memory or latched into a register. If it wasn't valid, the results of that stage just got ignored. Piece of cake."

(See also Puerco links from Ben Sittler's old project page.)

Excellent! Clearly these former undergraduates learned some things that are not taught in any textbook nor observed in any simulator.

(But note also the "average of 20 hours per person per week" times four students. Ouch, that's a tough workload. In my teaching paper I argue that a well designed course can provide students a working framework so that the total workload need not be so oppressive.)

[update 02/19/01] Henson:

"The actual design took us only a couple of weeks, figuring out how to just plain use the software took us much longer. ... The most useful way to decrease the coursework load would be a short training session with the software, better documented software, and better software, period. Designing the state machine and physical layout of the CPU in each group individually was definitely worth the time. Since we had to use what we designed, we put a lot more thought and effort into making our designs simple and robust. The rest of the semester was spent cussing at the software. :)"

Yet more FPGAs in EE Times
Bernard Cole, EE Times: Programmable-chip methods get fresh look. A survey.

Anna S. Chiang, Altera, in EE Times: Programming enters designer's core. An interesting, exhaustive list of requirements for a development platform for an embedded processor with programmable logic, as exemplefied by the Altera Excalibur program, including its Quartus II with SoPC Builder.

Wednesday, February 14, 2001
Will Wade, EE Times: Chameleon eyes ASIC segments with reconfigurable chip.
"The CS2112 features 108 arithmetic processing units. Each of those 32-bit processing cores runs at 125 MHz. Their combined power, Fox said, is comparable to that of a Pentium-class processor running at frequencies higher than 12 GHz."
Chameleon Systems.

Saturday, February 10, 2001
The Xilinx XtremeDSP/Virtex-II Simulcast is now available as a series of web cast sessions with accompanying PDF slide sets. Registration required. My earlier comments.

I highly recommend viewing Erich Goetting's presentation on Virtex-II. There's some great motivation on features like the XCITE controlled impedance technology, the IP Immersion architecture, and so forth. After registering with Xilinx, you can also download the corresponding slide set, named module7.pdf.

The shape of things to come
In particular, I would like to bring to your attention slide 83 of 88. It shows a diagram of a huge FPGA with four embedded PowerPC cores and what appear to be 12 (top edge) + 12 (bottom edge) Conexant 3.125 Gb/s serial link cores -- which would be about 75 Gb/s/chip of link bandwidth.

(Memory) surpluses out to 2010
Zooming way in, this diagram appears to depict a monster 136x104 CLB part, which would be well over 100,000 logic cells, and apparently with 18 columns of block RAMs and multipliers (apparently 556 in all) and about six columns of CLBs per block RAM. Stupendous!

I had been disappointed that the larger Virtex-II devices seemed relatively block-RAM-port poor. For example, the Virtex-II data sheet states the 120-CLB-column '2V10000 will have only 6 columns of block RAMs, or on average, only one column of block RAMs per 20 columns of CLBs.

But at 6 CLB columns per block RAM column, this monster FPGA assuages this concern -- and is quite reminiscent of the generously-RAM-endowed Virtex-EM family.

Indeed, the apparent 556 18 Kb block RAMs would total about 1.2 MB of block RAM, and might offer a total bandwidth of about 556 * 2 * 36 b at (say) 200 MHz = 8 Tb/s!

Immersed IP footprints
Each PowerPC core appears to displace 4 rows by 2 columns of block RAMs, plus apparently 16 rows by 2+6+2 columns of CLBs.

Each serial link core appears to displace one single block RAM and hardware multiplier block.

500 soft CPUs per chip?
At apparently 4 rows by 6 columns of CLBs per block RAM, this could be just about the perfect pitch for those of us with delusions of chip multiprocessors, since our area optimized 16-bit CPU core (which requires uses one block RAM) should use 4 rows by 6-7 columns of CLBs in Virtex-II.

In such a device without IP immersion, this would yield about 34 rows by 16 columns of processors = 544 16-bit CPUs per die. Subtracting the areas partially covered by serial-link or processor hard cores (spanning apparently 13 of "my soft processor core tiles" per quadarant) would leave about 492 simple 16-bit CPUs per monster FPGA...

Since the area of a Virtex-II-optimized compact simple 32-bit RISC core will be about 8 rows by 6-8 columns of CLBs, and since each PowerPC core seems to displace twice that many CLBs, we obtain this counterintuitive rule of thumb: one streamlined 32-bit soft CPU core optimized for programmable logic might need only half the silicon area of an elaborate 32-bit hard CPU core!

It's not apples to oranges -- the PowerPC hard core runs much faster, has much more cache memory and many additional instructions and features -- but it does kind of turn conventional wisdom on its ear!

Put on your thinking caps
Whether this diagram depicts a hypothetical planned device, a trial balloon, a clever misdirection, or something else, does not matter. It is clear that this shows some flavor of the shape of things to come. It's time to start thinking imaginatively about how to best use such a monster -- not to mention a rack full of them. In some ways, a 500 CPU MIMD or a 1000 CPU SIMD per chip, or even a 100-trillion-instructions-per-second 1,000,000 CPU MIMD (20 boards, 100 chips per board, 500 CPUs per chip, 100 MIPS per CPU) is just about the least imaginative use possible.

Mind boggling stuff.

Thursday, February 1, 2001
Peter Clarke, EE Times: Momentum builds for open-source hardware. More coverage of LEON SPARC and OpenCores.org.

Sez you dept.
Steven Fyffe, Electronic News (on EDN): Altera, Xilinx Vie To Claim Fastest Development Software. Apparently, Brand A is at least twice as fast as Brand X. And vice versa!

FPGA CPU News, Vol. 2, No. 2
Back issues: Vol. 2 (2001): Jan; Vol. 1 (2000): Apr Aug Sep Oct Nov Dec.
Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.


Copyright © 2000-2002, Gray Research LLC. All rights reserved.
Last updated: Mar 04 2001