FPGA CPU News of April 2002


May >>
<< Mar

News Index
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
  Oct Nov Dec
  Apr Aug Sep
  Oct Nov Dec

Fpga-cpu List
Usenet Posts
Site News
Gray Research


  Launch Mail
  Circuit Cellar
  XSOC News
  XSOC Talk

XSOC 2.0
  XSOC2 Log

  CNets Log
Google SiteSearch

Thursday, April 18, 2002
Marc Boule' and Zeljko Zilic, McGill: An FPGA Based Move Generator for the Game of Chess.

Here's an interesting comp.arch thread on FPGA CPUs from 1994.

Wednesday, April 17, 2002
Ken Chapman's new Xilinx techXclusive series, part one: Creating Embedded Microcontrollers (Programmable State Machines).
"I hope you will be inspired to create your own application-specific PSM processors as well as find new applications for existing PSM macros."
Earlier KCPSM coverage.

Programmable World 2002 observations
I attended the Bellevue, WA satellite downlink. I found this to be a somewhat disappointing seminar. Some sessions were not technical enough, while others were curiously not oriented to FPGA designers. (See also Greg Neff's review.)

The first keynote was a rah-rah "internet exponential growth" talk that might as well have been given two years ago, before the bust, and even back then would have left us engineers hungry for some scraps of hard technical content.

Nick Tredennick on the other hand was thought provoking (as usual).

Once again, the highlight was the presentation by Erich Goetting on Virtex-II Pro. It was somewhat of a repeat of earlier stunners but still there were several interesting disclosures that I haven't seen elsewhere. (If you haven't already poured over the Virtex-II Pro Handbook, go off and do so, and then come back here when you have more context to interpret the following trivia.)

(The following points are transcribed from hastily typed notes I took during the talk -- please send corrections if necessary.)

  • The 1.5V V2Pro is designed for 130 nm, 9 layer metal, all Cu, low-K dielectric, with 92 nm gate lengths, and 22 angstrom gate oxide thickness.

  • The RocketIO MGT (multi-gigabit serial transceivers) contain hard logic equivalent to 50K "ASIC gates" plus addtional analog circuitry.

  • Since the 300 MHz, 456 D-MIPS, PowerPC 405 core uses just 0.9 mW/MHz (0.59 mW/D-MIPS), it's a "low Power PC", if you will.

    Indeed, Goetting humorously compared the 456 D-MIPS embedded PPC to an "acre full" of (1 D-MIPS) DEC VAX-11/780s. Later he compared the 100 mW used by a typical LED and the 169 D-MIPS of 405 computation that would use the same amount of power -- and compared that to the 157 D-MIPS of horsepower of the original Cray-1 (an apples to oranges comparison if I've ever heard one.)

    Each instance (including the CPU, 16 KB I- and D-caches, MMU, etc.) occupies 3.8 mm2, or 2% of 2VP50, (and displaces approximately 1000 LUTs), so I suppose we can infer a 2VP50 is approximately 200 mm2.

  • Xilinx will provide a Data2BlockRAM tool to insert compiled code and other data into the block RAM initialization bitstream.

    (It was unclear to me whether you can also arrange to pre-initialize the 405's I-cache and D-caches via the configuration bitstream, for those applications content to boot out of, or run entirely out of, the core's caches.)

  • There will be a System Generator for PowerPC later this year.
I was surprised that the PowerPC talk that followed was so focused on the IBM Microelectronics' ASIC/ASSP/CSSP PowerPC business without spending that much time exploring the new opportunities inherent in the Xilinx PowerPC + FPGA platform.

I'd wager that over the next few years, IBM will gather considerably more CoreConnect licensees, partners, tools, and more CoreConnect reusable IP, based upon this Xilinx alliance, than they've seen in the whole history of embedded PowerPC ASIC products.

Conversely, I was left with the impression that John Fogelin of Wind River Systems really appreciated the potential of these new system platform FPGAs and also recognized the new challenges facing engineers. I thought he did a good job laying out Wind River's value proposition to FPGA SoC designers.

More on RocketIO MGT links
One of the talks I attended was on using the MGT links for building serial backplanes. Here the recommendation was to use a proven soft core to simplify interfacing to the rich and nontrivial MGT hard core. For example, you might use the forthcoming Aurora protocol interface (350 slices) to handle packet framing, data alignment, etc. (The Aurora part of the presentation was rather light on details -- specific control signals and so forth -- I was left with the impression that it is still being designed.) Or, you might use Xilinx's XAUI interface cores and thereby speak 10 Gb ethernet to XAUI switches in your backplane.

I learned some new things about the MGT links.

  • Each instance needs 9 pins -- TX+/- and RX+/-, of course, but also AVCCAUXRX, AVCCAUXTX, VTTX, VTRX, and GNDA. These must accompanied by (recommended) 4 ferrite beads and 4 capacitors.

  • We were told that in FF (flip chip BGA) packages the links can run at the full 3.125 Gb/s speed, but not so in FG (regular BGA) packages. This is puzzling since the links are also supposed to run reliably through 20" of FR4 and a couple of backplane connectors. Confirmation: see this Xilinx support forum Q&A.

  • MGT links may use 350 mW each.

  • Apparently you can directly use the MGT transceivers as ethernet PHYs.
I was wondering about clock mismatch between transmitter and receiver. Each MGT receiver has an elastic buffer. As I understand it, if the transmitter is clocked slightly slower than the receiver, the receiver's client may clock out more characters than have been received. As the buffer drains, I understand the MGT receiver inserts protocol-specific IDLE characters into the buffer to keep it from underflowing. Presumably the protocol adapter interface receiver soft core will then drop these IDLE characters as they are retrieved.

Conversely, if the transmitter is running a little faster than the receiver, it may start to fill the elastic buffer faster than the receiver can drain it. In that case, as I understand it, the elastic buffer will start to delete protocol-specific IDLE characters from the buffer. Ah, but how did those IDLE characters get into the buffer? I believe the transmit-side protocol adapter has to insert a certain number of IDLE characters, perhaps between packets, or otherwise, into the stream of characters to be transmitted, so as to give the receiver's elastic buffer something to drop in the event that the transmitter is outrunning the receiver. Is that right?

Another issue: I asked about latency through the MGT. Say you have one MGT send just two words (8 bytes) to an adjacent MGT; and then user logic at that second MGT sends the two words right back to the first.

What is the round trip time? I was told it may be some tens of user clocks of latency before the first receiver sees the 8 bytes, and then another some tens of user clocks before the data is received back at the source.

Oh well, if that is so, it may reduce the utility of these links as a low-latency interprocessor-cluster-interconnect in a scalable MP scenario. High bandwidth yes, low latency, maybe not.

Is this information specified somewhere?

Big picture
Xilinx is to be congratulated for democratizing these advanced technologies and putting them and the tools needed to access them in the hands of thousands of designers who are not necessarily "big companies".

Nevertheless, one is taken aback at the considerable detail and complexity of this new system platform. One could reasonably absorb 90% of the details of a the XC4000 programmable logic fabric in a few hours. For Virtex, a few days might be required to also grok BRAMs, DLLs, etc. For Virtex-II, more time. But in Virtex-II Pro, there is tremendous flexibility, power, and yes, complexity, inherent in interfacing to the embedded hard cores, there are mixed software and hardware design scenarios, and so on and on.

Xilinx and its partners are going to be challenged to tie a neat bow around these technologies, using reference designs, using complexity busting "easy IP", using tools like their forthcoming System Generator for PowerPC, so that engineers new to SoC design, embedded processors, or high speed interconnects, can successfully apply all this great silicon.

That's a different kind of challenge than designing a great FPGA fabric or a better place-and-route algorithm. It will be interesting to see how they do.

(If I am not mistaken, the current best offering in this department is the Virtex-II Pro Developer's Kit, $95.)

Thinking caps
Will Apple's OS X ever be ported to a Virtex-II-Pro-based platform? Might FPGA-based hardware media acceleration and/or reconfigurable computing make a compelling platform for some future Macintosh Media machine?

What is the next hard IP block destined for IP immersion?

Will demand ever lead Xilinx to field a Virtex-II-Pro+, which, by analogy with Virtex-EM, contains a factor of 2-4x again more embedded processors?

Tuesday, April 16, 2002
Xilinx PR: Aurora serial link-layer protocol.

Anthony Cataldo, EE Times: Xilinx floats Aurora serial protocol.

[updated 05/01/02]
Murray Disman, ChipCenter: Xilinx's Aurora Protocol.

Reminder, tomorrow is Programmable World 2002. Next week is FCCM'02.

Friday, April 5, 2002
Anthony Cataldo, EE Times: FPGA vendors close in on 3.125-Gbit/s serial I/O.
'"There's no comparison between a standalone processor and one that's immersed in an FPGA," said Kent Dahlgren, a member of the technical marketing staff. "The bandwidth we have is far more important than [CPU] Mips or megahertz."'
(Just as the bandwidth to dozens or hundreds of instances of compact soft CPU cores will dwarf the bandwidth to a handful of hard cores.)

EE Times: SoC designers describe their 'best practices'. '"Increasingly, in the future, we are going to see multiprocessor SoC devices and multithreading cores."'

On the fpga-cpu list, Anand Gopal Shirahatti asks:

"... What I was wondering is, are there are Implementations of the TCP/IP Implementation over a Single FPGA, for mutilple connections. ..."
The simplest thing to do is run a software TCP/IP stack on a soft CPU core. For example, at ESC I saw TCP/IP running on uCLinux on Altera Nios with a CS8900A ethernet MAC.

Note that a compact FPGA CPU core with integral DMA (e.g. xr16) may be hybridized into the data shovel aspect of an ethernet MAC. (Flexibly shovel the incoming bits to/from buffers, etc.) Indeed, one enhanced FPGA CPU might (time multiplexed or otherwise) manage several physical links.

You can also build hardware implementations of the TCP/IP protocol itself. There are several such implementations in custom VLSI. For FPGA approaches, see:

  • Smith et al's XCoNet.

  • BlueArc SiliconServer white paper.
    "The SiliconServer runs all normal TCP/IP functionality in state machine logic with a few exceptions that are currently dealt with by software running on the systems attached processor (e.g. ICMP traffic, fragmented traffic reassembly)."
And related things: FPXKSM.

Wednesday, April 3, 2002
Following up on yesterday's entry, Reinoud referred us to this comp.arch.fpga thread.

Tuesday, April 2, 2002
Legacy ISA soft cores in FPGAs?
Peter Alfke of Xilinx wrote on comp.arch.fpga:
"...IMHO, both PowerPC and ARM are too complex to be implemented as soft macros."
Implementations of integer subsets of MIPS, ARM, and PowerPC architectures are not too complex to be implemented as soft cores. One can produce an integer MIPS-I soft core as "small" as MicroBlaze; and I have done a spreadhseet analysis/design study for an FPGA-optimized PowerPC Book I soft core that cost between 1200 and 2000 LUTs (1.3-2.2x the size of MicroBlaze), depending upon performance tradeoffs and whether or not you trap and emulate certain rare and expensive instructions.

The only thing holding back fast (100 MHz) relatively compact (800-2000 LUTs) FPGA-optimized soft core implementations of subsetted commercial RISC instruction set architectures is the intellectual property landscape.

I am surprised that certain processor IP companies, that lack a hard core programmable logic platform, and may therefore be losing certain design wins to ARM and PPC, have not yet launched soft FPGA-optimized processor core products. Perhaps they too think it infeasible or impractical.

(Advertising: my company can help prove otherwise -- we may be available to develop FPGA-optimized soft cores for processor IP licensors.)

I predict that sooner-or-later all processor IP licensors will come to the realization that programmable logic has become the air that a great many of their designers breathe, and that eventually all processor IP licensors will offer or endorse FPGA-optimized soft processor core implementations of their ISAs. To not do so would be to surrender a quickly growing market segment to their competitors. I put that date around 2005.

There is no defense against the ATTACK OF THE KILLER FPGAS!

I also feel that binary translation (static or dynamic) will become important and then commonplace, both as a way to run legacy ISAs on streamlined FPGA-optimized cores, and as a way to run full ISAs on subsetted ISA implementations.

Billion transistor FPGAs and defects
After yesterday's entry, Ben Franchuk asked (on the fpga-cpu list),

"Now with that many transistors how is failing/defective transistors/CLB's handled? Need one design error detecting logic in the new cpu ISA's? While I know the decimal machines of the 1950's often had error detecting codes like 2 bits out 5 that not only detected storage problems it detected alu problems too. Is there anything simple for today's binary machines in re-coding information for storage and arithmetic to detect possible problems?"

Each and every one of those transistors test out "perfectly" at the factory. I understand that the tester downloads a number of configuration bitstreams that fully exercise and cover the configuration memory, the CLBs, interconnect, etc.

(((Wacky idea: I understand that testers step over each die on the unsawn wafer, pressing probe wires to the die's pads, powering it up, and running some test circuits. I wonder, is it practical to add power, ground, and JTAG-like test paths, between dice, to interconnect the dice on the unsawn wafer and thereby test entire wafers in parallel? You would still need to step the tester over each die to check out I/O defects, but since most internal logic defects would already have been diagnosed, the tester would not need to spend much time on known bad dice. Then you collect the self-test and tester-based test results and saw and keep the good dice, the EasyPath dice, etc.)))

Altera APEX: BTW in APEX parts, Altera reportedly uses redundancy to improve yield and hence lower cost.

EasyPath: Since only a fraction of FPGA transistors matter for a given configuration, the keen idea of the EasyPath product, as I understand it, is to qualify partially defective dice against a fixed configuration (or at any rate, a set of test configurations that covers the resources required by the fixed configuration).

That said, factory perfect FPGAs may still have failures in the field. Coping with those failures is a rich subject. Here are just a few comments.

  1. You can use readback to read the configuration bitstream. You can even read it back to an internal circuit within the FPGA. There you can compute a signature on the bitstream and so detect if it has changed through some kind of SRAM upset. You can even continuously readback the configuration and test it is pristine every second (or more often than that).

  2. In one FPGA you can build two or more processors, and run them in lock step, comparing the write-back results of each processor each cycle. This can detect when one diverges from the other. I really think this is the easist thing to do, at least to detect faults.

You can also build a TMR system. (And I think I would have more confidence in a system done across three FPGAs than all on one.) And as in big systems you can always put EDAC (ECC) on the buses and/or RAMs in your system.

Designers of aerospace systems have to worry about this all the time. See for example the MAPLD Conference.

Monday, April 1, 2002
Here are PDF slides of Xilinx's Peter Alfke's talk, Evolution, Revolution, and Convolution: Recent Progress in Programmable Logic. (Xilinx techXclusives version noted here earlier). It's quite Xilinx-centric, but still well worth reading, chock full of important issues and good ideas.
"FPGAs circa 2005"
  • "50 Million system gates"
  • "2 Billion transistors on one chip" (my emphasis)
  • "70-nm process technology"
  • "10-layer Cu technology"
  • "Hard and soft IP blocks"
  • "1 GHz embedded processor"
  • "Mixed-signal Intellectual Property"
  • "10-Giga-bps I/O channels"
The theme of one of the best issues of IEEE Computer ever, Sept. 1997, was The Future of Microprocessors. The introductory article was Burger et al, Billion-Transistor Architectures. It seems very likely to me that the first billion transistor microprocessors will be FPGA chip-multiprocessors.

FPGA CPU News, Vol. 3, No. 4
Back issues: Vol.3 (2002): Jan Feb Mar; Vol. 2 (2001): Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec; Vol. 1 (2000): Apr Aug Sep Oct Nov Dec.
Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.

Copyright © 2000-2002, Gray Research LLC. All rights reserved.
Last updated: May 15 2002