FPGA CPU News of October 2000


Nov >>
<< Sep

News Index
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
  Oct Nov Dec
  Apr Aug Sep
  Oct Nov Dec

Fpga-cpu List
Usenet Posts
Site News
Gray Research


  Launch Mail
  Circuit Cellar
  XSOC News
  XSOC Talk

XSOC 2.0
  XSOC2 Log

  CNets Log
Google SiteSearch

Saturday, October 28, 2000
Mike Rutenberg reports XSOC is discussed in c't magazine this month!
> Subject: XSOC has made the press
> You are in the latest issue of the (excellent!) c't magazine.  The
> magazine URL is below, though I am not sure the article is on the web
> site.  The article is on OpenSource-Hardware, and it mention you
> specifically and your processor, along with a number of others.  One
> thing it says is that the License only allows non-commercial use,
> implying that there are no options for using the system for other
> things.
Well, that's right, XSOC/xr16 is not currently licensed for commercial use. Never say never.

The Tao of Static Timing Analysis
One quote that I overlooked in yesterday's fine EE Times piece on tools:

"Both Sevcik and Greenfield said that with device performance a primary concern, customers are also using static timing analysis tools in high-end FPGA design."
I believe that all FPGA designers should use static timing analysis. As I learned from Philip Freidin,
functional simulation plus static timing analysis is necessary and sufficient
which leaves little to no value in doing timing simulation, back annotated or not.

(Here follows my brief enthusiast's take -- which doesn't do justice to Philip's eloquent and convincing exposition.)

First, I assume you're using a robust synchronous design method. No asynchronous feedback circuits, no gated clocks. All state is registered in flip-flops (and other synchronous primitives like RAMs), passes through a cloud of combinational logic, and sinks some flip-flops (RAMs, etc.). All FFs are clocked by common clocks on low-skew clock lines. Clock-enables, not gated clocks, determine whether each FF is enabled on each cycle. Ideally all FFs are clocked on the rising clock edge, but sometimes you also have to clock on the falling edge too. Ideally all circuits take one cycle, but sometimes some take more than one cycle. Ideally all inputs and outputs to the chip are registered in FFs.

In this paradigm, a cycle-by-cycle functional simulation will suffice to verify the logical correctness of your design.

Once your design is logically correct in all circumstances, it remains to determine whether it also satisfies its performance (timing) requirements.

That's where static timing analysis comes in. A single almost-automatic static timing analysis run can quickly establish that a design meets its timing constraints.

Given a placed-and-routed design, where every primitive element of logic and interconnect has been determined, a static timing analyzer like Xilinx's TRCE first determines the worst-case delays on every net and logic element. Then, using this data, TRCE traces every circuit path from every FF to every FF to determine worst-case delays from a given clock edge to another given clock edge, and this reveals the maximum guaranteed safe operating frequency over a range of supply voltages and temperatures.

A static timing analysis is necessary and sufficient because it explores every path between each group of FFs in the design, and (together with a cycle-based functional simulation) covers the design even if you overlook some test vector for some path, and even if some elements are operating exceptionally fast. This 100% coverage is difficult to achieve with a timing simulation.

(I don't mean to oversell static timing analysis. But I like it. It does require that you apply minimal timing specifications to your design. Sometimes as little as a net CLK period = 15 ns; will suffice. Some designs, with some multi-cycle paths, multiple clock domains, etc., require much more extensive time specifications. Sometimes it's hard to get the tools to report 100% coverage or to trust them when they do. Even so, STA can achieve better coverage than timing simulation, and for less time and effort.)

(This little commentary also does not touch on the other applications of static timing analysis during prototyping and design, to gain early insight into feasible circuit designs, track slack between groups of circuits as the design evolves (another Freidin technique), etc.)

Since functional simulation plus static timing analysis is necessary and sufficient, how can we explain the false allure of timing simulation, or the notion that only some designers are using static timing analysis?

  • "I'd never thought of it."
  • "I'd never heard of it."
  • "I don't know how to write timespecs."
  • "I've heard of it, but I (my manager) like (insists upon) timing simulation."
  • "I don't/won't/can't trust the static timing analyzer so I do functional simulation, as well, to get more confidence."

Friday, October 27, 2000
Today, I am investing some time working on the Knowledge.

I can now build RPMs in Verilog in Synplify: /* synthesis xc_props="RLOC=RyCx" */ seems to works well. Alas, the slice coordinate system added with Virtex is awkward. I have to produce two versions of some of my RPMs -- one with a _s0 suffix, one with a _s1 suffix. Within the lowest level modules, the suffix controls whether to apply "RLOC=RyCx.S0" or "RLOC=RyCx.S1". Unfortunately, this attribute propagates upwards to higher level modules, and so I must write addsub16_s0 and addsub16_s1 and so forth.


  • a 32-bit add/sub RPM, sinking and sourcing adjacent registers, floorplanned as 16Rx1C of slices, runs at about 150 MHz in a 2S50-5: Tcko + net + 16*Tbyp + Tccky < 7 ns. With care, a pipelined 32-bit RISC in a -5 part should hit 100 MHz.
  • to floorplan it as 8Rx2C of slices instead, adds about 2.7 ns to the cycle time, due to the extra vertical net to route carry-out[15] from the top of the first 16-bit adder column to the bottom (carry-in port) of the second one, plus the extra Tbxcy. Ouch, so much for that idea.
  • a 64-bit ripple-carry adder, 32x1 slices, has a cycle time under 11 ns.
  • as does a 64-bit carry-select add/sub implemented with 3 32-bit adders and a mux.

Michael Santarini, EE Times: Can tools keep up with programmable silicon? Highly recommended. Indeed, tools and methodology issues are probably a greater barrier to widespread deployment of FPGA systems than the increasingly abundant and cheap programmable logic itself. For example, floorplanning attributes (RLOCs and so forth) have a different attribute syntax in the Synopsys, Exemplar, and Synplicity synthesis tools!

'Most sources agreed that HDLs have replaced the "schematic-sauruses" -- those who hand-tweaked gates and flip-flops to get the maximum performance out of FPGAs. Philip Freidin, a longtime schematic-saurus, said that he has begun to incorporate HDLs into his methodologies.'

'"It isn't because I wanted to do it; it is because customers demand it," said Freidin, who specializes in high-performance FPGA design at Fliptronics (Sunnyvale, Calif.). "The issue simply comes down to design time."'

Peter Clarke, EE Times: Panel ponders cost of programmable system-on-chip.

Monday, October 23, 2000
Hiro Higuma, Martin Won, Altera, for ChipCenter: Building Configurable Network Processors.
"Programmable logic-based packet processing functions offer many of the same flexibility and time-to-market benefits of off-the-shelf network processors. Further, programmable logic can provide better performance by utilizing dedicated hardware for specific packet processing functions."

"... Finally, the recent availability of 32-bit RISC soft cores for programmable logic adds another level of capability to these devices, affording users greater usability and the option to rapidly develop custom multiprocessor designs."

It's a great application for dozens of compact (200-500 logic cell) soft CPU cores per FPGA. See also Soft cores.

Sunday, October 22, 2000
One evening last August, I designed a simple RISC processor, even simpler and smaller than xr16. Like xr16, it is designed to be the target of lcc and has 16-bit instructions, 16-bit data, and 16 registers. Unlike xr16, (but like Brian Davis's YARD-1A (description)), it is a 2-operand architecture, is not pipelined, and uses a single bank of dual-port RAM for the register file. Initially it will use the Virtex-family block-RAM as the instruction store.

The goals are to provide a simple, fully embeddedable MCU, comparable to KCPSM but C programmable, and to advance the agenda of demystifying processor design and encouraging student and enthusiast experimentation. (In retrospect, the pipelining of xr16 is good for performance but detracts from its simplicity.) I will be presenting this CPU/SoC design at an upcoming design conference. Like XSOC/xr16, it is "disclosed source", licensed under the XSOC LICENSE agreement.

Taking a page from the "literate programming" community, the write-up of the design is the design. Using Microsoft Word, I save the document as text and filter that to extract the Verilog source. Here is the current work-in-progress in PDF. (The processor is mostly sketched out but it surely doesn't compile yet.)

"Processor and SoC design is not rocket science ... To prove the point, this paper and accompanying 50-minute class will present the complete design and implementation of a streamlined, but real, system-on-a-chip and FPGA RISC processor."
Universidad de Valladolid (Spain): The uP1232 8-bit fpga-processor. 8-bit CISC, 32 registers, 200 XC4000E CLBs.

PLD processors by Jeung Joon Lee: Reliance-1, PopCorn-1, Acorn-1.

Saturday, October 21, 2000
Craig Matsumoto, EE Times: How Cisco beat chip world to net. Cisco's in-house network processor designs.
"Like most network processor designs, the Toaster is parallel-pipelined. "Each column you can think of as doing a system function with separate memories," allowing for better I/O bandwidth in any column, Nellenbach said. ... Kerr's main concern was in maximizing the possible number of 32-bit lookups per second, which meant getting the memory interface right ... So Jennings' team went with eight memory interfaces, all connecting to synchronous DRAM."
Commentary: some designs (not necessarily this one) are inevitably constrained by pin-bandwidth to external memory and I/O, regardless of whether the internals are hard gates or programmable logic.

Tuesday, October 17, 2000
Brian Dipert, EDN: Cunning circuits confound crooks. Nice survey article on PLD design security.
"With otherwise-SRAM-based FPGAs, for example, adding nonvolatile memory for unique device identifiers might be cost-prohibitive. Instead, Xilinx is including a hard-wired Triple-DES decryption block, along with two sets of 56-bit key registers and dedicated battery-backup supply-voltage inputs for only those registers, on its upcoming Virtex-II FPGAs. Xilinx estimates that the registers alone consume only nanoamps of current, orders of magnitude lower than if the entire device needed to be battery-powered. Xilinx's approach not only prohibits device cloning, but also prevents unwanted rogue bit streams, such as viruses, from being downloaded to the part in this increasingly network-connected world."

Tuesday, October 10, 2000
This afternoon I resumed the Virtex port of XSOC/xr16/xr32 (XSOC2 Log) and am now (finally) running XSOC/xr16 in my XESS XSV-300 prototyping board.

Today's work involved several compromises. Since this board does not have a tool to pre-load the SRAM, I modified the XSOC design to provide a 256x16 boot ROM in a block RAM. I further modified the new fully synchronous MEMCTRL so that instruction fetches from this block RAM signal RDY in the same cycle.

Just as with the XSOC/xr16 kit for XS40 boards, the design currently includes a bitmapped VGA display, using the DMA engine in the xr16 CPU core as the video address counter. (With a 50 MHz dot clock, it refreshes the display at 120 Hz!)

Alas the XSV's two 16-bit SRAM banks both lack byte-write-enables. For the time being I am using just one byte-wide bank of SRAM. Later I will modify MEMCTRL to perform read-modify-write accesses for byte stores to RAM,

Using a modified version of xr16 (replacing the double-cycled single-port RAMs with dual-port RAMs), we get a design that TRCE reports will run at 60 MHz in a V300-5. (Not floorplanned yet.) Total size of the design, including MEMCTRL and VGA, is about 400 logic cells.

The design runs fine at 33 MHz. At 50 MHz, the program runs fine, but accesses to the external SRAM frame buffer fail. I will therefore modify the memory controller to insert a wait state on each external SRAM access. That done, I should be able to tune up the core design up to 67 MHz in short order, motivating integrated instruction and data caches...

Craig Matsumoto, EE Times: Adaptive Silicon preps FPGA core for ASICs.

Monday, October 9, 2000
Rich Katz (NASA)'s site "dedicated to the design and use of programmable and quick-turn technologies for space flight applications."

Peter Clarke, EE Times: Beefy parallel processor packs 128 cores. Pact GmbH's XPP-128, a 128 CPU MP-on-a-chip, 12.8B MAC/s at 100 MHz.

"The XPP is a mixture of a parallel-processing array with an interconnect architecture similar in style to that of an FPGA. But Vorbach said the second crucial element is the transparent run-time reconfiguration technology that dynamically controls the processing resources. Vorbach said this technology automatically makes changes to the array interconnect, assigning processes to clusters of processors based on internal or external events."

Based upon the sketchy description of the programming model, a pure data-parallel SIMD or a pure MIMD would seem easier to program.

Peter Glaskowsky, Microprocessor Report: PACT Debuts Extreme Processor: Reconfigurable ALU Array Is Very Powerful -- And Very Complex. The article provides excellent detail on the design, including its programmable interconnect, and thoughtful analysis of its prospects. Factoids -- 400 mm^2 in 0.21 micron, 32 256x32 embedded SRAMs, 8 SDRAM channels, 1,521 contact BGA!

XPP reminds me of the Univ. of Washington's RaPiD configurable computing architecture.

We may build similar things in big Virtex-Es. (With all that block RAM for local scratchpad RAMs or caches, it's a natural.) See Multiprocessors, Supercomputers, Soft cores, Using block RAM, and of course, Danny Hillis' The Connection Machine.

ARC Tangent
Peter Clarke, EE Times: ARC Tangent extends configurability to the system level.

'However the company is not introducing high-end features such as out-of-order instruction execution, conditional execution or speculative execution of branches. "The philosophy is still to keep the core simple, the base core gate count is still only about 17,000 gates," said Hakewill.'
An excellent philosophy. Simple is beautiful.

Saturday, October 7, 2000
Over on the fpga-cpu mailing list, I posted two messages on how to build really compact RISC cores, down around 150 logic cells, half the size of xr16 (and at least twice as slow).

In the first, I start with a ultra-minimalist datapath, and add features one-by-one to improve performance.

In the second, I subtract microarchitectural features one-by-one from xr16 to explore what savings might be realized, and at what cost in performance.

Friday, October 6, 2000
Ken Chapman, Xilinx, App Note XAPP213: 8-Bit Microcontroller for Virtex Devices. If I may be permitted to quote so extensively, I'll let this superb app note speak for itself:
"The Constant (k) Coded Programmable State Machine (KCPSM) presented in this application note is a fully embedded 8-bit microcontroller macro for the Virtex and Spartan-II devices. The module is remarkably small at just 35 CLBs, less than half of the smallest Spartan XC2S15 device, and virtually free in an XCV2000 device by consuming less than 0.37% of the device CLBs."

"This KCPSM provides 49 different instructions, 16 registers, 256 directly and indirectly addressable ports, and a maskable interrupt at 35 million instructions per second (MIPs). This performance exceeds that of traditional discrete microcontroller devices, making the KCPSM a cost-attractive solution for data processing as well as control algorithms."

"Fully embedded including the program memory, the KCPSM can be connected to many other functions and peripherals tuned to a specific design. Processing distributed over multiple KCPSM processors within a single device is suitable for applications such as neural networks." ...

"When a processor is completely embedded within an FPGA, no I/O resources are required to communicate with other modules in the same FPGA. Additionally, system design flexibility is included along with savings on PCB requirements, power consumption, and EMI. Whenever a special type of instruction is required, it can be created in hardware (other CLBs) and connected to the KCPSM as a kind of coprocessor. Indeed, there is nothing to prevent a coprocessor from being another KCPSM module. In this way, even the 256-instruction program length is not a limitation."

See also this app note by Chapman from almost six years ago. Dynamic Microcontroller in an XC4000 FPGA. Nice work, and a nice prior art reference. (Ah, XBLOX, those were the days. I designed my first 3-D rendering system with XBLOX.)

This new app note articulates many of the potential advantages of compact soft CPU cores. I feel strongly that small soft CPU cores will prove to be indispensible, both standalone in low-cost device SoCs, and as channel processors and smart peripherals to hard CPU cores in the forthcoming "hard CPU + PLD" hybrid devices. See also Soft cores and the theme of my Circuit Cellar articles.

What do I consider small? Not 950 or 1100 or 1700 logic cells. Certainly not 3000. By small, I mean cores like this excellent assembler-programmable KCPSM (35 CLBs => ~140 logic cells) or the integer-C-programmable xr16 (~300 logic cells). A Spartan-II-150 is a terrible thing to waste. See also Simple is beautiful.

Ericcson's Erlang FPGA CPU
From the Sixth International Erlang/OTP User Conference, Robert Tjärnström and Peter Lundell, Ericsson Telecom: ECOMP - an Erlang Processor (PowerPoint).

"An Erlang processor has been built in an FPGA (i.e. programmable hardware). The JAM compiler has been changed to generate native code which allows Erlang programs to be run directly on the processor without any OS and with improved performance."
This is an interesting LIW architecture that does concurrent real-time garbage collection in hardware. Also has ~20 cycle hardware process switching.

Modeled in Erlang, implemented in VHDL, prototyped in an RC1000-PP with an XC40150XV. Results: a speedup of 30 (cycle per cycle) while decreasing power by more than an order of magnitude. Compared to what, they didn't specify...

Over on the LEON SPARC mailing list, Jiri Gaisler announces version 2.2 beta, which uses AMBA AHB/APB internal buses. Another milestone.

Thursday, October 5, 2000
Xilinx Announces Acquisition of RocketChips. This is the first press release I've ever seen that is accompanied by a glossary! See also last Monday's announcements of Virtex-II 3.125 Gbps serial link technologies.

Craig Matsumoto, EE Times: Xilinx acquires I/O specialist.

Tuesday, October 3, 2000
Joel on Software: Painless Functional Specifications: Part One: Why Bother?, Part Two: What's a Spec?, Part 3: But... How?.

Alan Singletary, IBM: Techniques for Enabling FPGA Emulation of IBM CoreConnect Designs. I was surprised this paper describes implementing the CoreConnect bus external to a set of FPGAs, and not internal to a set of cores in one FPGA. Nevertheless, the same techniques apply -- use tri-state buses (via on-chip TBUFs) to reduce the interconnect and logic requirements of multi-master address and data bus multiplexing.

Monday, October 2, 2000
Xilinx Adds FPGA Support to Free Web Design Tools. Yahoo!
"Xilinx, Inc. today announced full support of the entire Spartan®-II FPGA family as well as the 300,000 system gate Virtex(TM) XCV300E FPGA in the WebPACK ISE(TM) tool suite. ... The next release of WebPACK ISE, V3.2i WP1.0, which contains the added FPGA support of Spartan-II and the Virtex XCV300E device is scheduled for release in mid-October 2000."
These tools and parts are more than adequate for all manner of sophisticated processor cores, multiprocessors, DSPs, etc. As the barriers to entry fall away, we're in for some serious innovation, serious products, and some serious fun.

[updated 00/08/05] Murray Disman, ChipCenter, Xilinx Offers Free FPGA Design Tools.

IP business models [revised 00/10/04]
Here is my take on some cogent analysis on IP business models from Tom Cantrell of Circuit Cellar. As he writes in Excalibur Marks the Spot,

"Remember that the cost of any chip is comprised of two parts-what it costs to make, sell, and support the silicon, and the value of the design (i.e., IP). As a silicon supplier, Altera has the ability to hide the IP cost in the chip price. Independent IP providers have no such luxury, short of messy and unpopular royalty schemes. Also, the free IP news may perk the interest of lawyers, similar to how MP3 got the recording industry riled up. I look forward to reading the fine print in the Nios license."
In the old days, chip vendors were also the IP developers and the EDA tools developers. Nowadays, we have specialized fab companies (TSMC), IP companies (ARM, MIPS, Gray Research LLC :-) ), and tools companies (Mentor, Cadence, etc.), and combinations of these (Intel). You can buy IP bundled with hardware (Intel), bundled with your tools (EDA companies), or separately (IP providers).

Enter the FPGA vendors (Xilinx, Altera). They have an opportunity to seize upon a unique business model.

Take Altera Nios. The Nios development kit is relatively inexpensive (~$1000) and they will supposedly issue you a license to use the Nios core in Altera FPGAs for $0. The more instances you make, the more programmable logic they sell. I suppose they make up the cost of developing, testing, supporting, documenting, etc. the IP in device sales (which also simplifies the accounting).

This business model gives these vendors a giant, almost unassailable advantage over third party IP vendors. The latter can never compete on price, because the FPGA vendors can always price their IP down to $0 and happily make up any lost revenue with further sales of CLBs. Therefore a third party IP vendor can only compete on value, quality, and innovation. For example, in the Altera CPU cores market, which includes the $0 Nios core family, one can only compete with a different value proposition, perhaps instruction set compatibility with a legacy ISA, or perhaps by offering a core which is dramatically smaller and faster than Nios. In the latter case, if your core uses $2 less programmable logic than the FPGA vendor's does, then it may have a value of $2/unit to a customer. Or not. It also depends on which vendor(s) establish a larger value chain of experts, plug-ins, etc.

In pricing their cores, FPGA vendors may also consider the customer lock-in value of a key piece of IP (such as a processor or on-chip bus protocol). Once a customer has designed against such a key facility or interface, it will be extremely costly to change horses later.

Perhaps FPGA vendors have a vested interest in giving away the largest cores that the market will bear, so as to sell more CLBs. There are two problems with this idea: driving away "cores partners", and competition with other device vendors.

Giving away: Free IP in a market may act to reduce customers' value model for IP -- "I'll be darned if I'm going to pay PrettyGoodCores Inc. $1000 for a UART license (even with validation and support) when I can get a whole processor soft core and an on-chip bus license from my vendor for nothing!" If FPGA vendors give away enough free cores, the end effect could be to discourage pure IP vendors from contributing to that device vendor's value chain, reducing the supply of device optimized cores, hence design wins, hence device sales.

Largest cores: Secondly, the FPGA vendors must compete with each other for design wins, and if one vendor has an excellent (fast, compact) set of cores they may sell fewer CLBs per design win, but may be able to win new designs from the competition.

It's an interesting conundrum.

Eventually there will also be a number of suppliers of high-quality free IP. This will drive down the price of "me too" equivalent-quality commercial IP except when propped up by artificial means. Even in that world, I think there will still be an interesting market for unique or highly-optimized commercial IP.

By the way, the "hello world" message I wrote to my LCD panel in my project at the Altera Nios Hands-on Workshop read: Such a great business model.

FPGA CPU News, Vol. 1, No. 4
Back issues: Sep, Aug, Apr.
Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.

Copyright © 2000-2002, Gray Research LLC. All rights reserved.
Last updated: Mar 09 2001