FPGA CPU News of August 2002

Home

Sep >>
<< Jul


News Index
2002
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
2001
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
  Oct Nov Dec
2000
  Apr Aug Sep
  Oct Nov Dec

Links
Fpga-cpu List
Usenet Posts
Site News
Papers
Teaching
Resources
Glossary
Gray Research

GR CPUs

XSOC
  Launch Mail
  Circuit Cellar
  LICENSE
  README
  XSOC News
  XSOC Talk
  Issues
  xr16

XSOC 2.0
  XSOC2 Log

CNets
  CNets Log
Google SiteSearch

Friday, August 30, 2002
Software day!

Patent 6,442,620
Environment extensibility and automatic services for component applications using contexts, policies, and activators. Filed August 17, 1998, granted August 27, 2002.

"An object system provides composable object execution environment extensions with an object model that defines a framework with contexts, policies, policy makers and activators that act as object creation-time, reference creation-time and call-time event sinks to provide processing of effects specific to the environment extensions. At object creation time, an object instantiation service of the object system delegates to the activators to establish a context in which the object is created. The context contains context properties that represent particular of the composable environment extensions in which the object is to execute. The context properties also can act as policy makers that contribute policies to an optimized policy set for references that cross context boundaries. The policies in such optimized sets are issued policy events on calls across the context boundary to process effects of switching between the environment extensions of the two contexts."
My last project at Microsoft. It's the extensibility architecture that we added to COM to make it extensible enough to host automatic COM+ Services such as MTS (Transaction Server). This first shipped in Windows 2000.

I am very pleased that a second generation, more performant version of the Contexts architecture is an important part of the .NET Framework architecture (no thanks to me).

Mike Woodring's Context slides from last summer's Conference.NET.

Dharma Shukla, Simon Fell, and Chris Sells, in MSDN Magazine: Aspect-Oriented Programming Enables Better Code Encapsulation and Reuse.

Platforms
Joel on Software: Platforms.

"It's really, really important to figure out if your product is a platform or not, because platforms need to be marketed in a very different way to be successful. That's because a platform needs to appeal to developers first and foremost, not end users."
Exercise: can you reconcile Spolsky's Platform with the term Platform FPGA? In the latter case, who are developers and who are end users?

Smarter than you
I used to give technical interviews to candidates at Microsoft, one or two each week. One of the standard "clever insight" interview questions was:

"How would you detect a cycle in a singly linked list?"
So, I used to ask this one, and kept pushing -- "Per-node mark bits? Good answer, but can you do without 'em?" -- exploring one solution after another, until the solution space had been pruned down to solutions with just O(n) time and O(1) storage. There are several approaches which require only a couple of words of state (total).

Sometimes you interview people who can't think of a single workable algorithm, even without any constraints on the solution. And then, sometimes, you meet someone whose mind navigates mysterious higher planes of thought, someone an order of magnitude smarter than you or anyone else you know.

Once, a colleague whose initials are G.Y., heard this problem and said immediately "Ah yes, that's almost the same problem as finding the cycle length of a pseudo random number generator whose next value depends only upon its last value. There are two algorithms in exercises in Knuth." He pulls Seminumerical Algorithms (IIRC) from my shelf and turns straight to the relevant page. "See?" And there they were, isomorphic to my two preferred solutions. Damn!

Thursday, August 29, 2002
ISE 5.1i
Or as I prefer to think of it, five point oh.

As you read this piece, consider:

Those that can, do.
Those that can't, write weblog entries about what those that could, did.
Xilinx announces World's Fastest Software for Programmable Logic and System Design: ISE Version 5.1i. Marketing summary.

Michael Santarini, EE Times: Xilinx overhauls FPGA software design package.

(I haven't yet received my copy, so these comments are based only upon the press release and related materials.)

These announcements do show that Xilinx is working hard to make it easier and faster to target their devices. There does seem to be some there there, and many new features and tools to master/remaster.

Besides the usual 2X speedup and quality of results claims :-), some of the more interesting new features include Macro Builder, Architectural Wizards, and support for partial reconfigurability.

I also noted some things that weren't there. Most notable was any description of functionality that rivals Altera's SOPC Builder (my comments). A similar tool would seem to be the key enabler to help designers get started exploiting PowerPC/MicroBlaze/CoreConnect-based SoC design. Maybe it's in ISE 5.1i but not hyped in the marketing materials. (Also missing: I wonder when and if Xilinx will productize the technologies they purchased in the LavaLogic acquisition.)

"Throughout the remainder of 2002, Xilinx will roll out a series of embedded and system-level ISE 5.1i family and EDA partner design tools that enable on-demand architectural synthesis and flexible hardware/software partitioning."
I assume that means these tools are not in this release. Oh well, there's not that much left of 2002... On the subject of architectural synthesis, this press release is worth a read and a ponder.

Maybe I'm comparing apples and oranges, but this Xilinx vision seems to be broader, yet less focused on SoC design capture, than the Altera SOPC Builder reality.

Macro Builder
Macro Builder:

"Macro Builder: ... making it possible for customers and IP vendors to capture physical implementations of internally-developed IP; preserve placement information for timing-critical blocks; and ensure repeatable high-performance in future designs. Changes in other parts of a design do not affect macro performance, further supporting change without risk."
Is this RPMs for designers too lazy to figure out RLOCs? Or too busy to use them? RLOCs give you full control of placement, plus repeatable results. With RPMs, you control the initial placement. There is no pushing on a rope. What you want is what you get. LUTs and DFFs in your datapaths go where you put them, period. On the other hand, particularly for control logic, adding placement constraint attributes is slow and tedious; if Macro Builder can make it easy to build repeatably floorplanned control units, that would be a major advance.

(As I have noted before, RPMs can "halve critical path net delays, and, as importantly, most importantly, make the delays predictable so you have terra firma upon which to make methodical implementation decisions".)

The current floorplanner allows manual control of placement, but when you change your source code, too often the synthesizer scrambles up all your synthesized net and instance names and the floorplanning process must be repeated.

I wonder how Macro Builder helps address this issue, which always seemed to me to be an unnecessary synthesis induced problem. There is no reason synthesis tools cannot fabricate synthesized names, in a deterministic way, based upon the topology of the circuit nearby. Certain synthesized instances, such as slices of registers, can be given repeatable names derived from identifiers in the source code. (The 7th bit of register pc is of course pc<7>.) And a gate between register X and register PC[7] could be named $lut_X_PC<7> or $lut_hash("X_PC<7>"), instead of $lut1234. Under this regime, no matter what changes you make to other parts of your design, this particular synthesized LUT is going to be repeatably named $lut_X_PC<7>.

(Well, if it were my problem to solve, first I'd lobby the synthesis tools vendors to synthesize to repeatable, canonicalized names. I strongly suggested doing exactly this to Xilinx two years ago. This helps with incremental design respins too. Perhaps this has happened. In lieu of that, I'd start applying some graph and subgraph isomorphism algorithms. Perhaps the problem can be recast as a simple subtree isomorphism, so good old linear-time CSE value-numbering suffices.)

RLOCs or no, fundamentally, there is a skill set that tools like Macro Builder cannot automatically apply. And that is The Art of High Performance FPGA Design.

Elsewhere, in its description of PACE (Pinout & Area Constraint Editor), Xilinx says it offers "design rule-driven floorplanning". I wonder what that is.

Nothing in the Macro Builder literature mentions RPMs. Oh dear. Please, Xilinx, please assure us that there is clean, smooth integration and composability between RPMs of RLOCs and this new Macro Builder feature.

Architectural Wizards
Architecture Wizards:

"For instance, the Digital Clock Managers (DCM) wizard and RocketIO multi-gigabit transceivers (MGT) wizards let the user graphically set DCM and MGT functions through dialog boxes available in the ISE Project Navigator. ISE then writes editable source code directly into the HDL source file to set and control these advanced capabilities. The Architecture Wizards enable correct-by-construction HDL code, alleviating the need to learn all of the programming attributes required to configure these powerful flexible device features, thereby speeding the design cycle. In a related announcement, Xilinx and Cadence delivered a kit for designing with MGTs in Virtex-II Pro devices."
Earlier this year, on Virtex-II Pro, I wrote:
"Xilinx and its partners are going to be challenged to tie a neat bow around these technologies, using reference designs, using complexity busting "easy IP", using tools like their forthcoming System Generator for PowerPC, so that engineers new to SoC design, embedded processors, or high speed interconnects, can successfully apply all this great silicon."
This seems to be a reasonable step in this direction. Of course, while making it easy to author the HDL that correctly sets attribute bits and wires the darn things up is very helpful, it is only one small piece of the puzzle of using these advanced IP blocks.

"In a related announcement, Xilinx and Cadence delivered a kit for designing with MGTs in Virtex-II Pro devices."
Perhaps the more significant announcement. Kits and reference designs (including software), that's the ticket.

Partial reconfigurability
Partial reconfigurability FAQ. Interesting, exotic, obscure, off-the-mainstream.

"Design communication between modules on TBUF bus macros". TBUFs, the Rodney Dangerfield of Xilinx device primitives. Maybe there's some hope for them after all. Naah.

And now for an essay. Ironically, I haven't been inspired to do any new FPGA design work for several months now.

The Art of High Performance FPGA Design
The trick to getting best area and performance out of FPGA designs is to not lose 50% quality of results here and there (and there again). In a nutshell: first you have to acquire The Knowledge so that you intuitively come to evaluate area and delay costs in terms of FPGA primitives. Then you have to apply it through a set of Best Practices (if you will forgive the tired cliche).

The Knowledge
If you want to be a cab driver in London, you first must earn The Knowledge. Students study for many months to memorize the thousands of little streets in London and learn the best routes from place to place. And they go out every day on scooters to scout around and validate their book learning.

Similarly, if you want to be a great FPGA-optimized core designer, you have to acquire The (Device) Knowledge. You have to know what the LUTs, registers, slices, CLBs, block RAMs, DLLs, etc. can and can't do. You have to learn exactly how much local, intermediate, and long routing is available per bit height of the logic in your datapath and how wide the input and output buses to the block RAMs are. You have to learn about carry chain tricks, clock inversions, GSR nets, "bonus" CLB and routing resources, TBUFs, and so forth.

You also need to know the limitations of the tools. What device features PAR can and can't utilize. How to make PAR obey your placement and timing constraints, and what things it can't handle. And how to "push on the rope" of your synthesis tools to make them emit what you already know you want.

The Knowledge isn't in any book, alas. Yes, you can read the 'street maps', e.g. the datasheets and app notes, but that only goes so far. You have to get out on your 'scooter' and explore, e.g. crank up your tools and design some test circuits, and then open up the timing analyzer and the FPGA editor and pour over what came out, what the latencies (logic and routing) tend to be, etc.

Best practices
A slow FPGA design is usually one with either too many logic levels and/or a bad placement that runs nets (slow programmable interconnect) back and forth across half the chip. Too often the designer writes the HDL without even knowing ahead of time what the FPGA result is going to be.

In contrast, a fast, compact FPGA design is one where the final FPGA implementation is in mind from the start; where the eventual area and cycle time results are no surprise. Here the designer knows what he or she wants and what the fabric is capable of; and their task at hand is simply coding the design so that the tools emit the desired set of primitives and constraints.

Thoughtful technology mapping is crucial. When you design your architecture, or when you code your HDL, you must always be conscious of how your design will map into FPGA primitives. (In a custom VLSI design you think in gate delay levels, so it is in FPGAs with LUT levels.) Then you may have to run stripped down test designs through the synthesis and place-and-route tools to double-check your technology mapping assumptions are valid. Sometimes you have to spend an hour figuring out how to stand on your head (push on a rope) to make the tools emit what you know the FPGA fabric is capable of. Even a detail as small as how the global sync or async reset net is coded (in terms of Verilog always @() idioms) can make a significant difference. Sometimes you completely override your synthesis tool and provide an explicit LUT-at-a-time technology mapping to achieve the crucial trick that saves you a whole column of LUTs.

Floorplanning is an essential tool for managing interconnect delays. Xilinx: Use RLOC relative-placement-constraints to hierarchically build up hard macro blocks of programmable logic that are constrained to a fixed relative placement. This is called an RPM (relatively placed macro). RPMs provide predictable and repeatable performance. This is important to your customers, but is mandatory to make delays consistently repeatable, providing terra firma upon which to make methodical iterative implementation optimization decisions. Without floorplanning, you're just playing whack a mole on the critical path.

As I like to design in Verilog, and as Verilog lacks GENERATE statements for parameterized macros, of late I usually write Python programs that emit parametric RLOC-annotated structural Verilog, which (usually) passes through the synthesis tool unmolested; these constraint-bearing primitive instantiations are then properly respected by the mapping, placement, and routing tools phases.

You must also pore over the interactive timing analyzer reports to study where the critical paths are. Perhaps a blob of logic needs to be rewritten. Perhaps some retiming is in order. Perhaps the RPM floorplan must be rearranged to shorten some interconnect delay. Perhaps some duplication of high fanout control registers will shave a half ns from a cycle time. It is always necessary to add time constraints (cycle time budgets) to the design, declaring cycle time, disclaiming false paths, multicycle paths, and so forth, and to refine these constraints as the design tunes up.

Above all, is iteration iteration iteration. While an expert can land in the vicinity of an optimal design in short order, there is no substitute for the grunt work of running the design (or experimental subsets thereof) through the tools over and over and over again (a hundred times) while you make little tweaks, redesign parts of datapaths, constrain and further constrain mapping, placement, and timing.

If you have the opportunity, it also helps to iteratively evaluate and modify architecture and implementation together. Sometimes small changes to the architecture can save both area and time in the implementation.

Tuesday, August 13, 2002
Lattice's new ispXP FPGA Line
A new commercial SRAM-based FPGA architecture is about as rare as a total solar eclipse. The two big companies hold such commanding positions and mindshare, and patent portfolios, that new entrants to this marketplace arena are few and far between. But last month Lattice boldly jumped into the fray with the launch of their ispXPGA family. (Recall it was not so long ago that Lattice also acquired the ORCA 2,3,4 familes from Agere/Lucent (nee AT&T Microelectronics).)

Lattice: Lattice Semiconductor Introduces World's First Infinitely Reconfigurable Instant-On FPGA.

(I must note that Lattice apparently won't let you access their data sheets and related literature online unless you register for a Lattice Web Account. What a clever way to drive potential customers away! I suppose someone at Lattice decided it is better to capture the identities of a determined few than to disseminate crucial information on their newest products to as wide an (anonymous) engineering audience as possible. Therefore, dear reader, in this instance, I have refrained from "deep linking" to Lattice data sheets, white papers, app notes, and so forth. If you want to know more, you will have to register for your very own Lattice Web Account. Not that interested? I don't blame you.)

ispXPGA home. There you will find links to the data sheet and other literature. The data sheet references a number of tech notes (TNxxx) but I was unable to locate some of those.

Having reviewed the data sheet, the new ispXPGA looks to be a competent new SRAM-based SRAM-based-yet-EEPROM-configured 4-LUT FPGA architecture. With 4 4-LUTs per PFU (programmable function unit, think CLB), and some embedded RAM blocks, it is most reminiscent of the Xilinx Virtex (not Virtex-II) family. Yet it also echoes certain features of the Altera 10K family (PLLs).

Of course, it offers some unique new capabilities. Most notably, each ispXPGA device has both SRAM-based active configuration memory, and EEPROM non-volatile configuration memory. On power-up, the device loads its SRAM configuration memory from its EEPROM. This has several advantages. It is simpler. It uses less board real estate. It is more secure (there is no off-chip configuration bitstream download to capture). You can pre-program devices at your facility before mounting them on PCBs or selling them to your customers. You can of course still download a configuration to the SRAM. You can also download a configuration to EEPROM without disturbing the current configuration in SRAM. Loading a configuration from EEPROM is very fast, and this is probably the first SRAM FPGA you can categorize as "instant on". So in some ways it's rather like a 2-context FPGA with one of the contexts being non-volatile.

Perhaps someone could help me reconcile the marketing statement "infinitely reconfigurable ... FPGA", and the device data sheet that guarantees the EEPROM for a minimum of only 1000 programming cycles.

ispXPGAs will be available in four sizes, from the 1936 LUT (and 92 Kb BRAM) ispXPGA125, to the 15376 LUT (414 Kb) ispXPGA 1200. The former is roughly comparable in size to the 1536 LUT (64 Kb) XCV50E, the latter to the 13824 LUT (288 Kb) XCV600E. Size-wise, the ispXPGA family is not competitive with some of the larger parts from Xilinx or Altera, for example, the XCV3200E (~65,000 LUTs) and the XC2V8000 (~93,000 LUTs),

Let's review some of the more unusual elements of the architecture.

  • The chip appears to accomodate a wide range of VCC voltages: 1.8V, 2.5V, and 3.3V. How?

  • Each PFU has 4 4-LUTs, a 4-bit carry-logic circuit, a wide-logic structure, and 8 flip-flops. Each LUT also has an AND gate a la Xilinx's MULT_AND for cheap and cheerful Booth multiplication. At each LUT, you can register any two of the LUT output, the carry-logic-unit's sum output, some LUT and SEL inputs, and/or the wide logic output. As Disman points out, two registers per LUT may be a nice feature, but only if it is exploited by your synthesis tool.

    I don't know about the 2 'flops per LUT. My processor datapaths are fully pipelined and they never need more than one flop per LUT. On the other hand, sometimes in a floorplanned design you want to register the datapath control signals, or even replicated control signals, adjacent to, or better yet, embedded in, the datapath. This reduces the interconnect delays. This architecture would accomodate that nicely. On the other other hand, Virtex-II's buffered Active Interconnect greatly reduces the need for careful control register placement and replication (as fan out doesn't hurt nearly so much).

    Also if a deeply pipelined datapath needs to delay certain results across more than once clock cycle, then multiple 'flops per LUT might be just the thing.

  • The 4-LUTs can be configured as distributed RAM, up to 64 bits per PFU (32 bits when dual ported). Details are not provided. Since the ispXPGA family provides distributed RAM, apparently suitable for building small fast register files, like Xilinx, it should be a good platform for a small fast RISC processor soft core, or multiprocessor of same.

    Each LUT can also be configured as an up to 8-bit shift register.

    Speed seems competitive, with tLUT4 of 390 ps (-5 device) to 550 ps (-3), as compared to Virtex-E data sheet tILO's of 350 ps (-8) to 470 ps (-6). It's harder to compare adder, distributed RAM, or interconnect delays.

  • The "variable length" inter-PFU programmable interconnect sounds Xilinx-ish, but details are not provided. The horizontal and vertical long lines are "tri-statable" but details are not provided.

  • There are numerous dual-ported embedded block RAMs, apparently in the center and at the periphery of the device, but unlike Xilinx and Altera, there does not seem to be the flexibility to use a wide data bus on one port and a narrow one on another. In contrast, you can configure a Virtex-II dual port block RAM with one 1-bit wide port and one 32-bit wide port, which is quite useful for building high speed SERDES FIFOs.

    The Lattice BRAMs provide both synchronous and asynchronous read modes. In async read mode, with WE deasserted, the DATA read changes tEBADDO after ADDR changes, independent of CLK.

  • There are eight PLLs for clock multiplication and division (and presumably, to eliminate clock delay (delay by 360 degrees)).

  • The sysIO I/O blocks seem roughly comparable to the programmable I/O facilities provided in Xilinx and Altera FPGAs. There are 4-20 high speed differential serial blocks with integrated clock data recovery, SERDES, and 8B/10B (and 10B/12B) coding.
Overall, this looks to be a reasonable and plausibly competitive family of devices, particularly if a few hundred thousand or one million "system gates" of programmable logic is sufficient for your application. The several conveniences, including VCC as high as 3.3V, and the integrated EEPROM configuration memory with improved design security and "instant on", are sufficiently attractive that (if priced right), this family could certainly garner some design wins. But the competition is intense, and in the smaller devices, Lattice must strive to be price competitive with the formidable Spartan-IIE family and comparable Altera offerings (taking into account the costs of external FLASH config ROM).

(These days, the silicon is only half the story. Another factor in the strong positions of both Altera and Xilinx is their FPGA development tools products, which reflect over a decade of innovation, iteration, and improvement. Thus Lattice must ship tools that reflect comparable refinement.)

Murray Disman, ChipCenter: Lattice Introduces FPGA.

"An on-chip regulator allows the use of 1.8V, 2.5V, or 3.3V for the logic core's power supply."

Anthony Cataldo, EE Times: Lattice lands programmable-logic combo punch.

Graham Prophet, EDN Europe: High-density programmable logic uses dual-memory structure.

Applying FPGA SoCs
Jesse Kempa, Altera, at ChipCenter: Maximizing Embedded System Performance in the Era of Programmable Logic (PDF). A very nice article, based upon the task of speeding up a Nios SoC-based HTTP server, illustrating that creative application of programmable logic can deliver big speedups over a pure software approach.

"Implementing a microprocessor core in programmable logic offers many ways to customize an embedded system to fit the performance goals of a project that are not available in traditional design methodologies. The performance boost of two simple optimization methods performed in the above examples and combining these two in the final system has shown that system performance is by no means solely dependant on clock speed or Dhrystone MIPS. The continuing evolution in programmable logic devices and tools carries with it the ability to rapidly create powerful systems with close integration between hardware and software design."

Software defined radio
Some day, SDR might be a huge consumer of programmable logic silicon. EE Times has a splendid set of articles that explores the technology and business considerations.

Loring Wirbel, EE Times: Economics may rule out SDR, despite benefits. Nice analysis, but ... what a bummer!

Chris Dick, Xilinx, in EE Times: A case for using FPGAs in SDR PHY. Nice survey of the current FPGA-based software defined radio space.

Additional articles from Elixent, Analog Devices, and QuickSilver Technology.

Design security
Anthony Cataldo, EE Times: Actel pushes for better FPGA security safeguards.

"But when it comes to security, the SRAM-based FPGA has an Achilles' heel. Such devices require a separate PROM memory, which stores the configuration bits that are sent to the FPGA upon power-up. The configuration bits are thus exposed en route to the FPGA, and can be captured using a probe."
This should not be a problem in most cases, if the bitstream is triple-DES encrypted (Virtex-II and Pro) or if the bitstream is preloaded at the factory, with battery-backed up configuration memory, assuming you can address the battery lifetime and field replacement strategy.

[07/24/02] Ken McElvain, Synplicity, in EE Times: Future looks programmable. A paean to FPGAs.

Creative accounting, marketing gates style
Xilinx says the XC2V8000 has "104,832" logic cells. Yet its data sheet states it is 112x104 CLBs and each CLB has 8 LUTs. That's 93,184 LUTs. Sigh.

Monday, August 12, 2002
More obituaries, alas
Today, Google sports a link, "In memoriam, Edsger W. Dijkstra, 1930-2002." How cool is that? John Markoff, The New York Times: Edsger Dijkstra, 72, Physicist Who Shaped Computer Era, Dies. Edsger Dijkstra.

Kristen Nygaard
Last week I posted an obituary for Ole-Johan Dahl, one of the two designers of Simula, and fathers of object-oriented programming. Unfortunately his colleague, Kristen Nygaard, has also just passed away.

Larry Tesler: Kristen Nygaard 1926-2002.
Home Page for [Hjemmeside for] Kristen Nygaard.
Dahl and Nygaard: How Object-Oriented Programming Started.

FCCM Report
On 4/21-4/24, I attended FCCM'02, in Napa, CA. Here's a rather belated, partial write-up. Unfortunately I misplaced my Procedings and my written notes therein, so some of this is from memory, please forgive my mistakes.

It was an interesting, not stunning, conference. For me, the highlight was not any particular presentation, but the Tuesday evening session on programmable logic in nanoelectronics. More on that below.

Monday -- day one
K.H. Tsoi et al, CUHK, A Massively Parallel RC4 Key Search Engine.

"A total of 96 RC4 decryption engines were integrated on a single Xilinx Virtex XCV1000E ... The resulting design operates at a 50 MHz clock rate and achieves a speedup of 58 over a 1.5 GHz Pentium 4."
The presenter describes an elegant mapping of the RC4 algorithm to FPGA fabric, including effiently embedding the S[] array and S[] lookups in a block RAM. The resulting RC4 core is sufficiently compact that it can be instantiated 96 times in a V1000E.

I appreciated that the designers used best practice FPGA design techniques, including replacing LUT-based 5-1 muxes with TBUF based ones, and RLOC-based floorplanning of their cores. That said, the design is block RAM constrained and has lots of "white space".

The design uses the Pilchard FPGA-in-an-SDRAM DIMM card in a Linux PC platform to provide high bandwidth low latency interconnect to the host.

This system searches 6M keys/s and can exhaustively search a 40-bit key space in 50 hours.

The presenter notes that a further factor of six speed up would be possible if they used a 2X faster clock (100 MHz) and an FPGA with 3X more block RAM (such as the XCV812EM).

T. Mitra, NU Singapore, et al, An FPGA Implementation of Triangle Mesh Decompression.

"The first hardware implementation of triangle mesh decomposition."

In 3D rendering systems, a tesselated surface is represented as a mesh of triangles. The triangles are sent from the geometry engine (often a PC host) to the rendering engine. It is important to compress the mesh data to reduce bus bandwidth requirements. If you walk the triangles in a systematic order, it is easy to see how to typically send only one vertex per new triangle (reusing the prior two vertices).

But you can do much better. By walking the triangles in a still more clever order, for instance by systematically expanding a frontier of triangles in a clockwise or counterclockwise walk from a seed triangle, you can send fewer than one vertex per triangle, instead referring back to previously sent vertices in a vertex "frontier buffer".

The presenter described some nice hardware optimizations so that the frontier buffer can be efficiently implemented using an external RAM, plus a small cache of left and right vertices adjacent to the current edge; further the implementation takes advantages of perfect cache prefetching possible due to the clever mapping of the buffer to hardware.

The presenter described an FCCM to implement this process in an FPGA, specifically a PCI Pamette board. The system processes about 8 M triangles/s, and reduces the triangle mesh bus bandwidth requirements by about 83%.

Nicholas Weaver, UC Berkeley, The Effects of Datapath Placement and C-slow Retiming on Three Computational Benchmarks. One of the benchmarks was a pipelined RISC datapath. My recollection is that Weaver showed that using floorplanning and 3-slow retiming (3-threading) the datapath he was able to improve speed from 50 MHz to 100 MHz.

Demo Night: JHDL xr16vx integrated development environment
I caught demo night presentations by BYU, CUHK, Altera, Xilinx, Annapolis Micro, and others.

The demonstration given by the Configurable Computing Laboratory of Brigham Young University was wonderful. Prof. Brent Nelson and his students, particularly Eric Roesler, have taken Mike Butts' xr16vx independent JHDL reimplementation of the xr16 instruction set architecture and run with it. They demonstrated an integrated environment that included xr16vx, the xr16 compiler tools (ported to Linux, and with some bug fixes and enhancements), and the JHDL framework. They could do source level debugging, assembly level debugging, single stepping, etc. You could single step the processor and view in the JDHL environment a generated schematic with all signals and buses annotated with current signal values. The team also showed an xr16 state window which showed PC and next PC, current and next instruction, the xr16 register file, immediate prefix, and also the two inputs and output of the ALU.

(Looking back now, I told them, as I told Mike Butts long ago, that I would split the XSOC Project Kit into two pieces, relicensing the XSOC architectural and compiler tool chain components (save lcc which has its own license) under some open source license. Sorry folks. Please stay tuned just a little while longer.)

Tuesday -- day two
R.Franklin, et al, BYU, Assisting Network Intrustion Detection with Reconfigurable Hardware. Presenter Prof. Brad Hutchings, BYU, described how to accelerate SNORT network intrustion using an FPGA to match the SNORT intrusion signature regular expression database. The results were quite good; for example for a regexp of 4,900 characters, the FPGA implementation scanned at 784 kB/s compared to a software implementation (Pentium 3/750 MHz) of 1.72 kB/s, for a cost of about 1.25 slices/character in the regexp. This paper builds upon a very clever paper, R.Sidhu and V.Prasanna: Fast regular expression matching using FPGAs. This subject was personally humbling in that Mr. Sidhu asked in this comp.arch thread about the practice of running NFAs in programmable logic. And yours truly weighed in with conventional wisdom, which is you convert the regexp to an NFA and from that to a DFA. Bzzt!

Tuesday evening -- Survey of Nanoscale Digital System Technology
Speakers: Mike Butts (Cadence Design Systems), Andre DeHon (CalTech), Phil Kuekes (HP Labs). FCCM'02 Nano-Technology Panel Session.

Please return later for my write up of this session.

Press releases du jour
Xilinx: Xilinx Extends Speed And Density Leadership By Shipping Industry's Largest And Fastest Programmable IC. Significantly, the first shipping FPGA with over 100,000 logic cells.

"0.15" micron CMOS? According to this, "The most advanced products within this series, the Virtex-II FPGAs, are built at UMC's 300mm Fab 12A on the company's 130 nm (0.13 micron) eight layer copper/low-k process." [emphasis added]

And Most Expensive? "The Xilinx XC2V8000 is immediately available. Second half 2003 pricing for the XC2V8000 device is $3960 in volumes of 10,000 units." A $40M order! Phew, I wonder what the 3Q02 Q100 price is? ("If you have to ask, you can't afford it.")

Can you implement a 2V8000 design on a current PC? See these threads. I wonder how long a PAR run takes.

Altera: Altera Optimizes Leading-Edge IP Cores for Stratix FPGAs.

Sunday, August 11, 2002
T2
John Jakson reports on his ongoing T2 work (earlier mention).

Saturday, August 10, 2002
Charmed Labs' Xport
Charmed Labs' Xport for the Nintendo Game Boy Advance. FAQ. Xport is an XCS10XL-TQ144-based prototyping kit for the Game Boy Advance, which is itself a cool and inexpensive little machine with a ARM7TDMI, 4 MB of RAM, 240x160 TFT LCD, for under $70. The Xport board also has EEPROM for the FPGA configuration memory, and flash for its GBA application memory. The kit requires Foundation Student Edition 2.1i or the like, but includes a GCC tool chain for building apps for the GBA, and utilities for downloading your hardware and software designs into the EEPROM and FLASH.

The kit is $129. I will order one and a GBA. It could be fun to port XSOC/xr16 to it (time permitting).

Thursday, August 8, 2002
At the movies
Yesterday we saw Spy Kids 2. Good fun, although the story was inferior to "Episode 1".

About two minutes into the film there is a brief shot of the innards of the control panel of the Juggler amusement park ride at the fictional Troublemaker Studios Park. Unlike most films and ads, which use some non-techie's conception of electronics, this control panel was well grounded in reality, and had both Altera and Xilinx Inside. Perhaps that helps to explain the events which ensue...

Deconstructing the relationship
Here's more irreverent follow-up pure speculation to my Xilinx/IBM eFPGA piece. Maybe, taken in isolation, the upside for Xilinx of this announcement is not so compelling, considering the numerous engineering, tools, and business complexities involved.

But taking into account the Xilinx/IBM Microelectronics partnership in toto, perhaps this work is the discharge of an obligation, if you will, of the larger deal for IBM's embedded PowerPC and SoC technologies first manifest in Virtex-II Pro. That the two phases were not revealed simultaneously might simply reflect real world engineering scheduling necessities.

Thinking along those lines, an "eFPGA for PPC/SoC (plus/minus royalties and IBM fab capacity)" equitable deal analysis makes this announcement almost predictable. Otherwise it would have been -- what -- a "royalties for PPC/SoC (plus fab capacity)" deal!? Hardly an equitable intellectual property value exchange.

Big companies think big and deal big.

Wednesday, August 7, 2002
Edsger Dijkstra
Edsger Dijkstra died. He was one of the founding fathers of Computer Science. His rigorous and very productive works have spanned many subdisciplines of CS, have made widespread contributions to computing and to the Betterment of Mankind, so much so that today we take many of his results for granted: shortest path (Dijkstra's algorithm) and many algorithm results, programming language (Algol 60) implementation techniques, including stacks for recursive functions, mutual exclusion, semaphores and critical sections (P() and V() -- passeren and vrijgeven -- "to pass" and "to give free" in Dutch), cooperating sequential processes, dining philosphers (deadly embrace, starvation, fairness), structured programming, and the discipline of programming via stepwise refinement, applying invariants, and proofs of correctness.

Go To Considered Harmful (but see also the last paragraph of EWD1308: What led to "Notes on Structured Programming");
EWD498: How Do We Tell Truths that Might Hurt?;
EWD1304: The end of Computing Science?.

I had the privilege of hearing Prof. Dijkstra speak twice in December, 1999. In fact, here is the University of Waterloo CS Seminars Schedule for a few select days that week.

Seminars
Wednesday, 1 December 1999
Timothy Chan: -- The Dynamic Planar Convex Hull Problem
Jan Gray: -- Homebrew Processors and Integrated Systems in FPGAs

Thursday, 2 December 1999
Edsger W. Dijkstra: -- Calculational Mathematics

Friday, 3 December 1999
Edsger W. Dijkstra: -- Proofs and Programs

And some of my notes:
"There were two Dijkstra talks, both theory talks. The first, he gave 20 quick 4-line proofs on properties of "under" (abstraction of <=) and "up arrow" (abstraction of min): (omitted).

"The second, he showed how the techniques derived to prove properties of programs (applying invariants, demonstrating termination, etc) can be used to prove mathematical conjectures. For example, the conjecture "given n unique points, not collinear, there exists a line which passes through exactly 2 points" took many decades to prove in a complex proof, but Dijkstra proved it using undergrad proofs-of-algorithms techniques by constructing an algorithm to find such a line and showing it preserves its invariants and terminates."

"I enjoyed the talks. My brush with greatness was in the second talk. After he concluded his proof, I asked 'don't you have to show your construction there preserves your invariant there?' 'Oh, yes, thank you very much, ... . '"

As this article relates:
"Years from now, if you are doing something quick and dirty, you imagine that I am looking over your shoulder and say to yourself, "Dijkstra would not like this," well that would be immortality for me."

Ole-Johan Dahl
Ole-Johan Dahl died on June 29, 2002. With Kristen Nygaard, he developed Simula, which inspired object-oriented programming and Smalltalk, C++, Java, C#, etc., and much later also helped design Beta. Along with Dijkstra, and Sir Tony Hoare, he also co-wrote an influential book, Structured Programming, in 1972.

Friday, August 2, 2002
H&P 3.0
Morgan Kaufmann have just published John Hennessy and David Patterson's Computer Architecture: A Quantitative Approach, Third Edition.

I read the first edition (1990), skipped the second edition (1996), and am now working my way through this new third edition. It presents many of the same themes as the earlier books, but reflecting twelve years of Moore's Law at work and the phenomenon of the internet, is completely overhauled with new data, new examples, and new topics. Beside its traditional focus in price/performance optimized desktop processors, CA:AQA3e now also explores two other design points, server big iron, and embedded system processors. An appendix presents answers to selected exercises. Unfortunately, due to space considerations (1100 pages), appendices C through I are only available online. I encourage you to follow the above link and take a look at some of the appendices. For example, Appendix C - A Survey of RISC Architectures for Desktop, Server, and Embedded Computers is the best survey I have seen of the evolution of features in commercial RISC architectures.

If you are a serious student or practioner of computer architecture, then you have already read the first or second edition. So far my experience has been that the third edition is sufficiently different from the first that your time and money will be well spent.

On the other hand, this is not a book for beginners. If you're a newbie, you'll probably be better off reading Patterson and Hennessy's other text, Computer Organization and Design: The Hardware/Software Interface, which explains in great detail how a simple RISC machine works.

Hmm, maybe we should try an online CA:AQA3e study group.



FPGA CPU News, Vol. 3, No. 8
Back issues:
Vol. 3 (2002): Jan Feb Mar Apr May Jun Jul;
Vol. 2 (2001): Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec;
Vol. 1 (2000): Apr Aug Sep Oct Nov Dec.


Copyright © 2000-2002, Gray Research LLC. All rights reserved.
Last updated: Oct 06 2002