FPGA CPU News of November 2000

Home

Dec >>
<< Oct


News Index
2002
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
2001
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
  Oct Nov Dec
2000
  Apr Aug Sep
  Oct Nov Dec

Links
Fpga-cpu List
Usenet Posts
Site News
Papers
Teaching
Resources
Glossary
Gray Research

GR CPUs

XSOC
  Launch Mail
  Circuit Cellar
  LICENSE
  README
  XSOC News
  XSOC Talk
  Issues
  xr16

XSOC 2.0
  XSOC2 Log

CNets
  CNets Log
Google SiteSearch

Monday, November 20, 2000
Peter Clarke, EE Times: Xilinx dives into DSP waters.
"Under the XtremeDSP initiative, Xilinx will place on a Virtex-II FPGA as many as 192 18-bit x 18-bit single-cycle multipliers, associated registers, up to 3.5 Mbits of dual-port RAM and as many as 10 million gates of logic."

"The result, the company claims, is a theoretical performance of 600 billion 8-bit x 8-bit multiply-accumulate cycles (MACs) per second."

"Altera to Ask Court to Reverse Patent Infringement Verdict - If Denied, Will File Appeal; Will Contest Any Xilinx Request for an Injunction".

Friday, November 17, 2000
"Xilinx Wins Patent Case Against Altera". Q&A.
"Xilinx ... today announced that it will seek an immediate injunction against Altera Corporation to stop all shipments of Altera Flex product and its derivative programmable logic devices that infringe two Xilinx patents."

"Altera to Ask Court to Reverse Patent Infringement Verdict - If Denied, Will File Appeal".

USRE034363: Configurable electrical circuit having configurable logic elements and configurable interconnects.

US4642487: Special interconnect for configurable logic array.

Thursday, November 16, 2000
Today, for a change of pace, I'm reading Microsoft's ECMA submissions of the .NET Platform technologies (C# and Common Language Infrastructure).

Wednesday, November 15, 2000
Today we're celebrating this site's 100,000th page view. Thank you for your interest.

Three years ago I noted that multiprocessors-on-an-FPGA had become feasible; later I sketched a strawman FPGA-array-supercomputer. The last pages of my Computer Architecture Education workshop paper also discuss FPGA multiprocessors. More recently I wrote that even in an era of embedded hard processor cores, arrays of soft CPU cores will still play an important role.

This evening, as a lark, I designed an 8-way MP-on-a-chip. I took the work-in-progress GR0000 16-bit RISC core, which is now floorplanned as 8 rows by 6 columns of CLBs, plus two block RAMs (which provide a byte-addressable 16-bit wide 1 KB dual-ported embedded instruction/data memory), and simply instanced eight of them as 2 rows by 4 columns of processors plus memories, in the smallest Virtex-E part (XCV50E), which provides 16 rows by 24 columns of CLBs and 4 rows by 4 columns of block RAMs.

Here's the floorplan; diagonal stripes denote hand-floorplanned primitives:

8-way GR0000 MP floorplan in XCV50E

Here's the device utilization data. As you can see below, it's a very tight fit, but thanks to the floorplanning, the design placed and routed in three minutes.

Design Summary:
   Number of errors:      0
   Number of warnings:   10
   Number of Slices:                729 out of    768   94%
   Number of Slices containing
      unrelated logic:                0 out of    729    0%
   Number of Slice Flip Flops:      288 out of  1,536   18%
   Total Number 4 input LUTs:     1,392 out of  1,536   90%
      Number used as LUTs:                      1,136
      Number used for Dual Port RAMs:             256
      (Two LUTs used per Dual Port RAM)
   Number of bonded IOBs:            16 out of     94   17%
   Number of Tbufs:                 768 out of    832   92%
   Number of Block RAMs:             16 out of     16  100%
   Number of GCLKs:                   1 out of      4   25%
   Number of GCLKIOBs:                1 out of      4   25%
   Number of RPM macros:            8
Total equivalent gate count for design:  291,768
This design will have a guaranteed-never-to-exceed performance of 8x50 "MIPS". Of course, this is currently a 100% useless MP-on-a-chip, with no interprocessor interconnect, no external I/O, no spare programmable logic for custom instructions/function units, etc., but it stands as proof-by-example of the feasibility of FPGA MPSoCs and further illustrates the utility of simple, compact, floorplanned processor cores.

(To be clear, even the uniprocessor GR0000 is not yet up and running in hardware, nor are its lcc-xr16-derived tools finished yet.)

Sunday, November 12, 2000
Yesterday, I did some more work on the GR0000 implementation. Recall it's a new, space optimized 16-bit RISC for Virtex/Spartan-II. Early (work-in-progress) implementation results look very promising: 50 MHz in 50 CLBs, in half of an XC2S15-5.

Today, I have been investing in The Knowledge.

As I wrote earlier, each multiplexer in a processor datapath merits close scrutiny. In 4-LUT FPGAs, a 2-1 multiplexer wastes as much area as a 16-bit register or an adder/subtractor.

Therefore it is imperative for the FPGA CPU core designer to find circuit structures that minimize the number of muxes. Alas, some muxes are unavoidable. Consider a processor's program sequencing unit, which determines the next value of the program counter, PC.

  1. Usually PC is incremented by 2 (or 4).
  2. Sometimes (taken branches), a short relative branch displacement is sign-extended and added to PC.
  3. Sometimes (jumps, calls, returns), PC is loaded with an effective address.
How shall we implement this? A naive approach is to write
    if (jump)
        pc_nxt = eff_addr;
    else if (branch)
        pc_nxt = pc + sign-ext({br_disp,1'b0});
    else
        pc_nxt = pc + 2;
which is two adders, and two 2-1 muxes or one 3-1 mux. However, we can save area by forming a simpler mux and sharing an adder:
    pc_disp = sign_ext(branch ? {br_disp,1'b0} : 2);
    pc_nxt = jump ? eff_addr : pc + pc_disp;
That's one small 2-1 mux, one adder, and one 2-1 mux. Better. Can we do better still?

Consult the knowedge. Is there an efficient circuit structure in Virtex that implements that last equation? Or put another way, can

    o = add ? (a + b) : k; 
be implemented in one logic cell per bit?

I tried Synplicity 6.0, but it emits an adder and a mux, e.g. two LUTs per bit. I looked into the Xilinx F2.1i libraries, in particular, at the 8-bit loadable counter CC8CLE, and it builds something funky using the Virtex MUXCY and XORCY carry-chain resources, but it too seems to require two LUTs per bit.

Still, it looked possible... I stared for a while at the Virtex slice architecture schematic, and in particular at the MULT_AND, MUXCY, and XORCY resources. And then I did indeed figure out how to implement an add-mux circuit in just one LUT per bit! Here's how.

(If you want to make sense of this commentary, I recommend you follow along with your own copy of the Virtex slice architecture schematic.)

First, a brief review of the Virtex slice architecture. A slice has two 4-LUTs, as well as two copies of carry-logic primitives -- a MUXCY and XORCY, and the multiplier primitive -- MULT_AND. In a regular a+b adder, it is usually configured that the 4-LUT computes a[i]^b[i], the MUXCY generates carry-out c[i], propagating either a[i] if a[i]^b[i]==0 or c[i-1] if a[i]^b[i]==1. The XORCY computes a[i]^b[i]^c[i-1] as desired.

Now then, to this happy arrangement, we wish to add two additional inputs, add and k[i]. It is potentially feasible because, besides a[i] and b[i], there are two unused inputs on each 4-LUT.

To generate a sum and carry per LUT, we must still use the MUXCY and XORCY structures. Therefore, to pass the constant k through when add==0, the carry at every bit must be 0. But if we use the 4-LUT to compute

    o[i] = add&(a[i]^b[i]) | ~add&k[i];
then when add==0 and a[i]==1 and k[i]==0, the MUXCY might still propagate c[i]=a[i]==1 and we will incorrectly generate a carry-out that will cause the next more significant bit's XORCY to toggle k[i+1] into ~k[i+1].

That's where MULT_AND comes in. The MULT_AND primitive was provided to help implement compact multipliers. In a multipler, a x b, if the current bit of the multiplier b[i] is 0, we add nothing (0 times the multiplicand) to the product. If it is 1, we add one times the multiplicand to the product.

In Virtex, the MULT_AND is provided so, when b[i]==0 and a[i]==1, instead of passing a[i] through to the MUX_CY (and then to the carry-out c[i]), we instead AND them together a[i]&b[i] and pass 0 so carry-out remains 0.

Using this structure, each bit of a parallel multiplier can be written as approximately

    prod[j][i] = b[i] ? prod[j-1][i] + a[i-j] : prod[j-1][i]
This said, our goal of
    o = add ? (a + b) : k; 
is now in sight. The trick is to use MULT_AND to zero the a[i] input to the MUXCY when add is false.

Here's the source code for one bit of the circuit (synthesis directives omitted):

    module addmux1(add, ci, a, b, k, sum, co);
        input add, ci, a, b, k;
        output sum, co;
    
        wire add_a, lut;
    
        addmux_lut lut_(.add(add), .a(a), .b(b), .k(k), .o(lut));
        MULT_AND   and_(.I0(add), .I1(a), .LO(add_a));
        MUXCY_L    cy_(.S(lut), .DI(add_a), .CI(ci), .LO(co));
        XORCY      xor_(.LI(lut), .CI(ci), .O(sum));
    endmodule
    
    module addmux_lut(add, a, b, k, o);
        input add, a, b, k;
        output o;
        assign o = add&(a^b) | ~add&k;
    endmodule

Does this work? I haven't verified it yet. It looks good though.

I am very pleased to find this. It will shave at least one, and perhaps two, logic-cells per bit from GR0000, XR processors, and so forth.

This same idea can be modified for other 4-input adder/mux circuits. For example, it is also possible to do a minimal ALU in one column of LUTs:

    o = add ? a + b : a ~& b;
Finally, it is worth pointing out one problem with this construction. In a conventional adder/mux, the mux latency is obviously one LUT delay. In this new construction this latency will (incorrectly) appear to a static timing analyzer to be up to one n-bit adder delay.

Friday, November 10, 2000
Craig Matsumoto, EE Times: Xilinx deal puts Synopsys in FPGA flow. Coming soon: a flow that takes C or SystemC and puts out gates.
'But generally, Synopsys expects the C-to-RTL design flow to yield the same circuits as a Verilog/VHDL-based flow, if not better, because both will hand off their RTL data to the same synthesis tools. "There is no difference in quality of results, because the synthesis is the same," Kunkel said.'
Apparently EDA vendors hope to realize higher prices on higher-end FPGA SoC tools... Meanwhile, CNets development is on hold while I pursue other opportunities.

Thursday, November 9, 2000
XESS: Introduction to WebPACK 3.1 (PDF): "Using XILINX WebPACK Software to Create CPLD Designs."

Xilinx Empower!/XtremeDSP/SystemIO Platform FPGA coverage
Craig Matsumoto, EE Times: Xilinx signs partners to pull cores onto its FPGAs.
Murray Disman, ChipCenter: Xilinx Reveals Platform FPGA Initiative.
EDN: Xilinx Aligns with 5 Companies for FPGA Platform.

Monday, November 6, 2000
Xilinx Aligns with Industry Leaders To Announce Platform FPGA Initiative. A "must read": [some emphasis added]
"Empower! ... Embedded PowerPC 405 microprocessor cores from IBM will operate at 300 MHz to offer over 420 Dhrystone MIPs of performance ... Additionally, embedded soft processor cores and high performance external interfaces ensure that designers can implement a wide variety of custom solutions."

"XtremeDSP ... For high performance DSP applications, the Xilinx XtremeDSP solutions will support over 600 billion multiply accumulate cycles (MACs)/sec, more than 100 times faster than the industry's leading embedded DSP processor core. The Virtex-II architecture includes fully distributed registers and RAM for efficient FIR filters, up to 3.5 Mbits embedded dual-port RAM for data buffering and embedded 18x18 multipliers for high performance MAC functions."

"SystemIO ... RocketIO gigabit serial interfaces will deliver unprecedented bandwidth for networking and communications systems for Platform FPGAs. Embedded 3.125Gbps SkyRail CMOS serial transceivers licensed from Conexant Systems, Inc. will support 10 Gbit Ethernet, OC-192, InfinibandTM and XAUI interface standards."

As Steve Ballmer used to say (and probably still does), "It's a great time to be us."

Saturday, November 4, 2000
We now have our first link from the Xilinx site. Go to the IP Center, and click on Processor Products and there we are. Thanks Xilinx!

Peter Clarke, EE Times: ASM, Philips build 70-nanometer gate. Excellent. I had read that SiO2 won't cut it as the gate insulator at those geometries, and now they've found something 1.1 nm thick that's a million times better! Twenty-five million FPGA gates, here we come!

Over on Scripting News, I comment on why Microsoft has a successful component software ecology (and why Unix and open-source software don't.)

assembling variable-length instructions
Today I am working to finish up the long-delayed xr32 tools story. The specs, core, SoC support, compiler, and simulator are done, but I have to do some more work on the assembler.

For xr16, things are simple. Any reference to a symbol is always going to be a 16-bit address, and so the immediate instruction (addi, lw, etc.) is always going to require an immediate prefix. And any call to a function, all of which are 16-byte aligned, will require a single call instruction.

For xr32, things are more complicated. In xr32, more than one imm prefix is permitted, in order to build up a larger immediate constant, even on the call instruction:

  lw r1,0x5678     -> imm 0x567 ;; lw r1,8(r0)
  lw r1,0x2345678  -> imm 0x567 ;; imm 0x234 ;; lw r1,8(r0)
  lw r1,0x12345678 -> imm 0x567 ;; imm 0x234 ;; imm 0x001 ;; lw r1,8(r0)

  call 0x5670      -> call 0x5670
  call 0x2345670   -> imm 0x567 ;; call 0x234
  call 0x12345670  -> imm 0x567 ;; imm 0x234 ;; call 0x001
So now a symbol's address determines what instruction sequence addresses it. But you don't know the address of anything until all the code and data has been assembled.

This is rather like the long-standing XSOC Issue #3, which is that far branch displacements are not implemented. I'm going to fix that one today, too.

This is an old chestnut of the assembler and linker world. One good strategy is to do a first pass of assembly, building a table of symbolic references ("fixups"), assuming all references are short and emitting the shortest code sequence possible. Then loop over the whole program image, resolving every fixup. If a fixup site makes a reference to a symbol that is "too far" away, you must insert instructions to implement the long reference. Inserting code moves the following code down, which may make some formerly short references long, which may require yet more instructions, etc. Eventually the whole thing settles down to a fixed point, no insertions occur, and you can stop.

One additional complication: the xr call instruction requires the called function to be 16-byte aligned. Therefore all functions are 16-byte aligned. If you apply the 16-byte alignment padding early, it will be invalidated as soon as you insert any instructions in some other function that precedes it.

If you wait until fixup resolution achieves a fixed-point, and then apply function alignment padding, the padding that is inserted can make certain address references long, requiring more passes of fixup resolution. If you are not careful, you can insert some padding, insert something ahead of that, insert some more padding, insert something, etc., and the padding per function grows without bound.

Therefore, when we 16-byte-align functions, we must satisfy an invariant that always there are only 0-14 bytes of padding before any function. This invariant must be maintained by careful code in the code-inserter, which must now be function-boundary aware.

Friday, November 3, 2000
FCCM 2001, April 30-May 2, 2001, will be held in Rohnert Park, CA this year due to construction at the Napa site.

It's that time of year again. Time to send a check to Xilinx for a year of maintenance on Alliance Standard.

For as long as there has been an Alliance product, Alliance Standard has targeted the entire family of Xilinx devices. No more. Now Xilinx has introduced a new price tier, Alliance Elite. If want to target (or even experiment with targeting) devices larger than an VirtexE-1000, you need Elite.

I was quoted a yearly maintenance fee on Elite of about triple that of Standard. Expensive enough to make you think twice. Since I don't currently need to target a device larger than a V1000E, and since (as I understand it) upgrading to Elite would put my Alliance license into the realm of time based licensing, I'll pass on Elite for now.

Bottom line, Elite will keep at least this one customer from "kicking the tires" on larger devices -- and so I probably won't be publishing any reports on how many and how fast an array of processors will run in a XCV3200E anytime soon.

but the good news is...
The free Xilinx WebPack ISE, which includes synthesis, simulation, and place-and-route tools for Spartan-II and Virtex V300E devices is now available. All totaled, it's a big download -- over 100 MB.

FPGA CPU News, Vol. 1, No. 5
Back issues: Oct, Sep, Aug, Apr.
Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.


Copyright © 2000-2002, Gray Research LLC. All rights reserved.
Last updated: Feb 13 2001