XSOC 2.0 Log

Home

CNets >>
<< XSOC 2.0


News Index
2002
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
2001
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
  Oct Nov Dec
2000
  Apr Aug Sep
  Oct Nov Dec

Links
Fpga-cpu List
Usenet Posts
Site News
Papers
Teaching
Resources
Glossary
Gray Research

GR CPUs

XSOC
  Launch Mail
  Circuit Cellar
  LICENSE
  README
  XSOC News
  XSOC Talk
  Issues
  xr16

XSOC 2.0
  XSOC2 Log

CNets
  CNets Log
Google SiteSearch
Monday, August 7, 2000 Starting a new Virtex-optimized XSOC/xr project.
Agenda.
Using block RAMs.
The Knowledge.
First steps.
Tuesday, October 10, 2000 Up and running in an XSV-300 board.

Monday, August 7, 2000
Agenda
I am starting on a major cycle to port xr16 (and the not yet quite finished xr32) to Virtex. I will probably also do a new 32-register machine to make it easier to port GCC, binutils, et al via a cross-assembly or cross-targeting-linker strategy. More on that later.

The goal of the XC4000X xr16 was good performance in as little area as possible, e.g. a streamlined implementation that fit in ~3/5 of a XC4005X. The goal of the new Virtex xr16/xr32 and the new 32-register ISA cores will be leadership performance, with minimal area a second priority.

For xr16, I need at least 67 MHz performance and perhaps even 100 MHz performance, in the slowest speed grade of SpartanII.

Besides high clock rate, a second goal is to reduce CPI (cycles per instruction) to improve overall performance.

A third goal, harder to quantify, is to provide best energy consumed per computation.

Another goal of the original xr16 was to make it easy to compose integrated systems from an xr16 and peripherals. Similarly, a fourth goal of this new effort is to make it easy to compose multiple processors and peripherals together into a multiprocessor system-on-a-chip. For example, I hope to demonstrate a 4-processor + peripherals system in a single XCV50 or XC2S50.

I still believe that the key to leadership performance, area, energy consumption, and usability, is to keep things simple, streamlined, and small.

Block RAMs change everything
I have installed Foundation 2.1i which I understand will also be the basis for the next update to the Foundation Student Ed., and should therefore be able to target small Virtex and SpartanII devices.

(I have also installed Alliance 3.1i but haven't yet done anything with it.)

I have been experimenting with F2.1i and trying various circuit constructions in Virtex. I have learned a bit about the architecture of the Virtex Block RAMs which I will present later.

I expect to see two different configurations of the processor cores: one with a dedicated instruction and/or data block RAM (no external RAM) and one with block RAMs for I-cache or D-cache.

(See also my thoughts on the myriad applications of block RAMs, Using block RAM.)

The Virtex block RAM is a true dual ported synchronous SRAM that can read or write with 0 cycles of latency. Present address, enable, and write enable, 2-3 ns ahead of the clock edge, and the data comes out just 3-4 ns after clock rises.

The speed of these RAMs raises the intriguing possibility of eliminating the IF pipeline stage. When designing with external memory, you need an IF stage to absorb the one or more cycles necessary for the instruction data to load from RAM. With on-chip block RAM, you can present address late in one cycle and you get your instruction or data early in the next cycle.

By deleting IF, we can reduce the branch and jump latency to 2 cycles. Can we get away with it? It will trim the slack time in the DC stage by Tbcko + routing - Tcko (e.g. the difference in latency between getting the instruction from the output register of the block RAM and getting it from an adjacent IR register in CLBs). This is about 3-4 ns. At 100 MHz, this leaves only 6-7 ns for the DC stage, probably not enough. We'll see.

Furthermore, now that we have a little more area, we can add a mux and possibly a dedicated branch adder, so it may be possible to process branches, calls, and returns in the DC pipeline stage.

Combined with eliminating the IF stage, this would permit one cycle branches, calls, and returns, compared to three cycles in XC4000X xr16.

Similarly, it may be possible to eliminate the LS (load/store) pipe bubble that the XC4000X xr16 suffers, either by doing a data or D-cache access in the middle of the EX stage, or by adding (horrors) a MEM pipe stage. Load/store is less common and less of a priority, however, and it isn't the end of the world if load/store still takes two cycles. We'll see.

The Knowledge
If you want to be a cab driver in London, you first have to acquire The Knowledge. Students study for many months to memorize the thousands of little streets in London and learn the best routes from place to place. And they go out every day on scooters to scout around and validate their book learning.

Similarly, if you want to be a great FPGA-optimized core designer, you have to acquire The Device Knowledge. You have to know what the LUTs, registers, slices, CLBs, block RAMs, DLLs, etc. can and can't do. You have to learn exactly how much local, intermediate, and long routing is available per bit height of the logic in your datapath and how wide the input and output buses to the block RAMs are. You have to learn about carry chain tricks, "bonus" CLB and routing resources, TBUFs, and so forth.

You also need to know the limitations of the tools. What device features PAR can and can't utilize. How to make PAR obey your placement and timing constraints, and what things it can't handle. And how to "push on the rope" of your synthesis tools to make them emit what you already know you want.

The Knowledge isn't in any book, alas. Yes, you can read the 'street maps', e.g. the datasheets and app notes, but that only goes so far. You have to get out on your 'scooter' and explore, e.g. crank up your tools and design some test circuits, and then open up the timing analyzer and the FPGA editor and pour over what came out, what the latencies (logic and routing) tend to be, etc.

I have The XC4000X Knowledge, but I'm a Virtex newbie. I've read the data sheets of course, but only this week have I been playing with the tools and seeing what's what.

If you have any tricks for learning The Knowledge, or any good little Virtex surprises as compared to 4000X, please share them with us. Here's one: in 4000X CLBs, the RAM write-clock can be inverted independently of the FF clocks. Not so in Virtex -- they are either both inverted or neither is.

First Steps
As I said, I've been building all manner of little test circuits, pieces of instruction fetchers and register files and datapaths. For fun, last night I decided to port xr16 to Virtex to see how fast she'll run "out of the box".

1. I started with the XSOC project Verilog model in /xsoc/xsocv from the XSOC 0.93 beta distribution. (I also have some newer code which does both xr16/xr32 from 'define, but it's not ready for prime time yet.)

First I modified xsoc.v to be just an xr16 processor and one single ported 256x16 block RAM:

// modified excerpt from xsoc.v, (C) 2000 Gray Research LLC:
    // submodules
    xr16 p(
        .clk(clk), .rst(rst), .rdy(1'b1),
        .ud_t(1'b1), .ld_t(1'b1), .udlt_t(1'b1),
        .int_req(1'b0), .dma_req(1'b0), .zerodma(1'b0),
        .insn(ramd), .mem_ce(mem_ce),
        .word_nxt(word_nxt), .read_nxt(read_nxt),
        .dbus_nxt(dbus_nxt),
        .dma(dma), .addr_nxt(addr_nxt), .d(d));

    RAMB4_S16 ram(.WE(1'b0), .EN(1'b1), .RST(1'b0), .CLK(clk),
                  .ADDR(addr_nxt[8:1]), .DI(16'b0), .DO(ramd));
I deleted MEMCTRL etc. I just hard-wired the processor to not actually store data to RAM, nor take interrupts, nor DMA requests, etc., and to simply read instructions from the block RAM.

I ran that through the tools and it had a min cycle time of 27.3 ns.

2. I did a timing analysis. There was tight timing in the 1/2 cycle path from IR through RN muxes through reg file RAMs to falling-edge clocked reg file output registers. So I modified xr16.v, datapath.v, ctrl.v to replace the single-port reg file RAMs with dual port RAMs. Basically I had to add a new ctrl output 'rnd' (destination register number) and modify the datapath regfile so:

// modified excerpt from datapath.v, (C) 2000 Gray Research LLC:
...
87,88c88,91
<   ram16x16s aregs(.wclk(clk), .addr(rna), .we(rf_we), .d(res),
         .o(areg_nxt));
<   ram16x16s bregs(.wclk(clk), .addr(rnb), .we(rf_we), .d(res),
         .o(breg_nxt));
---
>   ram16x16d aregs(.wclk(clk), .we(rf_we), .wr_addr(rnd), .addr(rna),
>                   .d(res), .wr_o(xa), .o(areg));
>   ram16x16d bregs(.wclk(clk), .we(rf_we), .wr_addr(rnd), .addr(rnb),
>                   .d(res), .wr_o(xb), .o(breg));

I placed and routed that and now my cycle time was just 18.1 ns, nearly 60 MHz. Here is the critical path at that point (as output from the timing analyzer):

Delay:    18.142ns p/dp/a<14> to p/branch
          18.134ns Total path delay (16.635ns delay plus 1.499ns setup)
           0.008ns clock skew

Path p/dp/a<14> to p/branch contains 11 levels of logic:
Path starting from Comp: CLB_R5C20.S1.CLK (from clk_BUFGPed)
To                   Delay type         Delay(ns)  Physical Resource
                                                   Logical Resource(s)
-------------------------------------------------  --------
CLB_R5C20.S1.YQ      Tcko                  1.372R  p/dp/a<14>
                                                   p/dp/a_reg<2>
CLB_R10C21.S1.G1     net (fanout=5)        2.255R  p/dp/a<2>
CLB_R10C21.S1.COUT   Topcyg                1.579R  p/dp/N463
                                                   p/dp/C477/C7/C0
                                                   p/dp/C477/C7/C2
CLB_R9C21.S1.CIN     net (fanout=1)        0.000R  p/dp/C477/C7/C2/O
CLB_R9C21.S1.COUT    Tbyp                  0.109R  p/dp/N465
                                                   p/dp/C477/C8/C2
                                                   p/dp/C477/C9/C2
CLB_R8C21.S1.CIN     net (fanout=1)        0.000R  p/dp/C477/C9/C2/O
CLB_R8C21.S1.COUT    Tbyp                  0.109R  p/dp/N467
                                                   p/dp/C477/C10/C2
                                                   p/dp/C477/C11/C2
CLB_R7C21.S1.CIN     net (fanout=1)        0.000R  p/dp/C477/C11/C2/O
CLB_R7C21.S1.X       Tcinx                 0.522R  p/dp/N469
                                                   p/dp/C477/C12/C1
CLB_R9C23.S0.F2      net (fanout=3)        1.694R  p/dp/N469
CLB_R9C23.S0.X       Tilo                  0.738R  N_xa<5>
                                                   C596
CLB_R5C22.S0.F4      net (fanout=1)        1.643R  syn7841
CLB_R5C22.S0.X       Tilo                  0.738R  p/dp/S_98/cell0
                                                   C591
CLB_R2C21.S1.G4      net (fanout=6)        1.430R  p/dp/S_98/cell0
CLB_R2C21.S1.Y       Tilo                  0.738R  syn2231
                                                   C583
CLB_R2C22.S0.G2      net (fanout=2)        0.660R  syn1687
CLB_R2C22.S0.Y       Tilo                  0.738R  syn7892
                                                   C577
CLB_R2C22.S0.F3      net (fanout=1)        0.112R  syn2239
CLB_R2C22.S0.X       Tilo                  0.738R  syn7892
                                                   C575
CLB_R4C20.S0.G3      net (fanout=1)        1.460R  syn7892
CLB_R4C20.S0.CLK     Tick                  1.499R  p/branch
                                                   C573
                                                   p/ctrl/branch_reg
-------------------------------------------------
Total (8.880ns logic, 9.254ns route)      18.134ns (to clk_BUFGPed)
      (49.0% logic, 51.0% route)

As you can see, the critical path is in the EX stage. The adder adds/subtracts the two operands A and B. The sum passes into a zero detector nor-tree, and then into the branch condition logic, finally setting up the 'branch' register. The 8.88 ns logic time shows that even if there was much better floorplanning, it's still going to be very tough to get this down to a 10 ns cycle time. This suggests we are going to need to do some retiming to push some of this branch condition evaluation logic forward into the next pipeline stage.

(It is funny to note this branch logic *used* to be in the next pipeline stage (e.g. the EX stage of the branch instruction which is currently being decoded) but I pushed it up into the DC stage (retimed it) during my XC4000X xr16 optimization work to fix an EX stage critical path. Now this decision needs to be revisted because the ratio of adder latency to general logic+routing latency has been reduced in Virtex (as compared to XC4000X).

3. Even though I knew the next speed hurdle was architectural, I still wasted a couple of hours experimenting with Virtex floorplanning. While I much prefer to floorplan upstream in the design representation using RLOCs, I tried to see what kinds of floorplanning could be done using INST constraints in the UCF file. (This is what I did to apply a modicum of floorplanning in the current /xsoc/xsocv project, via its UCF file.)

The improvements were not very great (yet), for instance the cycle time went as low as 16.6 ns, 60 MHz but not yet 67 MHz.

I also learned that 2.1i PAR will not honor a specific INST directive on a specific component if there is also a range INST on the larger module. That is, if you write

    INST p/ctrl/add_reg LOC=CLB_R1C1;
    INST p/* LOC=CLB_R1C1:CLB_R9C9;
the constraint on p/ctrl/add_reg is lost. That seems wrong, but we're stuck with it.

Finally, I learned that my practice of making each flip-flop in my design an asynchronously resetable one:

    always @(posedge clk or posedge rst) begin
        if (rst)
            ff <= 0;
        else
            ...
    end
may not be a good decision for Virtex. To my surprise, the tools actually routed a reset line about to all my flip-flops using the programmable interconnect (instead of merely using the hidden GSR signal).

It may be I should design with no resets on anything but the core control unit registers necessary to properly initialize the rest of the design as it comes out of reset.

Tuesday, October 10, 2000
Up and running
This afternoon I resumed the Virtex port of XSOC/xr16/xr32 and am now (finally) running XSOC/xr16 in my XESS XSV-300 prototyping board.

Today's work involved several compromises. Since this board does not have a tool to pre-load the SRAM, I modified the XSOC design to provide a 256x16 boot ROM in a block RAM. I further modified the new fully synchronous MEMCTRL so that instruction fetches from this block RAM signal RDY in the same cycle.

Just as with the XSOC/xr16 kit for XS40 boards, the design currently includes a bitmapped VGA display, using the DMA engine in the xr16 CPU core as the video address counter. (With a 50 MHz dot clock, it refreshes the display at 120 Hz!)

Alas the XSV's two 16-bit SRAM banks both lack byte-write-enables. For the time being I am using just one byte-wide bank of SRAM. Later I will modify MEMCTRL to perform read-modify-write accesses for byte stores to RAM,

Using a modified version of xr16 (replacing the double-cycled single-port RAMs with dual-port RAMs), we get a design that TRCE reports will run at 60 MHz in a V300-5. (Not floorplanned yet.) Total size of the design, including MEMCTRL and VGA, is about 400 logic cells.

The design runs fine at 33 MHz. At 50 MHz, the program runs fine, but accesses to the external SRAM frame buffer fail. I will therefore modify the memory controller to insert a wait state on each external SRAM access. That done, I should be able to tune up the core design up to 67 MHz in short order, motivating integrated instruction and data caches...


Copyright © 2000-2002, Gray Research LLC. All rights reserved.
Last updated: Feb 03 2001