Agenda
I am starting on a major cycle to port xr16 (and the not yet quite
finished xr32) to Virtex. I will probably also do a new 32-register
machine to make it easier to port GCC, binutils, et al
via a cross-assembly or cross-targeting-linker strategy.
More on that later.
The goal of the XC4000X xr16 was good performance in as little area as
possible, e.g. a streamlined implementation that fit in ~3/5 of a XC4005X.
The goal of the new Virtex xr16/xr32 and the new 32-register ISA cores
will be leadership performance, with minimal area a second priority.
For xr16, I need at least 67 MHz performance and perhaps even
100 MHz performance, in the slowest speed grade of SpartanII.
Besides high clock rate, a second goal is to reduce CPI
(cycles per instruction) to improve overall performance.
A third goal, harder to quantify, is to provide best energy
consumed per computation.
Another goal of the original xr16 was to make it easy to compose
integrated systems from an xr16 and peripherals. Similarly, a fourth goal
of this new effort is to make it easy to compose multiple processors
and peripherals together into a multiprocessor system-on-a-chip.
For example, I hope to demonstrate a 4-processor + peripherals system
in a single XCV50 or XC2S50.
I still believe that the key to leadership performance, area, energy
consumption, and usability, is to keep things simple, streamlined,
and small.
Block RAMs change everything
I have installed Foundation 2.1i which I understand will also be
the basis for the next update to the Foundation Student Ed., and
should therefore be able to target small Virtex and SpartanII devices.
(I have also installed Alliance 3.1i but haven't yet done anything
with it.)
I have been experimenting with F2.1i and trying various circuit
constructions in Virtex. I have learned a bit about the architecture
of the Virtex Block RAMs which I will present later.
I expect to see two different configurations of the processor cores:
one with a dedicated instruction and/or data block RAM (no external
RAM) and one with block RAMs for I-cache or D-cache.
(See also my thoughts on the myriad applications of block RAMs, Using block RAM.)
The Virtex block RAM is a true dual ported synchronous SRAM that can
read or write with 0 cycles of latency. Present address, enable, and
write enable, 2-3 ns ahead of the clock edge, and the data comes out
just 3-4 ns after clock rises.
The speed of these RAMs raises the intriguing possibility of eliminating
the IF pipeline stage. When designing with external memory, you need an
IF stage to absorb the one or more cycles necessary for the instruction
data to load from RAM. With on-chip block RAM, you can present address
late in one cycle and you get your instruction or data early in the
next cycle.
By deleting IF, we can reduce the branch and jump latency to 2 cycles.
Can we get away with it? It will trim the slack time in the DC stage
by Tbcko + routing - Tcko (e.g. the difference in latency between
getting the instruction from the output register of the block RAM
and getting it from an adjacent IR register in CLBs). This is about
3-4 ns. At 100 MHz, this leaves only 6-7 ns for the DC stage, probably
not enough. We'll see.
Furthermore, now that we have a little more area, we can add a mux
and possibly a dedicated branch adder, so it may be possible
to process branches, calls, and returns in the DC pipeline stage.
Combined with eliminating the IF stage, this would permit one cycle
branches, calls, and returns, compared to three cycles in XC4000X xr16.
Similarly, it may be possible to eliminate the LS (load/store) pipe bubble
that the XC4000X xr16 suffers, either by doing a data or D-cache access
in the middle of the EX stage, or by adding (horrors) a MEM pipe stage.
Load/store is less common and less of a priority, however, and it isn't
the end of the world if load/store still takes two cycles. We'll see.
The Knowledge
If you want to be a cab driver in London, you first have to acquire
The Knowledge. Students study for many months to memorize the
thousands of little streets in London and learn the best routes
from place to place. And they go out every day on scooters to scout
around and validate their book learning.
Similarly, if you want to be a great FPGA-optimized core designer,
you have to acquire The Device Knowledge. You have to know what the
LUTs, registers, slices, CLBs, block RAMs, DLLs, etc. can and can't do.
You have to learn exactly how much local, intermediate, and long routing
is available per bit height of the logic in your datapath and how wide
the input and output buses to the block RAMs are. You have to learn
about carry chain tricks, "bonus" CLB and routing resources, TBUFs,
and so forth.
You also need to know the limitations of the tools. What device
features PAR can and can't utilize. How to make PAR obey your
placement and timing constraints, and what things it can't handle.
And how to "push on the rope" of your synthesis tools to make them
emit what you already know you want.
The Knowledge isn't in any book, alas. Yes, you can read the
'street maps', e.g. the datasheets and app notes, but that only goes
so far. You have to get out on your 'scooter' and explore, e.g.
crank up your tools and design some test circuits, and then open
up the timing analyzer and the FPGA editor and pour over what
came out, what the latencies (logic and routing) tend to be, etc.
I have The XC4000X Knowledge, but I'm a Virtex newbie. I've read
the data sheets of course, but only this week have I been playing
with the tools and seeing what's what.
If you have any tricks for learning The Knowledge, or any good
little Virtex surprises as compared to 4000X, please share
them with us. Here's one: in 4000X CLBs, the RAM write-clock
can be inverted independently of the FF clocks. Not so in
Virtex -- they are either both inverted or neither is.
First Steps
As I said, I've been building all manner of little test circuits,
pieces of instruction fetchers and register files and datapaths.
For fun, last night I decided to port xr16 to Virtex to see
how fast she'll run "out of the box".
1. I started with the XSOC project Verilog model in /xsoc/xsocv from
the XSOC 0.93 beta distribution. (I also have some newer code which
does both xr16/xr32 from 'define, but it's not ready for prime time yet.)
First I modified xsoc.v to be just an xr16 processor and one
single ported 256x16 block RAM:
// modified excerpt from xsoc.v, (C) 2000 Gray Research LLC:
// submodules
xr16 p(
.clk(clk), .rst(rst), .rdy(1'b1),
.ud_t(1'b1), .ld_t(1'b1), .udlt_t(1'b1),
.int_req(1'b0), .dma_req(1'b0), .zerodma(1'b0),
.insn(ramd), .mem_ce(mem_ce),
.word_nxt(word_nxt), .read_nxt(read_nxt),
.dbus_nxt(dbus_nxt),
.dma(dma), .addr_nxt(addr_nxt), .d(d));
RAMB4_S16 ram(.WE(1'b0), .EN(1'b1), .RST(1'b0), .CLK(clk),
.ADDR(addr_nxt[8:1]), .DI(16'b0), .DO(ramd));
I deleted MEMCTRL etc. I just hard-wired the processor to not actually
store data to RAM, nor take interrupts, nor DMA requests, etc., and
to simply read instructions from the block RAM.
I ran that through the tools and it had a min cycle time of 27.3 ns.
2. I did a timing analysis. There was tight timing in the 1/2
cycle path from IR through RN muxes through reg file RAMs to falling-edge
clocked reg file output registers.
So I modified xr16.v, datapath.v, ctrl.v to replace the single-port reg
file RAMs with dual port RAMs. Basically I had to add a new ctrl output
'rnd' (destination register number) and modify the datapath regfile so:
// modified excerpt from datapath.v, (C) 2000 Gray Research LLC:
...
87,88c88,91
< ram16x16s aregs(.wclk(clk), .addr(rna), .we(rf_we), .d(res),
.o(areg_nxt));
< ram16x16s bregs(.wclk(clk), .addr(rnb), .we(rf_we), .d(res),
.o(breg_nxt));
---
> ram16x16d aregs(.wclk(clk), .we(rf_we), .wr_addr(rnd), .addr(rna),
> .d(res), .wr_o(xa), .o(areg));
> ram16x16d bregs(.wclk(clk), .we(rf_we), .wr_addr(rnd), .addr(rnb),
> .d(res), .wr_o(xb), .o(breg));
I placed and routed that and now my cycle time was just 18.1 ns,
nearly 60 MHz. Here is the critical path at that point (as output
from the timing analyzer):
Delay: 18.142ns p/dp/a<14> to p/branch
18.134ns Total path delay (16.635ns delay plus 1.499ns setup)
0.008ns clock skew
Path p/dp/a<14> to p/branch contains 11 levels of logic:
Path starting from Comp: CLB_R5C20.S1.CLK (from clk_BUFGPed)
To Delay type Delay(ns) Physical Resource
Logical Resource(s)
------------------------------------------------- --------
CLB_R5C20.S1.YQ Tcko 1.372R p/dp/a<14>
p/dp/a_reg<2>
CLB_R10C21.S1.G1 net (fanout=5) 2.255R p/dp/a<2>
CLB_R10C21.S1.COUT Topcyg 1.579R p/dp/N463
p/dp/C477/C7/C0
p/dp/C477/C7/C2
CLB_R9C21.S1.CIN net (fanout=1) 0.000R p/dp/C477/C7/C2/O
CLB_R9C21.S1.COUT Tbyp 0.109R p/dp/N465
p/dp/C477/C8/C2
p/dp/C477/C9/C2
CLB_R8C21.S1.CIN net (fanout=1) 0.000R p/dp/C477/C9/C2/O
CLB_R8C21.S1.COUT Tbyp 0.109R p/dp/N467
p/dp/C477/C10/C2
p/dp/C477/C11/C2
CLB_R7C21.S1.CIN net (fanout=1) 0.000R p/dp/C477/C11/C2/O
CLB_R7C21.S1.X Tcinx 0.522R p/dp/N469
p/dp/C477/C12/C1
CLB_R9C23.S0.F2 net (fanout=3) 1.694R p/dp/N469
CLB_R9C23.S0.X Tilo 0.738R N_xa<5>
C596
CLB_R5C22.S0.F4 net (fanout=1) 1.643R syn7841
CLB_R5C22.S0.X Tilo 0.738R p/dp/S_98/cell0
C591
CLB_R2C21.S1.G4 net (fanout=6) 1.430R p/dp/S_98/cell0
CLB_R2C21.S1.Y Tilo 0.738R syn2231
C583
CLB_R2C22.S0.G2 net (fanout=2) 0.660R syn1687
CLB_R2C22.S0.Y Tilo 0.738R syn7892
C577
CLB_R2C22.S0.F3 net (fanout=1) 0.112R syn2239
CLB_R2C22.S0.X Tilo 0.738R syn7892
C575
CLB_R4C20.S0.G3 net (fanout=1) 1.460R syn7892
CLB_R4C20.S0.CLK Tick 1.499R p/branch
C573
p/ctrl/branch_reg
-------------------------------------------------
Total (8.880ns logic, 9.254ns route) 18.134ns (to clk_BUFGPed)
(49.0% logic, 51.0% route)
As you can see, the critical path is in the EX stage. The adder
adds/subtracts the two operands A and B. The sum passes into a zero
detector nor-tree, and then into the branch condition logic,
finally setting up the 'branch' register. The 8.88 ns logic time shows
that even if there was much better floorplanning, it's still going
to be very tough to get this down to a 10 ns cycle time. This
suggests we are going to need to do some retiming to push some
of this branch condition evaluation logic forward into the next
pipeline stage.
(It is funny to note this branch logic *used* to be in the next
pipeline stage (e.g. the EX stage of the branch instruction which
is currently being decoded) but I pushed it up into the DC stage
(retimed it) during my XC4000X xr16 optimization work to fix an
EX stage critical path. Now this decision needs to be revisted
because the ratio of adder latency to general logic+routing latency
has been reduced in Virtex (as compared to XC4000X).
3. Even though I knew the next speed hurdle was architectural, I still
wasted a couple of hours experimenting with Virtex floorplanning.
While I much prefer to floorplan upstream in the design representation
using RLOCs, I tried to see what kinds of floorplanning could be done
using INST constraints in the UCF file. (This is what I did to apply a
modicum of floorplanning in the current /xsoc/xsocv project, via
its UCF file.)
The improvements were not very great (yet), for instance the
cycle time went as low as 16.6 ns, 60 MHz but not yet 67 MHz.
I also learned that 2.1i PAR will not honor a specific INST directive on
a specific component if there is also a range INST on the larger module.
That is, if you write
INST p/ctrl/add_reg LOC=CLB_R1C1;
INST p/* LOC=CLB_R1C1:CLB_R9C9;
the constraint on p/ctrl/add_reg is lost. That seems wrong, but we're
stuck with it.
Finally, I learned that my practice of making each flip-flop in my
design an asynchronously resetable one:
always @(posedge clk or posedge rst) begin
if (rst)
ff <= 0;
else
...
end
may not be a good decision for Virtex. To my surprise, the tools
actually routed a reset line about to all my flip-flops using
the programmable interconnect (instead of merely using the hidden
GSR signal).
It may be I should design with no resets on anything but the core
control unit registers necessary to properly initialize the rest
of the design as it comes out of reset.
|