Home
Nov >>
<< Sep
News Index
2002
Jan Feb Mar
Apr May Jun
Jul Aug Sep
2001
Jan Feb Mar
Apr May Jun
Jul Aug Sep
Oct Nov Dec
2000
Apr Aug Sep
Oct Nov Dec
Links
Fpga-cpu List
Usenet Posts
Site News
Papers
Teaching
Resources
Glossary
Gray Research
GR CPUs
XSOC
Launch Mail
Circuit Cellar
LICENSE
README
XSOC News
XSOC Talk
Issues
xr16
XSOC 2.0
XSOC2 Log
CNets
CNets Log
|
|
|
|
Branch instructions, delay slots, and alternatives (predication, skip)
Tommy Thorn asked (via email):
"... I have a few [xr16] questions that I haven't seen answered on fpgacpu.org:
-
Ignoring compiler issues, why wouldn't it be much simpler just to
implement delay slots for branches? The annuling logic would go
away and the branch penalty would be eliminated.
Under the Interrupts section you mention that delay slots would
complicate interrupt handling. I don't see why you couldn't just
block the interrupts while in the taken-branch delay slot, much
like what is already done for interlocked pipeline stages. (Any
branches in the delay slot should be illegal).
-
Not really a question, but it seems that a "minor" modification to
your design could improve both performance and code density:
Argument the processor state with a condition bit which encodes the
result of a set of compare instructions (ceq, cne, clt, cge, cle,
cgt, cltu, cgeu, cleu, cgtu), eg.
cge r3, r5 --> cond = (r3 >= r5) ? 1 : 0;
Allocate two bits in the opcode to encode (always, if-true,
if-false). (The fourth case can be used for the prefix which
doesn't need to be conditional).
Assuming for the moment that the instruction set could somehow be
fitted into the remaining bits, the search.asm example
[JG: see p.2 of the
series]
(slightly improved) could be compiled as follows (using "0_inst" and "1_inst"
for conditional instructions):
_search: br L3
L2: cge r9, r3
0_lw r4, 4(r4)
1_lw r4, 2(r4)
L3: ceq r4, r0
0_lw r9, (r4)
0_ceq r9, r3
0_br L2
mov r2, r4
ret
The worst case inner loop here is 6+3=9 cycles (2 cycles could be
further saved using delayed branches), compared to the original of
9+12=21, that is, three times faster. (BTW, conditional instructions
can live in delays slots without problems, unlike branch
instructions).
A slightly more complicated variant of this idea is to make
instructions always conditional and use just one bit to encode the
polarity, eg:
_search: TRUE ; force known condition context
; (eg, 0_ceq r0, r0)
1_br L3
L2: 0_cge r9, r3
0_lw r4, 4(r4)
1_lw r4, 2(r4)
TRUE ; force known condition context
L3: 1_ceq r4, r0
0_lw r9, (r4)
0_ceq r9, r3
0_br L2
1_mov r2, r4
1_ret"
Great questions. My responses:
First: why no branch delay slots? There are several issues.
-
Simplicity. I was constantly struggling to keep the articles under
their 3x3000 word budget. Having delay slots would necessarily mean
discussing them in the articles.
-
Assembler instruction scheduling. Delay slots would of course require some code in the assembler to try to move an instruction into the delay slot.
-
Code bloat. An unfillable delay slot would require a nop. In the
old days, the rule of thumb was 60-70% of the time, the first delay slot
is fillable. This can be overcome by having two branch instruction
families, one with a delay slot, one without, or perhaps, one that
conditionally annuls the delay slot. In that case, then, the code
bloat is quietly redistributed about the opcode map.
-
The annulling logic would not go away, but it would be changed slightly.
Currently when a jump or taken branch executes, the two instructions in the
IF (instruction fetch) and DC (decode/operand fetch) stages are annulled.
To implement a 1-instruction delay slot,
you need only change one gate in order to annul the IF stage
instruction, while leaving the DC stage instruction alone.
(The gate in question is the OR4 feeding the EXANNUL flip-flop on page 16 of the
XSOC series.)
-
Interrupts. Believe me, you don't want to write an interrupt handler
that handles return-from-interrupt-into-the-branch-delay-slot! But as
Thorn suggests, interrupts could indeed be deferred while in a branch
delay slot, thus avoiding this situation.
Checking back on some mail at that time (11/17/98), it appears
that since interrupts hadn't been designed yet, I hadn't realized
that the potential problem could be overcome by simply deferring
interrupts in branch delay slots. (I had done so since though.)
As for Thorn's second question/suggestion, his code transformation above
is quite clever and is a compelling example of the benefits of
predicated execution.
With 16-bit instruction words, I don't know if it is possible or profitable
to squeeze out a precious opcode bit, or (heavens forfend) 2, for a
predicate, alas.
I don't deny the branch instruction architecture I settled on for xr16
was a very familiar and conventional one. (And one that was easy
to figure out how to make lcc target.) I did evaluate providing only
an unconditional branch (with a larger branch displacement), plus a skip1<cond>
and skip2<cond> regime a la Philip Freidin's RISC4005, and
it did provide some performance advantages in some cases, but I did not
pursue that for some reason that I can't recall.
More on Empower! and Excalibur
Anthony Cataldo, EE Times:
Xilinx moves forward, Altera pulls back on PowerPC cores.
"Xilinx Inc. says it's on track for yearend sampling of its Virtex 2
FPGA with an embedded PowerPC core. But rival Altera Corp. has postponed
plans to offer PowerPC and MIPS-based processor cores for its FPGAs,
citing cost concerns. ..."
"... Instead, Altera has focused on its ARM9 hard processor and Nios soft
microcontroller cores, both of which are shipping now."
See also this earlier discussion.
|
|
I have started a new section,
FPGA prototyping boards,
on the Links page.
Insight Electronics'
Design Solutions
page
now lists the new
Virtex-II MicroBlaze Development Kit.
FPGA MP3
The winning entry in the 2001
Circuit Cellar
Design Logic 2001 Contest
(Atmel FPSLIC section)
is an
MP3 player
that uses the
FPSLIC's
AVR microcontroller for interface and control,
and the integrated on-chip FPGA fabric as a 32-bit fixed-point DSP coprocessor
that decodes the MP3 data. Congratulations to Geva Patz of Brighton, MA.
Xilinx has some MP3 player
design resources,
some
app
notes,
and a cute
reference design.
These students at Florida Atlantic did one (sort of).
(Also, I thought there was an Altera Nios based MP3 project, but Google and
I can find no definitive reference to it.)
Actel: MP3 Personal Digital Players Using Actel FPGAs
-- the Rio PMP300. The Actel device acts as a parallel port,
flash controller, smart media interface, CPU interface, and misc.
glue logic; an external CPU and MP3 decoder are still required.
Actel is Dominant Supplier ....
Murray Disman, ChipCenter:
Actel Claims to Dominate MP3 Market (5/00).
e.Digital and Actel to Deliver Silicon Solution for Digital Music and Voice Recorder/Players.
Murray Disman, ChipCenter:
e.Digital and Actel Collaborate.
Celoxica has an MP3 encoder
case study.
Altera LogicLock
As a big fan of explicit floorplanning using Xilinx RPMs, I was intrigued
when Altera mentioned LogicLock.
Now here is some more
detail, an
app note, and a
methodology white paper.
"During integration and system-level verification, the performance of each module of logic is preserved."
That is so important. Recently I wrote:
"... floorplanning [using RPMs] can certainly halve critical path net delays,
and, as importantly, make the delays predictable so you have terra firma
upon which to make methodical implementation decisions."
(Although I wish I'd written sometimes instead of certainly.)
Back in April, I asked:
"(By the way, it is important that such placement constraints can be schematic and HDL source code annotations -- it is rather less useful if the constraints are applied to post-synthesis entities. Is that the case with LogicLock?)
The
app note says
"A LogicLock region can be created graphically using the the (sic) LogicLock
window in the Quartus II Floorplan Editor or in text with a Tcl script."
So the answer appears to be that LogicLock constraints are indeed
applied to post-synthesis entities.
Thus LogicLock appears to be similar to the Xilinx Floorplanner
(or perhaps the
Modular Design Flow),
and less like Xilinx RLOCs.
Not that there's anything wrong with that.
By the way, Altera has an interesting quarterly newsletter,
News and Views;
of course, Xilinx does too.
|
|
Digital Communication Technologies'
lightfoot for FPGAs.
Lightfoot data sheet.
DCT/Xilinx press release.
"Lightfoot will run on Xilinx Spartan-II, Virtex-E and Virtex-II
FPGAs. Purpose-designed for embedded systems, DCT's Lightfoot core
uses around 25,000 gates in its conventional off-the-shelf chip form,
and requires just 1710 'CLB slices' of Xilinx logic in its IP form."
Peter Clarke, EE Times:
Startup builds Java processor with ARC core.
'The core, called Bigfoot, complements Digital Communication Technologies
Ltd.'s current offering, a Java and C-language CPU based on a stack
architecture, known as Lightfoot. ... '
'"We realized from the start that a Java-only processor had no chance,"
said Turner. ... "We realized that some applications were using a Dhrystone
benchmark even though they also wanted Java performance," he said. "That's
the area in which Lightfoot can't compete. With Bigfoot we can get the
Dhrystone benchmark inherently [from the ARC core]."'
From Drinking the .NET Kool-Aid,
Don Box at Conference.NET:
"... if you talk to people who really look at the trends in this industry
... it feels ... we are moving a world that there are basically
two places that code runs -- the JVM and the CLR. ..."
OK, so Don wasn't speaking to embedded system development.
But long term, J2ME and/or
.NET Compact Framework
support will be essential in heavier-weight embedded systems.
But as I ask in Java processors,
"The fundamental design issue is how sophisticated is your download-time
translation pass over the bytecodes, in order to canonicalize or
regularize them?"
|
|
Altera has posted tens of MB of detail on their Excalibur
(ARM hard core and Nios soft core) embedded processors on their
Excalibur literature
site. The
ARM-Based Embedded Processor Device Overview Data Sheet
provides a concise overview of the EPXA (embedded ARM) family.
".. the embedded processor domain and PLD domain can be asynchronous,
to allow optimized clock frequencies for each domain."
Note that Altera is once again offering free
Excalibur Embedded Processor Workshops.
I've signed up for the one in my area.
(Back in August 2000, I attended the one of the first generation Nios workshops.
It was terrific, very hands-on, and time well spent.)
Nick Heaton and Ed Flaherty, Integrated System Design:
A Verification Environment for an Embedded Processor.
Great detail on the design and verification of the Altera EPXA10,
including how the hard microprocessor IP "stripe" interfaces to the
programmable logic.
"The Excalibur physical architecture consists of single and dual port
RAM, a 300k-gate standard cell area, an embedded processor IP core, and
one-million gates of APEX 20KE PLD all on a single chip, representing
some 80 million transistors. ..."
"In this context the "stripe" refers to the custom embedded logic area
along one edge for the Excalibur die as distinct from the PLD area. The
separation is required because of the fundamentally different ways in
which the two areas are designed and passed through layout."
|
|
Anthony Cataldo, EE Times:
Compiler that converts C-code to processor gates advances.
"But if the technology preview is any indication of what's coming, it's
only a matter of time before processors are considered just another
building block for system-on-chip design, just as transistors are for
today's devices. Moreover, system architects won't be beholden to a
select group of processor designers, Rowen said."
Altera:
Quartus II Design Software Now Available in Web Edition.
Altera Quartus II Web Edition
joins
Xilinx ISE WebPack
in offering a free environment providing "design entry, HDL synthesis, place and route, verification, and programming."
Altera Program License Subscription Agreement, part 9, TalkBack Feature Notice:
"The TalkBack feature, included with the Licensed Program(s), enables ALTERA to receive limited information concerning your compilation of logic designs (but not the logic design files themselves) using the Licensed Program(s). ..."
"You may disable/enable the TalkBack feature at any time by running qtb_install.exe located in your quartus/bin folder."
|
|
Pushing on a rope
Shucks, somehow I never noticed this.
Tom Cantrell, Circuit Cellar Online (4/01):
DesignCon Fusion:
Shades of Gray.
'In a process Gray described as "pushing on a rope," the HDL had to be
iteratively modified until it produced the optimized logic he expected.'
'In one example, after "much experimenting," he came up with nonobvious
HDL that was able to compel the synthesizer to generate an efficient
combined adder/subtractor rather than the separate adder, subtractor,
and multiplexer generated by a more readable version.'
(This was particularly a problem when trying to inject carry-in and
capture carry-out.)
Minimalism
On the fpga-cpu list, in a discussion of minimalist FPGA CPUs, Tim Boescke
asked:
"Jan Gray somehow managed to fit a 4 operator ALU into a
single LUT per bit. (In the GR0040) Is there any documentation on
this ? So maybe there is a way to reduce the ALU size to 6 CLBs."
Some of the background is here.
Some of it is hinted at in the GR0000
paper in section 3.12.
Now then. Say you build a 4-bit ALU using this technique (4 LUTs).
And say you attach that to a 16 entry x 4-bit LUT RAM (another 4 LUTs, or
8 if you make it dual-port RAM). And say you add a 2-bit counter (2 LUTs)
to sequence through LSB addresses 00, 01, 10, and 11, to that LUT RAM.
Add a LUT and FF to handle carry-in.
Now you have a simple datapath with 4 16-bit registers, nybble-serial,
that should easily run at 100 MHz (25 MHz for each 16-bit operation).
Total cost of datapath: 11-15 LUTs (3-4 Virtex CLBs; 2 Virtex2 CLBs).
To hook that up to a 512x8 or 256x16 BRAM for program and data storage,
you need another 8 or 9 FFs for a PC and/or address register. These
FFs can share the same handful of CLBs with the aforementioned LUTs.
The instruction register can be the BRAM output register.
Add a few LUTs for minimal instruction decoding, and you're most of the
way to a very compact (and extremely austere) processor.
Bumming LUTs
In repsonse to the above, Tim Boescke
wrote back:
"...the PC could be mapped to the registerfile, but this
would double the amount of cycles per instruction. (well, 8
cycles for a dual ported registerfile, 12 cycles for single ported) ..."
Bingo! That's the right mindset indeed. Keeping the PC in the regster
file to save area is a technique I used in my
first FPGA CPU
seven years ago.
It may also be necessary to put 0 in the regfile (leaving 2 (or perhaps it is 2
3/4) 16-bit regs (or 6 8-bit regs, if you prefer)). Then IR=MEM[PC+=1]
becomes rf[PC] += rf[zero] + cin=1 across 4 cycles. As the ALU output
nybbles go by, you latch them in your nybble-serial-to-parallel FF-based
address register, that drives address lines to the BRAM. And perhaps you
can save more registers and a mux, if the data bus is only 4-bits wide,
if you use a RAMB_S16_S4 or something like that. (Hands wave furiously.)
(If this is all too complex, by all means, build a conventional 8-
or 16-bit high regfile and ALU -- I just wanted to demonstrate that it
is possible to build a minimal slow austere 16-bit datapath in just 2
Virtex2 CLBs.)
Another gate bumming idea: the output network of a RAM is a mux.
Use it before you use LUTs for muxes. In the XSOC/xr16 design, moving
the video address register into the PC register file cost a column
of LUT RAM, but saved a column of LUTs on a mux, and another column on
the video address incrementer (by reusing the PC adder/incrementer).
Another example: rather than design a video character generator ROM
(using BRAM) that puts out an 8-bit byte, and then sending *that* through
an 8-1 mux to drive pixels to the screen, instead configure the BRAM as
a RAMB4_S1, e.g. with a 1-bit output. The 8-1 mux disappears into the
BRAM's internal output mux logic.
Fun stuff. Brings back the old days of assembler one-upsmanship, striving
to bum an instruction or a cycle from a little compute kernel. "What's
the most compact sequence to do atoi, or itoa, or what have you..."
I feel sorry for anyone starting in computing today and who thinks 4 KB
is nothing. They missed all the fun.
This thread reminds me of one of my favorite essays,
Simple is Beautiful.
FPGAs free to a good home
A while back, Gray Research bought 300 XCS20TQ144-3C (date code 9829)
in carriers of 60, for $0.50 each, in an auction. We don't need them
all. We'll donate up to 240 of them, in lots of 60, to one or more
universities. If you are a university professor based
here in the US, and you're reasonably sure you will put them to a good
use, just send me a very brief note
of proposal ("Dear Jan, we sure could use n of your FPGAs!")
and they're yours (while supplies last).
Remember that an XCS20 is a 5V FPGA similar to an XC4010, with 20x20 CLBs,
2 LUTs per CLB, and list at $28 each per
Avnet.
Note these are untested, fine pitch PQFPs, and you'll probably have to bake
them before you use them.
Ernie Coombs
Goodbye, and thanks, dear
Mr. Dressup.
|
|
Tom Murphy, Electronics News:
ARM Core Boosts Altera's ASIC Alternative.
'"What you get is best-in-class performance that you would expect from an
ASIC without the minimum order quantities a foundry would expect or the
lengthy fab cycles," Chiang said. "In addition, there are no licensing
fees to pay to ARM. That is included in our pricing structure."'
Brian Dipert, EDN (2/01):
Do combo chips compute (or even compile)?.
"Neither your logic in the programmable-logic partition nor the
Altera-supplied soft IP also potentially stored in that partition
directly interfaces to the CPU core. Instead, the hookup is through
master-and-slave bridges and dual-port SRAM (different from the
embedded-array-block memory that the Apex 20K array contains)."
|
|
FPGA CPUs in Education
From Prof. Chris Meyers' Univ. of Utah
CS/EE 3710
course description:
"During this class you will design and implement a microprocessor in a
team of three. Note that grades will still be given individually, not just
to the team. You will be given 3 benchmark codes in C. Your job will be
to design an instruction set, determine a microprocessor architecture,
simulate and test using VHDL, and implement in FPGAs. The better your
design performs, the better your grade. Therefore, you should consider
advanced architecture features such as pipelining, branch prediction,
hardware multipliers, etc." [my emphasis -JG]
Fantastic.
Here's a similar sentiment from my paper on teaching computer architecture
"hands-on" with FPGA CPUs:
"Our favorite idea simulates the competitive processor design industry.
Student teams are issued a CPU design kit, including computer tools,
a working, non-pipelined processor core, a benchmark suite, and an
FPGA board, which runs "out of the box", and are instructed to evolve
and optimize their processor ... to run the benchmark suite as fast
as possible ... At end of term, teams submit their designs and vie for
the coveted "fastest CPU design" trophy. This sort of project could
uniquely motivate students to practice all manner of quantitative
analysis and design activities."
See also some of the other teaching links.
In particular, I still get a kick out of the Hiroshima University
City-1 pages.
Disman
Murray Disman, ChipCenter:
Xilinx Shipping MicroBlaze.
"The problem with these comparisons is that Xilinx is basing its results on the Nios 1.0 core."
Murray Disman, ChipCenter:
Altera Announces Nios 2.0.
"The number of logic elements required for the 16-bit bus implementation
has been decreased from 1100 to 900, while the elements needed for
the 32-bit design has dropped from 1700 to 1250. At the same time, the
processor speed has been increased from 33 MHz for Nios 1.0 to 80 MHz
for version 2.0."
Rather impressive improvements. Competition is good.
Are they apples-to-apples? (Frequency more than doubling in same device?)
So we have claims of 900 LUTs, 125 MHz, and 82 D-MIPS for Xilinx MicroBlaze 1.0,
and reports of 1250 LEs, 80 MHz, and
(EE Times:)
40 Dhrystone MIPS for Altera Nios 2.0.
(For comparison, gr1040: <180 LUTs; gr1050: <300 LUTs; north of 67 MHz
in Virtex-E; D-MIPS...?)
That said, I am inclined to side with this sentiment in the
same article:
'"We realized that it's not about the instruction set. It's about how easy it is to use and put the systems together and compile down," said Jordan Plofsky, senior vice president of embedded-processor products at Altera.'
Now in my multiprocessor-SoC experiments, the limiting density factor has
not been LUT counts but block RAM ports.
You can put 60 16-bit RISCs in a 72-block RAM XCV600E, but only if each
one uses/shares a total of about one block RAM (two ports) each.
Therefore, I encourage FPGA CPU vendors and users alike to quote both
LUT counts and block RAM / ESB counts on CPU cores. For example,
the smallest gr1040 that might make sense requires <180 LUTs and 1
block RAM as an on-chip instruction/data store. In comparison, the Nios
resource usage
app note
states that a 16-bit Nios 1.1.1 reference design, on Apex-II,
uses 31,488 EAB/ESB bits. At 4096 bits per ESB, that's at least 7 ESBs.
Murray Disman, ChipCenter:
Altera Unveils HardCopy Program.
"Conversions of high-density FPGAs will make economic sense for production quantities in the low 100s, since the HardCopy device will typically cost about $1000 less than the FPGA it replaces."
Murray Disman, ChipCenter:
QuickLogic Ships MIPS-Based Hybrid FPGA.
"The MIPS program at Altera seems to have been placed on hold while the company concentrates on its other Excalibur products."
Legacy ISA IP companies: please write
me if you'd like to explore contracting
for a competitive (compact, fast) Xilinx-optimized reimplementation
of your architecture. Don't lose design wins for want of an FPGA solution.
|
|
Sacrificing silicon at the altar of programmability
Recently in comp.arch.fpga, "Dennis" wrote:
"I am an ASIC Designer trying to understand FPGAs. While going through
Xilinx Datasheets, I got some clues (although not fully understood)
about the definition of System Gates Capability of a particular
product. I wish to understand, How much Silicon is sacrificed for the
sake of Programmability? , for example: For a 315K equivalent Gates in
XC2V300, How Many ASIC Gates(2 input NAND) have been put in the
Silicon????"
Ray Andraka replied:
"The truth is, these marketing gates have been badly perverted by
oneupsmanship, to the point that on chip memory and extra features more or
less dominates the figure. A better measure of FPGA capability is a count
of the number of logic elements (consisting of a 4-LUT and flip-flop in
many devices), and then season that with any special features to the
extent that they BENEFIT YOUR application."
I replied:
I agree with Ray. But see also my weblog entry
marketing gates redux.
And read Peter Alfke's definitive
posting on this subject.
There Peter figured each logic cell would be worth about 12 ASIC gates
(6 for the LUT and 6 for the FF).
I thought a ballpark answer to the question might be interesting. Here
follow some educated guesses, but none based upon actual data from actual
shipping devices.
In the book "Architecture and CAD for Deep-Submicron FPGAs", Betz, Rose, and
Marquardt, Appendix B, pp.207-220, the authors provide a design for a
generic CLB 'cluster' of four 4-LUTs and FFs that occupies 1678 'minimum
width transistor areas'. (Each 4-LUT is 167 'MWT's.) That doesn't count the
myriad transistors in each cluster's programmable interconnect (routing
channels) -- configuration SRAM cells, switches, buffers etc.-- which I have
read (somewhere) can be 4X more transistors than the CLB cluster itself. So
let's say a tile with 4 4-LUTs, and its programmable interconnect, could
require 8,000 transistors -- that's to implement about 40-something ASIC
gates of logic and wiring -- call it 200 transistors per ASIC gate.
If you figure a CMOS 2-input NAND is four transistors, then at 200
transistors per NAND, it works out to a 50-1 'transistor overhead' for
programmability. That sounds bad, but remember that FPGA transistors are
typically manufactured in the latest and greatest processes, so often they
are smaller, faster, and cheaper than ASIC transistors.
Let's check our figures another way. Last year, Steve Young of Xilinx was
quoted
as saying that this year Xilinx Virtex-II designs would get up
into the 500 million transistors zone. Doing the math with a 2V6000
or a 2V10000 type device, this too indicates that several thousand
transistors go into each logic cell (plus its share of the routing
and RAM), or several hundred per equivalent ASIC gate.
And here's a third approach. A 2V10000 requires a 33.5 Mb configuration
bitstream. Assuming each bit is stored in a 6 transistor SRAM cell, and
each configuration bit drives only a single pass transistor, (way too
conservative), that's 33.5M*7 = at least 250 million transistors for the
123,000 LUTs = >2000 transistors per LUT, or again several hundred
transistors per ASIC gate.
MicroBlaze/Nios News Links
Crista Souza, EBN:
PLDs make inroads with processor designs.
Anthony Cataldo, EE Times:
Altera, Xilinx heat up processor-core fray.
|
|
Xilinx MicroBlaze
Xilinx Ships MicroBlaze.
Congratulations, you guys!
MicroBlaze page.
Performance data.
Forum.
Peripherals.
CoreConnect Technology.
CoreConnect architecture (IBM).
CoreConnect license (PDF).
Register and Download.
Literature:
Getting Started Guide;
Hardware Reference Guide;
Software Reference Guide.
"MicroBlaze delivers over three times the performance in less than half the size of competing soft processors."
82 "dhrystone MIPS" is excellent performance. I have to assume that
number represents running dhrystone entirely out of on-chip block RAM.
Presumably Xilinx's Altera Nios comparable numbers are for a Nios system
running code and data out of off-chip RAM. On the other hand, since Nios
uses several ESBs for the processor implementation itself, that
leaves fewer such block RAMs for hosting, er, benchmark memory images.
FPGA CPU benchmarking:
"For a while to come, expect apples-to-oranges data that warrant considerable
skepticism. Company #1 will present simulated results for their core for
their fastest speed grade parts (expensive unobtainium), running entirely
on-chip, with programs and data in on-chip block RAM, on their best case
inner loops. Company #2 will present measured results in the context of a
real low-cost system using last year's slowest-speed grade device, running
standard benchmark programs out of external RAM."
The performance page says MicroBlaze runs at 125 MHz in the fastest
speed grade of Virtex-II, but only 65 MHz in the fastest speed grade
of Spartan-II. Is it valid to assume that the performance in Spartan-II is
no better than 82*65/125 == ~43 D-MIPS?
'"According to some industry analysts, the processor based field-configurable
market is expected to grow to $235 million by 2004," said Babak Hedayati,
senior director of Product Solutions Marketing at Xilinx.'
This all comes back to last year's
IP business models discussion.
Is this new sales of FPGA CPU IP cores? $235 M / $500-$5000 == 50,000-500,000 design
wins? Doubtful!
Is this $235 M additional sales of programmable logic devices?
More likely.
Peripherals
The MicroBlaze platform includes an impressive
portfolio
of periphals, all using the CoreConnect OPB V2.0 bus.
Of course, these and other peripherals also interop with the PowerPC
embedded hard cores coming in the Virtex-II Pro product.
It's quite an impressive story.
If you want to go into the IP business, selling peripherals cores
for the Xilinx platform, it's time to crack open some specs and
learn all about the OPB bus.
Altera
Altera Rolls Out ARM-based Excalibur Product Family.
Altera's Excalibur Development Kit Speeds ARM-Based SOPC Designs.
"Available immediately, the Excalibur System Development Kit includes a development board with the 200 MHz EPXA10 ARM-based Excalibur device that supports complex SOPC designs of up to 38,400 logic elements (1 million system gates)."
Altera Enhances Nios Soft Processor for High-Bandwidth Applications.
"After selling more than 2,500 Nios embedded processor development kits since its introduction in June 2000..."
Congratulations, Altera. That tops the >2,000 downloads of XSOC since
its introduction in March 2000.
Such a great business model, redux
Now the FPGA CPU / SoC lines are drawn for real. In one corner, Altera.
In the other, Xilinx. (In the third, independent IP providers.)
Now these two PLD giants can use their extensive (and presumably
high quality, well supported) cores and tools as velvet shackles
to lock you into their PLD architectures for the rest of time.
MPF correspondent wanted
I will not be at MPF's FPGA CPU session tomorrow to share in the fun.
If you attend, would you please be so kind as to share with
us your impressions of the session highlights? Thank you.
|
|
FPGA CPUs at MPF
Next week is
Microprocessor Forum 2001.
Tuesday Session 3,
is Microprocessors in Programmable Logic, moderated by Cary Snyder.
QuickLogic will present on QuickMIPS;
Altera, has two presentations, presumably on their Nios 2.0 soft core
and ARM hard core products, and Xilinx two more on MicroBlaze and
Virtex-II Pro.
"The MicroBlaze Soft Processor
Reno Sanchez, Engineering Site Manager, Xilinx, Inc."
"Xilinx agrees that soft processors have a valuable place on FPGAs and the
company is ready to roll out its soft processor, dubbed MicroBlaze. MPF
2001 attendees will be the first to hear about the architectural details
which enable MicroBlaze to provide a two fold performance improvement
over all existing soft processors."
More on the new SoC design
Continuing design notes from last time. (Today's entry won't make
sense if you haven't yet read those earlier notes.)
Last time, I wrote this design will have 2 32-bit CPUs. Scratch that.
It may have up to 4. LUT-count wise, this is not a problem, as each core
is only about 300 LUTs, and the XC2S100 has 20x30x4 = 2400 LUTs. However,
if each cache-enhanced gr1050c processor requires 3 block RAMs,
the requisite 3x4 block RAMs exceed the device's available 10 block RAMs.
Fortunately, each gr10x0c can make do with just 2½ BRAMS: ½ for i-cache tags,
½ for i-cache data, ½ for d-cache tags, and 1 for d-cache data.
I was going to use one BRAM port for reading instructions from the i-cache
data RAM, and a second BRAM port on that same RAM for writing instructions
(on a cache miss line fill). But this can be done with just one port,
plus two 4-LUT muxes, to count up the two LSBs of the
i-cache data address, during cache refill.
The principal effect of this change is to reduce the i-cache from
64 lines of 4 instructions, to just 32 lines of 4 instructions, with some
(as yet unmeasured) significant reduction of i-cache hit rate.
Indeed, with so few cache lines, we may be better off with 64 lines of
2 instructions instead.
Today's design notes reflect this change.
Further i-cache design notes
First, some background. The gr10x0c is based upon the gr10x0i core.
This core closely resembles the gr0041 core described in the GR CPUs pages, except
- it is parametric in width -- gr1040i is 24 == 16-bits wide, the gr1050i
is 25 == 32-bits wide;
- it is parametric in aspect ratio -- the 32-bit gr1050i can be made 8 rows of CLBs tall, or 16 rows tall; and most importantly
- it is pipelined -- there is a pipeline register immediately before the register file access.
There are two pipeline stages: decode, and execute. There is no fetch
stage, because the GR CPU designs assume a block RAM based instruction
store or instruction cache. (Recall that block RAMs have 0 clocks of
access latency: you present a valid address immediately ahead of the
clock edge, and a few ns later you have the corresponding data.)
In the present design, the new instruction address i_ad is presented
to the i-cache data and tag block RAMs just ahead of clock. After the
clock edge, the fetched instruction is decoded. Concurrent with this
decoding is the i-cache tag check. If the i-cache tag (read from
tag RAM at address i_ad[7:3]) differs from i_ad[23:8], this signals
an i-cache miss. The instruction being decoded is invalid and
its effects are annulled.
On an i-cache miss, the CPU core sends a 'read 4-words' command
to the memory system. As the memory system acknowledges each
word, these instructions are deposited in the right cache line
in the i-cache data memory. Upon receiving the final word in the
read transaction, the CPU core updates the i-cache tag to indicate the
corresponding cache line now holds a valid copy of the line.
To keep this first cut simple, we do not do so-called
critical-word-first line fill. Rather we read each of the
four instructions, in order, starting at address 0%8.
All the while that read is happening, the CPU core stalls.
Well, not exactly. In fact, it sits there reading the same
cached instruction word and tag. Over and over, the tag doesn't match
and the fetched instruction is annulled. Eventually the memory
subsystem delivers the cache line, and the tag is updated.
On the next clock edge, the tag compares equal and the just-fetched
instruction is not annulled, and execution continues.
This all works only because we use a single BRAM port to update
the i-cache data, and a second single port to update the i-cache tag.
When using block RAM, beware writing and reading to the same
address, via two different ports, on the same clock cycle.
Spartan-II data sheet, section Design Considerations/Using Block RAM Features/Conflict Resolution:
"If one port attempts a read of the same memory cell
the other simultaneously writes, violating the
clock-to-clock setup requirement, the following occurs.
- The write succeeds
- The data out on the writing port accurately reflects
the data written.
- The data out on the reading port is invalid."
Further d-cache design notes
Now if the instruction in the execute pipeline stage is a load
or store, we have some more work to do.
In the first clock cycle of such instructions, we compute the effective
address to d_ad, and present it to the d-cache data and tag
RAMs.
For longword stores, we drive the data to be stored to the result bus
and then into the d-cache data RAM; and write the new d_ad[23:9]
to the d-cache tag RAM.
Since (as we discussed last time), the d-cache design does not allow
partially valid d-cache lines, for word or byte stores, we do
not store the data to the d-cache. Instead, we mark the line as invalid.
For all flavors of stores, the store data are written-through to
main memory. The CPU waits for a rdy acknowledgement from
memory before advancing to the next instruction.
Loads are more complicated. Loads may hit in the d-cache, or may miss.
On a load longword hit, the read data is simply driven onto the
3-state result bus.
On a load word or load byte hit, the read data is properly aligned and
then driven onto the result bus. These loads do zero fill, so the
upper 16- or 24-bits are driven with 0.
On a load miss, whether it be a byte, word, or longword access,
the CPU loads a full 32-bit longword from RAM, drives it onto
the result bus, and writes it to the d-cache data RAM.
On a load longword miss, the instruction completes in that same cycle,
since the result bus already has the desired longword data.
On a load word or load byte miss, once the above multicycle miss handling
has taken place, on the next clock cycle, the d-cache hits and
things proceed as for the load word / load byte hit described above.
|
|
Here are some rough design notes for a new system-on-a-chip I'm
designing for the XESS XSA-100 board.
I thought you might prefer to read a rough outline now than
wait, possibly forever, for a more polished form.
(As always when I report on work-in-progress, remember,
such work may never get past the work-in-progress stage,
and even then, it will not necessarily be released here.)
Questions? Comments? Discuss on the fpga-cpu list.
- Target Virtex and Virtex derivatives
- Based on gr10x0 core
- Compact, pipelined, Virtex-optimized processor
- 16- or 32-bits
- Assumes block RAM instruction store or i-cache
- Target XESS XSA-100 board
- Spartan-2 FPGA: XC2S100TQ144-5C
- 16 MB SDRAM: 4 banks of 2Mx16: HY57V281620AT-H
- 256 KB FLASH: AT49F002-90TC, with CPLD: XC9572XLVQ64-5C
- FPGA configuration storage
- Extended boot ROM
- VGA, PS/2, and parallel ports
- Architecture
- Target frequency: 67 MHz
- Two 32-bit gr1050ic cores
- SDRAM controller
- Color frame buffer – 1024x768x8 (6-bit DAC)
- 1280x1024 if sufficient RAM bandwidth
- PS/2 keyboard/mouse interface
- SDRAM considerations
- Independent open (activated) row per each of 4 banks
- Load/store latency
- 1 cycle to form effective address
- +1 cycle to drive SDRAM CMD at IOB FFs
- +0 cycles for writes
- +3 cycles for reads
- tCL (CAS latency) = 2 cycles
- +1 cycle to move data from IOBs to data-bus/cache
- +1 cycles for 32-bit data (write/read second 16-bits)
- +4/5 cycles for bank-row miss
- possible tWR (write recovery latency) = 1 cycle
- tRP (RAS precharge latency) = 2 cycle
- tRCD (RAS CAS delay) = 2 cycle
- Row-hit-store-longword: 3 cycles
- Row-hit-load-longword: 6 cycles
- Row-miss-store-longword: 7/8 cycles
- Row-miss-load-longword: 10/11 cycles
- + occasional refresh cycles
- Addressing
- Strategy: maximize row hit rate by keeping all four
banks busy
- col = ad[9:1]
- bank = ad[11:10]
- row = ad[23:12]
- Less attractive alternative: (may be better if video refresh
thrashes too many banks
- col = ad[9:1]
- row = ad[21:10]
- bank = ad[23:22]
- Memory design
- Goal: performance/area
- Small data or small code scenarios can simply use on-chip BRAM
- SDRAM capacity is nice, but as we see above, as much as 10 cycles of
latency
- General purpose system:
- Implies i-cache and d-cache needed
- Goal: minimize use of logic
- Use second port of BRAMs to avoid address or data
multiplexers
- Goal: minimize use of block RAMs
- Goal: simplicity
- Issue: i-cache organization
- Direct mapped simplest, but 2-way set associative possible
alternative if there are enough spare BRAM ports
- 1 BRAM: i-cache data: 256 16-bit instructions
- port A: CPU ifetch
- port B: memory interface
- ½ BRAM: 64/128/256 i-cache tags
- port A: i-cache controller
- Should i-cache be (a) 64 4-word lines, (b) 128 2-word
lines, or (c) 256 1-word lines?
- i-cache miss penalty: (a) 4 cycles per word, (b) 5
cycles per 2 words, (c) 7 cycles per 4 words
- i-cache hit rate: to be measured
- Critical word first?
- Issue: d-cache organization and policy
- Write through or write back?
- Write allocate? Write invalidate?
- If write-back, dirty data must be saved when a cache
line is ejected.
- That means a store requires a d-cache tag check, a
d-cache data read (which may be concurrent with the tag check) which
adds at least one cycle of latency to operation
- Write-through is much simpler. Data in cache is never
dirty with respect to RAM (although may not be coherent with data in
another cache).
- On write-through, should update cache? Yes, when possible.
- Can a d-cache line be partially valid? No. Keep things simple for now.
- If cache line size equals write item size, can update
tag and data at store time. However, if item size is less than cache
line size, should invalidate cache line.
- To keep things simple, set cache line size to word size (32-bits).
- Therefore (32-bit system) sb and sw invalidate cache line;
sw leaves cache line valid. This is important for fast register
save/reload on function call prolog/epilog.
- Physical d-cache data width -- 16-bits?
- Pro: simple interface to 16-bit SDRAM
- Pro: simpler if second design produced with 4 gr1040ic 16-bit RISCs
- Con: extra cycle to load or store 32-bit word on cache hit
- Physical d-cache data width -- 32-bits
- Pro: 1 cycle store, 2 cycle load on cache hit
- Con: uses 2 x16 BRAMs
- Choice: 32-bit d-cache data
- Performance ideas
- On stores (through), do not stop pipeline
- Write will get to memory eventually
- Pipeline only stops if another memory
transaction (i-cache miss, d-cache load miss, another store through)
occurs
- Too complicated
- Start a speculative memory load (even before the i- or
d-cache tag check fails). Saves a cycle.
- Too complicated
- Might cause unnecessary bank row misses
- In an MP or multi-master system, could waste bandwidth
needed by other masters
- Summing up: implementation
- i-cache 1½ RAMB4_S16_S16
- Data: 256x16 organized as 64 lines of 4 instruction
words
- Tags: 64x16
- d-cache: 1½ RAMB4_S16S16
- Data: 2x128x16 organized as 128 lines of 32-bit longwords
- Tags: 128x16
- Memory interface:
- Avoid i_ad, d_ad multiplexer: instead, on d-cache miss,
drive d_ad through preexisting i_ad multiplexer to SDRAM controller
- SDRAM controller multiplexes p1.i_ad, p2.i_ad, video.ad
addresses
- sdram_dq driven onto p1.d or p2.d buses
- Arrange for i-cache data/tags and d-cache data/tags to
be preloaded at power-on.
- SDRAM controller design
- Per-bank active row registers
- Per-bank activated registers
|
|
Welcome back, dear readers. I'm sorry that it has been a while.
Nios laps MicroBlaze
A reader kindly pointed out that the Altera
Nios
soft processor core is now in version 2.0. Whereas Xilinx
MicroBlaze still seems to be in beta test.
The Nios interface bus, generated by the
Nios System Builder tool, is called
Avalon. Avalon now supports multiple masters and DMA.
Note the difference in strategy here. Xilinx has announced its
MicroBlaze product will use the same popular CoreConnect bus as its
Virtex-II-Pro embedded PowerPC 405s use. So presumably soft cores
for one will work with the other. In contrast, Altera is using
the non-standard (and presumably lighter weight) Avalon bus for its
soft processor core, whereas its embedded processor cores (ARM and MIPS)
use the popular AMBA bus.
Also new/improved in Nios 2.0 is support for user-defined custom instructions,
and an on-chip debug peripheral.
This app note
on Nios resource usage and performance is instructive.
And here
is an interesting recent comp.arch.embedded posting on experience with Nios.
Altera Excalibur and MIPS?
This article
(in the same Nios discussion thread as above) speculates that Altera has
"decided to can the MIPS version" of their Excalibur embedded
hard CPU core, leaving just the
ARM 922T
based products.
The Excalibur index page
no longer seems to mention any MIPS embedded hard core devices,
except for a <meta> tag that reads:
"...The three families ... Nios ... ARM ... and MIPS-based hard core embedded processor ..."
If Altera is currently focusing on ARM, this makes good sense. The initial
announcements of two hard CPU architectures (see our
coverage) were surprising since it
is very costly to support one architecture, let alone three.
Also, each additional embedded CPU line (beyond the first) further
disrupts synergies, dilutes critical mass adoptions, leads to divergent
product strategies, etc.
Perhaps Lewin A.R.W. Edwards'
scenario
is as good as any.
XESS XSA-100
I now have an XSA-100.
In late September, I designed a cheap and cheerful ASCII VGA character
display core. It uses one 512x8 RAMB4_S8_S8 block RAM as a 16R x 32C
display buffer (with one RAM port exposed to the core client, which
read/write ASCII characters into this buffer under any clock discipline it
chooses), and a second block RAM configured 4096x1 (RAMB4_S1) as a 96x5x8
character generator. It was a tight squeeze, but the 96 printable ASCII
characters, including true lowercase descenders, do fit nicely in 4096 bits.
This is a really handy tool for on-chip debugging -- it is compact, fast,
very simple to drop into an arbitrary embedded system design, and needs
as few as three device pins -- video, hsync_n, and vsync_n.
I also have run 16- and 32-bit pipelined versions (gr1040i/gr1050i)
of the GR CPUs on this board.
I am now working on an SDRAM controller core, part of a new system
I am designing with a 32-bit gr1050i, an i-cache, d-cache,
SDRAM interface to the 16 MB SDRAM, and a high resolution
color frame buffer.
New versions
Xilinx has just released new versions of
WebPack 4.1i (free) and
Foundation/Alliance (now called
ISE) 4.1i (not free).
Xiilnx: 4.1i press release.
Murray Disman, ChipCenter:
Xilinx Delivers ISE 4.1i.
Xilinx claims all sorts of performance enhancements in 4.1i, but
since my critical-path datapaths are already (nearly) optimally floorplanned,
I'm not expecting much of an improvement, if any.
The ISE tools do have some nice new features, such as cross-probing from
the timing analyzer to the floorplanner. (You click on the
critical net and the floorplanner zooms in on that part of the
chip floorplan and highlights the path.)
WebPack is turning into quite a nice, full-featured product.
It now supports Spartan2 devices, plus VirtexE and
Virtex2 "up to 300K", and it includes Xilinx's XST HDL synthesis tool.
Soon I will try running my Synplify-Verilog-based
floorplanned datapath designs through XST. I use a lot of
explicitly instantiated primitives with explicit
RLOC and INIT attributes -- and no doubt XST has a different and
incompatible attribute syntax.
Dear friends at Xilinx: the point one i suffixes aren't fooling anyone.
Admit it, 4.1i is really 4.0 -- or maybe 11.0.
If it is a major new version, it is a .0 product.
Calling it a .1 product to make it appear that you've already
shook out the usual .0 issues won't make it so.
And the i (for internet or perhaps for fuel injection?)
moniker is so 2000.
Which reminds me of a couple of definitions from Stan Kelly-Bootle's brilliant
The Devil's DP Dictionary, (out of print), since revised and expanded as
The Computer Contradictionary:
"upgrade n. & v.trans. [From up + Latin gradus "steep incline."]
1 n. An expensive counterexample to earlier conjectures
about upward compatibility.
2 n. A painful crisis which belatedly restores one's faith
in the previous system.
3 n. To replace (obsolete stability) with something less boring. ..."
"release n. & v.trans. [Latin relaxare "to ease the pain."]
1 n. A set of kludges issued by the manufacturer which
clashes with the private fixes made by the user since the last release.
2 n., also called next release. The long-awaited panacea,
due tomorrow, replacing all previous temporary patches, fixed patches,
and patched fixes with a freshly integrated, fully tested, notarized
update.
3 v.trans. Marketing To announce the availability (of a mooted
producted) in response to the release by a competitor of a product
prompted by your previous release.
Care is needed to distinguish a last release from a next release,
since the difference is more than temporal. A last release is
characterized by being punctual but inadequate, a next release
avoids both errors. Next releases are worth waiting for. [And staying
in maintenance for -JG] ..."
"Tremendous Barrier to Entry"
Anthony Cataldo, EE Times:
CEO: Altera, Xilinx hold embedded PLD keys.
'The use of programmable-logic structures by providers of
application-specific standard products will be "impossible unless they
work with ourselves or Xilinx to license the technology," Daane said.'
Rick Merrit, EE Times:
Rattling Sabers.
HardCopy
Altera: HardCopy - The Right Product at the Right Time.
Anthony Cataldo, EE Times:
Altera joins FPGA-to-ASIC drive as gate arrays come back in vogue.
'Lightspeed Semiconductor, meanwhile, is looking to undercut Xilinx in
a similar manner. The company has started taking orders for its 4Em
... devices that use the same BGA packages as Xilinx's Virtex E and
identical memory configurations. ...
Routing is done in the metal and does without the extra capacitance
of pass transistors used in FPGAs, making the architecture typically
two times faster than FPGAs, said Lyle Smith, chief scientist for
Lightspeed. ...'
'The transfer of intellectual-property cores can also get dicey if another
vendor is involved. "Realistically, we're the only people that can do
this, in a sense, legally," said Altera's Tong. "We don't eliminate
the possibility of going to an ASIC supplier; however, there is going
to have to be licensing negotiations to use our IP in an ASIC."'
[my emphasis -JG]
See also IP business models.
Anthony Cataldo, EE Times:
Altera kicks off mask-programmable PLD program.
'"For extreme volumes, standard-cell is still the best solution," Daane said.
"But for the gate array vendors and conversion vendors it's over for them."'
SignOnce
Xilinx:
Xilinx and 21 Leading Intellectual Property Providers Launch Common FPGA Licensing Program.
Murray Disman, ChipCenter:
Xilinx Announces SignOnce IP License.
Michael Santarini, EE Times:
Xilinx, IP partners unify licensing.
'"We've taken our own license and have convinced all of our AllianceCore
partners to join the program and adopt our license as their own," said
Sevcik, noting that the effort to sign up the consortium partners had
taken a year and a half. ...'
'There are two versions of the SignOne IP License. The Site License
gives users access to the IP in question for an unlimited amount of
projects within a 5-mile radius of where the license is granted. The
Project License limits use of the IP to a single project. The licenses
are typically granted for FPGA netlist versions of a given core.'
Silaria
Chris Edwards, Electronics Times:
Proteus processor core gets embedded Linux port.
"The company has added hardware design support to let the Proteus
3's datapath use any bit width from 4- to 256-bits but will only add
compilation support for 8- to 256-bit systems with the version 2.0
compilers due out next year."
This summer, I visited an old colleague, now at Silaria Ltd. in Dublin.
Sharp team. If you're in the market for configurable processor IP,
they are worthy of consideration.
Who mourns for DS-502?
My shelves runneth over with copies of XactStep 1.x, Foundation and/or
Alliance 1.3, 1.4, 1.5, 2.1i, 3.1i, and now 4.1i, not to mention a few
Student Editions. Now that many of the old devices are obsolete,
there's little need or justifcation to hang on to good old
DS-VL-BAS, DS-390, DS-502, and about two shelf-feet of old Viewlogic and
XACT manuals. Not to mention the Quarterdeck QEMM386 that they
required. (I doubt QEMM will even run on current GHz, 200+ MB PCs.)
I couldn't quite bring myself to toss them, but they've been paged out.
It's sad to think of the blood, sweat, and tears that must have gone into
those earlier products, which (just a few years later) are chucked
unceremoniously into the big bit bucket of history. And it's strange to
recall just how excited I was to get to work with those products,
but now, I could not care less for them.
I mourn too for
some
of
my
own
earlier
products,
which are now, for the most part, gathering dust.
They fulfilled their purpose, they achieved their mission,
the toothpaste tube of their productivity, potential, and utility has been
squeezed out and used up. It is (more or less) stupid to waste
time with them -- there are better alternatives. They are worthless.
And forgotten are each of these products' team members,
once a close but unruly family,
building a shared vision, now scattered to the four winds;
forgotten too, the all nighters and the crushing deadlines
and the death marches,
the feature shootouts, the infinite meetings, the unfinished specs,
the sprawling milestones, the bugs uncountable and the bug triages,
the alphas, the betas, and the release candidates, the sign offs and
product launches, the fierce (yet collegial) company rivalries,
win all reviews and reviews won and lost,
the happy customers (and the unhappy ones),
and all that spent energy, and the days and nights and weekends,
the months and the marriages, now spent and dissipated,
forgotten and fading away like Apollo in Who Mourns for Adonais?.
Cold comfort: Inevitably, some code lives on. Take the C++
object model
layout code I wrote in 1990. For compatibility reasons, that code will
remain as is for the rest of time or x86 computing, whichever comes first.
Whether you run Windows apps on Windows or on WINE on Linux, your code
is rather likely to have been compiled by my code.
In fact (the last time I checked),
I use Xilinx's software to build my products;
Xilinx uses my software to build their products!
Ye olde near branch / far branch problem
Last month, on the fpga-cpu list, someone wrote:
"I am trying to implement far branches in xr16asm (XSOC
Project) and was wondering if anyone did this before. If so
it would be nice if you post some lines of code, so i can see
how it has to be done."
This issue has always annoyed me,
so I went and fixed it.
The fix was not trivial, because of these side issues.
-
We can't tell which (forward) branches are far (>127 instructions from .) until the forward label has been seen. We could simply pad all forward branches with two NOPs (reserving space that we can later overwrite at FIX_BR fixup application time) but that seemed inelegant (although it would be simpler).
-
Instead, on seeing a far branch, we must turn
bcond label
into
bnot-cond skip
imm label[15:4]
jal r0,label[3:0](r0)
skip:
and insert up to two additional instructions at the branch site. This creates two problems:
-
It disturbs 16-byte aligned functions (which have been 16-bit aligned because they may be the targets of CALL instructions).
-
Inserting two instructions may make other branches over the branch also far.
-
The symbol, fixup, and line correspondence tables have address fields which will become invalid if code is inserted at a lower address than the given address field.
So I changed the applyFixups() function to work in three phases.
-
Resolve far branch fixups. Repeatedly scan over all branch fixups in the program. If one is found to be a far branch fixup, replace it with the b<not-cond> sequence shown above, by moving subsequent instructions and data down in memory. The FIX_BR fixup to the label becomes a FIX_EA (nee FIX_LDST) fixup on the label.
-
Resolve call target 16-byte alignment constraints. For each symbol (in order of increasing address) that is constrained to be 16-byte aligned (e.g. is seen to be a CALL target), if it is not already aligned, insert enough 0s to 16-byte align it.
[[I also changed lcc-xr16 to not emit "align 16"s in the prolog of function definitions. If a function is not a CALL target, (e.g. if it is only called indirect through a function pointer via JAL), it need not be 16-byte aligned.]]
-
Resolve all fixups (just as applyFixups used to do). At this point, all remaining FIX_BR fixups are near, and all FIX_CALL fixups are to 16-byte aligned targets.
I'm going to try to wrap this fix up in a new build of the XSOC Kit
someday.
|
FPGA CPU News, Vol. 2, No. 10
Back issues: Vol. 2 (2001): Jan Feb Mar Apr May Jun Jul Aug Sep; Vol. 1 (2000): Apr Aug Sep Oct Nov Dec.
Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.
|