fpgacpu.org - FPGA CPU News of October 2001

FPGA CPU News of October 2001

Home

Nov >>
<< Sep

News Index
2002
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
2001
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
  Oct Nov Dec
2000
  Apr Aug Sep
  Oct Nov Dec

Links
Fpga-cpu List
Usenet Posts
Site News
Papers
Teaching
Resources
Glossary
Gray Research

GR CPUs

XSOC
  Launch Mail
  Circuit Cellar
  LICENSE
  README
  XSOC News
  XSOC Talk
  Issues
  xr16

XSOC 2.0
  XSOC2 Log

CNets
  CNets Log

Wednesday, October 31, 2001
Branch instructions, delay slots, and alternatives (predication, skip)
Tommy Thorn asked (via email):
"... I have a few [xr16] questions that I haven't seen answered on fpgacpu.org:

Ignoring compiler issues, why wouldn't it be much simpler just to implement delay slots for branches? The annuling logic would go away and the branch penalty would be eliminated.
Under the Interrupts section you mention that delay slots would complicate interrupt handling. I don't see why you couldn't just block the interrupts while in the taken-branch delay slot, much like what is already done for interlocked pipeline stages. (Any branches in the delay slot should be illegal).

Not really a question, but it seems that a "minor" modification to your design could improve both performance and code density:
Argument the processor state with a condition bit which encodes the result of a set of compare instructions (ceq, cne, clt, cge, cle, cgt, cltu, cgeu, cleu, cgtu), eg.
cge r3, r5 --> cond = (r3 >= r5) ? 1 : 0;
Allocate two bits in the opcode to encode (always, if-true, if-false). (The fourth case can be used for the prefix which doesn't need to be conditional).
Assuming for the moment that the instruction set could somehow be fitted into the remaining bits, the search.asm example [JG: see p.2 of the series] (slightly improved) could be compiled as follows (using "0_inst" and "1_inst" for conditional instructions):
_search: br L3 L2: cge r9, r3 0_lw r4, 4(r4) 1_lw r4, 2(r4) L3: ceq r4, r0 0_lw r9, (r4) 0_ceq r9, r3 0_br L2 mov r2, r4 ret
The worst case inner loop here is 6+3=9 cycles (2 cycles could be further saved using delayed branches), compared to the original of 9+12=21, that is, three times faster. (BTW, conditional instructions can live in delays slots without problems, unlike branch instructions).
A slightly more complicated variant of this idea is to make instructions always conditional and use just one bit to encode the polarity, eg:
_search: TRUE ; force known condition context ; (eg, 0_ceq r0, r0) 1_br L3 L2: 0_cge r9, r3 0_lw r4, 4(r4) 1_lw r4, 2(r4) TRUE ; force known condition context L3: 1_ceq r4, r0 0_lw r9, (r4) 0_ceq r9, r3 0_br L2 1_mov r2, r4 1_ret"

Great questions. My responses:
First: why no branch delay slots? There are several issues.

Simplicity. I was constantly struggling to keep the articles under their 3x3000 word budget. Having delay slots would necessarily mean discussing them in the articles.

Assembler instruction scheduling. Delay slots would of course require some code in the assembler to try to move an instruction into the delay slot.

Code bloat. An unfillable delay slot would require a nop. In the old days, the rule of thumb was 60-70% of the time, the first delay slot is fillable. This can be overcome by having two branch instruction families, one with a delay slot, one without, or perhaps, one that conditionally annuls the delay slot. In that case, then, the code bloat is quietly redistributed about the opcode map.

The annulling logic would not go away, but it would be changed slightly. Currently when a jump or taken branch executes, the two instructions in the IF (instruction fetch) and DC (decode/operand fetch) stages are annulled. To implement a 1-instruction delay slot, you need only change one gate in order to annul the IF stage instruction, while leaving the DC stage instruction alone. (The gate in question is the OR4 feeding the EXANNUL flip-flop on page 16 of the XSOC series.)

Interrupts. Believe me, you don't want to write an interrupt handler that handles return-from-interrupt-into-the-branch-delay-slot! But as Thorn suggests, interrupts could indeed be deferred while in a branch delay slot, thus avoiding this situation.
Checking back on some mail at that time (11/17/98), it appears that since interrupts hadn't been designed yet, I hadn't realized that the potential problem could be overcome by simply deferring interrupts in branch delay slots. (I had done so since though.)

As for Thorn's second question/suggestion, his code transformation above is quite clever and is a compelling example of the benefits of predicated execution. With 16-bit instruction words, I don't know if it is possible or profitable to squeeze out a precious opcode bit, or (heavens forfend) 2, for a predicate, alas.
I don't deny the branch instruction architecture I settled on for xr16 was a very familiar and conventional one. (And one that was easy to figure out how to make lcc target.) I did evaluate providing only an unconditional branch (with a larger branch displacement), plus a skip1<cond> and skip2<cond> regime a la Philip Freidin's RISC4005, and it did provide some performance advantages in some cases, but I did not pursue that for some reason that I can't recall.
More on Empower! and Excalibur
Anthony Cataldo, EE Times: Xilinx moves forward, Altera pulls back on PowerPC cores.
"Xilinx Inc. says it's on track for yearend sampling of its Virtex 2 FPGA with an embedded PowerPC core. But rival Altera Corp. has postponed plans to offer PowerPC and MIPS-based processor cores for its FPGAs, citing cost concerns. ..."
"... Instead, Altera has focused on its ARM9 hard processor and Nios soft microcontroller cores, both of which are shipping now."
See also this earlier discussion.

Tuesday, October 30, 2001
I have started a new section, FPGA prototyping boards, on the Links page.
Insight Electronics' Design Solutions page now lists the new Virtex-II MicroBlaze Development Kit.
FPGA MP3
The winning entry in the 2001 Circuit Cellar Design Logic 2001 Contest (Atmel FPSLIC section) is an MP3 player that uses the FPSLIC's AVR microcontroller for interface and control, and the integrated on-chip FPGA fabric as a 32-bit fixed-point DSP coprocessor that decodes the MP3 data. Congratulations to Geva Patz of Brighton, MA.
Xilinx has some MP3 player design resources, some app notes, and a cute reference design.
These students at Florida Atlantic did one (sort of).
(Also, I thought there was an Altera Nios based MP3 project, but Google and I can find no definitive reference to it.)
Actel: MP3 Personal Digital Players Using Actel FPGAs -- the Rio PMP300. The Actel device acts as a parallel port, flash controller, smart media interface, CPU interface, and misc. glue logic; an external CPU and MP3 decoder are still required.
Actel is Dominant Supplier ....
Murray Disman, ChipCenter: Actel Claims to Dominate MP3 Market (5/00).
e.Digital and Actel to Deliver Silicon Solution for Digital Music and Voice Recorder/Players.
Murray Disman, ChipCenter: e.Digital and Actel Collaborate.
Celoxica has an MP3 encoder case study.
Altera LogicLock
As a big fan of explicit floorplanning using Xilinx RPMs, I was intrigued when Altera mentioned LogicLock. Now here is some more detail, an app note, and a methodology white paper.
"During integration and system-level verification, the performance of each module of logic is preserved."
That is so important. Recently I wrote: "... floorplanning [using RPMs] can certainly halve critical path net delays, and, as importantly, make the delays predictable so you have terra firma upon which to make methodical implementation decisions." (Although I wish I'd written sometimes instead of certainly.)
Back in April, I asked:
"(By the way, it is important that such placement constraints can be schematic and HDL source code annotations -- it is rather less useful if the constraints are applied to post-synthesis entities. Is that the case with LogicLock?)
The app note says
"A LogicLock region can be created graphically using the the (sic) LogicLock window in the Quartus II Floorplan Editor or in text with a Tcl script."
So the answer appears to be that LogicLock constraints are indeed applied to post-synthesis entities. Thus LogicLock appears to be similar to the Xilinx Floorplanner (or perhaps the Modular Design Flow), and less like Xilinx RLOCs. Not that there's anything wrong with that.
By the way, Altera has an interesting quarterly newsletter, News and Views; of course, Xilinx does too.

Sunday, October 28, 2001
Digital Communication Technologies' lightfoot for FPGAs.
Lightfoot data sheet. DCT/Xilinx press release.
"Lightfoot will run on Xilinx Spartan-II, Virtex-E and Virtex-II FPGAs. Purpose-designed for embedded systems, DCT's Lightfoot core uses around 25,000 gates in its conventional off-the-shelf chip form, and requires just 1710 'CLB slices' of Xilinx logic in its IP form."

Peter Clarke, EE Times: Startup builds Java processor with ARC core.
'The core, called Bigfoot, complements Digital Communication Technologies Ltd.'s current offering, a Java and C-language CPU based on a stack architecture, known as Lightfoot. ... '
'"We realized from the start that a Java-only processor had no chance," said Turner. ... "We realized that some applications were using a Dhrystone benchmark even though they also wanted Java performance," he said. "That's the area in which Lightfoot can't compete. With Bigfoot we can get the Dhrystone benchmark inherently [from the ARC core]."'
From Drinking the .NET Kool-Aid, Don Box at Conference.NET:
"... if you talk to people who really look at the trends in this industry ... it feels ... we are moving a world that there are basically two places that code runs -- the JVM and the CLR. ..."
OK, so Don wasn't speaking to embedded system development. But long term, J2ME and/or .NET Compact Framework support will be essential in heavier-weight embedded systems. But as I ask in Java processors,
"The fundamental design issue is how sophisticated is your download-time translation pass over the bytecodes, in order to canonicalize or regularize them?"

Friday, October 26, 2001
Altera has posted tens of MB of detail on their Excalibur (ARM hard core and Nios soft core) embedded processors on their Excalibur literature site. The ARM-Based Embedded Processor Device Overview Data Sheet provides a concise overview of the EPXA (embedded ARM) family.
".. the embedded processor domain and PLD domain can be asynchronous, to allow optimized clock frequencies for each domain."

Note that Altera is once again offering free Excalibur Embedded Processor Workshops. I've signed up for the one in my area. (Back in August 2000, I attended the one of the first generation Nios workshops. It was terrific, very hands-on, and time well spent.)
Nick Heaton and Ed Flaherty, Integrated System Design: A Verification Environment for an Embedded Processor. Great detail on the design and verification of the Altera EPXA10, including how the hard microprocessor IP "stripe" interfaces to the programmable logic.
"The Excalibur physical architecture consists of single and dual port RAM, a 300k-gate standard cell area, an embedded processor IP core, and one-million gates of APEX 20KE PLD all on a single chip, representing some 80 million transistors. ..."
"In this context the "stripe" refers to the custom embedded logic area along one edge for the Excalibur die as distinct from the PLD area. The separation is required because of the fundamentally different ways in which the two areas are designed and passed through layout."

Tuesday, October 23, 2001
Anthony Cataldo, EE Times: Compiler that converts C-code to processor gates advances.
"But if the technology preview is any indication of what's coming, it's only a matter of time before processors are considered just another building block for system-on-chip design, just as transistors are for today's devices. Moreover, system architects won't be beholden to a select group of processor designers, Rowen said."
Altera: Quartus II Design Software Now Available in Web Edition.
Altera Quartus II Web Edition joins Xilinx ISE WebPack in offering a free environment providing "design entry, HDL synthesis, place and route, verification, and programming."
Altera Program License Subscription Agreement, part 9, TalkBack Feature Notice:
"The TalkBack feature, included with the Licensed Program(s), enables ALTERA to receive limited information concerning your compilation of logic designs (but not the logic design files themselves) using the Licensed Program(s). ..."
"You may disable/enable the TalkBack feature at any time by running qtb_install.exe located in your quartus/bin folder."

Monday, October 22, 2001
Pushing on a rope
Shucks, somehow I never noticed this.
Tom Cantrell, Circuit Cellar Online (4/01): DesignCon Fusion: Shades of Gray.
'In a process Gray described as "pushing on a rope," the HDL had to be iteratively modified until it produced the optimized logic he expected.'
'In one example, after "much experimenting," he came up with nonobvious HDL that was able to compel the synthesizer to generate an efficient combined adder/subtractor rather than the separate adder, subtractor, and multiplexer generated by a more readable version.'
(This was particularly a problem when trying to inject carry-in and capture carry-out.)
Minimalism
On the fpga-cpu list, in a discussion of minimalist FPGA CPUs, Tim Boescke asked:
"Jan Gray somehow managed to fit a 4 operator ALU into a single LUT per bit. (In the GR0040) Is there any documentation on this ? So maybe there is a way to reduce the ALU size to 6 CLBs."
Some of the background is here. Some of it is hinted at in the GR0000 paper in section 3.12.
Now then. Say you build a 4-bit ALU using this technique (4 LUTs). And say you attach that to a 16 entry x 4-bit LUT RAM (another 4 LUTs, or 8 if you make it dual-port RAM). And say you add a 2-bit counter (2 LUTs) to sequence through LSB addresses 00, 01, 10, and 11, to that LUT RAM. Add a LUT and FF to handle carry-in. Now you have a simple datapath with 4 16-bit registers, nybble-serial, that should easily run at 100 MHz (25 MHz for each 16-bit operation). Total cost of datapath: 11-15 LUTs (3-4 Virtex CLBs; 2 Virtex2 CLBs).
To hook that up to a 512x8 or 256x16 BRAM for program and data storage, you need another 8 or 9 FFs for a PC and/or address register. These FFs can share the same handful of CLBs with the aforementioned LUTs. The instruction register can be the BRAM output register.
Add a few LUTs for minimal instruction decoding, and you're most of the way to a very compact (and extremely austere) processor.
Bumming LUTs
In repsonse to the above, Tim Boescke wrote back:
"...the PC could be mapped to the registerfile, but this would double the amount of cycles per instruction. (well, 8 cycles for a dual ported registerfile, 12 cycles for single ported) ..."

Bingo! That's the right mindset indeed. Keeping the PC in the regster file to save area is a technique I used in my first FPGA CPU seven years ago.
It may also be necessary to put 0 in the regfile (leaving 2 (or perhaps it is 2 3/4) 16-bit regs (or 6 8-bit regs, if you prefer)). Then IR=MEM[PC+=1] becomes rf[PC] += rf[zero] + cin=1 across 4 cycles. As the ALU output nybbles go by, you latch them in your nybble-serial-to-parallel FF-based address register, that drives address lines to the BRAM. And perhaps you can save more registers and a mux, if the data bus is only 4-bits wide, if you use a RAMB_S16_S4 or something like that. (Hands wave furiously.)
(If this is all too complex, by all means, build a conventional 8- or 16-bit high regfile and ALU -- I just wanted to demonstrate that it is possible to build a minimal slow austere 16-bit datapath in just 2 Virtex2 CLBs.)
Another gate bumming idea: the output network of a RAM is a mux. Use it before you use LUTs for muxes. In the XSOC/xr16 design, moving the video address register into the PC register file cost a column of LUT RAM, but saved a column of LUTs on a mux, and another column on the video address incrementer (by reusing the PC adder/incrementer). Another example: rather than design a video character generator ROM (using BRAM) that puts out an 8-bit byte, and then sending *that* through an 8-1 mux to drive pixels to the screen, instead configure the BRAM as a RAMB4_S1, e.g. with a 1-bit output. The 8-1 mux disappears into the BRAM's internal output mux logic.
Fun stuff. Brings back the old days of assembler one-upsmanship, striving to bum an instruction or a cycle from a little compute kernel. "What's the most compact sequence to do atoi, or itoa, or what have you..."
I feel sorry for anyone starting in computing today and who thinks 4 KB is nothing. They missed all the fun.
This thread reminds me of one of my favorite essays, Simple is Beautiful.
FPGAs free to a good home
A while back, Gray Research bought 300 XCS20TQ144-3C (date code 9829) in carriers of 60, for $0.50 each, in an auction. We don't need them all. We'll donate up to 240 of them, in lots of 60, to one or more universities. If you are a university professor based here in the US, and you're reasonably sure you will put them to a good use, just send me a very brief note of proposal ("Dear Jan, we sure could use n of your FPGAs!") and they're yours (while supplies last).
Remember that an XCS20 is a 5V FPGA similar to an XC4010, with 20x20 CLBs, 2 LUTs per CLB, and list at $28 each per Avnet.
Note these are untested, fine pitch PQFPs, and you'll probably have to bake them before you use them.
Ernie Coombs
Goodbye, and thanks, dear Mr. Dressup.

Thursday, October 18, 2001
Tom Murphy, Electronics News: ARM Core Boosts Altera's ASIC Alternative.
'"What you get is best-in-class performance that you would expect from an ASIC without the minimum order quantities a foundry would expect or the lengthy fab cycles," Chiang said. "In addition, there are no licensing fees to pay to ARM. That is included in our pricing structure."'
Brian Dipert, EDN (2/01): Do combo chips compute (or even compile)?.
"Neither your logic in the programmable-logic partition nor the Altera-supplied soft IP also potentially stored in that partition directly interfaces to the CPU core. Instead, the hookup is through master-and-slave bridges and dual-port SRAM (different from the embedded-array-block memory that the Apex 20K array contains)."

Wednesday, October 17, 2001
FPGA CPUs in Education
From Prof. Chris Meyers' Univ. of Utah CS/EE 3710 course description:
"During this class you will design and implement a microprocessor in a team of three. Note that grades will still be given individually, not just to the team. You will be given 3 benchmark codes in C. Your job will be to design an instruction set, determine a microprocessor architecture, simulate and test using VHDL, and implement in FPGAs. The better your design performs, the better your grade. Therefore, you should consider advanced architecture features such as pipelining, branch prediction, hardware multipliers, etc." [my emphasis -JG]
Fantastic. Here's a similar sentiment from my paper on teaching computer architecture "hands-on" with FPGA CPUs:
"Our favorite idea simulates the competitive processor design industry. Student teams are issued a CPU design kit, including computer tools, a working, non-pipelined processor core, a benchmark suite, and an FPGA board, which runs "out of the box", and are instructed to evolve and optimize their processor ... to run the benchmark suite as fast as possible ... At end of term, teams submit their designs and vie for the coveted "fastest CPU design" trophy. This sort of project could uniquely motivate students to practice all manner of quantitative analysis and design activities."
See also some of the other teaching links. In particular, I still get a kick out of the Hiroshima University City-1 pages.
Disman
Murray Disman, ChipCenter: Xilinx Shipping MicroBlaze.
"The problem with these comparisons is that Xilinx is basing its results on the Nios 1.0 core."
Murray Disman, ChipCenter: Altera Announces Nios 2.0.
"The number of logic elements required for the 16-bit bus implementation has been decreased from 1100 to 900, while the elements needed for the 32-bit design has dropped from 1700 to 1250. At the same time, the processor speed has been increased from 33 MHz for Nios 1.0 to 80 MHz for version 2.0."
Rather impressive improvements. Competition is good. Are they apples-to-apples? (Frequency more than doubling in same device?)
So we have claims of 900 LUTs, 125 MHz, and 82 D-MIPS for Xilinx MicroBlaze 1.0, and reports of 1250 LEs, 80 MHz, and (EE Times:) 40 Dhrystone MIPS for Altera Nios 2.0.
(For comparison, gr1040: <180 LUTs; gr1050: <300 LUTs; north of 67 MHz in Virtex-E; D-MIPS...?)
That said, I am inclined to side with this sentiment in the same article:
'"We realized that it's not about the instruction set. It's about how easy it is to use and put the systems together and compile down," said Jordan Plofsky, senior vice president of embedded-processor products at Altera.'

Now in my multiprocessor-SoC experiments, the limiting density factor has not been LUT counts but block RAM ports. You can put 60 16-bit RISCs in a 72-block RAM XCV600E, but only if each one uses/shares a total of about one block RAM (two ports) each.
Therefore, I encourage FPGA CPU vendors and users alike to quote both LUT counts and block RAM / ESB counts on CPU cores. For example, the smallest gr1040 that might make sense requires <180 LUTs and 1 block RAM as an on-chip instruction/data store. In comparison, the Nios resource usage app note states that a 16-bit Nios 1.1.1 reference design, on Apex-II, uses 31,488 EAB/ESB bits. At 4096 bits per ESB, that's at least 7 ESBs.
Murray Disman, ChipCenter: Altera Unveils HardCopy Program.
"Conversions of high-density FPGAs will make economic sense for production quantities in the low 100s, since the HardCopy device will typically cost about $1000 less than the FPGA it replaces."
Murray Disman, ChipCenter: QuickLogic Ships MIPS-Based Hybrid FPGA.
"The MIPS program at Altera seems to have been placed on hold while the company concentrates on its other Excalibur products."
Legacy ISA IP companies: please write me if you'd like to explore contracting for a competitive (compact, fast) Xilinx-optimized reimplementation of your architecture. Don't lose design wins for want of an FPGA solution.

Tuesday, October 16, 2001
Sacrificing silicon at the altar of programmability
Recently in comp.arch.fpga, "Dennis" wrote:
"I am an ASIC Designer trying to understand FPGAs. While going through Xilinx Datasheets, I got some clues (although not fully understood) about the definition of System Gates Capability of a particular product. I wish to understand, How much Silicon is sacrificed for the sake of Programmability? , for example: For a 315K equivalent Gates in XC2V300, How Many ASIC Gates(2 input NAND) have been put in the Silicon????"
Ray Andraka replied:
"The truth is, these marketing gates have been badly perverted by oneupsmanship, to the point that on chip memory and extra features more or less dominates the figure. A better measure of FPGA capability is a count of the number of logic elements (consisting of a 4-LUT and flip-flop in many devices), and then season that with any special features to the extent that they BENEFIT YOUR application."

I replied:
I agree with Ray. But see also my weblog entry marketing gates redux. And read Peter Alfke's definitive posting on this subject. There Peter figured each logic cell would be worth about 12 ASIC gates (6 for the LUT and 6 for the FF).
I thought a ballpark answer to the question might be interesting. Here follow some educated guesses, but none based upon actual data from actual shipping devices.
In the book "Architecture and CAD for Deep-Submicron FPGAs", Betz, Rose, and Marquardt, Appendix B, pp.207-220, the authors provide a design for a generic CLB 'cluster' of four 4-LUTs and FFs that occupies 1678 'minimum width transistor areas'. (Each 4-LUT is 167 'MWT's.) That doesn't count the myriad transistors in each cluster's programmable interconnect (routing channels) -- configuration SRAM cells, switches, buffers etc.-- which I have read (somewhere) can be 4X more transistors than the CLB cluster itself. So let's say a tile with 4 4-LUTs, and its programmable interconnect, could require 8,000 transistors -- that's to implement about 40-something ASIC gates of logic and wiring -- call it 200 transistors per ASIC gate.
If you figure a CMOS 2-input NAND is four transistors, then at 200 transistors per NAND, it works out to a 50-1 'transistor overhead' for programmability. That sounds bad, but remember that FPGA transistors are typically manufactured in the latest and greatest processes, so often they are smaller, faster, and cheaper than ASIC transistors.
Let's check our figures another way. Last year, Steve Young of Xilinx was quoted as saying that this year Xilinx Virtex-II designs would get up into the 500 million transistors zone. Doing the math with a 2V6000 or a 2V10000 type device, this too indicates that several thousand transistors go into each logic cell (plus its share of the routing and RAM), or several hundred per equivalent ASIC gate.
And here's a third approach. A 2V10000 requires a 33.5 Mb configuration bitstream. Assuming each bit is stored in a 6 transistor SRAM cell, and each configuration bit drives only a single pass transistor, (way too conservative), that's 33.5M*7 = at least 250 million transistors for the 123,000 LUTs = >2000 transistors per LUT, or again several hundred transistors per ASIC gate.
MicroBlaze/Nios News Links
Crista Souza, EBN: PLDs make inroads with processor designs.
Anthony Cataldo, EE Times: Altera, Xilinx heat up processor-core fray.

Monday, October 15, 2001
Xilinx MicroBlaze
Xilinx Ships MicroBlaze. Congratulations, you guys!
MicroBlaze page. Performance data. Forum. Peripherals. CoreConnect Technology. CoreConnect architecture (IBM). CoreConnect license (PDF). Register and Download.
Literature: Getting Started Guide; Hardware Reference Guide; Software Reference Guide.
"MicroBlaze delivers over three times the performance in less than half the size of competing soft processors."
82 "dhrystone MIPS" is excellent performance. I have to assume that number represents running dhrystone entirely out of on-chip block RAM. Presumably Xilinx's Altera Nios comparable numbers are for a Nios system running code and data out of off-chip RAM. On the other hand, since Nios uses several ESBs for the processor implementation itself, that leaves fewer such block RAMs for hosting, er, benchmark memory images.
FPGA CPU benchmarking:
"For a while to come, expect apples-to-oranges data that warrant considerable skepticism. Company #1 will present simulated results for their core for their fastest speed grade parts (expensive unobtainium), running entirely on-chip, with programs and data in on-chip block RAM, on their best case inner loops. Company #2 will present measured results in the context of a real low-cost system using last year's slowest-speed grade device, running standard benchmark programs out of external RAM."

The performance page says MicroBlaze runs at 125 MHz in the fastest speed grade of Virtex-II, but only 65 MHz in the fastest speed grade of Spartan-II. Is it valid to assume that the performance in Spartan-II is no better than 82*65/125 == ~43 D-MIPS?
'"According to some industry analysts, the processor based field-configurable market is expected to grow to $235 million by 2004," said Babak Hedayati, senior director of Product Solutions Marketing at Xilinx.'
This all comes back to last year's IP business models discussion. Is this new sales of FPGA CPU IP cores? $235 M / $500-$5000 == 50,000-500,000 design wins? Doubtful! Is this $235 M additional sales of programmable logic devices? More likely.
Peripherals
The MicroBlaze platform includes an impressive portfolio of periphals, all using the CoreConnect OPB V2.0 bus. Of course, these and other peripherals also interop with the PowerPC embedded hard cores coming in the Virtex-II Pro product. It's quite an impressive story. If you want to go into the IP business, selling peripherals cores for the Xilinx platform, it's time to crack open some specs and learn all about the OPB bus.
Altera
Altera Rolls Out ARM-based Excalibur Product Family.
Altera's Excalibur Development Kit Speeds ARM-Based SOPC Designs.
"Available immediately, the Excalibur System Development Kit includes a development board with the 200 MHz EPXA10 ARM-based Excalibur device that supports complex SOPC designs of up to 38,400 logic elements (1 million system gates)."

Altera Enhances Nios Soft Processor for High-Bandwidth Applications.
"After selling more than 2,500 Nios embedded processor development kits since its introduction in June 2000..."
Congratulations, Altera. That tops the >2,000 downloads of XSOC since its introduction in March 2000.
Such a great business model, redux
Now the FPGA CPU / SoC lines are drawn for real. In one corner, Altera. In the other, Xilinx. (In the third, independent IP providers.)
Now these two PLD giants can use their extensive (and presumably high quality, well supported) cores and tools as velvet shackles to lock you into their PLD architectures for the rest of time.
MPF correspondent wanted
I will not be at MPF's FPGA CPU session tomorrow to share in the fun. If you attend, would you please be so kind as to share with us your impressions of the session highlights? Thank you.

Sunday, October 14, 2001
FPGA CPUs at MPF
Next week is Microprocessor Forum 2001. Tuesday Session 3, is Microprocessors in Programmable Logic, moderated by Cary Snyder.
QuickLogic will present on QuickMIPS; Altera, has two presentations, presumably on their Nios 2.0 soft core and ARM hard core products, and Xilinx two more on MicroBlaze and Virtex-II Pro.
"The MicroBlaze Soft Processor
Reno Sanchez, Engineering Site Manager, Xilinx, Inc."
"Xilinx agrees that soft processors have a valuable place on FPGAs and the company is ready to roll out its soft processor, dubbed MicroBlaze. MPF 2001 attendees will be the first to hear about the architectural details which enable MicroBlaze to provide a two fold performance improvement over all existing soft processors."

More on the new SoC design
Continuing design notes from last time. (Today's entry won't make sense if you haven't yet read those earlier notes.)
Last time, I wrote this design will have 2 32-bit CPUs. Scratch that. It may have up to 4. LUT-count wise, this is not a problem, as each core is only about 300 LUTs, and the XC2S100 has 20x30x4 = 2400 LUTs. However, if each cache-enhanced gr1050c processor requires 3 block RAMs, the requisite 3x4 block RAMs exceed the device's available 10 block RAMs.
Fortunately, each gr10x0c can make do with just 2½ BRAMS: ½ for i-cache tags, ½ for i-cache data, ½ for d-cache tags, and 1 for d-cache data. I was going to use one BRAM port for reading instructions from the i-cache data RAM, and a second BRAM port on that same RAM for writing instructions (on a cache miss line fill). But this can be done with just one port, plus two 4-LUT muxes, to count up the two LSBs of the i-cache data address, during cache refill.
The principal effect of this change is to reduce the i-cache from 64 lines of 4 instructions, to just 32 lines of 4 instructions, with some (as yet unmeasured) significant reduction of i-cache hit rate. Indeed, with so few cache lines, we may be better off with 64 lines of 2 instructions instead.
Today's design notes reflect this change.
Further i-cache design notes
First, some background. The gr10x0c is based upon the gr10x0i core. This core closely resembles the gr0041 core described in the GR CPUs pages, except

it is parametric in width -- gr1040i is 2⁴ == 16-bits wide, the gr1050i is 2⁵ == 32-bits wide;

it is parametric in aspect ratio -- the 32-bit gr1050i can be made 8 rows of CLBs tall, or 16 rows tall; and most importantly

it is pipelined -- there is a pipeline register immediately before the register file access.

There are two pipeline stages: decode, and execute. There is no fetch stage, because the GR CPU designs assume a block RAM based instruction store or instruction cache. (Recall that block RAMs have 0 clocks of access latency: you present a valid address immediately ahead of the clock edge, and a few ns later you have the corresponding data.)
In the present design, the new instruction address i_ad is presented to the i-cache data and tag block RAMs just ahead of clock. After the clock edge, the fetched instruction is decoded. Concurrent with this decoding is the i-cache tag check. If the i-cache tag (read from tag RAM at address i_ad[7:3]) differs from i_ad[23:8], this signals an i-cache miss. The instruction being decoded is invalid and its effects are annulled.
On an i-cache miss, the CPU core sends a 'read 4-words' command to the memory system. As the memory system acknowledges each word, these instructions are deposited in the right cache line in the i-cache data memory. Upon receiving the final word in the read transaction, the CPU core updates the i-cache tag to indicate the corresponding cache line now holds a valid copy of the line.
To keep this first cut simple, we do not do so-called critical-word-first line fill. Rather we read each of the four instructions, in order, starting at address 0%8.
All the while that read is happening, the CPU core stalls. Well, not exactly. In fact, it sits there reading the same cached instruction word and tag. Over and over, the tag doesn't match and the fetched instruction is annulled. Eventually the memory subsystem delivers the cache line, and the tag is updated. On the next clock edge, the tag compares equal and the just-fetched instruction is not annulled, and execution continues.
This all works only because we use a single BRAM port to update the i-cache data, and a second single port to update the i-cache tag. When using block RAM, beware writing and reading to the same address, via two different ports, on the same clock cycle.
Spartan-II data sheet, section Design Considerations/Using Block RAM Features/Conflict Resolution:
"If one port attempts a read of the same memory cell the other simultaneously writes, violating the clock-to-clock setup requirement, the following occurs.

The write succeeds

The data out on the writing port accurately reflects the data written.

The data out on the reading port is invalid."

Further d-cache design notes
Now if the instruction in the execute pipeline stage is a load or store, we have some more work to do.
In the first clock cycle of such instructions, we compute the effective address to d_ad, and present it to the d-cache data and tag RAMs.
For longword stores, we drive the data to be stored to the result bus and then into the d-cache data RAM; and write the new d_ad[23:9] to the d-cache tag RAM.
Since (as we discussed last time), the d-cache design does not allow partially valid d-cache lines, for word or byte stores, we do not store the data to the d-cache. Instead, we mark the line as invalid.
For all flavors of stores, the store data are written-through to main memory. The CPU waits for a rdy acknowledgement from memory before advancing to the next instruction.
Loads are more complicated. Loads may hit in the d-cache, or may miss.
On a load longword hit, the read data is simply driven onto the 3-state result bus. On a load word or load byte hit, the read data is properly aligned and then driven onto the result bus. These loads do zero fill, so the upper 16- or 24-bits are driven with 0.
On a load miss, whether it be a byte, word, or longword access, the CPU loads a full 32-bit longword from RAM, drives it onto the result bus, and writes it to the d-cache data RAM.
On a load longword miss, the instruction completes in that same cycle, since the result bus already has the desired longword data.
On a load word or load byte miss, once the above multicycle miss handling has taken place, on the next clock cycle, the d-cache hits and things proceed as for the load word / load byte hit described above.

Friday, October 12, 2001
Here are some rough design notes for a new system-on-a-chip I'm designing for the XESS XSA-100 board. I thought you might prefer to read a rough outline now than wait, possibly forever, for a more polished form. (As always when I report on work-in-progress, remember, such work may never get past the work-in-progress stage, and even then, it will not necessarily be released here.)
Questions? Comments? Discuss on the fpga-cpu list.

Target Virtex and Virtex derivatives

Based on gr10x0 core

Compact, pipelined, Virtex-optimized processor

16- or 32-bits

Assumes block RAM instruction store or i-cache

Target XESS XSA-100 board

Spartan-2 FPGA: XC2S100TQ144-5C

16 MB SDRAM: 4 banks of 2Mx16: HY57V281620AT-H

256 KB FLASH: AT49F002-90TC, with CPLD: XC9572XLVQ64-5C

FPGA configuration storage

Extended boot ROM

VGA, PS/2, and parallel ports

Architecture

Target frequency: 67 MHz

Two 32-bit gr1050ic cores

SDRAM controller

Color frame buffer – 1024x768x8 (6-bit DAC)

1280x1024 if sufficient RAM bandwidth

PS/2 keyboard/mouse interface

SDRAM considerations

Independent open (activated) row per each of 4 banks

Load/store latency

1 cycle to form effective address

+1 cycle to drive SDRAM CMD at IOB FFs

+0 cycles for writes

+3 cycles for reads

tCL (CAS latency) = 2 cycles

+1 cycle to move data from IOBs to data-bus/cache

+1 cycles for 32-bit data (write/read second 16-bits)

+4/5 cycles for bank-row miss

possible tWR (write recovery latency) = 1 cycle

tRP (RAS precharge latency) = 2 cycle

tRCD (RAS CAS delay) = 2 cycle

Row-hit-store-longword: 3 cycles

Row-hit-load-longword: 6 cycles

Row-miss-store-longword: 7/8 cycles

Row-miss-load-longword: 10/11 cycles

+ occasional refresh cycles

Addressing

Strategy: maximize row hit rate by keeping all four banks busy

col = ad[9:1]

bank = ad[11:10]

row = ad[23:12]

Less attractive alternative: (may be better if video refresh thrashes too many banks

col = ad[9:1]

row = ad[21:10]

bank = ad[23:22]

Memory design

Goal: performance/area

Small data or small code scenarios can simply use on-chip BRAM

SDRAM capacity is nice, but as we see above, as much as 10 cycles of latency

General purpose system:

Implies i-cache and d-cache needed

Goal: minimize use of logic

Use second port of BRAMs to avoid address or data multiplexers

Goal: minimize use of block RAMs

Goal: simplicity

Issue: i-cache organization

Direct mapped simplest, but 2-way set associative possible alternative if there are enough spare BRAM ports

1 BRAM: i-cache data: 256 16-bit instructions

port A: CPU ifetch

port B: memory interface

½ BRAM: 64/128/256 i-cache tags

port A: i-cache controller

Should i-cache be (a) 64 4-word lines, (b) 128 2-word lines, or (c) 256 1-word lines?

i-cache miss penalty: (a) 4 cycles per word, (b) 5 cycles per 2 words, (c) 7 cycles per 4 words

i-cache hit rate: to be measured

Critical word first?

Issue: d-cache organization and policy

Write through or write back?

Write allocate? Write invalidate?

If write-back, dirty data must be saved when a cache line is ejected.

That means a store requires a d-cache tag check, a d-cache data read (which may be concurrent with the tag check) which adds at least one cycle of latency to operation

Write-through is much simpler. Data in cache is never dirty with respect to RAM (although may not be coherent with data in another cache).

On write-through, should update cache? Yes, when possible.

Can a d-cache line be partially valid? No. Keep things simple for now.

If cache line size equals write item size, can update tag and data at store time. However, if item size is less than cache line size, should invalidate cache line.

To keep things simple, set cache line size to word size (32-bits).

Therefore (32-bit system) sb and sw invalidate cache line; sw leaves cache line valid. This is important for fast register save/reload on function call prolog/epilog.

Physical d-cache data width -- 16-bits?

Pro: simple interface to 16-bit SDRAM

Pro: simpler if second design produced with 4 gr1040ic 16-bit RISCs

Con: extra cycle to load or store 32-bit word on cache hit

Physical d-cache data width -- 32-bits

Pro: 1 cycle store, 2 cycle load on cache hit

Con: uses 2 x16 BRAMs

Choice: 32-bit d-cache data

Performance ideas

On stores (through), do not stop pipeline

Write will get to memory eventually

Pipeline only stops if another memory transaction (i-cache miss, d-cache load miss, another store through) occurs

Too complicated

Start a speculative memory load (even before the i- or d-cache tag check fails). Saves a cycle.

Too complicated

Might cause unnecessary bank row misses

In an MP or multi-master system, could waste bandwidth needed by other masters

Summing up: implementation

i-cache 1½ RAMB4_S16_S16

Data: 256x16 organized as 64 lines of 4 instruction words

Tags: 64x16

d-cache: 1½ RAMB4_S16S16

Data: 2x128x16 organized as 128 lines of 32-bit longwords

Tags: 128x16

Memory interface:

Avoid i_ad, d_ad multiplexer: instead, on d-cache miss, drive d_ad through preexisting i_ad multiplexer to SDRAM controller

SDRAM controller multiplexes p1.i_ad, p2.i_ad, video.ad addresses

sdram_dq driven onto p1.d or p2.d buses

Arrange for i-cache data/tags and d-cache data/tags to be preloaded at power-on.

SDRAM controller design

Per-bank active row registers

Per-bank activated registers

Wednesday, October 10, 2001
Welcome back, dear readers. I'm sorry that it has been a while.
Nios laps MicroBlaze
A reader kindly pointed out that the Altera Nios soft processor core is now in version 2.0. Whereas Xilinx MicroBlaze still seems to be in beta test.
The Nios interface bus, generated by the Nios System Builder tool, is called Avalon. Avalon now supports multiple masters and DMA.
Note the difference in strategy here. Xilinx has announced its MicroBlaze product will use the same popular CoreConnect bus as its Virtex-II-Pro embedded PowerPC 405s use. So presumably soft cores for one will work with the other. In contrast, Altera is using the non-standard (and presumably lighter weight) Avalon bus for its soft processor core, whereas its embedded processor cores (ARM and MIPS) use the popular AMBA bus.
Also new/improved in Nios 2.0 is support for user-defined custom instructions, and an on-chip debug peripheral.
This app note on Nios resource usage and performance is instructive.
And here is an interesting recent comp.arch.embedded posting on experience with Nios.
Altera Excalibur and MIPS?
This article (in the same Nios discussion thread as above) speculates that Altera has "decided to can the MIPS version" of their Excalibur embedded hard CPU core, leaving just the ARM 922T based products.
The Excalibur index page no longer seems to mention any MIPS embedded hard core devices, except for a <meta> tag that reads:
"...The three families ... Nios ... ARM ... and MIPS-based hard core embedded processor ..."

If Altera is currently focusing on ARM, this makes good sense. The initial announcements of two hard CPU architectures (see our coverage) were surprising since it is very costly to support one architecture, let alone three. Also, each additional embedded CPU line (beyond the first) further disrupts synergies, dilutes critical mass adoptions, leads to divergent product strategies, etc.
Perhaps Lewin A.R.W. Edwards' scenario is as good as any.
XESS XSA-100
I now have an XSA-100. In late September, I designed a cheap and cheerful ASCII VGA character display core. It uses one 512x8 RAMB4_S8_S8 block RAM as a 16R x 32C display buffer (with one RAM port exposed to the core client, which read/write ASCII characters into this buffer under any clock discipline it chooses), and a second block RAM configured 4096x1 (RAMB4_S1) as a 96x5x8 character generator. It was a tight squeeze, but the 96 printable ASCII characters, including true lowercase descenders, do fit nicely in 4096 bits.
This is a really handy tool for on-chip debugging -- it is compact, fast, very simple to drop into an arbitrary embedded system design, and needs as few as three device pins -- video, hsync_n, and vsync_n.
I also have run 16- and 32-bit pipelined versions (gr1040i/gr1050i) of the GR CPUs on this board.
I am now working on an SDRAM controller core, part of a new system I am designing with a 32-bit gr1050i, an i-cache, d-cache, SDRAM interface to the 16 MB SDRAM, and a high resolution color frame buffer.
New versions
Xilinx has just released new versions of WebPack 4.1i (free) and Foundation/Alliance (now called ISE) 4.1i (not free).
Xiilnx: 4.1i press release.
Murray Disman, ChipCenter: Xilinx Delivers ISE 4.1i.
Xilinx claims all sorts of performance enhancements in 4.1i, but since my critical-path datapaths are already (nearly) optimally floorplanned, I'm not expecting much of an improvement, if any.
The ISE tools do have some nice new features, such as cross-probing from the timing analyzer to the floorplanner. (You click on the critical net and the floorplanner zooms in on that part of the chip floorplan and highlights the path.)
WebPack is turning into quite a nice, full-featured product. It now supports Spartan2 devices, plus VirtexE and Virtex2 "up to 300K", and it includes Xilinx's XST HDL synthesis tool.
Soon I will try running my Synplify-Verilog-based floorplanned datapath designs through XST. I use a lot of explicitly instantiated primitives with explicit RLOC and INIT attributes -- and no doubt XST has a different and incompatible attribute syntax.
Dear friends at Xilinx: the point one i suffixes aren't fooling anyone. Admit it, 4.1i is really 4.0 -- or maybe 11.0. If it is a major new version, it is a .0 product. Calling it a .1 product to make it appear that you've already shook out the usual .0 issues won't make it so. And the i (for internet or perhaps for fuel injection?) moniker is so 2000.
Which reminds me of a couple of definitions from Stan Kelly-Bootle's brilliant The Devil's DP Dictionary, (out of print), since revised and expanded as The Computer Contradictionary:
"upgrade n. & v.trans. [From up + Latin gradus "steep incline."] 1 n. An expensive counterexample to earlier conjectures about upward compatibility. 2 n. A painful crisis which belatedly restores one's faith in the previous system. 3 n. To replace (obsolete stability) with something less boring. ..."
"release n. & v.trans. [Latin relaxare "to ease the pain."] 1 n. A set of kludges issued by the manufacturer which clashes with the private fixes made by the user since the last release. 2 n., also called next release. The long-awaited panacea, due tomorrow, replacing all previous temporary patches, fixed patches, and patched fixes with a freshly integrated, fully tested, notarized update. 3 v.trans. Marketing To announce the availability (of a mooted producted) in response to the release by a competitor of a product prompted by your previous release.
Care is needed to distinguish a last release from a next release, since the difference is more than temporal. A last release is characterized by being punctual but inadequate, a next release avoids both errors. Next releases are worth waiting for. [And staying in maintenance for -JG] ..."

"Tremendous Barrier to Entry"
Anthony Cataldo, EE Times: CEO: Altera, Xilinx hold embedded PLD keys.
'The use of programmable-logic structures by providers of application-specific standard products will be "impossible unless they work with ourselves or Xilinx to license the technology," Daane said.'
Rick Merrit, EE Times: Rattling Sabers.
HardCopy
Altera: HardCopy - The Right Product at the Right Time.
Anthony Cataldo, EE Times: Altera joins FPGA-to-ASIC drive as gate arrays come back in vogue.
'Lightspeed Semiconductor, meanwhile, is looking to undercut Xilinx in a similar manner. The company has started taking orders for its 4Em ... devices that use the same BGA packages as Xilinx's Virtex E and identical memory configurations. ... Routing is done in the metal and does without the extra capacitance of pass transistors used in FPGAs, making the architecture typically two times faster than FPGAs, said Lyle Smith, chief scientist for Lightspeed. ...'
'The transfer of intellectual-property cores can also get dicey if another vendor is involved. "Realistically, we're the only people that can do this, in a sense, legally," said Altera's Tong. "We don't eliminate the possibility of going to an ASIC supplier; however, there is going to have to be licensing negotiations to use our IP in an ASIC."' [my emphasis -JG]
See also IP business models.
Anthony Cataldo, EE Times: Altera kicks off mask-programmable PLD program.
'"For extreme volumes, standard-cell is still the best solution," Daane said. "But for the gate array vendors and conversion vendors it's over for them."'

SignOnce
Xilinx: Xilinx and 21 Leading Intellectual Property Providers Launch Common FPGA Licensing Program.
Murray Disman, ChipCenter: Xilinx Announces SignOnce IP License.
Michael Santarini, EE Times: Xilinx, IP partners unify licensing.
'"We've taken our own license and have convinced all of our AllianceCore partners to join the program and adopt our license as their own," said Sevcik, noting that the effort to sign up the consortium partners had taken a year and a half. ...'
'There are two versions of the SignOne IP License. The Site License gives users access to the IP in question for an unlimited amount of projects within a 5-mile radius of where the license is granted. The Project License limits use of the IP to a single project. The licenses are typically granted for FPGA netlist versions of a given core.'

Silaria
Chris Edwards, Electronics Times: Proteus processor core gets embedded Linux port.
"The company has added hardware design support to let the Proteus 3's datapath use any bit width from 4- to 256-bits but will only add compilation support for 8- to 256-bit systems with the version 2.0 compilers due out next year."
This summer, I visited an old colleague, now at Silaria Ltd. in Dublin. Sharp team. If you're in the market for configurable processor IP, they are worthy of consideration.
Who mourns for DS-502?
My shelves runneth over with copies of XactStep 1.x, Foundation and/or Alliance 1.3, 1.4, 1.5, 2.1i, 3.1i, and now 4.1i, not to mention a few Student Editions. Now that many of the old devices are obsolete, there's little need or justifcation to hang on to good old DS-VL-BAS, DS-390, DS-502, and about two shelf-feet of old Viewlogic and XACT manuals. Not to mention the Quarterdeck QEMM386 that they required. (I doubt QEMM will even run on current GHz, 200+ MB PCs.) I couldn't quite bring myself to toss them, but they've been paged out.
It's sad to think of the blood, sweat, and tears that must have gone into those earlier products, which (just a few years later) are chucked unceremoniously into the big bit bucket of history. And it's strange to recall just how excited I was to get to work with those products, but now, I could not care less for them.
I mourn too for some of my own earlier products, which are now, for the most part, gathering dust. They fulfilled their purpose, they achieved their mission, the toothpaste tube of their productivity, potential, and utility has been squeezed out and used up. It is (more or less) stupid to waste time with them -- there are better alternatives. They are worthless.
And forgotten are each of these products' team members, once a close but unruly family, building a shared vision, now scattered to the four winds; forgotten too, the all nighters and the crushing deadlines and the death marches, the feature shootouts, the infinite meetings, the unfinished specs, the sprawling milestones, the bugs uncountable and the bug triages, the alphas, the betas, and the release candidates, the sign offs and product launches, the fierce (yet collegial) company rivalries, win all reviews and reviews won and lost, the happy customers (and the unhappy ones), and all that spent energy, and the days and nights and weekends, the months and the marriages, now spent and dissipated, forgotten and fading away like Apollo in Who Mourns for Adonais?.
Cold comfort: Inevitably, some code lives on. Take the C++ object model layout code I wrote in 1990. For compatibility reasons, that code will remain as is for the rest of time or x86 computing, whichever comes first. Whether you run Windows apps on Windows or on WINE on Linux, your code is rather likely to have been compiled by my code. In fact (the last time I checked),
I use Xilinx's software to build my products;
Xilinx uses my software to build their products!

Ye olde near branch / far branch problem
Last month, on the fpga-cpu list, someone wrote:
"I am trying to implement far branches in xr16asm (XSOC Project) and was wondering if anyone did this before. If so it would be nice if you post some lines of code, so i can see how it has to be done."
This issue has always annoyed me, so I went and fixed it. The fix was not trivial, because of these side issues.

We can't tell which (forward) branches are far (>127 instructions from .) until the forward label has been seen. We could simply pad all forward branches with two NOPs (reserving space that we can later overwrite at FIX_BR fixup application time) but that seemed inelegant (although it would be simpler).

Instead, on seeing a far branch, we must turn
bcond label
into
bnot-cond skip imm label[15:4] jal r0,label[3:0](r0) skip:
and insert up to two additional instructions at the branch site. This creates two problems:

It disturbs 16-byte aligned functions (which have been 16-bit aligned because they may be the targets of CALL instructions).

Inserting two instructions may make other branches over the branch also far.

The symbol, fixup, and line correspondence tables have address fields which will become invalid if code is inserted at a lower address than the given address field.

So I changed the applyFixups() function to work in three phases.

Resolve far branch fixups. Repeatedly scan over all branch fixups in the program. If one is found to be a far branch fixup, replace it with the b<not-cond> sequence shown above, by moving subsequent instructions and data down in memory. The FIX_BR fixup to the label becomes a FIX_EA (nee FIX_LDST) fixup on the label.

Resolve call target 16-byte alignment constraints. For each symbol (in order of increasing address) that is constrained to be 16-byte aligned (e.g. is seen to be a CALL target), if it is not already aligned, insert enough 0s to 16-byte align it.
[[I also changed lcc-xr16 to not emit "align 16"s in the prolog of function definitions. If a function is not a CALL target, (e.g. if it is only called indirect through a function pointer via JAL), it need not be 16-byte aligned.]]

Resolve all fixups (just as applyFixups used to do). At this point, all remaining FIX_BR fixups are near, and all FIX_CALL fixups are to 16-byte aligned targets.

I'm going to try to wrap this fix up in a new build of the XSOC Kit someday.

FPGA CPU News, Vol. 2, No. 10
Back issues: Vol. 2 (2001): Jan Feb Mar Apr May Jun Jul Aug Sep; Vol. 1 (2000): Apr Aug Sep Oct Nov Dec.
Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.