Homebuilt Processors |
|||
Home Altera, Xilinx Announce >> << Why FPGA CPUs?
Usenet Postings |
Newsgroups: alt.comp.hardware.homebuilt,comp.arch.fpga Subject: homebuilt processors using FPGAs (long) Date: 11 Dec 1994 04:08:40 GMT (Hope the crosspost to comp.arch.fpga is OK, the topic is amateur processor implementations using FPGAs.) In <3c6is4$d7k@gordon.enea.se> pefo-@enea.se (Per Fogelstrom) writes: > >PDP11 Hacker ..... (ard-@siva.bris.ac.uk) wrote: > >: My main interest is in designing a CPU _from scratch_. OK, I know I'll get >: poor performance from all those FPGAs wire-wrapped together (all that >: capacitive loading for one thing), but with a good underlying design it should >: be useable (heck.. The PERQ 1a was had a CPU built from 250 TTL chips, PALs and >: PROMs, clocked at 5MHz, and still beats this 386DX33 for graphics performance >: :-)). And there's the joy when a prompt appears on a machine that you even >: designed the instruction set for. > >I've did a few bitslice designs many years ago. One was for my own amusement >and was based on AMD2903 slices (32 bits, 8 chips). It was fun but very time- >consuming. It was clocked with 5Mhz and executed reg-reg instructions in >two clocks. I later redesigned it to fetch and decode in the same cycle as >the previous execute. It never ran any serious software. > >Per > On homebrew computers: start simple and learn as you go. When they work they are *very* satisfying. I was encouraged by helpful U.Waterloo hardware hacker friends (thanks Ashok and Mike and co., wherever you are) into building my first homebrew 6809 system -- the "Gray-1", in 12th grade about 14 years ago. It started with ROM, SRAM, and LEDs, and gradually acquired serial ports, video, and a Votrax speech synthesizer. Eight bit micros and 1 MHz clock rates are easy to do: easy to wire wrap, and easy to program. Start with one of those; PICs look like a good choice today. On homebrew processors: I went into the software biz but my love for hardware and computer architecture remains. I've always been envious of the engineers in industry and academia who get to design and build new processors. For a hobbyist, custom VLSI, gate arrays, or standard cell has these hugely expensive barriers to entry. And only the most determined hobbyist would build a useful 32-bit CPU using bitslice parts. In the years since, the programmable logic industry has arrived! These days you can buy, quantity one, 5,000 gate field programmable gate arrays (FPGAs) for ~$100, and 10,000 gate parts for about ~$200. The beauty of these parts is they are adequately dense for implementing processors and they abstract away a lot of the high speed circuit stuff for you. For instance, clock skew is of little concern. If you stick to fully synchronous designs (no async preset/clear, no gated clocks, etc.), carefully floorplan your functional units, and stay on chip :-), your designs have a good chance of working at 20-25 MHz. In my copious spare time I am experimenting with homebrew RISC CPUs. Right now I have a partially finished, partially functional 16-bit RISC CPU and ambitions for a dual issue 32-bit CPU. The former ("jr16") is compiled for a Xilinx XC4005PC84C-5, the latter ("NVLIW1" -- "not very long instruction word #1") will be for a XC4010PC84C-5. jr16 is a pipelined 16 16-bit register, 3-operand, load/store RISC. The basic instruction formats are: { 0, op: 3, rd: 4, ra: 4, rb: 4 /* add/logic operations */ }, { 10, op: 2, rd: 4, ra: 4, imm: 4 /* load/store, EA=ra+imm4 */ }, and { 11, op: 2, rd: 4, imm: 8 /* load immediate, branch */ }. Instruction pipeline is the classic IF (insn fetch), RF (write back previous result and reg fetch), and EX (execute add/logic/effective address computation.) If there's a load/store the pipeline stops until completed. The 16-bit datapath is 8 rows by 5 columns of CLBs (Xilinx Configurable Logic Blocks) (only ~20% of an XC4005 which has an array of 14x14 CLBs). The columns are: rfa (reg file read port A), rfb (reg file read port B), mux (multiplex B or immediate data), adder, logic unit (and, or, xor, xnor). Results (add/logic/load data) are multiplexed into a write-back register on long lines (LLs) using the XC4000's dedicated LL tristate drivers. For this first design I avoided a separate PC incrementor and associated multiplexors and instead use r15 for a PC. Thus the clock phases are: phase register file exec. unit load/store 1 write back result reg add 2 to PC latch insn, read another 2 read next A, B regs add 2 to PC 3 write back PC user insn add/logic 4 read PC user insn add/logic (The execution unit takes two clocks to add/mux result at (unproven) 40 MHz.) A nice aspect of this design is the alternating inc-PC and user-insn cycle means that the previous user insn finishes and any results are written back to the reg file before the next user insn operands are read, thus eliminating any need for bypass multiplexors in the operand busses or ugly operation latencies in the programming model. To date I have this design running using the 11 MHz Xilinx XChecker circuit probe, incrementing PC, fetching instructions from an on-chip 16-word boot ROM, and performing ALU operations, but haven't yet implemented condition codes, branch or load/store circuitry. Soon! (I know it works as far as it does because I can verify internal state: the XChecker probe allows you to examine the state of every function generator and flip flop on the part.) As for top speed, XDelay static timing analysis (I don't have the simulator software) indicates I should be able to clock this at 40 MHz (25 ns). (I do have a critical path or two to better pipeline yet). Thus it should do 10 peak MIPS, not too shabby for a first design. One neat thing about the Xilinx XC4000 architecture (and I haven't seriously looked at the other FPGA vendor's architecture's to know if this is unique or inferior or superior) is there are enough flip flops mixed in with the function generators that you can make a RISC datapath in as few as three columns of CLBs: one register file (that you have to take two clocks to read two operands), one adder, one logic unit, result multiplexing being done on the LLs using tristate drivers). And using the dedicated carry paths you can do 16-bit adders in 9 CLBs, delay about 25 ns, and 32-bit adders in 17 CLBs, delay about 35 ns. As for the dual-issue 32-bit NVLIW1 my current plans are for a two-unit implementation of a simple VLIW achitecture. Each "unit" has a separate 16 32-bit register file, and 3 operand instructions (rdest = ra op rb), rdest and rb are local to the unit, specified using a 4-bit reg no., but ra can be read from either unit, and is spec'd using 1+4 bits. Thus a 2-unit machine has a basic 34-bit insn word: { op0: 4, rd0: 4, ra0: 5, rb0: 4, op1: 4, rd1: 4, ra1: 5, rb1: 4 }. (I'd obviously like to get that 34-bit word down to 32-bits but there isn't much fluff left. Any ideas out there? 32 - 2*(4+5+4) = 6, and six bits doesn't encode two operations very well...) Using the above "modestly decoupled" architecture, a separate PC incrementer, bypass result multiplexing, a VLIW-like limited access between register files/functional units, it should do peak two instructions in two clocks at 25 ns, or 40 MIPS. Here, the columns of functional units in the data path floor plan will be something like LAMMRRR RRRMMAL (L=logic unit, A=adder, MM=4-way A-bus source mux, RRR=3-read 2-write register file) with the two halves being placed such that splitting the LL bus lets me mux the adder or logic unit results of each concurrently. Thus the datapath of this 32-bit dual-issue machine should fit nicely in 14 columns X 17 rows of a 20x20 XC4010. On a 4013 (24x24) I would add a 16-entry 256-byte direct mapped cache (16 16-byte lines) whose cache and data SRAMs would burn another 5 rows by 16 columns. On a 4025, (32x32) ... It is amazing what you can squeeze onto these parts if you design the machine architecture carefully to exploit FPGA resources. In contrast, there was a very interesting article in a recent EE Times by a fellow from VAutomation doing virtual 6502's in VHDL, then synthesizing them down into arbitrary FPGA architectures. Although the 6502 design used only about 4000 "ASIC gates" it didn't quite fit in a XC4010, a so- called "10,000 gate" FPGA. That a dual-issue 32-bit RISC should fit, and a 4 MHz 6502 does not, states a great deal about VHDL synthesis vs. manual placement, about legacy architectures vs. custom ones, and maybe even something about CISC vs. RISC... Well, that serves as kind of a brain dump of work (play) in progress. Please drop me a line if you have questions, advice, etc. Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |