Synthesized CPU Core Issues |
|||
Home Register files >> << FPGA CPU Speeds
Usenet Postings |
Subject: Re: FPGA vs CPLD? Any Experts out there? Date: 12 Apr 1999 00:00:00 GMT Newsgroups: comp.arch.fpga Weri Kuolstad wrote in message <7etaf5$1ev$1@nnrp1.dejanews.com>... >Hi Jan, > I have been following this thread very closely. I am designing a RISC >CPU based on the MIPS 2000 from Computer Organization and Design : The >Hardware/Software Interface" by John Hennessy and David Patterson onto an >ORCA2C40 FPGA. Obviously I have that book. I also have the new Michael Celitti >book on Verilog that has the Xilinx Student Edition (I don't have the book >right now with me to quote the exact version #.) I am doing this design in >Verilog with two main design goals 1. Describe the entire design at a >behavioral level in Verilog > 2. Get the entire 32-bit design to fit onto >the ORCA2C40. > >I would appreciate any help...like book/link suggestions. >Thank you. >Weri Be careful. An ill-prepared behavioral design may be much larger or slower than necessary. To achieve a feasible, fast, and/or small FPGA implementation of a RISC processor, or anything else, you must first determine how your datapath maps to FPGA device primitives, e.g. 4-LUTs, FFs, BUFTs, RAMSs, CYs, etc. I think this is crucial. You must study and internalize your FPGA data sheets, and, if available, review exemplary implementations. Only when you understand where (and how and how many of) your rams, adders, registers, muxes, etc. should fall on the die, only then, should you write your first line of Verilog or draw your first FDCE. For the specific case of an instruction set compatible processor implementation: only when you understand what should be implemented in hardware, what in state machines, and what should trap to software, only then should you "break out the Verilog". For example, MIPS-I implies a 32-bit barrel shifter. These are expensive to implement in an FPGA, comparable in area to a modest I-cache. If you thought about how a barrel shifter would map to device primitives, you might instead profitably design a small, multi-cycle shifter, perhaps one which only does 1- and 4-bit shifts each cycle, saving LOTS of chip area for other things. Another example. MIPS-I implies a 1-cycle branch delay. In a straightforward implementation of the pipelined datapath sketched in Hennessey and Patterson, this would require 2 PC adders, one for PC+4 and one for PC+branch-displacement, and a MUX selecting between them. Instead, if you can accept a 2-cycle branch latency (e.g. one branch delay slot and one annulled cycle on branch taken), you can build a circuit in about 1/3 the area (PC + cheap-mux(4,sign-ext(branch-disp))). Sooo, once you have decided what you expect the tools to output in the end, then it's a simple matter of "pushing on a rope" to get your particular elaboration tools to map your design specification into the right inputs to your FPGA vendor's implementation tools. Schematics give you more direct control, HDLs more parameterization, netlist generators, the best of both worlds (at the expense of incompatibility with anything else). If you take this advice to heart, you should have no trouble fitting your design into a 2C40. IIRC that has 30x30 4-bit PFUs. My first 32-bit pipelined RISC, which did most of the MIPS-I integer instructions, had a datapath that was 16x11 2-bit CLBs, e.g. only about 5% of a 2C40. See my datapath floorplan slide (in www3.sympatico.ca/jsgray/j32.ppt or at www3.sympatico.ca/jsgray/sld021.htm) for an example. If you can figure out how to make your behavioral Verilog source code compile to the desired device primitives, I would use that. Otherwise I would try to specify the datapath (only) in structural Verilog. I experimented with this last year using Foundation / FPGA Express Verilog with good results, although I had a little helper script to generate a UCF file to constrain the resulting primitives' LOCs to my desired floor plan. I look forward to using other Verilog compilers which reportedly can pass FMAP and RLOC attributes through to the FPGA implementation tools. I had not heard of this Celitti book w/ XSE, can you provide more information? Jan Gray Subject: Re: FPGA vs CPLD? Any Experts out there? Date: 12 Apr 1999 00:00:00 GMT Newsgroups: comp.arch.fpga Jan Gray wrote in message <7ettgk$ddr$1@news-2.news.gte.net>... >Only when you understand where (and how and how many of) your rams, adders, >registers, muxes, etc. should fall on the die, only then, should you write >your first line of Verilog or draw your first FDCE. I don't like my own advice here, so let me try again. Implementing a processor or other substantial design is an iterative process with subproblems which require analysis and experimentation. The more expert you are with your tools and with the device architecture, the less experimentation you'll need. If you're new to FPGA design, I think taking some time to try out different solutions to the subproblems will help to save time overall and achieve a better result. Some of the subproblems to investigate include: * how to implement a register file? a 2 read / 1 write port register file? * how to source an operand from a register or an immediate field * how to implement an ALU? a shifter? * how to multiplex the many results (incl. ALU, shifts, loads, sign exts (lbs), jal's) * how to implement zero/negative/carry/overflow detect? * what is the external memory or on-chip bus interface like? * how to implement load/store byte lane alignment logic? * how to implement an instruction register? a program counter? incrementing it? branch displacements? * how to pipeline the design? how many stages are beneficial? how to stall pipe? how to annul insns? * how to deal with pipeline hazards? memory not ready? branch/jump shadows? data hazards? * where to implement the effective address adder? * should memory be 1- or 2- ported? how to mux eff. addr. with PC? * how to do interrupts and return from interrupt? * what is the clock discipline? rising or both edges? 1 or multiple clocks per insn? * what are the critical paths? what is the feasible cycle time? what is the required cycle time? * is any retiming needed? Some of these analyses will benefit from actually designing the subunit and observing what the tools produce, including layouts and delays (EPIC / static timing analysis). And trying some alternatives. Then you'll know approximately how much area and time it takes to do a register file writeback and read vs. an add vs. a wide-mux vs. a 32-bit zero detector and will be able to make intelligent tradeoffs. Have fun! Jan.
Copyright © 2000, Gray Research LLC. All rights reserved. |