3-D Rendering Acceleration |
|||
Home LFSR Design >> << Rambus for FPGAs
Usenet Postings |
Newsgroups: comp.graphics.algorithms,comp.arch.fpga Subject: Re: FPGA accelerated engines for volume rendering Date: 22 Mar 1995 05:47:46 GMT In <BENEDETT.95Mar20152030@caliban.dsi.unimo.it> benedett-@caliban.dsi.unimo.it (Arrigo Benedetti) writes: >I'm looking for references to implementations of hardware accelerators for volume >rendering algorithms (or other computationally intensive graphics algorithm) >based on FPGA's. I suspect this is not the volume rendering you mean, but maybe you'll find it interesting anyway, a kind of software/hardware practice and experience, if you will. A while back, I did a design for a Gouraud shaded Z-buffered rendering accelerator, whose datapath is compiled into a Xilinx XC4003A. Sure, it's probably the most well understood graphics rendering problem, and my implementation is simple at best (e.g. no blending, no textures), but I wanted to see how far one could get, at home, on a hobbyist scale. The inner loop (one scan line) of this simple polygon rendering algorithm is: // interpolate left to right, in (r,g,b) and z, and update // pixels for which z is closer than zbuf[x]: ... set up fixed point z, dz, r, dr, g, dg, b, db ... for (x = xleft; x < xright; x++) { if (z < zbuf[x]) { // Z-buffer check zbuf[x] = z; // update Z-buffer buf[x] = pixel(r,g,b); // update image } // advance interpolants z += dz; r += dr; g += dg; b += db; } When attached to 32-bits of DRAM or VRAM, and assuming a 16-bit Z-buffer, this design required three passes, fast page mode streaming over memory, to render a span of pixels across one scan line of a polygon. That is, I implement the above as three passes :- bit closer[]; // Pass 1: (check two Z-values per iteration) // initialize z0, z1, dz0, dz1 for (x = xleft; x < xright; x += 2) { closer[x] = (z0 < zbuf[x]); closer[x+1] = (z1 < zbuf[x+1]); z0 += dz0; z1 += dz1; } // Pass 2: (update up to two Z-values per iteration) // reinitialize z0, z1, dz0, dz1 for (x = xleft; x < xright; x += 2) { if (closer[x]) zbuf[x] = z0; if (closer[x+1]) zbuf[x+1] = z1; z0 += dz0; z1 += dz1; } // Pass 3: (update zero or one pixel value per iteration) // initialize r, g, b, dr, dg, db for (x = xleft; x < xright; x++) { if (closer[x]) buf[x] = pixel(r,g,b); r += dr; g += dg; b += db; } .. in hardware, in each case doing one loop iteration per clock (50 ns clock). ((I separated passes 1 and 2 because I thought it would be easier to do separate read and write passes on the Z-buffer memory, pipelined, rather than one pass with lots of back to back read/modify/write traffic.)) Amortized cost: 100 ns/pixel, several times faster than an R4000 software approach, even assuming packing several 8.8 bit fixed point interpolants per 64-bit register. Besides address sequencing and DRAM/VRAM control, the hardware to do the above is only two 24-bit accumulators (for the 16.8 bit fixed point interpolations of z0 and z1, and reused for 'r' and 'g' interpolation), one 16-bit accumulator (for the 8.8 bit fixed point interpolation of 'b'), and two 16-bit magnitude comparators (for comparing zbuf[i] and zbuf[i+1] with z0 and z1), plus a 64-by-2 bit SRAM to buffer closer/farther values (wider polygons would be divided into abutting narrow ones). All of which fits nicely in a "3000-gate" XC4003A. ((An "accumulator" in Xilinx-speak is an adder whose output is captured in a register "sum", and whose inputs are sum and another register "delta", so that "sum += delta" is formed each clock.)) I also considered using 16-bits/pixel (565 RGB) and adding error distribution "dithering" to propagate the error at each pixel to later pixels on the same line. This would require another adder at each accumulator. In my first couple of nights using ViewLogic, XBLOX, and XACT 1.4-something, I was able to design and compile the datapath of the above. Unfortunately at that point I got stuck, trying to determine how to interface an R3081 and then an R4000 to the FPGA, and so never did get the darn rendering engine built. (The R4000 bus protocols are nontrivial, especially when trying to interface to an FPGA with its own, nontrivial input setup/hold times and output delays.) Now, when time permits, I am designing a 32-bit RISC in the left half of an XC4010, and I hope to use the right half for a rendering accelerator as described above. Here "interpolate" (one iteration of one of the above passes) will be a machine instruction. Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |