Finding Max Element in an FPGA

Home

Floorplanning >>
<< Life in an FPGA

```Subject: Re: Efficient max-function architecture?
Date: 22 Sep 1998 00:00:00 GMT
Newsgroups: comp.arch.fpga

Algorithm analysis 101 tells us that to determine the largest item of n
items requires at least n-1 2-input "greater than" comparisons.  You might
build other circuit structures, e.g. using 3-input or 4-input max-functions,
of course, but I doubt they are as area efficient -- although I can't prove
that off the top of my head.

In the worst case, each comparison must consider each bit of its two
arguments, although there is a possibility (a probability) of "early out" if
some MSBs are different.  If you wish to take advantage of early-out, you
will have to design for the worst case (no early-outs occurred) or for
failure in the improbable case.

As you suggest, you can do the n-1 comparisons sequentially, in parallel,
pipelined, etc., depending upon your performance requirements.

Consider using dedicated carry chains (if available) to make your
comparators small and fast.  For example, an n-bit comparator requires
approximately n/2 XC4000 CLBs + approx 1/2 CLB to establish carry-in[0] as
0.

If the data arrives a byte at a time you could perform 1 load and n-1
comparison cycles using a single comparator and load-enabled register (e.g.
4.5 XC4000 CLBs, 1+15 cycles at ~10 ns/cycle).

If the 16 bytes of data arrive all at once, the pipelined tree of
comparators and muxes should do nicely (15*(4.5+4) = ~130 CLBs, 10
ns/cycle).

Hybrids: e.g. if you have 50 ns you can reduce the problem to a first step
where you iteratively find 4 largest of 4 subsets of 4, followed by 3
comparators+muxes, in area 4*4.5 +3*(4.5+4) = ~45 CLBs).

Jan Gray

Subject: Re: Efficient max-function architecture? -- "parallel bitwise max"
Date: 22 Sep 1998 00:00:00 GMT
Newsgroups: comp.arch.fpga

I wrote in message <6u8s8g\$l5s\$1@news-1.news.gte.net>...
>Algorithm analysis 101 tells us that to determine the largest item of n
>items requires at least n-1 2-input "greater than" comparisons. ...

SEGUE

Algorithm analysis 101 also teaches us to carefully choose a model (e.g.
elements and a 2-element comparison function) and a goal (e.g. minimize no.
of comparisons to determine the largest element) to represent the problem.
Only then can we compare algorithms or study optimal lower bounds on
algorithms.

For example, using a comparison based model, it can be shown that any
sorting algorithm on n elements must perform at least ceil lg n!
comparisons.  But there are other sorting algorithms which do not perform
any elementwise comparisons.  They assume a different model with different
operations on the elements.  Consider radix sort, in which we distribute the
n elements into r buckets, according to increasingly significant digits in
their base r representation, and then regather them and repeat.  (Remember
the old IBM punch card sorters?)

So, as radix sort is to a comparison-based sort, is there an analogous
"radix max" to our comparison-based max?

"PARALLEL BITWISE MAX"

Yes.  Here's a 'digit'al method for finding the max of n k-bit words.

We scan the n inputs as n serial bit streams, all clocked together, msbs
first.  One bit stream is the maximum if it is never observed to be less
than some other bit stream.  For any two bit streams A and B,
"A < B" iff ~msb(A)&msb(B) | (msb(A)==msb(B) & "lsbs(A) < lsbs(B)").

Design: we keep n candidate-for-max state bits, one for each bit stream.
All are initialized to 1, as each stream is initially a candidate to be the
maximum.  A candidate stream is still a candidate if it has a current 1
input bit.  A stream with a current 0 input bit loses its candidacy if there
is some other remaining candidate stream that has a current 1 bit.  After
all input bits have been clocked past, the remaining candidate is the
maximum bit stream.  If duplicate input bit streams are possible, there can
be more than one remaining candidate, each of which corresponds to the same
maximum value.

// pidgin netlist generator source:

input reset; // 1 during the reset cycle
input stream[n]; // current bit of each of the n input streams
reg cand[n]; // 1 if the i'th stream is still a candidate for maximum.
net some1; // 1 if some remaining candidate stream input bit is currently 1
output answer; // result, the index of max input stream

some1 = cand[0]&stream[0] | cand[1]&stream[1] | ... | cand[n-1]&stream[n-1];
for (i = 0; i < n; i++)
cand[i] := reset | cand[i]&(stream[i] | ~stream[i]&~some1);

If we know the input words are never two alike, then bitcount(cand)=1 and we
can use
otherwise,

Implemented in an XC4000 FPGA, for n=16, requires approximately
4.5 CLBs for some1
8 CLBs for cand[16]
12.5 CLBs for a 16-to-4 priority encoder
----
~25 CLBs, result every 9 clocks for 8-bit input data, at perhaps 10
ns/clock.

Radix 4 (2 bits/clock) is also possible but probably unwieldly and slow.

the slowest part of the design would be the 'some1' or-reduction circuit,
and even with 36 input streams this is only three CLB delays and mostly
local interconnect.

Jan Gray

Subject: Re: Efficient max-function architecture? -- "parallel bitwise max"
Date: 30 Sep 1998 00:00:00 GMT
Newsgroups: comp.arch.fpga

Brad Taylor wrote in message <3611D95A.527FC010@emf.net>...
>                       /=====REG==\
>                       |   [MAX]  |
>               [MAX]   \==>|a o|==/
> [hi_byte ]===>|a o|==REG=>|b__|
> [lo_byte ]===>|b__|

I like it!  Certainly nicer than the 4-way hybrid in my first posting.

To recap the thread and add yet another approach, if you have n inputs each
m bits long, some choices are :-

1. simple m/2+1 CLB max accumulator in ~n clocks
2. ~nm/2 CLB max-mux tree in 1 clock
3. hybrid which takes ~nm/2k CLBs in k clocks
4. "parallel bit serial max" in ~n CLBs in ~m clocks
5. "serial bit serial max" in ~lg m + m/16 CLBs in m*n clocks

About 5: use a bit serial max state machine together with an m-bit FIFO
implemented using dual port RAM, to store the "current max".  Operate by
streaming in all the input words, serially, one bit at a time, into the max
machine.  Each m clocks it bit serially emits the max of all inputs so far.

Jan Gray

Subject: Re: Efficient max-function architecture? -- "parallel bitwise max"
Date: 30 Sep 1998 00:00:00 GMT
Newsgroups: comp.arch.fpga

Jan Gray wrote in message <6usptm\$d1\$1@news-2.news.gte.net>...
>5. "serial bit serial max" in ~lg m + m/16 CLBs in m*n clocks
>
>About 5: use a bit serial max state machine together with an m-bit FIFO
>implemented using dual port RAM, to store the "current max"...

Oops, we don't need an m-bit FIFO, but rather a simple m-bit shift register.
Looking to http://www.xilinx.com/xapp/xapp052.pdf for inspiration, the m-bit
SR can be implemented in just 2 + m/32 CLBs so I should rather have written:

5. "serial bit serial max" in ~4 + m/32 CLBs in m*n clocks

Jan Gray
```