Regexps and Parsing in FPGAs |
|||
Home Life in an FPGA >> << FPGAs as coprocessors
Usenet Postings |
Newsgroups: comp.arch.fpga Subject: regular expression matching and parsing in FPGAs (was chess...) Date: Fri, 24 Dec 1999 12:30:30 -0800 I'm not sure how regular expression matching and parsing relates to chess, but these can of course be done at tremendous speeds in an FPGA. Here are some ideas. REGULAR EXPRESSIONS A regular expression (including compositions of simpler REs) can be implemented in hardware through these transformations: 1. convert the RE to an NFA (non-deterministic finite-state automaton) 2. convert the NFA to a DFA (deterministic FA) using the subset construction 3. implement the DFA as a one-hot FSM (finite state machine). Here each state is a flip-flop. Transitions between states are combinatorial expressions of the states and the input. If the input is encoded, e.g. as 8-bit characters, there would probably be additional logic to decode characters ('.') or character classes ([0123456789]). This could consume one character per clock at 100-200 MHz. LR PARSING A parser for a context-free grammar can be implemented in hardware through these transformations. 1. derive LR-parser shift-reduce tables (actions, goto, etc.) from the CFG productions. 2. implement the shift-reduce parsing algorithm approximately as: // types enum Token { term1, ..., nTerminals, nonterm1, ..., nTNT}; enum State { s0, s1, ..., nStates }; enum Prods { ..., nProds }; enum Action { accept, error, shift, reduce1, reduce2, ... }; // tables Action actions[nStates][nTNT]; State goto[nStates][nTNT]; Token lhs[nProds]; // token on lhs of production, e.g. X for X ::= Y1 ... Yn int rhs_len[nProds]; // length of rhs of production, e.g. n for above // shift-reduce parser State stack[]; // a stack of states int depth= 0; // index of top-of-stack State s = s0; stack[depth] = s; Token input = first token; repeat { Action a = actions[s][input]; if (a == accept) stop, accept; else if (a == error) stop, error; else if (a == shift) { s = stack[++depth] = goto[s][input]; input = next token; } else if (a == reduce p) { s = stack[depth -= rhs_len[p]]; s = stack[++depth] = goto[s][lhs[p]]; } } 3. implement *that* in an FPGA: For example, for parsing C, from off the top of my head, nStates < 256, nTNT < 128, nProds < 100, Stack depth <128. So implement 'input' in a 7-bit-wide FIFO; 'stack' in a 128x8 SRAM; 'lhs' in a 100x7 SRAM; 'rhs_len' in a 100x4 SRAM, and 'actions' and 'goto' in external SSRAM (for C, they won't fit in a few block rams). 'Stack' is indexed by depth; 'actions' by s and input; 'goto' by s and mux(input, lhs[p]). Implement depth in an accumulator (a 7-bit register + an adder of +1 or - prod_size[p]). Perhaps you could clock this parser at 100 MHz. A pity that it is not easy to pipeline the action and goto lookups. So here's a sketch of an FCCM to syntax check a preprocessed C program. Characters come in at 200 MHz to the lexical analyzer, which uses the RE recognizer described above to deposit tokens into a FIFO. The C parser drains the FIFO, consuming a token approximately every few cycles at 100 MHz. Overall the machine gobbles C code at 200 MB/s, or about 5 million lines per second, and then either the green Accept LED or the red Error LED lights up. :-) (Interesting gedanken experiment but I'd rather do it in software, thanks.) (I would be surprised if this (shift-reduce parsing in hardware) has not been studied already. Also, I wonder if there are alternative implementations of the LR parser algorithm that can better use 1-hot encodings. The problem is the current state gets pushed and popped, and a stack of 1-hot states is not very storage efficient; thus you'd have to encode them into a binary representation and decode them back to a 1-hot encoding.) (I suppose there might be real-world applications of work like this in areas like text classification and DNA sequencing.) Ref: see e.g. Compilers: Principles, Techniques, and Tools, Aho Sethi and Ullman, or Crafting a Compiler, Fischer and LeBlanc. Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |