A lexer is a tokenizer that takes a character-based input stream (read from a file, or a string in memory, or characters arriving over the network) and attempts to recognize characters and groups of characters into so-called terminal symbols or tokens.
Tools that can be used to generate lexers from a grammar (a formal specification of how input should be lexed) include:
- Ragel: a state machine compiler for recognizing regular languages
- ANTLR: an advanced LL(*) parser generator
- Walrus: my own tool prototyped in Ruby and currently being rewritten in ANTLR