A lexer is a tokenizer that takes a character-based input stream (read from a file, or a string in memory, or characters arriving over the network) and attempts to recognize characters and groups of characters into so-called terminal symbols or tokens.

Tools that can be used to generate lexers from a grammar (a formal specification of how input should be lexed) include:

  • Ragel: a state machine compiler for recognizing regular languages
  • ANTLR: an advanced LL(*) parser generator
  • Walrus: my own tool prototyped in Ruby and currently being rewritten in ANTLR