Skip to main content

Basic Concepts

The Congo Parser Generator is, at its core, a classic recursive descent parser generator. Now, I have to assume that many readers do not really understand what that means. For now, let me just make the following key point: when you use a tool like this to generate your parser, you are really implementing two separate machines -- the lexer (a.k.a. scanner or tokenizer) and the parser itself.

Of the two, the lexer is the lower level machine. Its job is to break the input into tokens, chunks of text input that are effectively the smallest pieces of text that have some actual meaning in the language.

Concretely, consider the following line of Java code:

   int foo = bar + baz.bat;

At the pre-lexical stage, that is just a sequence of characters:

·
·
·
·
i
n
t
·
f
o
o
·
=
·
b
a
r
·
=
·
b
a
z
.
b
a
t
;
␤

The lexer's job is to group those characters into the following:

  • int (a Java keyword)
  • foo (a Java identifier)
  • = (a Java operator)
  • bar (another identifier)
  • + (another operator)
  • baz (identifier)
  • . (The dot operator)
  • bat (identifier)
  • ; (A Java delimiter)

That is what the lexical stage is about, turning a stream of characters into a stream of tokens. As you surely see, some of the input characters are really just ignorable, most typically whitespace -- for example, the space between the keyword int and the identifier foo after it, or the spaces immediately before and after the + operator. Those are not even strictly necessary. There is no semantic difference between foo + bar and foo+bar. The extra spaces would only be for human readability. By the same token, the line-feed character that terminates the line is simply ignored by a Java compiler, which would be just as happy (sorry for the anthropomorphism!) if all the Java code was on a single line. However, code that is not broken into lines would obviously be very onerous for a human being to work with!

While the lexer is concerned with breaking a stream of characters into tokens, the parser works at the syntactic level. It takes that stream of tokens that come from the lexer and generates a tree (an inverted tree data structure) that could be visually represented as follows:

    <FieldDeclaration>
      <PrimitiveType>
        int
      <VariableDeclarator>
        <Identifier>
          foo
        <Operator>
          =
        <AdditiveExpression>
          <Name>
            <Identifier>
              bar
            <Operator>
              +
          <Name>
            <Identifier>
              baz
            <Operator>
              . 
            <Identifier>
              bat
      <Delimiter>
        ;

So that is what the parser builds. Or something like that... Again, the problem is partitioned into the lexical and syntactic side. The lexer takes the stream of characters and turns that into a stream of tokens. The parser takes that stream of tokens and produces a tree-like data structure as we see above.

Now, it should be clear that this lexer/parser is invariably part of a larger system, very likely a compiler or interpreter. A compiler can't do anything with that raw sequence of characters. Actually, it still can't do much with a stream of tokens either, though we're getting a bit warmer... However, the above tree-like data structure really is something that a program can do something with! This is because it is arranged in a way that reflects the logic and structure, a.k.a. semantics, of the language we are working with.