Skip to main content

CongoCC and Parsers

What is CongoCC

CongoCC a Java program and its main functionality is a "parser-generator". That is, CongoCC generates "parsers". A parser is a program that takes in as input a text file, parses the text file, and — in typical usage — generates a data structure that represents the structure of the text file. The parsers generated by CongoCC can be executed separately, without any dependencies on CongoCC.

What are parsers?

Think about a simple "hello world" Java program, one that looks like the following (from https://introcs.cs.princeton.edu/java/11hello/HelloWorld.java.html):

/******************************************************************************
 *  Compilation:  javac HelloWorld.java
 *  Execution:    java HelloWorld
 *
 *  Prints "Hello, World". By tradition, this is everyone's first program.
 *
 *  % java HelloWorld
 *  Hello, World
 *
 *  These 17 lines of text are comments. They are not part of the program;
 *  they serve to remind us about its properties. The first two lines tell
 *  us what to type to compile and test the program. The next line describes
 *  the purpose of the program. The next few lines give a sample execution
 *  of the program and the resulting output. We will always include such
 *  lines in our programs and encourage you to do the same.
 *
 ******************************************************************************/

public class HelloWorld {

    public static void main(String[] args) {

        // Prints "Hello, World" in the terminal window.
        System.out.println("Hello, World");
    }
}

It is easy enough to compile and run this:

$ javac HelloWorld.java
$ java HelloWorld
Hello, World

The Java compiler takes in the HelloWorld.java file and produces a compiled HelloWorld.class file, which we can run using the Java program. Somewhere early in the compilation process, the compiler has to be able to understand the structure of the Java source file (in this example, HelloWorld.java). It needs this structure to know what to compile. The above source file has the following structure:

  1. There is a comment block at the top.

  2. There is a class named HelloWorld.

  3. The class named HelloWorld has a single method named main.

  4. The method main in the class "HelloWorld is of type public static void and takes in one argument called args. The argument is of type String[].

  5. The method main in the class HelloWorld has a comment and it calls the method System.out.println with the argument "Hello, World".

The part of the compiler that is responsible for parsing the input source file and producing a structured representation of the source file is called the "parser". This structured representation is called a "Syntax Tree", "Abstract Syntax Tree", or "Concrete Syntax Tree". For now, it doesn't matter too much what we call them. Generally speaking, parsers are language-specific. A single parser only understands the rules of a single language. For example, the parser in the Java compiler will only understand the Java language and will not understand Python, C++, or Rust.

What does CongoCC do?

CongoCC generates parsers given an input "grammar". A grammar (sometimes called "formal grammar") describes the rules of a language. For example, the Java programming language grammar has rules that look like these:

ClassOrInterfaceDeclaration: 
    {Modifier} (ClassDeclaration | InterfaceDeclaration)

ClassDeclaration: 
    NormalClassDeclaration
    EnumDeclaration

InterfaceDeclaration: 
    NormalInterfaceDeclaration
    AnnotationTypeDeclaration

NormalClassDeclaration: 
    class Identifier [TypeParameters]
                                [extends Type] [implements TypeList] ClassBody

EnumDeclaration:
    enum Identifier [implements TypeList] EnumBody

NormalInterfaceDeclaration: 
    interface Identifier [TypeParameters] [extends TypeList] InterfaceBody

AnnotationTypeDeclaration:
    @ interface Identifier AnnotationTypeBody

These rules describe how a programmar will declare classes within a Java source file. In programming languages, a grammar is necessary because it captures the precise syntactic rules of the language. For example, in the above rules, under NormalClassDeclaration, we're told that when defining a regular class in Java, we must use the class keyword (e.g.: public class HelloWorld).

CongoCC has a flexible format with which you can express the grammar of programming languages such as C#, Python, and Java (or even your own constructed language!). Once you express the grammar in CongoCC's format, you can pass it to CongoCC to generate a parser that can parse text files following those rules.

CongoCC's Parser programs

As described above, the parsers generated by CongoCC are programs themselves. Note however that CongoCC does not produce a compiled parser program. Instead, CongoCC produces the source code of a parser program, which can then be independently compiled and used. As of writing, CongoCC is powerful enough to generate a parser source code in C#, Python, or Java. In other words, you can use CongoCC to generate a Python parser program that is capable of parsing C# source files! Support for more languages in CongoCC is planned.

Why do I even need parsers?

Parsers solve a very specific problem: capturing the structure of a text file according to some rules. Any time you need such structure, you need a parser. The need for parsers is immediate when you look at problems within the domain of programming languages:

  • If you are writing a compiler, you need a parser.
  • If you are writing a tool that verifies the source code written by people for various properties, you need a parser. For example, the Python linting tool Ruff needs a parser (https://github.com/astral-sh/ruff/tree/main/crates/ruff_python_ast/src)
  • If you are writing a