CongoCC and Parsers
What is CongoCC
CongoCC a Java program and its main functionality is a "parser-generator". That is, CongoCC generates "parsers". A parser is a program that takes in as input a text file, parses the text file, and — in typical usage — generates a data structure that represents the structure of the text file. The parsers generated by CongoCC can be executed separately, without any dependencies on CongoCC.
What are parsers?
Think about a simple "hello world" Java program, one that looks like the following (from https://introcs.cs.princeton.edu/java/11hello/HelloWorld.java.html):
/******************************************************************************
* Compilation: javac HelloWorld.java
* Execution: java HelloWorld
*
* Prints "Hello, World". By tradition, this is everyone's first program.
*
* % java HelloWorld
* Hello, World
*
* These 17 lines of text are comments. They are not part of the program;
* they serve to remind us about its properties. The first two lines tell
* us what to type to compile and test the program. The next line describes
* the purpose of the program. The next few lines give a sample execution
* of the program and the resulting output. We will always include such
* lines in our programs and encourage you to do the same.
*
******************************************************************************/
public class HelloWorld {
public static void main(String[] args) {
// Prints "Hello, World" in the terminal window.
System.out.println("Hello, World");
}
}
It is easy enough to compile and run this:
$ javac HelloWorld.java
$ java HelloWorld
Hello, World
The Java compiler takes in the HelloWorld.java
file and produces a compiled HelloWorld.class
file, which we can run using the Java program. Somewhere early in the compilation process, the compiler has to be able to understand the structure of the Java source file (in this example, HelloWorld.java
). It needs this structure to know what to compile. The above source file has the following structure:
-
There is a comment block at the top.
-
There is a class named
HelloWorld
. -
The class named
HelloWorld
has a single method namedmain
. -
The method
main
in the class "HelloWorld
is of typepublic static void
and takes in one argument calledargs
. The argument is of typeString[]
. -
The method
main
in the classHelloWorld
has a comment and it calls the methodSystem.out.println
with the argument"Hello, World"
.
The part of the compiler that is responsible for parsing the input source file and producing a structured representation of the source file is called the "parser". This structured representation is called a "Syntax Tree", "Abstract Syntax Tree", or "Concrete Syntax Tree". For now, it doesn't matter too much what we call them. Generally speaking, parsers are language-specific. A single parser only understands the rules of a single language. For example, the parser in the Java compiler will only understand the Java language and will not understand Python, C++, or Rust.
What does CongoCC do?
CongoCC generates parsers given an input "grammar". A grammar (sometimes called "formal grammar") describes the rules of a language. For example, the Java programming language grammar has rules that look like these:
ClassOrInterfaceDeclaration:
{Modifier} (ClassDeclaration | InterfaceDeclaration)
ClassDeclaration:
NormalClassDeclaration
EnumDeclaration
InterfaceDeclaration:
NormalInterfaceDeclaration
AnnotationTypeDeclaration
NormalClassDeclaration:
class Identifier [TypeParameters]
[extends Type] [implements TypeList] ClassBody
EnumDeclaration:
enum Identifier [implements TypeList] EnumBody
NormalInterfaceDeclaration:
interface Identifier [TypeParameters] [extends TypeList] InterfaceBody
AnnotationTypeDeclaration:
@ interface Identifier AnnotationTypeBody
These rules describe how a programmar will declare classes within a Java source file. In programming languages, a grammar is necessary because it captures the precise syntactic rules of the language. For example, in the above rules, under NormalClassDeclaration
, we're told that when defining a regular class in Java, we must use the class
keyword (e.g.: public class HelloWorld
).
CongoCC has a flexible format with which you can express the grammar of programming languages such as C#, Python, and Java (or even your own constructed language!). Once you express the grammar in CongoCC's format, you can pass it to CongoCC to generate a parser that can parse text files following those rules.
CongoCC's Parser programs
As described above, the parsers generated by CongoCC are programs themselves. Note however that CongoCC does not produce a compiled parser program. Instead, CongoCC produces the source code of a parser program, which can then be independently compiled and used. As of writing, CongoCC is powerful enough to generate a parser source code in C#, Python, or Java. In other words, you can use CongoCC to generate a Python parser program that is capable of parsing C# source files! Support for more languages in CongoCC is planned.
Why do I even need parsers?
Parsers solve a very specific problem: capturing the structure of a text file according to some rules. Any time you need such structure, you need a parser. The need for parsers is immediate when you look at problems within the domain of programming languages:
- If you are writing a compiler, you need a parser.
- If you are writing a tool that verifies the source code written by people for various properties, you need a parser. For example, the Python linting tool Ruff needs a parser (https://github.com/astral-sh/ruff/tree/main/crates/ruff_python_ast/src)
- If you are writing a