CongoCC and Parsers
What is CongoCC
CongoCC a Java program and its main functionality is a "parser-generator". That is, CongoCC generates "parsers". A parser is a program that takes in as input a text file, parses the text file, and — in typical usage — generates a data structure that represents the structure of the text file. The parsers generated by CongoCC can be executed separately, without any dependencies on CongoCC.
What are parsers?
Think about a simple "hello world" Java program, one that looks like the following (from https://introcs.cs.princeton.edu/java/11hello/HelloWorld.java.html):
/******************************************************************************
* Compilation: javac HelloWorld.java
* Execution: java HelloWorld
*
* Prints "Hello, World". By tradition, this is everyone's first program.
*
* % java HelloWorld
* Hello, World
*
* These 17 lines of text are comments. They are not part of the program;
* they serve to remind us about its properties. The first two lines tell
* us what to type to compile and test the program. The next line describes
* the purpose of the program. The next few lines give a sample execution
* of the program and the resulting output. We will always include such
* lines in our programs and encourage you to do the same.
*
******************************************************************************/
public class HelloWorld {
public static void main(String[] args) {
// Prints "Hello, World" in the terminal window.
System.out.println("Hello, World");
}
}
It is easy enough to compile and run this:
$ javac HelloWorld.java
$ java HelloWorld
Hello, World
The Java compiler takes in the HelloWorld.java
file and produces a compiled HelloWorld.class
file, which we can run using the Java program. Somewhere early in the compilation process, the compiler has to be able to understand the structure of the Java source file (in this example, HelloWorld.java
). It needs this structure to know what to compile. The above source file has the following structure:
-
There is a comment block at the top.
-
There is a class named
HelloWorld
. -
The class named
HelloWorld
has a single method namedmain
. -
The method
main
in the class "HelloWorld
is of typepublic static void
and takes in one argument calledargs
. The argument is of typeString[]
. -
The method
main
in the classHelloWorld
has a comment and it calls the methodSystem.out.println
with the argument"Hello, World"
.
The part of the compiler that is responsible for parsing the input source file and producing a structured representation of the source file is called the "parser". This structured representation is called a "Syntax Tree", "Abstract Syntax Tree", or "Concrete Syntax Tree". For now, it doesn't matter too much what we call them. Generally speaking, parsers are language-specific. A single parser only understands the rules of a single language. For example, the parser in the Java compiler will only understand the Java language and will not understand Python, C++, or Rust.
What does CongoCC do?
CongoCC generates parsers given an input "grammar". A grammar (sometimes called "formal grammar") describes the rules of a language. For example, the Java programming language grammar has rules that look like these:
ClassOrInterfaceDeclaration:
{Modifier} (ClassDeclaration | InterfaceDeclaration)
ClassDeclaration:
NormalClassDeclaration
EnumDeclaration
InterfaceDeclaration:
NormalInterfaceDeclaration
AnnotationTypeDeclaration
NormalClassDeclaration:
class Identifier [TypeParameters]
[extends Type] [implements TypeList] ClassBody
EnumDeclaration:
enum Identifier [implements TypeList] EnumBody
NormalInterfaceDeclaration:
interface Identifier [TypeParameters] [extends TypeList] InterfaceBody
AnnotationTypeDeclaration:
@ interface Identifier AnnotationTypeBody
These rules describe how a programmar will declare classes within a Java source file. In programming languages, a grammar is necessary because it captures the precise syntactic rules of the language. For example, in the above rules, under NormalClassDeclaration
, we're told that when defining a regular class in Java, we must use the class
keyword (e.g.: public class HelloWorld
).
CongoCC has a flexible format with which you can express the grammar of programming languages such as C#, Python, and Java (or even your own constructed language!). Once you express the grammar in CongoCC's format, you can pass it to CongoCC to generate a parser that can parse text files following those rules.
CongoCC's Parser programs
As described above, the parsers generated by CongoCC are programs themselves. Note however that CongoCC does not produce a compiled parser program. Instead, CongoCC produces the source code of a parser program, which can then be independently compiled and used. As of writing, CongoCC is powerful enough to generate a parser source code in C#, Python, or Java. In other words, you can use CongoCC to generate a Python parser program that is capable of parsing C# source files! Support for more languages in CongoCC is planned.
When would I use a parser?
Parsers solve a very specific problem: capturing the structure of a text file according to some rules. Any time you need such structure, you need a parser. The need for parsers is immediate when you look at problems within the domain of programming languages:
- If you are writing a compiler, you need a parser.
- If you are writing a tool that verifies the source code written by people for various properties, you need a parser. For example, the Python linting tool Ruff needs a parser (https://github.com/astral-sh/ruff/tree/main/crates/ruff_python_ast/src)
- If you are developing an IDE like PyCharm, and you want to provide a functionality where users can rename a function
foo
tobar
and have the rest of Python project be automatically updated so that all previous calls tofoo()
are renamed tobar()
, you need a parser.
But let's say you're never going to do anything with programming languages. Even so, parsers can be useful! Imagine you're writing a system for businesses and you want to provide them with a way of expressing different kinds of business rules, such as: "if my current stock of printing paper drops below 10 units, automatically order more from supplier A, B, or C, after taking into consideration my budget, the prices from the suppliers, and their shipping speed."
In such instances, you need some kind of a DSL (Domain Specific Language). It could even be some custom JSON-based format, e.g.:
{
"if": {
"ItemType": "PaperStock",
"ConditionType": "LessThanEqual",
"ConditionValue": 10
},
"then": {
"DoSomethingHere".
}
}
When the user of your system provides you with some text written in your custom DSL, you will need to reason about it so that your system can do the appropriate things.
Your first option is to write adhoc logic in your program:
json = <input json from user>
condition_to_check = None
then_clause = None
for stanza, attributes in json.items():
if stanza == "if":
item_type = attributes.get("ItemType")
condition_type = attributes.get("ConditionType")
condition_value = attributes.get("ConditionValue")
if item_type is None or condition_type is None or condition_value is None:
raise RuntimeError("The rules are not well-formed. Missing either ItemType, ConditionType, or ConditionValue.")
condition_to_check = (item_type, condition_type, condition_value)
elif stanza == "then":
then_clause = produce_then_clause_from_attributes(attributes)
if condition_to_check is not None:
if then_clause is None:
raise RuntimeError("The if-condition is malformed. A "then" clause is not provided")
if run_condition(condition_to_check) is True:
then_clause.execute()
As you can see, this is very error prone and time consuming. Each time you add new features your DSL, you will need to do a lot of work to maintain this adhoc code.
The second option is to make use of parsers:
json = <input json from user>
parsed_tree, error_message = parse_tree(json)
if parsed_tree is None:
raise RuntimeError(f"The specified business rules are not well-formed. Encountered the following errors: " + error_message)
for node in parsed_tree:
if node.type == NodeType.IF:
if run_condition(node.condition) is True:
execute(node.then_child_node)
With a parser, you have cleaner code because you no longer need to parse anything yourself and you don't have to deal with all the error cases. You can instead focus on your business logic. With parsers, you have another two options: either write the parser yourself or use CongoCC. If you want to spend more time actually working on your business logic, the second option is the correct option :) You can express your DSL rules in the CongoCC format, generate a parser, and use the parser in your system without needing CongoCC again until the next time you change your DSL rules.
No Comments