So far I’ve been an enthusiastic user of ANTLR3, mostly for the MySQL Workbench product, where I based all the parsing infrastructure on the ANTLR3 C runtime. However, with the appearence of v4 a few years ago this ANTLR version got outdated and the support for it decreased constantly since then. Even though I would like to do so, I could not upgrade to the latest version because it missed an important piece: a C or C++ backend. Since I work mostly in a C++/Obj-C environment, this is an absolutely necessary part.

There were quite a number of requests for a C++ backend in the past and so a few attempts were made to create one. Sam Harwell, co-author of ANTLR and creator of the C# runtime started a C++ backend (utilizing C++14), but could never finish it. Dan McLaughlin and David Sisson then began porting the Java runtime to C++ about 3 years ago. However, this project also got only occasional love after the initial machine translation and had no contributions since summer 2015.

This is why I decided in March to jump in and get this thing done. You can see all the work in the fork hosted by Dan on Github (this repo no longer exists).

Development

I started out from that machine generated C++ code, which compiled but didn’t do anything useful. I literally turned every char around in the source code to clean this up and fixed all the crashes, mostly caused by problems in memory handling. This is something special to C++ as it doesn’t have automatic memory managment, like the other targets languages have (currently Java, C#, Javascript and Python). As distinguished from previous ANLTR versions all the algorithm code has been moved to the runtime, so it took quite a while to understand all that good enough to make funded decisions how to structure things in the C++ target. But finally this has been finished and today I can proudly announce that we have a working C++ target that appears to be functionally complete and has a stable memory household. You can find the current code in the runtime folder where all targets exist. David and I also got the ANTLR runtime tests passing, so that we can be sure the current implementation is feature-complete.

One of the major concerns I had was speed. I heard from others that ANTLR4 would create slower parsers and when I created a test parser from my MySQL grammar, which I ported to ANTLR4, and tested that with our internal parser test suite (~1000 test cases) my worst fears came true. What executed in under one second with an ANTLR3 based parser then needed 2.5mins – unbearable. Fortunately I got help from Terence Parr and after some investigations I identified 2 major reasons for that slowness:

  1. The parsing algorithm doesn’t cache DFA states if they belong to a rule with a predicate on the left hand side (example: ID: {isID()}? [a-z];). This is because otherwise the predicates wouldn’t be executed on the next parse run. You can get more details about predicates from the ANTLR docs. Read especially the part about context dependent predicates + lexer predicates. Without caching however all the state + transition computation is done again and again for each input symbol or token (it’s the same in both Lexer and Parser ATN simulator).
  2. In order to manage memory in the C++ runtime it was necessary to use smart pointers. However, they are expensive, really expensive. Especially std::shared_ptr<> can cost a lot of speed if not used carefully (it must be thread safe, so it needs a lock when copying). And since nothing was cached all the memory was constantly reallocated, smart pointers freed and later again created etc. Pretty clear this was slow.

After figuring this out, it was quite simple to get much better results. The test suite now takes between 1.5 and 2secs to execute (depending on whether you let the parser create a parse tree or not). Quite an impressive improvement from the previous 2.5mins, isn’t it? This makes the ANTLR4 based parser about 2 times slower than the ANLTR3 based one, which is acceptable. And I still have potential for optimization, as I haven’t used the new features from ANLTR4 yet (first of all: direct left recursion, which simplifies the expression rules).

Specialities of this ANTLR target

There are a couple of things that only the C++ ANTLR target has to deal with:

Memory Management

Since C++ has no built-in memory management we need to take extra care. For that we rely mostly on smart pointers, which however might cause time penalties or memory side effects (like cyclic references) if not used with care. Currently however the memory household looks very stable. Generally, when you see a raw pointer in code consider this as being managed elsewehere. You should never try to manage such a pointer (delete, assign to smart pointer etc.).

Unicode Support

Encoding is mostly an input issue, i.e. when the lexer converts text input into lexer tokens. The parser is completely encoding unaware. However, lexer input in the grammar is defined by character ranges with either a single member (e.g. 'a' or [a] or [abc]), an explicit range (e.g. 'a'..'z' or [a-z]), the full Unicode range (for a wildcard) and the full Unicode range minus a sub range (for negated ranges, e.g. ~[a]). The explicit ranges (including single member ranges) are encoded in the serialized ATN by 16bit numbers, hence cannot reach beyond 0xFFFF (the Unicode BMP), while the implicit ranges can include any value (and hence support the full Unicode set, up to 0x10FFFF).

An interesting side note here is that the Java target fully supports Unicode as well, despite the inherent limitations from the serialized ATN. That’s possible because the Java String class represents characters beyond the BMP as surrogate pairs (two 16bit values) and even reads them as 2 separate input characters. To make this work a character range for an identifier in a grammar must include the surrogate pairs area (for a Java target parser).

Meanwhile full Unicode support has been implemented in ANTLR4 and you can specify various Unicode character classes and blocks in your lexer rule. So, nothing prevents you anymore from parsing Emoji in chat messages 😀 .

The C++ target always expects UTF-8 input (either in a string or via a std::istream) which is then converted to UTF-32 (a char32_t array) and fed to the lexer. ANTLR4, when parsing your grammar, limits character ranges explicitly to the BMP currently. So, in order to allow specifying the full Unicode set the C++ target uses a little trick: whenever an explicit character range includes the (unused) codepoint 0xFFFF in a grammar it is silently extended to the full Unicode range. It’s clear that this is an all-or-nothing solution. You cannot define a subset of Unicode codepoints > 0xFFFF that way. This can only be solved if ANTLR supports larger character intervals.

The differences in handling characters beyond the BMP leads to a difference between Java and C++ lexers: the character offsets may not concur. This is because Java reads two 16bit values per Unicode char (if that falls into the surrogate area) while a C++ parser only reads one 32bit value. That usually doesn’t have practical consequences, but might confuse people when comparing token positions.

Named Actions

In order to help customizing the generated files there are a number of additional socalled named actions. These actions are tight to specific areas in the generated code and allow to add custom (target specific) code. All targets support these actions

  • @parser::header
  • @parser::members
  • @lexer::header
  • @lexer::members

(and their scopeless alternatives @header and @members) where header doesn’t mean a C/C++ header file, but the top of a code file. The content of the header action appears in all generated files at the first line. So it’s good for things like license/copyright information.

The content of a @members action is placed in the public section of lexer or parser class declarations. Hence it can be used for public variables or predicate functions used in a grammar predicate. Since all targets support header + members they are the best place for stuff that should be available also in generated files for other languages.

In addition to that the C++ target supports many more such named actions. Unfortunately, it’s not possible to define new scopes (e.g. @listener in addition to @parser) so they had to be defined as part of the existing scopes (@lexer or @parser). The grammar in the demo application contains all of the named actions as well for reference. Here’s the list:

  • @lexer::preinclude – Placed right before the first #include (e.g. good for headers that must appear first, like system headers). Appears in both lexer h and cpp file.
  • @lexer::postinclude – Placed right after the last #include, but before any class code (e.g. for additional namespaces). Appears in both lexer h and cpp file.
  • @lexer::context – Placed right before the lexer class declaration. Use for e.g. additional types, aliases, forward declarations and the like. Appears in the lexer h file.
  • @lexer::declarations – Placed in the private section of the lexer declaration (sections in all generated classes strictly follow the pattern: public, protected, privat, from top to bottom). Use this for private vars etc.
  • @lexer::definitions – Placed before other implementations in the cpp file (but after postinclude). Use this to implement e.g. private types.

For the parser there are the same actions as shown above for the lexer. In addition to that there are even more actions for visitor and listener classes:

  • @parser::listenerpreinclude
  • @parser::listenerpostinclude
  • @parser::listenerdeclarations
  • @parser::listenermembers
  • @parser::listenerdefinitions
  • @parser::baselistenerpreinclude
  • @parser::baselistenerpostinclude
  • @parser::baselistenerdeclarations
  • @parser::baselistenermembers
  • @parser::baselistenerdefinitions
  • @parser::visitorpreinclude
  • @parser::visitorpostinclude
  • @parser::visitordeclarations
  • @parser::visitormembers
  • @parser::visitordefinitions
  • @parser::basevisitorpreinclude
  • @parser::basevisitorpostinclude
  • @parser::basevisitordeclarations
  • @parser::basevisitormembers
  • @parser::basevisitordefinitions

and should be self explanatory now. Note: there is no context action for listeners or visitors.

Runtime

In order to compile generated parsers you will need the ANTLR C++ runtime. Creating that yourself is easy by using the provided project files in the ANLTR4 runtime/Cpp/runtime folder. There is a CMakeLists.txt file which can be used to build on Linux and MacOS. Additionally you can find an XCode project, which allows to build static and dynamic libraries for MacOS and iOS. And not to forget: there are also  Visual Studio projects for VS 2013 and VS 2015.

For your convenience I provide source code and prebuilt binaries for MacOS and Windows from this page. See the download section at the end.

Testing

In order to get as many people testing the runtime as possible I also created an ANTLR snapshot jar that interested developers can use to generate C++ parser code from their grammars. This jar can be downloaded from this site as well, until our fork has been merged back to the main ANTLR repo and new official jars become available (probably around next October, from what I heard). If you are new to ANTLR then read the documentation available in the ANTLR repo for a quick start, grammar structures and other important information. Please join us in the ANTLR mailing list if you have questions, suggestions or problems with that new target. There is a small terminal demo application in the Cpp folder that compiles and runs on most platforms (Windows, macOS, Linux using Visual Studio 2013+, XCode or cmake). Enjoy!

With the merge of the C++ runtime to the main ANTLR repo all downloads are now available from the main ANTLR site.

Leave a Reply