grammars-v4/java/java at master · antlr/grammars-v4

The optimized Java grammar

This grammar, based on the optimized Java7 grammar by Terence Parr and Sam Harwell, is meant to parse the latest for the Java language, and is optimized for performance, practical usage, and clarity.

It does not correspond exactly to the Java Language Specification. The java8, java9, and java20 grammars follow the JLS, but are slower that this grammar due to ambiguity and max-k problems in the published JLS EBNF.

This grammar parses the file ManyStringsConcat.java faster than the unoptimized java grammars. It implements operator precedence using Antlr4-style alt ordering instead of operator-precedence rules. Thus, it avoids creating parse trees with long, single-child chains for each string literal constant in ManyStringsConcat.java. In addition, it is faster because it avoids the large ATN-config set construction in the AdaptivePredict() parsing engine.

Java Enhancement Proposals (JEP) are not implemented in this grammar.

Currently supported Java version

Java 24 (latest)

Main contributors

Terence Parr, 2013
Sam Harwell, 2013
Ivan Kochurkin (Positive Technologies), 2017
Michał Lorek, 2021

Tests

See examples/
OpenJDK 24, src/**/*.java (using Trash trgen to create app, then find ~/jdk-jdk-23-ga/src/ -name '*.java' | cygpath -w -f - | ./Test -x)

Benchmarks

Grammar performance has been tested on the following Java projects:

OpenJDK 24
Spring Framework
Elasticsearch
RxJava
JUnit4
Guava
Log4j

See the benchmarks page for details.

Grammar style

Please use antlr-format and formatting style config to reformat in the coding standard format for the repo.

String literals

Generally, you can use either a string literal or the corresponding lexer rule name (TOKEN_REF) directly in a parser rule for a token. It makes no difference because the java/java/ grammar is a split Antlr4 grammar, and the Antlr Tool prevents you from defining a token using a string literal in a parser rule (it outputs cannot create implicit token for string literal in non-combined grammar if you try). When writing an Antlr listener or visitor, use the corresponding lexer rule name for the string literal used in the parser rule.

Currently, the grammar contains a mixture of string literals and lexer rule names in parser rules. If you want a parser grammar that removes all string literals from parser rules, use Trash trfoldlit. If you want a parser grammar that uses string literals where a lexer rule exists for the string literal, use Trash trunfoldlit.

Reference

pldb