Sunday, 14 April 2013

Custom Lexing with XText

XText is an incredibly powerful framework for developing domain specific language tooling/editors.

To simplify the language creation process, XText unifies the model definition and parsing/lexing definition into a single file (the .xtext) file (per grammar).

This approach is usually sufficient for most grammars but occasionally, you may encounter a situation where fine-grained control over the lexing is required. This type of control is supported by XText, although the method to achieve such a result is hidden away.

For the purpose of this post, I will use a sample grammar based upon a cut-down version of a real-world lexing problem.

The problem grammar



The problem


The problem (in this case) is that the PROPERTY_VALUE terminal is greedy. The lexer is expecting to see the '=' token after the '[' token but instead the lexer provides a PROPERTY_VALUE.

Our objective then is to disable the PROPERTY_VALUE terminal rule when in the context of an array definition. We might also wish to modify the grammar so that we can start an array with a different token rather than an equals, but quite often we do not have the luxury of being able to alter the grammar, and more importantly, = is the correct token to use in this context, anything else would be a work-around.

Note : This post provides a grammar for sample purposes only. Other non-lexer override solutions might also be available but for the purposes of this post, we will solve using a lexer override.

Requirements

XText 2.4 (This tutorial will likely work with any version of XText from 2.0+, but I used 2.4 when working through this process)
ANTLR Works 1.5+ (a version compatible with ANTLR 3.x). Not strictly required as MWE2 will compile but quite useful for debugging the lexer.

Step 1 - Define your language using the Xtext file

Simply put, define your language, and start thinking about which terminal rules should be enabled/disabled in different lexical states. Xtext out of the box will not be able to handle the selective disabling of terminal rules, but that comes later.

Step 2 - Split the Lexer and the Parser (via the MWE2 file)

By default, XText generates a unified ANTLR lexer/parser by way of generating a single .g  that corresponds to the various rules and terminal rules defined in your .xtext file.

In order to be able to override the lexer in isolation, we must first configure the MWE2 workflow associated with the grammar so that the lexer and the parser are isolated.

2a) Comment out /remove this block in your code (this is the unified parser/lexer generator) :

//fragment = parser.antlr.XtextAntlrGeneratorFragment auto-inject {
//
//}


2b) Now add the following line in the same area where you commented out the previous block:

fragment = org.eclipse.xtext.generator.parser.antlr.ex.rt.AntlrGeneratorFragment {
    options = {
        backtrack = false
        backtrackLexer = false
    }
}

This fragment will generate a separate lexer and parser. Cushdy.



Step 3 - Copy the Lexer



If you have successfully split the lexer and the parser, you should now have 2 packages in your src-gen corresponding the parser/lexer (where previously there was just one).

/src-gen/[your package prefix].antlr.internal.*   --> The generated parser (we will not touch this)
/src-gen/[your package prefix].antlr.lexer.* --> The generated lexer

It is important to note that any updates to the .xtext file, and the subsequent generated parser that occurs after this copy will potentially be a breaking change and result in either a compile-time or runtime exception. Any update to the language grammar requires a re-copy and re-modification of the lexer (although for trivial changes this should be easy with the help of a diff tool).


3a) Copy contents of /src-gen/[your package prefix].antlr.lexer.* to /src/[your package prefix]/lexer.*

Copy the .g file from the package to the lexer package in your src folder.

3b) Rename the copied *.g  file to   [mylanguage]CustomLexer.g

We change the name of the lexer in order to separate them from the generated lexer.

3c) Modify lexer name and package name

/*
* generated by Xtext
*/
lexer grammar CustomAttributesLexer;


@header {
package org.consoli.customlexerexample.parser.antlr.lexer;




The lexer will start with lines similar to the lines shown above.

The first blue section above should be modified to contain the name of your new custom lexer class. In this case it is called MyLanguageCustomLexerName.

The second blue section should be modified to match the package created in step 3a (the package in the 'src' folder).


Step 4 - Compile the Custom Lexer using ANTLR Works

4a) Configure ANTLR Works to compile into same folder as .g file

Go to the File/Preferences option, then on the general tab, clear the output path so that nothing is shown (no spaces either), and click apply. Without this step, the compiled lexer would be generated into a subfolder called 'output'.



4b) Incrementally compile the .g custom lexer to test for bugs.


Step 5 - Modify the lexer to add context

This step is heavily dependent on the grammar itself, so I will just show the old and new lexer for the grammar listed above step one. The .g is the only file that is required to be changed in this step.


I realise that I have skipped over this step somewhat, but it is beyond the scope of this article to explain how to write a custom lexer in ANTLR, and in fact, it is far easier to learn by example than to explain the rules. For more detailed customisations, see the MWE2 / Gepetto Puppet links at the bottom of the article.

The implementation of this custom lexer itself was very simple, and hinged around an identifier being used to disable array mode, and one of the array tokens used to enable array mode, with the property value terminal rule disabled when in that mode.

This is of course, not a catch-all, syntactic predicates might be an alternative for using a custom lexer, or a combination of predicates and lexer may be required. There is currently no way to directly override the generated parser.

It is important not to add, rename or delete keywords in the custom lexer as they must match up with the generated parser.

Step 6 - Configure XText to use the custom lexer

Add this fragment after the previously inserted fragments in the MWE2 file. Any .java files that you compiled in ANTLR Works can now be deleted from the src folder (for the MWE2 workflow engine will compile the .g and .tokens file itself.).

// Uses ANTLR Tools to compile a custom lexer and will also add a binding in the runtime module to use the Lexer
fragment = parser.antlr.ex.ExternalAntlrLexerFragment {
    // A grammar file with .g will be expected in this package (should be stored in src folder)
    lexerGrammar = "org.consoli.customlexerexample.lexer.CustomAttributesLexer"
    runtime = true
    antlrParam = "-lib"
    // This is the folder where the lexer will be created
    antlrParam = "${runtimeProject}/src-gen/org/consoli/customlexerexample/lexer"
}

This step will also register the lexer with the language runtime module (AttributesRuntimeModule in our sample case). The lexerGrammar and antlrParam attributes will need to be updated to correspond to the name and package of your custom lexer.

If everything has been wired up correctly, there should be no compile-time errors

Gotchas

Most gotchas in this process will likely be to do with entering a bad path name or classname.

Result

If we re-load our development instance of eclipse and look at our sample file again, we can see that arrays are now correctly handled - so we have successfully replaced the default lexer with our custom implementation.



Useful Resources