Sunday, April 29, 2007

Creating a Source Code Syntax Parser in Java Using ANTLR

Recently I found myself looking for a way to parse through the source code syntax of several languages. Rather than write my own syntax parser, I began looking for an existing open source solution. The first option I came across was JavaCC (https://javacc.dev.java.net/), which has a lot of documentation regarding how to install it, create grammars, create parsing code for grammars, but not really much on how to use it. Also according to the JavaCC FAQ:

JavaCC does not automate the building of trees (or any other specific parser output), although there are at least two tree building tools JJTree and JTB (see Chapter 6.) based on JavaCC, and building trees "by hand" with a JavaCC based parser is easy. (http://www.engr.mun.ca/~theo/JavaCC-FAQ/javacc-faq-moz.htm)

While searching for an easy way to for building trees "by hand" using JavaCC I came across another tool called ANTLR. ANTLR, ANother Tool for Language Recognition, (formerly PCCTS) is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing Java, C#, Python, or C++ actions (http://www.antlr.org/about.html). With an abundance of grammars (http://www.antlr.org/grammar/list) and articles (http://www.antlr.org/article/list) it was pretty easy to get what I wanted working fast.

My specific example is a program that parses Java 1.5 source code for is imports, its "has a" relationships, its "is a" relationships, and the classes that it realizes (implements). Here are the steps to getting it to work:

  1. Download and install ANTLR (version 2.7.7) from http://www.antlr.org/download.html
  2. Download the Java 1.5 grammar for ANTLR 2.7.7 by Michael Studman's located at http://www.antlr.org/grammar/1090713067533/index.html.
  3. If running on Windows, add the location of ANTLR to the CLASSPATH environment variable. For example my library is located in "C:\antlr\277\lib\antlr.jar".
  4. Open a command window and navigate to the location of the grammar file, for example I put the Java 1.5 grammar in "C:\antlr\277\examples\java15-grammar".
  5. Run the command "java antlr.Tool java15.tree.g java15.g" assuming you are using Michael Studman's Java 1.5 grammar. This will generate the source code for the parser to be included in your project:
    • JavaLexer.java
    • JavaRecognizer.java
    • JavaTokenTypes.java
    • JavaTreeParser.java
    • JavaTreeParserTokenTypes.java
  6. Create a Java project using the above files along with the ANTLR JAR, and then use the following as the main class:
package antlrtest;

// Standard I/O
import java.io.IOException;
import java.io.File;
import java.io.BufferedReader;
import java.io.FileReader;
// ANTLR Library
import antlr.collections.AST;
import antlr.debug.misc.ASTFrame;
import antlr.RecognitionException;
import antlr.TokenStreamException;
import antlr.CommonAST;
import antlr.ASTFactory;
// Events
import java.awt.event.WindowAdapter;
import java.awt.event.WindowEvent;
// Data
import java.util.ArrayList;

/**
 * <p>Title: Main</p>
 *
 * <p>Description: Takes a Java source code file as an argument, parses it,
 * displays it content in a tree, and determines its imports and relationships
 * with other classes.</p>
 *
 * <p>Copyright: Copyright (c) 2007</p>
 *
 * @author John Valentino II
 * @version 1.0
 */
public class Main implements JavaTokenTypes {

    /**
     * Entry points for the program
     * @param args String[]
     */
    public static void main(String[] args) {

        if (args.length != 1) {
            System.out.println("Usage: main <java file name>");
            System.exit(0);
        }

        try {

            ArrayList<String> imports = new ArrayList<String>();
            ArrayList<String> is_a = new ArrayList<String>();
            ArrayList<String> realizes = new ArrayList<String>();
            ArrayList<String> has_a = new ArrayList<String>();

            //Open the given file
            File file = new File(args[0]);
            String fileName = file.getName();
            BufferedReader reader = new BufferedReader(new FileReader(file));

            //Create a scanner that reads from the input stream passed to us
            JavaLexer lexer = new JavaLexer(reader);
            lexer.setFilename(fileName);

            //Create a parser that reads from the scanner
            JavaRecognizer parser = new JavaRecognizer(lexer);
            parser.setFilename(fileName);

            //start parsing at the compilationUnit rule
            parser.compilationUnit();

            //Get the imports and relationships for this Java file and returns
            //the full class name
            String fullClassName = parseTree(parser.getAST(),
                parser.getTokenNames(),
                imports, is_a, realizes, has_a);

            System.out.println("Full Class Name: " + fullClassName);

            //display imports
            System.out.println("Imports:");
            for (int i = 0; i < imports.size(); i++) {
                System.out.println("\t" + imports.get(i));
            }

            //display the classes that this class inherits from
            System.out.println("Extends (Is A):");
            for (int i = 0; i < is_a.size(); i++) {
                System.out.println("\t" + is_a.get(i));
            }

            //display the "has a" relationships of this class
            System.out.println("Has A: ");
            for (int i = 0; i < has_a.size(); i++) {
                System.out.println("\t" + has_a.get(i));
            }

        } catch (IOException e) {
            e.printStackTrace();
        } catch (RecognitionException re) {
            re.printStackTrace();
        } catch (TokenStreamException tse) {
            tse.printStackTrace();
        }
    }


    /**
     * Parses the AST Root Node for a Java class. The AST Root node contains
     * the following nodes: PACKAGE_DEF, IMPORT, CLASS_DEF
     * @param t AST
     * @param tokenNames String[]
     * @param imports ArrayList
     * @param is_a ArrayList
     * @param realizes ArrayList
     * @param has_a ArrayList
     * @return String
     */
    private static String parseTree(AST t, String[] tokenNames,
        ArrayList<String> imports,
        ArrayList<String> is_a,
        ArrayList<String> realizes,
        ArrayList<String> has_a) {

        String packageName = "";
        String className = "";

        ((CommonAST) t).setVerboseStringConversion(true, tokenNames);
        ASTFactory factory = new ASTFactory();
        AST r = factory.create(0, "AST ROOT");
        r.setFirstChild(t);

        displayFrame(r);

        AST c = r.getFirstChild();

        //for each child...
        while (c != null) {

            int type = c.getType();

            switch (type) {

            case PACKAGE_DEF:
                packageName = getFormattedImport(c);
                break;
            case IMPORT:
                //get the import in the format "java.io.File"
                String importText = getFormattedImport(c);
                //add this import
                imports.add(importText);
                break;
            case CLASS_DEF:
                //there should only be one class per file
                className = parseClass(c, is_a, realizes, has_a);
                break;
            }

            c = c.getNextSibling();

        } //end for each child

        return packageName + "." + className;

    }

    /**
     * Displays the tree in a frame
     * @param r AST
     */
    private static void displayFrame(AST r) {
        final ASTFrame frame = new ASTFrame("Java AST", r);
        frame.setVisible(true);
        frame.addWindowListener(new WindowAdapter() {
            public void windowClosing(WindowEvent e) {
                frame.setVisible(false);
                frame.dispose();
                System.exit(0);
            }
        });
    }

    /**
     * Parses an IMPORT node and returns the full name
     * of the package
     * @param root AST
     * @return String
     */
    private static String getFormattedImport(AST root) {

        //get the import in the format "java.io.File."
        String importText = parseImport(root);

        //trim off the last character making it "java.io.File"
        importText = importText.substring(0, importText.length() - 1);

        return importText;

    }

    /**
     * Recursively parses an IMPORT node for the full name of the package
     * @param root AST
     * @return String
     */
    private static String parseImport(AST root) {

        String returner = "";

        AST c = root.getFirstChild();

        //for each child...
        while (c != null) {
            int type = c.getType();
            if (type == DOT)
                returner = parseImport(c);
            else {
                if (!c.getText().equals(""))
                    returner += c.getText() + ".";
            }
            c = c.getNextSibling();
        } //end for each child

        return returner;

    }

    /**
     * Parses the given CLASS_DEF node, which contains the following nodes:
     * MODIFIERS, EXTENDS_CLAUSE, IMPLEMENTS_CLAUSE, OBJBLOCK, IDENT
     * @param root AST
     * @param is_a ArrayList
     * @param realizes ArrayList
     * @param has_a ArrayList
     * @return String
     */
    private static String parseClass(AST root,
        ArrayList<String> is_a,
        ArrayList<String> realizes,
        ArrayList<String> has_a) {

        String className = "";

        AST c = root.getFirstChild();

        //for each child...
        while (c != null) {
            int type = c.getType();

            switch (type) {
            case IDENT:
                className = c.getText();
                break;
            case EXTENDS_CLAUSE:
                parseImplementsOrExtensions(c, is_a);
                break;
            case IMPLEMENTS_CLAUSE:
                parseImplementsOrExtensions(c, realizes);
                break;
            case OBJBLOCK:
                parseObject(c, has_a);
                break;
            }

            c = c.getNextSibling();

        } //end for each child

        return className;

    }

    /**
     * Parses the given IMPLEMENTS_CLAUSE or EXTENDS_CLAUSE to determine the
     * classes that are implements or extended
     * @param root AST
     * @param realizes ArrayList
     */
    private static void parseImplementsOrExtensions(AST root,
        ArrayList<String> realizes) {

        AST c = root.getFirstChild();

        //for each child...
        while (c != null) {

            int type = c.getType();

            switch (type) {
            case IDENT:
                realizes.add(c.getText());
                break;
            }

            c = c.getNextSibling();

        } //end for each child

    }

    /**
     * Parses the given OBJBLOCK node, which contains the following nodes:
     * VARIABLE_DEF, METHOD_DEF, CLASS_DEF, CTOR_DEF, IDENT
     * @param root AST
     * @param has_a ArrayList
     */
    private static void parseObject(AST root, ArrayList<String> has_a) {

        AST c = root.getFirstChild();

        //for each child...
        while (c != null) {
            int type = c.getType();

            switch (type) {
            case VARIABLE_DEF:
                parseVariableDef(c, has_a);
                break;
            }

            c = c.getNextSibling();

        } //end for each child

    }

    /**
     * Parses the given VARIABLE_DEF node, which contains the following nodes:
     * MODIFIERS, TYPE, IDENT
     * @param root AST
     * @param has_a ArrayList
     */
    private static void parseVariableDef(AST root, ArrayList<String> has_a) {
        AST c = root.getFirstChild();

        //for each child...
        while (c != null) {

            int type = c.getType();

            switch (type) {
            case TYPE:
                parseType(c, has_a);
                break;
            }
            c = c.getNextSibling();

        } //end for each child

    }

    /**
     * Parses the given TYPE node for its class names
     * @param root AST
     * @param has_a ArrayList
     */
    private static void parseType(AST root, ArrayList<String> has_a) {
        AST c = root.getFirstChild();

        //for each child...
        while (c != null) {

            int type = c.getType();

            switch (type) {
            case IDENT:
                has_a.add(c.getText());
                break;   
            }
            c = c.getNextSibling();

        } //end for each child

    }

}

Using the following the test file UMLHelper.java as the input to the above program:

package a.b.c.umlhelper;

import umlhelper.service.ServiceInterface;
import umlhelper.gui.MainFrameInterface;
import umlhelper.gui.MainFrame;
import umlhelper.service.Service;
import java.io.File;
import a.b.c.d.E;

/**
 * <p>Title: UML Helper</p>
 *
 * <p>Description: This is a test class for parsing, it does not compile.</p>
 *
 * <p>Copyright: Copyright (c) 2007</p>
 *
 * @author John valentino II
 * @version 1.0
 */
public class UMLHelper extends AB implements ServiceInterface, MainFrameInterface, A, B, C {

    /** Represents the gui tier of the application */
    private MainFrame mainFrame;
    /** Represents the service tier of the application */
    private Service service;
    /** Represents the version of this program */
    private float version = 0.1f;
    private ArrayList<String> list;

    /**
     * Entry point for the application
     * @param args String[]
     */
    public static void main(String[] args) {
        new UMLHelper();
    }

    /**
     * Creates the application
     */
    public UMLHelper() {
        this.service = new Service(this);
        this.mainFrame = new MainFrame(this);
    }

    /**
     * Returns the vesion of this program
     * @return float
     */
    public float getVersion() {
        return this.version;
    }

    /**
     * Opens the given directory and searches it for source code files and
     * other directories
     * @param directory File
     */
    public void open(File directory) {
        service.open(directory);
    }


}

The following output is generated:

Full Class Name: a.b.c.umlhelper.UMLHelper
Imports:
    umlhelper.service.ServiceInterface
    umlhelper.gui.MainFrameInterface
    umlhelper.gui.MainFrame
    umlhelper.service.Service
    java.io.File
    a.b.c.d.E
Extends (Is A):
    AB
Has A:
    MainFrame
    Service
    ArrayList

The following tree is displayed:

No comments:

Contributors