Sunday, April 29, 2007

Creating a Source Code Syntax Parser in Java Using ANTLR

Recently I found myself looking for a way to parse through the source code syntax of several languages. Rather than write my own syntax parser, I began looking for an existing open source solution. The first option I came across was JavaCC (https://javacc.dev.java.net/), which has a lot of documentation regarding how to install it, create grammars, create parsing code for grammars, but not really much on how to use it. Also according to the JavaCC FAQ:

JavaCC does not automate the building of trees (or any other specific parser output), although there are at least two tree building tools JJTree and JTB (see Chapter 6.) based on JavaCC, and building trees "by hand" with a JavaCC based parser is easy. (http://www.engr.mun.ca/~theo/JavaCC-FAQ/javacc-faq-moz.htm)

While searching for an easy way to for building trees "by hand" using JavaCC I came across another tool called ANTLR. ANTLR, ANother Tool for Language Recognition, (formerly PCCTS) is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing Java, C#, Python, or C++ actions (http://www.antlr.org/about.html). With an abundance of grammars (http://www.antlr.org/grammar/list) and articles (http://www.antlr.org/article/list) it was pretty easy to get what I wanted working fast.

My specific example is a program that parses Java 1.5 source code for is imports, its "has a" relationships, its "is a" relationships, and the classes that it realizes (implements). Here are the steps to getting it to work:

  1. Download and install ANTLR (version 2.7.7) from http://www.antlr.org/download.html
  2. Download the Java 1.5 grammar for ANTLR 2.7.7 by Michael Studman's located at http://www.antlr.org/grammar/1090713067533/index.html.
  3. If running on Windows, add the location of ANTLR to the CLASSPATH environment variable. For example my library is located in "C:\antlr\277\lib\antlr.jar".
  4. Open a command window and navigate to the location of the grammar file, for example I put the Java 1.5 grammar in "C:\antlr\277\examples\java15-grammar".
  5. Run the command "java antlr.Tool java15.tree.g java15.g" assuming you are using Michael Studman's Java 1.5 grammar. This will generate the source code for the parser to be included in your project:
    • JavaLexer.java
    • JavaRecognizer.java
    • JavaTokenTypes.java
    • JavaTreeParser.java
    • JavaTreeParserTokenTypes.java
  6. Create a Java project using the above files along with the ANTLR JAR, and then use the following as the main class:
package antlrtest;

// Standard I/O
import java.io.IOException;
import java.io.File;
import java.io.BufferedReader;
import java.io.FileReader;
// ANTLR Library
import antlr.collections.AST;
import antlr.debug.misc.ASTFrame;
import antlr.RecognitionException;
import antlr.TokenStreamException;
import antlr.CommonAST;
import antlr.ASTFactory;
// Events
import java.awt.event.WindowAdapter;
import java.awt.event.WindowEvent;
// Data
import java.util.ArrayList;

/**
 * <p>Title: Main</p>
 *
 * <p>Description: Takes a Java source code file as an argument, parses it,
 * displays it content in a tree, and determines its imports and relationships
 * with other classes.</p>
 *
 * <p>Copyright: Copyright (c) 2007</p>
 *
 * @author John Valentino II
 * @version 1.0
 */
public class Main implements JavaTokenTypes {

    /**
     * Entry points for the program
     * @param args String[]
     */
    public static void main(String[] args) {

        if (args.length != 1) {
            System.out.println("Usage: main <java file name>");
            System.exit(0);
        }

        try {

            ArrayList<String> imports = new ArrayList<String>();
            ArrayList<String> is_a = new ArrayList<String>();
            ArrayList<String> realizes = new ArrayList<String>();
            ArrayList<String> has_a = new ArrayList<String>();

            //Open the given file
            File file = new File(args[0]);
            String fileName = file.getName();
            BufferedReader reader = new BufferedReader(new FileReader(file));

            //Create a scanner that reads from the input stream passed to us
            JavaLexer lexer = new JavaLexer(reader);
            lexer.setFilename(fileName);

            //Create a parser that reads from the scanner
            JavaRecognizer parser = new JavaRecognizer(lexer);
            parser.setFilename(fileName);

            //start parsing at the compilationUnit rule
            parser.compilationUnit();

            //Get the imports and relationships for this Java file and returns
            //the full class name
            String fullClassName = parseTree(parser.getAST(),
                parser.getTokenNames(),
                imports, is_a, realizes, has_a);

            System.out.println("Full Class Name: " + fullClassName);

            //display imports
            System.out.println("Imports:");
            for (int i = 0; i < imports.size(); i++) {
                System.out.println("\t" + imports.get(i));
            }

            //display the classes that this class inherits from
            System.out.println("Extends (Is A):");
            for (int i = 0; i < is_a.size(); i++) {
                System.out.println("\t" + is_a.get(i));
            }

            //display the "has a" relationships of this class
            System.out.println("Has A: ");
            for (int i = 0; i < has_a.size(); i++) {
                System.out.println("\t" + has_a.get(i));
            }

        } catch (IOException e) {
            e.printStackTrace();
        } catch (RecognitionException re) {
            re.printStackTrace();
        } catch (TokenStreamException tse) {
            tse.printStackTrace();
        }
    }


    /**
     * Parses the AST Root Node for a Java class. The AST Root node contains
     * the following nodes: PACKAGE_DEF, IMPORT, CLASS_DEF
     * @param t AST
     * @param tokenNames String[]
     * @param imports ArrayList
     * @param is_a ArrayList
     * @param realizes ArrayList
     * @param has_a ArrayList
     * @return String
     */
    private static String parseTree(AST t, String[] tokenNames,
        ArrayList<String> imports,
        ArrayList<String> is_a,
        ArrayList<String> realizes,
        ArrayList<String> has_a) {

        String packageName = "";
        String className = "";

        ((CommonAST) t).setVerboseStringConversion(true, tokenNames);
        ASTFactory factory = new ASTFactory();
        AST r = factory.create(0, "AST ROOT");
        r.setFirstChild(t);

        displayFrame(r);

        AST c = r.getFirstChild();

        //for each child...
        while (c != null) {

            int type = c.getType();

            switch (type) {

            case PACKAGE_DEF:
                packageName = getFormattedImport(c);
                break;
            case IMPORT:
                //get the import in the format "java.io.File"
                String importText = getFormattedImport(c);
                //add this import
                imports.add(importText);
                break;
            case CLASS_DEF:
                //there should only be one class per file
                className = parseClass(c, is_a, realizes, has_a);
                break;
            }

            c = c.getNextSibling();

        } //end for each child

        return packageName + "." + className;

    }

    /**
     * Displays the tree in a frame
     * @param r AST
     */
    private static void displayFrame(AST r) {
        final ASTFrame frame = new ASTFrame("Java AST", r);
        frame.setVisible(true);
        frame.addWindowListener(new WindowAdapter() {
            public void windowClosing(WindowEvent e) {
                frame.setVisible(false);
                frame.dispose();
                System.exit(0);
            }
        });
    }

    /**
     * Parses an IMPORT node and returns the full name
     * of the package
     * @param root AST
     * @return String
     */
    private static String getFormattedImport(AST root) {

        //get the import in the format "java.io.File."
        String importText = parseImport(root);

        //trim off the last character making it "java.io.File"
        importText = importText.substring(0, importText.length() - 1);

        return importText;

    }

    /**
     * Recursively parses an IMPORT node for the full name of the package
     * @param root AST
     * @return String
     */
    private static String parseImport(AST root) {

        String returner = "";

        AST c = root.getFirstChild();

        //for each child...
        while (c != null) {
            int type = c.getType();
            if (type == DOT)
                returner = parseImport(c);
            else {
                if (!c.getText().equals(""))
                    returner += c.getText() + ".";
            }
            c = c.getNextSibling();
        } //end for each child

        return returner;

    }

    /**
     * Parses the given CLASS_DEF node, which contains the following nodes:
     * MODIFIERS, EXTENDS_CLAUSE, IMPLEMENTS_CLAUSE, OBJBLOCK, IDENT
     * @param root AST
     * @param is_a ArrayList
     * @param realizes ArrayList
     * @param has_a ArrayList
     * @return String
     */
    private static String parseClass(AST root,
        ArrayList<String> is_a,
        ArrayList<String> realizes,
        ArrayList<String> has_a) {

        String className = "";

        AST c = root.getFirstChild();

        //for each child...
        while (c != null) {
            int type = c.getType();

            switch (type) {
            case IDENT:
                className = c.getText();
                break;
            case EXTENDS_CLAUSE:
                parseImplementsOrExtensions(c, is_a);
                break;
            case IMPLEMENTS_CLAUSE:
                parseImplementsOrExtensions(c, realizes);
                break;
            case OBJBLOCK:
                parseObject(c, has_a);
                break;
            }

            c = c.getNextSibling();

        } //end for each child

        return className;

    }

    /**
     * Parses the given IMPLEMENTS_CLAUSE or EXTENDS_CLAUSE to determine the
     * classes that are implements or extended
     * @param root AST
     * @param realizes ArrayList
     */
    private static void parseImplementsOrExtensions(AST root,
        ArrayList<String> realizes) {

        AST c = root.getFirstChild();

        //for each child...
        while (c != null) {

            int type = c.getType();

            switch (type) {
            case IDENT:
                realizes.add(c.getText());
                break;
            }

            c = c.getNextSibling();

        } //end for each child

    }

    /**
     * Parses the given OBJBLOCK node, which contains the following nodes:
     * VARIABLE_DEF, METHOD_DEF, CLASS_DEF, CTOR_DEF, IDENT
     * @param root AST
     * @param has_a ArrayList
     */
    private static void parseObject(AST root, ArrayList<String> has_a) {

        AST c = root.getFirstChild();

        //for each child...
        while (c != null) {
            int type = c.getType();

            switch (type) {
            case VARIABLE_DEF:
                parseVariableDef(c, has_a);
                break;
            }

            c = c.getNextSibling();

        } //end for each child

    }

    /**
     * Parses the given VARIABLE_DEF node, which contains the following nodes:
     * MODIFIERS, TYPE, IDENT
     * @param root AST
     * @param has_a ArrayList
     */
    private static void parseVariableDef(AST root, ArrayList<String> has_a) {
        AST c = root.getFirstChild();

        //for each child...
        while (c != null) {

            int type = c.getType();

            switch (type) {
            case TYPE:
                parseType(c, has_a);
                break;
            }
            c = c.getNextSibling();

        } //end for each child

    }

    /**
     * Parses the given TYPE node for its class names
     * @param root AST
     * @param has_a ArrayList
     */
    private static void parseType(AST root, ArrayList<String> has_a) {
        AST c = root.getFirstChild();

        //for each child...
        while (c != null) {

            int type = c.getType();

            switch (type) {
            case IDENT:
                has_a.add(c.getText());
                break;   
            }
            c = c.getNextSibling();

        } //end for each child

    }

}

Using the following the test file UMLHelper.java as the input to the above program:

package a.b.c.umlhelper;

import umlhelper.service.ServiceInterface;
import umlhelper.gui.MainFrameInterface;
import umlhelper.gui.MainFrame;
import umlhelper.service.Service;
import java.io.File;
import a.b.c.d.E;

/**
 * <p>Title: UML Helper</p>
 *
 * <p>Description: This is a test class for parsing, it does not compile.</p>
 *
 * <p>Copyright: Copyright (c) 2007</p>
 *
 * @author John valentino II
 * @version 1.0
 */
public class UMLHelper extends AB implements ServiceInterface, MainFrameInterface, A, B, C {

    /** Represents the gui tier of the application */
    private MainFrame mainFrame;
    /** Represents the service tier of the application */
    private Service service;
    /** Represents the version of this program */
    private float version = 0.1f;
    private ArrayList<String> list;

    /**
     * Entry point for the application
     * @param args String[]
     */
    public static void main(String[] args) {
        new UMLHelper();
    }

    /**
     * Creates the application
     */
    public UMLHelper() {
        this.service = new Service(this);
        this.mainFrame = new MainFrame(this);
    }

    /**
     * Returns the vesion of this program
     * @return float
     */
    public float getVersion() {
        return this.version;
    }

    /**
     * Opens the given directory and searches it for source code files and
     * other directories
     * @param directory File
     */
    public void open(File directory) {
        service.open(directory);
    }


}

The following output is generated:

Full Class Name: a.b.c.umlhelper.UMLHelper
Imports:
    umlhelper.service.ServiceInterface
    umlhelper.gui.MainFrameInterface
    umlhelper.gui.MainFrame
    umlhelper.service.Service
    java.io.File
    a.b.c.d.E
Extends (Is A):
    AB
Has A:
    MainFrame
    Service
    ArrayList

The following tree is displayed:

Monday, April 23, 2007

Introduction to the Web Enterprise

Table of Contents


Web Servers

The term web server is used to refer to both the physical computer responsible for providing web services, and the software running on that computer that provides the web services (http://en.wikipedia.org/wiki/Web_server). The software responsible for providing web services is also referred to as an application server (http://en.wikipedia.org/wiki/Application_server). The hardware for a good web server is expensive, along with the cost of bandwidth depending on the service provider and traffic. For this reason smaller to medium sized websites are run on web hosts, which are web hosting service providers that provide the web server and application server for multiple entities (http://en.wikipedia.org/wiki/Web_host). The application server of the web host determines the server-side scripting languages, databases, and enterprise frameworks that you can use. The following are some of the more popular application servers, for a list of all web servers see http://en.wikipedia.org/wiki/Comparison_of_web_servers:

IIS (Internet Information Systems)

IIS is a Microsoft technology for running web services on a web server. Typically IIS is used to run the server-side scripting language ASP and the .NET framework. It also is commonly used with the database management system (DBMS) MS Access and MS SQL. IIS comes with Windows Server OS, and with Windows XP and MCE.

Apache

Apache originally was a Unix based application server, but has since been ported to Linux, Windows, and OS X. It is free and open source, and is credited with playing a key role in the growth of the Internet. Apache is typically used to run the server-side scripting language PHP, and the database management system MySQL.

Apache Tomcat

Apache Tomcat is a web container that implements Java Server Pages. Since this is Java based it is cross platform compatible, so it can run on any operating system and can use any DBMS that has a driver for Java. It should also be noted that projects such as Jakarta allow Apache Tomcat to run with IIS.

JAS (Java Application Server)

JAS is a platform used for delivering server-side web content using the J2EE framework, which includes JSP, servlets, and EJBs (Enterprise Java Beans). JAS includes its own database and can work with any other DBMS with a driver for Java.


Layers of Web Programming

Most people consider the web a mysterious random conglomeration of applications and content of different types and uses, with no categories or organization. While the web is a random series of static and dynamic content, the languages used to generate an maintain that content can be grouped by function and usages. Instead of just referring to every individual language used on the web, I find it more intuitive to group languages by overall function, and divide functions into layers.

  • Text formatting - These are languages that are just used to specify how to handle text in browsers.
  • Browser Scripting - These are languages that are used to program the behavior of the browser.
  • Server Scripting - These are languages that are used to program the behavior of the server.
  • Databases - This refers to the where data is stored, and the language used to communicate with those databases.
  • Data Formats - This refers to the various formats that are used to pass data on the web, which may or may not be independent of a database.
  • Enterprise Architectures - These are packages of web technologies which encapsulate all of the other layers.

Text Formatting

The text formatting languages on the web are basically variations of the Hyper Text Markup Language (HTML), and Cascading Style Sheets (CSS).

HTML (Hyper Text Markup Language)

HTML is simply a language for formatting text in files with the "htm" or "html " file extension. HTML uses markup tags to define what to do with text when it is interpreted by the browser. For example "<b>This is text</b>" in HTML would be displayed as "This is text" in a browser. The <b> markup tag is used to make text bold. While HTML can be done in any text editor, there are many tools that work like a Word Processor for generating HTML, so that the user doesn't have to remember all of the markup tags.

Tools for generating HTML

HTML Tutorials

CSS (Cascading Style Sheets)

CSS is used to define how to display individual HTML elements, by defining display properties within an HTML page or by linking HTML pages to external files with the "css" file extension.

Browser Scripting

The following are the most common languages and practices for programming the behavior of the browser (list from www.w3schools.com) :

  • JavaScript
  • HTML DOM (Hyper Text Markup Language Document Object Model)
  • DHTML (Dynamic Hyper Text Markup Language)
  • VBScript (Visual Basic Script)
  • AJAX (Asynchronous JavaScript and XML)
  • AppML (Application Markup Language)
  • E4X (ECMAScript for XML)
  • WMLScript (Wireless Markup Language Script)

JavaScript

JavaScript is a scripting language for programming the behavior of the browser, with syntax similar to that of the Java programming language. It should also be noted that JavaScript and the Java programming language are not the same thing. JavaScript can be embedded within an HTML page, or linked externally as a file with a ".js" file extension. Either way JavaScript is executed at runtime by the browser itself in order to provide dynamic interaction.

HTML DOM (Hyper Text Markup Language Document Object Model)

HTML DOM is a standardized set of methods for the management of HTML documents for multiple programming languages, by dividing the HTML document into a node tree:

Image from W3Schools - http://www.w3schools.com/htmldom/dom_nodes.asp

DHTML (Dynamic Hyper Text Markup Language)

DHTML is not a language itself, but is used to describe a combination of web languages used to add dynamic content to HTML. Usually the combination consists of HTML 4.0, CSS, DOM, and JavaScript.

VBScript (Visual Basic Script)

VB Script is a language for programming the behavior of the browser, with syntax similar to Visual Basic. VBScript is executed at runtime by the browser itself in order to provide dynamic interaction.

AJAX (Asynchronous JavaScript and XML)

AJAX is not a language, but is a technique for using JavaScript, XML, HTML, and CSS to create more interactive web pages faster.

AppML (Application Markup Language)

AppML is a declarative language that uses XML to describe internet applications, which is open source and browser independent. AppML XML files reside on a web server and use an AppML web service for execution.

E4X (ECMAScript for XML)

E4X is an standardized extension of JavaScript that includes direct support of XML.

WMLScript (Wireless Markup Language Script)

WMLScript is a browser scripting language for WML pages, which are pages written in an XLM based markup language that inherits from HTML, intended to be viewed by wireless devices.


Server Scripting

Server scripting languages are used to execute code directly on the server. Because these languages are executed on the server, no browser client support is needed. In the 3-tier architecture server side languages reside in the application/logic tier, while in the MVC pattern the server side languages reside in the controller. The following are the most common languages for programming the behavior of the server:

  • ASP (Active Server Pages)

  • ESP (Escapade Server Pages)

  • JSP (JavaServer Pages)

  • CFML (ColdFusion Markup Language)

  • PHP (Hypertext Preprocessor)

ASP (Active Server Pages)

ASP is a Microsoft technology that most commonly runs on the Internet Information Systems (IIS) web service used for generating dynamic web pages. ASP pages can actually be written in either VBScript, PerlScript, or JScript, where VBScript is the most common implementation. Though ASP is still in widespread use it has evolved into ASP.NET as part of the .NET architecture, which is discussed later.

ESP (Escapade Server Pages)

ESP is a server-side scripting language the runs on Linux and Unix based platforms. It is commonly used in Europe but is generally not used within the US.

JSP (JavaServer Pages)

JSP is a Java technology server-side scripting language with XML-like syntax that can run on Tomcat or JAS (Java Application Server), which is part of the J2EE architecture but can be used by itself.

CFML (ColdFusion Markup Language)

CFML is an XML based server-side scripting language that typically runs on the Adobe ColdFusion application server.

PHP (Hypertext Preprocessor)

PHP is a server-side scripting language with syntax similar to PERL. It typically runs on the Apache application server.


Databases

A database is a structured collection of data (http://en.wikipedia.org/wiki/Database), but in web programming the term database usually refers to the software used to manage and query the database. This software is referred to as a Database Management System (DBMS), or more correctly for database systems that are based on the relational model, Relational Database Management Systems (RDBMS). The relational model uses predicate logic and set theory to imply relationships between data within in a database (http://en.wikipedia.org/wiki/Database#Relational_model), which yields the ability to have related data. Communication with databases occurs through the use of queries using the Structured Query Language (SQL), which varies slightly from one database system to the next. Some of the more popular database systems are as follows:

Microsoft Access

Microsoft Access, now called Microsoft Office Access, is a simple relational database management system that what it lacks in power makes up for in simplicity. Microsoft Access doesn't require an application server and can be used as a desktop application, but when run with an application server it is typically IIS.

Microsoft SQL Server

Microsoft SQL Server is a relational database management system that typically runs on IIS.

MySQL

MySQL is a relational database management system that runs on numerous application servers, notably Apache and IIS.

Sybase

Sybase is a relational database management system that at one time was almost identical to MS SQL Server, but has since branched off into its own. It also runs on multiple platforms including Windows (IIS).

Oracle

The term Oracle is used to refer to both a relational database management system and the company that makes it. Oracle runs on multiple platforms, but is the most expensive of RDMS with prices depending on license agreement that range from $149 per user to $40,000 per machine.

DB2

DB2 is IBM's relational database management system, which runs on multiple platforms (Windows, Linux, and Unix).

PostgreSQL

PostgreSQL is an object-relational database management system (ORDBMS), which is a relational database management system that allows developers to integrate the database with their own custom data types and methods (http://en.wikipedia.org/wiki/ORDBMS).


Data Formats

The term data format in the context of this document is used to refer to standardized methods for storing and/or transmitting data with or without the use of a database. All of the formats described here are XML based, which means they use the XML syntax for storing self-describing data.

XML

XML

The Extensible Markup Language (XML) is a general purpose markup language used for storing self-descriptive data in a human readable format using tree-based structures.

XSL

The Extensible Stylesheet Language (XSL) is stylesheet language for XML, and is used to describe how an XML document should be displayed in the same way that CSS does for HTML. XSL consists of XSLT, XSL-FO, and XPath.

XSLT

XSL Transformations (XSLT) is a language for transforming XML documents.

XSL-FO

XSL Formatting Objects (XSL-FO) is a language for formatting XML documents.

XPath

XML Path (XPath) is a language for navigating XML documents.

XQuery

XML Query (XQuery) is a language for querying XML documents similar to SQL.

Linking in XML

XML Link (XLink) defines a format for creating hyperlinks in XML documents, while XML Pointer (XPointer) defines a format for creating hyperlinks to point to more specific parts of XML documents.

DTD

Document Type Definition (DTD) is used to define the structure of XML documents.

XSD

XML Schema Definition (XSD) is an alternative to DTD and is used to define the structure of XML documents.

XML DOM

XML Document Object Model (XML DOM) defines a standard way for traversing and manipulating XML documents.

SOAP

Simple Object Access Protocol (SOAP) is an XML based protocol for communication over HTTP.

WSDL

The Web Services Description Language (WSDL) uses XML to describe web services and how they are accessed.

RDF

The Resource Description Framework (RDF) is a standard for describing web resources.

RSS

Really Simple Syndication is an XML based method distributing web content, most commonly in the form of news feeds.


Enterprise Architectures

In terms of the Web Enterprise, Enterprise Frameworks refer to a group of related services for creating and managing web content over all tiers of web development using defined practices, tools, and methods. Two popular and competing Enterprise Frameworks are .NET and J2EE.

J2EE

Java Platform, Enterprise Edition (Java EE) is the industry standard for developing portable, robust, scalable and secure server-side Java applications (http://java.sun.com/javaee/). As shown in the following figure, J2EE provides the framework for all tiers of web development:

Image from java.sun.com - http://java.sun.com/javaee/5/docs/tutorial/doc/

.NET

The .NET Framework provides a large body of pre-coded solutions to common program requirements, and manages the execution of programs written specifically for the framework (http://en.wikipedia.org/wiki/Microsoft_.NET_Framework#.NET_Framework_architecture). As does J2EE, the .NET Framework encompasses all tiers of web development. It also relies on the Common Language Infrastructure (CLI) to take in code from different languages in order to execute the result on the windows platform:

Image from Wikipedia - http://en.wikipedia.org/wiki/Microsoft_.NET_Framework