Skip to content

Latest commit

 

History

History
537 lines (432 loc) · 16.8 KB

cEP-0018.md

File metadata and controls

537 lines (432 loc) · 16.8 KB

Integration of ANTLR into coala core

Metadata
cEP 18
Version 1.0
Title Integration of ANTLR into coala core
Authors Viresh Gupta mailto:[email protected]
Status Proposed
Type Feature

Abstract

This document describes how an API based on ANTLR will be constructed and maintained.

Introduction

ANTLR provides parsers for various language grammars and thus we can provide an interface from within coala and make it available to bear writers so that they can write advanced linting bears. This will aim at supporting the more flexible visitor based method of AST traversal (as opposed to listener based mechanisms).

The proposal is to introduce a parallel concept to the coala-bears library, which will be called coala-antlr hereon.

Proposed Change

Here is the detailed implementation stepwise:

  1. There will be a separate repository named as coala-antlr which will be installable via pip install coala-antlr.
  2. A new package would be introduced in the coala-antlr repository, which would maintain all the dependencies for antlr-ast based bears. This would be the coantlib package
  3. The coantlib package will provide endpoints relevant to the visitor model of traversing the ast generated via ANTLR.
  4. Another package named coantparsers will be created inside coala-antlr repository. This package will hold parsers and lexers generated for various languages generated (with python target) beforehand for the supported set of grammars.
  5. coantlib will have an ASTLoader class that will be responsible for loading the AST of a given file depending on the user input, using file extension as a fallback. For e.g, a .c file will be loaded using the parser from coantlib for c language if no language is specified.
  6. Another ASTWalker class will be responsible for providing a single method to resolve the language supplied by the bear and create a walker instance of the appropriate type.
  7. Also coantlib.walkers will have several walker classes derived from ASTWalker class that will be language specific and grammar dependant. For e.g a Py3ASTWalker.
  8. The bears dependant on the coantlib will reside in the same repository as the coantlib. All of them will be in a different package, the coantbears package, all of them installable together via a single option in the setup.py along with coantlib.
  9. The bears will be able to use a high level API by defining the LANGUAGE instance variable, which will return to them a walker with high level API's in case the language is supported. (for e.g if LANGUAGE='Python', they will automatically get a BasePyASTWalker instance in the walker class variable)

Management of the new coala-antlr repository

Managing a new repository is a heavy task, and this will be highly automated as follows:

  1. The parsers in coala-antlr/coantparsers would be generated and pushed via CI builds whenever a new commit is pushed.
  2. The cib tool can be enhanced to deal with the installation of bears that require only some specified parsers (for e.g a PyUnusedVarBear would only require parser for python).
  3. There will be nightly cron jobs for keeping parsers up-to-date.
  4. For adding support for a new language, a new Walker class deriving ASTWalker would be added and the rest will be automatically taken care by the library.

Code Samples/Prototypes

Here is a prototype for the implementations within coantlib:

import antlr4
import coantparsers
import inspector

from antlr4 import InputStream
from coalib.bearlib.languages.definitions import *
from coalib.bearlib.languages.Language import parse_lang_str, Languages


def resolve_with_package_info(lang):
    """
    Use the inspector methods to search amongst all
    classes that inherit ASTWalker within this module
    and return that class whose supported language matches
    """


class ASTLoader(object):

    mapping = {
        C: [coantparsers.clexer, coantparsers.cparser],
        Python: [coantparsers.pylexer, coantparsers.pyparser],
        JavaScript: [coantparsers.jslexer, coantparsers.jsparser],
        ...
    }

    @staticmethod
    def load_file(file, filename, lang='auto'):
        """
        Loads file's AST into memory
        """
        if(lang == 'auto'):
            ext = get_extension(filename)
        else:
            ext = Languages(parse_lang_str(lang)[0])[0]
        if ext is None:
            raise FileExtensionNotSupported('Unknown File Type')
        inputStream = InputStream(''.join(file))
        lexer = mapping[ext][0](inputStream)
        parser = mapping[ext][1](lexer)
        return parser


class ASTWalker(object):
    treeRoot = None
    tree = None

    @staticmethod
    def make_walker():
        """
        Resolves walker object for appropriate type
        """
        if lang == 'auto':
            return ASTWalker
        else:
            return resolve_with_package_info(lang)

    @staticmethod
    def get_walker(file, filename, lang):
        """
        Returns a walker to the required language
        """
        ret_class = make_walker(lang)
        if not ret_class:
            raise Exception('Un-supported Language')
        else:
            return ret_class(file, filename)

    def __init__(file, filename):
        self.tree = ASTLoader.load_file(file, filename)
        self.treeRoot = self.tree
    ...

Prototype of Py3ASTWalker

These kinds of Walkers will supply a high level walker API

"""Prototype of Python ASTWalkers."""


class BasePyASTWalker(ASTWalker):
    """Base class for Python Walkers. This will be an abstract class."""

    LANGUAGES = {'Python'}

    def get_imports():
        """Return list of strings containing import statements."""

    def get_methods():
        """Return list of strings containing all methods from the file."""
    ...


class Py2ASTWalker(BasePyASTWalker):
    """Python 2 specific walker."""

    LANGUAGES = {'Python 2'}

    def get_print_statements():
        """Return list of print statements, since print is a keyword in py2."""
    ...


class Py3ASTWalker(BasePyASTWalker):
    """Python 3 specific walker."""

    LANGUAGES = {'Python3'}

    def get_integer_div_stmts():
        """
        Return a list of integer divide statements.

        since / and // are different in py3, this is py3 specific
        """
    ...

Prototype of ASTBear class implementation

"""Prototype of ASTBear class."""

from coantlib import ASTWalker
from coalib.bears.LocalBear import Bear


class ASTBear(LocalBear):
    """
    ASTBear class.

    This class is similar to Local Bear except for the added capability of
    initialising walker for use by the instance of this bear.
    """

    walker = None

    def initialise(file, filename):
        """Initialise self.walker according to the file and language."""
        if LANGUAGES:
            self.walker = ASTWalker.get_walker(file,
                                               filename,
                                               self.LANGUAGES[0])
        else:
            self.walker = ASTWalker.get_walker(file, filename, 'auto')

    def run(self,
            filename,
            file,
            tree,
            *args,
            dependency_results=None,
            **kwargs
            ):
        """Actual run method - to be implemented by bear."""
        raise NotImplementedError('Needs to be done by bear')

A test bear

"""A sample Bear."""

from coantlib.ASTBear import ASTBear
from coalib.results.Result import Result


class TestBear(ASTBear):
    """A test bear that uses walker from coantlib for linting."""

    def run(self,
            filename,
            file,
            tree,
            *args,
            dependency_results=None,
            **kwargs
            ):
        """Essence of logic for bear."""
        self.initialise(file, filename)
        violations = {}
        method_list = self.walker.get_methods()
        for method in method_list:
            """
            logic for checking naming convention
            and yielding an apt result
            wherever required
            """
        ...

Providing Fixes with the library

ANTLR has a HiddenTokenStream which recieves all the tokens irrelevant to the parser such as whitespaces. These are not passed on to the parser and as a result are not returned when simply fetching text from a node in parse tree.

Consider the code:

def do_nothing():
    pass

def run_coala(console_printer=None,
              log_printer=None,
              print_results=do_nothing,
              print_section_beginning=do_nothing,
              nothing_done=do_nothing,
              autoapply=True,
              force_show_patch=False,
              arg_parser=None,
              arg_list=None,
              args=None,
              debug=False):
    """
    This function signature has been taken from coala-core for fairly complex
    example.
    """
    print('This function will do something')
    if (console_printer and
        log_printer and
        not (args == None and debug == False)
        ):
       print('These conditions were made to trick !')
       print(' !(a & b) == !a or !b')

The corresponding parse-tree: example_tree_complex

After converting this source back using:

#!/usr/bin/env python
from antlr4 import *
from Python3Lexer import Python3Lexer
from Python3Parser import Python3Parser
from Python3Visitor import Python3Visitor

import sys

class Visitor(Python3Visitor):
    def visitFile_input(self, ctx):
        print(ctx.getText())

def main(args):
    file = FileStream(args[1])
    lexer = Python3Lexer(file)
    stream = CommonTokenStream(lexer)
    parser = Python3Parser(stream)

    tre = parser.file_input()
    if parser.getNumberOfSyntaxErrors()!=0:
        print("File contains {} "
              "syntax errors".format(parser.getNumberOfSyntaxErrors()))
        return

    visitor = Visitor()
    visitor.stream = stream
    visitor.visit(tre)

if __name__ == '__main__':
    main(sys.argv)

We get the following output:

defdo_nothing():
    pass

defrun_coala(console_printer=None,log_printer=None,print_results=do_nothing,print_section_beginning=do_nothing,nothing_done=do_nothing,autoapply=True,force_show_patch=False,arg_parser=None,arg_list=None,args=None,debug=False):
    """
    This function signature has been taken from coala-core for fairly complex
    example.
    """
print('This function will do something')
if(console_printerandlog_printerandnot(args==Noneanddebug==False)):
       print('These conditions were made to trick !')
print(' !(a & b) == !a or !b')


<EOF>

The most important points to note in the above are (Especially with respect to a Python grammar scenario):

  • All spaces within two tokens has been removed unless present inside a string
  • All newlines have been stripped unless the newline is specified by the grammar to be present in the parse tree. With reference to the Python grammar the newlines are present irregularly because they are preserved only in case of the compound_stmt rule.
  • All information about the indentation in the code is lost, which makes it impossible for the fixer to suggest a complex fix without guessing the indentation.

Despite of all the above, ANTLR provides us a line number and a column number for each Terminal Node. Thus, keeping all the above in mind, the bears written with this library can provide some fixes for linting errors, and such a fix is provided in implementation by the PyPluralBear. Once a complete roundtrip mechanism is possible, any kind of fix can be done and the library can also provide functions for doing so, but this is not in the scope of this project, due to the forementioned facts.

Manual space insertion can be done using:

#!/usr/bin/env python
from antlr4 import *
from Python3Lexer import Python3Lexer
from Python3Parser import Python3Parser
from Python3Visitor import Python3Visitor

import sys

class Visitor(Python3Visitor):
    def visitTerminal(self, node):
        line = node.getPayload().line
        col = node.getPayload().column+1
        self.nodeinfo.append((line,col,node.getText(), type(node.parentCtx)))

def print_source(node_list):
    current_line=1
    current_column=1
    for element in sorted(node_list):
        while current_line<element[0]:
            current_column=1
            current_line+=1
            print("")

        while current_column<=element[1]-1:
            current_column+=1
            print(" ", end="")

        if element[2]!='\n' and element[2]!='\t' and element[2].strip()!='':
            print(element[2], end="")
            current_column = current_column + len(element[2])
        if element[2].count('\n') and element[2]!='\n':
            current_column = 1
            current_line += element[2].count('\n')
            print("")

def main(args):
    file = FileStream(args[1])
    lexer = Python3Lexer(file)
    stream = CommonTokenStream(lexer)
    parser = Python3Parser(stream)

    tre = parser.file_input()
    if parser.getNumberOfSyntaxErrors()!=0:
        print("File contains {} "
              "syntax errors".format(parser.getNumberOfSyntaxErrors()))
        return

    visitor = Visitor()
    visitor.stream = stream
    visitor.nodeinfo = []
    visitor.visit(tre)
    print_source(visitor.nodeinfo)

if __name__ == '__main__':
    main(sys.argv)

But the above also poses complications, especially when the source grammar uses stray nodes. For e.g the Python grammar uses nodes that the Python grammar for Java doesn't use. This forces antlr programs written using Python to deal with stray character nodes as of this cEP. Since the library can interface with arbitrary, but correct ANTLR grammar, it currently cannot detect these stray nodes automatically.

An example of these stray nodes can be seen using the Python3 grammar with a source that is indented using spaces.

Sample output for the above input code using the improvised pretty printer:

def do_nothing():
 hpass

def run_coala(console_printer=None,
     log_printer=None,
     print_results=do_nothing,
     print_section_beginning=do_nothing,
     nothing_done=do_nothing,
     autoapply=True,
     force_show_patch=False,
     arg_parser=None,
     arg_list=None,
     args=None,
     debug=False):
 """
  This function signature has been taken from coala-core for fairly complex
  example.
  """
 F
 eprint('This function will do something')
 ift(console_printer and
  log_printer and
  not (args == None and debug == False)
  ):
  =print('These conditions were made to trick !')
  eprint(' !(a & b) == !a or !b')
<EOF>

Notice the stray h in front of pass, in the actual node, it co-incides with the position of character p. Similarly there is an extra F, e, t and = occupying spaces seemingly randomly. (But this is an effect of the context rules provided in the grammar). Also notice that replacing tab characters with whitespace drastically reduced the indent levels.

For e.g: Let's try to provide a fix from a bear that reduces complexity of conditions inside if. For the above example, it would be about converting

    if (console_printer and
        log_printer and
        not (args == None and debug == False)):
       print('These conditions were made to trick !')
       print(' !(a & b) == !a or !b')

into

    if (console_printer and
        log_printer and
        args != None or debug != False):
       print('These conditions were made to trick !')
       print(' !(a & b) == !a or !b')

For doing the above transformation, ideally one would require to transform the tree at those and nodes that have a not node as a parent, and then reconstruct source code from this transformed node.

Since ANTLR doesn't provide mechanism for node transformation as of this cEP, we will have to switch to manually doing this transformation at string level. Thus in order to do the above transformation, we save the column and line number from the not node and the ending line and column number as well. The bear will then process this information and construct a new string which will be sent to coala for diff generation.

Thus fixes from bears using this library are possible, but limited to being done by the bear. Also as noted above ANTLR has poor support for retrieving original source tokens that were discarded by the grammar from the parse tree, the feature to fix code using library provided functions is out of scope for this project, and will be proposed as a future project.