-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Parse C source code from Python click
Parse SQL scripts from Python click
Parse GTK-Doc style comments from Python click (parses function name, arguments, annotations and return value)
Parse food recipes from Python click
Check this one too click Simple Python-based Static Code Analyzer for C programs based on GLib/GStreamer.
pylangparser - Simple language parsing from Python. Project provides classes for parsing formal languages in an easy way. Without using any external libraries, only unittest, re and pprint. There is a Lexer and a Parser class. The lexer produces list of tokens that the Parser then uses to build the AST. The lexer can also be used as a stand alone component. There is support for building customized AST’s. The grammars are defined directly into the Python code.
In the examples folder you will find both simple example scripts demonstrating basic usage of the parser and some more useful and complex ones. For example, there is a script for parsing C source code and building and iterating the AST. SQL parser will be added soon too.
Note: Documentation is not fully complete yet. Existing APIs can still change.
Feel free to send suggestions, comments and patches.
The test defines simple calculator language MATABC and demonstrates how programs written in that language are parsed.
from pylangparser import *
# define all tokens in the language
IF = Keyword(r'if')
KEYWORDS = IF
PLUS = Operator(r'+')
MINUS = Operator(r'-')
ASSIGNMENT = Operator(r'=')
SEMICOLON = Operator(r';')
EQ = Operator(r'==')
LE = Operator(r'<')
GT = Operator(r'>')
LPAR = Operator(r'(')
RPAR = Operator(r')')
# order is important as first operator that matches will be considered
# so it is important that '<=' is taken before '<'
OPERATORS = EQ & PLUS & MINUS & ASSIGNMENT & LE & GT & SEMICOLON & \
LPAR & RPAR
IGNORE_CHARS = Ignore(r'[ \t\v\f\n]+')
COMMENTS = Ignore(r'\#.*\n')
IDENTIFIER = Symbols(r'[A-Za-z_]+[A-Za-z0-9_]*')
CONSTANT = Symbols(r'[0-9]+')
TOKENS = KEYWORDS & OPERATORS & CONSTANT & IDENTIFIER & \
COMMENTS & IGNORE_CHARS
# we want that certain tokens are ignored in the AST
IgnoreTokensInAST(SEMICOLON & LPAR & RPAR)
# define our grammar
arthm_operator = \
OperatorParser(PLUS) | \
OperatorParser(MINUS)
comp_operator = \
OperatorParser(LE) | \
OperatorParser(GT) | \
OperatorParser(EQ)
operand = \
SymbolsParser(CONSTANT) | \
SymbolsParser(IDENTIFIER)
arthm_expression = \
SymbolsParser(IDENTIFIER) & \
OperatorParser(ASSIGNMENT) & \
(operand << Optional(arthm_operator << operand)) & \
OperatorParser(SEMICOLON)
condition = \
operand << \
comp_operator << \
operand
# if_statement and statement have circular dependency, that is why
# we have to use RecursiveParser
statement = RecursiveParser()
if_statement = \
KeywordParser(IF) & \
OperatorParser(LPAR) & \
condition & \
OperatorParser(RPAR) & \
statement
# notice the usage of the '+=' operator below
statement += \
if_statement | arthm_expression
# use AllTokensConsumed so that the parser parses the
# complete source
program = AllTokensConsumed(ZeroOrMore(statement))
# our source code
source = """
# example program written in ABCMATH
p = 12;
if (p == 12)
if (p == 5)
p = 3 + 2;
"""
# obtain list of tokens present in the source
lexer = Lexer(TOKENS)
tokens = lexer.parseTokens(source)
print(tokens)
# build AST
result = program(tokens, 0)
result.pretty_print()When the program is run, it will output the following tree:
[[['p'], ['='], ['12']],
[['if'],
[['p'], ['=='], ['12']],
[['if'], [['p'], ['=='], ['5']], [['p'], ['='], [['3'], ['+'], ['2']]]]]]But maybe the tree can be reorganized a bit so that it is easier to interpret it. Let’s modify our code a bit.
First we modify the arthm_expression parser:
def update_arthm_expression(result):
token = result.get_token()
if len(token) == 3:
# p = 1
# ('p', '=', '1') or ('p', '=', ('3', '+', '2'))
(lo, op, ro) = token
if not ro.is_basic_token():
ro = update_arthm_expression(ro)
token = (op, lo, ro)
result.set_token(token)
return result
arthm_expression = \
CustomizeResult (SymbolsParser(IDENTIFIER) & \
OperatorParser(ASSIGNMENT) & \
operand & \
Optional(arthm_operator & operand) & \
OperatorParser(SEMICOLON), update_arthm_expression)And then the if_statement parser:
def update_condition(result):
# p == 1
# ('p', '==', '1')
token = result.get_token()
(lo, op, ro) = token
result.set_token((op, lo, ro))
return result
if_statement = \
KeywordParser(IF) & \
OperatorParser(LPAR) & \
CustomizeResult (condition, update_condition) & \
OperatorParser(RPAR) & \
statementThe result tree will look a bit different now:
[[['='], ['p'], ['12']],
[['if'],
[['=='], ['p'], ['12']],
[['if'], [['=='], ['p'], ['5']], [['='], ['p'], [['+'], ['3'], ['2']]]]]]Always use CheckErrors or AllTokensConsumed as a top level parser in order to get relevant information about parse errors:
Traceback (most recent call last):
File "simple_calc_language.py", line 103, in <module>
result = program(tokens, 0)
File "../pylangparser.py", line 915, in __call__
"Unknown symbol: %s" % tokens[i].get_token())
pylangparser.ParseException: row: 7, column: 7,
message: Unknown symbol: (List of supported Tokens:
Keyword
Symbols
Operator
IgnoreIf case-insensitive matching is desired when parsing Tokens, the ignorecase constructor property should be set when creating Token instances:
IF = Keyword(r'if', ignorecase=True)List of supported Parsers:
KeywordParser
OperatorParser
SymbolsParser
Optional
ZeroOrMore
Repeat
AllTokensConsumed
RecursiveParser
IgnoreResult
CustomizeResult
CheckErrorsParsers can be combined by suing the following operators: |, & and <<
p1 & p2 and p1 << p2
mean almost the same thing but there is still a tiny difference. To illustrate it, lets take as an example variable declaration parsing in C:
int a, b, c, d;The grammar may look like:
additional_declarator_with_modifier = \
OperatorParser(COMMA) & declarator_with_modifier
variable_declaration = \
(type_specifier & declarator_with_modifier << \
ZeroOrMore(additional_declarator_with_modifier) & \
OperatorParser(SEMICOLON))or:
additional_declarator_with_modifier = \
OperatorParser(COMMA) & declarator_with_modifier
variable_declaration = \
(type_specifier & declarator_with_modifier & \
ZeroOrMore(additional_declarator_with_modifier) & \
OperatorParser(SEMICOLON))And the AST in bothe cases:
['int'], [['a'], ['b'], ['c'], ['d']]and
['int'], [['a'], [['b'], ['c'], ['d']]]The result of applying a parser combination to some input is a ParserResult. A ParserResult may contain simple token, another ParserResult or a tuple of ParserResult’s. A ParserResult can be iterated using the get_sub_group(index) function, indexes or iterators. Indexes start from 1. 0 means the whole tree.
result = parser(tokens, 0)
sub_group = result.get_sub_group(1)
sub_group.pretty_print()
Or
sub_group = result[1]
sub_group.pretty_print()
Or
for sub_group in result:
sub_group.pretty_print()To check if a given group/sub-group is a result of applying a particular parser use the check_parser(parser) and check_parser_instance(parser_class) functions:
result = program(tokens, 0)
sub_group = result.get_sub_group(1)
if sub_group.check_parser(if_statement)
print("this is an if-statement")For more detailed info check the source code and the c_parser.py example.
Each group/sub-group can be pretty-printed with the pretty_print() function:
result.pretty_print()
sub_group.pretty_print()You can download and try the examples
(c)2014 Ognyan Tonchev (otonchev@gmail.com)