Google

"http://www.w3.org/TR/html4/loose.dtd"> >

Chapter 6
Lexer

6.1 Regular expression syntax

The lexer is based on the re1 module. TPG profits from the power of Python regular expressions. This document assumes the reader is familiar with regular expressions.

You can use the syntax of regular expressions as expected by the re module except from the grouping syntax since it is used by TPG to decide which token is recognized.

6.2 Token definition

6.2.1 Predefined tokens

Tokens can be explicitely defined by the token and separator keywords.

A token is defined by:

a name
which identifies the token. This name is used by the parser.
a regular expression
which describes what to match to recognize the token.
an action
which can translate the matched text into a Python object. It can be a function of one argument or a non callable object. It it is not callable, it will be returned for each token otherwise it will be applied to the text of the token and the result will be returned. This action is optional. By default the token text is returned.

Token definitions end with a ; .

See figure 6.1 for examples.




Figure 6.1: Token definition examples
    #     name     reg. exp        action  
    token integer: '\d+'           int;  
    token ident  : '[a-zA-Z]\w*'   ;  
 
    separator spaces  : '\s+';     # white spaces  
    separator comments: '#.*';     # comments


The order of the declaration of the tokens is important. The first token that is matched is returned. The regular expression has a special treatment. If it describes a keyword, TPG also looks for a word boundary after the keyword. If you try to match the keywords if and ifxyz TPG will internally search if\b and ifxyz\b. This way, if won’t match ifxyz and won’t interfere with general identifiers (\w+ for example).

There are two kinds of tokens. Tokens defined by the token keyword are parsed by the parser and tokens defined by the separator keyword are considered as separators (white spaces or comments for example) and are wiped out by the lexer.

6.2.2 Inline tokens

Tokens can also be defined on the fly. Their definition are then inlined in the grammar rules. This feature may be useful for keywords or punctuation signs. Inline tokens can not be transformed by an action as predefined tokens. They always return the token in a string.

See figure 6.2 for examples.




Figure 6.2: Inline token definition examples
    IfThenElse ->  
        'if' Cond  
        'then' Statement  
        'else' Statement  
        ;


Inline tokens have a higher precedence than predefined tokens to avoid conflicts (an inlined if won’t be matched as a predefined identifier).

6.3 Token matching

TPG works in two stages. The lexer first splits the input string into a list of tokens and then the parser parses this list.

6.3.1 Splitting the input string

The lexer split the input string according to the token definitions (see 6.2). When the input string can not be matched a tpg.LexerError exception is raised.

The lexer may loop indefinitely if a token can match an empty string since empty strings are everywhere.

6.3.2 Matching tokens in grammar rules

Tokens are matched as symbols are recognized. Predefined tokens have the same syntax than non terminal symbols. The token text (or the result of the function associated to the token) can be saved by the infix / operator (see figure 6.3).




Figure 6.3: Token usage examples
    S -> ident/i;


Inline tokens have a similar syntax. You just write the regular expression (in a string). Its text can also be save (see figure 6.4).




Figure 6.4: Token usage examples
    S -> '(' '\w+'/i ')';