6 Lexer

The lexer is based on the re¹ module. TPG profits from the power of Python regular expressions. This document assumes the reader is familiar with regular expressions.

You can use the syntax of regular expressions as expected by the re module except from the grouping syntax since it is used by TPG to decide which token is recognized.

6.2 Token definition

6.2.1 Predefined tokens

Figure 6.1:

Token definition examples

    #     name     reg. exp        action
    token integer: '\d+'           int;
    token ident  : '[a-zA-Z]\w*'   ;

    separator spaces  : '\s+';     # white spaces
    separator comments: '#.*';     # comments

The order of the declaration of the tokens is important. The first token that is matched is returned. The regular expression has a special treatment. If it describes a keyword, TPG also looks for a word boundary after the keyword. If you try to match the keywords if and ifxyz TPG will internally search if\b and ifxyz\b. This way, if won’t match ifxyz and won’t interfere with general identifiers (\w+ for example).

There are two kinds of tokens. Tokens defined by the token keyword are parsed by the parser and tokens defined by the separator keyword are considered as separators (white spaces or comments for example) and are wiped out by the lexer.

6.2.2 Inline tokens

Tokens can also be defined on the fly. Their definition are then inlined in the grammar rules. This feature may be useful for keywords or punctuation signs. Inline tokens can not be transformed by an action as predefined tokens. They always return the token in a string.

Figure 6.2:

Inline token definition examples

    IfThenElse ->
        'if' Cond
        'then' Statement
        'else' Statement
        ;

Inline tokens have a higher precedence than predefined tokens to avoid conflicts (an inlined if won’t be matched as a predefined identifier).

6.3 Token matching

TPG works in two stages. The lexer first splits the input string into a list of tokens and then the parser parses this list.

6.3.1 Splitting the input string

The lexer split the input string according to the token definitions (see 6.2). When the input string can not be matched a tpg.LexerError exception is raised.

The lexer may loop indefinitely if a token can match an empty string since empty strings are everywhere.

6.3.2 Matching tokens in grammar rules

Tokens are matched as symbols are recognized. Predefined tokens have the same syntax than non terminal symbols. The token text (or the result of the function associated to the token) can be saved by the infix / operator (see figure 6.3).

Figure 6.3:

Token usage examples

S -> ident/i;

Inline tokens have a similar syntax. You just write the regular expression (in a string). Its text can also be save (see figure 6.4).

Figure 6.4:

Token usage examples

S -> '(' '\w+'/i ')';

Chapter 6
Lexer

6.1 Regular expression syntax

6.2 Token definition

6.2.1 Predefined tokens

6.2.2 Inline tokens

6.3 Token matching

6.3.1 Splitting the input string

6.3.2 Matching tokens in grammar rules

Chapter 6Lexer

6.1 Regular expression syntax

6.2 Token definition

6.2.1 Predefined tokens

6.2.2 Inline tokens

6.3 Token matching

6.3.1 Splitting the input string

6.3.2 Matching tokens in grammar rules

Chapter 6
Lexer