Parser
Parser is responsible for interpreting a SQL string into an abstract syntax tree (AST), which is more structural and easier to process. AST can be used for preprocessing, syntactic analysis, and so on.
The code lives in the pingcap/tidb repo, parser directory.
Understand Parser
Parser is generated by a parser generator named yacc. It takes the grammar file parser.y
as the input and outputs the source code file parser.go
, which is the real parser imported by TiDB. Thus, the core file is parser.y
because when the SQL syntax changes, most of the changes take place in parser.y
.
In case you are unfamiliar with yacc, some concepts are listed here:
Terminal Symbol is also known as "token". When a SQL string reaches parser, the first step is to tokenize them into an array of tokens. For example,
"SELECT * FROM t"
is tokenized to[selectKeyword, '*', fromKeyword, identifier(t)]
by lexer.Lex().Non-terminal Symbol is a syntactic variable, which can represent a group of terminal/non-terminal symbols.
Grammar Rule specifies which symbols can replace which other non-terminal symbol.
Semantic Action defines how an AST node is constructed.
An example of a grammar rule is as follows:
AlterDatabaseStmt
is a non-terminal symbol because there is no such token."ALTER"
is a terminal symbol.DatabaseSym
,DBName
andDatabaseOptionList
are non-terminal symbols that are defined in other grammar rules.The pseudo-code in brackets is the semantic action. It means an AST node
ast.AlterDatabaseStmt
will be constructed when the rule is reduced by the parser. Note that a dollar character$
followed by a number represents the binding Golang value previously (in other rules), where the number is the index of symbol in rule (1-based).$$
represents current binding value. After goyacc substitution, this code snippet will be valid Golang code.
Getting back to parser.y
, the structure of this file is divided into three parts:
%union
enumerates all the Golang types that can be passed around grammar rules.%token
or%type
declares the terminal or non-terminal symbols that will be used in grammar rules.Grammar rules define the syntax of SQL and related semantic actions.
Except for parser.y
, other sub-package/files should be easy to understand, feel free to explore them by yourself:
ast
The AST definition used by TiDB.
auth
Authentication-related functions.
charset
Currently supported charsets and encodings.
format
The Formatters for yacc file and the functions for restoring AST to SQL.
goyacc
The generator for parser.go.
model
Basic structures in TiDB like TableInfo, ColumnInfo...
mysql
MySQL constants, errors, privileges, types, and others.
opcode
Operator code like <, >, +, =...
terror
The errors used by TiDB.
test_driver
A parser driver only for unit tests.
tidb
TiDB related features' keywords.
types
The field types and evaluation types that used in TiDB.
Develop and Build
To get started with the parser development, please also take a look at quickstart.md. It shows the basic usage of the parser and it explains some concepts like parser_driver.
Run make parser
in the project root directory to generate a new parser.go
.
FAQ
How to debug the parsing procedure?
Put the test file in the parser
package. Set yyDebug
level to 4
(or any integers >= 4) before calling Parse()
. The parser will try to show state information in each step.
How to resolve shift-reduce or reduce-reduce conflicts?
Shift means "move the next token in" to match the current rule. Reduce means "replace current tokens/symbols to a non-terminal symbol". Shift-reduce conflicts occur when the parser cannot decide the next step is to shift or to reduce.
When yacc reports such conflicts, it also keeps the file y.output
. You can search "conflict on" in the file to locate which rule conflicts with other rules. Then you can try to annotate the %precedence
to tokens, rewrite the grammar rule, or ask for help on GitHub.
Last updated