What is Syntax analysis?

Syntax analysis is a second phase of the compiler design process that comes after lexical analysis. It analyses the syntactical structure of the given input. It checks if the given input is in the correct syntax of the programming language in which the input which has been written. It is known as the Parse Tree or Syntax Tree.

The Parse Tree is developed with the help of pre-defined grammar of the language. The syntax analyzer also checks whether a given program fulfills the rules implied by a context-free grammar. If it satisfies, the parser then creates the parse tree of that source program. Otherwise, it will display error messages.

In this tutorial, you will learn

Why do you need Syntax Analyzer?

  • Check if the code is valid grammatically
  • The syntactical analyzer helps you to apply rules to the code
  • Helps you to make sure that each opening brace has a corresponding closing balance
  • Each declaration has a type and that the type must be exists

Important Syntax Analyzer Terminology

Important terminologies used in syntax analysis process:

  • Sentence: A sentence is a group of character over some alphabet.
  • Lexeme: A lexeme is the lowest level syntactic unit of a language (e.g., total, start).
  • Token: A token is just a category of lexemes.
  • Keywords and reserved words – It is an identifier which is used as a fixed part of the syntax of a statement. It is a reserved word which you can't use as a variable name or identifier.
  • Noise words - Noise words are optional which are inserted in a statement to enhance the readability of the sentence.
  • Comments – It is a very important part of the documentation. It mostly display by, /* */, or//Blank (spaces)
  • Delimiters – It is a syntactic element which marks the start or end of some syntactic unit. Like a statement or expression, "begin"...''end", or {}.
  • Character set - ASCII, Unicode
  • Identifiers – It is a restrictions on the length which helps you to reduce the readability of the sentence.
  • Operator symbols - + and – performs two basic arithmetic operations.
  • Syntactic elements of the Language

Why do we need Parsing?

A parse also checks that the input string is well-formed, and if not, reject it.

Following are important tasks perform by the parser:

  • Helps you to detect all types of Syntax errors
  • Find the position at which error has occurred
  • Clear & accurate description of the error.
  • Recovery from an error to continue and find further errors in the code.
  • Should not affect compilation of "correct" programs.
  • The parse must reject invalid texts by reporting syntax errors

Parsing Techniques

Parsing techniques are divided into two different groups:

  • Top-Down Parsing,
  • Bottom-Up Parsing

Top-Down Parsing:

In the top-down parsing construction of the parse tree starts at the root and then proceeds towards the leaves.

Two types of Top-down parsing are:

  1. Predictive Parsing:

Predictive parse can predict which production should be used to replace the specific input string. The predictive parser uses look-ahead point, which points towards next input symbols. Backtracking is not an issue with this parsing technique. It is known as LL(1) Parser

  1. Recursive Descent Parsing:

This parsing technique recursively parses the input to make a prase tree. It consists of several small functions, one for each nonterminal in the grammar.

Bottom-Up Parsing:

In the bottom-up parsing technique the construction of the parse tree starts with the leave, and then it processes towards its root. It is also called as shift-reduce parsing. This type of parsing is created with the help of using some software tools.

Error – Recovery Methods

Common Errors that occur in Parsing

  • Lexical: Name of an incorrectly typed identifier
  • Syntactical: unbalanced parenthesis or a missing semicolon
  • Semantical: incompatible value assignment
  • Logical: Infinite loop and not reachable code

A parser should able to detect and report any error found in the program. So, whenever an error occurred the parser. It should be able to handle it and carry on parsing the remaining input. A program can have following types of errors at various compilation process stages. There are five common error-recovery methods which can be implemented in the parser

Statement mode recovery

  • In the case when the parser encounters an error, it helps you to take corrective steps. This allows rest of inputs and states to parse ahead.
  • For example, adding a missing semicolon is comes in statement mode recover method. However, parse designer need to be careful while making these changes as one wrong correction may lead to an infinite loop.

Panic-Mode recovery

  • In the case when the parser encounters an error, this mode ignores the rest of the statement and not process input from erroneous input to delimiter, like a semi-colon. This is a simple error recovery method.
  • In this type of recovery method, the parser rejects input symbols one by one until a single designated group of synchronizing tokens is found. The synchronizing tokens generally using delimiters like or.

Phrase-Level Recovery:

  • Compiler corrects the program by inserting or deleting tokens. This allows it to proceed to parse from where it was. It performs correction on the remaining input. It can replace a prefix of the remaining input with some string this helps the parser to continue the process.

Error Productions

  • Error production recovery expands the grammar for the language which generates the erroneous constructs. The parser then performs error diagnostic about that construct.

Global Correction:

  • The compiler should make less number of changes as possible while processing an incorrect input string. Given incorrect input string a and grammar c, algorithms will search for a parse tree for a related string b. Like some insertions, deletions, and modification made of tokens needed to transform an into b is as little as possible.

Grammar:

A grammar is a set of structural rules which describe a language. Grammars assign structure to any sentence. This term also refers to the study of these rules, and this file includes morphology, phonology, and syntax. It is capable of describing many, of the syntax of programming languages.

Rules of Form Grammar

  • The non-terminal symbol should appear to the left of the at least one production
  • The goal symbol should never be displayed to the right of the::= of any production
  • A rule is recursive if LHS appears in its RHS

Notational Conventions

Notational conventions symbol may be indicated by enclosing the element in square brackets. It is an arbitrary sequence of instances of the element which can be indicated by enclosing the element in braces followed by an asterisk symbol, { ... }*.

It is a choice of the alternative which may use the symbol within the single rule. It may be enclosed by parenthesis ([,] ) when needed.

Two types of Notational conventions area Terminal and Non-terminals

1.Terminals:

  • Lower-case letters in the alphabet such as a, b, c,
  • Operator symbols such as +,-, *, etc.
  • Punctuation symbols such as parentheses, hash, comma
  • 0, 1, ..., 9 digits
  • Boldface strings like id or if, anything which represents a single terminal symbol

2.Nonterminals:

  • Upper-case letters such as A, B, C
  • Lower-case italic names: the expression or some

Context Free Grammar

A CFG is a left-recursive grammar that has at least one production of the type. The rules in a context-free grammar are mainly recursive. A syntax analyzer checks that specific program satisfies all the rules of Context-free grammar or not. If it does meet, these rules syntax analyzers may create a parse tree for that programme.

expression -> expression -+ term
expression -> expression – term 
expression-> term
term  -> term * factor
term -> expression/ factor
term  -> factor factor
factor ->  ( expression )
factor -> id

Grammar Derivation

Grammar derivation is a sequence of grammar rule which transforms the start symbol into the string. A derivation proves that the string belongs to the grammar's language.

Left-most Derivation

When the sentential form of input is scanned and replaced in left to right sequence, it is known as left-most derivation. The sentential form which is derived by the left-most derivation is called the left-sentential form.

Right-most Derivation

Rightmost derivation scan and replace the input with production rules, from right to left, sequence. It's known as right-most derivation. The sentential form which is derived from the rightmost derivation is known as right-sentential form.

Syntax vs. Lexical Analyser

Syntax Analyser

Lexical Analyser

The syntax analyzer mainly deals with recursive constructs of the language.

The lexical analyzer eases the task of the syntax analyzer.

The syntax analyzer works on tokens in a source program to recognize meaningful structures in the programming language.

The lexical analyzer recognizes the token in a source program.

It receives inputs, in the form of tokens, from lexical analyzers.

It is responsible for the validity of a token supplied by

the syntax analyzer

Disadvantages of using Syntax Analysers

  • It will never determine if a token is valid or not
  • Not helps you to determine if an operation performed on a token type is valid or not
  • You can't decide that token is declared & initialized before it is being used

Summary

  • Syntax analysis is a second phase of the compiler design process that comes after lexical analysis
  • The syntactical analyser helps you to apply rules to the code
  • Sentence, Lexeme, Token, Keywords and reserved words, Noise words, Comments, Delimiters, Character set, Identifiers are some important terms used in the syntax analysis
  • Parse checks that the input string is well-formed, and if not, reject it
  • Parsing techniques are divided into two different groups: Top-Down Parsing, Bottom-Up Parsing
  • Lexical, Syntactical, Semantical, and logical are some common errors occurs during parsing method
  • A grammar is a set of structural rules which describe a language
  • Notational conventions symbol may be indicated by enclosing the element in square brackets
  • A CFG is a left-recursive grammar that has at least one production of the type
  • Grammar derivation is a sequence of grammar rule which transforms the start symbol into the string
  • The syntax analyzer mainly deals with recursive constructs of the language while the lexical analyzer eases the task of the syntax analyzer
  • The drawback of Syntax analyzer method is that it will never determine if a token is valid or not

 

YOU MIGHT LIKE: