Skip to content

Latest commit

 

History

History
236 lines (200 loc) · 11.7 KB

File metadata and controls

236 lines (200 loc) · 11.7 KB

2. Language and Syntax

Every language displays a structure called its grammar or syntax. For example, a correct sentence always consists of a subject followed by a predicate, correct here meaning well formed. This fact can be described by the following formula:

sentence = subject predicate.

If we add to this formula the two further formulas

subject = "John" | "Mary".
predicate = "eats" | "talks".

then we define herewith exactly four possible sentences, namely

John eats Mary eats
John talks Mary talks

where the symbol | is to be pronounced as or. We call these formulas syntax rules, productions, or simply syntactic equations. Subject and predicate are syntactic classes. A shorter notation for the above omits meaningful identifiers:

S = AB. L = {ac, ad, bc, bd}
A = "a" | "b".
B = "c" | "d".

We will use this shorthand notation in the subsequent, short examples. The set L of sentences which can be generated in this way, that is, by repeated substitution of the left-hand sides by the right-hand sides of the equations, is called the language.

The example above evidently defines a language consisting of only four sentences. Typically, however, a language contains infinitely many sentences. The following example shows that an infinite set may very well be defined with a finite number of equations. The symbol ∅ stands for the empty sequence.

S = A. L = {∅, a, aa, aaa, aaaa, ... }
A = "a" A | ∅.

The means to do so is recursion which allows a substitution (here of A by "a"A) be repeated arbitrarily often.

Our third example is again based on the use of recursion. But it generates not only sentences consisting of an arbitrary sequence of the same symbol, but also nested sentences:

S = A. L = {b, abc, aabcc, aaabccc, ... }
A = "a" A "c" | "b".

It is clear that arbitrarily deep nestings (here of As) can be expressed, a property particularly important in the definition of structured languages. Our fourth and last example exhibits the structure of expressions. The symbols E, T, F, and V stand for expression, term, factor, and variable.

E = T | E "+" T.
T = F | T "*" F.
F = V | "(" E ")".
V = "a" | "b" | "c" | "d".

From this example it is evident that a syntax does not only define the set of sentences of a language, but also provides them with a structure. The syntax decomposes sentences in their constituents as shown in the example of Figure 2.1. The graphical representations are called structural trees or syntax trees.

Figure 2.1. Structure of expressions

Let us now formulate the concepts presented above more rigorously: A language is defined by the following:

  1. The set of terminal symbols. These are the symbols that occur in its sentences. They are said to be terminal, because they cannot be substituted by any other symbols. The substitution process stops with terminal symbols. In our first example this set consists of the elements a, b, c and d. The set is also called vocabulary.
  2. The set of nonterminal symbols. They denote syntactic classes and can be substituted. In our first example this set consists of the elements S, A and B.
  3. The set of syntactic equations (also called productions). These define the possible substitutions of nonterminal symbols. An equation is specified for each nonterminal symbol.
  4. The start symbol. It is a nonterminal symbol, in the examples above denoted by S.

A language is, therefore, the set of sequences of terminal symbols which, starting with the start symbol, can be generated by repeated application of syntactic equations, that is, substitutions.

We also wish to define rigorously and precisely the notation in which syntactic equations are specified. Let nonterminal symbols be identifiers as we know them from programming languages, that is, as sequences of letters (and possibly digits), for example, expression, term. Let terminal symbols be character sequences enclosed in quotes (strings), for example, "=", "|". For the definition of the structure of these equations it is convenient to use the tool just being defined itself:

syntax = production syntax | ∅.
production = identifier "=" expression "." .
expression = term | expression "|" term.
term = factor | term factor.
factor = identifier | string.
identifier = letter | identifier letter | identifier digit.
string = stringhead """.
stringhead = """ | stringhead character.
letter = "A" | ... | "Z".
digit = "0" | ... | "9".

This notation was introduced in 1960 by J. Backus and P. Naur in almost identical form for the formal description of the syntax of the language Algol 60. It is therefore called Backus Naur Form (BNF) (Naur, 1960). As our example shows, using recursion to express simple repetitions is rather detrimental to readability. Therefore, we extend this notation by two constructs to express repetition and optionality. Furthermore, we allow expressions to be enclosed within parentheses. Thereby an extension of BNF called EBNF (Wirth, 1977) is postulated, which again we immediately use for its own, precise definition:

 syntax = {production}.
 production = identifier "=" expression "." .
 expression = term {"|" term}.
 term = factor {factor}.
 factor = identifier | string | "(" expression ")" | "\[" expression "\]" | "{" expression "}".
 identifier = letter {letter | digit}.
 string = """ {character} """.
 letter = "A" | ... | "Z".
 digit = "0" | ... | "9".

A factor of the form {x} is equivalent to an arbitrarily long sequence of x, including the empty sequence. A production of the form

A = AB | ∅.

is now formulated more briefly as A = {B}. A factor of the form [x] is equivalent to x or nothing, that is, it expresses optionality. Hence, the need for the special symbol for the empty sequence vanishes.

The idea of defining languages and their grammar with mathematical precision goes back to N. Chomsky. It became clear, however, that the presented, simple scheme of substitution rules was insufficient to represent the complexity of spoken languages. This remained true even after the formalisms were considerably expanded. In contrast, this work proved extremely fruitful for the theory of programming languages and mathematical formalisms. With it, Algol 60 became the first programming language to be defined formally and precisely. In passing, we emphasize that this rigour applied to the syntax only, not to the semantics.

The use of the Chomsky formalism is also responsible for the term programming language, because programming languages seemed to exhibit a structure similar to spoken languages. We believe that this term is rather unfortunate on the whole, because a programming language is not spoken, and therefore is not a language in the true sense of the word. Formalism or formal notation would have been more appropriate terms.

One wonders why an exact definition of the sentences belonging to a language should be of any great importance. In fact, it is not really. However, it is important to know whether or not a sentence is well formed. But even here one may ask for a justification. Ultimately, the structure of a (well formed) sentence is relevant, because it is instrumental in establishing the sentence's meaning. Owing to the syntactic structure, the individual parts of the sentence and their meaning can be recognized independently, and together they yield the meaning of the whole. Let us illustrate this point using the following, trivial example of an expression with the addition symbol. Let E stand for expression, and N for number:

E = N | E "+" E.
N = "1" | "2" | "3" | "4" .

Evidently, 4 + 2 + 1 is a well-formed expression. It may even be derived in several ways, each corresponding to a different structure, as shown in Figure 2.2.

Figure 2.2. Differing structural trees for the same expression.

The two differing structures may also be expressed with appropriate parentheses, namely as (4 + 2) + 1 and as 4 + (2 + 1), respectively. Fortunately, thanks to the associativity of addition both yield the same value 7. But this need not always be the case. The mere use of subtraction in place of addition yields a counter example which shows that the two differing structures also yield a different interpretation and result: (4 - 2) - 1 = 1, 4 - (2 - 1) = 3. The example illustrates two facts:

  1. Interpretation of sentences always rests on the recognition of their syntactic structure.
  2. Every sentence must have a single structure in order to be unambiguous.

If the second requirement is not satisfied, ambiguous sentences arise. These may enrich spoken languages; ambiguous programming languages, however, are simply useless. We call a syntactic class ambiguous if it can be attributed several structures. A language is ambiguous if it contains at least one ambiguous syntactic class (construct).

2.1 Exercises

2.1.1

The Algol 60 Report contains the following syntax (translated into EBNF):

primary
	= unsignedNumber
	| variable
	| "(" arithmeticExpression ")"
	| ... .
factor
	= primary
	| factor "↑" primary.
term
	= factor
	| term ("×" | "/" | "÷") factor.
simpleArithmeticExpression
	= term
	| ("+" | "-") term
	| simpleArithmeticExpression ("+" | "-") term.
arithmeticExpression
	= simpleArithmeticExpression
	| "IF" BooleanExpression "THEN" simpleArithmeticExpression "ELSE" arithmeticExpression.
relationalOperator
	= "=" | "≠" | "≤" | "<" | "≥" | ">" .
relation
	= arithmeticExpression relationalOperator arithmeticExpression.
BooleanPrimary
	= logicalValue
	| variable
	| relation
	| "(" BooleanExpression ")" | ... .
BooleanSecondary
	= BooleanPrimary
	| "¬" BooleanPrimary.
BooleanFactor
	= BooleanSecondary
	| BooleanFactor "∧" BooleanSecondary.
BooleanTerm
	= BooleanFactor
	| BooleanTerm "∨" BooleanFactor.
implication
	= BooleanTerm
	| implication "⊃" BooleanTerm.
simpleBoolean
	= implication
	| simpleBoolean "≡" implication.
BooleanExpression
	= simpleBoolean
	| "IF" BooleanExpression "THEN" simpleBoolean "ELSE" BooleanExpression.

Determine the syntax trees of the following expressions, in which letters are to be taken as variables: a) x + y + z b) x × y + z c) x + y × z d) (x - y) × (x + y) e) -x ÷ y f) a + b < c + d g) a + b < c ∨ d ≠ e ∧ ¬ f ⊃ g > h ≡ i × j = k ↑ l ∨ m - n + p ≤ q

2.1.2

The following productions also are part of the original definition of Algol 60. They contain ambiguities which were eliminated in the Revised Report.

forListElement
	= arithmeticExpression
	| arithmeticExpression "STEP" arithmeticExpression "UNTIL" arithmeticExpression
	| arithmeticExpression "WHILE" BooleanExpression.
forList
	= forListElement
	| forList "," forListElement.
forClause
	= "FOR" variable ":=" forList "DO" .
forStatement
	= forClause statement.
compoundTail
	= statement "END"
	| statement ";" compoundTail.
compoundStatement
	= "BEGIN" compoundTail.
unconditional Statement
	= basicStatement
	| forStatement
	| compoundStatement
	| ... .
ifStatement
	= "IF" BooleanExpression "THEN" unconditionalStatement.
conditionalStatement
	= ifStatement
	| ifStatement "ELSE" statement.
statement
	= unconditionalStatement
	| conditionalStatement.

Find at least two different structures for the following expressions and statements. Let A and B stand for "basic statements".

IF a THEN b ELSE c = d
IF a THEN IF b THEN A ELSE B
IF a THEN FOR ... DO IF b THEN A ELSE B

Propose an alternative syntax which is unambiguous.

2.1.3

Consider the following constructs and find out which ones are correct in Algol, and which ones in Oberon: a) a + b = c + d b) a * -b c) a < b & c < d

Evaluate the following expressions: a) 5 * 13 DIV 4 = b) 13 DIV 5*4 =