Chapter 2 - Syntax and Semantics with Scheme

PPL 2023

We have introduced the following main concepts in Chapter 1 through examples in JavaScript and TypeScript:

In Chapter 2, we develop these concepts in a more systematic and in-depth manner.
The method we use is to design a new programming language, to specify exactly its syntax and its operational semantics, and to implement this specification into an executable interpreter.

Interpreters

In this chapter, we define a small but complete programming language. We demonstrate what are the elements necessary to define a programming language.
We first present these elements informally - to define a subset of the Scheme language. This language is a simple functional language.

We then move on to define the elements formally in the form of:

We implement these formal definitions into concrete programs - which together form an interpreter of the language.

As summarized by Hal Abelson in the introduction to Essentials of Programming Languages (Friedman, Wand and Haynes, 2000) - this chapter: brings us face to face with the most fundamental idea in computer programming: The interpreter for a computer language is just another program.

Interpreters are interesting because:

Object Language vs. Meta Language

We use interpreters to implement the formal definitions of a programming language.
The interpreter itself is written in a programming language, and it defines a programming language. To distinguish between these two programming roles, we use the following terminology:

The choice of each of these languages (object and meta) depends on the objective we set for ourselves.

Scheme as an Object Language

We introduce a subset of Scheme to illustrate how to specify a programming language and how to implement a full interpreter because Scheme is a small language (it has few primitives, few data types, an extremely simple and regular syntax) and a simple language (the evaluation rules are consistent and simple, there are not many special cases). Yet, Scheme is an expressive language - it is Turing complete, and because it follows the functional programming paradigm, it allows the definition of powerful functional abstractions which make programming productive.

All of these elements stand in contrast with JavaScript - which has evolved into a large language with many primitives, a very complex syntax, and because it is a multi-paradigm language - supporting FP, OOP, procedural programming and more - JavaScript has a complex evaluation semantics.

We thus select to describe a small and simple but full language - a subset of Scheme - as the object language of this course.

TypeScript as a Meta Language

We select to use a subset of TypeScript used as a Functional Language to implement the interpreter.
One of the advantages of this decision is that we exploit the TypeScript type system to encode the objects manipulated in the interpreter - abstract syntax trees and values. If the algorithm of the operational semantics is well understood, it can be implemented as a pure functional model in a code that is surprisingly short and elegant - and ported to any functional language.

Let us engage thus in developing a programming language bottom up, starting from small blocks and building up into a fuller version of a functional programming language.

Elements of a Programming Language

How do we specify a programming language? What are the elements that together define a programming language?

The key elements are:

  1. Primitives: expressions whose evaluation is built-in in the interpreter and which are not explained by the semantics of the language. These include primitive operations (for example, arithmetic operations on number or comparison operators) and primitive literal values (for examples, numbers or boolean values).
  2. Combination means: ways to create compound expressions and compound values from simpler ones.
  3. Abstraction means: ways to manipulate compound objects (expressions or data values) as standalone units by giving them simple names.

The language definition is structured in two aspects:

  1. Expressions are the words and sentences of the language.
  2. Values are the results of the computation of expressions according to the evaluation rules of the language. Values belong to the semantic domain of the language.

To describe our object programming language, we first present all the types of expressions in the language (this is called the syntax of the language) on the one hand, and all the possible values that can be computed by the language on the other. The syntax is the input to the interpreter, the values are the output of the interpreter.

We will introduce the syntax and semantics of the Scheme subset in several iterations - starting from the simplest forms of expressions, then describing the rules to evaluate their value, then introducing more complex forms of expressions and their evaluation rules. We will give names to each of the stages, starting from a very small language \(L1\), and growing the language to larger stages, \(L2\) and \(L3\).

Expressions

We distinguish

Atomic Expressions

The following are the types of atomic expressions in Scheme:

Compound Expressions

The only syntactic form used to combine expressions into complex expressions in Scheme is to arrange them into parentheses.

(+ 45 78) ;--> 123
(- 56 9)  ;--> 47
(* 6 50)  ;--> 300
(/ 25 5)  ;--> 5

These expressions are called forms. In this structure, we always refer to the leftmost sub-expression of the form as an operator and the rest of the sub-expressions as operands.
Scheme compound expressions are always written in prefix notation.

Forms can be nested recursively:

(+ (* 3 15) (- 9 2)) ;--> 52

It helps to pretty print (or indent) such complex expressions to figure out their internal structure:

(+ (* 3
      (+ 4 2)
      (* 2 5))
   7)

Expressions are evaluated by the interpreter and return a value. In Scheme, there are only expressions in the language. This is in contrast to JavaScript which contains expressions and statements (syntactic elements which, when evaluated, do not return a value, but simply execute a command).

Variables and Values

The programming language provides means to name objects. This is a fundamental form of abstraction: using a simple name instead of using a complex value or a complex expression.

In Scheme, define is used to bind a name to the value of an expression.

(define size 6) 

(* 2 size) ;--> 12 ;; size is now understood 

size in this context is called a variable. The relation between a variable and the value it denotes is called a binding.

Variables can be used as atomic expressions. They are evaluated to the value to which they were bound using define.

define is a form of abstraction because it allows the programmer to use names (variables) instead of complex operations.

(define area (* size size)) ;--> 36

In the syntax of a define operation, define is a special operator - it indicates that a special operation must be performed by the interpreter to evaluate the (define <var> <expression>) form.

The result of evaluating a define form is that the interpreter remembers that the variable is now bound to a value. The steps of this evaluation are:

Evaluation rule for forms of the type: (define <var> <exp>)
1. Let val = Evaluate(<exp>)
2. Add the binding < <var>, val > to the global environment

The global environment is a function which maps variable names to values.
It is best to think of it as an object of type variable => value.

Expression Types and Evaluation Rules - Round 1

We have now presented different expression types with the corresponding evaluation rules for each type of expression. Let us call this language \(L1\) and summarize the rules to evaluate all of the expression types in \(L1\).
We also summarize the set of all possible values computed by \(L1\).

Expression Types

All the expression types presented so far are:

  1. Atomic expressions:
    1. number literal expression (0, 1, 2, …)
    2. boolean literal expression (#t and #f)
    3. primitive operation expression (+, -, *, /, <, >, =)
    4. variable expression (a well formed name - made up of letters and punctuations)
  2. Compound expressions:
    1. Special compound expressions: (define <var> <exp>) - where <var> is a variable expression and <exp> is any expression except a define expression.
    2. Non-special compound expressions: \((exp_0 \ldots exp_n)\) - where each \(exp_i\) is any expression type except a define expression.

This inductive definition corresponds to the set \(Expression\) of all possible expressions in the language. (The definition is inductive because we use the term expression to define what are compound expressions.)

Evaluation Rules for Expressions

We define the function \(evaluate: Expression \rightarrow Value\) in an inductive manner:

1. Evaluation of atomic expressions:

  1. Variables are evaluated by looking up their value in the global environment.
  2. Primitive atomic expressions evaluate to their pre-defined denoted value.
    • Number atomic literal expressions evaluate to number values.
    • Boolean atomic literal expressions evaluate to boolean values true and false.
    • Primitive procedures evaluate to the primitive function that performs the denoted operation.

2. Evaluation of compound special forms:

  1. For each special form - a special evaluation rule exists.
    • The special form (define <var> <exp>) is evaluated according to this rule:
      1. Let val = evaluate(<exp>)
      2. Add the binding < <var>, val > to the global environment.
      3. The form returns a special void value.

In this rule, the sub-expression <var> is not evaluated.

We have only introduced one special form (define) so far - we will introduce more later.

3. Evaluation of compound non-special forms:

All compound forms are of the form \((exp_0 \ldots exp_n)\).

  1. Let \((val_0 \ldots val_n) = (evaluate(exp_0) \ldots evaluate(exp_n))\)
  2. Apply the procedure \(val_0\) to the values \((val_1 \ldots val_n)\).

NOTE: We assume that \(val_0\) evaluates into a primitive procedure - otherwise the second step above will not succeed. Can you think of which expressions are evaluated into primitive procedure values?

NOTE: Observe that this definition of the evaluate function is inductive - and that it follows the inductive definition of the set of expressions. This is an instance of the general principle we discussed in Section 1.4 which indicates that the structure of the types we define determines the structure of the code that processes values of the type.

Computed Values

Looking at all the evaluation rules, we can summarize the set of all possible values that can be returned by an invocation of evaluate:

Example Programs in \(L1\)

Let us write a few examples of programs in \(L1\):

5 ;--> 5

(* 3 2) ;--> 6

(+ (* 3 2) 4) --> 10

(> 2 3) ;--> #f

(= 2 (+ 1 1)) ;--> #t

Expressions in general are evaluated one by one. The order in which expressions are evaluated does not change their value.

Only in the case of the define form, there is a side-effect which makes the sequence of expressions significant:

(define radius 12)
(define pi 3.14)
(define area (* (* radius radius) pi))

(+ area (* 2 3))

Sequences in \(L1\)

In order to make sense of a program that includes define forms, we must define a compound expression which is a sequence of expressions and its evaluation rule. We will describe the details of the evaluation rule for sequences including define-expressions in more details later.

What is Not in \(L1\)

Let us try to assess what programs can be written in \(L1\) as defined.

On the side of the restrictions - we have:

On the positive side:

NOTE: consider the claim that programs in \(L1\) always terminate. Can you prove it? (Using structural induction).

\(L2\): User Defined Procedures and Conditional Expressions

Let us introduce two new types of expression into \(L1\) - leading to a new language we will call \(L2\). We choose to add together user defined procedures and conditional expressions - because these two language facilities work well together. The reason is that we will develop recursive functions - and when we write a recursive function, it helps to be able to test for the base case vs. recursive case.

Compound Procedure Definition with Lambda

lambda is a special operator which can be used in a special form of type lambda. The syntax is:

(lambda (<var> ...) <exp> ...)

For example:

(lambda (x) (* x x))

This expression is a procedure expression. It has three sub-expressions:

When this expression is evaluated, it creates a value whose type is called a closure. We will denote such values as <Closure (x) (* x x)>.

NOTE: (lambda (x) (* x x)) is an expression. <Closure (x) (* x x)> is a value. This is a similar distinction as: the literal expression 2 is an expression. Its value is the number 2.

NOTE: The expression (lambda (x) (* x x)) in Scheme is equivalent to (x) => x * x in TypeScript. The difference is syntactic only:

Closures: Composite or Atomic Values?

A closure value contains multiple parts - the parameters and the body. But there are no accessors to take apart these components from the value.

This leads to an interesting distinction:

When a lambda expression is evaluated, the body is not evaluated.

Naming User Procedures

To give a name to a procedure, we use the existing define mechanism:

(define square (lambda (x) (* x x)))

We will see later in the course that the capability to name procedures is a big deal - as it allows the definition of recursive functions - and, in particular, it changes the expressive power of the programming language.

Compound Procedure Application

When a lambda expression is computed, we obtain a closure. Closures can then be applied to values. This is the mechanism used when applying a closure to values:

Recall the evaluation rule for non-special forms in \(L1\):

Given a compound forms of the form \((exp_0 \ldots exp_n)\) where \(exp_0\) is not a special operator:

  1. Let \((val_0 \ldots val_n) = (evaluate(exp_0) \ldots evaluate(exp_n))\)
  2. Apply the procedure \(val_0\) to the values \((val_1 \ldots val_n)\).

In \(L1\) - the only possible procedure values were primitive procedures (the value of the atomic expressions +, * etc).

In \(L2\) - we also have user created closures.

The way a closure \(val_0\) = <Closure <p1, ..., pn> <exp1> ... <expk>> is applied to values \((val_1, \ldots, val_n)\) is according to the following rule:

  1. Replace all occurrences of \(p_1, \ldots, p_n\) in the expressions <exp1> ... <expk> of the body of the procedure with the corresponding values \(val_1, \ldots, val_n\).
  2. Evaluate all resulting expressions.
  3. The value returned by the application is the value of the last expression <expk>.

In summary:

Conditional Expressions

We introduce a third special form to the syntax of the language, in addition to define and lambda together with its specific evaluation rule to enable conditional evaluation.

The syntax of a Scheme conditional expression is:

(if <exp> <exp> <exp>)

For example:

(if (> x 2) x (* x 2))

if is a special operator - it has a special evaluation rule. The three other sub-expressions are called the test-part, then-part, and else-part of the compound if-expression. They can recursively be any type of expression.

NOTE: if-expressions are expressions - not statements as they are in TypeScript. They are evaluated into a value. They are equivalent to the ternary ?: operator we used in TypeScript.

Example: abs

(define abs (lambda (x) (if (> x 0) x (- x))))

(abs 2)  ;--> 2
(abs -3) ;--> 3

if-expressions can be nested as needed to define complicated decisions:

(if (= x y)
    0
    (if (> x y)
        1
        -1))

Scheme also includes an additional special form called cond which allows a more general form of conditional expression:

(cond (<p1> <e11> <e12> ... <e1k1>)
      (<p2> <e21> <e22> ... <e2k2>>)
      ...
      (else <en1> <en2> ... <enkn>>))

The sub-expressions of the cond form are called clauses - each clause starts with a predicate-expression and is followed by consequence-expressions.

Evaluation Rule for If-expressions

To evaluate an if-expression (if <test-exp> <then-exp> <else-exp>):

  1. Let p = evaluate <test-exp>
  2. If p is true, then evaluate <then-exp> and return this value as the value of the if-expression.
  3. else evaluate <else-exp> and return its value as the value of the if-expression.

NOTE: When evaluating an if-expression, the <test-exp> is always evaluated, but only one of <then-exp> or <else-exp> is evaluated. This is in contrast to what happens when we evaluate a non-special form - where all the sub-expressions are always evaluated.

Example Program in \(L2\)

The language we have defined so far is quite expressive. Let us define an example program demonsrating this: this program implements Newton’s method for computing square roots.

Newton’s method is stated as this algorithm:

To start this computation, we provide a non-zero guess like 1, and we need to decide when to stop guessing.

This algorithm is iterative:

Interestingly - we can implement this iterative algorithm even though we have no construct in the language to iterate.

(define sqrt (lambda (x) (sqrt-iter 1 x)))

(define sqrt-iter
  (lambda (guess x)
     (if (good-enough? guess x)
         guess
         (sqrt-iter (improve guess x)
                    x))))

(define abs (lambda (x) (if (< x 0) (- x) x)))
(define square (lambda (x) (* x x)))
(define epsilon 0.0001)

(define good-enough?
  (lambda (guess x)
     (< (abs (- (square guess) x)) epsilon)))
 
(define average
  (lambda (x y) (/ (+ x y) 2.0)))
  
(define improve
  (lambda (guess x)
    (average guess (/ x guess))))

This program illustrates many of the “good properties” we associated with the Procedural Programming paradigm in Chapter 1:

We haven’t discussed local variables (used inside each procedure without affecting the state of the program outside the scope of the procedures). We will see later in the chapter that even in \(L2\), we have enough semantic power to define local variables, but we have not provided syntactic constructs to encourage the use of this feature.

We have created a hierarchy of procedures, higher-level procedures call lower-level procedures:

Expression Types and Evaluation Rules - Round 2

Let us summarize the syntax (expression types) of the language \(L2\) and summarize the rules to evaluate all of the expression types in \(L2\).

Expression Types

All the expression types presented so far are:

  1. Atomic expressions:
    1. number literal expression (0, 1, 2, …)
    2. boolean literal expression (#t and #f)
    3. primitive operation expression (+, -, *, /, <, >, =)
    4. variable expression (a well formed name - made up of letters and punctuations)
  2. Compound expressions:
    1. Special compound expressions:
      • (define <var> <exp>)
      • (lambda (<var> ...) <exp> ...)
      • (if <exp> <exp> <exp>)
      • (cond (<exp> ...) ...)
    2. Non-special compound expressions: \((exp_0 \ldots exp_n)\)

This inductive definition corresponds to the set \(Expression\) of all possible expressions in the language.

Evaluation Rules for Expressions

We define the function \(evaluate: Expression \rightarrow Value\) in an inductive manner:

1. Evaluation of atomic expressions:

  1. Special operators are not evaluated (define, lambda, if, cond and else are the special operators).
  2. Variables are evaluated by looking up their value in the global environment.
  3. Primitive atomic expressions evaluate to their pre-defined denoted value.
    • Number atomic literal expressions evaluate to number values.
    • Boolean atomic literal expressions evaluate to boolean values true and false.
    • Primitive procedures evaluate to the primitive function that performs the denoted operation.

2. Evaluation of compound special forms:

  1. The special form (define <var> <exp>) is evaluated according to this rule:
    1. Let val = evaluate(<exp>)
    2. Add the binding < <var>, val > to the global environment.
    3. The form does not have any value (which we state as: it has a void value).
  2. The special form (lambda (<p1> ..) <exp1> ...) is evaluated into a value <Closure (<p1>...) <exp1>...>. The sub-expressions of the lambda-form are not evaluated.
  3. The special form (if <test-exp> <then-exp> <else-exp>) is evaluated as follows:
    • Let p = evaluate(<test-exp>)
    • If p is true - the value of the if-expression is evaluate(<then-exp>)
    • else it is evaluate(<else-exp>).

A similar rule applies for cond. For all special forms - not all sub-expressions are evaluated.

3. Evaluation of compound non-special forms:

All compound forms are of the form \((exp_0 \ldots exp_n)\).

  1. Let \((val_0, \ldots, val_n) = (evaluate(exp_0), \ldots, evaluate(exp_n))\)
  2. If \(val_0\) is a primitive procedure: Apply the procedure \(val_0\) to the values \((val_1, \ldots, val_n)\).
  3. Else if \(val_0\) is a closure <Closure <p1...pn> <exp1>..<expk>):
    • Replace \(p_1 ... p_n\) by \(val_1 ... val_n\) in <exp1> .. <expk>
    • Evaluate the resulting expressions and return the value of <expk>.

MISSING PARTS: In this specification, we did not explain yet:

These parts will become explicit and will need to be specified when we implement the evaluation rules in the code of the interpreter.

Computed Values

The set of all possible values computed by \(L2\) is:

What is Not in \(L2\)

Note the programming constructs which are absent from \(L2\):

Termination

Can you prove that some programs in \(L2\) may not terminate?

Extending the Language with Compound Values: \(L3\)

We observe that the computed values of \(L2\) are atomic values (numbers, booleans) or closures (which are a compoound value but which has no accessors).

To introduce compound data values in the language, we need:

In JavaScript, for example, compound values are constructed with the array and map constructors and denoted by the [] and {} notations for literal expressions.

In the minimalistic spirit which characterizes Scheme, we will introduce into \(L3\) a single compound value constructor and the capability to use it recursively.

The Pair Compound Data Type

A pair is a minimal compound data type that combines two values together into a single new unit. The language supports this by introducing:

We thus extend the language with 5 primitive functions: cons, car, cdr, pair?, equal?.

In addition, Scheme defines a standard form to print pair values and literal expressions that denote pair values: A literal pair expression is denoted as '( <exp> . <exp> ). For example:

(define p1 '(1 . 5))    ;; literal pair expression 
(define p2 (cons 1 5))  ;; constructor invocation

(car p1) ;--> 1
(cdr p1) ;--> 5

Pairs can be combined recursively into complex compound values:

(define p3 (cons p1 p2))
(define p4 (cons p3 p2))

(car p3) ;--> '(1 . 5)
(cdr p3) ;--> '(1 . 5)

NOTE: The cons Pair constructor can receive parameters of any types. The type of cons is thus described as (first T1, second T2)=>Pair(T1,T2).

(cons #t 1) ;--> '(#t . 1)

The List Compound Data Type

In addition to the Pair data type, \(L3\) introduces a recursive data type - called List. We first introduce it inductively:

Non empty lists are constructed from a value and a list. The size that characterizes lists in this inductive definition is the length of the list: a list of size \(n+1\) is constructed from a value and a list of size \(n\).

The definition of this inductive data type is a disjoint union between the empty list and non-empty lists.

Scheme implements List values by re-using the Pair data type for non-empty lists and a special value for the empty list. The additions to the language are:

(define l14 (cons 1 (cons 2 (cons 3 (cons 4 '()))))) ;--> '(1 2 3 4)
(list? l12) ;--> #t
(cdr l14)   ;--> '(2 3 4)
(car (cdr l14)) ;--> 2
(cons 10 l14)   ;--> '(10 1 2 3 4)

On the basis of this inductive definition, we can define functions over lists:

(define length
  (lambda (l)
    (if (empty? l)
        0
        (+ 1 (length (cdr l))))))

As usual, recursive functions operating over an inductive type have a structure similar to the inductive definition of the inductive compound data type: because List is a disjoint union over Empty and Non-Empty lists, the structure of a function operating over lists will be:

    (if (empty? l)
        <process empty list case>
        <process non-empty list case>

For example:

(define nth
  (lambda (n l)
    (if (empty? l)
        '()
        (if (= n 0)
            (car l)
            (nth (- n 1) (cdr l))))))
            
(nth 2 '(0 1 2 3 4)) ;--> 2
(nth 3 '(0 1))       ;--> '()

The list constructor can also receive parameters of any types. In particular, we can create list of pairs and recursive tree structures using the compound list data type.

Another list constructor is available in Scheme - which avoids the need for nested calls to cons: (list <e1> ... <en>) ;--> '(v1 ... vn) where vi is the value of <ei>.
Syntactically, list is complex because it can take variable number of arguments.

NOTE: Compare (list 1 2) and (cons 1 2) - how different are they?

Expression Types and Evaluation Rules - Round 3

Let us summarize the syntax (expression types) of the language \(L3\) and the rules to evaluate all of the expression types in \(L3\).

Expression Types

All the expression types presented so far are:

  1. Atomic expressions:
    1. number literal expression (0, 1, 2, …)
    2. boolean literal expression (#t and #f)
    3. primitive operation expression (+, -, *, /, <, >, =, cons, car, cdr, pair?, list? )
    4. variable expression (a well formed name - made up letters and punctuations)
    5. the empty list literal expression ‘()
  2. Compound expressions:
    1. Literal compound expressions: denoted '(<lit> . <lit>) for pairs or '(<lit> ... <lit>) for lists. where <lit> is an embedded literal expression (either atomic or compound).
    2. Special compound expressions:
      • (define <var> <exp>)
      • (lambda (<var> ...) <exp> ...)
      • (if <exp> <exp> <exp>)
      • (cond (<exp> ...) ...)
    3. Non-special compound expressions: \((exp_0 \ldots exp_n)\)

This inductive definition corresponds to the set \(Expression\) of all possible expressions in the language.

Evaluation Rules for Expressions

We define the function \(evaluate: Expression \rightarrow Value\) in an inductive manner:

1. Evaluation of atomic expressions:

  1. Special operators are not evaluated (define, lambda, if, cond and else are the special operators).
  2. Variables are evaluated by looking up their value in the global environment.
  3. Primitive atomic expressions evaluate to their pre-defined denoted value.
    • Number atomic literal expressions evaluate to number values.
    • Boolean atomic literal expressions evaluate to boolean values true and false.
    • Primitive procedures evaluate to the primitive function that performs the denoted operation.
    • The Empty list atomic literal expression evaluates to an empty list value.

2. Evaluation of compound special forms:

  1. The special form (define <var> <exp>) is evaluated according to this rule:
    1. Let val = evaluate(<exp>)
    2. Add the binding < <var>, val > to the global environment.
    3. The form does not have any value (which we state as: it has a void value).
    4. Compound literal expressions evaluate to compound values (pair or list).
  2. The special form (lambda (<p1> ..) <exp1> ...) is evaluated into a value <Closure (<p1>...) <exp1>...>. The sub-expressions of the lambda-form are not evaluated.
  3. The special form (if <test-exp> <then-exp> <else-exp>) is evaluated as follows:
    1. Let p = evaluate(<test-exp>)
    2. If p is true - the value of the if-expression is evaluate(<then-exp>)
    3. else it is evaluate(<else-exp>).

A similar rule applies for cond. For all special forms - not all sub-expressions are evaluated.

3. Evaluation of compound non-special forms:

All compound forms are of the form \((exp_0, \ldots, exp_n)\).

  1. Let \((val_0, \ldots, val_n) = (evaluate(exp_0), \ldots, evaluate(exp_n))\)
  2. If \(val_0\) is a primitive procedure: Apply the procedure \(val_0\) to the values \((val_1, \ldots, val_n)\).
  3. Else if \(val_0\) is a closure <Closure <p1...pn> <exp1>..<expk>):
    • Replace \(p_1 ... p_n\) by \(val_1 ... val_n\) in <exp1> .. <expk>
    • Evaluate the resulting expressions and return the value of <expk>.

Computed Values

The set of all possible values computed by \(L3\) is:

Comparison Functional JavaScript and \(L3\)

Let us compare the language we have obtained in \(L3\) and the Functional JavaScript subset we used in Chapter 1: