Data Types and Operations on Data

PPL 2023

In previous sections, we reviewed tools a programming language provides that help programmers design set of values that have common properties. The concrete form is a type language with which we can express type annotations. These type annotations denote sets of values. They are associated to variables or function parameters and function return values.

Such type annotations are useful to enable type checking - the process through which a compiler can ensure that variables will always be bound to values that belong to a specified type for all possible executions of the program - and issue errors when such assurance cannot be proven. Type annotations are also useful in documenting the intention of the programmer.

Type definitions have a third important function: they help the programmer structure the code that operates on complex values in a way that reflects the structure of the data type.
In other words, knowing the structure of a data type helps the programmer write correct functions that operates over values of this type.

Reversely, programmers design and name types for sets of values that will be processed by the same set of functions. That is, type definitions allow the definition of uniform functional interfaces.

We illustrate this point through 4 examples:

Homogeneous array types and the sequence interface (map, filter, reduce, …)
Modeling trees
Mutable data types in FP
Disjoint types and Disjoint Unions to enable uniform functional interfaces

Homogeneous Array Types and the Sequence Interface

Array values can be homogeneous (all items have the same type) or heterogeneous (items of different types appear).

The natural way to operate over arrays is to use the sequence interface - that is, a set of higher-order functions which can be applied on arrays. For example:

map(f, array)
filter(pred, array)
reduce(f, array, init)

and similar functions (such as every, some, find).

When we analyze the type of these functions, we realize that such operations will be easy and natural if the type of the function passed as a parameter is a simple type. For example:

import { map } from "ramda";

let arr = [1, 2, 3];
map(x => x * x, arr); // ==> [ 1, 4, 9 ]

// Or in curried manner:
map(x => x * x)(arr); 

This operation over the array works well because the type of the mapper function is simple and can be derived from the type of the array:

arr is of type number[]
x => x * x is a function of type number => number - it can thus be applied to all the items in arr
the resulting array is also of type number[]
the function map(x => x * x) is a function of type number[] => number[] - we say it is lifted from number to number[].

Similarly, if we use a mapper function of type number => boolean - we know the return value of map will be boolean[]:

import { map } from "ramda";

let arr = [1, 2, 3];
map(x => x > 2, arr); // ==> [ false, false, true ]

Operating over a heterogeneous array like [1, "a" ,true] would make the usage of these functions much more challenging - because the mapper function would need to know what to do for each type of parameter.

Summary: Homogeneous array types encourage the use of a simple functional interface - including map, filter, reduce. This functional interface (set of functions which operate on the same data structure) receive a function parameter with a simple type signature and abstract many forms of loops over repeated values of the same type.

Modeling Trees with Types

Let us review the definition of a binary tree:

type BinTree<T> = {
    root: T;
    left?: BinTree<T>;
    right?: BinTree<T>;
}

const bt: BinTree<number> = { root: 1, left: { root: 2 }, right: { root: 3 } };

bt; // ==> { root: 1, left: { root: 2 }, right: { root: 3 } }

When we write operations that operate over such trees, we must write code that operates according to the expected structure of the values in this type.

For example, let us write a function that traverses a BinTree in Depth-First order:

const traverseDFS: <T>(t: BinTree<T>) => void = t => {
    console.log(t.root);
    if (t.left !== undefined) traverseDFS(t.left);
    if (t.right !== undefined) traverseDFS(t.right);
};
traverseDFS(bt);
// ==>
// 1
// 2
// 3

The structure of this function follows the structure of the type definition - when we process a value of type BinTree, we know that accessing the field t.root is safe (will not return undefined or throw an exception). To access t.left we must first check whether it is undefined since the type allows for this (base case of the recursion). If it is not, we know it must be a value of type BinTree.

In addition to these assurances, we also know that after checking these conditions we have checked all possible configurations of values (that is, the function exhaustively covers all possible BinTree values).

Let us consider a variant task where we actually build values of type BinTree:

let bt2 = {
    root: 2,
    left: {
        root: 3,
        left: { root: 4 }
    },
    right: {
        root: 5,
        right: { root: 6 }
    }
};

const square: (x: number) => number = x => x * x;

const squareTree: (t: BinTree<number>) => BinTree<number> = t =>
    t.left === undefined && t.right === undefined ? { root: square(t.root) } :
    t.left === undefined ? { root: square(t.root), right: squareTree(t.right) } :
    t.right === undefined ? { root: square(t.root), left: squareTree(t.left) } :
    { root: square(t.root), left: squareTree(t.left), right: squareTree(t.right) };

squareTree(bt2);
// ==>
// {
//   root: 4,
//   left: { root: 9, left: { root: 16 } },
//   right: { root: 25, right: { root: 36 } }
// }

The function squareTree operates over a BinTree value and creates a new BinTree return value. It considers all possible configurations of BinTree (only root, root and left, root and right, root and left and right) and invokes the function squareTree recursively on each child value accordingly.

In this case as well - type analysis allows us to verify that all accesses to the fields of the BinTree value are safe and that all recursive calls pass values of the right type as parameters. In addition, we can verify that the function is exhaustive in checking all possible configurations for the value.

We can relax the type checking - and accept to receive values of type undefined in addition - yielding slightly shorter and more readable code, as shown below:

let bt3 = {
    root: 2,
    left: {
        root: 3,
        left: { root: 4 }
    },
    right: {
        root: 5,
        right: { root: 6 }
    }
};

const square: (x: number) => number = x => x * x;

const squareTree2: (t: BinTree<number> | undefined) => BinTree<number> | undefined = t =>
    t === undefined ? undefined :
    { root: square(t.root), left: squareTree2(t.left), right: squareTree2(t.right) };

squareTree2(bt3);
// ==>
// {
//   root: 4,
//   left: {
//     root: 9,
//     left: { root: 16, left: undefined, right: undefined },
//     right: undefined
//   },
//   right: {
//     root: 25,
//     left: undefined,
//     right: { root: 36, left: undefined, right: undefined }
//   }
// }

Observe that in this version:

We extended the expected values of type BinTree to also include the value undefined.
We explicitly test for this value as the first base case in the recursive function squareTree2.
The recursive calls are now simplified as we don’t need to avoid the recursive calls with a value undefined
The return value has values marked explicitly as undefined - these are semantically equivalent to absent values - but in the syntax of the object, they still appear.

We can get rid of them using a simple idiom:

JSON.parse(
    JSON.stringify({
        root: 4,
        left: {
            root: 9,
            left: { root: 16, left: undefined, right: undefined },
            right: undefined
        },
        right: {
            root: 25,
            left: undefined,
            right: { root: 36, left: undefined, right: undefined }
        }
    })
);
// ==>
// {
//   root: 4,
//   left: { root: 9, left: { root: 16 } },
//   right: { root: 25, right: { root: 36 } }
// }

The difference between the two versions is a matter of style preference. In general, the presence of undefined values complicates type analysis, but it is difficult to avoid dealing with it explicitly.

Mutable (Persistent) Data Types in FP

We indicated earlier that FP encourages immutable variables and data structures to achieve the goals of determinism (calling the same function with the same arguments should return the same value) and safe concurrency (avoid shared mutable data across threads).

Yet, some data types are thought of a mutable in their definition. Consider the example of a stack. It is defined as a container of values that enforces a specific access pattern through an interface:

push(x): modifies the stack by adding a new value x on top of it.
pop(): modifies the stack by removing the top value of the stack and returning its value.
empty(): determines whether the stack is empty.

As was discussed in SPL (for those who took this course), in OOP and Imperative Langauges, it is useful to split the Stack interface into Queries (functions that only return information about the data structure without changing it) and Commands (functions which only modify the data structure and do not return any value). Such distinction makes writing tests much easier (see Command Query Separation pattern).

To adopt this distinction, we split the pop() method into two distinct methods:

peek(): returns the value of the top element in the stack (this is a query method).
pop(): modifies the stack by removing the top element (this is a command method - that just has a side effect and no return value).

This definition is inherently procedural - as it defines the data type in terms of mutation (with the commands push() and pop()) in addition to the queries peek() and empty().

Note that according to this methodology, even the queries are not pure functions - because they are not deterministic (that is, we can call the same function twice on the same variable at different times and obtain different answers). Let us illustrate these points with a simple implementation of Stacks in TypeScript:

// Mutable implementation of stacks (NON Functional)

// Constructor
const makeStack = <T>(initValues: T[]): T[] => initValues;

// peek and empty are queries - they do not mutate the stack
const peek = <T>(stack: T[]): T => stack[0];
const empty = <T>(stack: T[]): boolean => stack.length === 0;

// push() and pop() are commands - they mutate the stack and return void (no return value)
const push = <T>(stack: T[], newVal: T): void => {
    stack.unshift(newVal);
    return;
};

const pop = <T>(stack: T[]): void => {
    stack.shift();
    return;
};

let s = makeStack([1, 2, 3]);
push(s, 0);
s; // ==> [ 0, 1, 2, 3 ]

This implementation relies on an array encoding for Stack values. It relies on generic data types in TypeScript so that we can use it for Stacks of any type - as long as it is a homogeneous stack.

It relies on the fact that arrays in JavaScript are mutable - and implements push() using the primitive unshift() operation on arrays, and pop() using the primitive shift() operation.

We skipped checking the preconditions to simplify the presentation.

Note the specific style: commands (functions that have a side-effect - in this case push() and pop()) have no return value - we mark them as void in TypeScript.

In contrast, queries (functions that have no side-effect and only return information about the data structure - in this case peek() and empty()) return a value.

Given that the underlying data type is mutable, we cannot obtain deterministic behavior:

let s = makeStack([1, 2, 3]);
push(s, 0);
console.log(peek(s)); // ==> 0
pop(s);
console.log(peek(s)); // ==> 1

The same operation peek(s) on the same variable s returns different values when mutation has occurred between the two calls. This is a case of non-determinism due to mutation.

Functional Stack: Step 1

Can we define a functional data structure for the Stack data type?

For example, can we define a Stack data structure that operates as an immutable data structure, while still offering the same interface to its clients (peek, empty, push, pop)?

The key change that is required to obtain such immutable functional data types is to modify the commands so that instead of mutating the existing data structure and returning void (that is, having no return value), the commands will return a new copy of the data structure.

This imposes first a change on the type of the functions, next a change on the client side. Let us illustrate this first round of changes (which we will find out is necessary but not sufficient):

// Implementation 2 of stacks (Towards Functional - not yet)

// To document the type, we give it a name - distinct from its implementation.
type Stack<T> = T[];

// Constructor
const makeStack = <T>(initValues: T[]): Stack<T> => initValues;

// peek and empty are queries - they do not mutate the stack - no change needed from V1 above.
const peek = <T>(stack: Stack<T>): T => stack[0];
const empty = <T>(stack: Stack<T>): boolean => stack.length === 0;

// push() and pop() are commands - they return a new version of the stack (instead of void in V1)
const push = <T>(stack: Stack<T>, newVal: T): Stack<T> => {
    stack.unshift(newVal);
    return stack;
};
const pop = <T>(stack: Stack<T>): Stack<T> => {
    stack.shift();
    return stack;
};

let s1 = makeStack([1, 2, 3]);
let s2 = push(s1, 0); // This is a new stack
s2; // ==> [ 0, 1, 2, 3 ]

This modification in the signature of the commands requires clients to change as well: each time a command is invoked, we must bind the return value to a new variable so that it can be used further:

let s1 = makeStack([1, 2, 3]);
let s2 = push(s1, 0); // This is a new stack
console.log(peek(s2)); // ==> 0
let s3 = pop(s2); // Another stack
peek(s3); // ==> 1

Value Aliasing

In this new style, we do not observe direct mutation - the calls seem to be deterministic.

Unfortunately, this is an illusion:

const s1 = makeStack([1, 2, 3]);
const s2 = push(s1, 0);
const s3 = pop(s2);
peek(s1); // ==> 1
pop(s1); // ==> [ 2, 3 ]
peek(s1); // ==> 2

The implementation relies on JavaScript arrays - which are internally mutable. We did not prevent this mutation by just changing the signature of the methods, because in the body of the commands we still call mutators on the internal representation of the stack.

The situation is even worse because we have created a very risky situation called variable aliasing - the stacks s2 and s3 share in memory cells that are used by s1. As a result, operations on s1 end up modifying the state of s2. If we now call the same expression peek(s2) and peek(s3) - we obtain different values from the earlier results (0 and 1) even though no direct mutation of s2 and s3 was performed:

peek(s2) // This was 0 before invoking pop(s1)
// ==> 2
peek(s3) // This was 1 before invoking pop(s1)
// ==> 2

The reason the stacks s2 and s3 were changed when we applied mutation on s1 is because the 3 stacks actually share parts of their value in memory - because Arrays in JavaScript behave like pointers to values in C++.

Functional Stack: Step 2

The solution to this problem is to require that commands actually copy the data structure when they need to modify it - so that each returned value is indeed a new value - and not an alias of the previous value.

This can be achieved by this new version of the Stack code:

// Implementation 3 of stacks (Functional - but inefficient)

// To document the type, we give it a name - distinct from its implementation.
type Stack<T> = T[];

// A utility to clone an array - relies on the fact that concat copies
// This is a shallow copy
const cloneArray = <T>(array: T[]): T[] => [].concat(array);

// Constructor
const makeStack = <T>(initValues: T[]): Stack<T> => cloneArray(initValues);

// peek and empty are queries - they do not mutate the stack - no change needed from V1.
const peek = <T>(stack: Stack<T>): T => stack[0];
const empty = <T>(stack: Stack<T>): boolean => stack.length === 0;

// push() and pop() are commands - they return a new copy of the stack
const push = <T>(stack: Stack<T>, newVal: T): Stack<T> => {
    let res = cloneArray(stack);
    res.unshift(newVal);
    return res;
};
const pop = <T>(stack: Stack<T>): Stack<T> => {
    let res = cloneArray(stack);
    res.shift();
    return res;
};

let s1 = makeStack([1, 2, 3]);
let s2 = push(s1, 0); // This is a new stack
s2; // ==> [ 0, 1, 2, 3 ]

Let us verify that this new implementation provides deterministic behavior for the Stack functions and no aliasing:

let s1 = makeStack([1, 2, 3]);
let s2 = push(s1, 0); // This is a new stack
console.log(peek(s2)); // Should be 0
// ==> 0
let s3 = pop(s2); // Another stack
console.log(peek(s3)); // Should be 1
// ==> 1

// Let's now change s1 and
// verify s2 and s3 are not affected
pop(s1);
console.log(peek(s1)); // Should remain 1
// ==> 1
console.log(peek(s2)); // Should remain 0
// ==> 0
console.log(peek(s3)); // Should remain 1
// ==> 1

This implementation is safe - it does not introduce unexepected side effects to the data structures, and the data structures remain immutable.

The cost of this implementation, though, is that each mutation requires a full copy of the data structure. This is inefficient in RAM (we obtain many copies of the same objects) and in CPU (the copying operations are expensive).

Efficient Functional Data Structures: Step 3

Based on the technique of amortization, the work of Okasaki has demonstrated how one can design efficient functional data structures that are immutable and still avoid unnecessary copying.

The book: Purely Functional Data Structures Chris Okasaki, Cambridge University Press, 1999 presents this general approach. It has become the standard reference to design efficient data structures for FP. In particular, it has been the basis for the design of the Clojure programming language.

The library Immutable https://immutable-js.com/ has been developed in TypeScript by Facebook. It provides an efficient immutable implementation of the main collection data structures (Map, Stack, Queue, Lists, Set) based on Okasaki’s original description.

Using the Immutable Stack implementation is performed as follows:

import { Stack as iStack } from "Immutable";

let s1 = iStack.of(1, 2, 3);
let s2 = s1.push(0);
console.log(`s1.size = ${s1.size}`); // ==> s1.size = 3
console.log(`s2.size = ${s2.size}`); // ==> s2.size = 4
console.log(`s1.peek() = ${s1.peek()}`); // ==> s1.peek() = 1
console.log(`s2.peek() = ${s2.peek()}`); // ==> s2.peek() = 0
let s3 = s1.pop();
let s4 = s2.pop();
console.log("After performing s1.pop() and s2.pop()");
console.log(`s1.size = ${s1.size}`); // ==> s1.size = 3
console.log(`s2.size = ${s2.size}`); // ==> s2.size = 4
console.log(`s3.size = ${s3.size}`); // ==> s3.size = 2
console.log(`s4.size = ${s4.size}`); // ==> s4.size = 3
console.log(`s1.peek() = ${s1.peek()}`); // ==> s1.peek() = 1
console.log(`s2.peek() = ${s2.peek()}`); // ==> s2.peek() = 0
console.log(`s3.peek() = ${s3.peek()}`); // ==> s3.peek() = 2
console.log(`s4.peek() = ${s4.peek()}`); // ==> s4.peek() = 1

We confirm that the Stack functions behave in a deterministic manner, with no aliasing. Yet, internally, minimal copying was performed.

Immutable works well with JSON and encourages the usage of embedded data structures in a safe immutable and efficient manner. It thus provides an essential component of FP in TypeScript.

JSON.stringify(s1); // ==> '[1,2,3]'

import { Stack as iStack } from "Immutable";
let personsStack = iStack(
    JSON.parse('[{"name":"avi", "age":23}, {"name":"bob", "age":26}]')
);
personsStack.peek(); // ==> { name: 'avi', age: 23 }

Another approach to functional immutable data structures is implemented in the https://immerjs.github.io/immer/ Immer library. Such libraries (Immutable.js or Immer) are essential in frontend programming for efficient state management.

Disjoint Types and Disjoint Union

We have highlighted the perspective of Data Types as denoting sets of values over which common operations can be performed. On the basis of this understanding, we defined operations over types which correspond to Set operations - such as Union and Intersection.

Such operations are provided for example in TypeScript, and we can define types such as:

type NoS = number | string;   // union: values that are either numbers or strings
type SoB = string | boolean;  // union: either string or boolean
type S = NoS & SoB;           // intersection - should be back to string.

Type union can be used for example if we want to model the set of values that can be denoted by the JSON notation - as a recursive union of possible values:

type Json =
  // Atomic values
  | string
  | number
  | boolean
  | null
  // Compound values - maps and arrays
  | { [property: string]: Json } // A map where the keys are all strings
  | Json[];

We have also defined that the type system implemented in Typescript follows structural subtyping as opposed to nominal subtyping. For example, if we define two types:

type Person = { name: string; address: string };    // a person record in a Database
type Variable = { name: string; address: string };  // a variable declaration in an Interpreter

let p: Person = { name: "a", address: "2" };
let v: Variable = p; // Compiles ok

Under nominal typing (like it exists in Java for example), these two types would be disjoint - values of type Person and values of type Variable would be different.

Under structural typing (like it exists in Typescript), these two types are actually equal - they describe the same set of values.

When modeling data types, we are often interested in distinguishing such types - so that the values we describe are distinct, and we cannot confuse a Person value with a Variable value.

The way to obtain this behavior is to add a discriminant field - called a tag - to distinguish values that are intended of being of different types.

const PERSON = "person";
const VARIABLE = "variable";

type Person = {
    tag: typeof PERSON;
    name: string;
    address: string;
} // a person record in a Database

type Variable = {
    tag: typeof VARIABLE;
    name: string;
    address: string;
} // a variable declaration in an Interpreter

let p: Person = { tag: PERSON, name: "a", address: "2" };
let v: Variable = p; // Does NOT compile:
// message:
// Type 'Person' is not assignable to type 'Variable'.
//   Types of property 'tag' are incompatible.
//     Type '"person"' is not assignable to type '"variable"'.

In this example, the type of the tag field is a set of a single value - the string "person" or the string "variable".

With the addition of the tag field with these specifications, the two types Person and Variable have become disjoint - that is, the set of values these type annotations denote are disjoint.

Disjoint Union

The possibility to define disjoint types can be combined into a very common pattern called disjoint union.

In set theory, the disjoint union of two sets \(A\) and \(B\) (https://en.wikipedia.org/wiki/Disjoint_union) is a binary operator that combines all distinct elements of a pair of given sets, while retaining the original set membership as a distinguishing characteristic of the union set.

\[A \uplus B = (A \times \{0\}) \cup (B \times \{1\})\]

For example:

\[\{0,1,2\} \uplus \{2,3\} = \{ (0,0), (1,0), (2,0), (2,1), (3,1) \}\]

We identify in that operation that the elements (0, 1) added to each pair play a role similar to the tag field we added to map types to make them disjoint.

In type descriptions, in order to define a disjoint union type, we define the union of two (or more) map types which are made disjoint by using the same tag field.

NOTE: tag is just a key, any other key could be used - for example, type or kind are often used. The key must be used consistently across all the types that appear in the union.

For example:

type Shape = Circle | Rectangle | Triangle;

type Circle = {
    tag: "circle";
    center: { x: number; y: number };
    radius: number;
}

type Rectangle = {
    tag: "rectangle";
    upperLeft: { x: number; y: number };
    lowerRight: { x: number; y: number };
}

type Triangle = {
    tag: "triangle";
    p1: { x: number; y: number };
    p2: { x: number; y: number };
    p3: { x: number; y: number };
}

These type definitions allow this type of processing - which is “case by case” processing of all the options in the disjoint union:

const area = (s: Shape): number => {
    switch (s.tag) {
        case "circle":
            return s.radius * s.radius * 3.14;
        case "rectangle":
            return (s.upperLeft.x - s.lowerRight.x) * (s.upperLeft.y - s.lowerRight.y);
        case "triangle":
            return 0; // I do not know the formula :(
    }
};

area({ tag: "circle", center: { x: 0, y: 0 }, radius: 2 }); // ==> 12.56

The tool of disjoint union together with the corresponding switch construct achieves an effect similar to sub-classes with virtual classes in Object-Oriented Programming. It allows the function to dispatch to different computations based on the type of the actual value received as a parameter.

We prefer in the course the following functional notation instead of switch, using the simple conditional expression:

const area = (s: Shape): number =>
    s.tag === "circle" ? s.radius * s.radius * 3.14 :
    s.tag === "rectangle" ? (s.upperLeft.x - s.lowerRight.x) * (s.upperLeft.y - s.lowerRight.y) :
    s.tag === "triangle" ? 0 :
    s // s is understood as a variable of type "never" here - it can be returned without affecting the return type of the function which remains number

The conditional expression when it is chained as in this example has the same semantic as a switch case where all the cases are exclusive. The difference is that in the last step, the syntax of the expression requires an else case which is a priori not needed. The type inference mechanism of TypeScript recognizes that after each test, the type of the variable s has less and less possible options. In the last else option, however, s is known to have no possible values. In this case, it is understood to have a type called never (which denotes the empty set, the opposite of the type any which denotes all possible values). In this case, we can just return s, which does not contradict the return type of the function, because we know that the function will return either one of the cases in the first 3 clauses (which is number) or never – altogether, the function returns number union never which is number.

Both the usage of switch and the conditional expression are safe against future changes in the code: if the programmer extends the definition of the type Shape with another option, then the code for the function area will not compile anymore.

The definition of the Union type in this specific context makes sense because it expresses the intention of the programmer:

these are disjoint types - they have no commonality in structure
but they have a similar functional interface - we consume them in a similar manner.

The type checker can determine that the switch construct covers all possible options - based on the structure of the type union - and for each case, it can check the expected keys based on the value of the tag key.

Summary

Type definitions help the programmer structure the code that operates on complex values in a way that reflects the structure of the data type.
Programmers design and name types for sets of values that will be processed by the same set of functions.
Homogeneous array types encourage the programmer to consume them using the sequence interface (map, filter, reduce) with processing functions applied to each item of the same type.
Recursive types are processed by recursive functions which inspect the values accordig to the type definition - and can determine the base case and the inductive case. We illustrated this with Tree processing functions.
Mutable data types can be modelled in a Functional Programming style by making sure commands are written as constructors which return a new version of the values, instead of mutating an existing value. We illustrated this on the case of the stack data type.
Disjoint types and Disjoint Unions enable the definition of uniform functional interfaces over types of values that are not structurally similar.

Principles of Programming Languages

Course Website