Benchmarking Loops in Python

PPL 2020 - Week #1

This notebook demonstrates the performance advantage obtained through vectorization on a loop. We compare a procedural implementation of a loop with mutation, a slightly better version using Python’s list comprehensions, and a vastly superior vectorized version using Numpy’s arrays and broadcast operations.

Our requirement is to compute the cube of all elements in a list of numbers of size \(n\). We first implement the Python version of the same function we reviewed in class:

n = 100

def cube(x):
    return x * x * x

Procedural Loop

def cubes_procedural(a):
    res = list(a)  # Copy the list
    for i in range(len(a)):
        res[i] = cube(a[i])
    return res

The function first copies the input parameter, then fills it with the required values.

Let us test the function:

s = list(range(10))
cubes_procedural(s)  # ==> [0, 1, 8, 27, 64, 125, 216, 343, 512, 729]

Let us now benchmark this function - using the IPython magic %%timeit command:

%%timeit 
for i in range(n):
    cubes_procedural(range(i))
1.23 ms ± 51.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Map

The procedure could be improved - both in terms of readability, style and performance by using Python’s map function. Python’s map function receives the same parameters as in TypeScript - a function and a list (or more generally an iterator) - and it returns an iterator. (We will learn later that map is a lazy function, it does not compute the resulting list immediately, only when we ask to construct a list out of the iterator, we obtain the resulting list).

map(cube, [1, 2, 3])  # ==> <map at 0x28a542e29b0>

To force the evaluation of the result, we copy the resulting lazy iterator into a list:

list(map(cube, [1, 2, 3]))  # ==> [1, 8, 27]
def cubes_map(a):
    return list(map(cube, a))

Let us benchmark the performance of map to compare it with the procedural loop:

%%timeit
for i in range(n):
    cubes_map(range(i))
579 µs ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The map version reduces the time of the test from ~1.23ms to ~600 µs. This is mainly because map is implemented as a primitive function instead of interpreting the loop execution.

Vectorization

We can improve this more significantly by using vectorizing - supported in Python by the NumPy package. NumPy offers a highly optimized array data structure together with mechanisms to broadcast functions over arrays - that is, basically perform a map of a function elementwise on arrays of numbers.

import numpy as np

# Use a NumPy Array instead of a list
a = np.arange(100)
a
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

NumPy interprets the application of a function to an array as a map of the function elementwise. This is called broadcasting a method to an array.

cube(a)
array([     0,      1,      8,     27,     64,    125,    216,    343,
          512,    729,   1000,   1331,   1728,   2197,   2744,   3375,
         4096,   4913,   5832,   6859,   8000,   9261,  10648,  12167,
        13824,  15625,  17576,  19683,  21952,  24389,  27000,  29791,
        32768,  35937,  39304,  42875,  46656,  50653,  54872,  59319,
        64000,  68921,  74088,  79507,  85184,  91125,  97336, 103823,
       110592, 117649, 125000, 132651, 140608, 148877, 157464, 166375,
       175616, 185193, 195112, 205379, 216000, 226981, 238328, 250047,
       262144, 274625, 287496, 300763, 314432, 328509, 343000, 357911,
       373248, 389017, 405224, 421875, 438976, 456533, 474552, 493039,
       512000, 531441, 551368, 571787, 592704, 614125, 636056, 658503,
       681472, 704969, 729000, 753571, 778688, 804357, 830584, 857375,
       884736, 912673, 941192, 970299])
%%timeit 
for i in range(n):
    cube(np.arange(i))
184 µs ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Vectorization improved performance by a factor of close to 5.

The way vectorization works is that it exploits hardware parallelism: because all the work performed is trivially parallel, the interpreter is free to split the array into as many blocks as there are available cores in the CPU, and execute a subset of the work on each of the blocks in parallel. There is no need to lock anything as the function that is applied (cube) is a pure function with no side effect, and the blocks are non-overlapping.

The benefits of vectorization increase as the size of the arrays increase and the number of cores in the hardware increases. On GPUs, one can exploit thousands of cores and obtain significant speedups when adopting this style of programming. This is on the key ingredients that has recently enabled significant progress in Artificial Intelligence and Deep Learning.

From a programming language perspective - vectorization is made possible because the language encourages a declarative style which is based on the Functional Programming paradigm, and the usage of pure functions with no side effects.