Introduction
The python library numba provides (among others) just-in-time (jit) compilation for python. Python code can gain tremendous speed-ups without the need to rewrite a single line of code. Simply adding a decorator to a function such as illustrated in this example:
import numba @njit def accelerate_me(): ...
can lead to run times comparable to C or Fortran code.
Let's make a concrete example, where we add random numbers in a for loop (not advised, but used here to demonstrate numba).
In the code below, we define a function add_numbers() which creates a numpy.array of 100000 random numbers between 0 and 1. Afterwards, a for loop iterates over the array's elements, adding their values to a variable result which is then returned.
The function is then evaluated 1000 times and each evaluation is timed by taking the difference of two timestamps, one immediately before the evaluation, the second one immediately after evaluation. The timings of each evaluation are stored in a list.
Finally, the mean value, standard deviation, minimum and maximum of all timings are reported as a dictionary.
import numpy
import time
def add_numbers():
""" Create a random 1D array and sum its elements in a for loop. """
a = numpy.random.random(100000)
result = 0
for i in range(len(a)):
result += a[i]
return a
timings = []
N = 1000
for i in range(N):
start = time.time()
add_numbers()
end = time.time()
timings.append(end-start)
timings = numpy.array(timings)
avg = timings.mean()
rms = timings.std()
mx = timings.max()
mn = timings.min()
return {
'avg': avg,
'rms': rms,
'mx': mx,
'mn': mn,
}
avg | : | 0.014892277717590332 | rms | : | 0.0018411326207627022 | mx | : | 0.0355679988861084 | mn | : | 0.012987852096557617 |
On average, it takes 0.015 seconds to evaluate the function add_numbers.
Now, we will employ the numba just in time compilation. All we have to do is to import the njit decorator and decorate our function add_numbers() with it. The rest of the code is the same:
import numpy
import time
from numba import njit
@njit
def add_numbers():
a = numpy.random.random(100000)
result = 0
for i in range(len(a)):
result += a[i]
return a
timings = []
N = 1000
for i in range(N):
start = time.time()
add_numbers()
end = time.time()
timings.append(end-start)
timings = numpy.array(timings)
avg = timings.mean()
rms = timings.std()
mx = timings.max()
mn = timings.min()
return {
'avg': avg,
'rms': rms,
'mx': mx,
'mn': mn,
}
avg | : | 0.0013004329204559326 | rms | : | 0.013805845813135468 | mx | : | 0.43764829635620117 | mn | : | 0.0007064342498779297 |
Et voilà, the average run has decreased by an order of magnitude (1.3e-3 vs 1.4e-2.). Noteworthy, the longest run time with numba is 0.43 seconds vs. 0.035 seconds without numba. I suspect that the first evaluation of add_numbers() takes significantly longer because the just in time compilation, taking place in this first encounter of add_numbers() adds a significant amount of overhead. Subsequent timings re-use the once compiled code and hence complete in much shorter time.
Let's plot the numba timings in their order:
import numpy
import time
from matplotlib import pyplot
from numba import njit
@njit
def add_numbers():
a = numpy.random.random(100000)
result = 0
for i in range(len(a)):
result += a[i]
return a
timings = []
N = 1000
for i in range(N):
start = time.time()
add_numbers()
end = time.time()
timings.append(end-start)
timings = numpy.array(timings)
avg = timings.mean()
rms = timings.std()
mx = timings.max()
mn = timings.min()
pyplot.semilogy(timings)
pyplot.xlabel('# evaluation')
pyplot.ylabel('Wall time (seconds)')
pyplot.title ('Wall time for 1000 evaluations with numba jit.')
plot_fname = 'numba_timings.png'
pyplot.savefig(plot_fname)
return plot_fname
This confirms my suspicion: The first evaluation, including numba compilation, takes more than two orders of magnitude longer than the subsequent 999 evaluations.
Numba and numpy
Using numba in code that offloads the heavy lifting to highly optimized libraries such as numpy is a completely different kind of beast. Numpy uses highly optimized implementations of mathematical algorithms coded in C or Fortran (FIXME) leaving little to no room for improvement through jit and potentially even degrading performance systems due to the involved compilation overhead.
Consider the following example where the function to be evaluated now implements matrix diagonalization using numpy.linalg.
import numpy
import time
# from numba import njit
def get_eigen(A):
eigenvalues, eigenvector = numpy.linalg.eigh(A)
return eigenvalues, eigenvector
A = numpy.random.random(size=(100,100))
timings = []
N = 1000
for i in range(N):
start = time.time()
eigenvalues, eigenvector = get_eigen(A)
end = time.time()
timings.append(end-start)
timings = numpy.array(timings)
avg = timings.mean()
rms = timings.std()
mx = timings.max()
mn = timings.min()
return {
'avg': avg,
'rms': rms,
'mx': mx,
'mn': mn,
}
avg | : | 0.002889450788497925 | rms | : | 0.00015085469247274813 | mx | : | 0.0037212371826171875 | mn | : | 0.0025556087493896484 |
The code (without numba) finishes in 0.003 seconds, on average.
Now, let's add the numba jit decorator:
import numpy
import time
from numba import njit
@njit
def get_eigen(A):
eigenvalues, eigenvector = numpy.linalg.eigh(A)
return eigenvalues, eigenvector
A = numpy.random.random(size=(100,100))
timings = []
N = 1000
for i in range(N):
start = time.time()
eigenvalues, eigenvector = get_eigen(A)
end = time.time()
timings.append(end-start)
timings = numpy.array(timings)
avg = timings.mean()
rms = timings.std()
mx = timings.max()
mn = timings.min()
return {
'avg': avg,
'rms': rms,
'mx': mx,
'mn': mn,
}
avg | : | 0.0036052455902099607 | rms | : | 0.02540786315103151 | mx | : | 0.8066534996032715 | mn | : | 0.0024809837341308594 |
No speedup, au contraire, we find a slight performance decrease (on average).
Numba on HPC
Things get even more complicated, if the numba code is run on a HPC system. TBC…