Numba and (vs.) SLURM

Introduction

The python library numba provides (among others) just-in-time (jit) compilation for python. Python code can gain tremendous speed-ups without the need to rewrite a single line of code. Simply adding a decorator to a function such as illustrated in this example:

import numba
@njit
def accelerate_me():
   ...

can lead to run times comparable to C or Fortran code.

Let's make a concrete example, where we add random numbers in a for loop (not advised, but used here to demonstrate numba).

In the code below, we define a function add_numbers() which creates a numpy.array of 100000 random numbers between 0 and 1. Afterwards, a for loop iterates over the array's elements, adding their values to a variable result which is then returned.

The function is then evaluated 1000 times and each evaluation is timed by taking the difference of two timestamps, one immediately before the evaluation, the second one immediately after evaluation. The timings of each evaluation are stored in a list.

Finally, the mean value, standard deviation, minimum and maximum of all timings are reported as a dictionary.

  import numpy
  import time

  def add_numbers():
      """ Create a random 1D array and sum its elements in a for loop. """

      a = numpy.random.random(100000)

      result = 0
      for i in range(len(a)):
          result += a[i]

      return a

  timings = []
  N = 1000
  for i in range(N):
      start = time.time()
      add_numbers()
      end = time.time()
      timings.append(end-start)

  timings = numpy.array(timings)
  avg = timings.mean()
  rms = timings.std()
  mx = timings.max()
  mn = timings.min()

  return {
      'avg': avg,
      'rms': rms,
      'mx': mx,
      'mn': mn,
    }

avg

0.014892277717590332

rms

0.0018411326207627022

0.0355679988861084

0.012987852096557617

On average, it takes 0.015 seconds to evaluate the function add_numbers.

Now, we will employ the numba just in time compilation. All we have to do is to import the njit decorator and decorate our function add_numbers() with it. The rest of the code is the same:

  import numpy
  import time
  from numba import njit


  @njit
  def add_numbers():
      a = numpy.random.random(100000)
      result = 0
      for  i in range(len(a)):
          result += a[i]

      return a

  timings = []
  N = 1000
  for i in range(N):
    start = time.time()
    add_numbers()
    end = time.time()
    timings.append(end-start)

  timings = numpy.array(timings)
  avg = timings.mean()
  rms = timings.std()
  mx = timings.max()
  mn = timings.min()

  return {
      'avg': avg,
      'rms': rms,
      'mx': mx,
      'mn': mn,
    }

avg

0.0013004329204559326

rms

0.013805845813135468

0.43764829635620117

0.0007064342498779297

Et voilà, the average run has decreased by an order of magnitude (1.3e-3 vs 1.4e-2.). Noteworthy, the longest run time with numba is 0.43 seconds vs. 0.035 seconds without numba. I suspect that the first evaluation of add_numbers() takes significantly longer because the just in time compilation, taking place in this first encounter of add_numbers() adds a significant amount of overhead. Subsequent timings re-use the once compiled code and hence complete in much shorter time.

Let's plot the numba timings in their order:

  import numpy
  import time
  from matplotlib import pyplot
  from numba import njit


  @njit
  def add_numbers():
      a = numpy.random.random(100000)
      result = 0
      for  i in range(len(a)):
          result += a[i]

      return a

  timings = []
  N = 1000
  for i in range(N):
    start = time.time()
    add_numbers()
    end = time.time()
    timings.append(end-start)

  timings = numpy.array(timings)
  avg = timings.mean()
  rms = timings.std()
  mx = timings.max()
  mn = timings.min()

  pyplot.semilogy(timings)
  pyplot.xlabel('# evaluation')
  pyplot.ylabel('Wall time (seconds)')
  pyplot.title ('Wall time for 1000 evaluations with numba jit.')
  plot_fname = 'numba_timings.png'
  pyplot.savefig(plot_fname)

  return plot_fname

This confirms my suspicion: The first evaluation, including numba compilation, takes more than two orders of magnitude longer than the subsequent 999 evaluations.

Numba and numpy

Using numba in code that offloads the heavy lifting to highly optimized libraries such as numpy is a completely different kind of beast. Numpy uses highly optimized implementations of mathematical algorithms coded in C or Fortran (FIXME) leaving little to no room for improvement through jit and potentially even degrading performance systems due to the involved compilation overhead.

Consider the following example where the function to be evaluated now implements matrix diagonalization using numpy.linalg.

  import numpy
  import time
  # from numba import njit

  def get_eigen(A):
    eigenvalues, eigenvector = numpy.linalg.eigh(A)

    return eigenvalues, eigenvector

  A = numpy.random.random(size=(100,100))


  timings = []

  N = 1000
  for i in range(N):
    start = time.time()
    eigenvalues, eigenvector = get_eigen(A)
    end = time.time()

    timings.append(end-start)

  timings = numpy.array(timings)
  avg = timings.mean()
  rms = timings.std()
  mx = timings.max()
  mn = timings.min()

  return {
      'avg': avg,
      'rms': rms,
      'mx': mx,
      'mn': mn,
    }

avg

0.002889450788497925

rms

0.00015085469247274813

0.0037212371826171875

0.0025556087493896484

The code (without numba) finishes in 0.003 seconds, on average.

Now, let's add the numba jit decorator:

  import numpy
  import time
  from numba import njit

  @njit
  def get_eigen(A):
    eigenvalues, eigenvector = numpy.linalg.eigh(A)

    return eigenvalues, eigenvector

  A = numpy.random.random(size=(100,100))


  timings = []

  N = 1000
  for i in range(N):
    start = time.time()
    eigenvalues, eigenvector = get_eigen(A)
    end = time.time()

    timings.append(end-start)

  timings = numpy.array(timings)
  avg = timings.mean()
  rms = timings.std()
  mx = timings.max()
  mn = timings.min()

  return {
      'avg': avg,
      'rms': rms,
      'mx': mx,
      'mn': mn,
    }

avg

0.0036052455902099607

rms

0.02540786315103151

0.8066534996032715

0.0024809837341308594

No speedup, au contraire, we find a slight performance decrease (on average).

Numba on HPC

Things get even more complicated, if the numba code is run on a HPC system. TBC…

SLURM, HPC, numba, python, scaling, performance