High-Performance Computing with Python

Course Content

Day 1: Profiling, Algorithms and Parallel Computation

Profiling

One of the most important steps toward a fast program is profiling to find out where your program spends its time. There are several tools for Python that help you to quantify the run times of your program. The course gives introduction to this topic.

Algorithms

Often the best speed improvements can be achieved by finding a better algorithm. Python offers several data structures that come with efficient algorithms. The course gives an overview of common Python data structures and the run time complexity.

Multiprocessing

Python comes with the multiprocessing module that allows to distribute calculation over several processes and this way parallelize applications. Its API is closed model after that of the threading module. You will learn how to use multiprocessing.

Pyro4

There are many libraries for Python to do distributed programming for clusters or networked computers. Pyro is a very mature solution. The course introduces version 4 with some examples.

Day 2: Beyond Pure Python

Python is very good glue language to connect existing systems. There is a long tradition to write modules in other languages. There are also some newer developments that increase the usefulness of Python HPC computing. The course presents some of them.

Numba

Numba is a new module that still undergoes considerable changes. It allows to compile pure Python code in to machine code via the LLVM. This means, many pure Python algorithms can run as fast as if they would have been written in C. The course shows how numba works and presents some implementations of algorithms.

PyPy

PyPy is a different implementation that has a Just-in-Time-Complier (JIT). It is full Python 2.7 compliant implementation that has several very innovative features. The course introduces to the work with PyPy.

f2py

Wrapping existing Fortran programs becomes much simpler with f2py. You will learn how to use f2py to wrap Fortran programs. Furthermore, the course covers accessing common memory in Fortran modules and calling Python functions from wrapped Fortran.

Day 3: NumPy for Fast Computations

The library NumPy is the defacto standard for the work with arrays. You will get a solid introduction to NumPy and learn some of its more advanced features.

Introduction to NumPy

  • Array construction and array properties

  • Data types

  • Slicing and broadcasting

  • Universal functions

Advanced NumPy

  • Masked arrays

  • Customizing error handling

  • Testing NumPy programs

NumPy and C

  • A look into the implementation of ndarrays

  • Working with ndarrays from C

Numexpr

Numexpr can evaluate numerical expressions such as 5 * a + 3 * b - 2 * c. The evaluations, especially of complex expressions, are faster and use less memory than using NumPy calculations of these exprssiosn.

Numexpr can run evaluation in parallel using multiple cores. It also supports the Math Kernel Library (MKL) for even more speed improvements.

Algorithms and SciPy

Examples of algorithms in NumPy and solutions in SciPy showcase solutions for common numerical problems.

Day 4: Cython for Speed

My first Cython extension

  • using pyximport to quickly (re-)build extension modules

  • using cython.inline() to compile code at runtime

  • building extension modules with distutils

Speeding up Python code with Cython

  • fast access to Python’s builtin types

  • fast looping over Python iterables and C types

  • string processing

  • fast arithmetic

  • incrementally optimizing Cython code

  • multi-threading outside of the GIL (Global Interpreter Lock)

Interfacing with external C code

  • calling into external C libraries

  • building against C libraries

  • writing Python wrapper APIs

  • calling C functions across extension module boundaries

Day 5: Cython and NumPy

Use of Python’s buffer interface from Cython code

  • directly accessing data buffers of other Python extensions

  • retrieving meta data about the buffer layout

  • setting up efficient memory views on external buffers

Implementing fast Cython loops over NumPy arrays

  • looping over NumPy exported buffers

  • implementing a simple image processing algorithm

  • using “fused types” (simple templating) to implement an algorithm once and run it efficiently on different C data types

Use of parallel loops to make use of multiple processing cores

  • building modules with OpenMP

  • processing data in parallel

  • speeding up an existing loop using OpenMP threads

Case studies

The participants are encouraged to send in short code examples from their own experience that they would like to see running faster by using Cython. Based on general interest and practicality, one or two of these examples will be examined as a case study. These examples must be available to the teacher at least one week before the course, and must be short but complete executable examples, including sufficient input data for benchmarking. Please be aware that example code that requires a substantial amount of explanation or background knowledge about a specific application domain will not be accepted.