High-Performance Computing with Python
Course Content¶
Day 1: Profiling, Algorithms and Parallel Computation¶
Profiling¶
One of the most important steps toward a fast program is profiling to find out where your program spends its time. There are several tools for Python that help you to quantify the run times of your program. The course gives introduction to this topic.
Algorithms¶
Often the best speed improvements can be achieved by finding a better algorithm. Python offers several data structures that come with efficient algorithms. The course gives an overview of common Python data structures and the run time complexity.
Multiprocessing¶
Python comes with the multiprocessing module that allows to distribute
calculation over several processes and this way parallelize applications.
Its API is closed model after that of the threading module. You will learn
how to use multiprocessing.
Pyro4¶
There are many libraries for Python to do distributed programming for clusters or networked computers. Pyro is a very mature solution. The course introduces version 4 with some examples.
Day 2: Beyond Pure Python¶
Python is very good glue language to connect existing systems. There is a long tradition to write modules in other languages. There are also some newer developments that increase the usefulness of Python HPC computing. The course presents some of them.
Numba¶
Numba is a new module that still undergoes considerable changes. It allows to compile pure Python code in to machine code via the LLVM. This means, many pure Python algorithms can run as fast as if they would have been written in C. The course shows how numba works and presents some implementations of algorithms.
PyPy¶
PyPy is a different implementation that has a Just-in-Time-Complier (JIT). It is full Python 2.7 compliant implementation that has several very innovative features. The course introduces to the work with PyPy.
f2py¶
Wrapping existing Fortran programs becomes much simpler with f2py. You will learn how to use f2py to wrap Fortran programs. Furthermore, the course covers accessing common memory in Fortran modules and calling Python functions from wrapped Fortran.
Day 3: NumPy for Fast Computations¶
The library NumPy is the defacto standard for the work with arrays. You will get a solid introduction to NumPy and learn some of its more advanced features.
Introduction to NumPy¶
Array construction and array properties
Data types
Slicing and broadcasting
Universal functions
Advanced NumPy¶
Masked arrays
Customizing error handling
Testing NumPy programs
NumPy and C¶
A look into the implementation of ndarrays
Working with ndarrays from C
Numexpr¶
Numexpr can evaluate numerical expressions such as 5 * a + 3 * b - 2 * c. The evaluations, especially of
complex expressions, are faster and use less memory than using NumPy calculations of these exprssiosn.
Numexpr can run evaluation in parallel using multiple cores. It also supports the Math Kernel Library (MKL) for even more speed improvements.
Algorithms and SciPy¶
Examples of algorithms in NumPy and solutions in SciPy showcase solutions for common numerical problems.
Day 4: Cython for Speed¶
My first Cython extension¶
using
pyximportto quickly (re-)build extension modulesusing
cython.inline()to compile code at runtimebuilding extension modules with distutils
Speeding up Python code with Cython¶
fast access to Python’s builtin types
fast looping over Python iterables and C types
string processing
fast arithmetic
incrementally optimizing Cython code
multi-threading outside of the GIL (Global Interpreter Lock)
Interfacing with external C code¶
calling into external C libraries
building against C libraries
writing Python wrapper APIs
calling C functions across extension module boundaries
Day 5: Cython and NumPy¶
Use of Python’s buffer interface from Cython code¶
directly accessing data buffers of other Python extensions
retrieving meta data about the buffer layout
setting up efficient memory views on external buffers
Implementing fast Cython loops over NumPy arrays¶
looping over NumPy exported buffers
implementing a simple image processing algorithm
using “fused types” (simple templating) to implement an algorithm once and run it efficiently on different C data types
Use of parallel loops to make use of multiple processing cores¶
building modules with OpenMP
processing data in parallel
speeding up an existing loop using OpenMP threads
Case studies¶
The participants are encouraged to send in short code examples from their own experience that they would like to see running faster by using Cython. Based on general interest and practicality, one or two of these examples will be examined as a case study. These examples must be available to the teacher at least one week before the course, and must be short but complete executable examples, including sufficient input data for benchmarking. Please be aware that example code that requires a substantial amount of explanation or background knowledge about a specific application domain will not be accepted.