integrate Intel TBB by wds15 · Pull Request #1376 · stan-dev/math

Summary

This PR integrate the Intel TBB with stan-math. Specifically this PR ensures the TBB's threadpool initialization and it switches map_rect C++11 threads implementation over to use the tbb::parallel_for (which has scheduling and a more efficient threadpool).

In order to keep the Intel TBB optional in the next release, the TBB code is only used whenever STAN_THREADS is set. This makes the TBB only mandatory whenever threads are being used. The Intel TBB will become mandatory in the near future regardless of the STAN_THREADS use (which has licensing implications for GPL-2 projects using Stan-math).

The initialisation of the TBB happens in two places:

  • General initialisation of the maximal concurrency level in the TBB threadpool. This is handled by the new init_threadpool_tbb function defined in stan/math/prim/core/init_threadpool_tbb.hpp. The number of threads used is defined in the STAN_NUM_THREADS environment variable.

  • For all TBB worker threads, the AD tape is initialised through an observer which is registered with the Intel TBB. Every time a worker thread enters the threadpool, the observer is called and ensures proper initialisation of the AD tape resource automatically. See stan/math/rev/core/init_chainablestack_tbb.hpp.

Note: This PR is a branch from the feature/intel-tbb-lib such that it includes these changes in addition to the actual changes introduced in addition. As reference branch the feature/intel-tbb-lib has been selected to make it easier to follow the actual changes introduced relative to the base branch.

Performance Tests

All reported times are in seconds.

threads Ubuntu i7 develop Ubuntu i7 TBB Ubuntu i7 speedup Ubuntu AMD TBB Ubuntu AMD develop Ubuntu AMD speedup
1 166 156 0.060 166.768 148.131 -0.125
2 91 84 0.077 90.459 155.651 0.418
4 80 45 0.438 49.331 93.949 0.475
8 76 42 0.448 28.109 80.378 0.650
12 22.028 72.776 0.697
14 22.421 72.742 0.697
16 23.166 66.685 0.653
18 23.056 69.001 0.666
24 23.525 80.779 0.709
32 23.047 87.403 0.736
threads Mac i9 develop Mac i9 TBB Mac i9 speedup
1 253 184 0.272
2 145 103 0.290
4 91 64 0.297
6 81 55 0.321
threads Win i7 develop Win i7 TBB Win i7 speedup
1 452 464 -0.027
2 222 230 -0.036
4 148 132 0.108
6 146 116 0.205

Tests

  • test/unit/math/prim/core/init_threadpool_tbb_test.cpp tests initialisation of the threadpool

  • test/unit/math/prim/core/init_threadpool_tbb_late_test.cpp tests initialisation of the threadpool in case a c++ user has already initialized the scheduler (in which case nothing happens)

  • test/unit/math/rev/mat/functor/gradient_test.cpp tests if the threaded gradient code runs fine when using the Intel TBB as backend. This tests if the effects of AD tape initialisation have taken place.

Side Effects

The initialisation of the TBB threadpool is only optional. That is, should client code not call the init_threadpool_tbb function, then the first execution of any TBB code will trigger default initialisation of the TBB threadpool which is the default behaviour of the TBB. In case the client code initialises first the TBB through explicit instantiation of the tbb::task_scheduler_init interface, then a subsequent call to the init_threadpool_tbb has no effect - this is again the default behaviour of the TBB (only the very first call is honoured).

For interfaces it is recommended to make a call to stan::math::init_threadpool_tbb() prior to using Stan-math. This will ensure the proper initialisation of the threadpool and ensure that no more STAN_NUM_THREADS threads are running in the threadpool.

Todo

  • agreement on design
  • a few more tests for init_threadpool_tbb
  • agree on the fate of stan::math::get_num_threads (moved to stan/math/prim/scal/fun/get_num_threads.hpp), see comments within

Checklist

  • Math issue #(issue number)

  • Copyright holder: Sebastian Weber

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested