integrate Intel TBB by wds15 · Pull Request #1376 · stan-dev/math
Summary
This PR integrate the Intel TBB with stan-math. Specifically this PR ensures the TBB's threadpool initialization and it switches map_rect C++11 threads implementation over to use the tbb::parallel_for (which has scheduling and a more efficient threadpool).
In order to keep the Intel TBB optional in the next release, the TBB code is only used whenever STAN_THREADS is set. This makes the TBB only mandatory whenever threads are being used. The Intel TBB will become mandatory in the near future regardless of the STAN_THREADS use (which has licensing implications for GPL-2 projects using Stan-math).
The initialisation of the TBB happens in two places:
-
General initialisation of the maximal concurrency level in the TBB threadpool. This is handled by the new
init_threadpool_tbbfunction defined instan/math/prim/core/init_threadpool_tbb.hpp. The number of threads used is defined in theSTAN_NUM_THREADSenvironment variable. -
For all TBB worker threads, the AD tape is initialised through an observer which is registered with the Intel TBB. Every time a worker thread enters the threadpool, the observer is called and ensures proper initialisation of the AD tape resource automatically. See
stan/math/rev/core/init_chainablestack_tbb.hpp.
Note: This PR is a branch from the feature/intel-tbb-lib such that it includes these changes in addition to the actual changes introduced in addition. As reference branch the feature/intel-tbb-lib has been selected to make it easier to follow the actual changes introduced relative to the base branch.
Performance Tests
All reported times are in seconds.
| threads | Ubuntu i7 develop | Ubuntu i7 TBB | Ubuntu i7 speedup | Ubuntu AMD TBB | Ubuntu AMD develop | Ubuntu AMD speedup |
|---|---|---|---|---|---|---|
| 1 | 166 | 156 | 0.060 | 166.768 | 148.131 | -0.125 |
| 2 | 91 | 84 | 0.077 | 90.459 | 155.651 | 0.418 |
| 4 | 80 | 45 | 0.438 | 49.331 | 93.949 | 0.475 |
| 8 | 76 | 42 | 0.448 | 28.109 | 80.378 | 0.650 |
| 12 | 22.028 | 72.776 | 0.697 | |||
| 14 | 22.421 | 72.742 | 0.697 | |||
| 16 | 23.166 | 66.685 | 0.653 | |||
| 18 | 23.056 | 69.001 | 0.666 | |||
| 24 | 23.525 | 80.779 | 0.709 | |||
| 32 | 23.047 | 87.403 | 0.736 |
| threads | Mac i9 develop | Mac i9 TBB | Mac i9 speedup |
|---|---|---|---|
| 1 | 253 | 184 | 0.272 |
| 2 | 145 | 103 | 0.290 |
| 4 | 91 | 64 | 0.297 |
| 6 | 81 | 55 | 0.321 |
| threads | Win i7 develop | Win i7 TBB | Win i7 speedup |
|---|---|---|---|
| 1 | 452 | 464 | -0.027 |
| 2 | 222 | 230 | -0.036 |
| 4 | 148 | 132 | 0.108 |
| 6 | 146 | 116 | 0.205 |
Tests
-
test/unit/math/prim/core/init_threadpool_tbb_test.cpptests initialisation of the threadpool -
test/unit/math/prim/core/init_threadpool_tbb_late_test.cpptests initialisation of the threadpool in case a c++ user has already initialized the scheduler (in which case nothing happens) -
test/unit/math/rev/mat/functor/gradient_test.cpptests if the threaded gradient code runs fine when using the Intel TBB as backend. This tests if the effects of AD tape initialisation have taken place.
Side Effects
The initialisation of the TBB threadpool is only optional. That is, should client code not call the init_threadpool_tbb function, then the first execution of any TBB code will trigger default initialisation of the TBB threadpool which is the default behaviour of the TBB. In case the client code initialises first the TBB through explicit instantiation of the tbb::task_scheduler_init interface, then a subsequent call to the init_threadpool_tbb has no effect - this is again the default behaviour of the TBB (only the very first call is honoured).
For interfaces it is recommended to make a call to stan::math::init_threadpool_tbb() prior to using Stan-math. This will ensure the proper initialisation of the threadpool and ensure that no more STAN_NUM_THREADS threads are running in the threadpool.
Todo
- agreement on design
- a few more tests for
init_threadpool_tbb - agree on the fate of
stan::math::get_num_threads(moved tostan/math/prim/scal/fun/get_num_threads.hpp), see comments within
Checklist
-
Math issue #(issue number)
-
Copyright holder: Sebastian Weber
The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/) -
the basic tests are passing
- unit tests pass (to run, use:
./runTests.py test/unit) - header checks pass, (
make test-headers) - docs build, (
make doxygen) - code passes the built in C++ standards checks (
make cpplint)
- unit tests pass (to run, use:
-
the code is written in idiomatic C++ and changes are documented in the doxygen
-
the new changes are tested