Bug: tensorflow-gpu takes long time before beginning to compute

I noticed that tensorflow always takes about ~2min before it actually starts to compute. I've been trying to find out, why this happens, and nothing really worked so far.

Tensorflow site says, I should use CUDA® Toolkit 9.0 and cuDNN v7.0. I have CUDA 9.0, so I downloaded CuDNN 7.0.5 for CUDA 9.0 and pasted the files to *C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0*, overwriting the ones form cuDNN 7.1.2, which I tested earlier. To make sure, I pip-installed tensorflow-gpu into a fresh anaconda env. See install here. The issue is still the same.

CUDA works, since it prints the 'Hello, TensorFlow!', when I use the official test example, but before that it takes like 2minutes every time!

When I tested this with another wheel (which is linked in this tutorial, I did not compile it myself.) on cuda 9.1/cudnn7.0.5, I had the same issues. A NVIDIA employee on stackoverflow suggested, I may be hitting a lengthy JIT compile step, because the GTX 1080 has compute capability of 6.1, which the wheel I used may not be compiled for.

So I tried to find wheels for tensorflow with compute capability 6.1 for windows, but the only one I found and tested produced the same problem.

Am I doing something wrong here, or do I just have to accept the 2min delay everytime I start my tensorflow/keras scripts?

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
Code:

import time
start_time = time.time()
import tensorflow as tf
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print(sess.run(c))
timer = time.time()
print(timer - start_time)

Output:

(tf_clean) C:\python_code\test>C:/anaconda/envs/tf_clean/python.exe c:/python_code/test/tf_test.py
2018-04-18 14:36:04.376661: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this
TensorFlow binary was not compiled to use: AVX2
2018-04-18 14:36:04.689661: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1344] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.60GiB
2018-04-18 14:36:04.699485: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-18 14:38:12.227561: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-18 14:38:12.234504: I T:\src\github\tens2018-04-18 14:38:12.237156: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0:   N
2018-04-18 14:38:12.240997: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6379 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1
2018-04-18 14:38:12.548288: I T:\src\github\tensorflow\tensorflow\core\common_runtime\direct_session.cc:297] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-18 14:38:12.559262: I T:\src\github\tensorflow\tensorflow\core\common_runtime\placer.cc:884] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-18 14:38:12.564847: I T:\src\github\tensorflow\tensorflow\core\common_runtime\placer.cc:884] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-18 14:38:12.570545: I T:\src\github\tensorflow\tensorflow\core\common_runtime\placer.cc:884] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
 [49. 64.]]
129.14624643325806

OS Platform and Distribution:
Windows 10 Education (Version 10.0.16299 Build 16299)
Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz, 3408 MHz, 4 Cores
TensorFlow installed from (source or binary):
binary
TensorFlow version:
tensorflow-gpu 1.5.0, 1.7.0
Python version:
3.5.5 & 3.6 (via anaconda, conda 4.5.1.)
Bazel Version:
N/A
CUDA/cuDNN version:
Tested combinations:
CUDA 9.0 and CuDNN 7.1.2 (tested on tensorflow 1.5.0, 1.7.0 and 1.8.0-dev20180329)
CUDA 9.1 and CuDNN 7.0.5 (tested on tensorflow 1.5.0 and 1.7.0)
GPU model and memory:
NVIDIA GeForce GTX 1080 (GP104-400) [Hewlett-Packard], 8192 MBytes of GDDR5X SDRAM [Micron]
Exact command to reproduce:
See: Have I written custom code...

=================================================================
EDIT:

Threadstarter here, hello.

Could you try with the latest nightly?
https://files.pythonhosted.org/packages/67/c0/e68a4f0400340b54c887703baa8eee188042c3d65a0cf535dda71abffbc2/tf_nightly_gpu-1.13.0.dev20190205-cp37-cp37m-win_amd64.whl

This works! I checked with that wheel, and then with tf-nightly-gpu-2.0-preview on PYPI, which also worked.
I initially wanted to use the anaconda cudatoolkit and cudnn packages, but currently, cudnn is only available up to version 7.3.1 on anaconda-cloud. Tensorflow 2.0 however, is compiled with 7.4.1, so I had to do this the oldschool way, and download the setups from Nvidia.
Soon, though...soon.

For everyone, here's what I did, as a guide:

How to install Tensorflow Nightly 2.0 GPU in Anaconda on Windows 10 x64

• I installed these CUDA/CuDnn Versions:
– cuda_10.0.130_win10_network (Nvidia CUDA Download: https://developer.nvidia.com/cuda-toolkit)
– cuDNN v7.4.1 (Nov 8, 2018), for CUDA 10.0 (Nvidia CuDnn Download: https://developer.nvidia.com/cudnn)
– Don't forget to check, whether the Cuda setup has correctly written itself to the PATH system variable.
– Reboot.
• Now make a new environment in Anaconda and activate it:
– conda create --name tf2-nightly-gpu python=3.6
– activate tf2-nightly-gpu
• Now, with the new env still activated, install the latest Tensorflow 2.0 nightly GPU build from PYPI:
– pip install tf-nightly-gpu-2.0-preview
• For machine learning in Jupyter notebook (or Jupyter Lab) , you need these as well:
– conda install nb_conda matplotlib scipy Pillow pandas scikit-learn
• Check, if your GPU is recognized by Tensorflow. Open the Anaconda prompt, activate the new environment and type python, then press Enter. Now type:
import tensorflow as tf
tf.test.is_gpu_available(cuda_only=False,min_cuda_compute_capability=None)
• Output should be something like this:

(tf2-nightly-gpu) C:\Users\___>python
>>> import tensorflow as tf
>>> tf.test.is_gpu_available(cuda_only=False,min_cuda_compute_capability=None)
2019-03-19 17:46:25.722209: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-03-19 17:46:25.729724: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library nvcuda.dll
2019-03-19 17:46:25.922934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1551] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.61GiB
2019-03-19 17:46:25.938231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1674] Adding visible gpu devices: 0
2019-03-19 17:46:26.539185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1082] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-19 17:46:26.546009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1088]      0
2019-03-19 17:46:26.550123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1101] 0:   N
2019-03-19 17:46:26.554188: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1222] Created TensorFlow device (/device:GPU:0 with 6360 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
True

• Done.