[MRG]: add fast_dot function calling BLAS directly and consume only twice the memory of your data by dengemann · Pull Request #2248 · scikit-learn/scikit-learn

Hi there, I finally got it running.

This implements a feature 'advocated' on this scipy page (section on large arrays and linalg):

http://wiki.scipy.org/PerformanceTips

When directly calling blass instead of np.dot it's possible to avoid copying when data are passed in F-contiguous order. In addition I've added chunking to the _logcosh function which avoids an extra copy.
This is now how it looks on 1GB testing data:

fast_dot_chunking_logcosh

This was how the same test would have looked on the current master (plot from the last memory PR):
memory_ica_par_w1_computation_del_gwx

To make this functionality available for other use cases I've added a fast_dot function to utils.extmath with almost stupid but explicit tests that exemplify the mapping between np.dot and fast_dot which can be a hell.
Finally I've made sure that down-stream applications are still workin. For example with this local branch the mne-python ICA looks as good as it had looked before.

cc @agramfort @GaelVaroquaux @mblondel