This method was taking an hour to compute inv(np.dot(A.T, A)) of a dense A matrix of shape (38520, 1560).
Some speed up may be gained by avoiding the copies, see http://wiki.scipy.org/PerformanceTips
But it isn't that clear how this would be done efficiently for large A. I think matrix multiplication is about O(n^3) complexity and I doubt the inverse is good either.
The large A that was slowing things down was quite sparse so using scipy.sparse is a potential option.