cublas backend of MatMul does not work with stream parallelism

We should run cublas in an appropriate stream, and this further require to create a different cublas handle for each stream. Since we cache cublas in GPUContext, we should make the cache available for multiple streams.