fix random generator: do not gen seed each time#9119
fix random generator: do not gen seed each time#9119piiswrong merged 13 commits intoapache:masterfrom
Conversation
|
@Javelinjs I've uploaded the script here https://gist.github.com/sxjscience/453605a1ea3102bc0010f9fb16df8238. Currently we should rely on the result of "Chi Square test" and the "mean test" as the "var test" needs far more samples. The "chi square test" performs relatively the best. |
src/common/random_generator.cc
Outdated
| } | ||
|
|
||
| template<> | ||
| RandGenerator<cpu, float> *NewRandGenerator<cpu, float>() { |
There was a problem hiding this comment.
return an value so you don't need DeleteRandGenerator
There was a problem hiding this comment.
But for NewRandGenerator<gpu> it is allocated on device.
There was a problem hiding this comment.
You should return an class that has an internal pointer to device.
|
@Javelinjs The updated gist here https://gist.github.com/sxjscience/453605a1ea3102bc0010f9fb16df8238 tests all the available random ops in MXNet: normal, uniform, gamma, exponential, poisson, negative_binomial, generalized_negative_binomial, multinomial |
src/common/random_generator.h
Outdated
| const int kGPURndStateNum = 32768; | ||
|
|
||
| // uniform number generation in Cuda made consistent with stl (include 0 but exclude 1) | ||
| // by using 1.0-curand_uniform(). Needed as some samplers below won't be able to deal with |
There was a problem hiding this comment.
Comment taken from prior code. A bit misleading as it references "samplers below" (but in this file, there are none).
src/operator/mxnet_op.h
Outdated
| inline static void LaunchNativeRandomGenerator(mshadow::Stream<cpu> *, | ||
| common::random::RandGenerator<cpu, GType> *rnd, | ||
| const int N, Args... args) { | ||
| // do not use openmp since it does not guarantee the output order. |
There was a problem hiding this comment.
Not clear to me what it means that "does not guarantee the output order". I really think we should support openmp here as sampling on CPU is equivalent important to sampling on CPU and we should not leave a potential speedup of 4-8 on the table. Sampling on CPU is really slow anyway.
Wouldn't it be natural to use the exact same design pattern as for the GPU case, i.e. a set of preallocated samplers as a GlobalSampler and then assign them to the different threads?
There was a problem hiding this comment.
I mean, assume the actual sequence of std:: mt19937 is 0.1, 0.2, 0.3, 0.4, when it is generated for arr[4] with openmp this could become arr = {0.2, 0.1, 0.4, 0.3}.
since std:: mt19937 is thread-safe, I didn't preallocate it for multi-threads. But you're right, the same design could be adopted.
There was a problem hiding this comment.
yes I think we should use openmp.
There was a problem hiding this comment.
Adopting the same design would solve this ordering issue. And we don't need to pre-allocate thousands of CPU-samplers, 256 would certainly be enough.
|
Nice catch and implementation. I was thinking about the same pattern (using a sufficiently large pool of pre-allocated random generators) as well in the initial implementation, but wasn't sure about blocking that much memory for the entire runtime. But in fact, it is the far better solution, not only in terms of sampling accuracy. |
|
@asmushetzel I'll merge the test in #9129 |
fix lint fix lint fix typo fix docstring fix docstring
include/mxnet/resource.h
Outdated
| kTempSpace | ||
| kTempSpace, | ||
| /*! \brief common::RandGenerator<xpu> object, which can be used in GPU kernel functions */ | ||
| kNativeRandom |
There was a problem hiding this comment.
Why use a new enum? Can this be merged with kRandom?
There was a problem hiding this comment.
kRandom returns mshadow::Random, whose behavior is different from the new one.
There was a problem hiding this comment.
I think it should be called kParallelRandom
There was a problem hiding this comment.
+1 (kParallelRandom would be a name that expresses what it is good for) Same for all functions that have "native" in their name.
src/common/random_generator.cc
Outdated
| } | ||
|
|
||
| template<> | ||
| RandGenerator<cpu, float> *NewRandGenerator<cpu, float>() { |
There was a problem hiding this comment.
You should return an class that has an internal pointer to device.
src/common/random_generator.h
Outdated
| // (non-thread-safe) random generator stores global states, | ||
| // always use mxnet_op::LaunchNativeRandomGenerator for launching a multi-threaded kernel. | ||
| template<typename DType> | ||
| class RandGeneratorGlobal<gpu, DType> : public RandGenerator<gpu, DType> { |
src/operator/mxnet_op.h
Outdated
| inline static void LaunchNativeRandomGenerator(mshadow::Stream<cpu> *, | ||
| common::random::RandGenerator<cpu, GType> *rnd, | ||
| const int N, Args... args) { | ||
| // do not use openmp since it does not guarantee the output order. |
There was a problem hiding this comment.
yes I think we should use openmp.
src/operator/mxnet_op.h
Outdated
| * \param args Varargs to eventually pass to the OP::Map() functoion | ||
| */ | ||
| template<typename GType, typename ...Args> | ||
| inline static void LaunchNativeRandomGenerator(mshadow::Stream<cpu> *, |
src/operator/mxnet_op.h
Outdated
| const int N, Args... args) { | ||
| using namespace mshadow::cuda; | ||
| const int nloop(1 + (N - 1) / common::random::kGPUMinRndNumberPerThread); | ||
| int ngrid = std::min(common::random::kGPURndStateNum / kBaseThreadNum, |
There was a problem hiding this comment.
common::random::kGPURndStateNum / kBaseThreadNum could be 0
src/operator/mxnet_op.h
Outdated
| for (int i = id * kGPUMinRndNumberPerThread; | ||
| i < N; | ||
| i += nthread * kGPUMinRndNumberPerThread) { | ||
| for (int j = 0; j < kGPUMinRndNumberPerThread && i + j < N; ++j) { |
There was a problem hiding this comment.
These two loops look weird. Are you sure it should be i<N?
src/common/random_generator.h
Outdated
| class RandGenerator; | ||
|
|
||
| template<typename Device, typename DType MSHADOW_DEFAULT_DTYPE> | ||
| class RandGeneratorGlobal; |
There was a problem hiding this comment.
Looks like this should be the internal implementation of RandGenerator. It doesn't need to be a top level public class
|
I don't think we need LaunchRNG. Why not Launch with N = rnd->size() ? |
|
@piiswrong By using
I think |
|
I have refactor |
|
adding LaunchXX interface complicates the design. |
src/common/random_generator.h
Outdated
|
|
||
| #if MXNET_USE_CUDA | ||
|
|
||
| // at least how many random numbers should be generated by one GPU thread. |
There was a problem hiding this comment.
a number of contiguous random numbers should be generated from one state
src/common/random_generator.h
Outdated
| // at least how many random numbers should be generated by one CPU thread. | ||
| const int kCPUMinRndNumberPerThread = 64; | ||
| // store how many global random states for CPU. | ||
| const int kCPURndStateNum = 1024; |
There was a problem hiding this comment.
These should be
RandGenerator<cpu, DType>::kNumRandomStates
src/common/random_generator.h
Outdated
|
|
||
| // Free the allocated GPU memory. | ||
| // For global singleton, | ||
| // calling this in destructor may cause undefined behavior. |
src/common/random_generator.h
Outdated
| // Will use float data type whenever instantiated for half_t or any other non | ||
| // standard real type. | ||
| template<typename Device, typename DType MSHADOW_DEFAULT_DTYPE> | ||
| class RandGeneratorImpl; |
There was a problem hiding this comment.
Why do you need RandGenerator and RandGeneratorImpl? Why not just one?
src/common/random_generator.h
Outdated
| public: | ||
| // Copy state to local memory for efficiency. | ||
| __device__ explicit RandGeneratorImpl(curandStatePhilox4_32_10_t *state) | ||
| : state_(*state) {} |
There was a problem hiding this comment.
So you are copying state to state_ by value. Then wouldn't the next call of the same random operator give you the same results?
There was a problem hiding this comment.
I haven't tested the case of multiple runs of the same generator. I should add that.
There was a problem hiding this comment.
Two choices:
- Follow the exact same patterns in CPU and GPU, i.e. copy by value but then also ensure to save the state again back in the RandGenerator at the end of LaunchRNG (which isn't the case currently)
- Copy state by reference in the CPU case
I would prefer a consistent handling for both cases (i.e. first version).
There was a problem hiding this comment.
The implement here was correct. It did save the state back in LaunchRNG.
I now refactor it to a more readable version.
src/common/random_generator.h
Outdated
| RandGenerator() { | ||
| cudaError_t e = cudaMalloc(&states_, kGPURndStateNum * sizeof(curandStatePhilox4_32_10_t)); | ||
| if (e != cudaSuccess && e != cudaErrorCudartUnloading) { | ||
| throw std::bad_alloc(); |
There was a problem hiding this comment.
Why not using the existing macros for interpreting cuda-errors (i.e. CUDA_CALL(cudaMalloc(.....))? That would also tell the user that something went wrong on the device, while throwing std::bad_alloc will be misleading as it refers to a memory allocation on the host.
src/operator/mxnet_op.h
Outdated
| #ifdef _OPENMP | ||
| const int omp_threads = std::min(kCPURndStateNum, | ||
| engine::OpenMP::Get()->GetRecommendedOMPThreadCount()); | ||
| if (omp_threads < 2) { |
There was a problem hiding this comment.
I don't think you need special code for omp_threads < 2. The general loop below will work there as well and not create any overhead either.
There was a problem hiding this comment.
This was intentional.
OpenMP disables some compiler optimizations, so an omp loop with 1 thread is slower than a simple loop without omp
src/operator/mxnet_op.h
Outdated
| common::random::RandGenerator<cpu, GType> *rnd, | ||
| const int N, Args... args) { | ||
| using namespace mxnet::common::random; | ||
| #ifdef _OPENMP |
There was a problem hiding this comment.
I don't think you need that ifdef. Even if we compile w/out openmp-support, the functions GetRecommendedOMPTHreadCount() and omp_get_thread_num() should be replaced by appropriate stubs. So your code does not need to do any specialties (see dmlc/omp.h and engine/openmp.cc).
python/mxnet/test_utils.py
Outdated
| import numpy as np | ||
| import numpy.testing as npt | ||
| import numpy.random as rnd | ||
| import scipy.stats as ss |
There was a problem hiding this comment.
we don't depend on scipy
There was a problem hiding this comment.
How should I revise this? Move it to be inside the functions?
There was a problem hiding this comment.
@Javelinjs @piiswrong I've added one commit to solve the problem. Also,
I also add tests for the case in which the generator is triggered multiple times. 199fabd
|
merged changes from #9129 |
* add tests for distribution generators fix lint fix lint fix typo fix docstring fix docstring * [Bugfix] fix random generator: do not gen seed each time * gen samplers on gpu for test_softmax * fix test cases * remove unnecessary prints * refactor RandGenerator * get_native_random -> get_parallel_random * revise test cases + remove dependency of scipy * raise warning
* add tests for distribution generators fix lint fix lint fix typo fix docstring fix docstring * [Bugfix] fix random generator: do not gen seed each time * gen samplers on gpu for test_softmax * fix test cases * remove unnecessary prints * refactor RandGenerator * get_native_random -> get_parallel_random * revise test cases + remove dependency of scipy * raise warning
* add tests for distribution generators fix lint fix lint fix typo fix docstring fix docstring * [Bugfix] fix random generator: do not gen seed each time * gen samplers on gpu for test_softmax * fix test cases * remove unnecessary prints * refactor RandGenerator * get_native_random -> get_parallel_random * revise test cases + remove dependency of scipy * raise warning
* add tests for distribution generators fix lint fix lint fix typo fix docstring fix docstring * [Bugfix] fix random generator: do not gen seed each time * gen samplers on gpu for test_softmax * fix test cases * remove unnecessary prints * refactor RandGenerator * get_native_random -> get_parallel_random * revise test cases + remove dependency of scipy * raise warning
* add tests for distribution generators fix lint fix lint fix typo fix docstring fix docstring * [Bugfix] fix random generator: do not gen seed each time * gen samplers on gpu for test_softmax * fix test cases * remove unnecessary prints * refactor RandGenerator * get_native_random -> get_parallel_random * revise test cases + remove dependency of scipy * raise warning
Description
The current implement for random sampler (#8179) raises a new (random) seed every time it starts to generate a random number, which is incorrect. @asmushetzel
Although
std::mt19937seems to work with this approach, samplers withcurandgenerate low-quality of randomness and probably makes these numbers correlated.Firstly, I noticed training with
SGLDcollapses to low ACC (#8958). Secondly, @sxjscience has written new test cases for random sampler, using mean/var/chi square test, non of them passes with the current implement.According to Nvidia document for kernel random API, we need to maintain global seeds and reuse them. Moreover, the random state is not thread-safe.
The implement here maintains a fixed number of global random states and can be access through
Resource. In case it is accessed by multiple GPU streams, (default)4independent GPU generators are created in globalResource.I tested
example/bayesian-methodsand now it converges to reasonable results. It should pass @sxjscience 's new test cases as well.The memory usage and speed barely changes.
Checklist
Essentials
make lint)Changes
SGLDoptimizer arguments.