-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Python crashes (core-dump) instead of a graceful error message when GPU context is used on a CPU-only instance (EC2 x1.32xlarge) #8835
Description
Description
Python crashes (core-dump) instead of gracefully returning an error message when GPU context is used on a CPU-only instance (EC2 x1.32xlarge). The root-cause of the problem may be "unknown CUDA error" when ideally it should return a valid CUDA error that MXNet can trap and display the error message instead of crashing Python.
Environment info (Required)
EC2 instance type: x1.32xlarge
MXNet: Release candidate: v1.0.0 RC0
Build info (Required if built from source)
Release candidate: v1.0.0 RC0
Compiler (gcc/clang/mingw/visual studio): gcc 5.4 on Ubuntu Linux 16.04
Error Message:
-snip--
import mxnet as mx
mx.version
'1.0.0'
shape = (10, 10)
a = mx.nd.ones(shape, mx.gpu(0))
[17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/dmlc-core/include/dmlc/./logging.h:308: [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error
Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc]
[bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410]
[bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]
[17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/dmlc-core/include/dmlc/./logging.h:308: [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error
Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc]
[bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410]
[bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10Threa
dPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt
10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1
clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]
terminate called after throwing an instance of 'dmlc::Error' what(): [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed
: e == cudaSuccess CUDA: unknown error
Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc][bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410][bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]
terminate called after throwing an instance of 'dmlc::Error'
what(): [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error
Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc]
[bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410]
[bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]
Aborted (core dumped)
--snip--
Minimum reproducible example
Build from source using the configuration below and run the reproduction steps on a CPU-only instance.
$ cd src/make
$ diff config.mk config.mk.ci | egrep ">"
DEBUG = 1
USE_CUDA = 1
USE_CUDA_PATH = /usr/local/cuda
USE_CUDNN = 1
USE_DIST_KVSTORE = 1
USE_S3 = 1
import mxnet as mx
mx.version
'1.0.0'
shape = (10, 10)
a = mx.nd.ones(shape, mx.gpu(0))
[17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/dmlc-core/include/dmlc/./logging.h:308: [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error
What have you tried to solve it?
Workaround: Do NOT use GPU context on a CPU-only instance.