Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Python crashes (core-dump) instead of a graceful error message when GPU context is used on a CPU-only instance (EC2 x1.32xlarge) #8835

@bhavinthaker

Description

@bhavinthaker

Description

Python crashes (core-dump) instead of gracefully returning an error message when GPU context is used on a CPU-only instance (EC2 x1.32xlarge). The root-cause of the problem may be "unknown CUDA error" when ideally it should return a valid CUDA error that MXNet can trap and display the error message instead of crashing Python.

Environment info (Required)

EC2 instance type: x1.32xlarge
MXNet: Release candidate: v1.0.0 RC0

Build info (Required if built from source)

Release candidate: v1.0.0 RC0

Compiler (gcc/clang/mingw/visual studio): gcc 5.4 on Ubuntu Linux 16.04

Error Message:

-snip--

import mxnet as mx
mx.version
'1.0.0'

shape = (10, 10)
a = mx.nd.ones(shape, mx.gpu(0))
[17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/dmlc-core/include/dmlc/./logging.h:308: [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error

Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc]
[bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410]
[bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]

[17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/dmlc-core/include/dmlc/./logging.h:308: [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error

Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc]
[bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410]
[bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10Threa
dPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt
10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1

clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5
+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]
terminate called after throwing an instance of 'dmlc::Error' what(): [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed
: e == cudaSuccess CUDA: unknown error

Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc][bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410][bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]

terminate called after throwing an instance of 'dmlc::Error'
what(): [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error

Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc]
[bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410]
[bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]

Aborted (core dumped)
--snip--

Minimum reproducible example

Build from source using the configuration below and run the reproduction steps on a CPU-only instance.

$ cd src/make
$ diff config.mk config.mk.ci | egrep ">"

DEBUG = 1
USE_CUDA = 1
USE_CUDA_PATH = /usr/local/cuda
USE_CUDNN = 1
USE_DIST_KVSTORE = 1
USE_S3 = 1

import mxnet as mx
mx.version
'1.0.0'

shape = (10, 10)
a = mx.nd.ones(shape, mx.gpu(0))
[17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/dmlc-core/include/dmlc/./logging.h:308: [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error

What have you tried to solve it?

Workaround: Do NOT use GPU context on a CPU-only instance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions