Skip to content

ut train fails on GPU with JIT compilation failed #60

@Shubhamcl

Description

@Shubhamcl

Using ubuntu 22, training works fine on CPU but when --num_gpus=1 I get this error stack.

This stack appears on following the instructions for the demo.

I first thought it is a tensorflow issue so I ran training on GPU using example from tensorflow tutorials, but that worked fine.

Detected at node 'SelectV2' defined at (most recent call last):
File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/threading.py", line 908, in _bootstrap
self._bootstrap_inner()
File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/threading.py", line 950, in _bootstrap_inner
self.run()
File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/training.py", line 1000, in run_step
outputs = model.train_step(data)
File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/training.py", line 864, in train_step
return self.compute_metrics(x, y, y_pred, sample_weight)
File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/training.py", line 957, in compute_metrics
self.compiled_metrics.update_state(y, y_pred, sample_weight)
File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/compile_utils.py", line 459, in update_state
metric_obj.update_state(y_t, y_p, sample_weight=mask)
File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/utime/evaluation/utils.py", line 22, in wrapper
mask = tf.where(tf.logical_and(
Node: 'SelectV2'
Detected at node 'SelectV2' defined at (most recent call last):
File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/threading.py", line 908, in _bootstrap
self._bootstrap_inner()
File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/threading.py", line 950, in _bootstrap_inner
self.run()
File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/training.py", line 1000, in run_step
outputs = model.train_step(data)
File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/training.py", line 864, in train_step
return self.compute_metrics(x, y, y_pred, sample_weight)
File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/training.py", line 957, in compute_metrics
self.compiled_metrics.update_state(y, y_pred, sample_weight)
File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/compile_utils.py", line 459, in update_state
metric_obj.update_state(y_t, y_p, sample_weight=mask)
File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/utime/evaluation/utils.py", line 22, in wrapper
mask = tf.where(tf.logical_and(
Node: 'SelectV2'
2 root error(s) found.
(0) UNKNOWN: JIT compilation failed.
[[{{node SelectV2}}]]
[[div_no_nan_1/ReadVariableOp/_12]]
(1) UNKNOWN: JIT compilation failed.
[[{{node SelectV2}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_13068]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions