When I use fp16 (16-bit float) and multi-gpu training,the code will wait in SyncBN(comm.py). 