Description
The shuffle read process includes data fetching (java) + decompression and deserialization (native). Currently, the GPU shuffle reading process can be blocked by a global GPU lock. Using multiple threads to read the shuffle streams and do the decompression work can accelerate the shuffle read process.
The asynchronous in the native shuffle read only parallelizes decompression and deserialization. The timing of data fetching still depends on when the Velox pipeline triggers the shuffle read.
Gluten version
None