Issues on torchgpipe project and paper

Thanks for your sharing this project and paper.

I have one doubt after reading your paper - **how do you achieve concurrent copy and computation only using streams wrapped by torch**? 

1. As the description of _https://developer.download.nvidia.cn/CUDA/training/StreamsAndConcurrencyWebinar.pdf_ from NVIDIA, **it's impossible to run kernels on default stream and other streams simultaneously**. If developers want to use two or more streams to overlap communication and computation, they have to explicitly create non-blocking streams to do this, but not including default-stream. 
2. Besides, **due to the Python GIL limitation, it's also hard/impossible to launch kernels simultaneously into several non-blocking streams** using Python.

**Or do you introduce other technologies to address this issue?**

Doubts as follows :
1. The figure5 in your paper said that  kernels run on default stream and non-blocking streams, sure? 
2. The figure7 has shown that the improvement you achieved by profiling kernel timeline with Nvidia Nsight tool, but if we enlarge the timeline to see it, the kernels perform overlapped?, or still execute sequentially in overall although using more streams.
3. Is there any possible that the improvement you got is from the reduced idle gap between each kernels,  not overlapped communication and computation.

Thanks for your time and answer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issues on torchgpipe project and paper #28

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issues on torchgpipe project and paper #28

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions