Skip to content

Issues on torchgpipe project and paper #28

@xshaun

Description

@xshaun

Thanks for your sharing this project and paper.

I have one doubt after reading your paper - how do you achieve concurrent copy and computation only using streams wrapped by torch?

  1. As the description of https://developer.download.nvidia.cn/CUDA/training/StreamsAndConcurrencyWebinar.pdf from NVIDIA, it's impossible to run kernels on default stream and other streams simultaneously. If developers want to use two or more streams to overlap communication and computation, they have to explicitly create non-blocking streams to do this, but not including default-stream.
  2. Besides, due to the Python GIL limitation, it's also hard/impossible to launch kernels simultaneously into several non-blocking streams using Python.

Or do you introduce other technologies to address this issue?

Doubts as follows :

  1. The figure5 in your paper said that kernels run on default stream and non-blocking streams, sure?
  2. The figure7 has shown that the improvement you achieved by profiling kernel timeline with Nvidia Nsight tool, but if we enlarge the timeline to see it, the kernels perform overlapped?, or still execute sequentially in overall although using more streams.
  3. Is there any possible that the improvement you got is from the reduced idle gap between each kernels, not overlapped communication and computation.

Thanks for your time and answer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions