Conversation
| def context_initialize(batch_size): | ||
| """ | ||
| Initialize this module, must be invoked before calling any other functions. | ||
| This function will block until it has been invoked from all replicas. |
There was a problem hiding this comment.
How's this enforced?
Dear Omkar, Thank you for asking. As we were making the Context global, the Context was firstly initialized in init_process_group as Context_obj following Aurick's suggestion. All the subsequent processes in terms of Context will be using the Context_obj instead. So the initialize process was enforced in the very beginning.
adaptdl/adaptdl/torch/__init__.py
Outdated
| world_size) | ||
|
|
||
| # Initialize Context module. | ||
| adaptdl.torch.data.context_initialize(batch_size=32) |
There was a problem hiding this comment.
How do you plan to make batch_size available in init_process_group?
There was a problem hiding this comment.
How do you plan to make
batch_sizeavailable ininit_process_group?
We were just giving it a default value here for the global Context, the batch_size could be replaced by users when they use it. But thanks for point it out that we can also remove the default value from here, which would be given from the definition of class Context directly. Further, users are still able to give their input value and the default value will be replaced then.
Please also have a check with our new commit to review the modification.
Many thanks.
adaptdl/adaptdl/torch/data.py
Outdated
| @property | ||
| def current_batch_size(self): | ||
| return (self.current_local_bsz * (self.accumulation_steps + 1) * | ||
| return (self._context.get_batch_size() * (self._context.get_accum_steps() + 1) * |
There was a problem hiding this comment.
This is a bit inefficient, it can potentially invoke the goodput optimization twice, once per Context._get_local_bsz call. You could probably use _context.current_local_bsz and accumulation_steps instead when you just want to query the values without potentially triggering optimization.
There was a problem hiding this comment.
Thank you Omkar, Please have a check on the new commit where the triggering issue was fixed. 🤝
adaptdl/adaptdl/torch/data.py
Outdated
| self.sampler.set_epoch( | ||
| epoch, index=self._elastic.current_index) | ||
| self.batch_sampler.batch_size = self._elastic._sync_local_bsz() | ||
| self.batch_sampler.batch_size = self._elastic._context.get_batch_size() |
There was a problem hiding this comment.
_sync_local_bsz cannot be replaced by the Context.get_batch_size, because _sync_local_bsz also does a broadcast from rank 0 to propagate the local batch size it calculated to rest of the replicas and also acts as barrier. Context.get_batch_size could cause local batch sizes in replicas to go out of sync.
There was a problem hiding this comment.
_sync_local_bszcannot be replaced by theContext.get_batch_size, because_sync_local_bszalso does a broadcast from rank 0 to propagate the local batch size it calculated to rest of the replicas and also acts as barrier.Context.get_batch_sizecould cause local batch sizes in replicas to go out of sync.
Thanks Omkar, please have a check on the new commit where the replacement issue is fixed.
No description provided.