Stage1.5 by Xuezhi-Liang · Pull Request #123 · petuum/adaptdl

Xuezhi-Liang · 2022-05-31T12:37:53Z

No description provided.

odp · 2022-05-31T16:29:01Z

adaptdl/adaptdl/torch/data.py

+def context_initialize(batch_size):
+    """
+    Initialize this module, must be invoked before calling any other functions.
+    This function will block until it has been invoked from all replicas.


How's this enforced?

How's this enforced?

Dear Omkar, Thank you for asking. As we were making the Context global, the Context was firstly initialized in init_process_group as Context_obj following Aurick's suggestion. All the subsequent processes in terms of Context will be using the Context_obj instead. So the initialize process was enforced in the very beginning.

odp · 2022-06-02T00:27:37Z

adaptdl/adaptdl/torch/__init__.py

                                  world_size)

+    # Initialize Context module.
+    adaptdl.torch.data.context_initialize(batch_size=32)


How do you plan to make batch_size available in init_process_group?

How do you plan to make batch_size available in init_process_group?

We were just giving it a default value here for the global Context, the batch_size could be replaced by users when they use it. But thanks for point it out that we can also remove the default value from here, which would be given from the definition of class Context directly. Further, users are still able to give their input value and the default value will be replaced then.
Please also have a check with our new commit to review the modification.
Many thanks.

odp · 2022-06-02T23:23:27Z

adaptdl/adaptdl/torch/data.py

    @property
    def current_batch_size(self):
-        return (self.current_local_bsz * (self.accumulation_steps + 1) *
+        return (self._context.get_batch_size() * (self._context.get_accum_steps() + 1) *


This is a bit inefficient, it can potentially invoke the goodput optimization twice, once per Context._get_local_bsz call. You could probably use _context.current_local_bsz and accumulation_steps instead when you just want to query the values without potentially triggering optimization.

Thank you Omkar, Please have a check on the new commit where the triggering issue was fixed. 🤝

odp · 2022-06-03T18:19:46Z

adaptdl/adaptdl/torch/data.py

                self.sampler.set_epoch(
                    epoch, index=self._elastic.current_index)
-                self.batch_sampler.batch_size = self._elastic._sync_local_bsz()
+                self.batch_sampler.batch_size = self._elastic._context.get_batch_size()


_sync_local_bsz cannot be replaced by the Context.get_batch_size, because _sync_local_bsz also does a broadcast from rank 0 to propagate the local batch size it calculated to rest of the replicas and also acts as barrier. Context.get_batch_size could cause local batch sizes in replicas to go out of sync.

_sync_local_bsz cannot be replaced by the Context.get_batch_size, because _sync_local_bsz also does a broadcast from rank 0 to propagate the local batch size it calculated to rest of the replicas and also acts as barrier. Context.get_batch_size could cause local batch sizes in replicas to go out of sync.

Thanks Omkar, please have a check on the new commit where the replacement issue is fixed.

Xuezhi-Liang added 2 commits May 31, 2022 16:33

stage1.5

c49b9ca

stage1.5

ff96c21

Xuezhi-Liang mentioned this pull request May 31, 2022

Stage1 #115

Closed

odp reviewed May 31, 2022

View reviewed changes

odp reviewed Jun 2, 2022

View reviewed changes

Xuezhi-Liang added 2 commits June 2, 2022 10:14

default batch size issue fixed

fee7fb1

default batch size issue fixed

ff2e5e6

odp reviewed Jun 2, 2022

View reviewed changes

optimize trigger issue fixed

2da85e5

odp reviewed Jun 3, 2022

View reviewed changes

_sync_local_bsz() replacement issue fixed

a08e312

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage1.5#123

Stage1.5#123
Xuezhi-Liang wants to merge 6 commits intopetuum:masterfrom
Xuezhi-Liang:stage1.5

Xuezhi-Liang commented May 31, 2022

Uh oh!

odp May 31, 2022

Uh oh!

VincentYaoMBZUAI Jun 1, 2022

Uh oh!

odp Jun 2, 2022

Uh oh!

VincentYaoMBZUAI Jun 2, 2022

Uh oh!

odp Jun 2, 2022 •

edited

Loading

Uh oh!

VincentYaoMBZUAI Jun 3, 2022

Uh oh!

odp Jun 3, 2022

Uh oh!

VincentYaoMBZUAI Jun 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Xuezhi-Liang commented May 31, 2022

Uh oh!

odp May 31, 2022

Choose a reason for hiding this comment

Uh oh!

VincentYaoMBZUAI Jun 1, 2022

Choose a reason for hiding this comment

Uh oh!

odp Jun 2, 2022

Choose a reason for hiding this comment

Uh oh!

VincentYaoMBZUAI Jun 2, 2022

Choose a reason for hiding this comment

Uh oh!

odp Jun 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VincentYaoMBZUAI Jun 3, 2022

Choose a reason for hiding this comment

Uh oh!

odp Jun 3, 2022

Choose a reason for hiding this comment

Uh oh!

VincentYaoMBZUAI Jun 6, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

odp Jun 2, 2022 •

edited

Loading