Perf: use fused Adam optimizer (#4463)

caic99 · web-flow · commit 104fc365ed8d · 2024-12-17T22:37:03.000Z
This PR sets the Adam optimizer to use the `fused=True` parameter. For the profiling result shown below, this modification brings an 2.75x improvement on optimizer update (22ms vs. 8ms) and ~3% improvement for total speed up (922ms vs. 892ms). The benchmark case is training a DPA-2 Q3 release model. Please note that the absolute time may differs between steps. <details><summary>Before</summary> <p> ![image](https://github.com/user-attachments/assets/d6b05a1d-6e6c-478d-921f-c497718bc551) </p> </details> <details><summary>After</summary> <p> ![image](https://github.com/user-attachments/assets/b216b919-094c-441f-96a7-146e1e3db483) </p> </details> [Ref](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html): > The foreach and fused implementations are typically faster than the for-loop, single-tensor implementation, with **fused being theoretically fastest** with both vertical and horizontal fusion. As such, if the user has not specified either flag (i.e., when foreach = fused = None), we will attempt defaulting to the foreach implementation when the tensors are all on CUDA. Why not fused? Since the fused implementation is relatively new, we want to give it sufficient bake-in time.  ## Summary by CodeRabbit - **Bug Fixes** - Improved optimizer performance during training by modifying the initialization of the Adam optimizer. - **Documentation** - Updated method signature for clarity in the `Trainer` class.
diff --git a/deepmd/pt/train/training.py b/deepmd/pt/train/training.py
@@ -579,7 +579,7 @@ def warm_up_linear(step, warmup_steps):
         # author: iProzd
         if self.opt_type == "Adam":
             self.optimizer = torch.optim.Adam(
-                self.wrapper.parameters(), lr=self.lr_exp.start_lr
+                self.wrapper.parameters(), lr=self.lr_exp.start_lr, fused=True
             )
             if optimizer_state_dict is not None and self.restart_training:
                 self.optimizer.load_state_dict(optimizer_state_dict)

Original file line number	Diff line number	Diff line change
`@@ -579,7 +579,7 @@ def warm_up_linear(step, warmup_steps):`
`579`	`579`	`# author: iProzd`
`580`	`580`	`if self.opt_type == "Adam":`
`581`	`581`	`self.optimizer = torch.optim.Adam(`
`582`		`- self.wrapper.parameters(), lr=self.lr_exp.start_lr`
	`582`	`+ self.wrapper.parameters(), lr=self.lr_exp.start_lr, fused=True`
`583`	`583`	`)`
`584`	`584`	`if optimizer_state_dict is not None and self.restart_training:`
`585`	`585`	`self.optimizer.load_state_dict(optimizer_state_dict)`