async-diloco

#distributed-learning #diloco

Key Takeaway

each worker updates the global parameters as soon as it has finished its SGD steps

straggler effect: when performing local-SGD using heterogeneous computing faster devices are idle waiting for slower ones to catch up, undermining the overall efficiency of the system
Method:
- data shards are samples for each worker only when it is "under-sampled" this ensures that the data shard with slower progress is more likely to be sampled for training, therefore balancing the learning process across shards.
- each data shard is assigned its own learning rate schedule. In practice, async training may conclude with different final iteration counts for each data shard
- a grace period for model synchronisation is allowed wherein before updating global model parameters a slight delay is allowed for gradients from other workers
Results:
- async local-SGD takes more iterations to converge than its synchronous counterpart despite updating the global model parameters more frequently. This suggests that the inherent staleness from sequentially applying simultaneous updates leads to performance drop.
- AdamW for inner optimisation and Nesterov Momentum for outer optimisation yields the best results for async training. Note, however AdamW for outer optimisation performs worse. This is likely because it introduces a normalisation effect which can be counter productive since these pseudo-gradients gathered from the workers tend to be larger than true gradients.
- Polynomial discounting of pseudo-gradients show marginal benefits.
Proposed Solutions:
- Delayed Nesterov Updates (addresses the momentum challenge in outer optimisation): using the Nesterov update intermittently - every $N$ server updates. Between the updates, the pseudo-gradients are aggregated into a buffer and updated using gradient descent. The model parameters are updated at every local step however the momentum update is delayed.
- Dynamic Local Updates: slower workers execute fewer times aligning the completion times across different workers. coupled with a grace period for global parameter updates this reduces the impact of stale gradients.
- Evaluating Delayed Nesterov (DN) and Dynamic Local Updates (DyLU):
  - DN + DyLU significantly reduces perplexity, surpassing synchronous diloco's performance over updates
  - DN + DyLU consistently excels across all heterogeneity levels
  - As the number of workers increases, the benefit of Local-SGD training diminishes
  - both sync and async local-SGD outperform fine-tuning on a single worker with increased batch size
  - DN + DyLU demonstrates consistent efficacy across various model sizes