<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>DeepSpeed Configuration: A Comprehensive Guide to ds_config.json</title>
<meta name="description" content="A detailed guide to DeepSpeed's configuration JSON (ds_config.json), covering batch size, optimizers, FP16/BFLOAT16 training, ZeRO optimizations, and more for efficient deep learning.">
</head>
<body>
<header>
<h1>DeepSpeed Configuration: A Comprehensive Guide to <code>ds_config.json</code></h1>
<p>Unlock the Power of Efficient Deep Learning with Detailed Configuration Options</p>
</header>
<main>
<section>
<h2>Introduction to DeepSpeed and its Configuration</h2>
<p>
<a href="https://www.deepspeed.ai/" target="_blank" rel="noopener noreferrer">DeepSpeed</a> is a powerful deep learning optimization library designed to make distributed training more accessible, efficient, and effective. A key component of DeepSpeed is its configuration file, <code>ds_config.json</code>, which allows users to fine-tune various aspects of the training process. This article provides an in-depth look at the parameters within this configuration file, helping you optimize your deep learning workflows.
</p>
<p>
This configuration file is your central hub for controlling how DeepSpeed operates, impacting everything from memory usage to communication overhead. By understanding and utilizing these parameters effectively, you can significantly improve the performance and scalability of your deep learning models.
</p>
</section>
<section>
<h2>Key Configuration Sections</h2>
<p>The <code>ds_config.json</code> file is organized into several key sections, each controlling a different aspect of DeepSpeed's functionality. Let's explore these sections in detail:</p>
<ul>
<li><a href="#batch-size-parameters">Batch Size Related Parameters</a></li>
<li><a href="#optimizer-parameters">Optimizer Parameters</a></li>
<li><a href="#scheduler-parameters">Scheduler Parameters</a></li>
<li><a href="#fp16-bfloat16-training">FP16 and BFLOAT16 Training Options</a></li>
<li><a href="#zero-optimizations">ZeRO Optimizations</a></li>
<li><a href="#data-efficiency">Data Efficiency</a></li>
<li><a href="#other-settings">Other Important Settings</a></li>
</ul>
</section>
<section id="batch-size-parameters">
<h2>Batch Size Related Parameters</h2>
<p>
Managing batch size is crucial for both training speed and memory utilization. DeepSpeed allows you to configure batch size parameters directly in the <code>ds_config.json</code> file.
</p>
<ul>
<li>
<code>train_batch_size</code>: The effective training batch size. DeepSpeed infers one of the batch size parameters automatically if two are provided.
<br><b>Example</b>: <code>32</code>
</li>
<li>
<code>train_micro_batch_size_per_gpu</code>: The batch size processed by each GPU per step (without gradient accumulation).
<br><b>Example</b>: Omit if <code>train_batch_size</code> and <code>gradient_accumulation_steps</code> are provided.
</li>
<li>
<code>gradient_accumulation_steps</code>: The number of steps to accumulate gradients before updating model weights. This reduces communication overhead and can simulate larger batch sizes.
<br><b>Example</b>: <code>1</code>
</li>
</ul>
<p>
<b>Note:</b> <code>train_batch_size</code> must equal <code>train_micro_batch_size_per_gpu * gradient_accumulation_steps * number of GPUs</code>.
</p>
</section>
<section id="optimizer-parameters">
<h2>Optimizer Parameters</h2>
<p>
DeepSpeed supports various optimizers, including Adam, AdamW, and its own optimized versions like OneBitAdam and OneBitLamb. Configuration is handled through the <code>optimizer</code> section.
</p>
<pre><code>
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001,
"betas": [0.8, 0.999],
"eps": 1e-8,
"weight_decay": 3e-7
}
}
</code></pre>
<ul>
<li><code>type</code>: The optimizer name. DeepSpeed supports Adam, AdamW, OneBitAdam, and Lamb optimizers natively. It canimport other optimizers from <a href="https://pytorch.org/docs/stable/optim.html" target="_blank" rel="noopener noreferrer">torch.optim</a>.</li>
<li><code>params</code>: A dictionary of parameters for the optimizer. These parameters should match the constructor signature of the chosen optimizer.
<ul>
<li><code>lr</code>: Learning rate.</li>
<li><code>betas</code>: Coefficients used for computing running averages of gradient and its square.</li>
<li><code>eps</code>: Term added to improve numerical stability.</li>
<li><code>weight_decay</code>: Weight decay (L2 penalty).</li>
</ul>
</li>
</ul>
<p>
DeepSpeed's optimized optimizers, such as 1-bit Adam and 1-bit LAMB, offer additional parameters to control compression and communication. Refer to the <a href="https://www.deepspeed.ai/tutorials/onebit-adam/" target="_blank" rel="noopener noreferrer">OneBitAdam tutorial</a> and <a href="https://www.deepspeed.ai/tutorials/onebit-lamb/" target="_blank" rel="noopener noreferrer">OneBitLAMB tutorial</a> for more details.
</p>
</section>
<section id="scheduler-parameters">
<h2>Scheduler Parameters</h2>
<p>
DeepSpeed integrates with learning rate schedulers to adjust the learning rate during training. Configure this using the <code>scheduler</code> section in <code>ds_config.json</code>.
</p>
<pre><code>
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.001,
"warmup_num_steps": 1000
}
}
</code></pre>
<ul>
<li><code>type</code>: The scheduler name. See a list of supported <a href="https://deepspeed.readthedocs.io/en/latest/schedulers.html" target="_blank" rel="noopener noreferrer">schedulers here</a>.</li>
<li><code>params</code>: A dictionary of parameters to instantiate the scheduler. The parameter names should match the scheduler constructor signature.
<ul>
<li><code>warmup_min_lr</code>: Minimum learning rate during warmup.</li>
<li><code>warmup_max_lr</code>: Maximum learning rate during warmup.</li>
<li><code>warmup_num_steps</code>: Number of warmup steps.</li>
</ul>
</li>
</ul>
<p>DeepSpeed calls the <code>step()</code> method of the scheduler at every training step when <code>model_engine.step()</code> is executed.</p>
</section>
<section id="fp16-bfloat16-training">
<h2>FP16 and BFLOAT16 Training Options</h2>
<p>
Mixed precision training (FP16 and BFLOAT16) can significantly accelerate training and reduce memory consumption. DeepSpeed offers dedicated sections for configuring these options.
</p>
<h3>FP16 Training</h3>
<pre><code>
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
}
</code></pre>
<ul>
<li><code>enabled</code>: Enables FP16 training.</li>
<li><code>auto_cast</code>: Automatically casts inputs to FP16 format.</li>
<li><code>loss_scale</code>: The loss scaling value for FP16 training (0.0 for dynamic loss scaling).</li>
<li><code>initial_scale_power</code>: The power of the initial dynamic loss scale value (loss scale = 2<sup>initial_scale_power</sup>).</li>
<li><code>loss_scale_window</code>: Window over which to raise/lower the dynamic loss scale value.</li>
<li><code>hysteresis</code>: Delay shift in dynamic loss scaling.</li>
<li><code>min_loss_scale</code>: Minimum dynamic loss scale value.</li>
</ul>
<h3>BFLOAT16 Training</h3>
<pre><code>
"bf16": {
"enabled": true
}
</code></pre>
<ul>
<li><code>enabled</code>: Enables BFLOAT16 training. Note: Requires hardware support (e.g., NVIDIA A100).</li>
</ul>
<p><b>Important:</b> FP16 and BFLOAT16 modes cannot be combined with the AMP mode. </p>
</section>
<section id="zero-optimizations">
<h2>ZeRO Optimizations</h2>
<p>
ZeRO (Zero Redundancy Optimizer) is a key feature of DeepSpeed, designed to reduce memory footprint and enable training of larger models.
</p>
<pre><code>
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true
}
</code></pre>
<ul>
<li><code>stage</code>: Chooses the ZeRO optimization stage (0, 1, 2, or 3).
<ul>
<li>0: Disabled</li>
<li>1: Optimizer state partitioning</li>
<li>2: Optimizer+Gradient state partitioning</li>
<li>3: Optimizer+Gradient+Parameter partitioning</li>
</ul>
</li>
<li><code>allgather_partitions</code>: Whether to use allgather or broadcast collectives for parameter updates.</li>
<li><code>allgather_bucket_size</code>: Number of elements gathered at a time.</li>
<li><code>overlap_comm</code>: Overlaps gradient reduction with backward computation.</li>
<li><code>reduce_scatter</code>: Uses reduce scatter instead of allreduce for averaging gradients.</li>
<li><code>reduce_bucket_size</code>: Number of elements reduced at a time.</li>
<li><code>contiguous_gradients</code>: Copies gradients to a contiguous buffer.</li>
<li><code>offload_param</code>: Enables offloading of model parameters to CPU or NVMe.</li>
<li><code>offload_optimizer</code>: Enables offloading of optimizer state to CPU or NVMe, and optimizer computation to CPU.</li>
</ul>
<p>
Each ZeRO stage offers different memory savings and communication trade-offs. Stage 3, for example, offers the highest memory savings but requires more communication. Parameter and optimizer offloading can further reduce GPU memory usage.
</p>
<h3>Parameter Offloading</h3>
<pre><code>
"offload_param": {
"device": "cpu",
"pin_memory": true
}
</code></pre>
<ul>
<li><code>device</code>: Device memory to offload model parameters (cpu or nvme).</li>
<li><code>nvme_path</code>: Filesystem path for NVMe device.</li>
<li><code>pin_memory</code>: Offload to page-locked CPU memory.</li>
</ul>
<h3>Optimizer Offloading</h3>
<pre><code>
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
}
</code></pre>
<ul>
<li><code>device</code>: Device memory to offload optimizer state (cpu or nvme).</li>
<li><code>nvme_path</code>: Filesystem path for NVMe device.</li>
<li><code>pin_memory</code>: Offload to page-locked CPU memory.</li>
<li><code>ratio</code>: The ratio of parameters updating on CPU side</li>
</ul>
</section>
<section id="data-efficiency">
<h2>Data Efficiency</h2>
<p>DeepSpeed provides capabilities to enhance data efficiency. </p>
<pre><code>
"data_efficiency": {
"enabled": true,
"curriculum_learning": {
"enabled": true,
"curriculum_type": "seqlen",
"min_difficulty": 8,
"max_difficulty": 1024,
"schedule_type": "fixed_linear",
"schedule_config": {
"total_curriculum_step": 40000,
"difficulty_step": 8
}
}
}
</code></pre>
<ul>
<li><code>enabled</code>: Enables data efficiency techniques.</li>
<li><code>curriculum_learning</code>: Configures curriculum learning. </li>
<li><code>random_ltd</code>: Configures random layerwise token dropping.</li>
</ul>
<p>See the <a href="https://www.deepspeed.ai/tutorials/data-efficiency/" target="_blank" rel="noopener noreferrer">DeepSpeed Data Efficiency Library tutorial</a> for further details.</p>
</section>
<section id="other-settings">
<h2>Other Important Settings</h2>
<p>
The <code>ds_config.json</code> file includes several other important settings for logging, autotuning, and more.
</p>
<h3>Logging</h3>
<pre><code>
"steps_per_print": 10,
"wall_clock_breakdown": false
</code></pre>
<ul>
<li><code>steps_per_print</code>: Print progress report every N training steps.</li>
<li><code>wall_clock_breakdown</code>: Enables timing of the forward/backward/update phases.</li>
</ul>
<h3>Autotuning</h3>
<pre><code>
"autotuning": {
"enabled": false,
"metric": "throughput",
"start_profile_step": 3,
"end_profile_step": 5
}
</code></pre>
<ul>
<li><code>enabled</code>: Enables the autotuner.</li>
<li><code>metric</code>: The performance metric to use (latency, throughput, or FLOPS).</li>
<li><code>start_profile_step</code>: The step at which to start profiling.</li>
<li><code>end_profile_step</code>: The step at which to end profiling.</li>
</ul>
<h3>Communication Logging</h3>
<pre><code>
"comms_logger": {
"enabled": true,
"verbose": false,
"prof_all": true
}
</code></pre>
<ul>
<li><code>enabled</code>: Whether communication logging is enabled.</li>
<li><code>verbose</code>: Whether to immediately print every communication operation</li>
<li><code>prof_all</code>: Whether to profile all operations.</li>
</ul>
</section>
<section>
<h2>Conclusion</h2>
<p>
The <code>ds_config.json</code> file is a powerful tool for configuring and optimizing DeepSpeed for your specific deep learning workloads. By understanding the various parameters and sections within this file, you can unlock the full potential of DeepSpeed and achieve significant performance gains.
</p>
<p>
Experiment with these settings to find the optimal configuration for your model, hardware, and training data. DeepSpeed's flexibility allows you to tailor the training process to your specific needs, resulting in faster training times and the ability to tackle larger, more complex models.
</p>
</section>
<section>
<h2>Further Reading</h2>
<ul>
<li><a href="https://www.deepspeed.ai/tutorials/" target="_blank" rel="noopener noreferrer">DeepSpeed Tutorials</a></li>
<li><a href="https://github.com/deepspeed-ai/DeepSpeed" target="_blank" rel="noopener noreferrer">DeepSpeed Github</a></li>
</ul>
</section>
</main>
<footer>
<p>© 2024 DeepSpeed Configuration Guide</p>
</footer>
</body>
</html>