Training a model with Megatron-LM for ROCm

Training a model with Megatron-LM for ROCm#

2025-06-20

18 min read time

Applies to Linux

The Megatron-LM framework for ROCm is a specialized fork of the robust Megatron-LM, designed to enable efficient training of large-scale language models on AMD GPUs. By leveraging AMD Instinct™ MI300X series accelerators, Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI workloads. It is purpose-built to support models like Llama, DeepSeek, and Mixtral, enabling developers to train next-generation AI models more efficiently.

AMD provides a ready-to-use Docker image for MI300X series accelerators containing essential components, including PyTorch, ROCm libraries, and Megatron-LM utilities. It contains the following software components to accelerate training workloads:

Software component	Version
ROCm	6.3.4
PyTorch	2.8.0a0+gite2f9759
Python	3.12 or 3.10
Transformer Engine	1.13.0+bb061ade
Flash Attention	3.0.0
hipBLASLt	0.13.0-4f18bf6
Triton	3.3.0
RCCL	2.22.3

Megatron-LM provides the following key features to train large language models efficiently:

Transformer Engine (TE)
APEX
GEMM tuning
Torch.compile
3D parallelism: TP + SP + CP
Distributed optimizer
Flash Attention (FA) 3
Fused kernels
Pre-training

The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators.

Supported models#

The following models are supported for training performance benchmarking with Megatron-LM and ROCm. Some instructions, commands, and training recommendations in this documentation might vary by model – select one to get started.

Model

Meta Llama

DeepSeek

Mistral AI

Model variant

Llama 3.3 70B

Llama 3.1 8B

Llama 3.1 70B

Llama 2 7B

Llama 2 70B

DeepSeek-V3

DeepSeek-V2-Lite

Mixtral 8x7B

Mixtral 8x22B

Note

Some models, such as Llama, require an external license agreement through a third party (for example, Meta).

Performance measurements#

To evaluate performance, the Performance results with AMD ROCm software page provides reference throughput and latency measurements for training popular AI models.

Important

The performance data presented in Performance results with AMD ROCm software only reflects the latest version of this training benchmarking environment. The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.

System validation#

Before running AI workloads, it’s important to validate that your AMD hardware is configured correctly and performing optimally.

If you have already validated your system settings, including aspects like NUMA auto-balancing, you can skip this step. Otherwise, complete the procedures in the System validation and optimization guide to properly configure your system settings before starting training.

To test for optimal performance, consult the recommended System health benchmarks. This suite of tests will help you verify and fine-tune your system’s configuration.

Environment setup#

Use the following instructions to set up the environment, configure the script to train models, and reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker image.

Download the Docker image#

Use the following command to pull the Docker image from Docker Hub.
Ubuntu 24.04 + Python 3.12
docker pull rocm/megatron-lm:v25.5_py312
Ubuntu 22.04 + Python 3.10
docker pull rocm/megatron-lm:v25.5_py310

Launch the Docker container.

docker run -it --device /dev/dri --device /dev/kfd --device /dev/infiniband --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v  $HOME/.ssh:/root/.ssh --shm-size 64G --name megatron_training_env rocm/megatron-lm:v25.5

Use these commands if you exit the megatron_training_env container and need to return to it.
```
docker start megatron_training_env
docker exec -it megatron_training_env bash
```

The Docker container includes a pre-installed, verified version of the ROCm Megatron-LM development branch ROCm/Megatron-LM, including necessary training scripts.

Configuration#

Update the train_llama3.sh configuration script in the examples/llama directory of ROCm/Megatron-LM to configure your training run. Options can also be passed as command line arguments as described in Run training.

Update the train_llama2.sh configuration script in the examples/llama directory of ROCm/Megatron-LM to configure your training run. Options can also be passed as command line arguments as described in Run training.

Update the train_deepseekv3.sh configuration script in the examples/deepseek_v3 directory of ROCm/Megatron-LM to configure your training run. Options can also be passed as command line arguments as described in Run training.

Update the train_deepseekv2.sh configuration script in the examples/deepseek_v2 directory of ROCm/Megatron-LM to configure your training run. Options can also be passed as command line arguments as described in Run training.

Update the train_mixtral_moe.sh configuration script in the examples/mixtral directory of ROCm/Megatron-LM to configure your training run. Options can also be passed as command line arguments as described in Run training.

Note

See Key options for more information on configuration options.

Network interface#

Update the network interface in the script to match your system’s network interface. To find your network interface, run the following (outside of any Docker container):

ip a

Look for an active interface that has an IP address in the same subnet as your other nodes. Then, update the following variables in the script, for example:

export NCCL_SOCKET_IFNAME=ens50f0np0

export GLOO_SOCKET_IFNAME=ens50f0np0

Tokenizer#

You can assign the path of an existing tokenizer to the TOKENIZER_MODEL as shown in the following examples. If the tokenizer is not found, it’ll be downloaded if publicly available.

If you do not have Llama 3.3 tokenizer locally, you need to use your personal Hugging Face access token HF_TOKEN to download the tokenizer. See Llama-3.3-70B-Instruct. After you are authorized, use your HF_TOKEN to download the tokenizer and set the variable TOKENIZER_MODEL to the tokenizer path.

export HF_TOKEN=<Your personal Hugging Face access token>