Training a model with PyTorch for ROCm

Training a model with PyTorch for ROCm#

2025-06-23

25 min read time

Applies to Linux

PyTorch is an open-source machine learning framework that is widely used for model training with GPU-optimized components for transformer-based models.

The PyTorch for ROCm training Docker (rocm/pytorch-training:v25.6) image provides a prebuilt optimized environment for fine-tuning and pretraining a model on AMD Instinct MI325X and MI300X accelerators. It includes the following software components to accelerate training workloads:

Software component	Version
ROCm	6.3.4
PyTorch	2.8.0a0+git7d205b2
Python	3.10.17
Transformer Engine	1.14.0+2f85f5f2
Flash Attention	3.0.0.post1
hipBLASLt	0.15.0-8c6919d
Triton	3.3.0

Supported models#

The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators.

Workload

Pre-training

Fine-tuning

Model

Llama 3.1 8B

Llama 3.1 70B

FLUX.1-dev

Llama 4 Scout 17B-16E

Llama 3.3 70B

Llama 3.2 1B

Llama 3.2 3B

Llama 3.2 Vision 11B

Llama 3.2 Vision 90B

Llama 3.1 8B

Llama 3.1 70B

Llama 3.1 405B

Llama 3 8B

Llama 3 70B

Llama 2 7B

Llama 2 13B

Llama 2 70B

Note

Some models require an external license agreement through a third party (for example, Meta).

Performance measurements#

To evaluate performance, the Performance results with AMD ROCm software page provides reference throughput and latency measurements for training popular AI models.

Note

The performance data presented in Performance results with AMD ROCm software should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.

System validation#

Before running AI workloads, it’s important to validate that your AMD hardware is configured correctly and performing optimally.

If you have already validated your system settings, including aspects like NUMA auto-balancing, you can skip this step. Otherwise, complete the procedures in the System validation and optimization guide to properly configure your system settings before starting training.

To test for optimal performance, consult the recommended System health benchmarks. This suite of tests will help you verify and fine-tune your system’s configuration.

This Docker image is optimized for specific model configurations outlined below. Performance can vary for other training workloads, as AMD doesn’t validate configurations and run conditions outside those described.

Benchmarking#

Once the setup is complete, choose between two options to start benchmarking:

MAD-integrated benchmarking

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt

For example, use this command to run the performance benchmark test on the Llama 3.1 8B model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-3.1-8b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-3.1-8b, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the Llama 3.1 70B model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-3.1-70b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-3.1-70b, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the FLUX.1-dev model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_flux --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_flux, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the Llama 4 Scout 17B-16E model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-4-scout-17b-16e --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-4-scout-17b-16e, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the Llama 3.3 70B model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-3.3-70b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-3.3-70b, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the Llama 3.2 1B model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-3.2-1b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-3.2-1b, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the Llama 3.2 3B model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-3.2-3b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-3.2-3b, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the Llama 3.2 Vision 11B model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-3.2-vision-11b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-3.2-vision-11b, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the Llama 3.2 Vision 90B model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-3.2-vision-90b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-3.2-vision-90b, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the Llama 3.1 8B model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-3.1-8b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-3.1-8b, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the Llama 3.1 70B model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-3.1-70b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-3.1-70b, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the Llama 3.1 405B model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-3.1-405b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-3.1-405b, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the Llama 3 8B model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-3-8b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-3-8b, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the Llama 3 70B model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-3-70b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-3-70b, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the Llama 2 7B model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-2-7b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-2-7b, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the Llama 2 13B model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-2-13b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-2-13b, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

For example, use this command to run the performance benchmark test on the Llama 2 70B model using one GPU with the BF16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_train_llama-2-70b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_train_llama-2-70b, for example. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv.

Standalone benchmarking

Download the Docker image and required packages

Use the following command to pull the Docker image from Docker Hub.

docker pull rocm/pytorch-training:v25.6

Run the Docker container.

docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v  $HOME/.ssh:/root/.ssh --shm-size 64G --name training_env rocm/pytorch-training:v25.6

Use these commands if you exit the training_env container and need to return to it.

docker start training_env
docker exec -it training_env bash

In the Docker container, clone the ROCm/MAD repository and navigate to the benchmark scripts directory /workspace/MAD/scripts/pytorch_train.

git clone https://github.com/ROCm/MAD
cd MAD/scripts/pytorch_train

Prepare training datasets and dependencies

The following benchmarking examples require downloading models and datasets from Hugging Face. To ensure successful access to gated repos, set your HF_TOKEN.

export HF_TOKEN=$your_personal_hugging_face_access_token

Run the setup script to install libraries and datasets needed for benchmarking.

./pytorch_benchmark_setup.sh

pytorch_benchmark_setup.sh installs the following libraries for Llama 3.1 8B:

Library	Reference
`accelerate`	Hugging Face Accelerate
`datasets`	Hugging Face Datasets 3.2.0

pytorch_benchmark_setup.sh installs the following libraries for Llama 3.1 70B:

Library	Reference
`datasets`	Hugging Face Datasets 3.2.0
`torchdata`	TorchData
`tomli`	Tomli
`tiktoken`	tiktoken
`blobfile`	blobfile
`tabulate`	tabulate
`wandb`	Weights & Biases
`sentencepiece`	SentencePiece 0.2.0
`tensorboard`	TensorBoard 2.18.0

pytorch_benchmark_setup.sh installs the following libraries for FLUX:

Library	Reference
`accelerate`	Hugging Face Accelerate
`datasets`	Hugging Face Datasets 3.2.0
`sentencepiece`	SentencePiece 0.2.0
`tensorboard`	TensorBoard 2.18.0
`csvkit`	csvkit 2.0.1
`deepspeed`	DeepSpeed 0.16.2
`diffusers`	Hugging Face Diffusers 0.31.0
`GitPython`	GitPython 3.1.44
`opencv-python-headless`	opencv-python-headless 4.10.0.84
`peft`	PEFT 0.14.0
`protobuf`	Protocol Buffers 5.29.2
`pytest`	PyTest 8.3.4
`python-dotenv`	python-dotenv 1.0.1
`seaborn`	Seaborn 0.13.2
`transformers`	Transformers 4.47.0

pytorch_benchmark_setup.sh downloads the following datasets from Hugging Face:

bghira/pseudo-camera-10k

Pretraining

To start the pre-training benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t pretrain -m Llama-3.1-8B -p $datatype -s $sequence_length

Name	Options	Description
`$datatype`	`BF16` or `FP8`	Only Llama 3.1 8B supports FP8 precision.
`$sequence_length`	Sequence length for the language model.	Between 2048 and 8192. 8192 by default.

Pretraining

To start the pre-training benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t pretrain -m Llama-3.1-70B -p $datatype -s $sequence_length

Name	Options	Description
`$datatype`	`BF16`	Only Llama 3.1 8B supports FP8 precision.
`$sequence_length`	Sequence length for the language model.	Between 2048 and 8192. 8192 by default.

Pretraining

To start the pre-training benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t pretrain -m Flux -p $datatype -s $sequence_length

Name	Options	Description
`$datatype`	`BF16`	Only Llama 3.1 8B supports FP8 precision.
`$sequence_length`	Sequence length for the language model.	Between 2048 and 8192. 8192 by default.

Note

Occasionally, downloading the Flux dataset might fail. In the event of this error, manually download it from Hugging Face at black-forest-labs/FLUX.1-dev and save it to /workspace/FluxBenchmark. This ensures that the test script can access the required dataset.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t $training_mode -m Llama-4-17B_16E -p BF16 -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 supported)
	`finetune_lora`	LoRA fine-tuning (BF16 supported)
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported)
	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

Llama 4 Scout 17B-16E currently supports the following fine-tuning methods:

finetune_fw
finetune_lora

The upstream torchtune repository does not currently provide YAML configuration files for other combinations of model to fine-tuning method However, you can still configure your own YAML files to enable support for fine-tuning methods not listed here by following existing patterns in the /workspace/torchtune/recipes/configs directory.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t $training_mode -m Llama-3.3-70B -p BF16 -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 supported)
	`finetune_lora`	LoRA fine-tuning (BF16 supported)
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported)
	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

Llama 3.3 70B currently supports the following fine-tuning methods:

finetune_fw
finetune_lora
finetune_qlora

The upstream torchtune repository does not currently provide YAML configuration files for other combinations of model to fine-tuning method However, you can still configure your own YAML files to enable support for fine-tuning methods not listed here by following existing patterns in the /workspace/torchtune/recipes/configs directory.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t $training_mode -m Llama-3.2-1B -p BF16 -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 supported)
	`finetune_lora`	LoRA fine-tuning (BF16 supported)
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported)
	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

Llama 3.2 1B currently supports the following fine-tuning methods:

finetune_fw
finetune_lora

The upstream torchtune repository does not currently provide YAML configuration files for other combinations of model to fine-tuning method However, you can still configure your own YAML files to enable support for fine-tuning methods not listed here by following existing patterns in the /workspace/torchtune/recipes/configs directory.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t $training_mode -m Llama-3.2-3B -p BF16 -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 supported)
	`finetune_lora`	LoRA fine-tuning (BF16 supported)
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported)
	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

Llama 3.2 3B currently supports the following fine-tuning methods:

finetune_fw
finetune_lora

The upstream torchtune repository does not currently provide YAML configuration files for other combinations of model to fine-tuning method However, you can still configure your own YAML files to enable support for fine-tuning methods not listed here by following existing patterns in the /workspace/torchtune/recipes/configs directory.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t $training_mode -m Llama-3.2-Vision-11B -p BF16 -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 supported)
	`finetune_lora`	LoRA fine-tuning (BF16 supported)
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported)
	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

Llama 3.2 Vision 11B currently supports the following fine-tuning methods:

finetune_fw

The upstream torchtune repository does not currently provide YAML configuration files for other combinations of model to fine-tuning method However, you can still configure your own YAML files to enable support for fine-tuning methods not listed here by following existing patterns in the /workspace/torchtune/recipes/configs directory.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t $training_mode -m Llama-3.2-Vision-90B -p BF16 -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 supported)
	`finetune_lora`	LoRA fine-tuning (BF16 supported)
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported)
	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

Llama 3.2 Vision 90B currently supports the following fine-tuning methods:

finetune_fw

The upstream torchtune repository does not currently provide YAML configuration files for other combinations of model to fine-tuning method However, you can still configure your own YAML files to enable support for fine-tuning methods not listed here by following existing patterns in the /workspace/torchtune/recipes/configs directory.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t $training_mode -m Llama-3.1-8B -p BF16 -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 supported)
	`finetune_lora`	LoRA fine-tuning (BF16 supported)
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported)
	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

Llama 3.1 8B currently supports the following fine-tuning methods:

finetune_fw
finetune_lora

The upstream torchtune repository does not currently provide YAML configuration files for other combinations of model to fine-tuning method However, you can still configure your own YAML files to enable support for fine-tuning methods not listed here by following existing patterns in the /workspace/torchtune/recipes/configs directory.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t $training_mode -m Llama-3.1-70B -p BF16 -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 supported)
	`finetune_lora`	LoRA fine-tuning (BF16 supported)
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported)
	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

Llama 3.1 70B currently supports the following fine-tuning methods:

finetune_fw
finetune_lora
finetune_qlora

The upstream torchtune repository does not currently provide YAML configuration files for other combinations of model to fine-tuning method However, you can still configure your own YAML files to enable support for fine-tuning methods not listed here by following existing patterns in the /workspace/torchtune/recipes/configs directory.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t $training_mode -m Llama-3.1-405B -p BF16 -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 supported)
	`finetune_lora`	LoRA fine-tuning (BF16 supported)
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported)
	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

Llama 3.1 405B currently supports the following fine-tuning methods:

finetune_qlora
HF_finetune_lora

The upstream torchtune repository does not currently provide YAML configuration files for other combinations of model to fine-tuning method However, you can still configure your own YAML files to enable support for fine-tuning methods not listed here by following existing patterns in the /workspace/torchtune/recipes/configs directory.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t $training_mode -m Llama-3-8B -p BF16 -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 supported)
	`finetune_lora`	LoRA fine-tuning (BF16 supported)
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported)
	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

Llama 3 8B currently supports the following fine-tuning methods:

finetune_fw
finetune_lora

The upstream torchtune repository does not currently provide YAML configuration files for other combinations of model to fine-tuning method However, you can still configure your own YAML files to enable support for fine-tuning methods not listed here by following existing patterns in the /workspace/torchtune/recipes/configs directory.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t $training_mode -m Llama-3-70B -p BF16 -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 supported)
	`finetune_lora`	LoRA fine-tuning (BF16 supported)
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported)
	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

Llama 3 70B currently supports the following fine-tuning methods:

finetune_fw
finetune_lora

The upstream torchtune repository does not currently provide YAML configuration files for other combinations of model to fine-tuning method However, you can still configure your own YAML files to enable support for fine-tuning methods not listed here by following existing patterns in the /workspace/torchtune/recipes/configs directory.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t $training_mode -m Llama-2-7B -p BF16 -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 supported)
	`finetune_lora`	LoRA fine-tuning (BF16 supported)
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported)
	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

Llama 2 7B currently supports the following fine-tuning methods:

finetune_fw
finetune_lora
finetune_qlora

The upstream torchtune repository does not currently provide YAML configuration files for other combinations of model to fine-tuning method However, you can still configure your own YAML files to enable support for fine-tuning methods not listed here by following existing patterns in the /workspace/torchtune/recipes/configs directory.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t $training_mode -m Llama-2-13B -p BF16 -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 supported)
	`finetune_lora`	LoRA fine-tuning (BF16 supported)
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported)
	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

Llama 2 13B currently supports the following fine-tuning methods:

finetune_fw
finetune_lora

The upstream torchtune repository does not currently provide YAML configuration files for other combinations of model to fine-tuning method However, you can still configure your own YAML files to enable support for fine-tuning methods not listed here by following existing patterns in the /workspace/torchtune/recipes/configs directory.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t $training_mode -m Llama-2-70B -p BF16 -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 supported)
	`finetune_lora`	LoRA fine-tuning (BF16 supported)
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported)
	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

Llama 2 70B currently supports the following fine-tuning methods:

finetune_lora
finetune_qlora
HF_finetune_lora

The upstream torchtune repository does not currently provide YAML configuration files for other combinations of model to fine-tuning method However, you can still configure your own YAML files to enable support for fine-tuning methods not listed here by following existing patterns in the /workspace/torchtune/recipes/configs directory.

Benchmarking examples

For examples of benchmarking commands, see ROCm/MAD.

Previous versions#

See PyTorch training performance testing version history to find documentation for previous releases of the ROCm/pytorch-training Docker image.

Training a model with PyTorch for ROCm

Contents

Training a model with PyTorch for ROCm#

Supported models#

Performance measurements#

System validation#

Benchmarking#

Previous versions#