Deep Dive: Advanced distributed training with Hugging Face LLMs and AWS Trainium

Рет қаралды 1,019

6 ай бұрын

Following up on the "Hugging Face on AWS accelerators" deep dive ( • Deep Dive: Hugging Fac... ) this video zooms in on distributed training with NeuronX Distributed, Optimum Neuron, and AWS Trainium.
First, we explain the basics and benefits of advanced distributed techniques like tensor parallelism, pipeline parallelism, sequence parallelism, and DeepSpeed ZeRO. Then, we discuss how these techniques are implemented in NeuronX Distributed and Optimum. Finally, we launch an Amazon EC2 Trainium-powered instance and demonstrate these techniques with distributed training runs on the TinyLlama and Llama 2 7B models. Of course, we share results on training time and cost, which will probably surprise you!
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at / julsimon or Substack at julsimon.substack.com. ⭐️⭐️⭐️
This video focuses on the software details essential for achieving peak performance. Access relevant code snippets and developer resources, suitable for both newcomers and experienced professionals. Whether you're familiar with AWS Trainium or approaching it for the first time, this technical walkthrough ensures your readiness for success in training Hugging Face models on AWS.
01:20 NeuronX Distributed
05:20 Tensor Parallelism
11:10 Pipeline Parallelism
16:55 Sequence Parallelism
20:32 Optimum Neuron
21:40 Optimum Neuron with Zero-1
25:55 Optimum Neuron with Tensor Parallelism and Sequence Parallelism
29:20 Amazon Machine Images (AMI) for Neuron devices
30:10 Launching an Amazon EC2 trn1n.32xlarge instance with the Hugging Face Neuron AMI
33:10 Fine-tuning TinyLlama with Optimum Neuron
41:15 Fine-tuning Llama 2 7B with Optimum Neuron
Links:
- Hugging Face Optimum Neuron: huggingface.co/docs/optimum-n...
- Source code for supported models: github.com/huggingface/optimu...
- Release notes: github.com/huggingface/optimu...
- Distributed training docs: huggingface.co/docs/optimum-n...
- TinyLlama: huggingface.co/TinyLlama/Tiny...
- Llama 2 7B: huggingface.co/meta-llama/Lla...

Пікірлер: 4

@yacinezahidi7206 6 ай бұрын

Thanks. That's great content!

@juliensimonfr 6 ай бұрын

Glad you liked it!

@andriimelashchenko2201 6 ай бұрын

Can you please share the link to supported models?

@juliensimonfr 6 ай бұрын

- Source code for supported models: github.com/huggingface/optimum-neuron/tree/main/optimum/neuron/distributed - Release notes: github.com/huggingface/optimum-neuron/releases