Training Large Language Models on Kubernetes - Ronen Dar, Run:ai

  Рет қаралды 1,300

CNCF [Cloud Native Computing Foundation]

CNCF [Cloud Native Computing Foundation]

Күн бұрын

Training Large Language Models on Kubernetes - Ronen Dar, Run:ai
Large Language Models (LLMs) are emerging as the biggest technology breakthrough since the iPhone launched. LLMs are huge in size and their training requires massive amounts of data and compute power. Often LLM training is being carried out on bare metal servers with workload schedulers from the high-performance computing world, like Slurm. In this talk, we present the challenges involved in pre-training LLMs in general and in specific on Kubernetes. We discuss best practices in terms of networking optimization, distributed resource management, scheduling, and code manipulation. We provide scripts based on NVIDIA’s Megatron Transformer framework with pre-made configurations, data pre-processing workflows, and training setup to make it easy for users to quickly start LLM training on K8s. We further provide benchmarks results comparing training throughput between bare metal environments and K8s-based environments with models like GPT, T5 and BERT, across a varying number of GPU nodes.

Пікірлер
Keep HPC Running - an SRE's Guide to Supporting GPUs on Kubernetes - Christopher Dutra, JP Morgan
22:14
CNCF [Cloud Native Computing Foundation]
Рет қаралды 574
Ronen Dar X Guy Salton, Run:ai at GTC 2024
22:05
Run:ai Official
Рет қаралды 404
Oh No! My Doll Fell In The Dirt🤧💩
00:17
ToolTastic
Рет қаралды 13 МЛН
Or is Harriet Quinn good? #cosplay#joker #Harriet Quinn
00:20
佐助与鸣人
Рет қаралды 48 МЛН
I Took a LUNCHBAR OFF A Poster 🤯 #shorts
00:17
Wian
Рет қаралды 15 МЛН
He bought this so I can drive too🥹😭 #tiktok #elsarca
00:22
Elsa Arca
Рет қаралды 45 МЛН
Unleashing the Power of AI in Kubernetes through K8sGPT | Alex Jones
30:01
Kubernetes Community Days UK
Рет қаралды 3,8 М.
Fine-tuning Large Language Models (LLMs) | w/ Example Code
28:18
Shaw Talebi
Рет қаралды 310 М.
How to Close AI’s Operational Gaps with VAST & Run:ai
51:30
Run:ai X Adobe at GTC 2024
24:38
Run:ai Official
Рет қаралды 449
Machine Learning on Kubernetes | Salman Iqbal
25:45
Kubernetes Community Days UK
Рет қаралды 3,3 М.
Large Language Models (in 2023)
49:07
Hyung Won Chung
Рет қаралды 74 М.
Has Generative AI Already Peaked? - Computerphile
12:48
Computerphile
Рет қаралды 979 М.
Why Kubernetes Is Inappropriate for Platforms, and How to Make It Better
35:25
CNCF [Cloud Native Computing Foundation]
Рет қаралды 4,7 М.
Oh No! My Doll Fell In The Dirt🤧💩
00:17
ToolTastic
Рет қаралды 13 МЛН