Рет қаралды 12,971
في هذا الفيديو, نستكشف سوياً تجربة بناء النموذج اللغوي (علام) المقدم من الهيئة السعودية للبيانات و الذكاء الاصطناعي في السعودية.
من خلال الورقة البحثية, نستكشف سوياً مراحل استكشاف المشكلة, وصولاً الى انشاء مجموعة النماذج اللغوية بكافة أنواعها.
03:30 from scratch or not
06:00 How does tokenizer work
08:51 LLaMA2 tokenizer
10:43 Fertility Rate
12:39 How can we expand the vocabulary
14:43 The ColossalAI Experiment
16:45 MMLU Datasets and the translation issues
22:00 How pre-train data were prepared
25:19 The pile dataset
25:50 Collecting Arabic Dataset
26:50 How to qualify the collected data
28:23 The datatrove project
29:40 The CosmoPedia Dataset
30:45 The machine-translated dataset
32:45 How to evaluate data ratios
34:50 Mixed Data Ratios
36:00 Continued Pretraining
37:30 Expanded vocabularies training
40:00 Continued Pretraining Hyperparameters
44:00 Train from scratch
44:20 Training on multiple stages
45:30 The cross-lingual transfer phenomena
48:30 Why do we need large batches
51:00 GPU Infrastructure
54:50 From base-model to instructions-tuned
56:40 The Ultra-Instinct dataset
01:00:01 The instructions-tuned model hyperparameters
01:04:00 Why do we need an additional finetuning step
01:07:00 Preference Training
01:07:50 DPO
01:11:00 On-Policy and Off-Policy Negative Sampling
01:13:00 DPO Data Augmentation
01:15:40 Learning Rates and Data Sizes
01:18:20 How many evaluation shots do you need
01:27:00 Human Evaluation Vs. Automated Evaluation
01:28:00 LMSys Arena
01:32:00 Why do we need to develop our evaluation
01:33:50 Conclusion