Deduplication of Large-scale Text Datasets for Pretraining of Language Models

  Рет қаралды 316

Ai2

Ai2

Күн бұрын

In this talk, I'll cover the newly released DataComp for Language Models project, in which we generate a testbed for controlled experiments of building better datasets for pretraining language models in a compute-limited regime. From here I'll pivot to discussing one particular aspect of building better datasets: removing duplicates and near-duplicates from large corpuses of text, explaining several key techniques as well as our findings from extensive deduplication ablations. Finally, I'll raise some several open questions and future directions regarding deduplication of pretraining datasets, including some unpublished (but interesting!) results.

Пікірлер
Conferencia Wendell Haag "Developing more informative approaches for monitoring mussel populations"
55:32
BIVAAS - Bivalves da Água Doce da América do Sul
Рет қаралды 17
Transformers (how LLMs work) explained visually | DL5
27:14
3Blue1Brown
Рет қаралды 4,1 МЛН
You Can't Have AI Safety Without Inclusion
1:01:01
Ai2
Рет қаралды 271
Accelerating scientific discovery with AI
29:02
Vetenskapsakademien
Рет қаралды 22 М.
Seminar with Professor Geoffrey Hinton, at the Royal Swedish Academy of Engineering Sciences (IVA)
1:31:24
Kungl. Ingenjörsvetenskapsakademien IVA
Рет қаралды 24 М.