Рет қаралды 316
In this talk, I'll cover the newly released DataComp for Language Models project, in which we generate a testbed for controlled experiments of building better datasets for pretraining language models in a compute-limited regime. From here I'll pivot to discussing one particular aspect of building better datasets: removing duplicates and near-duplicates from large corpuses of text, explaining several key techniques as well as our findings from extensive deduplication ablations. Finally, I'll raise some several open questions and future directions regarding deduplication of pretraining datasets, including some unpublished (but interesting!) results.