Deduplication of Large-scale Text Datasets for Pretraining of Language Models

BLADE: Benchmarking Language Model Agents for Data-Driven Science

Show It or Tell It? Text, Visualization, and Their Combination

#JasonDeruloTV // Funny #GotPermissionToPost From @SofiManassyan #SlowLow

乔的审判，精灵应该上天堂还是下地狱？#shorts #Fairy#fairytales

ЧТО ОПАСНЕЕ? ОТВЕТЫ ВАС ШОКИРУЮТ... (1% ОТВЕЧАЮТ ПРАВИЛЬНО) #Shorts #Глент

Қарғалардың анасы бар ма? | 1 серия | Сериал «‎QARGA 2»‎ | КОНКУРС

Deduplication of Large-scale Text Datasets for Pretraining of Language Models

Рет қаралды 316

Ai2

Күн бұрын

In this talk, I'll cover the newly released DataComp for Language Models project, in which we generate a testbed for controlled experiments of building better datasets for pretraining language models in a compute-limited regime. From here I'll pivot to discussing one particular aspect of building better datasets: removing duplicates and near-duplicates from large corpuses of text, explaining several key techniques as well as our findings from extensive deduplication ablations. Finally, I'll raise some several open questions and future directions regarding deduplication of pretraining datasets, including some unpublished (but interesting!) results.

Пікірлер

BLADE: Benchmarking Language Model Agents for Data-Driven Science

1:00:23

BLADE: Benchmarking Language Model Agents for Data-Driven Science

Ai2

Рет қаралды 196

Show It or Tell It? Text, Visualization, and Their Combination

59:56

Show It or Tell It? Text, Visualization, and Their Combination

Ai2

Рет қаралды 169

#JasonDeruloTV // Funny #GotPermissionToPost From @SofiManassyan #SlowLow

00:18

#JasonDeruloTV // Funny #GotPermissionToPost From @SofiManassyan #SlowLow

Jason Derulo

Рет қаралды 14 МЛН

乔的审判，精灵应该上天堂还是下地狱？#shorts #Fairy#fairytales

00:58

乔的审判，精灵应该上天堂还是下地狱？#shorts #Fairy#fairytales

精灵少女

Рет қаралды 9 МЛН

ЧТО ОПАСНЕЕ? ОТВЕТЫ ВАС ШОКИРУЮТ... (1% ОТВЕЧАЮТ ПРАВИЛЬНО) #Shorts #Глент

00:38

ЧТО ОПАСНЕЕ? ОТВЕТЫ ВАС ШОКИРУЮТ... (1% ОТВЕЧАЮТ ПРАВИЛЬНО) #Shorts #Глент

ГЛЕНТ

Рет қаралды 2,4 МЛН

Қарғалардың анасы бар ма? | 1 серия | Сериал «‎QARGA 2»‎ | КОНКУРС

41:02

Қарғалардың анасы бар ма? | 1 серия | Сериал «‎QARGA 2»‎ | КОНКУРС

OMIR

Рет қаралды 1,4 МЛН

Andrew Ng Explores The Rise Of AI Agents And Agentic Reasoning | BUILD 2024 Keynote

26:52

Andrew Ng Explores The Rise Of AI Agents And Agentic Reasoning | BUILD 2024 Keynote

Snowflake Inc.

Рет қаралды 266 М.

Atlantes: A Real-Time System for Global Maritime Behavior Analysis

54:02

Atlantes: A Real-Time System for Global Maritime Behavior Analysis

Ai2

Рет қаралды 188

Conferencia Wendell Haag "Developing more informative approaches for monitoring mussel populations"

55:32

Conferencia Wendell Haag "Developing more informative approaches for monitoring mussel populations"

BIVAAS - Bivalves da Água Doce da América do Sul

Рет қаралды 17

Transformers (how LLMs work) explained visually | DL5

27:14

Transformers (how LLMs work) explained visually | DL5

3Blue1Brown

Рет қаралды 4,1 МЛН

Data Deduplication using Locality Sensitive Hashing - Matti Lyra

39:47

Data Deduplication using Locality Sensitive Hashing - Matti Lyra

PyData

Рет қаралды 9 М.

You Can't Have AI Safety Without Inclusion

1:01:01

You Can't Have AI Safety Without Inclusion

Ai2

Рет қаралды 271

Accelerating scientific discovery with AI

29:02

Accelerating scientific discovery with AI

Vetenskapsakademien

Рет қаралды 22 М.

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

1:01:32

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Ai2

Рет қаралды 624

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

1:10:55

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Umar Jamil

Рет қаралды 75 М.

Seminar with Professor Geoffrey Hinton, at the Royal Swedish Academy of Engineering Sciences (IVA)

1:31:24

Seminar with Professor Geoffrey Hinton, at the Royal Swedish Academy of Engineering Sciences (IVA)

Kungl. Ingenjörsvetenskapsakademien IVA

Рет қаралды 24 М.

#JasonDeruloTV // Funny #GotPermissionToPost From @SofiManassyan #SlowLow

00:18

#JasonDeruloTV // Funny #GotPermissionToPost From @SofiManassyan #SlowLow

Jason Derulo

Рет қаралды 14 МЛН