Billion Scale Deduplication using Approximate Nearest Neighbours| Idan Richman Goshen, Sr Ds@Lusha

  Рет қаралды 5,049

PyData

PyData

Күн бұрын

At Lusha we are dealing with contacts profiles, lots of contacts profiles. It is by nature messy, and a single entity can have several representations in this type of data. In addition to the time and money spent moving messy data through the various pipelines, it is difficult to search in, not to mention the valuable information lost in the process. It would be ideal if we could merge all records of the same entity, even if they differ slightly (“Alagra Jones”, “Alagra Smith-Jones”). Comparing combinations of all pairs is possible on a small scale, but impossible when dealing with billions of records.
A set of algorithms known as approximate nearest neighbours is becoming more popular for solving such challenges and allowing the use of text-embeddings and clustering at large scales.
This talk will offer a brief overview of ANN algorithms and demonstrate how we can apply them to get a reasonable size subset of candidates, which we can then pass into a classifier for a match/no-match outcome. I’ll demonstrate how we handle such a task at scale, how we evaluate the two steps, and the tools we use.

Пікірлер: 1
Fake watermelon by Secret Vlog
00:16
Secret Vlog
Рет қаралды 16 МЛН
МАИНКРАФТ В РЕАЛЬНОЙ ЖИЗНИ!🌍 @Mikecrab
00:31
⚡️КАН АНДРЕЙ⚡️
Рет қаралды 42 МЛН
哈莉奎因怎么变骷髅了#小丑 #shorts
00:19
好人小丑
Рет қаралды 51 МЛН
[CVPR20 Tutorial] Billion-scale Approximate Nearest Neighbor Search
47:36
Thomas J. Fan - Time Series EDA with STUMPY
26:24
PyData NYC
Рет қаралды 1,3 М.
Product quantization in Faiss and from scratch
24:39
mildlyoverfitted
Рет қаралды 6 М.
Approximate Nearest Neighbors : Data Science Concepts
15:05
ritvikmath
Рет қаралды 25 М.
Generative AI in a Nutshell - how to survive and thrive in the age of AI
17:57
10. Introduction to Learning, Nearest Neighbors
49:56
MIT OpenCourseWare
Рет қаралды 264 М.