Vincent D. Warmerdam - Scikit-Learn can do THAT?!

  Рет қаралды 4,703

PyData

PyData

Күн бұрын

Many of us know scikit-learn for it's ability to construct pipelines that can do .fit().predict(). It's an amazing feature for sure. But once you dive into the codebase ... you realise that there is just so much more.
This talk will be an attempt at demonstrating some extra features in scikit-learn, and it's ecosystem, that are less common but deserve to be in the spotlight.
In particular I hope to discuss these things that scikit-learn can do:
- sparse datasets and models
- larger than memory datasets
- sample weight techniques
- image classification via embeddings
- tabular embeddings/vectorisation
- data deduplication
- pipeline caching
If time allows I may also touch on extra topics.

Пікірлер: 10
@calmcode-io
@calmcode-io 23 күн бұрын
Vincent here, and I need to make a correction! Around the 10 minute mark I mention metadata routing. But what I actually show on screen is not the new metadata routing feature, but rather the old way of doing it! Total mistake on my end! The scikit-learn docs will tell you more about the new syntax that is different from what you see here. I would normally share a link to the docs, but YT tends to shadowban comments with an outgoing link.
@DistortedV12
@DistortedV12 12 күн бұрын
Put it in the replies. Thanks sir!
@TechTalksWeekly
@TechTalksWeekly 20 күн бұрын
This is a fantastic talk and it has been featured in the last issue of Tech Talks Weekly newsletter 🎉 Congrats Vincent!
@JR-gy1lh
@JR-gy1lh 18 күн бұрын
I love Vincent's ability to teach. I'm here cause of :probabl which I've loved thus far.
@JR-gy1lh
@JR-gy1lh 18 күн бұрын
The fact you can abstract complexity with scikitlearn is amazing. The behind the scenes is beautiful.
@TheRonakagrawal
@TheRonakagrawal 23 күн бұрын
Wow, @calmcode-io @vincent! Fantastic talk, you kill it every time!
@wolpumba4099
@wolpumba4099 22 күн бұрын
*Summary* * *(**<a href="#" class="seekto" data-time="0">0:00</a>**)* *Scikit-learn can cache pipeline steps:* This significantly speeds up tasks like hyperparameter tuning using `GridSearchCV` by avoiding redundant computations. Use the `memory` parameter in the `make_pipeline` function. * *(**<a href="#" class="seekto" data-time="420">7:00</a>**)* *Scikit-learn handles sample weights:* You can assign different importance levels to data points during training using the `sample_weight` parameter in `fit` methods. * *(**<a href="#" class="seekto" data-time="570">9:30</a>**)* *Metadata Routing is a new feature:* This allows passing specific keyword arguments, like `sample_weight`, to individual components within a pipeline using the double underscore syntax (e.g., `pipeline__component__sample_weight`). * *(**<a href="#" class="seekto" data-time="652">10:52</a>**)* *Scikit-learn supports sparse datasets and models:* It efficiently handles data with many zeros, such as text data represented with TF-IDF, and provides specialized handling within components like `StandardScaler`. * *(**<a href="#" class="seekto" data-time="770">12:50</a>**)* *Scikit-learn allows training on large datasets (out-of-core learning):* The `partial_fit` method enables training on data that doesn't fit entirely in memory by processing it in batches. * *(**<a href="#" class="seekto" data-time="971">16:11</a>**)* *Scikit-learn provides robust numerical implementations:* Behind the scenes, scikit-learn utilizes numerically stable algorithms, addressing issues like integer overflow that can arise in simpler implementations. * *(**<a href="#" class="seekto" data-time="1579">26:19</a>**)* *Scikit-learn has a semi-supervised learning module:* Algorithms like `LabelSpreading` can propagate labels based on data structure and a limited number of labeled examples. This is useful for tasks like interactive search refinement. * *(**<a href="#" class="seekto" data-time="1610">26:50</a>**)* *Beyond `fit` and `predict`, scikit-learn offers a wealth of features:* Explore the documentation for a wide array of tools, including metrics, monotonic constraints, density estimators, fairness tools, and more. * *(**<a href="#" class="seekto" data-time="1078">17:58</a>**)* *Scikit-learn's documentation is excellent and worth diving into:* It often includes links to relevant research papers, providing deeper insights into the underlying methodologies. * *(**<a href="#" class="seekto" data-time="1186">19:46</a>**)* *Scikit-learn is actively maintained and improved:* Keep an eye on the release notes for new features, performance enhancements, and bug fixes. * *(**<a href="#" class="seekto" data-time="1634">27:14</a>**)* *Consider exploring the broader scikit-learn ecosystem:* Libraries like `umap` and `sentence_transformers` integrate seamlessly with scikit-learn, extending its capabilities. * *(**<a href="#" class="seekto" data-time="2013">33:33</a>**)* *Scikit-learn is generally thread-safe (with caveats):* While mostly thread-safe, specific implementations or configurations might introduce exceptions, so be mindful and refer to the documentation and issue tracker for details. Summarized by AI model: gemini-1.5-pro-exp-0801 Cost (if I didn't use the free tier): $0.0861 Input tokens: 21309 Output tokens: 1093
@manuelalbertoromerogarcia9495
@manuelalbertoromerogarcia9495 14 күн бұрын
In addition, there are a set of compatible packages like multilearn or imbalearn; ir can be quite overwhelming to know if you are using the rigth way! Nice presentation, as usual!
@ldq-o2s
@ldq-o2s 15 күн бұрын
<a href="#" class="seekto" data-time="556">9:16</a> should use sum of weights in the denominator
@agwargergahht
@agwargergahht 23 күн бұрын
ㅋㅋㅋ
Фейковый воришка 😂
00:51
КАРЕНА МАКАРЕНА
Рет қаралды 6 МЛН
SCHOOLBOY. Мама флексит 🫣👩🏻
00:41
⚡️КАН АНДРЕЙ⚡️
Рет қаралды 6 МЛН
How might LLMs store facts | Chapter 7, Deep Learning
22:43
3Blue1Brown
Рет қаралды 423 М.
Someone improved my code by 40,832,277,770%
28:47
Stand-up Maths
Рет қаралды 2,5 МЛН
Thomas J. Fan - Time Series EDA with STUMPY
26:24
PyData NYC
Рет қаралды 1,1 М.
I misunderstood Schrödinger's cat for years! (I finally get it!)
20:52
FloatHeadPhysics
Рет қаралды 386 М.
Why More People Dont Use Linux
18:51
ThePrimeTime
Рет қаралды 204 М.
Polars is the Pandas killer / Igor Mintz (Viz.ai)
21:46
PyData
Рет қаралды 4,9 М.
🚨 YOU'RE VISUALIZING YOUR DATA WRONG. And Here's Why...
17:11
Adam Finer - Learn BI Online
Рет қаралды 136 М.
The Man Who Solved the World’s Most Famous Math Problem
11:14
Newsthink
Рет қаралды 865 М.
Why Does Diffusion Work Better than Auto-Regression?
20:18
Algorithmic Simplicity
Рет қаралды 307 М.
Фейковый воришка 😂
00:51
КАРЕНА МАКАРЕНА
Рет қаралды 6 МЛН