Vincent D. Warmerdam - Scikit-Learn can do THAT?!

Рет қаралды 4,703

Күн бұрын

Many of us know scikit-learn for it's ability to construct pipelines that can do .fit().predict(). It's an amazing feature for sure. But once you dive into the codebase ... you realise that there is just so much more.
This talk will be an attempt at demonstrating some extra features in scikit-learn, and it's ecosystem, that are less common but deserve to be in the spotlight.
In particular I hope to discuss these things that scikit-learn can do:
- sparse datasets and models
- larger than memory datasets
- sample weight techniques
- image classification via embeddings
- tabular embeddings/vectorisation
- data deduplication
- pipeline caching
If time allows I may also touch on extra topics.

Пікірлер: 10

@calmcode-io 23 күн бұрын

Vincent here, and I need to make a correction! Around the 10 minute mark I mention metadata routing. But what I actually show on screen is not the new metadata routing feature, but rather the old way of doing it! Total mistake on my end! The scikit-learn docs will tell you more about the new syntax that is different from what you see here. I would normally share a link to the docs, but YT tends to shadowban comments with an outgoing link.

@DistortedV12 12 күн бұрын

Put it in the replies. Thanks sir!

@TechTalksWeekly 20 күн бұрын

This is a fantastic talk and it has been featured in the last issue of Tech Talks Weekly newsletter 🎉 Congrats Vincent!

@JR-gy1lh 18 күн бұрын

I love Vincent's ability to teach. I'm here cause of :probabl which I've loved thus far.

@JR-gy1lh 18 күн бұрын

The fact you can abstract complexity with scikitlearn is amazing. The behind the scenes is beautiful.

@TheRonakagrawal 23 күн бұрын

Wow, @calmcode-io @vincent! Fantastic talk, you kill it every time!

@wolpumba4099 22 күн бұрын

*Summary* * *(**<a href="#" class="seekto" data-time="0">0:00</a>**)* *Scikit-learn can cache pipeline steps:* This significantly speeds up tasks like hyperparameter tuning using `GridSearchCV` by avoiding redundant computations. Use the `memory` parameter in the `make_pipeline` function. * *(**<a href="#" class="seekto" data-time="420">7:00</a>**)* *Scikit-learn handles sample weights:* You can assign different importance levels to data points during training using the `sample_weight` parameter in `fit` methods. * *(**<a href="#" class="seekto" data-time="570">9:30</a>**)* *Metadata Routing is a new feature:* This allows passing specific keyword arguments, like `sample_weight`, to individual components within a pipeline using the double underscore syntax (e.g., `pipeline__component__sample_weight`). * *(**<a href="#" class="seekto" data-time="652">10:52</a>**)* *Scikit-learn supports sparse datasets and models:* It efficiently handles data with many zeros, such as text data represented with TF-IDF, and provides specialized handling within components like `StandardScaler`. * *(**<a href="#" class="seekto" data-time="770">12:50</a>**)* *Scikit-learn allows training on large datasets (out-of-core learning):* The `partial_fit` method enables training on data that doesn't fit entirely in memory by processing it in batches. * *(**<a href="#" class="seekto" data-time="971">16:11</a>**)* *Scikit-learn provides robust numerical implementations:* Behind the scenes, scikit-learn utilizes numerically stable algorithms, addressing issues like integer overflow that can arise in simpler implementations. * *(**<a href="#" class="seekto" data-time="1579">26:19</a>**)* *Scikit-learn has a semi-supervised learning module:* Algorithms like `LabelSpreading` can propagate labels based on data structure and a limited number of labeled examples. This is useful for tasks like interactive search refinement. * *(**<a href="#" class="seekto" data-time="1610">26:50</a>**)* *Beyond `fit` and `predict`, scikit-learn offers a wealth of features:* Explore the documentation for a wide array of tools, including metrics, monotonic constraints, density estimators, fairness tools, and more. * *(**<a href="#" class="seekto" data-time="1078">17:58</a>**)* *Scikit-learn's documentation is excellent and worth diving into:* It often includes links to relevant research papers, providing deeper insights into the underlying methodologies. * *(**<a href="#" class="seekto" data-time="1186">19:46</a>**)* *Scikit-learn is actively maintained and improved:* Keep an eye on the release notes for new features, performance enhancements, and bug fixes. * *(**<a href="#" class="seekto" data-time="1634">27:14</a>**)* *Consider exploring the broader scikit-learn ecosystem:* Libraries like `umap` and `sentence_transformers` integrate seamlessly with scikit-learn, extending its capabilities. * *(**<a href="#" class="seekto" data-time="2013">33:33</a>**)* *Scikit-learn is generally thread-safe (with caveats):* While mostly thread-safe, specific implementations or configurations might introduce exceptions, so be mindful and refer to the documentation and issue tracker for details. Summarized by AI model: gemini-1.5-pro-exp-0801 Cost (if I didn't use the free tier): $0.0861 Input tokens: 21309 Output tokens: 1093

@manuelalbertoromerogarcia9495 14 күн бұрын

In addition, there are a set of compatible packages like multilearn or imbalearn; ir can be quite overwhelming to know if you are using the rigth way! Nice presentation, as usual!