KNIME tutorial: Kaggle Titanic Machine Learning data prep and cleaning

  Рет қаралды 20,089

Eric

Eric

Күн бұрын

This is the first of a three part series of tutorials on how to use KNIME for a Kaggle machine learning problem. This tutorial is a beginner friendly way to build a machine learning model without needing to write code.
The Titanic problem is a classification problem that is a classic from Kaggle where data scientists try to use some passenger data to predict who survived and who did not.
Kaggle problem and data here: www.kaggle.com...
Download KNIME here: www.knime.com/...
In this tutorial we will cover data cleaning and preparation. We will utilize a random forest model to make our predictions. And finally, we will utilize feature engineering to improve our model's performance.
Please look for parts 2 and 3 of the tutorial coming soon.

Пікірлер: 33
@TPM1878
@TPM1878 4 жыл бұрын
Hi Eric, great job... please keep the content coming.
@ewhulbert
@ewhulbert 4 жыл бұрын
Glad you like it. Hope to find some time to add more soon.
@yoyo-tk1yz
@yoyo-tk1yz Жыл бұрын
Hi..may i know how to rewrite/identify the problem statement from this dataset..I'm still confused
@DanielWeikert
@DanielWeikert 5 жыл бұрын
Thanks can you do more on this Eric? It's kind of different to doing things in python and the machine learning libraries there (even thought the overall data science process is the same). I would love to see more videos from you diving into all the relevant nodes in knime which help us to all the fancy stuff available when we do things in python directly. Kind of difficult (at least for me) in knime when you first start and don't know all the necessary nodes. thanks and best regards edit: all relevant extensions required and useful would also be an interesting topic
@ewhulbert
@ewhulbert 5 жыл бұрын
I'm working on it. I've got another one mostly written regarding using the H2O.ai GLM nodes to do a linear regression. Need to finish up and film it and I think that one will be pretty cool. It is a little different in terms of using nodes vs. Pandas or Numpy. My suggestion is to just think of what you want to do, and then search for a node that sounds like that. I'll try to keep adding stuff as best I can, but unfortunately I have a pretty demanding day job. I appreciate the interest and the request, that helps me figure out what people want to hear about.
@DanielWeikert
@DanielWeikert 5 жыл бұрын
@@ewhulbert Great thanks for your reply. Cant wait
@MohamedAshraf-zs6nv
@MohamedAshraf-zs6nv 4 жыл бұрын
how to perform parameter optimizing for string values like pruning method in decision tree learner
@ewhulbert
@ewhulbert 4 жыл бұрын
Its complicated and probably wouldnt fit into this video. Perhaps I'll make one in a month or two.
@MohamedAshraf-zs6nv
@MohamedAshraf-zs6nv 4 жыл бұрын
@@ewhulbert Thanks, but till then could you provide me with a link to understand it you know one or two months is not a short period of time
@ewhulbert
@ewhulbert 4 жыл бұрын
@@MohamedAshraf-zs6nv Not sure if I see one in KNIME. There are some decent videos on doing it in python, which you can translate easily enough.
@christopherong4686
@christopherong4686 4 жыл бұрын
hi eric how about the cabin missing value how are u going to clean it?
@ewhulbert
@ewhulbert 4 жыл бұрын
Didn't use it. With so many missing its hard to clean and get any use from it. It is also not readily apparent if its important at all; unlike class we know some are better than others for survival its not really clear which cabins would or would not be better. I'd be highly skeptical that any solution which uses cabin is overfitting.
@careenevans
@careenevans 4 жыл бұрын
Hi Erick, Thank you for the well detailed explanation on Knime and Machine learning. I have one question: How do you know when to use SVM, Random forest or any other classification method on the data?
@ewhulbert
@ewhulbert 4 жыл бұрын
Sorry for the slow reply, I missed this comment. There is a ton that goes into choosing, e.g., how it performs on a specific type of data, the size of the data set, the importance of interpretability, etc. It is way more than I can type here, and I dont fancy myself an expert, so I'll post a couple of articles on it that I think are pretty good. Classification specific: medium.com/datadriveninvestor/choosing-the-best-algorithm-for-your-classification-model-7c632c78f38f Basic for all ML: towardsdatascience.com/do-you-know-how-to-choose-the-right-machine-learning-algorithm-among-7-different-types-295d0b0c7f60
@careenevans
@careenevans 4 жыл бұрын
Eric Hulbert Thank you so much, I’ll pass through this. I believe it will be helpful.
@royl7072
@royl7072 3 жыл бұрын
Hi Erick, why we have transfer the number SibSp/Parch from Integer to String? Is integer not feasible ? How about the same case in Decision tree ?we need transfer all of it? Thanks
@ewhulbert
@ewhulbert 3 жыл бұрын
Technically speaking, you don't necessarily need to as integers can be class variables. Practically speaking, its not hard to do it and eliminates errors that sometimes prop up when using integers as classes. I don't remember ever having this issue in KNIME, but I have had it elsewhere and its a pain that can take a while to figure out what is wrong.
@regen2787
@regen2787 4 жыл бұрын
Hi Eric! Im Alan Boey, can you teach some ways to do advanced preparation of data? (E.g Data cleaning) , which nodes should i use?
@ewhulbert
@ewhulbert 4 жыл бұрын
Thats a good idea Vincent. I'd like to do that and I think it would be valuable. Hopefully I can find the time soon.
@regen2787
@regen2787 4 жыл бұрын
@@ewhulbert Thanks! As for now , do you have any nodes you could recommend me some nodes for data cleaning/data preparation? Thanks in advance. -Alan Boey
@ihtishamkhalil7481
@ihtishamkhalil7481 5 жыл бұрын
Thank you Eric, i am completing my assignment for the bank client dataset and i am feeling trouble as my roc curve is always stands on 1 i don't know how should i do that :(
@ewhulbert
@ewhulbert 5 жыл бұрын
So models cant be perfect, and if you see a model that is predicting things perfectly, that typically means you have an independent variable that is not really independent, its a proxy for the target. Does that make sense? You're probably putting something into the model that is the same thing as your response. I'd start by checking the correlations of your independent variables (features) with the target variable and if you see one with an r2 over 0.9, throw it out. That means its really just a proxy for what you are trying to predict.
@ДарьяМ-т3л
@ДарьяМ-т3л 3 жыл бұрын
Hi, but after "missing values" "Passanger ID" is dissapear, why?? in my programm I still have it in "number to string". Could you help me and explain why it happens?
@ewhulbert
@ewhulbert 3 жыл бұрын
Not sure I understand the question, but the answer might be if you are excluding columns with a certain percentage of the values that are missing.
@thirumurthym7980
@thirumurthym7980 4 жыл бұрын
Eric Hulbert , can we h ave link to part 2 and 3. thanks
@ewhulbert
@ewhulbert 4 жыл бұрын
Part two: kzbin.info/www/bejne/rni6inyXm62sn9k Part three: kzbin.info/www/bejne/h4SYfIVjhb-ajbc
@morphyon
@morphyon 3 жыл бұрын
I sense a Big Lebowksi reference in here, but I might be wrong
@ewhulbert
@ewhulbert 3 жыл бұрын
You are not wrong. Perhaps I should change my channel to El Analytics Duderino if fans are not into the whole brevity thing.
@morphyon
@morphyon 3 жыл бұрын
Ah well, no troubles, I am way out of my element. The naming convention is not the issue here. 🙂
@ewhulbert
@ewhulbert 3 жыл бұрын
@@morphyon Am I the only one who gives a crap about the rules? Mark it zero smokey!
@morphyon
@morphyon 3 жыл бұрын
@@ewhulbert Calm down, Analytics Dude. You're being very Un-Analytics Dude.
@ewhulbert
@ewhulbert 3 жыл бұрын
@@morphyon Calmer than you are...ok, I'll stop now. I went periods in my life where friends could have entire conversations using Big Lebowski quotes and innuendo. Appreciate the support, hope you like the videos.
-5+3은 뭔가요? 📚 #shorts
0:19
5 분 Tricks
Рет қаралды 13 МЛН
Andro, ELMAN, TONI, MONA - Зари (Official Audio)
2:53
RAAVA MUSIC
Рет қаралды 8 МЛН
World Number 1 On Kaggle with Christof Henkel #36
1:08:13
AI Stories Podcast - Neil Leiser
Рет қаралды 21 М.
Knime Crash Course: Your first KNIME Workflow #knime #datascience
20:45
How to do the Titanic Kaggle Competition
18:28
Aladdin Persson
Рет қаралды 80 М.
Introduction to KNIME Data Analytics Platform
1:03:43
Dublin Core Metadata Initiative (DCMI)
Рет қаралды 16 М.
3 Steps to Build an Interactive Dashboard with KNIME
4:13
KNIMETV
Рет қаралды 10 М.
All Machine Learning algorithms explained in 17 min
16:30
Infinite Codes
Рет қаралды 578 М.
GraphRAG: The Marriage of Knowledge Graphs and RAG: Emil Eifrem
19:15
What Is KNIME?
5:41
KNIME Tutorials
Рет қаралды 127 М.
-5+3은 뭔가요? 📚 #shorts
0:19
5 분 Tricks
Рет қаралды 13 МЛН