How to do the Titanic Kaggle competition in R

How to do the Titanic Kaggle competition in R - Part 1

Рет қаралды 101,294

Күн бұрын

Пікірлер: 127

@jamiebond8481 7 жыл бұрын

thank you for the tutorial, this is great and easy to understand. I am learning how to analyze datasets from Kaggle, and this is great help for me for my final project for my school.

@amoghhuilgol1317 7 жыл бұрын

at 31:55 , how do you map isSurvived and passengerId . I mean how do we know , which value belongs to which passengerId?

@퉁퉁이-p1n 4 жыл бұрын

at 27:34 i don't know why my R show "titanic.model

@matthewmy22 2 жыл бұрын

As someone who is mainly accustomed to dplyr, I learned a ton seeing how you manipulate and clean data without it

@Datasciencedojo 2 жыл бұрын

Glad you liked it, keep following us for more tutorials.

@timmetzler4782 3 жыл бұрын

at 31:45 for output.df$Survived

@Kaneki-sn3bf 3 жыл бұрын

Why fill in with the mode? Why not just put NA for not available? 12.41

@Vaggos16 2 жыл бұрын

I tried to install package randomForest and doesnt exist in R. What is going on? If someone could help me I would appreciate it.

@deepakkannan7409 7 жыл бұрын

Hi, In the 20:00 min he basically introduced the concept of "Categorial Casting " and you convert only certain types of columns to factors. My doubt is 1. What is "Categorial Casting"? 2. Why is it used in only for certain data and not all? 3. What will happen if we do our model without it? Many thanks in advance.

@Datasciencedojo 6 жыл бұрын

Concept of categorical casting is similar to factors in R. This tells our model that only certainly values are allowed. for instance, if male and female are left as char strings and not converted to categories, we are telling the model that other input values are also possible. Categorical or factor casting forces the input to be only two values {male, female}. As to which features we apply this transform on depends upon your understanding of the problem. Class {1,2,3} may be left as an integer and can also be converted to a category. Whereas, Embarked must be treated as categorical/factor variable.

@s.a.Tawhid 3 жыл бұрын

@@Datasciencedojo , Thanks a lot. I was also wondering about this matter.

@karelsukup1973 6 жыл бұрын

Thank you so much! It's exactly what i was looking for

@Kaneki-sn3bf 3 жыл бұрын

What is the 70/30 split. A link to the minute he talked about this would be helpful :

@CynicalScientist261 3 жыл бұрын

Can anyone explain strings as factors? I’m new and don’t get it.

@Datasciencedojo 3 жыл бұрын

Hello, please forward your query on our site's chatbot or email: datasciencedojo.com/

@RMCCVO 5 жыл бұрын

Error in randomForest.default(m, y, ...) : NA/NaN/Inf in foreign function call (arg 1) In addition: Warning messages: 1: In data.matrix(x) : NAs introduced by coercion 2: In data.matrix(x) : NAs introduced by coercion

@Javigarcia222 4 жыл бұрын

Same here! Did you solve it? I think it identify Survived (0s) as if they were NAs :( idk how to solve it

@harshitadas7397 4 жыл бұрын

I came across the same error. To resolve it, when you read the train and test csv files, change the stringsAsFactors argument to TRUE. Hope this helps.

@amara1037 5 жыл бұрын

at 31:45 for output.df$Survived

@Javigarcia222 4 жыл бұрын

@pallavikurulkar631 4 жыл бұрын

I am getting the same error. Could you please explain as to why this is happening.

@harshitadas7397 4 жыл бұрын

I came across the same error. To resolve it, when you read the train and test csv files, change the stringsAsFactors argument to TRUE. Hope this helps.

@ksoovibe7606 3 жыл бұрын

Sir could you please explain why did the passenger id change in the submission file?

@Datasciencedojo 3 жыл бұрын

Hello, the submission file contains data on different passengers so that we can test our model. This is known as the train-test split and it helps us make sure our model is generalizable to data other than what we have trained on.

@ksoovibe7606 3 жыл бұрын

@@Datasciencedojo Oh okay it's clear now thank you so much sir

@Datasciencedojo 3 жыл бұрын

@@ksoovibe7606 Keep following us for more content!

@deepakkannan7409 7 жыл бұрын

Hi features.equation

@moeketsimosia7290 6 жыл бұрын

Samehere

@anuragkeshavaraju2169 7 жыл бұрын

Hello , All i get is like this " > tail(output.df) passengerId Survived 413 1304 0.38337855 414 1305 0.04974327 415 1306 0.99573333 416 1307 0.01896443 417 1308 0.04974327 418 1309 0.54535266" and survived column in test is NA.........plz xplain.

@PitDriver 5 жыл бұрын

I get an error when rbinding titanic.test with train. Error in match.names (clabs, names(xi)) : Names do not match previous names? What do I do to fix this?

@PitDriver 5 жыл бұрын

I found the issue misspelled Survived

@cheddar-samosa5789 7 жыл бұрын

Hi, I am facing a minor issue. After this: titanic.formula

@deepakkannan7409 7 жыл бұрын

Hi Replace titanic.test with titanic.train . Hope it will work.

@deepakkannan7409 7 жыл бұрын

Yes, I remember after changing it, it worked.

@PraveenKumar-zh6gw 7 жыл бұрын

Nope it didn't. Plse tell if there is any solution to this .

@bubblysudhi 7 жыл бұрын

Please check whether there are any missing values in your data before proceeding to random forest.

@VineelVishwanth 4 жыл бұрын

rf_model

@Vinit_Ambat Жыл бұрын

Thanks! I made my first ever submission on Kaggle for Spaceship project and got 73% accuracy

@Datasciencedojo Жыл бұрын

Congratulations, Vinit. Stay tuned with us for more tutorials.

@Vinit_Ambat Жыл бұрын

@@Datasciencedojo Thanks!

@spunkykid5898 7 жыл бұрын

Hi, I was unable to combine when using rbind, any idea of what could be causing the issue?

@bonifaceyogendran2294 7 жыл бұрын

I am having the same problem too!

@bonifaceyogendran2294 7 жыл бұрын

Try titanic.full

@rehanawajahathussain7642 6 жыл бұрын

ihave the same problem I tried to install package rbind gives me warning message

@datamaster6955 4 жыл бұрын

@@bonifaceyogendran2294 thank you so much!!!! you helped me

@shaista2k1 4 жыл бұрын

bind_rows function combines the training and testing datasets well but after that getting error whenever using titanic.full dataset

@harshadakale2247 5 жыл бұрын

Can you upload step by step project with data cleaning and missing values in r

@uhsay1986 6 жыл бұрын

why the 2 missing embarked cells were assigned to only S and not to C or Q ?

@kammelna 5 жыл бұрын

He assigned them to the most frequent which is S

@Fateaha 6 жыл бұрын

Why we using median here and not the mean?

@ananyajha451 5 жыл бұрын

I am not getting final Survived values as bunch of 0 and 1 but I am getting a bunch of probabilities. Why am I getting this any idea?

@Datasciencedojo 5 жыл бұрын

It is likely that you have set type="prob" in the predict() function. Set this to type="class" and you will get a class outcome. Otherwise, your Survived values might be interpreted as numbers rather than classes represented as numbers. They need to be treated as factor levels/classes 0 and 1. Use str(data) to check if they are num or factors. If they are num, convert them into factors - data$Survived

@kumarsrajesh9078 6 жыл бұрын

hi please help me not able to download data set.

@shaista2k1 4 жыл бұрын

Need at least two classes to do classification error while running the model. Please tell me how to resolve.

@kgrocks5442 4 жыл бұрын

It's not allowing me to do rbind

@amoghhuilgol1317 7 жыл бұрын

I am new to machine learning . In a large dataset with lots of columns/attributes , how do we decide which attributes to choose for our random forest . In the above video , I could see that you have hardcoded the attributes . How can we choose attributes dynamically?

@Sociography00 7 жыл бұрын

It depends on your understanding of data. You must have understanding that which variables/attributes gonna affect your Response variable.

@amoghhuilgol1317 7 жыл бұрын

Akshay Nevrekar but suppose my dataset has thousands of attributes . how would I choose best attributes . it would be practically impossible to know how those attributes affect the results . also understanding of attributes would mean some assumptions which could be false . any thoughts on how to handle when there are large number of attributes ??

@clinton11994 7 жыл бұрын

HI, how does the R Random Forest classifier deals with data with categorical and quantitative data ? and also any suggestions on dealing with the same problem if we want to use any other classification algorithms .?

@aldofranco6764 6 жыл бұрын

I think that R convert to dummies (exists 2 options i dont know which one is the used) and voila!!

@nsovoshe 2 жыл бұрын

titanic.formula

@stevezhang9276 3 жыл бұрын

Can someone tell me why he combined the two datasets together? is that mean he change the facts?

@Datasciencedojo 3 жыл бұрын

Hello Steve, we combine the datasets here so that we only have to apply pre-processing steps once and because we want to use the global median rather than the median of just the train.csv data. This is explained in the video around the 5-minute mark.

@stevezhang9276 3 жыл бұрын

@@Datasciencedojo Thank you so much. Is there any other way to clean the data? clean the data I mean to cover the missing value with better prediction.

@Datasciencedojo 3 жыл бұрын

Hello@@stevezhang9276, our data mining tutorials can help you out. go through tutorials from 7 till 11: kzbin.info/www/bejne/nGixl4Jna9Fjarc

@kashishmalhotra221 5 жыл бұрын

The github link is not working. Does anybody have the code?

@samsadbinzubair-j2d 10 ай бұрын

Could you please upload the codes in the description section?

@Datasciencedojo 10 ай бұрын

Please check the provided link in description

@neerajraut6473 6 жыл бұрын

why particularly random forest? Why not any other classification technique?

@Datasciencedojo 6 жыл бұрын

You can use any classification algorithm, instead. RandomForest models are popular in industry and are fairly easy to understand because of the underlying non-parametric(decision tree based) learning approach. You can get similar results with a different learning algorithms such as logistic regression, boosted decision trees, naive bayes and so on.

@469jnishant 7 жыл бұрын

Hi..can you please tell why didn't we convert the categorical variables into dummy vars...also, is there a need to convert them to dummy vars here at all ? if not then why do some tutorials say to convert into dummy vars and then use

@vitoroliveira2933 3 жыл бұрын

I also had the same doubt, I know that when the categorical variable has only 2 categories, R does it automatically, but when it has more than 2, I believe it is necessary to do so.

@dipeshjoshi9093 7 жыл бұрын

Awesome tutorial. Thanks for sharing. :)

@4upranit 7 жыл бұрын

i am new in this, can anyone explain why thr are two data sets eg. train.csv and test.csv thank you

@souravgames 7 жыл бұрын

one to create a model that will give you the result (train). other to check if the model predicts similar result after you doing the same with test.

@swati2793 6 жыл бұрын

That's a good video. However I did not understand that what was the need to converting variables(pClass, Sex, Embarked) into factors through as.factors. Also why did we convert survived into factors after spitting the two data sets here. I would be grateful if someone could explain it in a bit detail. Many thanks :)

@umairansari87 5 жыл бұрын

actually, these are categorical columns so they need to be addressed as ML algorithms only works for numerical data so we have 2 options here either to convert it into numeric or informally we can call them dummies.so what basically dummies are they are 0 and 1. Lets suppose you have sex as Male and Female so we can create a column which will give 1 for sex "Male" that is same as writing the Male in the column. So what factor is doing here it has created levels =the unique value of the column for sex columns it has created 2 columns(informally) but levels to be formal whiich will denote "Male" for male level and "Female" for female level and it will polpulate it as 0 and 1 for the values.

@amansingla5440 6 жыл бұрын

very well explained! Thank you so much..

@kgrocks5442 4 жыл бұрын

as.formula is not working

@vamshikrishna6410 6 жыл бұрын

titanic.combine

@upasnasharma8149 7 жыл бұрын

Hi... After using rbind() to create titanic.full, Im getting this :- Error in titanic.full$IsTrainSet : $ operator is invalid for atomic vectors What should i do?

@Abhi92raj 7 жыл бұрын

Check the names of the columns, there must be an error

@mcpduk 7 жыл бұрын

What exchange rate did you use to convert the fare into $ from £ :)

@mcpduk 7 жыл бұрын

seriously though, this is EXCELLENT vid

@Datasciencedojo 6 жыл бұрын

Depends upon whether the date of disaster was before or after Brexit. Can you remind us, please? :)

@Datasciencedojo 6 жыл бұрын

Thanks :)

@lnasution76 7 жыл бұрын

thanks for sharing - great stuff

@prvns8586 7 жыл бұрын

too awesome...!!!!!! great help

@yashuverma3811 7 жыл бұрын

Hi , why did we add 'S' only for the missing values. ? Also by mistake i added small s which now shows me C Q s S 270 123 2 914 .. how can i remove the small s now ?

@ankitapradhan1855 7 жыл бұрын

yes i too have the same doubt that why did he add the NA values into S category??

@ankitapradhan1855 7 жыл бұрын

yes i got the answer

@empuraan4710 6 жыл бұрын

What's the answer Ankita...

@TheMarinho1 6 жыл бұрын

Nice Video! But, I get an Error for nodesize = 0.01 * nrow(iris.train) stating "Error in nrow(x) : argument "x" is missing, with no default" Can anybody help me?

@lofimixradio 6 жыл бұрын

i think it should be titanic.train, iris is a test data set, check ?iris (Edgar Anderson's Iris Data)

@biseul 6 жыл бұрын

This line : Titanic.model

@davidluong98 6 жыл бұрын

i could not get it either. did you figure it out?

@Adinasa2 6 жыл бұрын

Raphael Pacheco titanic.model

@castilloberroafrancisco2914 2 жыл бұрын

wow great that's good

@Datasciencedojo 2 жыл бұрын

Glad you liked it, keep following us for more data science tutorials.

@taiwankyh 4 жыл бұрын

Nice video

@akashvp5262 6 жыл бұрын

Error in tail(titanic.train) : object 'titanic.train' not found. am getting that error

@septiandani8238 6 жыл бұрын

maybe, u store it typo?

@akashvp5262 6 жыл бұрын

Septian Dani okay

@cybern9ne 7 жыл бұрын

I doubt you're should manipulate or clean test data. the test data should be approached as if it were the data that would be used after modeling you'd never have the opportunity to utilize.

@Datasciencedojo 7 жыл бұрын

Hey cybern9ne! True, and that is definitely what is taught in the text books! However in the business world, where every bit of performance matters, you want to grab as much information as possible. But as long as you do not have access to the true labels, you always want to grab as much data as you can get. To account for possible categories that exist in the test set but not in the train set, or new missing values that exist in the test but not the train, etc.

@Arecapalm24 8 жыл бұрын

Great tutorial!

@pzmt8051 2 жыл бұрын

Could this be done in Python

@Datasciencedojo 2 жыл бұрын

Hello, yes this can be done in python.

@minhaaj 4 жыл бұрын

share the r code also

@ChrisR410a 8 жыл бұрын

So what was the point of creating the features.equation?

@Datasciencedojo 7 жыл бұрын

Hey Chris, we never ended up using it. It's a bad habit that I have from data mining in other languages where you have to define the features. R automatically finds the features the model was trained on, which is convenient.