thank you for the tutorial, this is great and easy to understand. I am learning how to analyze datasets from Kaggle, and this is great help for me for my final project for my school.
@amoghhuilgol13177 жыл бұрын
at 31:55 , how do you map isSurvived and passengerId . I mean how do we know , which value belongs to which passengerId?
@퉁퉁이-p1n4 жыл бұрын
at 27:34 i don't know why my R show "titanic.model
@matthewmy222 жыл бұрын
As someone who is mainly accustomed to dplyr, I learned a ton seeing how you manipulate and clean data without it
@Datasciencedojo2 жыл бұрын
Glad you liked it, keep following us for more tutorials.
@timmetzler47823 жыл бұрын
at 31:45 for output.df$Survived
@Kaneki-sn3bf3 жыл бұрын
Why fill in with the mode? Why not just put NA for not available? 12.41
@Vaggos162 жыл бұрын
I tried to install package randomForest and doesnt exist in R. What is going on? If someone could help me I would appreciate it.
@deepakkannan74097 жыл бұрын
Hi, In the 20:00 min he basically introduced the concept of "Categorial Casting " and you convert only certain types of columns to factors. My doubt is 1. What is "Categorial Casting"? 2. Why is it used in only for certain data and not all? 3. What will happen if we do our model without it? Many thanks in advance.
@Datasciencedojo6 жыл бұрын
Concept of categorical casting is similar to factors in R. This tells our model that only certainly values are allowed. for instance, if male and female are left as char strings and not converted to categories, we are telling the model that other input values are also possible. Categorical or factor casting forces the input to be only two values {male, female}. As to which features we apply this transform on depends upon your understanding of the problem. Class {1,2,3} may be left as an integer and can also be converted to a category. Whereas, Embarked must be treated as categorical/factor variable.
@s.a.Tawhid3 жыл бұрын
@@Datasciencedojo , Thanks a lot. I was also wondering about this matter.
@karelsukup19736 жыл бұрын
Thank you so much! It's exactly what i was looking for
@Kaneki-sn3bf3 жыл бұрын
What is the 70/30 split. A link to the minute he talked about this would be helpful :
@CynicalScientist2613 жыл бұрын
Can anyone explain strings as factors? I’m new and don’t get it.
@Datasciencedojo3 жыл бұрын
Hello, please forward your query on our site's chatbot or email: datasciencedojo.com/
@RMCCVO5 жыл бұрын
Error in randomForest.default(m, y, ...) : NA/NaN/Inf in foreign function call (arg 1) In addition: Warning messages: 1: In data.matrix(x) : NAs introduced by coercion 2: In data.matrix(x) : NAs introduced by coercion
@Javigarcia2224 жыл бұрын
Same here! Did you solve it? I think it identify Survived (0s) as if they were NAs :( idk how to solve it
@harshitadas73974 жыл бұрын
I came across the same error. To resolve it, when you read the train and test csv files, change the stringsAsFactors argument to TRUE. Hope this helps.
@amara10375 жыл бұрын
at 31:45 for output.df$Survived
@Javigarcia2224 жыл бұрын
Error in randomForest.default(m, y, ...) : NA/NaN/Inf in foreign function call (arg 1) In addition: Warning messages: 1: In data.matrix(x) : NAs introduced by coercion 2: In data.matrix(x) : NAs introduced by coercion Anyone? I am sure there are no NAs values in the variables selected...
@pallavikurulkar6314 жыл бұрын
I am getting the same error. Could you please explain as to why this is happening.
@harshitadas73974 жыл бұрын
I came across the same error. To resolve it, when you read the train and test csv files, change the stringsAsFactors argument to TRUE. Hope this helps.
@ksoovibe76063 жыл бұрын
Sir could you please explain why did the passenger id change in the submission file?
@Datasciencedojo3 жыл бұрын
Hello, the submission file contains data on different passengers so that we can test our model. This is known as the train-test split and it helps us make sure our model is generalizable to data other than what we have trained on.
@ksoovibe76063 жыл бұрын
@@Datasciencedojo Oh okay it's clear now thank you so much sir
@Datasciencedojo3 жыл бұрын
@@ksoovibe7606 Keep following us for more content!
@deepakkannan74097 жыл бұрын
Hi features.equation
@moeketsimosia72906 жыл бұрын
Samehere
@anuragkeshavaraju21697 жыл бұрын
Hello , All i get is like this " > tail(output.df) passengerId Survived 413 1304 0.38337855 414 1305 0.04974327 415 1306 0.99573333 416 1307 0.01896443 417 1308 0.04974327 418 1309 0.54535266" and survived column in test is NA.........plz xplain.
@PitDriver5 жыл бұрын
I get an error when rbinding titanic.test with train. Error in match.names (clabs, names(xi)) : Names do not match previous names? What do I do to fix this?
@PitDriver5 жыл бұрын
I found the issue misspelled Survived
@cheddar-samosa57897 жыл бұрын
Hi, I am facing a minor issue. After this: titanic.formula
@deepakkannan74097 жыл бұрын
Hi Replace titanic.test with titanic.train . Hope it will work.
@deepakkannan74097 жыл бұрын
Yes, I remember after changing it, it worked.
@PraveenKumar-zh6gw7 жыл бұрын
Nope it didn't. Plse tell if there is any solution to this .
@bubblysudhi7 жыл бұрын
Please check whether there are any missing values in your data before proceeding to random forest.
@VineelVishwanth4 жыл бұрын
rf_model
@Vinit_Ambat Жыл бұрын
Thanks! I made my first ever submission on Kaggle for Spaceship project and got 73% accuracy
@Datasciencedojo Жыл бұрын
Congratulations, Vinit. Stay tuned with us for more tutorials.
@Vinit_Ambat Жыл бұрын
@@Datasciencedojo Thanks!
@spunkykid58987 жыл бұрын
Hi, I was unable to combine when using rbind, any idea of what could be causing the issue?
@bonifaceyogendran22947 жыл бұрын
I am having the same problem too!
@bonifaceyogendran22947 жыл бұрын
Try titanic.full
@rehanawajahathussain76426 жыл бұрын
ihave the same problem I tried to install package rbind gives me warning message
@datamaster69554 жыл бұрын
@@bonifaceyogendran2294 thank you so much!!!! you helped me
@shaista2k14 жыл бұрын
bind_rows function combines the training and testing datasets well but after that getting error whenever using titanic.full dataset
@harshadakale22475 жыл бұрын
Can you upload step by step project with data cleaning and missing values in r
@uhsay19866 жыл бұрын
why the 2 missing embarked cells were assigned to only S and not to C or Q ?
@kammelna5 жыл бұрын
He assigned them to the most frequent which is S
@Fateaha6 жыл бұрын
Why we using median here and not the mean?
@ananyajha4515 жыл бұрын
I am not getting final Survived values as bunch of 0 and 1 but I am getting a bunch of probabilities. Why am I getting this any idea?
@Datasciencedojo5 жыл бұрын
It is likely that you have set type="prob" in the predict() function. Set this to type="class" and you will get a class outcome. Otherwise, your Survived values might be interpreted as numbers rather than classes represented as numbers. They need to be treated as factor levels/classes 0 and 1. Use str(data) to check if they are num or factors. If they are num, convert them into factors - data$Survived
@kumarsrajesh90786 жыл бұрын
hi please help me not able to download data set.
@shaista2k14 жыл бұрын
Need at least two classes to do classification error while running the model. Please tell me how to resolve.
@kgrocks54424 жыл бұрын
It's not allowing me to do rbind
@amoghhuilgol13177 жыл бұрын
I am new to machine learning . In a large dataset with lots of columns/attributes , how do we decide which attributes to choose for our random forest . In the above video , I could see that you have hardcoded the attributes . How can we choose attributes dynamically?
@Sociography007 жыл бұрын
It depends on your understanding of data. You must have understanding that which variables/attributes gonna affect your Response variable.
@amoghhuilgol13177 жыл бұрын
Akshay Nevrekar but suppose my dataset has thousands of attributes . how would I choose best attributes . it would be practically impossible to know how those attributes affect the results . also understanding of attributes would mean some assumptions which could be false . any thoughts on how to handle when there are large number of attributes ??
@clinton119947 жыл бұрын
HI, how does the R Random Forest classifier deals with data with categorical and quantitative data ? and also any suggestions on dealing with the same problem if we want to use any other classification algorithms .?
@aldofranco67646 жыл бұрын
I think that R convert to dummies (exists 2 options i dont know which one is the used) and voila!!
@nsovoshe2 жыл бұрын
titanic.formula
@stevezhang92763 жыл бұрын
Can someone tell me why he combined the two datasets together? is that mean he change the facts?
@Datasciencedojo3 жыл бұрын
Hello Steve, we combine the datasets here so that we only have to apply pre-processing steps once and because we want to use the global median rather than the median of just the train.csv data. This is explained in the video around the 5-minute mark.
@stevezhang92763 жыл бұрын
@@Datasciencedojo Thank you so much. Is there any other way to clean the data? clean the data I mean to cover the missing value with better prediction.
@Datasciencedojo3 жыл бұрын
Hello@@stevezhang9276, our data mining tutorials can help you out. go through tutorials from 7 till 11: kzbin.info/www/bejne/nGixl4Jna9Fjarc
@kashishmalhotra2215 жыл бұрын
The github link is not working. Does anybody have the code?
@samsadbinzubair-j2d10 ай бұрын
Could you please upload the codes in the description section?
@Datasciencedojo10 ай бұрын
Please check the provided link in description
@neerajraut64736 жыл бұрын
why particularly random forest? Why not any other classification technique?
@Datasciencedojo6 жыл бұрын
You can use any classification algorithm, instead. RandomForest models are popular in industry and are fairly easy to understand because of the underlying non-parametric(decision tree based) learning approach. You can get similar results with a different learning algorithms such as logistic regression, boosted decision trees, naive bayes and so on.
@469jnishant7 жыл бұрын
Hi..can you please tell why didn't we convert the categorical variables into dummy vars...also, is there a need to convert them to dummy vars here at all ? if not then why do some tutorials say to convert into dummy vars and then use
@vitoroliveira29333 жыл бұрын
I also had the same doubt, I know that when the categorical variable has only 2 categories, R does it automatically, but when it has more than 2, I believe it is necessary to do so.
@dipeshjoshi90937 жыл бұрын
Awesome tutorial. Thanks for sharing. :)
@4upranit7 жыл бұрын
i am new in this, can anyone explain why thr are two data sets eg. train.csv and test.csv thank you
@souravgames7 жыл бұрын
one to create a model that will give you the result (train). other to check if the model predicts similar result after you doing the same with test.
@swati27936 жыл бұрын
That's a good video. However I did not understand that what was the need to converting variables(pClass, Sex, Embarked) into factors through as.factors. Also why did we convert survived into factors after spitting the two data sets here. I would be grateful if someone could explain it in a bit detail. Many thanks :)
@umairansari875 жыл бұрын
actually, these are categorical columns so they need to be addressed as ML algorithms only works for numerical data so we have 2 options here either to convert it into numeric or informally we can call them dummies.so what basically dummies are they are 0 and 1. Lets suppose you have sex as Male and Female so we can create a column which will give 1 for sex "Male" that is same as writing the Male in the column. So what factor is doing here it has created levels =the unique value of the column for sex columns it has created 2 columns(informally) but levels to be formal whiich will denote "Male" for male level and "Female" for female level and it will polpulate it as 0 and 1 for the values.
@amansingla54406 жыл бұрын
very well explained! Thank you so much..
@kgrocks54424 жыл бұрын
as.formula is not working
@vamshikrishna64106 жыл бұрын
titanic.combine
@upasnasharma81497 жыл бұрын
Hi... After using rbind() to create titanic.full, Im getting this :- Error in titanic.full$IsTrainSet : $ operator is invalid for atomic vectors What should i do?
@Abhi92raj7 жыл бұрын
Check the names of the columns, there must be an error
@mcpduk7 жыл бұрын
What exchange rate did you use to convert the fare into $ from £ :)
@mcpduk7 жыл бұрын
seriously though, this is EXCELLENT vid
@Datasciencedojo6 жыл бұрын
Depends upon whether the date of disaster was before or after Brexit. Can you remind us, please? :)
@Datasciencedojo6 жыл бұрын
Thanks :)
@lnasution767 жыл бұрын
thanks for sharing - great stuff
@prvns85867 жыл бұрын
too awesome...!!!!!! great help
@yashuverma38117 жыл бұрын
Hi , why did we add 'S' only for the missing values. ? Also by mistake i added small s which now shows me C Q s S 270 123 2 914 .. how can i remove the small s now ?
@ankitapradhan18557 жыл бұрын
yes i too have the same doubt that why did he add the NA values into S category??
@ankitapradhan18557 жыл бұрын
yes i got the answer
@empuraan47106 жыл бұрын
What's the answer Ankita...
@TheMarinho16 жыл бұрын
Nice Video! But, I get an Error for nodesize = 0.01 * nrow(iris.train) stating "Error in nrow(x) : argument "x" is missing, with no default" Can anybody help me?
@lofimixradio6 жыл бұрын
i think it should be titanic.train, iris is a test data set, check ?iris (Edgar Anderson's Iris Data)
@biseul6 жыл бұрын
This line : Titanic.model
@davidluong986 жыл бұрын
i could not get it either. did you figure it out?
@Adinasa26 жыл бұрын
Raphael Pacheco titanic.model
@castilloberroafrancisco29142 жыл бұрын
wow great that's good
@Datasciencedojo2 жыл бұрын
Glad you liked it, keep following us for more data science tutorials.
@taiwankyh4 жыл бұрын
Nice video
@akashvp52626 жыл бұрын
Error in tail(titanic.train) : object 'titanic.train' not found. am getting that error
@septiandani82386 жыл бұрын
maybe, u store it typo?
@akashvp52626 жыл бұрын
Septian Dani okay
@cybern9ne7 жыл бұрын
I doubt you're should manipulate or clean test data. the test data should be approached as if it were the data that would be used after modeling you'd never have the opportunity to utilize.
@Datasciencedojo7 жыл бұрын
Hey cybern9ne! True, and that is definitely what is taught in the text books! However in the business world, where every bit of performance matters, you want to grab as much information as possible. But as long as you do not have access to the true labels, you always want to grab as much data as you can get. To account for possible categories that exist in the test set but not in the train set, or new missing values that exist in the test but not the train, etc.
@Arecapalm248 жыл бұрын
Great tutorial!
@pzmt80512 жыл бұрын
Could this be done in Python
@Datasciencedojo2 жыл бұрын
Hello, yes this can be done in python.
@minhaaj4 жыл бұрын
share the r code also
@ChrisR410a8 жыл бұрын
So what was the point of creating the features.equation?
@Datasciencedojo7 жыл бұрын
Hey Chris, we never ended up using it. It's a bad habit that I have from data mining in other languages where you have to define the features. R automatically finds the features the model was trained on, which is convenient.
@Sixsigma-acad Жыл бұрын
thankyou
@Datasciencedojo Жыл бұрын
Stay tuned with us for more tutorials, Jonathan.
@castilloberroafrancisco29142 жыл бұрын
Waooo genial
@lidaabdollahi4706 жыл бұрын
great!
@girishahb017 жыл бұрын
Thank you so much
@tws0611057 жыл бұрын
Does this video display as blurry for anyone else?
@tws0611057 жыл бұрын
NVM - just updated the quality and it's fine now - THANKS!
@gimpycoder85485 жыл бұрын
Take a shot every time he does that annoying click noise.