You're the best! My RSME dropped dramatically after watching your video from 0.20 to 0.14. Thanks to XGBoost and you! Thanks a lot!
@shivalikapatel72224 жыл бұрын
i'm new to data science field, could you please guide me a lil regarding how we decide to choose XGboost instead of something like linear regression?
@rajeevmayekar17753 жыл бұрын
@@shivalikapatel7222 check krish video search.... evalml krish naik on youtube .... which algo is best for the model ....evalml helps
@Dynamite_mohit4 жыл бұрын
Here is the confession, I didn't liked your videos earlier. But this is series and the way you explained things with why and how is awsome. Thanks alot for this. helped me clear multiple doubts. and a note to the viewer(audience) i have never commented on a youtube video. But this guy deserves it
@chetannalinde14415 жыл бұрын
Thank a ton, Krish. This is what I was exactly looking to start with Kaggle, keep the awesome work coming. Looking forward to more such videos.
@xpabhi5 жыл бұрын
Liked the way you explain the problem statement and the feature engineering part. I am pursuing the data scientist carrier and have great interest in the ML techniques. It's always a pleasure to watch you.
@harshitahluwalia84434 жыл бұрын
In 18:08 , if you want to see those values then do this, pd.set_option('display.max_rows', None) then after writing the above code, again write df.isnull().sum() Now you will be able to see all the values
@manojkuna3962 Жыл бұрын
Thanks for the information
@whatitgoingtoB4 жыл бұрын
Thanks Krish, appreciable, not slow and time consuming, it is what every beginner need everything without wasting time...Loved it.
@shivadigitalsolutionsandam562 жыл бұрын
bhai tum bolte bahot achcha ho. maine kuch nhi to 10 video dekhi same problem per. but your video is good one
@sandrafield98134 жыл бұрын
Hey thanks for this!! You're a great teacher. You really helped me parse through some things in my machine learning class. Sadly, I almost pushed ctrl-enter to enter this comment. lol
@MrJaga1215 жыл бұрын
Great work Krish . Thank you very much for explaining line by line .
@nabeelpm48945 жыл бұрын
Thanks again krish..love your passion and humbleness..you are giving such a valueble knowledge..indeed.. sharing knowledge is the best thing in the world..Thank you so much brother..
@samudragupta7195 жыл бұрын
All I can say this is one of the best exploration I've ever gone through! This must go on... ❤
@codingfun9154 жыл бұрын
Instead of writing the whole big column with categorical values......we can do this c = data.columns categorical=[] for a in c: if data[a].dtypes==object: categorical.append(a) and our categorical will be the list similar to columns list in the video Btw amazing video just loved it
@harikaepuri93374 жыл бұрын
Very neat and detailed explanation sir. Thank you very much for making me understand the whole project and how to participate in Kaggle competition. Looking forward to more such Kaggle competition videos sir.
@sabyasachighosh98475 жыл бұрын
Krish... U r going a great job... Good to learn from u...
@sauravsrivastava23532 жыл бұрын
This video was really helpful for me because I am just fresher in the Data Science world and I even don't know how to deal with such real-world data science problems, So thanks Krish sir for this kind of video, pls make another video regarding another Kaggle competitions.
@rajinirox2 жыл бұрын
around 22:07 - if you want to check it yourself if actually number of categories are different in training and test data set, use this "for column, col in zip(df,test_df): print(len(df[column].value_counts()), len(test_df[col].value_counts()))" this will print the number of categories column wise for training and test dataset
@akshayjadhav22134 жыл бұрын
very nice dear sir..u try to explain from basics and which is the necessary and important thing one should do. I had heard about kaggle competitions but today understood how it works.Thanks a lot and keep encouraging us.
@abhimanyutiwari1005 жыл бұрын
Nice. Coincidently, I too, completed this advanced regression kaggle problem yesterday.
@RahulVarshney_4 жыл бұрын
How to extract all the categorical features: features=df.select_dtypes(include=['object']).copy() This will give dataframe of all categorical feature. To extract columns we will write Categorical_features= features.columns 😊
@Omarismail-vs4jl4 жыл бұрын
you are a life saver
@astridbrenner29573 жыл бұрын
df.columns ?
@RahulVarshney_3 жыл бұрын
@@astridbrenner2957 df.columns will give you the original dataframe columns while features.columns will give you categorical columns
@astridbrenner29573 жыл бұрын
@@RahulVarshney_ Thank you Rahul. I'm new
@RahulVarshney_3 жыл бұрын
@@astridbrenner2957 you can go through many tutorials on dataframe play with them
@rakesh2you4 жыл бұрын
Thanks for this videos . It helped me in submission to Kaggle and understand what else goes in data science project
@oleholeynikov86592 жыл бұрын
it is my exam project. Thanks a lot for the video!!!!
@BiancaAguglia5 жыл бұрын
Nice job, as usual, Krish. 😊 One note about accuracy: I recently heard a data scientist at Netflix say that some of the models that win competitions on Kaggle are too complex and too impractical to be put into production. So our job is to find a balance between accuracy and usability. I thought that was interesting. 😊
@whoknows89925 жыл бұрын
yeah thats true!
@saurabhtripathi624 жыл бұрын
yes
@eyabaklouti90664 жыл бұрын
can you give us the name of the documentary please
@mohammedfaisal67144 жыл бұрын
If increasing the accuracy about 0.5 % is demanding a lot computational effort , companies will not be interested in such investments Agree to your point @Bianca A
@BGrovesyy4 жыл бұрын
Exactly, it’s very important to consider overfitting if the objective is deployment. An overfitted mode will not respond well to new variables and so will be not suitable for the “real world”
@akshayjhamb10225 жыл бұрын
Thanks Krish for the video. keep coming for more solutions of kaggle competitions
@gaganlohar55175 жыл бұрын
Thank You, Krish, you are doing a great job. Very nice video.
@pradipkaushik65835 жыл бұрын
Thank you so much Sir, you have explained these topics with so much ease. Keep posting such excellent videos.
@asdubey0074 жыл бұрын
thanks a lot sir to Crystal clear my concept ....... .........thank you soo much for your this much of efforts ........ keep doing really awesome work
@vishal567655 жыл бұрын
Loved it...try to upload next videos like the one you did on hyperparameter tuning so that we understand how to iterate on the same problem to get better results. So slowly make this problem a complete series jsing same datset.
@geekydanish59905 жыл бұрын
Great start man hope to see more videos soon
@PradeepSingh-gh1jp5 жыл бұрын
Made a great mistake here If two or more features have identical category names, doing final_df =final_df.loc[:,~final_df.columns.duplicated()] will actually create problems. You must have done pd.get_dummies(final_df[fields],drop_first=True, prefix=fields) to avoid such problem, but you did pd.get_dummies(final_df[fields],drop_first=True) here prefix is very important to distinguish each category by its associated feature. The rank you achieved here doesn't make sense after that, but the knowledge you gave is awesome. Thank you
@pranavkirdat81925 жыл бұрын
keep up the good work . u channel will grow
@greenshadowooo Жыл бұрын
A very useful sharing !😍😍😍
@CharzIntLtd3 жыл бұрын
Thank you very much Mr Krish you have given me a clear start
@dennisbesseling92673 жыл бұрын
in the description file it says that N/A values should be considered as a value for the absence of the feature. So if there is a null value in any of the basement columns this means that the house doesn't have a basement and so on..
@rakeshkumarrout26294 жыл бұрын
Krish this is really useful for upcoming decads...
@bivekyadav08 Жыл бұрын
This man is always there to help🙌 Thanks 🥺❤🙏
@shailendraverma7615 жыл бұрын
HI Krish. Thanks for your such a nice explanation. I was able to easily followup and tried few others things on data which results into score 2513 rank
@Prajwal_KV4 жыл бұрын
how are you dividing the training data set and testing data set df_train=final_df_iloc[:1422,:] df_test=final_df_iloc[1422:,:] how did you know it was 1422?how to calcluate?
@kalyanprasad40694 жыл бұрын
Hello Krish - It would be really helpful if you do a video on how to step into kaggle competitions? what are the basics thing that one should aware before entering into competitions. Thanks for understanding Sincerely
@o_rod89544 жыл бұрын
Thanks for the video. You make it easy to understand and follow!
@vijayanarayanan34254 жыл бұрын
Hi Krish, it was so nice listening to your video....thoroughly enjoyed.......
@spamaccount15132 ай бұрын
15:14 isn't that data leakage? He imputed mean values using the mean from the test df
@shashankvm5 жыл бұрын
You are my role model brother...I want to be like you :)
@Amrrkevin5 жыл бұрын
Please continue/complete Deep Learning series. We r waiting for those videos very eagerly.
@VivekKumar-li6xr5 жыл бұрын
Yes please ..even I am waiting for the same. But this too is very informative. Thanks a lot Krish for making video from your busy time.
@BhartiDeepak5 жыл бұрын
First of all thanks for the video, it is very informative. I am new to data science so this could be a novice thing to ask; however, one part that I wanted to point out is that you are combining train and test data for modelling. From what I have learnt we should never combine the train and test data for training as it will not be good for predicting the results for the dataset that is not seen by model. Please correct me if I am wrong.
@mohammed.dawood_4 жыл бұрын
He didnt feed the combined data set into the model. He only gave the training part for the input i.e first 1422 rows. The only reason for combining both training and test data was to create dummy variables. That could have been done seperately but he mentioned that for some attributes training data had 3 categories and test data had 4. So making dummy variables separately would have created discrepency in number of columns.
@maheshwarang20085 жыл бұрын
Thank you so much, sir. you are doing a superb job for us who wants to enter in Machine learning
@mohitkeshwani4565 жыл бұрын
Really helping this... Thanks alot for making these types of videos
@aryamahima35 жыл бұрын
very very helpful videos.. your efforts are highly appriciable.
@kevinmartinezperez4111 Жыл бұрын
Men que buen video, muchas gracias, Saludos desde Perú
@Marcel-f1 Жыл бұрын
In summary: the machine learning engineer is making a “guess” about the dataset he is working on, and making multiple repetitive tasks like fill null values, feature extraction, and all these work that can be automatized
@nandalal-dev60954 жыл бұрын
sir, not every feature requires one hot encoding For example: for feature LotShape : we have values like this Regular Slightly irregular Moderately Irregular Irregular we can do LabelEnconding for these values ( [1,2,3,4]).
@AKHILESHKUMAR-nk2rk4 жыл бұрын
is it a ordinal categorical feature
@AKHILESHKUMAR-nk2rk4 жыл бұрын
yes pls do it
@colinodwanny3022 Жыл бұрын
I thought you wern't supposed to use label encoder on the X variable? I don't have a lot of experience, but that is just what I have heard
@Zeba_Sayyed5 жыл бұрын
Thank you soo much sir.. Ur videos are soo helpful
@aktharm13175 жыл бұрын
Great !! Good Work by you !
@abhileshm72165 жыл бұрын
Thank you for this initiative.... Can you please explain in detailed way of ...what are different levels in category data of train and test ......why we concatenated ...and when dummified what are duplicates in dummified created and why we removed those duplicates from this example ....please explain it sir
@kadhirn47924 жыл бұрын
Good techniques. I have learnt a lot from this thank you so much
@tonyt63793 жыл бұрын
Thanks. Great work! Could you explain where the duplicate columns come from at 26:23 ? I don't understand why you get these from one hot encoding the train and test set together.
@ashutoshsalekar18103 жыл бұрын
There are some columns which have same categories as other columns do. And while creating dummy variable the new column get the name of that category. so multiple columns get created with same name. Eg: column1 = ['yes', 'no', 'neutral'] # 3 categories column2 = ['yes', 'no'] # 2 categories when we apply get_dummies on these columns it will create columns such as yes, no, neutral, yes, no. We can see there are 2 columns with name 'yes' and 2 columns as name 'no'.
@ritika27083 жыл бұрын
@@ashutoshsalekar1810 yeah but shouldn't it be col1_yes and col2_yes. we need value of both attributes , if we will have coluns like Yes and no only how will we know for which attribute its is ?? And a simple pd.get_dummies(df) should have done that. I don't understand why we need such complicated method . I am getting 275 distinct columns with this not sure how can only 175 columns will serve the purpose.
@rohitbharti93604 жыл бұрын
Thank you so much..... It is very helpful 🙂
@niteshsoni53795 жыл бұрын
Great job sir..
@ShubhamGuptaGgps4 жыл бұрын
what to do for this, my jupyter notebook on writing df.isnull().sum() not shows complete instead shows using dots in mid for continuation instead of scroll bar df.isnull().sum() Out[13]: Id 0 MSSubClass 0 MSZoning 0 LotFrontage 259 LotArea 0 ......... MoSold 0 YrSold 0 SaleType 0 SaleCondition 0 SalePrice 0 Length: 81, dtype: int64
@ShubhamGuptaGgps4 жыл бұрын
Got it: nulls = df.isnull().sum().to_frame() for index, row in nulls.iterrows(): print(index, row[0])
@faisalkhan-oo5jd3 жыл бұрын
Great videos! Thanks a lot But in the property description text file it is written for some features that NA means the feature is not available rather than meaning that data for that column is not available. Does anyone else agree that we don't have to treat all the columns with mean and mode?
@PrasadHonavar5 жыл бұрын
Excited for your next Kaggle video.
@isingcauseimliving4 жыл бұрын
Hi Krish. It could have helped if you would have read the description of the pricing. Why some features were chosen or why some features like say area = f(length * breadth), but both are given separately. So we could have done something by creating features ourselves. So I would have liked to see why you removed any of the features. I am just a noob, learning ML, so allow me to question. Can you also take it that the reason that "Fence" has so many null values is that there are actually very few houses that have fences. However, the houses which have fences, by intuition are costly houses. In this case shouldn't we take into consideration the value of the indices rather than the value of the null points. For example a big mansion would have fences, however, there are not many mansions in the training set. This does not mean that we do not include the housing price of the mansion for our solution. We would need to go over all of the 81 features to determine, with intuition, what could be the real life scenario rather than just thinking about data as "null points". Please let me know if I am right or wrong. Thank you.
@eanamhossain11565 жыл бұрын
Thanks brother for this video. Continue..and more video upload.
@pembasherpa32403 жыл бұрын
Very helpful! Thank You
@jeremyheng85732 жыл бұрын
Thank you! Good tutorial!
@liiinx_com4 жыл бұрын
Hey Man, Thanks for the Video!
@sogolgolafshan78433 жыл бұрын
as I type final_df.shape, I get the 'NoneType' error. Can you help me on what I should do?
@Peter-ns6jg2 жыл бұрын
this helped me a lot. thanks
@Prajwal_KV4 жыл бұрын
sir,how are you dividing the training data set and testing data set df_train=final_df_iloc[:1422,:] df_test=final_df_iloc[1422:,:] how did you know it was 1422?how to calcluate?
@shivambansal19933 жыл бұрын
I have the same ques
@naageshk12567 ай бұрын
Thank you sir ❤🎉
@chetanmazumder3104 жыл бұрын
You are GREAT sir .
@abhinavshrivastava46374 жыл бұрын
I want to evaluate my model with confussion matrix but in test data we don't have that column 'SalePrice' then if i run this commond : confusion_matrix(y_test, y_pred). Here I dont have 'y_test' . What to do ???? please suggest
@AKHILESHKUMAR-nk2rk4 жыл бұрын
confusion matrix is for classification problem right.....
@atulpandey19795 жыл бұрын
Excellent..!!
@samratkishore46685 жыл бұрын
Sir I think you have to..fix the outliers in the data set....that can increase your prediction sir..
@krishnaik065 жыл бұрын
Yes that is the plan which I will doing that in my next video..this is just the beginning
@samratkishore46685 жыл бұрын
@@krishnaik06 Thank you..for your response sir...awaiting for more updates from you sir...!!! Always with love❤️
@santosharavind28875 жыл бұрын
@@krishnaik06 Thank you for clearly explaining each and every step. Would like to know when are you going to post the continuation video? so we can complete one entire project. Thank you in advance
@anandacharya99195 жыл бұрын
Super, but we can merge the test and train data first then can do feature Eng, to avoid double work.
@krishnaik065 жыл бұрын
Yes we can
@hafizmfadli4 жыл бұрын
nice video,thank your for sharing this video
@babayaga6265 жыл бұрын
Hello Sir, Can you please discuss about the parameters and values for XGBosst classifier. Also how do we get the best value of parameters using cross validation.
@cusematt234 жыл бұрын
I'm pretty sure you're deleting perfectly good columns when you're removing "duplicates". For example the categorical features can come in conditions of "good" "bad" "excellent". This can apply to garage, basement, or attic. Since you didnt use a prefix for get_dummie, you now have 3 columns with the name "good", "bad", and "excellent" and you delete 2 when you "remove duplicates". They were never duplicated, they simply weren't intelligently named. If you use a prefix which is equal to fields, there are now no duplicates. Logically, why would there be duplicates in the first place? It doesn't make logical sense to me that there would be duplicate columns after applying get_dummies, when there weren't duplicate columns before applying get_dummies!
@nickey02073 жыл бұрын
i agree
@ritika27083 жыл бұрын
Exactly , I have the same doubt it should be like garage_good and attic_good , as I am a new bee in this field I thought I might be missing something , so I looked into comments to see if anyone else also feel that way , also he used drop_first = True which removes the first value, not sure why he did that. I am getting around 275 columns if I apply encoding to call categorical columns , but we can convert lots of there in simple 1,2,3 ranking i believe.
@kdevendranathyadav5087 Жыл бұрын
Bro what is the final output this project
@victormayowa7989 Жыл бұрын
But duplicates goes across the observation unless it is restricted to a column
@Nitsoney5 жыл бұрын
Hi Sir, thanks for uploading the videos and training us with such good content, I didn't get the one hot column part why it was done??
@abhishekbajiya33324 жыл бұрын
Hey Krish, Why didn't you use OneHotEncoder and ColumnTransformer to change categorical variables?
@kristiangohibi73262 ай бұрын
Thank you for your video
@chandrakanthshalivahana86165 жыл бұрын
sir, why did u drop df.drop(['GarageYrBlt '],axis=1,inplace=True) as it has only 81 null values
@selvaprabu38785 жыл бұрын
Superb sir......great ...I have one doubt, sir 1)while treating the skewness and normality ..is it required to check all the columns(IVs) or only in DV...I am a beginner for Data science (Because , Linear regression Assumption not says about IVS) Can u explain clearly sir? 2)Another thing for feature selection I have to each and every column separately or can I with auto selection method. In, Manual condition for Numerical features based on Correlation what about categorical features? I have chi2 or ANOVA test I have to take or some other... I know theoretical...but actual real-time projects how it will be?(some Dataset having 400 Columns)
@thejswaroop52303 жыл бұрын
Thank you it was helpful
@MrPriti9995 жыл бұрын
great one
@edzhem3 жыл бұрын
Hey Krish, Thanks for your effort! just one quick question is data heteregenous data isnt it?
@hiw924 жыл бұрын
Great video
@bibhasgiri5275 жыл бұрын
Thank you.. This is really helpful.
@devathimahesh80075 жыл бұрын
Nice video
@Akshat.agr135 жыл бұрын
Why are we dropping some records in test and train data because null values should have been removed after applying mean to numeric features and mode to categorical features? (The heat map generated for 2nd time still shows some null values for some features before you execute the above step)
@krishnaik065 жыл бұрын
Because at the last there were some features which was having very less no of NaN values. So thought of dropping it
@zeuspolancosalgado33853 жыл бұрын
Kaggle Competition Link: www.kaggle.com/c/house-prices-advanced-regression-techniques/overview Original Dataset Link: jse.amstat.org/v19n3/decock.pdf
@himanshusahoo1435 жыл бұрын
Hi sir, just one question, why you apply 'Get dummy' feature, Rather 'label encoding'. Get dummy, is creating huge numbers of columns. And again removing duplicate columns. Can't we simply apply label enc? Please advise.
@sahil-74734 жыл бұрын
Hello Sir. Thanks for showing walkthrough of this problem. I have a suggestion. In test data, why are you replacing nan with those values that are there in test data. This is wrong! In test data, you have to replace nan with with those values that are there in train data. Test data is just like hidden. For example, let say I trained model with train.csv and i feed a query. For a query, these values can be anything either it is null or some values in some features. Now, for this query, to be replace null value, u don't have a bunch of test data, right? But u have done it with train data. So, it should be replaced with those values that has performed in train data. Thanks
@muzamilshah80285 жыл бұрын
very nice work
@bars-qt9yi5 жыл бұрын
Hi sir nice work but please make your videos in odered way or in sequence we don't know how to start from scratch
@BiancaAguglia5 жыл бұрын
Krish has several playlists organized by topic. Pick a playlist, and try to go through it in order. I think you'll find that helpful. 😊 Another idea is to make your own list for the thing you're trying to learn. For example, let's say you're trying to learn statistics. Google "statistics for data science" and create a list of things you need to learn. Then come back to KZbin and search for each topic in order. This is your chance to start thinking like a data scientist. Data scientists solve problems. Think of this (i.e. finding a good sequence to learn data science) as your first data science problem. 😊Don't get discouraged. It takes a while to become good at data science. Learning it is like going to a completely foreign city. At first you feel lost but, if you explore for a while, you'll be able to easily find your way around. Best wishes to you. 😊
@bars-qt9yi5 жыл бұрын
@@BiancaAguglia love you man love you can you help me to become a data scientist please tell me full path or KZbin channel please your thinking and mind very powerfull love you and give me reply please
@BiancaAguglia5 жыл бұрын
@@bars-qt9yi Please remember that your goal is to be a problem solver and ask for as little help as possible. 😊 I know you feel overwhelmed at first though, so here is a quick path to get you started (look these up on KZbin): 1. Python tutorial for beginners. 2. Statistics tutorial for beginners. 3. Pandas tutorial for beginners. 4. Matplotlib tutorial for beginners. 5. Scikit-learn tutorial for beginners. 6. Sign up for a Kaggle account and look at the Titanic challenge. That's Kaggle's best known challenge and you'll find plenty of help about it. Give yourself 2 to 6 months to go through these steps. Practice solving on your own the problems in the tutorials. 7. Start taking advanced statistics. 8. Start learning calculus. 9. Start learning linear algebra. By the time you're done with this list you'll know enough about data science to start carving your own path. 😊 Remember Henry Ford's words: "Whether you think you can, or you think you can't - you're right." 😊Start thinking that you CAN be a data scientist and then work hard and smart to become one. 😊Best wishes.
@bars-qt9yi5 жыл бұрын
@@BiancaAguglia love you god bless you how I contact with you
@bars-qt9yi5 жыл бұрын
@@BiancaAguglia and krish first playlist is about machine learning or about data science
@vivianjoseph8222 жыл бұрын
tqsm brother!!
@Girish05125 жыл бұрын
before getting into it what should be the basic preparations to be work on?
@preetdahiya30124 жыл бұрын
I am getting an error i converted float into int32 but while saving a csv file of same it convert 'Id ' column data type in int64 which shown error on kaggle.... so in last do have knowledge to overcome from it? please help me if you know
@datasciencetoday71272 жыл бұрын
at 27:58 if you drop the sales price XGBoost will give error features mismatch
@ibrahimkuru4502 жыл бұрын
sir i am getting this error. How can i solve.
@ashutoshkumar28344 жыл бұрын
I have one question. Let's say I have 2 dataframe train_df and test_df. Both have same type of columns(column_A,column_B,column_C).Both dataframe have some missing values. If I drop column_A in train_df , is it mandatory to drop same column train_A in test_df ?
@wise15694 жыл бұрын
Yes
@anoopk46593 жыл бұрын
Garagecars is a categorical variable ,but mean is used for fill na
@ApurvaMishra93 жыл бұрын
Hi Krish! Thank you so much for this ml intro to kaggle via house price prediction. I am a novice in the field and had a doubt. I shall be grateful if you could help me out. In theory, isn't testing data the data that is not touched at all meaning how can we perform preprocessing on the new untouched data and not violate that concept?
@adityadwivedi91592 жыл бұрын
Bro assume that he first preprocessed the train set by creating 2 functions and passing the train data to them 1- Nan handler this will handle all the Nan values 2- Category handler this will handle all the Categorical features Now if will not use these 2 functions on the test set then we may have null values and categorical features in test data also and if this is the case then ml algorithms will not work on it and wouldn't predict so preprocessing on test set is also required