Predict Employee Attrition Using Machine Learning & Python

Рет қаралды 52,532

Күн бұрын

Пікірлер: 78

@BillAugersdca 4 жыл бұрын

I enjoyed this and found it instructive to follow along. Appreciate your quick pacing, yet somehow unhurried, teaching style.

@paulushimawan5196 3 жыл бұрын

Yes that's the reason I like this video. Nice teaching style although less depth. But much better than those courses out there that just give the notebook and we have to run by ourselves without explaining one by one.

@onyedikachiadigwe8995 4 жыл бұрын

can you show how we will predict which staff will leave from the database

@kazimrazatalpur7228 4 жыл бұрын

Amazing very informative, can't wait to see your upcoming tutorials.

@shyamkishore6232 Жыл бұрын

How to make conclusions out of the entire coding process for presentation? Like what are the factors under the columns affect the most Attriction

@QUIZ_WHIZ_SMART 4 жыл бұрын

This is a good straight forward model training for beginners. But the model is weak. Especially in the case of the problem, if you want to make employee attrition you want to know who will quit the job and maybe contact him and the opposite way. Maybe it will be better to choose another metric like Recall or F1 score.

@ComputerSciencecompsci112358 4 жыл бұрын

You can never have enough metrics.

@cloudbaud7794 4 жыл бұрын

this has a Recall of 15%...howz that any good?? and in this case, the cost of an employee leaving unpredicted can never be same as falsely predicting someone who ends up staying. So is F1 that much more value-added???

@SANJIVRAI6693 3 жыл бұрын

@@cloudbaud7794 the goal will be to reduce the False Negative as much as possible, so the better Recall the good

@2lauren54 Жыл бұрын

At df.corr() , why do I get..? (FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. df.corr() ). and an error on every code after that

@sukruthms7984 3 жыл бұрын

Thank you for the appropriate explanation

@ComputerSciencecompsci112358 3 жыл бұрын

Glad you enjoyed the video!

@SaiCharan-zi1zu 3 жыл бұрын

Hii this video is at it's best. But I need a conclusion like on which the attrition is more dependent and how are we going to find out the main factor that's affecting the attrition the most?

@jimalyajenkins9133 3 жыл бұрын

Solid tutorial. How do I use this though?

@shilpashreshta Ай бұрын

How did you decide to use RandomForestClassifier? Why not go for Logistic after dropping redundant features? I am new to this hence confused. Please guide.

@rohittiwari1610 4 жыл бұрын

Simple and easy code. Nice explanation. Thank you so much

@shashankbafna2867 4 жыл бұрын

Fantastic approach. Can you also make a video explaining how can we use this model? like what after creating this prediction?

@erickwang5850 4 жыл бұрын

I also want to know, like can we the significance of each feature, and how to do that

@SANJIVRAI6693 3 жыл бұрын

@@erickwang5850 yes you can check the important features by their score of impact

@robiparvez 9 ай бұрын

where can I find the dataset??

@michaelmullings 2 жыл бұрын

Question - How do i predict in a current employee with attrit? how do I now test which employees are now on their way out the door and what factors do i look for that show this

@NitinBhavvsarPoems 4 жыл бұрын

You are a pro boss !! Good to see your video. Query - How do you validate the prediction results ? What are the ways and types to validate the same? Your thoughts on classification reports for the same ?

@jananisridhar5175 3 ай бұрын

Can u share the data set of IBM

@SantoshMaurya-is4bp Жыл бұрын

It's very nice video,it's really helpful me

@ajayantony4144 4 жыл бұрын

Instead of dropping the Age column can’t we change the index to one for attrition? Just asking, cause I am new to Data Science and curious.

@SANJIVRAI6693 3 жыл бұрын

yes you can

@manideep4486 4 жыл бұрын

With this model, how can I check which employee is more likely to attrite?

@idowukila5992 4 жыл бұрын

Great question. I was wondering, too. Have you by any chance gotten an answer to this?

@QUIZ_WHIZ_SMART 4 жыл бұрын

@@idowukila5992 Well this should be the Recall, but in this tutorial it was very weak

@SANJIVRAI6693 3 жыл бұрын

Any employee who will be predicted as Yes by the model will be most likely to leave - since its a Binary Classification you only get Yes or No result

@sonyishutin9949 Жыл бұрын

@@SANJIVRAI6693 how to see the employee who predicted to leave? I'm still learning

@mehtabrosul6909 Жыл бұрын

in last forest.fit(x_train,y_train) in shows string cannot convert into float why so???

@RahulRautela5797 2 жыл бұрын

Can we also find the specific reason of leaving, the variable with the highest value?

@abdulalimbaig3286 4 жыл бұрын

where is the link to the data set?

@nbddesigns7620 3 жыл бұрын

Getting error at randomforestclassifier using sklearn ? How to solve this

@jeevarajahjeevaratnam6224 4 жыл бұрын

I can't run seaborn, keep getting modulenotfounderror: no module named 'resource' . I'm using windows 10.

@debarati27 3 жыл бұрын

how do we show the decision tree?

@allammihay 4 жыл бұрын

Hallo, I have follow your step in medium but in the last step when I want to show importance feature there is eror " Valuue eror = Array must all be same length". I don"t understand with this problem, could you help me?

@sherifelgazar4089 2 жыл бұрын

Friend, can you put the dataset, to apply

@rahulahuja1412 4 жыл бұрын

Informative. Thanks. But would've been better had you standardized the data and then given an analysis of the data.

@ComputerSciencecompsci112358 4 жыл бұрын

Thanks for your opinion!

@cloudbaud7794 4 жыл бұрын

standardized in what way?

@SANJIVRAI6693 3 жыл бұрын

@@cloudbaud7794 standard scalar meaning the data set normalized in certain range for all values - mostly from -1 to 1 --lowest values to -1 and highest to 1

@cloudbaud7794 4 жыл бұрын

can someone please explain how we get 80% accuracy just by guessing "No" all the time need to understand the math (1233-237)/1233

@SANJIVRAI6693 3 жыл бұрын

if you say that attrition is NO to all the values you will be correct 80% of the time is what he means

@AnkitBhargava 3 жыл бұрын

I think he is just trying to pint out that there are too many NOs (not left the company) compared to Yes. So many that even without any modeling or scaling if you simply guess (like a coin toss) that the employee has NOT left, you would be right 80% of the time

@ItAintNecessarilySo 2 жыл бұрын

It should really be (# did not leave) / total employees = 1233 / (1233 + 237) which is approx 84%. This is the inverse or reciprocal of what the creator originally wrote.

@yashwantkumarverma1480 2 жыл бұрын

can we fix range of x axis ?? cuz I do have many data points on x axis

@prasunprakash2297 4 жыл бұрын

how to calculate employee performance-department wise?

@AnkitBhargava 3 жыл бұрын

Thank you for the walkthrough - really helpful. Question: early on in the analysis, you plotted bar graph for Age with Attrition as the hue. But we dont know if Age is correlated with other attribute or attributes so what would be the point of the graph? Age alone does not explain the attrition rate. Why look at that at all?

@nbddesigns7620 3 жыл бұрын

When we are fit the x_train and Y_train getting value error : Input contains Nan

@fabfitmom 2 жыл бұрын

Your columns have null values. clean up the data to make sure all rows have data in al columns. His step where he tests this is : #Get a count of empty values for each col df.isna().sum() the above should give you 0 value for all fields and the below should give you a False for theX_train & Y_train to work : # check for any missing or null values df.isnull().values.any()

@RohanTayal 4 жыл бұрын

Thank you for the amazing explanation but i have a query, why did you use label encoder and not one hot encoder to convert non - numeric data into numeric data?

@SANJIVRAI6693 3 жыл бұрын

you can use either of it

@harikanttiwari5326 2 жыл бұрын

i am getting error after #use random forest classifier from sklearn.ensemble import RandomForestClassifier forest=RandomForestClassifier(n_estimators = 10 , criterion = 'entropy', random_state = 0) forest.fit(X_train, Y_train) and the error is could not convert string to float: 'Non-Travel'

@dannymuzata4633 2 жыл бұрын

Before you come to random forest classifier , you must ensure that you have converted all your categorical data to numeric data. You wont have that error.

@vijaysolanki7497 3 жыл бұрын

why you won't use oversampling in this unbalanced data (yes-237,no-1233)?

@grahamg4529 2 жыл бұрын

Yes would really help with the FN and recall score

@surender6320 3 жыл бұрын

Can you please share the code, if you don't mind

@namanagrawal4968 4 жыл бұрын

where is the dataset used?

@DaisyBhullar27 4 жыл бұрын

Kaggle

@soumyasrm 4 жыл бұрын

Can you please share GitHub link of this project

@chowadagod 4 жыл бұрын

Lovely video but please do projects which involves data cleaning , especially handling text data such .what you do is lovely and very much appreciate sir but it's bit too plain and majority of work in data science is DATA CLEANING .so please in upcoming videos focus on this aspect .thank you sir

@mrgz999 3 жыл бұрын

I agree, tutorials on (i) Data Cleaning (ii) Merging of two files for two different years to do combined analysis

@codewiththink303 2 жыл бұрын

please give me hrm dataset

@HumptyDumptyActual 4 жыл бұрын

Your model by random guessing gives you 80% accuracy. But by machine learning it gives 86% accuracy. This generally makes a case against ML since -+6% accurate results are not that far off from random guess. So it is better to go with guessing than ML. Now that's my opinion. Others are welcome to share theirs as well.

@QUIZ_WHIZ_SMART 4 жыл бұрын

The building of this model was very straight forward. Of course, if you make it for some project, you will make some Feature engineering steps before start with the training. That the model is weak you can see on the TP. They were only 9 to FP of 45. The Recall is very bad, which means, the whole model is not usable. But with some Feature engineering and maybe a better algorithm, you will receive great results!

@furkanozbudak4440 4 жыл бұрын

Guessing gives 80% accuracy only on this particular dataset. New gathered data can have %80 attritions = "Yes", which will decrease your guess's accuracy to 20%. Then your guess would be way worse than flipping a coin and predicting based on the tail or head.

@grahamg4529 2 жыл бұрын

@@furkanozbudak4440 Exactly I fell into the trap of relying on accuracy when working with an unbalanced dataset. It can be very misleading for a beginner, but I’ve learnt precision and recall are actually more important in identifying the target data

@being_aspirang 4 жыл бұрын

this data sets is imbalanced, so we should use different approach to do project...

@pavel822 4 жыл бұрын

where can I get this data?

@q_1 4 жыл бұрын

Kaggle.com, IBM HR Analytics Employee Attrition

@alexanderthegreat9631 4 жыл бұрын

I keep getting a value error: ValueError Traceback (most recent call last) in () 1 from sklearn.model_selection import train_test_split ----> 2 X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = None, random_state = 0) /usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_consistent_length(*arrays) 210 if len(uniques) > 1: 211 raise ValueError("Found input variables with inconsistent numbers of" --> 212 " samples: %r" % [int(l) for l in lengths]) 213 214 ValueError: Found input variables with inconsistent numbers of samples: [1, 1470] Can someone help?

@SANJIVRAI6693 3 жыл бұрын

test_size needs to be defined - how much split will you give for train/test from whole dataset

@mrgz999 3 жыл бұрын

@@SANJIVRAI6693 why we selected 75 and 25% percent split. Why not more?

@ainli4125466 2 жыл бұрын

Thank you, and i got an error "ValueError: Input contains NaN, infinity or a value too large for dtype('float32')," when running the scripts of # use the random forest classifier from sklearn.ensemble import RandomForestClassifier forest = RandomForestClassifier(n_estimators=10, criterion='entropy',random_state=0) forest.fit(x_train, y_train) could you shed me some lights how to fix it?

@grahamg4529 2 жыл бұрын

You need to remove NaN’s from from dataset during the data cleansing process