Hi, thank you for the amazing video. Where can I find the data base that you used?
@StatsWire2 жыл бұрын
Thank you. You can find the notebook and dataset here: github.com/siddiquiamir/PySpark-Tutorial
@aksmalviyan8342 Жыл бұрын
Thanks for the video....
@StatsWire Жыл бұрын
You're welcome!
@user-lq1cs2 жыл бұрын
hello, i have another question so i just found out that if you remove the 'random_state' in 'train_test_split', the output of 'f_score' will be different every time you run the code. i want it to stay consistent without the 'random_state' because i cant find a reason and explanation on why to use 'random_state' for my final project. i want the f_score to stay consistent, can i put kfold cross validation here? i hope you understand what im talking about here, thankyou!
@StatsWire2 жыл бұрын
Hi, if you remove the parameter random_state then every time you split the data into train and test you will have different samples so the accuracy and f1 score will be different too but if you want same then use the parameter random_state
@leamon90242 жыл бұрын
Hi, thanks for your hard work. I have one question though. If my target variable is categorical variable, but my features are all numerical variables instead of the categorical ones like the dataset in this video. Can I still use chi2 to do the feature selection?
@StatsWire2 жыл бұрын
No, you should have both the variable as categorical
@QornainAji9 ай бұрын
You can convert your numerical variables into categorical data using the binning technique. But how much the bins that you should use depend on your data. Hope answer you.
@reanwithkimleng6 күн бұрын
Hello sir, how to do feature selection or select features importance if our dataset has categories and quantitative example penguin data has length depth mass , categories are sex island and species.
@Thomas-mr2xx2 жыл бұрын
Hello, I don't understand why the output to the chi2 function in your tutorial video has output as (p_values, chi2) array, but on the sklearn documentation and my local code the output is (chi2, p_values). Do you know why your code outputs like this?
@StatsWire2 жыл бұрын
Output is the same, I am printing in a different order.
@mazharalamsiddiqui69042 жыл бұрын
Very nice
@StatsWire2 жыл бұрын
Thank you
@putridisperindag6986 Жыл бұрын
Hello Mr.Amir Siddiqui, thank you for your nice explained video. May I ask, whats the difference between f_score and p_values? and which one we have to choose? Thanks in advance Mr.
@StatsWire Жыл бұрын
Hello, thank you for your kind words. F score and P values are totally different. An F-score is the harmonic mean of precision and recall values. A p-value is used in hypothesis testing to help you support or reject the null hypothesis. The p-value is the evidence against a null hypothesis.
@putridisperindag6986 Жыл бұрын
then Mr@@StatsWire for feature selection purposes, to select best feature should we choose based on top 10 highest p-values or top 10 highest fscore? Thanks Mr
@StatsWire Жыл бұрын
@@putridisperindag6986 In chi-square we have to select those features whose p-values are less than 0.05. Because we are doing testing of hypothesis.
@user-lq1cs2 жыл бұрын
hello, thankyou for making the video, but i have a question here i followed every step exactly but im using my own dataset, the output that i got is really high number, for example the highest is 9.417441e-01 and the lowest is 1.134117e-01 do you know where i did wrong here? im so confused. keep up the good work, thankyou!
@StatsWire2 жыл бұрын
Hello! These numbers are too small actually. You can convert these to numbers online.
@Angela-Gee2 жыл бұрын
in the last step, im getting this error 'numpy.ndarray' object has no attribute 'index'
@StatsWire2 жыл бұрын
Please check the above steps if they are correct or not.