How To Use A Decision Tree Classifier With A CSV File

Рет қаралды 4,126

Күн бұрын

Created and recorded in June 2022 by Vivek Jariwala
Music: Call of the Void, by Justin Miles, lmms.io/lsp?ac...
Imagine you have hundreds of data entries that you then want to go through to find a pattern or make a conclusion from. There is a way to quickly analyze the dataset, and make automatic predictions: Use a Decision Tree Classifier!
The data that we are using today showcases the most popular genre of music for a specific age group and gender.
Lets load in the pandas library and call it "pd". Lets also load in the scikitlearn package, denoted as sklearn, which is a widely used library for machine learning algorithms. Lets import the Decision Tree Classifier class from scikit-learn as that is the basic machine learning algorithm that we will be implementing today. A decision tree classifier creates a decision tree. Each node in the tree specifies a test on an attribute, each branch descending from that node corresponds to one of the possible values for that attribute.
Now lets also add the line, "sklearn.model_selection import train_test_split". So from the scikit-learn library, access the model_selection module, and import the function "train_test_split". Using this function, we can easily split the current dataset into two separate sets.
Since we also want to calculate the accuracy of the predictions being made by this model lets once again import a function from scikit-learn. So lets write "sklearn.metrics import accuracy_score". So we are accessing the scikit-learn library again, accessing the metrics module, and importing this accuracy_score function.
Now to read a csv file, this part is fairly standard, use the pandas read_csv method and specify the name of the data set, which in our case is 'music.csv'. Today we are this data set that lists the ages/genders of a sample size and their favourite music genre. Lets open this csv file for a second. In the documentation for this open source file, it stated that these 1s represent a male and the 0 represents a female. For example, we can see here that perhaps men from 31-37 like "Classical" music.
Lets create an A and b subset of the data. The A data set contains all the data from the excel file except the genre column. As you can see from the syntax "music_data.drop()" we are specifying in the argument section of that method that we are removing the column called 'genre'. The b subset, b = music_data['genre'] , contains just the genre column.
With this line: "A_train, A_test, b_train, b_test = train_test_split(A,b,test_size=0.2)", we can split up our data to a testing set and training set. Out of 1.0, we can specify the test size, thus changing how much is used to test versus train. The more the model has to train with, the more accurate the predictions will be. Generally it is good practice to have 75-85% of the data be used to train the model if the goal is to have accurate predictions. This function train_test_split returns four values. Therefore we are storing them in these four variables. The first two variables, A_train and A_test, are the input sets for training and testing. The last two variables b_train and b_test, are the output sets for training and testing. The "test_size" parameter specifies how much of the data is used for testing and the maximum it can be is 1.0.
Now with the DecisionTreeClassifier() , we can store it as an object in the object name "model". Then lets use this to call the method ".fit". With the ".fit" method, we can train our model. Thus for the arguments, lets train it using the train data sets as arguments, so we input "A_train, b_train" in the brackets.
We can then find the predictions using the ".predict" method and we will store it in a variable called "predictions". When making predictions, in the argument, lets put the "A_test" values as that will be in the input for the model to then make the predictions. Remember to specifically put ".values". This is something many programmers will forget to include but it helps you avoid a User Warning that would otherwise appear in the output window that states "does not have valid feature names" and the dataset was "was
fitted with feature name". Essentially, this just means that the model was fitted with a data in a dataframe, and then only values were used to predict. Thus including the ".values" allows the program to know that you did actually mean to only predict with numerical values.
In the beginning I also mentioned that I would show how to demonstrate the accuracy of the predictions being made. That is also really simple and can be done in a single line of code. The way this works is the program compares the b_test values with the actual predictions that were made. The syntax for this method is as follows "accuracy_score(b_test, predictions)". Thus, the accuracy_score method calculates the accuracy by comparing those two sets of data as arguments. It then will calculate a numerical value out of 1.0.

Пікірлер: 2

@palomaperez5694 2 жыл бұрын

Thanks for ur help and explain!

@ymonga122 Жыл бұрын

X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names this happens with me can you make me understand i did not get in you description