Machine Learning [Python] – Decision Trees – Classification

decision trees python ML

In this tutorial, will learn how to use Decision Trees. We will use this classification algorithm to build a model from the historical data of patients, and their response to different medications. Then we will use the trained decision tree to predict the class of an unknown patient or to find a proper drug for a new patient.

A Decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.

Here’s a visualization of the Decision Tree algorithm.

Illustrating a decision tree. Source: Navlani 2018.

Source

Parts Required

  • Python interpreter (Spyder, Jupyter, etc.).

Procedure

Following are the steps required to perform this tutorial.

Packages Needed

import numpy as np 
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

Dataset

The collected data is about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y.

The objective here is to build a model to find out which drug might be appropriate for a future patient with the same illness. The features of this dataset are Age, Sex, Blood Pressure, and the Cholesterol of the patients, and the target is the drug that each patient responded to.

It is a sample of a multiclass classifier, and you can use the training part of the dataset to build a decision tree, and then use it to predict the class of an unknown patient, or to prescribe a drug to a new patient.

Load Data From CSV File

my_data = pd.read_csv("drug200.csv", delimiter=",")
my_data[0:5]

Pre-Processing

Using my_data as the Drug.csv data read by pandas, declare the following variables:

  • X as the Feature Matrix (data of my_data)
  • y as the response vector (target)

Remove the column containing the target name since it doesn’t contain numeric values.

X = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
X[0:5]

Some features in this dataset are categorical, such as Sex or BP. Unfortunately, Sklearn Decision Trees does not handle categorical variables. We can still convert these features to numerical values using pandas.get_dummies() to convert the categorical variable into dummy/indicator variables.

from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1]) 


le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2])


le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3]) 

X[0:5]

Now we can fill in the target variable.

y = my_data["Drug"]
y[0:5]

Setting up the Decision Tree

We will be using train/test split on our decision tree. Let’s import train_test_split from sklearn.cross_validation.

from sklearn.model_selection import train_test_split

Now train_test_split will return 4 different parameters. We will name them: X_trainset, X_testset, y_trainset, y_testset.

The train_test_split will need the parameters: X, y, test_size=0.3, and random_state=3.

The X and y are the arrays required before the split, the test_size represents the ratio of the testing dataset, and the random_state ensures that we obtain the same splits.

X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)

Practice

Print the shape of X_trainset and y_trainset to ensure that the dimensions match.

print('Shape of X training set {}'.format(X_trainset.shape),'&',' Size of Y training set {}'.format(y_trainset.shape))

Print the shape of X_testset and y_testset to ensure that the dimensions match.

print('Shape of X training set {}'.format(X_testset.shape),'&',' Size of Y training set {}'.format(y_testset.shape))

Modeling

We will first create an instance of the DecisionTreeClassifier called drugTree. Inside of the classifier, specify criterion=”entropy” so we can see the information gain of each node.

drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
drugTree # it shows the default parameters
drugTree.fit(X_trainset,y_trainset)

Prediction

predTree = drugTree.predict(X_testset)

You can print out predTree and y_testset if you want to visually compare the predictions to the actual values.

print (predTree [0:5])
print (y_testset [0:5])

Evaluation

Let’s import metrics from sklearn and check the accuracy of our model.

from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))

Accuracy classification score computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

Advantages of Decision Tree

  1. Compared to other algorithms decision trees requires less effort for data preparation during pre-processing;
  2. A decision tree does not require normalization of data;
  3. A decision tree does not require scaling of data as well;
  4. Missing values in the data also do NOT affect the process of building a decision tree to any considerable extent;
  5. A Decision tree model is very intuitive and easy to explain to technical teams as well as stakeholders.

Disadvantages of Decision Tree

  1. A small change in the data can cause a large change in the structure of the decision tree causing instability;
  2. For a Decision tree sometimes calculation can go far more complex compared to other algorithms;
  3. Decision tree often involves higher time to train the model.
  4. Decision tree training is relatively expensive as the complexity and time has taken are more.
  5. The Decision Tree algorithm is inadequate for applying regression and predicting continuous values.

References

[1] https://www.geeksforgeeks.org/decision-tree/

[2] https://dhirajkumarblog.medium.com/top-5-advantages-and-disadvantages-of-decision-tree-algorithm-428ebd199d9a

[3] IBM – Machine Learning with Python – A Practical Introduction

[4] Udemy – The Data Science Course 2020: Complete Data Science Bootcamp – 365 Careers

[5] Udemy – Machine Learning and Data Science (Python)

Leave a Reply

Your email address will not be published. Required fields are marked *