# Random Forest Algorithm an its application

By

*Dr. Mohendra Roy*

According to Wikipedia, a forest is a large area dominated by **trees** [1]. In machine learning, the same analogy has been used for building **ensemble algorithms** such as Random Forest. Ensemble technique contains a group of predictive models (such as decision trees) to achieve a better accuracy and stability by avoiding bias and higher variance. Here Bias is, ‘how much on an average are the predicted values different from the actual value’ and variance means, ‘how different will the predictions of the model be at the same point if different samples are taken from the same population’ [2].

Generally, we can increase the prediction accuracy by increasing the complexity of the decision tree (by increasing the branches or sub-nodes), which will lower the bias in the model. However, this will **overfit** the model, which intern will suffer from the high variance. Therefore, we must find a better way to make a balance between these two errors.

In these regards, the random forest comes into account. The random forest is based on **bagging technique** (Bootstrap aggregating) [3]. This reduces the variance in the predictions by combining the result of multiple classifiers (trees) modelled on different sub-samples of the same dataset. In short, the random forest algorithm provides an output based on the most predicted value by the maximum number of trees in that forest.

In this chapter, we will discuss and develop the Random Forest algorithm from scratch.

The concise steps of the algorithm are as follows.

1) Fold the dataset to N number of sub-dataset. These N number of the dataset(with replacement) will then be used for the training of random forest model for n number of trees ( with minimum features = sqrt(length of features in the dataset).

2) Grow the each tree to the largest extent possible using a cost function ( here we will use Gini Index).

3) Find the maximum predicted output by the trees (for classification problem) or average of the predicted output (for regression problem)

For the training and testing of this algorithm, we will use a sonar data set. The sonar dataset contains 111 patterns obtained by bouncing sonar signals off a metal cylinder at various angles and under various conditions. It also contains 97 patterns obtained from rocks under similar conditions. The transmitted sonar signal is a frequency-modulated chirp, rising in frequency. The data set contains signals obtained from a variety of different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock.

Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time. The integration aperture for higher frequencies occurs later in the time since these frequencies are transmitted later during the chirp.

The label associated with each record contains the letter “R” if the object is a rock and “M” if it is a mine (metal cylinder). The numbers in the labels are in increasing order of aspect angle, but they do not encode the angle directly.

We will design our model on the top of this dataset to differentiate between the rock and metal as well as test it using the test sample from the same dataset. Here we go …

**Note: Please follow the GitHub link for the code or the HTML documentation. **

https://github.com/mohendra/My_Projects/tree/master/Random_Forest

The** Random_Forest **function:

```
def Random_Forest(training_data, test_data, maximum_branch, minimum_sample, subsample_ratio, tree_number, no_of_features):
trees = list()
for i in range(tree_number):
sub_sample = Make_Subsample(training_data, subsample_ratio) # "Make_Subsample" is a user defined function
tree = Make_Tree(sub_sample, maximum_branch, minimum_sample, no_of_features)#"Make_Tree" is a user defined function
trees.append(tree)
output = [Voted_Output(trees, row) for row in test_data]
return(output)
```

Here **“training_data”** argument is the dataset for training the **“Random_Forest”** function. Subsequently **“test_data”** is the dataset to test the performance of the algorithm.

**maximum_branch** is the argument for the maximum *allowed branch* or *depth* in a tree. Similarly, ** minimum_sample ** is the minimum number of sample required in a ** node** for further branching from that node.

**subsample_ratio** is the ratio at which the training sample will be randomly divided to subsamples for the training of the trees.

**tree_number** is the total number of trees that present in the random forest

**no_of_features** is the total number of random features that will use a tree for the training. Each tree will use different sets of features for training. and the minimum number of features required for the training of a tree is given by [4] —

Number of Features = sqrt(Length of a row in dataset)

** Note: ** Here we use the following user-defined functions:

** 1) Make_Subsample, 2) Make_Tree, 3) Voted_Output **

Further, we will explain these functions.

```
# function for making subsample from the training sample.
from random import randrange # "randrange" generates random number for a given range
def Make_Subsample(dataset, subsample_ratio):
sample = list()
n_sample = round(len(dataset) * subsample_ratio)
while len(sample) < n_sample:
index = randrange(len(dataset))
sample.append(dataset[index])
return sample
```

The above function will return a random subset of the given dataset.

```
# Function for making tree
def Make_Tree(training_data, maximum_branch, minimum_sample, no_of_features):
root = Make_Label(training_data, no_of_features) # "Make_Label" is a user defined function
Child_Split(root, maximum_branch, minimum_sample, no_of_features, 1) # Child_Split is a user defined function
return root
```

The above function will make the tree. In this function, we used the **Make_Label** and **Child_Split** functions. We will discuss these after the ** Voted_Output** function.

```
# Function for calculating voted output
def Voted_Output(trees, test_row):
p = [Predict(tree, test_row) for tree in trees] # "Predict" is a user defined function
return max(set(p), key=p.count) # This will return the maximum occurred number in the row in **"p"** provided by the
# Predict function
```

In the above function, the **“trees”** argument is returned by the ** Make_Tree ** function. Again in the above function, we used the ** Predict ** user-defined function. The detail of the function is as given below.

```
# Function to predict the test row to be in a particular group
def Predict(node, test_row):
if test_row[node['index']] < node['value']:
if isinstance(node['left'], dict):
return Predict(node['left'], test_row)
else:
return node['left']
else:
if isinstance(node['right'], dict):
return Predict(node['right'], test_row)
else:
return node['right']
```

Here the argument **“node”** is returned by the ** Make_Tree ** function. Which contain the dictionary containing the **“index”**, **“value”**, **label**( or label_value) of the termination node and if applicable, subgroups of all of these for sub-branch.

Note: ** “isinstance”** function return, whether an object is an instance of a class or of a subclass thereof. Here if the left(or right) itself contain the dictionary(

**dict**) then it will again operate the “Predict” function on the left(or right)

itself.

```
# Function to make label
def Make_Label(dataset, no_of_features):
class_labels = list(set(row[-1] for row in dataset))
bindex, bvalue, bscore, bgroups = 1000, 1000, 1000, None # Initial Labels(with values) of the dictionary
features = list()
while len(features) < no_of_features:
index = randrange(len(dataset[0])-1)
if index not in features:
features.append(index)
for index in features:
for row in dataset:
groups = Data_Split(index, row[index], dataset) # "Data_Split" is a user defined function
gini = Gini_Index(groups, class_labels) # "Gini Index" is a user defined function
if gini < bscore:
bindex, bvalue, bscore, bgroups = index, row[index], gini, groups
return {'index':bindex, 'value':bvalue, 'groups':bgroups}
```

Here the ** “class_labels”** contains the labels or the label numbers, each from a particular Row of the dataset(train set).

Here we used the ** Data_Split** and ** Gini_Index **user defined functions. These are as explained in below

```
# Function to split the data based on the index and value that have supplied to it.
def Data_Split(index, value, dataset):
left, right = list(), list()
for row in dataset:
if row[index] < value:
left.append(row)
else:
right.append(row)
return left, right
```

```
# function for making the sub nodes by splitting the parent node to obtain branches of the tree and eventually end node.
def Child_Split(node,maximum_branch, minimum_sample, no_of_features, current_depth):
left, right = node['groups']
del(node['groups'])
# check for a no split
if not left or not right:
node['left'] = node['right'] = Terminate(left + right) # "Terminate" is a user defined function
return
# check for max depth
if current_depth >= maximum_branch:
node['left'], node['right'] = Terminate(left), Terminate(right)
return
# process left child
if len(left) <= minimum_sample:
node['left'] = Terminate(left)
else:
node['left'] = Make_Label(left, no_of_features)
Child_Split(node['left'],maximum_branch, minimum_sample, no_of_features, current_depth+1)
# process right child
if len(right) <= minimum_sample:
node['right'] = Terminate(right)
else:
node['right'] = Make_Label(right, no_of_features)
Child_Split(node['right'], maximum_branch, minimum_sample, no_of_features, current_depth+1)
```

Here we used a user defined function called TTerminate.

```
def Terminate(group):
outcomes = [row[-1] for row in group]
return max(set(outcomes), key=outcomes.count)
```

The ** “return max(set(outcomes), key=outcomes.count)”** command gives the output of the of the maximum occurred number in the

**outcomes**array.

We know that Gini Index is one of the cost function that is used in the decision tree algorithm. And this is given by:

Where Py is the probability of the of yes for a particular class type in group *i* ( and “*i*” is ranging from 1 to n number of groups) and *c* is the class ranging from 1 to m. We can implement this as a function as described below.

```
# function to determine the Gini Index
def Gini_Index(groups, class_labels):
gini = 0.0
for class_value in class_labels:
for group in groups:
size = len(group)
if size == 0:
continue # This is to avoid the empty groups
prob = [row[-1] for row in group].count(class_value) / float(size)
gini += (prob * (1.0 - prob))
return gini
```

Till now we have built the Random forest algorithm. Now we will write the function to evaluate the performance of the Random forest algorithm.

```
# Function to evaluate the Random Forest algorithm
def Evaluate_Algorithm(dataset, algorithm, number_of_folds, *args): # *args contain list of arguments
folds = Make_Fold(dataset, number_of_folds) # "Make_Fold" is a user define function
scores = list()
for fold in folds:
train_set = list(folds)
train_set.remove(fold) # this is because we will use the current fold as a test set
train_set = sum(train_set, []) # this will make the train set containing purely one group( only one matrix)
test_set = list()
for row in fold:
row_copy = list(row)
test_set.append(row_copy)
row_copy[-1] = None # to remove the label from the test set
predicted = algorithm(train_set, test_set, *args) # Further we will replace the "algorithm" with "Random_Forest"
actual = [row[-1] for row in fold] # copy the labels of the test set
accuracy = Accuracy(actual, predicted) # "Accuracy" is the user defined function
scores.append(accuracy)
return scores
```

Here we used the ** Make_Fold ** and ** Accuracy ** user defined function

```
# function to Make Fold of the database to get the training and testing sub dataset
def Make_Fold(dataset, number_of_folds):
dataset_split = list()
dataset_copy = list(dataset)
fold_size = int(len(dataset) / number_of_folds)
for i in range(number_of_folds):
fold = list()
while len(fold) < fold_size:
index = randrange(len(dataset_copy))
fold.append(dataset_copy.pop(index))
dataset_split.append(fold)
return dataset_split
```

```
# function to find the accuracy between the predicted and actual data
def Accuracy(actual, predicted):
correct = 0
for i in range(len(actual)):
if actual[i] == predicted[i]:
correct += 1
return correct / float(len(actual)) * 100.0
```

Now we will make functions to **load** the file and make ready the dataset for apply in Random Forest algorithm

```
# Function for importing CSV file.
from csv import reader # reader will read the file
# Load a CSV file
def Load_File(filename):
dataset = list()
with open(filename, 'r') as file: # open the file in read only mode
data = reader(file)
for row in data:
if not row: # this is to eliminate any empty row
continue
dataset.append(row)
return dataset
```

Now we will prepare the dataset. That is, to convert the **label of the dataset from string to an integer** as well as convert the data that are present as a string (in CSV file) to float.

```
# Convert string column to float
def Str_Float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())
# Convert labels to integer
def Label_Int(dataset, column):
class_values = [row[column] for row in dataset]
labels= set(class_values)
L = dict()
for i, label in enumerate(labels):
L[label] = i
for row in dataset:
row[column] = L[row[column]]
return L
```

**Test:** Now we will test our algorithm. We have downloaded the sd.csv (Sonar Dataset Case Study) file from UC Irvine Machine Learning Repository

**About the dataset:**

*NAME*: Sonar, Mines vs. Rocks

*SUMMARY*: This is the data set used by Gorman and Sejnowski in their study

of the classification of sonar signals using a neural network. The

task is to train a network to discriminate between sonar signals bounced

off a metal cylinder and those bounced off a roughly cylindrical rock.

*SOURCE*: The data set was contributed to the benchmark collection by Terry

Sejnowski, now at the Salk Institute and the University of California at

San Diego. The data set was developed in collaboration with R. Paul

Gorman of Allied-Signal Aerospace Technology Center.

We used the custom developed Load_file function to load the file and subsequently convert the data to float and labels to integer data type.

```
from math import sqrt # it will require to find the minimum number of features.
# load and prepare data
filename = '/Users/mohendra/Dropbox/My_Projects/Test/sd.csv'
dataset = Load_File(filename)
# convert string attributes to integers
for i in range(0, len(dataset[0])-1):
Str_Float(dataset, i)
# convert class column to integers
Label_Int(dataset, len(dataset[0])-1)
# evaluate algorithm
number_of_folds = 5
maximum_branch = 10
minimum_sample = 1
subsample_ratio = 1.0
no_of_features = int(sqrt(len(dataset[0])-1))
for tree_number in [1, 5, 10]:
scores = Evaluate_Algorithm(dataset, Random_Forest, number_of_folds, maximum_branch, minimum_sample, subsample_ratio, tree_number, no_of_features)
print('Trees: %d' % tree_number)
print('Scores: %s' % scores)
print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))
```

**The output:**

**Advantages of Random Forest:**

1) This algorithm can solve both classification and regression problems.

2) It can handle large data sets.

**N.B.** Here we used the capital letters in user defined functions to differentiate between the build in functions and user-defined functions.

**Acknowledgement:** I deeply thankful to Jason Brownlee

**References:**

[1] https://en.wikipedia.org/wiki/Forest

[2] http://scott.fortmann-roe.com/docs/BiasVariance.html

[3] https://en.wikipedia.org/wiki/Bootstrap_aggregating

[4] http://machinelearningmastery.com/blog/

[5] Machine Learning: A Probabilistic Perspective. A textbook by Kevin R Murphy. ISBN: 978-0-262-01802-9

**Note:** **Please comment if you find any mistake and typo. Feel free to advise and make any suggestions. **

Find the detail description on Github :

https://github.com/mohendra/My_Projects/tree/master/Random_Forest