Random Forest Algorithm an its application


Dr. Mohendra Roy

According to Wikipedia, a forest is a large area dominated by trees [1]. In machine learning, the same analogy has been used for building ensemble algorithms such as Random Forest. Ensemble technique contains a group of predictive models (such as decision trees) to achieve a better accuracy and stability by avoiding bias and higher variance. Here Bias is, ‘how much on an average are the predicted values different from the actual value’ and variance means, ‘how different will the predictions of the model be at the same point if different samples are taken from the same population’ [2].

Generally, we can increase the prediction accuracy by increasing the complexity of the decision tree (by increasing the branches or sub-nodes), which will lower the bias in the model. However, this will overfit the model, which intern will suffer from the high variance. Therefore, we must find a better way to make a balance between these two errors.

In these regards, the random forest comes into account. The random forest is based on bagging technique (Bootstrap aggregating) [3]. This reduces the variance in the predictions by combining the result of multiple classifiers (trees) modelled on different sub-samples of the same dataset. In short, the random forest algorithm provides an output based on the most predicted value by the maximum number of trees in that forest.

In this chapter, we will discuss and develop the Random Forest algorithm from scratch.

The concise steps of the algorithm are as follows.

1) Fold the dataset to N number of sub-dataset. These N number of the dataset(with replacement) will then be used for the training of random forest model for n number of trees ( with minimum features = sqrt(length of features in the dataset).

2) Grow the each tree to the largest extent possible using a cost function ( here we will use Gini Index).

3) Find the maximum predicted output by the trees (for classification problem) or average of the predicted output (for regression problem)

For the training and testing of this algorithm, we will use a sonar data set. The sonar dataset contains 111 patterns obtained by bouncing sonar signals off a metal cylinder at various angles and under various conditions. It also contains 97 patterns obtained from rocks under similar conditions. The transmitted sonar signal is a frequency-modulated chirp, rising in frequency. The data set contains signals obtained from a variety of different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock.

Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time. The integration aperture for higher frequencies occurs later in the time since these frequencies are transmitted later during the chirp.

The label associated with each record contains the letter “R” if the object is a rock and “M” if it is a mine (metal cylinder). The numbers in the labels are in increasing order of aspect angle, but they do not encode the angle directly.

We will design our model on the top of this dataset to differentiate between the rock and metal as well as test it using the test sample from the same dataset. Here we go …

Note: Please follow the GitHub link for the code or the HTML documentation. 



The Random_Forest function:

In [1]:
def Random_Forest(training_data, test_data, maximum_branch, minimum_sample, subsample_ratio, tree_number, no_of_features):
	trees = list()
	for i in range(tree_number):
		sub_sample = Make_Subsample(training_data, subsample_ratio) # "Make_Subsample" is a user defined function 
		tree = Make_Tree(sub_sample, maximum_branch, minimum_sample, no_of_features)#"Make_Tree" is a user defined function
	output = [Voted_Output(trees, row) for row in test_data]

Here “training_data” argument is the dataset for training the “Random_Forest” function. Subsequently “test_data” is the dataset to test the performance of the algorithm.

maximum_branch is the argument for the maximum allowed branch or depth in a tree. Similarly, minimum_sample is the minimum number of sample required in a node for further branching from that node.

subsample_ratio is the ratio at which the training sample will be randomly divided to subsamples for the training of the trees.

tree_number is the total number of trees that present in the random forest

no_of_features is the total number of random features that will use a tree for the training. Each tree will use different sets of features for training. and the minimum number of features required for the training of a tree is given by [4] —

Number of Features = sqrt(Length of a row in dataset)

Note: Here we use the following user-defined functions:

1) Make_Subsample, 2) Make_Tree, 3) Voted_Output

Further, we will explain these functions.

In [2]:
# function for making subsample from the training sample. 

from random import randrange # "randrange" generates random number for a given range

def Make_Subsample(dataset, subsample_ratio):
	sample = list()
	n_sample = round(len(dataset) * subsample_ratio)
	while len(sample) < n_sample:
		index = randrange(len(dataset))
	return sample

The above function will return a random subset of the given dataset.

In [3]:
# Function for making tree

def Make_Tree(training_data, maximum_branch, minimum_sample, no_of_features):
	root = Make_Label(training_data, no_of_features) # "Make_Label" is a user defined function 
	Child_Split(root, maximum_branch, minimum_sample, no_of_features, 1) # Child_Split is a user defined function
	return root

The above function will make the tree. In this function, we used the Make_Label and Child_Split functions. We will discuss these after the Voted_Output function.

In [4]:
# Function for calculating voted output 
def Voted_Output(trees, test_row):
	p = [Predict(tree, test_row) for tree in trees] # "Predict" is a user defined function
	return max(set(p), key=p.count) # This will return the maximum occurred number in the row in **"p"** provided by the
# Predict function

In the above function, the “trees” argument is returned by the Make_Tree function. Again in the above function, we used the Predict user-defined function. The detail of the function is as given below.

In [7]:
# Function to predict the test row to be in a particular group

def Predict(node, test_row):
	if test_row[node['index']] < node['value']:
		if isinstance(node['left'], dict): 
			return Predict(node['left'], test_row)
			return node['left']
		if isinstance(node['right'], dict):
			return Predict(node['right'], test_row)
			return node['right']

Here the argument “node” is returned by the Make_Tree function. Which contain the dictionary containing the “index”, “value”, label( or label_value) of the termination node and if applicable, subgroups of all of these for sub-branch.

Note: “isinstance” function return, whether an object is an instance of a class or of a subclass thereof. Here if the left(or right) itself contain the dictionary(dict) then it will again operate the “Predict” function on the left(or right)

In [6]:
# Function to make label

def Make_Label(dataset, no_of_features):
	class_labels = list(set(row[-1] for row in dataset))
	bindex, bvalue, bscore, bgroups = 1000, 1000, 1000, None # Initial Labels(with values) of the dictionary 
	features = list()
	while len(features) < no_of_features:
		index = randrange(len(dataset[0])-1)
		if index not in features:
	for index in features:
		for row in dataset:
			groups = Data_Split(index, row[index], dataset) # "Data_Split" is a user defined function
			gini = Gini_Index(groups, class_labels) # "Gini Index" is a user defined function 
			if gini < bscore:
				bindex, bvalue, bscore, bgroups = index, row[index], gini, groups
	return {'index':bindex, 'value':bvalue, 'groups':bgroups}

Here the “class_labels” contains the labels or the label numbers, each from a particular Row of the dataset(train set).

Here we used the Data_Split and Gini_Index user defined functions. These are as explained in below

In [8]:
# Function to split the data based on the index and value that have supplied to it.

def Data_Split(index, value, dataset):
	left, right = list(), list()
	for row in dataset:
		if row[index] < value:
	return left, right
In [18]:
# function for making the sub nodes by splitting the parent node to obtain branches of the tree and eventually end node.

def Child_Split(node,maximum_branch, minimum_sample, no_of_features, current_depth):
	left, right = node['groups']
	# check for a no split
	if not left or not right:
		node['left'] = node['right'] = Terminate(left + right) # "Terminate" is a user defined function
	# check for max depth
	if current_depth >= maximum_branch:
		node['left'], node['right'] = Terminate(left), Terminate(right)
	# process left child
	if len(left) <= minimum_sample:
		node['left'] = Terminate(left)
		node['left'] = Make_Label(left, no_of_features)
		Child_Split(node['left'],maximum_branch, minimum_sample, no_of_features, current_depth+1)
	# process right child
	if len(right) <= minimum_sample:
		node['right'] = Terminate(right)
		node['right'] = Make_Label(right, no_of_features)
		Child_Split(node['right'], maximum_branch, minimum_sample, no_of_features, current_depth+1)

Here we used a user defined function called TTerminate.

In [10]:
def Terminate(group):
	outcomes = [row[-1] for row in group]
	return max(set(outcomes), key=outcomes.count)

The “return max(set(outcomes), key=outcomes.count)” command gives the output of the of the maximum occurred number in the outcomes array.

We know that Gini Index is one of the cost function that is used in the decision tree algorithm. And this is given by:

Where Py is the probability of the of yes for a particular class type in group i ( and “i” is ranging from 1 to n number of groups) and c is the class ranging from 1 to m. We can implement this as a function as described below.

In [11]:
# function to determine the Gini Index
def Gini_Index(groups, class_labels):
	gini = 0.0
	for class_value in class_labels:
		for group in groups:
			size = len(group)
			if size == 0:
				continue # This is to avoid the empty groups
			prob = [row[-1] for row in group].count(class_value) / float(size)
			gini += (prob * (1.0 - prob))
	return gini

Till now we have built the Random forest algorithm. Now we will write the function to evaluate the performance of the Random forest algorithm.

In [12]:
# Function to evaluate the Random Forest algorithm 

def Evaluate_Algorithm(dataset, algorithm, number_of_folds, *args): # *args contain list of arguments
	folds = Make_Fold(dataset, number_of_folds) # "Make_Fold" is a user define function
	scores = list()
	for fold in folds:
		train_set = list(folds)
		train_set.remove(fold) # this is because we will use the current fold as a test set
		train_set = sum(train_set, []) # this will make the train set containing purely one group( only one matrix)
		test_set = list()
		for row in fold:
			row_copy = list(row)
			row_copy[-1] = None # to remove the label from the test set
		predicted = algorithm(train_set, test_set, *args) # Further we will replace the "algorithm" with "Random_Forest"
		actual = [row[-1] for row in fold] # copy the labels of the test set
		accuracy = Accuracy(actual, predicted) # "Accuracy" is the user defined function
	return scores

Here we used the Make_Fold and Accuracy user defined function

In [13]:
#  function to Make Fold of the database to get the training and testing sub dataset 

def Make_Fold(dataset, number_of_folds):
	dataset_split = list()
	dataset_copy = list(dataset)
	fold_size = int(len(dataset) / number_of_folds)
	for i in range(number_of_folds):
		fold = list()
		while len(fold) < fold_size:
			index = randrange(len(dataset_copy))
	return dataset_split
In [14]:
# function to find the accuracy between the predicted and actual data

def Accuracy(actual, predicted):
	correct = 0
	for i in range(len(actual)):
		if actual[i] == predicted[i]:
			correct += 1
	return correct / float(len(actual)) * 100.0

Now we will make functions to load the file and make ready the dataset for apply in Random Forest algorithm

In [15]:
# Function for importing CSV file.
from csv import reader # reader will read the file

# Load a CSV file
def Load_File(filename):
	dataset = list()
	with open(filename, 'r') as file: # open the file in read only mode
		data = reader(file)
		for row in data:
			if not row: # this is to eliminate any empty row
	return dataset

Now we will prepare the dataset. That is, to convert the label of the dataset from string to an integer as well as convert the data that are present as a string (in CSV file) to float.

In [16]:
# Convert string column to float
def Str_Float(dataset, column):
	for row in dataset:
		row[column] = float(row[column].strip())

# Convert labels to integer
def Label_Int(dataset, column):
	class_values = [row[column] for row in dataset]
	labels= set(class_values)
	L = dict()
	for i, label in enumerate(labels):
		L[label] = i
	for row in dataset:
		row[column] = L[row[column]]
	return L

Test: Now we will test our algorithm. We have downloaded the sd.csv (Sonar Dataset Case Study) file from UC Irvine Machine Learning Repository

About the dataset:

NAME: Sonar, Mines vs. Rocks

SUMMARY: This is the data set used by Gorman and Sejnowski in their study
of the classification of sonar signals using a neural network. The
task is to train a network to discriminate between sonar signals bounced
off a metal cylinder and those bounced off a roughly cylindrical rock.

SOURCE: The data set was contributed to the benchmark collection by Terry
Sejnowski, now at the Salk Institute and the University of California at
San Diego. The data set was developed in collaboration with R. Paul
Gorman of Allied-Signal Aerospace Technology Center.

We used the custom developed Load_file function to load the file and subsequently convert the data to float and labels to integer data type.

In [19]:
from math import sqrt # it will require to find the minimum number of features.

# load and prepare data
filename = '/Users/mohendra/Dropbox/My_Projects/Test/sd.csv'
dataset = Load_File(filename)

# convert string attributes to integers
for i in range(0, len(dataset[0])-1):
	Str_Float(dataset, i)
# convert class column to integers
Label_Int(dataset, len(dataset[0])-1)

# evaluate algorithm

number_of_folds = 5
maximum_branch = 10
minimum_sample = 1
subsample_ratio = 1.0

no_of_features = int(sqrt(len(dataset[0])-1))

for tree_number in [1, 5, 10]:
	scores = Evaluate_Algorithm(dataset, Random_Forest, number_of_folds, maximum_branch, minimum_sample, subsample_ratio, tree_number, no_of_features)
	print('Trees: %d' % tree_number)
	print('Scores: %s' % scores)
	print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))
The output:
Trees: 1
Scores: [65.85365853658537, 68.29268292682927, 56.09756097560976, 56.09756097560976, 53.65853658536586]
Mean Accuracy: 60.000%
Trees: 5
Scores: [63.41463414634146, 53.65853658536586, 53.65853658536586, 58.536585365853654, 75.60975609756098]
Mean Accuracy: 60.976%
Trees: 10
Scores: [65.85365853658537, 70.73170731707317, 63.41463414634146, 58.536585365853654, 63.41463414634146]
Mean Accuracy: 64.390%

Advantages of Random Forest:

1) This algorithm can solve both classification and regression problems.

2) It can handle large data sets.

N.B. Here we used the capital letters in user defined functions to differentiate between the build in functions and user-defined functions.

Acknowledgement: I deeply thankful to Jason Brownlee


[1] https://en.wikipedia.org/wiki/Forest

[2] http://scott.fortmann-roe.com/docs/BiasVariance.html

[3] https://en.wikipedia.org/wiki/Bootstrap_aggregating

[4] http://machinelearningmastery.com/blog/

[5] Machine Learning: A Probabilistic Perspective. A textbook by Kevin R Murphy. ISBN: 978-0-262-01802-9

Note: Please comment if you find any mistake and typo. Feel free to advise and make any suggestions. 

One thought on “Random Forest Algorithm an its application

Leave a Reply

Your email address will not be published. Required fields are marked *