The random forest is a very common machine learning algorithm that is used by millions around the world. The algorithm works by using multiple decision trees to form a forest of decision trees. These trees are merged together, forming a ‘forest’, to ultimately output a single result. The reason why the random forest is so popular is due to its flexibility and ease of use, it works extremely well with both classification and regression problems.
As mentioned earlier, random forests are made up of multiple decision trees. Decision trees are formulated around one basic question. From this initial questions, follow-on questions can be asked and so forth until there are multiple questions being asked leading to a single result. The follow-on questions are called decision nodes in the decision tree and are primarily intended to specify and split the data to reach a reasonable result. The path taken is usually established by whether the data agrees with question or seeks an alternate path. The final decisions of all paths are labelled as the leaf node. These trees seek to find the best route to split the data to from accurate subsets which is done by training the tree using the Classification and Regression Tree (CART) algorithm.
Any random forest algorithm is comprised of three main hyperparameters that need to determined and set before any training or testing of the model can be started. These three parameters are node size, number of trees and number of features sampled. Node Size refers to the amount of splitting that can be done from a single node. More splitting means more options for data to be classified as meaning more sub-sets are created. Number of trees is fairly obvious as it refers to the amount of different unrelated decision trees in the random forest algorithm, more trees present, the large the classifier. Finally, the number of features sampled plays an integral role in the random forest algorithms as it helps determine the depth of the tree. Apart from these 3 main hyperparameters, there are other hyperparameters that need to be tuned in order to maximise the output and accuracy of the model. These are case-to-case and would require a sound understanding of the dataset and potentially running the random tree algorithms a few times through the data set.
Any random forest works in two different phases; the first phase involves making a random forest by combining decisions from all the decision trees in the algorithms while the second phase makes predictions for each tree in the random forest. For this work, random data points from the training set are taken and a decision trees associated with these particular points are made, this effectively makes subsets that the selected data points can be classified into. Once everything is built by the algorithm, data can be fed through the decision trees whereby they are subject to features and different ‘question’. Each decision tree outputs a prediction result which is pushed into the random forest of the classifier. Once all decision trees have produced a predicted result, majority voting is done, where the most popular class is chosen to be the final class.
Using the random forest algorithms has proven to be essential for today’s data scientists as it hosts numerous features and flexibility while being relevant straight-forward to understand. With any algorithm, there are benefits and obstacles to using the random tree algorithm.
While already being relatively easy to use, the random tree algorithm reduces the risk of overfitting. Decision trees on their own tend to suffer from overfitting in which the model will try to fit all its samples exactly to the training data which causes it to not be too welcoming too new data and likely cause inaccuracies and problems done the line. However, when using random tree, you mitigate this problems as there are multiple decision trees, usually quite and substantial amount of them. So, the classifier will not force overfitting of the model as more uncorrelated decision trees lowers the potential for error and the variance of the data set. Another key benefit that makes random tree to be immensely popular is its flexibility. The fact that this algorithm can handle both regression and classification problems make its versatility extremely attractive for data scientist. A particular feature, feature bagging, allows the random forest classifier able to estimate missing data while maintaining very high accuracy is another reason why random tree is a very useful algorithm.
With any algorithms, it suffers from a few shortcomings. A relatively important obstacle experiences when using the random tree algorithms is that they can be slightly time consuming especially when dealing with large amounts of data to process. This is the trade- off for having a high accuracy. Another inconvenience is that the random forest algorithms require a lot of resources/space to store the data, which is usually a by-product of having such large datasets. These two factors make it slightly problematic for when you require real-time processing and predictions for your model. If run-time is an essential parameter, random tree would not be ideal.
Due to the flexibility and simplicity of random tree, it makes very suitable for a variety of fields. Its uses vary from the banking industry to in healthcare. Random tree in healthcare helps medical professionals to make accurate predictions on the extent of a patient’s health condition based upon inputting the patient’s medical history and current parameters. These would allow doctors to predict or determine whether the patient is suffering from some sort of disease. Another place where the random tree algorithms is seen is in the stock market. Stock prices can be predicted using a random tree algorithms by tracking the particular stock trends. Since random tree algorithms are so flexible, many parameters can be modified/added to make an accurate prediction of prices that allow traders to pick the stock that is most likely to perform well.