Understanding Regression Trees: A Simple Guide to Machine Learning

Introduction to Regression Trees Without Jargon

This article aims to clarify the regression tree model in machine learning without the use of technical jargon or complex mathematical concepts. You don’t need any prior knowledge in computer science or mathematics to grasp the concepts discussed.

Regression trees, a popular machine learning technique, are primarily utilized for classification tasks but can also predict continuous numerical outcomes. Here, I will introduce the regression tree, a specific variant of the decision tree, in a way that’s accessible to everyone, regardless of their background in data science.

You won’t need a degree in mathematics to follow along! :)

Sample Data Scenario

Let’s create a hypothetical scenario. Typically, we agree that a student's study time correlates with their final exam scores. Imagine we surveyed a group of primary school students to determine their weekly study hours, and we gathered the following dataset, which includes their corresponding exam scores.

While this scenario appears ideal, it’s important to recognize that exam scores aren’t solely influenced by study hours. Nevertheless, it serves to illustrate the regression tree model. Another key point is that the data does not exhibit a linear relationship, as depicted in the following figure.

Creating Branches in a Regression Tree

The foundation of constructing a regression tree lies in determining how to split the nodes, which can be simplified as creating branches.

What Does a Branch Represent?

To illustrate, let’s assume we want to establish our first branch by partitioning the data based on the condition "Study Hours < 3.1". This value, 3.1, lies between the 7th and 8th data points as shown below.

As illustrated, the red vertical line (Study Hour = 3.1) divides the dataset into two regions: left and right.

For each region, we can compute the average scores. The average for the left group of 7 data points is 15.96, while for the right group of 16 data points, it is 87.65.

With this branching, we conclude that:

Students studying less than 3.1 hours are predicted to score 15.96.
Students studying 3.1 hours or more are predicted to score 87.65.

Now, how do we determine the optimal split condition? In other words, why choose the value between the 7th and 8th points instead of the 8th and 9th, or others? There are 24 values, providing 23 potential split candidates.

The fundamental criterion is straightforward:

Minimize errors for each split.

Error Measurement

To minimize errors, we need a method for measuring them.

Clearly, our predictions will incur errors. For instance, a student who studies 0.5 hours per week is expected to achieve only 1.4 marks (as depicted below). Yet, our regression tree predicts a score of 15.96.

Thus, the error is the difference between the predicted and actual values: 15.96 - 1.4 = 14.56. The dotted lines in the following figure represent these errors.

So, how do we identify the BEST splitting condition? We will evaluate every potential split within the dataset and select the one that results in the least error.

Observing the animation, it becomes clear that the total error is minimized when we split between the 7th and 8th points. The total error is the squared difference between the actual and predicted scores, resulting in a substantial number.

Continuing the Splitting Process

Even though the initial condition "Study Hours < 3.1" is optimal for the first split, it remains suboptimal overall due to the still considerable total error. We need to keep splitting the tree to refine our model.

After the previous split, we have two branches: one with 7 samples on the left and another with 17 samples on the right.

Applying the same logic to the left and right samples separately, we can identify the best splits for both groups as follows.

Now, if a student studies for only 0.5 hours, we would predict a score of 4.82, compared to the actual score of 1.4. We have successfully enhanced our regression tree model through continued splitting.

Understanding Overfitting

We now comprehend how a regression tree is constructed. Following the outlined logic, we can continually refine our model for better predictions.

However, have you noticed a potential issue?

Examining the previous figure, we can continue splitting the second branch, which contains 2 samples averaging 43.8. This leads to two new branches:

For study hours between 2.3 and 2.7, a predicted score of 61.7.
For study hours between 2.7 and 3.1, a predicted score of 25.9.

If we validate the tree using our dataset, we find that the error for these two samples is ZERO!

However, this suggests a flaw. It seems our regression tree indicates that students studying slightly more hours will receive lower marks, which is illogical.

This anomaly arises because our original dataset is imperfect, a reality often encountered in practice. We can’t attribute a student’s exam scores solely to their study hours. Some students may naturally excel and require less study time to achieve similar results.

In this scenario, our model fits the training data perfectly but fails in real-world applications, a phenomenon known as overfitting.

A decision tree model, whether for classification or regression, can theoretically achieve 100% accuracy on training data. By creating a tree with numerous branches, each containing a single sample, we could theoretically achieve 100% accuracy. However, such a tree lacks practical utility, and we must strive to mitigate the overfitting issue.

Strategies to Combat Overfitting

While completely avoiding overfitting is impossible, several strategies can help reduce its severity. Here are two commonly employed methods.

Limit the Tree Depth

By examining the tree constructed above, we can easily define its depth. The root of the tree represents depth = 0, with subsequent levels increasing with each split.

To mitigate overfitting, we can impose a maximum depth limit, for instance, setting the max depth to 2. This would halt the tree construction upon reaching this limit.

Limit the Number of Samples at Leaf Nodes

Another effective method to curb overfitting is to limit the number of samples at the leaf nodes. In the tree, the green boxes represent the leaf nodes, which signify the endpoints used for prediction.

If we establish a rule that each leaf must contain more than 5 samples, the resulting tree would look as follows.

In this regression tree, no leaf node contains fewer than 5 samples. Additionally, the first leaf node averaging 15.96 and containing 7 samples will not be split, as we know that one of the branches would only have 2 samples post-split.

In practice, these two methods may be combined to effectively reduce overfitting.

Conclusion

In this article, I aimed to elucidate the regression tree algorithm without resorting to formulas, equations, or complex terminology. I believe understanding the mechanics and intuition of this machine learning model should be accessible to individuals without a computer science or mathematics background.

The focus has been on constructing a regression tree and determining the optimal split conditions. Additionally, I have introduced the overfitting problem and common strategies to mitigate its effects.