Third January Post
- ynishimura73
- Jan 31, 2018
- 3 min read
Today, I first searched how to evaluate machine learning methods and predictive models, then continued machine learning. Using the knowledge I've learned from my mentor, I created a new model called Decision Tree on R. I used the same dataset as used when I created Linear Model and Random Forest to compare the models and decide which one is the best one.

As the picture above shows, I first made a variable called a train dataset, which contains 80% of the whole dataset, and a test dataset, which has the rest of dataset. I explained this on the post "First January Post", but just clarify, these datasets/variables look like below.

We are predicting the column "Finance_Value". Lines 84 and 85 install a package called party which provides nonparametric regression trees for nominal, ordinal, numeric, censored, and multivariate responses.

Now, I made a new variable called trainCT and testCT which are the same as train and test dataset created before but without columns 4 and 7. Line 95 shows the function "ctree(formula, data=)" that is used to create a regression or classification tree. In this case, the formula is Finance_Value, which I was trying to predict, and I used trainCT as the data using. The period (.) after Financial_Value~ means "everything", so the feature includes all the data.

This function created a new column on the right which shows the prediction for Finance_Value. You can compare the predicted values with actual values that are on the column between Product and Month. However, in order to see how close the prediction was, I wrote an equation on line 99, which calculated R-squared value.

The R-squared value of Decision Tree is 0.98, which means this model is incredibly close to perfect. I called the R-squared value DTr2 just to differentiate R-squared values between other models.

What the line 102 (plot(CT)) did it to show how the Decision Tree categorized each time visually. You can tell it had been so many trees to predict values.
The last thing I did was to write a code that would give me which model among three I made was the best model overall that predicted most accurately.

I wrote lines 106 and 107 so that the variable "winner" would state a model that is the best. This if-statement means if R-squared value of Linear Model is greater than the R-squared values of Random Forest Model and Decision Tree, then Linear Model is the best. If R-squared value of Decision Tree is greater than the ones of Random Forest and Linear Model, then Decision Tree is the best. If both of the statements are wrong, then Random Forest is the best.

And when I ran it, it shows me "DT" meaning Decision Tree was the best one among the models!
It was my first time coding by myself to create a model, and I was pretty unsure if it would work. However, I used materials learned previously to create other models, so I got to make it work. Depending on dataset, the model that predicts most accurately changes, so it is necessary to try all the possible models and determine which one works the best by setting up a equation like what I did in line 106.
It is so interesting to learn different models, and I like how what I'm predicting is a real world data!
Comentários