First January Post

ynishimura73
Jan 17, 2018
4 min read

Today, I used the computer language R, which I have been using, to compare some modelings in order to figure out which one predicts the most accurately and can be used to benefit the most. My mentor gave me a dataset that contained 7344 data with 6 variables (entity, product, finance value, year, month, and time index). After importing the dataset to R Studio, I named the dataset "DemoHistorical" and it appears in the box on the top left like below.

Right now, "Entity", "Product", and "Month" are considered to be "characters", meaning they are not read as values, so to make them treated as values, I wrote like down below to make the characters "factors". "$" means everything, and "<-" means equal. (My previous blog post explains more introduction for R Studio.)

We now make a train dataset which contains 80% of the actual dataset and a test dataset which contains the other proportion of the dataset. The train dataset is used to test the model by making predictions against the test dataset, so we can say that the closer our prediction of the test dataset is on the train dataset, the better the model is.

Lines 11 through 14 define "train" and "test" variables, and these variables come out the table like below.

Now, we are ready to make some models and test them out!

#1 Linear Model:

Linear Model basically is a linear line that goes through the dataset.

In line 20, we make a valuable called LM, which shows an equation that sets up a linear model, so we want the column Finance_Value to be the one will be predicted using the train dataset. LMPred in line 23 is a valuable that predicts finance value using the equation we got in LM for the test dataset. As line 25 suggests, we make a new column called Prediction that contains values of LMPred.

As the table above shows, a new column is made on the right. We can visually compare the prediction made and actual values that appear under the column Fincance_Value. The predicted values are actually good expect the third and last numbers. The lines 29 and 30 calculate R-squared, which is "a statistical measure of how close the data are to the fitted regression line" (defined by https://www.linkedin.com/pulse/regression-analysis-how-do-i-interpret-r-squared-assess-gaurhari-dass). We can say our model predicts more accurately as R-squared gets closer to 1. For this Linear Model, our R-squared value is 0.737, meaning there can be a better one.

#2 Random Forest:

Random Forest algorithm makes multiple decision trees and classifies data for prediction. The more decision trees are, the higher the accuracy gets.

To start off, I made a new valuable called trainRF and testRF, which are the same as the previous valuables train and test, but this time, I took out columns 4 and 7 (year and prediction) from these valuables using [-c(column#)] because year messes up in random forest modeling. The new table looks like below. Nothing really changed..

Now, a new valuable named AI is created in line 42. By the way, the names of valuables can be anything but should be short. AI predicts finance value using values of trainRF with 100 tree splits. The number of trees depends on you, but the more trees provide better results. What line 44 does is that they make plots that show the most important predictors.

The plots above is the outcome, and it shows that the values of Product play the greatest role predicting financial value in this case. Then we make a new valuable called CD which predicts financial value using testRF and line 49 makes a new column that displays the prediction made.

Our prediction here seems better than the Linear Model except for the third and last values again. However, R-squared value calculated in line 52 for this Random Forest Model is 0.876, which is much higher than the Linear Model attempt.

#3 Regression Tree:

Regression Tree basically divides data depending on if they fit in categories. For example, if the dividing question was like the value is bigger than 5, the values that are bigger than 5 go to the left and the values that are smaller than 5 go to the right.

To start off, we have to grow a tree, which can be done writing rpart=(formula, data= , method= , control=) for a new valuable S. We grow a tree for finance value using a method "anova", which is used for regression tree, and using the train data. Line 62 shows plots that show how many trees make the lowest error without overlapping.

According to the plots, it seems like 12 trees make the least amount of error. Next line shows the summary of how each tree divided the dataset. It is long, so I will only put the first tree as an example.

This means that it splits the dataset depending on when the product was sold. For the data that contain products that were sold before mid May, 2017, they go to the left. This can be understood reading TimeIndex and Year. Lines 65 and 67 make plots again that show correlation between R-squared values and number of splits.

The last line shows a regression tree,

This basically shows how trees were split.

Conclusion:

As our second model, Random Forest Model had the highest R-square value among these three models, we can conclude that the Random Forest model should be used to predict the financial value regarding on the entity, product, year, and month.

In order to figure out what model works the best to predict, we have to try all the possible models for each dataset because it varies depending on valuables and what kind of prediction it is. Therefore, it takes time, but it definitely is important to try all of them in order to make the most accurate prediction we can make.

Last Post: Reflection

Third April Post

Second April Post

First January Post

Comments