Second February Post

ynishimura73
Feb 14, 2018
3 min read

Today, I learned how to show what model is most accurate without using R-squared value. The purpose here is to learn how to tell someone who is not familiar with R-squared values why the model I chose is the best predictive model effectively. Even though I know why the graph is good, I have to explain the reason sufficiently in order to ask for understanding.

Using the models I have previously built, I played around with them and made some various graphs to see which one should be used.

By writing the line 29 shown above after naming variable "test" for the Linear Model, the sample data can also be in comma separated values (CSV) format. Each cell inside such data file is separated by a special character, which usually is a comma, although other characters can be used as well. Similarly, I can make CSV files for both Random Forest and Decision Tree models as well. There files can be exported to Excel, which I am able to graph them.

The imported data looks like the one above in Excel since I pasted the file in such a way that the commas that separate values will become separated by columns. Because I took the same 80% of the row data as sample data as a training data, even though models are different, they have the same real values that they will predict from.

Now, I would take random values to graph. For today, I used the first values. Column A contains the real data/values, Column B is the predicted values from Linear Model, Column C is the predicted values from Random Forest, Column C is the predicted values from Decision Tree.

From the data, I am able to graph the data using different graph types. For example, the scatter plot graph looks like the graph above.

How can you tell if the predicted values are good?

If the predicted values were predicted very accurately, the slope of the scatter plot will be y=x because predicted values and real values would match exactly each other. After I set a trendline, I can compare the trendline with a line that has y=x as the slope.

As shown above, although the trendline, the dots line, follows the trend of y=x, it is not quite on the same path, meaning the data is not fully accurate but close.

By looking at the graphs from different models, it is possible to guess which one is the most accurate one, but it cannot be fully known just by looking at these.

However, if we combine all the graphs together, it is possible to see.

The black line represents the real data, the orange dot line represents predicted values from Linear Model, the green dot line represents predicted values from Random Forest, the blue dot line represents predicted values from Decision Tree. By combining all three of them in one graph, it can be said that the orange line follows the black line the most, meaning the Linear Model predicts the most accurately among the tree models. However, one thing to remember is the result changes depending on values because in the graph, the blue line seems to be on the black line in some parts, so it is important to see overall result to determine which model actually is the best one overall.

Today, I learned how to explain the result visually since there are some times my audience do not understand R-squared values, so it is good to know other ways to conclude the result.

Right now, I have been doing machine learning, but from next week, I will hopefully change the shift a little bit to focus more on chatbots.

Last Post: Reflection

Third April Post

Second April Post

Second February Post

Comments