First December Post
- ynishimura73
- Dec 5, 2017
- 2 min read
Today, I read a Kaggle article titled "Exploring the Titanic Dataset"(https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic/notebook) as a starter for machine learning and learned how R could be used to find out various things.
There is full set of data given, such as names of 891 passengers, if they survived or not, passengers' class, their ages, ticket numbers, how much they payed for tickets, and so on (Surprisingly, the data is REAL!). For survival, they are marked 0 if they did not survive and 1 if they survived.
First, from each name, it is able to grab only titles and surnames. Also, by looking at last names, we can determine if families sink or swim together and graph the outcome.
There is a lot of missing information in data. However, instead of just ignoring those missing, it is possible to figure them out by comparing each information to the ones that already exist and guess. For example, we are missing an embarkment of a passenger, but we know that his class is 1 (first class) and payed $80. By getting all the information of other people, we are able to graph:

This shows the comparison between embarkment and fare, and for each embarkment, there are three factors that compare class and embarkment. Knowing that the passenger payed $80 and is in first class, we can assume that he probably embarked at C, which represents Cherbourg Port.
In machine learning, predictive imputation is very important since data is not always perfect but we cannot ignore those missing data. In the article, they created a model predicting Ages based on other variables for a few missing Age values.
This time, I did not get to type my own, but just reading the article with pictures of code and graphs help me understand what is going on and how they are doing it. By next week, I will follow the steps, try to code my own, and see if it works.
Commentaires