AOB Analytics is a successful Irish analytics company with plans to expand internationally into the American market. A limited budget challenges the company to look at creative ways to target this overseas audience with its business analytics service offering.
The Marketing Team has opted to use the digital arena to create awareness online using Guerilla Marketing (Shock Advertising). They need to find a topic of interest which would (a) grasp attention of the American audience, (b) display business analysis capabilities of a sample dataset to highlight the service on offer and (c) promote it in such a way that it is controversial to generate word of mouth.
Choosing a Topic of Interest- The Dataset:
What evokes empathy, emotion and passion more than a true story? What true story do we know of that sparks interest with both the Irish and American audience alike? Whether it’s through family tree lines or James Camerons production and direction of the 1993 Hollywood adaption, everyone knows the story of ‘The Titanic’. Pulling on Irish and American heart strings using a true story like the tragic R.M.S Titanic maiden voyage to launch an overseas Irish business is sure to cause word of mouth online.
About our Dataset
The R.M.S Titanic dataset is the perfect dataset source, as we have the ability to illustrate our predictive analytics capabilities showing which variables are associated with ‘survival’ using machine learning. We want to use our analysis (using Studio R) to highlight that being a survivor is associated with the persons ‘class’. In order to do this, we will be using the Regression Decision Model to display multiple variables impact on the survival rate and then display it graphically.
We have access to the Titanic dataset through a Data Science site called Kaggle. The data has been split into two datasets to analyse which group are more likely to survive; a training and a test set. The training set includes the outcome for each passenger and is used to build the data mining model. The test set is used to validate the model.
- The training set has 891 observations (rows) and 12 variables (columns).
- The test set has 481 observations and 11 variables (survival variable removed so we can predict it)
- Variables include:
- P_ID- Auto increments the integer
- Survival – Survival (0 = No; 1 = Yes). Not included in test.csv file.
- Pclass – Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- Name – Name
- Sex – Sex
- Age – Age
- Sibsp – Number of Siblings/Spouses Aboard
- Parch – Number of Parents/Children Aboard
- Ticket – Ticket Number
- Fare – Passenger Fare
- Cabin – Cabin
- Embarked – Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
The information can be examined in detail by (a) examining the csv or (b) importing the files and viewing the summary in the ‘Environment’ box on the top right and examine it further with the syntax ‘str’. Example of the latter below:
We are predicting that variables such as gender, age and class will impact survival, based on knowledge that women, children and wealth were prioritised once the ship hit the iceberg. We will draw on these insights to focus on ‘class’.
Looking at the data structure we can identify that there are a number of missing variables. The data needs to be cleansed to remove the unknown or to apply an average value to prevent distortions. For example, there are 177 NAs for Age and Cabin is missing 1014 cells between both datasets.
- Logistic Regression
- Decision Tree (Classification Model)
- Multiple Linear Regression
- QQ to normalise data and view for any outlyer
- Logistic Regression
We began with the assumption that everyone died for our predictive modelling tool (0 integer) and from our data we viewed 342 of 891 survive our training dataset. Then we mined our data for the Gender/Sex variable. We discovered that the majority of the dataset is male (577m:314f), however our proportion table illustrates that 74% of 314 females survive in comparison to only 18% of 577 males. This shows a factor in favour of survival is based on being the female gender.
From there we viewed the Age data to understand the proportion of children that survived (children i.e. under 18) i.e. 38f:23m. We further analysed it based on the proportion to discover 69% of females and 39% of males survived, indicating again that gender influences survival chances.
Note: We have assumed that 177 NAs are the mean age and therefore do not apply, so they have been assigned a 0 integer.
From the analysis below it is evident that gender and class impacts survival rates. The highest survival rate is females paying 20-30 or 30+ in first class or second class cabins. A female in first class paying 30+ has 97% survival, whereas, a female paying 30 in a third class has 12% chance. Males in first class have the highest chance of survival.
2. Decision Tree: To graphically display our data we used a Classification Decision Tree (this required installing new packages and that is the reason for the red font in the console script below)
Looking at the Decision Tree in more detail below – diagram 5- the root node illustrates our prediction that ‘everyone will die’ i.e. 0. This then shows that it is true for 62% and false for 38%, indicating overall 38% survival rate.
Further down the branch, the next node it selects is gender as the purest variable. In the left node (male) 81% die, 19% survive and the right node (female) 26% die, 74% survive. This illustrates a large variance in survival based on Sex.
Looking at the terminal node for males, age has a big part to play. 83% die over the age of 6.5, whereas the number declines to 33% under 6.5 years with 67% survival rate (3% of overall dataset).
For females, the branches delve deeper into passenger class, fare and port embarked on. For those in a passenger class greater than 2.5 and paying more than £23 they have a 59% survival rate.
This decision tree shows that we reject the H0 that everyone dies and illustrates that sex, age and class each have a factor to play in the survival rate.
3. Multiple Linear Regression
In order to test further which variables are significant on the hypotheses that nobody survives, I used a Multiple Linear Regression model.
- Analysis shows that there are 3x*** for the variables pclass, sexmale and age and a significance of 2* for sibling or spouse to impact on the survival rate.
- Adjusted RSquare is below 0.7 so we reject the H0 that it is by fluke. This shows the goodness of fit.
- The PValue is: < 2.2e-16, therefore we reject the null hypothesis because p < 0.05.
4. Scatterplot Matrix
The diagram below highlights the scatterplot matrix of the variables I want to focus on: P.Class, Sex and Age…
The training set has been validated against the test set:
<span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#PREDICTION FROM DECISION TREE CLASSIFICATION
</span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">Prediction <- predict(fit, titanic_test.csv, type = "class")
</span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">submit <- data.frame(PassengerId = titanic_test.csv$PassengerId, Survived = Prediction)
</span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">write.csv(submit, file = "myfirstdtree.csv", row.names = FALSE)</span>
Data Reporting for Business Purposes:
AOB Analytics has generated numerous aggregate charts, structured tables, proportion tables, decision tree classification and scatter plot charts, which can be used for reporting purposes. This data can be exported to various file formats from RStudio for use by the business.
Further research conducted by History.com and Column Five illustrate that class impacted the survival rate: 63% of first class survive in comparison to 25% in third class.
A fourth variable which was not considered is the crew, however this was not included in the dataset we used, therefore it is not a constraint and has not impacted the results.
Achieving Business Objective
Now we’re back to our business problem.
- Grab attention?
- Highlight business service?
- Generate word of mouth (controversial)?
Do you think you know your business?
What if you’re just looking at the tip of the iceberg?
Don’t let your sturdy ship sink…
Use AOB Analytics to delve deeper into your business.
Your future is in your hands.
View our analysis of the Titanic dataset using predictive analysis machine learning to understand why ‘class’ ‘paid’ off for Titanic survivors.
Predictive analytics can be used for understanding your business, create better strategic decisions, generating revenue and increasing profit.