Business Analytics: Using the CRSP DM Methodology

Business Understanding

Business Problem

AOB Analytics is a successful Irish analytics company with plans to expand internationally into the American market. A limited budget challenges the company to look at creative ways to target this overseas audience with its business analytics service offering.

The Marketing Team has opted to use the digital arena to create awareness online using Guerilla Marketing (Shock Advertising). They need to find a topic of interest which would (a) grasp attention of the American audience, (b) display business analysis capabilities of a sample dataset to highlight the service on offer and (c) promote it in such a way that it is controversial to generate word of mouth.

Choosing a Topic of Interest- The Dataset:

What evokes empathy, emotion and passion more than a true story? What true story do we know of that sparks interest with both the Irish and American audience alike? Whether it’s through family tree lines or James Camerons production and direction of the 1993 Hollywood adaption, everyone knows the story of ‘The Titanic’. Pulling on Irish and American heart strings using a true story like the tragic R.M.S Titanic maiden voyage to launch an overseas Irish business is sure to cause word of mouth online.

data analysis titanic
Diagram 1: Titanic route between Ireland and the US

About our Dataset 

The R.M.S Titanic dataset is the perfect dataset source, as we have the ability to illustrate our predictive analytics capabilities showing which variables are associated with ‘survival’ using machine learning. We want to use our analysis (using Studio R) to highlight that being a survivor is associated with the persons ‘class’. In order to do this, we will be using the Regression Decision Model to display multiple variables impact on the survival rate and then display it graphically.


Data Understanding

We have access to the Titanic dataset through a Data Science site called Kaggle. The data has been split into two datasets to analyse which group are more likely to survive; a training and a test set. The training set includes the outcome for each passenger and is used to build the data mining model. The test set is used to validate the model.

  • The training set has 891 observations (rows) and 12 variables (columns).
  • The test set has 481 observations and 11 variables (survival variable removed so we can predict it)
  • Variables include:
    1. P_ID- Auto increments the integer
    2. Survival – Survival (0 = No; 1 = Yes). Not included in test.csv file.
    3. Pclass – Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
    4. Name – Name
    5. Sex – Sex
    6. Age – Age
    7. Sibsp – Number of Siblings/Spouses Aboard
    8. Parch – Number of Parents/Children Aboard
    9. Ticket – Ticket Number
    10. Fare – Passenger Fare
    11. Cabin – Cabin
    12. Embarked – Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

The information can be examined in detail by (a) examining the csv or (b) importing the files and viewing the summary in the ‘Environment’ box on the top right and examine it further with the syntax ‘str’. Example of the latter below:

Data Structure

We are predicting that variables such as gender, age and class will impact survival, based on knowledge that women, children and wealth were prioritised once the ship hit the iceberg. We will draw on these insights to focus on ‘class’.


Data Preparation

Looking at the data structure we can identify that there are a number of missing variables. The data needs to be cleansed to remove the unknown or to apply an average value to prevent distortions. For example, there are 177 NAs for Age and Cabin is missing 1014 cells between both datasets.


Data Modelling

  1. Logistic Regression
    • Gender
    • Age
    • Class
  2. Decision Tree (Classification Model)
  3. Multiple Linear Regression
  4. QQ to normalise data and view for any outlyer

Data Evaluation

  1. Logistic Regression

(a) Gender:

We began with the assumption that everyone died for our predictive modelling tool (0 integer) and from our data we viewed 342 of 891 survive our training dataset. Then we mined our data for the Gender/Sex variable. We discovered that the majority of the dataset is male (577m:314f), however our proportion table illustrates that 74% of 314 females survive in comparison to only 18% of 577 males. This shows a factor in favour of survival is based on being the female gender.

gender data analysis titanic
Diagram 2: Analysis of the ratio of male to female survival after the sinking of the titanic

(b) Age:

From there we viewed the Age data to understand the proportion of children that survived (children i.e. under 18) i.e. 38f:23m. We further analysed it based on the proportion to discover 69% of females and 39% of males survived, indicating again that gender influences survival chances.

Note: We have assumed that 177 NAs are the mean age and therefore do not apply, so they have been assigned a 0 integer.

survival data analysis age
Diagram 3: Impact of the variable Age on survival of the Titanic

(c) Fare:

From the analysis below it is evident that gender and class impacts survival rates. The highest survival rate is females paying 20-30 or 30+ in first class or second class cabins. A female in first class paying 30+ has 97% survival, whereas, a female paying 30 in a third class has 12% chance. Males in first class have the highest chance of survival.

data analysis titanic survival class
Diagram 4: Survival rate from Titanic based on class

2. Decision Tree: To graphically display our data we used a Classification Decision Tree (this required installing new packages and that is the reason for the red font in the console script below)

decision tree classification model
Diagram 5: How to generate a Decision Tree Classification Model

Looking at the Decision Tree in more detail below – diagram 5- the root node illustrates our prediction that ‘everyone will die’ i.e. 0. This then shows that it is true for 62% and false for 38%, indicating overall 38% survival rate.

Further down the branch, the next node it selects is gender as the purest variable. In the left node (male) 81% die, 19% survive and the right node (female) 26% die, 74% survive. This illustrates a large variance in survival based on Sex.

Looking at the terminal node for males, age has a big part to play. 83% die over the age of 6.5, whereas the number declines to 33% under 6.5 years with 67% survival rate (3% of overall dataset).

For females, the branches delve deeper into passenger class, fare and port embarked on. For those in a passenger class greater than 2.5 and paying more than £23 they have a 59% survival rate.

This decision tree shows that we reject the H0 that everyone dies and illustrates that sex, age and class each have a factor to play in the survival rate.

decision tree classification
Diagram 6: Understanding the Titanic Dataset using a Decision Tree

 

3. Multiple Linear Regression

In order to test further which variables are significant on the hypotheses that nobody survives, I used a Multiple Linear Regression model.

  • Analysis shows that there are 3x*** for the variables pclass, sexmale and age and a significance of 2* for sibling or spouse to impact on the survival rate.
  • Adjusted RSquare is below 0.7 so we reject the H0 that it is by fluke. This shows the goodness of fit.
  • The PValue is: < 2.2e-16, therefore we reject the null hypothesis because p < 0.05.

 

multiple linear regression
Using Multiple Linear Regression to see most significant variables

 

4. Scatterplot Matrix

The diagram below highlights the scatterplot matrix of the variables I want to focus on: P.Class, Sex and Age…

scatterplot titanic analysis
Scatterplot displaying the variables to PClass, Sex and Age

Data Deployment

Data Validation:

The training set has been validated against the test set:

Data Reporting for Business Purposes:

AOB Analytics has generated numerous aggregate charts, structured tables, proportion tables,  decision tree classification and scatter plot charts, which can be used for reporting purposes. This data can be exported to various file formats from RStudio for use by the business.

Secondary Sources:

Further research conducted by History.com and Column Five illustrate that class impacted the survival rate: 63% of first class survive in comparison to 25% in third class.

A fourth variable which was not considered is the crew, however this was not included in the dataset we used, therefore it is not a constraint and has not impacted the results.

Further research
Further research

Trevor Stephens Tutorial.

 


Achieving Business Objective

Now we’re back to our business problem.

      • Grab attention?
      • Highlight business service?
      • Generate word of mouth (controversial)?

Do you think you know your business?

What if you’re just looking at the tip of the iceberg?

Don’t let your sturdy ship sink…

Use AOB Analytics to delve deeper into your business.

Your future is in your hands.

View our analysis of the Titanic dataset using predictive analysis machine learning to understand why ‘class’ ‘paid’ off for Titanic survivors.

Predictive analytics can be used for understanding your business, create better strategic decisions, generating revenue and increasing profit.


Using R Studio 3.3.1. to Analyse the Titanic Dataset- Console Script

Published by

aoife

A Dublin Business School students Blog for Big Data. This Blog summarises topics covered in the module B6IS106 Information Systems and Databases by Lecturer Darren Redmond, for the calendar year 2015/2016.

Leave a Reply

Your email address will not be published. Required fields are marked *