Business Analytics: Using the CRSP DM Methodology

Business Understanding

Business Problem

AOB Analytics is a successful Irish analytics company with plans to expand internationally into the American market. A limited budget challenges the company to look at creative ways to target this overseas audience with its business analytics service offering.

The Marketing Team has opted to use the digital arena to create awareness online using Guerilla Marketing (Shock Advertising). They need to find a topic of interest which would (a) grasp attention of the American audience, (b) display business analysis capabilities of a sample dataset to highlight the service on offer and (c) promote it in such a way that it is controversial to generate word of mouth.

Choosing a Topic of Interest- The Dataset:

What evokes empathy, emotion and passion more than a true story? What true story do we know of that sparks interest with both the Irish and American audience alike? Whether it’s through family tree lines or James Camerons production and direction of the 1993 Hollywood adaption, everyone knows the story of ‘The Titanic’. Pulling on Irish and American heart strings using a true story like the tragic R.M.S Titanic maiden voyage to launch an overseas Irish business is sure to cause word of mouth online.

data analysis titanic
Diagram 1: Titanic route between Ireland and the US

About our Dataset 

The R.M.S Titanic dataset is the perfect dataset source, as we have the ability to illustrate our predictive analytics capabilities showing which variables are associated with ‘survival’ using machine learning. We want to use our analysis (using Studio R) to highlight that being a survivor is associated with the persons ‘class’. In order to do this, we will be using the Regression Decision Model to display multiple variables impact on the survival rate and then display it graphically.


Data Understanding

We have access to the Titanic dataset through a Data Science site called Kaggle. The data has been split into two datasets to analyse which group are more likely to survive; a training and a test set. The training set includes the outcome for each passenger and is used to build the data mining model. The test set is used to validate the model.

  • The training set has 891 observations (rows) and 12 variables (columns).
  • The test set has 481 observations and 11 variables (survival variable removed so we can predict it)
  • Variables include:
    1. P_ID- Auto increments the integer
    2. Survival – Survival (0 = No; 1 = Yes). Not included in test.csv file.
    3. Pclass – Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
    4. Name – Name
    5. Sex – Sex
    6. Age – Age
    7. Sibsp – Number of Siblings/Spouses Aboard
    8. Parch – Number of Parents/Children Aboard
    9. Ticket – Ticket Number
    10. Fare – Passenger Fare
    11. Cabin – Cabin
    12. Embarked – Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

The information can be examined in detail by (a) examining the csv or (b) importing the files and viewing the summary in the ‘Environment’ box on the top right and examine it further with the syntax ‘str’. Example of the latter below:

Data Structure

We are predicting that variables such as gender, age and class will impact survival, based on knowledge that women, children and wealth were prioritised once the ship hit the iceberg. We will draw on these insights to focus on ‘class’.


Data Preparation

Looking at the data structure we can identify that there are a number of missing variables. The data needs to be cleansed to remove the unknown or to apply an average value to prevent distortions. For example, there are 177 NAs for Age and Cabin is missing 1014 cells between both datasets.


Data Modelling

  1. Logistic Regression
    • Gender
    • Age
    • Class
  2. Decision Tree (Classification Model)
  3. Multiple Linear Regression
  4. QQ to normalise data and view for any outlyer

Data Evaluation

  1. Logistic Regression

(a) Gender:

We began with the assumption that everyone died for our predictive modelling tool (0 integer) and from our data we viewed 342 of 891 survive our training dataset. Then we mined our data for the Gender/Sex variable. We discovered that the majority of the dataset is male (577m:314f), however our proportion table illustrates that 74% of 314 females survive in comparison to only 18% of 577 males. This shows a factor in favour of survival is based on being the female gender.

gender data analysis titanic
Diagram 2: Analysis of the ratio of male to female survival after the sinking of the titanic

(b) Age:

From there we viewed the Age data to understand the proportion of children that survived (children i.e. under 18) i.e. 38f:23m. We further analysed it based on the proportion to discover 69% of females and 39% of males survived, indicating again that gender influences survival chances.

Note: We have assumed that 177 NAs are the mean age and therefore do not apply, so they have been assigned a 0 integer.

survival data analysis age
Diagram 3: Impact of the variable Age on survival of the Titanic

(c) Fare:

From the analysis below it is evident that gender and class impacts survival rates. The highest survival rate is females paying 20-30 or 30+ in first class or second class cabins. A female in first class paying 30+ has 97% survival, whereas, a female paying 30 in a third class has 12% chance. Males in first class have the highest chance of survival.

data analysis titanic survival class
Diagram 4: Survival rate from Titanic based on class

2. Decision Tree: To graphically display our data we used a Classification Decision Tree (this required installing new packages and that is the reason for the red font in the console script below)

decision tree classification model
Diagram 5: How to generate a Decision Tree Classification Model

Looking at the Decision Tree in more detail below – diagram 5- the root node illustrates our prediction that ‘everyone will die’ i.e. 0. This then shows that it is true for 62% and false for 38%, indicating overall 38% survival rate.

Further down the branch, the next node it selects is gender as the purest variable. In the left node (male) 81% die, 19% survive and the right node (female) 26% die, 74% survive. This illustrates a large variance in survival based on Sex.

Looking at the terminal node for males, age has a big part to play. 83% die over the age of 6.5, whereas the number declines to 33% under 6.5 years with 67% survival rate (3% of overall dataset).

For females, the branches delve deeper into passenger class, fare and port embarked on. For those in a passenger class greater than 2.5 and paying more than £23 they have a 59% survival rate.

This decision tree shows that we reject the H0 that everyone dies and illustrates that sex, age and class each have a factor to play in the survival rate.

decision tree classification
Diagram 6: Understanding the Titanic Dataset using a Decision Tree

 

3. Multiple Linear Regression

In order to test further which variables are significant on the hypotheses that nobody survives, I used a Multiple Linear Regression model.

  • Analysis shows that there are 3x*** for the variables pclass, sexmale and age and a significance of 2* for sibling or spouse to impact on the survival rate.
  • Adjusted RSquare is below 0.7 so we reject the H0 that it is by fluke. This shows the goodness of fit.
  • The PValue is: < 2.2e-16, therefore we reject the null hypothesis because p < 0.05.

 

multiple linear regression
Using Multiple Linear Regression to see most significant variables

 

4. Scatterplot Matrix

The diagram below highlights the scatterplot matrix of the variables I want to focus on: P.Class, Sex and Age…

scatterplot titanic analysis
Scatterplot displaying the variables to PClass, Sex and Age

Data Deployment

Data Validation:

The training set has been validated against the test set:

Data Reporting for Business Purposes:

AOB Analytics has generated numerous aggregate charts, structured tables, proportion tables,  decision tree classification and scatter plot charts, which can be used for reporting purposes. This data can be exported to various file formats from RStudio for use by the business.

Secondary Sources:

Further research conducted by History.com and Column Five illustrate that class impacted the survival rate: 63% of first class survive in comparison to 25% in third class.

A fourth variable which was not considered is the crew, however this was not included in the dataset we used, therefore it is not a constraint and has not impacted the results.

Further research
Further research

Trevor Stephens Tutorial.

 


Achieving Business Objective

Now we’re back to our business problem.

  • Grab attention?
  • Highlight business service?
  • Generate word of mouth (controversial)?

Do you think you know your business?

What if you’re just looking at the tip of the iceberg?

Don’t let your sturdy ship sink…

Use AOB Analytics to delve deeper into your business.

Your future is in your hands.

View our analysis of the Titanic dataset using predictive analysis machine learning to understand why ‘class’ ‘paid’ off for Titanic survivors.

Predictive analytics can be used for understanding your business, create better strategic decisions, generating revenue and increasing profit.

Continue reading Business Analytics: Using the CRSP DM Methodology

Big Data equals Big Decisions

Big Data is a buzzword in every innovative business right now! Why? Companies are vacuuming up data from an expansive network of online and offline sources (from phones, credit cards to the infrastructure of cities) – but it is not the quantity that is remarkable, its what we can do with it. It is now possible to process this data cost effectively and timely to provide more valuable insights and become more empowered through business intelligence to make more informed business decisions.

Big Data
Empowering business decision making through Big Data

Ultimately, Big Data provides companies with a competitive advantage through better understanding of the 5c’s (consumer, competitor, company, competitor, climate) and foresight about what actions can help drive the business into the future. For example, insights can help create cost efficiency, improve processes to reduce time and generate economies of scale, understand customer needs better to assist with new product development and; improve customer experience or to detect risks/fraud. This is just the tip of the iceberg.. I think now you can see why its such a hot topic! Show me the MONEY!

Big data, helping business decisions
How Big Data can improve business decisions and provide competitive advantage

So…What is Big Data?

According to Chen et Al , Big Data and Big Data analytics can be defined as:

“the data sets and analytical techniques in applications that are so large (from terabytes to exabytes) and complex (from sensor to social media data) that they require advanced and unique data storage, management, analysis, and visualization technologies.”

Simply put, Big Data is a term used to describe extremely large volumes of data that can be analysed via computer programming to see patterns, trends and associations. It can be of both a quantitative and qualitative nature, stemming from a multitude of sources – such as streaming data (a web of connected devices- IOT), social data and publicly available data sourced through secondary research.

2.5 Quintillion bytes of data is created every day and by 2020 it will equal 40 Zettabytes. While 20% of data is structured (governed by a relational table), a staggering 80% is either semi-structured or unstructured data stemming from videos, images, social networks etc, so it is essential that data storage, data analysis, data visualisation and data reporting is at the core of how businesses operate.

big data insights, big data source
Where we get our Big Data insights from..

What  is important is not the volumes of raw data, but what companies do with the data to analyze and generate insights for better decision making and enhanced strategic planning through Business Intelligence.

This is reinforced through Gartners definition of Big Data:

“Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.


Data Evolution

As above, Gartner Analyst, Doug Laney, defined Big Data as 3 Vs (volume, velocity and variety) in 2001 and since then it can be simplified with the following formula:

Big Data = Transactions + Interactions + Observations

Big Data
Big Data = Transactions + Interactions + Observations

Data is being generated by people, machines, applications and a combination of all of these. Classical data is Transactional data, which is highly structured stemming from an event, with a time, numerical value and objective- e.g. purchase, payments, inventory shipping-  and is usually accessed through SQL (Structured Query Language). Whereas, Interactional data, which stems from relationships and interaction – e.g. web logs, social interactions and UGC-and observational data which stems from Internet of Things – e.g. RFID, NFC, sensors for lights/pressure/alarms- are both from multi-structured sources and require NOSQL.

From a marketing perspective, data has evolved over time from being one of transactional data (where the customer was unknown), to demographic data (what the customer looks like) to psychographic data (defining people by their interests) to an age now where we can evaluate attitudinal data (understanding sociographics i.e. how people think or feel). Big Data tells us what has happened and Business Intelligence helps us to understand behaviours and what can influence them. These steps sum up the use of Big Data:


Aggregate   >        Analyze          >         Articulate          >               Act


 

Big Data Challenges

Big data challenge, volume, velocity, variety, validity
The challenges analysts are faced with in order to understand Big Data

Due to the large volume, velocity, variety and requirement for validity of Big Data, there are huge challenges for IT, from use of infrastructure (data capacity, data speed), to platforms (end-to-end, easy to use and fully integrated platforms) and databases (scalability and ability to manage semi/non-structured data). This is being addressed by HPC (High Performance Computing) through Parallelism, Clusters and cloud computing.

 

Source

  • Business Intelligence and Analytics: from Big Data to Big Impact, Chen et al (2012).
  • http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/
  • http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-big-data-definition-consists-of-three-parts-not-to-be-confused-with-three-vs/#31dd37d23bf6
  • http://www.sas.com/en_us/insights/big-data/what-is-big-data.html

Revolutionary R – Making Sense of Data

What the fuRk is R?

R is a language. It is a (R)evolutionary statistical programming language, where you write functions and scripts as commands to analyse your data instead of traditional methods of ‘point and click’. It is also a software tool and environment used for data analysis and visualisation by Data Analysts worldwide.

R originated in New Zealand, when two teachers decided to provide a statistical computing platform for their students based on S. Over time it has evolved into a world-wide used tool for cutting edge statistical models and predictive modelling. Go Team Kiwis! Now, at the heart of R is ‘R Core’, a group of 20 developers who guide R’s evolution.

R is “free, open source, powerful and highly extensible”.


Data: Extraction  >  Exploration    >  Visualisation   >    Sharing


How to get Started

To learn how to use R I visited ‘Try R-Code School‘ and was guided through an 8 step tutorial. The site breaks down how to use the programming language in 8 simple chapters with a navigation menu to show how far you have progressed:

  1. Using R Basics
  2. Vectors
  3. Matrices
  4. Statistics
  5. Factors
  6. Data Frames
  7. Real World Data
  8. Completion of course

At the end of each chapter it rewards you with badges and the final chapter displays the TRY R badge.

R code
Learning how to use R statistical programming language

Using R Studio

After spending several years overseas I have returned to Ireland and the main topic of conversation seems to loom around commitment. The questions on everyone’s lips are ‘do you think its time to settle down now?’ and ‘do you think it might be time to put your foot on the property ladder?’. So, for that reason, I’m going to use R to analyse the trends in the property market over time to see if its a good time to purchase and if I should give in to peer pressure or continue to be a commitment phobe.

  1. To begin, I downloaded R Studio open source software for Windows from the website.
  2. In the background I sourced my property price csv data from the CSO website so I have a data-set I can work with. (Note: The time data was in the row and the type of property was in the column; I needed this vice versa. In order to do this I copied the data and transposed it in the excel sheet). Now I’m set to go..
  3. I imported the csv into R Studio (identifying the headers ).
  4. I viewed the property file i.e. > view(Propertyprice)
  5. To call the function date in a format that R can understand I update R: > rdate <-as.Date(Propertyprice$Date,”%m/%y)
  6. Following from this I plot the graph. >plot(Propertyprice$Dublinhouses~rdate, type=”l”, col=”red”)
  7. I insert a box around it by calling box()
  8. To provide an axis X I submit axis(rdate,%m-%y)
  9. My map displays the date data highlighting property prices in Dublin specifically for houses based on an increase/decrease from Jan05-February2016.

 

 

Google Fusion- Making Big Data Easy, Shareable and Fun

Let’s begin …

What are Trying to Achieve?

Looking at reams of data on a csv file can become quite laborious and then add multiple files to the mix and we’re facing a huge headache, so wouldn’t it be nice if we could merge all of our data and display it in an aesthetically appealing, easy to digest format for analysis and sharing our findings with others? The great news is we can! ..


How do we Make our Data Appealing for Everyone?

Google Fusion, an experimental data visualization web application, enables you to remove data from its silo and combine it with other public data on the web so you can put Big Data into context and present it in an aesthetically pleasing format. The output being that the merged data can be displayed in real-time visually as Heat Maps and Intensity Maps, as Charts (Bar, Line..) and in the form of Motion and Timelines. (If you want to drill down your data, it is possible to filter your data based on SQL like queries for more selective analysis.) The added bonus is that you can make your data public so you can share it with the world, or limit it to a few selected people or keep it entirely private- you have control over your data sets and its distribution.       


Data:                     Collaborate          >        Visualise            >          Share


 A Step by Step Guide:

  1. To begin, I sourced population data from the 2011 Census on the cso website and I cleansed the csv file to present population by county, removing any spelling errors or duplications.
  1. Following from this, I sourced a KML file from the Independent source with the county boundaries as my second data set.
  1. I visited the Google Chrome webstore and downloaded the Fusion Map app (version 0.2)
  1. Using the Table Upload function, I uploaded both files to my Google Drive (File > New Table), referencing the source and providing a description to provide credit to the origin of the source, for SEO and to assist other users to find the file.
data visualisation, fusion map, upload files fusion
Uploading csv and kml files to create a Fusion Map for data visualisation
  1. Next I merged both tables together in order to collate the data. I made sure that the source for each data sets matched e.g. county clare in the csv with county clare in the kml file
  1. Now I get to see the magic happen. Fusion map has auto detected location information through its Geocode service, so I click on Map of Geometry and view the Feature Map and by clicking on each county I can see the population figures appear.
fusion map of geometry, link for geometry
Fusion map auto detects location information and generates a Map of Geometry
  1. I now want to format my map so the variance in population is easily identified. Clicking on the button ‘change feature styles’, opens a tool box for editing.
buckets
Editing the Fusion Map to provide buckets for easily identifiable data by colour
  1. I change items such as ‘Points’ marker icons and assign values to the population subsets to group data together in ‘buckets’ in increments from 50,000 to 1,000,000. I also assign small dots so the lowest population regions can be distinguished from the higher population regions with the large dots- this information displays on the ‘county map’.
  1. I also change the formatting in the ‘Polygon’ section to replicate the population increments in buckets and assign a gradient of blues with the colour getting darker as the area becomes more densely populated.
  1. I provide a Legend for the ‘Geometry Map’ so the user can easily correlate the colour with the figures. First I update the Title to ‘Population Distribution in Ireland’, then I choose the location (in this case the left side) and next I provide a link reference.

Making my Fusion Map Public

In order to create a Fusion Map that can be viewed by the public I have amended the share feature setting from ‘Private’ to ‘Public’ in the top right of the page. Also, in the toolbar, I have selected tools and publish to obtain the iFrame code for insertion in my Blog.


Fusion Map Live

Now my map is ready for viewing…

Fusion Map

Analysis of my Population Fusion Map:

From this Fusion Map it is evident that the most highly densely populated areas in Ireland are the counties with cities i.e. Dublin region in poll position (1,273,069), followed by Cork (519,032) and thirdly by Galway (250,653).

The least densely populated areas are highlighted in light blue and provide a corridor from Sligo in the NW (65,393) all the way through the Midlands to Kilkenny in the SE of Ireland (95,419).


Supplementary Research:

By conducting further research and merging additional data sets with the Fusion Map it would provide insight into the distribution of the population and the tendency for population density to increase in counties with cities. Data that could be used to verify this include the road network, the availability of work, education facilities. technological resources etc.


Getting Started

Fusion Maps turns a matrix of numbers into something visual for even the non-programmer minds. To get started visit the Google Chrome Store .