Business Understanding
Business Problem
AOB Analytics is a successful Irish analytics company with plans to expand internationally into the American market. A limited budget challenges the company to look at creative ways to target this overseas audience with its business analytics service offering.
The Marketing Team has opted to use the digital arena to create awareness online using Guerilla Marketing (Shock Advertising). They need to find a topic of interest which would (a) grasp attention of the American audience, (b) display business analysis capabilities of a sample dataset to highlight the service on offer and (c) promote it in such a way that it is controversial to generate word of mouth.
Choosing a Topic of Interest- The Dataset:
What evokes empathy, emotion and passion more than a true story? What true story do we know of that sparks interest with both the Irish and American audience alike? Whether it’s through family tree lines or James Camerons production and direction of the 1993 Hollywood adaption, everyone knows the story of ‘The Titanic’. Pulling on Irish and American heart strings using a true story like the tragic R.M.S Titanic maiden voyage to launch an overseas Irish business is sure to cause word of mouth online.

About our Dataset
The R.M.S Titanic dataset is the perfect dataset source, as we have the ability to illustrate our predictive analytics capabilities showing which variables are associated with ‘survival’ using machine learning. We want to use our analysis (using Studio R) to highlight that being a survivor is associated with the persons ‘class’. In order to do this, we will be using the Regression Decision Model to display multiple variables impact on the survival rate and then display it graphically.
Data Understanding
We have access to the Titanic dataset through a Data Science site called Kaggle. The data has been split into two datasets to analyse which group are more likely to survive; a training and a test set. The training set includes the outcome for each passenger and is used to build the data mining model. The test set is used to validate the model.
- The training set has 891 observations (rows) and 12 variables (columns).
- The test set has 481 observations and 11 variables (survival variable removed so we can predict it)
- Variables include:
- P_ID- Auto increments the integer
- Survival – Survival (0 = No; 1 = Yes). Not included in test.csv file.
- Pclass – Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- Name – Name
- Sex – Sex
- Age – Age
- Sibsp – Number of Siblings/Spouses Aboard
- Parch – Number of Parents/Children Aboard
- Ticket – Ticket Number
- Fare – Passenger Fare
- Cabin – Cabin
- Embarked – Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
The information can be examined in detail by (a) examining the csv or (b) importing the files and viewing the summary in the ‘Environment’ box on the top right and examine it further with the syntax ‘str’. Example of the latter below:
We are predicting that variables such as gender, age and class will impact survival, based on knowledge that women, children and wealth were prioritised once the ship hit the iceberg. We will draw on these insights to focus on ‘class’.
Data Preparation
Looking at the data structure we can identify that there are a number of missing variables. The data needs to be cleansed to remove the unknown or to apply an average value to prevent distortions. For example, there are 177 NAs for Age and Cabin is missing 1014 cells between both datasets.
Data Modelling
- Logistic Regression
- Gender
- Age
- Class
- Decision Tree (Classification Model)
- Multiple Linear Regression
- QQ to normalise data and view for any outlyer
Data Evaluation
- Logistic Regression
(a) Gender:
We began with the assumption that everyone died for our predictive modelling tool (0 integer) and from our data we viewed 342 of 891 survive our training dataset. Then we mined our data for the Gender/Sex variable. We discovered that the majority of the dataset is male (577m:314f), however our proportion table illustrates that 74% of 314 females survive in comparison to only 18% of 577 males. This shows a factor in favour of survival is based on being the female gender.

(b) Age:
From there we viewed the Age data to understand the proportion of children that survived (children i.e. under 18) i.e. 38f:23m. We further analysed it based on the proportion to discover 69% of females and 39% of males survived, indicating again that gender influences survival chances.
Note: We have assumed that 177 NAs are the mean age and therefore do not apply, so they have been assigned a 0 integer.

(c) Fare:
From the analysis below it is evident that gender and class impacts survival rates. The highest survival rate is females paying 20-30 or 30+ in first class or second class cabins. A female in first class paying 30+ has 97% survival, whereas, a female paying 30 in a third class has 12% chance. Males in first class have the highest chance of survival.

2. Decision Tree: To graphically display our data we used a Classification Decision Tree (this required installing new packages and that is the reason for the red font in the console script below)

Looking at the Decision Tree in more detail below – diagram 5- the root node illustrates our prediction that ‘everyone will die’ i.e. 0. This then shows that it is true for 62% and false for 38%, indicating overall 38% survival rate.
Further down the branch, the next node it selects is gender as the purest variable. In the left node (male) 81% die, 19% survive and the right node (female) 26% die, 74% survive. This illustrates a large variance in survival based on Sex.
Looking at the terminal node for males, age has a big part to play. 83% die over the age of 6.5, whereas the number declines to 33% under 6.5 years with 67% survival rate (3% of overall dataset).
For females, the branches delve deeper into passenger class, fare and port embarked on. For those in a passenger class greater than 2.5 and paying more than £23 they have a 59% survival rate.
This decision tree shows that we reject the H0 that everyone dies and illustrates that sex, age and class each have a factor to play in the survival rate.

3. Multiple Linear Regression
In order to test further which variables are significant on the hypotheses that nobody survives, I used a Multiple Linear Regression model.
- Analysis shows that there are 3x*** for the variables pclass, sexmale and age and a significance of 2* for sibling or spouse to impact on the survival rate.
- Adjusted RSquare is below 0.7 so we reject the H0 that it is by fluke. This shows the goodness of fit.
- The PValue is: < 2.2e-16, therefore we reject the null hypothesis because p < 0.05.

4. Scatterplot Matrix
The diagram below highlights the scatterplot matrix of the variables I want to focus on: P.Class, Sex and Age…

Data Deployment
Data Validation:
The training set has been validated against the test set:
1 2 3 4 |
<span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#PREDICTION FROM DECISION TREE CLASSIFICATION </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">Prediction <- predict(fit, titanic_test.csv, type = "class") </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">submit <- data.frame(PassengerId = titanic_test.csv$PassengerId, Survived = Prediction) </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">write.csv(submit, file = "myfirstdtree.csv", row.names = FALSE)</span> |
Data Reporting for Business Purposes:
AOB Analytics has generated numerous aggregate charts, structured tables, proportion tables, decision tree classification and scatter plot charts, which can be used for reporting purposes. This data can be exported to various file formats from RStudio for use by the business.
Secondary Sources:
Further research conducted by History.com and Column Five illustrate that class impacted the survival rate: 63% of first class survive in comparison to 25% in third class.
A fourth variable which was not considered is the crew, however this was not included in the dataset we used, therefore it is not a constraint and has not impacted the results.

Achieving Business Objective
Now we’re back to our business problem.
- Grab attention?
- Highlight business service?
- Generate word of mouth (controversial)?
Do you think you know your business?
What if you’re just looking at the tip of the iceberg?
Don’t let your sturdy ship sink…
Use AOB Analytics to delve deeper into your business.
Your future is in your hands.
View our analysis of the Titanic dataset using predictive analysis machine learning to understand why ‘class’ ‘paid’ off for Titanic survivors.
Predictive analytics can be used for understanding your business, create better strategic decisions, generating revenue and increasing profit.
Using R Studio 3.3.1. to Analyse the Titanic Dataset- Console Script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 |
<span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#AoifeO'Brien </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Student ID 10331098 </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Titanic dataset </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Data mining </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Set working directory and import data files </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">setwd("~/Titanic_dataset") </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">titanic_training.csv <- read.csv("~/Titanic_dataset/titanic_training.csv.csv") </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">View(titanic_training.csv) </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">titanic_test.csv <- read.csv("~/Titanic_dataset/titanic_test.csv.csv") </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">View(titanic_test.csv) </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#View structure of dataset </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">str(titanic_training.csv) </span>'data.frame': 891 obs. of 12 variables: $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ... $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ... $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ... $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ... <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#ATTRIBUTE ONE WORKING WITH GENDER </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Count variables for the survival column with 1 meaning survival </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">table(titanic_training.csv$Survived) </span> 0 1 549 342 <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Proportion of survivors </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">prop.table(table(titanic_training.csv$Survived)) </span> 0 1 0.6161616 0.3838384 <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Command to apply the assumption that everyone has died in the test set to create a new column </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">titanic_test.csv$Survived <- rep(0, 418) </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Summary to view gender </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">summary(titanic_training.csv$Sex) </span>female male 314 577 <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Proportion table to see gender split of male and female with survivors and tragedies based on the full dataset </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">prop.table(table(titanic_training.csv$Sex, titanic_training.csv$Survived)) </span> 0 1 female 0.09090909 0.26150393 male 0.52525253 0.12233446 <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Proportion table for survivors based on separate genders on a one dimensional row </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">prop.table(table(titanic_training.csv$Sex, titanic_training.csv$Survived),1) </span> 0 1 female 0.2579618 0.7420382 male 0.8110919 0.1889081 <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Noone survives unless you're a female </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">titanic_test.csv$Survived <- 0 </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">titanic_test.csv$Survived[titanic_test.csv$Sex == 'female'] <- 1 </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#ATTRIBUTE TWO WORKING WITH AGE </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Summary of age variable </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">summary(titanic_training.csv$Age) </span> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.42 20.12 28.00 29.70 38.00 80.00 177 <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Create a new column that noone is a child then alter to account for under 18s </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">titanic_training.csv$Child <- 0 </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">titanic_training.csv$Child[titanic_training.csv$Age < 18] <- 1 </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Create a table with target Survived for both gender and age to see the survival numbers </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">aggregate(Survived ~ Child + Sex, data=titanic_training.csv, FUN=sum) </span> Child Sex Survived 1 0 female 195 2 1 female 38 3 0 male 86 4 1 male 23 <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Create a table to see the proportion survival of children based on gender </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">aggregate(Survived ~ Child + Sex, data=titanic_training.csv, FUN=function(x) {sum(x)/length(x)}) </span> Child Sex Survived 1 0 female 0.7528958 2 1 female 0.6909091 3 0 male 0.1657033 4 1 male 0.3965517 <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#ATTRIBUTE THREE WORKING WITH CLASS OR FARE </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Bucket fare values to three groupe i.e. 30+ 20-30 and <10 </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">titanic_training.csv$Fare2 <- '30+' </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">titanic_training.csv$Fare2[titanic_training.csv$Fare < 30 & titanic_training.csv$Fare >= 20] <- '20-30' </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">titanic_training.csv$Fare2[titanic_training.csv$Fare < 20 & titanic_training.csv$Fare >= 10] <- '10-20' </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">titanic_training.csv$Fare2[titanic_training.csv$Fare < 10] <- '<10' </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Aggregate table to explore the gender subset of data for survival rates based on fare prices </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">aggregate(Survived ~ Fare2 + Pclass + Sex, data=titanic_training.csv, FUN=function(x) {sum(x)/length(x)}) </span> Fare2 Pclass Sex Survived 1 20-30 1 female 0.8333333 2 30+ 1 female 0.9772727 3 10-20 2 female 0.9142857 4 20-30 2 female 0.9000000 5 30+ 2 female 1.0000000 6 <10 3 female 0.5937500 7 10-20 3 female 0.5813953 8 20-30 3 female 0.3333333 9 30+ 3 female 0.1250000 10 <10 1 male 0.0000000 11 20-30 1 male 0.4000000 12 30+ 1 male 0.3837209 13 <10 2 male 0.0000000 14 10-20 2 male 0.1587302 15 20-30 2 male 0.1600000 16 30+ 2 male 0.2142857 17 <10 3 male 0.1115385 18 10-20 3 male 0.2368421 19 20-30 3 male 0.1250000 20 30+ 3 male 0.2400000 <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Analysis shows that females paying under 20 are also less likely to survive </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#DECISION TREES </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">library("rpart") </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Formula for d.tree to show target survival using the dataframe (pclass, sex, age, sibsp, parch, fare, embarked) using the dataset (training) and using the decision tree (classification) </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=titanic_training.csv, method="class") </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Plotting the decision tree in text format </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">plot(fit) </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">text(fit) </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Downloading packages to view decision tree in graphical format </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">install.packages("rattle") </span><span class="GEM3DMTCPFB ace_constant">also installing the dependencies ‘RGtk2’, ‘magrittr’, ‘stringi’ </span><span class="GEM3DMTCPFB ace_constant">trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.3/RGtk2_2.20.31.zip' </span><span class="GEM3DMTCPFB ace_constant">Content type 'application/zip'</span><span class="GEM3DMTCPFB ace_constant"> length 13600625 bytes (13.0 MB) </span><span class="GEM3DMTCPFB ace_constant">downloaded 13.0 MB </span><span class="GEM3DMTCPFB ace_constant">trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.3/magrittr_1.5.zip' </span><span class="GEM3DMTCPFB ace_constant">Content type 'application/zip'</span><span class="GEM3DMTCPFB ace_constant"> length 149581 bytes (146 KB) </span><span class="GEM3DMTCPFB ace_constant">downloaded 146 KB </span><span class="GEM3DMTCPFB ace_constant">trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.3/stringi_1.1.1.zip' </span><span class="GEM3DMTCPFB ace_constant">Content type 'application/zip'</span><span class="GEM3DMTCPFB ace_constant"> length 14258493 bytes (13.6 MB) </span><span class="GEM3DMTCPFB ace_constant">downloaded 13.6 MB </span><span class="GEM3DMTCPFB ace_constant">trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.3/rattle_4.1.0.zip' </span><span class="GEM3DMTCPFB ace_constant">Content type 'application/zip'</span><span class="GEM3DMTCPFB ace_constant"> length 3853612 bytes (3.7 MB) </span><span class="GEM3DMTCPFB ace_constant">downloaded 3.7 MB </span>package ‘RGtk2’ successfully unpacked and MD5 sums checked package ‘magrittr’ successfully unpacked and MD5 sums checked package ‘stringi’ successfully unpacked and MD5 sums checked package ‘rattle’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\10331098\AppData\Local\Temp\Rtmpyw9BYv\downloaded_packages <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">install.packages("rpart.plot") </span><span class="GEM3DMTCPFB ace_constant">trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.3/rpart.plot_2.0.1.zip' </span><span class="GEM3DMTCPFB ace_constant">Content type 'application/zip'</span><span class="GEM3DMTCPFB ace_constant"> length 704700 bytes (688 KB) </span><span class="GEM3DMTCPFB ace_constant">downloaded 688 KB </span>package ‘rpart.plot’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\10331098\AppData\Local\Temp\Rtmpyw9BYv\downloaded_packages <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">install.packages("RColorBrewer") </span><span class="GEM3DMTCPFB ace_constant">trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.3/RColorBrewer_1.1-2.zip' </span><span class="GEM3DMTCPFB ace_constant">Content type 'application/zip'</span><span class="GEM3DMTCPFB ace_constant"> length 26676 bytes (26 KB) </span><span class="GEM3DMTCPFB ace_constant">downloaded 26 KB </span>package ‘RColorBrewer’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\10331098\AppData\Local\Temp\Rtmpyw9BYv\downloaded_packages <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">library("rattle") </span><span class="GEM3DMTCPFB ace_constant">Error in inDL(x, as.logical(local), as.logical(now), ...) : unable to load shared object 'C:/Users/10331098/Documents/R/R-3.3.1/library/RGtk2/libs/x64/RGtk2.dll': LoadLibrary failure: The specified module could not be found. </span><span class="GEM3DMTCPFB ace_constant">trying URL 'http://ftp.gnome.org/pub/gnome/binaries/win64/gtk+/2.22/gtk+-bundle_2.22.1-20101229_win64.zip' </span><span class="GEM3DMTCPFB ace_constant">Content type 'application/zip'</span><span class="GEM3DMTCPFB ace_constant"> length 25830230 bytes (24.6 MB) </span><span class="GEM3DMTCPFB ace_constant">downloaded 24.6 MB </span><span class="GEM3DMTCPFB ace_constant">Learn more about GTK+ at http://www.gtk.org </span><span class="GEM3DMTCPFB ace_constant">If the package still does not load, please ensure that GTK+ is installed and that it is on your PATH environment variable </span><span class="GEM3DMTCPFB ace_constant">IN ANY CASE, RESTART R BEFORE TRYING TO LOAD THE PACKAGE AGAIN </span><span class="GEM3DMTCPFB ace_constant">Rattle: A free graphical interface for data mining with R. Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd. Type 'rattle()' to shake, rattle, and roll your data. </span><span class="GEM3DMTCPFB ace_constant">Warning messages: </span><span class="GEM3DMTCPFB ace_constant">1: Failed to load RGtk2 dynamic library, attempting to install it. </span><span class="GEM3DMTCPFB ace_constant">2: </span><span class="GEM3DMTCPFB ace_constant">In dir.create(config_path, recursive = TRUE) :</span> <span class="GEM3DMTCPFB ace_constant"> 'C:\Users\10331098\Documents\R\R-3.3.1\library\RGtk2\gtk\x64\etc\gtk-2.0' already exists </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">library("rpart.plot") </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">library("RColorBrewer") </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Rendering imagery to see the decision tree in a better format </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">fancyRpartPlot(fit) </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#PREDICTION FROM DECISION TREE CLASSIFICATION </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">Prediction <- predict(fit, titanic_test.csv, type = "class") </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">submit <- data.frame(PassengerId = titanic_test.csv$PassengerId, Survived = Prediction) </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">write.csv(submit, file = "myfirstdtree.csv", row.names = FALSE) </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#MULTIPLE LINEAR REGRESSION ANALYSIS </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Formula for multiple linear regression model to show target survival using the dataframe (pclass, sex, age, sibsp, parch, fare, embarked) using the dataset (training) </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">fit <- lm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=titanic_training.csv) </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">summary(fit) # show results </span> Call: lm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = titanic_training.csv) Residuals: Min 1Q Median 3Q Max -1.09344 -0.22857 -0.06788 0.22815 0.99815 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.4919161 0.2813984 5.302 1.54e-07 *** Pclass -0.1873163 0.0228871 -8.184 1.28e-15 *** Sexmale -0.4854369 0.0315205 -15.401 < 2e-16 *** Age -0.0064061 0.0011324 -5.657 2.24e-08 *** SibSp -0.0507454 0.0174357 -2.910 0.00372 ** Parch -0.0106791 0.0190483 -0.561 0.57523 Fare 0.0001963 0.0003468 0.566 0.57162 EmbarkedC -0.0895853 0.2735887 -0.327 0.74343 EmbarkedQ -0.1883890 0.2822174 -0.668 0.50465 EmbarkedS -0.1559910 0.2728347 -0.572 0.56768 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.382 on 704 degrees of freedom (177 observations deleted due to missingness) Multiple R-squared: 0.4034, Adjusted R-squared: 0.3958 F-statistic: 52.9 on 9 and 704 DF, p-value: < 2.2e-16 <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Analysis shows that there are 3x*** for pclass, sexmale and age and a significance of 2* for sibling or spouse to impact on the survival rate. </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#Adjusted RSquare is below 0.7 so we reject the H0 that it is by fluke. This shows the goodness of fit. </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">#The PValue is: < 2.2e-16, therefore we reject the null hypothesis because p < 0.05. </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword"># Scatterplot Matrix </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">survived(~Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,data=titanic_training.csv, main="Simple Scatterplot Matrix") </span><span class="GEM3DMTCPFB ace_constant">Error: could not find function "survived" </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword"># Investigating further with 3D scatterplot by installing, library and then displaying </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">install.packages("scatterplot3d") </span><span class="GEM3DMTCPFB ace_constant">trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.3/scatterplot3d_0.3-37.zip' </span><span class="GEM3DMTCPFB ace_constant">Content type 'application/zip'</span><span class="GEM3DMTCPFB ace_constant"> length 440178 bytes (429 KB) </span><span class="GEM3DMTCPFB ace_constant">downloaded 429 KB </span>package ‘scatterplot3d’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\10331098\AppData\Local\Temp\Rtmpyw9BYv\downloaded_packages <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">library("scatterplot3d") </span><span class="GEM3DMTCLGB ace_keyword">> </span> <span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword"># 3D Scatterplot </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">library(scatterplot3d) </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">attach(titanic_training.csv) </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">scatterplot3d(Pclass, Sex, Age, main="3D Scatterplot") </span> |
1 2 3 |
<span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword"># Scatterplot Matrix </span><span class="GEM3DMTCLGB ace_keyword">> </span><span class="GEM3DMTCLFB ace_keyword">pairs(~Pclass + Sex + Age,data=titanic_training.csv, main="Simple Scatterplot Matrix") </span><span class="GEM3DMTCLGB ace_keyword">> </span> |
1 |
<span class="GEM3DMTCLFB ace_keyword"> </span> |