Categories: Uncategorized

Titanic Survival Prediction

Assignment 7: Titanic Survival Prediction

This assignment is for 80 points, four times the normal assignment weight. The goal of the project is to

predict what type of persons were more likely to survive? The features available are Name, Age, Gender,

Fare Class, etc. Data dictionary is provided in the appendix. Data is partitioned into (1) ProjectTrain.csv,

and (2) ProjectTest.csv. Use Train data to develop the model and report performance results on the Test

dataset.

1) Develop Logistic Repression, LDA, QDA and KNN based survival prediction models using Pclass,

Sex, Age, SibSp, Parch, and Embarked as predictor variables. Note that some of these variables may

need to be case of categorical (factors in R). Also, Age has lot of missing values. The missing values

may need to be imputed (e.g., mean) for using this variable. Try few values of k in KNN to

determine suitable value for K. Compare and interpret True Positive (TP) and False Positive (FP) of

the different models using test data. 40 Points

2) “Cabin” has sparse data content. One approach to handle the missing data is to have a special value

“Not Available” for all the missing values. For the Logistic Regression model, evaluate performance

improvement with and without including the cabin feature using test data. 10 Points

3) Like linear regression, Logistic regression (LR) has the advantage of interpretability. Research the

concepts of “Unadjusted Odds Ratio” and “Adjusted Odds Ratio”. Determine the adjusted odds ratio

for Sex, Pclass, and Embarked using LR. Interpret the results. 10 Points

4) The default threshold to classify an entity to a class is 0.5. For the LR models, vary the threshold to

0.8, 0.5, and 0.2. Which threshold value do you think is appropriate for survival prediction? Why?

Justify your answer with respect to misclassification rate on test data 10 Points

5) Develop ROC plot for the LDA model. 5 Points

6) What features do you think are important to make the prediction? Why? Evaluate the KNN model

performance by including just the important features 5 Points

In the report, include text of the R code.

Submit through link: eCampus -> Assignment 7

Deadline: March 18, 11:55 PM

Data Dictionary

Variable Notes

pclass: A proxy for socio-economic status (SES)

1st = Upper

2nd = Middle

3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way…

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way…

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

admin

Share
Published by
admin

Recent Posts

Childbirth

For this short paper activity, you will learn about the three delays model, which explains…

9 months ago

Literature

 This is a short essay that compares a common theme or motif in two works…

9 months ago

Hospital Adult Medical Surgical Collaboration Area

Topic : Hospital adult medical surgical collaboration area a. Current Menu Analysis (5 points/5%) Analyze…

9 months ago

Predictive and Qualitative Analysis Report

As a sales manager, you will use statistical methods to support actionable business decisions for Pastas R Us,…

9 months ago

Business Intelligence

Read the business intelligence articles: Getting to Know the World of Business Intelligence Business intelligence…

9 months ago

Alcohol Abuse

The behaviors of a population can put it at risk for specific health conditions. Studies…

9 months ago