My Journey Building a Titanic Survival Predictor with Logistic Regression

Juhaib Khan

Data Scientist, Python Developer and Machine Learning Trainee at ExcelR || Artificial Intelligence || Scikit-Learn|| SQL || Data Analytics || Data Visualization || Node.js || Jupyter || NLP|| Full-Stack Developer

Published Nov 19, 2024

Author: Juhaib Khan

Introduction

The Titanic dataset is famous among data enthusiasts, but diving into it myself was an entirely new experience. For this project, I set out to predict whether a Titanic passenger survived or not based on factors like age, gender, ticket class, and more. What started as an exploration of the dataset quickly turned into an insightful learning journey that combined data cleaning, machine learning, and app deployment.

I used logistic regression, a simple yet powerful algorithm, to tackle this binary classification problem. Here, I’ll walk you through the steps I took, what I learned along the way, and how I created an interactive app to make survival predictions.

Project Overview

The goal was simple: predict survival. But the process to achieve it? That took some work. The project had six main stages:

Understanding and exploring the data
Cleaning and preprocessing the data
Building the logistic regression model
Evaluating the model’s performance
Interpreting the results
Deploying the model as an interactive app

Step 1: Data Exploration

I started by loading the Titanic dataset, which contains information about passengers such as their ticket class, age, gender, and fare paid. The dataset also includes the target variable, Survived, where 1 means the passenger survived and 0 means they didn’t.

One thing that struck me was how unbalanced the data was. For example:

More males than females on the ship, but females were more likely to survive.
Ticket class mattered: first-class passengers had a much better chance of survival than those in third class.

Using visualizations like histograms and box plots, I explored relationships between features. For instance:

Younger passengers were more likely to survive than older ones.
Passengers who paid higher fares tended to survive more.

This exploration helped me decide which features to focus on during model building.

Step 2: Data Preprocessing

If you’ve worked with real-world data, you know it’s rarely clean or complete. The Titanic dataset was no exception. Here’s what I did:

Dropped Irrelevant Features: I removed columns like Name, Ticket, and Cabin since they wouldn’t add value to the predictions.
Handled Missing Values:
Encoded Categorical Data: I converted columns like Sex and Embarked into numeric format using one-hot encoding. For example, male became 1, and female became 0.

After cleaning the data, I was left with a tidy dataset ready for model training.

My Journey Building a Titanic Survival Predictor with Logistic Regression

Juhaib Khan

Data Scientist, Python Developer and Machine Learning Trainee at ExcelR || Artificial Intelligence || Scikit-Learn|| SQL || Data Analytics || Data Visualization || Node.js || Jupyter || NLP|| Full-Stack Developer

Author: Juhaib Khan

Introduction

Project Overview

Step 1: Data Exploration

Step 2: Data Preprocessing

Recommended by LinkedIn

Step 3: Building the Model

Step 4: Model Evaluation

Step 5: Interpreting the Model

Step 6: Deployment with Streamlit

Learned From this Lesson

Conclusion

Insights from the community

Others also viewed

Decision Tree

Understanding Gaussian Mixture Models (GMMs) - The Probabilistic Modelling

✅Logistic Regression - Explained📈

Logistic regression made simple

Demystifying Linear Regression: A Journey Through Data's Heart

How logistic regression can save the day?

Building a Sentiment Analysis Model for Stock Reviews with ML.NET: Progress and Challenges

Model Dimensionality and Overfitting

Demystifying Data Science, Part IV: Models and Machine Learning

Concise Basic Stats - Part VII: Linear Regression

Explore topics