My Journey Building a Titanic Survival Predictor with Logistic Regression
Author: Juhaib Khan
Introduction
The Titanic dataset is famous among data enthusiasts, but diving into it myself was an entirely new experience. For this project, I set out to predict whether a Titanic passenger survived or not based on factors like age, gender, ticket class, and more. What started as an exploration of the dataset quickly turned into an insightful learning journey that combined data cleaning, machine learning, and app deployment.
I used logistic regression, a simple yet powerful algorithm, to tackle this binary classification problem. Here, I’ll walk you through the steps I took, what I learned along the way, and how I created an interactive app to make survival predictions.
Project Overview
The goal was simple: predict survival. But the process to achieve it? That took some work. The project had six main stages:
Step 1: Data Exploration
I started by loading the Titanic dataset, which contains information about passengers such as their ticket class, age, gender, and fare paid. The dataset also includes the target variable, Survived, where 1 means the passenger survived and 0 means they didn’t.
One thing that struck me was how unbalanced the data was. For example:
Using visualizations like histograms and box plots, I explored relationships between features. For instance:
This exploration helped me decide which features to focus on during model building.
Step 2: Data Preprocessing
If you’ve worked with real-world data, you know it’s rarely clean or complete. The Titanic dataset was no exception. Here’s what I did:
After cleaning the data, I was left with a tidy dataset ready for model training.
Recommended by LinkedIn
Step 3: Building the Model
I chose logistic regression for its simplicity and interpretability. It’s a great starting point for binary classification problems like this one.
Using Scikit-learn, I split the data into training and testing sets (80% training, 20% testing) and trained the model.
Step 4: Model Evaluation
Once the model was trained, it was time to see how well it performed. I used several metrics to evaluate its performance:
I also plotted the ROC Curve, which helped visualize how well the model distinguishes between survivors and non-survivors. The curve confirmed that the model was doing a decent job.
Step 5: Interpreting the Model
One of the reasons I like logistic regression is its interpretability. The model assigns coefficients to each feature, showing how much they influence the survival probability. Here’s what I found:
These insights aligned well with historical accounts of the Titanic disaster.
Step 6: Deployment with Streamlit
To make the model accessible, I created an interactive app using Streamlit. The app allows users to input passenger details—such as age, gender, and ticket fare—and predicts whether the passenger would have survived.
Setting up Streamlit was surprisingly straightforward. With just a few lines of code, I turned my logistic regression model into an interactive tool.
Learned From this Lesson
Conclusion
This project was a fulfilling experience that taught me a lot about data science workflows. From exploring the data to deploying the final model, every step added to my understanding of machine learning.
If you’re curious, feel free to try the app yourself or reach out with questions. The Titanic dataset is a classic for a reason—it offers so many opportunities to learn and grow as a data scientist.