"Predicting Obesity with Logistic Regression: A Step-by-Step Guide Using Synthetic Data"
Outline for this article :
Logistic regression is a powerful statistical method used to model binary or categorical outcomes. It is commonly applied in cases where the dependent variable takes on two possible values, such as "yes" or "no", "success" or "failure", or "0" or "1". While similar to linear regression in some ways, logistic regression is designed specifically to handle situations where the response variable is categorical, making it ideal for classification tasks.
The logistic regression model estimates the probability that a given input point belongs to a particular class. This is achieved using the logistic function (also known as the sigmoid function), which maps any linear combination of input features to a value between 0 and 1. This result can then be interpreted as the probability of a given outcome, such as predicting whether a customer will make a purchase or whether a patient will develop a disease.
Example of Logistic Regression Applications
Understanding Categorical Variables
Categorical variables are variables that represent qualitative data by grouping observations into distinct categories. These categories do not have a numerical value but rather represent different classifications or characteristics.
Types of Categorical Variables:
Characteristics of Categorical Variables:
In logistic regression, binary categorical variables play a crucial role because they define the target outcome that we are attempting to predict.
Examples
Treatment in Data Analysis
When used in statistical modelling or machine learning:
The Logistic Function (Sigmoid Function)
At the heart of logistic regression is the sigmoid function, which transforms any real-valued number into a value between 0 and 1. This output can be interpreted as the probability of a given observation belonging to one class.
Sigmoid Function: Explanation
The Sigmoid function is a mathematical function that has an "S"-shaped curve and is commonly used in machine learning, statistics, and artificial neural networks. It maps any input value (real value) to a value between 0 and 1, making it useful in applications like binary classification problems.
Python code for Sigmoid function (illustration)
import numpy as np
import matplotlib.pyplot as plt
# Sigmoid function
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Derivative of the sigmoid function
def sigmoid_derivative(x):
sig = sigmoid(x)
return sig * (1 - sig)
# Plotting the sigmoid function
x = np.linspace(-10, 10, 100)
y = sigmoid(x)
plt.plot(x, y, label='Sigmoid')
plt.title('Sigmoid Function')
plt.xlabel('x')
plt.ylabel('σ(x)')
plt.grid(True)
plt.legend()
plt.show()
Binary Logistic Regression
The most common form is binary logistic regression, where the outcome is either 0 or 1. The logistic function constrains the output of the model between 0 and 1, which represents the probability of a class (e.g., the probability of success or failure).
Types of Logistic Regression
Log-Likelihood and Maximum Likelihood Estimation (MLE)
Logistic regression uses Maximum Likelihood Estimation (MLE) to estimate the model parameters (coefficients). The likelihood function measures how likely it is to observe the given data given a set of model parameters.
Instead of minimising the sum of squared residuals (as in linear regression), logistic regression maximizes the log-likelihood function, which represents the probability of the observed data given the model.
Key Assumptions of Logistic Regression
Odds and Log-Odds:
The Odds Ratio (OR) is a statistical measure used to quantify the strength of association between two events or variables. It is commonly used in epidemiology, clinical research, and various other fields to assess the relationship between an exposure and an outcome. The odds ratio compares the odds of an event occurring in one group to the odds of it occurring in another group.
In logistic regression, we often talk about odds and log-odds:
The logistic regression equation models the log-odds as a linear combination of the independent variables.
Example of Odds:
Let’s say you are predicting whether a person will purchase a product, and you know the probability of purchase is 0.8 (80%) and the probability of not purchasing is 0.2 (20%).
Odds=0.8/ 0.2=4
This means that the odds of purchasing the product are 4 to 1, or in simpler terms, the person is 4 times more likely to purchase than not purchase.
Interpretation of Odds Ratios:
When the odds ratio (OR) = 1, it means that the predictor variable has no effect on the odds of the outcome occurring. In other words, a one-unit change in the predictor variable does not increase or decrease the likelihood of the outcome. Essentially, the odds of the event occurring are the same regardless of the value of that predictor variable.
Example with Real Data for odd ratio =1
For example, you are analyzing the effect of education level on whether a person buys a specific type of insurance. You run a logistic regression and find that the odds ratio for education level is 1. This means that people with different levels of education (e.g., high school diploma, bachelor’s degree, master’s degree) have the same odds of buying that type of insurance. Education level is not a factor in determining whether or not they make the purchase.
Why Use Log-Odds?
In logistic regression, we are modeling the log-odds as a linear combination of the input variables. This transformation allows us to handle binary outcomes (which are otherwise non-linear in nature) and express them in a way that makes it possible to use a regression model.
The log-odds scale is linear and unbounded, which is why it's preferable for modeling in logistic regression. This linear property makes it easier to fit models using linear techniques.
How are Odds and Log-Odds Related to Logistic Regression?
In logistic regression, we are interested in modeling the relationship between one or more predictor variables X and the probability P(Y=1) of a binary outcome Y. Logistic regression models the log-odds of the outcome as a linear function of the predictor variables.
This equation shows that the log-odds of the outcome (e.g., success/failure) are linearly related to the predictor variables. By exponentiating the equation, we can transform it back into the odds.
This is essentially the inverse of the logit function, and it provides the probability of the event occurring.
Estimated Odds Ratio
Recommended by LinkedIn
Estimated Probability of an Event of Interest
Predicting Probability from Log-Odds:
Once you have the log-odds from a logistic regression model, you can easily convert it back to a probability using the following formula:
Interpreting Coefficients in Logistic Regression
Logistic Regression Model Evaluation Metrics
There are several ways to evaluate the performance of a logistic regression model:
Regularization in Logistic Regression
To avoid overfitting, regularization techniques such as L1 (Lasso) and L2 (Ridge) regularization can be applied to logistic regression. These methods penalize large coefficients, preventing the model from becoming too complex and fitting the noise in the training data.
Applications of Logistic Regression
Limitations of Logistic Regression
Summary of Key Concepts:
Project Overview: Predicting Obesity Using Logistic Regression
In this project, we utilize logistic regression to predict whether individuals in a synthetic dataset are obese based on their age and weight. The dataset is generated with 500 samples, with each individual assigned a randomly generated UserID, Age (between 18 and 80), and Weight. The Obesity Status is simulated as a binary outcome, where older individuals have a higher probability of being obese.
Illustration of Logistic Model using Weight of sample population , to predict obesity
Brief Explanation of the Code
This code performs logistic regression on a synthetic dataset that simulates data about obesity based on age and weight. Here's a breakdown of the steps in the code:
1. Data Generation:
Weight and Obesity Status Relationship:
The generated dataset is saved in an Excel file named obesity_data.xlsx.
2. Logistic Regression Model:
Logistic Regression Classification:
Weight Distribution and Classification:
Model Evaluation
The model's performance is evaluated using the following metrics:
Saving Outputs:
Visualizations:
Several visualizations are provided to help understand the model's performance:
Summary of Python Code for Logistic Regression using Synthetic Data
# Logistic Regression on Synthetic Obesity Data with Visualizations and Excel Output
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
confusion_matrix, accuracy_score, classification_report,
roc_curve, auc
)
from matplotlib.colors import ListedColormap
import seaborn as sns
# Setting the random seed for reproducibility
np.random.seed(0)
# Number of samples
num_samples = 500
# Generating User IDs
user_ids = np.arange(1, num_samples + 1)
# Generating ages between 18 and 80
ages = np.random.randint(18, 81, size=num_samples)
# Generating obesity status based on age
obesity_prob_by_age = (ages - 18) / (80 - 18) # Probability increases with age
obesity_status = np.random.binomial(1, obesity_prob_by_age * 0.6) # Adjusted probability
# Generating weights based on obesity status
weights = []
for age, obese in zip(ages, obesity_status):
if obese:
# Obese individuals
mean_weight = 80 + (age - 18) * 0.5 # Mean weight increases with age
weight = np.random.normal(mean_weight, 10)
else:
# Non-obese individuals
mean_weight = 60 + (age - 18) * 0.3
weight = np.random.normal(mean_weight, 8)
weights.append(weight)
weights = np.array(weights)
# Creating the DataFrame
df = pd.DataFrame({
'UserID': user_ids,
'Age': ages,
'Weight': weights,
'Obese': obesity_status
})
# Saving the dataset to an Excel file
df.to_excel('obesity_data.xlsx', index=False)
print("Dataset saved to 'obesity_data.xlsx'.")
# --------------------------------------------
# Reading the data and performing Logistic Regression
# --------------------------------------------
# Reading the data from the Excel file
df = pd.read_excel('obesity_data.xlsx')
# Features and target variable
X = df[['UserID', 'Age', 'Weight']]
y = df['Obese']
# Splitting the dataset into Training and Test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=0
)
# Extract UserID from X_train and X_test
X_train_userid = X_train['UserID'].values
X_test_userid = X_test['UserID'].values
# Extract Age and Weight for scaling
X_train_features = X_train[['Age', 'Weight']]
X_test_features = X_test[['Age', 'Weight']]
# Feature Scaling
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train_features)
X_test_scaled = sc.transform(X_test_features)
# Training the Logistic Regression Model
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train_scaled, y_train)
# Making Predictions and Evaluating the Model
y_pred = classifier.predict(X_test_scaled)
y_prob = classifier.predict_proba(X_test_scaled)[:, 1]
cm = confusion_matrix(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)
print(f"\nAccuracy: {acc*100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# --------------------------------------------
# Saving Outputs to Excel
# --------------------------------------------
# Creating a DataFrame with the test set results
results_df = pd.DataFrame({
'UserID': X_test_userid,
'Age': X_test_features['Age'].values,
'Weight': X_test_features['Weight'].values,
'Actual Obese': y_test.values,
'Predicted Obese': y_pred,
'Predicted Probability': y_prob
})
# Optionally, sort the results by UserID
results_df = results_df.sort_values(by='UserID')
# Save results to an Excel file
with pd.ExcelWriter('obesity_model_outputs.xlsx') as writer:
# Write the results DataFrame to a sheet
results_df.to_excel(writer, sheet_name='Test Set Predictions', index=False)
# Save the confusion matrix as a DataFrame
cm_df = pd.DataFrame(cm, index=['Actual Not Obese', 'Actual Obese'],
columns=['Predicted Not Obese', 'Predicted Obese'])
cm_df.to_excel(writer, sheet_name='Confusion Matrix')
# Save the classification report as a DataFrame
report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose()
report_df.to_excel(writer, sheet_name='Classification Report')
print("\nOutputs have been saved to 'obesity_model_outputs.xlsx'.")
# --------------------------------------------
# Visualizations (Optional)
# --------------------------------------------
# 1. Decision Boundary Plot for Training Set
X_set, y_set = X_train_scaled, y_train
# Create meshgrid
X1, X2 = np.meshgrid(
np.arange(start=X_set[:, 0].min() - 0.5, stop=X_set[:, 0].max() + 0.5, step=0.01),
np.arange(start=X_set[:, 1].min() - 0.5, stop=X_set[:, 1].max() + 0.5, step=0.01)
)
plt.figure(figsize=(12, 6))
plt.contourf(
X1, X2,
classifier.predict(
np.array([X1.ravel(), X2.ravel()]).T
).reshape(X1.shape),
alpha=0.3, cmap=ListedColormap(('lightblue', 'lightcoral'))
)
plt.scatter(X_set[y_set == 0, 0], X_set[y_set == 0, 1],
c='blue', label='Not Obese', edgecolor='k', s=20)
plt.scatter(X_set[y_set == 1, 0], X_set[y_set == 1, 1],
c='red', label='Obese', edgecolor='k', s=20)
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age (Standardized)')
plt.ylabel('Weight (Standardized)')
plt.legend()
plt.show()
# 2. Decision Boundary Plot for Test Set
X_set, y_set = X_test_scaled, y_test
# Create meshgrid
X1, X2 = np.meshgrid(
np.arange(start=X_set[:, 0].min() - 0.5, stop=X_set[:, 0].max() + 0.5, step=0.01),
np.arange(start=X_set[:, 1].min() - 0.5, stop=X_set[:, 1].max() + 0.5, step=0.01)
)
plt.figure(figsize=(12, 6))
plt.contourf(
X1, X2,
classifier.predict(
np.array([X1.ravel(), X2.ravel()]).T
).reshape(X1.shape),
alpha=0.3, cmap=ListedColormap(('lightblue', 'lightcoral'))
)
plt.scatter(X_set[y_set == 0, 0], X_set[y_set == 0, 1],
c='blue', label='Not Obese', edgecolor='k', s=20)
plt.scatter(X_set[y_set == 1, 0], X_set[y_set == 1, 1],
c='red', label='Obese', edgecolor='k', s=20)
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age (Standardized)')
plt.ylabel('Weight (Standardized)')
plt.legend()
plt.show()
# 3. Confusion Matrix Heatmap
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
# 4. ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2,
label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
Output from the above code:
Dataset saved to 'obesity_data.xlsx'. Confusion Matrix: [[89 1] [ 6 29]] Accuracy: 94.40% Classification Report: precision recall f1-score support 0 0.94 0.99 0.96 90 1 0.97 0.83 0.89 35 accuracy 0.94 125 macro avg 0.95 0.91 0.93 125 weighted avg 0.95 0.94 0.94 125 Outputs have been saved to 'obesity_model_outputs.xlsx'.
Summary of Logistic Regression Model Output
This project aimed to predict obesity based on age and weight using a logistic regression model. Here’s a concise summary of the key results and performance metrics:
Summary of Key Insights
Overall Conclusion
The logistic regression model performs well in predicting obesity, showing high accuracy and precision. While the recall for predicting obesity is slightly lower, the overall results suggest the model is effective for binary classification tasks involving health indicators like obesity. The results have been saved in an Excel file (obesity_model_outputs.xlsx), which contains detailed predictions, the confusion matrix, and the classification report for further analysis.
Thank you for reading, and we look forward to continuing this journey of predictive analytics with you!
Look forward to applying the logistic regression model using specific real-time datasets in the next article!