"Predicting Obesity with Logistic Regression: A Step-by-Step Guide Using Synthetic Data"

"Predicting Obesity with Logistic Regression: A Step-by-Step Guide Using Synthetic Data"

Outline for this article :

  • Introduction
  • Logistic Regression Overview
  • Understanding Categorical Variables
  • Sigmoid Function in Logistic Regression
  • Types of Logistic Regression
  • Key Assumptions of Logistic Regression
  • Odds, Log-Odds, and Odds Ratios
  • Project Overview
  • Classification of Obesity Using Weight Criteria
  • Logistic Regression Model Implementation
  • Model Evaluation
  • Conclusion

Logistic regression is a powerful statistical method used to model binary or categorical outcomes. It is commonly applied in cases where the dependent variable takes on two possible values, such as "yes" or "no", "success" or "failure", or "0" or "1". While similar to linear regression in some ways, logistic regression is designed specifically to handle situations where the response variable is categorical, making it ideal for classification tasks.

The logistic regression model estimates the probability that a given input point belongs to a particular class. This is achieved using the logistic function (also known as the sigmoid function), which maps any linear combination of input features to a value between 0 and 1. This result can then be interpreted as the probability of a given outcome, such as predicting whether a customer will make a purchase or whether a patient will develop a disease.

Example of Logistic Regression Applications

  • Customer Purchase Prediction: Predicting whether a customer will buy a product based on demographic factors such as age and income.
  • Medical Diagnosis: Predicting whether a patient has a disease, such as cancer or diabetes, based on health metrics and other characteristics.

Understanding Categorical Variables

Categorical variables are variables that represent qualitative data by grouping observations into distinct categories. These categories do not have a numerical value but rather represent different classifications or characteristics.

Types of Categorical Variables:

  • Nominal Variables: Categories with no intrinsic order (e.g., Gender: Male, Female, Other).
  • Ordinal Variables: Categories with a defined order, but the difference between them is not quantifiable (e.g., Education levels: High School, Bachelor's, Master's).
  • Binary Variables: A specific type of categorical variable with only two possible values (e.g., Yes/No, Success/Failure).

Characteristics of Categorical Variables:

  • Limited Values: A fixed number of possible categories.
  • Qualitative Nature: Representing qualities rather than quantities.
  • Non-Numeric Representation: Even if represented by numbers, the values do not have mathematical significance.

In logistic regression, binary categorical variables play a crucial role because they define the target outcome that we are attempting to predict.

Examples

  • Demographic Information: Age groups (18-25, 26-35), marital status (Married, Single).
  • Product Categories: Types of products (Electronics, Clothing, Food).
  • Health Status: Disease status (Healthy, Sick, Recovered)

Treatment in Data Analysis

When used in statistical modelling or machine learning:

  • Dummy Variables: Categorical variables must often be converted into dummy variables for regression analysis. This involves creating binary columns for each category while excluding one as a reference to avoid multicollinearity.
  • Statistical Analysis: Categorical data can be summarized using frequency tables and visualized through bar charts or pie charts. Statistical tests like Chi-square tests are often employed to analyze relationships involving categorical variables.

The Logistic Function (Sigmoid Function)

At the heart of logistic regression is the sigmoid function, which transforms any real-valued number into a value between 0 and 1. This output can be interpreted as the probability of a given observation belonging to one class.

Sigmoid Function: Explanation

The Sigmoid function is a mathematical function that has an "S"-shaped curve and is commonly used in machine learning, statistics, and artificial neural networks. It maps any input value (real value) to a value between 0 and 1, making it useful in applications like binary classification problems.


Python code for Sigmoid function (illustration)

import numpy as np
import matplotlib.pyplot as plt

# Sigmoid function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Derivative of the sigmoid function
def sigmoid_derivative(x):
    sig = sigmoid(x)
    return sig * (1 - sig)

# Plotting the sigmoid function
x = np.linspace(-10, 10, 100)
y = sigmoid(x)

plt.plot(x, y, label='Sigmoid')
plt.title('Sigmoid Function')
plt.xlabel('x')
plt.ylabel('σ(x)')
plt.grid(True)
plt.legend()
plt.show()        


Binary Logistic Regression

The most common form is binary logistic regression, where the outcome is either 0 or 1. The logistic function constrains the output of the model between 0 and 1, which represents the probability of a class (e.g., the probability of success or failure).

Types of Logistic Regression

  1. Binary Logistic Regression: Outcome variable has two possible values (e.g., Yes/No).
  2. Multinomial Logistic Regression: Outcome variable has more than two categories without any order (e.g., classifying different species of plants).
  3. Ordinal Logistic Regression: Outcome variable has more than two categories, but these categories have a meaningful order (e.g., low, medium, high).

Log-Likelihood and Maximum Likelihood Estimation (MLE)

Logistic regression uses Maximum Likelihood Estimation (MLE) to estimate the model parameters (coefficients). The likelihood function measures how likely it is to observe the given data given a set of model parameters.

Instead of minimising the sum of squared residuals (as in linear regression), logistic regression maximizes the log-likelihood function, which represents the probability of the observed data given the model.

Key Assumptions of Logistic Regression

  • Binary outcome: The dependent variable is binary (or categorical for multinomial and ordinal logistic regression).
  • Independence of observations: Each observation is independent of others.
  • Linearity of independent variables and log odds: Logistic regression assumes a linear relationship between the independent variables and the log odds of the dependent variable.
  • Little or no multicollinearity: Independent variables should not be highly correlated with each other.

Odds and Log-Odds:

The Odds Ratio (OR) is a statistical measure used to quantify the strength of association between two events or variables. It is commonly used in epidemiology, clinical research, and various other fields to assess the relationship between an exposure and an outcome. The odds ratio compares the odds of an event occurring in one group to the odds of it occurring in another group.

In logistic regression, we often talk about odds and log-odds:

The logistic regression equation models the log-odds as a linear combination of the independent variables.

Example of Odds:

Let’s say you are predicting whether a person will purchase a product, and you know the probability of purchase is 0.8 (80%) and the probability of not purchasing is 0.2 (20%).

Odds=0.8/ 0.2=4

This means that the odds of purchasing the product are 4 to 1, or in simpler terms, the person is 4 times more likely to purchase than not purchase.

Interpretation of Odds Ratios:

  • Odds Ratio > 1: The predictor variable is positively associated with the outcome. A one-unit increase in the predictor increases the odds of the outcome.
  • Odds Ratio < 1: The predictor variable is negatively associated with the outcome. A one-unit increase in the predictor decreases the odds of the outcome.
  • Odds Ratio = 1: The predictor variable has no effect on the odds of the outcome.

When the odds ratio (OR) = 1, it means that the predictor variable has no effect on the odds of the outcome occurring. In other words, a one-unit change in the predictor variable does not increase or decrease the likelihood of the outcome. Essentially, the odds of the event occurring are the same regardless of the value of that predictor variable.

Example with Real Data for odd ratio =1

For example, you are analyzing the effect of education level on whether a person buys a specific type of insurance. You run a logistic regression and find that the odds ratio for education level is 1. This means that people with different levels of education (e.g., high school diploma, bachelor’s degree, master’s degree) have the same odds of buying that type of insurance. Education level is not a factor in determining whether or not they make the purchase.

Why Use Log-Odds?

In logistic regression, we are modeling the log-odds as a linear combination of the input variables. This transformation allows us to handle binary outcomes (which are otherwise non-linear in nature) and express them in a way that makes it possible to use a regression model.

The log-odds scale is linear and unbounded, which is why it's preferable for modeling in logistic regression. This linear property makes it easier to fit models using linear techniques.

How are Odds and Log-Odds Related to Logistic Regression?

In logistic regression, we are interested in modeling the relationship between one or more predictor variables X and the probability P(Y=1) of a binary outcome Y. Logistic regression models the log-odds of the outcome as a linear function of the predictor variables.


This equation shows that the log-odds of the outcome (e.g., success/failure) are linearly related to the predictor variables. By exponentiating the equation, we can transform it back into the odds.


This is essentially the inverse of the logit function, and it provides the probability of the event occurring.

Estimated Odds Ratio


Estimated Probability of an Event of Interest

Predicting Probability from Log-Odds:

Once you have the log-odds from a logistic regression model, you can easily convert it back to a probability using the following formula:


Interpreting Coefficients in Logistic Regression

  • The coefficients β1,β2,…βn represent the change in the log-odds of the outcome for a one-unit change in the corresponding predictor variable.
  • Exponentiating the coefficient, e^β, gives the odds ratio. An odds ratio greater than 1 indicates that as the predictor increases, the odds of the outcome occurring increase; an odds ratio less than 1 means the odds decrease.

Logistic Regression Model Evaluation Metrics

There are several ways to evaluate the performance of a logistic regression model:

  1. Confusion Matrix: A table showing the true positives, true negatives, false positives, and false negatives.
  2. Accuracy: The proportion of correct predictions.
  3. Precision: The proportion of true positive predictions out of all positive predictions.
  4. Recall (Sensitivity): The proportion of true positives identified out of all actual positives.
  5. F1 Score: The harmonic mean of precision and recall.
  6. ROC Curve and AUC: A Receiver Operating Characteristic (ROC) curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity). The AUC (Area Under the Curve) indicates how well the model separates the classes.

Regularization in Logistic Regression

To avoid overfitting, regularization techniques such as L1 (Lasso) and L2 (Ridge) regularization can be applied to logistic regression. These methods penalize large coefficients, preventing the model from becoming too complex and fitting the noise in the training data.

  • L1 Regularization adds a penalty equal to the absolute value of the coefficient magnitudes.
  • L2 Regularization adds a penalty equal to the square of the coefficient magnitudes.

Applications of Logistic Regression

  • Medical Diagnosis: Predicting whether a patient has a disease (e.g., cancer detection).
  • Marketing: Predicting whether a customer will purchase a product based on demographic factors.
  • Credit Scoring: Determining whether a loan applicant is likely to default.
  • Customer Churn Prediction: Predicting whether a customer will stop using a service.
  • Spam Detection: Classifying emails as spam or not spam.

Limitations of Logistic Regression

  • Linearity in the log-odds: Logistic regression assumes a linear relationship between the independent variables and the log-odds of the outcome.
  • Sensitive to outliers: Logistic regression can be sensitive to outliers in the dataset.
  • Not suitable for large numbers of categorical features: Logistic regression may not perform well when there are many categorical features with a large number of levels.
  • Cannot handle complex relationships: For more complex relationships between variables, more advanced models like decision trees, random forests, or neural networks may be preferred.

Summary of Key Concepts:

  • Logistic regression is a classification algorithm that models the probability of a binary outcome.
  • The sigmoid function is used to convert a linear combination of input features into a probability between 0 and 1.
  • Outputs include predicted probabilities and binary classifications.
  • Model evaluation can be done using metrics like accuracy, precision, recall, and the AUC-ROC curve.
  • Logistic regression assumes independence of observations and a linear relationship between predictors and the log-odds of the outcome.

Project Overview: Predicting Obesity Using Logistic Regression

In this project, we utilize logistic regression to predict whether individuals in a synthetic dataset are obese based on their age and weight. The dataset is generated with 500 samples, with each individual assigned a randomly generated UserID, Age (between 18 and 80), and Weight. The Obesity Status is simulated as a binary outcome, where older individuals have a higher probability of being obese.


Illustration of Logistic Model using Weight of sample population , to predict obesity

Brief Explanation of the Code

This code performs logistic regression on a synthetic dataset that simulates data about obesity based on age and weight. Here's a breakdown of the steps in the code:

1. Data Generation:

  • Synthetic Data: The code generates a dataset with 500 samples where each individual has a randomly assigned UserID, Age (between 18 and 80), Weight, and Obesity Status (1 = Obese, 0 = Not Obese).
  • Obesity Status: The probability of being obese increases with age. This is simulated with a binomial distribution where older individuals are more likely to be labeled as obese.
  • Weight: The weight is generated based on the individual's obesity status, where obese individuals tend to have higher weights.

Weight and Obesity Status Relationship:

  • In the synthetic dataset, individuals’ obesity status is determined based on their weight and age.
  • The model simulates that as age increases, there’s a higher probability of an individual being obese, and their weight increases accordingly.
  • For obese individuals:The mean weight increases more rapidly with age, and the weight distribution is centered around a higher value.Example: For older individuals (e.g., 50+ years), the mean weight might be 80+ kg with a larger variation.
  • For non-obese individuals:The mean weight increases at a slower rate with age.Example: For younger or middle-aged individuals, the mean weight might be lower, around 60–70 kg.

The generated dataset is saved in an Excel file named obesity_data.xlsx.

2. Logistic Regression Model:

  • Feature Selection: The logistic regression model uses Age and Weight as the input features to predict Obesity.
  • Train-Test Split: The data is split into a training set (75%) and a test set (25%) to evaluate the model's performance.
  • Standardization: The features (Age and Weight) are scaled using StandardScaler to ensure that the model is not biased by the different ranges of values in the features.
  • Model Training: A Logistic Regression model is trained on the standardized training data.
  • Predictions: The model makes predictions on the test set and outputs probabilities of obesity.

Logistic Regression Classification:

  • The logistic regression model uses weight and age as the key input features to predict obesity (1 = Obese, 0 = Not Obese).
  • The model calculates the probability of an individual being obese based on these two features.
  • Sigmoid function: This function ensures that the output of the model lies between 0 and 1, representing the probability of being obese. A threshold (typically 0.5) is applied to classify individuals into either "obese" or "non-obese".

Weight Distribution and Classification:

  • For individuals with higher weights, the model tends to assign a higher probability of obesity, especially as weight increases in combination with age.
  • Individuals with lower weights relative to their age are more likely to be classified as non-obese.

Model Evaluation

The model's performance is evaluated using the following metrics:

  1. Confusion Matrix: A confusion matrix displays the true positives, true negatives, false positives, and false negatives, providing a clear picture of the model's performance.
  2. Accuracy: The accuracy score represents the proportion of correct predictions.
  3. Precision, Recall, and F1 Score: These metrics provide further insights into how well the model handles the classification task, particularly for the minority class (obese individuals).
  4. ROC Curve and AUC: The ROC curve illustrates the model's ability to distinguish between obese and non-obese individuals, with the AUC (Area Under the Curve) summarizing the model's performance into a single number.

Saving Outputs:

  • The test set results (actual and predicted obesity status along with predicted probabilities) are saved in an Excel file named obesity_model_outputs.xlsx. The confusion matrix and classification report are also saved in separate sheets in this file.

Visualizations:

Several visualizations are provided to help understand the model's performance:

  1. Decision Boundary Plot:
  2. Confusion Matrix Heatmap: A heatmap of the confusion matrix, visually representing the correct and incorrect predictions.
  3. ROC Curve: The ROC curve shows the trade-off between the true positive rate and false positive rate at various threshold levels. The AUC (Area Under the Curve) is also calculated, providing a single metric to evaluate the classifier's ability to distinguish between classes.

Summary of Python Code for Logistic Regression using Synthetic Data

# Logistic Regression on Synthetic Obesity Data with Visualizations and Excel Output

# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix, accuracy_score, classification_report,
    roc_curve, auc
)
from matplotlib.colors import ListedColormap
import seaborn as sns

# Setting the random seed for reproducibility
np.random.seed(0)

# Number of samples
num_samples = 500

# Generating User IDs
user_ids = np.arange(1, num_samples + 1)

# Generating ages between 18 and 80
ages = np.random.randint(18, 81, size=num_samples)

# Generating obesity status based on age
obesity_prob_by_age = (ages - 18) / (80 - 18)  # Probability increases with age
obesity_status = np.random.binomial(1, obesity_prob_by_age * 0.6)  # Adjusted probability

# Generating weights based on obesity status
weights = []
for age, obese in zip(ages, obesity_status):
    if obese:
        # Obese individuals
        mean_weight = 80 + (age - 18) * 0.5  # Mean weight increases with age
        weight = np.random.normal(mean_weight, 10)
    else:
        # Non-obese individuals
        mean_weight = 60 + (age - 18) * 0.3
        weight = np.random.normal(mean_weight, 8)
    weights.append(weight)

weights = np.array(weights)

# Creating the DataFrame
df = pd.DataFrame({
    'UserID': user_ids,
    'Age': ages,
    'Weight': weights,
    'Obese': obesity_status
})

# Saving the dataset to an Excel file
df.to_excel('obesity_data.xlsx', index=False)

print("Dataset saved to 'obesity_data.xlsx'.")

# --------------------------------------------
# Reading the data and performing Logistic Regression
# --------------------------------------------

# Reading the data from the Excel file
df = pd.read_excel('obesity_data.xlsx')

# Features and target variable
X = df[['UserID', 'Age', 'Weight']]
y = df['Obese']

# Splitting the dataset into Training and Test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0
)

# Extract UserID from X_train and X_test
X_train_userid = X_train['UserID'].values
X_test_userid = X_test['UserID'].values

# Extract Age and Weight for scaling
X_train_features = X_train[['Age', 'Weight']]
X_test_features = X_test[['Age', 'Weight']]

# Feature Scaling
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train_features)
X_test_scaled = sc.transform(X_test_features)

# Training the Logistic Regression Model
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train_scaled, y_train)

# Making Predictions and Evaluating the Model
y_pred = classifier.predict(X_test_scaled)
y_prob = classifier.predict_proba(X_test_scaled)[:, 1]

cm = confusion_matrix(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

print("\nConfusion Matrix:")
print(cm)
print(f"\nAccuracy: {acc*100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# --------------------------------------------
# Saving Outputs to Excel
# --------------------------------------------

# Creating a DataFrame with the test set results
results_df = pd.DataFrame({
    'UserID': X_test_userid,
    'Age': X_test_features['Age'].values,
    'Weight': X_test_features['Weight'].values,
    'Actual Obese': y_test.values,
    'Predicted Obese': y_pred,
    'Predicted Probability': y_prob
})

# Optionally, sort the results by UserID
results_df = results_df.sort_values(by='UserID')

# Save results to an Excel file
with pd.ExcelWriter('obesity_model_outputs.xlsx') as writer:
    # Write the results DataFrame to a sheet
    results_df.to_excel(writer, sheet_name='Test Set Predictions', index=False)

    # Save the confusion matrix as a DataFrame
    cm_df = pd.DataFrame(cm, index=['Actual Not Obese', 'Actual Obese'],
                         columns=['Predicted Not Obese', 'Predicted Obese'])
    cm_df.to_excel(writer, sheet_name='Confusion Matrix')

    # Save the classification report as a DataFrame
    report = classification_report(y_test, y_pred, output_dict=True)
    report_df = pd.DataFrame(report).transpose()
    report_df.to_excel(writer, sheet_name='Classification Report')

print("\nOutputs have been saved to 'obesity_model_outputs.xlsx'.")

# --------------------------------------------
# Visualizations (Optional)
# --------------------------------------------

# 1. Decision Boundary Plot for Training Set
X_set, y_set = X_train_scaled, y_train

# Create meshgrid
X1, X2 = np.meshgrid(
    np.arange(start=X_set[:, 0].min() - 0.5, stop=X_set[:, 0].max() + 0.5, step=0.01),
    np.arange(start=X_set[:, 1].min() - 0.5, stop=X_set[:, 1].max() + 0.5, step=0.01)
)

plt.figure(figsize=(12, 6))
plt.contourf(
    X1, X2,
    classifier.predict(
        np.array([X1.ravel(), X2.ravel()]).T
    ).reshape(X1.shape),
    alpha=0.3, cmap=ListedColormap(('lightblue', 'lightcoral'))
)
plt.scatter(X_set[y_set == 0, 0], X_set[y_set == 0, 1],
            c='blue', label='Not Obese', edgecolor='k', s=20)
plt.scatter(X_set[y_set == 1, 0], X_set[y_set == 1, 1],
            c='red', label='Obese', edgecolor='k', s=20)
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age (Standardized)')
plt.ylabel('Weight (Standardized)')
plt.legend()
plt.show()

# 2. Decision Boundary Plot for Test Set
X_set, y_set = X_test_scaled, y_test

# Create meshgrid
X1, X2 = np.meshgrid(
    np.arange(start=X_set[:, 0].min() - 0.5, stop=X_set[:, 0].max() + 0.5, step=0.01),
    np.arange(start=X_set[:, 1].min() - 0.5, stop=X_set[:, 1].max() + 0.5, step=0.01)
)

plt.figure(figsize=(12, 6))
plt.contourf(
    X1, X2,
    classifier.predict(
        np.array([X1.ravel(), X2.ravel()]).T
    ).reshape(X1.shape),
    alpha=0.3, cmap=ListedColormap(('lightblue', 'lightcoral'))
)
plt.scatter(X_set[y_set == 0, 0], X_set[y_set == 0, 1],
            c='blue', label='Not Obese', edgecolor='k', s=20)
plt.scatter(X_set[y_set == 1, 0], X_set[y_set == 1, 1],
            c='red', label='Obese', edgecolor='k', s=20)
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age (Standardized)')
plt.ylabel('Weight (Standardized)')
plt.legend()
plt.show()

# 3. Confusion Matrix Heatmap
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# 4. ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2,
         label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()        

Output from the above code:

Dataset saved to 'obesity_data.xlsx'. Confusion Matrix: [[89 1] [ 6 29]] Accuracy: 94.40% Classification Report: precision recall f1-score support 0 0.94 0.99 0.96 90 1 0.97 0.83 0.89 35 accuracy 0.94 125 macro avg 0.95 0.91 0.93 125 weighted avg 0.95 0.94 0.94 125 Outputs have been saved to 'obesity_model_outputs.xlsx'.

Summary of Logistic Regression Model Output

This project aimed to predict obesity based on age and weight using a logistic regression model. Here’s a concise summary of the key results and performance metrics:




Summary of Key Insights

  • High Accuracy: The model achieved an accuracy of 94.4%, which suggests strong performance in predicting obesity based on age and weight.
  • Balanced Precision and Recall: Both precision and recall for the non-obese group (class 0) are near-perfect. However, for the obese group (class 1), recall is slightly lower (0.83), indicating that the model occasionally fails to identify some obese individuals.
  • Excellent AUC: With an AUC score of 0.97, the model demonstrates a high capability to differentiate between obese and non-obese individuals.

Overall Conclusion

The logistic regression model performs well in predicting obesity, showing high accuracy and precision. While the recall for predicting obesity is slightly lower, the overall results suggest the model is effective for binary classification tasks involving health indicators like obesity. The results have been saved in an Excel file (obesity_model_outputs.xlsx), which contains detailed predictions, the confusion matrix, and the classification report for further analysis.

Thank you for reading, and we look forward to continuing this journey of predictive analytics with you!

Look forward to applying the logistic regression model using specific real-time datasets in the next article!

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics