Understanding feature engineering from a mathematical perspective

Understanding feature engineering from a mathematical perspective

Background

In this article, we explore feature engineering from a mathematical perspective. The primary goal of feature engineering is to transform raw data into a representation that captures the underlying patterns in the problem more effectively.  In a previous post, I shared that feature engtineering and model evaluation can be thought of as a feedback loop.  In this post, we extend this idea further.

From a mathematical perspective, feature engineering is about transforming raw data into a set of variables that can better represent the underlying problem to predictive models, thereby improving their performance. This involves applying maths based transformations and extracting meaningful patterns or new information from the data. 

Mathematically, feature engineering transforms the data into a space where it can be more effectively and efficiently processed by machine learning algorithms. This involves a variety of linear and nonlinear transformations, scaling, and extraction techniques designed to expose the underlying patterns in the data to the predictive models.

Overview of feature engineering

Feature engineering involves creating features (input variables) from raw data that make machine learning algorithms work more effectively.  Typically, features engineering uses domain knowledge to discern features from the raw data. These engineered features can be used to improve the performance of machine learning algorithms. The goal is to provide meaningful information through these engineered features that the model can use to make accurate predictions or classifications.

Well-designed features can improve the predictive power of machine learning models by capturing important information in the data. Better features can allow a simpler model to perform well, reducing the need for more complex algorithms which are harder to interpret and maintain.By creating features that capture the underlying structure of the data, models are less likely to overfit to the noise in the training set and are better at generalizing to new examples. 

In this sense, feature engineering is crucial for machine learning. 

Components of feature engineering 

There are primarily three techniques/components of feature engineering. 

Feature Transformation

Feature transformation involves changing the format or the scale of the data without altering its content. This process can make the data more suitable for modeling by changing its distribution or scaling. Common feature transformations include:

  • Normalization: Changing the range of the data to [0, 1] or [-1, 1].
  • Standardization (Z-score normalization): Rescaling the data to have a mean of 0 and a standard deviation of 1.
  • Log Transform: Applying the logarithm to each data point to reduce skewness.
  • Power Transform: Similar to log transform but can be any power.
  • Box-Cox Transform: A parametrized transformation to reduce skewness and make the data more normal (Gaussian).

Feature Scaling

Feature scaling is a technique to standardize the independent features present in the data in a fixed range. It is a part of feature transformation but focuses specifically on altering the scale of features so that they can be compared on common grounds. This is particularly important for models that rely on the magnitude of the data, such as distance-based algorithms like K-Nearest Neighbors (KNN) and gradient-based algorithms like linear regression. Examples of feature scaling include:

  • Min-Max Scaling: Bringing all the values into a range between a new min and max, typically [0, 1].
  • Standard Scaling: Adjusting the data so that it has a zero mean and unit variance.
  • Robust Scaling: Using median and interquartile range to scale data, which makes it robust to outliers.

 

Feature Extraction

Feature extraction is the process of creating new features from existing data, which captures essential information in a more useful or composite form. This is particularly important in unstructured data types like text and images. Examples of feature extraction include 

  • Principal Component Analysis (PCA): Reducing the dimensionality of the data by transforming features into fewer non-correlated features (principal components).
  • t-Distributed Stochastic Neighbor Embedding (t-SNE): Non-linear dimensionality reduction to visualize high-dimensional data.
  • Word Embeddings: Transforming text data into vector formats where the semantic meaning of the words is preserved (e.g., Word2Vec, GloVe).
  • Autoencoders: Neural networks designed to learn an encoding for a set of data, typically for dimensionality reduction or feature extraction purposes.

Interpreting Feature engineering as understanding the underlying distribution of data for inference

You can also think of feature engineering as the ability to understand the underlying distribution for the purpose of inference.  The underlying distribution of a phenomenon refers to the statistical properties (mean, variance, skewness, relationships, etc.) of the data that generate observations.Inference in machine learning refers to the model’s ability to make predictions or draw conclusions from data. We can think of feature engineering as a bridge between the underlying statistical properties of the data (as represented by its data distribution) and its predictive performance.

Exploratory Data Analysis (EDA) is often the starting point, revealing insights into the data’s distribution and informing feature engineering strategies.

Feature engineering is, in essence, the process of translating the insights from EDA into features that improve the model’s inference capabilities.

Feature engineering impacts inference in several ways:

Capturing Informative Features:  Features that represent the phenomenon's underlying distribution enable the model to learn meaningful relationships and patterns. Example: For predicting house prices, creating features like "proximity to schools" or "number of rooms per occupant" might align better with the real-world distribution of housing prices than raw input variables.

Reducing Noise:  Feature engineering helps isolate the signal from noise by emphasizing variables that are statistically significant or have predictive power. This leads to more reliable inference because the model focuses on aspects of the data that matter.

Addressing Non-linear Relationships: Many real-world phenomena have non-linear or complex relationships between variables. Feature engineering can transform data (e.g., log, polynomial, interaction terms) to capture these relationships, making the inference process more robust.

Improving Generalization: By representing the distribution well, the engineered features help models generalize better to unseen data. This reduces overfitting and ensures the model's inferences are applicable beyond the training dataset.

Matching the Model to the Distribution : Feature engineering is also about ensuring that the data aligns with the assumptions of the chosen model based on its distribution:

  • Linear models assume linear relationships, so feature transformations are often needed to match the data distribution to the model's inference mechanism.
  • Tree-based models (e.g., decision trees, random forests) handle non-linear relationships well, so feature engineering might focus more on handling missing values or creating categorical encodings.
  • Neural networks benefit from normalization or scaling to ensure numerical stability and better learning dynamics.

Overcoming Distribution-Inference Alignment: We can think of feature engineering as overcoming the misalignment in the data for the purpose of inference. Here are some ways in which this could occur.  

  1. Imbalanced Datasets: If one class dominates in a classification problem, the model might infer bias toward the majority class. To overcome this, we can use techniques like SMOTE (Synthetic Minority Over-sampling Technique) that can better represent the minority class.
  2. Time Series Data: Trends, seasonality, and autocorrelation can obscure inference in time series data. We can overcome this by adding lag features, rolling averages, or seasonal decomposition to capture these patterns.
  3. Interaction Effects: The phenomenon might involve interactions between variables. We can overcome this by creating interaction terms (e.g., product of two features) that can expose these effects to the model

Conclusion

Understanding the underlying distribution of a phenomenon allows us to engineer features that effectively represent the data's structure and relationships. This representation directly influences the model's ability to make accurate inferences, as the quality and relevance of the features determine the model's success in capturing the true essence of the phenomenon. Feature engineering techniques thus act as the bridge that connects statistical understanding to predictive performance.

If you want to study with us, please see our course on #AI at #universityofoxford (almost full) https://lnkd.in/dcdrjSC2

 

Pitso Msimanga

My job was a general worker. The time I was coaching. I was helping teachers to learn from them. That's why I say I was coaching start.

1d

The goal of feature engineering for a numeric variable is to find a better way of representing the numeric variable in the model, where "better" connotes greater validity, better predictive power, and improved interpretation.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics