Automated methods to ensure data accuracy
Automated methods to ensure data accuracy
Ever tried fitting an elephant into a Mini Cooper? That’s what Olivier Ledoit and Michael Wolf were up against when they tackled the monstrous problem of squishy, unreliable data sets. The Ledoit-Wolfe method was created by Olivier Ledoit and Michael Wolf, who aimed to solve common problems in estimating covariance matrices (the relationships between variables), especially when there are more variables than observations. Their groundbreaking work, published in 2003, in “Honey I shrank the covariance matrix” , introduced a way to "shrink" traditional estimates of covariance towards a target, reducing errors – particularly in high-dimensional contexts.
At Edgemesh, the Ledoit-Wolfe shrinkage estimator, helps us ensure that our data is accurate (or more specifically free of noise). We understand the accuracy of data is paramount to delivering valuable insights in the eCommerce space, something we carried over from our previous lives of automated trading. The only thing worse than no data, is bad data! Misguided decisions based on inaccurate data can lead to lost revenue, misallocation of resources, and a host of other issues. To guard against these risks, we employ advanced statistical methods to automatically detect and correct data inaccuracies..
Understanding the Ledoit-Wolfe Method
The Ledoit-Wolfe method is a statistical technique designed to improve the estimation of covariance matrices, particularly in situations where the sample size is small relative to the number of variables. Covariance matrices are grids that capture the relationships between pairs of variables in a dataset. However, real-world data often contains noise—random fluctuations that obscure the true relationships between variables. This noise can lead to an inaccurate or "noisy" covariance matrix, which in turn can distort the insights derived from the data.
Mathematically, the covariance matrix Σ for a set of variables is estimated as:
Where Xi represents each observation, \(\bar{X}\) is the mean vector, and n is the number of observations. However, when n is small compared to the number of variables, sigma becomes an unreliable estimate, often leading to overfitting.
Recommended by LinkedIn
The Ledoit-Wolfe method addresses this issue by "shrinking" the noisy covariance matrix toward a structured target, such as the identity matrix or a diagonal matrix. This is known as the target matrix - and in data quality systems the target matrix is often known. This is the common application, removing noise from data. For Edgemesh, we have well defined target matrices, so the presence of noise points to opportunities for in data inaccuracy!
To start, given the target matrix T, we need to apply a shrinkage estimator. The shrinkage estimator is given by:
Where λ is the shrinkage intensity and T is the target matrix (often the identity matrix). The shrinkage intensity λ is optimally chosen to minimize the mean-squared error between the true covariance matrix (target) and the estimator:
And this is where we often have an opportunity to identify a data error. Once λ is known (which there are a myriad of heuristics to apply for that) - any significant divergence shows, effectively, a material change in the relationship of the underlying data. E.g. This is a great example of finding noise in the data!
Application in eCommerce Data
In the context of eCommerce, covariance matrices are essential for understanding the relationships between various metrics, such as product views, cart additions, and purchases. A noisy covariance matrix might suggest false correlations, leading to misguided marketing strategies or incorrect inventory decisions. More importantly, some relationships are (by construction) well formed - e.g. funnel conversion steps as a Markov process (e.g. Edgemesh's north star metrics of Engaged User Rate, Car Active User Rate etc). By applying the Ledoit-Wolfe method, we ensure that the covariance matrices used in our analyses are robust and trustworthy, and if the weight to target is high - then we can examine exactly what the source of noise is (is it bad data... or is it something like prime day!).