Using EDA To Effectively Achieve Improved Data Quality and Higher Forecast Accuracy in the Supply Chain
Exploratory Data Analysis (EDA) plays a crucial role in enhancing data quality and improving forecast accuracy in supply chain management. Selecting best performing forecasting techniques for demand/supply chain planning has always been a challenging task for demand planners and forecast practitioners., especially when data can become intermittent and are distorted by unusual events. Even a single disruption, like a strike, oil embargo, or a pandemic, can with otherwise stable seasonal patterns lead to flawed results and misleading interpretations. Here's how EDA can be effectively utilized in this context:
Understanding Data Distributions: Use time plots, stem-and-leaf diagrams, box plots, and density plots to understand the distribution of key variables, such as demand, lead times, and inventory levels. Also ‘analyze central tendency and variability with resistant measures of location and scale. This helps in identifying outliers and skewness in the data.
Identifying Missing Values:
• Missing Data: Assess the extent and patterns of missing data. Visual tools like spreadsheet tables can help identify missing values in datasets, guiding decisions on imputation or removal.
• Imputation Strategies: Depending on the missing data mechanism, apply appropriate techniques (mean, median, or modeling methods like regression analysis.
Detecting Outliers
• Outlier Detection: Use visual methods (scatter plots, box plots) and statistical tests to identify outliers that may skew the analysis. Understanding the cause of these outliers is essential for data quality.
• Addressing Outliers: Decide whether to remove, cap, or transform these outliers based on their impact on the analysis.
Analyzing Relationships
• Correlation Analysis: Use robust and ordinary correlation matrices and scatter plots to explore relationships between variables, such as demand and promotions, seasonality, or economic indicators.
• Feature Engineering: Create new features that capture important relationships (e.g., lagged demand variables) to improve model performance.
5. Time Series Analysis
• Trends and Seasonality: Decompose time series data to identify trends, seasonal patterns, and cyclical behaviors. This helps in understanding demand fluctuations.
• Stationarity Tests: Conduct statistical tests to check for stationarity, which is crucial for many time series forecasting models.
Segmenting Data
• Customer Segmentation: Analyze customer purchasing behavior to segment them based on demand patterns, helping tailor inventory and supply strategies. In addition, EDA approaches can also help demand planners incorporate more diverse data sources into their forecasting models. For example, they can leverage social media data, customer sentiment analysis, or even weather patterns to gain insights into consumer behavior and demand fluctuations.
• Product Categorization: Group products based on similarities in demand patterns, aiding in targeted forecasting efforts.
Benchmarking and Comparative Analysis
• Comparative Analysis: Compare data across different time periods, regions, or product categories to identify areas of improvement and best practices.
• Benchmarking: Establish benchmarks for key performance indicators (KPIs) that reflect data quality and forecasting accuracy.
For example, demand planners in supply chain organizations are so accustomed to using the Mean Absolute Percentage Error (MAPE) as a preferred approach for assessing accuracy and forecasting performance byr practitioners that they may be unaware of some of the more appropriate (as is the case with intermittent demand situations) and better performing measures out there.
Among practitioners, it is a jungle out there trying to understand the role of the APEs in accuracy measurement.
Using EDA, the undefined APE and outlier issues, commonly found with intermittent demand, data can be resolved without loss of meaning by considering the Median Absolute Percentage Error (MdAPE) or, the new Typical Absolute Percentage (HBB TAPE) measure that can be very effectively used instead of the MAPE to summarize the APEs with a more reliable summary properties than the arithmetic mean. The Typical APE (TAPE) is an M-estimator calculated with an iterated re-weighted least squares algorithm.
Enhancing Forecasting Models
• Model Selection: Use insights from an STI trend/seasonal classification to choose appropriate demand forecasting models (ARIMA, exponential smoothing, statistical learning/machine learning (SL/ML) algorithms) that suit the data characteristics.
• Validation: Split data into training and testing sets and validate model performance to ensure robustness.
Takeaway
• By integrating EDA into the supply chain process, organizations can significantly enhance data quality and forecasting accuracy.
• EDA not only provides insights into data characteristics but also guides decision-making, ultimately leading to a more agile supply chain operation.
• Regularly revisiting EDA can ensure that the models remain relevant and accurate as the global supply chain environment evolves.
A Profile Performance Analysis Example
In a detailed example, I will use a real-world dataset with a seasonal/trend forecast profile, which is what many of these methods are designed for. The dataset has 96 observations, and we need to forecast the next 18-month time horizon., I will create three lead-time forecasts with the data. Starting with the initial 42 values in the dataset and holding out an additional 18-month lead-time period, I make three. rolling-block’ 18-month profile forecasts.
The dataset is a monthly time series N2796 from the 1990 IIF M3 Forecasting competition, which has 3003 time series of various types and periodicities. This is of interest to me now, because I participated in the M3 with the PP-Autocast entry. By plotting the N2796 time series, you can visually examine the seasonal and trend patterns and notice an unusual event at point # 38 (value =360). An ETS model projects a seasonal profile, as expected, but the unusual event impacts the reliability of the seasonal factors (affecting seasonal adjustments) as well as the uncertainty range around the forecasts (more uncertainty). However, how does this influence forecasting performance?
EDA Step1. Preliminary Examination of Dominant Seasonal Variation. Some methods in the M-Competitions perform a seasonal adjustment first on the data and then forecast a seasonally adjusted series. In this example, the seasonal pattern is affected by a single outlier. When I analyze the first three years with the STI classification scheme, variation can be decomposed into seasonality, trend and other (e.g. Excel > Data > Data Analysis: ANOVA – Two Way without Replication > SS column), we see that there is a 2 % reduction in the overall seasonal variation as a result of one unusual value. A seasonal adjustment method using unadjusted data method can be impacted as I confirm with ss(2) in the following EDA step.
EDA Step 2: Impact of Unusual Value on Seasonal Factors. An analysis of the first 42 values in the time series with an ETS (A,A,M) model shows that the affected seasonal month #38 is significantly impacted by the one unusual value! I surmise that a zero was missing, so I changed 360 to a more representative 3600 which results in the seasonal factors of the adjusted data.
EDA Step 3. Creating Alphabet Profiles. The Actual Alphabet Profile (AAP) and the Forecast Alphabet Profiles (FAP) are basic to an information-theoretic approach I developed for a profile forecasting analysis. I first create AAP ={a1, a2, . . . a18}and FAP ={f1, f2, . . . f18} over the 18-month holdout sample by dividing the lead-time Total for each profile into the respective components of the profile data, resulting in a set of lead-time weights or fractions summing to one. The alphabet profiles have identical patterns corresponding to the original data profiles.T
is the formula for a Forecast Profile Error (FPE). The sum of the FPE values over the horizon is called Profile Miss and can be interpreted as a measure of ignorance about the forecast Profile Error. The closer to zero the better. Thus, a forecast Profile Miss measures how different a forecast profile (or alphabet pattern) differs from a profile of actuals over a fixed horizon.
EDA Step 4. Measuring Profile Accuracy. The performance of the process that creates a forecast profile is now of interest because it can be measured by a ‘distance’ metric between a Forecast Alphabet Profile (FAP) and Actual Alphabet Profile (AAP). A measure for a forecast alphabet profile accuracy is given by the Kullback-Leibler divergence measure D(a|f):
D(a|f) can be interpreted as a measure of ignorance or uncertainty about Profile Accuracy, which is what we should be interested in for lead-time demand forecasting performance. D(a|f) accuracy measure is non-negative and is equal to zero if and only if ai = fi, (I = 1,2, . . 18). When D(a|f) = 0, the alphabet profiles overlap, or what we consider be 100% accurate.
EDA Step 5. Evaluating Profile Forecasting Performance. Now, to evaluate the effectiveness of a Method compared to a benchmark Method, I create a new skill score. For lead-time profile forecasting, called the Levenbach L-Skill Score = 1 – [D(a|Method) / D(a|Benchmark)]. This seems appropriate for performance measurement for this profile forecasting process. The L-skill score is proper and unt free, as it does not have the units of the accuracy measure D(a|f) and turns out to have a parallel relationship with the MSE Skill score = 1 – [MSE(Method) / MSE(Benchmark)] used with point forecasting in a normal (i.e. 'Gaussian.) forecast modeling framework.
EDA Step 6. Evaluating Profile Performance over the Holdout Periods. For the three eighteen-month holdout samples (#1: 43 to 60; #2: 61 to 78; #3: 79 to 96), I have created forecasts by using (1) a Naïve method and (2) an ETS statistical model. For a Naive method, I will use previous lead-time actuals [(#1: 25 to 42) (#2: 49 to 60) (#3: 61 to 78)] as a benchmark forecast for the hold-out sample period. This benchmark forecast profile is labeled NaiveLT-18. It is also known as the Naïve-18 method, for this type of multi-step ahead lead-time forecast.
The model forecast profile is obtained with an automatic State Space forecasting model ETS (A,A,M), which is an exponential smoothing model with Additive error, Additive local level and Multiplicative seasonal forecast profile. This model has a deterministic seasonal/trend profile.
EDA Step 7. Impact of Outlier (#38) on Forecast Profiles. For holdout period #1 (43 t0 60), there are differences in the ETS (A,A,M) seasonal profiles as a result of only one outlier while there is a dominant seasonal pattern in the data. As expected, the most significant impact is in period 8 (forecast #50 = #38 + 12 months), which corresponds to seasonal factor ss2. The 41% change in the forecast from an outlier (=360) to a more representative value (=3600) impacted the bias in all the forecasts in the same direction.
Treating these forecast profiles as point forecasts, the MAPE would give preference to the forecast based on the unadjusted data, while the MdAPE is more representative of a typical APE, which would suggest that the adjusted data are more representative.
In practice, the smarter forecaster should scrutinize the data on an ongoing basis in order to assess the possible impact of outliers or unusual values on the performance results.
Data quality should be routinely checked before, during, and after all forecast modeling steps.
The examples are taken from my currently updated book Change & Chance Embraced. My books are available on all Amazon websites worldwide.
With the support and endorsement from the International Institute of Forecasters, I created the first certification curriculum for demand forecasters (CPDF) and I have has conducted numerous (pre-pandemic), hands-on Professional Development Workshops In public and company-private groups worldwide. The CPDF training materials are also available online on Amazon websites.
Hans is a Past President and Treasurer, and former member of the Board of Directors of the International Institute of Forecasters.
He is Owner/Manager of these LinkedIn groups: (1) Demand Forecaster Training and Certification, Blended Learning, Predictive Visualization, and (2) New Product Forecasting and Innovation Planning, Cognitive Modeling, Predictive Visualization.
I invite you to join these groups and share your thoughts and practical experiences with demand data quality and demand forecasting performance in the supply chain. Feel free to send me the details of your findings, including the underlying data without identifying proprietary descriptions. If possible, I will attempt an independent analysis and see if we can collaborate on something that will be beneficial to everyone.
-