Discovering the Best Choice for Spline’s Knots and Intervals Using Order of Polynomial Regression Model ()
1. Introduction
Regression analysis is typically used by academics to examine the impact of several independent factors, or explanatory variables, on a single variable, or response variable. The regression equation is used by the investigators to explain how the response and explanatory variables relate to one another [1]. The regression analysis can be divided into linear regression and nonlinear regression [2]. A linear regression is easy to understand and simple to fit. The regression model is desirable because there are many techniques for testing the assumptions. However, in many cases, data are not linearly related. Therefore, it is not recommended to use linear regression. As previously mentioned, the traditional nonlinear regression model fits the model
However, in some situations, the structure of the data is so complicated that it is very difficult to find a function that estimates the relationship correctly. Other difficulties might emerge, such as the selection of good starting values and the suitable criterion to declare convergences. The general nonparametric regression model is written in a similar manner and
is left unspecified:
where
is an unknown smooth function,
are observation values of the response variable
,
are observation values of the explanatory variable
and
are normal distributed random errors with zero mean and common variance
, i.e.
. The basic goal of nonparametric regression is to estimate the regression function
directly, rather than to estimate parameters [3] [4]. Most nonparametric regression methods assume that
is smooth. The smooth function is a continuous function with first and second derivatives existence. This work aims to provide practical guidance for selecting the most appropriate order of the polynomial model that will fit the best model using spline regression. The study will explore the relationship through empirical experiments and by measuring model selection criteria such as Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). The optimal order of the polynomial model will be used to determine the optimal number of spline knots and intervals that would balance the spline model’s goodness of fit.
2. Methodology
Many different techniques are used to fit the relationship between the variables when the relationship is nonlinear, such as polynomials and splines. The nonlinear models intrinsically involved linear models of the independent variable x that were either strictly increasing or strictly decreasing [5] [6]. In many cases, a theoretical scatter plot of the data suggests that the true regression function has one or more peaks or valleys, which is at least one relative minimum or maximum. In many situations, a polynomial function may provide a satisfactory approximation to the true regression function.
2.1. Polynomial Regression
Linear regression may not always be used to fit the relationship between the dependent variable and the independent variables. In many cases, the data pattern in the relationship may be a nonlinear relation that cannot be captured by a simple straight line [7]. Polynomial regression can easily capture nonlinear relationships. Moreover, the polynomial regression is useful when the data pattern indicates that the relationship between the dependent and independent is not a linear relation. Adding polynomial terms (e.g., x2, x3) into the simple regression model will increase the accuracy of the model and allow it to fit the complex data patterns. Higher-degree polynomials can capture more intricate patterns, but they can also lead to overfitting [8] [9]. Polynomial regression is also called the special case of multiple linear regression because polynomial regression is a linear model with some adjustments in order to increase the accuracy. The polynomial regression model is defined as:
2.2. Selecting Polynomial Regression Degree
For n polynomial data points, we can come up with an nth degree polynomial that will ensure it goes through every single one of the points. Therefore, the polynomial degree is selected to optimize the target function for which the variance is minimum or we choose the degree of a polynomial when there is no significant decrease in the value of the variance as the degree of the polynomial increases [7] [10] [11]. In general, we need to balance the tradeoff between the bias and variance of the regression model. The estimated variance of the polynomial regression model is defined as:
sum square of the residuals for the
polynomial order;
polynomial order;
polynomial data points.
Figure 1 below shows different polynomial orders used to fit polynomial data.
(a) (b)
Figure 1. A plot of fitted nonlinear relationship by using different polynomial order.
3. Splines Regression
A spline is a piecewise function for which each segment is a polynomial function. Spline is designed to draw curves to balance the goodness of fit and minimize the mean square error of the model. We need to select certain breakpoints (called knots) to fit the regression model using splines. In splines, a linear or polynomial regression model is fitted between two knots. The degree of a polynomial and the number of knots must be determined to fit the splines regression model [11]. The piecewise regression between the knots is assumed to be a continuous polynomial function. Many different types of interpolating (splines) regression models can be used to fit the smoothing relationship between the explanatory variable and respond variables, such as cubic splines, B-splines, P-splines, natural splines, thin-plate splines, and smoothing splines [12] [13].
3.1. Basis Functions
The basis function is defined as a set
of elements for which any elements of the space can be expressed uniquely as a linear combination of elements. Moreover, the regression model can be extended to accommodate nonlinear effects using some polynomials [14]. The most popular basis for the polynomial regression model will be introduced in this section. Basis functions for simple linear regression models are defined as:
Using the matrix formula, we can obtain the vector of fitted values of
by:
. The basis function for the quadratic model:
(a) (b)
(c) (d)
Figure 2. Top left: (a) Simple linear model representation; Top right: (b) Corresponding basis representation; Bottom left: (c) Quadratic model representation; Bottom right: (d) Corresponding basis representation.
The simple regression and quadratic regression model with their corresponding basis functions are illustrated in Figure 2(a) & Figure 2(b) for the simple regression model and Figure 2(c) & Figure 2(d) for the quadratic regression model. Moreover, the quadratic model is an extended simple linear regression model that accommodates and handles a different type of nonlinear structure.
(a) (b)
(b) (c)
Figure 3. Top left: (a) Broken Stick model representation; Top right: (b) Corresponding basis representation; Bottom left: (c) Cubic model representation; Bottom right: (d) Corresponding basis representation.
The broken-stick regression model is a special case of the linear regression model with two differently sloped segment lines. The broken-stick regression model is proposed in order to make any complex and nonlinear function more suitable for modelling [15]. The broken-stick regression model demonstrated in Figure 3(a) consists of two differently sloped lines that join together at
. To introduce the basis function for broken-stick regression model, we need to find the slope for two lines that are connected at
. Positive slop left of
and negative slop from
and onward. The new basis function of the broken stick model with two differently sloped lines can be expressed as
(The positive part of the function
).
The cubic-sin model is more complicated than the broken stick model because there are several features, including peaks, troughs, and inflection points. Figure 3 panel (c) shows a cubic-sin model including straight lines and inflection points. The corresponding basis function for the cubic-sin model is demonstrated in Figure 3 panel (d). The cubic-sin model with associated basis function can be written by:
Because the function of the cubic-sin model contains multiple lines that are tied together at
. The value of
corresponding to the basis function is usually referred to knots [16]. Spline model can be estimated using a linear combination of basis functions
with multiple knots at
.
3.2. Knots Placement and Numbers
In the spline regression model, the number of knots and their placement along the range of
must be determined by analysts. The analysts can override the default placement of knots, and most software packages place the knots in the data in either quartiles or quintiles. Even though the number of knots has an important effect on the spline fit, the analysts found that where the knots are placed matters less than how many knots are used [17]. A spline with two knots will be linear and globally smooth because there is only one piecewise function. By Increasing the number of knots, the number of piecewise functions for fitting the data will be increased. Moreover, selecting a large enough number of knots will control the number of piecewise fits and the amount of smoothing data [18]. Practically, evenly spaced intervals can be used as standard practice to place knots with each region of
to get a smooth fit for the data. The number of knots effectively acts as a span parameter for splines. Therefore, selecting the span parameter would go through tradeoffs. The estimated spline model using a small number of knots will be overly smooth with little variability and may be biased. Conversely, estimating the spline model using a high number of knots would imply a little bias but high variability, and the result may be overfitting the fit. The number of knots in a spline fit is not overly sensitive to the selected number of knots [12]. There are two existing methods for selecting a number of knots: visual trial and error, which involves adding or subtracting knots based on the fit, and Akaike’s Information Criterion (AIC), which is less arbitrary and produces reasonable results. The choice of knots depends on sample size and sample size, with five knots for larger samples and three for smaller ones [18]. Knot selection can be complicated, but smoothing splines have made it easier to understand and compute.
3.3. Smoothing Splines
Smoothing splines are extensive techniques that minimize bias-variance tradeoffs, focusing on the solution to the penalized sum of squares. More details about smoothing splines can be found in [18] [19].
The residual sum of squares and roughness penalty are two key terms in smoothing splines. The first term is the residual sum of squares, while the second term is a roughness penalty, consisting of
, a smoothing parameter, and the integrated squared second derivative of
. The latter measures the rate of change of the slope for a function or curvature. As
increases, the second derivative becomes constrained to zero, resulting in a smooth least squares fit. The penalty term ensures linearity and limits the approximate degrees of freedom [16].
3.4. Selecting the Smoothing Parameter
Local-polynomial regression and smoothing splines have adjustable smoothing parameters that can be selected by trial and error or cross-validation. Bulling functions from the ssanova library can be used to choose the smoothing parameter. The penalized spline model can be rewritten in matrix form and the penalty term can also be written as a quadratic form in
[16]. The matrix form of penalty term can be written as:
where
denotes the number of knots. Therefore, the penalized spline regression model can be expressed in matrix notation as follows:
A possible way to accommodate the penalized penalty term in the context of the standard linear regression matrix is by introducing the hat matrix [16]. A smoother matrix for splines can be derived by
where
is penalty term scalar and it’s multiplied by the matrix operator
. the optimal value of
can be determined by using two approaches: Cross-Validation (CV) and Generalized Cross-Validation (GCV).
3.5. Cross-Validation
The fundamental concept of cross-validation is to leave the pair-point
out one at a time and choosing the smoothing parameter
that minimizes the residual sum of squares. The squared residual for the function at point
estimates using the remaining
data points. The cross-validation is given by:
where
denotes the nonparametric regression that applied to the remaining data for which
were deleted. We choose the parameter
that minimizes
and
. There are
-order algorithms for computation of
which is the most common smoothing technique [20]. For
versions of
the vector of fitted values is defined by:
where
is the
entry of
. For many smoothers, . Even that the expression does not hold exactly, it usually holds approximately [21]. Also, we can use
expression for cross-validation definition [22]. All smoothers used the sensible property that
then
, which implies that
for all
. Moreover, the denominator in the cross-validation expression is equal to
. Using cross-validation expression, we can show that
Therefore, cross-validation can be computed using only ordinary residuals and the diagonal elements of the smoother matrix [23].
4. Simulation Study
The study was carried out to estimate the nonlinear model using the higher polynomial regression model and the splines regression model. This simulation study was also implemented to determine the relationship between the polynomial order and the number of knots that need to fit the model using splines. RStudio was used to generate a dataset for estimating nonlinear relations which the true equation was
. We used a sequence for independent variable x, fixed value for
, and random noises from normal distribution to set a dataset that we would use for estimating a regression model. The regression model was fitted to the generated dataset using two proposed methods as shown in Figure 4. More results can be seen in Figure 4 and Table 1.
(a)
(b)
Figure 4. It shows the association between polynomial order and the number of knots for estimating the nonlinear model.
Table 1. Regression model results using splines and polynomials.
|
Splines model |
Polynomial model |
Coefficients |
Estimate |
SE |
t value |
Pr (>|t|) |
Estimate |
SE |
t value |
Pr (>|t|) |
|
−1.8889 |
0.6177 |
−3.058 |
0.00290** |
−0.086 |
0.14197 |
−0.603 |
0.548 |
|
14.4120 |
1.1857 |
12.154 |
<2e−16*** |
−11.829 |
1.41970 |
−8.333 |
6.37e−13*** |
|
−13.7853 |
0.8107 |
−17.004 |
<2e−16*** |
−0.813 |
1.41970 |
−0.573 |
0.568 |
|
17.3196 |
1.0914 |
15.870 |
<2e−16*** |
−9.378 |
1.41970 |
−6.605 |
2.35e−09*** |
|
−10.2133 |
0.8905 |
−11.469 |
<2e−16*** |
−0.990 |
1.41970 |
−0.697 |
0.487 |
|
2.6959 |
0.8980 |
3.002 |
0.00343** |
23.143 |
1.41970 |
16.302 |
2e−16*** |
|
0.01248936 |
0.01510638 |
|
0.8643 |
0.8015 |
|
0.8571 |
0.791 |
Figure 5. Top left: cross-validation of polynomial model based on different orders.; Top right: AIC&BIC for polynomial model; Bottom left: cross-validation of splines model based on different knots; Bottom right: AIC&BIC for splines model.
From the estimated model result, we can observe that both methods performed well in estimating nonlinear relationships. Moreover, by comparing the values of the MSE that are estimated using the proposed methods, we can realize that the splines model is more accurate than the polynomial model (assuming the polynomial order and splines knots are fixed with 5). In our study object, not only would we like to compare those estimated models, but we would also like to use the results of the polynomial model to determine the number of knots to fit the splines model. The estimated values of MSE,
, and
using the splines model, are 0.01248936, 0.8643, and 0.8571 respectively and compared with those estimated values using order five polynomial model 0.01510638, 0.8015, and 0.791. Even though some higher orders of polynomial coefficients are not significant, the estimated values of the MSE and
of the model are still comparable.
The above plots show a comparison between the cross-validation and the model goodness of fit coefficients that were estimated using two models. From the plots, we can observe that the polynomial regression model, particularly with high-degree polynomials, tends to overfit and show excellent fit with order 5. Based on the results of the regression model and goodness of fits (AIC and BIC), we can see that the order 5 for the polynomial model also minimizes the model coefficients. A large number of knots for the splines model would decrease the model coefficients. Therefore, we can say that the best number of knots for modeling the data using the spline model is 5 knots, which also minimizes the spline model’s goodness of fits coefficients. (Figure 5)
5. Conclusion
The first important step in building a realistic regression model is to understand the differences among these models (splines and polynomial regression). As we know, the polynomial regression model is smooth and fits with data wiggly; this is probably due to the high degree of freedom. The first-order polynomial model is a straightforward generalization of simple linear regression. Splines are more flexible and smoother than polynomial regression techniques. It is better in terms of extrapolation and is smoother. The complexity of using the spline model starts to increase when the number of knots and the number of intervals is increased. Therefore, we need to seek more to determine the best choice for the number of knots and the number of intervals first to build a regression model using splines. This study aimed to compare the results of the polynomial regression model and the splines regression model when the number of segments is equal to the polynomial degree. We compared the results that we got by using the polynomial regression model with the results that we got by using the splines regression model in terms of point estimates. The mean sum of squared errors and the fitted regression line show that the splines regression model improves well when the number of segments is equal to the polynomial degree of freedom. In our simulation study, order five of the polynomial model was the best choice for the spline segments.