1. Introduction
Given the continuous growth of the Chinese economy and the rapid improvement of people’s purchasing ability, China has been the largest market in the retail industry worldwide since 2016, in which its sales volume reached USD 5.2 trillion [
1]. The various favorable advantages in the Chinese market, such as huge population, large spending power, remarkable market order, and tax preference, have attracted many global retail chains, such as Zara [
2,
3], Walmart [
4], and Carrefour [
5]. The boom of the retail industry in China increases the convenience for consumers to purchase basic goods in nearby retail shops rather than in supermarkets far away from home, thereby avoiding crowded urban traffic; retail shops play an important role in peoples’ daily life [
6]. However, the potential Chinese market also brings challenges and competition for the retail industry in China, among which the first challenge is selecting appropriate locations with high market potential and low competition for new retail shops.
As one of the basic issues in modern urban planning and city construction, site selection comprehensively considers economic, environmental, and social factors, which may often be conflicting in site selection [
7]. The appropriate determination of site selection will sufficiently use city resources and provide high efficiency for the public [
8]. Given the development of spatial analysis technology, spatial site selection has been widely researched in fields, such as municipal planning [
9], port site selection [
10], transportation planning [
11], and electric management [
12].
A good site selection strategy is thought to be an effective means to reduce cost and obtain high benefits in fields, such as logistics storage siting and municipal planning [
13]. The business site selection for retail shops is a new application of GIS (geographic information system) theory in solving socioeconomic issues. The complex market potential and consumer characteristics have confused many retail business managers in site section strategies. To determine the regional market rules, some geographic analyses have been used in previous studies. Kayacan [
14] utilized spatial interpolation method to estimate the market demand among regions and determine potential commercial centers for the site selection of sports retail shops. Luyao [
15] used spatial autocorrelation to determine the market demand based on the sales performance of nearby retailers in geographic units, and the new retail shops were recommended in places with high market demand but few existing shops. Şener [
16] established a model that combined AHP (analytical hierarchical process) with GIS methods to determine the appropriate location of landfill sites, considering complex social, environmental, and technical parameters.
The appropriate geographic location of retail shops can bring passenger flow to retailers and locate them far away from fierce competition, thereby leading to stable profit [
17]. Moreover, the good site selection of retail shops will also provide convenience for nearby residents. Some large retail companies have conducted site selection strategies based on the geographic analysis of target markets. Zara chose the locations of its chain shops based on the estimation of consumers in regions and conducted different brand distribution strategies after realizing the consumer preferences in different regions via historical sales performance data [
18]. The business strategies help Zara to grasp the market demand precisely and earn high profits.
In previous studies, spatial distance was mainly considered in site selection research. Many socioeconomic factors have been regarded as important influence factors on retail sales by studying the distribution patterns of successful retail shops. These factors include population density [
19], ethnicity [
20], regional income level, traffic networks [
21], and nearby facilities [
22]. However, socioeconomic information was usually obtained through statistical yearbooks, the scale of which is extremely large. Such a process is useful for the site selection of large supermarkets or superstores but also leads to weakness in the site selection for small retail shops [
23]. Given that the influence scopes of retail shops are consistently small, detailed information of the regional markets is needed. With the development of mobile networks, social media data are a new type of data sources for researchers. Given that social media data are an actual reflection of peoples’ activities, the data can provide spatiotemporal information for the site selection problems. Jiang [
24], established a Huff model through social media data to determine the locations with high market potential for new shopping malls.
In actual markets, complex factors, such as traffic and competition factors, should be considered to obtain improved results, and some actual sales performance should be used for verification. In this study, we propose a two-step model for the site selection of retail shops. First, we established an improved gravity model for retail shops in micro scale to evaluate the accessibility of nearby retail shops in each region. Regions with low accessibility implied that existing retail shops could not satisfy the consumer needs in these areas. Then, we trained a PCA (principal component analysis)–BP (backpropagation network) model through 18 socioeconomic factors and actual sales data in Guiyang, China. The trained model was then used to estimate the market potential in regions with low spatial accessibility, thereby distinguishing regions with different market values.
The paper is organized as follows.
Section 2 reviews previous studies on site selection.
Section 3 introduces the methodology used in the present study.
Section 4 presents the data source used in the study and the study area.
Section 5 discusses the experiments conducted in Guiyang, China and recommends 42 best locations for the retail shops. The accuracy of the model was tested and compared with other methods. The conclusions and directions for future work are provided in
Section 6.
2. Materials
In view of complex socioeconomic factors, the most favorable site is often difficult to find, thereby leading to site selection being a challenging research topic in GIS [
25]. In previous studies, site selection has been mainly conducted for public facilities via historical experience and knowledge of the city environment [
26]. Given the lack of quantified standards in the process of site selection, the social benefits of the locations have been often high in some places and low in others. Thus, site locations will occasionally not meet the customer needs. Problems also occur when handling new regions with no reliable historical knowledge [
27,
28]. To overcome the problems in the traditional methods of site selection, two types of new technologies and theories were introduced, namely, qualitative and quantitative methods.
Quantitative methods have been established through specific and quantitative data sets. In view of the unavailability of certain socioeconomic factors in small scale, quantitative methods mainly focus on some aspects that can be easily obtained and quantized precisely, such as spatial distance. By investigating the sales performance of retail shops in 150 cities in America, Reilly [
29] proposed the “law of retail gravitation” to model the relationship between consumer group and retail gravity quantificationally. In the model, gravity was considered to show a direct proportion with city size and an inverse proportion with spatial distance. The model was improved by Batty (1987) and has been widely used in the site selection of retail shops, thereby spurring additional advances in estimating trade areas and determining market shares [
30]. The Huff model is an improvement of Reilly’s law of retail gravitation, which comprehensively considers retailers across all locations [
31]. In the research of Jiang [
24], a Huff model was established according to the activity centers of people obtained from web check-in data to estimate the visit probability in a grid-based map, and the locations were recommended for the new malls based on the grid map. Puniway et al. [
32] established a GIS model for the site selection of marine artificial port in the Island of Hawaii. Through multiple superimposed analysis, the benefits and limitations caused by several aspects, including biophysical, regulatory, and spatiality of candidate locations, were compared to determine the best location.
In a complex social environment, the data source obtained are often difficult to satisfy the accuracy requirement caused by the limited quantity of data set or unavailability of some socioeconomic factors. Calculating parameters in site selection models using unrefined data sets will cause considerable errors. To address this problem, qualitative methods have been introduced. These methods are models established based on historical experience or expert knowledge. As a typical qualitative method, AHP was first used to provide a reference for site selection strategies. On the basis of expert knowledge, the effect of each socioeconomic factor was provided in the method, which was more reliable than traditional approaches [
33]. Mosadeghi [
34] used AHP to address land use issues. Through AHP, a suitability evaluation map containing the information of slope, water source, and sunlight was created to help farmers decide the appropriate locations for different crops. Rong [
35] combined AHP with spatial theory to estimate the land potential of different locations. The results were used to provide guidance for city planners. Given the boom in the development of geographic technologies, some GIS approach theories have been applied to site selection combined with the prioritization of qualitative criteria. Boroushaki [
36] established a model that combined fuzzy theory with shortest path analysis in GIS to determine the best places for constructing parking lots. Complex traffic network and land cost were considered, and the weights of multidimensional data were set based on the fuzzy theory.
In addition to GIS–AHP, another important qualitative method is the spatial Delphi method, which is a modified version of the Delphi method [
37,
38]. Different from traditional expert investigating methods, Delphi is a type of iterative method. The investigation procedure of experts is anonymous and not face to face. Experts will receive feedback from one another and adjust their initial assessments. After several rounds, the final results with high intercommunity will be the ideal assessment results [
39]. Delphi was then introduced to the GIS field; the spatial Delphi [
40] and real-time spatial Delphi [
41] methods were proposed to solve spatial problems. The questionnaire for experts’ assessment was conducted on a map, and best locations for a specific purpose could be determined. Other methods include fuzzy logic methodology and multicriteria analysis (MCA). By combining GIS and fuzzy logic methodology, Aydi [
42] developed a site selection tool for the disposal site of Olive Mill Wastewater in Tunisia. The established hybrid model optimized the weights of each factor; the model was proven more effective than the GIS–AHP model. Papadimitriou [
43] applied a non-numerical algebraic and qualitative method based on lattice theory for land management in Rio de Janeiro, Brazil. The method addressed the complexity of the landscape while obeying a mathematical model or theory. MCA was also used via the ArcGIS platform to determine the most suitable sites for an electric substation [
44] and an environmental protection station [
45] under various conditions. Qaddah [
46] used a GIS-based methodology in conjunction with MCA to evaluate alternative site suitability to identify the best location for seismic stations based on given criteria developed in the GIS environment, and the individual satisfaction degrees for each alternative location were calculated using a weighted overlay tool. A GIS-based MCA method was performed by combining the information from several criteria to form a single index of evaluation, from which a choice will be made.
However, in previous studies, spatial distance is the main factor in site selection problems [
24,
35,
36]. The best location of a facility comprehensively considers many environmental, economic, and social factors; however, a linear correlation often exists among factors [
47]. To overcome this problem, PCA has been used to extract the main information of the factors in site selection issues [
48]. Moreover, the weights of factors are usually difficult to confirm via simple regression methods. To improve the accuracy in the analysis, McCulloch and Pitts [
49] introduced learning technology in 1956 to train a model with complex factors. Deep learning is a series of mathematical models that generalize the thinking and learning patterns of the human brain. Currently, deep learning algorithms have been widely used in fields, such as image recognition, image classification, unmanned driving, and traffic prediction. ANN (artificial neural network) has been widely used in the prediction of the duration of industrial building construction [
50] and electric price [
51], site assessment of reservoir [
52], and the evaluation of smart cities [
53]. An ANN can extract the main features between input and output in an unsupervised manner without considering the label data. Thus, the established neural network can predict the results with new input. As a deep learning network, BP neural network has been proven to occupy higher fitting precision compared with traditional regression models when facing multiple factor problems. The BP neural network with its simple structure has been applied in business domains, such as decisions for business locations in city centers [
54]. Zhai [
55] designed a hybrid BP neural network for the location analysis of large-scale supermarkets based on the consideration of approachability competition and space cost in different locations. Zhang [
56] proposed a site selection model based on BP neural network for business hotels. In view of rent cost, competition, and traffic factors, the model was established to determine the best locations for small hotels in micro scale. The retail shops were also a type of commercial facility that satisfied people’s needs, such as supermarket and hotels. The site selection of such retail shops also comprehensively considered the consumer quantity, competition, and other socioeconomic factors, such that the BP neural network could also be a new solution for the site selection issues of small retail shops.
5. Experimental Results
5.1. Spatial Division and PCA
Space division was conducted to the urban area of Guiyang City based on the studies of Zheng [
68] and Wang [
15]. To select the appropriate size of grid cells and
, 10 groups of sizes ranging from
to
were used to segment the space. The
and
(the pondage factor in Formula (4)) in each division was calculated, the results of which are shown in
Table 4.
Table 4 shows the
and
under spatial division of each grid size. From the table, an overall upward tendency of
existed from
with the increase of grid size. Some small fluctuations appeared from
. The highest
value was 0.927, when the grid cells were
. Thus, the most appropriate grid size was
, and the
value was set as 0.300.
Guiyang was divided into 1867 regions via spatial division, and the scale of each grid was set as 400 m × 400 m through the comparison of . In addition, the retails shops were distributed in 1016 grids. A total of 851 grids did not have any retail shops, although some of them indicated high population quantity and POI. The unbalanced distribution of retailers would result in inconvenience for citizens and reduce the profit of retail business. Thus, some new site selection strategies for retail shops should be adopted in these regions. The purpose of our experiment was to construct a model that would reveal the relationship between sales and social media data using the 1016 regions and then utilize the model to the 851 grids to determine locations with highest potentials. The 1016 regions were set as the original data set, from which 762 regions (75%) were randomly selected as the training data set of the BP network to train the model and the remaining 254 regions (25%) were selected as the validation data set to confirm the accuracy of the BP model. In actual practice, the process was repeated 20 times to reach the average accuracy. For each region in the training data set (762 regions), the 18 variables were initially considered as the input of the BP network and the retail sales as the output of BP network to train the neural network. However, a linear correlation often exists among numerous input factors. Thus, PCA was conducted before the training process to extract the components as the actual input of the BP network.
The 18 × 1000 matrixes were inputted in MATLAB 2017.
Table 5 shows the results of PCA. The 18 variables could be described as 7 components. From the table, Components 1–4 described 35.287% 24.562%, 16.089%, and 12.172% of the 18 variables, respectively. Meanwhile, Components 5–7 described less information. The results indicated that Components 1–4 described more than 88% of the total information of input. Thus, the first four components of PCA could be used as the input of the BP neural network.
Table 6 shows the relationship between each component and input variables. Component 1 shows a high positive relationship with the variables Financial facilities (0.675), Catering facilities (0.658), Sports facilities (0.482), Life service facilities (0.474), Shopping malls (0.455), and Population (0.455); however, Component 1 shows an obvious negative relationship with Government agencies (−0.485). Thus, this component includes business information. Component 2 shows a high positive relationship with Catering facilities (0.658), Communal facilities (0.58), Traffic networks (0.574), Attractions (0.522), Population (0.419), and Medical facilities (0.394); however, Component 2 shows a high negative relationship with Shopping malls (−0.516). Thus, this component includes information about the public service facilities of Guiyang. Component 3 shows a high positive relationship with Residential quarters (0.595) and high negative relationship with Factories (−0.473), Catering facilities (−0.380), and Attractions (−0.383). Component 4 includes few information and mainly describes Hotels (0.636) and Financial facilities (0.505).
The 18 variables were represented with four main components via PCA, and the information loss was less than 12%. Each component could be calculated by the results in
Table 6. The sample matrixes were converted from 18 × 1000 to 4 × 1000, which was then placed into the BP neural network for training.
5.2. Spatial Accessibility Estimation
The estimation of spatial accessibility in each region is a prerequisite step of site selection. We used the method in
Section 4.5 to estimate the spatial accessibility of each grid. The regions with different spatial accessibility were distinguished through K-means clustering algorithms [
32]. In a K-means algorithm, the number of clusters k is set before implementing the algorithm; the silhouette coefficient is often used to evaluate different k values [
34]. We evaluated the silhouette coefficient of the k value ranging from 2 to 8, and k was set as 3 when the silhouette coefficient reached the highest value (0.834), as shown in
Figure 4.
After determining the value of k, the regions were classified through K-means algorithm as follows: high spatial accessibility (0.753–0.914), moderate spatial accessibility (0.458–0.752), and low spatial accessibility (0.328–0.457).
Figure 5 shows the classification results in a small area in Guiyang.
In
Figure 5, the black points denote the web check-in data in the nine grids. The retail shops in the grids are represented by green shopping carts. From the figure, the retail shops are mainly distributed near the roads and shows a trend of gathering. The check-in points are also distributed near the main roads, which shows the travel habits of people. The figure also shows an imbalance distribution of retail shops and population because the shops are mainly located in the north and east of the region; however, the population is mainly distributed in the south of the region.
Three colors show the estimation results. The blue grids show that the spatial accessibility in the regions is relatively high (0.753–0.914); thus, the average distance between activity centers and retail shops is near, and the travel convenience in these regions is extremely high. In these regions, the current shops will effectively satisfy the consumer demand. The red grids show that the spatial accessibility in the regions is low (0.328–0.457). In these regions, consumption convenience is low; thus, the current retail shops cannot satisfy the consumer need of the grids because of the far distance or small scale. This situation is mainly caused by the rapid development of the city. With the construction of the city, new activity centers, such as squares and residential quarters continuously emerge; however, new business shops and malls cannot timely follow. Thus, business managers and city planners must urgently find the unsatisfying regions and evaluate their market potential to provide strategies for the site selection of new business facilities. The gray grids indicate that the spatial accessibility in the regions are in the middle range (0.458–0.752). In these regions, the consumer demand can be satisfied to a certain extent. The nearby retail shops are neither far nor near for the residents. In these regions, the current state must be maintained or several small shops should be opened.
The estimation results of the spatial accessibility of the entire city of Guiyang are shown in
Table 7. The table shows that almost 36% of the regions do not possess high spatial accessibility. Thus, we considered these regions for the site selection strategies. However, the market potential of these regions was different. Certain regions may indicate high market potential, whereas some may indicate low potential. To determine the market potential in these regions, we used the PCA–BP model.
5.3. Training via BP Neural Network
Business site selection aims to find locations with high market potential and less competition with other shops. After the PCA of original variables, the first four components of PCA were fed into the BP neural network to model the relationship between region demand and socioeconomic factors that were represented by the four components. Experimental models were implemented with MATLAB 2017 and deep learning tools. In the training process, the sample regions with four items of component information were set as the input and the sales were set as the output. After training the weights, the model was conducted to the test data set. As shown in
Figure 5, the regions in the red grid are areas that occupy low spatial accessibility; business competition in these regions is weak. Thus, the places in the red grids and with high market potential are the best location choices for retail shops. We conducted PCA–BP in the 347 red grids to estimate the market potential of these regions; the results are shown in
Figure 6.
The K-means algorithm was conducted to distinguish the regions based on their market potential, which is similar to the classification process of spatial accessibility. Three classes were then divided through K-means clustering. The market potential of each class was as follows: high potential (>
$11,426/month), moderate potential (
$4856–11,426/month), and low potential (<
$4856/month). The grids were divided into three colors, namely, red, orange, and transparent, based on the estimated consumer demand in the regions (
Figure 6). Site 1 in the red grid are regions with high consumer demand (>
$11,426/month), as estimated by the PCA–BP neural network. These regions are mainly distributed in the city centers. In these regions, spatial accessibility is low but the potential market demand is high, thereby making these regions the best site choice for retail managers to open up one or more new shops. Site 2 in the orange grids are regions with medium consumer demand (
$4856–11,426/month), as estimated by the PCA–BP neural network. They are mainly distributed near the city centers and can be considered for the site of new retail shops; however, the market potential in this site is less than that in Site 1. Site 3 in transparent grid represents the regions with low market potential (<
$4856/month). They are mainly distributed in the urban areas. Thus, managers should avoid opening their shops in these regions.
Table 8 shows the number and market potential of each site.
Table 8 indicates that 42 grids belong to Site 1, which are the best locations for retail shops. The number of grids of Sites 2 and 3 are 48 and 257, respectively. Thus, the 42 regions can be recommended for retail managers to open new retail shops. Each of the regions consisted of one or two blocks. The specific locations for new retail shops in these grids can consider the street centers, rent cost, or other orographic factors. The spatial accessibility we used can also be utilized in determining a location with high accessibility. The recommended locations with the highest spatial accessibility were determined in this study, and
Figure 7 shows the 42 recommended locations for retail managers.
The red points in
Figure 7 represent the recommended sites for new retail shops, and the spatial accessibility of these grids would be highest with the recommended new shops sites. The points we recommended could provide a reference for business managers. In practical site selection, their strategies could also be adjusted based on highly complex factors.
5.4. Accuracy Analysis of the Model
The accuracy results of one trained BP neural network are shown in
Figure 8. From the figure, the correlation coefficient of the training data is 0.993, and the correlation coefficients for validation and test data are 0.991 and 0.979, respectively. Given that the error of the model was lower than the acceptable threshold of 5%, the results indicated the high accuracy of the proposed PCA–BP neural network.
During the training of a BP network, the weights of the network were finally determined through hundreds of rounds of adjustments until they reaching the required accuracy. In each round, the whole dataset will go through the network and the weights will be adjusted according to the output of last round, and the process was called ‘epoch’. Therefore, the number of epochs was often used to evaluate the convergence speed or training cost of BP network [
69]. The epoch could be one of the main influence factors on accuracy; thus, we compared the accuracy of the model under different epochs. A separate trend of training and test accuracies of exists with the increase of epoch (
Figure 9). When the epoch was 1, the accuracy of training and test was only 0.75. As the epoch increased, the accuracies of training and test increase rapidly until the epoch value was close to 320. Up to 320 epochs, the training accuracy would increase slowly, and the test accuracy would remain almost the same, which was approximately 92.8%. The increase of epoch indicates more computation time needed for the training process. The number of epochs selected for the PCA–BP model was 320.
After the training and test process under 320 epochs, the accuracy and results of the PCA–BP model were calculated. The RMSE via the PCA–BP model was 0.065. The results showed that the proposed model could effectively pass through the cross-validation. To verify the accuracy and advantages of the PCA–BP model, we compared it with other regression models.
The comparison results between the proposed PCA–BP method and several other methods which are often used in estimation and prediction tasks, are shown in
Table 9. The RMSEs of the decision tree and OLS were 0.301 and 0.162, respectively. The accuracies of the two methods were relatively lower than the deep learning method. The accuracy of PCA–BP was higher than the BP method because some redundant information was removed. Thus, the PCA–BP method could be used for the estimation of market potential with socioeconomic factors.
The results showed the high accuracy of the PCA–BP model. In this study, the PCA–BP model was combined with the spatial accessibility model for the site recommendation of retail shops. In the hybrid model, the spatial accessibility model was used to estimate the competition and urgency in different regions. The proposed PCA–BP model was used in estimating the market potential of these regions to determine places with high market demand. In this way, the best site locations could be recommended for business managers. In actual practice, the site selection issue was affected by 12 socioeconomic factors. In different cities, the factors still indicated a significant difference. The proposed PCA–BP model indicated high expandability, which was suitable for various situations. Moreover, the spatial accessibility model was added to evaluate regional competitions. The combination made the site selection potentially convincing.
6. Conclusions
Given the increase in consumer income, the retail industry in China has developed rapidly in recent years. However, the site selection for retail shops has been a difficult problem in practical business decisions. Moreover, research focusing on site selection for small retail shops is limited. To fill this research gap, a hybrid model was proposed in this study for convincing business site selection.
In this study, a spatial accessibility estimation model was proposed to evaluate the average spatial accessibility to retail shops in each region. By combining BP neural network with PCA, we estimated the market potential of the regions with low spatial accessibility via web check-in and POI data. The hybrid two-step model was conducted in Guiyang, China. Via the experiment, three classes of sites were recommended to retail managers. The regions are as follows: those with low spatial accessibility and high market potential, which would be the best choice for new retail shops; those with medium spatial accessibility and medium market potential; and those with medium spatial accessibility and low market potential, which should be avoided by business managers when they open new retail shops. The main contributions of this study are as follows.
- (1)
The study provides a new method for business site selection, which fills the gap in the site selection for small retail shops. The two-step model, including the spatial accessibility estimation process with gravity model and the market potential evaluation process with BP–PCA model, makes the site selection convincing and near reality.
- (2)
Traditional research has mainly focused on the spatial distance in site selection problems. In the present study, additional socioeconomic factors were considered, such as POI data and road networks. The information on consumer groups was also obtained via social media data (web check-in data). Moreover, the actual locations and historical sales of retail shops were used. These complex data sources result in accurate analysis. The proposed hybrid model had high extendibility, thereby enabling its use in other cities. Complex data sources could also be considered.
- (3)
The study also indicates an improved gravity model in the micro scale and represents a new application of geographic methods in solving business problems.
In this context, we mainly used the static data of socioeconomic and retail sales data. The changes in sales performance over time were not considered. In future works, we will consider the temporal changes in sales performance and determine the spatiotemporal relationship between socioeconomic data and regional market potential. We will also consider additional factors, such as road connectivity, weather, and purchasing ability to be added to each geographic cell for accurate and precise results.