Prediction model of artificial neural network for the risk of hyperuricemia incorporating dietary risk factors in a Chinese adult study

Background Risk of hyperuricemia (HU) has been shown to be strongly associated with dietary factors. However, there is scarce evidence on prediction models incorporating dietary factors to estimate the risk of HU. Objective The aim of this study was to develop a prediction model to predict the risk of HU in Chinese adults based on dietary information. Design Our study was based on a cross-sectional survey, which recruited 1,488 community residents aged 18 to 60 years in Beijing from October 2010 to January 2011. The eligible participants were randomly divided into a training set (n1 = 992) and a validation set (n2 = 496) in the ratio of 2:1. We developed the prediction model in three stages. We first used a logistic regression model (LRM) based on the training set to select a set of dietary risk factors which were related to the risk of HU. Artificial neural network (ANN) was then used to construct the prediction model using the training set. Finally, we used receiver operating characteristic (ROC) curve analysis to assess the accuracy of the prediction model using training and validation sets. Results In the training set, the mean age of participants with and without HU was 39.3 (standard deviation [SD]: 9.65) and 38.2 (SD: 9.38) years, respectively. Patients with HU consisted of 101 males (77.7%) and 29 females (22.3%). The LRM found that food frequency (vegetables [odds ratio (OR) = 0.73], meat [0.72], eggs [0.80], plant oil [0.78], tea [0.51], eating habits (breakfast [OR = 1.28]), and the salty cooking style (OR = 1.33) were associated with risk of HU. In the ANN analysis, we selected a three-layer back propagation neural network (BPNN) model with 14, 3, and 1 neuron in the input, hidden, and output layers, respectively, as the best prediction model. The areas under the ROC of the training and validation sets were 0.827 and 0.814, respectively. HU would occur when the incidence probability is greater than 0.128. The indicators of accuracy, sensitivity, specificity, and Yuden Index suggested that the ANN model in our study is successful and valuable. Conclusions This study suggests that the ANN model could be used to predict the risk of HU in Chinese adults. Further prospective studies are needed to improve the accuracy and to generalize the use of model.


2
(page number not for citation purpose) H yperuricemia (HU), which is defined as serum uric acid (SUA) > 420 mmol/L for men, and >360 mmol/L for women (1), is a major cause of disability because of its high prevalence globally (2). In 2011, the prevalence rate of HU ranged from 2.6 to 36% in different populations worldwide (3). In China, with economic development and lifestyle changes, the prevalence of HU has increased rapidly. By the year 2010, the prevalence reached 13.7% (21% in males and 7.9% in females) in northern and northeastern Chinese provinces (4,5). Emerging evidence has indicated that HU could increase the risk of gout, hypertension, cardiovascular disease, diabetes, and chronic kidney diseases (5,6).
Epidemiologic evidence has indicated that incidence of HU is strongly related to dietary factors (7,8). For example, intake of meat, seafood, alcohol, and sugar-sweetened beverages are positively correlated with the risk of HU, while intake of fruits is found to be correlated with the reduced risk of HU (9). However, in research exploring the association between dietary patterns and risk of HU, where the complicated interaction of various groups of foods were considered, the results were conflicting. In China, a cross-sectional study with 374 participants demonstrated that the 'animal products and fried foods' dietary pattern was associated with a higher prevalence of HU, while 'soybean products and fruit' pattern was associated with its lower prevalence (10). Another cross-sectional study with 266 participants has found that there was no significant association between dietary patterns and uric acid levels (11). Moreover, in previous studies, potential confounders such as age, gender, education, ethnicity, and body mass index (BMI) were not adjusted properly (12). Therefore, it is necessary to identify the association between dietary factors and HU to establish the prediction model.
In the development of HU, the complex combinations of foods interact in non-linear biological mechanisms, which probably need a special mathematical approach, such as artificial neural networks (ANNs) (13), as traditional statistical prediction and classification methods (such as linear regression models [LR], logistic regression models [LRM], etc.) are difficult to solve collinearity problems. Some previous studies have presented evidence that ANN was more powerful than most of the traditional statistical prediction methods (14,15), but no studies have investigated the ability of ANNs in predicting risk of HU incorporating dietary risk factors in China.
Therefore, the aim of this study is to illustrate the potential usefulness of artificial intelligence, particularly ANN, in predicting the risk of HU vis-à-vis dietary factors. This model can be used as a preliminary screening tool to evaluate associations between HU and dietary risk factors, which would help with health management and prevention of HU in adults.

The subjects
Our study was based on a cross-sectional study that recruited 1,565 adults aged 18-61 years. The participants were enrolled randomly at a Beijing community hospital according to computer-generated random numbers for health checkup between October 2010 and January 2011. Participants with previously diagnosed gout were excluded. The respondents who did not complete at least 80% of the food frequency questionnaire (FFQ) (n = 43) and those having >20% missing SUA measures (n = 34) were excluded from analysis. Finally, 1,488 participants were included for analysis. The selection process of participants in this study is shown in Fig. 1.

Data collection
On the day of recruitment, all subjects had a face-to-face interview with well-trained interviewers using two questionnaires. The questionnaires were designed by experts from our research team and were modified and validated in a pilot study. The first questionnaire included sociodemographic characteristics, smoking, and drinking status. The second questionnaire was a semi-quantitative FFQ. It comprised the following 11 food groups: cereals, fruits, vegetables, meat, seafood, eggs, dairy products, legumes, plant oil, animal oil, and tea, and it was developed based on the Dietary Guidelines for Chinese People (2016) (16). Participants were required to select a category best applicable to them on the basis of past 3 months: consumed rarely, once to three times a month, once a week, twice to four times a week, five to six times a week, once a day, twice or thrice a day, or four times a day or more. The categories were subsequently classified into either 'low' group or 'high' group according to Dietary Guidelines (16). Additionally, the FFQ contained five questions on eating habits (breakfast frequency, midnight snack frequency, meal time regularity, dining out frequency, and snacking frequency) and three questions on cooking styles (the amount [gram per person per day] of sugar, oil, and salt used when preparing or cooking food). The categories were subsequently classified into either 'regular' group or 'non-regular' group according to the Dietary Guidelines (16). Serum uric acid levels were measured from venous blood collected in the morning after at least 8 h of fasting. Basic information measurements (weight, height, waist circumference [WC], and blood pressure) were measured by trained interviewers using standard protocol and validated equipment on the day of interview. Body mass index was calculated by the following formula: BMI (kg/m 2 ) = weight/height 2 .

Statistical analysis
All selected participants (n = 1,488) were randomly divided into two sets: training set (n 1 = 992) and validation set (n 2 = 496) in the ratio of 2:1. The process was based on deep learning of ANN for proportional division (17,18). Differences in participants' characteristics in the two sets were compared using x 2 test for categorical variables (t-test for continuous variables). We used the training set to select variables and establish the predictive ANN model. Then we tested and evaluated the ANN model using the validation set. All variable values were normalized to the range of 0 to 1. The binary variables were divided into 0 and 1, which means 'No' and 'Yes', respectively. Nonbinary variables were normalized as X' m = (X m -X min )/(X max -X min ). Continuous variables were reported as mean values ( standard deviation [SD]), and categorical variables were expressed as frequency percentages.
The model analysis was divided into three stages. In the first stage, a set of predictors to the HU risk was identified by logistic regression analysis using the training set. Univariate binary LRM was used to evaluate correlation between dietary risk factors and HU. Odds ratios (ORs) and 95% Confidence Intervals (CI) were used to quantify this relationship. In the second stage, we developed an ANN model for predicting HU risk by the input of significant predictors identified in the first stage. The ANN model was initially conducted after choosing potential significant risk factors with a P-value < 0.20 to select all possible predictors from univariate logistic models (19). The neural network generally consists of three layers: input layer to receive information, hidden layer to process information, and output layer to calculate response. As for ANN, it is a mathematical model or computational model that attempts to simulate the structure or function of biological neural network (20). Moreover, it is a non-linear statistical data modeling tool that can be used to model complex relationship between input and output. There are different types of neural networks, including feed-forward neural network, radial basis function (RBF) network, and Kohonen self-organizing network (21,22). The feed-forward neural networks, which include back-propagation (BP) delta rule network and other networks, are the earliest and simplest ANNs, while the BP network is the most popular choice because of its relative simplicity and stability. Therefore, in this study, we used BP for analyses.
In the third stage, we assessed the performance of risk evaluation model (using training and validation set) by using accuracy, sensitivity (Se), specificity (Sp), Yuden Index, and ROC curve analysis to evaluate the model's discriminatory ability. The accuracy index measures the percentage of correctly diagnosed subjects. Sensitivity refers to the proportion of subjects having target condition and gives positive test results. Specificity is the proportion of subjects without any target condition and gives negative test results. ROC curves display true-positives versus false-positives graphically across a range of cut-offs and of selecting the optimal cut-off for clinical support used. Youden's index is the sum of Se and Sp minus one (Se + Sp -1) (23). Unless specified, we used the significance level of 0.05 for all analyses. All analyses were performed using the software R version 3.5.3

Description of the participants
Supplementary Table S1 shows that the mean age of participants was over 37 years, with a mean age of 37.7 (9.6) years in the training set and 37.7 (9.8) years in the validation set. There were 521 males and 471 females (47.5%) in the training set, and 256 males and 240 females (48.4%) in the validation set. Differences between the training set and the validation set were not substantial and statistically significant in our study. Participants in these two data sets had similar characteristics (P > 0.05). Table 1 shows that the mean age of participants with HU in the training set was 39.3 (9.7) years and 38.2 (9.4) years in for participants without HU. Participants with HU included 101 males (77.7%) and 29 females (22.3%). We found that the risk factors of gender, age, smoking e,j Low = less than or equal to four times a week, high = more than four times a week.
f,g,h,k Low = less than or equal to six times a week, high = more than six times a week. l Always = more than five times a week, rarely = less than five times a week. m,n Always = more than twice a week, rarely = less than twice a week. o Always = more than six times a week, rarely = less than six times a week. p,q Low = less than 13 g per person per day, high = more than 13 g per person per day. r Low = less than 45 g per person per day, high = more than 45 g per person per day. status, drinking status, SUA, BMI, WC, systolic blood pressure (SBP), diastolic blood pressure (DBP), vegetable frequency, meat frequency, eggs frequency, plant oil frequency, tea frequency, breakfast frequency, and the salty cooking style had statistically significant differences between the participants with and without HU in the training data set (P < 0.05). Table 2 shows the predictors of HU risk identified from logistic regression analysis based on the training set (n 1 = 992). From univariate LRM, we found significantly negative relationships between HU risk and the following factors: females (OR = 0.27, 95% CI: 0.18-0. 42

Prediction models
We built ANN model on the basis of the predictors of HU risk resulting from logistic regression analysis. The following predictors were used as model inputs: gender, age, smoking status, drinking status, BMI, SBP, DBP, vegetable frequency, meat frequency, eggs frequency, plant oil frequency, tea frequency, breakfast frequency, and the salty cooking style. The binary variable of whether an individual was suffering from HU was the output variable. In this analysis, the structure of BP neural network included three layers (Fig. 2). Parameters were selected according to previous related studies (14,15). The training parameters such as learning rate and momentum were set at their default values. The training function was based on the Levenberg-Marquardt algorithm. The neural network was trained for 100 epochs. Dropping out 20% of input units and 50% of hidden units was often found to be optimal. It was a simple way to prevent neural networks from over-fitting (24). Each data point was weighed based on its outcome ratio, which was done to ensure that the output result was not heavily skewed toward dominant class. There were 14 neurons in the input layer, three neurons in the hidden layer, and neurons in the output layer of the ANN model corresponding to the forecast variable (that is the probability of having HU). Figure 3 summarizes the areas under ROC curves obtained from training and validation sets of ANN model. The area under receiver operating curves (AUCs) was 0.827 for training set and 0.814 for validation set. Thus, a welltrained optimal ANN model here could successfully predict the individual risk of HU, with high accuracy and large AUC. The cut-off incidence probability values of HU were 0.128 for training set and 0.146 for validation set. This means that HU would occur if the incidence probability is greater than 0.128. Table 3

Discussion
This is the first study to develop a HU prediction ANN model based on dietary risk factors because the use of ANN in medical field is a newly emerging phenomenon. In our study, the ANN model was proved to improve the predictive accuracy of HU. In this analysis, we developed a HU prediction model involving 14 significant predictors. For training and validation sets, the AUCs of ANN model were 0.827 and 0.814, respectively. The cut-off incidence probability values of HU were 0.128 for training set and 0.146 for validation set. Our study found that HU would occur when the incidence probability is greater than 0.128. Furthermore, we did a post-hoc analysis and found that the AUCs based on ANN and LRM were 0.827 and 0.758, respectively (P < 0.05). A previous study by Jahandideh et al. (25) found that ANN model was a better technique to predict the presence of coronary artery disease than LRM, which was similar to our study. More importantly, the indicators of accuracy, Se, Sp, and Yuden Index suggested that the ANN model in our study is successful and valuable. ANN is a systematic tool with great potential for clinical decision support; it could establish specific prediction value for each patient according to their related risk factors. Ability to provide targeted prediction is the most obvious strength of ANN comparing with the traditional statistical analysis methods.
The application of ANN model is of great significance in public health. It could be used as a preliminary screening tool to identify individual at high risk of HU based on their dietary factors; it could also guide the prevention strategy in clinic. The predictors included in the model are common and readily available, and could be assessed easily without any invasion. Moreover, this model could be applied to general population as well. New models' programs could be easily uploaded in computers. Therefore, if we put 14 dietary predictors in a program, the computer would automatically calculate the risk probability of HU. Owing to its simplicity, this could be more efficient than those traditional diagnostic techniques, which are more expensive and complex. However, whether the forecast probability could be applied to a particular individual requires further exploration.
Our study has found that dietary factors are major predictors of HU. Positive relationships were found between HU and the following factors: current smoking, current drinking, BMI, SBP, DBP, and the salty cooking style, while the negative relationships were found between HU risk and the following factors: female, age, vegetable frequency, meat frequency, eggs frequency, plant oil frequency, tea frequency, and breakfast frequency. In our analysis, higher tea frequency as an independent factor was negatively associated with decreased risk of HU, which was consistent with a Chinese epidemiologic study (26). The explanation could be that the caffeine found in tea could protect against increasing SUA because of its  diuretic and antioxidative properties (27,28). However, meat and egg intake was positively associated with the prevalence of HU, which was not consistent with a previous Chinese study (8). This may vary with different regions and populations.
Some studies have indicated that HU was affected prominently by lifestyles such as diet habits and demographic characteristics (29)(30)(31)(32). For instance, some previous studies have found that high intake of salt was strongly related to HU (33,34). Several metabolic experiments in both animals and humans have proved the effect of high loading of purine on increased serum urate level (31,35,36). Moreover, people who skip breakfast will be hungry and eat a lot of foods rich in protein and purines for lunch, which would accelerate the occurrence of HU (37). The habit of smoking would increase carbon monoxide in the blood-forming carboxyhemoglobin, which could induce erythrocytosis because of inadequate oxygenation in blood circulation (38). Increased total red blood cell count may lead to a large amount of red blood cells being destroyed, which would promote purine metabolism and overproduction of uric acid (39).
Our study has several strengths. First, the data are of high quality because these were collected by well-trained investigators using validated questionnaires and equipment. Second, our study used ANN model based on dietary factors to predict the incidence of HU, which is a novel method seldom used in previous studies. The accuracy of the prediction was improved by using ANN, compared with traditional prediction models. Third, the model could be used to predict HU risk by dietary factors, which is non-invasive and easier to assess compared with SUA examination. It could potentially improve the prevention of HU, especially in poor areas where the medical service is insufficient.
There are some limitations to our study. First, the samples were limited geographically and ethnically, so the generalization of the results should be taken with caution. Second, as a cross-sectional survey, the study was unable to verify the causality or the temporal relationship between diet and HU. Third, there may be some potential bias included in the models that would influence our findings. Furthermore, ANN models may be less practical to be used in clinic, as it is more complex than LRM and LR models (40,41), leading to a higher requirement of statistical background of researcher. Last but not least, FFQ may introduce recall bias inevitably, which would result in non-differential misclassification, leading to association toward null (42). Despite these limitations, we developed an accurate risk predictive model that estimates the combined impact of an individual's dietary factors on HU.

Conclusions
In conclusion, our study found that food frequency, eating habits, and cooking styles were associated with HU risk in Chinese adults. The results showed that ANN model could be used to improve the predictive accuracy of HU. HU would occur if the incidence probability is greater than 0.128. Further prospective studies are needed to confirm the findings and to validate our model for predicting HU risk in adults.