Development of Prediction Model for Diabetes
✅ Paper Type: Free Essay | ✅ Subject: Sciences |
✅ Wordcount: 3025 words | ✅ Published: 8th Feb 2020 |
Abstract
Preventing the disease of diabetes is an ongoing area of interest to the healthcare community. Although several factors are considered to lead to diabetes, it would be worth enough to find the most important factors causing this problem to gain a better understanding of the issue. The availability of huge amounts of medical data leads to the need for powerful mining tools to help health care professionals in the diagnosis of diabetes disease. Using data mining technique in the diagnosis of diabetes disease has been comprehensively investigated, showing the acceptable levels of accuracy. In this paper an attempt is made to analyse diabetes data set and derive some interesting facts from it which can be used to develop prediction model.
Keywords: Statistical Analysis, Data mining, Classification, Prediction.
1 Introduction
Diabetes is the epidemic of the 21st century and the biggest challenge confronting Australia’s health system. Based on the data from the 2018 National Diabetes Services Scheme (NDSS) Fact Sheet, diabetes affects an estimate of 1.7 million people in Australia. This includes all types of diagnosed diabetes (1.2 million known and registered) as well as silent, undiagnosed type 2 diabetes (up to 500,000 estimated). According to a study by the World Health Organization (WHO), the number of diabetes will have raised to 552 million by 2030, denote that one in ten grownups will have diabetes if no serious measure is taken.
If you need assistance with writing your essay, our professional essay writing service is here to help!
Essay Writing ServiceIn this paper we will try to predict the presence of diabetes based on some relevant covariates. Bayesian binary regression models, particularly the Bayesian logistic regression model, will be chosen as the model based on which prediction is made. We are interested in looking at the classification problem from a Bayesian perspective. Consequently, we do not indent to focus on appropriate model selection or interpretation. Our analysis will be focused on the effects on the prediction error rate of different choices of prior for the parameters in the logistic regression model.
2 Research Framework
Fig 1 shows the architecture of the proposed systems to answer the research questions to classify an individual as diabetic or not based on features such as Age, Blood Pressure, BMI, Diabetes Pedigree Function, Glucose Concentration, Insulin, Pregnancies and Skin thickness. In view of the above problem description, we provide in depth analysis on how datamining approach can help. Firstly, we need to look into the raw dataset to understand relevant data source, accessing data quality and discovering interesting facts. The next steps are towards data processing from the initial raw data to the final data sets, ready for the model development. We then apply data mining techniques to predict and explain factors that causing diabetes. However, we need to evaluate and assess the model to gain the research objectives.
2 Dataset Description
The dataset used for the purpose of study is from Hospital Frankfurt, Germany for diagnose of diabetes over 21 years old female. It is collected from Kaggle Machine Learning Repository and introduced by John Da Silva (John D.S., 2018). With 2000 observations, the datasets consist of 9 attributes with 8 attribute values and 1 class variable – two outcomes, namely whether the patient is tested positive (indicated by output 1) or tested negative (indicated by 0). The dataset has 684 women that were diagnosed with Diabetes and 1316 women that didn’t have Diabetes. The sample has a high occurrence (34%) of positive records of Diabetes.
Variable Name |
Variable Type |
Variable Description |
Pregnancies |
Integer |
Number of Pregnancies |
Glucose |
Integer |
Plasma glucose concentration at 2 hours in an oral glucose tolerance test. |
BloodPressure |
Integer |
Diastolic Blood Pressure (mm Hg) |
SkinThickness |
Integer |
Triceps skin fold thickness(mm) |
Insulin |
Integer |
2-hour serum insulin((µU/ml) |
BMI |
Numeric |
Body Mass Index(kg/m2) |
DiabetesPedigreeFunction |
Numeric |
History of diabetes in relatives or generic |
Age |
Integer |
Age(years) |
Outcome |
Integer |
Occurrence of Diabetes (0 or 1) |
Table 1: Attributes of Diabetes Data Set
3 Data Processing
Data processing is a technique of machine learning that comprises of converting raw data into a logical or comprehensible format. Processing involves certain activities like data cleaning, integrating the data, transformation of data, data reduction, data discretization and data cleaning. Here the dataset is checked for duplicate values, missing values and type mis-matches. All these inconsistencies are eliminated from this dataset. From the analysis, we can understand that there are abnormal or zero values for the variables such as SkinThinkness and Insulin represents 573 and 956 values respectively and the total missing error of the dataset is 48.25%. A removal of these number of datasets would results significant information loss. We use kNN imputation approaches to impute missing data which contribute to the better comprehensibility of the produced classifier and the better understanding of the learned concept. It is very important to clean the dataset before training it on a classifier in order to better learn the hidden pattern in the datasets.
4 Descriptive Statistics
Table: Summary of explanatory variables
Descriptive Statistics
================================================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
——————————————————————————–
Pregnancies 2,000 3.704 3.306 0 1 6 17
Glucose 2,000 121.880 30.562 44 99 141 199
BloodPressure 2,000 72.399 12.098 24 64 80 122
SkinThickness 2,000 29.132 10.302 7 21.7 36 110
Insulin 2,000 148.468 99.391 14.000 78.847 183.250 744.000
BMI 2,000 32.646 7.201 18.200 27.500 36.800 80.600
DiabetesPedigreeFunction 2,000 0.471 0.324 0.078 0.244 0.624 2.420
Age 2,000 33.090 11.786 21 24 40 81
Outcome 2,000 class (0=1316,1=684)
Figure 3 shows the boxplot of the explanatory variables. It gives an idea about the features of the datasets. We can see that several columns have the outliers, with the column Insulin being the most critical. From the analysis, we can infer that median glucose content is higher for patients who have diabetes. Blood pressure and skin thickness shows little variation with the diabetes.
From the figure 4 of the histograms plots it is evident that the variables — Pregnant and Age are highly skewed. For continuous variable, we can get more clarity on the distribution by analysing it with the dependent variables.
In figure 5, we can see that there is a little or no correlation exists between the variables. Hence the model is not likely to suffer from multicollinearity. Insulin and Glucose, BMI and Skin Thickness has a moderate to linear correlation
5 Model Construction
In this study three popular supervised learning technique such as Logistic Regression, Decision Tree, Support Vector Machine and one unsupervised learning technique (Principal Component Analysis) are applied and compared to each other based on their predictive accuracy on the hold-out samples. The data set spilt into an 80:20, train: test data set. We use Cross Validation technique for parameter tuning to identify the best model. The model performance is evaluated based on sensitivity to determine the patient with diabetes correctly.
Logistic Regression
We use the ‘stepAIC’ function to stepwise model selection for statistically significant independent variables with an objective to minimize the AIC value. The top three most relevant features are “Pregnancies”, “Glucose”, and “BMI” because of the low p-values.
Logistic Regression – Parameter Tuning
We evaluate the performance of the model using the following parameters:
1) AIC (Akaike Information Criteria): We can compare the AIC of the original model glm_fit1 and model derived by stepAIC function glm_fit2. The glm_fit1 is 1513.9 and glm_fit2 is 1508.9. As expected, the model derived by stepAIC function corresponds to lower AIC value.
2) Confusion Matrix: The test error rate is 23%. In other words, the accuracy is 77%
3) ROC Curve (Receiver Operating Characteristic) and AUC (Area Under the Curve): From the figure 6, we can see that the AUC value is 0.8269522 which shows better prediction power.
Decision Tree
From the summary result we can see that there are only 6 variables and 19 terminal nodes used for tree construction and the root node error is 0.34063. The figure 7 shows that the most important indicator of “Diabetes” appears to be Glucose where the first branch is divided (e.g. <128). We can also predict that if a women’s Glucose concentration is less than 128 and Body Mass Index is less than 30 then she is more likely to have diabetes.
Decision Tree – Parameter Tuning
We use a Cross-Validation technique to predict the response on the test data and produce a confusion matrix comparing the test labels to the predicted test labels. The misclassification error is 18.75% and the model accuracy is 81.25%.
Support Vector Machine
SVM algorithm’s primary goal is to maximizing the minimum margin between the hyperplane and support vectors and also reducing the misclassification rate. By introducing a larger cost function (for misclassification), SVM compromises on the minimum margin while correctly classifying the response variable. First, we will use tune function () to tune the parameter that identify the best model that minimizes the overall misclassification error rate. As we can see the result shows that cost =10, gamma =0.125 and SVM-Kernel: radial give the minimum misclassification and makes it optimal parameter for SVM model.
Fig: Misclassification error on SVM
Supprot Vector Machine – Paramerer Tuning
We use cross validation technique to predict the response on the test data and produce a confusion matrix comparing the test labels to the predicted test labels. The misclassification error is 0.1629 in other words the model accuracy is 0.8371.
5 Comparison of Model Accuracy
Comparing the three models Logistic Regression, Decision Tree, Support Vector Machine we get the following results:
To conclude, the graph shows that Logistic Regression Model has the lowest accuracy. However, the difference of accuracy between these classifiers are not significant.
Conclusion
The study demonstrates that data mining-based approaches can be used to assess predictor variables influencing the risk of diabetes. The most important factors of diabetes in our findings are Pregnancies, Glucose and BMI. The diabetes dataset is analysed and explored in detail. The patterns identified using data exploration methods were validated using modelling techniques employed. Classification models such as Logistic Regression, Classification Trees, Support Vector Machine were built and evaluated to identify the best model to predict the occurrence of diabetes. From the cross-validation performance measures of sensitivity, the Support Vector Machine model was concluded the best performing model.
REFERENCES
[2] Nabi M, Wahid A, “Performance Analysis of Classification Algorithms in Predicting Diabetes”, International Journal of Advanced Research in Computer Science, Vol8, No.3, March-April2017
[3] Akkarapol S, Jongsawas C, “An Analysis of Diabetes Risk Factors Using Data Mining Approach”, Paper PH10-2012.
[4] Ravneet S, Williamjeet S, “Data Mining in Healthcare for Diabetes Mellitus”, International Journal of Science and Research(IJSR), Vol3, Issue No.7, July 2014.
[5] “RPubs: Using Predictive Models to Classify Diabetes Dataset 2018.”, [Online]. Available: http://rpubs.com/rzezela77/346228.[Accessed 11-Nov-2018].
[6] “RPubs: Prediction of Diabetes in Pima Indian Women.”, [Online]. Available: https://rpubs.com/jayarapm/PIMAIndianWomenDiabetes [Accessed 11-Nov-2018].
Cite This Work
To export a reference to this article please select a referencing stye below:
Related Services
View allDMCA / Removal Request
If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: