Disclaimer: This is an example of a student written essay.
Click here for sample essays written by our professional writers.

Any scientific information contained within this essay should not be treated as fact, this content is to be used for educational purposes only and may contain factual inaccuracies or be out of date.

Development of Prediction Model for Diabetes

Paper Type: Free Essay Subject: Sciences
Wordcount: 3025 words Published: 8th Feb 2020

Reference this


Preventing the disease of diabetes is an ongoing area of interest to the healthcare community. Although several factors are considered to lead to diabetes, it would be worth enough to find the most important factors causing this problem to gain a better understanding of the issue. The availability of huge amounts of medical data leads to the need for powerful mining tools to help health care professionals in the diagnosis of diabetes disease. Using data mining technique in the diagnosis of diabetes disease has been comprehensively investigated, showing the acceptable levels of accuracy. In this paper an attempt is made to analyse diabetes data set and derive some interesting facts from it which can be used to develop prediction model.

Keywords: Statistical Analysis, Data mining, Classification, Prediction.

 1 Introduction

Diabetes is the epidemic of the 21st century and the biggest challenge confronting Australia’s health system. Based on the data from the 2018 National Diabetes Services Scheme (NDSS) Fact Sheet, diabetes affects an estimate of 1.7 million people in Australia. This includes all types of diagnosed diabetes (1.2 million known and registered) as well as silent, undiagnosed type 2 diabetes (up to 500,000 estimated).  According to a study by the World Health Organization (WHO), the number of diabetes will have raised to 552 million by 2030, denote that one in ten grownups will have diabetes if no serious measure is taken.

Get Help With Your Essay

If you need assistance with writing your essay, our professional essay writing service is here to help!

Essay Writing Service

In this paper we will try to predict the presence of diabetes based on some relevant covariates. Bayesian binary regression models, particularly the Bayesian logistic regression model, will be chosen as the model based on which prediction is made. We are interested in looking at the classification problem from a Bayesian perspective. Consequently, we do not indent to focus on appropriate model selection or interpretation. Our analysis will be focused on the effects on the prediction error rate of different choices of prior for the parameters in the logistic regression model.

2 Research Framework

Fig 1 shows the architecture of the proposed systems to answer the research questions to classify an individual as diabetic or not based on features such as Age, Blood Pressure, BMI, Diabetes Pedigree Function, Glucose Concentration, Insulin, Pregnancies and Skin thickness. In view of the above problem description, we provide in depth analysis on how datamining approach can help. Firstly, we need to look into the raw dataset to understand relevant data source, accessing data quality and discovering interesting facts. The next steps are towards data processing from the initial raw data to the final data sets, ready for the model development. We then apply data mining techniques to predict and explain factors that causing diabetes. However, we need to evaluate and assess the model to gain the research objectives.

2 Dataset Description

The dataset used for the purpose of study is from Hospital Frankfurt, Germany for diagnose of diabetes over 21 years old female.  It is collected from Kaggle Machine Learning Repository and introduced by John Da Silva (John D.S., 2018). With 2000 observations, the datasets consist of 9 attributes with 8 attribute values and 1 class variable – two outcomes, namely whether the patient is tested positive (indicated by output 1) or tested negative (indicated by 0). The dataset has 684 women that were diagnosed with Diabetes and 1316 women that didn’t have Diabetes. The sample has a high occurrence (34%) of positive records of Diabetes.

Variable Name

           Variable Type

                                       Variable Description



Number of Pregnancies



Plasma glucose concentration at 2 hours in an oral glucose tolerance test.



Diastolic Blood Pressure (mm Hg)



Triceps skin fold thickness(mm)



2-hour serum insulin((µU/ml)                                             



Body Mass Index(kg/m2)



History of diabetes in relatives or generic






Occurrence of Diabetes (0 or 1)

Table 1: Attributes of Diabetes Data Set


3 Data Processing

Data processing is a technique of machine learning that comprises of converting raw data into a logical or comprehensible format. Processing involves certain activities like data cleaning, integrating the data, transformation of data, data reduction, data discretization and data cleaning. Here the dataset is checked for duplicate values, missing values and type mis-matches. All these inconsistencies are eliminated from this dataset. From the analysis, we can understand that there are abnormal or zero values for the variables such as SkinThinkness and Insulin represents 573 and 956 values respectively and the total missing error of the dataset is 48.25%. A removal of these number of datasets would results significant information loss. We use kNN imputation approaches to impute missing data which contribute to the better comprehensibility of the produced classifier and the better understanding of the learned concept. It is very important to clean the dataset before training it on a classifier in order to better learn the hidden pattern in the datasets.

4 Descriptive Statistics

Table: Summary of explanatory variables

Descriptive Statistics


Statistic                  N    Mean   St. Dev.  Min   Pctl(25) Pctl(75)   Max 


Pregnancies              2,000  3.704   3.306     0       1        6       17  

Glucose                  2,000 121.880  30.562    44      99      141      199 

BloodPressure            2,000 72.399   12.098    24      64       80      122 

SkinThickness            2,000 29.132   10.302    7      21.7      36      110 

Insulin                  2,000 148.468  99.391  14.000  78.847  183.250  744.000

BMI                      2,000 32.646   7.201   18.200  27.500   36.800  80.600

DiabetesPedigreeFunction 2,000  0.471   0.324   0.078   0.244    0.624    2.420

Age                      2,000 33.090   11.786    21      24       40      81  

Outcome                  2,000 class (0=1316,1=684)  

Figure 3 shows the boxplot of the explanatory variables. It gives an idea about the features of the datasets. We can see that several columns have the outliers, with the column Insulin being the most critical. From the analysis, we can infer that median glucose content is higher for patients who have diabetes. Blood pressure and skin thickness shows little variation with the diabetes.

From the figure 4 of the histograms plots it is evident that the variables — Pregnant and Age are highly skewed. For continuous variable, we can get more clarity on the distribution by analysing it with the dependent variables.

In figure 5, we can see that there is a little or no correlation exists between the variables. Hence the model is not likely to suffer from multicollinearity. Insulin and Glucose, BMI and Skin Thickness has a moderate to linear correlation


5 Model Construction

In this study three popular supervised learning technique such as Logistic Regression, Decision Tree, Support Vector Machine and one unsupervised learning technique (Principal Component Analysis) are applied and compared to each other based on their predictive accuracy on the hold-out samples. The data set spilt into an 80:20, train: test data set. We use Cross Validation technique for parameter tuning to identify the best model. The model performance is evaluated based on sensitivity to determine the patient with diabetes correctly.

Logistic Regression

We use the ‘stepAIC’ function to stepwise model selection for statistically significant independent variables with an objective to minimize the AIC value. The top three most relevant features are “Pregnancies”, “Glucose”, and “BMI” because of the low p-values.

Logistic Regression – Parameter Tuning

We evaluate the performance of the model using the following parameters:

1)      AIC (Akaike Information Criteria): We can compare the AIC of the original model glm_fit1 and model derived by stepAIC function glm_fit2.  The glm_fit1 is 1513.9 and glm_fit2 is 1508.9. As expected, the model derived by stepAIC function corresponds to lower AIC value.

2)      Confusion Matrix: The test error rate is 23%. In other words, the accuracy is 77%

3)      ROC Curve (Receiver Operating Characteristic) and AUC (Area Under the Curve): From the figure 6, we can see that the AUC value is 0.8269522 which shows better prediction power.

Decision Tree

From the summary result we can see that there are only 6 variables and 19 terminal nodes used for tree construction and the root node error is 0.34063. The figure 7 shows that the most important indicator of “Diabetes” appears to be Glucose where the first branch is divided (e.g. <128). We can also predict that if a women’s Glucose concentration is less than 128 and Body Mass Index is less than 30 then she is more likely to have diabetes. 

Decision Tree – Parameter Tuning

We use a Cross-Validation technique to predict the response on the test data and produce a confusion matrix comparing the test labels to the predicted test labels. The misclassification error is 18.75% and the model accuracy is 81.25%.

Support Vector Machine

SVM algorithm’s primary goal is to maximizing the minimum margin between the hyperplane and support vectors and also reducing the misclassification rate. By introducing a larger cost function (for misclassification), SVM compromises on the minimum margin while correctly classifying the response variable. First, we will use tune function () to tune the parameter that identify the best model that minimizes the overall misclassification error rate. As we can see the result shows that cost =10, gamma =0.125 and SVM-Kernel: radial give the minimum misclassification and makes it optimal parameter for SVM model.

Fig: Misclassification error on SVM


Supprot Vector Machine – Paramerer Tuning

We use cross validation technique to predict the response on the test data and produce a confusion matrix comparing the test labels to the predicted test labels. The misclassification error is 0.1629 in other words the model accuracy is 0.8371.

5 Comparison of Model Accuracy

Comparing the three models Logistic Regression, Decision Tree, Support Vector Machine we get the following results:


To conclude, the graph shows that Logistic Regression Model has the lowest accuracy. However, the difference of accuracy between these classifiers are not significant.


The study demonstrates that data mining-based approaches can be used to assess predictor variables influencing the risk of diabetes.  The most important factors of diabetes in our findings are Pregnancies, Glucose and BMI. The diabetes dataset is analysed and explored in detail. The patterns identified using data exploration methods were validated using modelling techniques employed. Classification models such as Logistic Regression, Classification Trees, Support Vector Machine were built and evaluated to identify the best model to predict the occurrence of diabetes. From the cross-validation performance measures of sensitivity, the Support Vector Machine model was concluded the best performing model.


[1] k. Meena, N. Vijayalakshmi, “An Analysis of Risk Factor for Diabetes using Data Mining Approach”, Indian Journal of Public Health Research and Development, Vol6, Issue No.2, pp 112-117, April-June 2015.

[2] Nabi M, Wahid A, “Performance Analysis of Classification Algorithms in Predicting Diabetes”, International Journal of Advanced Research in Computer Science, Vol8, No.3, March-April2017

[3] Akkarapol S, Jongsawas C, “An Analysis of Diabetes Risk Factors Using Data Mining Approach”, Paper PH10-2012.

[4] Ravneet S, Williamjeet S, “Data Mining in Healthcare for Diabetes Mellitus”, International Journal of Science and Research(IJSR), Vol3, Issue No.7, July 2014.

[5] “RPubs: Using Predictive Models to Classify Diabetes Dataset 2018.”, [Online]. Available: http://rpubs.com/rzezela77/346228.[Accessed 11-Nov-2018].

[6] “RPubs: Prediction of Diabetes in Pima Indian Women.”, [Online]. Available: https://rpubs.com/jayarapm/PIMAIndianWomenDiabetes [Accessed 11-Nov-2018].


Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: