A DATA-DRIVEN PARADIGM FOR FORECASTING TYPE 2 DIABETES RISK THROUGH THE INTEGRATION OF MACHINE LEARNING TECHNIQUES AND CLINICAL DATA WITHIN A UNIFIED WEB-BASED PLATFORM

Authors

  • Janani Sp Department of Life Sciences, School of Science, Garden City University, Bangalore, India
  • Himanshu Shekhar Singh Department of Life Sciences, School of Science, Garden City University, Bangalore, India
  • Mahak Department of Life Sciences, School of Science, Garden City University, Bangalore, India
  • Shaik Mubarak Basha Department of Life Sciences, School of Science, Garden City University, Bangalore, India
  • Kesiya Joy Department of Life Sciences, School of Science, Garden City University, Bangalore, India

DOI:

https://doi.org/10.69980/gfs9k223

Keywords:

SDG-3, diabetes detection, feature engineering, personalized healthcare solution, XGBoost, machine learning, deep learning

Abstract

Type 2 diabetes mellitus (T2DM) has emerged as one of the most significant metabolic disorders affecting populations globally, necessitating the development of sophisticated early screening methodologies. The integration of artificial intelligence and data science in clinical applications has created unprecedented opportunities for improving disease prediction accuracy while advancing the United Nations Sustainable Development Goal 3 (Good Health and Well-Being). This research endeavors to establish a robust predictive framework by systematically evaluating multiple classification algorithms across four distinct diabetes- related datasets obtained from Kaggle and the UCI Machine Learning Repository. Our comprehensive analysis encompasses eight traditional machine learning classifiers alongside three deep neural network architectures, with model effectiveness measured using standard evaluation metrics including Accuracy, Precision, Recall, F1-Score, and Area Under the ROC Curve (AUC). We introduce a dimensionality reduction strategy based on Principal Component Analysis (PCA) to optimize feature representation, comparing i ts effectiveness against LASSO-based feature selection. Our experimental findings demonstrate that the XGBoost classifier combined with PCA-based feature engineering delivers exceptional predictive capability, achieving approximately 97% accuracy and F1 -score with a 96% AUC across the diabetes datasets. The optimized model has been deployed within an intuitive web-based application designed to facilitate accessible diabetes risk assessment. Additionally, we propose an integrated digital health framework incorporating Internet of Medical Things (IoMT), Robotic Process Automation (RPA), and AI technologies to enhance the reliability and scalability of personalized diabetes management solutions.

References

1.American Diabetes Association. (2021). Classification and diagnosis of diabetes. Diabetes Care, 44(Supplement 1), S15–S33.

2.International Diabetes Federation. (2021). IDF Diabetes Atlas (10th ed.).

3.World Health Organization. (2016). Global report on diabetes.

4.DeFronzo, R. A., Ferrannini, E., Groop, L., Henry, R. R., Herman, W. H., Holst, J. J., Hu, F. B.,

Kahn, C. R., Raz, I., Shulman, G. I., Simonson, D. C., Ferrannini, M. A., & Nauck, M. A. (2015). Type 2 diabetes mellitus. Nature Reviews Disease Primers, 1(1), 1–22.

5.Cowie, C. C., Rust, K. F., Ford, E. S., Eberhardt, M. S., Byrd-Holt, A. L., Li, C., Williams, D. E., Gregg, E. W., Bainbridge, K. E., & Saydah, S. H. (2009). Full accounting of diabetes and pre- diabetes in the US population in 1988–1994 and 1999–2004. Diabetes Care, 32(2), 287–294.

6.Zou, Q., Qu, K., Huang, Y., Yin, G., Chen, M., & Hua, Z. (2018). Predicting diabetes mellitus with machine learning techniques. Frontiers in Genetics, 9(515), 515.

7.Miotto, T., Wang, F., Wang, S., Jiang, X., & Dudley, J. T. (2018). Deep learning for healthcare: Review, opportunities and challenges. Briefings in Bioinformatics, 19(6), 1236–1246.

8.Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K., Cui, C., Corrado, G., Thrun, S., & Dean, J. (2019). A guide to deep learning in healthcare. Nature Medicine, 25(1), 24–29.

9.Shwetha, S., & Gnanambigai, N. (2020). Robotic process automation (RPA) in healthcare: A review. International Journal of Engineering Research & Technology, 9(11), 1–5.

10.Islam, S. M. R., Kwak, D., Kabir, M. H., Hossain, M., & Kwak, K. S. (2015). The internet of things for health care: A comprehensive survey. IEEE Access, 3, 678–708.

11.Uddin, M. A., Stranieri, A., Sahibzada, S., & Jelinek, H. F. (2019). Machine learning in diabetes prediction: A systematic review. IEEE Access, 7, 7123–7134.

12.Wu, H., Yang, S., Huang, Z., He, J., & Wang, X. (2019). Type 2 diabetes mellitus prediction model based on data mining. Computer Methods and Programs in Biomedicine, 170, 1–8.

13.Sharma, P., & Singh, A. (2018). Diabetes prediction using K-Nearest Neighbor and Decision Tree. International Journal of Computer Sciences and Engineering, 5(2), 45–50.

14.Maniruzzaman, M., Kumar, N., Abedin, M. M., Islam, M. S., Suri, H. S., El-Baz, A. S., & Suri, J. S. (2017). Comparative approaches for classification of diabetes mellitus data: Machine learning paradigm. Journal of Diabetes and its Complications, 31(9), 1435–1441.

15.Gaikwad, M. J., & Chatre, S. V. (2019). Prediction of diabetes using machine learning. International Research Journal of Engineering and Technology, 6(4), 2872–2875.

16.Chen, R. C., Dewangan, C., & Cheng, S. T. (2018). Prediction of diabetes mellitus using random forest. Proceedings of the 2018 IEEE International Conference on Big Data, 2123–2130.

17.Bashir, S., Khan, Z. S., Khan, F. H., & Anjum, A. (2019). Evaluation of machine learning methods for diabetes prediction. International Journal of Machine Learning and Computing, 8(3), 234–242.

18.Mohan, N., & Jain, S. (2020). Diabetes prediction using machine learning. International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 9(3), 1–5.

19.Dhanamjayulu, C., Krishnamurthy, S., Ravishankar, V., & Das, S. (2022). BMI prediction from facial images using deep learning. Frontiers in Nutrition, 9, 850781.

20.Sen, S. K., & Dash, S. (2020). Diabetes detection using machine learning. 2020 International

Conference on Information Technology, 1–6.

21.Alibrahim, A. H., & Ludwig, S. A. (2021). Hyperparameter optimization: Comparing genetic algorithm against grid search and Bayesian optimization. 2021 IEEE Congress on Evolutionary Computation, 1–8.

22.Rajalakshmi, R., & Sathiendran, R. K. (2019). Diabetes prediction using machine learning. Procedia Computer Science, 165, 292–299.

23.Islam, M. K., Ali, M. S., Miah, M. S., Rahman, M. M., Alam, M. S., & Hossain, M. A. (2020). Smart healthcare monitoring system in IoT environment. SN Computer Science, 1(3), 1–11.

24.Ghazal, T. M., Alzoubi, H. M., Al-Zoubi, S. I., Faraneh, B., & Al-Okaily, M. (2021). IoT for smart cities: A survey of technologies and applications. Future Generation Computer Systems, 123, 106– 118.

25.Almulihi, A. A., Alassaf, N., & Alqahtani, S. (2022). Ensemble deep learning for diabetic retinopathy detection and classification. Applied Sciences, 12(14), 6970.

26.Bhaskar, P., & Kumar, R. (2019). Deep neural network for diabetes prediction. International Journal of Recent Technology and Engineering, 8(2), 4567–4573.

27.Rahman, M. M., & Alam, M. S. (2020). Web-based diabetes prediction system using machine learning. 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies, 1–6.

28.Kumar, S., & Singh, J. (2021). Disease prediction using machine learning. 2021 International Conference on Computing, Communication, and Intelligent Systems, 1–5.

29.Islam, M., Hashem, M. M. A., & Hossain, M. A. (2021). Multiple disease prediction using deep learning. IEEE Access, 9, 34578–34590.

30.Fahim, F., & Rahman, A. (2021). Web-based diabetes detection using machine learning. Journal of Healthcare Engineering, 2021, 1–12.

31.Rawat, D. B., & Hassan, S. R. (2022). Artificial intelligence in healthcare: From diagnosis to therapy. IEEE Potentials, 41(1), 8–12.

32.Khan, S. A., & Al-Mogren, A. (2022). IoT security in healthcare: A review. IEEE Internet of Things Journal, 9(10), 7318–7337.

33.Shwetha, S., & Jain, R. (2022). RPA in healthcare monitoring and management. International Journal of Health Sciences, 6(S3), 2434–2444.

34.Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215–242.

35.Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (3th ed.). Wiley.

36.Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. CRC Press.

37.Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.

38.Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

39.Ho, T. K. (1995). Random decision forests. Proceedings of the 3rd International Conference on Document Analysis and Recognition, 278–282.

40.Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.

41.Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 144– 152.

42.Rish, I. (2001). An empirical study of the naive Bayes classifier. IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, 3, 41–46.

43.Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.

44.Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175–185.

45.Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232.

46.Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378.

47.Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785– 794.

48.Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., & Cho, H. (2015). XGBoost: Extreme gradient boosting. R Package, 1(4), 1–4.

49.Mitchell, R., & Frank, E. (2017). Accelerating the XGBoost algorithm using GPU computing. PeerJ Computer Science, 3, e127.

50.Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408.

51.McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115–133.

52.Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back- propagating errors. Nature, 323(6088), 533–536.

53.Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.

54.LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.

55.Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.

56.Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85– 117.

57.Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C., & Johannes, R. S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. Proceedings of the Annual Symposium on Computer Application in Medical Care, 261.

58.Teboul, A. (2021). Diabetes health indicators dataset. Kaggle.

59.Mustafa, M. (2023). Diabetes prediction dataset. Kaggle.

Downloads

Published

2026-04-08