Сравнение эффективности различных методов отбора признаков для решения задачи бинарной классификации предсказания наступления беременности при проведении экстракорпорального оплодотворения
Работая с нашим сайтом, вы даете свое согласие на использование файлов cookie. Это необходимо для нормального функционирования сайта, показа целевой рекламы и анализа трафика. Статистика использования сайта отправляется в «Яндекс» и «Google»
Научный журнал Моделирование, оптимизация и информационные технологииThe scientific journal Modeling, Optimization and Information Technology
Online media
issn 2310-6018

Comparison of the efficiency of different selecting features methods for solving the binary classification problem of predicting in vitro fertilization pregnancy

idSinotova S.L. idLimanovskaya O.V. idPlaksina A.N. idMakutina V.A.

UDC 519.683, 519-7
DOI: 10.26102/2310-6018/2020.30.3.025

  • Abstract
  • List of references
  • About authors

Determination of the range of factors affecting the object of research is the most important task of medical research. Its solution is complicated by a large amount of diverse data, including extensive anamnestic information and data from clinical studies, often combined with a limited number of observed patients. This work is devoted to the comparison of the results obtained by various feature selection methods for the search for a set of predictors, on the basis of which a model with the best forecast quality was created, for solving the problem of binary classification of predicting the onset of pregnancy during in vitro fertilization (IVF). The data from the anamnesis of women, presented in binary form, were used as features. The sample consisted of 68 features and 689 objects. The signs were examined for the presence of cross-correlation, after which methods and algorithms were applied to search for a selection of significant factors: nonparametric criteria, interval estimate of the shares, Zcriterion for the difference of two shares, mutual information, RFECV, ADD-DELL, Relief algorithms, algorithms based on the permutation importance (Boruta, Permutation Importance, PIMP), feature selection algorithms using model feature importance (lasso, random forest). To compare the quality of the selected sets of features, various classifiers were built, their metric AUC and the complexity of the model were calculated. All models have high prediction quality (AUC above 95%). The best of them are based on features selected using nonparametric criteria, model selection (lasso regression), Boruta, Permutation Importance, RFECV and ReliefF algorithms. The optimal set of predictors is a set of 30 binary features obtained by the Boruta algorithm, due to the lower complexity of the model with a relatively high quality (AUC of the model 0.983). Significant signs includes: data about pregnancies in the anamnesis in general, ectopic and regressive pregnancies, independent and term childbirth, abortions up to 12 weeks; hypertension, ischemia, stroke, thrombosis, ulcers, obesity, diabetes mellitus in the immediate family; currently undergoing hormonal treatment not associated with the IVF procedure; allergies; harmful professional factors; normal duration and stability of the menstrual cycle without taking medication; hysteroscopy, laparoscopy and laparotomy; resection of any organ in the genitourinary system; is it the first IVF, the presence of any surgical interventions, diseases of the genitourinary system; the age and BMI of the patient; absence of chronic diseases; the presence of diffuse fibrocystic mastopathy, hypothyroidism.

1. van Loendersloot L.L., van Wely M., Limpens J., Bossuyt P.M., Repping S., van der Veen F. Predictive factors in in vitro fertilization (IVF): a systematic review and meta-analysis. Hum Reprod Update. 2010;16(6):577–589. DOI: 10.1093/humupd/dmq015

2. Atasever M., Namlı Kalem M., Hatırnaz Ş., Hatırnaz E., Kalem Z., Kalaylıoğlu Z. Factors affecting clinical pregnancy rates after IUI for the treatment of unexplained infertility and mild male subfertility. J Turk Ger Gynecol Assoc. 2016;17:134–138. DOI: 10.5152/jtgga.2016.16056

3. Vaegter K.K., Lakic T.G., Olovsson M., Berglund L., Brodin T., Holte J. Which factors are most predictive for live birth after in vitro fertilization and intracytoplasmic sperm injection (IVF/ICSI) treatments? Analysis of 100 prospectively recorded variables in 8,400 IVF/ICSI single-embryo transfers. Fertil Steril. 2017;107(3):641–648.e2. DOI:10.1016/j.fertnstert.2016.12.005

4. Vogiatzi, P., Pouliakis, A., Siristatidis, C. An artificial neural network for the prediction of assisted reproduction outcome. J Assist Reprod Genet. 2019;36:1441–1448. DOI: 10.1007/s10815-019-01498-7

5. Ruey-Shiang Guh, Tsung-Chieh Jackson Wu, Shao-Ping Weng. Integrating genetic algorithm and decision tree learning for assistance in predicting in vitro fertilization outcomes. Expert Systems with Applications. 2011;38(4):4437–4449. DOI: 10.1016/j.eswa.2010.09.112

6. Hassan M.R., Al-Insaif S., Hossain M.I., Kamruzzaman J. A machine learning approach for prediction of pregnancy outcome following IVF treatment. Neural Comput & Applic. 2020;32:2283–2297. DOI: 10.1007/s00521-018-3693-9

7. Hafiz P., Nematollahi M., Boostani R., Namavar Jahromi B. Predicting Implantation Outcome of In Vitro Fertilization and Intracytoplasmic Sperm Injection Using Data Mining Techniques. Int J Fertil Steril. 2017;11(3):184–190. DOI: 10.22074/ijfs.2017.4882

8. Raef B, Ferdousi R. A Review of Machine Learning Approaches in Assisted Reproductive Technologies. Acta Inform Med. 2019;27(3):205–211. DOI:10.5455/aim.2019.27.205-211

9. Guyon I, Elisseeff A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003;3:1157–1182.

10. Guyon, I., Weston, J., Barnhill, S., Vapnik V. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning. 2002;46:389–422. DOI: 10.1023/A:1012487302797

11. Saeys Y., Inza I., Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–2517. DOI: 10.1093/bioinformatics/btm344

12. Voroncov K. V. Lekcii po metodam ocenivanija i vybora modelej. Available at: http://www.ccas.ru/voron/download/Modeling.pdf (accessed 18.08.2020) (In Russ)

13. Altmann A., Toloşi L., Sander O., Lengauer T. Permutation importance: a corrected feature importance measure. Bioinformatics. 2010;26(10):1340–1347. DOI: 10.1093/bioinformatics/btq134

14. Kenji K., Rendell A. L. The feature selection problem: traditional methods and a new algorithm. AAAI. 1992;129–134

15. Kursa, M., Rudnicki. Feature Selection with the Boruta Package. Journal of Statistical Software. 2010;36(11):1–13. DOI: 10.18637/jss.v036.i11

16. Mazaheri V., Khodadadi H. Heart arrhythmia diagnosis based on the combination of morphological, frequency and nonlinear features of ECG signals and metaheuristic feature selection algorithm. Expert Systems with Applications. 2020;161:113697. DOI: 10.1016/j.eswa.2020.113697

17. Faris H., Mafarja M.M., Heidari A.A., Aljarah I., Al-Zoubi A.M., Mirjalili S., Fujita H. An efficient binary Salp Swarm Algorithm with crossover scheme for feature selection problems. Knowledge-Based Systems. 2018:154;43–67. DOI: 10.1016/j.knosys.2018.05.009

18. He H., Bai Y., Garcia E.A., Li S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008;1322–1328. DOI: 10.1109/IJCNN.2008.4633969

19. Lemaître G., Nogueira F., Aridas C.K. Imbalanced-learn: Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. JMLR. 2017;18(17):1−5.

20. Glantz S. Primer of biostatistics. М.: Practica;1998. (In Russ)

21. Rothman K.J. A Show of Confidence. N Engl J Med. 1978;299(24):1362−1363. DOI: 10.1056/NEJM197812142992410

22. Das A.K., Kumar S., Jain S., Goswami S., Chakrabarti A., Chakraborty B. An informationtheoretic graph-based approach for feature selection. Sādhanā. 2020;45:11. DOI: 10.1007/s12046-019-1238-2

23. Battiti R. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks. 1994;5(4):537−550. DOI: 10.1109/72.298224

24. Kononenko I. Estimating attributes: Analysis and extensions of RELIEF. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence). 1994;784:171−182.

25. Robnik-Sikonja M., Kononenko I. An adaptation of Relief for attribute estimation in regression. ICML '97: Proceedings of the Fourteenth International Conference on Machine Learning. 1997;296–304.

26. Hamon J. Optimisation combinatoire pour la sélection de variables en régression en grande dimension: Application en génétique animale. Applications [stat.AP]. Université des Sciences et Technologie de Lille - Lille I, 2013. Français. fftel-00920205

27. Implementation of the algorithm RFECV in Scikit-learn. Available at: https://scikitlearn.org/stable/modules/generated/sklearn.feature_selection.RFECV. html#sklearn.feature_selection.RFECV (accessed 18.08.2020)

28. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay É. Scikit-learn: Machine Learning in Python. JMLR. 2011;12(85):2825−2830.

29. Natekin A. Gradientnyj busting: vozmozhnosti, osobennosti i fishki za predelami standartnyh kaggle-style zadach. Moscow Data Science Meetup. 2017. Available at: https://www.youtube.com/watch?time_continue=746&v=cM2c47Xlqk&feature=emb_logo (accessed 18.08.2020) (In Russ)

30. Shitikov V. K., Mastickij S. Je. Klassifikacija, regressija, algoritmy Data Mining s ispol'zovaniem R. 2017. Available at: https://github.com/ranalytics/data-mining (accessed 18.08.2020) (In Russ)

31. ELI5 library. Available at: https://eli5.readthedocs.io/en/latest/index.html# (accessed 18.08.2020)

32. Anaconda - solutions for Data Science Practitioners and Enterprise Machine Learning. Available at: https://www.anaconda.com (accessed 18.08.2020)

33. SciPy library. Available at: https://www.scipy.org/index.html (accessed 18.08.2020)

34. ReliefF library. Available at: https://pypi.org/project/ReliefF/#description (accessed 18.08.2020)

35. LightGBM library. Available at: https://lightgbm.readthedocs.io/en/latest/index.html# (accessed 18.08.2020)

36. Grellier O. Feature Selection with Null Importances. Article on the Kaggle. Available at: https://www.kaggle.com/ogrellier/feature-selection-with-null-importances (accessed 18.08.2020)

37. Boruta implementation in Python. Available at: https://github.com/scikit-learncontrib/boruta_py (accessed 18.08.2020)

38. NumPy library. Available at: https://numpy.org/ (accessed 18.08.2020)

39. Pandas library. Available at: https://pandas.pydata.org/ (accessed 18.08.2020)

40. Matplotlib library. Available at: https://matplotlib.org/index.html (accessed 18.08.2020)

41. Seaborn library. Available at: https://seaborn.pydata.org/# (accessed 18.08.2020)

42. Bergstra, J., Yamins D., Cox D.D. Making a Science of Model Search: Hyperparameter Optimizationin Hundreds of Dimensions for Vision Architectures. JMLR Workshop and Conference Proceedings. 2013;28(1):115–123.

43. Grjibovski А.М. Analysis of nominal data (independent observations). Human Ecology. 2008;6:58-68. (In Russ)

44. Ng A. Machine Learning Yearning. Available at: https://www.mlyearning.org/ (accessed 18.08.2020)

Sinotova Svetlana L.

Email: sveta.volkova92@mail.ru

ORCID |

Institute Of Fundamental Education, Fsaei He «Urfu Named After The First President Of Russia B.N.Yeltsin»

Ekaterinburg, Russian Federation

Limanovskaya Oksana V.
Candidate of Chemical Sciences
Email: o.v.limanovskaia@urfu.ru

ORCID |

Institute Of Fundamental Education, Fsaei He «Urfu Named After The First President Of Russia B.N.Yeltsin»

Ekaterinburg, Russian Federation

Plaksina Anna N.
Candidate of Medical Sciences
Email: burberry20@yandex.ru

ORCID |

FSBEI HE «USMU of the Ministry of Health of the Russian Federation»

Ekaterinburg, Russian Federation

Makutina Valerija A.
Candidate of Biological Sciences
Email: makutina_v@rambler.ru

ORCID |

The Family Medicine Centre

Ekaterinburg, Russian Federation

Keywords: feature selection, binary classification problem, small data analysis, machine learning, assisted reproductive technologies

For citation: Sinotova S.L. Limanovskaya O.V. Plaksina A.N. Makutina V.A. Comparison of the efficiency of different selecting features methods for solving the binary classification problem of predicting in vitro fertilization pregnancy. Modeling, Optimization and Information Technology. 2020;8(3). Available from: https://moit.vivt.ru/wp-content/uploads/2020/08/SinotovaSoavtors_3_20_1.pdf DOI: 10.26102/2310-6018/2020.30.3.025 (In Russ).

1129

Full text in PDF