COMPARISON OF THE EFFICIENCY OF DIFFERENT SELECTING FEATURES METHODS FOR SOLVING THE BINARY CLASSIFICATION PROBLEM OF PREDICTING IN VITRO FERTILIZATION PREGNANCY
UDC 519.683, 519-7
S.L. Sinotova, O.V. Limanovskaya, A.N. Plaksina, V.A. Makutina
Determination of the range of factors affecting the object of research is the most important task of medical research. Its solution is complicated by a large amount of diverse data, including extensive anamnestic information and data from clinical studies, often combined with a limited number of observed patients. This work is devoted to the comparison of the results obtained by various feature selection methods for the search for a set of predictors, on the basis of which a model with the best forecast quality was created, for solving the problem of binary classification of predicting the onset of pregnancy during in vitro fertilization (IVF). The data from the anamnesis of women, presented in binary form, were used as features. The sample consisted of 68 features and 689 objects. The signs were examined for the presence of cross-correlation, after which methods and algorithms were applied to search for a selection of significant factors: nonparametric criteria, interval estimate of the shares, Z-criterion for the difference of two shares, mutual information, RFECV, ADD-DELL, Relief algorithms, algorithms based on the permutation importance (Boruta, Permutation Importance, PIMP), feature selection algorithms using model feature importance (lasso, random forest). To compare the quality of the selected sets of features, various classifiers were built, their metric AUC and the complexity of the model were calculated. All models have high prediction quality (AUC above 95%). The best three of them are based on features selected using nonparametric criteria, model selection (lasso regression), Boruta, Permutation Importance, RFECV and ReliefF algorithms. The optimal set of predictors is a set of 30 binary features obtained by the Boruta algorithm, due to the lower complexity of the model with a relatively high quality (AUC of the model 0.983). Significant signs includes: data about pregnancies in the anamnesis in general, ectopic and regressive pregnancies, independent and term childbirth, abortions up to 12 weeks; hypertension, ischemia, stroke, thrombosis, ulcers, obesity, diabetes mellitus in the immediate family; currently undergoing hormonal treatment not associated with the IVF procedure; allergies; harmful professional factors; normal duration and stability of the menstrual cycle without taking medication; hysteroscopy, laparoscopy and laparotomy; resection of any organ in the genitourinary system; is it the first IVF, the presence of any surgical interventions, diseases of the genitourinary system; the age and BMI of the patient; absence of chronic diseases; the presence of diffuse fibrocystic mastopathy, hypothyroidism.
Keywords: feature selection, binary classification problem, small data analysis, machine learning, assisted reproductive technologies.