UDC 004.424.5.032.24

G. V. Vorontsov, A. P. Preobrazhenskiy, O. N. Choporov

The relevance of the study is conditioned by the need of modern society in the automatic classification of data. In this paper, we consider a Bayesian algorithm for the case of determining the subject matter of a text. The purpose of the work is to develop, identify and solve problems arising during the implementation and work of the classifier, as well as to evaluate its effectiveness. Identified problems of arithmetic overflow and the appearance of zero probability as a result. Their solution is proposed by means of Laplace smoothing and the properties of logarithms. Approaches to optimizing and increasing the speed of the program module are also presented. As a result, a Bayesian classifier was implemented. His study was conducted on the basis of sets of articles of 10 different subjects. Based on the results of analytical and test verification. The materials of the article are of practical value for those who are going to apply the algorithm considered or to them in their research.

Keywords: naive Bayesian classifier, Text Mining, algorithm, Bayes theorem, document analysis.

Full text: