Abstract:In current research on material property prediction based on machine learning, all data samples obtained by the database are usually used, and the prediction model is trained by calculating its high-dimensional vector representation. However, the high redundancy of the material database samples leads to a strong bias and over-fitting of the trained models. To this end, this paper proposes an algorithm to eliminate redundant samples in the data set, select representative samples from the data set; predict the material properties by using multiple machine learning algorithms and compare them, the results show that if the benchmark data set is not implemented Redundant control, even for random raw data sets, can yield good predictive performance metrics due to highly redundant samples; the study also found that using representative samples for training can actually help train higher generalization capabilities and more Predictive model. Therefore, this paper proposes that reducing redundancy is a necessary step in evaluating the material performance prediction model.