Abstract:Aiming at the problems of noise data sensitive and training sample redundancy of parallel SVM algorithm in big data environment, this paper have proposed a parallel SVM algorithm by using granularity and information entropy, named GIESVM-MR. Firstly, the algorithm proposed the NC (noise cleaning) method to evaluate the importance of each feature attribute and obtain the correlation between the sample and the category, which effectively identify and delete noise data. Secondly, a GDC (Data Compression based on Granulation) strategy is proposed, which screen the information granules to retain class boundary samples and delete non-support vectors. Then result in a smaller data set, and solve the problem of training sample data redundancy in a big data environment. Finally, the final classification model is generated by combining the idea of Bagging and MapReduce computing model. Experimental results show that the GIESVM-MR algorithm not only effectively improves the classification accuracy, but also reduces the time complexity of parallel SVM algorithm in big data environment.