Abstract:In recent years, the bag of visual words (BOVW) model based on spatial-temporal interest points (STIPs) has been widely used in the research of behavior recognition. However, the model ignores the weight of each visual word, and secondly it does not consider the spatial and temporal distribution of STIPs, which degrades the recognition accuracy. In this paper, we propose two new algorithms to solve the above problems. Firstly, term frequency–inverse document frequency (TF-IDF) method was used to optimize the traditional BOVW histogram, and the importance of visual word is weighed according to the its proportion in the words bag and the BOVW histogram. Secondly, the STIPs mutual information(STIPsMI) algorithm based on three dimensional Co-occurrence matrix is proposed; the new descriptor is proposed to describe the spatial-temporal relationship of interest points between different visual words. Then the STIPsMI descriptor is concatenated with the optimized BOVW histogram as the final descriptor of the video sequence. The proposed method is evaluated on two challenging human action datasets: the KTH dataset and the UCF sports dataset. Experiment results confirm the validity of our approach and better than BOVW model and other mainstream methods.