Human-computer interfaces and multimodal interaction are increasingly used in everyday life. Environments equipped with sensors are able to acquire and interpret a wide range of information, thus assisting humans in several application areas, such as behaviour understanding, event detection, action recognition, and many others. In these areas, the suitable processing of this information is a key factor to properly structure multimodal data. In particular, heterogeneous devices and different acquisition times can be exploited to improve recognition results. On the basis of these assumptions, in this paper, a multimodal system based on Allen's temporal logic combined with a prevision method is proposed. The main target of the system is to correlate user's events with system's reactions. After the post-processing data coming from different acquisition devices (e.g., RGB images, depth maps, sounds, proximity sensors), the system manages the correlations between recognition/detection results and events, in real-time, thus creating an interactive environment for users. To increase the recognition reliability, a predictive model is also associated with the method. Modularity of the system grants a full dynamic development and upgrade with customized modules. Finally, comparisons with other similar systems are shown, thus underlining the high flexibility and robustness of the proposed event management method.