Classical evaluation of information retrieval systems evaluates a system in a static test collection. In the case of Web search, the evaluation environment (EE) is continuously changing and the hypothesis of using a static test collection is not representative of this changing reality. Moreover, the changes in the evaluation environment, as the document set, the topics set, the relevance judgments, and the chosen metrics, have an impact on the performance measurement [1, 4]. To the best of our knowledge, there is no way to evaluate two versions of a search engine with evolving EEs. We aim at proposing a continuous framework to evaluate different versions of a search engine in different evaluation environments. The classical paradigm relies on a controlled test collection (i.e., set of topics, corpus of documents and relevant assessments) as a stable and meaningful EE that guarantees the reproducibility of system results. We propose to take into account multiple EEs for the evaluation of systems, in a dynamic test collection (DTC). A DTC is a list of test collections based on a controlled evolution of a static test collection. The DTC allows us to quantify and relate the differences between the test collection elements, called Knowledge delta (K Delta), and the performance differences between systems evaluated on these varying test collections, called Result delta (R Delta). Finally, the continuous evaluation is characterized by K Delta s and R Delta s. The related changes in both deltas will allow for interpreting the evaluations in systems performances. The expected contributions of the thesis are: (i) a pivot strategy based on R Delta to compare systems evaluated in different EEs; (ii) a formalization of DTC to simulate the continuous evaluation and provide significant R Delta in evolving contexts; and (iii) a continuous evaluation framework that incorporates K Delta to explain. of evaluated systems. It is not possible to measure the R Delta of two systems evaluated in different EEs, because the performance variations are dependent on the changes in the EEs. [1]. To get an estimation of this R Delta measure, we propose to use a reference system, called the pivot system, which would be evaluated within the two EEs considered. Then, the R Delta value is measured using the relative distance between the pivot system and each evaluated system. Our results [2, 3] show that using the pivot strategy we improve the correctness of the ranking of systems (RoS) evaluated in two EEs (i.e., similarity with the RoS evaluated in the ground truth), compared to the RoS constructed with the absolute performance values for each system evaluated in the different EEs. The correctness of the RoS depends on the system defined as pivot and the metric. The proposal focus moves to a continuous evaluation as a repeated assessment of the same or different versions of a web search across evolving EEs. Current test collections do not consider the evolution of documents, topics and relevance judgements. We require a DTC to extract...s of the compared system and its relation with the changes on the EEs (K Delta.). We provide a method to define a DTC from static test collections based on controlled features as a way to better simulate the evolving EE. According to our preliminary experiments, a system evaluated in our proposed DTC shows more variable performances, and larger R Delta s, than when it is evaluated in several random shards or bootstraps of documents. As future work, we will integrate the K Delta s to formalize an explainable continuous evaluation framework. The pivot strategy tells us when the performance of the system is improving across EEs. The DTC provides us with the required EEs to identify significant R Delta s, and the inclusion of K Delta s in the framework will define a set of factors that explain the system's performance changes.