Neural networks (NN) have been gaining significant traction for time series classification tasks over the past few years. Yet, they are frequently perceived as black-box tools, whose results may be difficult to interpret. To address this issue, several methods have been proposed to obtain maps of relevance scores highlighting the importance of different time steps for a given model. These methods were initially applied to images, and more recently to time-series data. Yet, interpretability of NN remains challenging. Indeed, interpretability methods typically provide different results, sometimes even diametrically opposite, and may not explain how neurons collaborate to represent specific patterns. In this work, we propose a new evaluation framework for post-hoc interpretability methods applied to time series classification tasks. We argue that this work is a critical step toward understanding NN-based decisions and provide a more robust interpretability workflow. We also present a preliminary study that aims to understand the robustness of the evaluation metrics.