They seemed to be pretty mindful of this contamination, and call out that they a...

They seemed to be pretty mindful of this contamination, and call out that they agressively pruned some training dataset and still observed strong performance. That said, I agree, I really want to try it out myself and see how it feels, and if the scores really translate to day-to-day capabilities.

From section 5:

    In Figure 2.1, we see that training on CodeExercises leads to a substantial boost in the performance of the
    model on the HumanEval benchmark. To investigate this boost, we propose to prune the CodeExercises
    dataset by removing files that are “similar” to those in HumanEval. This process can be viewed as
    a “strong form” of data decontamination. We then retrain our model on such pruned data, and still
    observe strong performance on HumanEval. In particular, even after aggressively pruning more than
    40% of the CodeExercises dataset (this even prunes files that are only vaguely similar to HumanEval, see
    Appendix C), the retrained phi-1 still outperforms StarCoder.