They seemed to be pretty mindful of this contamination, and call out that they agressively pruned some training dataset and still observed strong performance. That said, I agree, I really want to try it out myself and see how it feels, and if the scores really translate to day-to-day capabilities.
From section 5:
In Figure 2.1, we see that training on CodeExercises leads to a substantial boost in the performance of the
model on the HumanEval benchmark. To investigate this boost, we propose to prune the CodeExercises
dataset by removing files that are “similar” to those in HumanEval. This process can be viewed as
a “strong form” of data decontamination. We then retrain our model on such pruned data, and still
observe strong performance on HumanEval. In particular, even after aggressively pruning more than
40% of the CodeExercises dataset (this even prunes files that are only vaguely similar to HumanEval, see
Appendix C), the retrained phi-1 still outperforms StarCoder.
I read that but there is no good technique to rule out close duplicates. I know because I had tried to build one for my product. At best it relies on BLEU, embedding distance and other proxies which are far from ideal.
From section 5: