Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They seemed to be pretty mindful of this contamination, and call out that they agressively pruned some training dataset and still observed strong performance. That said, I agree, I really want to try it out myself and see how it feels, and if the scores really translate to day-to-day capabilities.

From section 5:

    In Figure 2.1, we see that training on CodeExercises leads to a substantial boost in the performance of the
    model on the HumanEval benchmark. To investigate this boost, we propose to prune the CodeExercises
    dataset by removing files that are “similar” to those in HumanEval. This process can be viewed as
    a “strong form” of data decontamination. We then retrain our model on such pruned data, and still
    observe strong performance on HumanEval. In particular, even after aggressively pruning more than
    40% of the CodeExercises dataset (this even prunes files that are only vaguely similar to HumanEval, see
    Appendix C), the retrained phi-1 still outperforms StarCoder.


I read that but there is no good technique to rule out close duplicates. I know because I had tried to build one for my product. At best it relies on BLEU, embedding distance and other proxies which are far from ideal.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: