Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

commoncrawl.org

Our public web dataset goes back to 2008, and is widely used by academia and startups.



I always wanted to ask:

- How often is that updated?

- How current is it at any point in time?

- Does it have historical / temporal access i.e. be able to check the history of a page a la The Internet Archive?


- monthly

- it's a historical archive, the concept of "current" is hard to turn into a metric

- not only is our archive historical, it is included in the Internet Archive's wayback machine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: