Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> why can't webmasters spider everything Google returns?

I'm sure you could start client-side caching every search you ever do to Google.

But if you're searching enough to eat up Google's bandwidth, they're paying for that data and they're under no obligation to keep serving you as a client (much as any server is under no particular obligation to serve a search spider).



> But if you're searching enough to eat up Google's bandwidth, they're paying for that data and they're under no obligation to keep serving you as a client

Do you not see the irony?


Sites allow search engines to index them (instead of telling them to go away with robots.txt) because the search traffic is worth it to them.

Search engines don't allow people to scrape them (resorting to blocking after scrapers ignore robots.txt) because they don't get anything similarly valuable in return.

(Disclosure: I work for Google, though not on search.)


No, I don't. Can you help clarify it for me?

Search engines crawling millions of sites each with---on average---a few MB of data distributes cost globally.

Extracting terabytes of index data from a single search engine's repository consolidates the cost on the back of that repository's bandwidth provision.

These are not symmetrical cost structures.


Our git repository went down when crawlers decided to index it


But probably not Google. The google crawler is very careful and stops as soon as they encounter higher error rates. Bing appears to do the same.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: