Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Okay, but then what? Host your sites on something other than 'www' or '*', exclude them from search engines, and never link to them? Then, the few people who do resolve these subdomains, you just gotta hope they don't do it using a DNS server owned by a company with an AI product (like Google, Microsoft, or Amazon)?

I really don't know how you're supposed to shield your content from AI without also shielding it from humanity.





Don't have any index pages or heavy cross-linking between pages.

None of that matters. AI bots can still figure out how to navigate the website.

The biggest problem I have seen with AI scrapping is that they blindly try every possible combination of URLs once they find your site and blast it 100 times per second for each page they can find.

They don’t respect robots.txt, they don’t care about your sitemap, they don’t bother caching, just mindlessly churning away effectively a DDOS.

Google at least played nice.

And so that is why things like anubis exist, why people flock to cloudflare and all the other tried and true methods to block bots.


I don't see how that is possible. The web site is a disconnected graph with a lot of components. If they get hold of a url, maybe that gets them to a few other pages, but not all of them. Most of the pages on my personal site are .txt files with no outbound links, for that matter. Nothing to navigate.

how? if you don't have a default page and index listings are disabled, how can they derive page names?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: