Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This test compares:

1. Number of visitors as recorded in Google Analytics

2. Number of loads of a 1x1 pixel served on a different domain

They see higher numbers for (2) than (1), and attribute the difference to users blocking Google Analytics.

I don't see them describing how they excluded bot traffic, however, and for my sites the majority of hits I get are from bots. Only some bots run JS, so I suspect their numbers for blocking users are thoroughly diluted by these bots.

(Disclosure: I work for Google, speaking only for myself)



Only some bots execute JS but even less bots fetch images.


It's not just two numbers of total hits the article is comparing.

The author extracted the browser information from the server logs (presumably from the User-Agent header i guess?). If they were able to do this, i'd assume they also filtered out bots from the tally :)


I'd expect serious bots to masquerade as whatever the latest chrome user agent is.


Then you would expect for Chrome to show a higher number of GA blocking "users". But that's not the case: the article mentions that the percentage of users blocking GA on Chrome is on par with Safari.

And I don't know what you mean by "serious". The most common crawlers (Google, Baidu, Yandex, etc) identify themselves as bots on the User-Agent very clearly. Personally, those are the ones that I'd call the most "serious". And also the ones which I've seen generating the most on servers.


The net is full of unidentified bots scraping content or looking for vulnerabilities (contact forms , wordpress logins, etc). On many occasions I had traffic issues and had to check logs and these were very hard to block, because they ignore the robots file, don’t advertise themselves in the User Agent and use a large pool of IPs.


I don't know why this comment is downvoted, it mirrors my experience. I'm responsible for a few domestic high traffic websites and have done some analysis from log files to find suspicious traffic, i.e. user agents saying they are chrome but not loading images or css files, having many page views (i.e. 50 where our average user has 2) etc. It wasn't foolproof but the false-positives where < 10% in my random checks. These bots made up ~10-15% of page views.


I meant bots with sufficiently sophisticated adversarial motives (ad fraud? blog comment spam? automated wordpress exploitation?), who I'd expect want to avoid being recognized as such.


Ah, I see what you mean. Those are serious-ly malicious bots then! But yeah, completely agreed; those can be a PITA on sites with user-generated content.

But, is your experience that these kinds of bots cause much traffic? Because, from what I've seen, they can make a mess with fake accounts, fake content, fake clicks, etc, but as far as traffic goes, they were completely dwarfed by search engine crawlers and real users' traffic.

Thanks for the clarification :D


Mhm, we have a bunch of non-crawled content that sees a significant minority of request volume from disguised bots. Overall, it is definitely dwarfed by traffic from real users, but it still forces a lot of work to prevent the bots from gaming metrics/analytics.


Yeah, the numbers in the article are so far off my intuition that I'm happy to latch onto any explanation for why they're weird. Being unable to effectively discount bots seems likely.


I'm reconsidering my intuition in the face of the fact that the sample is from OP's blog and not a customer-facing business.


How do you identify bots on your personal sites?


++ for checking that you were using the correct pronoun for the author!


I didn't check; I just use "they" when I don't know




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: