Web Scraping with Modern Perl (Part 2 - Speed Edition)

wslh · on Feb 20, 2013

Are you a hardcore web scraper? Check then https://blog.databigbang.com for articles such as how to scrape sites with javascript, browserless oauth, implementing your own rotating proxies.

Disclosure: I am the author but the site is helping and saving the time of thousands of people with code and examples.

mguterl · on Feb 21, 2013

Your site never seems to load for me.

creaktive · on Feb 21, 2013

Same here :(

wslh · on Feb 21, 2013

You mean slow? it is on AWS.

creaktive · on Feb 21, 2013

Figured it out; the problem was the HTTPS. http://blog.databigbang.com/ is fine :)

wslh · on Feb 22, 2013

Sorry, it was my fault. Too much HTTPS sites lately.

kimmel · on Feb 20, 2013

Here are some of the list resources:

20 Perl libraries for fetching web content - http://neilb.org/reviews/http-requesters.html

An SO wiki page for html scraping - http://stackoverflow.com/questions/2861/options-for-html-scr...

creaktive · on Feb 21, 2013

YADA, the concurrent fetcher featured in the article, has extensive benchmarks on many Perl WWW User Agent libraries, also: https://metacpan.org/module/AnyEvent::Net::Curl::Queued#BENC...

ajtaylor · on Feb 21, 2013

Neil's reviews on the various categories of CPAN modules have been extremely helpful. Highly recommended!

devcom · on Feb 21, 2013

Shamless on topic plug here. Some readers might be more familiar with ruby.

Multi-threaded webscraping with the tor network with ruby -

http://devcomsystems.com.au/2013/01/multi-thread-mechanize-u...

_hfqa · on Feb 20, 2013

Take a look at Tales.

Tales is a block tolerant web scraper that runs on top of aws and rackspace. Tales is design to be easy to deploy, configure, and manage.

https://github.com/calufa/tales-core

hercynium · on Feb 20, 2013

I dunno... the tales install script seems to want to take over whatever account it's run as, going so far as to modify ~/.ssh/config. That alone gives me pause... then it requires you have a github account?

And it needs mysql, redis and mongo??!?

Oh, and of course, I'll need an aws/rackspace account...

If I need all that for a web-scraper it better be for a big project.

The CPAN modules in the linked article can all be installed and run as a non-privileged user (via either local::lib or perlbrew, etc.). And there are no daemons, nothing running as root, nothing listening on any ports, no configuration or tuning to think about, and it'll work everywhere from my macbook to my dev-server running linux or BSD or Solaris or whatever.

BTW, Mojolicious (http://mojolicio.us) is really great stuff. Outside of having a reasonably up-to-date version of Perl (5.10.1 or higher), it's got no external dependencies, not even other CPAN modules. It's fast, flexible, easy to use, and easy to deploy just about anywhere. sri++

creaktive · on Feb 21, 2013

Mojolicious is featured more extensively in the Part 1 of my article: http://news.ycombinator.com/item?id=5159452 ;)

creaktive · on Feb 21, 2013

Interesting! Do you have any benchmarks?