Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Web Scraping with Modern Perl (Part 2 - Speed Edition) (perl.org)
44 points by creaktive on Feb 20, 2013 | hide | past | favorite | 14 comments


Are you a hardcore web scraper? Check then https://blog.databigbang.com for articles such as how to scrape sites with javascript, browserless oauth, implementing your own rotating proxies.

Disclosure: I am the author but the site is helping and saving the time of thousands of people with code and examples.


Your site never seems to load for me.


Same here :(


You mean slow? it is on AWS.


Figured it out; the problem was the HTTPS. http://blog.databigbang.com/ is fine :)


Sorry, it was my fault. Too much HTTPS sites lately.


Here are some of the list resources:

20 Perl libraries for fetching web content - http://neilb.org/reviews/http-requesters.html

An SO wiki page for html scraping - http://stackoverflow.com/questions/2861/options-for-html-scr...


YADA, the concurrent fetcher featured in the article, has extensive benchmarks on many Perl WWW User Agent libraries, also: https://metacpan.org/module/AnyEvent::Net::Curl::Queued#BENC...


Neil's reviews on the various categories of CPAN modules have been extremely helpful. Highly recommended!


Shamless on topic plug here. Some readers might be more familiar with ruby.

Multi-threaded webscraping with the tor network with ruby -

http://devcomsystems.com.au/2013/01/multi-thread-mechanize-u...


Take a look at Tales.

Tales is a block tolerant web scraper that runs on top of aws and rackspace. Tales is design to be easy to deploy, configure, and manage.

https://github.com/calufa/tales-core


I dunno... the tales install script seems to want to take over whatever account it's run as, going so far as to modify ~/.ssh/config. That alone gives me pause... then it requires you have a github account?

And it needs mysql, redis and mongo??!?

Oh, and of course, I'll need an aws/rackspace account...

If I need all that for a web-scraper it better be for a big project.

The CPAN modules in the linked article can all be installed and run as a non-privileged user (via either local::lib or perlbrew, etc.). And there are no daemons, nothing running as root, nothing listening on any ports, no configuration or tuning to think about, and it'll work everywhere from my macbook to my dev-server running linux or BSD or Solaris or whatever.

BTW, Mojolicious (http://mojolicio.us) is really great stuff. Outside of having a reasonably up-to-date version of Perl (5.10.1 or higher), it's got no external dependencies, not even other CPAN modules. It's fast, flexible, easy to use, and easy to deploy just about anywhere. sri++


Mojolicious is featured more extensively in the Part 1 of my article: http://news.ycombinator.com/item?id=5159452 ;)


Interesting! Do you have any benchmarks?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: