Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I used scrapy a lot. Just my opinion:

1. Instead of creating a urls global variable, use start_requests function.

2. Don't use beautifulsoup to parse, use CSS or XPATH.

3. If you are going into multiple pages over and over again, use CrawlSpider with Rule.



Can you please give some details about your second point? What’s wrong with beautifulsoup?


Using CSS & XPATH to select elements is very natural to web pages. BS4 has very limited CSS selector support and zero XPATH support.


It is very slow. But personally, I prefer to write my crawlers in Go (custom code, not Colly).


Try Parsel: https://github.com/scrapy/parsel

It's way faster and has better support for CSS selectors.


> But personally, I prefer to write my crawlers in Go (custom code, not Colly).

This is my current setup as well, been scraping on and off for 20+ years now.


What's your problem with Colly? [0]

[0] http://go-colly.org/


Mostly that I started my crawler before learning about Colly and it didn't make sense to rewrite the code.

By "not Colly" I just wanted to remark that in Go is relatively easy to write a crawler from scratch.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: