More

mnmkng · on July 10, 2024

Crawlee isn’t any less configurable than Scrapy. It just uses different, in my personal opinion more approachable, patterns. It makes it easier to start with, but you can tweak whatever you want. Btw, you can add middleware in Crawlee Router.

mdaniel · on July 10, 2024

> Crawlee isn’t any less configurable than Scrapy.

Oh, then I have obviously overlooked how one would be able to determined if a proxy has been blocked and evict it from the pool <https://github.com/rejoiceinhope/scrapy-proxy-pool/blob/b833...> . Or how to use an HTTP cache independent of the "browser" cache (e.g. to allow short-circuiting the actual request if I can prove it is not stale for my needs, which enables recrawls to fix logic bugs or even downloading the actual request-response payloads for making better tests) https://docs.scrapy.org/en/2.11/topics/downloader-middleware...

Unless you meant what I said about "pip install -e && echo glhf" in which case, yes, it's a "simple matter of programming" into a framework that was not designed to be extended

8organicbits · on July 10, 2024

Cache management is also what I had in mind. Ive been using golang+colly and the default caching behavior is just different enough from what I need. I haven't written a custom cache middleware, but I'm getting to that point.

mnmkng · on July 9, 2024

Technically it can. You can log in with the PlaywrightCrawler class without issue. The question is if there’s 2FA as well and how that’s handled. Crawlee does not have any abstraction for handling 2FA as it depends a lot on what verification options are supported on the SSO side. So that part would need a custom implementation within Crawlee.

mnmkng · on July 9, 2024

In one word. Nothing.

But I personally think it does some things a little easier, a little faster and little more conveniently than the other libraries and tools out there.

Although there’s one thing that the JS version of Crawlee has which unfortunately isn’t in Python yet, but it will be there soon. AFAIK it’s unique among all libraries. It’s automatically detecting whether a headless browser is needed or if HTTP will suffice and using the most performant option.

localfirst · on July 9, 2024

is there anything that uses a computer vision model/ocr locally to extract data?

I find some dynamic sites purposefully make it extremely difficult to parse and they obfuscate the XHR calls to their API

I've also seen some websites pollute the data when it detects scraping which results in garbage data but you don't know until its verified

mnmkng · on July 9, 2024

We tried a self hosted OCR model a few years ago, but the quality and speed wasn’t great. From experience, it’s usually better to reverse engineer the APIs. The more complicated they are, the less they change. So it can sometimes be painful to set up the scrapers, but once they work, they tend to be more stable than other methods.

Data pollution is real. Also location specific results, personalized results, A/B testing, and my favorite, badly implemented websites are real as well.

When you encounter this, you can try scraping the data from different locations, with various tokens, cookies, referrers etc. and often you can find a pattern to make the data consistent. Websites hate scraping, but they hate showing wrong data to human users even more. So if you resemble a legit user, you’ll most likely get correct data. But of course, there are exceptions.

mnmkng · on July 9, 2024

It uses Playwright under the hood, so yes, it can do all of that, and more.

mnmkng · on July 9, 2024

It’s an “old” law that did not consider many intricacies of internet and the platforms that exist on it and it’s mostly made obsolete by EU case law, which has shrunk the definition of a protected database under this law so much that it’s practically inapplicable to web scraping.

(Not my opinion. I visited a major global law firm’s seminar on this topic a month ago and this is what they said.)

mnmkng · on July 9, 2024

Sorry about the confusion. Some features, like the tiered proxies, are not documented properly. You’re absolutely right. Updates will come soon.

We wanted to have as many features in the initial release as possible, because we have a local Python community conference coming up tomorrow and we wanted to have the library ready for that.

More docs will come soon. I promise. And thanks for the shout.

bn-l · on July 9, 2024

I literally had to go through the entire codebase the documentation is that lacking. It’s boring to document but imo it’s the lowest hanging fruit to get people moving down that crawlee -> appify funnel.

mnmkng · on July 9, 2024

Oh wow, thanks! Will fix it right away. Crawlee is originally a JS library.

mnmkng · on March 14, 2023

finally!

mnmkng · on Aug 23, 2022

Yeah I agree, keeping the source HTML is great for debugging or retro-fixing issues. We also like to take screenshots on important errors, when running headless.

mnmkng · on Aug 23, 2022

cebert · on Aug 23, 2022

Awesome! Thanks.

mnmkng · on Aug 23, 2022

If it doesn’t, please make an issue. We know it works from the community but we don’t have tests specifically for Lambda. But it should work, so we’ll help if it doesn’t.