This attack is particularly interesting. The attackers targeted the massively popular 'requests' package in PyPI, used bitsquatting to target typosquat candidates, and results in ransomware getting deployed.
Totally agree. It feels like there is a pretty strong inverse correlation between standard library size, and average depth of a dependency tree for projects in a given language. In our world, that is pretty close to attack surface.
Rust is another example of this. Just bringing in grpc and protobuf gets about a hundred dependencies. Some of them seemingly unrelated. For a language aimed at avoiding security bugs, I find this to be an issue. But a good dependency manager and a small (or optionally absent) stdlib has lead to highly granular dependencies and bringing in giant libs for tiny bits.
We've found a lot of open-source packages that are authored by (well, released by authors identified by) disposable email addresses. We were shocked to find companies doing this, too.
The reason is obvious, people crawl pypi.org/github.com/npmjs.com and email their job posts or product launches. Every platform that requires an email and shows it publicly will necessarily get a lot of disposable ones.
(Disclaimer: I work at Phylum, which has a very similar capability)
Not all of it has to be manual. Some vulnerabilities come with enough information to deduce vulnerability reachability with a high degree of confidence with some slightly clever automation.
Not all vulns come with this information, but as time goes on the percentage that do is increasing. I'm very optimistic that automation + a bit of human curation can drastically improve the S/N for open source library vulns.
A nice property of this is: you only have to solve it once per vuln. If you look at the total set of vulns (and temporarily ignore super old C stuff) it's not insurmountable at all.
I support this strongly.
Given that a Software engineer should be able to understand code / write code.. it's a nobrainer to go the scrapping way.
Even me, with absolutely no CS background (never studied, but it's my digital life) started to learn python out of fun and also did some scrapping because I'm to lazy in some terms..
Scrap your info.
We're building a solution to solve exactly this at phylum. I'm not trying to be a sales shill, but if anyone is interested in discussing ideas on how to best defend open-source libraries from these types of attacks, please get in touch, I'd love to hear from you!
We're consuming everything we can about a package to figure this out. We've built a static analysis system to reason about code (it's not perfect, but we're getting better and better). We process all the data we can get, then build analytics, heuristics and ML models to extract evidence. The evidence is then pieced together to identify software supply chain risk.
In this case there is a lot of signal to show both bad and suspicious things are happening.
1. Obfuscation: this creates a comparatively deep AST of the code, and isn't difficult to identify.
2. Command execution: curl, wget, LOLBINs like certutil are pretty easy to identify. This isn't a slam dunk every time you see it, but it adds evidence to a potentially malicious claim.
3. URLs: These are uncommon in libraries and add evidence.
4. Pre/Post install scripts: These are fairly commonly used for other things as well, but invoking node on a source file that is likely obfuscated is a good sign something suspicious is happening.
We're trying to build everything fast enough to make the target far less attractive for attackers before it gets a lot worse.
I'm done trying to cast spells at Make