It would be nice if all packages (and each version) were cached somewhere global and then just referred to by a local environment, to deduplicate. I think Maven does this.
The downloaded packages themselves are shared in pip's cache. Once installed though they cannot be shared, for a number of reasons:
- Compiled python code is (can be) generated and stored alongside the installed source files, sharing that installed package between envs with different python versions would lead to weird bugs
- Installed packages are free to modify their own files (especially when they embed data files)
- Virtualenv are essentially non relocatable and installed package manifests use absolute paths
Really that shouldn't be a problem though. Downloaded packages are cached by pip, as well as locally built wheels of source packages, meaning once you've installed a bunch of packages already, installing new ones in new environments is a mere file extraction.
The real issue IMHO causing install slowness is that package maintainers have not taken the habit to be mindful of the size of their packages. If you install multiple data science packages for instances, you can quickly get to multi GB virtualenvs, for the sole reason that package maintainers didn't bother to remove (or separate as "extra") their test datasets, language translation files, static assets, etc.
It could be worse... you could be like the NuGet team and fail to identify "prevent cache from growing without bound" as a necessary feature. ;)
Last I checked, it was going-on 7+ years of adding/updating last-used metadata to cached packages (because MS disabled them by default at the Windows fs level in Vista+ for performance reasons), to support automated pruning. https://github.com/NuGet/Home/issues/4980
(Admittedly, most of that spent waiting for the glacially slow default versions of things to ship through the dotnet ecosystem)
They can be, pip developers just have to care about this. Nothing you described precludes file-based deduplication, which is what pnpm does for JS projects: it stores all library files in a global content-addressable directory, and creates {sym,hard,ref}links depending on what your filesystem supports.
Being able to mutate files requires reflinks, but they're supported by e.g. XFS, you don't have to go to COW filesystems that have their own disadvantages.
You can do something like that manually for virtualenvs too, by running rmlint on all of them in one go. It will pick an "original" for each duplicate file, and will replace duplicates with reflinks to it (or other types of links if needed). The downside is obvious: it has to be repeated as files change, but I've saved a lot of space this way.
Or just use a filesystem that supports deduplication natively like btrfs/zfs.
This is unreasonably dismissive: the `pip` authors care immensely about maintaining a Python package installer that successfully installs billions of distributions across disparate OSes and architectures each day. Adopting deduplication techniques that only work on some platforms, some of the time means a more complicated codebase and harder-to-reproduce user-surfaced bugs.
It can be worth it, but it's not a matter of "care": it's a matter of bandwidth and relative priorities.
"Having other priorities" uses different words to say exactly the same thing. I'm guessing you did not look at pnpm. It works on all major operating systems; deduplication works everywhere too, which shows that it can be solved if needed. As far as I know, it has been developed by one guy in Ukraine.
Are there package name and version disclosure considerations when sharing packages between envs with hardlinks and does that matter for this application?
Absolutely. Yes, they have many PMs, but thay can always cache downloads in home/.cache/pip or appdata/local/python/pip-cache-common. Packages don’t change so there’s no need for separate caches, just dump them all in one folder and add basic locks on it.
I wonder how much traffic pressure pip remotes could release by just getting caching right.
Having a way to install all packages and all versions of Python somewhere central, and referring to them both as a combination in a virtual environment, would be fantastic. It's nice that an ecosystem of tools has built into this gap, but it does make good Python practice less approachable, as there are so many options.