Open-source drivers according to Habana

AshamedCaptain · on March 31, 2022

Kernel maintainers tend to refuse drivers that only work with proprietary user-space, so I guess this is just one way to workaround that.

hansendc · on March 31, 2022

It's not just drivers. It's really about ensuring that the folks that maintain the kernel have a way to test the code they maintain. The reasons that we (the kernel maintainers) have for this requirement are varied. But, for me, it's really nice to have at least one open source implementation that can test the kernel code. Without that, the kernel code can bit rot too easily.

Even better is if an open source implementation is in the kernel tree, like in tools/testing/selftests. That makes it even less likely that the kernel code gets broken.

Disclaimer: I work on Linux at Intel, although not on drivers like this Habana one.

eikenberry · on March 31, 2022

One of the points of having the drivers in kernel is that means they kernel can actually run on that hardware. In addition to allowing for testing as others have pointed out, it is also a way to make sure that drivers aren't used to restrict access to the hardware. It ensures the freedom of the platform.

10000truths · on March 31, 2022

Or they could pull an NVidia, and dedicate a whole in-house kernel team to maintaining an out-of-tree kernel module.

my123 · on March 31, 2022

For NVIDIA? More than one.

The Tegra stack uses a totally separate out-of-tree but GPLv2 kernel module (which also works on some dGPU SKUs). It's available at https://nv-tegra.nvidia.com/r/gitweb?p=linux-nvgpu.git;a=sum...

And then there's the partially closed kernel module stack, which is a different code base...

j16sdiz · on March 31, 2022

Background story: https://lwn.net/Articles/867168/

TLDR: Intel want some huge DRI-related in linux kernel. The DRI maintainer insist there need at least one open user mode user. So we have this Proof-of-Concept driver

drewg123 · on March 31, 2022

Thanks for this, without this I had no clue what the article was referring to.

It would be nice if there was a fairly stable kernel API (or ABI) so drivers like this didn't have to be in the kernel. Out of tree drivers are a nightmare to maintain.

I maintained some out-of-tree drivers for years at Myricom (version of myri10ge with valuable features nak'ed by netdev, MX HPC drivers). Doing this was a massive PITA. Pretty much every minor release brought with it some critical function changing the number of arguments, changing names, etc. RHEL updates were my own special version of hell, since their kernel X.Y.Z in no way resembled the upstream X.Y.Z It got so bad to support 2.6.9 through 3.x that the shim layer for the Linux driver was almost as big as the entire FreeBSD driver (where nobody cared that I implemented LRO).

10000truths · on March 31, 2022

This is by design. Linux doesn't want to pay the maintenance and performance costs of guaranteeing a stable in-kernel API/ABI:

https://www.kernel.org/doc/Documentation/process/stable-api-...

charcircuit · on March 31, 2022

You can make the same argument about user space compatibility. It's extra work and may prevent some improvements, but it's nice to have.

josephcsible · on March 31, 2022

> It would be nice if there was a fairly stable kernel API (or ABI) so drivers like this didn't have to be in the kernel.

Why? We want as many drivers as possible to be in the kernel.

rrss · on April 1, 2022

because many times people want to be able to run new drivers (e.g. with new hardware support) on an LTS kernel.

lots of vendors support this, and most do it by maintaining a version of their driver that can work with a wide range of kernel versions that detect and adapt to what APIs get added/renamed/removed.

josephcsible · on April 1, 2022

Citing being able to use an LTS kernel as the reason seems like circular logic. If every device had a mainlined driver, what would you need an LTS kernel for, rather than just always using the latest?

aseipp · on March 31, 2022

There's another preceding case here, from 2020, involving Qualcomm submitting a similar driver (and to some extent Microsoft), which is quietly linked to and worth looking at for some other history: https://lwn.net/Articles/821817/

The situation there is that Habana already had their driver submitted very early I suppose, and I guess the resistance wasn't high enough at the time to keep it out. Qualcomm later came around and their own AI 100 driver was rejected, on similar grounds that would have kept Habana out, had they been applied at the time. (Airlie even called out Greg at this time.)

The later scuffle (your OP link) is because the Habana driver eventually wanted to adopt DMA-BUF and P2P-DMA support in the driver, which the original developers intended for the GPU subsystem, so they consider this over the line, because the criteria for new GPU drivers is "a testable open source userspace". So, that work was rejected, but the driver itself wasn't pulled entirely. Just that particular series of patches was not applied.

Microsoft had a weirdly similar case where they wanted a virtualization driver for Linux that would effectively pass GPU compute through to Windows hosts running under HyperV, for the purposes of running Machine Learning compute workloads -- not graphics. (The underlying Windows component to handle these tasks does use DirectX, but only the DirectCompute part of it.) But, it wasn't out of hand rejected on the same principle; it's more like a VFIO passthrough device conceptually, and didn't need to use any DRI/DRM specific subsystems to accomplish that. But the basic outline is the same where the userspace component would be closed source, so the driver is just connecting a binary blob to a binary blob. It doesn't use any deeply involved APIs, but it's also not very useful for anyone except the WSL team. It's a bit of an inbetween case where it isn't quite the same thing, but it's not not the same thing. Strange one.

As of right now, looking at upstream:

- Habana now has DMA-BUF support, as of late last year, so presumably the minimal userspace given above was "good enough" for upstream, since they can presumably at least run minimal testing on the driver paths: https://github.com/torvalds/linux/commit/a9498ee575fa116e289...

- Microsoft's DXGI/whatever-it's-called driver for compute is still not upstream, but I think they ship it with their custom-by-default WSL2 kernel (`wsl -e uname -a` gives me `5.10.16.3-microsoft-standard-WSL2` right now). It was not rejected out of hand but they also didn't seem to mind if it didn't land immediately. I have no idea what it's status is.

- Qualcomm's driver for AI 100 was completely rejected immediately and I do not know of any further attempts to upstream it.

- And there are probably even more cases of this. I believe Xilinx has a driver for their (similarly closed) compiler + runtime stack included in Vitis, and I doubt it's going upstream soon (xocl/xcmgmnt)

So the rules in general aren't particularly conclusive. But it looks like most accelerator designs will eventually fall under the rules of the graphics subsystem, if they seek to scale through P2P/DMA designs. As a result of that, a lot of people will probably get blocked, but Habana to some extent got a first-mover advantage, I think.

Arguably if people want to complain about SynapseAI Core being unsuitable for production use, to some extent, they should also share a bit of blame with the Linux developers for that, if they consider the drivers a problem. I think this isn't an unreasonable position.

But ultimately this comes down to there being two different desires among people: the kernel developers' concerns aren't that every userspace stack for every accelerator, shipped to every production user, is fully open source. That might be the concern of some people who are users of the kernel and Linux (including some kernel developers themselves), but not "them" at large. Their concern might be more accurately stated as: they have enough tooling and information to maintain their own codebase and APIs reliably, given the hardware drivers they have. These are not the same objective, and this is a good example of that.

throw_a_grenade · on April 1, 2022

Oh, and that's what happens when there's no viable userspace: https://lore.kernel.org/lkml/CAFCwf10UrVGwGRqwPAfGUsyqqdp8Gx...

yboris · on March 31, 2022

Do I understand this right? It seems like Habana has some super-efficient and fast ML hardware, but you can't just drop in an ML project and start using it? For example, only a subset of TensorFlow or PyTorch is supported?

Is that right? If you want to use their hardware, you need to jump through some hoops?

https://docs.habana.ai/en/latest/Tensorflow_User_Guide/Tenso...

https://docs.habana.ai/en/latest/PyTorch_User_Guide/PyTorch_...

j16sdiz · on March 31, 2022

If you want to use that, you need the closed-source driver.

The open-source driver is the minimal to make linux kernel developer happy.

yboris · on March 31, 2022

But my point is that even if you use some closed-source driver or whatever is required to "set up" your project - you still can't re-use your regular code wholesale, but will have to modify to fit within whatever features they support. Right?

my123 · on March 31, 2022

Unsupported layers will transparently fall back to CPU execution. You can then choose to implement those who can be as TPC kernels. It's fundamentally much less flexible than a GPU.

Efficiency of Habana hardware isn't that great either, but that's another story... (they're still using TSMC 16nm in 2022 notably). Where Habana has an advantage in some workloads is cost.

bertday · on March 31, 2022

The API of Tensorflow and PyTorch is quite large. Even Tensorflow has a dialect specifically for TPU, so the notion of accelerator independence (especially when considering performance) has not been realized yet (though this is still a work in progress).

Anyway, how could you reuse your code wholesale when one of the most common operations is calling “.cuda()” on a tensor?

naoqj · on March 31, 2022

...and then they'll complain that corporations would rather maintain their own forks of the kernel.

Aissen · on March 31, 2022

…and their customers will simply not buy their product if it needs a custom kernel.

_ehqz · on March 31, 2022

Okay, I clearly don't understand everything going on in this one. Why? There must be something going on that explains why we are letting people get away with breaking the rules the rest of us are supposed to be follow. Must be. Why?

Because if I understand anything at all about open source, it's that you don't get to post closed source as if it is open source; even if you try to resort to shenanigans and trickery to get around the rules.

So my question is this. Why are we letting Intel get away with loopholing the rules the rest of us would have to follow? Seems to me the best thing to do here is punish them like the petulant child they are being. Erase their code, and tell them to politely fuck off. Or at least more politely than Linus Torvalds did with Nvidia.

Also, I don't see why letting them put their code up is of any use in the first place if they are just going to do things like this to essentially break it. Like seriously folks, what in the ever living fuck?

If it were up to me, I'd have people hacking them just to show them their place, and putting EVERYTHING up online for all to use. And that would be after doing something like sending evidence of wrong doings to the justice departments of every single nation who wants to take a chew out of them. (of which there are likely many)

Why?

Because the fact we allow companies like Microsoft and Intel to continuously get away with all their bullshit that they keep on trying to pull more bullshit. It's not rocket science folks.

Time to apply the brake to their bullshitmobile and hard.

Topgamer7 · on March 31, 2022

> If it were up to me, I'd have people hacking them

Cut this garbage shit out. Hack on the driver implementation instead.

> Because if I understand anything at all about open source, it's that you don't get to post closed source as if it is open source

If it was upstreamed into the linux tree, then they have open sourced, with the appropriate license. So the driver is not fully functional, but what they have included could be the building blocks for someone to add the additional functionality.

It is a shitty practice, and if you want to drive adoption, this isn't going to do it.

At least they didn't make a marketing statement that they're open source purveyors and software saviors.