2022-07-31 12:04:27

by Oded Gabbay

[permalink] [raw]
Subject: New subsystem for acceleration devices

Hi,
Greg and I talked a couple of months ago about preparing a new accel
subsystem for compute/acceleration devices that are not GPUs and I
think your drivers that you are now trying to upstream fit it as well.

Would you be open/interested in migrating your drivers to this new subsystem ?

Because there were no outstanding candidates, I have done so far only
a basic and partial implementation of the infrastructure for this
subsystem, but if you are willing to join I believe I can finish it
rather quickly.

At start, the new subsystem will provide only a common device
character (e.g. /dev/acX) so everyone will do open/close/ioctl on the
same device character. Also sysfs/debugfs entries will be under that
device and maybe an IOCTL to retrieve information.

In the future I plan to move some of habanalabs driver's code into the
subsystem itself, for common tasks such as memory management, dma
memory allocation, etc.

Of course, you will be able to add your own IOCTLs as you see fit.
There will be a range of IOCTLs which are device-specific (similar to
drm).

wdyt ?

Thanks,
Oded


2022-07-31 15:54:43

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Sun, Jul 31, 2022 at 02:45:34PM +0300, Oded Gabbay wrote:
> Hi,
> Greg and I talked a couple of months ago about preparing a new accel
> subsystem for compute/acceleration devices that are not GPUs and I
> think your drivers that you are now trying to upstream fit it as well.
>
> Would you be open/interested in migrating your drivers to this new subsystem ?
>
> Because there were no outstanding candidates, I have done so far only
> a basic and partial implementation of the infrastructure for this
> subsystem, but if you are willing to join I believe I can finish it
> rather quickly.
>
> At start, the new subsystem will provide only a common device
> character (e.g. /dev/acX) so everyone will do open/close/ioctl on the
> same device character. Also sysfs/debugfs entries will be under that
> device and maybe an IOCTL to retrieve information.
>
> In the future I plan to move some of habanalabs driver's code into the
> subsystem itself, for common tasks such as memory management, dma
> memory allocation, etc.
>
> Of course, you will be able to add your own IOCTLs as you see fit.
> There will be a range of IOCTLs which are device-specific (similar to
> drm).
>
> wdyt ?

That sounds reasonable to me as a sane plan forward, thanks for working
on this.

greg k-h

2022-08-01 03:26:27

by Yuji Ishikawa

[permalink] [raw]
Subject: RE: New subsystem for acceleration devices

> -----Original Message-----
> From: Greg Kroah-Hartman <[email protected]>
> Sent: Monday, August 1, 2022 12:38 AM
> To: Oded Gabbay <[email protected]>
> Cc: ishikawa yuji($B@P@n(B $BM*;J(B $B!{#R#D#C""#A#I#T#C!{#E#A3+(B)
> <[email protected]>; Jiho Chu <[email protected]>; Arnd
> Bergmann <[email protected]>; Linux-Kernel@Vger. Kernel. Org
> <[email protected]>
> Subject: Re: New subsystem for acceleration devices
>
> On Sun, Jul 31, 2022 at 02:45:34PM +0300, Oded Gabbay wrote:
> > Hi,
> > Greg and I talked a couple of months ago about preparing a new accel
> > subsystem for compute/acceleration devices that are not GPUs and I
> > think your drivers that you are now trying to upstream fit it as well.
> >
> > Would you be open/interested in migrating your drivers to this new
> subsystem ?
> >
> > Because there were no outstanding candidates, I have done so far only
> > a basic and partial implementation of the infrastructure for this
> > subsystem, but if you are willing to join I believe I can finish it
> > rather quickly.
> >
> > At start, the new subsystem will provide only a common device
> > character (e.g. /dev/acX) so everyone will do open/close/ioctl on the
> > same device character. Also sysfs/debugfs entries will be under that
> > device and maybe an IOCTL to retrieve information.
> >
> > In the future I plan to move some of habanalabs driver's code into the
> > subsystem itself, for common tasks such as memory management, dma
> > memory allocation, etc.
> >
> > Of course, you will be able to add your own IOCTLs as you see fit.
> > There will be a range of IOCTLs which are device-specific (similar to
> > drm).
> >
> > wdyt ?
>
> That sounds reasonable to me as a sane plan forward, thanks for working on
> this.
>
> greg k-h

Hi,
Thank you for your suggestion.
I'm really interested in the idea to have a dedicated subsystem for accelerators.
Let me challenge if the Visconti DNN driver can be re-written to the framework.
As Visconti SoC has several accelerators as well as DNN,
having a general/common interface will be a big benefit for us to maintain drivers for a long time.

I've heard that the framework has some basic implementation.
Do you have some sample code or enumeration of idea to describe the framework?
Would you share them with me? if that does not bother you.

If you need further information on the current DNN driver, please let me know.
Also, I can provide some for other accelerator drivers which are currently under cleaning up for contribution.

Regards,
Yuji


2022-08-01 09:31:22

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Mon, Aug 1, 2022 at 5:35 AM <[email protected]> wrote:
>
> > -----Original Message-----
> > From: Greg Kroah-Hartman <[email protected]>
> > Sent: Monday, August 1, 2022 12:38 AM
> > To: Oded Gabbay <[email protected]>
> > Cc: ishikawa yuji(石川 悠司 ○RDC□AITC○EA開)
> > <[email protected]>; Jiho Chu <[email protected]>; Arnd
> > Bergmann <[email protected]>; Linux-Kernel@Vger. Kernel. Org
> > <[email protected]>
> > Subject: Re: New subsystem for acceleration devices
> >
> > On Sun, Jul 31, 2022 at 02:45:34PM +0300, Oded Gabbay wrote:
> > > Hi,
> > > Greg and I talked a couple of months ago about preparing a new accel
> > > subsystem for compute/acceleration devices that are not GPUs and I
> > > think your drivers that you are now trying to upstream fit it as well.
> > >
> > > Would you be open/interested in migrating your drivers to this new
> > subsystem ?
> > >
> > > Because there were no outstanding candidates, I have done so far only
> > > a basic and partial implementation of the infrastructure for this
> > > subsystem, but if you are willing to join I believe I can finish it
> > > rather quickly.
> > >
> > > At start, the new subsystem will provide only a common device
> > > character (e.g. /dev/acX) so everyone will do open/close/ioctl on the
> > > same device character. Also sysfs/debugfs entries will be under that
> > > device and maybe an IOCTL to retrieve information.
> > >
> > > In the future I plan to move some of habanalabs driver's code into the
> > > subsystem itself, for common tasks such as memory management, dma
> > > memory allocation, etc.
> > >
> > > Of course, you will be able to add your own IOCTLs as you see fit.
> > > There will be a range of IOCTLs which are device-specific (similar to
> > > drm).
> > >
> > > wdyt ?
> >
> > That sounds reasonable to me as a sane plan forward, thanks for working on
> > this.
> >
> > greg k-h
>
> Hi,
> Thank you for your suggestion.
> I'm really interested in the idea to have a dedicated subsystem for accelerators.
> Let me challenge if the Visconti DNN driver can be re-written to the framework.
> As Visconti SoC has several accelerators as well as DNN,
> having a general/common interface will be a big benefit for us to maintain drivers for a long time.
>
> I've heard that the framework has some basic implementation.
> Do you have some sample code or enumeration of idea to describe the framework?
> Would you share them with me? if that does not bother you.
>
Great to hear that!

I will need a couple of days to complete the code and clean it up
because I stopped working on it a couple of months ago as there were
no other candidates at the time.
Once I have it ready I will put it in my linux repo and send you a branch name.

In the meantime, I'd like to describe at a high level how this
framework is going to work.

I'll start with the main theme of this framework which is allowing
maximum flexibility to the different device drivers. This is because
in my opinion there is and always will be a large variance between
different compute accelerators, which stems from the fact their h/w is
designed for different purposes. I believe that it would be nearly
impossible to create a standard acceleration API that will be
applicable to all compute accelerators.

Having said that, there are many things that can be common. I'll just
name here a few things:

- Exposing devices in a uniform way (same device character files) and
managing the major/minor/open/close (with callbacks to the open/close
of the device driver)

- Defining a standard user API for monitoring applications that
usually run in a data-center. There is a big difference
between the acceleration uAPI and a monitoring uAPI and while the
former is highly specialized, the latter can be standardized.

- Common implementation of a memory manager that will give services of
allocating kernel memory that can be
mmap'ed by the user.

- For devices with on-chip MMU, common memory manager for allocating
device memory and managing it.

- Integration with dma-buf for interoperability with other drivers.

The initial implementation of the framework will expose two device
character files per device. One will be used for the main/compute
operations and the other one will be used for monitoring applications.
This is done in case there are some limitations on the main device
file (e.g. allowing only a single compute application to be connected
to the device).

Each driver will call a register function during its probe function
(very similar to what is done in the drm subsystem). This will
register the device in the accel framework and expose the device
character files. Every call on those files
(e.g. open, ioctl) will then go through the accel subsystem first and
then will go to the callbacks of the specific device drivers. The
framework will take care of allocating major and minor numbers and
handle these issues in a uniform way.

For now, I plan to define a single ioctl in the common part, which is
an information ioctl, allowing the user to retrieve various
information on the device (fixed or dynamic). I don't want to jump
ahead of myself and define other common ioctls before gaining some
traction. As I wrote above, each driver will be allowed to define its
own custom ioctls.

There will be an infrastructure to add custom sysfs and debugfs
entries. Once we start working, I'm sure we will find some properties
that we can move to the infrastructure itself (e.g. asking to reset
the device, operational status)

I don't know if I want to add the memory manager we prepared in
habanalabs driver at day one, because I would like to minimize the
effort we have to convert our drivers to using this framework. From
the above, I believe you can see the current effort is not large.

That's it from high-level pov. As I wrote at the beginning, I'll get
back to you in a few days with a link to a git repo containing the
initial implementation.

> If you need further information on the current DNN driver, please let me know.
> Also, I can provide some for other accelerator drivers which are currently under cleaning up for contribution.
I looked at the patches you sent and I believe it will be a good match
for this framework.
If you have additional drivers that you think can be a goot match, by
all means, please send links to me.

Thanks,
Oded

2022-08-02 18:09:45

by Jiho Chu

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Sun, 31 Jul 2022 14:45:34 +0300
Oded Gabbay <[email protected]> wrote:

> Hi,
> Greg and I talked a couple of months ago about preparing a new accel
> subsystem for compute/acceleration devices that are not GPUs and I
> think your drivers that you are now trying to upstream fit it as well.
>
> Would you be open/interested in migrating your drivers to this new subsystem ?
>
> Because there were no outstanding candidates, I have done so far only
> a basic and partial implementation of the infrastructure for this
> subsystem, but if you are willing to join I believe I can finish it
> rather quickly.
>
> At start, the new subsystem will provide only a common device
> character (e.g. /dev/acX) so everyone will do open/close/ioctl on the
> same device character. Also sysfs/debugfs entries will be under that
> device and maybe an IOCTL to retrieve information.
>
> In the future I plan to move some of habanalabs driver's code into the
> subsystem itself, for common tasks such as memory management, dma
> memory allocation, etc.
>
> Of course, you will be able to add your own IOCTLs as you see fit.
> There will be a range of IOCTLs which are device-specific (similar to
> drm).
>
> wdyt ?
>
> Thanks,
> Oded
>

Hi, Oded.
Thanks for sharing your idea. And I'm really positive on the subsystem for ai acceleration devices.

Samsung NPU driver is trying to upstream now, so I wonder new subsystem can cover all of the operations.
I'll appreciate if you share the code when you ready, so I can figure out our driver can be migrated.

Sincerely,
Jiho Chu






2022-08-02 19:09:00

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Tue, Aug 2, 2022 at 8:25 PM Jiho Chu <[email protected]> wrote:
>
> On Sun, 31 Jul 2022 14:45:34 +0300
> Oded Gabbay <[email protected]> wrote:
>
> > Hi,
> > Greg and I talked a couple of months ago about preparing a new accel
> > subsystem for compute/acceleration devices that are not GPUs and I
> > think your drivers that you are now trying to upstream fit it as well.
> >
> > Would you be open/interested in migrating your drivers to this new subsystem ?
> >
> > Because there were no outstanding candidates, I have done so far only
> > a basic and partial implementation of the infrastructure for this
> > subsystem, but if you are willing to join I believe I can finish it
> > rather quickly.
> >
> > At start, the new subsystem will provide only a common device
> > character (e.g. /dev/acX) so everyone will do open/close/ioctl on the
> > same device character. Also sysfs/debugfs entries will be under that
> > device and maybe an IOCTL to retrieve information.
> >
> > In the future I plan to move some of habanalabs driver's code into the
> > subsystem itself, for common tasks such as memory management, dma
> > memory allocation, etc.
> >
> > Of course, you will be able to add your own IOCTLs as you see fit.
> > There will be a range of IOCTLs which are device-specific (similar to
> > drm).
> >
> > wdyt ?
> >
> > Thanks,
> > Oded
> >
>
> Hi, Oded.
> Thanks for sharing your idea. And I'm really positive on the subsystem for ai acceleration devices.
>
> Samsung NPU driver is trying to upstream now, so I wonder new subsystem can cover all of the operations.
> I'll appreciate if you share the code when you ready, so I can figure out our driver can be migrated.
I'm working on it.
I'm also preparing a small demo driver so you will be able to better
understand how to integrate a driver with this subsystem.
I believe I will be able to post the link during the weekend.

Oded

>
> Sincerely,
> Jiho Chu
>
>
>
>
>

2022-08-03 05:33:24

by Yuji Ishikawa

[permalink] [raw]
Subject: RE: New subsystem for acceleration devices

> -----Original Message-----
> From: Oded Gabbay <[email protected]>
> Sent: Monday, August 1, 2022 5:21 PM
> To: ishikawa yuji(石川 悠司 ○RDC□AITC○EA開)
> <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>; Jiho Chu
> <[email protected]>; Arnd Bergmann <[email protected]>;
> Linux-Kernel@Vger. Kernel. Org <[email protected]>
> Subject: Re: New subsystem for acceleration devices
>
> On Mon, Aug 1, 2022 at 5:35 AM <[email protected]> wrote:
> >
> > > -----Original Message-----
> > > From: Greg Kroah-Hartman <[email protected]>
> > > Sent: Monday, August 1, 2022 12:38 AM
> > > To: Oded Gabbay <[email protected]>
> > > Cc: ishikawa yuji(石川 悠司 ○RDC□AITC○EA開)
> > > <[email protected]>; Jiho Chu <[email protected]>;
> > > Arnd Bergmann <[email protected]>; Linux-Kernel@Vger. Kernel. Org
> > > <[email protected]>
> > > Subject: Re: New subsystem for acceleration devices
> > >
> > > On Sun, Jul 31, 2022 at 02:45:34PM +0300, Oded Gabbay wrote:
> > > > Hi,
> > > > Greg and I talked a couple of months ago about preparing a new
> > > > accel subsystem for compute/acceleration devices that are not GPUs
> > > > and I think your drivers that you are now trying to upstream fit it as well.
> > > >
> > > > Would you be open/interested in migrating your drivers to this new
> > > subsystem ?
> > > >
> > > > Because there were no outstanding candidates, I have done so far
> > > > only a basic and partial implementation of the infrastructure for
> > > > this subsystem, but if you are willing to join I believe I can
> > > > finish it rather quickly.
> > > >
> > > > At start, the new subsystem will provide only a common device
> > > > character (e.g. /dev/acX) so everyone will do open/close/ioctl on
> > > > the same device character. Also sysfs/debugfs entries will be
> > > > under that device and maybe an IOCTL to retrieve information.
> > > >
> > > > In the future I plan to move some of habanalabs driver's code into
> > > > the subsystem itself, for common tasks such as memory management,
> > > > dma memory allocation, etc.
> > > >
> > > > Of course, you will be able to add your own IOCTLs as you see fit.
> > > > There will be a range of IOCTLs which are device-specific (similar
> > > > to drm).
> > > >
> > > > wdyt ?
> > >
> > > That sounds reasonable to me as a sane plan forward, thanks for
> > > working on this.
> > >
> > > greg k-h
> >
> > Hi,
> > Thank you for your suggestion.
> > I'm really interested in the idea to have a dedicated subsystem for
> accelerators.
> > Let me challenge if the Visconti DNN driver can be re-written to the
> framework.
> > As Visconti SoC has several accelerators as well as DNN, having a
> > general/common interface will be a big benefit for us to maintain drivers for a
> long time.
> >
> > I've heard that the framework has some basic implementation.
> > Do you have some sample code or enumeration of idea to describe the
> framework?
> > Would you share them with me? if that does not bother you.
> >
> Great to hear that!
>
> I will need a couple of days to complete the code and clean it up because I
> stopped working on it a couple of months ago as there were no other candidates
> at the time.
> Once I have it ready I will put it in my linux repo and send you a branch name.
>
> In the meantime, I'd like to describe at a high level how this framework is going
> to work.
>
> I'll start with the main theme of this framework which is allowing maximum
> flexibility to the different device drivers. This is because in my opinion there is
> and always will be a large variance between different compute accelerators,
> which stems from the fact their h/w is designed for different purposes. I believe
> that it would be nearly impossible to create a standard acceleration API that will
> be applicable to all compute accelerators.
>
> Having said that, there are many things that can be common. I'll just name here
> a few things:
>
> - Exposing devices in a uniform way (same device character files) and
> managing the major/minor/open/close (with callbacks to the open/close of the
> device driver)
>
> - Defining a standard user API for monitoring applications that usually run in a
> data-center. There is a big difference between the acceleration uAPI and a
> monitoring uAPI and while the former is highly specialized, the latter can be
> standardized.
>
> - Common implementation of a memory manager that will give services of
> allocating kernel memory that can be
> mmap'ed by the user.
>
> - For devices with on-chip MMU, common memory manager for allocating
> device memory and managing it.
>
> - Integration with dma-buf for interoperability with other drivers.
>
> The initial implementation of the framework will expose two device character
> files per device. One will be used for the main/compute operations and the
> other one will be used for monitoring applications.
> This is done in case there are some limitations on the main device file (e.g.
> allowing only a single compute application to be connected to the device).
>
> Each driver will call a register function during its probe function (very similar to
> what is done in the drm subsystem). This will register the device in the accel
> framework and expose the device character files. Every call on those files (e.g.
> open, ioctl) will then go through the accel subsystem first and then will go to
> the callbacks of the specific device drivers. The framework will take care of
> allocating major and minor numbers and handle these issues in a uniform way.
>
> For now, I plan to define a single ioctl in the common part, which is an
> information ioctl, allowing the user to retrieve various information on the device
> (fixed or dynamic). I don't want to jump ahead of myself and define other
> common ioctls before gaining some traction. As I wrote above, each driver will
> be allowed to define its own custom ioctls.
>
> There will be an infrastructure to add custom sysfs and debugfs entries. Once
> we start working, I'm sure we will find some properties that we can move to the
> infrastructure itself (e.g. asking to reset the device, operational status)
>
> I don't know if I want to add the memory manager we prepared in habanalabs
> driver at day one, because I would like to minimize the effort we have to convert
> our drivers to using this framework. From the above, I believe you can see the
> current effort is not large.
>
> That's it from high-level pov. As I wrote at the beginning, I'll get back to you in a
> few days with a link to a git repo containing the initial implementation.
>

Hi Oded,

Thank you for sharing high level idea with me.
I'm grad to hear that the DNN driver can be ported to the accelerator framework without too-much effort. I think we can start with minimum implementation and find better way.
Currently the driver does not have API for monitoring. I'll prepare some logic to match the framework.

Some of my interests are handling IOMMU and DMA-BUF.
Currently the DNN driver has its own minimum implementation of those two.
Will DMA-BUF feature of accelerator framework be provided soon?

> > If you need further information on the current DNN driver, please let me know.
> > Also, I can provide some for other accelerator drivers which are currently
> under cleaning up for contribution.
> I looked at the patches you sent and I believe it will be a good match for this
> framework.
> If you have additional drivers that you think can be a goot match, by all means,
> please send links to me.
>
> Thanks,
> Oded

Visconti DSP driver might be interesting for you as it provides multiple functions and requires external firmware.
The problem is that I don't have official public repository.
I wonder if it's possible to post some pieces of code to this ML.

Regards,
Yuji

2022-08-03 05:48:55

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Wed, Aug 03, 2022 at 04:39:12AM +0000, [email protected] wrote:
> Visconti DSP driver might be interesting for you as it provides multiple functions and requires external firmware.
> The problem is that I don't have official public repository.
> I wonder if it's possible to post some pieces of code to this ML.

You can always post code to this mailing list, that's what it is here
for! :)

2022-08-03 19:06:56

by Dave Airlie

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Sun, 31 Jul 2022 at 22:04, Oded Gabbay <[email protected]> wrote:
>
> Hi,
> Greg and I talked a couple of months ago about preparing a new accel
> subsystem for compute/acceleration devices that are not GPUs and I
> think your drivers that you are now trying to upstream fit it as well.

We've had some submissions for not-GPUs to the drm subsystem recently.

Intel GNA, Intel VPU, NVDLA, rpmsg AI processor unit.

why is creating a new subsystem at this time necessary?

Are we just creating a subsystem to avoid the open source userspace
consumer rules? Or do we have some concrete reasoning behind it?

Dave.

2022-08-03 21:13:57

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Wed, Aug 3, 2022 at 10:04 PM Dave Airlie <[email protected]> wrote:
>
> On Sun, 31 Jul 2022 at 22:04, Oded Gabbay <[email protected]> wrote:
> >
> > Hi,
> > Greg and I talked a couple of months ago about preparing a new accel
> > subsystem for compute/acceleration devices that are not GPUs and I
> > think your drivers that you are now trying to upstream fit it as well.
>
> We've had some submissions for not-GPUs to the drm subsystem recently.
>
> Intel GNA, Intel VPU, NVDLA, rpmsg AI processor unit.
>
> why is creating a new subsystem at this time necessary?
>
> Are we just creating a subsystem to avoid the open source userspace
> consumer rules? Or do we have some concrete reasoning behind it?
>
> Dave.

Hi Dave.
The reason it happened now is because I saw two drivers, which are
doing h/w acceleration for AI, trying to be accepted to the misc
subsystem.
Add to that the fact I talked with Greg a couple of months ago about
doing a subsystem for any compute accelerators, which he was positive
about, I thought it is a good opportunity to finally do it.

I also honestly think that I can contribute much to these drivers from
my experience with the habana driver (which is now deployed in mass at
AWS) and contribute code from the habana driver to a common framework
for AI drivers.

Regarding the open source userspace rules in drm - yes, I think your
rules are too limiting for the relatively young AI scene, and I saw at
the 2021 kernel summit that other people from the kernel community
think that as well.
But that's not the main reason, or even a reason at all for doing
this. After all, at least for habana, we open-sourced our compiler and
a runtime library. And Greg also asked those two drivers if they have
matching open-sourced user-space code.

And a final reason is that I thought this can also help in somewhat
reducing the workload on Greg. I saw in the last kernel summit there
was a concern about bringing more people to be kernel maintainers so I
thought this is a step in the right direction.

Oded

2022-08-03 21:17:53

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Wed, Aug 3, 2022 at 7:44 AM <[email protected]> wrote:
>
> > -----Original Message-----
> > From: Oded Gabbay <[email protected]>
> > Sent: Monday, August 1, 2022 5:21 PM
> > To: ishikawa yuji(石川 悠司 ○RDC□AITC○EA開)
> > <[email protected]>
> > Cc: Greg Kroah-Hartman <[email protected]>; Jiho Chu
> > <[email protected]>; Arnd Bergmann <[email protected]>;
> > Linux-Kernel@Vger. Kernel. Org <[email protected]>
> > Subject: Re: New subsystem for acceleration devices
> >
> > On Mon, Aug 1, 2022 at 5:35 AM <[email protected]> wrote:
> > >
> > > > -----Original Message-----
> > > > From: Greg Kroah-Hartman <[email protected]>
> > > > Sent: Monday, August 1, 2022 12:38 AM
> > > > To: Oded Gabbay <[email protected]>
> > > > Cc: ishikawa yuji(石川 悠司 ○RDC□AITC○EA開)
> > > > <[email protected]>; Jiho Chu <[email protected]>;
> > > > Arnd Bergmann <[email protected]>; Linux-Kernel@Vger. Kernel. Org
> > > > <[email protected]>
> > > > Subject: Re: New subsystem for acceleration devices
> > > >
> > > > On Sun, Jul 31, 2022 at 02:45:34PM +0300, Oded Gabbay wrote:
> > > > > Hi,
> > > > > Greg and I talked a couple of months ago about preparing a new
> > > > > accel subsystem for compute/acceleration devices that are not GPUs
> > > > > and I think your drivers that you are now trying to upstream fit it as well.
> > > > >
> > > > > Would you be open/interested in migrating your drivers to this new
> > > > subsystem ?
> > > > >
> > > > > Because there were no outstanding candidates, I have done so far
> > > > > only a basic and partial implementation of the infrastructure for
> > > > > this subsystem, but if you are willing to join I believe I can
> > > > > finish it rather quickly.
> > > > >
> > > > > At start, the new subsystem will provide only a common device
> > > > > character (e.g. /dev/acX) so everyone will do open/close/ioctl on
> > > > > the same device character. Also sysfs/debugfs entries will be
> > > > > under that device and maybe an IOCTL to retrieve information.
> > > > >
> > > > > In the future I plan to move some of habanalabs driver's code into
> > > > > the subsystem itself, for common tasks such as memory management,
> > > > > dma memory allocation, etc.
> > > > >
> > > > > Of course, you will be able to add your own IOCTLs as you see fit.
> > > > > There will be a range of IOCTLs which are device-specific (similar
> > > > > to drm).
> > > > >
> > > > > wdyt ?
> > > >
> > > > That sounds reasonable to me as a sane plan forward, thanks for
> > > > working on this.
> > > >
> > > > greg k-h
> > >
> > > Hi,
> > > Thank you for your suggestion.
> > > I'm really interested in the idea to have a dedicated subsystem for
> > accelerators.
> > > Let me challenge if the Visconti DNN driver can be re-written to the
> > framework.
> > > As Visconti SoC has several accelerators as well as DNN, having a
> > > general/common interface will be a big benefit for us to maintain drivers for a
> > long time.
> > >
> > > I've heard that the framework has some basic implementation.
> > > Do you have some sample code or enumeration of idea to describe the
> > framework?
> > > Would you share them with me? if that does not bother you.
> > >
> > Great to hear that!
> >
> > I will need a couple of days to complete the code and clean it up because I
> > stopped working on it a couple of months ago as there were no other candidates
> > at the time.
> > Once I have it ready I will put it in my linux repo and send you a branch name.
> >
> > In the meantime, I'd like to describe at a high level how this framework is going
> > to work.
> >
> > I'll start with the main theme of this framework which is allowing maximum
> > flexibility to the different device drivers. This is because in my opinion there is
> > and always will be a large variance between different compute accelerators,
> > which stems from the fact their h/w is designed for different purposes. I believe
> > that it would be nearly impossible to create a standard acceleration API that will
> > be applicable to all compute accelerators.
> >
> > Having said that, there are many things that can be common. I'll just name here
> > a few things:
> >
> > - Exposing devices in a uniform way (same device character files) and
> > managing the major/minor/open/close (with callbacks to the open/close of the
> > device driver)
> >
> > - Defining a standard user API for monitoring applications that usually run in a
> > data-center. There is a big difference between the acceleration uAPI and a
> > monitoring uAPI and while the former is highly specialized, the latter can be
> > standardized.
> >
> > - Common implementation of a memory manager that will give services of
> > allocating kernel memory that can be
> > mmap'ed by the user.
> >
> > - For devices with on-chip MMU, common memory manager for allocating
> > device memory and managing it.
> >
> > - Integration with dma-buf for interoperability with other drivers.
> >
> > The initial implementation of the framework will expose two device character
> > files per device. One will be used for the main/compute operations and the
> > other one will be used for monitoring applications.
> > This is done in case there are some limitations on the main device file (e.g.
> > allowing only a single compute application to be connected to the device).
> >
> > Each driver will call a register function during its probe function (very similar to
> > what is done in the drm subsystem). This will register the device in the accel
> > framework and expose the device character files. Every call on those files (e.g.
> > open, ioctl) will then go through the accel subsystem first and then will go to
> > the callbacks of the specific device drivers. The framework will take care of
> > allocating major and minor numbers and handle these issues in a uniform way.
> >
> > For now, I plan to define a single ioctl in the common part, which is an
> > information ioctl, allowing the user to retrieve various information on the device
> > (fixed or dynamic). I don't want to jump ahead of myself and define other
> > common ioctls before gaining some traction. As I wrote above, each driver will
> > be allowed to define its own custom ioctls.
> >
> > There will be an infrastructure to add custom sysfs and debugfs entries. Once
> > we start working, I'm sure we will find some properties that we can move to the
> > infrastructure itself (e.g. asking to reset the device, operational status)
> >
> > I don't know if I want to add the memory manager we prepared in habanalabs
> > driver at day one, because I would like to minimize the effort we have to convert
> > our drivers to using this framework. From the above, I believe you can see the
> > current effort is not large.
> >
> > That's it from high-level pov. As I wrote at the beginning, I'll get back to you in a
> > few days with a link to a git repo containing the initial implementation.
> >
>
> Hi Oded,
>
> Thank you for sharing high level idea with me.
> I'm grad to hear that the DNN driver can be ported to the accelerator framework without too-much effort. I think we can start with minimum implementation and find better way.
> Currently the driver does not have API for monitoring. I'll prepare some logic to match the framework.
>
> Some of my interests are handling IOMMU and DMA-BUF.
> Currently the DNN driver has its own minimum implementation of those two.
> Will DMA-BUF feature of accelerator framework be provided soon?
I didn't plan it to be on the initial release, but I'm sure we can
make it a priority to be added soon after we do the initial
integration and everything is accepted to the kernel.
Oded
>
> > > If you need further information on the current DNN driver, please let me know.
> > > Also, I can provide some for other accelerator drivers which are currently
> > under cleaning up for contribution.
> > I looked at the patches you sent and I believe it will be a good match for this
> > framework.
> > If you have additional drivers that you think can be a goot match, by all means,
> > please send links to me.
> >
> > Thanks,
> > Oded
>
> Visconti DSP driver might be interesting for you as it provides multiple functions and requires external firmware.
> The problem is that I don't have official public repository.
> I wonder if it's possible to post some pieces of code to this ML.
>
> Regards,
> Yuji

2022-08-03 23:43:30

by Daniel Stone

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

Hi Oded,

On Wed, 3 Aug 2022 at 21:21, Oded Gabbay <[email protected]> wrote:
> The reason it happened now is because I saw two drivers, which are
> doing h/w acceleration for AI, trying to be accepted to the misc
> subsystem.

Why misc?

> Regarding the open source userspace rules in drm - yes, I think your
> rules are too limiting for the relatively young AI scene, and I saw at
> the 2021 kernel summit that other people from the kernel community
> think that as well.
> But that's not the main reason, or even a reason at all for doing
> this. After all, at least for habana, we open-sourced our compiler and
> a runtime library. And Greg also asked those two drivers if they have
> matching open-sourced user-space code.
>
> And a final reason is that I thought this can also help in somewhat
> reducing the workload on Greg. I saw in the last kernel summit there
> was a concern about bringing more people to be kernel maintainers so I
> thought this is a step in the right direction.

Can you please explain what the reason is here?

Everything you have described - uniform device enumeration, common job
description, memory management helpers, unique job submission format,
etc - applies exactly to DRM. If open userspace is not a requirement,
and bypassing Greg's manual merging is a requirement, then I don't see
what the difference is between DRM and this new bespoke subsystem. It
would be great to have these differences enumerated in email as well
as in kerneldoc.

Cheers,
Daniel

2022-08-03 23:59:50

by Dave Airlie

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Thu, 4 Aug 2022 at 06:21, Oded Gabbay <[email protected]> wrote:
>
> On Wed, Aug 3, 2022 at 10:04 PM Dave Airlie <[email protected]> wrote:
> >
> > On Sun, 31 Jul 2022 at 22:04, Oded Gabbay <[email protected]> wrote:
> > >
> > > Hi,
> > > Greg and I talked a couple of months ago about preparing a new accel
> > > subsystem for compute/acceleration devices that are not GPUs and I
> > > think your drivers that you are now trying to upstream fit it as well.
> >
> > We've had some submissions for not-GPUs to the drm subsystem recently.
> >
> > Intel GNA, Intel VPU, NVDLA, rpmsg AI processor unit.
> >
> > why is creating a new subsystem at this time necessary?
> >
> > Are we just creating a subsystem to avoid the open source userspace
> > consumer rules? Or do we have some concrete reasoning behind it?
> >
> > Dave.
>
> Hi Dave.
> The reason it happened now is because I saw two drivers, which are
> doing h/w acceleration for AI, trying to be accepted to the misc
> subsystem.
> Add to that the fact I talked with Greg a couple of months ago about
> doing a subsystem for any compute accelerators, which he was positive
> about, I thought it is a good opportunity to finally do it.
>
> I also honestly think that I can contribute much to these drivers from
> my experience with the habana driver (which is now deployed in mass at
> AWS) and contribute code from the habana driver to a common framework
> for AI drivers.

Why not port the habana driver to drm now instead? I don't get why it
wouldn't make sense?

Stepping up to create a new subsystem is great, but we need rules
around what belongs where, we can't just spawn new subsystems when we
have no clear guidelines on where drivers should land.

What are the rules for a new accel subsystem? Do we have to now
retarget the 3 drivers that are queued up to use drm for accelerators,
because 2 drivers don't?

There's a lot more to figure out here than merge some structures and
it will be fine.

I think the one area I can see a divide where a new subsystem is for
accelerators that are single-user, one shot type things like media
drivers (though maybe they could be just media drivers).

I think anything that does command offloading to firmware or queues
belongs in drm, because that is pretty much what the framework does. I
think it might make sense to enhance some parts of drm to fit things
in better, but that shouldn't block things getting started.

I'm considering if, we should add an accelerator staging area to drm
and land the 2-3 submissions we have and try and steer things towards
commonality that way instead of holding them out of tree.

I'd like to offload things from Greg by just not having people submit
misc drivers at all for things that should go elsewhere.

Dave.


>
> Regarding the open source userspace rules in drm - yes, I think your
> rules are too limiting for the relatively young AI scene, and I saw at
> the 2021 kernel summit that other people from the kernel community
> think that as well.
> But that's not the main reason, or even a reason at all for doing
> this. After all, at least for habana, we open-sourced our compiler and
> a runtime library. And Greg also asked those two drivers if they have
> matching open-sourced user-space code.
>
> And a final reason is that I thought this can also help in somewhat
> reducing the workload on Greg. I saw in the last kernel summit there
> was a concern about bringing more people to be kernel maintainers so I
> thought this is a step in the right direction.
>
> Oded

2022-08-04 07:00:36

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Thu, Aug 4, 2022 at 2:32 AM Daniel Stone <[email protected]> wrote:
>
> Hi Oded,
>
> On Wed, 3 Aug 2022 at 21:21, Oded Gabbay <[email protected]> wrote:
> > The reason it happened now is because I saw two drivers, which are
> > doing h/w acceleration for AI, trying to be accepted to the misc
> > subsystem.
>
> Why misc?
You will need to ask them ;)
Seriously, I guess they thought they were not gpu drivers and didn't
find anything else to go to.
And at least for one of them, I remember Greg and Arnd pointing them to misc.

>
> > Regarding the open source userspace rules in drm - yes, I think your
> > rules are too limiting for the relatively young AI scene, and I saw at
> > the 2021 kernel summit that other people from the kernel community
> > think that as well.
> > But that's not the main reason, or even a reason at all for doing
> > this. After all, at least for habana, we open-sourced our compiler and
> > a runtime library. And Greg also asked those two drivers if they have
> > matching open-sourced user-space code.
> >
> > And a final reason is that I thought this can also help in somewhat
> > reducing the workload on Greg. I saw in the last kernel summit there
> > was a concern about bringing more people to be kernel maintainers so I
> > thought this is a step in the right direction.
>
> Can you please explain what the reason is here?
>
> Everything you have described - uniform device enumeration, common job
> description, memory management helpers, unique job submission format,
> etc - applies exactly to DRM. If open userspace is not a requirement,
> and bypassing Greg's manual merging is a requirement, then I don't see
> what the difference is between DRM and this new bespoke subsystem. It
> would be great to have these differences enumerated in email as well
> as in kerneldoc.
I don't think preparing such a list at this point is relevant, because
I don't have a full-featured subsystem ready, which I can take and
list all its features and compare it with drm.
I have a beginning of a subsystem, with very minimal common code, and
I planned for it to grow with time and with the relevant participants.

And regarding the serspace issue, I believe it will be less stringent
than in drm.
For example, afaik in drm you must upstream your LLVM fork to the
mainline LLVM tree. This is something that is really a heavy-lifting
task for most, if not all, companies.
So this is a requirement I think we can forgo.

Thanks,
Oded

>
> Cheers,
> Daniel

2022-08-04 08:18:40

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Thu, Aug 4, 2022 at 2:54 AM Dave Airlie <[email protected]> wrote:
>
> On Thu, 4 Aug 2022 at 06:21, Oded Gabbay <[email protected]> wrote:
> >
> > On Wed, Aug 3, 2022 at 10:04 PM Dave Airlie <[email protected]> wrote:
> > >
> > > On Sun, 31 Jul 2022 at 22:04, Oded Gabbay <[email protected]> wrote:
> > > >
> > > > Hi,
> > > > Greg and I talked a couple of months ago about preparing a new accel
> > > > subsystem for compute/acceleration devices that are not GPUs and I
> > > > think your drivers that you are now trying to upstream fit it as well.
> > >
> > > We've had some submissions for not-GPUs to the drm subsystem recently.
> > >
> > > Intel GNA, Intel VPU, NVDLA, rpmsg AI processor unit.
> > >
> > > why is creating a new subsystem at this time necessary?
> > >
> > > Are we just creating a subsystem to avoid the open source userspace
> > > consumer rules? Or do we have some concrete reasoning behind it?
> > >
> > > Dave.
> >
> > Hi Dave.
> > The reason it happened now is because I saw two drivers, which are
> > doing h/w acceleration for AI, trying to be accepted to the misc
> > subsystem.
> > Add to that the fact I talked with Greg a couple of months ago about
> > doing a subsystem for any compute accelerators, which he was positive
> > about, I thought it is a good opportunity to finally do it.
> >
> > I also honestly think that I can contribute much to these drivers from
> > my experience with the habana driver (which is now deployed in mass at
> > AWS) and contribute code from the habana driver to a common framework
> > for AI drivers.
>
> Why not port the habana driver to drm now instead? I don't get why it
> wouldn't make sense?

imho, no, I don't see the upside. This is not a trivial change, and
will require a large effort. What will it give me that I need and I
don't have now ?

>
> Stepping up to create a new subsystem is great, but we need rules
> around what belongs where, we can't just spawn new subsystems when we
> have no clear guidelines on where drivers should land.
>
> What are the rules for a new accel subsystem? Do we have to now
> retarget the 3 drivers that are queued up to use drm for accelerators,
> because 2 drivers don't?
>
> There's a lot more to figure out here than merge some structures and
> it will be fine.
I totally agree. We need to set some rules and make sure everyone in
the kernel community is familiar with them, because now you get
different answers based on who you consult with.

My rules of thumb that I thought of was that if you don't have any
display (you don't need to support X/wayland) and you don't need to
support opengl/vulkan/opencl/directx or any other gpu-related software
stack, then you don't have to go through drm.
In other words, if you don't have gpu-specific h/w and/or you don't
need gpu uAPI, you don't belong in drm.

After all, memory management services, or common device chars handling
I can get from other subsystems (e.g. rdma) as well. I'm sure I could
model my uAPI to be rdma uAPI compliant (I can define proprietary uAPI
there as well), but this doesn't mean I belong there, right ?

>
> I think the one area I can see a divide where a new subsystem is for
> accelerators that are single-user, one shot type things like media
> drivers (though maybe they could be just media drivers).
>
> I think anything that does command offloading to firmware or queues
> belongs in drm, because that is pretty much what the framework does. I
I think this is a very broad statement which doesn't reflect reality
in the kernel.

> think it might make sense to enhance some parts of drm to fit things
> in better, but that shouldn't block things getting started.
>
> I'm considering if, we should add an accelerator staging area to drm
> and land the 2-3 submissions we have and try and steer things towards
> commonality that way instead of holding them out of tree.
Sounds like a good idea regardless of this discussion.

>
> I'd like to offload things from Greg by just not having people submit
> misc drivers at all for things that should go elsewhere.
Great, at least we can agree on that.

Thanks,
Oded

>
> Dave.
>
>
> >
> > Regarding the open source userspace rules in drm - yes, I think your
> > rules are too limiting for the relatively young AI scene, and I saw at
> > the 2021 kernel summit that other people from the kernel community
> > think that as well.
> > But that's not the main reason, or even a reason at all for doing
> > this. After all, at least for habana, we open-sourced our compiler and
> > a runtime library. And Greg also asked those two drivers if they have
> > matching open-sourced user-space code.
> >
> > And a final reason is that I thought this can also help in somewhat
> > reducing the workload on Greg. I saw in the last kernel summit there
> > was a concern about bringing more people to be kernel maintainers so I
> > thought this is a step in the right direction.
> >
> > Oded

2022-08-04 09:40:42

by Jiho Chu

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Thu, 4 Aug 2022 09:46:49 +0300
Oded Gabbay <[email protected]> wrote:

> On Thu, Aug 4, 2022 at 2:32 AM Daniel Stone <[email protected]> wrote:
> >
> > Hi Oded,
> >
> > On Wed, 3 Aug 2022 at 21:21, Oded Gabbay <[email protected]> wrote:
> > > The reason it happened now is because I saw two drivers, which are
> > > doing h/w acceleration for AI, trying to be accepted to the misc
> > > subsystem.
> >
> > Why misc?
> You will need to ask them ;)
> Seriously, I guess they thought they were not gpu drivers and didn't
> find anything else to go to.
> And at least for one of them, I remember Greg and Arnd pointing them to misc.
>

Hi, Daniel.
Samsung NPU driver is one of the trier to be a misc device. There is some
reasons that it chooses misc, but it can be simply said that GPU was not a
perfect suit for NPU.
AI workload is not limited in graphical job, it can be NLP, data analysis or
training job. The GPU/DRM can work for them, but its description is not for
them.
e.g. AI workloads needs to manage ai model data as well as input data. I guess
it can be working with GEM object, and needs to be expaned for model information.
But I have a question that DRM accept this specialized GEM, thus it's not
related to Graphics.
Other subsystem was simliar, so I only could choose misc device.

IMHO, at the same reason, I'm positive on Oded's working, expecting that the
new subsystem could be more specialized for AI workload.

thanks,
Jiho

> >
> > > Regarding the open source userspace rules in drm - yes, I think your
> > > rules are too limiting for the relatively young AI scene, and I saw at
> > > the 2021 kernel summit that other people from the kernel community
> > > think that as well.
> > > But that's not the main reason, or even a reason at all for doing
> > > this. After all, at least for habana, we open-sourced our compiler and
> > > a runtime library. And Greg also asked those two drivers if they have
> > > matching open-sourced user-space code.
> > >
> > > And a final reason is that I thought this can also help in somewhat
> > > reducing the workload on Greg. I saw in the last kernel summit there
> > > was a concern about bringing more people to be kernel maintainers so I
> > > thought this is a step in the right direction.
> >
> > Can you please explain what the reason is here?
> >
> > Everything you have described - uniform device enumeration, common job
> > description, memory management helpers, unique job submission format,
> > etc - applies exactly to DRM. If open userspace is not a requirement,
> > and bypassing Greg's manual merging is a requirement, then I don't see
> > what the difference is between DRM and this new bespoke subsystem. It
> > would be great to have these differences enumerated in email as well
> > as in kerneldoc.
> I don't think preparing such a list at this point is relevant, because
> I don't have a full-featured subsystem ready, which I can take and
> list all its features and compare it with drm.
> I have a beginning of a subsystem, with very minimal common code, and
> I planned for it to grow with time and with the relevant participants.
>
> And regarding the serspace issue, I believe it will be less stringent
> than in drm.
> For example, afaik in drm you must upstream your LLVM fork to the
> mainline LLVM tree. This is something that is really a heavy-lifting
> task for most, if not all, companies.
> So this is a requirement I think we can forgo.
>
> Thanks,
> Oded
>
> >
> > Cheers,
> > Daniel
>

2022-08-04 13:02:51

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices


On 04/08/2022 00:54, Dave Airlie wrote:
> On Thu, 4 Aug 2022 at 06:21, Oded Gabbay <[email protected]> wrote:
>>
>> On Wed, Aug 3, 2022 at 10:04 PM Dave Airlie <[email protected]> wrote:
>>>
>>> On Sun, 31 Jul 2022 at 22:04, Oded Gabbay <[email protected]> wrote:
>>>>
>>>> Hi,
>>>> Greg and I talked a couple of months ago about preparing a new accel
>>>> subsystem for compute/acceleration devices that are not GPUs and I
>>>> think your drivers that you are now trying to upstream fit it as well.
>>>
>>> We've had some submissions for not-GPUs to the drm subsystem recently.
>>>
>>> Intel GNA, Intel VPU, NVDLA, rpmsg AI processor unit.
>>>
>>> why is creating a new subsystem at this time necessary?
>>>
>>> Are we just creating a subsystem to avoid the open source userspace
>>> consumer rules? Or do we have some concrete reasoning behind it?
>>>
>>> Dave.
>>
>> Hi Dave.
>> The reason it happened now is because I saw two drivers, which are
>> doing h/w acceleration for AI, trying to be accepted to the misc
>> subsystem.
>> Add to that the fact I talked with Greg a couple of months ago about
>> doing a subsystem for any compute accelerators, which he was positive
>> about, I thought it is a good opportunity to finally do it.
>>
>> I also honestly think that I can contribute much to these drivers from
>> my experience with the habana driver (which is now deployed in mass at
>> AWS) and contribute code from the habana driver to a common framework
>> for AI drivers.
>
> Why not port the habana driver to drm now instead? I don't get why it
> wouldn't make sense?
>
> Stepping up to create a new subsystem is great, but we need rules
> around what belongs where, we can't just spawn new subsystems when we
> have no clear guidelines on where drivers should land.
>
> What are the rules for a new accel subsystem? Do we have to now
> retarget the 3 drivers that are queued up to use drm for accelerators,
> because 2 drivers don't?

Isn't there three on the "don't prefer drm" side as well? Habana,
Toshiba and Samsung? Just so the numbers argument is not misrepresented.
Perhaps a poll like a) prefer DRM, b) prefer a new subsystem, c) don't
care in principle; is in order?

More to the point, code sharing is a very compelling argument if it can
be demonstrated to be significant, aka not needing to reinvent the same
wheel.

Perhaps one route forward could be a) to consider is to rename DRM to
something more appropriate, removing rendering from the name and
replacing with accelerators, co-processors, I don't know... Although I
am not sure renaming the codebase, character device node names and
userspace headers is all that feasible. Thought to mention it
nevertheless, maybe it gives an idea to someone how it could be done.

And b) allow the userspace rules to be considered per driver, or per
class (is it a gpu or not should be a question that can be answered).
Shouldn't be a blocker if it still matches the rules present elsewhere
in the kernel.

Those two would remove the two most contentions points as far as I
understood the thread.

Regards,

Tvrtko

> There's a lot more to figure out here than merge some structures and
> it will be fine.
>
> I think the one area I can see a divide where a new subsystem is for
> accelerators that are single-user, one shot type things like media
> drivers (though maybe they could be just media drivers).
>
> I think anything that does command offloading to firmware or queues
> belongs in drm, because that is pretty much what the framework does. I
> think it might make sense to enhance some parts of drm to fit things
> in better, but that shouldn't block things getting started.
>
> I'm considering if, we should add an accelerator staging area to drm
> and land the 2-3 submissions we have and try and steer things towards
> commonality that way instead of holding them out of tree.
>
> I'd like to offload things from Greg by just not having people submit
> misc drivers at all for things that should go elsewhere.
>
> Dave.
>
>
>>
>> Regarding the open source userspace rules in drm - yes, I think your
>> rules are too limiting for the relatively young AI scene, and I saw at
>> the 2021 kernel summit that other people from the kernel community
>> think that as well.
>> But that's not the main reason, or even a reason at all for doing
>> this. After all, at least for habana, we open-sourced our compiler and
>> a runtime library. And Greg also asked those two drivers if they have
>> matching open-sourced user-space code.
>>
>> And a final reason is that I thought this can also help in somewhat
>> reducing the workload on Greg. I saw in the last kernel summit there
>> was a concern about bringing more people to be kernel maintainers so I
>> thought this is a step in the right direction.
>>
>> Oded

2022-08-04 15:03:13

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Thu, Aug 04, 2022 at 10:43:42AM +0300, Oded Gabbay wrote:

> After all, memory management services, or common device chars handling
> I can get from other subsystems (e.g. rdma) as well. I'm sure I could
> model my uAPI to be rdma uAPI compliant (I can define proprietary uAPI
> there as well), but this doesn't mean I belong there, right ?

You sure can, but there is still an expectation, eg in RDMA, that your
device has a similarity to the established standards (like roce in
habana's case) that RDMA is geared to support.

I think the the most important thing to establish a new subsystem is
to actually identify what commonalities it is supposed to be
providing. Usually this is driven by some standards body, but the
AI/ML space hasn't gone in that direction at all yet.

We don't need a "subsystem" to have a bunch of drivers expose chardevs
with completely unique ioctls.

The flip is true of DRM - DRM is pretty general. I bet I could
implement an RDMA device under DRM - but that doesn't mean it should
be done.

My biggest concern is that this subsystem not turn into a back door
for a bunch of abuse of kernel APIs going forward. Though things are
better now, we still see this in DRM where expediency or performance
justifies hacky shortcuts instead of good in-kernel architecture. At
least DRM has reliable and experienced review these days.

Jason

2022-08-04 15:56:44

by Jeffrey Hugo

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On 8/4/2022 6:00 AM, Tvrtko Ursulin wrote:
>
> On 04/08/2022 00:54, Dave Airlie wrote:
>> On Thu, 4 Aug 2022 at 06:21, Oded Gabbay <[email protected]> wrote:
>>>
>>> On Wed, Aug 3, 2022 at 10:04 PM Dave Airlie <[email protected]> wrote:
>>>>
>>>> On Sun, 31 Jul 2022 at 22:04, Oded Gabbay <[email protected]>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>> Greg and I talked a couple of months ago about preparing a new accel
>>>>> subsystem for compute/acceleration devices that are not GPUs and I
>>>>> think your drivers that you are now trying to upstream fit it as well.
>>>>
>>>> We've had some submissions for not-GPUs to the drm subsystem recently.
>>>>
>>>> Intel GNA, Intel VPU, NVDLA, rpmsg AI processor unit.
>>>>
>>>> why is creating a new subsystem at this time necessary?
>>>>
>>>> Are we just creating a subsystem to avoid the open source userspace
>>>> consumer rules? Or do we have some concrete reasoning behind it?
>>>>
>>>> Dave.
>>>
>>> Hi Dave.
>>> The reason it happened now is because I saw two drivers, which are
>>> doing h/w acceleration for AI, trying to be accepted to the misc
>>> subsystem.
>>> Add to that the fact I talked with Greg a couple of months ago about
>>> doing a subsystem for any compute accelerators, which he was positive
>>> about, I thought it is a good opportunity to finally do it.
>>>
>>> I also honestly think that I can contribute much to these drivers from
>>> my experience with the habana driver (which is now deployed in mass at
>>> AWS) and contribute code from the habana driver to a common framework
>>> for AI drivers.
>>
>> Why not port the habana driver to drm now instead? I don't get why it
>> wouldn't make sense?
>>
>> Stepping up to create a new subsystem is great, but we need rules
>> around what belongs where, we can't just spawn new subsystems when we
>> have no clear guidelines on where drivers should land.
>>
>> What are the rules for a new accel subsystem? Do we have to now
>> retarget the 3 drivers that are queued up to use drm for accelerators,
>> because 2 drivers don't?
>
> Isn't there three on the "don't prefer drm" side as well? Habana,
> Toshiba and Samsung? Just so the numbers argument is not misrepresented.
> Perhaps a poll like a) prefer DRM, b) prefer a new subsystem, c) don't
> care in principle; is in order?

I'll chime in with my opinions. Take them for what you will.

I would say I fall into the C category, but I'm targeting DRM and will
be the 5th(?) accel device to do so.

I'll say that the ksummit (from what I see in the LWN article) made me
very happy. Finally, the community had clear rules for accel drivers.
When I targeted misc in the past, it seemed like Greg moved the goal
post just for me, which stalled our attempt. It was even more
frustrating to see that the high bar Greg set for us was not applied to
other devices of the same "class" in following submissions.

However, the past is the past, and based on ksummit, we've spent a
number of months retargeting DRM. In a week (or two), I plan to post
something to start up the discussions again.

As far as the DRM userspace requirements, unless we've misunderstood
something, they've been easier to satisfy (pending review I suppose)
than what misc has set.

I would say that Dave Airlie's feedback on this discussion resonates
with me. From the perspective of a vendor wanting to be a part of the
community, clear rules are important and ksummit seemed to set that.
Oded's announcement has thrown all of that into the wind. Without a
proposal to evaluate (eg show me the code with clear guidelines), I
cannot seriously consider Oded's idea, and I'm not sure I want to sit by
another few years to see it settle out.

I expect to move forward with what we were planning prior to seeing this
thread which is targeting DRM. We'll see what the DRM folks say when
they have something to look at. If our device doesn't fit in DRM per an
assessment of the DRM folks, then I sure hope they can suggest where we
do fit because then we'll have tried misc and DRM, and not found a home.
Since "drivers/accel" doesn't exist, and realistically won't for a
long time if ever, I don't see why we should consider it.

Why DRM? We consume dma_buf and might look to p2pdma in the future.
ksummit appears clear - we are a DRM device. Also, someone could
probably run openCL on our device if they were so inclined to wire it
up. Over time, I've come to the thinking that we are a GPU, just
without display. Yes, it would have helped if DRM and/or drivers/gpu
were renamed, but I think I'm past that point. Once you have everything
written, it doesn't seem like it matters if the uAPI device is called
/dev/drmX, /dev/miscX, or /dev/magic.

I will not opine on other devices as I am no expert on them. Today, my
opinion is that DRM is the best place for me. We'll see where that goes.

> More to the point, code sharing is a very compelling argument if it can
> be demonstrated to be significant, aka not needing to reinvent the same
> wheel.
>
> Perhaps one route forward could be a) to consider is to rename DRM to
> something more appropriate, removing rendering from the name and
> replacing with accelerators, co-processors, I don't know... Although I
> am not sure renaming the codebase, character device node names and
> userspace headers is all that feasible. Thought to mention it
> nevertheless, maybe it gives an idea to someone how it could be done.
>
> And b) allow the userspace rules to be considered per driver, or per
> class (is it a gpu or not should be a question that can be answered).
> Shouldn't be a blocker if it still matches the rules present elsewhere
> in the kernel.
>
> Those two would remove the two most contentions points as far as I
> understood the thread.
>
> Regards,
>
> Tvrtko
>


2022-08-04 18:19:36

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Thu, Aug 4, 2022 at 6:04 PM Jeffrey Hugo <[email protected]> wrote:
>
> On 8/4/2022 6:00 AM, Tvrtko Ursulin wrote:
> >
> > On 04/08/2022 00:54, Dave Airlie wrote:
> >> On Thu, 4 Aug 2022 at 06:21, Oded Gabbay <[email protected]> wrote:
> >>>
> >>> On Wed, Aug 3, 2022 at 10:04 PM Dave Airlie <[email protected]> wrote:
> >>>>
> >>>> On Sun, 31 Jul 2022 at 22:04, Oded Gabbay <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>> Hi,
> >>>>> Greg and I talked a couple of months ago about preparing a new accel
> >>>>> subsystem for compute/acceleration devices that are not GPUs and I
> >>>>> think your drivers that you are now trying to upstream fit it as well.
> >>>>
> >>>> We've had some submissions for not-GPUs to the drm subsystem recently.
> >>>>
> >>>> Intel GNA, Intel VPU, NVDLA, rpmsg AI processor unit.
> >>>>
> >>>> why is creating a new subsystem at this time necessary?
> >>>>
> >>>> Are we just creating a subsystem to avoid the open source userspace
> >>>> consumer rules? Or do we have some concrete reasoning behind it?
> >>>>
> >>>> Dave.
> >>>
> >>> Hi Dave.
> >>> The reason it happened now is because I saw two drivers, which are
> >>> doing h/w acceleration for AI, trying to be accepted to the misc
> >>> subsystem.
> >>> Add to that the fact I talked with Greg a couple of months ago about
> >>> doing a subsystem for any compute accelerators, which he was positive
> >>> about, I thought it is a good opportunity to finally do it.
> >>>
> >>> I also honestly think that I can contribute much to these drivers from
> >>> my experience with the habana driver (which is now deployed in mass at
> >>> AWS) and contribute code from the habana driver to a common framework
> >>> for AI drivers.
> >>
> >> Why not port the habana driver to drm now instead? I don't get why it
> >> wouldn't make sense?
> >>
> >> Stepping up to create a new subsystem is great, but we need rules
> >> around what belongs where, we can't just spawn new subsystems when we
> >> have no clear guidelines on where drivers should land.
> >>
> >> What are the rules for a new accel subsystem? Do we have to now
> >> retarget the 3 drivers that are queued up to use drm for accelerators,
> >> because 2 drivers don't?
> >
> > Isn't there three on the "don't prefer drm" side as well? Habana,
> > Toshiba and Samsung? Just so the numbers argument is not misrepresented.
> > Perhaps a poll like a) prefer DRM, b) prefer a new subsystem, c) don't
> > care in principle; is in order?
>
> I'll chime in with my opinions. Take them for what you will.
>
> I would say I fall into the C category, but I'm targeting DRM and will
> be the 5th(?) accel device to do so.
>
> I'll say that the ksummit (from what I see in the LWN article) made me
> very happy. Finally, the community had clear rules for accel drivers.
> When I targeted misc in the past, it seemed like Greg moved the goal
> post just for me, which stalled our attempt. It was even more
> frustrating to see that the high bar Greg set for us was not applied to
> other devices of the same "class" in following submissions.
>
> However, the past is the past, and based on ksummit, we've spent a
> number of months retargeting DRM. In a week (or two), I plan to post
> something to start up the discussions again.
>
> As far as the DRM userspace requirements, unless we've misunderstood
> something, they've been easier to satisfy (pending review I suppose)
> than what misc has set.
I think it is quite the opposite. In misc originally there was very
minimal userspace requirements, but when my driver started to use
dma-buf, Dave asked for more.
e.g. a driver that wants to get accepted to DRM and use a fork of LLVM
must not only open-source his code, but also to upstream his fork to
the mainline LLVM tree. In misc there is nothing that closely comes to
that requirement afaik.
>
> I would say that Dave Airlie's feedback on this discussion resonates
> with me. From the perspective of a vendor wanting to be a part of the
> community, clear rules are important and ksummit seemed to set that.
> Oded's announcement has thrown all of that into the wind. Without a
That wasn't my intention. I simply wanted to:
1. Offload Greg with these types of drivers.
2. Offer to the new drivers a standard char device handling
3. Start a community of kernel hackers that are writing device drivers
for compute accelerators.

> proposal to evaluate (eg show me the code with clear guidelines), I
> cannot seriously consider Oded's idea, and I'm not sure I want to sit by
> another few years to see it settle out.
I thought of posting something quick (but not dirty) but this backlash
has made me rethink that.

>
> I expect to move forward with what we were planning prior to seeing this
> thread which is targeting DRM. We'll see what the DRM folks say when
> they have something to look at. If our device doesn't fit in DRM per an
> assessment of the DRM folks, then I sure hope they can suggest where we
> do fit because then we'll have tried misc and DRM, and not found a home.
> Since "drivers/accel" doesn't exist, and realistically won't for a
> long time if ever, I don't see why we should consider it.
>
> Why DRM? We consume dma_buf and might look to p2pdma in the future.
> ksummit appears clear - we are a DRM device. Also, someone could
> probably run openCL on our device if they were so inclined to wire it
> up. Over time, I've come to the thinking that we are a GPU, just
> without display. Yes, it would have helped if DRM and/or drivers/gpu
> were renamed, but I think I'm past that point. Once you have everything
> written, it doesn't seem like it matters if the uAPI device is called
> /dev/drmX, /dev/miscX, or /dev/magic.
>
> I will not opine on other devices as I am no expert on them. Today, my
> opinion is that DRM is the best place for me. We'll see where that goes.
>
> > More to the point, code sharing is a very compelling argument if it can
> > be demonstrated to be significant, aka not needing to reinvent the same
> > wheel.
> >
> > Perhaps one route forward could be a) to consider is to rename DRM to
> > something more appropriate, removing rendering from the name and
> > replacing with accelerators, co-processors, I don't know... Although I
> > am not sure renaming the codebase, character device node names and
> > userspace headers is all that feasible. Thought to mention it
> > nevertheless, maybe it gives an idea to someone how it could be done.
> >
> > And b) allow the userspace rules to be considered per driver, or per
> > class (is it a gpu or not should be a question that can be answered).
> > Shouldn't be a blocker if it still matches the rules present elsewhere
> > in the kernel.
> >
> > Those two would remove the two most contentions points as far as I
> > understood the thread.
> >
> > Regards,
> >
> > Tvrtko
> >
>

2022-08-04 18:31:53

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Thu, Aug 4, 2022 at 5:50 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Thu, Aug 04, 2022 at 10:43:42AM +0300, Oded Gabbay wrote:
>
> > After all, memory management services, or common device chars handling
> > I can get from other subsystems (e.g. rdma) as well. I'm sure I could
> > model my uAPI to be rdma uAPI compliant (I can define proprietary uAPI
> > there as well), but this doesn't mean I belong there, right ?
>
> You sure can, but there is still an expectation, eg in RDMA, that your
> device has a similarity to the established standards (like roce in
> habana's case) that RDMA is geared to support.
>
> I think the the most important thing to establish a new subsystem is
> to actually identify what commonalities it is supposed to be
> providing. Usually this is driven by some standards body, but the
> AI/ML space hasn't gone in that direction at all yet.
I agree. In the AI-world the standard doesn't exist and I don't see
anything on the horizon. There are the AI frameworks/compilers which
are 30,000 feet above us, and there is CUDA which is closed-source and
I have no idea what it does inside.
>
> We don't need a "subsystem" to have a bunch of drivers expose chardevs
> with completely unique ioctls.
I totally agree with this sentence and this is *exactly* why
personally I don't want to use DRM because when I look at the long
list of common IOCTLs in drm.h, I don't find anything that I can use.
It's simply either not relevant at all to my h/w or it is something
that our h/w implemented differently.

This is in contrast to the rdma, where as you said, we have ibverbs
API. So, when you asked that we write an IBverbs driver I understood
the reasoning. There is a common user-space library which talks to the
rdma drivers and all the rdma applications use that library and once I
will write a (somewhat) standard driver, then hopefully I can enjoy
all that.

>
> The flip is true of DRM - DRM is pretty general. I bet I could
> implement an RDMA device under DRM - but that doesn't mean it should
> be done.
>
> My biggest concern is that this subsystem not turn into a back door
> for a bunch of abuse of kernel APIs going forward. Though things are
How do you suggest to make sure it won't happen ?

> better now, we still see this in DRM where expediency or performance
> justifies hacky shortcuts instead of good in-kernel architecture. At
> least DRM has reliable and experienced review these days.
Definitely. DRM has some parts that are really well written. For
example, the whole char device handling with sysfs/debugfs and managed
resources code. This is something I would gladly either use or
copy-paste into the hw accel subsystem.
And of course more pairs of eyes looking at the code will usually
produce better code.

I think that it is clear from my previous email what I intended to
provide. A clean, simple framework for devices to register with, get
services for the most basic stuff such as device char handling,
sysfs/debugfs. Later on, add more simple stuff such as memory manager
and uapi for memory handling. I guess someone can say all that exists
in drm, but like I said it exists in other subsystems as well.

I want to be perfectly honest and say there is nothing special here
for AI. It's actually the opposite, it is a generic framework for
compute only. Think of it as an easier path to upstream if you just
want to do compute acceleration. Maybe in the future it will be more,
but I can't predict the future.

If that's not enough for a new subsystem, fair enough, I'll withdraw my offer.

Thanks,
Oded

>
> Jason

2022-08-05 00:42:29

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Thu, Aug 04, 2022 at 08:48:28PM +0300, Oded Gabbay wrote:

> > The flip is true of DRM - DRM is pretty general. I bet I could
> > implement an RDMA device under DRM - but that doesn't mean it should
> > be done.
> >
> > My biggest concern is that this subsystem not turn into a back door
> > for a bunch of abuse of kernel APIs going forward. Though things
> > are
>
> How do you suggest to make sure it won't happen ?

Well, if you launch the subsystem then it is your job to make sure it
doesn't happen - or endure the complaints :)

Accelerators have this nasty tendancy to become co-designed with their
SOCs in nasty intricate ways and then there is a desire to punch
through all the inconvenient layers to make it go faster.

> > better now, we still see this in DRM where expediency or performance
> > justifies hacky shortcuts instead of good in-kernel architecture. At
> > least DRM has reliable and experienced review these days.
> Definitely. DRM has some parts that are really well written. For
> example, the whole char device handling with sysfs/debugfs and managed
> resources code.

Arguably this should all be common code in the driver core/etc - what
value is a subsystem adding beyond that besides using it properly?

It would be nice to at least identify something that could obviously
be common, like some kind of enumeration and metadata kind of stuff
(think like ethtool, devlink, rdma tool, nvemctl etc)

> I think that it is clear from my previous email what I intended to
> provide. A clean, simple framework for devices to register with, get
> services for the most basic stuff such as device char handling,
> sysfs/debugfs.

This should all be trivially done by any driver using the core codes,
if you see gaps I'd ask why not improve the core code?

> Later on, add more simple stuff such as memory manager
> and uapi for memory handling. I guess someone can say all that exists
> in drm, but like I said it exists in other subsystems as well.

This makes sense to me, all accelerators need a way to set a memory
map, but on the other hand we are doing some very nifty stuff in this
area with iommufd and it might be very interesting to just have the
accelerator driver link to that API instead of building yet another
copy of pin_user_pages() code.. Especially with PASID likely becoming
part of any accelerator toolkit.

> I want to be perfectly honest and say there is nothing special here
> for AI. It's actually the opposite, it is a generic framework for
> compute only. Think of it as an easier path to upstream if you just
> want to do compute acceleration. Maybe in the future it will be more,
> but I can't predict the future.

I can't either, and to be clear I'm only questioning the merit of a
"subsystem" eg with a struct class, rigerous uAPI, etc.

The idea that these kinds of accel drivers deserve specialized review
makes sense to me, even if they remain as unorganized char
devices. Just understand that is what you are signing up for :)

Jason

2022-08-05 03:30:55

by Dave Airlie

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Thu, 4 Aug 2022 at 17:44, Oded Gabbay <[email protected]> wrote:
>
> On Thu, Aug 4, 2022 at 2:54 AM Dave Airlie <[email protected]> wrote:
> >
> > On Thu, 4 Aug 2022 at 06:21, Oded Gabbay <[email protected]> wrote:
> > >
> > > On Wed, Aug 3, 2022 at 10:04 PM Dave Airlie <[email protected]> wrote:
> > > >
> > > > On Sun, 31 Jul 2022 at 22:04, Oded Gabbay <[email protected]> wrote:
> > > > >
> > > > > Hi,
> > > > > Greg and I talked a couple of months ago about preparing a new accel
> > > > > subsystem for compute/acceleration devices that are not GPUs and I
> > > > > think your drivers that you are now trying to upstream fit it as well.
> > > >
> > > > We've had some submissions for not-GPUs to the drm subsystem recently.
> > > >
> > > > Intel GNA, Intel VPU, NVDLA, rpmsg AI processor unit.
> > > >
> > > > why is creating a new subsystem at this time necessary?
> > > >
> > > > Are we just creating a subsystem to avoid the open source userspace
> > > > consumer rules? Or do we have some concrete reasoning behind it?
> > > >
> > > > Dave.
> > >
> > > Hi Dave.
> > > The reason it happened now is because I saw two drivers, which are
> > > doing h/w acceleration for AI, trying to be accepted to the misc
> > > subsystem.
> > > Add to that the fact I talked with Greg a couple of months ago about
> > > doing a subsystem for any compute accelerators, which he was positive
> > > about, I thought it is a good opportunity to finally do it.
> > >
> > > I also honestly think that I can contribute much to these drivers from
> > > my experience with the habana driver (which is now deployed in mass at
> > > AWS) and contribute code from the habana driver to a common framework
> > > for AI drivers.
> >
> > Why not port the habana driver to drm now instead? I don't get why it
> > wouldn't make sense?
>
> imho, no, I don't see the upside. This is not a trivial change, and
> will require a large effort. What will it give me that I need and I
> don't have now ?

The opportunity for review, code sharing, experience of locking
hierarchy, mm integration?

IMHO The biggest thing that drm has is the community of people who
understand accelerators, memory management, userspace command
submissions, fencing, dma-buf etc.

It's hard to have input to those discussions from the outside, and
they are always ongoing.

I think one of the Intel teams reported dropping a lot of code on
their drm port due to stuff already being there, I'd expect the same
for you.

The opposite question is also valid, what does moving to a new
subsystem help you or others, when there is one with all the
infrastructure and more importantly reviewers.

I'd be happy to have accelerator submaintainers, I'd be happy to even
create an ACCELERATOR property for non-gpu drivers, so they can opt
out of some of the GPUier stuff, like multiple device node users etc,
or even create a new class of device nodes under /dev/dri.


> I totally agree. We need to set some rules and make sure everyone in
> the kernel community is familiar with them, because now you get
> different answers based on who you consult with.
>
> My rules of thumb that I thought of was that if you don't have any
> display (you don't need to support X/wayland) and you don't need to
> support opengl/vulkan/opencl/directx or any other gpu-related software
> stack, then you don't have to go through drm.
> In other words, if you don't have gpu-specific h/w and/or you don't
> need gpu uAPI, you don't belong in drm.

What happens when NVIDIA submit a driver for just compute or intel?
for what is actually a GPU?
This has been suggested as workaround for our userspace rules a few times.

If my GPU can do compute tasks, do I have to add an accelerator
subsystem driver alongside my GPU one?

> After all, memory management services, or common device chars handling
> I can get from other subsystems (e.g. rdma) as well. I'm sure I could
> model my uAPI to be rdma uAPI compliant (I can define proprietary uAPI
> there as well), but this doesn't mean I belong there, right ?

Good point, but I think accelerators do mostly belong in drm or media,
because there is enough framework around them to allow them to work,
without reinventing everything.

> >
> > I think the one area I can see a divide where a new subsystem is for
> > accelerators that are single-user, one shot type things like media
> > drivers (though maybe they could be just media drivers).
> >
> > I think anything that does command offloading to firmware or queues
> > belongs in drm, because that is pretty much what the framework does. I
> I think this is a very broad statement which doesn't reflect reality
> in the kernel.

I think the habanalabs driver is one of the only ones that is outside
this really, and in major use. There might be one or two other minor
drivers with no real users.

Dave.

2022-08-07 06:51:14

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Fri, Aug 5, 2022 at 3:22 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Thu, Aug 04, 2022 at 08:48:28PM +0300, Oded Gabbay wrote:
>
> > > The flip is true of DRM - DRM is pretty general. I bet I could
> > > implement an RDMA device under DRM - but that doesn't mean it should
> > > be done.
> > >
> > > My biggest concern is that this subsystem not turn into a back door
> > > for a bunch of abuse of kernel APIs going forward. Though things
> > > are
> >
> > How do you suggest to make sure it won't happen ?
>
> Well, if you launch the subsystem then it is your job to make sure it
> doesn't happen - or endure the complaints :)
Understood, I'll make sure there is no room for complaints.
>
> Accelerators have this nasty tendancy to become co-designed with their
> SOCs in nasty intricate ways and then there is a desire to punch
> through all the inconvenient layers to make it go faster.
>
> > > better now, we still see this in DRM where expediency or performance
> > > justifies hacky shortcuts instead of good in-kernel architecture. At
> > > least DRM has reliable and experienced review these days.
> > Definitely. DRM has some parts that are really well written. For
> > example, the whole char device handling with sysfs/debugfs and managed
> > resources code.
>
> Arguably this should all be common code in the driver core/etc - what
> value is a subsystem adding beyond that besides using it properly?

I mainly see two things here:

1. If there is a subsystem which is responsible for creating and
exposing the device character files, then there should be some code
that connects between each device driver to that subsystem.
i.e. There should be functions that each driver should call from its
probe and release callback functions.

Those functions should take care of the following:
- Create metadata for the device, the device's minor(s) and the
driver's ioctls table and driver's callback for file operations (both
are common for all the driver's devices). Save all that metadata with
proper locking.
- Create the device char files themselves and supply file operations
that will be called per each open/close/mmap/etc.
- Keep track of all these objects' lifetime in regard to the device
driver's lifetime, with proper handling for release.
- Add common handling and entries of sysfs/debugfs for these devices
with the ability for each device driver to add their own unique
entries.

2. I think that you underestimate (due to your experience) the "using
it properly" part... It is not so easy to do this properly for
inexperienced kernel people. If we provide all the code I mentioned
above, the device driver writer doesn't need to be aware of all these
kernel APIs.

>
> It would be nice to at least identify something that could obviously
> be common, like some kind of enumeration and metadata kind of stuff
> (think like ethtool, devlink, rdma tool, nvemctl etc)
Definitely. I think we can have at least one ioctl that will be common
to all drivers from the start.
A kind of information retrieval ioctl. There are many information
points that I'm sure are common to most accelerators. We have an
extensive information ioctl in the habanalabs driver and most of the
information there is not habana specific imo.
>
> > I think that it is clear from my previous email what I intended to
> > provide. A clean, simple framework for devices to register with, get
> > services for the most basic stuff such as device char handling,
> > sysfs/debugfs.
>
> This should all be trivially done by any driver using the core codes,
> if you see gaps I'd ask why not improve the core code?
>
> > Later on, add more simple stuff such as memory manager
> > and uapi for memory handling. I guess someone can say all that exists
> > in drm, but like I said it exists in other subsystems as well.
>
> This makes sense to me, all accelerators need a way to set a memory
> map, but on the other hand we are doing some very nifty stuff in this
> area with iommufd and it might be very interesting to just have the
> accelerator driver link to that API instead of building yet another
> copy of pin_user_pages() code.. Especially with PASID likely becoming
> part of any accelerator toolkit.
Here I disagree with you. First of all, there are many relatively
simple accelerators, especially in edge, where PASID is really not
relevant.
Second, even for the more sophisticated PCIe/CXL-based ones, PASID is
not mandatory and I suspect that it won't be in 100% of those devices.
But definitely that should be an alternative to the "classic" way of
handling dma'able memory (pin_user_pages()).
>
> > I want to be perfectly honest and say there is nothing special here
> > for AI. It's actually the opposite, it is a generic framework for
> > compute only. Think of it as an easier path to upstream if you just
> > want to do compute acceleration. Maybe in the future it will be more,
> > but I can't predict the future.
>
> I can't either, and to be clear I'm only questioning the merit of a
> "subsystem" eg with a struct class, rigerous uAPI, etc.
>
> The idea that these kinds of accel drivers deserve specialized review
> makes sense to me, even if they remain as unorganized char
> devices. Just understand that is what you are signing up for :)
I understand. That's why I'm taking all your points very seriously.
This is not a decision that should be taken lightly and I want to be
sure most agree this is the correct way forward.
My next step is to talk to Dave about it in-depth. In his other email
he wrote some interesting ideas which I want to discuss with him.

Maybe this is something that should be discussed in the kernel summit ?

>
> Jason

2022-08-07 07:03:55

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Fri, Aug 5, 2022 at 6:03 AM Dave Airlie <[email protected]> wrote:
>
> On Thu, 4 Aug 2022 at 17:44, Oded Gabbay <[email protected]> wrote:
> >
> > On Thu, Aug 4, 2022 at 2:54 AM Dave Airlie <[email protected]> wrote:
> > >
> > > On Thu, 4 Aug 2022 at 06:21, Oded Gabbay <[email protected]> wrote:
> > > >
> > > > On Wed, Aug 3, 2022 at 10:04 PM Dave Airlie <[email protected]> wrote:
> > > > >
> > > > > On Sun, 31 Jul 2022 at 22:04, Oded Gabbay <[email protected]> wrote:
> > > > > >
> > > > > > Hi,
> > > > > > Greg and I talked a couple of months ago about preparing a new accel
> > > > > > subsystem for compute/acceleration devices that are not GPUs and I
> > > > > > think your drivers that you are now trying to upstream fit it as well.
> > > > >
> > > > > We've had some submissions for not-GPUs to the drm subsystem recently.
> > > > >
> > > > > Intel GNA, Intel VPU, NVDLA, rpmsg AI processor unit.
> > > > >
> > > > > why is creating a new subsystem at this time necessary?
> > > > >
> > > > > Are we just creating a subsystem to avoid the open source userspace
> > > > > consumer rules? Or do we have some concrete reasoning behind it?
> > > > >
> > > > > Dave.
> > > >
> > > > Hi Dave.
> > > > The reason it happened now is because I saw two drivers, which are
> > > > doing h/w acceleration for AI, trying to be accepted to the misc
> > > > subsystem.
> > > > Add to that the fact I talked with Greg a couple of months ago about
> > > > doing a subsystem for any compute accelerators, which he was positive
> > > > about, I thought it is a good opportunity to finally do it.
> > > >
> > > > I also honestly think that I can contribute much to these drivers from
> > > > my experience with the habana driver (which is now deployed in mass at
> > > > AWS) and contribute code from the habana driver to a common framework
> > > > for AI drivers.
> > >
> > > Why not port the habana driver to drm now instead? I don't get why it
> > > wouldn't make sense?
> >
> > imho, no, I don't see the upside. This is not a trivial change, and
> > will require a large effort. What will it give me that I need and I
> > don't have now ?
>
> The opportunity for review, code sharing, experience of locking
> hierarchy, mm integration?
>
> IMHO The biggest thing that drm has is the community of people who
> understand accelerators, memory management, userspace command
> submissions, fencing, dma-buf etc.
>
> It's hard to have input to those discussions from the outside, and
> they are always ongoing.
>
> I think one of the Intel teams reported dropping a lot of code on
> their drm port due to stuff already being there, I'd expect the same
> for you.
>
> The opposite question is also valid, what does moving to a new
> subsystem help you or others, when there is one with all the
> infrastructure and more importantly reviewers.
>
> I'd be happy to have accelerator submaintainers, I'd be happy to even
> create an ACCELERATOR property for non-gpu drivers, so they can opt
> out of some of the GPUier stuff, like multiple device node users etc,
> or even create a new class of device nodes under /dev/dri.
>
I'm taking all what you wrote seriously, these are all good points.
As I wrote to Jason, I don't want to jump the gun here. I think we
should discuss this and explore the possibilities that you suggested
because I would like to reach consensus if possible.
Maybe this is something we can discuss in LPC or in the kernel summit ?

Oded

>
> > I totally agree. We need to set some rules and make sure everyone in
> > the kernel community is familiar with them, because now you get
> > different answers based on who you consult with.
> >
> > My rules of thumb that I thought of was that if you don't have any
> > display (you don't need to support X/wayland) and you don't need to
> > support opengl/vulkan/opencl/directx or any other gpu-related software
> > stack, then you don't have to go through drm.
> > In other words, if you don't have gpu-specific h/w and/or you don't
> > need gpu uAPI, you don't belong in drm.
>
> What happens when NVIDIA submit a driver for just compute or intel?
> for what is actually a GPU?
> This has been suggested as workaround for our userspace rules a few times.
>
> If my GPU can do compute tasks, do I have to add an accelerator
> subsystem driver alongside my GPU one?
>
> > After all, memory management services, or common device chars handling
> > I can get from other subsystems (e.g. rdma) as well. I'm sure I could
> > model my uAPI to be rdma uAPI compliant (I can define proprietary uAPI
> > there as well), but this doesn't mean I belong there, right ?
>
> Good point, but I think accelerators do mostly belong in drm or media,
> because there is enough framework around them to allow them to work,
> without reinventing everything.
>
> > >
> > > I think the one area I can see a divide where a new subsystem is for
> > > accelerators that are single-user, one shot type things like media
> > > drivers (though maybe they could be just media drivers).
> > >
> > > I think anything that does command offloading to firmware or queues
> > > belongs in drm, because that is pretty much what the framework does. I
> > I think this is a very broad statement which doesn't reflect reality
> > in the kernel.
>
> I think the habanalabs driver is one of the only ones that is outside
> this really, and in major use. There might be one or two other minor
> drivers with no real users.
>
> Dave.

2022-08-07 11:27:31

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Sun, Aug 7, 2022 at 9:43 AM Oded Gabbay <[email protected]> wrote:
>
> On Fri, Aug 5, 2022 at 3:22 AM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Thu, Aug 04, 2022 at 08:48:28PM +0300, Oded Gabbay wrote:
> >
> > > > The flip is true of DRM - DRM is pretty general. I bet I could
> > > > implement an RDMA device under DRM - but that doesn't mean it should
> > > > be done.
> > > >
> > > > My biggest concern is that this subsystem not turn into a back door
> > > > for a bunch of abuse of kernel APIs going forward. Though things
> > > > are
> > >
> > > How do you suggest to make sure it won't happen ?
> >
> > Well, if you launch the subsystem then it is your job to make sure it
> > doesn't happen - or endure the complaints :)
> Understood, I'll make sure there is no room for complaints.
> >
> > Accelerators have this nasty tendancy to become co-designed with their
> > SOCs in nasty intricate ways and then there is a desire to punch
> > through all the inconvenient layers to make it go faster.
> >
> > > > better now, we still see this in DRM where expediency or performance
> > > > justifies hacky shortcuts instead of good in-kernel architecture. At
> > > > least DRM has reliable and experienced review these days.
> > > Definitely. DRM has some parts that are really well written. For
> > > example, the whole char device handling with sysfs/debugfs and managed
> > > resources code.
> >
> > Arguably this should all be common code in the driver core/etc - what
> > value is a subsystem adding beyond that besides using it properly?
>
> I mainly see two things here:
>
> 1. If there is a subsystem which is responsible for creating and
> exposing the device character files, then there should be some code
> that connects between each device driver to that subsystem.
> i.e. There should be functions that each driver should call from its
> probe and release callback functions.
>
> Those functions should take care of the following:
> - Create metadata for the device, the device's minor(s) and the
> driver's ioctls table and driver's callback for file operations (both
> are common for all the driver's devices). Save all that metadata with
> proper locking.
> - Create the device char files themselves and supply file operations
> that will be called per each open/close/mmap/etc.
> - Keep track of all these objects' lifetime in regard to the device
> driver's lifetime, with proper handling for release.
> - Add common handling and entries of sysfs/debugfs for these devices
> with the ability for each device driver to add their own unique
> entries.
>
> 2. I think that you underestimate (due to your experience) the "using
> it properly" part... It is not so easy to do this properly for
> inexperienced kernel people. If we provide all the code I mentioned
> above, the device driver writer doesn't need to be aware of all these
> kernel APIs.
>
Two more points I thought of as examples for added value to be done in
common code:
1. Common code for handling dma-buf. Very similar to what was added a
year ago to rdma. This can be accompanied by a common ioctl to export
and import a dma-buf.
2. Common code to handle drivers that want to allow a single user at a
time to run open the device char file.

Oded


> >
> > It would be nice to at least identify something that could obviously
> > be common, like some kind of enumeration and metadata kind of stuff
> > (think like ethtool, devlink, rdma tool, nvemctl etc)
> Definitely. I think we can have at least one ioctl that will be common
> to all drivers from the start.
> A kind of information retrieval ioctl. There are many information
> points that I'm sure are common to most accelerators. We have an
> extensive information ioctl in the habanalabs driver and most of the
> information there is not habana specific imo.
> >
> > > I think that it is clear from my previous email what I intended to
> > > provide. A clean, simple framework for devices to register with, get
> > > services for the most basic stuff such as device char handling,
> > > sysfs/debugfs.
> >
> > This should all be trivially done by any driver using the core codes,
> > if you see gaps I'd ask why not improve the core code?
> >
> > > Later on, add more simple stuff such as memory manager
> > > and uapi for memory handling. I guess someone can say all that exists
> > > in drm, but like I said it exists in other subsystems as well.
> >
> > This makes sense to me, all accelerators need a way to set a memory
> > map, but on the other hand we are doing some very nifty stuff in this
> > area with iommufd and it might be very interesting to just have the
> > accelerator driver link to that API instead of building yet another
> > copy of pin_user_pages() code.. Especially with PASID likely becoming
> > part of any accelerator toolkit.
> Here I disagree with you. First of all, there are many relatively
> simple accelerators, especially in edge, where PASID is really not
> relevant.
> Second, even for the more sophisticated PCIe/CXL-based ones, PASID is
> not mandatory and I suspect that it won't be in 100% of those devices.
> But definitely that should be an alternative to the "classic" way of
> handling dma'able memory (pin_user_pages()).
> >
> > > I want to be perfectly honest and say there is nothing special here
> > > for AI. It's actually the opposite, it is a generic framework for
> > > compute only. Think of it as an easier path to upstream if you just
> > > want to do compute acceleration. Maybe in the future it will be more,
> > > but I can't predict the future.
> >
> > I can't either, and to be clear I'm only questioning the merit of a
> > "subsystem" eg with a struct class, rigerous uAPI, etc.
> >
> > The idea that these kinds of accel drivers deserve specialized review
> > makes sense to me, even if they remain as unorganized char
> > devices. Just understand that is what you are signing up for :)
> I understand. That's why I'm taking all your points very seriously.
> This is not a decision that should be taken lightly and I want to be
> sure most agree this is the correct way forward.
> My next step is to talk to Dave about it in-depth. In his other email
> he wrote some interesting ideas which I want to discuss with him.
>
> Maybe this is something that should be discussed in the kernel summit ?
>
> >
> > Jason

2022-08-08 06:16:27

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Sun, Aug 07, 2022 at 02:25:33PM +0300, Oded Gabbay wrote:
> 2. Common code to handle drivers that want to allow a single user at a
> time to run open the device char file.

Note, that's an impossible request, and one that the kernel should never
worry about, so don't even try it. Think about userspace doing an call
to dup() on an open char file descriptor and then passing that off
somewhere else.

thanks,

greg k-h

2022-08-08 18:04:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Mon, Aug 08, 2022 at 08:10:22AM +0200, Greg Kroah-Hartman wrote:
> On Sun, Aug 07, 2022 at 02:25:33PM +0300, Oded Gabbay wrote:
> > 2. Common code to handle drivers that want to allow a single user at a
> > time to run open the device char file.
>
> Note, that's an impossible request, and one that the kernel should never
> worry about, so don't even try it. Think about userspace doing an call
> to dup() on an open char file descriptor and then passing that off
> somewhere else.

Oded is talking about a model like VFIO has where the HW has a limited
number of concurrent state registers - lets say in this case the ASID
translation mapping the accelerator into DMA.

Each 'struct file' that is created owns one of those HW state
registers, and each struct file is completely isolated from all
others. eg someone controlling the accelerator through struct file A
cannot access memory mapped into the accelerator through struct file
B.

So, the number of struct files that can be created is capped at the
number of HW state registers the device can support (eg one for
Habana).

This is different from the number of FDs pointing at the struct file.
Userpsace can open a HW state and point a lot of FDs at it, that is
userspace's problem. From a kernel view they all share one struct file
and thus one HW state.

Jason

2022-08-08 18:16:34

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Sun, Aug 07, 2022 at 09:43:40AM +0300, Oded Gabbay wrote:

> 1. If there is a subsystem which is responsible for creating and
> exposing the device character files, then there should be some code
> that connects between each device driver to that subsystem.
> i.e. There should be functions that each driver should call from its
> probe and release callback functions.
>
> Those functions should take care of the following:
> - Create metadata for the device, the device's minor(s) and the
> driver's ioctls table and driver's callback for file operations (both
> are common for all the driver's devices). Save all that metadata with
> proper locking.
> - Create the device char files themselves and supply file operations
> that will be called per each open/close/mmap/etc.
> - Keep track of all these objects' lifetime in regard to the device
> driver's lifetime, with proper handling for release.
> - Add common handling and entries of sysfs/debugfs for these devices
> with the ability for each device driver to add their own unique
> entries.
>
> 2. I think that you underestimate (due to your experience) the "using
> it properly" part... It is not so easy to do this properly for
> inexperienced kernel people. If we provide all the code I mentioned
> above, the device driver writer doesn't need to be aware of all these
> kernel APIs.

This may be, but it still seems weird to me to justify a subsystem as
"making existing APIs simpler so drivers don't mess them up". It
suggests perhaps we need additional core API helpers?

> > It would be nice to at least identify something that could obviously
> > be common, like some kind of enumeration and metadata kind of stuff
> > (think like ethtool, devlink, rdma tool, nvemctl etc)
> Definitely. I think we can have at least one ioctl that will be common
> to all drivers from the start.

Generally you don't want that as an ioctl because you have to open the
device to execute it, there may be permissions issues for instance -
or if you have a single-open-at-a-time model like VFIO, then it
doesn't work well together.

Usually this would be sysfs or netlink.

> > This makes sense to me, all accelerators need a way to set a memory
> > map, but on the other hand we are doing some very nifty stuff in this
> > area with iommufd and it might be very interesting to just have the
> > accelerator driver link to that API instead of building yet another
> > copy of pin_user_pages() code.. Especially with PASID likely becoming
> > part of any accelerator toolkit.
>
> Here I disagree with you. First of all, there are many relatively
> simple accelerators, especially in edge, where PASID is really not
> relevant.
> Second, even for the more sophisticated PCIe/CXL-based ones, PASID is
> not mandatory and I suspect that it won't be in 100% of those devices.
> But definitely that should be an alternative to the "classic" way of
> handling dma'able memory (pin_user_pages()).

My point was that iommufd can do the pinning for you and dump that
result into a iommu based PASID, or it can do the pinning for you and
allow the driver to translate it into its own page table format eg the
ASID in the habana device.

We don't need to have map/unmap APIs to manage address spaces in every
subsystem.

> Maybe this is something that should be discussed in the kernel summit ?

Maybe, I expect to be at LPC at least

Jason

2022-08-08 20:44:22

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Mon, Aug 8, 2022 at 8:46 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Sun, Aug 07, 2022 at 09:43:40AM +0300, Oded Gabbay wrote:
>
> > 1. If there is a subsystem which is responsible for creating and
> > exposing the device character files, then there should be some code
> > that connects between each device driver to that subsystem.
> > i.e. There should be functions that each driver should call from its
> > probe and release callback functions.
> >
> > Those functions should take care of the following:
> > - Create metadata for the device, the device's minor(s) and the
> > driver's ioctls table and driver's callback for file operations (both
> > are common for all the driver's devices). Save all that metadata with
> > proper locking.
> > - Create the device char files themselves and supply file operations
> > that will be called per each open/close/mmap/etc.
> > - Keep track of all these objects' lifetime in regard to the device
> > driver's lifetime, with proper handling for release.
> > - Add common handling and entries of sysfs/debugfs for these devices
> > with the ability for each device driver to add their own unique
> > entries.
> >
> > 2. I think that you underestimate (due to your experience) the "using
> > it properly" part... It is not so easy to do this properly for
> > inexperienced kernel people. If we provide all the code I mentioned
> > above, the device driver writer doesn't need to be aware of all these
> > kernel APIs.
>
> This may be, but it still seems weird to me to justify a subsystem as
> "making existing APIs simpler so drivers don't mess them up". It
> suggests perhaps we need additional core API helpers?
>
I'm sorry, but my original argument was poorly phrased. I'll try to
phrase it better.

What I'm trying to say is that imo one of the end goals of doing a
common subsystem is to provide a common uAPI for all the drivers that
belong to that subsystem.
I wrote this argument in a previous email as a criteria whether a
driver should join a subsystem.

So if you want a common uAPI and a common userspace library to use it,
you need to expose the same device character files for every device,
regardless of the driver. e.g. you need all devices to be called
/dev/accelX and not /dev/habanaX or /dev/nvidiaX

This means that the whole device character creation/removal/open/close
is done in the common subsystem, not in each driver individually.
So even if it is a simple code as you said, it still must reside in
the subsystem common code.

Once you put that code there, you need to add meta-data as different
drivers attach to the subsystem and ask to create devices and minors
when their probe function is called. In addition, you need to remove
all this code from each individual driver.

That's what I mean by abstracting all this kernel API from the
drivers. Not because it is an API that is hard to use, but because the
drivers should *not* use it at all.

I think drm did that pretty well. Their code defines objects for
driver, device and minors, with resource manager that will take care
of releasing the objects automatically (it is based on devres.c).

> > > It would be nice to at least identify something that could obviously
> > > be common, like some kind of enumeration and metadata kind of stuff
> > > (think like ethtool, devlink, rdma tool, nvemctl etc)
> > Definitely. I think we can have at least one ioctl that will be common
> > to all drivers from the start.
>
> Generally you don't want that as an ioctl because you have to open the
> device to execute it, there may be permissions issues for instance -
> or if you have a single-open-at-a-time model like VFIO, then it
> doesn't work well together.
>
> Usually this would be sysfs or netlink.
So actually I do want an ioctl but as you said, not for the main
device char, but to an accompanied control device char.

In habana we define two device char per device. One is the compute
which behaves like VFIO, and one is a control device which has no
limitation on the number of applications that can access it. However,
an application only has access to the information ioctl through this
device char (so it can't submit anything, allocate memory, etc.) and
can only retrieve metrics which do not leak information about the
compute application.

The reason I want an ioctl is because it is much more flexible than
sysfs and allows writing proper software to monitor the device in the
data-center. At least, that's my experience from the deployment we had
so far.

>
> > > This makes sense to me, all accelerators need a way to set a memory
> > > map, but on the other hand we are doing some very nifty stuff in this
> > > area with iommufd and it might be very interesting to just have the
> > > accelerator driver link to that API instead of building yet another
> > > copy of pin_user_pages() code.. Especially with PASID likely becoming
> > > part of any accelerator toolkit.
> >
> > Here I disagree with you. First of all, there are many relatively
> > simple accelerators, especially in edge, where PASID is really not
> > relevant.
> > Second, even for the more sophisticated PCIe/CXL-based ones, PASID is
> > not mandatory and I suspect that it won't be in 100% of those devices.
> > But definitely that should be an alternative to the "classic" way of
> > handling dma'able memory (pin_user_pages()).
>
> My point was that iommufd can do the pinning for you and dump that
> result into a iommu based PASID, or it can do the pinning for you and
> allow the driver to translate it into its own page table format eg the
> ASID in the habana device.
>
> We don't need to have map/unmap APIs to manage address spaces in every
> subsystem.
I see, I will need to learn this more in-depth if/when it will become
relevant. But of course I agree that reusing existing code is much
better.

>
> > Maybe this is something that should be discussed in the kernel summit ?
>
> Maybe, I expect to be at LPC at least
Great!
Oded
>
> Jason

2022-08-09 06:52:33

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Mon, Aug 08, 2022 at 02:55:28PM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 08, 2022 at 08:10:22AM +0200, Greg Kroah-Hartman wrote:
> > On Sun, Aug 07, 2022 at 02:25:33PM +0300, Oded Gabbay wrote:
> > > 2. Common code to handle drivers that want to allow a single user at a
> > > time to run open the device char file.
> >
> > Note, that's an impossible request, and one that the kernel should never
> > worry about, so don't even try it. Think about userspace doing an call
> > to dup() on an open char file descriptor and then passing that off
> > somewhere else.
>
> Oded is talking about a model like VFIO has where the HW has a limited
> number of concurrent state registers - lets say in this case the ASID
> translation mapping the accelerator into DMA.

Based on the number of drivers that I see submitted weekly that try to
restrict their open call to just one user by using atomic variables or
other tricks, I think my interpretation of this stands :)

> Each 'struct file' that is created owns one of those HW state
> registers, and each struct file is completely isolated from all
> others. eg someone controlling the accelerator through struct file A
> cannot access memory mapped into the accelerator through struct file
> B.
>
> So, the number of struct files that can be created is capped at the
> number of HW state registers the device can support (eg one for
> Habana).
>
> This is different from the number of FDs pointing at the struct file.
> Userpsace can open a HW state and point a lot of FDs at it, that is
> userspace's problem. From a kernel view they all share one struct file
> and thus one HW state.

Yes, that's fine, if that is what is happening here, I have no
objection.

greg k-h

2022-08-09 08:15:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Tue, Aug 09, 2022 at 08:23:27AM +0200, Greg Kroah-Hartman wrote:
> Based on the number of drivers that I see submitted weekly that try to
> restrict their open call to just one user by using atomic variables or
> other tricks, I think my interpretation of this stands :)

I think they really want what Jason described most of the time. They
just don't know about the pitfalls of dup yet.

> > This is different from the number of FDs pointing at the struct file.
> > Userpsace can open a HW state and point a lot of FDs at it, that is
> > userspace's problem. From a kernel view they all share one struct file
> > and thus one HW state.
>
> Yes, that's fine, if that is what is happening here, I have no
> objection.

It would be great if we could actually life that into a common
layer (chardev or vfs) given just how common this, and drivers tend
to get it wrong, do it suboptimal so often.

2022-08-09 08:56:21

by Arnd Bergmann

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Tue, Aug 9, 2022 at 10:04 AM Christoph Hellwig <[email protected]> wrote:
> On Tue, Aug 09, 2022 at 08:23:27AM +0200, Greg Kroah-Hartman wrote:
> > > This is different from the number of FDs pointing at the struct file.
> > > Userpsace can open a HW state and point a lot of FDs at it, that is
> > > userspace's problem. From a kernel view they all share one struct file
> > > and thus one HW state.
> >
> > Yes, that's fine, if that is what is happening here, I have no
> > objection.
>
> It would be great if we could actually life that into a common
> layer (chardev or vfs) given just how common this, and drivers tend
> to get it wrong, do it suboptimal so often.

Totally agreed.

I think for devices with hardware MMU contexts you actually want to
bind the context to a 'mm_struct', and then ensure that the context
is only ever used from a process that shares this mm_struct,
regardless of who else has access to the same file or inode.

We did this in a somewhat hacky way in spufs a long time ago, and
I would expect this to be solved more cleanly in drivers/gpu/drm, but
I don't know where to look for that code.

Arnd

2022-08-09 09:09:35

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Tue, Aug 09, 2022 at 01:04:15AM -0700, Christoph Hellwig wrote:
> On Tue, Aug 09, 2022 at 08:23:27AM +0200, Greg Kroah-Hartman wrote:
> > Based on the number of drivers that I see submitted weekly that try to
> > restrict their open call to just one user by using atomic variables or
> > other tricks, I think my interpretation of this stands :)
>
> I think they really want what Jason described most of the time. They
> just don't know about the pitfalls of dup yet.
>
> > > This is different from the number of FDs pointing at the struct file.
> > > Userpsace can open a HW state and point a lot of FDs at it, that is
> > > userspace's problem. From a kernel view they all share one struct file
> > > and thus one HW state.
> >
> > Yes, that's fine, if that is what is happening here, I have no
> > objection.
>
> It would be great if we could actually life that into a common
> layer (chardev or vfs) given just how common this, and drivers tend
> to get it wrong, do it suboptimal so often.

No objection from me, I'll gladly take patches to chardev or miscdev to
support this.

greg k-h

2022-08-09 12:23:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Tue, Aug 09, 2022 at 10:32:27AM +0200, Arnd Bergmann wrote:
> On Tue, Aug 9, 2022 at 10:04 AM Christoph Hellwig <[email protected]> wrote:
> > On Tue, Aug 09, 2022 at 08:23:27AM +0200, Greg Kroah-Hartman wrote:
> > > > This is different from the number of FDs pointing at the struct file.
> > > > Userpsace can open a HW state and point a lot of FDs at it, that is
> > > > userspace's problem. From a kernel view they all share one struct file
> > > > and thus one HW state.
> > >
> > > Yes, that's fine, if that is what is happening here, I have no
> > > objection.
> >
> > It would be great if we could actually life that into a common
> > layer (chardev or vfs) given just how common this, and drivers tend
> > to get it wrong, do it suboptimal so often.
>
> Totally agreed.
>
> I think for devices with hardware MMU contexts you actually want to
> bind the context to a 'mm_struct', and then ensure that the context
> is only ever used from a process that shares this mm_struct,
> regardless of who else has access to the same file or inode.

I can't think of a security justification for this.

If process A stuffs part of its address space into the device and
passes the FD to process B which can then access that addresss space,
how is it any different from process A making a tmpfs, mmaping it, and
passing it to process B which can then access that address space?

IMHO the 'struct file' is the security domain and a process must be
careful to only allow FDs to be created that meet its security needs.

The kernel should not be involved in security here any further than
using the standard FD security mechanisms.

Who is to say there isn't a meaningful dual process use case for the
accelerator? We see dual-process designs regularly in networking
accelerators.

Jason

2022-08-09 12:50:47

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Mon, Aug 08, 2022 at 11:26:11PM +0300, Oded Gabbay wrote:

> So if you want a common uAPI and a common userspace library to use it,
> you need to expose the same device character files for every device,
> regardless of the driver. e.g. you need all devices to be called
> /dev/accelX and not /dev/habanaX or /dev/nvidiaX

So, this is an interesting idea. One of the things we did in RDMA that
turned our very well is to have the user side of the kernel/user API
in a single git repo for all the drivers, including the lowest layer
of the driver-specific APIs.

It gives a reasonable target for a DRM-like test of "you must have a
userspace". Ie send your userspace and userspace documentation/tests
before your kernel side can be merged.

Even if it is just a git repo collecting and curating driver-specific
libraries under the "accel" banner it could be quite a useful
activity.

But, probably this boils down to things that look like:

device = habana_open_device()
habana_mooo(device)

device = nvidia_open_device()
nvidia_baaa(device)

> That's what I mean by abstracting all this kernel API from the
> drivers. Not because it is an API that is hard to use, but because the
> drivers should *not* use it at all.
>
> I think drm did that pretty well. Their code defines objects for
> driver, device and minors, with resource manager that will take care
> of releasing the objects automatically (it is based on devres.c).

We have lots of examples of subsystems doing this - the main thing
unique about accel is that that there is really no shared uAPI between
the drivers, and not 'abstraction' provided by the kernel. Maybe that
is the point..

> So actually I do want an ioctl but as you said, not for the main
> device char, but to an accompanied control device char.

There is a general problem across all these "thick" devices in the
kernel to support their RAS & configuration requirements and IMHO we
don't have a good answer at all.

We've been talking on and off here about having some kind of
subsystem/methodology specifically for this area - how to monitor,
configure, service, etc a very complicated off-CPU device. I think
there would be a lot of interest in this and maybe it shouldn't be
coupled to this accel idea.

Eg we already have some established mechinisms - I would expect any
accel device to be able to introspect and upgrade its flash FW using
the 'devlink flash' common API.

> an application only has access to the information ioctl through this
> device char (so it can't submit anything, allocate memory, etc.) and
> can only retrieve metrics which do not leak information about the
> compute application.

This is often being done over a netlink socket as the "second char"

Jason

2022-08-09 13:36:09

by Arnd Bergmann

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Tue, Aug 9, 2022 at 2:18 PM Jason Gunthorpe <[email protected]> wrote:
> On Tue, Aug 09, 2022 at 10:32:27AM +0200, Arnd Bergmann wrote:
> > On Tue, Aug 9, 2022 at 10:04 AM Christoph Hellwig <[email protected]> wrote:
> >
> > I think for devices with hardware MMU contexts you actually want to
> > bind the context to a 'mm_struct', and then ensure that the context
> > is only ever used from a process that shares this mm_struct,
> > regardless of who else has access to the same file or inode.
>
> I can't think of a security justification for this.
>
> If process A stuffs part of its address space into the device and
> passes the FD to process B which can then access that addresss space,
> how is it any different from process A making a tmpfs, mmaping it, and
> passing it to process B which can then access that address space?
>
> IMHO the 'struct file' is the security domain and a process must be
> careful to only allow FDs to be created that meet its security needs.
>
> The kernel should not be involved in security here any further than
> using the standard FD security mechanisms.
>
> Who is to say there isn't a meaningful dual process use case for the
> accelerator? We see dual-process designs regularly in networking
> accelerators.

I think there is a lot of value in keeping things simple here, to
make sure users and developers don't make a mess of it. If the
accelerator has access to the memory of one process but I run
it from another process, I lose the benefits of the shared page
tables, but gain a lot of the complexity that I think the 'single-opener'
logic was trying (without success) to avoid.

E.g. attaching a debugger to single-step through the accelerator code
would then involve at least four address spaces:

- the addresses of the debugger
- the addresses local to the accelerator
- addresses in the shared address space of the process that owns
the memory
- addresses in the process that owns the accelerator context

which is at least one more than I'd like to deal with.

This is somewhat different for accelerators that have coherent
access to a process address space and only access explicitly
shared buffers. On these you could more easily allow sharing the
file descriptor between any number of processes.

Arnd

2022-08-09 14:27:19

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Tue, Aug 09, 2022 at 02:46:36PM +0200, Arnd Bergmann wrote:
> On Tue, Aug 9, 2022 at 2:18 PM Jason Gunthorpe <[email protected]> wrote:
> > On Tue, Aug 09, 2022 at 10:32:27AM +0200, Arnd Bergmann wrote:
> > > On Tue, Aug 9, 2022 at 10:04 AM Christoph Hellwig <[email protected]> wrote:
> > >
> > > I think for devices with hardware MMU contexts you actually want to
> > > bind the context to a 'mm_struct', and then ensure that the context
> > > is only ever used from a process that shares this mm_struct,
> > > regardless of who else has access to the same file or inode.
> >
> > I can't think of a security justification for this.
> >
> > If process A stuffs part of its address space into the device and
> > passes the FD to process B which can then access that addresss space,
> > how is it any different from process A making a tmpfs, mmaping it, and
> > passing it to process B which can then access that address space?
> >
> > IMHO the 'struct file' is the security domain and a process must be
> > careful to only allow FDs to be created that meet its security needs.
> >
> > The kernel should not be involved in security here any further than
> > using the standard FD security mechanisms.
> >
> > Who is to say there isn't a meaningful dual process use case for the
> > accelerator? We see dual-process designs regularly in networking
> > accelerators.
>
> I think there is a lot of value in keeping things simple here, to
> make sure users and developers don't make a mess of it.

I don't think the kernel should enforce policy on userspace. As long
as the kernel side is simple and reasonable then it should let
userspace make whatever mess it likes.

We have a lot of experiance here now, and userspaces do take advantage
of this flexability in more well established accelerator subsystems,
like DRM and RDMA.

> If the accelerator has access to the memory of one process but I run
> it from another process, I lose the benefits of the shared page
> tables,

There are several accelerator "ASID" models I've seen - devices can
have one or many ASID's and the ASID's can be map/unmap style or forced
1:1 with a mm_struct (usually called SVA/SVM).

Userspace is responsible to figure out how to use this stuff. With
map/unmap there should be no kernel restriction on what mappings can
be created, but often sane userspaces will probably want to stick to
1:1 map/unmap with a single process.

> E.g. attaching a debugger to single-step through the accelerator code
> would then involve at least four address spaces:
>
> - the addresses of the debugger
> - the addresses local to the accelerator
> - addresses in the shared address space of the process that owns
> the memory
> - addresses in the process that owns the accelerator context
>
> which is at least one more than I'd like to deal with.

It is a FD. There is no "owns the accelerator context" - that concept
is an abuse of the FD model, IMHO.

If you are debugging you have the mmu_struct of each of the threads
you are debugging and each of the ASID's loaded in the accelerator to
deal with - it is inherent in the hardware design.

> This is somewhat different for accelerators that have coherent
> access to a process address space and only access explicitly
> shared buffers. On these you could more easily allow sharing the
> file descriptor between any number of processes.

That is just a multi-ASID accelerator and userspace has linked a
unique SVA ASID to each unique process using the FD.

The thing to understand is that the FD represents the security
context, so it is very resonable on a multi-ASID device I could share
the same security context with two co-operating processes, create two
ASID's and do accelerator operations that work jointly across both
memory spaces. For instance I might consume a source from process B,
process it and deliver the result into process A where process A will
then send it over the network or something. We have these kinds of use
models already.

Jason

2022-08-09 21:51:10

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Sun, Aug 7, 2022 at 9:50 AM Oded Gabbay <[email protected]> wrote:
>
> On Fri, Aug 5, 2022 at 6:03 AM Dave Airlie <[email protected]> wrote:
> >
> > On Thu, 4 Aug 2022 at 17:44, Oded Gabbay <[email protected]> wrote:
> > >
> > > On Thu, Aug 4, 2022 at 2:54 AM Dave Airlie <[email protected]> wrote:
> > > >
> > > > On Thu, 4 Aug 2022 at 06:21, Oded Gabbay <[email protected]> wrote:
> > > > >
> > > > > On Wed, Aug 3, 2022 at 10:04 PM Dave Airlie <[email protected]> wrote:
> > > > > >
> > > > > > On Sun, 31 Jul 2022 at 22:04, Oded Gabbay <[email protected]> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > > Greg and I talked a couple of months ago about preparing a new accel
> > > > > > > subsystem for compute/acceleration devices that are not GPUs and I
> > > > > > > think your drivers that you are now trying to upstream fit it as well.
> > > > > >
> > > > > > We've had some submissions for not-GPUs to the drm subsystem recently.
> > > > > >
> > > > > > Intel GNA, Intel VPU, NVDLA, rpmsg AI processor unit.
> > > > > >
> > > > > > why is creating a new subsystem at this time necessary?
> > > > > >
> > > > > > Are we just creating a subsystem to avoid the open source userspace
> > > > > > consumer rules? Or do we have some concrete reasoning behind it?
> > > > > >
> > > > > > Dave.
> > > > >
> > > > > Hi Dave.
> > > > > The reason it happened now is because I saw two drivers, which are
> > > > > doing h/w acceleration for AI, trying to be accepted to the misc
> > > > > subsystem.
> > > > > Add to that the fact I talked with Greg a couple of months ago about
> > > > > doing a subsystem for any compute accelerators, which he was positive
> > > > > about, I thought it is a good opportunity to finally do it.
> > > > >
> > > > > I also honestly think that I can contribute much to these drivers from
> > > > > my experience with the habana driver (which is now deployed in mass at
> > > > > AWS) and contribute code from the habana driver to a common framework
> > > > > for AI drivers.
> > > >
> > > > Why not port the habana driver to drm now instead? I don't get why it
> > > > wouldn't make sense?
> > >
> > > imho, no, I don't see the upside. This is not a trivial change, and
> > > will require a large effort. What will it give me that I need and I
> > > don't have now ?
> >
> > The opportunity for review, code sharing, experience of locking
> > hierarchy, mm integration?
> >
> > IMHO The biggest thing that drm has is the community of people who
> > understand accelerators, memory management, userspace command
> > submissions, fencing, dma-buf etc.
> >
> > It's hard to have input to those discussions from the outside, and
> > they are always ongoing.
> >
> > I think one of the Intel teams reported dropping a lot of code on
> > their drm port due to stuff already being there, I'd expect the same
> > for you.
> >
> > The opposite question is also valid, what does moving to a new
> > subsystem help you or others, when there is one with all the
> > infrastructure and more importantly reviewers.
> >
> > I'd be happy to have accelerator submaintainers, I'd be happy to even
> > create an ACCELERATOR property for non-gpu drivers, so they can opt
> > out of some of the GPUier stuff, like multiple device node users etc,
> > or even create a new class of device nodes under /dev/dri.
> >
> I'm taking all what you wrote seriously, these are all good points.
> As I wrote to Jason, I don't want to jump the gun here. I think we
> should discuss this and explore the possibilities that you suggested
> because I would like to reach consensus if possible.
> Maybe this is something we can discuss in LPC or in the kernel summit ?
>
> Oded

Hi Jiho, Yuji.

I want to update that I'm currently in discussions with Dave to figure
out what's the best way to move forward. We are writing it down to do
a proper comparison between the two paths (new accel subsystem or
using drm). I guess it will take a week or so.

In the meantime, I'm putting the accel code on hold. I have only
managed to do the very basic infra and add a demo driver that shows
how to register and unregister from it.
You can check the code at:
https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux.git/log/?h=accel

It has two commits. The first adds the subsystem code and the second
adds the demo driver.
The subsystem code is basically drm code copied and renamed and
slightly modified, but I really only worked on it for a couple of
hours so take that into consideration.

The important thing is that the demo driver shows the basic steps are
really simple. You need to add two function calls in your probe and
one function call in your release. Of course you will need to supply
some function callbacks, but I haven't got to fill that in the demo
driver. Once you register, you get /dev/accel/ac0 and
/dev/accel/ac_controlD64 (if you want a control device). If I were to
continue this, the next step is to do the open and close part.

I will update once we know where things are heading. As I said, I
imagine it can take a few weeks.

Thanks,
Oded


Oded

>
> >
> > > I totally agree. We need to set some rules and make sure everyone in
> > > the kernel community is familiar with them, because now you get
> > > different answers based on who you consult with.
> > >
> > > My rules of thumb that I thought of was that if you don't have any
> > > display (you don't need to support X/wayland) and you don't need to
> > > support opengl/vulkan/opencl/directx or any other gpu-related software
> > > stack, then you don't have to go through drm.
> > > In other words, if you don't have gpu-specific h/w and/or you don't
> > > need gpu uAPI, you don't belong in drm.
> >
> > What happens when NVIDIA submit a driver for just compute or intel?
> > for what is actually a GPU?
> > This has been suggested as workaround for our userspace rules a few times.
> >
> > If my GPU can do compute tasks, do I have to add an accelerator
> > subsystem driver alongside my GPU one?
> >
> > > After all, memory management services, or common device chars handling
> > > I can get from other subsystems (e.g. rdma) as well. I'm sure I could
> > > model my uAPI to be rdma uAPI compliant (I can define proprietary uAPI
> > > there as well), but this doesn't mean I belong there, right ?
> >
> > Good point, but I think accelerators do mostly belong in drm or media,
> > because there is enough framework around them to allow them to work,
> > without reinventing everything.
> >
> > > >
> > > > I think the one area I can see a divide where a new subsystem is for
> > > > accelerators that are single-user, one shot type things like media
> > > > drivers (though maybe they could be just media drivers).
> > > >
> > > > I think anything that does command offloading to firmware or queues
> > > > belongs in drm, because that is pretty much what the framework does. I
> > > I think this is a very broad statement which doesn't reflect reality
> > > in the kernel.
> >
> > I think the habanalabs driver is one of the only ones that is outside
> > this really, and in major use. There might be one or two other minor
> > drivers with no real users.
> >
> > Dave.

2022-08-10 10:08:00

by Jiho Chu

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Wed, 10 Aug 2022 00:42:24 +0300
Oded Gabbay <[email protected]> wrote:

>
> Hi Jiho, Yuji.
>
> I want to update that I'm currently in discussions with Dave to figure
> out what's the best way to move forward. We are writing it down to do
> a proper comparison between the two paths (new accel subsystem or
> using drm). I guess it will take a week or so.
>
> In the meantime, I'm putting the accel code on hold. I have only
> managed to do the very basic infra and add a demo driver that shows
> how to register and unregister from it.
> You can check the code at:
> https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux.git/log/?h=accel
>
> It has two commits. The first adds the subsystem code and the second
> adds the demo driver.
> The subsystem code is basically drm code copied and renamed and
> slightly modified, but I really only worked on it for a couple of
> hours so take that into consideration.
>
> The important thing is that the demo driver shows the basic steps are
> really simple. You need to add two function calls in your probe and
> one function call in your release. Of course you will need to supply
> some function callbacks, but I haven't got to fill that in the demo
> driver. Once you register, you get /dev/accel/ac0 and
> /dev/accel/ac_controlD64 (if you want a control device). If I were to
> continue this, the next step is to do the open and close part.
>
> I will update once we know where things are heading. As I said, I
> imagine it can take a few weeks.
>
> Thanks,
> Oded
>

Hi Oded,
Thanks for sharing your code, it looks good start from basic drm code.
I hope the discussion makes better result.

Thanks,
Jiho Chu

2022-08-10 14:22:18

by Yuji Ishikawa

[permalink] [raw]
Subject: RE: New subsystem for acceleration devices

> -----Original Message-----
> From: Oded Gabbay <[email protected]>
> Sent: Wednesday, August 10, 2022 6:42 AM
> To: Dave Airlie <[email protected]>; Greg Kroah-Hartman
> <[email protected]>; ishikawa yuji(石川 悠司 ○RDC□AITC○
> EA開) <[email protected]>; Jiho Chu <[email protected]>
> Cc: dri-devel <[email protected]>; Arnd Bergmann
> <[email protected]>; Linux-Kernel@Vger. Kernel. Org
> <[email protected]>; Jason Gunthorpe <[email protected]>
> Subject: Re: New subsystem for acceleration devices
>
> Hi Jiho, Yuji.
>
> I want to update that I'm currently in discussions with Dave to figure out what's
> the best way to move forward. We are writing it down to do a proper comparison
> between the two paths (new accel subsystem or using drm). I guess it will take
> a week or so.
>
> In the meantime, I'm putting the accel code on hold. I have only managed to do
> the very basic infra and add a demo driver that shows how to register and
> unregister from it.
> You can check the code at:
> https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux.git/log/?h=ac
> cel
>
> It has two commits. The first adds the subsystem code and the second adds the
> demo driver.
> The subsystem code is basically drm code copied and renamed and slightly
> modified, but I really only worked on it for a couple of hours so take that into
> consideration.
>
> The important thing is that the demo driver shows the basic steps are really
> simple. You need to add two function calls in your probe and one function call in
> your release. Of course you will need to supply some function callbacks, but I
> haven't got to fill that in the demo driver. Once you register, you get
> /dev/accel/ac0 and
> /dev/accel/ac_controlD64 (if you want a control device). If I were to continue
> this, the next step is to do the open and close part.
>
> I will update once we know where things are heading. As I said, I imagine it can
> take a few weeks.
>
> Thanks,
> Oded

Hi Odded,
Thank you for uploading the framework as well as a sample.
It's exciting to see new software is growing up.

Since Visconti DNN is a platform device, I'll write some test code to initialize driver and see if it works.

Regards,
Yuji

2022-08-10 15:45:50

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Wed, Aug 10, 2022 at 5:10 PM <[email protected]> wrote:
>
> > -----Original Message-----
> > From: Oded Gabbay <[email protected]>
> > Sent: Wednesday, August 10, 2022 6:42 AM
> > To: Dave Airlie <[email protected]>; Greg Kroah-Hartman
> > <[email protected]>; ishikawa yuji(石川 悠司 ○RDC□AITC○
> > EA開) <[email protected]>; Jiho Chu <[email protected]>
> > Cc: dri-devel <[email protected]>; Arnd Bergmann
> > <[email protected]>; Linux-Kernel@Vger. Kernel. Org
> > <[email protected]>; Jason Gunthorpe <[email protected]>
> > Subject: Re: New subsystem for acceleration devices
> >
> > Hi Jiho, Yuji.
> >
> > I want to update that I'm currently in discussions with Dave to figure out what's
> > the best way to move forward. We are writing it down to do a proper comparison
> > between the two paths (new accel subsystem or using drm). I guess it will take
> > a week or so.
> >
> > In the meantime, I'm putting the accel code on hold. I have only managed to do
> > the very basic infra and add a demo driver that shows how to register and
> > unregister from it.
> > You can check the code at:
> > https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux.git/log/?h=ac
> > cel
> >
> > It has two commits. The first adds the subsystem code and the second adds the
> > demo driver.
> > The subsystem code is basically drm code copied and renamed and slightly
> > modified, but I really only worked on it for a couple of hours so take that into
> > consideration.
> >
> > The important thing is that the demo driver shows the basic steps are really
> > simple. You need to add two function calls in your probe and one function call in
> > your release. Of course you will need to supply some function callbacks, but I
> > haven't got to fill that in the demo driver. Once you register, you get
> > /dev/accel/ac0 and
> > /dev/accel/ac_controlD64 (if you want a control device). If I were to continue
> > this, the next step is to do the open and close part.
> >
> > I will update once we know where things are heading. As I said, I imagine it can
> > take a few weeks.
> >
> > Thanks,
> > Oded
>
> Hi Odded,
> Thank you for uploading the framework as well as a sample.
> It's exciting to see new software is growing up.
>
> Since Visconti DNN is a platform device, I'll write some test code to initialize driver and see if it works.
>
> Regards,
> Yuji

Platform or PCI, it doesn't matter. You just call it from the probe.
But really, this is something I did in a few hours and I stopped
because there were some objections and I wanted to first talk about it
with Dave.
I don't know if it's worth it for you to waste time on it at this point.

Thanks,
Oded

2022-08-23 19:37:26

by Kevin Hilman

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

Hi Obed,

Oded Gabbay <[email protected]> writes:

[...]

> I want to update that I'm currently in discussions with Dave to figure
> out what's the best way to move forward. We are writing it down to do
> a proper comparison between the two paths (new accel subsystem or
> using drm). I guess it will take a week or so.

Any update on the discussions with Dave? and/or are there any plans to
discuss this further at LPC/ksummit yet?

We (BayLibre) are upstreaming support for APUs on Mediatek SoCs, and are
using the DRM-based approach. I'll also be at LPC and happy to discuss
in person.

For some context on my/our interest: back in Sept 2020 we initially
submitted an rpmesg based driver for kernel communication[1]. After
review comments, we rewrote that based on DRM[2] and are now using it
for some MTK SoCs[3] and supporting our MTK customers with it.

Hopefully we will get the kernel interfaces sorted out soon, but next,
there's the userspace side of things. To that end, we're also working
on libAPU, a common, open userspace stack. Alex Bailon recently
presented a proposal earlier this year at Embedded Recipes in Paris
(video[4], slides[5].)

libAPU would include abstractions of the kernel interfaces for DRM
(using libdrm), remoteproc/rpmsg, virtio etc. but also goes farther and
proposes an open firmware for the accelerator side using
libMetal/OpenAMP + rpmsg for communication with (most likely closed
source) vendor firmware. Think of this like sound open firmware (SOF[6]),
but for accelerators.

We've been using this succesfully for Mediatek SoCs (which have a
Cadence VP6 APU) and have submitted/published the code, including the
OpenAMP[7] and libmetal[8] parts in addition to the kernel parts already
mentioned.

We're to the point where we're pretty happy with how this works for MTK
SoCs, and wanting to collaborate with folks working on other platforms
and to see what's needed to support other kinds of accelerators with a
common userspace and open firmware infrastructure.

Kevin

[1] https://lore.kernel.org/r/[email protected]
[2] https://lore.kernel.org/r/[email protected]
[3] https://lore.kernel.org/r/[email protected]
[4] https://www.youtube.com/watch?v=Uj1FZoF8MMw&t=18211s
[5] https://embedded-recipes.org/2022/wp-content/uploads/2022/06/bailon.pdf
[6] https://www.sofproject.org/
[7] https://github.com/BayLibre/open-amp/tree/v2021.10-mtk
[8] https://github.com/BayLibre/libmetal/tree/v2021.10-mtk

2022-08-23 21:01:37

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Tue, Aug 23, 2022 at 9:24 PM Kevin Hilman <[email protected]> wrote:
>
> Hi Obed,
>
> Oded Gabbay <[email protected]> writes:
>
> [...]
>
> > I want to update that I'm currently in discussions with Dave to figure
> > out what's the best way to move forward. We are writing it down to do
> > a proper comparison between the two paths (new accel subsystem or
> > using drm). I guess it will take a week or so.
>
> Any update on the discussions with Dave? and/or are there any plans to
> discuss this further at LPC/ksummit yet?
Hi Kevin.

We are still discussing the details, as at least the habanalabs driver
is very complex and there are multiple parts that I need to see if and
how they can be mapped to drm.
Some of us will attend LPC so we will probably take advantage of that
to talk more about this.

>
> We (BayLibre) are upstreaming support for APUs on Mediatek SoCs, and are
> using the DRM-based approach. I'll also be at LPC and happy to discuss
> in person.
>
> For some context on my/our interest: back in Sept 2020 we initially
> submitted an rpmesg based driver for kernel communication[1]. After
> review comments, we rewrote that based on DRM[2] and are now using it
> for some MTK SoCs[3] and supporting our MTK customers with it.
>
> Hopefully we will get the kernel interfaces sorted out soon, but next,
> there's the userspace side of things. To that end, we're also working
> on libAPU, a common, open userspace stack. Alex Bailon recently
> presented a proposal earlier this year at Embedded Recipes in Paris
> (video[4], slides[5].)
>
> libAPU would include abstractions of the kernel interfaces for DRM
> (using libdrm), remoteproc/rpmsg, virtio etc. but also goes farther and
> proposes an open firmware for the accelerator side using
> libMetal/OpenAMP + rpmsg for communication with (most likely closed
> source) vendor firmware. Think of this like sound open firmware (SOF[6]),
> but for accelerators.
I think your device and the habana device are very different in
nature, and it is part of what Dave and I discussed, whether these two
classes of devices can live together. I guess they can live together
in the kernel, but in the userspace, not so much imo.

The first class is the edge inference devices (usually as part of some
SoC). I think your description of the APU on MTK SoC is a classic
example of such a device.
You usually have some firmware you load, you give it a graph and
pointers for input and output and then you just execute the graph
again and again to perform inference and just replace the inputs.

The second class is the data-center, training accelerators, which
habana's gaudi device is classified as such. These devices usually
have a number of different compute engines, a fabric for scaling out,
on-device memory, internal MMUs and RAS monitoring requirements. Those
devices are usually operated via command queues, either through their
kernel driver or directly from user-space. They have multiple APIs for
memory management, RAS, scaling-out and command-submissions.

>
> We've been using this succesfully for Mediatek SoCs (which have a
> Cadence VP6 APU) and have submitted/published the code, including the
> OpenAMP[7] and libmetal[8] parts in addition to the kernel parts already
> mentioned.
What's the difference between libmetal and other open-source low-level
runtime drivers, such as oneAPI level-zero ?

Currently we have our own runtime driver which is tightly coupled with
our h/w. For example, the method the userspace "talks" to the
data-plane firmware is very proprietary as it is hard-wired into the
architecture of the entire ASIC and how it performs deep-learning
training. Therefore, I don't see how this can be shared with other
vendors. Not because of secrecy but because it is simply not relevant
to any other ASIC.

>
> We're to the point where we're pretty happy with how this works for MTK
> SoCs, and wanting to collaborate with folks working on other platforms
> and to see what's needed to support other kinds of accelerators with a
> common userspace and open firmware infrastructure.
>
> Kevin
>
> [1] https://lore.kernel.org/r/[email protected]
> [2] https://lore.kernel.org/r/[email protected]
> [3] https://lore.kernel.org/r/[email protected]
> [4] https://www.youtube.com/watch?v=Uj1FZoF8MMw&t=18211s
> [5] https://embedded-recipes.org/2022/wp-content/uploads/2022/06/bailon.pdf
> [6] https://www.sofproject.org/
> [7] https://github.com/BayLibre/open-amp/tree/v2021.10-mtk
> [8] https://github.com/BayLibre/libmetal/tree/v2021.10-mtk

2022-08-29 20:58:19

by Kevin Hilman

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

Hi Oded (and sorry I misspelled your name last time),

Oded Gabbay <[email protected]> writes:

> On Tue, Aug 23, 2022 at 9:24 PM Kevin Hilman <[email protected]> wrote:
>>
>> Hi Obed,
>>
>> Oded Gabbay <[email protected]> writes:
>>
>> [...]
>>
>> > I want to update that I'm currently in discussions with Dave to figure
>> > out what's the best way to move forward. We are writing it down to do
>> > a proper comparison between the two paths (new accel subsystem or
>> > using drm). I guess it will take a week or so.
>>
>> Any update on the discussions with Dave? and/or are there any plans to
>> discuss this further at LPC/ksummit yet?
> Hi Kevin.
>
> We are still discussing the details, as at least the habanalabs driver
> is very complex and there are multiple parts that I need to see if and
> how they can be mapped to drm.
> Some of us will attend LPC so we will probably take advantage of that
> to talk more about this.

OK, looking forward to some more conversations at LPC.

>>
>> We (BayLibre) are upstreaming support for APUs on Mediatek SoCs, and are
>> using the DRM-based approach. I'll also be at LPC and happy to discuss
>> in person.
>>
>> For some context on my/our interest: back in Sept 2020 we initially
>> submitted an rpmesg based driver for kernel communication[1]. After
>> review comments, we rewrote that based on DRM[2] and are now using it
>> for some MTK SoCs[3] and supporting our MTK customers with it.
>>
>> Hopefully we will get the kernel interfaces sorted out soon, but next,
>> there's the userspace side of things. To that end, we're also working
>> on libAPU, a common, open userspace stack. Alex Bailon recently
>> presented a proposal earlier this year at Embedded Recipes in Paris
>> (video[4], slides[5].)
>>
>> libAPU would include abstractions of the kernel interfaces for DRM
>> (using libdrm), remoteproc/rpmsg, virtio etc. but also goes farther and
>> proposes an open firmware for the accelerator side using
>> libMetal/OpenAMP + rpmsg for communication with (most likely closed
>> source) vendor firmware. Think of this like sound open firmware (SOF[6]),
>> but for accelerators.
>
> I think your device and the habana device are very different in
> nature, and it is part of what Dave and I discussed, whether these two
> classes of devices can live together. I guess they can live together
> in the kernel, but in the userspace, not so much imo.

Yeah, for now I think focusing on how to handle both classes of devices
in the kernel is the most important.

> The first class is the edge inference devices (usually as part of some
> SoC). I think your description of the APU on MTK SoC is a classic
> example of such a device.

Correct.

> You usually have some firmware you load, you give it a graph and
> pointers for input and output and then you just execute the graph
> again and again to perform inference and just replace the inputs.
>
> The second class is the data-center, training accelerators, which
> habana's gaudi device is classified as such. These devices usually
> have a number of different compute engines, a fabric for scaling out,
> on-device memory, internal MMUs and RAS monitoring requirements. Those
> devices are usually operated via command queues, either through their
> kernel driver or directly from user-space. They have multiple APIs for
> memory management, RAS, scaling-out and command-submissions.

OK, I see.

>>
>> We've been using this succesfully for Mediatek SoCs (which have a
>> Cadence VP6 APU) and have submitted/published the code, including the
>> OpenAMP[7] and libmetal[8] parts in addition to the kernel parts already
>> mentioned.
> What's the difference between libmetal and other open-source low-level
> runtime drivers, such as oneAPI level-zero ?

TBH, I'd never heard of oneAPI before, so I'm assuming it's mainly
focused in the data center. libmetal/openAMP are widely used
in the consumer, industrial embedded space, and heavily used by FPGAs in
many market segments.

> Currently we have our own runtime driver which is tightly coupled with
> our h/w. For example, the method the userspace "talks" to the
> data-plane firmware is very proprietary as it is hard-wired into the
> architecture of the entire ASIC and how it performs deep-learning
> training. Therefore, I don't see how this can be shared with other
> vendors. Not because of secrecy but because it is simply not relevant
> to any other ASIC.

OK, makes sense.

Thanks for clarifying your use case in more detail.

Kevin

2022-09-23 16:41:38

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Mon, Aug 29, 2022 at 11:54 PM Kevin Hilman <[email protected]> wrote:
>
> Hi Oded (and sorry I misspelled your name last time),
>
> Oded Gabbay <[email protected]> writes:
>
> > On Tue, Aug 23, 2022 at 9:24 PM Kevin Hilman <[email protected]> wrote:
> >>
> >> Hi Obed,
> >>
> >> Oded Gabbay <[email protected]> writes:
> >>
> >> [...]
> >>
> >> > I want to update that I'm currently in discussions with Dave to figure
> >> > out what's the best way to move forward. We are writing it down to do
> >> > a proper comparison between the two paths (new accel subsystem or
> >> > using drm). I guess it will take a week or so.
> >>
> >> Any update on the discussions with Dave? and/or are there any plans to
> >> discuss this further at LPC/ksummit yet?
> > Hi Kevin.
> >
> > We are still discussing the details, as at least the habanalabs driver
> > is very complex and there are multiple parts that I need to see if and
> > how they can be mapped to drm.
> > Some of us will attend LPC so we will probably take advantage of that
> > to talk more about this.
>
> OK, looking forward to some more conversations at LPC.
>
> >>
> >> We (BayLibre) are upstreaming support for APUs on Mediatek SoCs, and are
> >> using the DRM-based approach. I'll also be at LPC and happy to discuss
> >> in person.
> >>
> >> For some context on my/our interest: back in Sept 2020 we initially
> >> submitted an rpmesg based driver for kernel communication[1]. After
> >> review comments, we rewrote that based on DRM[2] and are now using it
> >> for some MTK SoCs[3] and supporting our MTK customers with it.
> >>
> >> Hopefully we will get the kernel interfaces sorted out soon, but next,
> >> there's the userspace side of things. To that end, we're also working
> >> on libAPU, a common, open userspace stack. Alex Bailon recently
> >> presented a proposal earlier this year at Embedded Recipes in Paris
> >> (video[4], slides[5].)
> >>
> >> libAPU would include abstractions of the kernel interfaces for DRM
> >> (using libdrm), remoteproc/rpmsg, virtio etc. but also goes farther and
> >> proposes an open firmware for the accelerator side using
> >> libMetal/OpenAMP + rpmsg for communication with (most likely closed
> >> source) vendor firmware. Think of this like sound open firmware (SOF[6]),
> >> but for accelerators.
> >
> > I think your device and the habana device are very different in
> > nature, and it is part of what Dave and I discussed, whether these two
> > classes of devices can live together. I guess they can live together
> > in the kernel, but in the userspace, not so much imo.
>
> Yeah, for now I think focusing on how to handle both classes of devices
> in the kernel is the most important.
>
> > The first class is the edge inference devices (usually as part of some
> > SoC). I think your description of the APU on MTK SoC is a classic
> > example of such a device.
>
> Correct.
>
> > You usually have some firmware you load, you give it a graph and
> > pointers for input and output and then you just execute the graph
> > again and again to perform inference and just replace the inputs.
> >
> > The second class is the data-center, training accelerators, which
> > habana's gaudi device is classified as such. These devices usually
> > have a number of different compute engines, a fabric for scaling out,
> > on-device memory, internal MMUs and RAS monitoring requirements. Those
> > devices are usually operated via command queues, either through their
> > kernel driver or directly from user-space. They have multiple APIs for
> > memory management, RAS, scaling-out and command-submissions.
>
> OK, I see.
>
> >>
> >> We've been using this succesfully for Mediatek SoCs (which have a
> >> Cadence VP6 APU) and have submitted/published the code, including the
> >> OpenAMP[7] and libmetal[8] parts in addition to the kernel parts already
> >> mentioned.
> > What's the difference between libmetal and other open-source low-level
> > runtime drivers, such as oneAPI level-zero ?
>
> TBH, I'd never heard of oneAPI before, so I'm assuming it's mainly
> focused in the data center. libmetal/openAMP are widely used
> in the consumer, industrial embedded space, and heavily used by FPGAs in
> many market segments.
>
> > Currently we have our own runtime driver which is tightly coupled with
> > our h/w. For example, the method the userspace "talks" to the
> > data-plane firmware is very proprietary as it is hard-wired into the
> > architecture of the entire ASIC and how it performs deep-learning
> > training. Therefore, I don't see how this can be shared with other
> > vendors. Not because of secrecy but because it is simply not relevant
> > to any other ASIC.
>
> OK, makes sense.
>
> Thanks for clarifying your use case in more detail.
>
> Kevin

Hi all,
I wanted to update on this issue for those of you who weren't in LPC.
We had a BOF session about this topic with most, if not all, of the
relevant people - DRM maintainers, Greg and other subsystem and device
drivers maintainers.
Dave Airlie summarized the session in his blog -
https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html

TL;DR
Accelerators drivers will use the DRM subsystem code, but they will be
located in a different directory (drivers/accel) and will be exposed
to the userspace using a new major and a new device char name
(/dev/accelX).

I'm currently working on preparing some prerequisite patches for the
DRM subsystem to support the new subsystem (e.g. new major number).
After that will be merged, I plan to move the habanalabs driver to the
new location and convert it to use the modified DRM framework code.

Thanks,
Oded

2022-09-26 08:51:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

Btw, there is another interesting thing around on the block:

NVMe Computational Storage devices. Don't be fooled by the name, much
of it is not about neither computation not storage, but it allows to
use the existing NVMe queuing model model to allow access to arbitrary
accelerators, including a way to expose access to on-device memory.

The probably most current version is here:

https://www.snia.org/educational-library/nvme-computational-storage-update-standard-2022

The first version will be rather limited and miss some important
functionality like directly accessing host DRAM or CXL integration,
but much of that is planned. The initial version also probably won't
be able to be supported by Linux at all, but we need to think hard about
how to support it.

It woud also be really elpful to get more people with accelerator
experience into the NVMe working group.

2022-09-29 07:08:44

by Oded Gabbay

[permalink] [raw]
Subject: Re: New subsystem for acceleration devices

On Mon, Sep 26, 2022 at 11:16 AM Christoph Hellwig <[email protected]> wrote:
>
> Btw, there is another interesting thing around on the block:
>
> NVMe Computational Storage devices. Don't be fooled by the name, much
> of it is not about neither computation not storage, but it allows to
> use the existing NVMe queuing model model to allow access to arbitrary
> accelerators, including a way to expose access to on-device memory.
>
> The probably most current version is here:
>
> https://www.snia.org/educational-library/nvme-computational-storage-update-standard-2022
Thanks for the link. Indeed, there were some people in the BOF that
mentioned computational storage as something that is relevant.
I'll watch the presentation to understand the direction it is going
and how it maps to what we were planning to do.

>
> The first version will be rather limited and miss some important
> functionality like directly accessing host DRAM or CXL integration,
> but much of that is planned. The initial version also probably won't
> be able to be supported by Linux at all, but we need to think hard about
> how to support it.
>
> It woud also be really elpful to get more people with accelerator
> experience into the NVMe working group.
I will be happy to help and contribute.

Oded