2023-10-19 15:25:01

by Jakub Kicinski

[permalink] [raw]
Subject: Re: mlx5 ConnectX diagnostic misc driver

> The ConnectX HW family supported by the mlx5 drivers uses an architecture
> where a FW component executes "mailbox RPCs" issued by the driver to make
> changes to the device. This results in a complex debugging environment
> where the FW component has information and complex low level state that
> needs to be accessed to userspace for debugging purposes.

You're being very dishonest towards Greg by not telling him that this
is a networking device, and the networking maintainers explicitly nacked
this backdoor. Nacked it, because you can't answer basic questions like
"what are the use cases" with more than "custom config and debug".

Whether Greg wants to take this into the "misc" pile is entirely up
to him, but you gotta give him more context.


2023-10-19 15:36:36

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: mlx5 ConnectX diagnostic misc driver

On Thu, Oct 19, 2023 at 08:24:51AM -0700, Jakub Kicinski wrote:
> > The ConnectX HW family supported by the mlx5 drivers uses an architecture
> > where a FW component executes "mailbox RPCs" issued by the driver to make
> > changes to the device. This results in a complex debugging environment
> > where the FW component has information and complex low level state that
> > needs to be accessed to userspace for debugging purposes.
>
> You're being very dishonest towards Greg by not telling him that this
> is a networking device, and the networking maintainers explicitly nacked
> this backdoor. Nacked it, because you can't answer basic questions like
> "what are the use cases" with more than "custom config and debug".
>
> Whether Greg wants to take this into the "misc" pile is entirely up
> to him, but you gotta give him more context.

Well, in this case, no way in hell will I be taking this. If this is a
networking device, it needs to go through the normal networking driver
review process, thanks for the heads up.

greg k-h

2023-10-19 16:02:08

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: mlx5 ConnectX diagnostic misc driver

On Thu, Oct 19, 2023 at 05:36:04PM +0200, Greg Kroah-Hartman wrote:
> On Thu, Oct 19, 2023 at 08:24:51AM -0700, Jakub Kicinski wrote:
> > > The ConnectX HW family supported by the mlx5 drivers uses an architecture
> > > where a FW component executes "mailbox RPCs" issued by the driver to make
> > > changes to the device. This results in a complex debugging environment
> > > where the FW component has information and complex low level state that
> > > needs to be accessed to userspace for debugging purposes.
> >
> > You're being very dishonest towards Greg by not telling him that this
> > is a networking device, and the networking maintainers explicitly nacked
> > this backdoor.

Do you have a lore link?

> Well, in this case, no way in hell will I be taking this. If this is a
> networking device, it needs to go through the normal networking driver
> review process, thanks for the heads up.

It is not just a networking device. mlx5 is a giant and complex
multi-subsystem piece of hardware.

This is shared debugging and configuration interface to the device FW
that interacts across all of the different subsystems the driver
supports.

Looking at Saeed's tool capability on his github it is significantly,
but not exclusively supporting RDMA (ie drivers/infiniband), with some
features for the mlx5 VFIO drivers, mlx5 VDPA and a bunch of lowlevel
PCI stuff too.

Calling it a "networking device" in the sense of "it is owned only be
netdev" is not accurate.

We think misc is an appropriate place to put something like this,
there are many other misc drivers that are sort of similar APIs to
access embedded FW to manage and debug it.

Jason

2023-10-19 16:11:10

by Saeed Mahameed

[permalink] [raw]
Subject: Re: mlx5 ConnectX diagnostic misc driver

On 19 Oct 08:24, Jakub Kicinski wrote:
>> The ConnectX HW family supported by the mlx5 drivers uses an architecture
>> where a FW component executes "mailbox RPCs" issued by the driver to make
>> changes to the device. This results in a complex debugging environment
>> where the FW component has information and complex low level state that
>> needs to be accessed to userspace for debugging purposes.
>
>You're being very dishonest towards Greg by not telling him that this
>is a networking device, and the networking maintainers explicitly nacked

This is not a netwroking only device, as described in the tools
documentation quoting:

"ConnectX devices are complex and provide a vast set of features
and components (SmartNiCs SoCs, Multi-protocol Network Adapters with
Ethernent, Infiniband, Storage, DPU, and many acceleration and offload
capabilities).

This project will provide unified tool set to access, debug, diagnose and
monitor those devices using the ConnectX architecture and onboard
processors and firmware."

mlx5 has multiple drivers in multiple subsystems, it's not only networking or
netdev, a huge part of mlx5 is the mlx5 RDMA driver another is mlx5 vdpa
driver, and the list goes on, including virtio/vfio.
ConnectX supports multiple types of PF/VFs/SFs and at least a
dozen of ULPs and aux devices.

A unified diag driver that provideslow level understanding of the ConnectX
architecture is the best way to go here, so it can serve everyone using a
ConnectX device, regardless of the subsystem it is being used for..


>this backdoor. Nacked it, because you can't answer basic questions like
>"what are the use cases" with more than "custom config and debug".
>

I think the tools project provides a clear picture on what the uses cases
are, it is already part of this cover-letter:
Please see the documentation, and let me know if you have other questions:
https://github.com/saeedtx/mlx5ctl

>Whether Greg wants to take this into the "misc" pile is entirely up
>to him, but you gotta give him more context.

2023-10-19 16:24:03

by Jakub Kicinski

[permalink] [raw]
Subject: Re: mlx5 ConnectX diagnostic misc driver

On Thu, 19 Oct 2023 13:01:45 -0300 Jason Gunthorpe wrote:
> Do you have a lore link?

No, it was pitched at conferences:

Last year's LPC: https://www.youtube.com/watch?v=JGR9ZCeiW-E
This year's netconf / KR, but Saeed didn't have slides:
https://netdev.bots.linux.dev/netconf/2023/index.html

Really, you should be asking Saeed this, not me.

> Looking at Saeed's tool capability on his github it is significantly,
> but not exclusively supporting RDMA (ie drivers/infiniband), with some
> features for the mlx5 VFIO drivers, mlx5 VDPA and a bunch of lowlevel
> PCI stuff too.
>
> Calling it a "networking device" in the sense of "it is owned only be
> netdev" is not accurate.

Yes, let's now have a pointless augment about when a NIC is a NIC
and when it's no longer a NIC because it has offloads.

My point is Saeed pitched this to networking maintainers *twice*
and then purposefully left us out of the CC list.
That is absolutely unacceptable IMO, but I'd like to consult with
others to make sure it's not just me. Please allow me some time
to do that.

2023-10-19 16:51:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: mlx5 ConnectX diagnostic misc driver

On Thu, Oct 19, 2023 at 09:23:46AM -0700, Jakub Kicinski wrote:
> On Thu, 19 Oct 2023 13:01:45 -0300 Jason Gunthorpe wrote:
> > Do you have a lore link?
>
> No, it was pitched at conferences:
>
> Last year's LPC: https://www.youtube.com/watch?v=JGR9ZCeiW-E
> This year's netconf / KR, but Saeed didn't have slides:
> https://netdev.bots.linux.dev/netconf/2023/index.html
>
> Really, you should be asking Saeed this, not me.

You brought it up. I was at that LPC talk, and you made it clear not
to even propose using devlink. Seems like that is what Saeed did to
me.

> > Looking at Saeed's tool capability on his github it is significantly,
> > but not exclusively supporting RDMA (ie drivers/infiniband), with some
> > features for the mlx5 VFIO drivers, mlx5 VDPA and a bunch of lowlevel
> > PCI stuff too.
> >
> > Calling it a "networking device" in the sense of "it is owned only be
> > netdev" is not accurate.
>
> Yes, let's now have a pointless augment about when a NIC is a NIC
> and when it's no longer a NIC because it has offloads.

This is not a pointless argument. How mlx5 hardware works in the many
subsystems it works with is NOT your exclusive decision to dictate
just because it has a shared networking port on the back of the card.

Lets focus on a community conensus approach between all the subsystems
please. If you have a better suggestion how to address this
cross-subsystem need then please share it.

> My point is Saeed pitched this to networking maintainers *twice*

I read that as for a devlink version, this is clearly not devlink.

> My point is Saeed pitched this to networking maintainers *twice*
> and then purposefully left us out of the CC list.
> That is absolutely unacceptable IMO, but I'd like to consult with
> others to make sure it's not just me. Please allow me some time
> to do that.

It is standard practice not to CC'd netdev for mlx5 related work
outside the scope of the netdev driver.

Jason

2023-10-19 17:12:31

by Leon Romanovsky

[permalink] [raw]
Subject: Re: mlx5 ConnectX diagnostic misc driver

On Thu, Oct 19, 2023 at 05:36:04PM +0200, Greg Kroah-Hartman wrote:
> On Thu, Oct 19, 2023 at 08:24:51AM -0700, Jakub Kicinski wrote:
> > > The ConnectX HW family supported by the mlx5 drivers uses an architecture
> > > where a FW component executes "mailbox RPCs" issued by the driver to make
> > > changes to the device. This results in a complex debugging environment
> > > where the FW component has information and complex low level state that
> > > needs to be accessed to userspace for debugging purposes.
> >
> > You're being very dishonest towards Greg by not telling him that this
> > is a networking device, and the networking maintainers explicitly nacked
> > this backdoor. Nacked it, because you can't answer basic questions like
> > "what are the use cases" with more than "custom config and debug".
> >
> > Whether Greg wants to take this into the "misc" pile is entirely up
> > to him, but you gotta give him more context.
>
> Well, in this case, no way in hell will I be taking this. If this is a
> networking device,

It is not networking device. Only by coincidence, the PCI core of mlx5
driver is located in netdev, and it serves other subsystems as well
(VFIO, VDPA, RDMA, e.t.c.).

Just as a reminder, we added auxiliary bus to make sure that this device
is properly decoupled across various subsystems.

> it needs to go through the normal networking driver review process,
> thanks for the heads up.

That process is not relevant here, not everything in the world is netdev.

Thanks

>
> greg k-h

2023-10-19 17:59:37

by Saeed Mahameed

[permalink] [raw]
Subject: Re: mlx5 ConnectX diagnostic misc driver

On 19 Oct 09:23, Jakub Kicinski wrote:
>On Thu, 19 Oct 2023 13:01:45 -0300 Jason Gunthorpe wrote:
>> Do you have a lore link?
>
>No, it was pitched at conferences:
>
>Last year's LPC: https://www.youtube.com/watch?v=JGR9ZCeiW-E

Yes this was one of the problems I discussed at LPC, please keep in mind
that I raised multiple other unrelated problems too.

In this project we discuss the debug-ability features only,
The orchestration features are a separate project that is still
on-going work to be under devlink.

We all agreed that devlink isn't the place for such vast device debug
options, hence this is our proposal, a separate aux device that works in
unison with other aux devices such as virtio, vpda,rdma,netdev, etc ..

>This year's netconf / KR, but Saeed didn't have slides:
>https://netdev.bots.linux.dev/netconf/2023/index.html
>

At netconf/KR I only discussed the orchestration project we are working
on to improve an going work we are doing, multi netdev interface, such
as the socket direct issue we discussed, this has nothing to do with
this project proposal which cover RDMA,VDPA,netdev,virtio,Blue-Field,
etc .. ConnectX debug-ability and diagnosis problem.

>Really, you should be asking Saeed this, not me.
>