2023-11-21 07:08:29

by Saeed Mahameed

[permalink] [raw]
Subject: [PATCH V3 0/5] mlx5 ConnectX control misc driver

From: Saeed Mahameed <[email protected]>

V1: https://lore.kernel.org/all/[email protected]/#r
V2: https://lore.kernel.org/all/[email protected]/#r

V2->V3:
- Fix bad Sign-off line
- Fix kernel robot warnings, define a user ptr arg for umem_unreg ioctl
instead of plain integer to simplify compat_ioctl usage added in V2

V1->V2:
- Provide legal statement and sign-off for dual license use
- Fix License clause to use: BSD-3-Clause OR GPL-2.0
- Fix kernel robot warnings
- Use dev_dbg directly instead of umem_dbg() local wrapper
- Implement .compat_ioctl for 32bit compatibility
- Fix mlx5ctl_info ABI structure size and alignment
- Local pointer to correct type instead of in-place cast
- Check unused fields and flags are 0 on ioctl path
- Use correct macro to declare scalar arg ioctl command
#define MLX5CTL_IOCTL_UMEM_UNREG \
_IO(MLX5CTL_IOCTL_MAGIC, 0x3)

mlx5 ConnectX control misc driver
=================================

The ConnectX HW family supported by the mlx5 drivers uses an architecture
where a FW component executes "mailbox RPCs" issued by the driver to make
changes to the device. This results in a complex debugging environment
where the FW component has information and low level configuration that
needs to be accessed to userspace for debugging purposes.

Historically a userspace program was used that accessed the PCI register
and config space directly through /sys/bus/pci/.../XXX and could operate
these debugging interfaces in parallel with the running driver.
This approach is incompatible with secure boot and kernel lockdown so this
driver provides a secure and restricted interface to that.

1) The first patch in the series introduces the main driver file with the
implementation of a new mlx5 auxiliary device driver to run on top
mlx5_core device instances, on probe it creates a new misc device and in
this patch we implement the open and release fops, On open the driver
would allocate a special FW UID (user context ID) restricted to debug
RPCs only, where all user debug rpcs will be executed under this UID,
and on release the UID will be freed.

2) The second patch adds an info ioctl that will show the allocated UID
and the available capability masks of the device and the current UID, and
some other useful device information such as the underlying ConnectX

Example:
$ sudo ./mlx5ctlu mlx5_core.ctl.0
mlx5dev: 0000:00:04.0
UCTX UID: 1
UCTX CAP: 0x3
DEV UCTX CAP: 0x3
USER CAP: 0x1d

3) Third patch will add the capability to execute debug RPCs under the
special UID.

In the mlx5 architecture the FW RPC commands are of the format of
inbox and outbox buffers. The inbox buffer contains the command
rpc layout as described in the ConnectX Programmers Reference Manual
(PRM) document and as defined in linux/include/mlx5/mlx5_ifc.h.

On success the user outbox buffer will be filled with the device's rpc
response.

For example to query device capabilities:
a user fills out an inbox buffer with the inbox layout:
struct mlx5_ifc_query_hca_cap_in_bits
and expects an outbox buffer with the layout:
struct mlx5_ifc_cmd_hca_cap_bits

4) The fourth patch adds the ability to register user memory into the
ConntectX device and create a umem object that points to that memory.

Command rpc outbox buffer is limited in size, which can be very
annoying when trying to pull large traces out of the device.
Many rpcs offer the ability to scatter output traces, contexts
and logs directly into user space buffers in a single shot.

The registered memory will be described by a device UMEM object which
has a unique umem_id, this umem_id can be later used in the rpc inbox
to tell the device where to populate the response output,
e.g HW traces and other debug object queries.

Example usecase, a ConnectX device coredump can be as large as 2MB.
Using inline rpcs will take thousands of rpcs to get the full
coredump which can consume multiple seconds.

With UMEM, it can be done in a single rpc, using 2MB of umem user buffer.

Other usecases with umem:
- dynamic HW and FW trace monitoring
- high frequency diagnostic counters sampling
- batched objects and resource dumps

See links below for information about user space tools that use this
interface:

[1] https://github.com/saeedtx/mlx5ctl

[2] https://github.com/Mellanox/mstflint
see:

d) mstregdump utility
This utility dumps hardware registers from Mellanox hardware
for later analysis by Mellanox.

g) mstconfig
This tool sets or queries non-volatile configurable options
for Mellanox HCAs.

h) mstfwmanager
Mellanox firmware update and query utility which scans the system
for available Mellanox devices (only mst PCI devices) and performs
the necessary firmware updates.

i) mstreg
The mlxreg utility allows users to obtain information regarding
supported access registers, such as their fields

License: BSD-3-Clause OR GPL-2.0
================================
After a review of this thread [3], and a conversation with the LF,
Mellanox and NVIDIA legal continue to approve the use of a Dual GPL &
Permissive License for mlx5 related driver contributions. This makes it
clear to future contributors that this file may be adapted and reused
under BSD-3-Clause terms on other operating systems. Contributions will
be handled in the normal way and the dual license will apply
automatically. If people wish to contribute significantly and opt out of
a dual license they may separate their GPL only contributions in dedicated
files.

Jason has a signing authority for NVIDIA and has gone through our internal
process to get approval.

Signed-off-by: Jason Gunthorpe <[email protected]> # for legal

[3] https://lore.kernel.org/all/[email protected]/#r

================================

Signed-off-by: Saeed Mahameed <[email protected]>

Saeed Mahameed (5):
mlx5: Add aux dev for ctl interface
misc: mlx5ctl: Add mlx5ctl misc driver
misc: mlx5ctl: Add info ioctl
misc: mlx5ctl: Add command rpc ioctl
misc: mlx5ctl: Add umem reg/unreg ioctl

.../userspace-api/ioctl/ioctl-number.rst | 1 +
MAINTAINERS | 8 +
drivers/misc/Kconfig | 1 +
drivers/misc/Makefile | 1 +
drivers/misc/mlx5ctl/Kconfig | 14 +
drivers/misc/mlx5ctl/Makefile | 5 +
drivers/misc/mlx5ctl/main.c | 579 ++++++++++++++++++
drivers/misc/mlx5ctl/umem.c | 322 ++++++++++
drivers/misc/mlx5ctl/umem.h | 17 +
drivers/net/ethernet/mellanox/mlx5/core/dev.c | 8 +
include/uapi/misc/mlx5ctl.h | 59 ++
11 files changed, 1015 insertions(+)
create mode 100644 drivers/misc/mlx5ctl/Kconfig
create mode 100644 drivers/misc/mlx5ctl/Makefile
create mode 100644 drivers/misc/mlx5ctl/main.c
create mode 100644 drivers/misc/mlx5ctl/umem.c
create mode 100644 drivers/misc/mlx5ctl/umem.h
create mode 100644 include/uapi/misc/mlx5ctl.h

--
2.42.0


2023-11-21 07:08:36

by Saeed Mahameed

[permalink] [raw]
Subject: [PATCH V3 1/5] mlx5: Add aux dev for ctl interface

From: Saeed Mahameed <[email protected]>

Allow ctl protocol interface auxiliary driver in mlx5.

Signed-off-by: Saeed Mahameed <[email protected]>
---
drivers/net/ethernet/mellanox/mlx5/core/dev.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/dev.c b/drivers/net/ethernet/mellanox/mlx5/core/dev.c
index cf0477f53dc4..753c11f7050d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/dev.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/dev.c
@@ -228,8 +228,14 @@ enum {
MLX5_INTERFACE_PROTOCOL_VNET,

MLX5_INTERFACE_PROTOCOL_DPLL,
+ MLX5_INTERFACE_PROTOCOL_CTL,
};

+static bool is_ctl_supported(struct mlx5_core_dev *dev)
+{
+ return MLX5_CAP_GEN(dev, uctx_cap);
+}
+
static const struct mlx5_adev_device {
const char *suffix;
bool (*is_supported)(struct mlx5_core_dev *dev);
@@ -252,6 +258,8 @@ static const struct mlx5_adev_device {
.is_supported = &is_mp_supported },
[MLX5_INTERFACE_PROTOCOL_DPLL] = { .suffix = "dpll",
.is_supported = &is_dpll_supported },
+ [MLX5_INTERFACE_PROTOCOL_CTL] = { .suffix = "ctl",
+ .is_supported = &is_ctl_supported },
};

int mlx5_adev_idx_alloc(void)
--
2.42.0

2023-11-21 07:08:39

by Saeed Mahameed

[permalink] [raw]
Subject: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

From: Saeed Mahameed <[email protected]>

The ConnectX HW family supported by the mlx5 drivers uses an architecture
where a FW component executes "mailbox RPCs" issued by the driver to make
changes to the device. This results in a complex debugging environment
where the FW component has information and low level configuration that
needs to be accessed to userspace for debugging purposes.

Historically a userspace program was used that accessed the PCI register
and config space directly through /sys/bus/pci/.../XXX and could operate
these debugging interfaces in parallel with the running driver.
This approach is incompatible with secure boot and kernel lockdown so this
driver provides a secure and restricted interface to that same data.

On open the driver would allocate a special FW UID (user context ID)
restrected to debug RPCs only, later in this series all user RPCs will
be stamped with this UID.

Reviewed-by: Jiri Pirko <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
---
MAINTAINERS | 8 +
drivers/misc/Kconfig | 1 +
drivers/misc/Makefile | 1 +
drivers/misc/mlx5ctl/Kconfig | 14 ++
drivers/misc/mlx5ctl/Makefile | 4 +
drivers/misc/mlx5ctl/main.c | 314 ++++++++++++++++++++++++++++++++++
6 files changed, 342 insertions(+)
create mode 100644 drivers/misc/mlx5ctl/Kconfig
create mode 100644 drivers/misc/mlx5ctl/Makefile
create mode 100644 drivers/misc/mlx5ctl/main.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 97f51d5ec1cf..4b532df16b19 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13835,6 +13835,14 @@ L: [email protected]
S: Supported
F: drivers/vdpa/mlx5/

+MELLANOX MLX5 ConnectX Diag DRIVER
+M: Saeed Mahameed <[email protected]>
+R: Itay Avraham <[email protected]>
+L: [email protected]
+S: Supported
+F: drivers/misc/mlx5ctl/
+F: include/uapi/misc/mlx5ctl.h
+
MELLANOX MLXCPLD I2C AND MUX DRIVER
M: Vadim Pasternak <[email protected]>
M: Michael Shych <[email protected]>
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index f37c4b8380ae..c0e4823648ed 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -579,4 +579,5 @@ source "drivers/misc/cardreader/Kconfig"
source "drivers/misc/uacce/Kconfig"
source "drivers/misc/pvpanic/Kconfig"
source "drivers/misc/mchp_pci1xxxx/Kconfig"
+source "drivers/misc/mlx5ctl/Kconfig"
endmenu
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index f2a4d1ff65d4..49bc4697f498 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -67,3 +67,4 @@ obj-$(CONFIG_TMR_MANAGER) += xilinx_tmr_manager.o
obj-$(CONFIG_TMR_INJECT) += xilinx_tmr_inject.o
obj-$(CONFIG_TPS6594_ESM) += tps6594-esm.o
obj-$(CONFIG_TPS6594_PFSM) += tps6594-pfsm.o
+obj-$(CONFIG_MLX5CTL) += mlx5ctl/
diff --git a/drivers/misc/mlx5ctl/Kconfig b/drivers/misc/mlx5ctl/Kconfig
new file mode 100644
index 000000000000..faaa1dba2cc2
--- /dev/null
+++ b/drivers/misc/mlx5ctl/Kconfig
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+
+config MLX5CTL
+ tristate "mlx5 ConnectX control misc driver"
+ depends on MLX5_CORE
+ help
+ MLX5CTL provides interface for the user process to access the debug and
+ configuration registers of the ConnectX hardware family
+ (NICs, PCI switches and SmartNIC SoCs).
+ This will allow configuration and debug tools to work out of the box on
+ mainstream kernel.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/misc/mlx5ctl/Makefile b/drivers/misc/mlx5ctl/Makefile
new file mode 100644
index 000000000000..b5c7f99e0ab6
--- /dev/null
+++ b/drivers/misc/mlx5ctl/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_MLX5CTL) += mlx5ctl.o
+mlx5ctl-y := main.o
diff --git a/drivers/misc/mlx5ctl/main.c b/drivers/misc/mlx5ctl/main.c
new file mode 100644
index 000000000000..8eb150461b80
--- /dev/null
+++ b/drivers/misc/mlx5ctl/main.c
@@ -0,0 +1,314 @@
+// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0
+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#include <linux/miscdevice.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/auxiliary_bus.h>
+#include <linux/mlx5/device.h>
+#include <linux/mlx5/driver.h>
+#include <linux/atomic.h>
+#include <linux/refcount.h>
+
+MODULE_DESCRIPTION("mlx5 ConnectX control misc driver");
+MODULE_AUTHOR("Saeed Mahameed <[email protected]>");
+MODULE_LICENSE("Dual BSD/GPL");
+
+struct mlx5ctl_dev {
+ struct mlx5_core_dev *mdev;
+ struct miscdevice miscdev;
+ struct auxiliary_device *adev;
+ struct list_head fd_list;
+ spinlock_t fd_list_lock; /* protect list add/del */
+ struct rw_semaphore rw_lock;
+ struct kref refcount;
+};
+
+struct mlx5ctl_fd {
+ u16 uctx_uid;
+ u32 uctx_cap;
+ u32 ucap; /* user cap */
+ struct mlx5ctl_dev *mcdev;
+ struct list_head list;
+};
+
+#define mlx5ctl_err(mcdev, format, ...) \
+ dev_err(mcdev->miscdev.parent, format, ##__VA_ARGS__)
+
+#define mlx5ctl_dbg(mcdev, format, ...) \
+ dev_dbg(mcdev->miscdev.parent, "PID %d: " format, \
+ current->pid, ##__VA_ARGS__)
+
+enum {
+ MLX5_UCTX_OBJECT_CAP_RAW_TX = 0x1,
+ MLX5_UCTX_OBJECT_CAP_INTERNAL_DEVICE_RESOURCES = 0x2,
+ MLX5_UCTX_OBJECT_CAP_TOOLS_RESOURCES = 0x4,
+};
+
+static int mlx5ctl_alloc_uid(struct mlx5ctl_dev *mcdev, u32 cap)
+{
+ u32 out[MLX5_ST_SZ_DW(create_uctx_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(create_uctx_in)] = {};
+ void *uctx;
+ int err;
+ u16 uid;
+
+ uctx = MLX5_ADDR_OF(create_uctx_in, in, uctx);
+
+ mlx5ctl_dbg(mcdev, "MLX5_CMD_OP_CREATE_UCTX: caps 0x%x\n", cap);
+ MLX5_SET(create_uctx_in, in, opcode, MLX5_CMD_OP_CREATE_UCTX);
+ MLX5_SET(uctx, uctx, cap, cap);
+
+ err = mlx5_cmd_exec(mcdev->mdev, in, sizeof(in), out, sizeof(out));
+ if (err)
+ return err;
+
+ uid = MLX5_GET(create_uctx_out, out, uid);
+ mlx5ctl_dbg(mcdev, "allocated uid %d with caps 0x%x\n", uid, cap);
+ return uid;
+}
+
+static void mlx5ctl_release_uid(struct mlx5ctl_dev *mcdev, u16 uid)
+{
+ u32 in[MLX5_ST_SZ_DW(destroy_uctx_in)] = {};
+ struct mlx5_core_dev *mdev = mcdev->mdev;
+ int err;
+
+ MLX5_SET(destroy_uctx_in, in, opcode, MLX5_CMD_OP_DESTROY_UCTX);
+ MLX5_SET(destroy_uctx_in, in, uid, uid);
+
+ err = mlx5_cmd_exec_in(mdev, destroy_uctx, in);
+ mlx5ctl_dbg(mcdev, "released uid %d err(%d)\n", uid, err);
+}
+
+static void mcdev_get(struct mlx5ctl_dev *mcdev);
+static void mcdev_put(struct mlx5ctl_dev *mcdev);
+
+static int mlx5ctl_open_mfd(struct mlx5ctl_fd *mfd)
+{
+ struct mlx5_core_dev *mdev = mfd->mcdev->mdev;
+ struct mlx5ctl_dev *mcdev = mfd->mcdev;
+ u32 ucap = 0, cap = 0;
+ int uid;
+
+#define MLX5_UCTX_CAP(mdev, cap) \
+ (MLX5_CAP_GEN(mdev, uctx_cap) & MLX5_UCTX_OBJECT_CAP_##cap)
+
+ if (capable(CAP_NET_RAW) && MLX5_UCTX_CAP(mdev, RAW_TX)) {
+ ucap |= CAP_NET_RAW;
+ cap |= MLX5_UCTX_OBJECT_CAP_RAW_TX;
+ }
+
+ if (capable(CAP_SYS_RAWIO) && MLX5_UCTX_CAP(mdev, INTERNAL_DEVICE_RESOURCES)) {
+ ucap |= CAP_SYS_RAWIO;
+ cap |= MLX5_UCTX_OBJECT_CAP_INTERNAL_DEVICE_RESOURCES;
+ }
+
+ if (capable(CAP_SYS_ADMIN) && MLX5_UCTX_CAP(mdev, TOOLS_RESOURCES)) {
+ ucap |= CAP_SYS_ADMIN;
+ cap |= MLX5_UCTX_OBJECT_CAP_TOOLS_RESOURCES;
+ }
+
+ uid = mlx5ctl_alloc_uid(mcdev, cap);
+ if (uid < 0)
+ return uid;
+
+ mfd->uctx_uid = uid;
+ mfd->uctx_cap = cap;
+ mfd->ucap = ucap;
+ mfd->mcdev = mcdev;
+
+ mlx5ctl_dbg(mcdev, "allocated uid %d with uctx caps 0x%x, user cap 0x%x\n",
+ uid, cap, ucap);
+ return 0;
+}
+
+static void mlx5ctl_release_mfd(struct mlx5ctl_fd *mfd)
+{
+ struct mlx5ctl_dev *mcdev = mfd->mcdev;
+
+ mlx5ctl_release_uid(mcdev, mfd->uctx_uid);
+}
+
+static int mlx5ctl_open(struct inode *inode, struct file *file)
+{
+ struct mlx5_core_dev *mdev;
+ struct mlx5ctl_dev *mcdev;
+ struct mlx5ctl_fd *mfd;
+ int err = 0;
+
+ mcdev = container_of(file->private_data, struct mlx5ctl_dev, miscdev);
+ mcdev_get(mcdev);
+ down_read(&mcdev->rw_lock);
+ mdev = mcdev->mdev;
+ if (!mdev) {
+ err = -ENODEV;
+ goto unlock;
+ }
+
+ mfd = kzalloc(sizeof(*mfd), GFP_KERNEL_ACCOUNT);
+ if (!mfd)
+ return -ENOMEM;
+
+ mfd->mcdev = mcdev;
+ err = mlx5ctl_open_mfd(mfd);
+ if (err)
+ goto unlock;
+
+ spin_lock(&mcdev->fd_list_lock);
+ list_add_tail(&mfd->list, &mcdev->fd_list);
+ spin_unlock(&mcdev->fd_list_lock);
+
+ file->private_data = mfd;
+
+unlock:
+ up_read(&mcdev->rw_lock);
+ if (err) {
+ mcdev_put(mcdev);
+ kfree(mfd);
+ }
+ return err;
+}
+
+static int mlx5ctl_release(struct inode *inode, struct file *file)
+{
+ struct mlx5ctl_fd *mfd = file->private_data;
+ struct mlx5ctl_dev *mcdev = mfd->mcdev;
+
+ down_read(&mcdev->rw_lock);
+ if (!mcdev->mdev) {
+ pr_debug("[%d] UID %d mlx5ctl: mdev is already released\n",
+ current->pid, mfd->uctx_uid);
+ /* All mfds are already released, skip ... */
+ goto unlock;
+ }
+
+ spin_lock(&mcdev->fd_list_lock);
+ list_del(&mfd->list);
+ spin_unlock(&mcdev->fd_list_lock);
+
+ mlx5ctl_release_mfd(mfd);
+
+unlock:
+ kfree(mfd);
+ up_read(&mcdev->rw_lock);
+ mcdev_put(mcdev);
+ file->private_data = NULL;
+ return 0;
+}
+
+static const struct file_operations mlx5ctl_fops = {
+ .owner = THIS_MODULE,
+ .open = mlx5ctl_open,
+ .release = mlx5ctl_release,
+};
+
+static int mlx5ctl_probe(struct auxiliary_device *adev,
+ const struct auxiliary_device_id *id)
+
+{
+ struct mlx5_adev *madev = container_of(adev, struct mlx5_adev, adev);
+ struct mlx5_core_dev *mdev = madev->mdev;
+ struct mlx5ctl_dev *mcdev;
+ char *devname = NULL;
+ int err;
+
+ mcdev = kzalloc(sizeof(*mcdev), GFP_KERNEL_ACCOUNT);
+ if (!mcdev)
+ return -ENOMEM;
+
+ kref_init(&mcdev->refcount);
+ INIT_LIST_HEAD(&mcdev->fd_list);
+ spin_lock_init(&mcdev->fd_list_lock);
+ init_rwsem(&mcdev->rw_lock);
+ mcdev->mdev = mdev;
+ mcdev->adev = adev;
+ devname = kasprintf(GFP_KERNEL_ACCOUNT, "mlx5ctl-%s",
+ dev_name(&adev->dev));
+ if (!devname) {
+ err = -ENOMEM;
+ goto abort;
+ }
+
+ mcdev->miscdev = (struct miscdevice) {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = devname,
+ .fops = &mlx5ctl_fops,
+ .parent = &adev->dev,
+ };
+
+ err = misc_register(&mcdev->miscdev);
+ if (err) {
+ mlx5ctl_err(mcdev, "mlx5ctl: failed to register misc device err %d\n", err);
+ goto abort;
+ }
+
+ mlx5ctl_dbg(mcdev, "probe mdev@%s %s\n", dev_driver_string(mdev->device), dev_name(mdev->device));
+
+ auxiliary_set_drvdata(adev, mcdev);
+
+ return 0;
+
+abort:
+ kfree(devname);
+ kfree(mcdev);
+ return err;
+}
+
+static void mlx5ctl_remove(struct auxiliary_device *adev)
+{
+ struct mlx5ctl_dev *mcdev = auxiliary_get_drvdata(adev);
+ struct mlx5_core_dev *mdev = mcdev->mdev;
+ struct mlx5ctl_fd *mfd, *n;
+
+ misc_deregister(&mcdev->miscdev);
+ down_write(&mcdev->rw_lock);
+
+ list_for_each_entry_safe(mfd, n, &mcdev->fd_list, list) {
+ mlx5ctl_dbg(mcdev, "UID %d still has open FDs\n", mfd->uctx_uid);
+ list_del(&mfd->list);
+ mlx5ctl_release_mfd(mfd);
+ }
+
+ mlx5ctl_dbg(mcdev, "removed mdev %s %s\n",
+ dev_driver_string(mdev->device), dev_name(mdev->device));
+
+ mcdev->mdev = NULL; /* prevent already open fds from accessing the device */
+ up_write(&mcdev->rw_lock);
+ mcdev_put(mcdev);
+}
+
+static void mcdev_free(struct kref *ref)
+{
+ struct mlx5ctl_dev *mcdev = container_of(ref, struct mlx5ctl_dev, refcount);
+
+ kfree(mcdev->miscdev.name);
+ kfree(mcdev);
+}
+
+static void mcdev_get(struct mlx5ctl_dev *mcdev)
+{
+ kref_get(&mcdev->refcount);
+}
+
+static void mcdev_put(struct mlx5ctl_dev *mcdev)
+{
+ kref_put(&mcdev->refcount, mcdev_free);
+}
+
+static const struct auxiliary_device_id mlx5ctl_id_table[] = {
+ { .name = MLX5_ADEV_NAME ".ctl", },
+ {},
+};
+
+MODULE_DEVICE_TABLE(auxiliary, mlx5ctl_id_table);
+
+static struct auxiliary_driver mlx5ctl_driver = {
+ .name = "ctl",
+ .probe = mlx5ctl_probe,
+ .remove = mlx5ctl_remove,
+ .id_table = mlx5ctl_id_table,
+};
+
+module_auxiliary_driver(mlx5ctl_driver);
--
2.42.0

2023-11-21 07:09:04

by Saeed Mahameed

[permalink] [raw]
Subject: [PATCH V3 4/5] misc: mlx5ctl: Add command rpc ioctl

From: Saeed Mahameed <[email protected]>

Add new IOCTL to allow user space to send device debug rpcs and
attach the user's uctx UID to each rpc.

In the mlx5 architecture the FW RPC commands are of the format of
inbox and outbox buffers. The inbox buffer contains the command
rpc layout as described in the ConnectX Programmers Reference Manual
(PRM) document and as defined in include/linux/mlx5/mlx5_ifc.h.

On success the user outbox buffer will be filled with the device's rpc
response.

For example to query device capabilities:
a user fills out an inbox buffer with the inbox layout:
struct mlx5_ifc_query_hca_cap_in_bits
and expects an outbox buffer with the layout:
struct mlx5_ifc_cmd_hca_cap_bits

Reviewed-by: Jiri Pirko <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
---
drivers/misc/mlx5ctl/main.c | 95 +++++++++++++++++++++++++++++++++++++
include/uapi/misc/mlx5ctl.h | 13 +++++
2 files changed, 108 insertions(+)

diff --git a/drivers/misc/mlx5ctl/main.c b/drivers/misc/mlx5ctl/main.c
index 6a98b40e4300..e7776ea4bfca 100644
--- a/drivers/misc/mlx5ctl/main.c
+++ b/drivers/misc/mlx5ctl/main.c
@@ -232,6 +232,97 @@ static int mlx5ctl_info_ioctl(struct file *file,
return err;
}

+struct mlx5_ifc_mbox_in_hdr_bits {
+ u8 opcode[0x10];
+ u8 uid[0x10];
+
+ u8 reserved_at_20[0x10];
+ u8 op_mod[0x10];
+
+ u8 reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_mbox_out_hdr_bits {
+ u8 status[0x8];
+ u8 reserved_at_8[0x18];
+
+ u8 syndrome[0x20];
+
+ u8 reserved_at_40[0x40];
+};
+
+static int mlx5ctl_cmdrpc_ioctl(struct file *file,
+ struct mlx5ctl_cmdrpc __user *arg,
+ size_t usize)
+{
+ struct mlx5ctl_fd *mfd = file->private_data;
+ struct mlx5ctl_dev *mcdev = mfd->mcdev;
+ struct mlx5ctl_cmdrpc *rpc = NULL;
+ void *in = NULL, *out = NULL;
+ size_t ksize = 0;
+ int err;
+
+ ksize = max(sizeof(struct mlx5ctl_cmdrpc), usize);
+ rpc = kzalloc(ksize, GFP_KERNEL_ACCOUNT);
+ if (!rpc)
+ return -ENOMEM;
+
+ err = copy_from_user(rpc, arg, usize);
+ if (err)
+ goto out;
+
+ mlx5ctl_dbg(mcdev, "[UID %d] cmdrpc: rpc->inlen %d rpc->outlen %d\n",
+ mfd->uctx_uid, rpc->inlen, rpc->outlen);
+
+ if (rpc->inlen < MLX5_ST_SZ_BYTES(mbox_in_hdr) ||
+ rpc->outlen < MLX5_ST_SZ_BYTES(mbox_out_hdr) ||
+ rpc->inlen > MLX5CTL_MAX_RPC_SIZE ||
+ rpc->outlen > MLX5CTL_MAX_RPC_SIZE) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (rpc->flags) {
+ err = -EOPNOTSUPP;
+ goto out;
+ }
+
+ in = memdup_user(u64_to_user_ptr(rpc->in), rpc->inlen);
+ if (IS_ERR(in)) {
+ err = PTR_ERR(in);
+ goto out;
+ }
+
+ out = kvzalloc(rpc->outlen, GFP_KERNEL_ACCOUNT);
+ if (!out) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ mlx5ctl_dbg(mcdev, "[UID %d] cmdif: opcode 0x%x inlen %d outlen %d\n",
+ mfd->uctx_uid,
+ MLX5_GET(mbox_in_hdr, in, opcode), rpc->inlen, rpc->outlen);
+
+ MLX5_SET(mbox_in_hdr, in, uid, mfd->uctx_uid);
+ err = mlx5_cmd_do(mcdev->mdev, in, rpc->inlen, out, rpc->outlen);
+ mlx5ctl_dbg(mcdev, "[UID %d] cmdif: opcode 0x%x retval %d\n",
+ mfd->uctx_uid,
+ MLX5_GET(mbox_in_hdr, in, opcode), err);
+
+ /* -EREMOTEIO means outbox is valid, but out.status is not */
+ if (!err || err == -EREMOTEIO) {
+ err = 0;
+ if (copy_to_user(u64_to_user_ptr(rpc->out), out, rpc->outlen))
+ err = -EFAULT;
+ }
+
+out:
+ kvfree(out);
+ kfree(in);
+ kfree(rpc);
+ return err;
+}
+
static long mlx5ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
struct mlx5ctl_fd *mfd = file->private_data;
@@ -257,6 +348,10 @@ static long mlx5ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg
err = mlx5ctl_info_ioctl(file, argp, size);
break;

+ case MLX5CTL_IOCTL_CMDRPC:
+ err = mlx5ctl_cmdrpc_ioctl(file, argp, size);
+ break;
+
default:
mlx5ctl_dbg(mcdev, "Unknown ioctl %x\n", cmd);
err = -ENOIOCTLCMD;
diff --git a/include/uapi/misc/mlx5ctl.h b/include/uapi/misc/mlx5ctl.h
index 37153cc0fc6e..3277eaf78a37 100644
--- a/include/uapi/misc/mlx5ctl.h
+++ b/include/uapi/misc/mlx5ctl.h
@@ -16,9 +16,22 @@ struct mlx5ctl_info {
__u32 reserved2;
};

+struct mlx5ctl_cmdrpc {
+ __aligned_u64 in; /* RPC inbox buffer user address */
+ __aligned_u64 out; /* RPC outbox buffer user address */
+ __u32 inlen; /* inbox buffer length */
+ __u32 outlen; /* outbox buffer length */
+ __aligned_u64 flags;
+};
+
+#define MLX5CTL_MAX_RPC_SIZE 8192
+
#define MLX5CTL_IOCTL_MAGIC 0x5c

#define MLX5CTL_IOCTL_INFO \
_IOR(MLX5CTL_IOCTL_MAGIC, 0x0, struct mlx5ctl_info)

+#define MLX5CTL_IOCTL_CMDRPC \
+ _IOWR(MLX5CTL_IOCTL_MAGIC, 0x1, struct mlx5ctl_cmdrpc)
+
#endif /* __MLX5CTL_IOCTL_H__ */
--
2.42.0

2023-11-21 07:09:06

by Saeed Mahameed

[permalink] [raw]
Subject: [PATCH V3 5/5] misc: mlx5ctl: Add umem reg/unreg ioctl

From: Saeed Mahameed <[email protected]>

Command rpc outbox buffer is limited in size, which can be very
annoying when trying to pull large traces out of the device.
Many device rpcs offer the ability to scatter output traces, contexts
and logs directly into user space buffers in a single shot.

Allow user to register user memory space, so the device may dump
information directly into user memory space.

The registered memory will be described by a device UMEM object which
has a unique umem_id, this umem_id can be later used in the rpc inbox
to tell the device where to populate the response output,
e.g HW traces and other debug object queries.

To do so this patch introduces two ioctls:

MLX5CTL_IOCTL_UMEM_REG(va_address, size):
- calculate page fragments from the user provided virtual address
- pin the pages, and allocate a sg list
- dma map the sg list
- create a UMEM device object that points to the dma addresses
- add a driver umem object to an xarray data base for bookkeeping
- return UMEM ID to user so it can be used in subsequent rpcs

MLX5CTL_IOCTL_UMEM_UNREG(umem_id):
- user provides a pre allocated umem ID
- unwinds the above

Example usecase, ConnectX device coredump can be as large as 2MB.
Using inline rpcs will take thousands of rpcs to get the full
coredump which can take multiple seconds.

With UMEM, it can be done in a single rpc, using 2MB of umem user buffer.

$ ./mlx5ctlu mlx5_core.ctl.0 coredump --umem_size=$(( 2 ** 20 ))

00 00 00 00 01 00 20 00 00 00 00 04 00 00 48 ec
00 00 00 08 00 00 00 00 00 00 00 0c 00 00 00 03
00 00 00 10 00 00 00 00 00 00 00 14 00 00 00 00
....
00 50 0b 3c 00 00 00 00 00 50 0b 40 00 00 00 00
00 50 0b 44 00 00 00 00 00 50 0b 48 00 00 00 00
00 50 0c 00 00 00 00 00

INFO : Core dump done
INFO : Core dump size 831304
INFO : Core dump address 0x0
INFO : Core dump cookie 0x500c04
INFO : More Dump 0

Other usecases are: dynamic HW and FW trace monitoring, high frequency
diagnostic counters sampling and batched objects and resource dumps.

Reviewed-by: Jiri Pirko <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
---
drivers/misc/mlx5ctl/Makefile | 1 +
drivers/misc/mlx5ctl/main.c | 99 +++++++++++
drivers/misc/mlx5ctl/umem.c | 322 ++++++++++++++++++++++++++++++++++
drivers/misc/mlx5ctl/umem.h | 17 ++
include/uapi/misc/mlx5ctl.h | 22 +++
5 files changed, 461 insertions(+)
create mode 100644 drivers/misc/mlx5ctl/umem.c
create mode 100644 drivers/misc/mlx5ctl/umem.h

diff --git a/drivers/misc/mlx5ctl/Makefile b/drivers/misc/mlx5ctl/Makefile
index b5c7f99e0ab6..f35234e931a8 100644
--- a/drivers/misc/mlx5ctl/Makefile
+++ b/drivers/misc/mlx5ctl/Makefile
@@ -2,3 +2,4 @@

obj-$(CONFIG_MLX5CTL) += mlx5ctl.o
mlx5ctl-y := main.o
+mlx5ctl-y += umem.o
diff --git a/drivers/misc/mlx5ctl/main.c b/drivers/misc/mlx5ctl/main.c
index e7776ea4bfca..7e6d6da26a79 100644
--- a/drivers/misc/mlx5ctl/main.c
+++ b/drivers/misc/mlx5ctl/main.c
@@ -12,6 +12,8 @@
#include <linux/atomic.h>
#include <linux/refcount.h>

+#include "umem.h"
+
MODULE_DESCRIPTION("mlx5 ConnectX control misc driver");
MODULE_AUTHOR("Saeed Mahameed <[email protected]>");
MODULE_LICENSE("Dual BSD/GPL");
@@ -30,6 +32,8 @@ struct mlx5ctl_fd {
u16 uctx_uid;
u32 uctx_cap;
u32 ucap; /* user cap */
+
+ struct mlx5ctl_umem_db *umem_db;
struct mlx5ctl_dev *mcdev;
struct list_head list;
};
@@ -115,6 +119,12 @@ static int mlx5ctl_open_mfd(struct mlx5ctl_fd *mfd)
if (uid < 0)
return uid;

+ mfd->umem_db = mlx5ctl_umem_db_create(mdev, uid);
+ if (IS_ERR(mfd->umem_db)) {
+ mlx5ctl_release_uid(mcdev, uid);
+ return PTR_ERR(mfd->umem_db);
+ }
+
mfd->uctx_uid = uid;
mfd->uctx_cap = cap;
mfd->ucap = ucap;
@@ -129,6 +139,7 @@ static void mlx5ctl_release_mfd(struct mlx5ctl_fd *mfd)
{
struct mlx5ctl_dev *mcdev = mfd->mcdev;

+ mlx5ctl_umem_db_destroy(mfd->umem_db);
mlx5ctl_release_uid(mcdev, mfd->uctx_uid);
}

@@ -323,6 +334,86 @@ static int mlx5ctl_cmdrpc_ioctl(struct file *file,
return err;
}

+static int mlx5ctl_ioctl_umem_reg(struct file *file,
+ struct mlx5ctl_umem_reg __user *arg,
+ size_t usize)
+{
+ struct mlx5ctl_fd *mfd = file->private_data;
+ struct mlx5ctl_umem_reg *umem_reg;
+ int umem_id, err = 0;
+ size_t ksize = 0;
+
+ ksize = max(sizeof(struct mlx5ctl_umem_reg), usize);
+ umem_reg = kzalloc(ksize, GFP_KERNEL_ACCOUNT);
+ if (!umem_reg)
+ return -ENOMEM;
+
+ umem_reg->size = sizeof(struct mlx5ctl_umem_reg);
+
+ if (copy_from_user(umem_reg, arg, usize)) {
+ err = -EFAULT;
+ goto out;
+ }
+
+ if (umem_reg->flags || umem_reg->reserved1 || umem_reg->reserved2) {
+ err = -EOPNOTSUPP;
+ goto out;
+ }
+
+ umem_id = mlx5ctl_umem_reg(mfd->umem_db,
+ (unsigned long)umem_reg->addr,
+ umem_reg->len);
+ if (umem_id < 0) {
+ err = umem_id;
+ goto out;
+ }
+
+ umem_reg->umem_id = umem_id;
+
+ if (copy_to_user(arg, umem_reg, usize)) {
+ mlx5ctl_umem_unreg(mfd->umem_db, umem_id);
+ err = -EFAULT;
+ }
+out:
+ kfree(umem_reg);
+ return err;
+}
+
+static int mlx5ctl_ioctl_umem_unreg(struct file *file,
+ struct mlx5ctl_umem_unreg __user *arg,
+ size_t usize)
+{
+ struct mlx5ctl_fd *mfd = file->private_data;
+ struct mlx5ctl_umem_unreg *umem_unreg;
+ size_t ksize = 0;
+ int err = 0;
+
+ ksize = max(sizeof(struct mlx5ctl_umem_unreg), usize);
+ umem_unreg = kzalloc(ksize, GFP_KERNEL_ACCOUNT);
+ if (!umem_unreg)
+ return -ENOMEM;
+
+ umem_unreg->size = sizeof(struct mlx5ctl_umem_unreg);
+
+ if (copy_from_user(umem_unreg, arg, usize)) {
+ err = -EFAULT;
+ goto out;
+ }
+
+ if (umem_unreg->flags) {
+ err = -EOPNOTSUPP;
+ goto out;
+ }
+
+ err = mlx5ctl_umem_unreg(mfd->umem_db, umem_unreg->umem_id);
+
+ if (!err && copy_to_user(arg, umem_unreg, usize))
+ err = -EFAULT;
+out:
+ kfree(umem_unreg);
+ return err;
+}
+
static long mlx5ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
struct mlx5ctl_fd *mfd = file->private_data;
@@ -352,6 +443,14 @@ static long mlx5ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg
err = mlx5ctl_cmdrpc_ioctl(file, argp, size);
break;

+ case MLX5CTL_IOCTL_UMEM_REG:
+ err = mlx5ctl_ioctl_umem_reg(file, argp, size);
+ break;
+
+ case MLX5CTL_IOCTL_UMEM_UNREG:
+ err = mlx5ctl_ioctl_umem_unreg(file, argp, size);
+ break;
+
default:
mlx5ctl_dbg(mcdev, "Unknown ioctl %x\n", cmd);
err = -ENOIOCTLCMD;
diff --git a/drivers/misc/mlx5ctl/umem.c b/drivers/misc/mlx5ctl/umem.c
new file mode 100644
index 000000000000..e62030dadf51
--- /dev/null
+++ b/drivers/misc/mlx5ctl/umem.c
@@ -0,0 +1,322 @@
+// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0
+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#include <linux/mlx5/device.h>
+#include <linux/mlx5/driver.h>
+#include <uapi/misc/mlx5ctl.h>
+
+#include "umem.h"
+
+#define MLX5CTL_UMEM_MAX_MB 64
+
+static unsigned long umem_num_pages(u64 addr, size_t len)
+{
+ return DIV_ROUND_UP(addr + len - PAGE_ALIGN_DOWN(addr), PAGE_SIZE);
+}
+
+struct mlx5ctl_umem {
+ struct sg_table sgt;
+ unsigned long addr;
+ size_t size;
+ size_t offset;
+ size_t npages;
+ struct task_struct *source_task;
+ struct mm_struct *source_mm;
+ struct user_struct *source_user;
+ u32 umem_id;
+ struct page **page_list;
+};
+
+struct mlx5ctl_umem_db {
+ struct xarray xarray;
+ struct mlx5_core_dev *mdev;
+ u32 uctx_uid;
+};
+
+static int inc_user_locked_vm(struct mlx5ctl_umem *umem, unsigned long npages)
+{
+ unsigned long lock_limit;
+ unsigned long cur_pages;
+ unsigned long new_pages;
+
+ lock_limit = task_rlimit(umem->source_task, RLIMIT_MEMLOCK) >>
+ PAGE_SHIFT;
+ do {
+ cur_pages = atomic_long_read(&umem->source_user->locked_vm);
+ new_pages = cur_pages + npages;
+ if (new_pages > lock_limit)
+ return -ENOMEM;
+ } while (atomic_long_cmpxchg(&umem->source_user->locked_vm, cur_pages,
+ new_pages) != cur_pages);
+ return 0;
+}
+
+static void dec_user_locked_vm(struct mlx5ctl_umem *umem, unsigned long npages)
+{
+ if (WARN_ON(atomic_long_read(&umem->source_user->locked_vm) < npages))
+ return;
+ atomic_long_sub(npages, &umem->source_user->locked_vm);
+}
+
+#define PAGES_2_MB(pages) ((pages) >> (20 - PAGE_SHIFT))
+
+static struct mlx5ctl_umem *mlx5ctl_umem_pin(struct mlx5ctl_umem_db *umem_db,
+ unsigned long addr, size_t size)
+{
+ size_t npages = umem_num_pages(addr, size);
+ struct mlx5_core_dev *mdev = umem_db->mdev;
+ unsigned long endaddr = addr + size;
+ struct mlx5ctl_umem *umem;
+ struct page **page_list;
+ int err = -EINVAL;
+ int pinned = 0;
+
+ dev_dbg(mdev->device, "%s: addr %p size %zu npages %zu\n",
+ __func__, (void __user *)addr, size, npages);
+
+ /* Avoid integer overflow */
+ if (endaddr < addr || PAGE_ALIGN(endaddr) < endaddr)
+ return ERR_PTR(-EINVAL);
+
+ if (npages == 0 || PAGES_2_MB(npages) > MLX5CTL_UMEM_MAX_MB)
+ return ERR_PTR(-EINVAL);
+
+ page_list = kvmalloc_array(npages, sizeof(struct page *), GFP_KERNEL_ACCOUNT);
+ if (!page_list)
+ return ERR_PTR(-ENOMEM);
+
+ umem = kzalloc(sizeof(*umem), GFP_KERNEL_ACCOUNT);
+ if (!umem) {
+ kvfree(page_list);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ umem->addr = addr;
+ umem->size = size;
+ umem->offset = addr & ~PAGE_MASK;
+ umem->npages = npages;
+
+ umem->page_list = page_list;
+ umem->source_mm = current->mm;
+ umem->source_task = current->group_leader;
+ get_task_struct(current->group_leader);
+ umem->source_user = get_uid(current_user());
+
+ /* mm and RLIMIT_MEMLOCK user task accounting similar to what is
+ * being done in iopt_alloc_pages() and do_update_pinned()
+ * for IOPT_PAGES_ACCOUNT_USER @drivers/iommu/iommufd/pages.c
+ */
+ mmgrab(umem->source_mm);
+
+ pinned = pin_user_pages_fast(addr, npages, FOLL_WRITE, page_list);
+ if (pinned != npages) {
+ dev_dbg(mdev->device, "pin_user_pages_fast failed %d\n", pinned);
+ err = pinned < 0 ? pinned : -ENOMEM;
+ goto pin_failed;
+ }
+
+ err = inc_user_locked_vm(umem, npages);
+ if (err)
+ goto pin_failed;
+
+ atomic64_add(npages, &umem->source_mm->pinned_vm);
+
+ err = sg_alloc_table_from_pages(&umem->sgt, page_list, npages, 0,
+ npages << PAGE_SHIFT, GFP_KERNEL_ACCOUNT);
+ if (err) {
+ dev_dbg(mdev->device, "sg_alloc_table failed: %d\n", err);
+ goto sgt_failed;
+ }
+
+ dev_dbg(mdev->device, "\tsgt: size %zu npages %zu sgt.nents (%d)\n",
+ size, npages, umem->sgt.nents);
+
+ err = dma_map_sgtable(mdev->device, &umem->sgt, DMA_BIDIRECTIONAL, 0);
+ if (err) {
+ dev_dbg(mdev->device, "dma_map_sgtable failed: %d\n", err);
+ goto dma_failed;
+ }
+
+ dev_dbg(mdev->device, "\tsgt: dma_nents %d\n", umem->sgt.nents);
+ return umem;
+
+dma_failed:
+sgt_failed:
+ sg_free_table(&umem->sgt);
+ atomic64_sub(npages, &umem->source_mm->pinned_vm);
+ dec_user_locked_vm(umem, npages);
+pin_failed:
+ if (pinned > 0)
+ unpin_user_pages(page_list, pinned);
+ mmdrop(umem->source_mm);
+ free_uid(umem->source_user);
+ put_task_struct(umem->source_task);
+
+ kfree(umem);
+ kvfree(page_list);
+ return ERR_PTR(err);
+}
+
+static void mlx5ctl_umem_unpin(struct mlx5ctl_umem_db *umem_db,
+ struct mlx5ctl_umem *umem)
+{
+ struct mlx5_core_dev *mdev = umem_db->mdev;
+
+ dev_dbg(mdev->device, "%s: addr %p size %zu npages %zu dma_nents %d\n",
+ __func__, (void *)umem->addr, umem->size, umem->npages,
+ umem->sgt.nents);
+
+ dma_unmap_sgtable(mdev->device, &umem->sgt, DMA_BIDIRECTIONAL, 0);
+ sg_free_table(&umem->sgt);
+
+ atomic64_sub(umem->npages, &umem->source_mm->pinned_vm);
+ dec_user_locked_vm(umem, umem->npages);
+ unpin_user_pages(umem->page_list, umem->npages);
+ mmdrop(umem->source_mm);
+ free_uid(umem->source_user);
+ put_task_struct(umem->source_task);
+
+ kvfree(umem->page_list);
+ kfree(umem);
+}
+
+static int mlx5ctl_umem_create(struct mlx5_core_dev *mdev,
+ struct mlx5ctl_umem *umem, u32 uid)
+{
+ u32 out[MLX5_ST_SZ_DW(create_umem_out)] = {};
+ int err, inlen, i, n = 0;
+ struct scatterlist *sg;
+ void *in, *umemptr;
+ __be64 *mtt;
+
+ inlen = MLX5_ST_SZ_BYTES(create_umem_in) +
+ umem->npages * MLX5_ST_SZ_BYTES(mtt);
+
+ in = kzalloc(inlen, GFP_KERNEL_ACCOUNT);
+ if (!in)
+ return -ENOMEM;
+
+ MLX5_SET(create_umem_in, in, opcode, MLX5_CMD_OP_CREATE_UMEM);
+ MLX5_SET(create_umem_in, in, uid, uid);
+
+ umemptr = MLX5_ADDR_OF(create_umem_in, in, umem);
+
+ MLX5_SET(umem, umemptr, log_page_size,
+ PAGE_SHIFT - MLX5_ADAPTER_PAGE_SHIFT);
+ MLX5_SET64(umem, umemptr, num_of_mtt, umem->npages);
+ MLX5_SET(umem, umemptr, page_offset, umem->offset);
+
+ dev_dbg(mdev->device,
+ "UMEM CREATE: log_page_size %d num_of_mtt %lld page_offset %d\n",
+ MLX5_GET(umem, umemptr, log_page_size),
+ MLX5_GET64(umem, umemptr, num_of_mtt),
+ MLX5_GET(umem, umemptr, page_offset));
+
+ mtt = MLX5_ADDR_OF(create_umem_in, in, umem.mtt);
+ for_each_sgtable_dma_sg(&umem->sgt, sg, i) {
+ u64 dma_addr = sg_dma_address(sg);
+ ssize_t len = sg_dma_len(sg);
+
+ for (; n < umem->npages && len > 0; n++, mtt++) {
+ *mtt = cpu_to_be64(dma_addr);
+ MLX5_SET(mtt, mtt, wr_en, 1);
+ MLX5_SET(mtt, mtt, rd_en, 1);
+ dma_addr += PAGE_SIZE;
+ len -= PAGE_SIZE;
+ }
+ WARN_ON_ONCE(n == umem->npages && len > 0);
+ }
+
+ err = mlx5_cmd_exec(mdev, in, inlen, out, sizeof(out));
+ if (err)
+ goto out;
+
+ umem->umem_id = MLX5_GET(create_umem_out, out, umem_id);
+ dev_dbg(mdev->device, "\tUMEM CREATED: umem_id %d\n", umem->umem_id);
+out:
+ kfree(in);
+ return err;
+}
+
+static void mlx5ctl_umem_destroy(struct mlx5_core_dev *mdev,
+ struct mlx5ctl_umem *umem)
+{
+ u32 in[MLX5_ST_SZ_DW(destroy_umem_in)] = {};
+
+ MLX5_SET(destroy_umem_in, in, opcode, MLX5_CMD_OP_DESTROY_UMEM);
+ MLX5_SET(destroy_umem_in, in, umem_id, umem->umem_id);
+
+ dev_dbg(mdev->device, "UMEM DESTROY: umem_id %d\n", umem->umem_id);
+ mlx5_cmd_exec_in(mdev, destroy_umem, in);
+}
+
+int mlx5ctl_umem_reg(struct mlx5ctl_umem_db *umem_db, unsigned long addr,
+ size_t size)
+{
+ struct mlx5ctl_umem *umem;
+ void *ret;
+ int err;
+
+ umem = mlx5ctl_umem_pin(umem_db, addr, size);
+ if (IS_ERR(umem))
+ return PTR_ERR(umem);
+
+ err = mlx5ctl_umem_create(umem_db->mdev, umem, umem_db->uctx_uid);
+ if (err)
+ goto umem_create_err;
+
+ ret = xa_store(&umem_db->xarray, umem->umem_id, umem, GFP_KERNEL_ACCOUNT);
+ if (WARN(xa_is_err(ret), "Failed to store UMEM")) {
+ err = xa_err(ret);
+ goto xa_store_err;
+ }
+
+ return umem->umem_id;
+
+xa_store_err:
+ mlx5ctl_umem_destroy(umem_db->mdev, umem);
+umem_create_err:
+ mlx5ctl_umem_unpin(umem_db, umem);
+ return err;
+}
+
+int mlx5ctl_umem_unreg(struct mlx5ctl_umem_db *umem_db, u32 umem_id)
+{
+ struct mlx5ctl_umem *umem;
+
+ umem = xa_erase(&umem_db->xarray, umem_id);
+ if (!umem)
+ return -ENOENT;
+
+ mlx5ctl_umem_destroy(umem_db->mdev, umem);
+ mlx5ctl_umem_unpin(umem_db, umem);
+ return 0;
+}
+
+struct mlx5ctl_umem_db *mlx5ctl_umem_db_create(struct mlx5_core_dev *mdev,
+ u32 uctx_uid)
+{
+ struct mlx5ctl_umem_db *umem_db;
+
+ umem_db = kzalloc(sizeof(*umem_db), GFP_KERNEL_ACCOUNT);
+ if (!umem_db)
+ return ERR_PTR(-ENOMEM);
+
+ xa_init(&umem_db->xarray);
+ umem_db->mdev = mdev;
+ umem_db->uctx_uid = uctx_uid;
+
+ return umem_db;
+}
+
+void mlx5ctl_umem_db_destroy(struct mlx5ctl_umem_db *umem_db)
+{
+ struct mlx5ctl_umem *umem;
+ unsigned long index;
+
+ xa_for_each(&umem_db->xarray, index, umem)
+ mlx5ctl_umem_unreg(umem_db, umem->umem_id);
+
+ xa_destroy(&umem_db->xarray);
+ kfree(umem_db);
+}
diff --git a/drivers/misc/mlx5ctl/umem.h b/drivers/misc/mlx5ctl/umem.h
new file mode 100644
index 000000000000..9cf62e5e775e
--- /dev/null
+++ b/drivers/misc/mlx5ctl/umem.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0 */
+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#ifndef __MLX5CTL_UMEM_H__
+#define __MLX5CTL_UMEM_H__
+
+#include <linux/types.h>
+#include <linux/mlx5/driver.h>
+
+struct mlx5ctl_umem_db;
+
+struct mlx5ctl_umem_db *mlx5ctl_umem_db_create(struct mlx5_core_dev *mdev, u32 uctx_uid);
+void mlx5ctl_umem_db_destroy(struct mlx5ctl_umem_db *umem_db);
+int mlx5ctl_umem_reg(struct mlx5ctl_umem_db *umem_db, unsigned long addr, size_t size);
+int mlx5ctl_umem_unreg(struct mlx5ctl_umem_db *umem_db, u32 umem_id);
+
+#endif /* __MLX5CTL_UMEM_H__ */
diff --git a/include/uapi/misc/mlx5ctl.h b/include/uapi/misc/mlx5ctl.h
index 3277eaf78a37..506aa8db75b4 100644
--- a/include/uapi/misc/mlx5ctl.h
+++ b/include/uapi/misc/mlx5ctl.h
@@ -24,6 +24,22 @@ struct mlx5ctl_cmdrpc {
__aligned_u64 flags;
};

+struct mlx5ctl_umem_reg {
+ __aligned_u64 flags;
+ __u32 size;
+ __u32 reserved1;
+ __aligned_u64 addr; /* user address */
+ __aligned_u64 len; /* user buffer length */
+ __u32 umem_id; /* returned device's umem ID */
+ __u32 reserved2;
+};
+
+struct mlx5ctl_umem_unreg {
+ __aligned_u64 flags;
+ __u32 size;
+ __u32 umem_id;
+};
+
#define MLX5CTL_MAX_RPC_SIZE 8192

#define MLX5CTL_IOCTL_MAGIC 0x5c
@@ -34,4 +50,10 @@ struct mlx5ctl_cmdrpc {
#define MLX5CTL_IOCTL_CMDRPC \
_IOWR(MLX5CTL_IOCTL_MAGIC, 0x1, struct mlx5ctl_cmdrpc)

+#define MLX5CTL_IOCTL_UMEM_REG \
+ _IOWR(MLX5CTL_IOCTL_MAGIC, 0x2, struct mlx5ctl_umem_reg)
+
+#define MLX5CTL_IOCTL_UMEM_UNREG \
+ _IOWR(MLX5CTL_IOCTL_MAGIC, 0x3, struct mlx5ctl_umem_unreg)
+
#endif /* __MLX5CTL_IOCTL_H__ */
--
2.42.0

2023-11-21 07:09:06

by Saeed Mahameed

[permalink] [raw]
Subject: [PATCH V3 3/5] misc: mlx5ctl: Add info ioctl

From: Saeed Mahameed <[email protected]>

Implement INFO ioctl to return the allocated UID and the capability flags
and some other useful device information such as the underlying ConnectX
device.

Example:
$ sudo ./mlx5ctlu mlx5_core.ctl.0
mlx5dev: 0000:00:04.0
UCTX UID: 1
UCTX CAP: 0x3
DEV UCTX CAP: 0x3
USER CAP: 0x1d

Reviewed-by: Jiri Pirko <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
---
.../userspace-api/ioctl/ioctl-number.rst | 1 +
drivers/misc/mlx5ctl/main.c | 71 +++++++++++++++++++
include/uapi/misc/mlx5ctl.h | 24 +++++++
3 files changed, 96 insertions(+)
create mode 100644 include/uapi/misc/mlx5ctl.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 4ea5b837399a..9faf91ffefff 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -89,6 +89,7 @@ Code Seq# Include File Comments
0x20 all drivers/cdrom/cm206.h
0x22 all scsi/sg.h
0x3E 00-0F linux/counter.h <mailto:[email protected]>
+0x5c all uapi/misc/mlx5ctl.h Nvidia ConnectX control
'!' 00-1F uapi/linux/seccomp.h
'#' 00-3F IEEE 1394 Subsystem
Block for the entire subsystem
diff --git a/drivers/misc/mlx5ctl/main.c b/drivers/misc/mlx5ctl/main.c
index 8eb150461b80..6a98b40e4300 100644
--- a/drivers/misc/mlx5ctl/main.c
+++ b/drivers/misc/mlx5ctl/main.c
@@ -8,6 +8,7 @@
#include <linux/auxiliary_bus.h>
#include <linux/mlx5/device.h>
#include <linux/mlx5/driver.h>
+#include <uapi/misc/mlx5ctl.h>
#include <linux/atomic.h>
#include <linux/refcount.h>

@@ -198,10 +199,80 @@ static int mlx5ctl_release(struct inode *inode, struct file *file)
return 0;
}

+static int mlx5ctl_info_ioctl(struct file *file,
+ struct mlx5ctl_info __user *arg,
+ size_t usize)
+{
+ struct mlx5ctl_fd *mfd = file->private_data;
+ struct mlx5ctl_dev *mcdev = mfd->mcdev;
+ struct mlx5_core_dev *mdev = mcdev->mdev;
+ struct mlx5ctl_info *info;
+ size_t ksize = 0;
+ int err = 0;
+
+ ksize = max(sizeof(struct mlx5ctl_info), usize);
+ info = kzalloc(ksize, GFP_KERNEL_ACCOUNT);
+ if (!info)
+ return -ENOMEM;
+
+ info->size = sizeof(struct mlx5ctl_info);
+
+ info->dev_uctx_cap = MLX5_CAP_GEN(mdev, uctx_cap);
+ info->uctx_cap = mfd->uctx_cap;
+ info->uctx_uid = mfd->uctx_uid;
+ info->ucap = mfd->ucap;
+
+ strscpy(info->devname, dev_name(&mdev->pdev->dev),
+ sizeof(info->devname));
+
+ if (copy_to_user(arg, info, usize))
+ err = -EFAULT;
+
+ kfree(info);
+ return err;
+}
+
+static long mlx5ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+ struct mlx5ctl_fd *mfd = file->private_data;
+ struct mlx5ctl_dev *mcdev = mfd->mcdev;
+ void __user *argp = (void __user *)arg;
+ size_t size = _IOC_SIZE(cmd);
+ int err = 0;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ mlx5ctl_dbg(mcdev, "ioctl 0x%x type/nr: %d/%d size: %d DIR:%d\n", cmd,
+ _IOC_TYPE(cmd), _IOC_NR(cmd), _IOC_SIZE(cmd), _IOC_DIR(cmd));
+
+ down_read(&mcdev->rw_lock);
+ if (!mcdev->mdev) {
+ err = -ENODEV;
+ goto unlock;
+ }
+
+ switch (cmd) {
+ case MLX5CTL_IOCTL_INFO:
+ err = mlx5ctl_info_ioctl(file, argp, size);
+ break;
+
+ default:
+ mlx5ctl_dbg(mcdev, "Unknown ioctl %x\n", cmd);
+ err = -ENOIOCTLCMD;
+ break;
+ }
+unlock:
+ up_read(&mcdev->rw_lock);
+ return err;
+}
+
static const struct file_operations mlx5ctl_fops = {
.owner = THIS_MODULE,
.open = mlx5ctl_open,
.release = mlx5ctl_release,
+ .unlocked_ioctl = mlx5ctl_ioctl,
+ .compat_ioctl = compat_ptr_ioctl,
};

static int mlx5ctl_probe(struct auxiliary_device *adev,
diff --git a/include/uapi/misc/mlx5ctl.h b/include/uapi/misc/mlx5ctl.h
new file mode 100644
index 000000000000..37153cc0fc6e
--- /dev/null
+++ b/include/uapi/misc/mlx5ctl.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#ifndef __MLX5CTL_IOCTL_H__
+#define __MLX5CTL_IOCTL_H__
+
+struct mlx5ctl_info {
+ __aligned_u64 flags;
+ __u32 size;
+ __u8 devname[64]; /* underlaying ConnectX device */
+ __u16 uctx_uid; /* current process allocated UCTX UID */
+ __u16 reserved1;
+ __u32 uctx_cap; /* current process effective UCTX cap */
+ __u32 dev_uctx_cap; /* device's UCTX capabilities */
+ __u32 ucap; /* process user capability */
+ __u32 reserved2;
+};
+
+#define MLX5CTL_IOCTL_MAGIC 0x5c
+
+#define MLX5CTL_IOCTL_INFO \
+ _IOR(MLX5CTL_IOCTL_MAGIC, 0x0, struct mlx5ctl_info)
+
+#endif /* __MLX5CTL_IOCTL_H__ */
--
2.42.0

2023-11-21 20:45:08

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH V3 5/5] misc: mlx5ctl: Add umem reg/unreg ioctl

On Mon, 20 Nov 2023 23:06:19 -0800 Saeed Mahameed wrote:
> high frequency diagnostic counters

So is it a debug driver or not a debug driver?

Because I'm pretty sure some people want to have access to high freq
counters in production, across their fleet. What's worse David Ahern
has been pitching a way of exposing device counters which would be
common across netdev.

Definite nack on this patch.

2023-11-21 21:04:36

by Saeed Mahameed

[permalink] [raw]
Subject: Re: [PATCH V3 5/5] misc: mlx5ctl: Add umem reg/unreg ioctl

On 21 Nov 12:44, Jakub Kicinski wrote:
>On Mon, 20 Nov 2023 23:06:19 -0800 Saeed Mahameed wrote:
>> high frequency diagnostic counters
>
>So is it a debug driver or not a debug driver?
>

High frequency _diagnostic_ counters are a very useful tool for
debugging a high performance chip. So yes this is for diagnostics/debug.

>Because I'm pretty sure some people want to have access to high freq
>counters in production, across their fleet. What's worse David Ahern
>has been pitching a way of exposing device counters which would be
>common across netdev.
>

This is not netdev, this driver is to support ConnectX chips and SoCs
with any stack, netdev/rdma/vdpa/virtio and internal chip units and
acceleration engines, add to that ARM core diagnostics in case of
Blue-Field DPUs.

I am not looking for counting netdev ethernet packets in this driver.

I am also pretty sure David will also want an interface to access other
than netdev counters, to get more visibility on how a specific chip is
behaving.

>Definite nack on this patch.

Based on what ?


2023-11-21 22:13:08

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH V3 5/5] misc: mlx5ctl: Add umem reg/unreg ioctl

On Tue, 21 Nov 2023 13:04:06 -0800 Saeed Mahameed wrote:
> On 21 Nov 12:44, Jakub Kicinski wrote:
>> On Mon, 20 Nov 2023 23:06:19 -0800 Saeed Mahameed wrote:
>>> high frequency diagnostic counters
>>
>> So is it a debug driver or not a debug driver?
>
> High frequency _diagnostic_ counters are a very useful tool for
> debugging a high performance chip. So yes this is for diagnostics/debug.

You keep saying debugging but if it's expected to run on all servers in
the fleet _monitoring_ performance, then it's a very different thing.
To me it certainly moves this driver from "debug thing loaded when
things fail" to the "always loaded in production" category.

2023-11-21 22:18:51

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH V3 5/5] misc: mlx5ctl: Add umem reg/unreg ioctl

On 11/21/23 1:04 PM, Saeed Mahameed wrote:
> On 21 Nov 12:44, Jakub Kicinski wrote:
>> On Mon, 20 Nov 2023 23:06:19 -0800 Saeed Mahameed wrote:
>>> high frequency diagnostic counters
>>
>> So is it a debug driver or not a debug driver?
>>
>
> High frequency _diagnostic_ counters are a very useful tool for
> debugging a high performance chip. So yes this is for diagnostics/debug.
>
>> Because I'm pretty sure some people want to have access to high freq
>> counters in production, across their fleet. What's worse David Ahern
>> has been pitching a way of exposing device counters which would be
>> common across netdev.
.

For context on the `what's worse ...` comment for those who have not
seen the netconf slides:
https://netdev.bots.linux.dev/netconf/2023/david.pdf

and I am having a hard time parsing Kuba's intent with that comment here
(knowing you did not like the pitch I made at netconf :-))


>
> This is not netdev, this driver is to support ConnectX chips and SoCs
> with any stack, netdev/rdma/vdpa/virtio and internal chip units and
> acceleration engines, add to that ARM core diagnostics in case of
> Blue-Field DPUs.
> I am not looking for counting netdev ethernet packets in this driver.
>
> I am also pretty sure David will also want an interface to access other
> than netdev counters, to get more visibility on how a specific chip is
> behaving.

yes, and h/w counters were part of the proposal. One thought is to
leverage userspace registered memory with the device vs mapping bar
space, but we have not moved beyond a theoretical discussion at this point.

>
>> Definite nack on this patch.
>
> Based on what ?

It's a generic interface argument?


2023-11-21 22:47:10

by Saeed Mahameed

[permalink] [raw]
Subject: Re: [PATCH V3 5/5] misc: mlx5ctl: Add umem reg/unreg ioctl

On 21 Nov 14:18, David Ahern wrote:
>On 11/21/23 1:04 PM, Saeed Mahameed wrote:
>> On 21 Nov 12:44, Jakub Kicinski wrote:
>>> On Mon, 20 Nov 2023 23:06:19 -0800 Saeed Mahameed wrote:
>>>> high frequency diagnostic counters
>>>
>>> So is it a debug driver or not a debug driver?
>>>
>>
>> High frequency _diagnostic_ counters are a very useful tool for
>> debugging a high performance chip. So yes this is for diagnostics/debug.
>>
>>> Because I'm pretty sure some people want to have access to high freq
>>> counters in production, across their fleet. What's worse David Ahern
>>> has been pitching a way of exposing device counters which would be
>>> common across netdev.
>.
>
>For context on the `what's worse ...` comment for those who have not
>seen the netconf slides:
>https://netdev.bots.linux.dev/netconf/2023/david.pdf
>
>and I am having a hard time parsing Kuba's intent with that comment here
>(knowing you did not like the pitch I made at netconf :-))
>
>
>>
>> This is not netdev, this driver is to support ConnectX chips and SoCs
>> with any stack, netdev/rdma/vdpa/virtio and internal chip units and
>> acceleration engines, add to that ARM core diagnostics in case of
>> Blue-Field DPUs.
>> I am not looking for counting netdev ethernet packets in this driver.
>>
>> I am also pretty sure David will also want an interface to access other
>> than netdev counters, to get more visibility on how a specific chip is
>> behaving.
>
>yes, and h/w counters were part of the proposal. One thought is to
>leverage userspace registered memory with the device vs mapping bar
>space, but we have not moved beyond a theoretical discussion at this point.
>
>>
>>> Definite nack on this patch.
>>
>> Based on what ?
>
>It's a generic interface argument?
>

For this driver the diagnostic counters is only a small part of the debug
utilities the driver provides, so it is not fair to nak this patch based
on one use-case, we need this driver to also dump other stuff like
core dumps, FW contexts, internal objects, register dumps, resource dumps,
etc ..

This patch original purpose was to allow core dumps, since core dump can go
up to 2MB of memory, without this patch we won't have core dump ability
which is more important for debugging than diagnostic counters.

You can find more here:
https://github.com/saeedtx/mlx5ctl#mlx5ctl-userspace-linux-debug-utilities-for-mlx5-connectx-devices

For diagnostic counters we can continue the discussion to have a generic
interface I am all for it, but it's irrelevant for this submission.

2023-11-21 22:53:26

by Saeed Mahameed

[permalink] [raw]
Subject: Re: [PATCH V3 5/5] misc: mlx5ctl: Add umem reg/unreg ioctl

On 21 Nov 14:10, Jakub Kicinski wrote:
>On Tue, 21 Nov 2023 13:04:06 -0800 Saeed Mahameed wrote:
>> On 21 Nov 12:44, Jakub Kicinski wrote:
>>> On Mon, 20 Nov 2023 23:06:19 -0800 Saeed Mahameed wrote:
>>>> high frequency diagnostic counters
>>>
>>> So is it a debug driver or not a debug driver?
>>
>> High frequency _diagnostic_ counters are a very useful tool for
>> debugging a high performance chip. So yes this is for diagnostics/debug.
>
>You keep saying debugging but if it's expected to run on all servers in
>the fleet _monitoring_ performance, then it's a very different thing.
>To me it certainly moves this driver from "debug thing loaded when
>things fail" to the "always loaded in production" category.

Exactly, only when things fails or the user want to debug something.

For your example, you can monitor network performance via standard netdev
tools, once you start experiencing hiccups, you can use this driver and
the corresponding tools to quickly grab HW debug information, useful to
further root cause and analyze the network hiccups.

Again this is only one use-case, the driver is intended to provide any
debug information, not only diagnostic counters or monitoring tools.
The goal of this driver is not the one use case you have in mind.

2023-11-21 23:48:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V3 5/5] misc: mlx5ctl: Add umem reg/unreg ioctl

On Tue, Nov 21, 2023 at 12:44:56PM -0800, Jakub Kicinski wrote:
> On Mon, 20 Nov 2023 23:06:19 -0800 Saeed Mahameed wrote:
> > high frequency diagnostic counters
>
> So is it a debug driver or not a debug driver?

In the part you decided not to quote Saeed explained how the main
purpose of the generic DMA to userspace mechanism is to transfer FW
trace, FW memory copies and other large data dumps.

The thing with generic stuff is you can use it for lots of things if
you are so inclined. Saeed gave many examples. I think you took it in
the wrong way as I am not aware of any plan for actual high speed
netdev relavent counters in a performance monitor application. It
isn't that kind of "high speed".

The main interest is for micro-architectural debugging
information. The kind that are opaque unless you can reference the RTL
to understand what it means. It is "high speed" in the sense that
executing a FW command per register/counter would be offensively slow
compared to executing a FW command to bulk DMA a cluster of
micro-architecture registers/etc in the device.

The design is so generic because it is a debug interface that we want
to be always available and always fully functional. Mellanox ships new
FW and new chips at a rapid rate, we do not want to be changing the
kernel driver every time we do anything. That will never get
backported into production kernels across all our supported customers
fast enough. Debug features that a field support engineer cannot
access simply do not exist.

Debugs are challenging. mlx5 is the most popular datacenter NIC in the
world. We have so many insane problems, you wouldn't belive it. I just
spent 8 months leading a debug that turned out to be a qemu defect
(thanks Paolo for all the help!!). This debug data and flexibility is
critical to making these hugely complex systems work.

Jason

2023-11-27 13:55:07

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote:
> +config MLX5CTL
> + tristate "mlx5 ConnectX control misc driver"
> + depends on MLX5_CORE
> + help
> + MLX5CTL provides interface for the user process to access the debug and
> + configuration registers of the ConnectX hardware family
> + (NICs, PCI switches and SmartNIC SoCs).
> + This will allow configuration and debug tools to work out of the box on
> + mainstream kernel.

Did you run checkpatch.pl on this? You lost the tabs :(

> --- /dev/null
> +++ b/drivers/misc/mlx5ctl/main.c
> @@ -0,0 +1,314 @@
> +// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0

Why is this dual licensed? Again, you are using a GPL-only api for this
driver, how would that even be possible to make this code be BSD?

I thought we already discussed this, AND I talked to someone who
discussed this with a nvidia lawyer already and I thought this was going
to get changed. What happened to that?

thanks,

greg k-h

2023-11-27 14:41:22

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Mon, Nov 27, 2023 at 01:36:54PM +0000, Greg Kroah-Hartman wrote:

> Why is this dual licensed? Again, you are using a GPL-only api for this
> driver, how would that even be possible to make this code be BSD?

The code to FreeBSD with mlx5 related dual licensed code integrated is
freely available to inspect. I've never looked myself but I'm told
FreeBSD changes the reference Linux code to remove the use of GPL
code, and that FreeBSD has created BSD licensed versions of some
kernel APIs to support that.

> I thought we already discussed this, AND I talked to someone who
> discussed this with a nvidia lawyer already and I thought this was going
> to get changed. What happened to that?

It is in the cover letter. You asked for an approval and statement
from our legal and we obtained it. Our lawyers did a review, discussed
with a LF contact, and continue to instruct to use the dual
license. We've done what you required us to do.

The summary I have of the call you refer to does not include a
discussion or agreement about change in nvidia policy regarding mlx5
code.

Like Dave said, our lawyers are not your lawyers. Now that we have
involved legal, and they have given an instruction, we must follow it.

Regards,
Jason

2023-11-27 15:52:06

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Mon, Nov 27, 2023 at 10:40:17AM -0400, Jason Gunthorpe wrote:
> On Mon, Nov 27, 2023 at 01:36:54PM +0000, Greg Kroah-Hartman wrote:
>
> > Why is this dual licensed? Again, you are using a GPL-only api for this
> > driver, how would that even be possible to make this code be BSD?
>
> The code to FreeBSD with mlx5 related dual licensed code integrated is
> freely available to inspect. I've never looked myself but I'm told
> FreeBSD changes the reference Linux code to remove the use of GPL
> code, and that FreeBSD has created BSD licensed versions of some
> kernel APIs to support that.

Yeah, I'm not going to get into the legality of "creating BSD licensed
versions of gpl-only Linux apis" but note, lots of lawyers look
longingly at those things when they consider early retirement :)

> > I thought we already discussed this, AND I talked to someone who
> > discussed this with a nvidia lawyer already and I thought this was going
> > to get changed. What happened to that?
>
> It is in the cover letter. You asked for an approval and statement
> from our legal and we obtained it. Our lawyers did a review, discussed
> with a LF contact, and continue to instruct to use the dual
> license. We've done what you required us to do.

Ah, missed that, sorry, I didn't see it in the changes for this set of
patches, it was in the previous submission.

> The summary I have of the call you refer to does not include a
> discussion or agreement about change in nvidia policy regarding mlx5
> code.
>
> Like Dave said, our lawyers are not your lawyers. Now that we have
> involved legal, and they have given an instruction, we must follow it.

I think everyone involved is thankful that your lawyers are not mine :)

Ok, best of luck with this mess, I'll stop harping on it now and just
point out all of the other issues here. First off, you all need to get
the network maintainers to agree that this driver is ok to do this way,
and I don't think that has happened yet, so I'll wait on reviewing the
series until that is resolved.

thanks,

greg k-h

2023-11-27 16:17:56

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Mon, Nov 27, 2023 at 03:51:10PM +0000, Greg Kroah-Hartman wrote:

> Ok, best of luck with this mess, I'll stop harping on it now and just
> point out all of the other issues here. First off, you all need to get
> the network maintainers to agree that this driver is ok to do this way,
> and I don't think that has happened yet, so I'll wait on reviewing the
> series until that is resolved.

As I said already, I strongly disagree with the idea that the netdev
maintainers get a global veto on what happens with mlx5 devices just
because they sometimes have an ethernet port on the back of the card.

This module is primarily (but not exclusively) for rdma related
functionality, not netdev, and the RDMA maintainers Ack it.

Thanks,
Jason

2023-11-27 18:28:07

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Mon, Nov 27, 2023 at 12:17:32PM -0400, Jason Gunthorpe wrote:
> On Mon, Nov 27, 2023 at 03:51:10PM +0000, Greg Kroah-Hartman wrote:
>
> > Ok, best of luck with this mess, I'll stop harping on it now and just
> > point out all of the other issues here. First off, you all need to get
> > the network maintainers to agree that this driver is ok to do this way,
> > and I don't think that has happened yet, so I'll wait on reviewing the
> > series until that is resolved.
>
> As I said already, I strongly disagree with the idea that the netdev
> maintainers get a global veto on what happens with mlx5 devices just
> because they sometimes have an ethernet port on the back of the card.

I understand you might disagree, however I hold their opinion in high
regard and want to ensure that they agree that exposing device-specific
debugging information for a device that deals with networking is ok to
do so in a device-specific misc device node and not through some other
way that other networking devices normally do (i.e. netlink or
some-other-such-thing.)

Note, device-specific character devices have almost always proven to be
a bad idea in the long run, I understand your immediate need to do
something like this, but remember that keeping it alive for the next 20+
years is going to be tough.

> This module is primarily (but not exclusively) for rdma related
> functionality, not netdev, and the RDMA maintainers Ack it.

In my mind, RDMA implies networking, as it's over a network connection,
but hey, I might be wrong :)

thanks,

greg k-h

2023-11-27 18:59:30

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote:
> +struct mlx5ctl_dev {
> + struct mlx5_core_dev *mdev;
> + struct miscdevice miscdev;
> + struct auxiliary_device *adev;
> + struct list_head fd_list;
> + spinlock_t fd_list_lock; /* protect list add/del */
> + struct rw_semaphore rw_lock;
> + struct kref refcount;

You now have 2 different things that control the lifespan of this
structure. We really need some way to automatically check this so that
people don't keep making this same mistake, it happens all the time :(

Please pick one structure (miscdevice) or the other (kref) to control
the lifespan, having 2 will just not work.

Also, why a rw_semaphore? Only use those if you can prove with a
benchmark that it is actually faster, otherwise it's simpler to just use
a normal mutex (hint, you are changing the fields in the structure with
the read lock held, which feels very wrong, and so needs a LOT of
documentation, or just use a normal mutex...)

thanks,

greg k-h

2023-11-27 19:09:29

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH V3 3/5] misc: mlx5ctl: Add info ioctl

On Mon, Nov 20, 2023 at 11:06:17PM -0800, Saeed Mahameed wrote:
> +static int mlx5ctl_info_ioctl(struct file *file,
> + struct mlx5ctl_info __user *arg,
> + size_t usize)
> +{
> + struct mlx5ctl_fd *mfd = file->private_data;
> + struct mlx5ctl_dev *mcdev = mfd->mcdev;
> + struct mlx5_core_dev *mdev = mcdev->mdev;
> + struct mlx5ctl_info *info;
> + size_t ksize = 0;
> + int err = 0;
> +
> + ksize = max(sizeof(struct mlx5ctl_info), usize);

Why / How can usize be larger than the structure size and you still want
to allocate a memory chunk that big? Shouldn't the size always match?

And what if it's too small?

> + info = kzalloc(ksize, GFP_KERNEL_ACCOUNT);

Why account as it will go away almost instantly?

> + if (!info)
> + return -ENOMEM;
> +
> + info->size = sizeof(struct mlx5ctl_info);
> +
> + info->dev_uctx_cap = MLX5_CAP_GEN(mdev, uctx_cap);
> + info->uctx_cap = mfd->uctx_cap;
> + info->uctx_uid = mfd->uctx_uid;
> + info->ucap = mfd->ucap;
> +
> + strscpy(info->devname, dev_name(&mdev->pdev->dev),
> + sizeof(info->devname));
> +
> + if (copy_to_user(arg, info, usize))
> + err = -EFAULT;

So if usize is smaller than the structure you don't copy it all?

What am I missing here?

> +
> + kfree(info);
> + return err;
> +}
> +
> +static long mlx5ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> + struct mlx5ctl_fd *mfd = file->private_data;
> + struct mlx5ctl_dev *mcdev = mfd->mcdev;
> + void __user *argp = (void __user *)arg;
> + size_t size = _IOC_SIZE(cmd);
> + int err = 0;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + mlx5ctl_dbg(mcdev, "ioctl 0x%x type/nr: %d/%d size: %d DIR:%d\n", cmd,
> + _IOC_TYPE(cmd), _IOC_NR(cmd), _IOC_SIZE(cmd), _IOC_DIR(cmd));
> +
> + down_read(&mcdev->rw_lock);
> + if (!mcdev->mdev) {
> + err = -ENODEV;
> + goto unlock;
> + }
> +
> + switch (cmd) {
> + case MLX5CTL_IOCTL_INFO:
> + err = mlx5ctl_info_ioctl(file, argp, size);
> + break;
> +
> + default:
> + mlx5ctl_dbg(mcdev, "Unknown ioctl %x\n", cmd);
> + err = -ENOIOCTLCMD;

-ENOTTY is the correct error.

> --- /dev/null
> +++ b/include/uapi/misc/mlx5ctl.h
> @@ -0,0 +1,24 @@
> +/* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0 WITH Linux-syscall-note */
> +/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
> +
> +#ifndef __MLX5CTL_IOCTL_H__
> +#define __MLX5CTL_IOCTL_H__
> +
> +struct mlx5ctl_info {
> + __aligned_u64 flags;

Is this used?

> + __u32 size;
> + __u8 devname[64]; /* underlaying ConnectX device */

64 should be a define somewhere, right? And why 64?

> + __u16 uctx_uid; /* current process allocated UCTX UID */
> + __u16 reserved1;

Where is this checked to be always 0? Well it's a read so I guess where
is the documentation saying it will always be set to 0?

> + __u32 uctx_cap; /* current process effective UCTX cap */
> + __u32 dev_uctx_cap; /* device's UCTX capabilities */
> + __u32 ucap; /* process user capability */
> + __u32 reserved2;

Same here.

And why reserve anything? What does that help with?

thanks,

greg k-h

2023-11-27 19:26:24

by Saeed Mahameed

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On 27 Nov 18:27, Greg Kroah-Hartman wrote:
>On Mon, Nov 27, 2023 at 12:17:32PM -0400, Jason Gunthorpe wrote:
>> On Mon, Nov 27, 2023 at 03:51:10PM +0000, Greg Kroah-Hartman wrote:
>>
>> > Ok, best of luck with this mess, I'll stop harping on it now and just
>> > point out all of the other issues here. First off, you all need to get
>> > the network maintainers to agree that this driver is ok to do this way,
>> > and I don't think that has happened yet, so I'll wait on reviewing the
>> > series until that is resolved.
>>
>> As I said already, I strongly disagree with the idea that the netdev
>> maintainers get a global veto on what happens with mlx5 devices just
>> because they sometimes have an ethernet port on the back of the card.
>
>I understand you might disagree, however I hold their opinion in high
>regard and want to ensure that they agree that exposing device-specific
>debugging information for a device that deals with networking is ok to
>do so in a device-specific misc device node and not through some other
>way that other networking devices normally do (i.e. netlink or
>some-other-such-thing.)
>
>Note, device-specific character devices have almost always proven to be
>a bad idea in the long run, I understand your immediate need to do
>something like this, but remember that keeping it alive for the next 20+
>years is going to be tough.
>

This driver is different as it doesn't replace existing mlx5 drivers,
mlx5 functionality drivers are still there to expose the device features
through the standard stacks, this is just a companion driver to access
debug information, by driver and FW design mlx5ctl is not meant to
manage or pilot the device like other device specific char drivers.

To be clear this debug driver (or at least an older version of it)
has been already in use for over than 15 years, since the beginning
of mlx5, we used to only provide it as external package called mft
debug tools [1] which has the kernel parts as well. Now it's time to
upstream it.

mlx5ctl will keep serving existing and future HW for the next few decades,
I am pretty sure of that. as the cover-letter explains mlx5 architecture
is set in stone and written in ink, the same mlx5 drivers work on any
ConnectX chip since 2012, and the will keep working on the next generations
of chips, mlx5ctl will be no different.

[1] https://network.nvidia.com/products/adapter-software/firmware-tools/

>> This module is primarily (but not exclusively) for rdma related
>> functionality, not netdev, and the RDMA maintainers Ack it.
>

For Infiniband/virtio/vfio/vdpa/nvme/fpga ConnectX devices mlx5 netdev
doesn't even exist, so it is not reasonable to ask that the debug
interface should go via the netdev stack, mlx5ctl is needed to serve
all users of mlx5 devices, not only netdev (networking).

So I really find this odd, that one stack maintainer gets a veto over all
others.

>In my mind, RDMA implies networking, as it's over a network connection,
>but hey, I might be wrong :)
>
>thanks,
>
>greg k-h

2023-11-27 20:39:32

by Saeed Mahameed

[permalink] [raw]
Subject: Re: [PATCH V3 3/5] misc: mlx5ctl: Add info ioctl

On 27 Nov 19:09, Greg Kroah-Hartman wrote:
>On Mon, Nov 20, 2023 at 11:06:17PM -0800, Saeed Mahameed wrote:
>> +static int mlx5ctl_info_ioctl(struct file *file,
>> + struct mlx5ctl_info __user *arg,
>> + size_t usize)
>> +{
>> + struct mlx5ctl_fd *mfd = file->private_data;
>> + struct mlx5ctl_dev *mcdev = mfd->mcdev;
>> + struct mlx5_core_dev *mdev = mcdev->mdev;
>> + struct mlx5ctl_info *info;
>> + size_t ksize = 0;
>> + int err = 0;
>> +
>> + ksize = max(sizeof(struct mlx5ctl_info), usize);
>
>Why / How can usize be larger than the structure size and you still want
>to allocate a memory chunk that big? Shouldn't the size always match?
>

new user-space old kernel, the driver would allocate the usiae and make
sure to clear all the buffer with 0's, then fill in what the kernel
understands and send the whole buffer back to user with trailer always
zeroed out.

>And what if it's too small?
>

Too small means old user new kernel, the kernel will honor that and only
fill in what the user understands.

>> + info = kzalloc(ksize, GFP_KERNEL_ACCOUNT);
>
>Why account as it will go away almost instantly?
>

Will change that back to GFP_KERNEL.

>> + if (!info)
>> + return -ENOMEM;
>> +
>> + info->size = sizeof(struct mlx5ctl_info);
>> +
>> + info->dev_uctx_cap = MLX5_CAP_GEN(mdev, uctx_cap);
>> + info->uctx_cap = mfd->uctx_cap;
>> + info->uctx_uid = mfd->uctx_uid;
>> + info->ucap = mfd->ucap;
>> +
>> + strscpy(info->devname, dev_name(&mdev->pdev->dev),
>> + sizeof(info->devname));
>> +
>> + if (copy_to_user(arg, info, usize))
>> + err = -EFAULT;
>
>So if usize is smaller than the structure you don't copy it all?
>
>What am I missing here?
>

We copy back to user as much data as they expect, newer kernel can have
extended functionality and we want to allow expanding the ioctl structs in
the future while keeping support for old user, so new kernel would honor
old user and copy only the size the user expects.

>> +
>> + kfree(info);
>> + return err;
>> +}
>> +
>> +static long mlx5ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>> +{
>> + struct mlx5ctl_fd *mfd = file->private_data;
>> + struct mlx5ctl_dev *mcdev = mfd->mcdev;
>> + void __user *argp = (void __user *)arg;
>> + size_t size = _IOC_SIZE(cmd);
>> + int err = 0;
>> +
>> + if (!capable(CAP_SYS_ADMIN))
>> + return -EPERM;
>> +
>> + mlx5ctl_dbg(mcdev, "ioctl 0x%x type/nr: %d/%d size: %d DIR:%d\n", cmd,
>> + _IOC_TYPE(cmd), _IOC_NR(cmd), _IOC_SIZE(cmd), _IOC_DIR(cmd));
>> +
>> + down_read(&mcdev->rw_lock);
>> + if (!mcdev->mdev) {
>> + err = -ENODEV;
>> + goto unlock;
>> + }
>> +
>> + switch (cmd) {
>> + case MLX5CTL_IOCTL_INFO:
>> + err = mlx5ctl_info_ioctl(file, argp, size);
>> + break;
>> +
>> + default:
>> + mlx5ctl_dbg(mcdev, "Unknown ioctl %x\n", cmd);
>> + err = -ENOIOCTLCMD;
>
>-ENOTTY is the correct error.
>
>> --- /dev/null
>> +++ b/include/uapi/misc/mlx5ctl.h
>> @@ -0,0 +1,24 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0 WITH Linux-syscall-note */
>> +/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
>> +
>> +#ifndef __MLX5CTL_IOCTL_H__
>> +#define __MLX5CTL_IOCTL_H__
>> +
>> +struct mlx5ctl_info {
>> + __aligned_u64 flags;
>
>Is this used?
>

no, not yet, but it is good for future extendibility and compatibility
checking.

>> + __u32 size;
>> + __u8 devname[64]; /* underlaying ConnectX device */
>
>64 should be a define somewhere, right? And why 64?
>

It is usually the kobj->name of the underlying device, I will have to
define this in the uAPI. 64 seemed large enough, any other suggestion ?

This field is informational only for the user to have an idea which is the
underlying physical device, it's ok if in odd situation the name has to be
truncated to fit into the uAPI buffer.

>> + __u16 uctx_uid; /* current process allocated UCTX UID */
>> + __u16 reserved1;
>
>Where is this checked to be always 0? Well it's a read so I guess where
>is the documentation saying it will always be set to 0?
>

I forgot to add the checks in the info ioctl path, will add that.
Isn't it an unwritten rule that reserved fields has to be always 0 ?
Do I really need to document this ?

>> + __u32 uctx_cap; /* current process effective UCTX cap */
>> + __u32 dev_uctx_cap; /* device's UCTX capabilities */
>> + __u32 ucap; /* process user capability */
>> + __u32 reserved2;
>
>Same here.
>
>And why reserve anything? What does that help with?
>

struct alignment and explicit padding.

>thanks,
>
>greg k-h

2023-11-28 00:07:41

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Mon, 27 Nov 2023 11:26:06 -0800 Saeed Mahameed wrote:
> This driver is different as it doesn't replace existing mlx5 drivers,
> mlx5 functionality drivers are still there to expose the device features
> through the standard stacks, this is just a companion driver to access
> debug information, by driver and FW design mlx5ctl is not meant to
> manage or pilot the device like other device specific char drivers.

You keep saying "debug information" which is really underselling this
driver. Are you not going to support mstreg?

The common development flow as far as netdev is concerned is:
- some customer is interested in a new feature of a chip
- vendor hacks the support out of tree, using oot module and/or
user space tooling
- customer does a PoC with that hacked up, non-upstream solution
- if it works, vendor has to find out a proper upstream API,
hopefully agreed on with other vendors
- if it doesn't match customer needs the whole thing lands in the bin

If the vendor specific PoC can be achieved with fully upstream software
we lose leverage to force vendors to agree on common APIs.

This should all be self-evident for people familiar with netdev, but
I thought I'd spell it out.

I understand that the device has other uses, but every modern NIC has
other uses. I don't see how netdev can agree to this driver as long as
there is potential for configuring random networking things thru it.
All major netdev vendors have a set of out of tree tools / "expose
everything misc drivers", "for debug". They will soon follow your
example if we let this in.

2023-11-28 04:47:00

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Mon, Nov 27, 2023 at 04:07:19PM -0800, Jakub Kicinski wrote:
> On Mon, 27 Nov 2023 11:26:06 -0800 Saeed Mahameed wrote:
> > This driver is different as it doesn't replace existing mlx5 drivers,
> > mlx5 functionality drivers are still there to expose the device features
> > through the standard stacks, this is just a companion driver to access
> > debug information, by driver and FW design mlx5ctl is not meant to
> > manage or pilot the device like other device specific char drivers.
>
> You keep saying "debug information" which is really underselling this
> driver. Are you not going to support mstreg?
>
> The common development flow as far as netdev is concerned is:
> - some customer is interested in a new feature of a chip
> - vendor hacks the support out of tree, using oot module and/or
> user space tooling
> - customer does a PoC with that hacked up, non-upstream solution
> - if it works, vendor has to find out a proper upstream API,
> hopefully agreed on with other vendors
> - if it doesn't match customer needs the whole thing lands in the bin
>
> If the vendor specific PoC can be achieved with fully upstream software
> we lose leverage to force vendors to agree on common APIs.

Please elaborate on what "common" API there is to create here?

Do you agree that each ASIC in the device is unique and hence will
have made different trade offs - both features and nerd knobs to tune
and affect the performance and runtime capabilities? If you do not
agree, then we need to have a different discussion ...
If you do, please elaborate on the outline of some common API that
could possibly be done here.

You said no to the devlink parameters as a way to tune an ASIC. This
is a common, established tool, using a common, established message
channel but in an open, free form way of letting a customer see what
tunables there are for a specific H/W version and firmware version
and then set them. That is about as common as can be for different
vendors creating different ASICs with different goals and design
objectives. Yet, you somehow expect all of them to have nerd knob X
and tunable Y. That is not realistic.

>
> This should all be self-evident for people familiar with netdev, but
> I thought I'd spell it out.
>
> I understand that the device has other uses, but every modern NIC has
> other uses. I don't see how netdev can agree to this driver as long as
> there is potential for configuring random networking things thru it.
> All major netdev vendors have a set of out of tree tools / "expose
> everything misc drivers", "for debug". They will soon follow your
> example if we let this in.

Out of tree drivers are already ingrained into customers now. Mellanox
(in this case) has tried many different angles at getting access to H/W
unique tunables (i.e., the devlink comment) and now dumping huge amounts
of data. Your response to the devlink parameters attempt is to basically
abuse the firmware upgrade command as a way to push a binary blob that
can contain said settings (and I still have not fully wrapped my head
around the fact that you suggested that option).

What specifically are you looking for? There are different vendors and
different h/w options for specific market based reasons. Your hard line
stance against needs like this is what is pushing out of tree drivers
and tools to continue.

2023-11-28 09:13:55

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH V3 3/5] misc: mlx5ctl: Add info ioctl

On Mon, Nov 27, 2023 at 12:39:22PM -0800, Saeed Mahameed wrote:
> On 27 Nov 19:09, Greg Kroah-Hartman wrote:
> > On Mon, Nov 20, 2023 at 11:06:17PM -0800, Saeed Mahameed wrote:
> > > +static int mlx5ctl_info_ioctl(struct file *file,
> > > + struct mlx5ctl_info __user *arg,
> > > + size_t usize)
> > > +{
> > > + struct mlx5ctl_fd *mfd = file->private_data;
> > > + struct mlx5ctl_dev *mcdev = mfd->mcdev;
> > > + struct mlx5_core_dev *mdev = mcdev->mdev;
> > > + struct mlx5ctl_info *info;
> > > + size_t ksize = 0;
> > > + int err = 0;
> > > +
> > > + ksize = max(sizeof(struct mlx5ctl_info), usize);
> >
> > Why / How can usize be larger than the structure size and you still want
> > to allocate a memory chunk that big? Shouldn't the size always match?
> >
>
> new user-space old kernel, the driver would allocate the usiae and make
> sure to clear all the buffer with 0's, then fill in what the kernel
> understands and send the whole buffer back to user with trailer always
> zeroed out.

No, at that point you know something is wrong and you need to just abort
and return -EINVAL as the structure sizes do not match.

If you need to "extend" the structure to include more information, do so
in a new ioctl.

> > > --- /dev/null
> > > +++ b/include/uapi/misc/mlx5ctl.h
> > > @@ -0,0 +1,24 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0 WITH Linux-syscall-note */
> > > +/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
> > > +
> > > +#ifndef __MLX5CTL_IOCTL_H__
> > > +#define __MLX5CTL_IOCTL_H__
> > > +
> > > +struct mlx5ctl_info {
> > > + __aligned_u64 flags;
> >
> > Is this used?
> >
>
> no, not yet, but it is good for future extendibility and compatibility
> checking.

But you are not checking anything now, so please don't include something
that will not work in the future.

> > > + __u32 size;
> > > + __u8 devname[64]; /* underlaying ConnectX device */
> >
> > 64 should be a define somewhere, right? And why 64?
> >
>
> It is usually the kobj->name of the underlying device, I will have to
> define this in the uAPI. 64 seemed large enough, any other suggestion ?

What happens if the names get bigger?

> This field is informational only for the user to have an idea which is the
> underlying physical device, it's ok if in odd situation the name has to be
> truncated to fit into the uAPI buffer.

As the truncation will happen on the right side of the string, usually
the actual device id or unique identifier, that's not going to help out
much to drop that portion :(

> > > + __u16 uctx_uid; /* current process allocated UCTX UID */
> > > + __u16 reserved1;
> >
> > Where is this checked to be always 0? Well it's a read so I guess where
> > is the documentation saying it will always be set to 0?
> >
>
> I forgot to add the checks in the info ioctl path, will add that.
> Isn't it an unwritten rule that reserved fields has to be always 0 ?
> Do I really need to document this ?

It is a written rule that reserved fields must be 0, please see the
documentation for how to write an ioctl.

thanks,

greg k-h

2023-11-28 14:54:07

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Mon, 27 Nov 2023 21:46:28 -0700 David Ahern wrote:
> > You keep saying "debug information" which is really underselling this
> > driver. Are you not going to support mstreg?
> >
> > The common development flow as far as netdev is concerned is:
> > - some customer is interested in a new feature of a chip
> > - vendor hacks the support out of tree, using oot module and/or
> > user space tooling
> > - customer does a PoC with that hacked up, non-upstream solution
> > - if it works, vendor has to find out a proper upstream API,
> > hopefully agreed on with other vendors
> > - if it doesn't match customer needs the whole thing lands in the bin
> >
> > If the vendor specific PoC can be achieved with fully upstream software
> > we lose leverage to force vendors to agree on common APIs.
>
> Please elaborate on what "common" API there is to create here?

Damn, am I so bad at explaining basic things or y'all are spending
5 seconds reading this and are not really paying attention? :|

> Do you agree that each ASIC in the device is unique and hence will
> have made different trade offs - both features and nerd knobs to tune
> and affect the performance and runtime capabilities? If you do not
> agree, then we need to have a different discussion ...
> If you do, please elaborate on the outline of some common API that
> could possibly be done here.

We have devlink params. If that doesn't scale we can look for other
solutions but let's see them not scale _in practice_ first.

> You said no to the devlink parameters as a way to tune an ASIC.

What? When?

> This is a common, established tool, using a common, established message
> channel but in an open, free form way of letting a customer see what
> tunables there are for a specific H/W version and firmware version
> and then set them. That is about as common as can be for different
> vendors creating different ASICs with different goals and design
> objectives. Yet, you somehow expect all of them to have nerd knob X
> and tunable Y. That is not realistic.

I don't know what you're talking about.

> > This should all be self-evident for people familiar with netdev, but
> > I thought I'd spell it out.
> >
> > I understand that the device has other uses, but every modern NIC has
> > other uses. I don't see how netdev can agree to this driver as long as
> > there is potential for configuring random networking things thru it.
> > All major netdev vendors have a set of out of tree tools / "expose
> > everything misc drivers", "for debug". They will soon follow your
> > example if we let this in.
>
> Out of tree drivers are already ingrained into customers now. Mellanox
> (in this case) has tried many different angles at getting access to H/W
> unique tunables (i.e., the devlink comment) and now dumping huge amounts
> of data. Your response to the devlink parameters attempt is to basically
> abuse the firmware upgrade command as a way to push a binary blob that
> can contain said settings (and I still have not fully wrapped my head
> around the fact that you suggested that option).
>
> What specifically are you looking for? There are different vendors and
> different h/w options for specific market based reasons. Your hard line
> stance against needs like this is what is pushing out of tree drivers
> and tools to continue.

Sounds like you'd like a similar open-ended interface for your device.

2023-11-28 16:24:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Tue, Nov 28, 2023 at 06:53:21AM -0800, Jakub Kicinski wrote:
> > You said no to the devlink parameters as a way to tune an ASIC.
>
> What? When?

You said you already rejected it at the very start of this discussion
and linked to the video recording of the rejection discussion:

https://lore.kernel.org/all/[email protected]/

This session was specifically on the 600 FW configuration parameters
that mlx5 has. This is something that is done today on non-secure boot
systems with direct PCI access on sysfs and would be absorbed into
this driver on secure-boot systems. Ie nothing really changes from the
broader ecosystem perspective.

The discussion starts at time index 22:11.

Dave (IIRC? sorry) is talking about unique things and suggesting
devlink. Many people in the audiance seem to support this idea

27:00 -> 28:00 you repeated this thread's argument about said NO
"occasionally you are allowed to use [devlink parameters] them"

At 29 you said just keep it all out of tree.

Oh, I chimed in around 30:00 saying it is not the kernel maintainers
job to force standardization. This is a user problem.

31:25 you said again "nothing"

31:30 S390 teams would like this and pushed back against "keep it all
out of tree" describing the situation as a "deadlock".

33:00 you said need two implementors for device specific parameters,
which misses the point of the discussion, IMHO.

And this was at the Dublin LPC, so over a year ago.

I second Dave's question - if you do not like mlx5ctl, then what is
your vision to solve all these user problems?

Jason

2023-11-28 16:44:34

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Tue, 28 Nov 2023 12:24:13 -0400 Jason Gunthorpe wrote:
> You said you already rejected it at the very start of this discussion
> and linked to the video recording of the rejection discussion:
>
> https://lore.kernel.org/all/[email protected]/
>
> This session was specifically on the 600 FW configuration parameters
> that mlx5 has. This is something that is done today on non-secure boot
> systems with direct PCI access on sysfs and would be absorbed into
> this driver on secure-boot systems. Ie nothing really changes from the
> broader ecosystem perspective.

The question at LPC was about making devlink params completely
transparent to the kernel. Basically added directly from FW.
That what I was not happy about.

You can add as many params at the driver level as you want.
In fact I asked Saeed repeatedly to start posting all those
params instead of complaining.

> I second Dave's question - if you do not like mlx5ctl, then what is
> your vision to solve all these user problems?

Let the users complain about the user problems. Also something
I repeatedly told Saeed. His response was something along the lines
of users are secret, they can't post on the list, blah, blah.

You know one user who is participating in this thread?
*ME*
While the lot of you work for vendors.

2023-11-28 16:52:20

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On 11/28/23 7:53 AM, Jakub Kicinski wrote:
> On Mon, 27 Nov 2023 21:46:28 -0700 David Ahern wrote:
>>> You keep saying "debug information" which is really underselling this
>>> driver. Are you not going to support mstreg?
>>>
>>> The common development flow as far as netdev is concerned is:
>>> - some customer is interested in a new feature of a chip
>>> - vendor hacks the support out of tree, using oot module and/or
>>> user space tooling
>>> - customer does a PoC with that hacked up, non-upstream solution
>>> - if it works, vendor has to find out a proper upstream API,
>>> hopefully agreed on with other vendors
>>> - if it doesn't match customer needs the whole thing lands in the bin
>>>
>>> If the vendor specific PoC can be achieved with fully upstream software
>>> we lose leverage to force vendors to agree on common APIs.
>>
>> Please elaborate on what "common" API there is to create here?
>
> Damn, am I so bad at explaining basic things or y'all are spending
> 5 seconds reading this and are not really paying attention? :|
>
>> Do you agree that each ASIC in the device is unique and hence will
>> have made different trade offs - both features and nerd knobs to tune
>> and affect the performance and runtime capabilities? If you do not
>> agree, then we need to have a different discussion ...
>> If you do, please elaborate on the outline of some common API that
>> could possibly be done here.
>
> We have devlink params. If that doesn't scale we can look for other
> solutions but let's see them not scale _in practice_ first.
>
>> You said no to the devlink parameters as a way to tune an ASIC.
>
> What? When?

Jason responded with the same LPC reference, so I will only add a
reference to the slides for interested parties:

https://lpc.events/event/16/contributions/1359/attachments/1092/2094/FW-Centric-devices.pdf

>
> Sounds like you'd like a similar open-ended interface for your device.

That's the danger of chiming in on threads like this - people drawing
conclusions they should not.

My interest on this thread is along the same lines for commenting during
both the LPC 2022 discussion and again at netconf this year - trying to
understand and keep track of the strongly held opinions of maintainers
and what options are deemed off limits for similar needs. No vendor
wants to go in circles redesigning and rewriting s/w.

2023-11-28 17:52:45

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Tue, Nov 28, 2023 at 08:44:21AM -0800, Jakub Kicinski wrote:
> On Tue, 28 Nov 2023 12:24:13 -0400 Jason Gunthorpe wrote:
> > You said you already rejected it at the very start of this discussion
> > and linked to the video recording of the rejection discussion:
> >
> > https://lore.kernel.org/all/[email protected]/
> >
> > This session was specifically on the 600 FW configuration parameters
> > that mlx5 has. This is something that is done today on non-secure boot
> > systems with direct PCI access on sysfs and would be absorbed into
> > this driver on secure-boot systems. Ie nothing really changes from the
> > broader ecosystem perspective.
>
> The question at LPC was about making devlink params completely
> transparent to the kernel. Basically added directly from FW.
> That what I was not happy about.

It is creating a back-porting nightmare for all the enterprise
distributions.

> You can add as many params at the driver level as you want.
> In fact I asked Saeed repeatedly to start posting all those
> params instead of complaining.

That really isn't what you said in the video.

Regardless, configurables are only one part of what mlx5ctl addresses,
we still have all the debugability problems, which are arguably more
important.

> > I second Dave's question - if you do not like mlx5ctl, then what is
> > your vision to solve all these user problems?
>
> Let the users complain about the user problems. Also something
> I repeatedly told Saeed. His response was something along the lines
> of users are secret, they can't post on the list, blah, blah.

You mean like the S390 team at IBM did in the video?

This is not a reasonable position. One of the jobs of the vendors is
to aggregate the user requests. Even the giant hyperscale customers
that do have the capacity to come on this list prefer to delegate
these things to us.

If you want to get a direct user forum the kernel mailing list is not
an appropriate place to do it.

> You know one user who is participating in this thread?
> *ME*
> While the lot of you work for vendors.

I'm sick of this vendor bashing. You work for *one* user. You know who
talks to *every* user out there? *ME*.

User and vendors need debugging of this complex HW. I don't need to
bring a parade of a dozen users to this thread to re-enforce that
obvious truth. Indeed when debugging is required the vendor usually
has to do it, so we are the user in this discussion.

You didn't answer the question, what is your alternative debug-ability
vision here?

Jason

2023-11-28 18:33:38

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Tue, 28 Nov 2023 13:52:24 -0400 Jason Gunthorpe wrote:
> > The question at LPC was about making devlink params completely
> > transparent to the kernel. Basically added directly from FW.
> > That what I was not happy about.
>
> It is creating a back-porting nightmare for all the enterprise
> distributions.

We don't care about enterprise distros, Jason, or stable kernel APIs.

> > You can add as many params at the driver level as you want.
> > In fact I asked Saeed repeatedly to start posting all those
> > params instead of complaining.
>
> That really isn't what you said in the video.
>
> Regardless, configurables are only one part of what mlx5ctl addresses,
> we still have all the debugability problems, which are arguably more
> important.

Read-only debug interfaces are "do whatever you want" in netdev.
Params controlling them (ie. writing stuff) need to be reviewed
but are also allowed.

Doesn't mlx5 have a pile of stuff in debugfs already?

Nobody bothered to answer my "are you not going support mstreg over
this" question (arbitrary register writes).

> > Let the users complain about the user problems. Also something
> > I repeatedly told Saeed. His response was something along the lines
> > of users are secret, they can't post on the list, blah, blah.
>
> You mean like the S390 team at IBM did in the video?
>
> This is not a reasonable position. One of the jobs of the vendors is
> to aggregate the user requests. Even the giant hyperscale customers
> that do have the capacity to come on this list prefer to delegate
> these things to us.
>
> If you want to get a direct user forum the kernel mailing list is not
> an appropriate place to do it.

Agree to disagree.

> > You know one user who is participating in this thread?
> > *ME*
> > While the lot of you work for vendors.
>
> I'm sick of this vendor bashing. You work for *one* user. You know who
> talks to *every* user out there? *ME*.
>
> User and vendors need debugging of this complex HW. I don't need to
> bring a parade of a dozen users to this thread to re-enforce that
> obvious truth. Indeed when debugging is required the vendor usually
> has to do it, so we are the user in this discussion.
>
> You didn't answer the question, what is your alternative debug-ability
> vision here?

Covered above. And it's been discussed multiple times.

Honestly I don't want to spend any more time discussing this.
Once you're ready to work together in good faith let me know.

On future revisions of this series please carry:

Nacked-by: Jakub Kicinski <[email protected]>

2023-11-28 19:32:03

by Saeed Mahameed

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On 28 Nov 08:44, Jakub Kicinski wrote:
>On Tue, 28 Nov 2023 12:24:13 -0400 Jason Gunthorpe wrote:
>> You said you already rejected it at the very start of this discussion
>> and linked to the video recording of the rejection discussion:
>>
>> https://lore.kernel.org/all/[email protected]/
>>
>> This session was specifically on the 600 FW configuration parameters
>> that mlx5 has. This is something that is done today on non-secure boot
>> systems with direct PCI access on sysfs and would be absorbed into
>> this driver on secure-boot systems. Ie nothing really changes from the
>> broader ecosystem perspective.
>
>The question at LPC was about making devlink params completely
>transparent to the kernel. Basically added directly from FW.
>That what I was not happy about.
>
>You can add as many params at the driver level as you want.
>In fact I asked Saeed repeatedly to start posting all those
>params instead of complaining.
>

We posted many params over the years the you shot down on the spot,
do you really expect me to implement 600 of those knowing that you will
nack 80% of them asking to have common knobs for all vendors, and you know
that is impossible.
you nack patches then ask for a porpossal, we came up with many proposal
and discussed them face to face on multiple occasions, LPC/netconf etc,
then you ask for patches, then you nack again, we are just going in circles
here..

>> I second Dave's question - if you do not like mlx5ctl, then what is
>> your vision to solve all these user problems?
>
>Let the users complain about the user problems. Also something
>I repeatedly told Saeed. His response was something along the lines
>of users are secret, they can't post on the list, blah, blah.
>

I never said it is a secret, but I can't publicly reveal who my customers
are and what they want, you know very well who asked for the high
frequency counter sampling.. So we came up with a very clear solution,
that has nothing to do with netdev, since for that particular use-case it
is not netdev specific, and netdev stack isn't even present.

>You know one user who is participating in this thread?
>*ME*
>While the lot of you work for vendors.

And how *YOU* expect the vendors to debug *YOUR* issues, if you don't
allow them to access their HW?

Asking all vendors to use *YOUR* "devlink generic_dev generic_knob" is an
insult to all vendors, how about you provide the ASIC design and RTLs
to all vendors and we just manufacture it for you..

2023-11-28 19:55:28

by Saeed Mahameed

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On 28 Nov 10:33, Jakub Kicinski wrote:
>On Tue, 28 Nov 2023 13:52:24 -0400 Jason Gunthorpe wrote:
>> > The question at LPC was about making devlink params completely
>> > transparent to the kernel. Basically added directly from FW.
>> > That what I was not happy about.
>>
>> It is creating a back-porting nightmare for all the enterprise
>> distributions.
>
>We don't care about enterprise distros, Jason, or stable kernel APIs.
>
>> > You can add as many params at the driver level as you want.
>> > In fact I asked Saeed repeatedly to start posting all those
>> > params instead of complaining.
>>
>> That really isn't what you said in the video.
>>
>> Regardless, configurables are only one part of what mlx5ctl addresses,
>> we still have all the debugability problems, which are arguably more
>> important.
>
>Read-only debug interfaces are "do whatever you want" in netdev.
>Params controlling them (ie. writing stuff) need to be reviewed
>but are also allowed.
>
>Doesn't mlx5 have a pile of stuff in debugfs already?
>

not enough, not scalable and it's a backporting and maintenance nightmare
as Jason already showed.

mlx5 supports creating millions of objects, tools need to selectively
pick which objects to dump for a specific use case, if it's ok with you to
do this in debugfs, then ioctl is much cleaner .. so what's your problem
with mlx5ctl?


>Nobody bothered to answer my "are you not going support mstreg over
>this" question (arbitrary register writes).
>
>> > Let the users complain about the user problems. Also something
>> > I repeatedly told Saeed. His response was something along the lines
>> > of users are secret, they can't post on the list, blah, blah.
>>
>> You mean like the S390 team at IBM did in the video?
>>
>> This is not a reasonable position. One of the jobs of the vendors is
>> to aggregate the user requests. Even the giant hyperscale customers
>> that do have the capacity to come on this list prefer to delegate
>> these things to us.
>>
>> If you want to get a direct user forum the kernel mailing list is not
>> an appropriate place to do it.
>
>Agree to disagree.
>
>> > You know one user who is participating in this thread?
>> > *ME*
>> > While the lot of you work for vendors.
>>
>> I'm sick of this vendor bashing. You work for *one* user. You know who
>> talks to *every* user out there? *ME*.
>>
>> User and vendors need debugging of this complex HW. I don't need to
>> bring a parade of a dozen users to this thread to re-enforce that
>> obvious truth. Indeed when debugging is required the vendor usually
>> has to do it, so we are the user in this discussion.
>>
>> You didn't answer the question, what is your alternative debug-ability
>> vision here?
>
>Covered above. And it's been discussed multiple times.
>
>Honestly I don't want to spend any more time discussing this.
>Once you're ready to work together in good faith let me know.
>
>On future revisions of this series please carry:
>
>Nacked-by: Jakub Kicinski <[email protected]>

I asked before and I never got a technical answer, based on what?

All we got is just political views and complaints against vendors.
What is your proposal for accessing every possible debug information from a
vendor specific device ? devlink X Y Z, debugfs? won't work, sorry.

And I can't accept "do it out of tree" as an answer from a well
established linux maintainer, the whole point of this is to have this
available in every linux box with any mlx5 configuration (not only netdev)
so we can start debugging on the spot.

For your claims that we need this for setting device parameters, it
is simply not true, because we don't need this driver to do that,
so please go back and read the cover-letter and code, and let me know what
is wrong with our approach to get access to our device's debug info.


2023-11-28 20:10:12

by Saeed Mahameed

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On 28 Nov 10:33, Jakub Kicinski wrote:
>On Tue, 28 Nov 2023 13:52:24 -0400 Jason Gunthorpe wrote:
>> > The question at LPC was about making devlink params completely
>> > transparent to the kernel. Basically added directly from FW.
>> > That what I was not happy about.
>>
>> It is creating a back-porting nightmare for all the enterprise
>> distributions.
>
>We don't care about enterprise distros, Jason, or stable kernel APIs.
>

Oh, I missed this one, so you don't care about users?

Users often pay to distros for support, and distros always turn to vendors
for debug situations, in fact one of the high stakeholders for this is an
enterprise distro..

Also Jason and I are users, and more than 300 engineers at nvidia
and dozens of customers are users, deploying 100s of thousands of ConnectX
chips in their fleets.

if it weren't important for us and our users, I wouldn't even fight this
hard.. but it is important, and very useful, I can't express that hard
enough.

So you don't care ? Then don't ask about who the users are, in the other
email, apparently it's all about your personal opinion and views about
vendors what drives your responses. So it is really hard to debate with
you..

2023-11-29 08:53:54

by Saeed Mahameed

[permalink] [raw]
Subject: Re: [PATCH V3 3/5] misc: mlx5ctl: Add info ioctl

On 28 Nov 09:13, Greg Kroah-Hartman wrote:
>On Mon, Nov 27, 2023 at 12:39:22PM -0800, Saeed Mahameed wrote:
>> On 27 Nov 19:09, Greg Kroah-Hartman wrote:
>> > On Mon, Nov 20, 2023 at 11:06:17PM -0800, Saeed Mahameed wrote:
>> > > +static int mlx5ctl_info_ioctl(struct file *file,
>> > > + struct mlx5ctl_info __user *arg,
>> > > + size_t usize)
>> > > +{
>> > > + struct mlx5ctl_fd *mfd = file->private_data;
>> > > + struct mlx5ctl_dev *mcdev = mfd->mcdev;
>> > > + struct mlx5_core_dev *mdev = mcdev->mdev;
>> > > + struct mlx5ctl_info *info;
>> > > + size_t ksize = 0;
>> > > + int err = 0;
>> > > +
>> > > + ksize = max(sizeof(struct mlx5ctl_info), usize);
>> >
>> > Why / How can usize be larger than the structure size and you still want
>> > to allocate a memory chunk that big? Shouldn't the size always match?
>> >
>>
>> new user-space old kernel, the driver would allocate the usiae and make
>> sure to clear all the buffer with 0's, then fill in what the kernel
>> understands and send the whole buffer back to user with trailer always
>> zeroed out.
>
>No, at that point you know something is wrong and you need to just abort
>and return -EINVAL as the structure sizes do not match.
>
>If you need to "extend" the structure to include more information, do so
>in a new ioctl.
>

Ack, will remove these fields.

>> > > --- /dev/null
>> > > +++ b/include/uapi/misc/mlx5ctl.h
>> > > @@ -0,0 +1,24 @@
>> > > +/* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0 WITH Linux-syscall-note */
>> > > +/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
>> > > +
>> > > +#ifndef __MLX5CTL_IOCTL_H__
>> > > +#define __MLX5CTL_IOCTL_H__
>> > > +
>> > > +struct mlx5ctl_info {
>> > > + __aligned_u64 flags;
>> >
>> > Is this used?
>> >
>>
>> no, not yet, but it is good for future extendibility and compatibility
>> checking.
>
>But you are not checking anything now, so please don't include something
>that will not work in the future.
>

Ack, will remove.

>> > > + __u32 size;
>> > > + __u8 devname[64]; /* underlaying ConnectX device */
>> >
>> > 64 should be a define somewhere, right? And why 64?
>> >
>>
>> It is usually the kobj->name of the underlying device, I will have to
>> define this in the uAPI. 64 seemed large enough, any other suggestion ?
>
>What happens if the names get bigger?
>
>> This field is informational only for the user to have an idea which is the
>> underlying physical device, it's ok if in odd situation the name has to be
>> truncated to fit into the uAPI buffer.
>
>As the truncation will happen on the right side of the string, usually
>the actual device id or unique identifier, that's not going to help out
>much to drop that portion :(
>

Right :/, it's an assumption that mlx5 devices can either be a pci device
or an auxiliary device in case of a mlx5-subfunction, so i don't expect the
names to get larger and can easily fit in 64B string, but you are right, I
shouldn't make such assumption in an IOCTL, I will figure out something or
completely drop this field in V4.

>> > > + __u16 uctx_uid; /* current process allocated UCTX UID */
>> > > + __u16 reserved1;
>> >
>> > Where is this checked to be always 0? Well it's a read so I guess where
>> > is the documentation saying it will always be set to 0?
>> >
>>
>> I forgot to add the checks in the info ioctl path, will add that.
>> Isn't it an unwritten rule that reserved fields has to be always 0 ?
>> Do I really need to document this ?
>
>It is a written rule that reserved fields must be 0, please see the
>documentation for how to write an ioctl.
>

Ack, will document.

>thanks,
>
>greg k-h

2023-11-29 09:08:25

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Tue, Nov 28, 2023 at 12:10:00PM -0800, Saeed Mahameed wrote:
> On 28 Nov 10:33, Jakub Kicinski wrote:
> > On Tue, 28 Nov 2023 13:52:24 -0400 Jason Gunthorpe wrote:
> > > > The question at LPC was about making devlink params completely
> > > > transparent to the kernel. Basically added directly from FW.
> > > > That what I was not happy about.
> > >
> > > It is creating a back-porting nightmare for all the enterprise
> > > distributions.
> >
> > We don't care about enterprise distros, Jason, or stable kernel APIs.
> >
>
> Oh, I missed this one, so you don't care about users?

Ok, time out please. This isn't going anywhere fast except to make
everyone else mad.

To Jakub's point, no, we don't care about enterprise distros, they are a
consumer of our releases and while some of them pay the salaries of our
developers, they really don't have much influence over our development
rules as they are just so far behind in releases that their releases
look nothing like what we do in places (i.e. Linux "like" just like many
Android systems.)

If the enterprise distros pop in here and give us their opinions of the
patchset, I would GREATLY appreciate it, as having more people review
code at this point in time would be most helpful instead of having to
hear about how the interfaces are broken 2 years from now.

And I think that's what is driving your work now, the "enterprise"
distros finally picked up the "lock down the kernel from random PCI
device access in userspace" which caused you to have to drop your
userspace implementation to create a real kernel driver, correct?

And as for stable kernel apis, you all know that's just not a thing, and
has nothing to do with users EXCEPT it benefits users because it keeps
kernel code smaller and faster overall, that's well documented and users
appreciate it.

> Users often pay to distros for support, and distros always turn to vendors
> for debug situations, in fact one of the high stakeholders for this is an
> enterprise distro..

Hey, I was right, an enterprise distro wants this driver, great, can you
get them to review it as well? I'm tired of being the only one to
review patches like this and could use the help if they are going to
rely on this (why do they pass that work to me?). They should be the
ones helping us catch basic things like the reference count issues I
pointed out, as they can test the code, I can't :)

But, let's step back a bit further.

It seems the network device normal interface for this is using devlink.
And drivers are allowed to have driver-specific devlink interfaces, as
you know, your driver has lots of them today. Why not just add more?
What's wrong with 600+ more interfaces being added as that way they
would be well documented and fit in with the existing infrastructure
that you have today.

Is the issue that the firmware doesn't guarantee that these interfaces
will be documented or stable or even known at this point in time? If
so, how are your going to have a good userspace tool for this? What am
I missing that requires a totally-new-and-custom user/kernel api from
what all other network drivers are using today?

thanks,

greg k-h

2023-11-29 09:09:01

by Saeed Mahameed

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On 27 Nov 18:59, Greg Kroah-Hartman wrote:
>On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote:
>> +struct mlx5ctl_dev {
>> + struct mlx5_core_dev *mdev;
>> + struct miscdevice miscdev;
>> + struct auxiliary_device *adev;
>> + struct list_head fd_list;
>> + spinlock_t fd_list_lock; /* protect list add/del */
>> + struct rw_semaphore rw_lock;
>> + struct kref refcount;
>
>You now have 2 different things that control the lifespan of this
>structure. We really need some way to automatically check this so that
>people don't keep making this same mistake, it happens all the time :(
>
>Please pick one structure (miscdevice) or the other (kref) to control
>the lifespan, having 2 will just not work.
>

miscdevice doesn't handle the lifespan, open files will remain open even
after the miscdevice was unregistered, hence we use the kref to defer the
kfree until the last open file is closed.

>Also, why a rw_semaphore? Only use those if you can prove with a
>benchmark that it is actually faster, otherwise it's simpler to just use
>a normal mutex (hint, you are changing the fields in the structure with
>the read lock held, which feels very wrong, and so needs a LOT of
>documentation, or just use a normal mutex...)
>

It is needed so we can protect against underlaying device unloading while
miscdevice is active, we use rw semaphore since we don't care about
synchronization between any of the fops, but we want to protect current
active ioctls and fops from sudden mlx5ctl_remove (auxiliary_driver.remove)
which can happen dynamically due to underlaying device removal.

So here is the locking scheme:

write_lock() : only on mlx5_ctl remove and mark the device is down
via assigning NULL to mcdev->dev, to let all new readers abort and to wait
for current readers to finish their task.

read_lock(): used in all fops and ioctls, to make sure underlaying
mlx5_core device is still active, and to prevent open files to access the
device when miscdevice was already unregistered.

I agree, this should've been documented in the code, I will add
documentation.


>thanks,
>
>greg k-h

2023-11-29 09:20:58

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Wed, Nov 29, 2023 at 01:08:39AM -0800, Saeed Mahameed wrote:
> On 27 Nov 18:59, Greg Kroah-Hartman wrote:
> > On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote:
> > > +struct mlx5ctl_dev {
> > > + struct mlx5_core_dev *mdev;
> > > + struct miscdevice miscdev;
> > > + struct auxiliary_device *adev;
> > > + struct list_head fd_list;
> > > + spinlock_t fd_list_lock; /* protect list add/del */
> > > + struct rw_semaphore rw_lock;
> > > + struct kref refcount;
> >
> > You now have 2 different things that control the lifespan of this
> > structure. We really need some way to automatically check this so that
> > people don't keep making this same mistake, it happens all the time :(
> >
> > Please pick one structure (miscdevice) or the other (kref) to control
> > the lifespan, having 2 will just not work.
> >
>
> miscdevice doesn't handle the lifespan, open files will remain open even
> after the miscdevice was unregistered, hence we use the kref to defer the
> kfree until the last open file is closed.

miscdevice has a reference counter and a lifecycle, you can not have two
reference counted objects in the same structure and expect things to
work well.

> > Also, why a rw_semaphore? Only use those if you can prove with a
> > benchmark that it is actually faster, otherwise it's simpler to just use
> > a normal mutex (hint, you are changing the fields in the structure with
> > the read lock held, which feels very wrong, and so needs a LOT of
> > documentation, or just use a normal mutex...)
> >
>
> It is needed so we can protect against underlaying device unloading while
> miscdevice is active, we use rw semaphore since we don't care about
> synchronization between any of the fops, but we want to protect current
> active ioctls and fops from sudden mlx5ctl_remove (auxiliary_driver.remove)
> which can happen dynamically due to underlaying device removal.

Then use a normal mutex. Only use a rw lock if you can prove the
performance needs it as usually a rw lock is slower and more complex as
you then have to document stuff like:

> So here is the locking scheme:
>
> write_lock() : only on mlx5_ctl remove and mark the device is down
> via assigning NULL to mcdev->dev, to let all new readers abort and to wait
> for current readers to finish their task.
>
> read_lock(): used in all fops and ioctls, to make sure underlaying
> mlx5_core device is still active, and to prevent open files to access the
> device when miscdevice was already unregistered.
>
> I agree, this should've been documented in the code, I will add
> documentation.

Just make it simple and use a normal mutex please.

And fix up the reference counting, it shouldn't be this complex, it's
just a "simple" misc device driver :)

But before you do that, please see my other email about why not using
devlink for all of this instead.

thanks,

greg k-h

2023-12-04 21:38:40

by Aron Silverton

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

Hi,

On Wed, Nov 29, 2023 at 09:08:03AM +0000, Greg Kroah-Hartman wrote:
> On Tue, Nov 28, 2023 at 12:10:00PM -0800, Saeed Mahameed wrote:
> > On 28 Nov 10:33, Jakub Kicinski wrote:
> > > On Tue, 28 Nov 2023 13:52:24 -0400 Jason Gunthorpe wrote:
> > > > > The question at LPC was about making devlink params completely
> > > > > transparent to the kernel. Basically added directly from FW.
> > > > > That what I was not happy about.
> > > >
> > > > It is creating a back-porting nightmare for all the enterprise
> > > > distributions.
> > >
> > > We don't care about enterprise distros, Jason, or stable kernel APIs.
> > >
> >
> > Oh, I missed this one, so you don't care about users?
>
> Ok, time out please. This isn't going anywhere fast except to make
> everyone else mad.
>
> To Jakub's point, no, we don't care about enterprise distros, they are a
> consumer of our releases and while some of them pay the salaries of our
> developers, they really don't have much influence over our development
> rules as they are just so far behind in releases that their releases
> look nothing like what we do in places (i.e. Linux "like" just like many
> Android systems.)
>
> If the enterprise distros pop in here and give us their opinions of the
> patchset, I would GREATLY appreciate it, as having more people review
> code at this point in time would be most helpful instead of having to
> hear about how the interfaces are broken 2 years from now.

We will be happy to test and review v4 of this series.

Fully interactive debugging has become essential to getting to the root
cause of complex issues that arise between the types of devices being
discussed and their interactions with various software layers. Turning
knobs and dumping registers just isn't sufficient, and I wish we'd had
this capability long ago.

Our customers have already benefited from the interactive debugging
capability that these patches provide, but the full potential won't be
realized until this is upstream.

>
> And I think that's what is driving your work now, the "enterprise"
> distros finally picked up the "lock down the kernel from random PCI
> device access in userspace" which caused you to have to drop your
> userspace implementation to create a real kernel driver, correct?
>
> And as for stable kernel apis, you all know that's just not a thing, and
> has nothing to do with users EXCEPT it benefits users because it keeps
> kernel code smaller and faster overall, that's well documented and users
> appreciate it.
>
> > Users often pay to distros for support, and distros always turn to vendors
> > for debug situations, in fact one of the high stakeholders for this is an
> > enterprise distro..
>
> Hey, I was right, an enterprise distro wants this driver, great, can you
> get them to review it as well? I'm tired of being the only one to
> review patches like this and could use the help if they are going to
> rely on this (why do they pass that work to me?). They should be the
> ones helping us catch basic things like the reference count issues I
> pointed out, as they can test the code, I can't :)

Guilty as charged? But, I'm sure we are not the only ones. I'm sorry that
we were not able to spot this and provide a review earlier, but we are
happy to do so for v4.

>
> But, let's step back a bit further.
>
> It seems the network device normal interface for this is using devlink.
> And drivers are allowed to have driver-specific devlink interfaces, as
> you know, your driver has lots of them today. Why not just add more?
> What's wrong with 600+ more interfaces being added as that way they
> would be well documented and fit in with the existing infrastructure
> that you have today.
>
> Is the issue that the firmware doesn't guarantee that these interfaces
> will be documented or stable or even known at this point in time? If
> so, how are your going to have a good userspace tool for this? What am
> I missing that requires a totally-new-and-custom user/kernel api from
> what all other network drivers are using today?
>
> thanks,
>
> greg k-h

Aron

2023-12-05 02:52:30

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Mon, 4 Dec 2023 15:37:56 -0600 Aron Silverton wrote:
> > To Jakub's point, no, we don't care about enterprise distros, they are a
> > consumer of our releases and while some of them pay the salaries of our
> > developers, they really don't have much influence over our development
> > rules as they are just so far behind in releases that their releases
> > look nothing like what we do in places (i.e. Linux "like" just like many
> > Android systems.)
> >
> > If the enterprise distros pop in here and give us their opinions of the
> > patchset, I would GREATLY appreciate it, as having more people review
> > code at this point in time would be most helpful instead of having to
> > hear about how the interfaces are broken 2 years from now.
>
> We will be happy to test and review v4 of this series.
>
> Fully interactive debugging has become essential to getting to the root
> cause of complex issues that arise between the types of devices being
> discussed and their interactions with various software layers. Turning
> knobs and dumping registers just isn't sufficient, and I wish we'd had
> this capability long ago.

Could you shed some light on what issues you were debugging, broadly?

> Our customers have already benefited from the interactive debugging
> capability that these patches provide, but the full potential won't be
> realized until this is upstream.

Can you elaborate on why "full potential won't be realized until this
is upstream"? The driver looks very easy to ship as a standalone module.

2023-12-05 17:11:43

by Aron Silverton

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

Hi Jakub,

On Mon, Dec 04, 2023 at 06:52:10PM -0800, Jakub Kicinski wrote:
> On Mon, 4 Dec 2023 15:37:56 -0600 Aron Silverton wrote:
> > > To Jakub's point, no, we don't care about enterprise distros, they are a
> > > consumer of our releases and while some of them pay the salaries of our
> > > developers, they really don't have much influence over our development
> > > rules as they are just so far behind in releases that their releases
> > > look nothing like what we do in places (i.e. Linux "like" just like many
> > > Android systems.)
> > >
> > > If the enterprise distros pop in here and give us their opinions of the
> > > patchset, I would GREATLY appreciate it, as having more people review
> > > code at this point in time would be most helpful instead of having to
> > > hear about how the interfaces are broken 2 years from now.
> >
> > We will be happy to test and review v4 of this series.
> >
> > Fully interactive debugging has become essential to getting to the root
> > cause of complex issues that arise between the types of devices being
> > discussed and their interactions with various software layers. Turning
> > knobs and dumping registers just isn't sufficient, and I wish we'd had
> > this capability long ago.
>
> Could you shed some light on what issues you were debugging, broadly?
>

I'll do my best, with two recent instances:

1. As mentioned already, we recently faced a complex problem with RDMA
in KVM and were getting nowhere trying to debug using the usual methods.
Mellanox support was able to use this debug interface to see what was
happening on the PCI bus and prove that the issue was caused by
corrupted PCIe transactions. This finally put the investigation on the
correct path. The debug interface was used consistently and extensively
to test theories about what was happening in the system and, ultimately,
allowed the problem to be solved.

2. We've faced RDMA issues related to lost EQ doorbells, requiring
complex debug, and ultimately root-caused as a defective CPU. Without
interactive access to the device allowing us to test theories like,
"what if we manually restart the EQ", we could not have proven this
definitively.

> > Our customers have already benefited from the interactive debugging
> > capability that these patches provide, but the full potential won't be
> > realized until this is upstream.
>
> Can you elaborate on why "full potential won't be realized until this
> is upstream"? The driver looks very easy to ship as a standalone module.

Firstly, We believe in working upstream and all of the advantages that
that brings to all the distros as well as to us and our customers.

Secondly, Our cloud business offers many types of machine instances,
some with bare metal/vfio mlx5 devices, that require customer driven
debug and we want our customers to have the freedom to choose which OS
they want to use.

Aron

2023-12-06 04:55:08

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Tue, 5 Dec 2023 11:11:00 -0600 Aron Silverton wrote:
> 1. As mentioned already, we recently faced a complex problem with RDMA
> in KVM and were getting nowhere trying to debug using the usual methods.
> Mellanox support was able to use this debug interface to see what was
> happening on the PCI bus and prove that the issue was caused by
> corrupted PCIe transactions. This finally put the investigation on the
> correct path. The debug interface was used consistently and extensively
> to test theories about what was happening in the system and, ultimately,
> allowed the problem to be solved.

You hit on an important point, and what is also my experience working
at Meta. I may have even mentioned it in this thread already.
If there is a serious issue with a complex device, there are two ways
you can get support - dump all you can and send the dump to the vendor
or get on a live debugging session with their engineers. Users' ability
to debug those devices is practically non-existent. The idea that we
need access to FW internals is predicated on the assumption that we
have an ability to make sense of those internals.

Once you're on a support call with the vendor - just load a custom
kernel, module, whatever, it's already extremely expensive manual labor.

> 2. We've faced RDMA issues related to lost EQ doorbells, requiring
> complex debug, and ultimately root-caused as a defective CPU. Without
> interactive access to the device allowing us to test theories like,
> "what if we manually restart the EQ", we could not have proven this
> definitively.

I'm not familiar with the RDMA debugging capabilities. Perhaps there
are some gaps there. The more proprietary the implementation the harder
it is to debug. An answer to that would be "try to keep as much as
possible open".. and interfaces which let closed user space talk to
closed FW take us in the opposite direction.

FWIW good netdevice drivers have a selftest which tests IRQ generation
and EQ handling. I think that'd cover the case you're describing?
IDK if mlx5 has them, but if it doesn't definitely worth adding. And I
recommend running those on suspicious machines (ethtool -t, devlink has
some selftests, too)

> Firstly, We believe in working upstream and all of the advantages that
> that brings to all the distros as well as to us and our customers.
>
> Secondly, Our cloud business offers many types of machine instances,
> some with bare metal/vfio mlx5 devices, that require customer driven
> debug and we want our customers to have the freedom to choose which OS
> they want to use.

I understand that having everything packaged and shipped together makes
life easier.

If the point of the kernel at this stage of its evolution is to collect
incompatible bits of vendor software, make sure they build cleanly and
ship them to distros - someone should tell me, and I will relent.

2023-12-07 15:57:03

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On 12/5/23 9:48 PM, Jakub Kicinski wrote:
> On Tue, 5 Dec 2023 11:11:00 -0600 Aron Silverton wrote:
>> 1. As mentioned already, we recently faced a complex problem with RDMA
>> in KVM and were getting nowhere trying to debug using the usual methods.
>> Mellanox support was able to use this debug interface to see what was
>> happening on the PCI bus and prove that the issue was caused by
>> corrupted PCIe transactions. This finally put the investigation on the
>> correct path. The debug interface was used consistently and extensively
>> to test theories about what was happening in the system and, ultimately,
>> allowed the problem to be solved.
>
> You hit on an important point, and what is also my experience working
> at Meta. I may have even mentioned it in this thread already.
> If there is a serious issue with a complex device, there are two ways
> you can get support - dump all you can and send the dump to the vendor
> or get on a live debugging session with their engineers. Users' ability
> to debug those devices is practically non-existent. The idea that we
> need access to FW internals is predicated on the assumption that we
> have an ability to make sense of those internals.
>
> Once you're on a support call with the vendor - just load a custom
> kernel, module, whatever, it's already extremely expensive manual labor.

You rail against out of tree drivers and vendor proprietary tools, and
now you argue for just that. There is no reason debugging capabilities
can not be built into the OS and used when needed. That means anything
needed - from kernel modules to userspace tools.

The Meta data point is not representative of the world at large -
different scale, different needs, different expertise on staff (OS and
H/W). Getting S/W installed (especially anything requiring a compiler)
in a production server (and VMs) is not an easy request and in many
cases not even possible.

When a customer hits problem, the standard steps are to run a script,
generate a tar file and ship it to the OS vendor. Engineers at the OS
vendor go through it and may need other data - like getting detailed
dumps from individual pieces of H/W. Every time those requests require
going to a vendor web site to pull down vendor tools, get permission to
install them, schedule the run of said tool ... it only serves to drag
out the debugging process. ie., this open-ended stance only serves to
hurt Linux users.

2023-12-07 16:21:25

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

David, I agree with your points. I think you're misreading
what I said.

On Thu, 7 Dec 2023 08:54:51 -0700 David Ahern wrote:
> You rail against out of tree drivers and vendor proprietary tools, and
> now you argue for just that.

I don't rail against out of tree drivers, very much the opposite.
Linux supports out of tree modules, and I agree that's 100% the
correct thing to do. I'd encourage more people to take advantage of
that. The problem is quite the opposite, all the "security hardening"
is making it almost impossible for users to take advantage of OOT
modules.

> There is no reason debugging capabilities
> can not be built into the OS and used when needed. That means anything
> needed - from kernel modules to userspace tools.
>
> The Meta data point is not representative of the world at large -
> different scale, different needs, different expertise on staff (OS and
> H/W). Getting S/W installed (especially anything requiring a compiler)
> in a production server (and VMs) is not an easy request and in many
> cases not even possible.

I did not say it's easy.

> When a customer hits problem, the standard steps are to run a script,
> generate a tar file and ship it to the OS vendor. Engineers at the OS
> vendor go through it and may need other data - like getting detailed
> dumps from individual pieces of H/W.

You say that like this is not _exactly_ what I just said!?

> Every time those requests require
> going to a vendor web site to pull down vendor tools, get permission to
> install them, schedule the run of said tool ... it only serves to drag
> out the debugging process. ie., this open-ended stance only serves to
> hurt Linux users.

Right, exactly. What are you arguing with then? As I said - we have a
very open / accommodating policy for extracting all sort of debug and
state dumps. You can put whatever read only stuff you want in debugfs**
Read-write interfaces must be constrained to a clear set of commands /
settings but also very much allowed. As you said users need to be able
to extract debug info to share with the vendors, no tools necessary.

** (this will surprise you but you can also put stats there, if they
are custom, I don't care if they go into ethtool -S or debugfs)

2023-12-07 16:41:56

by Aron Silverton

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Tue, Dec 05, 2023 at 08:48:55PM -0800, Jakub Kicinski wrote:
> On Tue, 5 Dec 2023 11:11:00 -0600 Aron Silverton wrote:
> > 1. As mentioned already, we recently faced a complex problem with RDMA
> > in KVM and were getting nowhere trying to debug using the usual methods.
> > Mellanox support was able to use this debug interface to see what was
> > happening on the PCI bus and prove that the issue was caused by
> > corrupted PCIe transactions. This finally put the investigation on the
> > correct path. The debug interface was used consistently and extensively
> > to test theories about what was happening in the system and, ultimately,
> > allowed the problem to be solved.
>
> You hit on an important point, and what is also my experience working
> at Meta. I may have even mentioned it in this thread already.
> If there is a serious issue with a complex device, there are two ways
> you can get support - dump all you can and send the dump to the vendor
> or get on a live debugging session with their engineers. Users' ability
> to debug those devices is practically non-existent. The idea that we
> need access to FW internals is predicated on the assumption that we
> have an ability to make sense of those internals.
>
> Once you're on a support call with the vendor - just load a custom
> kernel, module, whatever, it's already extremely expensive manual labor.
>
> > 2. We've faced RDMA issues related to lost EQ doorbells, requiring
> > complex debug, and ultimately root-caused as a defective CPU. Without
> > interactive access to the device allowing us to test theories like,
> > "what if we manually restart the EQ", we could not have proven this
> > definitively.
>
> I'm not familiar with the RDMA debugging capabilities. Perhaps there
> are some gaps there. The more proprietary the implementation the harder
> it is to debug. An answer to that would be "try to keep as much as
> possible open".. and interfaces which let closed user space talk to
> closed FW take us in the opposite direction.
>
> FWIW good netdevice drivers have a selftest which tests IRQ generation
> and EQ handling. I think that'd cover the case you're describing?
> IDK if mlx5 has them, but if it doesn't definitely worth adding. And I
> recommend running those on suspicious machines (ethtool -t, devlink has
> some selftests, too)

Essentially, a warning light, and that doesn't solve the underlying
problem. We still need experts (e.g., vendors) to investigate with their
toolsets when and where the problem occurs.

I offered this as an example of one issue we solved. I cannot predict
what kind of issues will pop up in the future, and writing a self-test
for every possible situation is impossible by definition.

>
> > Firstly, We believe in working upstream and all of the advantages that
> > that brings to all the distros as well as to us and our customers.
> >
> > Secondly, Our cloud business offers many types of machine instances,
> > some with bare metal/vfio mlx5 devices, that require customer driven
> > debug and we want our customers to have the freedom to choose which OS
> > they want to use.
>
> I understand that having everything packaged and shipped together makes
> life easier.

I think it is a requirement. We operate with Secure Boot. The kernel is
locked down. We don't have debugfs access, even if it were sufficient,
and we cannot compile and load modules. Even without Secure Boot, there
may not be a build environment available.

We really need the module ready-to-go when the debug calls for it - no
building, no reboots, no months long attempts to reproduce in some lab -
just immediate availability of the debug interface on the affected
machine.

>
> If the point of the kernel at this stage of its evolution is to collect
> incompatible bits of vendor software, make sure they build cleanly and
> ship them to distros - someone should tell me, and I will relent.

I'm not sure I follow you... The mlx5ctl driver seems very compatible
with the mlx5 device driver. I may be misunderstanding.

2023-12-07 17:24:43

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Thu, 7 Dec 2023 10:41:25 -0600 Aron Silverton wrote:
> > I understand that having everything packaged and shipped together makes
> > life easier.
>
> I think it is a requirement. We operate with Secure Boot. The kernel is
> locked down. We don't have debugfs access, even if it were sufficient,
> and we cannot compile and load modules. Even without Secure Boot, there
> may not be a build environment available.

This 'no debugfs' requirement is a kernel lockdown thing, I presume?
Are we expected to throw debugfs out the window and for all vendors
to reimplement their debug functionality via a misc driver taking
arbitrary ioctls? Not only does that sound like a complete waste of
time and going backward in terms of quality of the interfaces, needing
custom vendor tools etc. etc., but also you go from (hopefully somewhat)
upstream reviewed debugfs interface to an interface where the only
security assurance is vendor telling you "trust me, it's all good".

2023-12-07 18:07:09

by Aron Silverton

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Thu, Dec 07, 2023 at 09:23:29AM -0800, Jakub Kicinski wrote:
> On Thu, 7 Dec 2023 10:41:25 -0600 Aron Silverton wrote:
> > > I understand that having everything packaged and shipped together makes
> > > life easier.
> >
> > I think it is a requirement. We operate with Secure Boot. The kernel is
> > locked down. We don't have debugfs access, even if it were sufficient,
> > and we cannot compile and load modules. Even without Secure Boot, there
> > may not be a build environment available.
>
> This 'no debugfs' requirement is a kernel lockdown thing, I presume?
> Are we expected to throw debugfs out the window and for all vendors
> to reimplement their debug functionality via a misc driver taking
> arbitrary ioctls? Not only does that sound like a complete waste of
> time and going backward in terms of quality of the interfaces, needing
> custom vendor tools etc. etc., but also you go from (hopefully somewhat)
> upstream reviewed debugfs interface to an interface where the only
> security assurance is vendor telling you "trust me, it's all good".

IIRC, with lockdown, we can read from debugfs IFF the entries'
permissions are 0400. We cannot write. It's not suitable for
implementing an interactive debug interface.

2023-12-07 18:54:17

by Saeed Mahameed

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On 07 Dec 10:41, Aron Silverton wrote:
>On Tue, Dec 05, 2023 at 08:48:55PM -0800, Jakub Kicinski wrote:
>> On Tue, 5 Dec 2023 11:11:00 -0600 Aron Silverton wrote:
>> > 1. As mentioned already, we recently faced a complex problem with RDMA
>> > in KVM and were getting nowhere trying to debug using the usual methods.
>> > Mellanox support was able to use this debug interface to see what was
>> > happening on the PCI bus and prove that the issue was caused by
>> > corrupted PCIe transactions. This finally put the investigation on the
>> > correct path. The debug interface was used consistently and extensively
>> > to test theories about what was happening in the system and, ultimately,
>> > allowed the problem to be solved.
>>
>> You hit on an important point, and what is also my experience working
>> at Meta. I may have even mentioned it in this thread already.
>> If there is a serious issue with a complex device, there are two ways
>> you can get support - dump all you can and send the dump to the vendor
>> or get on a live debugging session with their engineers. Users' ability
>> to debug those devices is practically non-existent. The idea that we
>> need access to FW internals is predicated on the assumption that we
>> have an ability to make sense of those internals.
>>
>> Once you're on a support call with the vendor - just load a custom
>> kernel, module, whatever, it's already extremely expensive manual labor.
>>
>> > 2. We've faced RDMA issues related to lost EQ doorbells, requiring
>> > complex debug, and ultimately root-caused as a defective CPU. Without
>> > interactive access to the device allowing us to test theories like,
>> > "what if we manually restart the EQ", we could not have proven this
>> > definitively.
>>
>> I'm not familiar with the RDMA debugging capabilities. Perhaps there
>> are some gaps there. The more proprietary the implementation the harder
>> it is to debug. An answer to that would be "try to keep as much as
>> possible open".. and interfaces which let closed user space talk to
>> closed FW take us in the opposite direction.
>>
>> FWIW good netdevice drivers have a selftest which tests IRQ generation
>> and EQ handling. I think that'd cover the case you're describing?
>> IDK if mlx5 has them, but if it doesn't definitely worth adding. And I
>> recommend running those on suspicious machines (ethtool -t, devlink has
>> some selftests, too)
>
>Essentially, a warning light, and that doesn't solve the underlying
>problem. We still need experts (e.g., vendors) to investigate with their
>toolsets when and where the problem occurs.
>
>I offered this as an example of one issue we solved. I cannot predict
>what kind of issues will pop up in the future, and writing a self-test
>for every possible situation is impossible by definition.
>
>>
>> > Firstly, We believe in working upstream and all of the advantages that
>> > that brings to all the distros as well as to us and our customers.
>> >
>> > Secondly, Our cloud business offers many types of machine instances,
>> > some with bare metal/vfio mlx5 devices, that require customer driven
>> > debug and we want our customers to have the freedom to choose which OS
>> > they want to use.
>>
>> I understand that having everything packaged and shipped together makes
>> life easier.
>
>I think it is a requirement. We operate with Secure Boot. The kernel is
>locked down. We don't have debugfs access, even if it were sufficient,
>and we cannot compile and load modules. Even without Secure Boot, there
>may not be a build environment available.
>
>We really need the module ready-to-go when the debug calls for it - no
>building, no reboots, no months long attempts to reproduce in some lab -
>just immediate availability of the debug interface on the affected
>machine.
>
>>
>> If the point of the kernel at this stage of its evolution is to collect
>> incompatible bits of vendor software, make sure they build cleanly and
>> ship them to distros - someone should tell me, and I will relent.
>
>I'm not sure I follow you... The mlx5ctl driver seems very compatible
>with the mlx5 device driver. I may be misunderstanding.
>

mlx5ctl is 100% compatible with mlx5 ConnectX open spec [1], and supports
any mlx5 driven stacks, not only netdev, it is able to expose millions of
objects and device states interactively, debugfs would explode if we even
try to accommodate some of these objects or states via debugfs, not to
mention it is also impossible to maintain a stable debugfs output for such
a huge data set, when this mlx5ctl interface speaks out a clear and open
ConnectX language, which is the hole point of the driver.

ConnectX is a highly programmable device for the enduser, and we have a
very open / accommodating policy, an advanced user who can read the open
spec [1], will also have the ability to do self-debug of their own
RDMA/DPU/FPGA apps or similar usecases.

Also I would like to repeat, this is not touching netdev, netdev's policies
do not apply to the greater kernel or RDMA, and we have use cases with
pure-infiniband/DPU/FPGA cards that have no netdev at all, or other cases
with pur virtio instances, and much more.

[1] https://network.nvidia.com/files/doc-2020/ethernet-adapters-programming-manual.pdf

2023-12-07 19:02:50

by Saeed Mahameed

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On 07 Dec 12:06, Aron Silverton wrote:
>On Thu, Dec 07, 2023 at 09:23:29AM -0800, Jakub Kicinski wrote:
>> On Thu, 7 Dec 2023 10:41:25 -0600 Aron Silverton wrote:
>> > > I understand that having everything packaged and shipped together makes
>> > > life easier.
>> >
>> > I think it is a requirement. We operate with Secure Boot. The kernel is
>> > locked down. We don't have debugfs access, even if it were sufficient,
>> > and we cannot compile and load modules. Even without Secure Boot, there
>> > may not be a build environment available.
>>
>> This 'no debugfs' requirement is a kernel lockdown thing, I presume?
>> Are we expected to throw debugfs out the window and for all vendors
>> to reimplement their debug functionality via a misc driver taking
>> arbitrary ioctls? Not only does that sound like a complete waste of
>> time and going backward in terms of quality of the interfaces, needing
>> custom vendor tools etc. etc., but also you go from (hopefully somewhat)
>> upstream reviewed debugfs interface to an interface where the only
>> security assurance is vendor telling you "trust me, it's all good".
>
>IIRC, with lockdown, we can read from debugfs IFF the entries'
>permissions are 0400. We cannot write. It's not suitable for
>implementing an interactive debug interface.

I would like to add that debugfs is usually used to expose the driver
software states, as it evolves and changes with the driver code, but as I
explained in the other email, it's clearly not a good solution to expose
arbitrary objects of complex devices, that require interactive and
selective debug interfaces tailored to the user use-case.

2023-12-08 05:27:33

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Thu, Dec 07, 2023 at 12:06:28PM -0600, Aron Silverton wrote:
> On Thu, Dec 07, 2023 at 09:23:29AM -0800, Jakub Kicinski wrote:
> > On Thu, 7 Dec 2023 10:41:25 -0600 Aron Silverton wrote:
> > > > I understand that having everything packaged and shipped together makes
> > > > life easier.
> > >
> > > I think it is a requirement. We operate with Secure Boot. The kernel is
> > > locked down. We don't have debugfs access, even if it were sufficient,
> > > and we cannot compile and load modules. Even without Secure Boot, there
> > > may not be a build environment available.
> >
> > This 'no debugfs' requirement is a kernel lockdown thing, I presume?
> > Are we expected to throw debugfs out the window and for all vendors
> > to reimplement their debug functionality via a misc driver taking
> > arbitrary ioctls? Not only does that sound like a complete waste of
> > time and going backward in terms of quality of the interfaces, needing
> > custom vendor tools etc. etc., but also you go from (hopefully somewhat)
> > upstream reviewed debugfs interface to an interface where the only
> > security assurance is vendor telling you "trust me, it's all good".
>
> IIRC, with lockdown, we can read from debugfs IFF the entries'
> permissions are 0400. We cannot write. It's not suitable for
> implementing an interactive debug interface.

This is a policy decision that a distro decides to do, and I have seen
it happen many times. The problem with this is then, as you have found
out, vendors try to work around the removal of access to debugfs by
creating new interfaces like misc drivers and sysfs files to emulate
what they were previously exporting with debugfs.

When they do that, they break the reason why the distro/vendor decided
to prevent access to debugfs in the first place, making the whole system
insecure again!

I see this happen all the time in Android devices as Android restricted
access to debugfs many years ago to try to solve the problem of drivers
doing insecure things there. Those drivers then just moved those
insecure operations to a different interface, making the system insecure
again.

To stop this cat-and-mouse game, work with the vendors that are
implementing these requirements to shut down access to these interfaces.
That is a policy decision made by them, and does NOT mean that the
kernel community needs to then start taking code to circumvent those
decisions as that makes the whole thing pointless.

So in short, use debugfs, that is what it was designed for. If a vendor
does not wish to enable access to debugfs, then that is their decision
(probably a good one), don't circumvent it by making a new interface, as
odds are, the vendor will eventually realize it and work to cut off that
new interface as well, as they rightfully should.

thanks,

greg k-h

2023-12-08 05:29:44

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Thu, Dec 07, 2023 at 11:02:36AM -0800, Saeed Mahameed wrote:
> I would like to add that debugfs is usually used to expose the driver
> software states, as it evolves and changes with the driver code, but as I
> explained in the other email, it's clearly not a good solution to expose
> arbitrary objects of complex devices, that require interactive and
> selective debug interfaces tailored to the user use-case.

Why not? Remember, the only rule in debugfs is "there are no rules!"

Well, there is one practical one, "do not rely on debugfs for any
functioning system properties", i.e. "if access to debugfs is not
present, or something in debugfs breaks, the kernel should continue to
work just fine with no change in operations". But that's an overall
system-level rule, not a rule for what you can put into debugfs.

Have fun!

greg k-h

2023-12-08 12:52:44

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Fri, Dec 08, 2023 at 06:27:14AM +0100, Greg Kroah-Hartman wrote:

> When they do that, they break the reason why the distro/vendor decided
> to prevent access to debugfs in the first place, making the whole system
> insecure again!

Maybe others have, be we did not!

In fact Saeed addresses this explicitly in the cover letter:

Historically a userspace program was used that accessed the PCI register
and config space directly through /sys/bus/pci/.../XXX and could operate
these debugging interfaces in parallel with the running driver.
This approach is incompatible with secure boot and kernel lockdown so this
driver provides a secure and restricted interface to that.

We did a lot of work here to build an interface that is compatible
with the security principles of lockdown. This work is embodied by the
new FW bit MLX5_UCTX_OBJECT_CAP_TOOLS_RESOURCES which causes the FW to
run these RPCs in a security context that is compatible with lockdown.

The overall philosophical architecture is similar to the NVMe vendor
command channel which is also a way to deliver opaque RPCs to a device
FW and is considered lockdown compatible.

This series is not some insecure attempt to bypass lockdown like you
may have seen in Android. mlx5 is the driver for the most popular high
speed NIC in the world. Our customers take security seriously, and we
take security seriously.

Jason

2023-12-08 13:34:56

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Fri, Dec 08, 2023 at 06:29:29AM +0100, Greg Kroah-Hartman wrote:
> On Thu, Dec 07, 2023 at 11:02:36AM -0800, Saeed Mahameed wrote:
> > I would like to add that debugfs is usually used to expose the driver
> > software states, as it evolves and changes with the driver code, but as I
> > explained in the other email, it's clearly not a good solution to expose
> > arbitrary objects of complex devices, that require interactive and
> > selective debug interfaces tailored to the user use-case.
>
> Why not? Remember, the only rule in debugfs is "there are no rules!"

We already have debugfs files to issue RPCs. They are not secure and
not lockdown compatible. Few users have been interested in this, Aron
does a good job explaining the general perspective I've seen in many
places.

Users want an in-tree solution that is compatible with lockdown. A
solution that works for all the mlx5 deployment modes (including
Infiniband native without netdev) and covers most of the functionality
they previously enjoyed with the /sys/../resource based tooling.

This series delivers that.

Nobody has offered an alterative vision that achieves the same
thing. There have been lots of suggestions how to do small little
parts, but not everything together as this does.

> Well, there is one practical one, "do not rely on debugfs for any
> functioning system properties"

Jakub expressed additional "netdev only" rules for debugfs.

Read-write interfaces must be constrained to a clear set of commands /
settings

Which I think is what Saeed is reacting to.

Jason

2023-12-13 16:56:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

On Thu, Dec 07, 2023 at 10:54:02AM -0800, Saeed Mahameed wrote:
> Also I would like to repeat, this is not touching netdev, netdev's policies
> do not apply to the greater kernel or RDMA, and we have use cases with
> pure-infiniband/DPU/FPGA cards that have no netdev at all, or other cases
> with pur virtio instances, and much more.

Yes. I mean just about every complex block driver has some kind of
vendor spcific tooling for debugging, statistics, etc. Trying to deny
it just because one function expose by a device is a network device is
even more silly than disallowing it for pure net devices (which already
tend to be complex beasts).