From: Saeed Mahameed <[email protected]>
Recap from V3 discussion:
=========================
LWN has published an article on this series aptly summarizing the debate.
LINK: https://lwn.net/Articles/955001/
We continue to think that mlx5ctl is reasonable and aligned with the
greater kernel community values. People have pointed to the HW RAID
miscdevices as a good analog. The MD developers did not get to block HW
RAID configuration on the basis that it undermines their work on the
software RAID stack. Further, while there is a superficial similarity to
the DRM/accel debate, that was grounded in a real concern that DRM values
on open source would be bypassed. That argument does not hold up here as
this does come with open source userspace and the functionality mlx5ctl
enables on lockdown has always been available to ConnectX users through
the non-lockdown PCI sysfs. netdev has been doing just fine despite the
long standing presence of this tooling and we have continued to work with
Jakub on building common APIs when appropriate. mlx5 already implements
a wide range of the netdev common interfaces, many of which were pushed
forward by our staff - the DPLL configuration netlink being a recent
example.
Version history:
================
V3: https://lore.kernel.org/all/[email protected]/#r
V2: https://lore.kernel.org/all/[email protected]/#r
V1: https://lore.kernel.org/all/[email protected]/#r
V3->V4:
- Document locking scheme for device lifecycle
- Document reserved bits will always be checked for 0 by driver
- Use GFP_KERNEL instead of ACCOUNT for short lived buffers
- Create sysfs link to parent device under the misc device's sysfs
- Remove unnecessary device name from info ioctl output
- Remove reserved and future flags fields from ioctls
- Precise size checking for ioctl user input.
V2->V3:
- Fix bad Sign-off line.
- Fix kernel robot warnings, define a user ptr arg for umem_unreg ioctl
instead of plain integer to simplify compat_ioctl usage.
V1->V2:
- Provide legal statement and sign-off for dual license use
- Fix License clause to use: BSD-3-Clause OR GPL-2.0
- Fix kernel robot warnings
- Use dev_dbg directly instead of umem_dbg() local wrapper
- Implement .compat_ioctl for 32bit compatibility
- Fix mlx5ctl_info ABI structure size and alignment
- Local pointer to correct type instead of in-place cast
- Check unused fields and flags are 0 on ioctl path
- Use correct macro to declare scalar arg ioctl command
#define MLX5CTL_IOCTL_UMEM_UNREG \
_IO(MLX5CTL_IOCTL_MAGIC, 0x3)
mlx5 ConnectX control misc driver:
==================================
The ConnectX HW family supported by the mlx5 drivers uses an architecture
where a FW component executes "mailbox RPCs" issued by the driver to make
changes to the device. This results in a complex debugging environment
where the FW component has information and low level configuration that
needs to be accessed to userspace for debugging purposes.
Historically a userspace program was used that accessed the PCI register
and config space directly through /sys/bus/pci/.../XXX and could operate
these debugging interfaces in parallel with the running driver.
This approach is incompatible with secure boot and kernel lockdown so this
driver provides a secure and restricted interface to that.
Patch breakdown:
================
1) The first patch in the series introduces the main driver file with the
implementation of a new mlx5 auxiliary device driver to run on top
mlx5_core device instances, on probe it creates a new misc device and in
this patch we implement the open and release fops, On open the driver
would allocate a special FW UID (user context ID) restricted to debug
RPCs only, where all user debug rpcs will be executed under this UID,
and on release the UID will be freed.
2) The second patch adds an info ioctl that will show the allocated UID
and the available capability masks of the device and the current UID, and
some other useful device information such as the underlying ConnectX
Example:
$ sudo ./mlx5ctlu mlx5_core.ctl.0
mlx5dev: 0000:00:04.0
UCTX UID: 1
UCTX CAP: 0x3
DEV UCTX CAP: 0x3
USER CAP: 0x1d
3) Third patch will add the capability to execute debug RPCs under the
special UID.
In the mlx5 architecture the FW RPC commands are of the format of
inbox and outbox buffers. The inbox buffer contains the command
rpc layout as described in the ConnectX Programmers Reference Manual
(PRM) document and as defined in linux/include/mlx5/mlx5_ifc.h.
On success the user outbox buffer will be filled with the device's rpc
response.
For example to query device capabilities:
a user fills out an inbox buffer with the inbox layout:
struct mlx5_ifc_query_hca_cap_in_bits
and expects an outbox buffer with the layout:
struct mlx5_ifc_cmd_hca_cap_bits
4) The fourth patch adds the ability to register user memory into the
ConntectX device and create a umem object that points to that memory.
Command rpc outbox buffer is limited in size, which can be very
annoying when trying to pull large traces out of the device.
Many rpcs offer the ability to scatter output traces, contexts
and logs directly into user space buffers in a single shot.
The registered memory will be described by a device UMEM object which
has a unique umem_id, this umem_id can be later used in the rpc inbox
to tell the device where to populate the response output,
e.g HW traces and other debug object queries.
Example usecase, a ConnectX device coredump can be as large as 2MB.
Using inline rpcs will take thousands of rpcs to get the full
coredump which can consume multiple seconds.
With UMEM, it can be done in a single rpc, using 2MB of umem user buffer.
Other usecases with umem:
- dynamic HW and FW trace monitoring
- high frequency diagnostic counters sampling
- batched objects and resource dumps
See links below for information about user space tools that use this
interface:
[1] https://github.com/saeedtx/mlx5ctl
[2] https://github.com/Mellanox/mstflint
see:
d) mstregdump utility
This utility dumps hardware registers from Mellanox hardware
for later analysis by Mellanox.
g) mstconfig
This tool sets or queries non-volatile configurable options
for Mellanox HCAs.
h) mstfwmanager
Mellanox firmware update and query utility which scans the system
for available Mellanox devices (only mst PCI devices) and performs
the necessary firmware updates.
i) mstreg
The mlxreg utility allows users to obtain information regarding
supported access registers, such as their fields
License: BSD-3-Clause OR GPL-2.0
================================
After a review of this thread [3], and a conversation with the LF,
Mellanox and NVIDIA legal continue to approve the use of a Dual GPL &
Permissive License for mlx5 related driver contributions. This makes it
clear to future contributors that this file may be adapted and reused
under BSD-3-Clause terms on other operating systems. Contributions will
be handled in the normal way and the dual license will apply
automatically. If people wish to contribute significantly and opt out of
a dual license they may separate their GPL only contributions in dedicated
files.
Jason has a signing authority for NVIDIA and has gone through our internal
process to get approval.
[3] https://lore.kernel.org/all/[email protected]/#r
Signed-off-by: Jason Gunthorpe <[email protected]> # for legal
Signed-off-by: Saeed Mahameed <[email protected]>
Nacked-by: Jakub Kicinski <[email protected]>
Saeed Mahameed (5):
mlx5: Add aux dev for ctl interface
misc: mlx5ctl: Add mlx5ctl misc driver
misc: mlx5ctl: Add info ioctl
misc: mlx5ctl: Add command rpc ioctl
misc: mlx5ctl: Add umem reg/unreg ioctl
.../userspace-api/ioctl/ioctl-number.rst | 1 +
MAINTAINERS | 8 +
drivers/misc/Kconfig | 1 +
drivers/misc/Makefile | 1 +
drivers/misc/mlx5ctl/Kconfig | 14 +
drivers/misc/mlx5ctl/Makefile | 5 +
drivers/misc/mlx5ctl/main.c | 597 ++++++++++++++++++
drivers/misc/mlx5ctl/umem.c | 322 ++++++++++
drivers/misc/mlx5ctl/umem.h | 17 +
drivers/net/ethernet/mellanox/mlx5/core/dev.c | 8 +
include/uapi/misc/mlx5ctl.h | 50 ++
11 files changed, 1024 insertions(+)
create mode 100644 drivers/misc/mlx5ctl/Kconfig
create mode 100644 drivers/misc/mlx5ctl/Makefile
create mode 100644 drivers/misc/mlx5ctl/main.c
create mode 100644 drivers/misc/mlx5ctl/umem.c
create mode 100644 drivers/misc/mlx5ctl/umem.h
create mode 100644 include/uapi/misc/mlx5ctl.h
--
2.43.0
From: Saeed Mahameed <[email protected]>
Allow ctl protocol interface auxiliary driver in mlx5.
Signed-off-by: Saeed Mahameed <[email protected]>
---
drivers/net/ethernet/mellanox/mlx5/core/dev.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/dev.c b/drivers/net/ethernet/mellanox/mlx5/core/dev.c
index cf0477f53dc4..753c11f7050d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/dev.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/dev.c
@@ -228,8 +228,14 @@ enum {
MLX5_INTERFACE_PROTOCOL_VNET,
MLX5_INTERFACE_PROTOCOL_DPLL,
+ MLX5_INTERFACE_PROTOCOL_CTL,
};
+static bool is_ctl_supported(struct mlx5_core_dev *dev)
+{
+ return MLX5_CAP_GEN(dev, uctx_cap);
+}
+
static const struct mlx5_adev_device {
const char *suffix;
bool (*is_supported)(struct mlx5_core_dev *dev);
@@ -252,6 +258,8 @@ static const struct mlx5_adev_device {
.is_supported = &is_mp_supported },
[MLX5_INTERFACE_PROTOCOL_DPLL] = { .suffix = "dpll",
.is_supported = &is_dpll_supported },
+ [MLX5_INTERFACE_PROTOCOL_CTL] = { .suffix = "ctl",
+ .is_supported = &is_ctl_supported },
};
int mlx5_adev_idx_alloc(void)
--
2.43.0
From: Saeed Mahameed <[email protected]>
The ConnectX HW family supported by the mlx5 drivers uses an architecture
where a FW component executes "mailbox RPCs" issued by the driver to make
changes to the device. This results in a complex debugging environment
where the FW component has information and low level configuration that
needs to be accessed to userspace for debugging purposes.
Historically a userspace program was used that accessed the PCI register
and config space directly through /sys/bus/pci/.../XXX and could operate
these debugging interfaces in parallel with the running driver.
This approach is incompatible with secure boot and kernel lockdown so this
driver provides a secure and restricted interface to that same data.
On open the driver would allocate a special FW UID (user context ID)
restrected to debug RPCs only, later in this series all user RPCs will
be stamped with this UID.
Reviewed-by: Jiri Pirko <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
---
MAINTAINERS | 8 +
drivers/misc/Kconfig | 1 +
drivers/misc/Makefile | 1 +
drivers/misc/mlx5ctl/Kconfig | 14 ++
drivers/misc/mlx5ctl/Makefile | 4 +
drivers/misc/mlx5ctl/main.c | 332 ++++++++++++++++++++++++++++++++++
6 files changed, 360 insertions(+)
create mode 100644 drivers/misc/mlx5ctl/Kconfig
create mode 100644 drivers/misc/mlx5ctl/Makefile
create mode 100644 drivers/misc/mlx5ctl/main.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 8d1052fa6a69..a41dc2056ae1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14011,6 +14011,14 @@ L: [email protected]
S: Supported
F: drivers/vdpa/mlx5/
+MELLANOX MLX5 ConnectX Diag DRIVER
+M: Saeed Mahameed <[email protected]>
+R: Itay Avraham <[email protected]>
+L: [email protected]
+S: Supported
+F: drivers/misc/mlx5ctl/
+F: include/uapi/misc/mlx5ctl.h
+
MELLANOX MLXCPLD I2C AND MUX DRIVER
M: Vadim Pasternak <[email protected]>
M: Michael Shych <[email protected]>
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 4fb291f0bf7c..4ab825acfd54 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -591,4 +591,5 @@ source "drivers/misc/cardreader/Kconfig"
source "drivers/misc/uacce/Kconfig"
source "drivers/misc/pvpanic/Kconfig"
source "drivers/misc/mchp_pci1xxxx/Kconfig"
+source "drivers/misc/mlx5ctl/Kconfig"
endmenu
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index ea6ea5bbbc9c..c491c2b8ac88 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -68,3 +68,4 @@ obj-$(CONFIG_TMR_INJECT) += xilinx_tmr_inject.o
obj-$(CONFIG_TPS6594_ESM) += tps6594-esm.o
obj-$(CONFIG_TPS6594_PFSM) += tps6594-pfsm.o
obj-$(CONFIG_NSM) += nsm.o
+obj-$(CONFIG_MLX5CTL) += mlx5ctl/
diff --git a/drivers/misc/mlx5ctl/Kconfig b/drivers/misc/mlx5ctl/Kconfig
new file mode 100644
index 000000000000..faaa1dba2cc2
--- /dev/null
+++ b/drivers/misc/mlx5ctl/Kconfig
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+
+config MLX5CTL
+ tristate "mlx5 ConnectX control misc driver"
+ depends on MLX5_CORE
+ help
+ MLX5CTL provides interface for the user process to access the debug and
+ configuration registers of the ConnectX hardware family
+ (NICs, PCI switches and SmartNIC SoCs).
+ This will allow configuration and debug tools to work out of the box on
+ mainstream kernel.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/misc/mlx5ctl/Makefile b/drivers/misc/mlx5ctl/Makefile
new file mode 100644
index 000000000000..b5c7f99e0ab6
--- /dev/null
+++ b/drivers/misc/mlx5ctl/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_MLX5CTL) += mlx5ctl.o
+mlx5ctl-y := main.o
diff --git a/drivers/misc/mlx5ctl/main.c b/drivers/misc/mlx5ctl/main.c
new file mode 100644
index 000000000000..314aeb45bb8e
--- /dev/null
+++ b/drivers/misc/mlx5ctl/main.c
@@ -0,0 +1,332 @@
+// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0
+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#include <linux/miscdevice.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/auxiliary_bus.h>
+#include <linux/mlx5/device.h>
+#include <linux/mlx5/driver.h>
+#include <linux/atomic.h>
+#include <linux/refcount.h>
+
+MODULE_DESCRIPTION("mlx5 ConnectX control misc driver");
+MODULE_AUTHOR("Saeed Mahameed <[email protected]>");
+MODULE_LICENSE("Dual BSD/GPL");
+
+/* mlx5ctl_dev lifecycle locking scheme:
+ *
+ * 1. mcdev->rw_lock: protects mdev from removal on mlx5ctl_remove()
+ *
+ * - Write lock is taken by auxiliary_driver.remove
+ * - Set mdev to NULL
+ * - Read lock is taken by open/release and ioctls
+ * - To prevent mdev from being removed
+ * - Check mdev is not NULL, abort otherwise
+ *
+ * 2. mcdev->refcount: protects mcdev from removal after miscdevice is unregistered
+ * - miscdevice does not have a refcount, so we use kref
+ * - miscdevice is unregistered on mlx5ctl_remove()
+ * - already open fds will still have a reference to mcdev
+ * - mcdev is freed when refcount reaches 0 on last fd release
+ */
+struct mlx5ctl_dev {
+ struct mlx5_core_dev *mdev;
+ struct miscdevice miscdev;
+ struct auxiliary_device *adev;
+ struct list_head fd_list;
+ spinlock_t fd_list_lock; /* protect list add/del */
+ struct rw_semaphore rw_lock; /* protect mdev from device removal */
+ struct kref refcount;
+};
+
+struct mlx5ctl_fd {
+ u16 uctx_uid;
+ u32 uctx_cap;
+ u32 ucap; /* user cap */
+ struct mlx5ctl_dev *mcdev;
+ struct list_head list;
+};
+
+#define mlx5ctl_err(mcdev, format, ...) \
+ dev_err(mcdev->miscdev.parent, format, ##__VA_ARGS__)
+
+#define mlx5ctl_dbg(mcdev, format, ...) \
+ dev_dbg(mcdev->miscdev.parent, "PID %d: " format, \
+ current->pid, ##__VA_ARGS__)
+
+enum {
+ MLX5_UCTX_OBJECT_CAP_RAW_TX = 0x1,
+ MLX5_UCTX_OBJECT_CAP_INTERNAL_DEVICE_RESOURCES = 0x2,
+ MLX5_UCTX_OBJECT_CAP_TOOLS_RESOURCES = 0x4,
+};
+
+static int mlx5ctl_alloc_uid(struct mlx5ctl_dev *mcdev, u32 cap)
+{
+ u32 out[MLX5_ST_SZ_DW(create_uctx_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(create_uctx_in)] = {};
+ void *uctx;
+ int err;
+ u16 uid;
+
+ uctx = MLX5_ADDR_OF(create_uctx_in, in, uctx);
+
+ mlx5ctl_dbg(mcdev, "MLX5_CMD_OP_CREATE_UCTX: caps 0x%x\n", cap);
+ MLX5_SET(create_uctx_in, in, opcode, MLX5_CMD_OP_CREATE_UCTX);
+ MLX5_SET(uctx, uctx, cap, cap);
+
+ err = mlx5_cmd_exec(mcdev->mdev, in, sizeof(in), out, sizeof(out));
+ if (err)
+ return err;
+
+ uid = MLX5_GET(create_uctx_out, out, uid);
+ mlx5ctl_dbg(mcdev, "allocated uid %d with caps 0x%x\n", uid, cap);
+ return uid;
+}
+
+static void mlx5ctl_release_uid(struct mlx5ctl_dev *mcdev, u16 uid)
+{
+ u32 in[MLX5_ST_SZ_DW(destroy_uctx_in)] = {};
+ struct mlx5_core_dev *mdev = mcdev->mdev;
+ int err;
+
+ MLX5_SET(destroy_uctx_in, in, opcode, MLX5_CMD_OP_DESTROY_UCTX);
+ MLX5_SET(destroy_uctx_in, in, uid, uid);
+
+ err = mlx5_cmd_exec_in(mdev, destroy_uctx, in);
+ mlx5ctl_dbg(mcdev, "released uid %d err(%d)\n", uid, err);
+}
+
+static void mcdev_get(struct mlx5ctl_dev *mcdev);
+static void mcdev_put(struct mlx5ctl_dev *mcdev);
+
+static int mlx5ctl_open_mfd(struct mlx5ctl_fd *mfd)
+{
+ struct mlx5_core_dev *mdev = mfd->mcdev->mdev;
+ struct mlx5ctl_dev *mcdev = mfd->mcdev;
+ u32 ucap = 0, cap = 0;
+ int uid;
+
+#define MLX5_UCTX_CAP(mdev, cap) \
+ (MLX5_CAP_GEN(mdev, uctx_cap) & MLX5_UCTX_OBJECT_CAP_##cap)
+
+ if (capable(CAP_NET_RAW) && MLX5_UCTX_CAP(mdev, RAW_TX)) {
+ ucap |= CAP_NET_RAW;
+ cap |= MLX5_UCTX_OBJECT_CAP_RAW_TX;
+ }
+
+ if (capable(CAP_SYS_RAWIO) && MLX5_UCTX_CAP(mdev, INTERNAL_DEVICE_RESOURCES)) {
+ ucap |= CAP_SYS_RAWIO;
+ cap |= MLX5_UCTX_OBJECT_CAP_INTERNAL_DEVICE_RESOURCES;
+ }
+
+ if (capable(CAP_SYS_ADMIN) && MLX5_UCTX_CAP(mdev, TOOLS_RESOURCES)) {
+ ucap |= CAP_SYS_ADMIN;
+ cap |= MLX5_UCTX_OBJECT_CAP_TOOLS_RESOURCES;
+ }
+
+ uid = mlx5ctl_alloc_uid(mcdev, cap);
+ if (uid < 0)
+ return uid;
+
+ mfd->uctx_uid = uid;
+ mfd->uctx_cap = cap;
+ mfd->ucap = ucap;
+ mfd->mcdev = mcdev;
+
+ mlx5ctl_dbg(mcdev, "allocated uid %d with uctx caps 0x%x, user cap 0x%x\n",
+ uid, cap, ucap);
+ return 0;
+}
+
+static void mlx5ctl_release_mfd(struct mlx5ctl_fd *mfd)
+{
+ struct mlx5ctl_dev *mcdev = mfd->mcdev;
+
+ mlx5ctl_release_uid(mcdev, mfd->uctx_uid);
+}
+
+static int mlx5ctl_open(struct inode *inode, struct file *file)
+{
+ struct mlx5_core_dev *mdev;
+ struct mlx5ctl_dev *mcdev;
+ struct mlx5ctl_fd *mfd;
+ int err = 0;
+
+ mcdev = container_of(file->private_data, struct mlx5ctl_dev, miscdev);
+ mcdev_get(mcdev);
+ down_read(&mcdev->rw_lock);
+ mdev = mcdev->mdev;
+ if (!mdev) {
+ err = -ENODEV;
+ goto unlock;
+ }
+
+ mfd = kzalloc(sizeof(*mfd), GFP_KERNEL_ACCOUNT);
+ if (!mfd)
+ return -ENOMEM;
+
+ mfd->mcdev = mcdev;
+ err = mlx5ctl_open_mfd(mfd);
+ if (err)
+ goto unlock;
+
+ spin_lock(&mcdev->fd_list_lock);
+ list_add_tail(&mfd->list, &mcdev->fd_list);
+ spin_unlock(&mcdev->fd_list_lock);
+
+ file->private_data = mfd;
+
+unlock:
+ up_read(&mcdev->rw_lock);
+ if (err) {
+ mcdev_put(mcdev);
+ kfree(mfd);
+ }
+ return err;
+}
+
+static int mlx5ctl_release(struct inode *inode, struct file *file)
+{
+ struct mlx5ctl_fd *mfd = file->private_data;
+ struct mlx5ctl_dev *mcdev = mfd->mcdev;
+
+ down_read(&mcdev->rw_lock);
+ if (!mcdev->mdev) {
+ pr_debug("[%d] UID %d mlx5ctl: mdev is already released\n",
+ current->pid, mfd->uctx_uid);
+ /* All mfds are already released, skip ... */
+ goto unlock;
+ }
+
+ spin_lock(&mcdev->fd_list_lock);
+ list_del(&mfd->list);
+ spin_unlock(&mcdev->fd_list_lock);
+
+ mlx5ctl_release_mfd(mfd);
+
+unlock:
+ kfree(mfd);
+ up_read(&mcdev->rw_lock);
+ mcdev_put(mcdev);
+ file->private_data = NULL;
+ return 0;
+}
+
+static const struct file_operations mlx5ctl_fops = {
+ .owner = THIS_MODULE,
+ .open = mlx5ctl_open,
+ .release = mlx5ctl_release,
+};
+
+static int mlx5ctl_probe(struct auxiliary_device *adev,
+ const struct auxiliary_device_id *id)
+
+{
+ struct mlx5_adev *madev = container_of(adev, struct mlx5_adev, adev);
+ struct mlx5_core_dev *mdev = madev->mdev;
+ struct mlx5ctl_dev *mcdev;
+ char *devname = NULL;
+ int err;
+
+ mcdev = kzalloc(sizeof(*mcdev), GFP_KERNEL_ACCOUNT);
+ if (!mcdev)
+ return -ENOMEM;
+
+ kref_init(&mcdev->refcount);
+ INIT_LIST_HEAD(&mcdev->fd_list);
+ spin_lock_init(&mcdev->fd_list_lock);
+ init_rwsem(&mcdev->rw_lock);
+ mcdev->mdev = mdev;
+ mcdev->adev = adev;
+ devname = kasprintf(GFP_KERNEL_ACCOUNT, "mlx5ctl-%s",
+ dev_name(&adev->dev));
+ if (!devname) {
+ err = -ENOMEM;
+ goto abort;
+ }
+
+ mcdev->miscdev = (struct miscdevice) {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = devname,
+ .fops = &mlx5ctl_fops,
+ .parent = &adev->dev,
+ };
+
+ err = misc_register(&mcdev->miscdev);
+ if (err) {
+ mlx5ctl_err(mcdev, "mlx5ctl: failed to register misc device err %d\n", err);
+ goto abort;
+ }
+
+ mlx5ctl_dbg(mcdev, "probe mdev@%s %s\n",
+ dev_driver_string(mdev->device),
+ dev_name(mdev->device));
+
+ auxiliary_set_drvdata(adev, mcdev);
+
+ return 0;
+
+abort:
+ kfree(devname);
+ kfree(mcdev);
+ return err;
+}
+
+static void mlx5ctl_remove(struct auxiliary_device *adev)
+{
+ struct mlx5ctl_dev *mcdev = auxiliary_get_drvdata(adev);
+ struct mlx5_core_dev *mdev = mcdev->mdev;
+ struct mlx5ctl_fd *mfd, *n;
+
+ misc_deregister(&mcdev->miscdev);
+ down_write(&mcdev->rw_lock);
+
+ list_for_each_entry_safe(mfd, n, &mcdev->fd_list, list) {
+ mlx5ctl_dbg(mcdev, "UID %d still has open FDs\n", mfd->uctx_uid);
+ list_del(&mfd->list);
+ mlx5ctl_release_mfd(mfd);
+ }
+
+ mlx5ctl_dbg(mcdev, "removed mdev %s %s\n",
+ dev_driver_string(mdev->device), dev_name(mdev->device));
+
+ mcdev->mdev = NULL; /* prevent already open fds from accessing the device */
+ up_write(&mcdev->rw_lock);
+ mcdev_put(mcdev);
+}
+
+static void mcdev_free(struct kref *ref)
+{
+ struct mlx5ctl_dev *mcdev = container_of(ref, struct mlx5ctl_dev, refcount);
+
+ kfree(mcdev->miscdev.name);
+ kfree(mcdev);
+}
+
+static void mcdev_get(struct mlx5ctl_dev *mcdev)
+{
+ kref_get(&mcdev->refcount);
+}
+
+static void mcdev_put(struct mlx5ctl_dev *mcdev)
+{
+ kref_put(&mcdev->refcount, mcdev_free);
+}
+
+static const struct auxiliary_device_id mlx5ctl_id_table[] = {
+ { .name = MLX5_ADEV_NAME ".ctl", },
+ {},
+};
+
+MODULE_DEVICE_TABLE(auxiliary, mlx5ctl_id_table);
+
+static struct auxiliary_driver mlx5ctl_driver = {
+ .name = "ctl",
+ .probe = mlx5ctl_probe,
+ .remove = mlx5ctl_remove,
+ .id_table = mlx5ctl_id_table,
+};
+
+module_auxiliary_driver(mlx5ctl_driver);
--
2.43.0
From: Saeed Mahameed <[email protected]>
Implement INFO ioctl to return the allocated UID and the capability flags
and some other useful device information.
Example:
$ sudo ./mlx5ctl mlx5_core.ctl.0
UCTX UID: 1
UCTX CAP: 0x3
DEV UCTX CAP: 0x3
USER CAP: 0x1d
Reviewed-by: Jiri Pirko <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
---
.../userspace-api/ioctl/ioctl-number.rst | 1 +
drivers/misc/mlx5ctl/main.c | 68 +++++++++++++++++++
include/uapi/misc/mlx5ctl.h | 20 ++++++
3 files changed, 89 insertions(+)
create mode 100644 include/uapi/misc/mlx5ctl.h
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 457e16f06e04..17e6ab6d0d9f 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -89,6 +89,7 @@ Code Seq# Include File Comments
0x20 all drivers/cdrom/cm206.h
0x22 all scsi/sg.h
0x3E 00-0F linux/counter.h <mailto:[email protected]>
+0x5c all uapi/misc/mlx5ctl.h Nvidia ConnectX control
'!' 00-1F uapi/linux/seccomp.h
'#' 00-3F IEEE 1394 Subsystem
Block for the entire subsystem
diff --git a/drivers/misc/mlx5ctl/main.c b/drivers/misc/mlx5ctl/main.c
index 314aeb45bb8e..e4e70359dbe8 100644
--- a/drivers/misc/mlx5ctl/main.c
+++ b/drivers/misc/mlx5ctl/main.c
@@ -8,6 +8,7 @@
#include <linux/auxiliary_bus.h>
#include <linux/mlx5/device.h>
#include <linux/mlx5/driver.h>
+#include <uapi/misc/mlx5ctl.h>
#include <linux/atomic.h>
#include <linux/refcount.h>
@@ -214,10 +215,77 @@ static int mlx5ctl_release(struct inode *inode, struct file *file)
return 0;
}
+static int mlx5ctl_info_ioctl(struct file *file,
+ struct mlx5ctl_info __user *arg,
+ size_t usize)
+{
+ struct mlx5ctl_fd *mfd = file->private_data;
+ size_t ksize = sizeof(struct mlx5ctl_info);
+ struct mlx5ctl_dev *mcdev = mfd->mcdev;
+ struct mlx5_core_dev *mdev = mcdev->mdev;
+ struct mlx5ctl_info *info;
+ int err = 0;
+
+ if (usize < ksize)
+ return -EINVAL;
+
+ info = kzalloc(ksize, GFP_KERNEL);
+ if (!info)
+ return -ENOMEM;
+
+ info->dev_uctx_cap = MLX5_CAP_GEN(mdev, uctx_cap);
+ info->uctx_cap = mfd->uctx_cap;
+ info->uctx_uid = mfd->uctx_uid;
+ info->ucap = mfd->ucap;
+
+ if (copy_to_user(arg, info, ksize))
+ err = -EFAULT;
+
+ kfree(info);
+ return err;
+}
+
+static long mlx5ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+ struct mlx5ctl_fd *mfd = file->private_data;
+ struct mlx5ctl_dev *mcdev = mfd->mcdev;
+ void __user *argp = (void __user *)arg;
+ size_t size = _IOC_SIZE(cmd);
+ int err = 0;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ mlx5ctl_dbg(mcdev, "ioctl 0x%x type/nr: %d/%d size: %d DIR:%d\n", cmd,
+ _IOC_TYPE(cmd), _IOC_NR(cmd), _IOC_SIZE(cmd), _IOC_DIR(cmd));
+
+ down_read(&mcdev->rw_lock);
+ if (!mcdev->mdev) {
+ err = -ENODEV;
+ goto unlock;
+ }
+
+ switch (cmd) {
+ case MLX5CTL_IOCTL_INFO:
+ err = mlx5ctl_info_ioctl(file, argp, size);
+ break;
+
+ default:
+ mlx5ctl_dbg(mcdev, "Unknown ioctl %x\n", cmd);
+ err = -ENOTTY;
+ break;
+ }
+unlock:
+ up_read(&mcdev->rw_lock);
+ return err;
+}
+
static const struct file_operations mlx5ctl_fops = {
.owner = THIS_MODULE,
.open = mlx5ctl_open,
.release = mlx5ctl_release,
+ .unlocked_ioctl = mlx5ctl_ioctl,
+ .compat_ioctl = compat_ptr_ioctl,
};
static int mlx5ctl_probe(struct auxiliary_device *adev,
diff --git a/include/uapi/misc/mlx5ctl.h b/include/uapi/misc/mlx5ctl.h
new file mode 100644
index 000000000000..9be944128025
--- /dev/null
+++ b/include/uapi/misc/mlx5ctl.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#ifndef __MLX5CTL_IOCTL_H__
+#define __MLX5CTL_IOCTL_H__
+
+struct mlx5ctl_info {
+ __u16 uctx_uid; /* current process allocated UCTX UID */
+ __u16 reserved1; /* explicit padding must be zero */
+ __u32 uctx_cap; /* current process effective UCTX cap */
+ __u32 dev_uctx_cap; /* device's UCTX capabilities */
+ __u32 ucap; /* process user capability */
+};
+
+#define MLX5CTL_IOCTL_MAGIC 0x5c
+
+#define MLX5CTL_IOCTL_INFO \
+ _IOR(MLX5CTL_IOCTL_MAGIC, 0x0, struct mlx5ctl_info)
+
+#endif /* __MLX5CTL_IOCTL_H__ */
--
2.43.0
From: Saeed Mahameed <[email protected]>
Add new IOCTL to allow user space to send device debug rpcs and
attach the user's uctx UID to each rpc.
In the mlx5 architecture the FW RPC commands are of the format of
inbox and outbox buffers. The inbox buffer contains the command
rpc layout as described in the ConnectX Programmers Reference Manual
(PRM) document and as defined in include/linux/mlx5/mlx5_ifc.h.
On success the user outbox buffer will be filled with the device's rpc
response.
For example to query device capabilities:
a user fills out an inbox buffer with the inbox layout:
struct mlx5_ifc_query_hca_cap_in_bits
and expects an outbox buffer with the layout:
struct mlx5_ifc_cmd_hca_cap_bits
Reviewed-by: Jiri Pirko <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
---
drivers/misc/mlx5ctl/main.c | 98 +++++++++++++++++++++++++++++++++++++
include/uapi/misc/mlx5ctl.h | 12 +++++
2 files changed, 110 insertions(+)
diff --git a/drivers/misc/mlx5ctl/main.c b/drivers/misc/mlx5ctl/main.c
index e4e70359dbe8..c02b80efffc1 100644
--- a/drivers/misc/mlx5ctl/main.c
+++ b/drivers/misc/mlx5ctl/main.c
@@ -245,6 +245,94 @@ static int mlx5ctl_info_ioctl(struct file *file,
return err;
}
+struct mlx5_ifc_mbox_in_hdr_bits {
+ u8 opcode[0x10];
+ u8 uid[0x10];
+
+ u8 reserved_at_20[0x10];
+ u8 op_mod[0x10];
+
+ u8 reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_mbox_out_hdr_bits {
+ u8 status[0x8];
+ u8 reserved_at_8[0x18];
+
+ u8 syndrome[0x20];
+
+ u8 reserved_at_40[0x40];
+};
+
+static int mlx5ctl_cmdrpc_ioctl(struct file *file,
+ struct mlx5ctl_cmdrpc __user *arg,
+ size_t usize)
+{
+ size_t ksize = sizeof(struct mlx5ctl_cmdrpc);
+ struct mlx5ctl_fd *mfd = file->private_data;
+ struct mlx5ctl_dev *mcdev = mfd->mcdev;
+ struct mlx5ctl_cmdrpc *rpc = NULL;
+ void *in = NULL, *out = NULL;
+ int err;
+
+ if (usize < ksize)
+ return -EINVAL;
+
+ rpc = kzalloc(ksize, GFP_KERNEL);
+ if (!rpc)
+ return -ENOMEM;
+
+ err = copy_from_user(rpc, arg, usize);
+ if (err)
+ goto out;
+
+ mlx5ctl_dbg(mcdev, "[UID %d] cmdrpc: inlen %d outlen %d\n",
+ mfd->uctx_uid, rpc->inlen, rpc->outlen);
+
+ if (rpc->inlen < MLX5_ST_SZ_BYTES(mbox_in_hdr) ||
+ rpc->outlen < MLX5_ST_SZ_BYTES(mbox_out_hdr) ||
+ rpc->inlen > MLX5CTL_MAX_RPC_SIZE ||
+ rpc->outlen > MLX5CTL_MAX_RPC_SIZE) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ in = memdup_user(u64_to_user_ptr(rpc->in), rpc->inlen);
+ if (IS_ERR(in)) {
+ err = PTR_ERR(in);
+ goto out;
+ }
+
+ out = kvzalloc(rpc->outlen, GFP_KERNEL);
+ if (!out) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ mlx5ctl_dbg(mcdev, "[UID %d] cmdif: opcode 0x%x inlen %d outlen %d\n",
+ mfd->uctx_uid,
+ MLX5_GET(mbox_in_hdr, in, opcode), rpc->inlen, rpc->outlen);
+
+ MLX5_SET(mbox_in_hdr, in, uid, mfd->uctx_uid);
+ err = mlx5_cmd_do(mcdev->mdev, in, rpc->inlen, out, rpc->outlen);
+ mlx5ctl_dbg(mcdev, "[UID %d] cmdif: opcode 0x%x retval %d\n",
+ mfd->uctx_uid,
+ MLX5_GET(mbox_in_hdr, in, opcode), err);
+
+ /* -EREMOTEIO means outbox is valid, but out.status is not */
+ if (!err || err == -EREMOTEIO) {
+ err = 0;
+ if (copy_to_user(u64_to_user_ptr(rpc->out), out, rpc->outlen))
+ err = -EFAULT;
+ }
+
+out:
+ kvfree(out);
+ kfree(in);
+ kfree(rpc);
+ return err;
+}
+
static long mlx5ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
struct mlx5ctl_fd *mfd = file->private_data;
@@ -270,6 +358,10 @@ static long mlx5ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg
err = mlx5ctl_info_ioctl(file, argp, size);
break;
+ case MLX5CTL_IOCTL_CMDRPC:
+ err = mlx5ctl_cmdrpc_ioctl(file, argp, size);
+ break;
+
default:
mlx5ctl_dbg(mcdev, "Unknown ioctl %x\n", cmd);
err = -ENOTTY;
@@ -328,6 +420,11 @@ static int mlx5ctl_probe(struct auxiliary_device *adev,
goto abort;
}
+ err = sysfs_create_link_nowarn(&mcdev->miscdev.this_device->kobj,
+ &mdev->device->kobj, "mdev");
+ if (err)
+ mlx5ctl_dbg(mcdev, "mlx5ctl: failed to create sysfs link err %d\n", err);
+
mlx5ctl_dbg(mcdev, "probe mdev@%s %s\n",
dev_driver_string(mdev->device),
dev_name(mdev->device));
@@ -348,6 +445,7 @@ static void mlx5ctl_remove(struct auxiliary_device *adev)
struct mlx5_core_dev *mdev = mcdev->mdev;
struct mlx5ctl_fd *mfd, *n;
+ sysfs_remove_link(&mcdev->miscdev.this_device->kobj, "mdev");
misc_deregister(&mcdev->miscdev);
down_write(&mcdev->rw_lock);
diff --git a/include/uapi/misc/mlx5ctl.h b/include/uapi/misc/mlx5ctl.h
index 9be944128025..1e4622c5979f 100644
--- a/include/uapi/misc/mlx5ctl.h
+++ b/include/uapi/misc/mlx5ctl.h
@@ -12,9 +12,21 @@ struct mlx5ctl_info {
__u32 ucap; /* process user capability */
};
+struct mlx5ctl_cmdrpc {
+ __aligned_u64 in; /* RPC inbox buffer user address */
+ __aligned_u64 out; /* RPC outbox buffer user address */
+ __u32 inlen; /* inbox buffer length */
+ __u32 outlen; /* outbox buffer length */
+};
+
+#define MLX5CTL_MAX_RPC_SIZE (512 * 512) /* max FW RPC buffer size 512 blocks of 512 bytes */
+
#define MLX5CTL_IOCTL_MAGIC 0x5c
#define MLX5CTL_IOCTL_INFO \
_IOR(MLX5CTL_IOCTL_MAGIC, 0x0, struct mlx5ctl_info)
+#define MLX5CTL_IOCTL_CMDRPC \
+ _IOWR(MLX5CTL_IOCTL_MAGIC, 0x1, struct mlx5ctl_cmdrpc)
+
#endif /* __MLX5CTL_IOCTL_H__ */
--
2.43.0
From: Saeed Mahameed <[email protected]>
Command rpc outbox buffer is limited in size, which can be very
annoying when trying to pull large traces out of the device.
Many device rpcs offer the ability to scatter output traces, contexts
and logs directly into user space buffers in a single shot.
Allow user to register user memory space, so the device may dump
information directly into user memory space.
The registered memory will be described by a device UMEM object which
has a unique umem_id, this umem_id can be later used in the rpc inbox
to tell the device where to populate the response output,
e.g HW traces and other debug object queries.
To do so this patch introduces two ioctls:
MLX5CTL_IOCTL_UMEM_REG(va_address, size):
- calculate page fragments from the user provided virtual address
- pin the pages, and allocate a sg list
- dma map the sg list
- create a UMEM device object that points to the dma addresses
- add a driver umem object to an xarray data base for bookkeeping
- return UMEM ID to user so it can be used in subsequent rpcs
MLX5CTL_IOCTL_UMEM_UNREG(umem_id):
- user provides a pre allocated umem ID
- unwinds the above
Example usecase, ConnectX device coredump can be as large as 2MB.
Using inline rpcs will take thousands of rpcs to get the full
coredump which can take multiple seconds.
With UMEM, it can be done in a single rpc, using 2MB of umem user buffer.
$ ./mlx5ctlu mlx5_core.ctl.0 coredump --umem_size=$(( 2 ** 20 ))
00 00 00 00 01 00 20 00 00 00 00 04 00 00 48 ec
00 00 00 08 00 00 00 00 00 00 00 0c 00 00 00 03
00 00 00 10 00 00 00 00 00 00 00 14 00 00 00 00
...
00 50 0b 3c 00 00 00 00 00 50 0b 40 00 00 00 00
00 50 0b 44 00 00 00 00 00 50 0b 48 00 00 00 00
00 50 0c 00 00 00 00 00
INFO : Core dump done
INFO : Core dump size 831304
INFO : Core dump address 0x0
INFO : Core dump cookie 0x500c04
INFO : More Dump 0
Other usecases are: dynamic HW and FW trace monitoring, high frequency
diagnostic counters sampling and batched objects and resource dumps.
Reviewed-by: Jiri Pirko <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
---
drivers/misc/mlx5ctl/Makefile | 1 +
drivers/misc/mlx5ctl/main.c | 99 +++++++++++
drivers/misc/mlx5ctl/umem.c | 322 ++++++++++++++++++++++++++++++++++
drivers/misc/mlx5ctl/umem.h | 17 ++
include/uapi/misc/mlx5ctl.h | 18 ++
5 files changed, 457 insertions(+)
create mode 100644 drivers/misc/mlx5ctl/umem.c
create mode 100644 drivers/misc/mlx5ctl/umem.h
diff --git a/drivers/misc/mlx5ctl/Makefile b/drivers/misc/mlx5ctl/Makefile
index b5c7f99e0ab6..f35234e931a8 100644
--- a/drivers/misc/mlx5ctl/Makefile
+++ b/drivers/misc/mlx5ctl/Makefile
@@ -2,3 +2,4 @@
obj-$(CONFIG_MLX5CTL) += mlx5ctl.o
mlx5ctl-y := main.o
+mlx5ctl-y += umem.o
diff --git a/drivers/misc/mlx5ctl/main.c b/drivers/misc/mlx5ctl/main.c
index c02b80efffc1..f79e1aa62b8f 100644
--- a/drivers/misc/mlx5ctl/main.c
+++ b/drivers/misc/mlx5ctl/main.c
@@ -12,6 +12,8 @@
#include <linux/atomic.h>
#include <linux/refcount.h>
+#include "umem.h"
+
MODULE_DESCRIPTION("mlx5 ConnectX control misc driver");
MODULE_AUTHOR("Saeed Mahameed <[email protected]>");
MODULE_LICENSE("Dual BSD/GPL");
@@ -46,6 +48,8 @@ struct mlx5ctl_fd {
u16 uctx_uid;
u32 uctx_cap;
u32 ucap; /* user cap */
+
+ struct mlx5ctl_umem_db *umem_db;
struct mlx5ctl_dev *mcdev;
struct list_head list;
};
@@ -131,6 +135,12 @@ static int mlx5ctl_open_mfd(struct mlx5ctl_fd *mfd)
if (uid < 0)
return uid;
+ mfd->umem_db = mlx5ctl_umem_db_create(mdev, uid);
+ if (IS_ERR(mfd->umem_db)) {
+ mlx5ctl_release_uid(mcdev, uid);
+ return PTR_ERR(mfd->umem_db);
+ }
+
mfd->uctx_uid = uid;
mfd->uctx_cap = cap;
mfd->ucap = ucap;
@@ -145,6 +155,7 @@ static void mlx5ctl_release_mfd(struct mlx5ctl_fd *mfd)
{
struct mlx5ctl_dev *mcdev = mfd->mcdev;
+ mlx5ctl_umem_db_destroy(mfd->umem_db);
mlx5ctl_release_uid(mcdev, mfd->uctx_uid);
}
@@ -333,6 +344,86 @@ static int mlx5ctl_cmdrpc_ioctl(struct file *file,
return err;
}
+static int mlx5ctl_ioctl_umem_reg(struct file *file,
+ struct mlx5ctl_umem_reg __user *arg,
+ size_t usize)
+{
+ size_t ksize = sizeof(struct mlx5ctl_umem_reg);
+ struct mlx5ctl_fd *mfd = file->private_data;
+ struct mlx5ctl_umem_reg *umem_reg;
+ int umem_id, err = 0;
+
+ if (usize < ksize)
+ return -EINVAL;
+
+ umem_reg = kzalloc(ksize, GFP_KERNEL);
+ if (!umem_reg)
+ return -ENOMEM;
+
+ if (copy_from_user(umem_reg, arg, ksize)) {
+ err = -EFAULT;
+ goto out;
+ }
+
+ if (umem_reg->reserved) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ umem_id = mlx5ctl_umem_reg(mfd->umem_db,
+ (unsigned long)umem_reg->addr,
+ umem_reg->len);
+ if (umem_id < 0) {
+ err = umem_id;
+ goto out;
+ }
+
+ umem_reg->umem_id = umem_id;
+
+ if (copy_to_user(arg, umem_reg, ksize)) {
+ mlx5ctl_umem_unreg(mfd->umem_db, umem_id);
+ err = -EFAULT;
+ }
+out:
+ kfree(umem_reg);
+ return err;
+}
+
+static int mlx5ctl_ioctl_umem_unreg(struct file *file,
+ struct mlx5ctl_umem_unreg __user *arg,
+ size_t usize)
+{
+ size_t ksize = sizeof(struct mlx5ctl_umem_unreg);
+ struct mlx5ctl_fd *mfd = file->private_data;
+ struct mlx5ctl_umem_unreg *umem_unreg;
+ int err = 0;
+
+ if (usize < ksize)
+ return -EINVAL;
+
+ umem_unreg = kzalloc(ksize, GFP_KERNEL);
+ if (!umem_unreg)
+ return -ENOMEM;
+
+ if (copy_from_user(umem_unreg, arg, ksize)) {
+ err = -EFAULT;
+ goto out;
+ }
+
+ if (umem_unreg->reserved) {
+ err = -EOPNOTSUPP;
+ goto out;
+ }
+
+ err = mlx5ctl_umem_unreg(mfd->umem_db, umem_unreg->umem_id);
+
+ if (!err && copy_to_user(arg, umem_unreg, ksize))
+ err = -EFAULT;
+out:
+ kfree(umem_unreg);
+ return err;
+}
+
static long mlx5ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
struct mlx5ctl_fd *mfd = file->private_data;
@@ -362,6 +453,14 @@ static long mlx5ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg
err = mlx5ctl_cmdrpc_ioctl(file, argp, size);
break;
+ case MLX5CTL_IOCTL_UMEM_REG:
+ err = mlx5ctl_ioctl_umem_reg(file, argp, size);
+ break;
+
+ case MLX5CTL_IOCTL_UMEM_UNREG:
+ err = mlx5ctl_ioctl_umem_unreg(file, argp, size);
+ break;
+
default:
mlx5ctl_dbg(mcdev, "Unknown ioctl %x\n", cmd);
err = -ENOTTY;
diff --git a/drivers/misc/mlx5ctl/umem.c b/drivers/misc/mlx5ctl/umem.c
new file mode 100644
index 000000000000..29091a19305b
--- /dev/null
+++ b/drivers/misc/mlx5ctl/umem.c
@@ -0,0 +1,322 @@
+// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0
+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#include <linux/mlx5/device.h>
+#include <linux/mlx5/driver.h>
+#include <uapi/misc/mlx5ctl.h>
+
+#include "umem.h"
+
+#define MLX5CTL_UMEM_MAX_MB 64
+
+static unsigned long umem_num_pages(u64 addr, size_t len)
+{
+ return DIV_ROUND_UP(addr + len - PAGE_ALIGN_DOWN(addr), PAGE_SIZE);
+}
+
+struct mlx5ctl_umem {
+ struct sg_table sgt;
+ unsigned long addr;
+ size_t size;
+ size_t offset;
+ size_t npages;
+ struct task_struct *source_task;
+ struct mm_struct *source_mm;
+ struct user_struct *source_user;
+ u32 umem_id;
+ struct page **page_list;
+};
+
+struct mlx5ctl_umem_db {
+ struct xarray xarray;
+ struct mlx5_core_dev *mdev;
+ u32 uctx_uid;
+};
+
+static int inc_user_locked_vm(struct mlx5ctl_umem *umem, unsigned long npages)
+{
+ unsigned long lock_limit;
+ unsigned long cur_pages;
+ unsigned long new_pages;
+
+ lock_limit = task_rlimit(umem->source_task, RLIMIT_MEMLOCK) >>
+ PAGE_SHIFT;
+ do {
+ cur_pages = atomic_long_read(&umem->source_user->locked_vm);
+ new_pages = cur_pages + npages;
+ if (new_pages > lock_limit)
+ return -ENOMEM;
+ } while (atomic_long_cmpxchg(&umem->source_user->locked_vm, cur_pages,
+ new_pages) != cur_pages);
+ return 0;
+}
+
+static void dec_user_locked_vm(struct mlx5ctl_umem *umem, unsigned long npages)
+{
+ if (WARN_ON(atomic_long_read(&umem->source_user->locked_vm) < npages))
+ return;
+ atomic_long_sub(npages, &umem->source_user->locked_vm);
+}
+
+#define PAGES_2_MB(pages) ((pages) >> (20 - PAGE_SHIFT))
+
+static struct mlx5ctl_umem *mlx5ctl_umem_pin(struct mlx5ctl_umem_db *umem_db,
+ unsigned long addr, size_t size)
+{
+ size_t npages = umem_num_pages(addr, size);
+ struct mlx5_core_dev *mdev = umem_db->mdev;
+ unsigned long endaddr = addr + size;
+ struct mlx5ctl_umem *umem;
+ struct page **page_list;
+ int err = -EINVAL;
+ int pinned = 0;
+
+ dev_dbg(mdev->device, "%s: addr %p size %zu npages %zu\n",
+ __func__, (void __user *)addr, size, npages);
+
+ /* Avoid integer overflow */
+ if (endaddr < addr || PAGE_ALIGN(endaddr) < endaddr)
+ return ERR_PTR(-EINVAL);
+
+ if (npages == 0 || PAGES_2_MB(npages) > MLX5CTL_UMEM_MAX_MB)
+ return ERR_PTR(-EINVAL);
+
+ page_list = kvmalloc_array(npages, sizeof(struct page *), GFP_KERNEL_ACCOUNT);
+ if (!page_list)
+ return ERR_PTR(-ENOMEM);
+
+ umem = kzalloc(sizeof(*umem), GFP_KERNEL_ACCOUNT);
+ if (!umem) {
+ kvfree(page_list);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ umem->addr = addr;
+ umem->size = size;
+ umem->offset = addr & ~PAGE_MASK;
+ umem->npages = npages;
+
+ umem->page_list = page_list;
+ umem->source_mm = current->mm;
+ umem->source_task = current->group_leader;
+ get_task_struct(current->group_leader);
+ umem->source_user = get_uid(current_user());
+
+ /* mm and RLIMIT_MEMLOCK user task accounting similar to what is
+ * being done in iopt_alloc_pages() and do_update_pinned()
+ * for IOPT_PAGES_ACCOUNT_USER @drivers/iommu/iommufd/pages.c
+ */
+ mmgrab(umem->source_mm);
+
+ pinned = pin_user_pages_fast(addr, npages, FOLL_WRITE, page_list);
+ if (pinned != npages) {
+ dev_dbg(mdev->device, "pin_user_pages_fast failed %d\n", pinned);
+ err = pinned < 0 ? pinned : -ENOMEM;
+ goto pin_failed;
+ }
+
+ err = inc_user_locked_vm(umem, npages);
+ if (err)
+ goto pin_failed;
+
+ atomic64_add(npages, &umem->source_mm->pinned_vm);
+
+ err = sg_alloc_table_from_pages(&umem->sgt, page_list, npages, 0,
+ npages << PAGE_SHIFT, GFP_KERNEL_ACCOUNT);
+ if (err) {
+ dev_dbg(mdev->device, "sg_alloc_table failed: %d\n", err);
+ goto sgt_failed;
+ }
+
+ dev_dbg(mdev->device, "\tsgt: size %zu npages %zu sgt.nents (%d)\n",
+ size, npages, umem->sgt.nents);
+
+ err = dma_map_sgtable(mdev->device, &umem->sgt, DMA_BIDIRECTIONAL, 0);
+ if (err) {
+ dev_dbg(mdev->device, "dma_map_sgtable failed: %d\n", err);
+ goto dma_failed;
+ }
+
+ dev_dbg(mdev->device, "\tsgt: dma_nents %d\n", umem->sgt.nents);
+ return umem;
+
+dma_failed:
+sgt_failed:
+ sg_free_table(&umem->sgt);
+ atomic64_sub(npages, &umem->source_mm->pinned_vm);
+ dec_user_locked_vm(umem, npages);
+pin_failed:
+ if (pinned > 0)
+ unpin_user_pages(page_list, pinned);
+ mmdrop(umem->source_mm);
+ free_uid(umem->source_user);
+ put_task_struct(umem->source_task);
+
+ kfree(umem);
+ kvfree(page_list);
+ return ERR_PTR(err);
+}
+
+static void mlx5ctl_umem_unpin(struct mlx5ctl_umem_db *umem_db,
+ struct mlx5ctl_umem *umem)
+{
+ struct mlx5_core_dev *mdev = umem_db->mdev;
+
+ dev_dbg(mdev->device, "%s: addr %p size %zu npages %zu dma_nents %d\n",
+ __func__, (void *)umem->addr, umem->size, umem->npages,
+ umem->sgt.nents);
+
+ dma_unmap_sgtable(mdev->device, &umem->sgt, DMA_BIDIRECTIONAL, 0);
+ sg_free_table(&umem->sgt);
+
+ atomic64_sub(umem->npages, &umem->source_mm->pinned_vm);
+ dec_user_locked_vm(umem, umem->npages);
+ unpin_user_pages(umem->page_list, umem->npages);
+ mmdrop(umem->source_mm);
+ free_uid(umem->source_user);
+ put_task_struct(umem->source_task);
+
+ kvfree(umem->page_list);
+ kfree(umem);
+}
+
+static int mlx5ctl_umem_create(struct mlx5_core_dev *mdev,
+ struct mlx5ctl_umem *umem, u32 uid)
+{
+ u32 out[MLX5_ST_SZ_DW(create_umem_out)] = {};
+ int err, inlen, i, n = 0;
+ struct scatterlist *sg;
+ void *in, *umemptr;
+ __be64 *mtt;
+
+ inlen = MLX5_ST_SZ_BYTES(create_umem_in) +
+ umem->npages * MLX5_ST_SZ_BYTES(mtt);
+
+ in = kzalloc(inlen, GFP_KERNEL);
+ if (!in)
+ return -ENOMEM;
+
+ MLX5_SET(create_umem_in, in, opcode, MLX5_CMD_OP_CREATE_UMEM);
+ MLX5_SET(create_umem_in, in, uid, uid);
+
+ umemptr = MLX5_ADDR_OF(create_umem_in, in, umem);
+
+ MLX5_SET(umem, umemptr, log_page_size,
+ PAGE_SHIFT - MLX5_ADAPTER_PAGE_SHIFT);
+ MLX5_SET64(umem, umemptr, num_of_mtt, umem->npages);
+ MLX5_SET(umem, umemptr, page_offset, umem->offset);
+
+ dev_dbg(mdev->device,
+ "UMEM CREATE: log_page_size %d num_of_mtt %lld page_offset %d\n",
+ MLX5_GET(umem, umemptr, log_page_size),
+ MLX5_GET64(umem, umemptr, num_of_mtt),
+ MLX5_GET(umem, umemptr, page_offset));
+
+ mtt = MLX5_ADDR_OF(create_umem_in, in, umem.mtt);
+ for_each_sgtable_dma_sg(&umem->sgt, sg, i) {
+ u64 dma_addr = sg_dma_address(sg);
+ ssize_t len = sg_dma_len(sg);
+
+ for (; n < umem->npages && len > 0; n++, mtt++) {
+ *mtt = cpu_to_be64(dma_addr);
+ MLX5_SET(mtt, mtt, wr_en, 1);
+ MLX5_SET(mtt, mtt, rd_en, 1);
+ dma_addr += PAGE_SIZE;
+ len -= PAGE_SIZE;
+ }
+ WARN_ON_ONCE(n == umem->npages && len > 0);
+ }
+
+ err = mlx5_cmd_exec(mdev, in, inlen, out, sizeof(out));
+ if (err)
+ goto out;
+
+ umem->umem_id = MLX5_GET(create_umem_out, out, umem_id);
+ dev_dbg(mdev->device, "\tUMEM CREATED: umem_id %d\n", umem->umem_id);
+out:
+ kfree(in);
+ return err;
+}
+
+static void mlx5ctl_umem_destroy(struct mlx5_core_dev *mdev,
+ struct mlx5ctl_umem *umem)
+{
+ u32 in[MLX5_ST_SZ_DW(destroy_umem_in)] = {};
+
+ MLX5_SET(destroy_umem_in, in, opcode, MLX5_CMD_OP_DESTROY_UMEM);
+ MLX5_SET(destroy_umem_in, in, umem_id, umem->umem_id);
+
+ dev_dbg(mdev->device, "UMEM DESTROY: umem_id %d\n", umem->umem_id);
+ mlx5_cmd_exec_in(mdev, destroy_umem, in);
+}
+
+int mlx5ctl_umem_reg(struct mlx5ctl_umem_db *umem_db, unsigned long addr,
+ size_t size)
+{
+ struct mlx5ctl_umem *umem;
+ void *ret;
+ int err;
+
+ umem = mlx5ctl_umem_pin(umem_db, addr, size);
+ if (IS_ERR(umem))
+ return PTR_ERR(umem);
+
+ err = mlx5ctl_umem_create(umem_db->mdev, umem, umem_db->uctx_uid);
+ if (err)
+ goto umem_create_err;
+
+ ret = xa_store(&umem_db->xarray, umem->umem_id, umem, GFP_KERNEL_ACCOUNT);
+ if (WARN(xa_is_err(ret), "Failed to store UMEM")) {
+ err = xa_err(ret);
+ goto xa_store_err;
+ }
+
+ return umem->umem_id;
+
+xa_store_err:
+ mlx5ctl_umem_destroy(umem_db->mdev, umem);
+umem_create_err:
+ mlx5ctl_umem_unpin(umem_db, umem);
+ return err;
+}
+
+int mlx5ctl_umem_unreg(struct mlx5ctl_umem_db *umem_db, u32 umem_id)
+{
+ struct mlx5ctl_umem *umem;
+
+ umem = xa_erase(&umem_db->xarray, umem_id);
+ if (!umem)
+ return -ENOENT;
+
+ mlx5ctl_umem_destroy(umem_db->mdev, umem);
+ mlx5ctl_umem_unpin(umem_db, umem);
+ return 0;
+}
+
+struct mlx5ctl_umem_db *mlx5ctl_umem_db_create(struct mlx5_core_dev *mdev,
+ u32 uctx_uid)
+{
+ struct mlx5ctl_umem_db *umem_db;
+
+ umem_db = kzalloc(sizeof(*umem_db), GFP_KERNEL_ACCOUNT);
+ if (!umem_db)
+ return ERR_PTR(-ENOMEM);
+
+ xa_init(&umem_db->xarray);
+ umem_db->mdev = mdev;
+ umem_db->uctx_uid = uctx_uid;
+
+ return umem_db;
+}
+
+void mlx5ctl_umem_db_destroy(struct mlx5ctl_umem_db *umem_db)
+{
+ struct mlx5ctl_umem *umem;
+ unsigned long index;
+
+ xa_for_each(&umem_db->xarray, index, umem)
+ mlx5ctl_umem_unreg(umem_db, umem->umem_id);
+
+ xa_destroy(&umem_db->xarray);
+ kfree(umem_db);
+}
diff --git a/drivers/misc/mlx5ctl/umem.h b/drivers/misc/mlx5ctl/umem.h
new file mode 100644
index 000000000000..9cf62e5e775e
--- /dev/null
+++ b/drivers/misc/mlx5ctl/umem.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0 */
+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#ifndef __MLX5CTL_UMEM_H__
+#define __MLX5CTL_UMEM_H__
+
+#include <linux/types.h>
+#include <linux/mlx5/driver.h>
+
+struct mlx5ctl_umem_db;
+
+struct mlx5ctl_umem_db *mlx5ctl_umem_db_create(struct mlx5_core_dev *mdev, u32 uctx_uid);
+void mlx5ctl_umem_db_destroy(struct mlx5ctl_umem_db *umem_db);
+int mlx5ctl_umem_reg(struct mlx5ctl_umem_db *umem_db, unsigned long addr, size_t size);
+int mlx5ctl_umem_unreg(struct mlx5ctl_umem_db *umem_db, u32 umem_id);
+
+#endif /* __MLX5CTL_UMEM_H__ */
diff --git a/include/uapi/misc/mlx5ctl.h b/include/uapi/misc/mlx5ctl.h
index 1e4622c5979f..bb9ca8581112 100644
--- a/include/uapi/misc/mlx5ctl.h
+++ b/include/uapi/misc/mlx5ctl.h
@@ -19,6 +19,18 @@ struct mlx5ctl_cmdrpc {
__u32 outlen; /* outbox buffer length */
};
+struct mlx5ctl_umem_reg {
+ __aligned_u64 addr; /* user address */
+ __aligned_u64 len; /* user buffer length */
+ __u32 umem_id; /* returned device's umem ID */
+ __u32 reserved; /* explicit padding must be zero */
+};
+
+struct mlx5ctl_umem_unreg {
+ __u32 umem_id;
+ __u32 reserved; /* explicit padding must be zero */
+};
+
#define MLX5CTL_MAX_RPC_SIZE (512 * 512) /* max FW RPC buffer size 512 blocks of 512 bytes */
#define MLX5CTL_IOCTL_MAGIC 0x5c
@@ -29,4 +41,10 @@ struct mlx5ctl_cmdrpc {
#define MLX5CTL_IOCTL_CMDRPC \
_IOWR(MLX5CTL_IOCTL_MAGIC, 0x1, struct mlx5ctl_cmdrpc)
+#define MLX5CTL_IOCTL_UMEM_REG \
+ _IOWR(MLX5CTL_IOCTL_MAGIC, 0x2, struct mlx5ctl_umem_reg)
+
+#define MLX5CTL_IOCTL_UMEM_UNREG \
+ _IOWR(MLX5CTL_IOCTL_MAGIC, 0x3, struct mlx5ctl_umem_unreg)
+
#endif /* __MLX5CTL_IOCTL_H__ */
--
2.43.0
On Tue, 6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote:
> From: Saeed Mahameed <[email protected]>
>
> Recap from V3 discussion:
> =========================
>
> LWN has published an article on this series aptly summarizing the debate.
> LINK: https://lwn.net/Articles/955001/
>
> We continue to think that mlx5ctl is reasonable and aligned with the
> greater kernel community values. People have pointed to the HW RAID
> miscdevices as a good analog. The MD developers did not get to block HW
> RAID configuration on the basis that it undermines their work on the
> software RAID stack. Further, while there is a superficial similarity to
> the DRM/accel debate, that was grounded in a real concern that DRM values
> on open source would be bypassed. That argument does not hold up here as
> this does come with open source userspace and the functionality mlx5ctl
> enables on lockdown has always been available to ConnectX users through
> the non-lockdown PCI sysfs. netdev has been doing just fine despite the
> long standing presence of this tooling and we have continued to work with
> Jakub on building common APIs when appropriate. mlx5 already implements
> a wide range of the netdev common interfaces, many of which were pushed
> forward by our staff - the DPLL configuration netlink being a recent
> example.
I appreciate Jiri's contributions (and you hired Maciej off of Intel at
some point) but don't make it sound like nVidia lead DPLL, or pushed for
a common interface :| Intel posted SyncE support. I asked them make it
a standalone API family:
https://lore.kernel.org/netdev/20210830162909.110753ec@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/
Vadim from Meta joined in and helped a lot based on the OCP time card.
Then after some delaying and weird noises y'all started participating.
> mlx5 ConnectX control misc driver:
> ==================================
>
> The ConnectX HW family supported by the mlx5 drivers uses an architecture
> where a FW component executes "mailbox RPCs" issued by the driver to make
> changes to the device. This results in a complex debugging environment
> where the FW component has information and low level configuration that
> needs to be accessed to userspace for debugging purposes.
You don't explain anywhere why addressing the challenges of using
debugfs in secure environments isn't the way to go.
Also you keep saying debugging purposes but the driver is called
"control misc driver", you need to iron out your narrative just
a bit more.
> Historically a userspace program was used that accessed the PCI register
> and config space directly through /sys/bus/pci/.../XXX and could operate
> these debugging interfaces in parallel with the running driver.
> This approach is incompatible with secure boot and kernel lockdown so this
> driver provides a secure and restricted interface to that.
[snip]
> i) mstreg
> The mlxreg utility allows users to obtain information regarding
> supported access registers, such as their fields
So the access mstreg gives over this interface is read only? That's
what your description sounds like, but given your complaints about
"inability to add knobs" and "control" in the name of the driver that
must be false.
> Other usecases with umem:
> - dynamic HW and FW trace monitoring
> - high frequency diagnostic counters sampling
Ah yes, the high frequency counters. Something that is definitely
impossible to implement in a generic way. You were literally in the
room at netconf when David Ahern described his proposal for this.
Anyway, I don't want to waste any more time arguing with you.
My opinion is that the kernel community is under no obligation to carry
your proprietary gateway interfaces. I may be wrong, but I'm entitled
to my opinion.
Please do me the basic courtesy of carrying my nack on these patches:
Nacked-by: Jakub Kicinski <[email protected]>
and CC netdev, so I don't have to respond again :|
On 07 Feb 07:03, Jakub Kicinski wrote:
>On Tue, 6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote:
>> From: Saeed Mahameed <[email protected]>
>>
>> Recap from V3 discussion:
>> =========================
>>
>> LWN has published an article on this series aptly summarizing the debate.
>> LINK: https://lwn.net/Articles/955001/
>>
>> We continue to think that mlx5ctl is reasonable and aligned with the
>> greater kernel community values. People have pointed to the HW RAID
>> miscdevices as a good analog. The MD developers did not get to block HW
>> RAID configuration on the basis that it undermines their work on the
>> software RAID stack. Further, while there is a superficial similarity to
>> the DRM/accel debate, that was grounded in a real concern that DRM values
>> on open source would be bypassed. That argument does not hold up here as
>> this does come with open source userspace and the functionality mlx5ctl
>> enables on lockdown has always been available to ConnectX users through
>> the non-lockdown PCI sysfs. netdev has been doing just fine despite the
>> long standing presence of this tooling and we have continued to work with
>> Jakub on building common APIs when appropriate. mlx5 already implements
>> a wide range of the netdev common interfaces, many of which were pushed
>> forward by our staff - the DPLL configuration netlink being a recent
>> example.
>
>I appreciate Jiri's contributions (and you hired Maciej off of Intel at
>some point) but don't make it sound like nVidia lead DPLL, or pushed for
>a common interface :| Intel posted SyncE support. I asked them make it
>a standalone API family:
>
I will let the stats speak for itself.
$ git shortlog -sne --no-merges net/devlink
and prior to commit f05bd8ebeb69 devlink: move code to a dedicated directory
$ git shortlog -sne --no-merges net/core/devlink.c
More than 70% of the commits are authored by more than 10 different individuals
form Mellanox/Nvidia ..
Ok you don't like DPLL, here is a list of some central devlink features we did
push to the devlink standard API:
- subfunction API and devlink infrastructure
- Shared buffer API
- port function and rate API
- shared buffer
- health
>https://lore.kernel.org/netdev/20210830162909.110753ec@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/
>
>Vadim from Meta joined in and helped a lot based on the OCP time card.
>Then after some delaying and weird noises y'all started participating.
>
I remember those discussions, and I agree it is very weird when it
takes 3 vendors and 2 years to get a simple devlink API for single bit
flip accepted.
>> mlx5 ConnectX control misc driver:
>> ==================================
>>
>> The ConnectX HW family supported by the mlx5 drivers uses an architecture
>> where a FW component executes "mailbox RPCs" issued by the driver to make
>> changes to the device. This results in a complex debugging environment
>> where the FW component has information and low level configuration that
>> needs to be accessed to userspace for debugging purposes.
>
>You don't explain anywhere why addressing the challenges of using
>debugfs in secure environments isn't the way to go.
>
I already answered this question in length in v3
here: https://lore.kernel.org/all/ZWZFm2qqhV1wKKCV@x130/
>Also you keep saying debugging purposes but the driver is called
>"control misc driver", you need to iron out your narrative just
>a bit more.
>
>> Historically a userspace program was used that accessed the PCI register
>> and config space directly through /sys/bus/pci/.../XXX and could operate
>> these debugging interfaces in parallel with the running driver.
>> This approach is incompatible with secure boot and kernel lockdown so this
>> driver provides a secure and restricted interface to that.
>
>[snip]
>
>> i) mstreg
>> The mlxreg utility allows users to obtain information regarding
>> supported access registers, such as their fields
>
>So the access mstreg gives over this interface is read only? That's
>what your description sounds like, but given your complaints about
>"inability to add knobs" and "control" in the name of the driver that
>must be false.
>
Yes this is enforced by the mlx5ctl driver and FW using the special
debug uid.
>> Other usecases with umem:
>> - dynamic HW and FW trace monitoring
>> - high frequency diagnostic counters sampling
>
>Ah yes, the high frequency counters. Something that is definitely
>impossible to implement in a generic way. You were literally in the
>room at netconf when David Ahern described his proposal for this.
>
I was in the room and I am in support of David's idea, I like it a lot,
but I don't believe we have any concrete proposal, and we don't have any
use case for it in netdev for now, our use case for this is currently RDMA
and HPC specific.
Also siimilar to devlink we will be the first to jump in and implement
the new API once defined, but this doesn't mean I need to throw away the
whole driver just because one single use case will be implemented in netdev
one day, and I am sure the netdev implementation won't satisfy all the
use-cases of high frequency counters:
Also keep in mind high frequency counters is a very small part of the debug
and access capabilities the mlx5ctl interface offers.
>Anyway, I don't want to waste any more time arguing with you.
>My opinion is that the kernel community is under no obligation to carry
>your proprietary gateway interfaces. I may be wrong, but I'm entitled
>to my opinion.
>
Thanks, I appreciate your honesty, but I must disagree with your Nack, we
provided enough argument for why we believe this approach is the right
way to go, it is clear from the responses on V3 and from the LWN article
that we have the community support for this open source project.
>Please do me the basic courtesy of carrying my nack on these patches:
>
>Nacked-by: Jakub Kicinski <[email protected]>
>
>and CC netdev, so I don't have to respond again :|
Ack.
Thanks,
Saeed.
On Wed, 7 Feb 2024 21:03:35 -0800 Saeed Mahameed wrote:
> On 07 Feb 07:03, Jakub Kicinski wrote:
> >On Tue, 6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote:
> >> From: Saeed Mahameed <[email protected]>
> >>
> >> Recap from V3 discussion:
> >> =========================
> >>
> >> LWN has published an article on this series aptly summarizing the debate.
> >> LINK: https://lwn.net/Articles/955001/
> >>
> >> We continue to think that mlx5ctl is reasonable and aligned with the
> >> greater kernel community values. People have pointed to the HW RAID
> >> miscdevices as a good analog. The MD developers did not get to block HW
> >> RAID configuration on the basis that it undermines their work on the
> >> software RAID stack. Further, while there is a superficial similarity to
> >> the DRM/accel debate, that was grounded in a real concern that DRM values
> >> on open source would be bypassed. That argument does not hold up here as
> >> this does come with open source userspace and the functionality mlx5ctl
> >> enables on lockdown has always been available to ConnectX users through
> >> the non-lockdown PCI sysfs. netdev has been doing just fine despite the
> >> long standing presence of this tooling and we have continued to work with
> >> Jakub on building common APIs when appropriate. mlx5 already implements
> >> a wide range of the netdev common interfaces, many of which were pushed
> >> forward by our staff - the DPLL configuration netlink being a recent
> >> example.
> >
> >I appreciate Jiri's contributions (and you hired Maciej off of Intel at
> >some point) but don't make it sound like nVidia lead DPLL, or pushed for
> >a common interface :| Intel posted SyncE support. I asked them make it
> >a standalone API family:
>
> I will let the stats speak for itself.
> $ git shortlog -sne --no-merges net/devlink
> and prior to commit f05bd8ebeb69 devlink: move code to a dedicated directory
> $ git shortlog -sne --no-merges net/core/devlink.c
>
> More than 70% of the commits are authored by more than 10 different individuals
> form Mellanox/Nvidia ..
I'm not questioning that there are folks at nVidia who prefer to go
the generic infrastructure route. Jiri and mlxsw team especially.
I do worry that adding a pass thru interface will undermine them,
but that wasn't my main point.
> Ok you don't like DPLL,
I didn't say I dislike DPLL. I think it's a very odd example for
you to pick for nVidia's contribution. My recollection is:
- Maciej from Intel started developing upstream API for SyncE support
- I asked him to generalize it to DPLL, he started working on it
- nVidia expressed interest in creating a common interface, we thought
it'd be great to align vendors
- nVidia hired Maciej from Intel, shutting down Intel's progress for a while
- nVidia went AWoL, long response times, we held meetings to nudge
you along, no commitments
- then after months and months Jiri started helping Arkadiusz and Vadim
I remember thinking at the time that it must have been a terrible
experience for Intel, definitely not how cooperation upstream should
look :|
IDK how disconnected from upstream netdev you have to be to put that on
your banner.
> here is a list of some central devlink features we did
> push to the devlink standard API:
>
> - subfunction API and devlink infrastructure
> - Shared buffer API
> - port function and rate API
> - shared buffer
> - health
I guess shared buffer was important enough to mention twice? :)
> >https://lore.kernel.org/netdev/20210830162909.110753ec@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/
> >
> >Vadim from Meta joined in and helped a lot based on the OCP time card.
> >Then after some delaying and weird noises y'all started participating.
> >
>
> I remember those discussions, and I agree it is very weird when it
> takes 3 vendors and 2 years to get a simple devlink API for single bit
> flip accepted.
In hindsight it was naive of us to try to wait for nVidia.
It's better to let one vendor blaze the trail and others to follow.
> >> i) mstreg
> >> The mlxreg utility allows users to obtain information regarding
> >> supported access registers, such as their fields
> >
> >So the access mstreg gives over this interface is read only? That's
> >what your description sounds like, but given your complaints about
> >"inability to add knobs" and "control" in the name of the driver that
> >must be false.
> >
>
> Yes this is enforced by the mlx5ctl driver and FW using the special
> debug uid.
So we trust the proprietary FW not to let the proprietary user space
do something out of scope. Got it.
> >> Other usecases with umem:
> >> - dynamic HW and FW trace monitoring
> >> - high frequency diagnostic counters sampling
> >
> >Ah yes, the high frequency counters. Something that is definitely
> >impossible to implement in a generic way. You were literally in the
> >room at netconf when David Ahern described his proposal for this.
> >
>
> I was in the room and I am in support of David's idea, I like it a lot,
> but I don't believe we have any concrete proposal, and we don't have any
> use case for it in netdev for now, our use case for this is currently RDMA
> and HPC specific.
>
> Also siimilar to devlink we will be the first to jump in and implement
> the new API once defined, but this doesn't mean I need to throw away the
I'm not asking to throw it away. The question is only whether you get
to push it upstream and skirt subsystem rules by posting a "misc" driver
without even CCing the maintainers on v1 :|
> whole driver just because one single use case will be implemented in netdev
> one day, and I am sure the netdev implementation won't satisfy all the
> use-cases of high frequency counters:
>
> Also keep in mind high frequency counters is a very small part of the debug
> and access capabilities the mlx5ctl interface offers.
>
> >Anyway, I don't want to waste any more time arguing with you.
> >My opinion is that the kernel community is under no obligation to carry
> >your proprietary gateway interfaces. I may be wrong, but I'm entitled
> >to my opinion.
>
> Thanks, I appreciate your honesty, but I must disagree with your Nack, we
> provided enough argument for why we believe this approach is the right
> way to go, it is clear from the responses on V3 and from the LWN article
> that we have the community support for this open source project.
Why don't you repost it to netdev and see how many acks you get?
I'm not the only netdev maintainer.
Fri, Feb 09, 2024 at 03:15:55AM CET, [email protected] wrote:
>On Wed, 7 Feb 2024 21:03:35 -0800 Saeed Mahameed wrote:
>> On 07 Feb 07:03, Jakub Kicinski wrote:
>> >On Tue, 6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote:
>> >> From: Saeed Mahameed <[email protected]>
[...]
>
>> Ok you don't like DPLL,
>
>I didn't say I dislike DPLL. I think it's a very odd example for
>you to pick for nVidia's contribution. My recollection is:
>
> - Maciej from Intel started developing upstream API for SyncE support
> - I asked him to generalize it to DPLL, he started working on it
> - nVidia expressed interest in creating a common interface, we thought
> it'd be great to align vendors
> - nVidia hired Maciej from Intel, shutting down Intel's progress for a while
> - nVidia went AWoL, long response times, we held meetings to nudge
> you along, no commitments
> - then after months and months Jiri started helping Arkadiusz and Vadim
>
>I remember thinking at the time that it must have been a terrible
>experience for Intel, definitely not how cooperation upstream should
>look :|
For the record, I spent huge amount of time reviewing the patchset and
ended up with redesigning significant chunks of it, steering Arkadiusz
and Vadim the way I felt is the correct one. Oftentimes I said to myself
it would be much quicker to take the patchset over and do it myself :)
Anyway, at the end, I think that the result is very good. Solid and well
defined uapi, nice kernel implementation, 3 drivers implementing it,
each with slightly different usecase, all clicks. If this is not
good example of upstream cooperation, I'm not sure what else is...
But, I don't think this is related to the misc driver discussion, I just
wanted to express my pov on dpll process, when I see people talking
about it :)
>
>IDK how disconnected from upstream netdev you have to be to put that on
>your banner.
[...]
On 2/8/24 7:15 PM, Jakub Kicinski wrote:
>>> Ah yes, the high frequency counters. Something that is definitely
>>> impossible to implement in a generic way. You were literally in the
>>> room at netconf when David Ahern described his proposal for this.
The key point of that proposal is host memory mapped to userspace where
H/W counters land (either via direct DMA by a H/W push or a
kthread/timer pulling in updates). That is similar to what is proposed here.
>>>
>>
>> I was in the room and I am in support of David's idea, I like it a lot,
>> but I don't believe we have any concrete proposal, and we don't have any
>> use case for it in netdev for now, our use case for this is currently RDMA
>> and HPC specific.
>>
>> Also siimilar to devlink we will be the first to jump in and implement
>> the new API once defined, but this doesn't mean I need to throw away the
>
> I'm not asking to throw it away. The question is only whether you get
> to push it upstream and skirt subsystem rules by posting a "misc" driver
> without even CCing the maintainers on v1 :|
Can you define what you mean by 'skirt subsystem rules'? That's a new
one to me.
BTW, there is already a broadcom driver under drivers/misc that seems to
have a lot of overlap capability wise to this driver. Perhaps a Broadcom
person could chime in.
>
>> whole driver just because one single use case will be implemented in netdev
>> one day, and I am sure the netdev implementation won't satisfy all the
>> use-cases of high frequency counters:
>>
>> Also keep in mind high frequency counters is a very small part of the debug
>> and access capabilities the mlx5ctl interface offers.
>>
>>> Anyway, I don't want to waste any more time arguing with you.
>>> My opinion is that the kernel community is under no obligation to carry
>>> your proprietary gateway interfaces. I may be wrong, but I'm entitled
>>> to my opinion.
>>
>> Thanks, I appreciate your honesty, but I must disagree with your Nack, we
>> provided enough argument for why we believe this approach is the right
>> way to go, it is clear from the responses on V3 and from the LWN article
>> that we have the community support for this open source project.
>
> Why don't you repost it to netdev and see how many acks you get?
> I'm not the only netdev maintainer.
I'll go out on that limb and say I would have no problem ACK'ing the
driver. It's been proven time and time again that these kinds of
debugging facilities are needed for these kinds of complex,
multifunction devices.
On Fri, 9 Feb 2024 15:42:16 -0700 David Ahern wrote:
> On 2/8/24 7:15 PM, Jakub Kicinski wrote:
> >> I was in the room and I am in support of David's idea, I like it a lot,
> >> but I don't believe we have any concrete proposal, and we don't have any
> >> use case for it in netdev for now, our use case for this is currently RDMA
> >> and HPC specific.
> >>
> >> Also siimilar to devlink we will be the first to jump in and implement
> >> the new API once defined, but this doesn't mean I need to throw away the
> >
> > I'm not asking to throw it away. The question is only whether you get
> > to push it upstream and skirt subsystem rules by posting a "misc" driver
> > without even CCing the maintainers on v1 :|
>
> Can you define what you mean by 'skirt subsystem rules'? That's a new
> one to me.
I mean that Saeed is well aware that direct FW <> user space interfaces
are not allowed in netdev, so he posted this "misc" driver without
CCing us, knowing we'd nack it.
Maybe the baseline question is whether major subsystems are allowed to
set their own rules. I think they should as historically we have a very
broad range of, eh, openness in different fields. Networking is pretty
open because of the necessary interoperability.
> BTW, there is already a broadcom driver under drivers/misc that seems to
> have a lot of overlap capability wise to this driver. Perhaps a Broadcom
> person could chime in.
I'm not aware. Or do you mean bcm-vk? That's a video encoder.
> >> Thanks, I appreciate your honesty, but I must disagree with your Nack, we
> >> provided enough argument for why we believe this approach is the right
> >> way to go, it is clear from the responses on V3 and from the LWN article
> >> that we have the community support for this open source project.
> >
> > Why don't you repost it to netdev and see how many acks you get?
> > I'm not the only netdev maintainer.
>
> I'll go out on that limb and say I would have no problem ACK'ing the
> driver. It's been proven time and time again that these kinds of
> debugging facilities are needed for these kinds of complex,
> multifunction devices.
Can we have bare minimum of honesty and stop pretending that his is
primarily about debug :|
On Fri, Feb 09, 2024 at 03:42:16PM -0700, David Ahern wrote:
> On 2/8/24 7:15 PM, Jakub Kicinski wrote:
> >>> Ah yes, the high frequency counters. Something that is definitely
> >>> impossible to implement in a generic way. You were literally in the
> >>> room at netconf when David Ahern described his proposal for this.
>
> The key point of that proposal is host memory mapped to userspace where
> H/W counters land (either via direct DMA by a H/W push or a
> kthread/timer pulling in updates). That is similar to what is proposed here.
The counter experiment that inspired Saeed to write about it here was
done using mlx5ctl interfaces and some other POC stuff on an RDMA
network monitoring RDMA workloads, inspecting RDMA objects.
So if your proposal also considers how to select RDMA object counters,
control the detailed sampling hardware with RDMA stuff, and works
on a netdev-free InfiniBand network, then it might be interesting.
It was actually interesting research, I hope some information will be
made public.
> BTW, there is already a broadcom driver under drivers/misc that seems to
> have a lot of overlap capability wise to this driver. Perhaps a Broadcom
> person could chime in.
Yeah, there are lots of examples of drivers that use this kind FW API
direct to userspace. It is a common design pattern across the kernel
in many subsystems. At the core it is following the general philosophy
of pushing things to userspace that don't need to be in the kernel. It
is more secure, more hackable and easier to deploy.
It becomes a userspace decision what kind of tooling community will
develop and what the ultimate user experience will be.
> > Why don't you repost it to netdev and see how many acks you get?
> > I'm not the only netdev maintainer.
>
> I'll go out on that limb and say I would have no problem ACK'ing the
> driver. It's been proven time and time again that these kinds of
> debugging facilities are needed for these kinds of complex,
> multifunction devices.
Agree as well. Ack for RDMA community. This is perfectly consistent
with the subsystem's existing design of directly exposing the device
to userspace. It is essential as we can't piggyback on any "generic"
netdev stuff on InfiniBand HW. Further, I anticipate most of the
mlx5ctl users would actually be running primarily RDMA related
workloads anyhow.
There is not that many people that buy these expensive cards and don't
use them to their full capability.
Recently at usenix Microsoft shared some details of their production
network in the paper "Empowering Azure Storage with RDMA".
Notably they shared that "Traffic statistics of all Azure public
regions between January 18 and February 16, 2023. Traffic was measured
by collecting switch counters of server-facing ports on all Top of
Rack (ToR) switches. Around 70% of traffic was RDMA."
It is a rare public insight into what is going on in the industry at
large, and why RDMA is a significant and important subsytem.
Jason
On 2/9/24 3:58 PM, Jakub Kicinski wrote:
> On Fri, 9 Feb 2024 15:42:16 -0700 David Ahern wrote:
>> On 2/8/24 7:15 PM, Jakub Kicinski wrote:
>>>> I was in the room and I am in support of David's idea, I like it a lot,
>>>> but I don't believe we have any concrete proposal, and we don't have any
>>>> use case for it in netdev for now, our use case for this is currently RDMA
>>>> and HPC specific.
>>>>
>>>> Also siimilar to devlink we will be the first to jump in and implement
>>>> the new API once defined, but this doesn't mean I need to throw away the
>>>
>>> I'm not asking to throw it away. The question is only whether you get
>>> to push it upstream and skirt subsystem rules by posting a "misc" driver
>>> without even CCing the maintainers on v1 :|
>>
>> Can you define what you mean by 'skirt subsystem rules'? That's a new
>> one to me.
>
> I mean that Saeed is well aware that direct FW <> user space interfaces
> are not allowed in netdev, so he posted this "misc" driver without
> CCing us, knowing we'd nack it.
The argument you are making here is that if a device has an ethernet
port, all patches must flow through netdev. Or am I misunderstanding?
There are a lot of devices that toggle between IB and ethernet with only
one (ignore roce here) active at a moment. MLX5 (like many) is split
between drivers/net and drivers/infinband. If the debugging capabilities
went through drivers/infiniband would you have the same stance?
Maybe my bigger question is should drivers that can do different
physical layers, or can operate without a netdev, should they be split
into different drivers? drivers/misc for the PCI interface, drivers/net
for ethernet interfaces and its APIs and drivers/infiniband for IB and
its APIs? Generic capabilities like this debugging interface belong to
the PCI device since it is generic to the device hence drivers/misc.
>
> Maybe the baseline question is whether major subsystems are allowed to
> set their own rules. I think they should as historically we have a very
> broad range of, eh, openness in different fields. Networking is pretty
> open because of the necessary interoperability.
Interoperability applies solely to networking protocols, not how a
device is designed and certainly not to how it can be debugged. There is
a clear difference between standard networking protocols (packets on a
wire and its headers) and device designs.
>
>> BTW, there is already a broadcom driver under drivers/misc that seems to
>> have a lot of overlap capability wise to this driver. Perhaps a Broadcom
>> person could chime in.
>
> I'm not aware. Or do you mean bcm-vk? That's a video encoder.
If that specific piece of S/W is a video encoder, why isn't it under
drivers/video? Scanning the code it seems to me to fall under the open
channel between userspace and F/W which is a common paradigm. But I do
not want this to distract from this patch set; really I was just
browsing existing drivers for any overlap.
On Fri, Feb 09, 2024 at 10:01:33PM -0700, David Ahern wrote:
> >> BTW, there is already a broadcom driver under drivers/misc that seems to
> >> have a lot of overlap capability wise to this driver. Perhaps a Broadcom
> >> person could chime in.
> >
> > I'm not aware. Or do you mean bcm-vk? That's a video encoder.
>
> If that specific piece of S/W is a video encoder, why isn't it under
> drivers/video? Scanning the code it seems to me to fall under the open
> channel between userspace and F/W which is a common paradigm. But I do
> not want this to distract from this patch set; really I was just
> browsing existing drivers for any overlap.
It is an "offload-engine" type of thing that was added before we had the
specific subsystem for it. It should be moved to drivers/accel/ one of
these days, want to volunteer to help out with that? :)
thanks,
greg k-h
On 2/9/24 6:01 PM, Jason Gunthorpe wrote:
> On Fri, Feb 09, 2024 at 03:42:16PM -0700, David Ahern wrote:
>> On 2/8/24 7:15 PM, Jakub Kicinski wrote:
>>>>> Ah yes, the high frequency counters. Something that is definitely
>>>>> impossible to implement in a generic way. You were literally in the
>>>>> room at netconf when David Ahern described his proposal for this.
>>
>> The key point of that proposal is host memory mapped to userspace where
>> H/W counters land (either via direct DMA by a H/W push or a
>> kthread/timer pulling in updates). That is similar to what is proposed here.
>
> The counter experiment that inspired Saeed to write about it here was
> done using mlx5ctl interfaces and some other POC stuff on an RDMA
> network monitoring RDMA workloads, inspecting RDMA objects.
>
> So if your proposal also considers how to select RDMA object counters,
> control the detailed sampling hardware with RDMA stuff, and works
> on a netdev-free InfiniBand network, then it might be interesting.
Response at netconf in September was mixed. As I recall Jakub for
example was shaking his head at 'yet another stats proposal', but since
he has referenced it a couple of times now maybe it is worth moving
beyond slides to a POC. The uapi discussed was netlink (genl) based;
driver hooks were not discussed. Perhaps I can get a working POC for
both stacks by netdevconf in July.
On 2/11/24 4:03 AM, Greg Kroah-Hartman wrote:
> On Fri, Feb 09, 2024 at 10:01:33PM -0700, David Ahern wrote:
>>>> BTW, there is already a broadcom driver under drivers/misc that seems to
>>>> have a lot of overlap capability wise to this driver. Perhaps a Broadcom
>>>> person could chime in.
>>>
>>> I'm not aware. Or do you mean bcm-vk? That's a video encoder.
>>
>> If that specific piece of S/W is a video encoder, why isn't it under
>> drivers/video? Scanning the code it seems to me to fall under the open
>> channel between userspace and F/W which is a common paradigm. But I do
>> not want this to distract from this patch set; really I was just
>> browsing existing drivers for any overlap.
>
> It is an "offload-engine" type of thing that was added before we had the
> specific subsystem for it. It should be moved to drivers/accel/ one of
> these days, want to volunteer to help out with that? :)
>
Thanks for the background.
As for volunteering to move it, I believe Broadcom has more kernel
engineers than Enfabrica. :-)
Chmining in late after a vacation last week, but I really do not
understand the discussion this cause.
Complex devices need diagnostics, and mlx5ctl and the (free software)
userspace tool provide that. Given how it is a complex multi-subsystem
driver that exports at least RDMA, ethernet, nvme and virtio_blk
interfaces this functionality by definition can't fit into a single
subsystem.
So with my nvme co-maintainer hat on:
Acked-by: Christoph Hellwig <[email protected]>
to th concept (which is not a full code review!).
With my busy kernel contributor head on I have to voice my
dissatisfaction with the subsystem maintainer overreach that's causing
the troubles here. I think all maintainers can and should voice the
opinions, be those technical or political, but trying to block a useful
feature without lots of precedence because it is vaguely related to the
subsystem is not helpful. Note that this is absolutely not intended to
shut the discussion down - if we can find valid arguments why some of
this functionality should also be reported through a netdev API we
should continue that. E.g. just because we have SCSI, NVMe or product
specific passthrough interface we still try to provide useful common
interface generically.
On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote:
> With my busy kernel contributor head on I have to voice my
> dissatisfaction with the subsystem maintainer overreach that's causing
> the troubles here.
Overreach is unfortunate, I'd love to say "please do merge it as part
of RDMA". You probably don't trust my opinion but Jason admitted himself
this is primarily for RDMA. RDMA is what it is in terms of openness and
all vendors trying to sell their secret magic sauce.
The problem is that some RDMA stuff is built really closely on TCP,
and given Jason's and co. inability to understand that good fences
make good neighbors it will soon start getting into the netdev stack :|
Ah, and I presume they may also want it for their DOCA products.
So 80% RDMA, 15% DOCA, 5% the rest is my guess.
> I think all maintainers can and should voice the
> opinions, be those technical or political, but trying to block a useful
> feature without lots of precedence because it is vaguely related to the
> subsystem is not helpful.
Not sure what you mean by "without lots of precedence" but you can ask
around netdev. We have nacked such interfaces multiple times.
The best proof the rule exists and is well established it is that Saeed
has himself asked us a number of times to lift it.
What should be expected of us is fairness and not engaging in politics.
We have a clear rule against opaque user space to FW interfaces,
and I don't see how we could enforce that fairly for pure Ethernet
devices if big vendors get to do whatever they want.
> Note that this is absolutely not intended to
> shut the discussion down - if we can find valid arguments why some of
> this functionality should also be reported through a netdev API we
> should continue that.
Once again, netdev requirements for debug are - you can do whatever you
want but other than knobs clearly defined in the code interface must be
read only.
On Wed, Feb 14, 2024 at 11:17:58AM -0500, Andy Gospodarek wrote:
> 1. How someone working at a distro would be able to help/understand if
> a tool like this was run and may have programmed their hardware
> differently than a default driver or FW.
This is stepping a bit further ahead of the debug focused interfaces
mlx5ctl currently has and into configuration..
Obviously a generic FW RPC can do configuration too.
We do have alot of industry experience with configuration already.
Realistically every complex device already has on-device FLASH that
has on-device FW configuration. Formalizing what already exists and is
widely used isn't going to make the world worse.
I'm also very certain there are actual problems with FW configuration
incompatibility. eg there are PCI related configurables on mlx5 you
can set that will break everything if you are not The Special User
that the configuration was created for.
Today everyone already deals with FW version specific behavioral
differences and bugs.
FW Vers + FW Vers&Config is also the current state of affairs, and is
delt with about the same. IMHO
> due to out-of-band configuration. One thought I had was some sort of
> journal to note that config happened from outside, but I'm not sure
> there is much value there.
I think devices using this kind of approach need to be well behaved -
like no wild (production) access to random bits of HW. Things need to
be structured and reportable.
There is a clear split in my mind between:
- inspection debugging
- invasive mutating debugging
- configuration
And maybe "invasive mutating debugging" taints the kernel or something
like that.
> With the ability to dump regs with devlink health it's possible to
> know that values may have changed, so I'm not concerned about this
> since that infra exists.
Distros need good tooling to report the state of HW in distro problem
reports. If something is lacking we should improve on it.
TBH I'd pick "report current status" over "journaling" because devices
have FLASH and configuration changes can typically be permanent across
reboots.
Also, I would generally frown on designs changing OS visible behavior
without running the device through a FLR.
> 2. If one can make configuration changes to hardware without kernel
> APIs (devlink et al), will people still develop new kernel APIs?
Sure? Why not? Every vendor has existing tooling to do this
configuration stuff. mlx5's existing tooling runs in userspace and
uses /sys/../resource. *Everyone* has been using this for the last 15
years. So whatever impact it has is long since baked in.
> I think the answer to this is 'yes' as realistically using default tools
> is much better than using vendor tools for regular configuration.
Personally I think the answer is more nuanced. Users have problems
they want to solve. If your problem is "provision my HW from the
factory" you may be perfectlly happy with a shared userspace program
that can do that for all the HW vendors at the site. No artificial
kernel enforced commonality required.
IMHO as kernel people we often look at the hammer we have and fall
into these patterns where "only the kernel can do abstractions!" but
it isn't true, we can very effectively make abstractions in userspace
too.
> if vendors provide shortcuts to program hardware for eval/testing/debug
> my experience is that these are not acceptable long-term. Requests are always
> made to include this type of changes in future releases. So I'm not too
> concerned about the ossification of kernel APIs due to this being included.
Nor am I. Users will ask for the things that work best for them and
vendors have been historically good at responding.
I also look at RDMA, which is deeply down this path of not doing
common interfaces in the kernel. We have alot of robust open source
now! The common interfaces that the userspace community developed are
wildly different and much more useful than what the prior effort to
build kernel common interfaces was creating. In fact there are now
competing ideas on what those interfaces should be and alot of
interesting development.
The kernel common APIs that were developed before turned out to be
such a straight jacket that it held back so much stuff and forced
people into out-of-tree and badly harmed community forming in the
userspace side.
In other words, allowing userspace to have freedom has really pushed
things forward in good ways.
> So if there is general agreement that this is acceptable (especially
> compared to other out-of-tree drivers, I think a few who find this
> useful should sync on the best way forward; I'm not sure a separate
> driver for each vendor is the right approach.
I also like this, I don't want the outcome of this discussion to be
that only mlx5ctl gets merged. I want all the HW that has this problem
to have support in the mainline kernel.
I want to see a robust multi-vendor userspace community that houses
alot of stuff and is working to solve user problems. To me this point
is the big win for the community.
If we need a formally named kernel subsystem to pull that community
together then lets do that. We can probably have some common ioctls
and shared uABI marshaling/discovery/etc like nvme does. At the end
it can't really be abstracted too much, the user facing API is really
going to be "send a FW RPC" like mlx5/nvme-vendor does.
Honestly it will be easier to discuss and document the overall design
and what devices have to do to be compliant within a tidy subsytem
with a name that people can identify as a thing. Naming things and
having spaces is really important to build community around them.
Certainly, I am up for this. If you have a similar usage flows I'm
quite happy to work with everyone to launch a new mini-subystem. If
other people are reading this and think they want to take part too
please speak up.
> If upstream (and therefore distros) are going to accept this we probably
> owe it to them to not have misc drivers for every different flavor of
> hardware out there when it might be possible to add a generic driver
> that can connect to a PCI device via new (auxiliary bus?) API.
Distros often prefer if they have less packaging, versioning and
community to deal with :)
I'm inspired by things like nvme-cli that show a nice way forward
where there can be vendor plugins and a shared github with a common
shared umbrella of CLI, documentation and packaging.
With some co-operation from the distros we can push vendors to
participate in the shared userspace, and we can push vendors from the
kernel by denying kernel driver support without basic integration to
the userspace (like DRM does with mesa).
Thanks,
Jason
On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote:
> There is a clear split in my mind between:
> - inspection debugging
> - invasive mutating debugging
> - configuration
Yes there's a clear split, and how are you going to enforce it on
an opaque interface? Put an "evil" bit in the common header?
> And maybe "invasive mutating debugging" taints the kernel or something
> like that.
On Wed, Feb 14, 2024 at 10:11:26AM -0800, Jakub Kicinski wrote:
> On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote:
> > There is a clear split in my mind between:
> > - inspection debugging
> > - invasive mutating debugging
> > - configuration
>
> Yes there's a clear split, and how are you going to enforce it on
> an opaque interface? Put an "evil" bit in the common header?
The interface is opaque through a subsystem, it doesn't mean it is
completely opaque to every driver layer in the kernel. There is still a
HW specific kernel driver that delivers the FW command to the actual
HW.
In the mlx5 model the kernel driver stamps the command with "uid"
which is effectively a security scope label. This cannot be avoided by
userspace and is fundamental to why mlx5ctl is secure in a lockdown
kernel.
For example mlx5's FW interface has the concept of security scopes. We
have several defined today:
- Kernel
- Userspace rdma
- Userspace rdma with CAP_NET_RAW
- Userspace rdma with CAP_SYS_RAWIO
So we trivally add three more for the scopes I listed above. The
mlx5ctl driver as posted already introduced a new scope, for example.
Userspace will ask the kernel for an appropriate security scope after
opening the char-device. If userspace asks for invasive then you get a
taint. Issuing an invasive command without a kernel applied invasive
security label will be rejected by the FW.
We trust the kernel to apply the security label for the origin of the
command. We trust the the device FW to implement security scopes,
because these are RDMA devices and all of RDMA and all of SRIOV
virtualization are totally broken if the device FW cannot be trusted
to maintain security separation between scopes.
Jason
On Wed, Feb 14, 2024 at 12:29:16AM -0800, Christoph Hellwig wrote:
> Chmining in late after a vacation last week, but I really do not
> understand the discussion this cause.
>
> Complex devices need diagnostics, and mlx5ctl and the (free software)
> userspace tool provide that. Given how it is a complex multi-subsystem
> driver that exports at least RDMA, ethernet, nvme and virtio_blk
> interfaces this functionality by definition can't fit into a single
> subsystem.
>
> So with my nvme co-maintainer hat on:
>
> Acked-by: Christoph Hellwig <[email protected]>
>
> to th concept (which is not a full code review!).
I've been trying to figure out how to respond correctly to this over the
last few days, but I share your sentiment. It's probably time to have
something like this upstream for more devices. My initial concerns with
something that allows direct access to hardware to send messages to FW
or read/write registers over BARs were:
1. How someone working at a distro would be able to help/understand if
a tool like this was run and may have programmed their hardware
differently than a default driver or FW. There now exists a chance for
identical systems running identical drivers and FW to behave differently
due to out-of-band configuration. One thought I had was some sort of
journal to note that config happened from outside, but I'm not sure
there is much value there. With the ability to dump regs with
devlink health it's possible to know that values may have changed, so I'm not
concerned about this since that infra exists.
2. If one can make configuration changes to hardware without kernel
APIs (devlink et al), will people still develop new kernel APIs? I
think the answer to this is 'yes' as realistically using default tools
is much better than using vendor tools for regular configuration. Even
if vendors provide shortcuts to program hardware for eval/testing/debug
my experience is that these are not acceptable long-term. Requests are always
made to include this type of changes in future releases. So I'm not too
concerned about the ossification of kernel APIs due to this being included.
So if there is general agreement that this is acceptable (especially
compared to other out-of-tree drivers, I think a few who find this
useful should sync on the best way forward; I'm not sure a separate
driver for each vendor is the right approach.
If upstream (and therefore distros) are going to accept this we probably
owe it to them to not have misc drivers for every different flavor of
hardware out there when it might be possible to add a generic driver
that can connect to a PCI device via new (auxiliary bus?) API.
On 2/9/24 10:01 PM, David Ahern wrote:
> Maybe my bigger question is should drivers that can do different
> physical layers, or can operate without a netdev, should they be split
> into different drivers? drivers/misc for the PCI interface, drivers/net
> for ethernet interfaces and its APIs and drivers/infiniband for IB and
> its APIs? Generic capabilities like this debugging interface belong to
> the PCI device since it is generic to the device hence drivers/misc.
>
I do not recall seeing a response to this line of questions. We are
considering splitting our driver along these lines to upstream it given
the independent nature of the capabilities and need to interact with
different S/W APIs and hence different maintainers. Any reason new
drivers should not be split along these lines?
If the answer is no, then where is the best home for the PCI device
driver piece of it? drivers/misc or drivers/accel or create a new
drivers/multifcn/?
On Wed, Feb 14, 2024 at 01:31:29PM -0700, David Ahern wrote:
> On 2/9/24 10:01 PM, David Ahern wrote:
> > Maybe my bigger question is should drivers that can do different
> > physical layers, or can operate without a netdev, should they be split
> > into different drivers? drivers/misc for the PCI interface, drivers/net
> > for ethernet interfaces and its APIs and drivers/infiniband for IB and
> > its APIs? Generic capabilities like this debugging interface belong to
> > the PCI device since it is generic to the device hence drivers/misc.
> >
>
> I do not recall seeing a response to this line of questions. We are
> considering splitting our driver along these lines to upstream it given
> the independent nature of the capabilities and need to interact with
> different S/W APIs and hence different maintainers. Any reason new
> drivers should not be split along these lines?
I think it is essential to split them up like this, it is why auxbus
was created in the first place. The subsystem specific logic should
live in the subsystem space.
I've had some conversations where we have disagreements about which
functions end up in which directories in these split up drivers, but
the general principle is good.
I like mlx5's split where the core is about booting and communicating
with the FW and some common helpers then every subsystem has its own
FW calls in its own functions. Other people liked the idea that as
much "low level" code was in the core and the subsystems would call
helper functions in the core that invoke the FW functions.
> If the answer is no, then where is the best home for the PCI device
> driver piece of it? drivers/misc or drivers/accel or create a new
> drivers/multifcn/?
I've had conversations about this too, I would vote for a new
directory going forward. drivers/auxbus or something specifically for
these core components of an auxbus based multi-subsystem driver.
We likely need some reasonable rules so that the core driver remains
an in-kernel component and doesn't start growing weird uapi. If you
want uapi it needs to come through an appropriate subsystem specific
thing, even if that is misc.
Jason
Thu, Feb 15, 2024 at 08:00:40AM CET, [email protected] wrote:
>On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote:
[...]
>> > I think all maintainers can and should voice the
>> > opinions, be those technical or political, but trying to block a useful
>> > feature without lots of precedence because it is vaguely related to the
>> > subsystem is not helpful.
>>
>> Not sure what you mean by "without lots of precedence" but you can ask
>
>Should have been with. Just about every subsystem with complex devices
>has this kind of direct interface for observability and co in at least
>some drivers.
What about configuration? How do you ensure the direct FW/HW access
is used only for observability/debug purposes. I mean, if you can't,
I think it is incorrect to name it that way, isn't it?
On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote:
> Overreach is unfortunate, I'd love to say "please do merge it as part
> of RDMA". You probably don't trust my opinion but Jason admitted himself
> this is primarily for RDMA. RDMA is what it is in terms of openness and
> all vendors trying to sell their secret magic sauce.
Common. RDMA has two important open standards, one of them even done
in IETF that most open of all standards organizations.
> > I think all maintainers can and should voice the
> > opinions, be those technical or political, but trying to block a useful
> > feature without lots of precedence because it is vaguely related to the
> > subsystem is not helpful.
>
> Not sure what you mean by "without lots of precedence" but you can ask
Should have been with. Just about every subsystem with complex devices
has this kind of direct interface for observability and co in at least
some drivers.
On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote:
> On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote:
> > With my busy kernel contributor head on I have to voice my
> > dissatisfaction with the subsystem maintainer overreach that's causing
> > the troubles here.
>
> Overreach is unfortunate, I'd love to say "please do merge it as part
> of RDMA". You probably don't trust my opinion but Jason admitted himself
> this is primarily for RDMA.
"admitted"? You make it sound like a crime. I've been very clear on
this need from the RDMA community since the first posting.
> The problem is that some RDMA stuff is built really closely on TCP,
Huh? Since when? Are you talking about soft-iwarp? That is a reasearch
project and Bernard is very responsive, if you have issues ask him and
he will help.
Otherwise the actual HW devices are not entangled with netdev TCP, the
few iWarp devices have their own TCP implementation, in accordance
with what the IETF standardized.
> and given Jason's and co. inability to understand that good fences
> make good neighbors it will soon start getting into the netdev stack :|
I seem to recall you saying RDMA shouldn't call any netdev APIs at
all. We were unable to agree on where to build the fence for this
reason.
> Ah, and I presume they may also want it for their DOCA products.
> So 80% RDMA, 15% DOCA, 5% the rest is my guess.
I don't know all details about DOCA, but what I know about runs over
RDMA.
> Not sure what you mean by "without lots of precedence" but you can ask
> around netdev. We have nacked such interfaces multiple times.
> The best proof the rule exists and is well established it is that Saeed
> has himself asked us a number of times to lift it.
>
> What should be expected of us is fairness and not engaging in politics.
> We have a clear rule against opaque user space to FW interfaces,
> and I don't see how we could enforce that fairly for pure Ethernet
> devices if big vendors get to do whatever they want.
If your community is telling your rules are not working for them
anymore, it is not nice to tell them that rules exist and cannot be
questioned. Try working together toward a reasonable consensus
solution.
The world has changed alot, the use cases are different, the users are
different, the devices are different. When Dave made that prohibition
long ago it was not in a world of a multi billion transistor NIC being
deployed in uniform clusters of unimaginable size.
Jason
On Wed, 14 Feb 2024 23:00:40 -0800 Christoph Hellwig wrote:
> On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote:
> > Overreach is unfortunate, I'd love to say "please do merge it as part
> > of RDMA". You probably don't trust my opinion but Jason admitted himself
> > this is primarily for RDMA. RDMA is what it is in terms of openness and
> > all vendors trying to sell their secret magic sauce.
>
> Common. RDMA has two important open standards, one of them even done
> in IETF that most open of all standards organizations.
While I don't dispute that there are standards which can be read,
the practical interoperability of RDMA devices is extremely low.
By practical I mean having two devices from different vendors
achieve any reasonable performance talking to each other.
Even two devices from _the same_ vendor but different generations
are unlikely to perform.
Given how RDMA is deployed (uniform, greenfield/full replacement)
this is entirely reasonable from the engineering perspective.
But this is a bit of a vicious cycle, vendors have little incentive
to interoperate, and primarily focus on adding secret sauce outside of
the standard. In fact you're lucky if the vendor didn't bake some
extension which requires custom switches into the NICs :(
Compare that to WiFi, which is a level of standardization netdev folks
are more accustomed to. You can connect a new device from vendor X to
a 10 year old AP from vendor Y and it will run with high perf.
Unfortunately because of the AI craze I have some experience
with RDMA deployments now. Perhaps you have more, perhaps your
experience differs.
On Thu, 15 Feb 2024 09:21:38 -0400 Jason Gunthorpe wrote:
> On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote:
> > On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote:
> > > With my busy kernel contributor head on I have to voice my
> > > dissatisfaction with the subsystem maintainer overreach that's causing
> > > the troubles here.
> >
> > Overreach is unfortunate, I'd love to say "please do merge it as part
> > of RDMA". You probably don't trust my opinion but Jason admitted himself
> > this is primarily for RDMA.
>
> "admitted"? You make it sound like a crime. I've been very clear on
> this need from the RDMA community since the first posting.
Sorry, unintentional :) Maybe it's a misunderstanding but my impression
was that at least Saeed was trying hard to make this driver a common
one, not just for RDMA.
> > The problem is that some RDMA stuff is built really closely on TCP,
>
> Huh? Since when? Are you talking about soft-iwarp? That is a reasearch
> project and Bernard is very responsive, if you have issues ask him and
> he will help.
>
> Otherwise the actual HW devices are not entangled with netdev TCP, the
> few iWarp devices have their own TCP implementation, in accordance
> with what the IETF standardized.
There are some things I know either from work or otherwise told me
in confidence which I can't share. This is very frustrating for
me, and probably for you, sorry :(
> > and given Jason's and co. inability to understand that good fences
> > make good neighbors it will soon start getting into the netdev stack :|
>
> I seem to recall you saying RDMA shouldn't call any netdev APIs at
> all. We were unable to agree on where to build the fence for this
> reason.
Last time you were calling into the IPsec stack right? It's not just
a basic API. IDK how to draw a line, definitely open to suggestions!
> > Ah, and I presume they may also want it for their DOCA products.
> > So 80% RDMA, 15% DOCA, 5% the rest is my guess.
>
> I don't know all details about DOCA, but what I know about runs over
> RDMA.
Well, since you're an RDMA person that's not really saying much
about existence of other parts.
On Wed, 14 Feb 2024 14:37:55 -0400 Jason Gunthorpe wrote:
> On Wed, Feb 14, 2024 at 10:11:26AM -0800, Jakub Kicinski wrote:
> > On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote:
> > > There is a clear split in my mind between:
> > > - inspection debugging
> > > - invasive mutating debugging
> > > - configuration
> >
> > Yes there's a clear split, and how are you going to enforce it on
> > an opaque interface? Put an "evil" bit in the common header?
>
> The interface is opaque through a subsystem, it doesn't mean it is
> completely opaque to every driver layer in the kernel. There is still a
> HW specific kernel driver that delivers the FW command to the actual
> HW.
>
> In the mlx5 model the kernel driver stamps the command with "uid"
> which is effectively a security scope label. This cannot be avoided by
> userspace and is fundamental to why mlx5ctl is secure in a lockdown
> kernel.
>
> For example mlx5's FW interface has the concept of security scopes. We
> have several defined today:
> - Kernel
> - Userspace rdma
> - Userspace rdma with CAP_NET_RAW
> - Userspace rdma with CAP_SYS_RAWIO
>
> So we trivally add three more for the scopes I listed above. The
> mlx5ctl driver as posted already introduced a new scope, for example.
>
> Userspace will ask the kernel for an appropriate security scope after
> opening the char-device. If userspace asks for invasive then you get a
> taint. Issuing an invasive command without a kernel applied invasive
> security label will be rejected by the FW.
>
> We trust the kernel to apply the security label for the origin of the
> command. We trust the the device FW to implement security scopes,
> because these are RDMA devices and all of RDMA and all of SRIOV
> virtualization are totally broken if the device FW cannot be trusted
> to maintain security separation between scopes.
You have changed the argument.
The problem Andy was raising is that users having access to low level
configuration will make it impossible for distro's support to tell
device configuration. There won't be any trace of activity at the OS
level.
To which you replied that you can differentiate between debugging and
configuration on an opaque interface, _in the kernel_.
Which I disagree with, obviously.
And now you're saying that you can maintain security if you trust
the firmware to enforce some rules.
I'm not talking about security here, the evil bit is just an example
of an unsound design.
On 2/15/24 6:10 PM, Jakub Kicinski wrote:
>>> and given Jason's and co. inability to understand that good fences
>>> make good neighbors it will soon start getting into the netdev stack :|
>>
>> I seem to recall you saying RDMA shouldn't call any netdev APIs at
>> all. We were unable to agree on where to build the fence for this
>> reason.
>
> Last time you were calling into the IPsec stack right? It's not just
> a basic API. IDK how to draw a line, definitely open to suggestions!
>
In general, each subsystem should focus on proper APIs that other
subsystems can leverage and not worry about how people wire them up.
That NVME, RDMA/IB and io_uring want to re-use the Linux networking
stack is a good thing.
On Thu, Feb 15, 2024 at 05:40:34PM -0800, Jakub Kicinski wrote:
> On Wed, 14 Feb 2024 14:37:55 -0400 Jason Gunthorpe wrote:
> > On Wed, Feb 14, 2024 at 10:11:26AM -0800, Jakub Kicinski wrote:
> > > On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote:
> > > > There is a clear split in my mind between:
> > > > - inspection debugging
> > > > - invasive mutating debugging
> > > > - configuration
> > >
> > > Yes there's a clear split, and how are you going to enforce it on
> > > an opaque interface? Put an "evil" bit in the common header?
> >
> > The interface is opaque through a subsystem, it doesn't mean it is
> > completely opaque to every driver layer in the kernel. There is still a
> > HW specific kernel driver that delivers the FW command to the actual
> > HW.
> >
> > In the mlx5 model the kernel driver stamps the command with "uid"
> > which is effectively a security scope label. This cannot be avoided by
> > userspace and is fundamental to why mlx5ctl is secure in a lockdown
> > kernel.
> >
> > For example mlx5's FW interface has the concept of security scopes. We
> > have several defined today:
> > - Kernel
> > - Userspace rdma
> > - Userspace rdma with CAP_NET_RAW
> > - Userspace rdma with CAP_SYS_RAWIO
> >
> > So we trivally add three more for the scopes I listed above. The
> > mlx5ctl driver as posted already introduced a new scope, for example.
> >
> > Userspace will ask the kernel for an appropriate security scope after
> > opening the char-device. If userspace asks for invasive then you get a
> > taint. Issuing an invasive command without a kernel applied invasive
> > security label will be rejected by the FW.
> >
> > We trust the kernel to apply the security label for the origin of the
> > command. We trust the the device FW to implement security scopes,
> > because these are RDMA devices and all of RDMA and all of SRIOV
> > virtualization are totally broken if the device FW cannot be trusted
> > to maintain security separation between scopes.
>
> You have changed the argument.
I explained how the technical bits of a part work, you clipped out my
answer to Andy's concern.
> The problem Andy was raising is that users having access to low level
> configuration will make it impossible for distro's support to tell
> device configuration. There won't be any trace of activity at the OS
> level.
I responded to that by saying the answer is to have robust dumping of
the device configuration and suggested a taint bit if changes are made
to the device outside that support envelope and can't be captured in
the dumps.
This first part is already what everyone already does. There is some
supported configuration in flash and there are tools to dump and
inspect this. The field teams understand they need to look at that,
and existing data collection tools already capture this stuff. I don't
view we have a real problem here.
The step beyond Andy was talking about is the hypothetical "what if
you touch random unsafe registers directly or something" Which
probably shouldn't be allowed on a lockdown kernel, but assuming a
device could do so safely, my answer was to trigger a taint.
Then you asked how do you trigger a taint if the kernel doesn't parse
the commands.
> To which you replied that you can differentiate between debugging and
> configuration on an opaque interface, _in the kernel_.
I did not say "_in the kernel_" meaning the kernel would do it, I
meant the kernel would ensure it is done.
The kernel delegates the differentiation to the FW and it trusts the
FW to do that work for it.
> Which I disagree with, obviously.
I don't know why? How is it obvious?
> And now you're saying that you can maintain security if you trust
> the firmware to enforce some rules.
Right. The userspace sends a command. The kernel tags it with the
permission the userspace has. The FW parses the command and checks the
kernel supplied permission against the command content and permits it.
We trust the FW to do the restriction on behalf of the kernel.
The restriction we are talking about is containing the userspace to
only operate within the distro's support envelope. The kernel can ask
the FW to enforce that rule as a matter of security and trust the FW
to do so.
Jason
On Thu, Feb 15, 2024 at 05:00:46PM -0800, Jakub Kicinski wrote:
> But this is a bit of a vicious cycle, vendors have little incentive
> to interoperate, and primarily focus on adding secret sauce outside of
> the standard. In fact you're lucky if the vendor didn't bake some
> extension which requires custom switches into the NICs :(
This may all seem shocking if you come from the netdev world, but this
has been normal for HPC networking for the last 30 years at least.
My counter perspective would be that we are currently in a pretty good
moment for HPC industry because we actually have open source
implementations for most of it. In fact most actual deployments are
running something quite close to the mainline open source stack.
The main hold out right now is Cray/HPE's Slingshot networking family
(based on ethernet apparently), but less open source.
I would say the HPC community has a very different community goal post
that netdev land. Make your thing, whatever it is. Come with an open
kernel driver, a open rdma-core, a open libfabric/ucx and plug into
the open dpdk/nccl/ucx/libfabric layer and demonstrate your thing
works with openmpi/etc applications.
Supporting that open stack is broadly my north star for the kernel
perspective as Mesa is to DRM.
Several points of this chain are open industry standards driven by
technical working group communities.
This is what the standardization and interoperability looks like
here. It is probably totally foreign from a netdev view point, far
less focus on the wire protocol, devices and kernel. Here the focus is
on application and software interoperability. Still, it is open in
a pretty solid way.
Jason
On Thu, Feb 15, 2024 at 05:10:13PM -0800, Jakub Kicinski wrote:
> On Thu, 15 Feb 2024 09:21:38 -0400 Jason Gunthorpe wrote:
> > On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote:
> > > On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote:
> > > > With my busy kernel contributor head on I have to voice my
> > > > dissatisfaction with the subsystem maintainer overreach that's causing
> > > > the troubles here.
> > >
> > > Overreach is unfortunate, I'd love to say "please do merge it as part
> > > of RDMA". You probably don't trust my opinion but Jason admitted himself
> > > this is primarily for RDMA.
> >
> > "admitted"? You make it sound like a crime. I've been very clear on
> > this need from the RDMA community since the first posting.
>
> Sorry, unintentional :) Maybe it's a misunderstanding but my impression
> was that at least Saeed was trying hard to make this driver a common
> one, not just for RDMA.
The hardware is common, this is a driver to talk to the shared FW. I
don't know how you'd do just one when netdev is effectively an RDMA
application running in the kernel, from the perspective of the FW.
There is no real line between these things beyond the artificial one
we have created in the kernel.
> > > The problem is that some RDMA stuff is built really closely on TCP,
> >
> > Huh? Since when? Are you talking about soft-iwarp? That is a reasearch
> > project and Bernard is very responsive, if you have issues ask him and
> > he will help.
> >
> > Otherwise the actual HW devices are not entangled with netdev TCP, the
> > few iWarp devices have their own TCP implementation, in accordance
> > with what the IETF standardized.
>
> There are some things I know either from work or otherwise told me
> in confidence which I can't share. This is very frustrating for
> me, and probably for you, sorry :(
Well, all I can say is I know of no forthcoming RDMA things with any
different relationship to TCP. I think if someone wants to more TCP
they will have a hard time, and I'm not inclined to seriously help
anyone get more TCP into RDMA. iWarp is trouble enough already.
> > I seem to recall you saying RDMA shouldn't call any netdev APIs at
> > all. We were unable to agree on where to build the fence for this
> > reason.
>
> Last time you were calling into the IPsec stack right? It's not just
> a basic API. IDK how to draw a line, definitely open to suggestions!
I thought the two halfs of the mlx5 driver were co-ordinating their
usage of the shared HW around the ipsec configuration pushed into the
device by netdev xfrm.
> > > Ah, and I presume they may also want it for their DOCA products.
> > > So 80% RDMA, 15% DOCA, 5% the rest is my guess.
> >
> > I don't know all details about DOCA, but what I know about runs over
> > RDMA.
>
> Well, since you're an RDMA person that's not really saying much
> about existence of other parts.
<shrug> why does DOCA matter? Should we have not done io_uring because
Oracle might use it? Besides, from what I know about DOCA it is almost
all data plane stuff and RDMA fully covers that already..
Regards,
Jason
On 07/02/2024 08:24, Saeed Mahameed wrote:
> +static int mlx5ctl_open(struct inode *inode, struct file *file)
> +{
> + struct mlx5_core_dev *mdev;
> + struct mlx5ctl_dev *mcdev;
> + struct mlx5ctl_fd *mfd;
> + int err = 0;
> +
> + mcdev = container_of(file->private_data, struct mlx5ctl_dev, miscdev);
> + mcdev_get(mcdev);
> + down_read(&mcdev->rw_lock);
> + mdev = mcdev->mdev;
> + if (!mdev) {
> + err = -ENODEV;
> + goto unlock;
> + }
> +
> + mfd = kzalloc(sizeof(*mfd), GFP_KERNEL_ACCOUNT);
> + if (!mfd)
> + return -ENOMEM;
goto unlock?
Or why not reorder so you always allocate this before doing anything else?
> +
> + mfd->mcdev = mcdev;
> + err = mlx5ctl_open_mfd(mfd);
> + if (err)
> + goto unlock;
> +
> + spin_lock(&mcdev->fd_list_lock);
> + list_add_tail(&mfd->list, &mcdev->fd_list);
> + spin_unlock(&mcdev->fd_list_lock);
> +
> + file->private_data = mfd;
> +
> +unlock:
> + up_read(&mcdev->rw_lock);
> + if (err) {
> + mcdev_put(mcdev);
> + kfree(mfd);
> + }
> + return err;
> +}
Vegard
On 07/02/2024 08:24, Saeed Mahameed wrote:
> +static int mlx5ctl_info_ioctl(struct file *file,
> + struct mlx5ctl_info __user *arg,
> + size_t usize)
> +{
> + struct mlx5ctl_fd *mfd = file->private_data;
> + size_t ksize = sizeof(struct mlx5ctl_info);
> + struct mlx5ctl_dev *mcdev = mfd->mcdev;
> + struct mlx5_core_dev *mdev = mcdev->mdev;
> + struct mlx5ctl_info *info;
> + int err = 0;
> +
> + if (usize < ksize)
> + return -EINVAL;
> +
> + info = kzalloc(ksize, GFP_KERNEL);
> + if (!info)
> + return -ENOMEM;
struct mlx5ctl_info is small, why not put it on the stack or even copy
it directly from the original object, assuming it has no holes/padding?
> +
> + info->dev_uctx_cap = MLX5_CAP_GEN(mdev, uctx_cap);
> + info->uctx_cap = mfd->uctx_cap;
> + info->uctx_uid = mfd->uctx_uid;
> + info->ucap = mfd->ucap;
> +
> + if (copy_to_user(arg, info, ksize))
> + err = -EFAULT;
> +
> + kfree(info);
> + return err;
> +}
Is there even a remote possibility of extending this structure in the
future? If so the size check will not allow you to be backwards
compatible. Should there be a version field in there or would you
just add a new ioctl altogether? Adding [email protected] to Cc.
> diff --git a/include/uapi/misc/mlx5ctl.h b/include/uapi/misc/mlx5ctl.h
> new file mode 100644
> index 000000000000..9be944128025
> --- /dev/null
> +++ b/include/uapi/misc/mlx5ctl.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0 WITH Linux-syscall-note */
> +/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
> +
> +#ifndef __MLX5CTL_IOCTL_H__
> +#define __MLX5CTL_IOCTL_H__
> +
> +struct mlx5ctl_info {
> + __u16 uctx_uid; /* current process allocated UCTX UID */
> + __u16 reserved1; /* explicit padding must be zero */
> + __u32 uctx_cap; /* current process effective UCTX cap */
> + __u32 dev_uctx_cap; /* device's UCTX capabilities */
> + __u32 ucap; /* process user capability */
> +};
> +
> +#define MLX5CTL_IOCTL_MAGIC 0x5c
> +
> +#define MLX5CTL_IOCTL_INFO \
> + _IOR(MLX5CTL_IOCTL_MAGIC, 0x0, struct mlx5ctl_info)
> +
> +#endif /* __MLX5CTL_IOCTL_H__ */
Should you add anything to Documentation/ABI/ ? (Or add other
documentation for this driver?)
Vegard
On 07/02/2024 08:24, Saeed Mahameed wrote:
> @@ -328,6 +420,11 @@ static int mlx5ctl_probe(struct auxiliary_device *adev,
> goto abort;
> }
>
> + err = sysfs_create_link_nowarn(&mcdev->miscdev.this_device->kobj,
> + &mdev->device->kobj, "mdev");
> + if (err)
> + mlx5ctl_dbg(mcdev, "mlx5ctl: failed to create sysfs link err %d\n", err);
> +
Should this propagate the error to the caller?
What happens if/when mlx5ctl_remove()/sysfs_remove_link() gets called
later for this kobj?
Vegard
On 07/02/2024 08:24, Saeed Mahameed wrote:
> +static int mlx5ctl_ioctl_umem_unreg(struct file *file,
> + struct mlx5ctl_umem_unreg __user *arg,
> + size_t usize)
> +{
> + size_t ksize = sizeof(struct mlx5ctl_umem_unreg);
> + struct mlx5ctl_fd *mfd = file->private_data;
> + struct mlx5ctl_umem_unreg *umem_unreg;
> + int err = 0;
> +
> + if (usize < ksize)
> + return -EINVAL;
> +
> + umem_unreg = kzalloc(ksize, GFP_KERNEL);
> + if (!umem_unreg)
> + return -ENOMEM;
> +
> + if (copy_from_user(umem_unreg, arg, ksize)) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> + if (umem_unreg->reserved) {
> + err = -EOPNOTSUPP;
> + goto out;
> + }
> +
> + err = mlx5ctl_umem_unreg(mfd->umem_db, umem_unreg->umem_id);
> +
> + if (!err && copy_to_user(arg, umem_unreg, ksize))
> + err = -EFAULT;
Why do you copy this back to userspace, does it even change?
> +static struct mlx5ctl_umem *mlx5ctl_umem_pin(struct mlx5ctl_umem_db *umem_db,
> + unsigned long addr, size_t size)
> +{
> + size_t npages = umem_num_pages(addr, size);
> + struct mlx5_core_dev *mdev = umem_db->mdev;
> + unsigned long endaddr = addr + size;
> + struct mlx5ctl_umem *umem;
> + struct page **page_list;
> + int err = -EINVAL;
> + int pinned = 0;
> +
> + dev_dbg(mdev->device, "%s: addr %p size %zu npages %zu\n",
> + __func__, (void __user *)addr, size, npages);
> +
> + /* Avoid integer overflow */
> + if (endaddr < addr || PAGE_ALIGN(endaddr) < endaddr)
> + return ERR_PTR(-EINVAL);
There is a check_add_overflow() macro in <linux/overflow.h>, should that
be used instead?
> +
> + if (npages == 0 || PAGES_2_MB(npages) > MLX5CTL_UMEM_MAX_MB)
> + return ERR_PTR(-EINVAL);
> +
> + page_list = kvmalloc_array(npages, sizeof(struct page *), GFP_KERNEL_ACCOUNT);
> + if (!page_list)
> + return ERR_PTR(-ENOMEM);
> +
> + umem = kzalloc(sizeof(*umem), GFP_KERNEL_ACCOUNT);
> + if (!umem) {
> + kvfree(page_list);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + umem->addr = addr;
> + umem->size = size;
> + umem->offset = addr & ~PAGE_MASK;
> + umem->npages = npages;
> +
> + umem->page_list = page_list;
> + umem->source_mm = current->mm;
> + umem->source_task = current->group_leader;
> + get_task_struct(current->group_leader);
> + umem->source_user = get_uid(current_user());
I think you could do this, right?
umem->source_task = get_task_struct(current->group_leader);
umem->source_user = get_current_user();
> +
> + /* mm and RLIMIT_MEMLOCK user task accounting similar to what is
> + * being done in iopt_alloc_pages() and do_update_pinned()
> + * for IOPT_PAGES_ACCOUNT_USER @drivers/iommu/iommufd/pages.c
> + */
> + mmgrab(umem->source_mm);
> +
> + pinned = pin_user_pages_fast(addr, npages, FOLL_WRITE, page_list);
> + if (pinned != npages) {
> + dev_dbg(mdev->device, "pin_user_pages_fast failed %d\n", pinned);
> + err = pinned < 0 ? pinned : -ENOMEM;
> + goto pin_failed;
> + }
> +
> + err = inc_user_locked_vm(umem, npages);
> + if (err)
> + goto pin_failed;
> +
> + atomic64_add(npages, &umem->source_mm->pinned_vm);
> +
> + err = sg_alloc_table_from_pages(&umem->sgt, page_list, npages, 0,
> + npages << PAGE_SHIFT, GFP_KERNEL_ACCOUNT);
> + if (err) {
> + dev_dbg(mdev->device, "sg_alloc_table failed: %d\n", err);
> + goto sgt_failed;
Is this correct? See below...
> + }
> +
> + dev_dbg(mdev->device, "\tsgt: size %zu npages %zu sgt.nents (%d)\n",
> + size, npages, umem->sgt.nents);
> +
> + err = dma_map_sgtable(mdev->device, &umem->sgt, DMA_BIDIRECTIONAL, 0);
> + if (err) {
> + dev_dbg(mdev->device, "dma_map_sgtable failed: %d\n", err);
> + goto dma_failed;
> + }
> +
> + dev_dbg(mdev->device, "\tsgt: dma_nents %d\n", umem->sgt.nents);
> + return umem;
> +
> +dma_failed:
> +sgt_failed:
> + sg_free_table(&umem->sgt);
Not sure about calling sg_free_table() if sg_alloc_table_from_pages()
failed.
> + atomic64_sub(npages, &umem->source_mm->pinned_vm);
> + dec_user_locked_vm(umem, npages);
> +pin_failed:
> + if (pinned > 0)
> + unpin_user_pages(page_list, pinned);
> + mmdrop(umem->source_mm);
> + free_uid(umem->source_user);
> + put_task_struct(umem->source_task);
> +
> + kfree(umem);
> + kvfree(page_list);
> + return ERR_PTR(err);
> +}
Vegard
On 29 Feb 12:49, Vegard Nossum wrote:
>
>On 07/02/2024 08:24, Saeed Mahameed wrote:
>>@@ -328,6 +420,11 @@ static int mlx5ctl_probe(struct auxiliary_device *adev,
>> goto abort;
>> }
>>+ err = sysfs_create_link_nowarn(&mcdev->miscdev.this_device->kobj,
>>+ &mdev->device->kobj, "mdev");
>>+ if (err)
>>+ mlx5ctl_dbg(mcdev, "mlx5ctl: failed to create sysfs link err %d\n", err);
>>+
>
>Should this propagate the error to the caller?
>
this link is informational only and not necessary for the driver function,
it meant to help user-space apps to associate mlx5ctl driver with it parent
mlx5_core device.
>What happens if/when mlx5ctl_remove()/sysfs_remove_link() gets called
>later for this kobj?
>
sysfs_remove_link() will eventually call kernfs_remove_by_name_ns()
and it will return -ENOENT; if not found, and it will be silently dropped.
On 29 Feb 12:47, Vegard Nossum wrote:
>
>On 07/02/2024 08:24, Saeed Mahameed wrote:
>>+static int mlx5ctl_info_ioctl(struct file *file,
>>+ struct mlx5ctl_info __user *arg,
>>+ size_t usize)
>>+{
>>+ struct mlx5ctl_fd *mfd = file->private_data;
>>+ size_t ksize = sizeof(struct mlx5ctl_info);
>>+ struct mlx5ctl_dev *mcdev = mfd->mcdev;
>>+ struct mlx5_core_dev *mdev = mcdev->mdev;
>>+ struct mlx5ctl_info *info;
>>+ int err = 0;
>>+
>>+ if (usize < ksize)
>>+ return -EINVAL;
>>+
>>+ info = kzalloc(ksize, GFP_KERNEL);
>>+ if (!info)
>>+ return -ENOMEM;
>
>struct mlx5ctl_info is small, why not put it on the stack or even copy
>it directly from the original object, assuming it has no holes/padding?
>
There's no original object, but yes storing it on the stack should work.
>>+
>>+ info->dev_uctx_cap = MLX5_CAP_GEN(mdev, uctx_cap);
>>+ info->uctx_cap = mfd->uctx_cap;
>>+ info->uctx_uid = mfd->uctx_uid;
>>+ info->ucap = mfd->ucap;
>>+
>>+ if (copy_to_user(arg, info, ksize))
>>+ err = -EFAULT;
>>+
>>+ kfree(info);
>>+ return err;
>>+}
>
>Is there even a remote possibility of extending this structure in the
>future? If so the size check will not allow you to be backwards
>compatible. Should there be a version field in there or would you
>just add a new ioctl altogether? Adding [email protected] to Cc.
>
This was my original implementation, but Greg's preference is to allow no
extension to the ioctl structures, in case of extension required, new IOCTL
and structure should be introduced.
>>diff --git a/include/uapi/misc/mlx5ctl.h b/include/uapi/misc/mlx5ctl.h
>>new file mode 100644
>>index 000000000000..9be944128025
>>--- /dev/null
>>+++ b/include/uapi/misc/mlx5ctl.h
>>@@ -0,0 +1,20 @@
>>+/* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0 WITH Linux-syscall-note */
>>+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
>>+
>>+#ifndef __MLX5CTL_IOCTL_H__
>>+#define __MLX5CTL_IOCTL_H__
>>+
>>+struct mlx5ctl_info {
>>+ __u16 uctx_uid; /* current process allocated UCTX UID */
>>+ __u16 reserved1; /* explicit padding must be zero */
>>+ __u32 uctx_cap; /* current process effective UCTX cap */
>>+ __u32 dev_uctx_cap; /* device's UCTX capabilities */
>>+ __u32 ucap; /* process user capability */
>>+};
>>+
>>+#define MLX5CTL_IOCTL_MAGIC 0x5c
>>+
>>+#define MLX5CTL_IOCTL_INFO \
>>+ _IOR(MLX5CTL_IOCTL_MAGIC, 0x0, struct mlx5ctl_info)
>>+
>>+#endif /* __MLX5CTL_IOCTL_H__ */
>
>Should you add anything to Documentation/ABI/ ? (Or add other
>documentation for this driver?)
>
The driver doesn't expose any sysfs other than the IOCTLs, but yes
a documentation might be useful to make sure ABI is stable, most of the
other drivers point out to the uapi header for documentation.
On 29 Feb 12:44, Vegard Nossum wrote:
>
>On 07/02/2024 08:24, Saeed Mahameed wrote:
>>+static int mlx5ctl_open(struct inode *inode, struct file *file)
>>+{
>>+ struct mlx5_core_dev *mdev;
>>+ struct mlx5ctl_dev *mcdev;
>>+ struct mlx5ctl_fd *mfd;
>>+ int err = 0;
>>+
>>+ mcdev = container_of(file->private_data, struct mlx5ctl_dev, miscdev);
>>+ mcdev_get(mcdev);
>>+ down_read(&mcdev->rw_lock);
>>+ mdev = mcdev->mdev;
>>+ if (!mdev) {
>>+ err = -ENODEV;
>>+ goto unlock;
>>+ }
>>+
>>+ mfd = kzalloc(sizeof(*mfd), GFP_KERNEL_ACCOUNT);
>>+ if (!mfd)
>>+ return -ENOMEM;
>
>goto unlock?
>
>Or why not reorder so you always allocate this before doing anything else?
Good catch !, thanks I will reorder.
On Wed, Feb 14, 2024 at 01:57:35PM -0400, Jason Gunthorpe wrote:
> I also like this, I don't want the outcome of this discussion to be
> that only mlx5ctl gets merged. I want all the HW that has this problem
> to have support in the mainline kernel.
To this end here is my proposal to move forward with a new
mini-subsystem to provide rules for this common approach. Get the
existing tools out of hacky unstructured direct hardware access via
/sys/XX and into a formalized lockdown compatible system. I've talked
to enough people now to think we have a critical mass.
===============
fwctl subsystem
===============
Overview
--------
Modern devices contain extensive amounts of FW, and in many cases, are largely
software defined pieces of hardware. The evolution of this approach is largely a
reaction to Moore's Law where a chip tape out is now highly expensive, and the
chip design is extremely large. Replacing fixed HW logic with a flexible and
tightly coupled FW/HW combination is an effective risk mitigation against chip
respin. Problems in the HW design can be counteracted in device FW. This is
especially true for devices which present a stable and backwards compatible
interface to the operating system driver (such as NVMe).
The FW layer in devices has grown to incredible sizes and devices frequently
integrate clusters of fast processors to run it. For example, mlx5 devices have
over 30MB of FW code, and big configurations operate with over1GB of FW managed
runtime state.
The availability of such a flexible layer has created quite a variety in the
industry where single pieces of silicon are now configurable software defined
devices and can operate in quite different ways depending on the need. Further
we often see cases where specific sites wish to operate devices in ways that are
highly specialized and can only run an application that has been tailored to
their unique configuration.
Further, devices have become multi-functional and integrated they no longer
fit neatly into the kernel's division of subsystems. Modern multi-functional
devices have drivers, such as bnxt/ice/mlx5/pds, that span many subsystems
while sharing the underlying hardware using the auxiliary device system.
All together this creates a challenge for the operating system, where devices
have an expansive FW environment that needs robust vendor-specific debugging
support, and FW driven functionality that is not well suited to “generic”
interfaces. fwctl seeks to allow access to the full device functionality from
user space in the areas of debuggability, management, and first-boot/nth-boot
provisioning.
fwctl is aimed at the common device design pattern where the OS and FW
communicate via an RPC message layer constructed with a queue or mailbox scheme.
In this case the driver will typically have some layer to deliver RPC messages
and collect RPC responses from device FW. The in-kernel subsystem drivers that
operate the device for its primary purposes will use these RPCs to build their
drivers, but devices also usually have a set of ancillary RPCs that don't really
fit into any specific subsystem. For example, a HW RAID controller is primarily
operated by the block layer but also comes with a set of RPCs to administer the
construction of drives within the HW RAID.
In the past when devices were more single function individual subsystems would
grow different approaches to solving some of these common problems, for instance
monitoring device health, manipulating its FLASH, debugging the FW,
provisioning, all have various unique interfaces across the kernel.
fwctl's purpose is to define a common set of limited rules, described below,
that allow user space to securely construct and execute RPCs inside device FW.
The rules serve as an agreement between the operating system and FW on how to
correctly design the RPC interface. As a uAPI the subsystem provides a thin
layer of discovery and a generic uAPI to deliver the RPCs and collect the
response. It supports a system of user space libraries and tools which will
use this interface to control the device.
Scope of Action
---------------
fwctl drivers are strictly restricted to being a way to operate the device FW.
It is not an avenue to access random kernel internals, or other operating system
SW states.
fwctl instances must operate on a well-defined device function, and the device
should have a well-defined security model for what scope within the device the
function is permitted to access. For instance, the most complex PCIe device
today may broadly have several function level scopes:
1. A privileged function with full access to the on-device global state and
configuration
2. Multiple hypervisor function with control over itself and child functions
used with VMs
3. Multiple VM function tightly scoped within the VM
The device may create a logical parent/child relationship between these scopes,
for instance a child VM's FW may be within the scope of the hypervisor FW. It is
quite common in the VFIO world that the hypervisor environment has a complex
provisioning/profiling/configuration responsibility for the function VFIO
assigns to the VM.
Further, within the function, devices often have RPC commands that fall within
some general scopes of action:
1. Access to function & child configuration, flash, etc that becomes live at a
function reset.
2. Access to function & child runtime configuration that kernel drivers can
discover at runtime.
3. Read only access to function debug information that may report on FW objects
in the function & child, including FW objects owned by other kernel
subsystems.
4. Write access to function & child debug information strictly compatible with
the principles of kernel lockdown and kernel integrity protection. Triggers
a kernel Taint.
5. Full debug device access. Triggers a kernel Taint, requires CAP_SYS_RAWIO.
Limitation to the appropriate scope must be performed by the kernel.
There are many things this interface must not allow user space to do (without a
Taint or CAP), broadly derived from the principles of kernel lockdown. Some
examples:
1. DMA to/from arbitrary memory, hang the system, run code in the device, or
otherwise compromise device or system security and integrity.
2. Provide an abnormal “back door” to kernel drivers. No manipulation of kernel
objects owned by kernel drivers.
3. Directly configure or otherwise control kernel drivers. The kernel driver
can react to the device configuration at function reset/driver load time,
but otherwise should not be coupled to fwctl.
4. Operate the HW in a way that overlaps with the core purpose of another
primary kernel subsystem, such as read/write to LBAs, send/receive of
Network packets, or operate an accelerator's data plane.
fwctl is not a replacement for device direct access subsystems like uacce or
VFIO.
Security Response
-----------------
The kernel remains the gatekeeper for this interface. If violations of the
scopes, security or isolation principles are found, we have options to let
devices fix them with a FW update, push a kernel patch to parse and block RPC
commands or push a kernel patch to block entire firmware versions, or devices.
While the kernel can always directly parse and restrict RPCs, it is expected
that the existing kernel pattern of allowing drivers to delegate validation to
FW to be a useful design.
fwctl Driver design
-------------------
In many cases a fwctl driver is going to be part of a larger cross-subsystem
device possibly using the auxiliary_device mechanism. In that case several
subsystems are going to be sharing the same device and FW interface layer so the
device design must already provide for isolation and co-operation between kernel
subsystems. fwctl should fit into that same model.
Part of the driver should include a description of how its scope restrictions
and security model work. The driver and FW together must ensure that RPCs
provided by user space are mapped to the appropriate scope and that the user
space has rights to work on that scope. Broadly there are two approaches drivers
can take:
Have the kernel provide a scope label to the FW along with the user space RPC
and FW will parse and enforce the scope
Have the kernel parse the RPC and enforce the scope
The driver and FW must co-operate to ensure that either fwctl cannot allocate
any FW resources, or any resources it does allocate are freed on FD closure. A
driver primarily constructed around FW RPCs may find that its core PCI function
and RPC layer belongs under fwctl with auxiliary devices connecting to other
subsystems.
User space Operation
--------------------
To be written in detail, broadly:
- A “fwctl” class
- Each fwctl device driver makes a fwctl instance with a struct device and
sysfs presence in its class and a char device /dev/fwctl/XX
- Some sysfs files per-instance describing what the device is and bootstrap
details needed to operate and format RPCs.
- fds opened from /dev/fwctl/XX support an IOCTL to specify the scope, execute
an RPC and return the response.
- A security layer ensuring that operations from user space are restricted and
contained within the defined scopes.
User space Community
--------------------
Drawing inspiration from nvme-cli, participating in the kernel side must come
with a user space in a common TBD git tree, at a minimum to usefully operate the
kernel driver. Providing such an implementation is a pre-condition to merging a
kernel driver.
The goal is to build user space community around some of the shared problems
we all have, and ideally develop some common user space programs with some
starting themes of:
- Device in-field debugging
- HW provisioning
- VFIO child device profiling before VM boot
- Confidential Compute topics (attestation, secure provisioning)
That stretches across all subsystems in the kernel. fwupd is a great example of
how an excellent user space experience can emerge out of kernel-side diversity.
Existing Similar Examples
-------------------------
The approach described in this document is not a new idea. Direct, or near
direct device access has been offered by the kernel in different areas for
decades. With more devices wanting to follow this design pattern it is becoming
clear that it is not entirely well understood and, more importantly, the
security considerations are not well defined or agreed upon.
Some examples:
- HW RAID controllers. This includes RPCs to do things like compose drives into
a RAID volume, configure RAID parameters, monitor the HW and more.
- Baseboard managers. RPCs for configuring settings in the device and more
- NVMe vendor command capsules. nvme-cli provides access to some monitoring
functions that vendors have defined, but more exists.
- DRM allows user space drivers to send commands to the device via kernel
mediation
- RDMA allows user space drivers to directly push commands to the device
without kernel involvement
Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc
The first 3 would be examples of areas that fwctl intends to cover.
Some key lessons learned from these past efforts are the importance of having a
common user space project to use as a pre-condition for obtaining a kernel
driver. Developing good community around useful software in user space is key to
getting vendor companies to fund participation.
On 3/4/24 9:02 AM, Jason Gunthorpe wrote:
> On Wed, Feb 14, 2024 at 01:57:35PM -0400, Jason Gunthorpe wrote:
>
>> I also like this, I don't want the outcome of this discussion to be
>> that only mlx5ctl gets merged. I want all the HW that has this problem
>> to have support in the mainline kernel.
>
> To this end here is my proposal to move forward with a new
> mini-subsystem to provide rules for this common approach. Get the
> existing tools out of hacky unstructured direct hardware access via
> /sys/XX and into a formalized lockdown compatible system. I've talked
> to enough people now to think we have a critical mass.
>
It has been almost 3 weeks and no response (to this and other proposals
in this thread). I have been around Linux long enough to know that
silence is not acceptance.
To me, this seems like a sane start for a new subsystem addressing
challenges with complex, modern devices. To that end:
Acked-by: David Ahern <[email protected]>
On Thu, Mar 21, 2024 at 09:23:28PM -0600, David Ahern wrote:
> On 3/4/24 9:02 AM, Jason Gunthorpe wrote:
> > On Wed, Feb 14, 2024 at 01:57:35PM -0400, Jason Gunthorpe wrote:
> >
> >> I also like this, I don't want the outcome of this discussion to be
> >> that only mlx5ctl gets merged. I want all the HW that has this problem
> >> to have support in the mainline kernel.
> >
> > To this end here is my proposal to move forward with a new
> > mini-subsystem to provide rules for this common approach. Get the
> > existing tools out of hacky unstructured direct hardware access via
> > /sys/XX and into a formalized lockdown compatible system. I've talked
> > to enough people now to think we have a critical mass.
> >
>
> It has been almost 3 weeks and no response (to this and other proposals
> in this thread). I have been around Linux long enough to know that
> silence is not acceptance.
>
> To me, this seems like a sane start for a new subsystem addressing
> challenges with complex, modern devices. To that end:
>
> Acked-by: David Ahern <[email protected]>
It's the middle of the merge window, not much we can actually do and
this patch series itself couldn't be applied as-is, so it's hard to see
what could have happened on my end...
thanks,
greg k-h
On Thu, Mar 21, 2024 at 09:23:28PM -0600, David Ahern wrote:
> On 3/4/24 9:02 AM, Jason Gunthorpe wrote:
> > On Wed, Feb 14, 2024 at 01:57:35PM -0400, Jason Gunthorpe wrote:
> >
> >> I also like this, I don't want the outcome of this discussion to be
> >> that only mlx5ctl gets merged. I want all the HW that has this problem
> >> to have support in the mainline kernel.
> >
> > To this end here is my proposal to move forward with a new
> > mini-subsystem to provide rules for this common approach. Get the
> > existing tools out of hacky unstructured direct hardware access via
> > /sys/XX and into a formalized lockdown compatible system. I've talked
> > to enough people now to think we have a critical mass.
> >
>
> It has been almost 3 weeks and no response (to this and other proposals
> in this thread). I have been around Linux long enough to know that
> silence is not acceptance.
>
> To me, this seems like a sane start for a new subsystem addressing
> challenges with complex, modern devices. To that end:
>
> Acked-by: David Ahern <[email protected]>
+1. We look forward to helping with the development and testing
in any way that we can.
Jason, thank you for creating this proposal and apologies for not
responding on-list sooner.
Aron
On 3/22/24 1:32 AM, Greg Kroah-Hartman wrote:
>
> It's the middle of the merge window, not much we can actually do and
> this patch series itself couldn't be applied as-is, so it's hard to see
> what could have happened on my end...
>
The proposal was sent a week before the end of the last development
cycle, and I believe the intent was to motivate discussion around a
concrete proposal to converge on an acceptable solution before sending
patches.
On your end, what would be helpful is either a simple yes this seems
reasonable or no you don't like it for reasons X, Y, and Z.
On Fri, Mar 22, 2024 at 09:24:29AM -0600, David Ahern wrote:
> On 3/22/24 1:32 AM, Greg Kroah-Hartman wrote:
> >
> > It's the middle of the merge window, not much we can actually do and
> > this patch series itself couldn't be applied as-is, so it's hard to see
> > what could have happened on my end...
> >
>
> The proposal was sent a week before the end of the last development
> cycle, and I believe the intent was to motivate discussion around a
> concrete proposal to converge on an acceptable solution before sending
> patches.
>
> On your end, what would be helpful is either a simple yes this seems
> reasonable or no you don't like it for reasons X, Y, and Z.
>
Well said, David.
I would totally support doing something like this in a fairly generic
way that could be leveraged/instantiated by drivers that will allow
communication/inspection of hardware blocks in the datapath. There are
lots of different ways this could go, so feedback on this would help get
us all moving in the right direction.
On Fri, 22 Mar 2024 11:46:27 -0400 Andy Gospodarek wrote:
> > > It's the middle of the merge window, not much we can actually do and
> > > this patch series itself couldn't be applied as-is, so it's hard to see
> > > what could have happened on my end...
> >
> > The proposal was sent a week before the end of the last development
> > cycle, and I believe the intent was to motivate discussion around a
> > concrete proposal to converge on an acceptable solution before sending
> > patches.
> >
> > On your end, what would be helpful is either a simple yes this seems
> > reasonable or no you don't like it for reasons X, Y, and Z.
>
> Well said, David.
>
> I would totally support doing something like this in a fairly generic
> way that could be leveraged/instantiated by drivers that will allow
> communication/inspection of hardware blocks in the datapath. There are
> lots of different ways this could go, so feedback on this would help get
> us all moving in the right direction.
The more I learn, the more I am convinced that the technical
justifications here are just smoke and mirrors. The main motivation
for nVidia, Broadcom, (and Enfabrica?) being to hide as much as
possible of what you consider your proprietary advantage in the
"AI gold rush".
RDMA is what it is but I really hate how you're trying to pretend
that it's is somehow an inherent need of advanced technology and
we need to lower the openness standards for all of the kernel.
On 3/22/24 2:58 PM, Jakub Kicinski wrote:
> The more I learn, the more I am convinced that the technical
> justifications here are just smoke and mirrors. The main motivation for
> nVidia, Broadcom, (and Enfabrica?) being to hide as much as possible of
> what you consider your proprietary advantage in the "AI gold rush". RDMA
> is what it is but I really hate how you're trying to pretend that it's
> is somehow an inherent need of advanced technology and we need to lower
> the openness standards for all of the kernel.
Kuba:
can you respond to Jason's email with the proposal for the new fwctl
subsystem and identify places you do not agree? That would provide more
concrete discussion points. Thanks,
On Fri, Mar 22, 2024 at 01:58:26PM -0700, Jakub Kicinski wrote:
> On Fri, 22 Mar 2024 11:46:27 -0400 Andy Gospodarek wrote:
> > > > It's the middle of the merge window, not much we can actually do and
> > > > this patch series itself couldn't be applied as-is, so it's hard to see
> > > > what could have happened on my end...
> > >
> > > The proposal was sent a week before the end of the last development
> > > cycle, and I believe the intent was to motivate discussion around a
> > > concrete proposal to converge on an acceptable solution before sending
> > > patches.
> > >
> > > On your end, what would be helpful is either a simple yes this seems
> > > reasonable or no you don't like it for reasons X, Y, and Z.
> >
> > Well said, David.
> >
> > I would totally support doing something like this in a fairly generic
> > way that could be leveraged/instantiated by drivers that will allow
> > communication/inspection of hardware blocks in the datapath. There are
> > lots of different ways this could go, so feedback on this would help get
> > us all moving in the right direction.
>
> The more I learn, the more I am convinced that the technical
> justifications here are just smoke and mirrors.
Let's see some evidence of this then, point to some sillicon devices
in the multibillion gate space that don't have complex FW built into
their design?
> The main motivation for nVidia, Broadcom, (and Enfabrica?) being to
> hide as much as possible of what you consider your proprietary
> advantage in the "AI gold rush".
Despite all of those having built devices like this well before the
"AI gold rush" and it being a general overall design principle for the
industry because, yes, the silicon technology available actually
demands it.
It is not to say you couldn't do otherwise, it is just simply too
expensive.
> RDMA is what it is but I really hate how you're trying to pretend
> that it's is somehow an inherent need of advanced technology and
> we need to lower the openness standards for all of the kernel.
Open hardware has never been an "openness standard" for the kernel.
Jason
On Fri, 22 Mar 2024 18:44:23 -0300 Jason Gunthorpe wrote:
> On Fri, Mar 22, 2024 at 01:58:26PM -0700, Jakub Kicinski wrote:
> > > Well said, David.
> > >
> > > I would totally support doing something like this in a fairly generic
> > > way that could be leveraged/instantiated by drivers that will allow
> > > communication/inspection of hardware blocks in the datapath. There are
> > > lots of different ways this could go, so feedback on this would help get
> > > us all moving in the right direction.
> >
> > The more I learn, the more I am convinced that the technical
> > justifications here are just smoke and mirrors.
>
> Let's see some evidence of this then, point to some sillicon devices
> in the multibillion gate space that don't have complex FW built into
> their design?
Existence of complex FW does not imply that production systems must
have a backdoor to talk to that FW in kernel-unmitigated fashion.
As an existence proof I give you NICs we use at Meta.
Or old Netronome NICs, you can pick.
> > The main motivation for nVidia, Broadcom, (and Enfabrica?) being to
> > hide as much as possible of what you consider your proprietary
> > advantage in the "AI gold rush".
>
> Despite all of those having built devices like this well before the
> "AI gold rush" and it being a general overall design principle for the
> industry because, yes, the silicon technology available actually
> demands it.
>
> It is not to say you couldn't do otherwise, it is just simply too
> expensive.
I do agree that it is expensive, not sure if it's "too" expensive.
But Linux never promised that our way of doing SW development would
always be the most cost effective option, right? Especially short
term. Or that we'll be competitive time to market.
> > RDMA is what it is but I really hate how you're trying to pretend
> > that it's is somehow an inherent need of advanced technology and
> > we need to lower the openness standards for all of the kernel.
>
> Open hardware has never been an "openness standard" for the kernel.
I was in the meeting with a vendor this morning and when explicitly
asked by an SRE (not from my org nor in any way "primed" by me)
whether configuration of some run of the mill PCI thing can be exposed
via devlink params instead of whatever proprietary thing the vendor was
pitching, the vendor's answer was silence and then a pitch of another
proprietary mechanism.
So no, the "open hardware" is certainly not a requirement for the
kernel. But users can't get vendors to implement standard Linux
configuration interfaces, and your proposal will make it a lot worse.
On Fri, 22 Mar 2024 15:18:09 -0600 David Ahern wrote:
> can you respond to Jason's email with the proposal for the new fwctl
> subsystem and identify places you do not agree? That would provide more
> concrete discussion points. Thanks,
Respond in what way, David? Comment on technical aspects of whether
a common class device with a discovery mechanism and a sprinkling of
semantically meaningless fields can be implemented? Some trivial object
hierarchy?
On whether someone can actually enforce any of the 4 "don't"s, and
whether this interface is basically encouraging and giving a leg up
to those willing to be dishonest?
Or should we go for another loop of me talking about openness and
building common abstractions, and vendors saying how their way of
doing basic configuration is so very special, and this is just for
debug and security and because others.
There's absolutely no willingness to try and build a common interface
here.
On 22 Mar 15:29, Jakub Kicinski wrote:
>On Fri, 22 Mar 2024 18:44:23 -0300 Jason Gunthorpe wrote:
>> On Fri, Mar 22, 2024 at 01:58:26PM -0700, Jakub Kicinski wrote:
>> > > Well said, David.
>> > >
>> > > I would totally support doing something like this in a fairly generic
>> > > way that could be leveraged/instantiated by drivers that will allow
>> > > communication/inspection of hardware blocks in the datapath. There are
>> > > lots of different ways this could go, so feedback on this would help get
>> > > us all moving in the right direction.
>> >
>> > The more I learn, the more I am convinced that the technical
>> > justifications here are just smoke and mirrors.
>>
>> Let's see some evidence of this then, point to some sillicon devices
>> in the multibillion gate space that don't have complex FW built into
>> their design?
>
>Existence of complex FW does not imply that production systems must
>have a backdoor to talk to that FW in kernel-unmitigated fashion.
>
>As an existence proof I give you NICs we use at Meta.
>Or old Netronome NICs, you can pick.
>
This is not true at all, at least for our NICs. Our NICs do need
non-netdev interfaces at least for debug and monitoring non-netdev
functionality and use-cases at Meta. We can talk about this offline.
Also below you mentioned another one of your vendors using proprietary
mechanism for configuration. So you can't just have it both ways.
It is obvious to everyone that in the AI era, everyone needs
customization, this interface is the proposal for the standardization,
if you cared to look at Jason's proposal you will see how he goes in
length describing how abstraction can happen in user space.
>> > The main motivation for nVidia, Broadcom, (and Enfabrica?) being to
>> > hide as much as possible of what you consider your proprietary
>> > advantage in the "AI gold rush".
>>
>> Despite all of those having built devices like this well before the
>> "AI gold rush" and it being a general overall design principle for the
>> industry because, yes, the silicon technology available actually
>> demands it.
>>
>> It is not to say you couldn't do otherwise, it is just simply too
>> expensive.
>
>I do agree that it is expensive, not sure if it's "too" expensive.
>But Linux never promised that our way of doing SW development would
>always be the most cost effective option, right? Especially short
>term. Or that we'll be competitive time to market.
>
>> > RDMA is what it is but I really hate how you're trying to pretend
>> > that it's is somehow an inherent need of advanced technology and
>> > we need to lower the openness standards for all of the kernel.
>>
>> Open hardware has never been an "openness standard" for the kernel.
>
>I was in the meeting with a vendor this morning and when explicitly
>asked by an SRE (not from my org nor in any way "primed" by me)
>whether configuration of some run of the mill PCI thing can be exposed
>via devlink params instead of whatever proprietary thing the vendor was
>pitching, the vendor's answer was silence and then a pitch of another
>proprietary mechanism.
>
Well, this is why we came up with fwctl interface, so nobody needs to sit
in silence and all vendors can agree to one interface.
We both know devlink params can't scale well enough and accommodate all
vendors and don't forget it's netdev specifc!
>So no, the "open hardware" is certainly not a requirement for the
>kernel. But users can't get vendors to implement standard Linux
>configuration interfaces, and your proposal will make it a lot worse.
Vendors are already using proprietary configuration interfaces, using
direct PCI access via sysfs.. So on the contrary to what you say, this
proposal came to unify vendors, and improve the user's experience..
with fwctl, and the proper use-space shared tooling as Jason's suggested
you can force other vendors to follow the herd and implement the new
standard interfaces that we already have 3 vendors agree to..
On Fri, Mar 22, 2024 at 03:29:24PM -0700, Jakub Kicinski wrote:
> On Fri, 22 Mar 2024 18:44:23 -0300 Jason Gunthorpe wrote:
> > On Fri, Mar 22, 2024 at 01:58:26PM -0700, Jakub Kicinski wrote:
> > > > Well said, David.
> > > >
> > > > I would totally support doing something like this in a fairly generic
> > > > way that could be leveraged/instantiated by drivers that will allow
> > > > communication/inspection of hardware blocks in the datapath. There are
> > > > lots of different ways this could go, so feedback on this would help get
> > > > us all moving in the right direction.
> > >
> > > The more I learn, the more I am convinced that the technical
> > > justifications here are just smoke and mirrors.
> >
> > Let's see some evidence of this then, point to some sillicon devices
> > in the multibillion gate space that don't have complex FW built into
> > their design?
>
> Existence of complex FW does not imply that production systems must
> have a backdoor to talk to that FW in kernel-unmitigated fashion.
I think we've been over this endlessly, production systems require a
solid debugging story for all software in the system, including the
device FW.
> As an existence proof I give you NICs we use at Meta.
> Or old Netronome NICs, you can pick.
I wouldn't pick those, they don't meet the multi-billion gate
criteria. ie expensive chips built on the most cutting edge and
expensive process. Smaller chips on non-cutting edge process have
different economics. Startups have different economics.
Look at something like an Intel/AMD GPU or Habana labs device. Lack of
FW creates a very opaque driver that is twiddling megabytes of
register mnemonics that is incomprehensible without HW
documentation. A win for code availability, but it hasn't actually
crossed into being usefully open.
> > Despite all of those having built devices like this well before the
> > "AI gold rush" and it being a general overall design principle for the
> > industry because, yes, the silicon technology available actually
> > demands it.
> >
> > It is not to say you couldn't do otherwise, it is just simply too
> > expensive.
>
> I do agree that it is expensive, not sure if it's "too" expensive.
> But Linux never promised that our way of doing SW development would
> always be the most cost effective option, right? Especially short
> term. Or that we'll be competitive time to market.
By "too expensive" I mean the vendor cannot produce a chip at a price
that is salable. Or a startup goes out of business because it can't
afford to respin.
Linux promises community collaboration where the community members
broadly should be driving the priorities. When TTM matters and is
agreed it is done. See for instance the various massive security
fixes.
> > > RDMA is what it is but I really hate how you're trying to pretend
> > > that it's is somehow an inherent need of advanced technology and
> > > we need to lower the openness standards for all of the kernel.
> >
> > Open hardware has never been an "openness standard" for the kernel.
>
> I was in the meeting with a vendor this morning and when explicitly
> asked by an SRE (not from my org nor in any way "primed" by me)
> whether configuration of some run of the mill PCI thing can be
> exposed
"run of the mill PCI thing"? Does this thing already have devlink
knob? Usually "run of the mill PCI things" are configured through the
PCI subsystem, not devlink.
> via devlink params instead of whatever proprietary thing the vendor was
> pitching, the vendor's answer was silence and then a pitch of another
> proprietary mechanism.
Our team was very excited about devlink when it first came about. But
now we have so many devlink parameters that have been kept out of
mainline I see that the excitement has died.
> So no, the "open hardware" is certainly not a requirement for the
> kernel. But users can't get vendors to implement standard Linux
> configuration interfaces, and your proposal will make it a lot worse.
I don't agree. If you can't get your vendor to implement the thing you
want on devlink right now today, this fwctl isn't going to change that
one bit. You already have the vendor tool and the vendor telling you
to use it. It makes no difference at all how the existing vendor tools
reach the device.
Indeed, counting on lockdown to break all the existing vendor tools
and render them permanently unusable seems to me to be straining one
of the few hard full project rules Linux actually has: don't break
existing userspace.
Think bigger, maybe your SRE will be happier if as part of this we can
get the vendors to agree on some common userspace tooling for device
configuration! Wouldn't that be a great big ecosystem improvement!
Frankly, there are many forms of common interfaces and many paths to
get there. I don't view your very restrictive approach to be helpful
to growing this space. It hasn't delivered a working ecosystem, and I
don't think it ever will. Instead, as this thread is showing, we have
a bunch of unhappy community members and poor upstream support for
devices.
Jason
On 3/22/24 4:40 PM, Jakub Kicinski wrote:
> On Fri, 22 Mar 2024 15:18:09 -0600 David Ahern wrote:
>> can you respond to Jason's email with the proposal for the new fwctl
>> subsystem and identify places you do not agree? That would provide more
>> concrete discussion points. Thanks,
>
> Respond in what way, David? Comment on technical aspects of whether
> a common class device with a discovery mechanism and a sprinkling of
> semantically meaningless fields can be implemented? Some trivial object
> hierarchy?
>
> On whether someone can actually enforce any of the 4 "don't"s, and
> whether this interface is basically encouraging and giving a leg up
> to those willing to be dishonest?
>
> Or should we go for another loop of me talking about openness and
> building common abstractions, and vendors saying how their way of
> doing basic configuration is so very special, and this is just for
> debug and security and because others.
>
> There's absolutely no willingness to try and build a common interface
> here.
The proposal is an attempt at a common interface and common tooling to a
degree but independent of any specific subsystem of which many are
supported by the device.
Your responses continue to align with the notion that because the device
can spit out ethernet frames, all diagnostics, debugging, configuration,
etc. MUST go through networking APIs.
You seem unwilling to acknowledge that devices can work for various use
cases without a netdev driver, and thus aspects of managing that device
should be done outside of a netdev driver.
On Tue, Mar 26, 2024 at 08:57:29AM -0600, David Ahern wrote:
> On 3/22/24 4:40 PM, Jakub Kicinski wrote:
> > On Fri, 22 Mar 2024 15:18:09 -0600 David Ahern wrote:
> >> can you respond to Jason's email with the proposal for the new fwctl
> >> subsystem and identify places you do not agree? That would provide more
> >> concrete discussion points. Thanks,
> >
> > Respond in what way, David? Comment on technical aspects of whether
> > a common class device with a discovery mechanism and a sprinkling of
> > semantically meaningless fields can be implemented? Some trivial object
> > hierarchy?
> >
> > On whether someone can actually enforce any of the 4 "don't"s, and
> > whether this interface is basically encouraging and giving a leg up
> > to those willing to be dishonest?
> >
> > Or should we go for another loop of me talking about openness and
> > building common abstractions, and vendors saying how their way of
> > doing basic configuration is so very special, and this is just for
> > debug and security and because others.
> >
> > There's absolutely no willingness to try and build a common interface
> > here.
>
> The proposal is an attempt at a common interface and common tooling to a
> degree but independent of any specific subsystem of which many are
> supported by the device.
>
> Your responses continue to align with the notion that because the device
> can spit out ethernet frames, all diagnostics, debugging, configuration,
> etc. MUST go through networking APIs.
>
> You seem unwilling to acknowledge that devices can work for various use
> cases without a netdev driver, and thus aspects of managing that device
> should be done outside of a netdev driver.
HNS driver is a good example of such device. It has nothing to do with
netdev and needs common and reliable way to configure FW.
https://lore.kernel.org/all/[email protected]/
https://lore.kernel.org/all/[email protected]/
Thanks
On Mon, 1 Apr 2024 15:30:03 +0300 Leon Romanovsky wrote:
> > The proposal is an attempt at a common interface and common tooling to a
> > degree but independent of any specific subsystem of which many are
> > supported by the device.
> >
> > Your responses continue to align with the notion that because the device
> > can spit out ethernet frames, all diagnostics, debugging, configuration,
> > etc. MUST go through networking APIs.
> >
> > You seem unwilling to acknowledge that devices can work for various use
> > cases without a netdev driver, and thus aspects of managing that device
> > should be done outside of a netdev driver.
>
> HNS driver is a good example of such device. It has nothing to do with
> netdev and needs common and reliable way to configure FW.
Sorry, I have a completely different reading of that thread.
Thanks for bringing it up, tho.
As I said multiple times I agree that configuring custom parameters
in RDMA is a necessity. Junxian's approach of putting such code in
the RDMA driver / subsystem is more than reasonable. Even better,
it looks like the API is fairly narrowly defined.
On Mon, Apr 01, 2024 at 07:50:03AM -0700, Jakub Kicinski wrote:
> On Mon, 1 Apr 2024 15:30:03 +0300 Leon Romanovsky wrote:
> > > The proposal is an attempt at a common interface and common tooling to a
> > > degree but independent of any specific subsystem of which many are
> > > supported by the device.
> > >
> > > Your responses continue to align with the notion that because the device
> > > can spit out ethernet frames, all diagnostics, debugging, configuration,
> > > etc. MUST go through networking APIs.
> > >
> > > You seem unwilling to acknowledge that devices can work for various use
> > > cases without a netdev driver, and thus aspects of managing that device
> > > should be done outside of a netdev driver.
> >
> > HNS driver is a good example of such device. It has nothing to do with
> > netdev and needs common and reliable way to configure FW.
>
> Sorry, I have a completely different reading of that thread.
> Thanks for bringing it up, tho.
Different people have different opinions, and it is fine.
>
> As I said multiple times I agree that configuring custom parameters
> in RDMA is a necessity. Junxian's approach of putting such code in
> the RDMA driver / subsystem is more than reasonable. Even better,
> it looks like the API is fairly narrowly defined.
It was a tiny example, which emphasizes the need for a common way.
If we were listen to average RDMA driver author, we would find ourselves
with gazillion different sysfs knobs which do nothing except sending
raw data to FW. As a subsystem, we don't want to waste our time in
not-beneficial to the subsystem code.
Thanks
On Mon, 1 Apr 2024 21:10:33 +0300 Leon Romanovsky wrote:
> > Sorry, I have a completely different reading of that thread.
> > Thanks for bringing it up, tho.
>
> Different people have different opinions, and it is fine.
Agreed! I think the core of our disagreement is whether and when
one set of people get to force their set of choices on other people.
Far be it from me to try to force my opinions in RDMA or the kernel
as a whole.
> > As I said multiple times I agree that configuring custom parameters
> > in RDMA is a necessity. Junxian's approach of putting such code in
> > the RDMA driver / subsystem is more than reasonable. Even better,
> > it looks like the API is fairly narrowly defined.
>
> It was a tiny example, which emphasizes the need for a common way.
>
> If we were listen to average RDMA driver author, we would find ourselves
> with gazillion different sysfs knobs which do nothing except sending
> raw data to FW. As a subsystem, we don't want to waste our time in
> not-beneficial to the subsystem code.
No disagreement here either, there's a trade-off here between what
you call waste of time and what I'd call fostering organic
collaboration. Even without taking your priorities into account -
whether reviewing device APIs is beneficial is (a) subjective,
(b) subsystem dependent, so you should be allowed to make your choice
(within the bounds of Linus's trust). But what I keep arguing is that
so should we :|
On 26/03/2024 14:57, David Ahern wrote:
> The proposal is an attempt at a common interface and common tooling to a
> degree but independent of any specific subsystem of which many are
> supported by the device.
[ Let me prefix this by noting that I'm speaking personally here, and
not representing the position of my employer. ]
You can't have a "common interface" and yet be "independent" of anything
that could give semantics to that interface. What you have is a common
*transport*, used to connect to a *vendor* interface.
If you can't even bring yourself to be honest about that, it's no wonder
you're getting maintainer pushback.
Do we need to go all the way back to operating systems 101 and point out
that one of the fundamental jobs of a kernel is to *abstract* the
hardware, and provide *services* to userspace rather than mere devices?
Frankly, this whole thread reads to me like certain vendors whining that
they weren't expecting to have to get their new features *reviewed* by
upstream — possibly they thought devlink params would just get rubber-
stamped — and now they're finding that the kernel's quality standards
still apply.
Complaining that devlink params "don't scale" is disingenuous. Patches
aren't languishing for want of reviewer resources; it's just that it
takes *submitter* time and effort to bring them up to the quality level
that's required, and occasionally the vendor has to (shock! horror!)
tell the world what one of their magic knobs actually *does*.
If all the configuration of these Complex Devices™ goes through fwctl
backdoors, where exactly is anyone going to discover the commonalities
to underlie the generic interfaces of the next generation? What would
configuring plain vanilla netdevs be like today if, instead of a set of
well-defined cross-vendor APIs, ethtool (say) had been a mechanism to
write arbitrary values to hardware registers on the NIC?
These commonalities are key to allowing a product category to mature. I
realise vendors in many cases don't want that to happen, because mature
products are largely commoditised and thus don't command huge margins;
but Linux isn't here to serve vendors' interests at the expense of
users.
On 23/03/2024 01:27, Saeed Mahameed wrote:
> It is obvious to everyone that in the AI era, everyone needs
> customization
It's always possible to argue that the New Thing is qualitatively
different from anything that went before, that these "multibillion
gate devices" need to be able to break the rules.
But the truth is, you aren't that special.
-e
On Tue, Apr 02, 2024 at 05:32:44PM +0100, Edward Cree wrote:
> On 26/03/2024 14:57, David Ahern wrote:
> > The proposal is an attempt at a common interface and common tooling to a
> > degree but independent of any specific subsystem of which many are
> > supported by the device.
>
> [ Let me prefix this by noting that I'm speaking personally here, and
> not representing the position of my employer. ]
>
> You can't have a "common interface" and yet be "independent" of anything
> that could give semantics to that interface. What you have is a common
> *transport*, used to connect to a *vendor* interface.
> If you can't even bring yourself to be honest about that, it's no wonder
> you're getting maintainer pushback.
I think this was covered in the doc I posted, it is unapologetically a
common transport.
I have some ideas for different "common interface's" built in
userspace. I think others will have good ideas too.
It is 'independent [of subsystems]' because it is manipulating *the
device* not the subsystem software.
> Do we need to go all the way back to operating systems 101 and point out
> that one of the fundamental jobs of a kernel is to *abstract* the
> hardware, and provide *services* to userspace rather than mere devices?
Except that isn't even true! That is a very naive and simplistic view
of what an OS should do. If we took that tonnes of interfaces in Linux
should just be purged.
A fundamental thing the OS exists to do is to arbitrate, secure and
multiplex access to hardware. There are many levels where people can
put a stick and say 'common interface', and it is not some 'wrong
architecture' for the OS to delegate the 'common interface' job to
userspace, or even say direct access is fine.
This is not an exclusive situation, fwctl doesn't mean future OS side
common interfaces are somehow blocked.
> Frankly, this whole thread reads to me like certain vendors whining that
> they weren't expecting to have to get their new features *reviewed* by
> upstream — possibly they thought devlink params would just get rubber-
> stamped — and now they're finding that the kernel's quality standards
> still apply.
Oh please. "quality standards" is not the issue here. This is a
philosophical disagreement on OS design and, as Jakub pointed out, a
second argument about who gets to have power in our community.
> Complaining that devlink params "don't scale" is disingenuous.
Saeed's remarks mean the review process doesn't scale. There are
something like 600-800 configurables in mlx5, I assume other devices
are similar. You can't fight over commonality bit by bit that many
times and ever get anywhere. Everyone will get exhausted well before
it is done.
> If all the configuration of these Complex Devices™ goes through fwctl
> backdoors, where exactly is anyone going to discover the commonalities
> to underlie the generic interfaces of the next generation? What would
> configuring plain vanilla netdevs be like today if, instead of a set of
It would be *exactly the same as today* because today everyone with
these devices uses the vendor tooling to configure them. Where is the
screaming? Where has the concrete demands for common interfaces on
some of the knobs been all these years? Where has keeping blessed
support out of the kernel got us?
If there was a real industry consensus to make commonality here it
would be done. In my view there is not industry will because it is not
actually an important problem that needs solving. When you buy this
kind of complex HW you need to ensure the flash is configured for your
site. You will work with the vendor and either get devices with flash
preconfigured, or you will work with the vendor to get a suitable FW
version and configurable list for your site. Then the job is simple,
userspace needs to confirm and fix the device to have the target flash
state.
The actual benefit of common names for the individual configuration
values is pretty tiny. Unless 100% of the configuration is covered by
common names it is not going to be meaningful in practice.
In my view there is alot of benefit to have a common tool to take
descriptions of devices target flash states and then enforce
it. Something that is common to many devices and could work for
everything, not just some useless subset of aruged-to-death-blessed
items.
> These commonalities are key to allowing a product category to
> mature.
Uh no. Alot of this configuration stuff is to serve niche customer
interests where the customers have no interest in publishing and
commonizing their method of operating because it would give away their
own operator secrets. The vendor's created it because their big
customers demanded it.
eg there are configurables in mlx5 that exist *soley* to accomodate
certain customer's pre-existing SW.
Stuff is like this because the *industry* doesn't want to fix it,
don't dump everything on the vendors like they are the source of the
problem.
And, as before, this message is talking about configuration and
devlink, but there are many more problems fwctl will solve with these
devices including debugging/etc that inherently can't be solved in an
abstract way.
> > It is obvious to everyone that in the AI era, everyone needs
> > customization
>
> It's always possible to argue that the New Thing is qualitatively
> different from anything that went before, that these "multibillion
> gate devices" need to be able to break the rules.
What supposed "rules"? Read my post again, this is well trodden ground
in the kernel. Delegation is fairly normal Linux design - try to
minimize the kernel footprint is also a well accepted design axiom.
Jason
On Mon, Apr 01, 2024 at 07:50:03AM -0700, Jakub Kicinski wrote:
> On Mon, 1 Apr 2024 15:30:03 +0300 Leon Romanovsky wrote:
> > > The proposal is an attempt at a common interface and common tooling to a
> > > degree but independent of any specific subsystem of which many are
> > > supported by the device.
> > >
> > > Your responses continue to align with the notion that because the device
> > > can spit out ethernet frames, all diagnostics, debugging, configuration,
> > > etc. MUST go through networking APIs.
> > >
> > > You seem unwilling to acknowledge that devices can work for various use
> > > cases without a netdev driver, and thus aspects of managing that device
> > > should be done outside of a netdev driver.
> >
> > HNS driver is a good example of such device. It has nothing to do with
> > netdev and needs common and reliable way to configure FW.
>
> Sorry, I have a completely different reading of that thread.
> Thanks for bringing it up, tho.
>
> As I said multiple times I agree that configuring custom parameters
> in RDMA is a necessity. Junxian's approach of putting such code in
> the RDMA driver / subsystem is more than reasonable. Even better,
> it looks like the API is fairly narrowly defined.
Uh, if I understand netdev rules aren't read/write sysfs created from
drivers banned?
So reasonable for RDMA but unacceptable to netdev? My brain hurts.
FWIW, I've been trying to push RDMA away from driver created sysfs for
a while now. Aside from the API complexity, implementations have
messed up using the sysfs APIs and resulted in some significant
problems :(
Jason
On Tue, Apr 02, 2024 at 05:32:44PM +0100, Edward Cree wrote:
> On 26/03/2024 14:57, David Ahern wrote:
> > The proposal is an attempt at a common interface and common tooling to a
> > degree but independent of any specific subsystem of which many are
> > supported by the device.
>
> [ Let me prefix this by noting that I'm speaking personally here, and
> not representing the position of my employer. ]
<...>
> you're getting maintainer pushback.
May I suggest you to take a short break, collect names of people who
participated in this discussion and check in git history/MAINTAINERS
file their contribution to the linux kernel?
After you do that, try to ask yourself if your response is still appropriate.
Thanks.
>
> Do we need to go all the way back to operating systems 101 and point out
> that one of the fundamental jobs of a kernel is to *abstract* the
> hardware, and provide *services* to userspace rather than mere devices?
>
> Frankly, this whole thread reads to me like certain vendors whining that
> they weren't expecting to have to get their new features *reviewed* by
> upstream — possibly they thought devlink params would just get rubber-
> stamped — and now they're finding that the kernel's quality standards
> still apply.
> Complaining that devlink params "don't scale" is disingenuous. Patches
> aren't languishing for want of reviewer resources; it's just that it
> takes *submitter* time and effort to bring them up to the quality level
> that's required, and occasionally the vendor has to (shock! horror!)
> tell the world what one of their magic knobs actually *does*.
>
> If all the configuration of these Complex Devices™ goes through fwctl
> backdoors, where exactly is anyone going to discover the commonalities
> to underlie the generic interfaces of the next generation? What would
> configuring plain vanilla netdevs be like today if, instead of a set of
> well-defined cross-vendor APIs, ethtool (say) had been a mechanism to
> write arbitrary values to hardware registers on the NIC?
> These commonalities are key to allowing a product category to mature. I
> realise vendors in many cases don't want that to happen, because mature
> products are largely commoditised and thus don't command huge margins;
> but Linux isn't here to serve vendors' interests at the expense of
> users.
>
> On 23/03/2024 01:27, Saeed Mahameed wrote:
> > It is obvious to everyone that in the AI era, everyone needs
> > customization
>
> It's always possible to argue that the New Thing is qualitatively
> different from anything that went before, that these "multibillion
> gate devices" need to be able to break the rules.
> But the truth is, you aren't that special.
>
> -e
On Mon, Apr 01, 2024 at 12:04:20PM -0700, Jakub Kicinski wrote:
> On Mon, 1 Apr 2024 21:10:33 +0300 Leon Romanovsky wrote:
> > > Sorry, I have a completely different reading of that thread.
> > > Thanks for bringing it up, tho.
> >
> > Different people have different opinions, and it is fine.
>
> Agreed! I think the core of our disagreement is whether and when
> one set of people get to force their set of choices on other people.
> Far be it from me to try to force my opinions in RDMA or the kernel
> as a whole.
>
> > > As I said multiple times I agree that configuring custom parameters
> > > in RDMA is a necessity. Junxian's approach of putting such code in
> > > the RDMA driver / subsystem is more than reasonable. Even better,
> > > it looks like the API is fairly narrowly defined.
> >
> > It was a tiny example, which emphasizes the need for a common way.
> >
> > If we were listen to average RDMA driver author, we would find ourselves
> > with gazillion different sysfs knobs which do nothing except sending
> > raw data to FW. As a subsystem, we don't want to waste our time in
> > not-beneficial to the subsystem code.
>
> No disagreement here either, there's a trade-off here between what
> you call waste of time and what I'd call fostering organic
> collaboration. Even without taking your priorities into account -
> whether reviewing device APIs is beneficial is (a) subjective,
> (b) subsystem dependent, so you should be allowed to make your choice
> (within the bounds of Linus's trust). But what I keep arguing is that
> so should we :|
Internal device API is a different story, and I assure you that we are
very close in our views here. Probably main difference is that I'm very
lax for first user and very strict for the second one, while you are more
stricter than me even for the first one.
I gave an example from HNS for API to configure FW without any subsystem
involvement and there is no real way to push to collaboration here without
standardization.
Thanks
On Tue, 2 Apr 2024 15:45:54 -0300 Jason Gunthorpe wrote:
> On Mon, Apr 01, 2024 at 07:50:03AM -0700, Jakub Kicinski wrote:
> > On Mon, 1 Apr 2024 15:30:03 +0300 Leon Romanovsky wrote:
> > > HNS driver is a good example of such device. It has nothing to do with
> > > netdev and needs common and reliable way to configure FW.
> >
> > Sorry, I have a completely different reading of that thread.
> > Thanks for bringing it up, tho.
> >
> > As I said multiple times I agree that configuring custom parameters
> > in RDMA is a necessity. Junxian's approach of putting such code in
> > the RDMA driver / subsystem is more than reasonable. Even better,
> > it looks like the API is fairly narrowly defined.
>
> Uh, if I understand netdev rules aren't read/write sysfs created from
> drivers banned?
Neither is that true as an absolute "netdev rule" nor relevant
to the discussion.
> So reasonable for RDMA but unacceptable to netdev?
I don't know or care what interface guidance you provide.
What I called reasonable is putting that code in RDMA driver
/ subsystem.
> My brain hurts.
Maybe brains are better suited for understanding what other people
say rather than twisting and misinterpreting..
> FWIW, I've been trying to push RDMA away from driver created sysfs for
> a while now. Aside from the API complexity, implementations have
> messed up using the sysfs APIs and resulted in some significant
> problems :(
Sure, agreed, but off-topic.
On Tue, Apr 02, 2024 at 02:36:07PM -0700, Jakub Kicinski wrote:
> > FWIW, I've been trying to push RDMA away from driver created sysfs for
> > a while now. Aside from the API complexity, implementations have
> > messed up using the sysfs APIs and resulted in some significant
> > problems :(
>
> Sure, agreed, but off-topic.
It is not - I don't want a huge amount of sysfs in drivers to replace
what fwctl will do for pretty solid technical reasons.
I do object, as snarky as I was, to you saying RDMA should take on a
whole bunch of read/write sysfs in drivers that netdev would not
accept as a "reasonable" direction.
Jason
On Tue, 2 Apr 2024 19:46:32 -0300 Jason Gunthorpe wrote:
> On Tue, Apr 02, 2024 at 02:36:07PM -0700, Jakub Kicinski wrote:
> > > FWIW, I've been trying to push RDMA away from driver created sysfs for
> > > a while now. Aside from the API complexity, implementations have
> > > messed up using the sysfs APIs and resulted in some significant
> > > problems :(
> >
> > Sure, agreed, but off-topic.
>
> It is not - I don't want a huge amount of sysfs in drivers to replace
> what fwctl will do for pretty solid technical reasons.
>
> I do object, as snarky as I was, to you saying RDMA should take on a
> whole bunch of read/write sysfs in drivers that netdev would not
> accept as a "reasonable" direction.
Repeating myself a bit - I'm not saying you should take the sysfs
patches. Just that RDMA configuration belongs in the RDMA subsystem.
I don't want to suggest another solution, because frankly given
our "no direct FW interface exposure" I don't have much experience
maintaining such APIs upstream. It'd be vain of me to make suggestions.
Also I don't want to sound like I'm giving you my "blessing" to do
whatever, since my personal beliefs(?) remain unchanged. But they
carry little weight when netdevs or traffic which ends up in netdev
are not involved.
On Tue, 2 Apr 2024 16:21:06 -0700 Jakub Kicinski wrote:
> they carry little weight when netdevs or traffic which ends up in
> netdev are not involved.
Clarification before this goes into "Leon's quote book"
netdev == code base, traffic == network "packets"
as opposed to the mailing list and emails.
On Tue, Apr 02, 2024 at 05:15:35PM -0700, Jakub Kicinski wrote:
> On Tue, 2 Apr 2024 16:21:06 -0700 Jakub Kicinski wrote:
> > they carry little weight when netdevs or traffic which ends up in
> > netdev are not involved.
>
> Clarification before this goes into "Leon's quote book"
I don't have one, should I start write it?
And sorry for listening and remembering what people say.
Thanks
On 02/04/2024 19:48, Leon Romanovsky wrote:
> On Tue, Apr 02, 2024 at 05:32:44PM +0100, Edward Cree wrote:
>> you're getting maintainer pushback.
>
> May I suggest you to take a short break, collect names of people who
> participated in this discussion and check in git history/MAINTAINERS
> file their contribution to the linux kernel?
Whether you like it or not, Kuba is a kernel maintainer.
And thus, semantically, a Nack from him is "maintainer pushback".
That remains true regardless of who else in the discussion is also
a kernel maintainer.
If you had an actual point, feel free to explain to me, without the
veiled language, what was so 'inappropriate' about my posting.
On 4/2/24 12:40 PM, Jason Gunthorpe wrote:
> and, as Jakub pointed out, a second argument about who gets to have
> power in our community.
I think that is the key summary of this disagreement: this is about
power over what vendors are allowed to do with in-tree drivers.
Jakub is taking the stance that because the devices have use cases with
the networking stack, he (as a maintainer of netdev) gets a say on how
those devices are accessed across the entire Linux code base and its
various use cases. That the device can be used without a netdev seems to
be irrelevant.
On 4/3/24 1:00 PM, Leon Romanovsky wrote:
> On Wed, Apr 03, 2024 at 01:26:50PM +0100, Edward Cree wrote:
>> On 02/04/2024 19:48, Leon Romanovsky wrote:
>>> On Tue, Apr 02, 2024 at 05:32:44PM +0100, Edward Cree wrote:
>>>> you're getting maintainer pushback.
>>>
>>> May I suggest you to take a short break, collect names of people who
>>> participated in this discussion and check in git history/MAINTAINERS
>>> file their contribution to the linux kernel?
>> Whether you like it or not, Kuba is a kernel maintainer.
>> And thus, semantically, a Nack from him is "maintainer pushback".
>> That remains true regardless of who else in the discussion is also
>> a kernel maintainer.
>>
>> If you had an actual point, feel free to explain to me, without the
>> veiled language, what was so 'inappropriate' about my posting.
>
> Language, tone, and content of your email were inappropriate:
>
> https://lore.kernel.org/all/[email protected]/
> ...certain vendors whining...
>
> ^^^^ Language
>
> ... possibly they thought devlink params would just get rubber-
> stamped — and now they're finding that the kernel's quality standards
> still apply. ...
>
> ^^^^ Tone
> EVERYONE who participated in this discussion knows about kernel's
> quality standards.
>
> ... Patches aren't languishing for want of reviewer resources; it's just that it
> takes *submitter* time and effort to bring them up to the quality level
> that's required, and occasionally the vendor has to (shock! horror!)
> tell the world what one of their magic knobs actually *does*. ...
>
> ^^^^ Content
> This paragraph alone shows that you completely didn't understand the
> discussion here.
>
> Thanks
+1
On Wed, Apr 03, 2024 at 01:26:50PM +0100, Edward Cree wrote:
> On 02/04/2024 19:48, Leon Romanovsky wrote:
> > On Tue, Apr 02, 2024 at 05:32:44PM +0100, Edward Cree wrote:
> >> you're getting maintainer pushback.
> >
> > May I suggest you to take a short break, collect names of people who
> > participated in this discussion and check in git history/MAINTAINERS
> > file their contribution to the linux kernel?
> Whether you like it or not, Kuba is a kernel maintainer.
> And thus, semantically, a Nack from him is "maintainer pushback".
> That remains true regardless of who else in the discussion is also
> a kernel maintainer.
>
> If you had an actual point, feel free to explain to me, without the
> veiled language, what was so 'inappropriate' about my posting.
Language, tone, and content of your email were inappropriate:
https://lore.kernel.org/all/[email protected]/
..certain vendors whining...
^^^^ Language
.. possibly they thought devlink params would just get rubber-
stamped — and now they're finding that the kernel's quality standards
still apply. ...
^^^^ Tone
EVERYONE who participated in this discussion knows about kernel's
quality standards.
.. Patches aren't languishing for want of reviewer resources; it's just that it
takes *submitter* time and effort to bring them up to the quality level
that's required, and occasionally the vendor has to (shock! horror!)
tell the world what one of their magic knobs actually *does*. ...
^^^^ Content
This paragraph alone shows that you completely didn't understand the
discussion here.
Thanks
On Wed, 3 Apr 2024 13:31:46 -0600 David Ahern wrote:
> > ... Patches aren't languishing for want of reviewer resources; it's just that it
> > takes *submitter* time and effort to bring them up to the quality level
> > that's required, and occasionally the vendor has to (shock! horror!)
> > tell the world what one of their magic knobs actually *does*. ...
> >
> > ^^^^ Content
> > This paragraph alone shows that you completely didn't understand the
> > discussion here.
>
> +1
"didn't understand the discussion" is an ironic thing for you to +1,
David. After all my emails about HNS3 RDMA you somehow concluded today
that I want to make rules for the entire kernel:
https://lore.kernel.org/all/[email protected]/
And I second what Ed said. I have asked multiple vendors preaching
impossibilism in this thread to start posting those knobs. I offered
to do a quick off-list review of the list of knobs they have to give
a quick yay / nay, so they don't waste time implementing things that
would get nacked. None of the vendors bothered taking me up on that
offer.
On 4/3/24 6:01 PM, Jakub Kicinski wrote:
> On Wed, 3 Apr 2024 13:31:46 -0600 David Ahern wrote:
>>> ... Patches aren't languishing for want of reviewer resources; it's just that it
>>> takes *submitter* time and effort to bring them up to the quality level
>>> that's required, and occasionally the vendor has to (shock! horror!)
>>> tell the world what one of their magic knobs actually *does*. ...
>>>
>>> ^^^^ Content
>>> This paragraph alone shows that you completely didn't understand the
>>> discussion here.
>>
>> +1
>
> "didn't understand the discussion" is an ironic thing for you to +1,
Come on, Jakub. The +1 was regarding Leon's entire response to Ed, not
just the last part you so conveniently quoted here. I agree 100% wiht
Leon that Ed's "Language, tone, and content of your email were
inappropriate". That is why I left the entire quote in the response and
not just the paragraph above.
> David. After all my emails about HNS3 RDMA you somehow concluded today
> that I want to make rules for the entire kernel:
> https://lore.kernel.org/all/[email protected]/
>
And as for that response (the URL you listed there), what else should
one conclude? That has to be a fair summation of your stance because
this entire thread is on a driver completely unrelated to netdev that
you jumped in on with a NACK. I have asked multiple times in this thread
and the one before where you believe your boundary ends as a maintainer.
> And I second what Ed said. I have asked multiple vendors preaching
> impossibilism in this thread to start posting those knobs. I offered
> to do a quick off-list review of the list of knobs they have to give
> a quick yay / nay, so they don't waste time implementing things that
> would get nacked. None of the vendors bothered taking me up on that
> offer.
Again, entirely missing the point. This is not about configure knobs
that have a potential to be consistent across devices.
Sadly, this thread is not a spiral, ever so slowly converging on an
agreement, but a merry-ground just going in circles.
On Wed, Apr 03, 2024 at 05:01:49PM -0700, Jakub Kicinski wrote:
> On Wed, 3 Apr 2024 13:31:46 -0600 David Ahern wrote:
> > > ... Patches aren't languishing for want of reviewer resources; it's just that it
> > > takes *submitter* time and effort to bring them up to the quality level
> > > that's required, and occasionally the vendor has to (shock! horror!)
> > > tell the world what one of their magic knobs actually *does*. ...
> > >
> > > ^^^^ Content
> > > This paragraph alone shows that you completely didn't understand the
> > > discussion here.
> >
> > +1
>
> "didn't understand the discussion" is an ironic thing for you to +1,
> David. After all my emails about HNS3 RDMA you somehow concluded today
> that I want to make rules for the entire kernel:
> https://lore.kernel.org/all/[email protected]/
What if (hypothetically) I tould you that the congestion control
settings in the device FW impacted netdev sourced ethernet trafic as
well? Would you be so sanguine that RDMA should have those settings?
> And I second what Ed said. I have asked multiple vendors preaching
> impossibilism in this thread to start posting those knobs. I offered
> to do a quick off-list review of the list of knobs they have to give
> a quick yay / nay, so they don't waste time implementing things that
> would get nacked. None of the vendors bothered taking me up on that
> offer.
As far as configuration/provisioning goes, it is really all or
nothing.
If a specific site can configure only 90% of the stuff required
because you will NAK the missing 10% it then it is still not usable
and is a wasted effort for everyone.
You have never shown that there is a path to 100% with your approach
to devlink. In fact I believe you've said flat out that 100% is not
achievable. Right here you illustrate the fundamental problem again:
there are configurables that already exist in the device that you will
NAK for devlink.
This is fundamentally why no one is taking you up on these generous
offers to pre-NAK device's designs. You made it explicit that you will
will NAK something and then it is not 100%.
Saeed has said repeatedly he wants 100% of the endless configurables
in mlx5. You have the manual and know what they are, tell him how to
get to 100% in a few months of work and I will believe you that it is
not impossible.
Then we only have fwctl's support for debugging and other topics to
argue about :P
Jason
On Thu, 4 Apr 2024 09:23:38 -0300 Jason Gunthorpe wrote:
> > "didn't understand the discussion" is an ironic thing for you to +1,
> > David. After all my emails about HNS3 RDMA you somehow concluded today
> > that I want to make rules for the entire kernel:
> > https://lore.kernel.org/all/[email protected]/
>
> What if (hypothetically) I tould you that the congestion control
> settings in the device FW impacted netdev sourced ethernet trafic as
> well? Would you be so sanguine that RDMA should have those settings?
We can lawyer the words until the cows come home.
The team I work on takes care of both RoCE/IB/pick your fav proto
and TCP/IP NICs. It's fairly obvious what is RoCE and what is TCP
or user UDP when there are no incentives to act otherwise :|
> > And I second what Ed said. I have asked multiple vendors preaching
> > impossibilism in this thread to start posting those knobs. I offered
> > to do a quick off-list review of the list of knobs they have to give
> > a quick yay / nay, so they don't waste time implementing things that
> > would get nacked. None of the vendors bothered taking me up on that
> > offer.
>
> As far as configuration/provisioning goes, it is really all or
> nothing.
>
> If a specific site can configure only 90% of the stuff required
> because you will NAK the missing 10% it then it is still not usable
> and is a wasted effort for everyone.
(a) are you saying that the device needs 100% of the knobs to be used?
oof, you better warn your prospective customers :S
(b) as Ed pointed out some of the "knobs" are just hacks and lazy
workarounds so we rejected them for quality reasons; the remaining
rejects are because the knobs aren't really device specific, but
vendors don't want to extend existing APIs, as it is easier to
ship "features" without having a core kernel dependency...
> You have never shown that there is a path to 100% with your approach
> to devlink. In fact I believe you've said flat out that 100% is not
> achievable. Right here you illustrate the fundamental problem again:
> there are configurables that already exist in the device that you will
> NAK for devlink.
>
> This is fundamentally why no one is taking you up on these generous
> offers to pre-NAK device's designs. You made it explicit that you will
> will NAK something and then it is not 100%.
>
> Saeed has said repeatedly he wants 100% of the endless configurables
> in mlx5. You have the manual and know what they are, tell him how to
> get to 100% in a few months of work and I will believe you that it is
> not impossible.
Sorry, are you saying that I'm responsible for a providing a solution
to allow arbitrary vendor tools to work and proprietary user space to
communicate directly with the proprietary firmware?
> Then we only have fwctl's support for debugging and other topics to
> argue about :P
[ Again, I had better disassociate my employer from the vitriol below,
for which I alone am responsible. ]
On 02/04/2024 19:40, Jason Gunthorpe wrote:
> Uh no. Alot of this configuration stuff is to serve niche customer
> interests where the customers have no interest in publishing and
> commonizing their method of operating because it would give away their
> own operator secrets. The vendor's created it because their big
> customers demanded it.
>
> eg there are configurables in mlx5 that exist *soley* to accomodate
> certain customer's pre-existing SW.
So it's a single-user hack, why do you want support for it in upstream?
Oh right, because you want to be able to ship single-user hacks to all
your customers while still getting the _cachet_ of being An Intree
Driver with the implied engineering benefits of open source — despite
the actual feature implementations being obscured in firmware that's
not subject to the review process and thus doesn't actually carry
those benefits.
> There are something like 600-800 configurables in mlx5
So because your engineers can't design clean, orthogonal APIs for
toffee, you should be allowed to bypass review? Sorry, but no.
The right approach is to find generic mechanisms that cover multiple
customer configurations by allowing each customer to specify a policy
— which is better not just for the kernel but also for your customers
(they can experiment more easily with new policies) and even for you
(you don't have to spend engineer time on implementing hundreds of
single-purpose configuration switches, and can instead focus on
making better products). And of course the customer's policy, with
their 'valuable' operator secrets, stays in-house.
This is not some revolutionary new idea or blue-sky architecture
astronaut thinking; this is the way the Unix engineering tradition
has worked for half a century.
And it's not like we don't give you the tools to do it!
BPF demonstrates that the kernel is perfectly willing to expose highly
complex configuration primitives to userspace, *as long as* the
interface is well-defined and cross-vendor.
Of course, without knowing what your several hundred knobs are for, I
can't tell you even in the broadest sense what shape a clean config
system to replace them would look like. But 800 magic flags isn't it.
> Where is the screaming? Where has keeping blessed support out of
> the kernel got us?
Well, clearly *someone* wants you to supply them an in-tree driver,
else you wouldn't be trying to upstream this. Now maybe they really
do just have a psychological aversion to TAINT_OOT_MODULE, but it's
more likely that the underlying reason is the improved reliability,
maintainability, and portability that come from the upstreaming
process. And that in turn only happens because the kernel does not
"bless" crap.
> The actual benefit of common names for the individual configuration
> values is pretty tiny.
In case I still need to make it clearer: the purpose of requiring
your configurables to go through review is not so you can have
"common names" for 800 magic flags. It's so that you are forced to
come up with configurables that actually have a sensible design and
meaningful semantics that you can define in a way that gives you
something to *put* in the name and the commit message that's not
just "magic behaviour tweak #42 for $BigCustomer".
> a second argument about who gets to have power in our community.
As I understand it, power in Linux is entirely social and informal.
If you think Kuba doesn't have standing to object, there's nothing
*technical* stopping you from applying the patches with his
Nacked-by tag included in the commit messages, then sending the
resulting PR to Linus.
And if you think that Linus would reject the PR in that case, then
you're implicitly conceding that Kuba *does* have standing.
On Thu, Apr 04, 2024 at 07:48:50AM -0700, Jakub Kicinski wrote:
> On Thu, 4 Apr 2024 09:23:38 -0300 Jason Gunthorpe wrote:
> > > "didn't understand the discussion" is an ironic thing for you to +1,
> > > David. After all my emails about HNS3 RDMA you somehow concluded today
> > > that I want to make rules for the entire kernel:
> > > https://lore.kernel.org/all/[email protected]/
> >
> > What if (hypothetically) I tould you that the congestion control
> > settings in the device FW impacted netdev sourced ethernet trafic as
> > well? Would you be so sanguine that RDMA should have those settings?
>
> We can lawyer the words until the cows come home.
> The team I work on takes care of both RoCE/IB/pick your fav proto
> and TCP/IP NICs. It's fairly obvious what is RoCE and what is TCP
> or user UDP when there are no incentives to act otherwise :|
Sure you can tell the difference, that isn't my hypothetical. I'm
asking what if the people who designed the device choose not to tell
the difference?
> > > And I second what Ed said. I have asked multiple vendors preaching
> > > impossibilism in this thread to start posting those knobs. I offered
> > > to do a quick off-list review of the list of knobs they have to give
> > > a quick yay / nay, so they don't waste time implementing things that
> > > would get nacked. None of the vendors bothered taking me up on that
> > > offer.
> >
> > As far as configuration/provisioning goes, it is really all or
> > nothing.
> >
> > If a specific site can configure only 90% of the stuff required
> > because you will NAK the missing 10% it then it is still not usable
> > and is a wasted effort for everyone.
>
> (a) are you saying that the device needs 100% of the knobs to be used?
> oof, you better warn your prospective customers :S
No, I'm saying I have 100 customers, 600 configurables and every
customer needs a partially overlapping set of 30 of them to be
different than the COTS manufacturing default.
I can implement 30 and support one customer, but I can't support all
100 customers without all 600 knobs.
> (b) as Ed pointed out some of the "knobs" are just hacks and lazy
> workarounds so we rejected them for quality reasons; the remaining
> rejects are because the knobs aren't really device specific, but
> vendors don't want to extend existing APIs, as it is easier to
> ship "features" without having a core kernel dependency...
Which is back to my point. You are picking and choosing what gets to
be supported, and the end result is none of the 100 customers get to
actually work.
It is overreach to demand that the devices be re-designed as a
condition to be part of mainline. The configurables exist as they are
and need to be supported, in one way or another, by the kernel.
> > You have never shown that there is a path to 100% with your approach
> > to devlink. In fact I believe you've said flat out that 100% is not
> > achievable. Right here you illustrate the fundamental problem again:
> > there are configurables that already exist in the device that you will
> > NAK for devlink.
> >
> > This is fundamentally why no one is taking you up on these generous
> > offers to pre-NAK device's designs. You made it explicit that you will
> > will NAK something and then it is not 100%.
> >
> > Saeed has said repeatedly he wants 100% of the endless configurables
> > in mlx5. You have the manual and know what they are, tell him how to
> > get to 100% in a few months of work and I will believe you that it is
> > not impossible.
>
> Sorry, are you saying that I'm responsible for a providing a solution
> to allow arbitrary vendor tools to work and proprietary user space to
> communicate directly with the proprietary firmware?
I am responding to your remark about "vendors preaching
impossibilism". If you want me to agree with you that it is possible
then yes, you need to show a way where we get to a point that users
are actually able to solve their problems.
Otherwise all I hear is how you are going to NAK some unknowable
subset of the needed configurables. Sure sounds impossible to me.
Jason
#include <disclaimer.h>
On 04/04/2024 18:47, Jason Gunthorpe wrote:
> The configurables exist as they are
> and need to be supported, in one way or another, by the kernel.
Why? What does the kernel get out of it?
Maybe *you* need them to be supported, but maybe you should have
thought of that earlier in the design process. ("A failure on
your part to foresee the eminently foreseeable does not
constitute an emergency on mine.")
If we let folks bypass our standards with a _fait accompli_, we
don't really have standards in the first place.
On Thu, Apr 04, 2024 at 06:35:04PM +0100, Edward Cree wrote:
> [ Again, I had better disassociate my employer from the vitriol below,
> for which I alone am responsible. ]
>
> On 02/04/2024 19:40, Jason Gunthorpe wrote:
> > Uh no. Alot of this configuration stuff is to serve niche customer
> > interests where the customers have no interest in publishing and
> > commonizing their method of operating because it would give away their
> > own operator secrets. The vendor's created it because their big
> > customers demanded it.
> >
> > eg there are configurables in mlx5 that exist *soley* to accomodate
> > certain customer's pre-existing SW.
>
> So it's a single-user hack, why do you want support for it in upstream?
> Oh right, because you want to be able to ship single-user hacks to all
> your customers while still getting the _cachet_ of being An Intree
> Driver with the implied engineering benefits of open source — despite
> the actual feature implementations being obscured in firmware that's
> not subject to the review process and thus doesn't actually carry
> those benefits.
Uh no, mlx5 already has an excellent in-tree driver, thank you very
much. The configuration is not changing the driver, it is changing the
device.
Consider, I can ship unique devices pre-configured for each site's
special needs. They still work with the same inbox driver.
In fact that happens already for certain large customers. It is why
Jakub recently pointed out that Meta doesn't need any
provisioning/mlx5ctl/misc driver. They just buy different devices than
everyone else.
So, it is really some kind of extremism to say that allowing users to
configure the device in their own system in a booted Linux OS instead
of in the factory looses the "implied engineering benefits of open
source".
Further it is really rude and inappropriate to say that some customers
should not be able to enjoy intree drivers because of your aesthetic
opinion of hacks in a device's design.
> > There are something like 600-800 configurables in mlx5
>
> So because your engineers can't design clean, orthogonal APIs for
> toffee, you should be allowed to bypass review? Sorry, but no.
Overreach. The job of the kernel maintainer is to review the driver
software, not the device design.
> > Where is the screaming? Where has keeping blessed support out of
> > the kernel got us?
>
> Well, clearly *someone* wants you to supply them an in-tree driver,
> else you wouldn't be trying to upstream this.
We already have an intree driver for this access, it is built into
sysfs. Tooling exists everything is deployed, people are happy.
If you had read the thread to understand the issue, you'd know this is
because the distros have turned on module signing, secure boot and
kernel lock down.
This disables the ability to have an OOT driver, it disables the
existing sysfs scheme and it forces everything to be in Linus's tree
or it does not exist. It effectively breaks the long existing
ecosystem in this space.
Arguing that keeping things working as they have been working for the
last 10 years is somehow going to wreck opensource or whatever is
absurd.
Jason
On Thu, Apr 04, 2024 at 07:06:53PM +0100, Edward Cree wrote:
> #include <disclaimer.h>
>
> On 04/04/2024 18:47, Jason Gunthorpe wrote:
> > The configurables exist as they are
> > and need to be supported, in one way or another, by the kernel.
>
> Why? What does the kernel get out of it?
>
> Maybe *you* need them to be supported, but maybe you should have
> thought of that earlier in the design process. ("A failure on
> your part to foresee the eminently foreseeable does not
> constitute an emergency on mine.")
> If we let folks bypass our standards with a _fait accompli_, we
> don't really have standards in the first place.
Sorry, who are "we" and what are "our standards"?
And, please, please reread the thread, it is not about devlink at all.
Thanks
> > There are something like 600-800 configurables in mlx5
I keep seeing that 600-800, and thinking that 2 to the power 600 is a
very big number, assuming they are simple on/off configurations.
Does any user actually have the 'optimal' configuration? How can they
even know?
Andrew
disclaimer.sh
On 04/04/2024 19:33, Jason Gunthorpe wrote:
> Uh no, mlx5 already has an excellent in-tree driver, thank you very
> much.
I was referring to *mlx5ctl*, which is not currently in-tree, which
is why this thread trying to add it exists in the first place.
> So, it is really some kind of extremism to say that allowing users to
> configure the device in their own system in a booted Linux OS instead
Um, nothing upstream does is stopping them installing an OOT mlx5ctl
driver, *if* that's what they want to do. Clearly some of them don't
like that solution, otherwise we wouldn't be here.
> of in the factory looses the "implied engineering benefits of open
> source".
You're looking at the wrong point on the causal graph.
Whether you apply the hacks in the factory or at runtime, their hacky
design is what prevents them from having the benefits of open source
(because those benefits consist of a development process that weeds
out hacks and bad design).
It is just that the latter case, if done through an intree driver,
would appear to be (and might be marketed as) an open-source developed
product, which users would naturally expect to have those benefits,
when in fact it doesn't.
>> So because your engineers can't design clean, orthogonal APIs for
>> toffee, you should be allowed to bypass review? Sorry, but no.
>
> Overreach. The job of the kernel maintainer is to review the driver
> software, not the device design.
Really? [1]
The kernel always has the choice to not support a given device, or a
given feature of a device; and kernel maintainers are precisely the
people with both the right and the duty to make that determination.
> If you had read the thread to understand the issue, you'd know this is
> because the distros have turned on module signing, secure boot and
> kernel lock down.
Funnily enough, I am aware of that.
And if your customers didn't want those things, they'd be quite capable
of forking the distro to undo it. Several hyperscalers already have
their own in-house distros anyway.
They could add their own signing key to the kernel, and sign your ctl
driver with it.
They could disable lockdown, or patch the kernel to allow your
(hopefully signed and IMA-ed) userspace tool to do its PCI-over-sysfs
crap even when lockdown blocks it for everything else.
But strangely, there are people out there who trust the upstream process
to ensure quality/security/etc. more than they trust vendors and their
"oh don't worry, the device will enforce security scopes on these raw
commands from userspace" magic firmware blobs.
What you are asking for here is a special exemption to all those
requirements and security measures because "just trust me bro".
-ed
[1]: https://wiki.linuxfoundation.org/networking/toe
On 04/04/2024 19:35, Leon Romanovsky wrote:
> On Thu, Apr 04, 2024 at 07:06:53PM +0100, Edward Cree wrote:
>> Why? What does the kernel get out of it?
>>
>> Maybe *you* need them to be supported, but maybe you should have
>> thought of that earlier in the design process. ("A failure on
>> your part to foresee the eminently foreseeable does not
>> constitute an emergency on mine.")
>> If we let folks bypass our standards with a _fait accompli_, we
>> don't really have standards in the first place.
>
> Sorry, who are "we" and what are "our standards"?
As should be obvious from context, "we" in that sentence referred to
the mainline kernel. And while participants in this thread currently
disagree on what "our standards" are, I hope it is not contentious
that the kernel community *does* have standards as to what code and
design is acceptable for inclusion.
Necessarily, for those standards to be meaningful, there cannot be an
exception for "oh, but we already built our product, it's too late
now", otherwise everyone would just ignore the standards and then
demand to be merged anyway.
That is true regardless of the specific standards in question.
> And, please, please reread the thread, it is not about devlink at all.
Please, please, reread the email to which you are replying; it is
not about devlink at all.
-ed
On Thu, 4 Apr 2024 15:33:05 -0300 Jason Gunthorpe wrote:
> Uh no, mlx5 already has an excellent in-tree driver, thank you very
> much. The configuration is not changing the driver, it is changing the
> device.
>
> Consider, I can ship unique devices pre-configured for each site's
> special needs. They still work with the same inbox driver.
>
> In fact that happens already for certain large customers. It is why
> Jakub recently pointed out that Meta doesn't need any
> provisioning/mlx5ctl/misc driver. They just buy different devices than
> everyone else.
> Further it is really rude and inappropriate to say that some customers
> should not be able to enjoy intree drivers because of your aesthetic
> opinion of hacks in a device's design.
To my knowledge the "customizations" are mostly around fitting into OCP
servers. Those unfamiliar with how hyperscalers operate can mentally
replace $hyperscaler with HP or Dell in your message. Minus all the
proprietary OOB management stuff those guys also provide.
> Overreach. The job of the kernel maintainer is to review the driver
> software, not the device design.
Agreed. I rarely if ever comment on what I think about device design.
Discussion is about mlx5ctrl not "the device" as much as you'd like
to equate the two.
On Thu, Apr 04, 2024 at 12:53:39PM -0700, Jakub Kicinski wrote:
> On Thu, 4 Apr 2024 15:33:05 -0300 Jason Gunthorpe wrote:
> > Uh no, mlx5 already has an excellent in-tree driver, thank you very
> > much. The configuration is not changing the driver, it is changing the
> > device.
> >
> > Consider, I can ship unique devices pre-configured for each site's
> > special needs. They still work with the same inbox driver.
> >
> > In fact that happens already for certain large customers. It is why
> > Jakub recently pointed out that Meta doesn't need any
> > provisioning/mlx5ctl/misc driver. They just buy different devices than
> > everyone else.
>
> > Further it is really rude and inappropriate to say that some customers
> > should not be able to enjoy intree drivers because of your aesthetic
> > opinion of hacks in a device's design.
>
> To my knowledge the "customizations" are mostly around fitting into OCP
> servers.
Nope. I understand it is significant. If Meta had to work with a COTS
environment like a HP/Dell customer then Meta would have a list of
flash configurables to set. I think you greatly underestimate the
privilege of being at a hyperscaler and having vendors create custom
products just for you..
> Those unfamiliar with how hyperscalers operate can mentally
> replace $hyperscaler with HP or Dell in your message. Minus all the
> proprietary OOB management stuff those guys also provide.
A significant Dell customer will get a server pre-populated with a NIC
with some generic Dell configuration. In most cases the customer will
have to then program the flash to match their needs. Set a specific FW
version, set site specific configurables, etc.
Similar to how a Dell customer will have to change the BIOS settings
in the Dell to match their needs.
> Discussion is about mlx5ctrl not "the device" as much as you'd like
> to equate the two.
I view mlx5ctl/fwctl as a window into the device.
Jason
On Thu, Apr 04, 2024 at 08:44:56PM +0200, Andrew Lunn wrote:
> > > There are something like 600-800 configurables in mlx5
>
> I keep seeing that 600-800, and thinking that 2 to the power 600 is a
> very big number, assuming they are simple on/off configurations.
>
> Does any user actually have the 'optimal' configuration? How can they
> even know?
First, the big number is the union of all configurables on all devices
released over the past 20 years that work with the mlx5
driver. Looking at a single device will see a much smaller number of
configurable.
Second, much of it is not performance tuning. The site has a
functional workload they need to run and configurations need to be
adjusted to match.
For instance there is a bunch that configure the UEFI option
rom. Various devices can PXE boot, iSCSI boot and so on.
There is other stuff to turn on/off functional features which can be
quite often because stuff just needs to be known at FLR time.
I'm sure there are things in a more classical tunable space but it
going to be a more workably small number than 600..
Jason
On 04/04/2024 21:25, Jason Gunthorpe wrote:
> For instance there is a bunch that configure the UEFI option
> rom. Various devices can PXE boot, iSCSI boot and so on.
Sounds like a perfect example of something that should have a
standardised cross-vendor configuration mechanism, rather than
every vendor having its own special tool as is sadly the case
today.
On Thu, 4 Apr 2024 17:44:54 -0300 Jason Gunthorpe wrote:
> > To my knowledge the "customizations" are mostly around fitting into OCP
> > servers.
>
> Nope. I understand it is significant. If Meta had to work with a COTS
> environment like a HP/Dell customer then Meta would have a list of
> flash configurables to set. I think you greatly underestimate the
> privilege of being at a hyperscaler and having vendors create custom
> products just for you..
>
> > Those unfamiliar with how hyperscalers operate can mentally
> > replace $hyperscaler with HP or Dell in your message. Minus all the
> > proprietary OOB management stuff those guys also provide.
>
> A significant Dell customer will get a server pre-populated with a NIC
> with some generic Dell configuration. In most cases the customer will
> have to then program the flash to match their needs. Set a specific FW
> version, set site specific configurables, etc.
>
> Similar to how a Dell customer will have to change the BIOS settings
> in the Dell to match their needs.
I can only guess that you are again thinking about RDMA/HPC.
Flashing tunables is not a workable solution for extremely varied
and ephemeral TCP/IP workloads :|
On Thu, Apr 04, 2024 at 08:46:41PM +0100, Edward Cree wrote:
> On 04/04/2024 19:35, Leon Romanovsky wrote:
> > On Thu, Apr 04, 2024 at 07:06:53PM +0100, Edward Cree wrote:
> >> Why? What does the kernel get out of it?
> >>
> >> Maybe *you* need them to be supported, but maybe you should have
> >> thought of that earlier in the design process. ("A failure on
> >> your part to foresee the eminently foreseeable does not
> >> constitute an emergency on mine.")
> >> If we let folks bypass our standards with a _fait accompli_, we
> >> don't really have standards in the first place.
> >
> > Sorry, who are "we" and what are "our standards"?
>
> As should be obvious from context, "we" in that sentence referred to
> the mainline kernel. And while participants in this thread currently
> disagree on what "our standards" are, I hope it is not contentious
> that the kernel community *does* have standards as to what code and
> design is acceptable for inclusion.
You didn't answer my question. What are "our standards"?
Thanks
On Thu, Apr 04, 2024 at 09:53:14PM +0100, Edward Cree wrote:
> On 04/04/2024 21:25, Jason Gunthorpe wrote:
> > For instance there is a bunch that configure the UEFI option
> > rom. Various devices can PXE boot, iSCSI boot and so on.
>
> Sounds like a perfect example of something that should have a
> standardised cross-vendor configuration mechanism, rather than
> every vendor having its own special tool as is sadly the case
> today.
Sure, I imagine exploring this kind of commonality in userspace with a
fwctl based tool. The kernel is the wrong layer to do it.
Jason
On Thu, Apr 04, 2024 at 02:34:07PM -0700, Jakub Kicinski wrote:
> On Thu, 4 Apr 2024 17:44:54 -0300 Jason Gunthorpe wrote:
> > > To my knowledge the "customizations" are mostly around fitting into OCP
> > > servers.
> >
> > Nope. I understand it is significant. If Meta had to work with a COTS
> > environment like a HP/Dell customer then Meta would have a list of
> > flash configurables to set. I think you greatly underestimate the
> > privilege of being at a hyperscaler and having vendors create custom
> > products just for you..
> >
> > > Those unfamiliar with how hyperscalers operate can mentally
> > > replace $hyperscaler with HP or Dell in your message. Minus all the
> > > proprietary OOB management stuff those guys also provide.
> >
> > A significant Dell customer will get a server pre-populated with a NIC
> > with some generic Dell configuration. In most cases the customer will
> > have to then program the flash to match their needs. Set a specific FW
> > version, set site specific configurables, etc.
> >
> > Similar to how a Dell customer will have to change the BIOS settings
> > in the Dell to match their needs.
>
> I can only guess that you are again thinking about RDMA/HPC.
> Flashing tunables is not a workable solution for extremely varied
> and ephemeral TCP/IP workloads :|
As I answered to Anderew, a lot is functional behavior not so much
"tunables". The same way many BIOS settings are not all tunables but
have functional impacts to the machine. Like enable SRIOV, for
instance.
Even for dataplane tunables - you know there are micro-architectural
performance tunables set in the special Meta NICs that are wired just
for Meta's special use case? Apparently that is actually perfectly
workable.
It is really strange to hear you act like "Meta doesn't need
provisioning or tuning" when the NIC Meta uses is *highly* customized
specifically for Meta to the point it is an entirely different
product. Of course you don't need provisioning, alot of other people
did alot of hard work to make it that way.
So please don't use that as a justification to pull up the ladder so
nobody else can enjoy even a semi-customized device.
Jason
On Thu, Apr 04, 2024 at 08:31:24PM +0100, Edward Cree wrote:
> > of in the factory looses the "implied engineering benefits of open
> > source".
>
> You're looking at the wrong point on the causal graph.
> Whether you apply the hacks in the factory or at runtime, their hacky
> design is what prevents them from having the benefits of open source
> (because those benefits consist of a development process that weeds
> out hacks and bad design).
Oh please. They are not hacks, you are inventing this idea entirely
out of thin air based on your own lack of detailed knowledge or
personal aesthetic preferences.
> > Overreach. The job of the kernel maintainer is to review the driver
> > software, not the device design.
>
> Really? [1]
Yep. TOEs are supported in the RDMA stack.
The only thing the kernel community decided was that a TOE cannot be
accessed using the socket API, the HW is still supported.
> But strangely, there are people out there who trust the upstream process
> to ensure quality/security/etc. more than they trust vendors and their
> "oh don't worry, the device will enforce security scopes on these raw
> commands from userspace" magic firmware blobs.
Upstream Linux community has already decided trusting firwmare is
acceptable. I gave many examples in the fwctl document if you'd care
to understand it.
And of course we get the usual garbage solution "oh just fork linux"
which *completely misses the point* of what Linux really is or what
the community is fundamentally seeking to achieve.
> What you are asking for here is a special exemption to all those
> requirements and security measures because "just trust me bro".
Actually no, fwctl is following accepted kernel design principles with
many prior examples.
Jason
On Fri, 5 Apr 2024 08:13:06 -0300 Jason Gunthorpe wrote:
> As I answered to Anderew, a lot is functional behavior not so much
> "tunables". The same way many BIOS settings are not all tunables but
> have functional impacts to the machine. Like enable SRIOV, for
> instance.
Thanks, SRIOV is a great example:
https://docs.kernel.org/next/networking/devlink/devlink-params.html#id2
Literally the first devlink param on the list.
"We will flash it for you" seems to be what vendors like to offer.
> Even for dataplane tunables - you know there are micro-architectural
> performance tunables set in the special Meta NICs that are wired just
> for Meta's special use case? Apparently that is actually perfectly
> workable.
The only "tunables" I'm aware of were for the OCP Yosemite platform,
which is an interesting beast with 4 hosts plugged into one NIC,
and constrained PCIe BW. Which is why I said the "tunables" are really
about the server platform not being off the shelf. Updating NIC FW
to fix server compatibility is hardly unusual.
> It is really strange to hear you act like "Meta doesn't need
> provisioning or tuning" when the NIC Meta uses is *highly* customized
> specifically for Meta to the point it is an entirely different
> product. Of course you don't need provisioning, alot of other people
> did alot of hard work to make it that way.
:) When you say *highly* I think I know what you mean :)
It'd be unprofessional for us to discuss here, and I really doubt
you actually want to air that laundry publicly :) :)
> So please don't use that as a justification to pull up the ladder so
> nobody else can enjoy even a semi-customized device.
So in this thread I'm pulling up the ladder and in the fbnic one I'm
not (as I hope you'd agree)? One could hopefully be forgiven for
wondering to what extent your assessment of my intentions is colored
by whether they align with your particular goals :(
On Fri, 5 Apr 2024 08:38:27 -0700 Jakub Kicinski wrote:
> > It is really strange to hear you act like "Meta doesn't need
> > provisioning or tuning" when the NIC Meta uses is *highly* customized
> > specifically for Meta to the point it is an entirely different
> > product. Of course you don't need provisioning, alot of other people
> > did alot of hard work to make it that way.
>
> :) When you say *highly* I think I know what you mean :)
> It'd be unprofessional for us to discuss here, and I really doubt
> you actually want to air that laundry publicly :) :)
Maybe that's unnecessary air of mystery. Long time ago there was
a concern about impact of the rapidly developing eswitch offload
market(?) on FW stability so a requirement was put forward to
*compile out* major unused FW features. Such requirement is no
longer in place (or fulfilled) largely due to my support.
I wish I could support that out by referring to the OCP NIC SW spec,
by it's stuck in "legal review" of one of the vendors for months.
I'd like to ask that vendor not to pull up the ladder and let everyone
else enjoy access to NIC requirements and recommendations from Meta
and Google.
On 4/5/24 12:41, Leon Romanovsky wrote:
> On Thu, Apr 04, 2024 at 08:46:41PM +0100, Edward Cree wrote:
>> On 04/04/2024 19:35, Leon Romanovsky wrote:
>>> On Thu, Apr 04, 2024 at 07:06:53PM +0100, Edward Cree wrote:
>>>> Why? What does the kernel get out of it?
>>>>
>>>> Maybe *you* need them to be supported, but maybe you should have
>>>> thought of that earlier in the design process. ("A failure on
>>>> your part to foresee the eminently foreseeable does not
>>>> constitute an emergency on mine.")
>>>> If we let folks bypass our standards with a _fait accompli_, we
>>>> don't really have standards in the first place.
>>>
>>> Sorry, who are "we" and what are "our standards"?
>>
>> As should be obvious from context, "we" in that sentence referred to
>> the mainline kernel. And while participants in this thread currently
>> disagree on what "our standards" are, I hope it is not contentious
>> that the kernel community *does* have standards as to what code and
>> design is acceptable for inclusion.
>
> You didn't answer my question. What are "our standards"?
>
> Thanks
>
Our standards include use of tone that is welcoming, and, in this
thread, you are aggressive instead.
With that said, I appreciate your review feedback that I have received
from you.
On Fri, Apr 05, 2024 at 08:38:27AM -0700, Jakub Kicinski wrote:
> On Fri, 5 Apr 2024 08:13:06 -0300 Jason Gunthorpe wrote:
> > As I answered to Anderew, a lot is functional behavior not so much
> > "tunables". The same way many BIOS settings are not all tunables but
> > have functional impacts to the machine. Like enable SRIOV, for
> > instance.
>
> Thanks, SRIOV is a great example:
> https://docs.kernel.org/next/networking/devlink/devlink-params.html#id2
> Literally the first devlink param on the list.
It is too basic to be really usable, unfortunately - recall my earlier
remarks that a site needs to configure everything, not just the 2 SRIOV
related values in that take. Those are nice generic ones, but others
are not.
> The only "tunables" I'm aware of were for the OCP Yosemite platform,
> which is an interesting beast with 4 hosts plugged into one NIC,
> and constrained PCIe BW. Which is why I said the "tunables" are really
> about the server platform not being off the shelf. Updating NIC FW
> to fix server compatibility is hardly unusual.
I don't think it is appropriate to go into details of all the stuff
that happens in the commercial relationship between Meta and
NVIDIA.. There is lots of history there.
My main point is, for others reading these threads, that taking a COTS
device like mlx5 and essentially forking it for a single user's
special requirements is pretty much standard operating procedure
now. Some people enjoy this with custom devices and custom FW, some
people have to run standard FW and customize it at provisioning time.
Edward accused this all of being "hacks", but I strongly
disagree. Having device specific customization and parameters that
make sense for the device architecture is not a hack. Functional
changes are not the same was weird performance tunables.
> > So please don't use that as a justification to pull up the ladder so
> > nobody else can enjoy even a semi-customized device.
>
> So in this thread I'm pulling up the ladder and in the fbnic one I'm
> not (as I hope you'd agree)? One could hopefully be forgiven for
> wondering to what extent your assessment of my intentions is colored
> by whether they align with your particular goals :(
I'm not sure what your position is on fbnic TBH. I said mine, I think
it would be fine to merge that series - but I'm pretty moderate in my
views of what should be accepted to Linux for ideological reasons.
Jason
On Fri, Apr 05, 2024 at 10:48:29AM -0700, Jakub Kicinski wrote:
> I wish I could support that out by referring to the OCP NIC SW spec,
> by it's stuck in "legal review" of one of the vendors for months.
> I'd like to ask that vendor not to pull up the ladder and let everyone
> else enjoy access to NIC requirements and recommendations from Meta
> and Google.
Heh. Building standards is hard. It is a delicate game of politics to
convince everyone that they are getting some kind of win-win-win,
especially when IP and competitive concerns becomes involved. Many
attempts fail while still immature, it is why there is so much value
in the successful ones.
Jason
On 3/4/24 9:02 AM, Jason Gunthorpe wrote:
> On Wed, Feb 14, 2024 at 01:57:35PM -0400, Jason Gunthorpe wrote:
>
>> I also like this, I don't want the outcome of this discussion to be
>> that only mlx5ctl gets merged. I want all the HW that has this problem
>> to have support in the mainline kernel.
>
> To this end here is my proposal to move forward with a new
> mini-subsystem to provide rules for this common approach. Get the
> existing tools out of hacky unstructured direct hardware access via
> /sys/XX and into a formalized lockdown compatible system. I've talked
> to enough people now to think we have a critical mass.
>
> ===============
> fwctl subsystem
> ===============
The aux bus infrastructure was created specifically for multifunction
devices -- it allows a core PCI device driver with smaller, subsystem
focused drivers for vendor specific implementations of various S/W stack
APIs (IB, netdev, etc). One aspect not addressed in that design is where
to park the various drivers and extensions that are not solely tied to
any one subsystem.
Given, that how about moving the existing auxbus code into a new
directory, drivers/auxbus. From there, create a subdirectory for this
new fwctl subsystem which is most likely to be realized as a new auxbus
device and then subdirectories for vendor specific drivers for the aux
bus device. Then new drivers being developed in this auxbus world can
put the core PCI device handling code under drivers/auxbus/core.
In short:
- drivers/auxbus/auxiliary.c
- drivers/auxbus/core/<vendor>/ - per h/w device driver for managing
the PCI device which is shared across multiple auxbus devices
- drivers/auxbus/fwctl/fwctl.c - this FW interface
- drivers/auxbus/fwctl/<vendor>/ - vendor specific driver for a fwctl
auxbus device
On Mon, Apr 29, 2024 at 07:36:41PM -0600, David Ahern wrote:
> On 3/4/24 9:02 AM, Jason Gunthorpe wrote:
> > On Wed, Feb 14, 2024 at 01:57:35PM -0400, Jason Gunthorpe wrote:
> >
> >> I also like this, I don't want the outcome of this discussion to be
> >> that only mlx5ctl gets merged. I want all the HW that has this problem
> >> to have support in the mainline kernel.
> >
> > To this end here is my proposal to move forward with a new
> > mini-subsystem to provide rules for this common approach. Get the
> > existing tools out of hacky unstructured direct hardware access via
> > /sys/XX and into a formalized lockdown compatible system. I've talked
> > to enough people now to think we have a critical mass.
> >
> > ===============
> > fwctl subsystem
> > ===============
>
>
> The aux bus infrastructure was created specifically for multifunction
> devices -- it allows a core PCI device driver with smaller, subsystem
> focused drivers for vendor specific implementations of various S/W stack
> APIs (IB, netdev, etc). One aspect not addressed in that design is where
> to park the various drivers and extensions that are not solely tied to
> any one subsystem.
>
> Given, that how about moving the existing auxbus code into a new
> directory, drivers/auxbus. From there, create a subdirectory for this
> new fwctl subsystem which is most likely to be realized as a new auxbus
> device and then subdirectories for vendor specific drivers for the aux
> bus device. Then new drivers being developed in this auxbus world can
> put the core PCI device handling code under drivers/auxbus/core.
>
> In short:
> - drivers/auxbus/auxiliary.c
Ah, the ever-frequent "where do we put the files" discussion :)
Is it "per bus" or "per functionality"? In the end, it's a bit of both,
HOWEVER, no, auxiliary.c belongs in drivers/base/ for now, why move it?
And really, stuff on the auxbus has nothing to do with the auxbus, it
has everything to do with with the common device that auxbus really is
emulating (i.e. a pci device with multi-functions, or whatever.)
> - drivers/auxbus/core/<vendor>/ ?- per h/w device driver for managing
> the PCI device which is shared across multiple auxbus devices
Why "vendor"? What happens when vendor names change? :)
> - drivers/auxbus/fwctl/fwctl.c - this FW interface
fwctl has nothing to do with auxbus other than it might be a user of it,
right? So again, no need for it to be under auxbus/ as that is not
needed.
> - drivers/auxbus/fwctl/<vendor>/ - vendor specific driver for a fwctl
> auxbus device
if you want drivers/fwctl/<VENDOR>/ if <VENDOR> numbers get huge, sure
make subdirs, but really, why create a subdir like that if you don't
need it?
Start simple please, you can always move files later if it gets too
complex. I think that discussing where to place non-existant files at
this point in time is kind of funny, given we have yet to see any actual
code...
thanks,
greg k-h