2019-03-01 05:54:03

by Parav Pandit

[permalink] [raw]
Subject: [RFC net-next 0/8] Introducing subdev bus and devlink extension

Use case:
---------
A user wants to create/delete hardware linked sub devices without
using SR-IOV.
These devices for a pci device can be netdev (optional rdma device)
or other devices. Such sub devices share some of the PCI device
resources and also have their own dedicated resources.

Few examples are:
1. netdev having its own txq(s), rq(s) and/or hw offload parameters.
2. netdev with switchdev mode using netdev representor
3. rdma device with IB link layer and IPoIB netdev
4. rdma/RoCE device and a netdev
5. rdma device with multiple ports

Requirements for above use cases:
--------------------------------
1. We need a generic user interface & core APIs to create sub devices
from a parent pci device but should be generic enough for other parent
devices
2. Interface should be vendor agnostic
3. User should be able to set device params at creation time
4. In future if needed, tool should be able to create passthrough
device to map to a virtual machine
5. A device can have multiple ports
6. An orchestration software wants to know how many such sub devices
can be created from a parent device so that it can manage them in global
cluster resources.

So how is it done?
------------------
(a) user in control
To address above requirements, a generic tool iproute2/devlink is
extended for sub device's life cycle.
However a devlink tool and its kernel counter part is not sufficient
to create protocol agnostic devices on a existing PCI bus.

(b) subdev bus
A given bus defines well defined addressing scheme. Creating sub devices
on existing PCI bus with a different naming scheme is just weird.
So, creating well named devices on appropriate bus is desired.

Hence a new 'subdev' bus is created.
User adds/removes new sub devices subdev on this bus via a devlink tool.
devlink tool instructs hardware driver to create/remove/configure
such devices. Hardware vendor driver places devices on the bus.
Another or same vendor driver matches based on vendor-id, device-id
scheme and run through classic device driver model.

Given that, these are user created devices for a given hardware and in
absence of a central entity like PCISIG to assign vendor and device ids,
A unique vendor and device id are maintained as enum in
include/linux/subdev_ids.h.

subdev bus device names follow default device naming scheme of Linux
kernel. It is done as 'subdev<instance_id>' such as, subdev0, subdev3.

subdev device inherits its parent's DMA parameters.
subdev will follow rich power management infrastructure of core kernel/
So that every vendor driver doesn't have to iterate over its child
devices, invent a locking and device anchoring scheme.

Patchset summary:
-----------------
Patch-1, 2 introduces a subdev bus and interface for subdev life cycle.
Patch-3 extends modpost tool for module device id table.
Patch-4,5,6 implements a devlink vendor driver to add/remove devices.
Patch-7 mlx5 driver implements subdev devices and places them on subdev
bus.
Patch-8 match against the subdev for mlx5 vendor, device id and creates
fake netdevice.

All patches are only a reference implementation to see RFC in works
at devlink, sysfs and device model level. Once RFC looks good, more
solid upstreamable version of the implementation will be done.
All patches are functional except the last two patches, which just
create fake subdev devices and fake netdevice.

System example view:
--------------------

$ devlink dev show
pci/0000:05:00.0

$ devlink dev add pci/0000:05:00.0
$ devlink dev show
pci/0000:05:00.0
subdev/subdev0

sysfs view with subdev:

$ ls -l /sys/bus/pci/devices/0000:05:00.0
[..]
drwxr-xr-x 3 root root 0 Feb 13 15:57 infiniband
-rw-r--r-- 1 root root 4096 Feb 13 15:57 msi_bus
drwxr-xr-x 3 root root 0 Feb 13 15:57 net
drwxr-xr-x 2 root root 0 Feb 13 15:57 power
drwxr-xr-x 3 root root 0 Feb 13 15:57 ptp
drwxr-xr-x 4 root root 0 Feb 13 15:57 subdev0

$ ls -l /sys/bus/pci/devices/0000:05:00.0/subdev0
lrwxrwxrwx 1 root root 0 Feb 13 15:58 driver -> ../../../../../bus/subdev/drivers/mlx5_core
drwxr-xr-x 3 root root 0 Feb 13 15:58 net
drwxr-xr-x 2 root root 0 Feb 13 15:58 power
lrwxrwxrwx 1 root root 0 Feb 13 15:58 subsystem -> ../../../../../bus/subdev
-rw-r--r-- 1 root root 4096 Feb 13 15:58 uevent

$ ls -l /sys/bus/pci/devices/0000:05:00.0/subdev0/net/
drwxr-xr-x 5 root root 0 Feb 13 15:58 eth0

Software view:
-------------
Some of you if you prefer to see in picture, below diagram tries to
show software modules in bus/device hierarchy.

devlink user (iproute2/devlink)
------------------------------
|
|
+----------------+
| devlink module |
| doit() | +------------------+
| | | | vendor driver |
+------------|---+ | (mlx5) |
----------+-> subdev_ops() |
+|-----------------+
|
+---------|--+ +-----------+ +------------------+
| subdev bus | | core | | subdev device |
| driver | | kernel | | drivers |
| (add/del) | | dev model | | (netdev, rdma) |
| ----------------------> probe/remove() |
+------------+ +-----------+ +------------------+

Alternatives considered:
------------------------
Will discuss separately if needed to keep this RFC short.


Parav Pandit (8):
subdev: Introducing subdev bus
subdev: Introduce pm callbacks
modpost: Add support for subdev device id table
devlink: Introduce and use devlink_init/cleanup() in alloc/free
devlink: Add variant of devlink_register/unregister
devlink: Add support for devlink subdev lifecycle
net/mlx5: Add devlink subdev life cycle command support
net/mlx5: Add subdev driver to bind to subdev devices

drivers/Kconfig | 2 +
drivers/Makefile | 1 +
drivers/net/ethernet/mellanox/mlx5/core/Makefile | 1 +
drivers/net/ethernet/mellanox/mlx5/core/main.c | 12 +-
.../net/ethernet/mellanox/mlx5/core/mlx5_core.h | 7 +
drivers/net/ethernet/mellanox/mlx5/core/subdev.c | 55 ++++++
.../ethernet/mellanox/mlx5/core/subdev_driver.c | 93 +++++++++
drivers/subdev/Kconfig | 12 ++
drivers/subdev/Makefile | 8 +
drivers/subdev/subdev_main.c | 212 +++++++++++++++++++++
include/linux/mod_devicetable.h | 12 ++
include/linux/subdev_bus.h | 63 ++++++
include/linux/subdev_ids.h | 17 ++
include/net/devlink.h | 29 ++-
include/uapi/linux/devlink.h | 3 +
net/core/devlink.c | 179 +++++++++++++++--
scripts/mod/devicetable-offsets.c | 4 +
scripts/mod/file2alias.c | 15 ++
18 files changed, 704 insertions(+), 21 deletions(-)
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/subdev.c
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
create mode 100644 drivers/subdev/Kconfig
create mode 100644 drivers/subdev/Makefile
create mode 100644 drivers/subdev/subdev_main.c
create mode 100644 include/linux/subdev_bus.h
create mode 100644 include/linux/subdev_ids.h

--
1.8.3.1



2019-03-01 05:39:24

by Parav Pandit

[permalink] [raw]
Subject: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to subdev devices

Add a subdev driver to probe the subdev devices and create fake
netdevice for it.

Signed-off-by: Parav Pandit <[email protected]>
---
drivers/net/ethernet/mellanox/mlx5/core/Makefile | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/main.c | 8 +-
.../net/ethernet/mellanox/mlx5/core/mlx5_core.h | 3 +
.../ethernet/mellanox/mlx5/core/subdev_driver.c | 93 ++++++++++++++++++++++
4 files changed, 104 insertions(+), 2 deletions(-)
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index f218789..c8aeaf1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -16,7 +16,7 @@ mlx5_core-y := main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
transobj.o vport.o sriov.o fs_cmd.o fs_core.o \
fs_counters.o rl.o lag.o dev.o events.o wq.o lib/gid.o \
lib/devcom.o diag/fs_tracepoint.o diag/fw_tracer.o
-mlx5_core-$(CONFIG_SUBDEV) += subdev.o
+mlx5_core-$(CONFIG_SUBDEV) += subdev.o subdev_driver.o

#
# Netdev basic
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 5f8cf0d..7dfa8c4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1548,7 +1548,11 @@ static int __init init(void)
mlx5e_init();
#endif

- return 0;
+ err = subdev_register_driver(&mlx5_subdev_driver);
+ if (err)
+ pci_unregister_driver(&mlx5_core_driver);
+
+ return err;

err_debug:
mlx5_unregister_debugfs();
@@ -1557,6 +1561,8 @@ static int __init init(void)

static void __exit cleanup(void)
{
+ subdev_unregister_driver(&mlx5_subdev_driver);
+
#ifdef CONFIG_MLX5_CORE_EN
mlx5e_cleanup();
#endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 2a54148..1b733c7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -41,12 +41,15 @@
#include <linux/ptp_clock_kernel.h>
#include <linux/mlx5/cq.h>
#include <linux/mlx5/fs.h>
+#include <linux/subdev_bus.h>

#define DRIVER_NAME "mlx5_core"
#define DRIVER_VERSION "5.0-0"

extern uint mlx5_core_debug_mask;

+extern struct subdev_driver mlx5_subdev_driver;
+
#define mlx5_core_dbg(__dev, format, ...) \
dev_dbg(&(__dev)->pdev->dev, "%s:%d:(pid %d): " format, \
__func__, __LINE__, current->pid, \
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c b/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
new file mode 100644
index 0000000..880aa4f
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018-19 Mellanox Technologies
+
+#include <linux/module.h>
+#include <linux/dma-mapping.h>
+#include <linux/subdev_bus.h>
+#include <linux/subdev_ids.h>
+#include <linux/etherdevice.h>
+
+struct mlx5_subdev_ndev {
+ struct net_device ndev;
+};
+
+static void mlx5_dma_test(struct device *dev)
+{
+ dma_addr_t pa;
+ void *va;
+
+ va = dma_alloc_coherent(dev, 4096, &pa, GFP_KERNEL);
+ if (va)
+ dma_free_coherent(dev, 4096, va, pa);
+}
+
+static struct net_device *ndev;
+
+static int mlx5e_subdev_open(struct net_device *netdev)
+{
+ return 0;
+}
+
+static int mlx5e_subdev_close(struct net_device *netdev)
+{
+ return 0;
+}
+
+static netdev_tx_t
+mlx5e_subdev_xmit(struct sk_buff *skb, struct net_device *netdev)
+{
+ return NETDEV_TX_BUSY;
+}
+
+const struct net_device_ops mlx5e_subdev_netdev_ops = {
+ .ndo_open = mlx5e_subdev_open,
+ .ndo_stop = mlx5e_subdev_close,
+ .ndo_start_xmit = mlx5e_subdev_xmit,
+};
+
+static int mlx5_subdev_probe(struct device *dev)
+{
+ int err;
+
+ mlx5_dma_test(dev);
+ /* Only one device supported in rfc */
+ if (ndev)
+ return 0;
+
+ ndev = alloc_etherdev_mqs(sizeof(struct mlx5_subdev_ndev), 1, 1);
+ if (!ndev)
+ return -ENOMEM;
+
+ SET_NETDEV_DEV(ndev, dev);
+ ndev->netdev_ops = &mlx5e_subdev_netdev_ops;
+ err = register_netdev(ndev);
+ if (err) {
+ free_netdev(ndev);
+ ndev = NULL;
+ }
+ return err;
+}
+
+static int mlx5_subdev_remove(struct device *dev)
+{
+ if (ndev) {
+ unregister_netdev(ndev);
+ free_netdev(ndev);
+ ndev = NULL;
+ }
+ return 0;
+}
+
+static const struct subdev_id mlx5_subdev_id_table[] = {
+ { .vendor_id = SUBDEV_VENDOR_ID_MELLANOX,
+ .device_id = SUBDEV_DEVICE_ID_MELLANOX_SF },
+ { 0, }
+};
+MODULE_DEVICE_TABLE(subdev, mlx5_subdev_id_table);
+
+struct subdev_driver mlx5_subdev_driver = {
+ .id_table = mlx5_subdev_id_table,
+ .driver.name = "mlx5_subdev_driver",
+ .driver.probe = mlx5_subdev_probe,
+ .driver.remove = mlx5_subdev_remove,
+};
--
1.8.3.1


2019-03-01 05:39:35

by Parav Pandit

[permalink] [raw]
Subject: [RFC net-next 5/8] devlink: Add variant of devlink_register/unregister

Add variants of devlink_register and devlink_unregister which doesn't
explicitly acquire/release devlink_mutex lock, but requires that caller
hold the devlink_mutex lock.

This is required to create child devlink devices while working on
parent devlink device.

Change-Id: I74417158144b28ff51ecfb2d1105c83ebefdf985
Signed-off-by: Parav Pandit <[email protected]>
---
include/net/devlink.h | 15 ++++++++++++++-
net/core/devlink.c | 36 +++++++++++++++++++++++++++++++-----
2 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index ae5e0e6..9a067b1 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -545,7 +545,9 @@ static inline struct devlink *priv_to_devlink(void *priv)
void devlink_init(struct devlink *devlink, const struct devlink_ops *ops);
void devlink_cleanup(struct devlink *devlink);
struct devlink *devlink_alloc(const struct devlink_ops *ops, size_t priv_size);
+void __devlink_register(struct devlink *devlink, struct device *dev);
int devlink_register(struct devlink *devlink, struct device *dev);
+void __devlink_unregister(struct devlink *devlink);
void devlink_unregister(struct devlink *devlink);
void devlink_free(struct devlink *devlink);
int devlink_port_register(struct devlink *devlink,
@@ -713,6 +715,7 @@ int devlink_health_report(struct devlink_health_reporter *reporter,

static inline void devlink_init(struct devlink *devlink,
const struct devlink_ops *ops)
+{
}

static inline void devlink_cleanup(struct devlink *devlink)
@@ -725,11 +728,21 @@ static inline struct devlink *devlink_alloc(const struct devlink_ops *ops,
return kzalloc(sizeof(struct devlink) + priv_size, GFP_KERNEL);
}

-static inline int devlink_register(struct devlink *devlink, struct device *dev)
+static inline void __devlink_register(struct devlink *devlink,
+ struct device *dev)
+{
+}
+
+static inline int devlink_register(struct devlink *devlink,
+ struct device *dev)
{
return 0;
}

+static inline void __devlink_unregister(struct devlink *devlink)
+{
+}
+
static inline void devlink_unregister(struct devlink *devlink)
{
}
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 25492c6..cfbad2c 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -5262,22 +5262,49 @@ struct devlink *devlink_alloc(const struct devlink_ops *ops, size_t priv_size)
EXPORT_SYMBOL_GPL(devlink_alloc);

/**
- * devlink_register - Register devlink instance
+ * __devlink_register - Register devlink instance
+ * Caller must hold devlink_mutex.
*
* @devlink: devlink
*/
-int devlink_register(struct devlink *devlink, struct device *dev)
+void __devlink_register(struct devlink *devlink, struct device *dev)
{
- mutex_lock(&devlink_mutex);
+ lockdep_assert_held(&devlink_mutex);
devlink->dev = dev;
list_add_tail(&devlink->list, &devlink_list);
devlink_notify(devlink, DEVLINK_CMD_NEW);
+}
+EXPORT_SYMBOL_GPL(__devlink_register);
+
+/**
+ * devlink_register - Register devlink instance
+ *
+ * @devlink: devlink
+ */
+int devlink_register(struct devlink *devlink, struct device *dev)
+{
+ mutex_lock(&devlink_mutex);
+ __devlink_register(devlink, dev);
mutex_unlock(&devlink_mutex);
return 0;
}
EXPORT_SYMBOL_GPL(devlink_register);

/**
+ * __devlink_unregister - Unregister devlink instance
+ * Caller must hold the devlink_mutex while invoking this API.
+ *
+ * @devlink: devlink
+ */
+void __devlink_unregister(struct devlink *devlink)
+{
+ lockdep_assert_held(&devlink_mutex);
+ devlink_notify(devlink, DEVLINK_CMD_DEL);
+ list_del(&devlink->list);
+}
+EXPORT_SYMBOL_GPL(__devlink_unregister);
+
+/**
* devlink_unregister - Unregister devlink instance
*
* @devlink: devlink
@@ -5285,8 +5312,7 @@ int devlink_register(struct devlink *devlink, struct device *dev)
void devlink_unregister(struct devlink *devlink)
{
mutex_lock(&devlink_mutex);
- devlink_notify(devlink, DEVLINK_CMD_DEL);
- list_del(&devlink->list);
+ __devlink_unregister(devlink);
mutex_unlock(&devlink_mutex);
}
EXPORT_SYMBOL_GPL(devlink_unregister);
--
1.8.3.1


2019-03-01 05:40:07

by Parav Pandit

[permalink] [raw]
Subject: [RFC net-next 3/8] modpost: Add support for subdev device id table

Add support to parse subdev module device id table.

Signed-off-by: Parav Pandit <[email protected]>
---
scripts/mod/devicetable-offsets.c | 4 ++++
scripts/mod/file2alias.c | 15 +++++++++++++++
2 files changed, 19 insertions(+)

diff --git a/scripts/mod/devicetable-offsets.c b/scripts/mod/devicetable-offsets.c
index 2930044..77f6b6e 100644
--- a/scripts/mod/devicetable-offsets.c
+++ b/scripts/mod/devicetable-offsets.c
@@ -225,5 +225,9 @@ int main(void)
DEVID_FIELD(typec_device_id, svid);
DEVID_FIELD(typec_device_id, mode);

+ DEVID(subdev_id);
+ DEVID_FIELD(subdev_id, vendor_id);
+ DEVID_FIELD(subdev_id, device_id);
+
return 0;
}
diff --git a/scripts/mod/file2alias.c b/scripts/mod/file2alias.c
index a37af7d..be89e8e 100644
--- a/scripts/mod/file2alias.c
+++ b/scripts/mod/file2alias.c
@@ -1287,6 +1287,20 @@ static int do_typec_entry(const char *filename, void *symval, char *alias)
return 1;
}

+/* Looks like: subdev:vNdN. */
+static int do_subdev_entry(const char *filename, void *symval, char *alias)
+{
+ DEF_FIELD(symval, subdev_id, vendor_id);
+ DEF_FIELD(symval, subdev_id, device_id);
+
+ strcpy(alias, "subdev:");
+ ADD(alias, "v", 1, vendor_id);
+ ADD(alias, "d", 1, device_id);
+
+ add_wildcard(alias);
+ return 1;
+}
+
/* Does namelen bytes of name exactly match the symbol? */
static bool sym_is(const char *name, unsigned namelen, const char *symbol)
{
@@ -1357,6 +1371,7 @@ static void do_table(void *symval, unsigned long size,
{"fslmc", SIZE_fsl_mc_device_id, do_fsl_mc_entry},
{"tbsvc", SIZE_tb_service_id, do_tbsvc_entry},
{"typec", SIZE_typec_device_id, do_typec_entry},
+ {"subdev", SIZE_subdev_id, do_subdev_entry},
};

/* Create MODULE_ALIAS() statements.
--
1.8.3.1


2019-03-01 05:40:15

by Parav Pandit

[permalink] [raw]
Subject: [RFC net-next 2/8] subdev: Introduce pm callbacks

Keep power management callbacks in place to optionally notify drivers
who register them.

Signed-off-by: Parav Pandit <[email protected]>
---
drivers/subdev/subdev_main.c | 59 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)

diff --git a/drivers/subdev/subdev_main.c b/drivers/subdev/subdev_main.c
index 4aabcaa..e213331 100644
--- a/drivers/subdev/subdev_main.c
+++ b/drivers/subdev/subdev_main.c
@@ -23,10 +23,69 @@ static int subdev_bus_match(struct device *dev, struct device_driver *drv)
return 0;
}

+static int subdev_pm_prepare(struct device *dev)
+{
+ if (dev->driver->pm && dev->driver->pm->prepare)
+ return dev->driver->pm->prepare(dev);
+ return 0;
+}
+
+static void subdev_pm_complete(struct device *dev)
+{
+ if (dev->driver->pm && dev->driver->pm->complete)
+ dev->driver->pm->complete(dev);
+}
+
+static int subdev_pm_suspend(struct device *dev)
+{
+ if (dev->driver->pm && dev->driver->pm->suspend)
+ return dev->driver->pm->suspend(dev);
+ return 0;
+}
+
+static int subdev_pm_suspend_late(struct device *dev)
+{
+ if (dev->driver->pm && dev->driver->pm->suspend_late)
+ return dev->driver->pm->suspend_late(dev);
+ return 0;
+}
+
+static int subdev_pm_resume(struct device *dev)
+{
+ if (dev->driver->pm && dev->driver->pm->resume)
+ return dev->driver->pm->resume(dev);
+ return 0;
+}
+
+static int subdev_pm_freeze(struct device *dev)
+{
+ if (dev->driver->pm && dev->driver->pm->freeze)
+ return dev->driver->pm->freeze(dev);
+ return 0;
+}
+
+static int subdev_pm_freeze_late(struct device *dev)
+{
+ if (dev->driver->pm && dev->driver->pm->freeze_late)
+ return dev->driver->pm->freeze_late(dev);
+ return 0;
+}
+
+static const struct dev_pm_ops subdev_dev_pm_ops = {
+ .prepare = subdev_pm_prepare,
+ .complete = subdev_pm_complete,
+ .suspend = subdev_pm_suspend,
+ .suspend_late = subdev_pm_suspend_late,
+ .resume = subdev_pm_resume,
+ .freeze = subdev_pm_freeze,
+ .freeze_late = subdev_pm_freeze_late,
+};
+
static struct bus_type subdev_bus_type = {
.dev_name = "subdev",
.name = "subdev",
.match = subdev_bus_match,
+ .pm = &subdev_dev_pm_ops,
};

int __subdev_register_driver(struct subdev_driver *drv, struct module *owner,
--
1.8.3.1


2019-03-01 05:41:02

by Parav Pandit

[permalink] [raw]
Subject: [RFC net-next 6/8] devlink: Add support for devlink subdev lifecycle

Add support for creating and deleting devlink subdevices.
For every subdev created on subdev bus, has corresponding devlink device.
This devlink device serves the control point for any internal device
configuration which is usually required before setting up the protocol
specific devices such as netdev, block or infiniband devices.

devlink subdev are created using iproute2 devlink tool command such as:
(a) create devlink subdev
$devlink dev add DEV
output: subdev/subdev0

(b) delete a devlink subdev
$devlink dev del DEV
$devlink dev del subdev/subdev0

Signed-off-by: Parav Pandit <[email protected]>
---
include/net/devlink.h | 6 ++-
include/uapi/linux/devlink.h | 3 ++
net/core/devlink.c | 97 ++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 102 insertions(+), 4 deletions(-)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 9a067b1..3265508 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -36,6 +36,7 @@ struct devlink {
struct device *dev;
possible_net_t _net;
struct mutex lock;
+ struct devlink *parent; /* optional if this is child devlink device */
char priv[0] __aligned(NETDEV_ALIGN);
};

@@ -524,6 +525,8 @@ struct devlink_ops {
int (*flash_update)(struct devlink *devlink, const char *file_name,
const char *component,
struct netlink_ext_ack *extack);
+ struct devlink* (*dev_add)(struct devlink *devlink);
+ void (*dev_del)(struct devlink *del_dev);
};

static inline void *devlink_priv(struct devlink *devlink)
@@ -545,7 +548,8 @@ static inline struct devlink *priv_to_devlink(void *priv)
void devlink_init(struct devlink *devlink, const struct devlink_ops *ops);
void devlink_cleanup(struct devlink *devlink);
struct devlink *devlink_alloc(const struct devlink_ops *ops, size_t priv_size);
-void __devlink_register(struct devlink *devlink, struct device *dev);
+int __devlink_register(struct devlink *devlink, struct device *dev,
+ struct devlink *parent);
int devlink_register(struct devlink *devlink, struct device *dev);
void __devlink_unregister(struct devlink *devlink);
void devlink_unregister(struct devlink *devlink);
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 53de880..233f5bc 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -105,6 +105,9 @@ enum devlink_command {

DEVLINK_CMD_FLASH_UPDATE,

+ DEVLINK_CMD_DEV_ADD,
+ DEVLINK_CMD_DEV_DEL,
+
/* add new commands above here */
__DEVLINK_CMD_MAX,
DEVLINK_CMD_MAX = __DEVLINK_CMD_MAX - 1
diff --git a/net/core/devlink.c b/net/core/devlink.c
index cfbad2c..3b5c961 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -3759,6 +3759,57 @@ static int devlink_nl_cmd_region_read_dumpit(struct sk_buff *skb,
return err;
}

+static int
+devlink_nl_cmd_dev_add_doit(struct sk_buff *skb, struct genl_info *info)
+{
+ struct devlink *devlink = info->user_ptr[0];
+ struct devlink *new_devlink;
+ struct sk_buff *msg;
+ int err;
+
+ if (!devlink->ops->dev_add || !devlink->ops->dev_del)
+ return -EOPNOTSUPP;
+
+ msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+ if (!msg)
+ return -ENOMEM;
+
+ new_devlink = devlink->ops->dev_add(devlink);
+ if (IS_ERR(new_devlink)) {
+ err = PTR_ERR(new_devlink);
+ goto dev_err;
+ }
+
+ err = devlink_nl_put_handle(msg, new_devlink);
+ if (err)
+ goto put_err;
+
+ return genlmsg_reply(msg, info);
+
+put_err:
+ devlink->ops->dev_del(new_devlink);
+dev_err:
+ nlmsg_free(msg);
+ return err;
+}
+
+static int
+devlink_nl_cmd_dev_del_doit(struct sk_buff *skb, struct genl_info *info)
+{
+ struct devlink *devlink;
+ struct devlink *parent;
+
+ devlink = devlink_get_from_info(info);
+ if (!devlink)
+ return -ENODEV;
+ parent = devlink->parent;
+ if (!parent)
+ return -EOPNOTSUPP;
+
+ parent->ops->dev_del(devlink);
+ return 0;
+}
+
struct devlink_info_req {
struct sk_buff *msg;
};
@@ -5201,6 +5252,20 @@ static int devlink_nl_cmd_health_reporter_dump_get_doit(struct sk_buff *skb,
.flags = GENL_ADMIN_PERM,
.internal_flags = DEVLINK_NL_FLAG_NEED_DEVLINK,
},
+ {
+ .cmd = DEVLINK_CMD_DEV_ADD,
+ .doit = devlink_nl_cmd_dev_add_doit,
+ .policy = devlink_nl_policy,
+ .flags = GENL_ADMIN_PERM,
+ .internal_flags = DEVLINK_NL_FLAG_NEED_DEVLINK,
+ },
+ {
+ .cmd = DEVLINK_CMD_DEV_DEL,
+ .doit = devlink_nl_cmd_dev_del_doit,
+ .policy = devlink_nl_policy,
+ .flags = GENL_ADMIN_PERM,
+ .internal_flags = DEVLINK_NL_FLAG_NO_LOCK,
+ },
};

static struct genl_family devlink_nl_family __ro_after_init = {
@@ -5266,13 +5331,24 @@ struct devlink *devlink_alloc(const struct devlink_ops *ops, size_t priv_size)
* Caller must hold devlink_mutex.
*
* @devlink: devlink
+ * @parent: pointer to parent devlink instance for which child devlink
+ * device is created. It must be set when child devlink
+ * device is created. It is optional otherwise.
*/
-void __devlink_register(struct devlink *devlink, struct device *dev)
+int __devlink_register(struct devlink *devlink, struct device *dev,
+ struct devlink *parent)
{
lockdep_assert_held(&devlink_mutex);
+
+ if (parent && (!parent->ops || !parent->ops->dev_add ||
+ !parent->ops->dev_del))
+ return -EINVAL;
+
devlink->dev = dev;
+ devlink->parent = parent;
list_add_tail(&devlink->list, &devlink_list);
devlink_notify(devlink, DEVLINK_CMD_NEW);
+ return 0;
}
EXPORT_SYMBOL_GPL(__devlink_register);

@@ -5283,13 +5359,27 @@ void __devlink_register(struct devlink *devlink, struct device *dev)
*/
int devlink_register(struct devlink *devlink, struct device *dev)
{
+ int ret;
+
mutex_lock(&devlink_mutex);
- __devlink_register(devlink, dev);
+ ret = __devlink_register(devlink, dev, NULL);
mutex_unlock(&devlink_mutex);
- return 0;
+ return ret;
}
EXPORT_SYMBOL_GPL(devlink_register);

+static void devlink_child_devices_delete(struct devlink *devlink)
+{
+ struct devlink *cur, *tmp;
+
+ list_for_each_entry_safe(cur, tmp, &devlink_list, list) {
+ struct devlink *parent = cur->parent;
+
+ if (devlink == parent)
+ parent->ops->dev_del(cur);
+ }
+}
+
/**
* __devlink_unregister - Unregister devlink instance
* Caller must hold the devlink_mutex while invoking this API.
@@ -5299,6 +5389,7 @@ int devlink_register(struct devlink *devlink, struct device *dev)
void __devlink_unregister(struct devlink *devlink)
{
lockdep_assert_held(&devlink_mutex);
+ devlink_child_devices_delete(devlink);
devlink_notify(devlink, DEVLINK_CMD_DEL);
list_del(&devlink->list);
}
--
1.8.3.1


2019-03-01 05:54:03

by Parav Pandit

[permalink] [raw]
Subject: [RFC net-next 1/8] subdev: Introducing subdev bus

Introduce a new subdev bus which holds sub devices created from a
primary device. These devices are named as 'subdev'.
A subdev is identified similarly to pci device using 16-bit vendor id
and device id.
Unlike PCI devices, scope of subdev is limited to Linux kernel.
A central entry that assigns unique subdev vendor and device id is:
include/linux/subdev_ids.h enums. Enum are chosen over define macro so
that two vendors do not end up with vendor id in kernel development
process.

subdev bus holds subdevices of multiple devices. A typical created
subdev for a PCI device in sysfs tree appears under their parent's
device as using core's default device naming scheme:

subdev<instance_id>.
i.e.
subdev0
subdev1

$ ls -l /sys/bus/pci/devices/0000:05:00.0
[..]
drwxr-xr-x 4 root root 0 Feb 13 15:57 subvdev0
drwxr-xr-x 4 root root 0 Feb 13 15:57 subvdev1

Device model view:
------------------
+------+ +------+ +------+
|subdev| |subdev| |subdev|
-----| 1 |----| 2 |-------| 3 |----------
| +--|---+ +-|----+ +--|---+ |
--------|----------|---subdev bus--|--------------
| | |
+--+----+-----+ +---+---+
|pcidev | |pcidev |
-----| A |-----------------| B |----------
| +-------+ +-------+ |
-------------------pci bus------------------------

subdev are allocated and freed using subdev_alloc(), subdev_free() APIs.
A driver which wants to create actual class driver such as
net/block/infiniband need to use subdev_register_driver(),
subdev_unregister_driver() APIs.

Signed-off-by: Parav Pandit <[email protected]>
---
drivers/Kconfig | 2 +
drivers/Makefile | 1 +
drivers/subdev/Kconfig | 12 ++++
drivers/subdev/Makefile | 8 +++
drivers/subdev/subdev_main.c | 153 ++++++++++++++++++++++++++++++++++++++++
include/linux/mod_devicetable.h | 12 ++++
include/linux/subdev_bus.h | 63 +++++++++++++++++
include/linux/subdev_ids.h | 17 +++++
8 files changed, 268 insertions(+)
create mode 100644 drivers/subdev/Kconfig
create mode 100644 drivers/subdev/Makefile
create mode 100644 drivers/subdev/subdev_main.c
create mode 100644 include/linux/subdev_bus.h
create mode 100644 include/linux/subdev_ids.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index 4f9f990..1818796 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -228,4 +228,6 @@ source "drivers/siox/Kconfig"

source "drivers/slimbus/Kconfig"

+source "drivers/subdev/Kconfig"
+
endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index e1ce029..a040e96 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -186,3 +186,4 @@ obj-$(CONFIG_MULTIPLEXER) += mux/
obj-$(CONFIG_UNISYS_VISORBUS) += visorbus/
obj-$(CONFIG_SIOX) += siox/
obj-$(CONFIG_GNSS) += gnss/
+obj-$(CONFIG_SUBDEV) += subdev/
diff --git a/drivers/subdev/Kconfig b/drivers/subdev/Kconfig
new file mode 100644
index 0000000..8ce3acc
--- /dev/null
+++ b/drivers/subdev/Kconfig
@@ -0,0 +1,12 @@
+#
+# subdev configuration
+#
+
+config SUBDEV
+ tristate "subdev bus driver"
+ help
+ The subdev bus driver allows creating hardware based sub devices
+ from a parent device. The subdev bus driver is required to create,
+ discover devices and to attach device drivers to this subdev
+ devices. These subdev devices are created using devlink tool by
+ user.
diff --git a/drivers/subdev/Makefile b/drivers/subdev/Makefile
new file mode 100644
index 0000000..405b74a
--- /dev/null
+++ b/drivers/subdev/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for subdev bus driver
+#
+
+obj-$(CONFIG_SUBDEV) += subdev.o
+
+subdev-y := subdev_main.o
diff --git a/drivers/subdev/subdev_main.c b/drivers/subdev/subdev_main.c
new file mode 100644
index 0000000..4aabcaa
--- /dev/null
+++ b/drivers/subdev/subdev_main.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/subdev_bus.h>
+
+static DEFINE_XARRAY_FLAGS(subdev_ids, XA_FLAGS_ALLOC);
+
+static int subdev_bus_match(struct device *dev, struct device_driver *drv)
+{
+ struct subdev_driver *subdev_drv = to_subdev_driver(drv);
+ const struct subdev_id *ids = subdev_drv->id_table;
+ const struct subdev *subdev = to_subdev_device(dev);
+
+ while (ids) {
+ if (ids->vendor_id == subdev->dev_id.vendor_id &&
+ ids->device_id == subdev->dev_id.device_id)
+ return 1;
+
+ ids++;
+ }
+ return 0;
+}
+
+static struct bus_type subdev_bus_type = {
+ .dev_name = "subdev",
+ .name = "subdev",
+ .match = subdev_bus_match,
+};
+
+int __subdev_register_driver(struct subdev_driver *drv, struct module *owner,
+ const char *mod_name)
+{
+ drv->driver.name = mod_name;
+ drv->driver.bus = &subdev_bus_type;
+ drv->driver.owner = owner;
+ drv->driver.mod_name = mod_name;
+
+ return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(__subdev_register_driver);
+
+void subdev_unregister_driver(struct subdev_driver *drv)
+{
+ driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(subdev_unregister_driver);
+
+static void subdev_release(struct device *dev)
+{
+ struct subdev *subdev = to_subdev_device(dev);
+
+ kfree(subdev);
+}
+
+/**
+ * _subdev_alloc_subdev - Allocate a subdev device.
+ * @size: Size of the device that to allocate that contains subdev
+ * device as the first element.
+ * Returns pointer to a valid subdev structure or returns ERR_PTR.
+ *
+ */
+struct subdev *_subdev_alloc_dev(size_t size)
+{
+ struct subdev *subdev;
+
+ subdev = kzalloc(size, GFP_KERNEL);
+ if (!subdev)
+ return ERR_PTR(-ENOMEM);
+ subdev->dev.release = subdev_release;
+ device_initialize(&subdev->dev);
+ return subdev;
+}
+EXPORT_SYMBOL(_subdev_alloc_dev);
+
+/**
+ * subdev_free_dev - Free allocated subdev device.
+ * @subdev: Pointer to subdev
+ *
+ */
+void subdev_free_dev(struct subdev *subdev)
+{
+ put_device(&subdev->dev);
+}
+EXPORT_SYMBOL(subdev_free_dev);
+
+/**
+ * subdev_add_dev - Add a sub device to bus.
+ * @subdev: subdev devie to be placed on the bus
+ * @parent_dev: Parent device of the subdev
+ * @vid: Vendor ID of the device
+ * @did: Device ID of the device
+ *
+ * Returns 0 on successfully adding subdev to bus or error code on failure.
+ * Once the device is added, it can be probed by the device driver who
+ * wish to match it.
+ *
+ */
+int subdev_add_dev(struct subdev *subdev, struct device *parent_dev,
+ enum subdev_vendor_id vid, enum subdev_device_id did)
+{
+ u32 id = 0;
+ int ret;
+
+ if (!parent_dev)
+ return -EINVAL;
+
+ ret = xa_alloc(&subdev_ids, &id, UINT_MAX, NULL, GFP_KERNEL);
+ if (ret < 0)
+ return ret;
+
+ subdev->dev.id = id;
+ subdev->dev_id.vendor_id = vid;
+ subdev->dev_id.device_id = did;
+ subdev->dev.parent = parent_dev;
+ subdev->dev.bus = &subdev_bus_type;
+ subdev->dev.dma_mask = parent_dev->dma_mask;
+ subdev->dev.dma_parms = parent_dev->dma_parms;
+ subdev->dev.coherent_dma_mask = parent_dev->coherent_dma_mask;
+ ret = device_add(&subdev->dev);
+ if (ret)
+ xa_erase(&subdev_ids, id);
+ return ret;
+}
+EXPORT_SYMBOL(subdev_add_dev);
+
+/**
+ * subdev_delete_dev - Delete previously added subdev device
+ *
+ * @subdev: Pointer to subdev device to delete
+ */
+void subdev_delete_dev(struct subdev *subdev)
+{
+ device_del(&subdev->dev);
+ xa_erase(&subdev_ids, subdev->dev.id);
+}
+EXPORT_SYMBOL(subdev_delete_dev);
+
+static int __init subdev_init(void)
+{
+ return bus_register(&subdev_bus_type);
+}
+
+static void __exit subdev_exit(void)
+{
+ bus_unregister(&subdev_bus_type);
+}
+
+module_init(subdev_init);
+module_exit(subdev_exit);
+
+MODULE_LICENSE("GPL");
diff --git a/include/linux/mod_devicetable.h b/include/linux/mod_devicetable.h
index f9bd2f3..f271dab 100644
--- a/include/linux/mod_devicetable.h
+++ b/include/linux/mod_devicetable.h
@@ -779,4 +779,16 @@ struct typec_device_id {
kernel_ulong_t driver_data;
};

+/**
+ * struct subdev_id - subdev device identifiers defined in
+ * include/linux/subdev_ids.h
+ *
+ * @vendor_id: Vendor ID
+ * @device_id: Device ID
+ */
+struct subdev_id {
+ __u16 vendor_id;
+ __u16 device_id;
+};
+
#endif /* LINUX_MOD_DEVICETABLE_H */
diff --git a/include/linux/subdev_bus.h b/include/linux/subdev_bus.h
new file mode 100644
index 0000000..c6410e3
--- /dev/null
+++ b/include/linux/subdev_bus.h
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#ifndef SUBDEV_BUS_H
+#define SUBDEV_BUS_H
+
+#include <linux/mod_devicetable.h>
+#include <linux/device.h>
+#include <linux/subdev_ids.h>
+
+struct subdev_driver {
+ const struct subdev_id *id_table;
+ struct device_driver driver;
+ struct list_head list;
+};
+
+#define to_subdev_driver(x) container_of(x, struct subdev_driver, driver)
+
+int __subdev_register_driver(struct subdev_driver *drv, struct module *owner,
+ const char *mod_name);
+#define subdev_register_driver(driver) \
+ __subdev_register_driver(driver, THIS_MODULE, KBUILD_MODNAME)
+
+void subdev_unregister_driver(struct subdev_driver *dev);
+
+/**
+ * subdev - A subdev device representation
+ *
+ * @dev: device struct that represent subdev device in core device model
+ * @dev_id: Unique vendor id, device id that subdev device drivers match
+ * against. A unique id that defines this subdev assigned in
+ * include/linux/subdev_ids.h
+ */
+struct subdev {
+ struct device dev;
+ struct subdev_id dev_id;
+};
+
+#define to_subdev_device(x) container_of(x, struct subdev, dev)
+
+struct subdev *_subdev_alloc_dev(size_t size);
+
+/**
+ * subdev_alloc_dev - allocate memory for driver structure which holds
+ * subdev structure and other driver's device specific
+ * fields.
+ * @drv_struct: Driver's device structure which defines subdev device
+ * as the first member in the structure.
+ * @member: Name of the subdev instance name in drivers device
+ * structure.
+ */
+#define subdev_alloc_dev(drv_struct, member) \
+ container_of(_subdev_alloc_dev(sizeof(struct drv_struct) + \
+ BUILD_BUG_ON_ZERO(offsetof( \
+ struct drv_struct, member))), \
+ struct drv_struct, member)
+
+void subdev_free_dev(struct subdev *subdev);
+
+int subdev_add_dev(struct subdev *subdev, struct device *parent_dev,
+ enum subdev_vendor_id vid, enum subdev_device_id did);
+void subdev_delete_dev(struct subdev *subdev);
+
+#endif
diff --git a/include/linux/subdev_ids.h b/include/linux/subdev_ids.h
new file mode 100644
index 0000000..361faa3
--- /dev/null
+++ b/include/linux/subdev_ids.h
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#ifndef SUBDEV_IDS_H
+#define SUBDEV_IDS_H
+
+enum subdev_vendor_id {
+ SUBDEV_VENDOR_ID_MELLANOX,
+
+ /* new device id must be added above at the end */
+};
+
+enum subdev_device_id {
+ SUBDEV_DEVICE_ID_MELLANOX_SF,
+
+ /* new device id mst be added above at the end */
+};
+#endif
--
1.8.3.1


2019-03-01 05:55:53

by Parav Pandit

[permalink] [raw]
Subject: [RFC net-next 4/8] devlink: Introduce and use devlink_init/cleanup() in alloc/free

There is usecase to allocate devlink instance along with other structure
instance.
This is case when struct devlink and struct device are desired to be
part of single structure instance whose life cycle is driven by the life
cycle of the core device.
To support it, have more grandular init/cleanup APIs and reuse them in
existing alloc/free APIs.

Signed-off-by: Parav Pandit <[email protected]>
---
include/net/devlink.h | 10 ++++++++++
net/core/devlink.c | 50 +++++++++++++++++++++++++++++++++++++-------------
2 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index a2da49d..ae5e0e6 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -542,6 +542,8 @@ static inline struct devlink *priv_to_devlink(void *priv)

#if IS_ENABLED(CONFIG_NET_DEVLINK)

+void devlink_init(struct devlink *devlink, const struct devlink_ops *ops);
+void devlink_cleanup(struct devlink *devlink);
struct devlink *devlink_alloc(const struct devlink_ops *ops, size_t priv_size);
int devlink_register(struct devlink *devlink, struct device *dev);
void devlink_unregister(struct devlink *devlink);
@@ -709,6 +711,14 @@ int devlink_health_report(struct devlink_health_reporter *reporter,

#else

+static inline void devlink_init(struct devlink *devlink,
+ const struct devlink_ops *ops)
+}
+
+static inline void devlink_cleanup(struct devlink *devlink)
+{
+}
+
static inline struct devlink *devlink_alloc(const struct devlink_ops *ops,
size_t priv_size)
{
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 04d9855..25492c6 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -5218,21 +5218,16 @@ static int devlink_nl_cmd_health_reporter_dump_get_doit(struct sk_buff *skb,
};

/**
- * devlink_alloc - Allocate new devlink instance resources
+ * devlink_init - Initialize devlink instance
*
- * @ops: ops
- * @priv_size: size of user private data
+ * @devlink: devlink pointer, which is not allocated using devlink_alloc().
*
- * Allocate new devlink instance resources, including devlink index
- * and name.
+ * When user wants to allocate devlink object along with other objects
+ * in driver such as refcounted using struct device, it is useful to
+ * just init the devlink instance without allocating.
*/
-struct devlink *devlink_alloc(const struct devlink_ops *ops, size_t priv_size)
+void devlink_init(struct devlink *devlink, const struct devlink_ops *ops)
{
- struct devlink *devlink;
-
- devlink = kzalloc(sizeof(*devlink) + priv_size, GFP_KERNEL);
- if (!devlink)
- return NULL;
devlink->ops = ops;
devlink_net_set(devlink, &init_net);
INIT_LIST_HEAD(&devlink->port_list);
@@ -5243,6 +5238,25 @@ struct devlink *devlink_alloc(const struct devlink_ops *ops, size_t priv_size)
INIT_LIST_HEAD(&devlink->region_list);
INIT_LIST_HEAD(&devlink->reporter_list);
mutex_init(&devlink->lock);
+}
+EXPORT_SYMBOL_GPL(devlink_init);
+
+/**
+ * devlink_alloc - Allocate new devlink instance resources
+ *
+ * @ops: ops
+ * @priv_size: size of user private data
+ *
+ * Allocate new devlink instance resources, including devlink index
+ * and name.
+ */
+struct devlink *devlink_alloc(const struct devlink_ops *ops, size_t priv_size)
+{
+ struct devlink *devlink;
+
+ devlink = kzalloc(sizeof(*devlink) + priv_size, GFP_KERNEL);
+ if (devlink)
+ devlink_init(devlink, ops);
return devlink;
}
EXPORT_SYMBOL_GPL(devlink_alloc);
@@ -5278,11 +5292,11 @@ void devlink_unregister(struct devlink *devlink)
EXPORT_SYMBOL_GPL(devlink_unregister);

/**
- * devlink_free - Free devlink instance resources
+ * devlink_cleanup - Cleanup devlink instance resources
*
* @devlink: devlink
*/
-void devlink_free(struct devlink *devlink)
+void devlink_cleanup(struct devlink *devlink)
{
WARN_ON(!list_empty(&devlink->reporter_list));
WARN_ON(!list_empty(&devlink->region_list));
@@ -5291,7 +5305,17 @@ void devlink_free(struct devlink *devlink)
WARN_ON(!list_empty(&devlink->dpipe_table_list));
WARN_ON(!list_empty(&devlink->sb_list));
WARN_ON(!list_empty(&devlink->port_list));
+}
+EXPORT_SYMBOL_GPL(devlink_cleanup);

+/**
+ * devlink_free - Free devlink instance resources
+ *
+ * @devlink: devlink
+ */
+void devlink_free(struct devlink *devlink)
+{
+ devlink_cleanup(devlink);
kfree(devlink);
}
EXPORT_SYMBOL_GPL(devlink_free);
--
1.8.3.1


2019-03-01 05:56:47

by Parav Pandit

[permalink] [raw]
Subject: [RFC net-next 7/8] net/mlx5: Add devlink subdev life cycle command support

Implement devlink device add/del command which cretes dummy subdev
devices that actual driver can bind to using standard device driver
model.

Signed-off-by: Parav Pandit <[email protected]>
---
drivers/net/ethernet/mellanox/mlx5/core/Makefile | 1 +
drivers/net/ethernet/mellanox/mlx5/core/main.c | 4 ++
.../net/ethernet/mellanox/mlx5/core/mlx5_core.h | 4 ++
drivers/net/ethernet/mellanox/mlx5/core/subdev.c | 55 ++++++++++++++++++++++
4 files changed, 64 insertions(+)
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/subdev.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 82d636b..f218789 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -16,6 +16,7 @@ mlx5_core-y := main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
transobj.o vport.o sriov.o fs_cmd.o fs_core.o \
fs_counters.o rl.o lag.o dev.o events.o wq.o lib/gid.o \
lib/devcom.o diag/fs_tracepoint.o diag/fw_tracer.o
+mlx5_core-$(CONFIG_SUBDEV) += subdev.o

#
# Netdev basic
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 40d591c..5f8cf0d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1213,6 +1213,10 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
.eswitch_encap_mode_set = mlx5_devlink_eswitch_encap_mode_set,
.eswitch_encap_mode_get = mlx5_devlink_eswitch_encap_mode_get,
#endif
+#if IS_ENABLED(CONFIG_SUBDEV)
+ .dev_add = mlx5_devlink_dev_add,
+ .dev_del = mlx5_devlink_dev_del,
+#endif
};

#define MLX5_IB_MOD "mlx5_ib"
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 9529cf9..2a54148 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -202,4 +202,8 @@ enum {

u8 mlx5_get_nic_state(struct mlx5_core_dev *dev);
void mlx5_set_nic_state(struct mlx5_core_dev *dev, u8 state);
+
+struct devlink *mlx5_devlink_dev_add(struct devlink *devlink);
+void mlx5_devlink_dev_del(struct devlink *devlink);
+
#endif /* __MLX5_CORE_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/subdev.c b/drivers/net/ethernet/mellanox/mlx5/core/subdev.c
new file mode 100644
index 0000000..9e78ea01
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/subdev.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+// Copyright (c) 2018-19 Mellanox Technologies
+
+#include <linux/subdev_bus.h>
+#include <linux/subdev_ids.h>
+#include <linux/mlx5/driver.h>
+#include <net/devlink.h>
+
+#include "mlx5_core.h"
+
+struct mlx5_subdev {
+ struct subdev subdev;
+ struct devlink dl;
+};
+
+struct devlink *mlx5_devlink_dev_add(struct devlink *devlink)
+{
+ struct mlx5_subdev *subdev;
+ int ret;
+
+ subdev = subdev_alloc_dev(mlx5_subdev, subdev);
+ if (!subdev)
+ return ERR_PTR(-ENOMEM);
+
+ devlink_init(&subdev->dl, NULL);
+
+ ret = subdev_add_dev(&subdev->subdev, devlink->dev,
+ SUBDEV_VENDOR_ID_MELLANOX,
+ SUBDEV_DEVICE_ID_MELLANOX_SF);
+ if (ret)
+ goto add_err;
+
+ ret = __devlink_register(&subdev->dl, &subdev->subdev.dev, devlink);
+ if (ret)
+ goto reg_err;
+
+ return &subdev->dl;
+
+reg_err:
+ devlink_cleanup(&subdev->dl);
+add_err:
+ subdev_free_dev(&subdev->subdev);
+ return ERR_PTR(ret);
+}
+
+void mlx5_devlink_dev_del(struct devlink *devlink)
+{
+ struct mlx5_subdev *subdev =
+ container_of(devlink, struct mlx5_subdev, dl);
+
+ __devlink_unregister(devlink);
+ devlink_cleanup(devlink);
+ subdev_delete_dev(&subdev->subdev);
+ subdev_free_dev(&subdev->subdev);
+}
--
1.8.3.1


2019-03-01 07:19:26

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC net-next 1/8] subdev: Introducing subdev bus

On Thu, Feb 28, 2019 at 11:37:45PM -0600, Parav Pandit wrote:
> Introduce a new subdev bus which holds sub devices created from a
> primary device. These devices are named as 'subdev'.
> A subdev is identified similarly to pci device using 16-bit vendor id
> and device id.
> Unlike PCI devices, scope of subdev is limited to Linux kernel.

But these are limited to only PCI devices, right?

This sounds a lot like that ARM proposal a week or so ago that asked for
something like this, are you working with them to make sure your
proposal works for them as well? (sorry, can't find where that was
announced, it was online somewhere...)

> A central entry that assigns unique subdev vendor and device id is:
> include/linux/subdev_ids.h enums. Enum are chosen over define macro so
> that two vendors do not end up with vendor id in kernel development
> process.

Why not just make it dynamic with on static ids?

> subdev bus holds subdevices of multiple devices. A typical created
> subdev for a PCI device in sysfs tree appears under their parent's
> device as using core's default device naming scheme:
>
> subdev<instance_id>.
> i.e.
> subdev0
> subdev1
>
> $ ls -l /sys/bus/pci/devices/0000:05:00.0
> [..]
> drwxr-xr-x 4 root root 0 Feb 13 15:57 subvdev0
> drwxr-xr-x 4 root root 0 Feb 13 15:57 subvdev1
>
> Device model view:
> ------------------
> +------+ +------+ +------+
> |subdev| |subdev| |subdev|
> -----| 1 |----| 2 |-------| 3 |----------
> | +--|---+ +-|----+ +--|---+ |
> --------|----------|---subdev bus--|--------------
> | | |
> +--+----+-----+ +---+---+
> |pcidev | |pcidev |
> -----| A |-----------------| B |----------
> | +-------+ +-------+ |
> -------------------pci bus------------------------

To be clear, "subdev bus" is just a logical grouping, there is no
physical backing "bus" here at all, right?

What is going to "bind" to subdev devices? PCI drivers? Or new types
of drivers?

> subdev are allocated and freed using subdev_alloc(), subdev_free() APIs.
> A driver which wants to create actual class driver such as
> net/block/infiniband need to use subdev_register_driver(),
> subdev_unregister_driver() APIs.
>
> Signed-off-by: Parav Pandit <[email protected]>
> ---
> drivers/Kconfig | 2 +
> drivers/Makefile | 1 +
> drivers/subdev/Kconfig | 12 ++++
> drivers/subdev/Makefile | 8 +++
> drivers/subdev/subdev_main.c | 153 ++++++++++++++++++++++++++++++++++++++++
> include/linux/mod_devicetable.h | 12 ++++
> include/linux/subdev_bus.h | 63 +++++++++++++++++
> include/linux/subdev_ids.h | 17 +++++
> 8 files changed, 268 insertions(+)
> create mode 100644 drivers/subdev/Kconfig
> create mode 100644 drivers/subdev/Makefile
> create mode 100644 drivers/subdev/subdev_main.c
> create mode 100644 include/linux/subdev_bus.h
> create mode 100644 include/linux/subdev_ids.h
>
> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index 4f9f990..1818796 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -228,4 +228,6 @@ source "drivers/siox/Kconfig"
>
> source "drivers/slimbus/Kconfig"
>
> +source "drivers/subdev/Kconfig"
> +
> endmenu
> diff --git a/drivers/Makefile b/drivers/Makefile
> index e1ce029..a040e96 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -186,3 +186,4 @@ obj-$(CONFIG_MULTIPLEXER) += mux/
> obj-$(CONFIG_UNISYS_VISORBUS) += visorbus/
> obj-$(CONFIG_SIOX) += siox/
> obj-$(CONFIG_GNSS) += gnss/
> +obj-$(CONFIG_SUBDEV) += subdev/
> diff --git a/drivers/subdev/Kconfig b/drivers/subdev/Kconfig
> new file mode 100644
> index 0000000..8ce3acc
> --- /dev/null
> +++ b/drivers/subdev/Kconfig
> @@ -0,0 +1,12 @@
> +#
> +# subdev configuration
> +#
> +
> +config SUBDEV
> + tristate "subdev bus driver"
> + help
> + The subdev bus driver allows creating hardware based sub devices
> + from a parent device. The subdev bus driver is required to create,
> + discover devices and to attach device drivers to this subdev
> + devices. These subdev devices are created using devlink tool by
> + user.


Your definition of the bus uses the name of the bus in the definition :)

> diff --git a/drivers/subdev/Makefile b/drivers/subdev/Makefile
> new file mode 100644
> index 0000000..405b74a
> --- /dev/null
> +++ b/drivers/subdev/Makefile
> @@ -0,0 +1,8 @@
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Makefile for subdev bus driver
> +#
> +
> +obj-$(CONFIG_SUBDEV) += subdev.o
> +
> +subdev-y := subdev_main.o
> diff --git a/drivers/subdev/subdev_main.c b/drivers/subdev/subdev_main.c
> new file mode 100644
> index 0000000..4aabcaa
> --- /dev/null
> +++ b/drivers/subdev/subdev_main.c
> @@ -0,0 +1,153 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/subdev_bus.h>
> +
> +static DEFINE_XARRAY_FLAGS(subdev_ids, XA_FLAGS_ALLOC);

Why not an idr?

> +
> +static int subdev_bus_match(struct device *dev, struct device_driver *drv)
> +{
> + struct subdev_driver *subdev_drv = to_subdev_driver(drv);
> + const struct subdev_id *ids = subdev_drv->id_table;
> + const struct subdev *subdev = to_subdev_device(dev);
> +
> + while (ids) {
> + if (ids->vendor_id == subdev->dev_id.vendor_id &&
> + ids->device_id == subdev->dev_id.device_id)
> + return 1;
> +
> + ids++;
> + }
> + return 0;
> +}
> +
> +static struct bus_type subdev_bus_type = {
> + .dev_name = "subdev",
> + .name = "subdev",
> + .match = subdev_bus_match,
> +};
> +
> +int __subdev_register_driver(struct subdev_driver *drv, struct module *owner,
> + const char *mod_name)
> +{
> + drv->driver.name = mod_name;
> + drv->driver.bus = &subdev_bus_type;
> + drv->driver.owner = owner;
> + drv->driver.mod_name = mod_name;
> +
> + return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(__subdev_register_driver);

EXPORT_SYMBOL_GPL() for this and the other ones as you are just wrapping
the driver core logic loosely.

> +
> +void subdev_unregister_driver(struct subdev_driver *drv)
> +{
> + driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(subdev_unregister_driver);
> +
> +static void subdev_release(struct device *dev)
> +{
> + struct subdev *subdev = to_subdev_device(dev);
> +
> + kfree(subdev);
> +}
> +
> +/**
> + * _subdev_alloc_subdev - Allocate a subdev device.
> + * @size: Size of the device that to allocate that contains subdev
> + * device as the first element.
> + * Returns pointer to a valid subdev structure or returns ERR_PTR.
> + *
> + */
> +struct subdev *_subdev_alloc_dev(size_t size)
> +{
> + struct subdev *subdev;
> +
> + subdev = kzalloc(size, GFP_KERNEL);
> + if (!subdev)
> + return ERR_PTR(-ENOMEM);
> + subdev->dev.release = subdev_release;
> + device_initialize(&subdev->dev);
> + return subdev;
> +}
> +EXPORT_SYMBOL(_subdev_alloc_dev);
> +
> +/**
> + * subdev_free_dev - Free allocated subdev device.
> + * @subdev: Pointer to subdev
> + *
> + */
> +void subdev_free_dev(struct subdev *subdev)
> +{
> + put_device(&subdev->dev);
> +}
> +EXPORT_SYMBOL(subdev_free_dev);
> +
> +/**
> + * subdev_add_dev - Add a sub device to bus.
> + * @subdev: subdev devie to be placed on the bus
> + * @parent_dev: Parent device of the subdev
> + * @vid: Vendor ID of the device
> + * @did: Device ID of the device
> + *
> + * Returns 0 on successfully adding subdev to bus or error code on failure.
> + * Once the device is added, it can be probed by the device driver who
> + * wish to match it.
> + *
> + */
> +int subdev_add_dev(struct subdev *subdev, struct device *parent_dev,
> + enum subdev_vendor_id vid, enum subdev_device_id did)
> +{
> + u32 id = 0;
> + int ret;
> +
> + if (!parent_dev)
> + return -EINVAL;

No root devices?

> +
> + ret = xa_alloc(&subdev_ids, &id, UINT_MAX, NULL, GFP_KERNEL);

No locking needed?

> + if (ret < 0)
> + return ret;
> +
> + subdev->dev.id = id;
> + subdev->dev_id.vendor_id = vid;
> + subdev->dev_id.device_id = did;
> + subdev->dev.parent = parent_dev;
> + subdev->dev.bus = &subdev_bus_type;
> + subdev->dev.dma_mask = parent_dev->dma_mask;
> + subdev->dev.dma_parms = parent_dev->dma_parms;
> + subdev->dev.coherent_dma_mask = parent_dev->coherent_dma_mask;
> + ret = device_add(&subdev->dev);
> + if (ret)
> + xa_erase(&subdev_ids, id);
> + return ret;
> +}
> +EXPORT_SYMBOL(subdev_add_dev);
> +
> +/**
> + * subdev_delete_dev - Delete previously added subdev device
> + *
> + * @subdev: Pointer to subdev device to delete
> + */
> +void subdev_delete_dev(struct subdev *subdev)
> +{
> + device_del(&subdev->dev);
> + xa_erase(&subdev_ids, subdev->dev.id);
> +}
> +EXPORT_SYMBOL(subdev_delete_dev);
> +
> +static int __init subdev_init(void)
> +{
> + return bus_register(&subdev_bus_type);
> +}
> +
> +static void __exit subdev_exit(void)
> +{
> + bus_unregister(&subdev_bus_type);
> +}
> +
> +module_init(subdev_init);
> +module_exit(subdev_exit);
> +
> +MODULE_LICENSE("GPL");

Nit, for a few more weeks, this needs to be "GPL v2".

> diff --git a/include/linux/mod_devicetable.h b/include/linux/mod_devicetable.h
> index f9bd2f3..f271dab 100644
> --- a/include/linux/mod_devicetable.h
> +++ b/include/linux/mod_devicetable.h
> @@ -779,4 +779,16 @@ struct typec_device_id {
> kernel_ulong_t driver_data;
> };
>
> +/**
> + * struct subdev_id - subdev device identifiers defined in
> + * include/linux/subdev_ids.h
> + *
> + * @vendor_id: Vendor ID
> + * @device_id: Device ID
> + */
> +struct subdev_id {
> + __u16 vendor_id;
> + __u16 device_id;
> +};
> +
> #endif /* LINUX_MOD_DEVICETABLE_H */
> diff --git a/include/linux/subdev_bus.h b/include/linux/subdev_bus.h
> new file mode 100644
> index 0000000..c6410e3
> --- /dev/null
> +++ b/include/linux/subdev_bus.h
> @@ -0,0 +1,63 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#ifndef SUBDEV_BUS_H
> +#define SUBDEV_BUS_H
> +
> +#include <linux/mod_devicetable.h>
> +#include <linux/device.h>
> +#include <linux/subdev_ids.h>
> +
> +struct subdev_driver {
> + const struct subdev_id *id_table;
> + struct device_driver driver;
> + struct list_head list;
> +};
> +
> +#define to_subdev_driver(x) container_of(x, struct subdev_driver, driver)
> +
> +int __subdev_register_driver(struct subdev_driver *drv, struct module *owner,
> + const char *mod_name);
> +#define subdev_register_driver(driver) \
> + __subdev_register_driver(driver, THIS_MODULE, KBUILD_MODNAME)
> +
> +void subdev_unregister_driver(struct subdev_driver *dev);
> +
> +/**
> + * subdev - A subdev device representation
> + *
> + * @dev: device struct that represent subdev device in core device model
> + * @dev_id: Unique vendor id, device id that subdev device drivers match
> + * against. A unique id that defines this subdev assigned in
> + * include/linux/subdev_ids.h
> + */
> +struct subdev {
> + struct device dev;
> + struct subdev_id dev_id;
> +};
> +
> +#define to_subdev_device(x) container_of(x, struct subdev, dev)
> +
> +struct subdev *_subdev_alloc_dev(size_t size);
> +
> +/**
> + * subdev_alloc_dev - allocate memory for driver structure which holds
> + * subdev structure and other driver's device specific
> + * fields.
> + * @drv_struct: Driver's device structure which defines subdev device
> + * as the first member in the structure.
> + * @member: Name of the subdev instance name in drivers device
> + * structure.
> + */
> +#define subdev_alloc_dev(drv_struct, member) \
> + container_of(_subdev_alloc_dev(sizeof(struct drv_struct) + \
> + BUILD_BUG_ON_ZERO(offsetof( \
> + struct drv_struct, member))), \
> + struct drv_struct, member)
> +
> +void subdev_free_dev(struct subdev *subdev);
> +
> +int subdev_add_dev(struct subdev *subdev, struct device *parent_dev,
> + enum subdev_vendor_id vid, enum subdev_device_id did);
> +void subdev_delete_dev(struct subdev *subdev);
> +
> +#endif
> diff --git a/include/linux/subdev_ids.h b/include/linux/subdev_ids.h
> new file mode 100644
> index 0000000..361faa3
> --- /dev/null
> +++ b/include/linux/subdev_ids.h
> @@ -0,0 +1,17 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#ifndef SUBDEV_IDS_H
> +#define SUBDEV_IDS_H
> +
> +enum subdev_vendor_id {
> + SUBDEV_VENDOR_ID_MELLANOX,
> +
> + /* new device id must be added above at the end */

Again, why ids at all?

So far, this is just a very loose wrapping of the driver core bus
functionality, which is fine, but I really don't see the goal here...

thanks,

greg k-h

2019-03-01 07:19:48

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC net-next 7/8] net/mlx5: Add devlink subdev life cycle command support

On Thu, Feb 28, 2019 at 11:37:51PM -0600, Parav Pandit wrote:
> --- /dev/null
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/subdev.c
> @@ -0,0 +1,55 @@
> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB

For new stuff, just use GPL-2.0, no need to keep the mistake of the
Linux-OpenIB license around :)

thanks,

greg k-h

2019-03-01 07:24:04

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to subdev devices

On Thu, Feb 28, 2019 at 11:37:52PM -0600, Parav Pandit wrote:
> Add a subdev driver to probe the subdev devices and create fake
> netdevice for it.

So I'm guessing here is the "meat" of the whole goal here?

You just want multiple netdevices per PCI device? Why can't you do that
today in your PCI driver?

What problem are you trying to solve that others also are having that
requires all of this?

Adding a new bus type and subsystem is fine, but usually we want more
than just one user of it, as this does not really show how it is
exercised very well. Ideally 3 users would be there as that is when it
proves itself that it is flexible enough.

Would just using the mfd subsystem work better for you? That provides
core support for "multi-function" drivers/devices already. What is
missing from that subsystem that does not work for you here?

thanks,

greg k-h

2019-03-01 16:36:40

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 1/8] subdev: Introducing subdev bus

Hi Greg,

> -----Original Message-----
> From: Greg KH <[email protected]>
> Sent: Friday, March 1, 2019 1:17 AM
> To: Parav Pandit <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Jiri Pirko
> <[email protected]>
> Subject: Re: [RFC net-next 1/8] subdev: Introducing subdev bus
>
> On Thu, Feb 28, 2019 at 11:37:45PM -0600, Parav Pandit wrote:
> > Introduce a new subdev bus which holds sub devices created from a
> > primary device. These devices are named as 'subdev'.
> > A subdev is identified similarly to pci device using 16-bit vendor id
> > and device id.
> > Unlike PCI devices, scope of subdev is limited to Linux kernel.
>
> But these are limited to only PCI devices, right?
>
For Mellanox use case yes, its limited to PCI devices.

> This sounds a lot like that ARM proposal a week or so ago that asked for
> something like this, are you working with them to make sure your proposal
> works for them as well? (sorry, can't find where that was announced, it was
> online somewhere...)
>
We were not aware of it, mostly because we are either on net side of mailing lists (netdev, rdma, virt etc).
ARM proposal likely on linux-kernel, I guess.
I will lookup that proposal and surely see if both of us can use common infrastructure.

> > A central entry that assigns unique subdev vendor and device id is:
> > include/linux/subdev_ids.h enums. Enum are chosen over define macro so
> > that two vendors do not end up with vendor id in kernel development
> > process.
>
> Why not just make it dynamic with on static ids?
>
Can you please elaborate?
Do you mean we should use something similar to pci_add_dynid() with enhancement to catch duplicate id addition?

> > subdev bus holds subdevices of multiple devices. A typical created
> > subdev for a PCI device in sysfs tree appears under their parent's
> > device as using core's default device naming scheme:
> >
> > subdev<instance_id>.
> > i.e.
> > subdev0
> > subdev1
> >
> > $ ls -l /sys/bus/pci/devices/0000:05:00.0 [..]
> > drwxr-xr-x 4 root root 0 Feb 13 15:57 subvdev0
> > drwxr-xr-x 4 root root 0 Feb 13 15:57 subvdev1
> >
> > Device model view:
> > ------------------
> > +------+ +------+ +------+
> > |subdev| |subdev| |subdev|
> > -----| 1 |----| 2 |-------| 3 |----------
> > | +--|---+ +-|----+ +--|---+ |
> > --------|----------|---subdev bus--|--------------
> > | | |
> > +--+----+-----+ +---+---+
> > |pcidev | |pcidev |
> > -----| A |-----------------| B |----------
> > | +-------+ +-------+ |
> > -------------------pci bus------------------------
>
> To be clear, "subdev bus" is just a logical grouping, there is no physical
> backing "bus" here at all, right?
>
Yep. that's correct.

> What is going to "bind" to subdev devices? PCI drivers? Or new types of
> drivers?
>
Devices are placed on subdev bus using devlink interface. And drivers which registers using subdev_register_driver(), their probe() method will be called.
So yes, those are PCI vendor driver.
I tried to capture this in cover-letter.
At present users didn't ask to map this subdev to VM, but there is very high chance that once we have this without PCI SR-IOV, they would like to extend to VMs too.
So in that case devlink will have option to say, add 'passthrough' device, and in that case instead of vendor's pci driver, some high level vfio type driver will bind to it.
That is just the anticipation, but we haven't really worked out this fully.
But this model allows to do so.

> > subdev are allocated and freed using subdev_alloc(), subdev_free() APIs.
> > A driver which wants to create actual class driver such as
> > net/block/infiniband need to use subdev_register_driver(),
> > subdev_unregister_driver() APIs.
> >
> > +++ b/drivers/subdev/Kconfig
> > @@ -0,0 +1,12 @@
> > +#
> > +# subdev configuration
> > +#
> > +
> > +config SUBDEV
> > + tristate "subdev bus driver"
> > + help
> > + The subdev bus driver allows creating hardware based sub devices
> > + from a parent device. The subdev bus driver is required to create,
> > + discover devices and to attach device drivers to this subdev
> > + devices. These subdev devices are created using devlink tool by
> > + user.
>
>
> Your definition of the bus uses the name of the bus in the definition :)
>
I will better word this in v2 if we don't go mfd route.

> > diff --git a/drivers/subdev/Makefile b/drivers/subdev/Makefile new
> > file mode 100644 index 0000000..405b74a
> > --- /dev/null
> > +++ b/drivers/subdev/Makefile
> > @@ -0,0 +1,8 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +#
> > +# Makefile for subdev bus driver
> > +#
> > +
> > +obj-$(CONFIG_SUBDEV) += subdev.o
> > +
> > +subdev-y := subdev_main.o
> > diff --git a/drivers/subdev/subdev_main.c
> > b/drivers/subdev/subdev_main.c new file mode 100644 index
> > 0000000..4aabcaa
> > --- /dev/null
> > +++ b/drivers/subdev/subdev_main.c
> > @@ -0,0 +1,153 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include <linux/module.h>
> > +#include <linux/device.h>
> > +#include <linux/slab.h>
> > +#include <linux/subdev_bus.h>
> > +
> > +static DEFINE_XARRAY_FLAGS(subdev_ids, XA_FLAGS_ALLOC);
>
> Why not an idr?
>
Matt is running large patch series to get rid of idr/ida and replacing with xarray.
So instead of creating more work for him, thought to start with xarray from beginning.

> > +
> > +int __subdev_register_driver(struct subdev_driver *drv, struct module
> *owner,
> > + const char *mod_name)
> > +{
> > + drv->driver.name = mod_name;
> > + drv->driver.bus = &subdev_bus_type;
> > + drv->driver.owner = owner;
> > + drv->driver.mod_name = mod_name;
> > +
> > + return driver_register(&drv->driver); }
> > +EXPORT_SYMBOL(__subdev_register_driver);
>
> EXPORT_SYMBOL_GPL() for this and the other ones as you are just wrapping
> the driver core logic loosely.
>
I see. Yes. will fix this.

> > +/**
> > + * subdev_add_dev - Add a sub device to bus.
> > + * @subdev: subdev devie to be placed on the bus
> > + * @parent_dev: Parent device of the subdev
> > + * @vid: Vendor ID of the device
> > + * @did: Device ID of the device
> > + *
> > + * Returns 0 on successfully adding subdev to bus or error code on failure.
> > + * Once the device is added, it can be probed by the device driver
> > +who
> > + * wish to match it.
> > + *
> > + */
> > +int subdev_add_dev(struct subdev *subdev, struct device *parent_dev,
> > + enum subdev_vendor_id vid, enum subdev_device_id did) {
> > + u32 id = 0;
> > + int ret;
> > +
> > + if (!parent_dev)
> > + return -EINVAL;
>
> No root devices?
>
I didn't get the comment. Intent of this check is subdev must have parent. Parent type doesn't matter.

> > +
> > + ret = xa_alloc(&subdev_ids, &id, UINT_MAX, NULL, GFP_KERNEL);
>
> No locking needed?
>
Documentation at [1] describes that xa_alloc() and xa_erase() takes the lock internally.
[1] https://www.kernel.org/doc/html/latest/core-api/xarray.html

> > +module_init(subdev_init);
> > +module_exit(subdev_exit);
> > +
> > +MODULE_LICENSE("GPL");
>
> Nit, for a few more weeks, this needs to be "GPL v2".
>
Ok.

> > #endif /* LINUX_MOD_DEVICETABLE_H */
> > diff --git a/include/linux/subdev_bus.h b/include/linux/subdev_bus.h
> > new file mode 100644 index 0000000..c6410e3
> > --- /dev/null
> > +++ b/include/linux/subdev_bus.h
> > @@ -0,0 +1,63 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#ifndef SUBDEV_BUS_H
> > +#define SUBDEV_BUS_H
> > +
> > +#include <linux/mod_devicetable.h>
> > +#include <linux/device.h>
> > +#include <linux/subdev_ids.h>
> > +
> > +struct subdev_driver {
> > + const struct subdev_id *id_table;
> > + struct device_driver driver;
> > + struct list_head list;
> > +};
> > +
> > +#define to_subdev_driver(x) container_of(x, struct subdev_driver,
> > +driver)
> > +
> > +int __subdev_register_driver(struct subdev_driver *drv, struct module
> *owner,
> > + const char *mod_name);
> > +#define subdev_register_driver(driver) \
> > + __subdev_register_driver(driver, THIS_MODULE, KBUILD_MODNAME)
> > +
> > +void subdev_unregister_driver(struct subdev_driver *dev);
> > +
> > +/**
> > + * subdev - A subdev device representation
> > + *
> > + * @dev: device struct that represent subdev device in core device
> model
> > + * @dev_id: Unique vendor id, device id that subdev device drivers match
> > + * against. A unique id that defines this subdev assigned in
> > + * include/linux/subdev_ids.h
> > + */
> > +struct subdev {
> > + struct device dev;
> > + struct subdev_id dev_id;
> > +};
> > +
> > +#endif
> > diff --git a/include/linux/subdev_ids.h b/include/linux/subdev_ids.h
> > new file mode 100644 index 0000000..361faa3
> > --- /dev/null
> > +++ b/include/linux/subdev_ids.h
> > @@ -0,0 +1,17 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#ifndef SUBDEV_IDS_H
> > +#define SUBDEV_IDS_H
> > +
> > +enum subdev_vendor_id {
> > + SUBDEV_VENDOR_ID_MELLANOX,
> > +
> > + /* new device id must be added above at the end */
>
> Again, why ids at all?
>
Do you mean, we should just string or something to match on which driver to bind to the device?
Or do something similar to pci_add_dynid()?

> So far, this is just a very loose wrapping of the driver core bus functionality,
Right, it is wrapped to avoid creating just random device_driver and random device objects in device tree.

> which is fine, but I really don't see the goal here...

I see you have extended this question in mail where you ask about creating netdevices and using mfd.
So lets discuss in that context as it is more appropriate there. This patch is just code..


2019-03-01 17:01:17

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC net-next 1/8] subdev: Introducing subdev bus

On Fri, Mar 01, 2019 at 04:35:46PM +0000, Parav Pandit wrote:
> Hi Greg,
>
> > -----Original Message-----
> > From: Greg KH <[email protected]>
> > Sent: Friday, March 1, 2019 1:17 AM
> > To: Parav Pandit <[email protected]>
> > Cc: [email protected]; [email protected];
> > [email protected]; [email protected]; Jiri Pirko
> > <[email protected]>
> > Subject: Re: [RFC net-next 1/8] subdev: Introducing subdev bus
> >
> > On Thu, Feb 28, 2019 at 11:37:45PM -0600, Parav Pandit wrote:
> > > Introduce a new subdev bus which holds sub devices created from a
> > > primary device. These devices are named as 'subdev'.
> > > A subdev is identified similarly to pci device using 16-bit vendor id
> > > and device id.
> > > Unlike PCI devices, scope of subdev is limited to Linux kernel.
> >
> > But these are limited to only PCI devices, right?
> >
> For Mellanox use case yes, its limited to PCI devices.
>
> > This sounds a lot like that ARM proposal a week or so ago that asked for
> > something like this, are you working with them to make sure your proposal
> > works for them as well? (sorry, can't find where that was announced, it was
> > online somewhere...)
> >
> We were not aware of it, mostly because we are either on net side of mailing lists (netdev, rdma, virt etc).
> ARM proposal likely on linux-kernel, I guess.
> I will lookup that proposal and surely see if both of us can use common infrastructure.
>
> > > A central entry that assigns unique subdev vendor and device id is:
> > > include/linux/subdev_ids.h enums. Enum are chosen over define macro so
> > > that two vendors do not end up with vendor id in kernel development
> > > process.
> >
> > Why not just make it dynamic with on static ids?
> >
> Can you please elaborate?
> Do you mean we should use something similar to pci_add_dynid() with enhancement to catch duplicate id addition?

I have no idea what I wrote here, sorry :)

I was trying to say something like "using an enumerated type going to
rely on a central authority for your "dynamic" bus, why is that needed
at all"?

> > > subdev bus holds subdevices of multiple devices. A typical created
> > > subdev for a PCI device in sysfs tree appears under their parent's
> > > device as using core's default device naming scheme:
> > >
> > > subdev<instance_id>.
> > > i.e.
> > > subdev0
> > > subdev1
> > >
> > > $ ls -l /sys/bus/pci/devices/0000:05:00.0 [..]
> > > drwxr-xr-x 4 root root 0 Feb 13 15:57 subvdev0
> > > drwxr-xr-x 4 root root 0 Feb 13 15:57 subvdev1
> > >
> > > Device model view:
> > > ------------------
> > > +------+ +------+ +------+
> > > |subdev| |subdev| |subdev|
> > > -----| 1 |----| 2 |-------| 3 |----------
> > > | +--|---+ +-|----+ +--|---+ |
> > > --------|----------|---subdev bus--|--------------
> > > | | |
> > > +--+----+-----+ +---+---+
> > > |pcidev | |pcidev |
> > > -----| A |-----------------| B |----------
> > > | +-------+ +-------+ |
> > > -------------------pci bus------------------------
> >
> > To be clear, "subdev bus" is just a logical grouping, there is no physical
> > backing "bus" here at all, right?
> >
> Yep. that's correct.
>
> > What is going to "bind" to subdev devices? PCI drivers? Or new types of
> > drivers?
> >
> Devices are placed on subdev bus using devlink interface. And drivers which registers using subdev_register_driver(), their probe() method will be called.

But it's just a virtual mapping, what "good" does this provide anyone?
You are still sharing the same backing device here, what does this
logical split buy you?

> So yes, those are PCI vendor driver.
> I tried to capture this in cover-letter.
> At present users didn't ask to map this subdev to VM, but there is very high chance that once we have this without PCI SR-IOV, they would like to extend to VMs too.
> So in that case devlink will have option to say, add 'passthrough' device, and in that case instead of vendor's pci driver, some high level vfio type driver will bind to it.
> That is just the anticipation, but we haven't really worked out this fully.
> But this model allows to do so.

I think mfd is what you want to do here, instead of creating your own
bus type.

> > > +int subdev_add_dev(struct subdev *subdev, struct device *parent_dev,
> > > + enum subdev_vendor_id vid, enum subdev_device_id did) {
> > > + u32 id = 0;
> > > + int ret;
> > > +
> > > + if (!parent_dev)
> > > + return -EINVAL;
> >
> > No root devices?
> >
> I didn't get the comment. Intent of this check is subdev must have parent. Parent type doesn't matter.

You do not allow a subdev to sit at the "root" of the device tree.
That's fine, it was just a comment, it's your choice.

thanks,

greg k-h

2019-03-01 17:39:48

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 7/8] net/mlx5: Add devlink subdev life cycle command support



> -----Original Message-----
> From: Greg KH <[email protected]>
> Sent: Friday, March 1, 2019 1:19 AM
> To: Parav Pandit <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Jiri Pirko
> <[email protected]>
> Subject: Re: [RFC net-next 7/8] net/mlx5: Add devlink subdev life cycle
> command support
>
> On Thu, Feb 28, 2019 at 11:37:51PM -0600, Parav Pandit wrote:
> > --- /dev/null
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/subdev.c
> > @@ -0,0 +1,55 @@
> > +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
>
> For new stuff, just use GPL-2.0, no need to keep the mistake of the Linux-
> OpenIB license around :)
>
Oh yes. my copy paste error and ignorance of openib carry forward.
Checkpatch did actually complain but didn't realize that it may be for OpenIB.
Will fix this.

> thanks,
>
> greg k-h

2019-03-01 18:02:50

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to subdev devices



> -----Original Message-----
> From: Greg KH <[email protected]>
> Sent: Friday, March 1, 2019 1:22 AM
> To: Parav Pandit <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Jiri Pirko
> <[email protected]>
> Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to
> subdev devices
>
> On Thu, Feb 28, 2019 at 11:37:52PM -0600, Parav Pandit wrote:
> > Add a subdev driver to probe the subdev devices and create fake
> > netdevice for it.
>
> So I'm guessing here is the "meat" of the whole goal here?
>
> You just want multiple netdevices per PCI device? Why can't you do that
> today in your PCI driver?
>
Yes, but it just not multiple netdevices.
Let me please elaborate in detail.

There is a swichdev mode of a PCI function for netdevices.
In this mode a given netdev has additional control netdev (called representor netdevice = rep-ndev).
This rep-ndev is attached to OVS for adding rules, offloads etc using standard tc, netfilter infra.
Currently this rep-ndev controls switch side of the settings, but not the host side of netdev.
So there is discussion to create another netdev or devlink port..

Additionally this subdev has optional rdma device too.

And when we are in switchdev mode, this rdma dev has similar rdma rep device for control.

In some cases we actually don't create netdev when it is in InfiniBand mode.
Here there is PCI device->rdma_device.

In other case, a given sub device for rdma is dual port device, having netdevice for each that can use existing netdev->dev_port.

Creating 4 devices of two different classes using one iproute2/ip or iproute2/rdma command is horrible thing to do.

In case if this sub device has to be a passthrough device, ip link command will fail badly that day, because we are creating some sub device which is not even a netdevice.

So iproute2/devlink which works on bus+device, mainly PCI today, seems right abstraction point to create sub devices.
This also extends to map ports of the device, health, registers debug, etc rich infrastructure that is already built.

Additionally, we don't want mlx driver and other drivers to go through its child devices (split logic in netdev and rdma) for power management.
Kernel core code does that well today, that we like to leverage through subdev bus or mfd pm callbacks.

So it is lot more than just creating netdevices.

> What problem are you trying to solve that others also are having that
> requires all of this?
>
> Adding a new bus type and subsystem is fine, but usually we want more
> than just one user of it, as this does not really show how it is exercised very
> well.
This subdev and devlink infrastructure solves this problem of creating smaller sub devices out of one PCI device.
Someone has to start.. :-)

To my knowledge, currently Netronome, Broadcom and Mellanox are actively using this devlink and switchdev infra today.
I added Jakub from Netronome, he is in netdev mailing list, but added in CC, to listen his feedback.

> Ideally 3 users would be there as that is when it proves itself that it is
> flexible enough.
>

We were looking at drivers/visorbus if we can repurpose it, but GUID device naming scheme is just not user friendly.
It has only single s-Par user and whose guest drivers are still in staging for more than a year now. So doesn't really fit well.

> Would just using the mfd subsystem work better for you? That provides
> core support for "multi-function" drivers/devices already. What is missing
> from that subsystem that does not work for you here?
>
We were not aware of mfd until now. I looked at very high level now. It's a wrapper to platform devices and seems widely use.
Before subdev proposal, Jason suggested an alternative is to create platform devices and driver attach to it.

When I read kernel documentation [1], it says "platform devices typically appear as autonomous entities"
Here instead of autonomy, it is in user's control.
Platform devices probably don't disappear a lot in live system as opposed to subdevices which are created and removed dynamically a lot often.

Not sure if platform device is abuse for this purpose or not.
So which direction to go, devlink->mfd(platform wrapper) or devlink->subdev would be obviously a huge blessing.

[1] https://www.kernel.org/doc/Documentation/driver-model/platform.txt


2019-03-01 20:05:03

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On Thu, 28 Feb 2019 23:37:44 -0600, Parav Pandit wrote:
> Use case:
> ---------
> A user wants to create/delete hardware linked sub devices without
> using SR-IOV.
> These devices for a pci device can be netdev (optional rdma device)
> or other devices. Such sub devices share some of the PCI device
> resources and also have their own dedicated resources.
>
> Few examples are:
> 1. netdev having its own txq(s), rq(s) and/or hw offload parameters.
> 2. netdev with switchdev mode using netdev representor
> 3. rdma device with IB link layer and IPoIB netdev
> 4. rdma/RoCE device and a netdev
> 5. rdma device with multiple ports
>
> Requirements for above use cases:
> --------------------------------
> 1. We need a generic user interface & core APIs to create sub devices
> from a parent pci device but should be generic enough for other parent
> devices
> 2. Interface should be vendor agnostic
> 3. User should be able to set device params at creation time
> 4. In future if needed, tool should be able to create passthrough
> device to map to a virtual machine

Like a mediated device?

https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
https://www.dpdk.org/wp-content/uploads/sites/35/2018/06/Mediated-Devices-Better-Userland-IO.pdf

Other than pass-through it is entirely unclear to me why you'd need
a bus. (Or should I say VM pass through or DPDK?) Could you clarify
why the need for a bus?

My thinking is that we should allow spawning subports in devlink and
if user specifies "passthrough" the device spawned would be an mdev.

> 5. A device can have multiple ports

What does this mean, in practice? You want to spawn a subdev which can
access both ports? That'd be for RDMA use cases, more than Ethernet,
right? (Just clarifying :))

> 6. An orchestration software wants to know how many such sub devices
> can be created from a parent device so that it can manage them in global
> cluster resources.
>
> So how is it done?
> ------------------
> (a) user in control
> To address above requirements, a generic tool iproute2/devlink is
> extended for sub device's life cycle.
> However a devlink tool and its kernel counter part is not sufficient
> to create protocol agnostic devices on a existing PCI bus.

"Protocol agnostic"?... What does that mean?

> (b) subdev bus
> A given bus defines well defined addressing scheme. Creating sub devices
> on existing PCI bus with a different naming scheme is just weird.
> So, creating well named devices on appropriate bus is desired.

What's that address scheme you're referring to, you seem to assign IDs
in sequence?

> Hence a new 'subdev' bus is created.
> User adds/removes new sub devices subdev on this bus via a devlink tool.
> devlink tool instructs hardware driver to create/remove/configure
> such devices. Hardware vendor driver places devices on the bus.
> Another or same vendor driver matches based on vendor-id, device-id
> scheme and run through classic device driver model.
>
> Given that, these are user created devices for a given hardware and in
> absence of a central entity like PCISIG to assign vendor and device ids,
> A unique vendor and device id are maintained as enum in
> include/linux/subdev_ids.h.

Why do we need IDs? The sysfs hierarchy isn't sufficient? Do we need
a driver to match on those again? Is it going to be a different driver?

> subdev bus device names follow default device naming scheme of Linux
> kernel. It is done as 'subdev<instance_id>' such as, subdev0, subdev3.
>
> subdev device inherits its parent's DMA parameters.
> subdev will follow rich power management infrastructure of core kernel/
> So that every vendor driver doesn't have to iterate over its child
> devices, invent a locking and device anchoring scheme.
>
> Patchset summary:
> -----------------
> Patch-1, 2 introduces a subdev bus and interface for subdev life cycle.
> Patch-3 extends modpost tool for module device id table.
> Patch-4,5,6 implements a devlink vendor driver to add/remove devices.
> Patch-7 mlx5 driver implements subdev devices and places them on subdev
> bus.
> Patch-8 match against the subdev for mlx5 vendor, device id and creates
> fake netdevice.
>
> All patches are only a reference implementation to see RFC in works
> at devlink, sysfs and device model level. Once RFC looks good, more
> solid upstreamable version of the implementation will be done.
> All patches are functional except the last two patches, which just
> create fake subdev devices and fake netdevice.
>
> System example view:
> --------------------
>
> $ devlink dev show
> pci/0000:05:00.0
>
> $ devlink dev add pci/0000:05:00.0

That does not look great.

Also you have to return the id of the spawned device, otherwise this
is very racy.

> $ devlink dev show
> pci/0000:05:00.0
> subdev/subdev0

Please don't spawn devlink instances. Devlink instance is supposed to
represent an ASIC. If we start spawning them willy nilly for whatever
software construct we want to model the clarity of the ontology will
suffer a lot.

Please see the discussion on my recent patchset. I think Jiri CCed you.

> sysfs view with subdev:
>
> $ ls -l /sys/bus/pci/devices/0000:05:00.0
> [..]
> drwxr-xr-x 3 root root 0 Feb 13 15:57 infiniband
> -rw-r--r-- 1 root root 4096 Feb 13 15:57 msi_bus
> drwxr-xr-x 3 root root 0 Feb 13 15:57 net
> drwxr-xr-x 2 root root 0 Feb 13 15:57 power
> drwxr-xr-x 3 root root 0 Feb 13 15:57 ptp
> drwxr-xr-x 4 root root 0 Feb 13 15:57 subdev0
>
> $ ls -l /sys/bus/pci/devices/0000:05:00.0/subdev0
> lrwxrwxrwx 1 root root 0 Feb 13 15:58 driver -> ../../../../../bus/subdev/drivers/mlx5_core
> drwxr-xr-x 3 root root 0 Feb 13 15:58 net
> drwxr-xr-x 2 root root 0 Feb 13 15:58 power
> lrwxrwxrwx 1 root root 0 Feb 13 15:58 subsystem -> ../../../../../bus/subdev
> -rw-r--r-- 1 root root 4096 Feb 13 15:58 uevent
>
> $ ls -l /sys/bus/pci/devices/0000:05:00.0/subdev0/net/
> drwxr-xr-x 5 root root 0 Feb 13 15:58 eth0
>
> Software view:
> -------------
> Some of you if you prefer to see in picture, below diagram tries to
> show software modules in bus/device hierarchy.
>
> devlink user (iproute2/devlink)
> ------------------------------
> |
> |
> +----------------+
> | devlink module |
> | doit() | +------------------+
> | | | | vendor driver |
> +------------|---+ | (mlx5) |
> ----------+-> subdev_ops() |
> +|-----------------+
> |
> +---------|--+ +-----------+ +------------------+
> | subdev bus | | core | | subdev device |
> | driver | | kernel | | drivers |
> | (add/del) | | dev model | | (netdev, rdma) |
> | ----------------------> probe/remove() |
> +------------+ +-----------+ +------------------+
>
> Alternatives considered:
> ------------------------
> Will discuss separately if needed to keep this RFC short.

Please do discuss.

The things key thing for me on the netdev side is what is the
forwarding model to this new entity. Is this basically VMDQ?
Should we just go ahead and mandate "switchdev mode" here?

Thanks for working on a common architecture and suffering through
people's reviews rather than adding a debugfs interface that does
this like a different vendor did :)

2019-03-01 22:12:46

by Saeed Mahameed

[permalink] [raw]
Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to subdev devices

On Thu, 2019-02-28 at 23:37 -0600, Parav Pandit wrote:
> Add a subdev driver to probe the subdev devices and create fake
> netdevice for it.
>
> Signed-off-by: Parav Pandit <[email protected]>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/Makefile | 2 +-
> drivers/net/ethernet/mellanox/mlx5/core/main.c | 8 +-
> .../net/ethernet/mellanox/mlx5/core/mlx5_core.h | 3 +
> .../ethernet/mellanox/mlx5/core/subdev_driver.c | 93
> ++++++++++++++++++++++
> 4 files changed, 104 insertions(+), 2 deletions(-)
> create mode 100644
> drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> index f218789..c8aeaf1 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> @@ -16,7 +16,7 @@ mlx5_core-y := main.o cmd.o debugfs.o fw.o
> eq.o uar.o pagealloc.o \
> transobj.o vport.o sriov.o fs_cmd.o fs_core.o \
> fs_counters.o rl.o lag.o dev.o events.o wq.o lib/gid.o
> \
> lib/devcom.o diag/fs_tracepoint.o diag/fw_tracer.o
> -mlx5_core-$(CONFIG_SUBDEV) += subdev.o
> +mlx5_core-$(CONFIG_SUBDEV) += subdev.o subdev_driver.o
>
> #
> # Netdev basic
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index 5f8cf0d..7dfa8c4 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -1548,7 +1548,11 @@ static int __init init(void)
> mlx5e_init();
> #endif
>
> - return 0;
> + err = subdev_register_driver(&mlx5_subdev_driver);
> + if (err)
> + pci_unregister_driver(&mlx5_core_driver);
> +
> + return err;
>
> err_debug:
> mlx5_unregister_debugfs();
> @@ -1557,6 +1561,8 @@ static int __init init(void)
>
> static void __exit cleanup(void)
> {
> + subdev_unregister_driver(&mlx5_subdev_driver);
> +
> #ifdef CONFIG_MLX5_CORE_EN
> mlx5e_cleanup();
> #endif
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> index 2a54148..1b733c7 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> @@ -41,12 +41,15 @@
> #include <linux/ptp_clock_kernel.h>
> #include <linux/mlx5/cq.h>
> #include <linux/mlx5/fs.h>
> +#include <linux/subdev_bus.h>
>
> #define DRIVER_NAME "mlx5_core"
> #define DRIVER_VERSION "5.0-0"
>
> extern uint mlx5_core_debug_mask;
>
> +extern struct subdev_driver mlx5_subdev_driver;
> +
> #define mlx5_core_dbg(__dev, format, ...)
> \
> dev_dbg(&(__dev)->pdev->dev, "%s:%d:(pid %d): " format,
> \
> __func__, __LINE__, current->pid,
> \
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
> b/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
> new file mode 100644
> index 0000000..880aa4f
> --- /dev/null
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2018-19 Mellanox Technologies
> +
> +#include <linux/module.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/subdev_bus.h>
> +#include <linux/subdev_ids.h>
> +#include <linux/etherdevice.h>
> +
> +struct mlx5_subdev_ndev {
> + struct net_device ndev;
> +};
> +
> +static void mlx5_dma_test(struct device *dev)
> +{
> + dma_addr_t pa;
> + void *va;
> +
> + va = dma_alloc_coherent(dev, 4096, &pa, GFP_KERNEL);
> + if (va)
> + dma_free_coherent(dev, 4096, va, pa);
> +}
> +
> +static struct net_device *ndev;
> +
> +static int mlx5e_subdev_open(struct net_device *netdev)
> +{
> + return 0;
> +}
> +
> +static int mlx5e_subdev_close(struct net_device *netdev)
> +{
> + return 0;
> +}
> +
> +static netdev_tx_t
> +mlx5e_subdev_xmit(struct sk_buff *skb, struct net_device *netdev)
> +{
> + return NETDEV_TX_BUSY;
> +}
> +
> +const struct net_device_ops mlx5e_subdev_netdev_ops = {
> + .ndo_open = mlx5e_subdev_open,
> + .ndo_stop = mlx5e_subdev_close,
> + .ndo_start_xmit = mlx5e_subdev_xmit,
> +};
> +
> +static int mlx5_subdev_probe(struct device *dev)
> +{
> + int err;
> +
> + mlx5_dma_test(dev);

Hi Parav, can you please shed some light on how do you plan to
communicate with the parent device ? (pci_dev and it's running driver
instance), We will need to share some resources, such as IRQs/BARs/etc
.., and maybe some HW objects which are going to be managed by the
parent pci device driver.

Just allocating a dma buffer doesn't mean anything, the dma buffer is
just bound to the generic device.

> + /* Only one device supported in rfc */
> + if (ndev)
> + return 0;
> +
> + ndev = alloc_etherdev_mqs(sizeof(struct mlx5_subdev_ndev), 1,
> 1);
> + if (!ndev)
> + return -ENOMEM;
> +
> + SET_NETDEV_DEV(ndev, dev);
> + ndev->netdev_ops = &mlx5e_subdev_netdev_ops;
> + err = register_netdev(ndev);
> + if (err) {
> + free_netdev(ndev);
> + ndev = NULL;
> + }
> + return err;
> +}
> +
> +static int mlx5_subdev_remove(struct device *dev)
> +{
> + if (ndev) {
> + unregister_netdev(ndev);
> + free_netdev(ndev);
> + ndev = NULL;
> + }
> + return 0;
> +}
> +
> +static const struct subdev_id mlx5_subdev_id_table[] = {
> + { .vendor_id = SUBDEV_VENDOR_ID_MELLANOX,
> + .device_id = SUBDEV_DEVICE_ID_MELLANOX_SF },
> + { 0, }
> +};
> +MODULE_DEVICE_TABLE(subdev, mlx5_subdev_id_table);
> +
> +struct subdev_driver mlx5_subdev_driver = {
> + .id_table = mlx5_subdev_id_table,
> + .driver.name = "mlx5_subdev_driver",
> + .driver.probe = mlx5_subdev_probe,
> + .driver.remove = mlx5_subdev_remove,
> +};

2019-03-04 04:43:56

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension



> -----Original Message-----
> From: Jakub Kicinski <[email protected]>
> Sent: Friday, March 1, 2019 2:04 PM
> To: Parav Pandit <[email protected]>; Or Gerlitz <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; Jiri Pirko <[email protected]>
> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>
> On Thu, 28 Feb 2019 23:37:44 -0600, Parav Pandit wrote:
> > Requirements for above use cases:
> > --------------------------------
> > 1. We need a generic user interface & core APIs to create sub devices
> > from a parent pci device but should be generic enough for other parent
> > devices 2. Interface should be vendor agnostic 3. User should be able
> > to set device params at creation time 4. In future if needed, tool
> > should be able to create passthrough device to map to a virtual
> > machine
>
> Like a mediated device?
>
Yes.

> https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
> https://www.dpdk.org/wp-content/uploads/sites/35/2018/06/Mediated-
> Devices-Better-Userland-IO.pdf
>
> Other than pass-through it is entirely unclear to me why you'd need a bus.
> (Or should I say VM pass through or DPDK?) Could you clarify why the need
> for a bus?
>
A bus follow standard linux kernel device driver model to attach a driver to specific device.
Platform device with my limited understanding looks a hack/abuse of it based on documentation [1], but it can possibly be an alternative to bus if it looks fine to Greg and others.

> My thinking is that we should allow spawning subports in devlink and if user
> specifies "passthrough" the device spawned would be an mdev.
>
devlink device is much more comprehensive way to create sub-devices than sub-ports for at least below reasons.

1. devlink device already defines device->port relation which enables to create multiport device.
subport breaks that.
2. With bus model, it enables us to load driver of same vendor or generic one such a vfio in future.
3. Devices live on the bus, mapping a subport to 'struct device' is not intuitive.
4. sub-device allows to use existing devlink port, registers, health infrastructure to sub devices, which otherwise need to be duplicated for ports.
5. Even though current devlink devices are networking devices, there is nothing restricts it to be that way.
So subport is a restricted view.
6. devlink device already covers port sub-object, hence creating devlink device is desired.

> > 5. A device can have multiple ports
>
> What does this mean, in practice? You want to spawn a subdev which can
> access both ports? That'd be for RDMA use cases, more than Ethernet,
> right? (Just clarifying :))
>
Yep, you got it right. :-)

> > So how is it done?
> > ------------------
> > (a) user in control
> > To address above requirements, a generic tool iproute2/devlink is
> > extended for sub device's life cycle.
> > However a devlink tool and its kernel counter part is not sufficient
> > to create protocol agnostic devices on a existing PCI bus.
>
> "Protocol agnostic"?... What does that mean?
>
Devlink works on bus,device model. It doesn't matter what class of device is.
For example, for pci class can be anything. So newly created sub-devices are not limited to netdev/rdma devices.
Its agnostic to protocol.
More importantly, we don't want to create these sub-devices who bus type is 'pci'.
Because as described below, PCI has its addressing scheme and pci bus must not have mix-n match devices.

So probably better wording should be,
'a devlink tool and its kernel counterpart is not sufficient to create sub-devices of same class as that of PCI device.

> > (b) subdev bus
> > A given bus defines well defined addressing scheme. Creating sub
> > devices on existing PCI bus with a different naming scheme is just weird.
> > So, creating well named devices on appropriate bus is desired.
>
> What's that address scheme you're referring to, you seem to assign IDs in
> sequence?
>
Yes. a device on subdev bus follows standard linux driver model based id assignment scheme = u32.
And devices are well named as 'subdev0'. Prefix + id as the default scheme of core driver model.

> >
> > Given that, these are user created devices for a given hardware and in
> > absence of a central entity like PCISIG to assign vendor and device
> > ids, A unique vendor and device id are maintained as enum in
> > include/linux/subdev_ids.h.
>
> Why do we need IDs? The sysfs hierarchy isn't sufficient?

> Do we need a driver to match on those again? Is it going to be a different driver?
>
IDs are used to match driver against the created device.
It can be same or different driver.
Even in same driver case, it provides a clear code separation for creating sub-devices and their respective one or more protocol devices (netdev, rep-netdev, rdma ..)

> > subdev bus device names follow default device naming scheme of Linux
> > kernel. It is done as 'subdev<instance_id>' such as, subdev0, subdev3.
> >
> > System example view:
> > --------------------
> >
> > $ devlink dev show
> > pci/0000:05:00.0
> >
> > $ devlink dev add pci/0000:05:00.0
>
> That does not look great.
>
Yes, It must return bus+device attributes in user output too
Code in existing patchset returns it, it is not shown here.
I will fix the cover-letter.

> Also you have to return the id of the spawned device, otherwise this is very
> racy.
>
Yes, that is correct. It must return an devlink device id = {bus+device} attr.
I will update the example in v2.

> > $ devlink dev show
> > pci/0000:05:00.0
> > subdev/subdev0
>
> Please don't spawn devlink instances. Devlink instance is supposed to
> represent an ASIC. If we start spawning them willy nilly for whatever
> software construct we want to model the clarity of the ontology will suffer a
> lot.
Devlink devices not restricted to ASIC even though today it is representing ASIC for one vendor.
Today for one ASIC, it already presents multiple devlink devices (128 or more) for PF and VFs, two PFs on same ASIC etc.
VF is just a sub-device which is well defined by PCISIG, whereas sub-device is not.
Sub-device do consume actual ASIC resources (just like PFs and VFs),
Hence point-(6) of cover-letter indicate that the devlink capability to tell how many such sub-devices can be created.

In above example, they are created for a given bus-device following existing devlink construct.

>
> Please see the discussion on my recent patchset. I think Jiri CCed you.
>
I will review the discussion in short while after this reply, and provide comments.

> > Alternatives considered:
> > ------------------------
> > Will discuss separately if needed to keep this RFC short.
>
> Please do discuss.
>
(a) subports instead of subdevices.
We dropped this option because its two restrictive; I explained above the benefits of devlink device.

(b) extending iproute2/ip link and iproute2/rdma tools to creating sub-devices.
But that is too limiting which doesn't provide all the features we get using devlink.
It also doesn't address the passthrough needs and its just ugly to create and manage PCI level devices using high level tools like 'ip' and 'rdma'.

(c) creating platform device and platform driver instead of subdev bus
Our understanding is that - platform device for this purpose would be an abuse/misuse, but our view is limited based on kernel documentation in [2].
[1] says "platform devices typically appear as autonomous entities"
Sub-devices are well managed, created, configurable by user.
Most things of [1] -> "Platform devices" section do not match with subdev.

Greg suggested to use mfd framework (wrapper to platform), which also needs extension.
mfd_remove_devices() removes all the devices, while here based on user request, we want to add/remove individual device.
Will wait if he is ok with subdev bus or he prefers to extend the platform documentation and mfd for removing individual devices.

(d) drivers/visorbus
This bus is limited to UUID/GUID based naming scheme and very specific to s-Par standard and vendor.
Additionally its guest drivers are living in staging for more than year.
So it doesn't appear the right direction.

(e) creating subdev as child objects of devlink device (such as port, registers, health, etc).
In this mode, a given devlink device has multiport child device which is anchored using 'struct device' and life cycled through devlink.
Only difference with current proposal is it doesn't follow standard driver model to bind to other driver.
It also doesn't show in unified way using devlink dev show.

So instead of these alternatives, devlink device that matches PF, VF, sub-device, + subdev bus seems better design.
This follows all standard constructs of 1. Devlink, 2. Linux driver model.
It is not limited to ports and generic enough for networking and not networking devices.

> The things key thing for me on the netdev side is what is the forwarding
> model to this new entity. Is this basically VMDQ?
> Should we just go ahead and mandate "switchdev mode" here?
>
It will follow the switchdev mode, but it not limited to it.
Switchdev mode is for the eswitch functionality. There isn't a need to combine this.
rdma Infiniband will be able to use this without switchdev mode.

> Thanks for working on a common architecture and suffering through
> people's reviews rather than adding a debugfs interface that does this like a
> different vendor did :)
Oh yes, lets not do debugfs.
Thanks a lot Jakub for the review.
This common architecture should be able to address such common needs.
Please let me know if this needs more refinement, if I missed something.

[1] https://www.kernel.org/doc/Documentation/driver-model/platform.txt


2019-03-04 17:08:29

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to subdev devices

Hi Saeed,

> -----Original Message-----
> From: Saeed Mahameed
> Sent: Friday, March 1, 2019 4:12 PM
> To: Jiri Pirko <[email protected]>; [email protected]; linux-
> [email protected]; Parav Pandit <[email protected]>;
> [email protected]; [email protected];
> [email protected]
> Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to
> subdev devices
>
> On Thu, 2019-02-28 at 23:37 -0600, Parav Pandit wrote:
> > Add a subdev driver to probe the subdev devices and create fake
> > netdevice for it.
> >
> > Signed-off-by: Parav Pandit <[email protected]>
> > ---
> > drivers/net/ethernet/mellanox/mlx5/core/Makefile | 2 +-
> > drivers/net/ethernet/mellanox/mlx5/core/main.c | 8 +-
> > .../net/ethernet/mellanox/mlx5/core/mlx5_core.h | 3 +
> > .../ethernet/mellanox/mlx5/core/subdev_driver.c | 93
> > ++++++++++++++++++++++
> > 4 files changed, 104 insertions(+), 2 deletions(-) create mode
> > 100644 drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > index f218789..c8aeaf1 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > @@ -16,7 +16,7 @@ mlx5_core-y := main.o cmd.o debugfs.o fw.o
> > eq.o uar.o pagealloc.o \
> > transobj.o vport.o sriov.o fs_cmd.o fs_core.o \
> > fs_counters.o rl.o lag.o dev.o events.o wq.o lib/gid.o \
> > lib/devcom.o diag/fs_tracepoint.o diag/fw_tracer.o
> > -mlx5_core-$(CONFIG_SUBDEV) += subdev.o
> > +mlx5_core-$(CONFIG_SUBDEV) += subdev.o subdev_driver.o
> >
> > #
> > # Netdev basic
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > index 5f8cf0d..7dfa8c4 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > @@ -1548,7 +1548,11 @@ static int __init init(void)
> > mlx5e_init();
> > #endif
> >
> > - return 0;
> > + err = subdev_register_driver(&mlx5_subdev_driver);
> > + if (err)
> > + pci_unregister_driver(&mlx5_core_driver);
> > +
> > + return err;
> >
> > err_debug:
> > mlx5_unregister_debugfs();
> > @@ -1557,6 +1561,8 @@ static int __init init(void)
> >
> > static void __exit cleanup(void)
> > {
> > + subdev_unregister_driver(&mlx5_subdev_driver);
> > +
> > #ifdef CONFIG_MLX5_CORE_EN
> > mlx5e_cleanup();
> > #endif
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> > b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> > index 2a54148..1b733c7 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> > @@ -41,12 +41,15 @@
> > #include <linux/ptp_clock_kernel.h>
> > #include <linux/mlx5/cq.h>
> > #include <linux/mlx5/fs.h>
> > +#include <linux/subdev_bus.h>
> >
> > #define DRIVER_NAME "mlx5_core"
> > #define DRIVER_VERSION "5.0-0"
> >
> > extern uint mlx5_core_debug_mask;
> >
> > +extern struct subdev_driver mlx5_subdev_driver;
> > +
> > #define mlx5_core_dbg(__dev, format, ...)
> > \
> > dev_dbg(&(__dev)->pdev->dev, "%s:%d:(pid %d): " format,
>
> > \
> > __func__, __LINE__, current->pid,
> > \
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
> > new file mode 100644
> > index 0000000..880aa4f
> > --- /dev/null
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
> > @@ -0,0 +1,93 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +// Copyright (c) 2018-19 Mellanox Technologies
> > +
> > +#include <linux/module.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/subdev_bus.h>
> > +#include <linux/subdev_ids.h>
> > +#include <linux/etherdevice.h>
> > +
> > +struct mlx5_subdev_ndev {
> > + struct net_device ndev;
> > +};
> > +
> > +static void mlx5_dma_test(struct device *dev) {
> > + dma_addr_t pa;
> > + void *va;
> > +
> > + va = dma_alloc_coherent(dev, 4096, &pa, GFP_KERNEL);
> > + if (va)
> > + dma_free_coherent(dev, 4096, va, pa); }
> > +
> > +static struct net_device *ndev;
> > +
> > +static int mlx5e_subdev_open(struct net_device *netdev) {
> > + return 0;
> > +}
> > +
> > +static int mlx5e_subdev_close(struct net_device *netdev) {
> > + return 0;
> > +}
> > +
> > +static netdev_tx_t
> > +mlx5e_subdev_xmit(struct sk_buff *skb, struct net_device *netdev) {
> > + return NETDEV_TX_BUSY;
> > +}
> > +
> > +const struct net_device_ops mlx5e_subdev_netdev_ops = {
> > + .ndo_open = mlx5e_subdev_open,
> > + .ndo_stop = mlx5e_subdev_close,
> > + .ndo_start_xmit = mlx5e_subdev_xmit,
> > +};
> > +
> > +static int mlx5_subdev_probe(struct device *dev) {
> > + int err;
> > +
> > + mlx5_dma_test(dev);
>
> Hi Parav, can you please shed some light on how do you plan to
> communicate with the parent device ? (pci_dev and its running driver
> instance), We will need to share some resources, such as IRQs/BARs/etc ..,
> and maybe some HW objects which are going to be managed by the parent
> pci device driver.
>
Since mlx5 driver works on its pci device, in mlx5_subdev_probe(struct device *device)
device->parent is a PCI device for driver to use.

> Just allocating a dma buffer doesn't mean anything, the dma buffer is just
> bound to the generic device.
>

dma buffer allocation is just to make sure that stack and core and rdma ULPs dma allocations in same way as PCI device.

> > + /* Only one device supported in rfc */
> > + if (ndev)
> > + return 0;
> > +
> > + ndev = alloc_etherdev_mqs(sizeof(struct mlx5_subdev_ndev), 1,
> > 1);
> > + if (!ndev)
> > + return -ENOMEM;
> > +
> > + SET_NETDEV_DEV(ndev, dev);
> > + ndev->netdev_ops = &mlx5e_subdev_netdev_ops;
> > + err = register_netdev(ndev);
> > + if (err) {
> > + free_netdev(ndev);
> > + ndev = NULL;
> > + }
> > + return err;
> > +}
> > +
> > +static int mlx5_subdev_remove(struct device *dev) {
> > + if (ndev) {
> > + unregister_netdev(ndev);
> > + free_netdev(ndev);
> > + ndev = NULL;
> > + }
> > + return 0;
> > +}
> > +
> > +static const struct subdev_id mlx5_subdev_id_table[] = {
> > + { .vendor_id = SUBDEV_VENDOR_ID_MELLANOX,
> > + .device_id = SUBDEV_DEVICE_ID_MELLANOX_SF },
> > + { 0, }
> > +};
> > +MODULE_DEVICE_TABLE(subdev, mlx5_subdev_id_table);
> > +
> > +struct subdev_driver mlx5_subdev_driver = {
> > + .id_table = mlx5_subdev_id_table,
> > + .driver.name = "mlx5_subdev_driver",
> > + .driver.probe = mlx5_subdev_probe,
> > + .driver.remove = mlx5_subdev_remove, };

2019-03-05 01:36:18

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

Parav, please wrap your responses to at most 80 characters.
This is hard to read.

On Mon, 4 Mar 2019 04:41:01 +0000, Parav Pandit wrote:
> > -----Original Message-----
> > From: Jakub Kicinski <[email protected]>
> > Sent: Friday, March 1, 2019 2:04 PM
> > To: Parav Pandit <[email protected]>; Or Gerlitz <[email protected]>
> > Cc: [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; Jiri Pirko <[email protected]>
> > Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
> >
> > On Thu, 28 Feb 2019 23:37:44 -0600, Parav Pandit wrote:
> > > Requirements for above use cases:
> > > --------------------------------
> > > 1. We need a generic user interface & core APIs to create sub devices
> > > from a parent pci device but should be generic enough for other parent
> > > devices 2. Interface should be vendor agnostic 3. User should be able
> > > to set device params at creation time 4. In future if needed, tool
> > > should be able to create passthrough device to map to a virtual
> > > machine
> >
> > Like a mediated device?
>
> Yes.
>
> > https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
> > https://www.dpdk.org/wp-content/uploads/sites/35/2018/06/Mediated-
> > Devices-Better-Userland-IO.pdf
> >
> > Other than pass-through it is entirely unclear to me why you'd need a bus.
> > (Or should I say VM pass through or DPDK?) Could you clarify why the need
> > for a bus?
> >
> A bus follow standard linux kernel device driver model to attach a
> driver to specific device. Platform device with my limited
> understanding looks a hack/abuse of it based on documentation [1],
> but it can possibly be an alternative to bus if it looks fine to Greg
> and others.

I grok from this text that the main advantage you see is the ability to
choose a driver for the subdevice.

> > My thinking is that we should allow spawning subports in devlink
> > and if user specifies "passthrough" the device spawned would be an
> > mdev.
>
> devlink device is much more comprehensive way to create sub-devices
> than sub-ports for at least below reasons.
>
> 1. devlink device already defines device->port relation which enables
> to create multiport device.

I presume that by devlink device you mean devlink instance? Yes, this
part I'm following.

> subport breaks that.

Breaks what? The ability to create a devlink instance with multiple
ports?

> 2. With bus model, it enables us to load driver of same vendor or
> generic one such a vfio in future.

Yes, sorry, I'm not an expert on mdevs, but isn't that the goal of
those? Could you go into more detail why not just use mdevs?

> 3. Devices live on the bus, mapping a subport to 'struct device' is
> not intuitive.

Are you saying that the main devlink instance would not have any port
information for the subdevices?

Devices live on a bus. Software constructs - depend on how one wants
to model them - don't have to.

> 4. sub-device allows to use existing devlink port,
> registers, health infrastructure to sub devices, which otherwise need
> to be duplicated for ports.

Health stuff is not tied to a port, I'm not following you. You can
create a reporter per port, per ACL rule or per SB or per whatever your
heart desires..

> 5. Even though current devlink devices are networking devices, there
> is nothing restricts it to be that way. So subport is a restricted
> view.
> 6. devlink device already covers
> port sub-object, hence creating devlink device is desired.
>
> > > 5. A device can have multiple ports
> >
> > What does this mean, in practice? You want to spawn a subdev which
> > can access both ports? That'd be for RDMA use cases, more than
> > Ethernet, right? (Just clarifying :))
> >
> Yep, you got it right. :-)
>
> > > So how is it done?
> > > ------------------
> > > (a) user in control
> > > To address above requirements, a generic tool iproute2/devlink is
> > > extended for sub device's life cycle.
> > > However a devlink tool and its kernel counter part is not
> > > sufficient to create protocol agnostic devices on a existing PCI
> > > bus.
> >
> > "Protocol agnostic"?... What does that mean?
> >
> Devlink works on bus,device model. It doesn't matter what class of
> device is. For example, for pci class can be anything. So newly
> created sub-devices are not limited to netdev/rdma devices. Its
> agnostic to protocol. More importantly, we don't want to create these
> sub-devices who bus type is 'pci'. Because as described below, PCI
> has its addressing scheme and pci bus must not have mix-n match
> devices.
>
> So probably better wording should be,
> 'a devlink tool and its kernel counterpart is not sufficient to
> create sub-devices of same class as that of PCI device.

Let me clarify - for networking devices the partition will most likely
end up as a subport, but its not a requirement that each partition must
be a subport.. The question was about the necessity to invent a new
bus, and have every resource have a struct device..

> > > (b) subdev bus
> > > A given bus defines well defined addressing scheme. Creating sub
> > > devices on existing PCI bus with a different naming scheme is
> > > just weird. So, creating well named devices on appropriate bus is
> > > desired.
> >
> > What's that address scheme you're referring to, you seem to assign
> > IDs in sequence?
> >
> Yes. a device on subdev bus follows standard linux driver model based
> id assignment scheme = u32. And devices are well named as 'subdev0'.
> Prefix + id as the default scheme of core driver model.

I thought "well defined addressing scheme" means I can address
subdevice X of device Y with your scheme. I can't, it's just an
global ID. Thanks for clarifying.

> > > Given that, these are user created devices for a given hardware
> > > and in absence of a central entity like PCISIG to assign vendor
> > > and device ids, A unique vendor and device id are maintained as
> > > enum in include/linux/subdev_ids.h.
> >
> > Why do we need IDs? The sysfs hierarchy isn't sufficient?
>
> > Do we need a driver to match on those again? Is it going to be a
> > different driver?
> IDs are used to match driver against the created device.
> It can be same or different driver.
> Even in same driver case, it provides a clear code separation for
> creating sub-devices and their respective one or more protocol
> devices (netdev, rep-netdev, rdma ..)
>
> > > subdev bus device names follow default device naming scheme of
> > > Linux kernel. It is done as 'subdev<instance_id>' such as,
> > > subdev0, subdev3.
> > >
> > > System example view:
> > > --------------------
> > >
> > > $ devlink dev show
> > > pci/0000:05:00.0
> > >
> > > $ devlink dev add pci/0000:05:00.0
> >
> > That does not look great.
> >
> Yes, It must return bus+device attributes in user output too
> Code in existing patchset returns it, it is not shown here.
> I will fix the cover-letter.
>
> > Also you have to return the id of the spawned device, otherwise
> > this is very racy.
> >
> Yes, that is correct. It must return an devlink device id =
> {bus+device} attr. I will update the example in v2.
>
> > > $ devlink dev show
> > > pci/0000:05:00.0
> > > subdev/subdev0
>
> > Please don't spawn devlink instances. Devlink instance is supposed
> > to represent an ASIC. If we start spawning them willy nilly for
> > whatever software construct we want to model the clarity of the
> > ontology will suffer a lot.
> Devlink devices not restricted to ASIC even though today it is
> representing ASIC for one vendor. Today for one ASIC, it already
> presents multiple devlink devices (128 or more) for PF and VFs, two
> PFs on same ASIC etc. VF is just a sub-device which is well defined
> by PCISIG, whereas sub-device is not. Sub-device do consume actual
> ASIC resources (just like PFs and VFs), Hence point-(6) of
> cover-letter indicate that the devlink capability to tell how many
> such sub-devices can be created.
>
> In above example, they are created for a given bus-device following
> existing devlink construct.
>
> > Please see the discussion on my recent patchset. I think Jiri CCed
> > you.
> I will review the discussion in short while after this reply, and
> provide comments.
>
> > > Alternatives considered:
> > > ------------------------
> > > Will discuss separately if needed to keep this RFC short.
> >
> > Please do discuss.
> >
> (a) subports instead of subdevices.
> We dropped this option because its two restrictive; I explained above
> the benefits of devlink device.
>
> (b) extending iproute2/ip link and iproute2/rdma tools to creating
> sub-devices. But that is too limiting which doesn't provide all the
> features we get using devlink. It also doesn't address the
> passthrough needs and its just ugly to create and manage PCI level
> devices using high level tools like 'ip' and 'rdma'.
>
> (c) creating platform device and platform driver instead of subdev bus
> Our understanding is that - platform device for this purpose would be
> an abuse/misuse, but our view is limited based on kernel
> documentation in [2]. [1] says "platform devices typically appear as
> autonomous entities" Sub-devices are well managed, created,
> configurable by user. Most things of [1] -> "Platform devices"
> section do not match with subdev.
>
> Greg suggested to use mfd framework (wrapper to platform), which also
> needs extension. mfd_remove_devices() removes all the devices, while
> here based on user request, we want to add/remove individual device.
> Will wait if he is ok with subdev bus or he prefers to extend the
> platform documentation and mfd for removing individual devices.
>
> (d) drivers/visorbus
> This bus is limited to UUID/GUID based naming scheme and very
> specific to s-Par standard and vendor. Additionally its guest drivers
> are living in staging for more than year. So it doesn't appear the
> right direction.
>
> (e) creating subdev as child objects of devlink device (such as port,
> registers, health, etc). In this mode, a given devlink device has
> multiport child device which is anchored using 'struct device' and
> life cycled through devlink. Only difference with current proposal is
> it doesn't follow standard driver model to bind to other driver. It
> also doesn't show in unified way using devlink dev show.
>
> So instead of these alternatives, devlink device that matches PF, VF,
> sub-device, + subdev bus seems better design. This follows all
> standard constructs of 1. Devlink, 2. Linux driver model. It is not
> limited to ports and generic enough for networking and not networking
> devices.
> > The things key thing for me on the netdev side is what is the
> > forwarding model to this new entity. Is this basically VMDQ?
> > Should we just go ahead and mandate "switchdev mode" here?
> >
> It will follow the switchdev mode, but it not limited to it.
> Switchdev mode is for the eswitch functionality. There isn't a need
> to combine this. rdma Infiniband will be able to use this without
> switchdev mode.

It's the devlink instance that's in "switchdev mode", regardless of
type of any of its ports.

> > Thanks for working on a common architecture and suffering through
> > people's reviews rather than adding a debugfs interface that does
> > this like a different vendor did :)
> Oh yes, lets not do debugfs.
> Thanks a lot Jakub for the review.
> This common architecture should be able to address such common needs.
> Please let me know if this needs more refinement, if I missed
> something.
>
> [1] https://www.kernel.org/doc/Documentation/driver-model/platform.txt
>


2019-03-05 01:46:58

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On Mon, 4 Mar 2019 04:41:01 +0000, Parav Pandit wrote:
> > > $ devlink dev show
> > > pci/0000:05:00.0
> > > subdev/subdev0
> >
> > Please don't spawn devlink instances. Devlink instance is supposed to
> > represent an ASIC. If we start spawning them willy nilly for whatever
> > software construct we want to model the clarity of the ontology will suffer a
> > lot.
> Devlink devices not restricted to ASIC even though today it is
> representing ASIC for one vendor. Today for one ASIC, it already
> presents multiple devlink devices (128 or more) for PF and VFs, two
> PFs on same ASIC etc. VF is just a sub-device which is well defined
> by PCISIG, whereas sub-device is not. Sub-device do consume actual
> ASIC resources (just like PFs and VFs), Hence point-(6) of
> cover-letter indicate that the devlink capability to tell how many
> such sub-devices can be created.
>
> In above example, they are created for a given bus-device following
> existing devlink construct.

No, it's not "representing the ASIC for one vendor". It's how it works
for switches (including mlxsw) and how it was described in the original
cover letter:

Introduce devlink interface and first drivers to use it

There a is need for some userspace API that would allow to expose things
that are not directly related to any device class like net_device of
ib_device, but rather chip-wide/switch-ASIC-wide stuff.

[...]

We can deviate from the original intent if need be and dilute the
ontology. But let's be clear on the status quo, please.

2019-03-05 07:14:14

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to subdev devices

On Fri, Mar 01, 2019 at 05:21:13PM +0000, Parav Pandit wrote:
>
>
> > -----Original Message-----
> > From: Greg KH <[email protected]>
> > Sent: Friday, March 1, 2019 1:22 AM
> > To: Parav Pandit <[email protected]>
> > Cc: [email protected]; [email protected];
> > [email protected]; [email protected]; Jiri Pirko
> > <[email protected]>
> > Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to
> > subdev devices
> >
> > On Thu, Feb 28, 2019 at 11:37:52PM -0600, Parav Pandit wrote:
> > > Add a subdev driver to probe the subdev devices and create fake
> > > netdevice for it.
> >
> > So I'm guessing here is the "meat" of the whole goal here?
> >
> > You just want multiple netdevices per PCI device? Why can't you do that
> > today in your PCI driver?
> >
> Yes, but it just not multiple netdevices.
> Let me please elaborate in detail.
>
> There is a swichdev mode of a PCI function for netdevices.
> In this mode a given netdev has additional control netdev (called representor netdevice = rep-ndev).
> This rep-ndev is attached to OVS for adding rules, offloads etc using standard tc, netfilter infra.
> Currently this rep-ndev controls switch side of the settings, but not the host side of netdev.
> So there is discussion to create another netdev or devlink port..
>
> Additionally this subdev has optional rdma device too.
>
> And when we are in switchdev mode, this rdma dev has similar rdma rep device for control.
>
> In some cases we actually don't create netdev when it is in InfiniBand mode.
> Here there is PCI device->rdma_device.
>
> In other case, a given sub device for rdma is dual port device, having netdevice for each that can use existing netdev->dev_port.
>
> Creating 4 devices of two different classes using one iproute2/ip or iproute2/rdma command is horrible thing to do.

Why is that?

> In case if this sub device has to be a passthrough device, ip link command will fail badly that day, because we are creating some sub device which is not even a netdevice.

But it is a network device, right?

> So iproute2/devlink which works on bus+device, mainly PCI today, seems right abstraction point to create sub devices.
> This also extends to map ports of the device, health, registers debug, etc rich infrastructure that is already built.
>
> Additionally, we don't want mlx driver and other drivers to go through its child devices (split logic in netdev and rdma) for power management.

And how is power management going to work with your new devices? All
you have here is a tiny shim around a driver bus, I do not see any new
functionality, and as others have said, no way to actually share, or
split up, the PCI resources.

> Kernel core code does that well today, that we like to leverage through subdev bus or mfd pm callbacks.
>
> So it is lot more than just creating netdevices.

But that's all you are showing here :)

> > What problem are you trying to solve that others also are having that
> > requires all of this?
> >
> > Adding a new bus type and subsystem is fine, but usually we want more
> > than just one user of it, as this does not really show how it is exercised very
> > well.
> This subdev and devlink infrastructure solves this problem of creating smaller sub devices out of one PCI device.
> Someone has to start.. :-)

That is what a mfd should allow you to do.

> To my knowledge, currently Netronome, Broadcom and Mellanox are actively using this devlink and switchdev infra today.

Where are they "using it"? This patchset does not show that.

> > Ideally 3 users would be there as that is when it proves itself that it is
> > flexible enough.
> >
>
> We were looking at drivers/visorbus if we can repurpose it, but GUID device naming scheme is just not user friendly.

You can always change the naming scheme if needed. But why isn't a GUID
ok? It's very easy to reserve properly, and you do not need a central
naming "authority".

> > Would just using the mfd subsystem work better for you? That provides
> > core support for "multi-function" drivers/devices already. What is missing
> > from that subsystem that does not work for you here?
> >
> We were not aware of mfd until now. I looked at very high level now. It's a wrapper to platform devices and seems widely use.
> Before subdev proposal, Jason suggested an alternative is to create platform devices and driver attach to it.
>
> When I read kernel documentation [1], it says "platform devices typically appear as autonomous entities"
> Here instead of autonomy, it is in user's control.
> Platform devices probably don't disappear a lot in live system as opposed to subdevices which are created and removed dynamically a lot often.
>
> Not sure if platform device is abuse for this purpose or not.

No, do not abuse a platform device. You should be able to just use a
normal PCI device for this just fine, and if not, we should be able to
make the needed changes to mfd for that.

thanks,

greg k-h

2019-03-05 17:41:06

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension



> -----Original Message-----
> From: Jakub Kicinski <[email protected]>
> Sent: Monday, March 4, 2019 7:46 PM
> To: Parav Pandit <[email protected]>
> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; Jiri Pirko <[email protected]>
> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>
> On Mon, 4 Mar 2019 04:41:01 +0000, Parav Pandit wrote:
> > > > $ devlink dev show
> > > > pci/0000:05:00.0
> > > > subdev/subdev0
> > >
> > > Please don't spawn devlink instances. Devlink instance is supposed
> > > to represent an ASIC. If we start spawning them willy nilly for
> > > whatever software construct we want to model the clarity of the
> > > ontology will suffer a lot.
> > Devlink devices not restricted to ASIC even though today it is
> > representing ASIC for one vendor. Today for one ASIC, it already
> > presents multiple devlink devices (128 or more) for PF and VFs, two
> > PFs on same ASIC etc. VF is just a sub-device which is well defined by
> > PCISIG, whereas sub-device is not. Sub-device do consume actual ASIC
> > resources (just like PFs and VFs), Hence point-(6) of cover-letter
> > indicate that the devlink capability to tell how many such sub-devices
> > can be created.
> >
> > In above example, they are created for a given bus-device following
> > existing devlink construct.
>
> No, it's not "representing the ASIC for one vendor". It's how it works for
> switches (including mlxsw) and how it was described in the original cover
> letter:
>
Sorry for the confusion.
I meant to say, my understanding is Netronome creates one devlink instance for whole ASIC.
Please correct me if this is incorrect.
mlx5_core driver creates multiple devlink devices for PF and VFs for one ASIC.

> Introduce devlink interface and first drivers to use it
>
> There a is need for some userspace API that would allow to expose things
> that are not directly related to any device class like net_device of
> ib_device, but rather chip-wide/switch-ASIC-wide stuff.
>
> [...]
>
> We can deviate from the original intent if need be and dilute the ontology.
> But let's be clear on the status quo, please.
Status quo is mlx5_core driver creates multiple devlink devices. It creates for devlink device for each PF and VF of a single ASIC.

2019-03-05 18:54:06

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to subdev devices



> -----Original Message-----
> From: Greg KH <[email protected]>
> Sent: Tuesday, March 5, 2019 1:14 AM
> To: Parav Pandit <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Jiri Pirko
> <[email protected]>; Jakub Kicinski <[email protected]>
> Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to
> subdev devices
>
> On Fri, Mar 01, 2019 at 05:21:13PM +0000, Parav Pandit wrote:
> >
> >
> > > -----Original Message-----
> > > From: Greg KH <[email protected]>
> > > Sent: Friday, March 1, 2019 1:22 AM
> > > To: Parav Pandit <[email protected]>
> > > Cc: [email protected]; [email protected];
> > > [email protected]; [email protected]; Jiri Pirko
> > > <[email protected]>
> > > Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind
> > > to subdev devices
> > >
> > > On Thu, Feb 28, 2019 at 11:37:52PM -0600, Parav Pandit wrote:
> > > > Add a subdev driver to probe the subdev devices and create fake
> > > > netdevice for it.
> > >
> > > So I'm guessing here is the "meat" of the whole goal here?
> > >
> > > You just want multiple netdevices per PCI device? Why can't you do
> > > that today in your PCI driver?
> > >
> > Yes, but it just not multiple netdevices.
> > Let me please elaborate in detail.
> >
> > There is a swichdev mode of a PCI function for netdevices.
> > In this mode a given netdev has additional control netdev (called
> representor netdevice = rep-ndev).
> > This rep-ndev is attached to OVS for adding rules, offloads etc using
> standard tc, netfilter infra.
> > Currently this rep-ndev controls switch side of the settings, but not the
> host side of netdev.
> > So there is discussion to create another netdev or devlink port..
> >
> > Additionally this subdev has optional rdma device too.
> >
> > And when we are in switchdev mode, this rdma dev has similar rdma rep
> device for control.
> >
> > In some cases we actually don't create netdev when it is in InfiniBand
> mode.
> > Here there is PCI device->rdma_device.
> >
> > In other case, a given sub device for rdma is dual port device, having
> netdevice for each that can use existing netdev->dev_port.
> >
> > Creating 4 devices of two different classes using one iproute2/ip or
> iproute2/rdma command is horrible thing to do.
>
> Why is that?
>
When user creates the device, user tool needs to return a device handle that got created.
Creating multiple devices doesn't make sense. I haven't seen any tool doing such crazy thing.

> > In case if this sub device has to be a passthrough device, ip link command
> will fail badly that day, because we are creating some sub device which is not
> even a netdevice.
>
> But it is a network device, right?
>
When there is passthrough subdevice, there won't be netdevice created.
We don't want to create passthrough subdevice using iproute2/ip tool which primarily works on netdevices.

> > So iproute2/devlink which works on bus+device, mainly PCI today, seems
> right abstraction point to create sub devices.
> > This also extends to map ports of the device, health, registers debug, etc
> rich infrastructure that is already built.
> >
> > Additionally, we don't want mlx driver and other drivers to go through its
> child devices (split logic in netdev and rdma) for power management.
>
> And how is power management going to work with your new devices? All
> you have here is a tiny shim around a driver bus,
So subdevices power management is done before their parent's.
Vendor driver doesn't need to iterate its child devices to suspend/resume it.

> I do not see any new
> functionality, and as others have said, no way to actually share, or split up,
> the PCI resources.
>
devlink tool create command will be able to accept more parameters during device creation time to share and split PCI resources.
This is just the start of the development and RFC is to agree on direction.
devlink tool has parameters options that can be queried/set and existing infra will be used for granular device config.

> > Kernel core code does that well today, that we like to leverage through
> subdev bus or mfd pm callbacks.
> >
> > So it is lot more than just creating netdevices.
>
> But that's all you are showing here :)
>
Starting use case is netdev and rdma, but we don't want to create new tools few months/a year later for passthrough mode or for different link layers etc.

> > > What problem are you trying to solve that others also are having
> > > that requires all of this?
> > >
> > > Adding a new bus type and subsystem is fine, but usually we want
> > > more than just one user of it, as this does not really show how it
> > > is exercised very well.
> > This subdev and devlink infrastructure solves this problem of creating
> smaller sub devices out of one PCI device.
> > Someone has to start.. :-)
>
> That is what a mfd should allow you to do.
>
I did cursory look at mfd.
It lacks removing specific devices, but that is small. It can be enhanced to remove specific mfd device.

> > To my knowledge, currently Netronome, Broadcom and Mellanox are
> actively using this devlink and switchdev infra today.
>
> Where are they "using it"? This patchset does not show that.
>
devlink and swhichdev mode for SRIOV is common among these vendors and more.
The code is in,
drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c
drivers/net/ethernet/netronome/nfp/nfp_net_main.c
drivers/net/ethernet/mellanox/mlx5/core/main.c

This patchset covers only mlx5, but other vendors who also intent to create subdevices will be able to reuse it.
This RFC doesn't cover other vendors.
Jakub and netdev list in CC. We are discussing with Jakub in this patchset discussion.

> > > Ideally 3 users would be there as that is when it proves itself that
> > > it is flexible enough.
> > >
> >
> > We were looking at drivers/visorbus if we can repurpose it, but GUID
> device naming scheme is just not user friendly.
>
> You can always change the naming scheme if needed. But why isn't a GUID
> ok?
I think it was ok.
vendor-device id scheme seems more user friendly and in kernels control, also fits with existing modpost tools.
GUID can be used instead of vendor, device id.
However visorbus is tied to acpi and device life cycle is very different under workqueue handlering.
It is also meant for a vendor s-Par devices.
Its guest drivers are in staging without a clear roadmap for more than year now.
So do not want to depend on it. mfd or dedicated bus seems better fit.

> It's very easy to reserve properly, and you do not need a central naming
> "authority".
>
> > > Would just using the mfd subsystem work better for you? That
> > > provides core support for "multi-function" drivers/devices already.
> > > What is missing from that subsystem that does not work for you here?
> > >
> > We were not aware of mfd until now. I looked at very high level now. It's a
> wrapper to platform devices and seems widely use.
> > Before subdev proposal, Jason suggested an alternative is to create
> platform devices and driver attach to it.
> >
> > When I read kernel documentation [1], it says "platform devices typically
> appear as autonomous entities"
> > Here instead of autonomy, it is in user's control.
> > Platform devices probably don't disappear a lot in live system as opposed
> to subdevices which are created and removed dynamically a lot often.
> >
> > Not sure if platform device is abuse for this purpose or not.
>
> No, do not abuse a platform device.
Yes. that is my point mfd devices are platform devices.
mfd creates platform devices. and to match to it, platfrom_register_driver() have to be called to bind to it.
I do not know currently if we have the flexibility to say that instead of binding X driver, bind Y driver for platform devices.

> You should be able to just use a normal
> PCI device for this just fine, and if not, we should be able to make the
> needed changes to mfd for that.
>
Ok. so parent pci device and mfd devices.
mfd seems to fit this use case.
Do you think 'Platform devices' section is stale in [1] for autonomy, host bridge, soc platform etc points?
Should we update the documentation to indicate that it can be used for non-autonomous, user created devices and it can be used for creating devices on top of PCI parent device etc?

[1] https://www.kernel.org/doc/Documentation/driver-model/platform.txt

2019-03-05 19:33:07

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to subdev devices

On Tue, Mar 05, 2019 at 05:57:58PM +0000, Parav Pandit wrote:
>
>
> > -----Original Message-----
> > From: Greg KH <[email protected]>
> > Sent: Tuesday, March 5, 2019 1:14 AM
> > To: Parav Pandit <[email protected]>
> > Cc: [email protected]; [email protected];
> > [email protected]; [email protected]; Jiri Pirko
> > <[email protected]>; Jakub Kicinski <[email protected]>
> > Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to
> > subdev devices
> >
> > On Fri, Mar 01, 2019 at 05:21:13PM +0000, Parav Pandit wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Greg KH <[email protected]>
> > > > Sent: Friday, March 1, 2019 1:22 AM
> > > > To: Parav Pandit <[email protected]>
> > > > Cc: [email protected]; [email protected];
> > > > [email protected]; [email protected]; Jiri Pirko
> > > > <[email protected]>
> > > > Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind
> > > > to subdev devices
> > > >
> > > > On Thu, Feb 28, 2019 at 11:37:52PM -0600, Parav Pandit wrote:
> > > > > Add a subdev driver to probe the subdev devices and create fake
> > > > > netdevice for it.
> > > >
> > > > So I'm guessing here is the "meat" of the whole goal here?
> > > >
> > > > You just want multiple netdevices per PCI device? Why can't you do
> > > > that today in your PCI driver?
> > > >
> > > Yes, but it just not multiple netdevices.
> > > Let me please elaborate in detail.
> > >
> > > There is a swichdev mode of a PCI function for netdevices.
> > > In this mode a given netdev has additional control netdev (called
> > representor netdevice = rep-ndev).
> > > This rep-ndev is attached to OVS for adding rules, offloads etc using
> > standard tc, netfilter infra.
> > > Currently this rep-ndev controls switch side of the settings, but not the
> > host side of netdev.
> > > So there is discussion to create another netdev or devlink port..
> > >
> > > Additionally this subdev has optional rdma device too.
> > >
> > > And when we are in switchdev mode, this rdma dev has similar rdma rep
> > device for control.
> > >
> > > In some cases we actually don't create netdev when it is in InfiniBand
> > mode.
> > > Here there is PCI device->rdma_device.
> > >
> > > In other case, a given sub device for rdma is dual port device, having
> > netdevice for each that can use existing netdev->dev_port.
> > >
> > > Creating 4 devices of two different classes using one iproute2/ip or
> > iproute2/rdma command is horrible thing to do.
> >
> > Why is that?
> >
> When user creates the device, user tool needs to return a device handle that got created.
> Creating multiple devices doesn't make sense. I haven't seen any tool doing such crazy thing.

And what do you mean by "device handle"? All you get here is a sysfs
device tree.

> > > In case if this sub device has to be a passthrough device, ip link command
> > will fail badly that day, because we are creating some sub device which is not
> > even a netdevice.
> >
> > But it is a network device, right?
> >
> When there is passthrough subdevice, there won't be netdevice created.
> We don't want to create passthrough subdevice using iproute2/ip tool which primarily works on netdevices.

I don't know enough networking to claim anything here, so I'll ignore
this :)

> > > So iproute2/devlink which works on bus+device, mainly PCI today, seems
> > right abstraction point to create sub devices.
> > > This also extends to map ports of the device, health, registers debug, etc
> > rich infrastructure that is already built.
> > >
> > > Additionally, we don't want mlx driver and other drivers to go through its
> > child devices (split logic in netdev and rdma) for power management.
> >
> > And how is power management going to work with your new devices? All
> > you have here is a tiny shim around a driver bus,
> So subdevices power management is done before their parent's.
> Vendor driver doesn't need to iterate its child devices to suspend/resume it.

True, so we can just autosuspend these "children" device and the "vendor
driver" is not going to care? You are going to care as you are talking
to the same PCI device. This goes to the other question about "how are
you sharing PCI device resources?"

> > I do not see any new
> > functionality, and as others have said, no way to actually share, or split up,
> > the PCI resources.
> >
> devlink tool create command will be able to accept more parameters during device creation time to share and split PCI resources.
> This is just the start of the development and RFC is to agree on direction.
> devlink tool has parameters options that can be queried/set and existing infra will be used for granular device config.

Pointers to this beast?

> > > Kernel core code does that well today, that we like to leverage through
> > subdev bus or mfd pm callbacks.
> > >
> > > So it is lot more than just creating netdevices.
> >
> > But that's all you are showing here :)
> >
> Starting use case is netdev and rdma, but we don't want to create new
> tools few months/a year later for passthrough mode or for different
> link layers etc.

And I don't want to see duplicated driver model code happen either,
which is why I point out the MFD layer :)

> > > > What problem are you trying to solve that others also are having
> > > > that requires all of this?
> > > >
> > > > Adding a new bus type and subsystem is fine, but usually we want
> > > > more than just one user of it, as this does not really show how it
> > > > is exercised very well.
> > > This subdev and devlink infrastructure solves this problem of creating
> > smaller sub devices out of one PCI device.
> > > Someone has to start.. :-)
> >
> > That is what a mfd should allow you to do.
> >
> I did cursory look at mfd.
> It lacks removing specific devices, but that is small. It can be
> enhanced to remove specific mfd device.

That should be easy enough, work with the MFD developers. I think
something like that should work today as you can use USB devices with
MFD, right?

> >
> > No, do not abuse a platform device.
> Yes. that is my point mfd devices are platform devices.
> mfd creates platform devices. and to match to it, platfrom_register_driver() have to be called to bind to it.
> I do not know currently if we have the flexibility to say that instead of binding X driver, bind Y driver for platform devices.

try it :)

> > You should be able to just use a normal
> > PCI device for this just fine, and if not, we should be able to make the
> > needed changes to mfd for that.
> >
> Ok. so parent pci device and mfd devices.
> mfd seems to fit this use case.
> Do you think 'Platform devices' section is stale in [1] for autonomy, host bridge, soc platform etc points?

Nope, they are still horrible things and I hate them :)

Maybe we should just make MFD create "virtual" devices (bare ones, no
need for platform stuff), and that would solve the issue of the platform
device bloat being drug around everywhere.

> Should we update the documentation to indicate that it can be used for
> non-autonomous, user created devices and it can be used for creating
> devices on top of PCI parent device etc?

Nope, leave it alone please.

thanks,

greg k-h

2019-03-05 19:51:59

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension



> -----Original Message-----
> From: Jakub Kicinski <[email protected]>
> Sent: Monday, March 4, 2019 7:35 PM
> To: Parav Pandit <[email protected]>
> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; Jiri Pirko <[email protected]>
> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>
> Parav, please wrap your responses to at most 80 characters.
> This is hard to read.
>
Sorry about it. I will wrap now on.

> On Mon, 4 Mar 2019 04:41:01 +0000, Parav Pandit wrote:
> > > -----Original Message-----
> > > From: Jakub Kicinski <[email protected]>
> > > Sent: Friday, March 1, 2019 2:04 PM
> > > To: Parav Pandit <[email protected]>; Or Gerlitz
> > > <[email protected]>
> > > Cc: [email protected]; [email protected];
> > > [email protected]; [email protected];
> > > [email protected]; Jiri Pirko <[email protected]>
> > > Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
> > > extension
> > >
> > > On Thu, 28 Feb 2019 23:37:44 -0600, Parav Pandit wrote:
> > > > Requirements for above use cases:
> > > > --------------------------------
> > > > 1. We need a generic user interface & core APIs to create sub
> > > > devices from a parent pci device but should be generic enough for
> > > > other parent devices 2. Interface should be vendor agnostic 3.
> > > > User should be able to set device params at creation time 4. In
> > > > future if needed, tool should be able to create passthrough device
> > > > to map to a virtual machine
> > >
> > > Like a mediated device?
> >
> > Yes.
> >
> > > https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
> > > https://www.dpdk.org/wp-content/uploads/sites/35/2018/06/Mediated-
> > > Devices-Better-Userland-IO.pdf
> > >
> > > Other than pass-through it is entirely unclear to me why you'd need a
> bus.
> > > (Or should I say VM pass through or DPDK?) Could you clarify why
> > > the need for a bus?
> > >
> > A bus follow standard linux kernel device driver model to attach a
> > driver to specific device. Platform device with my limited
> > understanding looks a hack/abuse of it based on documentation [1], but
> > it can possibly be an alternative to bus if it looks fine to Greg and
> > others.
>
> I grok from this text that the main advantage you see is the ability to choose
> a driver for the subdevice.
>
Yes.

> > > My thinking is that we should allow spawning subports in devlink and
> > > if user specifies "passthrough" the device spawned would be an mdev.
> >
> > devlink device is much more comprehensive way to create sub-devices
> > than sub-ports for at least below reasons.
> >
> > 1. devlink device already defines device->port relation which enables
> > to create multiport device.
>
> I presume that by devlink device you mean devlink instance? Yes, this part
> I'm following.
>
Yes -> 'struct devlink'
> > subport breaks that.
>
> Breaks what? The ability to create a devlink instance with multiple ports?
>
Right.

> > 2. With bus model, it enables us to load driver of same vendor or
> > generic one such a vfio in future.
>
> Yes, sorry, I'm not an expert on mdevs, but isn't that the goal of those?
> Could you go into more detail why not just use mdevs?
>
I am novice at mdev level too. mdev or vfio mdev.
Currently by default we bind to same vendor driver, but when it was created as passthrough device, vendor driver won't create netdevice or rdma device for it.
And vfio/mdev or whatever mature available driver would bind at that point.

> > 3. Devices live on the bus, mapping a subport to 'struct device' is
> > not intuitive.
>
> Are you saying that the main devlink instance would not have any port
> information for the subdevices?
>
Right, this newly created devlink device is the control point of its port(s).

> Devices live on a bus. Software constructs - depend on how one wants to
> model them - don't have to.
>
> > 4. sub-device allows to use existing devlink port, registers, health
> > infrastructure to sub devices, which otherwise need to be duplicated
> > for ports.
>
> Health stuff is not tied to a port, I'm not following you. You can create a
> reporter per port, per ACL rule or per SB or per whatever your heart desires..
>
Instead of creating multiple reporters and inventing these reporter naming schemes,
creating devlink instance leverage all health reporting done for a devliink instance.
So whatever is done for instance A (parent), can be available for instance B (subdev).

> > 5. Even though current devlink devices are networking devices, there
> > is nothing restricts it to be that way. So subport is a restricted
> > view.
> > 6. devlink device already covers
> > port sub-object, hence creating devlink device is desired.
> >
> > > > 5. A device can have multiple ports
> > >
> > > What does this mean, in practice? You want to spawn a subdev which
> > > can access both ports? That'd be for RDMA use cases, more than
> > > Ethernet, right? (Just clarifying :))
> > >
> > Yep, you got it right. :-)
> >
> > > > So how is it done?
> > > > ------------------
> > > > (a) user in control
> > > > To address above requirements, a generic tool iproute2/devlink is
> > > > extended for sub device's life cycle.
> > > > However a devlink tool and its kernel counter part is not
> > > > sufficient to create protocol agnostic devices on a existing PCI
> > > > bus.
> > >
> > > "Protocol agnostic"?... What does that mean?
> > >
> > Devlink works on bus,device model. It doesn't matter what class of
> > device is. For example, for pci class can be anything. So newly
> > created sub-devices are not limited to netdev/rdma devices. Its
> > agnostic to protocol. More importantly, we don't want to create these
> > sub-devices who bus type is 'pci'. Because as described below, PCI has
> > its addressing scheme and pci bus must not have mix-n match devices.
> >
> > So probably better wording should be,
> > 'a devlink tool and its kernel counterpart is not sufficient to create
> > sub-devices of same class as that of PCI device.
>
> Let me clarify - for networking devices the partition will most likely end up as
> a subport, but its not a requirement that each partition must be a subport..
> The question was about the necessity to invent a new bus, and have every
> resource have a struct device..
>

A device object and bus connecting all software objects correctly. This includes,
1. devlink bus/name handle based access
2. matching such device in sysfs
3. parent child hierarchy in sysfs
4. ability to bind different driver
5. multi-ports per device
6. still usable for single port use case
7. parameters setting at devlink instance level
8. parent-child relation handling power mgmt
9. follows standard linux driver model

Some are achievable to through mfd too, instead of subdev bus.
Will follow Greg's guidance on this.

> > > > (b) subdev bus
> > > > A given bus defines well defined addressing scheme. Creating sub
> > > > devices on existing PCI bus with a different naming scheme is just
> > > > weird. So, creating well named devices on appropriate bus is
> > > > desired.
> > >
> > > What's that address scheme you're referring to, you seem to assign
> > > IDs in sequence?
> > >
> > Yes. a device on subdev bus follows standard linux driver model based
> > id assignment scheme = u32. And devices are well named as 'subdev0'.
> > Prefix + id as the default scheme of core driver model.
>
> I thought "well defined addressing scheme" means I can address subdevice X
> of device Y with your scheme. I can't, it's just an global ID. Thanks for
> clarifying.
>
It's a global ID on the subdev bus.
subdevice X are listed under parent device Y.

We did consider embedding parent PCI address in child was considered, but its duplicate info that doesn't seem worth.

devlink will show its parent device link, like
$devlink dev show
pci/0000:05:00.0
subdev/subdev0 parent pci/0000:05:00.0

> > > The things key thing for me on the netdev side is what is the
> > > forwarding model to this new entity. Is this basically VMDQ?
> > > Should we just go ahead and mandate "switchdev mode" here?
> > >
> > It will follow the switchdev mode, but it not limited to it.
> > Switchdev mode is for the eswitch functionality. There isn't a need to
> > combine this. rdma Infiniband will be able to use this without
> > switchdev mode.
>
> It's the devlink instance that's in "switchdev mode", regardless of type of any
> of its ports.
>
I didn't follow your comment.
What I wanted to say, is,
When $devlink dev add pci/0000:05:00.0 is done,
devlink instance pci/0000:05:00.0, doesn't have to be in switchdev mode.
We do not plan to support switchdev, but it is not devlink's domain to enforce it.

switchdev mode has nothing to do with sriov, even though it might have started with that vision.


2019-03-05 21:44:43

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to subdev devices



> -----Original Message-----
> From: Greg KH <[email protected]>
> Sent: Tuesday, March 5, 2019 1:27 PM
> To: Parav Pandit <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Jiri Pirko
> <[email protected]>; Jakub Kicinski <[email protected]>
> Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to
> subdev devices
>
> On Tue, Mar 05, 2019 at 05:57:58PM +0000, Parav Pandit wrote:
> >
> >
> > > -----Original Message-----
> > > From: Greg KH <[email protected]>
> > > Sent: Tuesday, March 5, 2019 1:14 AM
> > > To: Parav Pandit <[email protected]>
> > > Cc: [email protected]; [email protected];
> > > [email protected]; [email protected]; Jiri Pirko
> > > <[email protected]>; Jakub Kicinski <[email protected]>
> > > Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind
> > > to subdev devices
> > >
> > > On Fri, Mar 01, 2019 at 05:21:13PM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Greg KH <[email protected]>
> > > > > Sent: Friday, March 1, 2019 1:22 AM
> > > > > To: Parav Pandit <[email protected]>
> > > > > Cc: [email protected]; [email protected];
> > > > > [email protected]; [email protected]; Jiri Pirko
> > > > > <[email protected]>
> > > > > Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to
> > > > > bind to subdev devices
> > > > >
> > > > > On Thu, Feb 28, 2019 at 11:37:52PM -0600, Parav Pandit wrote:
> > > > > > Add a subdev driver to probe the subdev devices and create
> > > > > > fake netdevice for it.
> > > > >
> > > > > So I'm guessing here is the "meat" of the whole goal here?
> > > > >
> > > > > You just want multiple netdevices per PCI device? Why can't you
> > > > > do that today in your PCI driver?
> > > > >
> > > > Yes, but it just not multiple netdevices.
> > > > Let me please elaborate in detail.
> > > >
> > > > There is a swichdev mode of a PCI function for netdevices.
> > > > In this mode a given netdev has additional control netdev (called
> > > representor netdevice = rep-ndev).
> > > > This rep-ndev is attached to OVS for adding rules, offloads etc
> > > > using
> > > standard tc, netfilter infra.
> > > > Currently this rep-ndev controls switch side of the settings, but
> > > > not the
> > > host side of netdev.
> > > > So there is discussion to create another netdev or devlink port..
> > > >
> > > > Additionally this subdev has optional rdma device too.
> > > >
> > > > And when we are in switchdev mode, this rdma dev has similar rdma
> > > > rep
> > > device for control.
> > > >
> > > > In some cases we actually don't create netdev when it is in
> > > > InfiniBand
> > > mode.
> > > > Here there is PCI device->rdma_device.
> > > >
> > > > In other case, a given sub device for rdma is dual port device,
> > > > having
> > > netdevice for each that can use existing netdev->dev_port.
> > > >
> > > > Creating 4 devices of two different classes using one iproute2/ip
> > > > or
> > > iproute2/rdma command is horrible thing to do.
> > >
> > > Why is that?
> > >
> > When user creates the device, user tool needs to return a device handle
> that got created.
> > Creating multiple devices doesn't make sense. I haven't seen any tool
> doing such crazy thing.
>
> And what do you mean by "device handle"? All you get here is a sysfs device
> tree.
>
Subdev devices are created using devlink tool that works on device handle.
Device handle is defined using bus/device of a 'struct device'.
It is described in [1].
$ devlink dev add DEV creates new devlink device instance and its holding 'struct device'.
This command returns device handle = new devlink instance bus/name.
Patch 6 in the series returns device handle.
Patch 6 is at [2] with example in it where sysfs name and devlink matches with each other.

> > > > In case if this sub device has to be a passthrough device, ip link
> > > > command
> > > will fail badly that day, because we are creating some sub device
> > > which is not even a netdevice.
> > >
> > > But it is a network device, right?
> > >
> > When there is passthrough subdevice, there won't be netdevice created.
> > We don't want to create passthrough subdevice using iproute2/ip tool
> which primarily works on netdevices.
>
> I don't know enough networking to claim anything here, so I'll ignore this :)
>
> > > > So iproute2/devlink which works on bus+device, mainly PCI today,
> > > > seems
> > > right abstraction point to create sub devices.
> > > > This also extends to map ports of the device, health, registers
> > > > debug, etc
> > > rich infrastructure that is already built.
> > > >
> > > > Additionally, we don't want mlx driver and other drivers to go
> > > > through its
> > > child devices (split logic in netdev and rdma) for power management.
> > >
> > > And how is power management going to work with your new devices?
> > > All you have here is a tiny shim around a driver bus,
> > So subdevices power management is done before their parent's.
> > Vendor driver doesn't need to iterate its child devices to suspend/resume
> it.
>
> True, so we can just autosuspend these "children" device and the "vendor
> driver" is not going to care? You are going to care as you are talking to the
> same PCI device.
Oh, vendor driver certainly care.
subdev vendor driver implements driver->pm callbacks to work on just a specific subdev.
Patch-2 in series at [3] implement shim layer by connecting core pm layer to driver pm callbacks.

> This goes to the other question about "how are you
> sharing PCI device resources?"
>
Currently its equal distribution among all subdevices.
But when actual user arise to ask for specific resource reservation etc, we add those parameters using existing devlink infra [4].

> > > I do not see any new
> > > functionality, and as others have said, no way to actually share, or
> > > split up, the PCI resources.
> > >
> > devlink tool create command will be able to accept more parameters
> during device creation time to share and split PCI resources.
> > This is just the start of the development and RFC is to agree on direction.
> > devlink tool has parameters options that can be queried/set and existing
> infra will be used for granular device config.
>
> Pointers to this beast?
>
[1] and [4].

> > > > Kernel core code does that well today, that we like to leverage
> > > > through
> > > subdev bus or mfd pm callbacks.
> > > >
> > > > So it is lot more than just creating netdevices.
> > >
> > > But that's all you are showing here :)
> > >
> > Starting use case is netdev and rdma, but we don't want to create new
> > tools few months/a year later for passthrough mode or for different
> > link layers etc.
>
> And I don't want to see duplicated driver model code happen either, which
> is why I point out the MFD layer :)
>
Yes. Sure.

> > > > > What problem are you trying to solve that others also are having
> > > > > that requires all of this?
> > > > >
> > > > > Adding a new bus type and subsystem is fine, but usually we want
> > > > > more than just one user of it, as this does not really show how
> > > > > it is exercised very well.
> > > > This subdev and devlink infrastructure solves this problem of
> > > > creating
> > > smaller sub devices out of one PCI device.
> > > > Someone has to start.. :-)
> > >
> > > That is what a mfd should allow you to do.
> > >
> > I did cursory look at mfd.
> > It lacks removing specific devices, but that is small. It can be
> > enhanced to remove specific mfd device.
>
> That should be easy enough, work with the MFD developers. I think
> something like that should work today as you can use USB devices with MFD,
> right?
>
> > >
> > > No, do not abuse a platform device.
> > Yes. that is my point mfd devices are platform devices.
> > mfd creates platform devices. and to match to it, platfrom_register_driver()
> have to be called to bind to it.
> > I do not know currently if we have the flexibility to say that instead of
> binding X driver, bind Y driver for platform devices.
>
> try it :)
>
> > > You should be able to just use a normal PCI device for this just
> > > fine, and if not, we should be able to make the needed changes to
> > > mfd for that.
> > >
> > Ok. so parent pci device and mfd devices.
> > mfd seems to fit this use case.
> > Do you think 'Platform devices' section is stale in [1] for autonomy, host
> bridge, soc platform etc points?
>
> Nope, they are still horrible things and I hate them :)
>
> Maybe we should just make MFD create "virtual" devices (bare ones, no
> need for platform stuff), and that would solve the issue of the platform
> device bloat being drug around everywhere.
>
If you mean virtual MFD devices in /sys/devices/virtual/, than, it becomes difficult to do their life cycle using devlink because, devlink handle = bus+device.
devlink will fail to work. Inventing new tool and make it work with devlink wouldn't work.

virtual device has bus=NULL.
mfd device currently has bus_type=platform.

We still need to link subdevice to parent pci for power_mgmt to work, right?
And also to see right device hierarchy.
Don't you think subdev bus is actually able to link all the pieces together?
devlink, sysfs, core kernel, vendor drivers..

> > Should we update the documentation to indicate that it can be used for
> > non-autonomous, user created devices and it can be used for creating
> > devices on top of PCI parent device etc?
>
> Nope, leave it alone please.
>
> thanks,
>
> greg k-h

[1] http://man7.org/linux/man-pages/man8/devlink-dev.8.html
[2] https://lore.kernel.org/patchwork/patch/1046995/
[3] https://lore.kernel.org/patchwork/patch/1046996/
[4] https://lore.kernel.org/patchwork/patch/959280/



2019-03-05 22:40:33

by Kirti Wankhede

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension



On 3/6/2019 1:16 AM, Parav Pandit wrote:
>
>
>> -----Original Message-----
>> From: Jakub Kicinski <[email protected]>
>> Sent: Monday, March 4, 2019 7:35 PM
>> To: Parav Pandit <[email protected]>
>> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
>> [email protected]; [email protected]; [email protected];
>> [email protected]; Jiri Pirko <[email protected]>
>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>>
>> Parav, please wrap your responses to at most 80 characters.
>> This is hard to read.
>>
> Sorry about it. I will wrap now on.
>
>> On Mon, 4 Mar 2019 04:41:01 +0000, Parav Pandit wrote:
>>>> -----Original Message-----
>>>> From: Jakub Kicinski <[email protected]>
>>>> Sent: Friday, March 1, 2019 2:04 PM
>>>> To: Parav Pandit <[email protected]>; Or Gerlitz
>>>> <[email protected]>
>>>> Cc: [email protected]; [email protected];
>>>> [email protected]; [email protected];
>>>> [email protected]; Jiri Pirko <[email protected]>
>>>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
>>>> extension
>>>>
>>>> On Thu, 28 Feb 2019 23:37:44 -0600, Parav Pandit wrote:
>>>>> Requirements for above use cases:
>>>>> --------------------------------
>>>>> 1. We need a generic user interface & core APIs to create sub
>>>>> devices from a parent pci device but should be generic enough for
>>>>> other parent devices 2. Interface should be vendor agnostic 3.
>>>>> User should be able to set device params at creation time 4. In
>>>>> future if needed, tool should be able to create passthrough device
>>>>> to map to a virtual machine
>>>>
>>>> Like a mediated device?
>>>
>>> Yes.
>>>
>>>> https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
>>>> https://www.dpdk.org/wp-content/uploads/sites/35/2018/06/Mediated-
>>>> Devices-Better-Userland-IO.pdf
>>>>
>>>> Other than pass-through it is entirely unclear to me why you'd need a
>> bus.
>>>> (Or should I say VM pass through or DPDK?) Could you clarify why
>>>> the need for a bus?
>>>>
>>> A bus follow standard linux kernel device driver model to attach a
>>> driver to specific device. Platform device with my limited
>>> understanding looks a hack/abuse of it based on documentation [1], but
>>> it can possibly be an alternative to bus if it looks fine to Greg and
>>> others.
>>
>> I grok from this text that the main advantage you see is the ability to choose
>> a driver for the subdevice.
>>
> Yes.
>
>>>> My thinking is that we should allow spawning subports in devlink and
>>>> if user specifies "passthrough" the device spawned would be an mdev.
>>>
>>> devlink device is much more comprehensive way to create sub-devices
>>> than sub-ports for at least below reasons.
>>>
>>> 1. devlink device already defines device->port relation which enables
>>> to create multiport device.
>>
>> I presume that by devlink device you mean devlink instance? Yes, this part
>> I'm following.
>>
> Yes -> 'struct devlink'
>>> subport breaks that.
>>
>> Breaks what? The ability to create a devlink instance with multiple ports?
>>
> Right.
>
>>> 2. With bus model, it enables us to load driver of same vendor or
>>> generic one such a vfio in future.
>>

You can achieve this with mdev as well.

>> Yes, sorry, I'm not an expert on mdevs, but isn't that the goal of those?
>> Could you go into more detail why not just use mdevs?
>>
> I am novice at mdev level too. mdev or vfio mdev.
> Currently by default we bind to same vendor driver, but when it was created as passthrough device, vendor driver won't create netdevice or rdma device for it.
> And vfio/mdev or whatever mature available driver would bind at that point.
>

Using mdev framework, if you want to partition a physical device into
multiple logic devices, you can bind those devices to same vendor driver
through vfio-mdev, where as if you want to passthrough the device bind
it to vfio-pci. If I understand correctly, that is what you are looking for.


>>> 3. Devices live on the bus, mapping a subport to 'struct device' is
>>> not intuitive.
>>
>> Are you saying that the main devlink instance would not have any port
>> information for the subdevices?
>>
> Right, this newly created devlink device is the control point of its port(s).
>
>> Devices live on a bus. Software constructs - depend on how one wants to
>> model them - don't have to.
>>
>>> 4. sub-device allows to use existing devlink port, registers, health
>>> infrastructure to sub devices, which otherwise need to be duplicated
>>> for ports.
>>
>> Health stuff is not tied to a port, I'm not following you. You can create a
>> reporter per port, per ACL rule or per SB or per whatever your heart desires..
>>
> Instead of creating multiple reporters and inventing these reporter naming schemes,
> creating devlink instance leverage all health reporting done for a devliink instance.
> So whatever is done for instance A (parent), can be available for instance B (subdev).
>
>>> 5. Even though current devlink devices are networking devices, there
>>> is nothing restricts it to be that way. So subport is a restricted
>>> view.
>>> 6. devlink device already covers
>>> port sub-object, hence creating devlink device is desired.
>>>
>>>>> 5. A device can have multiple ports
>>>>
>>>> What does this mean, in practice? You want to spawn a subdev which
>>>> can access both ports? That'd be for RDMA use cases, more than
>>>> Ethernet, right? (Just clarifying :))
>>>>
>>> Yep, you got it right. :-)
>>>
>>>>> So how is it done?
>>>>> ------------------
>>>>> (a) user in control
>>>>> To address above requirements, a generic tool iproute2/devlink is
>>>>> extended for sub device's life cycle.
>>>>> However a devlink tool and its kernel counter part is not
>>>>> sufficient to create protocol agnostic devices on a existing PCI
>>>>> bus.
>>>>
>>>> "Protocol agnostic"?... What does that mean?
>>>>
>>> Devlink works on bus,device model. It doesn't matter what class of
>>> device is. For example, for pci class can be anything. So newly
>>> created sub-devices are not limited to netdev/rdma devices. Its
>>> agnostic to protocol. More importantly, we don't want to create these
>>> sub-devices who bus type is 'pci'. Because as described below, PCI has
>>> its addressing scheme and pci bus must not have mix-n match devices.
>>>
>>> So probably better wording should be,
>>> 'a devlink tool and its kernel counterpart is not sufficient to create
>>> sub-devices of same class as that of PCI device.
>>
>> Let me clarify - for networking devices the partition will most likely end up as
>> a subport, but its not a requirement that each partition must be a subport..
>> The question was about the necessity to invent a new bus, and have every
>> resource have a struct device..
>>
>
> A device object and bus connecting all software objects correctly. This includes,
> 1. devlink bus/name handle based access
> 2. matching such device in sysfs
> 3. parent child hierarchy in sysfs
> 4. ability to bind different driver
> 5. multi-ports per device
> 6. still usable for single port use case
> 7. parameters setting at devlink instance level
> 8. parent-child relation handling power mgmt
> 9. follows standard linux driver model
>
> Some are achievable to through mfd too, instead of subdev bus.
> Will follow Greg's guidance on this.
>

I think you can achieve all the above points with mdev framework as
well. Check samples at samples/vfio-mdev/ in kernel for quick
understanding.

Thanks,
Kirti

2019-03-05 23:45:04

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension

Hi Kirti,

> -----Original Message-----
> From: Kirti Wankhede <[email protected]>
> Sent: Tuesday, March 5, 2019 4:40 PM
> To: Parav Pandit <[email protected]>; Jakub Kicinski
> <[email protected]>
> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; Jiri Pirko <[email protected]>
> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>
>
>
> >> On Mon, 4 Mar 2019 04:41:01 +0000, Parav Pandit wrote:
> >>>> -----Original Message-----
> >>>> From: Jakub Kicinski <[email protected]>
> >>>> Sent: Friday, March 1, 2019 2:04 PM
> >>>> To: Parav Pandit <[email protected]>; Or Gerlitz
> >>>> <[email protected]>
> >>>> Cc: [email protected]; [email protected];
> >>>> [email protected]; [email protected];
> >>>> [email protected]; Jiri Pirko <[email protected]>
> >>>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
> >>>> extension
> >>>>
> >>>> On Thu, 28 Feb 2019 23:37:44 -0600, Parav Pandit wrote:
> >>>>> Requirements for above use cases:
> >>>>> --------------------------------
> >>>>> 1. We need a generic user interface & core APIs to create sub
> >>>>> devices from a parent pci device but should be generic enough for
> >>>>> other parent devices 2. Interface should be vendor agnostic 3.
> >>>>> User should be able to set device params at creation time 4. In
> >>>>> future if needed, tool should be able to create passthrough device
> >>>>> to map to a virtual machine
> >>>>
> >>>> Like a mediated device?
> >>>
> >>> Yes.
> >>>
> >>>> https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
> >>>> https://www.dpdk.org/wp-
> content/uploads/sites/35/2018/06/Mediated-
> >>>> Devices-Better-Userland-IO.pdf
> >>>>
> >>>> Other than pass-through it is entirely unclear to me why you'd need
> >>>> a
> >> bus.
> >>>> (Or should I say VM pass through or DPDK?) Could you clarify why
> >>>> the need for a bus?
> >>>>
> >>> A bus follow standard linux kernel device driver model to attach a
> >>> driver to specific device. Platform device with my limited
> >>> understanding looks a hack/abuse of it based on documentation [1],
> >>> but it can possibly be an alternative to bus if it looks fine to
> >>> Greg and others.
> >>
> >> I grok from this text that the main advantage you see is the ability
> >> to choose a driver for the subdevice.
> >>
> > Yes.
> >
> >>>> My thinking is that we should allow spawning subports in devlink
> >>>> and if user specifies "passthrough" the device spawned would be an
> mdev.
> >>>
> >>> devlink device is much more comprehensive way to create sub-devices
> >>> than sub-ports for at least below reasons.
> >>>
> >>> 1. devlink device already defines device->port relation which
> >>> enables to create multiport device.
> >>
> >> I presume that by devlink device you mean devlink instance? Yes,
> >> this part I'm following.
> >>
> > Yes -> 'struct devlink'
> >>> subport breaks that.
> >>
> >> Breaks what? The ability to create a devlink instance with multiple ports?
> >>
> > Right.
> >
> >>> 2. With bus model, it enables us to load driver of same vendor or
> >>> generic one such a vfio in future.
> >>
>
> You can achieve this with mdev as well.
>
> >> Yes, sorry, I'm not an expert on mdevs, but isn't that the goal of those?
> >> Could you go into more detail why not just use mdevs?
> >>
> > I am novice at mdev level too. mdev or vfio mdev.
> > Currently by default we bind to same vendor driver, but when it was
> created as passthrough device, vendor driver won't create netdevice or rdma
> device for it.
> > And vfio/mdev or whatever mature available driver would bind at that
> point.
> >
>
> Using mdev framework, if you want to partition a physical device into
> multiple logic devices, you can bind those devices to same vendor driver
> through vfio-mdev, where as if you want to passthrough the device bind it to
> vfio-pci. If I understand correctly, that is what you are looking for.
>
>
We cannot bind a whole PCI device to vfio-pci, reason is,
A given PCI device has existing protocol devices on it such as netdevs and rdma dev.
This device is partitioned while those protocol devices exist and
mlx5_core, mlx5_ib drivers are loaded on it.
And we also need to connect these objects rightly to eswitch exposed
by devlink interface (net/core/devlink.c) that supports
eswitch binding, health, registers, parameters, ports support.
It also supports existing PCI VFs.

I don’t think we want to replicate all of this again in mdev subsystem [1].

[1] https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt

So devlink interface to migrate users from managing VFs to
non_VF sub device is natural progression.

However, in future, I believe we would be creating mediated devices on user request,
to use mdev modules and map them to VM.

Also 'mdev_bus' is created as a class and not as a bus. This limits to not use
devlink interface whose handle is bus+device name.

So one option is to change mdev from class to bus.
devlink will create mdevs on the bus, mdev driver can probe these devices on host system by default.
And if told to do passthrough, a different driver exposes them to VM.
How feasible is this?

> >>> 3. Devices live on the bus, mapping a subport to 'struct device' is
> >>> not intuitive.
> >>
> >> Are you saying that the main devlink instance would not have any port
> >> information for the subdevices?
> >>
> > Right, this newly created devlink device is the control point of its port(s).
> >
> >> Devices live on a bus. Software constructs - depend on how one wants
> >> to model them - don't have to.
> >>
> >>> 4. sub-device allows to use existing devlink port, registers, health
> >>> infrastructure to sub devices, which otherwise need to be duplicated
> >>> for ports.
> >>
> >> Health stuff is not tied to a port, I'm not following you. You can
> >> create a reporter per port, per ACL rule or per SB or per whatever your
> heart desires..
> >>
> > Instead of creating multiple reporters and inventing these reporter
> > naming schemes, creating devlink instance leverage all health reporting
> done for a devliink instance.
> > So whatever is done for instance A (parent), can be available for instance B
> (subdev).
> >
> >>> 5. Even though current devlink devices are networking devices, there
> >>> is nothing restricts it to be that way. So subport is a restricted
> >>> view.
> >>> 6. devlink device already covers
> >>> port sub-object, hence creating devlink device is desired.
> >>>
> >>>>> 5. A device can have multiple ports
> >>>>
> >>>> What does this mean, in practice? You want to spawn a subdev which
> >>>> can access both ports? That'd be for RDMA use cases, more than
> >>>> Ethernet, right? (Just clarifying :))
> >>>>
> >>> Yep, you got it right. :-)
> >>>
> >>>>> So how is it done?
> >>>>> ------------------
> >>>>> (a) user in control
> >>>>> To address above requirements, a generic tool iproute2/devlink is
> >>>>> extended for sub device's life cycle.
> >>>>> However a devlink tool and its kernel counter part is not
> >>>>> sufficient to create protocol agnostic devices on a existing PCI
> >>>>> bus.
> >>>>
> >>>> "Protocol agnostic"?... What does that mean?
> >>>>
> >>> Devlink works on bus,device model. It doesn't matter what class of
> >>> device is. For example, for pci class can be anything. So newly
> >>> created sub-devices are not limited to netdev/rdma devices. Its
> >>> agnostic to protocol. More importantly, we don't want to create
> >>> these sub-devices who bus type is 'pci'. Because as described below,
> >>> PCI has its addressing scheme and pci bus must not have mix-n match
> devices.
> >>>
> >>> So probably better wording should be, 'a devlink tool and its kernel
> >>> counterpart is not sufficient to create sub-devices of same class as
> >>> that of PCI device.
> >>
> >> Let me clarify - for networking devices the partition will most
> >> likely end up as a subport, but its not a requirement that each partition
> must be a subport..
> >> The question was about the necessity to invent a new bus, and have
> >> every resource have a struct device..
> >>
> >
> > A device object and bus connecting all software objects correctly.
> > This includes, 1. devlink bus/name handle based access 2. matching
> > such device in sysfs 3. parent child hierarchy in sysfs 4. ability to
> > bind different driver 5. multi-ports per device 6. still usable for
> > single port use case 7. parameters setting at devlink instance level
> > 8. parent-child relation handling power mgmt 9. follows standard linux
> > driver model
> >
> > Some are achievable to through mfd too, instead of subdev bus.
> > Will follow Greg's guidance on this.
> >
>
> I think you can achieve all the above points with mdev framework as well.
> Check samples at samples/vfio-mdev/ in kernel for quick understanding.
>
> Thanks,
> Kirti

2019-03-05 23:49:38

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension



> -----Original Message-----
> From: [email protected] <linux-kernel-
> [email protected]> On Behalf Of Parav Pandit
> Sent: Tuesday, March 5, 2019 5:17 PM
> To: Kirti Wankhede <[email protected]>; Jakub Kicinski
> <[email protected]>
> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; Jiri Pirko <[email protected]>
> Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>
> Hi Kirti,
>
> > -----Original Message-----
> > From: Kirti Wankhede <[email protected]>
> > Sent: Tuesday, March 5, 2019 4:40 PM
> > To: Parav Pandit <[email protected]>; Jakub Kicinski
> > <[email protected]>
> > Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> > [email protected]; [email protected]; [email protected];
> > [email protected]; Jiri Pirko <[email protected]>
> > Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
> > extension
> >
> >
> >
> > > I am novice at mdev level too. mdev or vfio mdev.
> > > Currently by default we bind to same vendor driver, but when it was
> > created as passthrough device, vendor driver won't create netdevice or
> > rdma device for it.
> > > And vfio/mdev or whatever mature available driver would bind at that
> > point.
> > >
> >
> > Using mdev framework, if you want to partition a physical device into
> > multiple logic devices, you can bind those devices to same vendor
> > driver through vfio-mdev, where as if you want to passthrough the
> > device bind it to vfio-pci. If I understand correctly, that is what you are
> looking for.
> >
> >
> We cannot bind a whole PCI device to vfio-pci, reason is, A given PCI device
> has existing protocol devices on it such as netdevs and rdma dev.
> This device is partitioned while those protocol devices exist and mlx5_core,
> mlx5_ib drivers are loaded on it.
> And we also need to connect these objects rightly to eswitch exposed by
> devlink interface (net/core/devlink.c) that supports eswitch binding, health,
> registers, parameters, ports support.
> It also supports existing PCI VFs.
>
> I don’t think we want to replicate all of this again in mdev subsystem [1].
>
> [1] https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
>
> So devlink interface to migrate users from managing VFs to non_VF sub
> device is natural progression.
>
> However, in future, I believe we would be creating mediated devices on user
> request, to use mdev modules and map them to VM.
>
> Also 'mdev_bus' is created as a class and not as a bus. This limits to not use
> devlink interface whose handle is bus+device name.
>
> So one option is to change mdev from class to bus.
> devlink will create mdevs on the bus, mdev driver can probe these devices
> on host system by default.
> And if told to do passthrough, a different driver exposes them to VM.
> How feasible is this?
>
Wait, I do see a mdev bus and mdevs are created on this bus using mdev_device_create().
So how about we create mdevs on this bus using devlink, instead of sysfs?
And driver side on host gets the mdev_register_driver()->probe()?


2019-03-06 00:48:06

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension

Hi Greg, Kirti,

> -----Original Message-----
> From: Parav Pandit
> Sent: Tuesday, March 5, 2019 5:45 PM
> To: Parav Pandit <[email protected]>; Kirti Wankhede
> <[email protected]>; Jakub Kicinski <[email protected]>
> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; Jiri Pirko <[email protected]>
> Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>
>
>
> > -----Original Message-----
> > From: [email protected] <linux-kernel-
> > [email protected]> On Behalf Of Parav Pandit
> > Sent: Tuesday, March 5, 2019 5:17 PM
> > To: Kirti Wankhede <[email protected]>; Jakub Kicinski
> > <[email protected]>
> > Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> > [email protected]; [email protected]; [email protected];
> > [email protected]; Jiri Pirko <[email protected]>
> > Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink
> > extension
> >
> > Hi Kirti,
> >
> > > -----Original Message-----
> > > From: Kirti Wankhede <[email protected]>
> > > Sent: Tuesday, March 5, 2019 4:40 PM
> > > To: Parav Pandit <[email protected]>; Jakub Kicinski
> > > <[email protected]>
> > > Cc: Or Gerlitz <[email protected]>; [email protected];
> > > linux- [email protected]; [email protected];
> > > [email protected]; [email protected]; Jiri Pirko
> > > <[email protected]>
> > > Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
> > > extension
> > >
> > >
> > >
> > > > I am novice at mdev level too. mdev or vfio mdev.
> > > > Currently by default we bind to same vendor driver, but when it
> > > > was
> > > created as passthrough device, vendor driver won't create netdevice
> > > or rdma device for it.
> > > > And vfio/mdev or whatever mature available driver would bind at
> > > > that
> > > point.
> > > >
> > >
> > > Using mdev framework, if you want to partition a physical device
> > > into multiple logic devices, you can bind those devices to same
> > > vendor driver through vfio-mdev, where as if you want to passthrough
> > > the device bind it to vfio-pci. If I understand correctly, that is
> > > what you are
> > looking for.
> > >
> > >
> > We cannot bind a whole PCI device to vfio-pci, reason is, A given PCI
> > device has existing protocol devices on it such as netdevs and rdma dev.
> > This device is partitioned while those protocol devices exist and
> > mlx5_core, mlx5_ib drivers are loaded on it.
> > And we also need to connect these objects rightly to eswitch exposed
> > by devlink interface (net/core/devlink.c) that supports eswitch
> > binding, health, registers, parameters, ports support.
> > It also supports existing PCI VFs.
> >
> > I don’t think we want to replicate all of this again in mdev subsystem [1].
> >
> > [1] https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
> >
> > So devlink interface to migrate users from managing VFs to non_VF sub
> > device is natural progression.
> >
> > However, in future, I believe we would be creating mediated devices on
> > user request, to use mdev modules and map them to VM.
> >
> > Also 'mdev_bus' is created as a class and not as a bus. This limits to
> > not use devlink interface whose handle is bus+device name.
> >
> > So one option is to change mdev from class to bus.
> > devlink will create mdevs on the bus, mdev driver can probe these
> > devices on host system by default.
> > And if told to do passthrough, a different driver exposes them to VM.
> > How feasible is this?
> >
> Wait, I do see a mdev bus and mdevs are created on this bus using
> mdev_device_create().
> So how about we create mdevs on this bus using devlink, instead of sysfs?
> And driver side on host gets the mdev_register_driver()->probe()?
>

Thinking more and reviewing more mdev code, I believe mdev fits
this need a lot better than new subdev bus, mfd, platform device, or devlink subport.
For coming future, to map this sub device (mdev) to VM will also be easier by using mdev bus.

I also believe we can use the sysfs interface for mdev life cycle.
Here when mdev are created it will register as devlink instance and
will be able to query/config parameters before driver probe the device.
(instead of having life cycle via devlink)

Few enhancements would be needed for mdev side.
1. making iommu optional.
2. configuring mdev device parameters during creation time

More once get my hands dirty with mdev in RFCv2.

What do you think?

2019-03-06 03:59:04

by Kirti Wankhede

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension



On 3/6/2019 6:14 AM, Parav Pandit wrote:
> Hi Greg, Kirti,
>
>> -----Original Message-----
>> From: Parav Pandit
>> Sent: Tuesday, March 5, 2019 5:45 PM
>> To: Parav Pandit <[email protected]>; Kirti Wankhede
>> <[email protected]>; Jakub Kicinski <[email protected]>
>> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
>> [email protected]; [email protected]; [email protected];
>> [email protected]; Jiri Pirko <[email protected]>
>> Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>>
>>
>>
>>> -----Original Message-----
>>> From: [email protected] <linux-kernel-
>>> [email protected]> On Behalf Of Parav Pandit
>>> Sent: Tuesday, March 5, 2019 5:17 PM
>>> To: Kirti Wankhede <[email protected]>; Jakub Kicinski
>>> <[email protected]>
>>> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
>>> [email protected]; [email protected]; [email protected];
>>> [email protected]; Jiri Pirko <[email protected]>
>>> Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink
>>> extension
>>>
>>> Hi Kirti,
>>>
>>>> -----Original Message-----
>>>> From: Kirti Wankhede <[email protected]>
>>>> Sent: Tuesday, March 5, 2019 4:40 PM
>>>> To: Parav Pandit <[email protected]>; Jakub Kicinski
>>>> <[email protected]>
>>>> Cc: Or Gerlitz <[email protected]>; [email protected];
>>>> linux- [email protected]; [email protected];
>>>> [email protected]; [email protected]; Jiri Pirko
>>>> <[email protected]>
>>>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
>>>> extension
>>>>
>>>>
>>>>
>>>>> I am novice at mdev level too. mdev or vfio mdev.
>>>>> Currently by default we bind to same vendor driver, but when it
>>>>> was
>>>> created as passthrough device, vendor driver won't create netdevice
>>>> or rdma device for it.
>>>>> And vfio/mdev or whatever mature available driver would bind at
>>>>> that
>>>> point.
>>>>>
>>>>
>>>> Using mdev framework, if you want to partition a physical device
>>>> into multiple logic devices, you can bind those devices to same
>>>> vendor driver through vfio-mdev, where as if you want to passthrough
>>>> the device bind it to vfio-pci. If I understand correctly, that is
>>>> what you are
>>> looking for.
>>>>
>>>>
>>> We cannot bind a whole PCI device to vfio-pci, reason is, A given PCI
>>> device has existing protocol devices on it such as netdevs and rdma dev.
>>> This device is partitioned while those protocol devices exist and
>>> mlx5_core, mlx5_ib drivers are loaded on it.
>>> And we also need to connect these objects rightly to eswitch exposed
>>> by devlink interface (net/core/devlink.c) that supports eswitch
>>> binding, health, registers, parameters, ports support.
>>> It also supports existing PCI VFs.
>>>
>>> I don’t think we want to replicate all of this again in mdev subsystem [1].
>>>
>>> [1] https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
>>>
>>> So devlink interface to migrate users from managing VFs to non_VF sub
>>> device is natural progression.
>>>
>>> However, in future, I believe we would be creating mediated devices on
>>> user request, to use mdev modules and map them to VM.
>>>
>>> Also 'mdev_bus' is created as a class and not as a bus. This limits to
>>> not use devlink interface whose handle is bus+device name.
>>>
>>> So one option is to change mdev from class to bus.
>>> devlink will create mdevs on the bus, mdev driver can probe these
>>> devices on host system by default.
>>> And if told to do passthrough, a different driver exposes them to VM.
>>> How feasible is this?
>>>
>> Wait, I do see a mdev bus and mdevs are created on this bus using
>> mdev_device_create().
>> So how about we create mdevs on this bus using devlink, instead of sysfs?
>> And driver side on host gets the mdev_register_driver()->probe()?
>>
>
> Thinking more and reviewing more mdev code, I believe mdev fits
> this need a lot better than new subdev bus, mfd, platform device, or devlink subport.
> For coming future, to map this sub device (mdev) to VM will also be easier by using mdev bus.
>

Thanks for taking close look at mdev code.

Assigning mdev to VM support is already in place, QEMU and libvirt have
support to assign mdev device to VM.

> I also believe we can use the sysfs interface for mdev life cycle.
> Here when mdev are created it will register as devlink instance and
> will be able to query/config parameters before driver probe the device.
> (instead of having life cycle via devlink)
>
> Few enhancements would be needed for mdev side.
> 1. making iommu optional.

Currently mdev devices are not IOMMU aware, vendor driver is responsible
for programming IOMMU for mdev device, if required.
IOMMU aware mdev device patch set is almost reviewed and ready to get
pulled. This is optional, vendor driver have to decide whether mdev
device should be associated with its parents IOMMU or not. I'm testing
it and I think Alex is on vacation and this will get pulled when Alex
will be back from vacation.
https://lwn.net/Articles/779650/

> 2. configuring mdev device parameters during creation time
>

Mdev framework provides a way to define multiple types for creation
through sysfs. You can define multiple types rather than having creation
time parameter and on creation accordingly update 'available_instances'.
Mdev also provides a way to provide vendor-specific-attributes for
parent physical device as well as for created mdev device. You can add
sysfs interface to get input parameters for a mdev device which can be
used by vendor driver when open() on that mdev device is called.

Thanks,
Kirti


2019-03-06 06:29:08

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension

Hi Kirti,

> -----Original Message-----
> From: Kirti Wankhede <[email protected]>
> Sent: Tuesday, March 5, 2019 9:51 PM
> To: Parav Pandit <[email protected]>; Jakub Kicinski
> <[email protected]>
> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; Jiri Pirko <[email protected]>
> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>
>
>
> On 3/6/2019 6:14 AM, Parav Pandit wrote:
> > Hi Greg, Kirti,
> >
> >> -----Original Message-----
> >> From: Parav Pandit
> >> Sent: Tuesday, March 5, 2019 5:45 PM
> >> To: Parav Pandit <[email protected]>; Kirti Wankhede
> >> <[email protected]>; Jakub Kicinski
> <[email protected]>
> >> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> >> [email protected]; [email protected];
> [email protected];
> >> [email protected]; Jiri Pirko <[email protected]>
> >> Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink
> >> extension
> >>
> >>
> >>
> >>> -----Original Message-----
> >>> From: [email protected] <linux-kernel-
> >>> [email protected]> On Behalf Of Parav Pandit
> >>> Sent: Tuesday, March 5, 2019 5:17 PM
> >>> To: Kirti Wankhede <[email protected]>; Jakub Kicinski
> >>> <[email protected]>
> >>> Cc: Or Gerlitz <[email protected]>; [email protected];
> >>> linux- [email protected]; [email protected];
> >>> [email protected]; [email protected]; Jiri Pirko
> >>> <[email protected]>
> >>> Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink
> >>> extension
> >>>
> >>> Hi Kirti,
> >>>
> >>>> -----Original Message-----
> >>>> From: Kirti Wankhede <[email protected]>
> >>>> Sent: Tuesday, March 5, 2019 4:40 PM
> >>>> To: Parav Pandit <[email protected]>; Jakub Kicinski
> >>>> <[email protected]>
> >>>> Cc: Or Gerlitz <[email protected]>; [email protected];
> >>>> linux- [email protected]; [email protected];
> >>>> [email protected]; [email protected]; Jiri Pirko
> >>>> <[email protected]>
> >>>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
> >>>> extension
> >>>>
> >>>>
> >>>>
> >>>>> I am novice at mdev level too. mdev or vfio mdev.
> >>>>> Currently by default we bind to same vendor driver, but when it
> >>>>> was
> >>>> created as passthrough device, vendor driver won't create netdevice
> >>>> or rdma device for it.
> >>>>> And vfio/mdev or whatever mature available driver would bind at
> >>>>> that
> >>>> point.
> >>>>>
> >>>>
> >>>> Using mdev framework, if you want to partition a physical device
> >>>> into multiple logic devices, you can bind those devices to same
> >>>> vendor driver through vfio-mdev, where as if you want to
> >>>> passthrough the device bind it to vfio-pci. If I understand
> >>>> correctly, that is what you are
> >>> looking for.
> >>>>
> >>>>
> >>> We cannot bind a whole PCI device to vfio-pci, reason is, A given
> >>> PCI device has existing protocol devices on it such as netdevs and rdma
> dev.
> >>> This device is partitioned while those protocol devices exist and
> >>> mlx5_core, mlx5_ib drivers are loaded on it.
> >>> And we also need to connect these objects rightly to eswitch exposed
> >>> by devlink interface (net/core/devlink.c) that supports eswitch
> >>> binding, health, registers, parameters, ports support.
> >>> It also supports existing PCI VFs.
> >>>
> >>> I don’t think we want to replicate all of this again in mdev subsystem [1].
> >>>
> >>> [1]
> >>> https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
> >>>
> >>> So devlink interface to migrate users from managing VFs to non_VF
> >>> sub device is natural progression.
> >>>
> >>> However, in future, I believe we would be creating mediated devices
> >>> on user request, to use mdev modules and map them to VM.
> >>>
> >>> Also 'mdev_bus' is created as a class and not as a bus. This limits
> >>> to not use devlink interface whose handle is bus+device name.
> >>>
> >>> So one option is to change mdev from class to bus.
> >>> devlink will create mdevs on the bus, mdev driver can probe these
> >>> devices on host system by default.
> >>> And if told to do passthrough, a different driver exposes them to VM.
> >>> How feasible is this?
> >>>
> >> Wait, I do see a mdev bus and mdevs are created on this bus using
> >> mdev_device_create().
> >> So how about we create mdevs on this bus using devlink, instead of sysfs?
> >> And driver side on host gets the mdev_register_driver()->probe()?
> >>
> >
> > Thinking more and reviewing more mdev code, I believe mdev fits this
> > need a lot better than new subdev bus, mfd, platform device, or devlink
> subport.
> > For coming future, to map this sub device (mdev) to VM will also be easier
> by using mdev bus.
> >
>
> Thanks for taking close look at mdev code.
>
> Assigning mdev to VM support is already in place, QEMU and libvirt have
> support to assign mdev device to VM.
>
> > I also believe we can use the sysfs interface for mdev life cycle.
> > Here when mdev are created it will register as devlink instance and
> > will be able to query/config parameters before driver probe the device.
> > (instead of having life cycle via devlink)
> >
> > Few enhancements would be needed for mdev side.
> > 1. making iommu optional.
>
> Currently mdev devices are not IOMMU aware, vendor driver is responsible
> for programming IOMMU for mdev device, if required.
> IOMMU aware mdev device patch set is almost reviewed and ready to get
> pulled. This is optional, vendor driver have to decide whether mdev device
> should be associated with its parents IOMMU or not. I'm testing it and I
> think Alex is on vacation and this will get pulled when Alex will be back from
> vacation.
> https://lwn.net/Articles/779650/
>
> > 2. configuring mdev device parameters during creation time
> >
>
> Mdev framework provides a way to define multiple types for creation
> through sysfs. You can define multiple types rather than having creation
> time parameter and on creation accordingly update 'available_instances'.
> Mdev also provides a way to provide vendor-specific-attributes for parent
> physical device as well as for created mdev device. You can add sysfs
> interface to get input parameters for a mdev device which can be used by
> vendor driver when open() on that mdev device is called.
>
> Thanks,
> Kirti

Yes. I got my patches to adapt to mdev way. Will be posting RFC v2 soon.
Will wait for a day to receive more comments/views from Greg and others.

As I explained in this cover-letter and discussion,
First use case is to create and use mdevs in the host (and not in VM).
Later on, I am sure once we have mdevs available, VM users will likely use it.

So, mlx5_core driver will have two components as starting point.

1. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev.c
This is mdev device life cycle driver which will do, mdev_register_device() and implements mlx5_mdev_ops.

2. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev_driver.c
This is mdev device driver which does mdev_register_driver()
and probe() creates netdev by heavily reusing existing code of the PF device.
These drivers will not be placed under drivers/vfio/mdev, because this is not a vfio driver.
This is fine, right?

Given that this is net driver, we will be submitting patches,
through netdev mailing list through Dave Miller's net-next tree.
And CC [email protected], you and others as usual.
Are you ok, merging code this way as mdev device creator and mdev driver.
Yes?

2019-03-07 19:05:31

by Kirti Wankhede

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

CC += Alex

On 3/6/2019 11:12 AM, Parav Pandit wrote:
> Hi Kirti,
>
>> -----Original Message-----
>> From: Kirti Wankhede <[email protected]>
>> Sent: Tuesday, March 5, 2019 9:51 PM
>> To: Parav Pandit <[email protected]>; Jakub Kicinski
>> <[email protected]>
>> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
>> [email protected]; [email protected]; [email protected];
>> [email protected]; Jiri Pirko <[email protected]>
>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>>
>>
>>
>> On 3/6/2019 6:14 AM, Parav Pandit wrote:
>>> Hi Greg, Kirti,
>>>
>>>> -----Original Message-----
>>>> From: Parav Pandit
>>>> Sent: Tuesday, March 5, 2019 5:45 PM
>>>> To: Parav Pandit <[email protected]>; Kirti Wankhede
>>>> <[email protected]>; Jakub Kicinski
>> <[email protected]>
>>>> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
>>>> [email protected]; [email protected];
>> [email protected];
>>>> [email protected]; Jiri Pirko <[email protected]>
>>>> Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink
>>>> extension
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: [email protected] <linux-kernel-
>>>>> [email protected]> On Behalf Of Parav Pandit
>>>>> Sent: Tuesday, March 5, 2019 5:17 PM
>>>>> To: Kirti Wankhede <[email protected]>; Jakub Kicinski
>>>>> <[email protected]>
>>>>> Cc: Or Gerlitz <[email protected]>; [email protected];
>>>>> linux- [email protected]; [email protected];
>>>>> [email protected]; [email protected]; Jiri Pirko
>>>>> <[email protected]>
>>>>> Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink
>>>>> extension
>>>>>
>>>>> Hi Kirti,
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Kirti Wankhede <[email protected]>
>>>>>> Sent: Tuesday, March 5, 2019 4:40 PM
>>>>>> To: Parav Pandit <[email protected]>; Jakub Kicinski
>>>>>> <[email protected]>
>>>>>> Cc: Or Gerlitz <[email protected]>; [email protected];
>>>>>> linux- [email protected]; [email protected];
>>>>>> [email protected]; [email protected]; Jiri Pirko
>>>>>> <[email protected]>
>>>>>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
>>>>>> extension
>>>>>>
>>>>>>
>>>>>>
>>>>>>> I am novice at mdev level too. mdev or vfio mdev.
>>>>>>> Currently by default we bind to same vendor driver, but when it
>>>>>>> was
>>>>>> created as passthrough device, vendor driver won't create netdevice
>>>>>> or rdma device for it.
>>>>>>> And vfio/mdev or whatever mature available driver would bind at
>>>>>>> that
>>>>>> point.
>>>>>>>
>>>>>>
>>>>>> Using mdev framework, if you want to partition a physical device
>>>>>> into multiple logic devices, you can bind those devices to same
>>>>>> vendor driver through vfio-mdev, where as if you want to
>>>>>> passthrough the device bind it to vfio-pci. If I understand
>>>>>> correctly, that is what you are
>>>>> looking for.
>>>>>>
>>>>>>
>>>>> We cannot bind a whole PCI device to vfio-pci, reason is, A given
>>>>> PCI device has existing protocol devices on it such as netdevs and rdma
>> dev.
>>>>> This device is partitioned while those protocol devices exist and
>>>>> mlx5_core, mlx5_ib drivers are loaded on it.
>>>>> And we also need to connect these objects rightly to eswitch exposed
>>>>> by devlink interface (net/core/devlink.c) that supports eswitch
>>>>> binding, health, registers, parameters, ports support.
>>>>> It also supports existing PCI VFs.
>>>>>
>>>>> I don’t think we want to replicate all of this again in mdev subsystem [1].
>>>>>
>>>>> [1]
>>>>> https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
>>>>>
>>>>> So devlink interface to migrate users from managing VFs to non_VF
>>>>> sub device is natural progression.
>>>>>
>>>>> However, in future, I believe we would be creating mediated devices
>>>>> on user request, to use mdev modules and map them to VM.
>>>>>
>>>>> Also 'mdev_bus' is created as a class and not as a bus. This limits
>>>>> to not use devlink interface whose handle is bus+device name.
>>>>>
>>>>> So one option is to change mdev from class to bus.
>>>>> devlink will create mdevs on the bus, mdev driver can probe these
>>>>> devices on host system by default.
>>>>> And if told to do passthrough, a different driver exposes them to VM.
>>>>> How feasible is this?
>>>>>
>>>> Wait, I do see a mdev bus and mdevs are created on this bus using
>>>> mdev_device_create().
>>>> So how about we create mdevs on this bus using devlink, instead of sysfs?
>>>> And driver side on host gets the mdev_register_driver()->probe()?
>>>>
>>>
>>> Thinking more and reviewing more mdev code, I believe mdev fits this
>>> need a lot better than new subdev bus, mfd, platform device, or devlink
>> subport.
>>> For coming future, to map this sub device (mdev) to VM will also be easier
>> by using mdev bus.
>>>
>>
>> Thanks for taking close look at mdev code.
>>
>> Assigning mdev to VM support is already in place, QEMU and libvirt have
>> support to assign mdev device to VM.
>>
>>> I also believe we can use the sysfs interface for mdev life cycle.
>>> Here when mdev are created it will register as devlink instance and
>>> will be able to query/config parameters before driver probe the device.
>>> (instead of having life cycle via devlink)
>>>
>>> Few enhancements would be needed for mdev side.
>>> 1. making iommu optional.
>>
>> Currently mdev devices are not IOMMU aware, vendor driver is responsible
>> for programming IOMMU for mdev device, if required.
>> IOMMU aware mdev device patch set is almost reviewed and ready to get
>> pulled. This is optional, vendor driver have to decide whether mdev device
>> should be associated with its parents IOMMU or not. I'm testing it and I
>> think Alex is on vacation and this will get pulled when Alex will be back from
>> vacation.
>> https://lwn.net/Articles/779650/
>>
>>> 2. configuring mdev device parameters during creation time
>>>
>>
>> Mdev framework provides a way to define multiple types for creation
>> through sysfs. You can define multiple types rather than having creation
>> time parameter and on creation accordingly update 'available_instances'.
>> Mdev also provides a way to provide vendor-specific-attributes for parent
>> physical device as well as for created mdev device. You can add sysfs
>> interface to get input parameters for a mdev device which can be used by
>> vendor driver when open() on that mdev device is called.
>>
>> Thanks,
>> Kirti
>
> Yes. I got my patches to adapt to mdev way. Will be posting RFC v2 soon.
> Will wait for a day to receive more comments/views from Greg and others.
>
> As I explained in this cover-letter and discussion,
> First use case is to create and use mdevs in the host (and not in VM).
> Later on, I am sure once we have mdevs available, VM users will likely use it.
>
> So, mlx5_core driver will have two components as starting point.
>
> 1. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev.c
> This is mdev device life cycle driver which will do, mdev_register_device() and implements mlx5_mdev_ops.
>
Ok. I would suggest not use mdev.c file name, may be add device name,
something like mlx_mdev.c or vfio_mlx.c

> 2. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev_driver.c
> This is mdev device driver which does mdev_register_driver()
> and probe() creates netdev by heavily reusing existing code of the PF device.
> These drivers will not be placed under drivers/vfio/mdev, because this is not a vfio driver.
> This is fine, right?
>

I'm not too familiar with netdev, but can you create netdev on open()
call on mlx mdev device? Then you don't have to write mdev device driver.


> Given that this is net driver, we will be submitting patches,
> through netdev mailing list through Dave Miller's net-next tree.
> And CC [email protected], you and others as usual.
> Are you ok, merging code this way as mdev device creator and mdev driver.
> Yes?
>

Keep Alex and me in loop.

Thanks,
Kirti

2019-03-07 20:28:16

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension



> -----Original Message-----
> From: Kirti Wankhede <[email protected]>
> Sent: Thursday, March 7, 2019 1:04 PM
> To: Parav Pandit <[email protected]>; Jakub Kicinski
> <[email protected]>
> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; Jiri Pirko <[email protected]>; Alex
> Williamson <[email protected]>
> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>
> CC += Alex
>
> On 3/6/2019 11:12 AM, Parav Pandit wrote:
> > Hi Kirti,
> >
> >> -----Original Message-----
> >> From: Kirti Wankhede <[email protected]>
> >> Sent: Tuesday, March 5, 2019 9:51 PM
> >> To: Parav Pandit <[email protected]>; Jakub Kicinski
> >> <[email protected]>
> >> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> >> [email protected]; [email protected];
> [email protected];
> >> [email protected]; Jiri Pirko <[email protected]>
> >> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
> >> extension
> >>
> >>
> >>
> >> On 3/6/2019 6:14 AM, Parav Pandit wrote:
> >>> Hi Greg, Kirti,
> >>>
> >>>> -----Original Message-----
> >>>> From: Parav Pandit
> >>>> Sent: Tuesday, March 5, 2019 5:45 PM
> >>>> To: Parav Pandit <[email protected]>; Kirti Wankhede
> >>>> <[email protected]>; Jakub Kicinski
> >> <[email protected]>
> >>>> Cc: Or Gerlitz <[email protected]>; [email protected];
> >>>> linux- [email protected]; [email protected];
> >> [email protected];
> >>>> [email protected]; Jiri Pirko <[email protected]>
> >>>> Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink
> >>>> extension
> >>>>
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: [email protected] <linux-kernel-
> >>>>> [email protected]> On Behalf Of Parav Pandit
> >>>>> Sent: Tuesday, March 5, 2019 5:17 PM
> >>>>> To: Kirti Wankhede <[email protected]>; Jakub Kicinski
> >>>>> <[email protected]>
> >>>>> Cc: Or Gerlitz <[email protected]>; [email protected];
> >>>>> linux- [email protected]; [email protected];
> >>>>> [email protected]; [email protected]; Jiri Pirko
> >>>>> <[email protected]>
> >>>>> Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink
> >>>>> extension
> >>>>>
> >>>>> Hi Kirti,
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Kirti Wankhede <[email protected]>
> >>>>>> Sent: Tuesday, March 5, 2019 4:40 PM
> >>>>>> To: Parav Pandit <[email protected]>; Jakub Kicinski
> >>>>>> <[email protected]>
> >>>>>> Cc: Or Gerlitz <[email protected]>; [email protected];
> >>>>>> linux- [email protected]; [email protected];
> >>>>>> [email protected]; [email protected]; Jiri Pirko
> >>>>>> <[email protected]>
> >>>>>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and
> >>>>>> devlink extension
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> I am novice at mdev level too. mdev or vfio mdev.
> >>>>>>> Currently by default we bind to same vendor driver, but when it
> >>>>>>> was
> >>>>>> created as passthrough device, vendor driver won't create
> >>>>>> netdevice or rdma device for it.
> >>>>>>> And vfio/mdev or whatever mature available driver would bind at
> >>>>>>> that
> >>>>>> point.
> >>>>>>>
> >>>>>>
> >>>>>> Using mdev framework, if you want to partition a physical device
> >>>>>> into multiple logic devices, you can bind those devices to same
> >>>>>> vendor driver through vfio-mdev, where as if you want to
> >>>>>> passthrough the device bind it to vfio-pci. If I understand
> >>>>>> correctly, that is what you are
> >>>>> looking for.
> >>>>>>
> >>>>>>
> >>>>> We cannot bind a whole PCI device to vfio-pci, reason is, A given
> >>>>> PCI device has existing protocol devices on it such as netdevs and
> >>>>> rdma
> >> dev.
> >>>>> This device is partitioned while those protocol devices exist and
> >>>>> mlx5_core, mlx5_ib drivers are loaded on it.
> >>>>> And we also need to connect these objects rightly to eswitch
> >>>>> exposed by devlink interface (net/core/devlink.c) that supports
> >>>>> eswitch binding, health, registers, parameters, ports support.
> >>>>> It also supports existing PCI VFs.
> >>>>>
> >>>>> I don’t think we want to replicate all of this again in mdev subsystem
> [1].
> >>>>>
> >>>>> [1]
> >>>>> https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
> >>>>>
> >>>>> So devlink interface to migrate users from managing VFs to non_VF
> >>>>> sub device is natural progression.
> >>>>>
> >>>>> However, in future, I believe we would be creating mediated
> >>>>> devices on user request, to use mdev modules and map them to VM.
> >>>>>
> >>>>> Also 'mdev_bus' is created as a class and not as a bus. This
> >>>>> limits to not use devlink interface whose handle is bus+device name.
> >>>>>
> >>>>> So one option is to change mdev from class to bus.
> >>>>> devlink will create mdevs on the bus, mdev driver can probe these
> >>>>> devices on host system by default.
> >>>>> And if told to do passthrough, a different driver exposes them to VM.
> >>>>> How feasible is this?
> >>>>>
> >>>> Wait, I do see a mdev bus and mdevs are created on this bus using
> >>>> mdev_device_create().
> >>>> So how about we create mdevs on this bus using devlink, instead of
> sysfs?
> >>>> And driver side on host gets the mdev_register_driver()->probe()?
> >>>>
> >>>
> >>> Thinking more and reviewing more mdev code, I believe mdev fits this
> >>> need a lot better than new subdev bus, mfd, platform device, or
> >>> devlink
> >> subport.
> >>> For coming future, to map this sub device (mdev) to VM will also be
> >>> easier
> >> by using mdev bus.
> >>>
> >>
> >> Thanks for taking close look at mdev code.
> >>
> >> Assigning mdev to VM support is already in place, QEMU and libvirt
> >> have support to assign mdev device to VM.
> >>
> >>> I also believe we can use the sysfs interface for mdev life cycle.
> >>> Here when mdev are created it will register as devlink instance and
> >>> will be able to query/config parameters before driver probe the device.
> >>> (instead of having life cycle via devlink)
> >>>
> >>> Few enhancements would be needed for mdev side.
> >>> 1. making iommu optional.
> >>
> >> Currently mdev devices are not IOMMU aware, vendor driver is
> >> responsible for programming IOMMU for mdev device, if required.
> >> IOMMU aware mdev device patch set is almost reviewed and ready to get
> >> pulled. This is optional, vendor driver have to decide whether mdev
> >> device should be associated with its parents IOMMU or not. I'm
> >> testing it and I think Alex is on vacation and this will get pulled
> >> when Alex will be back from vacation.
> >> https://lwn.net/Articles/779650/
> >>
> >>> 2. configuring mdev device parameters during creation time
> >>>
> >>
> >> Mdev framework provides a way to define multiple types for creation
> >> through sysfs. You can define multiple types rather than having
> >> creation time parameter and on creation accordingly update
> 'available_instances'.
> >> Mdev also provides a way to provide vendor-specific-attributes for
> >> parent physical device as well as for created mdev device. You can
> >> add sysfs interface to get input parameters for a mdev device which
> >> can be used by vendor driver when open() on that mdev device is called.
> >>
> >> Thanks,
> >> Kirti
> >
> > Yes. I got my patches to adapt to mdev way. Will be posting RFC v2 soon.
> > Will wait for a day to receive more comments/views from Greg and others.
> >
> > As I explained in this cover-letter and discussion, First use case is
> > to create and use mdevs in the host (and not in VM).
> > Later on, I am sure once we have mdevs available, VM users will likely use
> it.
> >
> > So, mlx5_core driver will have two components as starting point.
> >
> > 1. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev.c
> > This is mdev device life cycle driver which will do, mdev_register_device()
> and implements mlx5_mdev_ops.
> >
> Ok. I would suggest not use mdev.c file name, may be add device name,
> something like mlx_mdev.c or vfio_mlx.c
>
mlx5/core is coding convention is not following to prefix mlx to its 40+ files.

it uses actual subsystem or functionality name, such as,
sriov.c
eswitch.c
fw.c
en_tc.c (en for Ethernet)
lag.c
so,
mdev.c aligns to rest of the 40+ files.


> > 2. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev_driver.c
> > This is mdev device driver which does mdev_register_driver() and
> > probe() creates netdev by heavily reusing existing code of the PF device.
> > These drivers will not be placed under drivers/vfio/mdev, because this is
> not a vfio driver.
> > This is fine, right?
> >
>
> I'm not too familiar with netdev, but can you create netdev on open() call on
> mlx mdev device? Then you don't have to write mdev device driver.
>
Who invokes open() and release()?
I believe it is the qemu would do open(), release, read/write/mmap?

Assuming that is the case,
I think its incorrect to create netdev in open.
Because when we want to map the mdev to VM using above mdev calls, we actually wont be creating netdev in host.
Instead, some queues etc will be setup as part of these calls.

By default this created mdev is bound to vfio_mdev.
And once we unbind the device from this driver, we need to bind to mlx5 driver so that driver can create the netdev etc.

Or did I get open() and friends call wrong?

>
> > Given that this is net driver, we will be submitting patches, through
> > netdev mailing list through Dave Miller's net-next tree.
> > And CC [email protected], you and others as usual.
> > Are you ok, merging code this way as mdev device creator and mdev driver.
> > Yes?
> >
>
> Keep Alex and me in loop.
Sure. Thanks.

2019-03-07 20:54:48

by Kirti Wankhede

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension



<snip>

>>>
>>> Yes. I got my patches to adapt to mdev way. Will be posting RFC v2 soon.
>>> Will wait for a day to receive more comments/views from Greg and others.
>>>
>>> As I explained in this cover-letter and discussion, First use case is
>>> to create and use mdevs in the host (and not in VM).
>>> Later on, I am sure once we have mdevs available, VM users will likely use
>> it.
>>>
>>> So, mlx5_core driver will have two components as starting point.
>>>
>>> 1. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev.c
>>> This is mdev device life cycle driver which will do, mdev_register_device()
>> and implements mlx5_mdev_ops.
>>>
>> Ok. I would suggest not use mdev.c file name, may be add device name,
>> something like mlx_mdev.c or vfio_mlx.c
>>
> mlx5/core is coding convention is not following to prefix mlx to its 40+ files.
>
> it uses actual subsystem or functionality name, such as,
> sriov.c
> eswitch.c
> fw.c
> en_tc.c (en for Ethernet)
> lag.c
> so,
> mdev.c aligns to rest of the 40+ files.
>
>
>>> 2. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev_driver.c
>>> This is mdev device driver which does mdev_register_driver() and
>>> probe() creates netdev by heavily reusing existing code of the PF device.
>>> These drivers will not be placed under drivers/vfio/mdev, because this is
>> not a vfio driver.
>>> This is fine, right?
>>>
>>
>> I'm not too familiar with netdev, but can you create netdev on open() call on
>> mlx mdev device? Then you don't have to write mdev device driver.
>>
> Who invokes open() and release()?
> I believe it is the qemu would do open(), release, read/write/mmap?
>
> Assuming that is the case,
> I think its incorrect to create netdev in open.
> Because when we want to map the mdev to VM using above mdev calls, we actually wont be creating netdev in host.
> Instead, some queues etc will be setup as part of these calls.
>
> By default this created mdev is bound to vfio_mdev.
> And once we unbind the device from this driver, we need to bind to mlx5 driver so that driver can create the netdev etc.
>
> Or did I get open() and friends call wrong?
>

In 'struct mdev_parent_ops' there are create() and remove(). When user
creates mdev device by writing UUID to create sysfs, vendor driver's
create() callback gets called. This should be used to allocate/commit
resources from parent device and on remove() callback free those
resources. So there is no need to bind mlx5 driver to that mdev device.

open/release/read/write/mmap/ioctl are regular file operations for that
mdev device.

Thanks,
Kirti


2019-03-07 21:03:19

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension



> -----Original Message-----
> From: Kirti Wankhede <[email protected]>
> Sent: Thursday, March 7, 2019 2:54 PM
> To: Parav Pandit <[email protected]>; Jakub Kicinski
> <[email protected]>
> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; Jiri Pirko <[email protected]>; Alex
> Williamson <[email protected]>
> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>
>
>
> <snip>
>
> >>>
> >>> Yes. I got my patches to adapt to mdev way. Will be posting RFC v2 soon.
> >>> Will wait for a day to receive more comments/views from Greg and
> others.
> >>>
> >>> As I explained in this cover-letter and discussion, First use case
> >>> is to create and use mdevs in the host (and not in VM).
> >>> Later on, I am sure once we have mdevs available, VM users will
> >>> likely use
> >> it.
> >>>
> >>> So, mlx5_core driver will have two components as starting point.
> >>>
> >>> 1. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev.c
> >>> This is mdev device life cycle driver which will do,
> >>> mdev_register_device()
> >> and implements mlx5_mdev_ops.
> >>>
> >> Ok. I would suggest not use mdev.c file name, may be add device name,
> >> something like mlx_mdev.c or vfio_mlx.c
> >>
> > mlx5/core is coding convention is not following to prefix mlx to its 40+
> files.
> >
> > it uses actual subsystem or functionality name, such as, sriov.c
> > eswitch.c fw.c en_tc.c (en for Ethernet) lag.c so, mdev.c aligns to
> > rest of the 40+ files.
> >
> >
> >>> 2. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev_driver.c
> >>> This is mdev device driver which does mdev_register_driver() and
> >>> probe() creates netdev by heavily reusing existing code of the PF device.
> >>> These drivers will not be placed under drivers/vfio/mdev, because
> >>> this is
> >> not a vfio driver.
> >>> This is fine, right?
> >>>
> >>
> >> I'm not too familiar with netdev, but can you create netdev on open()
> >> call on mlx mdev device? Then you don't have to write mdev device
> driver.
> >>
> > Who invokes open() and release()?
> > I believe it is the qemu would do open(), release, read/write/mmap?
> >
> > Assuming that is the case,
> > I think its incorrect to create netdev in open.
> > Because when we want to map the mdev to VM using above mdev calls, we
> actually wont be creating netdev in host.
> > Instead, some queues etc will be setup as part of these calls.
> >
> > By default this created mdev is bound to vfio_mdev.
> > And once we unbind the device from this driver, we need to bind to mlx5
> driver so that driver can create the netdev etc.
> >
> > Or did I get open() and friends call wrong?
> >
>
> In 'struct mdev_parent_ops' there are create() and remove(). When user
> creates mdev device by writing UUID to create sysfs, vendor driver's
> create() callback gets called. This should be used to allocate/commit
Yes. I am already past that stage.

> resources from parent device and on remove() callback free those resources.
> So there is no need to bind mlx5 driver to that mdev device.
>
If we don't bind mlx5 driver, vfio_mdev driver is bound to it. Such driver won't create netdev.
Again, we do not want to map this mdev to a VM.
We want to consume it in the host where mdev is created.
So I am able to detach this mdev from vfio_mdev driver as usaual using
$ echo mdev_name > ../drivers/vfio_mdev/unbind

Followed by binding it to mlx5_core driver.

Below is sample output before binding it to mlx5_core driver.
When we bind with mlx5_core driver, that driver creates the netdev in host.
If user wants to map this mdev to VM, user won't bind to mlx5_core driver. instead he will bind to vfio driver and that does usual open/release/...


lrwxrwxrwx 1 root root 0 Mar 7 14:24 69ea1551-d054-46e9-974d-8edae8f0aefe -> ../../../devices/pci0000:00/0000:00:02.2/0000:05:00.0/69ea1551-d054-46e9-974d-8edae8f0aefe
[root@sw-mtx-036 net-next]# ls -l /sys/bus/mdev/devices/69ea1551-d054-46e9-974d-8edae8f0aefe/
total 0
lrwxrwxrwx 1 root root 0 Mar 7 14:24 driver -> ../../../../../bus/mdev/drivers/vfio_mdev
lrwxrwxrwx 1 root root 0 Mar 7 14:24 iommu_group -> ../../../../../kernel/iommu_groups/0
lrwxrwxrwx 1 root root 0 Mar 7 14:24 mdev_type -> ../mdev_supported_types/mlx5_core-mgmt
drwxr-xr-x 2 root root 0 Mar 7 14:24 power
--w------- 1 root root 4096 Mar 7 14:24 remove
lrwxrwxrwx 1 root root 0 Mar 7 14:24 subsystem -> ../../../../../bus/mdev
-rw-r--r-- 1 root root 4096 Mar 7 14:24 uevent

> open/release/read/write/mmap/ioctl are regular file operations for that
> mdev device.
>

> Thanks,
> Kirti

2019-03-07 21:08:32

by Kirti Wankhede

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension



On 3/8/2019 2:32 AM, Parav Pandit wrote:
>
>
>> -----Original Message-----
>> From: Kirti Wankhede <[email protected]>
>> Sent: Thursday, March 7, 2019 2:54 PM
>> To: Parav Pandit <[email protected]>; Jakub Kicinski
>> <[email protected]>
>> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
>> [email protected]; [email protected]; [email protected];
>> [email protected]; Jiri Pirko <[email protected]>; Alex
>> Williamson <[email protected]>
>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>>
>>
>>
>> <snip>
>>
>>>>>
>>>>> Yes. I got my patches to adapt to mdev way. Will be posting RFC v2 soon.
>>>>> Will wait for a day to receive more comments/views from Greg and
>> others.
>>>>>
>>>>> As I explained in this cover-letter and discussion, First use case
>>>>> is to create and use mdevs in the host (and not in VM).
>>>>> Later on, I am sure once we have mdevs available, VM users will
>>>>> likely use
>>>> it.
>>>>>
>>>>> So, mlx5_core driver will have two components as starting point.
>>>>>
>>>>> 1. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev.c
>>>>> This is mdev device life cycle driver which will do,
>>>>> mdev_register_device()
>>>> and implements mlx5_mdev_ops.
>>>>>
>>>> Ok. I would suggest not use mdev.c file name, may be add device name,
>>>> something like mlx_mdev.c or vfio_mlx.c
>>>>
>>> mlx5/core is coding convention is not following to prefix mlx to its 40+
>> files.
>>>
>>> it uses actual subsystem or functionality name, such as, sriov.c
>>> eswitch.c fw.c en_tc.c (en for Ethernet) lag.c so, mdev.c aligns to
>>> rest of the 40+ files.
>>>
>>>
>>>>> 2. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev_driver.c
>>>>> This is mdev device driver which does mdev_register_driver() and
>>>>> probe() creates netdev by heavily reusing existing code of the PF device.
>>>>> These drivers will not be placed under drivers/vfio/mdev, because
>>>>> this is
>>>> not a vfio driver.
>>>>> This is fine, right?
>>>>>
>>>>
>>>> I'm not too familiar with netdev, but can you create netdev on open()
>>>> call on mlx mdev device? Then you don't have to write mdev device
>> driver.
>>>>
>>> Who invokes open() and release()?
>>> I believe it is the qemu would do open(), release, read/write/mmap?
>>>
>>> Assuming that is the case,
>>> I think its incorrect to create netdev in open.
>>> Because when we want to map the mdev to VM using above mdev calls, we
>> actually wont be creating netdev in host.
>>> Instead, some queues etc will be setup as part of these calls.
>>>
>>> By default this created mdev is bound to vfio_mdev.
>>> And once we unbind the device from this driver, we need to bind to mlx5
>> driver so that driver can create the netdev etc.
>>>
>>> Or did I get open() and friends call wrong?
>>>
>>
>> In 'struct mdev_parent_ops' there are create() and remove(). When user
>> creates mdev device by writing UUID to create sysfs, vendor driver's
>> create() callback gets called. This should be used to allocate/commit
> Yes. I am already past that stage.
>
>> resources from parent device and on remove() callback free those resources.
>> So there is no need to bind mlx5 driver to that mdev device.
>>
> If we don't bind mlx5 driver, vfio_mdev driver is bound to it. Such driver won't create netdev.

Doesn't need to.

Create netdev from create() callback.

Thanks,
Kirti

> Again, we do not want to map this mdev to a VM.
> We want to consume it in the host where mdev is created.
> So I am able to detach this mdev from vfio_mdev driver as usaual using
> $ echo mdev_name > ../drivers/vfio_mdev/unbind
>
> Followed by binding it to mlx5_core driver.
>
> Below is sample output before binding it to mlx5_core driver.
> When we bind with mlx5_core driver, that driver creates the netdev in host.
> If user wants to map this mdev to VM, user won't bind to mlx5_core driver. instead he will bind to vfio driver and that does usual open/release/...
>
>
> lrwxrwxrwx 1 root root 0 Mar 7 14:24 69ea1551-d054-46e9-974d-8edae8f0aefe -> ../../../devices/pci0000:00/0000:00:02.2/0000:05:00.0/69ea1551-d054-46e9-974d-8edae8f0aefe
> [root@sw-mtx-036 net-next]# ls -l /sys/bus/mdev/devices/69ea1551-d054-46e9-974d-8edae8f0aefe/
> total 0
> lrwxrwxrwx 1 root root 0 Mar 7 14:24 driver -> ../../../../../bus/mdev/drivers/vfio_mdev
> lrwxrwxrwx 1 root root 0 Mar 7 14:24 iommu_group -> ../../../../../kernel/iommu_groups/0
> lrwxrwxrwx 1 root root 0 Mar 7 14:24 mdev_type -> ../mdev_supported_types/mlx5_core-mgmt
> drwxr-xr-x 2 root root 0 Mar 7 14:24 power
> --w------- 1 root root 4096 Mar 7 14:24 remove
> lrwxrwxrwx 1 root root 0 Mar 7 14:24 subsystem -> ../../../../../bus/mdev
> -rw-r--r-- 1 root root 4096 Mar 7 14:24 uevent
>
>> open/release/read/write/mmap/ioctl are regular file operations for that
>> mdev device.
>>
>
>> Thanks,
>> Kirti
>

2019-03-07 21:22:53

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension



> -----Original Message-----
> From: Kirti Wankhede <[email protected]>
> Sent: Thursday, March 7, 2019 3:08 PM
> To: Parav Pandit <[email protected]>; Jakub Kicinski
> <[email protected]>
> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; Jiri Pirko <[email protected]>; Alex
> Williamson <[email protected]>
> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>
>
>
> On 3/8/2019 2:32 AM, Parav Pandit wrote:
> >
> >
> >> -----Original Message-----
> >> From: Kirti Wankhede <[email protected]>
> >> Sent: Thursday, March 7, 2019 2:54 PM
> >> To: Parav Pandit <[email protected]>; Jakub Kicinski
> >> <[email protected]>
> >> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> >> [email protected]; [email protected];
> [email protected];
> >> [email protected]; Jiri Pirko <[email protected]>; Alex
> >> Williamson <[email protected]>
> >> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
> >> extension
> >>
> >>
> >>
> >> <snip>
> >>
> >>>>>
> >>>>> Yes. I got my patches to adapt to mdev way. Will be posting RFC v2
> soon.
> >>>>> Will wait for a day to receive more comments/views from Greg and
> >> others.
> >>>>>
> >>>>> As I explained in this cover-letter and discussion, First use case
> >>>>> is to create and use mdevs in the host (and not in VM).
> >>>>> Later on, I am sure once we have mdevs available, VM users will
> >>>>> likely use
> >>>> it.
> >>>>>
> >>>>> So, mlx5_core driver will have two components as starting point.
> >>>>>
> >>>>> 1. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev.c
> >>>>> This is mdev device life cycle driver which will do,
> >>>>> mdev_register_device()
> >>>> and implements mlx5_mdev_ops.
> >>>>>
> >>>> Ok. I would suggest not use mdev.c file name, may be add device
> >>>> name, something like mlx_mdev.c or vfio_mlx.c
> >>>>
> >>> mlx5/core is coding convention is not following to prefix mlx to its
> >>> 40+
> >> files.
> >>>
> >>> it uses actual subsystem or functionality name, such as, sriov.c
> >>> eswitch.c fw.c en_tc.c (en for Ethernet) lag.c so, mdev.c aligns to
> >>> rest of the 40+ files.
> >>>
> >>>
> >>>>> 2. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev_driver.c
> >>>>> This is mdev device driver which does mdev_register_driver() and
> >>>>> probe() creates netdev by heavily reusing existing code of the PF
> device.
> >>>>> These drivers will not be placed under drivers/vfio/mdev, because
> >>>>> this is
> >>>> not a vfio driver.
> >>>>> This is fine, right?
> >>>>>
> >>>>
> >>>> I'm not too familiar with netdev, but can you create netdev on
> >>>> open() call on mlx mdev device? Then you don't have to write mdev
> >>>> device
> >> driver.
> >>>>
> >>> Who invokes open() and release()?
> >>> I believe it is the qemu would do open(), release, read/write/mmap?
> >>>
> >>> Assuming that is the case,
> >>> I think its incorrect to create netdev in open.
> >>> Because when we want to map the mdev to VM using above mdev calls,
> >>> we
> >> actually wont be creating netdev in host.
> >>> Instead, some queues etc will be setup as part of these calls.
> >>>
> >>> By default this created mdev is bound to vfio_mdev.
> >>> And once we unbind the device from this driver, we need to bind to
> >>> mlx5
> >> driver so that driver can create the netdev etc.
> >>>
> >>> Or did I get open() and friends call wrong?
> >>>
> >>
> >> In 'struct mdev_parent_ops' there are create() and remove(). When
> >> user creates mdev device by writing UUID to create sysfs, vendor
> >> driver's
> >> create() callback gets called. This should be used to allocate/commit
> > Yes. I am already past that stage.
> >
> >> resources from parent device and on remove() callback free those
> resources.
> >> So there is no need to bind mlx5 driver to that mdev device.
> >>
> > If we don't bind mlx5 driver, vfio_mdev driver is bound to it. Such driver
> won't create netdev.
>
> Doesn't need to.
>
> Create netdev from create() callback.
>
I strongly believe this is incorrect way to use create() API.
Because,
mdev is mediated device from its primary pci device. It is not a protocol device.

It it also incorrect to tell user that vfio_mdev driver is bound to this mdev and mlx5_core driver creating netdev on top of mdev.

When we want to map this mdev to VM, what should create() do?
We will have to shift the code from create() to mdev_device_driver()->probe() to address a use case of selectively mapping a mdev to VM or to host and implement appropriate open/close etc functions for VM case.

So why not start correctly from the beginning?


> Thanks,
> Kirti
>
> > Again, we do not want to map this mdev to a VM.
> > We want to consume it in the host where mdev is created.
> > So I am able to detach this mdev from vfio_mdev driver as usaual using
> > $ echo mdev_name > ../drivers/vfio_mdev/unbind
> >
> > Followed by binding it to mlx5_core driver.
> >
> > Below is sample output before binding it to mlx5_core driver.
> > When we bind with mlx5_core driver, that driver creates the netdev in
> host.
> > If user wants to map this mdev to VM, user won't bind to mlx5_core driver.
> instead he will bind to vfio driver and that does usual open/release/...
> >
> >
> > lrwxrwxrwx 1 root root 0 Mar 7 14:24
> > 69ea1551-d054-46e9-974d-8edae8f0aefe ->
> > ../../../devices/pci0000:00/0000:00:02.2/0000:05:00.0/69ea1551-d054-46
> > e9-974d-8edae8f0aefe
> > [root@sw-mtx-036 net-next]# ls -l
> > /sys/bus/mdev/devices/69ea1551-d054-46e9-974d-8edae8f0aefe/
> > total 0
> > lrwxrwxrwx 1 root root 0 Mar 7 14:24 driver ->
> ../../../../../bus/mdev/drivers/vfio_mdev
> > lrwxrwxrwx 1 root root 0 Mar 7 14:24 iommu_group ->
> ../../../../../kernel/iommu_groups/0
> > lrwxrwxrwx 1 root root 0 Mar 7 14:24 mdev_type ->
> ../mdev_supported_types/mlx5_core-mgmt
> > drwxr-xr-x 2 root root 0 Mar 7 14:24 power
> > --w------- 1 root root 4096 Mar 7 14:24 remove
> > lrwxrwxrwx 1 root root 0 Mar 7 14:24 subsystem -> ../../../../../bus/mdev
> > -rw-r--r-- 1 root root 4096 Mar 7 14:24 uevent
> >
> >> open/release/read/write/mmap/ioctl are regular file operations for
> >> that mdev device.
> >>
> >
> >> Thanks,
> >> Kirti
> >

2019-03-07 22:02:46

by Kirti Wankhede

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension



On 3/8/2019 2:51 AM, Parav Pandit wrote:
>
>
>> -----Original Message-----
>> From: Kirti Wankhede <[email protected]>
>> Sent: Thursday, March 7, 2019 3:08 PM
>> To: Parav Pandit <[email protected]>; Jakub Kicinski
>> <[email protected]>
>> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
>> [email protected]; [email protected]; [email protected];
>> [email protected]; Jiri Pirko <[email protected]>; Alex
>> Williamson <[email protected]>
>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>>
>>
>>
>> On 3/8/2019 2:32 AM, Parav Pandit wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Kirti Wankhede <[email protected]>
>>>> Sent: Thursday, March 7, 2019 2:54 PM
>>>> To: Parav Pandit <[email protected]>; Jakub Kicinski
>>>> <[email protected]>
>>>> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
>>>> [email protected]; [email protected];
>> [email protected];
>>>> [email protected]; Jiri Pirko <[email protected]>; Alex
>>>> Williamson <[email protected]>
>>>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
>>>> extension
>>>>
>>>>
>>>>
>>>> <snip>
>>>>
>>>>>>>
>>>>>>> Yes. I got my patches to adapt to mdev way. Will be posting RFC v2
>> soon.
>>>>>>> Will wait for a day to receive more comments/views from Greg and
>>>> others.
>>>>>>>
>>>>>>> As I explained in this cover-letter and discussion, First use case
>>>>>>> is to create and use mdevs in the host (and not in VM).
>>>>>>> Later on, I am sure once we have mdevs available, VM users will
>>>>>>> likely use
>>>>>> it.
>>>>>>>
>>>>>>> So, mlx5_core driver will have two components as starting point.
>>>>>>>
>>>>>>> 1. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev.c
>>>>>>> This is mdev device life cycle driver which will do,
>>>>>>> mdev_register_device()
>>>>>> and implements mlx5_mdev_ops.
>>>>>>>
>>>>>> Ok. I would suggest not use mdev.c file name, may be add device
>>>>>> name, something like mlx_mdev.c or vfio_mlx.c
>>>>>>
>>>>> mlx5/core is coding convention is not following to prefix mlx to its
>>>>> 40+
>>>> files.
>>>>>
>>>>> it uses actual subsystem or functionality name, such as, sriov.c
>>>>> eswitch.c fw.c en_tc.c (en for Ethernet) lag.c so, mdev.c aligns to
>>>>> rest of the 40+ files.
>>>>>
>>>>>
>>>>>>> 2. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev_driver.c
>>>>>>> This is mdev device driver which does mdev_register_driver() and
>>>>>>> probe() creates netdev by heavily reusing existing code of the PF
>> device.
>>>>>>> These drivers will not be placed under drivers/vfio/mdev, because
>>>>>>> this is
>>>>>> not a vfio driver.
>>>>>>> This is fine, right?
>>>>>>>
>>>>>>
>>>>>> I'm not too familiar with netdev, but can you create netdev on
>>>>>> open() call on mlx mdev device? Then you don't have to write mdev
>>>>>> device
>>>> driver.
>>>>>>
>>>>> Who invokes open() and release()?
>>>>> I believe it is the qemu would do open(), release, read/write/mmap?
>>>>>
>>>>> Assuming that is the case,
>>>>> I think its incorrect to create netdev in open.
>>>>> Because when we want to map the mdev to VM using above mdev calls,
>>>>> we
>>>> actually wont be creating netdev in host.
>>>>> Instead, some queues etc will be setup as part of these calls.
>>>>>
>>>>> By default this created mdev is bound to vfio_mdev.
>>>>> And once we unbind the device from this driver, we need to bind to
>>>>> mlx5
>>>> driver so that driver can create the netdev etc.
>>>>>
>>>>> Or did I get open() and friends call wrong?
>>>>>
>>>>
>>>> In 'struct mdev_parent_ops' there are create() and remove(). When
>>>> user creates mdev device by writing UUID to create sysfs, vendor
>>>> driver's
>>>> create() callback gets called. This should be used to allocate/commit
>>> Yes. I am already past that stage.
>>>
>>>> resources from parent device and on remove() callback free those
>> resources.
>>>> So there is no need to bind mlx5 driver to that mdev device.
>>>>
>>> If we don't bind mlx5 driver, vfio_mdev driver is bound to it. Such driver
>> won't create netdev.
>>
>> Doesn't need to.
>>
>> Create netdev from create() callback.
>>
> I strongly believe this is incorrect way to use create() API.
> Because,
> mdev is mediated device from its primary pci device. It is not a protocol device.
>
> It it also incorrect to tell user that vfio_mdev driver is bound to this mdev and mlx5_core driver creating netdev on top of mdev.
>

vfio_mdev is generic common driver.
Vendor driver who want to partition its device should handle its child
creation and its life cycle. What is wrong in that? Why netdev has to be
created from probe() only and not from create()?

> When we want to map this mdev to VM, what should create() do?

Mediated device should be created before it is mapped to VM.

If you look at the sequence of mdev device creation:
- 'struct device' is created with bus 'mdev_bus_type'
- register the device - device_register(&mdev->dev) -> which calls
vfio_mdev's probe() -> common code for all vendor drivers
- mdev_device_create_ops() -> calls vendor driver's create() -> this is
for vendor specific allocation and initialization. This is the callback
from where you can do what you want to do for mdev device creation and
initialization. Why it has to be named as probe()?


> We will have to shift the code from create() to mdev_device_driver()->probe() to address a use case of selectively mapping a mdev to VM or to host and implement appropriate open/close etc functions for VM case.
>
> So why not start correctly from the beginning?
>

What is wrong with current implementation which is being used and tested
for multiple devices?

Thanks,
Kirti

>
>> Thanks,
>> Kirti
>>
>>> Again, we do not want to map this mdev to a VM.
>>> We want to consume it in the host where mdev is created.
>>> So I am able to detach this mdev from vfio_mdev driver as usaual using
>>> $ echo mdev_name > ../drivers/vfio_mdev/unbind
>>>
>>> Followed by binding it to mlx5_core driver.
>>>
>>> Below is sample output before binding it to mlx5_core driver.
>>> When we bind with mlx5_core driver, that driver creates the netdev in
>> host.
>>> If user wants to map this mdev to VM, user won't bind to mlx5_core driver.
>> instead he will bind to vfio driver and that does usual open/release/...
>>>
>>>
>>> lrwxrwxrwx 1 root root 0 Mar 7 14:24
>>> 69ea1551-d054-46e9-974d-8edae8f0aefe ->
>>> ../../../devices/pci0000:00/0000:00:02.2/0000:05:00.0/69ea1551-d054-46
>>> e9-974d-8edae8f0aefe
>>> [root@sw-mtx-036 net-next]# ls -l
>>> /sys/bus/mdev/devices/69ea1551-d054-46e9-974d-8edae8f0aefe/
>>> total 0
>>> lrwxrwxrwx 1 root root 0 Mar 7 14:24 driver ->
>> ../../../../../bus/mdev/drivers/vfio_mdev
>>> lrwxrwxrwx 1 root root 0 Mar 7 14:24 iommu_group ->
>> ../../../../../kernel/iommu_groups/0
>>> lrwxrwxrwx 1 root root 0 Mar 7 14:24 mdev_type ->
>> ../mdev_supported_types/mlx5_core-mgmt
>>> drwxr-xr-x 2 root root 0 Mar 7 14:24 power
>>> --w------- 1 root root 4096 Mar 7 14:24 remove
>>> lrwxrwxrwx 1 root root 0 Mar 7 14:24 subsystem -> ../../../../../bus/mdev
>>> -rw-r--r-- 1 root root 4096 Mar 7 14:24 uevent
>>>
>>>> open/release/read/write/mmap/ioctl are regular file operations for
>>>> that mdev device.
>>>>
>>>
>>>> Thanks,
>>>> Kirti
>>>

2019-03-07 22:32:29

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension



> -----Original Message-----
> From: Kirti Wankhede <[email protected]>
> Sent: Thursday, March 7, 2019 4:02 PM
> To: Parav Pandit <[email protected]>; Jakub Kicinski
> <[email protected]>
> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; Jiri Pirko <[email protected]>; Alex
> Williamson <[email protected]>
> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>
>
>
> On 3/8/2019 2:51 AM, Parav Pandit wrote:
> >
> >
> >> -----Original Message-----
> >> From: Kirti Wankhede <[email protected]>
> >> Sent: Thursday, March 7, 2019 3:08 PM
> >> To: Parav Pandit <[email protected]>; Jakub Kicinski
> >> <[email protected]>
> >> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> >> [email protected]; [email protected];
> [email protected];
> >> [email protected]; Jiri Pirko <[email protected]>; Alex
> >> Williamson <[email protected]>
> >> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
> >> extension
> >>
> >>
> >>
> >> On 3/8/2019 2:32 AM, Parav Pandit wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Kirti Wankhede <[email protected]>
> >>>> Sent: Thursday, March 7, 2019 2:54 PM
> >>>> To: Parav Pandit <[email protected]>; Jakub Kicinski
> >>>> <[email protected]>
> >>>> Cc: Or Gerlitz <[email protected]>; [email protected];
> >>>> linux- [email protected]; [email protected];
> >> [email protected];
> >>>> [email protected]; Jiri Pirko <[email protected]>; Alex
> >>>> Williamson <[email protected]>
> >>>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
> >>>> extension
> >>>>
> >>>>
> >>>>
> >>>> <snip>
> >>>>
> >>>>>>>
> >>>>>>> Yes. I got my patches to adapt to mdev way. Will be posting RFC
> >>>>>>> v2
> >> soon.
> >>>>>>> Will wait for a day to receive more comments/views from Greg and
> >>>> others.
> >>>>>>>
> >>>>>>> As I explained in this cover-letter and discussion, First use
> >>>>>>> case is to create and use mdevs in the host (and not in VM).
> >>>>>>> Later on, I am sure once we have mdevs available, VM users will
> >>>>>>> likely use
> >>>>>> it.
> >>>>>>>
> >>>>>>> So, mlx5_core driver will have two components as starting point.
> >>>>>>>
> >>>>>>> 1. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev.c
> >>>>>>> This is mdev device life cycle driver which will do,
> >>>>>>> mdev_register_device()
> >>>>>> and implements mlx5_mdev_ops.
> >>>>>>>
> >>>>>> Ok. I would suggest not use mdev.c file name, may be add device
> >>>>>> name, something like mlx_mdev.c or vfio_mlx.c
> >>>>>>
> >>>>> mlx5/core is coding convention is not following to prefix mlx to
> >>>>> its
> >>>>> 40+
> >>>> files.
> >>>>>
> >>>>> it uses actual subsystem or functionality name, such as, sriov.c
> >>>>> eswitch.c fw.c en_tc.c (en for Ethernet) lag.c so, mdev.c aligns
> >>>>> to rest of the 40+ files.
> >>>>>
> >>>>>
> >>>>>>> 2. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev_driver.c
> >>>>>>> This is mdev device driver which does mdev_register_driver() and
> >>>>>>> probe() creates netdev by heavily reusing existing code of the
> >>>>>>> PF
> >> device.
> >>>>>>> These drivers will not be placed under drivers/vfio/mdev,
> >>>>>>> because this is
> >>>>>> not a vfio driver.
> >>>>>>> This is fine, right?
> >>>>>>>
> >>>>>>
> >>>>>> I'm not too familiar with netdev, but can you create netdev on
> >>>>>> open() call on mlx mdev device? Then you don't have to write mdev
> >>>>>> device
> >>>> driver.
> >>>>>>
> >>>>> Who invokes open() and release()?
> >>>>> I believe it is the qemu would do open(), release, read/write/mmap?
> >>>>>
> >>>>> Assuming that is the case,
> >>>>> I think its incorrect to create netdev in open.
> >>>>> Because when we want to map the mdev to VM using above mdev
> calls,
> >>>>> we
> >>>> actually wont be creating netdev in host.
> >>>>> Instead, some queues etc will be setup as part of these calls.
> >>>>>
> >>>>> By default this created mdev is bound to vfio_mdev.
> >>>>> And once we unbind the device from this driver, we need to bind to
> >>>>> mlx5
> >>>> driver so that driver can create the netdev etc.
> >>>>>
> >>>>> Or did I get open() and friends call wrong?
> >>>>>
> >>>>
> >>>> In 'struct mdev_parent_ops' there are create() and remove(). When
> >>>> user creates mdev device by writing UUID to create sysfs, vendor
> >>>> driver's
> >>>> create() callback gets called. This should be used to
> >>>> allocate/commit
> >>> Yes. I am already past that stage.
> >>>
> >>>> resources from parent device and on remove() callback free those
> >> resources.
> >>>> So there is no need to bind mlx5 driver to that mdev device.
> >>>>
> >>> If we don't bind mlx5 driver, vfio_mdev driver is bound to it. Such
> >>> driver
> >> won't create netdev.
> >>
> >> Doesn't need to.
> >>
> >> Create netdev from create() callback.
> >>
> > I strongly believe this is incorrect way to use create() API.
> > Because,
> > mdev is mediated device from its primary pci device. It is not a protocol
> device.
> >
> > It it also incorrect to tell user that vfio_mdev driver is bound to this mdev
> and mlx5_core driver creating netdev on top of mdev.
> >
>
> vfio_mdev is generic common driver.
> Vendor driver who want to partition its device should handle its child
> creation and its life cycle. What is wrong in that? Why netdev has to be
> created from probe() only and not from create()?
>
I am not suggesting to invent any new probe() method.
create() is generic mdev creation entry point.
When create() is implemented by vendor driver, vendor driver doesn't know if this mdev will be provisioned for VM or for host.
So it must do only generic mdev init sequence.
This means, it cannot create netdev here. As simple as that.

When user wants to use this mdev in a host, user will first unbind it from vfio_mdev driver and binds this mdev to mlx5_driver.
probe() of mlx5_core driver is called who did mdev_register_driver.
At this point netdev is created.

If user wants to use this mdev for VM, than vfio_mdev driver and qemu will control it via open/release friend functions.

> > When we want to map this mdev to VM, what should create() do?
>
> Mediated device should be created before it is mapped to VM.
>
Of course.
Let me rephrase the question:
what shouldn't be done by create() when it wants to map to VM?
Answer is: it shouldn't create a netdev, but do necessary initialization so that it can be mapped to VM.
Because netdev will be created inside the VM not in the host.

create() simply doesn't know during creation time, where this mdev will be used (VM or host).
So it doesn't make any sense to create netdev in create().

I hope it's clear now.

> If you look at the sequence of mdev device creation:
> - 'struct device' is created with bus 'mdev_bus_type'
> - register the device - device_register(&mdev->dev) -> which calls vfio_mdev's
> probe() -> common code for all vendor drivers
> - mdev_device_create_ops() -> calls vendor driver's create() -> this is for
> vendor specific allocation and initialization. This is the callback from where
> you can do what you want to do for mdev device creation and initialization.
> Why it has to be named as probe()?

I do not intent to create any new probe().
I think I explained the flow well above -
i.e. role of mdev_driver->probe() vs mdev_device->create().

>
> > We will have to shift the code from create() to mdev_device_driver()-
> >probe() to address a use case of selectively mapping a mdev to VM or to
> host and implement appropriate open/close etc functions for VM case.
> >
> > So why not start correctly from the beginning?
> >
>
> What is wrong with current implementation which is being used and tested
> for multiple devices?
>
Oh, nothing wrong in current implementation.
Which current implementation provisions mdev in host (and not in guest VM)?

I am just using right code split of already available mdev.
When user wants to map a device to VM, attach vfio_mdev driver and create vfio_device.
When user wants to use a device in host, don't attach vfio_mdev driver, instead attach, appropriate driver what owns this mdev.

Again, I am not inventing any new probe().
We will use all existing infra of mdev and core kernel to bind/unbind driver with device.

2019-03-08 12:19:56

by Kirti Wankhede

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension



On 3/8/2019 4:01 AM, Parav Pandit wrote:
>
>
>> -----Original Message-----
>> From: Kirti Wankhede <[email protected]>
>> Sent: Thursday, March 7, 2019 4:02 PM
>> To: Parav Pandit <[email protected]>; Jakub Kicinski
>> <[email protected]>
>> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
>> [email protected]; [email protected]; [email protected];
>> [email protected]; Jiri Pirko <[email protected]>; Alex
>> Williamson <[email protected]>
>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>>
>>
>>
>> On 3/8/2019 2:51 AM, Parav Pandit wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Kirti Wankhede <[email protected]>
>>>> Sent: Thursday, March 7, 2019 3:08 PM
>>>> To: Parav Pandit <[email protected]>; Jakub Kicinski
>>>> <[email protected]>
>>>> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
>>>> [email protected]; [email protected];
>> [email protected];
>>>> [email protected]; Jiri Pirko <[email protected]>; Alex
>>>> Williamson <[email protected]>
>>>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
>>>> extension
>>>>
>>>>
>>>>
>>>> On 3/8/2019 2:32 AM, Parav Pandit wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Kirti Wankhede <[email protected]>
>>>>>> Sent: Thursday, March 7, 2019 2:54 PM
>>>>>> To: Parav Pandit <[email protected]>; Jakub Kicinski
>>>>>> <[email protected]>
>>>>>> Cc: Or Gerlitz <[email protected]>; [email protected];
>>>>>> linux- [email protected]; [email protected];
>>>> [email protected];
>>>>>> [email protected]; Jiri Pirko <[email protected]>; Alex
>>>>>> Williamson <[email protected]>
>>>>>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
>>>>>> extension
>>>>>>
>>>>>>
>>>>>>
>>>>>> <snip>
>>>>>>
>>>>>>>>>
>>>>>>>>> Yes. I got my patches to adapt to mdev way. Will be posting RFC
>>>>>>>>> v2
>>>> soon.
>>>>>>>>> Will wait for a day to receive more comments/views from Greg and
>>>>>> others.
>>>>>>>>>
>>>>>>>>> As I explained in this cover-letter and discussion, First use
>>>>>>>>> case is to create and use mdevs in the host (and not in VM).
>>>>>>>>> Later on, I am sure once we have mdevs available, VM users will
>>>>>>>>> likely use
>>>>>>>> it.
>>>>>>>>>
>>>>>>>>> So, mlx5_core driver will have two components as starting point.
>>>>>>>>>
>>>>>>>>> 1. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev.c
>>>>>>>>> This is mdev device life cycle driver which will do,
>>>>>>>>> mdev_register_device()
>>>>>>>> and implements mlx5_mdev_ops.
>>>>>>>>>
>>>>>>>> Ok. I would suggest not use mdev.c file name, may be add device
>>>>>>>> name, something like mlx_mdev.c or vfio_mlx.c
>>>>>>>>
>>>>>>> mlx5/core is coding convention is not following to prefix mlx to
>>>>>>> its
>>>>>>> 40+
>>>>>> files.
>>>>>>>
>>>>>>> it uses actual subsystem or functionality name, such as, sriov.c
>>>>>>> eswitch.c fw.c en_tc.c (en for Ethernet) lag.c so, mdev.c aligns
>>>>>>> to rest of the 40+ files.
>>>>>>>
>>>>>>>
>>>>>>>>> 2. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev_driver.c
>>>>>>>>> This is mdev device driver which does mdev_register_driver() and
>>>>>>>>> probe() creates netdev by heavily reusing existing code of the
>>>>>>>>> PF
>>>> device.
>>>>>>>>> These drivers will not be placed under drivers/vfio/mdev,
>>>>>>>>> because this is
>>>>>>>> not a vfio driver.
>>>>>>>>> This is fine, right?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'm not too familiar with netdev, but can you create netdev on
>>>>>>>> open() call on mlx mdev device? Then you don't have to write mdev
>>>>>>>> device
>>>>>> driver.
>>>>>>>>
>>>>>>> Who invokes open() and release()?
>>>>>>> I believe it is the qemu would do open(), release, read/write/mmap?
>>>>>>>
>>>>>>> Assuming that is the case,
>>>>>>> I think its incorrect to create netdev in open.
>>>>>>> Because when we want to map the mdev to VM using above mdev
>> calls,
>>>>>>> we
>>>>>> actually wont be creating netdev in host.
>>>>>>> Instead, some queues etc will be setup as part of these calls.
>>>>>>>
>>>>>>> By default this created mdev is bound to vfio_mdev.
>>>>>>> And once we unbind the device from this driver, we need to bind to
>>>>>>> mlx5
>>>>>> driver so that driver can create the netdev etc.
>>>>>>>
>>>>>>> Or did I get open() and friends call wrong?
>>>>>>>
>>>>>>
>>>>>> In 'struct mdev_parent_ops' there are create() and remove(). When
>>>>>> user creates mdev device by writing UUID to create sysfs, vendor
>>>>>> driver's
>>>>>> create() callback gets called. This should be used to
>>>>>> allocate/commit
>>>>> Yes. I am already past that stage.
>>>>>
>>>>>> resources from parent device and on remove() callback free those
>>>> resources.
>>>>>> So there is no need to bind mlx5 driver to that mdev device.
>>>>>>
>>>>> If we don't bind mlx5 driver, vfio_mdev driver is bound to it. Such
>>>>> driver
>>>> won't create netdev.
>>>>
>>>> Doesn't need to.
>>>>
>>>> Create netdev from create() callback.
>>>>
>>> I strongly believe this is incorrect way to use create() API.
>>> Because,
>>> mdev is mediated device from its primary pci device. It is not a protocol
>> device.
>>>
>>> It it also incorrect to tell user that vfio_mdev driver is bound to this mdev
>> and mlx5_core driver creating netdev on top of mdev.
>>>
>>
>> vfio_mdev is generic common driver.
>> Vendor driver who want to partition its device should handle its child
>> creation and its life cycle. What is wrong in that? Why netdev has to be
>> created from probe() only and not from create()?
>>
> I am not suggesting to invent any new probe() method.
> create() is generic mdev creation entry point.
> When create() is implemented by vendor driver, vendor driver doesn't know if this mdev will be provisioned for VM or for host.

Vendor driver doesn't need to know if mdev device is going to used for
VM or host.

> So it must do only generic mdev init sequence.
> This means, it cannot create netdev here. As simple as that.
>
> When user wants to use this mdev in a host, user will first unbind it from vfio_mdev driver and binds this mdev to mlx5_driver.
> probe() of mlx5_core driver is called who did mdev_register_driver.
> At this point netdev is created.
>
> If user wants to use this mdev for VM, than vfio_mdev driver and qemu will control it via open/release friend functions.
>

VFIO interface is generic interface, one example is QEMU, as a user
space application, uses VFIO interface. But that doesn't mean that if
you are using VFIO interface that is only be used for VM.
You can write a user space application for host using VFIO interface.


>>> When we want to map this mdev to VM, what should create() do?
>>
>> Mediated device should be created before it is mapped to VM.
>>
> Of course.
> Let me rephrase the question:
> what shouldn't be done by create() when it wants to map to VM?
> Answer is: it shouldn't create a netdev, but do necessary initialization so that it can be mapped to VM.
> Because netdev will be created inside the VM not in the host.
>
> create() simply doesn't know during creation time, where this mdev will be used (VM or host).
> So it doesn't make any sense to create netdev in create().
>

As I explained above, I disagree with this comment.

> I hope it's clear now.
>
>> If you look at the sequence of mdev device creation:
>> - 'struct device' is created with bus 'mdev_bus_type'
>> - register the device - device_register(&mdev->dev) -> which calls vfio_mdev's
>> probe() -> common code for all vendor drivers
>> - mdev_device_create_ops() -> calls vendor driver's create() -> this is for
>> vendor specific allocation and initialization. This is the callback from where
>> you can do what you want to do for mdev device creation and initialization.
>> Why it has to be named as probe()?
>
> I do not intent to create any new probe().
> I think I explained the flow well above -
> i.e. role of mdev_driver->probe() vs mdev_device->create().
>
>>
>>> We will have to shift the code from create() to mdev_device_driver()-
>>> probe() to address a use case of selectively mapping a mdev to VM or to
>> host and implement appropriate open/close etc functions for VM case.
>>>
>>> So why not start correctly from the beginning?
>>>
>>
>> What is wrong with current implementation which is being used and tested
>> for multiple devices?
>>
> Oh, nothing wrong in current implementation.
> Which current implementation provisions mdev in host (and not in guest VM)?
>

All mdev vendor driver implementations. Mdev device for host or VM :
there is no such difference, interface is generic.

> I am just using right code split of already available mdev.
> When user wants to map a device to VM, attach vfio_mdev driver and create vfio_device.
> When user wants to use a device in host, don't attach vfio_mdev driver, instead attach, appropriate driver what owns this mdev.
>

No need to do that. Use VFIO interface in your user space application
when you want to use device in host.

Thanks,
Kirti

> Again, I am not inventing any new probe().
> We will use all existing infra of mdev and core kernel to bind/unbind driver with device.
>

2019-03-08 17:10:36

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension



> -----Original Message-----
> From: Kirti Wankhede <[email protected]>
> Sent: Friday, March 8, 2019 6:19 AM
> To: Parav Pandit <[email protected]>; Jakub Kicinski
> <[email protected]>
> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; Jiri Pirko <[email protected]>; Alex
> Williamson <[email protected]>
> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>
>
>
> >>>>>> <snip>
> >>>>>>
> >>>>>>>>>
> >>>>>>>>> Yes. I got my patches to adapt to mdev way. Will be posting
> >>>>>>>>> RFC
> >>>>>>>>> v2
> >>>> soon.
> >>>>>>>>> Will wait for a day to receive more comments/views from Greg
> >>>>>>>>> and
> >>>>>> others.
> >>>>>>>>>
> >>>>>>>>> As I explained in this cover-letter and discussion, First use
> >>>>>>>>> case is to create and use mdevs in the host (and not in VM).
> >>>>>>>>> Later on, I am sure once we have mdevs available, VM users
> >>>>>>>>> will likely use
> >>>>>>>> it.
> >>>>>>>>>
> >>>>>>>>> So, mlx5_core driver will have two components as starting point.
> >>>>>>>>>
> >>>>>>>>> 1. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev.c
> >>>>>>>>> This is mdev device life cycle driver which will do,
> >>>>>>>>> mdev_register_device()
> >>>>>>>> and implements mlx5_mdev_ops.
> >>>>>>>>>
> >>>>>>>> Ok. I would suggest not use mdev.c file name, may be add device
> >>>>>>>> name, something like mlx_mdev.c or vfio_mlx.c
> >>>>>>>>
> >>>>>>> mlx5/core is coding convention is not following to prefix mlx to
> >>>>>>> its
> >>>>>>> 40+
> >>>>>> files.
> >>>>>>>
> >>>>>>> it uses actual subsystem or functionality name, such as, sriov.c
> >>>>>>> eswitch.c fw.c en_tc.c (en for Ethernet) lag.c so, mdev.c aligns
> >>>>>>> to rest of the 40+ files.
> >>>>>>>
> >>>>>>>
> >>>>>>>>> 2. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev_driver.c
> >>>>>>>>> This is mdev device driver which does mdev_register_driver()
> >>>>>>>>> and
> >>>>>>>>> probe() creates netdev by heavily reusing existing code of the
> >>>>>>>>> PF
> >>>> device.
> >>>>>>>>> These drivers will not be placed under drivers/vfio/mdev,
> >>>>>>>>> because this is
> >>>>>>>> not a vfio driver.
> >>>>>>>>> This is fine, right?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I'm not too familiar with netdev, but can you create netdev on
> >>>>>>>> open() call on mlx mdev device? Then you don't have to write
> >>>>>>>> mdev device
> >>>>>> driver.
> >>>>>>>>
> >>>>>>> Who invokes open() and release()?
> >>>>>>> I believe it is the qemu would do open(), release,
> read/write/mmap?
> >>>>>>>
> >>>>>>> Assuming that is the case,
> >>>>>>> I think its incorrect to create netdev in open.
> >>>>>>> Because when we want to map the mdev to VM using above mdev
> >> calls,
> >>>>>>> we
> >>>>>> actually wont be creating netdev in host.
> >>>>>>> Instead, some queues etc will be setup as part of these calls.
> >>>>>>>
> >>>>>>> By default this created mdev is bound to vfio_mdev.
> >>>>>>> And once we unbind the device from this driver, we need to bind
> >>>>>>> to
> >>>>>>> mlx5
> >>>>>> driver so that driver can create the netdev etc.
> >>>>>>>
> >>>>>>> Or did I get open() and friends call wrong?
> >>>>>>>
> >>>>>>
> >>>>>> In 'struct mdev_parent_ops' there are create() and remove(). When
> >>>>>> user creates mdev device by writing UUID to create sysfs, vendor
> >>>>>> driver's
> >>>>>> create() callback gets called. This should be used to
> >>>>>> allocate/commit
> >>>>> Yes. I am already past that stage.
> >>>>>
> >>>>>> resources from parent device and on remove() callback free those
> >>>> resources.
> >>>>>> So there is no need to bind mlx5 driver to that mdev device.
> >>>>>>
> >>>>> If we don't bind mlx5 driver, vfio_mdev driver is bound to it.
> >>>>> Such driver
> >>>> won't create netdev.
> >>>>
> >>>> Doesn't need to.
> >>>>
> >>>> Create netdev from create() callback.
> >>>>
> >>> I strongly believe this is incorrect way to use create() API.
> >>> Because,
> >>> mdev is mediated device from its primary pci device. It is not a
> >>> protocol
> >> device.
> >>>
> >>> It it also incorrect to tell user that vfio_mdev driver is bound to
> >>> this mdev
> >> and mlx5_core driver creating netdev on top of mdev.
> >>>
> >>
> >> vfio_mdev is generic common driver.
> >> Vendor driver who want to partition its device should handle its
> >> child creation and its life cycle. What is wrong in that? Why netdev
> >> has to be created from probe() only and not from create()?
> >>
> > I am not suggesting to invent any new probe() method.
> > create() is generic mdev creation entry point.
> > When create() is implemented by vendor driver, vendor driver doesn't
> know if this mdev will be provisioned for VM or for host.
>
> Vendor driver doesn't need to know if mdev device is going to used for VM
> or host.
>
I explained you the use cases. How can I explain better?
When user wants to use mdev in host, rdma and netdevices needs to be created by the kernel vendor driver.
When user wants to use mdev in VM (passthrough), rdma and netdev won't be created by host kernel driver.
Can you please ack that these two use cases are understood?
Because I get the feeling that it is not from your proposal.
And we should get level set on it first.

These are two different way and obviously it has two different init sequence.
More below.

> > So it must do only generic mdev init sequence.
> > This means, it cannot create netdev here. As simple as that.
> >
> > When user wants to use this mdev in a host, user will first unbind it from
> vfio_mdev driver and binds this mdev to mlx5_driver.
> > probe() of mlx5_core driver is called who did mdev_register_driver.
> > At this point netdev is created.
> >
> > If user wants to use this mdev for VM, than vfio_mdev driver and qemu
> will control it via open/release friend functions.
> >
>
> VFIO interface is generic interface, one example is QEMU, as a user space
> application, uses VFIO interface. But that doesn't mean that if you are using
> VFIO interface that is only be used for VM.
> You can write a user space application for host using VFIO interface.
>
VM is one use case. We picked VM for discussion here.

If mdev are supposed to be consume by single driver = vfio_mdev, there is no need to have it as separate vfio_mdev driver.
Can you please explain why vfio_mdev driver logic was split from the core mdev?

> >>> When we want to map this mdev to VM, what should create() do?
> >>
> >> Mediated device should be created before it is mapped to VM.
> >>
> > Of course.
> > Let me rephrase the question:
> > what shouldn't be done by create() when it wants to map to VM?
> > Answer is: it shouldn't create a netdev, but do necessary initialization so
> that it can be mapped to VM.
> > Because netdev will be created inside the VM not in the host.
> >
> > create() simply doesn't know during creation time, where this mdev will be
> used (VM or host).
> > So it doesn't make any sense to create netdev in create().
> >
>
> As I explained above, I disagree with this comment.
>
But that disagreement doesn't propose a good software solution. :-)
More discussion below.

> > I hope it's clear now.
> >
> >> If you look at the sequence of mdev device creation:
> >> - 'struct device' is created with bus 'mdev_bus_type'
> >> - register the device - device_register(&mdev->dev) -> which calls
> >> vfio_mdev's
> >> probe() -> common code for all vendor drivers
> >> - mdev_device_create_ops() -> calls vendor driver's create() -> this
> >> is for vendor specific allocation and initialization. This is the
> >> callback from where you can do what you want to do for mdev device
> creation and initialization.
> >> Why it has to be named as probe()?
> >
> > I do not intent to create any new probe().
> > I think I explained the flow well above - i.e. role of
> > mdev_driver->probe() vs mdev_device->create().
> >
> >>
> >>> We will have to shift the code from create() to
> >>> mdev_device_driver()-
> >>> probe() to address a use case of selectively mapping a mdev to VM or
> >>> to
> >> host and implement appropriate open/close etc functions for VM case.
> >>>
> >>> So why not start correctly from the beginning?
> >>>
> >>
> >> What is wrong with current implementation which is being used and
> >> tested for multiple devices?
> >>
> > Oh, nothing wrong in current implementation.
> > Which current implementation provisions mdev in host (and not in guest
> VM)?
> >
>
> All mdev vendor driver implementations. Mdev device for host or VM :
> there is no such difference, interface is generic.
>
Can you point to an example driver that creates usable device (not just mdev, actual device for a given mdev)?
I looked at intel_gvt_activate_vgpu() and ioctl(), that doesn't create any gpu in host.
I looked at samples/vfio-mdev/mtty.c that doesn't create any tty device on open() or ioctl() call.

> > I am just using right code split of already available mdev.
> > When user wants to map a device to VM, attach vfio_mdev driver and
> create vfio_device.
> > When user wants to use a device in host, don't attach vfio_mdev driver,
> instead attach, appropriate driver what owns this mdev.
> >
>
> No need to do that. Use VFIO interface in your user space application when
> you want to use device in host.
>
1. Can you explain what do you propose - on which callback should create rdma and netdevice if we should not dedicated vendor bus driver?
2. Can you please also explain how vendor driver should not create these rdma and netdevice, when this device should be used in VM?

So far I heard only two points from you -
i.e. create netdev during create() callback.
Other one was create netdev during open() call.

Both of these modes doesn't fit the need as I explained.
Did I miss something?

3. we certainly do not want to invent new ioctl opcodes on vfio char device to create/delete these devices.
There should be only VFIO_DEVICE_*.
netdev subsystem has gone far ahead to not use such crazy vendor ioctl() extensions.
I hope you do not intent to propose that when you say "Use VFIO interface in your user space application ...".

4. Finally linux kernel core provides a bus model and binding/attaching a specific driver to a device.
I assumed that you are aware of bind/unbind sysfs files of a driver.
Can you please explain why you think that mdev subsystem should not use standard linux kernel driver model?

2019-03-26 11:49:40

by Lorenzo Pieralisi

[permalink] [raw]
Subject: Re: [RFC net-next 1/8] subdev: Introducing subdev bus

On Fri, Mar 01, 2019 at 08:17:27AM +0100, Greg KH wrote:
> On Thu, Feb 28, 2019 at 11:37:45PM -0600, Parav Pandit wrote:
> > Introduce a new subdev bus which holds sub devices created from a
> > primary device. These devices are named as 'subdev'.
> > A subdev is identified similarly to pci device using 16-bit vendor id
> > and device id.
> > Unlike PCI devices, scope of subdev is limited to Linux kernel.
>
> But these are limited to only PCI devices, right?
>
> This sounds a lot like that ARM proposal a week or so ago that asked for
> something like this, are you working with them to make sure your
> proposal works for them as well? (sorry, can't find where that was
> announced, it was online somewhere...)

Thanks for pointing this out and sorry for the delay in chiming in.

Blog post and white paper are available here:

https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/high-performance-device-virtualization-approach-to-standardization

It would be certainly good to reach a degree of convergence in this
design space, which eventually will be beneficial for the kernel
interfaces required.

Thanks again for pointing this out.

Lorenzo

2021-05-31 10:38:37

by moyufeng

[permalink] [raw]
Subject: Re: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension


Hi, Jiri & Jakub

Generally, a devlink instance is created for each PF/VF. This
facilitates the query and configuration of the settings of each
function. But if some common objects, like the health status of
the entire ASIC, the data read by those instances will be duplicate.

So I wonder do I just need to apply a public devlink instance for the
entire ASIC to avoid reading the same data? If so, then I can't set
parameters for each function individually. Or is there a better suggestion
to implement it?

Thanks! ~

On 2019/3/6 0:52, Parav Pandit wrote:
>
>
>> -----Original Message-----
>> From: Jakub Kicinski <[email protected]>
>> Sent: Monday, March 4, 2019 7:46 PM
>> To: Parav Pandit <[email protected]>
>> Cc: Or Gerlitz <[email protected]>; [email protected]; linux-
>> [email protected]; [email protected]; [email protected];
>> [email protected]; Jiri Pirko <[email protected]>
>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>>
>> On Mon, 4 Mar 2019 04:41:01 +0000, Parav Pandit wrote:
>>>>> $ devlink dev show
>>>>> pci/0000:05:00.0
>>>>> subdev/subdev0
>>>>
>>>> Please don't spawn devlink instances. Devlink instance is supposed
>>>> to represent an ASIC. If we start spawning them willy nilly for
>>>> whatever software construct we want to model the clarity of the
>>>> ontology will suffer a lot.
>>> Devlink devices not restricted to ASIC even though today it is
>>> representing ASIC for one vendor. Today for one ASIC, it already
>>> presents multiple devlink devices (128 or more) for PF and VFs, two
>>> PFs on same ASIC etc. VF is just a sub-device which is well defined by
>>> PCISIG, whereas sub-device is not. Sub-device do consume actual ASIC
>>> resources (just like PFs and VFs), Hence point-(6) of cover-letter
>>> indicate that the devlink capability to tell how many such sub-devices
>>> can be created.
>>>
>>> In above example, they are created for a given bus-device following
>>> existing devlink construct.
>>
>> No, it's not "representing the ASIC for one vendor". It's how it works for
>> switches (including mlxsw) and how it was described in the original cover
>> letter:
>>
> Sorry for the confusion.
> I meant to say, my understanding is Netronome creates one devlink instance for whole ASIC.
> Please correct me if this is incorrect.
> mlx5_core driver creates multiple devlink devices for PF and VFs for one ASIC.
>
>> Introduce devlink interface and first drivers to use it
>>
>> There a is need for some userspace API that would allow to expose things
>> that are not directly related to any device class like net_device of
>> ib_device, but rather chip-wide/switch-ASIC-wide stuff.
>>
>> [...]
>>
>> We can deviate from the original intent if need be and dilute the ontology.
>> But let's be clear on the status quo, please.
> Status quo is mlx5_core driver creates multiple devlink devices. It creates for devlink device for each PF and VF of a single ASIC.
>

2021-06-01 05:40:40

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On Mon, 31 May 2021 18:36:12 +0800 moyufeng wrote:
> Hi, Jiri & Jakub
>
> Generally, a devlink instance is created for each PF/VF. This
> facilitates the query and configuration of the settings of each
> function. But if some common objects, like the health status of
> the entire ASIC, the data read by those instances will be duplicate.
>
> So I wonder do I just need to apply a public devlink instance for the
> entire ASIC to avoid reading the same data? If so, then I can't set
> parameters for each function individually. Or is there a better suggestion
> to implement it?

I don't think there is a great way to solve this today. In my mind
devlink instances should be per ASIC, but I never had to solve this
problem for a multi-function ASIC.

Can you assume all functions are in the same control domain? Can they
trust each other?

2021-06-01 07:33:54

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On 2021/6/1 13:37, Jakub Kicinski wrote:
> On Mon, 31 May 2021 18:36:12 +0800 moyufeng wrote:
>> Hi, Jiri & Jakub
>>
>> Generally, a devlink instance is created for each PF/VF. This
>> facilitates the query and configuration of the settings of each
>> function. But if some common objects, like the health status of
>> the entire ASIC, the data read by those instances will be duplicate.
>>
>> So I wonder do I just need to apply a public devlink instance for the
>> entire ASIC to avoid reading the same data? If so, then I can't set
>> parameters for each function individually. Or is there a better suggestion
>> to implement it?
>
> I don't think there is a great way to solve this today. In my mind
> devlink instances should be per ASIC, but I never had to solve this
> problem for a multi-function ASIC.

Is there a reason why it didn't have to be solved yet?
Is it because the devices currently supporting devlink do not have
this kind of problem, like single-function ASIC or multi-function
ASIC without sharing common resource?

Was there a discussion how to solved it in the past?

>
> Can you assume all functions are in the same control domain? Can they
> trust each other?

"same control domain" means if it is controlled by a single host, not
by multi hosts, right?

If the PF is not passed through to a vm using VFIO and other PF is still
in the host, then I think we can say it is controlled by a single host.

And each PF is trusted with each other right now, at least at the driver
level, but not between VF.

>
> .
>

2021-06-01 21:38:22

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On Tue, 1 Jun 2021 15:33:09 +0800 Yunsheng Lin wrote:
> On 2021/6/1 13:37, Jakub Kicinski wrote:
> > On Mon, 31 May 2021 18:36:12 +0800 moyufeng wrote:
> >> Hi, Jiri & Jakub
> >>
> >> Generally, a devlink instance is created for each PF/VF. This
> >> facilitates the query and configuration of the settings of each
> >> function. But if some common objects, like the health status of
> >> the entire ASIC, the data read by those instances will be duplicate.
> >>
> >> So I wonder do I just need to apply a public devlink instance for the
> >> entire ASIC to avoid reading the same data? If so, then I can't set
> >> parameters for each function individually. Or is there a better suggestion
> >> to implement it?
> >
> > I don't think there is a great way to solve this today. In my mind
> > devlink instances should be per ASIC, but I never had to solve this
> > problem for a multi-function ASIC.
>
> Is there a reason why it didn't have to be solved yet?
> Is it because the devices currently supporting devlink do not have
> this kind of problem, like single-function ASIC or multi-function
> ASIC without sharing common resource?

I'm not 100% sure, my guess is multi-function devices supporting
devlink are simple enough for the problem not to matter all that much.

> Was there a discussion how to solved it in the past?

Not really, we floated an idea of creating aliases for devlink
instances so a single devlink instance could answer to multiple
bus identifiers. But nothing concrete.

> > Can you assume all functions are in the same control domain? Can they
> > trust each other?
>
> "same control domain" means if it is controlled by a single host, not
> by multi hosts, right?
>
> If the PF is not passed through to a vm using VFIO and other PF is still
> in the host, then I think we can say it is controlled by a single host.
>
> And each PF is trusted with each other right now, at least at the driver
> level, but not between VF.

Right, the challenge AFAIU is how to match up multiple functions into
a single devlink instance, when driver has to probe them one by one.
If there is no requirement that different functions are securely
isolated it becomes a lot simpler (e.g. just compare device serial
numbers).

2021-06-02 02:26:59

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On 2021/6/2 5:34, Jakub Kicinski wrote:
> On Tue, 1 Jun 2021 15:33:09 +0800 Yunsheng Lin wrote:
>> On 2021/6/1 13:37, Jakub Kicinski wrote:
>>> On Mon, 31 May 2021 18:36:12 +0800 moyufeng wrote:
>>>> Hi, Jiri & Jakub
>>>>
>>>> Generally, a devlink instance is created for each PF/VF. This
>>>> facilitates the query and configuration of the settings of each
>>>> function. But if some common objects, like the health status of
>>>> the entire ASIC, the data read by those instances will be duplicate.
>>>>
>>>> So I wonder do I just need to apply a public devlink instance for the
>>>> entire ASIC to avoid reading the same data? If so, then I can't set
>>>> parameters for each function individually. Or is there a better suggestion
>>>> to implement it?
>>>
>>> I don't think there is a great way to solve this today. In my mind
>>> devlink instances should be per ASIC, but I never had to solve this
>>> problem for a multi-function ASIC.
>>
>> Is there a reason why it didn't have to be solved yet?
>> Is it because the devices currently supporting devlink do not have
>> this kind of problem, like single-function ASIC or multi-function
>> ASIC without sharing common resource?
>
> I'm not 100% sure, my guess is multi-function devices supporting
> devlink are simple enough for the problem not to matter all that much.
>
>> Was there a discussion how to solved it in the past?
>
> Not really, we floated an idea of creating aliases for devlink
> instances so a single devlink instance could answer to multiple
> bus identifiers. But nothing concrete.

What does it mean by "answer to multiple bus identifiers"? I
suppose it means user provides the bus identifiers when setting or
getting something, and devlink instance uses that bus identifiers
to differentiate different PF in the same ASIC?

can devlink port be used to indicate different PF in the same ASIC,
which already has the bus identifiers in it? It seems we need a
extra identifier to indicate the ASIC?

$ devlink port show
...
pci/0000:03:00.0/61: type eth netdev sw1p1s0 split_group 0

>
>>> Can you assume all functions are in the same control domain? Can they
>>> trust each other?
>>
>> "same control domain" means if it is controlled by a single host, not
>> by multi hosts, right?
>>
>> If the PF is not passed through to a vm using VFIO and other PF is still
>> in the host, then I think we can say it is controlled by a single host.
>>
>> And each PF is trusted with each other right now, at least at the driver
>> level, but not between VF.
>
> Right, the challenge AFAIU is how to match up multiple functions into
> a single devlink instance, when driver has to probe them one by one.

Does it make sense if the PF first probed creates a auxiliary device,
and the auxiliary device driver creates the devlink instance? And
the PF probed later can connect/register to that devlink instance?

> If there is no requirement that different functions are securely
> isolated it becomes a lot simpler (e.g. just compare device serial
> numbers).

Is there any known requirement if the different functions are not
securely isolated?

>
> .
>

2021-06-02 16:38:29

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On Wed, 2 Jun 2021 10:24:11 +0800 Yunsheng Lin wrote:
> On 2021/6/2 5:34, Jakub Kicinski wrote:
> > On Tue, 1 Jun 2021 15:33:09 +0800 Yunsheng Lin wrote:
> >> Is there a reason why it didn't have to be solved yet?
> >> Is it because the devices currently supporting devlink do not have
> >> this kind of problem, like single-function ASIC or multi-function
> >> ASIC without sharing common resource?
> >
> > I'm not 100% sure, my guess is multi-function devices supporting
> > devlink are simple enough for the problem not to matter all that much.
> >
> >> Was there a discussion how to solved it in the past?
> >
> > Not really, we floated an idea of creating aliases for devlink
> > instances so a single devlink instance could answer to multiple
> > bus identifiers. But nothing concrete.
>
> What does it mean by "answer to multiple bus identifiers"? I
> suppose it means user provides the bus identifiers when setting or
> getting something, and devlink instance uses that bus identifiers
> to differentiate different PF in the same ASIC?

Correct.

> can devlink port be used to indicate different PF in the same ASIC,
> which already has the bus identifiers in it? It seems we need a
> extra identifier to indicate the ASIC?
>
> $ devlink port show
> ...
> pci/0000:03:00.0/61: type eth netdev sw1p1s0 split_group 0

Ports can obviously be used, but which PCI device will you use to
register the devlink instance? Perhaps using just one doesn't matter
if there is only one NIC in the system, but may be confusing with
multiple NICs, no?

> >> "same control domain" means if it is controlled by a single host, not
> >> by multi hosts, right?
> >>
> >> If the PF is not passed through to a vm using VFIO and other PF is still
> >> in the host, then I think we can say it is controlled by a single host.
> >>
> >> And each PF is trusted with each other right now, at least at the driver
> >> level, but not between VF.
> >
> > Right, the challenge AFAIU is how to match up multiple functions into
> > a single devlink instance, when driver has to probe them one by one.
>
> Does it make sense if the PF first probed creates a auxiliary device,
> and the auxiliary device driver creates the devlink instance? And
> the PF probed later can connect/register to that devlink instance?

I would say no, that just adds another layer of complication and
doesn't link the functions in any way.

> > If there is no requirement that different functions are securely
> > isolated it becomes a lot simpler (e.g. just compare device serial
> > numbers).
>
> Is there any known requirement if the different functions are not
> securely isolated?

Not sure I understand. If the functions are in different domains
of control allowing one of them to dump state of the other may be
problematic given features like TLS offload, for instance.

2021-06-03 03:48:47

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On 2021/6/3 0:34, Jakub Kicinski wrote:
> On Wed, 2 Jun 2021 10:24:11 +0800 Yunsheng Lin wrote:
>> On 2021/6/2 5:34, Jakub Kicinski wrote:
>>> On Tue, 1 Jun 2021 15:33:09 +0800 Yunsheng Lin wrote:
>>>> Is there a reason why it didn't have to be solved yet?
>>>> Is it because the devices currently supporting devlink do not have
>>>> this kind of problem, like single-function ASIC or multi-function
>>>> ASIC without sharing common resource?
>>>
>>> I'm not 100% sure, my guess is multi-function devices supporting
>>> devlink are simple enough for the problem not to matter all that much.
>>>
>>>> Was there a discussion how to solved it in the past?
>>>
>>> Not really, we floated an idea of creating aliases for devlink
>>> instances so a single devlink instance could answer to multiple
>>> bus identifiers. But nothing concrete.
>>
>> What does it mean by "answer to multiple bus identifiers"? I
>> suppose it means user provides the bus identifiers when setting or
>> getting something, and devlink instance uses that bus identifiers
>> to differentiate different PF in the same ASIC?
>
> Correct.
>
>> can devlink port be used to indicate different PF in the same ASIC,
>> which already has the bus identifiers in it? It seems we need a
>> extra identifier to indicate the ASIC?
>>
>> $ devlink port show
>> ...
>> pci/0000:03:00.0/61: type eth netdev sw1p1s0 split_group 0
>
> Ports can obviously be used, but which PCI device will you use to
> register the devlink instance? Perhaps using just one doesn't matter
> if there is only one NIC in the system, but may be confusing with
> multiple NICs, no?

Yes, it is confusing, how about using the controler_id to indicate
different NIC? we can make sure controler_id is unqiue in the same
host, a controler_id corresponds to a devlink instance, vendor info
or serial num for the devlink instance can further indicate more info
to the system user?

pci/controler_id/0000:03:00.0/61

>
>>>> "same control domain" means if it is controlled by a single host, not
>>>> by multi hosts, right?
>>>>
>>>> If the PF is not passed through to a vm using VFIO and other PF is still
>>>> in the host, then I think we can say it is controlled by a single host.
>>>>
>>>> And each PF is trusted with each other right now, at least at the driver
>>>> level, but not between VF.
>>>
>>> Right, the challenge AFAIU is how to match up multiple functions into
>>> a single devlink instance, when driver has to probe them one by one.
>>
>> Does it make sense if the PF first probed creates a auxiliary device,
>> and the auxiliary device driver creates the devlink instance? And
>> the PF probed later can connect/register to that devlink instance?
>
> I would say no, that just adds another layer of complication and
> doesn't link the functions in any way.

How about:
The PF first probed creates the devlink instance? PF probed later can
connect/register to that devlink instance created by the PF first probed.
It seems some locking need to ensure the above happens as intended too.

About linking, the PF provide vendor info/serial number(or whatever is
unqiue between different vendor) of a controller it belong to, if the
controller does not exist yet, create one and connect/register to that
devlink instance, otherwise just do the connecting/registering.

2021-06-03 17:54:45

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On Thu, 3 Jun 2021 11:46:43 +0800 Yunsheng Lin wrote:
> >> can devlink port be used to indicate different PF in the same ASIC,
> >> which already has the bus identifiers in it? It seems we need a
> >> extra identifier to indicate the ASIC?
> >>
> >> $ devlink port show
> >> ...
> >> pci/0000:03:00.0/61: type eth netdev sw1p1s0 split_group 0
> >
> > Ports can obviously be used, but which PCI device will you use to
> > register the devlink instance? Perhaps using just one doesn't matter
> > if there is only one NIC in the system, but may be confusing with
> > multiple NICs, no?
>
> Yes, it is confusing, how about using the controler_id to indicate
> different NIC? we can make sure controler_id is unqiue in the same
> host, a controler_id corresponds to a devlink instance, vendor info
> or serial num for the devlink instance can further indicate more info
> to the system user?
>
> pci/controler_id/0000:03:00.0/61

What is a "controller id" in concrete terms? Another abstract ID which
may change on every boot?

> >> Does it make sense if the PF first probed creates a auxiliary device,
> >> and the auxiliary device driver creates the devlink instance? And
> >> the PF probed later can connect/register to that devlink instance?
> >
> > I would say no, that just adds another layer of complication and
> > doesn't link the functions in any way.
>
> How about:
> The PF first probed creates the devlink instance? PF probed later can
> connect/register to that devlink instance created by the PF first probed.
> It seems some locking need to ensure the above happens as intended too.
>
> About linking, the PF provide vendor info/serial number(or whatever is
> unqiue between different vendor) of a controller it belong to, if the
> controller does not exist yet, create one and connect/register to that
> devlink instance, otherwise just do the connecting/registering.

Sounds about right, but I don't understand why another ID is
necessary. Why not allow devlink instances to have multiple names,
like we allow aliases for netdevs these days?

2021-06-04 01:20:09

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On 2021/6/4 1:53, Jakub Kicinski wrote:
> On Thu, 3 Jun 2021 11:46:43 +0800 Yunsheng Lin wrote:
>>>> can devlink port be used to indicate different PF in the same ASIC,
>>>> which already has the bus identifiers in it? It seems we need a
>>>> extra identifier to indicate the ASIC?
>>>>
>>>> $ devlink port show
>>>> ...
>>>> pci/0000:03:00.0/61: type eth netdev sw1p1s0 split_group 0
>>>
>>> Ports can obviously be used, but which PCI device will you use to
>>> register the devlink instance? Perhaps using just one doesn't matter
>>> if there is only one NIC in the system, but may be confusing with
>>> multiple NICs, no?
>>
>> Yes, it is confusing, how about using the controler_id to indicate
>> different NIC? we can make sure controler_id is unqiue in the same
>> host, a controler_id corresponds to a devlink instance, vendor info
>> or serial num for the devlink instance can further indicate more info
>> to the system user?
>>
>> pci/controler_id/0000:03:00.0/61
>
> What is a "controller id" in concrete terms? Another abstract ID which
> may change on every boot?

My initial thinking is a id from a global IDA pool, which indeed may
change on every boot.

I am not really thinking much deeper about the controller id, just
mirroring the bus identifiers for pcie device and ifindex for netdev,
which may change too if the device is pluged into different pci slot
on every boot?

>
>>>> Does it make sense if the PF first probed creates a auxiliary device,
>>>> and the auxiliary device driver creates the devlink instance? And
>>>> the PF probed later can connect/register to that devlink instance?
>>>
>>> I would say no, that just adds another layer of complication and
>>> doesn't link the functions in any way.
>>
>> How about:
>> The PF first probed creates the devlink instance? PF probed later can
>> connect/register to that devlink instance created by the PF first probed.
>> It seems some locking need to ensure the above happens as intended too.
>>
>> About linking, the PF provide vendor info/serial number(or whatever is
>> unqiue between different vendor) of a controller it belong to, if the
>> controller does not exist yet, create one and connect/register to that
>> devlink instance, otherwise just do the connecting/registering.
>
> Sounds about right, but I don't understand why another ID is
> necessary. Why not allow devlink instances to have multiple names,
> like we allow aliases for netdevs these days?

We could still allow devlink instances to have multiple names,
which seems to be more like devlink tool problem?

For example, devlink tool could use the id or the vendor_info/
serial_number to indicate a devlink instance according to user's
request.

Aliase could be allowed too as long as devlink core provide a
field and ops to set/get the field mirroring the ifalias for
netdevice?

>
> .
>

2021-06-04 18:42:50

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On Fri, 4 Jun 2021 09:18:04 +0800 Yunsheng Lin wrote:
> >> Yes, it is confusing, how about using the controler_id to indicate
> >> different NIC? we can make sure controler_id is unqiue in the same
> >> host, a controler_id corresponds to a devlink instance, vendor info
> >> or serial num for the devlink instance can further indicate more info
> >> to the system user?
> >>
> >> pci/controler_id/0000:03:00.0/61
> >
> > What is a "controller id" in concrete terms? Another abstract ID which
> > may change on every boot?
>
> My initial thinking is a id from a global IDA pool, which indeed may
> change on every boot.
>
> I am not really thinking much deeper about the controller id, just
> mirroring the bus identifiers for pcie device and ifindex for netdev,

devlink instance id seems fine, but there's already a controller
concept in devlink so please steer clear of that naming.

> which may change too if the device is pluged into different pci slot
> on every boot?

Heh. What is someone reflashes the part to change it's serial number? :)
pci slot is reasonably stable, as proven by years of experience trying
to find stable naming for netdevs.

> >> How about:
> >> The PF first probed creates the devlink instance? PF probed later can
> >> connect/register to that devlink instance created by the PF first probed.
> >> It seems some locking need to ensure the above happens as intended too.
> >>
> >> About linking, the PF provide vendor info/serial number(or whatever is
> >> unqiue between different vendor) of a controller it belong to, if the
> >> controller does not exist yet, create one and connect/register to that
> >> devlink instance, otherwise just do the connecting/registering.
> >
> > Sounds about right, but I don't understand why another ID is
> > necessary. Why not allow devlink instances to have multiple names,
> > like we allow aliases for netdevs these days?
>
> We could still allow devlink instances to have multiple names,
> which seems to be more like devlink tool problem?
>
> For example, devlink tool could use the id or the vendor_info/
> serial_number to indicate a devlink instance according to user's
> request.

Typing serial numbers seems pretty painful.

> Aliase could be allowed too as long as devlink core provide a
> field and ops to set/get the field mirroring the ifalias for
> netdevice?

I don't understand.

2021-06-07 01:39:03

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On 2021/6/5 2:41, Jakub Kicinski wrote:
> On Fri, 4 Jun 2021 09:18:04 +0800 Yunsheng Lin wrote:
>>>> Yes, it is confusing, how about using the controler_id to indicate
>>>> different NIC? we can make sure controler_id is unqiue in the same
>>>> host, a controler_id corresponds to a devlink instance, vendor info
>>>> or serial num for the devlink instance can further indicate more info
>>>> to the system user?
>>>>
>>>> pci/controler_id/0000:03:00.0/61
>>>
>>> What is a "controller id" in concrete terms? Another abstract ID which
>>> may change on every boot?
>>
>> My initial thinking is a id from a global IDA pool, which indeed may
>> change on every boot.
>>
>> I am not really thinking much deeper about the controller id, just
>> mirroring the bus identifiers for pcie device and ifindex for netdev,
>
> devlink instance id seems fine, but there's already a controller
> concept in devlink so please steer clear of that naming.
I am not sure if controller concept already existed is reusable for
the devlink instance representing problem for multi-function which
shares common resource in the same ASIC. If not, we do need to pick
up other name.

Another thing I am not really think throught is how is the VF represented
by the devlink instance when VF is passed through to a VM.
I was thinking about VF is represented as devlink port, just like PF(with
different port flavour), and VF devlink port only exist on the same host
as PF(which assumes PF is never passed through to a VM), so it may means
the PF is responsible for creating the devlink port for VF when VF is passed
through to a VM?

Or do we need to create a devlink instance for VF in the VM too when the
VF is passed through to a VM? Or more specificly, does user need to query
or configure devlink info or configuration in a VM? If not, then devlink
instance in VM seems unnecessary?

>
>> which may change too if the device is pluged into different pci slot
>> on every boot?
>
> Heh. What is someone reflashes the part to change it's serial number? :)
> pci slot is reasonably stable, as proven by years of experience trying
> to find stable naming for netdevs.

I suppose that requires a booting to take effect and a vendor tool
to reflash the serial number, it seems reasonable the vendor/user will
try their best to not mess the serial number, otherwise service and
maintenance based on serial number will not work?
I was thinking about adding the vendor name besides the serial number
to indicate a devlink instance, to avoid that case that two hw from
different vendor having the same serial number accidentally.

>
>>>> How about:
>>>> The PF first probed creates the devlink instance? PF probed later can
>>>> connect/register to that devlink instance created by the PF first probed.
>>>> It seems some locking need to ensure the above happens as intended too.
>>>>
>>>> About linking, the PF provide vendor info/serial number(or whatever is
>>>> unqiue between different vendor) of a controller it belong to, if the
>>>> controller does not exist yet, create one and connect/register to that
>>>> devlink instance, otherwise just do the connecting/registering.
>>>
>>> Sounds about right, but I don't understand why another ID is
>>> necessary. Why not allow devlink instances to have multiple names,
>>> like we allow aliases for netdevs these days?
>>
>> We could still allow devlink instances to have multiple names,
>> which seems to be more like devlink tool problem?
>>
>> For example, devlink tool could use the id or the vendor_info/
>> serial_number to indicate a devlink instance according to user's
>> request.
>
> Typing serial numbers seems pretty painful.
>
>> Aliase could be allowed too as long as devlink core provide a
>> field and ops to set/get the field mirroring the ifalias for
>> netdevice?
>
> I don't understand.

I meant we could still allow the user to provide a more meaningful
name to indicate a devlink instance besides the id.

>
> .
>

2021-06-07 19:48:38

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On Mon, 7 Jun 2021 09:36:38 +0800 Yunsheng Lin wrote:
> On 2021/6/5 2:41, Jakub Kicinski wrote:
> > On Fri, 4 Jun 2021 09:18:04 +0800 Yunsheng Lin wrote:
> >> My initial thinking is a id from a global IDA pool, which indeed may
> >> change on every boot.
> >>
> >> I am not really thinking much deeper about the controller id, just
> >> mirroring the bus identifiers for pcie device and ifindex for netdev,
> >
> > devlink instance id seems fine, but there's already a controller
> > concept in devlink so please steer clear of that naming.
> I am not sure if controller concept already existed is reusable for
> the devlink instance representing problem for multi-function which
> shares common resource in the same ASIC. If not, we do need to pick
> up other name.
>
> Another thing I am not really think throught is how is the VF represented
> by the devlink instance when VF is passed through to a VM.
> I was thinking about VF is represented as devlink port, just like PF(with
> different port flavour), and VF devlink port only exist on the same host
> as PF(which assumes PF is never passed through to a VM), so it may means
> the PF is responsible for creating the devlink port for VF when VF is passed
> through to a VM?
>
> Or do we need to create a devlink instance for VF in the VM too when the
> VF is passed through to a VM? Or more specificly, does user need to query
> or configure devlink info or configuration in a VM? If not, then devlink
> instance in VM seems unnecessary?

I believe the current best practice is to create a devlink instance for
the VF with a devlink port of type "virtual". Such instance represents
a "virtualized" view of the device.

> >> which may change too if the device is pluged into different pci slot
> >> on every boot?
> >
> > Heh. What is someone reflashes the part to change it's serial number? :)
> > pci slot is reasonably stable, as proven by years of experience trying
> > to find stable naming for netdevs.
>
> I suppose that requires a booting to take effect and a vendor tool
> to reflash the serial number, it seems reasonable the vendor/user will
> try their best to not mess the serial number, otherwise service and
> maintenance based on serial number will not work?
> I was thinking about adding the vendor name besides the serial number
> to indicate a devlink instance, to avoid that case that two hw from
> different vendor having the same serial number accidentally.

I'm not opposed to the use of attributes such as serial number for
selecting instance, in principle. I was just trying to prove that PCI
slot/PCI device name is as stable as any other attribute.

In fact for mass-produced machines using PCI slot is far more
convenient than globally unique identifiers because it can be used
to talk to a specific device in a server for all machines of a given
model, hence easing automation.

> >> We could still allow devlink instances to have multiple names,
> >> which seems to be more like devlink tool problem?
> >>
> >> For example, devlink tool could use the id or the vendor_info/
> >> serial_number to indicate a devlink instance according to user's
> >> request.
> >
> > Typing serial numbers seems pretty painful.
> >
> >> Aliase could be allowed too as long as devlink core provide a
> >> field and ops to set/get the field mirroring the ifalias for
> >> netdevice?
> >
> > I don't understand.
>
> I meant we could still allow the user to provide a more meaningful
> name to indicate a devlink instance besides the id.

To clarify/summarize my statement above serial number may be a useful
addition but PCI device names should IMHO remain the primary
identifiers, even if it means devlink instances with multiple names.

In addition I don't think that user-controlled names/aliases are
necessarily a great idea for devlink.

2021-06-09 00:58:58

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On 2021/6/8 3:46, Jakub Kicinski wrote:
> On Mon, 7 Jun 2021 09:36:38 +0800 Yunsheng Lin wrote:
>> On 2021/6/5 2:41, Jakub Kicinski wrote:
>>> On Fri, 4 Jun 2021 09:18:04 +0800 Yunsheng Lin wrote:
>>>> My initial thinking is a id from a global IDA pool, which indeed may
>>>> change on every boot.
>>>>
>>>> I am not really thinking much deeper about the controller id, just
>>>> mirroring the bus identifiers for pcie device and ifindex for netdev,
>>>
>>> devlink instance id seems fine, but there's already a controller
>>> concept in devlink so please steer clear of that naming.
>> I am not sure if controller concept already existed is reusable for
>> the devlink instance representing problem for multi-function which
>> shares common resource in the same ASIC. If not, we do need to pick
>> up other name.
>>
>> Another thing I am not really think throught is how is the VF represented
>> by the devlink instance when VF is passed through to a VM.
>> I was thinking about VF is represented as devlink port, just like PF(with
>> different port flavour), and VF devlink port only exist on the same host
>> as PF(which assumes PF is never passed through to a VM), so it may means
>> the PF is responsible for creating the devlink port for VF when VF is passed
>> through to a VM?
>>
>> Or do we need to create a devlink instance for VF in the VM too when the
>> VF is passed through to a VM? Or more specificly, does user need to query
>> or configure devlink info or configuration in a VM? If not, then devlink
>> instance in VM seems unnecessary?
>
> I believe the current best practice is to create a devlink instance for
> the VF with a devlink port of type "virtual". Such instance represents
> a "virtualized" view of the device.

Afer discussion with Parav in other thread, I undersood it was the current
practice, but I am not sure I understand why it is current *best* practice.

If we allow all PF of a ASCI to register to the same devlink instance, does
it not make sense that all VF under one PF also register to the same devlink
instance that it's PF is registering to when they are in the same host?

For eswitch legacy mode, whether VF and PF are the same host or not, the VF
can also provide the serial number of a ASIC to register to the devlink instance,
if that devlink instance does not exist yet, just create that devlink instance
according to the serial number, just like PF does.

For eswitch DEVLINK_ESWITCH_MODE_SWITCHDEV mode, the flavour type for devlink
port instance representing the netdev of VF function is FLAVOUR_VIRTUAL, the
flavour type for devlink port instance representing the representor netdev of
VF is FLAVOUR_PCI_VF, which are different type, so they can register to the same
devlink instance even when both of the devlink port instance is in the same host?

Is there any reason why VF use its own devlink instance?

>
>>>> which may change too if the device is pluged into different pci slot
>>>> on every boot?
>>>
>>> Heh. What is someone reflashes the part to change it's serial number? :)
>>> pci slot is reasonably stable, as proven by years of experience trying
>>> to find stable naming for netdevs.
>>
>> I suppose that requires a booting to take effect and a vendor tool
>> to reflash the serial number, it seems reasonable the vendor/user will
>> try their best to not mess the serial number, otherwise service and
>> maintenance based on serial number will not work?
>> I was thinking about adding the vendor name besides the serial number
>> to indicate a devlink instance, to avoid that case that two hw from
>> different vendor having the same serial number accidentally.
>
> I'm not opposed to the use of attributes such as serial number for
> selecting instance, in principle. I was just trying to prove that PCI
> slot/PCI device name is as stable as any other attribute.
>
> In fact for mass-produced machines using PCI slot is far more
> convenient than globally unique identifiers because it can be used
> to talk to a specific device in a server for all machines of a given
> model, hence easing automation.

Make sense.

>
>>>> We could still allow devlink instances to have multiple names,
>>>> which seems to be more like devlink tool problem?
>>>>
>>>> For example, devlink tool could use the id or the vendor_info/
>>>> serial_number to indicate a devlink instance according to user's
>>>> request.
>>>
>>> Typing serial numbers seems pretty painful.
>>>
>>>> Aliase could be allowed too as long as devlink core provide a
>>>> field and ops to set/get the field mirroring the ifalias for
>>>> netdevice?
>>>
>>> I don't understand.
>>
>> I meant we could still allow the user to provide a more meaningful
>> name to indicate a devlink instance besides the id.
>
> To clarify/summarize my statement above serial number may be a useful
> addition but PCI device names should IMHO remain the primary
> identifiers, even if it means devlink instances with multiple names.

I am not sure I understand what does it mean by "devlink instances with
multiple names"?

Does that mean whenever a devlink port instance is registered to a devlink
instance, that devlink instance get a new name according to the PCI device
which the just registered devlink port instance corresponds to?

>
> In addition I don't think that user-controlled names/aliases are
> necessarily a great idea for devlink.
>
> .
>

2021-06-09 08:52:28

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On Tue, 8 Jun 2021 20:10:37 +0800 Yunsheng Lin wrote:
> >> I am not sure if controller concept already existed is reusable for
> >> the devlink instance representing problem for multi-function which
> >> shares common resource in the same ASIC. If not, we do need to pick
> >> up other name.
> >>
> >> Another thing I am not really think throught is how is the VF represented
> >> by the devlink instance when VF is passed through to a VM.
> >> I was thinking about VF is represented as devlink port, just like PF(with
> >> different port flavour), and VF devlink port only exist on the same host
> >> as PF(which assumes PF is never passed through to a VM), so it may means
> >> the PF is responsible for creating the devlink port for VF when VF is passed
> >> through to a VM?
> >>
> >> Or do we need to create a devlink instance for VF in the VM too when the
> >> VF is passed through to a VM? Or more specificly, does user need to query
> >> or configure devlink info or configuration in a VM? If not, then devlink
> >> instance in VM seems unnecessary?
> >
> > I believe the current best practice is to create a devlink instance for
> > the VF with a devlink port of type "virtual". Such instance represents
> > a "virtualized" view of the device.
>
> Afer discussion with Parav in other thread, I undersood it was the current
> practice, but I am not sure I understand why it is current *best* practice.
>
> If we allow all PF of a ASCI to register to the same devlink instance, does
> it not make sense that all VF under one PF also register to the same devlink
> instance that it's PF is registering to when they are in the same host?
>
> For eswitch legacy mode, whether VF and PF are the same host or not, the VF
> can also provide the serial number of a ASIC to register to the devlink instance,
> if that devlink instance does not exist yet, just create that devlink instance
> according to the serial number, just like PF does.
>
> For eswitch DEVLINK_ESWITCH_MODE_SWITCHDEV mode, the flavour type for devlink
> port instance representing the netdev of VF function is FLAVOUR_VIRTUAL, the
> flavour type for devlink port instance representing the representor netdev of
> VF is FLAVOUR_PCI_VF, which are different type, so they can register to the same
> devlink instance even when both of the devlink port instance is in the same host?
>
> Is there any reason why VF use its own devlink instance?

Primary use case for VFs is virtual environments where guest isn't
trusted, so tying the VF to the main devlink instance, over which guest
should have no control is counter productive.

> >> I meant we could still allow the user to provide a more meaningful
> >> name to indicate a devlink instance besides the id.
> >
> > To clarify/summarize my statement above serial number may be a useful
> > addition but PCI device names should IMHO remain the primary
> > identifiers, even if it means devlink instances with multiple names.
>
> I am not sure I understand what does it mean by "devlink instances with
> multiple names"?
>
> Does that mean whenever a devlink port instance is registered to a devlink
> instance, that devlink instance get a new name according to the PCI device
> which the just registered devlink port instance corresponds to?

Not devlink port, new PCI device. Multiple ports may reside on the same
PCI function, some ports don't have a function (e.g. Ethernet ports).

2021-06-09 14:06:29

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On 2021/6/9 1:29, Jakub Kicinski wrote:
> On Tue, 8 Jun 2021 20:10:37 +0800 Yunsheng Lin wrote:
>>>> I am not sure if controller concept already existed is reusable for
>>>> the devlink instance representing problem for multi-function which
>>>> shares common resource in the same ASIC. If not, we do need to pick
>>>> up other name.
>>>>
>>>> Another thing I am not really think throught is how is the VF represented
>>>> by the devlink instance when VF is passed through to a VM.
>>>> I was thinking about VF is represented as devlink port, just like PF(with
>>>> different port flavour), and VF devlink port only exist on the same host
>>>> as PF(which assumes PF is never passed through to a VM), so it may means
>>>> the PF is responsible for creating the devlink port for VF when VF is passed
>>>> through to a VM?
>>>>
>>>> Or do we need to create a devlink instance for VF in the VM too when the
>>>> VF is passed through to a VM? Or more specificly, does user need to query
>>>> or configure devlink info or configuration in a VM? If not, then devlink
>>>> instance in VM seems unnecessary?
>>>
>>> I believe the current best practice is to create a devlink instance for
>>> the VF with a devlink port of type "virtual". Such instance represents
>>> a "virtualized" view of the device.
>>
>> Afer discussion with Parav in other thread, I undersood it was the current
>> practice, but I am not sure I understand why it is current *best* practice.
>>
>> If we allow all PF of a ASCI to register to the same devlink instance, does
>> it not make sense that all VF under one PF also register to the same devlink
>> instance that it's PF is registering to when they are in the same host?
>>
>> For eswitch legacy mode, whether VF and PF are the same host or not, the VF
>> can also provide the serial number of a ASIC to register to the devlink instance,
>> if that devlink instance does not exist yet, just create that devlink instance
>> according to the serial number, just like PF does.
>>
>> For eswitch DEVLINK_ESWITCH_MODE_SWITCHDEV mode, the flavour type for devlink
>> port instance representing the netdev of VF function is FLAVOUR_VIRTUAL, the
>> flavour type for devlink port instance representing the representor netdev of
>> VF is FLAVOUR_PCI_VF, which are different type, so they can register to the same
>> devlink instance even when both of the devlink port instance is in the same host?
>>
>> Is there any reason why VF use its own devlink instance?
>
> Primary use case for VFs is virtual environments where guest isn't
> trusted, so tying the VF to the main devlink instance, over which guest
> should have no control is counter productive.

The security is mainly about VF using in container case, right?
Because VF using in VM, it is different host, it means a different devlink
instance for VF, so there is no security issue for VF using in VM case?
But it might not be the case for VF using in container?

Also I read about the devlink disscusion betwwen you and jiri in [1]:
"I think we agree that all objects of an ASIC should be under one
devlink instance, the question remains whether both ends of the pipe
for PCI devices (subdevs or not) should appear under ports or does the
"far end" (from ASICs perspective)/"host end" get its own category."

I am not sure if there is already any conclusion about the latter part
(I did not find the conclusion in that thread)?

"far end" (from ASICs perspective)/"host end" means PF/VF, right?
Which seems to correspond to port flavor of FLAVOUR_PHYSICAL and
FLAVOUR_VIRTUAL if we try to represent PF/VF using devlink port
instance?

It seems the conclusion is very important to our disscusion in this
thread, as we are trying to represent PF/VF as devlink port instance
in this thread(at least that is what I think, hns3 does not support
eswitch SWITCHDEV mode yet).

Also, there is a "switch_id" concept from jiri's example, which seems
to be not implemented yet?
pci/0000:05:00.0/10000: type eth netdev enp5s0npf0s0 flavour pci_pf pf 0 subport 0 switch_id 00154d130d2f

1. https://lore.kernel.org/netdev/[email protected]/t/

>
>>>> I meant we could still allow the user to provide a more meaningful
>>>> name to indicate a devlink instance besides the id.
>>>
>>> To clarify/summarize my statement above serial number may be a useful
>>> addition but PCI device names should IMHO remain the primary
>>> identifiers, even if it means devlink instances with multiple names.
>>
>> I am not sure I understand what does it mean by "devlink instances with
>> multiple names"?
>>
>> Does that mean whenever a devlink port instance is registered to a devlink
>> instance, that devlink instance get a new name according to the PCI device
>> which the just registered devlink port instance corresponds to?
>
> Not devlink port, new PCI device. Multiple ports may reside on the same
> PCI function, some ports don't have a function (e.g. Ethernet ports).

Multiple ports on the same mainly PCI function means subfunction from mlx,
right?

“some ports don't have a function (e.g. Ethernet ports)” does not seem
exist yet? For now devlink port instance of FLAVOUR_PHYSICAL represents
both PF and Ethernet ports?

>
> .
>

2021-06-09 14:09:48

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension


> From: Yunsheng Lin <[email protected]>
> Sent: Wednesday, June 9, 2021 2:46 PM
>
[..]

> >> Is there any reason why VF use its own devlink instance?
> >
> > Primary use case for VFs is virtual environments where guest isn't
> > trusted, so tying the VF to the main devlink instance, over which
> > guest should have no control is counter productive.
>
> The security is mainly about VF using in container case, right?
> Because VF using in VM, it is different host, it means a different devlink
> instance for VF, so there is no security issue for VF using in VM case?
> But it might not be the case for VF using in container?
Devlink instance has net namespace attached to it controlled using devlink reload command.
So a VF devlink instance can be assigned to a container/process running in a specific net namespace.

$ ip netns add n1
$ devlink dev reload pci/0000:06:00.4 netns n1
^^^^^^^^^^^^^
PCI VF/PF/SF.

> Also, there is a "switch_id" concept from jiri's example, which seems to be
> not implemented yet?

switch_id is present for switch ports in [1] and documented in [2].

[1] /sys/class/net/representor_netdev/phys_switch_id.
[2] https://www.kernel.org/doc/Documentation/networking/switchdev.txt " Switch ID"

2021-06-09 14:16:03

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension


> From: Yunsheng Lin <[email protected]>
> Sent: Tuesday, June 8, 2021 5:41 PM

>
> Is there any reason why VF use its own devlink instance?
Because devlink instance gives the ability for the VF and SF to control itself.
(a) device parameters (devlink dev param show)
(b) resources of the device
(c) health reporters
(d) reload in net ns

There knobs (a) to (c) etc are not for the hypervisor to control. These are mainly for the VF/SF users to manage its own device.

2021-06-09 14:35:51

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On 2021/6/9 17:38, Parav Pandit wrote:
>
>> From: Yunsheng Lin <[email protected]>
>> Sent: Wednesday, June 9, 2021 2:46 PM
>>
> [..]
>
>>>> Is there any reason why VF use its own devlink instance?
>>>
>>> Primary use case for VFs is virtual environments where guest isn't
>>> trusted, so tying the VF to the main devlink instance, over which
>>> guest should have no control is counter productive.
>>
>> The security is mainly about VF using in container case, right?
>> Because VF using in VM, it is different host, it means a different devlink
>> instance for VF, so there is no security issue for VF using in VM case?
>> But it might not be the case for VF using in container?
> Devlink instance has net namespace attached to it controlled using devlink reload command.
> So a VF devlink instance can be assigned to a container/process running in a specific net namespace.
>
> $ ip netns add n1
> $ devlink dev reload pci/0000:06:00.4 netns n1
> ^^^^^^^^^^^^^
> PCI VF/PF/SF.

Could we create another devlink instance when the net namespace of
devlink port instance is changed? It may seems we need to change the
net namespace based on devlink port instance instead of devlink instance.
This way container case seems be similiar to the VM case?

>
>> Also, there is a "switch_id" concept from jiri's example, which seems to be
>> not implemented yet?
>
> switch_id is present for switch ports in [1] and documented in [2].
>
> [1] /sys/class/net/representor_netdev/phys_switch_id.
> [2] https://www.kernel.org/doc/Documentation/networking/switchdev.txt " Switch ID"

Thanks for info.
I suppose we could use "switch_id" to indentify a eswitch since
"switch_id is present for switch ports"?
Where does the "switch_id" of switch port come from? Is it from FW?
Or the driver generated it?

Is there any rule for "switch_id"? Or is it vendor specific?

>

2021-06-09 15:00:56

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension



> From: Yunsheng Lin <[email protected]>
> Sent: Wednesday, June 9, 2021 4:35 PM
>
> On 2021/6/9 17:38, Parav Pandit wrote:
> >
> >> From: Yunsheng Lin <[email protected]>
> >> Sent: Wednesday, June 9, 2021 2:46 PM
> >>
> > [..]
> >
> >>>> Is there any reason why VF use its own devlink instance?
> >>>
> >>> Primary use case for VFs is virtual environments where guest isn't
> >>> trusted, so tying the VF to the main devlink instance, over which
> >>> guest should have no control is counter productive.
> >>
> >> The security is mainly about VF using in container case, right?
> >> Because VF using in VM, it is different host, it means a different
> >> devlink instance for VF, so there is no security issue for VF using in VM
> case?
> >> But it might not be the case for VF using in container?
> > Devlink instance has net namespace attached to it controlled using devlink
> reload command.
> > So a VF devlink instance can be assigned to a container/process running in a
> specific net namespace.
> >
> > $ ip netns add n1
> > $ devlink dev reload pci/0000:06:00.4 netns n1
> > ^^^^^^^^^^^^^
> > PCI VF/PF/SF.
>
> Could we create another devlink instance when the net namespace of
> devlink port instance is changed?
Net namespace of (a) netdevice (b) rdma device (c) devlink instance can be changed.
Net namespace of devlink port cannot be changed.

> It may seems we need to change the net
> namespace based on devlink port instance instead of devlink instance.
> This way container case seems be similiar to the VM case?
I mostly do not understand the topology you have in mind or if you explained previously I missed the thread.
In your case what is the flavour of a devlink port?

>
> >
> >> Also, there is a "switch_id" concept from jiri's example, which seems
> >> to be not implemented yet?
> >
> > switch_id is present for switch ports in [1] and documented in [2].
> >
> > [1] /sys/class/net/representor_netdev/phys_switch_id.
> > [2]
> https://www.kernel.org/doc/Documentation/networking/switchdev.txt "
> Switch ID"
>
> Thanks for info.
> I suppose we could use "switch_id" to indentify a eswitch since "switch_id is
> present for switch ports"?
> Where does the "switch_id" of switch port come from? Is it from FW?
> Or the driver generated it?
>
> Is there any rule for "switch_id"? Or is it vendor specific?
>
> >
It should be unique enough, usually generated out of board serial id or other fields such as vendor OUI that makes it fairly unique.


2021-06-09 15:25:07

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension



> From: Yunsheng Lin <[email protected]>
> Sent: Wednesday, June 9, 2021 6:00 PM
>
> On 2021/6/9 19:59, Parav Pandit wrote:
> >> From: Yunsheng Lin <[email protected]>
> >> Sent: Wednesday, June 9, 2021 4:35 PM
> >>
> >> On 2021/6/9 17:38, Parav Pandit wrote:
> >>>
> >>>> From: Yunsheng Lin <[email protected]>
> >>>> Sent: Wednesday, June 9, 2021 2:46 PM
> >>>>
> >>> [..]
> >>>
> >>>>>> Is there any reason why VF use its own devlink instance?
> >>>>>
> >>>>> Primary use case for VFs is virtual environments where guest isn't
> >>>>> trusted, so tying the VF to the main devlink instance, over which
> >>>>> guest should have no control is counter productive.
> >>>>
> >>>> The security is mainly about VF using in container case, right?
> >>>> Because VF using in VM, it is different host, it means a different
> >>>> devlink instance for VF, so there is no security issue for VF using
> >>>> in VM
> >> case?
> >>>> But it might not be the case for VF using in container?
> >>> Devlink instance has net namespace attached to it controlled using
> >>> devlink
> >> reload command.
> >>> So a VF devlink instance can be assigned to a container/process
> >>> running in a
> >> specific net namespace.
> >>>
> >>> $ ip netns add n1
> >>> $ devlink dev reload pci/0000:06:00.4 netns n1
> >>> ^^^^^^^^^^^^^
> >>> PCI VF/PF/SF.
> >>
> >> Could we create another devlink instance when the net namespace of
> >> devlink port instance is changed?
> > Net namespace of (a) netdevice (b) rdma device (c) devlink instance can be
> changed.
> > Net namespace of devlink port cannot be changed.
>
> Yes, net namespace is changed based on the devlink instance, not devlink
> port instance, *right now*.
>
> >
> >> It may seems we need to change the net namespace based on devlink
> >> port instance instead of devlink instance.
> >> This way container case seems be similiar to the VM case?
> > I mostly do not understand the topology you have in mind or if you
> explained previously I missed the thread.
> > In your case what is the flavour of a devlink port?
>
> flavour of the devlink port instance is FLAVOUR_PHYSICAL or
> FLAVOUR_VIRTUAL.
>
> The reason I suggest to change the net namespace on devlink port instance
> instead of devlink instance is:
> I proposed that all the PF and VF in the same ASIC are registered to the same
> devlink instance as flavour FLAVOUR_PHYSICAL or FLAVOUR_VIRTUAL when
> there are in the same host and in the same net namespace.
>
> If a VF's devlink port instance is unregistered from old devlink instance in the
> old net namespace and registered to new devlink instance in the new net
> namespace(create a new devlink instance if
> needed) when devlink port instance's net namespace is changed, then the
> security mentioned by jakub is not a issue any more?

It seems that devlink instance of VF is not needed in your case, and if so what is the motivation to even have VIRTUAL port attach to the PF?
If only netdevice of the VF is of interest, it can be assigned to net namespace directly.

It doesn’t make sense to me to create new devlink instance in new net namespace, that also needs to be deleted when net ns is deleted.
And pre_exit() routine will mostly deadlock holding global devlink_mutex.

2021-06-09 16:43:57

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On Wed, 9 Jun 2021 17:16:06 +0800 Yunsheng Lin wrote:
> On 2021/6/9 1:29, Jakub Kicinski wrote:
> > On Tue, 8 Jun 2021 20:10:37 +0800 Yunsheng Lin wrote:
> >> Afer discussion with Parav in other thread, I undersood it was the current
> >> practice, but I am not sure I understand why it is current *best* practice.
> >>
> >> If we allow all PF of a ASCI to register to the same devlink instance, does
> >> it not make sense that all VF under one PF also register to the same devlink
> >> instance that it's PF is registering to when they are in the same host?
> >>
> >> For eswitch legacy mode, whether VF and PF are the same host or not, the VF
> >> can also provide the serial number of a ASIC to register to the devlink instance,
> >> if that devlink instance does not exist yet, just create that devlink instance
> >> according to the serial number, just like PF does.
> >>
> >> For eswitch DEVLINK_ESWITCH_MODE_SWITCHDEV mode, the flavour type for devlink
> >> port instance representing the netdev of VF function is FLAVOUR_VIRTUAL, the
> >> flavour type for devlink port instance representing the representor netdev of
> >> VF is FLAVOUR_PCI_VF, which are different type, so they can register to the same
> >> devlink instance even when both of the devlink port instance is in the same host?
> >>
> >> Is there any reason why VF use its own devlink instance?
> >
> > Primary use case for VFs is virtual environments where guest isn't
> > trusted, so tying the VF to the main devlink instance, over which guest
> > should have no control is counter productive.
>
> The security is mainly about VF using in container case, right?
> Because VF using in VM, it is different host, it means a different devlink
> instance for VF, so there is no security issue for VF using in VM case?
> But it might not be the case for VF using in container?

How do you differentiate from the device perspective VF being assigned
to the host vs VM? Presumably PFs and VFs have a similar API to talk to
the FW, if VF can "join" the devlink instance of the PF that'd suggest
to me it has access to privileged FW commands.

> Also I read about the devlink disscusion betwwen you and jiri in [1]:
> "I think we agree that all objects of an ASIC should be under one
> devlink instance, the question remains whether both ends of the pipe
> for PCI devices (subdevs or not) should appear under ports or does the
> "far end" (from ASICs perspective)/"host end" get its own category."
>
> I am not sure if there is already any conclusion about the latter part
> (I did not find the conclusion in that thread)?
>
> "far end" (from ASICs perspective)/"host end" means PF/VF, right?
> Which seems to correspond to port flavor of FLAVOUR_PHYSICAL and
> FLAVOUR_VIRTUAL if we try to represent PF/VF using devlink port
> instance?

No, no, PHYSICAL is a physical port on the adapter, like an SFP port.
There wasn't any conclusion to that discussion. Mellanox views devlink
ports as eswitch ports, I view them as device ports which is hard to
reconcile.

> It seems the conclusion is very important to our disscusion in this
> thread, as we are trying to represent PF/VF as devlink port instance
> in this thread(at least that is what I think, hns3 does not support
> eswitch SWITCHDEV mode yet).
>
> Also, there is a "switch_id" concept from jiri's example, which seems
> to be not implemented yet?
> pci/0000:05:00.0/10000: type eth netdev enp5s0npf0s0 flavour pci_pf pf 0 subport 0 switch_id 00154d130d2f
>
> 1. https://lore.kernel.org/netdev/[email protected]/t/
>
> >> I am not sure I understand what does it mean by "devlink instances with
> >> multiple names"?
> >>
> >> Does that mean whenever a devlink port instance is registered to a devlink
> >> instance, that devlink instance get a new name according to the PCI device
> >> which the just registered devlink port instance corresponds to?
> >
> > Not devlink port, new PCI device. Multiple ports may reside on the same
> > PCI function, some ports don't have a function (e.g. Ethernet ports).
>
> Multiple ports on the same mainly PCI function means subfunction from mlx,
> right?

Not necessarily, there are older devices out there (older NFPs, mlx4)
which have one PF which is logically divided by the driver to service
multiple ports.

> “some ports don't have a function (e.g. Ethernet ports)” does not seem
> exist yet? For now devlink port instance of FLAVOUR_PHYSICAL represents
> both PF and Ethernet ports?

It does. I think Mellanox cards are incapable of divorcing PFs from
Ethernet ports, but the NFP driver represents the Ethernet port/SFP
as one netdev and devlink port (PHYSICAL) and the host port by another
netdev and devlink port (PCI_PF). Which allows forwarding frames between
PFs and between Ethernet ports directly (again, something not supported
efficiently by simpler cards, but supported by NFPs).

2021-06-09 17:20:13

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On 2021/6/9 17:52, Parav Pandit wrote:
>
>> From: Yunsheng Lin <[email protected]>
>> Sent: Tuesday, June 8, 2021 5:41 PM
>
>>
>> Is there any reason why VF use its own devlink instance?
> Because devlink instance gives the ability for the VF and SF to control itself.
> (a) device parameters (devlink dev param show)
> (b) resources of the device
> (c) health reporters
> (d) reload in net ns
>
> There knobs (a) to (c) etc are not for the hypervisor to control. These are mainly for the VF/SF users to manage its own device.

Do we need to disable user from changing the net ns in a container?

>

2021-06-09 17:27:03

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On 2021/6/9 19:59, Parav Pandit wrote:
>> From: Yunsheng Lin <[email protected]>
>> Sent: Wednesday, June 9, 2021 4:35 PM
>>
>> On 2021/6/9 17:38, Parav Pandit wrote:
>>>
>>>> From: Yunsheng Lin <[email protected]>
>>>> Sent: Wednesday, June 9, 2021 2:46 PM
>>>>
>>> [..]
>>>
>>>>>> Is there any reason why VF use its own devlink instance?
>>>>>
>>>>> Primary use case for VFs is virtual environments where guest isn't
>>>>> trusted, so tying the VF to the main devlink instance, over which
>>>>> guest should have no control is counter productive.
>>>>
>>>> The security is mainly about VF using in container case, right?
>>>> Because VF using in VM, it is different host, it means a different
>>>> devlink instance for VF, so there is no security issue for VF using in VM
>> case?
>>>> But it might not be the case for VF using in container?
>>> Devlink instance has net namespace attached to it controlled using devlink
>> reload command.
>>> So a VF devlink instance can be assigned to a container/process running in a
>> specific net namespace.
>>>
>>> $ ip netns add n1
>>> $ devlink dev reload pci/0000:06:00.4 netns n1
>>> ^^^^^^^^^^^^^
>>> PCI VF/PF/SF.
>>
>> Could we create another devlink instance when the net namespace of
>> devlink port instance is changed?
> Net namespace of (a) netdevice (b) rdma device (c) devlink instance can be changed.
> Net namespace of devlink port cannot be changed.

Yes, net namespace is changed based on the devlink instance, not
devlink port instance, *right now*.

>
>> It may seems we need to change the net
>> namespace based on devlink port instance instead of devlink instance.
>> This way container case seems be similiar to the VM case?
> I mostly do not understand the topology you have in mind or if you explained previously I missed the thread.
> In your case what is the flavour of a devlink port?

flavour of the devlink port instance is FLAVOUR_PHYSICAL or
FLAVOUR_VIRTUAL.

The reason I suggest to change the net namespace on devlink port
instance instead of devlink instance is:
I proposed that all the PF and VF in the same ASIC are registered to
the same devlink instance as flavour FLAVOUR_PHYSICAL or FLAVOUR_VIRTUAL
when there are in the same host and in the same net namespace.

If a VF's devlink port instance is unregistered from old devlink
instance in the old net namespace and registered to new devlink
instance in the new net namespace(create a new devlink instance if
needed) when devlink port instance's net namespace is changed, then
the security mentioned by jakub is not a issue any more?

>
>>
>>>
>>>> Also, there is a "switch_id" concept from jiri's example, which seems
>>>> to be not implemented yet?
>>>
>>> switch_id is present for switch ports in [1] and documented in [2].
>>>
>>> [1] /sys/class/net/representor_netdev/phys_switch_id.
>>> [2]
>> https://www.kernel.org/doc/Documentation/networking/switchdev.txt "
>> Switch ID"
>>
>> Thanks for info.
>> I suppose we could use "switch_id" to indentify a eswitch since "switch_id is
>> present for switch ports"?
>> Where does the "switch_id" of switch port come from? Is it from FW?
>> Or the driver generated it?
>>
>> Is there any rule for "switch_id"? Or is it vendor specific?
>>
>>>
> It should be unique enough, usually generated out of board serial id or other fields such as vendor OUI that makes it fairly unique.
>
>

2021-06-09 18:43:44

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension



> From: Yunsheng Lin <[email protected]>
> Sent: Wednesday, June 9, 2021 4:47 PM
>
> On 2021/6/9 17:52, Parav Pandit wrote:
> >
> >> From: Yunsheng Lin <[email protected]>
> >> Sent: Tuesday, June 8, 2021 5:41 PM
> >
> >>
> >> Is there any reason why VF use its own devlink instance?
> > Because devlink instance gives the ability for the VF and SF to control itself.
> > (a) device parameters (devlink dev param show)
> > (b) resources of the device
> > (c) health reporters
> > (d) reload in net ns
> >
> > There knobs (a) to (c) etc are not for the hypervisor to control. These are
> mainly for the VF/SF users to manage its own device.
>
> Do we need to disable user from changing the net ns in a container?
It is not the role of the hw/vendor driver to disable it.
Process capabilities such as NET_ADMIN etc take care of it.

2021-06-10 06:54:53

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On 2021/6/10 0:40, Jakub Kicinski wrote:
> On Wed, 9 Jun 2021 17:16:06 +0800 Yunsheng Lin wrote:
>> On 2021/6/9 1:29, Jakub Kicinski wrote:
>>> On Tue, 8 Jun 2021 20:10:37 +0800 Yunsheng Lin wrote:
>>>> Afer discussion with Parav in other thread, I undersood it was the current
>>>> practice, but I am not sure I understand why it is current *best* practice.
>>>>
>>>> If we allow all PF of a ASCI to register to the same devlink instance, does
>>>> it not make sense that all VF under one PF also register to the same devlink
>>>> instance that it's PF is registering to when they are in the same host?
>>>>
>>>> For eswitch legacy mode, whether VF and PF are the same host or not, the VF
>>>> can also provide the serial number of a ASIC to register to the devlink instance,
>>>> if that devlink instance does not exist yet, just create that devlink instance
>>>> according to the serial number, just like PF does.
>>>>
>>>> For eswitch DEVLINK_ESWITCH_MODE_SWITCHDEV mode, the flavour type for devlink
>>>> port instance representing the netdev of VF function is FLAVOUR_VIRTUAL, the
>>>> flavour type for devlink port instance representing the representor netdev of
>>>> VF is FLAVOUR_PCI_VF, which are different type, so they can register to the same
>>>> devlink instance even when both of the devlink port instance is in the same host?
>>>>
>>>> Is there any reason why VF use its own devlink instance?
>>>
>>> Primary use case for VFs is virtual environments where guest isn't
>>> trusted, so tying the VF to the main devlink instance, over which guest
>>> should have no control is counter productive.
>>
>> The security is mainly about VF using in container case, right?
>> Because VF using in VM, it is different host, it means a different devlink
>> instance for VF, so there is no security issue for VF using in VM case?
>> But it might not be the case for VF using in container?
>
> How do you differentiate from the device perspective VF being assigned
> to the host vs VM? Presumably PFs and VFs have a similar API to talk to
> the FW, if VF can "join" the devlink instance of the PF that'd suggest
> to me it has access to privileged FW commands.

I was thinking info/param/health that is specfic to a VF is only registered
to the devlink port instance of that VF, same for resource that is specific
to PF. And it seems the param is already able to registered based on devlink
instance(devlink_params_register()) or based on devlink port instance(
devlink_port_params_register()).

Only PF will register privileged common resource based on devlink instance,
we may need to ensure only one PF register the privileged common resource
(maybe the PF probed first do the the privileged common resource registering,
I am not sure how to ensure that or implement it yet).

When user access the common resource in devlink instance, I think it is ok
to pass it through one of the PF(suppose all PF is in the same privilege
level)?

When user access the resource in devlink port instance of PHYSICAL/VIRTUAL,
the access is through the specific function(PF/VF) corresponds to that devlink
port instance?

When user access the resource in devlink port instance of PCI_PF/PCI_VF/PCI_SF,
the access is through the function where the eswitch is located?

so if a devlink instance only have devlink port instance of VF, that devlink
instance has not privileged common resource registered, so the user is not
able to access the privileged common resource?

If the PF and VF is in the same host and in the same net namespace, I suppose
it is ok to have the PF and VF to share the same devlink instance with the
privileged common resource registered?

>
>> Also I read about the devlink disscusion betwwen you and jiri in [1]:
>> "I think we agree that all objects of an ASIC should be under one
>> devlink instance, the question remains whether both ends of the pipe
>> for PCI devices (subdevs or not) should appear under ports or does the
>> "far end" (from ASICs perspective)/"host end" get its own category."
>>
>> I am not sure if there is already any conclusion about the latter part
>> (I did not find the conclusion in that thread)?
>>
>> "far end" (from ASICs perspective)/"host end" means PF/VF, right?
>> Which seems to correspond to port flavor of FLAVOUR_PHYSICAL and
>> FLAVOUR_VIRTUAL if we try to represent PF/VF using devlink port
>> instance?
>
> No, no, PHYSICAL is a physical port on the adapter, like an SFP port.
> There wasn't any conclusion to that discussion. Mellanox views devlink
> ports as eswitch ports, I view them as device ports which is hard to
> reconcile.

I suppose eswitch ports only exist when DEVLINK_ESWITCH_MODE_SWITCHDEV
mode is enabled, right? Does "Mellanox views devlink ports as eswitch
ports" means mlx driver will not create any devlink port instance when
DEVLINK_ESWITCH_MODE_LEGACY mode is enabled?
It does not seems to be the case any more, because the PF is registered
as a devlink port instance of FLAVOUR_PHYSICAL and VF is registered as
a devlink port instance of FLAVOUR__VIRTUAL in mlx5e_devlink_port_register(),
unless mlx5e_devlink_port_register() is only called in SWITCHDEV mode too.

From discussion in other thread with parav in [1], it seems:
1. Whenever there is a pcie function(PF/VF, maybe SF too?), there is a
devlink instance corresponds to that pcie function.
2. Whenever there is a netdev(netdev of PF/VF, or representor netdev), there
is a devlink port instance corresponds to that netdev.

It seems we only need to change (1) to enable "all objects of an ASIC
should be under one devlink instance" as below:
Whenever there is a ASIC(or switch), there is a devlink instance
corresponds to that ASIC(or switch)?

I am not sure I understand what it means by "device ports"? netdev? or
"physical port on the adapter, like an SFP port"? or "pcie function like
PF/VF"? Let's suppose it is in MODE_LEGACY mode.

1. https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/#24231633

>
>> It seems the conclusion is very important to our disscusion in this
>> thread, as we are trying to represent PF/VF as devlink port instance
>> in this thread(at least that is what I think, hns3 does not support
>> eswitch SWITCHDEV mode yet).
>>
>> Also, there is a "switch_id" concept from jiri's example, which seems
>> to be not implemented yet?
>> pci/0000:05:00.0/10000: type eth netdev enp5s0npf0s0 flavour pci_pf pf 0 subport 0 switch_id 00154d130d2f
>>
>> 1. https://lore.kernel.org/netdev/[email protected]/t/
>>
>>>> I am not sure I understand what does it mean by "devlink instances with
>>>> multiple names"?
>>>>
>>>> Does that mean whenever a devlink port instance is registered to a devlink
>>>> instance, that devlink instance get a new name according to the PCI device
>>>> which the just registered devlink port instance corresponds to?
>>>
>>> Not devlink port, new PCI device. Multiple ports may reside on the same
>>> PCI function, some ports don't have a function (e.g. Ethernet ports).
>>
>> Multiple ports on the same mainly PCI function means subfunction from mlx,
>> right?
>
> Not necessarily, there are older devices out there (older NFPs, mlx4)
> which have one PF which is logically divided by the driver to service
> multiple ports.
>
>> “some ports don't have a function (e.g. Ethernet ports)” does not seem
>> exist yet? For now devlink port instance of FLAVOUR_PHYSICAL represents
>> both PF and Ethernet ports?
>
> It does. I think Mellanox cards are incapable of divorcing PFs from
> Ethernet ports, but the NFP driver represents the Ethernet port/SFP
> as one netdev and devlink port (PHYSICAL) and the host port by another
> netdev and devlink port (PCI_PF). Which allows forwarding frames between
> PFs and between Ethernet ports directly (again, something not supported
> efficiently by simpler cards, but supported by NFPs).

If "Whenever there is a netdev(netdev of PF/VF, or representor netdev),
there is a devlink port instance corresponds to that netdev." rule apply
to the above case, as there is one netdev for PF and one netdev for Ethernet
port, then we have two devlink port instance too, one for netdev of PF, one
for the netdev of Ethernet port, which is different from Mellanox having one
netdev for both PF and Ethernet port,hence one devlink port for both PF and
Ethernet port.

It seems it is needed to clarify the FLAVOUR_PHYSICAL and FLAVOUR_PCI_PF
maybe having different semantic between NFP and Mellanox?

we might need to add another flavour type to indicate the netdev of PF,
if FLAVOUR_PHYSICAL indicates netdev of Ethernet port(if that netdev
exists) and FLAVOUR_PCI_PF indicates representor netdev of PF, as the
comment in definiation of flavour type:

DEVLINK_PORT_FLAVOUR_PHYSICAL, /* Any kind of a port physically
* facing the user.
*/

DEVLINK_PORT_FLAVOUR_PCI_PF, /* Represents eswitch port for
* the PCI PF. It is an internal
* port that faces the PCI PF.
*/

>
> .
>

2021-06-10 07:06:55

by Yunsheng Lin

[permalink] [raw]
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

On 2021/6/9 21:45, Parav Pandit wrote:
>> From: Yunsheng Lin <[email protected]>
>> Sent: Wednesday, June 9, 2021 6:00 PM
>>
>> On 2021/6/9 19:59, Parav Pandit wrote:
>>>> From: Yunsheng Lin <[email protected]>
>>>> Sent: Wednesday, June 9, 2021 4:35 PM
>>>>
>>>> On 2021/6/9 17:38, Parav Pandit wrote:
>>>>>
>>>>>> From: Yunsheng Lin <[email protected]>
>>>>>> Sent: Wednesday, June 9, 2021 2:46 PM
>>>>>>
>>>>> [..]
>>>>>
>>>>>>>> Is there any reason why VF use its own devlink instance?
>>>>>>>
>>>>>>> Primary use case for VFs is virtual environments where guest isn't
>>>>>>> trusted, so tying the VF to the main devlink instance, over which
>>>>>>> guest should have no control is counter productive.
>>>>>>
>>>>>> The security is mainly about VF using in container case, right?
>>>>>> Because VF using in VM, it is different host, it means a different
>>>>>> devlink instance for VF, so there is no security issue for VF using
>>>>>> in VM
>>>> case?
>>>>>> But it might not be the case for VF using in container?
>>>>> Devlink instance has net namespace attached to it controlled using
>>>>> devlink
>>>> reload command.
>>>>> So a VF devlink instance can be assigned to a container/process
>>>>> running in a
>>>> specific net namespace.
>>>>>
>>>>> $ ip netns add n1
>>>>> $ devlink dev reload pci/0000:06:00.4 netns n1
>>>>> ^^^^^^^^^^^^^
>>>>> PCI VF/PF/SF.
>>>>
>>>> Could we create another devlink instance when the net namespace of
>>>> devlink port instance is changed?
>>> Net namespace of (a) netdevice (b) rdma device (c) devlink instance can be
>> changed.
>>> Net namespace of devlink port cannot be changed.
>>
>> Yes, net namespace is changed based on the devlink instance, not devlink
>> port instance, *right now*.
>>
>>>
>>>> It may seems we need to change the net namespace based on devlink
>>>> port instance instead of devlink instance.
>>>> This way container case seems be similiar to the VM case?
>>> I mostly do not understand the topology you have in mind or if you
>> explained previously I missed the thread.
>>> In your case what is the flavour of a devlink port?
>>
>> flavour of the devlink port instance is FLAVOUR_PHYSICAL or
>> FLAVOUR_VIRTUAL.
>>
>> The reason I suggest to change the net namespace on devlink port instance
>> instead of devlink instance is:
>> I proposed that all the PF and VF in the same ASIC are registered to the same
>> devlink instance as flavour FLAVOUR_PHYSICAL or FLAVOUR_VIRTUAL when
>> there are in the same host and in the same net namespace.
>>
>> If a VF's devlink port instance is unregistered from old devlink instance in the
>> old net namespace and registered to new devlink instance in the new net
>> namespace(create a new devlink instance if
>> needed) when devlink port instance's net namespace is changed, then the
>> security mentioned by jakub is not a issue any more?
>
> It seems that devlink instance of VF is not needed in your case, and if so what is the motivation to even have VIRTUAL port attach to the PF?

The devlink instance is mainly used to hold the devlink port instance
of VF if there is only one VF in vm, we might still need to have
param/health specific to the VF to registered to the devlink port
instance of that VF.

> If only netdevice of the VF is of interest, it can be assigned to net namespace directly.

I think that is another option, if there is nothing in the devlink port
instance specific to VF that need exposing to the user in another net
namespace.

>
> It doesn’t make sense to me to create new devlink instance in new net namespace, that also needs to be deleted when net ns is deleted.
> And pre_exit() routine will mostly deadlock holding global devlink_mutex.

Would you be more specific why there is deadlock?
It seems more of implementation detail, which we can discuss later
when we are agreed it is the right way to go down deeper?

>

2021-06-10 07:18:39

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension



> From: Yunsheng Lin <[email protected]>
> Sent: Thursday, June 10, 2021 12:34 PM
>
> On 2021/6/9 21:45, Parav Pandit wrote:
> >> From: Yunsheng Lin <[email protected]>
> >> Sent: Wednesday, June 9, 2021 6:00 PM
> >>
> >> On 2021/6/9 19:59, Parav Pandit wrote:
> >>>> From: Yunsheng Lin <[email protected]>
> >>>> Sent: Wednesday, June 9, 2021 4:35 PM
> >>>>
> >>>> On 2021/6/9 17:38, Parav Pandit wrote:
> >>>>>
> >>>>>> From: Yunsheng Lin <[email protected]>
> >>>>>> Sent: Wednesday, June 9, 2021 2:46 PM
> >>>>>>
> >>>>> [..]
> >>>>>
> >>>>>>>> Is there any reason why VF use its own devlink instance?
> >>>>>>>
> >>>>>>> Primary use case for VFs is virtual environments where guest
> >>>>>>> isn't trusted, so tying the VF to the main devlink instance,
> >>>>>>> over which guest should have no control is counter productive.
> >>>>>>
> >>>>>> The security is mainly about VF using in container case, right?
> >>>>>> Because VF using in VM, it is different host, it means a
> >>>>>> different devlink instance for VF, so there is no security issue
> >>>>>> for VF using in VM
> >>>> case?
> >>>>>> But it might not be the case for VF using in container?
> >>>>> Devlink instance has net namespace attached to it controlled using
> >>>>> devlink
> >>>> reload command.
> >>>>> So a VF devlink instance can be assigned to a container/process
> >>>>> running in a
> >>>> specific net namespace.
> >>>>>
> >>>>> $ ip netns add n1
> >>>>> $ devlink dev reload pci/0000:06:00.4 netns n1
> >>>>> ^^^^^^^^^^^^^
> >>>>> PCI VF/PF/SF.
> >>>>
> >>>> Could we create another devlink instance when the net namespace of
> >>>> devlink port instance is changed?
> >>> Net namespace of (a) netdevice (b) rdma device (c) devlink instance
> >>> can be
> >> changed.
> >>> Net namespace of devlink port cannot be changed.
> >>
> >> Yes, net namespace is changed based on the devlink instance, not
> >> devlink port instance, *right now*.
> >>
> >>>
> >>>> It may seems we need to change the net namespace based on devlink
> >>>> port instance instead of devlink instance.
> >>>> This way container case seems be similiar to the VM case?
> >>> I mostly do not understand the topology you have in mind or if you
> >> explained previously I missed the thread.
> >>> In your case what is the flavour of a devlink port?
> >>
> >> flavour of the devlink port instance is FLAVOUR_PHYSICAL or
> >> FLAVOUR_VIRTUAL.
> >>
> >> The reason I suggest to change the net namespace on devlink port
> >> instance instead of devlink instance is:
> >> I proposed that all the PF and VF in the same ASIC are registered to
> >> the same devlink instance as flavour FLAVOUR_PHYSICAL or
> >> FLAVOUR_VIRTUAL when there are in the same host and in the same net
> namespace.
> >>
> >> If a VF's devlink port instance is unregistered from old devlink
> >> instance in the old net namespace and registered to new devlink
> >> instance in the new net namespace(create a new devlink instance if
> >> needed) when devlink port instance's net namespace is changed, then
> >> the security mentioned by jakub is not a issue any more?
> >
> > It seems that devlink instance of VF is not needed in your case, and if so
> what is the motivation to even have VIRTUAL port attach to the PF?
>
> The devlink instance is mainly used to hold the devlink port instance of VF if
> there is only one VF in vm, we might still need to have param/health specific
> to the VF to registered to the devlink port instance of that VF.
>
This will cover things uniformly with/without container or VM.

> > If only netdevice of the VF is of interest, it can be assigned to net
> namespace directly.
>
> I think that is another option, if there is nothing in the devlink port instance
> specific to VF that need exposing to the user in another net namespace.
>
Yes. no need for devlink instance or devlink port.

> >
> > It doesn’t make sense to me to create new devlink instance in new net
> namespace, that also needs to be deleted when net ns is deleted.
> > And pre_exit() routine will mostly deadlock holding global devlink_mutex.
>
> Would you be more specific why there is deadlock?
Net namespace exit routine cannot invoke a devlink API that demands acquiring devlink global mutex.