2022-10-15 01:45:33

by Si-Wei Liu

[permalink] [raw]
Subject: [PATCH 0/4] vDPA: dev config export via "vdpa dev show" command

Live migration of vdpa would typically require re-instate vdpa
device with an idential set of configs on the destination node,
same way as how source node created the device in the first place.

In order to allow live migration orchestration software to export the
initial set of vdpa attributes with which the device was created, it
will be useful if the vdpa tool can report the config on demand with
simple query. This will ease the orchestration software implementation
so that it doesn't have to keep track of vdpa config change, or have
to persist vdpa attributes across failure and recovery, in fear of
being killed due to accidental software error.

In this series, the initial device config for vdpa creation will be
exported via the "vdpa dev show" command. This is unlike the "vdpa
dev config show" command that usually goes with the live value in
the device config space, which is not reliable subject to the dynamics
of feature negotiation and possible change in device config space.

Examples:

1) Create vDPA by default without any config attribute

$ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0
$ vdpa dev show vdpa0
vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
$ vdpa dev -jp show vdpa0
{
"dev": {
"vdpa0": {
"type": "network",
"mgmtdev": "pci/0000:41:04.2",
"vendor_id": 5555,
"max_vqs": 9,
"max_vq_size": 256,
}
}
}

2) Create vDPA with config attribute(s) specified

$ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0 \
mac e4:11:c6:d3:45:f0 max_vq_pairs 4
$ vdpa dev show
vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
mac e4:11:c6:d3:45:f0 max_vq_pairs 4
$ vdpa dev -jp show
{
"dev": {
"vdpa0": {
"type": "network",
"mgmtdev": "pci/0000:41:04.2",
"vendor_id": 5555,
"max_vqs": 9,
"max_vq_size": 256,
"mac": "e4:11:c6:d3:45:f0",
"max_vq_pairs": 4
}
}
}

---

Si-Wei Liu (4):
vdpa: save vdpa_dev_set_config in struct vdpa_device
vdpa: pass initial config to _vdpa_register_device()
vdpa: show dev config as-is in "vdpa dev show" output
vdpa: fix improper error message when adding vdpa dev

drivers/vdpa/ifcvf/ifcvf_main.c | 2 +-
drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
drivers/vdpa/vdpa.c | 63 +++++++++++++++++++++++++++++++++---
drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 2 +-
drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 2 +-
drivers/vdpa/vdpa_user/vduse_dev.c | 2 +-
drivers/vdpa/virtio_pci/vp_vdpa.c | 3 +-
include/linux/vdpa.h | 26 ++++++++-------
8 files changed, 80 insertions(+), 22 deletions(-)

--
1.8.3.1


2022-10-15 01:45:35

by Si-Wei Liu

[permalink] [raw]
Subject: [PATCH 4/4] vdpa: fix improper error message when adding vdpa dev

In below example, before the fix, mtu attribute is supported
by the parent mgmtdev, but the error message showing "All
provided are not supported" is just misleading.

$ vdpa mgmtdev show
vdpasim_net:
supported_classes net
max_supported_vqs 3
dev_features MTU MAC CTRL_VQ CTRL_MAC_ADDR ANY_LAYOUT VERSION_1 ACCESS_PLATFORM

$ vdpa dev add mgmtdev vdpasim_net name vdpasim0 mtu 5000 max_vqp 2
Error: vdpa: All provided attributes are not supported.
kernel answers: Operation not supported

After fix, the relevant error message will be like:

$ vdpa dev add mgmtdev vdpasim_net name vdpasim0 mtu 5000 max_vqp 2
Error: vdpa: Some provided attributes are not supported.
kernel answers: Operation not supported

$ vdpa dev add mgmtdev vdpasim_net name vdpasim0 max_vqp 2
Error: vdpa: All provided attributes are not supported.
kernel answers: Operation not supported

Signed-off-by: Si-Wei Liu <[email protected]>
---
drivers/vdpa/vdpa.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index 91eca6d..ff15e0a 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -629,13 +629,20 @@ static int vdpa_nl_cmd_dev_add_set_doit(struct sk_buff *skb, struct genl_info *i
err = PTR_ERR(mdev);
goto err;
}
- if ((config.mask & mdev->config_attr_mask) != config.mask) {
+ if (config.mask && (config.mask & mdev->config_attr_mask) == 0) {
NL_SET_ERR_MSG_MOD(info->extack,
"All provided attributes are not supported");
err = -EOPNOTSUPP;
goto err;
}

+ if ((config.mask & mdev->config_attr_mask) != config.mask) {
+ NL_SET_ERR_MSG_MOD(info->extack,
+ "Some provided attributes are not supported");
+ err = -EOPNOTSUPP;
+ goto err;
+ }
+
err = mdev->ops->dev_add(mdev, name, &config);
err:
up_write(&vdpa_dev_lock);
--
1.8.3.1

2022-10-15 01:45:51

by Si-Wei Liu

[permalink] [raw]
Subject: [PATCH 3/4] vdpa: show dev config as-is in "vdpa dev show" output

Live migration of vdpa would typically require re-instate vdpa
device with an idential set of configs on the destination node,
same way as how source node created the device in the first
place. In order to save orchestration software from memorizing
and keeping track of vdpa config, it will be helpful if the vdpa
tool provides the aids for exporting the initial configs from
which vdpa device was created as-is. The "vdpa dev show" command
seems to be the right vehicle for that. It is unlike the "vdpa dev
config show" command output that usually goes with the live value
in the device config space, which is not quite reliable subject to
the dynamics of feature negotiation and possible change in device
config space.

Examples:

1) Create vDPA by default without any config attribute

$ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0
$ vdpa dev show vdpa0
vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
$ vdpa dev -jp show vdpa0
{
"dev": {
"vdpa0": {
"type": "network",
"mgmtdev": "pci/0000:41:04.2",
"vendor_id": 5555,
"max_vqs": 9,
"max_vq_size": 256,
}
}
}

2) Create vDPA with config attribute(s) specified

$ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0 \
mac e4:11:c6:d3:45:f0 max_vq_pairs 4
$ vdpa dev show
vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
mac e4:11:c6:d3:45:f0 max_vq_pairs 4
$ vdpa dev -jp show
{
"dev": {
"vdpa0": {
"type": "network",
"mgmtdev": "pci/0000:41:04.2",
"vendor_id": 5555,
"max_vqs": 9,
"max_vq_size": 256,
"mac": "e4:11:c6:d3:45:f0",
"max_vq_pairs": 4
}
}
}

Signed-off-by: Si-Wei Liu <[email protected]>
---
drivers/vdpa/vdpa.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index 566c1c6..91eca6d 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -677,6 +677,41 @@ static int vdpa_nl_cmd_dev_del_set_doit(struct sk_buff *skb, struct genl_info *i
}

static int
+vdpa_dev_cfgattrs_fill(struct vdpa_device *vdev, struct sk_buff *msg, u32 device_id)
+{
+ struct vdpa_dev_set_config *cfg = &vdev->vdev_cfg;
+ int err = -EMSGSIZE;
+
+ if (!cfg->mask)
+ return 0;
+
+ switch (device_id) {
+ case VIRTIO_ID_NET:
+ if ((cfg->mask & BIT_ULL(VDPA_ATTR_DEV_NET_CFG_MACADDR)) != 0 &&
+ nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR,
+ sizeof(cfg->net.mac), cfg->net.mac))
+ return err;
+ if ((cfg->mask & BIT_ULL(VDPA_ATTR_DEV_NET_CFG_MTU)) != 0 &&
+ nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, cfg->net.mtu))
+ return err;
+ if ((cfg->mask & BIT_ULL(VDPA_ATTR_DEV_NET_CFG_MAX_VQP)) != 0 &&
+ nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP,
+ cfg->net.max_vq_pairs))
+ return err;
+ break;
+ default:
+ break;
+ }
+
+ if ((cfg->mask & BIT_ULL(VDPA_ATTR_DEV_FEATURES)) != 0 &&
+ nla_put_u64_64bit(msg, VDPA_ATTR_DEV_FEATURES,
+ cfg->device_features, VDPA_ATTR_PAD))
+ return err;
+
+ return 0;
+}
+
+static int
vdpa_dev_fill(struct vdpa_device *vdev, struct sk_buff *msg, u32 portid, u32 seq,
int flags, struct netlink_ext_ack *extack)
{
@@ -715,6 +750,10 @@ static int vdpa_nl_cmd_dev_del_set_doit(struct sk_buff *skb, struct genl_info *i
if (nla_put_u16(msg, VDPA_ATTR_DEV_MIN_VQ_SIZE, min_vq_size))
goto msg_err;

+ err = vdpa_dev_cfgattrs_fill(vdev, msg, device_id);
+ if (err)
+ goto msg_err;
+
genlmsg_end(msg, hdr);
return 0;

--
1.8.3.1

2022-10-15 01:45:55

by Si-Wei Liu

[permalink] [raw]
Subject: [PATCH 2/4] vdpa: pass initial config to _vdpa_register_device()

Just as _vdpa_register_device taking @nvqs as the number of queues
to feed userspace inquery via vdpa_dev_fill(), we can follow the
same to stash config attributes in struct vdpa_device at the time
of vdpa registration.

Signed-off-by: Si-Wei Liu <[email protected]>
---
drivers/vdpa/ifcvf/ifcvf_main.c | 2 +-
drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
drivers/vdpa/vdpa.c | 15 +++++++++++----
drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 2 +-
drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 2 +-
drivers/vdpa/vdpa_user/vduse_dev.c | 2 +-
drivers/vdpa/virtio_pci/vp_vdpa.c | 3 ++-
include/linux/vdpa.h | 3 ++-
8 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
index f9c0044..c54ab2c 100644
--- a/drivers/vdpa/ifcvf/ifcvf_main.c
+++ b/drivers/vdpa/ifcvf/ifcvf_main.c
@@ -771,7 +771,7 @@ static int ifcvf_vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
else
ret = dev_set_name(&vdpa_dev->dev, "vdpa%u", vdpa_dev->index);

- ret = _vdpa_register_device(&adapter->vdpa, vf->nr_vring);
+ ret = _vdpa_register_device(&adapter->vdpa, vf->nr_vring, config);
if (ret) {
put_device(&adapter->vdpa.dev);
IFCVF_ERR(pdev, "Failed to register to vDPA bus");
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 9091336..376082e 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -3206,7 +3206,7 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev *v_mdev, const char *name,
mlx5_notifier_register(mdev, &ndev->nb);
ndev->nb_registered = true;
mvdev->vdev.mdev = &mgtdev->mgtdev;
- err = _vdpa_register_device(&mvdev->vdev, max_vqs + 1);
+ err = _vdpa_register_device(&mvdev->vdev, max_vqs + 1, add_config);
if (err)
goto err_reg;

diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index febdc99..566c1c6 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -215,11 +215,16 @@ static int vdpa_name_match(struct device *dev, const void *data)
return (strcmp(dev_name(&vdev->dev), data) == 0);
}

-static int __vdpa_register_device(struct vdpa_device *vdev, u32 nvqs)
+static int __vdpa_register_device(struct vdpa_device *vdev, u32 nvqs,
+ const struct vdpa_dev_set_config *cfg)
{
struct device *dev;

vdev->nvqs = nvqs;
+ if (cfg)
+ vdev->vdev_cfg = *cfg;
+ else
+ vdev->vdev_cfg.mask = 0ULL;

lockdep_assert_held(&vdpa_dev_lock);
dev = bus_find_device(&vdpa_bus, NULL, dev_name(&vdev->dev), vdpa_name_match);
@@ -237,15 +242,17 @@ static int __vdpa_register_device(struct vdpa_device *vdev, u32 nvqs)
* callback after setting up valid mgmtdev for this vdpa device.
* @vdev: the vdpa device to be registered to vDPA bus
* @nvqs: number of virtqueues supported by this device
+ * @cfg: initial config on vdpa device creation
*
* Return: Returns an error when fail to add device to vDPA bus
*/
-int _vdpa_register_device(struct vdpa_device *vdev, u32 nvqs)
+int _vdpa_register_device(struct vdpa_device *vdev, u32 nvqs,
+ const struct vdpa_dev_set_config *cfg)
{
if (!vdev->mdev)
return -EINVAL;

- return __vdpa_register_device(vdev, nvqs);
+ return __vdpa_register_device(vdev, nvqs, cfg);
}
EXPORT_SYMBOL_GPL(_vdpa_register_device);

@@ -262,7 +269,7 @@ int vdpa_register_device(struct vdpa_device *vdev, u32 nvqs)
int err;

down_write(&vdpa_dev_lock);
- err = __vdpa_register_device(vdev, nvqs);
+ err = __vdpa_register_device(vdev, nvqs, NULL);
up_write(&vdpa_dev_lock);
return err;
}
diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c b/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c
index c6db1a1..5e1cebc 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c
@@ -387,7 +387,7 @@ static int vdpasim_blk_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
if (IS_ERR(simdev))
return PTR_ERR(simdev);

- ret = _vdpa_register_device(&simdev->vdpa, VDPASIM_BLK_VQ_NUM);
+ ret = _vdpa_register_device(&simdev->vdpa, VDPASIM_BLK_VQ_NUM, config);
if (ret)
goto put_dev;

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim_net.c b/drivers/vdpa/vdpa_sim/vdpa_sim_net.c
index c3cb225..06ef5a0 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim_net.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim_net.c
@@ -260,7 +260,7 @@ static int vdpasim_net_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,

vdpasim_net_setup_config(simdev, config);

- ret = _vdpa_register_device(&simdev->vdpa, VDPASIM_NET_VQ_NUM);
+ ret = _vdpa_register_device(&simdev->vdpa, VDPASIM_NET_VQ_NUM, config);
if (ret)
goto reg_err;

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 35dceee..6530fd2 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -1713,7 +1713,7 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name,
if (ret)
return ret;

- ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
+ ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num, config);
if (ret) {
put_device(&dev->vdev->vdpa.dev);
return ret;
diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c b/drivers/vdpa/virtio_pci/vp_vdpa.c
index d448db0..ffdc90e 100644
--- a/drivers/vdpa/virtio_pci/vp_vdpa.c
+++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
@@ -538,7 +538,8 @@ static int vp_vdpa_dev_add(struct vdpa_mgmt_dev *v_mdev, const char *name,
vp_vdpa->config_irq = VIRTIO_MSI_NO_VECTOR;

vp_vdpa->vdpa.mdev = &vp_vdpa_mgtdev->mgtdev;
- ret = _vdpa_register_device(&vp_vdpa->vdpa, vp_vdpa->queues);
+ ret = _vdpa_register_device(&vp_vdpa->vdpa, vp_vdpa->queues,
+ add_config);
if (ret) {
dev_err(&pdev->dev, "Failed to register to vdpa bus\n");
goto err;
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index f1838f5..b9d50e8 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -381,7 +381,8 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
int vdpa_register_device(struct vdpa_device *vdev, u32 nvqs);
void vdpa_unregister_device(struct vdpa_device *vdev);

-int _vdpa_register_device(struct vdpa_device *vdev, u32 nvqs);
+int _vdpa_register_device(struct vdpa_device *vdev, u32 nvqs,
+ const struct vdpa_dev_set_config *cfg);
void _vdpa_unregister_device(struct vdpa_device *vdev);

/**
--
1.8.3.1

2022-10-15 02:09:08

by Si-Wei Liu

[permalink] [raw]
Subject: [PATCH 1/4] vdpa: save vdpa_dev_set_config in struct vdpa_device

In order to allow live migration orchestration software to export the
initial set of vdpa attributes with which the device was created, it
will be useful if the vdpa tool can report the config on demand with
simple query. This will ease the orchestration software implementation
so that it doesn't have to keep track of vdpa config change, or have
to persist vdpa attributes across failure and recovery, in fear of
being killed due to accidental software error.

This commit attempts to make struct vdpa_device contain the struct
vdpa_dev_set_config, where all config attributes upon vdpa creation
are carried over. Which will be used in subsequent commits.

Signed-off-by: Si-Wei Liu <[email protected]>
---
include/linux/vdpa.h | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 6d0f5e4..f1838f5 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -58,6 +58,16 @@ struct vdpa_vq_state {
};
};

+struct vdpa_dev_set_config {
+ u64 device_features;
+ struct {
+ u8 mac[ETH_ALEN];
+ u16 mtu;
+ u16 max_vq_pairs;
+ } net;
+ u64 mask;
+};
+
struct vdpa_mgmt_dev;

/**
@@ -77,6 +87,8 @@ struct vdpa_vq_state {
* @nvqs: maximum number of supported virtqueues
* @mdev: management device pointer; caller must setup when registering device as part
* of dev_add() mgmtdev ops callback before invoking _vdpa_register_device().
+ * @vdev_cfg: initial device config on vdpa creation; useful when instantiate with
+ * the exact same config is needed.
*/
struct vdpa_device {
struct device dev;
@@ -91,6 +103,7 @@ struct vdpa_device {
struct vdpa_mgmt_dev *mdev;
unsigned int ngroups;
unsigned int nas;
+ struct vdpa_dev_set_config vdev_cfg;
};

/**
@@ -103,16 +116,6 @@ struct vdpa_iova_range {
u64 last;
};

-struct vdpa_dev_set_config {
- u64 device_features;
- struct {
- u8 mac[ETH_ALEN];
- u16 mtu;
- u16 max_vq_pairs;
- } net;
- u64 mask;
-};
-
/**
* Corresponding file area for device memory mapping
* @file: vma->vm_file for the mapping
--
1.8.3.1

2022-10-17 07:12:54

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 0/4] vDPA: dev config export via "vdpa dev show" command

Adding Sean and Daniel for more thoughts.

On Sat, Oct 15, 2022 at 9:33 AM Si-Wei Liu <[email protected]> wrote:
>
> Live migration of vdpa would typically require re-instate vdpa
> device with an idential set of configs on the destination node,
> same way as how source node created the device in the first place.
>
> In order to allow live migration orchestration software to export the
> initial set of vdpa attributes with which the device was created, it
> will be useful if the vdpa tool can report the config on demand with
> simple query.

For live migration, I think the management layer should have this
knowledge and they can communicate directly without bothering the vdpa
tool on the source. If I was not wrong this is the way libvirt is
doing now.

> This will ease the orchestration software implementation
> so that it doesn't have to keep track of vdpa config change, or have
> to persist vdpa attributes across failure and recovery, in fear of
> being killed due to accidental software error.
>
> In this series, the initial device config for vdpa creation will be
> exported via the "vdpa dev show" command.
> This is unlike the "vdpa
> dev config show" command that usually goes with the live value in
> the device config space, which is not reliable subject to the dynamics
> of feature negotiation and possible change in device config space.
>
> Examples:
>
> 1) Create vDPA by default without any config attribute
>
> $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0
> $ vdpa dev show vdpa0
> vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> $ vdpa dev -jp show vdpa0
> {
> "dev": {
> "vdpa0": {
> "type": "network",
> "mgmtdev": "pci/0000:41:04.2",
> "vendor_id": 5555,
> "max_vqs": 9,
> "max_vq_size": 256,
> }
> }
> }
>
> 2) Create vDPA with config attribute(s) specified
>
> $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0 \
> mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> $ vdpa dev show
> vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> $ vdpa dev -jp show
> {
> "dev": {
> "vdpa0": {
> "type": "network",
> "mgmtdev": "pci/0000:41:04.2",

So "mgmtdev" looks not necessary for live migration.

Thanks

> "vendor_id": 5555,
> "max_vqs": 9,
> "max_vq_size": 256,
> "mac": "e4:11:c6:d3:45:f0",
> "max_vq_pairs": 4
> }
> }
> }
>
> ---
>
> Si-Wei Liu (4):
> vdpa: save vdpa_dev_set_config in struct vdpa_device
> vdpa: pass initial config to _vdpa_register_device()
> vdpa: show dev config as-is in "vdpa dev show" output
> vdpa: fix improper error message when adding vdpa dev
>
> drivers/vdpa/ifcvf/ifcvf_main.c | 2 +-
> drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
> drivers/vdpa/vdpa.c | 63 +++++++++++++++++++++++++++++++++---
> drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 2 +-
> drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 2 +-
> drivers/vdpa/vdpa_user/vduse_dev.c | 2 +-
> drivers/vdpa/virtio_pci/vp_vdpa.c | 3 +-
> include/linux/vdpa.h | 26 ++++++++-------
> 8 files changed, 80 insertions(+), 22 deletions(-)
>
> --
> 1.8.3.1
>

2022-10-17 13:06:19

by Sean Mooney

[permalink] [raw]
Subject: Re: [PATCH 0/4] vDPA: dev config export via "vdpa dev show" command

On Mon, 2022-10-17 at 15:08 +0800, Jason Wang wrote:
> Adding Sean and Daniel for more thoughts.
>
> On Sat, Oct 15, 2022 at 9:33 AM Si-Wei Liu <[email protected]> wrote:
> >
> > Live migration of vdpa would typically require re-instate vdpa
> > device with an idential set of configs on the destination node,
> > same way as how source node created the device in the first place.
> >
> > In order to allow live migration orchestration software to export the
> > initial set of vdpa attributes with which the device was created, it
> > will be useful if the vdpa tool can report the config on demand with
> > simple query.
>
> For live migration, I think the management layer should have this
> knowledge and they can communicate directly without bothering the vdpa
> tool on the source. If I was not wrong this is the way libvirt is
> doing now.
At least form a openstack(nova) perspective we are not expecting to do any vdpa device configuration
at the openstack level. To use a vdpa device in openstack the oeprator when installing openstack
need to create a udev/systemd script to precreatre the vdpa devices.

nova will query libvirt for the list avaiable vdpa devices at start up and record them in our database.
when schudling we select a host that has a free vdpa device and on that host we generate a xml snipit
that refernce the vdpa device and proivde that to libvirt and it will in turn program the mac.

"""
<interface type="vdpa">
<mac address="b5:bc:2e:e7:51:ee"/>
<source dev="/dev/vhost-vdpa-3"/>
</interface>
"""

when live migrating the workflow is similar. we ask our schduler for a host that should have enough avaiable
resouces, then we make an rpc call "pre_live_migrate" which makes a number of assterions such as cpu compatiablity
but also computes cpu pinning and device passthough asignemnts. i.e. in pre_live_migate we select wich cpu cores, pcie
devices and in this case vdpa devices to use on the destination host and return that in our rpc result.

we then use that information to udpate the libvirt domain xml with the new host specific information and start
the migration at the libvirt level.

today in openstack we use a hack i came up with to workaroudn that fact that you cant migrate with sriov/pci passthough
devices to support live migration with vdpa. basically before we call libvirt to live migrate we hot unplug the vdpa nic
form the guest and add them back after the migration is complte. if you dont bound the vdpa nics wiht a transparently migratable
nic in the guest that obvioulsy result in a loss of network connectivity while the migration is happenign which is not ideal
so a normal virtio-net interface on ovs is what we recommend as the fallback interface for the bound.

obviouly when vdpa supprot transparent live migration we can just skip this workaround which woudl be a very nice ux improvement.
one of the sideeffct of the hack however is you can start with an intel nic and end up with a melonox nic becasue we dont need
to preserve the device capablies sicne we are hotplugging.

with vdpa we will at least have a virtaul virtio-net-pci frontend in qemu to provide some level of abstraction.
i guess the point you are raising is that for live migration we cant start with 4 queue paris and vq_size=256
and select a device with 2 queue pairs and vq_size of 512 and expect that to just work.

There are two ways to adress that. 1 we can start recording this infor in our db and schdule only ot hosts with the same
configuration values, or 2 we can record the capablities i.e. the max vaulues that are support by a devcice and schdule to a host
where its >= the current value and rely on libvirt to reconfigure the device.

libvirt required very little input today to consume a vdpa interface
https://libvirt.org/formatdomain.html#vdpa-devices
there are some generic virtio device optiosn we could set https://libvirt.org/formatdomain.html#virtio-related-options
and some generic options like the mtu that the interface element supportr

but the miniumal valide xml snipit is litrally just the source dev path.

<devices>
<interface type='vdpa'>
<source dev='/dev/vhost-vdpa-0'/>
</interface>
</devices>

nova only add the mac address and MTU today although i have some untested code that will try to also set the vq size.
https://github.com/openstack/nova/blob/11cb31258fa5b429ea9881c92b2d745fd127cdaf/nova/virt/libvirt/designer.py#L154-L167

The basic supprot we have today assumes however that the vq_size is either the same on all host or it does not matter because we do
not support transparent live migration today so its ok for it to change form host to host.
in any case we do not track the vq_size or vq count today so we cant schdule based on it or comunicate it to libvirt via our
pre_live_migration rpc result. that means libvirt shoudl check if the dest device has the same cofnig or update it if posible
before starting the destination qemu instance and begining the migration.

>
> > This will ease the orchestration software implementation
> > so that it doesn't have to keep track of vdpa config change, or have
> > to persist vdpa attributes across failure and recovery, in fear of
> > being killed due to accidental software error.
the vdpa device config is not somethign we do today so this woudl make our lives more complex depending on
what that info is. at least in the case of nova we do not use the vdpa cli at all, we use libvirt as an indirection layer.
so libvirt would need to support this interface, we would have to then add it to our db and modify our RPC interface
to then update the libvirt xml with addtional info we dont need today.
> >
> > In this series, the initial device config for vdpa creation will be
> > exported via the "vdpa dev show" command.
> > This is unlike the "vdpa
> > dev config show" command that usually goes with the live value in
> > the device config space, which is not reliable subject to the dynamics
> > of feature negotiation and possible change in device config space.
> >
> > Examples:
> >
> > 1) Create vDPA by default without any config attribute
> >
> > $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0
> > $ vdpa dev show vdpa0
> > vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> > $ vdpa dev -jp show vdpa0
> > {
> > "dev": {
> > "vdpa0": {
> > "type": "network",
> > "mgmtdev": "pci/0000:41:04.2",
> > "vendor_id": 5555,
> > "max_vqs": 9,
> > "max_vq_size": 256,
> > }
> > }
> > }
This is how openstack works today. this step is done statically at boot time typiccly via a udev script or systemd servic file.
the mac adress is udpate don the vdpa interface by libvirt when its asigined to the qemu process.
if we wanted to suport multi queue or vq size configuration it would also happen at that time not during device creation.
> >
> > 2) Create vDPA with config attribute(s) specified
> >
> > $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0 \
> > mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> > $ vdpa dev show
> > vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> > mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> > $ vdpa dev -jp show
> > {
> > "dev": {
> > "vdpa0": {
> > "type": "network",
> > "mgmtdev": "pci/0000:41:04.2",
>
> So "mgmtdev" looks not necessary for live migration.
>
> Thanks
>
> > "vendor_id": 5555,
> > "max_vqs": 9,
> > "max_vq_size": 256,
> > "mac": "e4:11:c6:d3:45:f0",
> > "max_vq_pairs": 4
> > }
> > }
> > }
dynmaicaly creating vdpa device at runtime while possible is not an approch we are plannign to supprot.

currntly in nova we perefer to do allcoation of staticically provsioned resouces in nova.
for persitent memory, sriov/pci passthorgh, dedciated cpus, hugepages and vdpa devices we manage inventories
of resouce that the operator has configured on the platform.

we have one excption to this static aproch which is semi dynmaic that is how we manage vifo mediated devices.
for reasons that are not important we currrnly track the partent devices that are capable of providing MDEVs
and we directlly write to /sys/... to create teh mdev instance of a requested mdev on demand.

This has proven ot be quite problematic as we have encountered caching bugs due to the delay between device
creation and when the /sys interface expost the direcotry stucture for the mdev. This has lead ot libvirt and as a result
nova getting out of sync with the actual state of the host. There are also issue with host reboots.

while we do see the advantage of beign able to create vdpa interface on demad espicaly if we can do finer grained resouce
partioning by allcoating one mdev with 4 vqs adn another with 8 ectra, or experice with dynmic mdev management gives us
pause. we can and will fix our bugs with mdevs but we have found that most of our customers that use feature like this
are telcos or other similar industries that typiclly have very static wrokloads. while there is some interest in making
there clouds more dynmaic they typically file a host and run the same worklaod on that host form months to years at a
time and plan there hardware and acordingly so they are well seved by the static usecase "1) Create vDPA by default without any config attribute".

> >
> > ---
> >
> > Si-Wei Liu (4):
> > vdpa: save vdpa_dev_set_config in struct vdpa_device
> > vdpa: pass initial config to _vdpa_register_device()
> > vdpa: show dev config as-is in "vdpa dev show" output
> > vdpa: fix improper error message when adding vdpa dev
> >
> > drivers/vdpa/ifcvf/ifcvf_main.c | 2 +-
> > drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
> > drivers/vdpa/vdpa.c | 63 +++++++++++++++++++++++++++++++++---
> > drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 2 +-
> > drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 2 +-
> > drivers/vdpa/vdpa_user/vduse_dev.c | 2 +-
> > drivers/vdpa/virtio_pci/vp_vdpa.c | 3 +-
> > include/linux/vdpa.h | 26 ++++++++-------
> > 8 files changed, 80 insertions(+), 22 deletions(-)
> >
> > --
> > 1.8.3.1
> >
>

2022-10-17 23:03:25

by Si-Wei Liu

[permalink] [raw]
Subject: Re: [PATCH 0/4] vDPA: dev config export via "vdpa dev show" command



On 10/17/2022 12:08 AM, Jason Wang wrote:
> Adding Sean and Daniel for more thoughts.
>
> On Sat, Oct 15, 2022 at 9:33 AM Si-Wei Liu <[email protected]> wrote:
>> Live migration of vdpa would typically require re-instate vdpa
>> device with an idential set of configs on the destination node,
>> same way as how source node created the device in the first place.
>>
>> In order to allow live migration orchestration software to export the
>> initial set of vdpa attributes with which the device was created, it
>> will be useful if the vdpa tool can report the config on demand with
>> simple query.
> For live migration, I think the management layer should have this
> knowledge and they can communicate directly without bothering the vdpa
> tool on the source. If I was not wrong this is the way libvirt is
> doing now.
I think this series doesn't conflict with what libvirt is doing now. For
example it can still remember the supported features for the parent
mgmtdev, and mtu and mac properties for vdpa creation, and use them to
replicate vdpa device on  destination node. The extra benefit is - the
management software (for live migration) doesn't need to care those
mgmtdev specifics - such as what features the parent mgmtdev supports,
whether some features are mandatory, and what are the default values for
those, whether there's enough system or hardware resource available to
create vdpa with requested features et al. This kind of process can be
simplified by just getting a vdpa created with the exact same features
and configus exposed via the 'vdpa dev show' command. Essentially this
export facility just provides the layer of abstraction needed for virtio
related device configuration and for the very core need of vdpa live
migration. For e.g. what're exported can even be useful to facilitate
live migration from vdpa to software virtio. Basically, it doesn't
prevent libvirt from implementing another layer on top to manage vdpa
between mgmtdev devices and vdpa creation, and on the other hand, would
benefit light-weighted mgmt software implementation with device
management and live migration orchestration decoupled in the upper level.

>> This will ease the orchestration software implementation
>> so that it doesn't have to keep track of vdpa config change, or have
>> to persist vdpa attributes across failure and recovery, in fear of
>> being killed due to accidental software error.
>>
>> In this series, the initial device config for vdpa creation will be
>> exported via the "vdpa dev show" command.
>> This is unlike the "vdpa
>> dev config show" command that usually goes with the live value in
>> the device config space, which is not reliable subject to the dynamics
>> of feature negotiation and possible change in device config space.
>>
>> Examples:
>>
>> 1) Create vDPA by default without any config attribute
>>
>> $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0
>> $ vdpa dev show vdpa0
>> vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
>> $ vdpa dev -jp show vdpa0
>> {
>> "dev": {
>> "vdpa0": {
>> "type": "network",
>> "mgmtdev": "pci/0000:41:04.2",
>> "vendor_id": 5555,
>> "max_vqs": 9,
>> "max_vq_size": 256,
>> }
>> }
>> }
>>
>> 2) Create vDPA with config attribute(s) specified
>>
>> $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0 \
>> mac e4:11:c6:d3:45:f0 max_vq_pairs 4
>> $ vdpa dev show
>> vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
>> mac e4:11:c6:d3:45:f0 max_vq_pairs 4
>> $ vdpa dev -jp show
>> {
>> "dev": {
>> "vdpa0": {
>> "type": "network",
>> "mgmtdev": "pci/0000:41:04.2",
> So "mgmtdev" looks not necessary for live migration.
Right, so once the resulting device_features is exposed to the 'vdpa dev
show' output, the mgmt software could infer the set of config options to
recreate vdpa with, and filter out those unwanted attributes (or pick
what it really wants).

-Siwei

>
> Thanks
>
>> "vendor_id": 5555,
>> "max_vqs": 9,
>> "max_vq_size": 256,
>> "mac": "e4:11:c6:d3:45:f0",
>> "max_vq_pairs": 4
>> }
>> }
>> }
>>
>> ---
>>
>> Si-Wei Liu (4):
>> vdpa: save vdpa_dev_set_config in struct vdpa_device
>> vdpa: pass initial config to _vdpa_register_device()
>> vdpa: show dev config as-is in "vdpa dev show" output
>> vdpa: fix improper error message when adding vdpa dev
>>
>> drivers/vdpa/ifcvf/ifcvf_main.c | 2 +-
>> drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
>> drivers/vdpa/vdpa.c | 63 +++++++++++++++++++++++++++++++++---
>> drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 2 +-
>> drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 2 +-
>> drivers/vdpa/vdpa_user/vduse_dev.c | 2 +-
>> drivers/vdpa/virtio_pci/vp_vdpa.c | 3 +-
>> include/linux/vdpa.h | 26 ++++++++-------
>> 8 files changed, 80 insertions(+), 22 deletions(-)
>>
>> --
>> 1.8.3.1
>>

2022-10-17 23:55:50

by Si-Wei Liu

[permalink] [raw]
Subject: Re: [PATCH 0/4] vDPA: dev config export via "vdpa dev show" command



On 10/17/2022 5:28 AM, Sean Mooney wrote:
> On Mon, 2022-10-17 at 15:08 +0800, Jason Wang wrote:
>> Adding Sean and Daniel for more thoughts.
>>
>> On Sat, Oct 15, 2022 at 9:33 AM Si-Wei Liu <[email protected]> wrote:
>>> Live migration of vdpa would typically require re-instate vdpa
>>> device with an idential set of configs on the destination node,
>>> same way as how source node created the device in the first place.
>>>
>>> In order to allow live migration orchestration software to export the
>>> initial set of vdpa attributes with which the device was created, it
>>> will be useful if the vdpa tool can report the config on demand with
>>> simple query.
>> For live migration, I think the management layer should have this
>> knowledge and they can communicate directly without bothering the vdpa
>> tool on the source. If I was not wrong this is the way libvirt is
>> doing now.
> At least form a openstack(nova) perspective we are not expecting to do any vdpa device configuration
> at the openstack level. To use a vdpa device in openstack the oeprator when installing openstack
> need to create a udev/systemd script to precreatre the vdpa devices.
This seems to correlate vdpa device creation with the static allocation
of SR-IOV VF devices. Perhaps OpenStack doesn't have a plan to support
dynamic vdpa creation, but conceptionally vdpa creation can be on demand
for e.g. over Mellanox SubFunction or Intel Scalable IOV device.

>
> nova will query libvirt for the list avaiable vdpa devices at start up and record them in our database.
> when schudling we select a host that has a free vdpa device and on that host we generate a xml snipit
> that refernce the vdpa device and proivde that to libvirt and it will in turn program the mac.
>
> """
> <interface type="vdpa">
> <mac address="b5:bc:2e:e7:51:ee"/>
> <source dev="/dev/vhost-vdpa-3"/>
> </interface>
> """
>
> when live migrating the workflow is similar. we ask our schduler for a host that should have enough avaiable
> resouces, then we make an rpc call "pre_live_migrate" which makes a number of assterions such as cpu compatiablity
> but also computes cpu pinning and device passthough asignemnts. i.e. in pre_live_migate we select wich cpu cores, pcie
> devices and in this case vdpa devices to use on the destination host
In the case of vdpa, does it (the pre_live_migrate rpc) now just selects
the parent mgmtdev for creating vdpa in later phase, or it ends up with
a vdpa device being created? Be noted by now there's only a few
properties for vdpa creation e.g. mtu and mac, that it doesn't need
special reservation of resources for creating a vdpa device. But that
may well change in the future.

> and return that in our rpc result.
>
> we then use that information to udpate the libvirt domain xml with the new host specific information and start
> the migration at the libvirt level.
>
> today in openstack we use a hack i came up with to workaroudn that fact that you cant migrate with sriov/pci passthough
> devices to support live migration with vdpa. basically before we call libvirt to live migrate we hot unplug the vdpa nic
> form the guest and add them back after the migration is complte. if you dont bound the vdpa nics wiht a transparently migratable
> nic in the guest that obvioulsy result in a loss of network connectivity while the migration is happenign which is not ideal
> so a normal virtio-net interface on ovs is what we recommend as the fallback interface for the bound.
Do you need to preserve the mac address when falling back to the normal
virtio-net interface, and similarly any other network config/state?
Basically vDPA doesn't support live migration for the moment. This
doesn't like to be a technically correct solution for it to work.
>
> obviouly when vdpa supprot transparent live migration we can just skip this workaround which woudl be a very nice ux improvement.
> one of the sideeffct of the hack however is you can start with an intel nic and end up with a melonox nic becasue we dont need
> to preserve the device capablies sicne we are hotplugging.
Exactly. This is the issue.
>
> with vdpa we will at least have a virtaul virtio-net-pci frontend in qemu to provide some level of abstraction.
> i guess the point you are raising is that for live migration we cant start with 4 queue paris and vq_size=256
> and select a device with 2 queue pairs and vq_size of 512 and expect that to just work.
Not exactly, the vq_size comes from QEMU that has nothing to do with
vDPA tool. And live migrating from 4 queue pairs to 2 queue pairs won't
work for the guest driver. Change of queue pair numbers would need
device reset which  won't happen transparently during live migration.
Basically libvirt has to match the exact queue pair number and queue
length on destination node.

>
> There are two ways to adress that. 1 we can start recording this infor in our db and schdule only ot hosts with the same
> configuration values, or 2 we can record the capablities i.e. the max vaulues that are support by a devcice and schdule to a host
> where its >= the current value and rely on libvirt to reconfigure the device.
>
> libvirt required very little input today to consume a vdpa interface
> https://libvirt.org/formatdomain.html#vdpa-devices
> there are some generic virtio device optiosn we could set https://libvirt.org/formatdomain.html#virtio-related-options
> and some generic options like the mtu that the interface element supportr
>
> but the miniumal valide xml snipit is litrally just the source dev path.
>
> <devices>
> <interface type='vdpa'>
> <source dev='/dev/vhost-vdpa-0'/>
> </interface>
> </devices>
>
> nova only add the mac address and MTU today although i have some untested code that will try to also set the vq size.
> https://github.com/openstack/nova/blob/11cb31258fa5b429ea9881c92b2d745fd127cdaf/nova/virt/libvirt/designer.py#L154-L167
>
> The basic supprot we have today assumes however that the vq_size is either the same on all host or it does not matter because we do
> not support transparent live migration today so its ok for it to change form host to host.
> in any case we do not track the vq_size or vq count today so we cant schdule based on it or comunicate it to libvirt via our
> pre_live_migration rpc result. that means libvirt shoudl check if the dest device has the same cofnig or update it if posible
> before starting the destination qemu instance and begining the migration.
>
>>> This will ease the orchestration software implementation
>>> so that it doesn't have to keep track of vdpa config change, or have
>>> to persist vdpa attributes across failure and recovery, in fear of
>>> being killed due to accidental software error.
> the vdpa device config is not somethign we do today so this woudl make our lives more complex
It's regarding use case whether to support or not. These configs well
exist before my change.

> depending on
> what that info is. at least in the case of nova we do not use the vdpa cli at all, we use libvirt as an indirection layer.
> so libvirt would need to support this interface, we would have to then add it to our db and modify our RPC interface
> to then update the libvirt xml with addtional info we dont need today.

Yes. You can follow libvirt when the corresponding support is done, but
I think it's orthogonal with my changes. Basically my change won't
affect libvirt's implementation at all.

Thanks,
-Siwei


>>> In this series, the initial device config for vdpa creation will be
>>> exported via the "vdpa dev show" command.
>>> This is unlike the "vdpa
>>> dev config show" command that usually goes with the live value in
>>> the device config space, which is not reliable subject to the dynamics
>>> of feature negotiation and possible change in device config space.
>>>
>>> Examples:
>>>
>>> 1) Create vDPA by default without any config attribute
>>>
>>> $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0
>>> $ vdpa dev show vdpa0
>>> vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
>>> $ vdpa dev -jp show vdpa0
>>> {
>>> "dev": {
>>> "vdpa0": {
>>> "type": "network",
>>> "mgmtdev": "pci/0000:41:04.2",
>>> "vendor_id": 5555,
>>> "max_vqs": 9,
>>> "max_vq_size": 256,
>>> }
>>> }
>>> }
> This is how openstack works today. this step is done statically at boot time typiccly via a udev script or systemd servic file.
> the mac adress is udpate don the vdpa interface by libvirt when its asigined to the qemu process.
> if we wanted to suport multi queue or vq size configuration it would also happen at that time not during device creation.
>>> 2) Create vDPA with config attribute(s) specified
>>>
>>> $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0 \
>>> mac e4:11:c6:d3:45:f0 max_vq_pairs 4
>>> $ vdpa dev show
>>> vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
>>> mac e4:11:c6:d3:45:f0 max_vq_pairs 4
>>> $ vdpa dev -jp show
>>> {
>>> "dev": {
>>> "vdpa0": {
>>> "type": "network",
>>> "mgmtdev": "pci/0000:41:04.2",
>> So "mgmtdev" looks not necessary for live migration.
>>
>> Thanks
>>
>>> "vendor_id": 5555,
>>> "max_vqs": 9,
>>> "max_vq_size": 256,
>>> "mac": "e4:11:c6:d3:45:f0",
>>> "max_vq_pairs": 4
>>> }
>>> }
>>> }
> dynmaicaly creating vdpa device at runtime while possible is not an approch we are plannign to supprot.
>
> currntly in nova we perefer to do allcoation of staticically provsioned resouces in nova.
> for persitent memory, sriov/pci passthorgh, dedciated cpus, hugepages and vdpa devices we manage inventories
> of resouce that the operator has configured on the platform.
>
> we have one excption to this static aproch which is semi dynmaic that is how we manage vifo mediated devices.
> for reasons that are not important we currrnly track the partent devices that are capable of providing MDEVs
> and we directlly write to /sys/... to create teh mdev instance of a requested mdev on demand.
>
> This has proven ot be quite problematic as we have encountered caching bugs due to the delay between device
> creation and when the /sys interface expost the direcotry stucture for the mdev. This has lead ot libvirt and as a result
> nova getting out of sync with the actual state of the host. There are also issue with host reboots.
>
> while we do see the advantage of beign able to create vdpa interface on demad espicaly if we can do finer grained resouce
> partioning by allcoating one mdev with 4 vqs adn another with 8 ectra, or experice with dynmic mdev management gives us
> pause. we can and will fix our bugs with mdevs but we have found that most of our customers that use feature like this
> are telcos or other similar industries that typiclly have very static wrokloads. while there is some interest in making
> there clouds more dynmaic they typically file a host and run the same worklaod on that host form months to years at a
> time and plan there hardware and acordingly so they are well seved by the static usecase "1) Create vDPA by default without any config attribute".
>
>>> ---
>>>
>>> Si-Wei Liu (4):
>>> vdpa: save vdpa_dev_set_config in struct vdpa_device
>>> vdpa: pass initial config to _vdpa_register_device()
>>> vdpa: show dev config as-is in "vdpa dev show" output
>>> vdpa: fix improper error message when adding vdpa dev
>>>
>>> drivers/vdpa/ifcvf/ifcvf_main.c | 2 +-
>>> drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
>>> drivers/vdpa/vdpa.c | 63 +++++++++++++++++++++++++++++++++---
>>> drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 2 +-
>>> drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 2 +-
>>> drivers/vdpa/vdpa_user/vduse_dev.c | 2 +-
>>> drivers/vdpa/virtio_pci/vp_vdpa.c | 3 +-
>>> include/linux/vdpa.h | 26 ++++++++-------
>>> 8 files changed, 80 insertions(+), 22 deletions(-)
>>>
>>> --
>>> 1.8.3.1
>>>

2022-10-18 08:15:32

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 0/4] vDPA: dev config export via "vdpa dev show" command

On Tue, Oct 18, 2022 at 7:35 AM Si-Wei Liu <[email protected]> wrote:
>
>
>
> On 10/17/2022 5:28 AM, Sean Mooney wrote:
> > On Mon, 2022-10-17 at 15:08 +0800, Jason Wang wrote:
> >> Adding Sean and Daniel for more thoughts.
> >>
> >> On Sat, Oct 15, 2022 at 9:33 AM Si-Wei Liu <[email protected]> wrote:
> >>> Live migration of vdpa would typically require re-instate vdpa
> >>> device with an idential set of configs on the destination node,
> >>> same way as how source node created the device in the first place.
> >>>
> >>> In order to allow live migration orchestration software to export the
> >>> initial set of vdpa attributes with which the device was created, it
> >>> will be useful if the vdpa tool can report the config on demand with
> >>> simple query.
> >> For live migration, I think the management layer should have this
> >> knowledge and they can communicate directly without bothering the vdpa
> >> tool on the source. If I was not wrong this is the way libvirt is
> >> doing now.
> > At least form a openstack(nova) perspective we are not expecting to do any vdpa device configuration
> > at the openstack level. To use a vdpa device in openstack the oeprator when installing openstack
> > need to create a udev/systemd script to precreatre the vdpa devices.
> This seems to correlate vdpa device creation with the static allocation
> of SR-IOV VF devices. Perhaps OpenStack doesn't have a plan to support
> dynamic vdpa creation, but conceptionally vdpa creation can be on demand
> for e.g. over Mellanox SubFunction or Intel Scalable IOV device.

Yes, it's not specific to vDPA but something that openstack needs to consider.

>
> >
> > nova will query libvirt for the list avaiable vdpa devices at start up and record them in our database.
> > when schudling we select a host that has a free vdpa device and on that host we generate a xml snipit
> > that refernce the vdpa device and proivde that to libvirt and it will in turn program the mac.
> >
> > """
> > <interface type="vdpa">
> > <mac address="b5:bc:2e:e7:51:ee"/>
> > <source dev="/dev/vhost-vdpa-3"/>
> > </interface>
> > """
> >
> > when live migrating the workflow is similar. we ask our schduler for a host that should have enough avaiable
> > resouces, then we make an rpc call "pre_live_migrate" which makes a number of assterions such as cpu compatiablity

A migration compatibility check for vDPA should be done as well here.

> > but also computes cpu pinning and device passthough asignemnts. i.e. in pre_live_migate we select wich cpu cores, pcie
> > devices and in this case vdpa devices to use on the destination host
> In the case of vdpa, does it (the pre_live_migrate rpc) now just selects
> the parent mgmtdev for creating vdpa in later phase, or it ends up with
> a vdpa device being created? Be noted by now there's only a few
> properties for vdpa creation e.g. mtu and mac, that it doesn't need
> special reservation of resources for creating a vdpa device. But that
> may well change in the future.
>
> > and return that in our rpc result.
> >
> > we then use that information to udpate the libvirt domain xml with the new host specific information and start
> > the migration at the libvirt level.
> >
> > today in openstack we use a hack i came up with to workaroudn that fact that you cant migrate with sriov/pci passthough
> > devices to support live migration with vdpa. basically before we call libvirt to live migrate we hot unplug the vdpa nic
> > form the guest and add them back after the migration is complte. if you dont bound the vdpa nics wiht a transparently migratable
> > nic in the guest that obvioulsy result in a loss of network connectivity while the migration is happenign which is not ideal
> > so a normal virtio-net interface on ovs is what we recommend as the fallback interface for the bound.
> Do you need to preserve the mac address when falling back to the normal
> virtio-net interface, and similarly any other network config/state?
> Basically vDPA doesn't support live migration for the moment.

Basic shadow vq based live migration can work now. Eugenio is working
to make it fully ready in the near future.

>This
> doesn't like to be a technically correct solution for it to work.

I agree.

> >
> > obviouly when vdpa supprot transparent live migration we can just skip this workaround which woudl be a very nice ux improvement.
> > one of the sideeffct of the hack however is you can start with an intel nic and end up with a melonox nic becasue we dont need
> > to preserve the device capablies sicne we are hotplugging.
> Exactly. This is the issue.
> >
> > with vdpa we will at least have a virtaul virtio-net-pci frontend in qemu to provide some level of abstraction.
> > i guess the point you are raising is that for live migration we cant start with 4 queue paris and vq_size=256
> > and select a device with 2 queue pairs and vq_size of 512 and expect that to just work.
> Not exactly, the vq_size comes from QEMU that has nothing to do with
> vDPA tool. And live migrating from 4 queue pairs to 2 queue pairs won't
> work for the guest driver. Change of queue pair numbers would need
> device reset which won't happen transparently during live migration.
> Basically libvirt has to match the exact queue pair number and queue
> length on destination node.
>
> >
> > There are two ways to adress that. 1 we can start recording this infor in our db and schdule only ot hosts with the same
> > configuration values, or 2 we can record the capablities i.e. the max vaulues that are support by a devcice and schdule to a host
> > where its >= the current value and rely on libvirt to reconfigure the device.
> >
> > libvirt required very little input today to consume a vdpa interface
> > https://libvirt.org/formatdomain.html#vdpa-devices

So a question here, if we need to create vDPA on demand (e.g with the
features and configs from the source) who will do the provision? Is it
libvirt?

Thanks

> > there are some generic virtio device optiosn we could set https://libvirt.org/formatdomain.html#virtio-related-options
> > and some generic options like the mtu that the interface element supportr
> >
> > but the miniumal valide xml snipit is litrally just the source dev path.
> >
> > <devices>
> > <interface type='vdpa'>
> > <source dev='/dev/vhost-vdpa-0'/>
> > </interface>
> > </devices>
> >
> > nova only add the mac address and MTU today although i have some untested code that will try to also set the vq size.
> > https://github.com/openstack/nova/blob/11cb31258fa5b429ea9881c92b2d745fd127cdaf/nova/virt/libvirt/designer.py#L154-L167
> >
> > The basic supprot we have today assumes however that the vq_size is either the same on all host or it does not matter because we do
> > not support transparent live migration today so its ok for it to change form host to host.
> > in any case we do not track the vq_size or vq count today so we cant schdule based on it or comunicate it to libvirt via our
> > pre_live_migration rpc result. that means libvirt shoudl check if the dest device has the same cofnig or update it if posible
> > before starting the destination qemu instance and begining the migration.
> >
> >>> This will ease the orchestration software implementation
> >>> so that it doesn't have to keep track of vdpa config change, or have
> >>> to persist vdpa attributes across failure and recovery, in fear of
> >>> being killed due to accidental software error.
> > the vdpa device config is not somethign we do today so this woudl make our lives more complex
> It's regarding use case whether to support or not. These configs well
> exist before my change.
>
> > depending on
> > what that info is. at least in the case of nova we do not use the vdpa cli at all, we use libvirt as an indirection layer.
> > so libvirt would need to support this interface, we would have to then add it to our db and modify our RPC interface
> > to then update the libvirt xml with addtional info we dont need today.
>
> Yes. You can follow libvirt when the corresponding support is done, but
> I think it's orthogonal with my changes. Basically my change won't
> affect libvirt's implementation at all.
>
> Thanks,
> -Siwei
>
>
> >>> In this series, the initial device config for vdpa creation will be
> >>> exported via the "vdpa dev show" command.
> >>> This is unlike the "vdpa
> >>> dev config show" command that usually goes with the live value in
> >>> the device config space, which is not reliable subject to the dynamics
> >>> of feature negotiation and possible change in device config space.
> >>>
> >>> Examples:
> >>>
> >>> 1) Create vDPA by default without any config attribute
> >>>
> >>> $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0
> >>> $ vdpa dev show vdpa0
> >>> vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> >>> $ vdpa dev -jp show vdpa0
> >>> {
> >>> "dev": {
> >>> "vdpa0": {
> >>> "type": "network",
> >>> "mgmtdev": "pci/0000:41:04.2",
> >>> "vendor_id": 5555,
> >>> "max_vqs": 9,
> >>> "max_vq_size": 256,
> >>> }
> >>> }
> >>> }
> > This is how openstack works today. this step is done statically at boot time typiccly via a udev script or systemd servic file.
> > the mac adress is udpate don the vdpa interface by libvirt when its asigined to the qemu process.
> > if we wanted to suport multi queue or vq size configuration it would also happen at that time not during device creation.
> >>> 2) Create vDPA with config attribute(s) specified
> >>>
> >>> $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0 \
> >>> mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> >>> $ vdpa dev show
> >>> vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> >>> mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> >>> $ vdpa dev -jp show
> >>> {
> >>> "dev": {
> >>> "vdpa0": {
> >>> "type": "network",
> >>> "mgmtdev": "pci/0000:41:04.2",
> >> So "mgmtdev" looks not necessary for live migration.
> >>
> >> Thanks
> >>
> >>> "vendor_id": 5555,
> >>> "max_vqs": 9,
> >>> "max_vq_size": 256,
> >>> "mac": "e4:11:c6:d3:45:f0",
> >>> "max_vq_pairs": 4
> >>> }
> >>> }
> >>> }
> > dynmaicaly creating vdpa device at runtime while possible is not an approch we are plannign to supprot.
> >
> > currntly in nova we perefer to do allcoation of staticically provsioned resouces in nova.
> > for persitent memory, sriov/pci passthorgh, dedciated cpus, hugepages and vdpa devices we manage inventories
> > of resouce that the operator has configured on the platform.
> >
> > we have one excption to this static aproch which is semi dynmaic that is how we manage vifo mediated devices.
> > for reasons that are not important we currrnly track the partent devices that are capable of providing MDEVs
> > and we directlly write to /sys/... to create teh mdev instance of a requested mdev on demand.
> >
> > This has proven ot be quite problematic as we have encountered caching bugs due to the delay between device
> > creation and when the /sys interface expost the direcotry stucture for the mdev. This has lead ot libvirt and as a result
> > nova getting out of sync with the actual state of the host. There are also issue with host reboots.
> >
> > while we do see the advantage of beign able to create vdpa interface on demad espicaly if we can do finer grained resouce
> > partioning by allcoating one mdev with 4 vqs adn another with 8 ectra, or experice with dynmic mdev management gives us
> > pause. we can and will fix our bugs with mdevs but we have found that most of our customers that use feature like this
> > are telcos or other similar industries that typiclly have very static wrokloads. while there is some interest in making
> > there clouds more dynmaic they typically file a host and run the same worklaod on that host form months to years at a
> > time and plan there hardware and acordingly so they are well seved by the static usecase "1) Create vDPA by default without any config attribute".
> >
> >>> ---
> >>>
> >>> Si-Wei Liu (4):
> >>> vdpa: save vdpa_dev_set_config in struct vdpa_device
> >>> vdpa: pass initial config to _vdpa_register_device()
> >>> vdpa: show dev config as-is in "vdpa dev show" output
> >>> vdpa: fix improper error message when adding vdpa dev
> >>>
> >>> drivers/vdpa/ifcvf/ifcvf_main.c | 2 +-
> >>> drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
> >>> drivers/vdpa/vdpa.c | 63 +++++++++++++++++++++++++++++++++---
> >>> drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 2 +-
> >>> drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 2 +-
> >>> drivers/vdpa/vdpa_user/vduse_dev.c | 2 +-
> >>> drivers/vdpa/virtio_pci/vp_vdpa.c | 3 +-
> >>> include/linux/vdpa.h | 26 ++++++++-------
> >>> 8 files changed, 80 insertions(+), 22 deletions(-)
> >>>
> >>> --
> >>> 1.8.3.1
> >>>
>

2022-10-18 08:56:18

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 0/4] vDPA: dev config export via "vdpa dev show" command

On Tue, Oct 18, 2022 at 6:58 AM Si-Wei Liu <[email protected]> wrote:
>
>
>
> On 10/17/2022 12:08 AM, Jason Wang wrote:
> > Adding Sean and Daniel for more thoughts.
> >
> > On Sat, Oct 15, 2022 at 9:33 AM Si-Wei Liu <[email protected]> wrote:
> >> Live migration of vdpa would typically require re-instate vdpa
> >> device with an idential set of configs on the destination node,
> >> same way as how source node created the device in the first place.
> >>
> >> In order to allow live migration orchestration software to export the
> >> initial set of vdpa attributes with which the device was created, it
> >> will be useful if the vdpa tool can report the config on demand with
> >> simple query.
> > For live migration, I think the management layer should have this
> > knowledge and they can communicate directly without bothering the vdpa
> > tool on the source. If I was not wrong this is the way libvirt is
> > doing now.
> I think this series doesn't conflict with what libvirt is doing now. For
> example it can still remember the supported features for the parent
> mgmtdev, and mtu and mac properties for vdpa creation, and use them to
> replicate vdpa device on destination node. The extra benefit is - the
> management software (for live migration) doesn't need to care those
> mgmtdev specifics - such as what features the parent mgmtdev supports,
> whether some features are mandatory, and what are the default values for
> those, whether there's enough system or hardware resource available to
> create vdpa with requested features et al. This kind of process can be
> simplified by just getting a vdpa created with the exact same features
> and configus exposed via the 'vdpa dev show' command. Essentially this
> export facility just provides the layer of abstraction needed for virtio
> related device configuration and for the very core need of vdpa live
> migration. For e.g. what're exported can even be useful to facilitate
> live migration from vdpa to software virtio. Basically, it doesn't
> prevent libvirt from implementing another layer on top to manage vdpa
> between mgmtdev devices and vdpa creation, and on the other hand, would
> benefit light-weighted mgmt software implementation with device
> management and live migration orchestration decoupled in the upper level.

Ok, I think this is fine.

>
> >> This will ease the orchestration software implementation
> >> so that it doesn't have to keep track of vdpa config change, or have
> >> to persist vdpa attributes across failure and recovery, in fear of
> >> being killed due to accidental software error.
> >>
> >> In this series, the initial device config for vdpa creation will be
> >> exported via the "vdpa dev show" command.
> >> This is unlike the "vdpa
> >> dev config show" command that usually goes with the live value in
> >> the device config space, which is not reliable subject to the dynamics
> >> of feature negotiation and possible change in device config space.
> >>
> >> Examples:
> >>
> >> 1) Create vDPA by default without any config attribute
> >>
> >> $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0
> >> $ vdpa dev show vdpa0
> >> vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> >> $ vdpa dev -jp show vdpa0
> >> {
> >> "dev": {
> >> "vdpa0": {
> >> "type": "network",
> >> "mgmtdev": "pci/0000:41:04.2",
> >> "vendor_id": 5555,
> >> "max_vqs": 9,
> >> "max_vq_size": 256,
> >> }
> >> }
> >> }
> >>
> >> 2) Create vDPA with config attribute(s) specified
> >>
> >> $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0 \
> >> mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> >> $ vdpa dev show
> >> vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> >> mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> >> $ vdpa dev -jp show
> >> {
> >> "dev": {
> >> "vdpa0": {
> >> "type": "network",
> >> "mgmtdev": "pci/0000:41:04.2",
> > So "mgmtdev" looks not necessary for live migration.
> Right, so once the resulting device_features is exposed to the 'vdpa dev
> show' output, the mgmt software could infer the set of config options to
> recreate vdpa with, and filter out those unwanted attributes (or pick
> what it really wants).

Ok, so I wonder if it is this better to have a new command instead of
mixing it with "dev show"?

Or at least have separated key for virtio like

"vdpa0": {
"mgmtdev": "vdpasim_net",
"virtio config: {
....
}
}

Thanks

>
> -Siwei
>
> >
> > Thanks
> >
> >> "vendor_id": 5555,
> >> "max_vqs": 9,
> >> "max_vq_size": 256,
> >> "mac": "e4:11:c6:d3:45:f0",
> >> "max_vq_pairs": 4
> >> }
> >> }
> >> }
> >>
> >> ---
> >>
> >> Si-Wei Liu (4):
> >> vdpa: save vdpa_dev_set_config in struct vdpa_device
> >> vdpa: pass initial config to _vdpa_register_device()
> >> vdpa: show dev config as-is in "vdpa dev show" output
> >> vdpa: fix improper error message when adding vdpa dev
> >>
> >> drivers/vdpa/ifcvf/ifcvf_main.c | 2 +-
> >> drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
> >> drivers/vdpa/vdpa.c | 63 +++++++++++++++++++++++++++++++++---
> >> drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 2 +-
> >> drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 2 +-
> >> drivers/vdpa/vdpa_user/vduse_dev.c | 2 +-
> >> drivers/vdpa/virtio_pci/vp_vdpa.c | 3 +-
> >> include/linux/vdpa.h | 26 ++++++++-------
> >> 8 files changed, 80 insertions(+), 22 deletions(-)
> >>
> >> --
> >> 1.8.3.1
> >>
>

2022-10-19 10:34:43

by Sean Mooney

[permalink] [raw]
Subject: Re: [PATCH 0/4] vDPA: dev config export via "vdpa dev show" command

On Tue, 2022-10-18 at 15:59 +0800, Jason Wang wrote:
> On Tue, Oct 18, 2022 at 7:35 AM Si-Wei Liu <[email protected]> wrote:
> >
> >
> >
> > On 10/17/2022 5:28 AM, Sean Mooney wrote:
> > > On Mon, 2022-10-17 at 15:08 +0800, Jason Wang wrote:
> > > > Adding Sean and Daniel for more thoughts.
> > > >
> > > > On Sat, Oct 15, 2022 at 9:33 AM Si-Wei Liu <[email protected]> wrote:
> > > > > Live migration of vdpa would typically require re-instate vdpa
> > > > > device with an idential set of configs on the destination node,
> > > > > same way as how source node created the device in the first place.
> > > > >
> > > > > In order to allow live migration orchestration software to export the
> > > > > initial set of vdpa attributes with which the device was created, it
> > > > > will be useful if the vdpa tool can report the config on demand with
> > > > > simple query.
> > > > For live migration, I think the management layer should have this
> > > > knowledge and they can communicate directly without bothering the vdpa
> > > > tool on the source. If I was not wrong this is the way libvirt is
> > > > doing now.
> > > At least form a openstack(nova) perspective we are not expecting to do any vdpa device configuration
> > > at the openstack level. To use a vdpa device in openstack the oeprator when installing openstack
> > > need to create a udev/systemd script to precreatre the vdpa devices.
> > This seems to correlate vdpa device creation with the static allocation
> > of SR-IOV VF devices. Perhaps OpenStack doesn't have a plan to support
> > dynamic vdpa creation, but conceptionally vdpa creation can be on demand
> > for e.g. over Mellanox SubFunction or Intel Scalable IOV device.
>
> Yes, it's not specific to vDPA but something that openstack needs to consider.

yes so before i joined redhat in 2018 i worked at intel and was trying to intergate there mdev based
solution for a nic that got canceled due to the changes required to adtop vdpa and some other issues. 
that orgianal approch was goign to dynmaicaly create teh mdev. for vdpa we built the support on top of
our pci manager so the current support directly requires the vdpa device's parent to be a vf.

we did this for two reasons, first when i started working on this again in 2020 that was the only model that
worked with any nic that were aviable and at the time mdevs were automaticaly created when you allcoated the
vf sicne this predated teh newer model of using the vdpa cli to add devcies. secondly buliding on top of the
sriov/pci passthough framework was less invasive. with openstack we currently have only tested our vdpa support
with melonox connectx6-dx cards which at the time i added supprot only supported vdpa device creation on vfs
not subfunctions.

we are aware that vdpa devices can be created over a Mellanox SubFunction or Intel Scalable IOV device
but we do not plan to support either in the near term. Im the primary person who has been workign on the openstack
support for vdpa in openstack but i will be working on some other work for the next 6 months or so
its unlikely we will look at extending nova vdpa cpablities until our upstream b cycle which start in march next year.

conceptually speakign we have two concerns with the dynamic approch. nova for better or worse effectivly runs with
full root access alhtough tha tis configend with selinux/apparmor policies and/or with contaienrs in most production
installations. even so we try to limit privaldged calls and in the past we did not want nova specirifcly
to make privladged calls to reconfigure hardware. we already have precident for breaking that in the form of our
generic mdev support where we are pass a list of parent devices and the mdev type to create on that device and then
dynimcily advertise pools of avaiable mdevs which we create dynamicly. The scecond conserne is we have seen caching issue
with libvirt when device change dynmicaly as it does nto always process the udev events correctly. as such our experince
tells us that we are less likely to have caching related bugs if we take a staic approch.

Enableing dynamic vdpa createion would reqiure a large rewrite of how vdpa has been integrated in nova and neutron(openstack networking component).
right now when you create a neutron port of type vdpa we internally convert that to a pci_request of type vdpa which
or pci_manager matches to a db record for the precreated vdpa devices. we woudl likely need to have a new vdpa-dynmaic or similar
port type that would correspond to the subfucniton/iov backed vdpa devices adn we would need to modle them differencly in our database.

for sr-iov vfs the consumable consumable and therfor schdulable resouce is the vf so we have pools of vf with some metadata ot map them to logical
networks in the case of a nic. for a subfunciton/iov backed device there are presumabel two logical consumable resouces. The number of hardware
vq pairs, and presumably there is a fintes number of iov/subfunctions a PF can allcoate. im sure there are other qualatative aspect we would like
to schdule on but those are the main consumables that will determin how many vdpa devices we can create when selecting a host.

we/i intentionally put this out of scope fo the inital vdpa support as i had no way to test this back in 2020 partly because we have a policy
of not support test only code path i.e. adding supprot for the vdpa_sim kernel module which was discussed adn rejected becasue it does not create
pci vfs for the vdpa devices and would have prevent using the pci manager or at least require us to fake the pci adress info using a sentenial like
all 0. if we did not need to supprot dynamic creation the usnign a cential pci adress would actully be a quick way to add iov/subfucniton support
but it woudl still require the static creations and a slight tweak to how we disucover the vdpa devices from libvirt.

so to summerise supporting vdpa with iov/subfuctions as the backedn device would be a relitivly simple if they are precreated, we can
fake the entry in the pci tacker in that case using a sentinal adress other tag in the db row to denote they are not vf backed.
this woudl not require any change to neutron although we woudl likely need ot also modify os-vif to lookup the repesentor netdev which we
attach to openvswitch to accoutn for the fact its an iov/subfunciton isntead of a vf parent.
the dynmic approch

>
> >
> > >
> > > nova will query libvirt for the list avaiable vdpa devices at start up and record them in our database.
> > > when schudling we select a host that has a free vdpa device and on that host we generate a xml snipit
> > > that refernce the vdpa device and proivde that to libvirt and it will in turn program the mac.
> > >
> > > """
> > > <interface type="vdpa">
> > > <mac address="b5:bc:2e:e7:51:ee"/>
> > > <source dev="/dev/vhost-vdpa-3"/>
> > > </interface>
> > > """
> > >
> > > when live migrating the workflow is similar. we ask our schduler for a host that should have enough avaiable
> > > resouces, then we make an rpc call "pre_live_migrate" which makes a number of assterions such as cpu compatiablity
>
> A migration compatibility check for vDPA should be done as well here.
yes it could be although at this point we have alredy complete schduling so if the check fails we will abort the migration.
presumable libvirt will also check after we invoke the migraation too but the ideal case woudl be that we have enough
info in our db to select a host we know will work rather then checkign after the fact.
>
> > > but also computes cpu pinning and device passthough asignemnts. i.e. in pre_live_migate we select wich cpu cores, pcie
> > > devices and in this case vdpa devices to use on the destination host
> > In the case of vdpa, does it (the pre_live_migrate rpc) now just selects
> > the parent mgmtdev for creating vdpa in later phase, or it ends up with
> > a vdpa device being created? Be noted by now there's only a few
> > properties for vdpa creation e.g. mtu and mac, that it doesn't need
> > special reservation of resources for creating a vdpa device. But that
> > may well change in the future.
> >
> > > and return that in our rpc result.
> > >
> > > we then use that information to udpate the libvirt domain xml with the new host specific information and start
> > > the migration at the libvirt level.
> > >
> > > today in openstack we use a hack i came up with to workaroudn that fact that you cant migrate with sriov/pci passthough
> > > devices to support live migration with vdpa. basically before we call libvirt to live migrate we hot unplug the vdpa nic
> > > form the guest and add them back after the migration is complte. if you dont bound the vdpa nics wiht a transparently migratable
> > > nic in the guest that obvioulsy result in a loss of network connectivity while the migration is happenign which is not ideal
> > > so a normal virtio-net interface on ovs is what we recommend as the fallback interface for the bound.
> > Do you need to preserve the mac address when falling back to the normal
> > virtio-net interface, and similarly any other network config/state?
> > Basically vDPA doesn't support live migration for the moment.
>
> Basic shadow vq based live migration can work now. Eugenio is working
> to make it fully ready in the near future.
this missed our merge window for the Zed cycle which feature froze in september and was relased a week or two ago.

>
> > This
> > doesn't like to be a technically correct solution for it to work.
>
> I agree.
this is what is what we added for sriov VF direct pasthouhg
all you need to do in the guest is use the linux kernel bond driver to create a bound between the vf and anoter live migratebale prot type.
the basic test i did for this i added the mac of the VF to the allowed adress pairs for the ovs port and used that as the mac of the vf.
this is all out os scope of openstack to configure and manage. we are not useing the virtio fallback support that was added after we implemted this
as we dont have a good way today to descibe the grouping of the primary and fallback internfce in neutron api.

the details of waht we do are here https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/libvirt-neutron-sriov-livemigration.html

we only do automatic hotplug for nic and only if they are drectly attached, if they use a macvtap to attach the vf to the vm we do not remove them.
our intent for vdpa was to also not remove them and rely on the native live migration supprot when we detct the source and destiantion host
supprot that.

>
> > >
> > > obviouly when vdpa supprot transparent live migration we can just skip this workaround which woudl be a very nice ux improvement.
> > > one of the sideeffct of the hack however is you can start with an intel nic and end up with a melonox nic becasue we dont need
> > > to preserve the device capablies sicne we are hotplugging.
> > Exactly. This is the issue.
so today in an openstack cloud you ingeneral should not know the vendro fo the nic that si provdie as a normal user.
for vdpa that is even more imporant as vdpa is ment to abstract that and provdie a standardised virtio interface to the guest.
there are atributes of the virtio inteface that obviously need to match liek the vq_size but we shoudl be able to
live migrate from a vdpa device created on a connext6-dx vf to one expsoed by an intel iov device without the guest knowing
that we have changed teh backend.
> > >
> > > with vdpa we will at least have a virtaul virtio-net-pci frontend in qemu to provide some level of abstraction.
> > > i guess the point you are raising is that for live migration we cant start with 4 queue paris and vq_size=256
> > > and select a device with 2 queue pairs and vq_size of 512 and expect that to just work.
> > Not exactly, the vq_size comes from QEMU that has nothing to do with
> > vDPA tool. And live migrating from 4 queue pairs to 2 queue pairs won't
> > work for the guest driver. Change of queue pair numbers would need
> > device reset which won't happen transparently during live migration.
> > Basically libvirt has to match the exact queue pair number and queue
> > length on destination node.
> >
> > >
> > > There are two ways to adress that. 1 we can start recording this infor in our db and schdule only ot hosts with the same
> > > configuration values, or 2 we can record the capablities i.e. the max vaulues that are support by a devcice and schdule to a host
> > > where its >= the current value and rely on libvirt to reconfigure the device.
> > >
> > > libvirt required very little input today to consume a vdpa interface
> > > https://libvirt.org/formatdomain.html#vdpa-devices
>
> So a question here, if we need to create vDPA on demand (e.g with the
> features and configs from the source) who will do the provision? Is it
> libvirt?
for mdevs we directly write to /sys we decied not to supprot mdevctl as it was not vendor neutal at the time or packaged in
distros other then fedora. since then libvirt has added the ablity to create mdevs via its nodedev api. if that had existed
at the time we probably would have used that instead. so if we were to support dynamic mdev createion we would have two options

1.) wrap calls to the vdpa cli into privaldged fucntions executed in our privaladge seperation deamon
2.) use a libvirt provided api likely an extention to the nodedev api like the mdev one.

timing would be a factor but if libvirt supported the capablity when we started working on supprot i dont see why we woudl
bypass it and do it ourselve. if libvirt did not want to supprot this we woudl fall back to option 1.

as noted above the openstack team at redhat will not have capsity to consume this ongoing work in the kernel/libvit
until q2 next year at the eairliest so we likely wont make any discision in this regard until then.

>
> Thanks
>
> > > there are some generic virtio device optiosn we could set https://libvirt.org/formatdomain.html#virtio-related-options
> > > and some generic options like the mtu that the interface element supportr
> > >
> > > but the miniumal valide xml snipit is litrally just the source dev path.
> > >
> > > <devices>
> > > <interface type='vdpa'>
> > > <source dev='/dev/vhost-vdpa-0'/>
> > > </interface>
> > > </devices>
> > >
> > > nova only add the mac address and MTU today although i have some untested code that will try to also set the vq size.
> > > https://github.com/openstack/nova/blob/11cb31258fa5b429ea9881c92b2d745fd127cdaf/nova/virt/libvirt/designer.py#L154-L167
> > >
> > > The basic supprot we have today assumes however that the vq_size is either the same on all host or it does not matter because we do
> > > not support transparent live migration today so its ok for it to change form host to host.
> > > in any case we do not track the vq_size or vq count today so we cant schdule based on it or comunicate it to libvirt via our
> > > pre_live_migration rpc result. that means libvirt shoudl check if the dest device has the same cofnig or update it if posible
> > > before starting the destination qemu instance and begining the migration.
> > >
> > > > > This will ease the orchestration software implementation
> > > > > so that it doesn't have to keep track of vdpa config change, or have
> > > > > to persist vdpa attributes across failure and recovery, in fear of
> > > > > being killed due to accidental software error.
> > > the vdpa device config is not somethign we do today so this woudl make our lives more complex
> > It's regarding use case whether to support or not. These configs well
> > exist before my change.
> >
> > > depending on
> > > what that info is. at least in the case of nova we do not use the vdpa cli at all, we use libvirt as an indirection layer.
> > > so libvirt would need to support this interface, we would have to then add it to our db and modify our RPC interface
> > > to then update the libvirt xml with addtional info we dont need today.
> >
> > Yes. You can follow libvirt when the corresponding support is done, but
> > I think it's orthogonal with my changes. Basically my change won't
> > affect libvirt's implementation at all.
> >
> > Thanks,
> > -Siwei
> >
> >
> > > > > In this series, the initial device config for vdpa creation will be
> > > > > exported via the "vdpa dev show" command.
> > > > > This is unlike the "vdpa
> > > > > dev config show" command that usually goes with the live value in
> > > > > the device config space, which is not reliable subject to the dynamics
> > > > > of feature negotiation and possible change in device config space.
> > > > >
> > > > > Examples:
> > > > >
> > > > > 1) Create vDPA by default without any config attribute
> > > > >
> > > > > $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0
> > > > > $ vdpa dev show vdpa0
> > > > > vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> > > > > $ vdpa dev -jp show vdpa0
> > > > > {
> > > > > "dev": {
> > > > > "vdpa0": {
> > > > > "type": "network",
> > > > > "mgmtdev": "pci/0000:41:04.2",
> > > > > "vendor_id": 5555,
> > > > > "max_vqs": 9,
> > > > > "max_vq_size": 256,
> > > > > }
> > > > > }
> > > > > }
> > > This is how openstack works today. this step is done statically at boot time typiccly via a udev script or systemd servic file.
> > > the mac adress is udpate don the vdpa interface by libvirt when its asigined to the qemu process.
> > > if we wanted to suport multi queue or vq size configuration it would also happen at that time not during device creation.
> > > > > 2) Create vDPA with config attribute(s) specified
> > > > >
> > > > > $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0 \
> > > > > mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> > > > > $ vdpa dev show
> > > > > vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> > > > > mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> > > > > $ vdpa dev -jp show
> > > > > {
> > > > > "dev": {
> > > > > "vdpa0": {
> > > > > "type": "network",
> > > > > "mgmtdev": "pci/0000:41:04.2",
> > > > So "mgmtdev" looks not necessary for live migration.
> > > >
> > > > Thanks
> > > >
> > > > > "vendor_id": 5555,
> > > > > "max_vqs": 9,
> > > > > "max_vq_size": 256,
> > > > > "mac": "e4:11:c6:d3:45:f0",
> > > > > "max_vq_pairs": 4
> > > > > }
> > > > > }
> > > > > }
> > > dynmaicaly creating vdpa device at runtime while possible is not an approch we are plannign to supprot.
> > >
> > > currntly in nova we perefer to do allcoation of staticically provsioned resouces in nova.
> > > for persitent memory, sriov/pci passthorgh, dedciated cpus, hugepages and vdpa devices we manage inventories
> > > of resouce that the operator has configured on the platform.
> > >
> > > we have one excption to this static aproch which is semi dynmaic that is how we manage vifo mediated devices.
> > > for reasons that are not important we currrnly track the partent devices that are capable of providing MDEVs
> > > and we directlly write to /sys/... to create teh mdev instance of a requested mdev on demand.
> > >
> > > This has proven ot be quite problematic as we have encountered caching bugs due to the delay between device
> > > creation and when the /sys interface expost the direcotry stucture for the mdev. This has lead ot libvirt and as a result
> > > nova getting out of sync with the actual state of the host. There are also issue with host reboots.
> > >
> > > while we do see the advantage of beign able to create vdpa interface on demad espicaly if we can do finer grained resouce
> > > partioning by allcoating one mdev with 4 vqs adn another with 8 ectra, or experice with dynmic mdev management gives us
> > > pause. we can and will fix our bugs with mdevs but we have found that most of our customers that use feature like this
> > > are telcos or other similar industries that typiclly have very static wrokloads. while there is some interest in making
> > > there clouds more dynmaic they typically file a host and run the same worklaod on that host form months to years at a
> > > time and plan there hardware and acordingly so they are well seved by the static usecase "1) Create vDPA by default without any config attribute".
> > >
> > > > > ---
> > > > >
> > > > > Si-Wei Liu (4):
> > > > > vdpa: save vdpa_dev_set_config in struct vdpa_device
> > > > > vdpa: pass initial config to _vdpa_register_device()
> > > > > vdpa: show dev config as-is in "vdpa dev show" output
> > > > > vdpa: fix improper error message when adding vdpa dev
> > > > >
> > > > > drivers/vdpa/ifcvf/ifcvf_main.c | 2 +-
> > > > > drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
> > > > > drivers/vdpa/vdpa.c | 63 +++++++++++++++++++++++++++++++++---
> > > > > drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 2 +-
> > > > > drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 2 +-
> > > > > drivers/vdpa/vdpa_user/vduse_dev.c | 2 +-
> > > > > drivers/vdpa/virtio_pci/vp_vdpa.c | 3 +-
> > > > > include/linux/vdpa.h | 26 ++++++++-------
> > > > > 8 files changed, 80 insertions(+), 22 deletions(-)
> > > > >
> > > > > --
> > > > > 1.8.3.1
> > > > >
> >
>

2022-10-21 07:34:27

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 0/4] vDPA: dev config export via "vdpa dev show" command

On Wed, Oct 19, 2022 at 5:09 PM Sean Mooney <[email protected]> wrote:
>
> On Tue, 2022-10-18 at 15:59 +0800, Jason Wang wrote:
> > On Tue, Oct 18, 2022 at 7:35 AM Si-Wei Liu <[email protected]> wrote:
> > >
> > >
> > >
> > > On 10/17/2022 5:28 AM, Sean Mooney wrote:
> > > > On Mon, 2022-10-17 at 15:08 +0800, Jason Wang wrote:
> > > > > Adding Sean and Daniel for more thoughts.
> > > > >
> > > > > On Sat, Oct 15, 2022 at 9:33 AM Si-Wei Liu <[email protected]> wrote:
> > > > > > Live migration of vdpa would typically require re-instate vdpa
> > > > > > device with an idential set of configs on the destination node,
> > > > > > same way as how source node created the device in the first place.
> > > > > >
> > > > > > In order to allow live migration orchestration software to export the
> > > > > > initial set of vdpa attributes with which the device was created, it
> > > > > > will be useful if the vdpa tool can report the config on demand with
> > > > > > simple query.
> > > > > For live migration, I think the management layer should have this
> > > > > knowledge and they can communicate directly without bothering the vdpa
> > > > > tool on the source. If I was not wrong this is the way libvirt is
> > > > > doing now.
> > > > At least form a openstack(nova) perspective we are not expecting to do any vdpa device configuration
> > > > at the openstack level. To use a vdpa device in openstack the oeprator when installing openstack
> > > > need to create a udev/systemd script to precreatre the vdpa devices.
> > > This seems to correlate vdpa device creation with the static allocation
> > > of SR-IOV VF devices. Perhaps OpenStack doesn't have a plan to support
> > > dynamic vdpa creation, but conceptionally vdpa creation can be on demand
> > > for e.g. over Mellanox SubFunction or Intel Scalable IOV device.
> >
> > Yes, it's not specific to vDPA but something that openstack needs to consider.
>
> yes so before i joined redhat in 2018 i worked at intel and was trying to intergate there mdev based
> solution for a nic that got canceled due to the changes required to adtop vdpa and some other issues.
> that orgianal approch was goign to dynmaicaly create teh mdev. for vdpa we built the support on top of
> our pci manager so the current support directly requires the vdpa device's parent to be a vf.
>
> we did this for two reasons, first when i started working on this again in 2020 that was the only model that
> worked with any nic that were aviable and at the time mdevs were automaticaly created when you allcoated the
> vf sicne this predated teh newer model of using the vdpa cli to add devcies. secondly buliding on top of the
> sriov/pci passthough framework was less invasive. with openstack we currently have only tested our vdpa support
> with melonox connectx6-dx cards which at the time i added supprot only supported vdpa device creation on vfs
> not subfunctions.
>
> we are aware that vdpa devices can be created over a Mellanox SubFunction or Intel Scalable IOV device
> but we do not plan to support either in the near term. Im the primary person who has been workign on the openstack
> support for vdpa in openstack but i will be working on some other work for the next 6 months or so
> its unlikely we will look at extending nova vdpa cpablities until our upstream b cycle which start in march next year.

Ok, I see. But note that cx6-dx has been switched to use vdap tool so
the vDPA device is no longer automatically being created after VF is
probed.

Actually, there's a third choice, the openstack can still choose to
"statically" provision the vDPA instance during the boot by just
calling the vdpa tool to create the instances.

>
> conceptually speakign we have two concerns with the dynamic approch. nova for better or worse effectivly runs with
> full root access alhtough tha tis configend with selinux/apparmor policies and/or with contaienrs in most production
> installations.

I think the situation should be no different from the case of e.g
using an iproute2 to create/destroy/configure networking devices? E.g
we case we need to create a bond or configure bridge/ovs. Can we keep
using the way we've used for iproute2?

> even so we try to limit privaldged calls and in the past we did not want nova specirifcly
> to make privladged calls to reconfigure hardware. we already have precident for breaking that in the form of our
> generic mdev support where we are pass a list of parent devices and the mdev type to create on that device and then
> dynimcily advertise pools of avaiable mdevs which we create dynamicly. The scecond conserne is we have seen caching issue
> with libvirt when device change dynmicaly as it does nto always process the udev events correctly. as such our experince
> tells us that we are less likely to have caching related bugs if we take a staic approch.

I guess it might be bugs which needs to be fixed in the kernel instead
of trying to have workaround in the userspace but anyhow, we can still
emulate the "static" provisioning via vdpa tool, it seems more
smoothly that a immediate switch to dynamic one and then we will have
the time to fix the dynamic things.

>
> Enableing dynamic vdpa createion would reqiure a large rewrite of how vdpa has been integrated in nova and neutron(openstack networking component).
> right now when you create a neutron port of type vdpa we internally convert that to a pci_request of type vdpa which
> or pci_manager matches to a db record for the precreated vdpa devices. we woudl likely need to have a new vdpa-dynmaic or similar
> port type that would correspond to the subfucniton/iov backed vdpa devices adn we would need to modle them differencly in our database.

Yes, so this needs to be consided with the mdev/SIOV/SF, they
basically have the same model more or less.

>
> for sr-iov vfs the consumable consumable and therfor schdulable resouce is the vf so we have pools of vf with some metadata ot map them to logical
> networks in the case of a nic. for a subfunciton/iov backed device there are presumabel two logical consumable resouces. The number of hardware
> vq pairs, and presumably there is a fintes number of iov/subfunctions a PF can allcoate.

This reminds me that the current vDPA tool can not report the number
of available vDPA instances that could be created by a mgmtdev, this
needs to be fixed.

> im sure there are other qualatative aspect we would like
> to schdule on but those are the main consumables that will determin how many vdpa devices we can create when selecting a host.
>
> we/i intentionally put this out of scope fo the inital vdpa support as i had no way to test this back in 2020 partly because we have a policy
> of not support test only code path i.e. adding supprot for the vdpa_sim kernel module which was discussed adn rejected becasue it does not create
> pci vfs for the vdpa devices and would have prevent using the pci manager or at least require us to fake the pci adress info using a sentenial like
> all 0. if we did not need to supprot dynamic creation the usnign a cential pci adress would actully be a quick way to add iov/subfucniton support
> but it woudl still require the static creations and a slight tweak to how we disucover the vdpa devices from libvirt.

Right, but for now, we have plenty of devices, e.g we can create vDPA
on top of subfucntion (and vDPA seems to be the first of several
devices that could be created via SF).

>
> so to summerise supporting vdpa with iov/subfuctions as the backedn device would be a relitivly simple if they are precreated, we can
> fake the entry in the pci tacker in that case using a sentinal adress other tag in the db row to denote they are not vf backed.
> this woudl not require any change to neutron although we woudl likely need ot also modify os-vif to lookup the repesentor netdev which we
> attach to openvswitch to accoutn for the fact its an iov/subfunciton isntead of a vf parent.
> the dynmic approch
>
> >
> > >
> > > >
> > > > nova will query libvirt for the list avaiable vdpa devices at start up and record them in our database.
> > > > when schudling we select a host that has a free vdpa device and on that host we generate a xml snipit
> > > > that refernce the vdpa device and proivde that to libvirt and it will in turn program the mac.
> > > >
> > > > """
> > > > <interface type="vdpa">
> > > > <mac address="b5:bc:2e:e7:51:ee"/>
> > > > <source dev="/dev/vhost-vdpa-3"/>
> > > > </interface>
> > > > """
> > > >
> > > > when live migrating the workflow is similar. we ask our schduler for a host that should have enough avaiable
> > > > resouces, then we make an rpc call "pre_live_migrate" which makes a number of assterions such as cpu compatiablity
> >
> > A migration compatibility check for vDPA should be done as well here.
> yes it could be although at this point we have alredy complete schduling so if the check fails we will abort the migration.
> presumable libvirt will also check after we invoke the migraation too but the ideal case woudl be that we have enough
> info in our db to select a host we know will work rather then checkign after the fact.
> >
> > > > but also computes cpu pinning and device passthough asignemnts. i.e. in pre_live_migate we select wich cpu cores, pcie
> > > > devices and in this case vdpa devices to use on the destination host
> > > In the case of vdpa, does it (the pre_live_migrate rpc) now just selects
> > > the parent mgmtdev for creating vdpa in later phase, or it ends up with
> > > a vdpa device being created? Be noted by now there's only a few
> > > properties for vdpa creation e.g. mtu and mac, that it doesn't need
> > > special reservation of resources for creating a vdpa device. But that
> > > may well change in the future.
> > >
> > > > and return that in our rpc result.
> > > >
> > > > we then use that information to udpate the libvirt domain xml with the new host specific information and start
> > > > the migration at the libvirt level.
> > > >
> > > > today in openstack we use a hack i came up with to workaroudn that fact that you cant migrate with sriov/pci passthough
> > > > devices to support live migration with vdpa. basically before we call libvirt to live migrate we hot unplug the vdpa nic
> > > > form the guest and add them back after the migration is complte. if you dont bound the vdpa nics wiht a transparently migratable
> > > > nic in the guest that obvioulsy result in a loss of network connectivity while the migration is happenign which is not ideal
> > > > so a normal virtio-net interface on ovs is what we recommend as the fallback interface for the bound.
> > > Do you need to preserve the mac address when falling back to the normal
> > > virtio-net interface, and similarly any other network config/state?
> > > Basically vDPA doesn't support live migration for the moment.
> >
> > Basic shadow vq based live migration can work now. Eugenio is working
> > to make it fully ready in the near future.
> this missed our merge window for the Zed cycle which feature froze in september and was relased a week or two ago.

I see.

>
> >
> > > This
> > > doesn't like to be a technically correct solution for it to work.
> >
> > I agree.
> this is what is what we added for sriov VF direct pasthouhg
> all you need to do in the guest is use the linux kernel bond driver to create a bound between the vf and anoter live migratebale prot type.
> the basic test i did for this i added the mac of the VF to the allowed adress pairs for the ovs port and used that as the mac of the vf.
> this is all out os scope of openstack to configure and manage. we are not useing the virtio fallback support that was added after we implemted this
> as we dont have a good way today to descibe the grouping of the primary and fallback internfce in neutron api.
>
> the details of waht we do are here https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/libvirt-neutron-sriov-livemigration.html
>
> we only do automatic hotplug for nic and only if they are drectly attached, if they use a macvtap to attach the vf to the vm we do not remove them.
> our intent for vdpa was to also not remove them and rely on the native live migration supprot when we detct the source and destiantion host
> supprot that.

Yes, it should work for vDPA as well, but what we want to say is that
vDPA will support native live migration. And vDPA is more than just a
networking device. It might not be easy to find a similar technology
like bond/multipath in other type of devices.

>
> >
> > > >
> > > > obviouly when vdpa supprot transparent live migration we can just skip this workaround which woudl be a very nice ux improvement.
> > > > one of the sideeffct of the hack however is you can start with an intel nic and end up with a melonox nic becasue we dont need
> > > > to preserve the device capablies sicne we are hotplugging.
> > > Exactly. This is the issue.
> so today in an openstack cloud you ingeneral should not know the vendro fo the nic that si provdie as a normal user.
> for vdpa that is even more imporant as vdpa is ment to abstract that and provdie a standardised virtio interface to the guest.
> there are atributes of the virtio inteface that obviously need to match liek the vq_size but we shoudl be able to
> live migrate from a vdpa device created on a connext6-dx vf to one expsoed by an intel iov device without the guest knowing
> that we have changed teh backend.

Yes.

> > > >
> > > > with vdpa we will at least have a virtaul virtio-net-pci frontend in qemu to provide some level of abstraction.
> > > > i guess the point you are raising is that for live migration we cant start with 4 queue paris and vq_size=256
> > > > and select a device with 2 queue pairs and vq_size of 512 and expect that to just work.
> > > Not exactly, the vq_size comes from QEMU that has nothing to do with
> > > vDPA tool. And live migrating from 4 queue pairs to 2 queue pairs won't
> > > work for the guest driver. Change of queue pair numbers would need
> > > device reset which won't happen transparently during live migration.
> > > Basically libvirt has to match the exact queue pair number and queue
> > > length on destination node.
> > >
> > > >
> > > > There are two ways to adress that. 1 we can start recording this infor in our db and schdule only ot hosts with the same
> > > > configuration values, or 2 we can record the capablities i.e. the max vaulues that are support by a devcice and schdule to a host
> > > > where its >= the current value and rely on libvirt to reconfigure the device.
> > > >
> > > > libvirt required very little input today to consume a vdpa interface
> > > > https://libvirt.org/formatdomain.html#vdpa-devices
> >
> > So a question here, if we need to create vDPA on demand (e.g with the
> > features and configs from the source) who will do the provision? Is it
> > libvirt?
> for mdevs we directly write to /sys we decied not to supprot mdevctl as it was not vendor neutal at the time or packaged in
> distros other then fedora. since then libvirt has added the ablity to create mdevs via its nodedev api.

Good to know that, then it looks like libvirt should be in charge of this.

> if that had existed
> at the time we probably would have used that instead. so if we were to support dynamic mdev createion we would have two options
>
> 1.) wrap calls to the vdpa cli into privaldged fucntions executed in our privaladge seperation deamon
> 2.) use a libvirt provided api likely an extention to the nodedev api like the mdev one.
>
> timing would be a factor but if libvirt supported the capablity when we started working on supprot i dont see why we woudl
> bypass it and do it ourselve. if libvirt did not want to supprot this we woudl fall back to option 1.
>
> as noted above the openstack team at redhat will not have capsity to consume this ongoing work in the kernel/libvit
> until q2 next year at the eairliest so we likely wont make any discision in this regard until then.

I see, then we can see if it would be interesting for other vendors to
implement.

Thanks

>
> >
> > Thanks
> >
> > > > there are some generic virtio device optiosn we could set https://libvirt.org/formatdomain.html#virtio-related-options
> > > > and some generic options like the mtu that the interface element supportr
> > > >
> > > > but the miniumal valide xml snipit is litrally just the source dev path.
> > > >
> > > > <devices>
> > > > <interface type='vdpa'>
> > > > <source dev='/dev/vhost-vdpa-0'/>
> > > > </interface>
> > > > </devices>
> > > >
> > > > nova only add the mac address and MTU today although i have some untested code that will try to also set the vq size.
> > > > https://github.com/openstack/nova/blob/11cb31258fa5b429ea9881c92b2d745fd127cdaf/nova/virt/libvirt/designer.py#L154-L167
> > > >
> > > > The basic supprot we have today assumes however that the vq_size is either the same on all host or it does not matter because we do
> > > > not support transparent live migration today so its ok for it to change form host to host.
> > > > in any case we do not track the vq_size or vq count today so we cant schdule based on it or comunicate it to libvirt via our
> > > > pre_live_migration rpc result. that means libvirt shoudl check if the dest device has the same cofnig or update it if posible
> > > > before starting the destination qemu instance and begining the migration.
> > > >
> > > > > > This will ease the orchestration software implementation
> > > > > > so that it doesn't have to keep track of vdpa config change, or have
> > > > > > to persist vdpa attributes across failure and recovery, in fear of
> > > > > > being killed due to accidental software error.
> > > > the vdpa device config is not somethign we do today so this woudl make our lives more complex
> > > It's regarding use case whether to support or not. These configs well
> > > exist before my change.
> > >
> > > > depending on
> > > > what that info is. at least in the case of nova we do not use the vdpa cli at all, we use libvirt as an indirection layer.
> > > > so libvirt would need to support this interface, we would have to then add it to our db and modify our RPC interface
> > > > to then update the libvirt xml with addtional info we dont need today.
> > >
> > > Yes. You can follow libvirt when the corresponding support is done, but
> > > I think it's orthogonal with my changes. Basically my change won't
> > > affect libvirt's implementation at all.
> > >
> > > Thanks,
> > > -Siwei
> > >
> > >
> > > > > > In this series, the initial device config for vdpa creation will be
> > > > > > exported via the "vdpa dev show" command.
> > > > > > This is unlike the "vdpa
> > > > > > dev config show" command that usually goes with the live value in
> > > > > > the device config space, which is not reliable subject to the dynamics
> > > > > > of feature negotiation and possible change in device config space.
> > > > > >
> > > > > > Examples:
> > > > > >
> > > > > > 1) Create vDPA by default without any config attribute
> > > > > >
> > > > > > $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0
> > > > > > $ vdpa dev show vdpa0
> > > > > > vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> > > > > > $ vdpa dev -jp show vdpa0
> > > > > > {
> > > > > > "dev": {
> > > > > > "vdpa0": {
> > > > > > "type": "network",
> > > > > > "mgmtdev": "pci/0000:41:04.2",
> > > > > > "vendor_id": 5555,
> > > > > > "max_vqs": 9,
> > > > > > "max_vq_size": 256,
> > > > > > }
> > > > > > }
> > > > > > }
> > > > This is how openstack works today. this step is done statically at boot time typiccly via a udev script or systemd servic file.
> > > > the mac adress is udpate don the vdpa interface by libvirt when its asigined to the qemu process.
> > > > if we wanted to suport multi queue or vq size configuration it would also happen at that time not during device creation.
> > > > > > 2) Create vDPA with config attribute(s) specified
> > > > > >
> > > > > > $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0 \
> > > > > > mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> > > > > > $ vdpa dev show
> > > > > > vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> > > > > > mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> > > > > > $ vdpa dev -jp show
> > > > > > {
> > > > > > "dev": {
> > > > > > "vdpa0": {
> > > > > > "type": "network",
> > > > > > "mgmtdev": "pci/0000:41:04.2",
> > > > > So "mgmtdev" looks not necessary for live migration.
> > > > >
> > > > > Thanks
> > > > >
> > > > > > "vendor_id": 5555,
> > > > > > "max_vqs": 9,
> > > > > > "max_vq_size": 256,
> > > > > > "mac": "e4:11:c6:d3:45:f0",
> > > > > > "max_vq_pairs": 4
> > > > > > }
> > > > > > }
> > > > > > }
> > > > dynmaicaly creating vdpa device at runtime while possible is not an approch we are plannign to supprot.
> > > >
> > > > currntly in nova we perefer to do allcoation of staticically provsioned resouces in nova.
> > > > for persitent memory, sriov/pci passthorgh, dedciated cpus, hugepages and vdpa devices we manage inventories
> > > > of resouce that the operator has configured on the platform.
> > > >
> > > > we have one excption to this static aproch which is semi dynmaic that is how we manage vifo mediated devices.
> > > > for reasons that are not important we currrnly track the partent devices that are capable of providing MDEVs
> > > > and we directlly write to /sys/... to create teh mdev instance of a requested mdev on demand.
> > > >
> > > > This has proven ot be quite problematic as we have encountered caching bugs due to the delay between device
> > > > creation and when the /sys interface expost the direcotry stucture for the mdev. This has lead ot libvirt and as a result
> > > > nova getting out of sync with the actual state of the host. There are also issue with host reboots.
> > > >
> > > > while we do see the advantage of beign able to create vdpa interface on demad espicaly if we can do finer grained resouce
> > > > partioning by allcoating one mdev with 4 vqs adn another with 8 ectra, or experice with dynmic mdev management gives us
> > > > pause. we can and will fix our bugs with mdevs but we have found that most of our customers that use feature like this
> > > > are telcos or other similar industries that typiclly have very static wrokloads. while there is some interest in making
> > > > there clouds more dynmaic they typically file a host and run the same worklaod on that host form months to years at a
> > > > time and plan there hardware and acordingly so they are well seved by the static usecase "1) Create vDPA by default without any config attribute".
> > > >
> > > > > > ---
> > > > > >
> > > > > > Si-Wei Liu (4):
> > > > > > vdpa: save vdpa_dev_set_config in struct vdpa_device
> > > > > > vdpa: pass initial config to _vdpa_register_device()
> > > > > > vdpa: show dev config as-is in "vdpa dev show" output
> > > > > > vdpa: fix improper error message when adding vdpa dev
> > > > > >
> > > > > > drivers/vdpa/ifcvf/ifcvf_main.c | 2 +-
> > > > > > drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
> > > > > > drivers/vdpa/vdpa.c | 63 +++++++++++++++++++++++++++++++++---
> > > > > > drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 2 +-
> > > > > > drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 2 +-
> > > > > > drivers/vdpa/vdpa_user/vduse_dev.c | 2 +-
> > > > > > drivers/vdpa/virtio_pci/vp_vdpa.c | 3 +-
> > > > > > include/linux/vdpa.h | 26 ++++++++-------
> > > > > > 8 files changed, 80 insertions(+), 22 deletions(-)
> > > > > >
> > > > > > --
> > > > > > 1.8.3.1
> > > > > >
> > >
> >
>

2022-10-21 12:22:17

by Sean Mooney

[permalink] [raw]
Subject: Re: [PATCH 0/4] vDPA: dev config export via "vdpa dev show" command

On Fri, 2022-10-21 at 15:13 +0800, Jason Wang wrote:
> On Wed, Oct 19, 2022 at 5:09 PM Sean Mooney <[email protected]> wrote:
> >
> > On Tue, 2022-10-18 at 15:59 +0800, Jason Wang wrote:
> > > On Tue, Oct 18, 2022 at 7:35 AM Si-Wei Liu <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > On 10/17/2022 5:28 AM, Sean Mooney wrote:
> > > > > On Mon, 2022-10-17 at 15:08 +0800, Jason Wang wrote:
> > > > > > Adding Sean and Daniel for more thoughts.
> > > > > >
> > > > > > On Sat, Oct 15, 2022 at 9:33 AM Si-Wei Liu <[email protected]> wrote:
> > > > > > > Live migration of vdpa would typically require re-instate vdpa
> > > > > > > device with an idential set of configs on the destination node,
> > > > > > > same way as how source node created the device in the first place.
> > > > > > >
> > > > > > > In order to allow live migration orchestration software to export the
> > > > > > > initial set of vdpa attributes with which the device was created, it
> > > > > > > will be useful if the vdpa tool can report the config on demand with
> > > > > > > simple query.
> > > > > > For live migration, I think the management layer should have this
> > > > > > knowledge and they can communicate directly without bothering the vdpa
> > > > > > tool on the source. If I was not wrong this is the way libvirt is
> > > > > > doing now.
> > > > > At least form a openstack(nova) perspective we are not expecting to do any vdpa device configuration
> > > > > at the openstack level. To use a vdpa device in openstack the oeprator when installing openstack
> > > > > need to create a udev/systemd script to precreatre the vdpa devices.
> > > > This seems to correlate vdpa device creation with the static allocation
> > > > of SR-IOV VF devices. Perhaps OpenStack doesn't have a plan to support
> > > > dynamic vdpa creation, but conceptionally vdpa creation can be on demand
> > > > for e.g. over Mellanox SubFunction or Intel Scalable IOV device.
> > >
> > > Yes, it's not specific to vDPA but something that openstack needs to consider.
> >
> > yes so before i joined redhat in 2018 i worked at intel and was trying to intergate there mdev based
> > solution for a nic that got canceled due to the changes required to adtop vdpa and some other issues.
> > that orgianal approch was goign to dynmaicaly create teh mdev. for vdpa we built the support on top of
> > our pci manager so the current support directly requires the vdpa device's parent to be a vf.
> >
> > we did this for two reasons, first when i started working on this again in 2020 that was the only model that
> > worked with any nic that were aviable and at the time mdevs were automaticaly created when you allcoated the
> > vf sicne this predated teh newer model of using the vdpa cli to add devcies. secondly buliding on top of the
> > sriov/pci passthough framework was less invasive. with openstack we currently have only tested our vdpa support
> > with melonox connectx6-dx cards which at the time i added supprot only supported vdpa device creation on vfs
> > not subfunctions.
> >
> > we are aware that vdpa devices can be created over a Mellanox SubFunction or Intel Scalable IOV device
> > but we do not plan to support either in the near term. Im the primary person who has been workign on the openstack
> > support for vdpa in openstack but i will be working on some other work for the next 6 months or so
> > its unlikely we will look at extending nova vdpa cpablities until our upstream b cycle which start in march next year.
>
> Ok, I see. But note that cx6-dx has been switched to use vdap tool so
> the vDPA device is no longer automatically being created after VF is
> probed.
yes we had to rewrite our instller support for that but the effect is the same
nova expect the devices to exsit before nova is started.
we now use udev rules to allcoate the VFs and then call the vdpa tool to create the vdpa devices on system boot
it was not a large change sicne we already used udev rules to create the VFs.
>
> Actually, there's a third choice, the openstack can still choose to
> "statically" provision the vDPA instance during the boot by just
> calling the vdpa tool to create the instances.
we could yes although that would be a departure form how we normally operate
>
> >
> > conceptually speakign we have two concerns with the dynamic approch. nova for better or worse effectivly runs with
> > full root access alhtough tha tis configend with selinux/apparmor policies and/or with contaienrs in most production
> > installations.
>
> I think the situation should be no different from the case of e.g
> using an iproute2 to create/destroy/configure networking devices? E.g
> we case we need to create a bond or configure bridge/ovs. Can we keep
> using the way we've used for iproute2?
openstack does not create vf today and alloacting the vdpa device is similar to that.

we do have a componet "os-vif" which is the virtural interface lib and it will add/remove ports to ovs or linuxbridge.
it can also be exteneded to reconfigure the vdpa device but if we were to do that hten libvirt would need ot not modify the device.
we do not want ot have multiple actors managing the configuration fo the same vdpa device.

if we were to add support for iov/subfuncitons we may delegate that to os-vif to dynmically create the vdpa device as requried
it would just require extending https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/linux_net.py to add the function to do the device
creation and a slight cahnge to our object model to be able to pass the SF/SIOV info for the parent device.
we may also need to tweak the capabities the libary has slightly https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/privsep.py#L23
alhtogh i woudl hope cap net admin is sufficent to create vdpa devices as it woudl be for invokeing iproute2.
>
> > even so we try to limit privaldged calls and in the past we did not want nova specirifcly
> > to make privladged calls to reconfigure hardware. we already have precident for breaking that in the form of our
> > generic mdev support where we are pass a list of parent devices and the mdev type to create on that device and then
> > dynimcily advertise pools of avaiable mdevs which we create dynamicly. The scecond conserne is we have seen caching issue
> > with libvirt when device change dynmicaly as it does nto always process the udev events correctly. as such our experince
> > tells us that we are less likely to have caching related bugs if we take a staic approch.
>
> I guess it might be bugs which needs to be fixed in the kernel instead
> of trying to have workaround in the userspace but anyhow, we can still
> emulate the "static" provisioning via vdpa tool, it seems more
> smoothly that a immediate switch to dynamic one and then we will have
> the time to fix the dynamic things.
yes we can emulated and were required too when moving form rhel 8.4 to rhel 9.0

>
> >
> > Enableing dynamic vdpa createion would reqiure a large rewrite of how vdpa has been integrated in nova and neutron(openstack networking component).
> > right now when you create a neutron port of type vdpa we internally convert that to a pci_request of type vdpa which
> > or pci_manager matches to a db record for the precreated vdpa devices. we woudl likely need to have a new vdpa-dynmaic or similar
> > port type that would correspond to the subfucniton/iov backed vdpa devices adn we would need to modle them differencly in our database.
>
> Yes, so this needs to be consided with the mdev/SIOV/SF, they
> basically have the same model more or less.
yes they are similar. one of the issue is that openstack also has a dedicated acclerator as a service project call cyborg.
cyborgs scope is to provide end ot end lifecycle managment of acclerator like fpga or gpus or smart nics.
becuase it scope include end to end lifecycle managment including programing of things like fgpa its data model is intended for both
static and dynmic device management. From an openstack upstreeam perspective vdpa really should be managed via cyborg not nova
however the static model was a minor extention to our sriov support so it was accpeted.

unless we can confine the dynmic element to os-vif or libvirt or otherwise show that it will be a minimal extention to our exstign fucntionality
its highly likely that futer device managnet supprot will be rejected in nova and we woudl be redirected to cyborg.

if i take upstream hat off and don my redhat one. from a product perspecitve we currently do not have a way to ship cyborg in osp 17 which we just
released. we would have to entirly bootstrap the packaging, instalation, documenation, Qe, and ci/maintaince of cyborg. that likely would take 1-2
years to do properly meaning its unlikey to be somethign we could execute on for osp 18 so any cyborg based solution woudl like take 3+ years to
ship to customers unless we invest alot to accelerate it.

that does not mean that the kernel comunity or libvirt shoudl not embrace a more dynmic approch.
we can adapt to it ane enable more usecase that way in openstack in the medium term either via nova/os-vif or cyborg
and with my redhat on we will figure out how to deliver it to our customer one way or another.
>
> >
> > for sr-iov vfs the consumable consumable and therfor schdulable resouce is the vf so we have pools of vf with some metadata ot map them to logical
> > networks in the case of a nic. for a subfunciton/iov backed device there are presumabel two logical consumable resouces. The number of hardware
> > vq pairs, and presumably there is a fintes number of iov/subfunctions a PF can allcoate.
>
> This reminds me that the current vDPA tool can not report the number
> of available vDPA instances that could be created by a mgmtdev, this
> needs to be fixed.
yep this woudl eb requried for use to move to the dymic approch with siov/SF
>
> > im sure there are other qualatative aspect we would like
> > to schdule on but those are the main consumables that will determin how many vdpa devices we can create when selecting a host.
> >
> > we/i intentionally put this out of scope fo the inital vdpa support as i had no way to test this back in 2020 partly because we have a policy
> > of not support test only code path i.e. adding supprot for the vdpa_sim kernel module which was discussed adn rejected becasue it does not create
> > pci vfs for the vdpa devices and would have prevent using the pci manager or at least require us to fake the pci adress info using a sentenial like
> > all 0. if we did not need to supprot dynamic creation the usnign a cential pci adress would actully be a quick way to add iov/subfucniton support
> > but it woudl still require the static creations and a slight tweak to how we disucover the vdpa devices from libvirt.
>
> Right, but for now, we have plenty of devices, e.g we can create vDPA
> on top of subfucntion (and vDPA seems to be the first of several
> devices that could be created via SF).
ya so with a small hack either seting the pci adress to all 0 or using some other reserved value
we may be abel to retrofit SF in how we track PFs and VFs and if you pre create tehm then that woudl work in the short term
>
> >
> > so to summerise supporting vdpa with iov/subfuctions as the backedn device would be a relitivly simple if they are precreated, we can
> > fake the entry in the pci tacker in that case using a sentinal adress other tag in the db row to denote they are not vf backed.
> > this woudl not require any change to neutron although we woudl likely need ot also modify os-vif to lookup the repesentor netdev which we
> > attach to openvswitch to accoutn for the fact its an iov/subfunciton isntead of a vf parent.
> > the dynmic approch
> >
> > >
> > > >
> > > > >
> > > > > nova will query libvirt for the list avaiable vdpa devices at start up and record them in our database.
> > > > > when schudling we select a host that has a free vdpa device and on that host we generate a xml snipit
> > > > > that refernce the vdpa device and proivde that to libvirt and it will in turn program the mac.
> > > > >
> > > > > """
> > > > > <interface type="vdpa">
> > > > > <mac address="b5:bc:2e:e7:51:ee"/>
> > > > > <source dev="/dev/vhost-vdpa-3"/>
> > > > > </interface>
> > > > > """
> > > > >
> > > > > when live migrating the workflow is similar. we ask our schduler for a host that should have enough avaiable
> > > > > resouces, then we make an rpc call "pre_live_migrate" which makes a number of assterions such as cpu compatiablity
> > >
> > > A migration compatibility check for vDPA should be done as well here.
> > yes it could be although at this point we have alredy complete schduling so if the check fails we will abort the migration.
> > presumable libvirt will also check after we invoke the migraation too but the ideal case woudl be that we have enough
> > info in our db to select a host we know will work rather then checkign after the fact.
> > >
> > > > > but also computes cpu pinning and device passthough asignemnts. i.e. in pre_live_migate we select wich cpu cores, pcie
> > > > > devices and in this case vdpa devices to use on the destination host
> > > > In the case of vdpa, does it (the pre_live_migrate rpc) now just selects
> > > > the parent mgmtdev for creating vdpa in later phase, or it ends up with
> > > > a vdpa device being created? Be noted by now there's only a few
> > > > properties for vdpa creation e.g. mtu and mac, that it doesn't need
> > > > special reservation of resources for creating a vdpa device. But that
> > > > may well change in the future.
> > > >
> > > > > and return that in our rpc result.
> > > > >
> > > > > we then use that information to udpate the libvirt domain xml with the new host specific information and start
> > > > > the migration at the libvirt level.
> > > > >
> > > > > today in openstack we use a hack i came up with to workaroudn that fact that you cant migrate with sriov/pci passthough
> > > > > devices to support live migration with vdpa. basically before we call libvirt to live migrate we hot unplug the vdpa nic
> > > > > form the guest and add them back after the migration is complte. if you dont bound the vdpa nics wiht a transparently migratable
> > > > > nic in the guest that obvioulsy result in a loss of network connectivity while the migration is happenign which is not ideal
> > > > > so a normal virtio-net interface on ovs is what we recommend as the fallback interface for the bound.
> > > > Do you need to preserve the mac address when falling back to the normal
> > > > virtio-net interface, and similarly any other network config/state?
> > > > Basically vDPA doesn't support live migration for the moment.
> > >
> > > Basic shadow vq based live migration can work now. Eugenio is working
> > > to make it fully ready in the near future.
> > this missed our merge window for the Zed cycle which feature froze in september and was relased a week or two ago.
>
> I see.
>
> >
> > >
> > > > This
> > > > doesn't like to be a technically correct solution for it to work.
> > >
> > > I agree.
> > this is what is what we added for sriov VF direct pasthouhg
> > all you need to do in the guest is use the linux kernel bond driver to create a bound between the vf and anoter live migratebale prot type.
> > the basic test i did for this i added the mac of the VF to the allowed adress pairs for the ovs port and used that as the mac of the vf.
> > this is all out os scope of openstack to configure and manage. we are not useing the virtio fallback support that was added after we implemted this
> > as we dont have a good way today to descibe the grouping of the primary and fallback internfce in neutron api.
> >
> > the details of waht we do are here https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/libvirt-neutron-sriov-livemigration.html
> >
> > we only do automatic hotplug for nic and only if they are drectly attached, if they use a macvtap to attach the vf to the vm we do not remove them.
> > our intent for vdpa was to also not remove them and rely on the native live migration supprot when we detct the source and destiantion host
> > supprot that.
>
> Yes, it should work for vDPA as well, but what we want to say is that
> vDPA will support native live migration.
>
yes we discussed this internally in the past and it was the promice of transparent live migration that motivated us to add support
in openstack in 2020. As i noted in converstation with PM the latest we really coudl have accomadated enabling transparent live migraiton for the zed
release was if this had been completed end to end including libvirt/qemu/kernel release by may of this year.
when that past i was asked to still provide some support for any type of live migration in zed so i extended the sriov hotpulg supprot to vdpa
as a tempory mesure.
> And vDPA is more than just a
> networking device. It might not be easy to find a similar technology
> like bond/multipath in other type of devices.

yes eveturally we woudl like to leverage it for virtio-blk device and or any other type of device that consomabel via vdpa.
its not clear that that woudl be accpeted in the nova project and may need to be enableidn in cyborg but im not going to premt that
disucssion before it happens upstream in nova.

we intentially dont allow this hotplug migration for anythign other then a nic as we knew if we tried it with a block device or gpu we
woudl very likely crash the guest workload. live-migration in openstack is an admin only operation so we felt it was accpetabel but not ideal
as we document the limistaion and block unsafe live migrations if other passthoguh devices are present.

once the native support is aviable in all our depencies we will deprecate and remove the hotplug support for vdpa and just sue the native
support. Its a much better enduser ux and it is the correct longterm solution.

>
> >
> > >
> > > > >
> > > > > obviouly when vdpa supprot transparent live migration we can just skip this workaround which woudl be a very nice ux improvement.
> > > > > one of the sideeffct of the hack however is you can start with an intel nic and end up with a melonox nic becasue we dont need
> > > > > to preserve the device capablies sicne we are hotplugging.
> > > > Exactly. This is the issue.
> > so today in an openstack cloud you ingeneral should not know the vendro fo the nic that si provdie as a normal user.
> > for vdpa that is even more imporant as vdpa is ment to abstract that and provdie a standardised virtio interface to the guest.
> > there are atributes of the virtio inteface that obviously need to match liek the vq_size but we shoudl be able to
> > live migrate from a vdpa device created on a connext6-dx vf to one expsoed by an intel iov device without the guest knowing
> > that we have changed teh backend.
>
> Yes.
>
> > > > >
> > > > > with vdpa we will at least have a virtaul virtio-net-pci frontend in qemu to provide some level of abstraction.
> > > > > i guess the point you are raising is that for live migration we cant start with 4 queue paris and vq_size=256
> > > > > and select a device with 2 queue pairs and vq_size of 512 and expect that to just work.
> > > > Not exactly, the vq_size comes from QEMU that has nothing to do with
> > > > vDPA tool. And live migrating from 4 queue pairs to 2 queue pairs won't
> > > > work for the guest driver. Change of queue pair numbers would need
> > > > device reset which won't happen transparently during live migration.
> > > > Basically libvirt has to match the exact queue pair number and queue
> > > > length on destination node.
> > > >
> > > > >
> > > > > There are two ways to adress that. 1 we can start recording this infor in our db and schdule only ot hosts with the same
> > > > > configuration values, or 2 we can record the capablities i.e. the max vaulues that are support by a devcice and schdule to a host
> > > > > where its >= the current value and rely on libvirt to reconfigure the device.
> > > > >
> > > > > libvirt required very little input today to consume a vdpa interface
> > > > > https://libvirt.org/formatdomain.html#vdpa-devices
> > >
> > > So a question here, if we need to create vDPA on demand (e.g with the
> > > features and configs from the source) who will do the provision? Is it
> > > libvirt?
> > for mdevs we directly write to /sys we decied not to supprot mdevctl as it was not vendor neutal at the time or packaged in
> > distros other then fedora. since then libvirt has added the ablity to create mdevs via its nodedev api.
>
> Good to know that, then it looks like libvirt should be in charge of this.
>
> > if that had existed
> > at the time we probably would have used that instead. so if we were to support dynamic mdev createion we would have two options
> >
> > 1.) wrap calls to the vdpa cli into privaldged fucntions executed in our privaladge seperation deamon
> > 2.) use a libvirt provided api likely an extention to the nodedev api like the mdev one.
> >
> > timing would be a factor but if libvirt supported the capablity when we started working on supprot i dont see why we woudl
> > bypass it and do it ourselve. if libvirt did not want to supprot this we woudl fall back to option 1.actully
> >
> > as noted above the openstack team at redhat will not have capsity to consume this ongoing work in the kernel/libvit
> > until q2 next year at the eairliest so we likely wont make any discision in this regard until then.
>
> I see, then we can see if it would be interesting for other vendors to
> implement.
>
> Thanks
>
> >
> > >
> > > Thanks
> > >
> > > > > there are some generic virtio device optiosn we could set https://libvirt.org/formatdomain.html#virtio-related-options
> > > > > and some generic options like the mtu that the interface element supportr
> > > > >
> > > > > but the miniumal valide xml snipit is litrally just the source dev path.
> > > > >
> > > > > <devices>
> > > > > <interface type='vdpa'>
> > > > > <source dev='/dev/vhost-vdpa-0'/>
> > > > > </interface>
> > > > > </devices>
> > > > >
> > > > > nova only add the mac address and MTU today although i have some untested code that will try to also set the vq size.
> > > > > https://github.com/openstack/nova/blob/11cb31258fa5b429ea9881c92b2d745fd127cdaf/nova/virt/libvirt/designer.py#L154-L167
> > > > >
> > > > > The basic supprot we have today assumes however that the vq_size is either the same on all host or it does not matter because we do
> > > > > not support transparent live migration today so its ok for it to change form host to host.
> > > > > in any case we do not track the vq_size or vq count today so we cant schdule based on it or comunicate it to libvirt via our
> > > > > pre_live_migration rpc result. that means libvirt shoudl check if the dest device has the same cofnig or update it if posible
> > > > > before starting the destination qemu instance and begining the migration.
> > > > >
> > > > > > > This will ease the orchestration software implementation
> > > > > > > so that it doesn't have to keep track of vdpa config change, or have
> > > > > > > to persist vdpa attributes across failure and recovery, in fear of
> > > > > > > being killed due to accidental software error.
> > > > > the vdpa device config is not somethign we do today so this woudl make our lives more complex
> > > > It's regarding use case whether to support or not. These configs well
> > > > exist before my change.
> > > >
> > > > > depending on
> > > > > what that info is. at least in the case of nova we do not use the vdpa cli at all, we use libvirt as an indirection layer.
> > > > > so libvirt would need to support this interface, we would have to then add it to our db and modify our RPC interface
> > > > > to then update the libvirt xml with addtional info we dont need today.
> > > >
> > > > Yes. You can follow libvirt when the corresponding support is done, but
> > > > I think it's orthogonal with my changes. Basically my change won't
> > > > affect libvirt's implementation at all.
> > > >
> > > > Thanks,
> > > > -Siwei
> > > >
> > > >
> > > > > > > In this series, the initial device config for vdpa creation will be
> > > > > > > exported via the "vdpa dev show" command.
> > > > > > > This is unlike the "vdpa
> > > > > > > dev config show" command that usually goes with the live value in
> > > > > > > the device config space, which is not reliable subject to the dynamics
> > > > > > > of feature negotiation and possible change in device config space.
> > > > > > >
> > > > > > > Examples:
> > > > > > >
> > > > > > > 1) Create vDPA by default without any config attribute
> > > > > > >
> > > > > > > $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0
> > > > > > > $ vdpa dev show vdpa0
> > > > > > > vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> > > > > > > $ vdpa dev -jp show vdpa0
> > > > > > > {
> > > > > > > "dev": {
> > > > > > > "vdpa0": {
> > > > > > > "type": "network",
> > > > > > > "mgmtdev": "pci/0000:41:04.2",
> > > > > > > "vendor_id": 5555,
> > > > > > > "max_vqs": 9,
> > > > > > > "max_vq_size": 256,
> > > > > > > }
> > > > > > > }
> > > > > > > }
> > > > > This is how openstack works today. this step is done statically at boot time typiccly via a udev script or systemd servic file.
> > > > > the mac adress is udpate don the vdpa interface by libvirt when its asigined to the qemu process.
> > > > > if we wanted to suport multi queue or vq size configuration it would also happen at that time not during device creation.
> > > > > > > 2) Create vDPA with config attribute(s) specified
> > > > > > >
> > > > > > > $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0 \
> > > > > > > mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> > > > > > > $ vdpa dev show
> > > > > > > vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> > > > > > > mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> > > > > > > $ vdpa dev -jp show
> > > > > > > {
> > > > > > > "dev": {
> > > > > > > "vdpa0": {
> > > > > > > "type": "network",
> > > > > > > "mgmtdev": "pci/0000:41:04.2",
> > > > > > So "mgmtdev" looks not necessary for live migration.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > > "vendor_id": 5555,
> > > > > > > "max_vqs": 9,
> > > > > > > "max_vq_size": 256,
> > > > > > > "mac": "e4:11:c6:d3:45:f0",
> > > > > > > "max_vq_pairs": 4
> > > > > > > }
> > > > > > > }
> > > > > > > }
> > > > > dynmaicaly creating vdpa device at runtime while possible is not an approch we are plannign to supprot.
> > > > >
> > > > > currntly in nova we perefer to do allcoation of staticically provsioned resouces in nova.
> > > > > for persitent memory, sriov/pci passthorgh, dedciated cpus, hugepages and vdpa devices we manage inventories
> > > > > of resouce that the operator has configured on the platform.
> > > > >
> > > > > we have one excption to this static aproch which is semi dynmaic that is how we manage vifo mediated devices.
> > > > > for reasons that are not important we currrnly track the partent devices that are capable of providing MDEVs
> > > > > and we directlly write to /sys/... to create teh mdev instance of a requested mdev on demand.
> > > > >
> > > > > This has proven ot be quite problematic as we have encountered caching bugs due to the delay between device
> > > > > creation and when the /sys interface expost the direcotry stucture for the mdev. This has lead ot libvirt and as a result
> > > > > nova getting out of sync with the actual state of the host. There are also issue with host reboots.
> > > > >
> > > > > while we do see the advantage of beign able to create vdpa interface on demad espicaly if we can do finer grained resouce
> > > > > partioning by allcoating one mdev with 4 vqs adn another with 8 ectra, or experice with dynmic mdev management gives us
> > > > > pause. we can and will fix our bugs with mdevs but we have found that most of our customers that use feature like this
> > > > > are telcos or other similar industries that typiclly have very static wrokloads. while there is some interest in making
> > > > > there clouds more dynmaic they typically file a host and run the same worklaod on that host form months to years at a
> > > > > time and plan there hardware and acordingly so they are well seved by the static usecase "1) Create vDPA by default without any config attribute".
> > > > >
> > > > > > > ---
> > > > > > >
> > > > > > > Si-Wei Liu (4):
> > > > > > > vdpa: save vdpa_dev_set_config in struct vdpa_device
> > > > > > > vdpa: pass initial config to _vdpa_register_device()
> > > > > > > vdpa: show dev config as-is in "vdpa dev show" output
> > > > > > > vdpa: fix improper error message when adding vdpa dev
> > > > > > >
> > > > > > > drivers/vdpa/ifcvf/ifcvf_main.c | 2 +-
> > > > > > > drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
> > > > > > > drivers/vdpa/vdpa.c | 63 +++++++++++++++++++++++++++++++++---
> > > > > > > drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 2 +-
> > > > > > > drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 2 +-
> > > > > > > drivers/vdpa/vdpa_user/vduse_dev.c | 2 +-
> > > > > > > drivers/vdpa/virtio_pci/vp_vdpa.c | 3 +-
> > > > > > > include/linux/vdpa.h | 26 ++++++++-------
> > > > > > > 8 files changed, 80 insertions(+), 22 deletions(-)
> > > > > > >
> > > > > > > --
> > > > > > > 1.8.3.1
> > > > > > >
> > > >
> > >
> >
>