2024-01-10 20:46:27

by Steven Sistare

[permalink] [raw]
Subject: [RFC V1 00/13] vdpa live update

Live update is a technique wherein an application saves its state, exec's
to an updated version of itself, and restores its state. Clients of the
application experience a brief suspension of service, on the order of
100's of milliseconds, but are otherwise unaffected.

Define and implement interfaces that allow vdpa devices to be preserved
across fork or exec, to support live update for applications such as qemu.
The device must be suspended during the update, but its dma mappings are
preserved, so the suspension is brief.

The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
accounting from one process to another.

The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
VHOST_NEW_OWNER is supported.

The VHOST_IOTLB_REMAP message type updates a dma mapping with its userland
address in the new process.

The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
VHOST_IOTLB_REMAP is supported and required. Some devices do not
require it, because the userland address of each dma mapping is discarded
after being translated to a physical address.

Here is a pseudo-code sequence for performing live update, based on
suspend + reset because resume is not yet available. The vdpa device
descriptor, fd, remains open across the exec.

ioctl(fd, VHOST_VDPA_SUSPEND)
ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
exec

ioctl(fd, VHOST_NEW_OWNER)

issue ioctls to re-create vrings

if VHOST_BACKEND_F_IOTLB_REMAP
foreach dma mapping
write(fd, {VHOST_IOTLB_REMAP, new_addr})

ioctl(fd, VHOST_VDPA_SET_STATUS,
ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)


Steve Sistare (13):
vhost-vdpa: count pinned memory
vhost-vdpa: pass mm to bind
vhost-vdpa: VHOST_NEW_OWNER
vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
vhost-vdpa: VHOST_IOTLB_REMAP
vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
vhost-vdpa: flush workers on suspend
vduse: flush workers on suspend
vdpa_sim: reset must not run
vdpa_sim: flush workers on suspend
vdpa/mlx5: new owner capability
vdpa_sim: new owner capability
vduse: new owner capability

drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 +-
drivers/vdpa/vdpa_sim/vdpa_sim.c | 24 ++++++-
drivers/vdpa/vdpa_user/vduse_dev.c | 32 +++++++++
drivers/vhost/vdpa.c | 101 +++++++++++++++++++++++++++--
drivers/vhost/vhost.c | 15 +++++
drivers/vhost/vhost.h | 1 +
include/uapi/linux/vhost.h | 10 +++
include/uapi/linux/vhost_types.h | 15 ++++-
8 files changed, 191 insertions(+), 10 deletions(-)

--
2.39.3



2024-01-10 20:47:35

by Steven Sistare

[permalink] [raw]
Subject: [RFC V1 07/13] vhost-vdpa: flush workers on suspend

To pass ownership of a live vdpa device to a new process, the user
suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
mm. Flush workers in suspend to guarantee that no worker sees the new
mm and old VA in between.

Signed-off-by: Steve Sistare <[email protected]>
---
drivers/vhost/vdpa.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 8fe1562d24af..9673e8e20d11 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -591,10 +591,14 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
{
struct vdpa_device *vdpa = v->vdpa;
const struct vdpa_config_ops *ops = vdpa->config;
+ struct vhost_dev *vdev = &v->vdev;

if (!ops->suspend)
return -EOPNOTSUPP;

+ if (vdev->use_worker)
+ vhost_dev_flush(vdev);
+
return ops->suspend(vdpa);
}

--
2.39.3


2024-01-10 20:49:52

by Steven Sistare

[permalink] [raw]
Subject: [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP

When device ownership is passed to a new process via VHOST_NEW_OWNER,
some devices need to know the new userland addresses of the dma mappings.
Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
of a mapping. The new uaddr must address the same memory object as
originally mapped.

The user must suspend the device before the old address is invalidated,
and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
requirement is not enforced by the API.

Signed-off-by: Steve Sistare <[email protected]>
---
drivers/vhost/vdpa.c | 34 ++++++++++++++++++++++++++++++++
include/uapi/linux/vhost_types.h | 11 ++++++++++-
2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index faed6471934a..ec5ca20bd47d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1219,6 +1219,37 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,

}

+static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
+ struct vhost_iotlb *iotlb,
+ struct vhost_iotlb_msg *msg)
+{
+ struct vdpa_device *vdpa = v->vdpa;
+ const struct vdpa_config_ops *ops = vdpa->config;
+ u32 asid = iotlb_to_asid(iotlb);
+ u64 start = msg->iova;
+ u64 last = start + msg->size - 1;
+ struct vhost_iotlb_map *map;
+ int r = 0;
+
+ if (msg->perm || !msg->size)
+ return -EINVAL;
+
+ map = vhost_iotlb_itree_first(iotlb, start, last);
+ if (!map)
+ return -ENOENT;
+
+ if (map->start != start || map->last != last)
+ return -EINVAL;
+
+ /* batch will finish with remap. non-batch must do it now. */
+ if (!v->in_batch)
+ r = ops->set_map(vdpa, asid, iotlb);
+ if (!r)
+ map->addr = msg->uaddr;
+
+ return r;
+}
+
static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
struct vhost_iotlb *iotlb,
struct vhost_iotlb_msg *msg)
@@ -1298,6 +1329,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
ops->set_map(vdpa, asid, iotlb);
v->in_batch = false;
break;
+ case VHOST_IOTLB_REMAP:
+ r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
+ break;
default:
r = -EINVAL;
break;
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index 9177843951e9..35908315ff55 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
/*
* VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
* multiple mappings in one go: beginning with
- * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
+ * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
* VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
* When one of these two values is used as the message type, the rest
* of the fields in the message are ignored. There's no guarantee that
@@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
*/
#define VHOST_IOTLB_BATCH_BEGIN 5
#define VHOST_IOTLB_BATCH_END 6
+
+/*
+ * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
+ * The new uaddr must address the same memory object as originally mapped.
+ * Failure to do so will result in user memory corruption and/or device
+ * misbehavior. iova and size must match the arguments used to create the
+ * an existing mapping. Protection is not changed, and perm must be 0.
+ */
+#define VHOST_IOTLB_REMAP 7
__u8 type;
};

--
2.39.3


2024-01-11 02:56:22

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC V1 00/13] vdpa live update

On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <[email protected]> wrote:
>
> Live update is a technique wherein an application saves its state, exec's
> to an updated version of itself, and restores its state. Clients of the
> application experience a brief suspension of service, on the order of
> 100's of milliseconds, but are otherwise unaffected.
>
> Define and implement interfaces that allow vdpa devices to be preserved
> across fork or exec, to support live update for applications such as qemu.
> The device must be suspended during the update, but its dma mappings are
> preserved, so the suspension is brief.
>
> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
> accounting from one process to another.
>
> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
> VHOST_NEW_OWNER is supported.
>
> The VHOST_IOTLB_REMAP message type updates a dma mapping with its userland
> address in the new process.
>
> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
> VHOST_IOTLB_REMAP is supported and required. Some devices do not
> require it, because the userland address of each dma mapping is discarded
> after being translated to a physical address.
>
> Here is a pseudo-code sequence for performing live update, based on
> suspend + reset because resume is not yet available. The vdpa device
> descriptor, fd, remains open across the exec.
>
> ioctl(fd, VHOST_VDPA_SUSPEND)
> ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
> exec

Is there a userspace implementation as a reference?

>
> ioctl(fd, VHOST_NEW_OWNER)
>
> issue ioctls to re-create vrings
>
> if VHOST_BACKEND_F_IOTLB_REMAP
> foreach dma mapping
> write(fd, {VHOST_IOTLB_REMAP, new_addr})

I think I need to understand the advantages of this approach. For
example, why it is better than

ioctl(VHOST_RESET_OWNER)
exec

ioctl(VHOST_SET_OWNER)

for each dma mapping
ioctl(VHOST_IOTLB_UPDATE)

Thanks

>
> ioctl(fd, VHOST_VDPA_SET_STATUS,
> ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)
>
>
> Steve Sistare (13):
> vhost-vdpa: count pinned memory
> vhost-vdpa: pass mm to bind
> vhost-vdpa: VHOST_NEW_OWNER
> vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
> vhost-vdpa: VHOST_IOTLB_REMAP
> vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
> vhost-vdpa: flush workers on suspend
> vduse: flush workers on suspend
> vdpa_sim: reset must not run
> vdpa_sim: flush workers on suspend
> vdpa/mlx5: new owner capability
> vdpa_sim: new owner capability
> vduse: new owner capability
>
> drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 +-
> drivers/vdpa/vdpa_sim/vdpa_sim.c | 24 ++++++-
> drivers/vdpa/vdpa_user/vduse_dev.c | 32 +++++++++
> drivers/vhost/vdpa.c | 101 +++++++++++++++++++++++++++--
> drivers/vhost/vhost.c | 15 +++++
> drivers/vhost/vhost.h | 1 +
> include/uapi/linux/vhost.h | 10 +++
> include/uapi/linux/vhost_types.h | 15 ++++-
> 8 files changed, 191 insertions(+), 10 deletions(-)
>
> --
> 2.39.3
>


2024-01-11 03:09:07

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP

On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <[email protected]> wrote:
>
> When device ownership is passed to a new process via VHOST_NEW_OWNER,
> some devices need to know the new userland addresses of the dma mappings.
> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
> of a mapping. The new uaddr must address the same memory object as
> originally mapped.
>
> The user must suspend the device before the old address is invalidated,
> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
> requirement is not enforced by the API.
>
> Signed-off-by: Steve Sistare <[email protected]>
> ---
> drivers/vhost/vdpa.c | 34 ++++++++++++++++++++++++++++++++
> include/uapi/linux/vhost_types.h | 11 ++++++++++-
> 2 files changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index faed6471934a..ec5ca20bd47d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1219,6 +1219,37 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>
> }
>
> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
> + struct vhost_iotlb *iotlb,
> + struct vhost_iotlb_msg *msg)
> +{
> + struct vdpa_device *vdpa = v->vdpa;
> + const struct vdpa_config_ops *ops = vdpa->config;
> + u32 asid = iotlb_to_asid(iotlb);
> + u64 start = msg->iova;
> + u64 last = start + msg->size - 1;
> + struct vhost_iotlb_map *map;
> + int r = 0;
> +
> + if (msg->perm || !msg->size)
> + return -EINVAL;
> +
> + map = vhost_iotlb_itree_first(iotlb, start, last);
> + if (!map)
> + return -ENOENT;
> +
> + if (map->start != start || map->last != last)
> + return -EINVAL;
> +
> + /* batch will finish with remap. non-batch must do it now. */
> + if (!v->in_batch)
> + r = ops->set_map(vdpa, asid, iotlb);
> + if (!r)
> + map->addr = msg->uaddr;

I may miss something, for example for PA mapping,

1) need to convert uaddr into phys addr
2) need to check whether the uaddr is backed by the same page or not?

Thanks

> +
> + return r;
> +}
> +
> static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> struct vhost_iotlb *iotlb,
> struct vhost_iotlb_msg *msg)
> @@ -1298,6 +1329,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
> ops->set_map(vdpa, asid, iotlb);
> v->in_batch = false;
> break;
> + case VHOST_IOTLB_REMAP:
> + r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
> + break;
> default:
> r = -EINVAL;
> break;
> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
> index 9177843951e9..35908315ff55 100644
> --- a/include/uapi/linux/vhost_types.h
> +++ b/include/uapi/linux/vhost_types.h
> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
> /*
> * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
> * multiple mappings in one go: beginning with
> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
> * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
> * When one of these two values is used as the message type, the rest
> * of the fields in the message are ignored. There's no guarantee that
> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
> */
> #define VHOST_IOTLB_BATCH_BEGIN 5
> #define VHOST_IOTLB_BATCH_END 6
> +
> +/*
> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
> + * The new uaddr must address the same memory object as originally mapped.
> + * Failure to do so will result in user memory corruption and/or device
> + * misbehavior. iova and size must match the arguments used to create the
> + * an existing mapping. Protection is not changed, and perm must be 0.
> + */
> +#define VHOST_IOTLB_REMAP 7
> __u8 type;
> };
>
> --
> 2.39.3
>


2024-01-11 03:09:45

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC V1 07/13] vhost-vdpa: flush workers on suspend

On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <[email protected]> wrote:
>
> To pass ownership of a live vdpa device to a new process, the user
> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
> mm. Flush workers in suspend to guarantee that no worker sees the new
> mm and old VA in between.
>
> Signed-off-by: Steve Sistare <[email protected]>
> ---
> drivers/vhost/vdpa.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 8fe1562d24af..9673e8e20d11 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -591,10 +591,14 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
> {
> struct vdpa_device *vdpa = v->vdpa;
> const struct vdpa_config_ops *ops = vdpa->config;
> + struct vhost_dev *vdev = &v->vdev;
>
> if (!ops->suspend)
> return -EOPNOTSUPP;
>
> + if (vdev->use_worker)
> + vhost_dev_flush(vdev);

It looks to me like it's better to check use_woker in vhost_dev_flush.

Thanks


> +
> return ops->suspend(vdpa);
> }
>
> --
> 2.39.3
>


2024-01-11 16:24:34

by Mike Christie

[permalink] [raw]
Subject: Re: [RFC V1 07/13] vhost-vdpa: flush workers on suspend

On 1/10/24 9:09 PM, Jason Wang wrote:
> On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <[email protected]> wrote:
>>
>> To pass ownership of a live vdpa device to a new process, the user
>> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
>> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
>> mm. Flush workers in suspend to guarantee that no worker sees the new
>> mm and old VA in between.
>>
>> Signed-off-by: Steve Sistare <[email protected]>
>> ---
>> drivers/vhost/vdpa.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index 8fe1562d24af..9673e8e20d11 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -591,10 +591,14 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
>> {
>> struct vdpa_device *vdpa = v->vdpa;
>> const struct vdpa_config_ops *ops = vdpa->config;
>> + struct vhost_dev *vdev = &v->vdev;
>>
>> if (!ops->suspend)
>> return -EOPNOTSUPP;
>>
>> + if (vdev->use_worker)
>> + vhost_dev_flush(vdev);
>
> It looks to me like it's better to check use_woker in vhost_dev_flush.
>

You can now just call vhost_dev_flush and it will do the right thing.
The xa_for_each loop will only flush workers if they have been setup,
so for vdpa it will not find/flush anything.




2024-01-12 02:28:46

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC V1 07/13] vhost-vdpa: flush workers on suspend

On Fri, Jan 12, 2024 at 12:18 AM Mike Christie
<[email protected]> wrote:
>
> On 1/10/24 9:09 PM, Jason Wang wrote:
> > On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <[email protected]> wrote:
> >>
> >> To pass ownership of a live vdpa device to a new process, the user
> >> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
> >> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
> >> mm. Flush workers in suspend to guarantee that no worker sees the new
> >> mm and old VA in between.
> >>
> >> Signed-off-by: Steve Sistare <[email protected]>
> >> ---
> >> drivers/vhost/vdpa.c | 4 ++++
> >> 1 file changed, 4 insertions(+)
> >>
> >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >> index 8fe1562d24af..9673e8e20d11 100644
> >> --- a/drivers/vhost/vdpa.c
> >> +++ b/drivers/vhost/vdpa.c
> >> @@ -591,10 +591,14 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
> >> {
> >> struct vdpa_device *vdpa = v->vdpa;
> >> const struct vdpa_config_ops *ops = vdpa->config;
> >> + struct vhost_dev *vdev = &v->vdev;
> >>
> >> if (!ops->suspend)
> >> return -EOPNOTSUPP;
> >>
> >> + if (vdev->use_worker)
> >> + vhost_dev_flush(vdev);
> >
> > It looks to me like it's better to check use_woker in vhost_dev_flush.
> >
>
> You can now just call vhost_dev_flush and it will do the right thing.
> The xa_for_each loop will only flush workers if they have been setup,
> so for vdpa it will not find/flush anything.

Right.

Thanks

>
>
>


2024-01-16 18:16:38

by Eugenio Perez Martin

[permalink] [raw]
Subject: Re: [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP

On Wed, Jan 10, 2024 at 9:40 PM Steve Sistare <[email protected]> wrote:
>
> When device ownership is passed to a new process via VHOST_NEW_OWNER,
> some devices need to know the new userland addresses of the dma mappings.
> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
> of a mapping. The new uaddr must address the same memory object as
> originally mapped.
>

I think this new ioctl is very interesting, and can be used to
optimize some parts of live migration with shadow virtqueue if it
allows to actually replace the uaddr. Would you be interested in that
capability?

> The user must suspend the device before the old address is invalidated,
> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
> requirement is not enforced by the API.
>
> Signed-off-by: Steve Sistare <[email protected]>
> ---
> drivers/vhost/vdpa.c | 34 ++++++++++++++++++++++++++++++++
> include/uapi/linux/vhost_types.h | 11 ++++++++++-
> 2 files changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index faed6471934a..ec5ca20bd47d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1219,6 +1219,37 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>
> }
>
> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
> + struct vhost_iotlb *iotlb,
> + struct vhost_iotlb_msg *msg)
> +{
> + struct vdpa_device *vdpa = v->vdpa;
> + const struct vdpa_config_ops *ops = vdpa->config;
> + u32 asid = iotlb_to_asid(iotlb);
> + u64 start = msg->iova;
> + u64 last = start + msg->size - 1;
> + struct vhost_iotlb_map *map;
> + int r = 0;
> +
> + if (msg->perm || !msg->size)
> + return -EINVAL;
> +
> + map = vhost_iotlb_itree_first(iotlb, start, last);
> + if (!map)
> + return -ENOENT;
> +
> + if (map->start != start || map->last != last)
> + return -EINVAL;
> +
> + /* batch will finish with remap. non-batch must do it now. */
> + if (!v->in_batch)
> + r = ops->set_map(vdpa, asid, iotlb);

I'm missing how these cases are handled:
* The device does not expose set_map but dma_map / dma_unmap (you can
check this case with the simulator)
* The device uses platform iommu (!set_map && !dma_unmap vdpa_ops).

> + if (!r)
> + map->addr = msg->uaddr;
> +
> + return r;
> +}
> +
> static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> struct vhost_iotlb *iotlb,
> struct vhost_iotlb_msg *msg)
> @@ -1298,6 +1329,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
> ops->set_map(vdpa, asid, iotlb);
> v->in_batch = false;
> break;
> + case VHOST_IOTLB_REMAP:
> + r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
> + break;
> default:
> r = -EINVAL;
> break;
> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
> index 9177843951e9..35908315ff55 100644
> --- a/include/uapi/linux/vhost_types.h
> +++ b/include/uapi/linux/vhost_types.h
> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
> /*
> * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
> * multiple mappings in one go: beginning with
> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
> * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
> * When one of these two values is used as the message type, the rest
> * of the fields in the message are ignored. There's no guarantee that
> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
> */
> #define VHOST_IOTLB_BATCH_BEGIN 5
> #define VHOST_IOTLB_BATCH_END 6
> +
> +/*
> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
> + * The new uaddr must address the same memory object as originally mapped.
> + * Failure to do so will result in user memory corruption and/or device
> + * misbehavior. iova and size must match the arguments used to create the
> + * an existing mapping. Protection is not changed, and perm must be 0.
> + */
> +#define VHOST_IOTLB_REMAP 7
> __u8 type;
> };
>
> --
> 2.39.3
>


2024-01-17 20:31:58

by Steven Sistare

[permalink] [raw]
Subject: Re: [RFC V1 07/13] vhost-vdpa: flush workers on suspend

On 1/11/2024 9:28 PM, Jason Wang wrote:
> On Fri, Jan 12, 2024 at 12:18 AM Mike Christie
> <[email protected]> wrote:
>>
>> On 1/10/24 9:09 PM, Jason Wang wrote:
>>> On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <[email protected]> wrote:
>>>>
>>>> To pass ownership of a live vdpa device to a new process, the user
>>>> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
>>>> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
>>>> mm. Flush workers in suspend to guarantee that no worker sees the new
>>>> mm and old VA in between.
>>>>
>>>> Signed-off-by: Steve Sistare <[email protected]>
>>>> ---
>>>> drivers/vhost/vdpa.c | 4 ++++
>>>> 1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>>> index 8fe1562d24af..9673e8e20d11 100644
>>>> --- a/drivers/vhost/vdpa.c
>>>> +++ b/drivers/vhost/vdpa.c
>>>> @@ -591,10 +591,14 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
>>>> {
>>>> struct vdpa_device *vdpa = v->vdpa;
>>>> const struct vdpa_config_ops *ops = vdpa->config;
>>>> + struct vhost_dev *vdev = &v->vdev;
>>>>
>>>> if (!ops->suspend)
>>>> return -EOPNOTSUPP;
>>>>
>>>> + if (vdev->use_worker)
>>>> + vhost_dev_flush(vdev);
>>>
>>> It looks to me like it's better to check use_woker in vhost_dev_flush.
>>
>> You can now just call vhost_dev_flush and it will do the right thing.
>> The xa_for_each loop will only flush workers if they have been setup,
>> so for vdpa it will not find/flush anything.

Very good, I will drop this patch.

- Steve

2024-01-17 20:32:48

by Steven Sistare

[permalink] [raw]
Subject: Re: [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP

On 1/10/2024 10:08 PM, Jason Wang wrote:
> On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <[email protected]> wrote:
>>
>> When device ownership is passed to a new process via VHOST_NEW_OWNER,
>> some devices need to know the new userland addresses of the dma mappings.
>> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
>> of a mapping. The new uaddr must address the same memory object as
>> originally mapped.
>>
>> The user must suspend the device before the old address is invalidated,
>> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
>> requirement is not enforced by the API.
>>
>> Signed-off-by: Steve Sistare <[email protected]>
>> ---
>> drivers/vhost/vdpa.c | 34 ++++++++++++++++++++++++++++++++
>> include/uapi/linux/vhost_types.h | 11 ++++++++++-
>> 2 files changed, 44 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index faed6471934a..ec5ca20bd47d 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -1219,6 +1219,37 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>>
>> }
>>
>> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
>> + struct vhost_iotlb *iotlb,
>> + struct vhost_iotlb_msg *msg)
>> +{
>> + struct vdpa_device *vdpa = v->vdpa;
>> + const struct vdpa_config_ops *ops = vdpa->config;
>> + u32 asid = iotlb_to_asid(iotlb);
>> + u64 start = msg->iova;
>> + u64 last = start + msg->size - 1;
>> + struct vhost_iotlb_map *map;
>> + int r = 0;
>> +
>> + if (msg->perm || !msg->size)
>> + return -EINVAL;
>> +
>> + map = vhost_iotlb_itree_first(iotlb, start, last);
>> + if (!map)
>> + return -ENOENT;
>> +
>> + if (map->start != start || map->last != last)
>> + return -EINVAL;
>> +
>> + /* batch will finish with remap. non-batch must do it now. */
>> + if (!v->in_batch)
>> + r = ops->set_map(vdpa, asid, iotlb);
>> + if (!r)
>> + map->addr = msg->uaddr;
>
> I may miss something, for example for PA mapping,
>
> 1) need to convert uaddr into phys addr
> 2) need to check whether the uaddr is backed by the same page or not?

This code does not verify that the new size@uaddr points to the same physical
pages as the old size@uaddr. If the app screws up and they differ, then the app
may corrupt its own memory, but no-one else's.

It would be expensive for large memories to verify page by page, O(npages), and such
verification lies on the critical path for virtual machine downtime during live update.
I could compare the properties of the vma(s) for the old size@uaddr vs the vma for the
new, but that is more complicated and would be a maintenance headache. When I submitted
such code to Alex W when writing the equivalent patches for vfio, he said don't check,
correctness is the user's responsibility.

- Steve

>> +
>> + return r;
>> +}
>> +
>> static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>> struct vhost_iotlb *iotlb,
>> struct vhost_iotlb_msg *msg)
>> @@ -1298,6 +1329,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
>> ops->set_map(vdpa, asid, iotlb);
>> v->in_batch = false;
>> break;
>> + case VHOST_IOTLB_REMAP:
>> + r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
>> + break;
>> default:
>> r = -EINVAL;
>> break;
>> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
>> index 9177843951e9..35908315ff55 100644
>> --- a/include/uapi/linux/vhost_types.h
>> +++ b/include/uapi/linux/vhost_types.h
>> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
>> /*
>> * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
>> * multiple mappings in one go: beginning with
>> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
>> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
>> * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
>> * When one of these two values is used as the message type, the rest
>> * of the fields in the message are ignored. There's no guarantee that
>> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
>> */
>> #define VHOST_IOTLB_BATCH_BEGIN 5
>> #define VHOST_IOTLB_BATCH_END 6
>> +
>> +/*
>> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
>> + * The new uaddr must address the same memory object as originally mapped.
>> + * Failure to do so will result in user memory corruption and/or device
>> + * misbehavior. iova and size must match the arguments used to create the
>> + * an existing mapping. Protection is not changed, and perm must be 0.
>> + */
>> +#define VHOST_IOTLB_REMAP 7
>> __u8 type;
>> };
>>
>> --
>> 2.39.3
>>
>

2024-01-17 20:41:37

by Steven Sistare

[permalink] [raw]
Subject: Re: [RFC V1 00/13] vdpa live update

On 1/10/2024 9:55 PM, Jason Wang wrote:
> On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <[email protected]> wrote:
>>
>> Live update is a technique wherein an application saves its state, exec's
>> to an updated version of itself, and restores its state. Clients of the
>> application experience a brief suspension of service, on the order of
>> 100's of milliseconds, but are otherwise unaffected.
>>
>> Define and implement interfaces that allow vdpa devices to be preserved
>> across fork or exec, to support live update for applications such as qemu.
>> The device must be suspended during the update, but its dma mappings are
>> preserved, so the suspension is brief.
>>
>> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
>> accounting from one process to another.
>>
>> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
>> VHOST_NEW_OWNER is supported.
>>
>> The VHOST_IOTLB_REMAP message type updates a dma mapping with its userland
>> address in the new process.
>>
>> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
>> VHOST_IOTLB_REMAP is supported and required. Some devices do not
>> require it, because the userland address of each dma mapping is discarded
>> after being translated to a physical address.
>>
>> Here is a pseudo-code sequence for performing live update, based on
>> suspend + reset because resume is not yet available. The vdpa device
>> descriptor, fd, remains open across the exec.
>>
>> ioctl(fd, VHOST_VDPA_SUSPEND)
>> ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
>> exec
>
> Is there a userspace implementation as a reference?

I have working patches for qemu that use these ioctl's, but they depend on other
qemu cpr patches that are a work in progress, and not posted yet. I'm working on
that.

>> ioctl(fd, VHOST_NEW_OWNER)
>>
>> issue ioctls to re-create vrings
>>
>> if VHOST_BACKEND_F_IOTLB_REMAP
>> foreach dma mapping
>> write(fd, {VHOST_IOTLB_REMAP, new_addr})
>
> I think I need to understand the advantages of this approach. For
> example, why it is better than
>
> ioctl(VHOST_RESET_OWNER)
> exec
>
> ioctl(VHOST_SET_OWNER)
>
> for each dma mapping
> ioctl(VHOST_IOTLB_UPDATE)

That is slower. VHOST_RESET_OWNER unbinds physical pages, and VHOST_IOTLB_UPDATE
rebinds them. It costs multiple seconds for large memories, and is incurred during the
virtual machine's pause time during live update. For comparison, the total pause time
for live update with vfio interfaces is ~100 millis.

However, the interaction with userland is so similar that the same code paths can be used.
In my qemu prototype, after cpr exec's new qemu:
- vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
- vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of VHOST_IOTLB_UPDATE

- Steve


2024-01-22 04:06:48

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP

On Thu, Jan 18, 2024 at 4:32 AM Steven Sistare
<[email protected]> wrote:
>
> On 1/10/2024 10:08 PM, Jason Wang wrote:
> > On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <[email protected]> wrote:
> >>
> >> When device ownership is passed to a new process via VHOST_NEW_OWNER,
> >> some devices need to know the new userland addresses of the dma mappings.
> >> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
> >> of a mapping. The new uaddr must address the same memory object as
> >> originally mapped.
> >>
> >> The user must suspend the device before the old address is invalidated,
> >> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
> >> requirement is not enforced by the API.
> >>
> >> Signed-off-by: Steve Sistare <[email protected]>
> >> ---
> >> drivers/vhost/vdpa.c | 34 ++++++++++++++++++++++++++++++++
> >> include/uapi/linux/vhost_types.h | 11 ++++++++++-
> >> 2 files changed, 44 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >> index faed6471934a..ec5ca20bd47d 100644
> >> --- a/drivers/vhost/vdpa.c
> >> +++ b/drivers/vhost/vdpa.c
> >> @@ -1219,6 +1219,37 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> >>
> >> }
> >>
> >> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
> >> + struct vhost_iotlb *iotlb,
> >> + struct vhost_iotlb_msg *msg)
> >> +{
> >> + struct vdpa_device *vdpa = v->vdpa;
> >> + const struct vdpa_config_ops *ops = vdpa->config;
> >> + u32 asid = iotlb_to_asid(iotlb);
> >> + u64 start = msg->iova;
> >> + u64 last = start + msg->size - 1;
> >> + struct vhost_iotlb_map *map;
> >> + int r = 0;
> >> +
> >> + if (msg->perm || !msg->size)
> >> + return -EINVAL;
> >> +
> >> + map = vhost_iotlb_itree_first(iotlb, start, last);
> >> + if (!map)
> >> + return -ENOENT;
> >> +
> >> + if (map->start != start || map->last != last)
> >> + return -EINVAL;
> >> +
> >> + /* batch will finish with remap. non-batch must do it now. */
> >> + if (!v->in_batch)
> >> + r = ops->set_map(vdpa, asid, iotlb);
> >> + if (!r)
> >> + map->addr = msg->uaddr;
> >
> > I may miss something, for example for PA mapping,
> >
> > 1) need to convert uaddr into phys addr
> > 2) need to check whether the uaddr is backed by the same page or not?
>
> This code does not verify that the new size@uaddr points to the same physical
> pages as the old size@uaddr. If the app screws up and they differ, then the app
> may corrupt its own memory, but no-one else's.
>
> It would be expensive for large memories to verify page by page, O(npages), and such
> verification lies on the critical path for virtual machine downtime during live update.
> I could compare the properties of the vma(s) for the old size@uaddr vs the vma for the
> new, but that is more complicated and would be a maintenance headache. When I submitted
> such code to Alex W when writing the equivalent patches for vfio, he said don't check,
> correctness is the user's responsibility.

Ok, let's document this somewhere.

Thanks

>
> - Steve
>
> >> +
> >> + return r;
> >> +}
> >> +
> >> static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> >> struct vhost_iotlb *iotlb,
> >> struct vhost_iotlb_msg *msg)
> >> @@ -1298,6 +1329,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
> >> ops->set_map(vdpa, asid, iotlb);
> >> v->in_batch = false;
> >> break;
> >> + case VHOST_IOTLB_REMAP:
> >> + r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
> >> + break;
> >> default:
> >> r = -EINVAL;
> >> break;
> >> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
> >> index 9177843951e9..35908315ff55 100644
> >> --- a/include/uapi/linux/vhost_types.h
> >> +++ b/include/uapi/linux/vhost_types.h
> >> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
> >> /*
> >> * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
> >> * multiple mappings in one go: beginning with
> >> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
> >> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
> >> * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
> >> * When one of these two values is used as the message type, the rest
> >> * of the fields in the message are ignored. There's no guarantee that
> >> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
> >> */
> >> #define VHOST_IOTLB_BATCH_BEGIN 5
> >> #define VHOST_IOTLB_BATCH_END 6
> >> +
> >> +/*
> >> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
> >> + * The new uaddr must address the same memory object as originally mapped.
> >> + * Failure to do so will result in user memory corruption and/or device
> >> + * misbehavior. iova and size must match the arguments used to create the
> >> + * an existing mapping. Protection is not changed, and perm must be 0.
> >> + */
> >> +#define VHOST_IOTLB_REMAP 7
> >> __u8 type;
> >> };
> >>
> >> --
> >> 2.39.3
> >>
> >
>


2024-01-22 04:12:29

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC V1 00/13] vdpa live update

On Thu, Jan 18, 2024 at 4:32 AM Steven Sistare
<[email protected]> wrote:
>
> On 1/10/2024 9:55 PM, Jason Wang wrote:
> > On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <[email protected]> wrote:
> >>
> >> Live update is a technique wherein an application saves its state, exec's
> >> to an updated version of itself, and restores its state. Clients of the
> >> application experience a brief suspension of service, on the order of
> >> 100's of milliseconds, but are otherwise unaffected.
> >>
> >> Define and implement interfaces that allow vdpa devices to be preserved
> >> across fork or exec, to support live update for applications such as qemu.
> >> The device must be suspended during the update, but its dma mappings are
> >> preserved, so the suspension is brief.
> >>
> >> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
> >> accounting from one process to another.
> >>
> >> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
> >> VHOST_NEW_OWNER is supported.
> >>
> >> The VHOST_IOTLB_REMAP message type updates a dma mapping with its userland
> >> address in the new process.
> >>
> >> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
> >> VHOST_IOTLB_REMAP is supported and required. Some devices do not
> >> require it, because the userland address of each dma mapping is discarded
> >> after being translated to a physical address.
> >>
> >> Here is a pseudo-code sequence for performing live update, based on
> >> suspend + reset because resume is not yet available. The vdpa device
> >> descriptor, fd, remains open across the exec.
> >>
> >> ioctl(fd, VHOST_VDPA_SUSPEND)
> >> ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
> >> exec
> >
> > Is there a userspace implementation as a reference?
>
> I have working patches for qemu that use these ioctl's, but they depend on other
> qemu cpr patches that are a work in progress, and not posted yet. I'm working on
> that.

Ok.

>
> >> ioctl(fd, VHOST_NEW_OWNER)
> >>
> >> issue ioctls to re-create vrings
> >>
> >> if VHOST_BACKEND_F_IOTLB_REMAP
> >> foreach dma mapping
> >> write(fd, {VHOST_IOTLB_REMAP, new_addr})
> >
> > I think I need to understand the advantages of this approach. For
> > example, why it is better than
> >
> > ioctl(VHOST_RESET_OWNER)
> > exec
> >
> > ioctl(VHOST_SET_OWNER)
> >
> > for each dma mapping
> > ioctl(VHOST_IOTLB_UPDATE)
>
> That is slower. VHOST_RESET_OWNER unbinds physical pages, and VHOST_IOTLB_UPDATE
> rebinds them. It costs multiple seconds for large memories, and is incurred during the
> virtual machine's pause time during live update. For comparison, the total pause time
> for live update with vfio interfaces is ~100 millis.
>
> However, the interaction with userland is so similar that the same code paths can be used.
> In my qemu prototype, after cpr exec's new qemu:
> - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
> - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of VHOST_IOTLB_UPDATE
>
> - Steve
>

Ok, let's document this in the changlog.

Thanks


2024-02-09 16:13:40

by Steven Sistare

[permalink] [raw]
Subject: Re: [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP

On 1/16/2024 1:14 PM, Eugenio Perez Martin wrote:
> On Wed, Jan 10, 2024 at 9:40 PM Steve Sistare <[email protected]> wrote:
>>
>> When device ownership is passed to a new process via VHOST_NEW_OWNER,
>> some devices need to know the new userland addresses of the dma mappings.
>> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
>> of a mapping. The new uaddr must address the same memory object as
>> originally mapped.
>
> I think this new ioctl is very interesting, and can be used to
> optimize some parts of live migration with shadow virtqueue if it
> allows to actually replace the uaddr. Would you be interested in that
> capability?

Sure. Please share your thoughts on how it would be used. I don't follow,
because with live migration, you are creating a new vdpa instance with new
shadow queues on the destination, versus relocating old shadow queues (which
we could do during live update which preserves the same vdpa instance).

(Sorry for the late response, I stashed this email and forgot to respond.)

>> The user must suspend the device before the old address is invalidated,
>> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
>> requirement is not enforced by the API.
>>
>> Signed-off-by: Steve Sistare <[email protected]>
>> ---
>> drivers/vhost/vdpa.c | 34 ++++++++++++++++++++++++++++++++
>> include/uapi/linux/vhost_types.h | 11 ++++++++++-
>> 2 files changed, 44 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index faed6471934a..ec5ca20bd47d 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -1219,6 +1219,37 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>>
>> }
>>
>> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
>> + struct vhost_iotlb *iotlb,
>> + struct vhost_iotlb_msg *msg)
>> +{
>> + struct vdpa_device *vdpa = v->vdpa;
>> + const struct vdpa_config_ops *ops = vdpa->config;
>> + u32 asid = iotlb_to_asid(iotlb);
>> + u64 start = msg->iova;
>> + u64 last = start + msg->size - 1;
>> + struct vhost_iotlb_map *map;
>> + int r = 0;
>> +
>> + if (msg->perm || !msg->size)
>> + return -EINVAL;
>> +
>> + map = vhost_iotlb_itree_first(iotlb, start, last);
>> + if (!map)
>> + return -ENOENT;
>> +
>> + if (map->start != start || map->last != last)
>> + return -EINVAL;
>> +
>> + /* batch will finish with remap. non-batch must do it now. */
>> + if (!v->in_batch)
>> + r = ops->set_map(vdpa, asid, iotlb);
>
> I'm missing how these cases are handled:
>
> * The device does not expose set_map but dma_map / dma_unmap (you can
> check this case with the simulator)

I chose not to support dma_map, because set_map looks to be more commonly supported,
and batch-mode plus set_map is the most efficient way to support remapping.
I could define a dma_remap entry point if folks think that is important.
Regardless, I will check that ops->set_map != NULL before calling it.

> * The device uses platform iommu (!set_map && !dma_unmap vdpa_ops).

I believe this just works, because iommu_map() never saves userland address, so it
does not need to be informed when uaddr changes. That is certainly true for the
code path:

vhost_vdpa_pa_map()
vhost_vdpa_map()
if !dma_map && !set_map
iommu_map()

However, this code path confuses me:

vhost_vdpa_process_iotlb_update()
if (vdpa->use_va)
vhost_vdpa_va_map(... uaddr)
vhost_vdpa_map(... uaddr)
iommu_map(... uaddr)

AFAICT uaddr is never translated to physical.
Maybe use_va is always false if !dma_map && !set_map ?

- Steve

>> + if (!r)
>> + map->addr = msg->uaddr;
>> +
>> + return r;
>> +}
>> +
>> static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>> struct vhost_iotlb *iotlb,
>> struct vhost_iotlb_msg *msg)
>> @@ -1298,6 +1329,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
>> ops->set_map(vdpa, asid, iotlb);
>> v->in_batch = false;
>> break;
>> + case VHOST_IOTLB_REMAP:
>> + r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
>> + break;
>> default:
>> r = -EINVAL;
>> break;
>> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
>> index 9177843951e9..35908315ff55 100644
>> --- a/include/uapi/linux/vhost_types.h
>> +++ b/include/uapi/linux/vhost_types.h
>> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
>> /*
>> * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
>> * multiple mappings in one go: beginning with
>> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
>> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
>> * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
>> * When one of these two values is used as the message type, the rest
>> * of the fields in the message are ignored. There's no guarantee that
>> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
>> */
>> #define VHOST_IOTLB_BATCH_BEGIN 5
>> #define VHOST_IOTLB_BATCH_END 6
>> +
>> +/*
>> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
>> + * The new uaddr must address the same memory object as originally mapped.
>> + * Failure to do so will result in user memory corruption and/or device
>> + * misbehavior. iova and size must match the arguments used to create the
>> + * an existing mapping. Protection is not changed, and perm must be 0.
>> + */
>> +#define VHOST_IOTLB_REMAP 7
>> __u8 type;
>> };
>>
>> --
>> 2.39.3
>>
>