LinuxLists.cc - [PATCH] vfio: remove useless judgement

2022-06-27 04:45:11

Subject: [PATCH] vfio: remove useless judgement

From: Li Zhe <[email protected]>

In function vfio_dma_do_unmap(), we currently prevent process to unmap
vfio dma region whose mm_struct is different from the vfio_dma->task.
In our virtual machine scenario which is using kvm and qemu, this
judgement stops us from liveupgrading our qemu, which uses fork() &&
exec() to load the new binary but the new process cannot do the
VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement.

This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add
task structure to vfio_dma") for the security reason. But it seems that
no other task who has no family relationship with old and new process
can get the same vfio_dma struct here for the reason of resource
isolation. So this patch delete it.

Signed-off-by: Li Zhe <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 6 ------
1 file changed, 6 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c13b9290e357..a8ff00dad834 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1377,12 +1377,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,

if (!iommu->v2 && iova > dma->iova)
break;
- /*
- * Task with same address space who mapped this iova range is
- * allowed to unmap the iova range.
- */
- if (dma->task->mm != current->mm)
- break;

if (invalidate_vaddr) {
if (dma->vaddr_invalid) {
--
2.20.1

2022-06-27 22:37:27

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH] vfio: remove useless judgement

Hey Steve, how did you get around this for cpr or is this a gap?
Thanks,

Alex

On Mon, 27 Jun 2022 11:51:09 +0800
[email protected] wrote:

> From: Li Zhe <[email protected]>
>
> In function vfio_dma_do_unmap(), we currently prevent process to unmap
> vfio dma region whose mm_struct is different from the vfio_dma->task.
> In our virtual machine scenario which is using kvm and qemu, this
> judgement stops us from liveupgrading our qemu, which uses fork() &&
> exec() to load the new binary but the new process cannot do the
> VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement.
>
> This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add
> task structure to vfio_dma") for the security reason. But it seems that
> no other task who has no family relationship with old and new process
> can get the same vfio_dma struct here for the reason of resource
> isolation. So this patch delete it.
>
> Signed-off-by: Li Zhe <[email protected]>
> Reviewed-by: Jason Gunthorpe <[email protected]>
> ---
> drivers/vfio/vfio_iommu_type1.c | 6 ------
> 1 file changed, 6 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c13b9290e357..a8ff00dad834 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1377,12 +1377,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>
> if (!iommu->v2 && iova > dma->iova)
> break;
> - /*
> - * Task with same address space who mapped this iova range is
> - * allowed to unmap the iova range.
> - */
> - if (dma->task->mm != current->mm)
> - break;
>
> if (invalidate_vaddr) {
> if (dma->vaddr_invalid) {

2022-06-28 12:50:50

by Steven Sistare

[permalink] [raw]

Subject: Re: [PATCH] vfio: remove useless judgement

For cpr, old qemu directly exec's new qemu, so task does not change.

To support fork+exec, the ownership test needs to be deleted or modified.

Pinned page accounting is another issue, as the parent counts pins in its
mm->locked_vm. If the child unmaps, it cannot simply decrement its own
mm->locked_vm counter. As you and I have discussed, the count is also
wrong in the direct exec model, because exec clears mm->locked_vm. I am
thinking vfio could count pins in struct user locked_vm to handle both
models. The user struct and its count would persist across direct exec,
and be shared by parent and child for fork+exec. However, that does change
the RLIMIT_MEMLOCK value that applications must set, because the limit must
accommodate vfio plus other sub-systems that count in user->locked_vm, which
includes io_uring, skbuff, xdp, and perf. Plus, the limit must accommodate all
processes of that user, not just a single process.

Folks like fork+exec because it allows recovery if the new qemu process fails to
initialize. One can fall back to the original process, if the above issues are fixed.

- Steve

On 6/27/2022 6:06 PM, Alex Williamson wrote:
>
> Hey Steve, how did you get around this for cpr or is this a gap?
> Thanks,
>
> Alex
>
> On Mon, 27 Jun 2022 11:51:09 +0800
> [email protected] wrote:
>
>> From: Li Zhe <[email protected]>
>>
>> In function vfio_dma_do_unmap(), we currently prevent process to unmap
>> vfio dma region whose mm_struct is different from the vfio_dma->task.
>> In our virtual machine scenario which is using kvm and qemu, this
>> judgement stops us from liveupgrading our qemu, which uses fork() &&
>> exec() to load the new binary but the new process cannot do the
>> VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement.
>>
>> This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add
>> task structure to vfio_dma") for the security reason. But it seems that
>> no other task who has no family relationship with old and new process
>> can get the same vfio_dma struct here for the reason of resource
>> isolation. So this patch delete it.
>>
>> Signed-off-by: Li Zhe <[email protected]>
>> Reviewed-by: Jason Gunthorpe <[email protected]>
>> ---
>> drivers/vfio/vfio_iommu_type1.c | 6 ------
>> 1 file changed, 6 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index c13b9290e357..a8ff00dad834 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -1377,12 +1377,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>
>> if (!iommu->v2 && iova > dma->iova)
>> break;
>> - /*
>> - * Task with same address space who mapped this iova range is
>> - * allowed to unmap the iova range.
>> - */
>> - if (dma->task->mm != current->mm)
>> - break;
>>
>> if (invalidate_vaddr) {
>> if (dma->vaddr_invalid) {
>

2022-06-28 13:16:37

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH] vfio: remove useless judgement

On Tue, Jun 28, 2022 at 08:48:11AM -0400, Steven Sistare wrote:
> For cpr, old qemu directly exec's new qemu, so task does not change.
>
> To support fork+exec, the ownership test needs to be deleted or modified.
>
> Pinned page accounting is another issue, as the parent counts pins in its
> mm->locked_vm. If the child unmaps, it cannot simply decrement its own
> mm->locked_vm counter.

It is fine already:

mm = async ? get_task_mm(dma->task) : dma->task->mm;
if (!mm)
return -ESRCH; /* process exited */

ret = mmap_write_lock_killable(mm);
if (!ret) {
ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task,
dma->lock_cap);

Each 'dma' already stores a pointer to the mm that sourced it and only
manipulates the counter in that mm. AFAICT 'current' is not used
during unmap.

> As you and I have discussed, the count is also wrong in the direct
> exec model, because exec clears mm->locked_vm.

Really? Yikes, I thought exec would generate a new mm?

> I am thinking vfio could count pins in struct user locked_vm to handle both
> models. The user struct and its count would persist across direct exec,
> and be shared by parent and child for fork+exec. However, that does change
> the RLIMIT_MEMLOCK value that applications must set, because the limit must
> accommodate vfio plus other sub-systems that count in user->locked_vm, which
> includes io_uring, skbuff, xdp, and perf. Plus, the limit must accommodate all
> processes of that user, not just a single process.

We discussed this, for iommufd we are currently planning to go this
way and will See How it Goes.

Jason

2022-06-28 13:59:27

by Steven Sistare

[permalink] [raw]

Subject: Re: [PATCH] vfio: remove useless judgement

On 6/28/2022 9:03 AM, Jason Gunthorpe wrote:
> On Tue, Jun 28, 2022 at 08:48:11AM -0400, Steven Sistare wrote:
>> For cpr, old qemu directly exec's new qemu, so task does not change.
>>
>> To support fork+exec, the ownership test needs to be deleted or modified.
>>
>> Pinned page accounting is another issue, as the parent counts pins in its
>> mm->locked_vm. If the child unmaps, it cannot simply decrement its own
>> mm->locked_vm counter.
>
> It is fine already:
>
> mm = async ? get_task_mm(dma->task) : dma->task->mm;
> if (!mm)
> return -ESRCH; /* process exited */
>
> ret = mmap_write_lock_killable(mm);
> if (!ret) {
> ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task,
> dma->lock_cap);
>
> Each 'dma' already stores a pointer to the mm that sourced it and only
> manipulates the counter in that mm. AFAICT 'current' is not used
> during unmap.
Ah yes, no problem then.
Limits become looser, though, as the child can pin an additional RLIMIT_MEMLOCK
of pages. That is the natural consequence of mm->locked_vm being a per process limit,
but probably not what the application wants. Another argument for switching to
user->locked_vm.

>> As you and I have discussed, the count is also wrong in the direct
>> exec model, because exec clears mm->locked_vm.
>
> Really? Yikes, I thought exec would generate a new mm?

Yes, exec creates a new mm with locked_vm = 0. The old locked_vm count is dropped
on the floor. The existing dma points to the same task, but task->mm has changed,
and dma->task->mm->locked_vm is 0. An unmap ioctl drives it negative.

I have prototyped a few possible fixes. One changes vfio to use user->locked_vm.
Another changes to mm->pinned_vm and preserves it during exec. A third preserves
mm->locked_vm across exec, but that is not practical, because mm->locked_vm mixes
vfio pins and mlocks. The mlock component must be cleared during exec, and we don't
have a separate count for it.

>> I am thinking vfio could count pins in struct user locked_vm to handle both
>> models. The user struct and its count would persist across direct exec,
>> and be shared by parent and child for fork+exec. However, that does change
>> the RLIMIT_MEMLOCK value that applications must set, because the limit must
>> accommodate vfio plus other sub-systems that count in user->locked_vm, which
>> includes io_uring, skbuff, xdp, and perf. Plus, the limit must accommodate all
>> processes of that user, not just a single process.
>
> We discussed this, for iommufd we are currently planning to go this
> way and will See How it Goes.

Yes, I have followed that thread with interest.

- Steve

2022-06-28 14:34:28

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH] vfio: remove useless judgement

On Tue, Jun 28, 2022 at 09:54:19AM -0400, Steven Sistare wrote:

> >> As you and I have discussed, the count is also wrong in the direct
> >> exec model, because exec clears mm->locked_vm.
> >
> > Really? Yikes, I thought exec would generate a new mm?
>
> Yes, exec creates a new mm with locked_vm = 0. The old locked_vm count is dropped
> on the floor. The existing dma points to the same task, but task->mm has changed,
> and dma->task->mm->locked_vm is 0. An unmap ioctl drives it
> negative.

Oh.. This is probably a bug, vfio should never use task->mm, the mm
itself should be held using mmgrab instead.

Otherwise exec case is broken as you describe.

> I have prototyped a few possible fixes. One changes vfio to use user->locked_vm.
> Another changes to mm->pinned_vm and preserves it during exec. A third preserves
> mm->locked_vm across exec, but that is not practical, because mm->locked_vm mixes
> vfio pins and mlocks. The mlock component must be cleared during exec, and we don't
> have a separate count for it.

Lossing locked_vm on exec/fork is the correct and expected behavior
for the core kernel code, the bug is that vfio drives it negative.

Jason

2022-06-30 19:55:11

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH] vfio: remove useless judgement

On Mon, 27 Jun 2022 11:51:09 +0800
[email protected] wrote:

> From: Li Zhe <[email protected]>
>
> In function vfio_dma_do_unmap(), we currently prevent process to unmap
> vfio dma region whose mm_struct is different from the vfio_dma->task.
> In our virtual machine scenario which is using kvm and qemu, this
> judgement stops us from liveupgrading our qemu, which uses fork() &&
> exec() to load the new binary but the new process cannot do the
> VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement.
>
> This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add
> task structure to vfio_dma") for the security reason. But it seems that
> no other task who has no family relationship with old and new process
> can get the same vfio_dma struct here for the reason of resource
> isolation. So this patch delete it.
>
> Signed-off-by: Li Zhe <[email protected]>
> Reviewed-by: Jason Gunthorpe <[email protected]>
> ---
> drivers/vfio/vfio_iommu_type1.c | 6 ------
> 1 file changed, 6 deletions(-)

Applied to vfio next branch for v5.20. Thanks,

Alex

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c13b9290e357..a8ff00dad834 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1377,12 +1377,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>
> if (!iommu->v2 && iova > dma->iova)
> break;
> - /*
> - * Task with same address space who mapped this iova range is
> - * allowed to unmap the iova range.
> - */
> - if (dma->task->mm != current->mm)
> - break;
>
> if (invalidate_vaddr) {
> if (dma->vaddr_invalid) {