2023-05-08 13:29:12

by Yan Zhao

[permalink] [raw]
Subject: [PATCH] vfio/pci: take mmap write lock for io_remap_pfn_range

In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after
pin_user_pages_remote() returns -EFAULT.

follow_fault_pfn
fixup_user_fault
handle_mm_fault
handle_mm_fault
do_fault
do_shared_fault
do_fault
__do_fault
vfio_pci_mmap_fault
io_remap_pfn_range
remap_pfn_range
track_pfn_remap
vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm)
remap_pfn_range_notrack
vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm)

As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1],
holding of mmap write lock is required.
So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap
write lock.

[1] https://lkml.kernel.org/r/[email protected]
commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions")
commit 1c71222e5f23
("mm: replace vma->vm_flags direct modifications with modifier calls")

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a5ab416cf476..5082f89152b3 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1687,6 +1687,12 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
struct vfio_pci_mmap_vma *mmap_vma;
vm_fault_t ret = VM_FAULT_NOPAGE;

+ mmap_assert_locked(vma->vm_mm);
+ mmap_read_unlock(vma->vm_mm);
+
+ if (mmap_write_lock_killable(vma->vm_mm))
+ return VM_FAULT_RETRY;
+
mutex_lock(&vdev->vma_lock);
down_read(&vdev->memory_lock);

@@ -1726,6 +1732,17 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
up_out:
up_read(&vdev->memory_lock);
mutex_unlock(&vdev->vma_lock);
+ mmap_write_unlock(vma->vm_mm);
+
+ /* If PTE is installed successfully, add the completed flag to
+ * indicate mmap lock is released,
+ * otherwise retake the read lock
+ */
+ if (ret == VM_FAULT_NOPAGE)
+ ret |= VM_FAULT_COMPLETED;
+ else
+ mmap_read_lock(vma->vm_mm);
+
return ret;
}


base-commit: 705b004ee377b789e39ae237519bab714297ac83
--
2.17.1


2023-05-08 17:00:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH] vfio/pci: take mmap write lock for io_remap_pfn_range

On Mon, May 08, 2023 at 08:58:42PM +0800, Yan Zhao wrote:
> In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after
> pin_user_pages_remote() returns -EFAULT.
>
> follow_fault_pfn
> fixup_user_fault
> handle_mm_fault
> handle_mm_fault
> do_fault
> do_shared_fault
> do_fault
> __do_fault
> vfio_pci_mmap_fault
> io_remap_pfn_range
> remap_pfn_range
> track_pfn_remap
> vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm)
> remap_pfn_range_notrack
> vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm)
>
> As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1],
> holding of mmap write lock is required.
> So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap
> write lock.
>
> [1] https://lkml.kernel.org/r/[email protected]
> commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions")
> commit 1c71222e5f23
> ("mm: replace vma->vm_flags direct modifications with modifier calls")
>
> Signed-off-by: Yan Zhao <[email protected]>
> ---
> drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index a5ab416cf476..5082f89152b3 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1687,6 +1687,12 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> struct vfio_pci_mmap_vma *mmap_vma;
> vm_fault_t ret = VM_FAULT_NOPAGE;
>
> + mmap_assert_locked(vma->vm_mm);
> + mmap_read_unlock(vma->vm_mm);
> +
> + if (mmap_write_lock_killable(vma->vm_mm))
> + return VM_FAULT_RETRY;

Certainly not..

I'm not sure how to resolve this properly, set the flags in advance?

The address space conversion?

Jason

2023-05-08 21:24:44

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH] vfio/pci: take mmap write lock for io_remap_pfn_range

On Mon, 8 May 2023 13:48:30 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Mon, May 08, 2023 at 08:58:42PM +0800, Yan Zhao wrote:
> > In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after
> > pin_user_pages_remote() returns -EFAULT.
> >
> > follow_fault_pfn
> > fixup_user_fault
> > handle_mm_fault
> > handle_mm_fault
> > do_fault
> > do_shared_fault
> > do_fault
> > __do_fault
> > vfio_pci_mmap_fault
> > io_remap_pfn_range
> > remap_pfn_range
> > track_pfn_remap
> > vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm)
> > remap_pfn_range_notrack
> > vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm)
> >
> > As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1],
> > holding of mmap write lock is required.
> > So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap
> > write lock.
> >
> > [1] https://lkml.kernel.org/r/[email protected]
> > commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions")
> > commit 1c71222e5f23
> > ("mm: replace vma->vm_flags direct modifications with modifier calls")
> >
> > Signed-off-by: Yan Zhao <[email protected]>
> > ---
> > drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++++
> > 1 file changed, 17 insertions(+)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index a5ab416cf476..5082f89152b3 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -1687,6 +1687,12 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> > struct vfio_pci_mmap_vma *mmap_vma;
> > vm_fault_t ret = VM_FAULT_NOPAGE;
> >
> > + mmap_assert_locked(vma->vm_mm);
> > + mmap_read_unlock(vma->vm_mm);
> > +
> > + if (mmap_write_lock_killable(vma->vm_mm))
> > + return VM_FAULT_RETRY;
>
> Certainly not..
>
> I'm not sure how to resolve this properly, set the flags in advance?
>
> The address space conversion?

We already try to set the flags in advance, but there are some
architectural flags like VM_PAT that make that tricky. Cedric has been
looking at inserting individual pages with vmf_insert_pfn(), but that
incurs a lot more faults and therefore latency vs remapping the entire
vma on fault. I'm not convinced that we shouldn't just attempt to
remove the fault handler entirely, but I haven't tried it yet to know
what gotchas are down that path. Thanks,

Alex

2023-05-10 21:20:19

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH] vfio/pci: take mmap write lock for io_remap_pfn_range

On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:

> We already try to set the flags in advance, but there are some
> architectural flags like VM_PAT that make that tricky. Cedric has been
> looking at inserting individual pages with vmf_insert_pfn(), but that
> incurs a lot more faults and therefore latency vs remapping the entire
> vma on fault. I'm not convinced that we shouldn't just attempt to
> remove the fault handler entirely, but I haven't tried it yet to know
> what gotchas are down that path. Thanks,

I thought we did it like this because there were races otherwise with
PTE insertion and zapping? I don't remember well anymore.

I vaugely remember the address_space conversion might help remove the
fault handler?

Jason

2023-05-11 07:32:47

by Yan Zhao

[permalink] [raw]
Subject: Re: [PATCH] vfio/pci: take mmap write lock for io_remap_pfn_range

On Wed, May 10, 2023 at 05:41:06PM -0300, Jason Gunthorpe wrote:
> On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:
>
> > We already try to set the flags in advance, but there are some
> > architectural flags like VM_PAT that make that tricky. Cedric has been
> > looking at inserting individual pages with vmf_insert_pfn(), but that
> > incurs a lot more faults and therefore latency vs remapping the entire
> > vma on fault. I'm not convinced that we shouldn't just attempt to
> > remove the fault handler entirely, but I haven't tried it yet to know
> > what gotchas are down that path. Thanks,
>
> I thought we did it like this because there were races otherwise with
> PTE insertion and zapping? I don't remember well anymore.
>
> I vaugely remember the address_space conversion might help remove the
> fault handler?
>
What about calling vmf_insert_pfn() in bulk as below?
And what is address_space conversion?


diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a5ab416cf476..1476e537f593 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1686,6 +1686,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
struct vfio_pci_core_device *vdev = vma->vm_private_data;
struct vfio_pci_mmap_vma *mmap_vma;
vm_fault_t ret = VM_FAULT_NOPAGE;
+ unsigned long base_pfn, offset, i;

mutex_lock(&vdev->vma_lock);
down_read(&vdev->memory_lock);
@@ -1710,12 +1711,15 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
goto up_out;
}

- if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
- vma->vm_end - vma->vm_start,
- vma->vm_page_prot)) {
- ret = VM_FAULT_SIGBUS;
- zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
- goto up_out;
+ base_pfn = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
+ base_pfn += vma->vm_pgoff;
+ for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) {
+ offset = (i - vma->vm_start) >> PAGE_SHIFT;
+ ret = vmf_insert_pfn(vma, i, base_pfn + offset);
+ if (ret != VM_FAULT_NOPAGE) {
+ zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
+ goto up_out;
+ }
}

if (__vfio_pci_add_vma(vdev, vma)) {


2023-05-11 07:55:52

by Cédric Le Goater

[permalink] [raw]
Subject: Re: [PATCH] vfio/pci: take mmap write lock for io_remap_pfn_range

On 5/11/23 08:56, Yan Zhao wrote:
> On Wed, May 10, 2023 at 05:41:06PM -0300, Jason Gunthorpe wrote:
>> On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:
>>
>>> We already try to set the flags in advance, but there are some
>>> architectural flags like VM_PAT that make that tricky. Cedric has been
>>> looking at inserting individual pages with vmf_insert_pfn(), but that
>>> incurs a lot more faults and therefore latency vs remapping the entire
>>> vma on fault. I'm not convinced that we shouldn't just attempt to
>>> remove the fault handler entirely, but I haven't tried it yet to know
>>> what gotchas are down that path. Thanks,
>>
>> I thought we did it like this because there were races otherwise with
>> PTE insertion and zapping? I don't remember well anymore.
>>
>> I vaugely remember the address_space conversion might help remove the
>> fault handler?
>>
> What about calling vmf_insert_pfn() in bulk as below?

This works too, it is slightly slower than the io_remap_pfn_range() call
but doesn't have the lockdep issues.

Thanks,

C.

> And what is address_space conversion?
>
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index a5ab416cf476..1476e537f593 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1686,6 +1686,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> struct vfio_pci_core_device *vdev = vma->vm_private_data;
> struct vfio_pci_mmap_vma *mmap_vma;
> vm_fault_t ret = VM_FAULT_NOPAGE;
> + unsigned long base_pfn, offset, i;
>
> mutex_lock(&vdev->vma_lock);
> down_read(&vdev->memory_lock);
> @@ -1710,12 +1711,15 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> goto up_out;
> }
>
> - if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> - vma->vm_end - vma->vm_start,
> - vma->vm_page_prot)) {
> - ret = VM_FAULT_SIGBUS;
> - zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> - goto up_out;
> + base_pfn = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> + base_pfn += vma->vm_pgoff;
> + for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) {
> + offset = (i - vma->vm_start) >> PAGE_SHIFT;
> + ret = vmf_insert_pfn(vma, i, base_pfn + offset);
> + if (ret != VM_FAULT_NOPAGE) {
> + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> + goto up_out;
> + }
> }
>
> if (__vfio_pci_add_vma(vdev, vma)) {
>


2023-05-11 07:56:36

by Cédric Le Goater

[permalink] [raw]
Subject: Re: [PATCH] vfio/pci: take mmap write lock for io_remap_pfn_range

On 5/10/23 22:41, Jason Gunthorpe wrote:
> On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:
>
>> We already try to set the flags in advance, but there are some
>> architectural flags like VM_PAT that make that tricky. Cedric has been
>> looking at inserting individual pages with vmf_insert_pfn(), but that
>> incurs a lot more faults and therefore latency vs remapping the entire
>> vma on fault. I'm not convinced that we shouldn't just attempt to
>> remove the fault handler entirely, but I haven't tried it yet to know
>> what gotchas are down that path. Thanks,

OTOH I didn't see any noticeable slowdowns in the tests I did with
NICs, IGD and NVIDIA GPU pass-through. I lack devices with large
BARs though. If anyone has some time for it, here are commits :

https://github.com/legoater/linux/commits/vfio

First does a one page insert and fixes the lockdep issues we are
seeing with the recent series:

https://lore.kernel.org/all/[email protected]/

Second adds some statistics. For the NVIDIA GPU, faults reach 800k.

Thanks,

C.

> I thought we did it like this because there were races otherwise with
> PTE insertion and zapping? I don't remember well anymore.
>
> I vaugely remember the address_space conversion might help remove the
> fault handler?
>
> Jason
>


2023-05-11 16:30:53

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH] vfio/pci: take mmap write lock for io_remap_pfn_range

On Wed, 10 May 2023 17:41:06 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:
>
> > We already try to set the flags in advance, but there are some
> > architectural flags like VM_PAT that make that tricky. Cedric has been
> > looking at inserting individual pages with vmf_insert_pfn(), but that
> > incurs a lot more faults and therefore latency vs remapping the entire
> > vma on fault. I'm not convinced that we shouldn't just attempt to
> > remove the fault handler entirely, but I haven't tried it yet to know
> > what gotchas are down that path. Thanks,
>
> I thought we did it like this because there were races otherwise with
> PTE insertion and zapping? I don't remember well anymore.

TBH, I don't recall if we tried a synchronous approach previously. The
benefit of the faulting approach was that we could track the minimum
set of vmas which are actually making use of the mapping and throw that
tracking list away when zapping. Without that, we need to add vmas
both on mmap and in vm_ops.open, removing only in vm_ops.close, and
acquire all the proper mm locking for each vma to re-insert the
mappings.

> I vaugely remember the address_space conversion might help remove the
> fault handler?

Yes, this did remove the fault handler entirely, it's (obviously)
dropped off my radar, but perhaps in the interim we could switch to
vmf_insert_pfn() and revive the address space series to eventually
remove the fault handling and vma list altogether.

For reference, I think this was the last posting of the address space
series:

https://lore.kernel.org/all/162818167535.1511194.6614962507750594786.stgit@omen/

Thanks,
Alex


2023-05-11 18:00:35

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH] vfio/pci: take mmap write lock for io_remap_pfn_range

On Thu, May 11, 2023 at 10:07:06AM -0600, Alex Williamson wrote:

> > I vaugely remember the address_space conversion might help remove the
> > fault handler?
>
> Yes, this did remove the fault handler entirely, it's (obviously)
> dropped off my radar, but perhaps in the interim we could switch to
> vmf_insert_pfn() and revive the address space series to eventually
> remove the fault handling and vma list altogether.

vmf_insert_pfn() technically isn't supposed to be used for MMIO..

Eg it doesn't do the PAT stuff on x86 that is causing this problem in
the first place.

So doing the address space removing series seems like the best fix. It
has been mislocked for a long time, I suspect there isn't a real
urgent problem beyond we actually have lockdep annoations to catch the
mislocking now.

Jason

2023-05-12 08:42:25

by Yan Zhao

[permalink] [raw]
Subject: Re: [PATCH] vfio/pci: take mmap write lock for io_remap_pfn_range

On Thu, May 11, 2023 at 10:07:06AM -0600, Alex Williamson wrote:
> On Wed, 10 May 2023 17:41:06 -0300
> Jason Gunthorpe <[email protected]> wrote:
>
> > On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:
> >
> > > We already try to set the flags in advance, but there are some
> > > architectural flags like VM_PAT that make that tricky. Cedric has been
> > > looking at inserting individual pages with vmf_insert_pfn(), but that
> > > incurs a lot more faults and therefore latency vs remapping the entire
> > > vma on fault. I'm not convinced that we shouldn't just attempt to
> > > remove the fault handler entirely, but I haven't tried it yet to know
> > > what gotchas are down that path. Thanks,
> >
> > I thought we did it like this because there were races otherwise with
> > PTE insertion and zapping? I don't remember well anymore.
>
> TBH, I don't recall if we tried a synchronous approach previously. The
> benefit of the faulting approach was that we could track the minimum
> set of vmas which are actually making use of the mapping and throw that
> tracking list away when zapping. Without that, we need to add vmas
> both on mmap and in vm_ops.open, removing only in vm_ops.close, and
> acquire all the proper mm locking for each vma to re-insert the
> mappings.
>
> > I vaugely remember the address_space conversion might help remove the
> > fault handler?
>
> Yes, this did remove the fault handler entirely, it's (obviously)
> dropped off my radar, but perhaps in the interim we could switch to
> vmf_insert_pfn() and revive the address space series to eventually
> remove the fault handling and vma list altogether.
>
> For reference, I think this was the last posting of the address space
> series:
>
> https://lore.kernel.org/all/162818167535.1511194.6614962507750594786.stgit@omen/

Just took a quick look at this series.
A question is that looks it still needs to call io_remap_pfn_range() in
places like vfio_basic_config_write() for PCI_COMMAND, and device reset,
so mmap write lock is still required around vdev->memory_lock.