With the introduction of memory I/O (MIO) instructions enbaled in commit
71ba41c9b1d9 ("s390/pci: provide support for MIO instructions") s390
gained support for direct user-space access to mapped PCI resources.
Even without those however user-space can access mapped PCI resources
via the s390 specific MMIO syscalls. There is thus nothing fundamentally
preventing s390 from supporting VFIO_PCI_MMAP allowing user-space drivers
to access PCI resources without going through the pread() interface.
To actually enable VFIO_PCI_MMAP a few issues need fixing however.
Firstly the s390 MMIO syscalls do not cause a page fault when
follow_pte() fails due to the page not being present. This breaks
vfio-pci's mmap() handling which lazily maps on first access.
Secondly on s390 there is a virtual PCI device called ISM which has
a few oddities. For one it claims to have a 256 TiB PCI BAR (not a typo)
which leads to any attempt to mmap() it fail with the following message:
vmap allocation for size 281474976714752 failed: use vmalloc=<size> to increase size
Even if one tried to map this BAR only partially the mapping would not
be usable on systems with MIO support enabled. So just block mapping
BARs which don't fit between IOREMAP_START and IOREMAP_END.
Note:
For your convenience the code is also available in the tagged
b4/vfio_pci_mmap branch on my git.kernel.org site below:
https: //git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git/
Thanks,
Niklas
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Niklas Schnelle <[email protected]>
---
Changes in v3:
- Rebased on v6.10-rc1 requiring change to follow_pte() call
- Use current->mm for fixup_user_fault() as seems more common
- Collected new trailers
- Link to v2: https://lore.kernel.org/r/[email protected]
Changes in v2:
- Changed last patch to remove VFIO_PCI_MMAP instead of just enabling it
for s390 as it is unconditionally true with s390 supporting PCI resource mmap() (Jason)
- Collected R-bs from Jason
- Link to v1: https://lore.kernel.org/r/[email protected]
---
Niklas Schnelle (3):
s390/pci: Fix s390_mmio_read/write syscall page fault handling
vfio/pci: Tolerate oversized BARs by disallowing mmap
vfio/pci: Enable PCI resource mmap() on s390 and remove VFIO_PCI_MMAP
arch/s390/pci/pci_mmio.c | 18 +++++++++++++-----
drivers/vfio/pci/Kconfig | 4 ----
drivers/vfio/pci/vfio_pci_core.c | 11 ++++++-----
3 files changed, 19 insertions(+), 14 deletions(-)
---
base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
change-id: 20240503-vfio_pci_mmap-1549e3d02ca7
Best regards,
--
Niklas Schnelle
On s390 there is a virtual PCI device called ISM which has a few rather
annoying oddities. For one it claims to have a 256 TiB PCI BAR (not
a typo) which leads to any attempt to mmap() it failing during vmap.
Even if one tried to map this "BAR" only partially the mapping would not
be usable on systems with MIO support enabled however. This is because
of another oddity in that this virtual PCI device does not support the
newer memory I/O (MIO) PCI instructions and legacy PCI instructions are
not accessible by user-space when MIO is in use. If this device needs to
be accessed by user-space it will thus need a vfio-pci variant driver.
Until then work around both issues by excluding resources which don't
fit between IOREMAP_START and IOREMAP_END in vfio_pci_probe_mmaps().
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Matthew Rosato <[email protected]>
Signed-off-by: Niklas Schnelle <[email protected]>
---
drivers/vfio/pci/vfio_pci_core.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 80cae87fff36..0f1ddf2d3ef2 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -28,6 +28,7 @@
#include <linux/nospec.h>
#include <linux/sched/mm.h>
#include <linux/iommufd.h>
+#include <linux/ioremap.h>
#if IS_ENABLED(CONFIG_EEH)
#include <asm/eeh.h>
#endif
@@ -129,9 +130,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
/*
* The PCI core shouldn't set up a resource with a
* type but zero size. But there may be bugs that
- * cause us to do that.
+ * cause us to do that. There is also at least one
+ * device which advertises a resource too large to
+ * ioremap().
*/
- if (!resource_size(res))
+ if (!resource_size(res) ||
+ resource_size(res) > (IOREMAP_END + 1 - IOREMAP_START))
goto no_mmap;
if (resource_size(res) >= PAGE_SIZE) {
--
2.40.1
The s390 MMIO syscalls when using the classic PCI instructions do not
cause a page fault when follow_pte() fails due to the page not being
present. Besides being a general deficiency this breaks vfio-pci's mmap()
handling once VFIO_PCI_MMAP gets enabled as this lazily maps on first
access. Fix this by following a failed follow_pte() with
fixup_user_page() and retrying the follow_pte().
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Matthew Rosato <[email protected]>
Signed-off-by: Niklas Schnelle <[email protected]>
---
arch/s390/pci/pci_mmio.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c
index 5398729bfe1b..80c21b1a101c 100644
--- a/arch/s390/pci/pci_mmio.c
+++ b/arch/s390/pci/pci_mmio.c
@@ -170,8 +170,12 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,
goto out_unlock_mmap;
ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
- if (ret)
- goto out_unlock_mmap;
+ if (ret) {
+ fixup_user_fault(current->mm, mmio_addr, FAULT_FLAG_WRITE, NULL);
+ ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
+ if (ret)
+ goto out_unlock_mmap;
+ }
io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) |
(mmio_addr & ~PAGE_MASK));
@@ -305,12 +309,16 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr,
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
goto out_unlock_mmap;
ret = -EACCES;
- if (!(vma->vm_flags & VM_WRITE))
+ if (!(vma->vm_flags & VM_READ))
goto out_unlock_mmap;
ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
- if (ret)
- goto out_unlock_mmap;
+ if (ret) {
+ fixup_user_fault(current->mm, mmio_addr, 0, NULL);
+ ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
+ if (ret)
+ goto out_unlock_mmap;
+ }
io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) |
(mmio_addr & ~PAGE_MASK));
--
2.40.1
With the introduction of memory I/O (MIO) instructions enbaled in commit
71ba41c9b1d9 ("s390/pci: provide support for MIO instructions") s390
gained support for direct user-space access to mapped PCI resources.
Even without those however user-space can access mapped PCI resources
via the s390 specific MMIO syscalls. Thus mmap() can and should be
supported on all s390 systems with native PCI. Since VFIO_PCI_MMAP
enablement for s390 would make it unconditionally true and thus
pointless just remove it entirely.
Link: https://lore.kernel.org/all/[email protected]/
Suggested-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Matthew Rosato <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Niklas Schnelle <[email protected]>
---
drivers/vfio/pci/Kconfig | 4 ----
drivers/vfio/pci/vfio_pci_core.c | 3 ---
2 files changed, 7 deletions(-)
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index bf50ffa10bde..c3bcb6911c53 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -7,10 +7,6 @@ config VFIO_PCI_CORE
select VFIO_VIRQFD
select IRQ_BYPASS_MANAGER
-config VFIO_PCI_MMAP
- def_bool y if !S390
- depends on VFIO_PCI_CORE
-
config VFIO_PCI_INTX
def_bool y if !S390
depends on VFIO_PCI_CORE
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 0f1ddf2d3ef2..a0e2e2a806d1 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -121,9 +121,6 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
res = &vdev->pdev->resource[bar];
- if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
- goto no_mmap;
-
if (!(res->flags & IORESOURCE_MEM))
goto no_mmap;
--
2.40.1
Am 29.05.24 um 13:36 schrieb Niklas Schnelle:
> With the introduction of memory I/O (MIO) instructions enbaled in commit
> 71ba41c9b1d9 ("s390/pci: provide support for MIO instructions") s390
> gained support for direct user-space access to mapped PCI resources.
> Even without those however user-space can access mapped PCI resources
> via the s390 specific MMIO syscalls. There is thus nothing fundamentally
> preventing s390 from supporting VFIO_PCI_MMAP allowing user-space drivers
> to access PCI resources without going through the pread() interface.
> To actually enable VFIO_PCI_MMAP a few issues need fixing however.
>
> Firstly the s390 MMIO syscalls do not cause a page fault when
> follow_pte() fails due to the page not being present. This breaks
> vfio-pci's mmap() handling which lazily maps on first access.
>
> Secondly on s390 there is a virtual PCI device called ISM which has
> a few oddities. For one it claims to have a 256 TiB PCI BAR (not a typo)
> which leads to any attempt to mmap() it fail with the following message:
>
> vmap allocation for size 281474976714752 failed: use vmalloc=<size> to increase size
>
> Even if one tried to map this BAR only partially the mapping would not
> be usable on systems with MIO support enabled. So just block mapping
> BARs which don't fit between IOREMAP_START and IOREMAP_END.
>
> Note:
> For your convenience the code is also available in the tagged
> b4/vfio_pci_mmap branch on my git.kernel.org site below:
> https: //git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git/
I guess its now mostly a question of who picks those patches? Alex?
Any patch suitable for stable?
On Mon, 2024-06-03 at 17:50 +0200, Christian Borntraeger wrote:
> Am 29.05.24 um 13:36 schrieb Niklas Schnelle:
> > With the introduction of memory I/O (MIO) instructions enbaled in commit
> > 71ba41c9b1d9 ("s390/pci: provide support for MIO instructions") s390
> > gained support for direct user-space access to mapped PCI resources.
> > Even without those however user-space can access mapped PCI resources
> > via the s390 specific MMIO syscalls. There is thus nothing fundamentally
> > preventing s390 from supporting VFIO_PCI_MMAP allowing user-space drivers
> > to access PCI resources without going through the pread() interface.
> > To actually enable VFIO_PCI_MMAP a few issues need fixing however.
> >
> > Firstly the s390 MMIO syscalls do not cause a page fault when
> > follow_pte() fails due to the page not being present. This breaks
> > vfio-pci's mmap() handling which lazily maps on first access.
> >
> > Secondly on s390 there is a virtual PCI device called ISM which has
> > a few oddities. For one it claims to have a 256 TiB PCI BAR (not a typo)
> > which leads to any attempt to mmap() it fail with the following message:
> >
> > vmap allocation for size 281474976714752 failed: use vmalloc=<size> to increase size
> >
> > Even if one tried to map this BAR only partially the mapping would not
> > be usable on systems with MIO support enabled. So just block mapping
> > BARs which don't fit between IOREMAP_START and IOREMAP_END.
> >
> > Note:
> > For your convenience the code is also available in the tagged
> > b4/vfio_pci_mmap branch on my git.kernel.org site below:
> > https: //git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git/
>
>
> I guess its now mostly a question of who picks those patches? Alex?
That matches my understanding as well.
>
> Any patch suitable for stable?
I'd almost say all but the last one may be candidates for stable. I
found it hard to pinpoint a specific commit they fix though, hence the
lack of Fixes tag. For the first one I'm actually not sure if e.g.
rdma-core users could also run into this problem when they get swapped
out as I'm not sure if the mapping is pinned there.
On Tue, 2024-06-04 at 11:27 +0200, Niklas Schnelle wrote:
> On Mon, 2024-06-03 at 17:50 +0200, Christian Borntraeger wrote:
> > Am 29.05.24 um 13:36 schrieb Niklas Schnelle:
> > > With the introduction of memory I/O (MIO) instructions enbaled in commit
> > > 71ba41c9b1d9 ("s390/pci: provide support for MIO instructions") s390
> > > gained support for direct user-space access to mapped PCI resources.
> > > Even without those however user-space can access mapped PCI resources
> > > via the s390 specific MMIO syscalls. There is thus nothing fundamentally
> > > preventing s390 from supporting VFIO_PCI_MMAP allowing user-space drivers
> > > to access PCI resources without going through the pread() interface.
> > > To actually enable VFIO_PCI_MMAP a few issues need fixing however.
> > >
> > > Firstly the s390 MMIO syscalls do not cause a page fault when
> > > follow_pte() fails due to the page not being present. This breaks
> > > vfio-pci's mmap() handling which lazily maps on first access.
> > >
> > > Secondly on s390 there is a virtual PCI device called ISM which has
> > > a few oddities. For one it claims to have a 256 TiB PCI BAR (not a typo)
> > > which leads to any attempt to mmap() it fail with the following message:
> > >
> > > vmap allocation for size 281474976714752 failed: use vmalloc=<size> to increase size
> > >
> > > Even if one tried to map this BAR only partially the mapping would not
> > > be usable on systems with MIO support enabled. So just block mapping
> > > BARs which don't fit between IOREMAP_START and IOREMAP_END.
> > >
> > > Note:
> > > For your convenience the code is also available in the tagged
> > > b4/vfio_pci_mmap branch on my git.kernel.org site below:
> > > https: //git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git/
> >
> >
> > I guess its now mostly a question of who picks those patches? Alex?
>
> That matches my understanding as well.
>
> >
> > Any patch suitable for stable?
>
> I'd almost say all but the last one may be candidates for stable. I
> found it hard to pinpoint a specific commit they fix though, hence the
> lack of Fixes tag. For the first one I'm actually not sure if e.g.
> rdma-core users could also run into this problem when they get swapped
> out as I'm not sure if the mapping is pinned there.
>
Was a bit unclear/wrong above. Obviously MMIO mappings can't be
"swapped out" I should have said "are subject to page faults".
On Mon, 3 Jun 2024 17:50:13 +0200
Christian Borntraeger <[email protected]> wrote:
> Am 29.05.24 um 13:36 schrieb Niklas Schnelle:
> > With the introduction of memory I/O (MIO) instructions enbaled in commit
> > 71ba41c9b1d9 ("s390/pci: provide support for MIO instructions") s390
> > gained support for direct user-space access to mapped PCI resources.
> > Even without those however user-space can access mapped PCI resources
> > via the s390 specific MMIO syscalls. There is thus nothing fundamentally
> > preventing s390 from supporting VFIO_PCI_MMAP allowing user-space drivers
> > to access PCI resources without going through the pread() interface.
> > To actually enable VFIO_PCI_MMAP a few issues need fixing however.
> >
> > Firstly the s390 MMIO syscalls do not cause a page fault when
> > follow_pte() fails due to the page not being present. This breaks
> > vfio-pci's mmap() handling which lazily maps on first access.
> >
> > Secondly on s390 there is a virtual PCI device called ISM which has
> > a few oddities. For one it claims to have a 256 TiB PCI BAR (not a typo)
> > which leads to any attempt to mmap() it fail with the following message:
> >
> > vmap allocation for size 281474976714752 failed: use vmalloc=<size> to increase size
> >
> > Even if one tried to map this BAR only partially the mapping would not
> > be usable on systems with MIO support enabled. So just block mapping
> > BARs which don't fit between IOREMAP_START and IOREMAP_END.
> >
> > Note:
> > For your convenience the code is also available in the tagged
> > b4/vfio_pci_mmap branch on my git.kernel.org site below:
> > https: //git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git/
>
>
> I guess its now mostly a question of who picks those patches? Alex?
>
> Any patch suitable for stable?
Nothing here looks like stable material to me. 1/ only becomes an
issue when mmap of MMIO is allowed on s390 (ie. 3/), 2/ is generic, but
only really targets a device found on s390, and finally 3/ is
essentially enabling a new feature.
If we expect any conflicts with 1/ in the next merge window I can take
a branch for it and apply 2/ and 3/ through the vfio tree, otherwise I
can bring them all through the vfio tree if the s390 folks agree.
Thanks,
Alex
On Thu, Jun 06, 2024 at 11:27:18AM -0600, Alex Williamson wrote:
Hi Alex,
> If we expect any conflicts with 1/ in the next merge window I can take
> a branch for it and apply 2/ and 3/ through the vfio tree, otherwise I
> can bring them all through the vfio tree if the s390 folks agree.
Yes. Pull it via the vfio tree, please.
> Thanks,
>
> Alex
Thanks!
On Thu, 2024-06-06 at 11:27 -0600, Alex Williamson wrote:
> On Mon, 3 Jun 2024 17:50:13 +0200
> Christian Borntraeger <[email protected]> wrote:
>
> > Am 29.05.24 um 13:36 schrieb Niklas Schnelle:
> > > With the introduction of memory I/O (MIO) instructions enbaled in commit
> > > 71ba41c9b1d9 ("s390/pci: provide support for MIO instructions") s390
> > > gained support for direct user-space access to mapped PCI resources.
> > > Even without those however user-space can access mapped PCI resources
> > > via the s390 specific MMIO syscalls. There is thus nothing fundamentally
> > > preventing s390 from supporting VFIO_PCI_MMAP allowing user-space drivers
> > > to access PCI resources without going through the pread() interface.
> > > To actually enable VFIO_PCI_MMAP a few issues need fixing however.
> > >
> > > Firstly the s390 MMIO syscalls do not cause a page fault when
> > > follow_pte() fails due to the page not being present. This breaks
> > > vfio-pci's mmap() handling which lazily maps on first access.
> > >
> > > Secondly on s390 there is a virtual PCI device called ISM which has
> > > a few oddities. For one it claims to have a 256 TiB PCI BAR (not a typo)
> > > which leads to any attempt to mmap() it fail with the following message:
> > >
> > > vmap allocation for size 281474976714752 failed: use vmalloc=<size> to increase size
> > >
> > > Even if one tried to map this BAR only partially the mapping would not
> > > be usable on systems with MIO support enabled. So just block mapping
> > > BARs which don't fit between IOREMAP_START and IOREMAP_END.
> > >
> > > Note:
> > > For your convenience the code is also available in the tagged
> > > b4/vfio_pci_mmap branch on my git.kernel.org site below:
> > > https: //git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git/
> >
> >
> > I guess its now mostly a question of who picks those patches? Alex?
> >
> > Any patch suitable for stable?
>
> Nothing here looks like stable material to me. 1/ only becomes an
> issue when mmap of MMIO is allowed on s390 (ie. 3/), 2/ is generic, but
> only really targets a device found on s390, and finally 3/ is
> essentially enabling a new feature.
I trust your judgement and was unsure too. I think for the
s390_mmio_read/write syscalls the only existing users out there are via
rdma-core, so unless Jason tells us that he thinks they could also be
affected by the lack of page fault handling I see no problem in going
upstream only.
>
> If we expect any conflicts with 1/ in the next merge window I can take
> a branch for it and apply 2/ and 3/ through the vfio tree, otherwise I
> can bring them all through the vfio tree if the s390 folks agree.
> Thanks,
>
> Alex
>
I also agree with this going via the vfio tree. I don't forsee
conflicts with 1.
Thanks,
Niklas
On Fri, Jun 07, 2024 at 09:47:08AM +0200, Niklas Schnelle wrote:
> I trust your judgement and was unsure too. I think for the
> s390_mmio_read/write syscalls the only existing users out there are via
> rdma-core, so unless Jason tells us that he thinks they could also be
> affected by the lack of page fault handling I see no problem in going
> upstream only.
rdma doesn't use the fault path for the doorbell mmio.
Jason
On Wed, 2024-05-29 at 13:36 +0200, Niklas Schnelle wrote:
> The s390 MMIO syscalls when using the classic PCI instructions do not
> cause a page fault when follow_pte() fails due to the page not being
> present. Besides being a general deficiency this breaks vfio-pci's mmap()
> handling once VFIO_PCI_MMAP gets enabled as this lazily maps on first
> access. Fix this by following a failed follow_pte() with
> fixup_user_page() and retrying the follow_pte().
>
> Reviewed-by: Jason Gunthorpe <[email protected]>
> Reviewed-by: Matthew Rosato <[email protected]>
> Signed-off-by: Niklas Schnelle <[email protected]>
> ---
> arch/s390/pci/pci_mmio.c | 18 +++++++++++++-----
> 1 file changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c
> index 5398729bfe1b..80c21b1a101c 100644
> --- a/arch/s390/pci/pci_mmio.c
> +++ b/arch/s390/pci/pci_mmio.c
> @@ -170,8 +170,12 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,
> goto out_unlock_mmap;
>
> ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
> - if (ret)
> - goto out_unlock_mmap;
> + if (ret) {
> + fixup_user_fault(current->mm, mmio_addr, FAULT_FLAG_WRITE, NULL);
> + ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
> + if (ret)
> + goto out_unlock_mmap;
> + }
>
> io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) |
> (mmio_addr & ~PAGE_MASK));
> @@ -305,12 +309,16 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr,
> if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
> goto out_unlock_mmap;
> ret = -EACCES;
> - if (!(vma->vm_flags & VM_WRITE))
> + if (!(vma->vm_flags & VM_READ))
> goto out_unlock_mmap;
>
> ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
> - if (ret)
> - goto out_unlock_mmap;
> + if (ret) {
> + fixup_user_fault(current->mm, mmio_addr, 0, NULL);
> + ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
> + if (ret)
> + goto out_unlock_mmap;
> + }
>
> io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) |
> (mmio_addr & ~PAGE_MASK));
>
Ughh, I think I just stumbled over a problem with this. This is a
failing lock held assertion via __is_vma_write_locked() in
remap_pfn_range_notrack() but I'm not sure yet what exactly causes this
[ 67.338855] ------------[ cut here ]------------
[ 67.338865] WARNING: CPU: 15 PID: 2056 at include/linux/rwsem.h:85 remap_pfn_range_notrack+0x596/0x5b0
[ 67.338874] Modules linked in: <--- 8< --->
[ 67.338931] CPU: 15 PID: 2056 Comm: vfio-test Not tainted 6.10.0-rc1-pci-pfault-00004-g193e3a513cee #5
[ 67.338934] Hardware name: IBM 3931 A01 701 (LPAR)
[ 67.338935] Krnl PSW : 0704c00180000000 000003e54c9730ea (remap_pfn_range_notrack+0x59a/0x5b0)
[ 67.338940] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[ 67.338944] Krnl GPRS: 0000000000000100 000003655915fb78 000002d80b9a5928 000003ff7fa00000
[ 67.338946] 0004008000000000 0000000000004000 0000000000000711 000003ff7fa04000
[ 67.338948] 000002d80c533f00 000002d800000100 000002d81bbe6c28 000002d80b9a5928
[ 67.338950] 000003ff7fa00000 000002d80c533f00 000003e54c973120 000003655915fab0
[ 67.338956] Krnl Code: 000003e54c9730de: a708ffea lhi %r0,-22
000003e54c9730e2: a7f4fff6 brc 15,000003e54c9730ce
#000003e54c9730e6: af000000 mc 0,0
>000003e54c9730ea: a7f4fd6e brc 15,000003e54c972bc6
000003e54c9730ee: af000000 mc 0,0
000003e54c9730f2: af000000 mc 0,0
000003e54c9730f6: 0707 bcr 0,%r7
000003e54c9730f8: 0707 bcr 0,%r7
[ 67.339025] Call Trace:
[ 67.339027] [<000003e54c9730ea>] remap_pfn_range_notrack+0x59a/0x5b0
[ 67.339032] [<000003e54c973120>] remap_pfn_range+0x20/0x30
[ 67.339035] [<000003e4cce5396c>] vfio_pci_mmap_fault+0xec/0x1d0 [vfio_pci_core]
[ 67.339043] [<000003e54c977240>] handle_mm_fault+0x6b0/0x25a0
[ 67.339046] [<000003e54c966328>] fixup_user_fault+0x138/0x310
[ 67.339048] [<000003e54c63a91c>] __s390x_sys_s390_pci_mmio_read+0x28c/0x3a0
[ 67.339051] [<000003e54c5e200a>] do_syscall+0xea/0x120
[ 67.339055] [<000003e54d5f9954>] __do_syscall+0x94/0x140
[ 67.339059] [<000003e54d611020>] system_call+0x70/0xa0
[ 67.339063] Last Breaking-Event-Address:
[ 67.339065] [<000003e54c972bc2>] remap_pfn_range_notrack+0x72/0x5b0
[ 67.339067] ---[ end trace 0000000000000000 ]---
On Tue, 2024-06-11 at 13:21 +0200, Niklas Schnelle wrote:
> On Wed, 2024-05-29 at 13:36 +0200, Niklas Schnelle wrote:
> > The s390 MMIO syscalls when using the classic PCI instructions do not
> > cause a page fault when follow_pte() fails due to the page not being
> > present. Besides being a general deficiency this breaks vfio-pci's mmap()
> > handling once VFIO_PCI_MMAP gets enabled as this lazily maps on first
> > access. Fix this by following a failed follow_pte() with
> > fixup_user_page() and retrying the follow_pte().
> >
> > Reviewed-by: Jason Gunthorpe <[email protected]>
> > Reviewed-by: Matthew Rosato <[email protected]>
> > Signed-off-by: Niklas Schnelle <[email protected]>
> > ---
> > arch/s390/pci/pci_mmio.c | 18 +++++++++++++-----
> > 1 file changed, 13 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c
> > index 5398729bfe1b..80c21b1a101c 100644
> > --- a/arch/s390/pci/pci_mmio.c
> > +++ b/arch/s390/pci/pci_mmio.c
> > @@ -170,8 +170,12 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,
> > goto out_unlock_mmap;
> >
> > ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
> > - if (ret)
> > - goto out_unlock_mmap;
> > + if (ret) {
> > + fixup_user_fault(current->mm, mmio_addr, FAULT_FLAG_WRITE, NULL);
> > + ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
> > + if (ret)
> > + goto out_unlock_mmap;
> > + }
> >
> > io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) |
> > (mmio_addr & ~PAGE_MASK));
> > @@ -305,12 +309,16 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr,
> > if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
> > goto out_unlock_mmap;
> > ret = -EACCES;
> > - if (!(vma->vm_flags & VM_WRITE))
> > + if (!(vma->vm_flags & VM_READ))
> > goto out_unlock_mmap;
> >
> > ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
> > - if (ret)
> > - goto out_unlock_mmap;
> > + if (ret) {
> > + fixup_user_fault(current->mm, mmio_addr, 0, NULL);
> > + ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
> > + if (ret)
> > + goto out_unlock_mmap;
> > + }
> >
> > io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) |
> > (mmio_addr & ~PAGE_MASK));
> >
>
> Ughh, I think I just stumbled over a problem with this. This is a
> failing lock held assertion via __is_vma_write_locked() in
> remap_pfn_range_notrack() but I'm not sure yet what exactly causes this
>
> [ 67.338855] ------------[ cut here ]------------
> [ 67.338865] WARNING: CPU: 15 PID: 2056 at include/linux/rwsem.h:85 remap_pfn_range_notrack+0x596/0x5b0
> [ 67.338874] Modules linked in: <--- 8< --->
> [ 67.338931] CPU: 15 PID: 2056 Comm: vfio-test Not tainted 6.10.0-rc1-pci-pfault-00004-g193e3a513cee #5
> [ 67.338934] Hardware name: IBM 3931 A01 701 (LPAR)
> [ 67.338935] Krnl PSW : 0704c00180000000 000003e54c9730ea (remap_pfn_range_notrack+0x59a/0x5b0)
> [ 67.338940] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> [ 67.338944] Krnl GPRS: 0000000000000100 000003655915fb78 000002d80b9a5928 000003ff7fa00000
> [ 67.338946] 0004008000000000 0000000000004000 0000000000000711 000003ff7fa04000
> [ 67.338948] 000002d80c533f00 000002d800000100 000002d81bbe6c28 000002d80b9a5928
> [ 67.338950] 000003ff7fa00000 000002d80c533f00 000003e54c973120 000003655915fab0
> [ 67.338956] Krnl Code: 000003e54c9730de: a708ffea lhi %r0,-22
> 000003e54c9730e2: a7f4fff6 brc 15,000003e54c9730ce
> #000003e54c9730e6: af000000 mc 0,0
> >000003e54c9730ea: a7f4fd6e brc 15,000003e54c972bc6
> 000003e54c9730ee: af000000 mc 0,0
> 000003e54c9730f2: af000000 mc 0,0
> 000003e54c9730f6: 0707 bcr 0,%r7
> 000003e54c9730f8: 0707 bcr 0,%r7
> [ 67.339025] Call Trace:
> [ 67.339027] [<000003e54c9730ea>] remap_pfn_range_notrack+0x59a/0x5b0
> [ 67.339032] [<000003e54c973120>] remap_pfn_range+0x20/0x30
> [ 67.339035] [<000003e4cce5396c>] vfio_pci_mmap_fault+0xec/0x1d0 [vfio_pci_core]
> [ 67.339043] [<000003e54c977240>] handle_mm_fault+0x6b0/0x25a0
> [ 67.339046] [<000003e54c966328>] fixup_user_fault+0x138/0x310
> [ 67.339048] [<000003e54c63a91c>] __s390x_sys_s390_pci_mmio_read+0x28c/0x3a0
> [ 67.339051] [<000003e54c5e200a>] do_syscall+0xea/0x120
> [ 67.339055] [<000003e54d5f9954>] __do_syscall+0x94/0x140
> [ 67.339059] [<000003e54d611020>] system_call+0x70/0xa0
> [ 67.339063] Last Breaking-Event-Address:
> [ 67.339065] [<000003e54c972bc2>] remap_pfn_range_notrack+0x72/0x5b0
> [ 67.339067] ---[ end trace 0000000000000000 ]---
>
This has me a bit confused so far as __is_vma_write_locked() checks
mmap_assert_write_locked(vma->vm_mm) but most other users of
fixup_user_fault() hold mmap_read_lock() just like this code and
clearly in the non page fault case we only need the read lock.
On Tue, 2024-06-11 at 14:08 +0200, Niklas Schnelle wrote:
> On Tue, 2024-06-11 at 13:21 +0200, Niklas Schnelle wrote:
> > On Wed, 2024-05-29 at 13:36 +0200, Niklas Schnelle wrote:
> > > The s390 MMIO syscalls when using the classic PCI instructions do not
> > > cause a page fault when follow_pte() fails due to the page not being
> > > present. Besides being a general deficiency this breaks vfio-pci's mmap()
> > > handling once VFIO_PCI_MMAP gets enabled as this lazily maps on first
> > > access. Fix this by following a failed follow_pte() with
> > > fixup_user_page() and retrying the follow_pte().
> > >
> > > Reviewed-by: Jason Gunthorpe <[email protected]>
> > > Reviewed-by: Matthew Rosato <[email protected]>
> > > Signed-off-by: Niklas Schnelle <[email protected]>
> > > ---
> > > arch/s390/pci/pci_mmio.c | 18 +++++++++++++-----
> > > 1 file changed, 13 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c
> > > index 5398729bfe1b..80c21b1a101c 100644
> > > --- a/arch/s390/pci/pci_mmio.c
> > > +++ b/arch/s390/pci/pci_mmio.c
> > > @@ -170,8 +170,12 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,
> > > goto out_unlock_mmap;
> > >
> > > ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
> > > - if (ret)
> > > - goto out_unlock_mmap;
> > > + if (ret) {
> > > + fixup_user_fault(current->mm, mmio_addr, FAULT_FLAG_WRITE, NULL);
> > > + ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
> > > + if (ret)
> > > + goto out_unlock_mmap;
> > > + }
> > >
> > > io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) |
> > > (mmio_addr & ~PAGE_MASK));
> > > @@ -305,12 +309,16 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr,
> > > if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
> > > goto out_unlock_mmap;
> > > ret = -EACCES;
> > > - if (!(vma->vm_flags & VM_WRITE))
> > > + if (!(vma->vm_flags & VM_READ))
> > > goto out_unlock_mmap;
> > >
> > > ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
> > > - if (ret)
> > > - goto out_unlock_mmap;
> > > + if (ret) {
> > > + fixup_user_fault(current->mm, mmio_addr, 0, NULL);
> > > + ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
> > > + if (ret)
> > > + goto out_unlock_mmap;
> > > + }
> > >
> > > io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) |
> > > (mmio_addr & ~PAGE_MASK));
> > >
> >
> > Ughh, I think I just stumbled over a problem with this. This is a
> > failing lock held assertion via __is_vma_write_locked() in
> > remap_pfn_range_notrack() but I'm not sure yet what exactly causes this
> >
> > [ 67.338855] ------------[ cut here ]------------
> > [ 67.338865] WARNING: CPU: 15 PID: 2056 at include/linux/rwsem.h:85 remap_pfn_range_notrack+0x596/0x5b0
> > [ 67.338874] Modules linked in: <--- 8< --->
> > [ 67.338931] CPU: 15 PID: 2056 Comm: vfio-test Not tainted 6.10.0-rc1-pci-pfault-00004-g193e3a513cee #5
> > [ 67.338934] Hardware name: IBM 3931 A01 701 (LPAR)
> > [ 67.338935] Krnl PSW : 0704c00180000000 000003e54c9730ea (remap_pfn_range_notrack+0x59a/0x5b0)
> > [ 67.338940] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> > [ 67.338944] Krnl GPRS: 0000000000000100 000003655915fb78 000002d80b9a5928 000003ff7fa00000
> > [ 67.338946] 0004008000000000 0000000000004000 0000000000000711 000003ff7fa04000
> > [ 67.338948] 000002d80c533f00 000002d800000100 000002d81bbe6c28 000002d80b9a5928
> > [ 67.338950] 000003ff7fa00000 000002d80c533f00 000003e54c973120 000003655915fab0
> > [ 67.338956] Krnl Code: 000003e54c9730de: a708ffea lhi %r0,-22
> > 000003e54c9730e2: a7f4fff6 brc 15,000003e54c9730ce
> > #000003e54c9730e6: af000000 mc 0,0
> > >000003e54c9730ea: a7f4fd6e brc 15,000003e54c972bc6
> > 000003e54c9730ee: af000000 mc 0,0
> > 000003e54c9730f2: af000000 mc 0,0
> > 000003e54c9730f6: 0707 bcr 0,%r7
> > 000003e54c9730f8: 0707 bcr 0,%r7
> > [ 67.339025] Call Trace:
> > [ 67.339027] [<000003e54c9730ea>] remap_pfn_range_notrack+0x59a/0x5b0
> > [ 67.339032] [<000003e54c973120>] remap_pfn_range+0x20/0x30
> > [ 67.339035] [<000003e4cce5396c>] vfio_pci_mmap_fault+0xec/0x1d0 [vfio_pci_core]
> > [ 67.339043] [<000003e54c977240>] handle_mm_fault+0x6b0/0x25a0
> > [ 67.339046] [<000003e54c966328>] fixup_user_fault+0x138/0x310
> > [ 67.339048] [<000003e54c63a91c>] __s390x_sys_s390_pci_mmio_read+0x28c/0x3a0
> > [ 67.339051] [<000003e54c5e200a>] do_syscall+0xea/0x120
> > [ 67.339055] [<000003e54d5f9954>] __do_syscall+0x94/0x140
> > [ 67.339059] [<000003e54d611020>] system_call+0x70/0xa0
> > [ 67.339063] Last Breaking-Event-Address:
> > [ 67.339065] [<000003e54c972bc2>] remap_pfn_range_notrack+0x72/0x5b0
> > [ 67.339067] ---[ end trace 0000000000000000 ]---
> >
>
> This has me a bit confused so far as __is_vma_write_locked() checks
> mmap_assert_write_locked(vma->vm_mm) but most other users of
> fixup_user_fault() hold mmap_read_lock() just like this code and
> clearly in the non page fault case we only need the read lock.
>
And it gets weirder, as I could have sworn that I properly tested this
on v1, I retested with v1 (tags/sent/vfio_pci_mmap-v1 on my
git.kernel.org/niks and based on v6.9) and there I don't get the above
warning. I also made sure that it's not caused by my change to
"current->mm" for v2. But I'm also not hitting the checks David moved
into follow_pte() so yeah not sure what's going on here.
On 11.06.24 15:23, Niklas Schnelle wrote:
> On Tue, 2024-06-11 at 14:08 +0200, Niklas Schnelle wrote:
>> On Tue, 2024-06-11 at 13:21 +0200, Niklas Schnelle wrote:
>>> On Wed, 2024-05-29 at 13:36 +0200, Niklas Schnelle wrote:
>>>> The s390 MMIO syscalls when using the classic PCI instructions do not
>>>> cause a page fault when follow_pte() fails due to the page not being
>>>> present. Besides being a general deficiency this breaks vfio-pci's mmap()
>>>> handling once VFIO_PCI_MMAP gets enabled as this lazily maps on first
>>>> access. Fix this by following a failed follow_pte() with
>>>> fixup_user_page() and retrying the follow_pte().
>>>>
>>>> Reviewed-by: Jason Gunthorpe <[email protected]>
>>>> Reviewed-by: Matthew Rosato <[email protected]>
>>>> Signed-off-by: Niklas Schnelle <[email protected]>
>>>> ---
>>>> arch/s390/pci/pci_mmio.c | 18 +++++++++++++-----
>>>> 1 file changed, 13 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c
>>>> index 5398729bfe1b..80c21b1a101c 100644
>>>> --- a/arch/s390/pci/pci_mmio.c
>>>> +++ b/arch/s390/pci/pci_mmio.c
>>>> @@ -170,8 +170,12 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,
>>>> goto out_unlock_mmap;
>>>>
>>>> ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
>>>> - if (ret)
>>>> - goto out_unlock_mmap;
>>>> + if (ret) {
>>>> + fixup_user_fault(current->mm, mmio_addr, FAULT_FLAG_WRITE, NULL);
>>>> + ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
>>>> + if (ret)
>>>> + goto out_unlock_mmap;
>>>> + }
>>>>
>>>> io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) |
>>>> (mmio_addr & ~PAGE_MASK));
>>>> @@ -305,12 +309,16 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr,
>>>> if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
>>>> goto out_unlock_mmap;
>>>> ret = -EACCES;
>>>> - if (!(vma->vm_flags & VM_WRITE))
>>>> + if (!(vma->vm_flags & VM_READ))
>>>> goto out_unlock_mmap;
>>>>
>>>> ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
>>>> - if (ret)
>>>> - goto out_unlock_mmap;
>>>> + if (ret) {
>>>> + fixup_user_fault(current->mm, mmio_addr, 0, NULL);
>>>> + ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
>>>> + if (ret)
>>>> + goto out_unlock_mmap;
>>>> + }
>>>>
>>>> io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) |
>>>> (mmio_addr & ~PAGE_MASK));
>>>>
>>>
>>> Ughh, I think I just stumbled over a problem with this. This is a
>>> failing lock held assertion via __is_vma_write_locked() in
>>> remap_pfn_range_notrack() but I'm not sure yet what exactly causes this
>>>
>>> [ 67.338855] ------------[ cut here ]------------
>>> [ 67.338865] WARNING: CPU: 15 PID: 2056 at include/linux/rwsem.h:85 remap_pfn_range_notrack+0x596/0x5b0
>>> [ 67.338874] Modules linked in: <--- 8< --->
>>> [ 67.338931] CPU: 15 PID: 2056 Comm: vfio-test Not tainted 6.10.0-rc1-pci-pfault-00004-g193e3a513cee #5
>>> [ 67.338934] Hardware name: IBM 3931 A01 701 (LPAR)
>>> [ 67.338935] Krnl PSW : 0704c00180000000 000003e54c9730ea (remap_pfn_range_notrack+0x59a/0x5b0)
>>> [ 67.338940] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>>> [ 67.338944] Krnl GPRS: 0000000000000100 000003655915fb78 000002d80b9a5928 000003ff7fa00000
>>> [ 67.338946] 0004008000000000 0000000000004000 0000000000000711 000003ff7fa04000
>>> [ 67.338948] 000002d80c533f00 000002d800000100 000002d81bbe6c28 000002d80b9a5928
>>> [ 67.338950] 000003ff7fa00000 000002d80c533f00 000003e54c973120 000003655915fab0
>>> [ 67.338956] Krnl Code: 000003e54c9730de: a708ffea lhi %r0,-22
>>> 000003e54c9730e2: a7f4fff6 brc 15,000003e54c9730ce
>>> #000003e54c9730e6: af000000 mc 0,0
>>> >000003e54c9730ea: a7f4fd6e brc 15,000003e54c972bc6
>>> 000003e54c9730ee: af000000 mc 0,0
>>> 000003e54c9730f2: af000000 mc 0,0
>>> 000003e54c9730f6: 0707 bcr 0,%r7
>>> 000003e54c9730f8: 0707 bcr 0,%r7
>>> [ 67.339025] Call Trace:
>>> [ 67.339027] [<000003e54c9730ea>] remap_pfn_range_notrack+0x59a/0x5b0
>>> [ 67.339032] [<000003e54c973120>] remap_pfn_range+0x20/0x30
>>> [ 67.339035] [<000003e4cce5396c>] vfio_pci_mmap_fault+0xec/0x1d0 [vfio_pci_core]
>>> [ 67.339043] [<000003e54c977240>] handle_mm_fault+0x6b0/0x25a0
>>> [ 67.339046] [<000003e54c966328>] fixup_user_fault+0x138/0x310
>>> [ 67.339048] [<000003e54c63a91c>] __s390x_sys_s390_pci_mmio_read+0x28c/0x3a0
>>> [ 67.339051] [<000003e54c5e200a>] do_syscall+0xea/0x120
>>> [ 67.339055] [<000003e54d5f9954>] __do_syscall+0x94/0x140
>>> [ 67.339059] [<000003e54d611020>] system_call+0x70/0xa0
>>> [ 67.339063] Last Breaking-Event-Address:
>>> [ 67.339065] [<000003e54c972bc2>] remap_pfn_range_notrack+0x72/0x5b0
>>> [ 67.339067] ---[ end trace 0000000000000000 ]---
>>>
>>
>> This has me a bit confused so far as __is_vma_write_locked() checks
>> mmap_assert_write_locked(vma->vm_mm) but most other users of
>> fixup_user_fault() hold mmap_read_lock() just like this code and
>> clearly in the non page fault case we only need the read lock.
This is likely the
vm_flags_set()->vma_start_write(vma)->__is_vma_write_locked()
which checks mmap_assert_write_locked().
Setting VMA flags would be racy with the mmap lock in read mode.
remap_pfn_range() documents: "this is only safe if the mm semaphore is
held when called." which doesn't spell out if it needs to be held in
write mode (which I think it does) :)
My best guess is: if you are using remap_pfn_range() from a fault
handler (not during mmap time) you are doing something wrong, that's why
you get that report.
vmf_insert_pfn() and friends might be better alternatives, that make
sure that the VMA already received the proper VMA flags at mmap time.
>>
>
> And it gets weirder, as I could have sworn that I properly tested this
> on v1, I retested with v1 (tags/sent/vfio_pci_mmap-v1 on my
> git.kernel.org/niks and based on v6.9) and there I don't get the above
> warning. I also made sure that it's not caused by my change to
> "current->mm" for v2. But I'm also not hitting the checks David moved
> into follow_pte() so yeah not sure what's going on here.
You mean the mmap_assert_locked()? Yeah, that only checks if you have it
in read mode, but not in write mode.
--
Cheers,
David / dhildenb
> > >
--- 8< snip 8< ---
> > > > Ughh, I think I just stumbled over a problem with this. This is a
> > > > failing lock held assertion via __is_vma_write_locked() in
> > > > remap_pfn_range_notrack() but I'm not sure yet what exactly causes this
> > > >
> > > > [ 67.338855] ------------[ cut here ]------------
> > > > [ 67.338865] WARNING: CPU: 15 PID: 2056 at include/linux/rwsem.h:85 remap_pfn_range_notrack+0x596/0x5b0
> > > > [ 67.338874] Modules linked in: <--- 8< --->
> > > > [ 67.338931] CPU: 15 PID: 2056 Comm: vfio-test Not tainted 6.10.0-rc1-pci-pfault-00004-g193e3a513cee #5
> > > > [ 67.338934] Hardware name: IBM 3931 A01 701 (LPAR)
> > > > [ 67.338935] Krnl PSW : 0704c00180000000 000003e54c9730ea (remap_pfn_range_notrack+0x59a/0x5b0)
> > > > [ 67.338940] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> > > > [ 67.338944] Krnl GPRS: 0000000000000100 000003655915fb78 000002d80b9a5928 000003ff7fa00000
> > > > [ 67.338946] 0004008000000000 0000000000004000 0000000000000711 000003ff7fa04000
> > > > [ 67.338948] 000002d80c533f00 000002d800000100 000002d81bbe6c28 000002d80b9a5928
> > > > [ 67.338950] 000003ff7fa00000 000002d80c533f00 000003e54c973120 000003655915fab0
> > > > [ 67.338956] Krnl Code: 000003e54c9730de: a708ffea lhi %r0,-22
> > > > 000003e54c9730e2: a7f4fff6 brc 15,000003e54c9730ce
> > > > #000003e54c9730e6: af000000 mc 0,0
> > > > >000003e54c9730ea: a7f4fd6e brc 15,000003e54c972bc6
> > > > 000003e54c9730ee: af000000 mc 0,0
> > > > 000003e54c9730f2: af000000 mc 0,0
> > > > 000003e54c9730f6: 0707 bcr 0,%r7
> > > > 000003e54c9730f8: 0707 bcr 0,%r7
> > > > [ 67.339025] Call Trace:
> > > > [ 67.339027] [<000003e54c9730ea>] remap_pfn_range_notrack+0x59a/0x5b0
> > > > [ 67.339032] [<000003e54c973120>] remap_pfn_range+0x20/0x30
> > > > [ 67.339035] [<000003e4cce5396c>] vfio_pci_mmap_fault+0xec/0x1d0 [vfio_pci_core]
> > > > [ 67.339043] [<000003e54c977240>] handle_mm_fault+0x6b0/0x25a0
> > > > [ 67.339046] [<000003e54c966328>] fixup_user_fault+0x138/0x310
> > > > [ 67.339048] [<000003e54c63a91c>] __s390x_sys_s390_pci_mmio_read+0x28c/0x3a0
> > > > [ 67.339051] [<000003e54c5e200a>] do_syscall+0xea/0x120
> > > > [ 67.339055] [<000003e54d5f9954>] __do_syscall+0x94/0x140
> > > > [ 67.339059] [<000003e54d611020>] system_call+0x70/0xa0
> > > > [ 67.339063] Last Breaking-Event-Address:
> > > > [ 67.339065] [<000003e54c972bc2>] remap_pfn_range_notrack+0x72/0x5b0
> > > > [ 67.339067] ---[ end trace 0000000000000000 ]---
> > > >
> > >
> > > This has me a bit confused so far as __is_vma_write_locked() checks
> > > mmap_assert_write_locked(vma->vm_mm) but most other users of
> > > fixup_user_fault() hold mmap_read_lock() just like this code and
> > > clearly in the non page fault case we only need the read lock.
>
> This is likely the
> vm_flags_set()->vma_start_write(vma)->__is_vma_write_locked()
Yes
>
> which checks mmap_assert_write_locked().
>
> Setting VMA flags would be racy with the mmap lock in read mode.
>
>
> remap_pfn_range() documents: "this is only safe if the mm semaphore is
> held when called." which doesn't spell out if it needs to be held in
> write mode (which I think it does) :)
Logically this makes sense to me. At the same time it looks like
fixup_user_fault() expects the caller to only hold mmap_read_lock() as
I do here. In there it even retakes mmap_read_lock(). But then wouldn't
any fault handling by its nature need to hold the write lock?
>
>
> My best guess is: if you are using remap_pfn_range() from a fault
> handler (not during mmap time) you are doing something wrong, that's why
> you get that report.
@Alex: I guess so far the vfio_pci_mmap_fault() handler is only ever
triggered by "normal"/"actual" page faults where this isn't a problem?
Or could it be a problem there too?
>
> vmf_insert_pfn() and friends might be better alternatives, that make
> sure that the VMA already received the proper VMA flags at mmap time.
>
> > >
> >
> > And it gets weirder, as I could have sworn that I properly tested this
> > on v1, I retested with v1 (tags/sent/vfio_pci_mmap-v1 on my
> > git.kernel.org/niks and based on v6.9) and there I don't get the above
> > warning. I also made sure that it's not caused by my change to
> > "current->mm" for v2. But I'm also not hitting the checks David moved
> > into follow_pte() so yeah not sure what's going on here.
>
>
> You mean the mmap_assert_locked()? Yeah, that only checks if you have it
> in read mode, but not in write mode.
>
Turns out this part was just me being stupid and not running the old
version with lockdep enabled when it "worked" and this only turned into
a normal warn with commit ba168b52bf8e ("mm: use rwsem assertion macros
for mmap_lock"). Rerunning v1 with lockdep on gave me an equivalent
lockdep splat.
>>
>> which checks mmap_assert_write_locked().
>>
>> Setting VMA flags would be racy with the mmap lock in read mode.
>>
>>
>> remap_pfn_range() documents: "this is only safe if the mm semaphore is
>> held when called." which doesn't spell out if it needs to be held in
>> write mode (which I think it does) :)
>
> Logically this makes sense to me. At the same time it looks like
> fixup_user_fault() expects the caller to only hold mmap_read_lock() as
> I do here. In there it even retakes mmap_read_lock(). But then wouldn't
> any fault handling by its nature need to hold the write lock?
Well, if you're calling remap_pfn_range() right now the expectation is
that we hold it in write mode. :)
Staring at some random users, they all call it from mmap(), where you
hold the mmap lock in write mode.
I wonder why we are not seeing that splat with vfio all of the time?
That mmap lock check was added "recently". In 1c71222e5f23 we started
using vm_flags_set(). That (including the mmap_assert_write_locked())
check was added via bc292ab00f6c almost 1.5 years ago.
Maybe vfio is a bit special and was never really run with lockdep?
>
>>
>>
>> My best guess is: if you are using remap_pfn_range() from a fault
>> handler (not during mmap time) you are doing something wrong, that's why
>> you get that report.
>
> @Alex: I guess so far the vfio_pci_mmap_fault() handler is only ever
> triggered by "normal"/"actual" page faults where this isn't a problem?
> Or could it be a problem there too?
>
I think we should see it there as well, unless I am missing something.
>>
>> vmf_insert_pfn() and friends might be better alternatives, that make
>> sure that the VMA already received the proper VMA flags at mmap time.
>>
There would be ways of silencing that check: for example, making sure at
mmap time that these flags are already set, and skipping modifications
if the flags are already set.
But, we'll run into more similar checks in x86 VM_PAT code, where we
would do vm_flags_set(vma, VM_PAT) from track_pfn_remap. Some of that
code really doesn't want to be called concurrently (e.g., "vma->vm_pgoff
= pfn;").
I thought that we silenced some of these warnings in the past using
__vm_flags_mod(). But it sounds hacky.
CCing Sureen.
--
Cheers,
David / dhildenb
On Tue, 2024-06-11 at 17:10 +0200, David Hildenbrand wrote:
> > >
> > > which checks mmap_assert_write_locked().
> > >
> > > Setting VMA flags would be racy with the mmap lock in read mode.
> > >
> > >
> > > remap_pfn_range() documents: "this is only safe if the mm semaphore is
> > > held when called." which doesn't spell out if it needs to be held in
> > > write mode (which I think it does) :)
> >
> > Logically this makes sense to me. At the same time it looks like
> > fixup_user_fault() expects the caller to only hold mmap_read_lock() as
> > I do here. In there it even retakes mmap_read_lock(). But then wouldn't
> > any fault handling by its nature need to hold the write lock?
>
> Well, if you're calling remap_pfn_range() right now the expectation is
> that we hold it in write mode. :)
>
> Staring at some random users, they all call it from mmap(), where you
> hold the mmap lock in write mode.
>
>
> I wonder why we are not seeing that splat with vfio all of the time?
>
> That mmap lock check was added "recently". In 1c71222e5f23 we started
> using vm_flags_set(). That (including the mmap_assert_write_locked())
> check was added via bc292ab00f6c almost 1.5 years ago.
>
> Maybe vfio is a bit special and was never really run with lockdep?
>
> >
> > >
> > >
> > > My best guess is: if you are using remap_pfn_range() from a fault
> > > handler (not during mmap time) you are doing something wrong, that's why
> > > you get that report.
> >
> > @Alex: I guess so far the vfio_pci_mmap_fault() handler is only ever
> > triggered by "normal"/"actual" page faults where this isn't a problem?
> > Or could it be a problem there too?
> >
>
> I think we should see it there as well, unless I am missing something.
Well good news for me, bad news for everyone else. I just reproduced
the same problem on my x86_64 workstation. I "ported over" (hacked it
until it compiles) an x86 version of my trivial vfio-pci user-space
test code that mmaps() the BAR 0 of an NVMe and MMIO reads the NVMe
version field at offset 8. On my x86_64 box this leads to the following
splat (still on v6.10-rc1).
[ 555.396773] ------------[ cut here ]------------
[ 555.396774] WARNING: CPU: 3 PID: 1424 at include/linux/rwsem.h:85 remap_pfn_range_notrack+0x625/0x650
[ 555.396778] Modules linked in: vfio_pci <-- 8< -->
[ 555.396877] CPU: 3 PID: 1424 Comm: vfio-test Tainted: G W 6.10.0-rc1-niks-00007-gb19d6d864df1 #4 d09afec01ce27ca8218580af28295f25e2d2ed53
[ 555.396880] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Creator, BIOS P3.40 01/28/2021
[ 555.396881] RIP: 0010:remap_pfn_range_notrack+0x625/0x650
[ 555.396884] Code: a8 00 00 00 75 39 44 89 e0 48 81 c4 b0 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d e9 26 a7 e5 00 cc 0f 0b 41 bc ea ff ff ff eb c9 <0f> 0b 49 8b 47 10 e9 72 fa ff ff e8 8b 56 b5 ff e9 c0 fa ff ff e8
[ 555.396887] RSP: 0000:ffffaf8b04ed3bc0 EFLAGS: 00010246
[ 555.396889] RAX: ffff9ea747cfe300 RBX: 00000000000ee200 RCX: 0000000000000100
[ 555.396890] RDX: 00000000000ee200 RSI: ffff9ea747cfe300 RDI: ffff9ea76db58fd0
[ 555.396892] RBP: 00000000ffffffea R08: 8000000000000035 R09: 0000000000000000
[ 555.396894] R10: ffff9ea76d9bbf40 R11: ffffffff96e5ce50 R12: 0000000000004000
[ 555.396895] R13: 00007f23b988a000 R14: ffff9ea76db58fd0 R15: ffff9ea76db58fd0
[ 555.396897] FS: 00007f23b9561740(0000) GS:ffff9eb66e780000(0000) knlGS:0000000000000000
[ 555.396899] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 555.396901] CR2: 00007f23b988a008 CR3: 0000000136bde000 CR4: 0000000000350ef0
[ 555.396903] Call Trace:
[ 555.396904] <TASK>
[ 555.396905] ? __warn+0x18c/0x2a0
[ 555.396908] ? remap_pfn_range_notrack+0x625/0x650
[ 555.396911] ? report_bug+0x1bb/0x270
[ 555.396915] ? handle_bug+0x42/0x70
[ 555.396917] ? exc_invalid_op+0x1a/0x50
[ 555.396920] ? asm_exc_invalid_op+0x1a/0x20
[ 555.396923] ? __pfx_is_ISA_range+0x10/0x10
[ 555.396926] ? remap_pfn_range_notrack+0x625/0x650
[ 555.396929] ? asm_exc_invalid_op+0x1a/0x20
[ 555.396933] ? track_pfn_remap+0x170/0x180
[ 555.396936] remap_pfn_range+0x6f/0xc0
[ 555.396940] vfio_pci_mmap_fault+0xf3/0x1b0 [vfio_pci_core 6df3b7ac5dcecb63cb090734847a65c799a8fef2]
[ 555.396946] __do_fault+0x11b/0x210
[ 555.396949] do_pte_missing+0x239/0x1350
[ 555.396953] handle_mm_fault+0xb10/0x18b0
[ 555.396959] do_user_addr_fault+0x293/0x710
[ 555.396963] exc_page_fault+0x82/0x1c0
[ 555.396966] asm_exc_page_fault+0x26/0x30
[ 555.396968] RIP: 0033:0x55b0ea8bb7ac
[ 555.396972] Code: 00 00 b0 00 e8 e5 f8 ff ff 31 c0 48 83 c4 20 5d c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 89 7d f8 48 8b 45 f8 <8b> 00 89 c0 5d c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48
[ 555.396974] RSP: 002b:00007fff80973530 EFLAGS: 00010202
[ 555.396976] RAX: 00007f23b988a008 RBX: 00007fff80973738 RCX: 00007f23b988a000
[ 555.396978] RDX: 0000000000000001 RSI: 00007fff809735e8 RDI: 00007f23b988a008
[ 555.396979] RBP: 00007fff80973530 R08: 0000000000000005 R09: 0000000000000000
[ 555.396981] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002
[ 555.396982] R13: 0000000000000000 R14: 00007f23b98c8000 R15: 000055b0ea8bddc0
[ 555.396986] </TASK>
[ 555.396987] ---[ end trace 0000000000000000 ]---
On Tue, 2024-06-11 at 17:10 +0200, David Hildenbrand wrote:
> > >
> > > which checks mmap_assert_write_locked().
> > >
> > > Setting VMA flags would be racy with the mmap lock in read mode.
> > >
> > >
> > > remap_pfn_range() documents: "this is only safe if the mm semaphore is
> > > held when called." which doesn't spell out if it needs to be held in
> > > write mode (which I think it does) :)
> >
> > Logically this makes sense to me. At the same time it looks like
> > fixup_user_fault() expects the caller to only hold mmap_read_lock() as
> > I do here. In there it even retakes mmap_read_lock(). But then wouldn't
> > any fault handling by its nature need to hold the write lock?
>
> Well, if you're calling remap_pfn_range() right now the expectation is
> that we hold it in write mode. :)
>
> Staring at some random users, they all call it from mmap(), where you
> hold the mmap lock in write mode.
>
>
> I wonder why we are not seeing that splat with vfio all of the time?
>
> That mmap lock check was added "recently". In 1c71222e5f23 we started
> using vm_flags_set(). That (including the mmap_assert_write_locked())
> check was added via bc292ab00f6c almost 1.5 years ago.
>
> Maybe vfio is a bit special and was never really run with lockdep?
>
> >
> > >
> > >
> > > My best guess is: if you are using remap_pfn_range() from a fault
> > > handler (not during mmap time) you are doing something wrong, that's why
> > > you get that report.
> >
> > @Alex: I guess so far the vfio_pci_mmap_fault() handler is only ever
> > triggered by "normal"/"actual" page faults where this isn't a problem?
> > Or could it be a problem there too?
> >
>
> I think we should see it there as well, unless I am missing something.
>
> > >
> > > vmf_insert_pfn() and friends might be better alternatives, that make
> > > sure that the VMA already received the proper VMA flags at mmap time.
> > >
>
>
> There would be ways of silencing that check: for example, making sure at
> mmap time that these flags are already set, and skipping modifications
> if the flags are already set.
>
> But, we'll run into more similar checks in x86 VM_PAT code, where we
> would do vm_flags_set(vma, VM_PAT) from track_pfn_remap. Some of that
> code really doesn't want to be called concurrently (e.g., "vma->vm_pgoff
> = pfn;").
>
> I thought that we silenced some of these warnings in the past using
> __vm_flags_mod(). But it sounds hacky.
>
> CCing Sureen.
>
Before I forget it, thanks a lot for your incredible help David!
On Tue, 11 Jun 2024 17:37:20 +0200
Niklas Schnelle <[email protected]> wrote:
> On Tue, 2024-06-11 at 17:10 +0200, David Hildenbrand wrote:
> > > >
> > > > which checks mmap_assert_write_locked().
> > > >
> > > > Setting VMA flags would be racy with the mmap lock in read mode.
> > > >
> > > >
> > > > remap_pfn_range() documents: "this is only safe if the mm semaphore is
> > > > held when called." which doesn't spell out if it needs to be held in
> > > > write mode (which I think it does) :)
> > >
> > > Logically this makes sense to me. At the same time it looks like
> > > fixup_user_fault() expects the caller to only hold mmap_read_lock() as
> > > I do here. In there it even retakes mmap_read_lock(). But then wouldn't
> > > any fault handling by its nature need to hold the write lock?
> >
> > Well, if you're calling remap_pfn_range() right now the expectation is
> > that we hold it in write mode. :)
> >
> > Staring at some random users, they all call it from mmap(), where you
> > hold the mmap lock in write mode.
> >
> >
> > I wonder why we are not seeing that splat with vfio all of the time?
> >
> > That mmap lock check was added "recently". In 1c71222e5f23 we started
> > using vm_flags_set(). That (including the mmap_assert_write_locked())
> > check was added via bc292ab00f6c almost 1.5 years ago.
> >
> > Maybe vfio is a bit special and was never really run with lockdep?
> >
> > >
> > > >
> > > >
> > > > My best guess is: if you are using remap_pfn_range() from a fault
> > > > handler (not during mmap time) you are doing something wrong, that's why
> > > > you get that report.
> > >
> > > @Alex: I guess so far the vfio_pci_mmap_fault() handler is only ever
> > > triggered by "normal"/"actual" page faults where this isn't a problem?
> > > Or could it be a problem there too?
> > >
> >
> > I think we should see it there as well, unless I am missing something.
>
> Well good news for me, bad news for everyone else. I just reproduced
> the same problem on my x86_64 workstation. I "ported over" (hacked it
> until it compiles) an x86 version of my trivial vfio-pci user-space
> test code that mmaps() the BAR 0 of an NVMe and MMIO reads the NVMe
> version field at offset 8. On my x86_64 box this leads to the following
> splat (still on v6.10-rc1).
There's already a fix for this queued[1] in my for-linus branch for
v6.10. The problem has indeed existed with lockdep for some time but
only with the recent lockdep changes to generate a warning regardless
of debug kernel settings has it gone from just sketchy to having a fire
under it. There's still an outstanding question of whether we
can/should insert as many pfns as we can during the fault[2] to reduce
the new overhead and hopefully at some point we'll have an even cleaner
option to use huge_fault for pfnmaps, but currently
vmf_insert_pfn_{pmd,pud} don't work with those pfnmaps.
So hopefully this problem disappears on current linux-next, but let me
know if there's still an issue. Thanks,
Alex
[1]https://lore.kernel.org/all/[email protected]/
[2]https://lore.kernel.org/all/[email protected]/
> [ 555.396773] ------------[ cut here ]------------
> [ 555.396774] WARNING: CPU: 3 PID: 1424 at include/linux/rwsem.h:85 remap_pfn_range_notrack+0x625/0x650
> [ 555.396778] Modules linked in: vfio_pci <-- 8< -->
> [ 555.396877] CPU: 3 PID: 1424 Comm: vfio-test Tainted: G W 6.10.0-rc1-niks-00007-gb19d6d864df1 #4 d09afec01ce27ca8218580af28295f25e2d2ed53
> [ 555.396880] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Creator, BIOS P3.40 01/28/2021
> [ 555.396881] RIP: 0010:remap_pfn_range_notrack+0x625/0x650
> [ 555.396884] Code: a8 00 00 00 75 39 44 89 e0 48 81 c4 b0 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d e9 26 a7 e5 00 cc 0f 0b 41 bc ea ff ff ff eb c9 <0f> 0b 49 8b 47 10 e9 72 fa ff ff e8 8b 56 b5 ff e9 c0 fa ff ff e8
> [ 555.396887] RSP: 0000:ffffaf8b04ed3bc0 EFLAGS: 00010246
> [ 555.396889] RAX: ffff9ea747cfe300 RBX: 00000000000ee200 RCX: 0000000000000100
> [ 555.396890] RDX: 00000000000ee200 RSI: ffff9ea747cfe300 RDI: ffff9ea76db58fd0
> [ 555.396892] RBP: 00000000ffffffea R08: 8000000000000035 R09: 0000000000000000
> [ 555.396894] R10: ffff9ea76d9bbf40 R11: ffffffff96e5ce50 R12: 0000000000004000
> [ 555.396895] R13: 00007f23b988a000 R14: ffff9ea76db58fd0 R15: ffff9ea76db58fd0
> [ 555.396897] FS: 00007f23b9561740(0000) GS:ffff9eb66e780000(0000) knlGS:0000000000000000
> [ 555.396899] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 555.396901] CR2: 00007f23b988a008 CR3: 0000000136bde000 CR4: 0000000000350ef0
> [ 555.396903] Call Trace:
> [ 555.396904] <TASK>
> [ 555.396905] ? __warn+0x18c/0x2a0
> [ 555.396908] ? remap_pfn_range_notrack+0x625/0x650
> [ 555.396911] ? report_bug+0x1bb/0x270
> [ 555.396915] ? handle_bug+0x42/0x70
> [ 555.396917] ? exc_invalid_op+0x1a/0x50
> [ 555.396920] ? asm_exc_invalid_op+0x1a/0x20
> [ 555.396923] ? __pfx_is_ISA_range+0x10/0x10
> [ 555.396926] ? remap_pfn_range_notrack+0x625/0x650
> [ 555.396929] ? asm_exc_invalid_op+0x1a/0x20
> [ 555.396933] ? track_pfn_remap+0x170/0x180
> [ 555.396936] remap_pfn_range+0x6f/0xc0
> [ 555.396940] vfio_pci_mmap_fault+0xf3/0x1b0 [vfio_pci_core 6df3b7ac5dcecb63cb090734847a65c799a8fef2]
> [ 555.396946] __do_fault+0x11b/0x210
> [ 555.396949] do_pte_missing+0x239/0x1350
> [ 555.396953] handle_mm_fault+0xb10/0x18b0
> [ 555.396959] do_user_addr_fault+0x293/0x710
> [ 555.396963] exc_page_fault+0x82/0x1c0
> [ 555.396966] asm_exc_page_fault+0x26/0x30
> [ 555.396968] RIP: 0033:0x55b0ea8bb7ac
> [ 555.396972] Code: 00 00 b0 00 e8 e5 f8 ff ff 31 c0 48 83 c4 20 5d c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 89 7d f8 48 8b 45 f8 <8b> 00 89 c0 5d c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48
> [ 555.396974] RSP: 002b:00007fff80973530 EFLAGS: 00010202
> [ 555.396976] RAX: 00007f23b988a008 RBX: 00007fff80973738 RCX: 00007f23b988a000
> [ 555.396978] RDX: 0000000000000001 RSI: 00007fff809735e8 RDI: 00007f23b988a008
> [ 555.396979] RBP: 00007fff80973530 R08: 0000000000000005 R09: 0000000000000000
> [ 555.396981] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002
> [ 555.396982] R13: 0000000000000000 R14: 00007f23b98c8000 R15: 000055b0ea8bddc0
> [ 555.396986] </TASK>
> [ 555.396987] ---[ end trace 0000000000000000 ]---
>
On 12.06.24 00:21, Alex Williamson wrote:
> On Tue, 11 Jun 2024 17:37:20 +0200
> Niklas Schnelle <[email protected]> wrote:
>
>> On Tue, 2024-06-11 at 17:10 +0200, David Hildenbrand wrote:
>>>>>
>>>>> which checks mmap_assert_write_locked().
>>>>>
>>>>> Setting VMA flags would be racy with the mmap lock in read mode.
>>>>>
>>>>>
>>>>> remap_pfn_range() documents: "this is only safe if the mm semaphore is
>>>>> held when called." which doesn't spell out if it needs to be held in
>>>>> write mode (which I think it does) :)
>>>>
>>>> Logically this makes sense to me. At the same time it looks like
>>>> fixup_user_fault() expects the caller to only hold mmap_read_lock() as
>>>> I do here. In there it even retakes mmap_read_lock(). But then wouldn't
>>>> any fault handling by its nature need to hold the write lock?
>>>
>>> Well, if you're calling remap_pfn_range() right now the expectation is
>>> that we hold it in write mode. :)
>>>
>>> Staring at some random users, they all call it from mmap(), where you
>>> hold the mmap lock in write mode.
>>>
>>>
>>> I wonder why we are not seeing that splat with vfio all of the time?
>>>
>>> That mmap lock check was added "recently". In 1c71222e5f23 we started
>>> using vm_flags_set(). That (including the mmap_assert_write_locked())
>>> check was added via bc292ab00f6c almost 1.5 years ago.
>>>
>>> Maybe vfio is a bit special and was never really run with lockdep?
>>>
>>>>
>>>>>
>>>>>
>>>>> My best guess is: if you are using remap_pfn_range() from a fault
>>>>> handler (not during mmap time) you are doing something wrong, that's why
>>>>> you get that report.
>>>>
>>>> @Alex: I guess so far the vfio_pci_mmap_fault() handler is only ever
>>>> triggered by "normal"/"actual" page faults where this isn't a problem?
>>>> Or could it be a problem there too?
>>>>
>>>
>>> I think we should see it there as well, unless I am missing something.
>>
>> Well good news for me, bad news for everyone else. I just reproduced
>> the same problem on my x86_64 workstation. I "ported over" (hacked it
>> until it compiles) an x86 version of my trivial vfio-pci user-space
>> test code that mmaps() the BAR 0 of an NVMe and MMIO reads the NVMe
>> version field at offset 8. On my x86_64 box this leads to the following
>> splat (still on v6.10-rc1).
>
> There's already a fix for this queued[1] in my for-linus branch for
> v6.10. The problem has indeed existed with lockdep for some time but
> only with the recent lockdep changes to generate a warning regardless
> of debug kernel settings has it gone from just sketchy to having a fire
> under it. There's still an outstanding question of whether we
> can/should insert as many pfns as we can during the fault[2] to reduce
> the new overhead and hopefully at some point we'll have an even cleaner
> option to use huge_fault for pfnmaps, but currently
> vmf_insert_pfn_{pmd,pud} don't work with those pfnmaps.
>
> So hopefully this problem disappears on current linux-next, but let me
> know if there's still an issue. Thanks,
I see us now using vmf_insert_pfn(), which should be the right thing to
do. So I suspect this problem should be disappearing.
--
Cheers,
David / dhildenb