LinuxLists.cc - vfio-pci: protect remap_pfn_range() from simultaneous calls

2021-01-06 16:20:15

Subject: vfio-pci: protect remap_pfn_range() from simultaneous calls

Hi Ankur,

We are observing below BUG_ON() with latest kernel

[10011.321645] ------------[ cut here ]------------
[10011.322262] kernel BUG at mm/memory.c:1816!
[10011.323793] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[10011.326108] CPU: 2 PID: 1147 Comm: odp_l2fwd Not tainted 5.4.74-05938-gb9598e49fe61 #15
[10011.328272] Hardware name: Marvell CN106XX board (DT)
[10011.330328] pstate: 80400009 (Nzcv daif +PAN -UAO)
[10011.332402] pc : remap_pfn_range+0x1a4/0x260
[10011.334383] lr : remap_pfn_range+0x14c/0x260
[10011.335911] sp : ffff8000156afc10
[10011.337360] x29: ffff8000156afc10 x28: ffffffdffa240000
[10011.339671] x27: ffff00014a241000 x26: 0000002182000000
[10011.341984] x25: ffff0001489fbe00 x24: 0000002182040000
[10011.344279] x23: 0000002182040000 x22: 0068000000000fc3
[10011.346539] x21: 0000002182040000 x20: ffff000149d70860
[10011.348846] x19: 0000000000000041 x18: 0000000000000000
[10011.351064] x17: 0000000000000000 x16: 0000000000000000
[10011.353304] x15: 0000000000000000 x14: 0000000000000000
[10011.355519] x13: 0000000000000000 x12: 0000000000000000
[10011.357812] x11: 0000000000000000 x10: ffffffdfffe00000
[10011.360136] x9 : 0000000000000000 x8 : 0000000000000000
[10011.362414] x7 : 0000000000000000 x6 : 0000042182000000
[10011.364773] x5 : 0001000000000000 x4 : 0000000000000000
[10011.367103] x3 : ffffffe000328928 x2 : 016800017c240fc3
[10011.369462] x1 : 0000000000000000 x0 : ffffffe000328928
[10011.371694] Call trace:
[10011.373510] remap_pfn_range+0x1a4/0x260
[10011.375386] vfio_pci_mmap_fault+0x9c/0x114
[10011.377346] __do_fault+0x38/0x100
[10011.379253] __handle_mm_fault+0x81c/0xce4
[10011.381247] handle_mm_fault+0xb4/0x17c
[10011.383220] do_page_fault+0x110/0x430
[10011.385188] do_translation_fault+0x80/0x90
[10011.387069] do_mem_abort+0x3c/0xa0
[10011.388852] el0_da+0x20/0x24
[10011.391239] Code: eb1a02ff 54000080 f9400362 b4fffe42 (d4210000)
[10011.393306] ---[ end trace ae8b75b32426d53c ]---
[10011.395140] note: odp_l2fwd[1147] exited with preempt_count 2

This is observed after patch "vfio-pci: Fault mmaps to enable vma tracking" where actual mapping delayed on page fault.
When address of same page accessed by multiple threads at/around same time by threads running on different cores causes page fault for same page on multiple cores at same time. One of the fault hander creates mapping while second hander find that page-table mapping already exists and leads to above kernel BUG_ON().

While article https://lwn.net/Articles/828536/ suggest that you have already faced and fixed this issue
"- vfio-pci: protect remap_pfn_range() from simultaneous calls (Ankur Arora) [Orabug: 31663628] {CVE-2020-12888} {CVE-2020-12888}"

But I do not see any patch submitted or under review in upstream, hopefully I did not missed some discussion. Please let us know in case you already submitted or planning to submit fix or someone else fixed same.

Thanks
-Bharat

2021-01-06 18:20:03

by Ankur Arora

[permalink] [raw]

Subject: Re: vfio-pci: protect remap_pfn_range() from simultaneous calls

On 2021-01-06 8:17 a.m., Bharat Bhushan wrote:
> Hi Ankur,
>
> We are observing below BUG_ON() with latest kernel
>
> [10011.321645] ------------[ cut here ]------------
> [10011.322262] kernel BUG at mm/memory.c:1816!
> [10011.323793] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [10011.326108] CPU: 2 PID: 1147 Comm: odp_l2fwd Not tainted 5.4.74-05938-gb9598e49fe61 #15
> [10011.328272] Hardware name: Marvell CN106XX board (DT)
> [10011.330328] pstate: 80400009 (Nzcv daif +PAN -UAO)
> [10011.332402] pc : remap_pfn_range+0x1a4/0x260
> [10011.334383] lr : remap_pfn_range+0x14c/0x260
> [10011.335911] sp : ffff8000156afc10
> [10011.337360] x29: ffff8000156afc10 x28: ffffffdffa240000
> [10011.339671] x27: ffff00014a241000 x26: 0000002182000000
> [10011.341984] x25: ffff0001489fbe00 x24: 0000002182040000
> [10011.344279] x23: 0000002182040000 x22: 0068000000000fc3
> [10011.346539] x21: 0000002182040000 x20: ffff000149d70860
> [10011.348846] x19: 0000000000000041 x18: 0000000000000000
> [10011.351064] x17: 0000000000000000 x16: 0000000000000000
> [10011.353304] x15: 0000000000000000 x14: 0000000000000000
> [10011.355519] x13: 0000000000000000 x12: 0000000000000000
> [10011.357812] x11: 0000000000000000 x10: ffffffdfffe00000
> [10011.360136] x9 : 0000000000000000 x8 : 0000000000000000
> [10011.362414] x7 : 0000000000000000 x6 : 0000042182000000
> [10011.364773] x5 : 0001000000000000 x4 : 0000000000000000
> [10011.367103] x3 : ffffffe000328928 x2 : 016800017c240fc3
> [10011.369462] x1 : 0000000000000000 x0 : ffffffe000328928
> [10011.371694] Call trace:
> [10011.373510] remap_pfn_range+0x1a4/0x260
> [10011.375386] vfio_pci_mmap_fault+0x9c/0x114
> [10011.377346] __do_fault+0x38/0x100
> [10011.379253] __handle_mm_fault+0x81c/0xce4
> [10011.381247] handle_mm_fault+0xb4/0x17c
> [10011.383220] do_page_fault+0x110/0x430
> [10011.385188] do_translation_fault+0x80/0x90
> [10011.387069] do_mem_abort+0x3c/0xa0
> [10011.388852] el0_da+0x20/0x24
> [10011.391239] Code: eb1a02ff 54000080 f9400362 b4fffe42 (d4210000)
> [10011.393306] ---[ end trace ae8b75b32426d53c ]---
> [10011.395140] note: odp_l2fwd[1147] exited with preempt_count 2
>
> This is observed after patch "vfio-pci: Fault mmaps to enable vma tracking" where actual mapping delayed on page fault.
> When address of same page accessed by multiple threads at/around same time by threads running on different cores causes page fault for same page on multiple cores at same time. One of the fault hander creates mapping while second hander find that page-table mapping already exists and leads to above kernel BUG_ON().

Yeah, that's what my fix addressed as well.

>
> While article https://lwn.net/Articles/828536/ suggest that you have already faced and fixed this issue
> "- vfio-pci: protect remap_pfn_range() from simultaneous calls (Ankur Arora) [Orabug: 31663628] {CVE-2020-12888} {CVE-2020-12888}"
>
> But I do not see any patch submitted or under review in upstream, hopefully I did not missed some discussion. Please let us know in case you already submitted or planning to submit fix or someone else fixed same.

No you haven't missed a discussion on this. For upstream this was more of
a theoretical race so I dallied a bit before sending the patch upstream.

I'll submit a patch soon. Also, would you mind if I ask you to run this
failing test before submission?

Thanks
Ankur

>
> Thanks
> -Bharat
>

2021-01-07 05:00:18

by Bharat Bhushan

[permalink] [raw]

Subject: RE: [EXT] Re: vfio-pci: protect remap_pfn_range() from simultaneous calls

> -----Original Message-----
> From: Ankur Arora <[email protected]>
> Sent: Wednesday, January 6, 2021 11:44 PM
> To: Bharat Bhushan <[email protected]>; [email protected]
> Cc: [email protected]; Sunil Kovvuri Goutham
> <[email protected]>
> Subject: [EXT] Re: vfio-pci: protect remap_pfn_range() from simultaneous calls
>
> External Email
>
> ----------------------------------------------------------------------
> On 2021-01-06 8:17 a.m., Bharat Bhushan wrote:
> > Hi Ankur,
> >
> > We are observing below BUG_ON() with latest kernel
> >
> > [10011.321645] ------------[ cut here ]------------
> > [10011.322262] kernel BUG at mm/memory.c:1816!
> > [10011.323793] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> > [10011.326108] CPU: 2 PID: 1147 Comm: odp_l2fwd Not tainted 5.4.74-05938-
> gb9598e49fe61 #15
> > [10011.328272] Hardware name: Marvell CN106XX board (DT)
> > [10011.330328] pstate: 80400009 (Nzcv daif +PAN -UAO)
> > [10011.332402] pc : remap_pfn_range+0x1a4/0x260
> > [10011.334383] lr : remap_pfn_range+0x14c/0x260
> > [10011.335911] sp : ffff8000156afc10
> > [10011.337360] x29: ffff8000156afc10 x28: ffffffdffa240000
> > [10011.339671] x27: ffff00014a241000 x26: 0000002182000000
> > [10011.341984] x25: ffff0001489fbe00 x24: 0000002182040000
> > [10011.344279] x23: 0000002182040000 x22: 0068000000000fc3
> > [10011.346539] x21: 0000002182040000 x20: ffff000149d70860
> > [10011.348846] x19: 0000000000000041 x18: 0000000000000000
> > [10011.351064] x17: 0000000000000000 x16: 0000000000000000
> > [10011.353304] x15: 0000000000000000 x14: 0000000000000000
> > [10011.355519] x13: 0000000000000000 x12: 0000000000000000
> > [10011.357812] x11: 0000000000000000 x10: ffffffdfffe00000
> > [10011.360136] x9 : 0000000000000000 x8 : 0000000000000000
> > [10011.362414] x7 : 0000000000000000 x6 : 0000042182000000
> > [10011.364773] x5 : 0001000000000000 x4 : 0000000000000000
> > [10011.367103] x3 : ffffffe000328928 x2 : 016800017c240fc3
> > [10011.369462] x1 : 0000000000000000 x0 : ffffffe000328928
> > [10011.371694] Call trace:
> > [10011.373510] remap_pfn_range+0x1a4/0x260
> > [10011.375386] vfio_pci_mmap_fault+0x9c/0x114
> > [10011.377346] __do_fault+0x38/0x100
> > [10011.379253] __handle_mm_fault+0x81c/0xce4
> > [10011.381247] handle_mm_fault+0xb4/0x17c
> > [10011.383220] do_page_fault+0x110/0x430
> > [10011.385188] do_translation_fault+0x80/0x90
> > [10011.387069] do_mem_abort+0x3c/0xa0
> > [10011.388852] el0_da+0x20/0x24
> > [10011.391239] Code: eb1a02ff 54000080 f9400362 b4fffe42 (d4210000)
> > [10011.393306] ---[ end trace ae8b75b32426d53c ]---
> > [10011.395140] note: odp_l2fwd[1147] exited with preempt_count 2
> >
> > This is observed after patch "vfio-pci: Fault mmaps to enable vma tracking"
> where actual mapping delayed on page fault.
> > When address of same page accessed by multiple threads at/around same time
> by threads running on different cores causes page fault for same page on multiple
> cores at same time. One of the fault hander creates mapping while second hander
> find that page-table mapping already exists and leads to above kernel BUG_ON().
>
> Yeah, that's what my fix addressed as well.
>
> >
> > While article https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__lwn.net_Articles_828536_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=PAAl
> WswPe7d8gHlGbCLmy2YezyK7O3Hv_t2heGnouBw&m=HdwvdpkmrBJoQ0VZHxyHS
> K0T_43_msSxaKD_DlLoGWM&s=3ACed-
> _mL6h2DFbGHl0E5SucG5w4QEDRoKeO7cxpnKU&e= suggest that you have
> already faced and fixed this issue
> > "- vfio-pci: protect remap_pfn_range() from simultaneous calls (Ankur Arora)
> [Orabug: 31663628] {CVE-2020-12888} {CVE-2020-12888}"
> >
> > But I do not see any patch submitted or under review in upstream, hopefully I did
> not missed some discussion. Please let us know in case you already submitted or
> planning to submit fix or someone else fixed same.
>
> No you haven't missed a discussion on this. For upstream this was more of a
> theoretical race so I dallied a bit before sending the patch upstream.
>
> I'll submit a patch soon. Also, would you mind if I ask you to run this failing test
> before submission?

Sure we will review and test.

Thanks
-Bharat

>
> Thanks
> Ankur
>
> >
> > Thanks
> > -Bharat
> >

2021-01-19 08:54:16

by Bharat Bhushan

[permalink] [raw]

Subject: RE: [EXT] Re: vfio-pci: protect remap_pfn_range() from simultaneous calls

Hi Ankur,

> -----Original Message-----
> From: Ankur Arora <[email protected]>
> Sent: Wednesday, January 6, 2021 11:44 PM
> To: Bharat Bhushan <[email protected]>; [email protected]
> Cc: [email protected]; Sunil Kovvuri Goutham
> <[email protected]>
> Subject: [EXT] Re: vfio-pci: protect remap_pfn_range() from simultaneous calls
>
> External Email
>
> ----------------------------------------------------------------------
> On 2021-01-06 8:17 a.m., Bharat Bhushan wrote:
> > Hi Ankur,
> >
> > We are observing below BUG_ON() with latest kernel
> >
> > [10011.321645] ------------[ cut here ]------------
> > [10011.322262] kernel BUG at mm/memory.c:1816!
> > [10011.323793] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> > [10011.326108] CPU: 2 PID: 1147 Comm: odp_l2fwd Not tainted 5.4.74-05938-
> gb9598e49fe61 #15
> > [10011.328272] Hardware name: Marvell CN106XX board (DT)
> > [10011.330328] pstate: 80400009 (Nzcv daif +PAN -UAO)
> > [10011.332402] pc : remap_pfn_range+0x1a4/0x260
> > [10011.334383] lr : remap_pfn_range+0x14c/0x260
> > [10011.335911] sp : ffff8000156afc10
> > [10011.337360] x29: ffff8000156afc10 x28: ffffffdffa240000
> > [10011.339671] x27: ffff00014a241000 x26: 0000002182000000
> > [10011.341984] x25: ffff0001489fbe00 x24: 0000002182040000
> > [10011.344279] x23: 0000002182040000 x22: 0068000000000fc3
> > [10011.346539] x21: 0000002182040000 x20: ffff000149d70860
> > [10011.348846] x19: 0000000000000041 x18: 0000000000000000
> > [10011.351064] x17: 0000000000000000 x16: 0000000000000000
> > [10011.353304] x15: 0000000000000000 x14: 0000000000000000
> > [10011.355519] x13: 0000000000000000 x12: 0000000000000000
> > [10011.357812] x11: 0000000000000000 x10: ffffffdfffe00000
> > [10011.360136] x9 : 0000000000000000 x8 : 0000000000000000
> > [10011.362414] x7 : 0000000000000000 x6 : 0000042182000000
> > [10011.364773] x5 : 0001000000000000 x4 : 0000000000000000
> > [10011.367103] x3 : ffffffe000328928 x2 : 016800017c240fc3
> > [10011.369462] x1 : 0000000000000000 x0 : ffffffe000328928
> > [10011.371694] Call trace:
> > [10011.373510] remap_pfn_range+0x1a4/0x260
> > [10011.375386] vfio_pci_mmap_fault+0x9c/0x114
> > [10011.377346] __do_fault+0x38/0x100
> > [10011.379253] __handle_mm_fault+0x81c/0xce4
> > [10011.381247] handle_mm_fault+0xb4/0x17c
> > [10011.383220] do_page_fault+0x110/0x430
> > [10011.385188] do_translation_fault+0x80/0x90
> > [10011.387069] do_mem_abort+0x3c/0xa0
> > [10011.388852] el0_da+0x20/0x24
> > [10011.391239] Code: eb1a02ff 54000080 f9400362 b4fffe42 (d4210000)
> > [10011.393306] ---[ end trace ae8b75b32426d53c ]---
> > [10011.395140] note: odp_l2fwd[1147] exited with preempt_count 2
> >
> > This is observed after patch "vfio-pci: Fault mmaps to enable vma tracking"
> where actual mapping delayed on page fault.
> > When address of same page accessed by multiple threads at/around same time
> by threads running on different cores causes page fault for same page on multiple
> cores at same time. One of the fault hander creates mapping while second hander
> find that page-table mapping already exists and leads to above kernel BUG_ON().
>
> Yeah, that's what my fix addressed as well.
>
> >
> > While article https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__lwn.net_Articles_828536_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=PAAl
> WswPe7d8gHlGbCLmy2YezyK7O3Hv_t2heGnouBw&m=HdwvdpkmrBJoQ0VZHxyHS
> K0T_43_msSxaKD_DlLoGWM&s=3ACed-
> _mL6h2DFbGHl0E5SucG5w4QEDRoKeO7cxpnKU&e= suggest that you have
> already faced and fixed this issue
> > "- vfio-pci: protect remap_pfn_range() from simultaneous calls (Ankur Arora)
> [Orabug: 31663628] {CVE-2020-12888} {CVE-2020-12888}"
> >
> > But I do not see any patch submitted or under review in upstream, hopefully I did
> not missed some discussion. Please let us know in case you already submitted or
> planning to submit fix or someone else fixed same.
>
> No you haven't missed a discussion on this. For upstream this was more of a
> theoretical race so I dallied a bit before sending the patch upstream.
>
> I'll submit a patch soon. Also, would you mind if I ask you to run this failing test
> before submission?

Checking in case fix sent for review, did I missed something?

Thanks
-Bharat

>
> Thanks
> Ankur
>
> >
> > Thanks
> > -Bharat
> >

2021-01-21 04:46:12

by Ankur Arora

[permalink] [raw]

Subject: Re: vfio-pci: protect remap_pfn_range() from simultaneous calls

Hi Bharat,

So I don't have a final patch for you, but can you test the one below
the scissors mark? (The patch is correct, but I'm not happy that it
introduces a new lock.)

Ankur

-- >8 --

Date: Thu, 21 Jan 2021 00:04:36 +0000
Subject: vfio-pci: protect io_remap_pfn_range() from simultaneous calls

A fix for CVE-2020-12888 changed the mmap to be dynamic so it gets
zapped out and faulted back in based on the state of the PCI_COMMAND
register.

The original code flow was:

vfio_iommu_type1::vfio_device_mmap()
vfio_pci::vfio_pci_mmap()
remap_pfn_range(vma->vm_start .. vma->vm_end);

vfio_iommu_type1::vfio_iommu_type1_ioctl(VFIO_PIN_MAP_DMA)
get_user_pages_remote()
iommu_map()

Which became:

vfio_iommu_type1::vfio_device_mmap()
vfio_pci::vfio_pci_mmap()
/* Setup vm->vm_ops */

vfio_iommu_type1::vfio_iommu_type1_ioctl(VFIO_PIN_MAP_DMA)
get_user_pages_remote()
follow_fault_pfn(vma, vaddr); /* For missing pages */
fixup_user_fault()
handle_mm_fault()
vfio_pci::vfio_pci_mmap_fault()
io_remap_pfn_range(vma->vm_start .. vma->vm_end);
iommu_map()

With this, simultaneous calls to VFIO_PIN_MAP_DMA for an overlapping
region, would mean potentially multiple calls to io_remap_pfn_range()
-- each of which try to remap the full extent.

This ends up hitting BUG_ON(!pte_none(*pte)) in remap_pte_range()
because we are mapping an already mapped pte.

The fix is to elide calls to io_remap_pfn_range() if the VMA is already
mapped.

Fixes: abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access
on disabled memory")

Signed-off-by: Ankur Arora <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 48 ++++++++++++++++++++++++++++++++++---
drivers/vfio/pci/vfio_pci_private.h | 2 ++
2 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 706de3ef94bb..db7a2a716f51 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -64,6 +64,11 @@ static bool disable_denylist;
module_param(disable_denylist, bool, 0444);
MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");

+struct vdev_vma_priv {
+ struct vfio_pci_device *vdev;
+ bool vma_mapped;
+};
+
static inline bool vfio_vga_disabled(void)
{
#ifdef CONFIG_VFIO_PCI_VGA
@@ -1527,15 +1532,20 @@ static int vfio_pci_zap_and_vma_lock(struct vfio_pci_device *vdev, bool try)
list_for_each_entry_safe(mmap_vma, tmp,
&vdev->vma_list, vma_next) {
struct vm_area_struct *vma = mmap_vma->vma;
+ struct vdev_vma_priv *p;

if (vma->vm_mm != mm)
continue;

list_del(&mmap_vma->vma_next);
kfree(mmap_vma);
+ p = vma->vm_private_data;

+ mutex_lock(&vdev->map_lock);
zap_vma_ptes(vma, vma->vm_start,
vma->vm_end - vma->vm_start);
+ p->vma_mapped = false;
+ mutex_unlock(&vdev->map_lock);
}
mutex_unlock(&vdev->vma_lock);
mmap_read_unlock(mm);
@@ -1591,12 +1601,19 @@ static int __vfio_pci_add_vma(struct vfio_pci_device *vdev,
*/
static void vfio_pci_mmap_open(struct vm_area_struct *vma)
{
+ struct vdev_vma_priv *p = vma->vm_private_data;
+ struct vfio_pci_device *vdev = p->vdev;
+
+ mutex_lock(&vdev->map_lock);
+ p->vma_mapped = false;
zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
+ mutex_unlock(&vdev->map_lock);
}

static void vfio_pci_mmap_close(struct vm_area_struct *vma)
{
- struct vfio_pci_device *vdev = vma->vm_private_data;
+ struct vdev_vma_priv *p = vma->vm_private_data;
+ struct vfio_pci_device *vdev = p->vdev;
struct vfio_pci_mmap_vma *mmap_vma;

mutex_lock(&vdev->vma_lock);
@@ -1604,6 +1621,7 @@ static void vfio_pci_mmap_close(struct vm_area_struct *vma)
if (mmap_vma->vma == vma) {
list_del(&mmap_vma->vma_next);
kfree(mmap_vma);
+ kfree(p);
break;
}
}
@@ -1613,7 +1631,8 @@ static void vfio_pci_mmap_close(struct vm_area_struct *vma)
static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
- struct vfio_pci_device *vdev = vma->vm_private_data;
+ struct vdev_vma_priv *p = vma->vm_private_data;
+ struct vfio_pci_device *vdev = p->vdev;
vm_fault_t ret = VM_FAULT_NOPAGE;

mutex_lock(&vdev->vma_lock);
@@ -1633,10 +1652,24 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)

mutex_unlock(&vdev->vma_lock);

+ /*
+ * The vdev->map_lock in vfio_pci_zap_and_vma_lock() nests
+ * inside the vdev->vma_lock but doesn't depend on that for
+ * protection of the VMA.
+ * So take vdev->map_lock after releasing vdev->vma_lock.
+ */
+ mutex_lock(&vdev->map_lock);
+ if (p->vma_mapped)
+ goto unlock_out;
+
if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
vma->vm_end - vma->vm_start, vma->vm_page_prot))
ret = VM_FAULT_SIGBUS;
+ else
+ p->vma_mapped = true;

+unlock_out:
+ mutex_unlock(&vdev->map_lock);
up_out:
up_read(&vdev->memory_lock);
return ret;
@@ -1654,6 +1687,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
struct pci_dev *pdev = vdev->pdev;
unsigned int index;
u64 phys_len, req_len, pgoff, req_start;
+ struct vdev_vma_priv *priv;
int ret;

index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
@@ -1702,7 +1736,14 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
}
}

- vma->vm_private_data = vdev;
+ priv = kzalloc(sizeof(struct vdev_vma_priv), GFP_KERNEL);
+ if (!priv)
+ return -ENOMEM;
+
+ priv->vdev = vdev;
+ priv->vma_mapped = false;
+
+ vma->vm_private_data = priv;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;

@@ -1967,6 +2008,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
INIT_LIST_HEAD(&vdev->dummy_resources_list);
INIT_LIST_HEAD(&vdev->ioeventfds_list);
mutex_init(&vdev->vma_lock);
+ mutex_init(&vdev->map_lock);
INIT_LIST_HEAD(&vdev->vma_list);
init_rwsem(&vdev->memory_lock);

diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 5c90e560c5c7..f3010c49b06c 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -142,6 +142,8 @@ struct vfio_pci_device {
struct mutex vma_lock;
struct list_head vma_list;
struct rw_semaphore memory_lock;
+ /* Protects VMA against simultaneous remaps. */
+ struct mutex map_lock;
};

#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
--
2.9.3

2021-02-26 00:59:36

by Ankur Arora

[permalink] [raw]

Subject: Re: vfio-pci: protect remap_pfn_range() from simultaneous calls

Hi Bharat,

Can you test the patch below to see if it works for you?

Also could you add some more detail to your earlier description of
the bug?
In particular, AFAICS you are using ODP (-DPDK?) with multiple
threads touching this region. From your stack, it looks like the
fault was user-space generated, and I'm guessing you were not
using the VFIO_IOMMU_MAP_DMA.

Ankur

-- >8 --

Subject: [PATCH] vfio-pci: protect io_remap_pfn_range() from simultaneous calls

vfio_pci_mmap_fault() maps the complete VMA on fault. With concurrent
faults, this would result in multiple calls to io_remap_pfn_range(),
where it would hit a BUG_ON(!pte_none(*pte)) in remap_pte_range().
(It would also link the same VMA multiple times in vdev->vma_list
but given the BUG_ON that is less serious.)

Normally, however, this won't happen -- at least with vfio_iommu_type1 --
the VFIO_IOMMU_MAP_DMA path is protected by iommu->lock.

If, however, we are using some kind of parallelization mechanism like
this one with ktask under discussion [1], we would hit this.
Even if we were doing this serially, given that vfio-pci remaps a larger
extent than strictly necessary it should internally enforce coherence of
its data structures.

Handle this by using the VMA's presence in the vdev->vma_list as
indicative of a fully mapped VMA and returning success early to
all but the first VMA fault. Note that this is clearly optimstic given
that the mapping is ongoing, and might mean that the caller sees
more faults until the remap is done.

[1] https://lore.kernel.org/linux-mm/[email protected]/

Signed-off-by: Ankur Arora <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 65e7e6b44578..b9f509863db1 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct vfio_pci_device *vdev,
{
struct vfio_pci_mmap_vma *mmap_vma;

+ list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
+ if (mmap_vma->vma == vma)
+ return 1;
+ }
+
mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
if (!mmap_vma)
return -ENOMEM;
@@ -1613,6 +1618,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct vfio_pci_device *vdev = vma->vm_private_data;
vm_fault_t ret = VM_FAULT_NOPAGE;
+ int vma_present;

mutex_lock(&vdev->vma_lock);
down_read(&vdev->memory_lock);
@@ -1623,7 +1629,21 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
goto up_out;
}

- if (__vfio_pci_add_vma(vdev, vma)) {
+ /*
+ * __vfio_pci_add_vma() either adds the vma to the vdev->vma_list
+ * (vma_present == 0), or indicates that the vma is already present
+ * on the list (vma_present == 1).
+ *
+ * Overload the meaning of this flag to also imply that the vma is
+ * fully mapped. This allows us to serialize the mapping -- ensuring
+ * that simultaneous faults will not both try to call
+ * io_remap_pfn_range().
+ *
+ * However, this might mean that callers to which we returned success
+ * optimistically will see more faults until the remap is complete.
+ */
+ vma_present = __vfio_pci_add_vma(vdev, vma);
+ if (vma_present < 0) {
ret = VM_FAULT_OOM;
mutex_unlock(&vdev->vma_lock);
goto up_out;
@@ -1631,6 +1651,9 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)

mutex_unlock(&vdev->vma_lock);

+ if (vma_present)
+ goto up_out;
+
if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
vma->vm_end - vma->vm_start, vma->vm_page_prot))
ret = VM_FAULT_SIGBUS;
--
2.29.2

2021-03-04 06:23:54

by Bharat Bhushan

[permalink] [raw]

Subject: RE: [EXT] Re: vfio-pci: protect remap_pfn_range() from simultaneous calls

Hi Ankur,

> -----Original Message-----
> From: Ankur Arora <[email protected]>
> Sent: Friday, February 26, 2021 6:24 AM
> To: Bharat Bhushan <[email protected]>
> Cc: [email protected]; [email protected]; linux-
> [email protected]; Sunil Kovvuri Goutham <[email protected]>;
> [email protected]
> Subject: [EXT] Re: vfio-pci: protect remap_pfn_range() from simultaneous calls
>
> External Email
>
> ----------------------------------------------------------------------
> Hi Bharat,
>
> Can you test the patch below to see if it works for you?

Sorry for late reply, I actually missed this email.

Reproducibility of the issue was low in my test scenario, one out of ~15 runs. I run it multiple times, overnight and observed no issues.

>
> Also could you add some more detail to your earlier description of
> the bug?

Our test case is running ODP multi-threaded application, where parent process maps (yes it uses MAP_DMA) the region and then child processes access same. As a workaround we tried accessing the region once by parent process before creating other accessor threads and it worked as expected.

Thanks
-Bharat

> In particular, AFAICS you are using ODP (-DPDK?) with multiple
> threads touching this region. From your stack, it looks like the
> fault was user-space generated, and I'm guessing you were not
> using the VFIO_IOMMU_MAP_DMA.
>
> Ankur
>
> -- >8 --
>
> Subject: [PATCH] vfio-pci: protect io_remap_pfn_range() from simultaneous calls
>
> vfio_pci_mmap_fault() maps the complete VMA on fault. With concurrent
> faults, this would result in multiple calls to io_remap_pfn_range(),
> where it would hit a BUG_ON(!pte_none(*pte)) in remap_pte_range().
> (It would also link the same VMA multiple times in vdev->vma_list
> but given the BUG_ON that is less serious.)
>
> Normally, however, this won't happen -- at least with vfio_iommu_type1 --
> the VFIO_IOMMU_MAP_DMA path is protected by iommu->lock.
>
> If, however, we are using some kind of parallelization mechanism like
> this one with ktask under discussion [1], we would hit this.
> Even if we were doing this serially, given that vfio-pci remaps a larger
> extent than strictly necessary it should internally enforce coherence of
> its data structures.
>
> Handle this by using the VMA's presence in the vdev->vma_list as
> indicative of a fully mapped VMA and returning success early to
> all but the first VMA fault. Note that this is clearly optimstic given
> that the mapping is ongoing, and might mean that the caller sees
> more faults until the remap is done.
>
> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-
> 2Dmm_20181105145141.6f9937f6-
> 40w520.home_&d=DwIDAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=PAAlWswPe7d8gHl
> GbCLmy2YezyK7O3Hv_t2heGnouBw&m=3ZDXqnn9xNUCjgXwN9mHIKT7oyXu55P
> U7yV2j0b-5hw&s=hiICkNtrcH4AbAWRrbkvMUylp7Bv0YHFCjxNGC6CGOk&e=
>
> Signed-off-by: Ankur Arora <[email protected]>
> ---
> drivers/vfio/pci/vfio_pci.c | 25 ++++++++++++++++++++++++-
> 1 file changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 65e7e6b44578..b9f509863db1 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct vfio_pci_device
> *vdev,
> {
> struct vfio_pci_mmap_vma *mmap_vma;
>
> + list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
> + if (mmap_vma->vma == vma)
> + return 1;
> + }
> +
> mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
> if (!mmap_vma)
> return -ENOMEM;
> @@ -1613,6 +1618,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault
> *vmf)
> struct vm_area_struct *vma = vmf->vma;
> struct vfio_pci_device *vdev = vma->vm_private_data;
> vm_fault_t ret = VM_FAULT_NOPAGE;
> + int vma_present;
>
> mutex_lock(&vdev->vma_lock);
> down_read(&vdev->memory_lock);
> @@ -1623,7 +1629,21 @@ static vm_fault_t vfio_pci_mmap_fault(struct
> vm_fault *vmf)
> goto up_out;
> }
>
> - if (__vfio_pci_add_vma(vdev, vma)) {
> + /*
> + * __vfio_pci_add_vma() either adds the vma to the vdev->vma_list
> + * (vma_present == 0), or indicates that the vma is already present
> + * on the list (vma_present == 1).
> + *
> + * Overload the meaning of this flag to also imply that the vma is
> + * fully mapped. This allows us to serialize the mapping -- ensuring
> + * that simultaneous faults will not both try to call
> + * io_remap_pfn_range().
> + *
> + * However, this might mean that callers to which we returned success
> + * optimistically will see more faults until the remap is complete.
> + */
> + vma_present = __vfio_pci_add_vma(vdev, vma);
> + if (vma_present < 0) {
> ret = VM_FAULT_OOM;
> mutex_unlock(&vdev->vma_lock);
> goto up_out;
> @@ -1631,6 +1651,9 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault
> *vmf)
>
> mutex_unlock(&vdev->vma_lock);
>
> + if (vma_present)
> + goto up_out;
> +
> if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> vma->vm_end - vma->vm_start, vma-
> >vm_page_prot))
> ret = VM_FAULT_SIGBUS;
> --
> 2.29.2

2021-03-08 08:46:46

by Ankur Arora

[permalink] [raw]

Subject: Re: [EXT] Re: vfio-pci: protect remap_pfn_range() from simultaneous calls

On 2021-03-02 4:47 a.m., Bharat Bhushan wrote:
> Hi Ankur,
>
>> -----Original Message-----
>> From: Ankur Arora <[email protected]>
>> Sent: Friday, February 26, 2021 6:24 AM
>> To: Bharat Bhushan <[email protected]>
>> Cc: [email protected]; [email protected]; linux-
>> [email protected]; Sunil Kovvuri Goutham <[email protected]>;
>> [email protected]
>> Subject: [EXT] Re: vfio-pci: protect remap_pfn_range() from simultaneous calls
>>
>> External Email
>>
>> ----------------------------------------------------------------------
>> Hi Bharat,
>>
>> Can you test the patch below to see if it works for you?
>
> Sorry for late reply, I actually missed this email.
>
> Reproducibility of the issue was low in my test scenario, one out of ~15 runs. I run it multiple times, overnight and observed no issues.

Awesome. Would you mind giving me your Tested-by for the
patch?

>> Also could you add some more detail to your earlier description of
>> the bug?
>
> Our test case is running ODP multi-threaded application, where parent process maps (yes it uses MAP_DMA) the region and then child processes access same. As a workaround we tried accessing the region once by parent process before creating other accessor threads and it worked as expected.

Thanks for the detail. So if the child processes start early -- they
might fault while the VFIO_IOMMU_MAP_DMA was going on. And, since
they only acquire mmap_lock in RO mode, both paths would end up
calling io_remap_pfn_range() via the fault handler.

Thanks
Ankur

>
> Thanks
> -Bharat
>
>> In particular, AFAICS you are using ODP (-DPDK?) with multiple
>> threads touching this region. From your stack, it looks like the
>> fault was user-space generated, and I'm guessing you were not
>> using the VFIO_IOMMU_MAP_DMA.
>>
>> Ankur
>>
>> -- >8 --
>>
>> Subject: [PATCH] vfio-pci: protect io_remap_pfn_range() from simultaneous calls
>>
>> vfio_pci_mmap_fault() maps the complete VMA on fault. With concurrent
>> faults, this would result in multiple calls to io_remap_pfn_range(),
>> where it would hit a BUG_ON(!pte_none(*pte)) in remap_pte_range().
>> (It would also link the same VMA multiple times in vdev->vma_list
>> but given the BUG_ON that is less serious.)
>>
>> Normally, however, this won't happen -- at least with vfio_iommu_type1 --
>> the VFIO_IOMMU_MAP_DMA path is protected by iommu->lock.
>>
>> If, however, we are using some kind of parallelization mechanism like
>> this one with ktask under discussion [1], we would hit this.
>> Even if we were doing this serially, given that vfio-pci remaps a larger
>> extent than strictly necessary it should internally enforce coherence of
>> its data structures.
>>
>> Handle this by using the VMA's presence in the vdev->vma_list as
>> indicative of a fully mapped VMA and returning success early to
>> all but the first VMA fault. Note that this is clearly optimstic given
>> that the mapping is ongoing, and might mean that the caller sees
>> more faults until the remap is done.
>>
>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-
>> 2Dmm_20181105145141.6f9937f6-
>> 40w520.home_&d=DwIDAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=PAAlWswPe7d8gHl
>> GbCLmy2YezyK7O3Hv_t2heGnouBw&m=3ZDXqnn9xNUCjgXwN9mHIKT7oyXu55P
>> U7yV2j0b-5hw&s=hiICkNtrcH4AbAWRrbkvMUylp7Bv0YHFCjxNGC6CGOk&e=
>>
>> Signed-off-by: Ankur Arora <[email protected]>
>> ---
>> drivers/vfio/pci/vfio_pci.c | 25 ++++++++++++++++++++++++-
>> 1 file changed, 24 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index 65e7e6b44578..b9f509863db1 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct vfio_pci_device
>> *vdev,
>> {
>> struct vfio_pci_mmap_vma *mmap_vma;
>>
>> + list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
>> + if (mmap_vma->vma == vma)
>> + return 1;
>> + }
>> +
>> mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
>> if (!mmap_vma)
>> return -ENOMEM;
>> @@ -1613,6 +1618,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault
>> *vmf)
>> struct vm_area_struct *vma = vmf->vma;
>> struct vfio_pci_device *vdev = vma->vm_private_data;
>> vm_fault_t ret = VM_FAULT_NOPAGE;
>> + int vma_present;
>>
>> mutex_lock(&vdev->vma_lock);
>> down_read(&vdev->memory_lock);
>> @@ -1623,7 +1629,21 @@ static vm_fault_t vfio_pci_mmap_fault(struct
>> vm_fault *vmf)
>> goto up_out;
>> }
>>
>> - if (__vfio_pci_add_vma(vdev, vma)) {
>> + /*
>> + * __vfio_pci_add_vma() either adds the vma to the vdev->vma_list
>> + * (vma_present == 0), or indicates that the vma is already present
>> + * on the list (vma_present == 1).
>> + *
>> + * Overload the meaning of this flag to also imply that the vma is
>> + * fully mapped. This allows us to serialize the mapping -- ensuring
>> + * that simultaneous faults will not both try to call
>> + * io_remap_pfn_range().
>> + *
>> + * However, this might mean that callers to which we returned success
>> + * optimistically will see more faults until the remap is complete.
>> + */
>> + vma_present = __vfio_pci_add_vma(vdev, vma);
>> + if (vma_present < 0) {
>> ret = VM_FAULT_OOM;
>> mutex_unlock(&vdev->vma_lock);
>> goto up_out;
>> @@ -1631,6 +1651,9 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault
>> *vmf)
>>
>> mutex_unlock(&vdev->vma_lock);
>>
>> + if (vma_present)
>> + goto up_out;
>> +
>> if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
>> vma->vm_end - vma->vm_start, vma-
>>> vm_page_prot))
>> ret = VM_FAULT_SIGBUS;
>> --
>> 2.29.2
>