2020-05-05 21:56:58

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 0/3] vfio-pci: Block user access to disabled device MMIO

v2:

Locking in 3/ is substantially changed to avoid the retry scenario
within the fault handler, therefore a caller who does not allow retry
will no longer receive a SIGBUS on contention. IOMMU invalidations
are still not included here, I expect that will be a future follow-on
change as we're not fundamentally changing that issue in this series.
The 'add to vma list only on fault' behavior is also still included
here, per the discussion I think it's still a valid approach and has
some advantages, particularly in a VM scenario where we potentially
defer the mapping until the MMIO BAR is actually DMA mapped into the
VM address space (or the guest driver actually accesses the device
if that DMA mapping is eliminated at some point). Further discussion
and review appreciated. Thanks,

Alex

v1:

Add tracking of the device memory enable bit and block/fault accesses
to device MMIO space while disabled. This provides synchronous fault
handling for CPU accesses to the device and prevents the user from
triggering platform level error handling present on some systems.
Device reset and MSI-X vector table accesses are also included such
that access is blocked across reset and vector table accesses do not
depend on the user configuration of the device.

This is based on the vfio for-linus branch currently in next, making
use of follow_pfn() in vaddr_get_pfn() and therefore requiring patch
1/ to force the user fault in the case that a PFNMAP vma might be
DMA mapped before user access. Further PFNMAP iommu invalidation
tracking is not yet included here.

As noted in the comments, I'm copying quite a bit of the logic from
rdma code for performing the zap_vma_ptes() calls and I'm also
attempting to resolve lock ordering issues in the fault handler to
lockdep's satisfaction. I appreciate extra eyes on these sections in
particular.

I expect this to be functionally equivalent for any well behaved
userspace driver, but obviously there is a potential for the user to
get -EIO or SIGBUS on device access. The device is provided to the
user enabled and device resets will restore the command register, so
by my evaluation a user would need to explicitly disable the memory
enable bit to trigger these faults. We could potentially remap vmas
to a zero page rather than SIGBUS if we experience regressions, but
without known code requiring that, SIGBUS seems the appropriate
response to this condition. Thanks,

Alex

---

Alex Williamson (3):
vfio/type1: Support faulting PFNMAP vmas
vfio-pci: Fault mmaps to enable vma tracking
vfio-pci: Invalidate mmaps and block MMIO access on disabled memory


drivers/vfio/pci/vfio_pci.c | 321 +++++++++++++++++++++++++++++++++--
drivers/vfio/pci/vfio_pci_config.c | 36 +++-
drivers/vfio/pci/vfio_pci_intrs.c | 18 ++
drivers/vfio/pci/vfio_pci_private.h | 12 +
drivers/vfio/pci/vfio_pci_rdwr.c | 12 +
drivers/vfio/vfio_iommu_type1.c | 36 ++++
6 files changed, 405 insertions(+), 30 deletions(-)


2020-05-05 21:56:59

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 1/3] vfio/type1: Support faulting PFNMAP vmas

With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
the range being faulted into the vma. Add support to manually provide
that, in the same way as done on KVM with hva_to_pfn_remapped().

Signed-off-by: Alex Williamson <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 36 +++++++++++++++++++++++++++++++++---
1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index cc1d64765ce7..4a4cb7cd86b2 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
return 0;
}

+static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
+ unsigned long vaddr, unsigned long *pfn,
+ bool write_fault)
+{
+ int ret;
+
+ ret = follow_pfn(vma, vaddr, pfn);
+ if (ret) {
+ bool unlocked = false;
+
+ ret = fixup_user_fault(NULL, mm, vaddr,
+ FAULT_FLAG_REMOTE |
+ (write_fault ? FAULT_FLAG_WRITE : 0),
+ &unlocked);
+ if (unlocked)
+ return -EAGAIN;
+
+ if (ret)
+ return ret;
+
+ ret = follow_pfn(vma, vaddr, pfn);
+ }
+
+ return ret;
+}
+
static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
int prot, unsigned long *pfn)
{
@@ -339,12 +365,16 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,

vaddr = untagged_addr(vaddr);

+retry:
vma = find_vma_intersection(mm, vaddr, vaddr + 1);

if (vma && vma->vm_flags & VM_PFNMAP) {
- if (!follow_pfn(vma, vaddr, pfn) &&
- is_invalid_reserved_pfn(*pfn))
- ret = 0;
+ ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
+ if (ret == -EAGAIN)
+ goto retry;
+
+ if (!ret && !is_invalid_reserved_pfn(*pfn))
+ ret = -EFAULT;
}
done:
up_read(&mm->mmap_sem);

2020-05-05 21:57:19

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 2/3] vfio-pci: Fault mmaps to enable vma tracking

Rather than calling remap_pfn_range() when a region is mmap'd, setup
a vm_ops handler to support dynamic faulting of the range on access.
This allows us to manage a list of vmas actively mapping the area that
we can later use to invalidate those mappings. The open callback
invalidates the vma range so that all tracking is inserted in the
fault handler and removed in the close handler.

Signed-off-by: Alex Williamson <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 76 ++++++++++++++++++++++++++++++++++-
drivers/vfio/pci/vfio_pci_private.h | 7 +++
2 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 6c6b37b5c04e..66a545a01f8f 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1299,6 +1299,70 @@ static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
}

+static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
+ struct vm_area_struct *vma)
+{
+ struct vfio_pci_mmap_vma *mmap_vma;
+
+ mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
+ if (!mmap_vma)
+ return -ENOMEM;
+
+ mmap_vma->vma = vma;
+
+ mutex_lock(&vdev->vma_lock);
+ list_add(&mmap_vma->vma_next, &vdev->vma_list);
+ mutex_unlock(&vdev->vma_lock);
+
+ return 0;
+}
+
+/*
+ * Zap mmaps on open so that we can fault them in on access and therefore
+ * our vma_list only tracks mappings accessed since last zap.
+ */
+static void vfio_pci_mmap_open(struct vm_area_struct *vma)
+{
+ zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
+}
+
+static void vfio_pci_mmap_close(struct vm_area_struct *vma)
+{
+ struct vfio_pci_device *vdev = vma->vm_private_data;
+ struct vfio_pci_mmap_vma *mmap_vma;
+
+ mutex_lock(&vdev->vma_lock);
+ list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
+ if (mmap_vma->vma == vma) {
+ list_del(&mmap_vma->vma_next);
+ kfree(mmap_vma);
+ break;
+ }
+ }
+ mutex_unlock(&vdev->vma_lock);
+}
+
+static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct vfio_pci_device *vdev = vma->vm_private_data;
+
+ if (vfio_pci_add_vma(vdev, vma))
+ return VM_FAULT_OOM;
+
+ if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+ vma->vm_end - vma->vm_start, vma->vm_page_prot))
+ return VM_FAULT_SIGBUS;
+
+ return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vfio_pci_mmap_ops = {
+ .open = vfio_pci_mmap_open,
+ .close = vfio_pci_mmap_close,
+ .fault = vfio_pci_mmap_fault,
+};
+
static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
{
struct vfio_pci_device *vdev = device_data;
@@ -1357,8 +1421,14 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;

- return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
- req_len, vma->vm_page_prot);
+ /*
+ * See remap_pfn_range(), called from vfio_pci_fault() but we can't
+ * change vm_flags within the fault handler. Set them now.
+ */
+ vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+ vma->vm_ops = &vfio_pci_mmap_ops;
+
+ return 0;
}

static void vfio_pci_request(void *device_data, unsigned int count)
@@ -1608,6 +1678,8 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
spin_lock_init(&vdev->irqlock);
mutex_init(&vdev->ioeventfds_lock);
INIT_LIST_HEAD(&vdev->ioeventfds_list);
+ mutex_init(&vdev->vma_lock);
+ INIT_LIST_HEAD(&vdev->vma_list);

ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
if (ret)
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 36ec69081ecd..9b25f9f6ce1d 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -92,6 +92,11 @@ struct vfio_pci_vf_token {
int users;
};

+struct vfio_pci_mmap_vma {
+ struct vm_area_struct *vma;
+ struct list_head vma_next;
+};
+
struct vfio_pci_device {
struct pci_dev *pdev;
void __iomem *barmap[PCI_STD_NUM_BARS];
@@ -132,6 +137,8 @@ struct vfio_pci_device {
struct list_head ioeventfds_list;
struct vfio_pci_vf_token *vf_token;
struct notifier_block nb;
+ struct mutex vma_lock;
+ struct list_head vma_list;
};

#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)

2020-05-05 21:59:06

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

Accessing the disabled memory space of a PCI device would typically
result in a master abort response on conventional PCI, or an
unsupported request on PCI express. The user would generally see
these as a -1 response for the read return data and the write would be
silently discarded, possibly with an uncorrected, non-fatal AER error
triggered on the host. Some systems however take it upon themselves
to bring down the entire system when they see something that might
indicate a loss of data, such as this discarded write to a disabled
memory space.

To avoid this, we want to try to block the user from accessing memory
spaces while they're disabled. We start with a semaphore around the
memory enable bit, where writers modify the memory enable state and
must be serialized, while readers make use of the memory region and
can access in parallel. Writers include both direct manipulation via
the command register, as well as any reset path where the internal
mechanics of the reset may both explicitly and implicitly disable
memory access, and manipulation of the MSI-X configuration, where the
MSI-X vector table resides in MMIO space of the device. Readers
include the read and write file ops to access the vfio device fd
offsets as well as memory mapped access. In the latter case, we make
use of our new vma list support to zap, or invalidate, those memory
mappings in order to force them to be faulted back in on access.

Our semaphore usage will stall user access to MMIO spaces across
internal operations like reset, but the user might experience new
behavior when trying to access the MMIO space while disabled via the
PCI command register. Access via read or write while disabled will
return -EIO and access via memory maps will result in a SIGBUS. This
is expected to be compatible with known use cases and potentially
provides better error handling capabilities than present in the
hardware, while avoiding the more readily accessible and severe
platform error responses that might otherwise occur.

Signed-off-by: Alex Williamson <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 263 +++++++++++++++++++++++++++++++----
drivers/vfio/pci/vfio_pci_config.c | 36 ++++-
drivers/vfio/pci/vfio_pci_intrs.c | 18 ++
drivers/vfio/pci/vfio_pci_private.h | 5 +
drivers/vfio/pci/vfio_pci_rdwr.c | 12 ++
5 files changed, 300 insertions(+), 34 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 66a545a01f8f..49ae9faa6099 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -26,6 +26,7 @@
#include <linux/vfio.h>
#include <linux/vgaarb.h>
#include <linux/nospec.h>
+#include <linux/sched/mm.h>

#include "vfio_pci_private.h"

@@ -184,6 +185,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)

static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
static void vfio_pci_disable(struct vfio_pci_device *vdev);
+static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data);

/*
* INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -736,6 +738,12 @@ int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
return 0;
}

+struct vfio_devices {
+ struct vfio_device **devices;
+ int cur_index;
+ int max_index;
+};
+
static long vfio_pci_ioctl(void *device_data,
unsigned int cmd, unsigned long arg)
{
@@ -984,8 +992,16 @@ static long vfio_pci_ioctl(void *device_data,
return ret;

} else if (cmd == VFIO_DEVICE_RESET) {
- return vdev->reset_works ?
- pci_try_reset_function(vdev->pdev) : -EINVAL;
+ int ret;
+
+ if (!vdev->reset_works)
+ return -EINVAL;
+
+ vfio_pci_zap_and_down_write_memory_lock(vdev);
+ ret = pci_try_reset_function(vdev->pdev);
+ up_write(&vdev->memory_lock);
+
+ return ret;

} else if (cmd == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO) {
struct vfio_pci_hot_reset_info hdr;
@@ -1065,8 +1081,9 @@ static long vfio_pci_ioctl(void *device_data,
int32_t *group_fds;
struct vfio_pci_group_entry *groups;
struct vfio_pci_group_info info;
+ struct vfio_devices devs = { .cur_index = 0 };
bool slot = false;
- int i, count = 0, ret = 0;
+ int i, group_idx, mem_idx = 0, count = 0, ret = 0;

minsz = offsetofend(struct vfio_pci_hot_reset, count);

@@ -1118,9 +1135,9 @@ static long vfio_pci_ioctl(void *device_data,
* user interface and store the group and iommu ID. This
* ensures the group is held across the reset.
*/
- for (i = 0; i < hdr.count; i++) {
+ for (group_idx = 0; group_idx < hdr.count; group_idx++) {
struct vfio_group *group;
- struct fd f = fdget(group_fds[i]);
+ struct fd f = fdget(group_fds[group_idx]);
if (!f.file) {
ret = -EBADF;
break;
@@ -1133,8 +1150,9 @@ static long vfio_pci_ioctl(void *device_data,
break;
}

- groups[i].group = group;
- groups[i].id = vfio_external_user_iommu_id(group);
+ groups[group_idx].group = group;
+ groups[group_idx].id =
+ vfio_external_user_iommu_id(group);
}

kfree(group_fds);
@@ -1153,13 +1171,63 @@ static long vfio_pci_ioctl(void *device_data,
ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
vfio_pci_validate_devs,
&info, slot);
- if (!ret)
- /* User has access, do the reset */
- ret = pci_reset_bus(vdev->pdev);
+ if (ret)
+ goto hot_reset_release;
+
+ devs.max_index = count;
+ devs.devices = kcalloc(count, sizeof(struct vfio_device *),
+ GFP_KERNEL);
+ if (!devs.devices) {
+ ret = -ENOMEM;
+ goto hot_reset_release;
+ }
+
+ /*
+ * We need to get memory_lock for each device, but devices
+ * can share mmap_sem, therefore we need to zap and hold
+ * the vma_lock for each device, and only then get each
+ * memory_lock.
+ */
+ ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+ vfio_pci_try_zap_and_vma_lock_cb,
+ &devs, slot);
+ if (ret)
+ goto hot_reset_release;
+
+ for (; mem_idx < devs.cur_index; mem_idx++) {
+ struct vfio_pci_device *tmp;
+
+ tmp = vfio_device_data(devs.devices[mem_idx]);
+
+ ret = down_write_trylock(&tmp->memory_lock);
+ if (!ret) {
+ ret = -EBUSY;
+ goto hot_reset_release;
+ }
+ mutex_unlock(&tmp->vma_lock);
+ }
+
+ /* User has access, do the reset */
+ ret = pci_reset_bus(vdev->pdev);

hot_reset_release:
- for (i--; i >= 0; i--)
- vfio_group_put_external_user(groups[i].group);
+ for (i = 0; i < devs.cur_index; i++) {
+ struct vfio_device *device;
+ struct vfio_pci_device *tmp;
+
+ device = devs.devices[i];
+ tmp = vfio_device_data(device);
+
+ if (i < mem_idx)
+ up_write(&tmp->memory_lock);
+ else
+ mutex_unlock(&tmp->vma_lock);
+ vfio_device_put(device);
+ }
+ kfree(devs.devices);
+
+ for (group_idx--; group_idx >= 0; group_idx--)
+ vfio_group_put_external_user(groups[group_idx].group);

kfree(groups);
return ret;
@@ -1299,8 +1367,107 @@ static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
}

-static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
- struct vm_area_struct *vma)
+/* Return 1 on zap and vma_lock acquired, 0 on contention (only with @try) */
+static int vfio_pci_zap_and_vma_lock(struct vfio_pci_device *vdev, bool try)
+{
+ struct vfio_pci_mmap_vma *mmap_vma, *tmp;
+
+ /*
+ * Lock ordering:
+ * vma_lock is nested under mmap_sem for vm_ops callback paths.
+ * The memory_lock semaphore is used by both code paths calling
+ * into this function to zap vmas and the vm_ops.fault callback
+ * to protect the memory enable state of the device.
+ *
+ * When zapping vmas we need to maintain the mmap_sem => vma_lock
+ * ordering, which requires using vma_lock to walk vma_list to
+ * acquire an mm, then dropping vma_lock to get the mmap_sem and
+ * reacquiring vma_lock. This logic is derived from similar
+ * requirements in uverbs_user_mmap_disassociate().
+ *
+ * mmap_sem must always be the top-level lock when it is taken.
+ * Therefore we can only hold the memory_lock write lock when
+ * vma_list is empty, as we'd need to take mmap_sem to clear
+ * entries. vma_list can only be guaranteed empty when holding
+ * vma_lock, thus memory_lock is nested under vma_lock.
+ *
+ * This enables the vm_ops.fault callback to acquire vma_lock,
+ * followed by memory_lock read lock, while already holding
+ * mmap_sem without risk of deadlock.
+ */
+ while (1) {
+ struct mm_struct *mm = NULL;
+
+ if (try) {
+ if (!mutex_trylock(&vdev->vma_lock))
+ return 0;
+ } else {
+ mutex_lock(&vdev->vma_lock);
+ }
+ while (!list_empty(&vdev->vma_list)) {
+ mmap_vma = list_first_entry(&vdev->vma_list,
+ struct vfio_pci_mmap_vma,
+ vma_next);
+ mm = mmap_vma->vma->vm_mm;
+ if (mmget_not_zero(mm))
+ break;
+
+ list_del(&mmap_vma->vma_next);
+ kfree(mmap_vma);
+ mm = NULL;
+ }
+ if (!mm)
+ return 1;
+ mutex_unlock(&vdev->vma_lock);
+
+ if (try) {
+ if (!down_read_trylock(&mm->mmap_sem)) {
+ mmput(mm);
+ return 0;
+ }
+ } else {
+ down_read(&mm->mmap_sem);
+ }
+ if (mmget_still_valid(mm)) {
+ if (try) {
+ if (!mutex_trylock(&vdev->vma_lock)) {
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+ return 0;
+ }
+ } else {
+ mutex_lock(&vdev->vma_lock);
+ }
+ list_for_each_entry_safe(mmap_vma, tmp,
+ &vdev->vma_list, vma_next) {
+ struct vm_area_struct *vma = mmap_vma->vma;
+
+ if (vma->vm_mm != mm)
+ continue;
+
+ list_del(&mmap_vma->vma_next);
+ kfree(mmap_vma);
+
+ zap_vma_ptes(vma, vma->vm_start,
+ vma->vm_end - vma->vm_start);
+ }
+ mutex_unlock(&vdev->vma_lock);
+ }
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+ }
+}
+
+void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_device *vdev)
+{
+ vfio_pci_zap_and_vma_lock(vdev, false);
+ down_write(&vdev->memory_lock);
+ mutex_unlock(&vdev->vma_lock);
+}
+
+/* Caller holds vma_lock */
+static int __vfio_pci_add_vma(struct vfio_pci_device *vdev,
+ struct vm_area_struct *vma)
{
struct vfio_pci_mmap_vma *mmap_vma;

@@ -1309,10 +1476,7 @@ static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
return -ENOMEM;

mmap_vma->vma = vma;
-
- mutex_lock(&vdev->vma_lock);
list_add(&mmap_vma->vma_next, &vdev->vma_list);
- mutex_unlock(&vdev->vma_lock);

return 0;
}
@@ -1346,15 +1510,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct vfio_pci_device *vdev = vma->vm_private_data;
+ vm_fault_t ret = VM_FAULT_NOPAGE;

- if (vfio_pci_add_vma(vdev, vma))
- return VM_FAULT_OOM;
+ mutex_lock(&vdev->vma_lock);
+ down_read(&vdev->memory_lock);
+
+ if (!__vfio_pci_memory_enabled(vdev)) {
+ ret = VM_FAULT_SIGBUS;
+ mutex_unlock(&vdev->vma_lock);
+ goto up_out;
+ }
+
+ if (__vfio_pci_add_vma(vdev, vma)) {
+ ret = VM_FAULT_OOM;
+ mutex_unlock(&vdev->vma_lock);
+ goto up_out;
+ }
+
+ mutex_unlock(&vdev->vma_lock);

if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
vma->vm_end - vma->vm_start, vma->vm_page_prot))
- return VM_FAULT_SIGBUS;
+ ret = VM_FAULT_SIGBUS;

- return VM_FAULT_NOPAGE;
+up_out:
+ up_read(&vdev->memory_lock);
+ return ret;
}

static const struct vm_operations_struct vfio_pci_mmap_ops = {
@@ -1680,6 +1861,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
INIT_LIST_HEAD(&vdev->ioeventfds_list);
mutex_init(&vdev->vma_lock);
INIT_LIST_HEAD(&vdev->vma_list);
+ init_rwsem(&vdev->memory_lock);

ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
if (ret)
@@ -1933,12 +2115,6 @@ static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
kref_put_mutex(&reflck->kref, vfio_pci_reflck_release, &reflck_lock);
}

-struct vfio_devices {
- struct vfio_device **devices;
- int cur_index;
- int max_index;
-};
-
static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
{
struct vfio_devices *devs = data;
@@ -1969,6 +2145,39 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
return 0;
}

+static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data)
+{
+ struct vfio_devices *devs = data;
+ struct vfio_device *device;
+ struct vfio_pci_device *vdev;
+
+ if (devs->cur_index == devs->max_index)
+ return -ENOSPC;
+
+ device = vfio_device_get_from_dev(&pdev->dev);
+ if (!device)
+ return -EINVAL;
+
+ if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+ vfio_device_put(device);
+ return -EBUSY;
+ }
+
+ vdev = vfio_device_data(device);
+
+ /*
+ * Locking multiple devices is prone to deadlock, runaway and
+ * unwind if we hit contention.
+ */
+ if (!vfio_pci_zap_and_vma_lock(vdev, true)) {
+ vfio_device_put(device);
+ return -EBUSY;
+ }
+
+ devs->devices[devs->cur_index++] = device;
+ return 0;
+}
+
/*
* If a bus or slot reset is available for the provided device and:
* - All of the devices affected by that bus or slot reset are unused
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 90c0b80f8acf..3dcddbd572e6 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -395,6 +395,14 @@ static inline void p_setd(struct perm_bits *p, int off, u32 virt, u32 write)
*(__le32 *)(&p->write[off]) = cpu_to_le32(write);
}

+/* Caller should hold memory_lock semaphore */
+bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev)
+{
+ u16 cmd = le16_to_cpu(*(__le16 *)&vdev->vconfig[PCI_COMMAND]);
+
+ return cmd & PCI_COMMAND_MEMORY;
+}
+
/*
* Restore the *real* BARs after we detect a FLR or backdoor reset.
* (backdoor = some device specific technique that we didn't catch)
@@ -556,13 +564,18 @@ static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos,

new_cmd = le32_to_cpu(val);

+ phys_io = !!(phys_cmd & PCI_COMMAND_IO);
+ virt_io = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_IO);
+ new_io = !!(new_cmd & PCI_COMMAND_IO);
+
phys_mem = !!(phys_cmd & PCI_COMMAND_MEMORY);
virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);

- phys_io = !!(phys_cmd & PCI_COMMAND_IO);
- virt_io = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_IO);
- new_io = !!(new_cmd & PCI_COMMAND_IO);
+ if (!new_mem)
+ vfio_pci_zap_and_down_write_memory_lock(vdev);
+ else
+ down_write(&vdev->memory_lock);

/*
* If the user is writing mem/io enable (new_mem/io) and we
@@ -579,8 +592,11 @@ static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos,
}

count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
- if (count < 0)
+ if (count < 0) {
+ if (offset == PCI_COMMAND)
+ up_write(&vdev->memory_lock);
return count;
+ }

/*
* Save current memory/io enable bits in vconfig to allow for
@@ -591,6 +607,8 @@ static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos,

*virt_cmd &= cpu_to_le16(~mask);
*virt_cmd |= cpu_to_le16(new_cmd & mask);
+
+ up_write(&vdev->memory_lock);
}

/* Emulate INTx disable */
@@ -828,8 +846,11 @@ static int vfio_exp_config_write(struct vfio_pci_device *vdev, int pos,
pos - offset + PCI_EXP_DEVCAP,
&cap);

- if (!ret && (cap & PCI_EXP_DEVCAP_FLR))
+ if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
+ vfio_pci_zap_and_down_write_memory_lock(vdev);
pci_try_reset_function(vdev->pdev);
+ up_write(&vdev->memory_lock);
+ }
}

/*
@@ -907,8 +928,11 @@ static int vfio_af_config_write(struct vfio_pci_device *vdev, int pos,
pos - offset + PCI_AF_CAP,
&cap);

- if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP))
+ if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
+ vfio_pci_zap_and_down_write_memory_lock(vdev);
pci_try_reset_function(vdev->pdev);
+ up_write(&vdev->memory_lock);
+ }
}

return count;
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 2056f3f85f59..54102a7eb9d3 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -626,6 +626,8 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
int (*func)(struct vfio_pci_device *vdev, unsigned index,
unsigned start, unsigned count, uint32_t flags,
void *data) = NULL;
+ int ret;
+ u16 cmd;

switch (index) {
case VFIO_PCI_INTX_IRQ_INDEX:
@@ -673,5 +675,19 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
if (!func)
return -ENOTTY;

- return func(vdev, index, start, count, flags, data);
+ if (index == VFIO_PCI_MSIX_IRQ_INDEX) {
+ down_write(&vdev->memory_lock);
+ pci_read_config_word(vdev->pdev, PCI_COMMAND, &cmd);
+ pci_write_config_word(vdev->pdev, PCI_COMMAND,
+ cmd | PCI_COMMAND_MEMORY);
+ }
+
+ ret = func(vdev, index, start, count, flags, data);
+
+ if (index == VFIO_PCI_MSIX_IRQ_INDEX) {
+ pci_write_config_word(vdev->pdev, PCI_COMMAND, cmd);
+ up_write(&vdev->memory_lock);
+ }
+
+ return ret;
}
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 9b25f9f6ce1d..c4f25f1e80d7 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -139,6 +139,7 @@ struct vfio_pci_device {
struct notifier_block nb;
struct mutex vma_lock;
struct list_head vma_list;
+ struct rw_semaphore memory_lock;
};

#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
@@ -181,6 +182,10 @@ extern int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
extern int vfio_pci_set_power_state(struct vfio_pci_device *vdev,
pci_power_t state);

+extern bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev);
+extern void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_device
+ *vdev);
+
#ifdef CONFIG_VFIO_PCI_IGD
extern int vfio_pci_igd_init(struct vfio_pci_device *vdev);
#else
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index a87992892a9f..f58c45308682 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -162,6 +162,7 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
size_t x_start = 0, x_end = 0;
resource_size_t end;
void __iomem *io;
+ struct resource *res = &vdev->pdev->resource[bar];
ssize_t done;

if (pci_resource_start(pdev, bar))
@@ -200,8 +201,19 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
x_end = vdev->msix_offset + vdev->msix_size;
}

+ if (res->flags & IORESOURCE_MEM) {
+ down_read(&vdev->memory_lock);
+ if (!__vfio_pci_memory_enabled(vdev)) {
+ up_read(&vdev->memory_lock);
+ return -EIO;
+ }
+ }
+
done = do_io_rw(io, buf, pos, count, x_start, x_end, iswrite);

+ if (res->flags & IORESOURCE_MEM)
+ up_read(&vdev->memory_lock);
+
if (done >= 0)
*ppos += done;


2020-05-07 21:26:49

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] vfio/type1: Support faulting PFNMAP vmas

On Tue, May 05, 2020 at 03:54:44PM -0600, Alex Williamson wrote:
> With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
> the range being faulted into the vma. Add support to manually provide
> that, in the same way as done on KVM with hva_to_pfn_remapped().
>
> Signed-off-by: Alex Williamson <[email protected]>
> ---
> drivers/vfio/vfio_iommu_type1.c | 36 +++++++++++++++++++++++++++++++++---
> 1 file changed, 33 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index cc1d64765ce7..4a4cb7cd86b2 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
> return 0;
> }
>
> +static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> + unsigned long vaddr, unsigned long *pfn,
> + bool write_fault)
> +{
> + int ret;
> +
> + ret = follow_pfn(vma, vaddr, pfn);
> + if (ret) {
> + bool unlocked = false;
> +
> + ret = fixup_user_fault(NULL, mm, vaddr,
> + FAULT_FLAG_REMOTE |
> + (write_fault ? FAULT_FLAG_WRITE : 0),
> + &unlocked);
> + if (unlocked)
> + return -EAGAIN;

Hi, Alex,

IIUC this retry is not needed too because fixup_user_fault() will guarantee the
fault-in is done correctly with the valid PTE as long as ret==0, even if
unlocked==true.

Note: there's another patch just removed the similar retry in kvm:

https://lore.kernel.org/kvm/[email protected]/

Thanks,

> +
> + if (ret)
> + return ret;
> +
> + ret = follow_pfn(vma, vaddr, pfn);
> + }
> +
> + return ret;
> +}

--
Peter Xu

2020-05-07 21:50:37

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] vfio-pci: Fault mmaps to enable vma tracking

Hi, Alex,

On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
> +/*
> + * Zap mmaps on open so that we can fault them in on access and therefore
> + * our vma_list only tracks mappings accessed since last zap.
> + */
> +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
> +{
> + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);

A pure question: is this only a safety-belt or it is required in some known
scenarios?

In all cases:

Reviewed-by: Peter Xu <[email protected]>

Thanks,

--
Peter Xu

2020-05-07 21:52:25

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] vfio/type1: Support faulting PFNMAP vmas

On Thu, 7 May 2020 17:24:43 -0400
Peter Xu <[email protected]> wrote:

> On Tue, May 05, 2020 at 03:54:44PM -0600, Alex Williamson wrote:
> > With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
> > the range being faulted into the vma. Add support to manually provide
> > that, in the same way as done on KVM with hva_to_pfn_remapped().
> >
> > Signed-off-by: Alex Williamson <[email protected]>
> > ---
> > drivers/vfio/vfio_iommu_type1.c | 36 +++++++++++++++++++++++++++++++++---
> > 1 file changed, 33 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index cc1d64765ce7..4a4cb7cd86b2 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
> > return 0;
> > }
> >
> > +static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> > + unsigned long vaddr, unsigned long *pfn,
> > + bool write_fault)
> > +{
> > + int ret;
> > +
> > + ret = follow_pfn(vma, vaddr, pfn);
> > + if (ret) {
> > + bool unlocked = false;
> > +
> > + ret = fixup_user_fault(NULL, mm, vaddr,
> > + FAULT_FLAG_REMOTE |
> > + (write_fault ? FAULT_FLAG_WRITE : 0),
> > + &unlocked);
> > + if (unlocked)
> > + return -EAGAIN;
>
> Hi, Alex,
>
> IIUC this retry is not needed too because fixup_user_fault() will guarantee the
> fault-in is done correctly with the valid PTE as long as ret==0, even if
> unlocked==true.
>
> Note: there's another patch just removed the similar retry in kvm:
>
> https://lore.kernel.org/kvm/[email protected]/

Great, I was basing this on that kvm code, so I can make essentially an
identical fix. Thanks!

Alex

2020-05-07 22:01:25

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] vfio-pci: Block user access to disabled device MMIO

On Tue, May 05, 2020 at 03:54:36PM -0600, Alex Williamson wrote:
> v2:
>
> Locking in 3/ is substantially changed to avoid the retry scenario
> within the fault handler, therefore a caller who does not allow retry
> will no longer receive a SIGBUS on contention. IOMMU invalidations
> are still not included here, I expect that will be a future follow-on
> change as we're not fundamentally changing that issue in this series.
> The 'add to vma list only on fault' behavior is also still included
> here, per the discussion I think it's still a valid approach and has
> some advantages, particularly in a VM scenario where we potentially
> defer the mapping until the MMIO BAR is actually DMA mapped into the
> VM address space (or the guest driver actually accesses the device
> if that DMA mapping is eliminated at some point). Further discussion
> and review appreciated. Thanks,

Hi, Alex,

I have a general question on the series.

IIUC this series tries to protect illegal vfio userspace writes to device MMIO
regions which may cause platform-level issues. That makes perfect sense to me.
However what if the write comes from the devices' side? E.g.:

- Device A maps MMIO region X

- Device B do VFIO_IOMMU_DMA_MAP on Device A's MMIO region X
(so X's MMIO PFNs are mapped in device B's IOMMU page table)

- Device A clears PCI_COMMAND_MEMORY (reset, etc.)
- this should zap all existing vmas that mapping region X, however device
B's IOMMU page table is not aware of this?

- Device B writes to MMIO region X of device A even if PCI_COMMAND_MEMORY
cleared on device A's PCI_COMMAND register

Could this happen?

Thanks,

--
Peter Xu

2020-05-07 22:06:02

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] vfio-pci: Fault mmaps to enable vma tracking

On Thu, 7 May 2020 17:47:44 -0400
Peter Xu <[email protected]> wrote:

> Hi, Alex,
>
> On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
> > +/*
> > + * Zap mmaps on open so that we can fault them in on access and therefore
> > + * our vma_list only tracks mappings accessed since last zap.
> > + */
> > +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
> > +{
> > + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
>
> A pure question: is this only a safety-belt or it is required in some known
> scenarios?

It's not required. I originally did this so that I'm not allocating a
vma_list entry in a path where I can't return error, but as Jason
suggested I could zap here only in the case that I do encounter that
allocation fault. However I still like consolidating the vma_list
handling to the vm_ops .fault and .close callbacks and potentially we
reduce the zap latency by keeping the vma_list to actual users, which
we'll get to eventually anyway in the VM case as memory BARs are sized
and assigned addresses.

> In all cases:
>
> Reviewed-by: Peter Xu <[email protected]>

Thanks!
Alex

2020-05-07 22:26:49

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] vfio-pci: Fault mmaps to enable vma tracking

On Thu, May 07, 2020 at 04:03:34PM -0600, Alex Williamson wrote:
> On Thu, 7 May 2020 17:47:44 -0400
> Peter Xu <[email protected]> wrote:
>
> > Hi, Alex,
> >
> > On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
> > > +/*
> > > + * Zap mmaps on open so that we can fault them in on access and therefore
> > > + * our vma_list only tracks mappings accessed since last zap.
> > > + */
> > > +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
> > > +{
> > > + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> >
> > A pure question: is this only a safety-belt or it is required in some known
> > scenarios?
>
> It's not required. I originally did this so that I'm not allocating a
> vma_list entry in a path where I can't return error, but as Jason
> suggested I could zap here only in the case that I do encounter that
> allocation fault. However I still like consolidating the vma_list
> handling to the vm_ops .fault and .close callbacks and potentially we
> reduce the zap latency by keeping the vma_list to actual users, which
> we'll get to eventually anyway in the VM case as memory BARs are sized
> and assigned addresses.

Yes, I don't see much problem either on doing the vma_list maintainance only in
.fault() and .close(). My understandingg is that the worst case is the perf
critical applications (e.g. DPDK) could pre-fault these MMIO region easily
during setup if they want. My question was majorly about whether the vma
should be guaranteed to have no mapping at all when .open() is called. But I
agree with you that it's always good to have that as safety-belt anyways.

Thanks!

--
Peter Xu

2020-05-07 22:38:40

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] vfio-pci: Block user access to disabled device MMIO

On Thu, 7 May 2020 17:59:08 -0400
Peter Xu <[email protected]> wrote:

> On Tue, May 05, 2020 at 03:54:36PM -0600, Alex Williamson wrote:
> > v2:
> >
> > Locking in 3/ is substantially changed to avoid the retry scenario
> > within the fault handler, therefore a caller who does not allow retry
> > will no longer receive a SIGBUS on contention. IOMMU invalidations
> > are still not included here, I expect that will be a future follow-on
> > change as we're not fundamentally changing that issue in this series.
> > The 'add to vma list only on fault' behavior is also still included
> > here, per the discussion I think it's still a valid approach and has
> > some advantages, particularly in a VM scenario where we potentially
> > defer the mapping until the MMIO BAR is actually DMA mapped into the
> > VM address space (or the guest driver actually accesses the device
> > if that DMA mapping is eliminated at some point). Further discussion
> > and review appreciated. Thanks,
>
> Hi, Alex,
>
> I have a general question on the series.
>
> IIUC this series tries to protect illegal vfio userspace writes to device MMIO
> regions which may cause platform-level issues. That makes perfect sense to me.
> However what if the write comes from the devices' side? E.g.:
>
> - Device A maps MMIO region X
>
> - Device B do VFIO_IOMMU_DMA_MAP on Device A's MMIO region X
> (so X's MMIO PFNs are mapped in device B's IOMMU page table)
>
> - Device A clears PCI_COMMAND_MEMORY (reset, etc.)
> - this should zap all existing vmas that mapping region X, however device
> B's IOMMU page table is not aware of this?
>
> - Device B writes to MMIO region X of device A even if PCI_COMMAND_MEMORY
> cleared on device A's PCI_COMMAND register
>
> Could this happen?

Yes, this can happen and Jason has brought up variations on this
scenario that are important to fix as well. I've got some ideas, but
the access in this series was the current priority. There are also
issues in the above scenario that if a platform considers a DMA write
to an invalid IOMMU PTE and triggering an IOMMU fault to have the same
severity as the write to disabled MMIO space we've prevented, then our
hands are tied. Thanks,

Alex

2020-05-07 23:56:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] vfio/type1: Support faulting PFNMAP vmas

On Thu, May 07, 2020 at 05:24:43PM -0400, Peter Xu wrote:
> On Tue, May 05, 2020 at 03:54:44PM -0600, Alex Williamson wrote:
> > With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
> > the range being faulted into the vma. Add support to manually provide
> > that, in the same way as done on KVM with hva_to_pfn_remapped().
> >
> > Signed-off-by: Alex Williamson <[email protected]>
> > drivers/vfio/vfio_iommu_type1.c | 36 +++++++++++++++++++++++++++++++++---
> > 1 file changed, 33 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index cc1d64765ce7..4a4cb7cd86b2 100644
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
> > return 0;
> > }
> >
> > +static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> > + unsigned long vaddr, unsigned long *pfn,
> > + bool write_fault)
> > +{
> > + int ret;
> > +
> > + ret = follow_pfn(vma, vaddr, pfn);
> > + if (ret) {
> > + bool unlocked = false;
> > +
> > + ret = fixup_user_fault(NULL, mm, vaddr,
> > + FAULT_FLAG_REMOTE |
> > + (write_fault ? FAULT_FLAG_WRITE : 0),
> > + &unlocked);
> > + if (unlocked)
> > + return -EAGAIN;
>
> Hi, Alex,
>
> IIUC this retry is not needed too because fixup_user_fault() will guarantee the
> fault-in is done correctly with the valid PTE as long as ret==0, even if
> unlocked==true.

It is true, and today it is fine, but be careful when reworking this
to use notifiers as unlocked also means things like the vma pointer
are invalidated.

Jason

2020-05-08 00:00:54

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] vfio-pci: Fault mmaps to enable vma tracking

On Thu, May 07, 2020 at 06:22:23PM -0400, Peter Xu wrote:
> On Thu, May 07, 2020 at 04:03:34PM -0600, Alex Williamson wrote:
> > On Thu, 7 May 2020 17:47:44 -0400
> > Peter Xu <[email protected]> wrote:
> >
> > > Hi, Alex,
> > >
> > > On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
> > > > +/*
> > > > + * Zap mmaps on open so that we can fault them in on access and therefore
> > > > + * our vma_list only tracks mappings accessed since last zap.
> > > > + */
> > > > +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
> > > > +{
> > > > + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> > >
> > > A pure question: is this only a safety-belt or it is required in some known
> > > scenarios?
> >
> > It's not required. I originally did this so that I'm not allocating a
> > vma_list entry in a path where I can't return error, but as Jason
> > suggested I could zap here only in the case that I do encounter that
> > allocation fault. However I still like consolidating the vma_list
> > handling to the vm_ops .fault and .close callbacks and potentially we
> > reduce the zap latency by keeping the vma_list to actual users, which
> > we'll get to eventually anyway in the VM case as memory BARs are sized
> > and assigned addresses.
>
> Yes, I don't see much problem either on doing the vma_list maintainance only in
> .fault() and .close(). My understandingg is that the worst case is the perf
> critical applications (e.g. DPDK) could pre-fault these MMIO region easily
> during setup if they want. My question was majorly about whether the vma
> should be guaranteed to have no mapping at all when .open() is called. But I
> agree with you that it's always good to have that as safety-belt anyways.

If the VMA has a mapping then that specific VMA has to be in the
linked list.

So if the zap is skipped then the you have to allocate something and
add to the linked list to track the VMA with mapping.

It is not a 'safety belt'

Jason

2020-05-08 02:19:06

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] vfio-pci: Fault mmaps to enable vma tracking

On Thu, May 07, 2020 at 08:56:33PM -0300, Jason Gunthorpe wrote:
> On Thu, May 07, 2020 at 06:22:23PM -0400, Peter Xu wrote:
> > On Thu, May 07, 2020 at 04:03:34PM -0600, Alex Williamson wrote:
> > > On Thu, 7 May 2020 17:47:44 -0400
> > > Peter Xu <[email protected]> wrote:
> > >
> > > > Hi, Alex,
> > > >
> > > > On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
> > > > > +/*
> > > > > + * Zap mmaps on open so that we can fault them in on access and therefore
> > > > > + * our vma_list only tracks mappings accessed since last zap.
> > > > > + */
> > > > > +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
> > > > > +{
> > > > > + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> > > >
> > > > A pure question: is this only a safety-belt or it is required in some known
> > > > scenarios?
> > >
> > > It's not required. I originally did this so that I'm not allocating a
> > > vma_list entry in a path where I can't return error, but as Jason
> > > suggested I could zap here only in the case that I do encounter that
> > > allocation fault. However I still like consolidating the vma_list
> > > handling to the vm_ops .fault and .close callbacks and potentially we
> > > reduce the zap latency by keeping the vma_list to actual users, which
> > > we'll get to eventually anyway in the VM case as memory BARs are sized
> > > and assigned addresses.
> >
> > Yes, I don't see much problem either on doing the vma_list maintainance only in
> > .fault() and .close(). My understandingg is that the worst case is the perf
> > critical applications (e.g. DPDK) could pre-fault these MMIO region easily
> > during setup if they want. My question was majorly about whether the vma
> > should be guaranteed to have no mapping at all when .open() is called. But I
> > agree with you that it's always good to have that as safety-belt anyways.
>
> If the VMA has a mapping then that specific VMA has to be in the
> linked list.
>
> So if the zap is skipped then the you have to allocate something and
> add to the linked list to track the VMA with mapping.
>
> It is not a 'safety belt'

But shouldn't open() only be called when the VMA is created for a memory range?
If so, does it also mean that the address range must have not been mapped yet?

Thanks,

--
Peter Xu

2020-05-08 02:21:39

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] vfio/type1: Support faulting PFNMAP vmas

On Thu, May 07, 2020 at 08:54:21PM -0300, Jason Gunthorpe wrote:
> On Thu, May 07, 2020 at 05:24:43PM -0400, Peter Xu wrote:
> > On Tue, May 05, 2020 at 03:54:44PM -0600, Alex Williamson wrote:
> > > With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
> > > the range being faulted into the vma. Add support to manually provide
> > > that, in the same way as done on KVM with hva_to_pfn_remapped().
> > >
> > > Signed-off-by: Alex Williamson <[email protected]>
> > > drivers/vfio/vfio_iommu_type1.c | 36 +++++++++++++++++++++++++++++++++---
> > > 1 file changed, 33 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > index cc1d64765ce7..4a4cb7cd86b2 100644
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
> > > return 0;
> > > }
> > >
> > > +static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> > > + unsigned long vaddr, unsigned long *pfn,
> > > + bool write_fault)
> > > +{
> > > + int ret;
> > > +
> > > + ret = follow_pfn(vma, vaddr, pfn);
> > > + if (ret) {
> > > + bool unlocked = false;
> > > +
> > > + ret = fixup_user_fault(NULL, mm, vaddr,
> > > + FAULT_FLAG_REMOTE |
> > > + (write_fault ? FAULT_FLAG_WRITE : 0),
> > > + &unlocked);
> > > + if (unlocked)
> > > + return -EAGAIN;
> >
> > Hi, Alex,
> >
> > IIUC this retry is not needed too because fixup_user_fault() will guarantee the
> > fault-in is done correctly with the valid PTE as long as ret==0, even if
> > unlocked==true.
>
> It is true, and today it is fine, but be careful when reworking this
> to use notifiers as unlocked also means things like the vma pointer
> are invalidated.

Oh right, thanks for noticing that. Then we should probably still keep the
retry logic... because otherwise the latter follow_pfn() could be referencing
an invalid vma already...

Thanks,

--
Peter Xu

2020-05-08 02:35:56

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] vfio-pci: Block user access to disabled device MMIO

On Thu, May 07, 2020 at 04:34:37PM -0600, Alex Williamson wrote:
> On Thu, 7 May 2020 17:59:08 -0400
> Peter Xu <[email protected]> wrote:
>
> > On Tue, May 05, 2020 at 03:54:36PM -0600, Alex Williamson wrote:
> > > v2:
> > >
> > > Locking in 3/ is substantially changed to avoid the retry scenario
> > > within the fault handler, therefore a caller who does not allow retry
> > > will no longer receive a SIGBUS on contention. IOMMU invalidations
> > > are still not included here, I expect that will be a future follow-on
> > > change as we're not fundamentally changing that issue in this series.
> > > The 'add to vma list only on fault' behavior is also still included
> > > here, per the discussion I think it's still a valid approach and has
> > > some advantages, particularly in a VM scenario where we potentially
> > > defer the mapping until the MMIO BAR is actually DMA mapped into the
> > > VM address space (or the guest driver actually accesses the device
> > > if that DMA mapping is eliminated at some point). Further discussion
> > > and review appreciated. Thanks,
> >
> > Hi, Alex,
> >
> > I have a general question on the series.
> >
> > IIUC this series tries to protect illegal vfio userspace writes to device MMIO
> > regions which may cause platform-level issues. That makes perfect sense to me.
> > However what if the write comes from the devices' side? E.g.:
> >
> > - Device A maps MMIO region X
> >
> > - Device B do VFIO_IOMMU_DMA_MAP on Device A's MMIO region X
> > (so X's MMIO PFNs are mapped in device B's IOMMU page table)
> >
> > - Device A clears PCI_COMMAND_MEMORY (reset, etc.)
> > - this should zap all existing vmas that mapping region X, however device
> > B's IOMMU page table is not aware of this?
> >
> > - Device B writes to MMIO region X of device A even if PCI_COMMAND_MEMORY
> > cleared on device A's PCI_COMMAND register
> >
> > Could this happen?
>
> Yes, this can happen and Jason has brought up variations on this
> scenario that are important to fix as well. I've got some ideas, but
> the access in this series was the current priority. There are also
> issues in the above scenario that if a platform considers a DMA write
> to an invalid IOMMU PTE and triggering an IOMMU fault to have the same
> severity as the write to disabled MMIO space we've prevented, then our
> hands are tied. Thanks,

I see the point now; it makes sense to start with a series like this. Thanks, Alex.

--
Peter Xu

2020-05-08 06:48:33

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] vfio-pci: Fault mmaps to enable vma tracking


On 2020/5/8 上午10:16, Peter Xu wrote:
> On Thu, May 07, 2020 at 08:56:33PM -0300, Jason Gunthorpe wrote:
>> On Thu, May 07, 2020 at 06:22:23PM -0400, Peter Xu wrote:
>>> On Thu, May 07, 2020 at 04:03:34PM -0600, Alex Williamson wrote:
>>>> On Thu, 7 May 2020 17:47:44 -0400
>>>> Peter Xu <[email protected]> wrote:
>>>>
>>>>> Hi, Alex,
>>>>>
>>>>> On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
>>>>>> +/*
>>>>>> + * Zap mmaps on open so that we can fault them in on access and therefore
>>>>>> + * our vma_list only tracks mappings accessed since last zap.
>>>>>> + */
>>>>>> +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
>>>>>> +{
>>>>>> + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
>>>>> A pure question: is this only a safety-belt or it is required in some known
>>>>> scenarios?
>>>> It's not required. I originally did this so that I'm not allocating a
>>>> vma_list entry in a path where I can't return error, but as Jason
>>>> suggested I could zap here only in the case that I do encounter that
>>>> allocation fault. However I still like consolidating the vma_list
>>>> handling to the vm_ops .fault and .close callbacks and potentially we
>>>> reduce the zap latency by keeping the vma_list to actual users, which
>>>> we'll get to eventually anyway in the VM case as memory BARs are sized
>>>> and assigned addresses.
>>> Yes, I don't see much problem either on doing the vma_list maintainance only in
>>> .fault() and .close(). My understandingg is that the worst case is the perf
>>> critical applications (e.g. DPDK) could pre-fault these MMIO region easily
>>> during setup if they want. My question was majorly about whether the vma
>>> should be guaranteed to have no mapping at all when .open() is called. But I
>>> agree with you that it's always good to have that as safety-belt anyways.
>> If the VMA has a mapping then that specific VMA has to be in the
>> linked list.
>>
>> So if the zap is skipped then the you have to allocate something and
>> add to the linked list to track the VMA with mapping.
>>
>> It is not a 'safety belt'
> But shouldn't open() only be called when the VMA is created for a memory range?
> If so, does it also mean that the address range must have not been mapped yet?


Probably not, e.g when VMA is being split.

Thanks


>
> Thanks,
>

2020-05-08 12:11:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] vfio-pci: Fault mmaps to enable vma tracking

On Thu, May 07, 2020 at 10:16:56PM -0400, Peter Xu wrote:
> On Thu, May 07, 2020 at 08:56:33PM -0300, Jason Gunthorpe wrote:
> > On Thu, May 07, 2020 at 06:22:23PM -0400, Peter Xu wrote:
> > > On Thu, May 07, 2020 at 04:03:34PM -0600, Alex Williamson wrote:
> > > > On Thu, 7 May 2020 17:47:44 -0400
> > > > Peter Xu <[email protected]> wrote:
> > > >
> > > > > Hi, Alex,
> > > > >
> > > > > On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
> > > > > > +/*
> > > > > > + * Zap mmaps on open so that we can fault them in on access and therefore
> > > > > > + * our vma_list only tracks mappings accessed since last zap.
> > > > > > + */
> > > > > > +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
> > > > > > +{
> > > > > > + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> > > > >
> > > > > A pure question: is this only a safety-belt or it is required in some known
> > > > > scenarios?
> > > >
> > > > It's not required. I originally did this so that I'm not allocating a
> > > > vma_list entry in a path where I can't return error, but as Jason
> > > > suggested I could zap here only in the case that I do encounter that
> > > > allocation fault. However I still like consolidating the vma_list
> > > > handling to the vm_ops .fault and .close callbacks and potentially we
> > > > reduce the zap latency by keeping the vma_list to actual users, which
> > > > we'll get to eventually anyway in the VM case as memory BARs are sized
> > > > and assigned addresses.
> > >
> > > Yes, I don't see much problem either on doing the vma_list maintainance only in
> > > .fault() and .close(). My understandingg is that the worst case is the perf
> > > critical applications (e.g. DPDK) could pre-fault these MMIO region easily
> > > during setup if they want. My question was majorly about whether the vma
> > > should be guaranteed to have no mapping at all when .open() is called. But I
> > > agree with you that it's always good to have that as safety-belt anyways.
> >
> > If the VMA has a mapping then that specific VMA has to be in the
> > linked list.
> >
> > So if the zap is skipped then the you have to allocate something and
> > add to the linked list to track the VMA with mapping.
> >
> > It is not a 'safety belt'
>
> But shouldn't open() only be called when the VMA is created for a memory range?
> If so, does it also mean that the address range must have not been mapped yet?

open is called whenever a VMA is copied, I don't think it is called
when the VMA is first created?

Jason

2020-05-08 12:12:58

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] vfio/type1: Support faulting PFNMAP vmas

On Thu, May 07, 2020 at 10:19:39PM -0400, Peter Xu wrote:
> On Thu, May 07, 2020 at 08:54:21PM -0300, Jason Gunthorpe wrote:
> > On Thu, May 07, 2020 at 05:24:43PM -0400, Peter Xu wrote:
> > > On Tue, May 05, 2020 at 03:54:44PM -0600, Alex Williamson wrote:
> > > > With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
> > > > the range being faulted into the vma. Add support to manually provide
> > > > that, in the same way as done on KVM with hva_to_pfn_remapped().
> > > >
> > > > Signed-off-by: Alex Williamson <[email protected]>
> > > > drivers/vfio/vfio_iommu_type1.c | 36 +++++++++++++++++++++++++++++++++---
> > > > 1 file changed, 33 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > index cc1d64765ce7..4a4cb7cd86b2 100644
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
> > > > return 0;
> > > > }
> > > >
> > > > +static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> > > > + unsigned long vaddr, unsigned long *pfn,
> > > > + bool write_fault)
> > > > +{
> > > > + int ret;
> > > > +
> > > > + ret = follow_pfn(vma, vaddr, pfn);
> > > > + if (ret) {
> > > > + bool unlocked = false;
> > > > +
> > > > + ret = fixup_user_fault(NULL, mm, vaddr,
> > > > + FAULT_FLAG_REMOTE |
> > > > + (write_fault ? FAULT_FLAG_WRITE : 0),
> > > > + &unlocked);
> > > > + if (unlocked)
> > > > + return -EAGAIN;
> > >
> > > Hi, Alex,
> > >
> > > IIUC this retry is not needed too because fixup_user_fault() will guarantee the
> > > fault-in is done correctly with the valid PTE as long as ret==0, even if
> > > unlocked==true.
> >
> > It is true, and today it is fine, but be careful when reworking this
> > to use notifiers as unlocked also means things like the vma pointer
> > are invalidated.
>
> Oh right, thanks for noticing that. Then we should probably still keep the
> retry logic... because otherwise the latter follow_pfn() could be referencing
> an invalid vma already...

I looked briefly and thought this flow used the vma only once?

Jason

2020-05-08 14:29:00

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] vfio-pci: Fault mmaps to enable vma tracking

On Fri, May 08, 2020 at 09:08:01AM -0300, Jason Gunthorpe wrote:
> On Thu, May 07, 2020 at 10:16:56PM -0400, Peter Xu wrote:
> > On Thu, May 07, 2020 at 08:56:33PM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 07, 2020 at 06:22:23PM -0400, Peter Xu wrote:
> > > > On Thu, May 07, 2020 at 04:03:34PM -0600, Alex Williamson wrote:
> > > > > On Thu, 7 May 2020 17:47:44 -0400
> > > > > Peter Xu <[email protected]> wrote:
> > > > >
> > > > > > Hi, Alex,
> > > > > >
> > > > > > On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
> > > > > > > +/*
> > > > > > > + * Zap mmaps on open so that we can fault them in on access and therefore
> > > > > > > + * our vma_list only tracks mappings accessed since last zap.
> > > > > > > + */
> > > > > > > +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
> > > > > > > +{
> > > > > > > + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> > > > > >
> > > > > > A pure question: is this only a safety-belt or it is required in some known
> > > > > > scenarios?
> > > > >
> > > > > It's not required. I originally did this so that I'm not allocating a
> > > > > vma_list entry in a path where I can't return error, but as Jason
> > > > > suggested I could zap here only in the case that I do encounter that
> > > > > allocation fault. However I still like consolidating the vma_list
> > > > > handling to the vm_ops .fault and .close callbacks and potentially we
> > > > > reduce the zap latency by keeping the vma_list to actual users, which
> > > > > we'll get to eventually anyway in the VM case as memory BARs are sized
> > > > > and assigned addresses.
> > > >
> > > > Yes, I don't see much problem either on doing the vma_list maintainance only in
> > > > .fault() and .close(). My understandingg is that the worst case is the perf
> > > > critical applications (e.g. DPDK) could pre-fault these MMIO region easily
> > > > during setup if they want. My question was majorly about whether the vma
> > > > should be guaranteed to have no mapping at all when .open() is called. But I
> > > > agree with you that it's always good to have that as safety-belt anyways.
> > >
> > > If the VMA has a mapping then that specific VMA has to be in the
> > > linked list.
> > >
> > > So if the zap is skipped then the you have to allocate something and
> > > add to the linked list to track the VMA with mapping.
> > >
> > > It is not a 'safety belt'
> >
> > But shouldn't open() only be called when the VMA is created for a memory range?
> > If so, does it also mean that the address range must have not been mapped yet?
>
> open is called whenever a VMA is copied, I don't think it is called
> when the VMA is first created?

Ah I think you're right. I misunderstood open() which I thought should be
always pairing with close(), but it seems not. Thanks,

--
Peter Xu

2020-05-08 14:31:35

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] vfio-pci: Fault mmaps to enable vma tracking

Hi, Jason!

On Fri, May 08, 2020 at 02:44:44PM +0800, Jason Wang wrote:
> Probably not, e.g when VMA is being split.

I see it now, thanks. :)

--
Peter Xu

2020-05-08 14:32:47

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] vfio/type1: Support faulting PFNMAP vmas

On Fri, May 08, 2020 at 09:10:13AM -0300, Jason Gunthorpe wrote:
> On Thu, May 07, 2020 at 10:19:39PM -0400, Peter Xu wrote:
> > On Thu, May 07, 2020 at 08:54:21PM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 07, 2020 at 05:24:43PM -0400, Peter Xu wrote:
> > > > On Tue, May 05, 2020 at 03:54:44PM -0600, Alex Williamson wrote:
> > > > > With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
> > > > > the range being faulted into the vma. Add support to manually provide
> > > > > that, in the same way as done on KVM with hva_to_pfn_remapped().
> > > > >
> > > > > Signed-off-by: Alex Williamson <[email protected]>
> > > > > drivers/vfio/vfio_iommu_type1.c | 36 +++++++++++++++++++++++++++++++++---
> > > > > 1 file changed, 33 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > > index cc1d64765ce7..4a4cb7cd86b2 100644
> > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > @@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
> > > > > return 0;
> > > > > }
> > > > >
> > > > > +static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> > > > > + unsigned long vaddr, unsigned long *pfn,
> > > > > + bool write_fault)
> > > > > +{
> > > > > + int ret;
> > > > > +
> > > > > + ret = follow_pfn(vma, vaddr, pfn);
> > > > > + if (ret) {
> > > > > + bool unlocked = false;
> > > > > +
> > > > > + ret = fixup_user_fault(NULL, mm, vaddr,
> > > > > + FAULT_FLAG_REMOTE |
> > > > > + (write_fault ? FAULT_FLAG_WRITE : 0),
> > > > > + &unlocked);
> > > > > + if (unlocked)
> > > > > + return -EAGAIN;
> > > >
> > > > Hi, Alex,
> > > >
> > > > IIUC this retry is not needed too because fixup_user_fault() will guarantee the
> > > > fault-in is done correctly with the valid PTE as long as ret==0, even if
> > > > unlocked==true.
> > >
> > > It is true, and today it is fine, but be careful when reworking this
> > > to use notifiers as unlocked also means things like the vma pointer
> > > are invalidated.
> >
> > Oh right, thanks for noticing that. Then we should probably still keep the
> > retry logic... because otherwise the latter follow_pfn() could be referencing
> > an invalid vma already...
>
> I looked briefly and thought this flow used the vma only once?

ret = follow_pfn(vma, vaddr, pfn);
if (ret) {
bool unlocked = false;

ret = fixup_user_fault(NULL, mm, vaddr,
FAULT_FLAG_REMOTE |
(write_fault ? FAULT_FLAG_WRITE : 0),
&unlocked);
if (unlocked)
return -EAGAIN;

if (ret)
return ret;

ret = follow_pfn(vma, vaddr, pfn); <--------------- [1]
}

So imo the 2nd follow_pfn() [1] could be racy if without the unlocked check.

Thanks,

--
Peter Xu

2020-05-08 15:08:03

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] vfio/type1: Support faulting PFNMAP vmas

On Fri, May 08, 2020 at 10:30:42AM -0400, Peter Xu wrote:
> On Fri, May 08, 2020 at 09:10:13AM -0300, Jason Gunthorpe wrote:
> > On Thu, May 07, 2020 at 10:19:39PM -0400, Peter Xu wrote:
> > > On Thu, May 07, 2020 at 08:54:21PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, May 07, 2020 at 05:24:43PM -0400, Peter Xu wrote:
> > > > > On Tue, May 05, 2020 at 03:54:44PM -0600, Alex Williamson wrote:
> > > > > > With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
> > > > > > the range being faulted into the vma. Add support to manually provide
> > > > > > that, in the same way as done on KVM with hva_to_pfn_remapped().
> > > > > >
> > > > > > Signed-off-by: Alex Williamson <[email protected]>
> > > > > > drivers/vfio/vfio_iommu_type1.c | 36 +++++++++++++++++++++++++++++++++---
> > > > > > 1 file changed, 33 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > > > index cc1d64765ce7..4a4cb7cd86b2 100644
> > > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > > @@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
> > > > > > return 0;
> > > > > > }
> > > > > >
> > > > > > +static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> > > > > > + unsigned long vaddr, unsigned long *pfn,
> > > > > > + bool write_fault)
> > > > > > +{
> > > > > > + int ret;
> > > > > > +
> > > > > > + ret = follow_pfn(vma, vaddr, pfn);
> > > > > > + if (ret) {
> > > > > > + bool unlocked = false;
> > > > > > +
> > > > > > + ret = fixup_user_fault(NULL, mm, vaddr,
> > > > > > + FAULT_FLAG_REMOTE |
> > > > > > + (write_fault ? FAULT_FLAG_WRITE : 0),
> > > > > > + &unlocked);
> > > > > > + if (unlocked)
> > > > > > + return -EAGAIN;
> > > > >
> > > > > Hi, Alex,
> > > > >
> > > > > IIUC this retry is not needed too because fixup_user_fault() will guarantee the
> > > > > fault-in is done correctly with the valid PTE as long as ret==0, even if
> > > > > unlocked==true.
> > > >
> > > > It is true, and today it is fine, but be careful when reworking this
> > > > to use notifiers as unlocked also means things like the vma pointer
> > > > are invalidated.
> > >
> > > Oh right, thanks for noticing that. Then we should probably still keep the
> > > retry logic... because otherwise the latter follow_pfn() could be referencing
> > > an invalid vma already...
> >
> > I looked briefly and thought this flow used the vma only once?
>
> ret = follow_pfn(vma, vaddr, pfn);
> if (ret) {
> bool unlocked = false;
>
> ret = fixup_user_fault(NULL, mm, vaddr,
> FAULT_FLAG_REMOTE |
> (write_fault ? FAULT_FLAG_WRITE : 0),
> &unlocked);
> if (unlocked)
> return -EAGAIN;
>
> if (ret)
> return ret;
>
> ret = follow_pfn(vma, vaddr, pfn); <--------------- [1]
> }
>
> So imo the 2nd follow_pfn() [1] could be racy if without the unlocked check.

Ah yes, I didn't notice that, you can't touch vma here if unlocked is true.

Jason

2020-05-08 15:44:17

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] vfio/type1: Support faulting PFNMAP vmas

On Fri, 8 May 2020 12:05:40 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Fri, May 08, 2020 at 10:30:42AM -0400, Peter Xu wrote:
> > On Fri, May 08, 2020 at 09:10:13AM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 07, 2020 at 10:19:39PM -0400, Peter Xu wrote:
> > > > On Thu, May 07, 2020 at 08:54:21PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, May 07, 2020 at 05:24:43PM -0400, Peter Xu wrote:
> > > > > > On Tue, May 05, 2020 at 03:54:44PM -0600, Alex Williamson wrote:
> > > > > > > With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
> > > > > > > the range being faulted into the vma. Add support to manually provide
> > > > > > > that, in the same way as done on KVM with hva_to_pfn_remapped().
> > > > > > >
> > > > > > > Signed-off-by: Alex Williamson <[email protected]>
> > > > > > > drivers/vfio/vfio_iommu_type1.c | 36 +++++++++++++++++++++++++++++++++---
> > > > > > > 1 file changed, 33 insertions(+), 3 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > > > > index cc1d64765ce7..4a4cb7cd86b2 100644
> > > > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > > > @@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
> > > > > > > return 0;
> > > > > > > }
> > > > > > >
> > > > > > > +static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> > > > > > > + unsigned long vaddr, unsigned long *pfn,
> > > > > > > + bool write_fault)
> > > > > > > +{
> > > > > > > + int ret;
> > > > > > > +
> > > > > > > + ret = follow_pfn(vma, vaddr, pfn);
> > > > > > > + if (ret) {
> > > > > > > + bool unlocked = false;
> > > > > > > +
> > > > > > > + ret = fixup_user_fault(NULL, mm, vaddr,
> > > > > > > + FAULT_FLAG_REMOTE |
> > > > > > > + (write_fault ? FAULT_FLAG_WRITE : 0),
> > > > > > > + &unlocked);
> > > > > > > + if (unlocked)
> > > > > > > + return -EAGAIN;
> > > > > >
> > > > > > Hi, Alex,
> > > > > >
> > > > > > IIUC this retry is not needed too because fixup_user_fault() will guarantee the
> > > > > > fault-in is done correctly with the valid PTE as long as ret==0, even if
> > > > > > unlocked==true.
> > > > >
> > > > > It is true, and today it is fine, but be careful when reworking this
> > > > > to use notifiers as unlocked also means things like the vma pointer
> > > > > are invalidated.
> > > >
> > > > Oh right, thanks for noticing that. Then we should probably still keep the
> > > > retry logic... because otherwise the latter follow_pfn() could be referencing
> > > > an invalid vma already...
> > >
> > > I looked briefly and thought this flow used the vma only once?
> >
> > ret = follow_pfn(vma, vaddr, pfn);
> > if (ret) {
> > bool unlocked = false;
> >
> > ret = fixup_user_fault(NULL, mm, vaddr,
> > FAULT_FLAG_REMOTE |
> > (write_fault ? FAULT_FLAG_WRITE : 0),
> > &unlocked);
> > if (unlocked)
> > return -EAGAIN;
> >
> > if (ret)
> > return ret;
> >
> > ret = follow_pfn(vma, vaddr, pfn); <--------------- [1]
> > }
> >
> > So imo the 2nd follow_pfn() [1] could be racy if without the unlocked check.
>
> Ah yes, I didn't notice that, you can't touch vma here if unlocked is true.

Thanks for the discussion. I gather then that this patch is correct as
written, which probably also mean the patch Peter linked for KVM should
not be applied since the logic is the same there. Correct? Thanks,

Alex

2020-05-08 16:09:58

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] vfio/type1: Support faulting PFNMAP vmas

On Fri, May 08, 2020 at 09:42:13AM -0600, Alex Williamson wrote:
> Thanks for the discussion. I gather then that this patch is correct as
> written, which probably also mean the patch Peter linked for KVM should
> not be applied since the logic is the same there. Correct? Thanks,

Right. I've already replied yesterday in that thread so Paolo unqueue that
patch too. Thanks,

--
Peter Xu

2020-05-08 18:41:35

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] vfio/type1: Support faulting PFNMAP vmas

On Tue, May 05, 2020 at 03:54:44PM -0600, Alex Williamson wrote:
> With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
> the range being faulted into the vma. Add support to manually provide
> that, in the same way as done on KVM with hva_to_pfn_remapped().
>
> Signed-off-by: Alex Williamson <[email protected]>

And since after the discussion...

Reviewed-by: Peter Xu <[email protected]>

Thanks,

--
Peter Xu

2020-05-22 02:41:09

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

On Tue, May 05, 2020 at 03:55:02PM -0600, Alex Williamson wrote:
[]
vfio_pci_mmap_fault(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> struct vfio_pci_device *vdev = vma->vm_private_data;
> + vm_fault_t ret = VM_FAULT_NOPAGE;
>
> - if (vfio_pci_add_vma(vdev, vma))
> - return VM_FAULT_OOM;
> + mutex_lock(&vdev->vma_lock);
> + down_read(&vdev->memory_lock);

This lock here will trigger,

[17368.321363][T3614103] ======================================================
[17368.321375][T3614103] WARNING: possible circular locking dependency detected
[17368.321399][T3614103] 5.7.0-rc6-next-20200521+ #116 Tainted: G W
[17368.321410][T3614103] ------------------------------------------------------
[17368.321433][T3614103] qemu-kvm/3614103 is trying to acquire lock:
[17368.321443][T3614103] c000200fb2328968 (&kvm->lock){+.+.}-{3:3}, at: kvmppc_irq_bypass_add_producer_hv+0xd4/0x3b0 [kvm_hv]
[17368.321488][T3614103]
[17368.321488][T3614103] but task is already holding lock:
[17368.321533][T3614103] c0000000016f4dc8 (lock#7){+.+.}-{3:3}, at: irq_bypass_register_producer+0x80/0x1d0
[17368.321564][T3614103]
[17368.321564][T3614103] which lock already depends on the new lock.
[17368.321564][T3614103]
[17368.321590][T3614103]
[17368.321590][T3614103] the existing dependency chain (in reverse order) is:
[17368.321625][T3614103]
[17368.321625][T3614103] -> #4 (lock#7){+.+.}-{3:3}:
[17368.321662][T3614103] __mutex_lock+0xdc/0xb80
[17368.321683][T3614103] irq_bypass_register_producer+0x80/0x1d0
[17368.321706][T3614103] vfio_msi_set_vector_signal+0x1d8/0x350 [vfio_pci]
[17368.321719][T3614103] vfio_msi_set_block+0xb0/0x1e0 [vfio_pci]
[17368.321752][T3614103] vfio_pci_set_msi_trigger+0x13c/0x3e0 [vfio_pci]
[17368.321787][T3614103] vfio_pci_set_irqs_ioctl+0x134/0x2c0 [vfio_pci]
[17368.321821][T3614103] vfio_pci_ioctl+0xe10/0x1460 [vfio_pci]
[17368.321855][T3614103] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
[17368.321879][T3614103] ksys_ioctl+0xd8/0x130
[17368.321888][T3614103] sys_ioctl+0x28/0x40
[17368.321910][T3614103] system_call_exception+0x108/0x1d0
[17368.321932][T3614103] system_call_common+0xf0/0x278
[17368.321951][T3614103]
[17368.321951][T3614103] -> #3 (&vdev->memory_lock){++++}-{3:3}:
[17368.321988][T3614103] lock_release+0x190/0x5e0
[17368.322009][T3614103] __mutex_unlock_slowpath+0x68/0x410
[17368.322042][T3614103] vfio_pci_mmap_fault+0xe8/0x1f0 [vfio_pci]
vfio_pci_mmap_fault at drivers/vfio/pci/vfio_pci.c:1534
[17368.322066][T3614103] __do_fault+0x64/0x220
[17368.322086][T3614103] handle_mm_fault+0x12f0/0x19e0
[17368.322107][T3614103] __do_page_fault+0x284/0xf70
[17368.322116][T3614103] handle_page_fault+0x10/0x2c
[17368.322136][T3614103]
[17368.322136][T3614103] -> #2 (&mm->mmap_sem){++++}-{3:3}:
[17368.322160][T3614103] __might_fault+0x84/0xe0
[17368.322182][T3614103] _copy_to_user+0x3c/0x120
[17368.322206][T3614103] kvm_vcpu_ioctl+0x1ec/0xac0 [kvm]
[17368.322239][T3614103] ksys_ioctl+0xd8/0x130
[17368.322270][T3614103] sys_ioctl+0x28/0x40
[17368.322301][T3614103] system_call_exception+0x108/0x1d0
[17368.322334][T3614103] system_call_common+0xf0/0x278
[17368.322375][T3614103]
[17368.322375][T3614103] -> #1 (&vcpu->mutex){+.+.}-{3:3}:
[17368.322411][T3614103] __mutex_lock+0xdc/0xb80
[17368.322446][T3614103] kvmppc_xive_release+0xd8/0x260 [kvm]
[17368.322484][T3614103] kvm_device_release+0xc4/0x110 [kvm]
[17368.322518][T3614103] __fput+0x154/0x3b0
[17368.322562][T3614103] task_work_run+0xd8/0x170
[17368.322583][T3614103] do_exit+0x4f8/0xeb0
[17368.322604][T3614103] do_group_exit+0x78/0x160
[17368.322625][T3614103] get_signal+0x230/0x1440
[17368.322657][T3614103] do_notify_resume+0x130/0x3e0
[17368.322677][T3614103] syscall_exit_prepare+0x1a4/0x280
[17368.322687][T3614103] system_call_common+0xf8/0x278
[17368.322718][T3614103]
[17368.322718][T3614103] -> #0 (&kvm->lock){+.+.}-{3:3}:
[17368.322753][T3614103] __lock_acquire+0x1fe4/0x3190
[17368.322774][T3614103] lock_acquire+0x140/0x9a0
[17368.322805][T3614103] __mutex_lock+0xdc/0xb80
[17368.322838][T3614103] kvmppc_irq_bypass_add_producer_hv+0xd4/0x3b0 [kvm_hv]
[17368.322888][T3614103] kvm_arch_irq_bypass_add_producer+0x40/0x70 [kvm]
[17368.322925][T3614103] __connect+0x118/0x150
[17368.322956][T3614103] irq_bypass_register_producer+0x198/0x1d0
[17368.322989][T3614103] vfio_msi_set_vector_signal+0x1d8/0x350 [vfio_pci]
[17368.323024][T3614103] vfio_msi_set_block+0xb0/0x1e0 [vfio_pci]
[17368.323057][T3614103] vfio_pci_set_irqs_ioctl+0x134/0x2c0 [vfio_pci]
[17368.323091][T3614103] vfio_pci_ioctl+0xe10/0x1460 [vfio_pci]
[17368.323124][T3614103] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
[17368.323156][T3614103] ksys_ioctl+0xd8/0x130
[17368.323176][T3614103] sys_ioctl+0x28/0x40
[17368.323208][T3614103] system_call_exception+0x108/0x1d0
[17368.323229][T3614103] system_call_common+0xf0/0x278
[17368.323259][T3614103]
[17368.323259][T3614103] other info that might help us debug this:
[17368.323259][T3614103]
[17368.323295][T3614103] Chain exists of:
[17368.323295][T3614103] &kvm->lock --> &vdev->memory_lock --> lock#7
[17368.323295][T3614103]
[17368.323335][T3614103] Possible unsafe locking scenario:
[17368.323335][T3614103]
[17368.323368][T3614103] CPU0 CPU1
[17368.323410][T3614103] ---- ----
[17368.323429][T3614103] lock(lock#7);
[17368.323447][T3614103] lock(&vdev->memory_lock);
[17368.323503][T3614103] lock(lock#7);
[17368.323535][T3614103] lock(&kvm->lock);
[17368.323554][T3614103]
[17368.323554][T3614103] *** DEADLOCK ***
[17368.323554][T3614103]
[17368.323590][T3614103] 3 locks held by qemu-kvm/3614103:
[17368.323609][T3614103] #0: c000001f8daa2900 (&vdev->igate){+.+.}-{3:3}, at: vfio_pci_ioctl+0xdf0/0x1460 [vfio_pci]
[17368.323626][T3614103] #1: c000001f8daa2b88 (&vdev->memory_lock){++++}-{3:3}, at: vfio_pci_set_irqs_ioctl+0xe8/0x2c0 [vfio_pci]
[17368.323646][T3614103] #2: c0000000016f4dc8 (lock#7){+.+.}-{3:3}, at: irq_bypass_register_producer+0x80/0x1d0
[17368.323779][T3614103]
[17368.323779][T3614103] stack backtrace:
[17368.323835][T3614103] CPU: 84 PID: 3614103 Comm: qemu-kvm Tainted: G W 5.7.0-rc6-next-20200521+ #116
[17368.323963][T3614103] Call Trace:
[17368.324012][T3614103] [c000200d4260f380] [c00000000079cb38] dump_stack+0xfc/0x174 (unreliable)
[17368.324147][T3614103] [c000200d4260f3d0] [c0000000001ed958] print_circular_bug+0x2d8/0x350
[17368.324231][T3614103] [c000200d4260f470] [c0000000001edc40] check_noncircular+0x270/0x330
[17368.324328][T3614103] [c000200d4260f570] [c0000000001f4ce4] __lock_acquire+0x1fe4/0x3190
[17368.324427][T3614103] [c000200d4260f710] [c0000000001f0ec0] lock_acquire+0x140/0x9a0
[17368.324513][T3614103] [c000200d4260f840] [c000000000ad643c] __mutex_lock+0xdc/0xb80
[17368.324620][T3614103] [c000200d4260f960] [c0080000101e952c] kvmppc_irq_bypass_add_producer_hv+0xd4/0x3b0 [kvm_hv]
[17368.324747][T3614103] [c000200d4260fa10] [c0080000103ec698] kvm_arch_irq_bypass_add_producer+0x40/0x70 [kvm]
[17368.324845][T3614103] [c000200d4260fa30] [c000000000ad0558] __connect+0x118/0x150
[17368.324943][T3614103] [c000200d4260fa70] [c000000000ad0838] irq_bypass_register_producer+0x198/0x1d0
[17368.325048][T3614103] [c000200d4260fab0] [c00800000ffc5460] vfio_msi_set_vector_signal+0x1d8/0x350 [vfio_pci]
[17368.325178][T3614103] [c000200d4260fb70] [c00800000ffc5688] vfio_msi_set_block+0xb0/0x1e0 [vfio_pci]
[17368.325273][T3614103] [c000200d4260fbe0] [c00800000ffc6f2c] vfio_pci_set_irqs_ioctl+0x134/0x2c0 [vfio_pci]
[17368.325389][T3614103] [c000200d4260fc40] [c00800000ffc4be8] vfio_pci_ioctl+0xe10/0x1460 [vfio_pci]
[17368.325487][T3614103] [c000200d4260fd30] [c00800000fc805ec] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
[17368.325598][T3614103] [c000200d4260fd50] [c0000000005a30b8] ksys_ioctl+0xd8/0x130
[17368.325681][T3614103] [c000200d4260fda0] [c0000000005a3138] sys_ioctl+0x28/0x40
[17368.325787][T3614103] [c000200d4260fdc0] [c000000000039e78] system_call_exception+0x108/0x1d0
[17368.325872][T3614103] [c000200d4260fe20] [c00000000000c9f0] system_call_common+0xf0/0x278

> +
> + if (!__vfio_pci_memory_enabled(vdev)) {
> + ret = VM_FAULT_SIGBUS;
> + mutex_unlock(&vdev->vma_lock);
> + goto up_out;
> + }
> +
> + if (__vfio_pci_add_vma(vdev, vma)) {
> + ret = VM_FAULT_OOM;
> + mutex_unlock(&vdev->vma_lock);
> + goto up_out;
> + }
> +
> + mutex_unlock(&vdev->vma_lock);
>
> if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> vma->vm_end - vma->vm_start, vma->vm_page_prot))
> - return VM_FAULT_SIGBUS;
> + ret = VM_FAULT_SIGBUS;
>
> - return VM_FAULT_NOPAGE;
> +up_out:
> + up_read(&vdev->memory_lock);
> + return ret;
> }

2020-05-22 04:20:30

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

On Thu, 21 May 2020 22:39:06 -0400
Qian Cai <[email protected]> wrote:

> On Tue, May 05, 2020 at 03:55:02PM -0600, Alex Williamson wrote:
> []
> vfio_pci_mmap_fault(struct vm_fault *vmf)
> > {
> > struct vm_area_struct *vma = vmf->vma;
> > struct vfio_pci_device *vdev = vma->vm_private_data;
> > + vm_fault_t ret = VM_FAULT_NOPAGE;
> >
> > - if (vfio_pci_add_vma(vdev, vma))
> > - return VM_FAULT_OOM;
> > + mutex_lock(&vdev->vma_lock);
> > + down_read(&vdev->memory_lock);
>
> This lock here will trigger,
>
> [17368.321363][T3614103] ======================================================
> [17368.321375][T3614103] WARNING: possible circular locking dependency detected
> [17368.321399][T3614103] 5.7.0-rc6-next-20200521+ #116 Tainted: G W
> [17368.321410][T3614103] ------------------------------------------------------
> [17368.321433][T3614103] qemu-kvm/3614103 is trying to acquire lock:
> [17368.321443][T3614103] c000200fb2328968 (&kvm->lock){+.+.}-{3:3}, at: kvmppc_irq_bypass_add_producer_hv+0xd4/0x3b0 [kvm_hv]
> [17368.321488][T3614103]
> [17368.321488][T3614103] but task is already holding lock:
> [17368.321533][T3614103] c0000000016f4dc8 (lock#7){+.+.}-{3:3}, at: irq_bypass_register_producer+0x80/0x1d0
> [17368.321564][T3614103]
> [17368.321564][T3614103] which lock already depends on the new lock.
> [17368.321564][T3614103]
> [17368.321590][T3614103]
> [17368.321590][T3614103] the existing dependency chain (in reverse order) is:
> [17368.321625][T3614103]
> [17368.321625][T3614103] -> #4 (lock#7){+.+.}-{3:3}:
> [17368.321662][T3614103] __mutex_lock+0xdc/0xb80
> [17368.321683][T3614103] irq_bypass_register_producer+0x80/0x1d0
> [17368.321706][T3614103] vfio_msi_set_vector_signal+0x1d8/0x350 [vfio_pci]
> [17368.321719][T3614103] vfio_msi_set_block+0xb0/0x1e0 [vfio_pci]
> [17368.321752][T3614103] vfio_pci_set_msi_trigger+0x13c/0x3e0 [vfio_pci]
> [17368.321787][T3614103] vfio_pci_set_irqs_ioctl+0x134/0x2c0 [vfio_pci]
> [17368.321821][T3614103] vfio_pci_ioctl+0xe10/0x1460 [vfio_pci]
> [17368.321855][T3614103] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
> [17368.321879][T3614103] ksys_ioctl+0xd8/0x130
> [17368.321888][T3614103] sys_ioctl+0x28/0x40
> [17368.321910][T3614103] system_call_exception+0x108/0x1d0
> [17368.321932][T3614103] system_call_common+0xf0/0x278
> [17368.321951][T3614103]
> [17368.321951][T3614103] -> #3 (&vdev->memory_lock){++++}-{3:3}:
> [17368.321988][T3614103] lock_release+0x190/0x5e0
> [17368.322009][T3614103] __mutex_unlock_slowpath+0x68/0x410
> [17368.322042][T3614103] vfio_pci_mmap_fault+0xe8/0x1f0 [vfio_pci]
> vfio_pci_mmap_fault at drivers/vfio/pci/vfio_pci.c:1534
> [17368.322066][T3614103] __do_fault+0x64/0x220
> [17368.322086][T3614103] handle_mm_fault+0x12f0/0x19e0
> [17368.322107][T3614103] __do_page_fault+0x284/0xf70
> [17368.322116][T3614103] handle_page_fault+0x10/0x2c
> [17368.322136][T3614103]
> [17368.322136][T3614103] -> #2 (&mm->mmap_sem){++++}-{3:3}:
> [17368.322160][T3614103] __might_fault+0x84/0xe0
> [17368.322182][T3614103] _copy_to_user+0x3c/0x120
> [17368.322206][T3614103] kvm_vcpu_ioctl+0x1ec/0xac0 [kvm]
> [17368.322239][T3614103] ksys_ioctl+0xd8/0x130
> [17368.322270][T3614103] sys_ioctl+0x28/0x40
> [17368.322301][T3614103] system_call_exception+0x108/0x1d0
> [17368.322334][T3614103] system_call_common+0xf0/0x278
> [17368.322375][T3614103]
> [17368.322375][T3614103] -> #1 (&vcpu->mutex){+.+.}-{3:3}:
> [17368.322411][T3614103] __mutex_lock+0xdc/0xb80
> [17368.322446][T3614103] kvmppc_xive_release+0xd8/0x260 [kvm]
> [17368.322484][T3614103] kvm_device_release+0xc4/0x110 [kvm]
> [17368.322518][T3614103] __fput+0x154/0x3b0
> [17368.322562][T3614103] task_work_run+0xd8/0x170
> [17368.322583][T3614103] do_exit+0x4f8/0xeb0
> [17368.322604][T3614103] do_group_exit+0x78/0x160
> [17368.322625][T3614103] get_signal+0x230/0x1440
> [17368.322657][T3614103] do_notify_resume+0x130/0x3e0
> [17368.322677][T3614103] syscall_exit_prepare+0x1a4/0x280
> [17368.322687][T3614103] system_call_common+0xf8/0x278
> [17368.322718][T3614103]
> [17368.322718][T3614103] -> #0 (&kvm->lock){+.+.}-{3:3}:
> [17368.322753][T3614103] __lock_acquire+0x1fe4/0x3190
> [17368.322774][T3614103] lock_acquire+0x140/0x9a0
> [17368.322805][T3614103] __mutex_lock+0xdc/0xb80
> [17368.322838][T3614103] kvmppc_irq_bypass_add_producer_hv+0xd4/0x3b0 [kvm_hv]
> [17368.322888][T3614103] kvm_arch_irq_bypass_add_producer+0x40/0x70 [kvm]
> [17368.322925][T3614103] __connect+0x118/0x150
> [17368.322956][T3614103] irq_bypass_register_producer+0x198/0x1d0
> [17368.322989][T3614103] vfio_msi_set_vector_signal+0x1d8/0x350 [vfio_pci]
> [17368.323024][T3614103] vfio_msi_set_block+0xb0/0x1e0 [vfio_pci]
> [17368.323057][T3614103] vfio_pci_set_irqs_ioctl+0x134/0x2c0 [vfio_pci]
> [17368.323091][T3614103] vfio_pci_ioctl+0xe10/0x1460 [vfio_pci]
> [17368.323124][T3614103] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
> [17368.323156][T3614103] ksys_ioctl+0xd8/0x130
> [17368.323176][T3614103] sys_ioctl+0x28/0x40
> [17368.323208][T3614103] system_call_exception+0x108/0x1d0
> [17368.323229][T3614103] system_call_common+0xf0/0x278
> [17368.323259][T3614103]
> [17368.323259][T3614103] other info that might help us debug this:
> [17368.323259][T3614103]
> [17368.323295][T3614103] Chain exists of:
> [17368.323295][T3614103] &kvm->lock --> &vdev->memory_lock --> lock#7
> [17368.323295][T3614103]
> [17368.323335][T3614103] Possible unsafe locking scenario:
> [17368.323335][T3614103]
> [17368.323368][T3614103] CPU0 CPU1
> [17368.323410][T3614103] ---- ----
> [17368.323429][T3614103] lock(lock#7);
> [17368.323447][T3614103] lock(&vdev->memory_lock);
> [17368.323503][T3614103] lock(lock#7);
> [17368.323535][T3614103] lock(&kvm->lock);
> [17368.323554][T3614103]
> [17368.323554][T3614103] *** DEADLOCK ***
> [17368.323554][T3614103]
> [17368.323590][T3614103] 3 locks held by qemu-kvm/3614103:
> [17368.323609][T3614103] #0: c000001f8daa2900 (&vdev->igate){+.+.}-{3:3}, at: vfio_pci_ioctl+0xdf0/0x1460 [vfio_pci]
> [17368.323626][T3614103] #1: c000001f8daa2b88 (&vdev->memory_lock){++++}-{3:3}, at: vfio_pci_set_irqs_ioctl+0xe8/0x2c0 [vfio_pci]
> [17368.323646][T3614103] #2: c0000000016f4dc8 (lock#7){+.+.}-{3:3}, at: irq_bypass_register_producer+0x80/0x1d0


Thanks for the report. I think this means that we're spanning too much
code by holding memory_lock across the entire MSI-X related ioctl, we
need to only wrap the callouts that touch the vector table space, which
should avoid us ever calling through kvm code with that lock held.
Thanks,

Alex

> [17368.323779][T3614103]
> [17368.323779][T3614103] stack backtrace:
> [17368.323835][T3614103] CPU: 84 PID: 3614103 Comm: qemu-kvm Tainted: G W 5.7.0-rc6-next-20200521+ #116
> [17368.323963][T3614103] Call Trace:
> [17368.324012][T3614103] [c000200d4260f380] [c00000000079cb38] dump_stack+0xfc/0x174 (unreliable)
> [17368.324147][T3614103] [c000200d4260f3d0] [c0000000001ed958] print_circular_bug+0x2d8/0x350
> [17368.324231][T3614103] [c000200d4260f470] [c0000000001edc40] check_noncircular+0x270/0x330
> [17368.324328][T3614103] [c000200d4260f570] [c0000000001f4ce4] __lock_acquire+0x1fe4/0x3190
> [17368.324427][T3614103] [c000200d4260f710] [c0000000001f0ec0] lock_acquire+0x140/0x9a0
> [17368.324513][T3614103] [c000200d4260f840] [c000000000ad643c] __mutex_lock+0xdc/0xb80
> [17368.324620][T3614103] [c000200d4260f960] [c0080000101e952c] kvmppc_irq_bypass_add_producer_hv+0xd4/0x3b0 [kvm_hv]
> [17368.324747][T3614103] [c000200d4260fa10] [c0080000103ec698] kvm_arch_irq_bypass_add_producer+0x40/0x70 [kvm]
> [17368.324845][T3614103] [c000200d4260fa30] [c000000000ad0558] __connect+0x118/0x150
> [17368.324943][T3614103] [c000200d4260fa70] [c000000000ad0838] irq_bypass_register_producer+0x198/0x1d0
> [17368.325048][T3614103] [c000200d4260fab0] [c00800000ffc5460] vfio_msi_set_vector_signal+0x1d8/0x350 [vfio_pci]
> [17368.325178][T3614103] [c000200d4260fb70] [c00800000ffc5688] vfio_msi_set_block+0xb0/0x1e0 [vfio_pci]
> [17368.325273][T3614103] [c000200d4260fbe0] [c00800000ffc6f2c] vfio_pci_set_irqs_ioctl+0x134/0x2c0 [vfio_pci]
> [17368.325389][T3614103] [c000200d4260fc40] [c00800000ffc4be8] vfio_pci_ioctl+0xe10/0x1460 [vfio_pci]
> [17368.325487][T3614103] [c000200d4260fd30] [c00800000fc805ec] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
> [17368.325598][T3614103] [c000200d4260fd50] [c0000000005a30b8] ksys_ioctl+0xd8/0x130
> [17368.325681][T3614103] [c000200d4260fda0] [c0000000005a3138] sys_ioctl+0x28/0x40
> [17368.325787][T3614103] [c000200d4260fdc0] [c000000000039e78] system_call_exception+0x108/0x1d0
> [17368.325872][T3614103] [c000200d4260fe20] [c00000000000c9f0] system_call_common+0xf0/0x278
>
> > +
> > + if (!__vfio_pci_memory_enabled(vdev)) {
> > + ret = VM_FAULT_SIGBUS;
> > + mutex_unlock(&vdev->vma_lock);
> > + goto up_out;
> > + }
> > +
> > + if (__vfio_pci_add_vma(vdev, vma)) {
> > + ret = VM_FAULT_OOM;
> > + mutex_unlock(&vdev->vma_lock);
> > + goto up_out;
> > + }
> > +
> > + mutex_unlock(&vdev->vma_lock);
> >
> > if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > vma->vm_end - vma->vm_start, vma->vm_page_prot))
> > - return VM_FAULT_SIGBUS;
> > + ret = VM_FAULT_SIGBUS;
> >
> > - return VM_FAULT_NOPAGE;
> > +up_out:
> > + up_read(&vdev->memory_lock);
> > + return ret;
> > }
>