2022-05-28 20:15:16

by Lu Baolu

[permalink] [raw]
Subject: [PATCH 00/12] iommu/vt-d: Optimize the use of locks

Hi folks,

This series tries to optimize the uses of two locks in the Intel IOMMU
driver:

- The intel_iommu::lock is used to protect the IOMMU resources shared by
devices. They include the IOMMU root and context tables, the pasid
tables and the domain IDs.
- The global device_domain_lock is used to protect the global and the
per-domain device tracking lists.

The optimization includes:

- Remove the unnecessary global device tracking list;
- Remove unnecessary locking;
- Reduce the scope of the lock as much as possible, that is, use the
lock only where necessary;
- The global lock is transformed into a local lock to improve the
efficiency.
- Convert spinlock into mutex so that non-critical functions could be
called when the lock is held.

This series is also available on github:
https://github.com/LuBaolu/intel-iommu/commits/intel-iommu-lock-optimization-v1

Your comments and suggestions are very appreciated.

Best regards,
baolu

Lu Baolu (12):
iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
iommu/vt-d: Remove for_each_device_domain()
iommu/vt-d: Remove clearing translation data in disable_dmar_iommu()
iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk()
iommu/vt-d: Unncessary spinlock for root table alloc and free
iommu/vt-d: Acquiring lock in domain ID allocation helpers
iommu/vt-d: Acquiring lock in pasid manipulation helpers
iommu/vt-d: Replace spin_lock_irqsave() with spin_lock()
iommu/vt-d: Check device list of domain in domain free path
iommu/vt-d: Fold __dmar_remove_one_dev_info() into its caller
iommu/vt-d: Use device_domain_lock accurately
iommu/vt-d: Convert device_domain_lock into per-domain mutex

drivers/iommu/intel/iommu.h | 5 +-
drivers/iommu/intel/debugfs.c | 26 ++--
drivers/iommu/intel/iommu.c | 271 +++++++++-------------------------
drivers/iommu/intel/pasid.c | 143 +++++++++---------
drivers/iommu/intel/svm.c | 5 +-
5 files changed, 147 insertions(+), 303 deletions(-)

--
2.25.1



2022-05-28 20:17:08

by Lu Baolu

[permalink] [raw]
Subject: [PATCH 04/12] iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk()

Use pci_get_domain_bus_and_slot() instead of searching the global list
to retrieve the pci device pointer. This removes device_domain_list
global list as there are no consumers anymore.

Signed-off-by: Lu Baolu <[email protected]>
---
drivers/iommu/intel/iommu.h | 1 -
drivers/iommu/intel/iommu.c | 33 ++++++---------------------------
2 files changed, 6 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 2f4a5b9509c0..6724703d573b 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -609,7 +609,6 @@ struct intel_iommu {
/* PCI domain-device relationship */
struct device_domain_info {
struct list_head link; /* link to domain siblings */
- struct list_head global; /* link to global list */
struct list_head table; /* link to pasid table */
u32 segment; /* PCI segment number */
u8 bus; /* PCI bus number */
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 25d4c5200526..bbdd3417a1b1 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -131,8 +131,6 @@ static struct intel_iommu **g_iommus;

static void __init check_tylersburg_isoch(void);
static int rwbf_quirk;
-static inline struct device_domain_info *
-dmar_search_domain_by_dev_info(int segment, int bus, int devfn);

/*
* set to 1 to panic kernel if can't successfully enable VT-d
@@ -315,7 +313,6 @@ static int iommu_skip_te_disable;
#define IDENTMAP_AZALIA 4

static DEFINE_SPINLOCK(device_domain_lock);
-static LIST_HEAD(device_domain_list);
const struct iommu_ops intel_iommu_ops;

static bool translation_pre_enabled(struct intel_iommu *iommu)
@@ -845,9 +842,14 @@ static void pgtable_walk(struct intel_iommu *iommu, unsigned long pfn, u8 bus, u
struct device_domain_info *info;
struct dma_pte *parent, *pte;
struct dmar_domain *domain;
+ struct pci_dev *pdev;
int offset, level;

- info = dmar_search_domain_by_dev_info(iommu->segment, bus, devfn);
+ pdev = pci_get_domain_bus_and_slot(iommu->segment, bus, devfn);
+ if (!pdev)
+ return;
+
+ info = dev_iommu_priv_get(&pdev->dev);
if (!info || !info->domain) {
pr_info("device [%02x:%02x.%d] not probed\n",
bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
@@ -2348,19 +2350,6 @@ static void domain_remove_dev_info(struct dmar_domain *domain)
spin_unlock_irqrestore(&device_domain_lock, flags);
}

-static inline struct device_domain_info *
-dmar_search_domain_by_dev_info(int segment, int bus, int devfn)
-{
- struct device_domain_info *info;
-
- list_for_each_entry(info, &device_domain_list, global)
- if (info->segment == segment && info->bus == bus &&
- info->devfn == devfn)
- return info;
-
- return NULL;
-}
-
static int domain_setup_first_level(struct intel_iommu *iommu,
struct dmar_domain *domain,
struct device *dev,
@@ -4564,7 +4553,6 @@ static struct iommu_device *intel_iommu_probe_device(struct device *dev)
struct pci_dev *pdev = dev_is_pci(dev) ? to_pci_dev(dev) : NULL;
struct device_domain_info *info;
struct intel_iommu *iommu;
- unsigned long flags;
u8 bus, devfn;

iommu = device_to_iommu(dev, &bus, &devfn);
@@ -4607,10 +4595,7 @@ static struct iommu_device *intel_iommu_probe_device(struct device *dev)
}
}

- spin_lock_irqsave(&device_domain_lock, flags);
- list_add(&info->global, &device_domain_list);
dev_iommu_priv_set(dev, info);
- spin_unlock_irqrestore(&device_domain_lock, flags);

return &iommu->iommu;
}
@@ -4618,15 +4603,9 @@ static struct iommu_device *intel_iommu_probe_device(struct device *dev)
static void intel_iommu_release_device(struct device *dev)
{
struct device_domain_info *info = dev_iommu_priv_get(dev);
- unsigned long flags;

dmar_remove_one_dev_info(dev);
-
- spin_lock_irqsave(&device_domain_lock, flags);
dev_iommu_priv_set(dev, NULL);
- list_del(&info->global);
- spin_unlock_irqrestore(&device_domain_lock, flags);
-
kfree(info);
set_dma_ops(dev, NULL);
}
--
2.25.1


2022-05-28 20:28:13

by Lu Baolu

[permalink] [raw]
Subject: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

Retrieve the attached domain for a device through the generic interface
exposed by the iommu core. This also makes device_domain_lock static.

Signed-off-by: Lu Baolu <[email protected]>
---
drivers/iommu/intel/iommu.h | 1 -
drivers/iommu/intel/debugfs.c | 20 ++++++++------------
drivers/iommu/intel/iommu.c | 2 +-
3 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index a22adfbdf870..8a6d64d726c0 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -480,7 +480,6 @@ enum {
#define VTD_FLAG_SVM_CAPABLE (1 << 2)

extern int intel_iommu_sm;
-extern spinlock_t device_domain_lock;

#define sm_supported(iommu) (intel_iommu_sm && ecap_smts((iommu)->ecap))
#define pasid_supported(iommu) (sm_supported(iommu) && \
diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c
index d927ef10641b..eea8727aa7bc 100644
--- a/drivers/iommu/intel/debugfs.c
+++ b/drivers/iommu/intel/debugfs.c
@@ -344,19 +344,21 @@ static void pgtable_walk_level(struct seq_file *m, struct dma_pte *pde,

static int show_device_domain_translation(struct device *dev, void *data)
{
- struct device_domain_info *info = dev_iommu_priv_get(dev);
- struct dmar_domain *domain = info->domain;
+ struct dmar_domain *dmar_domain;
+ struct iommu_domain *domain;
struct seq_file *m = data;
u64 path[6] = { 0 };

+ domain = iommu_get_domain_for_dev(dev);
if (!domain)
return 0;

+ dmar_domain = to_dmar_domain(domain);
seq_printf(m, "Device %s @0x%llx\n", dev_name(dev),
- (u64)virt_to_phys(domain->pgd));
+ (u64)virt_to_phys(dmar_domain->pgd));
seq_puts(m, "IOVA_PFN\t\tPML5E\t\t\tPML4E\t\t\tPDPE\t\t\tPDE\t\t\tPTE\n");

- pgtable_walk_level(m, domain->pgd, domain->agaw + 2, 0, path);
+ pgtable_walk_level(m, dmar_domain->pgd, dmar_domain->agaw + 2, 0, path);
seq_putc(m, '\n');

return 0;
@@ -364,15 +366,9 @@ static int show_device_domain_translation(struct device *dev, void *data)

static int domain_translation_struct_show(struct seq_file *m, void *unused)
{
- unsigned long flags;
- int ret;

- spin_lock_irqsave(&device_domain_lock, flags);
- ret = bus_for_each_dev(&pci_bus_type, NULL, m,
- show_device_domain_translation);
- spin_unlock_irqrestore(&device_domain_lock, flags);
-
- return ret;
+ return bus_for_each_dev(&pci_bus_type, NULL, m,
+ show_device_domain_translation);
}
DEFINE_SHOW_ATTRIBUTE(domain_translation_struct);

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 1af4b6562266..cacae8bdaa65 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -314,7 +314,7 @@ static int iommu_skip_te_disable;
#define IDENTMAP_GFX 2
#define IDENTMAP_AZALIA 4

-DEFINE_SPINLOCK(device_domain_lock);
+static DEFINE_SPINLOCK(device_domain_lock);
static LIST_HEAD(device_domain_list);

/*
--
2.25.1


2022-05-28 20:33:30

by Lu Baolu

[permalink] [raw]
Subject: [PATCH 12/12] iommu/vt-d: Convert device_domain_lock into per-domain mutex

Using a global device_domain_lock spinlock to protect per-domain device
tracking lists is an inefficient way, especially considering this lock
is also needed in the hot paths.

On the other hand, in the iommu_unmap() path, the driver needs to iterate
over the device tracking list and flush the caches on the devices through
qi_submit_sync(), where unfortunately cpu_relax() is used. In order to
avoid holding a spinlock lock when cpu_relax() is called, this also covert
the spinlock into a mutex one. This works as the device tracking lists are
not touched in any interrupt contexts.

Signed-off-by: Lu Baolu <[email protected]>
---
drivers/iommu/intel/iommu.h | 1 +
drivers/iommu/intel/iommu.c | 45 +++++++++++++++----------------------
2 files changed, 19 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 6724703d573b..9e572ddffc08 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -541,6 +541,7 @@ struct dmar_domain {
u8 force_snooping : 1; /* Create IOPTEs with snoop control */
u8 set_pte_snp:1;

+ struct mutex mutex; /* Protect device tracking lists */
struct list_head devices; /* all devices' list */

struct dma_pte *pgd; /* virtual address */
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index f8aa8649dc6f..1815a9d73426 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -310,7 +310,6 @@ static int iommu_skip_te_disable;
#define IDENTMAP_GFX 2
#define IDENTMAP_AZALIA 4

-static DEFINE_SPINLOCK(device_domain_lock);
const struct iommu_ops intel_iommu_ops;

static bool translation_pre_enabled(struct intel_iommu *iommu)
@@ -534,9 +533,8 @@ static int domain_update_device_node(struct dmar_domain *domain)
{
struct device_domain_info *info;
int nid = NUMA_NO_NODE;
- unsigned long flags;

- spin_lock_irqsave(&device_domain_lock, flags);
+ mutex_lock(&domain->mutex);
list_for_each_entry(info, &domain->devices, link) {
/*
* There could possibly be multiple device numa nodes as devices
@@ -548,7 +546,7 @@ static int domain_update_device_node(struct dmar_domain *domain)
if (nid != NUMA_NO_NODE)
break;
}
- spin_unlock_irqrestore(&device_domain_lock, flags);
+ mutex_unlock(&domain->mutex);

return nid;
}
@@ -1375,12 +1373,11 @@ iommu_support_dev_iotlb(struct dmar_domain *domain, struct intel_iommu *iommu,
u8 bus, u8 devfn)
{
struct device_domain_info *info = NULL, *tmp;
- unsigned long flags;

if (!iommu->qi)
return NULL;

- spin_lock_irqsave(&device_domain_lock, flags);
+ mutex_lock(&domain->mutex);
list_for_each_entry(tmp, &domain->devices, link) {
if (tmp->iommu == iommu && tmp->bus == bus &&
tmp->devfn == devfn) {
@@ -1389,7 +1386,7 @@ iommu_support_dev_iotlb(struct dmar_domain *domain, struct intel_iommu *iommu,
break;
}
}
- spin_unlock_irqrestore(&device_domain_lock, flags);
+ mutex_unlock(&domain->mutex);

return info;
}
@@ -1398,9 +1395,8 @@ static void domain_update_iotlb(struct dmar_domain *domain)
{
struct device_domain_info *info;
bool has_iotlb_device = false;
- unsigned long flags;

- spin_lock_irqsave(&device_domain_lock, flags);
+ mutex_lock(&domain->mutex);
list_for_each_entry(info, &domain->devices, link) {
if (info->ats_enabled) {
has_iotlb_device = true;
@@ -1408,7 +1404,7 @@ static void domain_update_iotlb(struct dmar_domain *domain)
}
}
domain->has_iotlb_device = has_iotlb_device;
- spin_unlock_irqrestore(&device_domain_lock, flags);
+ mutex_unlock(&domain->mutex);
}

static void iommu_enable_dev_iotlb(struct device_domain_info *info)
@@ -1499,17 +1495,15 @@ static void __iommu_flush_dev_iotlb(struct device_domain_info *info,
static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
u64 addr, unsigned mask)
{
- unsigned long flags;
struct device_domain_info *info;

if (!domain->has_iotlb_device)
return;

- spin_lock_irqsave(&device_domain_lock, flags);
+ mutex_lock(&domain->mutex);
list_for_each_entry(info, &domain->devices, link)
__iommu_flush_dev_iotlb(info, addr, mask);
-
- spin_unlock_irqrestore(&device_domain_lock, flags);
+ mutex_unlock(&domain->mutex);
}

static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
@@ -1761,6 +1755,7 @@ static struct dmar_domain *alloc_domain(unsigned int type)
domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
domain->has_iotlb_device = false;
INIT_LIST_HEAD(&domain->devices);
+ mutex_init(&domain->mutex);

return domain;
}
@@ -2434,7 +2429,6 @@ static int domain_add_dev_info(struct dmar_domain *domain, struct device *dev)
{
struct device_domain_info *info = dev_iommu_priv_get(dev);
struct intel_iommu *iommu;
- unsigned long flags;
u8 bus, devfn;
int ret;

@@ -2446,9 +2440,9 @@ static int domain_add_dev_info(struct dmar_domain *domain, struct device *dev)
if (ret)
return ret;

- spin_lock_irqsave(&device_domain_lock, flags);
+ mutex_lock(&domain->mutex);
list_add(&info->link, &domain->devices);
- spin_unlock_irqrestore(&device_domain_lock, flags);
+ mutex_unlock(&domain->mutex);
info->domain = domain;

/* PASID table is mandatory for a PCI device in scalable mode. */
@@ -4126,7 +4120,6 @@ static void dmar_remove_one_dev_info(struct device *dev)
struct device_domain_info *info = dev_iommu_priv_get(dev);
struct dmar_domain *domain = info->domain;
struct intel_iommu *iommu = info->iommu;
- unsigned long flags;

if (!dev_is_real_dma_subdevice(info->dev)) {
if (dev_is_pci(info->dev) && sm_supported(iommu))
@@ -4138,9 +4131,9 @@ static void dmar_remove_one_dev_info(struct device *dev)
intel_pasid_free_table(info->dev);
}

- spin_lock_irqsave(&device_domain_lock, flags);
+ mutex_lock(&domain->mutex);
list_del(&info->link);
- spin_unlock_irqrestore(&device_domain_lock, flags);
+ mutex_unlock(&domain->mutex);

domain_detach_iommu(domain, iommu);
}
@@ -4424,7 +4417,7 @@ static bool domain_support_force_snooping(struct dmar_domain *domain)
struct device_domain_info *info;
bool support = true;

- assert_spin_locked(&device_domain_lock);
+ lockdep_assert_held(&domain->mutex);
list_for_each_entry(info, &domain->devices, link) {
if (!ecap_sc_support(info->iommu->ecap)) {
support = false;
@@ -4439,8 +4432,7 @@ static void domain_set_force_snooping(struct dmar_domain *domain)
{
struct device_domain_info *info;

- assert_spin_locked(&device_domain_lock);
-
+ lockdep_assert_held(&domain->mutex);
/*
* Second level page table supports per-PTE snoop control. The
* iommu_map() interface will handle this by setting SNP bit.
@@ -4458,20 +4450,19 @@ static void domain_set_force_snooping(struct dmar_domain *domain)
static bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain)
{
struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- unsigned long flags;

if (dmar_domain->force_snooping)
return true;

- spin_lock_irqsave(&device_domain_lock, flags);
+ mutex_lock(&dmar_domain->mutex);
if (!domain_support_force_snooping(dmar_domain)) {
- spin_unlock_irqrestore(&device_domain_lock, flags);
+ mutex_unlock(&dmar_domain->mutex);
return false;
}

domain_set_force_snooping(dmar_domain);
dmar_domain->force_snooping = true;
- spin_unlock_irqrestore(&device_domain_lock, flags);
+ mutex_unlock(&dmar_domain->mutex);

return true;
}
--
2.25.1


2022-06-01 01:12:52

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:

> For case 2, it is a bit weird. I tried to add a rwsem lock to make the
> iommu_unmap() and dumping tables in debugfs exclusive. This does not
> work because debugfs may depend on the DMA of the devices to work. It
> seems that what we can do is to allow this race, but when we traverse
> the page table in debugfs, we will check the validity of the physical
> address retrieved from the page table entry. Then, the worst case is to
> print some useless information.

Sounds horrible, don't you have locking around the IOPTEs of some
kind? How does updating them work reliably?

It is just debugfs, so maybe it is not the end of the world, but
still..

Jason

2022-06-01 05:45:58

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On 2022-05-31 16:59, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 02:52:28PM +0100, Robin Murphy wrote:
>
>>> +                break;
>>> +            pgtable_walk_level(m, phys_to_virt(phys_addr),
>>
>> Also, obligatory reminder that pfn_valid() only means that pfn_to_page()
>> gets you a valid struct page. Whether that page is direct-mapped kernel
>> memory or not is a different matter.
>
> Even though this is debugfs, if the operation is sketchy like that and
> can theortically crash the kernel the driver should test capabilities,
> CAP_SYS_RAWIO or something may be appropriate. I don't think we have a
> better cap for 'userspace may crash the kernel'

It shouldn't be insurmountable to make this safe, it just needs a bit
more than pfn_valid(), which can still return true off the ends of the
memory map if they're not perfectly section-aligned, and for random
reserved holes in the middle.

Robin.

2022-06-01 07:09:29

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On Tue, May 31, 2022 at 05:01:46PM +0100, Robin Murphy wrote:

> The DMA API doesn't need locking, partly since it can trust itself not to do
> stupid things, and mostly because it's DMA API performance that's
> fundamentally incompatible with serialisation anyway. Why do you think we
> have a complicated per-CPU IOVA caching mechanism, if not to support big
> multi-queue devices with multiple CPU threads mapping/unmapping in different
> parts of the same DMA domain concurrently?

Well, per-CPU is a form of locking.

So what are the actual locking rules here? We can call map/unmap
concurrently but not if ... ?

IOVA overlaps?

And we expect the iommu driver to be unable to free page table levels
that have IOVA boundaries in them?

> The simpler drivers already serialise on a per-domain lock internally, while
> the more performance-focused ones implement lock-free atomic pagetable
> management in a similar style to CPU arch code; either way it should work
> fine as-is.

The mm has page table locks at every level and generally expects them
to be held for a lot of manipulations. There are some lockless cases,
but it is not as aggressive as this sounds.

> The difference with debugfs is that it's a completely orthogonal
> side-channel - an iommu_domain user like VFIO or iommu-dma can make sure its
> *own* API usage is sane, but can't be aware of the user triggering some
> driver-internal introspection of that domain in a manner that could race
> more harmfully.

The mm solution to this problem is to RCU free the page table
levels. This way something like debugfs can read a page table under
RCU completely safely, though incoherently, and there is no
performance cost on the map/unmap fast path side.

Today struct page has a rcu_head that can be used to rcu free it, so
it costs nothing.

Jason

2022-06-01 11:58:22

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On Tue, May 31, 2022 at 10:22:32PM +0100, Robin Murphy wrote:

> There are only 3 instances where we'll free a table while the domain is
> live. The first is the one legitimate race condition, where two map requests
> targeting relatively nearby PTEs both go to fill in an intermediate level of
> table; whoever loses that race frees the table they allocated, but it was
> never visible to anyone else so that's definitely fine. The second is if
> we're mapping a block entry, and find that there's already a table entry
> there, wherein we assume the table must be empty, clear the entry,
> invalidate any walk caches, install the block entry, then free the orphaned
> table; since we're mapping the entire IOVA range covered by that table,
> there should be no other operations on that IOVA range attempting to walk
> the table at the same time, so it's fine.

I saw these two in the Intel driver

> The third is effectively the inverse, if we get a block-sized unmap
> but find a table entry rather than a block at that point (on the
> assumption that it's de-facto allowed for a single unmap to cover
> multiple adjacent mappings as long as it does so exactly); similarly
> we assume that the table must be full, and no other operations
> should be racing because we're unmapping its whole IOVA range, so we
> remove the table entry, invalidate, and free as before.

Not sure I noticed this one though

This all it all makes sense though.

> Although we don't have debug dumping for io-pgtable-arm, it's good to be
> thinking about this, since it's made me realise that dirty-tracking sweeps
> per that proposal might pose a similar kind of concern, so we might still
> need to harden these corners for the sake of that.

Lets make sure Joao sees this..

It is interesting because we probably don't want the big latency
spikes that would come from using locking to block map/unmap while
dirty reading is happening - eg at the iommfd level.

From a consistency POV the only case that matters is unmap and unmap
should already be doing a dedicated dirty read directly prior to the
unmap (as per that other long thread)

So having safe racy reading in the kernel is probably best, and so RCU
would be good here too.

> that somewhere I have some half-finished patches making io-pgtable-arm use
> the iommu_iotlb_gather freelist, so maybe I'll tackle both concerns at once
> (perhaps we might even be able to RCU-ify the freelist generically? I'll see
> how it goes when I get there).

This is all very similar to how the mm's tlb gather stuff works,
especially on PPC which requires RCU protection of page tables instead
of the IPI trick.

Currently the rcu_head and the lru overlap in the struct page. To do
this I'd suggest following what was done for slab - see ca1a46d6f506
("Merge tag 'slab-for-5.17'..) and create a 'struct page' sub-class
like 'struct slab' for use with iommu page tables and put the
list_head and rcu_head sequentially in the struct iommu_pt_page.

Continue to use put_pages_list() just with the list_head in the new
struct not the lru.

If we need it for dirty tracking then it makes sense to start
progressing parts at least..

Jason

2022-06-01 12:21:37

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On Tue, May 31, 2022 at 07:07:32PM +0100, Robin Murphy wrote:

> > And we expect the iommu driver to be unable to free page table levels
> > that have IOVA boundaries in them?
>
> I'm not entirely sure what you mean there, but in general an unmap request
> is expected to match some previous map request

atomic cmpxchg is OK for inserting new page table levels but it can't
protect you against concurrent freeing of page table levels. So
without locks it means that page tables can't usually be freed. Which
seems to match what the Intel driver does - at least from a cursory
look.

This is one of the reasons the mm has the mmap/etc lock and spinlocks
because we do expect page table levels to get wiped out when VMA's are
zap'd - all the different locks provide the protection against page
tables disappearing under from something manipulating them.

Basically every "lockless" walk in (process) MM land is actually
protected by some kind of lock that blocks zap_page_range() from
removing the page table levels themselves.

> They might either unmap the entire region originally mapped, or just
> the requested part, or might fail entirely (IIRC there was some
> nasty code in VFIO for detecting a particular behaviour).

This is something I did differently in iommufd. It always generates
unmaps that are strict supersets of the maps it issued. So it would be
a kernel bug if the driver unmaps more or less than requested.

> Oh, I've spent the last couple of weeks hacking up horrible things
> manipulating entries in init_mm, and never realised that that was actually
> the special case. Oh well, live and learn.

The init_mm is sort of different, it doesn't have zap in quite the
same way, for example. I was talking about the typical process mm.

Anyhow, the right solution is to use RCU as I described before, Baolu
do you want to try?

Jason

2022-06-01 18:34:50

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH 04/12] iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk()

> From: Lu Baolu <[email protected]>
> Sent: Friday, May 27, 2022 2:30 PM
>
> Use pci_get_domain_bus_and_slot() instead of searching the global list
> to retrieve the pci device pointer. This removes device_domain_list
> global list as there are no consumers anymore.
>
> Signed-off-by: Lu Baolu <[email protected]>

Reviewed-by: Kevin Tian <[email protected]>

> ---
> drivers/iommu/intel/iommu.h | 1 -
> drivers/iommu/intel/iommu.c | 33 ++++++---------------------------
> 2 files changed, 6 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index 2f4a5b9509c0..6724703d573b 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -609,7 +609,6 @@ struct intel_iommu {
> /* PCI domain-device relationship */
> struct device_domain_info {
> struct list_head link; /* link to domain siblings */
> - struct list_head global; /* link to global list */
> struct list_head table; /* link to pasid table */
> u32 segment; /* PCI segment number */
> u8 bus; /* PCI bus number */
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 25d4c5200526..bbdd3417a1b1 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -131,8 +131,6 @@ static struct intel_iommu **g_iommus;
>
> static void __init check_tylersburg_isoch(void);
> static int rwbf_quirk;
> -static inline struct device_domain_info *
> -dmar_search_domain_by_dev_info(int segment, int bus, int devfn);
>
> /*
> * set to 1 to panic kernel if can't successfully enable VT-d
> @@ -315,7 +313,6 @@ static int iommu_skip_te_disable;
> #define IDENTMAP_AZALIA 4
>
> static DEFINE_SPINLOCK(device_domain_lock);
> -static LIST_HEAD(device_domain_list);
> const struct iommu_ops intel_iommu_ops;
>
> static bool translation_pre_enabled(struct intel_iommu *iommu)
> @@ -845,9 +842,14 @@ static void pgtable_walk(struct intel_iommu
> *iommu, unsigned long pfn, u8 bus, u
> struct device_domain_info *info;
> struct dma_pte *parent, *pte;
> struct dmar_domain *domain;
> + struct pci_dev *pdev;
> int offset, level;
>
> - info = dmar_search_domain_by_dev_info(iommu->segment, bus,
> devfn);
> + pdev = pci_get_domain_bus_and_slot(iommu->segment, bus, devfn);
> + if (!pdev)
> + return;
> +
> + info = dev_iommu_priv_get(&pdev->dev);
> if (!info || !info->domain) {
> pr_info("device [%02x:%02x.%d] not probed\n",
> bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
> @@ -2348,19 +2350,6 @@ static void domain_remove_dev_info(struct
> dmar_domain *domain)
> spin_unlock_irqrestore(&device_domain_lock, flags);
> }
>
> -static inline struct device_domain_info *
> -dmar_search_domain_by_dev_info(int segment, int bus, int devfn)
> -{
> - struct device_domain_info *info;
> -
> - list_for_each_entry(info, &device_domain_list, global)
> - if (info->segment == segment && info->bus == bus &&
> - info->devfn == devfn)
> - return info;
> -
> - return NULL;
> -}
> -
> static int domain_setup_first_level(struct intel_iommu *iommu,
> struct dmar_domain *domain,
> struct device *dev,
> @@ -4564,7 +4553,6 @@ static struct iommu_device
> *intel_iommu_probe_device(struct device *dev)
> struct pci_dev *pdev = dev_is_pci(dev) ? to_pci_dev(dev) : NULL;
> struct device_domain_info *info;
> struct intel_iommu *iommu;
> - unsigned long flags;
> u8 bus, devfn;
>
> iommu = device_to_iommu(dev, &bus, &devfn);
> @@ -4607,10 +4595,7 @@ static struct iommu_device
> *intel_iommu_probe_device(struct device *dev)
> }
> }
>
> - spin_lock_irqsave(&device_domain_lock, flags);
> - list_add(&info->global, &device_domain_list);
> dev_iommu_priv_set(dev, info);
> - spin_unlock_irqrestore(&device_domain_lock, flags);
>
> return &iommu->iommu;
> }
> @@ -4618,15 +4603,9 @@ static struct iommu_device
> *intel_iommu_probe_device(struct device *dev)
> static void intel_iommu_release_device(struct device *dev)
> {
> struct device_domain_info *info = dev_iommu_priv_get(dev);
> - unsigned long flags;
>
> dmar_remove_one_dev_info(dev);
> -
> - spin_lock_irqsave(&device_domain_lock, flags);
> dev_iommu_priv_set(dev, NULL);
> - list_del(&info->global);
> - spin_unlock_irqrestore(&device_domain_lock, flags);
> -
> kfree(info);
> set_dma_ops(dev, NULL);
> }
> --
> 2.25.1


2022-06-01 18:36:34

by Lu Baolu

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On 2022/5/31 21:10, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:
>
>> For case 2, it is a bit weird. I tried to add a rwsem lock to make the
>> iommu_unmap() and dumping tables in debugfs exclusive. This does not
>> work because debugfs may depend on the DMA of the devices to work. It
>> seems that what we can do is to allow this race, but when we traverse
>> the page table in debugfs, we will check the validity of the physical
>> address retrieved from the page table entry. Then, the worst case is to
>> print some useless information.
>
> Sounds horrible, don't you have locking around the IOPTEs of some
> kind? How does updating them work reliably?

There's no locking around updating the IOPTEs. The basic assumption is
that at any time, there's only a single thread manipulating the mappings
of the range specified in iommu_map/unmap() APIs. Therefore, the race
only exists when multiple ranges share some high-level IOPTEs. The IOMMU
driver updates those IOPTEs using the compare-and-exchange atomic
operation.

>
> It is just debugfs, so maybe it is not the end of the world, but
> still..

Fair enough. I think this is somewhat similar to that IOMMU hardware can
traverse the page table at any time without considering when the CPUs
update it. The IOMMU hardware will generate faults when it encounters
failure during the traverse of page table. Similarly, perhaps debugfs
could dump all-ones for an invalid IOPTE?

Best regards,
baolu

2022-06-01 18:42:22

by Lu Baolu

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On 2022/6/1 02:51, Jason Gunthorpe wrote:
>> Oh, I've spent the last couple of weeks hacking up horrible things
>> manipulating entries in init_mm, and never realised that that was actually
>> the special case. Oh well, live and learn.
> The init_mm is sort of different, it doesn't have zap in quite the
> same way, for example. I was talking about the typical process mm.
>
> Anyhow, the right solution is to use RCU as I described before, Baolu
> do you want to try?

Yes, of course.

Your discussion with Robin gave me a lot of inspiration. Very
appreciated! I want to use a separate patch to solve this debugfs
problem, because it has exceeded the original intention of this series.

Best regards,
baolu

2022-06-01 19:11:23

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On 2022-05-31 17:21, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 05:01:46PM +0100, Robin Murphy wrote:
>
>> The DMA API doesn't need locking, partly since it can trust itself not to do
>> stupid things, and mostly because it's DMA API performance that's
>> fundamentally incompatible with serialisation anyway. Why do you think we
>> have a complicated per-CPU IOVA caching mechanism, if not to support big
>> multi-queue devices with multiple CPU threads mapping/unmapping in different
>> parts of the same DMA domain concurrently?
>
> Well, per-CPU is a form of locking.

Right, but that only applies for alloc_iova_fast() itself - once the
CPUs have each got their distinct IOVA region, they can then pile into
iommu_map() completely unhindered (and the inverse for the unmap path).

> So what are the actual locking rules here? We can call map/unmap
> concurrently but not if ... ?
>
> IOVA overlaps?

Well, I think the de-facto rule is that you technically *can* make
overlapping requests, but one or both may fail, and the final outcome in
terms of what ends up mapped in the domain is undefined (especially if
they both succeed). The only real benefit of enforcing serialisation
would be better failure behaviour in such cases, but it remains
fundamentally nonsensical for callers to make contradictory requests
anyway, whether concurrently or sequentially, so there doesn't seem much
point in spending effort on improving support for nonsense.

> And we expect the iommu driver to be unable to free page table levels
> that have IOVA boundaries in them?

I'm not entirely sure what you mean there, but in general an unmap
request is expected to match some previous map request - there isn't a
defined API-level behaviour for partial unmaps. They might either unmap
the entire region originally mapped, or just the requested part, or
might fail entirely (IIRC there was some nasty code in VFIO for
detecting a particular behaviour). Similarly for unmapping anything
that's already not mapped, some drivers treat that as a no-op, others as
an error. But again, this is even further unrelated to concurrency.

>> The simpler drivers already serialise on a per-domain lock internally, while
>> the more performance-focused ones implement lock-free atomic pagetable
>> management in a similar style to CPU arch code; either way it should work
>> fine as-is.
>
> The mm has page table locks at every level and generally expects them
> to be held for a lot of manipulations. There are some lockless cases,
> but it is not as aggressive as this sounds.

Oh, I've spent the last couple of weeks hacking up horrible things
manipulating entries in init_mm, and never realised that that was
actually the special case. Oh well, live and learn.

Robin.

2022-06-01 19:42:17

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On 2022-05-31 16:13, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 04:01:04PM +0100, Robin Murphy wrote:
>> On 2022-05-31 15:53, Jason Gunthorpe wrote:
>>> On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote:
>>>> On 2022/5/31 21:10, Jason Gunthorpe wrote:
>>>>> On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:
>>>>>
>>>>>> For case 2, it is a bit weird. I tried to add a rwsem lock to make the
>>>>>> iommu_unmap() and dumping tables in debugfs exclusive. This does not
>>>>>> work because debugfs may depend on the DMA of the devices to work. It
>>>>>> seems that what we can do is to allow this race, but when we traverse
>>>>>> the page table in debugfs, we will check the validity of the physical
>>>>>> address retrieved from the page table entry. Then, the worst case is to
>>>>>> print some useless information.
>>>>>
>>>>> Sounds horrible, don't you have locking around the IOPTEs of some
>>>>> kind? How does updating them work reliably?
>>>>
>>>> There's no locking around updating the IOPTEs. The basic assumption is
>>>> that at any time, there's only a single thread manipulating the mappings
>>>> of the range specified in iommu_map/unmap() APIs. Therefore, the race
>>>> only exists when multiple ranges share some high-level IOPTEs. The IOMMU
>>>> driver updates those IOPTEs using the compare-and-exchange atomic
>>>> operation.
>>>
>>> Oh? Did I miss where that was documented as part of the iommu API?
>>>
>>> Daniel posted patches for VFIO to multi-thread iommu_domin mapping.
>>>
>>> iommufd goes out of its way to avoid this kind of serialization so
>>> that userspace can parallel map IOVA.
>>>
>>> I think if this is the requirement then the iommu API needs to
>>> provide a lock around the domain for the driver..
>>
>> Eww, no, we can't kill performance by forcing serialisation on the entire
>> API just for one silly driver-internal debugfs corner :(
>
> I'm not worried about debugfs, I'm worried about these efforts to
> speed up VFIO VM booting by parallel domain loading:
>
> https://lore.kernel.org/kvm/[email protected]/
>
> The DMA API should maintain its own external lock, but the general
> domain interface to the rest of the kernel should be safe, IMHO.

The DMA API doesn't need locking, partly since it can trust itself not
to do stupid things, and mostly because it's DMA API performance that's
fundamentally incompatible with serialisation anyway. Why do you think
we have a complicated per-CPU IOVA caching mechanism, if not to support
big multi-queue devices with multiple CPU threads mapping/unmapping in
different parts of the same DMA domain concurrently?

> Or at least it should be documented..

As far as I'm aware there's never been any restriction at the IOMMU API
level. It should be self-evident that racing concurrent
iommu_{map,unmap} requests against iommu_domain_free(), or against each
other for overlapping IOVAs, is a bad idea and don't do that if you want
a well-defined outcome. The simpler drivers already serialise on a
per-domain lock internally, while the more performance-focused ones
implement lock-free atomic pagetable management in a similar style to
CPU arch code; either way it should work fine as-is. The difference with
debugfs is that it's a completely orthogonal side-channel - an
iommu_domain user like VFIO or iommu-dma can make sure its *own* API
usage is sane, but can't be aware of the user triggering some
driver-internal introspection of that domain in a manner that could race
more harmfully. It has to be down to individual drivers to make that
safe if they choose to expose such an interface.

Robin.

2022-06-01 20:24:40

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On Wed, Jun 01, 2022 at 02:52:05PM +0100, Joao Martins wrote:
> On 6/1/22 13:33, Jason Gunthorpe wrote:
> > On Wed, Jun 01, 2022 at 01:18:52PM +0100, Joao Martins wrote:
> >
> >>> So having safe racy reading in the kernel is probably best, and so RCU
> >>> would be good here too.
> >>
> >> Reading dirties ought to be similar to map/unmap but slightly simpler as
> >> I supposedly don't need to care about the pte changing under the hood (or
> >> so I initially thought). I was wrestling at some point if test-and-clear
> >> was enough or whether I switch back cmpxchg to detect the pte has changed
> >> and only mark dirty based on the old value[*]. The latter would align with
> >> how map/unmap performs the pte updates.
> >
> > test-and-clear should be fine, but this all needs to be done under a
> > RCU context while the page tables themsevles are freed by RCU. Then
> > you can safely chase the page table pointers down to each level
> > without fear of UAF.
> >
>
> I was actually thinking more towards holding the same IOVA range lock to
> align with the rest of map/unmap/demote/etc? All these IO page table
> manip have all have the same performance requirements.

IMHO ideally we would not make read dirty use the IOVA range lock because
we want to read dirty big swaths of IOVA a time and it shouldn't be
forced to chunk based on the arbitary area construction.

But yes this could also be an option.

Jason

2022-06-01 20:29:48

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, June 1, 2022 7:11 AM
>
> On Tue, May 31, 2022 at 10:22:32PM +0100, Robin Murphy wrote:
>
> > There are only 3 instances where we'll free a table while the domain is
> > live. The first is the one legitimate race condition, where two map requests
> > targeting relatively nearby PTEs both go to fill in an intermediate level of
> > table; whoever loses that race frees the table they allocated, but it was
> > never visible to anyone else so that's definitely fine. The second is if
> > we're mapping a block entry, and find that there's already a table entry
> > there, wherein we assume the table must be empty, clear the entry,
> > invalidate any walk caches, install the block entry, then free the orphaned
> > table; since we're mapping the entire IOVA range covered by that table,
> > there should be no other operations on that IOVA range attempting to walk
> > the table at the same time, so it's fine.
>
> I saw these two in the Intel driver
>
> > The third is effectively the inverse, if we get a block-sized unmap
> > but find a table entry rather than a block at that point (on the
> > assumption that it's de-facto allowed for a single unmap to cover
> > multiple adjacent mappings as long as it does so exactly); similarly
> > we assume that the table must be full, and no other operations
> > should be racing because we're unmapping its whole IOVA range, so we
> > remove the table entry, invalidate, and free as before.
>
> Not sure I noticed this one though
>
> This all it all makes sense though.

Intel driver also does this. See dma_pte_clear_level():

/* If range covers entire pagetable, free it */
if (start_pfn <= level_pfn &&
last_pfn >= level_pfn + level_size(level) - 1) {
...

2022-06-01 20:32:11

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On 6/1/22 13:33, Jason Gunthorpe wrote:
> On Wed, Jun 01, 2022 at 01:18:52PM +0100, Joao Martins wrote:
>
>>> So having safe racy reading in the kernel is probably best, and so RCU
>>> would be good here too.
>>
>> Reading dirties ought to be similar to map/unmap but slightly simpler as
>> I supposedly don't need to care about the pte changing under the hood (or
>> so I initially thought). I was wrestling at some point if test-and-clear
>> was enough or whether I switch back cmpxchg to detect the pte has changed
>> and only mark dirty based on the old value[*]. The latter would align with
>> how map/unmap performs the pte updates.
>
> test-and-clear should be fine, but this all needs to be done under a
> RCU context while the page tables themsevles are freed by RCU. Then
> you can safely chase the page table pointers down to each level
> without fear of UAF.
>

I was actually thinking more towards holding the same IOVA range lock to
align with the rest of map/unmap/demote/etc? All these IO page table
manip have all have the same performance requirements.

>> I am not sure yet on dynamic demote/promote of page sizes if it changes this.
>
> For this kind of primitive the caller must provide the locking, just
> like map/unmap.
>
Ah OK.

> Effectively you can consider the iommu_domain has having externally
> provided range-locks over the IOVA space. map/unmap/demote/promote
> must run serially over intersecting IOVA ranges.
>
> In terms of iommufd this means we always have to hold a lock related
> to the area (which is the IOVA range) before issuing any iommu call on
> the domain.

/me nods

2022-06-01 20:36:55

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On Tue, May 31, 2022 at 04:01:04PM +0100, Robin Murphy wrote:
> On 2022-05-31 15:53, Jason Gunthorpe wrote:
> > On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote:
> > > On 2022/5/31 21:10, Jason Gunthorpe wrote:
> > > > On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:
> > > >
> > > > > For case 2, it is a bit weird. I tried to add a rwsem lock to make the
> > > > > iommu_unmap() and dumping tables in debugfs exclusive. This does not
> > > > > work because debugfs may depend on the DMA of the devices to work. It
> > > > > seems that what we can do is to allow this race, but when we traverse
> > > > > the page table in debugfs, we will check the validity of the physical
> > > > > address retrieved from the page table entry. Then, the worst case is to
> > > > > print some useless information.
> > > >
> > > > Sounds horrible, don't you have locking around the IOPTEs of some
> > > > kind? How does updating them work reliably?
> > >
> > > There's no locking around updating the IOPTEs. The basic assumption is
> > > that at any time, there's only a single thread manipulating the mappings
> > > of the range specified in iommu_map/unmap() APIs. Therefore, the race
> > > only exists when multiple ranges share some high-level IOPTEs. The IOMMU
> > > driver updates those IOPTEs using the compare-and-exchange atomic
> > > operation.
> >
> > Oh? Did I miss where that was documented as part of the iommu API?
> >
> > Daniel posted patches for VFIO to multi-thread iommu_domin mapping.
> >
> > iommufd goes out of its way to avoid this kind of serialization so
> > that userspace can parallel map IOVA.
> >
> > I think if this is the requirement then the iommu API needs to
> > provide a lock around the domain for the driver..
>
> Eww, no, we can't kill performance by forcing serialisation on the entire
> API just for one silly driver-internal debugfs corner :(

I'm not worried about debugfs, I'm worried about these efforts to
speed up VFIO VM booting by parallel domain loading:

https://lore.kernel.org/kvm/[email protected]/

The DMA API should maintain its own external lock, but the general
domain interface to the rest of the kernel should be safe, IMHO.

Or at least it should be documented..

Jason

2022-06-01 20:48:44

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On 2022-05-31 19:51, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 07:07:32PM +0100, Robin Murphy wrote:
>
>>> And we expect the iommu driver to be unable to free page table levels
>>> that have IOVA boundaries in them?
>>
>> I'm not entirely sure what you mean there, but in general an unmap request
>> is expected to match some previous map request
>
> atomic cmpxchg is OK for inserting new page table levels but it can't
> protect you against concurrent freeing of page table levels. So
> without locks it means that page tables can't usually be freed. Which
> seems to match what the Intel driver does - at least from a cursory
> look.
>
> This is one of the reasons the mm has the mmap/etc lock and spinlocks
> because we do expect page table levels to get wiped out when VMA's are
> zap'd - all the different locks provide the protection against page
> tables disappearing under from something manipulating them.
>
> Basically every "lockless" walk in (process) MM land is actually
> protected by some kind of lock that blocks zap_page_range() from
> removing the page table levels themselves.

I'm not an expert in the Intel or AMD code, so I can only speak with
confidence about what we do in io-pgtable-arm, but the main reason for
not freeing pagetables is that it's simply not worth the bother of
trying to work out whether a whole sub-tree is empty. Not to mention
whether it's *still* empty by the time that we may have figured out that
it was.

There are only 3 instances where we'll free a table while the domain is
live. The first is the one legitimate race condition, where two map
requests targeting relatively nearby PTEs both go to fill in an
intermediate level of table; whoever loses that race frees the table
they allocated, but it was never visible to anyone else so that's
definitely fine. The second is if we're mapping a block entry, and find
that there's already a table entry there, wherein we assume the table
must be empty, clear the entry, invalidate any walk caches, install the
block entry, then free the orphaned table; since we're mapping the
entire IOVA range covered by that table, there should be no other
operations on that IOVA range attempting to walk the table at the same
time, so it's fine. The third is effectively the inverse, if we get a
block-sized unmap but find a table entry rather than a block at that
point (on the assumption that it's de-facto allowed for a single unmap
to cover multiple adjacent mappings as long as it does so exactly);
similarly we assume that the table must be full, and no other operations
should be racing because we're unmapping its whole IOVA range, so we
remove the table entry, invalidate, and free as before.

Again for efficiency reasons we don't attempt to validate those
assumptions by inspecting the freed tables, so odd behaviour can fall
out if the caller *does* do something bogus. For example if two calls
race to map a block and a page in the same (unmapped) region, the block
mapping will always succeed (and be what ends up in the final pagetable
state), but the page mapping may or may not report failure depending on
the exact timing.

Although we don't have debug dumping for io-pgtable-arm, it's good to be
thinking about this, since it's made me realise that dirty-tracking
sweeps per that proposal might pose a similar kind of concern, so we
might still need to harden these corners for the sake of that. Which
also reminds me that somewhere I have some half-finished patches making
io-pgtable-arm use the iommu_iotlb_gather freelist, so maybe I'll tackle
both concerns at once (perhaps we might even be able to RCU-ify the
freelist generically? I'll see how it goes when I get there).

Cheers,
Robin.

2022-06-01 20:57:42

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On Wed, Jun 01, 2022 at 01:18:52PM +0100, Joao Martins wrote:

> > So having safe racy reading in the kernel is probably best, and so RCU
> > would be good here too.
>
> Reading dirties ought to be similar to map/unmap but slightly simpler as
> I supposedly don't need to care about the pte changing under the hood (or
> so I initially thought). I was wrestling at some point if test-and-clear
> was enough or whether I switch back cmpxchg to detect the pte has changed
> and only mark dirty based on the old value[*]. The latter would align with
> how map/unmap performs the pte updates.

test-and-clear should be fine, but this all needs to be done under a
RCU context while the page tables themsevles are freed by RCU. Then
you can safely chase the page table pointers down to each level
without fear of UAF.

> I am not sure yet on dynamic demote/promote of page sizes if it changes this.

For this kind of primitive the caller must provide the locking, just
like map/unmap.

Effectively you can consider the iommu_domain has having externally
provided range-locks over the IOVA space. map/unmap/demote/promote
must run serially over intersecting IOVA ranges.

In terms of iommufd this means we always have to hold a lock related
to the area (which is the IOVA range) before issuing any iommu call on
the domain.

Jason

2022-06-01 21:10:24

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On 6/1/22 00:10, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 10:22:32PM +0100, Robin Murphy wrote:
>> There are only 3 instances where we'll free a table while the domain is
>> live. The first is the one legitimate race condition, where two map requests
>> targeting relatively nearby PTEs both go to fill in an intermediate level of
>> table; whoever loses that race frees the table they allocated, but it was
>> never visible to anyone else so that's definitely fine. The second is if
>> we're mapping a block entry, and find that there's already a table entry
>> there, wherein we assume the table must be empty, clear the entry,
>> invalidate any walk caches, install the block entry, then free the orphaned
>> table; since we're mapping the entire IOVA range covered by that table,
>> there should be no other operations on that IOVA range attempting to walk
>> the table at the same time, so it's fine.
>
> I saw these two in the Intel driver
>
>> The third is effectively the inverse, if we get a block-sized unmap
>> but find a table entry rather than a block at that point (on the
>> assumption that it's de-facto allowed for a single unmap to cover
>> multiple adjacent mappings as long as it does so exactly); similarly
>> we assume that the table must be full, and no other operations
>> should be racing because we're unmapping its whole IOVA range, so we
>> remove the table entry, invalidate, and free as before.
>
> Not sure I noticed this one though
>
> This all it all makes sense though.
>

I saw all three incantations in AMD iommu /I think/

AMD has sort of a replicated PTEs concept representing a power of 2 page size
mapped in 'default' page sizes(4K, 2M, 1G), which we need to check every single
one of them. Which upon writing the RFC I realized it's not really the most
efficient thing to keep considering the IOMMU hardware doesn't guarantee the
dirty bit is on every replicated pte. And maybe it's clobbering the fact we
don't attempt to pick the best page-size for the IOVA mapping (like Intel),
to instead have all powers of 2 page sizes allowed and IOMMU hw tries its
best to cope.

>> Although we don't have debug dumping for io-pgtable-arm, it's good to be
>> thinking about this, since it's made me realise that dirty-tracking sweeps
>> per that proposal might pose a similar kind of concern, so we might still
>> need to harden these corners for the sake of that.
>
> Lets make sure Joao sees this..
>
> It is interesting because we probably don't want the big latency
> spikes that would come from using locking to block map/unmap while
> dirty reading is happening - eg at the iommfd level.
>
Specially when we might be scanning big IOVA ranges.

> From a consistency POV the only case that matters is unmap and unmap
> should already be doing a dedicated dirty read directly prior to the
> unmap (as per that other long thread)
>
/me nods, yes

FWIW, I've already removed the unmap_read_dirty ops.

> So having safe racy reading in the kernel is probably best, and so RCU
> would be good here too.
>

Reading dirties ought to be similar to map/unmap but slightly simpler as
I supposedly don't need to care about the pte changing under the hood (or
so I initially thought). I was wrestling at some point if test-and-clear
was enough or whether I switch back cmpxchg to detect the pte has changed
and only mark dirty based on the old value[*]. The latter would align with
how map/unmap performs the pte updates.

[*] e.g. potentially the new entry hasn't been 'seen' by IOMMU walker or
might be a smaller size then what got dirtied with iopte split or racing
with a new map

>> that somewhere I have some half-finished patches making io-pgtable-arm use
>> the iommu_iotlb_gather freelist, so maybe I'll tackle both concerns at once
>> (perhaps we might even be able to RCU-ify the freelist generically? I'll see
>> how it goes when I get there).
>
> This is all very similar to how the mm's tlb gather stuff works,
> especially on PPC which requires RCU protection of page tables instead
> of the IPI trick.
>
> Currently the rcu_head and the lru overlap in the struct page. To do
> this I'd suggest following what was done for slab - see ca1a46d6f506
> ("Merge tag 'slab-for-5.17'..) and create a 'struct page' sub-class
> like 'struct slab' for use with iommu page tables and put the
> list_head and rcu_head sequentially in the struct iommu_pt_page.
>
> Continue to use put_pages_list() just with the list_head in the new
> struct not the lru.
>
> If we need it for dirty tracking then it makes sense to start
> progressing parts at least..
>
The suggestions here seem to apply to both map/unmap too, not
just read_dirty() alone IIUC

I am not sure yet on dynamic demote/promote of page sizes if it changes this.

2022-06-01 21:27:51

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote:
> On 2022/5/31 21:10, Jason Gunthorpe wrote:
> > On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:
> >
> > > For case 2, it is a bit weird. I tried to add a rwsem lock to make the
> > > iommu_unmap() and dumping tables in debugfs exclusive. This does not
> > > work because debugfs may depend on the DMA of the devices to work. It
> > > seems that what we can do is to allow this race, but when we traverse
> > > the page table in debugfs, we will check the validity of the physical
> > > address retrieved from the page table entry. Then, the worst case is to
> > > print some useless information.
> >
> > Sounds horrible, don't you have locking around the IOPTEs of some
> > kind? How does updating them work reliably?
>
> There's no locking around updating the IOPTEs. The basic assumption is
> that at any time, there's only a single thread manipulating the mappings
> of the range specified in iommu_map/unmap() APIs. Therefore, the race
> only exists when multiple ranges share some high-level IOPTEs. The IOMMU
> driver updates those IOPTEs using the compare-and-exchange atomic
> operation.

Oh? Did I miss where that was documented as part of the iommu API?

Daniel posted patches for VFIO to multi-thread iommu_domin mapping.

iommufd goes out of its way to avoid this kind of serialization so
that userspace can parallel map IOVA.

I think if this is the requirement then the iommu API needs to
provide a lock around the domain for the driver..

Jason

2022-06-01 21:32:22

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

On 2022-05-31 15:53, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote:
>> On 2022/5/31 21:10, Jason Gunthorpe wrote:
>>> On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:
>>>
>>>> For case 2, it is a bit weird. I tried to add a rwsem lock to make the
>>>> iommu_unmap() and dumping tables in debugfs exclusive. This does not
>>>> work because debugfs may depend on the DMA of the devices to work. It
>>>> seems that what we can do is to allow this race, but when we traverse
>>>> the page table in debugfs, we will check the validity of the physical
>>>> address retrieved from the page table entry. Then, the worst case is to
>>>> print some useless information.
>>>
>>> Sounds horrible, don't you have locking around the IOPTEs of some
>>> kind? How does updating them work reliably?
>>
>> There's no locking around updating the IOPTEs. The basic assumption is
>> that at any time, there's only a single thread manipulating the mappings
>> of the range specified in iommu_map/unmap() APIs. Therefore, the race
>> only exists when multiple ranges share some high-level IOPTEs. The IOMMU
>> driver updates those IOPTEs using the compare-and-exchange atomic
>> operation.
>
> Oh? Did I miss where that was documented as part of the iommu API?
>
> Daniel posted patches for VFIO to multi-thread iommu_domin mapping.
>
> iommufd goes out of its way to avoid this kind of serialization so
> that userspace can parallel map IOVA.
>
> I think if this is the requirement then the iommu API needs to
> provide a lock around the domain for the driver..

Eww, no, we can't kill performance by forcing serialisation on the
entire API just for one silly driver-internal debugfs corner :(

Robin.