2020-04-07 02:13:55

by Qian Cai

[permalink] [raw]
Subject: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

fetch_pte() could race with increase_address_space() because it held no
lock from iommu_unmap_page(). On the CPU that runs fetch_pte() it could
see a stale domain->pt_root and a new increased domain->mode from
increase_address_space(). As the result, it could trigger invalid
accesses later on. Fix it by using a pair of smp_[w|r]mb in those
places.

kernel BUG at drivers/iommu/amd_iommu.c:1704!
BUG_ON(unmapped && !is_power_of_2(unmapped));
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
RIP: 0010:amd_iommu_unmap+0x1b2/0x1d0
Call Trace:
<IRQ>
__iommu_unmap+0x106/0x320
iommu_unmap_fast+0xe/0x10
__iommu_dma_unmap+0xdc/0x1a0
iommu_dma_unmap_sg+0xae/0xd0
scsi_dma_unmap+0xe7/0x150
pqi_raid_io_complete+0x37/0x60 [smartpqi]
pqi_irq_handler+0x1fc/0x13f0 [smartpqi]
__handle_irq_event_percpu+0x78/0x4f0
handle_irq_event_percpu+0x70/0x100
handle_irq_event+0x5a/0x8b
handle_edge_irq+0x10c/0x370
do_IRQ+0x9e/0x1e0
common_interrupt+0xf/0xf
</IRQ>

Signed-off-by: Qian Cai <[email protected]>
---
drivers/iommu/amd_iommu.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 20cce366e951..22328a23335f 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1434,6 +1434,11 @@ static bool increase_address_space(struct protection_domain *domain,
*pte = PM_LEVEL_PDE(domain->mode,
iommu_virt_to_phys(domain->pt_root));
domain->pt_root = pte;
+ /*
+ * Make sure fetch_pte() will see the new domain->pt_root before it
+ * snapshots domain->mode.
+ */
+ smp_wmb();
domain->mode += 1;

ret = true;
@@ -1460,6 +1465,8 @@ static u64 *alloc_pte(struct protection_domain *domain,
*updated = increase_address_space(domain, address, gfp) || *updated;

level = domain->mode - 1;
+ /* To pair with smp_wmb() in increase_address_space(). */
+ smp_rmb();
pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
address = PAGE_SIZE_ALIGN(address, page_size);
end_lvl = PAGE_SIZE_LEVEL(page_size);
@@ -1545,6 +1552,8 @@ static u64 *fetch_pte(struct protection_domain *domain,
return NULL;

level = domain->mode - 1;
+ /* To pair with smp_wmb() in increase_address_space(). */
+ smp_rmb();
pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
*page_size = PTE_LEVEL_PAGE_SIZE(level);

--
2.21.0 (Apple Git-122.2)


2020-04-07 15:38:07

by Qian Cai

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()



> On Apr 6, 2020, at 10:12 PM, Qian Cai <[email protected]> wrote:
>
> fetch_pte() could race with increase_address_space() because it held no
> lock from iommu_unmap_page(). On the CPU that runs fetch_pte() it could
> see a stale domain->pt_root and a new increased domain->mode from
> increase_address_space(). As the result, it could trigger invalid
> accesses later on. Fix it by using a pair of smp_[w|r]mb in those
> places.

After further testing, the change along is insufficient. What I am chasing right
now is the swap device will go offline after heavy memory pressure below. The
symptom is similar to what we have in the commit,

754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)

Apparently, it is no possible to take the domain->lock in fetch_pte() because it
could sleep.

Thoughts?

[ 7292.727377][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd80000 flags=0x0010]
[ 7292.740571][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd81000 flags=0x0010]
[ 7292.753268][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd81900 flags=0x0010]
[ 7292.766078][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd81d00 flags=0x0010]
[ 7292.778447][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82000 flags=0x0010]
[ 7292.790724][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82400 flags=0x0010]
[ 7292.803195][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82800 flags=0x0010]
[ 7292.815664][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82c00 flags=0x0010]
[ 7292.828089][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd83000 flags=0x0010]
[ 7292.840468][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd83400 flags=0x0010]
[ 7292.852795][ T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffd83800 flags=0x0010]
[ 7292.864566][ T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffd83c00 flags=0x0010]
[ 7326.583908][ C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803
[ 7326.592386][ C8] smartpqi 0000:23:00.0: controller offline
[ 7326.663728][ C66] sd 0:1:0:0: [sda] tag#467 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=6s
[ 7326.664321][ T2738] smartpqi 0000:23:00.0: resetting scsi 0:1:0:0
[ 7326.673849][ C66] sd 0:1:0:0: [sda] tag#467 CDB: opcode=0x28 28 00 02 ee 2e 60 00 00 08 00
[ 7326.680118][ T2738] smartpqi 0000:23:00.0: reset of scsi 0:1:0:0: FAILED
[ 7326.688612][ C66] blk_update_request: I/O error, dev sda, sector 49163872 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 7326.695456][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery
[ 7326.706632][ C66] Read-error on swap-device (254:1:45833824)
[ 7326.714030][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery
[ 7326.723382][T45523] sd 0:1:0:0: rejecting I/O to offline device
[ 7326.727352][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery
[ 7326.727379][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery
[ 7326.727442][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery

>
> kernel BUG at drivers/iommu/amd_iommu.c:1704!
> BUG_ON(unmapped && !is_power_of_2(unmapped));
> Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> RIP: 0010:amd_iommu_unmap+0x1b2/0x1d0
> Call Trace:
> <IRQ>
> __iommu_unmap+0x106/0x320
> iommu_unmap_fast+0xe/0x10
> __iommu_dma_unmap+0xdc/0x1a0
> iommu_dma_unmap_sg+0xae/0xd0
> scsi_dma_unmap+0xe7/0x150
> pqi_raid_io_complete+0x37/0x60 [smartpqi]
> pqi_irq_handler+0x1fc/0x13f0 [smartpqi]
> __handle_irq_event_percpu+0x78/0x4f0
> handle_irq_event_percpu+0x70/0x100
> handle_irq_event+0x5a/0x8b
> handle_edge_irq+0x10c/0x370
> do_IRQ+0x9e/0x1e0
> common_interrupt+0xf/0xf
> </IRQ>
>
> Signed-off-by: Qian Cai <[email protected]>
> ---
> drivers/iommu/amd_iommu.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 20cce366e951..22328a23335f 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -1434,6 +1434,11 @@ static bool increase_address_space(struct protection_domain *domain,
> *pte = PM_LEVEL_PDE(domain->mode,
> iommu_virt_to_phys(domain->pt_root));
> domain->pt_root = pte;
> + /*
> + * Make sure fetch_pte() will see the new domain->pt_root before it
> + * snapshots domain->mode.
> + */
> + smp_wmb();
> domain->mode += 1;
>
> ret = true;
> @@ -1460,6 +1465,8 @@ static u64 *alloc_pte(struct protection_domain *domain,
> *updated = increase_address_space(domain, address, gfp) || *updated;
>
> level = domain->mode - 1;
> + /* To pair with smp_wmb() in increase_address_space(). */
> + smp_rmb();
> pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
> address = PAGE_SIZE_ALIGN(address, page_size);
> end_lvl = PAGE_SIZE_LEVEL(page_size);
> @@ -1545,6 +1552,8 @@ static u64 *fetch_pte(struct protection_domain *domain,
> return NULL;
>
> level = domain->mode - 1;
> + /* To pair with smp_wmb() in increase_address_space(). */
> + smp_rmb();
> pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
> *page_size = PTE_LEVEL_PAGE_SIZE(level);
>
> --
> 2.21.0 (Apple Git-122.2)
>

2020-04-08 16:19:36

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

Hi Qian,

On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote:
> After further testing, the change along is insufficient. What I am chasing right
> now is the swap device will go offline after heavy memory pressure below. The
> symptom is similar to what we have in the commit,
>
> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)
>
> Apparently, it is no possible to take the domain->lock in fetch_pte() because it
> could sleep.

Thanks a lot for finding and tracking down another race in the AMD IOMMU
page-table code. The domain->lock is a spin-lock and taking it can't
sleep. But fetch_pte() is a fast-path and must not take any locks.

I think the best fix is to update the pt_root and mode of the domain
atomically by storing the mode in the lower 12 bits of pt_root. This way
they are stored together and can be read/write atomically.

Regards,

Joerg

2020-04-14 13:51:38

by Qian Cai

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()



> On Apr 8, 2020, at 10:19 AM, Joerg Roedel <[email protected]> wrote:
>
> Hi Qian,
>
> On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote:
>> After further testing, the change along is insufficient. What I am chasing right
>> now is the swap device will go offline after heavy memory pressure below. The
>> symptom is similar to what we have in the commit,
>>
>> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)
>>
>> Apparently, it is no possible to take the domain->lock in fetch_pte() because it
>> could sleep.
>
> Thanks a lot for finding and tracking down another race in the AMD IOMMU
> page-table code. The domain->lock is a spin-lock and taking it can't
> sleep. But fetch_pte() is a fast-path and must not take any locks.
>
> I think the best fix is to update the pt_root and mode of the domain
> atomically by storing the mode in the lower 12 bits of pt_root. This way
> they are stored together and can be read/write atomically.

Like this?

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 20cce366e951..b36c6b07cbfd 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1397,13 +1397,13 @@ static struct page *free_sub_pt(unsigned long root, int mode,

static void free_pagetable(struct protection_domain *domain)
{
- unsigned long root = (unsigned long)domain->pt_root;
+ int level = iommu_get_mode(domain->pt_root);
+ unsigned long root = iommu_get_root(domain->pt_root);
struct page *freelist = NULL;

- BUG_ON(domain->mode < PAGE_MODE_NONE ||
- domain->mode > PAGE_MODE_6_LEVEL);
+ BUG_ON(level < PAGE_MODE_NONE || level > PAGE_MODE_6_LEVEL);

- freelist = free_sub_pt(root, domain->mode, freelist);
+ freelist = free_sub_pt(root, level, freelist);

free_page_list(freelist);
}
@@ -1417,24 +1417,27 @@ static bool increase_address_space(struct protection_domain *domain,
unsigned long address,
gfp_t gfp)
{
+ int level;
unsigned long flags;
bool ret = false;
u64 *pte;

spin_lock_irqsave(&domain->lock, flags);

- if (address <= PM_LEVEL_SIZE(domain->mode) ||
- WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
+ level = iommu_get_mode(domain->pt_root);
+
+ if (address <= PM_LEVEL_SIZE(level) ||
+ WARN_ON_ONCE(level == PAGE_MODE_6_LEVEL))
goto out;

pte = (void *)get_zeroed_page(gfp);
if (!pte)
goto out;

- *pte = PM_LEVEL_PDE(domain->mode,
- iommu_virt_to_phys(domain->pt_root));
- domain->pt_root = pte;
- domain->mode += 1;
+ *pte = PM_LEVEL_PDE(level,
+ iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root)));
+
+ WRITE_ONCE(domain->pt_root, (unsigned long)pte + level + 1);

ret = true;

@@ -1452,15 +1455,17 @@ static u64 *alloc_pte(struct protection_domain *domain,
bool *updated)
{
int level, end_lvl;
- u64 *pte, *page;
+ u64 *pte, *page, *pt_root, *root;

BUG_ON(!is_power_of_2(page_size));

- while (address > PM_LEVEL_SIZE(domain->mode))
+ while (address > PM_LEVEL_SIZE(iommu_get_mode(domain->pt_root)))
*updated = increase_address_space(domain, address, gfp) || *updated;

- level = domain->mode - 1;
- pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
+ pt_root = READ_ONCE(domain->pt_root);
+ root = (void *)iommu_get_root(pt_root);
+ level = iommu_get_mode(pt_root) - 1;
+ pte = &root[PM_LEVEL_INDEX(level, address)];
address = PAGE_SIZE_ALIGN(address, page_size);
end_lvl = PAGE_SIZE_LEVEL(page_size);

@@ -1536,16 +1541,18 @@ static u64 *fetch_pte(struct protection_domain *domain,
unsigned long address,
unsigned long *page_size)
{
- int level;
u64 *pte;
+ u64 *pt_root = READ_ONCE(domain->pt_root);
+ u64 *root = (void *)iommu_get_root(pt_root);
+ int level = iommu_get_mode(pt_root);

*page_size = 0;

- if (address > PM_LEVEL_SIZE(domain->mode))
+ if (address > PM_LEVEL_SIZE(level))
return NULL;

- level = domain->mode - 1;
- pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
+ level--;
+ pte = &root[PM_LEVEL_INDEX(level, address)];
*page_size = PTE_LEVEL_PAGE_SIZE(level);

while (level > 0) {
@@ -1814,12 +1821,13 @@ static struct protection_domain *dma_ops_domain_alloc(void)
if (protection_domain_init(domain))
goto free_domain;

- domain->mode = PAGE_MODE_3_LEVEL;
domain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
domain->flags = PD_DMA_OPS_MASK;
if (!domain->pt_root)
goto free_domain;

+ domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_3_LEVEL);
+
if (iommu_get_dma_cookie(&domain->domain) == -ENOMEM)
goto free_domain;

@@ -1847,10 +1855,10 @@ static void set_dte_entry(u16 devid, struct protection_domain *domain,
u64 flags = 0;
u32 old_domid;

- if (domain->mode != PAGE_MODE_NONE)
- pte_root = iommu_virt_to_phys(domain->pt_root);
+ if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
+ pte_root = iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root));

- pte_root |= (domain->mode & DEV_ENTRY_MODE_MASK)
+ pte_root |= ((unsigned long)domain->pt_root & DEV_ENTRY_MODE_MASK)
<< DEV_ENTRY_MODE_SHIFT;
pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V | DTE_FLAG_TV;

@@ -2382,13 +2390,14 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
if (!pdomain)
return NULL;

- pdomain->mode = PAGE_MODE_3_LEVEL;
pdomain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
if (!pdomain->pt_root) {
protection_domain_free(pdomain);
return NULL;
}

+ pdomain->pt_root = iommu_set_mode(pdomain->pt_root,
+ PAGE_MODE_3_LEVEL);
pdomain->domain.geometry.aperture_start = 0;
pdomain->domain.geometry.aperture_end = ~0ULL;
pdomain->domain.geometry.force_aperture = true;
@@ -2406,7 +2415,8 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
if (!pdomain)
return NULL;

- pdomain->mode = PAGE_MODE_NONE;
+ pdomain->pt_root = iommu_set_mode(pdomain->pt_root,
+ PAGE_MODE_NONE);
break;
default:
return NULL;
@@ -2435,7 +2445,7 @@ static void amd_iommu_domain_free(struct iommu_domain *dom)
dma_ops_domain_free(domain);
break;
default:
- if (domain->mode != PAGE_MODE_NONE)
+ if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
free_pagetable(domain);

if (domain->flags & PD_IOMMUV2_MASK)
@@ -2521,7 +2531,7 @@ static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
int prot = 0;
int ret;

- if (domain->mode == PAGE_MODE_NONE)
+ if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
return -EINVAL;

if (iommu_prot & IOMMU_READ)
@@ -2542,7 +2552,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
{
struct protection_domain *domain = to_pdomain(dom);

- if (domain->mode == PAGE_MODE_NONE)
+ if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
return 0;

return iommu_unmap_page(domain, iova, page_size);
@@ -2555,7 +2565,7 @@ static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
unsigned long offset_mask, pte_pgsize;
u64 *pte, __pte;

- if (domain->mode == PAGE_MODE_NONE)
+ if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
return iova;

pte = fetch_pte(domain, iova, &pte_pgsize);
@@ -2713,7 +2723,7 @@ void amd_iommu_domain_direct_map(struct iommu_domain *dom)
spin_lock_irqsave(&domain->lock, flags);

/* Update data structure */
- domain->mode = PAGE_MODE_NONE;
+ domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_NONE);

/* Make changes visible to IOMMUs */
update_domain(domain);
@@ -2910,7 +2920,7 @@ static int __set_gcr3(struct protection_domain *domain, int pasid,
{
u64 *pte;

- if (domain->mode != PAGE_MODE_NONE)
+ if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
return -EINVAL;

pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, true);
@@ -2926,7 +2936,7 @@ static int __clear_gcr3(struct protection_domain *domain, int pasid)
{
u64 *pte;

- if (domain->mode != PAGE_MODE_NONE)
+ if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
return -EINVAL;

pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, false);
diff --git a/drivers/iommu/amd_iommu_proto.h b/drivers/iommu/amd_iommu_proto.h
index 92c2ba6468a0..34d4dd66cf9b 100644
--- a/drivers/iommu/amd_iommu_proto.h
+++ b/drivers/iommu/amd_iommu_proto.h
@@ -67,6 +67,21 @@ static inline int amd_iommu_create_irq_domain(struct amd_iommu *iommu)
extern int amd_iommu_complete_ppr(struct pci_dev *pdev, int pasid,
int status, int tag);

+static inline int iommu_get_mode(void *pt_root)
+{
+ return (unsigned long)pt_root & ~PAGE_MASK;
+}
+
+static inline unsigned long iommu_get_root(void *pt_root)
+{
+ return (unsigned long)pt_root & PAGE_MASK;
+}
+
+static inline void *iommu_set_mode(void *pt_root, int mode)
+{
+ return (void *)(iommu_get_root(pt_root) + mode);
+}
+
static inline bool is_rd890_iommu(struct pci_dev *pdev)
{
return (pdev->vendor == PCI_VENDOR_ID_ATI) &&
diff --git a/drivers/iommu/amd_iommu_types.h b/drivers/iommu/amd_iommu_types.h
index ca8c4522045b..221adefa56a0 100644
--- a/drivers/iommu/amd_iommu_types.h
+++ b/drivers/iommu/amd_iommu_types.h
@@ -468,8 +468,8 @@ struct protection_domain {
iommu core code */
spinlock_t lock; /* mostly used to lock the page table*/
u16 id; /* the domain id written to the device table */
- int mode; /* paging mode (0-6 levels) */
- u64 *pt_root; /* page table root pointer */
+ u64 *pt_root; /* page table root pointer and paging mode (0-6
+ levels) */
int glx; /* Number of levels for GCR3 table */
u64 *gcr3_tbl; /* Guest CR3 table */
unsigned long flags; /* flags to find out type of domain */
--
2.21.0 (Apple Git-122.2)

2020-04-17 01:45:07

by Qian Cai

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()



> On Apr 13, 2020, at 9:36 PM, Qian Cai <[email protected]> wrote:
>
>
>
>> On Apr 8, 2020, at 10:19 AM, Joerg Roedel <[email protected]> wrote:
>>
>> Hi Qian,
>>
>> On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote:
>>> After further testing, the change along is insufficient. What I am chasing right
>>> now is the swap device will go offline after heavy memory pressure below. The
>>> symptom is similar to what we have in the commit,
>>>
>>> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)
>>>
>>> Apparently, it is no possible to take the domain->lock in fetch_pte() because it
>>> could sleep.
>>
>> Thanks a lot for finding and tracking down another race in the AMD IOMMU
>> page-table code. The domain->lock is a spin-lock and taking it can't
>> sleep. But fetch_pte() is a fast-path and must not take any locks.
>>
>> I think the best fix is to update the pt_root and mode of the domain
>> atomically by storing the mode in the lower 12 bits of pt_root. This way
>> they are stored together and can be read/write atomically.
>
> Like this?

So, this is still not enough that would still trigger storage driver offline under
memory pressure for a bit longer. It looks to me that in fetch_pte() there are
could still racy?

level = domain->mode - 1;
pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
<— increase_address_space();
*page_size = PTE_LEVEL_PAGE_SIZE(level);

while (level > 0) {
*page_size = PTE_LEVEL_PAGE_SIZE(level);

Then in iommu_unmap_page(),

while (unmapped < page_size) {
pte = fetch_pte(dom, bus_addr, &unmap_size);

bus_addr = (bus_addr & ~(unmap_size - 1)) + unmap_size;

bus_addr would be unsync with dom->mode when it enter fetch_pte() again.
Could that be a problem?

>
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 20cce366e951..b36c6b07cbfd 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -1397,13 +1397,13 @@ static struct page *free_sub_pt(unsigned long root, int mode,
>
> static void free_pagetable(struct protection_domain *domain)
> {
> - unsigned long root = (unsigned long)domain->pt_root;
> + int level = iommu_get_mode(domain->pt_root);
> + unsigned long root = iommu_get_root(domain->pt_root);
> struct page *freelist = NULL;
>
> - BUG_ON(domain->mode < PAGE_MODE_NONE ||
> - domain->mode > PAGE_MODE_6_LEVEL);
> + BUG_ON(level < PAGE_MODE_NONE || level > PAGE_MODE_6_LEVEL);
>
> - freelist = free_sub_pt(root, domain->mode, freelist);
> + freelist = free_sub_pt(root, level, freelist);
>
> free_page_list(freelist);
> }
> @@ -1417,24 +1417,27 @@ static bool increase_address_space(struct protection_domain *domain,
> unsigned long address,
> gfp_t gfp)
> {
> + int level;
> unsigned long flags;
> bool ret = false;
> u64 *pte;
>
> spin_lock_irqsave(&domain->lock, flags);
>
> - if (address <= PM_LEVEL_SIZE(domain->mode) ||
> - WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
> + level = iommu_get_mode(domain->pt_root);
> +
> + if (address <= PM_LEVEL_SIZE(level) ||
> + WARN_ON_ONCE(level == PAGE_MODE_6_LEVEL))
> goto out;
>
> pte = (void *)get_zeroed_page(gfp);
> if (!pte)
> goto out;
>
> - *pte = PM_LEVEL_PDE(domain->mode,
> - iommu_virt_to_phys(domain->pt_root));
> - domain->pt_root = pte;
> - domain->mode += 1;
> + *pte = PM_LEVEL_PDE(level,
> + iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root)));
> +
> + WRITE_ONCE(domain->pt_root, (unsigned long)pte + level + 1);
>
> ret = true;
>
> @@ -1452,15 +1455,17 @@ static u64 *alloc_pte(struct protection_domain *domain,
> bool *updated)
> {
> int level, end_lvl;
> - u64 *pte, *page;
> + u64 *pte, *page, *pt_root, *root;
>
> BUG_ON(!is_power_of_2(page_size));
>
> - while (address > PM_LEVEL_SIZE(domain->mode))
> + while (address > PM_LEVEL_SIZE(iommu_get_mode(domain->pt_root)))
> *updated = increase_address_space(domain, address, gfp) || *updated;
>
> - level = domain->mode - 1;
> - pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
> + pt_root = READ_ONCE(domain->pt_root);
> + root = (void *)iommu_get_root(pt_root);
> + level = iommu_get_mode(pt_root) - 1;
> + pte = &root[PM_LEVEL_INDEX(level, address)];
> address = PAGE_SIZE_ALIGN(address, page_size);
> end_lvl = PAGE_SIZE_LEVEL(page_size);
>
> @@ -1536,16 +1541,18 @@ static u64 *fetch_pte(struct protection_domain *domain,
> unsigned long address,
> unsigned long *page_size)
> {
> - int level;
> u64 *pte;
> + u64 *pt_root = READ_ONCE(domain->pt_root);
> + u64 *root = (void *)iommu_get_root(pt_root);
> + int level = iommu_get_mode(pt_root);
>
> *page_size = 0;
>
> - if (address > PM_LEVEL_SIZE(domain->mode))
> + if (address > PM_LEVEL_SIZE(level))
> return NULL;
>
> - level = domain->mode - 1;
> - pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
> + level--;
> + pte = &root[PM_LEVEL_INDEX(level, address)];
> *page_size = PTE_LEVEL_PAGE_SIZE(level);
>
> while (level > 0) {
> @@ -1814,12 +1821,13 @@ static struct protection_domain *dma_ops_domain_alloc(void)
> if (protection_domain_init(domain))
> goto free_domain;
>
> - domain->mode = PAGE_MODE_3_LEVEL;
> domain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
> domain->flags = PD_DMA_OPS_MASK;
> if (!domain->pt_root)
> goto free_domain;
>
> + domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_3_LEVEL);
> +
> if (iommu_get_dma_cookie(&domain->domain) == -ENOMEM)
> goto free_domain;
>
> @@ -1847,10 +1855,10 @@ static void set_dte_entry(u16 devid, struct protection_domain *domain,
> u64 flags = 0;
> u32 old_domid;
>
> - if (domain->mode != PAGE_MODE_NONE)
> - pte_root = iommu_virt_to_phys(domain->pt_root);
> + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
> + pte_root = iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root));
>
> - pte_root |= (domain->mode & DEV_ENTRY_MODE_MASK)
> + pte_root |= ((unsigned long)domain->pt_root & DEV_ENTRY_MODE_MASK)
> << DEV_ENTRY_MODE_SHIFT;
> pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V | DTE_FLAG_TV;
>
> @@ -2382,13 +2390,14 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
> if (!pdomain)
> return NULL;
>
> - pdomain->mode = PAGE_MODE_3_LEVEL;
> pdomain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
> if (!pdomain->pt_root) {
> protection_domain_free(pdomain);
> return NULL;
> }
>
> + pdomain->pt_root = iommu_set_mode(pdomain->pt_root,
> + PAGE_MODE_3_LEVEL);
> pdomain->domain.geometry.aperture_start = 0;
> pdomain->domain.geometry.aperture_end = ~0ULL;
> pdomain->domain.geometry.force_aperture = true;
> @@ -2406,7 +2415,8 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
> if (!pdomain)
> return NULL;
>
> - pdomain->mode = PAGE_MODE_NONE;
> + pdomain->pt_root = iommu_set_mode(pdomain->pt_root,
> + PAGE_MODE_NONE);
> break;
> default:
> return NULL;
> @@ -2435,7 +2445,7 @@ static void amd_iommu_domain_free(struct iommu_domain *dom)
> dma_ops_domain_free(domain);
> break;
> default:
> - if (domain->mode != PAGE_MODE_NONE)
> + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
> free_pagetable(domain);
>
> if (domain->flags & PD_IOMMUV2_MASK)
> @@ -2521,7 +2531,7 @@ static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
> int prot = 0;
> int ret;
>
> - if (domain->mode == PAGE_MODE_NONE)
> + if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
> return -EINVAL;
>
> if (iommu_prot & IOMMU_READ)
> @@ -2542,7 +2552,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
> {
> struct protection_domain *domain = to_pdomain(dom);
>
> - if (domain->mode == PAGE_MODE_NONE)
> + if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
> return 0;
>
> return iommu_unmap_page(domain, iova, page_size);
> @@ -2555,7 +2565,7 @@ static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
> unsigned long offset_mask, pte_pgsize;
> u64 *pte, __pte;
>
> - if (domain->mode == PAGE_MODE_NONE)
> + if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
> return iova;
>
> pte = fetch_pte(domain, iova, &pte_pgsize);
> @@ -2713,7 +2723,7 @@ void amd_iommu_domain_direct_map(struct iommu_domain *dom)
> spin_lock_irqsave(&domain->lock, flags);
>
> /* Update data structure */
> - domain->mode = PAGE_MODE_NONE;
> + domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_NONE);
>
> /* Make changes visible to IOMMUs */
> update_domain(domain);
> @@ -2910,7 +2920,7 @@ static int __set_gcr3(struct protection_domain *domain, int pasid,
> {
> u64 *pte;
>
> - if (domain->mode != PAGE_MODE_NONE)
> + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
> return -EINVAL;
>
> pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, true);
> @@ -2926,7 +2936,7 @@ static int __clear_gcr3(struct protection_domain *domain, int pasid)
> {
> u64 *pte;
>
> - if (domain->mode != PAGE_MODE_NONE)
> + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
> return -EINVAL;
>
> pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, false);
> diff --git a/drivers/iommu/amd_iommu_proto.h b/drivers/iommu/amd_iommu_proto.h
> index 92c2ba6468a0..34d4dd66cf9b 100644
> --- a/drivers/iommu/amd_iommu_proto.h
> +++ b/drivers/iommu/amd_iommu_proto.h
> @@ -67,6 +67,21 @@ static inline int amd_iommu_create_irq_domain(struct amd_iommu *iommu)
> extern int amd_iommu_complete_ppr(struct pci_dev *pdev, int pasid,
> int status, int tag);
>
> +static inline int iommu_get_mode(void *pt_root)
> +{
> + return (unsigned long)pt_root & ~PAGE_MASK;
> +}
> +
> +static inline unsigned long iommu_get_root(void *pt_root)
> +{
> + return (unsigned long)pt_root & PAGE_MASK;
> +}
> +
> +static inline void *iommu_set_mode(void *pt_root, int mode)
> +{
> + return (void *)(iommu_get_root(pt_root) + mode);
> +}
> +
> static inline bool is_rd890_iommu(struct pci_dev *pdev)
> {
> return (pdev->vendor == PCI_VENDOR_ID_ATI) &&
> diff --git a/drivers/iommu/amd_iommu_types.h b/drivers/iommu/amd_iommu_types.h
> index ca8c4522045b..221adefa56a0 100644
> --- a/drivers/iommu/amd_iommu_types.h
> +++ b/drivers/iommu/amd_iommu_types.h
> @@ -468,8 +468,8 @@ struct protection_domain {
> iommu core code */
> spinlock_t lock; /* mostly used to lock the page table*/
> u16 id; /* the domain id written to the device table */
> - int mode; /* paging mode (0-6 levels) */
> - u64 *pt_root; /* page table root pointer */
> + u64 *pt_root; /* page table root pointer and paging mode (0-6
> + levels) */
> int glx; /* Number of levels for GCR3 table */
> u64 *gcr3_tbl; /* Guest CR3 table */
> unsigned long flags; /* flags to find out type of domain */
> --
> 2.21.0 (Apple Git-122.2)

2020-04-18 12:11:45

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

On Thu, Apr 16, 2020 at 09:42:41PM -0400, Qian Cai wrote:
> So, this is still not enough that would still trigger storage driver offline under
> memory pressure for a bit longer. It looks to me that in fetch_pte() there are
> could still racy?

Yes, your patch still looks racy. You need to atomically read
domain->pt_root to a stack variable and derive the pt_root pointer and
the mode from that variable instead of domain->pt_root directly. If you
read the domain->pt_root twice there could still be an update between
the two reads.
Probably the lock in increase_address_space() can also be avoided if
pt_root is updated using cmpxchg()?

Regards,

Joerg

2020-04-18 13:04:06

by Qian Cai

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()



> On Apr 18, 2020, at 8:10 AM, Joerg Roedel <[email protected]> wrote:
>
> Yes, your patch still looks racy. You need to atomically read
> domain->pt_root to a stack variable and derive the pt_root pointer and
> the mode from that variable instead of domain->pt_root directly. If you
> read the domain->pt_root twice there could still be an update between
> the two reads.
> Probably the lock in increase_address_space() can also be avoided if
> pt_root is updated using cmpxchg()?

Hard to tell without testing further. I’ll leave that optimization in the future, and focus on fixing those races first.

2020-04-18 18:36:35

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote:
> Hard to tell without testing further. I’ll leave that optimization in
> the future, and focus on fixing those races first.

Yeah right, we should fix the existing races first before introducing
new ones ;)

Btw, THANKS A LOT for tracking down all these race condition bugs, I am
not even remotely able to trigger them with the hardware I have around.

I did some hacking and the attached diff shows how I think this race
condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1,
but did no further testing. Can you test it please?

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 20cce366e951..28229a38af4d 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -151,6 +151,26 @@ static struct protection_domain *to_pdomain(struct iommu_domain *dom)
return container_of(dom, struct protection_domain, domain);
}

+static void amd_iommu_domain_get_pgtable(struct protection_domain *domain,
+ struct domain_pgtable *pgtable)
+{
+ u64 pt_root = atomic64_read(&domain->pt_root);
+
+ pgtable->root = (u64 *)(pt_root & PAGE_MASK);
+ pgtable->mode = pt_root & 7; /* lowest 3 bits encode pgtable mode */
+}
+
+static u64 amd_iommu_domain_encode_pgtable(u64 *root, int mode)
+{
+ u64 pt_root;
+
+ /* lowest 3 bits encode pgtable mode */
+ pt_root = mode & 7;
+ pt_root |= (u64)root;
+
+ return pt_root;
+}
+
static struct iommu_dev_data *alloc_dev_data(u16 devid)
{
struct iommu_dev_data *dev_data;
@@ -1397,13 +1417,18 @@ static struct page *free_sub_pt(unsigned long root, int mode,

static void free_pagetable(struct protection_domain *domain)
{
- unsigned long root = (unsigned long)domain->pt_root;
+ struct domain_pgtable pgtable;
struct page *freelist = NULL;
+ unsigned long root;
+
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+ atomic64_set(&domain->pt_root, 0);

- BUG_ON(domain->mode < PAGE_MODE_NONE ||
- domain->mode > PAGE_MODE_6_LEVEL);
+ BUG_ON(pgtable.mode < PAGE_MODE_NONE ||
+ pgtable.mode > PAGE_MODE_6_LEVEL);

- freelist = free_sub_pt(root, domain->mode, freelist);
+ root = (unsigned long)pgtable.root;
+ freelist = free_sub_pt(root, pgtable.mode, freelist);

free_page_list(freelist);
}
@@ -1417,24 +1442,28 @@ static bool increase_address_space(struct protection_domain *domain,
unsigned long address,
gfp_t gfp)
{
+ struct domain_pgtable pgtable;
unsigned long flags;
bool ret = false;
- u64 *pte;
+ u64 *pte, root;

spin_lock_irqsave(&domain->lock, flags);

- if (address <= PM_LEVEL_SIZE(domain->mode) ||
- WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+
+ if (address <= PM_LEVEL_SIZE(pgtable.mode) ||
+ WARN_ON_ONCE(pgtable.mode == PAGE_MODE_6_LEVEL))
goto out;

pte = (void *)get_zeroed_page(gfp);
if (!pte)
goto out;

- *pte = PM_LEVEL_PDE(domain->mode,
- iommu_virt_to_phys(domain->pt_root));
- domain->pt_root = pte;
- domain->mode += 1;
+ *pte = PM_LEVEL_PDE(pgtable.mode, iommu_virt_to_phys(pgtable.root));
+
+ root = amd_iommu_domain_encode_pgtable(pte, pgtable.mode + 1);
+
+ atomic64_set(&domain->pt_root, root);

ret = true;

@@ -1451,16 +1480,22 @@ static u64 *alloc_pte(struct protection_domain *domain,
gfp_t gfp,
bool *updated)
{
+ struct domain_pgtable pgtable;
int level, end_lvl;
u64 *pte, *page;

BUG_ON(!is_power_of_2(page_size));

- while (address > PM_LEVEL_SIZE(domain->mode))
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+
+ while (address > PM_LEVEL_SIZE(pgtable.mode)) {
*updated = increase_address_space(domain, address, gfp) || *updated;
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+ }
+

- level = domain->mode - 1;
- pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
+ level = pgtable.mode - 1;
+ pte = &pgtable.root[PM_LEVEL_INDEX(level, address)];
address = PAGE_SIZE_ALIGN(address, page_size);
end_lvl = PAGE_SIZE_LEVEL(page_size);

@@ -1536,16 +1571,19 @@ static u64 *fetch_pte(struct protection_domain *domain,
unsigned long address,
unsigned long *page_size)
{
+ struct domain_pgtable pgtable;
int level;
u64 *pte;

*page_size = 0;

- if (address > PM_LEVEL_SIZE(domain->mode))
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+
+ if (address > PM_LEVEL_SIZE(pgtable.mode))
return NULL;

- level = domain->mode - 1;
- pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
+ level = pgtable.mode - 1;
+ pte = &pgtable.root[PM_LEVEL_INDEX(level, address)];
*page_size = PTE_LEVEL_PAGE_SIZE(level);

while (level > 0) {
@@ -1806,6 +1844,7 @@ static void dma_ops_domain_free(struct protection_domain *domain)
static struct protection_domain *dma_ops_domain_alloc(void)
{
struct protection_domain *domain;
+ u64 *pt_root, root;

domain = kzalloc(sizeof(struct protection_domain), GFP_KERNEL);
if (!domain)
@@ -1814,12 +1853,14 @@ static struct protection_domain *dma_ops_domain_alloc(void)
if (protection_domain_init(domain))
goto free_domain;

- domain->mode = PAGE_MODE_3_LEVEL;
- domain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
- domain->flags = PD_DMA_OPS_MASK;
- if (!domain->pt_root)
+ pt_root = (void *)get_zeroed_page(GFP_KERNEL);
+ if (!pt_root)
goto free_domain;

+ root = amd_iommu_domain_encode_pgtable(pt_root, PAGE_MODE_3_LEVEL);
+ atomic64_set(&domain->pt_root, root);
+ domain->flags = PD_DMA_OPS_MASK;
+
if (iommu_get_dma_cookie(&domain->domain) == -ENOMEM)
goto free_domain;

@@ -1843,14 +1884,17 @@ static bool dma_ops_domain(struct protection_domain *domain)
static void set_dte_entry(u16 devid, struct protection_domain *domain,
bool ats, bool ppr)
{
+ struct domain_pgtable pgtable;
u64 pte_root = 0;
u64 flags = 0;
u32 old_domid;

- if (domain->mode != PAGE_MODE_NONE)
- pte_root = iommu_virt_to_phys(domain->pt_root);
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+
+ if (pgtable.mode != PAGE_MODE_NONE)
+ pte_root = iommu_virt_to_phys(pgtable.root);

- pte_root |= (domain->mode & DEV_ENTRY_MODE_MASK)
+ pte_root |= (pgtable.mode & DEV_ENTRY_MODE_MASK)
<< DEV_ENTRY_MODE_SHIFT;
pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V | DTE_FLAG_TV;

@@ -2375,6 +2419,7 @@ static struct protection_domain *protection_domain_alloc(void)
static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
{
struct protection_domain *pdomain;
+ u64 *pt_root, root;

switch (type) {
case IOMMU_DOMAIN_UNMANAGED:
@@ -2382,13 +2427,15 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
if (!pdomain)
return NULL;

- pdomain->mode = PAGE_MODE_3_LEVEL;
- pdomain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
- if (!pdomain->pt_root) {
+ pt_root = (void *)get_zeroed_page(GFP_KERNEL);
+ if (!pt_root) {
protection_domain_free(pdomain);
return NULL;
}

+ root = amd_iommu_domain_encode_pgtable(pt_root, PAGE_MODE_3_LEVEL);
+ atomic64_set(&pdomain->pt_root, root);
+
pdomain->domain.geometry.aperture_start = 0;
pdomain->domain.geometry.aperture_end = ~0ULL;
pdomain->domain.geometry.force_aperture = true;
@@ -2406,7 +2453,7 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
if (!pdomain)
return NULL;

- pdomain->mode = PAGE_MODE_NONE;
+ atomic64_set(&pdomain->pt_root, PAGE_MODE_NONE);
break;
default:
return NULL;
@@ -2418,6 +2465,7 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
static void amd_iommu_domain_free(struct iommu_domain *dom)
{
struct protection_domain *domain;
+ struct domain_pgtable pgtable;

domain = to_pdomain(dom);

@@ -2435,7 +2483,9 @@ static void amd_iommu_domain_free(struct iommu_domain *dom)
dma_ops_domain_free(domain);
break;
default:
- if (domain->mode != PAGE_MODE_NONE)
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+
+ if (pgtable.mode != PAGE_MODE_NONE)
free_pagetable(domain);

if (domain->flags & PD_IOMMUV2_MASK)
@@ -2518,10 +2568,12 @@ static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
gfp_t gfp)
{
struct protection_domain *domain = to_pdomain(dom);
+ struct domain_pgtable pgtable;
int prot = 0;
int ret;

- if (domain->mode == PAGE_MODE_NONE)
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+ if (pgtable.mode == PAGE_MODE_NONE)
return -EINVAL;

if (iommu_prot & IOMMU_READ)
@@ -2541,8 +2593,10 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
struct iommu_iotlb_gather *gather)
{
struct protection_domain *domain = to_pdomain(dom);
+ struct domain_pgtable pgtable;

- if (domain->mode == PAGE_MODE_NONE)
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+ if (pgtable.mode == PAGE_MODE_NONE)
return 0;

return iommu_unmap_page(domain, iova, page_size);
@@ -2553,9 +2607,11 @@ static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
{
struct protection_domain *domain = to_pdomain(dom);
unsigned long offset_mask, pte_pgsize;
+ struct domain_pgtable pgtable;
u64 *pte, __pte;

- if (domain->mode == PAGE_MODE_NONE)
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+ if (pgtable.mode == PAGE_MODE_NONE)
return iova;

pte = fetch_pte(domain, iova, &pte_pgsize);
@@ -2708,16 +2764,26 @@ EXPORT_SYMBOL(amd_iommu_unregister_ppr_notifier);
void amd_iommu_domain_direct_map(struct iommu_domain *dom)
{
struct protection_domain *domain = to_pdomain(dom);
+ struct domain_pgtable pgtable;
unsigned long flags;
+ u64 pt_root;

spin_lock_irqsave(&domain->lock, flags);

+ /* First save pgtable configuration*/
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+
/* Update data structure */
- domain->mode = PAGE_MODE_NONE;
+ pt_root = amd_iommu_domain_encode_pgtable(NULL, PAGE_MODE_NONE);
+ atomic64_set(&domain->pt_root, pt_root);

/* Make changes visible to IOMMUs */
update_domain(domain);

+ /* Restore old pgtable in domain->ptroot to free page-table */
+ pt_root = amd_iommu_domain_encode_pgtable(pgtable.root, pgtable.mode);
+ atomic64_set(&domain->pt_root, pt_root);
+
/* Page-table is not visible to IOMMU anymore, so free it */
free_pagetable(domain);

@@ -2908,9 +2974,11 @@ static u64 *__get_gcr3_pte(u64 *root, int level, int pasid, bool alloc)
static int __set_gcr3(struct protection_domain *domain, int pasid,
unsigned long cr3)
{
+ struct domain_pgtable pgtable;
u64 *pte;

- if (domain->mode != PAGE_MODE_NONE)
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+ if (pgtable.mode != PAGE_MODE_NONE)
return -EINVAL;

pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, true);
@@ -2924,9 +2992,11 @@ static int __set_gcr3(struct protection_domain *domain, int pasid,

static int __clear_gcr3(struct protection_domain *domain, int pasid)
{
+ struct domain_pgtable pgtable;
u64 *pte;

- if (domain->mode != PAGE_MODE_NONE)
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+ if (pgtable.mode != PAGE_MODE_NONE)
return -EINVAL;

pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, false);
diff --git a/drivers/iommu/amd_iommu_types.h b/drivers/iommu/amd_iommu_types.h
index ca8c4522045b..7a8fdec138bd 100644
--- a/drivers/iommu/amd_iommu_types.h
+++ b/drivers/iommu/amd_iommu_types.h
@@ -468,8 +468,7 @@ struct protection_domain {
iommu core code */
spinlock_t lock; /* mostly used to lock the page table*/
u16 id; /* the domain id written to the device table */
- int mode; /* paging mode (0-6 levels) */
- u64 *pt_root; /* page table root pointer */
+ atomic64_t pt_root; /* pgtable root and pgtable mode */
int glx; /* Number of levels for GCR3 table */
u64 *gcr3_tbl; /* Guest CR3 table */
unsigned long flags; /* flags to find out type of domain */
@@ -477,6 +476,12 @@ struct protection_domain {
unsigned dev_iommu[MAX_IOMMUS]; /* per-IOMMU reference count */
};

+/* For decocded pt_root */
+struct domain_pgtable {
+ int mode;
+ u64 *root;
+};
+
/*
* Structure where we save information about one hardware AMD IOMMU in the
* system.

2020-04-20 02:10:28

by Qian Cai

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()



> On Apr 18, 2020, at 2:34 PM, Joerg Roedel <[email protected]> wrote:
>
> On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote:
>> Hard to tell without testing further. I’ll leave that optimization in
>> the future, and focus on fixing those races first.
>
> Yeah right, we should fix the existing races first before introducing
> new ones ;)
>
> Btw, THANKS A LOT for tracking down all these race condition bugs, I am
> not even remotely able to trigger them with the hardware I have around.
>
> I did some hacking and the attached diff shows how I think this race
> condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1,
> but did no further testing. Can you test it please?

Sure, give it a few days to see if it could survive.

2020-04-20 16:37:16

by Qian Cai

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()



> On Apr 18, 2020, at 2:34 PM, Joerg Roedel <[email protected]> wrote:
>
> On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote:
>> Hard to tell without testing further. I’ll leave that optimization in
>> the future, and focus on fixing those races first.
>
> Yeah right, we should fix the existing races first before introducing
> new ones ;)
>
> Btw, THANKS A LOT for tracking down all these race condition bugs, I am
> not even remotely able to trigger them with the hardware I have around.
>
> I did some hacking and the attached diff shows how I think this race
> condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1,
> but did no further testing. Can you test it please?

No dice. There could be some other races. For example,

> @@ -1536,16 +1571,19 @@ static u64 *fetch_pte(struct protection_domain *domain,
> unsigned long address,
> unsigned long *page_size)
...
> amd_iommu_domain_get_pgtable(domain, &pgtable);
>
> if (address > PM_LEVEL_SIZE(pgtable.mode))
> return NULL;
>
> level = pgtable.mode - 1;
> pte = &pgtable.root[PM_LEVEL_INDEX(level, address)];

<— increase_address_space()

> *page_size = PTE_LEVEL_PAGE_SIZE(level);
>

while (level > 0) {
*page_size = PTE_LEVEL_PAGE_SIZE(level);

Then in iommu_unmap_page(),

while (unmapped < page_size) {
pte = fetch_pte(dom, bus_addr, &unmap_size);

bus_addr = (bus_addr & ~(unmap_size - 1)) + unmap_size;

bus_addr would be unsync with dom->mode when it enter fetch_pte() again.
Could that be a problem?


[ 5159.274829][ T4057] LTP: starting oom02
[ 5160.382787][ C52] perf: interrupt took too long (7443 > 6208), lowering kernel.perf_event_max_sample_rate to 26800
[ 5167.951785][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc0000 flags=0x0010]
[ 5167.964540][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc1000 flags=0x0010]
[ 5167.977442][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc1900 flags=0x0010]
[ 5167.989901][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc1d00 flags=0x0010]
[ 5168.002291][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2000 flags=0x0010]
[ 5168.014665][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2400 flags=0x0010]
[ 5168.027132][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2800 flags=0x0010]
[ 5168.039566][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2c00 flags=0x0010]
[ 5168.051956][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc3000 flags=0x0010]
[ 5168.064310][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc3400 flags=0x0010]
[ 5168.076652][ T812] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xfffffffffffc3800 flags=0x0010]
[ 5168.088290][ T812] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xfffffffffffc3c00 flags=0x0010]
[ 5183.695829][ C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803
[ 5183.704390][ C8] smartpqi 0000:23:00.0: controller offline
[ 5183.756594][ C101] blk_update_request: I/O error, dev sda, sector 22306304 op 0x1:(WRITE) flags 0x8000000 phys_seg 4 prio class 0
[ 5183.756628][ C34] sd 0:1:0:0: [sda] tag#655 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756759][ C56] blk_update_request: I/O error, dev sda, sector 58480128 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.756810][ C79] sd 0:1:0:0: [sda] tag#234 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756816][ C121] sd 0:1:0:0: [sda] tag#104 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756837][ T53] sd 0:1:0:0: [sda] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756882][ C121] sd 0:1:0:0: [sda] tag#104 CDB: opcode=0x2a 2a 00 00 4d d4 00 00 02 00 00
[ 5183.756892][ C79] sd 0:1:0:0: [sda] tag#234 CDB: opcode=0x2a 2a 00 02 03 e4 00 00 02 00 00
[ 5183.756909][ C121] blk_update_request: I/O error, dev sda, sector 5100544 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.756920][ C79] blk_update_request: I/O error, dev sda, sector 33809408 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.756939][ T53] sd 0:1:0:0: [sda] tag#4 CDB: opcode=0x2a 2a 00 02 4b f8 00 00 02 00 00
[ 5183.756967][ T53] blk_update_request: I/O error, dev sda, sector 38533120 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.756989][ C49] blk_update_request: I/O error, dev sda, sector 30181376 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.757045][ C51] sd 0:1:0:0: [sda] tag#452 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757107][ C51] sd 0:1:0:0: [sda] tag#452 CDB: opcode=0x2a 2a 00 02 95 06 00 00 02 00 00
[ 5183.757125][ C51] blk_update_request: I/O error, dev sda, sector 43320832 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.757199][ C82] blk_update_request: I/O error, dev sda, sector 10187776 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.757209][ C109] blk_update_request: I/O error, dev sda, sector 29812736 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.757215][ C77] sd 0:1:0:0: [sda] tag#849 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757222][ C110] sd 0:1:0:0: [sda] tag#558 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757237][ C92] blk_update_request: I/O error, dev sda, sector 6410240 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.757244][ C91] sd 0:1:0:0: [sda] tag#73 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757251][ C68] sd 0:1:0:0: [sda] tag#416 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757458][ C77] sd 0:1:0:0: [sda] tag#849 CDB: opcode=0x2a 2a 00 02 78 a4 00 00 02 00 00
[ 5183.757467][ C110] sd 0:1:0:0: [sda] tag#558 CDB: opcode=0x2a 2a 00 03 58 94 00 00 02 00 00
[ 5183.757515][ C122] sd 0:1:0:0: [sda] tag#747 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757525][ C68] sd 0:1:0:0: [sda] tag#416 CDB: opcode=0x2a 2a 00 01 0e 32 00 00 02 00 00
[ 5183.757536][ C91] sd 0:1:0:0: [sda] tag#73 CDB: opcode=0x2a 2a 00 01 a2 86 00 00 02 00 00
[ 5183.757727][ C122] sd 0:1:0:0: [sda] tag#747 CDB: opcode=0x2a 2a 00 02 a7 24 00 00 02 00 00
[ 5183.758530][ T53] Write-error on swap-device (254:1:64823296)
[ 5183.758758][ T53] Write-error on swap-device (254:1:35201024)
[ 5183.758811][ C105] Write-error on swap-device (254:1:52690944)
[ 5183.758959][ C82] Write-error on swap-device (254:1:6856704)



2020-04-29 08:49:54

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

Hi Qian,

On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
>
> No dice. There could be some other races. For example,

Okay, I think I know what is happening. The increase_address_space()
function increases the address space, but does not update the
DTE and does not flush the old DTE from the caches. But this needs to
happen before domain->pt_root is updated, because otherwise another CPU
can come along and map something into the increased address-space which
is not yet accessible by the device because the DTE is not updated yet.

Regards,

Joerg

2020-04-29 11:24:07

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
> No dice. There could be some other races. For example,

Can you please test this branch:

https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes

It has the previous fix in it and a couple more to make sure the
device-table is updated and flushed before increase_address_space()
updates domain->pt_root.

Thanks,

Joerg

2020-04-30 01:06:18

by Qian Cai

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()



> On Apr 29, 2020, at 7:20 AM, Joerg Roedel <[email protected]> wrote:
>
> On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
>> No dice. There could be some other races. For example,
>
> Can you please test this branch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes
>
> It has the previous fix in it and a couple more to make sure the
> device-table is updated and flushed before increase_address_space()
> updates domain->pt_root.

It looks promising as it survives for a day’s stress testing. I’ll keep it running for a few days to be sure.

2020-05-03 13:06:38

by Qian Cai

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()



> On Apr 29, 2020, at 7:20 AM, Joerg Roedel <[email protected]> wrote:
>
> On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
>> No dice. There could be some other races. For example,
>
> Can you please test this branch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes
>
> It has the previous fix in it and a couple more to make sure the
> device-table is updated and flushed before increase_address_space()
> updates domain->pt_root.

I believe this closed the existing races as it had survived for many days. Great work!

2020-05-03 18:42:04

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

Hi Qian,

On Sun, May 03, 2020 at 09:04:03AM -0400, Qian Cai wrote:
> > On Apr 29, 2020, at 7:20 AM, Joerg Roedel <[email protected]> wrote:
> > Can you please test this branch:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes
> >
> > It has the previous fix in it and a couple more to make sure the
> > device-table is updated and flushed before increase_address_space()
> > updates domain->pt_root.
>
> I believe this closed the existing races as it had survived for many
> days. Great work!

Thanks a lot for testing these changes! Can I add your Tested-by when I
send them to the mailing list tomorrow?

Regards,

Joerg

2020-05-03 19:16:32

by Qian Cai

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()



> On May 3, 2020, at 2:39 PM, Joerg Roedel <[email protected]> wrote:
>
> Can I add your Tested-by when I
> send them to the mailing list tomorrow?

Sure. Feel free to add,

Tested-by: Qian Cai <[email protected]>