fetch_pte() could race with increase_address_space() because it held no
lock from iommu_unmap_page(). On the CPU that runs fetch_pte() it could
see a stale domain->pt_root and a new increased domain->mode from
increase_address_space(). As the result, it could trigger invalid
accesses later on. Fix it by using a pair of smp_[w|r]mb in those
places.
kernel BUG at drivers/iommu/amd_iommu.c:1704!
BUG_ON(unmapped && !is_power_of_2(unmapped));
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
RIP: 0010:amd_iommu_unmap+0x1b2/0x1d0
Call Trace:
<IRQ>
__iommu_unmap+0x106/0x320
iommu_unmap_fast+0xe/0x10
__iommu_dma_unmap+0xdc/0x1a0
iommu_dma_unmap_sg+0xae/0xd0
scsi_dma_unmap+0xe7/0x150
pqi_raid_io_complete+0x37/0x60 [smartpqi]
pqi_irq_handler+0x1fc/0x13f0 [smartpqi]
__handle_irq_event_percpu+0x78/0x4f0
handle_irq_event_percpu+0x70/0x100
handle_irq_event+0x5a/0x8b
handle_edge_irq+0x10c/0x370
do_IRQ+0x9e/0x1e0
common_interrupt+0xf/0xf
</IRQ>
Signed-off-by: Qian Cai <[email protected]>
---
drivers/iommu/amd_iommu.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 20cce366e951..22328a23335f 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1434,6 +1434,11 @@ static bool increase_address_space(struct protection_domain *domain,
*pte = PM_LEVEL_PDE(domain->mode,
iommu_virt_to_phys(domain->pt_root));
domain->pt_root = pte;
+ /*
+ * Make sure fetch_pte() will see the new domain->pt_root before it
+ * snapshots domain->mode.
+ */
+ smp_wmb();
domain->mode += 1;
ret = true;
@@ -1460,6 +1465,8 @@ static u64 *alloc_pte(struct protection_domain *domain,
*updated = increase_address_space(domain, address, gfp) || *updated;
level = domain->mode - 1;
+ /* To pair with smp_wmb() in increase_address_space(). */
+ smp_rmb();
pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
address = PAGE_SIZE_ALIGN(address, page_size);
end_lvl = PAGE_SIZE_LEVEL(page_size);
@@ -1545,6 +1552,8 @@ static u64 *fetch_pte(struct protection_domain *domain,
return NULL;
level = domain->mode - 1;
+ /* To pair with smp_wmb() in increase_address_space(). */
+ smp_rmb();
pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
*page_size = PTE_LEVEL_PAGE_SIZE(level);
--
2.21.0 (Apple Git-122.2)
> On Apr 6, 2020, at 10:12 PM, Qian Cai <[email protected]> wrote:
>
> fetch_pte() could race with increase_address_space() because it held no
> lock from iommu_unmap_page(). On the CPU that runs fetch_pte() it could
> see a stale domain->pt_root and a new increased domain->mode from
> increase_address_space(). As the result, it could trigger invalid
> accesses later on. Fix it by using a pair of smp_[w|r]mb in those
> places.
After further testing, the change along is insufficient. What I am chasing right
now is the swap device will go offline after heavy memory pressure below. The
symptom is similar to what we have in the commit,
754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)
Apparently, it is no possible to take the domain->lock in fetch_pte() because it
could sleep.
Thoughts?
[ 7292.727377][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd80000 flags=0x0010]
[ 7292.740571][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd81000 flags=0x0010]
[ 7292.753268][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd81900 flags=0x0010]
[ 7292.766078][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd81d00 flags=0x0010]
[ 7292.778447][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82000 flags=0x0010]
[ 7292.790724][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82400 flags=0x0010]
[ 7292.803195][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82800 flags=0x0010]
[ 7292.815664][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82c00 flags=0x0010]
[ 7292.828089][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd83000 flags=0x0010]
[ 7292.840468][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd83400 flags=0x0010]
[ 7292.852795][ T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffd83800 flags=0x0010]
[ 7292.864566][ T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffd83c00 flags=0x0010]
[ 7326.583908][ C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803
[ 7326.592386][ C8] smartpqi 0000:23:00.0: controller offline
[ 7326.663728][ C66] sd 0:1:0:0: [sda] tag#467 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=6s
[ 7326.664321][ T2738] smartpqi 0000:23:00.0: resetting scsi 0:1:0:0
[ 7326.673849][ C66] sd 0:1:0:0: [sda] tag#467 CDB: opcode=0x28 28 00 02 ee 2e 60 00 00 08 00
[ 7326.680118][ T2738] smartpqi 0000:23:00.0: reset of scsi 0:1:0:0: FAILED
[ 7326.688612][ C66] blk_update_request: I/O error, dev sda, sector 49163872 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 7326.695456][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery
[ 7326.706632][ C66] Read-error on swap-device (254:1:45833824)
[ 7326.714030][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery
[ 7326.723382][T45523] sd 0:1:0:0: rejecting I/O to offline device
[ 7326.727352][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery
[ 7326.727379][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery
[ 7326.727442][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery
>
> kernel BUG at drivers/iommu/amd_iommu.c:1704!
> BUG_ON(unmapped && !is_power_of_2(unmapped));
> Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> RIP: 0010:amd_iommu_unmap+0x1b2/0x1d0
> Call Trace:
> <IRQ>
> __iommu_unmap+0x106/0x320
> iommu_unmap_fast+0xe/0x10
> __iommu_dma_unmap+0xdc/0x1a0
> iommu_dma_unmap_sg+0xae/0xd0
> scsi_dma_unmap+0xe7/0x150
> pqi_raid_io_complete+0x37/0x60 [smartpqi]
> pqi_irq_handler+0x1fc/0x13f0 [smartpqi]
> __handle_irq_event_percpu+0x78/0x4f0
> handle_irq_event_percpu+0x70/0x100
> handle_irq_event+0x5a/0x8b
> handle_edge_irq+0x10c/0x370
> do_IRQ+0x9e/0x1e0
> common_interrupt+0xf/0xf
> </IRQ>
>
> Signed-off-by: Qian Cai <[email protected]>
> ---
> drivers/iommu/amd_iommu.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 20cce366e951..22328a23335f 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -1434,6 +1434,11 @@ static bool increase_address_space(struct protection_domain *domain,
> *pte = PM_LEVEL_PDE(domain->mode,
> iommu_virt_to_phys(domain->pt_root));
> domain->pt_root = pte;
> + /*
> + * Make sure fetch_pte() will see the new domain->pt_root before it
> + * snapshots domain->mode.
> + */
> + smp_wmb();
> domain->mode += 1;
>
> ret = true;
> @@ -1460,6 +1465,8 @@ static u64 *alloc_pte(struct protection_domain *domain,
> *updated = increase_address_space(domain, address, gfp) || *updated;
>
> level = domain->mode - 1;
> + /* To pair with smp_wmb() in increase_address_space(). */
> + smp_rmb();
> pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
> address = PAGE_SIZE_ALIGN(address, page_size);
> end_lvl = PAGE_SIZE_LEVEL(page_size);
> @@ -1545,6 +1552,8 @@ static u64 *fetch_pte(struct protection_domain *domain,
> return NULL;
>
> level = domain->mode - 1;
> + /* To pair with smp_wmb() in increase_address_space(). */
> + smp_rmb();
> pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
> *page_size = PTE_LEVEL_PAGE_SIZE(level);
>
> --
> 2.21.0 (Apple Git-122.2)
>
Hi Qian,
On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote:
> After further testing, the change along is insufficient. What I am chasing right
> now is the swap device will go offline after heavy memory pressure below. The
> symptom is similar to what we have in the commit,
>
> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)
>
> Apparently, it is no possible to take the domain->lock in fetch_pte() because it
> could sleep.
Thanks a lot for finding and tracking down another race in the AMD IOMMU
page-table code. The domain->lock is a spin-lock and taking it can't
sleep. But fetch_pte() is a fast-path and must not take any locks.
I think the best fix is to update the pt_root and mode of the domain
atomically by storing the mode in the lower 12 bits of pt_root. This way
they are stored together and can be read/write atomically.
Regards,
Joerg
> On Apr 8, 2020, at 10:19 AM, Joerg Roedel <[email protected]> wrote:
>
> Hi Qian,
>
> On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote:
>> After further testing, the change along is insufficient. What I am chasing right
>> now is the swap device will go offline after heavy memory pressure below. The
>> symptom is similar to what we have in the commit,
>>
>> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)
>>
>> Apparently, it is no possible to take the domain->lock in fetch_pte() because it
>> could sleep.
>
> Thanks a lot for finding and tracking down another race in the AMD IOMMU
> page-table code. The domain->lock is a spin-lock and taking it can't
> sleep. But fetch_pte() is a fast-path and must not take any locks.
>
> I think the best fix is to update the pt_root and mode of the domain
> atomically by storing the mode in the lower 12 bits of pt_root. This way
> they are stored together and can be read/write atomically.
Like this?
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 20cce366e951..b36c6b07cbfd 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1397,13 +1397,13 @@ static struct page *free_sub_pt(unsigned long root, int mode,
static void free_pagetable(struct protection_domain *domain)
{
- unsigned long root = (unsigned long)domain->pt_root;
+ int level = iommu_get_mode(domain->pt_root);
+ unsigned long root = iommu_get_root(domain->pt_root);
struct page *freelist = NULL;
- BUG_ON(domain->mode < PAGE_MODE_NONE ||
- domain->mode > PAGE_MODE_6_LEVEL);
+ BUG_ON(level < PAGE_MODE_NONE || level > PAGE_MODE_6_LEVEL);
- freelist = free_sub_pt(root, domain->mode, freelist);
+ freelist = free_sub_pt(root, level, freelist);
free_page_list(freelist);
}
@@ -1417,24 +1417,27 @@ static bool increase_address_space(struct protection_domain *domain,
unsigned long address,
gfp_t gfp)
{
+ int level;
unsigned long flags;
bool ret = false;
u64 *pte;
spin_lock_irqsave(&domain->lock, flags);
- if (address <= PM_LEVEL_SIZE(domain->mode) ||
- WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
+ level = iommu_get_mode(domain->pt_root);
+
+ if (address <= PM_LEVEL_SIZE(level) ||
+ WARN_ON_ONCE(level == PAGE_MODE_6_LEVEL))
goto out;
pte = (void *)get_zeroed_page(gfp);
if (!pte)
goto out;
- *pte = PM_LEVEL_PDE(domain->mode,
- iommu_virt_to_phys(domain->pt_root));
- domain->pt_root = pte;
- domain->mode += 1;
+ *pte = PM_LEVEL_PDE(level,
+ iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root)));
+
+ WRITE_ONCE(domain->pt_root, (unsigned long)pte + level + 1);
ret = true;
@@ -1452,15 +1455,17 @@ static u64 *alloc_pte(struct protection_domain *domain,
bool *updated)
{
int level, end_lvl;
- u64 *pte, *page;
+ u64 *pte, *page, *pt_root, *root;
BUG_ON(!is_power_of_2(page_size));
- while (address > PM_LEVEL_SIZE(domain->mode))
+ while (address > PM_LEVEL_SIZE(iommu_get_mode(domain->pt_root)))
*updated = increase_address_space(domain, address, gfp) || *updated;
- level = domain->mode - 1;
- pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
+ pt_root = READ_ONCE(domain->pt_root);
+ root = (void *)iommu_get_root(pt_root);
+ level = iommu_get_mode(pt_root) - 1;
+ pte = &root[PM_LEVEL_INDEX(level, address)];
address = PAGE_SIZE_ALIGN(address, page_size);
end_lvl = PAGE_SIZE_LEVEL(page_size);
@@ -1536,16 +1541,18 @@ static u64 *fetch_pte(struct protection_domain *domain,
unsigned long address,
unsigned long *page_size)
{
- int level;
u64 *pte;
+ u64 *pt_root = READ_ONCE(domain->pt_root);
+ u64 *root = (void *)iommu_get_root(pt_root);
+ int level = iommu_get_mode(pt_root);
*page_size = 0;
- if (address > PM_LEVEL_SIZE(domain->mode))
+ if (address > PM_LEVEL_SIZE(level))
return NULL;
- level = domain->mode - 1;
- pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
+ level--;
+ pte = &root[PM_LEVEL_INDEX(level, address)];
*page_size = PTE_LEVEL_PAGE_SIZE(level);
while (level > 0) {
@@ -1814,12 +1821,13 @@ static struct protection_domain *dma_ops_domain_alloc(void)
if (protection_domain_init(domain))
goto free_domain;
- domain->mode = PAGE_MODE_3_LEVEL;
domain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
domain->flags = PD_DMA_OPS_MASK;
if (!domain->pt_root)
goto free_domain;
+ domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_3_LEVEL);
+
if (iommu_get_dma_cookie(&domain->domain) == -ENOMEM)
goto free_domain;
@@ -1847,10 +1855,10 @@ static void set_dte_entry(u16 devid, struct protection_domain *domain,
u64 flags = 0;
u32 old_domid;
- if (domain->mode != PAGE_MODE_NONE)
- pte_root = iommu_virt_to_phys(domain->pt_root);
+ if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
+ pte_root = iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root));
- pte_root |= (domain->mode & DEV_ENTRY_MODE_MASK)
+ pte_root |= ((unsigned long)domain->pt_root & DEV_ENTRY_MODE_MASK)
<< DEV_ENTRY_MODE_SHIFT;
pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V | DTE_FLAG_TV;
@@ -2382,13 +2390,14 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
if (!pdomain)
return NULL;
- pdomain->mode = PAGE_MODE_3_LEVEL;
pdomain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
if (!pdomain->pt_root) {
protection_domain_free(pdomain);
return NULL;
}
+ pdomain->pt_root = iommu_set_mode(pdomain->pt_root,
+ PAGE_MODE_3_LEVEL);
pdomain->domain.geometry.aperture_start = 0;
pdomain->domain.geometry.aperture_end = ~0ULL;
pdomain->domain.geometry.force_aperture = true;
@@ -2406,7 +2415,8 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
if (!pdomain)
return NULL;
- pdomain->mode = PAGE_MODE_NONE;
+ pdomain->pt_root = iommu_set_mode(pdomain->pt_root,
+ PAGE_MODE_NONE);
break;
default:
return NULL;
@@ -2435,7 +2445,7 @@ static void amd_iommu_domain_free(struct iommu_domain *dom)
dma_ops_domain_free(domain);
break;
default:
- if (domain->mode != PAGE_MODE_NONE)
+ if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
free_pagetable(domain);
if (domain->flags & PD_IOMMUV2_MASK)
@@ -2521,7 +2531,7 @@ static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
int prot = 0;
int ret;
- if (domain->mode == PAGE_MODE_NONE)
+ if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
return -EINVAL;
if (iommu_prot & IOMMU_READ)
@@ -2542,7 +2552,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
{
struct protection_domain *domain = to_pdomain(dom);
- if (domain->mode == PAGE_MODE_NONE)
+ if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
return 0;
return iommu_unmap_page(domain, iova, page_size);
@@ -2555,7 +2565,7 @@ static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
unsigned long offset_mask, pte_pgsize;
u64 *pte, __pte;
- if (domain->mode == PAGE_MODE_NONE)
+ if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
return iova;
pte = fetch_pte(domain, iova, &pte_pgsize);
@@ -2713,7 +2723,7 @@ void amd_iommu_domain_direct_map(struct iommu_domain *dom)
spin_lock_irqsave(&domain->lock, flags);
/* Update data structure */
- domain->mode = PAGE_MODE_NONE;
+ domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_NONE);
/* Make changes visible to IOMMUs */
update_domain(domain);
@@ -2910,7 +2920,7 @@ static int __set_gcr3(struct protection_domain *domain, int pasid,
{
u64 *pte;
- if (domain->mode != PAGE_MODE_NONE)
+ if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
return -EINVAL;
pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, true);
@@ -2926,7 +2936,7 @@ static int __clear_gcr3(struct protection_domain *domain, int pasid)
{
u64 *pte;
- if (domain->mode != PAGE_MODE_NONE)
+ if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
return -EINVAL;
pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, false);
diff --git a/drivers/iommu/amd_iommu_proto.h b/drivers/iommu/amd_iommu_proto.h
index 92c2ba6468a0..34d4dd66cf9b 100644
--- a/drivers/iommu/amd_iommu_proto.h
+++ b/drivers/iommu/amd_iommu_proto.h
@@ -67,6 +67,21 @@ static inline int amd_iommu_create_irq_domain(struct amd_iommu *iommu)
extern int amd_iommu_complete_ppr(struct pci_dev *pdev, int pasid,
int status, int tag);
+static inline int iommu_get_mode(void *pt_root)
+{
+ return (unsigned long)pt_root & ~PAGE_MASK;
+}
+
+static inline unsigned long iommu_get_root(void *pt_root)
+{
+ return (unsigned long)pt_root & PAGE_MASK;
+}
+
+static inline void *iommu_set_mode(void *pt_root, int mode)
+{
+ return (void *)(iommu_get_root(pt_root) + mode);
+}
+
static inline bool is_rd890_iommu(struct pci_dev *pdev)
{
return (pdev->vendor == PCI_VENDOR_ID_ATI) &&
diff --git a/drivers/iommu/amd_iommu_types.h b/drivers/iommu/amd_iommu_types.h
index ca8c4522045b..221adefa56a0 100644
--- a/drivers/iommu/amd_iommu_types.h
+++ b/drivers/iommu/amd_iommu_types.h
@@ -468,8 +468,8 @@ struct protection_domain {
iommu core code */
spinlock_t lock; /* mostly used to lock the page table*/
u16 id; /* the domain id written to the device table */
- int mode; /* paging mode (0-6 levels) */
- u64 *pt_root; /* page table root pointer */
+ u64 *pt_root; /* page table root pointer and paging mode (0-6
+ levels) */
int glx; /* Number of levels for GCR3 table */
u64 *gcr3_tbl; /* Guest CR3 table */
unsigned long flags; /* flags to find out type of domain */
--
2.21.0 (Apple Git-122.2)
> On Apr 13, 2020, at 9:36 PM, Qian Cai <[email protected]> wrote:
>
>
>
>> On Apr 8, 2020, at 10:19 AM, Joerg Roedel <[email protected]> wrote:
>>
>> Hi Qian,
>>
>> On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote:
>>> After further testing, the change along is insufficient. What I am chasing right
>>> now is the swap device will go offline after heavy memory pressure below. The
>>> symptom is similar to what we have in the commit,
>>>
>>> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)
>>>
>>> Apparently, it is no possible to take the domain->lock in fetch_pte() because it
>>> could sleep.
>>
>> Thanks a lot for finding and tracking down another race in the AMD IOMMU
>> page-table code. The domain->lock is a spin-lock and taking it can't
>> sleep. But fetch_pte() is a fast-path and must not take any locks.
>>
>> I think the best fix is to update the pt_root and mode of the domain
>> atomically by storing the mode in the lower 12 bits of pt_root. This way
>> they are stored together and can be read/write atomically.
>
> Like this?
So, this is still not enough that would still trigger storage driver offline under
memory pressure for a bit longer. It looks to me that in fetch_pte() there are
could still racy?
level = domain->mode - 1;
pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
<— increase_address_space();
*page_size = PTE_LEVEL_PAGE_SIZE(level);
while (level > 0) {
*page_size = PTE_LEVEL_PAGE_SIZE(level);
Then in iommu_unmap_page(),
while (unmapped < page_size) {
pte = fetch_pte(dom, bus_addr, &unmap_size);
…
bus_addr = (bus_addr & ~(unmap_size - 1)) + unmap_size;
bus_addr would be unsync with dom->mode when it enter fetch_pte() again.
Could that be a problem?
>
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 20cce366e951..b36c6b07cbfd 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -1397,13 +1397,13 @@ static struct page *free_sub_pt(unsigned long root, int mode,
>
> static void free_pagetable(struct protection_domain *domain)
> {
> - unsigned long root = (unsigned long)domain->pt_root;
> + int level = iommu_get_mode(domain->pt_root);
> + unsigned long root = iommu_get_root(domain->pt_root);
> struct page *freelist = NULL;
>
> - BUG_ON(domain->mode < PAGE_MODE_NONE ||
> - domain->mode > PAGE_MODE_6_LEVEL);
> + BUG_ON(level < PAGE_MODE_NONE || level > PAGE_MODE_6_LEVEL);
>
> - freelist = free_sub_pt(root, domain->mode, freelist);
> + freelist = free_sub_pt(root, level, freelist);
>
> free_page_list(freelist);
> }
> @@ -1417,24 +1417,27 @@ static bool increase_address_space(struct protection_domain *domain,
> unsigned long address,
> gfp_t gfp)
> {
> + int level;
> unsigned long flags;
> bool ret = false;
> u64 *pte;
>
> spin_lock_irqsave(&domain->lock, flags);
>
> - if (address <= PM_LEVEL_SIZE(domain->mode) ||
> - WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
> + level = iommu_get_mode(domain->pt_root);
> +
> + if (address <= PM_LEVEL_SIZE(level) ||
> + WARN_ON_ONCE(level == PAGE_MODE_6_LEVEL))
> goto out;
>
> pte = (void *)get_zeroed_page(gfp);
> if (!pte)
> goto out;
>
> - *pte = PM_LEVEL_PDE(domain->mode,
> - iommu_virt_to_phys(domain->pt_root));
> - domain->pt_root = pte;
> - domain->mode += 1;
> + *pte = PM_LEVEL_PDE(level,
> + iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root)));
> +
> + WRITE_ONCE(domain->pt_root, (unsigned long)pte + level + 1);
>
> ret = true;
>
> @@ -1452,15 +1455,17 @@ static u64 *alloc_pte(struct protection_domain *domain,
> bool *updated)
> {
> int level, end_lvl;
> - u64 *pte, *page;
> + u64 *pte, *page, *pt_root, *root;
>
> BUG_ON(!is_power_of_2(page_size));
>
> - while (address > PM_LEVEL_SIZE(domain->mode))
> + while (address > PM_LEVEL_SIZE(iommu_get_mode(domain->pt_root)))
> *updated = increase_address_space(domain, address, gfp) || *updated;
>
> - level = domain->mode - 1;
> - pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
> + pt_root = READ_ONCE(domain->pt_root);
> + root = (void *)iommu_get_root(pt_root);
> + level = iommu_get_mode(pt_root) - 1;
> + pte = &root[PM_LEVEL_INDEX(level, address)];
> address = PAGE_SIZE_ALIGN(address, page_size);
> end_lvl = PAGE_SIZE_LEVEL(page_size);
>
> @@ -1536,16 +1541,18 @@ static u64 *fetch_pte(struct protection_domain *domain,
> unsigned long address,
> unsigned long *page_size)
> {
> - int level;
> u64 *pte;
> + u64 *pt_root = READ_ONCE(domain->pt_root);
> + u64 *root = (void *)iommu_get_root(pt_root);
> + int level = iommu_get_mode(pt_root);
>
> *page_size = 0;
>
> - if (address > PM_LEVEL_SIZE(domain->mode))
> + if (address > PM_LEVEL_SIZE(level))
> return NULL;
>
> - level = domain->mode - 1;
> - pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
> + level--;
> + pte = &root[PM_LEVEL_INDEX(level, address)];
> *page_size = PTE_LEVEL_PAGE_SIZE(level);
>
> while (level > 0) {
> @@ -1814,12 +1821,13 @@ static struct protection_domain *dma_ops_domain_alloc(void)
> if (protection_domain_init(domain))
> goto free_domain;
>
> - domain->mode = PAGE_MODE_3_LEVEL;
> domain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
> domain->flags = PD_DMA_OPS_MASK;
> if (!domain->pt_root)
> goto free_domain;
>
> + domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_3_LEVEL);
> +
> if (iommu_get_dma_cookie(&domain->domain) == -ENOMEM)
> goto free_domain;
>
> @@ -1847,10 +1855,10 @@ static void set_dte_entry(u16 devid, struct protection_domain *domain,
> u64 flags = 0;
> u32 old_domid;
>
> - if (domain->mode != PAGE_MODE_NONE)
> - pte_root = iommu_virt_to_phys(domain->pt_root);
> + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
> + pte_root = iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root));
>
> - pte_root |= (domain->mode & DEV_ENTRY_MODE_MASK)
> + pte_root |= ((unsigned long)domain->pt_root & DEV_ENTRY_MODE_MASK)
> << DEV_ENTRY_MODE_SHIFT;
> pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V | DTE_FLAG_TV;
>
> @@ -2382,13 +2390,14 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
> if (!pdomain)
> return NULL;
>
> - pdomain->mode = PAGE_MODE_3_LEVEL;
> pdomain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
> if (!pdomain->pt_root) {
> protection_domain_free(pdomain);
> return NULL;
> }
>
> + pdomain->pt_root = iommu_set_mode(pdomain->pt_root,
> + PAGE_MODE_3_LEVEL);
> pdomain->domain.geometry.aperture_start = 0;
> pdomain->domain.geometry.aperture_end = ~0ULL;
> pdomain->domain.geometry.force_aperture = true;
> @@ -2406,7 +2415,8 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
> if (!pdomain)
> return NULL;
>
> - pdomain->mode = PAGE_MODE_NONE;
> + pdomain->pt_root = iommu_set_mode(pdomain->pt_root,
> + PAGE_MODE_NONE);
> break;
> default:
> return NULL;
> @@ -2435,7 +2445,7 @@ static void amd_iommu_domain_free(struct iommu_domain *dom)
> dma_ops_domain_free(domain);
> break;
> default:
> - if (domain->mode != PAGE_MODE_NONE)
> + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
> free_pagetable(domain);
>
> if (domain->flags & PD_IOMMUV2_MASK)
> @@ -2521,7 +2531,7 @@ static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
> int prot = 0;
> int ret;
>
> - if (domain->mode == PAGE_MODE_NONE)
> + if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
> return -EINVAL;
>
> if (iommu_prot & IOMMU_READ)
> @@ -2542,7 +2552,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
> {
> struct protection_domain *domain = to_pdomain(dom);
>
> - if (domain->mode == PAGE_MODE_NONE)
> + if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
> return 0;
>
> return iommu_unmap_page(domain, iova, page_size);
> @@ -2555,7 +2565,7 @@ static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
> unsigned long offset_mask, pte_pgsize;
> u64 *pte, __pte;
>
> - if (domain->mode == PAGE_MODE_NONE)
> + if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
> return iova;
>
> pte = fetch_pte(domain, iova, &pte_pgsize);
> @@ -2713,7 +2723,7 @@ void amd_iommu_domain_direct_map(struct iommu_domain *dom)
> spin_lock_irqsave(&domain->lock, flags);
>
> /* Update data structure */
> - domain->mode = PAGE_MODE_NONE;
> + domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_NONE);
>
> /* Make changes visible to IOMMUs */
> update_domain(domain);
> @@ -2910,7 +2920,7 @@ static int __set_gcr3(struct protection_domain *domain, int pasid,
> {
> u64 *pte;
>
> - if (domain->mode != PAGE_MODE_NONE)
> + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
> return -EINVAL;
>
> pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, true);
> @@ -2926,7 +2936,7 @@ static int __clear_gcr3(struct protection_domain *domain, int pasid)
> {
> u64 *pte;
>
> - if (domain->mode != PAGE_MODE_NONE)
> + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
> return -EINVAL;
>
> pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, false);
> diff --git a/drivers/iommu/amd_iommu_proto.h b/drivers/iommu/amd_iommu_proto.h
> index 92c2ba6468a0..34d4dd66cf9b 100644
> --- a/drivers/iommu/amd_iommu_proto.h
> +++ b/drivers/iommu/amd_iommu_proto.h
> @@ -67,6 +67,21 @@ static inline int amd_iommu_create_irq_domain(struct amd_iommu *iommu)
> extern int amd_iommu_complete_ppr(struct pci_dev *pdev, int pasid,
> int status, int tag);
>
> +static inline int iommu_get_mode(void *pt_root)
> +{
> + return (unsigned long)pt_root & ~PAGE_MASK;
> +}
> +
> +static inline unsigned long iommu_get_root(void *pt_root)
> +{
> + return (unsigned long)pt_root & PAGE_MASK;
> +}
> +
> +static inline void *iommu_set_mode(void *pt_root, int mode)
> +{
> + return (void *)(iommu_get_root(pt_root) + mode);
> +}
> +
> static inline bool is_rd890_iommu(struct pci_dev *pdev)
> {
> return (pdev->vendor == PCI_VENDOR_ID_ATI) &&
> diff --git a/drivers/iommu/amd_iommu_types.h b/drivers/iommu/amd_iommu_types.h
> index ca8c4522045b..221adefa56a0 100644
> --- a/drivers/iommu/amd_iommu_types.h
> +++ b/drivers/iommu/amd_iommu_types.h
> @@ -468,8 +468,8 @@ struct protection_domain {
> iommu core code */
> spinlock_t lock; /* mostly used to lock the page table*/
> u16 id; /* the domain id written to the device table */
> - int mode; /* paging mode (0-6 levels) */
> - u64 *pt_root; /* page table root pointer */
> + u64 *pt_root; /* page table root pointer and paging mode (0-6
> + levels) */
> int glx; /* Number of levels for GCR3 table */
> u64 *gcr3_tbl; /* Guest CR3 table */
> unsigned long flags; /* flags to find out type of domain */
> --
> 2.21.0 (Apple Git-122.2)
On Thu, Apr 16, 2020 at 09:42:41PM -0400, Qian Cai wrote:
> So, this is still not enough that would still trigger storage driver offline under
> memory pressure for a bit longer. It looks to me that in fetch_pte() there are
> could still racy?
Yes, your patch still looks racy. You need to atomically read
domain->pt_root to a stack variable and derive the pt_root pointer and
the mode from that variable instead of domain->pt_root directly. If you
read the domain->pt_root twice there could still be an update between
the two reads.
Probably the lock in increase_address_space() can also be avoided if
pt_root is updated using cmpxchg()?
Regards,
Joerg
> On Apr 18, 2020, at 8:10 AM, Joerg Roedel <[email protected]> wrote:
>
> Yes, your patch still looks racy. You need to atomically read
> domain->pt_root to a stack variable and derive the pt_root pointer and
> the mode from that variable instead of domain->pt_root directly. If you
> read the domain->pt_root twice there could still be an update between
> the two reads.
> Probably the lock in increase_address_space() can also be avoided if
> pt_root is updated using cmpxchg()?
Hard to tell without testing further. I’ll leave that optimization in the future, and focus on fixing those races first.
On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote:
> Hard to tell without testing further. I’ll leave that optimization in
> the future, and focus on fixing those races first.
Yeah right, we should fix the existing races first before introducing
new ones ;)
Btw, THANKS A LOT for tracking down all these race condition bugs, I am
not even remotely able to trigger them with the hardware I have around.
I did some hacking and the attached diff shows how I think this race
condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1,
but did no further testing. Can you test it please?
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 20cce366e951..28229a38af4d 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -151,6 +151,26 @@ static struct protection_domain *to_pdomain(struct iommu_domain *dom)
return container_of(dom, struct protection_domain, domain);
}
+static void amd_iommu_domain_get_pgtable(struct protection_domain *domain,
+ struct domain_pgtable *pgtable)
+{
+ u64 pt_root = atomic64_read(&domain->pt_root);
+
+ pgtable->root = (u64 *)(pt_root & PAGE_MASK);
+ pgtable->mode = pt_root & 7; /* lowest 3 bits encode pgtable mode */
+}
+
+static u64 amd_iommu_domain_encode_pgtable(u64 *root, int mode)
+{
+ u64 pt_root;
+
+ /* lowest 3 bits encode pgtable mode */
+ pt_root = mode & 7;
+ pt_root |= (u64)root;
+
+ return pt_root;
+}
+
static struct iommu_dev_data *alloc_dev_data(u16 devid)
{
struct iommu_dev_data *dev_data;
@@ -1397,13 +1417,18 @@ static struct page *free_sub_pt(unsigned long root, int mode,
static void free_pagetable(struct protection_domain *domain)
{
- unsigned long root = (unsigned long)domain->pt_root;
+ struct domain_pgtable pgtable;
struct page *freelist = NULL;
+ unsigned long root;
+
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+ atomic64_set(&domain->pt_root, 0);
- BUG_ON(domain->mode < PAGE_MODE_NONE ||
- domain->mode > PAGE_MODE_6_LEVEL);
+ BUG_ON(pgtable.mode < PAGE_MODE_NONE ||
+ pgtable.mode > PAGE_MODE_6_LEVEL);
- freelist = free_sub_pt(root, domain->mode, freelist);
+ root = (unsigned long)pgtable.root;
+ freelist = free_sub_pt(root, pgtable.mode, freelist);
free_page_list(freelist);
}
@@ -1417,24 +1442,28 @@ static bool increase_address_space(struct protection_domain *domain,
unsigned long address,
gfp_t gfp)
{
+ struct domain_pgtable pgtable;
unsigned long flags;
bool ret = false;
- u64 *pte;
+ u64 *pte, root;
spin_lock_irqsave(&domain->lock, flags);
- if (address <= PM_LEVEL_SIZE(domain->mode) ||
- WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+
+ if (address <= PM_LEVEL_SIZE(pgtable.mode) ||
+ WARN_ON_ONCE(pgtable.mode == PAGE_MODE_6_LEVEL))
goto out;
pte = (void *)get_zeroed_page(gfp);
if (!pte)
goto out;
- *pte = PM_LEVEL_PDE(domain->mode,
- iommu_virt_to_phys(domain->pt_root));
- domain->pt_root = pte;
- domain->mode += 1;
+ *pte = PM_LEVEL_PDE(pgtable.mode, iommu_virt_to_phys(pgtable.root));
+
+ root = amd_iommu_domain_encode_pgtable(pte, pgtable.mode + 1);
+
+ atomic64_set(&domain->pt_root, root);
ret = true;
@@ -1451,16 +1480,22 @@ static u64 *alloc_pte(struct protection_domain *domain,
gfp_t gfp,
bool *updated)
{
+ struct domain_pgtable pgtable;
int level, end_lvl;
u64 *pte, *page;
BUG_ON(!is_power_of_2(page_size));
- while (address > PM_LEVEL_SIZE(domain->mode))
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+
+ while (address > PM_LEVEL_SIZE(pgtable.mode)) {
*updated = increase_address_space(domain, address, gfp) || *updated;
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+ }
+
- level = domain->mode - 1;
- pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
+ level = pgtable.mode - 1;
+ pte = &pgtable.root[PM_LEVEL_INDEX(level, address)];
address = PAGE_SIZE_ALIGN(address, page_size);
end_lvl = PAGE_SIZE_LEVEL(page_size);
@@ -1536,16 +1571,19 @@ static u64 *fetch_pte(struct protection_domain *domain,
unsigned long address,
unsigned long *page_size)
{
+ struct domain_pgtable pgtable;
int level;
u64 *pte;
*page_size = 0;
- if (address > PM_LEVEL_SIZE(domain->mode))
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+
+ if (address > PM_LEVEL_SIZE(pgtable.mode))
return NULL;
- level = domain->mode - 1;
- pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
+ level = pgtable.mode - 1;
+ pte = &pgtable.root[PM_LEVEL_INDEX(level, address)];
*page_size = PTE_LEVEL_PAGE_SIZE(level);
while (level > 0) {
@@ -1806,6 +1844,7 @@ static void dma_ops_domain_free(struct protection_domain *domain)
static struct protection_domain *dma_ops_domain_alloc(void)
{
struct protection_domain *domain;
+ u64 *pt_root, root;
domain = kzalloc(sizeof(struct protection_domain), GFP_KERNEL);
if (!domain)
@@ -1814,12 +1853,14 @@ static struct protection_domain *dma_ops_domain_alloc(void)
if (protection_domain_init(domain))
goto free_domain;
- domain->mode = PAGE_MODE_3_LEVEL;
- domain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
- domain->flags = PD_DMA_OPS_MASK;
- if (!domain->pt_root)
+ pt_root = (void *)get_zeroed_page(GFP_KERNEL);
+ if (!pt_root)
goto free_domain;
+ root = amd_iommu_domain_encode_pgtable(pt_root, PAGE_MODE_3_LEVEL);
+ atomic64_set(&domain->pt_root, root);
+ domain->flags = PD_DMA_OPS_MASK;
+
if (iommu_get_dma_cookie(&domain->domain) == -ENOMEM)
goto free_domain;
@@ -1843,14 +1884,17 @@ static bool dma_ops_domain(struct protection_domain *domain)
static void set_dte_entry(u16 devid, struct protection_domain *domain,
bool ats, bool ppr)
{
+ struct domain_pgtable pgtable;
u64 pte_root = 0;
u64 flags = 0;
u32 old_domid;
- if (domain->mode != PAGE_MODE_NONE)
- pte_root = iommu_virt_to_phys(domain->pt_root);
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+
+ if (pgtable.mode != PAGE_MODE_NONE)
+ pte_root = iommu_virt_to_phys(pgtable.root);
- pte_root |= (domain->mode & DEV_ENTRY_MODE_MASK)
+ pte_root |= (pgtable.mode & DEV_ENTRY_MODE_MASK)
<< DEV_ENTRY_MODE_SHIFT;
pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V | DTE_FLAG_TV;
@@ -2375,6 +2419,7 @@ static struct protection_domain *protection_domain_alloc(void)
static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
{
struct protection_domain *pdomain;
+ u64 *pt_root, root;
switch (type) {
case IOMMU_DOMAIN_UNMANAGED:
@@ -2382,13 +2427,15 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
if (!pdomain)
return NULL;
- pdomain->mode = PAGE_MODE_3_LEVEL;
- pdomain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
- if (!pdomain->pt_root) {
+ pt_root = (void *)get_zeroed_page(GFP_KERNEL);
+ if (!pt_root) {
protection_domain_free(pdomain);
return NULL;
}
+ root = amd_iommu_domain_encode_pgtable(pt_root, PAGE_MODE_3_LEVEL);
+ atomic64_set(&pdomain->pt_root, root);
+
pdomain->domain.geometry.aperture_start = 0;
pdomain->domain.geometry.aperture_end = ~0ULL;
pdomain->domain.geometry.force_aperture = true;
@@ -2406,7 +2453,7 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
if (!pdomain)
return NULL;
- pdomain->mode = PAGE_MODE_NONE;
+ atomic64_set(&pdomain->pt_root, PAGE_MODE_NONE);
break;
default:
return NULL;
@@ -2418,6 +2465,7 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
static void amd_iommu_domain_free(struct iommu_domain *dom)
{
struct protection_domain *domain;
+ struct domain_pgtable pgtable;
domain = to_pdomain(dom);
@@ -2435,7 +2483,9 @@ static void amd_iommu_domain_free(struct iommu_domain *dom)
dma_ops_domain_free(domain);
break;
default:
- if (domain->mode != PAGE_MODE_NONE)
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+
+ if (pgtable.mode != PAGE_MODE_NONE)
free_pagetable(domain);
if (domain->flags & PD_IOMMUV2_MASK)
@@ -2518,10 +2568,12 @@ static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
gfp_t gfp)
{
struct protection_domain *domain = to_pdomain(dom);
+ struct domain_pgtable pgtable;
int prot = 0;
int ret;
- if (domain->mode == PAGE_MODE_NONE)
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+ if (pgtable.mode == PAGE_MODE_NONE)
return -EINVAL;
if (iommu_prot & IOMMU_READ)
@@ -2541,8 +2593,10 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
struct iommu_iotlb_gather *gather)
{
struct protection_domain *domain = to_pdomain(dom);
+ struct domain_pgtable pgtable;
- if (domain->mode == PAGE_MODE_NONE)
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+ if (pgtable.mode == PAGE_MODE_NONE)
return 0;
return iommu_unmap_page(domain, iova, page_size);
@@ -2553,9 +2607,11 @@ static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
{
struct protection_domain *domain = to_pdomain(dom);
unsigned long offset_mask, pte_pgsize;
+ struct domain_pgtable pgtable;
u64 *pte, __pte;
- if (domain->mode == PAGE_MODE_NONE)
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+ if (pgtable.mode == PAGE_MODE_NONE)
return iova;
pte = fetch_pte(domain, iova, &pte_pgsize);
@@ -2708,16 +2764,26 @@ EXPORT_SYMBOL(amd_iommu_unregister_ppr_notifier);
void amd_iommu_domain_direct_map(struct iommu_domain *dom)
{
struct protection_domain *domain = to_pdomain(dom);
+ struct domain_pgtable pgtable;
unsigned long flags;
+ u64 pt_root;
spin_lock_irqsave(&domain->lock, flags);
+ /* First save pgtable configuration*/
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+
/* Update data structure */
- domain->mode = PAGE_MODE_NONE;
+ pt_root = amd_iommu_domain_encode_pgtable(NULL, PAGE_MODE_NONE);
+ atomic64_set(&domain->pt_root, pt_root);
/* Make changes visible to IOMMUs */
update_domain(domain);
+ /* Restore old pgtable in domain->ptroot to free page-table */
+ pt_root = amd_iommu_domain_encode_pgtable(pgtable.root, pgtable.mode);
+ atomic64_set(&domain->pt_root, pt_root);
+
/* Page-table is not visible to IOMMU anymore, so free it */
free_pagetable(domain);
@@ -2908,9 +2974,11 @@ static u64 *__get_gcr3_pte(u64 *root, int level, int pasid, bool alloc)
static int __set_gcr3(struct protection_domain *domain, int pasid,
unsigned long cr3)
{
+ struct domain_pgtable pgtable;
u64 *pte;
- if (domain->mode != PAGE_MODE_NONE)
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+ if (pgtable.mode != PAGE_MODE_NONE)
return -EINVAL;
pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, true);
@@ -2924,9 +2992,11 @@ static int __set_gcr3(struct protection_domain *domain, int pasid,
static int __clear_gcr3(struct protection_domain *domain, int pasid)
{
+ struct domain_pgtable pgtable;
u64 *pte;
- if (domain->mode != PAGE_MODE_NONE)
+ amd_iommu_domain_get_pgtable(domain, &pgtable);
+ if (pgtable.mode != PAGE_MODE_NONE)
return -EINVAL;
pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, false);
diff --git a/drivers/iommu/amd_iommu_types.h b/drivers/iommu/amd_iommu_types.h
index ca8c4522045b..7a8fdec138bd 100644
--- a/drivers/iommu/amd_iommu_types.h
+++ b/drivers/iommu/amd_iommu_types.h
@@ -468,8 +468,7 @@ struct protection_domain {
iommu core code */
spinlock_t lock; /* mostly used to lock the page table*/
u16 id; /* the domain id written to the device table */
- int mode; /* paging mode (0-6 levels) */
- u64 *pt_root; /* page table root pointer */
+ atomic64_t pt_root; /* pgtable root and pgtable mode */
int glx; /* Number of levels for GCR3 table */
u64 *gcr3_tbl; /* Guest CR3 table */
unsigned long flags; /* flags to find out type of domain */
@@ -477,6 +476,12 @@ struct protection_domain {
unsigned dev_iommu[MAX_IOMMUS]; /* per-IOMMU reference count */
};
+/* For decocded pt_root */
+struct domain_pgtable {
+ int mode;
+ u64 *root;
+};
+
/*
* Structure where we save information about one hardware AMD IOMMU in the
* system.
> On Apr 18, 2020, at 2:34 PM, Joerg Roedel <[email protected]> wrote:
>
> On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote:
>> Hard to tell without testing further. I’ll leave that optimization in
>> the future, and focus on fixing those races first.
>
> Yeah right, we should fix the existing races first before introducing
> new ones ;)
>
> Btw, THANKS A LOT for tracking down all these race condition bugs, I am
> not even remotely able to trigger them with the hardware I have around.
>
> I did some hacking and the attached diff shows how I think this race
> condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1,
> but did no further testing. Can you test it please?
Sure, give it a few days to see if it could survive.
> On Apr 18, 2020, at 2:34 PM, Joerg Roedel <[email protected]> wrote:
>
> On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote:
>> Hard to tell without testing further. I’ll leave that optimization in
>> the future, and focus on fixing those races first.
>
> Yeah right, we should fix the existing races first before introducing
> new ones ;)
>
> Btw, THANKS A LOT for tracking down all these race condition bugs, I am
> not even remotely able to trigger them with the hardware I have around.
>
> I did some hacking and the attached diff shows how I think this race
> condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1,
> but did no further testing. Can you test it please?
No dice. There could be some other races. For example,
> @@ -1536,16 +1571,19 @@ static u64 *fetch_pte(struct protection_domain *domain,
> unsigned long address,
> unsigned long *page_size)
...
> amd_iommu_domain_get_pgtable(domain, &pgtable);
>
> if (address > PM_LEVEL_SIZE(pgtable.mode))
> return NULL;
>
> level = pgtable.mode - 1;
> pte = &pgtable.root[PM_LEVEL_INDEX(level, address)];
<— increase_address_space()
> *page_size = PTE_LEVEL_PAGE_SIZE(level);
>
while (level > 0) {
*page_size = PTE_LEVEL_PAGE_SIZE(level);
Then in iommu_unmap_page(),
while (unmapped < page_size) {
pte = fetch_pte(dom, bus_addr, &unmap_size);
…
bus_addr = (bus_addr & ~(unmap_size - 1)) + unmap_size;
bus_addr would be unsync with dom->mode when it enter fetch_pte() again.
Could that be a problem?
[ 5159.274829][ T4057] LTP: starting oom02
[ 5160.382787][ C52] perf: interrupt took too long (7443 > 6208), lowering kernel.perf_event_max_sample_rate to 26800
[ 5167.951785][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc0000 flags=0x0010]
[ 5167.964540][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc1000 flags=0x0010]
[ 5167.977442][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc1900 flags=0x0010]
[ 5167.989901][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc1d00 flags=0x0010]
[ 5168.002291][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2000 flags=0x0010]
[ 5168.014665][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2400 flags=0x0010]
[ 5168.027132][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2800 flags=0x0010]
[ 5168.039566][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2c00 flags=0x0010]
[ 5168.051956][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc3000 flags=0x0010]
[ 5168.064310][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc3400 flags=0x0010]
[ 5168.076652][ T812] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xfffffffffffc3800 flags=0x0010]
[ 5168.088290][ T812] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xfffffffffffc3c00 flags=0x0010]
[ 5183.695829][ C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803
[ 5183.704390][ C8] smartpqi 0000:23:00.0: controller offline
[ 5183.756594][ C101] blk_update_request: I/O error, dev sda, sector 22306304 op 0x1:(WRITE) flags 0x8000000 phys_seg 4 prio class 0
[ 5183.756628][ C34] sd 0:1:0:0: [sda] tag#655 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756759][ C56] blk_update_request: I/O error, dev sda, sector 58480128 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.756810][ C79] sd 0:1:0:0: [sda] tag#234 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756816][ C121] sd 0:1:0:0: [sda] tag#104 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756837][ T53] sd 0:1:0:0: [sda] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756882][ C121] sd 0:1:0:0: [sda] tag#104 CDB: opcode=0x2a 2a 00 00 4d d4 00 00 02 00 00
[ 5183.756892][ C79] sd 0:1:0:0: [sda] tag#234 CDB: opcode=0x2a 2a 00 02 03 e4 00 00 02 00 00
[ 5183.756909][ C121] blk_update_request: I/O error, dev sda, sector 5100544 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.756920][ C79] blk_update_request: I/O error, dev sda, sector 33809408 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.756939][ T53] sd 0:1:0:0: [sda] tag#4 CDB: opcode=0x2a 2a 00 02 4b f8 00 00 02 00 00
[ 5183.756967][ T53] blk_update_request: I/O error, dev sda, sector 38533120 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.756989][ C49] blk_update_request: I/O error, dev sda, sector 30181376 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.757045][ C51] sd 0:1:0:0: [sda] tag#452 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757107][ C51] sd 0:1:0:0: [sda] tag#452 CDB: opcode=0x2a 2a 00 02 95 06 00 00 02 00 00
[ 5183.757125][ C51] blk_update_request: I/O error, dev sda, sector 43320832 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.757199][ C82] blk_update_request: I/O error, dev sda, sector 10187776 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.757209][ C109] blk_update_request: I/O error, dev sda, sector 29812736 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.757215][ C77] sd 0:1:0:0: [sda] tag#849 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757222][ C110] sd 0:1:0:0: [sda] tag#558 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757237][ C92] blk_update_request: I/O error, dev sda, sector 6410240 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.757244][ C91] sd 0:1:0:0: [sda] tag#73 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757251][ C68] sd 0:1:0:0: [sda] tag#416 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757458][ C77] sd 0:1:0:0: [sda] tag#849 CDB: opcode=0x2a 2a 00 02 78 a4 00 00 02 00 00
[ 5183.757467][ C110] sd 0:1:0:0: [sda] tag#558 CDB: opcode=0x2a 2a 00 03 58 94 00 00 02 00 00
[ 5183.757515][ C122] sd 0:1:0:0: [sda] tag#747 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757525][ C68] sd 0:1:0:0: [sda] tag#416 CDB: opcode=0x2a 2a 00 01 0e 32 00 00 02 00 00
[ 5183.757536][ C91] sd 0:1:0:0: [sda] tag#73 CDB: opcode=0x2a 2a 00 01 a2 86 00 00 02 00 00
[ 5183.757727][ C122] sd 0:1:0:0: [sda] tag#747 CDB: opcode=0x2a 2a 00 02 a7 24 00 00 02 00 00
[ 5183.758530][ T53] Write-error on swap-device (254:1:64823296)
[ 5183.758758][ T53] Write-error on swap-device (254:1:35201024)
[ 5183.758811][ C105] Write-error on swap-device (254:1:52690944)
[ 5183.758959][ C82] Write-error on swap-device (254:1:6856704)
Hi Qian,
On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
>
> No dice. There could be some other races. For example,
Okay, I think I know what is happening. The increase_address_space()
function increases the address space, but does not update the
DTE and does not flush the old DTE from the caches. But this needs to
happen before domain->pt_root is updated, because otherwise another CPU
can come along and map something into the increased address-space which
is not yet accessible by the device because the DTE is not updated yet.
Regards,
Joerg
On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
> No dice. There could be some other races. For example,
Can you please test this branch:
https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes
It has the previous fix in it and a couple more to make sure the
device-table is updated and flushed before increase_address_space()
updates domain->pt_root.
Thanks,
Joerg
> On Apr 29, 2020, at 7:20 AM, Joerg Roedel <[email protected]> wrote:
>
> On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
>> No dice. There could be some other races. For example,
>
> Can you please test this branch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes
>
> It has the previous fix in it and a couple more to make sure the
> device-table is updated and flushed before increase_address_space()
> updates domain->pt_root.
It looks promising as it survives for a day’s stress testing. I’ll keep it running for a few days to be sure.
> On Apr 29, 2020, at 7:20 AM, Joerg Roedel <[email protected]> wrote:
>
> On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
>> No dice. There could be some other races. For example,
>
> Can you please test this branch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes
>
> It has the previous fix in it and a couple more to make sure the
> device-table is updated and flushed before increase_address_space()
> updates domain->pt_root.
I believe this closed the existing races as it had survived for many days. Great work!
Hi Qian,
On Sun, May 03, 2020 at 09:04:03AM -0400, Qian Cai wrote:
> > On Apr 29, 2020, at 7:20 AM, Joerg Roedel <[email protected]> wrote:
> > Can you please test this branch:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes
> >
> > It has the previous fix in it and a couple more to make sure the
> > device-table is updated and flushed before increase_address_space()
> > updates domain->pt_root.
>
> I believe this closed the existing races as it had survived for many
> days. Great work!
Thanks a lot for testing these changes! Can I add your Tested-by when I
send them to the mailing list tomorrow?
Regards,
Joerg
> On May 3, 2020, at 2:39 PM, Joerg Roedel <[email protected]> wrote:
>
> Can I add your Tested-by when I
> send them to the mailing list tomorrow?
Sure. Feel free to add,
Tested-by: Qian Cai <[email protected]>