2019-09-04 21:25:58

by Qian Cai

[permalink] [raw]
Subject: [RFC PATCH] iommu/amd: fix a race in increase_address_space()

When the system is under some memory pressure, it could cause disks on
the "smartpqi" driver going offline below. From the UBSAN report, it
indicates that "domain->mode" becomes 7. Further investigation indicates
that the only place that would increase "domain->mode" is
increase_address_space() but it has this check before actually
increasing it,

if (domain->mode == PAGE_MODE_6_LEVEL)
/* address space already 64 bit large */
return false;

This gives a clue that there must be a race between multiple concurrent
threads in increase_address_space().

In amd_iommu_map() and amd_iommu_unmap(), there is
mutex_lock() and mutex_unlock() around a iommu_map_page(). There are
several places that could call iommu_map_page() with interrupt enabled.
One for sure is alloc_coherent() which call __map_single() that had a
comment said that it needs to hold a lock but not so. I am not sure
about the map_page() and map_sg() where my investigation so far
indicating that many call sites there are in the atomic context.

smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000
address=0xff702a00 flags=0x0010]
UBSAN: Undefined behaviour in drivers/iommu/amd_iommu.c:1464:29
shift exponent 66 is too large for 64-bit type 'long unsigned int'
CPU: 39 PID: 821 Comm: kswapd4 Tainted: G O
5.3.0-rc7-next-20190903+ #4
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40
07/10/2019
Call Trace:
dump_stack+0x62/0x9a
ubsan_epilogue+0xd/0x3a
smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000
address=0xff702c80 flags=0x0010]
__ubsan_handle_shift_out_of_bounds.cold.12+0x21/0x68
iommu_map_page.cold.39+0x7f/0x84
map_sg+0x1ce/0x2f0
scsi_dma_map+0xc6/0x160
pqi_raid_submit_scsi_cmd_with_io_request+0x1c3/0x470 [smartpqi]
smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000
address=0xff702900 flags=0x0010]
pqi_scsi_queue_command+0x7d6/0xeb0 [smartpqi]
scsi_queue_rq+0x7ee/0x12b0
__blk_mq_try_issue_directly+0x295/0x3f0
blk_mq_try_issue_directly+0xad/0x130
blk_mq_make_request+0xb12/0xee0
generic_make_request+0x179/0x4a0
submit_bio+0xaa/0x270
__swap_writepage+0x8f5/0xba0
swap_writepage+0x65/0xb0
pageout.isra.3+0x3e5/0xa00
shrink_page_list+0x15a0/0x2660
shrink_inactive_list+0x373/0x770
shrink_node_memcg+0x4ff/0x1550
shrink_node+0x123/0x800
balance_pgdat+0x493/0x9b0
kswapd+0x39b/0x7f0
kthread+0x1df/0x200
ret_from_fork+0x22/0x40
========================================================================
smartpqi 0000:23:00.0: resetting scsi 0:1:0:0
INFO: task khugepaged:797 blocked for more than 122 seconds.
Tainted: G O 5.3.0-rc7-next-20190903+ #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
khugepaged D22656 797 2 0x80004000
Call Trace:
__schedule+0x4bb/0xc20
schedule+0x6c/0x140
io_schedule+0x21/0x50
rq_qos_wait+0x18e/0x2b0
wbt_wait+0x12b/0x1b0
__rq_qos_throttle+0x3b/0x60
blk_mq_make_request+0x217/0xee0
generic_make_request+0x179/0x4a0
submit_bio+0xaa/0x270
__swap_writepage+0x8f5/0xba0
swap_writepage+0x65/0xb0
pageout.isra.3+0x3e5/0xa00
shrink_page_list+0x15a0/0x2660
shrink_inactive_list+0x373/0x770
shrink_node_memcg+0x4ff/0x1550
shrink_node+0x123/0x800
do_try_to_free_pages+0x22f/0x820
try_to_free_pages+0x242/0x4d0
__alloc_pages_nodemask+0x9f6/0x1bb0
khugepaged_alloc_page+0x6f/0xf0
collapse_huge_page+0x103/0x1060
khugepaged_scan_pmd+0x840/0xa70
khugepaged+0x1571/0x18d0
kthread+0x1df/0x200
ret_from_fork+0x22/0x40
INFO: task systemd-journal:3112 blocked for more than 123 seconds.
Tainted: G O 5.3.0-rc7-next-20190903+ #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
systemd-journal D20744 3112 1 0x00004120
Call Trace:
__do_page_cache_readahead+0x149/0x3a0
filemap_fault+0x9da/0x1050
__xfs_filemap_fault+0x167/0x450 [xfs]
xfs_filemap_fault+0x68/0x70 [xfs]
__do_fault+0x83/0x220
__handle_mm_fault+0x1034/0x1a50
handle_mm_fault+0x17f/0x37e
__do_page_fault+0x369/0x630
do_page_fault+0x50/0x2d3
page_fault+0x2f/0x40
RIP: 0033:0x56088e81faa0
Code: Bad RIP value.
RSP: 002b:00007ffdc2158ba8 EFLAGS: 000102_iter+0x135/0x850
shrink_node+0x123/0x800
do_try_to_free_pages+0x22f/0x820
try_to_free_pages+0x242/0x4d0
__alloc_pages_nodemask+0x9f6/0x1bb0
[11967.13967 __swap_writepage+0x8f5/0xba0
swap_writepage+0x65/0xb0
pageout.isra.3+0x3e5/0xa00
shrink_page_list+0x15a0/0x2660
io_schedule+0x21/0x50
rq_qos_wait+0x18e/0x2b0
wbt_wait+0x12b/0x1b0
__rq_qos_throttle+0x3b/0x60
blk_mq_make_request+0x217/0xee0
schedule+0x6c/0x140
io_schedule+0x21/0x50
rq_qos_wait+0x18e/0x2b0
wbt_wait+0x12b/0x1b0
__rq_qos_throttle+0x3b/0x60
blk_mq_make_request+0x217/0xee0
[11968.69198.0+0x60/0x60
swap_writepage+0x65/0xb0
pageout.isra.3+0x3e5/0xa00
shrink_page_list+0x15a0/0x2660
generic_make_request+0x179/0x4a0
submit_bio+0xaa/0x270
__swap_writepage+0x8f5/0xba0
swap_writepage+0x65/0xb0
pageout.isra.3+0x3e5/0xa00
__swap_writepage+0x8f5/0xba0
swap_writepage+0x65/0xb0
pageout.isra.3+0x3e5/0xa00
shrink_page_list+0x15a0/0x2660
wbt_wait+0x12b/0x1b0
__rq_qos_throttle+0x3b/0x60
blk_mq_make_request+0x217/0xee0
generic_make_request+0x179/0x4a0
submit_bio+0xaa/0x270
__swap_writepage+0x8f5/0xba0
1 lock held by khungtaskd/791:
#0: ffffffffa1b8b360 (rcu_read_lock){....}, at:
debug_show_all_locks+0x33/0x16a
1 lock held by khugepaged/797:
#0: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by kswapd0/820:
#0: ffffffffa1cc55c0 (fs_reclaim){....}, at:
__fs_reclaim_acquire+0x5/0x30
#1: ffff88846fa43538 (&md->io_barrier){....}, at:
dm_get_live_table+0x5/0x70 [dm_mod]
2 locks held by kswapd4/821:
#0: ffffffffa1cc55c0 (fs_reclaim){....}, at:
__fs_reclaim_acquire+0x5/0x30
#1: ffff88846fa43538 (&md->io_barrier){....}, at:
dm_get_live_table+0x5/0x70 [dm_mod]
1 lock held by scsi_eh_0/1577:
#0: ffff8884cb52cca0 (&ctrl_info->lun_reset_mutex){....}, at:
pqi_eh_device_reset_handler+0x313/0x970 [smartpqi]
[11acquire.part.15+0x5/0x30
2 locks held by oom01/121253:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121254:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121255:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121256:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
[11971.6650x5/0x30
2 locks held by oom01/121278:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121279:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121280:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121281:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acqfffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121299:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121300:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121301:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121302:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, 2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121320:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121321:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121322:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121323:
#0: ffff888338118840 (&om01/121340:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121341:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121342:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121343:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121366:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121367:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121368:
#0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
#1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121369:
smartpqi 0000:23:00.0: no heartbeat detected - last heartbeat count:
11794
smartpqi 0000:23:00.0: controller offline
smartpqi 0000:23:00.0: reset of scsi 0:1:0:0: FAILED
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
========================================================================
sd 0:1:0:0: Device offlined - not ready after error recovery
UBSAN: Undefined behaviour in drivers/iommu/amd_iommu.c:1526:28
shift exponent 66 is too large for 64-bit type 'long unsigned int'
CPU: 8 PID: 49038 Comm: kworker/8:0 Tainted: G O
5.3.0-rc7-next-20190903+ #4
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40
07/10/2019
sd 0:1:0:0: Device offlined - not ready after error recovery
Workqueue: events pqi_ctrl_offline_worker [smartpqi]
sd 0:1:0:0: Device offlined - not ready after error recovery
Call Trace:
dump_stack+0x62/0x9a
ubsan_epilogue+0xd/0x3a
sd 0:1:0:0: Device offlined - not ready after error recovery
__ubsan_handle_shift_out_of_bounds.cold.12+0x21/0x68
fetch_pte.cold.23+0x9c/0xcf
iommu_unmap_page+0xcd/0x180
sd 0:1:0:0: Device offlined - not ready after error recovery
__unmap_single.isra.18+0x6a/0x120
unmap_sg+0x74/0x90
scsi_dma_unmap+0xdc/0x140
sd 0:1:0:0: Device offlined - not ready after error recovery
pqi_raid_io_complete+0x2d/0x60 [smartpqi]
pqi_ctrl_offline_worker+0x128/0x270 [smartpqi]
sd 0:1:0:0: Device offlined - not ready after error recovery
process_one_work+0x53b/0xa70
worker_thread+0x63/0x5b0
kthread+0x1df/0x200
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
ret_from_fork+0x22/0x40
sd 0:1:0:0: Device offlined - not ready after error recovery
========================================================================
sd 0:1:0:0: Device offlined - not ready after error recovery
========================================================================
sd 0:1:0:0: Device offlined - not ready after error recovery
UBSAN: Undefined behaviour in drivers/iommu/amd_iommu.c:1527:16
shift exponent 66 is too large for 64-bit type 'long long unsigned int'
CPU: 8 PID: 49038 Comm: kworker/8:0 Tainted: G O
5.3.0-rc7-next-20190903+ #4
sd 0:1:0:0: Device offlined - not ready after error recovery
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40
07/10/2019
Workqueue: events pqi_ctrl_offline_worker [smartpqi]
Call Trace:
dump_stack+0x62/0x9a
sd 0:1:0:0: Device offlined - not ready after error recovery
ubsan_epilogue+0xd/0x3a
sd 0:1:0:0: Device offlined - not ready after error recovery
__ubsan_handle_shift_out_of_bounds.cold.12+0x21/0x68
sd 0:1:0:0: Device offlined - not ready after error recovery
fetch_pte.cold.23+0xca/0xcf
iommu_unmap_page+0xcd/0x180
__unmap_single.isra.18+0x6a/0x120
sd 0:1:0:0: Device offlined - not ready after error recovery
unmap_sg+0x74/0x90
sd 0:1:0:0: Device offlined - not ready after error recovery
scsi_dma_unmap+0xdc/0x140
pqi_raid_io_complete+0x2d/0x60 [smartpqi]
pqi_ctrl_offline_worker+0x128/0x270 [smartpqi]
sd 0:1:0:0: Device offlined - not ready after error recovery
process_one_work+0x53b/0xa70
worker_thread+0x63/0x5b0
kthread+0x1df/0x200
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
ret_from_fork+0x22/0x40
========================================================================
sd 0:1:0:0: rejecting I/O to offline device
sd 0:1:0:0: Device offlined - not ready after error recovery
blk_update_request: I/O error, dev sda, 43208)
Write-error on swap-device (254:1:29843216)
Write-error on swap-device (254:1:29843224)
Write-error on swap-device (254:1:29843232)
Write-error on swap-device (254:1:29843240)
---
drivers/iommu/amd_iommu.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index b607a92791d3..3874a0f5e197 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2660,8 +2660,10 @@ static void *alloc_coherent(struct device *dev, size_t size,
if (!dma_mask)
dma_mask = *dev->dma_mask;

+ mutex_lock(&domain->api_lock);
*dma_addr = __map_single(dev, dma_dom, page_to_phys(page),
size, DMA_BIDIRECTIONAL, dma_mask);
+ mutex_unlock(&domain->api_lock);

if (*dma_addr == DMA_MAPPING_ERROR)
goto out_free;
@@ -2696,7 +2698,9 @@ static void free_coherent(struct device *dev, size_t size,

dma_dom = to_dma_ops_domain(domain);

+ mutex_lock(&domain->api_lock);
__unmap_single(dma_dom, dma_addr, size, DMA_BIDIRECTIONAL);
+ mutex_unlock(&domain->api_lock);

free_mem:
if (!dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT))
--
1.8.3.1


2019-09-05 13:15:48

by Jörg Rödel

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in increase_address_space()

Hi Qian,

On Wed, Sep 04, 2019 at 05:24:22PM -0400, Qian Cai wrote:
> if (domain->mode == PAGE_MODE_6_LEVEL)
> /* address space already 64 bit large */
> return false;
>
> This gives a clue that there must be a race between multiple concurrent
> threads in increase_address_space().

Thanks for tracking this down, there is a race indeed.

> + mutex_lock(&domain->api_lock);
> *dma_addr = __map_single(dev, dma_dom, page_to_phys(page),
> size, DMA_BIDIRECTIONAL, dma_mask);
> + mutex_unlock(&domain->api_lock);
>
> if (*dma_addr == DMA_MAPPING_ERROR)
> goto out_free;
> @@ -2696,7 +2698,9 @@ static void free_coherent(struct device *dev, size_t size,
>
> dma_dom = to_dma_ops_domain(domain);
>
> + mutex_lock(&domain->api_lock);
> __unmap_single(dma_dom, dma_addr, size, DMA_BIDIRECTIONAL);
> + mutex_unlock(&domain->api_lock);

But I think the right fix is to lock the operation in
increase_address_space() directly, and not the calls around it, like in
the diff below. It is untested, so can you please try it and report back
if it fixes your issue?

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index b607a92791d3..1ff705f16239 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1424,18 +1424,21 @@ static void free_pagetable(struct protection_domain *domain)
* another level increases the size of the address space by 9 bits to a size up
* to 64 bits.
*/
-static bool increase_address_space(struct protection_domain *domain,
+static void increase_address_space(struct protection_domain *domain,
gfp_t gfp)
{
+ unsigned long flags;
u64 *pte;

- if (domain->mode == PAGE_MODE_6_LEVEL)
+ spin_lock_irqsave(&domain->lock, flags);
+
+ if (WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
/* address space already 64 bit large */
- return false;
+ goto out;

pte = (void *)get_zeroed_page(gfp);
if (!pte)
- return false;
+ goto out;

*pte = PM_LEVEL_PDE(domain->mode,
iommu_virt_to_phys(domain->pt_root));
@@ -1443,7 +1446,10 @@ static bool increase_address_space(struct protection_domain *domain,
domain->mode += 1;
domain->updated = true;

- return true;
+out:
+ spin_unlock_irqrestore(&domain->lock, flags);
+
+ return;
}

static u64 *alloc_pte(struct protection_domain *domain,

2019-09-06 08:18:48

by Qian Cai

[permalink] [raw]
Subject: Re: [RFC PATCH] iommu/amd: fix a race in increase_address_space()

On Thu, 2019-09-05 at 13:43 +0200, Joerg Roedel wrote:
> Hi Qian,
>
> On Wed, Sep 04, 2019 at 05:24:22PM -0400, Qian Cai wrote:
> > if (domain->mode == PAGE_MODE_6_LEVEL)
> > /* address space already 64 bit large */
> > return false;
> >
> > This gives a clue that there must be a race between multiple concurrent
> > threads in increase_address_space().
>
> Thanks for tracking this down, there is a race indeed.
>
> > + mutex_lock(&domain->api_lock);
> > *dma_addr = __map_single(dev, dma_dom, page_to_phys(page),
> > size, DMA_BIDIRECTIONAL, dma_mask);
> > + mutex_unlock(&domain->api_lock);
> >
> > if (*dma_addr == DMA_MAPPING_ERROR)
> > goto out_free;
> > @@ -2696,7 +2698,9 @@ static void free_coherent(struct device *dev, size_t size,
> >
> > dma_dom = to_dma_ops_domain(domain);
> >
> > + mutex_lock(&domain->api_lock);
> > __unmap_single(dma_dom, dma_addr, size, DMA_BIDIRECTIONAL);
> > + mutex_unlock(&domain->api_lock);
>
> But I think the right fix is to lock the operation in
> increase_address_space() directly, and not the calls around it, like in
> the diff below. It is untested, so can you please try it and report back
> if it fixes your issue?

Yes, it works great so far.

>
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index b607a92791d3..1ff705f16239 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -1424,18 +1424,21 @@ static void free_pagetable(struct protection_domain *domain)
> * another level increases the size of the address space by 9 bits to a size up
> * to 64 bits.
> */
> -static bool increase_address_space(struct protection_domain *domain,
> +static void increase_address_space(struct protection_domain *domain,
> gfp_t gfp)
> {
> + unsigned long flags;
> u64 *pte;
>
> - if (domain->mode == PAGE_MODE_6_LEVEL)
> + spin_lock_irqsave(&domain->lock, flags);
> +
> + if (WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
> /* address space already 64 bit large */
> - return false;
> + goto out;
>
> pte = (void *)get_zeroed_page(gfp);
> if (!pte)
> - return false;
> + goto out;
>
> *pte = PM_LEVEL_PDE(domain->mode,
> iommu_virt_to_phys(domain->pt_root));
> @@ -1443,7 +1446,10 @@ static bool increase_address_space(struct protection_domain *domain,
> domain->mode += 1;
> domain->updated = true;
>
> - return true;
> +out:
> + spin_unlock_irqrestore(&domain->lock, flags);
> +
> + return;
> }
>
> static u64 *alloc_pte(struct protection_domain *domain,

2019-09-06 11:11:49

by Jörg Rödel

[permalink] [raw]
Subject: [PATCH] iommu/amd: Fix race in increase_address_space()

On Thu, Sep 05, 2019 at 04:57:21PM -0400, Qian Cai wrote:
> On Thu, 2019-09-05 at 13:43 +0200, Joerg Roedel wrote:
> > But I think the right fix is to lock the operation in
> > increase_address_space() directly, and not the calls around it, like in
> > the diff below. It is untested, so can you please try it and report back
> > if it fixes your issue?
>
> Yes, it works great so far.

Thanks for testing! I queued the patch below to the IOMMU tree and will
send it upstream for v5.3 today.

Thanks,

Joerg

From 754265bcab78a9014f0f99cd35e0d610fcd7dfa7 Mon Sep 17 00:00:00 2001
From: Joerg Roedel <[email protected]>
Date: Fri, 6 Sep 2019 10:39:54 +0200
Subject: [PATCH] iommu/amd: Fix race in increase_address_space()

After the conversion to lock-less dma-api call the
increase_address_space() function can be called without any
locking. Multiple CPUs could potentially race for increasing
the address space, leading to invalid domain->mode settings
and invalid page-tables. This has been happening in the wild
under high IO load and memory pressure.

Fix the race by locking this operation. The function is
called infrequently so that this does not introduce
a performance regression in the dma-api path again.

Reported-by: Qian Cai <[email protected]>
Fixes: 256e4621c21a ('iommu/amd: Make use of the generic IOVA allocator')
Signed-off-by: Joerg Roedel <[email protected]>
---
drivers/iommu/amd_iommu.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index f853b96ee547..61de81965c44 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1435,18 +1435,21 @@ static void free_pagetable(struct protection_domain *domain)
* another level increases the size of the address space by 9 bits to a size up
* to 64 bits.
*/
-static bool increase_address_space(struct protection_domain *domain,
+static void increase_address_space(struct protection_domain *domain,
gfp_t gfp)
{
+ unsigned long flags;
u64 *pte;

- if (domain->mode == PAGE_MODE_6_LEVEL)
+ spin_lock_irqsave(&domain->lock, flags);
+
+ if (WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
/* address space already 64 bit large */
- return false;
+ goto out;

pte = (void *)get_zeroed_page(gfp);
if (!pte)
- return false;
+ goto out;

*pte = PM_LEVEL_PDE(domain->mode,
iommu_virt_to_phys(domain->pt_root));
@@ -1454,7 +1457,10 @@ static bool increase_address_space(struct protection_domain *domain,
domain->mode += 1;
domain->updated = true;

- return true;
+out:
+ spin_unlock_irqrestore(&domain->lock, flags);
+
+ return;
}

static u64 *alloc_pte(struct protection_domain *domain,
--
2.16.4