LinuxLists.cc - lockdep splat due to klist iteration from atomic context in Intel IOMMU driver

2022-08-15 13:04:25

Subject: lockdep splat due to klist iteration from atomic context in Intel IOMMU driver

On a build of 7ebfc85e2cd7 ("Merge tag 'net-6.0-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net"), with
CONFIG_INTEL_IOMMU_DEBUGFS enabled, I am seeing the lockdep splat
below when an I/O page fault occurs on a machine with an Intel
IOMMU in it.

The issue seems to be the klist iterator functions using
spin_*lock_irq*() but the klist insertion functions using
spin_*lock(), combined with the Intel DMAR IOMMU driver iterating
over klists from atomic (hardirq) context as of commit 8ac0b64b9735
("iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk()")
when CONFIG_INTEL_IOMMU_DEBUGFS is enabled, where
pci_get_domain_bus_and_slot() calls into bus_find_device() which
iterates over klists.

I found this commit from 2018:

commit 624fa7790f80575a4ec28fbdb2034097dc18d051
Author: Bart Van Assche <[email protected]>
Date: Fri Jun 22 14:54:49 2018 -0700

scsi: klist: Make it safe to use klists in atomic context

This commit switched lib/klist.c:klist_{prev,next} from
spin_{,un}lock() to spin_{lock_irqsave,unlock_irqrestore}(), but left
the spin_{,un}lock() calls in add_{head,tail}() untouched.

The simplest fix for this would be to switch lib/klist.c:add_{head,tail}()
over to use the IRQ-safe spinlock variants as well?

[ 32.283545] DMAR: DRHD: handling fault status reg 3
[ 32.288571] DMAR: [DMA Write NO_PASID] Request device [00:1a.0] fault addr 0x0 [fault reason 0x05] PTE Write access is not set
[ 32.288576] DMAR: Dump dmar0 table entries for IOVA 0x0
[ 32.288579] DMAR: root entry: 0x0000000127f42001
[ 32.288581] DMAR: context entry: hi 0x0000000000001502, low 0x000000012d8ab001
[ 32.288589]
[ 32.288590] ================================
[ 32.288591] WARNING: inconsistent lock state
[ 32.288593] 5.20.0-0.rc0.20220812git7ebfc85e2cd7.10.fc38.x86_64 #1 Not tainted
[ 32.288594] --------------------------------
[ 32.288595] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[ 32.288596] rngd/1006 [HC1[1]:SC0[0]:HE0:SE1] takes:
[ 32.288598] ff177021416f2d78 (&k->k_lock){?.+.}-{2:2}, at: klist_next+0x1b/0x160
[ 32.288610] {HARDIRQ-ON-W} state was registered at:
[ 32.288611] lock_acquire+0xce/0x2d0
[ 32.288617] _raw_spin_lock+0x33/0x80
[ 32.288625] klist_add_tail+0x46/0x80
[ 32.288627] bus_add_device+0xee/0x150
[ 32.288632] device_add+0x39d/0x9a0
[ 32.288634] add_memory_block+0x108/0x1d0
[ 32.288640] memory_dev_init+0xe1/0x117
[ 32.288648] driver_init+0x43/0x4d
[ 32.288651] kernel_init_freeable+0x1c2/0x2cc
[ 32.288657] kernel_init+0x16/0x140
[ 32.288663] ret_from_fork+0x1f/0x30
[ 32.406866] irq event stamp: 7812
[ 32.406867] hardirqs last enabled at (7811): [<ffffffff85000e86>] asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 32.406873] hardirqs last disabled at (7812): [<ffffffff84f16894>] irqentry_enter+0x54/0x60
[ 32.406876] softirqs last enabled at (7794): [<ffffffff840ff669>] __irq_exit_rcu+0xf9/0x170
[ 32.406883] softirqs last disabled at (7787): [<ffffffff840ff669>] __irq_exit_rcu+0xf9/0x170
[ 32.406886]
[ 32.406886] other info that might help us debug this:
[ 32.406886] Possible unsafe locking scenario:
[ 32.406886]
[ 32.406887] CPU0
[ 32.406888] ----
[ 32.406888] lock(&k->k_lock);
[ 32.406889] <Interrupt>
[ 32.406890] lock(&k->k_lock);
[ 32.406891]
[ 32.406891] *** DEADLOCK ***
[ 32.406891]
[ 32.406892] no locks held by rngd/1006.
[ 32.406893]
[ 32.406893] stack backtrace:
[ 32.406894] CPU: 0 PID: 1006 Comm: rngd Not tainted 5.20.0-0.rc0.20220812git7ebfc85e2cd7.10.fc38.x86_64 #1
[ 32.406897] Hardware name: Intel Corporation Idaville/Unknown, BIOS SB_IDVH .001.001.000.000.008.00006.D-569F2C6B571C2177 Aug 11 2022
[ 32.406898] Call Trace:
[ 32.406900] <TASK>
[ 32.406901] dump_stack_lvl+0x5b/0x77
[ 32.406910] mark_lock.cold+0x48/0xe1
[ 32.406916] ? find_held_lock+0x32/0x90
[ 32.406924] __lock_acquire+0x948/0x1ef0
[ 32.406926] ? __lock_acquire+0x388/0x1ef0
[ 32.406929] lock_acquire+0xce/0x2d0
[ 32.406931] ? klist_next+0x1b/0x160
[ 32.406933] ? lock_is_held_type+0xe8/0x140
[ 32.406936] ? find_held_lock+0x32/0x90
[ 32.406939] ? sched_clock_cpu+0xb/0xc0
[ 32.406942] ? system_root_device_release+0x10/0x10
[ 32.406944] _raw_spin_lock_irqsave+0x4d/0xa0
[ 32.406947] ? klist_next+0x1b/0x160
[ 32.406949] klist_next+0x1b/0x160
[ 32.406951] ? pci_uevent_ers+0x80/0x80
[ 32.406956] bus_find_device+0x52/0xa0
[ 32.406959] pci_get_domain_bus_and_slot+0x70/0xf0
[ 32.406968] dmar_fault_dump_ptes+0xf5/0x28f
[ 32.406972] dmar_fault.cold+0xf2/0x1e6
[ 32.487869] __handle_irq_event_percpu+0x90/0x330
[ 32.497674] handle_irq_event+0x34/0x70
[ 32.497677] handle_edge_irq+0x9f/0x240
[ 32.497681] __common_interrupt+0x6e/0x150
[ 32.497686] common_interrupt+0x5c/0xd0
[ 32.509825] asm_common_interrupt+0x22/0x40
[ 32.509828] RIP: 0033:0x7facc2b09688
[ 32.644127] Code: 98 48 83 c0 40 48 8b 00 48 f7 d0 48 89 c1 48 8b 45 98 48 83 c0 48 48 8b 00 48 21 c1 48 8b 45 98 48 83 c0 38 48 31 ca 48 89 10 <48> 8b 45 98 48 83 c0 60 48 8b 10 48 8b 45 98 48 83 c0 68 480
[ 32.644129] RSP: 002b:00007facc1c80820 EFLAGS: 00000202
[ 32.644132] RAX: 000055b21df61db8 RBX: 00007facbc004c10 RCX: 04428d8038200044
[ 32.644133] RDX: 5e6e24505b76b92a RSI: 1968b00000000000 RDI: 000055b21df61d80
[ 32.644135] RBP: 00007facc1c80890 R08: 00007facbc004c10 R09: 0000000000000064
[ 32.644136] R10: 00007fffeddfa080 R11: 00000000000078d6 R12: 0000000000000000
[ 32.697491] R13: 000055b21d88e440 R14: 000055b21d8892e5 R15: 000055b21d88a058
[ 32.697496] </TASK>
[ 32.697521] DMAR: PTE not present at level 4

2022-08-15 13:44:15

by Bart Van Assche

[permalink] [raw]

Subject: Re: lockdep splat due to klist iteration from atomic context in Intel IOMMU driver

On 8/15/22 05:05, Lennert Buytenhek wrote:
> On a build of 7ebfc85e2cd7 ("Merge tag 'net-6.0-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net"), with
> CONFIG_INTEL_IOMMU_DEBUGFS enabled, I am seeing the lockdep splat
> below when an I/O page fault occurs on a machine with an Intel
> IOMMU in it.
>
> The issue seems to be the klist iterator functions using
> spin_*lock_irq*() but the klist insertion functions using
> spin_*lock(), combined with the Intel DMAR IOMMU driver iterating
> over klists from atomic (hardirq) context as of commit 8ac0b64b9735
> ("iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk()")
> when CONFIG_INTEL_IOMMU_DEBUGFS is enabled, where
> pci_get_domain_bus_and_slot() calls into bus_find_device() which
> iterates over klists.
>
> I found this commit from 2018:
>
> commit 624fa7790f80575a4ec28fbdb2034097dc18d051
> Author: Bart Van Assche <[email protected]>
> Date: Fri Jun 22 14:54:49 2018 -0700
>
> scsi: klist: Make it safe to use klists in atomic context
>
> This commit switched lib/klist.c:klist_{prev,next} from
> spin_{,un}lock() to spin_{lock_irqsave,unlock_irqrestore}(), but left
> the spin_{,un}lock() calls in add_{head,tail}() untouched.
>
> The simplest fix for this would be to switch lib/klist.c:add_{head,tail}()
> over to use the IRQ-safe spinlock variants as well?

Another possibility would be to evaluate whether it is safe to revert
commit 624fa7790f80 ("scsi: klist: Make it safe to use klists in atomic
context"). That commit is no longer needed by the SRP transport driver
since the legacy block layer has been removed from the kernel.

Thanks,

Bart.

2022-08-15 14:03:50

by Lennert Buytenhek

[permalink] [raw]

Subject: Re: lockdep splat due to klist iteration from atomic context in Intel IOMMU driver

On Mon, Aug 15, 2022 at 06:32:24AM -0700, Bart Van Assche wrote:

> > On a build of 7ebfc85e2cd7 ("Merge tag 'net-6.0-rc1' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net"), with
> > CONFIG_INTEL_IOMMU_DEBUGFS enabled, I am seeing the lockdep splat
> > below when an I/O page fault occurs on a machine with an Intel
> > IOMMU in it.
> >
> > The issue seems to be the klist iterator functions using
> > spin_*lock_irq*() but the klist insertion functions using
> > spin_*lock(), combined with the Intel DMAR IOMMU driver iterating
> > over klists from atomic (hardirq) context as of commit 8ac0b64b9735
> > ("iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk()")
> > when CONFIG_INTEL_IOMMU_DEBUGFS is enabled, where
> > pci_get_domain_bus_and_slot() calls into bus_find_device() which
> > iterates over klists.
> >
> > I found this commit from 2018:
> >
> > commit 624fa7790f80575a4ec28fbdb2034097dc18d051
> > Author: Bart Van Assche <[email protected]>
> > Date: Fri Jun 22 14:54:49 2018 -0700
> >
> > scsi: klist: Make it safe to use klists in atomic context
> >
> > This commit switched lib/klist.c:klist_{prev,next} from
> > spin_{,un}lock() to spin_{lock_irqsave,unlock_irqrestore}(), but left
> > the spin_{,un}lock() calls in add_{head,tail}() untouched.
> >
> > The simplest fix for this would be to switch lib/klist.c:add_{head,tail}()
> > over to use the IRQ-safe spinlock variants as well?
>
> Another possibility would be to evaluate whether it is safe to revert commit
> 624fa7790f80 ("scsi: klist: Make it safe to use klists in atomic context").
> That commit is no longer needed by the SRP transport driver since the legacy
> block layer has been removed from the kernel.

And then to fix the 6.0-rc1 iommu/vt-d lockdep splat with
CONFIG_INTEL_IOMMU_DEBUGFS enabled, we could convert the Intel DMAR
IRQ handler to a threaded IRQ handler. We (Arista) carry the patch
below in our kernel tree, and the last two hunks of the patch do
exactly that, for the same reason (having to call
pci_get_domain_bus_and_slot() from the IRQ handler) but this is
probably too big of a change for 6.0-rc.

commit 90a8e7da0facf198692a641fcfe6f89c478608e0
Author: Lennert Buytenhek <[email protected]>
Date: Wed Jul 13 15:34:30 2022 +0300

iommu/vt-d: Use report_iommu_fault()

This patch makes iommu/vt-d call report_iommu_fault() when an I/O
page fault occurs, which has two effects:

1) It allows device drivers to register a callback to be notified
of I/O page faults, via the iommu_set_fault_handler() API.

2) It triggers the io_page_fault tracepoint in report_iommu_fault()
when an I/O page fault occurs.

The latter point is the main aim of this patch, as it allows
rasdaemon-like daemons to be notified of I/O page faults, and to
possibly initiate corrective action in response.

Most of the other IOMMU drivers already use report_iommu_fault(),
including iommu/amd on the x86 front, since Linux 5.16 via commit
9f78e446bde8 ("iommu/amd: Use report_iommu_fault()").

To be able to call report_iommu_fault() from the dmar_fault IRQ
handler, we need to make this IRQ handler run from thread context,
as we need to call pci_get_domain_bus_and_slot() to turn the PCI
BDF of the faulting device into a struct pci_dev * to pass into
report_iommu_fault(), and pci_get_domain_bus_and_slot() uses
bus_find_device() which iterates over klists, which is not safe to
do from hardirq context.

When IRQ remapping is enabled, iommu/vt-d will try to set up its
dmar_fault IRQ handler from start_kernel() -> x86_late_time_init()
-> apic_intr_mode_init() -> apic_bsp_setup() ->
irq_remap_enable_fault_handling() -> enable_drhd_fault_handling(),
which happens before kthreadd is started, and trying to set up a
threaded IRQ handler this early on will oops. However, there
doesn't seem to be a reason why iommu/vt-d needs to set up its
fault reporting IRQ handler this early, and if we remove the IRQ
setup code from enable_drhd_fault_handling(), the IRQ will be
registered instead from pci_iommu_init() -> intel_iommu_init() ->
init_dmars(), at which point kthreadd has already been started,
making it possible to set up a threaded IRQ handler.

Suggested-by: Scarlett Gourley <[email protected]>
Suggested-by: James Sewart <[email protected]>
Suggested-by: Jack O'Sullivan <[email protected]>
Signed-off-by: Lennert Buytenhek <[email protected]>

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 5a8f780e7ffd..f366041271af 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1909,15 +1909,43 @@ void dmar_msi_read(int irq, struct msi_msg *msg)
raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
}

-static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
- u8 fault_reason, u32 pasid, u16 source_id,
- unsigned long long addr)
+static int check_dma_fault(struct intel_iommu *iommu, int type,
+ u16 source_id, unsigned long long addr)
{
- const char *reason;
- int fault_type;
+ int ret;
+ struct pci_dev *pdev;
+
+ ret = -1;
+
+ pdev = pci_get_domain_bus_and_slot(iommu->segment,
+ PCI_BUS_NUM(source_id),
+ source_id & 0xFF);
+ if (pdev) {
+ struct device_domain_info *dev_data;
+
+ dev_data = dev_iommu_priv_get(&pdev->dev);
+ if (dev_data) {
+ int flags;
+
+ if (type)
+ flags = IOMMU_FAULT_READ;
+ else
+ flags = IOMMU_FAULT_WRITE;
+
+ ret = report_iommu_fault(&dev_data->domain->domain,
+ &pdev->dev, addr, flags);
+ }
+
+ pci_dev_put(pdev);
+ }

- reason = dmar_get_fault_reason(fault_reason, &fault_type);
+ return ret;
+}

+static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
+ u8 fault_reason, u32 pasid, u16 source_id,
+ unsigned long long addr, const char *reason, int fault_type)
+{
if (fault_type == INTR_REMAP) {
pr_err("[INTR-REMAP] Request device [%02x:%02x.%d] fault index 0x%llx [fault reason 0x%02x] %s\n",
source_id >> 8, PCI_SLOT(source_id & 0xFF),
@@ -1968,8 +1996,6 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
fault_index = dma_fsts_fault_record_index(fault_status);
reg = cap_fault_reg_offset(iommu->cap);
while (1) {
- /* Disable printing, simply clear the fault when ratelimited */
- bool ratelimited = !__ratelimit(&rs);
u8 fault_reason;
u16 source_id;
u64 guest_addr;
@@ -1977,6 +2003,8 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
int type;
u32 data;
bool pasid_present;
+ const char *reason;
+ int fault_type;

/* highest 32 bits */
data = readl(iommu->reg + reg +
@@ -1984,20 +2012,18 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
if (!(data & DMA_FRCD_F))
break;

- if (!ratelimited) {
- fault_reason = dma_frcd_fault_reason(data);
- type = dma_frcd_type(data);
+ fault_reason = dma_frcd_fault_reason(data);
+ type = dma_frcd_type(data);

- pasid = dma_frcd_pasid_value(data);
- data = readl(iommu->reg + reg +
- fault_index * PRIMARY_FAULT_REG_LEN + 8);
- source_id = dma_frcd_source_id(data);
+ pasid = dma_frcd_pasid_value(data);
+ data = readl(iommu->reg + reg +
+ fault_index * PRIMARY_FAULT_REG_LEN + 8);
+ source_id = dma_frcd_source_id(data);

- pasid_present = dma_frcd_pasid_present(data);
- guest_addr = dmar_readq(iommu->reg + reg +
- fault_index * PRIMARY_FAULT_REG_LEN);
- guest_addr = dma_frcd_page_addr(guest_addr);
- }
+ pasid_present = dma_frcd_pasid_present(data);
+ guest_addr = dmar_readq(iommu->reg + reg +
+ fault_index * PRIMARY_FAULT_REG_LEN);
+ guest_addr = dma_frcd_page_addr(guest_addr);

/* clear the fault */
writel(DMA_FRCD_F, iommu->reg + reg +
@@ -2005,11 +2031,17 @@ irqreturn_t dmar_fault(int irq, void *dev_id)

raw_spin_unlock_irqrestore(&iommu->register_lock, flag);

- if (!ratelimited)
+ reason = dmar_get_fault_reason(fault_reason, &fault_type);
+
+ if ((fault_type != DMA_REMAP ||
+ check_dma_fault(iommu, type, source_id, guest_addr)) &&
+ __ratelimit(&rs)) {
/* Using pasid -1 if pasid is not present */
dmar_fault_do_one(iommu, type, fault_reason,
pasid_present ? pasid : INVALID_IOASID,
- source_id, guest_addr);
+ source_id, guest_addr,
+ reason, fault_type);
+ }

fault_index++;
if (fault_index >= cap_num_fault_regs(iommu->cap))
@@ -2043,7 +2075,8 @@ int dmar_set_interrupt(struct intel_iommu *iommu)
return -EINVAL;
}

- ret = request_irq(irq, dmar_fault, IRQF_NO_THREAD, iommu->name, iommu);
+ ret = request_threaded_irq(irq, NULL, dmar_fault, IRQF_ONESHOT,
+ iommu->name, iommu);
if (ret)
pr_err("Can't request irq\n");
return ret;
@@ -2051,30 +2084,6 @@ int dmar_set_interrupt(struct intel_iommu *iommu)

int __init enable_drhd_fault_handling(void)
{
- struct dmar_drhd_unit *drhd;
- struct intel_iommu *iommu;
-
- /*
- * Enable fault control interrupt.
- */
- for_each_iommu(iommu, drhd) {
- u32 fault_status;
- int ret = dmar_set_interrupt(iommu);
-
- if (ret) {
- pr_err("DRHD %Lx: failed to enable fault, interrupt, ret %d\n",
- (unsigned long long)drhd->reg_base_addr, ret);
- return -1;
- }
-
- /*
- * Clear any previous faults.
- */
- dmar_fault(iommu->irq, iommu);
- fault_status = readl(iommu->reg + DMAR_FSTS_REG);
- writel(fault_status, iommu->reg + DMAR_FSTS_REG);
- }
-
return 0;
}

2022-08-17 04:06:25

by Lu Baolu

[permalink] [raw]

Subject: Re: lockdep splat due to klist iteration from atomic context in Intel IOMMU driver

On 2022/8/15 21:57, Lennert Buytenhek wrote:
> On Mon, Aug 15, 2022 at 06:32:24AM -0700, Bart Van Assche wrote:
>
>>> On a build of 7ebfc85e2cd7 ("Merge tag 'net-6.0-rc1' of
>>> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net"), with
>>> CONFIG_INTEL_IOMMU_DEBUGFS enabled, I am seeing the lockdep splat
>>> below when an I/O page fault occurs on a machine with an Intel
>>> IOMMU in it.
>>>
>>> The issue seems to be the klist iterator functions using
>>> spin_*lock_irq*() but the klist insertion functions using
>>> spin_*lock(), combined with the Intel DMAR IOMMU driver iterating
>>> over klists from atomic (hardirq) context as of commit 8ac0b64b9735
>>> ("iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk()")
>>> when CONFIG_INTEL_IOMMU_DEBUGFS is enabled, where
>>> pci_get_domain_bus_and_slot() calls into bus_find_device() which
>>> iterates over klists.
>>>
>>> I found this commit from 2018:
>>>
>>> commit 624fa7790f80575a4ec28fbdb2034097dc18d051
>>> Author: Bart Van Assche<[email protected]>
>>> Date: Fri Jun 22 14:54:49 2018 -0700
>>>
>>> scsi: klist: Make it safe to use klists in atomic context
>>>
>>> This commit switched lib/klist.c:klist_{prev,next} from
>>> spin_{,un}lock() to spin_{lock_irqsave,unlock_irqrestore}(), but left
>>> the spin_{,un}lock() calls in add_{head,tail}() untouched.
>>>
>>> The simplest fix for this would be to switch lib/klist.c:add_{head,tail}()
>>> over to use the IRQ-safe spinlock variants as well?
>> Another possibility would be to evaluate whether it is safe to revert commit
>> 624fa7790f80 ("scsi: klist: Make it safe to use klists in atomic context").
>> That commit is no longer needed by the SRP transport driver since the legacy
>> block layer has been removed from the kernel.
> And then to fix the 6.0-rc1 iommu/vt-d lockdep splat with
> CONFIG_INTEL_IOMMU_DEBUGFS enabled, we could convert the Intel DMAR
> IRQ handler to a threaded IRQ handler. We (Arista) carry the patch
> below in our kernel tree, and the last two hunks of the patch do
> exactly that, for the same reason (having to call
> pci_get_domain_bus_and_slot() from the IRQ handler) but this is
> probably too big of a change for 6.0-rc.
>
>
>
> commit 90a8e7da0facf198692a641fcfe6f89c478608e0
> Author: Lennert Buytenhek<[email protected]>
> Date: Wed Jul 13 15:34:30 2022 +0300
>
> iommu/vt-d: Use report_iommu_fault()
>
> This patch makes iommu/vt-d call report_iommu_fault() when an I/O
> page fault occurs, which has two effects:
>
> 1) It allows device drivers to register a callback to be notified
> of I/O page faults, via the iommu_set_fault_handler() API.
>
> 2) It triggers the io_page_fault tracepoint in report_iommu_fault()
> when an I/O page fault occurs.
>
> The latter point is the main aim of this patch, as it allows
> rasdaemon-like daemons to be notified of I/O page faults, and to
> possibly initiate corrective action in response.

The IOMMU subsystem already has a framework to handle I/O page faults:

commit fc36479db74e9 "iommu: Add a page fault handler"

And below series,

https://lore.kernel.org/linux-iommu/[email protected]/

is trying to make it more generic. It seems to be more suitable for your
case.

The report_iommu_fault() probably will be replaced by
iommu_register_device_fault_handler() eventually. So I don't encourage
its usage in the VT-d driver.

Best regards,
baolu

2022-08-17 05:00:51

by Lu Baolu

[permalink] [raw]

Subject: Re: lockdep splat due to klist iteration from atomic context in Intel IOMMU driver

On 2022/8/15 21:32, Bart Van Assche wrote:
> On 8/15/22 05:05, Lennert Buytenhek wrote:
>> On a build of 7ebfc85e2cd7 ("Merge tag 'net-6.0-rc1' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net"), with
>> CONFIG_INTEL_IOMMU_DEBUGFS enabled, I am seeing the lockdep splat
>> below when an I/O page fault occurs on a machine with an Intel
>> IOMMU in it.
>>
>> The issue seems to be the klist iterator functions using
>> spin_*lock_irq*() but the klist insertion functions using
>> spin_*lock(), combined with the Intel DMAR IOMMU driver iterating
>> over klists from atomic (hardirq) context as of commit 8ac0b64b9735
>> ("iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk()")
>> when CONFIG_INTEL_IOMMU_DEBUGFS is enabled, where
>> pci_get_domain_bus_and_slot() calls into bus_find_device() which
>> iterates over klists.
>>
>> I found this commit from 2018:
>>
>>     commit 624fa7790f80575a4ec28fbdb2034097dc18d051
>>     Author: Bart Van Assche <[email protected]>
>>     Date:   Fri Jun 22 14:54:49 2018 -0700
>>
>>         scsi: klist: Make it safe to use klists in atomic context
>>
>> This commit switched lib/klist.c:klist_{prev,next} from
>> spin_{,un}lock() to spin_{lock_irqsave,unlock_irqrestore}(), but left
>> the spin_{,un}lock() calls in add_{head,tail}() untouched.
>>
>> The simplest fix for this would be to switch
>> lib/klist.c:add_{head,tail}()
>> over to use the IRQ-safe spinlock variants as well?
>
> Another possibility would be to evaluate whether it is safe to revert
> commit 624fa7790f80 ("scsi: klist: Make it safe to use klists in atomic
> context"). That commit is no longer needed by the SRP transport driver
> since the legacy block layer has been removed from the kernel.

If so, pci_get_domain_bus_and_slot() can not be used in this interrupt
context, right?

Best regards,
baolu

2022-08-17 06:22:40

by Lennert Buytenhek

[permalink] [raw]

Subject: Re: lockdep splat due to klist iteration from atomic context in Intel IOMMU driver

On Wed, Aug 17, 2022 at 12:04:10PM +0800, Baolu Lu wrote:

> > > > On a build of 7ebfc85e2cd7 ("Merge tag 'net-6.0-rc1' of
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net"), with
> > > > CONFIG_INTEL_IOMMU_DEBUGFS enabled, I am seeing the lockdep splat
> > > > below when an I/O page fault occurs on a machine with an Intel
> > > > IOMMU in it.
> > > >
> > > > The issue seems to be the klist iterator functions using
> > > > spin_*lock_irq*() but the klist insertion functions using
> > > > spin_*lock(), combined with the Intel DMAR IOMMU driver iterating
> > > > over klists from atomic (hardirq) context as of commit 8ac0b64b9735
> > > > ("iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk()")
> > > > when CONFIG_INTEL_IOMMU_DEBUGFS is enabled, where
> > > > pci_get_domain_bus_and_slot() calls into bus_find_device() which
> > > > iterates over klists.
> > > >
> > > > I found this commit from 2018:
> > > >
> > > > commit 624fa7790f80575a4ec28fbdb2034097dc18d051
> > > > Author: Bart Van Assche<[email protected]>
> > > > Date: Fri Jun 22 14:54:49 2018 -0700
> > > >
> > > > scsi: klist: Make it safe to use klists in atomic context
> > > >
> > > > This commit switched lib/klist.c:klist_{prev,next} from
> > > > spin_{,un}lock() to spin_{lock_irqsave,unlock_irqrestore}(), but left
> > > > the spin_{,un}lock() calls in add_{head,tail}() untouched.
> > > >
> > > > The simplest fix for this would be to switch lib/klist.c:add_{head,tail}()
> > > > over to use the IRQ-safe spinlock variants as well?
> > > Another possibility would be to evaluate whether it is safe to revert commit
> > > 624fa7790f80 ("scsi: klist: Make it safe to use klists in atomic context").
> > > That commit is no longer needed by the SRP transport driver since the legacy
> > > block layer has been removed from the kernel.
> > And then to fix the 6.0-rc1 iommu/vt-d lockdep splat with
> > CONFIG_INTEL_IOMMU_DEBUGFS enabled, we could convert the Intel DMAR
> > IRQ handler to a threaded IRQ handler. We (Arista) carry the patch
> > below in our kernel tree, and the last two hunks of the patch do
> > exactly that, for the same reason (having to call
> > pci_get_domain_bus_and_slot() from the IRQ handler) but this is
> > probably too big of a change for 6.0-rc.
> >
> >
> >
> > commit 90a8e7da0facf198692a641fcfe6f89c478608e0
> > Author: Lennert Buytenhek<[email protected]>
> > Date: Wed Jul 13 15:34:30 2022 +0300
> >
> > iommu/vt-d: Use report_iommu_fault()
> > This patch makes iommu/vt-d call report_iommu_fault() when an I/O
> > page fault occurs, which has two effects:
> > 1) It allows device drivers to register a callback to be notified
> > of I/O page faults, via the iommu_set_fault_handler() API.
> > 2) It triggers the io_page_fault tracepoint in report_iommu_fault()
> > when an I/O page fault occurs.
> > The latter point is the main aim of this patch, as it allows
> > rasdaemon-like daemons to be notified of I/O page faults, and to
> > possibly initiate corrective action in response.
>
> The IOMMU subsystem already has a framework to handle I/O page faults:
>
> commit fc36479db74e9 "iommu: Add a page fault handler"
>
> And below series,
>
> https://lore.kernel.org/linux-iommu/[email protected]/
>
> is trying to make it more generic. It seems to be more suitable for your
> case.
>
> The report_iommu_fault() probably will be replaced by
> iommu_register_device_fault_handler() eventually. So I don't encourage
> its usage in the VT-d driver.

We use the iommu/io_page_fault tracepoint from userspace to be notified
of (non-ATS) I/O page faults so that we can detect malfunctioning PCIe
devices, which in our systems are typically switch/router line cards,
and take corrective action, such as restarting the offending line card.

Calling report_iommu_fault() causes the iommu/io_page_fault tracepoint
to be invoked, which is why we made the AMD and Intel IOMMU drivers use
report_iommu_fault() in our kernel tree.

It seems that iommu_queue_iopf() is specific to the SVA use case, while
we are not using SVA, in which case it would not address our use case.
(We don't care about knowing about ATS translation faults, we just want
to know when a non-ATS PCI device is malfunctioning.)

Kind regards,
Lennert

2022-08-17 06:41:44

by Lennert Buytenhek

[permalink] [raw]

Subject: Re: lockdep splat due to klist iteration from atomic context in Intel IOMMU driver

On Wed, Aug 17, 2022 at 12:45:26PM +0800, Baolu Lu wrote:

> > > On a build of 7ebfc85e2cd7 ("Merge tag 'net-6.0-rc1' of
> > > git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net"), with
> > > CONFIG_INTEL_IOMMU_DEBUGFS enabled, I am seeing the lockdep splat
> > > below when an I/O page fault occurs on a machine with an Intel
> > > IOMMU in it.
> > >
> > > The issue seems to be the klist iterator functions using
> > > spin_*lock_irq*() but the klist insertion functions using
> > > spin_*lock(), combined with the Intel DMAR IOMMU driver iterating
> > > over klists from atomic (hardirq) context as of commit 8ac0b64b9735
> > > ("iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk()")
> > > when CONFIG_INTEL_IOMMU_DEBUGFS is enabled, where
> > > pci_get_domain_bus_and_slot() calls into bus_find_device() which
> > > iterates over klists.
> > >
> > > I found this commit from 2018:
> > >
> > > ????commit 624fa7790f80575a4ec28fbdb2034097dc18d051
> > > ????Author: Bart Van Assche <[email protected]>
> > > ????Date:?? Fri Jun 22 14:54:49 2018 -0700
> > >
> > > ??????? scsi: klist: Make it safe to use klists in atomic context
> > >
> > > This commit switched lib/klist.c:klist_{prev,next} from
> > > spin_{,un}lock() to spin_{lock_irqsave,unlock_irqrestore}(), but left
> > > the spin_{,un}lock() calls in add_{head,tail}() untouched.
> > >
> > > The simplest fix for this would be to switch
> > > lib/klist.c:add_{head,tail}()
> > > over to use the IRQ-safe spinlock variants as well?
> >
> > Another possibility would be to evaluate whether it is safe to revert
> > commit 624fa7790f80 ("scsi: klist: Make it safe to use klists in atomic
> > context"). That commit is no longer needed by the SRP transport driver
> > since the legacy block layer has been removed from the kernel.
>
> If so, pci_get_domain_bus_and_slot() can not be used in this interrupt
> context, right?

The 624fa7790f80 commit from 2018 tried to make klist use safe from
atomic context, but since it missed a few of the klist accessors, it
didn't actually manage to make it safe, so it's already not safe to use
pci_get_domain_bus_and_slot() from interrupt context right now, even
without reverting this commit. Reverting the commit would just be a
declaration that klist use from atomic context isn't safe and never was.

A quick check doesn't turn up any other cases where people have run
into this issue with code in mainline, so it would seem that the
newly-added use of pci_get_domain_bus_and_slot() to the iommu/vt-d
fault reporting interrupt handler is one of very few (if not only)
cases of mainline code wanting to access klists from atomic context.

Kind regards,
Lennert

2022-08-17 07:37:25

by Lu Baolu

[permalink] [raw]

Subject: Re: lockdep splat due to klist iteration from atomic context in Intel IOMMU driver

On 2022/8/17 14:09, Lennert Buytenhek wrote:
> On Wed, Aug 17, 2022 at 12:04:10PM +0800, Baolu Lu wrote:
>
>>>>> On a build of 7ebfc85e2cd7 ("Merge tag 'net-6.0-rc1' of
>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net"), with
>>>>> CONFIG_INTEL_IOMMU_DEBUGFS enabled, I am seeing the lockdep splat
>>>>> below when an I/O page fault occurs on a machine with an Intel
>>>>> IOMMU in it.
>>>>>
>>>>> The issue seems to be the klist iterator functions using
>>>>> spin_*lock_irq*() but the klist insertion functions using
>>>>> spin_*lock(), combined with the Intel DMAR IOMMU driver iterating
>>>>> over klists from atomic (hardirq) context as of commit 8ac0b64b9735
>>>>> ("iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk()")
>>>>> when CONFIG_INTEL_IOMMU_DEBUGFS is enabled, where
>>>>> pci_get_domain_bus_and_slot() calls into bus_find_device() which
>>>>> iterates over klists.
>>>>>
>>>>> I found this commit from 2018:
>>>>>
>>>>> commit 624fa7790f80575a4ec28fbdb2034097dc18d051
>>>>> Author: Bart Van Assche<[email protected]>
>>>>> Date: Fri Jun 22 14:54:49 2018 -0700
>>>>>
>>>>> scsi: klist: Make it safe to use klists in atomic context
>>>>>
>>>>> This commit switched lib/klist.c:klist_{prev,next} from
>>>>> spin_{,un}lock() to spin_{lock_irqsave,unlock_irqrestore}(), but left
>>>>> the spin_{,un}lock() calls in add_{head,tail}() untouched.
>>>>>
>>>>> The simplest fix for this would be to switch lib/klist.c:add_{head,tail}()
>>>>> over to use the IRQ-safe spinlock variants as well?
>>>> Another possibility would be to evaluate whether it is safe to revert commit
>>>> 624fa7790f80 ("scsi: klist: Make it safe to use klists in atomic context").
>>>> That commit is no longer needed by the SRP transport driver since the legacy
>>>> block layer has been removed from the kernel.
>>> And then to fix the 6.0-rc1 iommu/vt-d lockdep splat with
>>> CONFIG_INTEL_IOMMU_DEBUGFS enabled, we could convert the Intel DMAR
>>> IRQ handler to a threaded IRQ handler. We (Arista) carry the patch
>>> below in our kernel tree, and the last two hunks of the patch do
>>> exactly that, for the same reason (having to call
>>> pci_get_domain_bus_and_slot() from the IRQ handler) but this is
>>> probably too big of a change for 6.0-rc.
>>>
>>>
>>>
>>> commit 90a8e7da0facf198692a641fcfe6f89c478608e0
>>> Author: Lennert Buytenhek<[email protected]>
>>> Date: Wed Jul 13 15:34:30 2022 +0300
>>>
>>> iommu/vt-d: Use report_iommu_fault()
>>> This patch makes iommu/vt-d call report_iommu_fault() when an I/O
>>> page fault occurs, which has two effects:
>>> 1) It allows device drivers to register a callback to be notified
>>> of I/O page faults, via the iommu_set_fault_handler() API.
>>> 2) It triggers the io_page_fault tracepoint in report_iommu_fault()
>>> when an I/O page fault occurs.
>>> The latter point is the main aim of this patch, as it allows
>>> rasdaemon-like daemons to be notified of I/O page faults, and to
>>> possibly initiate corrective action in response.
>>
>> The IOMMU subsystem already has a framework to handle I/O page faults:
>>
>> commit fc36479db74e9 "iommu: Add a page fault handler"
>>
>> And below series,
>>
>> https://lore.kernel.org/linux-iommu/[email protected]/
>>
>> is trying to make it more generic. It seems to be more suitable for your
>> case.
>>
>> The report_iommu_fault() probably will be replaced by
>> iommu_register_device_fault_handler() eventually. So I don't encourage
>> its usage in the VT-d driver.
>
> We use the iommu/io_page_fault tracepoint from userspace to be notified
> of (non-ATS) I/O page faults so that we can detect malfunctioning PCIe
> devices, which in our systems are typically switch/router line cards,
> and take corrective action, such as restarting the offending line card.

Yes. Make sense.

>
> Calling report_iommu_fault() causes the iommu/io_page_fault tracepoint
> to be invoked, which is why we made the AMD and Intel IOMMU drivers use
> report_iommu_fault() in our kernel tree.

Can iommu_register_device_fault_handler() also serve your case?
report_iommu_fault() is domain based, while the former is device based.

>
> It seems that iommu_queue_iopf() is specific to the SVA use case, while
> we are not using SVA, in which case it would not address our use case.
> (We don't care about knowing about ATS translation faults, we just want
> to know when a non-ATS PCI device is malfunctioning.)

The iommu_queue_iopf() is for recoverable I/O page fault. Your case only
cares about unrecoverable DMA faults. So it's not suitable for you.
Sorry for the misunderstanding.

Best regards,
baolu