From: Tomas Glozar <[email protected]>
nd_region_acquire_lane uses get_cpu, which disables preemption. This is
an issue on PREEMPT_RT kernels, since btt_write_pg and also
nd_region_acquire_lane itself take a spin lock, resulting in BUG:
sleeping function called from invalid context.
Fix the issue by replacing get_cpu with smp_process_id and
migrate_disable when needed. This makes BTT operations preemptible, thus
permitting the use of spin_lock.
Fixes: 5212e11fde4d ("nd_btt: atomic sector updates")
Signed-off-by: Tomas Glozar <[email protected]>
---
drivers/nvdimm/region_devs.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 0a81f87f6f6c..e2f1fb99707f 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -939,7 +939,8 @@ unsigned int nd_region_acquire_lane(struct nd_region *nd_region)
{
unsigned int cpu, lane;
- cpu = get_cpu();
+ migrate_disable();
+ cpu = smp_processor_id();
if (nd_region->num_lanes < nr_cpu_ids) {
struct nd_percpu_lane *ndl_lock, *ndl_count;
@@ -958,16 +959,15 @@ EXPORT_SYMBOL(nd_region_acquire_lane);
void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane)
{
if (nd_region->num_lanes < nr_cpu_ids) {
- unsigned int cpu = get_cpu();
+ unsigned int cpu = smp_processor_id();
struct nd_percpu_lane *ndl_lock, *ndl_count;
ndl_count = per_cpu_ptr(nd_region->lane, cpu);
ndl_lock = per_cpu_ptr(nd_region->lane, lane);
if (--ndl_count->count == 0)
spin_unlock(&ndl_lock->lock);
- put_cpu();
}
- put_cpu();
+ migrate_enable();
}
EXPORT_SYMBOL(nd_region_release_lane);
--
2.39.3
Tomáš Glozar wrote:
> From: Tomas Glozar <[email protected]>
>
> nd_region_acquire_lane uses get_cpu, which disables preemption. This is
> an issue on PREEMPT_RT kernels, since btt_write_pg and also
> nd_region_acquire_lane itself take a spin lock, resulting in BUG:
> sleeping function called from invalid context.
Is the bug in 1 of 2 places?
1) When btt_write_pg()->lock_map() (when the number of lanes is < number
of cpus) and the lane is acquired is called?
*or*
2) When nd_region_acquire_lane() internally trys to take it's lock?
A copy/paste of the BUG observed would have been more clear I think.
Regardless I *think* this is ok but I'm worried I don't fully understand
what the problem is.
Ira
>
> Fix the issue by replacing get_cpu with smp_process_id and
> migrate_disable when needed. This makes BTT operations preemptible, thus
> permitting the use of spin_lock.
>
> Fixes: 5212e11fde4d ("nd_btt: atomic sector updates")
> Signed-off-by: Tomas Glozar <[email protected]>
> ---
> drivers/nvdimm/region_devs.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 0a81f87f6f6c..e2f1fb99707f 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -939,7 +939,8 @@ unsigned int nd_region_acquire_lane(struct nd_region *nd_region)
> {
> unsigned int cpu, lane;
>
> - cpu = get_cpu();
> + migrate_disable();
> + cpu = smp_processor_id();
> if (nd_region->num_lanes < nr_cpu_ids) {
> struct nd_percpu_lane *ndl_lock, *ndl_count;
>
> @@ -958,16 +959,15 @@ EXPORT_SYMBOL(nd_region_acquire_lane);
> void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane)
> {
> if (nd_region->num_lanes < nr_cpu_ids) {
> - unsigned int cpu = get_cpu();
> + unsigned int cpu = smp_processor_id();
> struct nd_percpu_lane *ndl_lock, *ndl_count;
>
> ndl_count = per_cpu_ptr(nd_region->lane, cpu);
> ndl_lock = per_cpu_ptr(nd_region->lane, lane);
> if (--ndl_count->count == 0)
> spin_unlock(&ndl_lock->lock);
> - put_cpu();
> }
> - put_cpu();
> + migrate_enable();
> }
> EXPORT_SYMBOL(nd_region_release_lane);
>
> --
> 2.39.3
>
čt 14. 9. 2023 v 22:18 odesílatel Ira Weiny <[email protected]> napsal:
> Is the bug in 1 of 2 places?
>
> 1) When btt_write_pg()->lock_map() (when the number of lanes is < number
> of cpus) and the lane is acquired is called?
>
> *or*
>
> 2) When nd_region_acquire_lane() internally trys to take it's lock?
>
> A copy/paste of the BUG observed would have been more clear I think.
>
The BUG was observed on btt_write_pg()->lock_map(), but I assume the
BUG will also happen on the lock in nd_region_acquire_lane, since that
is also a spin lock, i.e. a sleeping lock on RT.
BUG observed in dmesg when running ndctl tests on RT kernel without the patch:
[ 123.262740] nfit_test_iomap: loading out-of-tree module taints kernel.
[ 123.262744] nfit_test_iomap: loading test module taints kernel.
[ 123.408628] nfit_test nfit_test.0: changing numa node from -1 to 0
for nfit region [0x0000000100000000-0x0000000101ffffff]
[ 123.408633] nfit_test nfit_test.0: changing target node from -1 to
0 for nfit region [0x0000000100000000-0x0000000101ffffff]
[ 123.408852] nfit_test nfit_test.0: changing numa node from -1 to 0
for nfit region [0x0000000108000000-0x000000010bffffff]
[ 123.408855] nfit_test nfit_test.0: changing target node from -1 to
0 for nfit region [0x0000000108000000-0x000000010bffffff]
[ 123.408933] nfit_test nfit_test.0: changing numa node from -1 to 0
for nfit region [0xffffb900c2cbd000-0xffffb900c2cbd00b]
[ 123.408935] nfit_test nfit_test.0: changing target node from -1 to
0 for nfit region [0xffffb900c2cbd000-0xffffb900c2cbd00b]
[ 123.408961] nfit_test nfit_test.0: changing numa node from -1 to 0
for nfit region [0xffffb900c2cde000-0xffffb900c2cde00b]
[ 123.408963] nfit_test nfit_test.0: changing target node from -1 to
0 for nfit region [0xffffb900c2cde000-0xffffb900c2cde00b]
[ 123.408988] nfit_test nfit_test.0: changing numa node from -1 to 0
for nfit region [0xffffb900c2cff000-0xffffb900c2cff00b]
[ 123.408990] nfit_test nfit_test.0: changing target node from -1 to
0 for nfit region [0xffffb900c2cff000-0xffffb900c2cff00b]
[ 123.409015] nfit_test nfit_test.0: changing numa node from -1 to 0
for nfit region [0xffffb900c2d45000-0xffffb900c2d4500b]
[ 123.409018] nfit_test nfit_test.0: changing target node from -1 to
0 for nfit region [0xffffb900c2d45000-0xffffb900c2d4500b]
[ 123.409586] nfit_test nfit_test.0: failed to evaluate _FIT
[ 123.441834] nfit_test nfit_test.1: Error found in NVDIMM nmem4
flags: save_fail restore_fail flush_fail not_armed
[ 123.441857] nfit_test nfit_test.1: Error found in NVDIMM nmem5
flags: map_fail
[ 123.457346] nfit_test nfit_test.1: changing numa node from -1 to 0
for nfit region [0x0000000140000000-0x0000000141ffffff]
[ 123.457351] nfit_test nfit_test.1: changing target node from -1 to
0 for nfit region [0x0000000140000000-0x0000000141ffffff]
[ 123.457427] nfit_test nfit_test.1: changing numa node from -1 to 0
for nfit region [0xffffb900c361d000-0xffffb900c3a1cfff]
[ 123.457429] nfit_test nfit_test.1: changing target node from -1 to
0 for nfit region [0xffffb900c361d000-0xffffb900c3a1cfff]
[ 123.475513] nd_pmem namespace3.0: unable to guarantee persistence of writes
[ 123.484778] nd_pmem namespace2.0: region2 read-only, marking pmem2 read-only
[ 126.349866] nd_pmem btt0.0: No existing arenas
[ 126.407070] BUG: sleeping function called from invalid context at
kernel/locking/spinlock_rt.c:48
[ 126.407073] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid:
4903, name: libndctl
[ 126.407074] preempt_count: 1, expected: 0
[ 126.407075] RCU nest depth: 0, expected: 0
[ 126.407075] 1 lock held by libndctl/4903:
[ 126.407076] #0: ffff8c184a270060
(&arena->map_locks[i].lock){+.+.}-{2:2}, at: btt_write_pg+0x2d7/0x500
[nd_btt]
[ 126.407085] Preemption disabled at:
[ 126.407085] [<ffffffffc1313db5>] nd_region_acquire_lane+0x15/0x90 [libnvdimm]
[ 126.407099] CPU: 1 PID: 4903 Comm: libndctl Kdump: loaded Tainted:
G W O N------- ---
6.5.0-ark.ndctl-tests.el9+rt-debug-rt6+ #6
[ 126.407101] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 1.16.2-1.fc38 04/01/2014
[ 126.407102] Call Trace:
[ 126.407103] <TASK>
[ 126.407104] dump_stack_lvl+0x8e/0xb0
[ 126.407109] __might_resched+0x19b/0x250
[ 126.407113] rt_spin_lock+0x4c/0x100
[ 126.407116] ? btt_write_pg+0x2d7/0x500 [nd_btt]
[ 126.407120] btt_write_pg+0x2d7/0x500 [nd_btt]
[ 126.407127] ? local_clock_noinstr+0x9/0xc0
[ 126.407131] btt_submit_bio+0x16d/0x270 [nd_btt]
[ 126.407138] __submit_bio+0x48/0x80
[ 126.407141] __submit_bio_noacct+0x7e/0x1e0
[ 126.407146] submit_bio_wait+0x58/0xb0
[ 126.407153] __blkdev_direct_IO_simple+0x107/0x240
[ 126.407156] ? inode_set_ctime_current+0x51/0x110
[ 126.407164] ? __pfx_submit_bio_wait_endio+0x10/0x10
[ 126.407171] blkdev_write_iter+0x1d8/0x290
[ 126.407174] vfs_write+0x237/0x330
[ 126.407183] ksys_write+0x68/0xf0
[ 126.407187] do_syscall_64+0x59/0x90
[ 126.407192] ? do_syscall_64+0x69/0x90
[ 126.407193] ? lockdep_hardirqs_on+0x79/0x100
[ 126.407195] ? do_syscall_64+0x69/0x90
[ 126.407196] ? lockdep_hardirqs_on+0x79/0x100
[ 126.407198] ? do_syscall_64+0x69/0x90
[ 126.407199] ? do_syscall_64+0x69/0x90
[ 126.407200] ? lockdep_hardirqs_on+0x79/0x100
[ 126.407202] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 126.407205] RIP: 0033:0x7f95f2f3eba7
[ 126.407218] Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7
0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89
7
4 24
[ 126.407219] RSP: 002b:00007ffe7a298678 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[ 126.407221] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f95f2f3eba7
[ 126.407222] RDX: 0000000000001000 RSI: 0000000001ea9000 RDI: 000000000000000a
[ 126.407223] RBP: 00007ffe7a298740 R08: 0000000000000000 R09: 00007f95f2fb14e0
[ 126.407223] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe7a298be8
[ 126.407224] R13: 000000000040e3a2 R14: 0000000000412db8 R15: 00007f95f31ff000
[ 126.407233] </TASK>
Tomas Glozar wrote:
> čt 14. 9. 2023 v 22:18 odesílatel Ira Weiny <[email protected]> napsal:
> > Is the bug in 1 of 2 places?
> >
> > 1) When btt_write_pg()->lock_map() (when the number of lanes is < number
> > of cpus) and the lane is acquired is called?
> >
> > *or*
> >
> > 2) When nd_region_acquire_lane() internally trys to take it's lock?
> >
> > A copy/paste of the BUG observed would have been more clear I think.
> >
>
> The BUG was observed on btt_write_pg()->lock_map(), but I assume the
> BUG will also happen on the lock in nd_region_acquire_lane, since that
> is also a spin lock, i.e. a sleeping lock on RT.
>
> BUG observed in dmesg when running ndctl tests on RT kernel without the patch:
Thanks for clarifying. Could you respin the patch with the text below?
That would have saved me a lot of time digging to see what the code path
was.
...
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 4903, name: libndctl
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
1 lock held by libndctl/4903:
#0: ffff8c184a270060 (&arena->map_locks[i].lock){+.+.}-{2:2}, at: btt_write_pg+0x2d7/0x500 [nd_btt]
Preemption disabled at:
[<ffffffffc1313db5>] nd_region_acquire_lane+0x15/0x90 [libnvdimm]
Call Trace:
<TASK>
dump_stack_lvl+0x8e/0xb0
__might_resched+0x19b/0x250
rt_spin_lock+0x4c/0x100
? btt_write_pg+0x2d7/0x500 [nd_btt]
btt_write_pg+0x2d7/0x500 [nd_btt]
? local_clock_noinstr+0x9/0xc0
btt_submit_bio+0x16d/0x270 [nd_btt]
__submit_bio+0x48/0x80
__submit_bio_noacct+0x7e/0x1e0
submit_bio_wait+0x58/0xb0
__blkdev_direct_IO_simple+0x107/0x240
? inode_set_ctime_current+0x51/0x110
? __pfx_submit_bio_wait_endio+0x10/0x10
blkdev_write_iter+0x1d8/0x290
vfs_write+0x237/0x330
...
With a respin including this trace:
Reviewed-by: Ira Weiny <[email protected]>
po 18. 9. 2023 v 23:32 odesílatel Ira Weiny <[email protected]> napsal:
> Thanks for clarifying. Could you respin the patch with the text below?
> That would have saved me a lot of time digging to see what the code path
> was.
Sent v2 with relevant parts of the BUG message.