From: He Zhe <[email protected]>
kmemleak_lock, as a rwlock on RT, can possibly be held in atomic context and
causes the follow BUG.
BUG: scheduling while atomic: migration/15/132/0x00000002
Modules linked in: iTCO_wdt iTCO_vendor_support intel_rapl pcc_cpufreq
pnd2_edac intel_powerclamp coretemp crct10dif_pclmul crct10dif_common
aesni_intel matroxfb_base aes_x86_64 matroxfb_g450 matroxfb_accel
crypto_simd matroxfb_DAC1064 cryptd glue_helper g450_pll matroxfb_misc
i2c_ismt i2c_i801 acpi_cpufreq
Preemption disabled at:
[<ffffffff8c927c11>] cpu_stopper_thread+0x71/0x100
CPU: 15 PID: 132 Comm: migration/15 Not tainted 4.19.0-rt1-preempt-rt #1
Hardware name: Intel Corp. Harcuvar/Server, BIOS HAVLCRB1.X64.0015.D62.1708310404 08/31/2017
Call Trace:
dump_stack+0x4f/0x6a
? cpu_stopper_thread+0x71/0x100
__schedule_bug.cold.16+0x38/0x55
__schedule+0x484/0x6c0
schedule+0x3d/0xe0
rt_spin_lock_slowlock_locked+0x118/0x2a0
rt_spin_lock_slowlock+0x57/0x90
__rt_spin_lock+0x26/0x30
__write_rt_lock+0x23/0x1a0
? intel_pmu_cpu_dying+0x67/0x70
rt_write_lock+0x2a/0x30
find_and_remove_object+0x1e/0x80
delete_object_full+0x10/0x20
kmemleak_free+0x32/0x50
kfree+0x104/0x1f0
? x86_pmu_starting_cpu+0x30/0x30
intel_pmu_cpu_dying+0x67/0x70
x86_pmu_dying_cpu+0x1a/0x30
cpuhp_invoke_callback+0x92/0x700
take_cpu_down+0x70/0xa0
multi_cpu_stop+0x62/0xc0
? cpu_stop_queue_work+0x130/0x130
cpu_stopper_thread+0x79/0x100
smpboot_thread_fn+0x20f/0x2d0
kthread+0x121/0x140
? sort_range+0x30/0x30
? kthread_park+0x90/0x90
ret_from_fork+0x35/0x40
And on v4.18 stable tree the following call trace, caused by grabbing
kmemleak_lock again, is also observed.
kernel BUG at kernel/locking/rtmutex.c:1048!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 5 PID: 689 Comm: mkfs.ext4 Not tainted 4.18.16-rt9-preempt-rt #1
Hardware name: Intel Corp. Harcuvar/Server, BIOS HAVLCRB1.X64.0015.D62.1708310404 08/31/2017
RIP: 0010:rt_spin_lock_slowlock_locked+0x277/0x2a0
Code: e8 5e 64 61 ff e9 bc fe ff ff e8 54 64 61 ff e9 b7 fe ff ff 0f 0b e8 98 57 53 ff e9 43 fe ff ff e8 8e 57 53 ff e9 74 ff ff ff <0f> 0b 0f 0b 0f 0b 48 8b 43 10 48 85 c0 74 06 48 3b 58 38 75 0b 49
RSP: 0018:ffff936846d4f3b0 EFLAGS: 00010046
RAX: ffff8e3680361e00 RBX: ffffffff83a8b240 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff8e3680361e00 RDI: ffffffff83a8b258
RBP: ffff936846d4f3e8 R08: ffff8e3680361e01 R09: ffffffff82adfdf0
R10: ffffffff827ede18 R11: 0000000000000000 R12: ffff936846d4f3f8
R13: ffff8e3680361e00 R14: ffff936846d4f3f8 R15: 0000000000000246
FS: 00007fc8b6bfd780(0000) GS:ffff8e369f340000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055fb5659e000 CR3: 00000007fdd14000 CR4: 00000000003406e0
Call Trace:
? preempt_count_add+0x74/0xc0
rt_spin_lock_slowlock+0x57/0x90
? __kernel_text_address+0x12/0x40
? __save_stack_trace+0x75/0x100
__rt_spin_lock+0x26/0x30
__write_rt_lock+0x23/0x1a0
rt_write_lock+0x2a/0x30
create_object+0x17d/0x2b0
kmemleak_alloc+0x34/0x50
kmem_cache_alloc+0x146/0x220
? mempool_alloc_slab+0x15/0x20
mempool_alloc_slab+0x15/0x20
mempool_alloc+0x65/0x170
sg_pool_alloc+0x21/0x60
__sg_alloc_table+0x101/0x160
? sg_free_table_chained+0x30/0x30
sg_alloc_table_chained+0x8b/0xb0
scsi_init_sgtable+0x31/0x90
scsi_init_io+0x44/0x130
sd_setup_write_same16_cmnd+0xef/0x150
sd_init_command+0x6bf/0xaa0
? cgroup_base_stat_cputime_account_end.isra.0+0x26/0x60
? elv_rb_del+0x2a/0x40
scsi_setup_cmnd+0x8e/0x140
scsi_prep_fn+0x5d/0x140
blk_peek_request+0xda/0x2f0
scsi_request_fn+0x33/0x550
? cfq_rb_erase+0x23/0x40
__blk_run_queue+0x43/0x60
cfq_insert_request+0x2f3/0x5d0
__elv_add_request+0x160/0x290
blk_flush_plug_list+0x204/0x230
schedule+0x87/0xe0
__write_rt_lock+0x18b/0x1a0
rt_write_lock+0x2a/0x30
create_object+0x17d/0x2b0
kmemleak_alloc+0x34/0x50
__kmalloc_node+0x1cd/0x340
alloc_request_size+0x30/0x70
mempool_alloc+0x65/0x170
? ioc_lookup_icq+0x54/0x70
get_request+0x4e3/0x8d0
? wait_woken+0x80/0x80
blk_queue_bio+0x153/0x470
generic_make_request+0x1dc/0x3f0
submit_bio+0x49/0x140
? next_bio+0x38/0x40
submit_bio_wait+0x59/0x90
blkdev_issue_discard+0x7a/0xd0
? _raw_spin_unlock_irqrestore+0x18/0x50
blk_ioctl_discard+0xc7/0x110
blkdev_ioctl+0x57e/0x960
? __wake_up+0x13/0x20
block_ioctl+0x3d/0x50
do_vfs_ioctl+0xa8/0x610
? vfs_write+0x166/0x1b0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x4d/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmemleak is an error detecting feature. We would not expect as good performance
as without it. As there is no raw rwlock defining helpers, we turn kmemleak_lock
to a raw spinlock.
Signed-off-by: He Zhe <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
v2: Remove stable tag as this is only for preempt-rt patchset
mm/kmemleak.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 17dd883..b68a3d0 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -26,7 +26,7 @@
*
* The following locks and mutexes are used by kmemleak:
*
- * - kmemleak_lock (rwlock): protects the object_list modifications and
+ * - kmemleak_lock (raw spinlock): protects the object_list modifications and
* accesses to the object_tree_root. The object_list is the main list
* holding the metadata (struct kmemleak_object) for the allocated memory
* blocks. The object_tree_root is a red black tree used to look-up
@@ -197,7 +197,7 @@ static LIST_HEAD(gray_list);
/* search tree for object boundaries */
static struct rb_root object_tree_root = RB_ROOT;
/* rw_lock protecting the access to object_list and object_tree_root */
-static DEFINE_RWLOCK(kmemleak_lock);
+static DEFINE_RAW_SPINLOCK(kmemleak_lock);
/* allocation caches for kmemleak internal data */
static struct kmem_cache *object_cache;
@@ -491,9 +491,9 @@ static struct kmemleak_object *find_and_get_object(unsigned long ptr, int alias)
struct kmemleak_object *object;
rcu_read_lock();
- read_lock_irqsave(&kmemleak_lock, flags);
+ raw_spin_lock_irqsave(&kmemleak_lock, flags);
object = lookup_object(ptr, alias);
- read_unlock_irqrestore(&kmemleak_lock, flags);
+ raw_spin_unlock_irqrestore(&kmemleak_lock, flags);
/* check whether the object is still available */
if (object && !get_object(object))
@@ -513,13 +513,13 @@ static struct kmemleak_object *find_and_remove_object(unsigned long ptr, int ali
unsigned long flags;
struct kmemleak_object *object;
- write_lock_irqsave(&kmemleak_lock, flags);
+ raw_spin_lock_irqsave(&kmemleak_lock, flags);
object = lookup_object(ptr, alias);
if (object) {
rb_erase(&object->rb_node, &object_tree_root);
list_del_rcu(&object->object_list);
}
- write_unlock_irqrestore(&kmemleak_lock, flags);
+ raw_spin_unlock_irqrestore(&kmemleak_lock, flags);
return object;
}
@@ -593,7 +593,7 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
/* kernel backtrace */
object->trace_len = __save_stack_trace(object->trace);
- write_lock_irqsave(&kmemleak_lock, flags);
+ raw_spin_lock_irqsave(&kmemleak_lock, flags);
min_addr = min(min_addr, ptr);
max_addr = max(max_addr, ptr + size);
@@ -624,7 +624,7 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
list_add_tail_rcu(&object->object_list, &object_list);
out:
- write_unlock_irqrestore(&kmemleak_lock, flags);
+ raw_spin_unlock_irqrestore(&kmemleak_lock, flags);
return object;
}
@@ -1310,7 +1310,7 @@ static void scan_block(void *_start, void *_end,
unsigned long *end = _end - (BYTES_PER_POINTER - 1);
unsigned long flags;
- read_lock_irqsave(&kmemleak_lock, flags);
+ raw_spin_lock_irqsave(&kmemleak_lock, flags);
for (ptr = start; ptr < end; ptr++) {
struct kmemleak_object *object;
unsigned long pointer;
@@ -1367,7 +1367,7 @@ static void scan_block(void *_start, void *_end,
spin_unlock(&object->lock);
}
}
- read_unlock_irqrestore(&kmemleak_lock, flags);
+ raw_spin_unlock_irqrestore(&kmemleak_lock, flags);
}
/*
--
2.7.4
On 2018-11-22 17:04:19 [+0800], [email protected] wrote:
> From: He Zhe <[email protected]>
>
> kmemleak_lock, as a rwlock on RT, can possibly be held in atomic context and
> causes the follow BUG.
please use
[PATCH RT … ]
in future while posting for RT. And this was (and is) on my TODO list.
Sebastian
On 2018-11-22 17:04:19 [+0800], [email protected] wrote:
> From: He Zhe <[email protected]>
>
> kmemleak_lock, as a rwlock on RT, can possibly be held in atomic context and
> causes the follow BUG.
>
> BUG: scheduling while atomic: migration/15/132/0x00000002
…
> Preemption disabled at:
> [<ffffffff8c927c11>] cpu_stopper_thread+0x71/0x100
> CPU: 15 PID: 132 Comm: migration/15 Not tainted 4.19.0-rt1-preempt-rt #1
> Hardware name: Intel Corp. Harcuvar/Server, BIOS HAVLCRB1.X64.0015.D62.1708310404 08/31/2017
> Call Trace:
> dump_stack+0x4f/0x6a
> ? cpu_stopper_thread+0x71/0x100
> __schedule_bug.cold.16+0x38/0x55
> __schedule+0x484/0x6c0
> schedule+0x3d/0xe0
> rt_spin_lock_slowlock_locked+0x118/0x2a0
> rt_spin_lock_slowlock+0x57/0x90
> __rt_spin_lock+0x26/0x30
> __write_rt_lock+0x23/0x1a0
> ? intel_pmu_cpu_dying+0x67/0x70
> rt_write_lock+0x2a/0x30
> find_and_remove_object+0x1e/0x80
> delete_object_full+0x10/0x20
> kmemleak_free+0x32/0x50
> kfree+0x104/0x1f0
> ? x86_pmu_starting_cpu+0x30/0x30
> intel_pmu_cpu_dying+0x67/0x70
> x86_pmu_dying_cpu+0x1a/0x30
> cpuhp_invoke_callback+0x92/0x700
> take_cpu_down+0x70/0xa0
> multi_cpu_stop+0x62/0xc0
> ? cpu_stop_queue_work+0x130/0x130
> cpu_stopper_thread+0x79/0x100
> smpboot_thread_fn+0x20f/0x2d0
> kthread+0x121/0x140
> ? sort_range+0x30/0x30
> ? kthread_park+0x90/0x90
> ret_from_fork+0x35/0x40
If this is the only problem? kfree() from a preempt-disabled section
should cause a warning even without kmemleak.
> And on v4.18 stable tree the following call trace, caused by grabbing
> kmemleak_lock again, is also observed.
>
> kernel BUG at kernel/locking/rtmutex.c:1048!
> invalid opcode: 0000 [#1] PREEMPT SMP PTI
> CPU: 5 PID: 689 Comm: mkfs.ext4 Not tainted 4.18.16-rt9-preempt-rt #1
…
> Call Trace:
> ? preempt_count_add+0x74/0xc0
> rt_spin_lock_slowlock+0x57/0x90
> ? __kernel_text_address+0x12/0x40
> ? __save_stack_trace+0x75/0x100
> __rt_spin_lock+0x26/0x30
> __write_rt_lock+0x23/0x1a0
> rt_write_lock+0x2a/0x30
> create_object+0x17d/0x2b0
…
is this an RT-only problem? Because mainline should not allow read->read
locking or read->write locking for reader-writer locks. If this only
happens on v4.18 and not on v4.19 then something must have fixed it.
Sebastian
> is this an RT-only problem? Because mainline should not allow read->read
> locking or read->write locking for reader-writer locks. If this only
> happens on v4.18 and not on v4.19 then something must have fixed it.
Probably misunderstanding, but I'd say that read->read locking is "the
norm"...?
If you don't use qrwlock, readers are also "recursive", in part.,
P0 P1
read_lock(l)
write_lock(l)
read_lock(l)
won't block P0 on the second read_lock(). (qrwlock somehow complicate
the analysis; IIUC, they are recursive if and only if in_interrupt().).
Andrea
>
>
> Sebastian
On 2018-11-23 12:02:55 [+0100], Andrea Parri wrote:
> > is this an RT-only problem? Because mainline should not allow read->read
> > locking or read->write locking for reader-writer locks. If this only
> > happens on v4.18 and not on v4.19 then something must have fixed it.
>
> Probably misunderstanding, but I'd say that read->read locking is "the
> norm"...?
>
> If you don't use qrwlock, readers are also "recursive", in part.,
>
> P0 P1
> read_lock(l)
> write_lock(l)
> read_lock(l)
>
> won't block P0 on the second read_lock(). (qrwlock somehow complicate
> the analysis; IIUC, they are recursive if and only if in_interrupt().).
ehm, peterz, is that true? My memory on that is that all readers will
block if there is a writer pending.
> Andrea
Sebastian
On Fri, Nov 23, 2018 at 12:06:11PM +0100, Sebastian Andrzej Siewior wrote:
> On 2018-11-23 12:02:55 [+0100], Andrea Parri wrote:
> > > is this an RT-only problem? Because mainline should not allow read->read
> > > locking or read->write locking for reader-writer locks. If this only
> > > happens on v4.18 and not on v4.19 then something must have fixed it.
> >
> > Probably misunderstanding, but I'd say that read->read locking is "the
> > norm"...?
> >
> > If you don't use qrwlock, readers are also "recursive", in part.,
> >
> > P0 P1
> > read_lock(l)
> > write_lock(l)
> > read_lock(l)
> >
> > won't block P0 on the second read_lock(). (qrwlock somehow complicate
> > the analysis; IIUC, they are recursive if and only if in_interrupt().).
>
> ehm, peterz, is that true? My memory on that is that all readers will
> block if there is a writer pending.
With qwrlocks, the readers will normally block if there is a pending
writer (to avoid starving the writer), unless in_interrupt() when the
readers are allowed to starve a pending writer.
TLA+/PlusCal model here: ;)
https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/qrwlock.tla
--
Catalin
On Fri, 23 Nov 2018 11:31:31 +0000
Catalin Marinas <[email protected]> wrote:
> With qwrlocks, the readers will normally block if there is a pending
> writer (to avoid starving the writer), unless in_interrupt() when the
> readers are allowed to starve a pending writer.
>
> TLA+/PlusCal model here: ;)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/qrwlock.tla
>
And the code appears to confirm it too:
void queued_read_lock_slowpath(struct qrwlock *lock)
{
/*
* Readers come here when they cannot get the lock without waiting
*/
if (unlikely(in_interrupt())) {
/*
* Readers in interrupt context will get the lock immediately
* if the writer is just waiting (not holding the lock yet),
* so spin with ACQUIRE semantics until the lock is available
* without waiting in the queue.
*/
atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
return;
}
atomic_sub(_QR_BIAS, &lock->cnts);
/*
* Put the reader into the wait queue
*/
arch_spin_lock(&lock->wait_lock);
atomic_add(_QR_BIAS, &lock->cnts);
-- Steve
On 2018/11/23 17:53, Sebastian Andrzej Siewior wrote:
> On 2018-11-22 17:04:19 [+0800], [email protected] wrote:
>> From: He Zhe <[email protected]>
>>
>> kmemleak_lock, as a rwlock on RT, can possibly be held in atomic context and
>> causes the follow BUG.
>>
>> BUG: scheduling while atomic: migration/15/132/0x00000002
> …
>> Preemption disabled at:
>> [<ffffffff8c927c11>] cpu_stopper_thread+0x71/0x100
>> CPU: 15 PID: 132 Comm: migration/15 Not tainted 4.19.0-rt1-preempt-rt #1
>> Hardware name: Intel Corp. Harcuvar/Server, BIOS HAVLCRB1.X64.0015.D62.1708310404 08/31/2017
>> Call Trace:
>> dump_stack+0x4f/0x6a
>> ? cpu_stopper_thread+0x71/0x100
>> __schedule_bug.cold.16+0x38/0x55
>> __schedule+0x484/0x6c0
>> schedule+0x3d/0xe0
>> rt_spin_lock_slowlock_locked+0x118/0x2a0
>> rt_spin_lock_slowlock+0x57/0x90
>> __rt_spin_lock+0x26/0x30
>> __write_rt_lock+0x23/0x1a0
>> ? intel_pmu_cpu_dying+0x67/0x70
>> rt_write_lock+0x2a/0x30
>> find_and_remove_object+0x1e/0x80
>> delete_object_full+0x10/0x20
>> kmemleak_free+0x32/0x50
>> kfree+0x104/0x1f0
>> ? x86_pmu_starting_cpu+0x30/0x30
>> intel_pmu_cpu_dying+0x67/0x70
>> x86_pmu_dying_cpu+0x1a/0x30
>> cpuhp_invoke_callback+0x92/0x700
>> take_cpu_down+0x70/0xa0
>> multi_cpu_stop+0x62/0xc0
>> ? cpu_stop_queue_work+0x130/0x130
>> cpu_stopper_thread+0x79/0x100
>> smpboot_thread_fn+0x20f/0x2d0
>> kthread+0x121/0x140
>> ? sort_range+0x30/0x30
>> ? kthread_park+0x90/0x90
>> ret_from_fork+0x35/0x40
> If this is the only problem? kfree() from a preempt-disabled section
> should cause a warning even without kmemleak.
Thanks for your review. I just did some tests aginst the latest code.
On latest v4.19.1-rt3, both of the call traces can be reproduced with kmemleak
enabied. And none can be reproduced with kmemleak disabled.
On latest mainline tree, none can be reproduced no matter kmemleak is enabled
or disabled.
I don't get why kfree from a preempt-disabled section should cause a warning
without kmemleak, since kfree can't sleep.
If I understand correctly, the call trace above is caused by trying to schedule
after preemption is disabled, which cannot be reached in mainline kernel. So
we might need to turn to use raw lock to keep preemption disabled.
>
>> And on v4.18 stable tree the following call trace, caused by grabbing
>> kmemleak_lock again, is also observed.
>>
>> kernel BUG at kernel/locking/rtmutex.c:1048!
>> invalid opcode: 0000 [#1] PREEMPT SMP PTI
>> CPU: 5 PID: 689 Comm: mkfs.ext4 Not tainted 4.18.16-rt9-preempt-rt #1
> …
>> Call Trace:
>> ? preempt_count_add+0x74/0xc0
>> rt_spin_lock_slowlock+0x57/0x90
>> ? __kernel_text_address+0x12/0x40
>> ? __save_stack_trace+0x75/0x100
>> __rt_spin_lock+0x26/0x30
>> __write_rt_lock+0x23/0x1a0
>> rt_write_lock+0x2a/0x30
>> create_object+0x17d/0x2b0
> …
>
> is this an RT-only problem? Because mainline should not allow read->read
> locking or read->write locking for reader-writer locks. If this only
> happens on v4.18 and not on v4.19 then something must have fixed it.
From what I reached above, this is RT-only and happens on v4.18 and v4.19.
The call trace above is caused by grabbing kmemleak_lock and then getting
scheduled and then re-grabbing kmemleak_lock. Using raw lock can also solve
this problem.
Thanks,
Zhe
>
>
> Sebastian
>
On Fri, Nov 23, 2018 at 12:06:11PM +0100, Sebastian Andrzej Siewior wrote:
> On 2018-11-23 12:02:55 [+0100], Andrea Parri wrote:
> > > is this an RT-only problem? Because mainline should not allow read->read
> > > locking or read->write locking for reader-writer locks. If this only
> > > happens on v4.18 and not on v4.19 then something must have fixed it.
> >
> > Probably misunderstanding, but I'd say that read->read locking is "the
> > norm"...?
> >
> > If you don't use qrwlock, readers are also "recursive", in part.,
> >
> > P0 P1
> > read_lock(l)
> > write_lock(l)
> > read_lock(l)
> >
> > won't block P0 on the second read_lock(). (qrwlock somehow complicate
> > the analysis; IIUC, they are recursive if and only if in_interrupt().).
>
> ehm, peterz, is that true? My memory on that is that all readers will
> block if there is a writer pending.
Since qrwlock is the more strict, all users should use its semantics.
Just like we cannot 'rely' on the unfairness of some lock
implementations.
On 2018-11-24 22:26:46 [+0800], He Zhe wrote:
> On latest v4.19.1-rt3, both of the call traces can be reproduced with kmemleak
> enabied. And none can be reproduced with kmemleak disabled.
okay. So it needs attention.
> On latest mainline tree, none can be reproduced no matter kmemleak is enabled
> or disabled.
>
> I don't get why kfree from a preempt-disabled section should cause a warning
> without kmemleak, since kfree can't sleep.
it might. It will acquire a sleeping lock if it has go down to the
memory allocator to actually give memory back.
> If I understand correctly, the call trace above is caused by trying to schedule
> after preemption is disabled, which cannot be reached in mainline kernel. So
> we might need to turn to use raw lock to keep preemption disabled.
The buddy-allocator runs with spin locks so it is okay on !RT. So you
can use kfree() with disabled preemption or disabled interrupts.
I don't think that we want to use raw-locks in the buddy-allocator.
> >From what I reached above, this is RT-only and happens on v4.18 and v4.19.
>
> The call trace above is caused by grabbing kmemleak_lock and then getting
> scheduled and then re-grabbing kmemleak_lock. Using raw lock can also solve
> this problem.
But this is a reader / writer lock. And if I understand the other part
of the thread then it needs multiple readers.
Couldn't we just get rid of that kfree() or move it somewhere else?
I mean if the free() memory on CPU-down and allocate it again CPU-up
then we could skip that, rigth? Just allocate it and don't free it
because the CPU will likely get up again.
> Thanks,
> Zhe
Sebastian
On 2018/12/1 02:19, Sebastian Andrzej Siewior wrote:
> On 2018-11-24 22:26:46 [+0800], He Zhe wrote:
>> On latest v4.19.1-rt3, both of the call traces can be reproduced with kmemleak
>> enabied. And none can be reproduced with kmemleak disabled.
> okay. So it needs attention.
>
>> On latest mainline tree, none can be reproduced no matter kmemleak is enabled
>> or disabled.
>>
>> I don't get why kfree from a preempt-disabled section should cause a warning
>> without kmemleak, since kfree can't sleep.
> it might. It will acquire a sleeping lock if it has go down to the
> memory allocator to actually give memory back.
Got it. Thanks.
>
>> If I understand correctly, the call trace above is caused by trying to schedule
>> after preemption is disabled, which cannot be reached in mainline kernel. So
>> we might need to turn to use raw lock to keep preemption disabled.
> The buddy-allocator runs with spin locks so it is okay on !RT. So you
> can use kfree() with disabled preemption or disabled interrupts.
> I don't think that we want to use raw-locks in the buddy-allocator.
For call trace 1:
I went through the calling paths inside kfree() and found that there have already
been things using raw lock in it as follow.
1) in the path of kfree() itself
kfree -> slab_free -> do_slab_free -> __slab_free -> raw_spin_lock_irqsave
2) in the path of CONFIG_DEBUG_OBJECTS_FREE
kfree -> slab_free -> slab_free_freelist_hook -> slab_free_hook -> debug_check_no_obj_freed -> __debug_check_no_obj_freed -> raw_spin_lock_irqsave
3) in the path of CONFIG_LOCKDEP
kfree -> __free_pages -> free_unref_page -> free_unref_page_prepare -> free_pcp_prepare -> free_pages_prepare -> debug_check_no_locks_freed -> debug_check_no_locks_freed -> raw_local_irq_save
Since kmemleak would most likely be used to debug in environments where
we would not expect as great performance as without it, and kfree() has raw locks
in its main path and other debug function paths, I suppose it wouldn't hurt that
we change to raw locks.
>
>> >From what I reached above, this is RT-only and happens on v4.18 and v4.19.
>>
>> The call trace above is caused by grabbing kmemleak_lock and then getting
>> scheduled and then re-grabbing kmemleak_lock. Using raw lock can also solve
>> this problem.
> But this is a reader / writer lock. And if I understand the other part
> of the thread then it needs multiple readers.
For call trace 2:
I don't get what "it needs multiple readers" exactly means here.
In this call trace, the kmemleak_lock is grabbed as write lock, and then scheduled
away, and then grabbed again as write lock from another path. It's a
write->write locking, compared to the discussion in the other part of the thread.
This is essentially because kmemleak hooks on the very low level memory
allocation and free operations. After scheduled away, it can easily re-enter itself.
We need raw locks to prevent this from happening.
> Couldn't we just get rid of that kfree() or move it somewhere else?
> I mean if the free() memory on CPU-down and allocate it again CPU-up
> then we could skip that, rigth? Just allocate it and don't free it
> because the CPU will likely get up again.
For call trace 1:
I went through the CPU hotplug code and found that the allocation of the
problematic data, cpuc->shared_regs, is done in intel_pmu_cpu_prepare. And
the free is done in intel_pmu_cpu_dying. They are handlers triggered by two
different perf events.
It seems we can hardly form a convincing method that holds the data while
CPUs are off and then uses it again. raw locks would be easy and good enough.
Thanks,
Zhe
>
>> Thanks,
>> Zhe
> Sebastian
>
On 2018-12-05 21:53:37 [+0800], He Zhe wrote:
> For call trace 1:
…
> Since kmemleak would most likely be used to debug in environments where
> we would not expect as great performance as without it, and kfree() has raw locks
> in its main path and other debug function paths, I suppose it wouldn't hurt that
> we change to raw locks.
okay.
> >> >From what I reached above, this is RT-only and happens on v4.18 and v4.19.
> >>
> >> The call trace above is caused by grabbing kmemleak_lock and then getting
> >> scheduled and then re-grabbing kmemleak_lock. Using raw lock can also solve
> >> this problem.
> > But this is a reader / writer lock. And if I understand the other part
> > of the thread then it needs multiple readers.
>
> For call trace 2:
>
> I don't get what "it needs multiple readers" exactly means here.
>
> In this call trace, the kmemleak_lock is grabbed as write lock, and then scheduled
> away, and then grabbed again as write lock from another path. It's a
> write->write locking, compared to the discussion in the other part of the thread.
>
> This is essentially because kmemleak hooks on the very low level memory
> allocation and free operations. After scheduled away, it can easily re-enter itself.
> We need raw locks to prevent this from happening.
With raw locks you wouldn't have multiple readers at the same time.
Maybe you wouldn't have recursion but since you can't have multiple
readers you would add lock contention where was none (because you could
have two readers at the same time).
> > Couldn't we just get rid of that kfree() or move it somewhere else?
> > I mean if the free() memory on CPU-down and allocate it again CPU-up
> > then we could skip that, rigth? Just allocate it and don't free it
> > because the CPU will likely get up again.
>
> For call trace 1:
>
> I went through the CPU hotplug code and found that the allocation of the
> problematic data, cpuc->shared_regs, is done in intel_pmu_cpu_prepare. And
> the free is done in intel_pmu_cpu_dying. They are handlers triggered by two
> different perf events.
>
> It seems we can hardly form a convincing method that holds the data while
> CPUs are off and then uses it again. raw locks would be easy and good enough.
Why not allocate the memory in intel_pmu_cpu_prepare() if it is not
already there (otherwise skip the allocation) and in
intel_pmu_cpu_dying() not free it. It looks easy.
> Thanks,
> Zhe
Sebastian
On 2018/12/6 03:14, Sebastian Andrzej Siewior wrote:
> On 2018-12-05 21:53:37 [+0800], He Zhe wrote:
>> For call trace 1:
> …
>> Since kmemleak would most likely be used to debug in environments where
>> we would not expect as great performance as without it, and kfree() has raw locks
>> in its main path and other debug function paths, I suppose it wouldn't hurt that
>> we change to raw locks.
> okay.
>
>>>> >From what I reached above, this is RT-only and happens on v4.18 and v4.19.
>>>>
>>>> The call trace above is caused by grabbing kmemleak_lock and then getting
>>>> scheduled and then re-grabbing kmemleak_lock. Using raw lock can also solve
>>>> this problem.
>>> But this is a reader / writer lock. And if I understand the other part
>>> of the thread then it needs multiple readers.
>> For call trace 2:
>>
>> I don't get what "it needs multiple readers" exactly means here.
>>
>> In this call trace, the kmemleak_lock is grabbed as write lock, and then scheduled
>> away, and then grabbed again as write lock from another path. It's a
>> write->write locking, compared to the discussion in the other part of the thread.
>>
>> This is essentially because kmemleak hooks on the very low level memory
>> allocation and free operations. After scheduled away, it can easily re-enter itself.
>> We need raw locks to prevent this from happening.
> With raw locks you wouldn't have multiple readers at the same time.
> Maybe you wouldn't have recursion but since you can't have multiple
> readers you would add lock contention where was none (because you could
> have two readers at the same time).
Sorry for slow reply.
OK. I understand your concern finally. At the commit log said, I wanted to use raw
rwlock but didn't find the DEFINE helper for it. Thinking it would not be expected to
have great performance, I turn to use raw spinlock instead. And yes, this would
introduce worse performance.
Maybe I miss the reason, but why don't we have rwlock_types_raw.h to define raw
rwlock helper for RT? With that, we can cleanly replace kmemleak_lock with a raw
rwlock.
Or should we just define a raw rwlock using basic type, like arch_rwlock_t, only in
kmemleak?
>
>>> Couldn't we just get rid of that kfree() or move it somewhere else?
>>> I mean if the free() memory on CPU-down and allocate it again CPU-up
>>> then we could skip that, rigth? Just allocate it and don't free it
>>> because the CPU will likely get up again.
>> For call trace 1:
>>
>> I went through the CPU hotplug code and found that the allocation of the
>> problematic data, cpuc->shared_regs, is done in intel_pmu_cpu_prepare. And
>> the free is done in intel_pmu_cpu_dying. They are handlers triggered by two
>> different perf events.
>>
>> It seems we can hardly form a convincing method that holds the data while
>> CPUs are off and then uses it again. raw locks would be easy and good enough.
> Why not allocate the memory in intel_pmu_cpu_prepare() if it is not
> already there (otherwise skip the allocation) and in
> intel_pmu_cpu_dying() not free it. It looks easy.
Thanks for your suggestion. I've sent the change for call trace 1 to mainline
mailing list. Hopefully it can be accepted.
Zhe
>
>> Thanks,
>> Zhe
> Sebastian
>
On Tue, Dec 18, 2018 at 06:51:41PM +0800, He Zhe wrote:
> On 2018/12/6 03:14, Sebastian Andrzej Siewior wrote:
> > With raw locks you wouldn't have multiple readers at the same time.
> > Maybe you wouldn't have recursion but since you can't have multiple
> > readers you would add lock contention where was none (because you could
> > have two readers at the same time).
>
> OK. I understand your concern finally. At the commit log said, I wanted to use raw
> rwlock but didn't find the DEFINE helper for it. Thinking it would not be expected to
> have great performance, I turn to use raw spinlock instead. And yes, this would
> introduce worse performance.
Looking through the kmemleak code, I can't really find significant
reader contention. The longest holder of this lock (read) is the scan
thread which is also protected by a scan_mutex, so can't run
concurrently with another scanner (triggered through debugfs). The other
read_lock(&kmemleak_lock) user is find_and_get_object() called from a
few places. However, all such places normally follow a create_object()
call (kmemleak_alloc() and friends) which already performs a
write_lock(&kmemleak_lock), so it needs to wait for the scan thread to
release the kmemleak_lock.
It may be worth running some performance/latency tests during kmemleak
scanning (echo scan > /sys/kernel/debug/kmemleak) but at a quick look,
I don't think we'd see any difference with a raw_spin_lock_t.
With a bit more thinking (though I'll be off until the new year), we
could probably get rid of the kmemleak_lock entirely in scan_block() and
make lookup_object() and the related rbtree code in kmemleak RCU-safe.
--
Catalin
On Thu, Nov 22, 2018 at 05:04:19PM +0800, [email protected] wrote:
> From: He Zhe <[email protected]>
>
> kmemleak_lock, as a rwlock on RT, can possibly be held in atomic context and
> causes the follow BUG.
>
> BUG: scheduling while atomic: migration/15/132/0x00000002
> Modules linked in: iTCO_wdt iTCO_vendor_support intel_rapl pcc_cpufreq
> pnd2_edac intel_powerclamp coretemp crct10dif_pclmul crct10dif_common
> aesni_intel matroxfb_base aes_x86_64 matroxfb_g450 matroxfb_accel
> crypto_simd matroxfb_DAC1064 cryptd glue_helper g450_pll matroxfb_misc
> i2c_ismt i2c_i801 acpi_cpufreq
> Preemption disabled at:
> [<ffffffff8c927c11>] cpu_stopper_thread+0x71/0x100
> CPU: 15 PID: 132 Comm: migration/15 Not tainted 4.19.0-rt1-preempt-rt #1
> Hardware name: Intel Corp. Harcuvar/Server, BIOS HAVLCRB1.X64.0015.D62.1708310404 08/31/2017
> Call Trace:
> dump_stack+0x4f/0x6a
> ? cpu_stopper_thread+0x71/0x100
> __schedule_bug.cold.16+0x38/0x55
> __schedule+0x484/0x6c0
> schedule+0x3d/0xe0
> rt_spin_lock_slowlock_locked+0x118/0x2a0
> rt_spin_lock_slowlock+0x57/0x90
> __rt_spin_lock+0x26/0x30
> __write_rt_lock+0x23/0x1a0
> ? intel_pmu_cpu_dying+0x67/0x70
> rt_write_lock+0x2a/0x30
> find_and_remove_object+0x1e/0x80
> delete_object_full+0x10/0x20
> kmemleak_free+0x32/0x50
> kfree+0x104/0x1f0
> ? x86_pmu_starting_cpu+0x30/0x30
> intel_pmu_cpu_dying+0x67/0x70
> x86_pmu_dying_cpu+0x1a/0x30
> cpuhp_invoke_callback+0x92/0x700
> take_cpu_down+0x70/0xa0
> multi_cpu_stop+0x62/0xc0
> ? cpu_stop_queue_work+0x130/0x130
> cpu_stopper_thread+0x79/0x100
> smpboot_thread_fn+0x20f/0x2d0
> kthread+0x121/0x140
> ? sort_range+0x30/0x30
> ? kthread_park+0x90/0x90
> ret_from_fork+0x35/0x40
>
> And on v4.18 stable tree the following call trace, caused by grabbing
> kmemleak_lock again, is also observed.
>
> kernel BUG at kernel/locking/rtmutex.c:1048!
> invalid opcode: 0000 [#1] PREEMPT SMP PTI
> CPU: 5 PID: 689 Comm: mkfs.ext4 Not tainted 4.18.16-rt9-preempt-rt #1
> Hardware name: Intel Corp. Harcuvar/Server, BIOS HAVLCRB1.X64.0015.D62.1708310404 08/31/2017
> RIP: 0010:rt_spin_lock_slowlock_locked+0x277/0x2a0
> Code: e8 5e 64 61 ff e9 bc fe ff ff e8 54 64 61 ff e9 b7 fe ff ff 0f 0b e8 98 57 53 ff e9 43 fe ff ff e8 8e 57 53 ff e9 74 ff ff ff <0f> 0b 0f 0b 0f 0b 48 8b 43 10 48 85 c0 74 06 48 3b 58 38 75 0b 49
> RSP: 0018:ffff936846d4f3b0 EFLAGS: 00010046
> RAX: ffff8e3680361e00 RBX: ffffffff83a8b240 RCX: 0000000000000001
> RDX: 0000000000000000 RSI: ffff8e3680361e00 RDI: ffffffff83a8b258
> RBP: ffff936846d4f3e8 R08: ffff8e3680361e01 R09: ffffffff82adfdf0
> R10: ffffffff827ede18 R11: 0000000000000000 R12: ffff936846d4f3f8
> R13: ffff8e3680361e00 R14: ffff936846d4f3f8 R15: 0000000000000246
> FS: 00007fc8b6bfd780(0000) GS:ffff8e369f340000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055fb5659e000 CR3: 00000007fdd14000 CR4: 00000000003406e0
> Call Trace:
> ? preempt_count_add+0x74/0xc0
> rt_spin_lock_slowlock+0x57/0x90
> ? __kernel_text_address+0x12/0x40
> ? __save_stack_trace+0x75/0x100
> __rt_spin_lock+0x26/0x30
> __write_rt_lock+0x23/0x1a0
> rt_write_lock+0x2a/0x30
> create_object+0x17d/0x2b0
> kmemleak_alloc+0x34/0x50
> kmem_cache_alloc+0x146/0x220
> ? mempool_alloc_slab+0x15/0x20
> mempool_alloc_slab+0x15/0x20
> mempool_alloc+0x65/0x170
> sg_pool_alloc+0x21/0x60
> __sg_alloc_table+0x101/0x160
> ? sg_free_table_chained+0x30/0x30
> sg_alloc_table_chained+0x8b/0xb0
> scsi_init_sgtable+0x31/0x90
> scsi_init_io+0x44/0x130
> sd_setup_write_same16_cmnd+0xef/0x150
> sd_init_command+0x6bf/0xaa0
> ? cgroup_base_stat_cputime_account_end.isra.0+0x26/0x60
> ? elv_rb_del+0x2a/0x40
> scsi_setup_cmnd+0x8e/0x140
> scsi_prep_fn+0x5d/0x140
> blk_peek_request+0xda/0x2f0
> scsi_request_fn+0x33/0x550
> ? cfq_rb_erase+0x23/0x40
> __blk_run_queue+0x43/0x60
> cfq_insert_request+0x2f3/0x5d0
> __elv_add_request+0x160/0x290
> blk_flush_plug_list+0x204/0x230
> schedule+0x87/0xe0
> __write_rt_lock+0x18b/0x1a0
> rt_write_lock+0x2a/0x30
> create_object+0x17d/0x2b0
> kmemleak_alloc+0x34/0x50
> __kmalloc_node+0x1cd/0x340
> alloc_request_size+0x30/0x70
> mempool_alloc+0x65/0x170
> ? ioc_lookup_icq+0x54/0x70
> get_request+0x4e3/0x8d0
> ? wait_woken+0x80/0x80
> blk_queue_bio+0x153/0x470
> generic_make_request+0x1dc/0x3f0
> submit_bio+0x49/0x140
> ? next_bio+0x38/0x40
> submit_bio_wait+0x59/0x90
> blkdev_issue_discard+0x7a/0xd0
> ? _raw_spin_unlock_irqrestore+0x18/0x50
> blk_ioctl_discard+0xc7/0x110
> blkdev_ioctl+0x57e/0x960
> ? __wake_up+0x13/0x20
> block_ioctl+0x3d/0x50
> do_vfs_ioctl+0xa8/0x610
> ? vfs_write+0x166/0x1b0
> ksys_ioctl+0x67/0x90
> __x64_sys_ioctl+0x1a/0x20
> do_syscall_64+0x4d/0xf0
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> kmemleak is an error detecting feature. We would not expect as good performance
> as without it. As there is no raw rwlock defining helpers, we turn kmemleak_lock
> to a raw spinlock.
>
> Signed-off-by: He Zhe <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
As I replied already, I don't think this patch would increase the
kmemleak latency (or performance), although I haven't actually tested
it. FWIW:
Acked-by: Catalin Marinas <[email protected]>
On 2018-12-18 15:07:45 [+0000], Catalin Marinas wrote:
…
> It may be worth running some performance/latency tests during kmemleak
> scanning (echo scan > /sys/kernel/debug/kmemleak) but at a quick look,
> I don't think we'd see any difference with a raw_spin_lock_t.
>
> With a bit more thinking (though I'll be off until the new year), we
> could probably get rid of the kmemleak_lock entirely in scan_block() and
> make lookup_object() and the related rbtree code in kmemleak RCU-safe.
Okay. So let me apply that patch into my RT tree with your ack (from the
other email). And then I hope that it either shows up upstream or gets
replaced with RCU in the ende :)
Thanks.
Sebastian
On 2018/12/19 23:30, Sebastian Andrzej Siewior wrote:
> On 2018-12-18 15:07:45 [+0000], Catalin Marinas wrote:
> …
>> It may be worth running some performance/latency tests during kmemleak
>> scanning (echo scan > /sys/kernel/debug/kmemleak) but at a quick look,
>> I don't think we'd see any difference with a raw_spin_lock_t.
>>
>> With a bit more thinking (though I'll be off until the new year), we
>> could probably get rid of the kmemleak_lock entirely in scan_block() and
>> make lookup_object() and the related rbtree code in kmemleak RCU-safe.
> Okay. So let me apply that patch into my RT tree with your ack (from the
> other email). And then I hope that it either shows up upstream or gets
> replaced with RCU in the ende :)
I'd like to do the upstreaming or replacing. Thanks.
Zhe
>
> Thanks.
>
> Sebastian
>