Hello Tejun,
I've been observing the following crashes on kernel 4.2.6 :
73309.529940] BUG: unable to handle kernel NULL pointer dereference at (null)
[73309.530238] IP: [<ffffffff8106b663>] __queue_work+0xb3/0x390
[73309.530466] PGD 0
[73309.530681] Oops: 0000 [#1] SMP
[73309.530947] Modules linked in: dm_snapshot dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c ipv6 xt_multiport iptable_filter xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables ext2 dm_mirror dm_region_hash dm_log iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 igb i2c_algo_bit i2c_core lpc_ich mfd_core ipmi_devintf ipmi_si ipmi_msghandler ioatdma dca
[73309.533556] CPU: 19 PID: 0 Comm: swapper/19 Not tainted 4.2.6-wbpatch-qib #1
[73309.533734] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013
[73309.533911] task: ffff880276501b80 ti: ffff880276510000 task.ti: ffff880276510000
[73309.534093] RIP: 0010:[<ffffffff8106b663>] [<ffffffff8106b663>] __queue_work+0xb3/0x390
[73309.534321] RSP: 0018:ffff88047fce3d58 EFLAGS: 00010086
[73309.534495] RAX: ffff880277812400 RBX: ffff8801e53e24c0 RCX: 00000000000100f0
[73309.534672] RDX: 0000000000000000 RSI: 0000000000000030 RDI: ffff8801e53e24c0
[73309.534849] RBP: ffff88047fce3de8 R08: 000042ad628a3480 R09: 0000000000000000
[73309.535023] R10: ffffffff816099d5 R11: 0000000000000000 R12: ffffffff8106b940
[73309.535196] R13: 0000000000000013 R14: ffff8803df464c00 R15: 0000000000000013
[73309.535370] FS: 0000000000000000(0000) GS:ffff88047fce0000(0000) knlGS:0000000000000000
[73309.535544] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[73309.535714] CR2: 0000000000000000 CR3: 0000000001a0e000 CR4: 00000000000406e0
[73309.535886] Stack:
[73309.536049] ffff88047fcefcd8 0000000000000092 0000000000000000 ffff8803df464d10
[73309.536415] 0000000000000032 00000000000100f0 0000000000000000 ffff88047fcf4a00
[73309.536785] ffff88047fcf4a00 0000000000000013 0000000000000000 ffff880276501b80
[73309.537152] Call Trace:
[73309.537319] <IRQ>
[73309.537373] [<ffffffff8106b940>] ? __queue_work+0x390/0x390
[73309.537714] [<ffffffff8106b958>] delayed_work_timer_fn+0x18/0x20
[73309.537891] [<ffffffff810ad1d7>] call_timer_fn+0x47/0x110
[73309.538071] [<ffffffff810be302>] ? tick_sched_timer+0x52/0xa0
[73309.538249] [<ffffffff810adb6f>] run_timer_softirq+0x17f/0x2b0
[73309.538425] [<ffffffff8106b940>] ? __queue_work+0x390/0x390
[73309.538604] [<ffffffff81057f40>] __do_softirq+0xe0/0x290
[73309.538778] [<ffffffff810581e6>] irq_exit+0xa6/0xb0
[73309.538952] [<ffffffff8159413a>] smp_apic_timer_interrupt+0x4a/0x59
[73309.539128] [<ffffffff815926bb>] apic_timer_interrupt+0x6b/0x70
[73309.539300] <EOI>
[73309.539355] [<ffffffff8148b136>] ? cpuidle_enter_state+0x136/0x290
[73309.539694] [<ffffffff8148b12d>] ? cpuidle_enter_state+0x12d/0x290
[73309.539870] [<ffffffff8158d9ed>] ? __schedule+0x37d/0x840
[73309.540045] [<ffffffff8148b2a7>] cpuidle_enter+0x17/0x20
[73309.540222] [<ffffffff810936c5>] cpuidle_idle_call+0x95/0x140
[73309.540398] [<ffffffff81072766>] ? atomic_notifier_call_chain+0x16/0x20
[73309.540574] [<ffffffff810938b5>] cpu_idle_loop+0x145/0x200
[73309.540748] [<ffffffff8109398b>] ? cpu_startup_entry+0x1b/0x70
[73309.540924] [<ffffffff813a1948>] ? get_random_bytes+0x48/0x90
[73309.541098] [<ffffffff810939cf>] cpu_startup_entry+0x5f/0x70
[73309.541274] [<ffffffff81033832>] start_secondary+0xc2/0xd0
[73309.541446] Code: 49 8b 96 08 01 00 00 49 63 c7 48 03 14 c5 e0 af ab 81 48 89 55 80 48 89 df e8 0a ee ff ff 48 8b 55 80 48 85 c0 0f 84 3e 01 00 00 <48> 8b 3a 48 39 f8 0f 84 35 01 00 00 48 89 c7 48 89 85 78 ff ff
[73309.545008] RIP [<ffffffff8106b663>] __queue_work+0xb3/0x390
[73309.545231] RSP <ffff88047fce3d58>
[73309.545399] CR2: 0000000000000000
The gist is that this fail on the following line:
if (last_pool && last_pool != pwq->pool) {
Since the pointer 'pwq' is wrong (it is loaded in %rdx) which in this
case is 0000000000000000. Looking at the function's source pwq should
be loaded by per_cpu_ptr since the if (!(wq->flags & WQ_UNBOUND))
check should evaluate to false. So pwq is loaded as the result from
unbound_pwq_by_node(wq, cpu_to_node(cpu));
Here are the flags of the workqueue:
crash> struct workqueue_struct.flags 0xffff8803df464c00
flags = 131082
(0xffff8803df464c00 is indeed the pointer to the workqueue struct,
so the flags aren't bogus).
So reading the numa_pwq_tbl it seems that it's uninitialised:
crash> struct workqueue_struct.numa_pwq_tbl 0xffff8803df464c00
numa_pwq_tbl = 0xffff8803df464d10
crash> rd -64 0xffff8803df464d10 3
ffff8803df464d10: 0000000000000000 0000000000000000 ................
ffff8803df464d20: 0000000000000000 ........
The machine where the crash occurred has a single NUMA node, so at the
very least I would have expected to have a pointer, rather than NULL ptr.
Also this crash is not isolated in that I have observed it on multiple
other nodes running vanilla 4.2.5/4.2.6 kernels.
Any advice how to further debug that?
Hello, Nikolay.
On Wed, Dec 09, 2015 at 02:08:56PM +0200, Nikolay Borisov wrote:
> 73309.529940] BUG: unable to handle kernel NULL pointer dereference at (null)
> [73309.530238] IP: [<ffffffff8106b663>] __queue_work+0xb3/0x390
...
> [73309.537319] <IRQ>
> [73309.537373] [<ffffffff8106b940>] ? __queue_work+0x390/0x390
> [73309.537714] [<ffffffff8106b958>] delayed_work_timer_fn+0x18/0x20
> [73309.537891] [<ffffffff810ad1d7>] call_timer_fn+0x47/0x110
> [73309.538071] [<ffffffff810be302>] ? tick_sched_timer+0x52/0xa0
> [73309.538249] [<ffffffff810adb6f>] run_timer_softirq+0x17f/0x2b0
> [73309.538425] [<ffffffff8106b940>] ? __queue_work+0x390/0x390
> [73309.538604] [<ffffffff81057f40>] __do_softirq+0xe0/0x290
> [73309.538778] [<ffffffff810581e6>] irq_exit+0xa6/0xb0
> [73309.538952] [<ffffffff8159413a>] smp_apic_timer_interrupt+0x4a/0x59
> [73309.539128] [<ffffffff815926bb>] apic_timer_interrupt+0x6b/0x70
...
> The gist is that this fail on the following line:
>
> if (last_pool && last_pool != pwq->pool) {
That's new.
> Since the pointer 'pwq' is wrong (it is loaded in %rdx) which in this
> case is 0000000000000000. Looking at the function's source pwq should
> be loaded by per_cpu_ptr since the if (!(wq->flags & WQ_UNBOUND))
> check should evaluate to false. So pwq is loaded as the result from
> unbound_pwq_by_node(wq, cpu_to_node(cpu));
>
> Here are the flags of the workqueue:
> crash> struct workqueue_struct.flags 0xffff8803df464c00
> flags = 131082
That's ordered unbound workqueue w/ a rescuer.
> (0xffff8803df464c00 is indeed the pointer to the workqueue struct,
> so the flags aren't bogus).
>
> So reading the numa_pwq_tbl it seems that it's uninitialised:
>
> crash> struct workqueue_struct.numa_pwq_tbl 0xffff8803df464c00
> numa_pwq_tbl = 0xffff8803df464d10
> crash> rd -64 0xffff8803df464d10 3
> ffff8803df464d10: 0000000000000000 0000000000000000 ................
> ffff8803df464d20: 0000000000000000 ........
>
> The machine where the crash occurred has a single NUMA node, so at the
> very least I would have expected to have a pointer, rather than NULL ptr.
>
> Also this crash is not isolated in that I have observed it on multiple
> other nodes running vanilla 4.2.5/4.2.6 kernels.
>
> Any advice how to further debug that?
Adding printk or tracepoints at numa_pwq_tbl_install() to dump what's
being installed would be helpful. It should at least tell us whether
it's the table being corrupted by something else or workqueue failing
to set it up correctly to begin with. How reproducible is the
problem?
Thanks.
--
tejun
On 12/09/2015 06:08 PM, Tejun Heo wrote:
> Hello, Nikolay.
>
> On Wed, Dec 09, 2015 at 02:08:56PM +0200, Nikolay Borisov wrote:
>> 73309.529940] BUG: unable to handle kernel NULL pointer dereference at (null)
>> [73309.530238] IP: [<ffffffff8106b663>] __queue_work+0xb3/0x390
> ...
>> [73309.537319] <IRQ>
>> [73309.537373] [<ffffffff8106b940>] ? __queue_work+0x390/0x390
>> [73309.537714] [<ffffffff8106b958>] delayed_work_timer_fn+0x18/0x20
>> [73309.537891] [<ffffffff810ad1d7>] call_timer_fn+0x47/0x110
>> [73309.538071] [<ffffffff810be302>] ? tick_sched_timer+0x52/0xa0
>> [73309.538249] [<ffffffff810adb6f>] run_timer_softirq+0x17f/0x2b0
>> [73309.538425] [<ffffffff8106b940>] ? __queue_work+0x390/0x390
>> [73309.538604] [<ffffffff81057f40>] __do_softirq+0xe0/0x290
>> [73309.538778] [<ffffffff810581e6>] irq_exit+0xa6/0xb0
>> [73309.538952] [<ffffffff8159413a>] smp_apic_timer_interrupt+0x4a/0x59
>> [73309.539128] [<ffffffff815926bb>] apic_timer_interrupt+0x6b/0x70
> ...
>> The gist is that this fail on the following line:
>>
>> if (last_pool && last_pool != pwq->pool) {
>
> That's new.
>
>> Since the pointer 'pwq' is wrong (it is loaded in %rdx) which in this
>> case is 0000000000000000. Looking at the function's source pwq should
>> be loaded by per_cpu_ptr since the if (!(wq->flags & WQ_UNBOUND))
>> check should evaluate to false. So pwq is loaded as the result from
>> unbound_pwq_by_node(wq, cpu_to_node(cpu));
>>
>> Here are the flags of the workqueue:
>> crash> struct workqueue_struct.flags 0xffff8803df464c00
>> flags = 131082
>
> That's ordered unbound workqueue w/ a rescuer.
So the name of the queue is 'dm-thin', looking at the sources in
dm-thin, the only place where a workqueue is allocates this here:
pool->wq = alloc_ordered_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM);
But in this case I guess the caller can't be the culprit? I'm biased wrt
dm-thin because in the past few months I've hit multiple bugs.
>
>> (0xffff8803df464c00 is indeed the pointer to the workqueue struct,
>> so the flags aren't bogus).
>>
>> So reading the numa_pwq_tbl it seems that it's uninitialised:
>>
>> crash> struct workqueue_struct.numa_pwq_tbl 0xffff8803df464c00
>> numa_pwq_tbl = 0xffff8803df464d10
>> crash> rd -64 0xffff8803df464d10 3
>> ffff8803df464d10: 0000000000000000 0000000000000000 ................
>> ffff8803df464d20: 0000000000000000 ........
>>
>> The machine where the crash occurred has a single NUMA node, so at the
>> very least I would have expected to have a pointer, rather than NULL ptr.
>>
>> Also this crash is not isolated in that I have observed it on multiple
>> other nodes running vanilla 4.2.5/4.2.6 kernels.
>>
>> Any advice how to further debug that?
>
> Adding printk or tracepoints at numa_pwq_tbl_install() to dump what's
> being installed would be helpful. It should at least tell us whether
> it's the table being corrupted by something else or workqueue failing
> to set it up correctly to begin with. How reproducible is the
> problem?
I think we are seeing this at least daily on at least 1 server (we have
multiple servers like that). So adding printk's would likely be the way
to go, anything in particular you might be interested in knowing? I see
RCU stuff around so might be tricky race condition.
>
> Thanks.
>
Hello,
On Wed, Dec 09, 2015 at 06:23:15PM +0200, Nikolay Borisov wrote:
> I think we are seeing this at least daily on at least 1 server (we have
> multiple servers like that). So adding printk's would likely be the way
> to go, anything in particular you might be interested in knowing? I see
> RCU stuff around so might be tricky race condition.
Printing out the workqueue's pointer, name, pwq's pointer, the node
being installed for and the installed pointer should give us enough
clues. There's RCU involved but the pointers shouldn't be becoming
NULLs unless we're installing NULL ptrs.
Thanks.
--
tejun
On 12/09/2015 06:27 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, Dec 09, 2015 at 06:23:15PM +0200, Nikolay Borisov wrote:
>> I think we are seeing this at least daily on at least 1 server (we have
>> multiple servers like that). So adding printk's would likely be the way
>> to go, anything in particular you might be interested in knowing? I see
>> RCU stuff around so might be tricky race condition.
>
> Printing out the workqueue's pointer, name, pwq's pointer, the node
> being installed for and the installed pointer should give us enough
> clues. There's RCU involved but the pointers shouldn't be becoming
> NULLs unless we're installing NULL ptrs.
So the debug patch has been rolled on 1 server and several more
are in the process, here it is what it prints:
WQ: ffff88046f00ba00 (events_unbound) old_pwq: (null) new_pwq: ffff88046f00d300 node: 0
WQ: ffff88046f00be00 (events_power_efficient) old_pwq: (null) new_pwq: ffff88046f00d400 node: 0
WQ: ffff88046d71c000 (events_freezable_power_) old_pwq: (null) new_pwq: ffff88046f00d500 node: 0
WQ: ffff88046ce9ca00 (khelper) old_pwq: (null) new_pwq: ffff88046f00d600 node: 0
WQ: ffff88046ce9c000 (netns) old_pwq: (null) new_pwq: ffff88046f00d700 node: 0
WQ: ffff88046ce9d400 (perf) old_pwq: (null) new_pwq: ffff88046f00d800 node: 0
WQ: ffff88046c408000 (writeback) old_pwq: (null) new_pwq: ffff88046c800000 node: 0
WQ: ffff88046c409200 (kacpi_hotplug) old_pwq: (null) new_pwq: ffff88046c42e200 node: 0
WQ: ffff880468455600 (scsi_tmf_0) old_pwq: (null) new_pwq: ffff88046c801f00 node: 0
WQ: ffff8804687f4400 (scsi_tmf_1) old_pwq: (null) new_pwq: ffff88046caa6700 node: 0
WQ: ffff8804687f4c00 (scsi_tmf_2) old_pwq: (null) new_pwq: ffff88046caa6900 node: 0
WQ: ffff8804687f5400 (scsi_tmf_3) old_pwq: (null) new_pwq: ffff88046caa6b00 node: 0
WQ: ffff8804687f5c00 (scsi_tmf_4) old_pwq: (null) new_pwq: ffff88046caa6d00 node: 0
WQ: ffff8804687f6400 (scsi_tmf_5) old_pwq: (null) new_pwq: ffff88046caa7000 node: 0
WQ: ffff8804687f6c00 (scsi_tmf_6) old_pwq: (null) new_pwq: ffff88046caa7300 node: 0
WQ: ffff880467964000 (kdmremove) old_pwq: (null) new_pwq: ffff880467a3c800 node: 0
WQ: ffff880467965000 (deferwq) old_pwq: (null) new_pwq: ffff880467a3c100 node: 0
WQ: ffff8804669bc600 (ib_addr) old_pwq: (null) new_pwq: ffff88046845a600 node: 0
WQ: ffff88007d167e00 (qib0_0) old_pwq: (null) new_pwq: ffff880466c19800 node: 0
WQ: ffff88007d165a00 (qib0_1) old_pwq: (null) new_pwq: ffff880466c18e00 node: 0
WQ: ffff88007d165200 (ib_mad1) old_pwq: (null) new_pwq: ffff880466c19d00 node: 0
WQ: ffff8804665d2000 (ib_mad2) old_pwq: (null) new_pwq: ffff880466c18a00 node: 0
WQ: ffff8804667d7600 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff880469806100 node: 0
WQ: ffff880079a9fc00 (edac-poller) old_pwq: (null) new_pwq: ffff88007d5ebf00 node: 0
WQ: ffff88046b47cc00 (kvm-irqfd-cleanup) old_pwq: (null) new_pwq: ffff8804651f0f00 node: 0
WQ: ffff8804694baa00 (kloopd0) old_pwq: (null) new_pwq: ffff88046949d100 node: 0
WQ: ffff880079a9cc00 (kloopd1) old_pwq: (null) new_pwq: ffff8804698cb900 node: 0
WQ: ffff88046809dc00 (kloopd2) old_pwq: (null) new_pwq: ffff88046957aa00 node: 0
WQ: ffff88046809c000 (kloopd3) old_pwq: (null) new_pwq: ffff8804650acc00 node: 0
WQ: ffff880466f3b000 (kloopd4) old_pwq: (null) new_pwq: ffff880469575900 node: 0
WQ: ffff88046809e800 (kloopd5) old_pwq: (null) new_pwq: ffff880469888200 node: 0
WQ: ffff88046809de00 (kloopd6) old_pwq: (null) new_pwq: ffff880469827400 node: 0
WQ: ffff88007d5f1c00 (dm_bufio_cache) old_pwq: (null) new_pwq: ffff8804673dda00 node: 0
WQ: ffff88046c42a400 (dm-thin) old_pwq: (null) new_pwq: ffff880079955100 node: 0
WQ: ffff8804672d0800 (dm-thin) old_pwq: (null) new_pwq: ffff88046baed800 node: 0
WQ: ffff88046993fa00 (dm-thin) old_pwq: (null) new_pwq: ffff8804650ff100 node: 0
WQ: ffff88046993d400 (dm-thin) old_pwq: (null) new_pwq: ffff88046949d600 node: 0
WQ: ffff88046993e400 (dm-thin) old_pwq: (null) new_pwq: ffff88046b833000 node: 0
WQ: ffff880466466400 (dm-thin) old_pwq: (null) new_pwq: ffff88007da60d00 node: 0
WQ: ffff88046b3eb200 (dm-thin) old_pwq: (null) new_pwq: ffff88046633d200 node: 0
WQ: ffff8804672d0600 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff880079955400 node: 0
WQ: ffff88046b3eb600 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff880465684900 node: 0
WQ: ffff88046c42a400 (dm-thin) old_pwq: (null) new_pwq: ffff8800799ee900 node: 0
WQ: ffff880466f39a00 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff880469849e00 node: 0
WQ: ffff880467b0cc00 (dm-thin) old_pwq: (null) new_pwq: ffff88007d52fa00 node: 0
WQ: ffff8804672d4e00 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff88046ca07f00 node: 0
WQ: ffff880079a9ca00 (dm-thin) old_pwq: (null) new_pwq: ffff8802d1be9e00 node: 0
WQ: ffff880466175000 (dm-thin) old_pwq: (null) new_pwq: ffff8802d8efec00 node: 0
WQ: ffff880403f28400 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff8802e224dd00 node: 0
WQ: ffff880403f29a00 (dm-thin) old_pwq: (null) new_pwq: ffff880465685300 node: 0
WQ: ffff8804672d6c00 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff880466d69300 node: 0
WQ: ffff880466f3ba00 (dm-thin) old_pwq: (null) new_pwq: ffff880469576500 node: 0
WQ: ffff8804672d4600 (dm-thin) old_pwq: (null) new_pwq: ffff8802d1a1ee00 node: 0
WQ: ffff8803ccf5c200 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff8804657b3200 node: 0
Is this format ok? Also I observed the exact same crash
on a machine running 4.1.12 kernel as well.
>
> Thanks.
>
On Thu, Dec 10, 2015 at 11:28:02AM +0200, Nikolay Borisov wrote:
> On 12/09/2015 06:27 PM, Tejun Heo wrote:
> > Hello,
> >
> > On Wed, Dec 09, 2015 at 06:23:15PM +0200, Nikolay Borisov wrote:
> >> I think we are seeing this at least daily on at least 1 server (we have
> >> multiple servers like that). So adding printk's would likely be the way
> >> to go, anything in particular you might be interested in knowing? I see
> >> RCU stuff around so might be tricky race condition.
> >
> > Printing out the workqueue's pointer, name, pwq's pointer, the node
> > being installed for and the installed pointer should give us enough
> > clues. There's RCU involved but the pointers shouldn't be becoming
> > NULLs unless we're installing NULL ptrs.
>
> So the debug patch has been rolled on 1 server and several more
> are in the process, here it is what it prints:
>
> WQ: ffff88046f00ba00 (events_unbound) old_pwq: (null) new_pwq: ffff88046f00d300 node: 0
...
> Is this format ok? Also I observed the exact same crash
> on a machine running 4.1.12 kernel as well.
Yeah, I think it can be a good starting point.
Thanks.
--
tejun
On 12/10/2015 05:29 PM, Tejun Heo wrote:
> On Thu, Dec 10, 2015 at 11:28:02AM +0200, Nikolay Borisov wrote:
>> On 12/09/2015 06:27 PM, Tejun Heo wrote:
>>> Hello,
>>>
>>> On Wed, Dec 09, 2015 at 06:23:15PM +0200, Nikolay Borisov wrote:
>>>> I think we are seeing this at least daily on at least 1 server (we have
>>>> multiple servers like that). So adding printk's would likely be the way
>>>> to go, anything in particular you might be interested in knowing? I see
>>>> RCU stuff around so might be tricky race condition.
>>>
>>> Printing out the workqueue's pointer, name, pwq's pointer, the node
>>> being installed for and the installed pointer should give us enough
>>> clues. There's RCU involved but the pointers shouldn't be becoming
>>> NULLs unless we're installing NULL ptrs.
>>
>> So the debug patch has been rolled on 1 server and several more
>> are in the process, here it is what it prints:
>>
>> WQ: ffff88046f00ba00 (events_unbound) old_pwq: (null) new_pwq: ffff88046f00d300 node: 0
> ...
>> Is this format ok? Also I observed the exact same crash
>> on a machine running 4.1.12 kernel as well.
>
> Yeah, I think it can be a good starting point.
So I had a server with the patch just crash on me:
Here is how the queue looks like:
crash> struct workqueue_struct 0xffff8802420a4a00
struct workqueue_struct {
pwqs = {
next = 0xffff8802420a4c00,
prev = 0xffff8802420a4a00
},
list = {
next = 0xffff880351f9b210,
prev = 0xdead000000200200
},
mutex = {
count = {
counter = 1
},
wait_lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0
}
}
}
}
},
wait_list = {
next = 0xffff8802420a4a28,
prev = 0xffff8802420a4a28
},
owner = 0x0,
osq = {
tail = {
counter = 0
}
}
},
work_color = 3,
flush_color = 3,
nr_pwqs_to_flush = {
counter = 0
},
first_flusher = 0x0,
flusher_queue = {
next = 0xffff8802420a4a60,
prev = 0xffff8802420a4a60
},
flusher_overflow = {
next = 0xffff8802420a4a70,
prev = 0xffff8802420a4a70
},
maydays = {
next = 0xffff8802420a4a80,
prev = 0xffff8802420a4a80
},
rescuer = 0xffff88046932ce40,
nr_drainers = 0,
saved_max_active = 1,
unbound_attrs = 0xffff8801a76c1f00,
dfl_pwq = 0x0,
wq_dev = 0x0,
name =
"dm-thin\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
rcu = {
next = 0xffff8802531c4c20,
func = 0xffffffff810692e0 <rcu_free_wq>
},
flags = 131082,
cpu_pwqs = 0x0,
numa_pwq_tbl = 0xffff8802420a4b10
}
crash> rd 0xffff8802420a4b10 2 (the machine has 2 NUMA nodes hence the
'2' argument)
ffff8802420a4b10: 0000000000000000 0000000000000000 ................
At the same time searching for 0xffff8802420a4a00 in the debug output
shows nothing IOW it seems that the numa_pwq_tbl is never installed for
this workqueue apparently:
[root@smallvault8 ~]# grep 0xffff8802420a4a00 /var/log/messages
Also dumping all the logs from the dmesg contained in the vmcore image I
find nothing and when I do the following correlation:
[root@smallvault8 ~]# grep \(null\) wq.log | wc -l
1940
[root@smallvault8 ~]# wc -l wq.log
1940 wq.log
It seems what's happening is really just changing the numa_pwq_tbl on
workqueue creation i.e. it is never re-assigned. So at this point I
think it seems that there is a situation where the wqattr are not being
applied at all.
>
> Thanks.
>
Hello, Nikolay.
On Fri, Dec 11, 2015 at 05:57:22PM +0200, Nikolay Borisov wrote:
> So I had a server with the patch just crash on me:
>
> Here is how the queue looks like:
> crash> struct workqueue_struct 0xffff8802420a4a00
> struct workqueue_struct {
> pwqs = {
> next = 0xffff8802420a4c00,
> prev = 0xffff8802420a4a00
Hmmm... pwq list is already corrupt. ->prev is terminated but ->next
isn't.
> },
> list = {
> next = 0xffff880351f9b210,
> prev = 0xdead000000200200
Followed by by 0xdead000000200200 which is likely from
CONFIG_ILLEGAL_POINTER_VALUE.
...
> name =
> "dm-thin\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
> rcu = {
> next = 0xffff8802531c4c20,
> func = 0xffffffff810692e0 <rcu_free_wq>
and call_rcu_sched() already called. The workqueue has already been
destroyed.
> },
> flags = 131082,
> cpu_pwqs = 0x0,
> numa_pwq_tbl = 0xffff8802420a4b10
> }
>
> crash> rd 0xffff8802420a4b10 2 (the machine has 2 NUMA nodes hence the
> '2' argument)
> ffff8802420a4b10: 0000000000000000 0000000000000000 ................
>
> At the same time searching for 0xffff8802420a4a00 in the debug output
> shows nothing IOW it seems that the numa_pwq_tbl is never installed for
> this workqueue apparently:
>
> [root@smallvault8 ~]# grep 0xffff8802420a4a00 /var/log/messages
>
> Also dumping all the logs from the dmesg contained in the vmcore image I
> find nothing and when I do the following correlation:
> [root@smallvault8 ~]# grep \(null\) wq.log | wc -l
> 1940
> [root@smallvault8 ~]# wc -l wq.log
> 1940 wq.log
>
> It seems what's happening is really just changing the numa_pwq_tbl on
> workqueue creation i.e. it is never re-assigned. So at this point I
> think it seems that there is a situation where the wqattr are not being
> applied at all.
Hmmm... No idea why it didn't show up in the debug log but the only
way a workqueue could be in the above state is either it got
explicitly destroyed or somehow pwq refcnting is messed up, in both
cases it should have shown up in the log.
cc'ing dm people. Is there any chance dm-think could be using
workqueue after destroying it?
Thanks.
--
tejun
On Fri, Dec 11, 2015 at 7:08 PM, Tejun Heo <[email protected]> wrote:
> Hello, Nikolay.
>
> On Fri, Dec 11, 2015 at 05:57:22PM +0200, Nikolay Borisov wrote:
>> So I had a server with the patch just crash on me:
>>
>> Here is how the queue looks like:
>> crash> struct workqueue_struct 0xffff8802420a4a00
>> struct workqueue_struct {
>> pwqs = {
>> next = 0xffff8802420a4c00,
>> prev = 0xffff8802420a4a00
>
> Hmmm... pwq list is already corrupt. ->prev is terminated but ->next
> isn't.
>
>> },
>> list = {
>> next = 0xffff880351f9b210,
>> prev = 0xdead000000200200
>
> Followed by by 0xdead000000200200 which is likely from
> CONFIG_ILLEGAL_POINTER_VALUE.
>
> ...
>> name =
>> "dm-thin\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
>> rcu = {
>> next = 0xffff8802531c4c20,
>> func = 0xffffffff810692e0 <rcu_free_wq>
>
> and call_rcu_sched() already called. The workqueue has already been
> destroyed.
>
>> },
>> flags = 131082,
>> cpu_pwqs = 0x0,
>> numa_pwq_tbl = 0xffff8802420a4b10
>> }
>>
>> crash> rd 0xffff8802420a4b10 2 (the machine has 2 NUMA nodes hence the
>> '2' argument)
>> ffff8802420a4b10: 0000000000000000 0000000000000000 ................
>>
>> At the same time searching for 0xffff8802420a4a00 in the debug output
>> shows nothing IOW it seems that the numa_pwq_tbl is never installed for
>> this workqueue apparently:
>>
>> [root@smallvault8 ~]# grep 0xffff8802420a4a00 /var/log/messages
>>
>> Also dumping all the logs from the dmesg contained in the vmcore image I
>> find nothing and when I do the following correlation:
>> [root@smallvault8 ~]# grep \(null\) wq.log | wc -l
>> 1940
>> [root@smallvault8 ~]# wc -l wq.log
>> 1940 wq.log
>>
>> It seems what's happening is really just changing the numa_pwq_tbl on
>> workqueue creation i.e. it is never re-assigned. So at this point I
>> think it seems that there is a situation where the wqattr are not being
>> applied at all.
>
> Hmmm... No idea why it didn't show up in the debug log but the only
> way a workqueue could be in the above state is either it got
> explicitly destroyed or somehow pwq refcnting is messed up, in both
> cases it should have shown up in the log.
>
> cc'ing dm people. Is there any chance dm-think could be using
> workqueue after destroying it?
In __pool_destroy in dm-thin.c I don't see a call to
cancel_delayed_work before destroying the workqueue. Is it possible
that this is the causeI
>
> Thanks.
>
> --
> tejun
On Fri, Dec 11 2015 at 1:00pm -0500,
Nikolay Borisov <[email protected]> wrote:
> On Fri, Dec 11, 2015 at 7:08 PM, Tejun Heo <[email protected]> wrote:
> >
> > Hmmm... No idea why it didn't show up in the debug log but the only
> > way a workqueue could be in the above state is either it got
> > explicitly destroyed or somehow pwq refcnting is messed up, in both
> > cases it should have shown up in the log.
> >
> > cc'ing dm people. Is there any chance dm-thinp could be using
> > workqueue after destroying it?
Not that I'm aware of. But never say never?
Plus I'd think we'd see other dm-thinp specific use-after-free issues
aside from the thin-pool's workqueue.
> In __pool_destroy in dm-thin.c I don't see a call to
> cancel_delayed_work before destroying the workqueue. Is it possible
> that this is the causeI
Cannot see how, __pool_destroy()'s destroy_workqueue() would spew a
bunch of WARN_ONs (and the wq wouldn't be destroyed) if the workqueue
had outstanding work.
__pool_destroy() is called once the thin-pool's ref count drops to 0
(see __pool_dec which is called when the thin-pool is removed --
e.g. with 'dmsetup remove'). This code is only reachable when nothing
else is using the thin-pool.
And the thin-pool is only able to be removed if all thin devices that
depend on it have first been removed. And each individual thin device
waits for all outstanding IO before they can be removed.
On 12/11/2015 09:14 PM, Mike Snitzer wrote:
> On Fri, Dec 11 2015 at 1:00pm -0500,
> Nikolay Borisov <[email protected]> wrote:
>
>> On Fri, Dec 11, 2015 at 7:08 PM, Tejun Heo <[email protected]> wrote:
>>>
>>> Hmmm... No idea why it didn't show up in the debug log but the only
>>> way a workqueue could be in the above state is either it got
>>> explicitly destroyed or somehow pwq refcnting is messed up, in both
>>> cases it should have shown up in the log.
>>>
>>> cc'ing dm people. Is there any chance dm-thinp could be using
>>> workqueue after destroying it?
>
> Not that I'm aware of. But never say never?
>
> Plus I'd think we'd see other dm-thinp specific use-after-free issues
> aside from the thin-pool's workqueue.
>
>> In __pool_destroy in dm-thin.c I don't see a call to
>> cancel_delayed_work before destroying the workqueue. Is it possible
>> that this is the causeI
>
> Cannot see how, __pool_destroy()'s destroy_workqueue() would spew a
> bunch of WARN_ONs (and the wq wouldn't be destroyed) if the workqueue
> had outstanding work.
>
> __pool_destroy() is called once the thin-pool's ref count drops to 0
> (see __pool_dec which is called when the thin-pool is removed --
> e.g. with 'dmsetup remove'). This code is only reachable when nothing
> else is using the thin-pool.
>
> And the thin-pool is only able to be removed if all thin devices that
> depend on it have first been removed. And each individual thin device
> waits for all outstanding IO before they can be removed.
Ok, I had a look at the code closer now and it indeed seems that when
the pool is suspended in its postsuspend callback the delay work is
indeed canceled and the workqueue is being flushed. But given that I see
those failures on at least 2-3 servers perday I doubt it it is a
hardware/machine-specific issue. Furthermore, the fact that it is always
a dm-thin queue that's being referenced points to the direction of
dm-thin, even though the code looks solid in that regard.
Regards,
Nikolay
On 12/11/2015 07:08 PM, Tejun Heo wrote:
> Hello, Nikolay.
>
> On Fri, Dec 11, 2015 at 05:57:22PM +0200, Nikolay Borisov wrote:
>> So I had a server with the patch just crash on me:
>>
>> Here is how the queue looks like:
>> crash> struct workqueue_struct 0xffff8802420a4a00
>> struct workqueue_struct {
>> pwqs = {
>> next = 0xffff8802420a4c00,
>> prev = 0xffff8802420a4a00
>
> Hmmm... pwq list is already corrupt. ->prev is terminated but ->next
> isn't.
>
>> },
>> list = {
>> next = 0xffff880351f9b210,
>> prev = 0xdead000000200200
>
> Followed by by 0xdead000000200200 which is likely from
> CONFIG_ILLEGAL_POINTER_VALUE.
>
> ...
>> name =
>> "dm-thin\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
>> rcu = {
>> next = 0xffff8802531c4c20,
>> func = 0xffffffff810692e0 <rcu_free_wq>
>
> and call_rcu_sched() already called. The workqueue has already been
> destroyed.
>
>> },
>> flags = 131082,
>> cpu_pwqs = 0x0,
>> numa_pwq_tbl = 0xffff8802420a4b10
>> }
>>
>> crash> rd 0xffff8802420a4b10 2 (the machine has 2 NUMA nodes hence the
>> '2' argument)
>> ffff8802420a4b10: 0000000000000000 0000000000000000 ................
>>
>> At the same time searching for 0xffff8802420a4a00 in the debug output
>> shows nothing IOW it seems that the numa_pwq_tbl is never installed for
>> this workqueue apparently:
>>
>> [root@smallvault8 ~]# grep 0xffff8802420a4a00 /var/log/messages
>>
>> Also dumping all the logs from the dmesg contained in the vmcore image I
>> find nothing and when I do the following correlation:
>> [root@smallvault8 ~]# grep \(null\) wq.log | wc -l
>> 1940
>> [root@smallvault8 ~]# wc -l wq.log
>> 1940 wq.log
>>
>> It seems what's happening is really just changing the numa_pwq_tbl on
>> workqueue creation i.e. it is never re-assigned. So at this point I
>> think it seems that there is a situation where the wqattr are not being
>> applied at all.
>
> Hmmm... No idea why it didn't show up in the debug log but the only
> way a workqueue could be in the above state is either it got
> explicitly destroyed or somehow pwq refcnting is messed up, in both
> cases it should have shown up in the log.
Had another poke at the backtrace that is produced and here what the
delayed_work looks like:
crash> struct delayed_work ffff88036772c8c0
struct delayed_work {
work = {
data = {
counter = 1537
},
entry = {
next = 0xffff88036772c8c8,
prev = 0xffff88036772c8c8
},
func = 0xffffffffa0211a30 <do_waker>
},
timer = {
entry = {
next = 0x0,
prev = 0xdead000000200200
},
expires = 4349463655,
base = 0xffff88047fd2d602,
function = 0xffffffff8106da40 <delayed_work_timer_fn>,
data = 18446612146934696128,
slack = -1,
start_pid = -1,
start_site = 0x0,
start_comm =
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
},
wq = 0xffff88030cf65400,
cpu = 21
}
>From this it seems that the timer is also cancelled/expired judging by
the values in timer -> entry. But then again in dm-thin the pool is
first suspended, which implies the following functions were called:
cancel_delayed_work(&pool->waker);
cancel_delayed_work(&pool->no_space_timeout);
flush_workqueue(pool->wq);
so at that point dm-thin's workqueue should be empty and it shouldn't be
possible to queue any more delayed work. But the crashdump clearly shows
that the opposite is happening. So far all of this points to a race
condition and inserting some sleeps after umount and after vgchange -Kan
(command to disable volume group and suspend, so the cancel_delayed_work
is invoked) seems to reduce the frequency of crashes, though it doesn't
eliminate them.
>
> cc'ing dm people. Is there any chance dm-think could be using
> workqueue after destroying it?
>
> Thanks.
>
On Mon, Dec 14 2015 at 3:41P -0500,
Nikolay Borisov <[email protected]> wrote:
> Had another poke at the backtrace that is produced and here what the
> delayed_work looks like:
>
> crash> struct delayed_work ffff88036772c8c0
> struct delayed_work {
> work = {
> data = {
> counter = 1537
> },
> entry = {
> next = 0xffff88036772c8c8,
> prev = 0xffff88036772c8c8
> },
> func = 0xffffffffa0211a30 <do_waker>
> },
> timer = {
> entry = {
> next = 0x0,
> prev = 0xdead000000200200
> },
> expires = 4349463655,
> base = 0xffff88047fd2d602,
> function = 0xffffffff8106da40 <delayed_work_timer_fn>,
> data = 18446612146934696128,
> slack = -1,
> start_pid = -1,
> start_site = 0x0,
> start_comm =
> "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
> },
> wq = 0xffff88030cf65400,
> cpu = 21
> }
>
> From this it seems that the timer is also cancelled/expired judging by
> the values in timer -> entry. But then again in dm-thin the pool is
> first suspended, which implies the following functions were called:
>
> cancel_delayed_work(&pool->waker);
> cancel_delayed_work(&pool->no_space_timeout);
> flush_workqueue(pool->wq);
>
> so at that point dm-thin's workqueue should be empty and it shouldn't be
> possible to queue any more delayed work. But the crashdump clearly shows
> that the opposite is happening. So far all of this points to a race
> condition and inserting some sleeps after umount and after vgchange -Kan
> (command to disable volume group and suspend, so the cancel_delayed_work
> is invoked) seems to reduce the frequency of crashes, though it doesn't
> eliminate them.
'vgchange -Kan' doesn't suspend the pool before it destroys the device.
So the cancel_delayed_work()s you referenced aren't applicable.
Can you try this patch?
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 63903a5..b201d887 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -2750,8 +2750,11 @@ static void __pool_destroy(struct pool *pool)
dm_bio_prison_destroy(pool->prison);
dm_kcopyd_client_destroy(pool->copier);
- if (pool->wq)
+ if (pool->wq) {
+ cancel_delayed_work(&pool->waker);
+ cancel_delayed_work(&pool->no_space_timeout);
destroy_workqueue(pool->wq);
+ }
if (pool->next_mapping)
mempool_free(pool->next_mapping, pool->mapping_pool);
On Mon, Dec 14, 2015 at 5:31 PM, Mike Snitzer <[email protected]> wrote:
> On Mon, Dec 14 2015 at 3:41P -0500,
> Nikolay Borisov <[email protected]> wrote:
>
>> Had another poke at the backtrace that is produced and here what the
>> delayed_work looks like:
>>
>> crash> struct delayed_work ffff88036772c8c0
>> struct delayed_work {
>> work = {
>> data = {
>> counter = 1537
>> },
>> entry = {
>> next = 0xffff88036772c8c8,
>> prev = 0xffff88036772c8c8
>> },
>> func = 0xffffffffa0211a30 <do_waker>
>> },
>> timer = {
>> entry = {
>> next = 0x0,
>> prev = 0xdead000000200200
>> },
>> expires = 4349463655,
>> base = 0xffff88047fd2d602,
>> function = 0xffffffff8106da40 <delayed_work_timer_fn>,
>> data = 18446612146934696128,
>> slack = -1,
>> start_pid = -1,
>> start_site = 0x0,
>> start_comm =
>> "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
>> },
>> wq = 0xffff88030cf65400,
>> cpu = 21
>> }
>>
>> From this it seems that the timer is also cancelled/expired judging by
>> the values in timer -> entry. But then again in dm-thin the pool is
>> first suspended, which implies the following functions were called:
>>
>> cancel_delayed_work(&pool->waker);
>> cancel_delayed_work(&pool->no_space_timeout);
>> flush_workqueue(pool->wq);
>>
>> so at that point dm-thin's workqueue should be empty and it shouldn't be
>> possible to queue any more delayed work. But the crashdump clearly shows
>> that the opposite is happening. So far all of this points to a race
>> condition and inserting some sleeps after umount and after vgchange -Kan
>> (command to disable volume group and suspend, so the cancel_delayed_work
>> is invoked) seems to reduce the frequency of crashes, though it doesn't
>> eliminate them.
>
> 'vgchange -Kan' doesn't suspend the pool before it destroys the device.
> So the cancel_delayed_work()s you referenced aren't applicable.
Hm, but does it not in fact destroy it. Using the following simple
stap script proves so:
probe module("dm_thin_pool").function("__pool_destroy") {
print("=========__pool_destroy======");
print_backtrace();
}
probe module("dm_thin_pool").function("pool_postsuspend") {
printf("==== POOL_POSTSUSPEND =====\n");
print_backtrace();
}
Produces the following backtraces:
==== POOL_POSTSUSPEND =====
0xffffffffa033ad40 : pool_postsuspend+0x0/0x50 [dm_thin_pool]
0xffffffff8148a5bf : suspend_targets+0x3f/0x90 [kernel]
0xffffffff8148a668 : dm_table_postsuspend_targets+0x18/0x20 [kernel]
0xffffffff814886dc : __dm_destroy+0x17c/0x190 [kernel]
0xffffffff81488723 : dm_destroy+0x13/0x20 [kernel]
0xffffffff8148f55a : dev_remove+0xfa/0x130 [kernel]
0xffffffff8148fe94 : ctl_ioctl+0x1d4/0x2e0 [kernel]
0xffffffff8148ffb3 : dm_ctl_ioctl+0x13/0x20 [kernel]
0xffffffff811af3f3 : do_vfs_ioctl+0x73/0x380 [kernel]
0xffffffff811af792 : sys_ioctl+0x92/0xa0 [kernel]
0xffffffff8159ae2e : entry_SYSCALL_64_fastpath+0x12/0x71 [kernel]
=========__pool_destroy====== 0xffffffffa033ae20 :
__pool_destroy+0x0/0x110 [dm_thin_pool]
0xffffffffa033af61 : __pool_dec+0x31/0x50 [dm_thin_pool]
0xffffffffa033afae : pool_dtr+0x2e/0x70 [dm_thin_pool]
0xffffffff8148c085 : dm_table_destroy+0x65/0x120 [kernel]
0xffffffff8148868a : __dm_destroy+0x12a/0x190 [kernel]
0xffffffff81488723 : dm_destroy+0x13/0x20 [kernel]
0xffffffff8148f55a : dev_remove+0xfa/0x130 [kernel]
0xffffffff8148fe94 : ctl_ioctl+0x1d4/0x2e0 [kernel]
0xffffffff8148ffb3 : dm_ctl_ioctl+0x13/0x20 [kernel]
0xffffffff811af3f3 : do_vfs_ioctl+0x73/0x380 [kernel]
0xffffffff811af792 : sys_ioctl+0x92/0xa0 [kernel]
0xffffffff8159ae2e : entry_SYSCALL_64_fastpath+0x12/0x71 [kernel]
When I run vgchange -Kan on a volume group. So in __dm_destroy before
dm_table_destroy (which calls pool_dtr)
the device is checked to see if it is suspended, and if not not dm
core would invoke the pre/post suspend hooks, and
this should cause the workqueue to be flushed and in quiescent state. No?
What am I missing?
>
> Can you try this patch?
I've scheduled some machines to go online with this patch and
will report back if it changes the situation. Thanks a lot!
>
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 63903a5..b201d887 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -2750,8 +2750,11 @@ static void __pool_destroy(struct pool *pool)
> dm_bio_prison_destroy(pool->prison);
> dm_kcopyd_client_destroy(pool->copier);
>
> - if (pool->wq)
> + if (pool->wq) {
> + cancel_delayed_work(&pool->waker);
> + cancel_delayed_work(&pool->no_space_timeout);
> destroy_workqueue(pool->wq);
> + }
>
> if (pool->next_mapping)
> mempool_free(pool->next_mapping, pool->mapping_pool);
On Mon, Dec 14 2015 at 3:11pm -0500,
Nikolay Borisov <[email protected]> wrote:
> On Mon, Dec 14, 2015 at 5:31 PM, Mike Snitzer <[email protected]> wrote:
> > On Mon, Dec 14 2015 at 3:41P -0500,
> > Nikolay Borisov <[email protected]> wrote:
> >
> >> Had another poke at the backtrace that is produced and here what the
> >> delayed_work looks like:
> >>
> >> crash> struct delayed_work ffff88036772c8c0
> >> struct delayed_work {
> >> work = {
> >> data = {
> >> counter = 1537
> >> },
> >> entry = {
> >> next = 0xffff88036772c8c8,
> >> prev = 0xffff88036772c8c8
> >> },
> >> func = 0xffffffffa0211a30 <do_waker>
> >> },
> >> timer = {
> >> entry = {
> >> next = 0x0,
> >> prev = 0xdead000000200200
> >> },
> >> expires = 4349463655,
> >> base = 0xffff88047fd2d602,
> >> function = 0xffffffff8106da40 <delayed_work_timer_fn>,
> >> data = 18446612146934696128,
> >> slack = -1,
> >> start_pid = -1,
> >> start_site = 0x0,
> >> start_comm =
> >> "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
> >> },
> >> wq = 0xffff88030cf65400,
> >> cpu = 21
> >> }
> >>
> >> From this it seems that the timer is also cancelled/expired judging by
> >> the values in timer -> entry. But then again in dm-thin the pool is
> >> first suspended, which implies the following functions were called:
> >>
> >> cancel_delayed_work(&pool->waker);
> >> cancel_delayed_work(&pool->no_space_timeout);
> >> flush_workqueue(pool->wq);
> >>
> >> so at that point dm-thin's workqueue should be empty and it shouldn't be
> >> possible to queue any more delayed work. But the crashdump clearly shows
> >> that the opposite is happening. So far all of this points to a race
> >> condition and inserting some sleeps after umount and after vgchange -Kan
> >> (command to disable volume group and suspend, so the cancel_delayed_work
> >> is invoked) seems to reduce the frequency of crashes, though it doesn't
> >> eliminate them.
> >
> > 'vgchange -Kan' doesn't suspend the pool before it destroys the device.
> > So the cancel_delayed_work()s you referenced aren't applicable.
>
> Hm, but does it not in fact destroy it. Using the following simple
> stap script proves so:
>
>
> probe module("dm_thin_pool").function("__pool_destroy") {
> print("=========__pool_destroy======");
> print_backtrace();
>
> }
>
> probe module("dm_thin_pool").function("pool_postsuspend") {
>
> printf("==== POOL_POSTSUSPEND =====\n");
> print_backtrace();
>
> }
>
> Produces the following backtraces:
>
> ==== POOL_POSTSUSPEND =====
> 0xffffffffa033ad40 : pool_postsuspend+0x0/0x50 [dm_thin_pool]
> 0xffffffff8148a5bf : suspend_targets+0x3f/0x90 [kernel]
> 0xffffffff8148a668 : dm_table_postsuspend_targets+0x18/0x20 [kernel]
> 0xffffffff814886dc : __dm_destroy+0x17c/0x190 [kernel]
> 0xffffffff81488723 : dm_destroy+0x13/0x20 [kernel]
> 0xffffffff8148f55a : dev_remove+0xfa/0x130 [kernel]
> 0xffffffff8148fe94 : ctl_ioctl+0x1d4/0x2e0 [kernel]
> 0xffffffff8148ffb3 : dm_ctl_ioctl+0x13/0x20 [kernel]
> 0xffffffff811af3f3 : do_vfs_ioctl+0x73/0x380 [kernel]
> 0xffffffff811af792 : sys_ioctl+0x92/0xa0 [kernel]
> 0xffffffff8159ae2e : entry_SYSCALL_64_fastpath+0x12/0x71 [kernel]
> =========__pool_destroy====== 0xffffffffa033ae20 :
> __pool_destroy+0x0/0x110 [dm_thin_pool]
> 0xffffffffa033af61 : __pool_dec+0x31/0x50 [dm_thin_pool]
> 0xffffffffa033afae : pool_dtr+0x2e/0x70 [dm_thin_pool]
> 0xffffffff8148c085 : dm_table_destroy+0x65/0x120 [kernel]
> 0xffffffff8148868a : __dm_destroy+0x12a/0x190 [kernel]
> 0xffffffff81488723 : dm_destroy+0x13/0x20 [kernel]
> 0xffffffff8148f55a : dev_remove+0xfa/0x130 [kernel]
> 0xffffffff8148fe94 : ctl_ioctl+0x1d4/0x2e0 [kernel]
> 0xffffffff8148ffb3 : dm_ctl_ioctl+0x13/0x20 [kernel]
> 0xffffffff811af3f3 : do_vfs_ioctl+0x73/0x380 [kernel]
> 0xffffffff811af792 : sys_ioctl+0x92/0xa0 [kernel]
> 0xffffffff8159ae2e : entry_SYSCALL_64_fastpath+0x12/0x71 [kernel]
>
> When I run vgchange -Kan on a volume group. So in __dm_destroy before
> dm_table_destroy (which calls pool_dtr)
> the device is checked to see if it is suspended, and if not not dm
> core would invoke the pre/post suspend hooks, and
> this should cause the workqueue to be flushed and in quiescent state. No?
>
> What am I missing?
Nothing, clearly you're right!
> >
> > Can you try this patch?
>
> I've scheduled some machines to go online with this patch and
> will report back if it changes the situation. Thanks a lot!
Shouldn't make any difference given the above.
But in that the suspend hooks are used during destroy (if not already
suspended): makes this report all the more bizarre.
On 12/14/2015 10:31 PM, Mike Snitzer wrote:
> On Mon, Dec 14 2015 at 3:11pm -0500,
> Nikolay Borisov <[email protected]> wrote:
>
>> On Mon, Dec 14, 2015 at 5:31 PM, Mike Snitzer <[email protected]> wrote:
>>> On Mon, Dec 14 2015 at 3:41P -0500,
>>> Nikolay Borisov <[email protected]> wrote:
>>>
>>>> Had another poke at the backtrace that is produced and here what the
>>>> delayed_work looks like:
>>>>
>>>> crash> struct delayed_work ffff88036772c8c0
>>>> struct delayed_work {
>>>> work = {
>>>> data = {
>>>> counter = 1537
>>>> },
>>>> entry = {
>>>> next = 0xffff88036772c8c8,
>>>> prev = 0xffff88036772c8c8
>>>> },
>>>> func = 0xffffffffa0211a30 <do_waker>
>>>> },
>>>> timer = {
>>>> entry = {
>>>> next = 0x0,
>>>> prev = 0xdead000000200200
>>>> },
>>>> expires = 4349463655,
>>>> base = 0xffff88047fd2d602,
>>>> function = 0xffffffff8106da40 <delayed_work_timer_fn>,
>>>> data = 18446612146934696128,
>>>> slack = -1,
>>>> start_pid = -1,
>>>> start_site = 0x0,
>>>> start_comm =
>>>> "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
>>>> },
>>>> wq = 0xffff88030cf65400,
>>>> cpu = 21
>>>> }
>>>>
>>>> From this it seems that the timer is also cancelled/expired judging by
>>>> the values in timer -> entry. But then again in dm-thin the pool is
>>>> first suspended, which implies the following functions were called:
>>>>
>>>> cancel_delayed_work(&pool->waker);
>>>> cancel_delayed_work(&pool->no_space_timeout);
>>>> flush_workqueue(pool->wq);
>>>>
>>>> so at that point dm-thin's workqueue should be empty and it shouldn't be
>>>> possible to queue any more delayed work. But the crashdump clearly shows
>>>> that the opposite is happening. So far all of this points to a race
>>>> condition and inserting some sleeps after umount and after vgchange -Kan
>>>> (command to disable volume group and suspend, so the cancel_delayed_work
>>>> is invoked) seems to reduce the frequency of crashes, though it doesn't
>>>> eliminate them.
>>>
>>> 'vgchange -Kan' doesn't suspend the pool before it destroys the device.
>>> So the cancel_delayed_work()s you referenced aren't applicable.
>>
>> Hm, but does it not in fact destroy it. Using the following simple
>> stap script proves so:
>>
>>
>> probe module("dm_thin_pool").function("__pool_destroy") {
>> print("=========__pool_destroy======");
>> print_backtrace();
>>
>> }
>>
>> probe module("dm_thin_pool").function("pool_postsuspend") {
>>
>> printf("==== POOL_POSTSUSPEND =====\n");
>> print_backtrace();
>>
>> }
>>
>> Produces the following backtraces:
>>
>> ==== POOL_POSTSUSPEND =====
>> 0xffffffffa033ad40 : pool_postsuspend+0x0/0x50 [dm_thin_pool]
>> 0xffffffff8148a5bf : suspend_targets+0x3f/0x90 [kernel]
>> 0xffffffff8148a668 : dm_table_postsuspend_targets+0x18/0x20 [kernel]
>> 0xffffffff814886dc : __dm_destroy+0x17c/0x190 [kernel]
>> 0xffffffff81488723 : dm_destroy+0x13/0x20 [kernel]
>> 0xffffffff8148f55a : dev_remove+0xfa/0x130 [kernel]
>> 0xffffffff8148fe94 : ctl_ioctl+0x1d4/0x2e0 [kernel]
>> 0xffffffff8148ffb3 : dm_ctl_ioctl+0x13/0x20 [kernel]
>> 0xffffffff811af3f3 : do_vfs_ioctl+0x73/0x380 [kernel]
>> 0xffffffff811af792 : sys_ioctl+0x92/0xa0 [kernel]
>> 0xffffffff8159ae2e : entry_SYSCALL_64_fastpath+0x12/0x71 [kernel]
>> =========__pool_destroy====== 0xffffffffa033ae20 :
>> __pool_destroy+0x0/0x110 [dm_thin_pool]
>> 0xffffffffa033af61 : __pool_dec+0x31/0x50 [dm_thin_pool]
>> 0xffffffffa033afae : pool_dtr+0x2e/0x70 [dm_thin_pool]
>> 0xffffffff8148c085 : dm_table_destroy+0x65/0x120 [kernel]
>> 0xffffffff8148868a : __dm_destroy+0x12a/0x190 [kernel]
>> 0xffffffff81488723 : dm_destroy+0x13/0x20 [kernel]
>> 0xffffffff8148f55a : dev_remove+0xfa/0x130 [kernel]
>> 0xffffffff8148fe94 : ctl_ioctl+0x1d4/0x2e0 [kernel]
>> 0xffffffff8148ffb3 : dm_ctl_ioctl+0x13/0x20 [kernel]
>> 0xffffffff811af3f3 : do_vfs_ioctl+0x73/0x380 [kernel]
>> 0xffffffff811af792 : sys_ioctl+0x92/0xa0 [kernel]
>> 0xffffffff8159ae2e : entry_SYSCALL_64_fastpath+0x12/0x71 [kernel]
>>
>> When I run vgchange -Kan on a volume group. So in __dm_destroy before
>> dm_table_destroy (which calls pool_dtr)
>> the device is checked to see if it is suspended, and if not not dm
>> core would invoke the pre/post suspend hooks, and
>> this should cause the workqueue to be flushed and in quiescent state. No?
>>
>> What am I missing?
>
> Nothing, clearly you're right!
>
>>>
>>> Can you try this patch?
>>
>> I've scheduled some machines to go online with this patch and
>> will report back if it changes the situation. Thanks a lot!
>
> Shouldn't make any difference given the above.
>
> But in that the suspend hooks are used during destroy (if not already
> suspended): makes this report all the more bizarre.
I applied the following patch:
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 493c38e08bd2..ccbbf7823cf3 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3506,8 +3506,8 @@ static void pool_postsuspend(struct dm_target *ti)
struct pool_c *pt = ti->private;
struct pool *pool = pt->pool;
- cancel_delayed_work(&pool->waker);
- cancel_delayed_work(&pool->no_space_timeout);
+ cancel_delayed_work_sync(&pool->waker);
+ cancel_delayed_work_sync(&pool->no_space_timeout);
flush_workqueue(pool->wq);
(void) commit(pool);
}
And this seems to have resolved the crashes. For the past 24 hours I
haven't seen a single server crash whereas before at least 3-5 servers
would crash.
Given that, it seems like a race condition between destroying the
workqueue from dm-thin and cancelling all the delayed work.
Tejun, I've looked at cancel_delayed_work/cancel_delayed_work_sync and
they both call try_to_grab_pending and then their function diverges. Is
it possible that there is a latent race condition between canceling the
delayed work and the subsequent re-scheduling of the work item?
Hello, Nikolay.
On Thu, Dec 17, 2015 at 12:46:10PM +0200, Nikolay Borisov wrote:
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 493c38e08bd2..ccbbf7823cf3 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -3506,8 +3506,8 @@ static void pool_postsuspend(struct dm_target *ti)
> struct pool_c *pt = ti->private;
> struct pool *pool = pt->pool;
>
> - cancel_delayed_work(&pool->waker);
> - cancel_delayed_work(&pool->no_space_timeout);
> + cancel_delayed_work_sync(&pool->waker);
> + cancel_delayed_work_sync(&pool->no_space_timeout);
> flush_workqueue(pool->wq);
> (void) commit(pool);
> }
>
> And this seems to have resolved the crashes. For the past 24 hours I
> haven't seen a single server crash whereas before at least 3-5 servers
> would crash.
So, that's an obvious bug on dm-thin side.
> Given that, it seems like a race condition between destroying the
> workqueue from dm-thin and cancelling all the delayed work.
>
> Tejun, I've looked at cancel_delayed_work/cancel_delayed_work_sync and
> they both call try_to_grab_pending and then their function diverges. Is
> it possible that there is a latent race condition between canceling the
> delayed work and the subsequent re-scheduling of the work item?
It's just the wrong variant being used. cancel_delayed_work() doesn't
guarantee that the work item isn't running on return. If the work
item was running and the workqueue is destroyed afterwards, it may end
up trying to requeue itself on a destroyed workqueue.
Thanks.
--
tejun
On 12/17/2015 05:33 PM, Tejun Heo wrote:
> Hello, Nikolay.
>
> On Thu, Dec 17, 2015 at 12:46:10PM +0200, Nikolay Borisov wrote:
>> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
>> index 493c38e08bd2..ccbbf7823cf3 100644
>> --- a/drivers/md/dm-thin.c
>> +++ b/drivers/md/dm-thin.c
>> @@ -3506,8 +3506,8 @@ static void pool_postsuspend(struct dm_target *ti)
>> struct pool_c *pt = ti->private;
>> struct pool *pool = pt->pool;
>>
>> - cancel_delayed_work(&pool->waker);
>> - cancel_delayed_work(&pool->no_space_timeout);
>> + cancel_delayed_work_sync(&pool->waker);
>> + cancel_delayed_work_sync(&pool->no_space_timeout);
>> flush_workqueue(pool->wq);
>> (void) commit(pool);
>> }
>>
>> And this seems to have resolved the crashes. For the past 24 hours I
>> haven't seen a single server crash whereas before at least 3-5 servers
>> would crash.
>
> So, that's an obvious bug on dm-thin side.
Mike if you are ok with this I will submit a proper patch ?
>
>> Given that, it seems like a race condition between destroying the
>> workqueue from dm-thin and cancelling all the delayed work.
>>
>> Tejun, I've looked at cancel_delayed_work/cancel_delayed_work_sync and
>> they both call try_to_grab_pending and then their function diverges. Is
>> it possible that there is a latent race condition between canceling the
>> delayed work and the subsequent re-scheduling of the work item?
>
> It's just the wrong variant being used. cancel_delayed_work() doesn't
> guarantee that the work item isn't running on return. If the work
> item was running and the workqueue is destroyed afterwards, it may end
> up trying to requeue itself on a destroyed workqueue.
Right, but my initial understanding was that when canceling the delayed
work and then issuing flush_workqueue would act the same way as if
cancel_delayed_work_sync is called wrt to this particular delayed item, no?
>
> Thanks.
>
Hello, Nikolay.
On Thu, Dec 17, 2015 at 05:43:12PM +0200, Nikolay Borisov wrote:
> Right, but my initial understanding was that when canceling the delayed
> work and then issuing flush_workqueue would act the same way as if
> cancel_delayed_work_sync is called wrt to this particular delayed item, no?
Not necessarily. cancel_delayed_work() cancels whatever is currently
pending. flush_workqueue() flushes whatever is pending and in flight
at the time of invocation. Imagine the following scenario.
1. Work item is running but hasn't requeued itself yet.
2. cancel_delayed_work_sync() doesn't do anything as it's not pending.
3. flush_workqueue() starts and waits for the running instance.
4. The running instance requeues itself but this isn't included in the
scope of the above flush_workqueue().
5. flush_workqueue() returns when the work item is finished (but it's
still queued).
Thanks.
--
tejun
On Thu, Dec 17 2015 at 10:50am -0500,
Tejun Heo <[email protected]> wrote:
> Hello, Nikolay.
>
> On Thu, Dec 17, 2015 at 05:43:12PM +0200, Nikolay Borisov wrote:
> > Right, but my initial understanding was that when canceling the delayed
> > work and then issuing flush_workqueue would act the same way as if
> > cancel_delayed_work_sync is called wrt to this particular delayed item, no?
>
> Not necessarily. cancel_delayed_work() cancels whatever is currently
> pending. flush_workqueue() flushes whatever is pending and in flight
> at the time of invocation. Imagine the following scenario.
>
> 1. Work item is running but hasn't requeued itself yet.
>
> 2. cancel_delayed_work_sync() doesn't do anything as it's not pending.
Did you mean cancel_delayed_work()?
> 3. flush_workqueue() starts and waits for the running instance.
>
> 4. The running instance requeues itself but this isn't included in the
> scope of the above flush_workqueue().
>
> 5. flush_workqueue() returns when the work item is finished (but it's
> still queued).
Hmm, the comment above cancel_delayed_work() is pretty misleading then:
* Note:
* The work callback function may still be running on return, unless
* it returns %true and the work doesn't re-arm itself. Explicitly flush or
* use cancel_delayed_work_sync() to wait on it.
Given dm-thin.c:pool_postsuspend() does:
cancel_delayed_work(&pool->waker);
cancel_delayed_work(&pool->no_space_timeout);
flush_workqueue(pool->wq);
I wouldn't have thought cancel_delayed_work_sync() was needed.
On Sat, Dec 19, 2015 at 03:34:45PM +0200, Nikolay Borisov wrote:
> Ping as Tejun might have missed this email. I'm also interested in knowing
> the logic behind the comment.
Didn't I already reply to that?
http://thread.gmane.org/gmane.linux.kernel/2104051
--
tejun
On Mon, Dec 21, 2015 at 4:44 PM, Tejun Heo <[email protected]> wrote:
> On Sat, Dec 19, 2015 at 03:34:45PM +0200, Nikolay Borisov wrote:
>> Ping as Tejun might have missed this email. I'm also interested in knowing
>> the logic behind the comment.
>
> Didn't I already reply to that?
>
> http://thread.gmane.org/gmane.linux.kernel/2104051
Oops, wrong link.
http://thread.gmane.org/gmane.linux.kernel/2104051/focus=2110844
--
tejun