Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752018AbbLIQXU (ORCPT ); Wed, 9 Dec 2015 11:23:20 -0500 Received: from mail-wm0-f53.google.com ([74.125.82.53]:34476 "EHLO mail-wm0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751332AbbLIQXS (ORCPT ); Wed, 9 Dec 2015 11:23:18 -0500 Subject: Re: corruption causing crash in __queue_work To: Tejun Heo References: <566819D8.5090804@kyup.com> <20151209160803.GK30240@mtj.duckdns.org> Cc: "Linux-Kernel@Vger. Kernel. Org" , SiteGround Operations From: Nikolay Borisov Message-ID: <56685573.1020805@kyup.com> Date: Wed, 9 Dec 2015 18:23:15 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 In-Reply-To: <20151209160803.GK30240@mtj.duckdns.org> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3605 Lines: 87 On 12/09/2015 06:08 PM, Tejun Heo wrote: > Hello, Nikolay. > > On Wed, Dec 09, 2015 at 02:08:56PM +0200, Nikolay Borisov wrote: >> 73309.529940] BUG: unable to handle kernel NULL pointer dereference at (null) >> [73309.530238] IP: [] __queue_work+0xb3/0x390 > ... >> [73309.537319] >> [73309.537373] [] ? __queue_work+0x390/0x390 >> [73309.537714] [] delayed_work_timer_fn+0x18/0x20 >> [73309.537891] [] call_timer_fn+0x47/0x110 >> [73309.538071] [] ? tick_sched_timer+0x52/0xa0 >> [73309.538249] [] run_timer_softirq+0x17f/0x2b0 >> [73309.538425] [] ? __queue_work+0x390/0x390 >> [73309.538604] [] __do_softirq+0xe0/0x290 >> [73309.538778] [] irq_exit+0xa6/0xb0 >> [73309.538952] [] smp_apic_timer_interrupt+0x4a/0x59 >> [73309.539128] [] apic_timer_interrupt+0x6b/0x70 > ... >> The gist is that this fail on the following line: >> >> if (last_pool && last_pool != pwq->pool) { > > That's new. > >> Since the pointer 'pwq' is wrong (it is loaded in %rdx) which in this >> case is 0000000000000000. Looking at the function's source pwq should >> be loaded by per_cpu_ptr since the if (!(wq->flags & WQ_UNBOUND)) >> check should evaluate to false. So pwq is loaded as the result from >> unbound_pwq_by_node(wq, cpu_to_node(cpu)); >> >> Here are the flags of the workqueue: >> crash> struct workqueue_struct.flags 0xffff8803df464c00 >> flags = 131082 > > That's ordered unbound workqueue w/ a rescuer. So the name of the queue is 'dm-thin', looking at the sources in dm-thin, the only place where a workqueue is allocates this here: pool->wq = alloc_ordered_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM); But in this case I guess the caller can't be the culprit? I'm biased wrt dm-thin because in the past few months I've hit multiple bugs. > >> (0xffff8803df464c00 is indeed the pointer to the workqueue struct, >> so the flags aren't bogus). >> >> So reading the numa_pwq_tbl it seems that it's uninitialised: >> >> crash> struct workqueue_struct.numa_pwq_tbl 0xffff8803df464c00 >> numa_pwq_tbl = 0xffff8803df464d10 >> crash> rd -64 0xffff8803df464d10 3 >> ffff8803df464d10: 0000000000000000 0000000000000000 ................ >> ffff8803df464d20: 0000000000000000 ........ >> >> The machine where the crash occurred has a single NUMA node, so at the >> very least I would have expected to have a pointer, rather than NULL ptr. >> >> Also this crash is not isolated in that I have observed it on multiple >> other nodes running vanilla 4.2.5/4.2.6 kernels. >> >> Any advice how to further debug that? > > Adding printk or tracepoints at numa_pwq_tbl_install() to dump what's > being installed would be helpful. It should at least tell us whether > it's the table being corrupted by something else or workqueue failing > to set it up correctly to begin with. How reproducible is the > problem? I think we are seeing this at least daily on at least 1 server (we have multiple servers like that). So adding printk's would likely be the way to go, anything in particular you might be interested in knowing? I see RCU stuff around so might be tricky race condition. > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/