Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752484AbbLKSAv (ORCPT ); Fri, 11 Dec 2015 13:00:51 -0500 Received: from mail-oi0-f43.google.com ([209.85.218.43]:36745 "EHLO mail-oi0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751526AbbLKSAa (ORCPT ); Fri, 11 Dec 2015 13:00:30 -0500 MIME-Version: 1.0 In-Reply-To: <20151211170805.GT30240@mtj.duckdns.org> References: <566819D8.5090804@kyup.com> <20151209160803.GK30240@mtj.duckdns.org> <56685573.1020805@kyup.com> <20151209162744.GN30240@mtj.duckdns.org> <566945A2.1050208@kyup.com> <20151210152901.GR30240@mtj.duckdns.org> <566AF262.8050009@kyup.com> <20151211170805.GT30240@mtj.duckdns.org> Date: Fri, 11 Dec 2015 20:00:29 +0200 Message-ID: Subject: Re: corruption causing crash in __queue_work From: Nikolay Borisov To: Tejun Heo Cc: Nikolay Borisov , "Linux-Kernel@Vger. Kernel. Org" , SiteGround Operations , Alasdair Kergon , Mike Snitzer , device-mapper development Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2731 Lines: 85 On Fri, Dec 11, 2015 at 7:08 PM, Tejun Heo wrote: > Hello, Nikolay. > > On Fri, Dec 11, 2015 at 05:57:22PM +0200, Nikolay Borisov wrote: >> So I had a server with the patch just crash on me: >> >> Here is how the queue looks like: >> crash> struct workqueue_struct 0xffff8802420a4a00 >> struct workqueue_struct { >> pwqs = { >> next = 0xffff8802420a4c00, >> prev = 0xffff8802420a4a00 > > Hmmm... pwq list is already corrupt. ->prev is terminated but ->next > isn't. > >> }, >> list = { >> next = 0xffff880351f9b210, >> prev = 0xdead000000200200 > > Followed by by 0xdead000000200200 which is likely from > CONFIG_ILLEGAL_POINTER_VALUE. > > ... >> name = >> "dm-thin\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", >> rcu = { >> next = 0xffff8802531c4c20, >> func = 0xffffffff810692e0 > > and call_rcu_sched() already called. The workqueue has already been > destroyed. > >> }, >> flags = 131082, >> cpu_pwqs = 0x0, >> numa_pwq_tbl = 0xffff8802420a4b10 >> } >> >> crash> rd 0xffff8802420a4b10 2 (the machine has 2 NUMA nodes hence the >> '2' argument) >> ffff8802420a4b10: 0000000000000000 0000000000000000 ................ >> >> At the same time searching for 0xffff8802420a4a00 in the debug output >> shows nothing IOW it seems that the numa_pwq_tbl is never installed for >> this workqueue apparently: >> >> [root@smallvault8 ~]# grep 0xffff8802420a4a00 /var/log/messages >> >> Also dumping all the logs from the dmesg contained in the vmcore image I >> find nothing and when I do the following correlation: >> [root@smallvault8 ~]# grep \(null\) wq.log | wc -l >> 1940 >> [root@smallvault8 ~]# wc -l wq.log >> 1940 wq.log >> >> It seems what's happening is really just changing the numa_pwq_tbl on >> workqueue creation i.e. it is never re-assigned. So at this point I >> think it seems that there is a situation where the wqattr are not being >> applied at all. > > Hmmm... No idea why it didn't show up in the debug log but the only > way a workqueue could be in the above state is either it got > explicitly destroyed or somehow pwq refcnting is messed up, in both > cases it should have shown up in the log. > > cc'ing dm people. Is there any chance dm-think could be using > workqueue after destroying it? In __pool_destroy in dm-thin.c I don't see a call to cancel_delayed_work before destroying the workqueue. Is it possible that this is the causeI > > Thanks. > > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/