Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752601AbbLNPbv (ORCPT ); Mon, 14 Dec 2015 10:31:51 -0500 Received: from mx1.redhat.com ([209.132.183.28]:58149 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751470AbbLNPbt (ORCPT ); Mon, 14 Dec 2015 10:31:49 -0500 Date: Mon, 14 Dec 2015 10:31:47 -0500 From: Mike Snitzer To: Nikolay Borisov Cc: Tejun Heo , "Linux-Kernel@Vger. Kernel. Org" , SiteGround Operations , Alasdair Kergon , dm-devel@redhat.com Subject: Re: corruption causing crash in __queue_work Message-ID: <20151214153147.GA14957@redhat.com> References: <566819D8.5090804@kyup.com> <20151209160803.GK30240@mtj.duckdns.org> <56685573.1020805@kyup.com> <20151209162744.GN30240@mtj.duckdns.org> <566945A2.1050208@kyup.com> <20151210152901.GR30240@mtj.duckdns.org> <566AF262.8050009@kyup.com> <20151211170805.GT30240@mtj.duckdns.org> <566E80AE.7020502@kyup.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <566E80AE.7020502@kyup.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2648 Lines: 80 On Mon, Dec 14 2015 at 3:41P -0500, Nikolay Borisov wrote: > Had another poke at the backtrace that is produced and here what the > delayed_work looks like: > > crash> struct delayed_work ffff88036772c8c0 > struct delayed_work { > work = { > data = { > counter = 1537 > }, > entry = { > next = 0xffff88036772c8c8, > prev = 0xffff88036772c8c8 > }, > func = 0xffffffffa0211a30 > }, > timer = { > entry = { > next = 0x0, > prev = 0xdead000000200200 > }, > expires = 4349463655, > base = 0xffff88047fd2d602, > function = 0xffffffff8106da40 , > data = 18446612146934696128, > slack = -1, > start_pid = -1, > start_site = 0x0, > start_comm = > "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000" > }, > wq = 0xffff88030cf65400, > cpu = 21 > } > > From this it seems that the timer is also cancelled/expired judging by > the values in timer -> entry. But then again in dm-thin the pool is > first suspended, which implies the following functions were called: > > cancel_delayed_work(&pool->waker); > cancel_delayed_work(&pool->no_space_timeout); > flush_workqueue(pool->wq); > > so at that point dm-thin's workqueue should be empty and it shouldn't be > possible to queue any more delayed work. But the crashdump clearly shows > that the opposite is happening. So far all of this points to a race > condition and inserting some sleeps after umount and after vgchange -Kan > (command to disable volume group and suspend, so the cancel_delayed_work > is invoked) seems to reduce the frequency of crashes, though it doesn't > eliminate them. 'vgchange -Kan' doesn't suspend the pool before it destroys the device. So the cancel_delayed_work()s you referenced aren't applicable. Can you try this patch? diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c index 63903a5..b201d887 100644 --- a/drivers/md/dm-thin.c +++ b/drivers/md/dm-thin.c @@ -2750,8 +2750,11 @@ static void __pool_destroy(struct pool *pool) dm_bio_prison_destroy(pool->prison); dm_kcopyd_client_destroy(pool->copier); - if (pool->wq) + if (pool->wq) { + cancel_delayed_work(&pool->waker); + cancel_delayed_work(&pool->no_space_timeout); destroy_workqueue(pool->wq); + } if (pool->next_mapping) mempool_free(pool->next_mapping, pool->mapping_pool); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/