Received: by 2002:a25:7ec1:0:0:0:0:0 with SMTP id z184csp6017584ybc; Wed, 27 Nov 2019 13:29:15 -0800 (PST) X-Google-Smtp-Source: APXvYqzjwmokDVk5F4sq2pgiAS86BRS9hNX49+LlVPsF9riJZBmRF66Mdw3F2q3MY+F5/V8F2YeU X-Received: by 2002:aa7:dd13:: with SMTP id i19mr35335511edv.240.1574890155130; Wed, 27 Nov 2019 13:29:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1574890155; cv=none; d=google.com; s=arc-20160816; b=rIFktnH8NBRZRYp2g7Cudc6w4g7N4td9wtXLLZq+wA2V5yUQTDAQ4YsNcPi+kYhqFs J5SqA+/dC659Oys1UWHSt2upM1l59wSzfTiZaaDK8AwKsL9eI96f84GwtqXimI7gPohV iLyE1UlvJqNhXP0vGVeoCF3wHfRkZiYtT3xwsJBtogk7ZBY3I87cEOQmwdSLt2Q/9ez2 YHeQx6kY7PqQAg34bLntQS0Y3B0+/BMTn4ncv1b2rpe4kYQPMOZcKd6x78C2fmpzH19s YXc2DKd2j5UaL/Wp2KBo1JSjaHlz5w/PReh2tCSvcvMj/6PBzeNk/f2zYK30DzcQeqnJ Vwuw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=+uSr4cTv/bClWMDFUvCVqnPjYKOoAvUDZKpmxTwFqqk=; b=NHNtUIv6JYk0zoPE6BgxiPetCL0ERqjd3QGl44UuzMmUG9lNSMLn46hZ7zOwZ+09PO M5nri26pbGp/jDmntsCs6wSL4TfEqOTGfFnVoXATIqFC1d83h+99TogWF0kxxGX9DeZ/ fDKZroOIzwqfPieKl13q02aCwVNZCh+tfeFC2CCs1aUuJbzDEd8eBjOpB/9gIWrM8E26 KCrftJUfyqRfOgBJQmBrIIUUQJwNmiiljUreXekkzuDD7jPFPvUHfx9hpQ4ryoRxGEKc 7Drnd3Q6YPOlP0go4R7mqYXycLLMUSkVZ0Lj7T5i4PnoFX6a/Jcio9Q5NRNQICJvhST/ DiKg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=e4JaJ9KZ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i7si11601335edx.107.2019.11.27.13.28.51; Wed, 27 Nov 2019 13:29:15 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=e4JaJ9KZ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732003AbfK0VYe (ORCPT + 99 others); Wed, 27 Nov 2019 16:24:34 -0500 Received: from mail.kernel.org ([198.145.29.99]:56076 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730966AbfK0VDD (ORCPT ); Wed, 27 Nov 2019 16:03:03 -0500 Received: from localhost (83-86-89-107.cable.dynamic.v4.ziggo.nl [83.86.89.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id C92CB2086A; Wed, 27 Nov 2019 21:03:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1574888582; bh=0tPbmlMoHUmKUKg4UP3KeQMU9iKDfEP/BR6bOzr6vMA=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=e4JaJ9KZEodVnZcVt4EQCafxDKCjjh6G8Ix1gwaHLbmx9LGeaFnLYOfZcxvqpsd8U QIerd9moPj18lK2T3T7fYK9pYTz4V0VmZbiJxrTkPtnyR/nFe+4egKyRqdZQ/IJNdk EEQo2co3xwN0c2JaqAM+DSDTOl4VH6xzOTXxucNg= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Dave Chinner , Jan Kara , Nicholas Piggin , Andrew Morton , Linus Torvalds , Sasha Levin Subject: [PATCH 4.19 148/306] mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock Date: Wed, 27 Nov 2019 21:29:58 +0100 Message-Id: <20191127203126.269615215@linuxfoundation.org> X-Mailer: git-send-email 2.24.0 In-Reply-To: <20191127203114.766709977@linuxfoundation.org> References: <20191127203114.766709977@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Dave Chinner [ Upstream commit 64081362e8ff4587b4554087f3cfc73d3e0a4cd7 ] We've recently seen a workload on XFS filesystems with a repeatable deadlock between background writeback and a multi-process application doing concurrent writes and fsyncs to a small range of a file. range_cyclic writeback Process 1 Process 2 xfs_vm_writepages write_cache_pages writeback_index = 2 cycled = 0 .... find page 2 dirty lock Page 2 ->writepage page 2 writeback page 2 clean page 2 added to bio no more pages write() locks page 1 dirties page 1 locks page 2 dirties page 1 fsync() .... xfs_vm_writepages write_cache_pages start index 0 find page 1 towrite lock Page 1 ->writepage page 1 writeback page 1 clean page 1 added to bio find page 2 towrite lock Page 2 page 2 is writeback write() locks page 1 dirties page 1 fsync() .... xfs_vm_writepages write_cache_pages start index 0 !done && !cycled sets index to 0, restarts lookup find page 1 dirty find page 1 towrite lock Page 1 page 1 is writeback lock Page 1 DEADLOCK because: - process 1 needs page 2 writeback to complete to make enough progress to issue IO pending for page 1 - writeback needs page 1 writeback to complete so process 2 can progress and unlock the page it is blocked on, then it can issue the IO pending for page 2 - process 2 can't make progress until process 1 issues IO for page 1 The underlying cause of the problem here is that range_cyclic writeback is processing pages in descending index order as we hold higher index pages in a structure controlled from above write_cache_pages(). The write_cache_pages() caller needs to be able to submit these pages for IO before write_cache_pages restarts writeback at mapping index 0 to avoid wcp inverting the page lock/writeback wait order. generic_writepages() is not susceptible to this bug as it has no private context held across write_cache_pages() - filesystems using this infrastructure always submit pages in ->writepage immediately and so there is no problem with range_cyclic going back to mapping index 0. However: mpage_writepages() has a private bio context, exofs_writepages() has page_collect fuse_writepages() has fuse_fill_wb_data nfs_writepages() has nfs_pageio_descriptor xfs_vm_writepages() has xfs_writepage_ctx All of these ->writepages implementations can hold pages under writeback in their private structures until write_cache_pages() returns, and hence they are all susceptible to this deadlock. Also worth noting is that ext4 has it's own bastardised version of write_cache_pages() and so it /may/ have an equivalent deadlock. I looked at the code long enough to understand that it has a similar retry loop for range_cyclic writeback reaching the end of the file and then promptly ran away before my eyes bled too much. I'll leave it for the ext4 developers to determine if their code is actually has this deadlock and how to fix it if it has. There's a few ways I can see avoid this deadlock. There's probably more, but these are the first I've though of: 1. get rid of range_cyclic altogether 2. range_cyclic always stops at EOF, and we start again from writeback index 0 on the next call into write_cache_pages() 2a. wcp also returns EAGAIN to ->writepages implementations to indicate range cyclic has hit EOF. writepages implementations can then flush the current context and call wpc again to continue. i.e. lift the retry into the ->writepages implementation 3. range_cyclic uses trylock_page() rather than lock_page(), and it skips pages it can't lock without blocking. It will already do this for pages under writeback, so this seems like a no-brainer 3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid blocking as per pages under writeback. I don't think #1 is an option - range_cyclic prevents frequently dirtied lower file offset from starving background writeback of rarely touched higher file offsets. #2 is simple, and I don't think it will have any impact on performance as going back to the start of the file implies an immediate seek. We'll have exactly the same number of seeks if we switch writeback to another inode, and then come back to this one later and restart from index 0. #2a is pretty much "status quo without the deadlock". Moving the retry loop up into the wcp caller means we can issue IO on the pending pages before calling wcp again, and so avoid locking or waiting on pages in the wrong order. I'm not convinced we need to do this given that we get the same thing from #2 on the next writeback call from the writeback infrastructure. #3 is really just a band-aid - it doesn't fix the access/wait inversion problem, just prevents it from becoming a deadlock situation. I'd prefer we fix the inversion, not sweep it under the carpet like this. #3a is really an optimisation that just so happens to include the band-aid fix of #3. So it seems that the simplest way to fix this issue is to implement solution #2 Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.com Signed-off-by: Dave Chinner Reviewed-by: Jan Kara Cc: Nicholas Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin --- mm/page-writeback.c | 33 +++++++++++++++------------------ 1 file changed, 15 insertions(+), 18 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index ea4fd3af3b4bd..43df0c52e1ccb 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2149,6 +2149,13 @@ EXPORT_SYMBOL(tag_pages_for_writeback); * not miss some pages (e.g., because some other process has cleared TOWRITE * tag we set). The rule we follow is that TOWRITE tag can be cleared only * by the process clearing the DIRTY tag (and submitting the page for IO). + * + * To avoid deadlocks between range_cyclic writeback and callers that hold + * pages in PageWriteback to aggregate IO until write_cache_pages() returns, + * we do not loop back to the start of the file. Doing so causes a page + * lock/page writeback access order inversion - we should only ever lock + * multiple pages in ascending page->index order, and looping back to the start + * of the file violates that rule and causes deadlocks. */ int write_cache_pages(struct address_space *mapping, struct writeback_control *wbc, writepage_t writepage, @@ -2163,7 +2170,6 @@ int write_cache_pages(struct address_space *mapping, pgoff_t index; pgoff_t end; /* Inclusive */ pgoff_t done_index; - int cycled; int range_whole = 0; int tag; @@ -2171,23 +2177,17 @@ int write_cache_pages(struct address_space *mapping, if (wbc->range_cyclic) { writeback_index = mapping->writeback_index; /* prev offset */ index = writeback_index; - if (index == 0) - cycled = 1; - else - cycled = 0; end = -1; } else { index = wbc->range_start >> PAGE_SHIFT; end = wbc->range_end >> PAGE_SHIFT; if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) range_whole = 1; - cycled = 1; /* ignore range_cyclic tests */ } if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) tag = PAGECACHE_TAG_TOWRITE; else tag = PAGECACHE_TAG_DIRTY; -retry: if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) tag_pages_for_writeback(mapping, index, end); done_index = index; @@ -2279,17 +2279,14 @@ int write_cache_pages(struct address_space *mapping, pagevec_release(&pvec); cond_resched(); } - if (!cycled && !done) { - /* - * range_cyclic: - * We hit the last page and there is more work to be done: wrap - * back to the start of the file - */ - cycled = 1; - index = 0; - end = writeback_index - 1; - goto retry; - } + + /* + * If we hit the last page and there is more work to be done: wrap + * back the index back to the start of the file for the next + * time we are called. + */ + if (wbc->range_cyclic && !done) + done_index = 0; if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0)) mapping->writeback_index = done_index; -- 2.20.1