Return-Path: Received: from mail-qk1-f196.google.com ([209.85.222.196]:42054 "EHLO mail-qk1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730228AbeKADsw (ORCPT ); Wed, 31 Oct 2018 23:48:52 -0400 Received: by mail-qk1-f196.google.com with SMTP id u68so6812487qkg.9 for ; Wed, 31 Oct 2018 11:49:35 -0700 (PDT) MIME-Version: 1.0 References: <1540858969-75803-1-git-send-email-bo.liu@linux.alibaba.com> In-Reply-To: <1540858969-75803-1-git-send-email-bo.liu@linux.alibaba.com> From: Liu Bo Date: Wed, 31 Oct 2018 11:49:23 -0700 Message-ID: Subject: Re: [PATCH RFC] Ext4: fix deadlock on dirty pages between fault and writeback To: Liu Bo , "Theodore Y. Ts'o" Cc: linux-ext4@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-ext4-owner@vger.kernel.org List-ID: Hi Ted, Could you please take a look at this? (unfortunately I failed to come up with a reproducer as it mixed 'short of memory, writeback and fault'.) thanks, liubo On Mon, Oct 29, 2018 at 5:26 PM Liu Bo wrote: > > mpage_prepare_extent_to_map() tries to build up a large bio to stuff down > the pipe. But if it needs to wait for a page lock, it needs to make sure > and send down any pending writes so we don't deadlock with anyone who has > the page lock and is waiting for writeback of things inside the bio. > > The related lock stack is shown as follows, > > task1: > [] wait_on_page_bit+0x82/0xa0 > [] shrink_page_list+0x907/0x960 > [] shrink_inactive_list+0x2c7/0x680 > [] shrink_node_memcg+0x404/0x830 > [] shrink_node+0xd8/0x300 > [] do_try_to_free_pages+0x10d/0x330 > [] try_to_free_mem_cgroup_pages+0xd5/0x1b0 > [] try_charge+0x14d/0x720 > [] memcg_kmem_charge_memcg+0x3c/0xa0 > [] memcg_kmem_charge+0x7e/0xd0 > [] __alloc_pages_nodemask+0x178/0x260 > [] alloc_pages_current+0x95/0x140 > [] pte_alloc_one+0x17/0x40 > [] __pte_alloc+0x1e/0x110 > [] alloc_set_pte+0x5fe/0xc20 > [] do_fault+0x103/0x970 > [] handle_mm_fault+0x61e/0xd10 > [] __do_page_fault+0x252/0x4d0 > [] do_page_fault+0x30/0x80 > [] page_fault+0x28/0x30 > [] 0xffffffffffffffff > > task2: > [] __lock_page+0x86/0xa0 > [] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] > [] ext4_writepages+0x479/0xd60 > [] do_writepages+0x1e/0x30 > [] __writeback_single_inode+0x45/0x320 > [] writeback_sb_inodes+0x272/0x600 > [] __writeback_inodes_wb+0x92/0xc0 > [] wb_writeback+0x268/0x300 > [] wb_workfn+0xb4/0x390 > [] process_one_work+0x189/0x420 > [] worker_thread+0x4e/0x4b0 > [] kthread+0xe6/0x100 > [] ret_from_fork+0x41/0x50 > [] 0xffffffffffffffff > > task1 is waiting for the PageWriteback bit of the page that task2 has > collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED > bit the page which tasks1 has locked. > > It seems that this deadlock only happens when those pages are mapped pages > so that mpage_prepare_extent_to_map() can have pages queued in io_bio and > when waiting to lock the subsequent page. > > Signed-off-by: Liu Bo > --- > > Only did build test. > > fs/ext4/inode.c | 21 ++++++++++++++++++++- > 1 file changed, 20 insertions(+), 1 deletion(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index c3d9a42c561e..becbfb292bf0 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -2681,7 +2681,26 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd) > if (mpd->map.m_len > 0 && mpd->next_page != page->index) > goto out; > > - lock_page(page); > + if (!trylock_page(page)) { > + /* > + * A rare race may happen between fault and > + * writeback, > + * > + * 1. fault may have raced in and locked this > + * page ahead of us, and if fault needs to > + * reclaim memory via shrink_page_list(), it may > + * also wait on the writeback pages we've > + * collected in our mpd->io_submit. > + * > + * 2. We have to submit mpd->io_submit->io_bio > + * to let memory reclaim make progress in order > + * to avoid the deadlock between fault and > + * ourselves(writeback). > + */ > + ext4_io_submit(&mpd->io_submit); > + lock_page(page); > + } > + > /* > * If the page is no longer dirty, or its mapping no > * longer corresponds to inode we are writing (which > -- > 1.8.3.1 >