Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp2123384imu; Wed, 12 Dec 2018 09:52:37 -0800 (PST) X-Google-Smtp-Source: AFSGD/V981XIFAkR9l/PaEDDxeqMXOpb5H+PykoDMfbpEhi1Q4BybdrqtO7BT2EOYcvwNY7qeRMJ X-Received: by 2002:a62:345:: with SMTP id 66mr20906921pfd.189.1544637157428; Wed, 12 Dec 2018 09:52:37 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544637157; cv=none; d=google.com; s=arc-20160816; b=d/651Wyc+ZuWi5GVcvB6qckj4ALSXwgYLRrsliCwYfFF5sXtICt/Gi3XbRFfIzEtq5 kQg6/c4sEDY1ucnHAs2B6Cm2D0s2HTq4H5jFxvLrDP8zndULEEoJY6Q/POfBq3DsFOiG JJPbOm9BqKfgO6niktrijT4e09BiNHo2DdzOo2q9bexfvDNxx+qYvfJMAaklAX6BHv/0 lLutxk1U5btLEVA1vItb/GYz7yosKPc3S51P02UmXXYwdSxpq4FpOW52/tB0ngFMM/Dr TMM2QO7JM4zQlzeaGq5H8E95kuHukqr10cALJD367GK5qf6+Jg8YJdddUcn48I8uKUG+ dvQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:reply-to:message-id :subject:cc:to:from:date; bh=f50uMHu0ZFal75xsn8lumoQYsknjCAEATRdwrh0xxUI=; b=0CVndfJTmaCXvBErCnxdPJKzikeVp6aYcz2IXB82RyKZjBkNmbeJ38P1uIN82/ZpVK rUb31XbjIaUwt0t62bgr4QCf2Cm0QQkMwBkQL5Cu+V3s8SIPv/VUGpMixb0GX8s5nUoi xC8J8rAYLvPtw+eJteHOlUHKeIpDboPmTrmmK3gRh8mwnHBaNdNhDT5081EiyHFXLEFV PReO1WmWAwLRY3R/YQnhklvSM7RMUzyc4hP2LXDPsFUF+4EC86lNlRWyHWLq6dfbj0I6 Ud13YpC0ZY942gt3cFz8NP7zPxNCTPO0W5Wq6hBuDJkNonGIo/p2cK+OaCvGS+YF66aS 6+gw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c26si14672289pgm.210.2018.12.12.09.52.22; Wed, 12 Dec 2018 09:52:37 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728113AbeLLRuG (ORCPT + 99 others); Wed, 12 Dec 2018 12:50:06 -0500 Received: from out30-133.freemail.mail.aliyun.com ([115.124.30.133]:32776 "EHLO out30-133.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727880AbeLLRuF (ORCPT ); Wed, 12 Dec 2018 12:50:05 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01429;MF=bo.liu@linux.alibaba.com;NM=1;PH=DS;RN=12;SR=0;TI=SMTPD_---0TFShJ-u_1544636944; Received: from US-160370MP2.local(mailfrom:bo.liu@linux.alibaba.com fp:SMTPD_---0TFShJ-u_1544636944) by smtp.aliyun-inc.com(127.0.0.1); Thu, 13 Dec 2018 01:49:06 +0800 Date: Wed, 12 Dec 2018 09:49:03 -0800 From: Liu Bo To: Michal Hocko Cc: Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Dave Chinner , Theodore Ts'o , Johannes Weiner , Vladimir Davydov , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, LKML , Michal Hocko Subject: Re: [PATCH v2] mm, memcg: fix reclaim deadlock with writeback Message-ID: <20181212174902.zaxfbebwmd7hjqh7@US-160370MP2.local> Reply-To: bo.liu@linux.alibaba.com References: <20181211132645.31053-1-mhocko@kernel.org> <20181212155055.1269-1-mhocko@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181212155055.1269-1-mhocko@kernel.org> User-Agent: NeoMutt/20180323 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 12, 2018 at 04:50:55PM +0100, Michal Hocko wrote: > From: Michal Hocko > > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the > ext4 writeback > task1: > [] wait_on_page_bit+0x82/0xa0 > [] shrink_page_list+0x907/0x960 > [] shrink_inactive_list+0x2c7/0x680 > [] shrink_node_memcg+0x404/0x830 > [] shrink_node+0xd8/0x300 > [] do_try_to_free_pages+0x10d/0x330 > [] try_to_free_mem_cgroup_pages+0xd5/0x1b0 > [] try_charge+0x14d/0x720 > [] memcg_kmem_charge_memcg+0x3c/0xa0 > [] memcg_kmem_charge+0x7e/0xd0 > [] __alloc_pages_nodemask+0x178/0x260 > [] alloc_pages_current+0x95/0x140 > [] pte_alloc_one+0x17/0x40 > [] __pte_alloc+0x1e/0x110 > [] alloc_set_pte+0x5fe/0xc20 > [] do_fault+0x103/0x970 > [] handle_mm_fault+0x61e/0xd10 > [] __do_page_fault+0x252/0x4d0 > [] do_page_fault+0x30/0x80 > [] page_fault+0x28/0x30 > [] 0xffffffffffffffff > > task2: > [] __lock_page+0x86/0xa0 > [] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] > [] ext4_writepages+0x479/0xd60 > [] do_writepages+0x1e/0x30 > [] __writeback_single_inode+0x45/0x320 > [] writeback_sb_inodes+0x272/0x600 > [] __writeback_inodes_wb+0x92/0xc0 > [] wb_writeback+0x268/0x300 > [] wb_workfn+0xb4/0x390 > [] process_one_work+0x189/0x420 > [] worker_thread+0x4e/0x4b0 > [] kthread+0xe6/0x100 > [] ret_from_fork+0x41/0x50 > [] 0xffffffffffffffff > > He adds > : task1 is waiting for the PageWriteback bit of the page that task2 has > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED > : bit the page which tasks1 has locked. > > More precisely task1 is handling a page fault and it has a page locked > while it charges a new page table to a memcg. That in turn hits a memory > limit reclaim and the memcg reclaim for legacy controller is waiting on > the writeback but that is never going to finish because the writeback > itself is waiting for the page locked in the #PF path. So this is > essentially ABBA deadlock. Thanks for the patch, Michal. Could you please append the followings (quoted from your reply in other thread)? It'd be much easier for reviewers to pick up what was happening. ----------------------------------------------------------------- lock_page(B) SetPageWriteback(B) unlock_page(B) lock_page(A) lock_page(A) pte_alloc_pne shrink_page_list wait_on_page_writeback(B) SetPageWriteback(A) unlock_page(A) # flush A, B to clear the writeback ----------------------------------------------------------------- thanks, -liubo > > Waiting for the writeback in legacy memcg controller is a workaround > for pre-mature OOM killer invocations because there is no dirty IO > throttling available for the controller. There is no easy way around > that unfortunately. Therefore fix this specific issue by pre-allocating > the page table outside of the page lock. We have that handy > infrastructure for that already so simply reuse the fault-around pattern > which already does this. > > Reported-and-Debugged-by: Liu Bo > Signed-off-by: Michal Hocko > --- > mm/memory.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index 4ad2d293ddc2..bb78e90a9b70 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) > struct vm_area_struct *vma = vmf->vma; > vm_fault_t ret; > > + /* > + * Preallocate pte before we take page_lock because this might lead to > + * deadlocks for memcg reclaim which waits for pages under writeback. > + */ > + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { > + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address); > + if (!vmf->prealloc_pte) > + return VM_FAULT_OOM; > + smp_wmb(); /* See comment in __pte_alloc() */ > + } > + > ret = vma->vm_ops->fault(vmf); > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | > VM_FAULT_DONE_COW))) > -- > 2.19.2