Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp5253889imu; Tue, 15 Jan 2019 14:09:24 -0800 (PST) X-Google-Smtp-Source: ALg8bN7waSF6grvZL8AoRTSxEYcgvIhs8J+cA0M5mgiOdY0q3iajFuadtDh58Ea7JCsFnHJhhh5k X-Received: by 2002:a17:902:145:: with SMTP id 63mr6354575plb.256.1547590164444; Tue, 15 Jan 2019 14:09:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1547590164; cv=none; d=google.com; s=arc-20160816; b=Vi5o8uf7vIRMWVF3+L0Y4z5EOl6A6fIvQT1o9MwqZcd//hrYWjONDMqg9+PCeMJUfs 8THugeJChrka5cWJOUVdZDcCYIE4CVrI4jXKc/oWazQkadzq+ShUtN3th552ywKs7fZY 2h+g5bbn71pqZAjZc3GhTfUs9hVBNJFfRgr8gvH5Wx0WUcRulTI3e3ZKKpDD/ae519Xj /B8yuWO+AGudSNgWciCnFPYICbQZLRdLZ3W2skXuaTJAuBWPXX+kYGVNqzArakGjhFkZ 8Gf6Eo19qCI4+6JdmYjSn/Pdh1I/DClRa9NmK+7tHe7q8DPBDpJWYPPzMOU1edIeXqr0 AUgQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=xXjKT7Rk48hBr+4Ic/8WYSuyE/5FrVR2GjHx3luyULI=; b=0ACxeQy7e+lwVucZ6czo2lO+Rf11nLAheOMsQYMekAALd6V60dzSC8qT56a1HvRlJu YWJamDsCyd2EOWXynuySWhejgUmfvMS1BMEIShkeKlUdEFeS6GdP26usdmy89Jpo36Wq Wv2mzCW5th28eFXSM8QTv4lZFR6U2dBy/eza2qfD4DXC6vgAExFeX21FYnRTOghhNYR9 tzznoKOPW+IorYU8S0Qwpny5GsJReXR1mRyB1u9r3iXeVnV9b4FkO1s3a2fD4EDG6YLR cj4Z/jUZSyfTBAefxtPd4zVRtSZZI2vuEc79NecSVZ9e2ONmeUOo7v6Eo9ASEWsH3A8U 9nnA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=R3xQlxqx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 1si4386587plo.195.2019.01.15.14.09.05; Tue, 15 Jan 2019 14:09:24 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=R3xQlxqx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387951AbfAOQqT (ORCPT + 99 others); Tue, 15 Jan 2019 11:46:19 -0500 Received: from mail.kernel.org ([198.145.29.99]:35874 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731025AbfAOQqN (ORCPT ); Tue, 15 Jan 2019 11:46:13 -0500 Received: from localhost (5356596B.cm-6-7b.dynamic.ziggo.nl [83.86.89.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 0BEAD2054F; Tue, 15 Jan 2019 16:46:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1547570772; bh=rxQLP/QAsAC9v5MVbLxIuZKs/loqy/VUDE1btsNwe8w=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=R3xQlxqx7wpHPRZHDFYi100ypj8qUCzbqHWpk+6S2eQAc24ynpcCvkgUQlHHwHf8u MPLqgtswqKBFhEFcoBh4aX8r2AaEcWhKoef2zFwiUXmmibI4IIrT7zAmPgVzm9SRaJ vxiie2LSQgnjHAgaTuZ0PZRJz5qUCsfC3IwqcPWQ= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Michal Hocko , Liu Bo , "Kirill A. Shutemov" , Johannes Weiner , Jan Kara , Dave Chinner , Theodore Tso , Vladimir Davydov , Shakeel Butt , Andrew Morton , Linus Torvalds Subject: [PATCH 4.20 30/57] mm, memcg: fix reclaim deadlock with writeback Date: Tue, 15 Jan 2019 17:36:11 +0100 Message-Id: <20190115154912.374956471@linuxfoundation.org> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190115154910.734892368@linuxfoundation.org> References: <20190115154910.734892368@linuxfoundation.org> User-Agent: quilt/0.65 X-stable: review X-Patchwork-Hint: ignore MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 4.20-stable review patch. If anyone has any objections, please let me know. ------------------ From: Michal Hocko commit 63f3655f950186752236bb88a22f8252c11ce394 upstream. Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the ext4 writeback task1: wait_on_page_bit+0x82/0xa0 shrink_page_list+0x907/0x960 shrink_inactive_list+0x2c7/0x680 shrink_node_memcg+0x404/0x830 shrink_node+0xd8/0x300 do_try_to_free_pages+0x10d/0x330 try_to_free_mem_cgroup_pages+0xd5/0x1b0 try_charge+0x14d/0x720 memcg_kmem_charge_memcg+0x3c/0xa0 memcg_kmem_charge+0x7e/0xd0 __alloc_pages_nodemask+0x178/0x260 alloc_pages_current+0x95/0x140 pte_alloc_one+0x17/0x40 __pte_alloc+0x1e/0x110 alloc_set_pte+0x5fe/0xc20 do_fault+0x103/0x970 handle_mm_fault+0x61e/0xd10 __do_page_fault+0x252/0x4d0 do_page_fault+0x30/0x80 page_fault+0x28/0x30 task2: __lock_page+0x86/0xa0 mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] ext4_writepages+0x479/0xd60 do_writepages+0x1e/0x30 __writeback_single_inode+0x45/0x320 writeback_sb_inodes+0x272/0x600 __writeback_inodes_wb+0x92/0xc0 wb_writeback+0x268/0x300 wb_workfn+0xb4/0x390 process_one_work+0x189/0x420 worker_thread+0x4e/0x4b0 kthread+0xe6/0x100 ret_from_fork+0x41/0x50 He adds "task1 is waiting for the PageWriteback bit of the page that task2 has collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED bit the page which tasks1 has locked" More precisely task1 is handling a page fault and it has a page locked while it charges a new page table to a memcg. That in turn hits a memory limit reclaim and the memcg reclaim for legacy controller is waiting on the writeback but that is never going to finish because the writeback itself is waiting for the page locked in the #PF path. So this is essentially ABBA deadlock: lock_page(A) SetPageWriteback(A) unlock_page(A) lock_page(B) lock_page(B) pte_alloc_pne shrink_page_list wait_on_page_writeback(A) SetPageWriteback(B) unlock_page(B) # flush A, B to clear the writeback This accumulating of more pages to flush is used by several filesystems to generate a more optimal IO patterns. Waiting for the writeback in legacy memcg controller is a workaround for pre-mature OOM killer invocations because there is no dirty IO throttling available for the controller. There is no easy way around that unfortunately. Therefore fix this specific issue by pre-allocating the page table outside of the page lock. We have that handy infrastructure for that already so simply reuse the fault-around pattern which already does this. There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations from under a fs page locked but they should be really rare. I am not aware of a better solution unfortunately. [akpm@linux-foundation.org: fix mm/memory.c:__do_fault()] [akpm@linux-foundation.org: coding-style fixes] [mhocko@kernel.org: enhance comment, per Johannes] Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") Signed-off-by: Michal Hocko Reported-by: Liu Bo Debugged-by: Liu Bo Acked-by: Kirill A. Shutemov Acked-by: Johannes Weiner Reviewed-by: Liu Bo Cc: Jan Kara Cc: Dave Chinner Cc: Theodore Ts'o Cc: Vladimir Davydov Cc: Shakeel Butt Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- mm/memory.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) --- a/mm/memory.c +++ b/mm/memory.c @@ -2993,6 +2993,29 @@ static vm_fault_t __do_fault(struct vm_f struct vm_area_struct *vma = vmf->vma; vm_fault_t ret; + /* + * Preallocate pte before we take page_lock because this might lead to + * deadlocks for memcg reclaim which waits for pages under writeback: + * lock_page(A) + * SetPageWriteback(A) + * unlock_page(A) + * lock_page(B) + * lock_page(B) + * pte_alloc_pne + * shrink_page_list + * wait_on_page_writeback(A) + * SetPageWriteback(B) + * unlock_page(B) + * # flush A, B to clear the writeback + */ + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, + vmf->address); + if (!vmf->prealloc_pte) + return VM_FAULT_OOM; + smp_wmb(); /* See comment in __pte_alloc() */ + } + ret = vma->vm_ops->fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | VM_FAULT_DONE_COW)))