Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp625151imu; Thu, 13 Dec 2018 01:24:14 -0800 (PST) X-Google-Smtp-Source: AFSGD/U+WQHF5cucZ9CP518yHjywQGZTGlRqCYuwTlSkWSQ47QpptKzCzzOXCip4LhN/jEVSeN0b X-Received: by 2002:a63:3287:: with SMTP id y129mr21339444pgy.337.1544693054417; Thu, 13 Dec 2018 01:24:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544693054; cv=none; d=google.com; s=arc-20160816; b=LfNNnpR23f0WppYxzXCnN1NyQPMuarTEkc2ReoOgXD3yLrMx0BwpaauHbM9KNN5I1W Cfbv9YDEbus+KZZJD8RGYJvEK8g3e14Vs1iXp6b/8piGMPUTg6ZZg8NytGVP+1jiY7WP CKdY9uauZVU5+eNvsXBhuG4/5jRdt5YgE54iEJmUsqMywu2OVRAcOQUyCV3UlLSGawid bvUSvYvWeQt0ed/tGnEhPaf40F7etWof5JUA0CpXBCUsX75MqegVSwRzRy5Jpfv/tokL wu8uFle8fZv0tjKXHGNzCL8gBtcuuI6wCIqCgxPweRprJFKU/B3CRhiETLUBVFXqF/ff jtQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=JUicFBR02yoMFdfZBlyVnEhzti2L20thT/sD8xlcfyI=; b=WIRISRnm1HKFa10kyBsB60on0i4x9GE3OlRtaiuctS+QJBE08SsfJADeWetxnilG1X wTdBzoNNN9biLxMfQExkX55Pr9s7OfGpmSt2zbRWqu+JYjD46Q1tt84rYyIbHsrq4aqX Q4CZ+oT41BRGbsO4Xg/YJRALiqxgerrVuFCX6o0l43Qb0JhdLgi43Qjc9UW2fsbOMrWV J82caw5/5N0zIn220MBz5wxecHTg6BdqOTbAd+eJqfjvCdoXm3Fqdp+wumt7cn5EXyX2 cvHIlPkz9YR4+pEb0rCSiK+fJPcxhmqeBuvI5gySZkONFWgZHMj+W8jK4uj6JO6ZuPn5 H/1w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id bg7si1106281plb.149.2018.12.13.01.24.00; Thu, 13 Dec 2018 01:24:14 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728075AbeLMJWa (ORCPT + 99 others); Thu, 13 Dec 2018 04:22:30 -0500 Received: from mail-ed1-f67.google.com ([209.85.208.67]:45991 "EHLO mail-ed1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727343AbeLMJWa (ORCPT ); Thu, 13 Dec 2018 04:22:30 -0500 Received: by mail-ed1-f67.google.com with SMTP id d39so1385173edb.12; Thu, 13 Dec 2018 01:22:28 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=JUicFBR02yoMFdfZBlyVnEhzti2L20thT/sD8xlcfyI=; b=lPGESSAbtYCyFi8tMSpyXk1ytCu0PbIuCHBTMjkXQCJJh4wgwBQZOqkxaWjv97neL/ nqc03aklc/DrMBR+TIMRlFMC6pzXAfgPuBBN4jo7rNVghpieLUZmf4uK7lIost57mQgp FQB3ATaYcAahfCej937Id0doB1xdqZNyGiD4YfNB9RGpQtQzmJLtFRyRaR/3IC/HdMD0 RhBfK5YQRuy0eeH1sdHGbgQMJTtsbOYciSu3CS7++0LHPSPGW/FRgNquvSqWpYY7vpn6 0PTz6oxIDMTe3wkC2JUeN8GJrZr3Hde2i82RIqucZayhHeOYdEVAf/OU9s6Q5MmHtIZ+ 5DwQ== X-Gm-Message-State: AA+aEWZ/ujhk8VfmHmIGE9wiR7KCR4/qmvsT1Cd7HH0joIZ2y4XI/+Mi xG/LfNHD3e7L8vjRaJav6/s= X-Received: by 2002:a50:9b1d:: with SMTP id o29mr20594527edi.246.1544692947665; Thu, 13 Dec 2018 01:22:27 -0800 (PST) Received: from tiehlicka.suse.cz (prg-ext-pat.suse.com. [213.151.95.130]) by smtp.gmail.com with ESMTPSA id z9sm472036edr.61.2018.12.13.01.22.26 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 13 Dec 2018 01:22:26 -0800 (PST) From: Michal Hocko To: Andrew Morton , "Kirill A. Shutemov" Cc: Liu Bo , Jan Kara , Dave Chinner , "Theodore Ts'o" , Johannes Weiner , Vladimir Davydov , , , LKML , Shakeel Butt , Michal Hocko , Stable tree Subject: [PATCH v3] mm, memcg: fix reclaim deadlock with writeback Date: Thu, 13 Dec 2018 10:22:21 +0100 Message-Id: <20181213092221.27270-1-mhocko@kernel.org> X-Mailer: git-send-email 2.19.2 In-Reply-To: <20181212155055.1269-1-mhocko@kernel.org> References: <20181212155055.1269-1-mhocko@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Michal Hocko Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the ext4 writeback task1: [] wait_on_page_bit+0x82/0xa0 [] shrink_page_list+0x907/0x960 [] shrink_inactive_list+0x2c7/0x680 [] shrink_node_memcg+0x404/0x830 [] shrink_node+0xd8/0x300 [] do_try_to_free_pages+0x10d/0x330 [] try_to_free_mem_cgroup_pages+0xd5/0x1b0 [] try_charge+0x14d/0x720 [] memcg_kmem_charge_memcg+0x3c/0xa0 [] memcg_kmem_charge+0x7e/0xd0 [] __alloc_pages_nodemask+0x178/0x260 [] alloc_pages_current+0x95/0x140 [] pte_alloc_one+0x17/0x40 [] __pte_alloc+0x1e/0x110 [] alloc_set_pte+0x5fe/0xc20 [] do_fault+0x103/0x970 [] handle_mm_fault+0x61e/0xd10 [] __do_page_fault+0x252/0x4d0 [] do_page_fault+0x30/0x80 [] page_fault+0x28/0x30 [] 0xffffffffffffffff task2: [] __lock_page+0x86/0xa0 [] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] [] ext4_writepages+0x479/0xd60 [] do_writepages+0x1e/0x30 [] __writeback_single_inode+0x45/0x320 [] writeback_sb_inodes+0x272/0x600 [] __writeback_inodes_wb+0x92/0xc0 [] wb_writeback+0x268/0x300 [] wb_workfn+0xb4/0x390 [] process_one_work+0x189/0x420 [] worker_thread+0x4e/0x4b0 [] kthread+0xe6/0x100 [] ret_from_fork+0x41/0x50 [] 0xffffffffffffffff He adds : task1 is waiting for the PageWriteback bit of the page that task2 has : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED : bit the page which tasks1 has locked. More precisely task1 is handling a page fault and it has a page locked while it charges a new page table to a memcg. That in turn hits a memory limit reclaim and the memcg reclaim for legacy controller is waiting on the writeback but that is never going to finish because the writeback itself is waiting for the page locked in the #PF path. So this is essentially ABBA deadlock: lock_page(A) SetPageWriteback(A) unlock_page(A) lock_page(B) lock_page(B) pte_alloc_pne shrink_page_list wait_on_page_writeback(A) SetPageWriteback(B) unlock_page(B) # flush A, B to clear the writeback This accumulating of more pages to flush is used by several filesystems to generate a more optimal IO patterns. Waiting for the writeback in legacy memcg controller is a workaround for pre-mature OOM killer invocations because there is no dirty IO throttling available for the controller. There is no easy way around that unfortunately. Therefore fix this specific issue by pre-allocating the page table outside of the page lock. We have that handy infrastructure for that already so simply reuse the fault-around pattern which already does this. There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations from under a fs page locked but they should be really rare. I am not aware of a better solution unfortunately. Reported-and-Debugged-by: Liu Bo Cc: stable Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") Signed-off-by: Michal Hocko --- mm/memory.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index 4ad2d293ddc2..bb78e90a9b70 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; vm_fault_t ret; + /* + * Preallocate pte before we take page_lock because this might lead to + * deadlocks for memcg reclaim which waits for pages under writeback. + */ + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address); + if (!vmf->prealloc_pte) + return VM_FAULT_OOM; + smp_wmb(); /* See comment in __pte_alloc() */ + } + ret = vma->vm_ops->fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | VM_FAULT_DONE_COW))) -- 2.19.2