Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp650205imu; Tue, 11 Dec 2018 05:28:50 -0800 (PST) X-Google-Smtp-Source: AFSGD/VVdkiCadPMdZ7m5T4copsYV5Wwt1hvr5qHRmXKmrv71RugFYRPD8/izpN2E9wiC1aeu79p X-Received: by 2002:a62:be15:: with SMTP id l21mr15908549pff.51.1544534930112; Tue, 11 Dec 2018 05:28:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544534930; cv=none; d=google.com; s=arc-20160816; b=H6mVWOi4EIsGBHRfMCh1GxShC6Y/6pgVrBPxt9WYPuHN63PX6KDWBqCHP/np3pS3T/ OtRQ1rGbxUaq0koNlDYQS6XOHJWoQtpcn3l/DFUpYt3oPYkIDcv/MBespKdtbUmi2VW2 xc9tFzEVC6YayZ4pM32/0kEVVpqGWr7Pkught7uiFrwwre4OuRNUac7Haik8C1H7brSu 3UUmUMS096cK/ieynwcDN8AnDqTmMiEpovut1Lh63OMFqdbsm5nDLKuVEubvrlN6zcUt CJOrvFZW8boo57/i9UsMlE6CuMG8ltJlckYX2RTiUTFVBSLCQuiyg16/EoLd052SbaG/ p7IA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=jxbynfRZF3X6tSvesQ9hBhsSdCVxDV0cRayMv8c1614=; b=hzURWTnWT0zwlceGpIzIkXbyr58sD90M23aSpgNxnx3SxL+l7bpCMtQKCqTabZlzev ZUWN4PmXghOwHXkm7sVsP/zNvCuE265v4jbKqOOAk9ALuPGEyn1noTUo3X498rsYs5hG IGdkN6WYw6FeNXHxQCPKBIVOZCJ1RmKcM7Yw4RH7wB7yjyxjVRhkVY5aZfCLt4rKyQ9d zXBj1dwjws09TxYrAOWHfi7a66kzaRozlBHNAyR0U+sAkefDHU0gjhHfbjJCBvO4H9CZ LljVNv3x65xm/JhEDUncL3YSMUHfQ75TJmKDv0tCFOsgLRLSX2fpiAE+NjvnGuplpAf9 eASQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i20si11747343pgm.586.2018.12.11.05.28.34; Tue, 11 Dec 2018 05:28:50 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726530AbeLKN07 (ORCPT + 99 others); Tue, 11 Dec 2018 08:26:59 -0500 Received: from mail-pf1-f194.google.com ([209.85.210.194]:39872 "EHLO mail-pf1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726170AbeLKN06 (ORCPT ); Tue, 11 Dec 2018 08:26:58 -0500 Received: by mail-pf1-f194.google.com with SMTP id c72so7136150pfc.6; Tue, 11 Dec 2018 05:26:58 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=jxbynfRZF3X6tSvesQ9hBhsSdCVxDV0cRayMv8c1614=; b=q3/Mm1FzPnqvdTh1AkN6LUGEzCcL9evRf6JrEWAnkEUSWaOitD7CX6jmTgFnHff08J R1uz8Nt2z5/rtgfc2bqWvLRdZFNNZrykaZNqxPEXX7RscTeln9ys6aQOAyXZODEPIMqH 0BwoeQ3OtgCGO3LvownOHaNdQy6vbw0uQIA2cQAmBp72/65/j7hHFYTHQyJ3+XipioSM eixcnxVa29IWay5B1cHR8jc5fmQEuMBMyS/YJ5tpGH4wxZFzJTOeBNUy9d8Wk5PlGLZ4 ypYxUWn9yWT7GKjf3WBpf1CscAabgZ9Lj1X04X8QqFsS1BWXSmhpPqC8jPg+eUyKkf0K 4MOg== X-Gm-Message-State: AA+aEWbV/fpuy4pSGqADrkBelIIRsWGTiF2tcEMRCqppJMp5yKbZh6em ykSN9CJrqQsv8lkgqPWVukk= X-Received: by 2002:a62:5444:: with SMTP id i65mr16903077pfb.193.1544534817855; Tue, 11 Dec 2018 05:26:57 -0800 (PST) Received: from tiehlicka.suse.cz (prg-ext-pat.suse.com. [213.151.95.130]) by smtp.gmail.com with ESMTPSA id v62sm28677411pfd.163.2018.12.11.05.26.54 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Dec 2018 05:26:56 -0800 (PST) From: Michal Hocko To: Andrew Morton , "Kirill A. Shutemov" Cc: Liu Bo , Jan Kara , Dave Chinner , "Theodore Ts'o" , Johannes Weiner , Vladimir Davydov , , , LKML , Michal Hocko Subject: [PATCH] mm, memcg: fix reclaim deadlock with writeback Date: Tue, 11 Dec 2018 14:26:45 +0100 Message-Id: <20181211132645.31053-1-mhocko@kernel.org> X-Mailer: git-send-email 2.19.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Michal Hocko Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the ext4 writeback task1: [] wait_on_page_bit+0x82/0xa0 [] shrink_page_list+0x907/0x960 [] shrink_inactive_list+0x2c7/0x680 [] shrink_node_memcg+0x404/0x830 [] shrink_node+0xd8/0x300 [] do_try_to_free_pages+0x10d/0x330 [] try_to_free_mem_cgroup_pages+0xd5/0x1b0 [] try_charge+0x14d/0x720 [] memcg_kmem_charge_memcg+0x3c/0xa0 [] memcg_kmem_charge+0x7e/0xd0 [] __alloc_pages_nodemask+0x178/0x260 [] alloc_pages_current+0x95/0x140 [] pte_alloc_one+0x17/0x40 [] __pte_alloc+0x1e/0x110 [] alloc_set_pte+0x5fe/0xc20 [] do_fault+0x103/0x970 [] handle_mm_fault+0x61e/0xd10 [] __do_page_fault+0x252/0x4d0 [] do_page_fault+0x30/0x80 [] page_fault+0x28/0x30 [] 0xffffffffffffffff task2: [] __lock_page+0x86/0xa0 [] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] [] ext4_writepages+0x479/0xd60 [] do_writepages+0x1e/0x30 [] __writeback_single_inode+0x45/0x320 [] writeback_sb_inodes+0x272/0x600 [] __writeback_inodes_wb+0x92/0xc0 [] wb_writeback+0x268/0x300 [] wb_workfn+0xb4/0x390 [] process_one_work+0x189/0x420 [] worker_thread+0x4e/0x4b0 [] kthread+0xe6/0x100 [] ret_from_fork+0x41/0x50 [] 0xffffffffffffffff He adds : task1 is waiting for the PageWriteback bit of the page that task2 has : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED : bit the page which tasks1 has locked. More precisely task1 is handling a page fault and it has a page locked while it charges a new page table to a memcg. That in turn hits a memory limit reclaim and the memcg reclaim for legacy controller is waiting on the writeback but that is never going to finish because the writeback itself is waiting for the page locked in the #PF path. So this is essentially ABBA deadlock. Waiting for the writeback in legacy memcg controller is a workaround for pre-mature OOM killer invocations because there is no dirty IO throttling available for the controller. There is no easy way around that unfortunately. Therefore fix this specific issue by pre-allocating the page table outside of the page lock. We have that handy infrastructure for that already so simply reuse the fault-around pattern which already does this. Reported-and-Debugged-by: Liu Bo Signed-off-by: Michal Hocko --- Hi, this has been originally reported here [1]. While it could get worked around in the fs, catching the allocation early sounds like a preferable approach. Liu Bo has noted that he is not able to reproduce this anymore because kmem accounting has been disabled in their workload but this should be quite straightforward to review. There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations from under a fs page locked but they should be really rare. I am not aware of a better solution unfortunately. I would appreciate if Kirril could have a look and double check I am not doing something stupid here. Debugging lock_page deadlocks is an absolute PITA considering a lack of lockdep support so I would mark it for stable. [1] http://lkml.kernel.org/r/1540858969-75803-1-git-send-email-bo.liu@linux.alibaba.com mm/memory.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index 4ad2d293ddc2..1a73d2d4659e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; vm_fault_t ret; + /* + * Preallocate pte before we take page_lock because this might lead to + * deadlocks for memcg reclaim which waits for pages under writeback. + */ + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address); + if (!vmf->prealloc_pte) + return VM_FAULT_OOM; + smp_wmb(); /* See comment in __pte_alloc() */ + } + ret = vma->vm_ops->fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | VM_FAULT_DONE_COW))) -- 2.19.2