Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1375091imu; Thu, 13 Dec 2018 14:05:51 -0800 (PST) X-Google-Smtp-Source: AFSGD/WpnEf7d/oYe0W0O05XFBIpRi9Dm0cSlbZoa8lq9N88TFnFjUktMy07ikgbeLmc+q6FeqEr X-Received: by 2002:a62:1447:: with SMTP id 68mr428364pfu.257.1544738751305; Thu, 13 Dec 2018 14:05:51 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544738751; cv=none; d=google.com; s=arc-20160816; b=ukdIZCtLg1oyjufxcV1crPpEiimVra5EYPUvWS9OAXOu8iPtvuiBIcQOT3g/5HYzM9 w48FdzwFqe7LrAr9YVnBtHpJKL7R4IKixli65Ow5xO4wxaSf8wVaViI/h/iC445TDVwo Eu27Vbuxptm1x8x+y6zQejWCo41YKKdmqmDri7QKnwCWv5de8gqhQkj1ZQW+rmix0L6h ZVyyIJvpatpcxDMJy/2nlWuteMJyCdNqXIpe4tA2CoFxn+FHv28+01wJLZMxTvpV7ftI DFn7t1G1Gft9t3v84buK/ikJq2QwTBjsvogoyHF43Aixoxc/oH1prywEgJ3ZNaH6jjbW Pp0Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=foC72XB4GNrSEnp94npMMNWltCrFXG+GOh1ykOFNNXE=; b=DOht01oF9gBs4WHjhFPUDpO6pbnXPTMA4N+nrAvXa6M3ZoI8LDIgqYsGzQFxKY5MmG OnKP3bxY4ODorvLl4xhzr+PAKWELZhayNcpP7TXAqAgkdGz6L0jX2XDmM805L1mZUjH9 ovyJdbcMlIyAYLBywA4Mcf5Hd8mqAMF2QZ5uoFrCFrEKIyZNPJBLyXVa81Q6nK9Pe1Ke powUS1j/ILhVm2CWQ2ZqzgG8LnQzBJuRF8A5ff0OYmcBDC+UixN1BZw2S0DeboPkxNDw 3Sp4G7MNbJiw0x66lo/V/14lg+d6IazkXQ9xYxtBYNHb8CID9pPLFMBIvtRSIlvLmSbF VoTQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=f8IyS7Qm; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p14si2349089plq.25.2018.12.13.14.05.31; Thu, 13 Dec 2018 14:05:51 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=f8IyS7Qm; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728168AbeLMWED (ORCPT + 99 others); Thu, 13 Dec 2018 17:04:03 -0500 Received: from mail-yb1-f195.google.com ([209.85.219.195]:42102 "EHLO mail-yb1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726457AbeLMWED (ORCPT ); Thu, 13 Dec 2018 17:04:03 -0500 Received: by mail-yb1-f195.google.com with SMTP id m129so1453496ybf.9 for ; Thu, 13 Dec 2018 14:04:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=foC72XB4GNrSEnp94npMMNWltCrFXG+GOh1ykOFNNXE=; b=f8IyS7QmkeCWKIKcbBcORAiM8enil8GmeI78yC0U5BGuZiyqI8BEM3+qaoYlvjjkzm hS3/4sISPzUbn5LNq5Ky42nI3uXWoo8Da0q5P6caOg8KR7PrQBAKJDHXxB3HgEG4YT9W /rxkse00EpxnhkuV8ALM8K3nQRJtuhROey+Qtf85ZRGc8+Vve990XJVSRZqseJn/RKAq vUhhdpHZ7Hnp0ahk8VXYQ9KmFii9f8lHN4EwCJ5w3h909KxzNrj2bth6/EC2sy1DAE2B 1UGxzcp+MQKrAuKc3HjU2mcFbKHSBtmaoj33X/Z++UaJZNhGTHofsXGqvMnB3WfO8+pU c5nA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=foC72XB4GNrSEnp94npMMNWltCrFXG+GOh1ykOFNNXE=; b=ceWyj2BUyAMtSPC/6zF8BkEt3yYW/O1bNmPOfuIjLJyFOv22gwbqt1x7aNflRv16N8 oOF7Hojt9P8HmnAhQB7gbig2obzw/hWcIwA1pDc7aCdSTjGy+P0w881UJGEFSUDRJzCu sXnnqhpd/Xq/bg1gs6LOLDsfrS3KTX/ajPfIeDfrZj9BmzpuRCPjt0ZJgh1oebHXEzjF yBNLW9fJgcYFLxE0yvCTJmOwiQfh7ih3XG02I5gmfBQJdPWg16wJsc4HSM5VQQ++7YZX p/Xpy5gAa5m41teWV/wNfyOybXVxiAI0f5/KsAZlHbRd8ZGWnaA8HNDEXX4qGFlXL/QV WBTA== X-Gm-Message-State: AA+aEWYHt1kyCo3va4pFyrL2PdWEPoLGW4nZ7c1HmxuPk7FJd+XzO0/V Qj/SBpCQsCD3wpj9E7Sv+ozqdQ== X-Received: by 2002:a25:2207:: with SMTP id i7mr554285ybi.365.1544738642165; Thu, 13 Dec 2018 14:04:02 -0800 (PST) Received: from localhost ([2620:10d:c091:200::5:f0b3]) by smtp.gmail.com with ESMTPSA id r62sm941186ywd.4.2018.12.13.14.04.01 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 13 Dec 2018 14:04:01 -0800 (PST) Date: Thu, 13 Dec 2018 17:04:00 -0500 From: Johannes Weiner To: Michal Hocko Cc: Andrew Morton , "Kirill A. Shutemov" , Liu Bo , Jan Kara , Dave Chinner , Theodore Ts'o , Vladimir Davydov , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, LKML , Shakeel Butt , Michal Hocko , Stable tree Subject: Re: [PATCH v3] mm, memcg: fix reclaim deadlock with writeback Message-ID: <20181213220400.GA9829@cmpxchg.org> References: <20181212155055.1269-1-mhocko@kernel.org> <20181213092221.27270-1-mhocko@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181213092221.27270-1-mhocko@kernel.org> User-Agent: Mutt/1.11.1 (2018-12-01) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 13, 2018 at 10:22:21AM +0100, Michal Hocko wrote: > From: Michal Hocko > > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the > ext4 writeback > task1: > [] wait_on_page_bit+0x82/0xa0 > [] shrink_page_list+0x907/0x960 > [] shrink_inactive_list+0x2c7/0x680 > [] shrink_node_memcg+0x404/0x830 > [] shrink_node+0xd8/0x300 > [] do_try_to_free_pages+0x10d/0x330 > [] try_to_free_mem_cgroup_pages+0xd5/0x1b0 > [] try_charge+0x14d/0x720 > [] memcg_kmem_charge_memcg+0x3c/0xa0 > [] memcg_kmem_charge+0x7e/0xd0 > [] __alloc_pages_nodemask+0x178/0x260 > [] alloc_pages_current+0x95/0x140 > [] pte_alloc_one+0x17/0x40 > [] __pte_alloc+0x1e/0x110 > [] alloc_set_pte+0x5fe/0xc20 > [] do_fault+0x103/0x970 > [] handle_mm_fault+0x61e/0xd10 > [] __do_page_fault+0x252/0x4d0 > [] do_page_fault+0x30/0x80 > [] page_fault+0x28/0x30 > [] 0xffffffffffffffff > > task2: > [] __lock_page+0x86/0xa0 > [] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] > [] ext4_writepages+0x479/0xd60 > [] do_writepages+0x1e/0x30 > [] __writeback_single_inode+0x45/0x320 > [] writeback_sb_inodes+0x272/0x600 > [] __writeback_inodes_wb+0x92/0xc0 > [] wb_writeback+0x268/0x300 > [] wb_workfn+0xb4/0x390 > [] process_one_work+0x189/0x420 > [] worker_thread+0x4e/0x4b0 > [] kthread+0xe6/0x100 > [] ret_from_fork+0x41/0x50 > [] 0xffffffffffffffff > > He adds > : task1 is waiting for the PageWriteback bit of the page that task2 has > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED > : bit the page which tasks1 has locked. > > More precisely task1 is handling a page fault and it has a page locked > while it charges a new page table to a memcg. That in turn hits a memory > limit reclaim and the memcg reclaim for legacy controller is waiting on > the writeback but that is never going to finish because the writeback > itself is waiting for the page locked in the #PF path. So this is > essentially ABBA deadlock: > lock_page(A) > SetPageWriteback(A) > unlock_page(A) > lock_page(B) > lock_page(B) > pte_alloc_pne > shrink_page_list > wait_on_page_writeback(A) > SetPageWriteback(B) > unlock_page(B) > > # flush A, B to clear the writeback > > This accumulating of more pages to flush is used by several filesystems > to generate a more optimal IO patterns. > > Waiting for the writeback in legacy memcg controller is a workaround > for pre-mature OOM killer invocations because there is no dirty IO > throttling available for the controller. There is no easy way around > that unfortunately. Therefore fix this specific issue by pre-allocating > the page table outside of the page lock. We have that handy > infrastructure for that already so simply reuse the fault-around pattern > which already does this. > > There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations > from under a fs page locked but they should be really rare. I am not > aware of a better solution unfortunately. > > Reported-and-Debugged-by: Liu Bo > Cc: stable > Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") > Signed-off-by: Michal Hocko Acked-by: Johannes Weiner Just one nit: > @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) > struct vm_area_struct *vma = vmf->vma; > vm_fault_t ret; > > + /* > + * Preallocate pte before we take page_lock because this might lead to > + * deadlocks for memcg reclaim which waits for pages under writeback. > + */ > + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { > + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address); > + if (!vmf->prealloc_pte) > + return VM_FAULT_OOM; > + smp_wmb(); /* See comment in __pte_alloc() */ > + } Could you be more specific in the deadlock comment? git blame will work fine for a while, but it becomes a pain to find corresponding patches after stuff gets moved around for years. In particular the race diagram between reclaim with a page lock held and the fs doing SetPageWriteback batches before kicking off IO would be useful directly in the code, IMO.