Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1997050imu; Wed, 12 Dec 2018 07:52:22 -0800 (PST) X-Google-Smtp-Source: AFSGD/U7h2zkuEy/piImQdhilo9HG+DG9zpfjY9oMF171ymgoc9shOYzgrBSfXJAg2jLdCta/OPP X-Received: by 2002:a17:902:74c1:: with SMTP id f1mr19904065plt.273.1544629942515; Wed, 12 Dec 2018 07:52:22 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544629942; cv=none; d=google.com; s=arc-20160816; b=f05UVqdumtsRRS21EpwelXDZ/sOSSI9aTMzEDhhz/Wo9wCTbImO4r55SvPsdrzh6Wj DrBta06F6feJe6opBQEE50XQc3I13kSTzyKb2eKVkYXneuha2AF61NUklguPH3CWZbkq OL1zIf1W4mkVVcGdGYIidv52mGw9WDnQh+Txf73xfnRXjStzz00sdKnqsWUL9oj1nCxe OW4OVuUnfrFdTX8k8l1WQEI9sJlZjC3C5IRb+6QNbb5DKXCeCSZlmkwZ+VLbgSyMEg7+ QD/dl/CS5AorRdg0e2086vIGYE5JTG0fcXTWe8n+OQqB+i5dt6D08elxMwYjI0GLEtJ4 nvVg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=2NB6emeHojHP3Pcyzy+norq0ay8ZtgpPR5m6A6AHM3U=; b=FbxrBRcg9e/DnjFZK4m796thqrojU7WUqt3AqTVLMrLnQHQu7tKTnDlumCgyXfF8Ii DIT/UcszIAX5S0LR+QzLtollbZDYUC94APbIcKwNU3Wv7isX3PIKMSOV9OD7dVl9/9XG rOLU1wQ1upzkCOqQyw63uKNeb1uKxSQ+iP+S6UBCsJCXBffxL9l2f76GMx+cASKzlrWz Aw0mAms7POH1xZYjbZZe2qk2qSub87xivySgxlgEkAtgUoPn3pz1NBIe4rKlMXCe0Ifm m+m7gxcEqSUAEuE5tbExTH4Ddy7WWyPZ/iaS7gLfjzDrcJrZ3K2ATtqjPdCJi+X2CYbv YS2w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d16si16033643plj.104.2018.12.12.07.52.03; Wed, 12 Dec 2018 07:52:22 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727829AbeLLPvE (ORCPT + 99 others); Wed, 12 Dec 2018 10:51:04 -0500 Received: from mail-ed1-f66.google.com ([209.85.208.66]:34455 "EHLO mail-ed1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726358AbeLLPvE (ORCPT ); Wed, 12 Dec 2018 10:51:04 -0500 Received: by mail-ed1-f66.google.com with SMTP id b3so15978404ede.1; Wed, 12 Dec 2018 07:51:02 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=2NB6emeHojHP3Pcyzy+norq0ay8ZtgpPR5m6A6AHM3U=; b=DSv1YemFszZRzzqLFJlnLeCnG8v/lPY0TuIkNfspSwLdo46e8V8sU/N2yBIoCqAsfF 5ilCe9Ag/Fpysc+6zv8jCyxM+8N+w1K7HNKBVp8LML1sYVZhImn2VIBc6+ea5lwdlHfG wcmnZUmO2Aex6SfputTo6xezkAk53pLa/JLZ2YCO4JHiUsKBypAzFXVwhMav6zreoX5A FY3wWIltVvQJX4kPYs+lAipIi0aSbWwDUeJ4QQ54OYZB/XRkjewJh4K9NhZMb+c9Qic8 jcPuMw83rRsq1DhDdivq5afI8rJVLvHfWbv38t2YV6oAZcrsiKEOVWJHFzhinbozCYXE 8+3w== X-Gm-Message-State: AA+aEWZmGYWJlzG/LerH9QGjEMA1UGH43s3+fPbk7y2LWRHsB0s4/ssV o4t6mYjGuxyGkmKC231fi6I= X-Received: by 2002:a17:906:5304:: with SMTP id h4-v6mr15820892ejo.39.1544629862211; Wed, 12 Dec 2018 07:51:02 -0800 (PST) Received: from tiehlicka.suse.cz (prg-ext-pat.suse.com. [213.151.95.130]) by smtp.gmail.com with ESMTPSA id l17sm4913030edc.56.2018.12.12.07.51.00 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 12 Dec 2018 07:51:00 -0800 (PST) From: Michal Hocko To: Andrew Morton , "Kirill A. Shutemov" Cc: Liu Bo , Jan Kara , Dave Chinner , "Theodore Ts'o" , Johannes Weiner , Vladimir Davydov , , , LKML , Michal Hocko Subject: [PATCH v2] mm, memcg: fix reclaim deadlock with writeback Date: Wed, 12 Dec 2018 16:50:55 +0100 Message-Id: <20181212155055.1269-1-mhocko@kernel.org> X-Mailer: git-send-email 2.19.2 In-Reply-To: <20181211132645.31053-1-mhocko@kernel.org> References: <20181211132645.31053-1-mhocko@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Michal Hocko Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the ext4 writeback task1: [] wait_on_page_bit+0x82/0xa0 [] shrink_page_list+0x907/0x960 [] shrink_inactive_list+0x2c7/0x680 [] shrink_node_memcg+0x404/0x830 [] shrink_node+0xd8/0x300 [] do_try_to_free_pages+0x10d/0x330 [] try_to_free_mem_cgroup_pages+0xd5/0x1b0 [] try_charge+0x14d/0x720 [] memcg_kmem_charge_memcg+0x3c/0xa0 [] memcg_kmem_charge+0x7e/0xd0 [] __alloc_pages_nodemask+0x178/0x260 [] alloc_pages_current+0x95/0x140 [] pte_alloc_one+0x17/0x40 [] __pte_alloc+0x1e/0x110 [] alloc_set_pte+0x5fe/0xc20 [] do_fault+0x103/0x970 [] handle_mm_fault+0x61e/0xd10 [] __do_page_fault+0x252/0x4d0 [] do_page_fault+0x30/0x80 [] page_fault+0x28/0x30 [] 0xffffffffffffffff task2: [] __lock_page+0x86/0xa0 [] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] [] ext4_writepages+0x479/0xd60 [] do_writepages+0x1e/0x30 [] __writeback_single_inode+0x45/0x320 [] writeback_sb_inodes+0x272/0x600 [] __writeback_inodes_wb+0x92/0xc0 [] wb_writeback+0x268/0x300 [] wb_workfn+0xb4/0x390 [] process_one_work+0x189/0x420 [] worker_thread+0x4e/0x4b0 [] kthread+0xe6/0x100 [] ret_from_fork+0x41/0x50 [] 0xffffffffffffffff He adds : task1 is waiting for the PageWriteback bit of the page that task2 has : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED : bit the page which tasks1 has locked. More precisely task1 is handling a page fault and it has a page locked while it charges a new page table to a memcg. That in turn hits a memory limit reclaim and the memcg reclaim for legacy controller is waiting on the writeback but that is never going to finish because the writeback itself is waiting for the page locked in the #PF path. So this is essentially ABBA deadlock. Waiting for the writeback in legacy memcg controller is a workaround for pre-mature OOM killer invocations because there is no dirty IO throttling available for the controller. There is no easy way around that unfortunately. Therefore fix this specific issue by pre-allocating the page table outside of the page lock. We have that handy infrastructure for that already so simply reuse the fault-around pattern which already does this. Reported-and-Debugged-by: Liu Bo Signed-off-by: Michal Hocko --- mm/memory.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index 4ad2d293ddc2..bb78e90a9b70 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; vm_fault_t ret; + /* + * Preallocate pte before we take page_lock because this might lead to + * deadlocks for memcg reclaim which waits for pages under writeback. + */ + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address); + if (!vmf->prealloc_pte) + return VM_FAULT_OOM; + smp_wmb(); /* See comment in __pte_alloc() */ + } + ret = vma->vm_ops->fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | VM_FAULT_DONE_COW))) -- 2.19.2