Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1667452imu; Wed, 12 Dec 2018 02:07:06 -0800 (PST) X-Google-Smtp-Source: AFSGD/X2ngNm7O8yzXCaC0y0IUaEU95zePhD5L5GY3AkPDoQfNB/HRTebA2YfBU1ylhO/vj/BhL9 X-Received: by 2002:a65:4683:: with SMTP id h3mr16969990pgr.225.1544609226680; Wed, 12 Dec 2018 02:07:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544609226; cv=none; d=google.com; s=arc-20160816; b=YungEdUNKx5lIab6mM5tF7XIg+rD/NAw5QVm06ylfsxCSgDCHIytnd6xkv76x4Nzra Jd1Fh7ZcSkVr3JQ0R8f1CNOYbOOkjKl7sgHKBQNa1x3hFRgYNeIhWsRr0NLGYxo0Zr/B NHCiSQSHNnlmNJRFVWdRzWErg8/Z8cvMejXE5+391mYGaIx6KFA7m7wcj0fCSikOxxtE T5H+lroiFk9uYWXSBsnzd/enJrpP4JcRroLGQ5dGTwSQHXieSNg/pWoFNIx6/JAPOwzF gCMrCudsSY5LDnrRHEKWz1znV7RU2YEoWw4yUbIg7JTZuAaOrr1axNA4Xxcqacs24eR/ JvVQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=OvqOvd5qk/XE0u30tjTSb1l9aG7rwRm0+tg6fuzqODc=; b=soDByXjs1WLdm8umDFCpbZpBzpp6UYJ3rEK+CE5E7eRxlT+UNI0nAcsFoTOBIeG8kR fLUWl0TMVY+RT5Hhuh48Y2qulvuG+NycohruHSGH16ftiflwlVUnem3DNsJAqhyfFBbD 5+lkpTcZtUiI+ihGWFGJ/RrqbPVXLOOZ3ECCmNk7XIfoqj967semLK7PLgSUQ7LngNlp jEVtXTUY/NS5FFDrFyqU/oNU127RPCq8GeYq1Kx3bOv79gJbGJK6EJzh/EFGHn7eSdpd j+rJkjTfR7EU8gvhbjFvL1ULWzO5RD3X7Zt8kfn7kaPEu6VOZsktYEAOiw63rSEAhGKv vT3w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t17si14473690pgk.217.2018.12.12.02.06.51; Wed, 12 Dec 2018 02:07:06 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727035AbeLLKFs (ORCPT + 99 others); Wed, 12 Dec 2018 05:05:48 -0500 Received: from mx2.suse.de ([195.135.220.15]:36430 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726727AbeLLKFq (ORCPT ); Wed, 12 Dec 2018 05:05:46 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 8D557AE88; Wed, 12 Dec 2018 10:05:43 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 9E9B61E11DB; Wed, 12 Dec 2018 11:05:42 +0100 (CET) Date: Wed, 12 Dec 2018 11:05:42 +0100 From: Jan Kara To: "Kirill A. Shutemov" Cc: Michal Hocko , Andrew Morton , Liu Bo , Jan Kara , Dave Chinner , Theodore Ts'o , Johannes Weiner , Vladimir Davydov , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, LKML , Michal Hocko , Hugh Dickins Subject: Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback Message-ID: <20181212100542.GA10902@quack2.suse.cz> References: <20181211132645.31053-1-mhocko@kernel.org> <20181212094249.cw4xjrdchqsp2tkt@kshutemo-mobl1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181212094249.cw4xjrdchqsp2tkt@kshutemo-mobl1> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 12-12-18 12:42:49, Kirill A. Shutemov wrote: > On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote: > > From: Michal Hocko > > > > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the > > ext4 writeback > > task1: > > [] wait_on_page_bit+0x82/0xa0 > > [] shrink_page_list+0x907/0x960 > > [] shrink_inactive_list+0x2c7/0x680 > > [] shrink_node_memcg+0x404/0x830 > > [] shrink_node+0xd8/0x300 > > [] do_try_to_free_pages+0x10d/0x330 > > [] try_to_free_mem_cgroup_pages+0xd5/0x1b0 > > [] try_charge+0x14d/0x720 > > [] memcg_kmem_charge_memcg+0x3c/0xa0 > > [] memcg_kmem_charge+0x7e/0xd0 > > [] __alloc_pages_nodemask+0x178/0x260 > > [] alloc_pages_current+0x95/0x140 > > [] pte_alloc_one+0x17/0x40 > > [] __pte_alloc+0x1e/0x110 > > [] alloc_set_pte+0x5fe/0xc20 > > [] do_fault+0x103/0x970 > > [] handle_mm_fault+0x61e/0xd10 > > [] __do_page_fault+0x252/0x4d0 > > [] do_page_fault+0x30/0x80 > > [] page_fault+0x28/0x30 > > [] 0xffffffffffffffff > > > > task2: > > [] __lock_page+0x86/0xa0 > > [] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] > > [] ext4_writepages+0x479/0xd60 > > [] do_writepages+0x1e/0x30 > > [] __writeback_single_inode+0x45/0x320 > > [] writeback_sb_inodes+0x272/0x600 > > [] __writeback_inodes_wb+0x92/0xc0 > > [] wb_writeback+0x268/0x300 > > [] wb_workfn+0xb4/0x390 > > [] process_one_work+0x189/0x420 > > [] worker_thread+0x4e/0x4b0 > > [] kthread+0xe6/0x100 > > [] ret_from_fork+0x41/0x50 > > [] 0xffffffffffffffff > > > > He adds > > : task1 is waiting for the PageWriteback bit of the page that task2 has > > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED > > : bit the page which tasks1 has locked. > > > > More precisely task1 is handling a page fault and it has a page locked > > while it charges a new page table to a memcg. That in turn hits a memory > > limit reclaim and the memcg reclaim for legacy controller is waiting on > > the writeback but that is never going to finish because the writeback > > itself is waiting for the page locked in the #PF path. So this is > > essentially ABBA deadlock. > > Side node: > > Do we have PG_writeback vs. PG_locked ordering documentated somewhere? > > IIUC, the trace from task2 suggests that we must not wait for writeback > on the locked page. Well, waiting on writeback of page A when A is locked has always been fine. After all that's the only easy way to make sure you really have a page for which no IO is running as page lock protects you from new writeback attempt starting. Waiting on writeback of page B while having page A locked *is* problematic and prone to deadlocks due to code paths like in task2. > But that not what I see for many wait_on_page_writeback() users: it usally > called with the page locked. I see it for truncate, shmem, swapfile, > splice... > > Maybe the problem is within task2 codepath after all? So ->writepages() methods in filesystems have the property that to complete writeback on page with index X, they may need page lock from page X+1. I agree that this is a bit hairy but from fs point of view it makes a lot of sense and AFAIK nothing besides that memcg IO throttling can create deadlocks with such a locking scheme. Honza -- Jan Kara SUSE Labs, CR