Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1654392imu; Wed, 12 Dec 2018 01:50:29 -0800 (PST) X-Google-Smtp-Source: AFSGD/VL7YDxcqJRPQM/GkFygQ+r7KxtGEsOYe8IeqBXzs0/lza9FHxfXhHIDSpYMwBUaErjjxn9 X-Received: by 2002:a17:902:b48b:: with SMTP id y11mr18549874plr.200.1544608229841; Wed, 12 Dec 2018 01:50:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544608229; cv=none; d=google.com; s=arc-20160816; b=mznPFyg5xLWjloUbK70+bhC5s9/8CdphNysxM8UFzaBomMnLcwAFp3no4+lHnNU0Zm 3ZsD5GaM15LW/bnCKdhL1XjG44QLiTf1Y9Vm4s0K1d2wT6X6MuOkwHqxBEODWUQNWxID VTVeMqT5WQcVSLfrxuCcw9/IXwpirVv9mSHXWCUft1dJYupwqakJ3NXlVKFImty70KTP cTxdNdj7mz4ja6Mgi8RwjWc0zyoKPD+srdX4BFbLskr+IbYsQkBtLHlsq38/l3TKOX4s 44RMLjgFQz13rtaJuqzbdzN5lqa2hRZNsPMDBiOUxbkzAEnuXAo0cK46i77YyxzHO1f3 jHtA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=KrsW3c60Rcd/tEpANDGERA2V1gyZh3RwjdiouRAaQVo=; b=JnCLYDVqe463mJd1ISY0NA2izbq0RbAtesqSo/3D1hhUr/mkungwyH14Ts6wN62gJ3 0o7gQFji3rYm3aKDKsgA2iNQPMnqPN2q/Pbwubl0TZQBkUaLRaUzeNksuiapwjO1/0AZ vxX4L8Ob9OlkY6PVkg65YmdF47msweZBhzkFlbAMmtGiRjIpIvxY3emrk5547q3KEoVk dy37cwvTTvzyGJGvfRc3DJgOrKqb4ngKsK5D9EUGEQg4th+j0EVYMsQHPA54Gu0dWwn4 CZOvmWiBFvZMBeAGVKUztHBT6mFnYQLTelrB7SJIFIYMBnE8qSiI7bBXkS+IOfSs1Fcc dKbQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j1si14119307plk.342.2018.12.12.01.50.10; Wed, 12 Dec 2018 01:50:29 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726963AbeLLJsh (ORCPT + 99 others); Wed, 12 Dec 2018 04:48:37 -0500 Received: from mx2.suse.de ([195.135.220.15]:33350 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726869AbeLLJsg (ORCPT ); Wed, 12 Dec 2018 04:48:36 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 6076CAFF4; Wed, 12 Dec 2018 09:48:33 +0000 (UTC) Date: Wed, 12 Dec 2018 10:48:32 +0100 From: Michal Hocko To: "Kirill A. Shutemov" Cc: Andrew Morton , Liu Bo , Jan Kara , Dave Chinner , Theodore Ts'o , Johannes Weiner , Vladimir Davydov , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, LKML , Hugh Dickins Subject: Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback Message-ID: <20181212094832.GN1286@dhcp22.suse.cz> References: <20181211132645.31053-1-mhocko@kernel.org> <20181212094249.cw4xjrdchqsp2tkt@kshutemo-mobl1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181212094249.cw4xjrdchqsp2tkt@kshutemo-mobl1> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 12-12-18 12:42:49, Kirill A. Shutemov wrote: > On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote: > > From: Michal Hocko > > > > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the > > ext4 writeback > > task1: > > [] wait_on_page_bit+0x82/0xa0 > > [] shrink_page_list+0x907/0x960 > > [] shrink_inactive_list+0x2c7/0x680 > > [] shrink_node_memcg+0x404/0x830 > > [] shrink_node+0xd8/0x300 > > [] do_try_to_free_pages+0x10d/0x330 > > [] try_to_free_mem_cgroup_pages+0xd5/0x1b0 > > [] try_charge+0x14d/0x720 > > [] memcg_kmem_charge_memcg+0x3c/0xa0 > > [] memcg_kmem_charge+0x7e/0xd0 > > [] __alloc_pages_nodemask+0x178/0x260 > > [] alloc_pages_current+0x95/0x140 > > [] pte_alloc_one+0x17/0x40 > > [] __pte_alloc+0x1e/0x110 > > [] alloc_set_pte+0x5fe/0xc20 > > [] do_fault+0x103/0x970 > > [] handle_mm_fault+0x61e/0xd10 > > [] __do_page_fault+0x252/0x4d0 > > [] do_page_fault+0x30/0x80 > > [] page_fault+0x28/0x30 > > [] 0xffffffffffffffff > > > > task2: > > [] __lock_page+0x86/0xa0 > > [] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] > > [] ext4_writepages+0x479/0xd60 > > [] do_writepages+0x1e/0x30 > > [] __writeback_single_inode+0x45/0x320 > > [] writeback_sb_inodes+0x272/0x600 > > [] __writeback_inodes_wb+0x92/0xc0 > > [] wb_writeback+0x268/0x300 > > [] wb_workfn+0xb4/0x390 > > [] process_one_work+0x189/0x420 > > [] worker_thread+0x4e/0x4b0 > > [] kthread+0xe6/0x100 > > [] ret_from_fork+0x41/0x50 > > [] 0xffffffffffffffff > > > > He adds > > : task1 is waiting for the PageWriteback bit of the page that task2 has > > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED > > : bit the page which tasks1 has locked. > > > > More precisely task1 is handling a page fault and it has a page locked > > while it charges a new page table to a memcg. That in turn hits a memory > > limit reclaim and the memcg reclaim for legacy controller is waiting on > > the writeback but that is never going to finish because the writeback > > itself is waiting for the page locked in the #PF path. So this is > > essentially ABBA deadlock. > > Side node: > > Do we have PG_writeback vs. PG_locked ordering documentated somewhere? I am not aware of any > IIUC, the trace from task2 suggests that we must not wait for writeback > on the locked page. > > But that not what I see for many wait_on_page_writeback() users: it usally > called with the page locked. I see it for truncate, shmem, swapfile, > splice... > > Maybe the problem is within task2 codepath after all? Jack and David have explained that this is due to an optimization multiple filesystems do. They lock and set wribeback on multiple pages and then send a largeer IO at once. So in this case we have the following pattern lock_page(B) SetPageWriteback(B) unlock_page(B) lock_page(A) lock_page(A) pte_alloc_pne shrink_page_list wait_on_page_writeback(B) SetPageWriteback(A) unlock_page(A) # flush A, B to clear the writeback -- Michal Hocko SUSE Labs