Received: by 10.192.165.156 with SMTP id m28csp457949imm; Wed, 11 Apr 2018 01:51:09 -0700 (PDT) X-Google-Smtp-Source: AIpwx49TCorXRZxACLsuvuWO1wLTX6lZKpOzpsAzbl3FyF+QmIrk5z6mthlhRto8HDAxT7OOnog/ X-Received: by 2002:a17:902:6ec5:: with SMTP id l5-v6mr4086476pln.113.1523436669358; Wed, 11 Apr 2018 01:51:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1523436669; cv=none; d=google.com; s=arc-20160816; b=cNQBRwrRy32s450dV8fxE/OJC9CJohme+6LeyBAtfXrSrnE/pmATd/4yv4LK6j9upe U2o3VIK1q4aSu8QgO/CPHgwzRiM6Uxhdq2UCJk9C1/pk/BEPlHlX1hDLAbuHlVjR3ysm YnMgpkv2B6Z/RP4rMvnGoq0+19DjXDKdMSbNSK/gkBehn3CgCnUDsLnhSLdJWtjSif3T B6ZYDdf9JP7/GH7qNhDPDGmOAkhXoN4BhaoHw+Tm8/tx62FYCoOHWbb+rb5KqVPQQABB vlr5iBNiewsDbxUK4HRMcHEMeZnFUNIemTIJ5jqS0hpfGugdtMqedacvNznYpHvVz/Tz jFyA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature:arc-authentication-results; bh=QAE1xFhy9sppqyRDZZC2+TLwYHuHdWmBNrFo/Vn8+Bo=; b=CsdWXu+Ulp1zUirZU5hRqztb6MFWQOSTd5ew4ZgLUMWaVYYnHmbEBYtqRSAJZr3Hzv 7m6uumN8ZcHcLfvbtUSGJLcd0BwNWUFzpTBprXwr7MBczFqqipdktlsOpP4SJ5OiqnbX mLhQbpWwRest2rMqXK83gZNCZdeqWqmAv3nr9ZkULWA5Kp7VnrUwjZwus7nWLq7Puk4G bQAvNkaKsY3hok8WsOnZvSV6zAmcXN6+1hw8u1PBMRhJ6WklK7T4E2VVCY97z00IDPGS putqUIwGMWl5Quhb9KF9r/Ysm4NXmCyXhShhJblkJd/0CVQxn0Su4dHXYgOOvSSCCmeA eG0w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=RQ52bM8R; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o16si439844pgc.832.2018.04.11.01.50.32; Wed, 11 Apr 2018 01:51:09 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=RQ52bM8R; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752370AbeDKIrL (ORCPT + 99 others); Wed, 11 Apr 2018 04:47:11 -0400 Received: from mail-pl0-f67.google.com ([209.85.160.67]:45456 "EHLO mail-pl0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750780AbeDKIrH (ORCPT ); Wed, 11 Apr 2018 04:47:07 -0400 Received: by mail-pl0-f67.google.com with SMTP id e22-v6so880378plj.12 for ; Wed, 11 Apr 2018 01:47:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=QAE1xFhy9sppqyRDZZC2+TLwYHuHdWmBNrFo/Vn8+Bo=; b=RQ52bM8RFSb9XT1R4A0/DYHR+OXz/hSfbf1Y/s/kTI2292u9wUMUJ7uq+/See1ogl6 DTaem83chx9/7HBFw1Ae7cSh8N63b/RnGNwWNaZMKQjcB85T+UGQO8JmVes9fOfm8rme bvA2RJA3QoRFtykiJg0uaf6iUyTHuqWzGaEhx6EYzrkTTToZVGBN6PJm4spGFLK+P2WC cmlEUZ3D4RQFjurqFZHUY2cfZgTpnaZG8P8uIOcoQKc//9ZuSr0zGVe3RW3WUGqo4i27 egXtuxODfhjmE9kDX7lOzJqSH7aqJtmI4kIPwnatB+6cuXQ3cl1FSguzqLp1p4pkmLvJ OnXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=QAE1xFhy9sppqyRDZZC2+TLwYHuHdWmBNrFo/Vn8+Bo=; b=VSFwxRPuyLvpusZr5yKJaPW0FJ6FN1dTepQnUcuq0d2Jvvs2XooB4x/b2KOe4+tkXi rTpeO5lm1cINqn2pzXJ78fJxKst2WIuHq6qu7tf6C+tXie+ZI6kO4dYEBf5eeBVBcDvx DDRHTlF7cS+oEQLOGKCbOoHB2hSLUt+XG3eb6CyJKkpSW0eLBSPwJ7t9TSTvTlRKVwwK XDM5n6nxncyZOnTN7k4z0O71Ltnq0ZvlczBeUYJCy1cxgIbWQhjpuOzI/lxpG0EZCwNx ajuDzUJeLXN8x1OlsmNJQuP3BXZknbu8Le502JC6Cq+xYMkkQzF57aBVH+3x7V6fX0Ff V97A== X-Gm-Message-State: ALQs6tC1GWY3kQ5gj5Vw+PZ+A0uZ6EmthDXD1bIVUwDUpgpSr6tdmO1p ea7+Y4Y/1RuIrZnC21V8I1X25w== X-Received: by 2002:a17:902:44c:: with SMTP id 70-v6mr4137561ple.354.1523436426982; Wed, 11 Apr 2018 01:47:06 -0700 (PDT) Received: from gthelen.svl.corp.google.com ([2620:15c:2cb:201:7fd0:97b4:747b:9bf1]) by smtp.gmail.com with ESMTPSA id d4sm1952108pgc.43.2018.04.11.01.47.05 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 11 Apr 2018 01:47:06 -0700 (PDT) From: Greg Thelen To: Wang Long , Michal Hocko , Andrew Morton , Johannes Weiner , Tejun Heo , Sasha Levin Cc: npiggin@gmail.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Greg Thelen , stable@vger.kernel.org Subject: [PATCH v4] writeback: safer lock nesting Date: Wed, 11 Apr 2018 01:46:53 -0700 Message-Id: <20180411084653.254724-1-gthelen@google.com> X-Mailer: git-send-email 2.17.0.484.g0c8726318c-goog In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if the page's memcg is undergoing move accounting, which occurs when a process leaves its memcg for a new one that has memory.move_charge_at_immigrate set. unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the given inode is switching writeback domains. Switches occur when enough writes are issued from a new domain. This existing pattern is thus suspicious: lock_page_memcg(page); unlocked_inode_to_wb_begin(inode, &locked); ... unlocked_inode_to_wb_end(inode, locked); unlock_page_memcg(page); If both inode switch and process memcg migration are both in-flight then unlocked_inode_to_wb_end() will unconditionally enable interrupts while still holding the lock_page_memcg() irq spinlock. This suggests the possibility of deadlock if an interrupt occurs before unlock_page_memcg(). truncate __cancel_dirty_page lock_page_memcg unlocked_inode_to_wb_begin unlocked_inode_to_wb_end end_page_writeback test_clear_page_writeback lock_page_memcg unlock_page_memcg Due to configuration limitations this deadlock is not currently possible because we don't mix cgroup writeback (a cgroupv2 feature) and memory.move_charge_at_immigrate (a cgroupv1 feature). If the kernel is hacked to always claim inode switching and memcg moving_account, then this script triggers lockup in less than a minute: cd /mnt/cgroup/memory mkdir a b echo 1 > a/memory.move_charge_at_immigrate echo 1 > b/memory.move_charge_at_immigrate ( echo $BASHPID > a/cgroup.procs while true; do dd if=/dev/zero of=/mnt/big bs=1M count=256 done ) & while true; do sync done & sleep 1h & SLEEP=$! while true; do echo $SLEEP > a/cgroup.procs echo $SLEEP > b/cgroup.procs done The deadlock does not seem possible, so it's debatable if there's any reason to modify the kernel. I suggest we should to prevent future surprises. And Wang Long said "this deadlock occurs three times in our environment", so there's more reason to apply this, even to stable. Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch see "[PATCH for-4.4] writeback: safer lock nesting" https://lkml.org/lkml/2018/4/11/146 Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") Cc: stable@vger.kernel.org # v4.2+ Reported-by: Wang Long Signed-off-by: Greg Thelen Acked-by: Michal Hocko Acked-by: Wang Long --- Changelog since v3: - initialize wb_lock_cookie wiht {} rather than {0}. - comment grammar fix - commit log footer cleanup (-Change-Id, +Fixes, +Acks, +stable), though patch does not cleanly apply to 4.4. I'll post a 4.4-stable specific patch. Changelog since v2: - explicitly initialize wb_lock_cookie to silence compiler warnings. Changelog since v1: - add wb_lock_cookie to record lock context. fs/fs-writeback.c | 7 ++++--- include/linux/backing-dev-defs.h | 5 +++++ include/linux/backing-dev.h | 31 +++++++++++++++++-------------- mm/page-writeback.c | 18 +++++++++--------- 4 files changed, 35 insertions(+), 26 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 1280f915079b..b1178acfcb08 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -745,11 +745,12 @@ int inode_congested(struct inode *inode, int cong_bits) */ if (inode && inode_to_wb_is_valid(inode)) { struct bdi_writeback *wb; - bool locked, congested; + struct wb_lock_cookie lock_cookie = {}; + bool congested; - wb = unlocked_inode_to_wb_begin(inode, &locked); + wb = unlocked_inode_to_wb_begin(inode, &lock_cookie); congested = wb_congested(wb, cong_bits); - unlocked_inode_to_wb_end(inode, locked); + unlocked_inode_to_wb_end(inode, &lock_cookie); return congested; } diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index bfe86b54f6c1..0bd432a4d7bd 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -223,6 +223,11 @@ static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync) set_wb_congested(bdi->wb.congested, sync); } +struct wb_lock_cookie { + bool locked; + unsigned long flags; +}; + #ifdef CONFIG_CGROUP_WRITEBACK /** diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 3e4ce54d84ab..96f4a3ddfb81 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -346,7 +346,7 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode) /** * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction * @inode: target inode - * @lockedp: temp bool output param, to be passed to the end function + * @cookie: output param, to be passed to the end function * * The caller wants to access the wb associated with @inode but isn't * holding inode->i_lock, mapping->tree_lock or wb->list_lock. This @@ -354,12 +354,12 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode) * association doesn't change until the transaction is finished with * unlocked_inode_to_wb_end(). * - * The caller must call unlocked_inode_to_wb_end() with *@lockdep - * afterwards and can't sleep during transaction. IRQ may or may not be - * disabled on return. + * The caller must call unlocked_inode_to_wb_end() with *@cookie afterwards and + * can't sleep during the transaction. IRQs may or may not be disabled on + * return. */ static inline struct bdi_writeback * -unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp) +unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie) { rcu_read_lock(); @@ -367,10 +367,10 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp) * Paired with store_release in inode_switch_wb_work_fn() and * ensures that we see the new wb if we see cleared I_WB_SWITCH. */ - *lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH; + cookie->locked = smp_load_acquire(&inode->i_state) & I_WB_SWITCH; - if (unlikely(*lockedp)) - spin_lock_irq(&inode->i_mapping->tree_lock); + if (unlikely(cookie->locked)) + spin_lock_irqsave(&inode->i_mapping->tree_lock, cookie->flags); /* * Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock. @@ -382,12 +382,14 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp) /** * unlocked_inode_to_wb_end - end inode wb access transaction * @inode: target inode - * @locked: *@lockedp from unlocked_inode_to_wb_begin() + * @cookie: @cookie from unlocked_inode_to_wb_begin() */ -static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked) +static inline void unlocked_inode_to_wb_end(struct inode *inode, + struct wb_lock_cookie *cookie) { - if (unlikely(locked)) - spin_unlock_irq(&inode->i_mapping->tree_lock); + if (unlikely(cookie->locked)) + spin_unlock_irqrestore(&inode->i_mapping->tree_lock, + cookie->flags); rcu_read_unlock(); } @@ -434,12 +436,13 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode) } static inline struct bdi_writeback * -unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp) +unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie) { return inode_to_wb(inode); } -static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked) +static inline void unlocked_inode_to_wb_end(struct inode *inode, + struct wb_lock_cookie *cookie) { } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 586f31261c83..8369572e1f7d 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2501,13 +2501,13 @@ void account_page_redirty(struct page *page) if (mapping && mapping_cap_account_dirty(mapping)) { struct inode *inode = mapping->host; struct bdi_writeback *wb; - bool locked; + struct wb_lock_cookie cookie = {}; - wb = unlocked_inode_to_wb_begin(inode, &locked); + wb = unlocked_inode_to_wb_begin(inode, &cookie); current->nr_dirtied--; dec_node_page_state(page, NR_DIRTIED); dec_wb_stat(wb, WB_DIRTIED); - unlocked_inode_to_wb_end(inode, locked); + unlocked_inode_to_wb_end(inode, &cookie); } } EXPORT_SYMBOL(account_page_redirty); @@ -2613,15 +2613,15 @@ void __cancel_dirty_page(struct page *page) if (mapping_cap_account_dirty(mapping)) { struct inode *inode = mapping->host; struct bdi_writeback *wb; - bool locked; + struct wb_lock_cookie cookie = {}; lock_page_memcg(page); - wb = unlocked_inode_to_wb_begin(inode, &locked); + wb = unlocked_inode_to_wb_begin(inode, &cookie); if (TestClearPageDirty(page)) account_page_cleaned(page, mapping, wb); - unlocked_inode_to_wb_end(inode, locked); + unlocked_inode_to_wb_end(inode, &cookie); unlock_page_memcg(page); } else { ClearPageDirty(page); @@ -2653,7 +2653,7 @@ int clear_page_dirty_for_io(struct page *page) if (mapping && mapping_cap_account_dirty(mapping)) { struct inode *inode = mapping->host; struct bdi_writeback *wb; - bool locked; + struct wb_lock_cookie cookie = {}; /* * Yes, Virginia, this is indeed insane. @@ -2690,14 +2690,14 @@ int clear_page_dirty_for_io(struct page *page) * always locked coming in here, so we get the desired * exclusion. */ - wb = unlocked_inode_to_wb_begin(inode, &locked); + wb = unlocked_inode_to_wb_begin(inode, &cookie); if (TestClearPageDirty(page)) { dec_lruvec_page_state(page, NR_FILE_DIRTY); dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); dec_wb_stat(wb, WB_RECLAIMABLE); ret = 1; } - unlocked_inode_to_wb_end(inode, locked); + unlocked_inode_to_wb_end(inode, &cookie); return ret; } return TestClearPageDirty(page); -- 2.17.0.484.g0c8726318c-goog