Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp1209376ybf; Thu, 27 Feb 2020 06:51:29 -0800 (PST) X-Google-Smtp-Source: APXvYqyoi2vpabxLXsdYaHo/sNAk4ldx5890L4pWx274NQ2sTFqIqE5ICjJWm4aCFE8Csbt6C9vu X-Received: by 2002:a9d:6b06:: with SMTP id g6mr3639206otp.93.1582815088969; Thu, 27 Feb 2020 06:51:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582815088; cv=none; d=google.com; s=arc-20160816; b=qzqiSY7MUcGBfWEWewUXENQg9dQgX7cXOgPwLFhFkuzIauJJEel8Zv7cJoM3911Bh7 BtfbZEU9BGP33k3wNLItVi3/zIwsHCTyGjsgNMMN0ap2ciQbiNM8Jn8XbrrYjq/3e4pq r4ppqg4EI8bRbETN/E2zQWmm5VVTCSTpKREFj06dEitRpY1KEk6FY6Ek3UaYYOfXe4Dx KigcEKOiApva7C3t1C0Gwv8tB1kQAxzFsvlplXqjgvNBcpC67QPQcf+0w7E+Quky2dn3 BpnBbBxd6g1bnDuZ49NWy/5K0vpjGJXiKmL3R0JMz7DAALvl5W47iWGsdMzMZAaNN8PZ 13fw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=+JLVb6jLPBOSLM/s1iUw8faAkRjVsKnqhG+mORDTcZE=; b=RmLZnGqrl/zLNQoGkmXBXCzutKgvYOvjG6nqyo6LAbXUf47NVfUj9nyuT7/hROjXlj 3jsPnrBxMf3OJnJxEVzncPapznK9LxX9rkge8+4DO9obOff1UUZ4NGsro97NS0ZbRq3G btof1lq+GsV0r0JBJPAmfC4txz7+oBdpj+B0sLwKSb86PESKOP4zQVebpcjLPx/Zw9dz 6Q+bO7U22evkPzEk6PzL1ACqnzjtHaxHLUQoBFO9BrLZdy/yi246uqnWxtQEnQcJoeMl aXVs46AHlXMrF8Tj93Ou+tXni6G5wLaIPOsnH6u01jSheSi5iUsYI0W/sXami8rPQ8aL cXXw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b="z/ajNT86"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u21si1770404otq.137.2020.02.27.06.51.17; Thu, 27 Feb 2020 06:51:28 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b="z/ajNT86"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729391AbgB0Nks (ORCPT + 99 others); Thu, 27 Feb 2020 08:40:48 -0500 Received: from mail.kernel.org ([198.145.29.99]:35126 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729359AbgB0Nkq (ORCPT ); Thu, 27 Feb 2020 08:40:46 -0500 Received: from localhost (83-86-89-107.cable.dynamic.v4.ziggo.nl [83.86.89.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 7A1AE246A2; Thu, 27 Feb 2020 13:40:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1582810844; bh=wUIufC8Wzip5suA+4iOfai1q2QIz8TaqzWm4SZxl3ZE=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=z/ajNT86C12oUnPzzQo7Q31qe938dli1weRdbBkY0xl1xq6HHOt4DYfDiHEx9WTr1 rePj4j8C6nfMf5MYZtRVcZr28zgQiixDbIKATPeGUnCmQLMEfO7i4xgTwKZX3vevhg ZqQl7km8u353rxH6Y+piQMNuUctmositF3mvfizI= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, ryusuke1925 , Koki Mitani , Josef Bacik , Filipe Manana , David Sterba Subject: [PATCH 4.4 009/113] Btrfs: fix race between using extent maps and merging them Date: Thu, 27 Feb 2020 14:35:25 +0100 Message-Id: <20200227132213.238515204@linuxfoundation.org> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20200227132211.791484803@linuxfoundation.org> References: <20200227132211.791484803@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Filipe Manana commit ac05ca913e9f3871126d61da275bfe8516ff01ca upstream. We have a few cases where we allow an extent map that is in an extent map tree to be merged with other extents in the tree. Such cases include the unpinning of an extent after the respective ordered extent completed or after logging an extent during a fast fsync. This can lead to subtle and dangerous problems because when doing the merge some other task might be using the same extent map and as consequence see an inconsistent state of the extent map - for example sees the new length but has seen the old start offset. With luck this triggers a BUG_ON(), and not some silent bug, such as the following one in __do_readpage(): $ cat -n fs/btrfs/extent_io.c 3061 static int __do_readpage(struct extent_io_tree *tree, 3062 struct page *page, (...) 3127 em = __get_extent_map(inode, page, pg_offset, cur, 3128 end - cur + 1, get_extent, em_cached); 3129 if (IS_ERR_OR_NULL(em)) { 3130 SetPageError(page); 3131 unlock_extent(tree, cur, end); 3132 break; 3133 } 3134 extent_offset = cur - em->start; 3135 BUG_ON(extent_map_end(em) <= cur); (...) Consider the following example scenario, where we end up hitting the BUG_ON() in __do_readpage(). We have an inode with a size of 8KiB and 2 extent maps: extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by a previous transaction extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet persisted but writeback started for it already. The extent map is pinned since there's writeback and an ordered extent in progress, so it can not be merged with extent map A yet The following sequence of steps leads to the BUG_ON(): 1) The ordered extent for extent B completes, the respective page gets its writeback bit cleared and the extent map is unpinned, at that point it is not yet merged with extent map A because it's in the list of modified extents; 2) Due to memory pressure, or some other reason, the MM subsystem releases the page corresponding to extent B - btrfs_releasepage() is called and returns 1, meaning the page can be released as it's not dirty, not under writeback anymore and the extent range is not locked in the inode's iotree. However the extent map is not released, either because we are not in a context that allows memory allocations to block or because the inode's size is smaller than 16MiB - in this case our inode has a size of 8KiB; 3) Task B needs to read extent B and ends up __do_readpage() through the btrfs_readpage() callback. At __do_readpage() it gets a reference to extent map B; 4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B while holding the write lock on the inode's extent map tree - this results in try_merge_map() being called and since it's possible to merge extent map B with extent map A now (the extent map B was removed from the list of modified extents), the merging begins - it sets extent map B's start offset to 0 (was 4KiB), but before it increments the map's length to 8KiB (4kb + 4KiB), task A is at: BUG_ON(extent_map_end(em) <= cur); The call to extent_map_end() sees the extent map has a start of 0 and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so the BUG_ON() is triggered. So it's dangerous to modify an extent map that is in the tree, because some other task might have got a reference to it before and still using it, and needs to see a consistent map while using it. Generally this is very rare since most paths that lookup and use extent maps also have the file range locked in the inode's iotree. The fsync path is pretty much the only exception where we don't do it to avoid serialization with concurrent reads. Fix this by not allowing an extent map do be merged if if it's being used by tasks other then the one attempting to merge the extent map (when the reference count of the extent map is greater than 2). Reported-by: ryusuke1925 Reported-by: Koki Mitani Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211 CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik Signed-off-by: Filipe Manana Signed-off-by: David Sterba Signed-off-by: Greg Kroah-Hartman --- fs/btrfs/extent_map.c | 11 +++++++++++ 1 file changed, 11 insertions(+) --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -227,6 +227,17 @@ static void try_merge_map(struct extent_ struct extent_map *merge = NULL; struct rb_node *rb; + /* + * We can't modify an extent map that is in the tree and that is being + * used by another task, as it can cause that other task to see it in + * inconsistent state during the merging. We always have 1 reference for + * the tree and 1 for this task (which is unpinning the extent map or + * clearing the logging flag), so anything > 2 means it's being used by + * other tasks too. + */ + if (atomic_read(&em->refs) > 2) + return; + if (em->start != 0) { rb = rb_prev(&em->rb_node); if (rb)