Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932335AbbKXWgn (ORCPT ); Tue, 24 Nov 2015 17:36:43 -0500 Received: from shadbolt.e.decadent.org.uk ([88.96.1.126]:40300 "EHLO shadbolt.e.decadent.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755445AbbKXWfl (ORCPT ); Tue, 24 Nov 2015 17:35:41 -0500 Content-Type: text/plain; charset="UTF-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit MIME-Version: 1.0 From: Ben Hutchings To: linux-kernel@vger.kernel.org, stable@vger.kernel.org CC: akpm@linux-foundation.org, "Filipe Manana" Date: Tue, 24 Nov 2015 22:33:59 +0000 Message-ID: X-Mailer: LinuxStableQueue (scripts by bwh) Subject: [PATCH 3.2 34/52] Btrfs: fix race leading to incorrect item deletion when dropping extents In-Reply-To: X-SA-Exim-Connect-IP: 192.168.4.247 X-SA-Exim-Mail-From: ben@decadent.org.uk X-SA-Exim-Scanned: No (on shadbolt.decadent.org.uk); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8253 Lines: 202 3.2.74-rc1 review patch. If anyone has any objections, please let me know. ------------------ From: Filipe Manana commit aeafbf8486c9e2bd53f5cc3c10c0b7fd7149d69c upstream. While running a stress test I got the following warning triggered: [191627.672810] ------------[ cut here ]------------ [191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]() (...) [191627.701485] Call Trace: [191627.702037] [] dump_stack+0x4f/0x7b [191627.702992] [] ? console_unlock+0x356/0x3a2 [191627.704091] [] warn_slowpath_common+0xa1/0xbb [191627.705380] [] ? __btrfs_drop_extents+0x391/0xa50 [btrfs] [191627.706637] [] warn_slowpath_null+0x1a/0x1c [191627.707789] [] __btrfs_drop_extents+0x391/0xa50 [btrfs] [191627.709155] [] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0 [191627.712444] [] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18 [191627.714162] [] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs] [191627.715887] [] ? start_transaction+0x3bb/0x610 [btrfs] [191627.717287] [] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs] [191627.728865] [] finish_ordered_fn+0x15/0x17 [btrfs] [191627.730045] [] normal_work_helper+0x14c/0x32c [btrfs] [191627.731256] [] btrfs_endio_write_helper+0x12/0x14 [btrfs] [191627.732661] [] process_one_work+0x24c/0x4ae [191627.733822] [] worker_thread+0x206/0x2c2 [191627.734857] [] ? process_scheduled_works+0x2f/0x2f [191627.736052] [] ? process_scheduled_works+0x2f/0x2f [191627.737349] [] kthread+0xef/0xf7 [191627.738267] [] ? time_hardirqs_on+0x15/0x28 [191627.739330] [] ? __kthread_parkme+0xad/0xad [191627.741976] [] ret_from_fork+0x42/0x70 [191627.743080] [] ? __kthread_parkme+0xad/0xad [191627.744206] ---[ end trace bbfddacb7aaada8d ]--- $ cat -n fs/btrfs/file.c 691 int __btrfs_drop_extents(struct btrfs_trans_handle *trans, (...) 758 btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); 759 if (key.objectid > ino || 760 key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end) 761 break; 762 763 fi = btrfs_item_ptr(leaf, path->slots[0], 764 struct btrfs_file_extent_item); 765 extent_type = btrfs_file_extent_type(leaf, fi); 766 767 if (extent_type == BTRFS_FILE_EXTENT_REG || 768 extent_type == BTRFS_FILE_EXTENT_PREALLOC) { (...) 774 } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { (...) 778 } else { 779 WARN_ON(1); 780 extent_end = search_start; 781 } (...) This happened because the item we were processing did not match a file extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this case we cast the item to a struct btrfs_file_extent_item pointer and then find a type field value that does not match any of the expected values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens due to a tiny time window where a race can happen as exemplified below. For example, consider the following scenario where we're using the NO_HOLES feature and we have the following two neighbour leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ] slot N - 2 slot N - 1 slot 0 Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather than explicit because NO_HOLES is enabled). Now if our inode has an ordered extent for the range [4K, 8K[ that is finishing, the following can happen: CPU 1 CPU 2 btrfs_finish_ordered_io() insert_reserved_file_extent() __btrfs_drop_extents() Searches for the key (257 EXTENT_DATA 4096) through btrfs_lookup_file_extent() Key not found and we get a path where path->nodes[0] == leaf X and path->slots[0] == N Because path->slots[0] is >= btrfs_header_nritems(leaf X), we call btrfs_next_leaf() btrfs_next_leaf() releases the path inserts key (257 INODE_REF 4096) at the end of leaf X, leaf X now has N + 1 keys, and the new key is at slot N btrfs_next_leaf() searches for key (257 INODE_REF 256), with path->keep_locks set to 1, because it was the last key it saw in leaf X finds it in leaf X again and notices it's no longer the last key of the leaf, so it returns 0 with path->nodes[0] == leaf X and path->slots[0] == N (which is now < btrfs_header_nritems(leaf X)), pointing to the new key (257 INODE_REF 4096) __btrfs_drop_extents() casts the item at path->nodes[0], slot path->slots[0], to a struct btrfs_file_extent_item - it does not skip keys for the target inode with a type less than BTRFS_EXTENT_DATA_KEY (BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY) sees a bogus value for the type field triggering the WARN_ON in the trace shown above, and sets extent_end = search_start (4096) does the if-then-else logic to fixup 0 length extent items created by a past bug from hole punching: if (extent_end == key.offset && extent_end >= search_start) goto delete_extent_item; that evaluates to true and it ends up deleting the key pointed to by path->slots[0], (257 INODE_REF 4096), from leaf X The same could happen for example for a xattr that ends up having a key with an offset value that matches search_start (very unlikely but not impossible). So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are skipped, never casted to struct btrfs_file_extent_item and never deleted by accident. Also protect against the unexpected case of getting a key for a lower inode number by skipping that key and issuing a warning. Signed-off-by: Filipe Manana [bwh: Backported to 3.2: drop use of ASSERT()] Signed-off-by: Ben Hutchings --- fs/btrfs/file.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -605,8 +605,15 @@ next_slot: } btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); - if (key.objectid > ino || - key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end) + + if (key.objectid > ino) + break; + if (WARN_ON_ONCE(key.objectid < ino) || + key.type < BTRFS_EXTENT_DATA_KEY) { + path->slots[0]++; + goto next_slot; + } + if (key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end) break; fi = btrfs_item_ptr(leaf, path->slots[0], @@ -625,8 +632,8 @@ next_slot: btrfs_file_extent_inline_len(leaf, path->slots[0], fi); } else { - WARN_ON(1); - extent_end = search_start; + /* can't happen */ + BUG(); } if (extent_end <= search_start) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/