Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp2261594imm; Mon, 28 May 2018 05:01:02 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqVTt4pQoWYfYZo+5pGsFz8r5cWdu9qRD4GGG/Hz3r5csp8OW8yK1gv1oRwBHOIegrBbHfb X-Received: by 2002:a17:902:7209:: with SMTP id ba9-v6mr13634155plb.119.1527508862451; Mon, 28 May 2018 05:01:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527508862; cv=none; d=google.com; s=arc-20160816; b=o5BAZmcdSWEMh8FzyyGJiN0LAiP/oSAOExe/X4Vd3mt/AnZO9C0Zz5EokSB+FDWO3w Yd+7UCWkzMjBn8Jti+YPQ10eLlqx1YXJ1eHgLvFuAeJjXE1nmvnSVjtea0aJw03LmRKV r7WXs1bwl9Afxw2iaXO9e2vtdZE+XBjIrDAOCKtz+XznYBK99qUY0WeWolopEDrtj/pR F67+zRFjW8l8B20rmmkehTizQN4haWrA7MmoDRTjFPTQlzIhtCxLwSA2XeWWlY6VwPhV TSn6PPq08Vx5qrDnAcZedOyhDbcV2I37GHWfF240UiyrshhCxbXrEulTfjPBlBbgkPy7 19ng== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :in-reply-to:message-id:date:subject:cc:to:from:dkim-signature :arc-authentication-results; bh=kXJLpHuIMy6Fij+YzyuMn4Z8pv1C9v9bx4lFCqOSAfI=; b=0UmDjm4BF1if2fDpZUJLkxMoucCXhaHZOiqnqsTVyIlcpKwHA5wgChlnuo3VyaaFKj WGxcN10tFKhmhP/eqTgEgoT3pHhcqpI1giZ5iAlU/IX2KGqnyKzmOssiPZ7m6qjM2M6g mJeoKXA7F6hhwA5wr5CLkNwdci7TOlciWPZijnDc4ApwgxGdR+E7GMPXMVMsqiyIH09K YOQ7ygMjDbsq8AKq3Ens95X0Aec1JYbENKLvMPeIRvuZjKDokCvZlGyEJRr4u3tLc6kF MifMOyaw4USYBj1Xd9DM+HmoTquD73XCByz8pLNE7hiLg+9eaSaxmSAuAS9bHnibZVM/ cKPw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=sLONY8x5; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z73-v6si7805031pgd.122.2018.05.28.05.00.47; Mon, 28 May 2018 05:01:02 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=sLONY8x5; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S938490AbeE1L74 (ORCPT + 99 others); Mon, 28 May 2018 07:59:56 -0400 Received: from mail.kernel.org ([198.145.29.99]:54556 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1423414AbeE1LHh (ORCPT ); Mon, 28 May 2018 07:07:37 -0400 Received: from localhost (LFbn-1-12247-202.w90-92.abo.wanadoo.fr [90.92.61.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 8952D2087E; Mon, 28 May 2018 11:07:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1527505657; bh=rH6t5LxfO0qC04N5zUklcDTSUV0oGyn0aFVqJ8VKOIk=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=sLONY8x5zX7PmqykQlO3XfNNFmUfvYacZiPwRfch/QefLb14yX7YD7LVNg60eG+fh GIbeAF2K/fy5ImoUj5J0883HfWCe/2K41zubbcZ/JS/8nVG2Otjq7Thj/njpuHRltH PkUL8e/I0F5YoEf5mX5NDEsLcagT+G7D4HySLJTI= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Filipe Manana , David Sterba , Sasha Levin Subject: [PATCH 4.16 049/272] Btrfs: fix loss of prealloc extents past i_size after fsync log replay Date: Mon, 28 May 2018 12:01:22 +0200 Message-Id: <20180528100244.974562540@linuxfoundation.org> X-Mailer: git-send-email 2.17.0 In-Reply-To: <20180528100240.256525891@linuxfoundation.org> References: <20180528100240.256525891@linuxfoundation.org> User-Agent: quilt/0.65 X-stable: review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 4.16-stable review patch. If anyone has any objections, please let me know. ------------------ From: Filipe Manana [ Upstream commit 471d557afed155b85da237ec46c549f443eeb5de ] Currently if we allocate extents beyond an inode's i_size (through the fallocate system call) and then fsync the file, we log the extents but after a power failure we replay them and then immediately drop them. This behaviour happens since about 2009, commit c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log"), because it marks the inode as an orphan instead of dropping any extents beyond i_size before replaying logged extents, so after the log replay, and while the mount operation is still ongoing, we find the inode marked as an orphan and then perform a truncation (drop extents beyond the inode's i_size). Because the processing of orphan inodes is still done right after replaying the log and before the mount operation finishes, the intention of that commit does not make any sense (at least as of today). However reverting that behaviour is not enough, because we can not simply discard all extents beyond i_size and then replay logged extents, because we risk dropping extents beyond i_size created in past transactions, for example: add prealloc extent beyond i_size fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode transaction commit add another prealloc extent beyond i_size fsync - triggers the fast fsync path power failure In that scenario, we would drop the first extent and then replay the second one. To fix this just make sure that all prealloc extents beyond i_size are logged, and if we find too many (which is far from a common case), fallback to a full transaction commit (like we do when logging regular extents in the fast fsync path). Trivial reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo $ sync $ xfs_io -c "falloc -k 256K 1M" /mnt/foo $ xfs_io -c "fsync" /mnt/foo # mount to replay log $ mount /dev/sdb /mnt # at this point the file only has one extent, at offset 0, size 256K A test case for fstests follows soon, covering multiple scenarios that involve adding prealloc extents with previous shrinking truncates and without such truncates. Fixes: c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log") Signed-off-by: Filipe Manana Signed-off-by: David Sterba Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman --- fs/btrfs/tree-log.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 58 insertions(+), 5 deletions(-) --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2461,13 +2461,41 @@ static int replay_one_buffer(struct btrf if (ret) break; - /* for regular files, make sure corresponding - * orphan item exist. extents past the new EOF - * will be truncated later by orphan cleanup. + /* + * Before replaying extents, truncate the inode to its + * size. We need to do it now and not after log replay + * because before an fsync we can have prealloc extents + * added beyond the inode's i_size. If we did it after, + * through orphan cleanup for example, we would drop + * those prealloc extents just after replaying them. */ if (S_ISREG(mode)) { - ret = insert_orphan_item(wc->trans, root, - key.objectid); + struct inode *inode; + u64 from; + + inode = read_one_inode(root, key.objectid); + if (!inode) { + ret = -EIO; + break; + } + from = ALIGN(i_size_read(inode), + root->fs_info->sectorsize); + ret = btrfs_drop_extents(wc->trans, root, inode, + from, (u64)-1, 1); + /* + * If the nlink count is zero here, the iput + * will free the inode. We bump it to make + * sure it doesn't get freed until the link + * count fixup is done. + */ + if (!ret) { + if (inode->i_nlink == 0) + inc_nlink(inode); + /* Update link count and nbytes. */ + ret = btrfs_update_inode(wc->trans, + root, inode); + } + iput(inode); if (ret) break; } @@ -4321,6 +4349,31 @@ static int btrfs_log_changed_extents(str num++; } + /* + * Add all prealloc extents beyond the inode's i_size to make sure we + * don't lose them after doing a fast fsync and replaying the log. + */ + if (inode->flags & BTRFS_INODE_PREALLOC) { + struct rb_node *node; + + for (node = rb_last(&tree->map); node; node = rb_prev(node)) { + em = rb_entry(node, struct extent_map, rb_node); + if (em->start < i_size_read(&inode->vfs_inode)) + break; + if (!list_empty(&em->list)) + continue; + /* Same as above loop. */ + if (++num > 32768) { + list_del_init(&tree->modified_extents); + ret = -EFBIG; + goto process; + } + refcount_inc(&em->refs); + set_bit(EXTENT_FLAG_LOGGING, &em->flags); + list_add_tail(&em->list, &extents); + } + } + list_sort(NULL, &extents, extent_cmp); btrfs_get_logged_extents(inode, logged_list, logged_start, logged_end); /*