Received: by 2002:a05:6a10:c7c6:0:0:0:0 with SMTP id h6csp1653565pxy; Mon, 2 Aug 2021 07:08:53 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxzOko7JvdKYzfekymST92EqtHshEMvyJxl/1bcBMTI6hW6SAmCTEVOA9UExoUwMIeKG45t X-Received: by 2002:a17:906:4e52:: with SMTP id g18mr15918508ejw.432.1627913333284; Mon, 02 Aug 2021 07:08:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1627913333; cv=none; d=google.com; s=arc-20160816; b=V0MhmlBXnHVjIns0gSowSHps8stUryaf7zT6JcHtU8HxNu1fSVO/vRYaYbv22ypkAG Iy2KR1q6Tqz598DIhlG6HD5VKd/79k915hE/bFzGwmkKv26MnK1zB5TWRB6B9fr8Zr6s 6Ixch1Cmh3ng9QPq/8gUraznFGB8BRt8uf4exJg2zv6KOeNbPeB+znem4u72CkNswovs Be6rDQ8kR8oB6AlfRHhKY5tQ1/GLjEkPEmatDkTnk4JPqo0x0dATarqN2RIZEWAKV9NE rElvU7NMn3aY98Lzu1cfO9LrYFtxQ/dFKheTtgc3U/hEJBvIYe+Da75VAVxhhYAC25Gi xrFg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=yvFzxjBmGq5+qZxUrlXtAC0DUL7Z7z7muSVLM2BZc4o=; b=H2r7dfSxNB3jRFudYBvZp7znOQYvQcPfz1Zw6OLD/dlqCFN+MQwVLLPKOK+fudJ6Xt uWxxvxPnuHSRPWHtbW8s8s9MEJg0V/ULwn32PCGWBmCFCgXk90QeZ9Jw3dMwp99hW+7R /lu0LqiIVWQgT3J8ieqBBG3L57xl/07J+RrjoZdU1mUaZbF5gvdPRDwFcWfLrzd7PXIS DyTqCo6kipNxnYqRM0sFX49EiTSX/6MTmxvMtKJqQZ3vlW/0qocC8liX0sVvkwN/GmRR 5VWx1Gq61oP3B+M7K/tTo7KZH8AagvN/YS0D6OLgHiNj7H8xJTWSLdDpvxjvgQMHLiR2 T7Cw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=rXkdKDXE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id nd38si10811695ejc.558.2021.08.02.07.08.30; Mon, 02 Aug 2021 07:08:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=rXkdKDXE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236944AbhHBOE0 (ORCPT + 99 others); Mon, 2 Aug 2021 10:04:26 -0400 Received: from mail.kernel.org ([198.145.29.99]:40708 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236064AbhHBN46 (ORCPT ); Mon, 2 Aug 2021 09:56:58 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 2C2D361152; Mon, 2 Aug 2021 13:54:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1627912478; bh=iSLtEkpFk9MrEizFwbezePfuZzbMzIi/nyefbTUI4D8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=rXkdKDXEI8uhSBBabddX944vH8Pck4+HuN3VrK1xn3VXHmL2GPZQ4jooeMj8FTPAY cHHPSFdCI2tRkm0bfamT0pgt+u4r0r4Dk2ck7ZPUSsPYyR4PVwLDM5Tq9gii91tjmP KunPaR70K77NVqATGDgbkgFnb+JGdFJpIzHofZLY= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Junxiao Bi , Joseph Qi , Mark Fasheh , Joel Becker , Changwei Ge , Gang He , Jun Piao , Andrew Morton , Linus Torvalds Subject: [PATCH 5.13 011/104] ocfs2: issue zeroout to EOF blocks Date: Mon, 2 Aug 2021 15:44:08 +0200 Message-Id: <20210802134344.389557585@linuxfoundation.org> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20210802134344.028226640@linuxfoundation.org> References: <20210802134344.028226640@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Junxiao Bi commit 9449ad33be8480f538b11a593e2dda2fb33ca06d upstream. For punch holes in EOF blocks, fallocate used buffer write to zero the EOF blocks in last cluster. But since ->writepage will ignore EOF pages, those zeros will not be flushed. This "looks" ok as commit 6bba4471f0cc ("ocfs2: fix data corruption by fallocate") will zero the EOF blocks when extend the file size, but it isn't. The problem happened on those EOF pages, before writeback, those pages had DIRTY flag set and all buffer_head in them also had DIRTY flag set, when writeback run by write_cache_pages(), DIRTY flag on the page was cleared, but DIRTY flag on the buffer_head not. When next write happened to those EOF pages, since buffer_head already had DIRTY flag set, it would not mark page DIRTY again. That made writeback ignore them forever. That will cause data corruption. Even directio write can't work because it will fail when trying to drop pages caches before direct io, as it found the buffer_head for those pages still had DIRTY flag set, then it will fall back to buffer io mode. To make a summary of the issue, as writeback ingores EOF pages, once any EOF page is generated, any write to it will only go to the page cache, it will never be flushed to disk even file size extends and that page is not EOF page any more. The fix is to avoid zero EOF blocks with buffer write. The following code snippet from qemu-img could trigger the corruption. 656 open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11 ... 660 fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 660 fallocate(11, 0, 2275868672, 327680) = 0 658 pwrite64(11, " Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi Reviewed-by: Joseph Qi Cc: Mark Fasheh Cc: Joel Becker Cc: Changwei Ge Cc: Gang He Cc: Jun Piao Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- fs/ocfs2/file.c | 99 +++++++++++++++++++++++++++++++++----------------------- 1 file changed, 60 insertions(+), 39 deletions(-) --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -1529,6 +1529,45 @@ static void ocfs2_truncate_cluster_pages } } +/* + * zero out partial blocks of one cluster. + * + * start: file offset where zero starts, will be made upper block aligned. + * len: it will be trimmed to the end of current cluster if "start + len" + * is bigger than it. + */ +static int ocfs2_zeroout_partial_cluster(struct inode *inode, + u64 start, u64 len) +{ + int ret; + u64 start_block, end_block, nr_blocks; + u64 p_block, offset; + u32 cluster, p_cluster, nr_clusters; + struct super_block *sb = inode->i_sb; + u64 end = ocfs2_align_bytes_to_clusters(sb, start); + + if (start + len < end) + end = start + len; + + start_block = ocfs2_blocks_for_bytes(sb, start); + end_block = ocfs2_blocks_for_bytes(sb, end); + nr_blocks = end_block - start_block; + if (!nr_blocks) + return 0; + + cluster = ocfs2_bytes_to_clusters(sb, start); + ret = ocfs2_get_clusters(inode, cluster, &p_cluster, + &nr_clusters, NULL); + if (ret) + return ret; + if (!p_cluster) + return 0; + + offset = start_block - ocfs2_clusters_to_blocks(sb, cluster); + p_block = ocfs2_clusters_to_blocks(sb, p_cluster) + offset; + return sb_issue_zeroout(sb, p_block, nr_blocks, GFP_NOFS); +} + static int ocfs2_zero_partial_clusters(struct inode *inode, u64 start, u64 len) { @@ -1538,6 +1577,7 @@ static int ocfs2_zero_partial_clusters(s struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); unsigned int csize = osb->s_clustersize; handle_t *handle; + loff_t isize = i_size_read(inode); /* * The "start" and "end" values are NOT necessarily part of @@ -1558,6 +1598,26 @@ static int ocfs2_zero_partial_clusters(s if ((start & (csize - 1)) == 0 && (end & (csize - 1)) == 0) goto out; + /* No page cache for EOF blocks, issue zero out to disk. */ + if (end > isize) { + /* + * zeroout eof blocks in last cluster starting from + * "isize" even "start" > "isize" because it is + * complicated to zeroout just at "start" as "start" + * may be not aligned with block size, buffer write + * would be required to do that, but out of eof buffer + * write is not supported. + */ + ret = ocfs2_zeroout_partial_cluster(inode, isize, + end - isize); + if (ret) { + mlog_errno(ret); + goto out; + } + if (start >= isize) + goto out; + end = isize; + } handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS); if (IS_ERR(handle)) { ret = PTR_ERR(handle); @@ -1856,45 +1916,6 @@ out: } /* - * zero out partial blocks of one cluster. - * - * start: file offset where zero starts, will be made upper block aligned. - * len: it will be trimmed to the end of current cluster if "start + len" - * is bigger than it. - */ -static int ocfs2_zeroout_partial_cluster(struct inode *inode, - u64 start, u64 len) -{ - int ret; - u64 start_block, end_block, nr_blocks; - u64 p_block, offset; - u32 cluster, p_cluster, nr_clusters; - struct super_block *sb = inode->i_sb; - u64 end = ocfs2_align_bytes_to_clusters(sb, start); - - if (start + len < end) - end = start + len; - - start_block = ocfs2_blocks_for_bytes(sb, start); - end_block = ocfs2_blocks_for_bytes(sb, end); - nr_blocks = end_block - start_block; - if (!nr_blocks) - return 0; - - cluster = ocfs2_bytes_to_clusters(sb, start); - ret = ocfs2_get_clusters(inode, cluster, &p_cluster, - &nr_clusters, NULL); - if (ret) - return ret; - if (!p_cluster) - return 0; - - offset = start_block - ocfs2_clusters_to_blocks(sb, cluster); - p_block = ocfs2_clusters_to_blocks(sb, p_cluster) + offset; - return sb_issue_zeroout(sb, p_block, nr_blocks, GFP_NOFS); -} - -/* * Parts of this function taken from xfs_change_file_space() */ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,