Changes since v1:
- In iomap_truncate_page() and dax_truncate_page(), for the case of
truncate blocksize is not power of 2, use do_dive() to calculate the
offset.
This series fix a stale data exposure issue reported by Chandan when
running fstests generic/561 on xfs with realtime device[1]. The real
problem is xfs_setattr_size() doesn't zero out enough range when
truncating a realtime inode, please see the third patch or [1] for
details.
The first two patches modify iomap_truncate_page() and
dax_truncate_page() to pass filesystem identified blocksize, and drop
the assumption of i_blocksize() as Dave suggested. The third patch fix
the issue by modifying xfs_truncate_page() to pass the correct
blocksize, and make sure zeroed range have been flushed to disk before
updating i_size.
[1] https://lore.kernel.org/linux-xfs/87ttj8ircu.fsf@debian-BULLSEYE-live-builder-AMD64/
Thanks,
Yi.
Zhang Yi (3):
iomap: pass blocksize to iomap_truncate_page()
fsdax: pass blocksize to dax_truncate_page()
xfs: correct the zeroing truncate range
fs/dax.c | 11 ++++++-----
fs/ext2/inode.c | 4 ++--
fs/iomap/buffered-io.c | 11 ++++++-----
fs/xfs/xfs_iomap.c | 36 ++++++++++++++++++++++++++++++++----
fs/xfs/xfs_iops.c | 10 ----------
include/linux/dax.h | 4 ++--
include/linux/iomap.h | 4 ++--
7 files changed, 50 insertions(+), 30 deletions(-)
--
2.39.2
From: Zhang Yi <[email protected]>
When truncating a realtime file unaligned to a shorter size,
xfs_setattr_size() only flush the EOF page before zeroing out, and
xfs_truncate_page() also only zeros the EOF block. This could expose
stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
a write operation").
If the sb_rextsize is bigger than one block, and we have a realtime
inode that contains a long enough written extent. If we unaligned
truncate into the middle of this extent, xfs_itruncate_extents() could
split the extent and align the it's tail to sb_rextsize, there maybe
have more than one blocks more between the end of the file. Since
xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
value, so it may leftover some blocks contains stale data that could be
exposed if we append write it over a long enough distance later.
xfs_truncate_page() should flush, zeros out the entire rtextsize range,
and make sure the entire zeroed range have been flushed to disk before
updating the inode size.
Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
Reported-by: Chandan Babu R <[email protected]>
Link: https://lore.kernel.org/linux-xfs/[email protected]
Signed-off-by: Zhang Yi <[email protected]>
---
fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
fs/xfs/xfs_iops.c | 10 ----------
2 files changed, 31 insertions(+), 14 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 4958cc3337bc..fc379450fe74 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1466,12 +1466,39 @@ xfs_truncate_page(
loff_t pos,
bool *did_zero)
{
+ struct xfs_mount *mp = ip->i_mount;
struct inode *inode = VFS_I(ip);
unsigned int blocksize = i_blocksize(inode);
+ int error;
+
+ if (XFS_IS_REALTIME_INODE(ip))
+ blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
+
+ /*
+ * iomap won't detect a dirty page over an unwritten block (or a
+ * cow block over a hole) and subsequently skips zeroing the
+ * newly post-EOF portion of the page. Flush the new EOF to
+ * convert the block before the pagecache truncate.
+ */
+ error = filemap_write_and_wait_range(inode->i_mapping, pos,
+ roundup_64(pos, blocksize));
+ if (error)
+ return error;
if (IS_DAX(inode))
- return dax_truncate_page(inode, pos, blocksize, did_zero,
- &xfs_dax_write_iomap_ops);
- return iomap_truncate_page(inode, pos, blocksize, did_zero,
- &xfs_buffered_write_iomap_ops);
+ error = dax_truncate_page(inode, pos, blocksize, did_zero,
+ &xfs_dax_write_iomap_ops);
+ else
+ error = iomap_truncate_page(inode, pos, blocksize, did_zero,
+ &xfs_buffered_write_iomap_ops);
+ if (error)
+ return error;
+
+ /*
+ * Write back path won't write dirty blocks post EOF folio,
+ * flush the entire zeroed range before updating the inode
+ * size.
+ */
+ return filemap_write_and_wait_range(inode->i_mapping, pos,
+ roundup_64(pos, blocksize));
}
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 66f8c47642e8..baeeddf4a6bb 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -845,16 +845,6 @@ xfs_setattr_size(
error = xfs_zero_range(ip, oldsize, newsize - oldsize,
&did_zeroing);
} else {
- /*
- * iomap won't detect a dirty page over an unwritten block (or a
- * cow block over a hole) and subsequently skips zeroing the
- * newly post-EOF portion of the page. Flush the new EOF to
- * convert the block before the pagecache truncate.
- */
- error = filemap_write_and_wait_range(inode->i_mapping, newsize,
- newsize);
- if (error)
- return error;
error = xfs_truncate_page(ip, newsize, &did_zeroing);
}
--
2.39.2
From: Zhang Yi <[email protected]>
dax_truncate_page() always assumes the block size of the truncating
inode is i_blocksize(), this is not always true for some filesystems,
e.g. XFS does extent size alignment for realtime inodes. Drop this
assumption and pass the block size for zeroing into dax_truncate_page(),
allow filesystems to indicate the correct block size.
Suggested-by: Dave Chinner <[email protected]>
Signed-off-by: Zhang Yi <[email protected]>
---
fs/dax.c | 11 ++++++-----
fs/ext2/inode.c | 4 ++--
fs/xfs/xfs_iomap.c | 2 +-
include/linux/dax.h | 4 ++--
4 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 423fc1607dfa..f672c9a663c1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1403,16 +1403,17 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
}
EXPORT_SYMBOL_GPL(dax_zero_range);
-int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
- const struct iomap_ops *ops)
+int dax_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
+ bool *did_zero, const struct iomap_ops *ops)
{
- unsigned int blocksize = i_blocksize(inode);
- unsigned int off = pos & (blocksize - 1);
+ loff_t start = pos;
+ unsigned int off = is_power_of_2(blocksize) ? (pos & (blocksize - 1)) :
+ do_div(pos, blocksize);
/* Block boundary? Nothing to do */
if (!off)
return 0;
- return dax_zero_range(inode, pos, blocksize - off, did_zero, ops);
+ return dax_zero_range(inode, start, blocksize - off, did_zero, ops);
}
EXPORT_SYMBOL_GPL(dax_truncate_page);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index f3d570a9302b..fbbd479f3c4e 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1278,8 +1278,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
inode_dio_wait(inode);
if (IS_DAX(inode))
- error = dax_truncate_page(inode, newsize, NULL,
- &ext2_iomap_ops);
+ error = dax_truncate_page(inode, newsize, i_blocksize(inode),
+ NULL, &ext2_iomap_ops);
else
error = block_truncate_page(inode->i_mapping,
newsize, ext2_get_block);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 31ac07bb8425..4958cc3337bc 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1470,7 +1470,7 @@ xfs_truncate_page(
unsigned int blocksize = i_blocksize(inode);
if (IS_DAX(inode))
- return dax_truncate_page(inode, pos, did_zero,
+ return dax_truncate_page(inode, pos, blocksize, did_zero,
&xfs_dax_write_iomap_ops);
return iomap_truncate_page(inode, pos, blocksize, did_zero,
&xfs_buffered_write_iomap_ops);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9d3e3327af4c..4aa8ef7c8fd4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -210,8 +210,8 @@ int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len,
const struct iomap_ops *ops);
int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
const struct iomap_ops *ops);
-int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
- const struct iomap_ops *ops);
+int dax_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
+ bool *did_zero, const struct iomap_ops *ops);
#if IS_ENABLED(CONFIG_DAX)
int dax_read_lock(void);
--
2.39.2