These are ilock patches which helps improve the current inode lock scalabiliy
problem in ext4 DIO mixed read/write workload case. The problem was first
reported by Joseph [1]. These patches are based upon upstream discussion
with Jan Kara & Joseph [2].
The problem really is that in case of DIO overwrites, we start with
a exclusive lock and then downgrade it later to shared lock. This causes a
scalability problem in case of mixed DIO read/write workload case.
i.e. if we have any ongoing DIO reads and then comes a DIO writes,
(since writes starts with excl. inode lock) then it has to wait until the
shared lock is released (which only happens when DIO read is completed).
Same is true for vice versa as well.
The same can be easily observed with perf-tools trace analysis [3].
This patch series (Patch-4) helps fix that situation even without
dioread_nolock mount opt. This is inline with the discussions with Jan [4].
More details about this are mentioned in commit msg of patch 3 & 4.
These patches are based on the top of Ted's ext4 master tree.
Patch description
=================
Patch-1: Fixes ext4_dax_read/write inode locking sequence for IOCB_NOWAIT
Patch-2: Introduces ext4_ilock/unlock APIs for use in next patches
Mainly a wrapper function for inode_lock/unlock.
Patch-3: Starts with shared iolock in case of DIO instead of exclusive iolock
This patchset helps fix the reported scalablity problem. But this Patch-3
fixes it only for dioread_nolock mount option.
Patch-4: In this we get away with dioread_nolock mount option condition
to check for shared locking. But we still take excl. lock for data=journal or
non-extent mode or non-regular file. This patch commit msg describe in
detail about why we don't need excl. lock even without dioread_nolock.
Git tree
========
https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC-v3
Testing
=======
Completed xfstests -g auto with default mkfs & mount opts.
No new failures except the known one without these patches.
Performance results
===================
Collected some performance numbers for DIO sync mixed random read/write
workload w.r.t number of threads (ext4) to check for scalability.
The performance historgram shown below is the percentage change in
performance by using this ilock patchset as compared to vanilla kernel.
FIO command:
fio -name=DIO-mixed-randrw -filename=./testfile -direct=1 -iodepth=1 -thread \
-rw=randrw -ioengine=psync -bs=$bs -size=10G -numjobs=$thread \
-group_reporting=1 -runtime=120
Used fioperf tool [5] for collecting this performance scores.
Below shows the performance benefit hist with this ilock patchset in (%)
w.r.t vanilla kernel for mixed randrw workload (for 4K block size).
Notice, the percentage benefit increases with increasing number of
threads. So this patchset help achieve good scalability in the mentioned
workload. Also this gives upto ~140% perf improvement in 24 threads mixed randrw
workload with 4K burst size.
The performance difference can be even higher with high speed storage
devices, since bw speeds without the patch seems to flatten due to lock
contention problem in case of multiple threads.
[Absolute perf delta can be seen at [6]]
Performance benefit (%) data randrw (read)-4K
(default mount options)
160 +-+------+-------+--------+--------+-------+--------+-------+------+-+
| + + + + + + + |
140 +-+ ** +-+
| ** ** |
120 +-+ ** ** ** +-+
| ** ** ** |
100 +-+ ** ** ** ** +-+
| ** ** ** ** |
80 +-+ ** ** ** ** +-+
| ** ** ** ** |
| ** ** ** ** ** |
60 +-+ ** ** ** ** ** +-+
| ** ** ** ** ** |
40 +-+ ** ** ** ** ** +-+
| ** ** ** ** ** ** |
20 +-+ ** ** ** ** ** ** +-+
| ** ** ** ** ** ** |
0 +-+ ** ** ** ** ** ** ** +-+
| + + + + + + + |
-20 +-+------+-------+--------+--------+-------+--------+-------+------+-+
1 2 4 8 12 16 24
Threads
Performance benefit (%) data randrw (write)-4K
(default mount options)
160 +-+------+-------+--------+--------+-------+--------+-------+------+-+
| + + + + + + + |
140 +-+ ** +-+
| ** ** |
120 +-+ ** ** ** +-+
| ** ** ** |
100 +-+ ** ** ** ** +-+
| ** ** ** ** |
80 +-+ ** ** ** ** +-+
| ** ** ** ** |
| ** ** ** ** ** |
60 +-+ ** ** ** ** ** +-+
| ** ** ** ** ** |
40 +-+ ** ** ** ** ** +-+
| ** ** ** ** ** ** |
20 +-+ ** ** ** ** ** ** +-+
| ** ** ** ** ** ** |
0 +-+ ** ** ** ** ** ** ** +-+
| + + + + + + + |
-20 +-+------+-------+--------+--------+-------+--------+-------+------+-+
1 2 4 8 12 16 24
Threads
Previous version
================
v2: https://www.spinics.net/lists/kernel/msg3262531.html
v1: https://patchwork.ozlabs.org/cover/1163286/
References
==========
[1]: https://lore.kernel.org/linux-ext4/[email protected]/
[2]: https://lore.kernel.org/linux-ext4/[email protected]/
[3]: https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/ext4/perf.report
[4]: https://patchwork.ozlabs.org/cover/1163286/
[5]: https://github.com/riteshharjani/fioperf
[6]: https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/ext4/diff_ilock_v3_default_dio_randrw_4K.txt
-ritesh
Ritesh Harjani (4):
ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAIT
ext4: Add ext4_ilock & ext4_iunlock API
ext4: start with shared iolock in case of DIO instead of excl. iolock
ext4: Move to shared iolock even without dioread_nolock mount opt
fs/ext4/ext4.h | 33 ++++++
fs/ext4/extents.c | 16 +--
fs/ext4/file.c | 252 +++++++++++++++++++++++++++++++++-------------
fs/ext4/inode.c | 4 +-
fs/ext4/ioctl.c | 16 +--
fs/ext4/super.c | 12 +--
fs/ext4/xattr.c | 17 ++--
7 files changed, 246 insertions(+), 104 deletions(-)
--
2.21.0
Earlier there was no shared lock in DIO read path.
But this patch (16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing
for parallel DIO reads by adding shared lock in inode DIO
read path.
But this created problem with mixed read/write workload.
It is due to the fact that in DIO path, we first start with
exclusive lock and only when we determine that it is a ovewrite
IO, we downgrade the lock. This causes the problem, since
with above patch we have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with
shared lock and then switching to exclusive lock only
when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io.
Previous API was abused in the sense that it was not really checking
for AIO anywhere also it used to check for extending writes.
So this API was renamed and simplified to ext4_unaligned_io()
which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do
zeroing of partial block and that will require serialization
against other direct IOs in the same block. So we take a excl iolock
for any unaligned DIO.
In case of AIO we also need to wait for any outstanding IOs to
complete so that conversion from unwritten to written is completed
before anyone try to map the overlapping block. Hence we take
excl iolock and also wait for inode_dio_wait() for unaligned DIO case.
Please note since we are anyway taking an exclusive lock in unaligned IO,
inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks().
In this we start with shared iolock and only switch to exclusive iolock
if required. So in most cases with aligned, non-extening, dioread_nolock
overwrites tries to write with a shared locking.
If not, then we restart the operation in ext4_dio_write_checks(),
after acquiring excl iolock.
Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/file.c | 191 ++++++++++++++++++++++++++++++++++++-------------
1 file changed, 142 insertions(+), 49 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ebe3f051598d..18cbf9fa52c6 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -167,19 +167,25 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
* threads are at work on the same unwritten block, they must be synchronized
* or one thread will zero the other's data, causing corruption.
*/
-static int
-ext4_unaligned_aio(struct inode *inode, struct iov_iter *from, loff_t pos)
+static bool
+ext4_unaligned_io(struct inode *inode, struct iov_iter *from, loff_t pos)
{
struct super_block *sb = inode->i_sb;
- int blockmask = sb->s_blocksize - 1;
-
- if (pos >= ALIGN(i_size_read(inode), sb->s_blocksize))
- return 0;
+ unsigned long blockmask = sb->s_blocksize - 1;
if ((pos | iov_iter_alignment(from)) & blockmask)
- return 1;
+ return true;
- return 0;
+ return false;
+}
+
+static bool
+ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
+{
+ if (offset + len > i_size_read(inode) ||
+ offset + len > EXT4_I(inode)->i_disksize)
+ return true;
+ return false;
}
/* Is IO overwriting allocated and initialized blocks? */
@@ -205,7 +211,8 @@ static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
return err == blklen && (map.m_flags & EXT4_MAP_MAPPED);
}
-static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
+ struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
@@ -229,11 +236,22 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos);
}
+ return iov_iter_count(from);
+}
+
+static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
+{
+ ssize_t ret, count;
+
+ count = ext4_generic_write_checks(iocb, from);
+ if (count <= 0)
+ return count;
+
ret = file_modified(iocb->ki_filp);
if (ret)
return ret;
- return iov_iter_count(from);
+ return count;
}
static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
@@ -365,15 +383,110 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
.end_io = ext4_dio_write_end_io,
};
+/*
+ * The intention here is to start with shared lock acquired then see if any
+ * condition requires an exclusive inode lock. If yes, then we restart the
+ * whole operation by releasing the shared lock and acquiring exclusive lock.
+ *
+ * - For unaligned_io we never take shared lock as it may cause data corruption
+ * when two unaligned IO tries to modify the same block e.g. while zeroing.
+ *
+ * - For extending writes case we don't take the shared lock, since it requires
+ * updating inode i_disksize and/or orphan handling with exclusive lock.
+ *
+ * - shared locking will only be true mostly in case of overwrites with
+ * dioread_nolock mode. Otherwise we will switch to excl. iolock mode.
+ */
+static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
+ unsigned int *iolock, bool *unaligned_io,
+ bool *extend)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file_inode(file);
+ loff_t offset = iocb->ki_pos;
+ loff_t final_size;
+ size_t count;
+ ssize_t ret;
+
+restart:
+ /* Fallback to buffered I/O if the inode does not support direct I/O. */
+ if (!ext4_dio_supported(inode)) {
+ ext4_iunlock(inode, *iolock);
+ return ext4_buffered_write_iter(iocb, from);
+ }
+
+ ret = ext4_generic_write_checks(iocb, from);
+ if (ret <= 0) {
+ ext4_iunlock(inode, *iolock);
+ return ret;
+ }
+
+ /* Recalculate since offset & count may change above. */
+ offset = iocb->ki_pos;
+ count = iov_iter_count(from);
+ final_size = offset + count;
+
+ if (ext4_unaligned_io(inode, from, offset))
+ *unaligned_io = true;
+
+ if (ext4_extending_io(inode, offset, count))
+ *extend = true;
+ /*
+ * Determine whether the IO operation will overwrite allocated
+ * and initialized blocks. If so, check to see whether it is
+ * possible to take the dioread_nolock path.
+ *
+ * We need exclusive i_rwsem for changing security info
+ * in file_modified().
+ */
+ if (*iolock == EXT4_IOLOCK_SHARED &&
+ (!IS_NOSEC(inode) || *unaligned_io || *extend ||
+ !ext4_should_dioread_nolock(inode) ||
+ !ext4_overwrite_io(inode, offset, count))) {
+ ext4_iunlock(inode, *iolock);
+ *iolock = EXT4_IOLOCK_EXCL;
+ ext4_ilock(inode, *iolock);
+ goto restart;
+ }
+
+ ret = file_modified(file);
+ if (ret < 0) {
+ ext4_iunlock(inode, *iolock);
+ return ret;
+ }
+
+ return count;
+}
+
static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
ssize_t ret;
- size_t count;
- loff_t offset;
handle_t *handle;
struct inode *inode = file_inode(iocb->ki_filp);
- bool extend = false, overwrite = false, unaligned_aio = false;
- unsigned int iolock = EXT4_IOLOCK_EXCL;
+ loff_t offset = iocb->ki_pos;
+ size_t count = iov_iter_count(from);
+ bool extend = false, unaligned_io = false;
+ unsigned int iolock = EXT4_IOLOCK_SHARED;
+
+ /*
+ * We initially start with shared inode lock
+ * unless it is unaligned IO which needs
+ * exclusive lock anyways.
+ */
+ if (ext4_unaligned_io(inode, from, offset)) {
+ unaligned_io = true;
+ iolock = EXT4_IOLOCK_EXCL;
+ }
+ /*
+ * Extending writes need exclusive lock.
+ */
+ if (ext4_extending_io(inode, offset, count)) {
+ extend = true;
+ iolock = EXT4_IOLOCK_EXCL;
+ }
+
+ if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
+ iolock = EXT4_IOLOCK_EXCL;
if (iocb->ki_flags & IOCB_NOWAIT) {
if (!ext4_ilock_nowait(inode, iolock))
@@ -382,47 +495,28 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
ext4_ilock(inode, iolock);
}
- if (!ext4_dio_supported(inode)) {
- ext4_iunlock(inode, iolock);
- /*
- * Fallback to buffered I/O if the inode does not support
- * direct I/O.
- */
- return ext4_buffered_write_iter(iocb, from);
- }
-
- ret = ext4_write_checks(iocb, from);
- if (ret <= 0) {
- ext4_iunlock(inode, iolock);
+ ret = ext4_dio_write_checks(iocb, from, &iolock, &unaligned_io,
+ &extend);
+ if (ret <= 0)
return ret;
- }
- /*
- * Unaligned asynchronous direct I/O must be serialized among each
- * other as the zeroing of partial blocks of two competing unaligned
- * asynchronous direct I/O writes can result in data corruption.
- */
offset = iocb->ki_pos;
count = iov_iter_count(from);
- if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
- !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
- unaligned_aio = true;
- inode_dio_wait(inode);
- }
/*
- * Determine whether the I/O will overwrite allocated and initialized
- * blocks. If so, check to see whether it is possible to take the
- * dioread_nolock path.
+ * Unaligned direct IO must be serialized among each other as zeroing
+ * of partial blocks of two competing unaligned IOs can result in data
+ * corruption.
+ *
+ * So we make sure we don't allow any unaligned IO in flight.
+ * For IOs where we need not wait (like unaligned non-AIO DIO),
+ * below inode_dio_wait() may anyway become a no-op, since we start
+ * with exclusive lock.
*/
- if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
- ext4_should_dioread_nolock(inode)) {
- overwrite = true;
- ext4_ilock_demote(inode, iolock);
- iolock = EXT4_IOLOCK_SHARED;
- }
+ if (unaligned_io)
+ inode_dio_wait(inode);
- if (offset + count > EXT4_I(inode)->i_disksize) {
+ if (extend) {
handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
@@ -435,12 +529,11 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
goto out;
}
- extend = true;
ext4_journal_stop(handle);
}
ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops,
- is_sync_kiocb(iocb) || unaligned_aio || extend);
+ is_sync_kiocb(iocb) || unaligned_io || extend);
if (extend)
ret = ext4_handle_inode_extension(inode, offset, ret, count);
--
2.21.0
We were using shared locking only in case of dioread_nolock
mount option in case of DIO overwrites. This mount condition
is not needed anymore with current code, since:-
1. No race between buffered writes & DIO overwrites.
Since buffIO writes takes exclusive locks & DIO overwrites
will take share locking. Also DIO path will make sure
to flush and wait for any dirty page cache data.
2. No race between buffered reads & DIO overwrites, since there
is no block allocation that is possible with DIO overwrites.
So no stale data exposure should happen. Same is the case
between DIO reads & DIO overwrites.
3. Also other paths like truncate is protected,
since we wait there for any DIO in flight to be over.
4. In case of buffIO writes followed by DIO reads:
Since here also we take exclusive locks in ext4_write_begin/end().
There is no risk of exposing any stale data in this case.
Since after ext4_write_end, iomap_dio_rw() will wait to flush &
wait for any dirty page cache data.
Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/file.c | 25 +++++++++++++++++++------
1 file changed, 19 insertions(+), 6 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 18cbf9fa52c6..b97efc89cd63 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -383,6 +383,17 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
.end_io = ext4_dio_write_end_io,
};
+static bool ext4_dio_should_shared_lock(struct inode *inode)
+{
+ if (!S_ISREG(inode->i_mode))
+ return false;
+ if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
+ return false;
+ if (ext4_should_journal_data(inode))
+ return false;
+ return true;
+}
+
/*
* The intention here is to start with shared lock acquired then see if any
* condition requires an exclusive inode lock. If yes, then we restart the
@@ -394,8 +405,8 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
* - For extending writes case we don't take the shared lock, since it requires
* updating inode i_disksize and/or orphan handling with exclusive lock.
*
- * - shared locking will only be true mostly in case of overwrites with
- * dioread_nolock mode. Otherwise we will switch to excl. iolock mode.
+ * - shared locking will only be true mostly in case of overwrites.
+ * Otherwise we will switch to excl. iolock mode.
*/
static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
unsigned int *iolock, bool *unaligned_io,
@@ -433,15 +444,14 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
*extend = true;
/*
* Determine whether the IO operation will overwrite allocated
- * and initialized blocks. If so, check to see whether it is
- * possible to take the dioread_nolock path.
+ * and initialized blocks.
*
* We need exclusive i_rwsem for changing security info
* in file_modified().
*/
if (*iolock == EXT4_IOLOCK_SHARED &&
(!IS_NOSEC(inode) || *unaligned_io || *extend ||
- !ext4_should_dioread_nolock(inode) ||
+ !ext4_dio_should_shared_lock(inode) ||
!ext4_overwrite_io(inode, offset, count))) {
ext4_iunlock(inode, *iolock);
*iolock = EXT4_IOLOCK_EXCL;
@@ -485,7 +495,10 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
iolock = EXT4_IOLOCK_EXCL;
}
- if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
+ /*
+ * Check if we should continue with shared iolock
+ */
+ if (iolock == EXT4_IOLOCK_SHARED && !ext4_dio_should_shared_lock(inode))
iolock = EXT4_IOLOCK_EXCL;
if (iocb->ki_flags & IOCB_NOWAIT) {
--
2.21.0
This adds ext4_ilock/iunlock types of APIs.
This is the preparation APIs to make shared
locking/unlocking & restarting with exclusive
locking/unlocking easier in next patch.
Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/ext4.h | 33 ++++++++++++++++++++++++++++++
fs/ext4/extents.c | 16 +++++++--------
fs/ext4/file.c | 52 +++++++++++++++++++++++------------------------
fs/ext4/inode.c | 4 ++--
fs/ext4/ioctl.c | 16 +++++++--------
fs/ext4/super.c | 12 +++++------
fs/ext4/xattr.c | 17 ++++++++--------
7 files changed, 92 insertions(+), 58 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 61987c106511..b4169a92e8d0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2960,6 +2960,39 @@ do { \
#define EXT4_FREECLUSTERS_WATERMARK 0
#endif
+#define EXT4_IOLOCK_EXCL (1 << 0)
+#define EXT4_IOLOCK_SHARED (1 << 1)
+
+static inline void ext4_ilock(struct inode *inode, unsigned int iolock)
+{
+ if (iolock == EXT4_IOLOCK_EXCL)
+ inode_lock(inode);
+ else
+ inode_lock_shared(inode);
+}
+
+static inline void ext4_iunlock(struct inode *inode, unsigned int iolock)
+{
+ if (iolock == EXT4_IOLOCK_EXCL)
+ inode_unlock(inode);
+ else
+ inode_unlock_shared(inode);
+}
+
+static inline int ext4_ilock_nowait(struct inode *inode, unsigned int iolock)
+{
+ if (iolock == EXT4_IOLOCK_EXCL)
+ return inode_trylock(inode);
+ else
+ return inode_trylock_shared(inode);
+}
+
+static inline void ext4_ilock_demote(struct inode *inode, unsigned int iolock)
+{
+ BUG_ON(iolock != EXT4_IOLOCK_EXCL);
+ downgrade_write(&inode->i_rwsem);
+}
+
/* Update i_disksize. Requires i_mutex to avoid races with truncate */
static inline void ext4_update_i_disksize(struct inode *inode, loff_t newsize)
{
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0e8708b77da6..08dd57558533 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4754,7 +4754,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
else
max_blocks -= lblk;
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
/*
* Indirect files do not support unwritten extnets
@@ -4864,7 +4864,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
ext4_journal_stop(handle);
out_mutex:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
return ret;
}
@@ -4930,7 +4930,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
if (mode & FALLOC_FL_KEEP_SIZE)
flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
/*
* We only support preallocation for extent-based files only
@@ -4961,7 +4961,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
EXT4_I(inode)->i_sync_tid);
}
out:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
return ret;
}
@@ -5509,7 +5509,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
return ret;
}
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
/*
* There is no need to overlap collapse range with EOF, in which case
* it is effectively a truncate operation
@@ -5608,7 +5608,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
out_mmap:
up_write(&EXT4_I(inode)->i_mmap_sem);
out_mutex:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
return ret;
}
@@ -5659,7 +5659,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
return ret;
}
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
/* Currently just for extent based files */
if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
ret = -EOPNOTSUPP;
@@ -5786,7 +5786,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
out_mmap:
up_write(&EXT4_I(inode)->i_mmap_sem);
out_mutex:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
return ret;
}
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 977ac58dc718..ebe3f051598d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -55,14 +55,14 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
struct inode *inode = file_inode(iocb->ki_filp);
if (iocb->ki_flags & IOCB_NOWAIT) {
- if (!inode_trylock_shared(inode))
+ if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_SHARED))
return -EAGAIN;
} else {
- inode_lock_shared(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_SHARED);
}
if (!ext4_dio_supported(inode)) {
- inode_unlock_shared(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
/*
* Fallback to buffered I/O if the operation being performed on
* the inode is not supported by direct I/O. The IOCB_DIRECT
@@ -76,7 +76,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL,
is_sync_kiocb(iocb));
- inode_unlock_shared(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
file_accessed(iocb->ki_filp);
return ret;
@@ -89,22 +89,23 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
ssize_t ret;
if (iocb->ki_flags & IOCB_NOWAIT) {
- if (!inode_trylock_shared(inode))
+ if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_SHARED))
return -EAGAIN;
} else {
- inode_lock_shared(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_SHARED);
}
+
/*
* Recheck under inode lock - at this point we are sure it cannot
* change anymore
*/
if (!IS_DAX(inode)) {
- inode_unlock_shared(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
/* Fallback to buffered IO in case we cannot support DAX */
return generic_file_read_iter(iocb, to);
}
ret = dax_iomap_rw(iocb, to, &ext4_iomap_ops);
- inode_unlock_shared(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
file_accessed(iocb->ki_filp);
return ret;
@@ -244,7 +245,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
if (iocb->ki_flags & IOCB_NOWAIT)
return -EOPNOTSUPP;
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
@@ -254,7 +255,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
current->backing_dev_info = NULL;
out:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
if (likely(ret > 0)) {
iocb->ki_pos += ret;
ret = generic_write_sync(iocb, ret);
@@ -372,16 +373,17 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
handle_t *handle;
struct inode *inode = file_inode(iocb->ki_filp);
bool extend = false, overwrite = false, unaligned_aio = false;
+ unsigned int iolock = EXT4_IOLOCK_EXCL;
if (iocb->ki_flags & IOCB_NOWAIT) {
- if (!inode_trylock(inode))
+ if (!ext4_ilock_nowait(inode, iolock))
return -EAGAIN;
} else {
- inode_lock(inode);
+ ext4_ilock(inode, iolock);
}
if (!ext4_dio_supported(inode)) {
- inode_unlock(inode);
+ ext4_iunlock(inode, iolock);
/*
* Fallback to buffered I/O if the inode does not support
* direct I/O.
@@ -391,7 +393,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
ret = ext4_write_checks(iocb, from);
if (ret <= 0) {
- inode_unlock(inode);
+ ext4_iunlock(inode, iolock);
return ret;
}
@@ -416,7 +418,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
ext4_should_dioread_nolock(inode)) {
overwrite = true;
- downgrade_write(&inode->i_rwsem);
+ ext4_ilock_demote(inode, iolock);
+ iolock = EXT4_IOLOCK_SHARED;
}
if (offset + count > EXT4_I(inode)->i_disksize) {
@@ -443,10 +446,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
ret = ext4_handle_inode_extension(inode, offset, ret, count);
out:
- if (overwrite)
- inode_unlock_shared(inode);
- else
- inode_unlock(inode);
+ ext4_iunlock(inode, iolock);
if (ret >= 0 && iov_iter_count(from)) {
ssize_t err;
@@ -489,10 +489,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
struct inode *inode = file_inode(iocb->ki_filp);
if (iocb->ki_flags & IOCB_NOWAIT) {
- if (!inode_trylock(inode))
+ if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_EXCL))
return -EAGAIN;
} else {
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
}
ret = ext4_write_checks(iocb, from);
@@ -524,7 +524,7 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (extend)
ret = ext4_handle_inode_extension(inode, offset, ret, count);
out:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
if (ret > 0)
ret = generic_write_sync(iocb, ret);
return ret;
@@ -757,16 +757,16 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence)
return generic_file_llseek_size(file, offset, whence,
maxbytes, i_size_read(inode));
case SEEK_HOLE:
- inode_lock_shared(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_SHARED);
offset = iomap_seek_hole(inode, offset,
&ext4_iomap_report_ops);
- inode_unlock_shared(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
break;
case SEEK_DATA:
- inode_lock_shared(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_SHARED);
offset = iomap_seek_data(inode, offset,
&ext4_iomap_report_ops);
- inode_unlock_shared(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
break;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 381813205f99..39dcc22667a1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3930,7 +3930,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
return ret;
}
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
/* No need to punch hole beyond i_size */
if (offset >= inode->i_size)
@@ -4037,7 +4037,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
out_dio:
up_write(&EXT4_I(inode)->i_mmap_sem);
out_mutex:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
return ret;
}
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 0b7f316fd30f..43b7a23dc57b 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -855,13 +855,13 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
if (err)
return err;
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
err = ext4_ioctl_check_immutable(inode,
from_kprojid(&init_user_ns, ei->i_projid),
flags);
if (!err)
err = ext4_ioctl_setflags(inode, flags);
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
mnt_drop_write_file(filp);
return err;
}
@@ -892,7 +892,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
goto setversion_out;
}
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
if (IS_ERR(handle)) {
err = PTR_ERR(handle);
@@ -907,7 +907,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
ext4_journal_stop(handle);
unlock_out:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
setversion_out:
mnt_drop_write_file(filp);
return err;
@@ -1026,9 +1026,9 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
* ext4_ext_swap_inode_data before we switch the
* inode format to prevent read.
*/
- inode_lock((inode));
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
err = ext4_ext_migrate(inode);
- inode_unlock((inode));
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
mnt_drop_write_file(filp);
return err;
}
@@ -1272,7 +1272,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
if (err)
return err;
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
ext4_fill_fsxattr(inode, &old_fa);
err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
if (err)
@@ -1287,7 +1287,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
goto out;
err = ext4_ioctl_setproject(filp, fa.fsx_projid);
out:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
mnt_drop_write_file(filp);
return err;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7796e2ffc294..48b83b2cf0ad 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2682,12 +2682,12 @@ static void ext4_orphan_cleanup(struct super_block *sb,
__func__, inode->i_ino, inode->i_size);
jbd_debug(2, "truncating inode %lu to %lld bytes\n",
inode->i_ino, inode->i_size);
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
truncate_inode_pages(inode->i_mapping, inode->i_size);
ret = ext4_truncate(inode);
if (ret)
ext4_std_error(inode->i_sb, ret);
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
nr_truncates++;
} else {
if (test_opt(sb, DEBUG))
@@ -5785,7 +5785,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id,
* files. If this fails, we return success anyway since quotas
* are already enabled and this is not a hard failure.
*/
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
handle = ext4_journal_start(inode, EXT4_HT_QUOTA, 1);
if (IS_ERR(handle))
goto unlock_inode;
@@ -5795,7 +5795,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id,
ext4_mark_inode_dirty(handle, inode);
ext4_journal_stop(handle);
unlock_inode:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
}
return err;
}
@@ -5887,7 +5887,7 @@ static int ext4_quota_off(struct super_block *sb, int type)
if (err || ext4_has_feature_quota(sb))
goto out_put;
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
/*
* Update modification times of quota files when userspace can
* start looking at them. If we fail, we return success anyway since
@@ -5902,7 +5902,7 @@ static int ext4_quota_off(struct super_block *sb, int type)
ext4_mark_inode_dirty(handle, inode);
ext4_journal_stop(handle);
out_unlock:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
out_put:
lockdep_set_quota_inode(inode, I_DATA_SEM_NORMAL);
iput(inode);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 8966a5439a22..5c2dcc4c836a 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -422,9 +422,9 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
ext4_set_inode_state(inode, EXT4_STATE_LUSTRE_EA_INODE);
ext4_xattr_inode_set_ref(inode, 1);
} else {
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
inode->i_flags |= S_NOQUOTA;
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
}
*ea_inode = inode;
@@ -976,7 +976,7 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
u32 hash;
int ret;
- inode_lock(ea_inode);
+ ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL);
ret = ext4_reserve_inode_write(handle, ea_inode, &iloc);
if (ret)
@@ -1030,7 +1030,7 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
ext4_warning_inode(ea_inode,
"ext4_mark_iloc_dirty() failed ret=%d", ret);
out:
- inode_unlock(ea_inode);
+ ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL);
return ret;
}
@@ -1380,10 +1380,11 @@ static int ext4_xattr_inode_write(handle_t *handle, struct inode *ea_inode,
block += 1;
}
- inode_lock(ea_inode);
+ ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL);
i_size_write(ea_inode, wsize);
ext4_update_i_disksize(ea_inode, wsize);
- inode_unlock(ea_inode);
+ ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL);
+
ext4_mark_inode_dirty(handle, ea_inode);
@@ -1432,9 +1433,9 @@ static struct inode *ext4_xattr_inode_create(handle_t *handle,
*/
dquot_free_inode(ea_inode);
dquot_drop(ea_inode);
- inode_lock(ea_inode);
+ ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL);
ea_inode->i_flags |= S_NOQUOTA;
- inode_unlock(ea_inode);
+ ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL);
}
return ea_inode;
--
2.21.0
On Wed, Nov 20, 2019 at 10:30:22AM +0530, Ritesh Harjani wrote:
> This adds ext4_ilock/iunlock types of APIs.
> This is the preparation APIs to make shared
> locking/unlocking & restarting with exclusive
> locking/unlocking easier in next patch.
*scratches head*
A nit, but what's with the changelog wrapping at like ~40 characters?
> +#define EXT4_IOLOCK_EXCL (1 << 0)
> +#define EXT4_IOLOCK_SHARED (1 << 1)
>
> +static inline void ext4_ilock(struct inode *inode, unsigned int iolock)
> +{
> + if (iolock == EXT4_IOLOCK_EXCL)
> + inode_lock(inode);
> + else
> + inode_lock_shared(inode);
> +}
> +
> +static inline void ext4_iunlock(struct inode *inode, unsigned int iolock)
> +{
> + if (iolock == EXT4_IOLOCK_EXCL)
> + inode_unlock(inode);
> + else
> + inode_unlock_shared(inode);
> +}
> +
> +static inline int ext4_ilock_nowait(struct inode *inode, unsigned int iolock)
> +{
> + if (iolock == EXT4_IOLOCK_EXCL)
> + return inode_trylock(inode);
> + else
> + return inode_trylock_shared(inode);
> +}
Is it really necessary for all these helpers to actually have the
'else' statement? Could we not just return/set whatever takes the
'else' branch directly from the end of these functions? I think it
would be cleaner that way.
/me doesn't really like the naming of these functions either.
What's people's opinion on changing these for example:
- ext4_inode_lock()
- ext4_inode_unlock()
Or, better yet, is there any reason why we've never actually
considered naming such functions to have the verb precede the actual
object that we're performing the operation on? In my opinion, it
totally makes way more sense from a code readability standpoint and
overall intent of the function. For example:
- ext4_lock_inode()
- ext4_unlock_inode()
> +static inline void ext4_ilock_demote(struct inode *inode, unsigned int iolock)
> +{
> + BUG_ON(iolock != EXT4_IOLOCK_EXCL);
> + downgrade_write(&inode->i_rwsem);
> +}
> +
Same principle would also apply here.
On an ending note, I'm not really sure that I like the name of these
macros. Like, for example, expand the macro 'EXT4_IOLOCK_EXCL' into
plain english words as if you were reading it. This would translate to
something like 'EXT4 INPUT/OUPUT LOCK EXCLUSIVE' or 'EXT4 IO LOCK
EXCLUSIVE'. Just flipping the words around make a significant
improvement for overall readability i.e. 'EXT4_EXCL_IOLOCK', which
would expand out to 'EXT4 EXCLUSIVE IO LOCK'. Speaking of, is there
any reason why we don't mention 'INODE' here seeing as though that's
the object we're actually protecting by taking one of these locking
mechanisms?
/M
Hello Matthew,
Thanks for the review.
On 11/20/19 4:53 PM, Matthew Bobrowski wrote:
> On Wed, Nov 20, 2019 at 10:30:22AM +0530, Ritesh Harjani wrote:
>> This adds ext4_ilock/iunlock types of APIs.
>> This is the preparation APIs to make shared
>> locking/unlocking & restarting with exclusive
>> locking/unlocking easier in next patch.
>
> *scratches head*
>
> A nit, but what's with the changelog wrapping at like ~40 characters?
Yup will fix that next time. Thanks.
>
>> +#define EXT4_IOLOCK_EXCL (1 << 0)
>> +#define EXT4_IOLOCK_SHARED (1 << 1)
>>
>> +static inline void ext4_ilock(struct inode *inode, unsigned int iolock)
>> +{
>> + if (iolock == EXT4_IOLOCK_EXCL)
>> + inode_lock(inode);
>> + else
>> + inode_lock_shared(inode);
>> +}
>> +
>> +static inline void ext4_iunlock(struct inode *inode, unsigned int iolock)
>> +{
>> + if (iolock == EXT4_IOLOCK_EXCL)
>> + inode_unlock(inode);
>> + else
>> + inode_unlock_shared(inode);
>> +}
>> +
>> +static inline int ext4_ilock_nowait(struct inode *inode, unsigned int iolock)
>> +{
>> + if (iolock == EXT4_IOLOCK_EXCL)
>> + return inode_trylock(inode);
>> + else
>> + return inode_trylock_shared(inode);
>> +}
>
> Is it really necessary for all these helpers to actually have the
> 'else' statement? Could we not just return/set whatever takes the
> 'else' branch directly from the end of these functions? I think it
> would be cleaner that way.
Sure np.
>
> /me doesn't really like the naming of these functions either.
:) difference of opinion.
>
> What's people's opinion on changing these for example:
> - ext4_inode_lock()
> - ext4_inode_unlock()
>
ext4_ilock/iunlock sounds better to me as it is short too.
But if others have also have a strong opinion towards
ext4_inode_lock/unlock() - I am ok with that.
> Or, better yet, is there any reason why we've never actually
> considered naming such functions to have the verb precede the actual
> object that we're performing the operation on? In my opinion, it
> totally makes way more sense from a code readability standpoint and
> overall intent of the function. For example:
> - ext4_lock_inode()
> - ext4_unlock_inode()
Not against your suggestion here.
But in kernel I do see a preference towards object followed by a verb.
At least in vfs I see functions like inode_lock()/unlock().
Plus I would not deny that this naming is also inspired from
xfs_ilock()/iunlock API names.
>
>> +static inline void ext4_ilock_demote(struct inode *inode, unsigned int iolock)
>> +{
>> + BUG_ON(iolock != EXT4_IOLOCK_EXCL);
>> + downgrade_write(&inode->i_rwsem);
>> +}
>> +
>
> Same principle would also apply here.
>
> On an ending note, I'm not really sure that I like the name of these
> macros. Like, for example, expand the macro 'EXT4_IOLOCK_EXCL' into
> plain english words as if you were reading it. This would translate to
> something like 'EXT4 INPUT/OUPUT LOCK EXCLUSIVE' or 'EXT4 IO LOCK
> EXCLUSIVE'. Just flipping the words around make a significant
> improvement for overall readability i.e. 'EXT4_EXCL_IOLOCK', which
> would expand out to 'EXT4 EXCLUSIVE IO LOCK'. Speaking of, is there
Ditto. Unless you and others have a strong objection, I would rather
keep this as is :)
> any reason why we don't mention 'INODE' here seeing as though that's
> the object we're actually protecting by taking one of these locking
> mechanisms?
hmm, it was increasing the name of the macro if I do it that way.
But that's ok. Is below macro name better?
#define EXT4_INODE_IOLOCK_EXCL (1 << 0)
#define EXT4_INODE_IOLOCK_SHARED (1 << 1)
Thanks for the review!!
-ritesh
On Wed 20-11-19 10:30:22, Ritesh Harjani wrote:
> This adds ext4_ilock/iunlock types of APIs.
> This is the preparation APIs to make shared
> locking/unlocking & restarting with exclusive
> locking/unlocking easier in next patch.
>
> Signed-off-by: Ritesh Harjani <[email protected]>
I know XFS does it this way but I don't think we need the obsurity of
additional locking helpers etc. just because of one place in
ext4_dio_write_iter() that will use this. So I'd just drop this patch...
Honza
> ---
> fs/ext4/ext4.h | 33 ++++++++++++++++++++++++++++++
> fs/ext4/extents.c | 16 +++++++--------
> fs/ext4/file.c | 52 +++++++++++++++++++++++------------------------
> fs/ext4/inode.c | 4 ++--
> fs/ext4/ioctl.c | 16 +++++++--------
> fs/ext4/super.c | 12 +++++------
> fs/ext4/xattr.c | 17 ++++++++--------
> 7 files changed, 92 insertions(+), 58 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 61987c106511..b4169a92e8d0 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2960,6 +2960,39 @@ do { \
> #define EXT4_FREECLUSTERS_WATERMARK 0
> #endif
>
> +#define EXT4_IOLOCK_EXCL (1 << 0)
> +#define EXT4_IOLOCK_SHARED (1 << 1)
> +
> +static inline void ext4_ilock(struct inode *inode, unsigned int iolock)
> +{
> + if (iolock == EXT4_IOLOCK_EXCL)
> + inode_lock(inode);
> + else
> + inode_lock_shared(inode);
> +}
> +
> +static inline void ext4_iunlock(struct inode *inode, unsigned int iolock)
> +{
> + if (iolock == EXT4_IOLOCK_EXCL)
> + inode_unlock(inode);
> + else
> + inode_unlock_shared(inode);
> +}
> +
> +static inline int ext4_ilock_nowait(struct inode *inode, unsigned int iolock)
> +{
> + if (iolock == EXT4_IOLOCK_EXCL)
> + return inode_trylock(inode);
> + else
> + return inode_trylock_shared(inode);
> +}
> +
> +static inline void ext4_ilock_demote(struct inode *inode, unsigned int iolock)
> +{
> + BUG_ON(iolock != EXT4_IOLOCK_EXCL);
> + downgrade_write(&inode->i_rwsem);
> +}
> +
> /* Update i_disksize. Requires i_mutex to avoid races with truncate */
> static inline void ext4_update_i_disksize(struct inode *inode, loff_t newsize)
> {
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 0e8708b77da6..08dd57558533 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4754,7 +4754,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
> else
> max_blocks -= lblk;
>
> - inode_lock(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
>
> /*
> * Indirect files do not support unwritten extnets
> @@ -4864,7 +4864,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>
> ext4_journal_stop(handle);
> out_mutex:
> - inode_unlock(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> return ret;
> }
>
> @@ -4930,7 +4930,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> if (mode & FALLOC_FL_KEEP_SIZE)
> flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
>
> - inode_lock(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
>
> /*
> * We only support preallocation for extent-based files only
> @@ -4961,7 +4961,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> EXT4_I(inode)->i_sync_tid);
> }
> out:
> - inode_unlock(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
> return ret;
> }
> @@ -5509,7 +5509,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
> return ret;
> }
>
> - inode_lock(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> /*
> * There is no need to overlap collapse range with EOF, in which case
> * it is effectively a truncate operation
> @@ -5608,7 +5608,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
> out_mmap:
> up_write(&EXT4_I(inode)->i_mmap_sem);
> out_mutex:
> - inode_unlock(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> return ret;
> }
>
> @@ -5659,7 +5659,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
> return ret;
> }
>
> - inode_lock(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> /* Currently just for extent based files */
> if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
> ret = -EOPNOTSUPP;
> @@ -5786,7 +5786,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
> out_mmap:
> up_write(&EXT4_I(inode)->i_mmap_sem);
> out_mutex:
> - inode_unlock(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> return ret;
> }
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 977ac58dc718..ebe3f051598d 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -55,14 +55,14 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
> struct inode *inode = file_inode(iocb->ki_filp);
>
> if (iocb->ki_flags & IOCB_NOWAIT) {
> - if (!inode_trylock_shared(inode))
> + if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_SHARED))
> return -EAGAIN;
> } else {
> - inode_lock_shared(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_SHARED);
> }
>
> if (!ext4_dio_supported(inode)) {
> - inode_unlock_shared(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
> /*
> * Fallback to buffered I/O if the operation being performed on
> * the inode is not supported by direct I/O. The IOCB_DIRECT
> @@ -76,7 +76,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
>
> ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL,
> is_sync_kiocb(iocb));
> - inode_unlock_shared(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
>
> file_accessed(iocb->ki_filp);
> return ret;
> @@ -89,22 +89,23 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
> ssize_t ret;
>
> if (iocb->ki_flags & IOCB_NOWAIT) {
> - if (!inode_trylock_shared(inode))
> + if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_SHARED))
> return -EAGAIN;
> } else {
> - inode_lock_shared(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_SHARED);
> }
> +
> /*
> * Recheck under inode lock - at this point we are sure it cannot
> * change anymore
> */
> if (!IS_DAX(inode)) {
> - inode_unlock_shared(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
> /* Fallback to buffered IO in case we cannot support DAX */
> return generic_file_read_iter(iocb, to);
> }
> ret = dax_iomap_rw(iocb, to, &ext4_iomap_ops);
> - inode_unlock_shared(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
>
> file_accessed(iocb->ki_filp);
> return ret;
> @@ -244,7 +245,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> if (iocb->ki_flags & IOCB_NOWAIT)
> return -EOPNOTSUPP;
>
> - inode_lock(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> ret = ext4_write_checks(iocb, from);
> if (ret <= 0)
> goto out;
> @@ -254,7 +255,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> current->backing_dev_info = NULL;
>
> out:
> - inode_unlock(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> if (likely(ret > 0)) {
> iocb->ki_pos += ret;
> ret = generic_write_sync(iocb, ret);
> @@ -372,16 +373,17 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> handle_t *handle;
> struct inode *inode = file_inode(iocb->ki_filp);
> bool extend = false, overwrite = false, unaligned_aio = false;
> + unsigned int iolock = EXT4_IOLOCK_EXCL;
>
> if (iocb->ki_flags & IOCB_NOWAIT) {
> - if (!inode_trylock(inode))
> + if (!ext4_ilock_nowait(inode, iolock))
> return -EAGAIN;
> } else {
> - inode_lock(inode);
> + ext4_ilock(inode, iolock);
> }
>
> if (!ext4_dio_supported(inode)) {
> - inode_unlock(inode);
> + ext4_iunlock(inode, iolock);
> /*
> * Fallback to buffered I/O if the inode does not support
> * direct I/O.
> @@ -391,7 +393,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>
> ret = ext4_write_checks(iocb, from);
> if (ret <= 0) {
> - inode_unlock(inode);
> + ext4_iunlock(inode, iolock);
> return ret;
> }
>
> @@ -416,7 +418,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> ext4_should_dioread_nolock(inode)) {
> overwrite = true;
> - downgrade_write(&inode->i_rwsem);
> + ext4_ilock_demote(inode, iolock);
> + iolock = EXT4_IOLOCK_SHARED;
> }
>
> if (offset + count > EXT4_I(inode)->i_disksize) {
> @@ -443,10 +446,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> ret = ext4_handle_inode_extension(inode, offset, ret, count);
>
> out:
> - if (overwrite)
> - inode_unlock_shared(inode);
> - else
> - inode_unlock(inode);
> + ext4_iunlock(inode, iolock);
>
> if (ret >= 0 && iov_iter_count(from)) {
> ssize_t err;
> @@ -489,10 +489,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> struct inode *inode = file_inode(iocb->ki_filp);
>
> if (iocb->ki_flags & IOCB_NOWAIT) {
> - if (!inode_trylock(inode))
> + if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_EXCL))
> return -EAGAIN;
> } else {
> - inode_lock(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> }
>
> ret = ext4_write_checks(iocb, from);
> @@ -524,7 +524,7 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> if (extend)
> ret = ext4_handle_inode_extension(inode, offset, ret, count);
> out:
> - inode_unlock(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> if (ret > 0)
> ret = generic_write_sync(iocb, ret);
> return ret;
> @@ -757,16 +757,16 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence)
> return generic_file_llseek_size(file, offset, whence,
> maxbytes, i_size_read(inode));
> case SEEK_HOLE:
> - inode_lock_shared(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_SHARED);
> offset = iomap_seek_hole(inode, offset,
> &ext4_iomap_report_ops);
> - inode_unlock_shared(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
> break;
> case SEEK_DATA:
> - inode_lock_shared(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_SHARED);
> offset = iomap_seek_data(inode, offset,
> &ext4_iomap_report_ops);
> - inode_unlock_shared(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
> break;
> }
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 381813205f99..39dcc22667a1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3930,7 +3930,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
> return ret;
> }
>
> - inode_lock(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
>
> /* No need to punch hole beyond i_size */
> if (offset >= inode->i_size)
> @@ -4037,7 +4037,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
> out_dio:
> up_write(&EXT4_I(inode)->i_mmap_sem);
> out_mutex:
> - inode_unlock(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> return ret;
> }
>
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 0b7f316fd30f..43b7a23dc57b 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -855,13 +855,13 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> if (err)
> return err;
>
> - inode_lock(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> err = ext4_ioctl_check_immutable(inode,
> from_kprojid(&init_user_ns, ei->i_projid),
> flags);
> if (!err)
> err = ext4_ioctl_setflags(inode, flags);
> - inode_unlock(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> mnt_drop_write_file(filp);
> return err;
> }
> @@ -892,7 +892,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> goto setversion_out;
> }
>
> - inode_lock(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
> if (IS_ERR(handle)) {
> err = PTR_ERR(handle);
> @@ -907,7 +907,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> ext4_journal_stop(handle);
>
> unlock_out:
> - inode_unlock(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> setversion_out:
> mnt_drop_write_file(filp);
> return err;
> @@ -1026,9 +1026,9 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> * ext4_ext_swap_inode_data before we switch the
> * inode format to prevent read.
> */
> - inode_lock((inode));
> + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> err = ext4_ext_migrate(inode);
> - inode_unlock((inode));
> + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> mnt_drop_write_file(filp);
> return err;
> }
> @@ -1272,7 +1272,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> if (err)
> return err;
>
> - inode_lock(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> ext4_fill_fsxattr(inode, &old_fa);
> err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
> if (err)
> @@ -1287,7 +1287,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> goto out;
> err = ext4_ioctl_setproject(filp, fa.fsx_projid);
> out:
> - inode_unlock(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> mnt_drop_write_file(filp);
> return err;
> }
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 7796e2ffc294..48b83b2cf0ad 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -2682,12 +2682,12 @@ static void ext4_orphan_cleanup(struct super_block *sb,
> __func__, inode->i_ino, inode->i_size);
> jbd_debug(2, "truncating inode %lu to %lld bytes\n",
> inode->i_ino, inode->i_size);
> - inode_lock(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> truncate_inode_pages(inode->i_mapping, inode->i_size);
> ret = ext4_truncate(inode);
> if (ret)
> ext4_std_error(inode->i_sb, ret);
> - inode_unlock(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> nr_truncates++;
> } else {
> if (test_opt(sb, DEBUG))
> @@ -5785,7 +5785,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id,
> * files. If this fails, we return success anyway since quotas
> * are already enabled and this is not a hard failure.
> */
> - inode_lock(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> handle = ext4_journal_start(inode, EXT4_HT_QUOTA, 1);
> if (IS_ERR(handle))
> goto unlock_inode;
> @@ -5795,7 +5795,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id,
> ext4_mark_inode_dirty(handle, inode);
> ext4_journal_stop(handle);
> unlock_inode:
> - inode_unlock(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> }
> return err;
> }
> @@ -5887,7 +5887,7 @@ static int ext4_quota_off(struct super_block *sb, int type)
> if (err || ext4_has_feature_quota(sb))
> goto out_put;
>
> - inode_lock(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> /*
> * Update modification times of quota files when userspace can
> * start looking at them. If we fail, we return success anyway since
> @@ -5902,7 +5902,7 @@ static int ext4_quota_off(struct super_block *sb, int type)
> ext4_mark_inode_dirty(handle, inode);
> ext4_journal_stop(handle);
> out_unlock:
> - inode_unlock(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> out_put:
> lockdep_set_quota_inode(inode, I_DATA_SEM_NORMAL);
> iput(inode);
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 8966a5439a22..5c2dcc4c836a 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -422,9 +422,9 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
> ext4_set_inode_state(inode, EXT4_STATE_LUSTRE_EA_INODE);
> ext4_xattr_inode_set_ref(inode, 1);
> } else {
> - inode_lock(inode);
> + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> inode->i_flags |= S_NOQUOTA;
> - inode_unlock(inode);
> + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> }
>
> *ea_inode = inode;
> @@ -976,7 +976,7 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
> u32 hash;
> int ret;
>
> - inode_lock(ea_inode);
> + ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL);
>
> ret = ext4_reserve_inode_write(handle, ea_inode, &iloc);
> if (ret)
> @@ -1030,7 +1030,7 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
> ext4_warning_inode(ea_inode,
> "ext4_mark_iloc_dirty() failed ret=%d", ret);
> out:
> - inode_unlock(ea_inode);
> + ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL);
> return ret;
> }
>
> @@ -1380,10 +1380,11 @@ static int ext4_xattr_inode_write(handle_t *handle, struct inode *ea_inode,
> block += 1;
> }
>
> - inode_lock(ea_inode);
> + ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL);
> i_size_write(ea_inode, wsize);
> ext4_update_i_disksize(ea_inode, wsize);
> - inode_unlock(ea_inode);
> + ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL);
> +
>
> ext4_mark_inode_dirty(handle, ea_inode);
>
> @@ -1432,9 +1433,9 @@ static struct inode *ext4_xattr_inode_create(handle_t *handle,
> */
> dquot_free_inode(ea_inode);
> dquot_drop(ea_inode);
> - inode_lock(ea_inode);
> + ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL);
> ea_inode->i_flags |= S_NOQUOTA;
> - inode_unlock(ea_inode);
> + ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL);
> }
>
> return ea_inode;
> --
> 2.21.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 20-11-19 10:30:23, Ritesh Harjani wrote:
> Earlier there was no shared lock in DIO read path.
> But this patch (16c54688592ce: ext4: Allow parallel DIO reads)
> simplified some of the locking mechanism while still allowing
> for parallel DIO reads by adding shared lock in inode DIO
> read path.
>
> But this created problem with mixed read/write workload.
> It is due to the fact that in DIO path, we first start with
> exclusive lock and only when we determine that it is a ovewrite
> IO, we downgrade the lock. This causes the problem, since
> with above patch we have shared locking in DIO reads.
>
> So, this patch tries to fix this issue by starting with
> shared lock and then switching to exclusive lock only
> when required based on ext4_dio_write_checks().
>
> Other than that, it also simplifies below cases:-
>
> 1. Simplified ext4_unaligned_aio API to ext4_unaligned_io.
> Previous API was abused in the sense that it was not really checking
> for AIO anywhere also it used to check for extending writes.
> So this API was renamed and simplified to ext4_unaligned_io()
> which actully only checks if the IO is really unaligned.
>
> Now, in case of unaligned direct IO, iomap_dio_rw needs to do
> zeroing of partial block and that will require serialization
> against other direct IOs in the same block. So we take a excl iolock
> for any unaligned DIO.
> In case of AIO we also need to wait for any outstanding IOs to
> complete so that conversion from unwritten to written is completed
> before anyone try to map the overlapping block. Hence we take
> excl iolock and also wait for inode_dio_wait() for unaligned DIO case.
> Please note since we are anyway taking an exclusive lock in unaligned IO,
> inode_dio_wait() becomes a no-op in case of non-AIO DIO.
>
> 2. Added ext4_extending_io(). This checks if the IO is extending the file.
>
> 3. Added ext4_dio_write_checks().
> In this we start with shared iolock and only switch to exclusive iolock
> if required. So in most cases with aligned, non-extening, dioread_nolock
> overwrites tries to write with a shared locking.
> If not, then we restart the operation in ext4_dio_write_checks(),
> after acquiring excl iolock.
>
> Signed-off-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/file.c | 191 ++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 142 insertions(+), 49 deletions(-)
Thanks for the patch! Some comments below...
> @@ -365,15 +383,110 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
> .end_io = ext4_dio_write_end_io,
> };
>
> +/*
> + * The intention here is to start with shared lock acquired then see if any
> + * condition requires an exclusive inode lock. If yes, then we restart the
> + * whole operation by releasing the shared lock and acquiring exclusive lock.
> + *
> + * - For unaligned_io we never take shared lock as it may cause data corruption
> + * when two unaligned IO tries to modify the same block e.g. while zeroing.
> + *
> + * - For extending writes case we don't take the shared lock, since it requires
> + * updating inode i_disksize and/or orphan handling with exclusive lock.
> + *
> + * - shared locking will only be true mostly in case of overwrites with
> + * dioread_nolock mode. Otherwise we will switch to excl. iolock mode.
> + */
> +static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
> + unsigned int *iolock, bool *unaligned_io,
> + bool *extend)
> +{
> + struct file *file = iocb->ki_filp;
> + struct inode *inode = file_inode(file);
> + loff_t offset = iocb->ki_pos;
> + loff_t final_size;
> + size_t count;
> + ssize_t ret;
> +
> +restart:
> + /* Fallback to buffered I/O if the inode does not support direct I/O. */
> + if (!ext4_dio_supported(inode)) {
> + ext4_iunlock(inode, *iolock);
> + return ext4_buffered_write_iter(iocb, from);
> + }
I don't think it is good to hide things like this fallback to buffered IO
in ext4_dio_write_checks(). Similarly with 'unaligned_io' and 'extend'
variables below. So I'd rather leave this in ext4_dio_write_iter() and just
move file_modified() from ext4_write_checks() since that seems to be the
only thing that cannot be always done with shared i_rwsem, can it?
> +
> + ret = ext4_generic_write_checks(iocb, from);
> + if (ret <= 0) {
> + ext4_iunlock(inode, *iolock);
> + return ret;
> + }
> +
> + /* Recalculate since offset & count may change above. */
> + offset = iocb->ki_pos;
> + count = iov_iter_count(from);
> + final_size = offset + count;
> +
> + if (ext4_unaligned_io(inode, from, offset))
> + *unaligned_io = true;
No need to recheck alignment here. That cannot change over time regardless
of locks we hold...
> +
> + if (ext4_extending_io(inode, offset, count))
> + *extend = true;
> + /*
> + * Determine whether the IO operation will overwrite allocated
> + * and initialized blocks. If so, check to see whether it is
> + * possible to take the dioread_nolock path.
> + *
> + * We need exclusive i_rwsem for changing security info
> + * in file_modified().
> + */
> + if (*iolock == EXT4_IOLOCK_SHARED &&
> + (!IS_NOSEC(inode) || *unaligned_io || *extend ||
> + !ext4_should_dioread_nolock(inode) ||
> + !ext4_overwrite_io(inode, offset, count))) {
> + ext4_iunlock(inode, *iolock);
> + *iolock = EXT4_IOLOCK_EXCL;
> + ext4_ilock(inode, *iolock);
> + goto restart;
> + }
> +
> + ret = file_modified(file);
> + if (ret < 0) {
> + ext4_iunlock(inode, *iolock);
> + return ret;
> + }
> +
> + return count;
> +}
> +
> static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> ssize_t ret;
> - size_t count;
> - loff_t offset;
> handle_t *handle;
> struct inode *inode = file_inode(iocb->ki_filp);
> - bool extend = false, overwrite = false, unaligned_aio = false;
> - unsigned int iolock = EXT4_IOLOCK_EXCL;
> + loff_t offset = iocb->ki_pos;
> + size_t count = iov_iter_count(from);
> + bool extend = false, unaligned_io = false;
> + unsigned int iolock = EXT4_IOLOCK_SHARED;
> +
> + /*
> + * We initially start with shared inode lock
> + * unless it is unaligned IO which needs
> + * exclusive lock anyways.
> + */
> + if (ext4_unaligned_io(inode, from, offset)) {
> + unaligned_io = true;
> + iolock = EXT4_IOLOCK_EXCL;
> + }
> + /*
> + * Extending writes need exclusive lock.
> + */
> + if (ext4_extending_io(inode, offset, count)) {
> + extend = true;
> + iolock = EXT4_IOLOCK_EXCL;
> + }
You cannot read EXT4_I(inode)->i_disksize without some lock (either
inode->i_rwsem or EXT4_I(inode)->i_data_sem). So I'd just do here a quick
check with i_size here (probably don't set extend, but just make note to
start with exclusive i_rwsem) and later when we hold i_rwsem, we can do a
reliable check.
> + if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
> + iolock = EXT4_IOLOCK_EXCL;
>
> if (iocb->ki_flags & IOCB_NOWAIT) {
> if (!ext4_ilock_nowait(inode, iolock))
> @@ -382,47 +495,28 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> ext4_ilock(inode, iolock);
> }
>
> - if (!ext4_dio_supported(inode)) {
> - ext4_iunlock(inode, iolock);
> - /*
> - * Fallback to buffered I/O if the inode does not support
> - * direct I/O.
> - */
> - return ext4_buffered_write_iter(iocb, from);
> - }
> -
> - ret = ext4_write_checks(iocb, from);
> - if (ret <= 0) {
> - ext4_iunlock(inode, iolock);
> + ret = ext4_dio_write_checks(iocb, from, &iolock, &unaligned_io,
> + &extend);
> + if (ret <= 0)
> return ret;
> - }
>
> - /*
> - * Unaligned asynchronous direct I/O must be serialized among each
> - * other as the zeroing of partial blocks of two competing unaligned
> - * asynchronous direct I/O writes can result in data corruption.
> - */
> offset = iocb->ki_pos;
> count = iov_iter_count(from);
> - if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> - !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> - unaligned_aio = true;
> - inode_dio_wait(inode);
> - }
>
> /*
> - * Determine whether the I/O will overwrite allocated and initialized
> - * blocks. If so, check to see whether it is possible to take the
> - * dioread_nolock path.
> + * Unaligned direct IO must be serialized among each other as zeroing
> + * of partial blocks of two competing unaligned IOs can result in data
> + * corruption.
> + *
> + * So we make sure we don't allow any unaligned IO in flight.
> + * For IOs where we need not wait (like unaligned non-AIO DIO),
> + * below inode_dio_wait() may anyway become a no-op, since we start
> + * with exclusive lock.
> */
> - if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> - ext4_should_dioread_nolock(inode)) {
> - overwrite = true;
> - ext4_ilock_demote(inode, iolock);
> - iolock = EXT4_IOLOCK_SHARED;
> - }
> + if (unaligned_io)
> + inode_dio_wait(inode);
>
> - if (offset + count > EXT4_I(inode)->i_disksize) {
> + if (extend) {
> handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> if (IS_ERR(handle)) {
> ret = PTR_ERR(handle);
> @@ -435,12 +529,11 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> goto out;
> }
>
> - extend = true;
> ext4_journal_stop(handle);
> }
>
> ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops,
> - is_sync_kiocb(iocb) || unaligned_aio || extend);
> + is_sync_kiocb(iocb) || unaligned_io || extend);
>
> if (extend)
> ret = ext4_handle_inode_extension(inode, offset, ret, count);
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 20-11-19 10:30:24, Ritesh Harjani wrote:
> We were using shared locking only in case of dioread_nolock
> mount option in case of DIO overwrites. This mount condition
> is not needed anymore with current code, since:-
>
> 1. No race between buffered writes & DIO overwrites.
> Since buffIO writes takes exclusive locks & DIO overwrites
> will take share locking. Also DIO path will make sure
> to flush and wait for any dirty page cache data.
>
> 2. No race between buffered reads & DIO overwrites, since there
> is no block allocation that is possible with DIO overwrites.
> So no stale data exposure should happen. Same is the case
> between DIO reads & DIO overwrites.
>
> 3. Also other paths like truncate is protected,
> since we wait there for any DIO in flight to be over.
>
> 4. In case of buffIO writes followed by DIO reads:
> Since here also we take exclusive locks in ext4_write_begin/end().
> There is no risk of exposing any stale data in this case.
> Since after ext4_write_end, iomap_dio_rw() will wait to flush &
> wait for any dirty page cache data.
>
> Signed-off-by: Ritesh Harjani <[email protected]>
There's one more case to consider here as I mentioned in [1]. There can be
mmap write instantiating dirty page and then someone starting writeback
against that page while DIO read is running still theoretically leading to
stale data exposure. Now this patch does not have influence on that race
but:
1) We need to close the race mentioned above. Maybe we could do that by
proactively allocating unwritten blocks for a page being faulted when there
is direct IO running against the file - the one who fills holes through
mmap write while direct IO is running on the file deserves to suffer the
performance penalty...
2) After this patch there's no point in having dioread_nolock at all so we
can just make that mount option no-op and drop all the precautions from the
buffered IO path connected with dioread_nolock.
[1] https://lore.kernel.org/linux-ext4/[email protected]
Honza
> ---
> fs/ext4/file.c | 25 +++++++++++++++++++------
> 1 file changed, 19 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 18cbf9fa52c6..b97efc89cd63 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -383,6 +383,17 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
> .end_io = ext4_dio_write_end_io,
> };
>
> +static bool ext4_dio_should_shared_lock(struct inode *inode)
> +{
> + if (!S_ISREG(inode->i_mode))
This cannot happen for DIO so no point in checking here.
> + return false;
> + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
Why this?
> + return false;
> + if (ext4_should_journal_data(inode))
We don't do DIO when journalling data so no point in checking here
(dio_supported() already checked this).
Honza
> + return false;
> + return true;
> +}
> +
> /*
> * The intention here is to start with shared lock acquired then see if any
> * condition requires an exclusive inode lock. If yes, then we restart the
> @@ -394,8 +405,8 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
> * - For extending writes case we don't take the shared lock, since it requires
> * updating inode i_disksize and/or orphan handling with exclusive lock.
> *
> - * - shared locking will only be true mostly in case of overwrites with
> - * dioread_nolock mode. Otherwise we will switch to excl. iolock mode.
> + * - shared locking will only be true mostly in case of overwrites.
> + * Otherwise we will switch to excl. iolock mode.
> */
> static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
> unsigned int *iolock, bool *unaligned_io,
> @@ -433,15 +444,14 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
> *extend = true;
> /*
> * Determine whether the IO operation will overwrite allocated
> - * and initialized blocks. If so, check to see whether it is
> - * possible to take the dioread_nolock path.
> + * and initialized blocks.
> *
> * We need exclusive i_rwsem for changing security info
> * in file_modified().
> */
> if (*iolock == EXT4_IOLOCK_SHARED &&
> (!IS_NOSEC(inode) || *unaligned_io || *extend ||
> - !ext4_should_dioread_nolock(inode) ||
> + !ext4_dio_should_shared_lock(inode) ||
> !ext4_overwrite_io(inode, offset, count))) {
> ext4_iunlock(inode, *iolock);
> *iolock = EXT4_IOLOCK_EXCL;
> @@ -485,7 +495,10 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> iolock = EXT4_IOLOCK_EXCL;
> }
>
> - if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
> + /*
> + * Check if we should continue with shared iolock
> + */
> + if (iolock == EXT4_IOLOCK_SHARED && !ext4_dio_should_shared_lock(inode))
> iolock = EXT4_IOLOCK_EXCL;
>
> if (iocb->ki_flags & IOCB_NOWAIT) {
> --
> 2.21.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Nov 20, 2019 at 05:48:30PM +0530, Ritesh Harjani wrote:
> Not against your suggestion here.
> But in kernel I do see a preference towards object followed by a verb.
> At least in vfs I see functions like inode_lock()/unlock().
>
> Plus I would not deny that this naming is also inspired from
> xfs_ilock()/iunlock API names.
I see those names as being "classical Unix" heritage (eh, maybe SysV).
> hmm, it was increasing the name of the macro if I do it that way.
> But that's ok. Is below macro name better?
>
> #define EXT4_INODE_IOLOCK_EXCL (1 << 0)
> #define EXT4_INODE_IOLOCK_SHARED (1 << 1)
In particular, Linux tends to prefer read/write instead of
shared/exclusive terminology. rwlocks, rwsems, rcu_read_lock, seqlocks.
shared/exclusive is used by file locks. And XFS ;-)
I agree with Jan; just leave them opencoded.
Probably worth adding inode_lock_downgrade() to fs.h instead of
accessing i_rwsem directly.
On Wed, Nov 20, 2019 at 02:11:07PM +0100, Jan Kara wrote:
> On Wed 20-11-19 10:30:22, Ritesh Harjani wrote:
> > This adds ext4_ilock/iunlock types of APIs.
> > This is the preparation APIs to make shared
> > locking/unlocking & restarting with exclusive
> > locking/unlocking easier in next patch.
> >
> > Signed-off-by: Ritesh Harjani <[email protected]>
>
> I know XFS does it this way but I don't think we need the obsurity of
> additional locking helpers etc. just because of one place in
> ext4_dio_write_iter() that will use this. So I'd just drop this patch...
FWIW, if you were to add tracepoints to these inode lock and unlock
helpers, then the helpers would have the additional value that you could
use ftrace + script to diagnose inode deadlocking issues and the like.
XFS has such scripts in the xfsprogs source code and it's really nice to
have an automated system to tell you which thread forgot to let go of
something, and when.
--D
> Honza
>
> > ---
> > fs/ext4/ext4.h | 33 ++++++++++++++++++++++++++++++
> > fs/ext4/extents.c | 16 +++++++--------
> > fs/ext4/file.c | 52 +++++++++++++++++++++++------------------------
> > fs/ext4/inode.c | 4 ++--
> > fs/ext4/ioctl.c | 16 +++++++--------
> > fs/ext4/super.c | 12 +++++------
> > fs/ext4/xattr.c | 17 ++++++++--------
> > 7 files changed, 92 insertions(+), 58 deletions(-)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 61987c106511..b4169a92e8d0 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -2960,6 +2960,39 @@ do { \
> > #define EXT4_FREECLUSTERS_WATERMARK 0
> > #endif
> >
> > +#define EXT4_IOLOCK_EXCL (1 << 0)
> > +#define EXT4_IOLOCK_SHARED (1 << 1)
> > +
> > +static inline void ext4_ilock(struct inode *inode, unsigned int iolock)
> > +{
> > + if (iolock == EXT4_IOLOCK_EXCL)
> > + inode_lock(inode);
> > + else
> > + inode_lock_shared(inode);
> > +}
> > +
> > +static inline void ext4_iunlock(struct inode *inode, unsigned int iolock)
> > +{
> > + if (iolock == EXT4_IOLOCK_EXCL)
> > + inode_unlock(inode);
> > + else
> > + inode_unlock_shared(inode);
> > +}
> > +
> > +static inline int ext4_ilock_nowait(struct inode *inode, unsigned int iolock)
> > +{
> > + if (iolock == EXT4_IOLOCK_EXCL)
> > + return inode_trylock(inode);
> > + else
> > + return inode_trylock_shared(inode);
> > +}
> > +
> > +static inline void ext4_ilock_demote(struct inode *inode, unsigned int iolock)
> > +{
> > + BUG_ON(iolock != EXT4_IOLOCK_EXCL);
> > + downgrade_write(&inode->i_rwsem);
> > +}
> > +
> > /* Update i_disksize. Requires i_mutex to avoid races with truncate */
> > static inline void ext4_update_i_disksize(struct inode *inode, loff_t newsize)
> > {
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 0e8708b77da6..08dd57558533 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -4754,7 +4754,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
> > else
> > max_blocks -= lblk;
> >
> > - inode_lock(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> >
> > /*
> > * Indirect files do not support unwritten extnets
> > @@ -4864,7 +4864,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
> >
> > ext4_journal_stop(handle);
> > out_mutex:
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> > return ret;
> > }
> >
> > @@ -4930,7 +4930,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> > if (mode & FALLOC_FL_KEEP_SIZE)
> > flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
> >
> > - inode_lock(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> >
> > /*
> > * We only support preallocation for extent-based files only
> > @@ -4961,7 +4961,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> > EXT4_I(inode)->i_sync_tid);
> > }
> > out:
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> > trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
> > return ret;
> > }
> > @@ -5509,7 +5509,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
> > return ret;
> > }
> >
> > - inode_lock(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> > /*
> > * There is no need to overlap collapse range with EOF, in which case
> > * it is effectively a truncate operation
> > @@ -5608,7 +5608,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
> > out_mmap:
> > up_write(&EXT4_I(inode)->i_mmap_sem);
> > out_mutex:
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> > return ret;
> > }
> >
> > @@ -5659,7 +5659,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
> > return ret;
> > }
> >
> > - inode_lock(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> > /* Currently just for extent based files */
> > if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
> > ret = -EOPNOTSUPP;
> > @@ -5786,7 +5786,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
> > out_mmap:
> > up_write(&EXT4_I(inode)->i_mmap_sem);
> > out_mutex:
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> > return ret;
> > }
> >
> > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > index 977ac58dc718..ebe3f051598d 100644
> > --- a/fs/ext4/file.c
> > +++ b/fs/ext4/file.c
> > @@ -55,14 +55,14 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > struct inode *inode = file_inode(iocb->ki_filp);
> >
> > if (iocb->ki_flags & IOCB_NOWAIT) {
> > - if (!inode_trylock_shared(inode))
> > + if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_SHARED))
> > return -EAGAIN;
> > } else {
> > - inode_lock_shared(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_SHARED);
> > }
> >
> > if (!ext4_dio_supported(inode)) {
> > - inode_unlock_shared(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
> > /*
> > * Fallback to buffered I/O if the operation being performed on
> > * the inode is not supported by direct I/O. The IOCB_DIRECT
> > @@ -76,7 +76,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >
> > ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL,
> > is_sync_kiocb(iocb));
> > - inode_unlock_shared(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
> >
> > file_accessed(iocb->ki_filp);
> > return ret;
> > @@ -89,22 +89,23 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > ssize_t ret;
> >
> > if (iocb->ki_flags & IOCB_NOWAIT) {
> > - if (!inode_trylock_shared(inode))
> > + if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_SHARED))
> > return -EAGAIN;
> > } else {
> > - inode_lock_shared(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_SHARED);
> > }
> > +
> > /*
> > * Recheck under inode lock - at this point we are sure it cannot
> > * change anymore
> > */
> > if (!IS_DAX(inode)) {
> > - inode_unlock_shared(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
> > /* Fallback to buffered IO in case we cannot support DAX */
> > return generic_file_read_iter(iocb, to);
> > }
> > ret = dax_iomap_rw(iocb, to, &ext4_iomap_ops);
> > - inode_unlock_shared(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
> >
> > file_accessed(iocb->ki_filp);
> > return ret;
> > @@ -244,7 +245,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > if (iocb->ki_flags & IOCB_NOWAIT)
> > return -EOPNOTSUPP;
> >
> > - inode_lock(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> > ret = ext4_write_checks(iocb, from);
> > if (ret <= 0)
> > goto out;
> > @@ -254,7 +255,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > current->backing_dev_info = NULL;
> >
> > out:
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> > if (likely(ret > 0)) {
> > iocb->ki_pos += ret;
> > ret = generic_write_sync(iocb, ret);
> > @@ -372,16 +373,17 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > handle_t *handle;
> > struct inode *inode = file_inode(iocb->ki_filp);
> > bool extend = false, overwrite = false, unaligned_aio = false;
> > + unsigned int iolock = EXT4_IOLOCK_EXCL;
> >
> > if (iocb->ki_flags & IOCB_NOWAIT) {
> > - if (!inode_trylock(inode))
> > + if (!ext4_ilock_nowait(inode, iolock))
> > return -EAGAIN;
> > } else {
> > - inode_lock(inode);
> > + ext4_ilock(inode, iolock);
> > }
> >
> > if (!ext4_dio_supported(inode)) {
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, iolock);
> > /*
> > * Fallback to buffered I/O if the inode does not support
> > * direct I/O.
> > @@ -391,7 +393,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >
> > ret = ext4_write_checks(iocb, from);
> > if (ret <= 0) {
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, iolock);
> > return ret;
> > }
> >
> > @@ -416,7 +418,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> > ext4_should_dioread_nolock(inode)) {
> > overwrite = true;
> > - downgrade_write(&inode->i_rwsem);
> > + ext4_ilock_demote(inode, iolock);
> > + iolock = EXT4_IOLOCK_SHARED;
> > }
> >
> > if (offset + count > EXT4_I(inode)->i_disksize) {
> > @@ -443,10 +446,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > ret = ext4_handle_inode_extension(inode, offset, ret, count);
> >
> > out:
> > - if (overwrite)
> > - inode_unlock_shared(inode);
> > - else
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, iolock);
> >
> > if (ret >= 0 && iov_iter_count(from)) {
> > ssize_t err;
> > @@ -489,10 +489,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > struct inode *inode = file_inode(iocb->ki_filp);
> >
> > if (iocb->ki_flags & IOCB_NOWAIT) {
> > - if (!inode_trylock(inode))
> > + if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_EXCL))
> > return -EAGAIN;
> > } else {
> > - inode_lock(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> > }
> >
> > ret = ext4_write_checks(iocb, from);
> > @@ -524,7 +524,7 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > if (extend)
> > ret = ext4_handle_inode_extension(inode, offset, ret, count);
> > out:
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> > if (ret > 0)
> > ret = generic_write_sync(iocb, ret);
> > return ret;
> > @@ -757,16 +757,16 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence)
> > return generic_file_llseek_size(file, offset, whence,
> > maxbytes, i_size_read(inode));
> > case SEEK_HOLE:
> > - inode_lock_shared(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_SHARED);
> > offset = iomap_seek_hole(inode, offset,
> > &ext4_iomap_report_ops);
> > - inode_unlock_shared(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
> > break;
> > case SEEK_DATA:
> > - inode_lock_shared(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_SHARED);
> > offset = iomap_seek_data(inode, offset,
> > &ext4_iomap_report_ops);
> > - inode_unlock_shared(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
> > break;
> > }
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 381813205f99..39dcc22667a1 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3930,7 +3930,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
> > return ret;
> > }
> >
> > - inode_lock(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> >
> > /* No need to punch hole beyond i_size */
> > if (offset >= inode->i_size)
> > @@ -4037,7 +4037,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
> > out_dio:
> > up_write(&EXT4_I(inode)->i_mmap_sem);
> > out_mutex:
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> > return ret;
> > }
> >
> > diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> > index 0b7f316fd30f..43b7a23dc57b 100644
> > --- a/fs/ext4/ioctl.c
> > +++ b/fs/ext4/ioctl.c
> > @@ -855,13 +855,13 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> > if (err)
> > return err;
> >
> > - inode_lock(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> > err = ext4_ioctl_check_immutable(inode,
> > from_kprojid(&init_user_ns, ei->i_projid),
> > flags);
> > if (!err)
> > err = ext4_ioctl_setflags(inode, flags);
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> > mnt_drop_write_file(filp);
> > return err;
> > }
> > @@ -892,7 +892,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> > goto setversion_out;
> > }
> >
> > - inode_lock(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> > handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
> > if (IS_ERR(handle)) {
> > err = PTR_ERR(handle);
> > @@ -907,7 +907,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> > ext4_journal_stop(handle);
> >
> > unlock_out:
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> > setversion_out:
> > mnt_drop_write_file(filp);
> > return err;
> > @@ -1026,9 +1026,9 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> > * ext4_ext_swap_inode_data before we switch the
> > * inode format to prevent read.
> > */
> > - inode_lock((inode));
> > + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> > err = ext4_ext_migrate(inode);
> > - inode_unlock((inode));
> > + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> > mnt_drop_write_file(filp);
> > return err;
> > }
> > @@ -1272,7 +1272,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> > if (err)
> > return err;
> >
> > - inode_lock(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> > ext4_fill_fsxattr(inode, &old_fa);
> > err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
> > if (err)
> > @@ -1287,7 +1287,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> > goto out;
> > err = ext4_ioctl_setproject(filp, fa.fsx_projid);
> > out:
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> > mnt_drop_write_file(filp);
> > return err;
> > }
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 7796e2ffc294..48b83b2cf0ad 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -2682,12 +2682,12 @@ static void ext4_orphan_cleanup(struct super_block *sb,
> > __func__, inode->i_ino, inode->i_size);
> > jbd_debug(2, "truncating inode %lu to %lld bytes\n",
> > inode->i_ino, inode->i_size);
> > - inode_lock(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> > truncate_inode_pages(inode->i_mapping, inode->i_size);
> > ret = ext4_truncate(inode);
> > if (ret)
> > ext4_std_error(inode->i_sb, ret);
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> > nr_truncates++;
> > } else {
> > if (test_opt(sb, DEBUG))
> > @@ -5785,7 +5785,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id,
> > * files. If this fails, we return success anyway since quotas
> > * are already enabled and this is not a hard failure.
> > */
> > - inode_lock(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> > handle = ext4_journal_start(inode, EXT4_HT_QUOTA, 1);
> > if (IS_ERR(handle))
> > goto unlock_inode;
> > @@ -5795,7 +5795,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id,
> > ext4_mark_inode_dirty(handle, inode);
> > ext4_journal_stop(handle);
> > unlock_inode:
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> > }
> > return err;
> > }
> > @@ -5887,7 +5887,7 @@ static int ext4_quota_off(struct super_block *sb, int type)
> > if (err || ext4_has_feature_quota(sb))
> > goto out_put;
> >
> > - inode_lock(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> > /*
> > * Update modification times of quota files when userspace can
> > * start looking at them. If we fail, we return success anyway since
> > @@ -5902,7 +5902,7 @@ static int ext4_quota_off(struct super_block *sb, int type)
> > ext4_mark_inode_dirty(handle, inode);
> > ext4_journal_stop(handle);
> > out_unlock:
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> > out_put:
> > lockdep_set_quota_inode(inode, I_DATA_SEM_NORMAL);
> > iput(inode);
> > diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> > index 8966a5439a22..5c2dcc4c836a 100644
> > --- a/fs/ext4/xattr.c
> > +++ b/fs/ext4/xattr.c
> > @@ -422,9 +422,9 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
> > ext4_set_inode_state(inode, EXT4_STATE_LUSTRE_EA_INODE);
> > ext4_xattr_inode_set_ref(inode, 1);
> > } else {
> > - inode_lock(inode);
> > + ext4_ilock(inode, EXT4_IOLOCK_EXCL);
> > inode->i_flags |= S_NOQUOTA;
> > - inode_unlock(inode);
> > + ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
> > }
> >
> > *ea_inode = inode;
> > @@ -976,7 +976,7 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
> > u32 hash;
> > int ret;
> >
> > - inode_lock(ea_inode);
> > + ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL);
> >
> > ret = ext4_reserve_inode_write(handle, ea_inode, &iloc);
> > if (ret)
> > @@ -1030,7 +1030,7 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
> > ext4_warning_inode(ea_inode,
> > "ext4_mark_iloc_dirty() failed ret=%d", ret);
> > out:
> > - inode_unlock(ea_inode);
> > + ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL);
> > return ret;
> > }
> >
> > @@ -1380,10 +1380,11 @@ static int ext4_xattr_inode_write(handle_t *handle, struct inode *ea_inode,
> > block += 1;
> > }
> >
> > - inode_lock(ea_inode);
> > + ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL);
> > i_size_write(ea_inode, wsize);
> > ext4_update_i_disksize(ea_inode, wsize);
> > - inode_unlock(ea_inode);
> > + ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL);
> > +
> >
> > ext4_mark_inode_dirty(handle, ea_inode);
> >
> > @@ -1432,9 +1433,9 @@ static struct inode *ext4_xattr_inode_create(handle_t *handle,
> > */
> > dquot_free_inode(ea_inode);
> > dquot_drop(ea_inode);
> > - inode_lock(ea_inode);
> > + ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL);
> > ea_inode->i_flags |= S_NOQUOTA;
> > - inode_unlock(ea_inode);
> > + ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL);
> > }
> >
> > return ea_inode;
> > --
> > 2.21.0
> >
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
On 11/20/19 10:05 PM, Matthew Wilcox wrote:
> On Wed, Nov 20, 2019 at 05:48:30PM +0530, Ritesh Harjani wrote:
>> Not against your suggestion here.
>> But in kernel I do see a preference towards object followed by a verb.
>> At least in vfs I see functions like inode_lock()/unlock().
>>
>> Plus I would not deny that this naming is also inspired from
>> xfs_ilock()/iunlock API names.
>
> I see those names as being "classical Unix" heritage (eh, maybe SysV).
>
>> hmm, it was increasing the name of the macro if I do it that way.
>> But that's ok. Is below macro name better?
>>
>> #define EXT4_INODE_IOLOCK_EXCL (1 << 0)
>> #define EXT4_INODE_IOLOCK_SHARED (1 << 1)
>
> In particular, Linux tends to prefer read/write instead of
> shared/exclusive terminology. rwlocks, rwsems, rcu_read_lock, seqlocks.
> shared/exclusive is used by file locks. And XFS ;-)
>
> I agree with Jan; just leave them opencoded.
Sure.
>
> Probably worth adding inode_lock_downgrade() to fs.h instead of
> accessing i_rwsem directly.
>
Yup, make sense. but since this series is independent of that change,
let me add that as a separate patch after this series.
Thanks for the review!!
-ritesh
Hello Jan,
Thanks for a thorough review.
On 11/20/19 7:25 PM, Jan Kara wrote:
> On Wed 20-11-19 10:30:23, Ritesh Harjani wrote:
>> Earlier there was no shared lock in DIO read path.
>> But this patch (16c54688592ce: ext4: Allow parallel DIO reads)
>> simplified some of the locking mechanism while still allowing
>> for parallel DIO reads by adding shared lock in inode DIO
>> read path.
>>
>> But this created problem with mixed read/write workload.
>> It is due to the fact that in DIO path, we first start with
>> exclusive lock and only when we determine that it is a ovewrite
>> IO, we downgrade the lock. This causes the problem, since
>> with above patch we have shared locking in DIO reads.
>>
>> So, this patch tries to fix this issue by starting with
>> shared lock and then switching to exclusive lock only
>> when required based on ext4_dio_write_checks().
>>
>> Other than that, it also simplifies below cases:-
>>
>> 1. Simplified ext4_unaligned_aio API to ext4_unaligned_io.
>> Previous API was abused in the sense that it was not really checking
>> for AIO anywhere also it used to check for extending writes.
>> So this API was renamed and simplified to ext4_unaligned_io()
>> which actully only checks if the IO is really unaligned.
>>
>> Now, in case of unaligned direct IO, iomap_dio_rw needs to do
>> zeroing of partial block and that will require serialization
>> against other direct IOs in the same block. So we take a excl iolock
>> for any unaligned DIO.
>> In case of AIO we also need to wait for any outstanding IOs to
>> complete so that conversion from unwritten to written is completed
>> before anyone try to map the overlapping block. Hence we take
>> excl iolock and also wait for inode_dio_wait() for unaligned DIO case.
>> Please note since we are anyway taking an exclusive lock in unaligned IO,
>> inode_dio_wait() becomes a no-op in case of non-AIO DIO.
>>
>> 2. Added ext4_extending_io(). This checks if the IO is extending the file.
>>
>> 3. Added ext4_dio_write_checks().
>> In this we start with shared iolock and only switch to exclusive iolock
>> if required. So in most cases with aligned, non-extening, dioread_nolock
>> overwrites tries to write with a shared locking.
>> If not, then we restart the operation in ext4_dio_write_checks(),
>> after acquiring excl iolock.
>>
>> Signed-off-by: Ritesh Harjani <[email protected]>
>> ---
>> fs/ext4/file.c | 191 ++++++++++++++++++++++++++++++++++++-------------
>> 1 file changed, 142 insertions(+), 49 deletions(-)
>
> Thanks for the patch! Some comments below...
>
>> @@ -365,15 +383,110 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
>> .end_io = ext4_dio_write_end_io,
>> };
>>
>> +/*
>> + * The intention here is to start with shared lock acquired then see if any
>> + * condition requires an exclusive inode lock. If yes, then we restart the
>> + * whole operation by releasing the shared lock and acquiring exclusive lock.
>> + *
>> + * - For unaligned_io we never take shared lock as it may cause data corruption
>> + * when two unaligned IO tries to modify the same block e.g. while zeroing.
>> + *
>> + * - For extending writes case we don't take the shared lock, since it requires
>> + * updating inode i_disksize and/or orphan handling with exclusive lock.
>> + *
>> + * - shared locking will only be true mostly in case of overwrites with
>> + * dioread_nolock mode. Otherwise we will switch to excl. iolock mode.
>> + */
>> +static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>> + unsigned int *iolock, bool *unaligned_io,
>> + bool *extend)
>> +{
>> + struct file *file = iocb->ki_filp;
>> + struct inode *inode = file_inode(file);
>> + loff_t offset = iocb->ki_pos;
>> + loff_t final_size;
>> + size_t count;
>> + ssize_t ret;
>> +
>> +restart:
>> + /* Fallback to buffered I/O if the inode does not support direct I/O. */
>> + if (!ext4_dio_supported(inode)) {
>> + ext4_iunlock(inode, *iolock);
>> + return ext4_buffered_write_iter(iocb, from);
>> + }
>
> I don't think it is good to hide things like this fallback to buffered IO
> in ext4_dio_write_checks(). Similarly with 'unaligned_io' and 'extend'
Yes, make sense. Yup will move above block of code in
ext4_dio_write_iter() before calling for ext4_dio_write_checks().
> variables below. So I'd rather leave this in ext4_dio_write_iter() and just
> move file_modified() from ext4_write_checks() since that seems to be the
Yes, in this patch itself that has been done. We removed file_modified()
from ext4_write_checks() & renamed it to ext4_generic_write_checks().
> only thing that cannot be always done with shared i_rwsem, can it?
AFAIU, that's right. Exclusive lock must be needed to change the
security info for inode.
>
>> +
>> + ret = ext4_generic_write_checks(iocb, from);
>> + if (ret <= 0) {
>> + ext4_iunlock(inode, *iolock);
>> + return ret;
>> + }
>> +
>> + /* Recalculate since offset & count may change above. */
>> + offset = iocb->ki_pos;
>> + count = iov_iter_count(from);
>> + final_size = offset + count;
>> +
>> + if (ext4_unaligned_io(inode, from, offset))
>> + *unaligned_io = true;
>
> No need to recheck alignment here. That cannot change over time regardless
> of locks we hold...
hmm. I think I got confused by function iov_iter_truncate() called
in ext4_generic_write_checks().
But looking at it again, I agree that alignment won't change here.
will remove the alignment check from here.
>
>> +
>> + if (ext4_extending_io(inode, offset, count))
>> + *extend = true;
>> + /*
>> + * Determine whether the IO operation will overwrite allocated
>> + * and initialized blocks. If so, check to see whether it is
>> + * possible to take the dioread_nolock path.
>> + *
>> + * We need exclusive i_rwsem for changing security info
>> + * in file_modified().
>> + */
>> + if (*iolock == EXT4_IOLOCK_SHARED &&
>> + (!IS_NOSEC(inode) || *unaligned_io || *extend ||
>> + !ext4_should_dioread_nolock(inode) ||
>> + !ext4_overwrite_io(inode, offset, count))) {
>> + ext4_iunlock(inode, *iolock);
>> + *iolock = EXT4_IOLOCK_EXCL;
>> + ext4_ilock(inode, *iolock);
>> + goto restart;
>> + }
>> +
>> + ret = file_modified(file);
>> + if (ret < 0) {
>> + ext4_iunlock(inode, *iolock);
>> + return ret;
>> + }
>> +
>> + return count;
>> +}
>> +
>> static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> {
>> ssize_t ret;
>> - size_t count;
>> - loff_t offset;
>> handle_t *handle;
>> struct inode *inode = file_inode(iocb->ki_filp);
>> - bool extend = false, overwrite = false, unaligned_aio = false;
>> - unsigned int iolock = EXT4_IOLOCK_EXCL;
>> + loff_t offset = iocb->ki_pos;
>> + size_t count = iov_iter_count(from);
>> + bool extend = false, unaligned_io = false;
>> + unsigned int iolock = EXT4_IOLOCK_SHARED;
>> +
>> + /*
>> + * We initially start with shared inode lock
>> + * unless it is unaligned IO which needs
>> + * exclusive lock anyways.
>> + */
>> + if (ext4_unaligned_io(inode, from, offset)) {
>> + unaligned_io = true;
>> + iolock = EXT4_IOLOCK_EXCL;
>> + }
>> + /*
>> + * Extending writes need exclusive lock.
>> + */
>> + if (ext4_extending_io(inode, offset, count)) {
>> + extend = true;
>> + iolock = EXT4_IOLOCK_EXCL;
>> + }
>
> You cannot read EXT4_I(inode)->i_disksize without some lock (either
> inode->i_rwsem or EXT4_I(inode)->i_data_sem). So I'd just do here a quick
> check with i_size here (probably don't set extend, but just make note to
> start with exclusive i_rwsem) and later when we hold i_rwsem, we can do a
> reliable check.
hmm. my bad. I think I later moved this block of code from
ext4_dio_write_checks() down here.
Thanks for correcting it.
Will only check against 'i_size' here then along with a
comment that a more reliable check with
shared lock has been done in ext4_dio_write_checks()
via ext4_extending_io().
>
>> + if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
>> + iolock = EXT4_IOLOCK_EXCL;
>>
>> if (iocb->ki_flags & IOCB_NOWAIT) {
>> if (!ext4_ilock_nowait(inode, iolock))
>> @@ -382,47 +495,28 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> ext4_ilock(inode, iolock);
>> }
>>
>> - if (!ext4_dio_supported(inode)) {
>> - ext4_iunlock(inode, iolock);
>> - /*
>> - * Fallback to buffered I/O if the inode does not support
>> - * direct I/O.
>> - */
>> - return ext4_buffered_write_iter(iocb, from);
>> - }
>> -
>> - ret = ext4_write_checks(iocb, from);
>> - if (ret <= 0) {
>> - ext4_iunlock(inode, iolock);
>> + ret = ext4_dio_write_checks(iocb, from, &iolock, &unaligned_io,
>> + &extend);
>> + if (ret <= 0)
>> return ret;
>> - }
>>
>> - /*
>> - * Unaligned asynchronous direct I/O must be serialized among each
>> - * other as the zeroing of partial blocks of two competing unaligned
>> - * asynchronous direct I/O writes can result in data corruption.
>> - */
>> offset = iocb->ki_pos;
>> count = iov_iter_count(from);
>> - if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
>> - !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
>> - unaligned_aio = true;
>> - inode_dio_wait(inode);
>> - }
>>
>> /*
>> - * Determine whether the I/O will overwrite allocated and initialized
>> - * blocks. If so, check to see whether it is possible to take the
>> - * dioread_nolock path.
>> + * Unaligned direct IO must be serialized among each other as zeroing
>> + * of partial blocks of two competing unaligned IOs can result in data
>> + * corruption.
>> + *
>> + * So we make sure we don't allow any unaligned IO in flight.
>> + * For IOs where we need not wait (like unaligned non-AIO DIO),
>> + * below inode_dio_wait() may anyway become a no-op, since we start
>> + * with exclusive lock.
>> */
>> - if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
>> - ext4_should_dioread_nolock(inode)) {
>> - overwrite = true;
>> - ext4_ilock_demote(inode, iolock);
>> - iolock = EXT4_IOLOCK_SHARED;
>> - }
>> + if (unaligned_io)
>> + inode_dio_wait(inode);
>>
>> - if (offset + count > EXT4_I(inode)->i_disksize) {
>> + if (extend) {
>> handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
>> if (IS_ERR(handle)) {
>> ret = PTR_ERR(handle);
>> @@ -435,12 +529,11 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> goto out;
>> }
>>
>> - extend = true;
>> ext4_journal_stop(handle);
>> }
>>
>> ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops,
>> - is_sync_kiocb(iocb) || unaligned_aio || extend);
>> + is_sync_kiocb(iocb) || unaligned_io || extend);
>>
>> if (extend)
>> ret = ext4_handle_inode_extension(inode, offset, ret, count);
>
> Honza
>
Hello Jan,
Sorry about getting a little late on this.
On 11/20/19 8:02 PM, Jan Kara wrote:
> On Wed 20-11-19 10:30:24, Ritesh Harjani wrote:
>> We were using shared locking only in case of dioread_nolock
>> mount option in case of DIO overwrites. This mount condition
>> is not needed anymore with current code, since:-
>>
>> 1. No race between buffered writes & DIO overwrites.
>> Since buffIO writes takes exclusive locks & DIO overwrites
>> will take share locking. Also DIO path will make sure
>> to flush and wait for any dirty page cache data.
>>
>> 2. No race between buffered reads & DIO overwrites, since there
>> is no block allocation that is possible with DIO overwrites.
>> So no stale data exposure should happen. Same is the case
>> between DIO reads & DIO overwrites.
>>
>> 3. Also other paths like truncate is protected,
>> since we wait there for any DIO in flight to be over.
>>
>> 4. In case of buffIO writes followed by DIO reads:
>> Since here also we take exclusive locks in ext4_write_begin/end().
>> There is no risk of exposing any stale data in this case.
>> Since after ext4_write_end, iomap_dio_rw() will wait to flush &
>> wait for any dirty page cache data.
>>
>> Signed-off-by: Ritesh Harjani <[email protected]>
>
> There's one more case to consider here as I mentioned in [1]. There can be
Yes, I should have mentioned about this in cover letter and about my
thoughts on that.
I was of the opinion that since the race is already existing
and it may not be caused due to this patch, so we should handle that in
incremental fashion and as a separate patch series after this one.
Let me know your thoughts on above.
Also, I wanted to have some more discussions on this race before
making the changes.
But nevertheless, it's the right time to discuss those changes here.
> mmap write instantiating dirty page and then someone starting writeback
> against that page while DIO read is running still theoretically leading to
> stale data exposure. Now this patch does not have influence on that race
> but:
Yes, agreed.
>
> 1) We need to close the race mentioned above. Maybe we could do that by
> proactively allocating unwritten blocks for a page being faulted when there
> is direct IO running against the file - the one who fills holes through
> mmap write while direct IO is running on the file deserves to suffer the
> performance penalty...
I was giving this a thought. So even if we try to penalize mmap
write as you mentioned above, what I am not sure about it, is that, how
can we reliably detect that the DIO is in progress?
Say even if we try to check for atomic_read(&inode->i_dio_count) in mmap
ext4_page_mkwrite path, it cannot be reliable unless there is some sort
of a lock protection, no?
Because after the check the DIO can still snoop in, right?
2. Also what about the delalloc opt. in that case? Even for delalloc
should we go ahead and allocate the unwritten blocks? That may even need
to start/stop the journal which could add more latency, no?
>
> 2) After this patch there's no point in having dioread_nolock at all so we
> can just make that mount option no-op and drop all the precautions from the
> buffered IO path connected with dioread_nolock.
Yes, with some careful review we should be able to drop those
precautions related to dioread_nolock, after getting above race fixed.
>
> [1] https://lore.kernel.org/linux-ext4/[email protected]
>
> Honza
>
>> ---
>> fs/ext4/file.c | 25 +++++++++++++++++++------
>> 1 file changed, 19 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> index 18cbf9fa52c6..b97efc89cd63 100644
>> --- a/fs/ext4/file.c
>> +++ b/fs/ext4/file.c
>> @@ -383,6 +383,17 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
>> .end_io = ext4_dio_write_end_io,
>> };
>>
>> +static bool ext4_dio_should_shared_lock(struct inode *inode)
>> +{
>> + if (!S_ISREG(inode->i_mode))
>
> This cannot happen for DIO so no point in checking here.
>
>> + return false;
>> + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
>
> Why this?
>
>> + return false;
>> + if (ext4_should_journal_data(inode))
>
> We don't do DIO when journalling data so no point in checking here
> (dio_supported() already checked this).
>
> Honza
>> + return false;
>> + return true;
>> +}
>> +
Yes, agreed we don't need this function (ext4_dio_should_shared_lock)
anyways.
>> /*
>> * The intention here is to start with shared lock acquired then see if any
>> * condition requires an exclusive inode lock. If yes, then we restart the
>> @@ -394,8 +405,8 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
>> * - For extending writes case we don't take the shared lock, since it requires
>> * updating inode i_disksize and/or orphan handling with exclusive lock.
>> *
>> - * - shared locking will only be true mostly in case of overwrites with
>> - * dioread_nolock mode. Otherwise we will switch to excl. iolock mode.
>> + * - shared locking will only be true mostly in case of overwrites.
>> + * Otherwise we will switch to excl. iolock mode.
>> */
>> static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>> unsigned int *iolock, bool *unaligned_io,
>> @@ -433,15 +444,14 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>> *extend = true;
>> /*
>> * Determine whether the IO operation will overwrite allocated
>> - * and initialized blocks. If so, check to see whether it is
>> - * possible to take the dioread_nolock path.
>> + * and initialized blocks.
>> *
>> * We need exclusive i_rwsem for changing security info
>> * in file_modified().
>> */
>> if (*iolock == EXT4_IOLOCK_SHARED &&
>> (!IS_NOSEC(inode) || *unaligned_io || *extend ||
>> - !ext4_should_dioread_nolock(inode) ||
>> + !ext4_dio_should_shared_lock(inode) ||
>> !ext4_overwrite_io(inode, offset, count))) {
>> ext4_iunlock(inode, *iolock);
>> *iolock = EXT4_IOLOCK_EXCL;
>> @@ -485,7 +495,10 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> iolock = EXT4_IOLOCK_EXCL;
>> }
>>
>> - if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
>> + /*
>> + * Check if we should continue with shared iolock
>> + */
>> + if (iolock == EXT4_IOLOCK_SHARED && !ext4_dio_should_shared_lock(inode))
>> iolock = EXT4_IOLOCK_EXCL;
>>
>> if (iocb->ki_flags & IOCB_NOWAIT) {
>> --
>> 2.21.0
>>
On 11/26/19 4:21 PM, Ritesh Harjani wrote:
> Hello Jan,
>
> Sorry about getting a little late on this.
>
> On 11/20/19 8:02 PM, Jan Kara wrote:
>> On Wed 20-11-19 10:30:24, Ritesh Harjani wrote:
>>> We were using shared locking only in case of dioread_nolock
>>> mount option in case of DIO overwrites. This mount condition
>>> is not needed anymore with current code, since:-
>>>
>>> 1. No race between buffered writes & DIO overwrites.
>>> Since buffIO writes takes exclusive locks & DIO overwrites
>>> will take share locking. Also DIO path will make sure
>>> to flush and wait for any dirty page cache data.
>>>
>>> 2. No race between buffered reads & DIO overwrites, since there
>>> is no block allocation that is possible with DIO overwrites.
>>> So no stale data exposure should happen. Same is the case
>>> between DIO reads & DIO overwrites.
>>>
>>> 3. Also other paths like truncate is protected,
>>> since we wait there for any DIO in flight to be over.
>>>
>>> 4. In case of buffIO writes followed by DIO reads:
>>> Since here also we take exclusive locks in ext4_write_begin/end().
>>> There is no risk of exposing any stale data in this case.
>>> Since after ext4_write_end, iomap_dio_rw() will wait to flush &
>>> wait for any dirty page cache data.
>>>
>>> Signed-off-by: Ritesh Harjani <[email protected]>
>>
>> There's one more case to consider here as I mentioned in [1]. There
>> can be
>
> Yes, I should have mentioned about this in cover letter and about my
> thoughts on that.
> I was of the opinion that since the race is already existing
> and it may not be caused due to this patch, so we should handle that in
> incremental fashion and as a separate patch series after this one.
> Let me know your thoughts on above.
>
> Also, I wanted to have some more discussions on this race before
> making the changes.
> But nevertheless, it's the right time to discuss those changes here.
>
>> mmap write instantiating dirty page and then someone starting writeback
>> against that page while DIO read is running still theoretically
>> leading to
>> stale data exposure. Now this patch does not have influence on that race
>> but:
>
> Yes, agreed.
>
>>
>> 1) We need to close the race mentioned above. Maybe we could do that by
>> proactively allocating unwritten blocks for a page being faulted when
>> there
>> is direct IO running against the file - the one who fills holes through
>> mmap write while direct IO is running on the file deserves to suffer the
>> performance penalty...
>
> I was giving this a thought. So even if we try to penalize mmap
> write as you mentioned above, what I am not sure about it, is that, how
> can we reliably detect that the DIO is in progress?
>
> Say even if we try to check for atomic_read(&inode->i_dio_count) in mmap
> ext4_page_mkwrite path, it cannot be reliable unless there is some sort
> of a lock protection, no?
> Because after the check the DIO can still snoop in, right?
IIRC, we had some discussion around this at [1] last time.
IIUC, you were mentioning to always using unwritten extents
as ->get_block in ext4_page_mkwrite.
And in ext4_writepages(), we replace 'ext4_should_dioread_nolock()'
check with 'is there any DIO in flight'.
It was discussed to do that check reliably we should have all pages
locked for writeback. But how does that ensure that DIO is not currently
in flight? One such case could be:-
It may still happen that the check to filemap_write_and_wait_range()
from DIO (iomap_dio_rw) got completed and before calling
inode_dio_begin() a context switch happens.
And in parallel we got the page fault which somehow also resulted into
writeback of pages calling ext4_writepages(). Here when we checked for
'is DIO in progress' after making sure all the writeback pages are
locked, we still say may miss the reliable check if the context switch
back to the DIO process happens right. Am I missing anything?
1. Is there any lock guarantee which I am overlooking here?
2. Do you think we should use some other lock to provide the guarantee
between page_mkwrite & DIO read?
3. What if we always go via unwritten blocks in ext4_writepages too?
hmm, but I am not sure if should really do this today as there are some
known xfstests failures for blocksize < pagesize with dioread_nolock
path right.
Although those problems I have mostly observed with 1K blocksize & 4K
pagesize on x86 platform.
[1]
https://lore.kernel.org/linux-ext4/[email protected]/
>
>
> 2. Also what about the delalloc opt. in that case? Even for delalloc
> should we go ahead and allocate the unwritten blocks? That may even need
> to start/stop the journal which could add more latency, no?
>
One thing by always using unwritten blocks in ext4_page_mkwrite is
that the stale data exposure problem (from DIO Read) would go away.
So one small thing to note would be that it will incur some additional
latency for delalloc too, but that is anyway there today in nodelalloc
or low disk space.
What I was thinking is that, whether there is a reliable way of directly
detecting a "whether DIO is in flight" in ext4_page_mkwrite() function.
If yes, then we should fallback to unwritten blocks allocation path
in ext4_page_mkwrite(). And no other changes would be needed. Right?
>
>>
>> 2) After this patch there's no point in having dioread_nolock at all
>> so we
>> can just make that mount option no-op and drop all the precautions
>> from the
>> buffered IO path connected with dioread_nolock.
>
> Yes, with some careful review we should be able to drop those
> precautions related to dioread_nolock, after getting above race fixed.
>
>
>
>>
>> [1]
>> https://lore.kernel.org/linux-ext4/[email protected]
>>
>> Honza
>>
>>> ---
>>> fs/ext4/file.c | 25 +++++++++++++++++++------
>>> 1 file changed, 19 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>>> index 18cbf9fa52c6..b97efc89cd63 100644
>>> --- a/fs/ext4/file.c
>>> +++ b/fs/ext4/file.c
>>> @@ -383,6 +383,17 @@ static const struct iomap_dio_ops
>>> ext4_dio_write_ops = {
>>> .end_io = ext4_dio_write_end_io,
>>> };
>>> +static bool ext4_dio_should_shared_lock(struct inode *inode)
>>> +{
>>> + if (!S_ISREG(inode->i_mode))
>>
>> This cannot happen for DIO so no point in checking here.
>>
>>> + return false;
>>> + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
>>
>> Why this?
>>
>>> + return false;
>>> + if (ext4_should_journal_data(inode))
>>
>> We don't do DIO when journalling data so no point in checking here
>> (dio_supported() already checked this).
>>
>> Honza
>>> + return false;
>>> + return true;
>>> +}
>>> +
>
> Yes, agreed we don't need this function (ext4_dio_should_shared_lock)
> anyways.
>
>
>>> /*
>>> * The intention here is to start with shared lock acquired then
>>> see if any
>>> * condition requires an exclusive inode lock. If yes, then we
>>> restart the
>>> @@ -394,8 +405,8 @@ static const struct iomap_dio_ops
>>> ext4_dio_write_ops = {
>>> * - For extending writes case we don't take the shared lock, since
>>> it requires
>>> * updating inode i_disksize and/or orphan handling with
>>> exclusive lock.
>>> *
>>> - * - shared locking will only be true mostly in case of overwrites with
>>> - * dioread_nolock mode. Otherwise we will switch to excl. iolock
>>> mode.
>>> + * - shared locking will only be true mostly in case of overwrites.
>>> + * Otherwise we will switch to excl. iolock mode.
>>> */
>>> static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct
>>> iov_iter *from,
>>> unsigned int *iolock, bool *unaligned_io,
>>> @@ -433,15 +444,14 @@ static ssize_t ext4_dio_write_checks(struct
>>> kiocb *iocb, struct iov_iter *from,
>>> *extend = true;
>>> /*
>>> * Determine whether the IO operation will overwrite allocated
>>> - * and initialized blocks. If so, check to see whether it is
>>> - * possible to take the dioread_nolock path.
>>> + * and initialized blocks.
>>> *
>>> * We need exclusive i_rwsem for changing security info
>>> * in file_modified().
>>> */
>>> if (*iolock == EXT4_IOLOCK_SHARED &&
>>> (!IS_NOSEC(inode) || *unaligned_io || *extend ||
>>> - !ext4_should_dioread_nolock(inode) ||
>>> + !ext4_dio_should_shared_lock(inode) ||
>>> !ext4_overwrite_io(inode, offset, count))) {
>>> ext4_iunlock(inode, *iolock);
>>> *iolock = EXT4_IOLOCK_EXCL;
>>> @@ -485,7 +495,10 @@ static ssize_t ext4_dio_write_iter(struct kiocb
>>> *iocb, struct iov_iter *from)
>>> iolock = EXT4_IOLOCK_EXCL;
>>> }
>>> - if (iolock == EXT4_IOLOCK_SHARED &&
>>> !ext4_should_dioread_nolock(inode))
>>> + /*
>>> + * Check if we should continue with shared iolock
>>> + */
>>> + if (iolock == EXT4_IOLOCK_SHARED &&
>>> !ext4_dio_should_shared_lock(inode))
>>> iolock = EXT4_IOLOCK_EXCL;
>>> if (iocb->ki_flags & IOCB_NOWAIT) {
>>> --
>>> 2.21.0
>>>
>
Hello Ritesh!
On Tue 26-11-19 16:21:15, Ritesh Harjani wrote:
> On 11/20/19 8:02 PM, Jan Kara wrote:
> > On Wed 20-11-19 10:30:24, Ritesh Harjani wrote:
> > > We were using shared locking only in case of dioread_nolock
> > > mount option in case of DIO overwrites. This mount condition
> > > is not needed anymore with current code, since:-
> > >
> > > 1. No race between buffered writes & DIO overwrites.
> > > Since buffIO writes takes exclusive locks & DIO overwrites
> > > will take share locking. Also DIO path will make sure
> > > to flush and wait for any dirty page cache data.
> > >
> > > 2. No race between buffered reads & DIO overwrites, since there
> > > is no block allocation that is possible with DIO overwrites.
> > > So no stale data exposure should happen. Same is the case
> > > between DIO reads & DIO overwrites.
> > >
> > > 3. Also other paths like truncate is protected,
> > > since we wait there for any DIO in flight to be over.
> > >
> > > 4. In case of buffIO writes followed by DIO reads:
> > > Since here also we take exclusive locks in ext4_write_begin/end().
> > > There is no risk of exposing any stale data in this case.
> > > Since after ext4_write_end, iomap_dio_rw() will wait to flush &
> > > wait for any dirty page cache data.
> > >
> > > Signed-off-by: Ritesh Harjani <[email protected]>
> >
> > There's one more case to consider here as I mentioned in [1]. There can be
>
> Yes, I should have mentioned about this in cover letter and about my
> thoughts on that.
> I was of the opinion that since the race is already existing
> and it may not be caused due to this patch, so we should handle that in
> incremental fashion and as a separate patch series after this one.
> Let me know your thoughts on above.
Yes, I'm fine with that.
> Also, I wanted to have some more discussions on this race before
> making the changes.
> But nevertheless, it's the right time to discuss those changes here.
>
> > mmap write instantiating dirty page and then someone starting writeback
> > against that page while DIO read is running still theoretically leading to
> > stale data exposure. Now this patch does not have influence on that race
> > but:
>
> Yes, agreed.
>
> >
> > 1) We need to close the race mentioned above. Maybe we could do that by
> > proactively allocating unwritten blocks for a page being faulted when there
> > is direct IO running against the file - the one who fills holes through
> > mmap write while direct IO is running on the file deserves to suffer the
> > performance penalty...
>
> I was giving this a thought. So even if we try to penalize mmap
> write as you mentioned above, what I am not sure about it, is that, how can
> we reliably detect that the DIO is in progress?
>
> Say even if we try to check for atomic_read(&inode->i_dio_count) in mmap
> ext4_page_mkwrite path, it cannot be reliable unless there is some sort of a
> lock protection, no?
> Because after the check the DIO can still snoop in, right?
Yes, doing this reliably will need some code tweaking. Also thinking about
this in detail, doing a reliable check in ext4_page_mkwrite() is
somewhat difficult so it will be probably less error-prone to deal with the
race in the writeback path.
My preferred way of dealing with this would be to move inode_dio_begin()
call in iomap_dio_rw() a bit earlier before page cache invalidation and add
there smp_mb_after_atomic() (so that e.g. nrpages checks cannot get
reordered before the increment). Then the check on i_dio_count in
ext4_writepages() will be reliable if we do it after gathering and locking
pages for writeback (i.e., in mpage_map_and_submit_extent()) - either we
see i_dio_count elevated and use the safe (but slower) writeback using
unwritten extents, or we see don't and then we are sure DIO will not start
until writeback of the pages we have locked has finished because of
filemap_write_and_wait() call in iomap_dio_rw().
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue 26-11-19 18:15:25, Ritesh Harjani wrote:
> On 11/26/19 4:21 PM, Ritesh Harjani wrote:
> > On 11/20/19 8:02 PM, Jan Kara wrote:
> > > On Wed 20-11-19 10:30:24, Ritesh Harjani wrote:
> > > > We were using shared locking only in case of dioread_nolock
> > > > mount option in case of DIO overwrites. This mount condition
> > > > is not needed anymore with current code, since:-
> > > >
> > > > 1. No race between buffered writes & DIO overwrites.
> > > > Since buffIO writes takes exclusive locks & DIO overwrites
> > > > will take share locking. Also DIO path will make sure
> > > > to flush and wait for any dirty page cache data.
> > > >
> > > > 2. No race between buffered reads & DIO overwrites, since there
> > > > is no block allocation that is possible with DIO overwrites.
> > > > So no stale data exposure should happen. Same is the case
> > > > between DIO reads & DIO overwrites.
> > > >
> > > > 3. Also other paths like truncate is protected,
> > > > since we wait there for any DIO in flight to be over.
> > > >
> > > > 4. In case of buffIO writes followed by DIO reads:
> > > > Since here also we take exclusive locks in ext4_write_begin/end().
> > > > There is no risk of exposing any stale data in this case.
> > > > Since after ext4_write_end, iomap_dio_rw() will wait to flush &
> > > > wait for any dirty page cache data.
> > > >
> > > > Signed-off-by: Ritesh Harjani <[email protected]>
> > >
> > > There's one more case to consider here as I mentioned in [1]. There
> > > can be
> >
> > Yes, I should have mentioned about this in cover letter and about my
> > thoughts on that.
> > I was of the opinion that since the race is already existing
> > and it may not be caused due to this patch, so we should handle that in
> > incremental fashion and as a separate patch series after this one.
> > Let me know your thoughts on above.
> >
> > Also, I wanted to have some more discussions on this race before
> > making the changes.
> > But nevertheless, it's the right time to discuss those changes here.
> >
> > > mmap write instantiating dirty page and then someone starting writeback
> > > against that page while DIO read is running still theoretically
> > > leading to
> > > stale data exposure. Now this patch does not have influence on that race
> > > but:
> >
> > Yes, agreed.
> >
> > >
> > > 1) We need to close the race mentioned above. Maybe we could do that by
> > > proactively allocating unwritten blocks for a page being faulted
> > > when there
> > > is direct IO running against the file - the one who fills holes through
> > > mmap write while direct IO is running on the file deserves to suffer the
> > > performance penalty...
> >
> > I was giving this a thought. So even if we try to penalize mmap
> > write as you mentioned above, what I am not sure about it, is that, how
> > can we reliably detect that the DIO is in progress?
> >
> > Say even if we try to check for atomic_read(&inode->i_dio_count) in mmap
> > ext4_page_mkwrite path, it cannot be reliable unless there is some sort
> > of a lock protection, no?
> > Because after the check the DIO can still snoop in, right?
>
> IIRC, we had some discussion around this at [1] last time.
> IIUC, you were mentioning to always using unwritten extents
> as ->get_block in ext4_page_mkwrite.
> And in ext4_writepages(), we replace 'ext4_should_dioread_nolock()' check
> with 'is there any DIO in flight'.
Yes.
> It was discussed to do that check reliably we should have all pages
> locked for writeback. But how does that ensure that DIO is not currently in
> flight? One such case could be:-
> It may still happen that the check to filemap_write_and_wait_range()
> from DIO (iomap_dio_rw) got completed and before calling inode_dio_begin() a
> context switch happens.
> And in parallel we got the page fault which somehow also resulted into
> writeback of pages calling ext4_writepages(). Here when we checked for
> 'is DIO in progress' after making sure all the writeback pages are locked,
> we still say may miss the reliable check if the context switch back to the
> DIO process happens right. Am I missing anything?
>
> 1. Is there any lock guarantee which I am overlooking here?
>
> 2. Do you think we should use some other lock to provide the guarantee
> between page_mkwrite & DIO read?
See my previous reply.
> 3. What if we always go via unwritten blocks in ext4_writepages too?
> hmm, but I am not sure if should really do this today as there are some
> known xfstests failures for blocksize < pagesize with dioread_nolock path
> right.
> Although those problems I have mostly observed with 1K blocksize & 4K
> pagesize on x86 platform.
Well, the problems with dioread_nolock writeback need to be fixed. That's
for sure but orthogonal to this discussion. The reason why we don't want to
always use unwritten extents for writeback in ext4_writepages() is because
it is slower (we have to convert unwritten extents to written ones after IO
completion and that does add up, although it may be worthwhile to get fresh
performance numbers).
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hello Jan,
I have compiled something based on our discussion.
Could you please share your thoughts on below.
On 11/29/19 10:48 PM, Jan Kara wrote:
> Hello Ritesh!
>
> On Tue 26-11-19 16:21:15, Ritesh Harjani wrote:
>> On 11/20/19 8:02 PM, Jan Kara wrote:
>>> On Wed 20-11-19 10:30:24, Ritesh Harjani wrote:
>>>> We were using shared locking only in case of dioread_nolock
>>>> mount option in case of DIO overwrites. This mount condition
>>>> is not needed anymore with current code, since:-
>>>>
>>>> 1. No race between buffered writes & DIO overwrites.
>>>> Since buffIO writes takes exclusive locks & DIO overwrites
>>>> will take share locking. Also DIO path will make sure
>>>> to flush and wait for any dirty page cache data.
>>>>
>>>> 2. No race between buffered reads & DIO overwrites, since there
>>>> is no block allocation that is possible with DIO overwrites.
>>>> So no stale data exposure should happen. Same is the case
>>>> between DIO reads & DIO overwrites.
>>>>
>>>> 3. Also other paths like truncate is protected,
>>>> since we wait there for any DIO in flight to be over.
>>>>
>>>> 4. In case of buffIO writes followed by DIO reads:
>>>> Since here also we take exclusive locks in ext4_write_begin/end().
>>>> There is no risk of exposing any stale data in this case.
>>>> Since after ext4_write_end, iomap_dio_rw() will wait to flush &
>>>> wait for any dirty page cache data.
>>>>
>>>> Signed-off-by: Ritesh Harjani <[email protected]>
>>>
>>> There's one more case to consider here as I mentioned in [1]. There can be
>>
>> Yes, I should have mentioned about this in cover letter and about my
>> thoughts on that.
>> I was of the opinion that since the race is already existing
>> and it may not be caused due to this patch, so we should handle that in
>> incremental fashion and as a separate patch series after this one.
>> Let me know your thoughts on above.
>
> Yes, I'm fine with that.
Sure thanks, will do that.
>
>> Also, I wanted to have some more discussions on this race before
>> making the changes.
>> But nevertheless, it's the right time to discuss those changes here.
>>
>>> mmap write instantiating dirty page and then someone starting writeback
>>> against that page while DIO read is running still theoretically leading to
>>> stale data exposure. Now this patch does not have influence on that race
>>> but:
>>
>> Yes, agreed.
>>
>>>
>>> 1) We need to close the race mentioned above. Maybe we could do that by
>>> proactively allocating unwritten blocks for a page being faulted when there
>>> is direct IO running against the file - the one who fills holes through
>>> mmap write while direct IO is running on the file deserves to suffer the
>>> performance penalty...
>>
>> I was giving this a thought. So even if we try to penalize mmap
>> write as you mentioned above, what I am not sure about it, is that, how can
>> we reliably detect that the DIO is in progress?
>>
>> Say even if we try to check for atomic_read(&inode->i_dio_count) in mmap
>> ext4_page_mkwrite path, it cannot be reliable unless there is some sort of a
>> lock protection, no?
>> Because after the check the DIO can still snoop in, right?
>
> Yes, doing this reliably will need some code tweaking. Also thinking about
> this in detail, doing a reliable check in ext4_page_mkwrite() is
> somewhat difficult so it will be probably less error-prone to deal with the
> race in the writeback path.
hmm. But if we don't do in ext4_page_mkwrite, then I am afraid on
how to handle nodelalloc scenario. Where we will directly go and
allocate block via ext4_get_block() in ext4_page_mkwrite(),
as explained below.
I guess we may need some tweaking at both places.
>
> My preferred way of dealing with this would be to move inode_dio_begin()
> call in iomap_dio_rw() a bit earlier before page cache invalidation and add
> there smp_mb_after_atomic() (so that e.g. nrpages checks cannot get
> reordered before the increment). Then the check on i_dio_count in
> ext4_writepages() will be reliable if we do it after gathering and locking
> pages for writeback (i.e., in mpage_map_and_submit_extent()) - either we
> see i_dio_count elevated and use the safe (but slower) writeback using
> unwritten extents, or we see don't and then we are sure DIO will not start
> until writeback of the pages we have locked has finished because of
> filemap_write_and_wait() call in iomap_dio_rw().
>
>
Thanks for explaining this in detail. I guess I understand this part now
Earlier my understanding towards mapping->nrpages was not complete.
AFAIU, with your above suggestion the race won't happen for delalloc
cases. But what if it is a nodelalloc mount option?
Say with above changes i.e. after tweaking iomap_dio_rw() code as you
mentioned above. Below race could still happen, right?
iomap_dio_rw()
filemap_write_and_wait_range()
inode_dio_begin()
smp_mb__after_atomic()
invalidate_inode_pages2_range()
ext4_page_mkwrite()
block_page_mkwrite()
lock_page()
ext4_get_block()
ext4_map_blocks()
//this will return IOMAP_MAPPED entry
submit_bio()
// this goes and reads the block
// with stale data allocated,
// by ext4_page_mkwrite()
Now, I am assuming that ext4_get_block() via ext4_page_mkwrite() path
may try to create the block for hole then and there itself.
And if submit_bio() from DIO path is serviced late i.e. after
ext4_get_block() has already allocated block there, then this may expose
stale data. Thoughts?
So to avoid both such races in delalloc & in nodelalloc path,
we should add the checks at both ext4_writepages() & also at
ext4_page_mkwrite().
For ext4_page_mkwrite(), why don't we just change the "get_block"
function pointer which is passed to block_page_mkwrite()
as below. This should solve our race since
ext4_dio_check_get_block() will be only called with lock_page()
held. And also with inode_dio_begin() now moved up before
invalidate_inode_pages2_range(), we could be sure
about DIO is currently running or not in ext4_page_mkwrite() path.
Does this looks correct to you?
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 381813205f99..74c33d03592c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -806,6 +806,19 @@ int ext4_get_block_unwritten(struct inode *inode,
sector_t iblock,
EXT4_GET_BLOCKS_IO_CREATE_EXT);
}
+int ext4_dio_check_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh, int create)
+{
+ get_block_t *get_block;
+
+ if (!atomic_read(&inode->i_dio_count))
+ get_block = ext4_get_block;
+ else
+ get_block = ext4_get_block_unwritten;
+
+ return get_block(inode, iblock, bh, create);
+}
+
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096
@@ -2332,7 +2345,8 @@ static int mpage_map_one_extent(handle_t *handle,
struct mpage_da_data *mpd)
struct inode *inode = mpd->inode;
struct ext4_map_blocks *map = &mpd->map;
int get_blocks_flags;
- int err, dioread_nolock;
+ int err;
+ bool dio_in_progress = atomic_read(&inode->i_dio_count);
trace_ext4_da_write_pages_extent(inode, map);
/*
@@ -2353,8 +2367,14 @@ static int mpage_map_one_extent(handle_t *handle,
struct mpage_da_data *mpd)
get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
EXT4_GET_BLOCKS_METADATA_NOFAIL |
EXT4_GET_BLOCKS_IO_SUBMIT;
- dioread_nolock = ext4_should_dioread_nolock(inode);
- if (dioread_nolock)
+
+ /*
+ * There could be race between DIO read & ext4_page_mkwrite
+ * where in delalloc case, we may go and try to allocate the
+ * block here but if DIO read is in progress then it may expose
+ * stale data, hence use unwritten blocks for allocation
+ * when DIO is in progress.
+ */
+ if (dio_in_progress)
get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
if (map->m_flags & (1 << BH_Delay))
get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
@@ -2362,7 +2382,7 @@ static int mpage_map_one_extent(handle_t *handle,
struct mpage_da_data *mpd)
err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
if (err < 0)
return err;
- if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
+ if (dio_in_progress && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
if (!mpd->io_submit.io_end->handle &&
ext4_handle_valid(handle)) {
mpd->io_submit.io_end->handle = handle->h_rsv_handle;
@@ -5906,10 +5926,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
}
unlock_page(page);
/* OK, we need to fill the hole... */
- if (ext4_should_dioread_nolock(inode))
- get_block = ext4_get_block_unwritten;
- else
- get_block = ext4_get_block;
+ get_block = ext4_dio_check_get_block;
retry_alloc:
handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
ext4_writepage_trans_blocks(inode));
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 2f88d64c2a4d..09d0601e5ecb 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -465,6 +465,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
if (ret)
goto out_free_dio;
+ inode_dio_begin(inode);
+ smp_mb__after_atomic();
/*
* Try to invalidate cache pages for the range we're direct
* writing. If this invalidation fails, tough, the write will
@@ -484,8 +486,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
goto out_free_dio;
}
- inode_dio_begin(inode);
-
blk_start_plug(&plug);
do {
ret = iomap_apply(inode, pos, count, flags, ops, dio,
-ritesh
Hello Ritesh!
On Tue 03-12-19 17:24:44, Ritesh Harjani wrote:
> On 11/29/19 10:48 PM, Jan Kara wrote:
> > > Also, I wanted to have some more discussions on this race before
> > > making the changes.
> > > But nevertheless, it's the right time to discuss those changes here.
> > >
> > > > mmap write instantiating dirty page and then someone starting writeback
> > > > against that page while DIO read is running still theoretically leading to
> > > > stale data exposure. Now this patch does not have influence on that race
> > > > but:
> > >
> > > Yes, agreed.
> > >
> > > >
> > > > 1) We need to close the race mentioned above. Maybe we could do that by
> > > > proactively allocating unwritten blocks for a page being faulted when there
> > > > is direct IO running against the file - the one who fills holes through
> > > > mmap write while direct IO is running on the file deserves to suffer the
> > > > performance penalty...
> > >
> > > I was giving this a thought. So even if we try to penalize mmap
> > > write as you mentioned above, what I am not sure about it, is that, how can
> > > we reliably detect that the DIO is in progress?
> > >
> > > Say even if we try to check for atomic_read(&inode->i_dio_count) in mmap
> > > ext4_page_mkwrite path, it cannot be reliable unless there is some sort of a
> > > lock protection, no?
> > > Because after the check the DIO can still snoop in, right?
> >
> > Yes, doing this reliably will need some code tweaking. Also thinking about
> > this in detail, doing a reliable check in ext4_page_mkwrite() is
> > somewhat difficult so it will be probably less error-prone to deal with the
> > race in the writeback path.
>
> hmm. But if we don't do in ext4_page_mkwrite, then I am afraid on
> how to handle nodelalloc scenario. Where we will directly go and
> allocate block via ext4_get_block() in ext4_page_mkwrite(),
> as explained below.
> I guess we may need some tweaking at both places.
Ok, I forgot to mention that. Yes, the nodelalloc case in
ext4_page_mkwrite() still needs tweaking. But that is not performance
sensitive path at all. So we can just have there:
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
get_block = ext4_get_block_unwritten;
else
get_block = ext4_get_block;
and be done with it. And yes, for inodes using indirect blocks, direct IO
reads can still theoretically expose data from blocks instantiated by hole
filling from ext4_page_mkwrite(). But that race has always been there
regardless of DIO locking and is hardly fixable with that on-disk format.
Honza
>
>
> >
> > My preferred way of dealing with this would be to move inode_dio_begin()
> > call in iomap_dio_rw() a bit earlier before page cache invalidation and add
> > there smp_mb_after_atomic() (so that e.g. nrpages checks cannot get
> > reordered before the increment). Then the check on i_dio_count in
> > ext4_writepages() will be reliable if we do it after gathering and locking
> > pages for writeback (i.e., in mpage_map_and_submit_extent()) - either we
> > see i_dio_count elevated and use the safe (but slower) writeback using
> > unwritten extents, or we see don't and then we are sure DIO will not start
> > until writeback of the pages we have locked has finished because of
> > filemap_write_and_wait() call in iomap_dio_rw().
> >
> >
>
> Thanks for explaining this in detail. I guess I understand this part now
> Earlier my understanding towards mapping->nrpages was not complete.
>
> AFAIU, with your above suggestion the race won't happen for delalloc
> cases. But what if it is a nodelalloc mount option?
>
> Say with above changes i.e. after tweaking iomap_dio_rw() code as you
> mentioned above. Below race could still happen, right?
>
> iomap_dio_rw()
> filemap_write_and_wait_range()
> inode_dio_begin()
> smp_mb__after_atomic()
> invalidate_inode_pages2_range()
> ext4_page_mkwrite()
> block_page_mkwrite()
> lock_page()
> ext4_get_block()
>
> ext4_map_blocks()
> //this will return IOMAP_MAPPED entry
>
> submit_bio()
> // this goes and reads the block
> // with stale data allocated,
> // by ext4_page_mkwrite()
>
>
> Now, I am assuming that ext4_get_block() via ext4_page_mkwrite() path
> may try to create the block for hole then and there itself.
> And if submit_bio() from DIO path is serviced late i.e. after
> ext4_get_block() has already allocated block there, then this may expose
> stale data. Thoughts?
>
>
> So to avoid both such races in delalloc & in nodelalloc path,
> we should add the checks at both ext4_writepages() & also at
> ext4_page_mkwrite().
>
> For ext4_page_mkwrite(), why don't we just change the "get_block"
> function pointer which is passed to block_page_mkwrite()
> as below. This should solve our race since
> ext4_dio_check_get_block() will be only called with lock_page()
> held. And also with inode_dio_begin() now moved up before
> invalidate_inode_pages2_range(), we could be sure
> about DIO is currently running or not in ext4_page_mkwrite() path.
>
> Does this looks correct to you?
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 381813205f99..74c33d03592c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -806,6 +806,19 @@ int ext4_get_block_unwritten(struct inode *inode,
> sector_t iblock,
> EXT4_GET_BLOCKS_IO_CREATE_EXT);
> }
>
> +int ext4_dio_check_get_block(struct inode *inode, sector_t iblock,
> + struct buffer_head *bh, int create)
> +{
> + get_block_t *get_block;
> +
> + if (!atomic_read(&inode->i_dio_count))
> + get_block = ext4_get_block;
> + else
> + get_block = ext4_get_block_unwritten;
> +
> + return get_block(inode, iblock, bh, create);
> +}
> +
> /* Maximum number of blocks we map for direct IO at once. */
> #define DIO_MAX_BLOCKS 4096
>
> @@ -2332,7 +2345,8 @@ static int mpage_map_one_extent(handle_t *handle,
> struct mpage_da_data *mpd)
> struct inode *inode = mpd->inode;
> struct ext4_map_blocks *map = &mpd->map;
> int get_blocks_flags;
> - int err, dioread_nolock;
> + int err;
> + bool dio_in_progress = atomic_read(&inode->i_dio_count);
>
> trace_ext4_da_write_pages_extent(inode, map);
> /*
> @@ -2353,8 +2367,14 @@ static int mpage_map_one_extent(handle_t *handle,
> struct mpage_da_data *mpd)
> get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
> EXT4_GET_BLOCKS_METADATA_NOFAIL |
> EXT4_GET_BLOCKS_IO_SUBMIT;
> - dioread_nolock = ext4_should_dioread_nolock(inode);
> - if (dioread_nolock)
> +
> + /*
> + * There could be race between DIO read & ext4_page_mkwrite
> + * where in delalloc case, we may go and try to allocate the
> + * block here but if DIO read is in progress then it may expose
> + * stale data, hence use unwritten blocks for allocation
> + * when DIO is in progress.
> + */
> + if (dio_in_progress)
> get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
> if (map->m_flags & (1 << BH_Delay))
> get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
> @@ -2362,7 +2382,7 @@ static int mpage_map_one_extent(handle_t *handle,
> struct mpage_da_data *mpd)
> err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
> if (err < 0)
> return err;
> - if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
> + if (dio_in_progress && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
> if (!mpd->io_submit.io_end->handle &&
> ext4_handle_valid(handle)) {
> mpd->io_submit.io_end->handle = handle->h_rsv_handle;
> @@ -5906,10 +5926,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
> }
> unlock_page(page);
> /* OK, we need to fill the hole... */
> - if (ext4_should_dioread_nolock(inode))
> - get_block = ext4_get_block_unwritten;
> - else
> - get_block = ext4_get_block;
> + get_block = ext4_dio_check_get_block;
> retry_alloc:
> handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
> ext4_writepage_trans_blocks(inode));
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 2f88d64c2a4d..09d0601e5ecb 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -465,6 +465,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> if (ret)
> goto out_free_dio;
>
> + inode_dio_begin(inode);
> + smp_mb__after_atomic();
> /*
> * Try to invalidate cache pages for the range we're direct
> * writing. If this invalidation fails, tough, the write will
> @@ -484,8 +486,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> goto out_free_dio;
> }
>
> - inode_dio_begin(inode);
> -
> blk_start_plug(&plug);
> do {
> ret = iomap_apply(inode, pos, count, flags, ops, dio,
>
>
>
> -ritesh
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 12/3/19 6:09 PM, Jan Kara wrote:
>
> Hello Ritesh!
>
> On Tue 03-12-19 17:24:44, Ritesh Harjani wrote:
>> On 11/29/19 10:48 PM, Jan Kara wrote:
>>>> Also, I wanted to have some more discussions on this race before
>>>> making the changes.
>>>> But nevertheless, it's the right time to discuss those changes here.
>>>>
>>>>> mmap write instantiating dirty page and then someone starting writeback
>>>>> against that page while DIO read is running still theoretically leading to
>>>>> stale data exposure. Now this patch does not have influence on that race
>>>>> but:
>>>>
>>>> Yes, agreed.
>>>>
>>>>>
>>>>> 1) We need to close the race mentioned above. Maybe we could do that by
>>>>> proactively allocating unwritten blocks for a page being faulted when there
>>>>> is direct IO running against the file - the one who fills holes through
>>>>> mmap write while direct IO is running on the file deserves to suffer the
>>>>> performance penalty...
>>>>
>>>> I was giving this a thought. So even if we try to penalize mmap
>>>> write as you mentioned above, what I am not sure about it, is that, how can
>>>> we reliably detect that the DIO is in progress?
>>>>
>>>> Say even if we try to check for atomic_read(&inode->i_dio_count) in mmap
>>>> ext4_page_mkwrite path, it cannot be reliable unless there is some sort of a
>>>> lock protection, no?
>>>> Because after the check the DIO can still snoop in, right?
>>>
>>> Yes, doing this reliably will need some code tweaking. Also thinking about
>>> this in detail, doing a reliable check in ext4_page_mkwrite() is
>>> somewhat difficult so it will be probably less error-prone to deal with the
>>> race in the writeback path.
>>
>> hmm. But if we don't do in ext4_page_mkwrite, then I am afraid on
>> how to handle nodelalloc scenario. Where we will directly go and
>> allocate block via ext4_get_block() in ext4_page_mkwrite(),
>> as explained below.
>> I guess we may need some tweaking at both places.
>
> Ok, I forgot to mention that. Yes, the nodelalloc case in
> ext4_page_mkwrite() still needs tweaking. But that is not performance
> sensitive path at all. So we can just have there:
hmm. I was of the opinion that why use unwritten blocks or move
from written to unwritten method while we can still avoid it.
>
> if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> get_block = ext4_get_block_unwritten;
> else
> get_block = ext4_get_block;
>
Although adding a function ext4_dio_check_get_block() as described in
previous email is also trivial, which could avoid using unwritten
blocks here when DIO is not in progress.
But if you think it's not worth it, then I will go with your suggestion
here.
> and be done with it. And yes, for inodes using indirect blocks, direct IO
> reads can still theoretically expose data from blocks instantiated by hole
> filling from ext4_page_mkwrite(). But that race has always been there
> regardless of DIO locking and is hardly fixable with that on-disk format.
>
Agreed.
> Honza
>
>>
>>
>>>
>>> My preferred way of dealing with this would be to move inode_dio_begin()
>>> call in iomap_dio_rw() a bit earlier before page cache invalidation and add
>>> there smp_mb_after_atomic() (so that e.g. nrpages checks cannot get
>>> reordered before the increment). Then the check on i_dio_count in
>>> ext4_writepages() will be reliable if we do it after gathering and locking
>>> pages for writeback (i.e., in mpage_map_and_submit_extent()) - either we
>>> see i_dio_count elevated and use the safe (but slower) writeback using
>>> unwritten extents, or we see don't and then we are sure DIO will not start
>>> until writeback of the pages we have locked has finished because of
>>> filemap_write_and_wait() call in iomap_dio_rw().
>>>
>>>
>>
>> Thanks for explaining this in detail. I guess I understand this part now
>> Earlier my understanding towards mapping->nrpages was not complete.
>>
>> AFAIU, with your above suggestion the race won't happen for delalloc
>> cases. But what if it is a nodelalloc mount option?
>>
>> Say with above changes i.e. after tweaking iomap_dio_rw() code as you
>> mentioned above. Below race could still happen, right?
>>
>> iomap_dio_rw()
>> filemap_write_and_wait_range()
>> inode_dio_begin()
>> smp_mb__after_atomic()
>> invalidate_inode_pages2_range()
>> ext4_page_mkwrite()
>> block_page_mkwrite()
>> lock_page()
>> ext4_get_block()
>>
>> ext4_map_blocks()
>> //this will return IOMAP_MAPPED entry
>>
>> submit_bio()
>> // this goes and reads the block
>> // with stale data allocated,
>> // by ext4_page_mkwrite()
>>
>>
>> Now, I am assuming that ext4_get_block() via ext4_page_mkwrite() path
>> may try to create the block for hole then and there itself.
>> And if submit_bio() from DIO path is serviced late i.e. after
>> ext4_get_block() has already allocated block there, then this may expose
>> stale data. Thoughts?
>>
>>
>> So to avoid both such races in delalloc & in nodelalloc path,
>> we should add the checks at both ext4_writepages() & also at
>> ext4_page_mkwrite().
>>
>> For ext4_page_mkwrite(), why don't we just change the "get_block"
>> function pointer which is passed to block_page_mkwrite()
>> as below. This should solve our race since
>> ext4_dio_check_get_block() will be only called with lock_page()
>> held. And also with inode_dio_begin() now moved up before
>> invalidate_inode_pages2_range(), we could be sure
>> about DIO is currently running or not in ext4_page_mkwrite() path.
>>
>> Does this looks correct to you?
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 381813205f99..74c33d03592c 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -806,6 +806,19 @@ int ext4_get_block_unwritten(struct inode *inode,
>> sector_t iblock,
>> EXT4_GET_BLOCKS_IO_CREATE_EXT);
>> }
>>
>> +int ext4_dio_check_get_block(struct inode *inode, sector_t iblock,
>> + struct buffer_head *bh, int create)
>> +{
>> + get_block_t *get_block;
>> +
>> + if (!atomic_read(&inode->i_dio_count))
>> + get_block = ext4_get_block;
>> + else
>> + get_block = ext4_get_block_unwritten;
>> +
>> + return get_block(inode, iblock, bh, create);
>> +}
>> +
>> /* Maximum number of blocks we map for direct IO at once. */
>> #define DIO_MAX_BLOCKS 4096
>>
>> @@ -2332,7 +2345,8 @@ static int mpage_map_one_extent(handle_t *handle,
>> struct mpage_da_data *mpd)
>> struct inode *inode = mpd->inode;
>> struct ext4_map_blocks *map = &mpd->map;
>> int get_blocks_flags;
>> - int err, dioread_nolock;
>> + int err;
>> + bool dio_in_progress = atomic_read(&inode->i_dio_count);
>>
>> trace_ext4_da_write_pages_extent(inode, map);
>> /*
>> @@ -2353,8 +2367,14 @@ static int mpage_map_one_extent(handle_t *handle,
>> struct mpage_da_data *mpd)
>> get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
>> EXT4_GET_BLOCKS_METADATA_NOFAIL |
>> EXT4_GET_BLOCKS_IO_SUBMIT;
>> - dioread_nolock = ext4_should_dioread_nolock(inode);
>> - if (dioread_nolock)
>> +
>> + /*
>> + * There could be race between DIO read & ext4_page_mkwrite
>> + * where in delalloc case, we may go and try to allocate the
>> + * block here but if DIO read is in progress then it may expose
>> + * stale data, hence use unwritten blocks for allocation
>> + * when DIO is in progress.
>> + */
>> + if (dio_in_progress)
>> get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
>> if (map->m_flags & (1 << BH_Delay))
>> get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
>> @@ -2362,7 +2382,7 @@ static int mpage_map_one_extent(handle_t *handle,
>> struct mpage_da_data *mpd)
>> err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
>> if (err < 0)
>> return err;
>> - if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
>> + if (dio_in_progress && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
>> if (!mpd->io_submit.io_end->handle &&
>> ext4_handle_valid(handle)) {
>> mpd->io_submit.io_end->handle = handle->h_rsv_handle;
>> @@ -5906,10 +5926,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>> }
>> unlock_page(page);
>> /* OK, we need to fill the hole... */
>> - if (ext4_should_dioread_nolock(inode))
>> - get_block = ext4_get_block_unwritten;
>> - else
>> - get_block = ext4_get_block;
>> + get_block = ext4_dio_check_get_block;
>> retry_alloc:
>> handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
>> ext4_writepage_trans_blocks(inode));
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index 2f88d64c2a4d..09d0601e5ecb 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -465,6 +465,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>> if (ret)
>> goto out_free_dio;
>>
>> + inode_dio_begin(inode);
>> + smp_mb__after_atomic();
>> /*
>> * Try to invalidate cache pages for the range we're direct
>> * writing. If this invalidation fails, tough, the write will
>> @@ -484,8 +486,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>> goto out_free_dio;
>> }
>>
>> - inode_dio_begin(inode);
>> -
>> blk_start_plug(&plug);
>> do {
>> ret = iomap_apply(inode, pos, count, flags, ops, dio,
>>
>>
>>
>> -ritesh
>>
On Tue 03-12-19 18:40:47, Ritesh Harjani wrote:
> On 12/3/19 6:09 PM, Jan Kara wrote:
> >
> > Hello Ritesh!
> >
> > On Tue 03-12-19 17:24:44, Ritesh Harjani wrote:
> > > On 11/29/19 10:48 PM, Jan Kara wrote:
> > > > > Also, I wanted to have some more discussions on this race before
> > > > > making the changes.
> > > > > But nevertheless, it's the right time to discuss those changes here.
> > > > >
> > > > > > mmap write instantiating dirty page and then someone starting writeback
> > > > > > against that page while DIO read is running still theoretically leading to
> > > > > > stale data exposure. Now this patch does not have influence on that race
> > > > > > but:
> > > > >
> > > > > Yes, agreed.
> > > > >
> > > > > >
> > > > > > 1) We need to close the race mentioned above. Maybe we could do that by
> > > > > > proactively allocating unwritten blocks for a page being faulted when there
> > > > > > is direct IO running against the file - the one who fills holes through
> > > > > > mmap write while direct IO is running on the file deserves to suffer the
> > > > > > performance penalty...
> > > > >
> > > > > I was giving this a thought. So even if we try to penalize mmap
> > > > > write as you mentioned above, what I am not sure about it, is that, how can
> > > > > we reliably detect that the DIO is in progress?
> > > > >
> > > > > Say even if we try to check for atomic_read(&inode->i_dio_count) in mmap
> > > > > ext4_page_mkwrite path, it cannot be reliable unless there is some sort of a
> > > > > lock protection, no?
> > > > > Because after the check the DIO can still snoop in, right?
> > > >
> > > > Yes, doing this reliably will need some code tweaking. Also thinking about
> > > > this in detail, doing a reliable check in ext4_page_mkwrite() is
> > > > somewhat difficult so it will be probably less error-prone to deal with the
> > > > race in the writeback path.
> > >
> > > hmm. But if we don't do in ext4_page_mkwrite, then I am afraid on
> > > how to handle nodelalloc scenario. Where we will directly go and
> > > allocate block via ext4_get_block() in ext4_page_mkwrite(),
> > > as explained below.
> > > I guess we may need some tweaking at both places.
> >
> > Ok, I forgot to mention that. Yes, the nodelalloc case in
> > ext4_page_mkwrite() still needs tweaking. But that is not performance
> > sensitive path at all. So we can just have there:
>
> hmm. I was of the opinion that why use unwritten blocks or move
> from written to unwritten method while we can still avoid it.
>
> >
> > if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> > get_block = ext4_get_block_unwritten;
> > else
> > get_block = ext4_get_block;
> >
>
> Although adding a function ext4_dio_check_get_block() as described in
> previous email is also trivial, which could avoid using unwritten
> blocks here when DIO is not in progress.
> But if you think it's not worth it, then I will go with your suggestion
> here.
Yeah, I would prefer to keep it simple. Otherwise you would have a rare
subcase of a rare case meaning that code path will hardly ever get tested
and that's not good for maintainability... Also note that check is not 100%
reliable. There's still a race like:
ext4_page_mkwrite()
block_page_mkwrite()
lock_page(page);
...
-> get_block()
if (inode_dio_count(inode) > 0)
-> false - use ext4_get_block()
iomap_dio_rw()
inode_dio_begin()
filemap_write_and_wait()
-> no dirty page yet -> bails
invalidate_mapping_pages2()
set_page_dirty(page);
unlock_page(page);
-> bails with error because the
page is dirty. Warning is
issued but stale data is still
exposed.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR