2019-09-24 16:46:25

by Ritesh Harjani

[permalink] [raw]
Subject: [RFC-v2 0/2] ext4: Improve ilock scalability to improve performance in DirectIO workloads

Hello All,

These are ilock patches which helps improve the current inode lock scalabiliy
problem in ext4 DIO mixed-read/write workload case. The problem was first
reported by Joseph [1]. These patches are based upon upstream discussion
with Jan Kara & Joseph [2].

Since these patches introduce ext4_ilock/unlock API for inode locking/
unlocking in shared v/s exclusive mode to help achieve better
scalability in DIO writes case, it was better to take care of inode_trylock
for RWF_NOWAIT case in the same patch set.
Discussion about this can be found here [3].
Please let me know if otherwise.

These patches are based on the top of ext4 iomap for DIO patches by
Matthew [4].

Now, since this is a very interesting scalability problem w.r.t locking,
I thought of collecting a comprehensive set of performance numbers for
DIO sync mixed random read/write workload w.r.t number of threads (ext4).
This commit msg will present a performance benefit histogram using this
patch set v/s vanilla kernel (v5.3-rc5) in DIO mixed randrw workload.

Git tree for this can be found at-
https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC-v2

Background
==========

dioread_nolock feature in ext4, is to allow multiple DIO reads in parallel.
Also, in case of overwrites DIO since there is no block allocation needed,
those I/Os too can run in parallel. This gives a good scalable I/O
performance w.r.t multiple DIO read/writes threads.
In case of buffer write with dioread_nolock feature, ext4 will allocate
uninitialized extent and once the IO is completed will convert it back
to initialized.

But recently DIO reads got changed to be under shared inode rwsem.
commit 16c54688592c ("ext4: Allow parallel DIO reads").
It did allow parallel DIO reads to work fine, but this degraded the mixed
DIO read/write workload.


Problem description
===================

Currently in DIO writes case, we start with an exclusive inode rwsem
and only once we know that it is a overwrite (with dioread_nolock mnt
option enabled),
we demote it to shared lock. Now with mixed DIO reads & writes, this
has a scalability problem. Which is, if we have any ongoing DIO reads,
then DIO writes (since it starts with exclusive) has to wait until the
shared lock is released (which only happens when DIO read is completed).
The same can be easily observed with perf-tools trace analysis.


Implementation
==============

To fix this, we start with a shared lock (unless an exclusive lock is needed
in certain cases e.g. extending/unaligned writes) in DIO write case.
Then if it is a overwrite we continue with shared lock, if not then we
release the shared lock and switch to exclusive lock and
redo certain checks again.

This approach help in achieving good scalablity for mixed DIO read/writes
worloads [5-6].

Performance results
===================

The performance numbers are collected using fio tool with dioread_nolock
mount option for DIO mixed random read/write(randrw) worload with 4K & 1M burst
sizes for increasing number of threads (1, 2, 4, 8, 12, 16, 24).
The performance historgram shown below is the percentage change in
performance by using this ilock patchset as compared to vanilla kernel
(v5.3-rc5).

To check absolute score comparison with some colorful graphs, refer [5-6].

FIO command:
fio -name=DIO-mixed-randrw -filename=./testfile -direct=1 -iodepth=1 -thread \
-rw=randrw -ioengine=psync -bs=$bs -size=10G -numjobs=$thread \
-group_reporting=1 -runtime=120

With dioread_nolock mount option
================================

Below shows the performance benefit hist with this ilock patchset in (%)
w.r.t vanilla kernel(v5.3-rc5) with dioread_nolock mount option for
mixed randrw workload (for 4K block size).
If we notice, the percentage benefit increases with increasing number of
threads. So this patchset help achieve good scalability in the mentioned
workload. Also this gives upto ~70% perf improvement in 24 threads mixed randrw
workload with 4K burst size.
The performance difference can be even higher with high speed storage
devices, since bw speeds without the patch seems to flatten due to lock
contention problem in case of multiple threads.

Absolute graph comparison can be seen here - [5]

Performance benefit (%) data randrw (read)-4K
80 +-+------+--------+-------+--------+--------+--------+-------+------+-+
| + + + + + + + |
70 +-+ ** +-+
| ** ** |
60 +-+ ** ** ** +-+
| ** ** ** |
50 +-+ ** ** ** ** +-+
| ** ** ** ** |
40 +-+ ** ** ** ** +-+
30 +-+ ** ** ** ** ** +-+
| ** ** ** ** ** |
20 +-+ ** ** ** ** ** +-+
| ** ** ** ** ** |
10 +-+ ** ** ** ** ** ** +-+
| ** ** ** ** ** ** ** |
0 +-+ ** ** ** ** ** ** ** +-+
| + + + + + + + |
-10 +-+------+--------+-------+--------+--------+--------+-------+------+-+
1, 2, 4, 8, 12, 16, 24,
No. of threads


Performance benefit (%) data randrw (write)-4K
80 +-+------+--------+-------+--------+--------+--------+-------+------+-+
| + + + + + + + |
70 +-+ ** +-+
| ** ** |
60 +-+ ** ** ** +-+
| ** ** ** |
50 +-+ ** ** ** ** +-+
| ** ** ** ** |
40 +-+ ** ** ** ** +-+
30 +-+ ** ** ** ** ** +-+
| ** ** ** ** ** |
20 +-+ ** ** ** ** ** +-+
| ** ** ** ** ** |
10 +-+ ** ** ** ** ** +-+
| ** ** ** ** ** ** ** |
0 +-+ ** ** ** ** ** ** ** +-+
| + + + + + + + |
-10 +-+------+--------+-------+--------+--------+--------+-------+------+-+
1, 2, 4, 8, 12, 16, 24,
No. of threads


Below shows the performance benefit hist with this ilock patchset in (%)
w.r.t vanilla kernel (v5.3-rc5) with dioread_nolock mount option for
mixed randrw workload (for 1M burst size).
This shows upto ~20% perf improvement in 24 threads workload with 1M
burst size.
Absolute graph comparison can be seen here - [6]

Performance benefit (%) data randrw (read)-1M
25 +-+------+--------+-------+--------+--------+--------+-------+------+-+
| + + + + + + + |
| ** |
20 +-+ ** +-+
| ** |
| ** |
15 +-+ ** +-+
| ** ** ** |
10 +-+ ** ** ** +-+
| ** ** ** ** |
| ** ** ** ** |
5 +-+ ** ** ** ** +-+
| ** ** ** ** |
| ** ** ** ** ** ** |
0 +-+ ** ** ** ** ** ** ** +-+
| ** |
| + + + + + + + |
-5 +-+------+--------+-------+--------+--------+--------+-------+------+-+
1, 2, 4, 8, 12, 16, 24,
No. of threads

Performance benefit (%) data randrw (write)-1M
25 +-+------+--------+-------+--------+--------+--------+-------+------+-+
| + + + + + + + |
| |
20 +-+ ** +-+
| ** |
| ** |
15 +-+ ** +-+
| ** ** |
10 +-+ ** ** ** +-+
| ** ** ** ** |
| ** ** ** ** |
5 +-+ ** ** ** ** +-+
| ** ** ** ** ** |
| ** ** ** ** ** ** |
0 +-+ ** ** ** ** ** ** ** +-+
| ** |
| + + + + + + + |
-5 +-+------+--------+-------+--------+--------+--------+-------+------+-+
1, 2, 4, 8, 12, 16, 24,
No. of threads


Without dioread_nolock mount option
===================================

I do see a little degradation without this mount option, but as per my performance
runs, this is within ~(-0.5 to -3)% limit. With 1M burst size, difference is even
better in some runs. So, ~(-3 to 3)% deviation is something which could be called
a run-to-run variation mostly. But for completeness sake I still present
the below data collected.

Performance(%) data randrw (read)-4K
3 +-+------+--------+--------+--------+-------+--------+--------+------+-+
| + + + + + + + |
| |
2 +-+ +-+
| |
| |
1 +-+ +-+
| |
0 +-+ ** ** ** ** ** ** ** +-+
| ** ** ** ** ** ** ** |
| ** ** ** ** ** ** ** |
-1 +-+ ** ** ** ** ** ** +-+
| ** ** ** ** ** |
| ** ** ** ** ** |
-2 +-+ ** ** ** ** ** +-+
| ** ** ** ** |
| + + +** ** +** + + |
-3 +-+------+--------+--------+--------+-------+--------+--------+------+-+
1, 2, 4, 8, 12, 16, 24,
No. of threads


Performance(%) data randrw (write)-4K
3 +-+------+--------+--------+--------+-------+--------+--------+------+-+
| + + + + + + + |
| |
2 +-+ +-+
| |
| |
1 +-+ +-+
| |
0 +-+ ** ** ** ** ** ** ** +-+
| ** ** ** ** ** ** ** |
| ** ** ** ** ** ** ** |
-1 +-+ ** ** ** ** ** ** +-+
| ** ** ** ** ** |
| ** ** ** ** ** |
-2 +-+ ** ** ** ** ** +-+
| ** ** ** ** |
| + + +** + + + + |
-3 +-+------+--------+--------+--------+-------+--------+--------+------+-+
1, 2, 4, 8, 12, 16, 24,
No. of threads



Performance(%) data randrw (read)-1M
3 +-+------+--------+--------+--------+-------+--------+--------+------+-+
| + + + + + + + |
| |
2 +-+ ** +-+
| ** ** |
| ** ** ** |
1 +-+ ** ** ** ** +-+
| ** ** ** ** |
0 +-+ ** ** ** ** ** ** ** +-+
| ** ** ** |
| ** ** |
-1 +-+ ** +-+
| ** |
| |
-2 +-+ +-+
| |
| + + + + + + + |
-3 +-+------+--------+--------+--------+-------+--------+--------+------+-+
1, 2, 4, 8, 12, 16, 24,
No. of threads


Performance(%) data randrw (write)-1M
3 +-+------+--------+--------+--------+-------+--------+--------+------+-+
| + + + + + + +** |
| ** |
2 +-+ ** ** +-+
| ** ** |
| ** ** ** |
1 +-+ ** ** ** ** ** +-+
| ** ** ** ** ** |
0 +-+ ** ** ** ** ** ** ** +-+
| ** ** |
| ** ** |
-1 +-+ ** +-+
| ** |
| |
-2 +-+ +-+
| |
| + + + + + + + |
-3 +-+------+--------+--------+--------+-------+--------+--------+------+-+
1, 2, 4, 8, 12, 16, 24,
No. of threads


Git tree- https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC-v2

References
==========
[1]: https://lore.kernel.org/linux-ext4/[email protected]/
[2]: https://lore.kernel.org/linux-ext4/[email protected]/
[3]: https://lore.kernel.org/linux-fsdevel/[email protected]/
[4]: https://lwn.net/Articles/799184/

[5]: https://github.com/riteshharjani/LinuxStudy/blob/master/ext4/fio-output/vanilla-vs-ilocknew-randrw-dioread-nolock-4K.png
[6]: https://github.com/riteshharjani/LinuxStudy/blob/master/ext4/fio-output/vanilla-vs-ilocknew-randrw-dioread-nolock-1M.pg

Ritesh Harjani (2):
ext4: Add ext4_ilock & ext4_iunlock API
ext4: Improve DIO writes locking sequence

fs/ext4/ext4.h | 33 ++++++
fs/ext4/extents.c | 16 +--
fs/ext4/file.c | 260 ++++++++++++++++++++++++++++++++--------------
fs/ext4/inode.c | 4 +-
fs/ext4/ioctl.c | 16 +--
fs/ext4/super.c | 12 +--
fs/ext4/xattr.c | 16 +--
7 files changed, 249 insertions(+), 108 deletions(-)

--
2.21.0


2019-09-24 16:46:25

by Ritesh Harjani

[permalink] [raw]
Subject: [RFC-v2 2/2] ext4: Improve DIO writes locking sequence

Earlier there was no shared lock in DIO read path.
But this patch (16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing
for parallel DIO reads by adding shared lock in inode DIO
read path.

But this created problem with mixed read/write workload.
It is due to the fact that in DIO path, we first start with
exclusive lock and only when we determine that it is a ovewrite
IO, we downgrade the lock. This causes the problem, since
with above patch we have shared locking in DIO reads.

So, this patch tries to fix this issue by starting with
shared lock and then switching to exclusive lock only
when required based on ext4_dio_write_checks().

Other than that, it also simplifies below cases:-

1. Simplified ext4_unaligned_aio API to
ext4_unaligned_io.
Previous API was abused in the sense that it was
not really checking for AIO anywhere also it used to
check for extending writes.
So this API was renamed and simplified to ext4_unaligned_io()
which actully only checks if the IO is really unaligned.

This is because in all other cases inode_dio_wait()
will anyway become a no-op. So no need to over complicate
by checking for every condition here.

2. Added ext4_extending_io API. This checks if the IO
is extending the file.

Now we only check for
unaligned_io, extend, dioread_nolock & overwrite.

Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/file.c | 213 ++++++++++++++++++++++++++++++++++++-------------
1 file changed, 159 insertions(+), 54 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ce1cecbae932..faa8dcf9c08d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -166,19 +166,25 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
* threads are at work on the same unwritten block, they must be synchronized
* or one thread will zero the other's data, causing corruption.
*/
-static int
-ext4_unaligned_aio(struct inode *inode, struct iov_iter *from, loff_t pos)
+static bool
+ext4_unaligned_io(struct inode *inode, struct iov_iter *from, loff_t pos)
{
struct super_block *sb = inode->i_sb;
- int blockmask = sb->s_blocksize - 1;
-
- if (pos >= ALIGN(i_size_read(inode), sb->s_blocksize))
- return 0;
+ unsigned long blockmask = sb->s_blocksize - 1;

if ((pos | iov_iter_alignment(from)) & blockmask)
- return 1;
+ return true;

- return 0;
+ return false;
+}
+
+static bool
+ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
+{
+ if (offset + len > i_size_read(inode) ||
+ offset + len > EXT4_I(inode)->i_disksize)
+ return true;
+ return false;
}

/* Is IO overwriting allocated and initialized blocks? */
@@ -204,7 +210,9 @@ static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
return err == blklen && (map.m_flags & EXT4_MAP_MAPPED);
}

-static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
+
+static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
+ struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
@@ -216,10 +224,6 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
if (ret <= 0)
return ret;

- ret = file_modified(iocb->ki_filp);
- if (ret)
- return 0;
-
/*
* If we have encountered a bitmap-format file, the size limit
* is smaller than s_maxbytes, which is for extent-mapped files.
@@ -231,9 +235,26 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return -EFBIG;
iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos);
}
+
return iov_iter_count(from);
}

+static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
+{
+ ssize_t ret;
+ ssize_t count;
+
+ count = ext4_generic_write_checks(iocb, from);
+ if (count <= 0)
+ return count;
+
+ ret = file_modified(iocb->ki_filp);
+ if (ret)
+ return ret;
+
+ return count;
+}
+
static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
struct iov_iter *from)
{
@@ -336,6 +357,83 @@ static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
return 0;
}

+/*
+ * The intention here is to start with shared lock acquired
+ * (except in unaligned IO & extending writes case),
+ * then see if any condition requires an exclusive inode
+ * lock. If yes, then we restart the whole operation by
+ * releasing the shared lock and acquiring exclusive lock.
+ *
+ * - For unaligned_io we never take exclusive lock as it
+ * may cause data corruption when two unaligned IO tries
+ * to modify the same block.
+ *
+ * - For extending wirtes case we don't take
+ * the exclusive lock, since it requires updating
+ * inode i_disksize with exclusive lock.
+ *
+ * - shared locking will only be true mostly in case of
+ * overwrites with dioread_nolock mode.
+ * Otherwise we will switch to exclusive locking mode.
+ */
+static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
+ unsigned int *iolock, bool *unaligned_io,
+ bool *extend)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file_inode(file);
+ loff_t offset = iocb->ki_pos;
+ loff_t final_size;
+ size_t count;
+ ssize_t ret;
+
+restart:
+ if (!ext4_dio_checks(inode)) {
+ ext4_iunlock(inode, *iolock);
+ return ext4_buffered_write_iter(iocb, from);
+ }
+
+ ret = ext4_generic_write_checks(iocb, from);
+ if (ret <= 0) {
+ ext4_iunlock(inode, *iolock);
+ return ret;
+ }
+
+ /* Recalculate since offset & count may change above. */
+ offset = iocb->ki_pos;
+ count = iov_iter_count(from);
+ final_size = offset + count;
+
+ if (ext4_unaligned_io(inode, from, offset))
+ *unaligned_io = true;
+
+ if (ext4_extending_io(inode, offset, count))
+ *extend = true;
+ /*
+ * Determine whether the IO operation will overwrite allocated
+ * and initialized blocks. If so, check to see whether it is
+ * possible to take the dioread_nolock path.
+ *
+ * We need exclusive i_rwsem for changing security info
+ * in file_modified().
+ */
+ if (*iolock == EXT4_IOLOCK_SHARED &&
+ (!IS_NOSEC(inode) || *unaligned_io || *extend ||
+ !ext4_should_dioread_nolock(inode) ||
+ !ext4_overwrite_io(inode, offset, count))) {
+ ext4_iunlock(inode, *iolock);
+ *iolock = EXT4_IOLOCK_EXCL;
+ ext4_ilock(inode, *iolock);
+ goto restart;
+ }
+
+ ret = file_modified(file);
+ if (ret)
+ return ret;
+
+ return count;
+}
+
/*
* For a write that extends the inode size, ext4_dio_write_iter() will
* wait for the write to complete. Consequently, operations performed
@@ -371,64 +469,71 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,

static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
+
ssize_t ret;
- size_t count;
loff_t offset = iocb->ki_pos;
+ size_t count = iov_iter_count(from);
struct inode *inode = file_inode(iocb->ki_filp);
- bool extend = false, overwrite = false, unaligned_aio = false;
- unsigned int iolock = EXT4_IOLOCK_EXCL;
+ bool extend = false, unaligned_io = false;
+ unsigned int iolock = EXT4_IOLOCK_SHARED;
+
+ /*
+ * We initially start with shared inode lock
+ * unless it is unaligned IO which needs
+ * exclusive lock anyways.
+ */
+ if (ext4_unaligned_io(inode, from, offset)) {
+ unaligned_io = true;
+ iolock = EXT4_IOLOCK_EXCL;
+ }
+ /*
+ * Extending writes need exclusive lock
+ * to update
+ */
+ if (ext4_extending_io(inode, offset, count)) {
+ extend = true;
+ iolock = EXT4_IOLOCK_EXCL;
+ }
+
+ if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
+ iolock = EXT4_IOLOCK_EXCL;

if (iocb->ki_flags & IOCB_NOWAIT) {
+ /*
+ * unaligned IO may anyway require wait
+ * at other places, so bail out.
+ */
+ if (unaligned_io)
+ return -EAGAIN;
if (!ext4_ilock_nowait(inode, iolock))
return -EAGAIN;
} else {
ext4_ilock(inode, iolock);
}

- if (!ext4_dio_checks(inode)) {
- ext4_iunlock(inode, iolock);
- /*
- * Fallback to buffered IO if the operation on the
- * inode is not supported by direct IO.
- */
- return ext4_buffered_write_iter(iocb, from);
- }
-
- ret = ext4_write_checks(iocb, from);
- if (ret <= 0) {
- ext4_iunlock(inode, iolock);
+ ret = ext4_dio_write_checks(iocb, from, &iolock, &unaligned_io,
+ &extend);
+ if (ret <= 0)
return ret;
- }
-
/*
- * Unaligned direct AIO must be serialized among each other as
+ * Unaligned direct IO must be serialized among each other as
* the zeroing of partial blocks of two competing unaligned
- * AIOs can result in data corruption.
+ * IOs can result in data corruption. This can mainly
+ * happen since we may start with shared locking and for
+ * dioread_nolock and overwrite case we may continue to be
+ * in shared locking mode. In that case two parallel unaligned
+ * IO may cause data corruption.
+ *
+ * So we make sure we don't allow any unaligned IO in flight.
+ * For IOs where we need not wait (like unaligned non-AIO DIO),
+ * below dio_wait may anyway become a no-op,
+ * since we start take exclusive locks.
*/
- if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
- !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
- unaligned_aio = true;
+ if (unaligned_io)
inode_dio_wait(inode);
- }
-
- /*
- * Determine whether the IO operation will overwrite allocated
- * and initialized blocks. If so, check to see whether it is
- * possible to take the dioread_nolock path.
- */
- count = iov_iter_count(from);
- if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
- ext4_should_dioread_nolock(inode)) {
- overwrite = true;
- ext4_ilock_demote(inode, iolock);
- iolock = EXT4_IOLOCK_SHARED;
- }

- if (offset + count > i_size_read(inode) ||
- offset + count > EXT4_I(inode)->i_disksize) {
+ if (extend)
ext4_update_i_disksize(inode, inode->i_size);
- extend = true;
- }

ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);

@@ -440,7 +545,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
* routines in ext4_dio_write_end_io() are covered by the
* inode_lock().
*/
- if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
+ if (ret == -EIOCBQUEUED && (unaligned_io || extend))
inode_dio_wait(inode);

ext4_iunlock(inode, iolock);
--
2.21.0

2019-09-24 16:46:25

by Ritesh Harjani

[permalink] [raw]
Subject: [RFC-v2 1/2] ext4: Add ext4_ilock & ext4_iunlock API

This adds ext4_ilock/iunlock types of APIs.
This is the preparation APIs to make shared
locking/unlocking & restarting with exclusive
locking/unlocking easier in next patch.

Along with above this also addresses the AIM7
regression problem which was only fixed for XFS
in,
commit 942491c9e6d6 ("xfs: fix AIM7 regression")

Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/ext4.h | 33 +++++++++++++++++++++++++
fs/ext4/extents.c | 16 ++++++------
fs/ext4/file.c | 63 +++++++++++++++++++++++++----------------------
fs/ext4/inode.c | 4 +--
fs/ext4/ioctl.c | 16 ++++++------
fs/ext4/super.c | 12 ++++-----
fs/ext4/xattr.c | 16 ++++++------
7 files changed, 98 insertions(+), 62 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2ab91815f52d..9ffafbe6bc3f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2945,6 +2945,39 @@ static inline int ext4_update_inode_size(struct inode *inode, loff_t newsize)
return changed;
}

+#define EXT4_IOLOCK_EXCL (1 << 0)
+#define EXT4_IOLOCK_SHARED (1 << 1)
+
+static inline void ext4_ilock(struct inode *inode, unsigned int iolock)
+{
+ if (iolock == EXT4_IOLOCK_EXCL)
+ inode_lock(inode);
+ else
+ inode_lock_shared(inode);
+}
+
+static inline void ext4_iunlock(struct inode *inode, unsigned int iolock)
+{
+ if (iolock == EXT4_IOLOCK_EXCL)
+ inode_unlock(inode);
+ else
+ inode_unlock_shared(inode);
+}
+
+static inline int ext4_ilock_nowait(struct inode *inode, unsigned int iolock)
+{
+ if (iolock == EXT4_IOLOCK_EXCL)
+ return inode_trylock(inode);
+ else
+ return inode_trylock_shared(inode);
+}
+
+static inline void ext4_ilock_demote(struct inode *inode, unsigned int iolock)
+{
+ BUG_ON(iolock != EXT4_IOLOCK_EXCL);
+ downgrade_write(&inode->i_rwsem);
+}
+
int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
loff_t len);

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index a869e206bd81..ef37f4d4ee7e 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4680,7 +4680,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
else
max_blocks -= lblk;

- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);

/*
* Indirect files do not support unwritten extnets
@@ -4790,7 +4790,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,

ext4_journal_stop(handle);
out_mutex:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
return ret;
}

@@ -4856,7 +4856,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
if (mode & FALLOC_FL_KEEP_SIZE)
flags |= EXT4_GET_BLOCKS_KEEP_SIZE;

- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);

/*
* We only support preallocation for extent-based files only
@@ -4887,7 +4887,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
EXT4_I(inode)->i_sync_tid);
}
out:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
return ret;
}
@@ -5387,7 +5387,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
return ret;
}

- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
/*
* There is no need to overlap collapse range with EOF, in which case
* it is effectively a truncate operation
@@ -5486,7 +5486,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
out_mmap:
up_write(&EXT4_I(inode)->i_mmap_sem);
out_mutex:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
return ret;
}

@@ -5537,7 +5537,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
return ret;
}

- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
/* Currently just for extent based files */
if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
ret = -EOPNOTSUPP;
@@ -5664,7 +5664,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
out_mmap:
up_write(&EXT4_I(inode)->i_mmap_sem);
out_mutex:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
return ret;
}

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d2ff383a8b9f..ce1cecbae932 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -57,14 +57,15 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
/*
* Get exclusion from truncate and other inode operations.
*/
- if (!inode_trylock_shared(inode)) {
- if (iocb->ki_flags & IOCB_NOWAIT)
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_SHARED))
return -EAGAIN;
- inode_lock_shared(inode);
+ } else {
+ ext4_ilock(inode, EXT4_IOLOCK_SHARED);
}

if (!ext4_dio_checks(inode)) {
- inode_unlock_shared(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
/*
* Fallback to buffered IO if the operation being
* performed on the inode is not supported by direct
@@ -77,7 +78,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
}

ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL);
- inode_unlock_shared(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_SHARED);

file_accessed(iocb->ki_filp);
return ret;
@@ -89,22 +90,23 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;

- if (!inode_trylock_shared(inode)) {
- if (iocb->ki_flags & IOCB_NOWAIT)
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_SHARED))
return -EAGAIN;
- inode_lock_shared(inode);
+ } else {
+ ext4_ilock(inode, EXT4_IOLOCK_SHARED);
}
/*
* Recheck under inode lock - at this point we are sure it cannot
* change anymore
*/
if (!IS_DAX(inode)) {
- inode_unlock_shared(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
/* Fallback to buffered IO in case we cannot support DAX */
return generic_file_read_iter(iocb, to);
}
ret = dax_iomap_rw(iocb, to, &ext4_iomap_ops);
- inode_unlock_shared(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_SHARED);

file_accessed(iocb->ki_filp);
return ret;
@@ -241,7 +243,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
if (iocb->ki_flags & IOCB_NOWAIT)
return -EOPNOTSUPP;

- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
@@ -250,7 +252,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
current->backing_dev_info = NULL;
out:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
if (likely(ret > 0)) {
iocb->ki_pos += ret;
ret = generic_write_sync(iocb, ret);
@@ -374,15 +376,17 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
loff_t offset = iocb->ki_pos;
struct inode *inode = file_inode(iocb->ki_filp);
bool extend = false, overwrite = false, unaligned_aio = false;
+ unsigned int iolock = EXT4_IOLOCK_EXCL;

- if (!inode_trylock(inode)) {
- if (iocb->ki_flags & IOCB_NOWAIT)
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!ext4_ilock_nowait(inode, iolock))
return -EAGAIN;
- inode_lock(inode);
+ } else {
+ ext4_ilock(inode, iolock);
}

if (!ext4_dio_checks(inode)) {
- inode_unlock(inode);
+ ext4_iunlock(inode, iolock);
/*
* Fallback to buffered IO if the operation on the
* inode is not supported by direct IO.
@@ -392,7 +396,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)

ret = ext4_write_checks(iocb, from);
if (ret <= 0) {
- inode_unlock(inode);
+ ext4_iunlock(inode, iolock);
return ret;
}

@@ -416,7 +420,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
ext4_should_dioread_nolock(inode)) {
overwrite = true;
- downgrade_write(&inode->i_rwsem);
+ ext4_ilock_demote(inode, iolock);
+ iolock = EXT4_IOLOCK_SHARED;
}

if (offset + count > i_size_read(inode) ||
@@ -438,10 +443,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
inode_dio_wait(inode);

- if (overwrite)
- inode_unlock_shared(inode);
- else
- inode_unlock(inode);
+ ext4_iunlock(inode, iolock);

if (ret >= 0 && iov_iter_count(from))
return ext4_buffered_write_iter(iocb, from);
@@ -457,10 +459,11 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
loff_t offset;
struct inode *inode = file_inode(iocb->ki_filp);

- if (!inode_trylock(inode)) {
- if (iocb->ki_flags & IOCB_NOWAIT)
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_EXCL))
return -EAGAIN;
- inode_lock(inode);
+ } else {
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
}

ret = ext4_write_checks(iocb, from);
@@ -480,7 +483,7 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (error)
ret = error;
out:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
if (ret > 0)
ret = generic_write_sync(iocb, ret);
return ret;
@@ -707,14 +710,14 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence)
return generic_file_llseek_size(file, offset, whence,
maxbytes, i_size_read(inode));
case SEEK_HOLE:
- inode_lock_shared(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_SHARED);
offset = iomap_seek_hole(inode, offset, &ext4_iomap_ops);
- inode_unlock_shared(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
break;
case SEEK_DATA:
- inode_lock_shared(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_SHARED);
offset = iomap_seek_data(inode, offset, &ext4_iomap_ops);
- inode_unlock_shared(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
break;
}

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a4f0749527c7..2870699ee504 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3914,7 +3914,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
return ret;
}

- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);

/* No need to punch hole beyond i_size */
if (offset >= inode->i_size)
@@ -4021,7 +4021,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
out_dio:
up_write(&EXT4_I(inode)->i_mmap_sem);
out_mutex:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
return ret;
}

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 442f7ef873fc..c6ae48567207 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -787,13 +787,13 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
if (err)
return err;

- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
err = ext4_ioctl_check_immutable(inode,
from_kprojid(&init_user_ns, ei->i_projid),
flags);
if (!err)
err = ext4_ioctl_setflags(inode, flags);
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
mnt_drop_write_file(filp);
return err;
}
@@ -824,7 +824,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
goto setversion_out;
}

- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
if (IS_ERR(handle)) {
err = PTR_ERR(handle);
@@ -839,7 +839,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
ext4_journal_stop(handle);

unlock_out:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
setversion_out:
mnt_drop_write_file(filp);
return err;
@@ -958,9 +958,9 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
* ext4_ext_swap_inode_data before we switch the
* inode format to prevent read.
*/
- inode_lock((inode));
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
err = ext4_ext_migrate(inode);
- inode_unlock((inode));
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
mnt_drop_write_file(filp);
return err;
}
@@ -1150,7 +1150,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
if (err)
return err;

- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
ext4_fill_fsxattr(inode, &old_fa);
err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
if (err)
@@ -1165,7 +1165,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
goto out;
err = ext4_ioctl_setproject(filp, fa.fsx_projid);
out:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
mnt_drop_write_file(filp);
return err;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4079605d437a..45519036de83 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2681,12 +2681,12 @@ static void ext4_orphan_cleanup(struct super_block *sb,
__func__, inode->i_ino, inode->i_size);
jbd_debug(2, "truncating inode %lu to %lld bytes\n",
inode->i_ino, inode->i_size);
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
truncate_inode_pages(inode->i_mapping, inode->i_size);
ret = ext4_truncate(inode);
if (ret)
ext4_std_error(inode->i_sb, ret);
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
nr_truncates++;
} else {
if (test_opt(sb, DEBUG))
@@ -5763,7 +5763,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id,
* files. If this fails, we return success anyway since quotas
* are already enabled and this is not a hard failure.
*/
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
handle = ext4_journal_start(inode, EXT4_HT_QUOTA, 1);
if (IS_ERR(handle))
goto unlock_inode;
@@ -5773,7 +5773,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id,
ext4_mark_inode_dirty(handle, inode);
ext4_journal_stop(handle);
unlock_inode:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
}
return err;
}
@@ -5865,7 +5865,7 @@ static int ext4_quota_off(struct super_block *sb, int type)
if (err || ext4_has_feature_quota(sb))
goto out_put;

- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
/*
* Update modification times of quota files when userspace can
* start looking at them. If we fail, we return success anyway since
@@ -5880,7 +5880,7 @@ static int ext4_quota_off(struct super_block *sb, int type)
ext4_mark_inode_dirty(handle, inode);
ext4_journal_stop(handle);
out_unlock:
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
out_put:
lockdep_set_quota_inode(inode, I_DATA_SEM_NORMAL);
iput(inode);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 491f9ee4040e..dbe3e2900c24 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -422,9 +422,9 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
ext4_set_inode_state(inode, EXT4_STATE_LUSTRE_EA_INODE);
ext4_xattr_inode_set_ref(inode, 1);
} else {
- inode_lock(inode);
+ ext4_ilock(inode, EXT4_IOLOCK_EXCL);
inode->i_flags |= S_NOQUOTA;
- inode_unlock(inode);
+ ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
}

*ea_inode = inode;
@@ -1025,7 +1025,7 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
u32 hash;
int ret;

- inode_lock(ea_inode);
+ ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL);

ret = ext4_reserve_inode_write(handle, ea_inode, &iloc);
if (ret)
@@ -1079,7 +1079,7 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
ext4_warning_inode(ea_inode,
"ext4_mark_iloc_dirty() failed ret=%d", ret);
out:
- inode_unlock(ea_inode);
+ ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL);
return ret;
}

@@ -1400,10 +1400,10 @@ static int ext4_xattr_inode_write(handle_t *handle, struct inode *ea_inode,
block += 1;
}

- inode_lock(ea_inode);
+ ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL);
i_size_write(ea_inode, wsize);
ext4_update_i_disksize(ea_inode, wsize);
- inode_unlock(ea_inode);
+ ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL);

ext4_mark_inode_dirty(handle, ea_inode);

@@ -1452,9 +1452,9 @@ static struct inode *ext4_xattr_inode_create(handle_t *handle,
*/
dquot_free_inode(ea_inode);
dquot_drop(ea_inode);
- inode_lock(ea_inode);
+ ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL);
ea_inode->i_flags |= S_NOQUOTA;
- inode_unlock(ea_inode);
+ ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL);
}

return ea_inode;
--
2.21.0