2019-09-17 10:53:28

by Ritesh Harjani

[permalink] [raw]
Subject: [RFC 0/2] ext4: Improve locking sequence in DIO write path

Hello,

This patch series is based on the upstream discussion with Jan
& Joseph @ [1].
It is based on top of Matthew's v3 ext4 iomap patch series [2]

Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
inode_lock/unlock instances from fs/ext4/*

For now I already accounted for trylock/lock issue symantics
(which was discussed here [3]) in the same patch,
since the this whole patch was around inode_lock/unlock API,
so I thought it will be best to address that issue in the same patch.
However, kindly let me know if otherwise.

Patch-2: Commit msg of this patch describes in detail about
what it is doing.
In brief - we try to first take the shared lock (instead of exclusive
lock), unless it is a unaligned_io or extend_io. Then in
ext4_dio_write_checks(), if we start with shared lock, we see
if we can really continue with shared lock or not. If not, then
we release the shared lock then acquire exclusive lock
and restart ext4_dio_write_checks().


Tested against few xfstests (with dioread_nolock mount option),
those ran fine (ext4 & generic).

I tried testing performance numbers on my VM (since I could not get
hold of any real h/w based test device). I could test the fact
that earlier we were trying to do downgrade_write() lock, but with
this patch, that path is now avoided for fio test case
(as reported by Joseph in [4]).
But for the actual results, I am not sure if VM machine testing could
really give the reliable perf numbers which we want to take a look at.
Though I do observe some form of perf improvements, but I could not
get any reliable numbers (not even with the same list of with/without
patches with which Joseph posted his numbers [1]).


@Joseph,
Would it be possible for you to give your test case a run with this
patches? That will be really helpful.

Branch for this is hosted at below tree.

https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC

[1]: https://lore.kernel.org/linux-ext4/[email protected]/
[2]: https://lwn.net/Articles/799184/
[3]: https://lore.kernel.org/linux-fsdevel/[email protected]/
[4]: https://lore.kernel.org/linux-ext4/[email protected]/


Ritesh Harjani (2):
ext4: Add ext4_ilock & ext4_iunlock API
ext4: Improve DIO writes locking sequence

fs/ext4/ext4.h | 33 ++++++
fs/ext4/extents.c | 16 +--
fs/ext4/file.c | 253 ++++++++++++++++++++++++++++++++--------------
fs/ext4/inode.c | 4 +-
fs/ext4/ioctl.c | 16 +--
fs/ext4/super.c | 12 +--
fs/ext4/xattr.c | 16 +--
7 files changed, 244 insertions(+), 106 deletions(-)

--
2.21.0


2019-09-17 10:53:37

by Ritesh Harjani

[permalink] [raw]
Subject: [RFC 2/2] ext4: Improve DIO writes locking sequence

Earlier there was no shared lock in DIO read path.
But this patch (16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing
for parallel DIO reads by adding shared lock in inode DIO
read path.

But this created problem with mixed read/write workload.
It is due to the fact that in DIO path, we first start with
exclusive lock and only when we determine that it is a ovewrite
IO, we downgrade the lock. This causes the problem, since
with above patch we have shared locking in DIO reads.

So, this patch tries to fix this issue by starting with
shared lock and then switching to exclusive lock only
when required based on ext4_dio_write_checks().

Other than that, it also simplifies below cases:-

1. Simplified ext4_unaligned_aio API to
ext4_unaligned_io.
Previous API was abused in the sense that it was
not really checking for AIO anywhere also it used to
check for extending writes.
So this API was renamed and simplified to ext4_unaligned_io()
which actully only checks if the IO is really unaligned.

This is because in all other cases inode_dio_wait()
will anyway become a no-op. So no need to over complicate
by checking for every condition here.

2. Added ext4_extending_io API. This checks if the IO
is extending the file.

Now we only check for
unaligned_io, extend, dioread_nolock & overwrite.

Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/file.c | 206 ++++++++++++++++++++++++++++++++++++-------------
1 file changed, 154 insertions(+), 52 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ce1cecbae932..45af2b7679ad 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -166,14 +166,11 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
* threads are at work on the same unwritten block, they must be synchronized
* or one thread will zero the other's data, causing corruption.
*/
-static int
-ext4_unaligned_aio(struct inode *inode, struct iov_iter *from, loff_t pos)
+static bool
+ext4_unaligned_io(struct inode *inode, struct iov_iter *from, loff_t pos)
{
struct super_block *sb = inode->i_sb;
- int blockmask = sb->s_blocksize - 1;
-
- if (pos >= ALIGN(i_size_read(inode), sb->s_blocksize))
- return 0;
+ unsigned long blockmask = sb->s_blocksize - 1;

if ((pos | iov_iter_alignment(from)) & blockmask)
return 1;
@@ -181,6 +178,15 @@ ext4_unaligned_aio(struct inode *inode, struct iov_iter *from, loff_t pos)
return 0;
}

+static bool
+ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
+{
+ if (offset + len > i_size_read(inode) ||
+ offset + len > EXT4_I(inode)->i_disksize)
+ return 1;
+ return 0;
+}
+
/* Is IO overwriting allocated and initialized blocks? */
static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
{
@@ -204,7 +210,9 @@ static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
return err == blklen && (map.m_flags & EXT4_MAP_MAPPED);
}

-static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
+
+static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
+ struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
@@ -216,10 +224,6 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
if (ret <= 0)
return ret;

- ret = file_modified(iocb->ki_filp);
- if (ret)
- return 0;
-
/*
* If we have encountered a bitmap-format file, the size limit
* is smaller than s_maxbytes, which is for extent-mapped files.
@@ -231,9 +235,26 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return -EFBIG;
iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos);
}
+
return iov_iter_count(from);
}

+static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
+{
+ ssize_t ret;
+ ssize_t count;
+
+ count = ext4_generic_write_checks(iocb, from);
+ if (count <= 0)
+ return count;
+
+ ret = file_modified(iocb->ki_filp);
+ if (ret)
+ return ret;
+
+ return count;
+}
+
static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
struct iov_iter *from)
{
@@ -336,6 +357,83 @@ static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
return 0;
}

+/*
+ * The intention here is to start with shared lock acquired
+ * (except in unaligned IO & extending writes case),
+ * then see if any condition requires an exclusive inode
+ * lock. If yes, then we restart the whole operation by
+ * releasing the shared lock and acquiring exclusive lock.
+ *
+ * - For unaligned_io we never take exclusive lock as it
+ * may cause data corruption when two unaligned IO tries
+ * to modify the same block.
+ *
+ * - For extending wirtes case we don't take
+ * the exclusive lock, since it requires updating
+ * inode i_disksize with exclusive lock.
+ *
+ * - shared locking will only be true mostly in case of
+ * overwrites with dioread_nolock mode.
+ * Otherwise we will switch to exclusive locking mode.
+ */
+static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
+ unsigned int *iolock, bool *unaligned_io,
+ bool *extend)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file_inode(file);
+ loff_t offset = iocb->ki_pos;
+ loff_t final_size;
+ size_t count;
+ ssize_t ret;
+
+restart:
+ if (!ext4_dio_checks(inode)) {
+ ext4_iunlock(inode, *iolock);
+ return ext4_buffered_write_iter(iocb, from);
+ }
+
+ ret = ext4_generic_write_checks(iocb, from);
+ if (ret <= 0) {
+ ext4_iunlock(inode, *iolock);
+ return ret;
+ }
+
+ /* Recalculate since offset & count may change above. */
+ offset = iocb->ki_pos;
+ count = iov_iter_count(from);
+ final_size = offset + count;
+
+ if (ext4_unaligned_io(inode, from, offset))
+ *unaligned_io = true;
+
+ if (ext4_extending_io(inode, offset, count))
+ *extend = true;
+ /*
+ * Determine whether the IO operation will overwrite allocated
+ * and initialized blocks. If so, check to see whether it is
+ * possible to take the dioread_nolock path.
+ *
+ * We need exclusive i_rwsem for changing security info
+ * in file_modified().
+ */
+ if (*iolock == EXT4_IOLOCK_SHARED &&
+ (!IS_NOSEC(inode) || *unaligned_io || *extend ||
+ !ext4_should_dioread_nolock(inode) ||
+ !ext4_overwrite_io(inode, offset, count))) {
+ ext4_iunlock(inode, *iolock);
+ *iolock = EXT4_IOLOCK_EXCL;
+ ext4_ilock(inode, *iolock);
+ goto restart;
+ }
+
+ ret = file_modified(file);
+ if (ret)
+ return ret;
+
+ return count;
+}
+
/*
* For a write that extends the inode size, ext4_dio_write_iter() will
* wait for the write to complete. Consequently, operations performed
@@ -371,64 +469,68 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,

static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
+
ssize_t ret;
- size_t count;
loff_t offset = iocb->ki_pos;
+ size_t count = iov_iter_count(from);
struct inode *inode = file_inode(iocb->ki_filp);
- bool extend = false, overwrite = false, unaligned_aio = false;
- unsigned int iolock = EXT4_IOLOCK_EXCL;
+ bool extend = false, unaligned_io = false;
+ unsigned int iolock = EXT4_IOLOCK_SHARED;
+
+ /*
+ * We initially start with shared inode lock
+ * unless it is unaligned IO which needs
+ * exclusive lock anyways.
+ */
+ if (ext4_unaligned_io(inode, from, offset)) {
+ unaligned_io = true;
+ iolock = EXT4_IOLOCK_EXCL;
+ }
+ /*
+ * Extending writes need exclusive lock
+ * to update
+ */
+ if (ext4_extending_io(inode, offset, count)) {
+ extend = true;
+ iolock = EXT4_IOLOCK_EXCL;
+ }

if (iocb->ki_flags & IOCB_NOWAIT) {
+ /*
+ * unaligned IO may anyway require wait
+ * at other places, so bail out.
+ */
+ if (unaligned_io)
+ return -EAGAIN;
if (!ext4_ilock_nowait(inode, iolock))
return -EAGAIN;
} else {
ext4_ilock(inode, iolock);
}

- if (!ext4_dio_checks(inode)) {
- ext4_iunlock(inode, iolock);
- /*
- * Fallback to buffered IO if the operation on the
- * inode is not supported by direct IO.
- */
- return ext4_buffered_write_iter(iocb, from);
- }
-
- ret = ext4_write_checks(iocb, from);
- if (ret <= 0) {
- ext4_iunlock(inode, iolock);
+ ret = ext4_dio_write_checks(iocb, from, &iolock, &unaligned_io,
+ &extend);
+ if (ret <= 0)
return ret;
- }
-
/*
- * Unaligned direct AIO must be serialized among each other as
+ * Unaligned direct IO must be serialized among each other as
* the zeroing of partial blocks of two competing unaligned
- * AIOs can result in data corruption.
+ * IOs can result in data corruption. This can mainly
+ * happen since we may start with shared locking and for
+ * dioread_nolock and overwrite case we may continue to be
+ * in shared locking mode. In that case two parallel unaligned
+ * IO may cause data corruption.
+ *
+ * So we make sure we don't allow any unaligned IO in flight.
+ * For IOs where we need not wait (like unaligned non-AIO DIO),
+ * below dio_wait may anyway become a no-op,
+ * since we start take exclusive locks.
*/
- if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
- !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
- unaligned_aio = true;
+ if (unaligned_io)
inode_dio_wait(inode);
- }
-
- /*
- * Determine whether the IO operation will overwrite allocated
- * and initialized blocks. If so, check to see whether it is
- * possible to take the dioread_nolock path.
- */
- count = iov_iter_count(from);
- if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
- ext4_should_dioread_nolock(inode)) {
- overwrite = true;
- ext4_ilock_demote(inode, iolock);
- iolock = EXT4_IOLOCK_SHARED;
- }

- if (offset + count > i_size_read(inode) ||
- offset + count > EXT4_I(inode)->i_disksize) {
+ if (extend)
ext4_update_i_disksize(inode, inode->i_size);
- extend = true;
- }

ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);

@@ -440,7 +542,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
* routines in ext4_dio_write_end_io() are covered by the
* inode_lock().
*/
- if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
+ if (ret == -EIOCBQUEUED && (unaligned_io || extend))
inode_dio_wait(inode);

ext4_iunlock(inode, iolock);
--
2.21.0

2019-09-18 02:22:01

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path



On 19/9/17 18:32, Ritesh Harjani wrote:
> Hello,
>
> This patch series is based on the upstream discussion with Jan
> & Joseph @ [1].
> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>
> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
> inode_lock/unlock instances from fs/ext4/*
>
> For now I already accounted for trylock/lock issue symantics
> (which was discussed here [3]) in the same patch,
> since the this whole patch was around inode_lock/unlock API,
> so I thought it will be best to address that issue in the same patch.
> However, kindly let me know if otherwise.
>
> Patch-2: Commit msg of this patch describes in detail about
> what it is doing.
> In brief - we try to first take the shared lock (instead of exclusive
> lock), unless it is a unaligned_io or extend_io. Then in
> ext4_dio_write_checks(), if we start with shared lock, we see
> if we can really continue with shared lock or not. If not, then
> we release the shared lock then acquire exclusive lock
> and restart ext4_dio_write_checks().
>
>
> Tested against few xfstests (with dioread_nolock mount option),
> those ran fine (ext4 & generic).
>
> I tried testing performance numbers on my VM (since I could not get
> hold of any real h/w based test device). I could test the fact
> that earlier we were trying to do downgrade_write() lock, but with
> this patch, that path is now avoided for fio test case
> (as reported by Joseph in [4]).
> But for the actual results, I am not sure if VM machine testing could
> really give the reliable perf numbers which we want to take a look at.
> Though I do observe some form of perf improvements, but I could not
> get any reliable numbers (not even with the same list of with/without
> patches with which Joseph posted his numbers [1]).
>
>
> @Joseph,
> Would it be possible for you to give your test case a run with this
> patches? That will be really helpful.
>

Sure, will post the result ASAP.

Thanks,
Joseph

> Branch for this is hosted at below tree.
>
> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>
> [1]: https://lore.kernel.org/linux-ext4/[email protected]/
> [2]: https://lwn.net/Articles/799184/
> [3]: https://lore.kernel.org/linux-fsdevel/[email protected]/
> [4]: https://lore.kernel.org/linux-ext4/[email protected]/
>
>
> Ritesh Harjani (2):
> ext4: Add ext4_ilock & ext4_iunlock API
> ext4: Improve DIO writes locking sequence
>
> fs/ext4/ext4.h | 33 ++++++
> fs/ext4/extents.c | 16 +--
> fs/ext4/file.c | 253 ++++++++++++++++++++++++++++++++--------------
> fs/ext4/inode.c | 4 +-
> fs/ext4/ioctl.c | 16 +--
> fs/ext4/super.c | 12 +--
> fs/ext4/xattr.c | 16 +--
> 7 files changed, 244 insertions(+), 106 deletions(-)
>

2019-09-18 06:36:16

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path

Hi Ritesh,

On 19/9/17 18:32, Ritesh Harjani wrote:
> Hello,
>
> This patch series is based on the upstream discussion with Jan
> & Joseph @ [1].
> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>
> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
> inode_lock/unlock instances from fs/ext4/*
>
> For now I already accounted for trylock/lock issue symantics
> (which was discussed here [3]) in the same patch,
> since the this whole patch was around inode_lock/unlock API,
> so I thought it will be best to address that issue in the same patch.
> However, kindly let me know if otherwise.
>
> Patch-2: Commit msg of this patch describes in detail about
> what it is doing.
> In brief - we try to first take the shared lock (instead of exclusive
> lock), unless it is a unaligned_io or extend_io. Then in
> ext4_dio_write_checks(), if we start with shared lock, we see
> if we can really continue with shared lock or not. If not, then
> we release the shared lock then acquire exclusive lock
> and restart ext4_dio_write_checks().
>
>
> Tested against few xfstests (with dioread_nolock mount option),
> those ran fine (ext4 & generic).
>
> I tried testing performance numbers on my VM (since I could not get
> hold of any real h/w based test device). I could test the fact
> that earlier we were trying to do downgrade_write() lock, but with
> this patch, that path is now avoided for fio test case
> (as reported by Joseph in [4]).
> But for the actual results, I am not sure if VM machine testing could
> really give the reliable perf numbers which we want to take a look at.
> Though I do observe some form of perf improvements, but I could not
> get any reliable numbers (not even with the same list of with/without
> patches with which Joseph posted his numbers [1]).
>
>
> @Joseph,
> Would it be possible for you to give your test case a run with this
> patches? That will be really helpful.
>
> Branch for this is hosted at below tree.
>
> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>
I've tested your branch, the result is:
mounting with dioread_nolock, it behaves the same like reverting
parallel dio reads + dioread_nolock;
while mounting without dioread_nolock, no improvement, or even worse.
Please refer the test data below.

fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
-direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
-size=20G -numjobs=8 -runtime=600 -group_reporting

w/ = with parallel dio reads
w/o = reverting parallel dio reads
w/o+ = reverting parallel dio reads + dioread_nolock
ilock = ext4-ilock-RFC
ilock+ = ext4-ilock-RFC + dioread_nolock

bs=4k:
--------------------------------------------------------------
| READ | WRITE |
--------------------------------------------------------------
w/ | 30898KB/s,7724,555.00us | 30875KB/s,7718,479.70us |
--------------------------------------------------------------
w/o | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
--------------------------------------------------------------
w/o+ | 123450KB/s,30862,245.77us | 123368KB/s,30841,12.14us |
--------------------------------------------------------------
ilock | 29964KB/s,7491,326.70us | 29940KB/s,7485,740.62us |
--------------------------------------------------------------
ilock+| 123685KB/s,30921,245.52us | 123601KB/s,30900,12.11us |
--------------------------------------------------------------


bs=16k:
--------------------------------------------------------------
| READ | WRITE |
--------------------------------------------------------------
w/ | 58961KB/s,3685,835.28us | 58877KB/s,3679,1335.98us |
--------------------------------------------------------------
w/o | 218409KB/s,13650,554.46us | 218257KB/s,13641,29.22us |
--------------------------------------------------------------
w/o+ | 222477KB/s,13904,552.94us | 222322KB/s,13895,20.28us |
--------------------------------------------------------------
ilock | 56039KB/s,3502,632.96us | 55943KB/s,3496,1652.72us |
--------------------------------------------------------------
ilock+| 222747KB/s,13921,552.57us | 222592KB/s,13912,20.31us |
--------------------------------------------------------------

bs=64k
----------------------------------------------------------------
| READ | WRITE |
----------------------------------------------------------------
w/ | 119396KB/s,1865,1759.38us | 119159KB/s,1861,2532.26us |
----------------------------------------------------------------
w/o | 422815KB/s,6606,1146.05us | 421619KB/s,6587,60.72us |
--------------------------------------------,-------------------
w/o+ | 427406KB/s,6678,1141.52us | 426197KB/s,6659,52.79us |
----------------------------------------------------------------
ilock | 105800KB/s,1653,1451.68us | 105721KB/s,1651,3388.64us |
----------------------------------------------------------------
ilock+| 427678KB/s,6682,1142.13us | 426468KB/s,6663,52.31us |
----------------------------------------------------------------

bs=512k
----------------------------------------------------------------
| READ | WRITE |
----------------------------------------------------------------
w/ | 392973KB/s,767,5046.35us | 393165KB/s,767,5359.86us |
----------------------------------------------------------------
w/o | 590266KB/s,1152,4312.01us | 590554KB/s,1153,2606.82us |
----------------------------------------------------------------
w/o+ | 618752KB/s,1208,4125.82us | 619054KB/s,1209,2487.90us |
----------------------------------------------------------------
ilock | 296239KB/s,578,4703.10us | 296384KB/s,578,9049.32us |
----------------------------------------------------------------
ilock+| 616636KB/s,1204,4143.38us | 616937KB/s,1204,2490.08us |
----------------------------------------------------------------

bs=1M
----------------------------------------------------------------
| READ | WRITE |
----------------------------------------------------------------
w/ | 487779KB/s,476,8058.55us | 485592KB/s,474,8630.51us |
----------------------------------------------------------------
w/o | 593927KB/s,580,7623.63us | 591265KB/s,577,6163.42us |
----------------------------------------------------------------
w/o+ | 615011KB/s,600,7399.93us | 612255KB/s,597,5936.61us |
----------------------------------------------------------------
ilock | 394762KB/s,385,7097.55us | 392993KB/s,383,13626.98us |
----------------------------------------------------------------
ilock+| 626183KB/s,611,7319.16us | 623377KB/s,608,5773.24us |
----------------------------------------------------------------

Thanks,
Joseph

2019-09-18 10:04:37

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path

Hello Joseph,

First of all thanks a lot for collecting a thorough
performance numbers.

On 9/18/19 12:05 PM, Joseph Qi wrote:
> Hi Ritesh,
>
> On 19/9/17 18:32, Ritesh Harjani wrote:
>> Hello,
>>
>> This patch series is based on the upstream discussion with Jan
>> & Joseph @ [1].
>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>
>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>> inode_lock/unlock instances from fs/ext4/*
>>
>> For now I already accounted for trylock/lock issue symantics
>> (which was discussed here [3]) in the same patch,
>> since the this whole patch was around inode_lock/unlock API,
>> so I thought it will be best to address that issue in the same patch.
>> However, kindly let me know if otherwise.
>>
>> Patch-2: Commit msg of this patch describes in detail about
>> what it is doing.
>> In brief - we try to first take the shared lock (instead of exclusive
>> lock), unless it is a unaligned_io or extend_io. Then in
>> ext4_dio_write_checks(), if we start with shared lock, we see
>> if we can really continue with shared lock or not. If not, then
>> we release the shared lock then acquire exclusive lock
>> and restart ext4_dio_write_checks().
>>
>>
>> Tested against few xfstests (with dioread_nolock mount option),
>> those ran fine (ext4 & generic).
>>
>> I tried testing performance numbers on my VM (since I could not get
>> hold of any real h/w based test device). I could test the fact
>> that earlier we were trying to do downgrade_write() lock, but with
>> this patch, that path is now avoided for fio test case
>> (as reported by Joseph in [4]).
>> But for the actual results, I am not sure if VM machine testing could
>> really give the reliable perf numbers which we want to take a look at.
>> Though I do observe some form of perf improvements, but I could not
>> get any reliable numbers (not even with the same list of with/without
>> patches with which Joseph posted his numbers [1]).
>>
>>
>> @Joseph,
>> Would it be possible for you to give your test case a run with this
>> patches? That will be really helpful.
>>
>> Branch for this is hosted at below tree.
>>
>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>
> I've tested your branch, the result is:
> mounting with dioread_nolock, it behaves the same like reverting
> parallel dio reads + dioread_nolock;

Good sign, means that patch is doing what it is supposed to do.


> while mounting without dioread_nolock, no improvement, or even worse.
> Please refer the test data below.
Actually without dioread_nolock, we take the restart path.
i.e. initially we start with SHARED_LOCK, but if dioread_nolock
is not enabled (or check some other conditions like overwrite),
we release the shared lock and re-acquire the EXCL lock.


But as an optimization, I added the below diff just now
to directly first check for ext4_should_dioread_nolock too
before taking the shared lock.

I think with this we should not see any performance regression
(even without dioread_nolock mount option).
Since it will directly start with exclusive lock
if dioread_nolock mount option is not enabled.

I have updated the tree with this diff in same branch.


ext4_dio_file_write_iter ()
<...>

498 if (iolock == EXT4_IOLOCK_SHARED &&
!ext4_should_dioread_nolock(inode))
499 iolock = EXT4_IOLOCK_EXCL;
500
<...>


>
> fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
> -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
> -size=20G -numjobs=8 -runtime=600 -group_reporting
>
> w/ = with parallel dio reads
> w/o = reverting parallel dio reads
> w/o+ = reverting parallel dio reads + dioread_nolock
> ilock = ext4-ilock-RFC
> ilock+ = ext4-ilock-RFC + dioread_nolock

I will request to kindly also add "w/ + dioread_nolock" in your list.


"w/ + dioread_nolock" v/s "ilock+" - should show some improvements.
"w/ " v/s "ilock" - should not show any regression.

But thanks for the exhaustive performance numbers you collected.


-ritesh


>
> bs=4k:
> --------------------------------------------------------------
> | READ | WRITE |
> --------------------------------------------------------------
> w/ | 30898KB/s,7724,555.00us | 30875KB/s,7718,479.70us |
> --------------------------------------------------------------
> w/o | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
> --------------------------------------------------------------
> w/o+ | 123450KB/s,30862,245.77us | 123368KB/s,30841,12.14us |
> --------------------------------------------------------------
> ilock | 29964KB/s,7491,326.70us | 29940KB/s,7485,740.62us |
> --------------------------------------------------------------
> ilock+| 123685KB/s,30921,245.52us | 123601KB/s,30900,12.11us |
> --------------------------------------------------------------
>
>
> bs=16k:
> --------------------------------------------------------------
> | READ | WRITE |
> --------------------------------------------------------------
> w/ | 58961KB/s,3685,835.28us | 58877KB/s,3679,1335.98us |
> --------------------------------------------------------------
> w/o | 218409KB/s,13650,554.46us | 218257KB/s,13641,29.22us |
> --------------------------------------------------------------
> w/o+ | 222477KB/s,13904,552.94us | 222322KB/s,13895,20.28us |
> --------------------------------------------------------------
> ilock | 56039KB/s,3502,632.96us | 55943KB/s,3496,1652.72us |
> --------------------------------------------------------------
> ilock+| 222747KB/s,13921,552.57us | 222592KB/s,13912,20.31us |
> --------------------------------------------------------------
>
> bs=64k
> ----------------------------------------------------------------
> | READ | WRITE |
> ----------------------------------------------------------------
> w/ | 119396KB/s,1865,1759.38us | 119159KB/s,1861,2532.26us |
> ----------------------------------------------------------------
> w/o | 422815KB/s,6606,1146.05us | 421619KB/s,6587,60.72us |
> --------------------------------------------,-------------------
> w/o+ | 427406KB/s,6678,1141.52us | 426197KB/s,6659,52.79us |
> ----------------------------------------------------------------
> ilock | 105800KB/s,1653,1451.68us | 105721KB/s,1651,3388.64us |
> ----------------------------------------------------------------
> ilock+| 427678KB/s,6682,1142.13us | 426468KB/s,6663,52.31us |
> ----------------------------------------------------------------
>
> bs=512k
> ----------------------------------------------------------------
> | READ | WRITE |
> ----------------------------------------------------------------
> w/ | 392973KB/s,767,5046.35us | 393165KB/s,767,5359.86us |
> ----------------------------------------------------------------
> w/o | 590266KB/s,1152,4312.01us | 590554KB/s,1153,2606.82us |
> ----------------------------------------------------------------
> w/o+ | 618752KB/s,1208,4125.82us | 619054KB/s,1209,2487.90us |
> ----------------------------------------------------------------
> ilock | 296239KB/s,578,4703.10us | 296384KB/s,578,9049.32us |
> ----------------------------------------------------------------
> ilock+| 616636KB/s,1204,4143.38us | 616937KB/s,1204,2490.08us |
> ----------------------------------------------------------------
>
> bs=1M
> ----------------------------------------------------------------
> | READ | WRITE |
> ----------------------------------------------------------------
> w/ | 487779KB/s,476,8058.55us | 485592KB/s,474,8630.51us |
> ----------------------------------------------------------------
> w/o | 593927KB/s,580,7623.63us | 591265KB/s,577,6163.42us |
> ----------------------------------------------------------------
> w/o+ | 615011KB/s,600,7399.93us | 612255KB/s,597,5936.61us |
> ----------------------------------------------------------------
> ilock | 394762KB/s,385,7097.55us | 392993KB/s,383,13626.98us |
> ----------------------------------------------------------------
> ilock+| 626183KB/s,611,7319.16us | 623377KB/s,608,5773.24us |
> ----------------------------------------------------------------
>
> Thanks,
> Joseph
>

2019-09-18 13:04:30

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path



On 19/9/18 18:03, Ritesh Harjani wrote:
> Hello Joseph,
>
> First of all thanks a lot for collecting a thorough
> performance numbers.
>
> On 9/18/19 12:05 PM, Joseph Qi wrote:
>> Hi Ritesh,
>>
>> On 19/9/17 18:32, Ritesh Harjani wrote:
>>> Hello,
>>>
>>> This patch series is based on the upstream discussion with Jan
>>> & Joseph @ [1].
>>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>>
>>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>>> inode_lock/unlock instances from fs/ext4/*
>>>
>>> For now I already accounted for trylock/lock issue symantics
>>> (which was discussed here [3]) in the same patch,
>>> since the this whole patch was around inode_lock/unlock API,
>>> so I thought it will be best to address that issue in the same patch.
>>> However, kindly let me know if otherwise.
>>>
>>> Patch-2: Commit msg of this patch describes in detail about
>>> what it is doing.
>>> In brief - we try to first take the shared lock (instead of exclusive
>>> lock), unless it is a unaligned_io or extend_io. Then in
>>> ext4_dio_write_checks(), if we start with shared lock, we see
>>> if we can really continue with shared lock or not. If not, then
>>> we release the shared lock then acquire exclusive lock
>>> and restart ext4_dio_write_checks().
>>>
>>>
>>> Tested against few xfstests (with dioread_nolock mount option),
>>> those ran fine (ext4 & generic).
>>>
>>> I tried testing performance numbers on my VM (since I could not get
>>> hold of any real h/w based test device). I could test the fact
>>> that earlier we were trying to do downgrade_write() lock, but with
>>> this patch, that path is now avoided for fio test case
>>> (as reported by Joseph in [4]).
>>> But for the actual results, I am not sure if VM machine testing could
>>> really give the reliable perf numbers which we want to take a look at.
>>> Though I do observe some form of perf improvements, but I could not
>>> get any reliable numbers (not even with the same list of with/without
>>> patches with which Joseph posted his numbers [1]).
>>>
>>>
>>> @Joseph,
>>> Would it be possible for you to give your test case a run with this
>>> patches? That will be really helpful.
>>>
>>> Branch for this is hosted at below tree.
>>>
>>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>>
>> I've tested your branch, the result is:
>> mounting with dioread_nolock, it behaves the same like reverting
>> parallel dio reads + dioread_nolock;
>
> Good sign, means that patch is doing what it is supposed to do.
>
>
>> while mounting without dioread_nolock, no improvement, or even worse.
>> Please refer the test data below.
> Actually without dioread_nolock, we take the restart path.
> i.e. initially we start with SHARED_LOCK, but if dioread_nolock
> is not enabled (or check some other conditions like overwrite),
> we release the shared lock and re-acquire the EXCL lock.
>
>
> But as an optimization, I added the below diff just now
> to directly first check for ext4_should_dioread_nolock too
> before taking the shared lock.
>
> I think with this we should not see any performance regression
> (even without dioread_nolock mount option).
> Since it will directly start with exclusive lock
> if dioread_nolock mount option is not enabled.
>
> I have updated the tree with this diff in same branch.
>
>
> ext4_dio_file_write_iter ()
> <...>
>
> 498         if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
> 499                 iolock = EXT4_IOLOCK_EXCL;
> 500
> <...>
>
>
>>
>> fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
>> -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
>> -size=20G -numjobs=8 -runtime=600 -group_reporting
>>
>> w/     = with parallel dio reads
>> w/o    = reverting parallel dio reads
>> w/o+   = reverting parallel dio reads + dioread_nolock
>> ilock  = ext4-ilock-RFC
>> ilock+ = ext4-ilock-RFC + dioread_nolock
>
> I will request to kindly also add "w/ + dioread_nolock" in your list.
>
I've done this test before, it still behaves poor.
You can refer the previous RFC thread:
https://www.spinics.net/lists/linux-ext4/msg67066.html

Thanks,
Joseph

>
> "w/ + dioread_nolock"  v/s  "ilock+" - should show some improvements.
> "w/ "  v/s  "ilock" - should not show any regression.
>
> But thanks for the exhaustive performance numbers you collected.
>
>
> -ritesh
>
>
>>
>> bs=4k:
>> --------------------------------------------------------------
>>        |            READ           |           WRITE          |
>> --------------------------------------------------------------
>> w/    | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
>> --------------------------------------------------------------
>> w/o   | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
>> --------------------------------------------------------------
>> w/o+  | 123450KB/s,30862,245.77us | 123368KB/s,30841,12.14us |
>> --------------------------------------------------------------
>> ilock | 29964KB/s,7491,326.70us      | 29940KB/s,7485,740.62us  |
>> --------------------------------------------------------------
>> ilock+| 123685KB/s,30921,245.52us | 123601KB/s,30900,12.11us |
>> --------------------------------------------------------------
>>
>>
>> bs=16k:
>> --------------------------------------------------------------
>>        |            READ           |           WRITE          |
>> --------------------------------------------------------------
>> w/    | 58961KB/s,3685,835.28us   | 58877KB/s,3679,1335.98us |
>> --------------------------------------------------------------
>> w/o   | 218409KB/s,13650,554.46us | 218257KB/s,13641,29.22us |
>> --------------------------------------------------------------
>> w/o+  | 222477KB/s,13904,552.94us | 222322KB/s,13895,20.28us |
>> --------------------------------------------------------------
>> ilock | 56039KB/s,3502,632.96us      | 55943KB/s,3496,1652.72us |
>> --------------------------------------------------------------
>> ilock+| 222747KB/s,13921,552.57us | 222592KB/s,13912,20.31us |
>> --------------------------------------------------------------
>>
>> bs=64k
>> ----------------------------------------------------------------
>>        |            READ            |           WRITE           |
>> ----------------------------------------------------------------
>> w/    | 119396KB/s,1865,1759.38us  | 119159KB/s,1861,2532.26us |
>> ----------------------------------------------------------------
>> w/o   | 422815KB/s,6606,1146.05us  | 421619KB/s,6587,60.72us   |
>> --------------------------------------------,-------------------
>> w/o+  | 427406KB/s,6678,1141.52us  | 426197KB/s,6659,52.79us   |
>> ----------------------------------------------------------------
>> ilock | 105800KB/s,1653,1451.68us  | 105721KB/s,1651,3388.64us |
>> ----------------------------------------------------------------
>> ilock+| 427678KB/s,6682,1142.13us  | 426468KB/s,6663,52.31us   |
>> ----------------------------------------------------------------
>>
>> bs=512k
>> ----------------------------------------------------------------
>>        |            READ            |           WRITE           |
>> ----------------------------------------------------------------
>> w/    | 392973KB/s,767,5046.35us   | 393165KB/s,767,5359.86us  |
>> ----------------------------------------------------------------
>> w/o   | 590266KB/s,1152,4312.01us  | 590554KB/s,1153,2606.82us |
>> ----------------------------------------------------------------
>> w/o+  | 618752KB/s,1208,4125.82us  | 619054KB/s,1209,2487.90us |
>> ----------------------------------------------------------------
>> ilock | 296239KB/s,578,4703.10us   | 296384KB/s,578,9049.32us  |
>> ----------------------------------------------------------------
>> ilock+| 616636KB/s,1204,4143.38us  | 616937KB/s,1204,2490.08us |
>> ----------------------------------------------------------------
>>
>> bs=1M
>> ----------------------------------------------------------------
>>        |            READ            |           WRITE           |
>> ----------------------------------------------------------------
>> w/    | 487779KB/s,476,8058.55us   | 485592KB/s,474,8630.51us  |
>> ----------------------------------------------------------------
>> w/o   | 593927KB/s,580,7623.63us   | 591265KB/s,577,6163.42us  |
>> ----------------------------------------------------------------
>> w/o+  | 615011KB/s,600,7399.93us   | 612255KB/s,597,5936.61us  |
>> ----------------------------------------------------------------
>> ilock | 394762KB/s,385,7097.55us   | 392993KB/s,383,13626.98us |
>> ----------------------------------------------------------------
>> ilock+| 626183KB/s,611,7319.16us   | 623377KB/s,608,5773.24us  |
>> ----------------------------------------------------------------
>>
>> Thanks,
>> Joseph
>>

2019-09-19 05:19:56

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path



On 19/9/18 18:03, Ritesh Harjani wrote:
> Hello Joseph,
>
> First of all thanks a lot for collecting a thorough
> performance numbers.
>
> On 9/18/19 12:05 PM, Joseph Qi wrote:
>> Hi Ritesh,
>>
>> On 19/9/17 18:32, Ritesh Harjani wrote:
>>> Hello,
>>>
>>> This patch series is based on the upstream discussion with Jan
>>> & Joseph @ [1].
>>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>>
>>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>>> inode_lock/unlock instances from fs/ext4/*
>>>
>>> For now I already accounted for trylock/lock issue symantics
>>> (which was discussed here [3]) in the same patch,
>>> since the this whole patch was around inode_lock/unlock API,
>>> so I thought it will be best to address that issue in the same patch.
>>> However, kindly let me know if otherwise.
>>>
>>> Patch-2: Commit msg of this patch describes in detail about
>>> what it is doing.
>>> In brief - we try to first take the shared lock (instead of exclusive
>>> lock), unless it is a unaligned_io or extend_io. Then in
>>> ext4_dio_write_checks(), if we start with shared lock, we see
>>> if we can really continue with shared lock or not. If not, then
>>> we release the shared lock then acquire exclusive lock
>>> and restart ext4_dio_write_checks().
>>>
>>>
>>> Tested against few xfstests (with dioread_nolock mount option),
>>> those ran fine (ext4 & generic).
>>>
>>> I tried testing performance numbers on my VM (since I could not get
>>> hold of any real h/w based test device). I could test the fact
>>> that earlier we were trying to do downgrade_write() lock, but with
>>> this patch, that path is now avoided for fio test case
>>> (as reported by Joseph in [4]).
>>> But for the actual results, I am not sure if VM machine testing could
>>> really give the reliable perf numbers which we want to take a look at.
>>> Though I do observe some form of perf improvements, but I could not
>>> get any reliable numbers (not even with the same list of with/without
>>> patches with which Joseph posted his numbers [1]).
>>>
>>>
>>> @Joseph,
>>> Would it be possible for you to give your test case a run with this
>>> patches? That will be really helpful.
>>>
>>> Branch for this is hosted at below tree.
>>>
>>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>>
>> I've tested your branch, the result is:
>> mounting with dioread_nolock, it behaves the same like reverting
>> parallel dio reads + dioread_nolock;
>
> Good sign, means that patch is doing what it is supposed to do.
>
>
>> while mounting without dioread_nolock, no improvement, or even worse.
>> Please refer the test data below.
> Actually without dioread_nolock, we take the restart path.
> i.e. initially we start with SHARED_LOCK, but if dioread_nolock
> is not enabled (or check some other conditions like overwrite),
> we release the shared lock and re-acquire the EXCL lock.
>
>
> But as an optimization, I added the below diff just now
> to directly first check for ext4_should_dioread_nolock too
> before taking the shared lock.
>
> I think with this we should not see any performance regression
> (even without dioread_nolock mount option).
> Since it will directly start with exclusive lock
> if dioread_nolock mount option is not enabled.
>
> I have updated the tree with this diff in same branch.
>
>
> ext4_dio_file_write_iter ()
> <...>
>
> 498         if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
> 499                 iolock = EXT4_IOLOCK_EXCL;
> 500
> <...>
>
With this optimization, when mounting without dioread_nolock, it has
a little improvement compared to the last ilock version, but still
poor than original, especially for big block size.

w/ = with parallel dio reads (original)
w/o = reverting parallel dio reads
w/o+ = reverting parallel dio reads + dioread_nolock
ilock = ext4-ilock-RFC
ilock+ = ext4-ilock-RFC + dioread_nolock
ilocknew = ext4-ilock-RFC latest
ilocknew+ = ext4-ilock-RFC latest + dioread_nolock


bs=4k:
------------------------------------------------------------------
| READ | WRITE |
------------------------------------------------------------------
w/ | 30898KB/s,7724,555.00us | 30875KB/s,7718,479.70us |
------------------------------------------------------------------
w/o | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
------------------------------------------------------------------
w/o+ | 123450KB/s,30862,245.77us | 123368KB/s,30841,12.14us |
------------------------------------------------------------------
ilock | 29964KB/s,7491,326.70us | 29940KB/s,7485,740.62us |
------------------------------------------------------------------
ilocknew | 30190KB/s,7547,497.47us | 30159KB/s,7539,561.85us |
------------------------------------------------------------------
ilock+ | 123685KB/s,30921,245.52us | 123601KB/s,30900,12.11us |
------------------------------------------------------------------
ilocknew+ | 123169KB/s,30792,245.81us | 123087KB/s,30771,12.85us |
------------------------------------------------------------------


bs=16k:
------------------------------------------------------------------
| READ | WRITE |
------------------------------------------------------------------
w/ | 58961KB/s,3685,835.28us | 58877KB/s,3679,1335.98us |
------------------------------------------------------------------
w/o | 218409KB/s,13650,554.46us | 218257KB/s,13641,29.22us |
------------------------------------------------------------------
w/o+ | 222477KB/s,13904,552.94us | 222322KB/s,13895,20.28us |
------------------------------------------------------------------
ilock | 56039KB/s,3502,632.96us | 55943KB/s,3496,1652.72us |
------------------------------------------------------------------
ilocknew | 57317KB/s,3582,1023.88us | 57230KB/s,3576,1209.91us |
------------------------------------------------------------------
ilock+ | 222747KB/s,13921,552.57us | 222592KB/s,13912,20.31us |
------------------------------------------------------------------
ilocknew+ | 221945KB/s,13871,552.61us | 221791KB/s,13861,21.29us |
------------------------------------------------------------------

bs=64k
-------------------------------------------------------------------
| READ | WRITE |
-------------------------------------------------------------------
w/ | 119396KB/s,1865,1759.38us | 119159KB/s,1861,2532.26us |
-------------------------------------------------------------------
w/o | 422815KB/s,6606,1146.05us | 421619KB/s,6587,60.72us |
--------------------------------------------,----------------------
w/o+ | 427406KB/s,6678,1141.52us | 426197KB/s,6659,52.79us |
-------------------------------------------------------------------
ilock | 105800KB/s,1653,1451.68us | 105721KB/s,1651,3388.64us |
-------------------------------------------------------------------
ilocknew | 107447KB/s,1678,1654.33us | 107322KB/s,1676,3112.96us |
-------------------------------------------------------------------
ilock+ | 427678KB/s,6682,1142.13us | 426468KB/s,6663,52.31us |
-------------------------------------------------------------------
ilocknew+ | 427054KB/s,6672,1143.43us | 425846KB/s,6653,52.87us |
-------------------------------------------------------------------

bs=512k
-------------------------------------------------------------------
| READ | WRITE |
-------------------------------------------------------------------
w/ | 392973KB/s,767,5046.35us | 393165KB/s,767,5359.86us |
-------------------------------------------------------------------
w/o | 590266KB/s,1152,4312.01us | 590554KB/s,1153,2606.82us |
-------------------------------------------------------------------
w/o+ | 618752KB/s,1208,4125.82us | 619054KB/s,1209,2487.90us |
-------------------------------------------------------------------
ilock | 296239KB/s,578,4703.10us | 296384KB/s,578,9049.32us |
-------------------------------------------------------------------
ilocknew | 309740KB/s,604,5485.93us | 309892KB/s,605,7666.79us |
-------------------------------------------------------------------
ilock+ | 616636KB/s,1204,4143.38us | 616937KB/s,1204,2490.08us |
-------------------------------------------------------------------
ilocknew+ | 618159KB/s,1207,4129.76us | 618461KB/s,1207,2486.90us |
-------------------------------------------------------------------

bs=1M
-------------------------------------------------------------------
| READ | WRITE |
-------------------------------------------------------------------
w/ | 487779KB/s,476,8058.55us | 485592KB/s,474,8630.51us |
-------------------------------------------------------------------
w/o | 593927KB/s,580,7623.63us | 591265KB/s,577,6163.42us |
-------------------------------------------------------------------
w/o+ | 615011KB/s,600,7399.93us | 612255KB/s,597,5936.61us |
-------------------------------------------------------------------
ilock | 394762KB/s,385,7097.55us | 392993KB/s,383,13626.98us |
-------------------------------------------------------------------
ilocknew | 422052KB/s,412,8338.16us | 420161KB/s,410,11008.95us |
-------------------------------------------------------------------
ilock+ | 626183KB/s,611,7319.16us | 623377KB/s,608,5773.24us |
-------------------------------------------------------------------
ilocknew+ | 626006KB/s,611,7281.09us | 623200KB/s,608,5817.84us |
-------------------------------------------------------------------

2019-09-19 22:14:10

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path

Hello,

On 9/19/19 7:38 AM, Joseph Qi wrote:
>
>
> On 19/9/18 18:03, Ritesh Harjani wrote:
>> Hello Joseph,
>>
>> First of all thanks a lot for collecting a thorough
>> performance numbers.
>>
>> On 9/18/19 12:05 PM, Joseph Qi wrote:
>>> Hi Ritesh,
>>>
>>> On 19/9/17 18:32, Ritesh Harjani wrote:
>>>> Hello,
>>>>
>>>> This patch series is based on the upstream discussion with Jan
>>>> & Joseph @ [1].
>>>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>>>
>>>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>>>> inode_lock/unlock instances from fs/ext4/*
>>>>
>>>> For now I already accounted for trylock/lock issue symantics
>>>> (which was discussed here [3]) in the same patch,
>>>> since the this whole patch was around inode_lock/unlock API,
>>>> so I thought it will be best to address that issue in the same patch.
>>>> However, kindly let me know if otherwise.
>>>>
>>>> Patch-2: Commit msg of this patch describes in detail about
>>>> what it is doing.
>>>> In brief - we try to first take the shared lock (instead of exclusive
>>>> lock), unless it is a unaligned_io or extend_io. Then in
>>>> ext4_dio_write_checks(), if we start with shared lock, we see
>>>> if we can really continue with shared lock or not. If not, then
>>>> we release the shared lock then acquire exclusive lock
>>>> and restart ext4_dio_write_checks().
>>>>
>>>>
>>>> Tested against few xfstests (with dioread_nolock mount option),
>>>> those ran fine (ext4 & generic).
>>>>
>>>> I tried testing performance numbers on my VM (since I could not get
>>>> hold of any real h/w based test device). I could test the fact
>>>> that earlier we were trying to do downgrade_write() lock, but with
>>>> this patch, that path is now avoided for fio test case
>>>> (as reported by Joseph in [4]).
>>>> But for the actual results, I am not sure if VM machine testing could
>>>> really give the reliable perf numbers which we want to take a look at.
>>>> Though I do observe some form of perf improvements, but I could not
>>>> get any reliable numbers (not even with the same list of with/without
>>>> patches with which Joseph posted his numbers [1]).
>>>>
>>>>
>>>> @Joseph,
>>>> Would it be possible for you to give your test case a run with this
>>>> patches? That will be really helpful.
>>>>
>>>> Branch for this is hosted at below tree.
>>>>
>>>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>>>
>>> I've tested your branch, the result is:
>>> mounting with dioread_nolock, it behaves the same like reverting
>>> parallel dio reads + dioread_nolock;
>>
>> Good sign, means that patch is doing what it is supposed to do.
>>
>>
>>> while mounting without dioread_nolock, no improvement, or even worse.
>>> Please refer the test data below.
>> Actually without dioread_nolock, we take the restart path.
>> i.e. initially we start with SHARED_LOCK, but if dioread_nolock
>> is not enabled (or check some other conditions like overwrite),
>> we release the shared lock and re-acquire the EXCL lock.
>>
>>
>> But as an optimization, I added the below diff just now
>> to directly first check for ext4_should_dioread_nolock too
>> before taking the shared lock.
>>
>> I think with this we should not see any performance regression
>> (even without dioread_nolock mount option).
>> Since it will directly start with exclusive lock
>> if dioread_nolock mount option is not enabled.
>>
>> I have updated the tree with this diff in same branch.
>>
>>
>> ext4_dio_file_write_iter ()
>> <...>
>>
>> 498         if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
>> 499                 iolock = EXT4_IOLOCK_EXCL;
>> 500
>> <...>
>>
> With this optimization, when mounting without dioread_nolock, it has
> a little improvement compared to the last ilock version, but still
> poor than original, especially for big block size.

Finally, I got hold of some HW. I am collecting the numbers as we speak.
Will post those tomorrow.

Thanks for your help!!

-ritesh



>
> w/ = with parallel dio reads (original)
> w/o = reverting parallel dio reads
> w/o+ = reverting parallel dio reads + dioread_nolock
> ilock = ext4-ilock-RFC
> ilock+ = ext4-ilock-RFC + dioread_nolock
> ilocknew = ext4-ilock-RFC latest
> ilocknew+ = ext4-ilock-RFC latest + dioread_nolock
>
>
> bs=4k:
> ------------------------------------------------------------------
> | READ | WRITE |
> ------------------------------------------------------------------
> w/ | 30898KB/s,7724,555.00us | 30875KB/s,7718,479.70us |
> ------------------------------------------------------------------
> w/o | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
> ------------------------------------------------------------------
> w/o+ | 123450KB/s,30862,245.77us | 123368KB/s,30841,12.14us |
> ------------------------------------------------------------------
> ilock | 29964KB/s,7491,326.70us | 29940KB/s,7485,740.62us |
> ------------------------------------------------------------------
> ilocknew | 30190KB/s,7547,497.47us | 30159KB/s,7539,561.85us |
> ------------------------------------------------------------------
> ilock+ | 123685KB/s,30921,245.52us | 123601KB/s,30900,12.11us |
> ------------------------------------------------------------------
> ilocknew+ | 123169KB/s,30792,245.81us | 123087KB/s,30771,12.85us |
> ------------------------------------------------------------------
>
>
> bs=16k:
> ------------------------------------------------------------------
> | READ | WRITE |
> ------------------------------------------------------------------
> w/ | 58961KB/s,3685,835.28us | 58877KB/s,3679,1335.98us |
> ------------------------------------------------------------------
> w/o | 218409KB/s,13650,554.46us | 218257KB/s,13641,29.22us |
> ------------------------------------------------------------------
> w/o+ | 222477KB/s,13904,552.94us | 222322KB/s,13895,20.28us |
> ------------------------------------------------------------------
> ilock | 56039KB/s,3502,632.96us | 55943KB/s,3496,1652.72us |
> ------------------------------------------------------------------
> ilocknew | 57317KB/s,3582,1023.88us | 57230KB/s,3576,1209.91us |
> ------------------------------------------------------------------
> ilock+ | 222747KB/s,13921,552.57us | 222592KB/s,13912,20.31us |
> ------------------------------------------------------------------
> ilocknew+ | 221945KB/s,13871,552.61us | 221791KB/s,13861,21.29us |
> ------------------------------------------------------------------
>
> bs=64k
> -------------------------------------------------------------------
> | READ | WRITE |
> -------------------------------------------------------------------
> w/ | 119396KB/s,1865,1759.38us | 119159KB/s,1861,2532.26us |
> -------------------------------------------------------------------
> w/o | 422815KB/s,6606,1146.05us | 421619KB/s,6587,60.72us |
> --------------------------------------------,----------------------
> w/o+ | 427406KB/s,6678,1141.52us | 426197KB/s,6659,52.79us |
> -------------------------------------------------------------------
> ilock | 105800KB/s,1653,1451.68us | 105721KB/s,1651,3388.64us |
> -------------------------------------------------------------------
> ilocknew | 107447KB/s,1678,1654.33us | 107322KB/s,1676,3112.96us |
> -------------------------------------------------------------------
> ilock+ | 427678KB/s,6682,1142.13us | 426468KB/s,6663,52.31us |
> -------------------------------------------------------------------
> ilocknew+ | 427054KB/s,6672,1143.43us | 425846KB/s,6653,52.87us |
> -------------------------------------------------------------------
>
> bs=512k
> -------------------------------------------------------------------
> | READ | WRITE |
> -------------------------------------------------------------------
> w/ | 392973KB/s,767,5046.35us | 393165KB/s,767,5359.86us |
> -------------------------------------------------------------------
> w/o | 590266KB/s,1152,4312.01us | 590554KB/s,1153,2606.82us |
> -------------------------------------------------------------------
> w/o+ | 618752KB/s,1208,4125.82us | 619054KB/s,1209,2487.90us |
> -------------------------------------------------------------------
> ilock | 296239KB/s,578,4703.10us | 296384KB/s,578,9049.32us |
> -------------------------------------------------------------------
> ilocknew | 309740KB/s,604,5485.93us | 309892KB/s,605,7666.79us |
> -------------------------------------------------------------------
> ilock+ | 616636KB/s,1204,4143.38us | 616937KB/s,1204,2490.08us |
> -------------------------------------------------------------------
> ilocknew+ | 618159KB/s,1207,4129.76us | 618461KB/s,1207,2486.90us |
> -------------------------------------------------------------------
>
> bs=1M
> -------------------------------------------------------------------
> | READ | WRITE |
> -------------------------------------------------------------------
> w/ | 487779KB/s,476,8058.55us | 485592KB/s,474,8630.51us |
> -------------------------------------------------------------------
> w/o | 593927KB/s,580,7623.63us | 591265KB/s,577,6163.42us |
> -------------------------------------------------------------------
> w/o+ | 615011KB/s,600,7399.93us | 612255KB/s,597,5936.61us |
> -------------------------------------------------------------------
> ilock | 394762KB/s,385,7097.55us | 392993KB/s,383,13626.98us |
> -------------------------------------------------------------------
> ilocknew | 422052KB/s,412,8338.16us | 420161KB/s,410,11008.95us |
> -------------------------------------------------------------------
> ilock+ | 626183KB/s,611,7319.16us | 623377KB/s,608,5773.24us |
> -------------------------------------------------------------------
> ilocknew+ | 626006KB/s,611,7281.09us | 623200KB/s,608,5817.84us |
> -------------------------------------------------------------------
>

2019-09-24 16:47:16

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path

> Finally, I got hold of some HW. I am collecting the numbers as we speak.
> Will post those tomorrow.

I have sent RFC-v2 with comprehensive set of performance numbers and
charts. It could be found below.

https://marc.info/?l=linux-ext4&m=156921748126221&w=2

Please take a look at it.

-ritesh

2019-09-26 08:15:13

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path

Hi Joseph!

On Wed 18-09-19 14:35:15, Joseph Qi wrote:
> On 19/9/17 18:32, Ritesh Harjani wrote:
> > Hello,
> >
> > This patch series is based on the upstream discussion with Jan
> > & Joseph @ [1].
> > It is based on top of Matthew's v3 ext4 iomap patch series [2]
> >
> > Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
> > inode_lock/unlock instances from fs/ext4/*
> >
> > For now I already accounted for trylock/lock issue symantics
> > (which was discussed here [3]) in the same patch,
> > since the this whole patch was around inode_lock/unlock API,
> > so I thought it will be best to address that issue in the same patch.
> > However, kindly let me know if otherwise.
> >
> > Patch-2: Commit msg of this patch describes in detail about
> > what it is doing.
> > In brief - we try to first take the shared lock (instead of exclusive
> > lock), unless it is a unaligned_io or extend_io. Then in
> > ext4_dio_write_checks(), if we start with shared lock, we see
> > if we can really continue with shared lock or not. If not, then
> > we release the shared lock then acquire exclusive lock
> > and restart ext4_dio_write_checks().
> >
> >
> > Tested against few xfstests (with dioread_nolock mount option),
> > those ran fine (ext4 & generic).
> >
> > I tried testing performance numbers on my VM (since I could not get
> > hold of any real h/w based test device). I could test the fact
> > that earlier we were trying to do downgrade_write() lock, but with
> > this patch, that path is now avoided for fio test case
> > (as reported by Joseph in [4]).
> > But for the actual results, I am not sure if VM machine testing could
> > really give the reliable perf numbers which we want to take a look at.
> > Though I do observe some form of perf improvements, but I could not
> > get any reliable numbers (not even with the same list of with/without
> > patches with which Joseph posted his numbers [1]).
> >
> >
> > @Joseph,
> > Would it be possible for you to give your test case a run with this
> > patches? That will be really helpful.
> >
> > Branch for this is hosted at below tree.
> >
> > https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
> >
> I've tested your branch, the result is:
> mounting with dioread_nolock, it behaves the same like reverting
> parallel dio reads + dioread_nolock;
> while mounting without dioread_nolock, no improvement, or even worse.
> Please refer the test data below.
>
> fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
> -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
> -size=20G -numjobs=8 -runtime=600 -group_reporting
>
> w/ = with parallel dio reads
> w/o = reverting parallel dio reads

This is with 16c54688592ce8 "ext4: Allow parallel DIO reads" reverted,
right?

> w/o+ = reverting parallel dio reads + dioread_nolock
> ilock = ext4-ilock-RFC
> ilock+ = ext4-ilock-RFC + dioread_nolock
>
> bs=4k:
> --------------------------------------------------------------
> | READ | WRITE |
> --------------------------------------------------------------
> w/ | 30898KB/s,7724,555.00us | 30875KB/s,7718,479.70us |
> --------------------------------------------------------------
> w/o | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
> --------------------------------------------------------------

I'm really surprised by the numbers here. They would mean that when DIO
read takes i_rwsem exclusive lock instead of shared, it is a win for your
workload... Argh, now checking code in fs/direct-io.c I think I can see the
difference. The trick in do_blockdev_direct_IO() is:

if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
inode_unlock(dio->inode);
if (dio->is_async && retval == 0 && dio->result &&
(iov_iter_rw(iter) == READ || dio->result == count))
retval = -EIOCBQUEUED;
else
dio_await_completion(dio);

So actually only direct IO read submission is protected by i_rwsem with
DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.

After some thought I think the best solution for this is to just finally
finish the conversion of ext4 so that dioread_nolock is the only DIO path.
With i_rwsem held in shared mode even for "unlocked" DIO, it should be
actually relatively simple and most of the dances with unwritten extents
shouldn't be needed anymore.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-09-26 08:43:17

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path



On 9/24/19 8:40 PM, Jan Kara wrote:
> Hi Joseph!
>
> On Wed 18-09-19 14:35:15, Joseph Qi wrote:
>> On 19/9/17 18:32, Ritesh Harjani wrote:
>>> Hello,
>>>
>>> This patch series is based on the upstream discussion with Jan
>>> & Joseph @ [1].
>>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>>
>>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>>> inode_lock/unlock instances from fs/ext4/*
>>>
>>> For now I already accounted for trylock/lock issue symantics
>>> (which was discussed here [3]) in the same patch,
>>> since the this whole patch was around inode_lock/unlock API,
>>> so I thought it will be best to address that issue in the same patch.
>>> However, kindly let me know if otherwise.
>>>
>>> Patch-2: Commit msg of this patch describes in detail about
>>> what it is doing.
>>> In brief - we try to first take the shared lock (instead of exclusive
>>> lock), unless it is a unaligned_io or extend_io. Then in
>>> ext4_dio_write_checks(), if we start with shared lock, we see
>>> if we can really continue with shared lock or not. If not, then
>>> we release the shared lock then acquire exclusive lock
>>> and restart ext4_dio_write_checks().
>>>
>>>
>>> Tested against few xfstests (with dioread_nolock mount option),
>>> those ran fine (ext4 & generic).
>>>
>>> I tried testing performance numbers on my VM (since I could not get
>>> hold of any real h/w based test device). I could test the fact
>>> that earlier we were trying to do downgrade_write() lock, but with
>>> this patch, that path is now avoided for fio test case
>>> (as reported by Joseph in [4]).
>>> But for the actual results, I am not sure if VM machine testing could
>>> really give the reliable perf numbers which we want to take a look at.
>>> Though I do observe some form of perf improvements, but I could not
>>> get any reliable numbers (not even with the same list of with/without
>>> patches with which Joseph posted his numbers [1]).
>>>
>>>
>>> @Joseph,
>>> Would it be possible for you to give your test case a run with this
>>> patches? That will be really helpful.
>>>
>>> Branch for this is hosted at below tree.
>>>
>>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>>
>> I've tested your branch, the result is:
>> mounting with dioread_nolock, it behaves the same like reverting
>> parallel dio reads + dioread_nolock;
>> while mounting without dioread_nolock, no improvement, or even worse.
>> Please refer the test data below.
>>
>> fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
>> -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
>> -size=20G -numjobs=8 -runtime=600 -group_reporting
>>
>> w/ = with parallel dio reads
>> w/o = reverting parallel dio reads
>
> This is with 16c54688592ce8 "ext4: Allow parallel DIO reads" reverted,
> right?

He posted the same numbers where he posted previous reverts too,
which I thought we already noticed [1].
From [2] below, I assumed we knew this.

[2] - """
(note
that the patches actually improve performance of read-only DIO workload
when not using dioread_nolock as for that case, exclusive lock is
replaced with a shared one)
"""


[1] https://patchwork.ozlabs.org/patch/1153546/
[2]
https://lore.kernel.org/linux-ext4/[email protected]/

>
>> w/o+ = reverting parallel dio reads + dioread_nolock
>> ilock = ext4-ilock-RFC
>> ilock+ = ext4-ilock-RFC + dioread_nolock
>>
>> bs=4k:
>> --------------------------------------------------------------
>> | READ | WRITE |
>> --------------------------------------------------------------
>> w/ | 30898KB/s,7724,555.00us | 30875KB/s,7718,479.70us |
>> --------------------------------------------------------------
>> w/o | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
>> --------------------------------------------------------------
>
> I'm really surprised by the numbers here. They would mean that when DIO

While testing my patches I noticed this again, but then when I saw [2]
above, I thought we were aware of this.
My bad, I should have brought this point up maybe once before going
ahead with implementing our discussed solution.


> read takes i_rwsem exclusive lock instead of shared, it is a win for your
> workload... Argh, now checking code in fs/direct-io.c I think I can see the
> difference. The trick in do_blockdev_direct_IO() is:
>
> if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
> inode_unlock(dio->inode);
> if (dio->is_async && retval == 0 && dio->result &&
> (iov_iter_rw(iter) == READ || dio->result == count))
> retval = -EIOCBQUEUED;
> else
> dio_await_completion(dio);
>
> So actually only direct IO read submission is protected by i_rwsem with
> DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.
>
> After some thought I think the best solution for this is to just finally
> finish the conversion of ext4 so that dioread_nolock is the only DIO path.

Sorry, I still didn't get this completely. Could you please explain a
bit more?


> With i_rwsem held in shared mode even for "unlocked" DIO, it should be
> actually relatively simple and most of the dances with unwritten extents
> shouldn't be needed anymore.

Again, maybe it's related to above comment. Could you please give some
insights?


Or do you mean that we should do it like this-
So as of now in dioread_nolock, we allocate blocks, mark the entry into
extents as unwritten, then do the data IO, and then finally do the
conversion of unwritten to written extents.

So instead of that we first only reserve the disk blocks, (without
making any on-disk changes in extent tree), do the data IO and then
finally make an entry into extent tree on disk. And going
forward only keep this as the default path.

The above is something I have been looking into for enabling
dioread_nolock for powerpc platforms where blocksize < page_size.
This is based upon an upstream discussion between Ted and you :)


But even with above, in case of extending writes, we still
will have to zero out those extending blocks no? Which
will require an exclusive inode lock anyways for zeroing.
(same which has been done in XFS too).

So going with current discussed solution of mounting with
dioread_nolock to provide performance scalability in mixed read/write
workload should be also the right approach, no?
Also looking at the numbers here [3] & [4], this patch also seems
to improve the performance with dioread_nolock mount option.
Please help me understand your thoughts on this.

[3] - https://marc.info/?l=linux-ext4&m=156921748126221&w=2
[4] -
https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/ext4/fio-output/vanilla-vs-ilocknew-randrw-dioread-nolock-4K.png


-ritesh

2019-09-26 09:03:15

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path



On 19/9/24 23:10, Jan Kara wrote:
> Hi Joseph!
>
> On Wed 18-09-19 14:35:15, Joseph Qi wrote:
>> On 19/9/17 18:32, Ritesh Harjani wrote:
>>> Hello,
>>>
>>> This patch series is based on the upstream discussion with Jan
>>> & Joseph @ [1].
>>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>>
>>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>>> inode_lock/unlock instances from fs/ext4/*
>>>
>>> For now I already accounted for trylock/lock issue symantics
>>> (which was discussed here [3]) in the same patch,
>>> since the this whole patch was around inode_lock/unlock API,
>>> so I thought it will be best to address that issue in the same patch.
>>> However, kindly let me know if otherwise.
>>>
>>> Patch-2: Commit msg of this patch describes in detail about
>>> what it is doing.
>>> In brief - we try to first take the shared lock (instead of exclusive
>>> lock), unless it is a unaligned_io or extend_io. Then in
>>> ext4_dio_write_checks(), if we start with shared lock, we see
>>> if we can really continue with shared lock or not. If not, then
>>> we release the shared lock then acquire exclusive lock
>>> and restart ext4_dio_write_checks().
>>>
>>>
>>> Tested against few xfstests (with dioread_nolock mount option),
>>> those ran fine (ext4 & generic).
>>>
>>> I tried testing performance numbers on my VM (since I could not get
>>> hold of any real h/w based test device). I could test the fact
>>> that earlier we were trying to do downgrade_write() lock, but with
>>> this patch, that path is now avoided for fio test case
>>> (as reported by Joseph in [4]).
>>> But for the actual results, I am not sure if VM machine testing could
>>> really give the reliable perf numbers which we want to take a look at.
>>> Though I do observe some form of perf improvements, but I could not
>>> get any reliable numbers (not even with the same list of with/without
>>> patches with which Joseph posted his numbers [1]).
>>>
>>>
>>> @Joseph,
>>> Would it be possible for you to give your test case a run with this
>>> patches? That will be really helpful.
>>>
>>> Branch for this is hosted at below tree.
>>>
>>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>>
>> I've tested your branch, the result is:
>> mounting with dioread_nolock, it behaves the same like reverting
>> parallel dio reads + dioread_nolock;
>> while mounting without dioread_nolock, no improvement, or even worse.
>> Please refer the test data below.
>>
>> fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
>> -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
>> -size=20G -numjobs=8 -runtime=600 -group_reporting
>>
>> w/ = with parallel dio reads
>> w/o = reverting parallel dio reads
>
> This is with 16c54688592ce8 "ext4: Allow parallel DIO reads" reverted,
> right?

Yes, actually, it also reverts the related patches:

Revert "ext4: remove EXT4_STATE_DIOREAD_LOCK flag"
Revert "ext4: fix off-by-one error when writing back pages before dio read"
Revert "ext4: Allow parallel DIO reads"

>
>> w/o+ = reverting parallel dio reads + dioread_nolock
>> ilock = ext4-ilock-RFC
>> ilock+ = ext4-ilock-RFC + dioread_nolock
>>
>> bs=4k:
>> --------------------------------------------------------------
>> | READ | WRITE |
>> --------------------------------------------------------------
>> w/ | 30898KB/s,7724,555.00us | 30875KB/s,7718,479.70us |
>> --------------------------------------------------------------
>> w/o | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
>> --------------------------------------------------------------
>
> I'm really surprised by the numbers here. They would mean that when DIO
> read takes i_rwsem exclusive lock instead of shared, it is a win for your
> workload... Argh, now checking code in fs/direct-io.c I think I can see the
> difference. The trick in do_blockdev_direct_IO() is:
>
> if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
> inode_unlock(dio->inode);
> if (dio->is_async && retval == 0 && dio->result &&
> (iov_iter_rw(iter) == READ || dio->result == count))
> retval = -EIOCBQUEUED;
> else
> dio_await_completion(dio);
>
> So actually only direct IO read submission is protected by i_rwsem with
> DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.
>
> After some thought I think the best solution for this is to just finally
> finish the conversion of ext4 so that dioread_nolock is the only DIO path.
> With i_rwsem held in shared mode even for "unlocked" DIO, it should be
> actually relatively simple and most of the dances with unwritten extents
> shouldn't be needed anymore.
>
> Honza
>

2019-09-26 09:22:31

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path

On Wed 25-09-19 01:18:04, Ritesh Harjani wrote:
> > read takes i_rwsem exclusive lock instead of shared, it is a win for your
> > workload... Argh, now checking code in fs/direct-io.c I think I can see the
> > difference. The trick in do_blockdev_direct_IO() is:
> >
> > if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
> > inode_unlock(dio->inode);
> > if (dio->is_async && retval == 0 && dio->result &&
> > (iov_iter_rw(iter) == READ || dio->result == count))
> > retval = -EIOCBQUEUED;
> > else
> > dio_await_completion(dio);
> >
> > So actually only direct IO read submission is protected by i_rwsem with
> > DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.
> >
> > After some thought I think the best solution for this is to just finally
> > finish the conversion of ext4 so that dioread_nolock is the only DIO path.
>
> Sorry, I still didn't get this completely. Could you please explain a bit
> more?

Well, currently we have two different locking schemes for DIO - the
"normal" case and the "dioread_nolock" case. And the "normal" case really
only exists because buffered writeback needed to be more careful (so that
nolock DIO cannot expose stale data) and nobody did the effort to make that
work when blocksize < pagesize. But having two different locking schemes
for DIO is really confusing to users and a maintenance burden so we want to
get rid of the old scheme once the "dioread_nolock" scheme works for all
the configurations.

> > With i_rwsem held in shared mode even for "unlocked" DIO, it should be
> > actually relatively simple and most of the dances with unwritten extents
> > shouldn't be needed anymore.
>
> Again, maybe it's related to above comment. Could you please give some
> insights?

Now that we hold i_rwsem in shared mode for all of DIO, it isn't really
"unlocked" anymore. Which actually very much limits the races with buffered
writes and thus page writeback (because we flush page cache before doing
DIO).

> Or do you mean that we should do it like this-
> So as of now in dioread_nolock, we allocate blocks, mark the entry into
> extents as unwritten, then do the data IO, and then finally do the
> conversion of unwritten to written extents.

So this allocation of extents as unwritten in dioread_nolock mode is now
mostly unnecessary. We hold i_rwsem for reading (or even for writing) while
submitting any DIO. Because we flush page cache before starting DIO and new
pages cannot be dirtied by buffered writes (i_rwsem), DIO cannot be racing
with page writeback and thus cannot expose stale block contents. There is
one exception though - which is why I wrote "mostly" above - pages can
still be dirtied through memory mappings (there's no i_rwsem protection for
mmap writes) and thus races with page writeback exposing stale data are still
theoretically possible. In fact the "normal" DIO mode has this kind of race
all the time ext4 exists.

I guess we could fix this by falling back to page writeback using unwritten
extents when we have dirty pages locked for writeback and see there's any
DIO in flight for the inode. Sadly that means we we cannot get rid of
writeback code using unwritten extents but at least it won't be hurting
performance in the common case.

> So instead of that we first only reserve the disk blocks, (without
> making any on-disk changes in extent tree), do the data IO and then
> finally make an entry into extent tree on disk. And going
> forward only keep this as the default path.
>
> The above is something I have been looking into for enabling
> dioread_nolock for powerpc platforms where blocksize < page_size.
> This is based upon an upstream discussion between Ted and you :)

Yes, this is even better solution to dioread_nolock "problem" but it is
also more work then just dropping the old DIO locking mode and fix
writeback using unwritten extents for blocksize < pagesize.

> But even with above, in case of extending writes, we still
> will have to zero out those extending blocks no? Which
> will require an exclusive inode lock anyways for zeroing.
> (same which has been done in XFS too).

Yes, extending writes are a different matter.

> So going with current discussed solution of mounting with
> dioread_nolock to provide performance scalability in mixed read/write
> workload should be also the right approach, no?
> Also looking at the numbers here [3] & [4], this patch also seems
> to improve the performance with dioread_nolock mount option.
> Please help me understand your thoughts on this.

Yes, your patches are definitely going in the right direction. What I'm
discussing is mostly how to make ext4 perform well for mixed read/write
workload by default without user having to use magic mount option.

> [3] - https://marc.info/?l=linux-ext4&m=156921748126221&w=2
> [4] - https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/ext4/fio-output/vanilla-vs-ilocknew-randrw-dioread-nolock-4K.png


Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-09-26 12:42:42

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path

Hello Jan,

Thanks for answering it all and giving all the info.

On 9/25/19 2:53 PM, Jan Kara wrote:
> On Wed 25-09-19 01:18:04, Ritesh Harjani wrote:
>>> read takes i_rwsem exclusive lock instead of shared, it is a win for your
>>> workload... Argh, now checking code in fs/direct-io.c I think I can see the
>>> difference. The trick in do_blockdev_direct_IO() is:
>>>
>>> if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
>>> inode_unlock(dio->inode);
>>> if (dio->is_async && retval == 0 && dio->result &&
>>> (iov_iter_rw(iter) == READ || dio->result == count))
>>> retval = -EIOCBQUEUED;
>>> else
>>> dio_await_completion(dio);
>>>
>>> So actually only direct IO read submission is protected by i_rwsem with
>>> DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.
>>>
>>> After some thought I think the best solution for this is to just finally
>>> finish the conversion of ext4 so that dioread_nolock is the only DIO path.
>>
>> Sorry, I still didn't get this completely. Could you please explain a bit
>> more?
>
> Well, currently we have two different locking schemes for DIO - the
> "normal" case and the "dioread_nolock" case. And the "normal" case really
> only exists because buffered writeback needed to be more careful (so that
> nolock DIO cannot expose stale data) and nobody did the effort to make that
> work when blocksize < pagesize. But having two different locking schemes
> for DIO is really confusing to users and a maintenance burden so we want to
> get rid of the old scheme once the "dioread_nolock" scheme works for all
> the configurations.

Agreed.

>
>>> With i_rwsem held in shared mode even for "unlocked" DIO, it should be
>>> actually relatively simple and most of the dances with unwritten extents
>>> shouldn't be needed anymore.
>>
>> Again, maybe it's related to above comment. Could you please give some
>> insights?
>
> Now that we hold i_rwsem in shared mode for all of DIO, it isn't really
> "unlocked" anymore. Which actually very much limits the races with buffered
> writes and thus page writeback (because we flush page cache before doing
> DIO).

So looking at the code again based on your inputs from above, we should
be able to remove this condition <snip below> in ext4_dio_write_checks.

What I meant is, in DIO writes path we can start with shared lock
(which the current patch is doing), and only check for below conditions
<snip below> for it to continue using shared lock.
(i.e. we need not check for ext4_should_dioread_nolock(inode) anymore).

That means there should not be any race for non-extent based mode
too right?


Because for overwrites in DIO (for both extents & non-extents)-
(just reiterating)

1. Race against bufferedIO writes and DIO overwrites will be protected
via SHARED inode lock in DIO overwrites & EXCL lock in bufferedIO
writes. Plus we flush page cache before doing DIO.

2. Race against bufferedIO reads & DIO overwrites will be anyway
protected since we don't do any block allocations during DIO overwrite.

3. Again race against DIO reads & DIO overwrites is not be a problem,
since no block allocation is done anyway. So no stale data will be
exposed.


<snip of change we should do in ext4_dio_write_checks>

if (*iolock == EXT4_IOLOCK_SHARED &&
(!IS_NOSEC(inode) || *unaligned_io || *extend ||
- !ext4_should_dioread_nolock(inode) ||
!ext4_overwrite_io(inode, offset, count))) {
ext4_iunlock(inode, *iolock);
*iolock = EXT4_IOLOCK_EXCL;
ext4_ilock(inode, *iolock);
goto restart;
}

>
>> Or do you mean that we should do it like this-
>> So as of now in dioread_nolock, we allocate blocks, mark the entry into
>> extents as unwritten, then do the data IO, and then finally do the
>> conversion of unwritten to written extents.
>
> So this allocation of extents as unwritten in dioread_nolock mode is now
> mostly unnecessary. We hold i_rwsem for reading (or even for writing) while
> submitting any DIO. Because we flush page cache before starting DIO and new
> pages cannot be dirtied by buffered writes (i_rwsem), DIO cannot be racing
> with page writeback and thus cannot expose stale block contents. There is
> one exception though - which is why I wrote "mostly" above - pages can
> still be dirtied through memory mappings (there's no i_rwsem protection for
> mmap writes) and thus races with page writeback exposing stale data are still
> theoretically possible. In fact the "normal" DIO mode has this kind of race
> all the time ext4 exists.

Yes, now that you said I could see this below race even with current
code. Any other race too?

i.e.
ext4_dio_read ext4_page_mkwrite()

filemap_write_and_wait_range()
ext4_get_blocks()

submit_bio
// this could expose the stale data.

mark_buffer_dirty()


Ok. I guess we cannot use any exclusive inode lock in ext4_page_mkwrite,
because then we loose the parallelism that it offers right now.
Wonder how other FS protect this race (like XFS?)


> I guess we could fix this by falling back to page writeback using unwritten
> extents when we have dirty pages locked for writeback and see there's any
> DIO in flight for the inode. Sadly that means we we cannot get rid of
> writeback code using unwritten extents but at least it won't be hurting
> performance in the common case.


1. So why to check for dirty pages locked for writeback (in
ext4_page_mkwrite)? we anyway lock the page which we modify or
allocate block for. So race against bufferedRead should not happen.

2. And even if we check for inode_dio_wait(), we still can't gurantee
that the next DIO may not snoopin while we are in ext4_page_mkwrite?

I am definitely missing something here.


>
>> So instead of that we first only reserve the disk blocks, (without
>> making any on-disk changes in extent tree), do the data IO and then
>> finally make an entry into extent tree on disk. And going
>> forward only keep this as the default path.
>>
>> The above is something I have been looking into for enabling
>> dioread_nolock for powerpc platforms where blocksize < page_size.
>> This is based upon an upstream discussion between Ted and you :)
>
> Yes, this is even better solution to dioread_nolock "problem" but it is
> also more work

Yes, I agree that we should do this incrementally.


> then just dropping the old DIO locking mode and
> fix writeback using unwritten extents for blocksize < pagesize.


So, I was looking into this (to support dioread_nolock for
blocksize < pagesize) but I could not find any history in git,
nor any thread which explains the problem.

I got below understanding while going over code -

1. This problem may be somehow related to bufferheads in the
extent conversion from unwritten to written in writeback path.
But I couldn't see what exactly is the issue there?

I see that the completion path happens via
ext4_end_io
|- ext4_convert_unwritten_extents() // offset & size is properly
taken care.
|- ext4_release_io_end() (which is same in both DIO & writeback).


2. One thing which I noticed is the FIXME in
mpage_map_and_submit_buffers(). Where we
io->io_end->size we add the whole PAGE_SIZE.
I guess we should update the size here carefully
based on buffer_heads.


Could you please give some pointers as to what
is the limitation. Or some hint which I can go
and check by myself.



>> But even with above, in case of extending writes, we still
>> will have to zero out those extending blocks no? Which
>> will require an exclusive inode lock anyways for zeroing.
>> (same which has been done in XFS too).
>
> Yes, extending writes are a different matter.
>
>> So going with current discussed solution of mounting with
>> dioread_nolock to provide performance scalability in mixed read/write
>> workload should be also the right approach, no?
>> Also looking at the numbers here [3] & [4], this patch also seems
>> to improve the performance with dioread_nolock mount option.
>> Please help me understand your thoughts on this.
>
> Yes, your patches are definitely going in the right direction. What I'm
> discussing is mostly how to make ext4 perform well for mixed read/write
> workload by default without user having to use magic mount option.

Really thanks for your support here.

So can we get these patches upstream incrementally?
i.e.
1. As a first step, we can remove this condition
ext4_should_dioread_nolock from the current patchset
(as mentioned above too) & get this patch rebased
on top of iomap pathset. We should be good to merge
this patchset then, right? Since this should be able
to improve the performance even without dioread_nolock
mount option.


2. Meanwhile I will continue to check for blocksize < pagesize
requirement for dioread_nolock. We can follow the plan
as you mentioned above then.

Your thoughts?


-ritesh

2019-09-26 13:50:28

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path

Hello,

On Thu 26-09-19 18:04:55, Ritesh Harjani wrote:
> On 9/25/19 2:53 PM, Jan Kara wrote:
> > On Wed 25-09-19 01:18:04, Ritesh Harjani wrote:
> > > > read takes i_rwsem exclusive lock instead of shared, it is a win for your
> > > > workload... Argh, now checking code in fs/direct-io.c I think I can see the
> > > > difference. The trick in do_blockdev_direct_IO() is:
> > > >
> > > > if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
> > > > inode_unlock(dio->inode);
> > > > if (dio->is_async && retval == 0 && dio->result &&
> > > > (iov_iter_rw(iter) == READ || dio->result == count))
> > > > retval = -EIOCBQUEUED;
> > > > else
> > > > dio_await_completion(dio);
> > > >
> > > > So actually only direct IO read submission is protected by i_rwsem with
> > > > DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.
> > > >
> > > > After some thought I think the best solution for this is to just finally
> > > > finish the conversion of ext4 so that dioread_nolock is the only DIO path.
> > >
> > > Sorry, I still didn't get this completely. Could you please explain a bit
> > > more?
> >
> > Well, currently we have two different locking schemes for DIO - the
> > "normal" case and the "dioread_nolock" case. And the "normal" case really
> > only exists because buffered writeback needed to be more careful (so that
> > nolock DIO cannot expose stale data) and nobody did the effort to make that
> > work when blocksize < pagesize. But having two different locking schemes
> > for DIO is really confusing to users and a maintenance burden so we want to
> > get rid of the old scheme once the "dioread_nolock" scheme works for all
> > the configurations.
>
> Agreed.
>
> > > > With i_rwsem held in shared mode even for "unlocked" DIO, it should be
> > > > actually relatively simple and most of the dances with unwritten extents
> > > > shouldn't be needed anymore.
> > >
> > > Again, maybe it's related to above comment. Could you please give some
> > > insights?
> >
> > Now that we hold i_rwsem in shared mode for all of DIO, it isn't really
> > "unlocked" anymore. Which actually very much limits the races with buffered
> > writes and thus page writeback (because we flush page cache before doing
> > DIO).
>
> So looking at the code again based on your inputs from above, we should
> be able to remove this condition <snip below> in ext4_dio_write_checks.
>
> What I meant is, in DIO writes path we can start with shared lock
> (which the current patch is doing), and only check for below conditions
> <snip below> for it to continue using shared lock.
> (i.e. we need not check for ext4_should_dioread_nolock(inode) anymore).
>
> That means there should not be any race for non-extent based mode
> too right?

Yes, that is correct. But please do this as a separate change with good
explanation of why this is OK.

> Because for overwrites in DIO (for both extents & non-extents)-
> (just reiterating)
>
> 1. Race against bufferedIO writes and DIO overwrites will be protected
> via SHARED inode lock in DIO overwrites & EXCL lock in bufferedIO
> writes. Plus we flush page cache before doing DIO.
>
> 2. Race against bufferedIO reads & DIO overwrites will be anyway protected
> since we don't do any block allocations during DIO overwrite.
>
> 3. Again race against DIO reads & DIO overwrites is not be a problem,
> since no block allocation is done anyway. So no stale data will be
> exposed.
>
>
> <snip of change we should do in ext4_dio_write_checks>
>
> if (*iolock == EXT4_IOLOCK_SHARED &&
> (!IS_NOSEC(inode) || *unaligned_io || *extend ||
> - !ext4_should_dioread_nolock(inode) ||
> !ext4_overwrite_io(inode, offset, count))) {
> ext4_iunlock(inode, *iolock);
> *iolock = EXT4_IOLOCK_EXCL;
> ext4_ilock(inode, *iolock);
> goto restart;
> }
>
> >
> > > Or do you mean that we should do it like this-
> > > So as of now in dioread_nolock, we allocate blocks, mark the entry into
> > > extents as unwritten, then do the data IO, and then finally do the
> > > conversion of unwritten to written extents.
> >
> > So this allocation of extents as unwritten in dioread_nolock mode is now
> > mostly unnecessary. We hold i_rwsem for reading (or even for writing) while
> > submitting any DIO. Because we flush page cache before starting DIO and new
> > pages cannot be dirtied by buffered writes (i_rwsem), DIO cannot be racing
> > with page writeback and thus cannot expose stale block contents. There is
> > one exception though - which is why I wrote "mostly" above - pages can
> > still be dirtied through memory mappings (there's no i_rwsem protection for
> > mmap writes) and thus races with page writeback exposing stale data are still
> > theoretically possible. In fact the "normal" DIO mode has this kind of race
> > all the time ext4 exists.
>
> Yes, now that you said I could see this below race even with current
> code. Any other race too?
>
> i.e.
> ext4_dio_read ext4_page_mkwrite()
>
> filemap_write_and_wait_range()
> ext4_get_blocks()
>
> submit_bio
> // this could expose the stale data.
>
> mark_buffer_dirty()

Yes, this is what I meant. Although note that in most cases
ext4_get_blocks() from ext4_page_mkwrite() will just reserve delalloc
blocks so you additionally need writeback to happen on that inode to
allocate delalloc block and only after that call of ext4_iomap_begin() +
submit_bio() from iomap_dio_rw() will be able to expose stale data. So I
don't think the race is very easy to trigger.

> Ok. I guess we cannot use any exclusive inode lock in ext4_page_mkwrite,
> because then we loose the parallelism that it offers right now.
> Wonder how other FS protect this race (like XFS?)

You cannot get i_rwsem at all in ext4_page_mkwrite() because mm code
holds mmap_sem when calling into ext4_page_mkwrite(). And mmap_sem ranks
below i_rwsem. And the deadlock is actually real because e.g. direct IO
code holds i_rwsem and then grabs mmap_sem as part of
bio_iov_iter_get_pages().

XFS always uses unwritten extents when writing out pages which prevents
this race.

> > I guess we could fix this by falling back to page writeback using unwritten
> > extents when we have dirty pages locked for writeback and see there's any
> > DIO in flight for the inode. Sadly that means we we cannot get rid of
> > writeback code using unwritten extents but at least it won't be hurting
> > performance in the common case.
>
> 1. So why to check for dirty pages locked for writeback (in
> ext4_page_mkwrite)? we anyway lock the page which we modify or
> allocate block for. So race against bufferedRead should not happen.
>
> 2. And even if we check for inode_dio_wait(), we still can't gurantee
> that the next DIO may not snoopin while we are in ext4_page_mkwrite?
>
> I am definitely missing something here.

I was speaking about code in ext4_writepages(). The block mapping in
ext4_page_mkwrite() can always use unwritten extents - that is just a
fallback path in case delayed allocation is disabled or we are running out
of disk space. The code in ext4_writepages() needs to decide whether to
writeback using unwritten extents or we can use normal ones. And in that
place I suggest replacing current "ext4_should_dioread_nolock()" check with
"is there any dio in flight against the inode?". And to make the check
reliable (without holding i_rwsem), you need to do it only after you have
all pages locked for writeback.

> > > So instead of that we first only reserve the disk blocks, (without
> > > making any on-disk changes in extent tree), do the data IO and then
> > > finally make an entry into extent tree on disk. And going
> > > forward only keep this as the default path.
> > >
> > > The above is something I have been looking into for enabling
> > > dioread_nolock for powerpc platforms where blocksize < page_size.
> > > This is based upon an upstream discussion between Ted and you :)
> >
> > Yes, this is even better solution to dioread_nolock "problem" but it is
> > also more work
>
> Yes, I agree that we should do this incrementally.
>
> > then just dropping the old DIO locking mode and
> > fix writeback using unwritten extents for blocksize < pagesize.
>
>
> So, I was looking into this (to support dioread_nolock for
> blocksize < pagesize) but I could not find any history in git,
> nor any thread which explains the problem.

Yeah, part of the problem is that we have only a rough understanding of
what actually the problem is :)

> I got below understanding while going over code -
>
> 1. This problem may be somehow related to bufferheads in the
> extent conversion from unwritten to written in writeback path.
> But I couldn't see what exactly is the issue there?
>
> I see that the completion path happens via
> ext4_end_io
> |- ext4_convert_unwritten_extents() // offset & size is properly taken
> care.
> |- ext4_release_io_end() (which is same in both DIO & writeback).
>
>
> 2. One thing which I noticed is the FIXME in
> mpage_map_and_submit_buffers(). Where we
> io->io_end->size we add the whole PAGE_SIZE.
> I guess we should update the size here carefully
> based on buffer_heads.
>
> Could you please give some pointers as to what
> is the limitation. Or some hint which I can go
> and check by myself.

You are correct that the problem is in ext4_writepages() (and functions
called from there). Essentially the whole code there needs verification
whether unwritten extent handling works correctly when blocksize <
pagesize. As you noted, there are couple of FIXMEs there where I was aware
that things would break but there may be other problems I've missed.
I remember things can get really complex there when there are multiple
unwritten extents underlying the page and we need to write them out.

> > > But even with above, in case of extending writes, we still
> > > will have to zero out those extending blocks no? Which
> > > will require an exclusive inode lock anyways for zeroing.
> > > (same which has been done in XFS too).
> >
> > Yes, extending writes are a different matter.
> >
> > > So going with current discussed solution of mounting with
> > > dioread_nolock to provide performance scalability in mixed read/write
> > > workload should be also the right approach, no?
> > > Also looking at the numbers here [3] & [4], this patch also seems
> > > to improve the performance with dioread_nolock mount option.
> > > Please help me understand your thoughts on this.
> >
> > Yes, your patches are definitely going in the right direction. What I'm
> > discussing is mostly how to make ext4 perform well for mixed read/write
> > workload by default without user having to use magic mount option.
>
> Really thanks for your support here.
>
> So can we get these patches upstream incrementally?
> i.e.
> 1. As a first step, we can remove this condition
> ext4_should_dioread_nolock from the current patchset
> (as mentioned above too) & get this patch rebased
> on top of iomap pathset. We should be good to merge
> this patchset then, right? Since this should be able
> to improve the performance even without dioread_nolock
> mount option.

Yes, correct.

> 2. Meanwhile I will continue to check for blocksize < pagesize
> requirement for dioread_nolock. We can follow the plan
> as you mentioned above then.

Good.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR