2023-01-19 22:12:09

by Saeed Mirzamohammadi

[permalink] [raw]
Subject: Phoronix pts fio io_uring test regression report on upstream v6.1 and v5.15

Hello,

I'm reporting a performance regression after the commit below on phoronix pts/fio test and with the config that is added in the end of this email:

Link: https://lore.kernel.org/all/[email protected]/

commit 7b3188e7ed54102a5dcc73d07727f41fb528f7c8
Author: Jens Axboe [email protected]
Date: Mon Aug 30 19:37:41 2021 -0600

io_uring: IORING_OP_WRITE needs hash_reg_file set

We observed regression on the latest v6.1.y and v5.15.y upstream kernels (Haven't tested other stable kernels). We noticed that performance regression improved 45% after the revert of the commit above.

All of the benchmarks below have experienced around ~45% regression.
phoronix-pts-fio-1.15.0-RandomWrite-EngineIO_uring-BufferedNo-DirectYes-BlockSize4KB-MB-s_xfs
phoronix-pts-fio-1.15.0-SequentialWrite-EngineIO_uring-BufferedNo-DirectYes-BlockSize4KB-MB-s_xfs
phoronix-pts-fio-1.15.0-SequentialWrite-EngineIO_uring-BufferedYes-DirectNo-BlockSize4KB-MB-s_xfs

We tend to see this regression on 4KB BlockSize tests.

We tried out changing force_async but that has no effect on the result. Also, backported a modified version of the patch mentioned here (https://lkml.org/lkml/2022/7/20/854) but that didn't affect performance.

Do you have any suggestions on any fixes or what else we can try to narrow down the issue?

Thanks a bunch,
Saeed
--------

Here is more info on the benchmark and system:

Here is the config for fio:
[global]
rw=randwrite
ioengine=io_uring
iodepth=64
size=1g
direct=1
buffered=0
startdelay=5
force_async=4
ramp_time=5
runtime=20
time_based
disk_util=0
clat_percentiles=0
disable_lat=1
disable_clat=1
disable_slat=1
filename=/data/fiofile
[test]
name=test
bs=4k
stonewall

df -Th output (file is on /data/):
Filesystem Type Size Used Avail Use% Mounted on
devtmpfs devtmpfs 252G 0 252G 0% /dev
tmpfs tmpfs 252G 0 252G 0% /dev/shm
tmpfs tmpfs 252G 18M 252G 1% /run
tmpfs tmpfs 252G 0 252G 0% /sys/fs/cgroup
/dev/mapper/ocivolume-root xfs 89G 17G 73G 19% /
/dev/mapper/ocivolume-oled xfs 10G 143M 9.9G 2% /var/oled
/dev/sda2 xfs 1014M 643M 372M 64% /boot
/dev/sda1 vfat 100M 5.0M 95M 6% /boot/efi
tmpfs tmpfs 51G 0 51G 0% /run/user/0
tmpfs tmpfs 51G 0 51G 0% /run/user/987
/dev/mapper/tank-lvm xfs 100G 1.8G 99G 2% /data


2023-01-20 04:48:11

by Jens Axboe

[permalink] [raw]
Subject: Re: Phoronix pts fio io_uring test regression report on upstream v6.1 and v5.15

On 1/19/23 2:36?PM, Saeed Mirzamohammadi wrote:
> Hello,
>
> I'm reporting a performance regression after the commit below on
> phoronix pts/fio test and with the config that is added in the end of
> this email:
>
> Link: https://lore.kernel.org/all/[email protected]/
>
> commit 7b3188e7ed54102a5dcc73d07727f41fb528f7c8
> Author: Jens Axboe [email protected]
> Date: Mon Aug 30 19:37:41 2021 -0600
>
> io_uring: IORING_OP_WRITE needs hash_reg_file set
>
> We observed regression on the latest v6.1.y and v5.15.y upstream
> kernels (Haven't tested other stable kernels). We noticed that
> performance regression improved 45% after the revert of the commit
> above.
>
> All of the benchmarks below have experienced around ~45% regression.
> phoronix-pts-fio-1.15.0-RandomWrite-EngineIO_uring-BufferedNo-DirectYes-BlockSize4KB-MB-s_xfs
> phoronix-pts-fio-1.15.0-SequentialWrite-EngineIO_uring-BufferedNo-DirectYes-BlockSize4KB-MB-s_xfs
> phoronix-pts-fio-1.15.0-SequentialWrite-EngineIO_uring-BufferedYes-DirectNo-BlockSize4KB-MB-s_xfs
>
> We tend to see this regression on 4KB BlockSize tests.
>
> We tried out changing force_async but that has no effect on the
> result. Also, backported a modified version of the patch mentioned
> here (https://lkml.org/lkml/2022/7/20/854) but that didn't affect
> performance.
>
> Do you have any suggestions on any fixes or what else we can try to
> narrow down the issue?

This is really mostly by design - the previous approach of not hashing
buffered writes on regular files would cause a lot of inode lock
contention due to lots of threads hammering on that.

That said, for XFS, we don't need to serialize on O_DIRECT writes. Don't
think we currently have a way to detect this as it isn't really
advertised. Something like the below might work, with the caveat that
this is totally untested.


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 595a5bcf46b9..85fdc6f2efa4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1171,7 +1171,8 @@ xfs_file_open(
{
if (xfs_is_shutdown(XFS_M(inode->i_sb)))
return -EIO;
- file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC;
+ file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC |
+ FMODE_ODIRECT_PARALLEL;
return generic_file_open(inode, file);
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index c1769a2c5d70..8541b9e53c2d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -166,6 +166,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File supports DIRECT IO */
#define FMODE_CAN_ODIRECT ((__force fmode_t)0x400000)

+/* File supports parallel O_DIRECT writes */
+#define FMODE_ODIRECT_PARALLEL ((__force fmode_t)0x800000)
+
/* File was opened by fanotify and shouldn't generate fanotify events */
#define FMODE_NONOTIFY ((__force fmode_t)0x4000000)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e680685e8a00..1409f6f69b13 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -424,7 +424,12 @@ static void io_prep_async_work(struct io_kiocb *req)
req->flags |= io_file_get_flags(req->file) << REQ_F_SUPPORT_NOWAIT_BIT;

if (req->flags & REQ_F_ISREG) {
- if (def->hash_reg_file || (ctx->flags & IORING_SETUP_IOPOLL))
+ bool should_hash = def->hash_reg_file;
+
+ if (should_hash && (req->file->f_flags & O_DIRECT) &&
+ (req->file->f_mode & FMODE_ODIRECT_PARALLEL))
+ should_hash = false;
+ if (should_hash || (ctx->flags & IORING_SETUP_IOPOLL))
io_wq_hash_work(&req->work, file_inode(req->file));
} else if (!req->file || !S_ISBLK(file_inode(req->file)->i_mode)) {
if (def->unbound_nonreg_file)

--
Jens Axboe

2023-01-26 00:22:17

by Saeed Mirzamohammadi

[permalink] [raw]
Subject: Re: Phoronix pts fio io_uring test regression report on upstream v6.1 and v5.15

Hi Jens,

I applied your patch (with a minor conflict in xfs_file_open() since FMODE_BUF_WASYNC isn't in v5.15) and did the same series of tests on the v5.15 kernel. All the io_uring benchmarks regressed 20-45% after it. I haven't tested on v6.1 yet.

Thanks,
Saeed
________________________________________
From: Jens Axboe <[email protected]>
Sent: Thursday, January 19, 2023 8:12 PM
To: Saeed Mirzamohammadi; [email protected]
Cc: [email protected]; [email protected]
Subject: Re: Phoronix pts fio io_uring test regression report on upstream v6.1 and v5.15

On 1/19/23 2:36?PM, Saeed Mirzamohammadi wrote:
> Hello,
>
> I'm reporting a performance regression after the commit below on
> phoronix pts/fio test and with the config that is added in the end of
> this email:
>
> Link: https://urldefense.com/v3/__https://lore.kernel.org/all/[email protected]/__;!!ACWV5N9M2RV99hQ!KM7c30r4OiqvkxVW44cyWb3rZr85i28yKto8WAAcj8OgWAOp-ebzcrHuggGw96ivMFCARikAEWwjhUBYFqujONc$
>
> commit 7b3188e7ed54102a5dcc73d07727f41fb528f7c8
> Author: Jens Axboe [email protected]
> Date: Mon Aug 30 19:37:41 2021 -0600
>
> io_uring: IORING_OP_WRITE needs hash_reg_file set
>
> We observed regression on the latest v6.1.y and v5.15.y upstream
> kernels (Haven't tested other stable kernels). We noticed that
> performance regression improved 45% after the revert of the commit
> above.
>
> All of the benchmarks below have experienced around ~45% regression.
> phoronix-pts-fio-1.15.0-RandomWrite-EngineIO_uring-BufferedNo-DirectYes-BlockSize4KB-MB-s_xfs
> phoronix-pts-fio-1.15.0-SequentialWrite-EngineIO_uring-BufferedNo-DirectYes-BlockSize4KB-MB-s_xfs
> phoronix-pts-fio-1.15.0-SequentialWrite-EngineIO_uring-BufferedYes-DirectNo-BlockSize4KB-MB-s_xfs
>
> We tend to see this regression on 4KB BlockSize tests.
>
> We tried out changing force_async but that has no effect on the
> result. Also, backported a modified version of the patch mentioned
> here (https://urldefense.com/v3/__https://lkml.org/lkml/2022/7/20/854__;!!ACWV5N9M2RV99hQ!KM7c30r4OiqvkxVW44cyWb3rZr85i28yKto8WAAcj8OgWAOp-ebzcrHuggGw96ivMFCARikAEWwjhUBYbOxQftI$ ) but that didn't affect
> performance.
>
> Do you have any suggestions on any fixes or what else we can try to
> narrow down the issue?

This is really mostly by design - the previous approach of not hashing
buffered writes on regular files would cause a lot of inode lock
contention due to lots of threads hammering on that.

That said, for XFS, we don't need to serialize on O_DIRECT writes. Don't
think we currently have a way to detect this as it isn't really
advertised. Something like the below might work, with the caveat that
this is totally untested.


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 595a5bcf46b9..85fdc6f2efa4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1171,7 +1171,8 @@ xfs_file_open(
{
if (xfs_is_shutdown(XFS_M(inode->i_sb)))
return -EIO;
- file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC;
+ file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC |
+ FMODE_ODIRECT_PARALLEL;
return generic_file_open(inode, file);
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index c1769a2c5d70..8541b9e53c2d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -166,6 +166,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File supports DIRECT IO */
#define FMODE_CAN_ODIRECT ((__force fmode_t)0x400000)

+/* File supports parallel O_DIRECT writes */
+#define FMODE_ODIRECT_PARALLEL ((__force fmode_t)0x800000)
+
/* File was opened by fanotify and shouldn't generate fanotify events */
#define FMODE_NONOTIFY ((__force fmode_t)0x4000000)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e680685e8a00..1409f6f69b13 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -424,7 +424,12 @@ static void io_prep_async_work(struct io_kiocb *req)
req->flags |= io_file_get_flags(req->file) << REQ_F_SUPPORT_NOWAIT_BIT;

if (req->flags & REQ_F_ISREG) {
- if (def->hash_reg_file || (ctx->flags & IORING_SETUP_IOPOLL))
+ bool should_hash = def->hash_reg_file;
+
+ if (should_hash && (req->file->f_flags & O_DIRECT) &&
+ (req->file->f_mode & FMODE_ODIRECT_PARALLEL))
+ should_hash = false;
+ if (should_hash || (ctx->flags & IORING_SETUP_IOPOLL))
io_wq_hash_work(&req->work, file_inode(req->file));
} else if (!req->file || !S_ISBLK(file_inode(req->file)->i_mode)) {
if (def->unbound_nonreg_file)

--
Jens Axboe


2023-01-26 00:29:01

by Jens Axboe

[permalink] [raw]
Subject: Re: Phoronix pts fio io_uring test regression report on upstream v6.1 and v5.15

On 1/25/23 5:22?PM, Saeed Mirzamohammadi wrote:
> Hi Jens,
>
> I applied your patch (with a minor conflict in xfs_file_open() since FMODE_BUF_WASYNC isn't in v5.15) and did the same series of tests on the v5.15 kernel. All the io_uring benchmarks regressed 20-45% after it. I haven't tested on v6.1 yet.

It should basically make the behavior the same as before once you apply
the patch, so please pass on the patch that you applied for 5.15 so we
can take a closer look.

Also, please don't top post. Replies go below the context, that's normal
OSS etiquette. Just like I did in the first one, and in this one.

--
Jens Axboe


2023-01-26 18:35:14

by Jens Axboe

[permalink] [raw]
Subject: Re: Phoronix pts fio io_uring test regression report on upstream v6.1 and v5.15

On 1/26/23 11:04 AM, Saeed Mirzamohammadi wrote:
> Hi Jens,
>
>> On Jan 25, 2023, at 4:28 PM, Jens Axboe <[email protected]> wrote:
>>
>> On 1/25/23 5:22?PM, Saeed Mirzamohammadi wrote:
>>> Hi Jens,
>>>
>>> I applied your patch (with a minor conflict in xfs_file_open() since FMODE_BUF_WASYNC isn't in v5.15) and did the same series of tests on the v5.15 kernel. All the io_uring benchmarks regressed 20-45% after it. I haven't tested on v6.1 yet.
>>
>> It should basically make the behavior the same as before once you apply
>> the patch, so please pass on the patch that you applied for 5.15 so we
>> can take a closer look.
>
> Attached the patch.

I tested the upstream variant, and it does what it's supposed to and
gets parallel writes on O_DIRECT. Unpatched, any dio write results in:

fio-566 [000] ..... 131.071108: io_uring_queue_async_work: ring 00000000706cb6c0, request 00000000b21691c4, user_data 0xaaab0e8e4c00, opcode WRITE, flags 0xe0040000, hashed queue, work 000000002c5aeb79

and after the patch:

fio-376 [000] ..... 24.590994: io_uring_queue_async_work: ring 000000007bdb650a, request 000000006b5350e0, user_data 0xaaab1b3e3c00, opcode WRITE, flags 0xe0040000, normal queue, work 00000000e3e81955

where the hashed queued is serialized based on the inode, and the normal
queue is not (eg they run in parallel).

As mentioned, the fio job being used isn't representative of anything
that should actually be run, the async flag really only exists for
experimentation. Do you have a real workload that is seeing a regression?
If yes, does that real workload change performance with the patch?

--
Jens Axboe



2023-02-14 18:58:57

by Saeed Mirzamohammadi

[permalink] [raw]
Subject: Re: Phoronix pts fio io_uring test regression report on upstream v6.1 and v5.15

Hi Jens,

> On Jan 26, 2023, at 10:35 AM, Jens Axboe <[email protected]> wrote:
>
> On 1/26/23 11:04 AM, Saeed Mirzamohammadi wrote:
>> Hi Jens,
>>
>>> On Jan 25, 2023, at 4:28 PM, Jens Axboe <[email protected]> wrote:
>>>
>>> On 1/25/23 5:22?PM, Saeed Mirzamohammadi wrote:
>>>> Hi Jens,
>>>>
>>>> I applied your patch (with a minor conflict in xfs_file_open() since FMODE_BUF_WASYNC isn't in v5.15) and did the same series of tests on the v5.15 kernel. All the io_uring benchmarks regressed 20-45% after it. I haven't tested on v6.1 yet.
>>>
>>> It should basically make the behavior the same as before once you apply
>>> the patch, so please pass on the patch that you applied for 5.15 so we
>>> can take a closer look.
>>
>> Attached the patch.
>
> I tested the upstream variant, and it does what it's supposed to and
> gets parallel writes on O_DIRECT. Unpatched, any dio write results in:
>
> fio-566 [000] ..... 131.071108: io_uring_queue_async_work: ring 00000000706cb6c0, request 00000000b21691c4, user_data 0xaaab0e8e4c00, opcode WRITE, flags 0xe0040000, hashed queue, work 000000002c5aeb79
>
> and after the patch:
>
> fio-376 [000] ..... 24.590994: io_uring_queue_async_work: ring 000000007bdb650a, request 000000006b5350e0, user_data 0xaaab1b3e3c00, opcode WRITE, flags 0xe0040000, normal queue, work 00000000e3e81955
>

Thanks for looking into this.

> where the hashed queued is serialized based on the inode, and the normal
> queue is not (eg they run in parallel).
>
> As mentioned, the fio job being used isn't representative of anything
> that should actually be run, the async flag really only exists for
> experimentation. Do you have a real workload that is seeing a regression?
> If yes, does that real workload change performance with the patch?

I tested without the async flag but didn’t see any change in the performance.

I haven’t tested any real workload yet. I’ll share with you if I noticed anything.

Thanks,
Saeed

p.s. I experienced multipathd issues with the patch that I had to work through. Never without the patch.

>
> --
> Jens Axboe