2019-07-19 09:25:33

by Joseph Qi

[permalink] [raw]
Subject: [RFC] performance regression with "ext4: Allow parallel DIO reads"

Hi Ted & Jan,
I've observed an significant performance regression with the following
commit in my Intel P3600 NVMe SSD.
16c54688592c ext4: Allow parallel DIO reads

From my initial investigation, it may be because of the
inode_lock_shared (down_read) consumes more than inode_lock (down_write)
in mixed random read write workload.

Here is my test result.

ioengine=psync
direct=1
rw=randrw
iodepth=1
numjobs=8
size=20G
runtime=600

w/ parallel dio reads : kernel 5.2.0
w/o parallel dio reads: kernel 5.2.0, then revert the following commits:
1d39834fba99 ext4: remove EXT4_STATE_DIOREAD_LOCK flag (related)
e5465795cac4 ext4: fix off-by-one error when writing back pages before dio read (related)
16c54688592c ext4: Allow parallel DIO reads

bs=4k:
-------------------------------------------------------------------------------------------
w/ parallel dio reads | READ 30898KB/s, 7724, 555.00us | WRITE 30875KB/s, 7718, 479.70us
-------------------------------------------------------------------------------------------
w/o parallel dio reads| READ 117915KB/s, 29478, 248.18us | WRITE 117854KB/s,29463, 21.91us
-------------------------------------------------------------------------------------------

bs=16k:
-------------------------------------------------------------------------------------------
w/ parallel dio reads | READ 58961KB/s, 3685, 835.28us | WRITE 58877KB/s, 3679, 1335.98us
-------------------------------------------------------------------------------------------
w/o parallel dio reads| READ 218409KB/s, 13650, 554.46us | WRITE 218257KB/s,13641, 29.22us
-------------------------------------------------------------------------------------------

bs=64k:
-------------------------------------------------------------------------------------------
w/ parallel dio reads | READ 119396KB/s, 1865, 1759.38us | WRITE 119159KB/s, 1861, 2532.26us
-------------------------------------------------------------------------------------------
w/o parallel dio reads| READ 422815KB/s, 6606, 1146.05us | WRITE 421619KB/s, 6587, 60.72us
-------------------------------------------------------------------------------------------

bs=512k:
-------------------------------------------------------------------------------------------
w/ parallel dio reads | READ 392973KB/s, 767, 5046.35us | WRITE 393165KB/s, 767, 5359.86us
-------------------------------------------------------------------------------------------
w/o parallel dio reads| READ 590266KB/s, 1152, 4312.01us | WRITE 590554KB/s, 1153, 2606.82us
-------------------------------------------------------------------------------------------

bs=1M:
-------------------------------------------------------------------------------------------
w/ parallel dio reads | READ 487779KB/s, 476, 8058.55us | WRITE 485592KB/s, 474, 8630.51us
-------------------------------------------------------------------------------------------
w/o parallel dio reads| READ 593927KB/s, 580, 7623.63us | WRITE 591265KB/s, 577, 6163.42us
-------------------------------------------------------------------------------------------

Thanks,
Joseph


2019-07-23 21:12:50

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

Hi Ted & Jan,
Could you please give your valuable comments?

Thanks,
Joseph

On 19/7/19 17:22, Joseph Qi wrote:
> Hi Ted & Jan,
> I've observed an significant performance regression with the following
> commit in my Intel P3600 NVMe SSD.
> 16c54688592c ext4: Allow parallel DIO reads
>
> From my initial investigation, it may be because of the
> inode_lock_shared (down_read) consumes more than inode_lock (down_write)
> in mixed random read write workload.
>
> Here is my test result.
>
> ioengine=psync
> direct=1
> rw=randrw
> iodepth=1
> numjobs=8
> size=20G
> runtime=600
>
> w/ parallel dio reads : kernel 5.2.0
> w/o parallel dio reads: kernel 5.2.0, then revert the following commits:
> 1d39834fba99 ext4: remove EXT4_STATE_DIOREAD_LOCK flag (related)
> e5465795cac4 ext4: fix off-by-one error when writing back pages before dio read (related)
> 16c54688592c ext4: Allow parallel DIO reads
>
> bs=4k:
> -------------------------------------------------------------------------------------------
> w/ parallel dio reads | READ 30898KB/s, 7724, 555.00us | WRITE 30875KB/s, 7718, 479.70us
> -------------------------------------------------------------------------------------------
> w/o parallel dio reads| READ 117915KB/s, 29478, 248.18us | WRITE 117854KB/s,29463, 21.91us
> -------------------------------------------------------------------------------------------
>
> bs=16k:
> -------------------------------------------------------------------------------------------
> w/ parallel dio reads | READ 58961KB/s, 3685, 835.28us | WRITE 58877KB/s, 3679, 1335.98us
> -------------------------------------------------------------------------------------------
> w/o parallel dio reads| READ 218409KB/s, 13650, 554.46us | WRITE 218257KB/s,13641, 29.22us
> -------------------------------------------------------------------------------------------
>
> bs=64k:
> -------------------------------------------------------------------------------------------
> w/ parallel dio reads | READ 119396KB/s, 1865, 1759.38us | WRITE 119159KB/s, 1861, 2532.26us
> -------------------------------------------------------------------------------------------
> w/o parallel dio reads| READ 422815KB/s, 6606, 1146.05us | WRITE 421619KB/s, 6587, 60.72us
> -------------------------------------------------------------------------------------------
>
> bs=512k:
> -------------------------------------------------------------------------------------------
> w/ parallel dio reads | READ 392973KB/s, 767, 5046.35us | WRITE 393165KB/s, 767, 5359.86us
> -------------------------------------------------------------------------------------------
> w/o parallel dio reads| READ 590266KB/s, 1152, 4312.01us | WRITE 590554KB/s, 1153, 2606.82us
> -------------------------------------------------------------------------------------------
>
> bs=1M:
> -------------------------------------------------------------------------------------------
> w/ parallel dio reads | READ 487779KB/s, 476, 8058.55us | WRITE 485592KB/s, 474, 8630.51us
> -------------------------------------------------------------------------------------------
> w/o parallel dio reads| READ 593927KB/s, 580, 7623.63us | WRITE 591265KB/s, 577, 6163.42us
> -------------------------------------------------------------------------------------------
>
> Thanks,
> Joseph
>

2019-07-25 21:23:04

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"


> On Jul 23, 2019, at 5:17 AM, Joseph Qi <[email protected]> wrote:
>
> Hi Ted & Jan,
> Could you please give your valuable comments?

It seems like the original patches should be reverted? There is no data
in the original commit message that indicates there is an actual performance
improvement from that patch, but there is data here showing it hurts both
read and write performance quite significantly.

Cheers, Andreas

> On 19/7/19 17:22, Joseph Qi wrote:
>> Hi Ted & Jan,
>> I've observed an significant performance regression with the following
>> commit in my Intel P3600 NVMe SSD.
>> 16c54688592c ext4: Allow parallel DIO reads
>>
>> From my initial investigation, it may be because of the
>> inode_lock_shared (down_read) consumes more than inode_lock (down_write)
>> in mixed random read write workload.
>>
>> Here is my test result.
>>
>> ioengine=psync
>> direct=1
>> rw=randrw
>> iodepth=1
>> numjobs=8
>> size=20G
>> runtime=600
>>
>> w/ parallel dio reads : kernel 5.2.0
>> w/o parallel dio reads: kernel 5.2.0, then revert the following commits:
>> 1d39834fba99 ext4: remove EXT4_STATE_DIOREAD_LOCK flag (related)
>> e5465795cac4 ext4: fix off-by-one error when writing back pages before dio read (related)
>> 16c54688592c ext4: Allow parallel DIO reads
>>
>> bs=4k:
>> -------------------------------------------------------------------------------------------
>> w/ parallel dio reads | READ 30898KB/s, 7724, 555.00us | WRITE 30875KB/s, 7718, 479.70us
>> -------------------------------------------------------------------------------------------
>> w/o parallel dio reads| READ 117915KB/s, 29478, 248.18us | WRITE 117854KB/s,29463, 21.91us
>> -------------------------------------------------------------------------------------------
>>
>> bs=16k:
>> -------------------------------------------------------------------------------------------
>> w/ parallel dio reads | READ 58961KB/s, 3685, 835.28us | WRITE 58877KB/s, 3679, 1335.98us
>> -------------------------------------------------------------------------------------------
>> w/o parallel dio reads| READ 218409KB/s, 13650, 554.46us | WRITE 218257KB/s,13641, 29.22us
>> -------------------------------------------------------------------------------------------
>>
>> bs=64k:
>> -------------------------------------------------------------------------------------------
>> w/ parallel dio reads | READ 119396KB/s, 1865, 1759.38us | WRITE 119159KB/s, 1861, 2532.26us
>> -------------------------------------------------------------------------------------------
>> w/o parallel dio reads| READ 422815KB/s, 6606, 1146.05us | WRITE 421619KB/s, 6587, 60.72us
>> -------------------------------------------------------------------------------------------
>>
>> bs=512k:
>> -------------------------------------------------------------------------------------------
>> w/ parallel dio reads | READ 392973KB/s, 767, 5046.35us | WRITE 393165KB/s, 767, 5359.86us
>> -------------------------------------------------------------------------------------------
>> w/o parallel dio reads| READ 590266KB/s, 1152, 4312.01us | WRITE 590554KB/s, 1153, 2606.82us
>> -------------------------------------------------------------------------------------------
>>
>> bs=1M:
>> -------------------------------------------------------------------------------------------
>> w/ parallel dio reads | READ 487779KB/s, 476, 8058.55us | WRITE 485592KB/s, 474, 8630.51us
>> -------------------------------------------------------------------------------------------
>> w/o parallel dio reads| READ 593927KB/s, 580, 7623.63us | WRITE 591265KB/s, 577, 6163.42us
>> -------------------------------------------------------------------------------------------
>>
>> Thanks,
>> Joseph
>>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2019-07-26 01:12:37

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"



On 19/7/26 05:20, Andreas Dilger wrote:
>
>> On Jul 23, 2019, at 5:17 AM, Joseph Qi <[email protected]> wrote:
>>
>> Hi Ted & Jan,
>> Could you please give your valuable comments?
>
> It seems like the original patches should be reverted? There is no data

From my test result, yes.
I've also tested libaio with iodepth 16, it behaves the same. Here is the test
data for libaio 4k randrw:

-------------------------------------------------------------------------------------------
w/ parallel dio reads | READ 78313KB/s, 19578, 1698.70us | WRITE 78313KB/s, 19578, 4837.60us
-------------------------------------------------------------------------------------------
w/o parallel dio reads| READ 387774KB/s, 96943, 1009.73us | WRITE 387656KB/s,96914, 308.87us
-------------------------------------------------------------------------------------------

Since this commit went into upstream long time ago,to be precise, Linux
4.9, I wonder if someone else has also observed this regression, or
anything I missed?

Thanks,
Joseph

> in the original commit message that indicates there is an actual performance
> improvement from that patch, but there is data here showing it hurts both
> read and write performance quite significantly.
> > Cheers, Andreas
>
>> On 19/7/19 17:22, Joseph Qi wrote:
>>> Hi Ted & Jan,
>>> I've observed an significant performance regression with the following
>>> commit in my Intel P3600 NVMe SSD.
>>> 16c54688592c ext4: Allow parallel DIO reads
>>>
>>> From my initial investigation, it may be because of the
>>> inode_lock_shared (down_read) consumes more than inode_lock (down_write)
>>> in mixed random read write workload.
>>>
>>> Here is my test result.
>>>
>>> ioengine=psync
>>> direct=1
>>> rw=randrw
>>> iodepth=1
>>> numjobs=8
>>> size=20G
>>> runtime=600
>>>
>>> w/ parallel dio reads : kernel 5.2.0
>>> w/o parallel dio reads: kernel 5.2.0, then revert the following commits:
>>> 1d39834fba99 ext4: remove EXT4_STATE_DIOREAD_LOCK flag (related)
>>> e5465795cac4 ext4: fix off-by-one error when writing back pages before dio read (related)
>>> 16c54688592c ext4: Allow parallel DIO reads
>>>
>>> bs=4k:
>>> -------------------------------------------------------------------------------------------
>>> w/ parallel dio reads | READ 30898KB/s, 7724, 555.00us | WRITE 30875KB/s, 7718, 479.70us
>>> -------------------------------------------------------------------------------------------
>>> w/o parallel dio reads| READ 117915KB/s, 29478, 248.18us | WRITE 117854KB/s,29463, 21.91us
>>> -------------------------------------------------------------------------------------------
>>>
>>> bs=16k:
>>> -------------------------------------------------------------------------------------------
>>> w/ parallel dio reads | READ 58961KB/s, 3685, 835.28us | WRITE 58877KB/s, 3679, 1335.98us
>>> -------------------------------------------------------------------------------------------
>>> w/o parallel dio reads| READ 218409KB/s, 13650, 554.46us | WRITE 218257KB/s,13641, 29.22us
>>> -------------------------------------------------------------------------------------------
>>>
>>> bs=64k:
>>> -------------------------------------------------------------------------------------------
>>> w/ parallel dio reads | READ 119396KB/s, 1865, 1759.38us | WRITE 119159KB/s, 1861, 2532.26us
>>> -------------------------------------------------------------------------------------------
>>> w/o parallel dio reads| READ 422815KB/s, 6606, 1146.05us | WRITE 421619KB/s, 6587, 60.72us
>>> -------------------------------------------------------------------------------------------
>>>
>>> bs=512k:
>>> -------------------------------------------------------------------------------------------
>>> w/ parallel dio reads | READ 392973KB/s, 767, 5046.35us | WRITE 393165KB/s, 767, 5359.86us
>>> -------------------------------------------------------------------------------------------
>>> w/o parallel dio reads| READ 590266KB/s, 1152, 4312.01us | WRITE 590554KB/s, 1153, 2606.82us
>>> -------------------------------------------------------------------------------------------
>>>
>>> bs=1M:
>>> -------------------------------------------------------------------------------------------
>>> w/ parallel dio reads | READ 487779KB/s, 476, 8058.55us | WRITE 485592KB/s, 474, 8630.51us
>>> -------------------------------------------------------------------------------------------
>>> w/o parallel dio reads| READ 593927KB/s, 580, 7623.63us | WRITE 591265KB/s, 577, 6163.42us
>>> -------------------------------------------------------------------------------------------
>>>
>>> Thanks,
>>> Joseph
>>>
>
>
> Cheers, Andreas
>
>
>
>
>

2019-07-27 02:21:22

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

It would be useful to post some details about your test hardware
(eg. HDD vs. SSD, CPU cores+speed, RAM), so that it is possible to make
a good comparison of someone sees different results.

Cheers, Andreas

> On Jul 25, 2019, at 19:12, Joseph Qi <[email protected]> wrote:
>
>
>
>> On 19/7/26 05:20, Andreas Dilger wrote:
>>
>>> On Jul 23, 2019, at 5:17 AM, Joseph Qi <[email protected]> wrote:
>>>
>>> Hi Ted & Jan,
>>> Could you please give your valuable comments?
>>
>> It seems like the original patches should be reverted? There is no data
>
> From my test result, yes.
> I've also tested libaio with iodepth 16, it behaves the same. Here is the test
> data for libaio 4k randrw:
>
> -------------------------------------------------------------------------------------------
> w/ parallel dio reads | READ 78313KB/s, 19578, 1698.70us | WRITE 78313KB/s, 19578, 4837.60us
> -------------------------------------------------------------------------------------------
> w/o parallel dio reads| READ 387774KB/s, 96943, 1009.73us | WRITE 387656KB/s,96914, 308.87us
> -------------------------------------------------------------------------------------------
>
> Since this commit went into upstream long time ago,to be precise, Linux
> 4.9, I wonder if someone else has also observed this regression, or
> anything I missed?
>
> Thanks,
> Joseph
>
>> in the original commit message that indicates there is an actual performance
>> improvement from that patch, but there is data here showing it hurts both
>> read and write performance quite significantly.
>>> Cheers, Andreas
>>
>>>> On 19/7/19 17:22, Joseph Qi wrote:
>>>> Hi Ted & Jan,
>>>> I've observed an significant performance regression with the following
>>>> commit in my Intel P3600 NVMe SSD.
>>>> 16c54688592c ext4: Allow parallel DIO reads
>>>>
>>>> From my initial investigation, it may be because of the
>>>> inode_lock_shared (down_read) consumes more than inode_lock (down_write)
>>>> in mixed random read write workload.
>>>>
>>>> Here is my test result.
>>>>
>>>> ioengine=psync
>>>> direct=1
>>>> rw=randrw
>>>> iodepth=1
>>>> numjobs=8
>>>> size=20G
>>>> runtime=600
>>>>
>>>> w/ parallel dio reads : kernel 5.2.0
>>>> w/o parallel dio reads: kernel 5.2.0, then revert the following commits:
>>>> 1d39834fba99 ext4: remove EXT4_STATE_DIOREAD_LOCK flag (related)
>>>> e5465795cac4 ext4: fix off-by-one error when writing back pages before dio read (related)
>>>> 16c54688592c ext4: Allow parallel DIO reads
>>>>
>>>> bs=4k:
>>>> -------------------------------------------------------------------------------------------
>>>> w/ parallel dio reads | READ 30898KB/s, 7724, 555.00us | WRITE 30875KB/s, 7718, 479.70us
>>>> -------------------------------------------------------------------------------------------
>>>> w/o parallel dio reads| READ 117915KB/s, 29478, 248.18us | WRITE 117854KB/s,29463, 21.91us
>>>> -------------------------------------------------------------------------------------------
>>>>
>>>> bs=16k:
>>>> -------------------------------------------------------------------------------------------
>>>> w/ parallel dio reads | READ 58961KB/s, 3685, 835.28us | WRITE 58877KB/s, 3679, 1335.98us
>>>> -------------------------------------------------------------------------------------------
>>>> w/o parallel dio reads| READ 218409KB/s, 13650, 554.46us | WRITE 218257KB/s,13641, 29.22us
>>>> -------------------------------------------------------------------------------------------
>>>>
>>>> bs=64k:
>>>> -------------------------------------------------------------------------------------------
>>>> w/ parallel dio reads | READ 119396KB/s, 1865, 1759.38us | WRITE 119159KB/s, 1861, 2532.26us
>>>> -------------------------------------------------------------------------------------------
>>>> w/o parallel dio reads| READ 422815KB/s, 6606, 1146.05us | WRITE 421619KB/s, 6587, 60.72us
>>>> -------------------------------------------------------------------------------------------
>>>>
>>>> bs=512k:
>>>> -------------------------------------------------------------------------------------------
>>>> w/ parallel dio reads | READ 392973KB/s, 767, 5046.35us | WRITE 393165KB/s, 767, 5359.86us
>>>> -------------------------------------------------------------------------------------------
>>>> w/o parallel dio reads| READ 590266KB/s, 1152, 4312.01us | WRITE 590554KB/s, 1153, 2606.82us
>>>> -------------------------------------------------------------------------------------------
>>>>
>>>> bs=1M:
>>>> -------------------------------------------------------------------------------------------
>>>> w/ parallel dio reads | READ 487779KB/s, 476, 8058.55us | WRITE 485592KB/s, 474, 8630.51us
>>>> -------------------------------------------------------------------------------------------
>>>> w/o parallel dio reads| READ 593927KB/s, 580, 7623.63us | WRITE 591265KB/s, 577, 6163.42us
>>>> -------------------------------------------------------------------------------------------
>>>>
>>>> Thanks,
>>>> Joseph
>>>>
>>
>>
>> Cheers, Andreas
>>
>>
>>
>>
>>

2019-07-27 02:27:03

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"



On 19/7/27 09:57, Andreas Dilger wrote:
> It would be useful to post some details about your test hardware
> (eg. HDD vs. SSD, CPU cores+speed, RAM), so that it is possible to make
> a good comparison of someone sees different results.

Sure, as I posted before, it is on Intel P3600 NVMe SSD.
Other hardware information as below:
CPU: Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz * 64
MEM: 256G

In fact, it behaves the same on different machines. So I think it
has more to do with the test workload (concurrent rand read/write).

Thanks,
Joseph

>
> Cheers, Andreas
>
>> On Jul 25, 2019, at 19:12, Joseph Qi <[email protected]> wrote:
>>
>>
>>
>>> On 19/7/26 05:20, Andreas Dilger wrote:
>>>
>>>> On Jul 23, 2019, at 5:17 AM, Joseph Qi <[email protected]> wrote:
>>>>
>>>> Hi Ted & Jan,
>>>> Could you please give your valuable comments?
>>>
>>> It seems like the original patches should be reverted? There is no data
>>
>> From my test result, yes.
>> I've also tested libaio with iodepth 16, it behaves the same. Here is the test
>> data for libaio 4k randrw:
>>
>> -------------------------------------------------------------------------------------------
>> w/ parallel dio reads | READ 78313KB/s, 19578, 1698.70us | WRITE 78313KB/s, 19578, 4837.60us
>> -------------------------------------------------------------------------------------------
>> w/o parallel dio reads| READ 387774KB/s, 96943, 1009.73us | WRITE 387656KB/s,96914, 308.87us
>> -------------------------------------------------------------------------------------------
>>
>> Since this commit went into upstream long time ago,to be precise, Linux
>> 4.9, I wonder if someone else has also observed this regression, or
>> anything I missed?
>>
>> Thanks,
>> Joseph
>>
>>> in the original commit message that indicates there is an actual performance
>>> improvement from that patch, but there is data here showing it hurts both
>>> read and write performance quite significantly.
>>>> Cheers, Andreas
>>>
>>>>> On 19/7/19 17:22, Joseph Qi wrote:
>>>>> Hi Ted & Jan,
>>>>> I've observed an significant performance regression with the following
>>>>> commit in my Intel P3600 NVMe SSD.
>>>>> 16c54688592c ext4: Allow parallel DIO reads
>>>>>
>>>>> From my initial investigation, it may be because of the
>>>>> inode_lock_shared (down_read) consumes more than inode_lock (down_write)
>>>>> in mixed random read write workload.
>>>>>
>>>>> Here is my test result.
>>>>>
>>>>> ioengine=psync
>>>>> direct=1
>>>>> rw=randrw
>>>>> iodepth=1
>>>>> numjobs=8
>>>>> size=20G
>>>>> runtime=600
>>>>>
>>>>> w/ parallel dio reads : kernel 5.2.0
>>>>> w/o parallel dio reads: kernel 5.2.0, then revert the following commits:
>>>>> 1d39834fba99 ext4: remove EXT4_STATE_DIOREAD_LOCK flag (related)
>>>>> e5465795cac4 ext4: fix off-by-one error when writing back pages before dio read (related)
>>>>> 16c54688592c ext4: Allow parallel DIO reads
>>>>>
>>>>> bs=4k:
>>>>> -------------------------------------------------------------------------------------------
>>>>> w/ parallel dio reads | READ 30898KB/s, 7724, 555.00us | WRITE 30875KB/s, 7718, 479.70us
>>>>> -------------------------------------------------------------------------------------------
>>>>> w/o parallel dio reads| READ 117915KB/s, 29478, 248.18us | WRITE 117854KB/s,29463, 21.91us
>>>>> -------------------------------------------------------------------------------------------
>>>>>
>>>>> bs=16k:
>>>>> -------------------------------------------------------------------------------------------
>>>>> w/ parallel dio reads | READ 58961KB/s, 3685, 835.28us | WRITE 58877KB/s, 3679, 1335.98us
>>>>> -------------------------------------------------------------------------------------------
>>>>> w/o parallel dio reads| READ 218409KB/s, 13650, 554.46us | WRITE 218257KB/s,13641, 29.22us
>>>>> -------------------------------------------------------------------------------------------
>>>>>
>>>>> bs=64k:
>>>>> -------------------------------------------------------------------------------------------
>>>>> w/ parallel dio reads | READ 119396KB/s, 1865, 1759.38us | WRITE 119159KB/s, 1861, 2532.26us
>>>>> -------------------------------------------------------------------------------------------
>>>>> w/o parallel dio reads| READ 422815KB/s, 6606, 1146.05us | WRITE 421619KB/s, 6587, 60.72us
>>>>> -------------------------------------------------------------------------------------------
>>>>>
>>>>> bs=512k:
>>>>> -------------------------------------------------------------------------------------------
>>>>> w/ parallel dio reads | READ 392973KB/s, 767, 5046.35us | WRITE 393165KB/s, 767, 5359.86us
>>>>> -------------------------------------------------------------------------------------------
>>>>> w/o parallel dio reads| READ 590266KB/s, 1152, 4312.01us | WRITE 590554KB/s, 1153, 2606.82us
>>>>> -------------------------------------------------------------------------------------------
>>>>>
>>>>> bs=1M:
>>>>> -------------------------------------------------------------------------------------------
>>>>> w/ parallel dio reads | READ 487779KB/s, 476, 8058.55us | WRITE 485592KB/s, 474, 8630.51us
>>>>> -------------------------------------------------------------------------------------------
>>>>> w/o parallel dio reads| READ 593927KB/s, 580, 7623.63us | WRITE 591265KB/s, 577, 6163.42us
>>>>> -------------------------------------------------------------------------------------------
>>>>>
>>>>> Thanks,
>>>>> Joseph
>>>>>
>>>
>>>
>>> Cheers, Andreas
>>>
>>>
>>>
>>>
>>>

2019-07-28 22:56:24

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

On Fri, Jul 26, 2019 at 09:12:07AM +0800, Joseph Qi wrote:
>
>
> On 19/7/26 05:20, Andreas Dilger wrote:
> >
> >> On Jul 23, 2019, at 5:17 AM, Joseph Qi <[email protected]> wrote:
> >>
> >> Hi Ted & Jan,
> >> Could you please give your valuable comments?
> >
> > It seems like the original patches should be reverted? There is no data
>
> From my test result, yes.
> I've also tested libaio with iodepth 16, it behaves the same. Here is the test
> data for libaio 4k randrw:
>
> -------------------------------------------------------------------------------------------
> w/ parallel dio reads | READ 78313KB/s, 19578, 1698.70us | WRITE 78313KB/s, 19578, 4837.60us
> -------------------------------------------------------------------------------------------
> w/o parallel dio reads| READ 387774KB/s, 96943, 1009.73us | WRITE 387656KB/s,96914, 308.87us
> -------------------------------------------------------------------------------------------
>
> Since this commit went into upstream long time ago,to be precise, Linux
> 4.9, I wonder if someone else has also observed this regression, or
> anything I missed?

I suspect that the second part of this set of mods that Jan had
planned to do (on the write side to use shared locking as well)
did not happen and so the DIO writes are serialising the workload.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-07-30 04:12:00

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"



On 19/7/29 06:51, Dave Chinner wrote:
> On Fri, Jul 26, 2019 at 09:12:07AM +0800, Joseph Qi wrote:
>>
>>
>> On 19/7/26 05:20, Andreas Dilger wrote:
>>>
>>>> On Jul 23, 2019, at 5:17 AM, Joseph Qi <[email protected]> wrote:
>>>>
>>>> Hi Ted & Jan,
>>>> Could you please give your valuable comments?
>>>
>>> It seems like the original patches should be reverted? There is no data
>>
>> From my test result, yes.
>> I've also tested libaio with iodepth 16, it behaves the same. Here is the test
>> data for libaio 4k randrw:
>>
>> -------------------------------------------------------------------------------------------
>> w/ parallel dio reads | READ 78313KB/s, 19578, 1698.70us | WRITE 78313KB/s, 19578, 4837.60us
>> -------------------------------------------------------------------------------------------
>> w/o parallel dio reads| READ 387774KB/s, 96943, 1009.73us | WRITE 387656KB/s,96914, 308.87us
>> -------------------------------------------------------------------------------------------
>>
>> Since this commit went into upstream long time ago,to be precise, Linux
>> 4.9, I wonder if someone else has also observed this regression, or
>> anything I missed?
>
> I suspect that the second part of this set of mods that Jan had
> planned to do (on the write side to use shared locking as well)
> did not happen and so the DIO writes are serialising the workload.
>

Thanks for the inputs, Dave.
Hi Jan, Could you please confirm this?
If so, should we revert this commit at present?

Thanks,
Joseph

2019-08-15 15:46:42

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

On Tue 30-07-19 09:34:39, Joseph Qi wrote:
> On 19/7/29 06:51, Dave Chinner wrote:
> > On Fri, Jul 26, 2019 at 09:12:07AM +0800, Joseph Qi wrote:
> >>
> >>
> >> On 19/7/26 05:20, Andreas Dilger wrote:
> >>>
> >>>> On Jul 23, 2019, at 5:17 AM, Joseph Qi <[email protected]> wrote:
> >>>>
> >>>> Hi Ted & Jan,
> >>>> Could you please give your valuable comments?
> >>>
> >>> It seems like the original patches should be reverted? There is no data
> >>
> >> From my test result, yes.
> >> I've also tested libaio with iodepth 16, it behaves the same. Here is the test
> >> data for libaio 4k randrw:
> >>
> >> -------------------------------------------------------------------------------------------
> >> w/ parallel dio reads | READ 78313KB/s, 19578, 1698.70us | WRITE 78313KB/s, 19578, 4837.60us
> >> -------------------------------------------------------------------------------------------
> >> w/o parallel dio reads| READ 387774KB/s, 96943, 1009.73us | WRITE 387656KB/s,96914, 308.87us
> >> -------------------------------------------------------------------------------------------
> >>
> >> Since this commit went into upstream long time ago,to be precise, Linux
> >> 4.9, I wonder if someone else has also observed this regression, or
> >> anything I missed?
> >
> > I suspect that the second part of this set of mods that Jan had
> > planned to do (on the write side to use shared locking as well)
> > did not happen and so the DIO writes are serialising the workload.
> >
>
> Thanks for the inputs, Dave.
> Hi Jan, Could you please confirm this?
> If so, should we revert this commit at present?

Sorry for getting to you only now. I was on vacation and then catching up
with various stuff. I suppose you are not using dioread_nolock mount
option, are you? Can you check what are your results with that mount
option?

I have hard time remembering what I was thinking those couple years back
but I think the plan was to switch to dioread_nolock always but somehow I
didn't finish that and now I forgot where I got stuck because I don't see
any problem with that currently.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-08-16 13:24:27

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

Hi Jan,
Thanks for your reply.

On 19/8/15 23:13, Jan Kara wrote:
> On Tue 30-07-19 09:34:39, Joseph Qi wrote:
>> On 19/7/29 06:51, Dave Chinner wrote:
>>> On Fri, Jul 26, 2019 at 09:12:07AM +0800, Joseph Qi wrote:
>>>>
>>>>
>>>> On 19/7/26 05:20, Andreas Dilger wrote:
>>>>>
>>>>>> On Jul 23, 2019, at 5:17 AM, Joseph Qi <[email protected]> wrote:
>>>>>>
>>>>>> Hi Ted & Jan,
>>>>>> Could you please give your valuable comments?
>>>>>
>>>>> It seems like the original patches should be reverted? There is no data
>>>>
>>>> From my test result, yes.
>>>> I've also tested libaio with iodepth 16, it behaves the same. Here is the test
>>>> data for libaio 4k randrw:
>>>>
>>>> -------------------------------------------------------------------------------------------
>>>> w/ parallel dio reads | READ 78313KB/s, 19578, 1698.70us | WRITE 78313KB/s, 19578, 4837.60us
>>>> -------------------------------------------------------------------------------------------
>>>> w/o parallel dio reads| READ 387774KB/s, 96943, 1009.73us | WRITE 387656KB/s,96914, 308.87us
>>>> -------------------------------------------------------------------------------------------
>>>>
>>>> Since this commit went into upstream long time ago,to be precise, Linux
>>>> 4.9, I wonder if someone else has also observed this regression, or
>>>> anything I missed?
>>>
>>> I suspect that the second part of this set of mods that Jan had
>>> planned to do (on the write side to use shared locking as well)
>>> did not happen and so the DIO writes are serialising the workload.
>>>
>>
>> Thanks for the inputs, Dave.
>> Hi Jan, Could you please confirm this?
>> If so, should we revert this commit at present?
>
> Sorry for getting to you only now. I was on vacation and then catching up
> with various stuff. I suppose you are not using dioread_nolock mount
> option, are you? Can you check what are your results with that mount
> option?
>
Yes, I've just used default mount options when testing. And it is indeed
that there is performance improvement with dioread_nolock after reverting
the 3 related commits.
I'll do a supplementary test with parallel dio reads as well as
dioread_nolock and send out the test result.

> I have hard time remembering what I was thinking those couple years back
> but I think the plan was to switch to dioread_nolock always but somehow I
> didn't finish that and now I forgot where I got stuck because I don't see
> any problem with that currently.
Do you mean mark dioread_nolock as default?

Thanks,
Joseph
>
> Honza
>

2019-08-16 14:57:46

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

On Fri 16-08-19 21:23:24, Joseph Qi wrote:
> On 19/8/15 23:13, Jan Kara wrote:
> > On Tue 30-07-19 09:34:39, Joseph Qi wrote:
> >> On 19/7/29 06:51, Dave Chinner wrote:
> >>> On Fri, Jul 26, 2019 at 09:12:07AM +0800, Joseph Qi wrote:
> >>>>
> >>>>
> >>>> On 19/7/26 05:20, Andreas Dilger wrote:
> >>>>>
> >>>>>> On Jul 23, 2019, at 5:17 AM, Joseph Qi <[email protected]> wrote:
> >>>>>>
> >>>>>> Hi Ted & Jan,
> >>>>>> Could you please give your valuable comments?
> >>>>>
> >>>>> It seems like the original patches should be reverted? There is no data
> >>>>
> >>>> From my test result, yes.
> >>>> I've also tested libaio with iodepth 16, it behaves the same. Here is the test
> >>>> data for libaio 4k randrw:
> >>>>
> >>>> -------------------------------------------------------------------------------------------
> >>>> w/ parallel dio reads | READ 78313KB/s, 19578, 1698.70us | WRITE 78313KB/s, 19578, 4837.60us
> >>>> -------------------------------------------------------------------------------------------
> >>>> w/o parallel dio reads| READ 387774KB/s, 96943, 1009.73us | WRITE 387656KB/s,96914, 308.87us
> >>>> -------------------------------------------------------------------------------------------
> >>>>
> >>>> Since this commit went into upstream long time ago,to be precise, Linux
> >>>> 4.9, I wonder if someone else has also observed this regression, or
> >>>> anything I missed?
> >>>
> >>> I suspect that the second part of this set of mods that Jan had
> >>> planned to do (on the write side to use shared locking as well)
> >>> did not happen and so the DIO writes are serialising the workload.
> >>>
> >>
> >> Thanks for the inputs, Dave.
> >> Hi Jan, Could you please confirm this?
> >> If so, should we revert this commit at present?
> >
> > Sorry for getting to you only now. I was on vacation and then catching up
> > with various stuff. I suppose you are not using dioread_nolock mount
> > option, are you? Can you check what are your results with that mount
> > option?
> >
> Yes, I've just used default mount options when testing. And it is indeed
> that there is performance improvement with dioread_nolock after reverting
> the 3 related commits.
> I'll do a supplementary test with parallel dio reads as well as
> dioread_nolock and send out the test result.
>
> > I have hard time remembering what I was thinking those couple years back
> > but I think the plan was to switch to dioread_nolock always but somehow I
> > didn't finish that and now I forgot where I got stuck because I don't see
> > any problem with that currently.
> Do you mean mark dioread_nolock as default?

Yes, in fact I'd like to just remove the other path so that we can remove
this confusing mount option.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-08-20 03:01:27

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

Hi Jan,

On 19/8/16 22:57, Jan Kara wrote:
> On Fri 16-08-19 21:23:24, Joseph Qi wrote:
>> On 19/8/15 23:13, Jan Kara wrote:
>>> On Tue 30-07-19 09:34:39, Joseph Qi wrote:
>>>> On 19/7/29 06:51, Dave Chinner wrote:
>>>>> On Fri, Jul 26, 2019 at 09:12:07AM +0800, Joseph Qi wrote:
>>>>>>
>>>>>>
>>>>>> On 19/7/26 05:20, Andreas Dilger wrote:
>>>>>>>
>>>>>>>> On Jul 23, 2019, at 5:17 AM, Joseph Qi <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Hi Ted & Jan,
>>>>>>>> Could you please give your valuable comments?
>>>>>>>
>>>>>>> It seems like the original patches should be reverted? There is no data
>>>>>>
>>>>>> From my test result, yes.
>>>>>> I've also tested libaio with iodepth 16, it behaves the same. Here is the test
>>>>>> data for libaio 4k randrw:
>>>>>>
>>>>>> -------------------------------------------------------------------------------------------
>>>>>> w/ parallel dio reads | READ 78313KB/s, 19578, 1698.70us | WRITE 78313KB/s, 19578, 4837.60us
>>>>>> -------------------------------------------------------------------------------------------
>>>>>> w/o parallel dio reads| READ 387774KB/s, 96943, 1009.73us | WRITE 387656KB/s,96914, 308.87us
>>>>>> -------------------------------------------------------------------------------------------
>>>>>>
>>>>>> Since this commit went into upstream long time ago,to be precise, Linux
>>>>>> 4.9, I wonder if someone else has also observed this regression, or
>>>>>> anything I missed?
>>>>>
>>>>> I suspect that the second part of this set of mods that Jan had
>>>>> planned to do (on the write side to use shared locking as well)
>>>>> did not happen and so the DIO writes are serialising the workload.
>>>>>
>>>>
>>>> Thanks for the inputs, Dave.
>>>> Hi Jan, Could you please confirm this?
>>>> If so, should we revert this commit at present?
>>>
>>> Sorry for getting to you only now. I was on vacation and then catching up
>>> with various stuff. I suppose you are not using dioread_nolock mount
>>> option, are you? Can you check what are your results with that mount
>>> option?
>>>
>> Yes, I've just used default mount options when testing. And it is indeed
>> that there is performance improvement with dioread_nolock after reverting
>> the 3 related commits.
>> I'll do a supplementary test with parallel dio reads as well as
>> dioread_nolock and send out the test result.

I've tested parallel dio reads with dioread_nolock, it doesn't have
significant performance improvement and still poor compared with reverting
parallel dio reads. IMO, this is because with parallel dio reads, it take
inode shared lock at the very beginning in ext4_direct_IO_read().

Thanks,
Joseph

2019-08-20 16:08:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

On Tue, Aug 20, 2019 at 11:00:39AM +0800, Joseph Qi wrote:
>
> I've tested parallel dio reads with dioread_nolock, it doesn't have
> significant performance improvement and still poor compared with reverting
> parallel dio reads. IMO, this is because with parallel dio reads, it take
> inode shared lock at the very beginning in ext4_direct_IO_read().

Why is that a problem? It's a shared lock, so parallel threads should
be able to issue reads without getting serialized?

Are you using sufficiently fast storage devices that you're worried
about cache line bouncing of the shared lock? Or do you have some
other concern, such as some other thread taking an exclusive lock?

- Ted

2019-08-21 01:20:36

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

Hi Ted,

On 19/8/21 00:08, Theodore Y. Ts'o wrote:
> On Tue, Aug 20, 2019 at 11:00:39AM +0800, Joseph Qi wrote:
>>
>> I've tested parallel dio reads with dioread_nolock, it doesn't have
>> significant performance improvement and still poor compared with reverting
>> parallel dio reads. IMO, this is because with parallel dio reads, it take
>> inode shared lock at the very beginning in ext4_direct_IO_read().
>
> Why is that a problem? It's a shared lock, so parallel threads should
> be able to issue reads without getting serialized?
>
The above just tells the result that even mounting with dioread_nolock,
parallel dio reads still has poor performance than before (w/o parallel
dio reads).

> Are you using sufficiently fast storage devices that you're worried
> about cache line bouncing of the shared lock? Or do you have some
> other concern, such as some other thread taking an exclusive lock?
>
The test case is random read/write described in my first mail. And
from my preliminary investigation, shared lock consumes more in such
scenario.

Thanks,
Joseph

2019-08-21 03:36:31

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

On Wed, Aug 21, 2019 at 09:04:57AM +0800, Joseph Qi wrote:
> On 19/8/21 00:08, Theodore Y. Ts'o wrote:
> > On Tue, Aug 20, 2019 at 11:00:39AM +0800, Joseph Qi wrote:
> >>
> >> I've tested parallel dio reads with dioread_nolock, it doesn't have
> >> significant performance improvement and still poor compared with reverting
> >> parallel dio reads. IMO, this is because with parallel dio reads, it take
> >> inode shared lock at the very beginning in ext4_direct_IO_read().
> >
> > Why is that a problem? It's a shared lock, so parallel threads should
> > be able to issue reads without getting serialized?
> >
> The above just tells the result that even mounting with dioread_nolock,
> parallel dio reads still has poor performance than before (w/o parallel
> dio reads).

Right, but you were asserting that performance hit was *because* of
the shared lock. I'm asking what leading you to have that opinion.
The fact that parallel dioread reads doesn't necessarily say that it
was because of that particular shared lock. It could be due to any
number of other things. Have you looked at /proc/lock_stat (enabeld
via CONFIG_LOCK_STAT) to see where the locking bottlenecks might be?

- Ted

2019-08-22 07:02:46

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

On Wed, Aug 21, 2019 at 09:04:57AM +0800, Joseph Qi wrote:
> Hi Ted,
>
> On 19/8/21 00:08, Theodore Y. Ts'o wrote:
> > On Tue, Aug 20, 2019 at 11:00:39AM +0800, Joseph Qi wrote:
> >>
> >> I've tested parallel dio reads with dioread_nolock, it doesn't have
> >> significant performance improvement and still poor compared with reverting
> >> parallel dio reads. IMO, this is because with parallel dio reads, it take
> >> inode shared lock at the very beginning in ext4_direct_IO_read().
> >
> > Why is that a problem? It's a shared lock, so parallel threads should
> > be able to issue reads without getting serialized?
> >
> The above just tells the result that even mounting with dioread_nolock,
> parallel dio reads still has poor performance than before (w/o parallel
> dio reads).
>
> > Are you using sufficiently fast storage devices that you're worried
> > about cache line bouncing of the shared lock? Or do you have some
> > other concern, such as some other thread taking an exclusive lock?
> >
> The test case is random read/write described in my first mail. And

Regardless of dioread_nolock, ext4_direct_IO_read() is taking
inode_lock_shared() across the direct IO call. And writes in ext4
_always_ take the inode_lock() in ext4_file_write_iter(), even
though it gets dropped quite early when overwrite && dioread_nolock
is set. But just taking the lock exclusively in write fro a short
while is enough to kill all shared locking concurrency...

> from my preliminary investigation, shared lock consumes more in such
> scenario.

If the write lock is also shared, then there should not be a
scalability issue. The shared dio locking is only half-done in ext4,
so perhaps comparing your workload against XFS would be an
informative exercise...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-08-22 07:08:14

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"



On 19/8/21 11:34, Theodore Y. Ts'o wrote:
> On Wed, Aug 21, 2019 at 09:04:57AM +0800, Joseph Qi wrote:
>> On 19/8/21 00:08, Theodore Y. Ts'o wrote:
>>> On Tue, Aug 20, 2019 at 11:00:39AM +0800, Joseph Qi wrote:
>>>>
>>>> I've tested parallel dio reads with dioread_nolock, it doesn't have
>>>> significant performance improvement and still poor compared with reverting
>>>> parallel dio reads. IMO, this is because with parallel dio reads, it take
>>>> inode shared lock at the very beginning in ext4_direct_IO_read().
>>>
>>> Why is that a problem? It's a shared lock, so parallel threads should
>>> be able to issue reads without getting serialized?
>>>
>> The above just tells the result that even mounting with dioread_nolock,
>> parallel dio reads still has poor performance than before (w/o parallel
>> dio reads).
>
> Right, but you were asserting that performance hit was *because* of
> the shared lock. I'm asking what leading you to have that opinion.
> The fact that parallel dioread reads doesn't necessarily say that it
> was because of that particular shared lock. It could be due to any
> number of other things. Have you looked at /proc/lock_stat (enabeld
> via CONFIG_LOCK_STAT) to see where the locking bottlenecks might be?
>
I've enabled CONFIG_LOCK_STAT and CONFIG_DEBUG_RWSEMS, but doesn't see
any statistics for i_rwsem. Am I missing something?

Thanks,
Joseph

2019-08-23 10:32:07

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

Hi Dave,

On 19/8/22 13:40, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 09:04:57AM +0800, Joseph Qi wrote:
>> Hi Ted,
>>
>> On 19/8/21 00:08, Theodore Y. Ts'o wrote:
>>> On Tue, Aug 20, 2019 at 11:00:39AM +0800, Joseph Qi wrote:
>>>>
>>>> I've tested parallel dio reads with dioread_nolock, it doesn't have
>>>> significant performance improvement and still poor compared with reverting
>>>> parallel dio reads. IMO, this is because with parallel dio reads, it take
>>>> inode shared lock at the very beginning in ext4_direct_IO_read().
>>>
>>> Why is that a problem? It's a shared lock, so parallel threads should
>>> be able to issue reads without getting serialized?
>>>
>> The above just tells the result that even mounting with dioread_nolock,
>> parallel dio reads still has poor performance than before (w/o parallel
>> dio reads).
>>
>>> Are you using sufficiently fast storage devices that you're worried
>>> about cache line bouncing of the shared lock? Or do you have some
>>> other concern, such as some other thread taking an exclusive lock?
>>>
>> The test case is random read/write described in my first mail. And
>
> Regardless of dioread_nolock, ext4_direct_IO_read() is taking
> inode_lock_shared() across the direct IO call. And writes in ext4
> _always_ take the inode_lock() in ext4_file_write_iter(), even
> though it gets dropped quite early when overwrite && dioread_nolock
> is set. But just taking the lock exclusively in write fro a short
> while is enough to kill all shared locking concurrency...
>
>> from my preliminary investigation, shared lock consumes more in such
>> scenario.
>
> If the write lock is also shared, then there should not be a
> scalability issue. The shared dio locking is only half-done in ext4,
> so perhaps comparing your workload against XFS would be an
> informative exercise...

I've done the same test workload on xfs, it behaves the same as ext4
after reverting parallel dio reads and mounting with dioread_lock.
Here is the test result:
psync, randrw, direct=1, numofjobs=8

4k:
-----------------------------------------
ext4 | READ 123450KB/s | WRITE 123368KB/s
-----------------------------------------
xfs | READ 123848KB/s | WRITE 123761KB/s
-----------------------------------------

16k:
-----------------------------------------
ext4 | READ 222477KB/s | WRITE 222322KB/s
-----------------------------------------
xfs | READ 223261KB/s | WRITE 223106KB/s
-----------------------------------------

64k:
-----------------------------------------
ext4 | READ 427406KB/s | WRITE 426197KB/s
-----------------------------------------
xfs | READ 403697KB/s | WRITE 402555KB/s
-----------------------------------------

512k:
-----------------------------------------
ext4 | READ 618752KB/s | WRITE 619054KB/s
-----------------------------------------
xfs | READ 614954KB/s | WRITE 615254KB/s
-----------------------------------------

1M:
-----------------------------------------
ext4 | READ 615011KB/s | WRITE 612255KB/s
-----------------------------------------
xfs | READ 624087KB/s | WRITE 621290KB/s
-----------------------------------------

2019-08-23 10:34:12

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"



On 19/8/23 15:57, Joseph Qi wrote:
> Hi Dave,
>
> On 19/8/22 13:40, Dave Chinner wrote:
>> On Wed, Aug 21, 2019 at 09:04:57AM +0800, Joseph Qi wrote:
>>> Hi Ted,
>>>
>>> On 19/8/21 00:08, Theodore Y. Ts'o wrote:
>>>> On Tue, Aug 20, 2019 at 11:00:39AM +0800, Joseph Qi wrote:
>>>>>
>>>>> I've tested parallel dio reads with dioread_nolock, it doesn't have
>>>>> significant performance improvement and still poor compared with reverting
>>>>> parallel dio reads. IMO, this is because with parallel dio reads, it take
>>>>> inode shared lock at the very beginning in ext4_direct_IO_read().
>>>>
>>>> Why is that a problem? It's a shared lock, so parallel threads should
>>>> be able to issue reads without getting serialized?
>>>>
>>> The above just tells the result that even mounting with dioread_nolock,
>>> parallel dio reads still has poor performance than before (w/o parallel
>>> dio reads).
>>>
>>>> Are you using sufficiently fast storage devices that you're worried
>>>> about cache line bouncing of the shared lock? Or do you have some
>>>> other concern, such as some other thread taking an exclusive lock?
>>>>
>>> The test case is random read/write described in my first mail. And
>>
>> Regardless of dioread_nolock, ext4_direct_IO_read() is taking
>> inode_lock_shared() across the direct IO call. And writes in ext4
>> _always_ take the inode_lock() in ext4_file_write_iter(), even
>> though it gets dropped quite early when overwrite && dioread_nolock
>> is set. But just taking the lock exclusively in write fro a short
>> while is enough to kill all shared locking concurrency...
>>
>>> from my preliminary investigation, shared lock consumes more in such
>>> scenario.
>>
>> If the write lock is also shared, then there should not be a
>> scalability issue. The shared dio locking is only half-done in ext4,
>> so perhaps comparing your workload against XFS would be an
>> informative exercise...
>
> I've done the same test workload on xfs, it behaves the same as ext4
> after reverting parallel dio reads and mounting with dioread_lock.
A typo here, s/dioread_lock/dioread_nolock/

> Here is the test result:
> psync, randrw, direct=1, numofjobs=8
>
> 4k:
> -----------------------------------------
> ext4 | READ 123450KB/s | WRITE 123368KB/s
> -----------------------------------------
> xfs | READ 123848KB/s | WRITE 123761KB/s
> -----------------------------------------
>
> 16k:
> -----------------------------------------
> ext4 | READ 222477KB/s | WRITE 222322KB/s
> -----------------------------------------
> xfs | READ 223261KB/s | WRITE 223106KB/s
> -----------------------------------------
>
> 64k:
> -----------------------------------------
> ext4 | READ 427406KB/s | WRITE 426197KB/s
> -----------------------------------------
> xfs | READ 403697KB/s | WRITE 402555KB/s
> -----------------------------------------
>
> 512k:
> -----------------------------------------
> ext4 | READ 618752KB/s | WRITE 619054KB/s
> -----------------------------------------
> xfs | READ 614954KB/s | WRITE 615254KB/s
> -----------------------------------------
>
> 1M:
> -----------------------------------------
> ext4 | READ 615011KB/s | WRITE 612255KB/s
> -----------------------------------------
> xfs | READ 624087KB/s | WRITE 621290KB/s
> -----------------------------------------
>

2019-08-23 22:58:44

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

On Fri, Aug 23, 2019 at 03:57:02PM +0800, Joseph Qi wrote:
> Hi Dave,
>
> On 19/8/22 13:40, Dave Chinner wrote:
> > On Wed, Aug 21, 2019 at 09:04:57AM +0800, Joseph Qi wrote:
> >> Hi Ted,
> >>
> >> On 19/8/21 00:08, Theodore Y. Ts'o wrote:
> >>> On Tue, Aug 20, 2019 at 11:00:39AM +0800, Joseph Qi wrote:
> >>>>
> >>>> I've tested parallel dio reads with dioread_nolock, it doesn't have
> >>>> significant performance improvement and still poor compared with reverting
> >>>> parallel dio reads. IMO, this is because with parallel dio reads, it take
> >>>> inode shared lock at the very beginning in ext4_direct_IO_read().
> >>>
> >>> Why is that a problem? It's a shared lock, so parallel threads should
> >>> be able to issue reads without getting serialized?
> >>>
> >> The above just tells the result that even mounting with dioread_nolock,
> >> parallel dio reads still has poor performance than before (w/o parallel
> >> dio reads).
> >>
> >>> Are you using sufficiently fast storage devices that you're worried
> >>> about cache line bouncing of the shared lock? Or do you have some
> >>> other concern, such as some other thread taking an exclusive lock?
> >>>
> >> The test case is random read/write described in my first mail. And
> >
> > Regardless of dioread_nolock, ext4_direct_IO_read() is taking
> > inode_lock_shared() across the direct IO call. And writes in ext4
> > _always_ take the inode_lock() in ext4_file_write_iter(), even
> > though it gets dropped quite early when overwrite && dioread_nolock
> > is set. But just taking the lock exclusively in write fro a short
> > while is enough to kill all shared locking concurrency...
> >
> >> from my preliminary investigation, shared lock consumes more in such
> >> scenario.
> >
> > If the write lock is also shared, then there should not be a
> > scalability issue. The shared dio locking is only half-done in ext4,
> > so perhaps comparing your workload against XFS would be an
> > informative exercise...
>
> I've done the same test workload on xfs, it behaves the same as ext4
> after reverting parallel dio reads and mounting with dioread_lock.

Ok, so the problem is not shared locking scalability ('cause that's
what XFS does and it scaled fine), the problem is almost certainly
that ext4 is using exclusive locking during writes...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-08-23 23:21:05

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"



On 19/8/23 18:16, Dave Chinner wrote:
> On Fri, Aug 23, 2019 at 03:57:02PM +0800, Joseph Qi wrote:
>> Hi Dave,
>>
>> On 19/8/22 13:40, Dave Chinner wrote:
>>> On Wed, Aug 21, 2019 at 09:04:57AM +0800, Joseph Qi wrote:
>>>> Hi Ted,
>>>>
>>>> On 19/8/21 00:08, Theodore Y. Ts'o wrote:
>>>>> On Tue, Aug 20, 2019 at 11:00:39AM +0800, Joseph Qi wrote:
>>>>>>
>>>>>> I've tested parallel dio reads with dioread_nolock, it doesn't have
>>>>>> significant performance improvement and still poor compared with reverting
>>>>>> parallel dio reads. IMO, this is because with parallel dio reads, it take
>>>>>> inode shared lock at the very beginning in ext4_direct_IO_read().
>>>>>
>>>>> Why is that a problem? It's a shared lock, so parallel threads should
>>>>> be able to issue reads without getting serialized?
>>>>>
>>>> The above just tells the result that even mounting with dioread_nolock,
>>>> parallel dio reads still has poor performance than before (w/o parallel
>>>> dio reads).
>>>>
>>>>> Are you using sufficiently fast storage devices that you're worried
>>>>> about cache line bouncing of the shared lock? Or do you have some
>>>>> other concern, such as some other thread taking an exclusive lock?
>>>>>
>>>> The test case is random read/write described in my first mail. And
>>>
>>> Regardless of dioread_nolock, ext4_direct_IO_read() is taking
>>> inode_lock_shared() across the direct IO call. And writes in ext4
>>> _always_ take the inode_lock() in ext4_file_write_iter(), even
>>> though it gets dropped quite early when overwrite && dioread_nolock
>>> is set. But just taking the lock exclusively in write fro a short
>>> while is enough to kill all shared locking concurrency...
>>>
>>>> from my preliminary investigation, shared lock consumes more in such
>>>> scenario.
>>>
>>> If the write lock is also shared, then there should not be a
>>> scalability issue. The shared dio locking is only half-done in ext4,
>>> so perhaps comparing your workload against XFS would be an
>>> informative exercise...
>>
>> I've done the same test workload on xfs, it behaves the same as ext4
>> after reverting parallel dio reads and mounting with dioread_lock.
>
> Ok, so the problem is not shared locking scalability ('cause that's
> what XFS does and it scaled fine), the problem is almost certainly
> that ext4 is using exclusive locking during writes...
>

Agree. Maybe I've misled you in my previous mails.I meant shared lock makes worse in case of mixed random read/write, since
we would always take inode lock during write.
And it also conflicts with dioread_nolock. It won't take any inode lock
before with dioread_nolock during read, but now it always takes a shared
lock.

Thanks,
Joseph

2019-08-24 02:20:34

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

On Fri, Aug 23, 2019 at 09:08:53PM +0800, Joseph Qi wrote:
>
>
> On 19/8/23 18:16, Dave Chinner wrote:
> > On Fri, Aug 23, 2019 at 03:57:02PM +0800, Joseph Qi wrote:
> >> Hi Dave,
> >>
> >> On 19/8/22 13:40, Dave Chinner wrote:
> >>> On Wed, Aug 21, 2019 at 09:04:57AM +0800, Joseph Qi wrote:
> >>>> Hi Ted,
> >>>>
> >>>> On 19/8/21 00:08, Theodore Y. Ts'o wrote:
> >>>>> On Tue, Aug 20, 2019 at 11:00:39AM +0800, Joseph Qi wrote:
> >>>>>>
> >>>>>> I've tested parallel dio reads with dioread_nolock, it
> >>>>>> doesn't have significant performance improvement and still
> >>>>>> poor compared with reverting parallel dio reads. IMO, this
> >>>>>> is because with parallel dio reads, it take inode shared
> >>>>>> lock at the very beginning in ext4_direct_IO_read().
> >>>>>
> >>>>> Why is that a problem? It's a shared lock, so parallel
> >>>>> threads should be able to issue reads without getting
> >>>>> serialized?
> >>>>>
> >>>> The above just tells the result that even mounting with
> >>>> dioread_nolock, parallel dio reads still has poor performance
> >>>> than before (w/o parallel dio reads).
> >>>>
> >>>>> Are you using sufficiently fast storage devices that you're
> >>>>> worried about cache line bouncing of the shared lock? Or do
> >>>>> you have some other concern, such as some other thread
> >>>>> taking an exclusive lock?
> >>>>>
> >>>> The test case is random read/write described in my first
> >>>> mail. And
> >>>
> >>> Regardless of dioread_nolock, ext4_direct_IO_read() is taking
> >>> inode_lock_shared() across the direct IO call. And writes in
> >>> ext4 _always_ take the inode_lock() in ext4_file_write_iter(),
> >>> even though it gets dropped quite early when overwrite &&
> >>> dioread_nolock is set. But just taking the lock exclusively
> >>> in write fro a short while is enough to kill all shared
> >>> locking concurrency...
> >>>
> >>>> from my preliminary investigation, shared lock consumes more
> >>>> in such scenario.
> >>>
> >>> If the write lock is also shared, then there should not be a
> >>> scalability issue. The shared dio locking is only half-done in
> >>> ext4, so perhaps comparing your workload against XFS would be
> >>> an informative exercise...
> >>
> >> I've done the same test workload on xfs, it behaves the same as
> >> ext4 after reverting parallel dio reads and mounting with
> >> dioread_lock.
> >
> > Ok, so the problem is not shared locking scalability ('cause
> > that's what XFS does and it scaled fine), the problem is almost
> > certainly that ext4 is using exclusive locking during
> > writes...
> >
>
> Agree. Maybe I've misled you in my previous mails.I meant shared
> lock makes worse in case of mixed random read/write, since we
> would always take inode lock during write. And it also conflicts
> with dioread_nolock. It won't take any inode lock before with
> dioread_nolock during read, but now it always takes a shared
> lock.

No, you didn't mislead me. IIUC, the shared locking was added to the
direct IO read path so that it can't run concurrently with
operations like hole punch that free the blocks the dio read might
currently be operating on (use after free).

i.e. the shared locking fixes an actual bug, but the performance
regression is a result of only partially converting the direct IO
path to use shared locking. Only half the job was done from a
performance perspective. Seems to me that the two options here to
fix the performance regression are to either finish the shared
locking conversion, or remove the shared locking on read and re-open
a potential data exposure issue...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-08-26 08:50:53

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

On Sat 24-08-19 12:18:40, Dave Chinner wrote:
> On Fri, Aug 23, 2019 at 09:08:53PM +0800, Joseph Qi wrote:
> >
> >
> > On 19/8/23 18:16, Dave Chinner wrote:
> > > On Fri, Aug 23, 2019 at 03:57:02PM +0800, Joseph Qi wrote:
> > >> Hi Dave,
> > >>
> > >> On 19/8/22 13:40, Dave Chinner wrote:
> > >>> On Wed, Aug 21, 2019 at 09:04:57AM +0800, Joseph Qi wrote:
> > >>>> Hi Ted,
> > >>>>
> > >>>> On 19/8/21 00:08, Theodore Y. Ts'o wrote:
> > >>>>> On Tue, Aug 20, 2019 at 11:00:39AM +0800, Joseph Qi wrote:
> > >>>>>>
> > >>>>>> I've tested parallel dio reads with dioread_nolock, it
> > >>>>>> doesn't have significant performance improvement and still
> > >>>>>> poor compared with reverting parallel dio reads. IMO, this
> > >>>>>> is because with parallel dio reads, it take inode shared
> > >>>>>> lock at the very beginning in ext4_direct_IO_read().
> > >>>>>
> > >>>>> Why is that a problem? It's a shared lock, so parallel
> > >>>>> threads should be able to issue reads without getting
> > >>>>> serialized?
> > >>>>>
> > >>>> The above just tells the result that even mounting with
> > >>>> dioread_nolock, parallel dio reads still has poor performance
> > >>>> than before (w/o parallel dio reads).
> > >>>>
> > >>>>> Are you using sufficiently fast storage devices that you're
> > >>>>> worried about cache line bouncing of the shared lock? Or do
> > >>>>> you have some other concern, such as some other thread
> > >>>>> taking an exclusive lock?
> > >>>>>
> > >>>> The test case is random read/write described in my first
> > >>>> mail. And
> > >>>
> > >>> Regardless of dioread_nolock, ext4_direct_IO_read() is taking
> > >>> inode_lock_shared() across the direct IO call. And writes in
> > >>> ext4 _always_ take the inode_lock() in ext4_file_write_iter(),
> > >>> even though it gets dropped quite early when overwrite &&
> > >>> dioread_nolock is set. But just taking the lock exclusively
> > >>> in write fro a short while is enough to kill all shared
> > >>> locking concurrency...
> > >>>
> > >>>> from my preliminary investigation, shared lock consumes more
> > >>>> in such scenario.
> > >>>
> > >>> If the write lock is also shared, then there should not be a
> > >>> scalability issue. The shared dio locking is only half-done in
> > >>> ext4, so perhaps comparing your workload against XFS would be
> > >>> an informative exercise...
> > >>
> > >> I've done the same test workload on xfs, it behaves the same as
> > >> ext4 after reverting parallel dio reads and mounting with
> > >> dioread_lock.
> > >
> > > Ok, so the problem is not shared locking scalability ('cause
> > > that's what XFS does and it scaled fine), the problem is almost
> > > certainly that ext4 is using exclusive locking during
> > > writes...
> > >
> >
> > Agree. Maybe I've misled you in my previous mails.I meant shared
> > lock makes worse in case of mixed random read/write, since we
> > would always take inode lock during write. And it also conflicts
> > with dioread_nolock. It won't take any inode lock before with
> > dioread_nolock during read, but now it always takes a shared
> > lock.
>
> No, you didn't mislead me. IIUC, the shared locking was added to the
> direct IO read path so that it can't run concurrently with
> operations like hole punch that free the blocks the dio read might
> currently be operating on (use after free).
>
> i.e. the shared locking fixes an actual bug, but the performance
> regression is a result of only partially converting the direct IO
> path to use shared locking. Only half the job was done from a
> performance perspective. Seems to me that the two options here to
> fix the performance regression are to either finish the shared
> locking conversion, or remove the shared locking on read and re-open
> a potential data exposure issue...

We actually had a separate locking mechanism in ext4 code to avoid stale
data exposure during hole punch when unlocked DIO reads were running. But
it was kind of ugly and making things complex. I agree we need to move ext4
DIO path conversion further to avoid taking exclusive lock when we won't
actually need it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-08-26 20:44:48

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

On Aug 26, 2019, at 2:39 AM, Jan Kara <[email protected]> wrote:
>
> On Sat 24-08-19 12:18:40, Dave Chinner wrote:
>> On Fri, Aug 23, 2019 at 09:08:53PM +0800, Joseph Qi wrote:
>>>
>>>
>>> On 19/8/23 18:16, Dave Chinner wrote:
>>>> On Fri, Aug 23, 2019 at 03:57:02PM +0800, Joseph Qi wrote:
>>>>> Hi Dave,
>>>>>
>>>>> On 19/8/22 13:40, Dave Chinner wrote:
>>>>>> On Wed, Aug 21, 2019 at 09:04:57AM +0800, Joseph Qi wrote:
>>>>>>> Hi Ted,
>>>>>>>
>>>>>>> On 19/8/21 00:08, Theodore Y. Ts'o wrote:
>>>>>>>> On Tue, Aug 20, 2019 at 11:00:39AM +0800, Joseph Qi wrote:
>>>>>>>>>
>>>>>>>>> I've tested parallel dio reads with dioread_nolock, it
>>>>>>>>> doesn't have significant performance improvement and still
>>>>>>>>> poor compared with reverting parallel dio reads. IMO, this
>>>>>>>>> is because with parallel dio reads, it take inode shared
>>>>>>>>> lock at the very beginning in ext4_direct_IO_read().
>>>>>>>>
>>>>>>>> Why is that a problem? It's a shared lock, so parallel
>>>>>>>> threads should be able to issue reads without getting
>>>>>>>> serialized?
>>>>>>>>
>>>>>>> The above just tells the result that even mounting with
>>>>>>> dioread_nolock, parallel dio reads still has poor performance
>>>>>>> than before (w/o parallel dio reads).
>>>>>>>
>>>>>>>> Are you using sufficiently fast storage devices that you're
>>>>>>>> worried about cache line bouncing of the shared lock? Or do
>>>>>>>> you have some other concern, such as some other thread
>>>>>>>> taking an exclusive lock?
>>>>>>>>
>>>>>>> The test case is random read/write described in my first
>>>>>>> mail. And
>>>>>>
>>>>>> Regardless of dioread_nolock, ext4_direct_IO_read() is taking
>>>>>> inode_lock_shared() across the direct IO call. And writes in
>>>>>> ext4 _always_ take the inode_lock() in ext4_file_write_iter(),
>>>>>> even though it gets dropped quite early when overwrite &&
>>>>>> dioread_nolock is set. But just taking the lock exclusively
>>>>>> in write fro a short while is enough to kill all shared
>>>>>> locking concurrency...
>>>>>>
>>>>>>> from my preliminary investigation, shared lock consumes more
>>>>>>> in such scenario.
>>>>>>
>>>>>> If the write lock is also shared, then there should not be a
>>>>>> scalability issue. The shared dio locking is only half-done in
>>>>>> ext4, so perhaps comparing your workload against XFS would be
>>>>>> an informative exercise...
>>>>>
>>>>> I've done the same test workload on xfs, it behaves the same as
>>>>> ext4 after reverting parallel dio reads and mounting with
>>>>> dioread_lock.
>>>>
>>>> Ok, so the problem is not shared locking scalability ('cause
>>>> that's what XFS does and it scaled fine), the problem is almost
>>>> certainly that ext4 is using exclusive locking during
>>>> writes...
>>>>
>>>
>>> Agree. Maybe I've misled you in my previous mails.I meant shared
>>> lock makes worse in case of mixed random read/write, since we
>>> would always take inode lock during write. And it also conflicts
>>> with dioread_nolock. It won't take any inode lock before with
>>> dioread_nolock during read, but now it always takes a shared
>>> lock.
>>
>> No, you didn't mislead me. IIUC, the shared locking was added to the
>> direct IO read path so that it can't run concurrently with
>> operations like hole punch that free the blocks the dio read might
>> currently be operating on (use after free).
>>
>> i.e. the shared locking fixes an actual bug, but the performance
>> regression is a result of only partially converting the direct IO
>> path to use shared locking. Only half the job was done from a
>> performance perspective. Seems to me that the two options here to
>> fix the performance regression are to either finish the shared
>> locking conversion, or remove the shared locking on read and re-open
>> a potential data exposure issue...
>
> We actually had a separate locking mechanism in ext4 code to avoid stale
> data exposure during hole punch when unlocked DIO reads were running. But
> it was kind of ugly and making things complex. I agree we need to move ext4
> DIO path conversion further to avoid taking exclusive lock when we won't
> actually need it.

It seems to me that the right solution for the short term is to revert
the patch in question, since that appears to be incomplete, and reverting
it will restore the performance. I haven't seen any comments posted with
a counter-example that the original patch actually improved performance,
or that reverting it will cause some other performance regression.

We can then leave implementing a more complete solution to a later kernel.

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2019-08-27 01:01:06

by Joseph Qi

[permalink] [raw]
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"



On 19/8/27 03:10, Andreas Dilger wrote:
> On Aug 26, 2019, at 2:39 AM, Jan Kara <[email protected]> wrote:
>>
>> On Sat 24-08-19 12:18:40, Dave Chinner wrote:
>>> On Fri, Aug 23, 2019 at 09:08:53PM +0800, Joseph Qi wrote:
>>>>
>>>>
>>>> On 19/8/23 18:16, Dave Chinner wrote:
>>>>> On Fri, Aug 23, 2019 at 03:57:02PM +0800, Joseph Qi wrote:
>>>>>> Hi Dave,
>>>>>>
>>>>>> On 19/8/22 13:40, Dave Chinner wrote:
>>>>>>> On Wed, Aug 21, 2019 at 09:04:57AM +0800, Joseph Qi wrote:
>>>>>>>> Hi Ted,
>>>>>>>>
>>>>>>>> On 19/8/21 00:08, Theodore Y. Ts'o wrote:
>>>>>>>>> On Tue, Aug 20, 2019 at 11:00:39AM +0800, Joseph Qi wrote:
>>>>>>>>>>
>>>>>>>>>> I've tested parallel dio reads with dioread_nolock, it
>>>>>>>>>> doesn't have significant performance improvement and still
>>>>>>>>>> poor compared with reverting parallel dio reads. IMO, this
>>>>>>>>>> is because with parallel dio reads, it take inode shared
>>>>>>>>>> lock at the very beginning in ext4_direct_IO_read().
>>>>>>>>>
>>>>>>>>> Why is that a problem? It's a shared lock, so parallel
>>>>>>>>> threads should be able to issue reads without getting
>>>>>>>>> serialized?
>>>>>>>>>
>>>>>>>> The above just tells the result that even mounting with
>>>>>>>> dioread_nolock, parallel dio reads still has poor performance
>>>>>>>> than before (w/o parallel dio reads).
>>>>>>>>
>>>>>>>>> Are you using sufficiently fast storage devices that you're
>>>>>>>>> worried about cache line bouncing of the shared lock? Or do
>>>>>>>>> you have some other concern, such as some other thread
>>>>>>>>> taking an exclusive lock?
>>>>>>>>>
>>>>>>>> The test case is random read/write described in my first
>>>>>>>> mail. And
>>>>>>>
>>>>>>> Regardless of dioread_nolock, ext4_direct_IO_read() is taking
>>>>>>> inode_lock_shared() across the direct IO call. And writes in
>>>>>>> ext4 _always_ take the inode_lock() in ext4_file_write_iter(),
>>>>>>> even though it gets dropped quite early when overwrite &&
>>>>>>> dioread_nolock is set. But just taking the lock exclusively
>>>>>>> in write fro a short while is enough to kill all shared
>>>>>>> locking concurrency...
>>>>>>>
>>>>>>>> from my preliminary investigation, shared lock consumes more
>>>>>>>> in such scenario.
>>>>>>>
>>>>>>> If the write lock is also shared, then there should not be a
>>>>>>> scalability issue. The shared dio locking is only half-done in
>>>>>>> ext4, so perhaps comparing your workload against XFS would be
>>>>>>> an informative exercise...
>>>>>>
>>>>>> I've done the same test workload on xfs, it behaves the same as
>>>>>> ext4 after reverting parallel dio reads and mounting with
>>>>>> dioread_lock.
>>>>>
>>>>> Ok, so the problem is not shared locking scalability ('cause
>>>>> that's what XFS does and it scaled fine), the problem is almost
>>>>> certainly that ext4 is using exclusive locking during
>>>>> writes...
>>>>>
>>>>
>>>> Agree. Maybe I've misled you in my previous mails.I meant shared
>>>> lock makes worse in case of mixed random read/write, since we
>>>> would always take inode lock during write. And it also conflicts
>>>> with dioread_nolock. It won't take any inode lock before with
>>>> dioread_nolock during read, but now it always takes a shared
>>>> lock.
>>>
>>> No, you didn't mislead me. IIUC, the shared locking was added to the
>>> direct IO read path so that it can't run concurrently with
>>> operations like hole punch that free the blocks the dio read might
>>> currently be operating on (use after free).
>>>
>>> i.e. the shared locking fixes an actual bug, but the performance
>>> regression is a result of only partially converting the direct IO
>>> path to use shared locking. Only half the job was done from a
>>> performance perspective. Seems to me that the two options here to
>>> fix the performance regression are to either finish the shared
>>> locking conversion, or remove the shared locking on read and re-open
>>> a potential data exposure issue...
>>
>> We actually had a separate locking mechanism in ext4 code to avoid stale
>> data exposure during hole punch when unlocked DIO reads were running. But
>> it was kind of ugly and making things complex. I agree we need to move ext4
>> DIO path conversion further to avoid taking exclusive lock when we won't
>> actually need it.
>
> It seems to me that the right solution for the short term is to revert
> the patch in question, since that appears to be incomplete, and reverting
> it will restore the performance. I haven't seen any comments posted with
> a counter-example that the original patch actually improved performance,
> or that reverting it will cause some other performance regression.
>
> We can then leave implementing a more complete solution to a later kernel.
>
Thanks for the discussion.
So if no one else objects reverting parallel dio reads at present, I'll
send out the revert patches.

Thanks,
Joseph