LinuxLists.cc - Re: [LKP] [ext4] 05c2c00f37: aim7.jobs-per-min -11.8% regression

2021-05-20 16:51:47

Subject: Re: [LKP] [ext4] 05c2c00f37: aim7.jobs-per-min -11.8% regression

Hello!

On Thu 20-05-21 15:13:20, Xing Zhengjun wrote:
>
> ???? Do you have time to look at this? The regression still existed in the
> latest Linux mainline v5.13-rc2.

Thanks for verification and for the ping! I had a look into this and I
think the regression is caused by the changes in orphan handling. The load
runs multiple tasks all creating and deleting files. This generally
contends on the orphan locking with fast storage (which is your case
because you use ramdisk as a backing store AFAICT). And the changes I did
move superblock checksum computation under the orphan lock so the lock hold
times are now higher.

Sadly it is not easy to move checksum update from under the orphan lock and
maintain checksum consistency since the checksum has to be recomputed
consistently with the changes of superblock state. But I have one idea how
we could maybe improve the situation. Can you check whether attached patch
recovers the regression? Because that's about how good it could get when we
are more careful when writing out superblock.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Attachments:

(No filename) (1.10 kB)
0001-ext4-Experiment-with-checksum-computation-during-orp.patch (1.64 kB)
Download all attachments

2021-05-21 09:04:43

by Xing Zhengjun

[permalink] [raw]

Subject: Re: [LKP] [ext4] 05c2c00f37: aim7.jobs-per-min -11.8% regression

Hi Jan,

On 5/20/2021 5:51 PM, Jan Kara wrote:
> Hello!
>
> On Thu 20-05-21 15:13:20, Xing Zhengjun wrote:
>>
>> ???? Do you have time to look at this? The regression still existed in the
>> latest Linux mainline v5.13-rc2.
>
> Thanks for verification and for the ping! I had a look into this and I
> think the regression is caused by the changes in orphan handling. The load
> runs multiple tasks all creating and deleting files. This generally
> contends on the orphan locking with fast storage (which is your case
> because you use ramdisk as a backing store AFAICT). And the changes I did
> move superblock checksum computation under the orphan lock so the lock hold
> times are now higher.
>
> Sadly it is not easy to move checksum update from under the orphan lock and
> maintain checksum consistency since the checksum has to be recomputed
> consistently with the changes of superblock state. But I have one idea how
> we could maybe improve the situation. Can you check whether attached patch
> recovers the regression? Because that's about how good it could get when we
> are more careful when writing out superblock.
>
> Honza
>

I apply the patch based on v5.13-rc2 and test, it can not recover the
regression and the regression became more serious(-45.7%).

=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/disk/md/fs/test/load/cpufreq_governor/ucode:

lkp-csl-2sp9/aim7/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/4BRD_12G/RAID1/ext4/creat-clo/1000/performance/0x5003006

commit:
4392fbc4bab57db3760f0fb61258cb7089b37665
05c2c00f3769abb9e323fcaca70d2de0b48af7ba
v5.13-rc2
2a1eb1a2fc08daaaf76a5aa8ffa355b5a5013d86 (the test patch)

4392fbc4bab57db3 05c2c00f3769abb9e323fcaca70 v5.13-rc2
2a1eb1a2fc08daaaf76a5aa8ffa
---------------- --------------------------- ---------------------------
---------------------------
%stddev %change %stddev %change
%stddev %change %stddev
\ | \ | \
| \
13342 -11.8% 11771 ? 2% -14.2% 11450
-45.7% 7240 ? 3% aim7.jobs-per-min

--
Zhengjun Xing

2021-05-21 20:11:09

by Jan Kara

[permalink] [raw]

Subject: Re: [LKP] [ext4] 05c2c00f37: aim7.jobs-per-min -11.8% regression

On Fri 21-05-21 09:16:42, Xing Zhengjun wrote:
> Hi Jan,
>
> On 5/20/2021 5:51 PM, Jan Kara wrote:
> > Hello!
> >
> > On Thu 20-05-21 15:13:20, Xing Zhengjun wrote:
> > >
> > > ???? Do you have time to look at this? The regression still existed in the
> > > latest Linux mainline v5.13-rc2.
> >
> > Thanks for verification and for the ping! I had a look into this and I
> > think the regression is caused by the changes in orphan handling. The load
> > runs multiple tasks all creating and deleting files. This generally
> > contends on the orphan locking with fast storage (which is your case
> > because you use ramdisk as a backing store AFAICT). And the changes I did
> > move superblock checksum computation under the orphan lock so the lock hold
> > times are now higher.
> >
> > Sadly it is not easy to move checksum update from under the orphan lock and
> > maintain checksum consistency since the checksum has to be recomputed
> > consistently with the changes of superblock state. But I have one idea how
> > we could maybe improve the situation. Can you check whether attached patch
> > recovers the regression? Because that's about how good it could get when we
> > are more careful when writing out superblock.
> >
> > Honza
> >
>
> I apply the patch based on v5.13-rc2 and test, it can not recover the
> regression and the regression became more serious(-45.7%).

OK, thanks for testing. So the orphan code is indeed the likely cause of
this regression but I probably did not guess correctly what is the
contention point there. Then I guess I need to reproduce and do more
digging why the contention happens...

Honza

>
> =========================================================================================
> tbox_group/testcase/rootfs/kconfig/compiler/disk/md/fs/test/load/cpufreq_governor/ucode:
>
> lkp-csl-2sp9/aim7/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/4BRD_12G/RAID1/ext4/creat-clo/1000/performance/0x5003006
>
> commit:
> 4392fbc4bab57db3760f0fb61258cb7089b37665
> 05c2c00f3769abb9e323fcaca70d2de0b48af7ba
> v5.13-rc2
> 2a1eb1a2fc08daaaf76a5aa8ffa355b5a5013d86 (the test patch)
>
> 4392fbc4bab57db3 05c2c00f3769abb9e323fcaca70 v5.13-rc2
> 2a1eb1a2fc08daaaf76a5aa8ffa
> ---------------- --------------------------- ---------------------------
> ---------------------------
> %stddev %change %stddev %change %stddev %change
> %stddev
> \ | \ | \
> | \
> 13342 -11.8% 11771 ? 2% -14.2% 11450
> -45.7% 7240 ? 3% aim7.jobs-per-min
>
>
>
> --
> Zhengjun Xing
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-05-21 20:23:33

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [LKP] [ext4] 05c2c00f37: aim7.jobs-per-min -11.8% regression

On Fri, May 21, 2021 at 11:27:30AM +0200, Jan Kara wrote:
>
> OK, thanks for testing. So the orphan code is indeed the likely cause of
> this regression but I probably did not guess correctly what is the
> contention point there. Then I guess I need to reproduce and do more
> digging why the contention happens...

Hmm... what if we only recalculate the superblock checksum when we do
a commit, via the callback function from the jbd2 layer to file
system?

- Ted

2021-05-25 09:25:52

by Jan Kara

[permalink] [raw]

Subject: Re: [LKP] [ext4] 05c2c00f37: aim7.jobs-per-min -11.8% regression

On Fri 21-05-21 12:42:16, Theodore Y. Ts'o wrote:
> On Fri, May 21, 2021 at 11:27:30AM +0200, Jan Kara wrote:
> >
> > OK, thanks for testing. So the orphan code is indeed the likely cause of
> > this regression but I probably did not guess correctly what is the
> > contention point there. Then I guess I need to reproduce and do more
> > digging why the contention happens...
>
> Hmm... what if we only recalculate the superblock checksum when we do
> a commit, via the callback function from the jbd2 layer to file
> system?

I actually have to check whether the regression is there because of the
additional locking of the buffer_head (because that's the only thing that
was added to that code in fact, adding some atomic instructions, bouncing
another cacheline) or because of the checksum computation that moved from
ext4_handle_dirty_super() closer to actual superblock update under those
locks.

If the problem is indeed just the checksum computation under all those
locks, we can move that to transaction commit time (using the t_frozen
trigger - ocfs2 uses that for all metadata checksumming). But then we have
to be very careful with unjournaled sb updates that can be running in
parallel with the journaled ones because once you drop buffer lock, sb can
get clobbered and checksum invalidated. Also there's the question what to
do with nojournal mode - probably we would have to keep separate set of
places recomputing checksums just for nojournal mode which is quite error
prone but if it's just for sb, I guess it's manageable.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-05-25 20:06:41

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [LKP] [ext4] 05c2c00f37: aim7.jobs-per-min -11.8% regression

On Tue, May 25, 2021 at 11:22:05AM +0200, Jan Kara wrote:
> I actually have to check whether the regression is there because of the
> additional locking of the buffer_head (because that's the only thing that
> was added to that code in fact, adding some atomic instructions, bouncing
> another cacheline) or because of the checksum computation that moved from
> ext4_handle_dirty_super() closer to actual superblock update under those
> locks.

Hmm, yes. So the other thing we could try doing might be to drop
s_orphan_lock mutex altogether and instead use the buffer_head lock to
protect the orphan list. In actual practice the orphan list
modifications are the vast majority of the times when we need to
modify the superblock, so we could just take the buffer_head lock a
bit earlier (when we take the s_orphan_lock) and release it a bit
later and thus avoid the cacheline bounce.

What do you think?

> If the problem is indeed just the checksum computation under all those
> locks, we can move that to transaction commit time (using the t_frozen
> trigger - ocfs2 uses that for all metadata checksumming). But then we have
> to be very careful with unjournaled sb updates that can be running in
> parallel with the journaled ones because once you drop buffer lock, sb can
> get clobbered and checksum invalidated. Also there's the question what to
> do with nojournal mode - probably we would have to keep separate set of
> places recomputing checksums just for nojournal mode which is quite error
> prone but if it's just for sb, I guess it's manageable.

We lived without the orphan list in ext2 for a long time, and without
the journal, whether the orphan linked list in inode->d_time would be
anything approaching consistent is a major question in any case. One
approach then might be to drop the orphan list in nojournal mode
altogether....

- Ted

2021-05-31 17:09:51

by Jan Kara

[permalink] [raw]

Subject: Re: [LKP] [ext4] 05c2c00f37: aim7.jobs-per-min -11.8% regression

On Tue 25-05-21 11:22:05, Jan Kara wrote:
> On Fri 21-05-21 12:42:16, Theodore Y. Ts'o wrote:
> > On Fri, May 21, 2021 at 11:27:30AM +0200, Jan Kara wrote:
> > >
> > > OK, thanks for testing. So the orphan code is indeed the likely cause of
> > > this regression but I probably did not guess correctly what is the
> > > contention point there. Then I guess I need to reproduce and do more
> > > digging why the contention happens...
> >
> > Hmm... what if we only recalculate the superblock checksum when we do
> > a commit, via the callback function from the jbd2 layer to file
> > system?
>
> I actually have to check whether the regression is there because of the
> additional locking of the buffer_head (because that's the only thing that
> was added to that code in fact, adding some atomic instructions, bouncing
> another cacheline) or because of the checksum computation that moved from
> ext4_handle_dirty_super() closer to actual superblock update under those
> locks.

So I did a few experiments on my test machine. I saw the biggest regression
for creat_clo workload for 7 threads. The results look like:

orig patched hack1 hack2
Hmean creat_clo-7 36458.33 ( 0.00%) 23836.55 * -34.62%* 32608.70 * -10.56%* 37300.18 ( 2.31%)

where hack1 means I've removed the lock_buffer() calls from orphan handling
code and hack2 means I've additionally moved checksum recalculation from
under orphan lock. Take the numbers with a grain of salt as they are rather
variable and this is just an average of 5 runs but the tendency is pretty
clear. Both these changes contribute to the regression significantly,
additional locking of the buffer head contributes somewhat more.

I will see how various variants of reducing the contention look like (e.g.
if just using bh lock for everything helps at all). But honestly I don't
want to jump through too big hoops just for this workload - the orphan list
contention is pretty pathological here and if we seriously care about
workload like this we should rather revive the patchset with hashed orphan
list I wrote couple years back... That was able to give like 3x speedup to
workloads like this.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-06-03 16:12:26

by Jan Kara

[permalink] [raw]

Subject: Re: [LKP] [ext4] 05c2c00f37: aim7.jobs-per-min -11.8% regression

On Mon 31-05-21 18:57:46, Jan Kara wrote:
> On Tue 25-05-21 11:22:05, Jan Kara wrote:
> > On Fri 21-05-21 12:42:16, Theodore Y. Ts'o wrote:
> > > On Fri, May 21, 2021 at 11:27:30AM +0200, Jan Kara wrote:
> > > >
> > > > OK, thanks for testing. So the orphan code is indeed the likely cause of
> > > > this regression but I probably did not guess correctly what is the
> > > > contention point there. Then I guess I need to reproduce and do more
> > > > digging why the contention happens...
> > >
> > > Hmm... what if we only recalculate the superblock checksum when we do
> > > a commit, via the callback function from the jbd2 layer to file
> > > system?
> >
> > I actually have to check whether the regression is there because of the
> > additional locking of the buffer_head (because that's the only thing that
> > was added to that code in fact, adding some atomic instructions, bouncing
> > another cacheline) or because of the checksum computation that moved from
> > ext4_handle_dirty_super() closer to actual superblock update under those
> > locks.
>
> So I did a few experiments on my test machine. I saw the biggest regression
> for creat_clo workload for 7 threads. The results look like:
>
> orig patched hack1 hack2
> Hmean creat_clo-7 36458.33 ( 0.00%) 23836.55 * -34.62%* 32608.70 * -10.56%* 37300.18 ( 2.31%)
>
> where hack1 means I've removed the lock_buffer() calls from orphan handling
> code and hack2 means I've additionally moved checksum recalculation from
> under orphan lock. Take the numbers with a grain of salt as they are rather
> variable and this is just an average of 5 runs but the tendency is pretty
> clear. Both these changes contribute to the regression significantly,
> additional locking of the buffer head contributes somewhat more.
>
> I will see how various variants of reducing the contention look like (e.g.
> if just using bh lock for everything helps at all). But honestly I don't
> want to jump through too big hoops just for this workload - the orphan list
> contention is pretty pathological here and if we seriously care about
> workload like this we should rather revive the patchset with hashed orphan
> list I wrote couple years back... That was able to give like 3x speedup to
> workloads like this.

So I did some more testing. I've found out that due to a configuration
mistake ramdisk created for previous round of test was tiny (8M instead of
8G) so I've now rerun all the tests and a few more:

Orig Patched Hack1 Hack2 BH orphan lock
Hmean creat_clo-1 12875.54 ( 0.00%) 12765.96 ( -0.85%) 12765.96 ( -0.85%) 12820.51 ( -0.43%) 12875.54 ( 0.00%)
Hmean creat_clo-2 20761.25 ( 0.00%) 19736.84 * -4.93%* 20408.16 ( -1.70%) 19736.84 * -4.93%* 20477.82 ( -1.37%)
Hmean creat_clo-3 22727.27 ( 0.00%) 22500.00 ( -1.00%) 24390.24 ( 7.32%) 23255.81 ( 2.33%) 21176.47 ( -6.82%)
Hmean creat_clo-4 27149.32 ( 0.00%) 24539.88 * -9.61%* 27272.73 ( 0.45%) 25806.45 ( -4.95%) 21660.65 * -20.22%*
Hmean creat_clo-5 32397.41 ( 0.00%) 27985.08 * -13.62%* 28957.53 ( -10.62%) 29821.07 * -7.95%* 23771.79 * -26.62%*
Hmean creat_clo-6 33898.30 ( 0.00%) 30769.23 ( -9.23%) 30981.07 ( -8.61%) 31858.41 ( -6.02%) 26086.96 * -23.04%*
Hmean creat_clo-7 29005.52 ( 0.00%) 29661.02 ( 2.26%) 30746.71 ( 6.00%) 33175.35 ( 14.38%) 24970.27 ( -13.91%)
Hmean creat_clo-8 30573.25 ( 0.00%) 32000.00 ( 4.67%) 29702.97 ( -2.85%) 34139.40 ( 11.66%) 23668.64 * -22.58%*

Similarly to previous test, 'Orig' is the original state before 05c2c00f37,
'Patched' is a state after commit 05c2c00f37, 'Hack1' is 05c2c00f37 but with
lock_buffer() calls removed from orphan handling, 'Hack2' is 05c2c00f37 with
lock_buffer() calls removed and checksumming moved from under orphan_lock,
'BH orphan lock' is 05c2c00f37 with orphan_lock replaced with sb buffer
lock.

As we can see with fixed filesystem size, the regression isn't actually
that big anymore but it about matches what 0-day reported. Replacing orphan
lock with superblock buffer_head lock makes things even much worse - not
really surprising given we are replacing optimized mutex implementation
with a bitlock. Just removing buffer lock (Hack1 test) doesn't seem to
improve the results noticeably so that is not a problem. Moving
checksumming out from under the orphan_lock would probably help noticeably
(Hack2 test) but there's the problem when to compute checksums for
nojournal mode and also we'd need to be very careful with all the other
places updating superblock so that they serialize against orphan operations
so that they cannot invalidate the checksum - IMO not very compelling.

So as we chatted on today's call probably the best option is to leave the
code as is for now and instead work on moving away from orphan list
altogether. I'll revive my patches to do that.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-09-03 05:32:43

by Xing Zhengjun

[permalink] [raw]

Subject: Re: [LKP] [ext4] 05c2c00f37: aim7.jobs-per-min -11.8% regression

Hi Jan,

Do you have time to look at this? I re-test it in v5.13 and v5.14,
the regression still existed. Thanks.

On 6/4/2021 12:10 AM, Jan Kara wrote:
> Similarly to previous test, 'Orig' is the original state before 05c2c00f37,
> 'Patched' is a state after commit 05c2c00f37, 'Hack1' is 05c2c00f37 but with
> lock_buffer() calls removed from orphan handling, 'Hack2' is 05c2c00f37 with
> lock_buffer() calls removed and checksumming moved from under orphan_lock,
> 'BH orphan lock' is 05c2c00f37 with orphan_lock replaced with sb buffer
> lock.
>
> As we can see with fixed filesystem size, the regression isn't actually
> that big anymore but it about matches what 0-day reported. Replacing orphan
> lock with superblock buffer_head lock makes things even much worse - not
> really surprising given we are replacing optimized mutex implementation
> with a bitlock. Just removing buffer lock (Hack1 test) doesn't seem to
> improve the results noticeably so that is not a problem. Moving
> checksumming out from under the orphan_lock would probably help noticeably
> (Hack2 test) but there's the problem when to compute checksums for
> nojournal mode and also we'd need to be very careful with all the other
> places updating superblock so that they serialize against orphan operations
> so that they cannot invalidate the checksum - IMO not very compelling.
>
> So as we chatted on today's call probably the best option is to leave the
> code as is for now and instead work on moving away from orphan list
> altogether. I'll revive my patches to do that.

--
Zhengjun Xing

2021-09-03 13:10:28

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [LKP] [ext4] 05c2c00f37: aim7.jobs-per-min -11.8% regression

On Fri, Sep 03, 2021 at 01:28:25PM +0800, Xing Zhengjun wrote:
> Hi Jan,
>
> Do you have time to look at this? I re-test it in v5.13 and v5.14, the
> regression still existed. Thanks.

> > So as we chatted on today's call probably the best option is to leave the
> > code as is for now and instead work on moving away from orphan list
> > altogether. I'll revive my patches to do that.

The orphan_file feature has landed in for the 5.15 merge window. When
enabled, it provides multiple blocks to store the set of orphaned
inodes, which provides parallelism for parallel truncate and unlink
operations.

It requires a development version of e2fsprogs to use, and so the
zero-day bot won't be able to test it for a little while.

Cheers,

- Ted