2019-11-04 11:31:03

by Zhangshaokun

[permalink] [raw]
Subject: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

From: Yang Guo <[email protected]>

percpu_counter_compare will be called by xfs_mod_icount/ifree to check
whether the counter less than 0 and it is a expensive function.
let's check it only when delta < 0, it will be good for xfs's performance.

Cc: "Darrick J. Wong" <[email protected]>
Signed-off-by: Yang Guo <[email protected]>
Signed-off-by: Shaokun Zhang <[email protected]>
---
fs/xfs/xfs_mount.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index ba5b6f3b2b88..5e8314e6565e 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1174,6 +1174,9 @@ xfs_mod_icount(
int64_t delta)
{
percpu_counter_add_batch(&mp->m_icount, delta, XFS_ICOUNT_BATCH);
+ if (delta > 0)
+ return 0;
+
if (__percpu_counter_compare(&mp->m_icount, 0, XFS_ICOUNT_BATCH) < 0) {
ASSERT(0);
percpu_counter_add(&mp->m_icount, -delta);
@@ -1188,6 +1191,9 @@ xfs_mod_ifree(
int64_t delta)
{
percpu_counter_add(&mp->m_ifree, delta);
+ if (delta > 0)
+ return 0;
+
if (percpu_counter_compare(&mp->m_ifree, 0) < 0) {
ASSERT(0);
percpu_counter_add(&mp->m_ifree, -delta);
--
2.7.4


2019-11-04 15:28:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
> From: Yang Guo <[email protected]>
>
> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
> whether the counter less than 0 and it is a expensive function.
> let's check it only when delta < 0, it will be good for xfs's performance.

How much overhead do you see? In the end the compare is just a debug
check, so if it actually shows up we should remove it entirely.

2019-11-04 20:50:35

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
> From: Yang Guo <[email protected]>
>
> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
> whether the counter less than 0 and it is a expensive function.
> let's check it only when delta < 0, it will be good for xfs's performance.

Hmmm. I don't recall this as being expensive.

How did you find this? Can you please always document how you found
the problem being addressed in the commit message so that we don't
then have to ask how the problem being fixed is reproduced.

> Cc: "Darrick J. Wong" <[email protected]>
> Signed-off-by: Yang Guo <[email protected]>
> Signed-off-by: Shaokun Zhang <[email protected]>
> ---
> fs/xfs/xfs_mount.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index ba5b6f3b2b88..5e8314e6565e 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1174,6 +1174,9 @@ xfs_mod_icount(
> int64_t delta)
> {
> percpu_counter_add_batch(&mp->m_icount, delta, XFS_ICOUNT_BATCH);
> + if (delta > 0)
> + return 0;
> +
> if (__percpu_counter_compare(&mp->m_icount, 0, XFS_ICOUNT_BATCH) < 0) {
> ASSERT(0);
> percpu_counter_add(&mp->m_icount, -delta);

I struggle to see how this is expensive when you have more than
num_online_cpus() * XFS_ICOUNT_BATCH inodes allocated.
__percpu_counter_compare() will always take the fast path so ends up
being very little code at all.

> @@ -1188,6 +1191,9 @@ xfs_mod_ifree(
> int64_t delta)
> {
> percpu_counter_add(&mp->m_ifree, delta);
> + if (delta > 0)
> + return 0;
> +
> if (percpu_counter_compare(&mp->m_ifree, 0) < 0) {
> ASSERT(0);
> percpu_counter_add(&mp->m_ifree, -delta);

This one might have some overhead because the count is often at or
around zero, but I haven't noticed it being expensive in kernel
profiles when creating/freeing hundreds of thousands of inodes every
second.

IOWs, we typically measure the overhead of such functions by kernel
profile. Creating ~200,000 inodes a second, so hammering the icount
and ifree counters, I see:

0.16% [kernel] [k] percpu_counter_add_batch
0.03% [kernel] [k] __percpu_counter_compare

Almost nothing - it's way down the long tail of noise in the
profile.

IOWs, the CPU consumed by percpu_counter_compare() is low that
optimisation isn't going to produce any measurable performance
improvement. Hence it's not really something we've concerned
ourselves about. The profile is pretty much identical for removing
hundreds of thousands of files a second, too, so there really isn't
any performance gain to be had here.

If you want to optimise code to make it faster and show a noticable
performance improvement, start by running kernel profiles while your
performance critical workload is running. Then look at what the
functions and call chains that consume the most CPU and work out how
to do them better. Those are the places that optimisation will
result in measurable performance gains....

Cheers,

Dave.

--
Dave Chinner
[email protected]

2019-11-05 03:09:18

by Zhangshaokun

[permalink] [raw]
Subject: Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

Hi Christoph,

On 2019/11/4 23:25, Christoph Hellwig wrote:
> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
>> From: Yang Guo <[email protected]>
>>
>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
>> whether the counter less than 0 and it is a expensive function.
>> let's check it only when delta < 0, it will be good for xfs's performance.
>
> How much overhead do you see? In the end the compare is just a debug

Thanks your reply, sorry for my not clear description.
__percpu_counter_compare itself is not expensive, but __percpu_counter_sum
called by __percpu_counter_compare is high load, I will list it in next thread.

> check, so if it actually shows up we should remove it entirely.
>

I'm not sure about it, so I check the delta to do less modification.

Thanks,
Shaokun

>

2019-11-05 03:30:17

by Zhangshaokun

[permalink] [raw]
Subject: Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

Hi Dave,

On 2019/11/5 4:49, Dave Chinner wrote:
> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
>> From: Yang Guo <[email protected]>
>>
>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
>> whether the counter less than 0 and it is a expensive function.
>> let's check it only when delta < 0, it will be good for xfs's performance.
>
> Hmmm. I don't recall this as being expensive.
>

Sorry about the misunderstanding information in commit message.

> How did you find this? Can you please always document how you found

If user creates million of files and the delete them, We found that the
__percpu_counter_compare costed 5.78% CPU usage, you are right that itself
is not expensive, but it calls __percpu_counter_sum which will use
spin_lock and read other cpu's count. perf record -g is used to profile it:

- 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree
- 5.86% xfs_mod_ifree
- 5.78% __percpu_counter_compare
5.61% __percpu_counter_sum

> the problem being addressed in the commit message so that we don't
> then have to ask how the problem being fixed is reproduced.
>
>> Cc: "Darrick J. Wong" <[email protected]>
>> Signed-off-by: Yang Guo <[email protected]>
>> Signed-off-by: Shaokun Zhang <[email protected]>
>> ---
>> fs/xfs/xfs_mount.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
>> index ba5b6f3b2b88..5e8314e6565e 100644
>> --- a/fs/xfs/xfs_mount.c
>> +++ b/fs/xfs/xfs_mount.c
>> @@ -1174,6 +1174,9 @@ xfs_mod_icount(
>> int64_t delta)
>> {
>> percpu_counter_add_batch(&mp->m_icount, delta, XFS_ICOUNT_BATCH);
>> + if (delta > 0)
>> + return 0;
>> +
>> if (__percpu_counter_compare(&mp->m_icount, 0, XFS_ICOUNT_BATCH) < 0) {
>> ASSERT(0);
>> percpu_counter_add(&mp->m_icount, -delta);
>
> I struggle to see how this is expensive when you have more than
> num_online_cpus() * XFS_ICOUNT_BATCH inodes allocated.
> __percpu_counter_compare() will always take the fast path so ends up
> being very little code at all.
>
>> @@ -1188,6 +1191,9 @@ xfs_mod_ifree(
>> int64_t delta)
>> {
>> percpu_counter_add(&mp->m_ifree, delta);
>> + if (delta > 0)
>> + return 0;
>> +
>> if (percpu_counter_compare(&mp->m_ifree, 0) < 0) {
>> ASSERT(0);
>> percpu_counter_add(&mp->m_ifree, -delta);
>
> This one might have some overhead because the count is often at or
> around zero, but I haven't noticed it being expensive in kernel
> profiles when creating/freeing hundreds of thousands of inodes every
> second.
>
> IOWs, we typically measure the overhead of such functions by kernel
> profile. Creating ~200,000 inodes a second, so hammering the icount
> and ifree counters, I see:
>
> 0.16% [kernel] [k] percpu_counter_add_batch
> 0.03% [kernel] [k] __percpu_counter_compare
>

0.03% is just __percpu_counter_compare's usage.

> Almost nothing - it's way down the long tail of noise in the
> profile.
>
> IOWs, the CPU consumed by percpu_counter_compare() is low that
> optimisation isn't going to produce any measurable performance
> improvement. Hence it's not really something we've concerned
> ourselves about. The profile is pretty much identical for removing
> hundreds of thousands of files a second, too, so there really isn't
> any performance gain to be had here.
>
> If you want to optimise code to make it faster and show a noticable
> performance improvement, start by running kernel profiles while your
> performance critical workload is running. Then look at what the
> functions and call chains that consume the most CPU and work out how
> to do them better. Those are the places that optimisation will
> result in measurable performance gains....

Hmm, I have done it and I didn't describe this problem clearly, with this
patch, 5.78%(__percpu_counter_compare) will disappear. I will follow
your method and reduce unnecessary noise.

Thanks,
Shaokun

>
> Cheers,
>
> Dave.
>

2019-11-05 04:06:42

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
> Hi Dave,
>
> On 2019/11/5 4:49, Dave Chinner wrote:
> > On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
> >> From: Yang Guo <[email protected]>
> >>
> >> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
> >> whether the counter less than 0 and it is a expensive function.
> >> let's check it only when delta < 0, it will be good for xfs's performance.
> >
> > Hmmm. I don't recall this as being expensive.
> >
>
> Sorry about the misunderstanding information in commit message.
>
> > How did you find this? Can you please always document how you found
>
> If user creates million of files and the delete them, We found that the
> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
> is not expensive, but it calls __percpu_counter_sum which will use
> spin_lock and read other cpu's count. perf record -g is used to profile it:
>
> - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree
> - 5.86% xfs_mod_ifree
> - 5.78% __percpu_counter_compare
> 5.61% __percpu_counter_sum

Interesting. Your workload is hitting the slow path, which I most
certainly do no see when creating lots of files. What's your
workload?

> > IOWs, we typically measure the overhead of such functions by kernel
> > profile. Creating ~200,000 inodes a second, so hammering the icount
> > and ifree counters, I see:
> >
> > 0.16% [kernel] [k] percpu_counter_add_batch
> > 0.03% [kernel] [k] __percpu_counter_compare
> >
>
> 0.03% is just __percpu_counter_compare's usage.

No, that's the total of _all_ the percpu counter functions captured
by the profile - it was the list of all samples filtered by
"percpu". I just re-ran the profile again, and got:


0.23% [kernel] [k] percpu_counter_add_batch
0.04% [kernel] [k] __percpu_counter_compare
0.00% [kernel] [k] collect_percpu_times
0.00% [kernel] [k] __handle_irq_event_percpu
0.00% [kernel] [k] __percpu_counter_sum
0.00% [kernel] [k] handle_irq_event_percpu
0.00% [kernel] [k] fprop_reflect_period_percpu.isra.0
0.00% [kernel] [k] percpu_ref_switch_to_atomic_rcu
0.00% [kernel] [k] free_percpu
0.00% [kernel] [k] percpu_ref_exit

So you can see that this essentially no samples in
__percpu_counter_sum at all - my tests are not hitting the slow path
at all, despite allocating inodes continuously.

IOWs, your workload is hitting the slow path repeatedly, and so the
question that needs to be answered is "why is the slow path actually
being exercised?". IOWs, we need to know what your workload is, what
the filesystem config is, what hardware (cpus, storage, etc) you are
running on, etc. There must be some reason for the slow path being
used, and that's what we need to understand first before deciding
what the best fix might be...

I suspect that you are only running one or two threads creating
files and you have lots of idle CPU and hence the inode allocation
is not clearing the fast path batch threshold on the ifree counter.
And because you have lots of CPUs, the cost of a sum is very
expensive compared to running single threaded creates. That's my
current hypothesis based what I see on my workloads that
xfs_mod_ifree overhead goes down as concurrency goes up....

FWIW, the profiles I took came from running this on 16 and 32p
machines:

--
dirs=""
for i in `seq 1 $THREADS`; do
dirs="$dirs -d /mnt/scratch/$i"
done

cycles=$((512 / $THREADS))

time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs
--

With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem
image:

meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=1
data = bsize=4096 blocks=134217727500, imaxpct=1
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

That's allocating enough inodes to keep the free inode counter
entirely out of the slow path...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-11-06 06:04:31

by Zhangshaokun

[permalink] [raw]
Subject: Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

Hi Dave,

On 2019/11/5 12:03, Dave Chinner wrote:
> On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
>> Hi Dave,
>>
>> On 2019/11/5 4:49, Dave Chinner wrote:
>>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
>>>> From: Yang Guo <[email protected]>
>>>>
>>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
>>>> whether the counter less than 0 and it is a expensive function.
>>>> let's check it only when delta < 0, it will be good for xfs's performance.
>>>
>>> Hmmm. I don't recall this as being expensive.
>>>
>>
>> Sorry about the misunderstanding information in commit message.
>>
>>> How did you find this? Can you please always document how you found
>>
>> If user creates million of files and the delete them, We found that the
>> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
>> is not expensive, but it calls __percpu_counter_sum which will use
>> spin_lock and read other cpu's count. perf record -g is used to profile it:
>>
>> - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree
>> - 5.86% xfs_mod_ifree
>> - 5.78% __percpu_counter_compare
>> 5.61% __percpu_counter_sum
>
> Interesting. Your workload is hitting the slow path, which I most
> certainly do no see when creating lots of files. What's your
> workload?
>

The hardware has 128 cpu cores, and the xfs filesystem format config is default,
while the test is a single thread, as follow:
./mdtest -I 10 -z 6 -b 8 -d /mnt/ -t -c 2

xfs info:
meta-data=/dev/bcache2 isize=512 agcount=4, agsize=244188661 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1 spinodes=1 rmapbt=0
= reflink=0
data = bsize=4096 blocks=976754644, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=476930, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

disk info:
Disk /dev/bcache2: 4000.8 GB, 4000787021824 bytes, 7814037152 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

>>> IOWs, we typically measure the overhead of such functions by kernel
>>> profile. Creating ~200,000 inodes a second, so hammering the icount
>>> and ifree counters, I see:
>>>
>>> 0.16% [kernel] [k] percpu_counter_add_batch
>>> 0.03% [kernel] [k] __percpu_counter_compare
>>>
>>
>> 0.03% is just __percpu_counter_compare's usage.
>
> No, that's the total of _all_ the percpu counter functions captured
> by the profile - it was the list of all samples filtered by
> "percpu". I just re-ran the profile again, and got:
>
>
> 0.23% [kernel] [k] percpu_counter_add_batch
> 0.04% [kernel] [k] __percpu_counter_compare
> 0.00% [kernel] [k] collect_percpu_times
> 0.00% [kernel] [k] __handle_irq_event_percpu
> 0.00% [kernel] [k] __percpu_counter_sum
> 0.00% [kernel] [k] handle_irq_event_percpu
> 0.00% [kernel] [k] fprop_reflect_period_percpu.isra.0
> 0.00% [kernel] [k] percpu_ref_switch_to_atomic_rcu
> 0.00% [kernel] [k] free_percpu
> 0.00% [kernel] [k] percpu_ref_exit
>
> So you can see that this essentially no samples in
> __percpu_counter_sum at all - my tests are not hitting the slow path
> at all, despite allocating inodes continuously.

Got it,

>
> IOWs, your workload is hitting the slow path repeatedly, and so the
> question that needs to be answered is "why is the slow path actually
> being exercised?". IOWs, we need to know what your workload is, what
> the filesystem config is, what hardware (cpus, storage, etc) you are
> running on, etc. There must be some reason for the slow path being
> used, and that's what we need to understand first before deciding
> what the best fix might be...
>
> I suspect that you are only running one or two threads creating

Yeah, we just run one thread test.

> files and you have lots of idle CPU and hence the inode allocation
> is not clearing the fast path batch threshold on the ifree counter.
> And because you have lots of CPUs, the cost of a sum is very
> expensive compared to running single threaded creates. That's my
> current hypothesis based what I see on my workloads that
> xfs_mod_ifree overhead goes down as concurrency goes up....
>

Agree, we add some debug info in xfs_mod_ifree and found most times
m_ifree.count < batch * num_online_cpus(), because we have 128 online
cpus and m_ifree.count around 999.


> FWIW, the profiles I took came from running this on 16 and 32p
> machines:
>
> --
> dirs=""
> for i in `seq 1 $THREADS`; do
> dirs="$dirs -d /mnt/scratch/$i"
> done
>
> cycles=$((512 / $THREADS))
>
> time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs
> --
>
> With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem
> image:
>
> meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=1 finobt=1, sparse=1, rmapbt=0
> = reflink=1
> data = bsize=4096 blocks=134217727500, imaxpct=1
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0, ftype=1
> log =internal log bsize=4096 blocks=521728, version=2
> = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> That's allocating enough inodes to keep the free inode counter
> entirely out of the slow path...

percpu_counter_read that reads the count will cause cache synchronization
cost if other cpu changes the count, Maybe it's better not to call
percpu_counter_compare if possible.

Thanks,
Shaokun

>
> Cheers,
>
> Dave.
>

2019-11-06 21:22:16

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote:
> Hi Dave,
>
> On 2019/11/5 12:03, Dave Chinner wrote:
> > On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
> >> Hi Dave,
> >>
> >> On 2019/11/5 4:49, Dave Chinner wrote:
> >>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
> >>>> From: Yang Guo <[email protected]>
> >>>>
> >>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
> >>>> whether the counter less than 0 and it is a expensive function.
> >>>> let's check it only when delta < 0, it will be good for xfs's performance.
> >>>
> >>> Hmmm. I don't recall this as being expensive.
> >>>
> >>
> >> Sorry about the misunderstanding information in commit message.
> >>
> >>> How did you find this? Can you please always document how you found
> >>
> >> If user creates million of files and the delete them, We found that the
> >> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
> >> is not expensive, but it calls __percpu_counter_sum which will use
> >> spin_lock and read other cpu's count. perf record -g is used to profile it:
> >>
> >> - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree
> >> - 5.86% xfs_mod_ifree
> >> - 5.78% __percpu_counter_compare
> >> 5.61% __percpu_counter_sum
> >
> > Interesting. Your workload is hitting the slow path, which I most
> > certainly do no see when creating lots of files. What's your
> > workload?
> >
>
> The hardware has 128 cpu cores, and the xfs filesystem format config is default,
> while the test is a single thread, as follow:
> ./mdtest -I 10 -z 6 -b 8 -d /mnt/ -t -c 2

What version and where do I get it?

Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent
metadata workload testing? How representative is it of your actual
production workload? Is that single threaded?

> xfs info:
> meta-data=/dev/bcache2 isize=512 agcount=4, agsize=244188661 blks

only 4 AGs, which explains the lack of free inodes - there isn't
enough concurrency in the filesystem layout to push the free inode
count in all AGs beyond the batchsize * num_online_cpus().

i.e. single threaded workloads typically drain the free inode count
all the way down to zero before new inodes are allocated. Workloads
that are highly concurrent allocate from lots of AGs at once,
leaving free inodes in every AG that is not current being actively
allocated out of.

As a test, can you remake that test filesystem with "-d agcount=32"
and see if the overhead you are seeing disappears?

> > files and you have lots of idle CPU and hence the inode allocation
> > is not clearing the fast path batch threshold on the ifree counter.
> > And because you have lots of CPUs, the cost of a sum is very
> > expensive compared to running single threaded creates. That's my
> > current hypothesis based what I see on my workloads that
> > xfs_mod_ifree overhead goes down as concurrency goes up....
> >
>
> Agree, we add some debug info in xfs_mod_ifree and found most times
> m_ifree.count < batch * num_online_cpus(), because we have 128 online
> cpus and m_ifree.count around 999.

Ok, the threshold is 32 * 128 = ~4000 to get out of the slow
path. 32 AGs may well push the count over this threshold, so it's
definitely worth trying....

> > FWIW, the profiles I took came from running this on 16 and 32p
> > machines:
> >
> > --
> > dirs=""
> > for i in `seq 1 $THREADS`; do
> > dirs="$dirs -d /mnt/scratch/$i"
> > done
> >
> > cycles=$((512 / $THREADS))
> >
> > time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs
> > --
> >
> > With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem
> > image:
> >
> > meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks
> > = sectsz=512 attr=2, projid32bit=1
> > = crc=1 finobt=1, sparse=1, rmapbt=0
> > = reflink=1
> > data = bsize=4096 blocks=134217727500, imaxpct=1
> > = sunit=0 swidth=0 blks
> > naming =version 2 bsize=4096 ascii-ci=0, ftype=1
> > log =internal log bsize=4096 blocks=521728, version=2
> > = sectsz=512 sunit=0 blks, lazy-count=1
> > realtime =none extsz=4096 blocks=0, rtextents=0
> >
> > That's allocating enough inodes to keep the free inode counter
> > entirely out of the slow path...
>
> percpu_counter_read that reads the count will cause cache synchronization
> cost if other cpu changes the count, Maybe it's better not to call
> percpu_counter_compare if possible.

Depends. Sometimes we trade off ultimate single threaded
performance and efficiency for substantially better scalability.
i.e. if we lose 5% on single threaded performance but gain 10x on
concurrent workloads, then that is a good tradeoff to make.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-11-08 06:02:41

by Zhangshaokun

[permalink] [raw]
Subject: Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

Hi Dave,

On 2019/11/7 5:20, Dave Chinner wrote:
> On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote:
>> Hi Dave,
>>
>> On 2019/11/5 12:03, Dave Chinner wrote:
>>> On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
>>>> Hi Dave,
>>>>
>>>> On 2019/11/5 4:49, Dave Chinner wrote:
>>>>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
>>>>>> From: Yang Guo <[email protected]>
>>>>>>
>>>>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
>>>>>> whether the counter less than 0 and it is a expensive function.
>>>>>> let's check it only when delta < 0, it will be good for xfs's performance.
>>>>>
>>>>> Hmmm. I don't recall this as being expensive.
>>>>>
>>>>
>>>> Sorry about the misunderstanding information in commit message.
>>>>
>>>>> How did you find this? Can you please always document how you found
>>>>
>>>> If user creates million of files and the delete them, We found that the
>>>> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
>>>> is not expensive, but it calls __percpu_counter_sum which will use
>>>> spin_lock and read other cpu's count. perf record -g is used to profile it:
>>>>
>>>> - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree
>>>> - 5.86% xfs_mod_ifree
>>>> - 5.78% __percpu_counter_compare
>>>> 5.61% __percpu_counter_sum
>>>
>>> Interesting. Your workload is hitting the slow path, which I most
>>> certainly do no see when creating lots of files. What's your
>>> workload?
>>>
>>
>> The hardware has 128 cpu cores, and the xfs filesystem format config is default,
>> while the test is a single thread, as follow:
>> ./mdtest -I 10 -z 6 -b 8 -d /mnt/ -t -c 2
>
> What version and where do I get it?

You can get the mdtest from github: https://github.com/LLNL/mdtest.

>
> Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent
> metadata workload testing? How representative is it of your actual
> production workload? Is that single threaded?
>

We just use mdtest to test the performance of a file system, it can't representative
the actual workload and it's single threaded. But we also find that it goes to slow
path when we remove a dir with many files. The cmd is below:
rm -rf xxx.

>> xfs info:
>> meta-data=/dev/bcache2 isize=512 agcount=4, agsize=244188661 blks
>
> only 4 AGs, which explains the lack of free inodes - there isn't
> enough concurrency in the filesystem layout to push the free inode
> count in all AGs beyond the batchsize * num_online_cpus().
>
> i.e. single threaded workloads typically drain the free inode count
> all the way down to zero before new inodes are allocated. Workloads
> that are highly concurrent allocate from lots of AGs at once,
> leaving free inodes in every AG that is not current being actively
> allocated out of.
>
> As a test, can you remake that test filesystem with "-d agcount=32"
> and see if the overhead you are seeing disappears?
>

We try to remake the filesystem with "-d agcount=32" and it also enters slow path
mostly. Print the batch * num_online_cpus() and find that it's 32768.
Because percpu_counter_batch was initialized to 256 when there are 128 cpu cores.
Then we change the agcount=1024, and it also goes to slow path frequently because
mostly there are no 32768 free inodes.

>>> files and you have lots of idle CPU and hence the inode allocation
>>> is not clearing the fast path batch threshold on the ifree counter.
>>> And because you have lots of CPUs, the cost of a sum is very
>>> expensive compared to running single threaded creates. That's my
>>> current hypothesis based what I see on my workloads that
>>> xfs_mod_ifree overhead goes down as concurrency goes up....
>>>
>>
>> Agree, we add some debug info in xfs_mod_ifree and found most times
>> m_ifree.count < batch * num_online_cpus(), because we have 128 online
>> cpus and m_ifree.count around 999.
>
> Ok, the threshold is 32 * 128 = ~4000 to get out of the slow
> path. 32 AGs may well push the count over this threshold, so it's
> definitely worth trying....
>

Yes, we tried it and found that threshold was 32768, because percpu_counter_batch
was initialized to 2 * num_online_cpus().

>>> FWIW, the profiles I took came from running this on 16 and 32p
>>> machines:
>>>
>>> --
>>> dirs=""
>>> for i in `seq 1 $THREADS`; do
>>> dirs="$dirs -d /mnt/scratch/$i"
>>> done
>>>
>>> cycles=$((512 / $THREADS))
>>>
>>> time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs
>>> --
>>>
>>> With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem
>>> image:
>>>
>>> meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks
>>> = sectsz=512 attr=2, projid32bit=1
>>> = crc=1 finobt=1, sparse=1, rmapbt=0
>>> = reflink=1
>>> data = bsize=4096 blocks=134217727500, imaxpct=1
>>> = sunit=0 swidth=0 blks
>>> naming =version 2 bsize=4096 ascii-ci=0, ftype=1
>>> log =internal log bsize=4096 blocks=521728, version=2
>>> = sectsz=512 sunit=0 blks, lazy-count=1
>>> realtime =none extsz=4096 blocks=0, rtextents=0
>>>
>>> That's allocating enough inodes to keep the free inode counter
>>> entirely out of the slow path...
>>
>> percpu_counter_read that reads the count will cause cache synchronization
>> cost if other cpu changes the count, Maybe it's better not to call
>> percpu_counter_compare if possible.
>
> Depends. Sometimes we trade off ultimate single threaded
> performance and efficiency for substantially better scalability.
> i.e. if we lose 5% on single threaded performance but gain 10x on
> concurrent workloads, then that is a good tradeoff to make.
>

Agree, I mean that when delta > 0, there is no need to call percpu_counter_compare in
xfs_mod_ifree/icount.

Thanks,
Shaokun

> Cheers,
>
> Dave.
>

2019-11-15 09:20:03

by Zhangshaokun

[permalink] [raw]
Subject: Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

Hi Dave,

With configuration "-d agcount=32", it also enters slow path frequently
when there are 128 cpu cores, any thoughts about this issue?
Can we remove debug check entirely as Christoph's suggestion?

Thanks,
Shaokun

On 2019/11/8 13:58, Shaokun Zhang wrote:
> Hi Dave,
>
> On 2019/11/7 5:20, Dave Chinner wrote:
>> On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote:
>>> Hi Dave,
>>>
>>> On 2019/11/5 12:03, Dave Chinner wrote:
>>>> On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
>>>>> Hi Dave,
>>>>>
>>>>> On 2019/11/5 4:49, Dave Chinner wrote:
>>>>>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
>>>>>>> From: Yang Guo <[email protected]>
>>>>>>>
>>>>>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
>>>>>>> whether the counter less than 0 and it is a expensive function.
>>>>>>> let's check it only when delta < 0, it will be good for xfs's performance.
>>>>>>
>>>>>> Hmmm. I don't recall this as being expensive.
>>>>>>
>>>>>
>>>>> Sorry about the misunderstanding information in commit message.
>>>>>
>>>>>> How did you find this? Can you please always document how you found
>>>>>
>>>>> If user creates million of files and the delete them, We found that the
>>>>> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
>>>>> is not expensive, but it calls __percpu_counter_sum which will use
>>>>> spin_lock and read other cpu's count. perf record -g is used to profile it:
>>>>>
>>>>> - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree
>>>>> - 5.86% xfs_mod_ifree
>>>>> - 5.78% __percpu_counter_compare
>>>>> 5.61% __percpu_counter_sum
>>>>
>>>> Interesting. Your workload is hitting the slow path, which I most
>>>> certainly do no see when creating lots of files. What's your
>>>> workload?
>>>>
>>>
>>> The hardware has 128 cpu cores, and the xfs filesystem format config is default,
>>> while the test is a single thread, as follow:
>>> ./mdtest -I 10 -z 6 -b 8 -d /mnt/ -t -c 2
>>
>> What version and where do I get it?
>
> You can get the mdtest from github: https://github.com/LLNL/mdtest.
>
>>
>> Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent
>> metadata workload testing? How representative is it of your actual
>> production workload? Is that single threaded?
>>
>
> We just use mdtest to test the performance of a file system, it can't representative
> the actual workload and it's single threaded. But we also find that it goes to slow
> path when we remove a dir with many files. The cmd is below:
> rm -rf xxx.
>
>>> xfs info:
>>> meta-data=/dev/bcache2 isize=512 agcount=4, agsize=244188661 blks
>>
>> only 4 AGs, which explains the lack of free inodes - there isn't
>> enough concurrency in the filesystem layout to push the free inode
>> count in all AGs beyond the batchsize * num_online_cpus().
>>
>> i.e. single threaded workloads typically drain the free inode count
>> all the way down to zero before new inodes are allocated. Workloads
>> that are highly concurrent allocate from lots of AGs at once,
>> leaving free inodes in every AG that is not current being actively
>> allocated out of.
>>
>> As a test, can you remake that test filesystem with "-d agcount=32"
>> and see if the overhead you are seeing disappears?
>>
>
> We try to remake the filesystem with "-d agcount=32" and it also enters slow path
> mostly. Print the batch * num_online_cpus() and find that it's 32768.
> Because percpu_counter_batch was initialized to 256 when there are 128 cpu cores.
> Then we change the agcount=1024, and it also goes to slow path frequently because
> mostly there are no 32768 free inodes.
>
>>>> files and you have lots of idle CPU and hence the inode allocation
>>>> is not clearing the fast path batch threshold on the ifree counter.
>>>> And because you have lots of CPUs, the cost of a sum is very
>>>> expensive compared to running single threaded creates. That's my
>>>> current hypothesis based what I see on my workloads that
>>>> xfs_mod_ifree overhead goes down as concurrency goes up....
>>>>
>>>
>>> Agree, we add some debug info in xfs_mod_ifree and found most times
>>> m_ifree.count < batch * num_online_cpus(), because we have 128 online
>>> cpus and m_ifree.count around 999.
>>
>> Ok, the threshold is 32 * 128 = ~4000 to get out of the slow
>> path. 32 AGs may well push the count over this threshold, so it's
>> definitely worth trying....
>>
>
> Yes, we tried it and found that threshold was 32768, because percpu_counter_batch
> was initialized to 2 * num_online_cpus().
>
>>>> FWIW, the profiles I took came from running this on 16 and 32p
>>>> machines:
>>>>
>>>> --
>>>> dirs=""
>>>> for i in `seq 1 $THREADS`; do
>>>> dirs="$dirs -d /mnt/scratch/$i"
>>>> done
>>>>
>>>> cycles=$((512 / $THREADS))
>>>>
>>>> time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs
>>>> --
>>>>
>>>> With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem
>>>> image:
>>>>
>>>> meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks
>>>> = sectsz=512 attr=2, projid32bit=1
>>>> = crc=1 finobt=1, sparse=1, rmapbt=0
>>>> = reflink=1
>>>> data = bsize=4096 blocks=134217727500, imaxpct=1
>>>> = sunit=0 swidth=0 blks
>>>> naming =version 2 bsize=4096 ascii-ci=0, ftype=1
>>>> log =internal log bsize=4096 blocks=521728, version=2
>>>> = sectsz=512 sunit=0 blks, lazy-count=1
>>>> realtime =none extsz=4096 blocks=0, rtextents=0
>>>>
>>>> That's allocating enough inodes to keep the free inode counter
>>>> entirely out of the slow path...
>>>
>>> percpu_counter_read that reads the count will cause cache synchronization
>>> cost if other cpu changes the count, Maybe it's better not to call
>>> percpu_counter_compare if possible.
>>
>> Depends. Sometimes we trade off ultimate single threaded
>> performance and efficiency for substantially better scalability.
>> i.e. if we lose 5% on single threaded performance but gain 10x on
>> concurrent workloads, then that is a good tradeoff to make.
>>
>
> Agree, I mean that when delta > 0, there is no need to call percpu_counter_compare in
> xfs_mod_ifree/icount.
>
> Thanks,
> Shaokun
>
>> Cheers,
>>
>> Dave.
>>

2019-11-18 08:16:02

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

On Fri, Nov 08, 2019 at 01:58:56PM +0800, Shaokun Zhang wrote:
> Hi Dave,
>
> On 2019/11/7 5:20, Dave Chinner wrote:
> > On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote:
> >> Hi Dave,
> >>
> >> On 2019/11/5 12:03, Dave Chinner wrote:
> >>> On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
> >>>> Hi Dave,
> >>>>
> >>>> On 2019/11/5 4:49, Dave Chinner wrote:
> >>>>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
> >>>>>> From: Yang Guo <[email protected]>
> >>>>>>
> >>>>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
> >>>>>> whether the counter less than 0 and it is a expensive function.
> >>>>>> let's check it only when delta < 0, it will be good for xfs's performance.
> >>>>>
> >>>>> Hmmm. I don't recall this as being expensive.
> >>>>>
> >>>>
> >>>> Sorry about the misunderstanding information in commit message.
> >>>>
> >>>>> How did you find this? Can you please always document how you found
> >>>>
> >>>> If user creates million of files and the delete them, We found that the
> >>>> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
> >>>> is not expensive, but it calls __percpu_counter_sum which will use
> >>>> spin_lock and read other cpu's count. perf record -g is used to profile it:
> >>>>
> >>>> - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree
> >>>> - 5.86% xfs_mod_ifree
> >>>> - 5.78% __percpu_counter_compare
> >>>> 5.61% __percpu_counter_sum
> >>>
> >>> Interesting. Your workload is hitting the slow path, which I most
> >>> certainly do no see when creating lots of files. What's your
> >>> workload?
> >>>
> >>
> >> The hardware has 128 cpu cores, and the xfs filesystem format config is default,
> >> while the test is a single thread, as follow:
> >> ./mdtest -I 10 -z 6 -b 8 -d /mnt/ -t -c 2
> >
> > What version and where do I get it?
>
> You can get the mdtest from github: https://github.com/LLNL/mdtest.

Oh what a pain.

$ MPI_CC=mpicc make mpicc -DLinux -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D__USE_LARGEFILE64=1 -g -o mdtest mdtest.c -lm
mdtest.c:32:10: fatal error: mpi.h: No such file or directory
32 | #include "mpi.h"
| ^~~~~~~
compilation terminated.
make: *** [Makefile:59: mdtest] Error 1

It doesn't build with a modern linux distro openmpi installation.

/me looks at the readme.

Oh, it's an unmaintained historical archive.

Hmmm, it got forked to https://github.com/MDTEST-LANL/mdtest.

That's an unmaintained archive, too.

Oh, there's a version at https://github.com/hpc/ior

Ngggh.

$./configure
....
checking for mpicc... mpicc
checking for gcc... (cached) mpicc
checking whether the C compiler works... no
configure: error: in `/home/dave/src/ior':
configure: error: C compiler cannot create executables
See `config.log' for more details
$
$ less config.log
....
configure:3805: checking whether the C compiler works
configure:3827: mpicc conftest.c >&5
/usr/bin/ld: cannot find -lopen-rte
/usr/bin/ld: cannot find -lopen-pal
/usr/bin/ld: cannot find -lhwloc
/usr/bin/ld: cannot find -levent
/usr/bin/ld: cannot find -levent_pthreads
collect2: error: ld returned 1 exit status
configure:3831: $? = 1
.....

So, the mpicc compiler wrapper uses a bunch of libraries that don't
get installed with the compiler wrapper.

Yay?

> > Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent
> > metadata workload testing? How representative is it of your actual
> > production workload? Is that single threaded?
> >
>
> We just use mdtest to test the performance of a file system, it can't representative
> the actual workload and it's single threaded. But we also find that it goes to slow
> path when we remove a dir with many files. The cmd is below:
> rm -rf xxx.

The benchmark doesn't reproduce that. It will only occur when you do
sequential inode remove so you have no free inodes in the filesytem
and every set of 64 inodes that is remove land in the same chunk and
so are immediately freed, leaving zero inodes in the filesyste,

i.e. this is a result of sequential inode create, which typically
implies "empty filesystem". Aged filesystems typically don't behave
like this - they have free indoes distributed all through the inode
chunks on disk. And if you do random remove, you won't see it for
the same reason, either - the remove phase of mdtest doesn't show
any CPU usage in percpu_counter_sum() at all.

Please understand taht I'm not saying that it isn't a problem,
that it doesn't happen, or that we don't need to change the
bahviour. What I'm trying to do is understand the
conditions in which this problem occurs and whether they are real
world or just micro-benchmark related problems...

> Because percpu_counter_batch was initialized to 256 when there are 128 cpu cores.
> Then we change the agcount=1024, and it also goes to slow path frequently because
> mostly there are no 32768 free inodes.

Hmmm. I had forgotten the batch size increased with CPU count like
that - I had the thought it was log(ncpus), not linear(ncpus).

That kinda points out that scaling the batch size by CPU count is
somewhat silly, as concurrency on the counter is not defined by the
number of CPUs in the system. INode counter concurrency is defined
by the number of inode allocation/free operations that can be
performed at any given time.

As such, having a counter threshold of 256 * num_cpus doesn't make a
whole lot of sense for a 4 AG filesystem because you can't have 128
CPUs all banging on it at once. And even if you have hundreds of
AGs, the CPUs are going to be somewhat serialised through the
transaction commit path by the CIL locks that are taken to insert
objects into the CIL.

Hence the unbound concurrency of transaction commit is going to be
soaked up by the CIL list insert spin lock, and it's mostly going to be
just a dribble of CPUs at a time through the transaction commit
accounting code.

Ok, so maybe we just need a small batch size here, like a value of 4
or 8 just to avoid every inode alloc/free transaction having to pick
up a global spin lock every time...

Let me do some testing here...

> >> percpu_counter_read that reads the count will cause cache synchronization
> >> cost if other cpu changes the count, Maybe it's better not to call
> >> percpu_counter_compare if possible.
> >
> > Depends. Sometimes we trade off ultimate single threaded
> > performance and efficiency for substantially better scalability.
> > i.e. if we lose 5% on single threaded performance but gain 10x on
> > concurrent workloads, then that is a good tradeoff to make.
> >
>
> Agree, I mean that when delta > 0, there is no need to call percpu_counter_compare in
> xfs_mod_ifree/icount.

It also doesn't totally avoid the issue, either, because of the
way we dynamically alloc/free inode clusters and so the counters go
both up and down during conmtinuous allocation or continuous
freeing.

i.e. The pattern we see on inode allocation is:

icount += 64
ifree += 64
ifree--
....
ifree--
icount += 64
ifree += 64
ifree--
....
ifree--

And on inode freeing, we see the opposite:

ifree++
....
ifree++
icount -= 64
ifree -= 64
ifree++
....
ifree++
icount -= 64
ifree -= 64

So the the values of ifree will still decrement during both free and
allocation operations, hence it doesn't avoid the issue totally.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-11-20 21:10:06

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

On Mon, Nov 18, 2019 at 07:12:12PM +1100, Dave Chinner wrote:
> On Fri, Nov 08, 2019 at 01:58:56PM +0800, Shaokun Zhang wrote:
> > On 2019/11/7 5:20, Dave Chinner wrote:
> > Because percpu_counter_batch was initialized to 256 when there are 128 cpu cores.
> > Then we change the agcount=1024, and it also goes to slow path frequently because
> > mostly there are no 32768 free inodes.
>
> Hmmm. I had forgotten the batch size increased with CPU count like
> that - I had the thought it was log(ncpus), not linear(ncpus).

[.....]

> Ok, so maybe we just need a small batch size here, like a value of 4
> or 8 just to avoid every inode alloc/free transaction having to pick
> up a global spin lock every time...
>
> Let me do some testing here...

[....]

> > Agree, I mean that when delta > 0, there is no need to call percpu_counter_compare in
> > xfs_mod_ifree/icount.

Ok, so the percpu_counter_sum() only shows up during a create
workload here, at ~1.5% of the CPU used. only doing the check when
delta < 0 makes no difference to that value. CPU usage of
percpu_counter_sum is noise for all other parts of the workload
workloads, including the "remove files" part of the benchmark.

Hence it is this pattern:

> i.e. The pattern we see on inode allocation is:
>
> icount += 64
> ifree += 64
> ifree--
> ....
> ifree--
> icount += 64
> ifree += 64
> ifree--
> ....
> ifree--

That is causing the compare CPU usage - all the single decrements
are triggering it because we are operating on a new filesystem that
has no free inodes other than the cluster we just allocated. Hence
avoiding doing the value compare when delta > 0 makes no difference
to CPU usage because most of the modifications are subraction.

> And on inode freeing, we see the opposite:
>
> ifree++
> ....
> ifree++
> icount -= 64
> ifree -= 64
> ifree++
> ....
> ifree++
> icount -= 64
> ifree -= 64

For freeing, we aren't always freeing from the same inode cluster,
so the ifree count actually goes up quite a bit before it starts
going down as we free clusters. Hence if we keep the batch size
small here, we should mostly stay out of the compare path, and the
"no compare when delta > 0" will also make a substantial difference
here if the ifree count is low.

So, I reduced the batch size to 8, and CPU usage during creates
dropped from 1.5% to 0.6% on a 16p machine. That's a substantial
reduction - if this translates to larger machines, that should bring
CPU usage down under 2%....

I was going to send this patch for you to test, but I left this
email sitting unsent overnight and I thought about it some more.

The reality is, as Christoph said, this ends up being just debug
code because we silently ignore the underflow error on production
kernels. Hence I think the right thing to do is gut the transaction
superblock accounting code so that we need simple asserts and don't
bother trying to behave like it's an error we can actually handle
sanely. This removes the compare from production kernels completely,
so you should see this all go away with the patch below.

The difference in performance on my 16p machine is not significant -
it is within the normal run-to-run variability of the mdtest
benchmark, but the counter compare overhead is gone from the
profiles.

Cheers,

Dave.
--
Dave Chinner
[email protected]

xfs: gut error handling in xfs_trans_unreserve_and_mod_sb()

From: Dave Chinner <[email protected]>

The error handling is only used to fire an assert on debug
kernels, so lets get rid of all the complexity and expensive
stuff that is done to determine if an assert will fire.

Rolling back the changes in the transaction if only one counter
underruns them makes all the other counters incorrect, because we
still have made that change and are committing the transaction.
Hence we can remove all the unwinding, too.

And xfs_mod_icount/xfs_mod_ifree are only called from
xfs_trans_unreserve_and_mod_sb(), so get rid of them and just
directly call the percpu_counter_add/percpu_counter_compare
functions.

Difference in binary size for a production kernel:

Before:
text data bss dec hex filename
9486 184 8 9678 25ce fs/xfs/xfs_trans.o
10858 89 12 10959 2acf fs/xfs/xfs_mount.o

After:
text data bss dec hex filename
8462 184 8 8654 21ce fs/xfs/xfs_trans.o
10510 89 12 10611 2973 fs/xfs/xfs_mount.o

So not only does it chop out a lot of source code, it also results
in a binary size reduction of ~1.3kB in a very frequently used
fast path of the filesystem.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/xfs_mount.c | 33 ------------
fs/xfs/xfs_mount.h | 2 -
fs/xfs/xfs_trans.c | 153 +++++++++++++----------------------------------------
3 files changed, 37 insertions(+), 151 deletions(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index ba5b6f3b2b88..c59d8f589eb9 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1163,39 +1163,6 @@ xfs_log_sbcount(xfs_mount_t *mp)
return xfs_sync_sb(mp, true);
}

-/*
- * Deltas for the inode count are +/-64, hence we use a large batch size
- * of 128 so we don't need to take the counter lock on every update.
- */
-#define XFS_ICOUNT_BATCH 128
-int
-xfs_mod_icount(
- struct xfs_mount *mp,
- int64_t delta)
-{
- percpu_counter_add_batch(&mp->m_icount, delta, XFS_ICOUNT_BATCH);
- if (__percpu_counter_compare(&mp->m_icount, 0, XFS_ICOUNT_BATCH) < 0) {
- ASSERT(0);
- percpu_counter_add(&mp->m_icount, -delta);
- return -EINVAL;
- }
- return 0;
-}
-
-int
-xfs_mod_ifree(
- struct xfs_mount *mp,
- int64_t delta)
-{
- percpu_counter_add(&mp->m_ifree, delta);
- if (percpu_counter_compare(&mp->m_ifree, 0) < 0) {
- ASSERT(0);
- percpu_counter_add(&mp->m_ifree, -delta);
- return -EINVAL;
- }
- return 0;
-}
-
/*
* Deltas for the block count can vary from 1 to very large, but lock contention
* only occurs on frequent small block count updates such as in the delayed
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index fdb60e09a9c5..0324412238ba 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -438,8 +438,6 @@ extern int xfs_initialize_perag(xfs_mount_t *mp, xfs_agnumber_t agcount,
xfs_agnumber_t *maxagi);
extern void xfs_unmountfs(xfs_mount_t *);

-extern int xfs_mod_icount(struct xfs_mount *mp, int64_t delta);
-extern int xfs_mod_ifree(struct xfs_mount *mp, int64_t delta);
extern int xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
bool reserved);
extern int xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index f4795fdb7389..7632868bdc92 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -527,57 +527,51 @@ xfs_trans_apply_sb_deltas(
sizeof(sbp->sb_frextents) - 1);
}

-STATIC int
+static void
xfs_sb_mod8(
uint8_t *field,
int8_t delta)
{
int8_t counter = *field;

+ if (!delta)
+ return;
counter += delta;
- if (counter < 0) {
- ASSERT(0);
- return -EINVAL;
- }
+ ASSERT(counter >= 0);
*field = counter;
- return 0;
}

-STATIC int
+static void
xfs_sb_mod32(
uint32_t *field,
int32_t delta)
{
int32_t counter = *field;

+ if (!delta)
+ return;
counter += delta;
- if (counter < 0) {
- ASSERT(0);
- return -EINVAL;
- }
+ ASSERT(counter >= 0);
*field = counter;
- return 0;
}

-STATIC int
+static void
xfs_sb_mod64(
uint64_t *field,
int64_t delta)
{
int64_t counter = *field;

+ if (!delta)
+ return;
counter += delta;
- if (counter < 0) {
- ASSERT(0);
- return -EINVAL;
- }
+ ASSERT(counter >= 0);
*field = counter;
- return 0;
}

/*
- * xfs_trans_unreserve_and_mod_sb() is called to release unused reservations
- * and apply superblock counter changes to the in-core superblock. The
+ * xfs_trans_unreserve_and_mod_sb() is called to release unused reservations and
+ * apply superblock counter changes to the in-core superblock. The
* t_res_fdblocks_delta and t_res_frextents_delta fields are explicitly NOT
* applied to the in-core superblock. The idea is that that has already been
* done.
@@ -586,7 +580,12 @@ xfs_sb_mod64(
* used block counts are not updated in the on disk superblock. In this case,
* XFS_TRANS_SB_DIRTY will not be set when the transaction is updated but we
* still need to update the incore superblock with the changes.
+ *
+ * Deltas for the inode count are +/-64, hence we use a large batch size of 128
+ * so we don't need to take the counter lock on every update.
*/
+#define XFS_ICOUNT_BATCH 128
+
void
xfs_trans_unreserve_and_mod_sb(
struct xfs_trans *tp)
@@ -622,20 +621,21 @@ xfs_trans_unreserve_and_mod_sb(
/* apply the per-cpu counters */
if (blkdelta) {
error = xfs_mod_fdblocks(mp, blkdelta, rsvd);
- if (error)
- goto out;
+ ASSERT(!error);
}

if (idelta) {
- error = xfs_mod_icount(mp, idelta);
- if (error)
- goto out_undo_fdblocks;
+ percpu_counter_add_batch(&mp->m_icount, idelta,
+ XFS_ICOUNT_BATCH);
+ if (idelta < 0)
+ ASSERT(__percpu_counter_compare(&mp->m_icount, 0,
+ XFS_ICOUNT_BATCH) >= 0);
}

if (ifreedelta) {
- error = xfs_mod_ifree(mp, ifreedelta);
- if (error)
- goto out_undo_icount;
+ percpu_counter_add(&mp->m_ifree, ifreedelta);
+ if (ifreedelta < 0)
+ ASSERT(percpu_counter_compare(&mp->m_ifree, 0) >= 0);
}

if (rtxdelta == 0 && !(tp->t_flags & XFS_TRANS_SB_DIRTY))
@@ -643,95 +643,16 @@ xfs_trans_unreserve_and_mod_sb(

/* apply remaining deltas */
spin_lock(&mp->m_sb_lock);
- if (rtxdelta) {
- error = xfs_sb_mod64(&mp->m_sb.sb_frextents, rtxdelta);
- if (error)
- goto out_undo_ifree;
- }
-
- if (tp->t_dblocks_delta != 0) {
- error = xfs_sb_mod64(&mp->m_sb.sb_dblocks, tp->t_dblocks_delta);
- if (error)
- goto out_undo_frextents;
- }
- if (tp->t_agcount_delta != 0) {
- error = xfs_sb_mod32(&mp->m_sb.sb_agcount, tp->t_agcount_delta);
- if (error)
- goto out_undo_dblocks;
- }
- if (tp->t_imaxpct_delta != 0) {
- error = xfs_sb_mod8(&mp->m_sb.sb_imax_pct, tp->t_imaxpct_delta);
- if (error)
- goto out_undo_agcount;
- }
- if (tp->t_rextsize_delta != 0) {
- error = xfs_sb_mod32(&mp->m_sb.sb_rextsize,
- tp->t_rextsize_delta);
- if (error)
- goto out_undo_imaxpct;
- }
- if (tp->t_rbmblocks_delta != 0) {
- error = xfs_sb_mod32(&mp->m_sb.sb_rbmblocks,
- tp->t_rbmblocks_delta);
- if (error)
- goto out_undo_rextsize;
- }
- if (tp->t_rblocks_delta != 0) {
- error = xfs_sb_mod64(&mp->m_sb.sb_rblocks, tp->t_rblocks_delta);
- if (error)
- goto out_undo_rbmblocks;
- }
- if (tp->t_rextents_delta != 0) {
- error = xfs_sb_mod64(&mp->m_sb.sb_rextents,
- tp->t_rextents_delta);
- if (error)
- goto out_undo_rblocks;
- }
- if (tp->t_rextslog_delta != 0) {
- error = xfs_sb_mod8(&mp->m_sb.sb_rextslog,
- tp->t_rextslog_delta);
- if (error)
- goto out_undo_rextents;
- }
- spin_unlock(&mp->m_sb_lock);
- return;
-
-out_undo_rextents:
- if (tp->t_rextents_delta)
- xfs_sb_mod64(&mp->m_sb.sb_rextents, -tp->t_rextents_delta);
-out_undo_rblocks:
- if (tp->t_rblocks_delta)
- xfs_sb_mod64(&mp->m_sb.sb_rblocks, -tp->t_rblocks_delta);
-out_undo_rbmblocks:
- if (tp->t_rbmblocks_delta)
- xfs_sb_mod32(&mp->m_sb.sb_rbmblocks, -tp->t_rbmblocks_delta);
-out_undo_rextsize:
- if (tp->t_rextsize_delta)
- xfs_sb_mod32(&mp->m_sb.sb_rextsize, -tp->t_rextsize_delta);
-out_undo_imaxpct:
- if (tp->t_rextsize_delta)
- xfs_sb_mod8(&mp->m_sb.sb_imax_pct, -tp->t_imaxpct_delta);
-out_undo_agcount:
- if (tp->t_agcount_delta)
- xfs_sb_mod32(&mp->m_sb.sb_agcount, -tp->t_agcount_delta);
-out_undo_dblocks:
- if (tp->t_dblocks_delta)
- xfs_sb_mod64(&mp->m_sb.sb_dblocks, -tp->t_dblocks_delta);
-out_undo_frextents:
- if (rtxdelta)
- xfs_sb_mod64(&mp->m_sb.sb_frextents, -rtxdelta);
-out_undo_ifree:
+ xfs_sb_mod64(&mp->m_sb.sb_frextents, rtxdelta);
+ xfs_sb_mod64(&mp->m_sb.sb_dblocks, tp->t_dblocks_delta);
+ xfs_sb_mod32(&mp->m_sb.sb_agcount, tp->t_agcount_delta);
+ xfs_sb_mod8(&mp->m_sb.sb_imax_pct, tp->t_imaxpct_delta);
+ xfs_sb_mod32(&mp->m_sb.sb_rextsize, tp->t_rextsize_delta);
+ xfs_sb_mod32(&mp->m_sb.sb_rbmblocks, tp->t_rbmblocks_delta);
+ xfs_sb_mod64(&mp->m_sb.sb_rblocks, tp->t_rblocks_delta);
+ xfs_sb_mod64(&mp->m_sb.sb_rextents, tp->t_rextents_delta);
+ xfs_sb_mod8(&mp->m_sb.sb_rextslog, tp->t_rextslog_delta);
spin_unlock(&mp->m_sb_lock);
- if (ifreedelta)
- xfs_mod_ifree(mp, -ifreedelta);
-out_undo_icount:
- if (idelta)
- xfs_mod_icount(mp, -idelta);
-out_undo_fdblocks:
- if (blkdelta)
- xfs_mod_fdblocks(mp, -blkdelta, rsvd);
-out:
- ASSERT(error == 0);
return;
}


2019-11-21 07:00:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

> +static void
> xfs_sb_mod8(
> uint8_t *field,
> int8_t delta)
> {
> int8_t counter = *field;
>
> + if (!delta)
> + return;
> counter += delta;
> + ASSERT(counter >= 0);
> *field = counter;
> }

I'd almost find it easier to keep the != 0 check in the caller, and
in fact also move the assert there and just kill these helpers, e.g.

if (tp->t_imaxpct_delta != 0) {
mp->m_sb.sb_imax_pct += tp->t_imaxpct_delta;
ASSERT(mp->m_sb.sb_imax_pct >= 0);
}
if (tp->t_rextsize_delta != 0) {
mp->m_sb.sb_rextsize += tp->t_rextsize_delta;
ASSERT(mp->m_sb.sb_rextsize >= 0);
}

While this is 2 more lines per counter it is a lot more explicit and
easier to follow.

> if (idelta) {
> - error = xfs_mod_icount(mp, idelta);
> - if (error)
> - goto out_undo_fdblocks;
> + percpu_counter_add_batch(&mp->m_icount, idelta,
> + XFS_ICOUNT_BATCH);
> + if (idelta < 0)
> + ASSERT(__percpu_counter_compare(&mp->m_icount, 0,
> + XFS_ICOUNT_BATCH) >= 0)

And here I wonder if keeping single use helpers wouldn't have been a
little nicer. Not that it really matters.