2020-08-27 10:23:40

by Zhangshaokun

[permalink] [raw]
Subject: [PATCH] fs: Optimized fget to improve performance

From: Yuqi Jin <[email protected]>

It is well known that the performance of atomic_add is better than that of
atomic_cmpxchg.
The initial value of @f_count is 1. While @f_count is increased by 1 in
__fget_files, it will go through three phases: > 0, < 0, and = 0. When the
fixed value 0 is used as the condition for terminating the increase of 1,
only atomic_cmpxchg can be used. When we use < 0 as the condition for
stopping plus 1, we can use atomic_add to obtain better performance.

we test syscall in unixbench on Huawei Kunpeng920(arm64). We've got a 132%
performance boost.

with this patch and the patch [1]
System Call Overhead 9516926.2 lps (10.0 s, 1 samples)

System Benchmarks Partial Index BASELINE RESULT INDEX
System Call Overhead 15000.0 9516926.2 6344.6
========
System Benchmarks Index Score (Partial Only) 6344.6

with this patch and without the patch [1]
System Call Overhead 5290449.3 lps (10.0 s, 1 samples)

System Benchmarks Partial Index BASELINE RESULT INDEX
System Call Overhead 15000.0 5290449.3 3527.0
========
System Benchmarks Index Score (Partial Only) 3527.0

without any patch
System Call Overhead 4102310.5 lps (10.0 s, 1 samples)

System Benchmarks Partial Index BASELINE RESULT INDEX
System Call Overhead 15000.0 4102310.5 2734.9
========
System Benchmarks Index Score (Partial Only) 2734.9

[1] https://lkml.org/lkml/2020/6/24/283

Cc: kernel test robot <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Boqun Feng <[email protected]>
Signed-off-by: Yuqi Jin <[email protected]>
Signed-off-by: Shaokun Zhang <[email protected]>
---
Hi Rong,

Can you help to test this patch individually and with [1] together on
your platform please? [1] has been tested on your platform[2].

[2] https://lkml.org/lkml/2020/7/8/227

include/linux/fs.h | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e019ea2f1347..2a9c2a30dc58 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -972,8 +972,19 @@ static inline struct file *get_file(struct file *f)
atomic_long_inc(&f->f_count);
return f;
}
+
+static inline bool get_file_unless_negative(atomic_long_t *v, long a)
+{
+ long c = atomic_long_read(v);
+
+ if (c <= 0)
+ return 0;
+
+ return atomic_long_add_return(a, v) - 1;
+}
+
#define get_file_rcu_many(x, cnt) \
- atomic_long_add_unless(&(x)->f_count, (cnt), 0)
+ get_file_unless_negative(&(x)->f_count, (cnt))
#define get_file_rcu(x) get_file_rcu_many((x), 1)
#define file_count(x) atomic_long_read(&(x)->f_count)

--
2.7.4


2020-08-27 12:50:06

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] fs: Optimized fget to improve performance

On Thu, Aug 27, 2020 at 06:19:44PM +0800, Shaokun Zhang wrote:
> From: Yuqi Jin <[email protected]>
>
> It is well known that the performance of atomic_add is better than that of
> atomic_cmpxchg.

I don't think that's well-known at all.

> +static inline bool get_file_unless_negative(atomic_long_t *v, long a)
> +{
> + long c = atomic_long_read(v);
> +
> + if (c <= 0)
> + return 0;
> +
> + return atomic_long_add_return(a, v) - 1;
> +}
> +
> #define get_file_rcu_many(x, cnt) \
> - atomic_long_add_unless(&(x)->f_count, (cnt), 0)
> + get_file_unless_negative(&(x)->f_count, (cnt))
> #define get_file_rcu(x) get_file_rcu_many((x), 1)
> #define file_count(x) atomic_long_read(&(x)->f_count)

I think you should be proposing a patch to fix atomic_long_add_unless()
on arm64 instead.

2020-08-27 14:35:03

by Al Viro

[permalink] [raw]
Subject: [NAK] Re: [PATCH] fs: Optimized fget to improve performance

On Thu, Aug 27, 2020 at 06:19:44PM +0800, Shaokun Zhang wrote:
> From: Yuqi Jin <[email protected]>
>
> It is well known that the performance of atomic_add is better than that of
> atomic_cmpxchg.
> The initial value of @f_count is 1. While @f_count is increased by 1 in
> __fget_files, it will go through three phases: > 0, < 0, and = 0. When the
> fixed value 0 is used as the condition for terminating the increase of 1,
> only atomic_cmpxchg can be used. When we use < 0 as the condition for
> stopping plus 1, we can use atomic_add to obtain better performance.

Suppose another thread has just removed it from the descriptor table.

> +static inline bool get_file_unless_negative(atomic_long_t *v, long a)
> +{
> + long c = atomic_long_read(v);
> +
> + if (c <= 0)
> + return 0;

Still 1. Now the other thread has gotten to dropping the last reference,
decremented counter to zero and committed to freeing the struct file.

> +
> + return atomic_long_add_return(a, v) - 1;

... and you increment that sucker back to 1. Sure, you return 0, so the
caller does nothing to that struct file. Which includes undoing the
changes to its refecount.

In the meanwhile, the third thread does fget on the same descriptor,
and there we end up bumping the refcount to 2 and succeeding. Which
leaves the caller with reference to already doomed struct file...

IOW, NAK - this is completely broken. The whole point of
atomic_long_add_unless() is that the check and conditional increment
are atomic. Together. That's what your optimization takes out.

2020-08-27 14:57:23

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] fs: Optimized fget to improve performance

From: Matthew Wilcox
> Sent: 27 August 2020 13:31
> On Thu, Aug 27, 2020 at 06:19:44PM +0800, Shaokun Zhang wrote:
> > From: Yuqi Jin <[email protected]>
> >
> > It is well known that the performance of atomic_add is better than that of
> > atomic_cmpxchg.
>
> I don't think that's well-known at all.

Especially since on (probably) every architecture except x86
atomic_add() is implemented using atomic_cmpxchg().

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-08-28 11:06:13

by Will Deacon

[permalink] [raw]
Subject: Re: [NAK] Re: [PATCH] fs: Optimized fget to improve performance

On Thu, Aug 27, 2020 at 03:28:48PM +0100, Al Viro wrote:
> On Thu, Aug 27, 2020 at 06:19:44PM +0800, Shaokun Zhang wrote:
> > From: Yuqi Jin <[email protected]>
> >
> > It is well known that the performance of atomic_add is better than that of
> > atomic_cmpxchg.
> > The initial value of @f_count is 1. While @f_count is increased by 1 in
> > __fget_files, it will go through three phases: > 0, < 0, and = 0. When the
> > fixed value 0 is used as the condition for terminating the increase of 1,
> > only atomic_cmpxchg can be used. When we use < 0 as the condition for
> > stopping plus 1, we can use atomic_add to obtain better performance.
>
> Suppose another thread has just removed it from the descriptor table.
>
> > +static inline bool get_file_unless_negative(atomic_long_t *v, long a)
> > +{
> > + long c = atomic_long_read(v);
> > +
> > + if (c <= 0)
> > + return 0;
>
> Still 1. Now the other thread has gotten to dropping the last reference,
> decremented counter to zero and committed to freeing the struct file.
>
> > +
> > + return atomic_long_add_return(a, v) - 1;
>
> ... and you increment that sucker back to 1. Sure, you return 0, so the
> caller does nothing to that struct file. Which includes undoing the
> changes to its refecount.
>
> In the meanwhile, the third thread does fget on the same descriptor,
> and there we end up bumping the refcount to 2 and succeeding. Which
> leaves the caller with reference to already doomed struct file...
>
> IOW, NAK - this is completely broken. The whole point of
> atomic_long_add_unless() is that the check and conditional increment
> are atomic. Together. That's what your optimization takes out.

Cheers Al, yes, this is fscked.

As an aside, I've previously toyed with the idea of implementing a form
of cmpxchg() using a pair of xchg() operations and a smp_cond_load_relaxed(),
where the thing would transition through a "reserved value", which might
perform better with the current trend of building hardware that doesn't
handle CAS failure so well.

But I've never had the time/motivation to hack it up, and it relies on that
reserved value which obviously doesn't always work (so it would have to be a
separate API).

Will

2020-08-31 01:46:14

by Zhangshaokun

[permalink] [raw]
Subject: Re: [NAK] Re: [PATCH] fs: Optimized fget to improve performance

Hi Al,

?? 2020/8/27 22:28, Al Viro ะด??:
> On Thu, Aug 27, 2020 at 06:19:44PM +0800, Shaokun Zhang wrote:
>> From: Yuqi Jin <[email protected]>
>>
>> It is well known that the performance of atomic_add is better than that of
>> atomic_cmpxchg.
>> The initial value of @f_count is 1. While @f_count is increased by 1 in
>> __fget_files, it will go through three phases: > 0, < 0, and = 0. When the
>> fixed value 0 is used as the condition for terminating the increase of 1,
>> only atomic_cmpxchg can be used. When we use < 0 as the condition for
>> stopping plus 1, we can use atomic_add to obtain better performance.
>
> Suppose another thread has just removed it from the descriptor table.
>
>> +static inline bool get_file_unless_negative(atomic_long_t *v, long a)
>> +{
>> + long c = atomic_long_read(v);
>> +
>> + if (c <= 0)
>> + return 0;
>
> Still 1. Now the other thread has gotten to dropping the last reference,
> decremented counter to zero and committed to freeing the struct file.
>

Apologies that I missed it.

>> +
>> + return atomic_long_add_return(a, v) - 1;
>
> ... and you increment that sucker back to 1. Sure, you return 0, so the
> caller does nothing to that struct file. Which includes undoing the
> changes to its refecount.
>
> In the meanwhile, the third thread does fget on the same descriptor,
> and there we end up bumping the refcount to 2 and succeeding. Which
> leaves the caller with reference to already doomed struct file...
>
> IOW, NAK - this is completely broken. The whole point of
> atomic_long_add_unless() is that the check and conditional increment
> are atomic. Together. That's what your optimization takes out.
>

How about this? We try to replace atomic_cmpxchg with atomic_add to improve
performance. The atomic_add does not check the current f_count value.
Therefore, the number of online CPUs is reserved to prevent multi-core
competition.

+
+static inline bool get_file_unless(atomic_long_t *v, long a)
+{
+ long cpus = num_online_cpus();
+ long c = atomic_long_read(v);
+ long ret;
+
+ if (c > cpus || c < -cpus)
+ ret = atomic_long_add_return(a, v) - a;
+ else
+ ret = atomic_long_add_unless(v, a, 0);
+
+ return ret;
+}
+
#define get_file_rcu_many(x, cnt) \
- atomic_long_add_unless(&(x)->f_count, (cnt), 0)
+ get_file_unless(&(x)->f_count, (cnt))

Thanks,
Shaokun

> .
>

2020-08-31 03:22:35

by Al Viro

[permalink] [raw]
Subject: Re: [NAK] Re: [PATCH] fs: Optimized fget to improve performance

On Mon, Aug 31, 2020 at 09:43:31AM +0800, Shaokun Zhang wrote:

> How about this? We try to replace atomic_cmpxchg with atomic_add to improve
> performance. The atomic_add does not check the current f_count value.
> Therefore, the number of online CPUs is reserved to prevent multi-core
> competition.

No. Really, really - no. Not unless you can guarantee that process on another
CPU won't lose its timeslice, ending up with more than one increment happening on
the same CPU - done by different processes scheduled there, one after another.

If you have some change of atomic_long_add_unless(), do it there. And get it
past the arm64 folks. get_file_rcu() is nothing special in that respect *AND*
it has to cope with any architecture out there.

BTW, keep in mind that there's such thing as a KVM - race windows are much
wider there, since a thread representing a guest CPU might lose its timeslice
whenever the host feels like that. At which point you get a single instruction
on a guest CPU taking longer than many thousands of instructions on another
CPU of the same guest.

AFAIK, arm64 does support KVM with SMP guests.

2020-09-01 09:30:35

by David Laight

[permalink] [raw]
Subject: RE: [NAK] Re: [PATCH] fs: Optimized fget to improve performance

From: Al Viro
> Sent: 31 August 2020 04:21
>
> On Mon, Aug 31, 2020 at 09:43:31AM +0800, Shaokun Zhang wrote:
>
> > How about this? We try to replace atomic_cmpxchg with atomic_add to improve
> > performance. The atomic_add does not check the current f_count value.
> > Therefore, the number of online CPUs is reserved to prevent multi-core
> > competition.
>
> No. Really, really - no. Not unless you can guarantee that process on another
> CPU won't lose its timeslice, ending up with more than one increment happening on
> the same CPU - done by different processes scheduled there, one after another.
>
> If you have some change of atomic_long_add_unless(), do it there. And get it
> past the arm64 folks. get_file_rcu() is nothing special in that respect *AND*
> it has to cope with any architecture out there.
>
> BTW, keep in mind that there's such thing as a KVM - race windows are much
> wider there, since a thread representing a guest CPU might lose its timeslice
> whenever the host feels like that. At which point you get a single instruction
> on a guest CPU taking longer than many thousands of instructions on another
> CPU of the same guest.

The same thing can happen if a hardware interrupt occurs.
Not only the delays for the interrupt itself, but all the softint
processing that happens as well.
That can take a long time - even milliseconds.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)