2023-06-22 09:04:47

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 24/29] mm: vmscan: make global slab shrink lockless

The shrinker_rwsem is a global read-write lock in
shrinkers subsystem, which protects most operations
such as slab shrink, registration and unregistration
of shrinkers, etc. This can easily cause problems in
the following cases.

1) When the memory pressure is high and there are many
filesystems mounted or unmounted at the same time,
slab shrink will be affected (down_read_trylock()
failed).

Such as the real workload mentioned by Kirill Tkhai:

```
One of the real workloads from my experience is start
of an overcommitted node containing many starting
containers after node crash (or many resuming containers
after reboot for kernel update). In these cases memory
pressure is huge, and the node goes round in long reclaim.
```

2) If a shrinker is blocked (such as the case mentioned
in [1]) and a writer comes in (such as mount a fs),
then this writer will be blocked and cause all
subsequent shrinker-related operations to be blocked.

Even if there is no competitor when shrinking slab, there
may still be a problem. If we have a long shrinker list
and we do not reclaim enough memory with each shrinker,
then the down_read_trylock() may be called with high
frequency. Because of the poor multicore scalability of
atomic operations, this can lead to a significant drop
in IPC (instructions per cycle).

We used to implement the lockless slab shrink with
SRCU [1], but then kernel test robot reported -88.8%
regression in stress-ng.ramfs.ops_per_sec test case [2],
so we reverted it [3].

This commit uses the refcount+RCU method [4] proposed by
by Dave Chinner to re-implement the lockless global slab
shrink. The memcg slab shrink is handled in the subsequent
patch.

Currently, the shrinker instances can be divided into
the following three types:

a) global shrinker instance statically defined in the kernel,
such as workingset_shadow_shrinker.

b) global shrinker instance statically defined in the kernel
modules, such as mmu_shrinker in x86.

c) shrinker instance embedded in other structures.

For case a, the memory of shrinker instance is never freed.
For case b, the memory of shrinker instance will be freed
after the module is unloaded. But we will call synchronize_rcu()
in free_module() to wait for RCU read-side critical section to
exit. For case c, the memory of shrinker instance will be
dynamically freed by calling kfree_rcu(). So we can use
rcu_read_{lock,unlock}() to ensure that the shrinker instance
is valid.

The shrinker::refcount mechanism ensures that the shrinker
instance will not be run again after unregistration. So the
structure that records the pointer of shrinker instance can be
safely freed without waiting for the RCU read-side critical
section.

In this way, while we implement the lockless slab shrink, we
don't need to be blocked in unregister_shrinker() to wait
RCU read-side critical section.

The following are the test results:

stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &

1) Before applying this patchset:

setting to a 60 second run per stressor
dispatching hogs: 9 ramfs
stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
(secs) (secs) (secs) (real time) (usr+sys time)
ramfs 880623 60.02 7.71 226.93 14671.45 3753.09
ramfs:
1 System Management Interrupt
for a 60.03s run time:
5762.40s available CPU time
7.71s user time ( 0.13%)
226.93s system time ( 3.94%)
234.64s total time ( 4.07%)
load average: 8.54 3.06 2.11
passed: 9: ramfs (9)
failed: 0
skipped: 0
successful run completed in 60.03s (1 min, 0.03 secs)

2) After applying this patchset:

setting to a 60 second run per stressor
dispatching hogs: 9 ramfs
stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
(secs) (secs) (secs) (real time) (usr+sys time)
ramfs 847562 60.02 7.44 230.22 14120.66 3566.23
ramfs:
4 System Management Interrupts
for a 60.12s run time:
5771.95s available CPU time
7.44s user time ( 0.13%)
230.22s system time ( 3.99%)
237.66s total time ( 4.12%)
load average: 8.18 2.43 0.84
passed: 9: ramfs (9)
failed: 0
skipped: 0
successful run completed in 60.12s (1 min, 0.12 secs)

We can see that the ops/s has hardly changed.

[1]. https://lore.kernel.org/lkml/[email protected]/
[2]. https://lore.kernel.org/lkml/[email protected]/
[3]. https://lore.kernel.org/all/[email protected]/
[4]. https://lore.kernel.org/lkml/[email protected]/

Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/shrinker.h | 6 ++++++
mm/vmscan.c | 33 ++++++++++++++-------------------
2 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 7bfeb2f25246..b0c6c2df9db8 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -74,6 +74,7 @@ struct shrinker {

refcount_t refcount;
struct completion completion_wait;
+ struct rcu_head rcu;

void *private_data;

@@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
void shrinker_free(struct shrinker *shrinker);
void unregister_and_free_shrinker(struct shrinker *shrinker);

+static inline bool shrinker_try_get(struct shrinker *shrinker)
+{
+ return refcount_inc_not_zero(&shrinker->refcount);
+}
+
static inline void shrinker_put(struct shrinker *shrinker)
{
if (refcount_dec_and_test(&shrinker->refcount))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6f9c4750effa..767569698946 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,7 @@
#include <linux/khugepaged.h>
#include <linux/rculist_nulls.h>
#include <linux/random.h>
+#include <linux/rculist.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
down_write(&shrinker_rwsem);
refcount_set(&shrinker->refcount, 1);
init_completion(&shrinker->completion_wait);
- list_add_tail(&shrinker->list, &shrinker_list);
+ list_add_tail_rcu(&shrinker->list, &shrinker_list);
shrinker->flags |= SHRINKER_REGISTERED;
shrinker_debugfs_add(shrinker);
up_write(&shrinker_rwsem);
@@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker)
wait_for_completion(&shrinker->completion_wait);

down_write(&shrinker_rwsem);
- list_del(&shrinker->list);
+ list_del_rcu(&shrinker->list);
shrinker->flags &= ~SHRINKER_REGISTERED;
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
@@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free);
void unregister_and_free_shrinker(struct shrinker *shrinker)
{
unregister_shrinker(shrinker);
- kfree(shrinker);
+ kfree_rcu(shrinker, rcu);
}
EXPORT_SYMBOL(unregister_and_free_shrinker);

@@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
return shrink_slab_memcg(gfp_mask, nid, memcg, priority);

- if (!down_read_trylock(&shrinker_rwsem))
- goto out;
-
- list_for_each_entry(shrinker, &shrinker_list, list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
.memcg = memcg,
};

+ if (!shrinker_try_get(shrinker))
+ continue;
+ rcu_read_unlock();
+
ret = do_shrink_slab(&sc, shrinker, priority);
if (ret == SHRINK_EMPTY)
ret = 0;
freed += ret;
- /*
- * Bail out if someone want to register a new shrinker to
- * prevent the registration from being stalled for long periods
- * by parallel ongoing shrinking.
- */
- if (rwsem_is_contended(&shrinker_rwsem)) {
- freed = freed ? : 1;
- break;
- }
- }

- up_read(&shrinker_rwsem);
-out:
+ rcu_read_lock();
+ shrinker_put(shrinker);
+ }
+ rcu_read_unlock();
cond_resched();
return freed;
}
--
2.30.2



2023-06-22 16:53:56

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless



On 2023/6/22 23:12, Vlastimil Babka wrote:
> On 6/22/23 10:53, Qi Zheng wrote:
>> The shrinker_rwsem is a global read-write lock in
>> shrinkers subsystem, which protects most operations
>> such as slab shrink, registration and unregistration
>> of shrinkers, etc. This can easily cause problems in
>> the following cases.
>>
>> 1) When the memory pressure is high and there are many
>> filesystems mounted or unmounted at the same time,
>> slab shrink will be affected (down_read_trylock()
>> failed).
>>
>> Such as the real workload mentioned by Kirill Tkhai:
>>
>> ```
>> One of the real workloads from my experience is start
>> of an overcommitted node containing many starting
>> containers after node crash (or many resuming containers
>> after reboot for kernel update). In these cases memory
>> pressure is huge, and the node goes round in long reclaim.
>> ```
>>
>> 2) If a shrinker is blocked (such as the case mentioned
>> in [1]) and a writer comes in (such as mount a fs),
>> then this writer will be blocked and cause all
>> subsequent shrinker-related operations to be blocked.
>>
>> Even if there is no competitor when shrinking slab, there
>> may still be a problem. If we have a long shrinker list
>> and we do not reclaim enough memory with each shrinker,
>> then the down_read_trylock() may be called with high
>> frequency. Because of the poor multicore scalability of
>> atomic operations, this can lead to a significant drop
>> in IPC (instructions per cycle).
>>
>> We used to implement the lockless slab shrink with
>> SRCU [1], but then kernel test robot reported -88.8%
>> regression in stress-ng.ramfs.ops_per_sec test case [2],
>> so we reverted it [3].
>>
>> This commit uses the refcount+RCU method [4] proposed by
>> by Dave Chinner to re-implement the lockless global slab
>> shrink. The memcg slab shrink is handled in the subsequent
>> patch.
>>
>> Currently, the shrinker instances can be divided into
>> the following three types:
>>
>> a) global shrinker instance statically defined in the kernel,
>> such as workingset_shadow_shrinker.
>>
>> b) global shrinker instance statically defined in the kernel
>> modules, such as mmu_shrinker in x86.
>>
>> c) shrinker instance embedded in other structures.
>>
>> For case a, the memory of shrinker instance is never freed.
>> For case b, the memory of shrinker instance will be freed
>> after the module is unloaded. But we will call synchronize_rcu()
>> in free_module() to wait for RCU read-side critical section to
>> exit. For case c, the memory of shrinker instance will be
>> dynamically freed by calling kfree_rcu(). So we can use
>> rcu_read_{lock,unlock}() to ensure that the shrinker instance
>> is valid.
>>
>> The shrinker::refcount mechanism ensures that the shrinker
>> instance will not be run again after unregistration. So the
>> structure that records the pointer of shrinker instance can be
>> safely freed without waiting for the RCU read-side critical
>> section.
>>
>> In this way, while we implement the lockless slab shrink, we
>> don't need to be blocked in unregister_shrinker() to wait
>> RCU read-side critical section.
>>
>> The following are the test results:
>>
>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>>
>> 1) Before applying this patchset:
>>
>> setting to a 60 second run per stressor
>> dispatching hogs: 9 ramfs
>> stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
>> (secs) (secs) (secs) (real time) (usr+sys time)
>> ramfs 880623 60.02 7.71 226.93 14671.45 3753.09
>> ramfs:
>> 1 System Management Interrupt
>> for a 60.03s run time:
>> 5762.40s available CPU time
>> 7.71s user time ( 0.13%)
>> 226.93s system time ( 3.94%)
>> 234.64s total time ( 4.07%)
>> load average: 8.54 3.06 2.11
>> passed: 9: ramfs (9)
>> failed: 0
>> skipped: 0
>> successful run completed in 60.03s (1 min, 0.03 secs)
>>
>> 2) After applying this patchset:
>>
>> setting to a 60 second run per stressor
>> dispatching hogs: 9 ramfs
>> stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
>> (secs) (secs) (secs) (real time) (usr+sys time)
>> ramfs 847562 60.02 7.44 230.22 14120.66 3566.23
>> ramfs:
>> 4 System Management Interrupts
>> for a 60.12s run time:
>> 5771.95s available CPU time
>> 7.44s user time ( 0.13%)
>> 230.22s system time ( 3.99%)
>> 237.66s total time ( 4.12%)
>> load average: 8.18 2.43 0.84
>> passed: 9: ramfs (9)
>> failed: 0
>> skipped: 0
>> successful run completed in 60.12s (1 min, 0.12 secs)
>>
>> We can see that the ops/s has hardly changed.
>>
>> [1]. https://lore.kernel.org/lkml/[email protected]/
>> [2]. https://lore.kernel.org/lkml/[email protected]/
>> [3]. https://lore.kernel.org/all/[email protected]/
>> [4]. https://lore.kernel.org/lkml/[email protected]/
>>
>> Signed-off-by: Qi Zheng <[email protected]>
>> ---
>> include/linux/shrinker.h | 6 ++++++
>> mm/vmscan.c | 33 ++++++++++++++-------------------
>> 2 files changed, 20 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>> index 7bfeb2f25246..b0c6c2df9db8 100644
>> --- a/include/linux/shrinker.h
>> +++ b/include/linux/shrinker.h
>> @@ -74,6 +74,7 @@ struct shrinker {
>>
>> refcount_t refcount;
>> struct completion completion_wait;
>> + struct rcu_head rcu;
>>
>> void *private_data;
>>
>> @@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
>> void shrinker_free(struct shrinker *shrinker);
>> void unregister_and_free_shrinker(struct shrinker *shrinker);
>>
>> +static inline bool shrinker_try_get(struct shrinker *shrinker)
>> +{
>> + return refcount_inc_not_zero(&shrinker->refcount);
>> +}
>> +
>> static inline void shrinker_put(struct shrinker *shrinker)
>> {
>> if (refcount_dec_and_test(&shrinker->refcount))
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 6f9c4750effa..767569698946 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -57,6 +57,7 @@
>> #include <linux/khugepaged.h>
>> #include <linux/rculist_nulls.h>
>> #include <linux/random.h>
>> +#include <linux/rculist.h>
>>
>> #include <asm/tlbflush.h>
>> #include <asm/div64.h>
>> @@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
>> down_write(&shrinker_rwsem);
>> refcount_set(&shrinker->refcount, 1);
>> init_completion(&shrinker->completion_wait);
>> - list_add_tail(&shrinker->list, &shrinker_list);
>> + list_add_tail_rcu(&shrinker->list, &shrinker_list);
>> shrinker->flags |= SHRINKER_REGISTERED;
>> shrinker_debugfs_add(shrinker);
>> up_write(&shrinker_rwsem);
>> @@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>> wait_for_completion(&shrinker->completion_wait);
>>
>> down_write(&shrinker_rwsem);
>> - list_del(&shrinker->list);
>> + list_del_rcu(&shrinker->list);
>> shrinker->flags &= ~SHRINKER_REGISTERED;
>> if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>> unregister_memcg_shrinker(shrinker);
>> @@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free);
>> void unregister_and_free_shrinker(struct shrinker *shrinker)
>> {
>> unregister_shrinker(shrinker);
>> - kfree(shrinker);
>> + kfree_rcu(shrinker, rcu);
>> }
>> EXPORT_SYMBOL(unregister_and_free_shrinker);
>>
>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>> return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>
>> - if (!down_read_trylock(&shrinker_rwsem))
>> - goto out;
>> -
>> - list_for_each_entry(shrinker, &shrinker_list, list) {
>> + rcu_read_lock();
>> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>> struct shrink_control sc = {
>> .gfp_mask = gfp_mask,
>> .nid = nid,
>> .memcg = memcg,
>> };
>>
>> + if (!shrinker_try_get(shrinker))
>> + continue;
>> + rcu_read_unlock();
>
> I don't think you can do this unlock?
>
>> +
>> ret = do_shrink_slab(&sc, shrinker, priority);
>> if (ret == SHRINK_EMPTY)
>> ret = 0;
>> freed += ret;
>> - /*
>> - * Bail out if someone want to register a new shrinker to
>> - * prevent the registration from being stalled for long periods
>> - * by parallel ongoing shrinking.
>> - */
>> - if (rwsem_is_contended(&shrinker_rwsem)) {
>> - freed = freed ? : 1;
>> - break;
>> - }
>> - }
>>
>> - up_read(&shrinker_rwsem);
>> -out:
>> + rcu_read_lock();
>
> That new rcu_read_lock() won't help AFAIK, the whole
> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
> safe.

In the unregister_shrinker() path, we will wait for the refcount to zero
before deleting the shrinker from the linked list. Here, we first took
the rcu lock, and then decrement the refcount of this shrinker.

shrink_slab unregister_shrinker
=========== ===================

/* wait for B */
wait_for_completion()
rcu_read_lock()

shrinker_put() --> (B)
list_del_rcu()
/* wait for rcu_read_unlock() */
kfree_rcu()

/*
* so this shrinker will not be freed here,
* and can be used to traverse the next node
* normally?
*/
list_for_each_entry()

shrinker_try_get()
rcu_read_unlock()

Did I miss something?

>
> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.
>
>> + shrinker_put(shrinker);
>> + }
>> + rcu_read_unlock();
>> cond_resched();
>> return freed;
>> }
>

2023-06-23 13:13:47

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless



On 2023/6/23 14:29, Dave Chinner wrote:
> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
>> On 6/22/23 10:53, Qi Zheng wrote:
>>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>> return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>
>>> - if (!down_read_trylock(&shrinker_rwsem))
>>> - goto out;
>>> -
>>> - list_for_each_entry(shrinker, &shrinker_list, list) {
>>> + rcu_read_lock();
>>> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>>> struct shrink_control sc = {
>>> .gfp_mask = gfp_mask,
>>> .nid = nid,
>>> .memcg = memcg,
>>> };
>>>
>>> + if (!shrinker_try_get(shrinker))
>>> + continue;
>>> + rcu_read_unlock();
>>
>> I don't think you can do this unlock?
>>
>>> +
>>> ret = do_shrink_slab(&sc, shrinker, priority);
>>> if (ret == SHRINK_EMPTY)
>>> ret = 0;
>>> freed += ret;
>>> - /*
>>> - * Bail out if someone want to register a new shrinker to
>>> - * prevent the registration from being stalled for long periods
>>> - * by parallel ongoing shrinking.
>>> - */
>>> - if (rwsem_is_contended(&shrinker_rwsem)) {
>>> - freed = freed ? : 1;
>>> - break;
>>> - }
>>> - }
>>>
>>> - up_read(&shrinker_rwsem);
>>> -out:
>>> + rcu_read_lock();
>>
>> That new rcu_read_lock() won't help AFAIK, the whole
>> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
>> safe.
>
> Yeah, that's the pattern we've been taught and the one we can look
> at and immediately say "this is safe".
>
> This is a different pattern, as has been explained bi Qi, and I
> think it *might* be safe.
>
> *However.*
>
> Right now I don't have time to go through a novel RCU list iteration
> pattern it one step at to determine the correctness of the
> algorithm. I'm mostly worried about list manipulations that can
> occur outside rcu_read_lock() section bleeding into the RCU
> critical section because rcu_read_lock() by itself is not a memory
> barrier.
>
> Maybe Paul has seen this pattern often enough he could simply tell
> us what conditions it is safe in. But for me to work that out from
> first principles? I just don't have the time to do that right now.

Hi Paul, can you help to confirm this?

>
>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.
>
> Yes, I suggested the IDR route because radix tree lookups under RCU
> with reference counted objects are a known safe pattern that we can
> easily confirm is correct or not. Hence I suggested the unification
> + IDR route because it makes the life of reviewers so, so much
> easier...

In fact, I originally planned to try the unification + IDR method you
suggested at the beginning. But in the case of CONFIG_MEMCG disabled,
the struct mem_cgroup is not even defined, and root_mem_cgroup and
shrinker_info will not be allocated. This required more code changes, so
I ended up keeping the shrinker_list and implementing the above pattern.

If the above pattern is not safe, I will go back to the unification +
IDR method.

Thanks,
Qi

>
> Cheers,
>
> Dave.

2023-06-23 22:21:12

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless

On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote:
> On 2023/6/23 14:29, Dave Chinner wrote:
> > On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
> > > On 6/22/23 10:53, Qi Zheng wrote:
> > Yes, I suggested the IDR route because radix tree lookups under RCU
> > with reference counted objects are a known safe pattern that we can
> > easily confirm is correct or not. Hence I suggested the unification
> > + IDR route because it makes the life of reviewers so, so much
> > easier...
>
> In fact, I originally planned to try the unification + IDR method you
> suggested at the beginning. But in the case of CONFIG_MEMCG disabled,
> the struct mem_cgroup is not even defined, and root_mem_cgroup and
> shrinker_info will not be allocated. This required more code changes, so
> I ended up keeping the shrinker_list and implementing the above pattern.

Yes. Go back and read what I originally said needed to be done
first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs
to exist that holds all of the global shrinkers. Then shrink_slab()
is only ever passed a memcg that should be iterated.

Yes, it needs changes external to the shrinker code itself to be
made to work. And even if memcg's are not enabled, we can still use
the memcg structures to ensure a common abstraction is used for the
shrinker tracking infrastructure....

> If the above pattern is not safe, I will go back to the unification +
> IDR method.

And that is exactly how we got into this mess in the first place....

-Dave
--
Dave Chinner
[email protected]

2023-07-03 16:49:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless

On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote:
> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
> > On 6/22/23 10:53, Qi Zheng wrote:
> > > @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> > > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> > > return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> > >
> > > - if (!down_read_trylock(&shrinker_rwsem))
> > > - goto out;
> > > -
> > > - list_for_each_entry(shrinker, &shrinker_list, list) {
> > > + rcu_read_lock();
> > > + list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
> > > struct shrink_control sc = {
> > > .gfp_mask = gfp_mask,
> > > .nid = nid,
> > > .memcg = memcg,
> > > };
> > >
> > > + if (!shrinker_try_get(shrinker))
> > > + continue;
> > > + rcu_read_unlock();
> >
> > I don't think you can do this unlock?

Sorry to be slow to respond here, this one fell through the cracks.
And thank you to Qi for reminding me!

If you do this unlock, you had jolly well better nail down the current
element (the one referenced by shrinker), for example, by acquiring an
explicit reference count on the object. And presumably this is exactly
what shrinker_try_get() is doing. And a look at your 24/29 confirms this,
at least assuming that shrinker->refcount is set to zero before the call
to synchronize_rcu() in free_module() *and* that synchronize_rcu() doesn't
start until *after* shrinker_put() calls complete(). Plus, as always,
the object must be removed from the list before the synchronize_rcu()
starts. (On these parts of the puzzle, I defer to those more familiar
with this code path. And I strongly suggest carefully commenting this
type of action-at-a-distance design pattern.)

Why is this important? Because otherwise that object might be freed
before you get to the call to rcu_read_lock() at the end of this loop.
And if that happens, list_for_each_entry_rcu() will be walking the
freelist, which is quite bad for the health and well-being of your kernel.

There are a few other ways to make this sort of thing work:

1. Defer the shrinker_put() to the beginning of the loop.
You would need a flag initially set to zero, and then set to
one just before (or just after) the rcu_read_lock() above.
You would also need another shrinker_old pointer to track the
old pointer. Then at the top of the loop, if the flag is set,
invoke shrinker_put() on shrinker_old. This ensures that the
previous shrinker structure stays around long enough to allow
the loop to find the next shrinker structure in the list.

This approach is attractive when the removal code path
can invoke shrinker_put() after the grace period ends.

2. Make shrinker_put() invoke call_rcu() when ->refcount reaches
zero, and have the callback function free the object. This of
course requires adding an rcu_head structure to the shrinker
structure, which might or might not be a reasonable course of
action. If adding that rcu_head is reasonable, this simplifies
the logic quite a bit.

3. For the shrinker-structure-removal code path, remove the shrinker
structure, then remove the initial count from ->refcount,
and then keep doing grace periods until ->refcount is zero,
then do one more. Of course, if the result of removing the
initial count was zero, then only a single additional grace
period is required.

This would need to be carefully commented, as it is a bit
unconventional.

There are probably many other ways, but just to give an idea of a few
other ways to do this.

> > > +
> > > ret = do_shrink_slab(&sc, shrinker, priority);
> > > if (ret == SHRINK_EMPTY)
> > > ret = 0;
> > > freed += ret;
> > > - /*
> > > - * Bail out if someone want to register a new shrinker to
> > > - * prevent the registration from being stalled for long periods
> > > - * by parallel ongoing shrinking.
> > > - */
> > > - if (rwsem_is_contended(&shrinker_rwsem)) {
> > > - freed = freed ? : 1;
> > > - break;
> > > - }
> > > - }
> > >
> > > - up_read(&shrinker_rwsem);
> > > -out:
> > > + rcu_read_lock();
> >
> > That new rcu_read_lock() won't help AFAIK, the whole
> > list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
> > safe.
>
> Yeah, that's the pattern we've been taught and the one we can look
> at and immediately say "this is safe".
>
> This is a different pattern, as has been explained bi Qi, and I
> think it *might* be safe.
>
> *However.*
>
> Right now I don't have time to go through a novel RCU list iteration
> pattern it one step at to determine the correctness of the
> algorithm. I'm mostly worried about list manipulations that can
> occur outside rcu_read_lock() section bleeding into the RCU
> critical section because rcu_read_lock() by itself is not a memory
> barrier.
>
> Maybe Paul has seen this pattern often enough he could simply tell
> us what conditions it is safe in. But for me to work that out from
> first principles? I just don't have the time to do that right now.

If the code does just the right sequence of things on the removal path
(remove, decrement reference, wait for reference to go to zero, wait for
grace period, free), then it would work. If this is what is happening,
I would argue for more comments. ;-)

Thanx, Paul

> > IIUC this is why Dave in [4] suggests unifying shrink_slab() with
> > shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.
>
> Yes, I suggested the IDR route because radix tree lookups under RCU
> with reference counted objects are a known safe pattern that we can
> easily confirm is correct or not. Hence I suggested the unification
> + IDR route because it makes the life of reviewers so, so much
> easier...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2023-07-05 03:30:27

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless



On 2023/7/4 11:45, Qi Zheng wrote:
>
>
> On 2023/7/4 00:39, Paul E. McKenney wrote:
>> On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote:
>>> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
>>>> On 6/22/23 10:53, Qi Zheng wrote:
>>>>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t
>>>>> gfp_mask, int nid,
>>>>>       if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>>           return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>> -    if (!down_read_trylock(&shrinker_rwsem))
>>>>> -        goto out;
>>>>> -
>>>>> -    list_for_each_entry(shrinker, &shrinker_list, list) {
>>>>> +    rcu_read_lock();
>>>>> +    list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>>>>>           struct shrink_control sc = {
>>>>>               .gfp_mask = gfp_mask,
>>>>>               .nid = nid,
>>>>>               .memcg = memcg,
>>>>>           };
>>>>> +        if (!shrinker_try_get(shrinker))
>>>>> +            continue;
>>>>> +        rcu_read_unlock();
>>>>
>>>> I don't think you can do this unlock?
>>
>> Sorry to be slow to respond here, this one fell through the cracks.
>> And thank you to Qi for reminding me!
>>
>> If you do this unlock, you had jolly well better nail down the current
>> element (the one referenced by shrinker), for example, by acquiring an
>> explicit reference count on the object.  And presumably this is exactly
>> what shrinker_try_get() is doing.  And a look at your 24/29 confirms
>> this,
>> at least assuming that shrinker->refcount is set to zero before the call
>> to synchronize_rcu() in free_module() *and* that synchronize_rcu()
>> doesn't
>> start until *after* shrinker_put() calls complete().  Plus, as always,
>> the object must be removed from the list before the synchronize_rcu()
>> starts.  (On these parts of the puzzle, I defer to those more familiar
>> with this code path.  And I strongly suggest carefully commenting this
>> type of action-at-a-distance design pattern.)
>
> Yeah, I think I've done it like above. A more detailed timing diagram is
> below.
>
>>
>> Why is this important?  Because otherwise that object might be freed
>> before you get to the call to rcu_read_lock() at the end of this loop.
>> And if that happens, list_for_each_entry_rcu() will be walking the
>> freelist, which is quite bad for the health and well-being of your
>> kernel.
>>
>> There are a few other ways to make this sort of thing work:
>>
>> 1.    Defer the shrinker_put() to the beginning of the loop.
>>     You would need a flag initially set to zero, and then set to
>>     one just before (or just after) the rcu_read_lock() above.
>>     You would also need another shrinker_old pointer to track the
>>     old pointer.  Then at the top of the loop, if the flag is set,
>>     invoke shrinker_put() on shrinker_old.    This ensures that the
>>     previous shrinker structure stays around long enough to allow
>>     the loop to find the next shrinker structure in the list.
>>
>>     This approach is attractive when the removal code path
>>     can invoke shrinker_put() after the grace period ends.
>>
>> 2.    Make shrinker_put() invoke call_rcu() when ->refcount reaches
>>     zero, and have the callback function free the object.  This of
>>     course requires adding an rcu_head structure to the shrinker
>>     structure, which might or might not be a reasonable course of
>>     action.  If adding that rcu_head is reasonable, this simplifies
>>     the logic quite a bit.
>>
>> 3.    For the shrinker-structure-removal code path, remove the shrinker
>>     structure, then remove the initial count from ->refcount,
>>     and then keep doing grace periods until ->refcount is zero,
>>     then do one more.  Of course, if the result of removing the
>>     initial count was zero, then only a single additional grace
>>     period is required.
>>
>>     This would need to be carefully commented, as it is a bit
>>     unconventional.
>
> Thanks for such a detailed addition!
>
>>
>> There are probably many other ways, but just to give an idea of a few
>> other ways to do this.
>>
>>>>> +
>>>>>           ret = do_shrink_slab(&sc, shrinker, priority);
>>>>>           if (ret == SHRINK_EMPTY)
>>>>>               ret = 0;
>>>>>           freed += ret;
>>>>> -        /*
>>>>> -         * Bail out if someone want to register a new shrinker to
>>>>> -         * prevent the registration from being stalled for long
>>>>> periods
>>>>> -         * by parallel ongoing shrinking.
>>>>> -         */
>>>>> -        if (rwsem_is_contended(&shrinker_rwsem)) {
>>>>> -            freed = freed ? : 1;
>>>>> -            break;
>>>>> -        }
>>>>> -    }
>>>>> -    up_read(&shrinker_rwsem);
>>>>> -out:
>>>>> +        rcu_read_lock();
>>>>
>>>> That new rcu_read_lock() won't help AFAIK, the whole
>>>> list_for_each_entry_rcu() needs to be under the single
>>>> rcu_read_lock() to be
>>>> safe.
>>>
>>> Yeah, that's the pattern we've been taught and the one we can look
>>> at and immediately say "this is safe".
>>>
>>> This is a different pattern, as has been explained bi Qi, and I
>>> think it *might* be safe.
>>>
>>> *However.*
>>>
>>> Right now I don't have time to go through a novel RCU list iteration
>>> pattern it one step at to determine the correctness of the
>>> algorithm. I'm mostly worried about list manipulations that can
>>> occur outside rcu_read_lock() section bleeding into the RCU
>>> critical section because rcu_read_lock() by itself is not a memory
>>> barrier.
>>>
>>> Maybe Paul has seen this pattern often enough he could simply tell
>>> us what conditions it is safe in. But for me to work that out from
>>> first principles? I just don't have the time to do that right now.
>>
>> If the code does just the right sequence of things on the removal path
>> (remove, decrement reference, wait for reference to go to zero, wait for
>> grace period, free), then it would work.  If this is what is happening,
>> I would argue for more comments.  ;-)
>
> The order of the removal path is slightly different from this:
>
>     shrink_slab                 unregister_shrinker
>     ===========                 ===================
>
>    shrinker_try_get()
>    rcu_read_unlock()
>                                 1. decrement initial reference
>                 shrinker_put()
>                 2. wait for reference to go to zero
>                 wait_for_completion()
>    rcu_read_lock()
>
>    shrinker_put()
>                 3. remove the shrinker from list
>                 list_del_rcu()
>                                 4. wait for grace period
>                 kfree_rcu()/synchronize_rcu()
>
>
>    list_for_each_entry()
>
>    shrinker_try_get()
>    rcu_read_unlock()
>                 5. free the shrinker
>
> So the order is: decrement reference, wait for reference to go to zero,
> remove, wait for grace period, free.
>
> I think this can work. And we can only do the *step 3* after we hold the
> RCU read lock again, right? Please let me know if I missed something.

Oh, you are right, It would be better to move step 3 to step 1. We
should first remove the shrinker from the shrinker_list to prevent
other traversers from finding it again, otherwise the following
situations may occur theoretically:

CPU 0 CPU 1

shrinker_try_get()

shrinker_try_get()

shrinker_put()
shrinker_try_get()
shrinker_put()

Thanks,
Qi

>
> Thanks,
> Qi
>
>>
>>                             Thanx, Paul
>>
>>>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
>>>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses
>>>> IDR.
>>>
>>> Yes, I suggested the IDR route because radix tree lookups under RCU
>>> with reference counted objects are a known safe pattern that we can
>>> easily confirm is correct or not.  Hence I suggested the unification
>>> + IDR route because it makes the life of reviewers so, so much
>>> easier...
>>>
>>> Cheers,
>>>
>>> Dave.
>>> --
>>> Dave Chinner
>>> [email protected]