2023-06-22 17:47:37

by Alan Huang

[permalink] [raw]
Subject: Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless


> 2023年6月23日 上午12:42,Qi Zheng <[email protected]> 写道:
>
>
>
> On 2023/6/22 23:12, Vlastimil Babka wrote:
>> On 6/22/23 10:53, Qi Zheng wrote:
>>> The shrinker_rwsem is a global read-write lock in
>>> shrinkers subsystem, which protects most operations
>>> such as slab shrink, registration and unregistration
>>> of shrinkers, etc. This can easily cause problems in
>>> the following cases.
>>>
>>> 1) When the memory pressure is high and there are many
>>> filesystems mounted or unmounted at the same time,
>>> slab shrink will be affected (down_read_trylock()
>>> failed).
>>>
>>> Such as the real workload mentioned by Kirill Tkhai:
>>>
>>> ```
>>> One of the real workloads from my experience is start
>>> of an overcommitted node containing many starting
>>> containers after node crash (or many resuming containers
>>> after reboot for kernel update). In these cases memory
>>> pressure is huge, and the node goes round in long reclaim.
>>> ```
>>>
>>> 2) If a shrinker is blocked (such as the case mentioned
>>> in [1]) and a writer comes in (such as mount a fs),
>>> then this writer will be blocked and cause all
>>> subsequent shrinker-related operations to be blocked.
>>>
>>> Even if there is no competitor when shrinking slab, there
>>> may still be a problem. If we have a long shrinker list
>>> and we do not reclaim enough memory with each shrinker,
>>> then the down_read_trylock() may be called with high
>>> frequency. Because of the poor multicore scalability of
>>> atomic operations, this can lead to a significant drop
>>> in IPC (instructions per cycle).
>>>
>>> We used to implement the lockless slab shrink with
>>> SRCU [1], but then kernel test robot reported -88.8%
>>> regression in stress-ng.ramfs.ops_per_sec test case [2],
>>> so we reverted it [3].
>>>
>>> This commit uses the refcount+RCU method [4] proposed by
>>> by Dave Chinner to re-implement the lockless global slab
>>> shrink. The memcg slab shrink is handled in the subsequent
>>> patch.
>>>
>>> Currently, the shrinker instances can be divided into
>>> the following three types:
>>>
>>> a) global shrinker instance statically defined in the kernel,
>>> such as workingset_shadow_shrinker.
>>>
>>> b) global shrinker instance statically defined in the kernel
>>> modules, such as mmu_shrinker in x86.
>>>
>>> c) shrinker instance embedded in other structures.
>>>
>>> For case a, the memory of shrinker instance is never freed.
>>> For case b, the memory of shrinker instance will be freed
>>> after the module is unloaded. But we will call synchronize_rcu()
>>> in free_module() to wait for RCU read-side critical section to
>>> exit. For case c, the memory of shrinker instance will be
>>> dynamically freed by calling kfree_rcu(). So we can use
>>> rcu_read_{lock,unlock}() to ensure that the shrinker instance
>>> is valid.
>>>
>>> The shrinker::refcount mechanism ensures that the shrinker
>>> instance will not be run again after unregistration. So the
>>> structure that records the pointer of shrinker instance can be
>>> safely freed without waiting for the RCU read-side critical
>>> section.
>>>
>>> In this way, while we implement the lockless slab shrink, we
>>> don't need to be blocked in unregister_shrinker() to wait
>>> RCU read-side critical section.
>>>
>>> The following are the test results:
>>>
>>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>>>
>>> 1) Before applying this patchset:
>>>
>>> setting to a 60 second run per stressor
>>> dispatching hogs: 9 ramfs
>>> stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
>>> (secs) (secs) (secs) (real time) (usr+sys time)
>>> ramfs 880623 60.02 7.71 226.93 14671.45 3753.09
>>> ramfs:
>>> 1 System Management Interrupt
>>> for a 60.03s run time:
>>> 5762.40s available CPU time
>>> 7.71s user time ( 0.13%)
>>> 226.93s system time ( 3.94%)
>>> 234.64s total time ( 4.07%)
>>> load average: 8.54 3.06 2.11
>>> passed: 9: ramfs (9)
>>> failed: 0
>>> skipped: 0
>>> successful run completed in 60.03s (1 min, 0.03 secs)
>>>
>>> 2) After applying this patchset:
>>>
>>> setting to a 60 second run per stressor
>>> dispatching hogs: 9 ramfs
>>> stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
>>> (secs) (secs) (secs) (real time) (usr+sys time)
>>> ramfs 847562 60.02 7.44 230.22 14120.66 3566.23
>>> ramfs:
>>> 4 System Management Interrupts
>>> for a 60.12s run time:
>>> 5771.95s available CPU time
>>> 7.44s user time ( 0.13%)
>>> 230.22s system time ( 3.99%)
>>> 237.66s total time ( 4.12%)
>>> load average: 8.18 2.43 0.84
>>> passed: 9: ramfs (9)
>>> failed: 0
>>> skipped: 0
>>> successful run completed in 60.12s (1 min, 0.12 secs)
>>>
>>> We can see that the ops/s has hardly changed.
>>>
>>> [1]. https://lore.kernel.org/lkml/[email protected]/
>>> [2]. https://lore.kernel.org/lkml/[email protected]/
>>> [3]. https://lore.kernel.org/all/[email protected]/
>>> [4]. https://lore.kernel.org/lkml/[email protected]/
>>>
>>> Signed-off-by: Qi Zheng <[email protected]>
>>> ---
>>> include/linux/shrinker.h | 6 ++++++
>>> mm/vmscan.c | 33 ++++++++++++++-------------------
>>> 2 files changed, 20 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>>> index 7bfeb2f25246..b0c6c2df9db8 100644
>>> --- a/include/linux/shrinker.h
>>> +++ b/include/linux/shrinker.h
>>> @@ -74,6 +74,7 @@ struct shrinker {
>>> refcount_t refcount;
>>> struct completion completion_wait;
>>> + struct rcu_head rcu;
>>> void *private_data;
>>> @@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
>>> void shrinker_free(struct shrinker *shrinker);
>>> void unregister_and_free_shrinker(struct shrinker *shrinker);
>>> +static inline bool shrinker_try_get(struct shrinker *shrinker)
>>> +{
>>> + return refcount_inc_not_zero(&shrinker->refcount);
>>> +}
>>> +
>>> static inline void shrinker_put(struct shrinker *shrinker)
>>> {
>>> if (refcount_dec_and_test(&shrinker->refcount))
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 6f9c4750effa..767569698946 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -57,6 +57,7 @@
>>> #include <linux/khugepaged.h>
>>> #include <linux/rculist_nulls.h>
>>> #include <linux/random.h>
>>> +#include <linux/rculist.h>
>>> #include <asm/tlbflush.h>
>>> #include <asm/div64.h>
>>> @@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
>>> down_write(&shrinker_rwsem);
>>> refcount_set(&shrinker->refcount, 1);
>>> init_completion(&shrinker->completion_wait);
>>> - list_add_tail(&shrinker->list, &shrinker_list);
>>> + list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>> shrinker->flags |= SHRINKER_REGISTERED;
>>> shrinker_debugfs_add(shrinker);
>>> up_write(&shrinker_rwsem);
>>> @@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>> wait_for_completion(&shrinker->completion_wait);
>>> down_write(&shrinker_rwsem);
>>> - list_del(&shrinker->list);
>>> + list_del_rcu(&shrinker->list);
>>> shrinker->flags &= ~SHRINKER_REGISTERED;
>>> if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>> unregister_memcg_shrinker(shrinker);
>>> @@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free);
>>> void unregister_and_free_shrinker(struct shrinker *shrinker)
>>> {
>>> unregister_shrinker(shrinker);
>>> - kfree(shrinker);
>>> + kfree_rcu(shrinker, rcu);
>>> }
>>> EXPORT_SYMBOL(unregister_and_free_shrinker);
>>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>> return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>> - if (!down_read_trylock(&shrinker_rwsem))
>>> - goto out;
>>> -
>>> - list_for_each_entry(shrinker, &shrinker_list, list) {
>>> + rcu_read_lock();
>>> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>>> struct shrink_control sc = {
>>> .gfp_mask = gfp_mask,
>>> .nid = nid,
>>> .memcg = memcg,
>>> };
>>> + if (!shrinker_try_get(shrinker))
>>> + continue;
>>> + rcu_read_unlock();
>> I don't think you can do this unlock?
>>> +
>>> ret = do_shrink_slab(&sc, shrinker, priority);
>>> if (ret == SHRINK_EMPTY)
>>> ret = 0;
>>> freed += ret;
>>> - /*
>>> - * Bail out if someone want to register a new shrinker to
>>> - * prevent the registration from being stalled for long periods
>>> - * by parallel ongoing shrinking.
>>> - */
>>> - if (rwsem_is_contended(&shrinker_rwsem)) {
>>> - freed = freed ? : 1;
>>> - break;
>>> - }
>>> - }
>>> - up_read(&shrinker_rwsem);
>>> -out:
>>> + rcu_read_lock();
>> That new rcu_read_lock() won't help AFAIK, the whole
>> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
>> safe.
>
> In the unregister_shrinker() path, we will wait for the refcount to zero
> before deleting the shrinker from the linked list. Here, we first took
> the rcu lock, and then decrement the refcount of this shrinker.
>
> shrink_slab unregister_shrinker
> =========== ===================
>
> /* wait for B */
> wait_for_completion()
> rcu_read_lock()
>
> shrinker_put() --> (B)
> list_del_rcu()
> /* wait for rcu_read_unlock() */
> kfree_rcu()
>
> /*
> * so this shrinker will not be freed here,
> * and can be used to traverse the next node
> * normally?
> */
> list_for_each_entry()
>
> shrinker_try_get()
> rcu_read_unlock()
>
> Did I miss something?

After calling rcu_read_unlock(), the next shrinker in the list can be freed,
so in the next iteration, use after free might happen?

Is that right?

>
>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.
>>> + shrinker_put(shrinker);
>>> + }
>>> + rcu_read_unlock();
>>> cond_resched();
>>> return freed;
>>> }



2023-06-22 18:26:29

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless



On 2023/6/23 01:41, Alan Huang wrote:
>
>> 2023年6月23日 上午12:42,Qi Zheng <[email protected]> 写道:
>>
>>
>>
>> On 2023/6/22 23:12, Vlastimil Babka wrote:
>>> On 6/22/23 10:53, Qi Zheng wrote:
>>>> The shrinker_rwsem is a global read-write lock in
>>>> shrinkers subsystem, which protects most operations
>>>> such as slab shrink, registration and unregistration
>>>> of shrinkers, etc. This can easily cause problems in
>>>> the following cases.
>>>>
>>>> 1) When the memory pressure is high and there are many
>>>> filesystems mounted or unmounted at the same time,
>>>> slab shrink will be affected (down_read_trylock()
>>>> failed).
>>>>
>>>> Such as the real workload mentioned by Kirill Tkhai:
>>>>
>>>> ```
>>>> One of the real workloads from my experience is start
>>>> of an overcommitted node containing many starting
>>>> containers after node crash (or many resuming containers
>>>> after reboot for kernel update). In these cases memory
>>>> pressure is huge, and the node goes round in long reclaim.
>>>> ```
>>>>
>>>> 2) If a shrinker is blocked (such as the case mentioned
>>>> in [1]) and a writer comes in (such as mount a fs),
>>>> then this writer will be blocked and cause all
>>>> subsequent shrinker-related operations to be blocked.
>>>>
>>>> Even if there is no competitor when shrinking slab, there
>>>> may still be a problem. If we have a long shrinker list
>>>> and we do not reclaim enough memory with each shrinker,
>>>> then the down_read_trylock() may be called with high
>>>> frequency. Because of the poor multicore scalability of
>>>> atomic operations, this can lead to a significant drop
>>>> in IPC (instructions per cycle).
>>>>
>>>> We used to implement the lockless slab shrink with
>>>> SRCU [1], but then kernel test robot reported -88.8%
>>>> regression in stress-ng.ramfs.ops_per_sec test case [2],
>>>> so we reverted it [3].
>>>>
>>>> This commit uses the refcount+RCU method [4] proposed by
>>>> by Dave Chinner to re-implement the lockless global slab
>>>> shrink. The memcg slab shrink is handled in the subsequent
>>>> patch.
>>>>
>>>> Currently, the shrinker instances can be divided into
>>>> the following three types:
>>>>
>>>> a) global shrinker instance statically defined in the kernel,
>>>> such as workingset_shadow_shrinker.
>>>>
>>>> b) global shrinker instance statically defined in the kernel
>>>> modules, such as mmu_shrinker in x86.
>>>>
>>>> c) shrinker instance embedded in other structures.
>>>>
>>>> For case a, the memory of shrinker instance is never freed.
>>>> For case b, the memory of shrinker instance will be freed
>>>> after the module is unloaded. But we will call synchronize_rcu()
>>>> in free_module() to wait for RCU read-side critical section to
>>>> exit. For case c, the memory of shrinker instance will be
>>>> dynamically freed by calling kfree_rcu(). So we can use
>>>> rcu_read_{lock,unlock}() to ensure that the shrinker instance
>>>> is valid.
>>>>
>>>> The shrinker::refcount mechanism ensures that the shrinker
>>>> instance will not be run again after unregistration. So the
>>>> structure that records the pointer of shrinker instance can be
>>>> safely freed without waiting for the RCU read-side critical
>>>> section.
>>>>
>>>> In this way, while we implement the lockless slab shrink, we
>>>> don't need to be blocked in unregister_shrinker() to wait
>>>> RCU read-side critical section.
>>>>
>>>> The following are the test results:
>>>>
>>>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>>>>
>>>> 1) Before applying this patchset:
>>>>
>>>> setting to a 60 second run per stressor
>>>> dispatching hogs: 9 ramfs
>>>> stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
>>>> (secs) (secs) (secs) (real time) (usr+sys time)
>>>> ramfs 880623 60.02 7.71 226.93 14671.45 3753.09
>>>> ramfs:
>>>> 1 System Management Interrupt
>>>> for a 60.03s run time:
>>>> 5762.40s available CPU time
>>>> 7.71s user time ( 0.13%)
>>>> 226.93s system time ( 3.94%)
>>>> 234.64s total time ( 4.07%)
>>>> load average: 8.54 3.06 2.11
>>>> passed: 9: ramfs (9)
>>>> failed: 0
>>>> skipped: 0
>>>> successful run completed in 60.03s (1 min, 0.03 secs)
>>>>
>>>> 2) After applying this patchset:
>>>>
>>>> setting to a 60 second run per stressor
>>>> dispatching hogs: 9 ramfs
>>>> stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
>>>> (secs) (secs) (secs) (real time) (usr+sys time)
>>>> ramfs 847562 60.02 7.44 230.22 14120.66 3566.23
>>>> ramfs:
>>>> 4 System Management Interrupts
>>>> for a 60.12s run time:
>>>> 5771.95s available CPU time
>>>> 7.44s user time ( 0.13%)
>>>> 230.22s system time ( 3.99%)
>>>> 237.66s total time ( 4.12%)
>>>> load average: 8.18 2.43 0.84
>>>> passed: 9: ramfs (9)
>>>> failed: 0
>>>> skipped: 0
>>>> successful run completed in 60.12s (1 min, 0.12 secs)
>>>>
>>>> We can see that the ops/s has hardly changed.
>>>>
>>>> [1]. https://lore.kernel.org/lkml/[email protected]/
>>>> [2]. https://lore.kernel.org/lkml/[email protected]/
>>>> [3]. https://lore.kernel.org/all/[email protected]/
>>>> [4]. https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>> Signed-off-by: Qi Zheng <[email protected]>
>>>> ---
>>>> include/linux/shrinker.h | 6 ++++++
>>>> mm/vmscan.c | 33 ++++++++++++++-------------------
>>>> 2 files changed, 20 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>>>> index 7bfeb2f25246..b0c6c2df9db8 100644
>>>> --- a/include/linux/shrinker.h
>>>> +++ b/include/linux/shrinker.h
>>>> @@ -74,6 +74,7 @@ struct shrinker {
>>>> refcount_t refcount;
>>>> struct completion completion_wait;
>>>> + struct rcu_head rcu;
>>>> void *private_data;
>>>> @@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
>>>> void shrinker_free(struct shrinker *shrinker);
>>>> void unregister_and_free_shrinker(struct shrinker *shrinker);
>>>> +static inline bool shrinker_try_get(struct shrinker *shrinker)
>>>> +{
>>>> + return refcount_inc_not_zero(&shrinker->refcount);
>>>> +}
>>>> +
>>>> static inline void shrinker_put(struct shrinker *shrinker)
>>>> {
>>>> if (refcount_dec_and_test(&shrinker->refcount))
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 6f9c4750effa..767569698946 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -57,6 +57,7 @@
>>>> #include <linux/khugepaged.h>
>>>> #include <linux/rculist_nulls.h>
>>>> #include <linux/random.h>
>>>> +#include <linux/rculist.h>
>>>> #include <asm/tlbflush.h>
>>>> #include <asm/div64.h>
>>>> @@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
>>>> down_write(&shrinker_rwsem);
>>>> refcount_set(&shrinker->refcount, 1);
>>>> init_completion(&shrinker->completion_wait);
>>>> - list_add_tail(&shrinker->list, &shrinker_list);
>>>> + list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>>> shrinker->flags |= SHRINKER_REGISTERED;
>>>> shrinker_debugfs_add(shrinker);
>>>> up_write(&shrinker_rwsem);
>>>> @@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>> wait_for_completion(&shrinker->completion_wait);
>>>> down_write(&shrinker_rwsem);
>>>> - list_del(&shrinker->list);
>>>> + list_del_rcu(&shrinker->list);
>>>> shrinker->flags &= ~SHRINKER_REGISTERED;
>>>> if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>>> unregister_memcg_shrinker(shrinker);
>>>> @@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free);
>>>> void unregister_and_free_shrinker(struct shrinker *shrinker)
>>>> {
>>>> unregister_shrinker(shrinker);
>>>> - kfree(shrinker);
>>>> + kfree_rcu(shrinker, rcu);
>>>> }
>>>> EXPORT_SYMBOL(unregister_and_free_shrinker);
>>>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>> return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>> - if (!down_read_trylock(&shrinker_rwsem))
>>>> - goto out;
>>>> -
>>>> - list_for_each_entry(shrinker, &shrinker_list, list) {
>>>> + rcu_read_lock();
>>>> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>>>> struct shrink_control sc = {
>>>> .gfp_mask = gfp_mask,
>>>> .nid = nid,
>>>> .memcg = memcg,
>>>> };
>>>> + if (!shrinker_try_get(shrinker))
>>>> + continue;
>>>> + rcu_read_unlock();
>>> I don't think you can do this unlock?
>>>> +
>>>> ret = do_shrink_slab(&sc, shrinker, priority);
>>>> if (ret == SHRINK_EMPTY)
>>>> ret = 0;
>>>> freed += ret;
>>>> - /*
>>>> - * Bail out if someone want to register a new shrinker to
>>>> - * prevent the registration from being stalled for long periods
>>>> - * by parallel ongoing shrinking.
>>>> - */
>>>> - if (rwsem_is_contended(&shrinker_rwsem)) {
>>>> - freed = freed ? : 1;
>>>> - break;
>>>> - }
>>>> - }
>>>> - up_read(&shrinker_rwsem);
>>>> -out:
>>>> + rcu_read_lock();
>>> That new rcu_read_lock() won't help AFAIK, the whole
>>> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
>>> safe.
>>
>> In the unregister_shrinker() path, we will wait for the refcount to zero
>> before deleting the shrinker from the linked list. Here, we first took
>> the rcu lock, and then decrement the refcount of this shrinker.
>>
>> shrink_slab unregister_shrinker
>> =========== ===================
>>
>> /* wait for B */
>> wait_for_completion()
>> rcu_read_lock()
>>
>> shrinker_put() --> (B)
>> list_del_rcu()
>> /* wait for rcu_read_unlock() */
>> kfree_rcu()
>>
>> /*
>> * so this shrinker will not be freed here,
>> * and can be used to traverse the next node
>> * normally?
>> */
>> list_for_each_entry()
>>
>> shrinker_try_get()
>> rcu_read_unlock()
>>
>> Did I miss something?
>
> After calling rcu_read_unlock(), the next shrinker in the list can be freed,
> so in the next iteration, use after free might happen?
>
> Is that right?

IIUC, are you talking about the following scenario?

shrink_slab unregister_shrinker a
=========== =====================


rcu_read_unlock()

/* next *shrinker b* was
* removed from shrinker_list.
*/
/* wait for B */
wait_for_completion()
rcu_read_lock()

shrinker_put() --> (B)
list_del_rcu()
/* wait for rcu_read_unlock() */
kfree_rcu()

list_for_each_entry()

shrinker_try_get()
rcu_read_unlock()

When the next *shrinker b* is deleted, the *shrinker a* has not been
deleted from the shrinker_list, so it will point a->next to b->next.
Then in the next iteration, we will get the b->next instead of b?


>
>>
>>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
>>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.
>>>> + shrinker_put(shrinker);
>>>> + }
>>>> + rcu_read_unlock();
>>>> cond_resched();
>>>> return freed;
>>>> }
>