LinuxLists.cc - [PATCH v2 3/3] fs: Use delayed shrinker unregistration

2023-06-05 19:05:15

Subject: [PATCH v2 3/3] fs: Use delayed shrinker unregistration

Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case caused by commit: f95bdb700bc6 ("mm: vmscan: make global slab
shrink lockless"). Qi Zheng investigated that the reason is in long SRCU's
synchronize_srcu() occuring in unregister_shrinker().

This patch fixes the problem by using new unregistration interfaces,
which split unregister_shrinker() in two parts. First part actually only
notifies shrinker subsystem about the fact of unregistration and it prevents
future shrinker methods calls. The second part completes the unregistration
and it insures, that struct shrinker is not used during shrinker chain
iteration anymore, so shrinker memory may be freed. Since the long second
part is called from delayed work asynchronously, it hides synchronize_srcu()
delay from a user.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/super.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index 8d8d68799b34..f3e4f205ec79 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -159,6 +159,7 @@ static void destroy_super_work(struct work_struct *work)
destroy_work);
int i;

+ unregister_shrinker_delayed_finalize(&s->s_shrink);
for (i = 0; i < SB_FREEZE_LEVELS; i++)
percpu_free_rwsem(&s->s_writers.rw_sem[i]);
kfree(s);
@@ -327,7 +328,7 @@ void deactivate_locked_super(struct super_block *s)
{
struct file_system_type *fs = s->s_type;
if (atomic_dec_and_test(&s->s_active)) {
- unregister_shrinker(&s->s_shrink);
+ unregister_shrinker_delayed_initiate(&s->s_shrink);
fs->kill_sb(s);

/*

2023-06-06 01:20:12

by Roman Gushchin

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration

On Mon, Jun 05, 2023 at 10:03:25PM +0300, Kirill Tkhai wrote:
> Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
> test case caused by commit: f95bdb700bc6 ("mm: vmscan: make global slab
> shrink lockless"). Qi Zheng investigated that the reason is in long SRCU's
> synchronize_srcu() occuring in unregister_shrinker().
>
> This patch fixes the problem by using new unregistration interfaces,
> which split unregister_shrinker() in two parts. First part actually only
> notifies shrinker subsystem about the fact of unregistration and it prevents
> future shrinker methods calls. The second part completes the unregistration
> and it insures, that struct shrinker is not used during shrinker chain
> iteration anymore, so shrinker memory may be freed. Since the long second
> part is called from delayed work asynchronously, it hides synchronize_srcu()
> delay from a user.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> fs/super.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/super.c b/fs/super.c
> index 8d8d68799b34..f3e4f205ec79 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -159,6 +159,7 @@ static void destroy_super_work(struct work_struct *work)
> destroy_work);
> int i;
>
> + unregister_shrinker_delayed_finalize(&s->s_shrink);
> for (i = 0; i < SB_FREEZE_LEVELS; i++)
> percpu_free_rwsem(&s->s_writers.rw_sem[i]);
> kfree(s);
> @@ -327,7 +328,7 @@ void deactivate_locked_super(struct super_block *s)
> {
> struct file_system_type *fs = s->s_type;
> if (atomic_dec_and_test(&s->s_active)) {
> - unregister_shrinker(&s->s_shrink);
> + unregister_shrinker_delayed_initiate(&s->s_shrink);

Hm, it makes the API more complex and easier to mess with. Like what will happen
if the second part is never called? Or it's called without the first part being
called first?

Isn't it possible to hide it from a user and call the second part from a work
context automatically?

Thanks!

2023-06-06 01:47:17

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration

On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote:
> On Mon, Jun 05, 2023 at 10:03:25PM +0300, Kirill Tkhai wrote:
> > Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
> > test case caused by commit: f95bdb700bc6 ("mm: vmscan: make global slab
> > shrink lockless"). Qi Zheng investigated that the reason is in long SRCU's
> > synchronize_srcu() occuring in unregister_shrinker().
> >
> > This patch fixes the problem by using new unregistration interfaces,
> > which split unregister_shrinker() in two parts. First part actually only
> > notifies shrinker subsystem about the fact of unregistration and it prevents
> > future shrinker methods calls. The second part completes the unregistration
> > and it insures, that struct shrinker is not used during shrinker chain
> > iteration anymore, so shrinker memory may be freed. Since the long second
> > part is called from delayed work asynchronously, it hides synchronize_srcu()
> > delay from a user.
> >
> > Signed-off-by: Kirill Tkhai <[email protected]>
> > ---
> > fs/super.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/super.c b/fs/super.c
> > index 8d8d68799b34..f3e4f205ec79 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -159,6 +159,7 @@ static void destroy_super_work(struct work_struct *work)
> > destroy_work);
> > int i;
> >
> > + unregister_shrinker_delayed_finalize(&s->s_shrink);
> > for (i = 0; i < SB_FREEZE_LEVELS; i++)
> > percpu_free_rwsem(&s->s_writers.rw_sem[i]);
> > kfree(s);
> > @@ -327,7 +328,7 @@ void deactivate_locked_super(struct super_block *s)
> > {
> > struct file_system_type *fs = s->s_type;
> > if (atomic_dec_and_test(&s->s_active)) {
> > - unregister_shrinker(&s->s_shrink);
> > + unregister_shrinker_delayed_initiate(&s->s_shrink);
>
> Hm, it makes the API more complex and easier to mess with. Like what will happen
> if the second part is never called? Or it's called without the first part being
> called first?

Bad things.

Also, it doesn't fix the three other unregister_shrinker() calls in
the XFS unmount path, nor the three in the ext4/mbcache/jbd2 unmount
path.

Those are just some of the unregister_shrinker() calls that have
dynamic contexts that would also need this same fix; I haven't
audited the 3 dozen other unregister_shrinker() calls around the
kernel to determine if any of them need similar treatment, too.

IOWs, this patchset is purely a band-aid to fix the reported
regression, not an actual fix for the underlying problems caused by
moving the shrinker infrastructure to SRCU protection. This is why
I really want the SRCU changeover reverted.

Not only are the significant changes the API being necessary, it's
put the entire shrinker paths under a SRCU critical section. AIUI,
this means while the shrinkers are running the RCU grace period
cannot expire and no RCU freed memory will actually get freed until
the srcu read lock is dropped by the shrinker.

Given the superblock shrinkers are freeing dentry and inode objects
by RCU freeing, this is also a fairly significant change of
behaviour. i.e. cond_resched() in the shrinker processing loops no
longer allows RCU grace periods to expire and have memory freed with
the shrinkers are running.

Are there problems this will cause? I don't know, but I'm pretty
sure they haven't even been considered until now....

> Isn't it possible to hide it from a user and call the second part from a work
> context automatically?

Nope, because it has to be done before the struct shrinker is freed.
Those are embedded into other structures rather than being
dynamically allocated objects. Hence the synchronise_srcu() has to
complete before the structure the shrinker is embedded in is freed.

Now, this can be dealt with by having register_shrinker() return an
allocated struct shrinker and the callers only keep a pointer, but
that's an even bigger API change. But, IMO, it is an API change that
should have been done before SRCU was introduced precisely because
it allows decoupling of shrinker execution and completion from
the owning structure.

Then we can stop shrinker execution, wait for it to complete and
prevent future execution in unregister_shrinker(), then punt the
expensive shrinker list removal to background work where processing
delays just don't matter for dead shrinker instances. It doesn't
need SRCU at all...

-Dave.
--
Dave Chinner
[email protected]

2023-06-06 03:20:17

by Roman Gushchin

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration

On Tue, Jun 06, 2023 at 11:24:32AM +1000, Dave Chinner wrote:
> On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote:
> > On Mon, Jun 05, 2023 at 10:03:25PM +0300, Kirill Tkhai wrote:
> > > Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
> > > test case caused by commit: f95bdb700bc6 ("mm: vmscan: make global slab
> > > shrink lockless"). Qi Zheng investigated that the reason is in long SRCU's
> > > synchronize_srcu() occuring in unregister_shrinker().
> > >
> > > This patch fixes the problem by using new unregistration interfaces,
> > > which split unregister_shrinker() in two parts. First part actually only
> > > notifies shrinker subsystem about the fact of unregistration and it prevents
> > > future shrinker methods calls. The second part completes the unregistration
> > > and it insures, that struct shrinker is not used during shrinker chain
> > > iteration anymore, so shrinker memory may be freed. Since the long second
> > > part is called from delayed work asynchronously, it hides synchronize_srcu()
> > > delay from a user.
> > >
> > > Signed-off-by: Kirill Tkhai <[email protected]>
> > > ---
> > > fs/super.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/super.c b/fs/super.c
> > > index 8d8d68799b34..f3e4f205ec79 100644
> > > --- a/fs/super.c
> > > +++ b/fs/super.c
> > > @@ -159,6 +159,7 @@ static void destroy_super_work(struct work_struct *work)
> > > destroy_work);
> > > int i;
> > >
> > > + unregister_shrinker_delayed_finalize(&s->s_shrink);
> > > for (i = 0; i < SB_FREEZE_LEVELS; i++)
> > > percpu_free_rwsem(&s->s_writers.rw_sem[i]);
> > > kfree(s);
> > > @@ -327,7 +328,7 @@ void deactivate_locked_super(struct super_block *s)
> > > {
> > > struct file_system_type *fs = s->s_type;
> > > if (atomic_dec_and_test(&s->s_active)) {
> > > - unregister_shrinker(&s->s_shrink);
> > > + unregister_shrinker_delayed_initiate(&s->s_shrink);
> >
> > Hm, it makes the API more complex and easier to mess with. Like what will happen
> > if the second part is never called? Or it's called without the first part being
> > called first?
>
> Bad things.

Agree.

> Also, it doesn't fix the three other unregister_shrinker() calls in
> the XFS unmount path, nor the three in the ext4/mbcache/jbd2 unmount
> path.
>
> Those are just some of the unregister_shrinker() calls that have
> dynamic contexts that would also need this same fix; I haven't
> audited the 3 dozen other unregister_shrinker() calls around the
> kernel to determine if any of them need similar treatment, too.
>
> IOWs, this patchset is purely a band-aid to fix the reported
> regression, not an actual fix for the underlying problems caused by
> moving the shrinker infrastructure to SRCU protection. This is why
> I really want the SRCU changeover reverted.
>
> Not only are the significant changes the API being necessary, it's
> put the entire shrinker paths under a SRCU critical section. AIUI,
> this means while the shrinkers are running the RCU grace period
> cannot expire and no RCU freed memory will actually get freed until
> the srcu read lock is dropped by the shrinker.
>
> Given the superblock shrinkers are freeing dentry and inode objects
> by RCU freeing, this is also a fairly significant change of
> behaviour. i.e. cond_resched() in the shrinker processing loops no
> longer allows RCU grace periods to expire and have memory freed with
> the shrinkers are running.
>
> Are there problems this will cause? I don't know, but I'm pretty
> sure they haven't even been considered until now....
>
> > Isn't it possible to hide it from a user and call the second part from a work
> > context automatically?
>
> Nope, because it has to be done before the struct shrinker is freed.
> Those are embedded into other structures rather than being
> dynamically allocated objects.

This part we might consider to revisit, if it helps to solve other problems.
Having an extra memory allocation (or two) per mount-point doesn't look
that expensive. Again, iff it helps with more important problems.

Thanks!

2023-06-06 07:18:31

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration

On Mon, Jun 05, 2023 at 07:56:59PM -0700, Roman Gushchin wrote:
> On Tue, Jun 06, 2023 at 11:24:32AM +1000, Dave Chinner wrote:
> > On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote:
> > > Isn't it possible to hide it from a user and call the second part from a work
> > > context automatically?
> >
> > Nope, because it has to be done before the struct shrinker is freed.
> > Those are embedded into other structures rather than being
> > dynamically allocated objects.
>
> This part we might consider to revisit, if it helps to solve other problems.
> Having an extra memory allocation (or two) per mount-point doesn't look
> that expensive. Again, iff it helps with more important problems.

Ah, I guess if you're concerned about memory allocation overhead
during register_shrinker() calls then you really aren't familiar
with what register_shrinker() does on memcg and numa aware
shrinkers?

Let's ignore the fact that we could roll the shrinker structure
allocation into the existing shrinker->nr_deferred array allocation
(so it's effectively a zero cost modification), and just look at
what a memcg enabled shrinker must initialise if it expands the
shrinker info array because the index returned from idr_alloc()
is larger than the current array:

for each memcg {
for_each_node {
info = kvmalloc_node();
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
}
}

Hmmmm?

So, there really isn't any additional cost, it completely decouples
the shrinker infrastructure from the subsystem shrinker
implementations, it enables the shrinker to control infrastructure
teardown independently of the subsystem that registered the
shrinker, and it still gives guarantees that the shrinker is never
run after unregister_shrinker() completes. What's not to like?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-06-06 16:25:43

by Roman Gushchin

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration

On Tue, Jun 06, 2023 at 04:51:40PM +1000, Dave Chinner wrote:
> On Mon, Jun 05, 2023 at 07:56:59PM -0700, Roman Gushchin wrote:
> > On Tue, Jun 06, 2023 at 11:24:32AM +1000, Dave Chinner wrote:
> > > On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote:
> > > > Isn't it possible to hide it from a user and call the second part from a work
> > > > context automatically?
> > >
> > > Nope, because it has to be done before the struct shrinker is freed.
> > > Those are embedded into other structures rather than being
> > > dynamically allocated objects.
> >
> > This part we might consider to revisit, if it helps to solve other problems.
> > Having an extra memory allocation (or two) per mount-point doesn't look
> > that expensive. Again, iff it helps with more important problems.
>
> Ah, I guess if you're concerned about memory allocation overhead
> during register_shrinker() calls then you really aren't familiar
> with what register_shrinker() does on memcg and numa aware
> shrinkers?

What a nice way to agree with an idea :)

>
> Let's ignore the fact that we could roll the shrinker structure
> allocation into the existing shrinker->nr_deferred array allocation
> (so it's effectively a zero cost modification), and just look at
> what a memcg enabled shrinker must initialise if it expands the
> shrinker info array because the index returned from idr_alloc()
> is larger than the current array:
>
> for each memcg {
> for_each_node {
> info = kvmalloc_node();
> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
> }
> }
>
> Hmmmm?
>
> So, there really isn't any additional cost, it completely decouples
> the shrinker infrastructure from the subsystem shrinker
> implementations, it enables the shrinker to control infrastructure
> teardown independently of the subsystem that registered the
> shrinker, and it still gives guarantees that the shrinker is never
> run after unregister_shrinker() completes. What's not to like?

Yep, this sounds like a good idea.

Thanks.

2023-06-06 21:42:45

by Kirill Tkhai

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration

On 06.06.2023 04:24, Dave Chinner wrote:
> On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote:
>> On Mon, Jun 05, 2023 at 10:03:25PM +0300, Kirill Tkhai wrote:
>>> Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
>>> test case caused by commit: f95bdb700bc6 ("mm: vmscan: make global slab
>>> shrink lockless"). Qi Zheng investigated that the reason is in long SRCU's
>>> synchronize_srcu() occuring in unregister_shrinker().
>>>
>>> This patch fixes the problem by using new unregistration interfaces,
>>> which split unregister_shrinker() in two parts. First part actually only
>>> notifies shrinker subsystem about the fact of unregistration and it prevents
>>> future shrinker methods calls. The second part completes the unregistration
>>> and it insures, that struct shrinker is not used during shrinker chain
>>> iteration anymore, so shrinker memory may be freed. Since the long second
>>> part is called from delayed work asynchronously, it hides synchronize_srcu()
>>> delay from a user.
>>>
>>> Signed-off-by: Kirill Tkhai <[email protected]>
>>> ---
>>> fs/super.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/super.c b/fs/super.c
>>> index 8d8d68799b34..f3e4f205ec79 100644
>>> --- a/fs/super.c
>>> +++ b/fs/super.c
>>> @@ -159,6 +159,7 @@ static void destroy_super_work(struct work_struct *work)
>>> destroy_work);
>>> int i;
>>>
>>> + unregister_shrinker_delayed_finalize(&s->s_shrink);
>>> for (i = 0; i < SB_FREEZE_LEVELS; i++)
>>> percpu_free_rwsem(&s->s_writers.rw_sem[i]);
>>> kfree(s);
>>> @@ -327,7 +328,7 @@ void deactivate_locked_super(struct super_block *s)
>>> {
>>> struct file_system_type *fs = s->s_type;
>>> if (atomic_dec_and_test(&s->s_active)) {
>>> - unregister_shrinker(&s->s_shrink);
>>> + unregister_shrinker_delayed_initiate(&s->s_shrink);
>>
>> Hm, it makes the API more complex and easier to mess with. Like what will happen
>> if the second part is never called? Or it's called without the first part being
>> called first?
>
> Bad things.
>
> Also, it doesn't fix the three other unregister_shrinker() calls in
> the XFS unmount path, nor the three in the ext4/mbcache/jbd2 unmount
> path.
>
> Those are just some of the unregister_shrinker() calls that have
> dynamic contexts that would also need this same fix; I haven't
> audited the 3 dozen other unregister_shrinker() calls around the
> kernel to determine if any of them need similar treatment, too.
>
> IOWs, this patchset is purely a band-aid to fix the reported
> regression, not an actual fix for the underlying problems caused by
> moving the shrinker infrastructure to SRCU protection. This is why
> I really want the SRCU changeover reverted.
>
> Not only are the significant changes the API being necessary, it's
> put the entire shrinker paths under a SRCU critical section. AIUI,
> this means while the shrinkers are running the RCU grace period
> cannot expire and no RCU freed memory will actually get freed until
> the srcu read lock is dropped by the shrinker.

Why so? Doesn't SRCU and RCU have different grace period and they does not prolong
each other?

Also, it looks like every SRCU has it's own namespace like shrinker_srcu for shrinker.
Don't different SRCU namespaces never prolong each other?!

> Given the superblock shrinkers are freeing dentry and inode objects
> by RCU freeing, this is also a fairly significant change of
> behaviour. i.e. cond_resched() in the shrinker processing loops no
> longer allows RCU grace periods to expire and have memory freed with
> the shrinkers are running.
>
> Are there problems this will cause? I don't know, but I'm pretty
> sure they haven't even been considered until now....
>
>> Isn't it possible to hide it from a user and call the second part from a work
>> context automatically?
>
> Nope, because it has to be done before the struct shrinker is freed.
> Those are embedded into other structures rather than being
> dynamically allocated objects. Hence the synchronise_srcu() has to
> complete before the structure the shrinker is embedded in is freed.
>
> Now, this can be dealt with by having register_shrinker() return an
> allocated struct shrinker and the callers only keep a pointer, but
> that's an even bigger API change. But, IMO, it is an API change that
> should have been done before SRCU was introduced precisely because
> it allows decoupling of shrinker execution and completion from
> the owning structure.
>
> Then we can stop shrinker execution, wait for it to complete and
> prevent future execution in unregister_shrinker(), then punt the
> expensive shrinker list removal to background work where processing
> delays just don't matter for dead shrinker instances. It doesn't
> need SRCU at all...
>
> -Dave.

2023-06-06 22:46:27

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration

On Wed, Jun 07, 2023 at 12:21:42AM +0300, Kirill Tkhai wrote:
> On 06.06.2023 04:24, Dave Chinner wrote:
> > On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote:
> >> On Mon, Jun 05, 2023 at 10:03:25PM +0300, Kirill Tkhai wrote:
> >>> Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
> >>> test case caused by commit: f95bdb700bc6 ("mm: vmscan: make global slab
> >>> shrink lockless"). Qi Zheng investigated that the reason is in long SRCU's
> >>> synchronize_srcu() occuring in unregister_shrinker().
> >>>
> >>> This patch fixes the problem by using new unregistration interfaces,
> >>> which split unregister_shrinker() in two parts. First part actually only
> >>> notifies shrinker subsystem about the fact of unregistration and it prevents
> >>> future shrinker methods calls. The second part completes the unregistration
> >>> and it insures, that struct shrinker is not used during shrinker chain
> >>> iteration anymore, so shrinker memory may be freed. Since the long second
> >>> part is called from delayed work asynchronously, it hides synchronize_srcu()
> >>> delay from a user.
> >>>
> >>> Signed-off-by: Kirill Tkhai <[email protected]>
> >>> ---
> >>> fs/super.c | 3 ++-
> >>> 1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/fs/super.c b/fs/super.c
> >>> index 8d8d68799b34..f3e4f205ec79 100644
> >>> --- a/fs/super.c
> >>> +++ b/fs/super.c
> >>> @@ -159,6 +159,7 @@ static void destroy_super_work(struct work_struct *work)
> >>> destroy_work);
> >>> int i;
> >>>
> >>> + unregister_shrinker_delayed_finalize(&s->s_shrink);
> >>> for (i = 0; i < SB_FREEZE_LEVELS; i++)
> >>> percpu_free_rwsem(&s->s_writers.rw_sem[i]);
> >>> kfree(s);
> >>> @@ -327,7 +328,7 @@ void deactivate_locked_super(struct super_block *s)
> >>> {
> >>> struct file_system_type *fs = s->s_type;
> >>> if (atomic_dec_and_test(&s->s_active)) {
> >>> - unregister_shrinker(&s->s_shrink);
> >>> + unregister_shrinker_delayed_initiate(&s->s_shrink);
> >>
> >> Hm, it makes the API more complex and easier to mess with. Like what will happen
> >> if the second part is never called? Or it's called without the first part being
> >> called first?
> >
> > Bad things.
> >
> > Also, it doesn't fix the three other unregister_shrinker() calls in
> > the XFS unmount path, nor the three in the ext4/mbcache/jbd2 unmount
> > path.
> >
> > Those are just some of the unregister_shrinker() calls that have
> > dynamic contexts that would also need this same fix; I haven't
> > audited the 3 dozen other unregister_shrinker() calls around the
> > kernel to determine if any of them need similar treatment, too.
> >
> > IOWs, this patchset is purely a band-aid to fix the reported
> > regression, not an actual fix for the underlying problems caused by
> > moving the shrinker infrastructure to SRCU protection. This is why
> > I really want the SRCU changeover reverted.
> >
> > Not only are the significant changes the API being necessary, it's
> > put the entire shrinker paths under a SRCU critical section. AIUI,
> > this means while the shrinkers are running the RCU grace period
> > cannot expire and no RCU freed memory will actually get freed until
> > the srcu read lock is dropped by the shrinker.
>
> Why so? Doesn't SRCU and RCU have different grace period and they does not prolong
> each other?

No idea - Documentation/RCU/whatisRCU.rst doesn't describe any
differences between SRCU and RCU except for "use SRCU if you need to
sleep in the read side" and there's no discussion of how they
interact, either. maybe there's some discussion in other RCU
documentation, but there's nothing in the "how to use RCU"
documentation that tells me they use different grace period
definitions...

> Also, it looks like every SRCU has it's own namespace like shrinker_srcu for shrinker.
> Don't different SRCU namespaces never prolong each other?!

RIght, SRCU vs SRCU is well defined. What is not clear from anything
I've read is SRCU vs RCU interactions, so I can only assuming from
the shared "RCU" in the name there are shared implementation
details and interactions...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-06-08 17:04:37

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration

On Tue, Jun 06, 2023 at 11:24:32AM +1000, Dave Chinner wrote:
> On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote:
> > Hm, it makes the API more complex and easier to mess with. Like what will happen
> > if the second part is never called? Or it's called without the first part being
> > called first?
>
> Bad things.
>
> Also, it doesn't fix the three other unregister_shrinker() calls in
> the XFS unmount path, nor the three in the ext4/mbcache/jbd2 unmount
> path.
>
> Those are just some of the unregister_shrinker() calls that have
> dynamic contexts that would also need this same fix; I haven't
> audited the 3 dozen other unregister_shrinker() calls around the
> kernel to determine if any of them need similar treatment, too.
>
> IOWs, this patchset is purely a band-aid to fix the reported
> regression, not an actual fix for the underlying problems caused by
> moving the shrinker infrastructure to SRCU protection. This is why
> I really want the SRCU changeover reverted.

There's been so much traffic on linux-fsdevel so I missed this thread
until Darrick pointed it out to me today. (Thanks, Darrick!)

From his description, and my quick read-through of this thread....
I'm worried.

Given that we're at -rc5 now, and the file system folks didn't get
consulted until fairly late in the progress, and the fact that this
may cause use-after-free problems that could lead to security issues,
perhaps we shoould consider reverting the SRCU changeover now, and try
again for the next merge window?

Thanks!!

- Ted

2023-06-08 23:39:12

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration

On Thu, Jun 08, 2023 at 12:36:22PM -0400, Theodore Ts'o wrote:
> On Tue, Jun 06, 2023 at 11:24:32AM +1000, Dave Chinner wrote:
> > On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote:
> > > Hm, it makes the API more complex and easier to mess with. Like what will happen
> > > if the second part is never called? Or it's called without the first part being
> > > called first?
> >
> > Bad things.
> >
> > Also, it doesn't fix the three other unregister_shrinker() calls in
> > the XFS unmount path, nor the three in the ext4/mbcache/jbd2 unmount
> > path.
> >
> > Those are just some of the unregister_shrinker() calls that have
> > dynamic contexts that would also need this same fix; I haven't
> > audited the 3 dozen other unregister_shrinker() calls around the
> > kernel to determine if any of them need similar treatment, too.
> >
> > IOWs, this patchset is purely a band-aid to fix the reported
> > regression, not an actual fix for the underlying problems caused by
> > moving the shrinker infrastructure to SRCU protection. This is why
> > I really want the SRCU changeover reverted.
>
> There's been so much traffic on linux-fsdevel so I missed this thread
> until Darrick pointed it out to me today. (Thanks, Darrick!)
>
> From his description, and my quick read-through of this thread....
> I'm worried.
>
> Given that we're at -rc5 now, and the file system folks didn't get
> consulted until fairly late in the progress, and the fact that this
> may cause use-after-free problems that could lead to security issues,
> perhaps we shoould consider reverting the SRCU changeover now, and try
> again for the next merge window?

Yes, please, because I think we can fix this in a much better way
and make things a whole lot simpler at the same time.

The root cause of the SRCU usage is that mm/memcg developers still
haven't unified the non-memcg and the memcg based shrinker
implementations. shrink_slab_memcg() doesn't require SRCU
protection as it does not iterate the shrinker list at all; it
requires *shrinker instance* lifetime protection. The problem is
shrink_slab() doing the root memcg/global shrinker work - it
iterates the shrinker list directly, and this is the list walk that
SRCU is necessary for to "make shrinkers lockless"

Going back to shrink_slab_memcg(), it does a lookup of the shrinker
instance by idr_find(); idr_find() is a wrapper around
radix_tree_lookup(), which means we can use RCU protection and
reference counting to both validate the shrinker instance *and*
guarantee that it isn't free from under us as we execute the
shrinker.

This requires, as I mentioned elsewhere in this thread, that the
shrinker instance to be dynamically allocated, not embedded in other
structures. Dynamically allocated shrinker instances and reference
counting them means we can do this in shrink_slab_memcg() to ensure
shrinker instance validity and lifetime rules are followed:

rcu_read_lock()
shrinker = idr_find(&shrinker_idr, i);
if (!shrinker ||
!refcount_inc_not_zero(&shrinker->refcount)) {
/* shrinker is being torn down */
clear_bit(i, info->map);
rcu_read_unlock();
continue;
}
rcu_read_unlock();

/* do shrinker stuff */

if (refcount_dec_and_test(&shrinker->refcount)) {
/* shrinker is being torn down, waiting for us */
wakeup(&shrinker->completion_wait);
}
/* unsafe to reference shrinker now */

And we enable the shrinker to run simply by:

shrinker_register()
{
.....
/* allow the shrinker to run now */
refcount_set(shrinker->refcount, 1);
return 0;
}

We then shut down the shrinker so we can tear it down like so:

shrinker_unregister()
{
DECLARE_WAITQUEUE(wait, current);

add_wait_queue_exclusive(shrinker->completion_wait, &wait);
if (!refcount_dec_and_test(&shrinker->refcount))) {
/* Wait for running instances to exit */
__set_current_state(TASK_UNINTERRUPTIBLE);
schedule();
}
remove_wait_queue(wq, &wait);

/* We own the entire shrinker instance now, start tearing it down */

.....

/* Free the shrinker itself after a RCU grace period expires */
kfree_rcu(shrinker, shrinker->rcu_head);
}

So, essentially we don't need SCRU at all to do lockless shrinker
lookup for the memcg shrinker side of the fence, nor do we need
synchronise_srcu() to wait for shrinker instances to finish running
before we can tear stuff down. There is no global state in this at
all; everything is per-shrinker instance.

But SRCU is needed to protect the global shrinker list walk because
it hasn't been converted to always use the root memcg and be
iterated as if it is just another memcg. IOWs, using SRCU to
protect the global shrinker list walk is effectively slapping a
bandaid over a more fundamental problem that we've known about for
a long time.

So the first thing that has to be done here is unify the shrinker
infrastructure under the memcg implementation. The global shrinker
state should be tracked in the root memcg, just like any other memcg
shrinker is tracked. If memcg's are not enabled, then we should be
creating a dummy global memcg that a get_root_memcg() helper returns
when memcgs are disabled to tracks all the global shrinker state.
then shrink_slab just doesn't care about what memcg is passed to it,
it just does the same thing.

IOWs, shrink_slab gets entirely replaced by shrink_slab_memcg(), and
the need for SRCU goes away entirely because shrinker instance
lookups are all RCU+refcount protected.

Yes, this requires we change how shrinker instances are instantiated
by subsystems, but this is mostly simple transformation of existing
code. But it doesn't require changing shrinker implementations, it
doesn't require a new teardown API, and it doesn't change the
context under which shrinkers might run.

All the existing RCU protected stuff in the shrinker maps and memcgs
can remain that way. We can also keep the shrinker rwsem/mutex for
all the little internal bits of setup/teardown/non-rcu coordination
that are needed; once the list iteration is lockless, there will be
almost no contention on that lock at all...

This isn't all that hard - it's just replicating a well known design
pattern (i.e. refcount+RCU) we use all over the kernel combined with
taking advantage of IDR being backed by RCU-safe infrastructure.

If I had been cc'd on the original SRCU patches, this is exactly
what I would have suggested as a better solution to the problem. We
end up with cleaner, more robust and better performing
infrastructure. This is far better than layering more complexity on
top of what is really a poor foundation....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-06-09 00:38:10

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration

On Fri, 9 Jun 2023 09:17:54 +1000 Dave Chinner <[email protected]> wrote:

> > Given that we're at -rc5 now, and the file system folks didn't get
> > consulted until fairly late in the progress, and the fact that this
> > may cause use-after-free problems that could lead to security issues,
> > perhaps we shoould consider reverting the SRCU changeover now, and try
> > again for the next merge window?
>
> Yes, please, because I think we can fix this in a much better way
> and make things a whole lot simpler at the same time.

Qi Zheng, if agreeable could you please prepare and send reverts of
475733dda5a ("mm: vmscan: add shrinker_srcu_generation") and of
f95bdb700bc6bb74 ("mm: vmscan: make global slab shrink lockless")?

Thanks.

2023-06-09 03:04:05

by Qi Zheng

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration

On 2023/6/9 08:27, Andrew Morton wrote:
> On Fri, 9 Jun 2023 09:17:54 +1000 Dave Chinner <[email protected]> wrote:
>
>>> Given that we're at -rc5 now, and the file system folks didn't get
>>> consulted until fairly late in the progress, and the fact that this
>>> may cause use-after-free problems that could lead to security issues,
>>> perhaps we shoould consider reverting the SRCU changeover now, and try
>>> again for the next merge window?
>>
>> Yes, please, because I think we can fix this in a much better way
>> and make things a whole lot simpler at the same time.
>
> Qi Zheng, if agreeable could you please prepare and send reverts of
> 475733dda5a ("mm: vmscan: add shrinker_srcu_generation") and of
> f95bdb700bc6bb74 ("mm: vmscan: make global slab shrink lockless")?

OK. After reading the proposal suggested by Dave, I think it is more
feasible. I will revert the previous changes ASAP, and then try to
implement Dave's proposal.

Thanks,
Qi

>
> Thanks.