2020-02-21 20:23:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC] ext4: fix potential race between online resizing and write operations

On Fri, Feb 21, 2020 at 02:14:55PM +0100, Uladzislau Rezki wrote:
> On Thu, Feb 20, 2020 at 04:30:35PM -0800, Paul E. McKenney wrote:
> > On Wed, Feb 19, 2020 at 11:52:33PM -0500, Theodore Y. Ts'o wrote:
> > > On Tue, Feb 18, 2020 at 06:08:57PM +0100, Uladzislau Rezki wrote:
> > > > now it becomes possible to use it like:
> > > > ...
> > > > void *p = kvmalloc(PAGE_SIZE);
> > > > kvfree_rcu(p);
> > > > ...
> > > > also have a look at the example in the mm/list_lru.c diff.
> > >
> > > I certainly like the interface, thanks! I'm going to be pushing
> > > patches to fix this using ext4_kvfree_array_rcu() since there are a
> > > number of bugs in ext4's online resizing which appear to be hitting
> > > multiple cloud providers (with reports from both AWS and GCP) and I
> > > want something which can be easily backported to stable kernels.
> > >
> > > But once kvfree_rcu() hits mainline, I'll switch ext4 to use it, since
> > > your kvfree_rcu() is definitely more efficient than my expedient
> > > jury-rig.
> > >
> > > I don't feel entirely competent to review the implementation, but I do
> > > have one question. It looks like the rcutiny implementation of
> > > kfree_call_rcu() isn't going to do the right thing with kvfree_rcu(p).
> > > Am I missing something?
> >
> > Good catch! I believe that rcu_reclaim_tiny() would need to do
> > kvfree() instead of its current kfree().
> >
> > Vlad, anything I am missing here?
> >
> Yes something like that. There are some open questions about
> realization, when it comes to tiny RCU. Since we are talking
> about "headless" kvfree_rcu() interface, i mean we can not link
> freed "objects" between each other, instead we should place a
> pointer directly into array that will be drained later on.
>
> It would be much more easier to achieve that if we were talking
> about the interface like: kvfree_rcu(p, rcu), but that is not our
> case :)
>
> So, for CONFIG_TINY_RCU we should implement very similar what we
> have done for CONFIG_TREE_RCU or just simply do like Ted has done
> with his
>
> void ext4_kvfree_array_rcu(void *to_free)
>
> i mean:
>
> local_irq_save(flags);
> struct foo *ptr = kzalloc(sizeof(*ptr), GFP_ATOMIC);
>
> if (ptr) {
> ptr->ptr = to_free;
> call_rcu(&ptr->rcu, kvfree_callback);
> }
> local_irq_restore(flags);

We really do still need the emergency case, in this case for when
kzalloc() returns NULL. Which does indeed mean an rcu_head in the thing
being freed. Otherwise, you end up with an out-of-memory deadlock where
you could free memory only if you had memor to allocate.

> Also there is one more open question what to do if GFP_ATOMIC
> gets failed in case of having low memory condition. Probably
> we can make use of "mempool interface" that allows to have
> min_nr guaranteed pre-allocated pages.

But we really do still need to handle the case where everything runs out,
even the pre-allocated pages.

Thanx, Paul


2020-02-22 22:24:59

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC] ext4: fix potential race between online resizing and write operations

On Fri, Feb 21, 2020 at 12:22:50PM -0800, Paul E. McKenney wrote:
> On Fri, Feb 21, 2020 at 02:14:55PM +0100, Uladzislau Rezki wrote:
> > On Thu, Feb 20, 2020 at 04:30:35PM -0800, Paul E. McKenney wrote:
> > > On Wed, Feb 19, 2020 at 11:52:33PM -0500, Theodore Y. Ts'o wrote:
> > > > On Tue, Feb 18, 2020 at 06:08:57PM +0100, Uladzislau Rezki wrote:
> > > > > now it becomes possible to use it like:
> > > > > ...
> > > > > void *p = kvmalloc(PAGE_SIZE);
> > > > > kvfree_rcu(p);
> > > > > ...
> > > > > also have a look at the example in the mm/list_lru.c diff.
> > > >
> > > > I certainly like the interface, thanks! I'm going to be pushing
> > > > patches to fix this using ext4_kvfree_array_rcu() since there are a
> > > > number of bugs in ext4's online resizing which appear to be hitting
> > > > multiple cloud providers (with reports from both AWS and GCP) and I
> > > > want something which can be easily backported to stable kernels.
> > > >
> > > > But once kvfree_rcu() hits mainline, I'll switch ext4 to use it, since
> > > > your kvfree_rcu() is definitely more efficient than my expedient
> > > > jury-rig.
> > > >
> > > > I don't feel entirely competent to review the implementation, but I do
> > > > have one question. It looks like the rcutiny implementation of
> > > > kfree_call_rcu() isn't going to do the right thing with kvfree_rcu(p).
> > > > Am I missing something?
> > >
> > > Good catch! I believe that rcu_reclaim_tiny() would need to do
> > > kvfree() instead of its current kfree().
> > >
> > > Vlad, anything I am missing here?
> > >
> > Yes something like that. There are some open questions about
> > realization, when it comes to tiny RCU. Since we are talking
> > about "headless" kvfree_rcu() interface, i mean we can not link
> > freed "objects" between each other, instead we should place a
> > pointer directly into array that will be drained later on.
> >
> > It would be much more easier to achieve that if we were talking
> > about the interface like: kvfree_rcu(p, rcu), but that is not our
> > case :)
> >
> > So, for CONFIG_TINY_RCU we should implement very similar what we
> > have done for CONFIG_TREE_RCU or just simply do like Ted has done
> > with his
> >
> > void ext4_kvfree_array_rcu(void *to_free)
> >
> > i mean:
> >
> > local_irq_save(flags);
> > struct foo *ptr = kzalloc(sizeof(*ptr), GFP_ATOMIC);
> >
> > if (ptr) {
> > ptr->ptr = to_free;
> > call_rcu(&ptr->rcu, kvfree_callback);
> > }
> > local_irq_restore(flags);
>
> We really do still need the emergency case, in this case for when
> kzalloc() returns NULL. Which does indeed mean an rcu_head in the thing
> being freed. Otherwise, you end up with an out-of-memory deadlock where
> you could free memory only if you had memor to allocate.

Can we rely on GFP_ATOMIC allocations for these? These have emergency memory
pools which are reserved.

I was thinking a 2 fold approach (just thinking out loud..):

If kfree_call_rcu() is called in atomic context or in any rcu reader, then
use GFP_ATOMIC to grow an rcu_head wrapper on the atomic memory pool and
queue that.

Otherwise, grow an rcu_head on the stack of kfree_call_rcu() and call
synchronize_rcu() inline with it.

Use preemptible() andr task_struct's rcu_read_lock_nesting to differentiate
between the 2 cases.

Thoughts?

> > Also there is one more open question what to do if GFP_ATOMIC
> > gets failed in case of having low memory condition. Probably
> > we can make use of "mempool interface" that allows to have
> > min_nr guaranteed pre-allocated pages.
>
> But we really do still need to handle the case where everything runs out,
> even the pre-allocated pages.

If *everything* runs out, you are pretty much going to OOM sooner or later
anyway :D. But I see what you mean. But the 'tradeoff' is RCU can free
head-less objects where possible.

thanks,

- Joel

2020-02-23 01:12:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC] ext4: fix potential race between online resizing and write operations

On Sat, Feb 22, 2020 at 05:24:15PM -0500, Joel Fernandes wrote:
> On Fri, Feb 21, 2020 at 12:22:50PM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 21, 2020 at 02:14:55PM +0100, Uladzislau Rezki wrote:
> > > On Thu, Feb 20, 2020 at 04:30:35PM -0800, Paul E. McKenney wrote:
> > > > On Wed, Feb 19, 2020 at 11:52:33PM -0500, Theodore Y. Ts'o wrote:
> > > > > On Tue, Feb 18, 2020 at 06:08:57PM +0100, Uladzislau Rezki wrote:
> > > > > > now it becomes possible to use it like:
> > > > > > ...
> > > > > > void *p = kvmalloc(PAGE_SIZE);
> > > > > > kvfree_rcu(p);
> > > > > > ...
> > > > > > also have a look at the example in the mm/list_lru.c diff.
> > > > >
> > > > > I certainly like the interface, thanks! I'm going to be pushing
> > > > > patches to fix this using ext4_kvfree_array_rcu() since there are a
> > > > > number of bugs in ext4's online resizing which appear to be hitting
> > > > > multiple cloud providers (with reports from both AWS and GCP) and I
> > > > > want something which can be easily backported to stable kernels.
> > > > >
> > > > > But once kvfree_rcu() hits mainline, I'll switch ext4 to use it, since
> > > > > your kvfree_rcu() is definitely more efficient than my expedient
> > > > > jury-rig.
> > > > >
> > > > > I don't feel entirely competent to review the implementation, but I do
> > > > > have one question. It looks like the rcutiny implementation of
> > > > > kfree_call_rcu() isn't going to do the right thing with kvfree_rcu(p).
> > > > > Am I missing something?
> > > >
> > > > Good catch! I believe that rcu_reclaim_tiny() would need to do
> > > > kvfree() instead of its current kfree().
> > > >
> > > > Vlad, anything I am missing here?
> > > >
> > > Yes something like that. There are some open questions about
> > > realization, when it comes to tiny RCU. Since we are talking
> > > about "headless" kvfree_rcu() interface, i mean we can not link
> > > freed "objects" between each other, instead we should place a
> > > pointer directly into array that will be drained later on.
> > >
> > > It would be much more easier to achieve that if we were talking
> > > about the interface like: kvfree_rcu(p, rcu), but that is not our
> > > case :)
> > >
> > > So, for CONFIG_TINY_RCU we should implement very similar what we
> > > have done for CONFIG_TREE_RCU or just simply do like Ted has done
> > > with his
> > >
> > > void ext4_kvfree_array_rcu(void *to_free)
> > >
> > > i mean:
> > >
> > > local_irq_save(flags);
> > > struct foo *ptr = kzalloc(sizeof(*ptr), GFP_ATOMIC);
> > >
> > > if (ptr) {
> > > ptr->ptr = to_free;
> > > call_rcu(&ptr->rcu, kvfree_callback);
> > > }
> > > local_irq_restore(flags);
> >
> > We really do still need the emergency case, in this case for when
> > kzalloc() returns NULL. Which does indeed mean an rcu_head in the thing
> > being freed. Otherwise, you end up with an out-of-memory deadlock where
> > you could free memory only if you had memor to allocate.
>
> Can we rely on GFP_ATOMIC allocations for these? These have emergency memory
> pools which are reserved.

You can, at least until the emergency memory pools are exhausted.

> I was thinking a 2 fold approach (just thinking out loud..):
>
> If kfree_call_rcu() is called in atomic context or in any rcu reader, then
> use GFP_ATOMIC to grow an rcu_head wrapper on the atomic memory pool and
> queue that.
>
> Otherwise, grow an rcu_head on the stack of kfree_call_rcu() and call
> synchronize_rcu() inline with it.
>
> Use preemptible() andr task_struct's rcu_read_lock_nesting to differentiate
> between the 2 cases.
>
> Thoughts?

How much are we really losing by having an rcu_head in the structure,
either separately or unioned over other fields?

> > > Also there is one more open question what to do if GFP_ATOMIC
> > > gets failed in case of having low memory condition. Probably
> > > we can make use of "mempool interface" that allows to have
> > > min_nr guaranteed pre-allocated pages.
> >
> > But we really do still need to handle the case where everything runs out,
> > even the pre-allocated pages.
>
> If *everything* runs out, you are pretty much going to OOM sooner or later
> anyway :D. But I see what you mean. But the 'tradeoff' is RCU can free
> head-less objects where possible.

Would you rather pay an rcu_head tax (in cases where it cannot share
with other fields), or would you rather have states where you could free
a lot of memory if only there was some memory to allocate to track the
memory to be freed?

But yes, as you suggested above, there could be an API similar to the
userspace RCU library's API that usually just queues the callback but
sometimes sleeps for a full grace period. If enough people want this,
it should not be hard to set up.

Thanx, Paul

2020-02-24 17:40:58

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH RFC] ext4: fix potential race between online resizing and write operations

On Sat, Feb 22, 2020 at 05:10:18PM -0800, Paul E. McKenney wrote:
> On Sat, Feb 22, 2020 at 05:24:15PM -0500, Joel Fernandes wrote:
> > On Fri, Feb 21, 2020 at 12:22:50PM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 21, 2020 at 02:14:55PM +0100, Uladzislau Rezki wrote:
> > > > On Thu, Feb 20, 2020 at 04:30:35PM -0800, Paul E. McKenney wrote:
> > > > > On Wed, Feb 19, 2020 at 11:52:33PM -0500, Theodore Y. Ts'o wrote:
> > > > > > On Tue, Feb 18, 2020 at 06:08:57PM +0100, Uladzislau Rezki wrote:
> > > > > > > now it becomes possible to use it like:
> > > > > > > ...
> > > > > > > void *p = kvmalloc(PAGE_SIZE);
> > > > > > > kvfree_rcu(p);
> > > > > > > ...
> > > > > > > also have a look at the example in the mm/list_lru.c diff.
> > > > > >
> > > > > > I certainly like the interface, thanks! I'm going to be pushing
> > > > > > patches to fix this using ext4_kvfree_array_rcu() since there are a
> > > > > > number of bugs in ext4's online resizing which appear to be hitting
> > > > > > multiple cloud providers (with reports from both AWS and GCP) and I
> > > > > > want something which can be easily backported to stable kernels.
> > > > > >
> > > > > > But once kvfree_rcu() hits mainline, I'll switch ext4 to use it, since
> > > > > > your kvfree_rcu() is definitely more efficient than my expedient
> > > > > > jury-rig.
> > > > > >
> > > > > > I don't feel entirely competent to review the implementation, but I do
> > > > > > have one question. It looks like the rcutiny implementation of
> > > > > > kfree_call_rcu() isn't going to do the right thing with kvfree_rcu(p).
> > > > > > Am I missing something?
> > > > >
> > > > > Good catch! I believe that rcu_reclaim_tiny() would need to do
> > > > > kvfree() instead of its current kfree().
> > > > >
> > > > > Vlad, anything I am missing here?
> > > > >
> > > > Yes something like that. There are some open questions about
> > > > realization, when it comes to tiny RCU. Since we are talking
> > > > about "headless" kvfree_rcu() interface, i mean we can not link
> > > > freed "objects" between each other, instead we should place a
> > > > pointer directly into array that will be drained later on.
> > > >
> > > > It would be much more easier to achieve that if we were talking
> > > > about the interface like: kvfree_rcu(p, rcu), but that is not our
> > > > case :)
> > > >
> > > > So, for CONFIG_TINY_RCU we should implement very similar what we
> > > > have done for CONFIG_TREE_RCU or just simply do like Ted has done
> > > > with his
> > > >
> > > > void ext4_kvfree_array_rcu(void *to_free)
> > > >
> > > > i mean:
> > > >
> > > > local_irq_save(flags);
> > > > struct foo *ptr = kzalloc(sizeof(*ptr), GFP_ATOMIC);
> > > >
> > > > if (ptr) {
> > > > ptr->ptr = to_free;
> > > > call_rcu(&ptr->rcu, kvfree_callback);
> > > > }
> > > > local_irq_restore(flags);
> > >
> > > We really do still need the emergency case, in this case for when
> > > kzalloc() returns NULL. Which does indeed mean an rcu_head in the thing
> > > being freed. Otherwise, you end up with an out-of-memory deadlock where
> > > you could free memory only if you had memor to allocate.
> >
> > Can we rely on GFP_ATOMIC allocations for these? These have emergency memory
> > pools which are reserved.
>
> You can, at least until the emergency memory pools are exhausted.
>
> > I was thinking a 2 fold approach (just thinking out loud..):
> >
> > If kfree_call_rcu() is called in atomic context or in any rcu reader, then
> > use GFP_ATOMIC to grow an rcu_head wrapper on the atomic memory pool and
> > queue that.
> >
I am not sure if that is acceptable, i mean what to do when GFP_ATOMIC
gets failed in atomic context? Or we can just consider it as out of
memory and another variant is to say that headless object can be called
from preemptible context only.

> >
> > Otherwise, grow an rcu_head on the stack of kfree_call_rcu() and call
> > synchronize_rcu() inline with it.
> >
> >
What do you mean here, Joel? "grow an rcu_head on the stack"?

> >
> > Use preemptible() andr task_struct's rcu_read_lock_nesting to differentiate
> > between the 2 cases.
> >
If the current context is preemptable then we can inline synchronize_rcu()
together with freeing to handle such corner case, i mean when we are run
out of memory.

As for "task_struct's rcu_read_lock_nesting". Will it be enough just
have a look at preempt_count of current process? If we have for example
nested rcu_read_locks:

<snip>
rcu_read_lock()
rcu_read_lock()
rcu_read_lock()
<snip>

the counter would be 3.

>
> How much are we really losing by having an rcu_head in the structure,
> either separately or unioned over other fields?
>
> > > > Also there is one more open question what to do if GFP_ATOMIC
> > > > gets failed in case of having low memory condition. Probably
> > > > we can make use of "mempool interface" that allows to have
> > > > min_nr guaranteed pre-allocated pages.
> > >
> > > But we really do still need to handle the case where everything runs out,
> > > even the pre-allocated pages.
> >
> > If *everything* runs out, you are pretty much going to OOM sooner or later
> > anyway :D. But I see what you mean. But the 'tradeoff' is RCU can free
> > head-less objects where possible.
>
> Would you rather pay an rcu_head tax (in cases where it cannot share
> with other fields), or would you rather have states where you could free
> a lot of memory if only there was some memory to allocate to track the
> memory to be freed?
>
> But yes, as you suggested above, there could be an API similar to the
> userspace RCU library's API that usually just queues the callback but
> sometimes sleeps for a full grace period. If enough people want this,
> it should not be hard to set up.
>

--
Vlad Rezki

2020-02-25 02:08:02

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC] ext4: fix potential race between online resizing and write operations

On Mon, Feb 24, 2020 at 06:40:30PM +0100, Uladzislau Rezki wrote:
> On Sat, Feb 22, 2020 at 05:10:18PM -0800, Paul E. McKenney wrote:
> > On Sat, Feb 22, 2020 at 05:24:15PM -0500, Joel Fernandes wrote:
> > > On Fri, Feb 21, 2020 at 12:22:50PM -0800, Paul E. McKenney wrote:
> > > > On Fri, Feb 21, 2020 at 02:14:55PM +0100, Uladzislau Rezki wrote:
> > > > > On Thu, Feb 20, 2020 at 04:30:35PM -0800, Paul E. McKenney wrote:
> > > > > > On Wed, Feb 19, 2020 at 11:52:33PM -0500, Theodore Y. Ts'o wrote:
> > > > > > > On Tue, Feb 18, 2020 at 06:08:57PM +0100, Uladzislau Rezki wrote:
> > > > > > > > now it becomes possible to use it like:
> > > > > > > > ...
> > > > > > > > void *p = kvmalloc(PAGE_SIZE);
> > > > > > > > kvfree_rcu(p);
> > > > > > > > ...
> > > > > > > > also have a look at the example in the mm/list_lru.c diff.
> > > > > > >
> > > > > > > I certainly like the interface, thanks! I'm going to be pushing
> > > > > > > patches to fix this using ext4_kvfree_array_rcu() since there are a
> > > > > > > number of bugs in ext4's online resizing which appear to be hitting
> > > > > > > multiple cloud providers (with reports from both AWS and GCP) and I
> > > > > > > want something which can be easily backported to stable kernels.
> > > > > > >
> > > > > > > But once kvfree_rcu() hits mainline, I'll switch ext4 to use it, since
> > > > > > > your kvfree_rcu() is definitely more efficient than my expedient
> > > > > > > jury-rig.
> > > > > > >
> > > > > > > I don't feel entirely competent to review the implementation, but I do
> > > > > > > have one question. It looks like the rcutiny implementation of
> > > > > > > kfree_call_rcu() isn't going to do the right thing with kvfree_rcu(p).
> > > > > > > Am I missing something?
> > > > > >
> > > > > > Good catch! I believe that rcu_reclaim_tiny() would need to do
> > > > > > kvfree() instead of its current kfree().
> > > > > >
> > > > > > Vlad, anything I am missing here?
> > > > > >
> > > > > Yes something like that. There are some open questions about
> > > > > realization, when it comes to tiny RCU. Since we are talking
> > > > > about "headless" kvfree_rcu() interface, i mean we can not link
> > > > > freed "objects" between each other, instead we should place a
> > > > > pointer directly into array that will be drained later on.
> > > > >
> > > > > It would be much more easier to achieve that if we were talking
> > > > > about the interface like: kvfree_rcu(p, rcu), but that is not our
> > > > > case :)
> > > > >
> > > > > So, for CONFIG_TINY_RCU we should implement very similar what we
> > > > > have done for CONFIG_TREE_RCU or just simply do like Ted has done
> > > > > with his
> > > > >
> > > > > void ext4_kvfree_array_rcu(void *to_free)
> > > > >
> > > > > i mean:
> > > > >
> > > > > local_irq_save(flags);
> > > > > struct foo *ptr = kzalloc(sizeof(*ptr), GFP_ATOMIC);
> > > > >
> > > > > if (ptr) {
> > > > > ptr->ptr = to_free;
> > > > > call_rcu(&ptr->rcu, kvfree_callback);
> > > > > }
> > > > > local_irq_restore(flags);
> > > >
> > > > We really do still need the emergency case, in this case for when
> > > > kzalloc() returns NULL. Which does indeed mean an rcu_head in the thing
> > > > being freed. Otherwise, you end up with an out-of-memory deadlock where
> > > > you could free memory only if you had memor to allocate.
> > >
> > > Can we rely on GFP_ATOMIC allocations for these? These have emergency memory
> > > pools which are reserved.
> >
> > You can, at least until the emergency memory pools are exhausted.
> >
> > > I was thinking a 2 fold approach (just thinking out loud..):
> > >
> > > If kfree_call_rcu() is called in atomic context or in any rcu reader, then
> > > use GFP_ATOMIC to grow an rcu_head wrapper on the atomic memory pool and
> > > queue that.
> > >
> I am not sure if that is acceptable, i mean what to do when GFP_ATOMIC
> gets failed in atomic context? Or we can just consider it as out of
> memory and another variant is to say that headless object can be called
> from preemptible context only.

Yes that makes sense, and we can always put disclaimer in the API's comments
saying if this object is expected to be freed a lot, then don't use the
headless-API to be extra safe.

BTW, GFP_ATOMIC the documentation says if GFP_ATOMIC reserves are depleted,
the kernel can even panic some times, so if GFP_ATOMIC allocation fails, then
there seems to be bigger problems in the system any way. I would say let us
write a patch to allocate there and see what the -mm guys think.

> > > Otherwise, grow an rcu_head on the stack of kfree_call_rcu() and call
> > > synchronize_rcu() inline with it.
> > >
> > >
> What do you mean here, Joel? "grow an rcu_head on the stack"?

By "grow on the stack", use the compiler-allocated rcu_head on the
kfree_rcu() caller's stack.

I meant here to say, if we are not in atomic context, then we use regular
GFP_KERNEL allocation, and if that fails, then we just use the stack's
rcu_head and call synchronize_rcu() or even synchronize_rcu_expedited since
the allocation failure would mean the need for RCU to free some memory is
probably great.

> > > Use preemptible() andr task_struct's rcu_read_lock_nesting to differentiate
> > > between the 2 cases.
> > >
> If the current context is preemptable then we can inline synchronize_rcu()
> together with freeing to handle such corner case, i mean when we are run
> out of memory.

Ah yes, exactly what I mean.

> As for "task_struct's rcu_read_lock_nesting". Will it be enough just
> have a look at preempt_count of current process? If we have for example
> nested rcu_read_locks:
>
> <snip>
> rcu_read_lock()
> rcu_read_lock()
> rcu_read_lock()
> <snip>
>
> the counter would be 3.

No, because preempt_count is not incremented during rcu_read_lock(). RCU
reader sections can be preempted, they just cannot goto sleep in a reader
section (unless the kernel is RT).

thanks,

- Joel