2020-04-14 13:36:58

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

This is a small code refactor and also removes the restriction that
headless kfree_rcu() users cannot sleep for allocation of the per-cpu
array where their pointers will be placed . Since they are always called
in a sleepable context, we can use this information to try harder during
the array allocation stage to allocate the per-cpu array.

Also there is a possible bug-fix for a migration scenario where a
kfree_rcu() headless user can get migrated during the
sleepable-allocation and end up on another CPU and restoring the wrong
CPU's flags. To remedy this, we store only the IRQ state on the stack
and save/restore IRQ state from there. Sure, for the headless case we
don't need to restore flags. But the code saving/restoring state is
common between headless and with-head kfree_rcu() variants, so it
handles all scenarios sampling/restoring just the IRQ state and not
saving/restoring all the flags.

Signed-off-by: Joel Fernandes (Google) <[email protected]>
---

This is just RFC and is based on top of Vlad's latest patches:
https://lkml.org/lkml/2020/4/2/383

The git tree containing this patch is at:
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=rcu/dev

(This patch will be a part of a much large series in the future).


kernel/rcu/tree.c | 150 +++++++++++++++++++++++++++++++---------------
1 file changed, 103 insertions(+), 47 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 744d04d8b7724..2e0eaec929059 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3104,11 +3104,95 @@ static void kfree_rcu_monitor(struct work_struct *work)
spin_unlock_irqrestore(&krcp->lock, flags);
}

+static inline struct kfree_rcu_cpu *krc_this_cpu_lock(bool irqs_disabled)
+{
+ struct kfree_rcu_cpu *krcp;
+
+ // For safely calling this_cpu_ptr().
+ if (!irqs_disabled)
+ local_irq_disable();
+
+ krcp = this_cpu_ptr(&krc);
+ if (likely(krcp->initialized))
+ spin_lock(&krcp->lock);
+
+ return krcp;
+}
+
+static inline void
+krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, bool irqs_disabled)
+{
+ if (likely(krcp->initialized))
+ spin_unlock(&krcp->lock);
+
+ if (!irqs_disabled)
+ local_irq_enable();
+}
+
+// alloc_object_locked - Try to allocate an object of size while dropping the lock.
+//
+// @size: Size of the object to internally allocate for kfree_rcu().
+// @slab: Do we allocate on slab or using buddy.
+// @can_sleep: Was kfree_rcu() called in sleepable context?
+// @krcp: The pointer to krcp. Needed if when relocking, we got migrated.
+//
+// Caveats:
+//
+// 1. Per-cpu krc's lock must be held with interrupts disabled.
+//
+// 2. Failure to allocate returns NULL and does not cause a warning.
+//
+// 3. Caller is responsible for using the correct free() APIs. If size == PAGE_SIZE,
+// then free_page() should be called for freeing. Otherwise kfree().
+//
+static inline void *alloc_object_locked(size_t size, bool slab, bool can_sleep,
+ struct kfree_rcu_cpu **krcpp)
+{
+ void *ptr;
+ gfp_t gfp_flags, wmark_flags, reclaim_flags;
+ struct kfree_rcu_cpu *krcp = *krcpp;
+
+ WARN_ON_ONCE(size == PAGE_SIZE && slab);
+
+ // Decompose the flags:
+ // wmark_flags - affect the watermark to control reserve access.
+ // reclaim_flags - these effect how reclaim works but would
+ // have no-affect in atomic or nowait context.
+ wmark_flags = (__GFP_HIGH | __GFP_ATOMIC);
+ reclaim_flags = (__GFP_RETRY_MAYFAIL);
+
+ // These flags will be common to all allocations, whether we want to
+ // wait or sleep or reclaim will be controlled with additional flags
+ // later during the actual allocation.
+ gfp_flags = (wmark_flags | reclaim_flags | __GFP_NOWARN);
+
+ // First, do an allocation without waiting.
+ ptr = (size == PAGE_SIZE) ? (void *)__get_free_page(gfp_flags | GFP_NOWAIT)
+ : (void *)kmalloc(size, gfp_flags | GFP_NOWAIT);
+ // If we cannot sleep, we are done.
+ if (ptr || !can_sleep)
+ return ptr;
+
+ // Now try to do it with more relaxed flags, we may enter direct-reclaim.
+ //
+ // IRQs were not enabled since can_sleep == true. So there's no need to
+ // save/restore flags.
+ krc_this_cpu_unlock(krcp, false);
+ ptr = (size == PAGE_SIZE) ? (void *)__get_free_page(gfp_flags | GFP_KERNEL)
+ : (void *)kmalloc(size, gfp_flags | GFP_KERNEL);
+
+ // Take it back. We may have got migrated here.
+ *krcpp = krc_this_cpu_lock(false);
+
+ return ptr;
+}
+
static inline bool
-kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
+kvfree_call_rcu_add_ptr_to_bulk(void *ptr, bool can_sleep, struct kfree_rcu_cpu **krcpp)
{
struct kvfree_rcu_bulk_data *bnode;
int idx;
+ struct kfree_rcu_cpu *krcp = *krcpp;

if (unlikely(!krcp->initialized))
return false;
@@ -3121,9 +3205,11 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
krcp->bkvhead[idx]->nr_records ==
KVFREE_BULK_MAX_ENTR) {
bnode = xchg(&krcp->bkvcache[idx], NULL);
- if (!bnode)
+ if (!bnode) {
bnode = (struct kvfree_rcu_bulk_data *)
- __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
+ alloc_object_locked(PAGE_SIZE, false /* use buddy */,
+ can_sleep, krcpp);
+ }

/* Switch to emergency path. */
if (unlikely(!bnode))
@@ -3144,14 +3230,15 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
return true;
}

-static inline struct rcu_head *attach_rcu_head_to_object(void *obj)
+static inline struct rcu_head *attach_rcu_head_to_object(void *obj,
+ struct kfree_rcu_cpu **krcpp)
{
unsigned long *ptr;

- /* Try hard to get the memory. */
- ptr = kmalloc(sizeof(unsigned long *) +
- sizeof(struct rcu_head), GFP_KERNEL | __GFP_ATOMIC |
- __GFP_HIGH | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
+ // Allocate rcu_head on the slab.
+ ptr = alloc_object_locked(sizeof(unsigned long *) + sizeof(struct rcu_head),
+ true /* slab */, true /* headless, so can sleep */,
+ krcpp);
if (!ptr)
return NULL;

@@ -3159,27 +3246,6 @@ static inline struct rcu_head *attach_rcu_head_to_object(void *obj)
return ((struct rcu_head *) ++ptr);
}

-static inline struct kfree_rcu_cpu *
-krc_this_cpu_lock(unsigned long *flags)
-{
- struct kfree_rcu_cpu *krcp;
-
- local_irq_save(*flags); // For safely calling this_cpu_ptr().
- krcp = this_cpu_ptr(&krc);
- if (likely(krcp->initialized))
- spin_lock(&krcp->lock);
-
- return krcp;
-}
-
-static inline void
-krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
-{
- if (likely(krcp->initialized))
- spin_unlock(&krcp->lock);
- local_irq_restore(flags);
-}
-
/*
* Queue a request for lazy invocation of appropriate free routine after a
* grace period. Please note there are three paths are maintained, two are the
@@ -3195,10 +3261,9 @@ krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
*/
void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
{
- unsigned long flags;
struct kfree_rcu_cpu *krcp;
bool expedited_drain = false;
- bool success;
+ bool success, irqs_disabled;
void *ptr;

if (head) {
@@ -3216,7 +3281,8 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
ptr = (unsigned long *) func;
}

- krcp = krc_this_cpu_lock(&flags);
+ irqs_disabled = irqs_disabled();
+ krcp = krc_this_cpu_lock(irqs_disabled);

// Queue the object but don't yet schedule the batch.
if (debug_rcu_head_queue(ptr)) {
@@ -3229,23 +3295,14 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
goto unlock_return;
}

- /*
- * Under high memory pressure GFP_NOWAIT can fail,
- * in that case the emergency path is maintained.
- */
- success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
+ // Allow sleeping here only if headless
+ success = kvfree_call_rcu_add_ptr_to_bulk(ptr, !head /* can_sleep */, &krcp);
if (!success) {
/* Is headless object? */
if (head == NULL) {
- /* Drop the lock. */
- krc_this_cpu_unlock(krcp, flags);
-
- head = attach_rcu_head_to_object(ptr);
+ head = attach_rcu_head_to_object(ptr, &krcp);
if (head == NULL)
- goto inline_return;
-
- /* Take it back. */
- krcp = krc_this_cpu_lock(&flags);
+ goto unlock_return;

/*
* Tag the headless object. Such objects have a back-pointer
@@ -3280,9 +3337,8 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
}

unlock_return:
- krc_this_cpu_unlock(krcp, flags);
+ krc_this_cpu_unlock(krcp, irqs_disabled);

-inline_return:
/*
* High memory pressure, so inline kvfree() after
* synchronize_rcu(). We can do it from might_sleep()
--
2.26.0.110.g2183baf09c-goog


2020-04-15 13:15:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

On Mon, Apr 13, 2020 at 05:15:04PM -0400, Joel Fernandes (Google) wrote:
> This is a small code refactor and also removes the restriction that
> headless kfree_rcu() users cannot sleep for allocation of the per-cpu
> array where their pointers will be placed . Since they are always called
> in a sleepable context, we can use this information to try harder during
> the array allocation stage to allocate the per-cpu array.

In kernels, needing to do allocations in order to free memory must be
regarded with great suspicion. It -might- be kind-of sort-of OK here,
but only if we never impede reclaim, I/O, or OOM handling. Even then,
this can be made to work only given that it is possible to fall back
on a direct call to synchronize_rcu() in the case where absolutely no
memory is available.

Rest assured that I do understand the attractions of a single-argument
kfree_rcu(). Attractive, yes, but also very dangerous if not done
quite carefully. So let's please do it carefully! ;-)

> Also there is a possible bug-fix for a migration scenario where a
> kfree_rcu() headless user can get migrated during the
> sleepable-allocation and end up on another CPU and restoring the wrong
> CPU's flags. To remedy this, we store only the IRQ state on the stack
> and save/restore IRQ state from there. Sure, for the headless case we
> don't need to restore flags. But the code saving/restoring state is
> common between headless and with-head kfree_rcu() variants, so it
> handles all scenarios sampling/restoring just the IRQ state and not
> saving/restoring all the flags.

I will suspend disbelief while I look at the patch, but this indirect flag
handling sounds like an accident waiting to happen. So in the meantime,
is there a way to structure the code to make the flag handling more
explicitly visible at the top level?

In addition, the usual way to conditionally disable interrupts
is local_irq_save(flags) rather than conditionally invoking
local_irq_disable().

Adding Johannes on CC for his thoughts as well. (He will not be shy
about correcting any misapprehensions that I might have.)

With that, please see below!

Thanx, Paul

> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> ---
>
> This is just RFC and is based on top of Vlad's latest patches:
> https://lkml.org/lkml/2020/4/2/383
>
> The git tree containing this patch is at:
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=rcu/dev
>
> (This patch will be a part of a much large series in the future).
>
>
> kernel/rcu/tree.c | 150 +++++++++++++++++++++++++++++++---------------
> 1 file changed, 103 insertions(+), 47 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 744d04d8b7724..2e0eaec929059 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3104,11 +3104,95 @@ static void kfree_rcu_monitor(struct work_struct *work)
> spin_unlock_irqrestore(&krcp->lock, flags);
> }
>
> +static inline struct kfree_rcu_cpu *krc_this_cpu_lock(bool irqs_disabled)
> +{
> + struct kfree_rcu_cpu *krcp;
> +
> + // For safely calling this_cpu_ptr().
> + if (!irqs_disabled)
> + local_irq_disable();

Again, local_irq_save() is the usual approach here. And local_irq_restore()
in krc_this_cpu_unlock() below.

> +
> + krcp = this_cpu_ptr(&krc);
> + if (likely(krcp->initialized))
> + spin_lock(&krcp->lock);
> +
> + return krcp;
> +}
> +
> +static inline void
> +krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, bool irqs_disabled)
> +{
> + if (likely(krcp->initialized))
> + spin_unlock(&krcp->lock);
> +
> + if (!irqs_disabled)
> + local_irq_enable();
> +}
> +
> +// alloc_object_locked - Try to allocate an object of size while dropping the lock.
> +//
> +// @size: Size of the object to internally allocate for kfree_rcu().
> +// @slab: Do we allocate on slab or using buddy.
> +// @can_sleep: Was kfree_rcu() called in sleepable context?
> +// @krcp: The pointer to krcp. Needed if when relocking, we got migrated.
> +//
> +// Caveats:
> +//
> +// 1. Per-cpu krc's lock must be held with interrupts disabled.
> +//
> +// 2. Failure to allocate returns NULL and does not cause a warning.
> +//
> +// 3. Caller is responsible for using the correct free() APIs. If size == PAGE_SIZE,
> +// then free_page() should be called for freeing. Otherwise kfree().
> +//
> +static inline void *alloc_object_locked(size_t size, bool slab, bool can_sleep,
> + struct kfree_rcu_cpu **krcpp)
> +{
> + void *ptr;
> + gfp_t gfp_flags, wmark_flags, reclaim_flags;
> + struct kfree_rcu_cpu *krcp = *krcpp;
> +
> + WARN_ON_ONCE(size == PAGE_SIZE && slab);
> +
> + // Decompose the flags:
> + // wmark_flags - affect the watermark to control reserve access.
> + // reclaim_flags - these effect how reclaim works but would
> + // have no-affect in atomic or nowait context.
> + wmark_flags = (__GFP_HIGH | __GFP_ATOMIC);
> + reclaim_flags = (__GFP_RETRY_MAYFAIL);

You have a __GFP_RETRY_MAYFAIL here, which is good. However, if
this CPU has quite a few 4K blocks of memory already allocated for
kfree_rcu(), at some point __GFP_NORETRY becomes necessary. Again,
single-argument kfree_rcu() has the option of invoking synchronize_rcu()
and most other memory allocators do not. And double-argument kfree_rcu()
has a pre-allocated rcu_head in the structure that it can fall back on.

So let's please not get too memory-greedy here!

Note also that most systems will normally invoke an RCU callback on the
same CPU that registered it. This allows easy maintenance of an
emergency cache for these situations.

Exceptions include systems doing frequent CPU-hotplug operations and
rcu_nocbs CPUs.

> +
> + // These flags will be common to all allocations, whether we want to
> + // wait or sleep or reclaim will be controlled with additional flags
> + // later during the actual allocation.
> + gfp_flags = (wmark_flags | reclaim_flags | __GFP_NOWARN);
> +
> + // First, do an allocation without waiting.
> + ptr = (size == PAGE_SIZE) ? (void *)__get_free_page(gfp_flags | GFP_NOWAIT)
> + : (void *)kmalloc(size, gfp_flags | GFP_NOWAIT);
> + // If we cannot sleep, we are done.
> + if (ptr || !can_sleep)
> + return ptr;
> +
> + // Now try to do it with more relaxed flags, we may enter direct-reclaim.
> + //
> + // IRQs were not enabled since can_sleep == true. So there's no need to
> + // save/restore flags.
> + krc_this_cpu_unlock(krcp, false);
> + ptr = (size == PAGE_SIZE) ? (void *)__get_free_page(gfp_flags | GFP_KERNEL)
> + : (void *)kmalloc(size, gfp_flags | GFP_KERNEL);

Dropping the possibility of small allocations also simplifies this code,
and also simplifies a fair amount of code elsewhere.

> +
> + // Take it back. We may have got migrated here.
> + *krcpp = krc_this_cpu_lock(false);
> +
> + return ptr;
> +}
> +
> static inline bool
> -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> +kvfree_call_rcu_add_ptr_to_bulk(void *ptr, bool can_sleep, struct kfree_rcu_cpu **krcpp)
> {
> struct kvfree_rcu_bulk_data *bnode;
> int idx;
> + struct kfree_rcu_cpu *krcp = *krcpp;
>
> if (unlikely(!krcp->initialized))
> return false;
> @@ -3121,9 +3205,11 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> krcp->bkvhead[idx]->nr_records ==
> KVFREE_BULK_MAX_ENTR) {
> bnode = xchg(&krcp->bkvcache[idx], NULL);
> - if (!bnode)
> + if (!bnode) {
> bnode = (struct kvfree_rcu_bulk_data *)
> - __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> + alloc_object_locked(PAGE_SIZE, false /* use buddy */,
> + can_sleep, krcpp);

Given the proper GFP_ flags, use of can_sleep does make sense here, good!

> + }
>
> /* Switch to emergency path. */
> if (unlikely(!bnode))
> @@ -3144,14 +3230,15 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> return true;
> }
>
> -static inline struct rcu_head *attach_rcu_head_to_object(void *obj)
> +static inline struct rcu_head *attach_rcu_head_to_object(void *obj,
> + struct kfree_rcu_cpu **krcpp)
> {
> unsigned long *ptr;
>
> - /* Try hard to get the memory. */
> - ptr = kmalloc(sizeof(unsigned long *) +
> - sizeof(struct rcu_head), GFP_KERNEL | __GFP_ATOMIC |
> - __GFP_HIGH | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> + // Allocate rcu_head on the slab.
> + ptr = alloc_object_locked(sizeof(unsigned long *) + sizeof(struct rcu_head),
> + true /* slab */, true /* headless, so can sleep */,
> + krcpp);

This small allocation can impede reclaim, I/O, and OOM handling, so this
part of the code must be dropped. If kfree_rcu() cannot allocate a 4K
page, it needs to leave the tiny amount of remaining memory for others.
The reason for this is that these others usually don't have the option
of just calling synchronize_rcu(), and their need is quite likely to be
much greater than that of kfree_rcu().

> if (!ptr)
> return NULL;
>
> @@ -3159,27 +3246,6 @@ static inline struct rcu_head *attach_rcu_head_to_object(void *obj)
> return ((struct rcu_head *) ++ptr);
> }
>
> -static inline struct kfree_rcu_cpu *
> -krc_this_cpu_lock(unsigned long *flags)
> -{
> - struct kfree_rcu_cpu *krcp;
> -
> - local_irq_save(*flags); // For safely calling this_cpu_ptr().
> - krcp = this_cpu_ptr(&krc);
> - if (likely(krcp->initialized))
> - spin_lock(&krcp->lock);
> -
> - return krcp;
> -}
> -
> -static inline void
> -krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
> -{
> - if (likely(krcp->initialized))
> - spin_unlock(&krcp->lock);
> - local_irq_restore(flags);
> -}
> -
> /*
> * Queue a request for lazy invocation of appropriate free routine after a
> * grace period. Please note there are three paths are maintained, two are the
> @@ -3195,10 +3261,9 @@ krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
> */
> void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> {
> - unsigned long flags;
> struct kfree_rcu_cpu *krcp;
> bool expedited_drain = false;
> - bool success;
> + bool success, irqs_disabled;
> void *ptr;
>
> if (head) {
> @@ -3216,7 +3281,8 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> ptr = (unsigned long *) func;
> }
>
> - krcp = krc_this_cpu_lock(&flags);
> + irqs_disabled = irqs_disabled();
> + krcp = krc_this_cpu_lock(irqs_disabled);
>
> // Queue the object but don't yet schedule the batch.
> if (debug_rcu_head_queue(ptr)) {
> @@ -3229,23 +3295,14 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> goto unlock_return;
> }
>
> - /*
> - * Under high memory pressure GFP_NOWAIT can fail,
> - * in that case the emergency path is maintained.
> - */
> - success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
> + // Allow sleeping here only if headless
> + success = kvfree_call_rcu_add_ptr_to_bulk(ptr, !head /* can_sleep */, &krcp);
> if (!success) {
> /* Is headless object? */
> if (head == NULL) {

We get here only if a 4K allocation fails, correct?

> - /* Drop the lock. */
> - krc_this_cpu_unlock(krcp, flags);
> -
> - head = attach_rcu_head_to_object(ptr);
> + head = attach_rcu_head_to_object(ptr, &krcp);

If so, don't bother with the small allocation. As noted earlier, others
likely need the memory more than does kfree_rcu(), given that most will
not have an out like invoking synchronize_rcu(). This approach allows
dispensing with the behind-the-scenes interrupt enabling, which again
is an accident waiting to happen.

Instead, just call synchronize_rcu(), then directly free the object,
as in the inline_return label below.

This also allows you to keep the earlier simpler krc_this_cpu_lock()
and krc_this_cpu_unlock() definitions and confine the interrupt flags
manipulation to this function.

> if (head == NULL)
> - goto inline_return;
> -
> - /* Take it back. */
> - krcp = krc_this_cpu_lock(&flags);
> + goto unlock_return;
>
> /*
> * Tag the headless object. Such objects have a back-pointer
> @@ -3280,9 +3337,8 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> }
>
> unlock_return:
> - krc_this_cpu_unlock(krcp, flags);
> + krc_this_cpu_unlock(krcp, irqs_disabled);
>
> -inline_return:
> /*
> * High memory pressure, so inline kvfree() after
> * synchronize_rcu(). We can do it from might_sleep()
> --
> 2.26.0.110.g2183baf09c-goog

2020-04-16 10:35:51

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

Hello, Paul, Joel.

See my comments below. They are for better understanding a
whole strategy and the best way how to drive headless objects :)

> > This is a small code refactor and also removes the restriction that
> > headless kfree_rcu() users cannot sleep for allocation of the per-cpu
> > array where their pointers will be placed . Since they are always called
> > in a sleepable context, we can use this information to try harder during
> > the array allocation stage to allocate the per-cpu array.
>
> In kernels, needing to do allocations in order to free memory must be
> regarded with great suspicion. It -might- be kind-of sort-of OK here,
> but only if we never impede reclaim, I/O, or OOM handling. Even then,
> this can be made to work only given that it is possible to fall back
> on a direct call to synchronize_rcu() in the case where absolutely no
> memory is available.
>
I see your point and agree. So, the idea is to do progress instead of
doing OOM, I/O or direct reclaiming. It means that we should avoid of
using any allocations flags which will trigger such effects, directly
or indirectly. I think we are on the same base now.

> Rest assured that I do understand the attractions of a single-argument
> kfree_rcu(). Attractive, yes, but also very dangerous if not done
> quite carefully. So let's please do it carefully! ;-)
>
Thanks for understanding and helping :)

> > Also there is a possible bug-fix for a migration scenario where a
> > kfree_rcu() headless user can get migrated during the
> > sleepable-allocation and end up on another CPU and restoring the wrong
> > CPU's flags. To remedy this, we store only the IRQ state on the stack
> > and save/restore IRQ state from there. Sure, for the headless case we
> > don't need to restore flags. But the code saving/restoring state is
> > common between headless and with-head kfree_rcu() variants, so it
> > handles all scenarios sampling/restoring just the IRQ state and not
> > saving/restoring all the flags.
>
> I will suspend disbelief while I look at the patch, but this indirect flag
> handling sounds like an accident waiting to happen. So in the meantime,
> is there a way to structure the code to make the flag handling more
> explicitly visible at the top level?
>
> In addition, the usual way to conditionally disable interrupts
> is local_irq_save(flags) rather than conditionally invoking
> local_irq_disable().
>
> Adding Johannes on CC for his thoughts as well. (He will not be shy
> about correcting any misapprehensions that I might have.)
>
> With that, please see below!
>
> Thanx, Paul
>
> > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > ---
> >
> > This is just RFC and is based on top of Vlad's latest patches:
> > https://lkml.org/lkml/2020/4/2/383
> >
> > The git tree containing this patch is at:
> > https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=rcu/dev
> >
> > (This patch will be a part of a much large series in the future).
> >
> >
> > kernel/rcu/tree.c | 150 +++++++++++++++++++++++++++++++---------------
> > 1 file changed, 103 insertions(+), 47 deletions(-)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 744d04d8b7724..2e0eaec929059 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3104,11 +3104,95 @@ static void kfree_rcu_monitor(struct work_struct *work)
> > spin_unlock_irqrestore(&krcp->lock, flags);
> > }
> >
> > +static inline struct kfree_rcu_cpu *krc_this_cpu_lock(bool irqs_disabled)
> > +{
> > + struct kfree_rcu_cpu *krcp;
> > +
> > + // For safely calling this_cpu_ptr().
> > + if (!irqs_disabled)
> > + local_irq_disable();
>
> Again, local_irq_save() is the usual approach here. And local_irq_restore()
> in krc_this_cpu_unlock() below.
>
We discussed that with Joel and i also think that keeping previous
variant of the krc_this_cpu_lock()/krc_this_cpu_unlock() is better
and what is more important is easy, i.e. we do not really need to
understand whether IRQs are disabled or not, instead just save flags
and restore and that is it.

> > +
> > + krcp = this_cpu_ptr(&krc);
> > + if (likely(krcp->initialized))
> > + spin_lock(&krcp->lock);
> > +
> > + return krcp;
> > +}
> > +
> > +static inline void
> > +krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, bool irqs_disabled)
> > +{
> > + if (likely(krcp->initialized))
> > + spin_unlock(&krcp->lock);
> > +
> > + if (!irqs_disabled)
> > + local_irq_enable();
> > +}
> > +
> > +// alloc_object_locked - Try to allocate an object of size while dropping the lock.
> > +//
> > +// @size: Size of the object to internally allocate for kfree_rcu().
> > +// @slab: Do we allocate on slab or using buddy.
> > +// @can_sleep: Was kfree_rcu() called in sleepable context?
> > +// @krcp: The pointer to krcp. Needed if when relocking, we got migrated.
> > +//
> > +// Caveats:
> > +//
> > +// 1. Per-cpu krc's lock must be held with interrupts disabled.
> > +//
> > +// 2. Failure to allocate returns NULL and does not cause a warning.
> > +//
> > +// 3. Caller is responsible for using the correct free() APIs. If size == PAGE_SIZE,
> > +// then free_page() should be called for freeing. Otherwise kfree().
> > +//
> > +static inline void *alloc_object_locked(size_t size, bool slab, bool can_sleep,
> > + struct kfree_rcu_cpu **krcpp)
> > +{
> > + void *ptr;
> > + gfp_t gfp_flags, wmark_flags, reclaim_flags;
> > + struct kfree_rcu_cpu *krcp = *krcpp;
> > +
> > + WARN_ON_ONCE(size == PAGE_SIZE && slab);
> > +
> > + // Decompose the flags:
> > + // wmark_flags - affect the watermark to control reserve access.
> > + // reclaim_flags - these effect how reclaim works but would
> > + // have no-affect in atomic or nowait context.
> > + wmark_flags = (__GFP_HIGH | __GFP_ATOMIC);
> > + reclaim_flags = (__GFP_RETRY_MAYFAIL);
>
> You have a __GFP_RETRY_MAYFAIL here, which is good. However, if
> this CPU has quite a few 4K blocks of memory already allocated for
> kfree_rcu(), at some point __GFP_NORETRY becomes necessary. Again,
> single-argument kfree_rcu() has the option of invoking synchronize_rcu()
> and most other memory allocators do not. And double-argument kfree_rcu()
> has a pre-allocated rcu_head in the structure that it can fall back on.
>
> So let's please not get too memory-greedy here!
>
As i see, your point is to use "light" allocations flags which will
not initiate direct reclaim, OOM, I/O waiting and so on. Please correct
me if i miss something.

> Note also that most systems will normally invoke an RCU callback on the
> same CPU that registered it. This allows easy maintenance of an
> emergency cache for these situations.
>
I have a patch that specifies number of pages to be cached, but i will
send out it later when we sort things like that out.

> Exceptions include systems doing frequent CPU-hotplug operations and
> rcu_nocbs CPUs.
>
> > +
> > + // These flags will be common to all allocations, whether we want to
> > + // wait or sleep or reclaim will be controlled with additional flags
> > + // later during the actual allocation.
> > + gfp_flags = (wmark_flags | reclaim_flags | __GFP_NOWARN);
> > +
> > + // First, do an allocation without waiting.
> > + ptr = (size == PAGE_SIZE) ? (void *)__get_free_page(gfp_flags | GFP_NOWAIT)
> > + : (void *)kmalloc(size, gfp_flags | GFP_NOWAIT);
> > + // If we cannot sleep, we are done.
> > + if (ptr || !can_sleep)
> > + return ptr;
> > +
> > + // Now try to do it with more relaxed flags, we may enter direct-reclaim.
> > + //
> > + // IRQs were not enabled since can_sleep == true. So there's no need to
> > + // save/restore flags.
> > + krc_this_cpu_unlock(krcp, false);
> > + ptr = (size == PAGE_SIZE) ? (void *)__get_free_page(gfp_flags | GFP_KERNEL)
> > + : (void *)kmalloc(size, gfp_flags | GFP_KERNEL);
>
> Dropping the possibility of small allocations also simplifies this code,
> and also simplifies a fair amount of code elsewhere.
>
I have a question about dynamic attaching of the rcu_head. Do you think
that we should drop it? We have it because of it requires 8 + syzeof(struct rcu_head)
bytes and is used when we can not allocate 1 page what is much more for array purpose.
Therefore, dynamic attaching can succeed because of using SLAB and requesting much
less memory then one page. There will be higher chance of bypassing synchronize_rcu()
and inlining freeing on a stack.

I agree that we should not use GFP_* flags instead we could go with GFP_NOWAIT |
__GFP_NOWARN when head attaching only. Also dropping GFP_ATOMIC to keep
atomic reserved memory for others.

> > +
> > + // Take it back. We may have got migrated here.
> > + *krcpp = krc_this_cpu_lock(false);
> > +
> > + return ptr;
> > +}
> > +
> > static inline bool
> > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > +kvfree_call_rcu_add_ptr_to_bulk(void *ptr, bool can_sleep, struct kfree_rcu_cpu **krcpp)
> > {
> > struct kvfree_rcu_bulk_data *bnode;
> > int idx;
> > + struct kfree_rcu_cpu *krcp = *krcpp;
> >
> > if (unlikely(!krcp->initialized))
> > return false;
> > @@ -3121,9 +3205,11 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > krcp->bkvhead[idx]->nr_records ==
> > KVFREE_BULK_MAX_ENTR) {
> > bnode = xchg(&krcp->bkvcache[idx], NULL);
> > - if (!bnode)
> > + if (!bnode) {
> > bnode = (struct kvfree_rcu_bulk_data *)
> > - __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > + alloc_object_locked(PAGE_SIZE, false /* use buddy */,
> > + can_sleep, krcpp);
>
> Given the proper GFP_ flags, use of can_sleep does make sense here, good!
>
> > + }
> >
> > /* Switch to emergency path. */
> > if (unlikely(!bnode))
> > @@ -3144,14 +3230,15 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > return true;
> > }
> >
> > -static inline struct rcu_head *attach_rcu_head_to_object(void *obj)
> > +static inline struct rcu_head *attach_rcu_head_to_object(void *obj,
> > + struct kfree_rcu_cpu **krcpp)
> > {
> > unsigned long *ptr;
> >
> > - /* Try hard to get the memory. */
> > - ptr = kmalloc(sizeof(unsigned long *) +
> > - sizeof(struct rcu_head), GFP_KERNEL | __GFP_ATOMIC |
> > - __GFP_HIGH | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > + // Allocate rcu_head on the slab.
> > + ptr = alloc_object_locked(sizeof(unsigned long *) + sizeof(struct rcu_head),
> > + true /* slab */, true /* headless, so can sleep */,
> > + krcpp);
>
> This small allocation can impede reclaim, I/O, and OOM handling, so this
> part of the code must be dropped. If kfree_rcu() cannot allocate a 4K
> page, it needs to leave the tiny amount of remaining memory for others.
> The reason for this is that these others usually don't have the option
> of just calling synchronize_rcu(), and their need is quite likely to be
> much greater than that of kfree_rcu().
>
That is about dynamic rcu_head attaching. We can drop it for sure. But
that helps to bypass "freeing on stack". Please see above.

> > if (!ptr)
> > return NULL;
> >
> > @@ -3159,27 +3246,6 @@ static inline struct rcu_head *attach_rcu_head_to_object(void *obj)
> > return ((struct rcu_head *) ++ptr);
> > }
> >
> > -static inline struct kfree_rcu_cpu *
> > -krc_this_cpu_lock(unsigned long *flags)
> > -{
> > - struct kfree_rcu_cpu *krcp;
> > -
> > - local_irq_save(*flags); // For safely calling this_cpu_ptr().
> > - krcp = this_cpu_ptr(&krc);
> > - if (likely(krcp->initialized))
> > - spin_lock(&krcp->lock);
> > -
> > - return krcp;
> > -}
> > -
> > -static inline void
> > -krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
> > -{
> > - if (likely(krcp->initialized))
> > - spin_unlock(&krcp->lock);
> > - local_irq_restore(flags);
> > -}
> > -
> > /*
> > * Queue a request for lazy invocation of appropriate free routine after a
> > * grace period. Please note there are three paths are maintained, two are the
> > @@ -3195,10 +3261,9 @@ krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
> > */
> > void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > {
> > - unsigned long flags;
> > struct kfree_rcu_cpu *krcp;
> > bool expedited_drain = false;
> > - bool success;
> > + bool success, irqs_disabled;
> > void *ptr;
> >
> > if (head) {
> > @@ -3216,7 +3281,8 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > ptr = (unsigned long *) func;
> > }
> >
> > - krcp = krc_this_cpu_lock(&flags);
> > + irqs_disabled = irqs_disabled();
> > + krcp = krc_this_cpu_lock(irqs_disabled);
> >
> > // Queue the object but don't yet schedule the batch.
> > if (debug_rcu_head_queue(ptr)) {
> > @@ -3229,23 +3295,14 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > goto unlock_return;
> > }
> >
> > - /*
> > - * Under high memory pressure GFP_NOWAIT can fail,
> > - * in that case the emergency path is maintained.
> > - */
> > - success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
> > + // Allow sleeping here only if headless
> > + success = kvfree_call_rcu_add_ptr_to_bulk(ptr, !head /* can_sleep */, &krcp);
> > if (!success) {
> > /* Is headless object? */
> > if (head == NULL) {
>
> We get here only if a 4K allocation fails, correct?
>
Correct.

> > - /* Drop the lock. */
> > - krc_this_cpu_unlock(krcp, flags);
> > -
> > - head = attach_rcu_head_to_object(ptr);
> > + head = attach_rcu_head_to_object(ptr, &krcp);
>
> If so, don't bother with the small allocation. As noted earlier, others
> likely need the memory more than does kfree_rcu(), given that most will
> not have an out like invoking synchronize_rcu(). This approach allows
> dispensing with the behind-the-scenes interrupt enabling, which again
> is an accident waiting to happen.
>
Same as above. Need to double confirm with you. We can drop it.

> Instead, just call synchronize_rcu(), then directly free the object,
> as in the inline_return label below.
>

> This also allows you to keep the earlier simpler krc_this_cpu_lock()
> and krc_this_cpu_unlock() definitions and confine the interrupt flags
> manipulation to this function.
>
Agree.

> > if (head == NULL)
> > - goto inline_return;
> > -
> > - /* Take it back. */
> > - krcp = krc_this_cpu_lock(&flags);
> > + goto unlock_return;
> >
> > /*
> > * Tag the headless object. Such objects have a back-pointer
> > @@ -3280,9 +3337,8 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > }
> >
> > unlock_return:
> > - krc_this_cpu_unlock(krcp, flags);
> > + krc_this_cpu_unlock(krcp, irqs_disabled);
> >
> > -inline_return:
> > /*
> > * High memory pressure, so inline kvfree() after
> > * synchronize_rcu(). We can do it from might_sleep()
> > --
> > 2.26.0.110.g2183baf09c-goog

I know Joel will also write some comments because we has discussed it
via IRC. The idea is to find common view, and do it
as best as we can :)

Thanks Paul for good comments!

--
Vlad Rezki

2020-04-16 13:22:17

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

On Thu, Apr 16, 2020 at 12:30:07PM +0200, Uladzislau Rezki wrote:
> Hello, Paul, Joel.
>
> See my comments below. They are for better understanding a
> whole strategy and the best way how to drive headless objects :)

Thanks for the comments on the RFC patch :) I am on the same page and Ok with
being more conservative on memory allocation.

> > > This is a small code refactor and also removes the restriction that
> > > headless kfree_rcu() users cannot sleep for allocation of the per-cpu
> > > array where their pointers will be placed . Since they are always called
> > > in a sleepable context, we can use this information to try harder during
> > > the array allocation stage to allocate the per-cpu array.
> >
> > In kernels, needing to do allocations in order to free memory must be
> > regarded with great suspicion. It -might- be kind-of sort-of OK here,
> > but only if we never impede reclaim, I/O, or OOM handling. Even then,
> > this can be made to work only given that it is possible to fall back
> > on a direct call to synchronize_rcu() in the case where absolutely no
> > memory is available.
> >
> I see your point and agree. So, the idea is to do progress instead of
> doing OOM, I/O or direct reclaiming. It means that we should avoid of
> using any allocations flags which will trigger such effects, directly
> or indirectly. I think we are on the same base now.

Right.

> > > Also there is a possible bug-fix for a migration scenario where a
> > > kfree_rcu() headless user can get migrated during the
> > > sleepable-allocation and end up on another CPU and restoring the wrong
> > > CPU's flags. To remedy this, we store only the IRQ state on the stack
> > > and save/restore IRQ state from there. Sure, for the headless case we
> > > don't need to restore flags. But the code saving/restoring state is
> > > common between headless and with-head kfree_rcu() variants, so it
> > > handles all scenarios sampling/restoring just the IRQ state and not
> > > saving/restoring all the flags.
> >
> > I will suspend disbelief while I look at the patch, but this indirect flag
> > handling sounds like an accident waiting to happen. So in the meantime,
> > is there a way to structure the code to make the flag handling more
> > explicitly visible at the top level?
> >
> > In addition, the usual way to conditionally disable interrupts
> > is local_irq_save(flags) rather than conditionally invoking
> > local_irq_disable().

I agree with the below suggestions and that would remove the need for the
conditional disabling of interrupts.

> >
> > > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > > ---
> > >
> > > This is just RFC and is based on top of Vlad's latest patches:
> > > https://lkml.org/lkml/2020/4/2/383
> > >
> > > The git tree containing this patch is at:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=rcu/dev
> > >
> > > (This patch will be a part of a much large series in the future).
> > >
> > >
> > > kernel/rcu/tree.c | 150 +++++++++++++++++++++++++++++++---------------
> > > 1 file changed, 103 insertions(+), 47 deletions(-)
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 744d04d8b7724..2e0eaec929059 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3104,11 +3104,95 @@ static void kfree_rcu_monitor(struct work_struct *work)
> > > spin_unlock_irqrestore(&krcp->lock, flags);
> > > }
> > >
> > > +static inline struct kfree_rcu_cpu *krc_this_cpu_lock(bool irqs_disabled)
> > > +{
> > > + struct kfree_rcu_cpu *krcp;
> > > +
> > > + // For safely calling this_cpu_ptr().
> > > + if (!irqs_disabled)
> > > + local_irq_disable();
> >
> > Again, local_irq_save() is the usual approach here. And local_irq_restore()
> > in krc_this_cpu_unlock() below.
> >
> We discussed that with Joel and i also think that keeping previous
> variant of the krc_this_cpu_lock()/krc_this_cpu_unlock() is better
> and what is more important is easy, i.e. we do not really need to
> understand whether IRQs are disabled or not, instead just save flags
> and restore and that is it.

Sounds good, if we are avoiding any possiblity of direct-reclaim, then that
removes the need for dropping the lock and thus removes the chance of
migrations that I was concerned about.

> > > + krcp = this_cpu_ptr(&krc);
> > > + if (likely(krcp->initialized))
> > > + spin_lock(&krcp->lock);
> > > +
> > > + return krcp;
> > > +}
> > > +
> > > +static inline void
> > > +krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, bool irqs_disabled)
> > > +{
> > > + if (likely(krcp->initialized))
> > > + spin_unlock(&krcp->lock);
> > > +
> > > + if (!irqs_disabled)
> > > + local_irq_enable();
> > > +}
> > > +
> > > +// alloc_object_locked - Try to allocate an object of size while dropping the lock.
> > > +//
> > > +// @size: Size of the object to internally allocate for kfree_rcu().
> > > +// @slab: Do we allocate on slab or using buddy.
> > > +// @can_sleep: Was kfree_rcu() called in sleepable context?
> > > +// @krcp: The pointer to krcp. Needed if when relocking, we got migrated.
> > > +//
> > > +// Caveats:
> > > +//
> > > +// 1. Per-cpu krc's lock must be held with interrupts disabled.
> > > +//
> > > +// 2. Failure to allocate returns NULL and does not cause a warning.
> > > +//
> > > +// 3. Caller is responsible for using the correct free() APIs. If size == PAGE_SIZE,
> > > +// then free_page() should be called for freeing. Otherwise kfree().
> > > +//
> > > +static inline void *alloc_object_locked(size_t size, bool slab, bool can_sleep,
> > > + struct kfree_rcu_cpu **krcpp)
> > > +{
> > > + void *ptr;
> > > + gfp_t gfp_flags, wmark_flags, reclaim_flags;
> > > + struct kfree_rcu_cpu *krcp = *krcpp;
> > > +
> > > + WARN_ON_ONCE(size == PAGE_SIZE && slab);
> > > +
> > > + // Decompose the flags:
> > > + // wmark_flags - affect the watermark to control reserve access.
> > > + // reclaim_flags - these effect how reclaim works but would
> > > + // have no-affect in atomic or nowait context.
> > > + wmark_flags = (__GFP_HIGH | __GFP_ATOMIC);
> > > + reclaim_flags = (__GFP_RETRY_MAYFAIL);
> >
> > You have a __GFP_RETRY_MAYFAIL here, which is good. However, if
> > this CPU has quite a few 4K blocks of memory already allocated for
> > kfree_rcu(), at some point __GFP_NORETRY becomes necessary. Again,
> > single-argument kfree_rcu() has the option of invoking synchronize_rcu()
> > and most other memory allocators do not. And double-argument kfree_rcu()
> > has a pre-allocated rcu_head in the structure that it can fall back on.
> >
> > So let's please not get too memory-greedy here!
> >
> As i see, your point is to use "light" allocations flags which will
> not initiate direct reclaim, OOM, I/O waiting and so on. Please correct
> me if i miss something.

I agree. One thing I want to add is any allocation however small has an
effect either directly or indirectly, but I agree trying too hard may further
avoid another unrelated needy user getting the memory they may need.

> > Note also that most systems will normally invoke an RCU callback on the
> > same CPU that registered it. This allows easy maintenance of an
> > emergency cache for these situations.
> >
> I have a patch that specifies number of pages to be cached, but i will
> send out it later when we sort things like that out.

Sounds good. I am assuming that you are going to make it such that there are
2 pages per-cpu by default which can be used for either vfree or kfree
arrays and further caching being made user-configurable. But either way
looking forward to the patch.

> > Exceptions include systems doing frequent CPU-hotplug operations and
> > rcu_nocbs CPUs.

Per my understanding, the CPU hotplug has an effect on migrating callbacks so
that's why CPU hotplug has an effect of not invoking callbacks on the same
CPU that they were queued on, but please let me know if my understanding is
not correct.

> > > +
> > > + // These flags will be common to all allocations, whether we want to
> > > + // wait or sleep or reclaim will be controlled with additional flags
> > > + // later during the actual allocation.
> > > + gfp_flags = (wmark_flags | reclaim_flags | __GFP_NOWARN);
> > > +
> > > + // First, do an allocation without waiting.
> > > + ptr = (size == PAGE_SIZE) ? (void *)__get_free_page(gfp_flags | GFP_NOWAIT)
> > > + : (void *)kmalloc(size, gfp_flags | GFP_NOWAIT);
> > > + // If we cannot sleep, we are done.
> > > + if (ptr || !can_sleep)
> > > + return ptr;
> > > +
> > > + // Now try to do it with more relaxed flags, we may enter direct-reclaim.
> > > + //
> > > + // IRQs were not enabled since can_sleep == true. So there's no need to
> > > + // save/restore flags.
> > > + krc_this_cpu_unlock(krcp, false);
> > > + ptr = (size == PAGE_SIZE) ? (void *)__get_free_page(gfp_flags | GFP_KERNEL)
> > > + : (void *)kmalloc(size, gfp_flags | GFP_KERNEL);
> >
> > Dropping the possibility of small allocations also simplifies this code,
> > and also simplifies a fair amount of code elsewhere.
> >
> I have a question about dynamic attaching of the rcu_head. Do you think
> that we should drop it? We have it because of it requires 8 + syzeof(struct rcu_head)
> bytes and is used when we can not allocate 1 page what is much more for array purpose.
> Therefore, dynamic attaching can succeed because of using SLAB and requesting much
> less memory then one page. There will be higher chance of bypassing synchronize_rcu()
> and inlining freeing on a stack.
>
> I agree that we should not use GFP_* flags instead we could go with GFP_NOWAIT |
> __GFP_NOWARN when head attaching only. Also dropping GFP_ATOMIC to keep
> atomic reserved memory for others.

I also have same question. Just to add here, previous patches added a warning
to synchronize_rcu(). Should that warning be dropped then if it is more
normal for kfree_rcu() to enter the synchronous path when the user had not
passed in an rcu_head?

> > > if (head == NULL)
> > > - goto inline_return;
> > > -
> > > - /* Take it back. */
> > > - krcp = krc_this_cpu_lock(&flags);
> > > + goto unlock_return;
> > >
> > > /*
> > > * Tag the headless object. Such objects have a back-pointer
> > > @@ -3280,9 +3337,8 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > > }
> > >
> > > unlock_return:
> > > - krc_this_cpu_unlock(krcp, flags);
> > > + krc_this_cpu_unlock(krcp, irqs_disabled);
> > >
> > > -inline_return:
> > > /*
> > > * High memory pressure, so inline kvfree() after
> > > * synchronize_rcu(). We can do it from might_sleep()
> > > --
> > > 2.26.0.110.g2183baf09c-goog
>
> I know Joel will also write some comments because we has discussed it
> via IRC. The idea is to find common view, and do it
> as best as we can :)
>
> Thanks Paul for good comments!

Thanks a lot for all the comments to both of you :)

- Joel

2020-04-16 21:24:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

On Thu, Apr 16, 2020 at 09:17:45AM -0400, Joel Fernandes wrote:
> On Thu, Apr 16, 2020 at 12:30:07PM +0200, Uladzislau Rezki wrote:
> > Hello, Paul, Joel.
> >
> > See my comments below. They are for better understanding a
> > whole strategy and the best way how to drive headless objects :)
>
> Thanks for the comments on the RFC patch :) I am on the same page and Ok with
> being more conservative on memory allocation.
>
> > > > This is a small code refactor and also removes the restriction that
> > > > headless kfree_rcu() users cannot sleep for allocation of the per-cpu
> > > > array where their pointers will be placed . Since they are always called
> > > > in a sleepable context, we can use this information to try harder during
> > > > the array allocation stage to allocate the per-cpu array.
> > >
> > > In kernels, needing to do allocations in order to free memory must be
> > > regarded with great suspicion. It -might- be kind-of sort-of OK here,
> > > but only if we never impede reclaim, I/O, or OOM handling. Even then,
> > > this can be made to work only given that it is possible to fall back
> > > on a direct call to synchronize_rcu() in the case where absolutely no
> > > memory is available.
> > >
> > I see your point and agree. So, the idea is to do progress instead of
> > doing OOM, I/O or direct reclaiming. It means that we should avoid of
> > using any allocations flags which will trigger such effects, directly
> > or indirectly. I think we are on the same base now.
>
> Right.

To Joel's point below, it is more that others need the memory, and don't
have the synchronize_rcu() option. So this code should not take memory
that the reclaim, I/O, or OOM handling code needs, or even memory that
this code could make good use of.

For example, it is entirely possible that a given kfree_rcu() might
only be freeing 16 bytes of data. It clearly does not make much sense
to allocate another 24 bytes of data to free 16 bytes. In contrast
the reclaim, I/O, and OOM handling can often free up many hundreds of
megabytes or even large numbers of gigabytes.

Of course, it might be that kfree_rcu() is freeing a larger block
of memory, say 64KB. But even in that case, allocating the extra 24
bytes isn't going to cause that memory to be freed any sooner than would
invoking synchronize_rcu() followed by kfree(). (Besides which, if you
have a 64KB structure, in most cases you aren't going to have too much
trouble finding 16 bytes for a rcu_head within that structure.)

Instead, the memory-footprint benefit of allocating the extra memory for
kfree_rcu() comes when the calling process will immediately do another
kfree_rcu(), but of a large structure. The problem is that there is
no way to tell within kfree_rcu() whether the caller is going to invoke
another kfree_rcu() ever, let alone immediately.

So kfree_rcu() needs to restrain itself in its use of memory during
OOM conditions. And if a 4K allocation fails, OOM conditions are
almost always in effect.

> > > > Also there is a possible bug-fix for a migration scenario where a
> > > > kfree_rcu() headless user can get migrated during the
> > > > sleepable-allocation and end up on another CPU and restoring the wrong
> > > > CPU's flags. To remedy this, we store only the IRQ state on the stack
> > > > and save/restore IRQ state from there. Sure, for the headless case we
> > > > don't need to restore flags. But the code saving/restoring state is
> > > > common between headless and with-head kfree_rcu() variants, so it
> > > > handles all scenarios sampling/restoring just the IRQ state and not
> > > > saving/restoring all the flags.
> > >
> > > I will suspend disbelief while I look at the patch, but this indirect flag
> > > handling sounds like an accident waiting to happen. So in the meantime,
> > > is there a way to structure the code to make the flag handling more
> > > explicitly visible at the top level?
> > >
> > > In addition, the usual way to conditionally disable interrupts
> > > is local_irq_save(flags) rather than conditionally invoking
> > > local_irq_disable().
>
> I agree with the below suggestions and that would remove the need for the
> conditional disabling of interrupts.
>
> > >
> > > > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > > > ---
> > > >
> > > > This is just RFC and is based on top of Vlad's latest patches:
> > > > https://lkml.org/lkml/2020/4/2/383
> > > >
> > > > The git tree containing this patch is at:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=rcu/dev
> > > >
> > > > (This patch will be a part of a much large series in the future).
> > > >
> > > >
> > > > kernel/rcu/tree.c | 150 +++++++++++++++++++++++++++++++---------------
> > > > 1 file changed, 103 insertions(+), 47 deletions(-)
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 744d04d8b7724..2e0eaec929059 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -3104,11 +3104,95 @@ static void kfree_rcu_monitor(struct work_struct *work)
> > > > spin_unlock_irqrestore(&krcp->lock, flags);
> > > > }
> > > >
> > > > +static inline struct kfree_rcu_cpu *krc_this_cpu_lock(bool irqs_disabled)
> > > > +{
> > > > + struct kfree_rcu_cpu *krcp;
> > > > +
> > > > + // For safely calling this_cpu_ptr().
> > > > + if (!irqs_disabled)
> > > > + local_irq_disable();
> > >
> > > Again, local_irq_save() is the usual approach here. And local_irq_restore()
> > > in krc_this_cpu_unlock() below.
> > >
> > We discussed that with Joel and i also think that keeping previous
> > variant of the krc_this_cpu_lock()/krc_this_cpu_unlock() is better
> > and what is more important is easy, i.e. we do not really need to
> > understand whether IRQs are disabled or not, instead just save flags
> > and restore and that is it.
>
> Sounds good, if we are avoiding any possiblity of direct-reclaim, then that
> removes the need for dropping the lock and thus removes the chance of
> migrations that I was concerned about.

Sounds good here as well! I will let you guys discuss with Sebastian, at
least for the moment. ;-)

> > > > + krcp = this_cpu_ptr(&krc);
> > > > + if (likely(krcp->initialized))
> > > > + spin_lock(&krcp->lock);
> > > > +
> > > > + return krcp;
> > > > +}
> > > > +
> > > > +static inline void
> > > > +krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, bool irqs_disabled)
> > > > +{
> > > > + if (likely(krcp->initialized))
> > > > + spin_unlock(&krcp->lock);
> > > > +
> > > > + if (!irqs_disabled)
> > > > + local_irq_enable();
> > > > +}
> > > > +
> > > > +// alloc_object_locked - Try to allocate an object of size while dropping the lock.
> > > > +//
> > > > +// @size: Size of the object to internally allocate for kfree_rcu().
> > > > +// @slab: Do we allocate on slab or using buddy.
> > > > +// @can_sleep: Was kfree_rcu() called in sleepable context?
> > > > +// @krcp: The pointer to krcp. Needed if when relocking, we got migrated.
> > > > +//
> > > > +// Caveats:
> > > > +//
> > > > +// 1. Per-cpu krc's lock must be held with interrupts disabled.
> > > > +//
> > > > +// 2. Failure to allocate returns NULL and does not cause a warning.
> > > > +//
> > > > +// 3. Caller is responsible for using the correct free() APIs. If size == PAGE_SIZE,
> > > > +// then free_page() should be called for freeing. Otherwise kfree().
> > > > +//
> > > > +static inline void *alloc_object_locked(size_t size, bool slab, bool can_sleep,
> > > > + struct kfree_rcu_cpu **krcpp)
> > > > +{
> > > > + void *ptr;
> > > > + gfp_t gfp_flags, wmark_flags, reclaim_flags;
> > > > + struct kfree_rcu_cpu *krcp = *krcpp;
> > > > +
> > > > + WARN_ON_ONCE(size == PAGE_SIZE && slab);
> > > > +
> > > > + // Decompose the flags:
> > > > + // wmark_flags - affect the watermark to control reserve access.
> > > > + // reclaim_flags - these effect how reclaim works but would
> > > > + // have no-affect in atomic or nowait context.
> > > > + wmark_flags = (__GFP_HIGH | __GFP_ATOMIC);
> > > > + reclaim_flags = (__GFP_RETRY_MAYFAIL);
> > >
> > > You have a __GFP_RETRY_MAYFAIL here, which is good. However, if
> > > this CPU has quite a few 4K blocks of memory already allocated for
> > > kfree_rcu(), at some point __GFP_NORETRY becomes necessary. Again,
> > > single-argument kfree_rcu() has the option of invoking synchronize_rcu()
> > > and most other memory allocators do not. And double-argument kfree_rcu()
> > > has a pre-allocated rcu_head in the structure that it can fall back on.
> > >
> > > So let's please not get too memory-greedy here!
> > >
> > As i see, your point is to use "light" allocations flags which will
> > not initiate direct reclaim, OOM, I/O waiting and so on. Please correct
> > me if i miss something.
>
> I agree. One thing I want to add is any allocation however small has an
> effect either directly or indirectly, but I agree trying too hard may further
> avoid another unrelated needy user getting the memory they may need.

I would say that the main point is to avoid depriving reclaim, I/O, and
OOM handling of memory that they could make good use of.

> > > Note also that most systems will normally invoke an RCU callback on the
> > > same CPU that registered it. This allows easy maintenance of an
> > > emergency cache for these situations.
> > >
> > I have a patch that specifies number of pages to be cached, but i will
> > send out it later when we sort things like that out.
>
> Sounds good. I am assuming that you are going to make it such that there are
> 2 pages per-cpu by default which can be used for either vfree or kfree
> arrays and further caching being made user-configurable. But either way
> looking forward to the patch.
>
> > > Exceptions include systems doing frequent CPU-hotplug operations and
> > > rcu_nocbs CPUs.
>
> Per my understanding, the CPU hotplug has an effect on migrating callbacks so
> that's why CPU hotplug has an effect of not invoking callbacks on the same
> CPU that they were queued on, but please let me know if my understanding is
> not correct.

You got it -- if a CPU is removed, some other CPU will adopt any remaining
callbacks from the outgoing CPU.

But is should be easy to mark the pre-allocated pages with the CPU that
they belong to and locklessly put them back. There are quite a few ways
to do that, especially if those pages are used only when allocation fails.

For example, each CPU could have an array of pointers to its reserved
set of pages, and each such page could have a pointer to its pointer.
Then just use smp_store_release() and done. (Please let me know if
you would like more detail. And, yes, this is compatible with lazy
populating of the reserve set.)

> > > > +
> > > > + // These flags will be common to all allocations, whether we want to
> > > > + // wait or sleep or reclaim will be controlled with additional flags
> > > > + // later during the actual allocation.
> > > > + gfp_flags = (wmark_flags | reclaim_flags | __GFP_NOWARN);
> > > > +
> > > > + // First, do an allocation without waiting.
> > > > + ptr = (size == PAGE_SIZE) ? (void *)__get_free_page(gfp_flags | GFP_NOWAIT)
> > > > + : (void *)kmalloc(size, gfp_flags | GFP_NOWAIT);
> > > > + // If we cannot sleep, we are done.
> > > > + if (ptr || !can_sleep)
> > > > + return ptr;
> > > > +
> > > > + // Now try to do it with more relaxed flags, we may enter direct-reclaim.
> > > > + //
> > > > + // IRQs were not enabled since can_sleep == true. So there's no need to
> > > > + // save/restore flags.
> > > > + krc_this_cpu_unlock(krcp, false);
> > > > + ptr = (size == PAGE_SIZE) ? (void *)__get_free_page(gfp_flags | GFP_KERNEL)
> > > > + : (void *)kmalloc(size, gfp_flags | GFP_KERNEL);
> > >
> > > Dropping the possibility of small allocations also simplifies this code,
> > > and also simplifies a fair amount of code elsewhere.
> > >
> > I have a question about dynamic attaching of the rcu_head. Do you think
> > that we should drop it? We have it because of it requires 8 + syzeof(struct rcu_head)
> > bytes and is used when we can not allocate 1 page what is much more for array purpose.
> > Therefore, dynamic attaching can succeed because of using SLAB and requesting much
> > less memory then one page. There will be higher chance of bypassing synchronize_rcu()
> > and inlining freeing on a stack.
> >
> > I agree that we should not use GFP_* flags instead we could go with GFP_NOWAIT |
> > __GFP_NOWARN when head attaching only. Also dropping GFP_ATOMIC to keep
> > atomic reserved memory for others.

I must defer to people who understand the GFP flags better than I do.
The suggestion of __GFP_RETRY_MAYFAIL for no memory pressure (or maybe
when the CPU's reserve is not yet full) and __GFP_NORETRY otherwise came
from one of these people. ;-)

> I also have same question. Just to add here, previous patches added a warning
> to synchronize_rcu(). Should that warning be dropped then if it is more
> normal for kfree_rcu() to enter the synchronous path when the user had not
> passed in an rcu_head?

Agreed. I could imagine a default-disabled kernel-boot flag, but people
wanting such a thing would probably be better served by the current
sysctl_panic_on_oom.

> > > > if (head == NULL)
> > > > - goto inline_return;
> > > > -
> > > > - /* Take it back. */
> > > > - krcp = krc_this_cpu_lock(&flags);
> > > > + goto unlock_return;
> > > >
> > > > /*
> > > > * Tag the headless object. Such objects have a back-pointer
> > > > @@ -3280,9 +3337,8 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > > > }
> > > >
> > > > unlock_return:
> > > > - krc_this_cpu_unlock(krcp, flags);
> > > > + krc_this_cpu_unlock(krcp, irqs_disabled);
> > > >
> > > > -inline_return:
> > > > /*
> > > > * High memory pressure, so inline kvfree() after
> > > > * synchronize_rcu(). We can do it from might_sleep()
> > > > --
> > > > 2.26.0.110.g2183baf09c-goog
> >
> > I know Joel will also write some comments because we has discussed it
> > via IRC. The idea is to find common view, and do it
> > as best as we can :)
> >
> > Thanks Paul for good comments!
>
> Thanks a lot for all the comments to both of you :)

We will get there! ;-)

Thanx, Paul

2020-04-22 15:01:37

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

On Thu, Apr 16, 2020 at 11:01:00AM -0700, Paul E. McKenney wrote:
> On Thu, Apr 16, 2020 at 09:17:45AM -0400, Joel Fernandes wrote:
> > On Thu, Apr 16, 2020 at 12:30:07PM +0200, Uladzislau Rezki wrote:
> > > I have a question about dynamic attaching of the rcu_head. Do you think
> > > that we should drop it? We have it because of it requires 8 + syzeof(struct rcu_head)
> > > bytes and is used when we can not allocate 1 page what is much more for array purpose.
> > > Therefore, dynamic attaching can succeed because of using SLAB and requesting much
> > > less memory then one page. There will be higher chance of bypassing synchronize_rcu()
> > > and inlining freeing on a stack.
> > >
> > > I agree that we should not use GFP_* flags instead we could go with GFP_NOWAIT |
> > > __GFP_NOWARN when head attaching only. Also dropping GFP_ATOMIC to keep
> > > atomic reserved memory for others.
>
> I must defer to people who understand the GFP flags better than I do.
> The suggestion of __GFP_RETRY_MAYFAIL for no memory pressure (or maybe
> when the CPU's reserve is not yet full) and __GFP_NORETRY otherwise came
> from one of these people. ;-)

The exact flags we want here depends somewhat on the rate and size of
kfree_rcu() bursts we can expect. We may want to start with one set
and instrument allocation success rates.

Memory tends to be fully consumed by the filesystem cache, so some
form of light reclaim is necessary for almost all allocations.

GFP_NOWAIT won't do any reclaim by itself, but it'll wake kswapd.
Kswapd maintains a small pool of free pages so that even allocations
that are allowed to enter reclaim usually don't have to. It would be
safe for RCU to dip into that.

However, there are some cons to using it:

- Depending on kfree_rcu() burst size, this pool could exhaust (it's
usually about half a percent of memory, but is affected by sysctls),
and then it would fail NOWAIT allocations until kswapd has caught up.

- This pool is shared by all GFP_NOWAIT users, and many (most? all?)
of them cannot actually sleep. Often they would have to drop locks,
restart list iterations, or suffer some other form of deterioration to
work around failing allocations.

Since rcu wouldn't have anything better to do than sleep at this
juncture, it may as well join the reclaim effort.

Using __GFP_NORETRY or __GFP_RETRY_MAYFAIL would allow them that
without exerting too much pressure on the VM.

2020-04-22 15:39:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

On Wed, Apr 22, 2020 at 10:57:52AM -0400, Johannes Weiner wrote:
> On Thu, Apr 16, 2020 at 11:01:00AM -0700, Paul E. McKenney wrote:
> > On Thu, Apr 16, 2020 at 09:17:45AM -0400, Joel Fernandes wrote:
> > > On Thu, Apr 16, 2020 at 12:30:07PM +0200, Uladzislau Rezki wrote:
> > > > I have a question about dynamic attaching of the rcu_head. Do you think
> > > > that we should drop it? We have it because of it requires 8 + syzeof(struct rcu_head)
> > > > bytes and is used when we can not allocate 1 page what is much more for array purpose.
> > > > Therefore, dynamic attaching can succeed because of using SLAB and requesting much
> > > > less memory then one page. There will be higher chance of bypassing synchronize_rcu()
> > > > and inlining freeing on a stack.
> > > >
> > > > I agree that we should not use GFP_* flags instead we could go with GFP_NOWAIT |
> > > > __GFP_NOWARN when head attaching only. Also dropping GFP_ATOMIC to keep
> > > > atomic reserved memory for others.
> >
> > I must defer to people who understand the GFP flags better than I do.
> > The suggestion of __GFP_RETRY_MAYFAIL for no memory pressure (or maybe
> > when the CPU's reserve is not yet full) and __GFP_NORETRY otherwise came
> > from one of these people. ;-)
>
> The exact flags we want here depends somewhat on the rate and size of
> kfree_rcu() bursts we can expect. We may want to start with one set
> and instrument allocation success rates.
>
> Memory tends to be fully consumed by the filesystem cache, so some
> form of light reclaim is necessary for almost all allocations.
>
> GFP_NOWAIT won't do any reclaim by itself, but it'll wake kswapd.
> Kswapd maintains a small pool of free pages so that even allocations
> that are allowed to enter reclaim usually don't have to. It would be
> safe for RCU to dip into that.
>
> However, there are some cons to using it:
>
> - Depending on kfree_rcu() burst size, this pool could exhaust (it's
> usually about half a percent of memory, but is affected by sysctls),
> and then it would fail NOWAIT allocations until kswapd has caught up.
>
> - This pool is shared by all GFP_NOWAIT users, and many (most? all?)
> of them cannot actually sleep. Often they would have to drop locks,
> restart list iterations, or suffer some other form of deterioration to
> work around failing allocations.
>
> Since rcu wouldn't have anything better to do than sleep at this
> juncture, it may as well join the reclaim effort.
>
> Using __GFP_NORETRY or __GFP_RETRY_MAYFAIL would allow them that
> without exerting too much pressure on the VM.

Thank you for looking this over and for the feedback!

Good point on the sleeping. My assumption has been that sleeping waiting
for a grace period was highly likely to allow memory to eventually be
freed, and that there is a point of diminishing returns beyond which
adding additional tasks to the reclaim effort does not help much.

Here are some strategies right offhand when sleeping is required:

1. Always sleep in synchronize_rcu() in order to (with high
probability) free the memory. This might mean that the reclaim
effort goes slower than would be good.

2. Always sleep in the memory allocator in order to help reclaim
along. (This is a strawman version of what I expect your
proposal really is, but putting it here for completeness, please
see below.)

3. Always sleep in the memory allocator in order to help reclaim
along, but return failure at some point. Then the caller
invokes synchronize_rcu(). When to return failure?

o After some substantial but limited amount of effort has
been spent on reclaim.

o When it becomes likely that further reclaim effort
is not going to free up additional memory.

I am guessing that you are thinking in terms of specifying GFP flags to
result in some variant of #3.

Or am I missing a trick here?

Thanx, Paul

2020-04-23 17:52:51

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

On Wed, Apr 22, 2020 at 08:35:03AM -0700, Paul E. McKenney wrote:
> On Wed, Apr 22, 2020 at 10:57:52AM -0400, Johannes Weiner wrote:
> > On Thu, Apr 16, 2020 at 11:01:00AM -0700, Paul E. McKenney wrote:
> > > On Thu, Apr 16, 2020 at 09:17:45AM -0400, Joel Fernandes wrote:
> > > > On Thu, Apr 16, 2020 at 12:30:07PM +0200, Uladzislau Rezki wrote:
> > > > > I have a question about dynamic attaching of the rcu_head. Do you think
> > > > > that we should drop it? We have it because of it requires 8 + syzeof(struct rcu_head)
> > > > > bytes and is used when we can not allocate 1 page what is much more for array purpose.
> > > > > Therefore, dynamic attaching can succeed because of using SLAB and requesting much
> > > > > less memory then one page. There will be higher chance of bypassing synchronize_rcu()
> > > > > and inlining freeing on a stack.
> > > > >
> > > > > I agree that we should not use GFP_* flags instead we could go with GFP_NOWAIT |
> > > > > __GFP_NOWARN when head attaching only. Also dropping GFP_ATOMIC to keep
> > > > > atomic reserved memory for others.
> > >
> > > I must defer to people who understand the GFP flags better than I do.
> > > The suggestion of __GFP_RETRY_MAYFAIL for no memory pressure (or maybe
> > > when the CPU's reserve is not yet full) and __GFP_NORETRY otherwise came
> > > from one of these people. ;-)
> >
> > The exact flags we want here depends somewhat on the rate and size of
> > kfree_rcu() bursts we can expect. We may want to start with one set
> > and instrument allocation success rates.
> >
> > Memory tends to be fully consumed by the filesystem cache, so some
> > form of light reclaim is necessary for almost all allocations.
> >
> > GFP_NOWAIT won't do any reclaim by itself, but it'll wake kswapd.
> > Kswapd maintains a small pool of free pages so that even allocations
> > that are allowed to enter reclaim usually don't have to. It would be
> > safe for RCU to dip into that.
> >
> > However, there are some cons to using it:
> >
> > - Depending on kfree_rcu() burst size, this pool could exhaust (it's
> > usually about half a percent of memory, but is affected by sysctls),
> > and then it would fail NOWAIT allocations until kswapd has caught up.
> >
> > - This pool is shared by all GFP_NOWAIT users, and many (most? all?)
> > of them cannot actually sleep. Often they would have to drop locks,
> > restart list iterations, or suffer some other form of deterioration to
> > work around failing allocations.
> >
> > Since rcu wouldn't have anything better to do than sleep at this
> > juncture, it may as well join the reclaim effort.
> >
> > Using __GFP_NORETRY or __GFP_RETRY_MAYFAIL would allow them that
> > without exerting too much pressure on the VM.
>
> Thank you for looking this over and for the feedback!
>
> Good point on the sleeping. My assumption has been that sleeping waiting
> for a grace period was highly likely to allow memory to eventually be
> freed, and that there is a point of diminishing returns beyond which
> adding additional tasks to the reclaim effort does not help much.

There is when the VM is struggling, but not necessarily when there is
simply a high, concurrent rate of short-lived file cache allocations.

Kswapd can easily reclaim gigabytes of clean page cache each second,
but there might be enough allocation concurrency from other threads to
starve a kfree_rcu() that only makes a very cursory attempt at getting
memory out of being able to snap up some of those returns.

In that scenario it makes sense to be a bit more persistent, or even
help scale out the concurrency of reclaim to that of allocations.

> Here are some strategies right offhand when sleeping is required:
>
> 1. Always sleep in synchronize_rcu() in order to (with high
> probability) free the memory. This might mean that the reclaim
> effort goes slower than would be good.
>
> 2. Always sleep in the memory allocator in order to help reclaim
> along. (This is a strawman version of what I expect your
> proposal really is, but putting it here for completeness, please
> see below.)
>
> 3. Always sleep in the memory allocator in order to help reclaim
> along, but return failure at some point. Then the caller
> invokes synchronize_rcu(). When to return failure?
>
> o After some substantial but limited amount of effort has
> been spent on reclaim.
>
> o When it becomes likely that further reclaim effort
> is not going to free up additional memory.
>
> I am guessing that you are thinking in terms of specifying GFP flags to
> result in some variant of #3.

Yes, although I would add

o After making more than one attempt at the freelist to
prevent merely losing races when the allocator/reclaim
subsystem is mobbed by a high concurrency of requests.

__GFP_NORETRY (despite its name) accomplishes this.

__GFP_RETRY_MAYFAIL is yet more persistent, but may retry for quite a
while if the allocation keeps losing the race for a page. This
increases the chance of the allocation eventually suceeding, but also
the risk of 1) trying to get memory for longer than a
synchronize_rcu() might have taken and 2) exerting more temporary
memory pressure on the workload* than might be productive.

So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
further decisions based on instrumentation of the success rates of
these opportunistic allocations.

* Reclaim and OOM handling will be fine since no reserves are tapped

2020-04-23 18:07:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

On Thu, Apr 23, 2020 at 01:48:31PM -0400, Johannes Weiner wrote:
> On Wed, Apr 22, 2020 at 08:35:03AM -0700, Paul E. McKenney wrote:
> > On Wed, Apr 22, 2020 at 10:57:52AM -0400, Johannes Weiner wrote:
> > > On Thu, Apr 16, 2020 at 11:01:00AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Apr 16, 2020 at 09:17:45AM -0400, Joel Fernandes wrote:
> > > > > On Thu, Apr 16, 2020 at 12:30:07PM +0200, Uladzislau Rezki wrote:
> > > > > > I have a question about dynamic attaching of the rcu_head. Do you think
> > > > > > that we should drop it? We have it because of it requires 8 + syzeof(struct rcu_head)
> > > > > > bytes and is used when we can not allocate 1 page what is much more for array purpose.
> > > > > > Therefore, dynamic attaching can succeed because of using SLAB and requesting much
> > > > > > less memory then one page. There will be higher chance of bypassing synchronize_rcu()
> > > > > > and inlining freeing on a stack.
> > > > > >
> > > > > > I agree that we should not use GFP_* flags instead we could go with GFP_NOWAIT |
> > > > > > __GFP_NOWARN when head attaching only. Also dropping GFP_ATOMIC to keep
> > > > > > atomic reserved memory for others.
> > > >
> > > > I must defer to people who understand the GFP flags better than I do.
> > > > The suggestion of __GFP_RETRY_MAYFAIL for no memory pressure (or maybe
> > > > when the CPU's reserve is not yet full) and __GFP_NORETRY otherwise came
> > > > from one of these people. ;-)
> > >
> > > The exact flags we want here depends somewhat on the rate and size of
> > > kfree_rcu() bursts we can expect. We may want to start with one set
> > > and instrument allocation success rates.
> > >
> > > Memory tends to be fully consumed by the filesystem cache, so some
> > > form of light reclaim is necessary for almost all allocations.
> > >
> > > GFP_NOWAIT won't do any reclaim by itself, but it'll wake kswapd.
> > > Kswapd maintains a small pool of free pages so that even allocations
> > > that are allowed to enter reclaim usually don't have to. It would be
> > > safe for RCU to dip into that.
> > >
> > > However, there are some cons to using it:
> > >
> > > - Depending on kfree_rcu() burst size, this pool could exhaust (it's
> > > usually about half a percent of memory, but is affected by sysctls),
> > > and then it would fail NOWAIT allocations until kswapd has caught up.
> > >
> > > - This pool is shared by all GFP_NOWAIT users, and many (most? all?)
> > > of them cannot actually sleep. Often they would have to drop locks,
> > > restart list iterations, or suffer some other form of deterioration to
> > > work around failing allocations.
> > >
> > > Since rcu wouldn't have anything better to do than sleep at this
> > > juncture, it may as well join the reclaim effort.
> > >
> > > Using __GFP_NORETRY or __GFP_RETRY_MAYFAIL would allow them that
> > > without exerting too much pressure on the VM.
> >
> > Thank you for looking this over and for the feedback!
> >
> > Good point on the sleeping. My assumption has been that sleeping waiting
> > for a grace period was highly likely to allow memory to eventually be
> > freed, and that there is a point of diminishing returns beyond which
> > adding additional tasks to the reclaim effort does not help much.
>
> There is when the VM is struggling, but not necessarily when there is
> simply a high, concurrent rate of short-lived file cache allocations.
>
> Kswapd can easily reclaim gigabytes of clean page cache each second,
> but there might be enough allocation concurrency from other threads to
> starve a kfree_rcu() that only makes a very cursory attempt at getting
> memory out of being able to snap up some of those returns.
>
> In that scenario it makes sense to be a bit more persistent, or even
> help scale out the concurrency of reclaim to that of allocations.
>
> > Here are some strategies right offhand when sleeping is required:
> >
> > 1. Always sleep in synchronize_rcu() in order to (with high
> > probability) free the memory. This might mean that the reclaim
> > effort goes slower than would be good.
> >
> > 2. Always sleep in the memory allocator in order to help reclaim
> > along. (This is a strawman version of what I expect your
> > proposal really is, but putting it here for completeness, please
> > see below.)
> >
> > 3. Always sleep in the memory allocator in order to help reclaim
> > along, but return failure at some point. Then the caller
> > invokes synchronize_rcu(). When to return failure?
> >
> > o After some substantial but limited amount of effort has
> > been spent on reclaim.
> >
> > o When it becomes likely that further reclaim effort
> > is not going to free up additional memory.
> >
> > I am guessing that you are thinking in terms of specifying GFP flags to
> > result in some variant of #3.
>
> Yes, although I would add
>
> o After making more than one attempt at the freelist to
> prevent merely losing races when the allocator/reclaim
> subsystem is mobbed by a high concurrency of requests.
>
> __GFP_NORETRY (despite its name) accomplishes this.
>
> __GFP_RETRY_MAYFAIL is yet more persistent, but may retry for quite a
> while if the allocation keeps losing the race for a page. This
> increases the chance of the allocation eventually suceeding, but also
> the risk of 1) trying to get memory for longer than a
> synchronize_rcu() might have taken and 2) exerting more temporary
> memory pressure on the workload* than might be productive.
>
> So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> further decisions based on instrumentation of the success rates of
> these opportunistic allocations.
>
> * Reclaim and OOM handling will be fine since no reserves are tapped

Thank you for the explanation! Makes sense to me!!!

Joel, Vlad, does this seem reasonable?

Thanx, Paul

2020-04-23 18:30:37

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

On Thu, Apr 23, 2020 at 11:02:49AM -0700, Paul E. McKenney wrote:
> On Thu, Apr 23, 2020 at 01:48:31PM -0400, Johannes Weiner wrote:
> > On Wed, Apr 22, 2020 at 08:35:03AM -0700, Paul E. McKenney wrote:
> > > On Wed, Apr 22, 2020 at 10:57:52AM -0400, Johannes Weiner wrote:
> > > > On Thu, Apr 16, 2020 at 11:01:00AM -0700, Paul E. McKenney wrote:
> > > > > On Thu, Apr 16, 2020 at 09:17:45AM -0400, Joel Fernandes wrote:
> > > > > > On Thu, Apr 16, 2020 at 12:30:07PM +0200, Uladzislau Rezki wrote:
> > > > > > > I have a question about dynamic attaching of the rcu_head. Do you think
> > > > > > > that we should drop it? We have it because of it requires 8 + syzeof(struct rcu_head)
> > > > > > > bytes and is used when we can not allocate 1 page what is much more for array purpose.
> > > > > > > Therefore, dynamic attaching can succeed because of using SLAB and requesting much
> > > > > > > less memory then one page. There will be higher chance of bypassing synchronize_rcu()
> > > > > > > and inlining freeing on a stack.
> > > > > > >
> > > > > > > I agree that we should not use GFP_* flags instead we could go with GFP_NOWAIT |
> > > > > > > __GFP_NOWARN when head attaching only. Also dropping GFP_ATOMIC to keep
> > > > > > > atomic reserved memory for others.
> > > > >
> > > > > I must defer to people who understand the GFP flags better than I do.
> > > > > The suggestion of __GFP_RETRY_MAYFAIL for no memory pressure (or maybe
> > > > > when the CPU's reserve is not yet full) and __GFP_NORETRY otherwise came
> > > > > from one of these people. ;-)
> > > >
> > > > The exact flags we want here depends somewhat on the rate and size of
> > > > kfree_rcu() bursts we can expect. We may want to start with one set
> > > > and instrument allocation success rates.
> > > >
> > > > Memory tends to be fully consumed by the filesystem cache, so some
> > > > form of light reclaim is necessary for almost all allocations.
> > > >
> > > > GFP_NOWAIT won't do any reclaim by itself, but it'll wake kswapd.
> > > > Kswapd maintains a small pool of free pages so that even allocations
> > > > that are allowed to enter reclaim usually don't have to. It would be
> > > > safe for RCU to dip into that.
> > > >
> > > > However, there are some cons to using it:
> > > >
> > > > - Depending on kfree_rcu() burst size, this pool could exhaust (it's
> > > > usually about half a percent of memory, but is affected by sysctls),
> > > > and then it would fail NOWAIT allocations until kswapd has caught up.
> > > >
> > > > - This pool is shared by all GFP_NOWAIT users, and many (most? all?)
> > > > of them cannot actually sleep. Often they would have to drop locks,
> > > > restart list iterations, or suffer some other form of deterioration to
> > > > work around failing allocations.
> > > >
> > > > Since rcu wouldn't have anything better to do than sleep at this
> > > > juncture, it may as well join the reclaim effort.
> > > >
> > > > Using __GFP_NORETRY or __GFP_RETRY_MAYFAIL would allow them that
> > > > without exerting too much pressure on the VM.
> > >
> > > Thank you for looking this over and for the feedback!
> > >
> > > Good point on the sleeping. My assumption has been that sleeping waiting
> > > for a grace period was highly likely to allow memory to eventually be
> > > freed, and that there is a point of diminishing returns beyond which
> > > adding additional tasks to the reclaim effort does not help much.
> >
> > There is when the VM is struggling, but not necessarily when there is
> > simply a high, concurrent rate of short-lived file cache allocations.
> >
> > Kswapd can easily reclaim gigabytes of clean page cache each second,
> > but there might be enough allocation concurrency from other threads to
> > starve a kfree_rcu() that only makes a very cursory attempt at getting
> > memory out of being able to snap up some of those returns.
> >
> > In that scenario it makes sense to be a bit more persistent, or even
> > help scale out the concurrency of reclaim to that of allocations.
> >
> > > Here are some strategies right offhand when sleeping is required:
> > >
> > > 1. Always sleep in synchronize_rcu() in order to (with high
> > > probability) free the memory. This might mean that the reclaim
> > > effort goes slower than would be good.
> > >
> > > 2. Always sleep in the memory allocator in order to help reclaim
> > > along. (This is a strawman version of what I expect your
> > > proposal really is, but putting it here for completeness, please
> > > see below.)
> > >
> > > 3. Always sleep in the memory allocator in order to help reclaim
> > > along, but return failure at some point. Then the caller
> > > invokes synchronize_rcu(). When to return failure?
> > >
> > > o After some substantial but limited amount of effort has
> > > been spent on reclaim.
> > >
> > > o When it becomes likely that further reclaim effort
> > > is not going to free up additional memory.
> > >
> > > I am guessing that you are thinking in terms of specifying GFP flags to
> > > result in some variant of #3.
> >
> > Yes, although I would add
> >
> > o After making more than one attempt at the freelist to
> > prevent merely losing races when the allocator/reclaim
> > subsystem is mobbed by a high concurrency of requests.
> >
> > __GFP_NORETRY (despite its name) accomplishes this.
> >
> > __GFP_RETRY_MAYFAIL is yet more persistent, but may retry for quite a
> > while if the allocation keeps losing the race for a page. This
> > increases the chance of the allocation eventually suceeding, but also
> > the risk of 1) trying to get memory for longer than a
> > synchronize_rcu() might have taken and 2) exerting more temporary
> > memory pressure on the workload* than might be productive.
> >
> > So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> > further decisions based on instrumentation of the success rates of
> > these opportunistic allocations.
> >
> > * Reclaim and OOM handling will be fine since no reserves are tapped
>
> Thank you for the explanation! Makes sense to me!!!
>
> Joel, Vlad, does this seem reasonable?
>
To me that makes sense. I think such strategy does fit to what we do,
i mean we need to release memory asap. Doing it without initiating of
long process of memory reclaim and do it only lightly(what __GFP_NORETRY does)
is a good approach. We have an option to fallback to synchronize_rcu().

But that is for sleepable context.

I have a question about non-sleeping context as well and how we allocate one
page:

<snip>
bnode = (struct kvfree_rcu_bulk_data *)
__get_free_page(GFP_NOWAIT | __GFP_NOWARN);

<snip>

Johannes, i saw you mentioned earlier that waking up a kswapd is not a
good idea, what actually GFP_NOWAIT does. Do you recommend to exclude
it? Also to replace by what? __GFP_HIGH|__GFP_ATOMIC?

Thank you in advance!

--
Vlad Rezki

2020-04-23 19:41:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

On Thu, Apr 23, 2020 at 08:27:50PM +0200, Uladzislau Rezki wrote:
> On Thu, Apr 23, 2020 at 11:02:49AM -0700, Paul E. McKenney wrote:
> > On Thu, Apr 23, 2020 at 01:48:31PM -0400, Johannes Weiner wrote:
> > > On Wed, Apr 22, 2020 at 08:35:03AM -0700, Paul E. McKenney wrote:
> > > > On Wed, Apr 22, 2020 at 10:57:52AM -0400, Johannes Weiner wrote:
> > > > > On Thu, Apr 16, 2020 at 11:01:00AM -0700, Paul E. McKenney wrote:
> > > > > > On Thu, Apr 16, 2020 at 09:17:45AM -0400, Joel Fernandes wrote:
> > > > > > > On Thu, Apr 16, 2020 at 12:30:07PM +0200, Uladzislau Rezki wrote:
> > > > > > > > I have a question about dynamic attaching of the rcu_head. Do you think
> > > > > > > > that we should drop it? We have it because of it requires 8 + syzeof(struct rcu_head)
> > > > > > > > bytes and is used when we can not allocate 1 page what is much more for array purpose.
> > > > > > > > Therefore, dynamic attaching can succeed because of using SLAB and requesting much
> > > > > > > > less memory then one page. There will be higher chance of bypassing synchronize_rcu()
> > > > > > > > and inlining freeing on a stack.
> > > > > > > >
> > > > > > > > I agree that we should not use GFP_* flags instead we could go with GFP_NOWAIT |
> > > > > > > > __GFP_NOWARN when head attaching only. Also dropping GFP_ATOMIC to keep
> > > > > > > > atomic reserved memory for others.
> > > > > >
> > > > > > I must defer to people who understand the GFP flags better than I do.
> > > > > > The suggestion of __GFP_RETRY_MAYFAIL for no memory pressure (or maybe
> > > > > > when the CPU's reserve is not yet full) and __GFP_NORETRY otherwise came
> > > > > > from one of these people. ;-)
> > > > >
> > > > > The exact flags we want here depends somewhat on the rate and size of
> > > > > kfree_rcu() bursts we can expect. We may want to start with one set
> > > > > and instrument allocation success rates.
> > > > >
> > > > > Memory tends to be fully consumed by the filesystem cache, so some
> > > > > form of light reclaim is necessary for almost all allocations.
> > > > >
> > > > > GFP_NOWAIT won't do any reclaim by itself, but it'll wake kswapd.
> > > > > Kswapd maintains a small pool of free pages so that even allocations
> > > > > that are allowed to enter reclaim usually don't have to. It would be
> > > > > safe for RCU to dip into that.
> > > > >
> > > > > However, there are some cons to using it:
> > > > >
> > > > > - Depending on kfree_rcu() burst size, this pool could exhaust (it's
> > > > > usually about half a percent of memory, but is affected by sysctls),
> > > > > and then it would fail NOWAIT allocations until kswapd has caught up.
> > > > >
> > > > > - This pool is shared by all GFP_NOWAIT users, and many (most? all?)
> > > > > of them cannot actually sleep. Often they would have to drop locks,
> > > > > restart list iterations, or suffer some other form of deterioration to
> > > > > work around failing allocations.
> > > > >
> > > > > Since rcu wouldn't have anything better to do than sleep at this
> > > > > juncture, it may as well join the reclaim effort.
> > > > >
> > > > > Using __GFP_NORETRY or __GFP_RETRY_MAYFAIL would allow them that
> > > > > without exerting too much pressure on the VM.
> > > >
> > > > Thank you for looking this over and for the feedback!
> > > >
> > > > Good point on the sleeping. My assumption has been that sleeping waiting
> > > > for a grace period was highly likely to allow memory to eventually be
> > > > freed, and that there is a point of diminishing returns beyond which
> > > > adding additional tasks to the reclaim effort does not help much.
> > >
> > > There is when the VM is struggling, but not necessarily when there is
> > > simply a high, concurrent rate of short-lived file cache allocations.
> > >
> > > Kswapd can easily reclaim gigabytes of clean page cache each second,
> > > but there might be enough allocation concurrency from other threads to
> > > starve a kfree_rcu() that only makes a very cursory attempt at getting
> > > memory out of being able to snap up some of those returns.
> > >
> > > In that scenario it makes sense to be a bit more persistent, or even
> > > help scale out the concurrency of reclaim to that of allocations.
> > >
> > > > Here are some strategies right offhand when sleeping is required:
> > > >
> > > > 1. Always sleep in synchronize_rcu() in order to (with high
> > > > probability) free the memory. This might mean that the reclaim
> > > > effort goes slower than would be good.
> > > >
> > > > 2. Always sleep in the memory allocator in order to help reclaim
> > > > along. (This is a strawman version of what I expect your
> > > > proposal really is, but putting it here for completeness, please
> > > > see below.)
> > > >
> > > > 3. Always sleep in the memory allocator in order to help reclaim
> > > > along, but return failure at some point. Then the caller
> > > > invokes synchronize_rcu(). When to return failure?
> > > >
> > > > o After some substantial but limited amount of effort has
> > > > been spent on reclaim.
> > > >
> > > > o When it becomes likely that further reclaim effort
> > > > is not going to free up additional memory.
> > > >
> > > > I am guessing that you are thinking in terms of specifying GFP flags to
> > > > result in some variant of #3.
> > >
> > > Yes, although I would add
> > >
> > > o After making more than one attempt at the freelist to
> > > prevent merely losing races when the allocator/reclaim
> > > subsystem is mobbed by a high concurrency of requests.
> > >
> > > __GFP_NORETRY (despite its name) accomplishes this.
> > >
> > > __GFP_RETRY_MAYFAIL is yet more persistent, but may retry for quite a
> > > while if the allocation keeps losing the race for a page. This
> > > increases the chance of the allocation eventually suceeding, but also
> > > the risk of 1) trying to get memory for longer than a
> > > synchronize_rcu() might have taken and 2) exerting more temporary
> > > memory pressure on the workload* than might be productive.
> > >
> > > So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> > > further decisions based on instrumentation of the success rates of
> > > these opportunistic allocations.
> > >
> > > * Reclaim and OOM handling will be fine since no reserves are tapped
> >
> > Thank you for the explanation! Makes sense to me!!!
> >
> > Joel, Vlad, does this seem reasonable?
> >
> To me that makes sense. I think such strategy does fit to what we do,
> i mean we need to release memory asap. Doing it without initiating of
> long process of memory reclaim and do it only lightly(what __GFP_NORETRY does)
> is a good approach. We have an option to fallback to synchronize_rcu().
>
> But that is for sleepable context.
>
> I have a question about non-sleeping context as well and how we allocate one
> page:
>
> <snip>
> bnode = (struct kvfree_rcu_bulk_data *)
> __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
>
> <snip>
>
> Johannes, i saw you mentioned earlier that waking up a kswapd is not a
> good idea, what actually GFP_NOWAIT does. Do you recommend to exclude
> it? Also to replace by what? __GFP_HIGH|__GFP_ATOMIC?

This is best-effort, correct?

Upon memory-allocation failure, the single-argument kfree_rcu() can leak
the memory (it has presumably already splatted) and the double-argument
kfree_rcu() can make use of the rcu_head structure that was provided.

Thanx, Paul

2020-04-23 20:09:24

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

> > > > >
> > > > > Thank you for looking this over and for the feedback!
> > > > >
> > > > > Good point on the sleeping. My assumption has been that sleeping waiting
> > > > > for a grace period was highly likely to allow memory to eventually be
> > > > > freed, and that there is a point of diminishing returns beyond which
> > > > > adding additional tasks to the reclaim effort does not help much.
> > > >
> > > > There is when the VM is struggling, but not necessarily when there is
> > > > simply a high, concurrent rate of short-lived file cache allocations.
> > > >
> > > > Kswapd can easily reclaim gigabytes of clean page cache each second,
> > > > but there might be enough allocation concurrency from other threads to
> > > > starve a kfree_rcu() that only makes a very cursory attempt at getting
> > > > memory out of being able to snap up some of those returns.
> > > >
> > > > In that scenario it makes sense to be a bit more persistent, or even
> > > > help scale out the concurrency of reclaim to that of allocations.
> > > >
> > > > > Here are some strategies right offhand when sleeping is required:
> > > > >
> > > > > 1. Always sleep in synchronize_rcu() in order to (with high
> > > > > probability) free the memory. This might mean that the reclaim
> > > > > effort goes slower than would be good.
> > > > >
> > > > > 2. Always sleep in the memory allocator in order to help reclaim
> > > > > along. (This is a strawman version of what I expect your
> > > > > proposal really is, but putting it here for completeness, please
> > > > > see below.)
> > > > >
> > > > > 3. Always sleep in the memory allocator in order to help reclaim
> > > > > along, but return failure at some point. Then the caller
> > > > > invokes synchronize_rcu(). When to return failure?
> > > > >
> > > > > o After some substantial but limited amount of effort has
> > > > > been spent on reclaim.
> > > > >
> > > > > o When it becomes likely that further reclaim effort
> > > > > is not going to free up additional memory.
> > > > >
> > > > > I am guessing that you are thinking in terms of specifying GFP flags to
> > > > > result in some variant of #3.
> > > >
> > > > Yes, although I would add
> > > >
> > > > o After making more than one attempt at the freelist to
> > > > prevent merely losing races when the allocator/reclaim
> > > > subsystem is mobbed by a high concurrency of requests.
> > > >
> > > > __GFP_NORETRY (despite its name) accomplishes this.
> > > >
> > > > __GFP_RETRY_MAYFAIL is yet more persistent, but may retry for quite a
> > > > while if the allocation keeps losing the race for a page. This
> > > > increases the chance of the allocation eventually suceeding, but also
> > > > the risk of 1) trying to get memory for longer than a
> > > > synchronize_rcu() might have taken and 2) exerting more temporary
> > > > memory pressure on the workload* than might be productive.
> > > >
> > > > So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> > > > further decisions based on instrumentation of the success rates of
> > > > these opportunistic allocations.
> > > >
> > > > * Reclaim and OOM handling will be fine since no reserves are tapped
> > >
> > > Thank you for the explanation! Makes sense to me!!!
> > >
> > > Joel, Vlad, does this seem reasonable?
> > >
> > To me that makes sense. I think such strategy does fit to what we do,
> > i mean we need to release memory asap. Doing it without initiating of
> > long process of memory reclaim and do it only lightly(what __GFP_NORETRY does)
> > is a good approach. We have an option to fallback to synchronize_rcu().
> >
> > But that is for sleepable context.
> >
> > I have a question about non-sleeping context as well and how we allocate one
> > page:
> >
> > <snip>
> > bnode = (struct kvfree_rcu_bulk_data *)
> > __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> >
> > <snip>
> >
> > Johannes, i saw you mentioned earlier that waking up a kswapd is not a
> > good idea, what actually GFP_NOWAIT does. Do you recommend to exclude
> > it? Also to replace by what? __GFP_HIGH|__GFP_ATOMIC?
>
> This is best-effort, correct?
>
a) Single argument(headless)
In this case we can make use an allocator with sleepable flags,
because we document that headleass variant must follow might_sleep()
annotation. For example __GFP_NORETRY | __GFP_NOWARN. __GFP_NORETRY
can do some light direct reclaim, thus the caller can call schedule().
To do such allocation we just drop our local spinlock.

If an allocation gets failed, we directly fall into synchronize_rcu()
i.e. inline freeing.

I also call it sleepable case, that is (a).

b) Double argument(with rcu_head)
This case we consider as it gets called from atomic context even though
it can be not. Why we consider such case as atomic: we just assume that.
The reason is to keep it simple, because it is not possible to detect whether
a current context is attomic or not(for all type of kernels), i mean the one
that calls kfree_rcu().

In this case we do not have synchronize_rcu() option. Instead we have an
object with rcu_head inside. If an allocation gets failed we just make
use of rcu_head inside the object, regular queuing.

In this case we do not need to hard in order to obtain memory. Therefore
my question was to Johannes what is best way here. Since we decided to
minimize reclaiming, whereas GFP_NOWAIT wakes up kswapd if no memory.
GFP_ATOMIC also is not good, because for (b) we do not need to waste
it.

>
> Upon memory-allocation failure, the single-argument kfree_rcu() can leak
> the memory (it has presumably already splatted) and the double-argument
> kfree_rcu() can make use of the rcu_head structure that was provided.
>
For single argument we inline freeing into current context after
synchronize_rcu() because it follows might_sleep() annotation.

Sorry for long email, i hope i covered everything :)

--
Vlad Rezki

2020-04-24 04:20:18

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

Hi Vlad,
I'm mostly on the same page, some more comments below:

On Thu, Apr 23, 2020 at 4:00 PM Uladzislau Rezki <[email protected]> wrote:
[snip]
> a) Single argument(headless)
> In this case we can make use an allocator with sleepable flags,
> because we document that headleass variant must follow might_sleep()
> annotation. For example __GFP_NORETRY | __GFP_NOWARN. __GFP_NORETRY
> can do some light direct reclaim, thus the caller can call schedule().
> To do such allocation we just drop our local spinlock.
> If an allocation gets failed, we directly fall into synchronize_rcu()
> i.e. inline freeing.
>
> I also call it sleepable case, that is (a).
>
> b) Double argument(with rcu_head)
> This case we consider as it gets called from atomic context even though
> it can be not. Why we consider such case as atomic: we just assume that.
> The reason is to keep it simple, because it is not possible to detect whether
> a current context is attomic or not(for all type of kernels), i mean the one
> that calls kfree_rcu().
>
> In this case we do not have synchronize_rcu() option. Instead we have an
> object with rcu_head inside. If an allocation gets failed we just make
> use of rcu_head inside the object, regular queuing.
>
> In this case we do not need to hard in order to obtain memory. Therefore
> my question was to Johannes what is best way here. Since we decided to
> minimize reclaiming, whereas GFP_NOWAIT wakes up kswapd if no memory.
> GFP_ATOMIC also is not good, because for (b) we do not need to waste
> it.

I think Johannes said that waking up kswapd is Ok. OTOH, I did not see
the drawback in waking up kswapd to do background reclaim since it
does not happen synchronously right? I think Johannes said we can do
better than just waking kswapd by also doing light direct reclaim
using __GFP_NORETRY but let me know if I missed something.

> > Upon memory-allocation failure, the single-argument kfree_rcu() can leak
> > the memory (it has presumably already splatted) and the double-argument
> > kfree_rcu() can make use of the rcu_head structure that was provided.
> >
> For single argument we inline freeing into current context after
> synchronize_rcu() because it follows might_sleep() annotation.

Yes.

Also, with the additional caching being planned, we could avoid the
chances of hitting the synchronize_rcu inlining.

Thanks,

- Joel

2020-04-24 12:33:35

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

>
> I think Johannes said that waking up kswapd is Ok. OTOH, I did not see
> the drawback in waking up kswapd to do background reclaim since it
> does not happen synchronously right? I think Johannes said we can do
> better than just waking kswapd by also doing light direct reclaim
> using __GFP_NORETRY but let me know if I missed something.
>
Then i misunderstood that point. So, seems it is settled now. We just
use GFP_NOWAIT | __GFP_RECLAIM | __GFP_NORETRY | __GFP_NOWARN for headless
case, i.e. when we can sleep. It will do direct reclaim(slow path), but
light one because of __GFP_NORETRY.

Does it sound good?

> > For single argument we inline freeing into current context after
> > synchronize_rcu() because it follows might_sleep() annotation.
>
> Yes.
>
> Also, with the additional caching being planned, we could avoid the
> chances of hitting the synchronize_rcu inlining.
>
Or minimize it.

There is also one question i would like to clarify. That is dynamic head
attaching that requires small allocations. Do we drop it?

Thanks!

--
Vlad Rezki

2020-05-05 18:19:43

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

On Thu, Apr 23, 2020 at 09:59:55PM +0200, Uladzislau Rezki wrote:
> b) Double argument(with rcu_head)
> This case we consider as it gets called from atomic context even though
> it can be not. Why we consider such case as atomic: we just assume that.
> The reason is to keep it simple, because it is not possible to detect whether
> a current context is attomic or not(for all type of kernels), i mean the one
> that calls kfree_rcu().
>
> In this case we do not have synchronize_rcu() option. Instead we have an
> object with rcu_head inside. If an allocation gets failed we just make
> use of rcu_head inside the object, regular queuing.
>
> In this case we do not need to hard in order to obtain memory. Therefore
> my question was to Johannes what is best way here. Since we decided to
> minimize reclaiming, whereas GFP_NOWAIT wakes up kswapd if no memory.
> GFP_ATOMIC also is not good, because for (b) we do not need to waste
> it.

Waking kswapd is fine, because it's a shared facility that doesn't
just reclaim on your behalf but on behalf of a central goal: to get
the freelist back to the watermarks. If they're low, somebody will
sooner or later kick kswapd anyway to do exactly that.

So unless you ask kswapd for a high order page that is unlikely to be
needed by anybody else, you're only doing the inevitable.

2020-05-05 18:35:59

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH RFC] rcu/tree: Refactor object allocation and try harder for array allocation

> > b) Double argument(with rcu_head)
> > This case we consider as it gets called from atomic context even though
> > it can be not. Why we consider such case as atomic: we just assume that.
> > The reason is to keep it simple, because it is not possible to detect whether
> > a current context is attomic or not(for all type of kernels), i mean the one
> > that calls kfree_rcu().
> >
> > In this case we do not have synchronize_rcu() option. Instead we have an
> > object with rcu_head inside. If an allocation gets failed we just make
> > use of rcu_head inside the object, regular queuing.
> >
> > In this case we do not need to hard in order to obtain memory. Therefore
> > my question was to Johannes what is best way here. Since we decided to
> > minimize reclaiming, whereas GFP_NOWAIT wakes up kswapd if no memory.
> > GFP_ATOMIC also is not good, because for (b) we do not need to waste
> > it.
>
> Waking kswapd is fine, because it's a shared facility that doesn't
> just reclaim on your behalf but on behalf of a central goal: to get
> the freelist back to the watermarks. If they're low, somebody will
> sooner or later kick kswapd anyway to do exactly that.
>
> So unless you ask kswapd for a high order page that is unlikely to be
> needed by anybody else, you're only doing the inevitable.
>
Johannes, thank you for the clarification!

--
Vlad Rezki