Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Sun, 5 Apr 2020 19:30:28 -0400
From:   Joel Fernandes <joel@joelfernandes.org>
To:     Uladzislau Rezki <urezki@gmail.com>
Cc:     LKML <linux-kernel@vger.kernel.org>,
        "Paul E . McKenney" <paulmck@kernel.org>,
        RCU <rcu@vger.kernel.org>, linux-mm@kvack.org,
        Andrew Morton <akpm@linux-foundation.org>,
        Steven Rostedt <rostedt@goodmis.org>,
        Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Subject: Re: [PATCH 1/1] rcu/tree: add emergency pool for headless case
Message-ID: <20200405233028.GC83565@google.com>
References: <20200403173051.4081-1-urezki@gmail.com>
 <20200404195129.GA83565@google.com>
 <20200405172105.GA7539@pc636>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200405172105.GA7539@pc636>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

Hi Vlad,

On Sun, Apr 05, 2020 at 07:21:05PM +0200, Uladzislau Rezki wrote:
> On Sat, Apr 04, 2020 at 03:51:29PM -0400, Joel Fernandes wrote:
> > On Fri, Apr 03, 2020 at 07:30:51PM +0200, Uladzislau Rezki (Sony) wrote:
> > > Maintain an emergency pool for each CPU with some
> > > extra objects. There is read-only sysfs attribute,
> > > the name is "rcu_nr_emergency_objs". It reflects
> > > the size of the pool. As for now the default value
> > > is 3.
> > > 
> > > The pool is populated when low memory condition is
> > > detected. Please note it is only for headless case
> > > it means when the regular SLAB is not able to serve
> > > any request, the pool is used.
> > > 
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > 
> > Hi Vlad,
> > 
> > One concern I have is this moves the problem a bit further down. My belief is
> > we should avoid the likelihood of even needing an rcu_head allocated for the
> > headless case, to begin with - than trying to do damage-control when it does
> > happen. The only way we would end up needing an rcu_head is if we could not
> > allocate an array.
> > 
> Let me share my view on all such caching. I think that now it becomes less as
> the issue, because of we have now https://lkml.org/lkml/2020/4/2/383 patch.
> I see that it does help a lot. I tried to simulate low memory condition and 
> apply high memory pressure with that. I did not manage to trigger the
> "synchronize rcu" path at all. It is because of using much more permissive
> parameters when we request a memory from the SLAB(direct reclaim, etc...).

That's a good sign that we don't hit this path in your tests.

I guess also, with your latest patch on releasing the lock to be in a
non-atomic context, and then doing the allocation, it became even more
permissive? If you drop that patch and tried, do you still not hit the
synchronous path more often?

Could you also try introducing memory pressure by reducing your system's
total memory and see how it behaves?

> > So instead of adding a pool for rcu_head allocations, how do you feel about
> > pre-allocation of the per-cpu cache array instead, which has the same effect
> > as you are intending?
> > 
> In the v2 i have a list of such objects. It is also per-CPU(it is scaled to CPUs),
> but the difference is, those objects require much less memory, it is 8 + sizeof(struct
> rcu_head) bytes comparing to one page. Therefore the memory footprint is lower.

Yes, true. That is one drawback is it higher memory usage. But if you have at
least 1 kfree_rcu() request an each CPU, then pre-allocation does not
increase memory usage any more than it already has right now. Unless, we
extend my proposal to cache more than 2 pages per-cpu which I think you
mentioned below.

> I have doubts that we would ever hit this emergency list, because of mentioned
> above patch, but from the other hand i can not say and guarantee 100%. Just in
> case, we may keep it. 

You really have to hit OOM in your tests to trigger it I suppose. Basically
the emergency pool improves situation under OOM, but otherwise does not
improve it due to the direct-reclaim that happens as you mentioned. Right?

> Paul, could you please share your view and opinion? It would be appreciated :)
> 
> > This has 3 benefits:
> > 1. It scales with number of CPUs, no configuration needed.
> > 2. It makes the first kfree_rcu() faster and less dependent on an allocation
> >    succeeding.
> > 3. Much simpler code, no new structures or special handling.
> > 4. In the future we can extend it to allocate more than 2 pages per CPU using
> >    the same caching mechanism.
> > 
> > The obvious drawback being its 2 pages per CPU but at least it scales by
> > number of CPUs. Something like the following (just lightly tested):
> > 
> > ---8<-----------------------
> > 
> > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > Subject: [PATCH] rcu/tree: Preallocate the per-cpu cache for kfree_rcu()
> > 
> > In recent changes, we have made it possible to use kfree_rcu() without
> > embedding an rcu_head in the object being free'd. This requires dynamic
> > allocation. In case dynamic allocation fails due to memory pressure, we
> > would end up synchronously waiting for an RCU grace period thus hurting
> > kfree_rcu() latency.
> > 
> > To make this less probable, let us pre-allocate the per-cpu cache so we
> > depend less on the dynamic allocation succeeding. This also has the
> > effect of making kfree_rcu() slightly faster at run time.
> > 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  kernel/rcu/tree.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 6172e6296dd7d..9fbdeb4048425 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -4251,6 +4251,11 @@ static void __init kfree_rcu_batch_init(void)
> >  			krcp->krw_arr[i].krcp = krcp;
> >  		}
> >  
> > +		krcp->bkvcache[0] =  (struct kvfree_rcu_bulk_data *)
> > +					__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > +		krcp->bkvcache[1] =  (struct kvfree_rcu_bulk_data *)
> > +					__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > +
> >  		INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
> >  		krcp->initialized = true;
> >  	}
> We pre-allocate it, but differently comparing with your proposal :) I do not see
> how it can improve things. The difference is you do it during initializing or booting  
> phase. In case of current code it will pre-allocate and cache one page after first
> calling of the kvfree_call_rcu(), say in one second. So basically both variants are
> the same.

Well, one proposal is only 5 lines extra ;-). That has got to be at least a
bit appealing ;-) ;-).

> But i think that we should allow to be used two pages as cached ones, no matter 
> whether it is vmalloc ptrs. or SLAB ones. So basically, two cached pages can be
> used by vmalloc path and SLAB path. And probably it makes sense because of two
> phases: one is when we collect pointers, second one is memory reclaim path. Thus
> one page per one phase, i.e. it would be paired.

You are saying this with regard to my proposal right?  I agree number of
pages could be increased. The caching mechanism already in-place could be
starting point for that extension.

Thanks, Vlad!

 - Joel

> Thanks, Joel!
> 
> --
> Vlad Rezki