Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4;
From:   Uladzislau Rezki <urezki@gmail.com>
Date:   Tue, 17 Oct 2023 16:06:29 +0200
To:     Joel Fernandes <joel@joelfernandes.org>
Cc:     "Uladzislau Rezki (Sony)" <urezki@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>,
        RCU <rcu@vger.kernel.org>,
        Neeraj upadhyay <Neeraj.Upadhyay@amd.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        LKML <linux-kernel@vger.kernel.org>,
        Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>,
        Frederic Weisbecker <frederic@kernel.org>
Subject: Re: [PATCH v3 1/1] rcu: Reduce synchronize_rcu() waiting time
Message-ID: <ZS6U5SgvGcmdE_DA@pc636>
References: <20231016173004.14148-1-urezki@gmail.com>
 <CAEXW_YRfuXqnBFN=DpOLio74j8fX3eEDSFCH8LXyavuHDdYysA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAEXW_YRfuXqnBFN=DpOLio74j8fX3eEDSFCH8LXyavuHDdYysA@mail.gmail.com>
Precedence: bulk

Hello, Joel!

> Hello, Vlad,
> 
> Looks like a nice patch, I am still looking at this more and just
> started, but a few small comments:
> 
Thank you for looking at it :) And thank you for the comments.


> On Mon, Oct 16, 2023 at 1:30 PM Uladzislau Rezki (Sony)
> <urezki@gmail.com> wrote:
> >
> > A call to a synchronize_rcu() can be optimized from time point of
> > view. Different workloads can be affected by this especially the
> > ones which use this API in its time critical sections.
> >
> > For example if CONFIG_RCU_NOCB_CPU is set, the wakeme_after_rcu()
> > callback can be delayed and such delay depends on where in a nocb
> > list it is located.
> >
> > 1. On our Android devices i can easily trigger the scenario when
> > it is a last in the list out of ~3600 callbacks:
> >
> > <snip>
> >   <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> > ...
> >   <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> >   <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> > <snip>
> >
> > 2. We use cpuset/cgroup to classify tasks and assign them into
> > different cgroups. For example "backgrond" group which binds tasks
> > only to little CPUs or "foreground" which makes use of all CPUs.
> > Tasks can be migrated between groups by a request if an acceleration
> > is needed.
> >
> > See below an example how "surfaceflinger" task gets migrated.
> > Initially it is located in the "system-background" cgroup which
> > allows to run only on little cores. In order to speed it up it
> > can be temporary moved into "foreground" cgroup which allows
> > to use big/all CPUs:
> >
> [...]
> > 3. To address this drawback, maintain a separate track that consists
> > of synchronize_rcu() callers only. The GP-kthread, that drivers a GP
> > either wake-ups a worker to drain all list or directly wakes-up end
> > user if it is one in the drain list.
> 
> I wonder if there is a cut-off N, where waking up N ("a few") inline
> instead of just 1, yields similar results. For small values of N, that
> keeps the total number of wakeups lower than pushing off to a kworker.
> For instance, if 2 users, then you get 3 wakeups instead of just 2 (1
> for the kworker and another 2 for the synchronize). But if you had a
> cut off like N=5, then 2 users results in just 2 wakeups.
> I don't feel too strongly about it, but for small values of N, I am
> interested in a slightly better optimization if we can squeeze it.
> 
This makes sense to me. We can add a threshold to wake-up several users.
Like you mentioned. We can do 2 wake-ups instead of 3.

> [...]
> > + * There are three lists for handling synchronize_rcu() users.
> > + * A first list corresponds to new coming users, second for users
> > + * which wait for a grace period and third is for which a grace
> > + * period is passed.
> > + */
> > +static struct sr_normal_state {
> > +       struct llist_head curr; /* request a GP users. */
> > +       struct llist_head wait; /* wait for GP users. */
> > +       struct llist_head done; /* ready for GP users. */
> > +       struct llist_node *curr_tail;
> > +       struct llist_node *wait_tail;
> 
> Just for clarification, the only reason you need 'tail' is because you
> can use llist_add_batch() to quickly splice the list, instead of
> finding the tail each time (expensive for a large list), correct? It
> would be good to mention that in a comment here.
> 
Right. I do not want to scan the list. I will comment it.

> > +       atomic_t active;
> > +} sr;
> > +
> > +/* Disable it by default. */
> > +static int rcu_normal_wake_from_gp = 0;
> > +module_param(rcu_normal_wake_from_gp, int, 0644);
> > +
> > +static void rcu_sr_normal_complete(struct llist_node *node)
> > +{
> > +       struct rcu_synchronize *rs = container_of(
> > +               (struct rcu_head *) node, struct rcu_synchronize, head);
> > +       unsigned long oldstate = (unsigned long) rs->head.func;
> > +
> > +       if (!poll_state_synchronize_rcu(oldstate))
> > +               WARN_ONCE(1, "A full grace period is not passed yet: %lu",
> > +                       rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
> 
> nit: WARN_ONCE(!poll_state_synchronize_rcu(oldstate)), ...) and get
> rid of if() ?
>
Initially i had it written as you proposed i reworked it because i
wanted to see the delta. I do not have a strong opinion, so i can
fix it as you proposed.

> 
> > +
> > +       /* Finally. */
> > +       complete(&rs->completion);
> > +}
> > +
> > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > +{
> > +       struct llist_node *done, *rcu, *next;
> > +
> > +       done = llist_del_all(&sr.done);
> > +       if (!done)
> > +               return;
> > +
> > +       llist_for_each_safe(rcu, next, done)
> > +               rcu_sr_normal_complete(rcu);
> > +}
> [...]
> > +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > +{
> > +       atomic_inc(&sr.active);
> > +       if (llist_add((struct llist_node *) &rs->head, &sr.curr))
> > +               /* Set the tail. Only first and one user can do that. */
> > +               WRITE_ONCE(sr.curr_tail, (struct llist_node *) &rs->head);
> > +       atomic_dec(&sr.active);
> 
> Here there is no memory ordering provided by the atomic ops. Is that really Ok?
> 
This needs to be reworked since there is no ordering guaranteed. I think
there is a version of "atomic_inc_something" that guarantees it?

> And same question for other usages of .active.
> 
> Per: https://www.kernel.org/doc/Documentation/atomic_t.txt
> "RMW operations that have no return value are unordered;"
> --
> Note to self to ping my Android friends as well about this improvement
> :-). Especially the -mm Android folks are interested in app startup
> time.
> 
That is good. Thank you :)

--
Uladzislau Rezki