Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20230718134120.81199-1-aaron.lu@intel.com> <20230718134120.81199-4-aaron.lu@intel.com>
 <CAKfTPtAu33AN6=X82T=yOgm40S8OXi+sPcF0QyD-bYRPV=xPEg@mail.gmail.com> <20230719051826.GB84059@ziqianlu-dell>
In-Reply-To: <20230719051826.GB84059@ziqianlu-dell>
From:   Vincent Guittot <vincent.guittot@linaro.org>
Date:   Wed, 19 Jul 2023 11:09:24 +0200
Message-ID: <CAKfTPtANpC5XV8_02ZhRwoUyzK_GjEiqYiG7rZWu6WLaeZ9RJQ@mail.gmail.com>
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for
 cfs_rq's removed load
To:     Aaron Lu <aaron.lu@intel.com>
Cc:     Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Daniel Jordan <daniel.m.jordan@oracle.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Tim Chen <tim.c.chen@intel.com>,
        Nitin Tekchandani <nitin.tekchandani@intel.com>,
        Yu Chen <yu.c.chen@intel.com>,
        Waiman Long <longman@redhat.com>, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Wed, 19 Jul 2023 at 07:18, Aaron Lu <aaron.lu@intel.com> wrote:
>
> On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote:
> > On Tue, 18 Jul 2023 at 15:41, Aaron Lu <aaron.lu@intel.com> wrote:
> > >
> > > When a workload involves many wake time task migrations, tg->load_avg
> > > can be heavily contended among CPUs because every migration involves
> > > removing the task's load from its src cfs_rq and attach that load to
> > > its new cfs_rq. Both the remove and attach requires an update to
> > > tg->load_avg as well as propagating the change up the hierarchy.
> > >
> > > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> > > Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> > > migration number is 11millions. Since the workload can trigger many
> > > wakeups and migrations, the access(both read and write) to tg->load_avg
> > > can be unbound. For the above said workload, the profile shows
> > > update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
> > > netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
> > > migration number is 14millions; update_cfs_group() costs ~25% and
> > > update_load_avg() costs ~16%.
> > >
> > > This patch is an attempt to reduce the cost of accessing tg->load_avg.
> >
> > here you mention tg->load_avg which is updated with update_tg_load_avg()
> >
> > but your patch below modifies the local update of cfs's util_avg,
> > runnable_avg  and load_avg which need to be maintained up to date
>
> Yes, since it delayed propagating the removed load, the upper cfs_rqs'
> *_avg could be updated later than current code.
>
> > You should be better to delay or rate limit the call to
> > update_tg_load_avg() or calc_group_shares()/update_cfs_group() which
> > access tg->load_avg and are the culprit instead of modifying other
> > place.
>
> Thanks for the suggestion and I think it makes perfect sense.
>
> I tried below to rate limit the update to tg->load_avg at most once per
> ms in update_tg_load_avg():
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..e48fd0e6982d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>  {
>         long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> +       u64 now = cfs_rq_clock_pelt(cfs_rq);
>
>         /*
>          * No need to update load_avg for root_task_group as it is not used.
> @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>         if (cfs_rq->tg == &root_task_group)
>                 return;
>
> -       if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> +       if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
> +           abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>                 atomic_long_add(delta, &cfs_rq->tg->load_avg);
>                 cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> +               cfs_rq->last_update_tg_load_avg = now;
>         }
>  }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 14dfaafb3a8f..b5201d44d820 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -594,6 +594,7 @@ struct cfs_rq {
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         unsigned long           tg_load_avg_contrib;
> +       u64                     last_update_tg_load_avg;
>         long                    propagate;
>         long                    prop_runnable_sum;
>
> From some quick tests using postgres_sysbench and netperf/UDP_RR on SPR,
> this alone already delivers good results. For postgres_sysbench, the two
> functions cost dropped to 1%-2% each; and for netperf/UDP_RR, the two
> functions cost dropped to ~2% and ~4%. Togerther with some more rate
> limiting on the read side, I think the end result will be better. Does
> the above look reasonable to you?

Yes, that seems a better way to limit write access

>
> Alternatively, I can remove some callsites of update_tg_load_avg() like
> you suggested below and only call update_tg_load_avg() when cfs_rq is
> decayed(really just decayed, not when it detected it has removed load
> pending or load propagated from its children). This way it would give us
> similar result as above(roughly once per ms).
>
> >
> > Have you tried to remove update_cfs_group() from enqueue/dequeue and
> > only let the tick update the share periodically ?
>
> patch4 kind of did that :-)

That's what I have noticed when reading patch 4 :-)

>
> >
> > Have you tried to make update_tg_load_avg() return early ? the change
> > of cfs' load_avg is written in the tg->load_avg only when the change
> > is bigger than 1.5%. maybe increase it to 6% ?
>
> Yes, increase the delta is also a good way to limit the update to
> tg->load_avg. Will try that too.
>
> >
> > Or like for update_cfs_group, remove it from attach/detach entity and
> > let the periodic update to propagate the change
> >
> > But please don't touch local update of *_avg
>
> OK I see.
>
> Thanks again for the comments, they are very helpful.