Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
From:   Mel Gorman <mgorman@techsingularity.net>
To:     Peter Zijlstra <peterz@infradead.org>
Cc:     Ingo Molnar <mingo@kernel.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Valentin Schneider <valentin.schneider@arm.com>,
        Aubrey Li <aubrey.li@linux.intel.com>,
        Barry Song <song.bao.hua@hisilicon.com>,
        Mike Galbraith <efault@gmx.de>,
        Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
        LKML <linux-kernel@vger.kernel.org>,
        Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers
Date:   Thu, 21 Oct 2021 15:56:02 +0100
Message-Id: <20211021145603.5313-2-mgorman@techsingularity.net>
In-Reply-To: <20211021145603.5313-1-mgorman@techsingularity.net>
References: <20211021145603.5313-1-mgorman@techsingularity.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk

This patch mitigates a problem where wake_wide() allows a heavy waker
(e.g. X) to stack an excessive number of wakees on the same CPU. This
is due to the cpu_load check in wake_affine_weight. As noted by the
original patch author (Mike Galbraith)[1];

	Between load updates, X, or any other waker of many, can stack
	wakees to a ludicrous depth.  Tracing kbuild vs firefox playing a
	youtube clip, I watched X stack 20 of the zillion firefox minions
	while their previous CPUs all had 1 lousy task running but a
	cpu_load() higher than the cpu_load() of X's CPU.  Most of those
	prev_cpus were where X had left them when it migrated. Each and
	every crazy depth migration was wake_affine_weight() deciding we
	should pull.

Parahrasing Mike's test results from the patch.

	With make -j8 running along with firefox with two tabs, one
	containing youtube's suggestions of stuff, the other a running
	clip, if the idle tab in focus, and don't drive mouse around,
	flips decay enough for wake_wide() to lose interest, but just
	wiggle the mouse, and it starts waking wide. Focus on the running
	clip, and it continuously wakes wide.  

The end result is that heavy wakers are less likely to stack tasks and,
depending on the workload, reduce migrations.

From additional tests on various servers, the impact is machine dependant
but generally this patch improves the situation.

hackbench-process-pipes
                          5.15.0-rc3             5.15.0-rc3
                             vanilla  sched-wakeeflips-v1r1
Amean     1        0.3667 (   0.00%)      0.3890 (  -6.09%)
Amean     4        0.5343 (   0.00%)      0.5217 (   2.37%)
Amean     7        0.5300 (   0.00%)      0.5387 (  -1.64%)
Amean     12       0.5737 (   0.00%)      0.5443 (   5.11%)
Amean     21       0.6727 (   0.00%)      0.6487 (   3.57%)
Amean     30       0.8583 (   0.00%)      0.8033 (   6.41%)
Amean     48       1.3977 (   0.00%)      1.2400 *  11.28%*
Amean     79       1.9790 (   0.00%)      1.8200 *   8.03%*
Amean     110      2.8020 (   0.00%)      2.5820 *   7.85%*
Amean     141      3.6683 (   0.00%)      3.2203 *  12.21%*
Amean     172      4.6687 (   0.00%)      3.8200 *  18.18%*
Amean     203      5.2183 (   0.00%)      4.3357 *  16.91%*
Amean     234      6.1077 (   0.00%)      4.8047 *  21.33%*
Amean     265      7.1313 (   0.00%)      5.1243 *  28.14%*
Amean     296      7.7557 (   0.00%)      5.5940 *  27.87%*

While different machines showed different results, in general
there were much less CPU migrations of tasks

tbench4
                           5.15.0-rc3             5.15.0-rc3
                              vanilla  sched-wakeeflips-v1r1
Hmean     1         824.05 (   0.00%)      802.56 *  -2.61%*
Hmean     2        1578.49 (   0.00%)     1645.11 *   4.22%*
Hmean     4        2959.08 (   0.00%)     2984.75 *   0.87%*
Hmean     8        5080.09 (   0.00%)     5173.35 *   1.84%*
Hmean     16       8276.02 (   0.00%)     9327.17 *  12.70%*
Hmean     32      15501.61 (   0.00%)    15925.55 *   2.73%*
Hmean     64      27313.67 (   0.00%)    24107.81 * -11.74%*
Hmean     128     32928.19 (   0.00%)    36261.75 *  10.12%*
Hmean     256     35434.73 (   0.00%)    38670.61 *   9.13%*
Hmean     512     50098.34 (   0.00%)    53243.75 *   6.28%*
Hmean     1024    69503.69 (   0.00%)    67425.26 *  -2.99%*

Bit of a mixed bag but wins more than it loses.

A new workload was added that runs a kernel build in the background
-jNR_CPUS while NR_CPUS pairs of tasks run Netperf TCP_RR. The
intent is to see if heavy background tasks disrupt ligher tasks

multi subtest kernbench
                               5.15.0-rc3             5.15.0-rc3
                                  vanilla  sched-wakeeflips-v1r1
Min       elsp-256       20.80 (   0.00%)       14.89 (  28.41%)
Amean     elsp-256       24.08 (   0.00%)       20.94 (  13.05%)
Stddev    elsp-256        3.32 (   0.00%)        4.68 ( -41.16%)
CoeffVar  elsp-256       13.78 (   0.00%)       22.36 ( -62.33%)
Max       elsp-256       29.11 (   0.00%)       26.49 (   9.00%)

multi subtest netperf-tcp-rr
                        5.15.0-rc3             5.15.0-rc3
                           vanilla  sched-wakeeflips-v1r1
Min       1    48286.26 (   0.00%)    49101.48 (   1.69%)
Hmean     1    62894.82 (   0.00%)    68963.51 *   9.65%*
Stddev    1     7600.56 (   0.00%)     8804.82 ( -15.84%)
Max       1    78975.16 (   0.00%)    87124.67 (  10.32%)

The variability is higher as a result of the patch but both workloads
experienced improved performance.

[1] https://lore.kernel.org/r/02c977d239c312de5e15c77803118dcf1e11f216.camel@gmx.de

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff69f245b939..d00af3b97d8f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p)
 	}
 
 	if (current->last_wakee != p) {
+		int min = __this_cpu_read(sd_llc_size) << 1;
+		/*
+		 * Couple the wakee flips to the waker for the case where it
+		 * doesn't accrue flips, taking care to not push the wakee
+		 * high enough that the wake_wide() heuristic fails.
+		 */
+		if (current->wakee_flips > p->wakee_flips * min)
+			p->wakee_flips++;
 		current->last_wakee = p;
 		current->wakee_flips++;
 	}
@@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p)
 
 	if (master < slave)
 		swap(master, slave);
-	if (slave < factor || master < slave * factor)
+	if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor)
 		return 0;
 	return 1;
 }
-- 
2.31.1