Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Wed, 10 Feb 2021 15:37:12 -0500
From:   Johannes Weiner <hannes@cmpxchg.org>
To:     Chengming Zhou <zhouchengming@bytedance.com>
Cc:     mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, rostedt@goodmis.org,
        linux-kernel@vger.kernel.org, songmuchun@bytedance.com
Subject: Re: [PATCH] psi: Use ONCPU state tracking machinery to detect reclaim
Message-ID: <YCRD+IkZNPkOtNYz@cmpxchg.org>
References: <20210210040605.38973-1-zhouchengming@bytedance.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20210210040605.38973-1-zhouchengming@bytedance.com>
Precedence: bulk

On Wed, Feb 10, 2021 at 12:06:05PM +0800, Chengming Zhou wrote:
> Move the reclaim detection from the timer tick to the task state
> tracking machinery using the recently added ONCPU state. And we
> also add memstall state changes checking in the psi_task_switch()
> optimization to update the parents properly.
> 
> Thanks to Johannes Weiner for pointing out the psi_task_switch()
> optimization things and the clearer changelog.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Thanks for the update, Chengming.

It would be good to include a rationale in the changelog that explains
why we're doing this. Performance and cost are a bit questionable, IMO
because it doesn't look cheaper in aggregate...

> ---
>  include/linux/psi.h  |  1 -
>  kernel/sched/core.c  |  1 -
>  kernel/sched/psi.c   | 52 ++++++++++++++++------------------------------------
>  kernel/sched/stats.h |  9 ---------
>  4 files changed, 16 insertions(+), 47 deletions(-)

...but the code is simpler and shorter this way: fewer lines, and
we're removing one of the hooks into the scheduler. So it's a
maintainability win, I would say.

I did some testing with perf bench. The new code appears to have
slightly more overhead (which is expected), although the error bars
overlap to a point where I don't think it would matter on real loads.

I tested an additional version of the code that adds unlikely()
annotations to move the pressure state branches out of line - since
they are after all exceptional situations. It seems to help -
especially the pipe bench actually looks better than on vanilla.

Attached the incremental patch below.

---

perf stat -r 3 -- perf sched bench messaging -l 10000

vanilla
        125,833.65 msec task-clock:u              #   22.102 CPUs utilized            ( +-  1.94% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            69,526      page-faults:u             #    0.553 K/sec                    ( +-  0.79% )
     8,189,667,649      cycles:u                  #    0.065 GHz                      ( +-  1.49% )  (83.31%)
     2,184,284,296      stalled-cycles-frontend:u #   26.67% frontend cycles idle     ( +-  4.37% )  (83.32%)
     1,152,703,719      stalled-cycles-backend:u  #   14.08% backend cycles idle      ( +-  0.56% )  (83.37%)
     2,483,312,679      instructions:u            #    0.30  insn per cycle         
                                                  #    0.88  stalled cycles per insn  ( +-  0.15% )  (83.33%)
       781,332,174      branches:u                #    6.209 M/sec                    ( +-  0.13% )  (83.35%)
       159,531,476      branch-misses:u           #   20.42% of all branches          ( +-  0.17% )  (83.32%)

            5.6933 +- 0.0911 seconds time elapsed  ( +-  1.60% )

patched
        129,756.92 msec task-clock:u              #   22.243 CPUs utilized            ( +-  1.92% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            69,904      page-faults:u             #    0.539 K/sec                    ( +-  0.67% )
     8,518,161,670      cycles:u                  #    0.066 GHz                      ( +-  2.19% )  (83.30%)
     2,337,165,666      stalled-cycles-frontend:u #   27.44% frontend cycles idle     ( +-  5.47% )  (83.32%)
     1,148,789,343      stalled-cycles-backend:u  #   13.49% backend cycles idle      ( +-  0.05% )  (83.35%)
     2,483,527,911      instructions:u            #    0.29  insn per cycle         
                                                  #    0.94  stalled cycles per insn  ( +-  0.18% )  (83.38%)
       782,138,388      branches:u                #    6.028 M/sec                    ( +-  0.09% )  (83.33%)
       160,131,311      branch-misses:u           #   20.47% of all branches          ( +-  0.16% )  (83.31%)

             5.834 +- 0.106 seconds time elapsed  ( +-  1.81% )

patched-unlikely
        127,437.78 msec task-clock:u              #   22.184 CPUs utilized            ( +-  0.74% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            70,063      page-faults:u             #    0.550 K/sec                    ( +-  0.53% )
     8,453,581,973      cycles:u                  #    0.066 GHz                      ( +-  1.49% )  (83.34%)
     2,327,192,242      stalled-cycles-frontend:u #   27.53% frontend cycles idle     ( +-  2.43% )  (83.32%)
     1,146,196,558      stalled-cycles-backend:u  #   13.56% backend cycles idle      ( +-  0.35% )  (83.34%)
     2,486,920,732      instructions:u            #    0.29  insn per cycle         
                                                  #    0.94  stalled cycles per insn  ( +-  0.10% )  (83.34%)
       781,067,666      branches:u                #    6.129 M/sec                    ( +-  0.15% )  (83.34%)
       160,104,212      branch-misses:u           #   20.50% of all branches          ( +-  0.10% )  (83.33%)

            5.7446 +- 0.0418 seconds time elapsed  ( +-  0.73% )

---

perf stat -r 3 -- perf bench sched pipe

vanilla
         14,086.14 msec task-clock:u              #    1.009 CPUs utilized            ( +-  6.52% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             1,467      page-faults:u             #    0.104 K/sec                    ( +-  0.06% )
       306,181,835      cycles:u                  #    0.022 GHz                      ( +-  2.13% )  (83.41%)
        43,975,811      stalled-cycles-frontend:u #   14.36% frontend cycles idle     ( +- 14.45% )  (83.05%)
        52,429,386      stalled-cycles-backend:u  #   17.12% backend cycles idle      ( +-  0.28% )  (83.58%)
        93,097,176      instructions:u            #    0.30  insn per cycle         
                                                  #    0.56  stalled cycles per insn  ( +-  0.36% )  (83.23%)
        35,351,661      branches:u                #    2.510 M/sec                    ( +-  0.21% )  (83.37%)
         6,124,932      branch-misses:u           #   17.33% of all branches          ( +-  0.51% )  (83.36%)

            13.955 +- 0.164 seconds time elapsed  ( +-  1.17% )

patched
         14,574.69 msec task-clock:u              #    1.040 CPUs utilized            ( +-  0.87% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             1,469      page-faults:u             #    0.101 K/sec                    ( +-  0.13% )
       302,769,739      cycles:u                  #    0.021 GHz                      ( +-  1.19% )  (83.17%)
        37,638,522      stalled-cycles-frontend:u #   12.43% frontend cycles idle     ( +-  0.31% )  (83.47%)
        46,206,055      stalled-cycles-backend:u  #   15.26% backend cycles idle      ( +-  6.56% )  (83.34%)
        92,566,358      instructions:u            #    0.31  insn per cycle         
                                                  #    0.50  stalled cycles per insn  ( +-  0.51% )  (83.45%)
        35,667,707      branches:u                #    2.447 M/sec                    ( +-  0.67% )  (83.23%)
         6,224,587      branch-misses:u           #   17.45% of all branches          ( +-  2.24% )  (83.35%)

            14.010 +- 0.245 seconds time elapsed  ( +-  1.75% )

patched-unlikely
         13,470.99 msec task-clock:u              #    1.024 CPUs utilized            ( +-  3.10% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             1,477      page-faults:u             #    0.110 K/sec                    ( +-  0.09% )
       310,752,740      cycles:u                  #    0.023 GHz                      ( +-  1.32% )  (83.35%)
        44,894,078      stalled-cycles-frontend:u #   14.45% frontend cycles idle     ( +- 13.24% )  (83.47%)
        52,540,903      stalled-cycles-backend:u  #   16.91% backend cycles idle      ( +-  0.36% )  (82.96%)
        92,296,178      instructions:u            #    0.30  insn per cycle         
                                                  #    0.57  stalled cycles per insn  ( +-  0.48% )  (83.44%)
        35,316,802      branches:u                #    2.622 M/sec                    ( +-  0.06% )  (83.32%)
         6,173,049      branch-misses:u           #   17.48% of all branches          ( +-  0.66% )  (83.47%)

            13.161 +- 0.293 seconds time elapsed  ( +-  2.22% )

> @@ -833,7 +827,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>  		 */
>  		iter = NULL;
>  		while ((group = iterate_groups(next, &iter))) {
> -			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
> +			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU] &&
> +			    next->in_memstall == prev->in_memstall) {
>  				common = group;
>  				break;

It'd be better to compare psi_flags instead of just in_memstall: it's
clearer and also more robust against future changes (even though it's
somewhat unlikely we grow more states). It's also an invariant
throughout the loop, so we should move it out.

The comment above the loop is now stale too.

Can you fold the following into your patch?

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 8735d5f291dc..6d4a246ef131 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -809,18 +809,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 	void *iter;
 
 	if (next->pid) {
+		bool identical_state;
+
 		psi_flags_change(next, 0, TSK_ONCPU);
 		/*
-		 * When moving state between tasks, the group that
-		 * contains them both does not change: we can stop
-		 * updating the tree once we reach the first common
-		 * ancestor. Iterate @next's ancestors until we
-		 * encounter @prev's state.
+		 * When switching between tasks that have an identical
+		 * runtime state, the cgroup that contains both tasks
+		 * does not change: we can stop updating the tree once
+		 * we reach the first common ancestor. Iterate @next's
+		 * ancestors only until we encounter @prev's ONCPU.
 		 */
+		identical_state = prev->psi_flags == next->psi_flags;
 		iter = NULL;
 		while ((group = iterate_groups(next, &iter))) {
-			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU] &&
-			    next->in_memstall == prev->in_memstall) {
+			if (identical_state &&
+			    per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
 				common = group;
 				break;
 			}

Otherwise, this looks good to me. Peter, what do you think?

---

From cf36f1dde714a2c543f5947e47138de32468f33a Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 10 Feb 2021 13:38:34 -0500
Subject: [PATCH] psi: pressure states are unlikely

Move the unlikely branches out of line. This eliminates undesirable
jumps during wakeup and sleeps for workloads that aren't under any
sort of resource pressure.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/psi.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 8735d5f291dc..7fbacd6347a6 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -216,15 +216,15 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
 {
 	switch (state) {
 	case PSI_IO_SOME:
-		return tasks[NR_IOWAIT];
+		return unlikely(tasks[NR_IOWAIT]);
 	case PSI_IO_FULL:
-		return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
+		return unlikely(tasks[NR_IOWAIT] && !tasks[NR_RUNNING]);
 	case PSI_MEM_SOME:
-		return tasks[NR_MEMSTALL];
+		return unlikely(tasks[NR_MEMSTALL]);
 	case PSI_MEM_FULL:
-		return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
+		return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
 	case PSI_CPU_SOME:
-		return tasks[NR_RUNNING] > tasks[NR_ONCPU];
+		return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
 	case PSI_NONIDLE:
 		return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
 			tasks[NR_RUNNING];
@@ -721,7 +721,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	 * task in a cgroup is in_memstall, the corresponding groupc
 	 * on that cpu is in PSI_MEM_FULL state.
 	 */
-	if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
+	if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
 		state_mask |= (1 << PSI_MEM_FULL);
 
 	groupc->state_mask = state_mask;
-- 
2.30.0