Date:   Wed,  8 Mar 2023 15:43:32 -0800
Message-Id: <20230308234332.3355689-1-eranian@google.com>
Mime-Version: 1.0
Subject: [PATCH V2] x86/resctrl: robustify __resctrl_sched_in
From:   Stephane Eranian <eranian@google.com>
To:     linux-kernel@vger.kernel.org
Cc:     torvalds@linux-foundation.org, peterz@infradead.org,
        tony.luck@intel.com, reinette.chatre@intel.com,
        fenghua.yu@intel.com, peternewman@google.com,
        ndesaulniers@google.com, morbo@google.com, jakub@redhat.com,
        james.morse@arm.com, babu.moger@amd.com, ananth.narayan@amd.com,
        vschneid@redhat.com
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

When compiling the kernel with clang, there is a problem with the code
in __rescrtl_sched_in() because of the way the inline function is optimized
when called from __switch_to(). The effect of the problem is that the
bandwidth restriction driven by the CLOSID is not enforced. The problem is
easy to reproduce on Intel or AMD x86 systems:

1. If resctrl filesystem is not mounted:
  $ mount -t resctrl none /sys/fs/resctrl

2. Create resctrl group:
  $ mkdir /sys/fs/resctrl/test

3. move shell into resctrl group
  $ echo $$ > /sys/fs/resctrl/test/tasks

4. Run bandwidth consuming test in background on CPU0
  $ taskset -c 0 triad &

5. Monitor bandwidth consumption
   Using perf to measure bandwidth on your processor

6. Restrict bandwidth
  - Intel: $ echo MB:0=10 > /sys/fs/resctrl/test/schemata
  - AMD: $ echo MB:0=240 > /sys/fs/resctrl/tests/schemata

 7. Monitor bandwidth again

At 7, you will see that the restriction is not enforced.

The problem is located in the __resctrl_sched_in() routine which rewrites
the active closid via the PQR_ASSOC register. Because this is an expensive
operation, the kernel only does it when the context switch involves tasks
with different CLOSIDs. And to check that, it needs to access the task being
scheduled in. And for that it is using the current task pointer to access
the task's closid field using current->closid. current is actually a macro
that reads the per-cpu variable pcpu_hot.current_task.

After an investigation by compiler experts, the problem has been tracked down
to the usage of the get_current() macro in the __resctrl_sched_in() code and
in particular the per-cpu macro:

static __always_inline struct task_struct *get_current(void)
{
        return this_cpu_read_stable(pcpu_hot.current_task);
}

And as per percpu.h:

/*
 * this_cpu_read() makes gcc load the percpu variable every time it is
 * accessed while this_cpu_read_stable() allows the value to be cached.
 * this_cpu_read_stable() is more efficient and can be used if its value
 * is guaranteed to be valid across cpus.  The current users include
 * get_current() and get_thread_info() both of which are actually
 * per-thread variables implemented as per-cpu variables and thus
 * stable for the duration of the respective task.
 */

The _stable version of the macro allows the value to be cached, i.e.,
not forcing a reload.

However in the __switch_to() routine, the current pointer is changed.  If the
compiler optimizes away the reload, then the resctrl_sched_in will look
at the previous task instead of the new current task. This explains why we
are not seeing the bandwidth limit enforced on the benchmark.

Note that the problem was detected when compiling the kernel with clang (v14)
but it could also happen with gcc.

To fix the problem, there are a few solutions.

In V1, we modified the resctrl_sched_in() function to use the this_cpu_read()
form of the macro. Given this is specific to the __switch_to routine, we do
not change get_current() but instead invoke the lower level macro directly
from __resched_sched_in(). This has no impact on any other calls of
get_current().

In V2, after further discussions with kernel maintainers and LLVM developers,
the conclusion that it was the problem is caused by the compiler but that the
function is not written in a proper manner and that the compiler is just
exposing the problem. So it is best to change the __resctrl_sched_in()
function to:
  (1) be a proper static inline function
  (2) pass the task pointer of the task scheduled in instead of retrieving
      it from the current pointer. It makes more sense and follows other
      sched_in type functions. If you are scheduling something in, then you
      need to name what it is, you should not rely on the current pointer.
      This solution was proposed by Linus.

Patch was tested on AMD Zen3 + MBA and I verified that the bandwidth limit
was now enforced.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Stephane Eranian <eranian@google.com>
Tested-by: Babu Moger <babu.moger@amd.com>
---
 arch/x86/include/asm/resctrl.h         | 12 ++++++------
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 ++--
 arch/x86/kernel/process_32.c           |  2 +-
 arch/x86/kernel/process_64.c           |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 52788f79786f..255a78d9d906 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -49,7 +49,7 @@ DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
  *   simple as possible.
  * Must be called with preemption disabled.
  */
-static void __resctrl_sched_in(void)
+static inline void __resctrl_sched_in(struct task_struct *tsk)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
 	u32 closid = state->default_closid;
@@ -61,13 +61,13 @@ static void __resctrl_sched_in(void)
 	 * Else use the closid/rmid assigned to this cpu.
 	 */
 	if (static_branch_likely(&rdt_alloc_enable_key)) {
-		tmp = READ_ONCE(current->closid);
+		tmp = READ_ONCE(tsk->closid);
 		if (tmp)
 			closid = tmp;
 	}
 
 	if (static_branch_likely(&rdt_mon_enable_key)) {
-		tmp = READ_ONCE(current->rmid);
+		tmp = READ_ONCE(tsk->rmid);
 		if (tmp)
 			rmid = tmp;
 	}
@@ -88,17 +88,17 @@ static inline unsigned int resctrl_arch_round_mon_val(unsigned int val)
 	return val * scale;
 }
 
-static inline void resctrl_sched_in(void)
+static inline void resctrl_sched_in(struct task_struct *tsk)
 {
 	if (static_branch_likely(&rdt_enable_key))
-		__resctrl_sched_in();
+		__resctrl_sched_in(tsk);
 }
 
 void resctrl_cpu_detect(struct cpuinfo_x86 *c);
 
 #else
 
-static inline void resctrl_sched_in(void) {}
+static inline void resctrl_sched_in(struct task_struct *tsk) {}
 static inline void resctrl_cpu_detect(struct cpuinfo_x86 *c) {}
 
 #endif /* CONFIG_X86_CPU_RESCTRL */
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e2c1599d1b37..884b6e9a7e31 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -314,7 +314,7 @@ static void update_cpu_closid_rmid(void *info)
 	 * executing task might have its own closid selected. Just reuse
 	 * the context switch code.
 	 */
-	resctrl_sched_in();
+	resctrl_sched_in(current);
 }
 
 /*
@@ -530,7 +530,7 @@ static void _update_task_closid_rmid(void *task)
 	 * Otherwise, the MSR is updated when the task is scheduled in.
 	 */
 	if (task == current)
-		resctrl_sched_in();
+		resctrl_sched_in(task);
 }
 
 static void update_task_closid_rmid(struct task_struct *t)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 470c128759ea..708c87b88cc1 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -212,7 +212,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	switch_fpu_finish();
 
 	/* Load the Intel cache allocation PQR MSR. */
-	resctrl_sched_in();
+	resctrl_sched_in(next_p);
 
 	return prev_p;
 }
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 4e34b3b68ebd..bb65a68b4b49 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -656,7 +656,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	}
 
 	/* Load the Intel cache allocation PQR MSR. */
-	resctrl_sched_in();
+	resctrl_sched_in(next_p);
 
 	return prev_p;
 }
--