2024-03-08 15:07:33

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH] sched: Add missing memory barrier in switch_mm_cid

Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb()
which the core scheduler code has depended upon since commit:

commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")

If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can
unset the actively used cid when it fails to observe active task after it
sets lazy_put.

There *is* a memory barrier between storing to rq->curr and _return to
userspace_ (as required by membarrier), but the rseq mm_cid has stricter
requirements: the barrier needs to be issued between store to rq->curr
and switch_mm_cid(), which happens earlier than:

- spin_unlock(),
- switch_to().

So it's fine when the architecture switch_mm happens to have that barrier
already, but less so when the architecture only provides the full barrier
in switch_to() or spin_unlock().

It is a bug in the rseq switch_mm_cid() implementation. All architectures
that don't have memory barriers in switch_mm(), but rather have the full
barrier either in finish_lock_switch() or switch_to() have them too late
for the needs of switch_mm_cid().

Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the
generic barrier.h header, and use it in switch_mm_cid() for scheduler
transitions where switch_mm() is expected to provide a memory barrier.

Architectures can override smp_mb__after_switch_mm() if their
switch_mm() implementation provides an implicit memory barrier.
Override it with a no-op on x86 which implicitly provide this memory
barrier by writing to CR3.

Link: https://lore.kernel.org/lkml/[email protected]/
Reported-by: levi.yun <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
Cc: <[email protected]> # 6.4.x
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: levi.yun <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Aaron Lu <[email protected]>
---
arch/x86/include/asm/barrier.h | 3 +++
include/asm-generic/barrier.h | 8 ++++++++
kernel/sched/sched.h | 20 ++++++++++++++------
3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 35389b2af88e..0d5e54201eb2 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -79,6 +79,9 @@ do { \
#define __smp_mb__before_atomic() do { } while (0)
#define __smp_mb__after_atomic() do { } while (0)

+/* Writing to CR3 provides a full memory barrier in switch_mm(). */
+#define smp_mb__after_switch_mm() do { } while (0)
+
#include <asm-generic/barrier.h>

/*
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 961f4d88f9ef..5a6c94d7a598 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -296,5 +296,13 @@ do { \
#define io_stop_wc() do { } while (0)
#endif

+/*
+ * Architectures that guarantee an implicit smp_mb() in switch_mm()
+ * can override smp_mb__after_switch_mm.
+ */
+#ifndef smp_mb__after_switch_mm
+#define smp_mb__after_switch_mm() smp_mb()
+#endif
+
#endif /* !__ASSEMBLY__ */
#endif /* __ASM_GENERIC_BARRIER_H */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e5a95486a42..044d842c696c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -79,6 +79,8 @@
# include <asm/paravirt_api_clock.h>
#endif

+#include <asm/barrier.h>
+
#include "cpupri.h"
#include "cpudeadline.h"

@@ -3481,13 +3483,19 @@ static inline void switch_mm_cid(struct rq *rq,
* between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
* Provide it here.
*/
- if (!prev->mm) // from kernel
+ if (!prev->mm) { // from kernel
smp_mb();
- /*
- * user -> user transition guarantees a memory barrier through
- * switch_mm() when current->mm changes. If current->mm is
- * unchanged, no barrier is needed.
- */
+ } else { // from user
+ /*
+ * user -> user transition relies on an implicit
+ * memory barrier in switch_mm() when
+ * current->mm changes. If the architecture
+ * switch_mm() does not have an implicit memory
+ * barrier, it is emitted here. If current->mm
+ * is unchanged, no barrier is needed.
+ */
+ smp_mb__after_switch_mm();
+ }
}
if (prev->mm_cid_active) {
mm_cid_snapshot_time(rq, prev->mm);
--
2.39.2



2024-03-19 09:20:44

by Yeo Reum Yun

[permalink] [raw]
Subject: Re: [PATCH] sched: Add missing memory barrier in switch_mm_cid

Gentle ping.

________________________________________
From: Mathieu Desnoyers <[email protected]>
Sent: 08 March 2024 15:07
To: Ingo Molnar; Peter Zijlstra
Cc: [email protected]; Mathieu Desnoyers; Yeo Reum Yun; [email protected]; Steven Rostedt; Vincent Guittot; Juri Lelli; Dietmar Eggemann; Ben Segall; Mel Gorman; Daniel Bristot de Oliveira; Valentin Schneider; Catalin Marinas; Mark Rutland; Will Deacon; Aaron Lu
Subject: [PATCH] sched: Add missing memory barrier in switch_mm_cid

Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb()
which the core scheduler code has depended upon since commit:

commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")

If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can
unset the actively used cid when it fails to observe active task after it
sets lazy_put.

There *is* a memory barrier between storing to rq->curr and _return to
userspace_ (as required by membarrier), but the rseq mm_cid has stricter
requirements: the barrier needs to be issued between store to rq->curr
and switch_mm_cid(), which happens earlier than:

- spin_unlock(),
- switch_to().

So it's fine when the architecture switch_mm happens to have that barrier
already, but less so when the architecture only provides the full barrier
in switch_to() or spin_unlock().

It is a bug in the rseq switch_mm_cid() implementation. All architectures
that don't have memory barriers in switch_mm(), but rather have the full
barrier either in finish_lock_switch() or switch_to() have them too late
for the needs of switch_mm_cid().

Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the
generic barrier.h header, and use it in switch_mm_cid() for scheduler
transitions where switch_mm() is expected to provide a memory barrier.

Architectures can override smp_mb__after_switch_mm() if their
switch_mm() implementation provides an implicit memory barrier.
Override it with a no-op on x86 which implicitly provide this memory
barrier by writing to CR3.

Link: https://lore.kernel.org/lkml/20240305145335.2696125-1-yeoreum.yun@armcom/
Reported-by: levi.yun <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
Cc: <[email protected]> # 6.4.x
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: levi.yun <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Aaron Lu <[email protected]>
---
arch/x86/include/asm/barrier.h | 3 +++
include/asm-generic/barrier.h | 8 ++++++++
kernel/sched/sched.h | 20 ++++++++++++++------
3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 35389b2af88e..0d5e54201eb2 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -79,6 +79,9 @@ do { \
#define __smp_mb__before_atomic() do { } while (0)
#define __smp_mb__after_atomic() do { } while (0)

+/* Writing to CR3 provides a full memory barrier in switch_mm(). */
+#define smp_mb__after_switch_mm() do { } while (0)
+
#include <asm-generic/barrier.h>

/*
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 961f4d88f9ef..5a6c94d7a598 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -296,5 +296,13 @@ do { \
#define io_stop_wc() do { } while (0)
#endif

+/*
+ * Architectures that guarantee an implicit smp_mb() in switch_mm()
+ * can override smp_mb__after_switch_mm.
+ */
+#ifndef smp_mb__after_switch_mm
+#define smp_mb__after_switch_mm() smp_mb()
+#endif
+
#endif /* !__ASSEMBLY__ */
#endif /* __ASM_GENERIC_BARRIER_H */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e5a95486a42..044d842c696c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -79,6 +79,8 @@
# include <asm/paravirt_api_clock.h>
#endif

+#include <asm/barrier.h>
+
#include "cpupri.h"
#include "cpudeadline.h"

@@ -3481,13 +3483,19 @@ static inline void switch_mm_cid(struct rq *rq,
* between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
* Provide it here.
*/
- if (!prev->mm) // from kernel
+ if (!prev->mm) { // from kernel
smp_mb();
- /*
- * user -> user transition guarantees a memory barrier through
- * switch_mm() when current->mm changes. If current->mm is
- * unchanged, no barrier is needed.
- */
+ } else { // from user
+ /*
+ * user -> user transition relies on an implicit
+ * memory barrier in switch_mm() when
+ * current->mm changes. If the architecture
+ * switch_mm() does not have an implicit memory
+ * barrier, it is emitted here. If current->mm
+ * is unchanged, no barrier is needed.
+ */
+ smp_mb__after_switch_mm();
+ }
}
if (prev->mm_cid_active) {
mm_cid_snapshot_time(rq, prev->mm);
--
2.39.2

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

2024-04-08 09:42:51

by Yeo Reum Yun

[permalink] [raw]
Subject: Re: [PATCH] sched: Add missing memory barrier in switch_mm_cid

Gentle ping...

________________________________________
From: Yeo Reum Yun <[email protected]>
Sent: 19 March 2024 09:20
To: Mathieu Desnoyers; Ingo Molnar; Peter Zijlstra
Cc: [email protected]; [email protected]; Steven Rostedt; Vincent Guittot; Juri Lelli; Dietmar Eggemann; Ben Segall; Mel Gorman; Daniel Bristot de Oliveira; Valentin Schneider; Catalin Marinas; Mark Rutland; Will Deacon; Aaron Lu
Subject: Re: [PATCH] sched: Add missing memory barrier in switch_mm_cid

Gentle ping.

________________________________________
From: Mathieu Desnoyers <[email protected]>
Sent: 08 March 2024 15:07
To: Ingo Molnar; Peter Zijlstra
Cc: [email protected]; Mathieu Desnoyers; Yeo Reum Yun; [email protected]; Steven Rostedt; Vincent Guittot; Juri Lelli; Dietmar Eggemann; Ben Segall; Mel Gorman; Daniel Bristot de Oliveira; Valentin Schneider; Catalin Marinas; Mark Rutland; Will Deacon; Aaron Lu
Subject: [PATCH] sched: Add missing memory barrier in switch_mm_cid

Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb()
which the core scheduler code has depended upon since commit:

commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")

If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can
unset the actively used cid when it fails to observe active task after it
sets lazy_put.

There *is* a memory barrier between storing to rq->curr and _return to
userspace_ (as required by membarrier), but the rseq mm_cid has stricter
requirements: the barrier needs to be issued between store to rq->curr
and switch_mm_cid(), which happens earlier than:

- spin_unlock(),
- switch_to().

So it's fine when the architecture switch_mm happens to have that barrier
already, but less so when the architecture only provides the full barrier
in switch_to() or spin_unlock().

It is a bug in the rseq switch_mm_cid() implementation. All architectures
that don't have memory barriers in switch_mm(), but rather have the full
barrier either in finish_lock_switch() or switch_to() have them too late
for the needs of switch_mm_cid().

Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the
generic barrier.h header, and use it in switch_mm_cid() for scheduler
transitions where switch_mm() is expected to provide a memory barrier.

Architectures can override smp_mb__after_switch_mm() if their
switch_mm() implementation provides an implicit memory barrier.
Override it with a no-op on x86 which implicitly provide this memory
barrier by writing to CR3.

Link: https://lore.kernel.org/lkml/20240305145335.2696125-1-yeoreum.yun@armcom/
Reported-by: levi.yun <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
Cc: <[email protected]> # 6.4.x
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: levi.yun <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Aaron Lu <[email protected]>
---
arch/x86/include/asm/barrier.h | 3 +++
include/asm-generic/barrier.h | 8 ++++++++
kernel/sched/sched.h | 20 ++++++++++++++------
3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 35389b2af88e..0d5e54201eb2 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -79,6 +79,9 @@ do { \
#define __smp_mb__before_atomic() do { } while (0)
#define __smp_mb__after_atomic() do { } while (0)

+/* Writing to CR3 provides a full memory barrier in switch_mm(). */
+#define smp_mb__after_switch_mm() do { } while (0)
+
#include <asm-generic/barrier.h>

/*
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 961f4d88f9ef..5a6c94d7a598 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -296,5 +296,13 @@ do { \
#define io_stop_wc() do { } while (0)
#endif

+/*
+ * Architectures that guarantee an implicit smp_mb() in switch_mm()
+ * can override smp_mb__after_switch_mm.
+ */
+#ifndef smp_mb__after_switch_mm
+#define smp_mb__after_switch_mm() smp_mb()
+#endif
+
#endif /* !__ASSEMBLY__ */
#endif /* __ASM_GENERIC_BARRIER_H */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e5a95486a42..044d842c696c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -79,6 +79,8 @@
# include <asm/paravirt_api_clock.h>
#endif

+#include <asm/barrier.h>
+
#include "cpupri.h"
#include "cpudeadline.h"

@@ -3481,13 +3483,19 @@ static inline void switch_mm_cid(struct rq *rq,
* between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
* Provide it here.
*/
- if (!prev->mm) // from kernel
+ if (!prev->mm) { // from kernel
smp_mb();
- /*
- * user -> user transition guarantees a memory barrier through
- * switch_mm() when current->mm changes. If current->mm is
- * unchanged, no barrier is needed.
- */
+ } else { // from user
+ /*
+ * user -> user transition relies on an implicit
+ * memory barrier in switch_mm() when
+ * current->mm changes. If the architecture
+ * switch_mm() does not have an implicit memory
+ * barrier, it is emitted here. If current->mm
+ * is unchanged, no barrier is needed.
+ */
+ smp_mb__after_switch_mm();
+ }
}
if (prev->mm_cid_active) {
mm_cid_snapshot_time(rq, prev->mm);
--
2.39.2

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

2024-04-09 10:23:22

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH] sched: Add missing memory barrier in switch_mm_cid

On Fri, Mar 08, 2024 at 10:07:19AM -0500, Mathieu Desnoyers wrote:
> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> index 35389b2af88e..0d5e54201eb2 100644
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -79,6 +79,9 @@ do { \
> #define __smp_mb__before_atomic() do { } while (0)
> #define __smp_mb__after_atomic() do { } while (0)
>
> +/* Writing to CR3 provides a full memory barrier in switch_mm(). */
> +#define smp_mb__after_switch_mm() do { } while (0)
> +
> #include <asm-generic/barrier.h>
>
> /*
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index 961f4d88f9ef..5a6c94d7a598 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -296,5 +296,13 @@ do { \
> #define io_stop_wc() do { } while (0)
> #endif
>
> +/*
> + * Architectures that guarantee an implicit smp_mb() in switch_mm()
> + * can override smp_mb__after_switch_mm.
> + */
> +#ifndef smp_mb__after_switch_mm
> +#define smp_mb__after_switch_mm() smp_mb()
> +#endif
> +
> #endif /* !__ASSEMBLY__ */
> #endif /* __ASM_GENERIC_BARRIER_H */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2e5a95486a42..044d842c696c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -79,6 +79,8 @@
> # include <asm/paravirt_api_clock.h>
> #endif
>
> +#include <asm/barrier.h>
> +
> #include "cpupri.h"
> #include "cpudeadline.h"
>
> @@ -3481,13 +3483,19 @@ static inline void switch_mm_cid(struct rq *rq,
> * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
> * Provide it here.
> */
> - if (!prev->mm) // from kernel
> + if (!prev->mm) { // from kernel
> smp_mb();
> - /*
> - * user -> user transition guarantees a memory barrier through
> - * switch_mm() when current->mm changes. If current->mm is
> - * unchanged, no barrier is needed.
> - */
> + } else { // from user
> + /*
> + * user -> user transition relies on an implicit
> + * memory barrier in switch_mm() when
> + * current->mm changes. If the architecture
> + * switch_mm() does not have an implicit memory
> + * barrier, it is emitted here. If current->mm
> + * is unchanged, no barrier is needed.
> + */
> + smp_mb__after_switch_mm();
> + }

I'm fine with the change from the arm64 perspective but I guess we need
an ack from the x86 and sched maintainers. FWIW:

Reviewed-by: Catalin Marinas <[email protected]>

--
Catalin

2024-04-10 15:30:44

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] sched: Add missing memory barrier in switch_mm_cid

Peter, Ingo, can you please consider merging this fix ?

It adds a missing memory barrier in switch_mm_cid(), which is a
no-op on x86 because switch_mm already provides it. This issue was
reported more than a month ago.

Thanks,

Mathieu

On 2024-04-08 05:38, Yeo Reum Yun wrote:
> Gentle ping...
>
> ________________________________________
> From: Yeo Reum Yun <[email protected]>
> Sent: 19 March 2024 09:20
> To: Mathieu Desnoyers; Ingo Molnar; Peter Zijlstra
> Cc: [email protected]; [email protected]; Steven Rostedt; Vincent Guittot; Juri Lelli; Dietmar Eggemann; Ben Segall; Mel Gorman; Daniel Bristot de Oliveira; Valentin Schneider; Catalin Marinas; Mark Rutland; Will Deacon; Aaron Lu
> Subject: Re: [PATCH] sched: Add missing memory barrier in switch_mm_cid
>
> Gentle ping.
>
> ________________________________________
> From: Mathieu Desnoyers <[email protected]>
> Sent: 08 March 2024 15:07
> To: Ingo Molnar; Peter Zijlstra
> Cc: [email protected]; Mathieu Desnoyers; Yeo Reum Yun; [email protected]; Steven Rostedt; Vincent Guittot; Juri Lelli; Dietmar Eggemann; Ben Segall; Mel Gorman; Daniel Bristot de Oliveira; Valentin Schneider; Catalin Marinas; Mark Rutland; Will Deacon; Aaron Lu
> Subject: [PATCH] sched: Add missing memory barrier in switch_mm_cid
>
> Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb()
> which the core scheduler code has depended upon since commit:
>
> commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
>
> If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can
> unset the actively used cid when it fails to observe active task after it
> sets lazy_put.
>
> There *is* a memory barrier between storing to rq->curr and _return to
> userspace_ (as required by membarrier), but the rseq mm_cid has stricter
> requirements: the barrier needs to be issued between store to rq->curr
> and switch_mm_cid(), which happens earlier than:
>
> - spin_unlock(),
> - switch_to().
>
> So it's fine when the architecture switch_mm happens to have that barrier
> already, but less so when the architecture only provides the full barrier
> in switch_to() or spin_unlock().
>
> It is a bug in the rseq switch_mm_cid() implementation. All architectures
> that don't have memory barriers in switch_mm(), but rather have the full
> barrier either in finish_lock_switch() or switch_to() have them too late
> for the needs of switch_mm_cid().
>
> Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the
> generic barrier.h header, and use it in switch_mm_cid() for scheduler
> transitions where switch_mm() is expected to provide a memory barrier.
>
> Architectures can override smp_mb__after_switch_mm() if their
> switch_mm() implementation provides an implicit memory barrier.
> Override it with a no-op on x86 which implicitly provide this memory
> barrier by writing to CR3.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Reported-by: levi.yun <[email protected]>
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
> Cc: <[email protected]> # 6.4.x
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: levi.yun <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Aaron Lu <[email protected]>
> ---
> arch/x86/include/asm/barrier.h | 3 +++
> include/asm-generic/barrier.h | 8 ++++++++
> kernel/sched/sched.h | 20 ++++++++++++++------
> 3 files changed, 25 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> index 35389b2af88e..0d5e54201eb2 100644
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -79,6 +79,9 @@ do { \
> #define __smp_mb__before_atomic() do { } while (0)
> #define __smp_mb__after_atomic() do { } while (0)
>
> +/* Writing to CR3 provides a full memory barrier in switch_mm(). */
> +#define smp_mb__after_switch_mm() do { } while (0)
> +
> #include <asm-generic/barrier.h>
>
> /*
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index 961f4d88f9ef..5a6c94d7a598 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -296,5 +296,13 @@ do { \
> #define io_stop_wc() do { } while (0)
> #endif
>
> +/*
> + * Architectures that guarantee an implicit smp_mb() in switch_mm()
> + * can override smp_mb__after_switch_mm.
> + */
> +#ifndef smp_mb__after_switch_mm
> +#define smp_mb__after_switch_mm() smp_mb()
> +#endif
> +
> #endif /* !__ASSEMBLY__ */
> #endif /* __ASM_GENERIC_BARRIER_H */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2e5a95486a42..044d842c696c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -79,6 +79,8 @@
> # include <asm/paravirt_api_clock.h>
> #endif
>
> +#include <asm/barrier.h>
> +
> #include "cpupri.h"
> #include "cpudeadline.h"
>
> @@ -3481,13 +3483,19 @@ static inline void switch_mm_cid(struct rq *rq,
> * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
> * Provide it here.
> */
> - if (!prev->mm) // from kernel
> + if (!prev->mm) { // from kernel
> smp_mb();
> - /*
> - * user -> user transition guarantees a memory barrier through
> - * switch_mm() when current->mm changes. If current->mm is
> - * unchanged, no barrier is needed.
> - */
> + } else { // from user
> + /*
> + * user -> user transition relies on an implicit
> + * memory barrier in switch_mm() when
> + * current->mm changes. If the architecture
> + * switch_mm() does not have an implicit memory
> + * barrier, it is emitted here. If current->mm
> + * is unchanged, no barrier is needed.
> + */
> + smp_mb__after_switch_mm();
> + }
> }
> if (prev->mm_cid_active) {
> mm_cid_snapshot_time(rq, prev->mm);
> --
> 2.39.2
>
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-04-10 17:18:32

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] sched: Add missing memory barrier in switch_mm_cid

This fix has received an Acked-by from ARM64 maintainer Catalin Marinas. [1]

I'm CCing x86 maintainers whom are not also scheduler maintainers
as well so they can give their input.

This is still waiting for feedback from scheduler maintainers.

[1] https://lore.kernel.org/lkml/[email protected]/

On 2024-03-08 10:07, Mathieu Desnoyers wrote:
> Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb()
> which the core scheduler code has depended upon since commit:
>
> commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
>
> If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can
> unset the actively used cid when it fails to observe active task after it
> sets lazy_put.
>
> There *is* a memory barrier between storing to rq->curr and _return to
> userspace_ (as required by membarrier), but the rseq mm_cid has stricter
> requirements: the barrier needs to be issued between store to rq->curr
> and switch_mm_cid(), which happens earlier than:
>
> - spin_unlock(),
> - switch_to().
>
> So it's fine when the architecture switch_mm happens to have that barrier
> already, but less so when the architecture only provides the full barrier
> in switch_to() or spin_unlock().
>
> It is a bug in the rseq switch_mm_cid() implementation. All architectures
> that don't have memory barriers in switch_mm(), but rather have the full
> barrier either in finish_lock_switch() or switch_to() have them too late
> for the needs of switch_mm_cid().
>
> Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the
> generic barrier.h header, and use it in switch_mm_cid() for scheduler
> transitions where switch_mm() is expected to provide a memory barrier.
>
> Architectures can override smp_mb__after_switch_mm() if their
> switch_mm() implementation provides an implicit memory barrier.
> Override it with a no-op on x86 which implicitly provide this memory
> barrier by writing to CR3.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Reported-by: levi.yun <[email protected]>
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
> Cc: <[email protected]> # 6.4.x
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: levi.yun <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Aaron Lu <[email protected]>
> ---
> arch/x86/include/asm/barrier.h | 3 +++
> include/asm-generic/barrier.h | 8 ++++++++
> kernel/sched/sched.h | 20 ++++++++++++++------
> 3 files changed, 25 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> index 35389b2af88e..0d5e54201eb2 100644
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -79,6 +79,9 @@ do { \
> #define __smp_mb__before_atomic() do { } while (0)
> #define __smp_mb__after_atomic() do { } while (0)
>
> +/* Writing to CR3 provides a full memory barrier in switch_mm(). */
> +#define smp_mb__after_switch_mm() do { } while (0)
> +
> #include <asm-generic/barrier.h>
>
> /*
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index 961f4d88f9ef..5a6c94d7a598 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -296,5 +296,13 @@ do { \
> #define io_stop_wc() do { } while (0)
> #endif
>
> +/*
> + * Architectures that guarantee an implicit smp_mb() in switch_mm()
> + * can override smp_mb__after_switch_mm.
> + */
> +#ifndef smp_mb__after_switch_mm
> +#define smp_mb__after_switch_mm() smp_mb()
> +#endif
> +
> #endif /* !__ASSEMBLY__ */
> #endif /* __ASM_GENERIC_BARRIER_H */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2e5a95486a42..044d842c696c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -79,6 +79,8 @@
> # include <asm/paravirt_api_clock.h>
> #endif
>
> +#include <asm/barrier.h>
> +
> #include "cpupri.h"
> #include "cpudeadline.h"
>
> @@ -3481,13 +3483,19 @@ static inline void switch_mm_cid(struct rq *rq,
> * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
> * Provide it here.
> */
> - if (!prev->mm) // from kernel
> + if (!prev->mm) { // from kernel
> smp_mb();
> - /*
> - * user -> user transition guarantees a memory barrier through
> - * switch_mm() when current->mm changes. If current->mm is
> - * unchanged, no barrier is needed.
> - */
> + } else { // from user
> + /*
> + * user -> user transition relies on an implicit
> + * memory barrier in switch_mm() when
> + * current->mm changes. If the architecture
> + * switch_mm() does not have an implicit memory
> + * barrier, it is emitted here. If current->mm
> + * is unchanged, no barrier is needed.
> + */
> + smp_mb__after_switch_mm();
> + }
> }
> if (prev->mm_cid_active) {
> mm_cid_snapshot_time(rq, prev->mm);

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-04-11 14:29:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] sched: Add missing memory barrier in switch_mm_cid

On 4/10/24 10:18, Mathieu Desnoyers wrote:
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -79,6 +79,9 @@ do {                                    \
>   #define __smp_mb__before_atomic()    do { } while (0)
>   #define __smp_mb__after_atomic()    do { } while (0)
> +/* Writing to CR3 provides a full memory barrier in switch_mm(). */
> +#define smp_mb__after_switch_mm()    do { } while (0)

I haven't gone through this in detail, but the CR3 certainly is a full
barrier and the x86 code _looks_ correct, so:

Acked-by: Dave Hansen <[email protected]> # for x86