Following the RFC discussion at [1].
Andrea
Changes since RFC:
- add prepare_sync_core_cmd()
- add #ifdeffery for nosmp builds
[1] https://lkml.kernel.org/r/[email protected]
Andrea Parri (2):
locking: Introduce prepare_sync_core_cmd()
membarrier: riscv: Provide core serializing command
.../membarrier-sync-core/arch-support.txt | 2 +-
arch/riscv/Kconfig | 3 ++
arch/riscv/include/asm/sync_core.h | 29 +++++++++++++++++++
include/linux/sync_core.h | 16 +++++++++-
init/Kconfig | 3 ++
kernel/sched/membarrier.c | 1 +
6 files changed, 52 insertions(+), 2 deletions(-)
create mode 100644 arch/riscv/include/asm/sync_core.h
--
2.34.1
Introduce an architecture function that architectures can use to set
up ("prepare") SYNC_CORE commands.
The function will be used by RISC-V to update its "deferred icache-
flush" data structures (icache_stale_mask).
Architectures defining prepare_sync_core_cmd() static inline need to
select ARCH_HAS_PREPARE_SYNC_CORE_CMD.
Signed-off-by: Andrea Parri <[email protected]>
Suggested-by: Mathieu Desnoyers <[email protected]>
---
include/linux/sync_core.h | 16 +++++++++++++++-
init/Kconfig | 3 +++
kernel/sched/membarrier.c | 1 +
3 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
index 013da4b8b3272..67bb9794b8758 100644
--- a/include/linux/sync_core.h
+++ b/include/linux/sync_core.h
@@ -17,5 +17,19 @@ static inline void sync_core_before_usermode(void)
}
#endif
-#endif /* _LINUX_SYNC_CORE_H */
+#ifdef CONFIG_ARCH_HAS_PREPARE_SYNC_CORE_CMD
+#include <asm/sync_core.h>
+#else
+/*
+ * This is a dummy prepare_sync_core_cmd() implementation that can be used on
+ * all architectures which provide unconditional core serializing instructions
+ * in switch_mm().
+ * If your architecture doesn't provide such core serializing instructions in
+ * switch_mm(), you may need to write your own functions.
+ */
+static inline void prepare_sync_core_cmd(struct mm_struct *mm)
+{
+}
+#endif
+#endif /* _LINUX_SYNC_CORE_H */
diff --git a/init/Kconfig b/init/Kconfig
index 9ffb103fc927b..87daf50838f02 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1972,6 +1972,9 @@ source "kernel/Kconfig.locks"
config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
bool
+config ARCH_HAS_PREPARE_SYNC_CORE_CMD
+ bool
+
config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
bool
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 2ad881d07752c..58f801e013988 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -320,6 +320,7 @@ static int membarrier_private_expedited(int flags, int cpu_id)
MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
return -EPERM;
ipi_func = ipi_sync_core;
+ prepare_sync_core_cmd(mm);
} else if (flags == MEMBARRIER_FLAG_RSEQ) {
if (!IS_ENABLED(CONFIG_RSEQ))
return -EINVAL;
--
2.34.1
RISC-V uses xRET instructions on return from interrupt and to go back
to user-space; the xRET instruction is not core serializing.
Use FENCE.I for providing core serialization as follows:
- by calling sync_core_before_usermode() on return from interrupt (cf.
ipi_sync_core()),
- via switch_mm() and sync_core_before_usermode() (respectively, for
uthread->uthread and kthread->uthread transitions) to go back to
user-space.
On RISC-V, the serialization in switch_mm() is activated by resetting
the icache_stale_mask of the mm at prepare_sync_core_cmd().
Signed-off-by: Andrea Parri <[email protected]>
Suggested-by: Palmer Dabbelt <[email protected]>
---
.../membarrier-sync-core/arch-support.txt | 2 +-
arch/riscv/Kconfig | 3 ++
arch/riscv/include/asm/sync_core.h | 29 +++++++++++++++++++
3 files changed, 33 insertions(+), 1 deletion(-)
create mode 100644 arch/riscv/include/asm/sync_core.h
diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
index d96b778b87ed8..f6f0bd871a578 100644
--- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
+++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
@@ -43,7 +43,7 @@
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
- | riscv: | TODO |
+ | riscv: | ok |
| s390: | ok |
| sh: | TODO |
| sparc: | TODO |
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 95a2a06acc6a6..3e2734a3f2957 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -27,14 +27,17 @@ config RISCV
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_GIGANTIC_PAGE
select ARCH_HAS_KCOV
+ select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_MMIOWB
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PMEM_API
+ select ARCH_HAS_PREPARE_SYNC_CORE_CMD
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SET_DIRECT_MAP if MMU
select ARCH_HAS_SET_MEMORY if MMU
select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
select ARCH_HAS_STRICT_MODULE_RWX if MMU && !XIP_KERNEL
+ select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_SYSCALL_WRAPPER
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UBSAN_SANITIZE_ALL
diff --git a/arch/riscv/include/asm/sync_core.h b/arch/riscv/include/asm/sync_core.h
new file mode 100644
index 0000000000000..9153016da8f14
--- /dev/null
+++ b/arch/riscv/include/asm/sync_core.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_RISCV_SYNC_CORE_H
+#define _ASM_RISCV_SYNC_CORE_H
+
+/*
+ * RISC-V implements return to user-space through an xRET instruction,
+ * which is not core serializing.
+ */
+static inline void sync_core_before_usermode(void)
+{
+ asm volatile ("fence.i" ::: "memory");
+}
+
+#ifdef CONFIG_SMP
+/*
+ * Ensure the next switch_mm() on every CPU issues a core serializing
+ * instruction for the given @mm.
+ */
+static inline void prepare_sync_core_cmd(struct mm_struct *mm)
+{
+ cpumask_setall(&mm->context.icache_stale_mask);
+}
+#else
+static inline void prepare_sync_core_cmd(struct mm_struct *mm)
+{
+}
+#endif /* CONFIG_SMP */
+
+#endif /* _ASM_RISCV_SYNC_CORE_H */
--
2.34.1
On 2023-11-27 05:32, Andrea Parri wrote:
> Introduce an architecture function that architectures can use to set
> up ("prepare") SYNC_CORE commands.
>
> The function will be used by RISC-V to update its "deferred icache-
> flush" data structures (icache_stale_mask).
>
> Architectures defining prepare_sync_core_cmd() static inline need to
> select ARCH_HAS_PREPARE_SYNC_CORE_CMD.
>
> Signed-off-by: Andrea Parri <[email protected]>
> Suggested-by: Mathieu Desnoyers <[email protected]>
Reviewed-by: Mathieu Desnoyers <[email protected]>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
On 2023-11-27 05:32, Andrea Parri wrote:
> RISC-V uses xRET instructions on return from interrupt and to go back
> to user-space; the xRET instruction is not core serializing.
>
> Use FENCE.I for providing core serialization as follows:
>
> - by calling sync_core_before_usermode() on return from interrupt (cf.
> ipi_sync_core()),
>
> - via switch_mm() and sync_core_before_usermode() (respectively, for
> uthread->uthread and kthread->uthread transitions) to go back to
> user-space.
>
> On RISC-V, the serialization in switch_mm() is activated by resetting
> the icache_stale_mask of the mm at prepare_sync_core_cmd().
>
> Signed-off-by: Andrea Parri <[email protected]>
> Suggested-by: Palmer Dabbelt <[email protected]>
> ---
[...]
> +
> +#ifdef CONFIG_SMP
> +/*
> + * Ensure the next switch_mm() on every CPU issues a core serializing
> + * instruction for the given @mm.
> + */
> +static inline void prepare_sync_core_cmd(struct mm_struct *mm)
> +{
> + cpumask_setall(&mm->context.icache_stale_mask);
I am concerned about the possibility that this change lacks two barriers in the
following scenario:
On a transition from uthread -> uthread on [CPU 0], from a thread belonging to
another mm to a thread belonging to the mm [!mm -> mm] for which a concurrent
membarrier sync-core is done on [CPU 1]:
- [CPU 1] sets all bits in the mm icache_stale_mask [A]. There are no barriers
associated with these stores.
- [CPU 0] store to rq->curr [B] (by the scheduler) vs [CPU 1] loads rq->curr [C]
within membarrier to decide if the IPI should be skipped. Let's say CPU 1 observes
cpu_rq(0)->curr->mm != mm, so it skips the IPI.
- This means membarrier relies on switch_mm() to issue the sync-core.
- [CPU 0] switch_mm() loads [D] the icache_stale_mask. If the bit is zero, switch_mm()
may incorrectly skip the sync-core.
AFAIU, [C] can be reordered before [A] because there is no barrier between those
operations within membarrier. I suspect it can cause the switch_mm() code to skip
a needed sync-core.
AFAIU, [D] can be reordered before [B] because there is no documented barrier
between those operations within the scheduler, which can also cause switch_mm()
to skip a needed sync-core.
We possibly have a similar scenario for uthread->uthread when the scheduler
switches between mm -> !mm.
One way to fix this would be to add the following barriers:
- A smp_mb() between [A] and [C], possibly just after cpumask_setall() in
prepare_sync_core_cmd(), with comments detailing the ordering it guarantees,
- A smp_mb() between [B] and [D], possibly just before cpumask_test_cpu() in
flush_icache_deferred(), with appropriate comments.
Am I missing something ?
Thanks,
Mathieu
> +}
> +#else
> +static inline void prepare_sync_core_cmd(struct mm_struct *mm)
> +{
> +}
> +#endif /* CONFIG_SMP */
> +
> +#endif /* _ASM_RISCV_SYNC_CORE_H */
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
> I am concerned about the possibility that this change lacks two barriers in the
> following scenario:
>
> On a transition from uthread -> uthread on [CPU 0], from a thread belonging to
> another mm to a thread belonging to the mm [!mm -> mm] for which a concurrent
> membarrier sync-core is done on [CPU 1]:
>
> - [CPU 1] sets all bits in the mm icache_stale_mask [A]. There are no barriers
> associated with these stores.
>
> - [CPU 0] store to rq->curr [B] (by the scheduler) vs [CPU 1] loads rq->curr [C]
> within membarrier to decide if the IPI should be skipped. Let's say CPU 1 observes
> cpu_rq(0)->curr->mm != mm, so it skips the IPI.
>
> - This means membarrier relies on switch_mm() to issue the sync-core.
>
> - [CPU 0] switch_mm() loads [D] the icache_stale_mask. If the bit is zero, switch_mm()
> may incorrectly skip the sync-core.
>
> AFAIU, [C] can be reordered before [A] because there is no barrier between those
> operations within membarrier. I suspect it can cause the switch_mm() code to skip
> a needed sync-core.
>
> AFAIU, [D] can be reordered before [B] because there is no documented barrier
> between those operations within the scheduler, which can also cause switch_mm()
> to skip a needed sync-core.
>
> We possibly have a similar scenario for uthread->uthread when the scheduler
> switches between mm -> !mm.
>
> One way to fix this would be to add the following barriers:
>
> - A smp_mb() between [A] and [C], possibly just after cpumask_setall() in
> prepare_sync_core_cmd(), with comments detailing the ordering it guarantees,
> - A smp_mb() between [B] and [D], possibly just before cpumask_test_cpu() in
> flush_icache_deferred(), with appropriate comments.
>
> Am I missing something ?
Thanks for the detailed analysis.
AFAIU, the following barrier (in membarrier_private_expedited())
/*
* Matches memory barriers around rq->curr modification in
* scheduler.
*/
smp_mb(); /* system call entry is not a mb. */
can serve the purpose of ordering [A] before [C] (to be documented in v2).
But I agree that [B] and [D] are unordered /missing suitable synchronization.
Worse, RISC-V has currently no full barrier after [B] and before returning to
user-space: I'm thinking (inspired by the PowerPC implementation),
diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
index 217fd4de61342..f63222513076d 100644
--- a/arch/riscv/mm/context.c
+++ b/arch/riscv/mm/context.c
@@ -323,6 +323,23 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
if (unlikely(prev == next))
return;
+#if defined(CONFIG_MEMBARRIER) && defined(CONFIG_SMP)
+ /*
+ * The membarrier system call requires a full memory barrier
+ * after storing to rq->curr, before going back to user-space.
+ *
+ * Only need the full barrier when switching between processes:
+ * barrier when switching from kernel to userspace is not
+ * required here, given that it is implied by mmdrop(); barrier
+ * when switching from userspace to kernel is not needed after
+ * store to rq->curr.
+ */
+ if (unlikely(atomic_read(&next->membarrier_state) &
+ (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
+ MEMBARRIER_STATE_GLOBAL_EXPEDITED)) && prev)
+ smp_mb();
+#endif
+
/*
* Mark the current MM context as inactive, and the next as
* active. This is at least used by the icache flushing
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a708d225c28e8..a1c749fddd095 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6670,8 +6670,9 @@ static void __sched notrace __schedule(unsigned int sched_mode)
*
* Here are the schemes providing that barrier on the
* various architectures:
- * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
- * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
+ * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC,
+ * RISC-V. switch_mm() relies on membarrier_arch_switch_mm()
+ * on PowerPC.
* - finish_lock_switch() for weakly-ordered
* architectures where spin_unlock is a full barrier,
* - switch_to() for arm64 (weakly-ordered, spin_unlock
The silver lining is that similar changes (probably as a separate/preliminary
patch) also restore the desired order between [B] and [D] AFAIU; so with them,
2/2 would just need additions to document the above SYNC_CORE scenario.
Thoughts?
Andrea
On 2023-11-28 10:13, Andrea Parri wrote:
>> I am concerned about the possibility that this change lacks two barriers in the
>> following scenario:
>>
>> On a transition from uthread -> uthread on [CPU 0], from a thread belonging to
>> another mm to a thread belonging to the mm [!mm -> mm] for which a concurrent
>> membarrier sync-core is done on [CPU 1]:
>>
>> - [CPU 1] sets all bits in the mm icache_stale_mask [A]. There are no barriers
>> associated with these stores.
>>
>> - [CPU 0] store to rq->curr [B] (by the scheduler) vs [CPU 1] loads rq->curr [C]
>> within membarrier to decide if the IPI should be skipped. Let's say CPU 1 observes
>> cpu_rq(0)->curr->mm != mm, so it skips the IPI.
>>
>> - This means membarrier relies on switch_mm() to issue the sync-core.
>>
>> - [CPU 0] switch_mm() loads [D] the icache_stale_mask. If the bit is zero, switch_mm()
>> may incorrectly skip the sync-core.
>>
>> AFAIU, [C] can be reordered before [A] because there is no barrier between those
>> operations within membarrier. I suspect it can cause the switch_mm() code to skip
>> a needed sync-core.
>>
>> AFAIU, [D] can be reordered before [B] because there is no documented barrier
>> between those operations within the scheduler, which can also cause switch_mm()
>> to skip a needed sync-core.
>>
>> We possibly have a similar scenario for uthread->uthread when the scheduler
>> switches between mm -> !mm.
>>
>> One way to fix this would be to add the following barriers:
>>
>> - A smp_mb() between [A] and [C], possibly just after cpumask_setall() in
>> prepare_sync_core_cmd(), with comments detailing the ordering it guarantees,
>> - A smp_mb() between [B] and [D], possibly just before cpumask_test_cpu() in
>> flush_icache_deferred(), with appropriate comments.
>>
>> Am I missing something ?
>
> Thanks for the detailed analysis.
>
> AFAIU, the following barrier (in membarrier_private_expedited())
>
> /*
> * Matches memory barriers around rq->curr modification in
> * scheduler.
> */
> smp_mb(); /* system call entry is not a mb. */
>
> can serve the purpose of ordering [A] before [C] (to be documented in v2).
Agreed. Yes it should be documented.
>
> But I agree that [B] and [D] are unordered /missing suitable synchronization.
> Worse, RISC-V has currently no full barrier after [B] and before returning to
> user-space: I'm thinking (inspired by the PowerPC implementation),
If RISC-V currently supports the membarrier private cmd and lacks the
appropriate smp_mb() in switch_mm(), then it's a bug. This initial patch
should be a "Fix" and fast-tracked as such.
Indeed, looking at how ASID is used to implement switch_mm, it appears
to not require a full smp_mb() at all as long as there are no ASID
rollovers.
>
> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> index 217fd4de61342..f63222513076d 100644
> --- a/arch/riscv/mm/context.c
> +++ b/arch/riscv/mm/context.c
> @@ -323,6 +323,23 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> if (unlikely(prev == next))
> return;
>
> +#if defined(CONFIG_MEMBARRIER) && defined(CONFIG_SMP)
> + /*
> + * The membarrier system call requires a full memory barrier
> + * after storing to rq->curr, before going back to user-space.
> + *
> + * Only need the full barrier when switching between processes:
> + * barrier when switching from kernel to userspace is not
> + * required here, given that it is implied by mmdrop(); barrier
> + * when switching from userspace to kernel is not needed after
> + * store to rq->curr.
> + */
> + if (unlikely(atomic_read(&next->membarrier_state) &
> + (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
> + MEMBARRIER_STATE_GLOBAL_EXPEDITED)) && prev)
> + smp_mb();
> +#endif
The approach looks good. Please implement it within a separate
membarrier_arch_switch_mm() as done on powerpc.
> +
> /*
> * Mark the current MM context as inactive, and the next as
> * active. This is at least used by the icache flushing
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a708d225c28e8..a1c749fddd095 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6670,8 +6670,9 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> *
> * Here are the schemes providing that barrier on the
> * various architectures:
> - * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
> - * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
> + * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC,
> + * RISC-V. switch_mm() relies on membarrier_arch_switch_mm()
> + * on PowerPC.
> * - finish_lock_switch() for weakly-ordered
> * architectures where spin_unlock is a full barrier,
> * - switch_to() for arm64 (weakly-ordered, spin_unlock
>
> The silver lining is that similar changes (probably as a separate/preliminary
> patch) also restore the desired order between [B] and [D] AFAIU; so with them,
> 2/2 would just need additions to document the above SYNC_CORE scenario.
Exactly.
> Thoughts?
I think we should be OK with the changes you suggest.
Thanks!
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
> > diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> > index 217fd4de61342..f63222513076d 100644
> > --- a/arch/riscv/mm/context.c
> > +++ b/arch/riscv/mm/context.c
> > @@ -323,6 +323,23 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > if (unlikely(prev == next))
> > return;
> > +#if defined(CONFIG_MEMBARRIER) && defined(CONFIG_SMP)
> > + /*
> > + * The membarrier system call requires a full memory barrier
> > + * after storing to rq->curr, before going back to user-space.
> > + *
> > + * Only need the full barrier when switching between processes:
> > + * barrier when switching from kernel to userspace is not
> > + * required here, given that it is implied by mmdrop(); barrier
> > + * when switching from userspace to kernel is not needed after
> > + * store to rq->curr.
> > + */
> > + if (unlikely(atomic_read(&next->membarrier_state) &
> > + (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
> > + MEMBARRIER_STATE_GLOBAL_EXPEDITED)) && prev)
> > + smp_mb();
> > +#endif
>
> The approach looks good. Please implement it within a separate
> membarrier_arch_switch_mm() as done on powerpc.
Will do. Thanks.
As regards the Fixes: tag, I guess it boils down to what we want or we
need to take for commit "riscv: Support membarrier private cmd". :-)
FWIW, a quick git-log search confirmed that MEMBARRIER has been around
for quite some time in the RISC-V world (though I'm not familiar with
any of its mainstream uses): commit 1464d00b27b2 says (at least) since
93917ad50972 ("RISC-V: Add support for restartable sequence"). I am
currently inclined to pick the latter commit (and check it w/ Palmer),
but other suggestions are welcome.
Andrea
On 2023-11-29 13:29, Andrea Parri wrote:
>>> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
>>> index 217fd4de61342..f63222513076d 100644
>>> --- a/arch/riscv/mm/context.c
>>> +++ b/arch/riscv/mm/context.c
>>> @@ -323,6 +323,23 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>>> if (unlikely(prev == next))
>>> return;
>>> +#if defined(CONFIG_MEMBARRIER) && defined(CONFIG_SMP)
>>> + /*
>>> + * The membarrier system call requires a full memory barrier
>>> + * after storing to rq->curr, before going back to user-space.
>>> + *
>>> + * Only need the full barrier when switching between processes:
>>> + * barrier when switching from kernel to userspace is not
>>> + * required here, given that it is implied by mmdrop(); barrier
>>> + * when switching from userspace to kernel is not needed after
>>> + * store to rq->curr.
>>> + */
>>> + if (unlikely(atomic_read(&next->membarrier_state) &
>>> + (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
>>> + MEMBARRIER_STATE_GLOBAL_EXPEDITED)) && prev)
>>> + smp_mb();
>>> +#endif
>>
>> The approach looks good. Please implement it within a separate
>> membarrier_arch_switch_mm() as done on powerpc.
>
> Will do. Thanks.
>
> As regards the Fixes: tag, I guess it boils down to what we want or we
> need to take for commit "riscv: Support membarrier private cmd". :-)
I'm not seeing this commit in the Linux master branch, am I missing
something ?
> FWIW, a quick git-log search confirmed that MEMBARRIER has been around
> for quite some time in the RISC-V world (though I'm not familiar with
> any of its mainstream uses): commit 1464d00b27b2 says (at least) since
> 93917ad50972 ("RISC-V: Add support for restartable sequence"). I am
> currently inclined to pick the latter commit (and check it w/ Palmer),
> but other suggestions are welcome.
Supporting membarrier private expedited is not optional since Linux 4.14:
see kernel/sched/core.c:
membarrier_switch_mm(rq, prev->active_mm, next->mm);
/*
* sys_membarrier() requires an smp_mb() between setting
* rq->curr / membarrier_switch_mm() and returning to userspace.
*
* The below provides this either through switch_mm(), or in
* case 'prev->active_mm == next->mm' through
* finish_task_switch()'s mmdrop().
*/
switch_mm_irqs_off(prev->active_mm, next->mm, next);
Failure to provide the required barrier is a bug in the architecture's
switch_mm implementation when CONFIG_MEMBARRIER=y.
We should probably introduce a new
Documentation/features/sched/membarrier/arch-support.txt
to clarify this requirement.
Userspace code such as liburcu [1] heavily relies on membarrier private
expedited (when available) to speed up RCU read-side critical sections.
Various DNS servers, including BIND 9, use liburcu.
Thanks,
Mathieu
[1] https:/liburcu.org
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
> > As regards the Fixes: tag, I guess it boils down to what we want or we
> > need to take for commit "riscv: Support membarrier private cmd". :-)
>
> I'm not seeing this commit in the Linux master branch, am I missing
> something ?
I don't think you're missing something: I was wondering "what/where is
this commit"? Sorry for the confusion.
> > FWIW, a quick git-log search confirmed that MEMBARRIER has been around
> > for quite some time in the RISC-V world (though I'm not familiar with
> > any of its mainstream uses): commit 1464d00b27b2 says (at least) since
> > 93917ad50972 ("RISC-V: Add support for restartable sequence"). I am
> > currently inclined to pick the latter commit (and check it w/ Palmer),
> > but other suggestions are welcome.
>
> Supporting membarrier private expedited is not optional since Linux 4.14:
>
> see kernel/sched/core.c:
>
> membarrier_switch_mm(rq, prev->active_mm, next->mm);
> /*
> * sys_membarrier() requires an smp_mb() between setting
> * rq->curr / membarrier_switch_mm() and returning to userspace.
> *
> * The below provides this either through switch_mm(), or in
> * case 'prev->active_mm == next->mm' through
> * finish_task_switch()'s mmdrop().
> */
> switch_mm_irqs_off(prev->active_mm, next->mm, next);
>
> Failure to provide the required barrier is a bug in the architecture's
> switch_mm implementation when CONFIG_MEMBARRIER=y.
>
> We should probably introduce a new
> Documentation/features/sched/membarrier/arch-support.txt
> to clarify this requirement.
>
> Userspace code such as liburcu [1] heavily relies on membarrier private
> expedited (when available) to speed up RCU read-side critical sections.
> Various DNS servers, including BIND 9, use liburcu.
Thanks for the information.
So I should probably stick to 93917ad50972, which apparently selected
CONFIG_MEMBARRIER on RISC-V, for the Fixes: tag in question.
I'll look into adding the membarrier feature you mention (as a final/
follow-up patch), unless you or someone else want to take care of it.
Andrea
On 2023-11-29 16:25, Andrea Parri wrote:
>>> As regards the Fixes: tag, I guess it boils down to what we want or we
>>> need to take for commit "riscv: Support membarrier private cmd". :-)
>>
>> I'm not seeing this commit in the Linux master branch, am I missing
>> something ?
>
> I don't think you're missing something: I was wondering "what/where is
> this commit"? Sorry for the confusion.
>
>
>>> FWIW, a quick git-log search confirmed that MEMBARRIER has been around
>>> for quite some time in the RISC-V world (though I'm not familiar with
>>> any of its mainstream uses): commit 1464d00b27b2 says (at least) since
>>> 93917ad50972 ("RISC-V: Add support for restartable sequence"). I am
>>> currently inclined to pick the latter commit (and check it w/ Palmer),
>>> but other suggestions are welcome.
>>
>> Supporting membarrier private expedited is not optional since Linux 4.14:
>>
>> see kernel/sched/core.c:
>>
>> membarrier_switch_mm(rq, prev->active_mm, next->mm);
>> /*
>> * sys_membarrier() requires an smp_mb() between setting
>> * rq->curr / membarrier_switch_mm() and returning to userspace.
>> *
>> * The below provides this either through switch_mm(), or in
>> * case 'prev->active_mm == next->mm' through
>> * finish_task_switch()'s mmdrop().
>> */
>> switch_mm_irqs_off(prev->active_mm, next->mm, next);
>>
>> Failure to provide the required barrier is a bug in the architecture's
>> switch_mm implementation when CONFIG_MEMBARRIER=y.
>>
>> We should probably introduce a new
>> Documentation/features/sched/membarrier/arch-support.txt
>> to clarify this requirement.
>>
>> Userspace code such as liburcu [1] heavily relies on membarrier private
>> expedited (when available) to speed up RCU read-side critical sections.
>> Various DNS servers, including BIND 9, use liburcu.
>
> Thanks for the information.
>
> So I should probably stick to 93917ad50972, which apparently selected
> CONFIG_MEMBARRIER on RISC-V, for the Fixes: tag in question.
I think it goes further than that, because you can explicitly
CONFIG_MEMBARRIER=y, see init/Kconfig:
config MEMBARRIER
bool "Enable membarrier() system call" if EXPERT
default y
help
Enable the membarrier() system call that allows issuing memory
barriers across all running threads, which can be used to distribute
the cost of user-space memory barriers asymmetrically by transforming
pairs of memory barriers into pairs consisting of membarrier() and a
compiler barrier.
If unsure, say Y.
Before 1464d00b27b2, riscv just happened to set it to =n in the defconfig.
I suspect the initial port of riscv merged after v4.14 was already broken.
> I'll look into adding the membarrier feature you mention (as a final/
> follow-up patch), unless you or someone else want to take care of it.
I'll be happy to review it :)
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
> > So I should probably stick to 93917ad50972, which apparently selected
> > CONFIG_MEMBARRIER on RISC-V, for the Fixes: tag in question.
>
> I think it goes further than that, because you can explicitly
> CONFIG_MEMBARRIER=y, see init/Kconfig:
>
> config MEMBARRIER
> bool "Enable membarrier() system call" if EXPERT
> default y
> help
> Enable the membarrier() system call that allows issuing memory
> barriers across all running threads, which can be used to distribute
> the cost of user-space memory barriers asymmetrically by transforming
> pairs of memory barriers into pairs consisting of membarrier() and a
> compiler barrier.
>
> If unsure, say Y.
>
> Before 1464d00b27b2, riscv just happened to set it to =n in the defconfig.
>
> I suspect the initial port of riscv merged after v4.14 was already broken.
I see. Oh well, guess I'll have to leave this up to the maintainers then
(I believe I've never managed to build riscv that far), Palmer?
> > I'll look into adding the membarrier feature you mention (as a final/
> > follow-up patch), unless you or someone else want to take care of it.
>
> I'll be happy to review it :)
Sweet! :-)
Andrea
On Wed, 29 Nov 2023 14:43:34 PST (-0800), [email protected] wrote:
>> > So I should probably stick to 93917ad50972, which apparently selected
>> > CONFIG_MEMBARRIER on RISC-V, for the Fixes: tag in question.
>>
>> I think it goes further than that, because you can explicitly
>> CONFIG_MEMBARRIER=y, see init/Kconfig:
>>
>> config MEMBARRIER
>> bool "Enable membarrier() system call" if EXPERT
>> default y
>> help
>> Enable the membarrier() system call that allows issuing memory
>> barriers across all running threads, which can be used to distribute
>> the cost of user-space memory barriers asymmetrically by transforming
>> pairs of memory barriers into pairs consisting of membarrier() and a
>> compiler barrier.
>>
>> If unsure, say Y.
>>
>> Before 1464d00b27b2, riscv just happened to set it to =n in the defconfig.
>>
>> I suspect the initial port of riscv merged after v4.14 was already broken.
>
> I see. Oh well, guess I'll have to leave this up to the maintainers then
> (I believe I've never managed to build riscv that far), Palmer?
I see
$ git grep "config MEMBARRIER" fab957c11efe2f405e08b9f0d080524bc2631428
fab957c11efe2f405e08b9f0d080524bc2631428:init/Kconfig:config MEMBARRIER
so IMO this is just one of those forever bugs. So I'd lean towards
Fixes: fab957c11efe ("RISC-V: Atomic and Locking Code")
(or anything in that original patch set). It's not that big of a
backport, so I think it's safe enough?
>> > I'll look into adding the membarrier feature you mention (as a final/
>> > follow-up patch), unless you or someone else want to take care of it.
>>
>> I'll be happy to review it :)
>
> Sweet! :-)
>
> Andrea
> I see
>
> $ git grep "config MEMBARRIER" fab957c11efe2f405e08b9f0d080524bc2631428
> fab957c11efe2f405e08b9f0d080524bc2631428:init/Kconfig:config MEMBARRIER
>
> so IMO this is just one of those forever bugs. So I'd lean towards
>
> Fixes: fab957c11efe ("RISC-V: Atomic and Locking Code")
Works for me, will apply in v2.
> (or anything in that original patch set). It's not that big of a backport,
> so I think it's safe enough?
Indeed, I think so.
The final version of this fix will likely depend on some machinery/code
introduced by 3ccfebedd8cf54 ("powerpc, membarrier: Skip memory barrier
in switch_mm()"); but, yes, nothing we can't safely adjust I think.
Thanks,
Andrea
On Wed, 06 Dec 2023 06:11:07 PST (-0800), [email protected] wrote:
>> I see
>>
>> $ git grep "config MEMBARRIER" fab957c11efe2f405e08b9f0d080524bc2631428
>> fab957c11efe2f405e08b9f0d080524bc2631428:init/Kconfig:config MEMBARRIER
>>
>> so IMO this is just one of those forever bugs. So I'd lean towards
>>
>> Fixes: fab957c11efe ("RISC-V: Atomic and Locking Code")
>
> Works for me, will apply in v2.
>
>
>> (or anything in that original patch set). It's not that big of a backport,
>> so I think it's safe enough?
>
> Indeed, I think so.
>
> The final version of this fix will likely depend on some machinery/code
> introduced by 3ccfebedd8cf54 ("powerpc, membarrier: Skip memory barrier
> in switch_mm()"); but, yes, nothing we can't safely adjust I think.
Ya, I guess we'll have to look to know for sure but hopefully it's
manageable.
> Thanks,
> Andrea
> > The final version of this fix will likely depend on some machinery/code
> > introduced by 3ccfebedd8cf54 ("powerpc, membarrier: Skip memory barrier
> > in switch_mm()"); but, yes, nothing we can't safely adjust I think.
>
> Ya, I guess we'll have to look to know for sure but hopefully it's
> manageable.
Absolutely. One approach would be to follow what PowerPC did: AFAIU, before
3ccfebedd8cf54 membarrier/powerpc used to hard code the required barrier in
in finish_task_switch(), "masking" it as an smp_mb__after_unlock_lock(); riscv
could use a similar approach (though with a different/new mask function).
Alternatively, we could maybe keep the barrier in switch_mm().
But let me complete and send out v2 with the fix at stake... this should give
us a more concrete basis to discuss about these matters.
Andrea
On Wed, 06 Dec 2023 09:42:44 PST (-0800), [email protected] wrote:
>> > The final version of this fix will likely depend on some machinery/code
>> > introduced by 3ccfebedd8cf54 ("powerpc, membarrier: Skip memory barrier
>> > in switch_mm()"); but, yes, nothing we can't safely adjust I think.
>>
>> Ya, I guess we'll have to look to know for sure but hopefully it's
>> manageable.
>
> Absolutely. One approach would be to follow what PowerPC did: AFAIU, before
> 3ccfebedd8cf54 membarrier/powerpc used to hard code the required barrier in
> in finish_task_switch(), "masking" it as an smp_mb__after_unlock_lock(); riscv
> could use a similar approach (though with a different/new mask function).
IIRC we patterned our MMIOWB/after-spinlock barriers after PPC, so
that's probably a good place to start here as well.
> Alternatively, we could maybe keep the barrier in switch_mm().
>
> But let me complete and send out v2 with the fix at stake... this should give
> us a more concrete basis to discuss about these matters.
>
> Andrea